Thursday, 29 March 2012
Extracting All Hyperlinks From Webpages - Python
In this example, I am going to show how easily you can extract all the links in a webpage using python. If you are learning to write some small scale crawler, this can be a quick startup on how you can extract the links in any webpage.
Basically, we will send the http request to any webpage and we will read the HTML response except in the case when the connection can not be established. In such case, we will simply inform the user that we could not connect to the website.
For all these stuffs, we will import few modules and most important ones are re and urllib2 for regular expression stuff and HTTP request/response stuffs respectively.
We then write the regex for the hyperlinks for which we will make a search in the HTML data we get back after sending the request from the server. Note the <a href=[\'"]?([^\'" >]+). The small brackets are there to let us capture our necessary information i.e. the actual links.
Now you understood what we'll be doing, below is the python script to extract the hyperlinks from any webpage.
Now run the script as python extracter.py http://www.techgaun.com or any URL you wish to.
So isn't it a good start for writing your own simple web crawler? :P
Basically, we will send the http request to any webpage and we will read the HTML response except in the case when the connection can not be established. In such case, we will simply inform the user that we could not connect to the website.
For all these stuffs, we will import few modules and most important ones are re and urllib2 for regular expression stuff and HTTP request/response stuffs respectively.
We then write the regex for the hyperlinks for which we will make a search in the HTML data we get back after sending the request from the server. Note the <a href=[\'"]?([^\'" >]+). The small brackets are there to let us capture our necessary information i.e. the actual links.
Now you understood what we'll be doing, below is the python script to extract the hyperlinks from any webpage.
#!/usr/bin/python import re, urllib2 from sys import argv if (len(argv) != 2): print "No URL specified. Taking default URL for link extraction" url = "http://www.techgaun.com" else: url = str(argv[1]) links_regex = re.compile('<a href=[\'"]?([^\'" >]+)', re.IGNORECASE) url_request = urllib2.Request(url) try: response = urllib2.urlopen(url_request) html = response.read() links = links_regex.findall(html) print '\n'.join(links) except urllib2.URLError: print "Can't Connect to the website"
Now run the script as python extracter.py http://www.techgaun.com or any URL you wish to.
So isn't it a good start for writing your own simple web crawler? :P
Labels:
internet,
programming,
python,
web
Bookmark this post:blogger tutorials
Social Bookmarking Blogger Widget |
Extracting All Hyperlinks From Webpages - Python
2012-03-29T18:19:00+05:45
Cool Samar
internet|programming|python|web|
Subscribe to:
Post Comments (Atom)