Convert Web Page to Text
It has been a while since I last published a post. Finally it is time to come back to this blog and keep learning new stuff I can share. Let’s get back to business.
As the title suggests, we will take a URL of a web page and save that page in a text document. This is particularly useful when working with NLP based problems and you need textual information about something. Web is the best source of abundance of information, for example Wikipedia. But copying and pasting manually from web will not be efficient where you need to process a lot of pages. So here comes the solution, automate web to text conversion with little help from Python.
While looking for this, I came across BeautifulSoup. It is a great tool in Python for processing html. And it does have a function called get_text, how lucky we are 😀 Here is a very short function for requesting a webpage and getting text:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import urllib.request from bs4 import BeautifulSoup def Web2Text(url, outname): # Header for the http request user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7' headers={'User-Agent':user_agent,} # Request and read the html from the given url request = urllib.request.Request(url,None,headers) response = urllib.request.urlopen(request) data = response.read() # HTML data of the web page's source # Clean html raw = BeautifulSoup(data).get_text() print(raw) with open(outname, 'w', encoding="utf-8") as outf: outf.writelines(raw) |