It has been a while since I last published a post. Finally it is time to come back to this blog and keep learning new stuff I can share. Let’s get back to business.
As the title suggests, we will take a URL of a web page and save that page in a text document. This is particularly useful when working with NLP based problems and you need textual information about something. Web is the best source of abundance of information, for example Wikipedia. But copying and pasting manually from web will not be efficient where you need to process a lot of pages. So here comes the solution, automate web to text conversion with little help from Python.
While looking for this, I came across BeautifulSoup. It is a great tool in Python for processing html. And it does have a function called get_text, how lucky we are 😀 Here is a very short function for requesting a webpage and getting text:
from bs4 import BeautifulSoup
def Web2Text(url, outname):
# Header for the http request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:220.127.116.11) Gecko/2009021910 Firefox/3.0.7'
# Request and read the html from the given url
request = urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read() # HTML data of the web page's source
# Clean html
raw = BeautifulSoup(data).get_text()
with open(outname, 'w', encoding="utf-8") as outf: