Skip to content

Yusuf Tas

  • Home
  • Deep Learning
  • Data Science
  • General Programming
  • Contact
  • About
December 31, 2021 / General Programming

Convert Web Page to Text

It has been a while since I last published a post. Finally it is time to come back to this blog and keep learning new stuff I can share. Let’s get back to business.

As the title suggests, we will take a URL of a web page and save that page in a text document. This is particularly useful when working with NLP based problems and you need textual information about something. Web is the best source of abundance of information, for example Wikipedia. But copying and pasting manually from web will not be efficient where you need to process a lot of pages. So here comes the solution, automate web to text conversion with little help from Python.

While looking for this, I came across BeautifulSoup. It is a great tool in Python for processing html. And it does have a function called get_text, how lucky we are 😀 Here is a very short function for requesting a webpage and getting text:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import urllib.request
from bs4 import BeautifulSoup
 
 
def Web2Text(url, outname):
    # Header for the http request
    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    headers={'User-Agent':user_agent,}
 
    # Request and read the html from the given url
    request  = urllib.request.Request(url,None,headers)
    response = urllib.request.urlopen(request)
    data = response.read() # HTML data of the web page's source
 
    # Clean html
    raw = BeautifulSoup(data).get_text()
    print(raw)
 
 
    with open(outname, 'w',  encoding="utf-8") as outf:
        outf.writelines(raw)

User agent is the browser agent we are imitating. Then urllib request is used to read the webpage html. This html data is simply processed by BeautifulSoup to get the text out. Finally the text is written to a text file. Lets see this in action by converting a Wikipedia page:

1
2
3
url = 'https://en.wikipedia.org/wiki/Grevillea_buxifolia'
output_name = 'test.txt'
Web2Text(url, output_name)

It should save the content of the page in a text file. You will unfortunately get a lot of junk as well since Wikipedia page contains many links, references, tables etc. There might be better ways to do this for Wikipedia pages, but in general this method worked good enough for me to get text from webpages. Part of the output I get for this page:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Family:
 
Proteaceae
 
 
Genus:
 
Grevillea
 
 
Species:
 
G. buxifolia
 
 
Binomial name
 
 
Grevillea buxifolia(Sm.) R.Br.
 
 
Grevillea buxifolia, commonly known as the grey spider flower, is a species of the family Proteaceae. It grows in coastal New South Wales, Australia. First described in 1793 by James Edward Smith, he gave the new species the name Embothrium buxifolium. It is widely cultivated and contains a number[clarification needed] of subspecies and cultivars. These vary most in the presentation of the attractive flower.
 
Description[edit]
The species forms a short shrub, three or four feet high. The numerous branches are covered in a reddish or brown hair and many leaves.  Flowers sit at the termination of these: yellowish and white, pendulous star-shaped petals, set to appear in November.
The flowers of the plant sit alone, erect in umbels, on stalks covered in reddish brown hairs. The corolla is likewise clothed and is partly fused to form a cavity. This interior is white and the petals are otherwise very pale to yellow; this spills above the hairy parts.  The single elliptic leaves are veiny, with a very rough dark green upper; margins entire and roll to the downy underside. These are arranged, almost directly to the stem, alternate and numerous up the branches.  They end in a little sharp point.
 
References[edit]
 
 
"Grevillea buxifolia". Plant Name Details. IPNI. Retrieved 2007-07-29. Basionym: Proteaceae Embothrium buxifolium Sm. Spec. Bot. New Holland 1793
"Grevillea buxifolia". Grevillea page. (ASGAP). 14 February 2006. Archived from the original on 2007-07-16. Retrieved 2007-08-01. G.buxifolia subsp. buxifolia and subsp. phylicoides are both well known in cultivation and are generally reliable and attractive shrubs
External links[edit]

And this is where I will finish this post. I have found out this beautiful library BeautifulSoup and wanted to share how it can be used to extract text from webpages. Hopefully I will find a good use for all the text I will collect 🙂 (hint: nl*) And as always keep learning.

Post navigation

Previous Post:

Cryptocurrency Trading Bot Using Deep Learning: Part-1 Data Gathering

Next Post:

Removing Ads From APK – 7Plus

Leave a ReplyCancel reply

Reach me

  • Github
  • Twitter
  • Youtube
  • Mail

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Recent Posts

  • Node Js Backend – Raspberry Pi 2 32bit Problem
  • Removing Ads From APK – 7Plus
  • Convert Web Page to Text
  • Cryptocurrency Trading Bot Using Deep Learning: Part-1 Data Gathering
  • Beginner’s Guide to R

Recent Comments

  • Cale oghlan on Removing Ads From APK – 7Plus
  • juan on Cryptocurrency Trading Bot Using Deep Learning: Part-1 Data Gathering
  • Maajid Khan on Caffe Python Installation with Anaconda
  • Nick G on Caffe Python Installation with Anaconda
  • Morne Supra on Using Tensorflow Object Detection API with OpenCV

Categories

  • Backend (1)
  • Computer Vision (1)
  • Data Science (4)
  • Deep Learning (5)
  • General Programming (5)
  • Life (1)
  • Reverse Engineering (1)

Tags

algorithm trading android security apk decompile apk hack automatic trading bot backend beautifulsoup beginner r binance-api blogging cell array cryptocurrency data science deeptraderbot docker frida fully connected networks html render javascript learning randomness life matlab mongodb nlp nodejs objection phd prng python-binance r random randomness random pattern random walk random walking raspberry remove ads reverse engineering r vs python statistics struct team_volunteer tensorflow web2text web to text
© 2025 Yusuf Tas - Powered by SimplyNews