Web scraping with Python(using BeautifulSoup)

As usual the first set of questions always go like this, what is web scraping? What is the usefulness? And how do I do it? Now, to answer the first two questions with the simplest of words, web scraping is simply the collection of specific data or information from a web site or a simple web page, to which this information or data could be used for analysis or whatever the web scraper needs such information for. Several programming languages can be used for web scraping, but as stated above we would be using the python programming language to scrape a web site. How do I do it? lets get right to it with a simple example. First, it would be a good thing to note that one of the languages used in building a website is the Hyper Text Mark-up Language(HTML). HTML contains large amount of data in text form. To scrape data from a web site, we would use the beautifulsoup4 from the bs4 python library and the lxml parser( there are other types of parsers but we would be using ‘lxml’ for this example). These tools are way more preferable, very helpful and easy to use when it comes to web scraping.

Getting Started:

You would need to create a folder, after which you create a virtual environment in that folder, then install these tools and libraries in the virtual environment. All these steps would be done in the command prompt using pip.

#on the command prompt
1) mkdir work
2) cd work
3) virtualenv work_flow_env
4) work_flow_env\Scripts\activate
5) pip install beautifulsoup4
6) pip install lxml
7) pip install html5lib #optional
8) pip install requests

I will assume that most of the readers have an idea even if it’s a little knowledge on HTML, but if you have none, you could always skim through a good free source website, in which I will recommend “w3schools.com”.

Now if you are using sublime text, all you have to do is drag your folder named “work’’ into the sublime text application and you are ready to code. If you have a well downloaded anaconda application, jupyter notebook has all these installed, so you would not have to go through the ‘Getting started’ phase.

Our task is really simple, we are to get the name of movies from ‘http://toxicwap.com/New_Movies/’ and their links. In your already set compiler, you import the libraries.

Importing Libraries:

from bs4 import BeautifulSoup
import requests

Getting the raw data:

After importing the libraries, you get the information in text format from the web page using requests.get().text

source = requests.get('http://toxicwap.com/New_Movies/').text 

Parsing:

Now, you have gotten the information needed, so you parse through the text using lxml, you make it clean and readable using prettify().

soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())

Inspecting the web page:

Go back to the web page and inspect it( you right click and and select inspect), you then navigate the web page from the source code seen, you navigate to the point you are able to highlight the part you want to scrape. When you have gotten all you need, you go back to your code then look through your “prettified” text. When going through your cleaned up data(prettified text) you would see the code you highlighted from the web page, depending on the site you are scraping you may have to dig a lot deeper before getting to what you want.

Digging(navigating) through the text data:

div = soup.find('div', attrs={'data-role':'content'})
ul = div.find('ul')
li = ul.find('li')
print(li)

To explain the code. First, if you are doing exactly what I am doing, when you inspect the web page you would notice the tags are mostly “div”, now line 1 selects the particular div that holds the content you want to scrape. Line 2 digs deeper into the div to the ul(unordered list) and line 3 digs into the ul to the li(list). Well, We all know what line 4 does( it displays the list).

Note: some sources online use ‘class_=()’ when trying to get to a particular div, but you would notice that in this particular case the prettified text did not display the class of the div, hence resulting to the use of ‘attrs ={}’.

title = li.a.text
print(title)
link = li.a
print(link['href'])

Now, the first line says put in the variable named title the text which can be found in the ‘a’ tag(<a>: link tag in HTML), which is also found in the ‘li’ tag. It is literally just digging from ‘li’ into ‘a’ to the ‘text’. From what I have said, you should be able to interpret the third line. Basically, you are already done, but this written code will get you just the first title and the first link, to get all the titles and links you use a for loop.

Displaying the whole output:

for li in ul.find_all('li'):
title = li.a.text
print(title)

link =li.a
print(link['href'])

Notice how .find() changed to .find_all()? That’s what you do when you want to get all the data needed. It is good practice to use .find() first when trying to navigate, then when the code format is gotten you use the .find_all() to get the data remaining. So now the whole code should look like this.

Complete code:

from bs4 import BeautifulSoup
import requests

You are done, but if you want to save the scrapped data into a text file or csv file, you can. I’ll be saving this into a csv file.

Saving in a format:

from bs4 import BeautifulSoup
import requests
import csv

I would like to add that some websites make it really hard to scrape their page and for some it is illegal to scrape their page.

That is it. It is all done. I guess I could say you just learnt how to scrape a website.

A software developer interested in artificial intelligence , web development and other related fields.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store