Saturday, September 24, 2011

Web Scraping: How to harvest web data using Ruby and Nokogiri

Web Scraping with Nokogiri

In this post I will walk through how to use Nokogiri to harvest data from retailer web pages and save that data into a spreadsheet, instead of copying and pasting by hand. I am using Ubuntu 10.10, Nokogiri 1.5.0, and Ruby 1.9.2. Update: I've learned that this technique is commonly called "web scraping," so I've updated the text to reflect that.

Web Scraping Background and Introduction

Recently I was assigned the task of populating a spreadsheet with fan data pulled from the retailer Industrial Fans Direct. My client needed the price, description, and serial number of a lot of fans, from each of the categories visible below (e.g. ceiling fans, exhaust fans, contractor fans). Some of these categories have sub-categories, and some of those sub-categories have further sub-categories. The point is that there are many hundreds of fans listed on this web site, and doing the traditional copy-paste into an Excel spreadsheet was going to take a long time.
Industrial Fans Direct -- Home Page

Below is a screenshot of a product summary page of ceiling fans. This page contains all the data I need: price, serial number, and description. I noticed that the formats are the same for all the ceiling fans, and it turns out that this retailer has used the same format across all categories of fans.

Industrial Fans Direct -- showing ceiling fan product summary page.

Since the format is consistent, this is a great format for using an HTML parser to gather the data.This technique is known "web scraping."

Introducing Nokogiri

Nokogiri is a Ruby gem designed to help parse HTML and XML. Its creators describe it as an "HTML, XML, SAX, & Reader parser with the ability to search documents via XPath or CSS3." Since we only want to read a simple HTML page, we can ignore the part about XML and SAX (I have no idea what SAX is). We can also ignore the part about XPath, which I'm also unfamiliar with. The takeaway is that Nokogiri can parse HTML and search it via CSS. That's how we're going to perform our web scraping. The parsing part we can largely ignore as well; it basically means Nokogiri will load the document. The really important part for us is that we can use Nokogiri to search HTML using CSS.

Searching with CSS

Searching HTML with CSS means using CSS selectors to identify parts of an HTML document. Consider the following simple HTML page (borrowed from tenderlove):
<html>
  <head>
    <title>Hello World</title>
  </head>
  <body>
    <h1>This is an awesome document</h1>
    <p>
      I am a paragraph
        <a href="http://google.ca">I am a link</a>
    </p>
  </body>
</html>

If we wanted to change that h1 heading to red text, we would use CSS. First we would select the h1 heading using the CSS selector "h1", and then we would apply the "color" property with the attribute "red". In a separate style sheet, that would look like this:

h1 { 
  color: red;
}

The point here is the selector. We use the selector "h1" to identify the discreet text string "This is an awesome document", which then turns red. Using CSS, we can identify any(?) element in an HTML document, assuming that document is properly marked-up. Using these exact selector rules from CSS, we can tell Nokogiri which elements we want to grab.

An important lesson here: know how to use CSS selectors. The CSS2 specification has a short and useful list of selectors. These will get you far.

Set up your own CSS file

Before jumping into Nokogiri, we have to know what we want to grab from the web site, and how to grab it using CSS selectors. In properly marked-up with semantic CSS, that should be fairly easy. However, the fan data I need is in Industrial Fans Direct, a web site with atrocious mark-up. That's okay--Nokogiri can handle it. It just means this will be a rather advanced lesson in selectors.

First, save a local copy of the HTML document, so that we can play around with its CSS. I started with this page of exhaust fans, and saved it onto my computer as "fans.html."

Second, create a style sheet (I called mine "andrew.css") and save it in the same location that you saved your local copy of the HTML page. I put both my local copy of the HTML and my style sheet in a folder called "nokogiri_testing".

Third, look at the source code in the browser. Specifically, look at the stylesheets. The "head" section from fans.html is below:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> 
<head> 
<title>INDUSTRIAL I - BELT :: Industrial Fans Direct</title> 
 
<base href="http://www.industrialfansdirect.com/Merchant2/"> 
<meta http-equiv="content-type" content="text/html; charset=utf-8" /> 
<link rel="canonical" href="http://www.industrialfansdirect.com/IND-FA-EF-CM-I1.html" /> 
<link href="css/andreas09.css" rel="stylesheet" type="text/css" /> 
<link href="css/dropdown.css" rel="stylesheet" type="text/css" /> 
<link href="css/tab-view.css" rel="stylesheet" type="text/css" /> 
<link href="css/IFD_Print.css" rel="stylesheet" type="text/css" media="print"> 
<link href="file:///home/andy/nokogiri_testing/andrew.css" rel="stylesheet" type="text/css" />
<script language="javascript"> 
function cfm_calc (form) {
form.cfm.value = Math.round((form.height.value * 1.2) * form.width.value *
form.length.value) ; }
</script> 
</head> 

Notice the base tag, which sets all links relative to the root of the site. With this in mind, we know we can insert our own stylesheet into this local copy by including a full path, as I have done above, on line 13. Notice also that it comes after all the other style sheets, so that it overrides everything that comes before it.

Fourth, populate this CSS file with something obnoxious just so that we know it works. Here's mine:

body {
	background-color: blue;
}
If that turns the page blue when you render the page in a browser, you'll know that you have a working style sheet.

Fifth, open the page in a browser to see if your CSS modifications are working. Remember: load your local copy (in my case, "fans.html"), not the online version.

Once you have a working stylesheet, the next step is to start using it to figure out what CSS selectors to use.

Identify your CSS selectors

The next step is to decide what you want to grab from the web page, and then figure out how to use CSS selectors to get to it. This is where it starts to get a bit difficult, especially with a page marked up as badly as this one is, with tables nested in tables nested in tables, and with countless divs, few of which have identifiers or classes.


A local copy of the exhaust fans page. Note the address bar (local copy!) and the prices next to each product.
The first piece of data I want from the Industrial Fans Direct summary product page is the price. Looking at the HTML document, I see that the price is embedded in lines that look like this:

<div align="left"><b>Your Price: <font color="#003366">$1,349.00</font></b></div>
The piece we want is with the dollar sign. We can see that it's wrapped in a font tag, which is in turn wrapped in a <b> tag, which is in turn wrapped in a <div> tag. The CSS selector which represents this is div > b > font.

Let's use a CSS selector to grab this.

First, go back to your CSS file and add the following line:
div > b > font {
	background-color: green;
}

Second, go back and refresh the browser (the local copy!). That should turn all of the prices green. If it works correctly, then we've gained our objective, which is to discover a suitable CSS selector to grab the information we want from the web page. For the price, that selector is div > b > font. Note that there are usually several ways to drill down to the information you need. As your knowledge of CSS selectors grows, you'll discover the most efficient ways.

The exhaust fans page again, this time with prices highlighted in green using the CSS selector " div > b > font". Notice that that selector didn't pick up any other elements on the page.
 Third, pick another piece of desired information, and find a CSS selector to identify it. The next piece of information I want is the serial number.

Fourth, go back to the HTML and take a guess at how you would drill down to the serial number. Here's a line that contains the serial number.
<div align="left"><b style="font-size:8px;">LFI-XB24SLB10050</b></div>

The serial number is wrapped in a <b> tag, which is wrapped in a <div> tag. Using the same logic as above, I try out the CSS selector div > b, as shown in my style sheet, which now has two styles:

div > b > font {
	background-color: green;
}

div > b {
	background-color: red;
}

Fifth, go back to the browser again and refresh the page. I've given my serial number style a background color of red, and the result is shown in the following figure.
The exhaust fans page after a first attempt at selecting price (green) and serial number (red). Notice that the serial number selector was too liberal.
 As you can see, the div > b CSS selector picked up more than just the serial number, so we'll have to get more precise.

From this point, it's an iterative process. Keep adding more and more specificity to your CSS selector chain until you highlight exactly the elements you need, and nothing more. My completed stylesheet is shown below:

/* price */
div > b > font {
	background-color: green;
}

/* serial number */
div#contentalt1 div:first-child b {
	background-color: red;
}

/* description */
table + table tr + tr td a {
	background-color: blue;
}
The result of all this work (downloading the page, adding a CSS file, highlighting elements) are the three CSS selectors we found:
  1. div > b > font
  2. div#contentalt1 div:first-child b
  3. table + table tr + tr td a
In the next section we will provide those CSS selectors to Nokogiri, which will use them to speed through HTML pages and pull out prices, serial numbers, and descriptions for all sorts of fans.

Dive into Nokogiri

Now that we've identified our CSS selectors, we're done with HTML and CSS. From here, we'll be in Ruby. I find it's always easiest to start in an Interactive Ruby (IRb) session. So type irb at the command prompt and type in the following commands:
$ irb
ruby-1.9.2-p180 :001 > require 'nokogiri'
 => true 
ruby-1.9.2-p180 :002 > require 'open-uri'
 => true 
ruby-1.9.2-p180 :003 > doc = Nokogiri::HTML(open('http://www.industrialfansdirect.com/IND-FA-PC-EC.html'))
[output truncated]
ruby-1.9.2-p180 :004 > doc.class
 => Nokogiri::HTML::Document 
The first two lines loaded Nokogiri and a library used by Nokogiri, respectively. The third line told Nokogiri to fetch an HTML document from the web, parse it as HTML, and save the result in an object called "doc". Since Ruby returns the result of every operation, that should result in a huge amount of output, which you can ignore. But now that you have the object called "doc", you can use Nokogiri's css method to search it. Simply pass the css method the CSS selector that you want it to use. That's it.
ruby-1.9.2-p180 :005 > > puts doc.css('div > b > font')
<font color="#003366">$739.00</font>
<font color="#003366">$1,019.00</font>
<font color="#003366">$1,779.00</font>
<font color="#003366">$2,099.00</font>
<font color="#003366">$2,329.00</font>
<font color="#003366">$2,499.00</font>
<font color="#003366">$3,849.00</font>
<font color="#003366">$3,599.00</font>
 => nil 
As you can see, Nokogiri returned the font tags in their entirety. Later we'll use the content method to return just what's inside those tags. But for the moment, the takeaway is:
  1. Load Nokogiri
  2. Pass it a file or a web page to parse and return a Nokogiri object
  3. Use the css method to search that object
Now that we know how to use Nokogiri, let's start a Ruby script to start doing the heavy lifting.

A Nokogiri Ruby Script

First, create a Ruby file as follows. I called mine "fans.rb".
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.industrialfansdirect.com/IND-FA-PC-EC.html'))

doc.css('div > b > font').each do |price|
  puts price.content
end
Run this file and note that the output only includes the content of the font tags.
However, we don't want to just print data to the terminal window; we want to store it. Let's take an intermediate step by filling out the program with all three attributes (price, description, serial number), and storing those attributes in Ruby arrays. To check that this is working, we can still print the output to the terminal window. Here's the new script:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.industrialfansdirect.com/IND-FA-PC-EC.html'))

prices = Array.new
serial_numbers = Array.new
descriptions = Array.new

doc.css('div > b > font').each do |price|
  prices << price.content
end

doc.css('div#contentalt1 table + table div:first-child b').each do |serial_number|
  serial_numbers << serial_number.content
end

doc.css('div#contentalt1 table + table tr + tr td a').each do |description|
  descriptions << description.content unless description.content.length < 2
end

(0..prices.length - 1).each do |index|
  puts "serial number: #{serial_numbers[index]}"
  puts "price: #{prices[index]}"
  puts "description: #{descriptions[index]}"
  puts ""
end

Note line 18: I had to add an unless modifier because I couldn't find a CSS selector that would select the description and nothing else. Instead, it selected the descriptions and random bits of empty tables. Since I don't want to store the random bits of empty tables (which appeared in my array as strings of length 0 or 1), I required a description to have at least 3 characters.

This Ruby script produces the following output:
$ ruby fans.rb
serial number: PC-PAC2KCYC01
price: $739.00
description: CYCLONE 3000 Portable 2 Speed Evaporative Cooler (2,400 / 3,000 CFM)

serial number: PC-PAC2K163SHD
price: $1,019.00
description: Portable 3 Speed Evaporative Cooler: 16 in Blade (2,500 / 3,280 / 3,900 CFM)

serial number: PC-PAC2K24HPVS
price: $1,779.00
description: Portable Variable Speed Evaporative Cooler: 24 in Blade (6,700 CFM)

serial number: PC-PAC2K361S
price: $2,099.00
description: Portable 1 Speed Evaporative Cooler: 36 in Blade (9,600 CFM)

serial number: PC-PAC2K363S
price: $2,329.00
description: Portable 3 Speed Evaporative Cooler: 36 in Blade (4,800 / 6,600 / 9,600 CFM)

serial number: PC-PAC2K36HPVS
price: $2,499.00
description: Portable Variable Speed Evaporative Cooler: 36 in Blade (10,100 CFM)

serial number: SCF-PROK142-2HV
price: $3,849.00
description: Portable 2 Speed Evaporative Cooler (high velocity): 42 in Blade (9,406 / 14,232 CFM)

serial number: PC-PAC2K482S
price: $3,599.00
description: Portable 2 Speed Evaporative Cooler: 48 in Blade (11,000 / 20,000 CFM)
It tells me that it knows the serial number, price and description of eight fans. I tested this script on a several different web pages from this retailer, and found that it works for each category and sub category.

Now that we know we can harvest (web scrape) and store the data in Ruby, we have to get it into a spreadsheet.

Storing the Harvested Data

For this part, we'll use Ruby's CSV class to store the data in a csv file. Simply require CSV at the top of the file, and use two loops to write the contents of our three arrays into a csv file. Below is the complete new script:
require 'nokogiri'
require 'open-uri'
require 'csv'

doc = Nokogiri::HTML(open('http://www.industrialfansdirect.com/IND-FA-PC-EC.html'))

prices = Array.new
serial_numbers = Array.new
descriptions = Array.new

doc.css('div > b > font').each do |price|
  prices << price.content
end

doc.css('div#contentalt1 div:first-child b').each do |serial_number|
  serial_numbers << serial_number.content
end

doc.css('table + table tr + tr td a').each do |description|
  descriptions << description.content unless description.content.length < 2
end

(0..prices.length - 1).each do |index|
  puts "serial number: #{serial_numbers[index]}"
  puts "price: #{prices[index]}"
  puts "description: #{descriptions[index]}"
  puts ""
end

CSV.open("fans.csv", "wb") do |row|
  row << ["serial number", "price", "description"]
  (0..prices.length - 1).each do |index|
    row << [serial_numbers[index], prices[index], descriptions[index]]
  end
end
That works correctly, which means we've completed the hard part. The script is parsing the HTML file, pulling out the data we want, and storing it in a csv file called "fans.csv". But we're not done yet; this script only takes one HTML file, and we have lots of web pages from which we want to harvest data. The next step is find a way to efficiently go through all these web pages without having to insert a new URL each time.

Running the script over multiple web pages

There are several ways to make this script "crawl" the web site. I think the simplest is to create an array of all the URLs that contain my data, and pass those URLs from the array, one at time, to the script we wrote. This means we'll establish the array of URLs and the three attribute arrays (serial numbers, prices, descriptions), and then wrap the rest of our code in a loop that goes through all the URLs. Here's the script, with the URL array and the loop. Notice that the arrays had to become instance variables so that they could be accessed outside the URL loop.
require 'nokogiri'
require 'open-uri'
require 'csv'

urls = Array[
  'http://www.industrialfansdirect.com/IND-FA-AF-S.html',
  'http://www.industrialfansdirect.com/IND-FA-AF-WE.html',
  'http://www.industrialfansdirect.com/IND-FA-AF-SS.html',
  'http://www.industrialfansdirect.com/IND-FA-AF-CF.html',
  'http://www.industrialfansdirect.com/IND-FA-BL.html',
  'http://www.industrialfansdirect.com/IND-FI-CF.html'
]

@prices = Array.new
@serial_numbers = Array.new
@descriptions = Array.new

urls.each do |url|
  doc = Nokogiri::HTML(open(url))
  doc.css('div > b > font').each do |price|
    @prices << price.content
  end

  doc.css('div#contentalt1 div:first-child b').each do |serial_number|
	  @serial_numbers << serial_number.content
  end

  doc.css('table + table tr + tr td a').each do |description|
    @descriptions << description.content unless description.content.length < 2
  end

  (0..@prices.length - 1).each do |index|
    puts "serial number: #{@serial_numbers[index]}"
    puts "price: #{@prices[index]}"
    puts "description: #{@descriptions[index]}"
    puts ""
  end
end
  
CSV.open("fans.csv", "wb") do |row|
  row << ["serial number", "price", "description"]
  (0..@prices.length - 1).each do |index|
    row << [@serial_numbers[index], @prices[index], @descriptions[index]]
  end
end
That completes the objectives of this task. With this Ruby script, using the power of Nokogiri, we can "web scrape," or harvest data from, as many pages as we want to include in the url array.

Special thanks are due to Aaron Paterson, creator of Nokogiri, and all who contribute to it.

Update

Without going into all the specifics of how I did, below is the completed script. It has a few extra features:
  • URLs are stored in an external CSV file
  • CSS selectors are updated to be slightly more robust
  • Includes category, sub-category, and sub-sub-category
The if statements at the top of the file organize how the category and sub-categories are identified. They're pulled from "bread crumb" navigation, which changes structure depending on how deep the category hierarchy goes. Again, all my thanks go to the creators of Nokogiri. With their Ruby gem, I pulled out more than 1,700 rows of data in 68 lines of code, which runs in about one minute. Including the 5 or so hours it took me to write this script, it probably saved me about 10 hours of work, and increased the accuracy of the finished product.

require 'nokogiri'
require 'open-uri'
require 'csv'

@prices = Array.new
@serial_numbers = Array.new
@descriptions = Array.new
@urls = Array.new
@categories = Array.new
@subcategories = Array.new
@subsubcategories = Array.new

urls = CSV.read("fan_urls.csv")
(0..urls.length - 1).each do |index|
  puts urls[index][0]
  doc = Nokogiri::HTML(open(urls[index][0]))
  
  #the last bread crumb does not have an anchor tag, which allows the following logic
  bread_crumbs_length = doc.css('div[style="padding-left:10px;"] a').length + 1
  puts "bread crumbs length: #{bread_crumbs_length}"
  if bread_crumbs_length == 2
    category = doc.css('a + font')[0].content
    sub_category = "na" 
    sub_sub_category = "na" 
  elsif bread_crumbs_length == 3
    category = doc.css('div[style="padding-left:10px;"] a:first-child + a')[0].content
    sub_category = doc.css('a + font')[0].content
    sub_sub_category = "na" 
  elsif bread_crumbs_length == 4
    category = doc.css('div[style="padding-left:10px;"] a:first-child + a')[0].content
    sub_category = doc.css('div[style="padding-left:10px;"] a:first-child + a + a')[0].content
    sub_sub_category = doc.css('a + font')[0].content
  else
    category = "na"
    sub_category = "na"
    sub_sub_category = "na"
  end

  doc.css('div > b > font').each do |price|
    @prices << price.content
    @urls << urls[index][0]
    @categories << category
    @subcategories << sub_category
    @subsubcategories << sub_sub_category
  end

  doc.css('div#contentalt1 table[align] div:first-child b').each do |serial_number|
	  @serial_numbers << serial_number.content
  end

  doc.css('table + table tr + tr td a').each do |description|
    @descriptions << description.content unless description.content.length < 2
  end
end
 
CSV.open("fans.csv", "wb") do |row|
  row << ["category", "sub-category", "sub-sub-category", "serial number", "price", "description", "url"]
  (0..@prices.length - 1).each do |index|
    row << [
      @categories[index], 
      @subcategories[index], 
      @subsubcategories[index], 
      @serial_numbers[index], 
      @prices[index], 
      @descriptions[index], 
      @urls[index]]
  end
end

5 comments:

  1. Hi Andy,

    Great post. To your knowledge, is there a way to make nokogiri work when you scrape a page where the content is loaded by javascript?

    I've got a small script working; I can grab the title of the page, but the content is loaded after the page loads via a js, and I'm never getting content.

    Thanks!

    Chris

    ReplyDelete
  2. @Chris, I had the same issue with a different page, but didn't have the time to research it. To my knowledge, it's not possible to scrape a page with Nokogiri unless you can feed it HTML. Have you looked into Firebug or other tools that might be able to parse AJAX-loaded pages? Sorry I don't have a better answer.

    ReplyDelete
  3. Hi all,

    Web scraping is the process of automatically collecting Web information. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding and human-computer interactions. Thanks for sharing it......

    Web Data Scraping

    ReplyDelete
  4. Andrew, you've probably already figured this out, but you can use Capybara to do what you're looking for. You would so something like:

    require 'capybara'
    require 'capybara/dsl'
    include Capybara::DSL

    Capybara.current_driver = :selenium
    Capybara.app_host = "http://www.google.com"

    page.visit('/path/to/page')

    Now you can feed page.html to Nokigiri. Good luck.

    ReplyDelete
  5. Web scraping services are provided by computer software which extracts the required facts from the website. Web scraping services mainly aims at converting unstructured data collected from the websites into structured data.Web scraping services mainly aims at collecting, storing and analyzing data. In this article good Advantages of web scraping services and Importance of web scraping services.. So more helpful..

    Web scraping services-importance of scraped data

    ReplyDelete