Category Archives: Programming

How to Scrape Websites in Ruby on Rails using scRUBYt!

After launching Contrastream I’ve always wanted to create a script that would automatically add new albums to the site as they were released (saving me a lot of time). The idea was to scrape a bunch of album titles from review sites and combine it with information from the Last.fm.

The task seemed daunting at the time, so I never jumped in and tried it. After a year I finally wrote my first script and realized how wrong I was. Scraping sites with ruby is surprisingly easy.

For my first project, I decided to try the simple task of pulling product information from homedepot.com. Here’s how I did it after some trial and error. See the completed script: http://www.pastie.org/267676

The Plan

  • Go to http://homedepot.com
  • Search for “Hoover Vacuums”
  • Find the First Product in the Results
  • Click Product Details Link
  • Fetch Name + Description
  • Repeat for Each Product in the Results
  • Save Data to MySQL

What You Need

This is purely a ruby script but I used Rails to save the data using Mass Assignment + ActiveRecord (MySQL).

  1. Firefox + Firebug + XPath Checker
  2. Set up basic Rails application framework
  3. Install scRUBYt! plugin (sudo gem install scrubyt –include-dependencies )
  4. Set up the database.yml file
  5. Create homedepot.rb in [rails_root]/scripts folder

Now the fun part begins…

Fetching the site

Add this to the top of the script.

	require '.ubygems' 
	require 'scrubyt' 
	Scrubyt.logger = Scrubyt::Logger.new 
	product_data = Scrubyt:: Extractor.define do 
 
	  fetch 'http://www.homedepot.com/'

These are the basic scRUBYt includes plus the rails environment so we can use ActiveRecord. After that you need to direct the script to the site your scraping (called fetching).

Searching the site

Now that the page is loaded up, it’s time to search the site.

fill_textfield 'keyword', 'hoover vacuums' 
		  submit

I highlighted the search input box in Firefox and viewed source. I found that the input name was “keyword”. The second string is the search query “hoover vacuums”.

Find the First Product in the Results (Create a Pattern)

The search results page shows a list of Hoover products. The next step is to create a pattern that the script will follow. In this case, the pattern is: for each product in the results I want to click and see the product details.

product_row "//div[@class='product']" doend

From viewing the source I found that each product is wrapped in div class=”product”. Now for each div with a class=”product” the next action will take place.

Click Product Details Link

The link to the product details page is the name of the product. For example, the link “Hoover Legacy Pet Rewind Vacuum” goes to the details page. Because the link that will be clicked is unique on each row, I need to use the xpath of the link to create a pattern. (Otherwise, I would have used product_link “View Details”.)

    product_link "/form/div[1]/p[1]/a", :generalize => false do

To find the xpath, right-click the first link in Firefox and click view XPath. This is the xpath for the link inside the div: “/form/div[1]/p[1]/a”. This shows that the link was wrapped in a form tag, followed by div, followed by a p (all within div class=”product”).

Now, scRUBYt has a pattern: for each product in the search results, click the link (located inside a form, div and p tag).

Each product in the results has the same HTML structure so the script can easily find the next product in the results.

Grab Title + Description

Now that the script is in the product details page, it’s time to start scraping some data. I wanted two things from this page, the name and description.

	   product_details do 
	     product_record "//p[@class='product-name']" do 
	        name "Homelite 20 In. Homelite Corded Electric Mower" 
	     end 
	     parent "//div[@id='tab-features']" do 
	       description "/p[1]" 
	     end 
	   end

First, from looking at the HTML, I found the name was wrapped in the class “product-name”, so now I can grab the contents. I copied the title from the first product so scRUBYt can create a pattern for the rest of the products. Now the script knows exactly what data to kick back (the text within the “product-name” div).

The description is similar except I used the xpath “/p[1]” to create a pattern instead of copying the whole paragraph of text.

The “name” + “description” preceding the paths above it is the label attached to the data when you export the data later on.

Save to MySQL

The script is pretty much done; it is now searching for hoover vacuums, going to each product page and grabbing the name/description. Now you can save the data to MySQL (or any ActiveRecord DB).

The first step is to name your database columns the same name as what you labelled the data above. In this case, I used “name” and “description”. The mass assignment will automatically save the data to the corresponding DB columns.

	product_data_hash = product_data.to_hash 
 
	product_data_hash.each do |item|
	  @product = Product.create(item)
	  @product.save 
	end

Now, we save it to the DB by converting the data to a hash. Each item creates a DB entry and mass assignment automatically sorts the data into the columns.

Note: you could also save the data to XML or a basic text file.

The Complete Script

You can view the complete script here: http://www.pastie.org/267654

To run the script, run “ruby homedepot.rb” in a command line.