Thursday, April 14, 2011

Simple Ruby Screen Scraper using Mechanize, Hpricot and XPath

In my previous post, simple-ruby-screen-scraper-in-just-5.html, I explained how to do Screen Scraping in just 5 lines without using XPath. In this post, I will achieve the same task using XPath, before going through this, I would recommend you to go through previous post to get an idea about the complete scenario.

Here also I will be scraping a link "Quality Assurance" given on the left panel under "My Blogs" section on my website "http://www.kumarritesh.com/" with XPath. Here is the code:


require 'rubygems'
require 'mechanize'
require 'hpricot'

agent = Mechanize.new
page = agent.get('http://www.kumarritesh.com')
@response = page.content
doc = Hpricot(@response)
(doc/"/html/body/div/div/div[2]/div[2]/div[3]/div/ul/li/a")[0].innerHTML


All the code is same as mentioned in the previous post, only last line changes.
In my previous post, I parsed the page by searching the div class, li id and anchor link.
Here, I will make use of XPath. 

To get XPath, inspect the "Quality Assurance" link in Firebug. You will get the code:
<a href="/index.php/quality-assurance">Quality Assurance</a>

Right click and select "Copy XPath" option. You will get following code:
/html/body/div/div/div[2]/div[2]/div[3]/div/ul/li/a
 
Pass the code as displayed in the last line, it will return all the links, so to get the first link we have taken the zeroth element and then its HTML format. Similarly, we can have the XPath of any text, label, links and can scrap it very easily.

3 comments:

  1. I used the above code but the inner html is not displaying. Is there any other way to do this scrapping??

    ReplyDelete
  2. ERROR: Error installing mechanize:
    ERROR: Failed to build gem native extension.
    i have this error how to install the mechanize

    ReplyDelete
  3. ERROR: Error installing mechanize:
    ERROR: Failed to build gem native extension.
    i have this error how to install the mechanize

    ReplyDelete