Poetry of Programming

Its about Ruby on Rails – Kiran Soumya


Import Tool – ScRUBYt

At first I tried String Wrapper tool using open-uri

Then I understood Tree wrappers tools that the HTML document can look very good in a browser, yet still be seriously malformed (unclosed/misused tags). It is a non-trivial problem to parse such a document into a structured format like XML, since XML parsers can work with well-formed documents only

But HTree and REXML is capable to transform the input into the nicest possible XML from our point of view: a REXML Document. ( REXML is Ruby’s standard XML/XPath processing library).

After preprocessing the page content with HTree, we have to unleash the full power of XPath, which is a very powerful XML document querying language, highly suitable for web extraction.

The powerful web scrapping tools in Ruby are mainly Mechanize and Hpricot. Hpricot is “a Faster HTML Parser for Ruby” out of other Rubyful-soup(HTree + XPath),scrAPI,ARIEL

www::Mechanize has the ability to automatically navigate through Web pages as a result of interaction (filling forms etc.) while keeping cookies, automatically following redirects and simulating everything else what a real user (or the browser in response to that) would do.

Mechanize is powerful lib BUT we cannot perfectly interact with JavaScript websites. That is, it cannot handle more than one redirects through javascript.

Using mechanize, attempts to gmail webscrap to get all mails (feed exists to get new mails only) and orkut scraps extraction could go through until there is no complex javascript to break.

At first tried to do google search and extract reditt articles using almost all libs.

Then Using Xpath and scRUBYt, I am able to extract details from finance.google.com in xml format.

To a certain extent scRUBYt which is combination of Hpricot and Mechanize on steriods, seems starting step to me to get finance.google.com portfolio.

Also,scRUBYt is faster than mechanize.


require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
fetch 'http://finance.google.com'
click_link 'Portfolios'
fill_textfield 'Email', '<a href="mailto:kiransoumi@gmail.com">kiransoumi@gmail.com</a>'
fill_textfield 'Passwd', '----'
fetch 'http://finance.google.com/finance/portfolio?action=view&pid=1'
click_link 'Transactions'
#Construct the wrapper
stockinfo "/html/body/div/div/table/tbody/tr" do
   symbol "/td[1]/a[1]"
   qty "/td[5]"
   price "/td[6]"


google_data.to_xml.write($stdout, 1)

[MODE] Learning
[ACTION] fetching document: <a href="http://finance.google.com">http://finance.google.com</a>
[ACTION] clicking link: Portfolios
[ACTION] fetched <a href="https://www.google.com/accounts/ServiceLogin?hl=en&service=finance&nui=1&continue=http%3A%2F%2Ffinance.google.com%3A80%2Ffinance%2Fportfolio%3Faction%3Dview">https://www.google.com/accounts/ServiceLogin?hl=en&service=finance&nui=1&continue=http%3A%2F%2Ffinance.google.com%3A80%2Ffinance%2Fportfolio%3Faction%3Dview</a>
[ACTION] typing <a href="mailto:kiransoumi@gmail.com">kiransoumi@gmail.com</a> into the textfield named 'Email'
[ACTION] typing ---- into the textfield named 'Passwd'
[ACTION] submitting form...
[ACTION] fetched <a href="https://www.google.com/accounts/CheckCookie?continue=http%3A%2F%2Ffinance.google.com%3A80%2Ffinance%2Fportfolio%3Faction%3Dview&service=finance&hl=en&chtml=LoginDoneHtml">https://www.google.com/accounts/CheckCookie?continue=http%3A%2F%2Ffinance.google.com%3A80%2Ffinance%2Fportfolio%3Faction%3Dview&service=finance&hl=en&chtml=LoginDoneHtml</a>
[ACTION] fetching document: <a href="http://finance.google.com/finance/portfolio?action=view&pid=1">http://finance.google.com/finance/portfolio?action=view&pid=1</a>
[ACTION] clicking link: Transactions
[ACTION] fetched <a href="http://finance.google.com/finance/portfolio?action=viewt&pid=1">http://finance.google.com/finance/portfolio?action=viewt&pid=1</a>
Extraction finished succesfully!

stockinfo extracted 4 instances.
symbol extracted 4 instances.
qty extracted 4 instances.
price extracted 4 instances.

One Response to Import Tool – ScRUBYt

  1. ben2k7 says:

    I like your blog, this post is really good, but please vary your topics, it will broad your readership.

Leave a Reply

Your email address will not be published. Required fields are marked *