27Apr
2009

Tags:

I am currently in the market to buy my first home so I've been spending a lot of time on various real estate websites searching through listings trying to find the perfect property. I live in a competitive housing market so it is important that I am informed whenever a new property becomes available. Logging onto any number of real estate websites to check for new listings each day is very repetitive and time consuming. Fortunately, it is possible to easily gather this information automatically using a technique called screen scraping.

Since most web pages are simply made of HTML it is easy for a computer to parse and store the information contained within these documents. Each programming language commonly has a host of libraries to assist in the screen scraping/parsing process and Ruby is no exception. To create simple screen scrapers in Ruby I have been using a library called scRUBYt!. scRUBYt! provides methods to access a given website and scrape its content. All the programmer needs to do is provide the XPath string to the desired information.

Using the scRUBYt! library has allowed me to write a small screen scraper script to access the FranklyMLS.com website, check for new listings, and then report back with the results. This has saved me a lot of time and effort. Let's dive into some code to see how this is done.

First, we'll need to create a simple class to store the information that we scrape from the FranklyMLS.com website. The Property class will hold various property related information (price, MLS number, square footage, etc.):

class Property
  attr_accessor :mls, :list_price, :dom, :address,
                :city, :zip, :bed, :bath, :sqft, :built

  def initialize(property)
    @mls = (property/:mls).inner_html
    @list_price = (property/:list_price).inner_html
    @dom = (property/:dom).inner_html
    @address = (property/:address).inner_html
    @city = (property/:city).inner_html
    @zip = (property/:zip).inner_html
    @bed = (property/:bed).inner_html
    @bath = (property/:bath).inner_html
    @sqft = (property/:sqft).inner_html
    @built = (property/:built).inner_html
  end
end

Next we'll need to make sure that scRUBYt! is installed. If you don't already have Github set up as one of your gem repositories do so now by executing the following command:

gem sources -a http://gems.github.com

Then install the scRUBYt! gem:

gem install jspradlin-scrubyt

Side note: I've built some functionality into the scRUBYt! library so you will need to grab the gem from my Github repository (i.e. jspradlin-scrubyt). I've spoken with the lead developer on the scRUBYt! project and it looks like my changes might make it into a future version of the official gem.

At this point we need to give scRUBYt! the URL of a website that we wish to scrape. The FranklyMLS.com website has its own special URL query syntax which displays only properties that meet our specific criteria. For example, if we only wanted to find active listings in the following zip codes - 22201, 22202, 22203 - the FranklyMLS.com URL would be:

http://franklymls.com/default.aspx?m=R&s=(22201,22202,22203)+active

We can dynamically generate a URL with our specific housing criteria by including the following code in our script:

# Generate the URL for FranklyMLS.com given
# the following criteria:
zips = [22201,22202,22203,22209]
beds = [2, 3].collect{ |bed| bed.to_s+'bdr'}
min_price = 150 #in thousands
max_price = 350 #in thousands
exclusions = ['JEFFERSON'].collect{ |exclude| "+-#{exclude}"}

fmls_url = "http://franklymls.com/default.aspx?"
fmls_url += "m=R&l=#{min_price}K&h=#{max_price}K"
fmls_url += "&s=(#{zips.join(',')})+active"
fmls_url += "+(#{beds.join(',')})"
fmls_url += exclusions.to_s

Now we're ready to scrape some housing data. Once the FranklyMLS.com property page loads we are presented with a table that contains information about the listings that meet our criteria (image modified to save space):

FranklyMLS.com Property Table

The HTML that generates this table would appear like this (modified to save space):

<table id="dgRealtorStyle">
  ...
  <tr style="display:visible">
    <td><a>...</a><a>AR6552162</a></td><!-- td[1]-->
    <td>$256,000</td><!-- td[2]-->
    <td>$339,000</td>
    <td>$</td>
    <td> </td>
    <td> </td>
    <td>562</td>
    <td>1931 CLEVELAND #313</td><!-- td[8]-->
    ...
    <td>ARLINGTON</td>
    <td>22201</td>
    <td> </td>
    <td>CLEVLAND HO</td>
    <td>3/1</td>
    <td>860</td>
    <td>1960</td>
    <td>...</td>
    <td>1</td>
    <td>160</td>
    <td><a>x</a></td>
  </tr>
  ...
</table>

Finally, the Ruby code to scrape this data using scRUBYt!:

#Scrape the FranklyMLS.com website using scRUBYt!
property_data = Scrubyt::Extractor.define do
  fetch fmls_url

  properties '//table[@id="dgRealtorStyle"]' do
    property "//tr" do
      mls "/td[1]//a[2]"
      list_price "/td[2]"
      dom "/td[7]"
      address "/td[8]"
      city "/td[9]"
      zip "/td[10]"
      bed "/td[13]",
        :format_output => lambda {|bed_bath| bed_bath.split('/')[0]}
      bath "/td[13]",
        :format_output => lambda {|bed_bath| bed_bath.split('/')[1]}
      sqft "/td[14]"
      built "/td[15]"
    end
  end
end

If you look at the table, the HTML code, and the Ruby code you'll see that I've color coordinated each separate piece of information to illustrate how it is parsed and then stored. The scRUBYt! library will "fetch" the given URL, locate the HTML elements by the given XPath, and then store the data.

Once we have all of the data collected we may want to do something useful with the information such as convert it into an RSS feed. We can accomplish this by using the Hpricot and Builder libraries (which should be installed as dependencies of scRUBYt!). The code for the RSS conversion would look like this:

# Read in the XML generated by scRUBYt! then
# convert the data into Property objects
# and store them in the property_hash.
property_hash = {}

hp = Hpricot.XML(property_data.to_xml)
(hp/:property).each do |property|
  property_hash[(property/:mls).inner_html] = Property.new(property)
end

# Using the Builder library, iterate through
# the property_hash to generate an RSS feed.
xml = Builder::XmlMarkup.new(:target => $stdout, :indent => 2 )
xml.instruct! :xml, :version => "1.0"
xml.rss :version => "2.0" do
  xml.channel do
    xml.title "Property Feed"
    xml.link "<--YOUR URL-->"
    xml.description "This is my property feed"

    property_hash.each do | key, property |
      pub_date = (Time.now - property.dom.to_i*60*60*24)
      pub_date = pub_date.strftime("%a, %d %b %Y %I:%M:%S")

      xml.item do
        xml.title property.address
        xml.link "http://franklymls.com/#{property.mls}"
        xml.pubDate "#{pub_date} EST"
        xml.description "City: #{property.city}
          Address: #{property.address}
          Price: #{property.list_price}
          Bed: #{property.bed}
          Bath: #{property.bath}
          Sqft: #{property.sqft}
          Built: #{property.built}"
      end
    end
  end
end

To make sure I get routine updates, I run this Ruby code on my server every hour using a cron job and pipe its output to an RSS feed. I am subscribed to the generated RSS feed so now I know exactly when a new property becomes available in my area!

Overall scRUBYt! is very easy to use and for simple screen scraping tasks it should work fine. However, I have found that it can run into some problems when the HTML gets complex. In these cases I would recommend using Hpricot for fine-level scraping.

To view the source code for this entry and to view other screen scraping code that I have written check out my Github page.

If you'd like to look at another example of scRUBYt! in action, feel free to read the post I wrote for my company's blog by clicking here.

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • DZone
  • FSDaily
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • Twitter
  1. 6 Responses to “Ruby Screen Scraping with scRUBYt!”

  2. Pretty cool. I have used franklymls to find open houses in Northern Virginia.

    By Tom on Apr 29, 2009

  3. Interesting point on screen scrapers, For simple stuff i use python to screen scrape, but for larger projects i used extractingdata.com screen scraper which worked great, they build custom screen scrapers and data extracting programs

    By Rachel on Oct 27, 2009

  4. Hey, great article!

    I’m a bit of a Ruby newbie and just had a few questions.

    I’ve never see the syntax you use in your initialize function

    def initialize(property)
    @mls = (property/:mls).inner_html
    @list_price = (property/:list_price).inner_html

    I’ve never see that (var/:symbol) syntax. Is that just a typo? What does the /: mean when in parenthesis?

    Also, I’ve tried running this example, and after fixing up a lot of versioning errors with the dependencies, I get a ’scan’, ran out of buffer space error… has this happened to you?

    Thanks!

    By Justin Reynen on Apr 1, 2010

  5. @Justin

    Thanks for the comment.

    They syntax is a little funky for sure, but it is valid syntax. I used an XML parser library called Hpricot for this example. Hpricot takes a block of XML and allows you to parse out individual elements by referencing their Xpath. For the example you gave above if the XML looked like this:

    <property>
    <mls>1234</mls>
    <list_price>123123</list_price>
    </property>

    You could pass that block of XML to Hpricot and access the different elements using the following syntax:

    @mls = (property/:mls).inner_html
    @list_price = (property/:list_price).inner_html

    As far as the errors are concerned a lot has happened since I last used this script. For one, ruby gems are no longer hosted on github so my customized jspradlin-scrubyt gem may no longer be available.

    Anyway, I do most of my screen scraping using a library called Nokogiri. I’d check that library out. I find the syntax a little more intuitive.

    By Justin Spradlin on Apr 1, 2010

  6. Thanks for posting, great article!
    Are you still using scRUBYt these days? it looks like the official site is inactive.

    By Piotr on Feb 28, 2012

  1. 1 Trackback(s)

  2. May 9, 2009: Ennuyer.net » Blog Archive » I am way behind on my rails link blogging. Link dump and reboot.

Post a Comment