I manage a lot of web scraping where I work, and have stumbled into a pattern I very much enjoy using to maintain page parsers. This technique is a variant of the page object pattern, which helps represent web pages as full-fledged objects, aware of it’s own capabilities. It uses a tool in the Ruby standard library called SimpleDelegator that helps keep sharp focus on those capabilities, while maintaining the generic features of your parser.

Page Objects in the Shell of Nut

The notion of “Page Objects” is most commonly associated with writing tests against a website. It recognizes that the knowledge of how to look up data on a page is not a concern of the test itself, and provides an abstraction API to the page under test. It brings the context of a page to life by giving it an object-oriented API, and a solid home for things like tag finders.

A basic example:

# an_imaginary_test_snippet.rb

page = some_nokogiri_style_page

title = page.css('h1').text
listings = page.css('div.results li')

assert title =~ /Lovely Title/
assert listings.size > 0

We have some code that is responsible both for finding components on a page in a detailed way, and testing the expected values in those components. A page object is a class that is responsible for representing the page, encapsulating the detailed structure away from places where it’s not important.

Our new test snippet might now be simply:

# The details are now contained in a page object class: MyPage.
# Anyone can use this from anywhere, and when the markup changes, it's nice and DRY.
page = MyPage.new(some_nokogiri_style_page)

assert page.title =~ /Lovely Title/
assert page.listings.size > 0

It no longer knows or cares how to find the attributes in the HTML structure, only that the page elements say “the right thing”. If you change how to locate those elements, there is now a single, specific home for this knowledge.

You can take this concept as far as you want, creating more interactivity such as

page.login! user: 'tim', password: 'wow, great password, tim.'
# Please don't hack me!

In a nutshell, page objects are simply a way to give context to your pages so that they can be integrated more easily with the rest of your application.

Decorating with SimpleDelegator

It’s really easy to imagine different ways this concept could be created. There are even full gems to help you get up and running.

However, a simple and effective style I’ve been using is the decorator pattern, which Ruby makes very easy. It looks a bit like this:

class MyPage
  attr_accessor :page

  def initialize(page)
    @page = page
  end

  def title
    @page.css('h1').text
  end

  def listings
    @page.css('div.results li')
  end
end

This class wraps an existing page object with new functionality in a straightforward way. If your page quacks like several common HTML parsers, this new class will work nicely.

I find that I want to expose the other methods from @page at the same level as these new methods. To me, this is still just a page, it’s just a bit more aware of itself. In the current example, I can access those methods by chaining through the #page accessor. We can simplify this using any of Ruby’s delegation techniques, and my favorite has been using SimpleDelgator.

SimpleDelegator is a class in the Ruby standard library that provides an initializer to store a target object for delegation, and automatically sends all undefined messages to that target. Hey, that’s what we just talked about!

class StubbedPage
  def text
    'This is some text'
  end
end

class LoudPage < SimpleDelegator
  # SimpleDelegator already takes care of an initialization method along the lines of:
  #
  #   def initialize(target_object)
  #     @target_object = target_object
  #   end
  #
  # It also delegates all unknown method calls to this target.
  # It's like a smarter version of:
  #
  #   def method_missing(msg)
  #     @target_object.call(msg)
  #   end
  #

  def yell
    text.upcase + '!!!'
    # Being that `#text` is not specifically defined in this class, it will get
    # delegated out to the `@target_object`.
  end
end


page = StubbedPage.new

loud_page = LoudPage.new(page)
loud_page.yell # => 'THIS IS SOME TEXT!!!'

# Conveniently, our `loud_page` still also knows all the same moves our `page` did.
loud_page.text #=> 'This is some text'

In this simple example, it feels a bit like normal inheritance. But this LoudPage can be used with any object that quacks #text, not just whatever we specify as a parent class. That sounds a bit like a mixin module. However, we don’t need to modify our original object or class – we are creating a separate, new object to play with that has a specific purpose to fulfill.

I find this chemistry perfect for page objects. The main libraries you use for parsing or navigation already have very robust representations of a page, they just lack context.

If we apply SimpleDelegator to our MyPage class from above, it is now simply:

class MyPage < SimpleDelegator
  def title
    css('h1').text
  end

  def listings
    css('div.results li')
  end

  # Wow, that's simple!
end


page = MyPage.new(some_nokogiri_style_page)
page.title # => 'This Here is a Lovely Title'
page.listings # => [<listing>, <listing>, <listing>]
page.css('form#that-one') # => <Nokogiri or Ox or Mechanize Page or Whatever>


# Pleasant.

Components

I’ve enjoyed the ability to easily create components for pages using this concept. Here’s an example using a Listing component:

# pages/results_page.rb
require_relative 'components/listing'

class ResultsPage < SimpleDelegator
  def listings
    css('ul.results').map {|l| Listing.new(l)}
  end

  def query
    css('.search-title')
  end
end


# pages/components/listing.rb
class Listing < SimpleDelegator
  def title
    css('div.product-name').text
  end

  def price
    css('div.price').text
  end

  def size
    specs['Size']
  end

  def color
    specs['Color']
  end

  private

  def specs_container
    css('div.product_specs')
  end

  def specs
    specs_container.text
      .split("\n")
      .map{|txt| txt.split(':').map(&:strip)}
      .to_h
  end

end

I can now add a lot of detail to what a Listing is without trampling on the full results page context. I find this isolation helpful when trying to diagnose issues.

Why do I like this?

Ubiqutous

I really love relying on things that are in everyone’s basic Ruby kit. There are gems that create a similar (perhaps better?) experience, but then you must understand the gem. SimpleDelegator is very simple and effective, and available everywhere Ruby is.

Simple

It requires very little code and therefor is fast to write, and easy to jam with during development. You can pry into an instance of that object and easily poke it while you give names to the things you are finding.

Organized

I love how laser-focused each page and component becomes. This makes it easier to diagnose parsing problems, as each component has it’s own encapsulation in a file in a specific place. If I have a problem with parsing a dollar amount out of a product listing, I have a tidy file specifically related to parsing that.

Discoverable and to-the-point

Most importantly, I like that I can imply the interface that a developer should use for pages and components by giving names to specific data targets, while keeping helper methods private. It indicates the public contract this object should fulfil, simplifying testing. We can easily test against an array of different listing scenarios by keeping a catalog of “weird cases” using only the listing HTML, not the full page HTML.

This is all possible without SimpleDelegator of course, but it seems like a pleasurable means to this end.

What seems off?

There is something a bit mysterious about the SimpleDelegator. It asks you to make assumptions about the object you are passing in. The interface is the epitome of duck-typing, and it leaves no trace of the objects you were expecting to play with in the first place. Ultimately, this is true of many other decorator and page object implementations I’ve seen. They all eventually need to collaborate with something to “get the job done”.

When you are in pry and you cd into the object to get a feel for it, there still isn’t an obvious reference to the methods available from the delegated class. If you ls, you’ll see the methods you’ve added, but all of the available-by-delegation are hidden behind generic methods like #public_methods. This is both awesome and obnoxious, depending on how comfortable you are with the parent API.

In fact, it’s hard to understand how to even find the object that was sucked into this black hole in the first place. There is the awkwardly named method #__getobj__ which will return the object in play, but it’s not obvious. Surely, this name is created in such a way as to not conflict with other method names.

Okay then.

It’s clear that there are many ways to solve these problems.

I’ve found using SimpleDelegator to be a very nice addition to my arsenal when parsing scraped data. It fits well with both simple cases, and more complex abstractions. It is quick to use, always available, versatile, and provides a level of organization and simplicity that speaks to the heart of Ruby: Write the code that is important for your problem, and let the language do the lifting.

I’d love to read about your thoughts or experiences.