Using Simple Delegator for Page Objects
I manage a lot of web scraping where I work, and have stumbled into a pattern I very much enjoy using to maintain page parsers. This technique is a variant of the page object pattern, which helps represent web pages as full-fledged objects, aware of it’s own capabilities. It uses a tool in the Ruby standard library called SimpleDelegator
that helps keep sharp focus on those capabilities, while maintaining the generic features of your parser.
Page Objects in the Shell of Nut
The notion of “Page Objects” is most commonly associated with writing tests against a website. It recognizes that the knowledge of how to look up data on a page is not a concern of the test itself, and provides an abstraction API to the page under test. It brings the context of a page to life by giving it an object-oriented API, and a solid home for things like tag finders.
A basic example:
# an_imaginary_test_snippet.rb
page = some_nokogiri_style_page
title = page.css('h1').text
listings = page.css('div.results li')
assert title =~ /Lovely Title/
assert listings.size > 0
We have some code that is responsible both for finding components on a page in a detailed way, and testing the expected values in those components. A page object is a class that is responsible for representing the page, encapsulating the detailed structure away from places where it’s not important.
Our new test snippet might now be simply:
# The details are now contained in a page object class: MyPage.
# Anyone can use this from anywhere, and when the markup changes, it's nice and DRY.
page = MyPage.new(some_nokogiri_style_page)
assert page.title =~ /Lovely Title/
assert page.listings.size > 0
It no longer knows or cares how to find the attributes in the HTML structure, only that the page elements say “the right thing”. If you change how to locate those elements, there is now a single, specific home for this knowledge.
You can take this concept as far as you want, creating more interactivity such as
page.login! user: 'tim', password: 'wow, great password, tim.'
# Please don't hack me!
In a nutshell, page objects are simply a way to give context to your pages so that they can be integrated more easily with the rest of your application.
Decorating with SimpleDelegator
It’s really easy to imagine different ways this concept could be created. There are even full gems to help you get up and running.
However, a simple and effective style I’ve been using is the decorator pattern, which Ruby makes very easy. It looks a bit like this:
class MyPage
attr_accessor :page
def initialize(page)
@page = page
end
def title
@page.css('h1').text
end
def listings
@page.css('div.results li')
end
end
This class wraps an existing page
object with new functionality in a straightforward way. If your page
quacks like several common HTML parsers, this new class will work nicely.
I find that I want to expose the other methods from @page
at the same level as these new methods. To me, this is still just a page
, it’s just a bit more aware of itself. In the current example, I can access those methods by chaining through the #page
accessor. We can simplify this using any of Ruby’s delegation techniques, and my favorite has been using SimpleDelgator
.
SimpleDelegator
is a class in the Ruby standard library that provides an initializer to store a target object for delegation, and automatically sends all undefined messages to that target. Hey, that’s what we just talked about!
class StubbedPage
def text
'This is some text'
end
end
class LoudPage < SimpleDelegator
# SimpleDelegator already takes care of an initialization method along the lines of:
#
# def initialize(target_object)
# @target_object = target_object
# end
#
# It also delegates all unknown method calls to this target.
# It's like a smarter version of:
#
# def method_missing(msg)
# @target_object.call(msg)
# end
#
def yell
text.upcase + '!!!'
# Being that `#text` is not specifically defined in this class, it will get
# delegated out to the `@target_object`.
end
end
page = StubbedPage.new
loud_page = LoudPage.new(page)
loud_page.yell # => 'THIS IS SOME TEXT!!!'
# Conveniently, our `loud_page` still also knows all the same moves our `page` did.
loud_page.text #=> 'This is some text'
In this simple example, it feels a bit like normal inheritance. But this LoudPage
can be used with any object that quacks #text
, not just whatever we specify as a parent class. That sounds a bit like a mixin module. However, we don’t need to modify our original object or class – we are creating a separate, new object to play with that has a specific purpose to fulfill.
I find this chemistry perfect for page objects. The main libraries you use for parsing or navigation already have very robust representations of a page, they just lack context.
If we apply SimpleDelegator
to our MyPage
class from above, it is now simply:
class MyPage < SimpleDelegator
def title
css('h1').text
end
def listings
css('div.results li')
end
# Wow, that's simple!
end
page = MyPage.new(some_nokogiri_style_page)
page.title # => 'This Here is a Lovely Title'
page.listings # => [<listing>, <listing>, <listing>]
page.css('form#that-one') # => <Nokogiri or Ox or Mechanize Page or Whatever>
# Pleasant.
Components
I’ve enjoyed the ability to easily create components for pages using this concept. Here’s an example using a Listing
component:
# pages/results_page.rb
require_relative 'components/listing'
class ResultsPage < SimpleDelegator
def listings
css('ul.results').map {|l| Listing.new(l)}
end
def query
css('.search-title')
end
end
# pages/components/listing.rb
class Listing < SimpleDelegator
def title
css('div.product-name').text
end
def price
css('div.price').text
end
def size
specs['Size']
end
def color
specs['Color']
end
private
def specs_container
css('div.product_specs')
end
def specs
specs_container.text
.split("\n")
.map{|txt| txt.split(':').map(&:strip)}
.to_h
end
end
I can now add a lot of detail to what a Listing
is without trampling on the full results page context. I find this isolation helpful when trying to diagnose issues.
Why do I like this?
Ubiqutous
I really love relying on things that are in everyone’s basic Ruby kit. There are gems that create a similar (perhaps better?) experience, but then you must understand the gem. SimpleDelegator is very simple and effective, and available everywhere Ruby is.
Simple
It requires very little code and therefor is fast to write, and easy to jam with during development. You can pry
into an instance of that object and easily poke it while you give names to the things you are finding.
Organized
I love how laser-focused each page and component becomes. This makes it easier to diagnose parsing problems, as each component has it’s own encapsulation in a file in a specific place. If I have a problem with parsing a dollar amount out of a product listing, I have a tidy file specifically related to parsing that.
Discoverable and to-the-point
Most importantly, I like that I can imply the interface that a developer should use for pages and components by giving names to specific data targets, while keeping helper methods private. It indicates the public contract this object should fulfil, simplifying testing. We can easily test against an array of different listing scenarios by keeping a catalog of “weird cases” using only the listing HTML, not the full page HTML.
This is all possible without SimpleDelegator
of course, but it seems like a pleasurable means to this end.
What seems off?
There is something a bit mysterious about the SimpleDelegator
. It asks you to make assumptions about the object you are passing in. The interface is the epitome of duck-typing, and it leaves no trace of the objects you were expecting to play with in the first place. Ultimately, this is true of many other decorator and page object implementations I’ve seen. They all eventually need to collaborate with something to “get the job done”.
When you are in pry
and you cd
into the object to get a feel for it, there still isn’t an obvious reference to the methods available from the delegated class. If you ls
, you’ll see the methods you’ve added, but all of the available-by-delegation are hidden behind generic methods like #public_methods
. This is both awesome and obnoxious, depending on how comfortable you are with the parent API.
In fact, it’s hard to understand how to even find the object that was sucked into this black hole in the first place. There is the awkwardly named method #__getobj__
which will return the object in play, but it’s not obvious. Surely, this name is created in such a way as to not conflict with other method names.
Okay then.
It’s clear that there are many ways to solve these problems.
I’ve found using SimpleDelegator
to be a very nice addition to my arsenal when parsing scraped data. It fits well with both simple cases, and more complex abstractions. It is quick to use, always available, versatile, and provides a level of organization and simplicity that speaks to the heart of Ruby: Write the code that is important for your problem, and let the language do the lifting.
I’d love to read about your thoughts or experiences.