Better Testing and Organization for 1-off Block Transforms in Kiba
Kiba provides flexibility when defining transformations for your data. Common, reusable, configurable data transformations can be set up as Class Transforms. They are isolated from your pipeline definition, and as such, easily testable. Your pipeline code tends to be easier to read, as it just describes what it’s going to do, leaving the implementation behind the scenes.
Sometimes, you need something that is extremely specific to the pipeline, and would be harder to reason about in a separate class or file. Or maybe you are prototyping, and not really sure what things look like yet. Or, let’s be honest, maybe this is just a one-and-done throwaway script. For this, you can use Block Transforms. Block Transforms are very convenient, but can leave your pipeline definition hard to read. They also are harder to test, since they are inlined into the pipeline definition.
Tonight I came up with a strategy that makes those 1-off block transforms more testable, and the pipeline as a whole easier to read, while still not affording them the pomp and circumstance of their own file. In addition to Class Transforms
and Block Transforms
, by leveraging proc.to_block
, you can also define transformations with lambdas or methods.
I’ll take a moment to caution here: I haven’t actually used this yet outside of prototyping and exploring for this post. I wanted to get it out on paper to force me to think about more. Judge for yourself if it’s helpful, I’m biased by the excitement of discovery right now.
First, let’s set up some context.
The example pipeline:
require 'kiba'
class Input
def each
[
{brand: "Good", model: "Assuredly 2", size: "205/55R16", sku: "123456789", price: 50.00},
{brand: "BFF Wowrich", model: "Advant", size: "205/55R16", sku: "123456", price: 50.00},
{brand: "Cooped Upper", model: "CSS Ultra Style", size: "205/55R16", sku: "900012345", price: 50.00},
].each{|row| yield row}
end
end
class Writer
def write(row)
puts row
end
end
class ETLTask
def job
Kiba.parse do
source Input
# ???
destination Writer
end
end
end
Class Transforms
Kiba can use any class that defines a #process(row)
method. The catch is that it must be a reference to a class, so it can be instantiated in the Kiba Context.
# A Kiba transform to make the row's values LOUD (UPPERCASE).
#
class MakeLoud
def process(row)
row.transform_values { |val| val.is_a?(String) ? val.upcase : val }
end
end
When you use this in a Kiba job, it looks like this:
source Input
transform MakeLoud
destination Writer
Block Transforms
We can also pass a block to transform
. This allows for easy, informal transformations. It looks like this:
source Input
# Be heard with many !s!!!
#
transform do |row|
row.transform_values { |val| "#{val}!!!" }
end
destination Writer
This transform is coded right in the middle of the job definition. It’s really convenient when prototyping, and just feels right for certain types of transformations that wouldn’t be shared anywhere else. The downsides are that there’s practically no way to test it, and often times they come with some contextual documentation, such as who requested it, or why. It can make the main pipeline harder to follow.
The Guts
Internally, if you use a Class Transform, Kiba initializes the class on your behalf. I’m not sure why this decision is as such. Perhaps it’s to help insulate thread safety? I’ve thought about adding a check along args.first.is_a? Class
and using the existing AliasingProc
for other types of objects. It creates a proxy for #process
-> #call
, which enables the &block
transform. It also could open other doors to functional Ruby, where #call
is frequently leveraged.
Without hacking the guts, our only options supported by the API are to pass something that responds to .new
, or, a block.
On Organization
In my ETL projects, I seem to have a mix of three types of pipeline objects.
- Everyday workers
- Specific to a Project but abstract or configurable
- Specific to a script, not really reusable.
Everyday Workers
These are the classes I find almost universally useful. Things like CSV and JL readers and writers. I tend to copy paste them from project to project as I like having the source code available to review and tweak if needed.
I’ve crafted these to my own likings with various configuration options, but generally they would be suitable for any Kiba project.
Specific to a Project, Still Abstract or Configurable
Projects often need a little repeatable special-sauce. Maybe it’s a source for a specific file type, like products.xlsx
, or a specialized data lookup with thoughtful caching. When I find myself writing a transformation that feels complicated, or requires persisting some state, I like to write these out as classes as well.
I make these configurable within reason. Custom input/output keys, injecting datasets, etc. Because they are full-fledged, isolated classes, they are next-of-kin to the Everyday Workers.
Specific to a Single Job
And finally, you get those data munging needs that are only used in a specific context. “Append 00 to the SKUs”, “Calculate some value”, “Do this one weird trick to make $50,000/week”, etc. These often feel great as an in-definition block transform.
The problem is, you end up with code that looks like this:
class ETLTask
def job
Kiba.parse do
source Input
transform MakeLoud
## Be heard with many !s!!!
#
# Steve likes when we are excited.
# For our sake and future reference, we'll want
# to make note of some times he demonstrated his excitement.
#
# 2021-01-01 08:15:31 <Steve> WOW! THIS IS COOL!!!
# 2021-01-01 08:15:32 <Steve> This is it!!!
# 2021-01-01 08:17:13 <Steve> WHY IS THIS SO BROKEN?!!!
# 2021-01-01 08:19:41 <Steve> I CAN'T EVEN!!!
#
# Summer Sun makes clear,
# This is totally silly.
# Please don't grant it Class.
#
# -- Tim's First Haiku
#
transform do |row|
row.transform_values { |val| val.is_a?(String) ? "#{val}!!!" : val}
end
## Add Description
#
# Percy requested a column that will help make the pivot table easier to work with.
#
# This might have a fair amount of context that
# is important to document, but the move is just not doing enough
# to feel like it warrants it's own class.
# (Disagreers gunna disagree.)
#
# Sure, we could make a `ValueConcatenator`...
# But good grief.
#
transform do |row|
row[:description] = row.values_at(:brand, :model, :size, :sku).join(' ')
row
end
destination Writer
end
end
end
Following the pipeline gets a lot harder with all the extra fluff. Assuming we feel very strongly against granting these moves grandeur with their own files and classes, we have two (or more) strategies we could consider here.
Two Moves (of many)
Nested Classes
class ETLTask
def job
Kiba.parse do
source Input
transform MakeLoud
transform SteveIt
# Set up description, persuaded by Percy's Pivot.
transform ValueConcatenator,
output_key: :description,
input_keys: [:brand, :model, :size, :sku]
destination Writer
end
end
## Be heard with many !s!!!
#
# Steve likes when we are excited.
# For our sake and future reference, we'll want
# to make note of some times he demonstrated his excitement.
#
# 2021-01-01 08:15:31 <Steve> WOW! THIS IS COOL!!!
# 2021-01-01 08:15:32 <Steve> This is it!!!
# 2021-01-01 08:17:13 <Steve> WHY IS THIS SO BROKEN?!!!
# 2021-01-01 08:19:41 <Steve> I CAN'T EVEN!!!
#
# Nested class down here.
# So much fluff, like Winter Snow.
# I'm still not a fan
#
# -- Tim's Second Haiku
#
class SteveIt
def process(row)
row.transform_values { |val| val.is_a?(String) ? "#{val}!!!" : val }
end
end
## An abstract value concatenator
#
# Well, lookie here. It turns out, this is actually a winner!
# After making the extra effort to extract this out of a block transform,
# it actually feels like a generically reusable piece of code.
#
# At this point, I'd be happy to extract this one to it's own file.
#
class ValueConcatenator
# The list of keys to concatenate
attr_reader :input_keys
# The key to write the new value to
attr_reader :output_key
def initialize(input_keys:, output_key:)
@input_keys = input_keys
@output_key = output_key
end
def process(row)
row[output_key] = row.values_at(*input_keys).compact.map(&:to_s).join(' ')
row
end
end
end
My thoughts:
-
We’ve kept the transformations in the same class context, near the usage. I do not find
SteveIt
interesting enough to extract to a separate file. It makes too much context switching for such a “One Weird Trick”. -
We actually found a decent abstraction in the
ValueConcatenator
. That one aughta go in its own file. I think it’s a winner. Sometimes this process is worthwhile! -
The individual moves could now be easily tested, as they are no longer in the middle of a Kiba job context.
-
The Job is a bit easier to read, and importantly, reorganize.
&proc (to_block)
And here I have arrived tonight. Realizing that you can cast a Proc
to a &block
, you can set up transform
with any procable object, like method references and Lambdas.
class ETLTask
def job
Kiba.parse do
source Input
transform MakeLoud
# Get buy-in from Steve:
transform &ETLTask.method(:steve_it)
# Add Description for Percy
transform &AddDescription
# Or, if you're allergic to & because it's uncanny,
# transform AddDescription.to_block
destination Writer
end
end
## Add Description (via Lambda (Or Proc.new, Or whatever floats you.))
#
# Requested by Percy in Accounting
#
# Add a single column that will help that pivot table easier to work with.
#
# This might have a fair amount of context that
# is important to document, but it's just not doing enough to feel like
# it warrants it's own class.
#
#
# This actually feels kinda dope.
#
AddDescription = -> (row) do
row[:description] = row.values_at(:brand, :model, :size, :sku).join(' ')
row
end
## Be heard with many !s!!!
#
# Steve likes when we are excited.
# For our sake and future reference, we'll want
# to make note of some times he demonstrated his excitement.
#
# 2021-01-01 08:15:31 <Steve> WOW! THIS IS COOL!!!
# 2021-01-01 08:15:32 <Steve> This is it!!!
# 2021-01-01 08:17:13 <Steve> WHY IS THIS SO BROKEN?!!!
# 2021-01-01 08:19:41 <Steve> I CAN'T EVEN!!!
#
# At the same indent.
# As if it were on the team.
# I can live with this.
#
# -- Tim's Third Haiku
#
def self.steve_it(row)
row.transform_values { |val| val.is_a?(String) ? "#{val}!!!" : val}
end
end
Thoughts:
-
I feel a sense of cohesiveness within the ETL task and some of it’s quirky transformations. It allows the quirks to get codified without ceremony, but manages to create isolated seams for testability. Nothing is getting separated out to it’s own weird desert-class unless I feel like it’s warranted.
-
Subjectively, it looks pleasing! The Job is even easier to read, and even easier to organize.
What the proc?
Finally, a quick note on lambdas. I think lambdas are cool. Data processing pipeliness are well-served by functional programming. In fact, the Class-based Transform methods are simply an OO wrapper around functions, even when they carry some state. They receive a row, they return a row. Boom. You’re not passing messages to it other than it’s primary method, #process
. In recent versions of Ruby, you can compose functions with <<
and >>
which is cool, but also perhaps esoteric and unideomatic. Behold:
class ETLTask
def job
Kiba.parse do
source Input
transform &FormatStringValues
destination Writer
end
end
# More likely these moves live in a StringProcessing module!
Strip = -> (val) { val.is_a?(String) ? val.strip : val }
NormalizeSpaces = -> (val) do
return val unless val.is_a?(String)
val.gsub(/[[:space:]]+/, ' ')
end
BadTitleCase = -> (val) do
return val unless val.is_a?(String)
val.split.each do |word|
word[0] = word[0].upcase
end.join(' ')
end
# This composes each of these smaller functions together, ready for use.
# Notably, I like that it protects against not-String values.
#
FormatString = BadTitleCase << Strip << NormalizeSpaces
# And finally, create a function to apply it to all of the String Values in a row
FormatStringValues = -> (row) { row.transform_values &FormatString }
end
People who do functional programming in Ruby do weird things. I’m not used to it, but I can usually follow it. For example, here’s some code that I think is pretty dang cool. I suspect it violates most people’s that’s too cleverometer. I came up with it while over-engineering tonight’s convoluted example:
module SomeFunctions
# Create a new guard-clause function that only operates on a given type.
# If the type matches, call the function. If it does not, just return the original value.
#
Guard = -> (type, function) { -> (val) { val.is_a?(type) ? function[val] : val } }.curry
# Only operate on Strings. If it's something else, let it be.
# There are many ways to trigger `#call` in Ruby. `[]` and `.()` are some.
StringGuard = Guard[String]
NumberGuard = Guard[Numeric]
# The classic String.strip(str) function, including a guard to only operate on String Things.
Strip = StringGuard[-> (val) { val.strip }]
# The quite useful "I don't care how many spaces you have. Fix it."
# I think Rails adds this as `#squish`.
NormalizeSpaces = StringGuard[-> (val) { val.gsub(/[[:space:]]+/, ' ') }]
# Glue them together!
# Okay, I know. This isn't that impressive. Just use `"str".strip.gsub(...)`...
#
# I'm trying to show something I find neat, and haven't explored it enough
# to rip out a better example from nowhere!
#
FormatString = Strip << NormalizeSpaces
end
This all can of course be expressed in method definitions more idiomatically. But you know what? It’s interesting to explore.
Conclusion
&proc
is cool and useful beyond just map(&:ping)
things. We can use it pass references to a Kiba transform
, which let’s us reference “arbitrary” blocks of code outside of the pipeline, such as to test them.