Better Testing and Organization for 1-off Block Transforms in Kiba

Kiba provides flexibility when defining transformations for your data. Common, reusable, configurable data transformations can be set up as Class Transforms. They are isolated from your pipeline definition, and as such, easily testable. Your pipeline code tends to be easier to read, as it just describes what it’s going to do, leaving the implementation behind the scenes.

Sometimes, you need something that is extremely specific to the pipeline, and would be harder to reason about in a separate class or file. Or maybe you are prototyping, and not really sure what things look like yet. Or, let’s be honest, maybe this is just a one-and-done throwaway script. For this, you can use Block Transforms. Block Transforms are very convenient, but can leave your pipeline definition hard to read. They also are harder to test, since they are inlined into the pipeline definition.

Tonight I came up with a strategy that makes those 1-off block transforms more testable, and the pipeline as a whole easier to read, while still not affording them the pomp and circumstance of their own file. In addition to Class Transforms and Block Transforms, by leveraging proc.to_block, you can also define transformations with lambdas or methods.

I’ll take a moment to caution here: I haven’t actually used this yet outside of prototyping and exploring for this post. I wanted to get it out on paper to force me to think about more. Judge for yourself if it’s helpful, I’m biased by the excitement of discovery right now.

First, let’s set up some context.

The example pipeline:

require 'kiba'

class Input
  def each
    [
      {brand: "Good", model: "Assuredly 2", size: "205/55R16", sku: "123456789", price: 50.00},
      {brand: "BFF Wowrich", model: "Advant", size: "205/55R16", sku: "123456", price: 50.00},
      {brand: "Cooped Upper", model: "CSS Ultra Style", size: "205/55R16", sku: "900012345", price: 50.00},
    ].each{|row| yield row}
  end
end

class Writer
  def write(row)
    puts row
  end
end

class ETLTask
  def job
    Kiba.parse do
      source Input

      # ???

      destination Writer
    end
  end
end

Class Transforms

Kiba can use any class that defines a #process(row) method. The catch is that it must be a reference to a class, so it can be instantiated in the Kiba Context.

# A Kiba transform to make the row's values LOUD (UPPERCASE).
#
class MakeLoud
  def process(row)
    row.transform_values { |val| val.is_a?(String) ? val.upcase : val }
  end
end

When you use this in a Kiba job, it looks like this:

      source Input

      transform MakeLoud

      destination Writer

Block Transforms

We can also pass a block to transform. This allows for easy, informal transformations. It looks like this:

      source Input

      # Be heard with many !s!!!
      #
      transform do |row|
        row.transform_values { |val| "#{val}!!!" }
      end

      destination Writer

This transform is coded right in the middle of the job definition. It’s really convenient when prototyping, and just feels right for certain types of transformations that wouldn’t be shared anywhere else. The downsides are that there’s practically no way to test it, and often times they come with some contextual documentation, such as who requested it, or why. It can make the main pipeline harder to follow.

The Guts

Internally, if you use a Class Transform, Kiba initializes the class on your behalf. I’m not sure why this decision is as such. Perhaps it’s to help insulate thread safety? I’ve thought about adding a check along args.first.is_a? Class and using the existing AliasingProc for other types of objects. It creates a proxy for #process -> #call, which enables the &block transform. It also could open other doors to functional Ruby, where #call is frequently leveraged.

Without hacking the guts, our only options supported by the API are to pass something that responds to .new, or, a block.

On Organization

In my ETL projects, I seem to have a mix of three types of pipeline objects.

Everyday workers
Specific to a Project but abstract or configurable
Specific to a script, not really reusable.

Everyday Workers

These are the classes I find almost universally useful. Things like CSV and JL readers and writers. I tend to copy paste them from project to project as I like having the source code available to review and tweak if needed.

I’ve crafted these to my own likings with various configuration options, but generally they would be suitable for any Kiba project.

Specific to a Project, Still Abstract or Configurable

Projects often need a little repeatable special-sauce. Maybe it’s a source for a specific file type, like products.xlsx, or a specialized data lookup with thoughtful caching. When I find myself writing a transformation that feels complicated, or requires persisting some state, I like to write these out as classes as well.

I make these configurable within reason. Custom input/output keys, injecting datasets, etc. Because they are full-fledged, isolated classes, they are next-of-kin to the Everyday Workers.

Specific to a Single Job

And finally, you get those data munging needs that are only used in a specific context. “Append 00 to the SKUs”, “Calculate some value”, “Do this one weird trick to make $50,000/week”, etc. These often feel great as an in-definition block transform.

The problem is, you end up with code that looks like this:

class ETLTask
  def job
    Kiba.parse do
      source Input

      transform MakeLoud

      ## Be heard with many !s!!!
      #
      # Steve likes when we are excited.
      # For our sake and future reference, we'll want
      #   to make note of some times he demonstrated his excitement.
      #
      # 2021-01-01 08:15:31 <Steve> WOW! THIS IS COOL!!!
      # 2021-01-01 08:15:32 <Steve> This is it!!!
      # 2021-01-01 08:17:13 <Steve> WHY IS THIS SO BROKEN?!!!
      # 2021-01-01 08:19:41 <Steve> I CAN'T EVEN!!!
      #
      #  Summer Sun makes clear,
      #    This is totally silly.
      #      Please don't grant it Class.
      #
      #          -- Tim's First Haiku
      #
      transform do |row|
        row.transform_values { |val| val.is_a?(String) ? "#{val}!!!" : val}
      end

      ## Add Description
      #
      # Percy requested a column that will help make the pivot table easier to work with.
      #
      # This might have a fair amount of context that 
      #   is important to document, but the move is just not doing enough
      #   to feel like it warrants it's own class.
      #   (Disagreers gunna disagree.)
      #
      # Sure, we could make a `ValueConcatenator`...
      #   But good grief.
      #
      transform do |row|
        row[:description] = row.values_at(:brand, :model, :size, :sku).join(' ')

        row
      end

      destination Writer
    end
  end
end

Following the pipeline gets a lot harder with all the extra fluff. Assuming we feel very strongly against granting these moves grandeur with their own files and classes, we have two (or more) strategies we could consider here.

Two Moves (of many)

Nested Classes

class ETLTask
  def job
    Kiba.parse do
      source Input

      transform MakeLoud

      transform SteveIt

      # Set up description, persuaded by Percy's Pivot.
      transform ValueConcatenator,
        output_key: :description,
        input_keys: [:brand, :model, :size, :sku]

      destination Writer
    end
  end

  ## Be heard with many !s!!!
  #
  # Steve likes when we are excited.
  # For our sake and future reference, we'll want
  #   to make note of some times he demonstrated his excitement.
  #
  # 2021-01-01 08:15:31 <Steve> WOW! THIS IS COOL!!!
  # 2021-01-01 08:15:32 <Steve> This is it!!!
  # 2021-01-01 08:17:13 <Steve> WHY IS THIS SO BROKEN?!!!
  # 2021-01-01 08:19:41 <Steve> I CAN'T EVEN!!!
  #
  #   Nested class down here.
  #     So much fluff, like Winter Snow.
  #       I'm still not a fan
  #
  #          -- Tim's Second Haiku
  #
  class SteveIt
    def process(row)
      row.transform_values { |val| val.is_a?(String) ? "#{val}!!!" : val }
    end
  end

  ## An abstract value concatenator
  #
  # Well, lookie here. It turns out, this is actually a winner!
  # After making the extra effort to extract this out of a block transform,
  # it actually feels like a generically reusable piece of code.
  #
  # At this point, I'd be happy to extract this one to it's own file.
  # 
  class ValueConcatenator
    # The list of keys to concatenate
    attr_reader :input_keys

    # The key to write the new value to
    attr_reader :output_key

    def initialize(input_keys:, output_key:)
      @input_keys = input_keys
      @output_key = output_key
    end

    def process(row)
      row[output_key] = row.values_at(*input_keys).compact.map(&:to_s).join(' ')

      row
    end
  end
end

My thoughts:

We’ve kept the transformations in the same class context, near the usage. I do not find SteveIt interesting enough to extract to a separate file. It makes too much context switching for such a “One Weird Trick”.
We actually found a decent abstraction in the ValueConcatenator. That one aughta go in its own file. I think it’s a winner. Sometimes this process is worthwhile!
The individual moves could now be easily tested, as they are no longer in the middle of a Kiba job context.
The Job is a bit easier to read, and importantly, reorganize.

&proc (to_block)

And here I have arrived tonight. Realizing that you can cast a Proc to a &block, you can set up transform with any procable object, like method references and Lambdas.

class ETLTask
  def job
    Kiba.parse do
      source Input

      transform MakeLoud

      # Get buy-in from Steve:
      transform &ETLTask.method(:steve_it)

      # Add Description for Percy
      transform &AddDescription
      # Or, if you're allergic to & because it's uncanny,
      # transform AddDescription.to_block

      destination Writer
    end
  end

  ## Add Description (via Lambda (Or Proc.new, Or whatever floats you.))
  #
  # Requested by Percy in Accounting
  #
  # Add a single column that will help that pivot table easier to work with.
  #
  # This might have a fair amount of context that 
  #   is important to document, but it's just not doing enough to feel like
  #   it warrants it's own class.
  #
  #
  # This actually feels kinda dope.
  #
  AddDescription = -> (row) do
    row[:description] = row.values_at(:brand, :model, :size, :sku).join(' ')

    row
  end

  ## Be heard with many !s!!!
  #
  # Steve likes when we are excited.
  # For our sake and future reference, we'll want
  #   to make note of some times he demonstrated his excitement.
  #
  # 2021-01-01 08:15:31 <Steve> WOW! THIS IS COOL!!!
  # 2021-01-01 08:15:32 <Steve> This is it!!!
  # 2021-01-01 08:17:13 <Steve> WHY IS THIS SO BROKEN?!!!
  # 2021-01-01 08:19:41 <Steve> I CAN'T EVEN!!!
  #
  #   At the same indent.
  #     As if it were on the team.
  #       I can live with this.
  #
  #          -- Tim's Third Haiku
  #
  def self.steve_it(row)
    row.transform_values { |val| val.is_a?(String) ? "#{val}!!!" : val}
  end
end

Thoughts:

I feel a sense of cohesiveness within the ETL task and some of it’s quirky transformations. It allows the quirks to get codified without ceremony, but manages to create isolated seams for testability. Nothing is getting separated out to it’s own weird desert-class unless I feel like it’s warranted.
Subjectively, it looks pleasing! The Job is even easier to read, and even easier to organize.

What the proc?

Finally, a quick note on lambdas. I think lambdas are cool. Data processing pipeliness are well-served by functional programming. In fact, the Class-based Transform methods are simply an OO wrapper around functions, even when they carry some state. They receive a row, they return a row. Boom. You’re not passing messages to it other than it’s primary method, #process. In recent versions of Ruby, you can compose functions with << and >> which is cool, but also perhaps esoteric and unideomatic. Behold:

class ETLTask
  def job
    Kiba.parse do
      source Input

      transform &FormatStringValues

      destination Writer
    end
  end

  # More likely these moves live in a StringProcessing module!
  Strip = -> (val) { val.is_a?(String) ? val.strip : val }

  NormalizeSpaces = -> (val) do
    return val unless val.is_a?(String)
    val.gsub(/[[:space:]]+/, ' ')
  end

  BadTitleCase = -> (val) do
    return val unless val.is_a?(String)
    val.split.each do |word|
      word[0] = word[0].upcase
    end.join(' ')
  end

  # This composes each of these smaller functions together, ready for use.
  # Notably, I like that it protects against not-String values.
  #
  FormatString = BadTitleCase << Strip << NormalizeSpaces

  # And finally, create a function to apply it to all of the String Values in a row
  FormatStringValues = -> (row) { row.transform_values &FormatString }
end

People who do functional programming in Ruby do weird things. I’m not used to it, but I can usually follow it. For example, here’s some code that I think is pretty dang cool. I suspect it violates most people’s that’s too cleverometer. I came up with it while over-engineering tonight’s convoluted example:

module SomeFunctions
  # Create a new guard-clause function that only operates on a given type.
  # If the type matches, call the function. If it does not, just return the original value.
  #
  Guard = -> (type, function) { -> (val) { val.is_a?(type) ? function[val] : val } }.curry
  
  # Only operate on Strings. If it's something else, let it be.
  # There are many ways to trigger `#call` in Ruby. `[]` and `.()` are some.
  StringGuard = Guard[String]
  NumberGuard = Guard[Numeric]

  # The classic String.strip(str) function, including a guard to only operate on String Things.
  Strip = StringGuard[-> (val) { val.strip }]

  # The quite useful "I don't care how many spaces you have. Fix it."
  # I think Rails adds this as `#squish`.
  NormalizeSpaces = StringGuard[-> (val) { val.gsub(/[[:space:]]+/, ' ') }]

  # Glue them together!
  # Okay, I know. This isn't that impressive. Just use `"str".strip.gsub(...)`...
  #
  # I'm trying to show something I find neat, and haven't explored it enough
  # to rip out a better example from nowhere!
  #
  FormatString = Strip << NormalizeSpaces
end

This all can of course be expressed in method definitions more idiomatically. But you know what? It’s interesting to explore.

Conclusion

&proc is cool and useful beyond just map(&:ping) things. We can use it pass references to a Kiba transform, which let’s us reference “arbitrary” blocks of code outside of the pipeline, such as to test them.

> That background, tho...