The problem

Recently I was on a team responsible for migrating data from a legacy application into an entirely new schema. Users of this application would add up to tens of thousands of pieces of data for one company that added up to build a custom ‘configuration’. In the legacy application the tables were very 'wide’ — some with well over 100 columns — and the new schema we were moving to was much more normalized, resulting in deep tables and many more associations. In addition, we were converting some pieces of data to new types: numbers that were strings became integers, some strings were normalized and became foreign keys, and some HTML lists that had been stored simply as text were now broken up into multiple rows in a table.

As we built out our classes that did the conversion and importing of this data we had lots of unit and integration tests to make sure that the data was going where it was supposed to and that the values were what we’d expect them to be. However, these tests did not feel like enough. When we considered how important this data was and just how much of it we had- we couldn’t honestly say that the conversion and import tasks were 'done’ until we were confident that every piece of data had been successfully migrated over to its new location (and sometimes even new format).

Fortunately, one of the devs that we were working with realized that we had a source of truth to test against because our application published a massive JSON document for each configuration— and given the same set of inputs, the new app should be publishing the exact same JSON as the legacy app. He whipped up a CompareVersions class that could take the two JSON structures, recursively walk down them, and then log any differences. It started us off on the right path, but like any useful piece of software, we quickly realized that we wanted to tweak a few things to make it even better.

The first thing that we wanted was to beef up the output of the differences to let us quickly see what type of values we were getting. For example, we had a lot of error messages that looked like one of the following:

Error in data field 'widget', old value: 1 - new value: 1
Error in data field 'cool_field', old value: - new value: 1

In the first case the new field had ended up as a string or an integer when the legacy data was the opposite. In the second case we didn’t know if the old value was nil or an empty string. One option for beefing up the output was to update the comparison to start doing type checks and nil checks (in addition to the equality checks), but forking the comparison logic would have just added more complexity to what had already become a hairy algorithm since the JSON structures weren’t exactly the same. Fortunately, as any Ruby-ist has seen, there are already great tools that provide well-formatted output when comparing two values— our unit test frameworks. Mulling over our options I quickly realized that what what we really wanted was RSpec-like output that read:

it converts attribute: 'max_amount'
   expected: 1
        got: '1'

or:

it converts attribute: 'max_amount'
   expected: 1
        got: nil

RSpec to the rescue

So instead of re-creating the wheel, we just realized that we could require rspec-expectations in our CompareVersions class, include RSpec::Matchers, then use the standard rspec expect(x).to eq('something') syntax inside of our own method:

require 'rspec/expectations'

class CompareVersions
  include RSpec::Matchers

  def initialize(new_data, legacy_data)
    @new_data = new_data
    @legacy_data = legacy_data
  end

  def call
    @legacy_data.each |key, value|
      ...
      #some code to extract values from the new and old data
      ...
      expect(new_value).to eq(legacy_value)
    end
  end
end

This solution worked great… right up until the first failed expectation, at which point execution of our CompareVersions#call exits with the following:

.../rspec/expectations/fail_with.rb:30:in `fail_with':  (RSpec::Expectations::ExpectationNotMetError)
expected: 2
     got: 1
(compared using ==)
    from .../rspec/expectations/handler.rb:35:in `handle_failure'
    from .../rspec/expectations/handler.rb:48:in `handle_matcher'
    from .../rspec/expectations/expectation_target.rb:54:in `to'
    from compare_versions.rb:8:in `block in call'

This resulted in a nice big face-palm — of course it raised an error because all that happens when an RSpec test fails is that the RSpec::Expectations::ExpectationNotMetError is raised. So, in order for our CompareVersions to be truly useful I needed to collect all the failed data comparisons to be returned later.

rescue to the rescue

Collecting the errors in the example I’ve started above is relatively easy. I can just initialize an instance of the CompareVersions class with an empty @errors hash, and then specifically rescue the RSpec::Expectations::ExpectationNotMetError error. (We used a hash for our errors so that we could use the JSON key, which corresponded to a field, as the key). Our call method now looks like this:

  def call
    @legacy_data.each |key, value|
      begin
        ...
        #some code to extract values from the new and old data
        ...
        expect(new_value).to eq(legacy_value)
      rescue RSpec::Expectations::ExpectationNotMetError => e
        @errors[key] = e
      end
    end
  end

Now we could create an instance of the CompareVersions class, call it, and iterate over any errors that were found. At first we just started running one instance at a time against some particularly hairy legacy data sets, but as we refined our migrations we were able to start instantiating more and more “comparers"— making sure that we covered every case necessary. Ultimately, only when a comparison for every single customer came back without any errors could we move our 'migration’ task do the done pile.