Let’s set the stage; my team messed up a data migration (also known as a backfill) and now we had some data missing 😰. The backfill aimed to set a new timestamp column (cancelled_at) based on an existing timestamp (updated_at), for specific records that no longer met our criteria for being active. Given how frequently the updated_at timestamp changes, it was not our first choice for getting this data. However, given that we were low on choices, and this was purely a data analytics and reporting requirement, we were comfortable using the updated_at column for getting this data.

Upon running what seemed like a relatively simple backfill in production, surprise surprise, we encountered an error after processing a decent chunk of records! This was because setting the cancelled_at column also updates the updated_at timestamp. Without thinking too much about the situation once we saw the error, we leapt into fixer mode. Upon fixing the error, the entire data migration was restarted. This led to the records processed in the original data migration getting an incorrect cancelled_at timestamp. Upon QA-ing the data, we noticed something was wrong, as over 98% of the cancelled_at timestamps were set to the day we ran the backfill. After scratching our heads for a little bit, we realized what happened. Thankfully, we work at a place like Fullscript, where we are given a chance to fix our mistakes and fail forward.

We spent some time discussing whether we could get that data another way without having to go to backup, as most of us had felt the pain of getting data from backup. After spending some time solutionizing and discussing with our data team, going to a backup was the only way we could get close to the level of accuracy we wanted for the cancelled_at column. We also knew we couldn’t just restore the table to its original state because it was currently being used, so just restoring the table would’ve also meant the loss of newer customer data.

In a former job and what seems like a former life, I was part of the incident response/management team, and I’ve done the usual slog for selectively getting data from a backup (sometimes from multiple backups 😅) to piece together the information I wanted. Typically, I’d select on a SQL level, and then dump the data I wanted to either some sort of JSON map or CSV that I could later access from a production server where the fix script or rake task would run. Depending on the size of the table(s) I was going through, accounting for all the different states of the data, validating my SQL selection could take anywhere from a few days to weeks, depending on the severity of the problem.

I opted to try something new to see if we could find another way to restore from a backup that wouldn’t take weeks to perform and validate. After spending some time on the QA piece with data from a recent production database snapshot, our solution allowed us to get the data we needed back in less than a week from the creation of the script.

One of our cheat codes as engineers at Fullscript is our stellar DevOps team that helps make things like going to a backup a lot easier. That said, this process can be done even without having an amazing DevOps team (they just make it MUCH easier)!

The first step is to dump your SQL backup(s) into a MySQL instance for querying. Thanks to our DevOps team, we can create a MySQL instance and get credentials for it with one command 🥇. Once you have a working MySQL instance, you also want to ensure it’s accessible across your environments. You want to test locally or in some kind of review environment before doing this in your production environment.

The next step is to start crafting your rake task or script. To do this, you want to define an archive model for the table you want to selectively restore from (or multiple archive models if you want to get data from multiple backups). Here’s an example of how I did mine:

class ArchiveModel < OriginalModel
 CONNECTION_URL = "mysql2://username:password@host_url:port/database_name
 establish_connection CONNECTION_URL
end

The archive model is defined as a subclass of the original model to ensure that we have access to the same methods and validations that the original model has. Where this class deviates from the original model is simply which database it establishes a connection with. There are two ways to manually choose which database your model connects to

establish_connection allows us to specify the database URL to which we want to connect.
connected_to allows you to do something similar, but its main benefit is that you won’t need to define a separate model class. You can wrap your calls to the archive database around a block, where connected_to will ensure you’re connected to the right database.

If your database.yml in production can be temporarily modified, using a connected_to block or a connects_to without an archive model might be a better approach for you. For this solution, I wasn’t able to vet how a connected_to block would work, since our app isn’t configured that way, which is why I decided to go with the establish_connection approach.

Another decision point you might encounter is where your archive class temporarily lives. You can choose to define this as a separate model in your models folder, or go a more contained route and define it as a class within your rake task, since this is (hopefully) a one-off task. The latter is the option I chose.

After you’ve figured out how to access your backup(s) from your Rails app, the next step is where the fun begins: defining your selective restore behaviour.

Once the setup was defined, querying and restoring the data we needed became easy-peasy lemon-squeezy. Here are the relevant parts of the rake task after the archive class definition:

ActiveRecord::Base.transaction do
  selected_subscription_variants = OriginalModel.where(active: false, cancelled_at: date_range) #when the original backfill went awry
  selected_subscription_variants.find_each do |variant|
    original_updated_at = ArchiveModel.find_by(id: id)&.updated_at
    next if original_updated_at.blank? || original_updated_at == variant.updated_at
    variant.update_column(:cancelled_at, original_updated_at)
 end
end

We can query our backup like a typical Active Record model, and get the information we need to update the field(s) we want. In our case, we opted to use update_column because we wanted to skip validations and preserve newer updated_at values for the records we’d be updating, in case something goes wrong again. However, you can update your database in any way you see fit.And that’s it! If you’ve done your due diligence, your script runs without failing, and you’ve validated your data is now restored. 🎉

A few final notes about this kind of work to keep in mind if you ever have to do something like this:

Doing a data QA before running on production — Making sure there are no unexplained anomalies will save you a lot of time verifying that your fix worked. It also prepares you to answer any questions your team might have about said anomalies. If you have a data team, lean on them for things like this! Thankfully, we also have an amazing data/reporting team here at Fullscript!
Idempotency — When dealing with a ton of customer entered data, there’s no telling the state your backup or production data is in. Making sure your rake task can be re-run after fixing an error will save you time and possibly a headache. Figuring out how to fix the fix from the last fix you made about the other fix you did is not fun.
Transactions! — Transactions ensure you’re not leaving your table(s) in an inconsistent state that may have far-reaching side effects. Just being able to say “it’ll continue from where it left off, and since that record failed, all the associated updates are rolled back” is something that’s also going to save you time and further headaches 😅.
Doing a dry run with a recent production backup — This point is super important: you do not want your data team or yourself doing QA in the live production database. That route will lead to questions such as “did our fix work as expected?”, “can we explain anomalies we’re seeing?”, all while customers are currently mutating or using the data you’re attempting to analyze.
Use find_each — If you have a lot of records (over the 100K typically), you’ll want to consider using find_each instead of find or where. This allows you to get records in batches and can prevent timeouts or degraded app performance when trying to fetch all the records simultaneously.

The approach of connecting directly to an SQL instance from the application layer eliminated the need to verify our restored data several times, as we would have in typical solutions where we would first have to validate our selected data at the SQL level and then on the application layer. This saved my team and me a lot of work and time. During this process, I have two huge takeaways. The first was understanding the priority of your work. I would’ve never thought to experiment with how I solved this problem if we didn’t stop to pause and assess the priority of the work. The second, planning a backfill is a great time to think slowly and do a lot of validations. This won’t always ensure zero errors when D-day comes, but it can help you better prepare for any anomalies. This kind of work can be stressful and time-consuming, and sometimes overwhelming so be sure to have a good group of folks to rubber duck with , who you can also have a little fun with to lighten up your mood! Happy backfilling!

Special Shout-outs!

@Anthony Jones — My DevOps point of contact! He was super helpful and insightful when I had random questions, considerations, and decisions around the approach I was going to take.

@Chantelle Brule — My data reporting point of contact! She helped QA the hell out of the data to make sure we wouldn’t need a fix for our fix haha 😅

@Jonathan Pike — I enjoyed rubber ducking and talking through some challenges together, and thanks for the reviews!