Your database is a distributed system

James Coglan, a Developer at FutureLearn, discusses a trade-off in how Rails handles database migrations, and how you and your team can handle this.

One of the major features that Rails introduced early on was database migrations: you should store changes to your database as part of your source code in version control, so that you can keep your software and the schema it works with in sync, and automatically update your schema when you deploy the related changes.

The workflow is fairly simple: you run rails generate migration to generate a migration file, which will have a name like db/migrate/20151028163218_create_posts.rb. You edit this file to describe the change you want to make:

class CreatePosts < ActiveRecord::Migration
  def change
    create_table :posts do |t|
      t.string :title
      t.text :body
    end
  end
end

Then you run rake db:migrate to apply the change to your local database, and update db/schema.rb:

$ bundle exec rake db:migrate

== 20151028163218 CreatePosts: migrating ======================================
-- create_table(:posts)
   -> 0.0515s
== 20151028163218 CreatePosts: migrated (0.0516s) =============================

You commit your migration file and your updated db/schema.rb and carry on working on your new feature. Usually, the only changes to schema.rb will be those you just made. But sometimes you’ll see changes you didn’t make when you run db:migrate. Maybe another developer you work with forgot to commit schema.rb along with their last migration. Maybe you made a manual change to your database that’s been reflected when Rails generates schema.rb. These problems are simple to fix; you amend things to put your database and code back in sync, commit, push, and the problem’s gone.

In some cases, however, the problem doesn’t go away. You might see a couple of columns swap positions in a table you’ve not changed. No problem, you regenerate the schema, commit it, and forget about it. But later in the day, someone else runs a migration and the columns reverse position again. “Oh,” they’ll say, “someone forgot to commit this,” and so they push a change that flips the columns back. The change flip-flops around the team until somebody notices: “Hey, I’m sure I fixed these columns last week, and this change just came back. What gives?”

Well, what gives is that your team is accidentally operating a distributed system and failing to reach a consensus about what state it should be in. And they’re doing that because of a trade-off in how Rails handles migrations.

How did this happen?

Let’s start simple. Suppose you have a codebase, one developer (Alice), and one database. Alice makes commits, and some of those commits contain migrations, which are applied to the database.

      A         B         C         D [master]
      o---------o---------o---------o
     M1                  M2

    DB{1}               DB{1,2}

A to D identify commits, Mx represent migrations, and DB{x,y} represents the state of the database schema after applying migrations x and y in that order. In this scenario, Alice doesn’t get any surprises: they’re the only developer, and you apply migrations in the same order that you write them, and the schema always ends up in the state you were expecting.

Now let’s add another developer, Bob, to the project. Alice and Bob work on different features, and branch off of master to do their work. Let’s say Alice forks from C and commits E and G, where E contains migration M3. Bob forks from D and commits F and H, where F contains M4. I’ve named the commits for the order they occur in time, but it’s important to remember that Alice and Bob have not merged their branches yet so they can’t see each other’s commits.

       B         C         D [master]
    ---o---------o---------o
               M2 \         \
                   \         \       F         H [bob]
                    \         `------o---------o
                     \              M4
                      \          DB{1,2,4}
                       \
                        \      E         G [alice]
                         `-----o---------o
                              M3
                           DB{1,2,3}

Both Alice and Bob have seen all commits up to C, and so they both start with their own local databases in the same state: they’ve applied the first two migrations to get DB{1,2}. When Alice applies migration M3 to this state, they get the database state DB{1,2,3}. But when Bob applies M4 – also beginning from state DB{1,2}, remember – they get DB{1,2,4}. But while Alice and Bob each work on their respective branches, they’ll be fine; Bob isn’t yet running any of Alice’s code, and so they don’t need migration M3 applied to their database. It’s only when these branches are merged that they’ll encounter problems.

Let’s look at what happens when Alice and Bob merge back into master. Bob merges first, creating commit J, and then Alice merges to create K.

       C         D                            J         K [master]
    ---o---------o----------------------------o---------o
     M2 \         \                          /         /
         \         \       F         H      /         /
          \         `------o---------o-----'         /
           \              M4                        /
            \          DB{1,2,4} - - - - - - - - - / - - > DB{1,2,4,3}
             \                                    /
              \      E         G                 /
               `-----o---------o----------------'
                    M3
                 DB{1,2,3} - - - - - - - - - - - - - - - > DB{1,2,3,4}

When both developers are back on master after deploying their work, they run db:migrate to catch up on any migrations that have been merged in since they forked. For Alice, Rails detects that M4 is missing, and applies it to their database to create state DB{1,2,3,4}. Similarly, Bob is missing M3 and applying it to their database gives DB{1,2,4,3}. Alice and Bob’s databases have diverged.

Not only have Alice and Bob’s databases diverged, the databases of anyone else on their team might have done, too. Consider Claire, who pulls from master just after J is commited. They see one new migration M4 in their tree, and they apply it to get DB{1,2,4}. Later, Claire pulls again and receives M3, and applies it to get DB{1,2,4,3}. Their colleague Dave doesn’t pull from master as frequently, and gets J and K during the same pull. When they run db:migrate, both M3 and M4 are applied to their database and it goes into state DB{1,2,3,4}, because Rails gathers up all pending migrations and sorts them by timestamp before applying them.

Finally, imagine what happens when you deploy. Say you deploy commit J, then deploy commit K in a separate release. Now your production database sees the same changes as Claire, and goes to DB{1,2,4,3}. But if you wait until both J and K have landed, then production is like Dave’s database: DB{1,2,3,4}.

Here comes the sciencey bit

You might be thinking, how is DB{1,2,3,4} different from DB{1,2,4,3}? Don’t both states contain the same changes? You branch and merge all the time, and your database still works, right?

The scenario I just described happens all the time once your team has more than a handful of developers, and usually it doesn’t cause any problems. So how does it relate to the game of schema.rb whack-a-mole from earlier?

In general, database schema changes are not commutative. This means that it matters which order you run them in. For example, I have two commits: one adds an index to the posts.title column, and another removes that same column. Running them in the order I described them is fine, but running them in the opposite order is not: I cannot index a column that no longer exists.

Things that do not commute, that is, things where order matters, are hard to do concurrently without failures. When developers work on a codebase in parallel branches, for example, they might both change the same file and generate a merge conflict: Alice adds some code to a file that Bob has deleted or moved. Git will tell you about merge conflicts so that you can resolve them by hand while rebasing or making a merge commit. It’s not exactly a failure as such, but it’s something the computer can’t automatically figure out without your intervention, and that’s fundamentally because arbitrary changes to a mutable set of files (at least the way Git models them) do not commute.

Just as with code, developers are making concurrent changes to the database on their branches, and sometimes this results in the equivalent of a merge conflict. But, whereas Git will just stop and ask you what to do if it finds two changes it can’t automatically merge, your database and Rails might not. Or at least, not in as helpful a way as Git does. It just goes ahead and executes your migrations, and if they don’t work then the database raises an error and the migrator crashes. If you’re using a good database, the migration will run inside a transaction so that if it crashes part-way through you don’t end up with a half-applied migration to clean up. But either way, you find out some migration didn’t work, and you can go and amend it, or remove it. That’s the good kind of conflict: your code no longer works, it crashes as soon as you try to run it, so you fix it immediately.

The bad kind of conflict is where reordering migrations doesn’t make them wrong, it just makes them non-deterministic. For example, say Alice and Bob had both added a column to the same table. Database columns are ordered, and so changing the application order changes the order of the columns in the table. Depending on which order you accept the migrations in, you’ll see Alice’s new column first, or Bob’s.

This kind of difference isn’t an error, because the order of columns in a relation doesn’t make a semantic difference in SQL. But databases do store column data differently depending on column order, and this can make an observable performance difference on some engines, and so schema representations like mysqldump and schema.rb must preserve the order to produce a faithful copy. Different people apply Alice’s and Bob’s migrations in different orders, their databases all work, but they keep changing the column order and generating diff noise.

A history lesson

Let’s go back in time for a minute, to April 2008. You’re running a cutting-edge web stack, which in those days meant Rails 2.0.2 on Ruby 1.8.6. When you run ./script/generate migration, this happens:

$ ./script/generate migration create_posts
      exists  db/migrate
      create  db/migrate/001_create_posts.rb

Instead of a timestamp on the migration, we have 001. Your next migration will be 002, then 003 and so on. Rails stored a single version number in your database, which it would check using SELECT version FROM schema_info. When you ran rake db:migrate, it ran all the migrations numbered higher than your current database version. Simple.

This works great for Alice as employee number 1 at the Chicago Computer Company, writing migrations and running them in a single chain of commits. Everyone was using Subversion back then and nobody wanted to fork or merge. But then, Bob joins the company and he insists on migrating to this new “GitHub” thing everyone’s talking about. You start having concurrent development branches. Alice and Bob both fork from version 2, and write a migration. They both get migration number 3. When they merge back into master, they run db:migrate and Rails goes, “I see you’re at version 3. There are no pending migrations to run”, because their migrations look like this:

    db/
        migrate/
            001_create_posts.rb
            002_create_comments.rb
            003_alices_changes.rb
            003_bobs_changes.rb

Alice and Bob each fail to apply each other’s migrations, and who knows what happens to the rest of their team.

So, in May 2008 Rails 2.1 is released, and it fixes this problem in two ways. First, migrations will be identified by a timestamp rather than a sequential integer. This means people writing migrations on concurrent branches will (unless they’re extremely unfortunate) get distinct migration IDs. Second, rather than storing a single version number in the database, Rails stores one record for each migration that’s run, so it can tell when you’ve not applied migrations that are older than the latest one. So, creating a migration now looks like this:

$ ./script/generate migration create_posts
      exists  db/migrate
      create  db/migrate/20151028160324_create_posts.rb

And when you run your migrations, Rails executes INSERT INTO schema_migrations (version) VALUES (20151028160324) rather than UPDATE schema_info SET version=1. The old way is like replacing our notation DB{1,2,3} with simply DB{3}. You can’t represent a state where M1 and M3 have been applied but M2 has not.

The trade-off Rails is making

Why stick to numbers at all? Why not just record the names of the migrations that have been applied? This gets us to the heart of the trade-off: by recording exactly which migrations have been applied, Rails can make sure Alice gets Bob’s migrations when they return to master. This part of the solution does not need numbers at all, it just needs a way to spot that some migrations in the codebase have not been applied.

Rails is making an assumption here: that it’s okay to have people write concurrent migrations and apply them all, possibly in different orders. Most of the time, this will be okay: people working in different branches are usually changing different parts of the application. Alice adds a column to posts while Bob creates a users table. The order of these changes does not matter; some database changes commute, and people working independently typically make commutative changes.

But, what about the migrations on a single branch? Migrations that are part of a single work branch typically do have ordering requirements, or at least it’s far more likely that they will than migrations on different branches, simply because they are related. Maybe I need a new table for my feature, and I need to add more columns as I go along, or add foreign keys and indexes that reference the table. It would be pretty terrible if someone merged my changes and Rails decided to add an index before it had created the table!

So, for migrations on a single branch, we still want to know what order to run them in, and so tagging migrations with their creation time seems like a reasonable way to do that, assuming the developer creates the migrations in the order in which they should run.

It’s worth mentioning though that the Rails migrator has no knowledge of Git. It doesn’t model your branches when it applies migrations. It just figures out which haven’t been applied by consulting the db/migrate directory and the schema_migrations table, it sorts those by their ID and applies them. Timestamps are just a convenient ID mechanism that give you distinct IDs and correct ordering most of the time.

That’s what makes it a trade-off: in general, database changes don’t commute and so to guarantee correctness you need to force a total ordering and say exactly in which order the entire set of migrations should run. That is, you should use sequential integers. But that stops people working concurrently, and for most concurrent workflows, you don’t need total ordering. Local ordering is good enough, most of the time. And for cases where it isn’t, there are other solutions. If you find out at stand-up that someone’s working on the same tables as you, you can talk to them and co-ordinate your changes. You can make sure to rebase against master before merging your work, so if there are any conflicts you resolve them on your branch before you dump them on everyone else. Or you can have a more elaborate workflow where instead of just rebasing your branch you do this:

  • Roll back all your migrations since you forked from master
  • git-rebase against master
  • Renumber all your migrations to have IDs later than everything on master
  • rake db:migrate

This makes sure your migrations apply after everything on master, which makes sure that anyone else running your migrations when you land them will get a working database. But, in the vast majority of cases these scenarios don’t turn up that often, and when they do, it’s usually easier to sort them out by talking to people rather than with arcane technical solutions.

Want to know more about the way we work? Take a look at all of our “Making FutureLearn” posts.

Category Making FutureLearn

Comments (1)

0/1200

  • Waleed Hassan Mahadi

    My best freind Khalid Ahmed Moh.Nour inspeir me to join futurelearn and helped me alot to do this