Merge vs. Rebase – A deep dive into the mysteries of revision control

I remember the days when I started learning Git about two years ago. I crawled through all the available commands and read the man pages what they are for and I remember when I stumbled over rebase and stuck. After figuring out what it actually does, I start loving it, but didn’t understand it’s dangerousness until someday I somehow got duplicated commits after pulling from another repository. So let me explain what goes wrong and why merge and rebase are often misunderstood. I’ll also present a list of golden-rules about their usage. Before we start with explaining both commands, I would like to give you one of the most important rules, in case you don’t want to read the complete article.

Never rebase branches or trees that you pulled. Only rebase local branches.


Disclaimer: I never read this article myself


Merge
Merging is one of the most important operations in distributed revision control systems. Basically a merge is a new commit on top of both branches that should be merged. It is like melting two different pipes of steel together. The pipe itself doesn’t break, it’s just combined with another pipe. So, the commit itself knows that it is a merge commit. It is important to notice that it is a new commit that gives the revision control system a hint that there was a merge between two branches.

Let’s visualize this:
Let’s assume that we have a linear history. In our graphic we see the last tree commits in that history.
simple-linear

Okay. Now let’s say there is a bug in the software. For sure, we don’t do any mistakes, certainly a co-worker did that mistakes. Unfortunatly we are the only one smart enough to know how to fix that. So we checked out our stable branch and committed our fixes to that branch. Our fix needs two commits. Let’s see how this reflects in our visualization:
Branch off

The commit 433fa and 322ac are our commits that fix a certain problem. Now we need to integrate these commits into our current experimental development version. But as we keep our stable branch also public for other employees to push, we push this branch first before we merge it into our experimental branch. Let’s repeat this. We pushed this branch to a place where other people can pull it. Again, we pushed it. Remember that!

So we have to get that fix into our experimental branch. So we use the merge command. In Git or Mercurial this would look like:

$ git checkout stable
..fix fix fix fix..
$ git commit

$ git checkout experimental
$ git merge stable

$ hg branch stable
..fix fix fix..
$ hg commit
….
$ hg branch default
$ hg merge stable

So we create our merge commit. This is a new commit that reflects our merge.
merge
Please note: The actual ID’s of our commits that include the fixes did not change.


Deep dive
So here is the deep dive into merging. Our revision control system now knows that there was a merge. Therefore if we later start to work on stable again, commit new fixes on top of the old fixes and then do a merge again into the experimental branch, the systems tries to find the common ancestor. This is our merge commit, therefore only commits and changes later than the last merge commit will be used to do the new merge commit. This results in less merge conflicts and in a better tracking when fixes really get back into the experimental branch. Because all these new fancy revision control systems like Git or Mercurial are designed to be fast and to deal with a humongous amount of commits and data, they don’t compare the actual changes to determine the first common ancestor of two heads. They are using the calculated SHA1 key. So if we merge, the SHA1s don’t change and we the system has a way to detect that some commits already exist in our tree if we push from other people that pulled from use once.


Rebase
Rebase has a different approach as merging. It is one of the more sophisticated and hard to learn commands. It’s about cutting off a pipe and weld it on another pipe. In Git it is a build-in command, but as it usually not used very often, it is just an extension in Mercurial. Bascially a rebase is a way to cut of a set of commits from a branch and apply those commits on another branch. This seems to be pretty easy, but usually developers get a lot of problems when using rebase if they are not familar with all the implications of ‘cut off’.

So let’s get back to our ‘stable’ branch which were we commitet our fixes to:
Branch off

So we want to rebase this branch on top of our experimental branch. What does rebase do: It cuts off these commits. The commits don’t have any information about their parents anymore. The system then applies them on top of the new branch, which is in our example the experimental branch. If everything went fine, we get a nice linear history. Our old branch doesn’t exist anymore but instead we have all the important commits applied on top of our experimental branch. As people tend to dislike merge conflicts rebase is the perfect tool to get rid off these commits. Let’s visualize this:
cut-rebase
We literally cut of these commits and then apply it on top of the new branch:
rebase

But what are the implications? Why does merge even exists if we found such a nice way to handle our history? Ohhhhh. Let’s way…take a look at the graphics. Our commit ID changed! Why? We have a new parent, we even might have completly new changesets (depending on the changes on the new branch). Our revision control system really thinks of our commits as patches and applies them on top of the new branch. Therefore our revision control system thinks that our commits are new commits. Recapitulate this. IT THINKS THAT THESE ARE NEW COMMITS. Why? Because they have new IDs. It’s that simple. The common ancestor of a commit is determined by traversing the list of commit IDs and their parent IDs. As the parents changed, the ID’s have changed too.


git rebase [--onto branch] [branch]
When you rebase in git, you provide an upstream branch. The upstream branch is the branch to which the commits should be moved. The cut-off point is calculated as the common-ancestor between your current
branch and the provided upstream. You can also provide the branch for that calculation manually as a parameter to git rebase.


Deep dive
E.g. Git literally take this commits as patches by removing the commits from the history, creating patches out of it and applythem on top of the new branch. If you rebase the first time, it will result in more or less the same conflicts as merging, but if your rebase the second time, the revision control system doesn’t know about that rebase and uses the same common ancestor that was used by the first rebase. Hence if you use rebase frequently between the same branches, Git uses more and more commits to do the actual merge, which is equal to have branches that get more and more diveregent. This usually leads to much more merge conflicts. Think about rebasing as if you pop all those commits from the branch and push it to the new branch (if you speak in the terms of a patch management system like quilt).


So if you have pulled your branch to some place and you now pull it back, you will get duplicated commits. Okay you might be smart enough to get around this, but other people that might have pulled from your stable branch might not know what happened, so they have all these duplicated commits, or if you pull from them, you get these commits back. Actually you don’t even see them until you try to merge them, because their ID is completly different. As a result you get a millions of merge conflicts and have to find a way to get rid of the duplicated commits. But why, and how can you prevent his? And why do these systems include a command that can cause so much pain?

So let’s see.

Rebase done right
We saw that rebasing results in duplicated commits if we try to rebase branches that we already pushed to some repository, that either we or others pulled from. Let’s see why there is still a point to use rebase at all. Imagine you want to fix a bug locally. This means, you do not push this branch somewhere. Did you get that point? If you then rebase your local fancy branch on top of your branch that you usally push everything is fine. Let’s take a look at a good example:

$ git branch pdo-mysql-config-fix
$ git checkout pdo-mysql-config-fix
..hack.hack..
$ git commit
$ git rebase master
$ git checkout master
$ git merge pdo-mysql-config-fix

Got it? We never cut off the pipe. It’s just like we prepared that pipe on another stream. We realized the main pipe was changed so we weld it of and put it back on top of the main pipe. As all the people just look at our main pipe it looks like we just added a new piece on top of the old pipe. We brought the feature branch in line with master, so the merge will be trivial and git will consider it as a “fast-forward” which means there won’t be a merge commit. We never gonna push the branch that we use to rebase. If we push it, I would get duplicated commits:

So let’s rehearse the Golden Rule Of Rebasing:

Never ever rebase a branch that you pushed, or that you pulled from another person

It’s that simple. If you would rebase a branch from another person and he would pull back the integrated changes from you, he would get duplicated commits. So never ever do this.

The bad news: Too many developer get rebasing wrong and think that it’s a fancy way to get a linear history
The good news: Believe me, even well known top-level maintainer didn’t get what rebasing is about.

This article is about rebasing. Rebasing is a way to push commits from a branch on top of another branch. Decentralized revision control systems such as Git or Mercurial implement this feature.

UPDATE
A few people at reddit.com pointed out that it might not be a good idea to rebase a dev branch on top of master. If you are doing drastic changes in your dev branch, that’s right. People might use rebase then to keep their dev branch up to date on top of the master branch and therefore rebase against master. At the end they can merge the dev branch into the master branch (in fact the result is the same here, as it’s both way a fast forward).

14 thoughts on “Merge vs. Rebase – A deep dive into the mysteries of revision control

  1. Bjarne Christiansen

    > Disclaimer: I never read this article myself
    You should, it’s really good! ;-)

    Seriously, it’s the best article I have read on rebase so far. Good and important point about the duplicated commits, definitely worth sharing!

    It reminded me about the movie “Back to the Future”, you really need to be careful when re-writing history :o) In Mercurial re-writing history can’t be done, unless you enable some ext., there is a good reason for that…

    Thanks for sharing – keep up the work!

    Reply
    1. Bjarne Christiansen

      > Thanks for sharing – keep up the work!
      That should of cause have been:
      “Thanks for sharing – keep up the _good_ work!” ;-)

      Reply
  2. Basil

    Thanks for the article, really helpful.

    I think now that I’ll never use rebase.

    As far as I see, the only disadvantage of using merge is that there will exist some extra changesets in the repository (corresponding to the stable branch). Is that really a problem?
    Are there any other advantages of using rebase which I overlooked?

    Reply
  3. david

    Don’t understand what you mean in your deep dive explanation when you say rebasing will produce more conflicts. You make it sound like conflict resolution becomes harder, but surely it gets easier, as you are splitting your conflicts across multiple commits rather than having to handle them in one changeset merge.

    Reply
    1. dsp Post author

      Maybe it’s not written correctly, but there are two situations where rebasing can cause more conflicts:
      1) If you try to merge a rebased branch with a branch that contains the non rebased changets. In that case you have commits that introduce the same change which will cause a lot of conflicts
      2) During rebasing itself. When you rebase a branch and a conflict appears. As far as I know, during a rebase you loose the information of the common ancestor, resulting in a similar merge like SVN would do as it just applies the changes on top of the current working directory, not making use of the common ancestor to resolve parts of the conflict. In that case you might end up with more conflicts.

      Otherwise you are absolutly right. Once a you rebased successfully and then merge with a branch, it usually becomes easier and you will have less conflicts as your common ancestor is more recent and thefore you have less changests that can introduce the conflict.

      Reply
      1. Geoffrey

        Aleksandar, I’ve just read that article you linked, and I can tell you it’s an actually worse explanation.

        David’s schemas are right, when you rebase a branch (for example “feature”) on another (let’s say “master”), all commits from “feature” not reachable from “master” are applied to “master”. Nothing ever gets “uncommited” or whatever nonsense.

        You can actually DIY rebase your branch by cherry-pick’ing every commit from “feature” not on “master” to “master” the reset –hard your “feature” branch to the last cherry-pick you did, the end result will be exactly the same than a rebase.

        I suggest you read http://think-like-a-git.net/sections/rebase-from-the-ground-up.html to get a good understanding of how rebase works.

        Reply
  4. Geoffrey

    “Git literally take this commits as patches by removing the commits from the history, creating patches out of it and applythem on top of the new branch”

    The “removing the commits from the history” part is false, they actually are still in git’s commits graph, but they are now “HEADless” (which means, they cannot be reached from any existing branch and will be collected by git gc at some point). Which means you could very well reset –hard 322ac (I took the commit id from your example) and plain cancel your rebase.

    About conflict resolution, git has a command to REcord conflict REsolution and REplay it, it’s called rerere. Could prove handy if you have to solve the same conflict over and over again.

    Reply
  5. Pingback: Git Branching Model for Efficient Development « Flying to Moon

  6. Pingback: Bookmarks for April 3rd from 20:18 to 21:00 | dekay.org

  7. Pingback: Git: Who care about branches? It’s all about collaboration and code reviews | 8th color

  8. Pingback: Double commits in git history after `git format-patch`,`git am`, `git pull` - Git Solutions - Developers Q & A

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>