Merge vs. Rebase – A deep dive into the mysteries of revision control
I remember the days when I started learning Git about two years ago. I crawled through all the available commands and read the man pages what they are for and I remember when I stumbled over rebase and stuck. After figuring out what it actually does, I start loving it, but didn’t understand it’s dangerousness until someday I somehow got duplicated commits after pulling from another repository. So let me explain what goes wrong and why merge and rebase are often misunderstood. I’ll also present a list of golden-rules about their usage. Before we start with explaining both commands, I would like to give you one of the most important rules, in case you don’t want to read the complete article.
Never rebase branches or trees that you pulled. Only rebase local branches.
Disclaimer: I never read this article myself
Merging is one of the most important operations in distributed revision control systems. Basically a merge is a new commit on top of both branches that should be merged. It is like melting two different pipes of steel together. The pipe itself doesn’t break, it’s just combined with another pipe. So, the commit itself knows that it is a merge commit. It is important to notice that it is a new commit that gives the revision control system a hint that there was a merge between two branches.
Let’s visualize this:
Let’s assume that we have a linear history. In our graphic we see the last tree commits in that history.
Okay. Now let’s say there is a bug in the software. For sure, we don’t do any mistakes, certainly a co-worker did that mistakes. Unfortunatly we are the only one smart enough to know how to fix that. So we checked out our stable branch and committed our fixes to that branch. Our fix needs two commits. Let’s see how this reflects in our visualization:
The commit 433fa and 322ac are our commits that fix a certain problem. Now we need to integrate these commits into our current experimental development version. But as we keep our stable branch also public for other employees to push, we push this branch first before we merge it into our experimental branch. Let’s repeat this. We pushed this branch to a place where other people can pull it. Again, we pushed it. Remember that!
So we have to get that fix into our experimental branch. So we use the merge command. In Git or Mercurial this would look like:
$ git checkout stable
..fix fix fix fix..
$ git commit
$ git checkout experimental
$ git merge stable
$ hg branch stable
..fix fix fix..
$ hg commit
$ hg branch default
$ hg merge stable
So we create our merge commit. This is a new commit that reflects our merge.
Please note: The actual ID’s of our commits that include the fixes did not change.
So here is the deep dive into merging. Our revision control system now knows that there was a merge. Therefore if we later start to work on stable again, commit new fixes on top of the old fixes and then do a merge again into the experimental branch, the systems tries to find the common ancestor. This is our merge commit, therefore only commits and changes later than the last merge commit will be used to do the new merge commit. This results in less merge conflicts and in a better tracking when fixes really get back into the experimental branch. Because all these new fancy revision control systems like Git or Mercurial are designed to be fast and to deal with a humongous amount of commits and data, they don’t compare the actual changes to determine the first common ancestor of two heads. They are using the calculated SHA1 key. So if we merge, the SHA1s don’t change and we the system has a way to detect that some commits already exist in our tree if we push from other people that pulled from use once.
Rebase has a different approach as merging. It is one of the more sophisticated and hard to learn commands. It’s about cutting off a pipe and weld it on another pipe. In Git it is a build-in command, but as it usually not used very often, it is just an extension in Mercurial. Bascially a rebase is a way to cut of a set of commits from a branch and apply those commits on another branch. This seems to be pretty easy, but usually developers get a lot of problems when using rebase if they are not familar with all the implications of ‘cut off’.
So let’s get back to our ‘stable’ branch which were we commitet our fixes to:
So we want to rebase this branch on top of our experimental branch. What does rebase do: It cuts off these commits. The commits don’t have any information about their parents anymore. The system then applies them on top of the new branch, which is in our example the experimental branch. If everything went fine, we get a nice linear history. Our old branch doesn’t exist anymore but instead we have all the important commits applied on top of our experimental branch. As people tend to dislike merge conflicts rebase is the perfect tool to get rid off these commits. Let’s visualize this:
We literally cut of these commits and then apply it on top of the new branch:
But what are the implications? Why does merge even exists if we found such a nice way to handle our history? Ohhhhh. Let’s way…take a look at the graphics. Our commit ID changed! Why? We have a new parent, we even might have completly new changesets (depending on the changes on the new branch). Our revision control system really thinks of our commits as patches and applies them on top of the new branch. Therefore our revision control system thinks that our commits are new commits. Recapitulate this. IT THINKS THAT THESE ARE NEW COMMITS. Why? Because they have new IDs. It’s that simple. The common ancestor of a commit is determined by traversing the list of commit IDs and their parent IDs. As the parents changed, the ID’s have changed too.
git rebase [–onto branch]
When you rebase in git, you provide an upstream branch. The upstream branch is the branch to which the commits should be moved. The cut-off point is calculated as the common-ancestor between your current
branch and the provided upstream. You can also provide the branch for that calculation manually as a parameter to git rebase.
E.g. Git literally take this commits as patches by removing the commits from the history, creating patches out of it and applythem on top of the new branch. If you rebase the first time, it will result in more or less the same conflicts as merging, but if your rebase the second time, the revision control system doesn’t know about that rebase and uses the same common ancestor that was used by the first rebase. Hence if you use rebase frequently between the same branches, Git uses more and more commits to do the actual merge, which is equal to have branches that get more and more diveregent. This usually leads to much more merge conflicts. Think about rebasing as if you pop all those commits from the branch and push it to the new branch (if you speak in the terms of a patch management system like quilt).
So if you have pulled your branch to some place and you now pull it back, you will get duplicated commits. Okay you might be smart enough to get around this, but other people that might have pulled from your stable branch might not know what happened, so they have all these duplicated commits, or if you pull from them, you get these commits back. Actually you don’t even see them until you try to merge them, because their ID is completly different. As a result you get a millions of merge conflicts and have to find a way to get rid of the duplicated commits. But why, and how can you prevent his? And why do these systems include a command that can cause so much pain?
So let’s see.
Rebase done right
We saw that rebasing results in duplicated commits if we try to rebase branches that we already pushed to some repository, that either we or others pulled from. Let’s see why there is still a point to use rebase at all. Imagine you want to fix a bug locally. This means, you do not push this branch somewhere. Did you get that point? If you then rebase your local fancy branch on top of your branch that you usally push everything is fine. Let’s take a look at a good example:
$ git branch pdo-mysql-config-fix
$ git checkout pdo-mysql-config-fix
$ git commit
$ git rebase master
$ git checkout master
$ git merge pdo-mysql-config-fix
Got it? We never cut off the pipe. It’s just like we prepared that pipe on another stream. We realized the main pipe was changed so we weld it of and put it back on top of the main pipe. As all the people just look at our main pipe it looks like we just added a new piece on top of the old pipe. We brought the feature branch in line with master, so the merge will be trivial and git will consider it as a “fast-forward” which means there won’t be a merge commit. We never gonna push the branch that we use to rebase. If we push it, I would get duplicated commits:
So let’s rehearse the Golden Rule Of Rebasing:
Never ever rebase a branch that you pushed, or that you pulled from another person
It’s that simple. If you would rebase a branch from another person and he would pull back the integrated changes from you, he would get duplicated commits. So never ever do this.
The bad news: Too many developer get rebasing wrong and think that it’s a fancy way to get a linear history
The good news: Believe me, even well known top-level maintainer didn’t get what rebasing is about.
This article is about rebasing. Rebasing is a way to push commits from a branch on top of another branch. Decentralized revision control systems such as Git or Mercurial implement this feature.
A few people at reddit.com pointed out that it might not be a good idea to rebase a dev branch on top of master. If you are doing drastic changes in your dev branch, that’s right. People might use rebase then to keep their dev branch up to date on top of the master branch and therefore rebase against master. At the end they can merge the dev branch into the master branch (in fact the result is the same here, as it’s both way a fast forward).