Monday, July 28, 2008

On rebasing

In a recent blogpost, Andrew Bennetts criticises Git users for rebasing their work, thereby changing their commit history and losing perhaps valuable information.

While it is true that information is lost, and thereby it may become difficult to sync up with other users, his suggestion has also some problems, that are fundamentally at the core of how Bazaar users are using Bazaar. This is basically the same issue as I raised earlier. He suggests merging the experimental commits into the mainline, at that point providing a useful commit message. The same practice is basically done in the Bazaar developers community: they make use of Bundle Buggy to track their development. When a patch series must be tweaked, fixes are uploaded to the buggy. If the series is complete, the whole branch is then merged into Bazaar’s mainline. This is also why it possible in Bazaar to talk about “mainline”: it fundamentally is a linear approach in the repository.

This means that small fixes, as in this bundle also get merged in the mainline.

This is exactly what Andrew means: you can still see all the little fixups, you still maintain your history, but when you run “bzr log —short”, you only see the merges, AKA the real commit messages.

The problem with this, of course, is that your revision control system then suddenly becomes a linear system, unable to do real merges. For if you do a merge of a branch with useful commit messages, those disappear unless you actually copy them in the merge message. If you don’t use “bzr log —short”, your log will be full of useless commits like “fixing newlines”, “oops, a typo”, and so on.

The problem is perhaps more apparent if you go and bisect a bug. In Git, with the commits rebased and having a nice clear commit message, you can understand in what context the bug happened and why the change was made. If you do the same bisect in Bazaar, it might bisect to a commit like “oops, forgot to add this”. Of course, you can look at the commit that merged was changed in, to see if that has a clearer commit message, but you can never be sure that it does. Perhaps the commit itself did have a clear message. Or perhaps the merge will just be a “Synced with mainline” merge, in which case you aren’t any farther than you were before. Perhaps you should look at the merge above that one?

The same is true when doing a merge from non-feature branches. Let’s say that someone has made a branch with ten new features in it. Each of those features was developed as above: small fixes, and a merge for the real feature. How are you going to merge that branch? Are you going to merge it in at once, with a commit message like “Merged ten new features from John”, “bzr log —short” won’t display which features were actually merged. You could merge the all by hand, but that is a lot of work. Or, you can expand the commit message to list all changes. In any case, you don’t want to view the full “bzr log” history, because that shows all the little fixups and errors.

As you can hopefully see, merging those commits in might seem easy, but can give a lot of problems afterwards. They certainly don’t make the history easier to understand. The way Bazaar is developed, you basically get the same merge power as Subversion, as “bzr log —short” won’t show what was merged in, but “bzr log” itself shows too much information to be useful.

Sunday, June 22, 2008

Plugins in Version Control

What is up with these version control systems that use plugins to do everything? I understand that having an extensible system is very useful if users want to create something custom. Mercurial and Bazaar both support plugins written in Python, while Git has an extensive set of commands that allow you to easily write custom scripts in any language.

However, being able to extend a program is not an excuse to ship a product missing features. One of the things that makes Git powerful is that it ships with a lot of capabilities built in.

Take Bazaar for example. If you want to commit only part of the changes in a file, you need the BzrTools extension. To rebase revisions you need the Rebase extension. You need extensions to show diffs in colour, to push all branches at once, and even to remove untracked files in your working tree! Why aren’t these included by default? Are they afraid a user might be afraid to see a coloured diff? Or is it too much work to support these tools directly?

This exactly the problem with plugins: it separates the functionality from the main program. Who is going to install some weird plugin on some website to gain functionality? Who created the plugin anyway, can it be trusted as much as the program itself? How is it updated? What if there’s an incompatibility between the plugin version and the program version? What if I’m using some feature, but others that I’m working with can’t do the same, because they don’t have the plugin installed yet?

Having an extensive set of tools shipped by the program isn’t a bad thing, it’s good. It allows you to use the same workflow and suggest the same solutions to everybody else, without requiring a certain plugin. It also makes features easier to use, as you don’t have to manually install something first.

Git is easily capable of supporting its full feature set release after release, and keeps adding useful features. When upgrading, you can be sure that all your previous commands are still usable. As they are integrated into the system, you can be sure that they get the same amount of attention as anything else.

This last point is more important than you might think. After all, you’re talking about your source code here. You want the program that you use to be reliable, to work consistently across releases. You don’t want to depend on something that you don’t fully trust.

Relying on plugins is just a way for lazy programmers to maintain less code. I really prefer Git’s builtin tools more than some lousy plugin that doesn’t even have it’s own website. If you stick to only core functionality, at least be the fastest ;).

Friday, June 6, 2008

Git Repack Parameters

A few people have asked my if I chose the parameters for Git’s repack correctly. Shouldn’t I use a higher --depth value than the default? Why did I pick a --window value of 250? Shouldn’t I have repacked with the default values?

To answer this first question last: no. I did these conversions as best as I could, in order to make a fair comparison. My assumption is that anyone converting their repository to Bazaar, Git or Mercurial knows what he or she is doing. Then why should I settle for less? Repacking a repository as tightly as I did is necessary only once, but it is an important step: git fast-import creates really bad packs in order to be fast, so a repack really helps. It is also suggested in the manpage to use a higher —window value than normal.

However, it got me curious on how the parameters (--depth and --window) influence final repository size. First I wanted to see if changing the --depth would have made a difference in final size. I repacked all repositories, with a depth value of either 50 (the default) or 100. I varied the window parameter over the values [10, 20, 50, 100, 150, 200, 250].

First let’s look at how the depth variable influences repository size. Window vs. Depth vs. Size

As can be seen from this figure, increase of repack depth only influences repository size on a repack with a small window. As I used a window size of 250, the depth variable did not influence results much.

However, it’s also interesting to see how these variables affect other parameters. An example of this is repack time.

Window size vs. Repack Time

Repack time still increases with increasing window size. As a repository won’t be packed much tighter on a window of 250 than on a window of 100, you might as well choose a lower value for your window when doing an aggressive repack.

However, there is a more interesting interaction going on: the effect of the window parameter depends on the size of your repository. Let’s look at repositories of different sizes (See “Meet the Candidates” for a description of the repositories):

Window Size vs. Repository Size

As can be seen, a higher window value will have an effect only on repositories that are actually quite large, like the emacs repository. If you have a small repository, there’s not much use to repacking with anything higher than --window=50, but if your repository is several hundreds MB’s, it skim off a few more megs.

(Please note that the repack times are done on an Intel iMac Core Duo, 2Ghz with 2GB RAM running OS X. Repacks are done with git repack -adf, which means that a repository will be completely packed. If you do a normal, incremental repack, expect to see much faster repacks.)

Tuesday, June 3, 2008

On mainline merges and fast forwards

Bazaar has a somewhat different notion of merging in the case of no new commits than Git and Mercurial do. The reason for this is the notion of a “mainline” in Bazaar. This supposed mainline is meant as a silver line through your history. All commits should be merged into this mainline, giving you a nice overview of development. Bazaar has even integrated this into their “log” tool: commits that have been merged “into the mainline” are indented to show this.

Git and Mercurial use another approach, based on the fast-forward method: If there are no new commits on your branch, but there are new ones on the remote, Git and Mercurial just fast-forward you to that commit. No merging or so, your new HEAD will just be the same revision as the remote.

The reason they do this is because Bazaar’s approach does come with some problems. The first and most obvious of this is performance. The “bzr log” command is really slow as it reconstructs history every time, figuring out what the mainline actually is and then showing the history in a neat way. This scales badly: For Cairo, with 4000 commits, “bzr log -l10” takes just over a second to display the first 10 log messages. Mozilla-central, with 15000 commits, already takes 5 seconds. Samba with 24000 commits takes 10 seconds to display the first few log messages, which I would call unacceptable. It gets even worse when you try the Emacs repository.

However, that is not the biggest problem. The problem with explicit merges is the pollution of your branches. The outlining Bazaar makes is only useful if it is “correct” and does not show pollution. This can be a real trouble when two developers work together on a single feature, and merge with each other.

An example

Let’s say that we have a single upstream with some commits. There are two people working on something, one in the branch1 branch and the other in the branch2 branch. Both do a commit on their own branch, then branch1 decides to merge with branch2. After this, branch2 merges with branch1, to get their features.

What this means is that after both have merged, they should have the same tree (given that the merge was conflict-free). One might therefore also assume that they have the same branch history. This is not true however.

Show branch1 log

Show branch2 log

Who is right here? Should the “Add b” commit be indented or the “Add c” commit? As you can see, the notion of “mainline” is then local to a developer, beating the whole point of having a “global line through history”.

It gets even worse. Suppose branch1 merges the changes from branch2 again (just to make sure he has everything) and pushes it to upstream. Then branch2 updates from upstream in order to continue working. The final output of his log looks like this:

--------------------------------------------
revno: 4
committer: Pieter de Bie <pdebie@ai.rug.nl>
branch nick: branch2
timestamp: Tue 2008-06-03 16:25:55 +0200
message:
  Merge with upstream
    --------------------------------------------
    revno: 1.1.3
    committer: Pieter de Bie <pdebie@ai.rug.nl>
    branch nick: branch1
    timestamp: Tue 2008-06-03 16:25:52 +0200
    message:
      Merge with branch2
--------------------------------------------
revno: 3
committer: Pieter de Bie <pdebie@ai.rug.nl>
branch nick: branch2
timestamp: Tue 2008-06-03 16:25:51 +0200
message:
  Merge with branch1
    --------------------------------------------
    revno: 1.1.2
    committer: Pieter de Bie <pdebie@ai.rug.nl>
    branch nick: branch1
    timestamp: Tue 2008-06-03 16:25:49 +0200
    message:
      Merge with branch2
    --------------------------------------------
    revno: 1.1.1
    committer: Pieter de Bie <pdebie@ai.rug.nl>
    branch nick: branch1
    timestamp: Tue 2008-06-03 16:25:45 +0200
    message:
      Add b
--------------------------------------------
revno: 2
committer: Pieter de Bie <pdebie@ai.rug.nl>
branch nick: branch2
timestamp: Tue 2008-06-03 16:25:47 +0200
message:
  Add c
--------------------------------------------
revno: 1
committer: Pieter de Bie <pdebie@ai.rug.nl>
branch nick: upstream
timestamp: Tue 2008-06-03 16:25:42 +0200
message:
  Base commit

Does that look readable to you? This does not show a nice overview of what has happened during the development of a feature. Instead, it is littered with merges. This kind of noise may make developers wary of merging, which is not good.

Compare this to how Git and Mercurial handle this. Even after the to- and fro merging, the log still shows only four commits:

commit 78cd0c44cd93800a169617a72832bcb6a984f9a3
Merge: 598b3f8... ecc0685...
Author: Pieter de Bie <pdebie@ai.rug.nl>
Date:   Tue Jun 3 17:04:27 2008 +0200

    Merge comparison/temp-dir-2/branch2

    * comparison/temp-dir-2/branch2:
      Add c

commit 598b3f859ff9f69733a2ddd1406229cd3c203591
Author: Pieter de Bie <pdebie@ai.rug.nl>
Date:   Tue Jun 3 17:04:26 2008 +0200

    Add b

commit ecc06852b9f76cdcf9285f6eac7c39c002e566e7
Author: Pieter de Bie <pdebie@ai.rug.nl>
Date:   Tue Jun 3 17:04:26 2008 +0200

    Add c

commit 5828b28ff1fc41c674cabe9b839621a72f3effa5
Author: Pieter de Bie <pdebie@ai.rug.nl>
Date:   Tue Jun 3 17:04:26 2008 +0200

    Base commit

Implications

The important point of this is that in the way Git and Mercurial work, it doesn’t matter who does the merge. Their workflow encourages the use of forking, giving more freedom to the developers to do what they want and merge with whom they want. The Bazaar approach, in contrast, discourages merging by anyone else than those in control of the mainline, as otherwise the history will look ugly and unreadable: the first branch is somehow “special” and must be maintained that way. In a distributed VCS, it should not matter who does the merge. By making sure that the maintainers have to merge from you, you get less freedom in deciding how to do your development.

While the outline may seem nice, sometimes having it is just plainly wrong. If I want to update my branch to the latest upstream, I want to be equal to the latest upstream, not to make it a merge. Also as we just saw with cross-developer merging, there sometimes is no way to determine who is “the mainline”. Making a simple branch something that it isn’t is only bound to spread confusion among the users.

See also this reply by Linus Torvalds that basically says the same thing. Also take a look at the rest of that thread if you’re interested.

Sunday, June 1, 2008

Git, Mercurial, Bazaar Repository Size Benchmark

I have finished all conversions of the repositories that I’m going to test. The first interesting metric is of course the repository size. This is interesting because Bazaar claims on its homepage that “Bazaar’s default storage format is highly efficient, better than its main competitors in most cases”. They include benchmarks which should show that they are more efficient than specifically Git and Mercurial.

As I mentioned before, this benchmark is bogus because it does not include project history. While Bazaar acknowledges this, they still mention space efficiency as one of their prime benefits. They also mention that their benchmarks are done on real use cases, not arbitrary processes.

At the same time, Mercurial states it has an “Extremely high-performance delta-compressed storage scheme”. Git mentions “It also uses an extremely efficient packed format for long-term revision storage that currently tops any other open source version control system.”

Test method

For all repositories, only a single branch was converted. For all repositories except Samba, this meant the development branch. As Samba has multiple development branches, I chose the v3-3-test branch.

I made use of the fast-export/import interface where possible. This means that all conversion were done using fast-export, except Git to Mercurial, for which I used “hg convert”, and Bazaar to Mercurial. This last one was a bit tricky, as Mercurial has no Bazaar importer. I therefore converted the Git repository of the same repo to Mercurial.

After conversion, I ran a pack command for the repositories that support this. For Git, this meant a “git repack -adf —window=250”, for Bazaar it meant a “bzr pack”, and removing the obsolete packs.

Tests were done using Git v1.5.5.3, Bzr v1.5, Hg v1.0

Results

So then, using actual, real world data, which system has the best storage efficiency? Below is the table of all projects.

RepositoryGitMercurialBazaar
Git
Cairo15MB 24MB 30MB
Coreutils29MB 44MB 76MB
Samba82MB 146MB 310MB
Mercurial
Octave22MB 49MB 57MB
Mozilla78MB 205MB 255MB
Dovecot9MB 14MB 23MB
Bazaar
Emacs120MB 163MB 300MB
Mailman42MB 75MB 73MB
Pkgconfig1.1MB 1.3MB 1.8MB
 
Total398MB721MB1125MB
Relative11.82.8

Finally

As can be seen from the table, Git really is the most efficient in storing the data. Next up is Mercurial, which also does a nice job. Bazaar is the least efficient by far, taking on average 2.8 times the space of an equivalent Git repository.

Friday, May 30, 2008

On History Rewriting

This is just a short note on history rewriting.

During my conversion process, I made some significant changes to the bzr-fast-export tool, as it can’t export some repositories by default. As I was hacking away quickly, I made some fast bzr commits, so that I could make nice patches out of them later on.

Today I wanted to do just that, but it appears to be impossible to rewrite your history with Bazaar. I would like to merge some commits, reorder them, change the log message and split one commit up in two parts. Also, I’ll have to adjust the author info, as I didn’t set it up correctly before.

Not too difficult, you’d think. However, I can’t figure out how to do it. The best thing I’ve found is bzr uncommit, which will return you to a previous state. In order to work with this, I’ll have to export a diff for every commit I did, then revert to the parent commit, apply a patch, and commit it again. If I want to split up a commit, I’ll either have to install and use the shelve plugin , or split up the patches myself.

Compare this to Git’s excellent rebase —interactive script. It will allow you to do everything I just mentioned, and then some. This has obvious advantages: you don’t have to worry about your patch series until you’re done with it. Some changes might not be obvious from the start, and being able to edit a log message to make it more clear as valuable tool. With Bazaar, once you’ve committed, you’re pretty much committed to it (pun intended). You’ll have to think ahead of time what you’re going to commit, in what order, and with what message. Obviously I prefer the freedom of Git in this case.

Tuesday, May 27, 2008

Import tools

The last few days I have spent some time using the different import tools that exist. Basically there are two ways to convert a repository: by using some tool-native tool, or by using the fast-{import,export} tools.

A bit of history.

Somewhere in 2006, Git introduced the git fast-import tool. It was created to allow fast importing of tarballs and to allow other revision systems to export their data easily and fast to Git.

Examples of front-ends are cvs2svn and hg-fast-export.py, which allow exporting CVS and Mercurial repositories to Git.

In the mean time, more importers and exporters have been created. For example, Bazaar has both a bzr fastimport plugin and an export script. Git has also implemented a git-fast-export, which should allow you to export your Git repository to any of the importers. Mercurial does not have a fast-import variant yet.

The reality

The reality, however, is that it is very difficult to make these tools work together. Git’s fast-import tool seems the most robust and is willing to take in anything well-formed. There was a bug in the fast-export tool when having a commit with multiple parents, but this has been fixed since.

The hg-fast-export tool also seems to work fairly well. It exports all branches it has, and is not likely to crash. It has succesfully exported all the repositories I gave it.

Bazaar’s tools seem to have the most problem. The fast-import front-end will crash on invalid encodings and other peculiarities that Git’s tool seems to have no problem with. The fast-export tool has several problems which I tried to fix. One of these is handling ghost commits, which is a somewhat bizarre feature of Bazaar where you can say a commit was a merge without supplying one of its parents. When Bazaar has recursive renames (for example, renaming “a/” to “b/” and “a/a” to “b/c”) it will provide output that is invalid for git-import. I also get a different number of revisions after the import is complete. I’m not sure yet what’s going on there. Furthermore, Bazaar’s tool might as well be called slow-import, as it can take a day to import a somewhat large repository.

The same was true when importing the Emacs repository; I couldn’t find any way to convert it into a Git repository. Currently I’m using the Emacs repository on repo.or.cz, but that one differs in history from the Bazaar one.

HG-convert

As Mercurial does not have a fast-import tool, I have to use the “hg convert” command/plugin to import repositories in Mercurial. This tool seems to crash every so often, especially if you provide it with a non-standard Git repository, like git.git or the linux kernel. I’m also not sure if it handles branches correctly; sometimes it seems to import them, and sometimes it seems to just ignore them. As Mercurial has no way to import Bazaar repositories, I had to import the Git versions of them. This means that most conversions have a different number of commits in all three versions.

The branching problem

All three systems have a different way to handle branches. With Git, branches are just references to specific commits that are kept totally out of the repository history. Mercurial has a file within the repository called .hgbranches that names the branches and which allows you to see the history of the branches. Bazaar seems to have the most bizarre way of keeping branches. They use the old and confusing concept of SVN’s “branches are directories”. Especially when importing this can be a bit difficult. The bzr fast-import tool allows you to import multiple branches, and they will be created as different directories. I’m not sure yet how to handle this. Is there a way to get the total number of revisions of all branches? What happens if I delete a branch, can I get it back easily? Will space be freed in the repository after deleting the branch? Can I switch between branches easily, or do I need to keep these working trees checked out?

All this vagueness and general incompatibility make me think about dropping branch support in my conversions and only work on a head branch. I’ll still do benchmarks on branching and merging speed, but just don’t import all the branches.

Meet the candidates

As a first real post, I’d like to introduce you to the candidate repositories that will be used in the performance tests. This list is not final yet, as the importers can have problems with anything remotely weird. I tried to pick a range of projects from each system.

Bazaar repositories

I had the most trouble finding suitable projects for Bazaar. While there are a lot of small projects, there’s almost no large project that has chosen Bazaar for their version control. The WhoUsesBzr wiki page lists some large projects, for example Drupal. However, their official development still takes place in SVN or CVS, which means these clones miss any branching / merging.

  • Emacs Finally, I settled on emacs. Emacs recently switched to bazaar. Their choice was mostly motivated by political reasons, and there have been some complaints, but most projects seem to have complainers when switching from repository. This is a big repository: it has almost 90000 commits and its repository is 300MB. It has a working tree of 104MB, which perhaps makes it one of the biggest repositories in the test.

  • Pkg-config. This is the smallest repository in the test. The repository is 1.8MB in size, with a working tree of less than a megabyte. It has just 187 commits.

  • Mailman. This is reasonably large repository. It has 6700 commits, a repository of 73MB and a working tree of 20MB.

Mercurial repositories

I found a couple of nice mercurial repositories that were used in the tests:

  • Mozilla-central. This is one of the repositories found on mozilla’s site. It has a repository of 205MB, it has more than 15000 commits and a working tree of 284MB, which makes it the largest repo in the test.

  • dovecot. Dovecot’s repo has 7500 commits, is 14MB with a working tree of 6MB.

  • Octave. Octave is an open-source clone of mathlab, without all the cool packages. It has just 8000 commits, but a repo of 60MB and a working tree of 29MB.

Git repositories

This was somewhat challenging too: while there are a lot of projects using Git, almost all the importers have trouble importing them. I will discuss the importers in another post, so I’ll just list the projects here.

  • Cairo. This is the smallest Git repository I’ve used. Cairo has a mostly linear history, with some merges happening. The repository has only about 5000 commits, but is still 16MB in size. This is probably because the project is quite large: the working directory is 10MB.

  • coreutils. Coreutil’s repository is about 30MB. It has around 25000 commits and a working dir of 9MB. It has some merges, but is mostly linear, like Cairo’s.

And the final candidate…

I had a lot of trouble finding the last repository. At first I wanted to use Git’s repository itself. However, it uses some octopus merges (merges with more than 2 parents) which cannot be imported correctly by “hg convert”, which ignores them. Furthermore, there was a bug in git-fast-export which made “bzr fast-import” crash on them.

Similarly I had troubles importing both the Linux-2.6 and Wine, on which the “hg convert” tool crashes because of an invalid byte encoding issue. Mercurial also had troubles importing the VLC repository, while bzr-fast-import couldn’t handle the Rubinius repository. Therefore, I’m still looking for a third repository to use with Git.

Bazaar, Git, Mercurial comparison: Introduction

This blog will describe my adventures in comparing different version control systems. In particular, I will look at three distributed systems: Bazaar, Mercurial, Git.

Why?

There are already some existing benchmarks between these system. However, most of them are bogus in some way. For example, the Git Benchmarks are mostly outdated. Bazaar has some benchmarks too, but these measure the uninteresting tasks. There are some other benchmarks, but these are small scale, don’t specify how the measurements were done and generally aren’t a good measure.

Performance, of course, isn’t everything. If two system are “fast enough”, then you shouldn’t care about which one to pick only on performance. However, if a system is so fast that you can do things you couldn’t do before (like merging within a second, or displaying differences two branches instantly), then performance becomes a factor. If you’re working on a large repository, and your log command takes several seconds to run, then performance is a factor too.

What’s wrong with Bazaar’s benchmarks?

Bazaar on first sight seems to have made some nice benchmarks. However, if you look more closely, you will see that it lacks in several ways:

  • They measure project size wrongly. For example, Git uses hard links when cloning a repository, so repository size is not doubled. These benchmarks do not take that into account
  • They don’t use full repository history. Their benchmarks are based on importing a single tarball from a project. This does not tell us anything about how a project scales in time. As we will see in future posts, Bazaar for example scales badly on a repository with a large amount of commits.
  • They do not measure useful things. Things that they do measure, for example, are time for the first import. However, in my opinion, this is not a measurement to make a final decision on, as importing a project will only be done once.

Bazaar’s benchmarks are just an example of what is wrong with existing benchmarks; however it should illustrate the problem.

Comparison

So then, what will I measure? This blog will test the performance of all three systems on existing repositories. These repositories will differ in size, though there is an emphasis on bigger repositories (with tens of thousands of commits). To compare the systems, all repositories will be converted to all systems. Three native repositories will be picked from each system, so conversion shouldn’t play a factor in performance.

I will not look at the time it takes to convert the repositories, as these one-time tasks should play no role in the final developer workflow. I will look at things that you’ll actually do when working in a repository: merging branches, branching off, repository size and size increase, the time it takes to diff, commit or ask the status of your repository. These quantitative measurements should be acceptable for everybody and offer little room for discussion.

However, as said before, performance isn’t everything. Part of this series will also look at the qualitative aspect of the systems: how easy is it to do different tasks? What workflows do the systems allow? How reliable is the system: if I kill a process halfway in committing, will it hurt my repository? What if I corrupt some data? These posts will be somewhat subjective and offer room for discussion.

Finally, I’m not without an opinion of myself. Some of my posts will be more like columns on what difficulties I encounter during my tests. I have more experience with some systems than with others, so I can be wrong sometimes. These posts will allow others to correct me and offer another point of view.

Finally

The first posts will be on the conversion of the different systems. I’m currently mostly done, however it’s a hard task find repositories that can be converted to all systems. I hope readers will find this information useful, or at least entertaining :)