Sunday, June 1, 2008

Git, Mercurial, Bazaar Repository Size Benchmark

I have finished all conversions of the repositories that I’m going to test. The first interesting metric is of course the repository size. This is interesting because Bazaar claims on its homepage that “Bazaar’s default storage format is highly efficient, better than its main competitors in most cases”. They include benchmarks which should show that they are more efficient than specifically Git and Mercurial.

As I mentioned before, this benchmark is bogus because it does not include project history. While Bazaar acknowledges this, they still mention space efficiency as one of their prime benefits. They also mention that their benchmarks are done on real use cases, not arbitrary processes.

At the same time, Mercurial states it has an “Extremely high-performance delta-compressed storage scheme”. Git mentions “It also uses an extremely efficient packed format for long-term revision storage that currently tops any other open source version control system.”

Test method

For all repositories, only a single branch was converted. For all repositories except Samba, this meant the development branch. As Samba has multiple development branches, I chose the v3-3-test branch.

I made use of the fast-export/import interface where possible. This means that all conversion were done using fast-export, except Git to Mercurial, for which I used “hg convert”, and Bazaar to Mercurial. This last one was a bit tricky, as Mercurial has no Bazaar importer. I therefore converted the Git repository of the same repo to Mercurial.

After conversion, I ran a pack command for the repositories that support this. For Git, this meant a “git repack -adf —window=250”, for Bazaar it meant a “bzr pack”, and removing the obsolete packs.

Tests were done using Git v1.5.5.3, Bzr v1.5, Hg v1.0


So then, using actual, real world data, which system has the best storage efficiency? Below is the table of all projects.

Cairo15MB 24MB 30MB
Coreutils29MB 44MB 76MB
Samba82MB 146MB 310MB
Octave22MB 49MB 57MB
Mozilla78MB 205MB 255MB
Dovecot9MB 14MB 23MB
Emacs120MB 163MB 300MB
Mailman42MB 75MB 73MB
Pkgconfig1.1MB 1.3MB 1.8MB


As can be seen from the table, Git really is the most efficient in storing the data. Next up is Mercurial, which also does a nice job. Bazaar is the least efficient by far, taking on average 2.8 times the space of an equivalent Git repository.


Anonymous said...

Nice blog and it's nice to get some up-to-date information. Looking forward to benchmarks.

About popularity of these three DVCS tools. Debian popularity contest is a system which collects automatically information about which Debian packages are installed, which are used most often etc. Behind the following link there's a graph which shows the number of users who use these DVCS tools regularly. It's pretty clear that Git is the most popular among Debian users.

Debian popularity contest: bzr, git-core and mercurial

gebi said...

to really shrink a big git repository for longterm storage something like this should be used:
git repack -a -d --depth=250 --window=250 -f

this step has to be done only once and only on the server. But it takes quite some time and you should have quite some spare ram.
the last repository i compressed with this used ~7GB ram at compression (which resultet in a 800MB .git directory).

no such amount of ram is required at using the repository though.

Pieter said...

@gebi: I did some benchmarks with the --depth parameter to see how much it would influence results (I will post those shortly).

The results show that increasing the --depth size does not really change how tightly packed a repository is (for example, the emacs repository had 3MB gain (about 2%) when using --depth=100). Increasing the depth will seriously make the repacking slower though. That is why I only repacked with --window=250 (which, as the results will show, is also not necessary, --window=150 is usually enough).

gebi said...

yea it really depends on the repository type and behaviour pattern within.

the gcc repository e.g halfes in size with this because of the endless changelog files.

brandon said...

Actually, I think you should have used the default git repack parameters. The default --window setting for git is 10. By "tweaking" the options for only git's repository packing, you may lead people to believe that your experiments have some bias in favor of git.

It may be true that the other's have no ability to affect repository repacking, but still, I think you should have used git's defaults. A separate comparison would be more appropriate to demonstrate how git's packing options affect pack size.

That said, I think the results using git's default repacking options would have produced similarly small repositories which would still have bested the competitors.

Pieter said...

@brandon: I disagree with that. It's true that I didn''t use the default values. However, I stated this explicitly in my test which value I used. Furthermore, there is also a remark in the fast-import man page that you should repack the repository with higher values.

Also remember that you only have to do this once. After that, you'll probably only do a "git gc" which reuses an existing pack. The initial packing should be done by the one doing the conversion, and he should know how to repack the repository.

Furthermore, this is of course some kind of "best-case" scenario, where I hope that the people doing the conversion have at least some clue in what they are doing. I could also create random benchmarks, just doing things until I have a working repository. The results would be different, and less interesting.

However, looking at some tests I did, it seems it does not matter much what value you use on these repositories -- anything above 20 or so will produce a nice result.

Anonymous said...

An interesting remark is that Mercurial achieve a quite good compression ratio without needing a repack interface (the size of the unpacked git repos would be interesting).

Anonymous said...

Anonymous said...

As you have already started, would do some more interesting benchmarks like the time to do operations like checkout, review history etc etc...

Space consumed nowadays should be almost everyone's last criteria...

Anonymous said...

"An interesting remark is that Mercurial achieve a quite good compression ratio without needing a repack interface (the size of the unpacked git repos would be interesting)."

I think the difference here is that Mercurial _only_ creates a packed repository. This method has pros and cons.

One advantage to this method is that an explicit packing operation is not needed. Space is also minimally used since the content of new commits are packed immediately.

A disadvantage is that packing in this on-going fashion increases the amount of work that must be performed during _each_ commit. This can make things slower, or require the use of a less complex pack algorithm (to make things fast enough).

Comparing an unpacked git repository to a mercurial repository is not really appropriate. This is because a git repository is generally _not_ used without packing and so creating one just to compare with mercurial would be a contrived example.

Creating packs is a normal operation in git and may be done automatically or manually. It is automatically done based on a threshold of unpacked objects. It may also be performed manually when it is convenient for the user (say at the end of a work day as the last command before leaving your computer). Deferring pack operations in this way provides a tradeoff in terms of disk space for speed, and allows a more complex pack algorithm to be employed at a later time.

Emanuele Aina said...

Mh, Mercurial repos are half-way between git packed and unpacked repositories as they are delta compressed but still fast to operate on.

There is some ongoing work to do real packed repositories in Mercurial, by basing them upon a bundle (the same used for exchanging changesets) as they are very compact.

Maybe someone could also measure the size of a full Mercurial bundle, to see what can be accomplished by pursuing this approach.

Still, the price for not having to repack (which takes its time) seems fair enough for me. :)

Jakub Narebski said...

It would be nice/interesting to have graph (plot) showing size of (packed) repository depending on number of commits ("git rev-list HEAD | wc -l") for all 9 tested repositories, for all tested version control systems.

Pieter de Bie said...

@Jakub: I tried to do something similar, but the problem is that repository size on these benchmarks is not very correlated to the number of commits, making the graph very hard to read. I actually wanted to see if the size increases linearly or less with number of commits, but such a trend can't be found with this data.

Anonymous said...

Nice post you got here. I'd like to read more concerning that matter. Thnx for sharing this info.
Sexy Lady
Escort services

Ebrahim said...

Please repeat your benchmark with latest versions:
Git 1.6.5
Mercurial 1.4.1
Bazaar 2.0

Anonymous said...

Guten Tag! Rachel Schubert . payday loans

Anonymous said...

At last I have found what I wanted. Thank you.
buy phentermine online

industrial vacuum cleaner said...

Yes, that's it. debian is obe of the best.

masoud ghomi said...

Wow. Thanks a lot. i kike your post. i agree with industrial vacuum cleaner.

mohammad said...

I Like That Your Post.

باغ تالار عروسی said...
I agree with you too