Tuesday, May 27, 2008

Import tools

The last few days I have spent some time using the different import tools that exist. Basically there are two ways to convert a repository: by using some tool-native tool, or by using the fast-{import,export} tools.

A bit of history.

Somewhere in 2006, Git introduced the git fast-import tool. It was created to allow fast importing of tarballs and to allow other revision systems to export their data easily and fast to Git.

Examples of front-ends are cvs2svn and hg-fast-export.py, which allow exporting CVS and Mercurial repositories to Git.

In the mean time, more importers and exporters have been created. For example, Bazaar has both a bzr fastimport plugin and an export script. Git has also implemented a git-fast-export, which should allow you to export your Git repository to any of the importers. Mercurial does not have a fast-import variant yet.

The reality

The reality, however, is that it is very difficult to make these tools work together. Git’s fast-import tool seems the most robust and is willing to take in anything well-formed. There was a bug in the fast-export tool when having a commit with multiple parents, but this has been fixed since.

The hg-fast-export tool also seems to work fairly well. It exports all branches it has, and is not likely to crash. It has succesfully exported all the repositories I gave it.

Bazaar’s tools seem to have the most problem. The fast-import front-end will crash on invalid encodings and other peculiarities that Git’s tool seems to have no problem with. The fast-export tool has several problems which I tried to fix. One of these is handling ghost commits, which is a somewhat bizarre feature of Bazaar where you can say a commit was a merge without supplying one of its parents. When Bazaar has recursive renames (for example, renaming “a/” to “b/” and “a/a” to “b/c”) it will provide output that is invalid for git-import. I also get a different number of revisions after the import is complete. I’m not sure yet what’s going on there. Furthermore, Bazaar’s tool might as well be called slow-import, as it can take a day to import a somewhat large repository.

The same was true when importing the Emacs repository; I couldn’t find any way to convert it into a Git repository. Currently I’m using the Emacs repository on repo.or.cz, but that one differs in history from the Bazaar one.


As Mercurial does not have a fast-import tool, I have to use the “hg convert” command/plugin to import repositories in Mercurial. This tool seems to crash every so often, especially if you provide it with a non-standard Git repository, like git.git or the linux kernel. I’m also not sure if it handles branches correctly; sometimes it seems to import them, and sometimes it seems to just ignore them. As Mercurial has no way to import Bazaar repositories, I had to import the Git versions of them. This means that most conversions have a different number of commits in all three versions.

The branching problem

All three systems have a different way to handle branches. With Git, branches are just references to specific commits that are kept totally out of the repository history. Mercurial has a file within the repository called .hgbranches that names the branches and which allows you to see the history of the branches. Bazaar seems to have the most bizarre way of keeping branches. They use the old and confusing concept of SVN’s “branches are directories”. Especially when importing this can be a bit difficult. The bzr fast-import tool allows you to import multiple branches, and they will be created as different directories. I’m not sure yet how to handle this. Is there a way to get the total number of revisions of all branches? What happens if I delete a branch, can I get it back easily? Will space be freed in the repository after deleting the branch? Can I switch between branches easily, or do I need to keep these working trees checked out?

All this vagueness and general incompatibility make me think about dropping branch support in my conversions and only work on a head branch. I’ll still do benchmarks on branching and merging speed, but just don’t import all the branches.


Michael Haggerty said...

To import emacs into git, you could take a recent CVS snapshot and use cvs2svn/cvs2git to convert directly from CVS to git. AFAIK emacs has only recently migrated to bzr, so the repository contents shouldn't be that much different from the bzr contents.

Pieter said...

After some hacking on Bazaar's fast-export tool, I was able to export it successfully.

You're right that the Emacs repository currently still mirrors the CVS repo. I don't think any real development is going on in Bazaar. This is actually one of the situations I tried to avoid. However, the Emacs repository might still be interesting because it's huge.

Zooko said...

Have you tried tailor?


I've used it quite a bit over the years. It's main limitation is that it doesn't try to manage branches -- it just tracks a single line of history.

SamB said...

Well, in defense of Bazaar:

Those branches are not just directories within an undifferentiated repository tree like in SVN; they are clearly marked (by stuff in the .bzr directory) as branches. If it isn't a branch, you can't branch, checkout, or merge from it.

And, assuming you use a shared repository for a collection of branches, you don't lose any commits when you delete a branch; all of the commits are stored in the shared repository, and no garbage collection is done. (At least, I don't think they've implemented any garbage collection yet. Even if they have/when they do, it will almost certainly require manual invocation.)

Really, the main differences between bzr and git are:

#1. bzr tracks files, directories, and their histories (that is, it tracks renames explicitly); git just tracks the history of the tree as a whole

#2. bzr is ridiculously extensible, to the point where you can install plugins that allow (nearly) seamless integration with other VCSs -- even SVN! (Obviously, the sloppy treatment of branches in SVN makes this potentially not seamless at all, depending on what kind of scheme (if any) the branches are arranged in, but with most projects, which follow one of a small number of standard schemes, it works great.)

#3. various minor details about the layout of repositories/branches in the filesystem / URL space. Like, each branch has its own URL, you don't often grab all of the branches from another repository (and there's no place in your repository where such branches "ought" to go), branches are stored in directories rather than as a tree of files containing nothing but revision IDs, you can have working trees in branch directories if you want...

Of course, the downside of #2 is that, since bzr plugins are much more tightly integrated than git add-on packages are, they are also more subject to bit-rot and tend to need relatively frequent updating (if only to update the allowed bzr version ranges). Thankfully, this seems to happen pretty well with most of the ones that are really useful.

lala said...

Thanks for your post and welcome to check: here