Cleaning up after migrating from Hg to Git

There is a lot of guidance out there on how to migrate from Mercurial to Git, but they often leave you with a repository in a bad state. Even more so if it originally was a subversion repository, then migrated to Mercurial and now finally to Git.

The Lokad.Cloud repository was such a case. The committers and authors in the commit history were a complete mess, but that's not that much of an issue in practice. Worse is the fact that most text files were stored with CLRF line endings instead of LF internally. Git supports platform-native checkouts (CRLF on Windows, LF on Linux) quite nicely, but it only works well if text files are normalized to LF internally when committed. I strongly recommend doing that, as it will save you from a lot of trouble later on. Luckily it is also the default behavior for new repositories.

Migration: Fast-Export to Git

This is the usual procedure that properly converts branches and tags to the git equivalents:

1: 
2: 
3: 
4: 
5: 
git clone git://repo.or.cz/fast-export.git
mkdir git_repo && cd git_repo
git init
/path/to/hg-fast-export.sh -r /path/to/mercurial_repo
git checkout HEAD

Normalize the whole history to LF line-endings

This step is only needed if all or some of the commits have been using non-LF line endings internally. If the repo once was in Subversion on Windows this most certainly is the case, but not necessarily on pure mercurial repositories. You can find out whether this is an issue, if you remove your git index and then reset. If a lot of files are now listed as modified, you better fix it as described here, if not you can skip this step.

1: 
2: 
rm .git/index
git reset

I recommend to do this step in Linux as it didn't work well for me on Windows.

First we need to turn off any automated git end-of-line handling. Unfortunately this is controlled in multiple places (for historical reasons). First there is the core.autocrlf config we need to turn off:

1: 
git config core.autocrlf false

Then we need to get rid of all the .gitattributes files in your repository in case they specify any automatic eol handling. This is not necessary in most of the cases, yet the repository I was dealing with used to be a hybrid git/mercurial repo some time ago and thus did already have a gitattributes file. If there is one, delete it and commit. Afterwards your current working directory should be clean, since git no longer wants to fix your line endings on any touched text files.

But to make sure the .gitattributes file in previous commits don't mess with us, we need to drop it in all commits (single line):

1: 
2: 
git filter-branch --prune-empty --index-filter
   'git rm --cached --ignore-unmatch .gitattributes' -- --all

After that we finally can go converting all the text files to LF line endings, with another history rewrite (single line):

1: 
2: 
git filter-branch -f --prune-empty --tree-filter
   'git ls-files -z | xargs -0 dos2unix --skipbin' -- --all

What this does is for every commit, for all files that are not binary, convert them to LF endings using dos2unix. In my case there are some paths with spaces in them (don't ask..), so I switched over to NULL-character separation using the -z and -0 options.

To ensure the normalization is enforced in future commits (especially from people forking your repository and then send you pull requests), create a new .gitattributes files containing at least something like * text=auto. The config option core.autocrlf however is not only local but also depreciated. You can remove it completely using

1: 
git config --unset core.autocrlf

Clean up committers and authors

You can get a quick overview on how badly the authors are off using

1: 
git shortlog -se

Luckily, fixing them is not that difficult, with yet another history rewrite:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
git filter-branch -f --env-filter '
if [ "$GIT_COMMITTER_NAME" = "bad user name" ]
then
export GIT_COMMITTER_NAME="correct user name"
export GIT_COMMITTER_EMAIL="correct email address"
fi
if [ "$GIT_AUTHOR_NAME" = "bad user name" ]
then
export GIT_AUTHOR_NAME="correct user name"
export GIT_AUTHOR_EMAIL="correct email address"
fi
' -- --all

Housekeeping

After all these rewrites it would be a good time to do some git maintenance, i.e.

1: 
git fsck --full

to check and verify your repository, drop no longer used blobs with

1: 
git prune

and then clean up and optimize your local repository using

1: 
git gc --aggressive