Re: Is there a way to speed up remote-hg?

Felipe Contreras <felipe.contreras@xxxxxxxxx> · Sat, 20 Apr 2013 18:07:41 -0500

On Sat, Apr 20, 2013 at 6:07 AM, John Szakmeister <john@xxxxxxxxxxxxxxx> wrote:
> I really like the idea of remote-hg, but it appears to be awfully slow
> on the clone step:

The short answer is no. I do have a couple of patches that improve
performance, but not by a huge factor.

I have profiled the code, and there are two significant places where
performance is wasted:

1) Fetching the file contents

Extracting, decompressing, transferring, and then compressing and
storing the file contents is mostly unavoidable, unless we already
have the contents of such file, which in Git, it would be easy to
check by analyzing the checksum (SHA-1). Unfortunately Mercurial
doesn't have that information. The SHA-1 that is stored is not of the
contents, but the contents and the parent checksum, which means that
if you revert a modification you made to a file, or move a file, any
operation that ends up in the same contents, but from a different
path, the SHA-1 is different. This means the only way to know if the
contents are the same, is by extracting, and calculating the SHA-1
yourself, which defeats the purpose of what you want the calculation
for.

I've tried, calculating the SHA-1 and use a previous reference to
avoid the transfer, or do the transfer, and let Git check for existing
objects doesn't make a difference.

This is by Mercurial's stupid design, and there's nothing we, or
anybody could do about it until they change it.

2) Checking for file changes

For each commit (or revision), we need to figure out which files were
modified, and for that, Mercurial has a neat shortcut that stores such
modifications in the commit context itself, so it's easy to retrieve.
Unfortunately, it's sometimes wrong.

Since the Mercurial tools never use this information for any real
work, simply to show the changes to the users, Mercurial folks never
noticed the contents they were storing were wrong. Which means if you
have a repository that started with old versions of mercurial, chances
are this information would be wrong, and there's no real guarantee
that future versions won't have this problem, since to this day this
information continues to be used only display stuff to the user.

So, since we cannot rely on this, we need to manually check for
differences the way Mercurial does, which blows performance away,
because you need to get the contents of the two parent revisions, and
compare them away. My content I mean the the manifest, or list of
files, which takes considerable amount of time.

For 1) there's nothing we can do, and for 2) we could trust the files
Mercurial thinks were modified, and that gives us a very significant
boost, but the repository will sometimes end up wrong. Most of the
time is spent on 2).

So unfortunately there's nothing we can do, that's just Mercurial
design, and it really has nothing to do with Git. Any other tool would
have the same problems, even a tool that converts a Mercurial
repository to Mercurial (without using tricks).

It seems Bazaar is more sensible in this regard; 1) the checksums are
try of the file contents, and 2) each revision does store the file
modifications correctly. So a clone in Bazaar is much faster. In my
opinion Mercurial just screwed up their design.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html