Re: Is there a way to speed up remote-hg?

John Szakmeister <john@xxxxxxxxxxxxxxx> · Sun, 21 Apr 2013 08:59:06 -0400

On Sat, Apr 20, 2013 at 7:07 PM, Felipe Contreras
<felipe.contreras@xxxxxxxxx> wrote:
> On Sat, Apr 20, 2013 at 6:07 AM, John Szakmeister <john@xxxxxxxxxxxxxxx> wrote:
>> I really like the idea of remote-hg, but it appears to be awfully slow
>> on the clone step:
>
> The short answer is no. I do have a couple of patches that improve
> performance, but not by a huge factor.
>
> I have profiled the code, and there are two significant places where
> performance is wasted:
>
> 1) Fetching the file contents
>
> Extracting, decompressing, transferring, and then compressing and
> storing the file contents is mostly unavoidable, unless we already
> have the contents of such file, which in Git, it would be easy to
> check by analyzing the checksum (SHA-1). Unfortunately Mercurial
> doesn't have that information. The SHA-1 that is stored is not of the
> contents, but the contents and the parent checksum, which means that
> if you revert a modification you made to a file, or move a file, any
> operation that ends up in the same contents, but from a different
> path, the SHA-1 is different. This means the only way to know if the
> contents are the same, is by extracting, and calculating the SHA-1
> yourself, which defeats the purpose of what you want the calculation
> for.
>
> I've tried, calculating the SHA-1 and use a previous reference to
> avoid the transfer, or do the transfer, and let Git check for existing
> objects doesn't make a difference.
>
> This is by Mercurial's stupid design, and there's nothing we, or
> anybody could do about it until they change it.

That's a bummer. :-(

> 2) Checking for file changes
>
> For each commit (or revision), we need to figure out which files were
> modified, and for that, Mercurial has a neat shortcut that stores such
> modifications in the commit context itself, so it's easy to retrieve.
> Unfortunately, it's sometimes wrong.
>
> Since the Mercurial tools never use this information for any real
> work, simply to show the changes to the users, Mercurial folks never
> noticed the contents they were storing were wrong. Which means if you
> have a repository that started with old versions of mercurial, chances
> are this information would be wrong, and there's no real guarantee
> that future versions won't have this problem, since to this day this
> information continues to be used only display stuff to the user.
>
> So, since we cannot rely on this, we need to manually check for
> differences the way Mercurial does, which blows performance away,
> because you need to get the contents of the two parent revisions, and
> compare them away. My content I mean the the manifest, or list of
> files, which takes considerable amount of time.

Eek!

> For 1) there's nothing we can do, and for 2) we could trust the files
> Mercurial thinks were modified, and that gives us a very significant
> boost, but the repository will sometimes end up wrong. Most of the
> time is spent on 2).
>
> So unfortunately there's nothing we can do, that's just Mercurial
> design, and it really has nothing to do with Git. Any other tool would
> have the same problems, even a tool that converts a Mercurial
> repository to Mercurial (without using tricks).
[snip]

That's unfortunate, but thank you for taking the time to explain!

-John
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html