Re: Benchmarks regarding git's gc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/08/2011 12:34 PM, Felipe Contreras wrote:
> Has anybody seen these?
> http://draketo.de/proj/hg-vs-git-server/test-results.html#results
> 
> Seems like a potential area of improvement.

The fact that git requires periodic garbage collection is indeed
annoying (even in interactive use) and even more annoying in the
scenario discussed by the author of this article.

With respect to the article's claims about the overall efficiency of
Mercurial vs. git, I would like to point out that the author's use of a
test repository with a linear history avoids one of Mercurial's big
design weaknesses.  If the repository had had a branching history,
Mercurial's numbers would probably be significantly less flattering.

Mercurial's revlog repository format [1] (at least the last time I
checked) uses a single data file to hold the contents of all versions of
a single file in the working copy.  It appends a delta to the end of the
revlog file for each revision, with periodic fulltexts.  It is designed
to make it possible to reconstruct any file revision via a single seek
and a single read of at most twice the length of the file's fulltext
(assuming that the index is already known).  The avoidance of disk seeks
goes a long way to explaining Mercurial's competitive performance
despite the fact that it is written in Python.

However, the deltas stored in revlog are not relative to a revision's
parent(s), but rather relative to the previous revision in the revlog
file, which is typically the most recent revision committed *to any
branch*.  Therefore, revlog is very good at storing a linear series of
commits, but is considerably less efficient at storing a history with
lots of branches that were under development concurrently.  The net
result is that the history of a branchy repository can take up much more
space than that of a linear repository.

There was a GSOC "parentdelta" project to allow deltas to be computed
against parents [2], later replaced by a second "generaldelta" scheme
[3], but AFAICT this is still experimental and they are struggling with
its performance.

There is also a script in contrib that reorders the revisions in a
revlog file to put topological neighbors closer together [4].  This can
shrink the size of the file dramatically.  But of course this script is
something like "git gc" in the sense that it would presumably need to be
run periodically, and each run would have to lock the repo for some time.

All this is not to detract from the fact that Mercurial, by not
requiring garbage collection, has a big advantage against git in certain
scenarios.

Michael

[1]
http://mercurial.selenic.com/wiki/FAQ#FAQ.2BAC8-TechnicalDetails.How_does_Mercurial_store_its_data.3F
[2] http://mercurial.selenic.com/wiki/ParentDeltaPlan
[3]
http://mercurial.selenic.com/wiki/WhatsNew#Mercurial_1.9_.282011-07-01.29
[4] http://selenic.com/hg/file/54c0517c0fe8/contrib/shrink-revlog.py

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]