Re: newbie questions about git design and features (some wrt hg)

Matt Mackall <mpm@xxxxxxxxxxx> · Wed, 31 Jan 2007 18:34:29 -0600

On Thu, Feb 01, 2007 at 12:58:42AM +0100, Jakub Narebski wrote:
> Matt Mackall wrote:
> > On Wed, Jan 31, 2007 at 11:56:01AM +0100, Jakub Narebski wrote:
> >> Theodore Tso wrote:
> >> 
> >>> On Tue, Jan 30, 2007 at 11:55:48AM -0500, Shawn O. Pearce wrote:
> >>>> I think hg modifies files as it goes, which could cause some issues
> >>>> when a writer is aborted.  I'm sure they have thought about the
> >>>> problem and tried to make it safe, but there isn't anything safer
> >>>> than just leaving the damn thing alone.  :)
> >>> 
> >>> To be fair hg modifies files using O_APPEND only.  That isn't quite
> >>> as safe as "only creating new files", but it is relatively safe.
> >> 
> >>>From (libc.info):
> >> 
> >>  -- Macro: int O_APPEND
> [...] 
> >> I don't quote understand how that would help hg (Mercurial) to have
> >> operations like commit, pull/fetch or push atomic, i.e. all or
> >> nothing. 
> > 
> > That's because it's unrelated.
> [...]
> > Mercurial has write-side locks so there can only ever be one writer at
> > a time. There are no locks needed on the read side, so there can be
> > any number of readers, even while commits are happening.
> > 
> >> What happens if operation is interrupted (e.g. lost connection to
> >> network during fetch)?
> > 
> > We keep a simple transaction journal. As Mercurial revlogs are
> > append-only, rolling back a transaction just means truncating all
> > files in a transaction to their original length.
> 
> Thanks a lot for complete answer. So Mercurial uses write-side locks
> for dealing with concurrent operations, and transaction journal for
> dealing with interrupted operations. I guess that incomplete transactions
> are rolled back on next hg command...

They are either automatically rolled back on abort or if that fails
for some reason like power failure the user is prompted to run "hg
recover" to complete the rollback. We also save the last transaction
journal which allows one level of undo for pulls/commits.

> I guess (please correct me if I'm wrong) that git uses "put reference
> after putting data" scheme, and write-side lock in few places when it
> is needed.

Mercurial also uses a "put reference after putting data" which is what
allows us to have no read vs write locking.

> >> In git both situations result in some prune-able and fsck-visible crud in
> >> repository, but repository stays uncorrupted, and all operations are atomic
> >> (all or nothing).
> > 
> > If a Mercurial transaction is interrupted and not rolled back, the
> > result is prune-able and fsck-visible crud. But this doesn't happen
> > much in practice.
> > 
> > The claim that's been made is that a) truncate is unsafe because Linux
> > has historically had problems in this area and b) git is safer because
> > it doesn't do this sort of thing. 
> > 
> > My response is a) those problems are overstated and Linux has never
> > had difficulty with the sorts of straightforward single writer
> > operations Mercurial uses and b) normal git usage involves regular
> > rewrites of data with packing operations that makes its exposure to
> > filesystem bugs equivalent or greater.
> 
> Rewrites in git perhaps are (or should be) regular, but need not be often.
> And with new idea/feature of kept packs rewrite need not be of full data.

If the set of files in a given commit (say tip) gets spread out across
an arbitrary number of packs ordered by last modification time,
performance degrades to O(n) lookups and random seeking.

> One command which _is_ (a bit) unsafe in git is git-prune. I'm not sure
> if it could be made safe. But not doing prune affects only a bit
> repository size (where git is best I think of all SCMs) and not performance.
> 
> On the other hand hg repository structure (namely log like append changelog
> / revlog to store commits) makes it I think hard to have multiple persistent
> branches.

Not sure why you think that. There are some difficulties here, but
they're mostly owing to the fact that we've always emphasized the one
branch per repo approach as being the most user-friendly.

> Sidenote 1: it looks like git is optimized for speed of merge and checkout
> (branch switching, or going to given point in history for bisect), and
> probably accidentally for multi-branch repos, while Mercurial is optimized
> for speed of commit and patch.

I think all of these things are comparable.

> Sidenote 2: Mercurial repository structure might make it use "file-ids"
> (perhaps implicitely), with all the disadvantages (different renames
> on different branches) of those.

Nope.

> > In either case, both provide strong integrity checks with recursive
> > SHA1 hashing, zlib CRCs, and GPG signatures (as well as distributed
> > "back-up"!) so this is largely a non-issue relative to traditional
> > systems.
> 
> Integrity checks can tell you that repository is corrupted, but it would
> be better if it didn't get corrupted in first place.

Obviously. Hence our append-only design. Data that's written to a repo
is never rewritten, which minimizes exposure to software bugs and I/O
errors.

> Besides: zlib CRC for Mercurial? I thought that hg didn't compress the
> data, only delta chain store it?

We use zlib compression of deltas and have since April 6, 2005.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html