RE: reftable [v5]: new ref storage format

David Turner <David.Turner@xxxxxxxxxxxx> · Mon, 14 Aug 2017 16:05:05 +0000

> -----Original Message-----
> From: Howard Chu [mailto:hyc@xxxxxxxxx]
> Sent: Monday, August 14, 2017 8:31 AM
> To: spearce@xxxxxxxxxxx
> Cc: David Turner <David.Turner@xxxxxxxxxxxx>; avarab@xxxxxxxxx;
> ben.alex@xxxxxxxxxxxx; dborowitz@xxxxxxxxxx; git@xxxxxxxxxxxxxxx;
> gitster@xxxxxxxxx; mhagger@xxxxxxxxxxxx; peff@xxxxxxxx;
> sbeller@xxxxxxxxxx; stoffe@xxxxxxxxx
> Subject: Re: reftable [v5]: new ref storage format
> 
> Howard Chu wrote:
> > The primary issue with using LMDB over NFS is with performance. All
> > reads are performed thru accesses of mapped memory, and in general,
> > NFS implementations don't cache mmap'd pages. I believe this is a
> > consequence of the fact that they also can't guarantee cache
> > coherence, so the only way for an NFS client to see a write from
> > another NFS client is by always refetching pages whenever they're accessed.
> 
> > LMDB's read lock management also wouldn't perform well over NFS; it
> > also uses an mmap'd file. On a local filesystem LMDB read locks are
> > zero cost since they just atomically update a word in the mmap. Over
> > NFS, each update to the mmap would also require an msync() to
> > propagate the change back to the server. This would seriously limit
> > the speed with which read transactions may be opened and closed.
> > (Ordinarily opening and closing a read txn can be done with zero
> > system calls.)
> 
> All that aside, we could simply add an EXCLUSIVE open-flag to LMDB, and
> prevent multiple processes from using the DB concurrently. In that case,
> maintaining coherence with other NFS clients is a non-issue. It strikes me that git
> doesn't require concurrent multi-process access anyway, and any particular
> process would only use the DB for a short time before closing it and going away.

Git, in general, does require concurrent multi-process access, depending on what 
that means.

For example, a post-receive hook might call some git command which opens the 
ref database.  This means that git receive-pack would have to close and 
re-open the ref database.  More generally, a fair number of git commands are
implemented in terms of other git commands, and might need the same treatment.
We could, in general, close and re-open the database around fork/exec, but I am
not sure that this solves the general problem -- by mere happenstance, one might
be e.g. pushing in one terminal while running git checkout in another.  This is 
especially true with git worktrees, which share one ref database across multiple
working directories.