Re: RFC/Pull Request: Refs db backend

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2015-06-23 at 07:47 -0400, Jeff King wrote:
> On Mon, Jun 22, 2015 at 08:50:56PM -0400, David Turner wrote:
> 
> > The db backend runs git for-each-ref about 30% faster than the files
> > backend with fully-packed refs on a repo with ~120k refs.  It's also
> > about 4x faster than using fully-unpacked refs.  In addition, and
> > perhaps more importantly, it avoids case-conflict issues on OS X.
> 
> Neat.
> 
> Can you describe a bit more about the reflog handling?
> 
> One of the problems we've had with large-ref repos is that the reflog
> storage is quite inefficient. You can pack all the refs, but you may
> still be stuck with a bunch of reflog files with one entry, wasting a
> whole inode. Doing a "git repack" when you have a million of those has
> horrible cold-cache performance. Basically anything that isn't
> one-file-per-reflog would be a welcome change. :)

Reflogs are stored in the database as well.  There is one header entry
per ref to indicate that a reflog is present, and then one database
entry per reflog entry; the entries are stored consecutively and
immediately following the header so that it's fast to iterate over them.

> It has also been a dream of mine to stop tying the reflogs specifically
> to the refs. I.e., have a spot for reflogs of branches that no longer
> exist, which allows us to retain them for deleted branches. Then you can
> possibly recover from a branch deletion, whereas now you have to dig
> through "git fsck"'s dangling output. And the reflog, if you don't
> expire it, becomes a suitable audit log to find out what happened to
> each branch when (whereas now it is full of holes when things get
> deleted).

That would be cool, and I don't think it would be hard to add to my
current code; we could simply replace the header with a "tombstone".
But I would prefer to wait until the series is merged; then we can build
on top of it.

> I dunno. Maybe I am overthinking it. But it really feels like the _refs_
> are a key/value thing, but the _reflogs_ are not. You can cram them into
> a key/value store, but you're probably operating on them as a big blob,
> then.

Reflogs are, conceptually, queues. I agree that a raw key-value store is
not a good way to store queues, but a B-Tree is not so terrible, since
it offers relatively fast iteration (amortized constant time IIRC).

> > I chose to use LMDB for the database.  LMDB has a few features that make
> > it suitable for usage in git:
> 
> One of the complaints that Shawn had about sqlite is that there is no
> native Java implementation, which makes it hard for JGit to ship a
> compatible backend. I suspect the same is true for LMDB, but it is
> probably a lot simpler than sqlite (so reimplementation might be
> possible).
> 
> But it may also be worth going with a slightly slower database if we can
> get wider compatibility for free.

There's a JNI interface to LMDB, which is, of course, not native.  I
don't think it would be too hard to entirely rewrite LMDB in Java, but
I'm not going to have time to do it for the forseeable future.  I've
asked Howard Chu if he knows of any efforts in progress.

> > To test this backend's correctness, I hacked test-lib.sh and
> > test-lib-functions.sh to run all tests under the refs backend. Dozens
> > of tests use manual ref/reflog reading/writing, or create submodules
> > without passing --refs-backend-type to git init.  If those tests are
> > changed to use the update-ref machinery or test-refs-be-db (or, in the
> > case of packed-refs, corrupt refs, and dumb fetch tests, are skipped),
> > the only remaining failing tests are the git-new-workdir tests and the
> > gitweb tests.
> 
> I think we'll need to bump core.repositoryformatversion, too. See the
> patches I just posted here:
> 
>   http://thread.gmane.org/gmane.comp.version-control.git/272447

Thanks, that's valuable.  For the refs backend, opening the LMDB
database for writing is sufficient to block other writers.  Do you think
it would be valuable to provide a git hold-ref-lock command that simply
reads refs from stdin and keeps them locked until it reads EOF from
stdin?  That would allow cross-backend ref locking. 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]