Re: RFC: Flat directory for notes, or fan-out? Both!

Sam Vilain <sam.vilain@xxxxxxxxxxxxxxx> · Wed, 11 Feb 2009 16:19:27 +1300

Shawn O. Pearce wrote:
>> The point you raised earlier, that there would be a lot of ambiguity if 
>> we allow both flat and fan-out directory structures, is a valid point, 
>> though.
> 
> Yup.  The flat vs. fan-out is a problem.
  [...]
> Notes on commits though are a hell of a problem.  SHA-1 is just so
> uniform at distributing the commits around the namespace that even
> with just the 200 most recent commits we wind up with a commit in
> almost every "bucket", assuming a two hex digit fan-out bucket like
> the loose object directory.

I think my patch from 1 Feb addressed this, at least for the operations
it implemented.

I just don't see why you need to decide up front what the split is going
to be.  Just read the next tree, descend into the closest matching tree
until you find the record you are looking for and that's it.  Sure, my
patch just loads it all and throws it into a hash - this should still be
efficient for short log operations even if the hash table ends up 1MB.
But why take my guess.  Let's stress test it.

'lorem' is the binary in the Text::Lorem Perl module.  It generates a
paragraph of random Latin text.

 wilber:~/src/git$ time git-log | wc -l
 256072

 real    0m0.709s
 user    0m0.608s
 sys     0m0.116s
 wilber:~/src/git$ git rev-list HEAD | wc -l
 17678
 wilber:~/src/git$ cat > my-editor
 #!/bin/sh

 ( lorem; echo ) > $1
 wilber:~/src/git$ chmod +x my-editor
 wilber:~/src/git$ export EDITOR=`pwd`/my-editor
 wilber:~/src/git$ export GIT_NOTES_SPLIT=2
 wilber:~/src/git$ time git-rev-list HEAD | while read rev
 > do ./git-notes.sh edit $rev; done
 fatal: unable to create '.git/refs/notes/commits.lock': File exists
 error: Ref refs/notes/commits is at
5f0732975b4acf237912a31e7ce14aa86d2e8179 but expected
725a2d119d2725e7d821906ad085bfbadbf43c8e
fatal: Cannot lock the ref 'refs/notes/commits'.
 [...]
 fatal: unable to write new index file
 Could not read index
 fatal: unable to write new index file
 Could not read index
 fatal: unable to write new index file
 Could not read index
 fatal: unable to write new index file
 Could not read index
 fatal: unable to write new index file
 Could not read index

 real    76m16.927s
 user    43m55.909s
 sys     19m33.005s

Oo.  Nasty errors there but never mind that for now.  Obviously some
remaining issues in the shell script.

What did I get out of that?

 wilber:~/src/git$ git-ls-tree -r refs/notes/commits | wc
   12043   48172 1144085
 wilber:~/src/git$

Hey well that's not too bad.  Enough to be a good test.  How long does
"git-log" take now?

 wilber:~/src/git$ time ./git-log | wc -l
 292201

 real    0m13.740s
 user    0m0.852s
 sys     0m0.716s
 wilber:~/src/git$ time ./git-log | wc -l
 292201

 real    0m1.335s
 user    0m0.856s
 sys     0m0.512s

Not bad!  Cool cache performance sucked there but only a 50% slowdown
for reading almost twice the number of objects.  Let's try 200 commits:

 wilber:~/src/git$ time git-log -200 | wc -l
 2877

 real    0m0.027s
 user    0m0.008s
 sys     0m0.020s

 wilber:~/src/git$ time ./git-log -200 | wc -l
 3477

 real    0m0.081s
 user    0m0.056s
 sys     0m0.020s

Quite a big slowdown proportionally, but not a huge amount in absolute
terms.  And we didn't even make the builtin-log machinery smart enough
to skip unneeded trees!

>  In a slightly unrelated
> thread offlist I have been talking with Sam Vilain about using Git
> as a database backend for tuple storage.
  [...]
> This would make the git-notes.sh code a *lot* more complex, as you
> can't just toss everything into an index file and then update it with
> a single update-index call.  Doing a tree split is much more work and
> requires removing and adding back all of the affected path names.
> (Its also perhaps unreasonable anyway to load 17,491 paths into a
> temporary index just to twiddle a note for the latest commit.)

Hehe, horribly overcomplicated for this use case... many applicable
ideas though.

> For the "git database" thing above, I've been contemplating the
> idea of an index stored external from the Git object database.
> Sam thinks indexes should be in the object database tree, but
> I'm considering storing them outside entirely because we can
> make the indexes more easily searched by a hash or binary search,
> like pack-*.idx.  Whenever the "database ref" gets moved we'd need
> to run a "sync" utility to bring these external indexes current.
> But they could also be more efficiently scanned.

Well either way it's a file you've got to scan somehow ... guess it
doesn't matter much whether it's in-tree or not.  I was actually saying
that there are some use cases where you might want to keep indexes in
the history and some where you don't.  Keeping them in-tree is not
normalised, but there are good use cases for it - eg efficient retrieval
of pre-computed aggregates that don't need to be up to the second, or
for instances where you want your nodes to be able to "hit the ground
running" after synchronisation without having to reindex.

For the use case we originally talked about I don't think you'd want any
indexes in-tree at all.

But I'd like to steer this thread well away from the database stuff I'm
drafting ... it's a lot more comprehensive, notes are a very simple hash
relationship.
-- 
Sam Vilain, Perl Hacker, Catalyst IT (NZ) Ltd.
phone: +64 4 499 2267        PGP ID: 0x66B25843
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html