Shawn O. Pearce wrote: >> The point you raised earlier, that there would be a lot of ambiguity if >> we allow both flat and fan-out directory structures, is a valid point, >> though. > > Yup. The flat vs. fan-out is a problem. [...] > Notes on commits though are a hell of a problem. SHA-1 is just so > uniform at distributing the commits around the namespace that even > with just the 200 most recent commits we wind up with a commit in > almost every "bucket", assuming a two hex digit fan-out bucket like > the loose object directory. I think my patch from 1 Feb addressed this, at least for the operations it implemented. I just don't see why you need to decide up front what the split is going to be. Just read the next tree, descend into the closest matching tree until you find the record you are looking for and that's it. Sure, my patch just loads it all and throws it into a hash - this should still be efficient for short log operations even if the hash table ends up 1MB. But why take my guess. Let's stress test it. 'lorem' is the binary in the Text::Lorem Perl module. It generates a paragraph of random Latin text. wilber:~/src/git$ time git-log | wc -l 256072 real 0m0.709s user 0m0.608s sys 0m0.116s wilber:~/src/git$ git rev-list HEAD | wc -l 17678 wilber:~/src/git$ cat > my-editor #!/bin/sh ( lorem; echo ) > $1 wilber:~/src/git$ chmod +x my-editor wilber:~/src/git$ export EDITOR=`pwd`/my-editor wilber:~/src/git$ export GIT_NOTES_SPLIT=2 wilber:~/src/git$ time git-rev-list HEAD | while read rev > do ./git-notes.sh edit $rev; done fatal: unable to create '.git/refs/notes/commits.lock': File exists error: Ref refs/notes/commits is at 5f0732975b4acf237912a31e7ce14aa86d2e8179 but expected 725a2d119d2725e7d821906ad085bfbadbf43c8e fatal: Cannot lock the ref 'refs/notes/commits'. [...] fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index real 76m16.927s user 43m55.909s sys 19m33.005s Oo. Nasty errors there but never mind that for now. Obviously some remaining issues in the shell script. What did I get out of that? wilber:~/src/git$ git-ls-tree -r refs/notes/commits | wc 12043 48172 1144085 wilber:~/src/git$ Hey well that's not too bad. Enough to be a good test. How long does "git-log" take now? wilber:~/src/git$ time ./git-log | wc -l 292201 real 0m13.740s user 0m0.852s sys 0m0.716s wilber:~/src/git$ time ./git-log | wc -l 292201 real 0m1.335s user 0m0.856s sys 0m0.512s Not bad! Cool cache performance sucked there but only a 50% slowdown for reading almost twice the number of objects. Let's try 200 commits: wilber:~/src/git$ time git-log -200 | wc -l 2877 real 0m0.027s user 0m0.008s sys 0m0.020s wilber:~/src/git$ time ./git-log -200 | wc -l 3477 real 0m0.081s user 0m0.056s sys 0m0.020s Quite a big slowdown proportionally, but not a huge amount in absolute terms. And we didn't even make the builtin-log machinery smart enough to skip unneeded trees! > In a slightly unrelated > thread offlist I have been talking with Sam Vilain about using Git > as a database backend for tuple storage. [...] > This would make the git-notes.sh code a *lot* more complex, as you > can't just toss everything into an index file and then update it with > a single update-index call. Doing a tree split is much more work and > requires removing and adding back all of the affected path names. > (Its also perhaps unreasonable anyway to load 17,491 paths into a > temporary index just to twiddle a note for the latest commit.) Hehe, horribly overcomplicated for this use case... many applicable ideas though. > For the "git database" thing above, I've been contemplating the > idea of an index stored external from the Git object database. > Sam thinks indexes should be in the object database tree, but > I'm considering storing them outside entirely because we can > make the indexes more easily searched by a hash or binary search, > like pack-*.idx. Whenever the "database ref" gets moved we'd need > to run a "sync" utility to bring these external indexes current. > But they could also be more efficiently scanned. Well either way it's a file you've got to scan somehow ... guess it doesn't matter much whether it's in-tree or not. I was actually saying that there are some use cases where you might want to keep indexes in the history and some where you don't. Keeping them in-tree is not normalised, but there are good use cases for it - eg efficient retrieval of pre-computed aggregates that don't need to be up to the second, or for instances where you want your nodes to be able to "hit the ground running" after synchronisation without having to reindex. For the use case we originally talked about I don't think you'd want any indexes in-tree at all. But I'd like to steer this thread well away from the database stuff I'm drafting ... it's a lot more comprehensive, notes are a very simple hash relationship. -- Sam Vilain, Perl Hacker, Catalyst IT (NZ) Ltd. phone: +64 4 499 2267 PGP ID: 0x66B25843 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html