On Tuesday 02 November 2010, Shawn Pearce wrote: > Why doesn't the fan-out work for this case? > > git notes --ref buggy.fanout add -m test HEAD > perl -e 'for($i=0;$i<1024;$i++){printf "%s %s%4.4x\n", $ARGV[0], > "0"x36, $i}' $(git rev-parse HEAD) | git notes --ref buggy.fanout copy > --stdin > git ls-tree refs/notes/buggy.fanout > 100644 blob ... 0000000000000000000000000000000000000000 > 100644 blob ... 0000000000000000000000000000000000000001 > 100644 blob ... 0000000000000000000000000000000000000002 > 100644 blob ... 0000000000000000000000000000000000000003 > ... > 100644 blob ... 000000000000000000000000000000000000018c > ... > 100644 blob ... 00000000000000000000000000000000000003e1 > ... > 100644 blob ... 00000000000000000000000000000000000003e2 > ... > 100644 blob ... 00000000000000000000000000000000000003fd > 100644 blob ... 00000000000000000000000000000000000003fe > 100644 blob ... 00000000000000000000000000000000000003ff > 100644 blob ... 638d3d9244720e0f07f22a953d25db878e9ad3f5 Yes, for your extremely far-fetched testcase, the current implementation is pessimal. However, please, consider exactly _what_ the notes tree contains. Yes, that's right: SHA1 sums. Now, try to come up with a real-world scenario where your notes tree would end up looking like the above. I guess the same "problem" exists with the 2/38 fanout that is done for loose objects in .git/objects/. If you happen to only add objects whose SHA1 start with 00..., you will get worst case behaviour there as well. Again, if you can construct a real-world testcase for this, I am probably not the only one interested in looking at it... > I thought the entire points of the notes fanout being 0/40, 2/38, > 2/2/36, 2/2/2/34 was to prevent git from having a big linear search > within any single tree when a large number of notes are added to a > note branch. Yes, and by assuming that SHA1 gives a fairly uniform distribution of keys (a pretty safe assumption if the rest of Git's design is anything to go by), we can conclude that the large number of keys would be fairly evenly spread out across the notes tree. > IIRC when the notes stuff was being debated on this list we insisted > that the fan-out algorithm had to be consistent, ensuring that if I > create a note for 638d3d92 and Junio creates a note on the same > object, they will wind up at the same path in the notes, and we won't > have doppleganger notes (two different notes in the same tree with the > same object). Let's separate the cases here: A. There is absolutely no guarantee that two different notes trees will place a note for 638d3d92 at the same location/path. I.e. if you add a note to your notes tree, and Junio adds one to his, they might end up at different paths. This depends on how many, and which other notes are in your respective notes trees. B. When combining/merging those notes tree, the notes code _does_ guarantee that there won't be doppelganger notes. I.e. if you're both working in the same notes tree (in the same repo), the notes code will not allow you to add two notes to 638d3d92 at different locations in the notes tree. Of course you might artificially construct a notes tree that contains doppelganger notes, and the current implementation will successfully read that notes tree (combining the doppelganger notes using the configured combine_notes function), but - crucially - the notes code will _never_ generate a notes tree with doppelganger notes. > Basically this issue arises because we are adding node support to > JGit. Our planned implementation would have inserted > 00000000000000000000000000000000000003ff into > 00/00/00/00/00/00/00/00/00/00/00/00/00/00/00/00/00/00/03/ff because > there are so many notes that at each depth we wind up with >255 > entries, and split the tree again. But that results in a very > different structure from what the C implementation is doing today. > > As far as I can tell from the C code, it won't split unless every hex > digit in the 0..f range is used. That is just a weird heuristic, and > actual paths used depend on the actual SHA-1s that we add to the tree. Read that sentence again. As explained above, since SHA1 provides a near- uniform distribution, looking at hex digits in the 0..f range is a very inexpensive and moderately good heuristic to estimating whether it's time to make a split. Granted, there might be better heuristics, and the current design allows us to implement such heuristics without making pre-existing notes trees invalid. > If I get "unlucky" and add 5 commits that start with "f" and never > add one that starts with "0", I won't get fan-out, even though my > notes tree has more than 256 items in it. True, however, the uniform distribution of SHA1 pretty much guarantees that as you keep adding notes, you're going to get one that starts with "0", and then you'll get the needed split. Hope this helps, ...Johan -- Johan Herland, <johan@xxxxxxxxxxx> www.herland.net -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html