On Sat, 5 Apr 2008, Roman Shaposhnik wrote: > > It was pretty much the result of (cd .git/objects ; find . -type f | tr -d > './') Ok, that explains it. And yes, it definitely works, but I think you now understand why it gave you that oddly pessimised pack-file. That said, I think it's also a really good example of how the git pack-files really are just a "bag of objects", and the fact that it worked but was non-optimal is a very good way to show something very fundamental about git. > So it seems that my list of objects was different from what > git-rev-list/setup_revisions() > would have provided in two ways: > 1. the order of objects was arbitrary > 2. the naming of blobs and trees was missing > #2 was, indeed, a huge oversight on my part. At the same time, I've always > thought > that #1 shouldn't matter, because, as you pointed out, the objects get > sorted by <type, namehash, length> anyway. However, it seems that because > of the lack of naming there was much less sorting done by type_size_sort() > and the original order persevered (at least within type-size partitions). > It did matter after all! Well, a pack actually as two *different* orderings, in that there's one ordering that is used for laying out the result in the pack-file, and another ordering that is used for actually finding the deltas with the sliding window. And the input order is actually used for the layout ordering. And both matter. The <type, namehash, length> is the one that determines how large the resulting pack is, but the layout order is what causes the IO patterns for most common uses, so if the pack-file is cold in the cache, it can matter quite a bit for performance. So the layout order is the one you give as input to "git pack-objects". And if that input has no particular ordering, then the final layout will also have no particular ordering and you get random IO patterns when loading from disk. > Now, in my particular case, the ordering was braindead and it resulted in > a highly visible inefficiency. On the other hand, it seems that creative > ordering could very well be used to exploit localities which go beyond > how default name hashing in builtin-pack-objects.c works. Yes. We've changed some of the heuristics subtly over time (eg the whole name-hashing was a tweak that was added later), but you're absolutely right - it's an area where some *particular* usage model could come up with a special ordering to optimize a some special load. The defaults are meant to be "good enough", and there has been some thought put into them, but yes, there's probably room for specialization there. In particular, one thing I've wanted to do is to do some "fingerprint hash" and perhaps use that as a sorting guide for finding deltas even more aggressively than we do now, but I've never really found the energy to try something like that out. > In fact, it starts to look awfully like custom MPEG profiles where if you know > your footage you can achieve a much higher compression ratio > compared to what the default might give you. It also makes it very > clear that Git's approach of doing away with per-file content tracking > is quite superior to the in-file deltas. Cool! Well, one thing that git doesn't do, but that I find intriguing is if it could be possible to do deltas not even between different files, but even *across* files. One thing that the git model sucks at is how it's not very good at handling large objects. I've often wondered if I should have made "object" be more fine-grained and tried to build up large files from multiple smaller objects. [ That said, I think git does the right thing - for source code. The blocking-up of files would cause a rather more complex model, and one of the great things about git is how simple the basic model is. But the large-file thing does mean that git potentially sucks really badly for some other loads ] > Here's my final question on that issue: wouldn't it be great to give users > a direct control over specifying the list of objects in exactly the order > they would like them to be tried for deltifying? Something like > --preserve-order option available for git-pack-objects? See above: there's actually two orders, so you'd need to specify *both* orders some way, since the input order already has meaning (it's assumed to be the "topological ordering". You *can* approximate what you descibe by playing games with the filenames (which you can give in the input too - they're just strings on the same line as the SHA1), and that would be a total hack to give that secondary ordering. And yes, I agree that it might be cool to do this explicitly some way. If only to give a way to test experimental versions of the type sorting. [ To see how the filename thing works, do git rev-list --objects --all > object-list git pack-objects ... < object-list and you can look at the "object-list" file to see how it's not just the list of SHA1, it now has the filename info in it too. Imagine changing that filename to generate a "custom" pack order ] Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html