On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote: > > It's not just metadata; it's actually part of what we hash to get the > > object id (though of course it doesn't _have_ to be stored in a linear > > buffer, as the pack storage shows). > > I'm still puzzled why git calculated the object id based on {type, size, data} > together instead of just {data}? You'd have to ask Linus for the original reasoning. ;) But one nice thing about including these, especially the type, in the hash, is that the object id gives the complete context for an object. So if another object claims to point to a tree, say, and points to a blob instead, we can detect that problem immediately. Or worse, think about something like "git show 1234abcd". If the metadata was not part of the object, then how would we know if you wanted to show a commit, or a blob (that happens to look like a commit), etc? That metadata could be carried outside the hash, but then it has to be stored somewhere, and is subject to ending up mismatched to the contents. Hashing all of it (including the size) makes consistency checking much easier. > > For packed object, it effectively is metadata, just stuck at the front > > of the object contents, rather than in a separate table. That lets us > > use the same .idx file for finding that metadata as we do for the > > contents themselves (at the slight cost that if you're _just_ accessing > > metadata, the results are sparser within the file, which has worse > > behavior for cold-cache disks). > > Agree. But what if there is a metadata table in the .idx file? > We can even know the type and size of the object without accessing > the packfile. I'm not sure it would be any faster than accessing the packfile. If you stick the metadata in the .idx file's oid lookup table, then lookups perform a bit worse because you're wasting memory cache. If you make a separate table in the .idx file that's OK, but I'm not sure it's consistently better than finding the data in the packfile. The oid lookup table gives you a way to index the table in constant-time (if you store the table as fixed-size entries in sha1 order), but we can also access the packfile in constant-time (the idx table gives us offsets). The idx metadata table would have better cache behavior if you're only looking at metadata, and not contents. But otherwise it's worse (since you have to hit the packfile, too). And I cheated a bit to say "fixed-size" above; the packfile metadata is in a variable-length encoding, so in some ways it's more efficient. So I doubt it would make any operations appreciably faster, and even if it did, you'd possibly be trading off versus other operations. I think the more interesting metadata is not type/size, but properties such as those stored by the commit graph. And there we do have separate tables for fast access (and it's a _lot_ faster, because it's helping us avoid inflating the object contents). > > I'm not sure how much of a speedup it would yield in practice, though. > > If you're printing the object contents, then the extra lookup is > > probably not that expensive by comparison. > > > > I feel like this solution may not be feasible. After we get the type and size > for the first time, we go through different output processes for different types > of objects: use `stream_blob()` for blobs, and `read_object_file()` with > `batch_write()` for other objects. If we obtain the content of a blob in one > single read operation, then the performance optimization provided by > `stream_blob()` would be invalidated. Good point. So yeah, even to use it in today's code you'd need something conditional. A few years ago I played with an option for object_info that would let the caller say "please give me the object contents if they are smaller than N bytes, otherwise don't". And that would let many call-sites get type, size, and content together most of the time (for small objects), and then stream only when necessary. I still have the patches, and running them now it looks like there's about a 10% speedup running: git cat-file --unordered --batch-all-objects --batch >/dev/null Other code paths dealing with blobs would likewise get a small speedup, I'd think. I don't remember why I didn't send it. I think there was some ugly refactoring that I needed to double-check, and my attention just got pulled elsewhere. The messy patches are at: https://github.com/peff/git jk/object-info-round-trip if you're interested. -Peff