Jeff King <peff@xxxxxxxx> 于2023年4月14日周五 15:30写道: > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote: > > > > It's not just metadata; it's actually part of what we hash to get the > > > object id (though of course it doesn't _have_ to be stored in a linear > > > buffer, as the pack storage shows). > > > > I'm still puzzled why git calculated the object id based on {type, size, data} > > together instead of just {data}? > > You'd have to ask Linus for the original reasoning. ;) > > But one nice thing about including these, especially the type, in the > hash, is that the object id gives the complete context for an object. > So if another object claims to point to a tree, say, and points to a blob > instead, we can detect that problem immediately. > > Or worse, think about something like "git show 1234abcd". If the > metadata was not part of the object, then how would we know if you > wanted to show a commit, or a blob (that happens to look like a commit), > etc? That metadata could be carried outside the hash, but then it has to > be stored somewhere, and is subject to ending up mismatched to the > contents. Hashing all of it (including the size) makes consistency > checking much easier. > Oh, you are right, this could be to prevent conflicts between Git objects with identical content but different types. However, I always associate Git with the file system, where metadata such as file type and size is stored in the inode, while the file data is stored in separate chunks. > > > For packed object, it effectively is metadata, just stuck at the front > > > of the object contents, rather than in a separate table. That lets us > > > use the same .idx file for finding that metadata as we do for the > > > contents themselves (at the slight cost that if you're _just_ accessing > > > metadata, the results are sparser within the file, which has worse > > > behavior for cold-cache disks). > > > > Agree. But what if there is a metadata table in the .idx file? > > We can even know the type and size of the object without accessing > > the packfile. > > I'm not sure it would be any faster than accessing the packfile. If you > stick the metadata in the .idx file's oid lookup table, then lookups > perform a bit worse because you're wasting memory cache. If you make a > separate table in the .idx file that's OK, but I'm not sure it's > consistently better than finding the data in the packfile. > Yes, but it maybe be very convenient if we need to filter by object type or size. > The oid lookup table gives you a way to index the table in > constant-time (if you store the table as fixed-size entries in sha1 > order), but we can also access the packfile in constant-time (the idx > table gives us offsets). The idx metadata table would have better cache > behavior if you're only looking at metadata, and not contents. But > otherwise it's worse (since you have to hit the packfile, too). And I > cheated a bit to say "fixed-size" above; the packfile metadata is in a > variable-length encoding, so in some ways it's more efficient. > Yes, if we only use git cat-file --batch-check, we may be able to improve performance by avoiding access to the pack file. Additionally, I think this metadata table is very suitable for filtering and aggregating operations. > So I doubt it would make any operations appreciably faster, and even if > it did, you'd possibly be trading off versus other operations. I think > the more interesting metadata is not type/size, but properties such as > those stored by the commit graph. And there we do have separate tables > for fast access (and it's a _lot_ faster, because it's helping us avoid > inflating the object contents). > Yeah, optimizing the retrieval of metadata such as type/size may not provide as much benefit as recording the commit properties in the metadata table, like the commit graph optimization does. > > > I'm not sure how much of a speedup it would yield in practice, though. > > > If you're printing the object contents, then the extra lookup is > > > probably not that expensive by comparison. > > > > > > > I feel like this solution may not be feasible. After we get the type and size > > for the first time, we go through different output processes for different types > > of objects: use `stream_blob()` for blobs, and `read_object_file()` with > > `batch_write()` for other objects. If we obtain the content of a blob in one > > single read operation, then the performance optimization provided by > > `stream_blob()` would be invalidated. > > Good point. So yeah, even to use it in today's code you'd need something > conditional. A few years ago I played with an option for object_info > that would let the caller say "please give me the object contents if > they are smaller than N bytes, otherwise don't". > > And that would let many call-sites get type, size, and content together > most of the time (for small objects), and then stream only when > necessary. I still have the patches, and running them now it looks like > there's about a 10% speedup running: > > git cat-file --unordered --batch-all-objects --batch >/dev/null > > Other code paths dealing with blobs would likewise get a small speedup, > I'd think. I don't remember why I didn't send it. I think there was some > ugly refactoring that I needed to double-check, and my attention just > got pulled elsewhere. The messy patches are at: > > https://github.com/peff/git jk/object-info-round-trip > > if you're interested. > Alright, this does feel a bit hackish, allowing most objects to fetch the content when first read and allowing blobs larger than N to be streamed via stream_blob(). I feel like this optimization for single-reads is a bit off-topic, I quote your previous sentence: > So a nice thing about being able to do the filtering in one process is > that we could _eventually_ do it all with one object lookup. But I'd > probably wait on adding something like --type-filter until we have an > internal single-lookup API, and then we could time it to see how much > speedup we can get. This optimization for single-reads doesn't seem to provide much benefit for implementing object filters, because we have already read the content of the object in advance? ZheNing Hu