Re: [Question] Can git cat-file have a type filtering option?

ZheNing Hu <adlternative@xxxxxxxxx> · Fri, 14 Apr 2023 20:17:34 +0800

Jeff King <peff@xxxxxxxx> 于2023年4月14日周五 15:30写道：
>
> On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote:
>
> > > It's not just metadata; it's actually part of what we hash to get the
> > > object id (though of course it doesn't _have_ to be stored in a linear
> > > buffer, as the pack storage shows).
> >
> > I'm still puzzled why git calculated the object id based on {type, size, data}
> >  together instead of just {data}?
>
> You'd have to ask Linus for the original reasoning. ;)
>
> But one nice thing about including these, especially the type, in the
> hash, is that the object id gives the complete context for an object.
> So if another object claims to point to a tree, say, and points to a blob
> instead, we can detect that problem immediately.
>
> Or worse, think about something like "git show 1234abcd". If the
> metadata was not part of the object, then how would we know if you
> wanted to show a commit, or a blob (that happens to look like a commit),
> etc? That metadata could be carried outside the hash, but then it has to
> be stored somewhere, and is subject to ending up mismatched to the
> contents. Hashing all of it (including the size) makes consistency
> checking much easier.
>

Oh, you are right, this could be to prevent conflicts between Git objects
with identical content but different types. However, I always associate
Git with the file system, where metadata such as file type and size is
stored in the inode, while the file data is stored in separate chunks.

> > > For packed object, it effectively is metadata, just stuck at the front
> > > of the object contents, rather than in a separate table. That lets us
> > > use the same .idx file for finding that metadata as we do for the
> > > contents themselves (at the slight cost that if you're _just_ accessing
> > > metadata, the results are sparser within the file, which has worse
> > > behavior for cold-cache disks).
> >
> > Agree. But what if there is a metadata table in the .idx file?
> > We can even know the type and size of the object without accessing
> > the packfile.
>
> I'm not sure it would be any faster than accessing the packfile. If you
> stick the metadata in the .idx file's oid lookup table, then lookups
> perform a bit worse because you're wasting memory cache. If you make a
> separate table in the .idx file that's OK, but I'm not sure it's
> consistently better than finding the data in the packfile.
>

Yes, but it maybe be very convenient if we need to filter by object
type or size.

> The oid lookup table gives you a way to index the table in
> constant-time (if you store the table as fixed-size entries in sha1
> order), but we can also access the packfile in constant-time (the idx
> table gives us offsets). The idx metadata table would have better cache
> behavior if you're only looking at metadata, and not contents. But
> otherwise it's worse (since you have to hit the packfile, too). And I
> cheated a bit to say "fixed-size" above; the packfile metadata is in a
> variable-length encoding, so in some ways it's more efficient.
>

Yes, if we only use git cat-file --batch-check, we may be able to improve
performance by avoiding access to the pack file. Additionally, I think
this metadata table is very suitable for filtering and aggregating operations.

> So I doubt it would make any operations appreciably faster, and even if
> it did, you'd possibly be trading off versus other operations. I think
> the more interesting metadata is not type/size, but properties such as
> those stored by the commit graph. And there we do have separate tables
> for fast access (and it's a _lot_ faster, because it's helping us avoid
> inflating the object contents).
>

Yeah, optimizing the retrieval of metadata such as type/size may not provide
as much benefit as recording the commit properties in the metadata table,
like the commit graph optimization does.

> > > I'm not sure how much of a speedup it would yield in practice, though.
> > > If you're printing the object contents, then the extra lookup is
> > > probably not that expensive by comparison.
> > >
> >
> > I feel like this solution may not be feasible. After we get the type and size
> > for the first time, we go through different output processes for different types
> > of objects: use `stream_blob()` for blobs, and `read_object_file()` with
> > `batch_write()` for other objects. If we obtain the content of a blob in one
> > single read operation, then the performance optimization provided by
> > `stream_blob()` would be invalidated.
>
> Good point. So yeah, even to use it in today's code you'd need something
> conditional. A few years ago I played with an option for object_info
> that would let the caller say "please give me the object contents if
> they are smaller than N bytes, otherwise don't".
>
> And that would let many call-sites get type, size, and content together
> most of the time (for small objects), and then stream only when
> necessary. I still have the patches, and running them now it looks like
> there's about a 10% speedup running:
>
>   git cat-file --unordered --batch-all-objects --batch >/dev/null
>
> Other code paths dealing with blobs would likewise get a small speedup,
> I'd think. I don't remember why I didn't send it. I think there was some
> ugly refactoring that I needed to double-check, and my attention just
> got pulled elsewhere. The messy patches are at:
>
>   https://github.com/peff/git jk/object-info-round-trip
>
> if you're interested.
>

Alright, this does feel a bit hackish, allowing most objects to fetch the
content when first read and allowing blobs larger than N to be
streamed via stream_blob().

I feel like this optimization for single-reads is a bit off-topic, I quote
your previous sentence:

> So a nice thing about being able to do the filtering in one process is
> that we could _eventually_ do it all with one object lookup. But I'd
> probably wait on adding something like --type-filter until we have an
> internal single-lookup API, and then we could time it to see how much
> speedup we can get.

This optimization for single-reads doesn't seem to provide much benefit
for implementing object filters, because we have already read the content
of the object in advance?

ZheNing Hu