Re: [Question] Can git cat-file have a type filtering option?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jeff King <peff@xxxxxxxx> 于2023年4月12日周三 15:43写道:
>
> On Tue, Apr 11, 2023 at 10:09:33PM +0800, ZheNing Hu wrote:
>
> > > So reducing the size of the actual --batch printing may make the
> > > relative cost of using multiple processes much higher (I didn't apply
> > > your --type-filter patches to test myself).
> > >
> >
> > You are right. Adding the --unordered option can avoid the
> > time-consuming sorting process from affecting the test results.
>
> Just to be clear: it's not the cost of sorting, but rather that
> accessing the object contents in a sub-optimal order is much worse (and
> that sub-optimal order happens to be "sorted by sha1", since that is
> effectively random with respect to the contents).
>

Okay, thanks for correcting me. Reading the packfile in SHA1 order is
actually a type of random read, and it should cause additional overhead.

> > time git cat-file --unordered --batch-all-objects \
> > --batch-check="%(objectname) %(objecttype)" | \
> > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
> >
> > 4.17s user 0.23s system 109% cpu 4.025 total
> >
> > time git cat-file --unordered --batch-all-objects --batch
> > --type-filter=blob >/dev/null
> >
> > 3.84s user 0.17s system 97% cpu 4.099 total
> >
> > It looks like the difference is not significant either.
>
> OK, good, that means we can probably not worry about it. :)
>
> > > In general, I do think having a processing pipeline like this is OK, as
> > > it's pretty flexible. But especially for smaller queries (even ones that
> > > don't ask for the whole object contents), the per-object lookup costs
> > > can start to dominate (especially in a repository that hasn't been
> > > recently packed). Right now, even your "--batch --type-filter" example
> > > is probably making at least two lookups per object, because we don't
> > > have a way to open a "handle" to an object to check its type, and then
> > > extract the contents conditionally. And of course with multiple
> > > processes, we're naturally doing a separate lookup in each one.
> > >
> >
> > Yes, the type of the object is encapsulated in the header of the loose
> > object file or the object entry header of the pack file. We have to read
> > it to get the object type. This may be a lingering question I have had:
> > why does git put the type/size in the file data instead of storing it as some
> > kind of metadata elsewhere?
>
> It's not just metadata; it's actually part of what we hash to get the
> object id (though of course it doesn't _have_ to be stored in a linear
> buffer, as the pack storage shows).

I'm still puzzled why git calculated the object id based on {type, size, data}
 together instead of just {data}?

> But for loose objects, where would
> such metadata be? And accessing it isn't too expensive; we only zlib
> inflate the first few bytes (the main cost is in the syscalls to find
> and open the file).
>

I may not have a lot of experience with this here. It looks like I should
go ahead and do some performance testing to compare the cost of searching
and opening loose objects v.s reading and inflating loose objects.

> For packed object, it effectively is metadata, just stuck at the front
> of the object contents, rather than in a separate table. That lets us
> use the same .idx file for finding that metadata as we do for the
> contents themselves (at the slight cost that if you're _just_ accessing
> metadata, the results are sparser within the file, which has worse
> behavior for cold-cache disks).
>

Agree. But what if there is a metadata table in the .idx file?
We can even know the type and size of the object without accessing
the packfile.

> But when I say that lookup costs dominate, what I mean is that we'd
> spend a lot of our time binary searching within the pack .idx file, or
> falling back to syscalls to look for loose objects.
>

Alright, binary search in .idx may indeed be more time-consuming than
reading type and size from the packfile.

> > > So a nice thing about being able to do the filtering in one process is
> > > that we could _eventually_ do it all with one object lookup. But I'd
> > > probably wait on adding something like --type-filter until we have an
> > > internal single-lookup API, and then we could time it to see how much
> > > speedup we can get.
> >
> > I am highly skeptical of this "internal single-lookup API". Do we really
> > need an extra metadata table to record all objects?
> > Something like: metadata: {oid: type, size}?
>
> No, I don't mean changing the storage at all. I mean that rather than
> doing this:
>
>   /* get type, size, etc, for --batch format */
>   type = oid_object_info(&oid, &size);
>
>   /* now get the contents for --batch to write them itself; but note
>    * that this searches for the entry again within all packs, etc */
>   contents = read_object_file(oid, &type, &size);
>
> as the cat-file code now does (because the first call is in
> batch_object_write(), and the latter in print_object_or_die()), they
> could be a single call that does the lookup once.
>
> We could actually do that today, since the object contents are
> eventually fed from oid_object_info_extended(), and we know ahead of
> time that we want both the metadata and the contents. But that wouldn't
> work if we filtered by type, etc.
>

So what you mentioned earlier about single read refers to combining
the two read operations of getting type size and getting content into one,
when we know exactly that we need to retrieve the content.(in order to reduce
the overhead of the binary search once).

> I'm not sure how much of a speedup it would yield in practice, though.
> If you're printing the object contents, then the extra lookup is
> probably not that expensive by comparison.
>

I feel like this solution may not be feasible. After we get the type and size
for the first time, we go through different output processes for different types
of objects: use `stream_blob()` for blobs, and `read_object_file()` with
`batch_write()` for other objects. If we obtain the content of a blob in one
single read operation, then the performance optimization provided by
`stream_blob()` would be invalidated.

> -Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux