Re: [PATCH v13 5/7] object-file.c: add "stream_loose_object()" to handle large object

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 06 Jun 2022 13:02:41 -0700

Junio C Hamano <gitster@xxxxxxxxx> writes:

>> Another difference with "write_loose_object()" is that we have no chance
>> to run "write_object_file_prepare()" to calculate the oid in advance.
>
> That is somewhat curious.  Is it fundamentally impossible, or is it
> just that this patch was written in such a way that conflates the
> two and it is cumbersome to split the "we repeat the sequence of
> reading and deflating just a bit until we process all" and the "we
> compute the hash over the data first and then we write out for
> real"?

OK, the answer lies somewhere in between.

The initial user of this streaming interface reads from an incoming
packfile and feeds the inflated bytestream to the interface, which
means we cannot seek.  That meaks it "fundamentally impossible" for
that codepath (i.e. unpack-objects to read from packstream and write
to on-disk loose objects).

But if the input source is seekable (e.g. a file in the working
tree), there is no fundamental reason why the new interface has "no
chance to run prepare to calculate the oid in advance".  It's just
that the such a different caller is not added by the series and we
chose not to allow the "prepare and then write" two-step process,
because we currently do not need it when this series lands.

> I am very tempted to ask why we do not do this to _all_ loose object
> files.  Instead of running the machinery twice over the data (once to
> compute the object name, then to compute the contents and write out),
> if we can produce loose object files of any size with a single pass,
> wouldn't that be an overall win?

There is a patch later in the series whose proposed log message has
benchmarks to show that it is slower in general.  It still is
curious where the slowness comes from and if it is something we can
tune, though.

Thanks.