Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 25 Aug 2014 11:35:45 -0700

Steffen Prohaska <prohaska@xxxxxx> writes:

>> Couldn't we do that with an lseek (or even an mmap with offset 0)? That
>> obviously would not work for non-file inputs, but I think we address
>> that already in index_fd: we push non-seekable things off to index_pipe,
>> where we spool them to memory.
>
> It could be handled that way, but we would be back to the original problem
> that 32-bit git fails for large files.

Correct, and you are making an incremental improvement so that such
a large blob can be handled _when_ the filters can successfully
munge it back and forth.  If we fail due to out of memory when the
filters cannot, that would be the same as without your improvement,
so you are still making progress.

> To implement something like the ideal strategy below, the entire convert 
> machinery for crlf and ident would have to be converted to a streaming
> approach.

Yes, that has always been the longer term vision since the day the
streaming infrastructure was introduced.

>> So it seems like the ideal strategy would be:
>> 
>>  1. If it's seekable, try streaming. If not, fall back to lseek/mmap.
>> 
>>  2. If it's not seekable and the filter is required, try streaming. We
>>     die anyway if we fail.

Puzzled...  Is it assumed that any content the filters tell us to
use the contents from the db as-is by exiting with non-zero status
will always be large not to fit in-core?  For small contents, isn't
this "ideal" strategy a regression?

>>  3. If it's not seekable and the filter is not required, decide based
>>     on file size:
>> 
>>       a. If it's small, spool to memory and proceed as we do now.
>> 
>>       b. If it's big, spool to a seekable tempfile.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html