Re: [PATCH v2 5/5] convert: add filter.<driver>.process option

Lars Schneider <larsxschneider@xxxxxxxxx> · Sat, 30 Jul 2016 01:44:40 +0200

> On 30 Jul 2016, at 01:11, Jakub Narębski <jnareb@xxxxxxxxx> wrote:
> 
> W dniu 2016-07-29 o 19:35, Junio C Hamano pisze:
>> Lars Schneider <larsxschneider@xxxxxxxxx> writes:
>> 
>>> I think sending it upfront is nice for buffer allocations of big files
>>> and it doesn't cost us anything to do it.
>> 
>> While I do NOT think "total size upfront" MUST BE avoided at all costs,
>> I do not think the above statement to justify it makes ANY sense.
>> 
>> Big files are by definition something you cannot afford to hold its
>> entirety in core, so you do not want to be told that you'd be fed 40GB
>> and ask xmalloc to allocate that much.
> 
> I don't know much how filter driver work internally, but in some cases
> Git reads or writes from file (file descriptor), in other cases it reads
> or writes from str+len pair (it probably predates strbuf) - I think in
> those cases file needs to fit in memory (in size_t).  So in some cases
> Git reads file into memory.  Whether it uses xmalloc or mmap, I don't
> know.
> 
>> 
>> It allows the reader to be lazy for buffer allocations as long as
>> you know the file fits in-core, at the cost of forcing the writer to
>> somehow come up with the total number of bytes even before sending a
>> single byte (in other words, if the writer cannot produce and hold
>> the data in-core, it may even have to spool the data in a temporary
>> file only to count, and then play it back after showing the total
>> size).
> 
> For some types of filters you can know the size upfront:
> - for filters such as rot13, with 1-to-1 transformation, you know
>   that the output size is the same as the input size
> - for block encodings, and for constant-width to constant-width
>   encoding conversion, filter can calculate output size from the
>   input size (e.g. <output size> = 2*<input size>)
> - filter may have get size from somewhere, for example LFS filter
>   stub is constant size, and files are stored in artifactory with
>   their length 
> 
>> 
>> It is good that you allow both mode of operations and the size of
>> the data can either be given upfront (which allows a single fixed
>> allocation upfront without realloc, as long as the data fits in
>> core), or be left "(atend)".
> 
> I think the protocol should be either: <size> + <contents>, or
> <size unknown> + <contents> + <flush>, that is do not use flush
> packet if size is known upfront -- it would be a second point
> of truth (SPOT principle).

As I mentioned elsewhere a <flush> packet is always send right now.
I have no strong opinion if this is good or bad. The implementation
was a little bit simpler and that's why I did it. I will implement 
whatever option the majority prefers :-)

Cheers,
Lars

> 
>> I just don't want to see it oversold as a "feature" that the size
>> has to come before data.  That is a limitation, not a feature.
>> 
>> Thanks.
>> 
> 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html