Re: [PATCH v2 5/5] convert: add filter.<driver>.process option

Lars Schneider <larsxschneider@xxxxxxxxx> · Thu, 28 Jul 2016 14:10:12 +0200

> On 27 Jul 2016, at 20:11, Jeff King <peff@xxxxxxxx> wrote:
> 
> On Wed, Jul 27, 2016 at 07:31:26PM +0200, Lars Schneider wrote:
> 
>>>> +	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet flush
>>> 
>>> What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
>>> 32-bit system)?
>> 
>> Would that be an acceptable solution?
>> 
>> if (size + 1 > SIZE_MAX)
>> 	return die("unrepresentable length for filter buffer");
> 
> No, because by definition "size" will wrap to 0. :)
> 
> You have to do:
> 
>  if (size > SIZE_MAX - 1)
> 	die("whoops");
> 
>> Can you point me to an example in the Git source how this kind of thing should
>> be handled?
> 
> The strbuf code itself checks for overflows. So you could do:
> 
>  strbuf_grow(sb, size);
>  ... fill up with size bytes ...
>  strbuf_addch(sb, ...); /* extra byte for whatever */
> 
> That does mean _possibly_ making a second allocation just to add the
> extra byte, but in practice it's not likely (unless the input exactly
> matches the strbuf's growth pattern).
> 
> If you want to do it yourself, I think:
> 
>  strbuf_grow(sb, st_add(size, 1));

I like that solution! Thanks!

> would work.
> 
>>>> +	while (
>>>> +		bytes_read > 0 && 					// the last packet was no flush
>>>> +		sb->len - total_bytes_read - 1 > 0 	// we still have space left in the buffer
>>>> +	);
>>> 
>>> And I'm not sure if you need to distinguish between "0" and "-1" when
>>> checking byte_read here.
>> 
>> We want to finish reading in both cases, no?
> 
> If we get "-1", that's from an unexpected EOF during the packet_read(),
> because you set GENTLE_ON_EOF. So there's nothing left to read, and we
> should break and return an error.

Right.

> I guess "0" would come from a flush packet? Why would the filter send
> back a flush packet (unless you were using them to signal end-of-input,
> but then why does the filter have to send back the number of bytes ahead
> of time?).

Sending the bytes ahead of time (if available) might be nice for efficient
buffer allocation. I am changing the code so that both cases can be handled
(size ahead of time and no size ahead of time).

>>> Why 8K? The pkt-line format naturally restricts us to just under 64K, so
>>> why not take advantage of that and minimize the framing overhead for
>>> large data?
>> 
>> I took inspiration from here for 8K MAX_IO_SIZE:
>> https://github.com/git/git/blob/master/copy.c#L6
>> 
>> Is this read limit correct? Should I read 8 times to fill a pkt-line?
> 
> MAX_IO_SIZE is generally 8 _megabytes_, not 8K. The loop in copy.c just
> haad to pick an arbitrary size for doing its read/write proxying.  I
> think in practice you are not likely to get much benefit from going
> beyond 8K or so there, just because operating systems tend to do things
> in page-sizes anyway, which are usually 4K.
> 
> But since you are formatting the result into a form that has framing
> overhead, anything up to LARGE_PACKET_MAX will see benefits (though
> admittedly even 4 bytes per 8K is not much).
> 
> I don't think it's worth the complexity of reading 8 times, but just
> using a buffer size of LARGE_PACKET_MAX-4 would be the most efficient.
> 
> I doubt it matters _that much_ in practice, but any time I see a magic
> number I have to wonder at the "why". At least basing it on
> LARGE_PACKET_MAX has some basis, whereas 8K is largely just made-up. :)

Sounds good. I will use LARGE_PACKET_MAX-4 !

> 
>>> We do sometimes do "ret |= something()" but that is in cases where
>>> "ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
>>> your function's error-reporting is inverted from our usual style?
>> 
>> I thought it makes the code easier to read and the filter doesn't care
>> at what point the error happens anyways. The filter either succeeds
>> or fails. What style would you suggest?
> 
> I think that's orthogonal. I just mean that using zero for success puts
> you in our usual style, and then accumulating errors can be done with
> "|=".

Ah. I guess I was misguided by the way errors are currently handled
in `apply_filter` (success = 1; failure = 0):
https://github.com/git/git/blob/8c6d1f9807c67532e7fb545a944b064faff0f70b/convert.c#L437-L479

I wouldn't like if the different filter protocols would use different
error exit codes. Would it be OK to adjust the existing `apply_filter`
function in a cleanup patch?

> I didn't look carefully at whether the accumulating style you're using
> makes sense or not. But something like:
> 
>>>> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
>>>> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
> 
> does mean that we call the second write() even if the first one failed.
> That's a waste of time (albeit a minor one), but it also means you could
> potentially cover up the value of "errno" from the first one (though in
> practice I'd expect the second one to fail for the same reason).

Oh. You're right. For some reason I thought the second operator would
never be evaluated if the first operator is 0. Apparently that is not
the case for bit-wise & ... only for logical & ... thanks for the lesson!

- Lars--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html