Re: [PATCH v3 10/10] convert: add filter.<driver>.process option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On 31 Jul 2016, at 11:42, Jakub Narębski <jnareb@xxxxxxxxx> wrote:
> 
> [Excuse me replying to myself, but there are a few things I forgot,
> or realized only later]

No worries :)

> 
> W dniu 31.07.2016 o 00:05, Jakub Narębski pisze:
>> W dniu 30.07.2016 o 01:38, larsxschneider@xxxxxxxxx pisze:
>>> From: Lars Schneider <larsxschneider@xxxxxxxxx>
>>> 
>>> Git's clean/smudge mechanism invokes an external filter process for every
>>> single blob that is affected by a filter. If Git filters a lot of blobs
>>> then the startup time of the external filter processes can become a
>>> significant part of the overall Git execution time.
>>> 
>>> This patch adds the filter.<driver>.process string option which, if used,
>>> keeps the external filter process running and processes all blobs with
>>> the following packet format (pkt-line) based protocol over standard input
>>> and standard output.
>> 
>> I think it would be nice to have here at least summary of the benchmarks
>> you did in https://github.com/github/git-lfs/pull/1382
> 
> Note that this feature is especially useful if startup time is long,
> that is if you are using an operating system with costly fork / new process
> startup time like MS Windows (which you have mentioned), or writing
> filter in a programming language with large startup time like Java
> or Python (the latter may have changed since).
> 
>  https://gnustavo.wordpress.com/2012/06/28/programming-languages-start-up-times/

OK, I will add this. Is it OK to add the link to the commit message?
(since I don't know how long the link will be available).


> [...]
>> I was thinking about having possible responses to receiving file
>> contents (or starting receiving in the streaming case) to be:
>> 
>>  packet:          git< ok size=7\n    (or "ok 7\n", if size is known)
>> 
>> or
>> 
>>  packet:          git< ok\n           (if filter does not know size upfront)
>> 
>> or
>> 
>>  packet:          git< fail <msg>\n   (or just "fail" + packet with msg)
>> 
>> The last would be when filter knows upfront that it cannot perform
>> the operation.  Though sending an empty file with non-"success" final
>> would work as well.
> 
> [...]
> 
>>> In case the filter cannot process the content, it is expected
>>> to respond with the result content size 0 (only if "stream" is
>>> not defined) and a "reject" packet.
>>> ------------------------
>>> packet:          git< size=0\n    (omitted with capability "stream")
>>> packet:          git< reject\n
>>> ------------------------
>> 
>> This is *wrong* idea!  Empty file, with size=0, can be a perfectly
>> legitimate response.  
> 
> Actually, I think I have misunderstood your intent.  If you want to have
> simpler protocol, with only one place to signal errors, that is after
> sending a response, then proper way of signaling the error condition
> would be to send empty file and then "reject" instead of "success":
> 
>   packet:          git< size=0\n    (omitted with capability "stream")
>   packet:          git< 0000        (we need this flush packet)
>   packet:          git< reject\n
> 
> Otherwise in the case without size upfront (capability "stream")
> file with contents "reject" would be mistaken for the "reject" packet.
> 
> See below for proposal with two places to signal errors: before sending
> first byte, and after.

Right now the protocol is implemented covering the following cases:

## CASE 1 - no stream success

packet:          git< size=57\n
packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< success\n


## CASE 2 - no stream success but 0 byte response

packet:          git< size=0\n
packet:          git< success\n


## CASE 3 - no stream filter; filter doesn't want to process the file

packet:          git< size=0\n
packet:          git< reject\n


## CASE 4 - no stream filter; filter error

packet:          git< size=57\n
packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< error\n

CASE 4 is not explicitly checked. If a final message is neither
"success" nor "reject" then it is interpreted as error. If that
happens then Git will shutdown and restart the filter process
if there is another file to filter. 

Alternatively a filter process can shutdown itself, too, to signal
an error.

The corresponding stream filter look like this:

## CASE 1 - stream success

packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< success\n


## CASE 2 - stream success but 0 byte response

packet:          git< 0000
packet:          git< success\n


## CASE 3 - stream filter; filter doesn't want to process the file

packet:          git< 0000
packet:          git< reject\n


## CASE 4 - stream filter; filter error

packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< error\n

--

I just realized that the size 0 case is a bit inconsistent
in the no stream case as it has no flush packet. Maybe I 
should indeed remove the flush packet in the no stream case
completely?!

Do the cases above make sense to you?

Regarding error handling. I would prefer it if the filter prints
all errors to STDERR by itself. I think that is the safest
option to communicate errors to the users because if the communication
got into a bad state then Git might not be able to read the errors
properly.

See Peff's response on the topic, too:
http://public-inbox.org/git/20160729165018.GA6553%40sigill.intra.peff.net/


> NOTE: there is a bit of mixed and possibly confusing notation, that
> is 0000 is flush packet, not packet with 0000 as content.  Perhaps
> write pkt-line in full?

I am not sure I understand what you mean (maybe it's too late for me...).
Can you try to rephrase or give an example?

Thank you,
Lars



> 
> 
> [...]
>>> ---
>>> Documentation/gitattributes.txt |  84 ++++++++-
>>> convert.c                       | 400 +++++++++++++++++++++++++++++++++++++--
>>> t/t0021-conversion.sh           | 405 ++++++++++++++++++++++++++++++++++++++++
>>> t/t0021/rot13-filter.pl         | 177 ++++++++++++++++++
>>> 4 files changed, 1053 insertions(+), 13 deletions(-)
>>> create mode 100755 t/t0021/rot13-filter.pl
> 
> Wouldn't it be better for easier review to split it into separate patches?
> Perhaps at least the new test...
> 
> [...]
>> I would assume that we have two error conditions.  
>> 
>> First situation is when the filter knows upfront (after receiving name
>> and size of file, and after receiving contents for not-streaming filters)
>> that it cannot process the file (like e.g. LFS filter with artifactory
>> replica/shard being a bit behind master, and not including contents of
>> the file being filtered).
>> 
>> My proposal is to reply with "fail" _in place of_ size of reply:
>> 
>>   packet:         git< fail\n       (any case: size known or not, stream or not)
>> 
>> It could be "reject", or "error" instead of "fail".
>> 
>> 
>> Another situation is if filter encounters error during output,
>> either with streaming filter (or non-stream, but not storing whole
>> input upfront) realizing in the middle of output that there is something
>> wrong with input (e.g. converting between encoding, and encountering
>> character that cannot be represented in output encoding), or e.g. filter
>> process being killed, or network connection dropping with LFS filter, etc.
>> The filter has send some packets with output already.  In this case
>> filter should flush, and send "reject" or "error" packet.
>> 
>>   <error condition>
>>   packet:         git< "0000"       (flush packet)
>>   packet:         git< reject\n
>> 
>> Should there be a place for an error message, or would standard error
>> (stderr) be used for this?
> 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]