Re: [PATCH v8 11/11] convert: add filter.<driver>.process option

Lars Schneider <larsxschneider@xxxxxxxxx> · Fri, 30 Sep 2016 20:56:12 +0200

> On 27 Sep 2016, at 00:41, Jakub Narębski <jnareb@xxxxxxxxx> wrote:
> 
> Part first of the review of 11/11.
> 
> W dniu 20.09.2016 o 21:02, larsxschneider@xxxxxxxxx pisze:
>> From: Lars Schneider <larsxschneider@xxxxxxxxx>
>> 
>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>> index 7aff940..946dcad 100644
>> --- a/Documentation/gitattributes.txt
>> +++ b/Documentation/gitattributes.txt
>> @@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the command is
>> fed the blob object from its standard input, and its standard
>> output is used to update the worktree file.  Similarly, the
>> `clean` command is used to convert the contents of worktree file
>> -upon checkin.
>> +upon checkin. By default these commands process only a single
>> +blob and terminate.  If a long running `process` filter is used
>   ^^^^
> 
> Should we use this terminology here?  I have not read the preceding
> part of documentation, so I don't know if it talks about "blobs" or
> if it uses "files" and/or "file contents".

I used that because it was used in the paragraph above already.

>> +Long Running Filter Process
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +If the filter command (a string value) is defined via
>> +`filter.<driver>.process` then Git can process all blobs with a
>> +single filter invocation for the entire life of a single Git
>> +command. This is achieved by using a packet format (pkt-line,
>> +see technical/protocol-common.txt) based protocol over standard
>> +input and standard output as follows. All packets are considered
>> +text and therefore are terminated by an LF. Exceptions are the
>> +"*CONTENT" packets and the flush packet.
> 
> I guess that reasoning here is that all but CONTENT packets are
> metadata, and thus to aid debuggability of the protocol are "text",
> as considered by pkt-line.
> 
> Perhaps a bit more readable would be the following (but current is
> just fine; I am nitpicking):
> 
>  All packets, except for the "{star}CONTENT" packets and the "0000"
>  flush packer, are considered text and therefore are terminated by
>  a LF.

OK, I use that!

> I think it might be a good idea to describe what flush packet is
> somewhere in this document; on the other hand referring (especially
> if hyperlinked) to pkt-line technical documentation might be good
> enough / better.  I'm unsure, but I tend on the side that referring
> to technical documentation is better.

I have this line in the first paragraph of the Long Running Filter process:
"packet format (pkt-line, see technical/protocol-common.txt) based protocol"

> 
>> +to read a welcome response message ("git-filter-server") and exactly
>> +one protocol version number from the previously sent list. All further
> 
> I guess that is to provide forward-compatibility, isn't it?  Also,
> "Git expects..." probably means filter process MUST send, in the
> RFC2119 (https://tools.ietf.org/html/rfc2119) meaning.

True. I feel "expects" reads better but I am happy to change it if
you feel strong about it.

>> +
>> +After the version negotiation Git sends a list of supported capabilities
>> +and a flush packet.
> 
> Is it that Git SHOULD send list of ALL supported capabilities, or is
> it that Git SHOULD NOT send capabilities it does not support, and that
> it MAY send only those capabilities it needs (so for example if command
> uses only `smudge`, it may not send `clean`, so that filter driver doesn't
> need to initialize data it would not need).

"After the version negotiation Git sends a list of all capabilities that
it supports and a flush packet."

Better?

> I wonder why it is "<capability>=true", and not "capability=<capability>".
> Is there a case where we would want to send "<capability>=false".  Or
> is it to allow configurable / value based capabilities?  Isn't it going
> a bit too far: is there even a hind of an idea for parametrize-able
> capability? YAGNI is a thing...

Peff suggested that format and I think it is OK:
http://public-inbox.org/git/20160803224619.bwtbvmslhuicx2qi@xxxxxxxxxxxxxxxxxxxxx/

> A few new capabilities that we might want to support in the near future
> is "size", "stream", which are options describing how to communicate,
> and "cleanFromFile", "smudgeToFile", which are new types of operations...
> but neither needs any parameter.
> 
> I guess that adding new capabilities doesn't require having to come up
> with the new version of the protocol, isn't it.

Correct.

>> +packet:          git< git-filter-server
>> +packet:          git< version=2
>> +packet:          git> clean=true
>> +packet:          git> smudge=true
>> +packet:          git> not-yet-invented=true
> 
> Hmmm... should we hint at the use of kebab-case versus snake_case
> or camelCase for new capabilities?

I personally prefer kebab-case but I think that is a discussion for
future contributions ;-)

>> +------------------------
>> +packet:          git> command=smudge
>> +packet:          git> pathname=path/testfile.dat
>> +packet:          git> 0000
>> +packet:          git> CONTENT
>> +packet:          git> 0000
>> +------------------------
> 
> I think it is important to mention that (at least with current
> `filter.<driver>.process` implementation, that is absent future
> "stream" capability / option) the filter process needs to read
> *whole contents* at once, *before* writing anything.  Otherwise
> it can lead to deadlock.
> 
> This is especially important in that it is different (!) from the
> current behavior of `clean` and `smudge` filters, which can
> stream their response because Git invokes them async.

I added this:
" Please note, that the filter
must not send any response before it received the content and the
final flush packet. "

>> +
>> +If the filter experiences an error during processing, then it can
>> +send the status "error" after the content was (partially or
>> +completely) sent. Depending on the `filter.<driver>.required` flag
>> +Git will interpret that as error but it will not stop or restart the
>> +filter process.
>> +------------------------
>> +packet:          git< status=success
>> +packet:          git< 0000
>> +packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
>> +packet:          git< 0000
>> +packet:          git< status=error
>> +packet:          git< 0000
>> +------------------------
> 
> Good.  A question is if the filter process can send "status=abort"
> after partial contents, or does it need to wait for the next command?

I added:
"expected to respond with an "abort" status at any point in
the protocol."

>> +
>> +After the filter has processed a blob it is expected to wait for
>> +the next "key=value" list containing a command. Git will close
>> +the command pipe on exit. The filter is expected to detect EOF
>> +and exit gracefully on its own.
> 
> Good to have it documented.  
> 
> Anyway, as it is Git command that spawns the filter driver process,
> assuming that the filter process doesn't daemonize itself, wouldn't
> the operating system reap it after its parent process, that is the
> git command it invoked, dies? So detecting EOF is good, but not
> strictly necessary for simple filter that do not need to free
> its resources, or can leave freeing resources to the operating
> system? But I may be wrong here.

The filter process runs independent of Git.

>> +
>> +
>> Interaction between checkin/checkout attributes
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> 
>> diff --git a/contrib/long-running-filter/example.pl b/contrib/long-running-filter/example.pl
>> new file mode 100755
>> index 0000000..c13a631
>> --- /dev/null
>> +++ b/contrib/long-running-filter/example.pl
> 
> To repeat myself, I think it would serve better as a separate patch.

OK

>> +        die "invalid packet size '$bytes_read' field";
> 
> This would read "invalid packet size '000' field", for example.
> Perhaps the following would be (slightly) better:
> 
>  +        die "invalid packet size field: '$bytes_read'";

OK

>> +    }
>> +    elsif ( $pkt_size > 4 ) {
> 
> Isn't a packet of $pkt_size == 4 a valid packet, a keep-alive
> one?  Or is it forbidden?

"Implementations SHOULD NOT send an empty pkt-line ("0004")."
Source: Documentation/technical/protocol-common.txt

>> +            die "invalid packet ($content_size expected; $bytes_read read)";
> 
> This error message would read "invalid packet (12 expected; 10 read)";
> I think it would be better to rephrase it as
> 
>  +            die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";

OK

>> +        die "invalid packet size";
> 
> I'm not sure if it is worth it (especially for the demo script),
> but perhaps we could show what this invalid size was?
> 
>  +        die "invalid packet size value '$pkt_size'";

OK

>> +sub packet_txt_read {
>> +    my ( $res, $buf ) = packet_bin_read();
>> +    unless ( $buf =~ /\n$/ ) {
> 
> Wouldn't
> 
>  +    unless ( $buf =~ s/\n$// ) {
> 
> or (less so)
> 
>  +    unless ( $buf =~ s/\n$\z// ) {
> 
> be more idiomatic (and not require use of 'substr')?  Remember,
> the s/// substitution quote-like operator returns number of
> substitutions in the scalar context.

OK.

>> +        die "A non-binary line SHOULD BE terminated by an LF.";
> 
> This is SHOULD be, not MUST be, so perhaps 'warn' would be enough.
> Not that Git should send us such line.

Actually it MUST per protocol definition. I'll change it to MUST.

>> +    my ($packet) = @_;
> 
> This is equivalent to
> 
>  +    my $packet = shift;
> 
> which, I think, is more common for single-parameter subroutines.
> 
> Also, this is $data (or $buf), not $packet.

OK

> Perhaps some comment that main begins here?
> 
>> +( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
>> +( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
>> +( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";
> 
> Actually, it is overly strict.  It should not fail if there
> are other "version=3", "version=4" etc. lines.

True, but I think for an example this is OK. I'll add a note
to the file header.

>> +
>> +while (1) {
>> +    my ($command)  = packet_txt_read() =~ /^command=([^=]+)$/;
>> +    my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/;
> 
> Do we require this order?  If it is, is that explained in the
> documentation?

Git sends that order right now but the filter should not rely
on that order.

>> +    packet_flush();    # empty list!
> 
> This is less "empty list!", and more keeping "status=success" unchanged.

OK

OK means, I agree and I added your suggestion to v9.
Thanks a lot for your review and the comments!

Cheers,
Lars