> On 27 Sep 2016, at 00:41, Jakub Narębski <jnareb@xxxxxxxxx> wrote: > > Part first of the review of 11/11. > > W dniu 20.09.2016 o 21:02, larsxschneider@xxxxxxxxx pisze: >> From: Lars Schneider <larsxschneider@xxxxxxxxx> >> >> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt >> index 7aff940..946dcad 100644 >> --- a/Documentation/gitattributes.txt >> +++ b/Documentation/gitattributes.txt >> @@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the command is >> fed the blob object from its standard input, and its standard >> output is used to update the worktree file. Similarly, the >> `clean` command is used to convert the contents of worktree file >> -upon checkin. >> +upon checkin. By default these commands process only a single >> +blob and terminate. If a long running `process` filter is used > ^^^^ > > Should we use this terminology here? I have not read the preceding > part of documentation, so I don't know if it talks about "blobs" or > if it uses "files" and/or "file contents". I used that because it was used in the paragraph above already. >> +Long Running Filter Process >> +^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> + >> +If the filter command (a string value) is defined via >> +`filter.<driver>.process` then Git can process all blobs with a >> +single filter invocation for the entire life of a single Git >> +command. This is achieved by using a packet format (pkt-line, >> +see technical/protocol-common.txt) based protocol over standard >> +input and standard output as follows. All packets are considered >> +text and therefore are terminated by an LF. Exceptions are the >> +"*CONTENT" packets and the flush packet. > > I guess that reasoning here is that all but CONTENT packets are > metadata, and thus to aid debuggability of the protocol are "text", > as considered by pkt-line. > > Perhaps a bit more readable would be the following (but current is > just fine; I am nitpicking): > > All packets, except for the "{star}CONTENT" packets and the "0000" > flush packer, are considered text and therefore are terminated by > a LF. OK, I use that! > I think it might be a good idea to describe what flush packet is > somewhere in this document; on the other hand referring (especially > if hyperlinked) to pkt-line technical documentation might be good > enough / better. I'm unsure, but I tend on the side that referring > to technical documentation is better. I have this line in the first paragraph of the Long Running Filter process: "packet format (pkt-line, see technical/protocol-common.txt) based protocol" > >> +to read a welcome response message ("git-filter-server") and exactly >> +one protocol version number from the previously sent list. All further > > I guess that is to provide forward-compatibility, isn't it? Also, > "Git expects..." probably means filter process MUST send, in the > RFC2119 (https://tools.ietf.org/html/rfc2119) meaning. True. I feel "expects" reads better but I am happy to change it if you feel strong about it. >> + >> +After the version negotiation Git sends a list of supported capabilities >> +and a flush packet. > > Is it that Git SHOULD send list of ALL supported capabilities, or is > it that Git SHOULD NOT send capabilities it does not support, and that > it MAY send only those capabilities it needs (so for example if command > uses only `smudge`, it may not send `clean`, so that filter driver doesn't > need to initialize data it would not need). "After the version negotiation Git sends a list of all capabilities that it supports and a flush packet." Better? > I wonder why it is "<capability>=true", and not "capability=<capability>". > Is there a case where we would want to send "<capability>=false". Or > is it to allow configurable / value based capabilities? Isn't it going > a bit too far: is there even a hind of an idea for parametrize-able > capability? YAGNI is a thing... Peff suggested that format and I think it is OK: http://public-inbox.org/git/20160803224619.bwtbvmslhuicx2qi@xxxxxxxxxxxxxxxxxxxxx/ > A few new capabilities that we might want to support in the near future > is "size", "stream", which are options describing how to communicate, > and "cleanFromFile", "smudgeToFile", which are new types of operations... > but neither needs any parameter. > > I guess that adding new capabilities doesn't require having to come up > with the new version of the protocol, isn't it. Correct. >> +packet: git< git-filter-server >> +packet: git< version=2 >> +packet: git> clean=true >> +packet: git> smudge=true >> +packet: git> not-yet-invented=true > > Hmmm... should we hint at the use of kebab-case versus snake_case > or camelCase for new capabilities? I personally prefer kebab-case but I think that is a discussion for future contributions ;-) >> +------------------------ >> +packet: git> command=smudge >> +packet: git> pathname=path/testfile.dat >> +packet: git> 0000 >> +packet: git> CONTENT >> +packet: git> 0000 >> +------------------------ > > I think it is important to mention that (at least with current > `filter.<driver>.process` implementation, that is absent future > "stream" capability / option) the filter process needs to read > *whole contents* at once, *before* writing anything. Otherwise > it can lead to deadlock. > > This is especially important in that it is different (!) from the > current behavior of `clean` and `smudge` filters, which can > stream their response because Git invokes them async. I added this: " Please note, that the filter must not send any response before it received the content and the final flush packet. " >> + >> +If the filter experiences an error during processing, then it can >> +send the status "error" after the content was (partially or >> +completely) sent. Depending on the `filter.<driver>.required` flag >> +Git will interpret that as error but it will not stop or restart the >> +filter process. >> +------------------------ >> +packet: git< status=success >> +packet: git< 0000 >> +packet: git< HALF_WRITTEN_ERRONEOUS_CONTENT >> +packet: git< 0000 >> +packet: git< status=error >> +packet: git< 0000 >> +------------------------ > > Good. A question is if the filter process can send "status=abort" > after partial contents, or does it need to wait for the next command? I added: "expected to respond with an "abort" status at any point in the protocol." >> + >> +After the filter has processed a blob it is expected to wait for >> +the next "key=value" list containing a command. Git will close >> +the command pipe on exit. The filter is expected to detect EOF >> +and exit gracefully on its own. > > Good to have it documented. > > Anyway, as it is Git command that spawns the filter driver process, > assuming that the filter process doesn't daemonize itself, wouldn't > the operating system reap it after its parent process, that is the > git command it invoked, dies? So detecting EOF is good, but not > strictly necessary for simple filter that do not need to free > its resources, or can leave freeing resources to the operating > system? But I may be wrong here. The filter process runs independent of Git. >> + >> + >> Interaction between checkin/checkout attributes >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> diff --git a/contrib/long-running-filter/example.pl b/contrib/long-running-filter/example.pl >> new file mode 100755 >> index 0000000..c13a631 >> --- /dev/null >> +++ b/contrib/long-running-filter/example.pl > > To repeat myself, I think it would serve better as a separate patch. OK >> + die "invalid packet size '$bytes_read' field"; > > This would read "invalid packet size '000' field", for example. > Perhaps the following would be (slightly) better: > > + die "invalid packet size field: '$bytes_read'"; OK >> + } >> + elsif ( $pkt_size > 4 ) { > > Isn't a packet of $pkt_size == 4 a valid packet, a keep-alive > one? Or is it forbidden? "Implementations SHOULD NOT send an empty pkt-line ("0004")." Source: Documentation/technical/protocol-common.txt >> + die "invalid packet ($content_size expected; $bytes_read read)"; > > This error message would read "invalid packet (12 expected; 10 read)"; > I think it would be better to rephrase it as > > + die "invalid packet ($content_size bytes expected; $bytes_read bytes read)"; OK >> + die "invalid packet size"; > > I'm not sure if it is worth it (especially for the demo script), > but perhaps we could show what this invalid size was? > > + die "invalid packet size value '$pkt_size'"; OK >> +sub packet_txt_read { >> + my ( $res, $buf ) = packet_bin_read(); >> + unless ( $buf =~ /\n$/ ) { > > Wouldn't > > + unless ( $buf =~ s/\n$// ) { > > or (less so) > > + unless ( $buf =~ s/\n$\z// ) { > > be more idiomatic (and not require use of 'substr')? Remember, > the s/// substitution quote-like operator returns number of > substitutions in the scalar context. OK. >> + die "A non-binary line SHOULD BE terminated by an LF."; > > This is SHOULD be, not MUST be, so perhaps 'warn' would be enough. > Not that Git should send us such line. Actually it MUST per protocol definition. I'll change it to MUST. >> + my ($packet) = @_; > > This is equivalent to > > + my $packet = shift; > > which, I think, is more common for single-parameter subroutines. > > Also, this is $data (or $buf), not $packet. OK > Perhaps some comment that main begins here? > >> +( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize"; >> +( packet_txt_read() eq ( 0, "version=2" ) ) || die "bad version"; >> +( packet_bin_read() eq ( 1, "" ) ) || die "bad version end"; > > Actually, it is overly strict. It should not fail if there > are other "version=3", "version=4" etc. lines. True, but I think for an example this is OK. I'll add a note to the file header. >> + >> +while (1) { >> + my ($command) = packet_txt_read() =~ /^command=([^=]+)$/; >> + my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/; > > Do we require this order? If it is, is that explained in the > documentation? Git sends that order right now but the filter should not rely on that order. >> + packet_flush(); # empty list! > > This is less "empty list!", and more keeping "status=success" unchanged. OK OK means, I agree and I added your suggestion to v9. Thanks a lot for your review and the comments! Cheers, Lars