Part 1 of review, starting with the protocol v2 itself. W dniu 08.10.2016 o 13:25, larsxschneider@xxxxxxxxx pisze: > From: Lars Schneider <larsxschneider@xxxxxxxxx> > > Git's clean/smudge mechanism invokes an external filter process for > every single blob that is affected by a filter. If Git filters a lot of > blobs then the startup time of the external filter processes can become > a significant part of the overall Git execution time. > > In a preliminary performance test this developer used a clean/smudge > filter written in golang to filter 12,000 files. This process took 364s > with the existing filter mechanism and 5s with the new mechanism. See > details here: https://github.com/github/git-lfs/pull/1382 > > This patch adds the `filter.<driver>.process` string option which, if > used, keeps the external filter process running and processes all blobs > with the packet format (pkt-line) based protocol over standard input and > standard output. The full protocol is explained in detail in > `Documentation/gitattributes.txt`. > > A few key decisions: > > * The long running filter process is referred to as filter protocol > version 2 because the existing single shot filter invocation is > considered version 1. > * Git sends a welcome message and expects a response right after the > external filter process has started. This ensures that Git will not > hang if a version 1 filter is incorrectly used with the > filter.<driver>.process option for version 2 filters. In addition, > Git can detect this kind of error and warn the user. > * The status of a filter operation (e.g. "success" or "error) is set > before the actual response and (if necessary!) re-set after the > response. The advantage of this two step status response is that if > the filter detects an error early, then the filter can communicate > this and Git does not even need to create structures to read the > response. > * All status responses are pkt-line lists terminated with a flush > packet. This allows us to send other status fields with the same > protocol in the future. Looks good to me. > > Helped-by: Martin-Louis Bright <mlbright@xxxxxxxxx> > Reviewed-by: Jakub Narebski <jnareb@xxxxxxxxx> > Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx> > Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx> > --- > Documentation/gitattributes.txt | 157 +++++++++++++- > convert.c | 297 +++++++++++++++++++++++++- > t/t0021-conversion.sh | 447 +++++++++++++++++++++++++++++++++++++++- > t/t0021/rot13-filter.pl | 191 +++++++++++++++++ > 4 files changed, 1082 insertions(+), 10 deletions(-) > create mode 100755 t/t0021/rot13-filter.pl > > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt > index 7aff940..5868f00 100644 > --- a/Documentation/gitattributes.txt > +++ b/Documentation/gitattributes.txt > @@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the command is > fed the blob object from its standard input, and its standard > output is used to update the worktree file. Similarly, the > `clean` command is used to convert the contents of worktree file > -upon checkin. > +upon checkin. By default these commands process only a single > +blob and terminate. If a long running `process` filter is used > +in place of `clean` and/or `smudge` filters, then Git can process > +all blobs with a single filter command invocation for the entire > +life of a single Git command, for example `git add --all`. See > +section below for the description of the protocol used to > +communicate with a `process` filter. I don't remember how this part looked like in previous versions of this patch series, but "... is used in place of `clean` ..." does not tell explicitly about the precedence of those configuration variables. I think it should be stated explicitly that `process` takes precedence over any `clean` and/or `smudge` settings for the same `filter.<driver>` (regardless of whether the long running `process` filter support "clean" and/or "smudge" operations or not). > > One use of the content filtering is to massage the content into a shape > that is more convenient for the platform, filesystem, and the user to use. > @@ -373,6 +379,155 @@ not exist, or may have different contents. So, smudge and clean commands > should not try to access the file on disk, but only act as filters on the > content provided to them on standard input. > > +Long Running Filter Process > +^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +If the filter command (a string value) is defined via > +`filter.<driver>.process` then Git can process all blobs with a > +single filter invocation for the entire life of a single Git > +command. This is achieved by using a packet format (pkt-line, > +see technical/protocol-common.txt) based protocol over standard > +input and standard output as follows. All packets, except for the > +"*CONTENT" packets and the "0000" flush packet, are considered > +text and therefore are terminated by a LF. Maybe s/standard input and output/\& of filter process,/ (that is, add "... of filter process," to the third sentence in the above paragraph). I guess what LF (line-feed character, "\n") is should be obvious for anybody who will be reading this part. All right. > + > +Git starts the filter when it encounters the first file > +that needs to be cleaned or smudged. After the filter started > +Git sends a welcome message ("git-filter-client"), a list of > +supported protocol version numbers, and a flush packet. I guess there is no need to be more explicit in description here, as the exact format should be obvious from the example below. We could add that the list of supported protocol version numbers is send as series of "version=<integer number>" text-packet lines. > Git expects > +to read a welcome response message ("git-filter-server") and exactly > +one protocol version number from the previously sent list. All further > +communication will be based on the selected version. The remaining > +protocol description below documents "version=2". Please note that > +"version=42" in the example below does not exist and is only there > +to illustrate how the protocol would look like with more than one > +version. > + > +After the version negotiation Git sends a list of all capabilities that > +it supports and a flush packet. Git expects to read a list of desired > +capabilities, which must be a subset of the supported capabilities list, > +and a flush packet as response: > +------------------------ > +packet: git> git-filter-client > +packet: git> version=2 > +packet: git> version=42 > +packet: git> 0000 > +packet: git< git-filter-server > +packet: git< version=2 > +packet: git> clean=true > +packet: git> smudge=true > +packet: git> not-yet-invented=true > +packet: git> 0000 > +packet: git< clean=true > +packet: git< smudge=true > +packet: git< 0000 WARNING: This example is different from description!!! In example you have Git sending "git-filter-client" and list of supported protocol versions, terminated with flush packet, then filter driver process sends "git-filter-server", exactly one version, *AND* list of supported capabilities in "<capability>=true" format, terminated with flush packet. In description above the example you have 4-part handshake, not 3-part; the filter is described to send list of supported capabilities last (a subset of what Git command supports). Moreover in the example in previous version at least as far as v8 of this series, the response from filter driver was fixed length list of two lines: magic string "git-filter-server" and exactly one line with protocol version; this part was *not* terminated with a flush packet (complicating code of filter driver program a bit, I think). I think this version of protocol is *better*, just the text needs to be updated to match. I wanted to propose something like this in v9,... By the way, now I look at it, the argument for using the "<capability>=true" format instead of "capability=<capability>" (or "supported-command=<capability>") is weak. The argument for using "<variable>=<value>" to make it easier to implement parsing is sound, but the argument for "<capability>=true" is weak. The argument was that with "<capability>=true" one can simply parse metadata into hash / dictionary / hashmap, and choose response based on that. Hash / hashmap / associative array needs different keys, so the reasoning went for "<capability>=true" over "capability=<capability>"... but the filter process still needs to handle lines with repeating keys, namely "version=<N>" lines! So the argument doesn't hold water IMVHO, and we can choose version which reads better / is more natural. > +------------------------ > +Supported filter capabilities in version 2 are "clean" and > +"smudge". I think it would be good to have something here separating the handshake part of protocol from the description of the working part. The latter loops over each file to be affected by given filter driver; handshake is done only once per filter. Maybe subsections? > + > +Afterwards Git sends a list of "key=value" pairs terminated with > +a flush packet. The list will contain at least the filter command > +(based on the supported capabilities) and the pathname of the file > +to filter relative to the repository root. Right after these packets I think you meant here "right after the flush packet", isn't it? It would be more explicit. > +Git sends the content split in zero or more pkt-line packets and a > +flush packet to terminate content. Please note, that the filter > +must not send any response before it received the content and the > +final flush packet. That's good to have this information in the documentation. BTW. I hope that in the future this restriction could be lifted with "stream" capability / option, by having Git read response from filter driver in an asynchronous process, like for one-shot v1 filters. But that is certainly for later. Let's polish this series and have it accepted first. > +------------------------ > +packet: git> command=smudge > +packet: git> pathname=path/testfile.dat > +packet: git> 0000 > +packet: git> CONTENT > +packet: git> 0000 > +------------------------ > + > +The filter is expected to respond with a list of "key=value" pairs > +terminated with a flush packet. I wonder if we could be more explicit that it is about "status" response. But I don't have good idea how to improve this sentence; not that it is really needed, I don't think. > If the filter does not experience > +problems then the list must contain a "success" status. Right after > +these packets the filter is expected to send the content in zero > +or more pkt-line packets and a flush packet at the end. Perhaps "terminating it with a flush packet"? But it is quite all right as it is now. > Finally, a > +second list of "key=value" pairs terminated with a flush packet > +is expected. The filter can change the status in the second list. I would add here, to be more explicit: This second list of "key=value" pairs may be empty, and usually would be if there is nothing wrong with response or filter; the terminating flush packet must be here regardless. Or something like that. The above proposal could be certainly improved. > +------------------------ > +packet: git< status=success > +packet: git< 0000 > +packet: git< SMUDGED_CONTENT > +packet: git< 0000 > +packet: git< 0000 # empty list, keep "status=success" unchanged! All right, looks good. Is this exclamation mark "!" necessary / wanted? > +------------------------ > + > +If the result content is empty then the filter is expected to respond > +with a "success" status and an empty list. Actually, it is empty content, not empty list; that is response (filter output) composed entirely of flush packet. Sidenote: I first thought that "empty list" here was about the post-content information, which may be empty, and for empty contents it would almost certainly be empty list - there is nothing I think that can change status of filter... > +------------------------ > +packet: git< status=success > +packet: git< 0000 > +packet: git< 0000 # empty content! > +packet: git< 0000 # empty list, keep "status=success" unchanged! > +------------------------ > + > +In case the filter cannot or does not want to process the content, > +it is expected to respond with an "error" status. Depending on the > +`filter.<driver>.required` flag Git will interpret that as error > +but it will not stop or restart the filter process. I think those two parts of last sentence: the part of 'required' flag, and the part about restarting process would be better either split, or their order reversed: first, tell that Git would not restart filter; second, that it would continue with next file without 'required' flag, and error-out if filter has 'required' flag. Though perhaps this should be known at this point of documentation, and doesn't need repeating... Ugh. Well, it's quite good as it is now. > +------------------------ > +packet: git< status=error > +packet: git< 0000 > +------------------------ > + > +If the filter experiences an error during processing, then it can > +send the status "error" after the content was (partially or > +completely) sent. Depending on the `filter.<driver>.required` flag > +Git will interpret that as error but it will not stop or restart the > +filter process. Errr... this is literal repetition. You need to decide whether to put it before example, or after example. Or maybe split it. > +------------------------ > +packet: git< status=success > +packet: git< 0000 > +packet: git< HALF_WRITTEN_ERRONEOUS_CONTENT > +packet: git< 0000 > +packet: git< status=error > +packet: git< 0000 > +------------------------ > + > +If the filter dies during the communication or does not adhere to > +the protocol then Git will stop the filter process and restart it > +with the next file that needs to be processed. Depending on the > +`filter.<driver>.required` flag Git will interpret that as error. Uhh... until now the order was explanation, then example. From the duplicated description above, it is now first example, then description. Consistency would be good. > + > +The error handling for all cases above mimic the behavior of > +the `filter.<driver>.clean` / `filter.<driver>.smudge` error > +handling. You have "error handling" repeated here. > + > +In case the filter cannot or does not want to process the content > +as well as any future content for the lifetime of the Git process, > +it is expected to respond with an "abort" status at any point in > +the protocol. Depending on the `filter.<driver>.required` flag Git > +will interpret that as error for the content as well as any future > +content for the lifetime of the Git process but it will not stop or > +restart the filter process. Here the order, first about `required` then without, looks all right to me: the result wrt process is the same, only the error or not changes. And here we have description first, example second. > +------------------------ > +packet: git< status=abort > +packet: git< 0000 > +------------------------ > + > +After the filter has processed a blob it is expected to wait for > +the next "key=value" list containing a command. Git will close > +the command pipe on exit. The filter is expected to detect EOF > +and exit gracefully on its own. Any "kill filter" solutions should probably be put here. I guess that filter exiting means EOF on its standard output when read by Git command, isn't it? > + > +If you develop your own long running filter > +process then the `GIT_TRACE_PACKET` environment variables can be > +very helpful for debugging (see linkgit:git[1]). s/environment variables/environment variable/ - there is only one GIT_TRACE_PACKET. Unless you wanted to write about GIT_TRACE? > + > +If a `filter.<driver>.process` command is configured then it > +always takes precedence over a configured `filter.<driver>.clean` > +or `filter.<driver>.smudge` command. Ah, it is here! I think it would be better to put it upfront; you don't need information about the protocol to *use* the existing filter, but you need this info. Or maybe we can repeat this information. > + > +Please note that you cannot use an existing `filter.<driver>.clean` > +or `filter.<driver>.smudge` command with `filter.<driver>.process` > +because the former two use a different inter process communication > +protocol than the latter one. I'm not sure where this should be (but it is needed) > + > + > Interaction between checkin/checkout attributes > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -- Jakub Narębski