[I'm sorry for taking so long in writing this, as I see there is v4 already] Greetings, I'll answer to individual emails in more detail later, but I'd like to go back to the drawing board, and attempt to summarize the discussion and the proposal so far. The ultimate goal is to be able to run filter drivers faster for both `clean` and `smudge` operations. This is done by starting filter driver once per git command invocation, instead of once per file being processed. Git needs to pass actual contents of files to filter driver, and get its output. We want the protocol between Git and filter driver process to be extensible, so that new features can be added without modifying protocol. 1. CONFIGURATION As I wrote, there are different ways of configuring new-type filter driver: * Using a separate variable to mark filter as using new protocol (the original approach): [filter "protocol"] protocolVersion = v2 clean = rot13-clean-filter.pl smudge = rot13-smudge-filter.pl PROS: allows to have separate clean and smudge filters CONS: does not allow using old-style per-file filter together with new; easy to make mistake and use old-style filter, leading to hang * Creating new variables for new filter type, separate for each phase, for example `cleanProcess` and `smudgeProcess` (or `processClean` and `processSmudge`). [filter "protocol"] cleanProcess = rot13-clean-filter.pl smudgeProcess = rot13-smudge-filter.pl PROS: allows to have separate clean and smudge filters; makes possible to use per-file and per-command filters together CONS: proliferation of additional variables, (esp. when extending it); NOTE: need to decide precedence between `clean` and `cleanProcess`, etc. # Using a single variable for new filter type, and decide on which phase (which operation) is supported by filter driver during the handshake *(current approach)* [filter "protocol"] process = rot13-filtes.pl PROS: per-file and per-command filters possible with precedence rule; extensible to other types of drivers: textconv, diff, etc. only one invocation for commands which use both clean and smudge CONS: need single driver to be responsible for both clean and smudge; need to run driver to know that it does not support given operation (workaround exists) 2. HANDSHAKE (INITIALIZATION) Next, there is deciding on and designing the handshake between Git (between Git command) and the filter driver process. With the `filter.<driver>.process` solution the driver needs to tell which operations among (for now) "clean" and "smudge" it does support. Plus it provides a way to extend protocol, adding new features, like support for streaming, cleaning from file or smudging to file, providing size upfront, perhaps even progress report. Current handshake consist of filter driver printing a signature, version number and capabilities, in that order. Git checks that it is well formed and matches expectations, and notes which of "clean" and "smudge" operations are supported by the filter. There is no interaction from the Git side in the handshake, for example to set options and expectations common to all files being filtered. Take one possible extension of protocol: supporting streaming. The filter driver needs to know whether it needs to read all the input, or whether it can start printing output while input is incoming (e.g. to reduce memory consumption)... though we may simply decide it to be next version of the protocol. On the other hand if the handshake began with Git sending some initializer info to the filter driver, we probably could detect one-shot filter misconfigured as process-filter. Note that we need some way of deciding where handshake ends, either by specifying number of entries (currently: three lines / pkt-line packets), or providing some terminator ("smart" transport protocol uses flush packet for this). Current handshake (in symbolic form): git< [signature] git-filter-protocol git< [version] version 2 git< [capabilites] clean smudge It is expected that the handshake is limited to this information, and that they are in this order; so naming them doesn't buy us much git< [capabilites] capabilities clean smudge or git< [capabilites] capabilities=clean smudge or git< [capabilites] capabilities: clean smudge If capabilities are to be third item, adding "capabilities", as if Git would look at the name and select what to do based on this name, doesn't buy us anything. Well, beside self-documenting of the protocol. The "smart" protocol do not use "capabilities" as prefix/name either. We would probably do not want to move from strict-order of information, that is "positional parameters". It would require to implement a parser, both for the Git side and for the filter driver process side. On the other hand requiring flush packet to end the handshake doesn't bring much overhead (it is 4 bytes, it is not over the network), and improves extendability. Well, so does using names, be it "<var> <value>", "<var>=<value>", "<var>: <value>...", "<var>=[<value>, <value>...]", etc. Let's take a look how other parts of Git communicate with external process (a "helper"). The git-credential(1) protocol uses <variable>=<value> syntax. But capabilities form a list; "<var>=<val1> <val2>" doesn't look that well. Credential helper only uses scalar (single) values. The gitremote-helpers(1) protocol is command / response; for example helper responds to "capabilities" command with the list of capabilities. Here commands and parameters are space separated, e.g. "option <name> <value>". The "smart" transport protocol (send-pack and receive-pack) had to (ab)use a quirk of implementation to extend protocol with capabilities negotiation. Here the capabilities list is sent without any prefix; some capabilities are parametrized, and use <capability>=<value> syntax (for example "symref=HEAD:refs/heads/master"). The handshake is closed with flush packet, but as it consist of variable-length ref advertisement, it needs to have explicit terminator of the each part of the "handshake". 3. SENDING CONTENTS (FILE TO BE FILTERED AND FILTER OUTPUT) Next thing to design is decision how to send contents to be filtered to the filter driver process, and how to get filtered output from the filter driver process. One thing I think we can agree on early, is sending data to filter process on its standard input, and receiving filtered result from its standard output. Because Git is sending (and receiving) multiple files, it needs some way to distinguish where one file ends and the next begins, in both directions, to and from filter. Also, the `clean` and `smudge` filters support expansion of the '%f' placeholder, so at least some filter drivers need name of the file being filtered. So the protocol must send it somehow to the filter driver. There are different approaches possible; here are ones that were used, and ones I thought about. * Send whole data to filter at once, and receiver all data at once, for example using something akin to the 'tar' archive, or uncompressed 'zip' archive (both are implemented in Git for the `git archive` command). Or just list of sizes and pathnames, empty entry as terminator, and then contents of all files concatenated. PROS: - can use the one-shot infrastructure implemented already CONS: - complicates Git code and filter driver code unnecessarily - difficult to implement error handling, esp. soft errors on filter driver side (error for single file, perhaps during output) - in synchronous version (non-streaming) requires absurd amout of memory / storage for the filter driver process * Send/receive data file by file, using <size> + <content>, that is, send size (plus other data like the filename), then file contents. This was the protocol used in the first iteration of series. PROS: - simple to implement on Git and on filter driver side NOTE: - you need to loop over read / user read_in_full anyway CONS: - no way to signal an error encountered during output, e.g. LFS network/server failure for after some contents were actually sent - impossible to implement streaming for filters that do not know size of output without examining full input # Send/receive data file by file, using some kind of chunking, with a end-of-file marker. The solution used by Git is pkt-line, with flush packet used to signal end of file. This is protocol used by the current implementation. PROS: - no need to know size upfront, so easier streaming support - you can signal error that happened during output, after some data were sent, as well as error known upfront - tracing support for free (GIT_TRACE_PACKET) CONS: - filter driver program slightly more difficult to implement - some negligible amount of overhead If we want in the end to implement streaming, then the last solution is the way to go. 4. PER-FILE HANDSHAKE - SENDING FILE TO FILTER Let's assume that for simplicity we want to implement (for now) only the synchronous (non-streaming) case, where we send whole contents of a file to filter driver process, and *then* read filter driver output. This is enough for git-LFS solutions, which were the reason for this patch series. But we want to keep the protocol flexible enough so that streaming and other features could be added easily. First, if we choose the solution where one process is responsible for both "clean" and "smudge" operations (and in the future possibly also "cleanFromFile" and "smudgeToFile"), Git needs to tell the driver which operation to perform. Together with operation Git can send additional information (sub-capabilities)... or we can use a separate line / packet to send it. If we are using pkt-line, then the convention is that text lines are terminated using LF ("\n") character. This needs to be stated explicitly in the documentation for filter.<driver>.process writers. git> packet: [operation] clean size=67\n We could denote that it is operation name, but it is obvious from position in the stream, thus not really needed. Then we need to provide the filename; some filters supposedly need this ('%f' in per-file `clean` / `smudge`). Note that filename can contain internal space characters, and could contain newlines, equal signs; anything that is not NUL ("\0") character. git> packet: [pathname] subdir/sample-file.r\n In most cases filename would be text, so perhaps we should use "\n" terminator (which filter driver would have to strip). We could use "filename=" prefix, but it is not necessary. We know where / when to expect the pathname (relative to project root). If we would want to be able to add variable number of packets to the handshake, then Git should send flush packet to signal the end of the handshake. But IMVHO it is unnecessary complication of the protocol; there is enough flexibility in it. We know that handshake consists of two packets. The Git would sent contents of the file to be filtered, using as many pack lines as needed (note: large file support needs to be tested, at least as expensive test). Flush packet is used to signal the end of the file. git> packets: <file contents> git> flush packet 5. FILTER DRIVER PROCESS RESPONSE First filter should, in my opinion, reply that it received the request (or the command, in the case of streaming supported). Also, in this response it can provide further information to Git process. git< packet: [received] ok size=67\n This response could be used to refuse to filter specific file upfront (for example if the file is not present in the artifactory for git-LFS solutions). git< packet: [rejected] reject\n We can even provide the reasoning to Git (maybe in the future extension)... or filter driver can print the explanation to the standard error (but then, no --quiet / --verbose support). git< packet: [rejected] reject with-message\n git< packet: [message] File not found on server\n git< flush packet Another response, which I think should be standarized, or at least described in the documentation, is filter driver refusing to filter further (e.g. git-LFS and network is down), to be not restarted by Git. git< packet: [quit] quit msg=Server error\n or git< packet: [quit] quit Server error\n or git< packet: [quit] quit with-message\n git< packet: [message] Server error\n git< flush packet Maybe this is over-engineering, but I don't think so. Next comes the output from the filter driver (filtered contents), using possibly multiple pkt-lines, ending with a flush packet: git< packets: <filtered contents> git< flush packet Note that empty file would consist of zero pack lines of contents, and one flush packet. Finally, to allow handling of [resumable] errors that occurred during sending file contents, especially for the future streaming filters case, we want to confirm that we send whole file successfully. git< packet: [status] success\n If there was an error during process, making data receives so far invalid, filter driver should tell about it git< packet: [status] fail\n or git< packet: [status] reject\n This may happen for example for UCS-2 <-> UTF-8 filter when invalid byte sequence is encountered. This may happen for git-LFS if the server fails during fetch, and spare / slave server doesn't have a file. We may want to quit filtering at this point, and not to send another file. git< packet: [status] quit\n There is place for extra information after the status, and in the future we can allow variable length information too. Best, -- Jakub Narębski -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html