> On 03 Aug 2016, at 20:30, Jakub Narębski <jnareb@xxxxxxxxx> wrote: > > The ultimate goal is to be able to run filter drivers faster for both `clean` > and `smudge` operations. This is done by starting filter driver once per > git command invocation, instead of once per file being processed. Git needs > to pass actual contents of files to filter driver, and get its output. > > We want the protocol between Git and filter driver process to be extensible, > so that new features can be added without modifying protocol. > > > 1. CONFIGURATION > > As I wrote, there are different ways of configuring new-type filter driver: > > ... > > # Using a single variable for new filter type, and decide on which phase > (which operation) is supported by filter driver during the handshake > *(current approach)* > > [filter "protocol"] > process = rot13-filtes.pl > > PROS: per-file and per-command filters possible with precedence rule; > extensible to other types of drivers: textconv, diff, etc. > only one invocation for commands which use both clean and smudge > CONS: need single driver to be responsible for both clean and smudge; > need to run driver to know that it does not support given > operation (workaround exists) > > > 2. HANDSHAKE (INITIALIZATION) > > Next, there is deciding on and designing the handshake between Git (between > Git command) and the filter driver process. With the `filter.<driver>.process` > solution the driver needs to tell which operations among (for now) "clean" > and "smudge" it does support. Plus it provides a way to extend protocol, > adding new features, like support for streaming, cleaning from file or > smudging to file, providing size upfront, perhaps even progress report. > > Current handshake consist of filter driver printing a signature, version > number and capabilities, in that order. Git checks that it is well formed > and matches expectations, and notes which of "clean" and "smudge" operations > are supported by the filter. > > There is no interaction from the Git side in the handshake, for example to > set options and expectations common to all files being filtered. Take > one possible extension of protocol: supporting streaming. The filter > driver needs to know whether it needs to read all the input, or whether > it can start printing output while input is incoming (e.g. to reduce > memory consumption)... though we may simply decide it to be next version > of the protocol. > > On the other hand if the handshake began with Git sending some initializer > info to the filter driver, we probably could detect one-shot filter > misconfigured as process-filter. OK, I'll look into this. > Note that we need some way of deciding where handshake ends, either by > specifying number of entries (currently: three lines / pkt-line packets), > or providing some terminator ("smart" transport protocol uses flush packet > for this). > > ... Would you be OK with Peff's suggestion? version=2 clean=true smudge=true 0000 http://public-inbox.org/git/20160803224619.bwtbvmslhuicx2qi%40sigill.intra.peff.net/ > 3. SENDING CONTENTS (FILE TO BE FILTERED AND FILTER OUTPUT) > > Next thing to design is decision how to send contents to be filtered > to the filter driver process, and how to get filtered output from the > filter driver process. > > ... > > # Send/receive data file by file, using some kind of chunking, > with a end-of-file marker. The solution used by Git is > pkt-line, with flush packet used to signal end of file. > > This is protocol used by the current implementation. > > PROS: > - no need to know size upfront, so easier streaming support > - you can signal error that happened during output, after > some data were sent, as well as error known upfront > - tracing support for free (GIT_TRACE_PACKET) > CONS: > - filter driver program slightly more difficult to implement > - some negligible amount of overhead > > If we want in the end to implement streaming, then the last solution > is the way to go. > > > 4. PER-FILE HANDSHAKE - SENDING FILE TO FILTER > > Let's assume that for simplicity we want to implement (for now) only > the synchronous (non-streaming) case, where we send whole contents > of a file to filter driver process, and *then* read filter driver > output. > ... > > If we are using pkt-line, then the convention is that text lines > are terminated using LF ("\n") character. This needs to be stated > explicitly in the documentation for filter.<driver>.process writers. > > git> packet: [operation] clean size=67\n > > We could denote that it is operation name, but it is obvious from > position in the stream, thus not really needed. I would prefer not to mix command and size in one packet as it makes parsing a little more difficult. > ... > > The Git would sent contents of the file to be filtered, using > as many pack lines as needed (note: large file support needs > to be tested, at least as expensive test). Flush packet is > used to signal the end of the file. > > git> packets: <file contents> > git> flush packet If expensive tests are enabled the test suite will process data larger then max pkt size. > 5. FILTER DRIVER PROCESS RESPONSE > > First filter should, in my opinion, reply that it received the > request (or the command, in the case of streaming supported). > Also, in this response it can provide further information to > Git process. > > git< packet: [received] ok size=67\n I think this would be different for real streaming and the current non-streaming... therefore it would complicate the protocol?! I wonder if it is truly necessary. > This response could be used to refuse to filter specific file > upfront (for example if the file is not present in the artifactory > for git-LFS solutions). > > git< packet: [rejected] reject\n Reject is already supported in v4. > We can even provide the reasoning to Git (maybe in the future > extension)... or filter driver can print the explanation to the > standard error (but then, no --quiet / --verbose support). > > git< packet: [rejected] reject with-message\n > git< packet: [message] File not found on server\n > git< flush packet I think Git shouldn't care about these details. If the filter needs to tell something then it should use stderr. > Another response, which I think should be standarized, or at > least described in the documentation, is filter driver refusing > to filter further (e.g. git-LFS and network is down), to be not > restarted by Git. > > git< packet: [quit] quit msg=Server error\n > > or > > git< packet: [quit] quit Server error\n > > or > > git< packet: [quit] quit with-message\n > git< packet: [message] Server error\n > git< flush packet > > Maybe this is over-engineering, but I don't think so. Interesting idea! I will look into this for v5! > Next comes the output from the filter driver (filtered contents), > using possibly multiple pkt-lines, ending with a flush packet: > > git< packets: <filtered contents> > git< flush packet > > Note that empty file would consist of zero pack lines of contents, > and one flush packet. > > Finally, to allow handling of [resumable] errors that occurred > during sending file contents, especially for the future streaming > filters case, we want to confirm that we send whole file > successfully. > > git< packet: [status] success\n > > If there was an error during process, making data receives so far > invalid, filter driver should tell about it > > git< packet: [status] fail\n > > or > > git< packet: [status] reject\n > > This may happen for example for UCS-2 <-> UTF-8 filter when invalid > byte sequence is encountered. This may happen for git-LFS if the > server fails during fetch, and spare / slave server doesn't have > a file. Correct! > We may want to quit filtering at this point, and not to send another > file. > > git< packet: [status] quit\n I don't get this one. Git would restart the filter as soon as it finds another file that needs to be filtered, right? Thanks a lot for this write up! - Lars-- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html