> On 12 Aug 2016, at 18:33, Stefan Beller <sbeller@xxxxxxxxxx> wrote: > > On Wed, Aug 10, 2016 at 6:04 AM, <larsxschneider@xxxxxxxxx> wrote: >> From: Lars Schneider <larsxschneider@xxxxxxxxx> >> >> Git's clean/smudge mechanism invokes an external filter process for every >> single blob that is affected by a filter. If Git filters a lot of blobs >> then the startup time of the external filter processes can become a >> significant part of the overall Git execution time. >> >> In a preliminary performance test this developer used a clean/smudge filter >> written in golang to filter 12,000 files. This process took 364s with the >> existing filter mechanism and 5s with the new mechanism. See details here: >> https://github.com/github/git-lfs/pull/1382 >> >> This patch adds the `filter.<driver>.process` string option which, if used, >> keeps the external filter process running and processes all blobs with >> the packet format (pkt-line) based protocol over standard input and standard >> output described below. >> >> Git starts the filter when it encounters the first file >> that needs to be cleaned or smudged. After the filter started >> Git sends a welcome message, a list of supported protocol >> version numbers, and a flush packet. Git expects to read the >> welcome message and one protocol version number from the >> previously sent list. Afterwards Git sends a list of supported >> capabilities and a flush packet. Git expects to read a list of >> desired capabilities, which must be a subset of the supported >> capabilities list, and a flush packet as response: >> ------------------------ >> packet: git> git-filter-client >> packet: git> version=2 >> packet: git> version=42 >> packet: git> 0000 >> packet: git< git-filter-server >> packet: git< version=2 > > what follows is specific to version=2? > version 42 may deem capabilities a bad idea? "version=42" is just an example to show how the initialization could look like in a distant future when we support even another protocol version. You are correct, what follows is specific to version=2. I will state that more clearly in the documentation. Can you try to rephrase "version 42 may deem capabilities a bad idea?" I am not sure I understand what you mean. > >> packet: git> clean=true >> packet: git> smudge=true >> packet: git> not-yet-invented=true >> packet: git> 0000 >> packet: git< clean=true >> packet: git< smudge=true >> packet: git< 0000 >> ------------------------ >> Supported filter capabilities in version 2 are "clean" and >> "smudge". > > I assume version 2 is an example here and we actually start with v1? No, it is actually called version 2 because I consider the current clean/smudge protocol version 1. > Can you clarify why we need welcome messages? > (Is there a technical reason, or better debuggability for humans?) The welcome message is necessary to distinguish the long running filter protocol (v2) from the current one-shot filter protocol (v1). This is becomes important if a users tries to use a v1 clean/smudge filter with the v2 git config settings. >> Afterwards Git sends a list of "key=value" pairs terminated with >> a flush packet. The list will contain at least the filter command >> (based on the supported capabilities) and the pathname of the file >> to filter relative to the repository root. Right after these packets >> Git sends the content split in zero or more pkt-line packets and a >> flush packet to terminate content. >> ------------------------ >> packet: git> command=smudge\n >> packet: git> pathname=path/testfile.dat\n >> packet: git> 0000 >> packet: git> CONTENT >> packet: git> 0000 >> ------------------------ >> >> The filter is expected to respond with a list of "key=value" pairs >> terminated with a flush packet. If the filter does not experience >> problems then the list must contain a "success" status. Right after >> these packets the filter is expected to send the content in zero >> or more pkt-line packets and a flush packet at the end. Finally, a >> second list of "key=value" pairs terminated with a flush packet >> is expected. The filter can change the status in the second list. >> ------------------------ >> packet: git< status=success\n >> packet: git< 0000 >> packet: git< SMUDGED_CONTENT >> packet: git< 0000 >> packet: git< 0000 # empty list! >> ------------------------ >> >> If the result content is empty then the filter is expected to respond >> with a success status and an empty list. >> ------------------------ >> packet: git< status=success\n >> packet: git< 0000 >> packet: git< 0000 # empty content! >> packet: git< 0000 # empty list! >> ------------------------ > > Why do we need the last flush packet? We'd expect as many successes > as we send out contents? Do we plan on interleaving operation, i.e. > Git sends out 10 files but the filter process is not as fast as Git sending > out and the answers trickle in slowly? Git filter processes run sequentially right now (unfortunately). re flush: please see Peff's answer: http://public-inbox.org/git/20160812163809.3wdkuqegxfjam2yn%40sigill.intra.peff.net/ >> In case the filter cannot or does not want to process the content, >> it is expected to respond with an "error" status. Depending on the >> `filter.<driver>.required` flag Git will interpret that as error >> but it will not stop or restart the filter process. >> ------------------------ >> packet: git< status=error\n >> packet: git< 0000 >> ------------------------ >> >> In case the filter cannot or does not want to process the content >> as well as any future content for the lifetime of the Git process, >> it is expected to respond with an "error-all" status. Depending on >> the `filter.<driver>.required` flag Git will interpret that as error >> but it will not stop or restart the filter process. >> ------------------------ >> packet: git< status=error-all\n >> packet: git< 0000 >> ------------------------ >> >> If the filter experiences an error during processing, then it can >> send the status "error". Depending on the `filter.<driver>.required` >> flag Git will interpret that as error but it will not stop or restart >> the filter process. >> ------------------------ >> packet: git< status=success\n > > So the first success is meaningless essentially? > Would it make sense to move the sucess behind the content sending > in all cases? Again, I refer to Peff's answer. >> packet: git< 0000 >> packet: git< HALF_WRITTEN_ERRONEOUS_CONTENT >> packet: git< 0000 >> packet: git< status=error\n >> packet: git< 0000 >> ------------------------ >> >> If the filter dies during the communication or does not adhere to >> the protocol then Git will stop the filter process and restart it >> with the next file that needs to be processed. >> >> After the filter has processed a blob it is expected to wait for >> the next "key=value" list containing a command. When the Git process >> terminates, it will send a kill signal to the filter in that stage. >> >> If a `filter.<driver>.clean` or `filter.<driver>.smudge` command >> is configured then these commands always take precedence over >> a configured `filter.<driver>.process` command. > > okay. I think you can omit most of the commit message as it is a duplicate > of the documentation? Yes it duplicates the documentation. > Instead the commit message can answer questions that are not part of > the documentation. (See the questions above which can be summarized > as "Why do we do it this way and not differently?") OK, point taken. I will write a new commit message for v6. > >> + if (err || errno == EPIPE) { >> + if (!strcmp(filter_status.buf, "error")) { >> + /* >> + * The filter signaled a problem with the file. >> + */ > > /* This could go into a single line comment. */ OK, will change. >> + } else if (!strcmp(filter_status.buf, "error-all")) { >> + /* >> + * The filter signaled a permanent problem. Don't try to filter >> + * files with the same command for the lifetime of the current >> + * Git process. >> + */ >> + entry->supported_capabilities &= ~wanted_capability; >> + } else { >> + /* >> + * Something went wrong with the protocol filter. >> + * Force shutdown and restart if another blob requires filtering! >> + */ >> + error("external filter '%s' failed", cmd); > > failed .. Can you give more information to the user such that they can easier > debug? (blob/path or state / expected state) Agreed, will add! However, we don't give this information with the current clean/smudge interface. >> + >> static int read_convert_config(const char *var, const char *value, void *cb) >> { >> const char *key, *name; >> @@ -526,6 +818,10 @@ static int read_convert_config(const char *var, const char *value, void *cb) >> if (!strcmp("clean", key)) >> return git_config_string(&drv->clean, var, value); >> >> + if (!strcmp("process", key)) { >> + return git_config_string(&drv->process, var, value); >> + } > > optional nit: braces unnecessary Agreed, will remove! Thanks a lot for the review, Lars-- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html