Re: [PATCH v5 14/15] convert: add filter.<driver>.process option

Lars Schneider <larsxschneider@xxxxxxxxx> · Fri, 12 Aug 2016 18:59:18 +0200

> On 12 Aug 2016, at 18:33, Stefan Beller <sbeller@xxxxxxxxxx> wrote:
> 
> On Wed, Aug 10, 2016 at 6:04 AM,  <larsxschneider@xxxxxxxxx> wrote:
>> From: Lars Schneider <larsxschneider@xxxxxxxxx>
>> 
>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
>> 
>> In a preliminary performance test this developer used a clean/smudge filter
>> written in golang to filter 12,000 files. This process took 364s with the
>> existing filter mechanism and 5s with the new mechanism. See details here:
>> https://github.com/github/git-lfs/pull/1382
>> 
>> This patch adds the `filter.<driver>.process` string option which, if used,
>> keeps the external filter process running and processes all blobs with
>> the packet format (pkt-line) based protocol over standard input and standard
>> output described below.
>> 
>> Git starts the filter when it encounters the first file
>> that needs to be cleaned or smudged. After the filter started
>> Git sends a welcome message, a list of supported protocol
>> version numbers, and a flush packet. Git expects to read the
>> welcome message and one protocol version number from the
>> previously sent list. Afterwards Git sends a list of supported
>> capabilities and a flush packet. Git expects to read a list of
>> desired capabilities, which must be a subset of the supported
>> capabilities list, and a flush packet as response:
>> ------------------------
>> packet:          git> git-filter-client
>> packet:          git> version=2
>> packet:          git> version=42
>> packet:          git> 0000
>> packet:          git< git-filter-server
>> packet:          git< version=2
> 
> what follows is specific to version=2?
> version 42 may deem capabilities a bad idea?

"version=42" is just an example to show how the initialization could look
like in a distant future when we support even another protocol version.

You are correct, what follows is specific to version=2. I will state
that more clearly in the documentation.

Can you try to rephrase "version 42 may deem capabilities a bad idea?"
I am not sure I understand what you mean.

> 
>> packet:          git> clean=true
>> packet:          git> smudge=true
>> packet:          git> not-yet-invented=true
>> packet:          git> 0000
>> packet:          git< clean=true
>> packet:          git< smudge=true
>> packet:          git< 0000
>> ------------------------
>> Supported filter capabilities in version 2 are "clean" and
>> "smudge".
> 
> I assume version 2 is an example here and we actually start with v1?

No, it is actually called version 2 because I consider the current
clean/smudge protocol version 1.

> Can you clarify why we need welcome messages?
> (Is there a technical reason, or better debuggability for humans?)

The welcome message is necessary to distinguish the long running
filter protocol (v2) from the current one-shot filter protocol (v1).
This is becomes important if a users tries to use a v1 clean/smudge
filter with the v2 git config settings.

>> Afterwards Git sends a list of "key=value" pairs terminated with
>> a flush packet. The list will contain at least the filter command
>> (based on the supported capabilities) and the pathname of the file
>> to filter relative to the repository root. Right after these packets
>> Git sends the content split in zero or more pkt-line packets and a
>> flush packet to terminate content.
>> ------------------------
>> packet:          git> command=smudge\n
>> packet:          git> pathname=path/testfile.dat\n
>> packet:          git> 0000
>> packet:          git> CONTENT
>> packet:          git> 0000
>> ------------------------
>> 
>> The filter is expected to respond with a list of "key=value" pairs
>> terminated with a flush packet. If the filter does not experience
>> problems then the list must contain a "success" status. Right after
>> these packets the filter is expected to send the content in zero
>> or more pkt-line packets and a flush packet at the end. Finally, a
>> second list of "key=value" pairs terminated with a flush packet
>> is expected. The filter can change the status in the second list.
>> ------------------------
>> packet:          git< status=success\n
>> packet:          git< 0000
>> packet:          git< SMUDGED_CONTENT
>> packet:          git< 0000
>> packet:          git< 0000  # empty list!
>> ------------------------
>> 
>> If the result content is empty then the filter is expected to respond
>> with a success status and an empty list.
>> ------------------------
>> packet:          git< status=success\n
>> packet:          git< 0000
>> packet:          git< 0000  # empty content!
>> packet:          git< 0000  # empty list!
>> ------------------------
> 
> Why do we need the last flush packet? We'd expect as many successes
> as we send out contents? Do we plan on interleaving operation, i.e.
> Git sends out 10 files but the filter process is not as fast as Git sending
> out and the answers trickle in slowly?

Git filter processes run sequentially right now (unfortunately).

re flush: please see Peff's answer:
http://public-inbox.org/git/20160812163809.3wdkuqegxfjam2yn%40sigill.intra.peff.net/

>> In case the filter cannot or does not want to process the content,
>> it is expected to respond with an "error" status. Depending on the
>> `filter.<driver>.required` flag Git will interpret that as error
>> but it will not stop or restart the filter process.
>> ------------------------
>> packet:          git< status=error\n
>> packet:          git< 0000
>> ------------------------
>> 
>> In case the filter cannot or does not want to process the content
>> as well as any future content for the lifetime of the Git process,
>> it is expected to respond with an "error-all" status. Depending on
>> the `filter.<driver>.required` flag Git will interpret that as error
>> but it will not stop or restart the filter process.
>> ------------------------
>> packet:          git< status=error-all\n
>> packet:          git< 0000
>> ------------------------
>> 
>> If the filter experiences an error during processing, then it can
>> send the status "error". Depending on the `filter.<driver>.required`
>> flag Git will interpret that as error but it will not stop or restart
>> the filter process.
>> ------------------------
>> packet:          git< status=success\n
> 
> So the first success is meaningless essentially?
> Would it make sense to move the sucess behind the content sending
> in all cases?

Again, I refer to Peff's answer.

>> packet:          git< 0000
>> packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
>> packet:          git< 0000
>> packet:          git< status=error\n
>> packet:          git< 0000
>> ------------------------
>> 
>> If the filter dies during the communication or does not adhere to
>> the protocol then Git will stop the filter process and restart it
>> with the next file that needs to be processed.
>> 
>> After the filter has processed a blob it is expected to wait for
>> the next "key=value" list containing a command. When the Git process
>> terminates, it will send a kill signal to the filter in that stage.
>> 
>> If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
>> is configured then these commands always take precedence over
>> a configured `filter.<driver>.process` command.
> 
> okay. I think you can omit most of the commit message as it is a duplicate
> of the documentation?

Yes it duplicates the documentation. 

> Instead the commit message can answer questions that are not part of
> the documentation. (See the questions above which can be summarized
> as "Why do we do it this way and not differently?")

OK, point taken. I will write a new commit message for v6.

> 
>> +       if (err || errno == EPIPE) {
>> +               if (!strcmp(filter_status.buf, "error")) {
>> +                       /*
>> +                    * The filter signaled a problem with the file.
>> +                    */
> 
> /* This could go into a single line comment. */

OK, will change.

>> +               } else if (!strcmp(filter_status.buf, "error-all")) {
>> +                       /*
>> +                        * The filter signaled a permanent problem. Don't try to filter
>> +                        * files with the same command for the lifetime of the current
>> +                        * Git process.
>> +                        */
>> +                        entry->supported_capabilities &= ~wanted_capability;
>> +               } else {
>> +                       /*
>> +                        * Something went wrong with the protocol filter.
>> +                        * Force shutdown and restart if another blob requires filtering!
>> +                        */
>> +                       error("external filter '%s' failed", cmd);
> 
> failed .. Can you give more information to the user such that they can easier
> debug? (blob/path or state / expected state)

Agreed, will add!
However, we don't give this information with the current clean/smudge interface.

>> +
>> static int read_convert_config(const char *var, const char *value, void *cb)
>> {
>>        const char *key, *name;
>> @@ -526,6 +818,10 @@ static int read_convert_config(const char *var, const char *value, void *cb)
>>        if (!strcmp("clean", key))
>>                return git_config_string(&drv->clean, var, value);
>> 
>> +       if (!strcmp("process", key)) {
>> +               return git_config_string(&drv->process, var, value);
>> +       }
> 
> optional nit: braces unnecessary

Agreed, will remove!

Thanks a lot for the review,
Lars--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html