Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option

Lars Schneider <larsxschneider@xxxxxxxxx> · Mon, 25 Jul 2016 22:09:10 +0200

On 24 Jul 2016, at 22:14, Jakub Narębski <jnareb@xxxxxxxxx> wrote:

> W dniu 2016-07-24 o 20:36, Lars Schneider pisze:
>> On 23 Jul 2016, at 02:11, Jakub Narębski <jnareb@xxxxxxxxx> wrote:
>>> W dniu 2016-07-22 o 17:49, larsxschneider@xxxxxxxxx pisze:
>>>> From: Lars Schneider <larsxschneider@xxxxxxxxx>
>>> 
>>> Nb. this line is only needed if you want author name and/or date
>>> different from the email sender, or if you have sender line misconfigured
>>> (e.g. lacking the human readable name).
>> 
>> I use "git format-patch" to generate these emails:
>> 
>> git format-patch --cover-letter --subject-prefix="PATCH ..." -M $BASE -o $OUTPUT
>> 
>> How would I disable this line? (I already checked the man page to no avail).
> 
> If you are using `git send-email` or equivalent, I think it is
> stripped automatically if it is not needed (in you case it was,
> because Sender was lacking human readable name... at least I think
> it was because of what my email reader inserted as reply line).
> If you are using an ordinary email client, you need to remove it
> yourself, if needed.

Weird. I am sending the patches with this command:

git send-email mystuff/* --to=git@xxxxxxxxxxxxxxx --cc=...

Maybe I need to try the "--suppress-from" switch?!

>> Plus, what does "Nb" stand for? :-)
> 
> Nb. (or N.b.) stands for "nota bene", which I meant do denote
> as a note on a certain side aspect; I'll switch to "note", or
> "BTW" / "by the way".

OK, thanks for the explanation :-)

>>>> Git's clean/smudge mechanism invokes an external filter process for every
>>>> single blob that is affected by a filter. If Git filters a lot of blobs
>>>> then the startup time of the external filter processes can become a
>>>> significant part of the overall Git execution time.
>>> 
>>> Do I understand it correctly (from the commit description) that with
>>> this new option you start one filter process for the whole life of
>>> the Git command (e.g. `git add .`), which may perform more than one
>>> cleanup, but do not create a daemon or daemon-like process which
>>> would live for several commands invocations?
>> 
>> Correct!
> 
> It would be nice to make it more obvious.

OK, I will try in v2.

> 
>>>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>>>> keeps the external filter process running and processes all blobs with
>>>> the following protocol over stdin/stdout.
>>> 
>>> I agree with Junio that the name "useProtocol" is bad, and not quite
>>> right. Perhaps "persistent" would be better? Also, what is the value
>>> of `filter.<driver>.useProtocol`: boolean? or a script name?
> 
> As you see I was not sure if `useProtocol` was boolean or a script name,
> which means that it should be stated more explicitly.  Of course this
> would end to not matter if the way new protocol is used were changed.
> 
>> I agree that the name is not ideal. "UseProtocol" as it is would be a boolean. 
>> I thought about "persistent" but this name wouldn't convey the scope of the 
>> persistency ("persistent for one Git operation" vs. "persistent for many Git 
>> operations"). What do you think about the protocol as int version idea
>> described in $gmane/300155 ?
> 
> You mean the `protocol` as a config variable name (fully name being
> `filter.<driver>.protocol`), being integer-valued, isn't it? Wouldn't
> `protocolVersion` be a more explicit?

Yes, but based on your other feedback I plan to use this variable differently
anyways.

>>> I also agree that we might wat to be able to keep clean and smudge
>>> filters separate, but be able to run a single program if they are
>>> both the same. I think there is a special case for filter unset,
>>> and/or filter being "cat" -- we would want to keep that.
>> 
>> Since 1a8630d there is a more efficient way to unset a filter ;-)
>> Can you think of other cases where the separation would be useful?
> 
> I can't think of any, but it doesn't mean that it does not exist.
> It also does not mean that you need to consider situation that may
> not happen. Covering one-way filters, like "indent" filter for `clean`,
> should be enough... they do work with your proposal, don't they?

This should work right now but it would be a bit inefficient (the filter
would just pass the data unchanged through the smudge command). I plan to
add a "capabilities" flag to the protocol. Then you can define only
the "clean" capability and nothing or the current filter mechanism 
would happen for smudge (I will make a test case to demonstrate that
behavior in v2).

>>> My proposal is to use `filter.<driver>.persistent` as an addition
>>> to 'clean' and 'smudge' variables, with the following possible
>>> values:
>>> 
>>> * none (the default)
>>> * clean
>>> * smudge
>>> * both
>> 
>> That could work. However, I am not convinced, yet, that separate
>> filters are an actual use case.
> 
> YAGNI (You Ain't Gonna Need It), right.

That will work in v2.

>>> I assume that either Git would have to start multiple filter
>>> commands for multi-threaded operation, or the protocol would have
>>> to be extended to make persistent filter fork itself.
>> 
>> I think it would be better to have Git start multiple filter commands
>> to keep the protocol as simple and error free as possible.
> 
> Right. Also, I am not sure if exec+fork would be much faster than
> fork+exec (where fork is n-way fork, and n is number of threads
> that Git command invoking filter is using).
> 
>>> BTW. what would happen in your original proposal if the user had
>>> *both* filter.<driver>.useProtocol and filter.<driver>.smudge
>>> (and/or filter.<driver>.clean) set?
>> 
>> That wouldn't be an issue as "useProtocol" is just a boolean that
>> tells Git how to talk to "filter.<driver>.smudge" and "filter.<driver>.clean".
>> I need to make this more clear in the documentation.
>> 
>> 
>>>> 1. Git starts the filter on first usage and expects a welcome message
>>>> with protocol version number:
>>>> 	Git <-- Filter: "git-filter-protocol\n"
>>>> 	Git <-- Filter: "version 1"
>>> 
>>> I was wondering how Git would know that filter executable was started,
>>> but then I realized it was once-per-command invocation, not a daemon.
>>> 
>>> I agree with Torsten that there should be a terminator after the
>>> version number.
>> 
>> I agree, too :)
> 
> Note that if we agree about switch to `protocol` / `protocolVersion`
> as a way to specify this protocol, it would probably need to be "protocol 2"
> (assuming that "protocol 1" is the original implementation, with one fork
> per affected file).

Agreed.

>>> Also, for future extendability this should be probably followed by
>>> possibly empty list of script capabilities, that is:
>>> 
>>> 	Git <-- Filter: "git-filter-protocol\n"
>>> 	Git <-- Filter: "version 1.1\n"
> 
> Note that "version 1.1" would not work with current implementation;
> it accepts only integer version numbers. Which might be a good idea,
> anyway.

Agreed.

>>> 	Git <-- Filter: "capabilities clean smudge\n"
>>> 
>>> Or we can add capabilities in later version...
>> 
>> That is an interesting idea. My initial thought was to make the capabilities
>> of a certain version fix. If we want to add new capabilities then we would 
>> bump the version. I wonder what others think about your suggestion!
> 
> Using capabilities (like git-upload-pack / git-receive-pack, that is
> smart Git transfer protocols do) is probably slightly more difficult on
> the Git side (assuming no capabilities negotiation), but also much more
> flexible than pure version numbers.
> 
> One possible idea for a capability is support for passing input
> and output of a filter via filesystem, like cleanToFile and smudgeFromFile
> proposal in 'jh/clean-smudge-annex' (in 'pu').
> 
> For example:
> 
> 	Git <-- Filter: "capabilities clean smudge cleanToFile smudgeFromFile\n"

Yes, I like that very much. As stated above, I will add that in v2.

>>> BTW. why not follow e.g. HTTP protocol example, and use
>>> 
>>> 	Git <-- Filter: "git-filter-protocol/1\n"
>> 
>> I think my proposal is a bit more explicit as it states "version". If
>> you feel strongly about it, I could be convinced otherwise.
> 
> No, I don't feel strongly about this. I think SSH also uses a separate
> "version"-like line.
> 
>>>> 2. Git sends the command (either "smudge" or "clean"), the filename, the
>>>> content size in bytes, and the content separated by a newline character:
>>>> 	Git --> Filter: "smudge\n"
>>> 
>>> Would it help (for some cases) to pass the name of filter that
>>> is being invoked?
>> 
>> Interesting thought! Can you imagine a case where this would be useful?
> 
> Actually... no, I don't think so. I don't think there is a situation
> where we might want to use the same filter commands for different filters
> and have it behave differently depending on the filter name.

OK

>>>> 	Git --> Filter: "testfile.dat\n"
>>> 
>>> Unfortunately, while sane filenames should not contain newlines[1],
>>> the unfortunate fact is that *filenames can include newlines*, and
>>> you need to be able to handle that[2].  Therefore you need either to
>>> choose a different separator (the only one that can be safely used
>>> is "\0", i.e. the NUL character - but it is not something easy to
>>> handle by shell scripts), or C-quote filenames as needed, or always
>>> C-quote filenames.  C-quoting at minimum should include quoting newline
>>> character, and the escape character itself.
>>> 
>>> BTW. is it the basename of a file, or a full pathname?
>>> 
>>> [1]: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
>>> [2]: http://www.dwheeler.com/essays/filenames-in-shell.html
>> 
>> Thanks for this explanation. A bash version of the protocol is not
>> trivial (I tried it but ended up using Perl). Therefore I think "\0"
>> would be a good choice?
> 
> That, or use git convention of surrounding C-quoted filenames in
> double quotes (which means that if it begins with quote, it is C-quoted).
> 
> For example:
> 
>  $ git commit -m 'Initial commit'
>  [master (root-commit) 266dab0] Initial commit
>   2 files changed, 2 insertions(+)
>   create mode 100644 foo
>   create mode 100644 "foo \" \\ ,"
>  $ ls -1
>  foo " \ ,
>  foo
> 
> I'm not sure which solution would be easier for filter writers,
> NUL termination, or C-quoting.

Unless someone has a convincing argument for one solution or the other
I will go with the \0 termination as it seems easier.

>>>> 	Git --> Filter: "7\n"
>>> 
>>> That's the content size in bytes written as an ASCII number.
>> 
>> Correct.
> 
> But not obvious from the description / documentation.

I will improve that in v2. Should I add the info that it is base10 or
would you consider that a given?

>>>> 	Git --> Filter: "CONTENT"
>>> 
>>> Can filter ignore the content size, and just read all what it was
>>> sent, that is until eof or something?
>> 
>> What would that something be? Since CONTENT is binary it can contain
>> any character (even "\0")...
> 
> Here by "or something" I meant some other way of detecting that there
> is nothing more to read. But providing the size upfront (or size of
> chunk / packet in the streaming interface, if/when it gets implemented)
> is probably a better idea. Git knows it anyway, cheaply.
> 
>>>> 3. The filter is expected to answer with the result content size in
>>>> bytes and the result content separated by a newline character:
>>>> 	Git <-- Filter: "15\n"
>>>> 	Git <-- Filter: "SMUDGED_CONTENT"
>>> 
>>> I wonder how hard would be to write filters for this protocol...
>> 
>> Easy :-) Plus you can look at a Perl (see t/t0021) and a golang implementation
>> already (https://github.com/github/git-lfs/pull/1382).
> 
> Right. Any programming language that has a way to specify "read N bytes"
> would work. I think even bash would work, with 'read -N $len -r'... I think.
> 
>>>> ---
>>>> Documentation/gitattributes.txt |  41 +++++++-
>>>> convert.c                       | 210 ++++++++++++++++++++++++++++++++++++++--
>>>> t/t0021-conversion.sh           | 170 ++++++++++++++++++++++++++++++++
>>> 
>>> Wouldn't it be better to name the test case something more
>>> descriptive, for example
>>> 
>>>  t/t0021-filter-driver-useProtocol.sh
>>> 
>>> The name of test should be adjusted to final name of the feature,
>>> of course.
>> 
>> I think the prefix numbers should be unique, no? And t0022 is already taken.
> 
> I meant here that the "conversion" part of "t/t0021-conversion.sh" test
> filename is not descriptive enough.

Ah, I see. You suggest to rename the test case? Would that be OK with the Git
community?

>>>> t/t0021/rot13.pl                |  80 +++++++++++++++
> 
> This is all right, because it is in t0021 context.
> 
>>>> 4 files changed, 494 insertions(+), 7 deletions(-)
>>>> create mode 100755 t/t0021/rot13.pl
>>>> 
>>>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>>>> index 8882a3e..7026d62 100644
>>>> --- a/Documentation/gitattributes.txt
>>>> +++ b/Documentation/gitattributes.txt
>>>> @@ -300,7 +300,10 @@ checkout, when the `smudge` command is specified, the command is
>>>> fed the blob object from its standard input, and its standard
>>>> output is used to update the worktree file.  Similarly, the
>>>> `clean` command is used to convert the contents of worktree file
>>>> -upon checkin.
>>>> +upon checkin. By default these commands process only a single
>>>> +blob and terminate. If the setting filter.<driver>.useProtocol is
>>>> +enabled then Git can process all blobs with a single filter command
>>>> +invocation (see filter protocol below).
>>> 
>>> This does not tell the precedence between `smudge`, `clean` and
>>> filter.<driver>.useProtocol, see above. Also, discrepancy in how
>>> config variables are referenced.
>> 
>> As mentioned above "useProtocol" is a boolean. Therefore precedence shouldn't
>> be a problem.
> 
> Which was not obvious (but might not matter in the end).
> 
>>             What do you mean by "discrepancy in how config variables are 
>> referenced"?
> 
> What I meant here that filter.<driver>.smudge and filter.<driver>.clean
> were referenced as "`smudge` command" and "`clean` command" in the paragraph
> you modified.
> 
> Perhaps filter.<driver>.useProtocol is all right (I have not looked further),
> but it should be formatted as `filter.<driver>.useProtocol` IMVHO.

Initially I thought so, too. But "filter.<driver>.required", which is already
mentioned in gitattributes.txt, does not use this style. Should I change that, too,
or use the existing style?

> [...]
>>>> diff --git a/convert.c b/convert.c
>>>> index 522e2c5..91ce86f 100644
>>>> --- a/convert.c
>>>> +++ b/convert.c
>>>> @@ -481,12 +481,188 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>>>> 	return ret;
>>>> }
>>>> 
>>>> +static int cmd_process_map_init = 0;
>>>> +static struct hashmap cmd_process_map;
>>>> +
>>>> +struct cmd2process {
>>>> +	struct hashmap_entry ent; /* must be the first member! */
>>>> +	const char *cmd;
>>>> +	long protocol;
>>>> +	struct child_process process;
>>>> +};
>>> [...]
>>>> +static struct cmd2process *find_protocol_filter_entry(const char *cmd)
>>>> +{
>>>> +	struct cmd2process k;
>>>> +	hashmap_entry_init(&k, strhash(cmd));
>>>> +	k.cmd = cmd;
>>>> +	return hashmap_get(&cmd_process_map, &k, NULL);
>>> 
>>> Should we use global variable cmd_process_map, or pass it as parameter?
>>> The same question apply for other procedures and functions.
>>> 
>>> Note that I am not saying that it is a bad thing to use global
>>> variable here.
>> 
>> Passing it would be nicer as this would make at least a few functions "pure".
>> I will change that!
> 
> You can always provide convenience functions that use global variable.
> That's what Git code does with the_index, if I remember it correctly.
> 
> [...]
>>>> +
>>>> +	sigchain_push(SIGPIPE, SIG_IGN);
>>>> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
>>>> +	sigchain_pop(SIGPIPE);
>>>> +
>>>> +	strbuf_stripspace(&nbuf, 0);
>>>> +	string_list_split_in_place(&split, nbuf.buf, ' ', 2);
>>>> +	ret &= split.nr > 1;
>>>> +	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;
>>>> +	if (ret) {
>>>> +		entry->protocol = strtol(split.items[1].string, NULL, 10);
>>> 
>>> This does not handle at least some errors in version number parsing,
>>> for example junk after version number. Don't we have some helper
>>> functions for this?
>> 
>> I am not sure. I haven't found one.
> 
> Hmmm... I remember there were some patches about this, but I don't know
> if they were accepted.  We have strtol_i() in git-compat-util.h. 
> 
> And you can always check where the parsing ended (by not passing NULL,
> of course).

OK, try to make this nicer in v2.

> [...]
>>>> +		switch (entry->protocol) {
>>>> +			case 1:
>>>> +				break;
>>>> +			default:
>>>> +				ret = 0;
>>>> +				error("unsupported protocol version %s for external persistent filter '%s'",
>>>> +					nbuf.buf, cmd);
>>>> +		}
>>>> +	}
>>>> +	string_list_clear(&split, 0);
>>>> +	strbuf_release(&nbuf);
>>>> +
>>>> +	if (!ret) {
>>>> +		error("initialization for external persistent filter '%s' failed", cmd);
>>>> +		return NULL;
>>>> +	}
>>> 
>>> Do we handle persistent filter command being killed before it finishes?
>>> Or exiting with error? I don't know this Git API...
>> 
>> If the "apply_filter" function fails then Git would proceed and just not
>> filter the content. If you define the "required" flag for the filter then
>> Git would error in that case.
> 
> Ah, right. 
> 
> [...]
>>>> diff --git a/t/t0021/rot13.pl b/t/t0021/rot13.pl
>>> 
>>> That's bit more than rot13... but it might be O.K. for a filename here.
>> 
>> "rot13-$FEATURE_NAME.pl" ?
> 
> As I said, rot13.pl is all right; if change, then perhaps to rot13-filter.pl 

OK.

> [...]
>>>> +        my $input;
>>>> +        {
>>>> +            binmode(STDIN);
>>>> +            my $bytes_read = 0;
>>>> +            $bytes_read = read STDIN, $input, $filelen;
>>>> +            if ( $bytes_read != $filelen ) {
>>>> +                die "not enough to read";
>>> 
>>> I know it's only a test script (well, a part of one), but we would probably
>>> want to have more information in the case of a real filter.
>> 
>> True. Do you think there is anything to change in the script, though?
> 
> No, I don't think so. It is enough that the test script would crash
> if fed incorrect data from Git. Better error messages would be nice,
> but are not necessary.
> 
>    +                die "not enough to read: expected $filelen, got $bytes_read";
> 
> I have noticed that some 'die' have "\n" at the end, and some do
> not. If I remember correctly it is for supressing error message from
> Perl, with filename and line number, isn't it? Anyway, we probably
> want to be consistent.

OK.

>>>> +        }
>>>> +    }
>>> 
>>> What happens if $filelen is zero, or negative? Ah, I see that $output
>>> would be undef... which is bad, I think.
>> 
>> Is this something we need to consider in the test script?
> 
> I think the test script should die if it gets incorrect content length
> (e.g. negative), so that it catches bug on the Git side. It should
> work for zero-length files, even if we don't test it -- we can in the
> future.

OK.

> [...]
>>>> +            print STDOUT "fail";
>>> 
>>> This is not defined in the protocol description!  Unless anything that
>>> does not conform to the specification would work here, but at least it
>>> is a recommended practice to be described in the documentation, don't
>>> you think?
>>> 
>>> What would happen in $output_len is 4?
>> 
>> Then it would work :D
>> I understand your point. However, this is not a reference implementation.
>> It is a test script that is supposed to trigger bad behavior which we can test. 
>> Therefore, I would argue that such a return value is OK. I will document it in 
>> the header, though. 
> 
> Why print "fail", and not die?

Agree, "die" is better.

> The problem is do the protocol need to have some way of communicating
> errors from the filter to Git?  Perhaps using stderr would be enough
> (but then Git would need to drain it, I think... unless it is not
> redirected), perhaps some command is needed?
> 
> For example, instead of:
> 
> 	Git <-- Filter: "15\n"
> 	Git <-- Filter: "SMUDGED_CONTENT"
> 
> perhaps filter should return
> 
> 	Git <-- Filter: "error\n"
> 	Git <-- Filter: "ONE_LINE_OF_ERROR_DESCRIPTION\n"
> 
> on error? Or if printing expected output length upfront is easier,
> use a signal (but that is supposedly not that reliable as message
> passing mechanism)?
> 
> It might be the case that some files return errors, but some do not.

I would prefer it if the filter just dies in case of trouble and that
way communicates to Git that something went wrong. Everything else
just complicates the protocol.

> BTW. do we test the case where filter fails, or returns wrong output?

Yes, I added a failure test in v2.

Thanks a lot,
Lars--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html