RE: [RFE] Teach git textconv to support %f

"Randall S. Becker" <rsbecker@xxxxxxxxxxxxx> · Wed, 9 Jun 2021 10:16:49 -0400

On June 9, 2021 12:12 AM, Peff wrote:
>On Wed, Jun 09, 2021 at 09:42:39AM +0900, Junio C Hamano wrote:
>
>> "Randall S. Becker" <rsbecker@xxxxxxxxxxxxx> writes:
>>
>> > The filter structure provides a mechanism for providing the working
>> > directory's file name path to a filter using a %f argument. This
>> > request is to teach the textconv mechanism to support the same
>> > capability.
>> >
>> > The use case comes from a complex content renderer that needs to
>> > know what the original file name is, so as to be able to find
>> > additional content, by name, that describes the file (base
>> > name+different extension).
>> >
>> > If this is considered a good idea, I would be happy to implement
>> > this but need a pointer or two of where to look in the code to make
>> > it happen.
>>
>> Both in diff, grep and cat-file, textconv eventually triggers
>> diff.c::fill_textconv() and calls run_textconv() unless there is a
>> cached copy of the resut of running textconv earlier on the same
>> contents.  This is because for each textconv driver, the output is
>> expected to be purely a function of the input bytestream, and that is
>> why it does not take any other input.
>>
>> So, if we have two identical blobs in a tree object under different
>> pathnames, making the output from textconv different for them because
>> they sit at different pathnames directly goes against the basic design
>> of the system at the philosophical level.
>>
>> Having said that, I _suspect_ (but not verified) that as long as the
>> driver is marked as non-cacheable, it may be acceptable to export a
>> new environment variable, say, GIT_TEXTCONV_PATH, and allow the
>> textconv program to produce different results for the same input.  I
>> am not offhand sure if it is OK to allow command line substitutions
>> like the filter scripts, though.  It would be nice from the point of
>> view of consistency if we could do so, but those who use an existing
>> textconv program at a pathname with per-cent in it may get negatively
>> affected.
>
>As the person who implemented both textconv and its caching, all of that sounds quite sensible to me. :)
>
>The caching mechanism has never been turned on by default, and probably will never be (because one of the original uses for which I
>wanted textconv is decrypting blobs on the fly, and storing the result would defeat the purpose). So it may be sufficient to say "if it hurts to
>turn on caching with a filter that depends on the path, then don't do it".
>
>It may be possible to make it work with a cache, too. The caching mechanism also has a validity key, which looks something like this:
>
>  [prime the cache with a silly filter]
>  $ git -c diff.perl.textconv='tr a-z A-Z <' \
>        -c diff.perl.cachetextconv=true \
>        log -p '*.perl' >/dev/null
>
>  [the result is stored via git-notes; the body is the cache-key]
>  $ git cat-file commit refs/notes/textconv/perl
>  tree a27c3c18d222e47a7943c2844f9ed6c75710d8c4
>  author Jeff King <peff@xxxxxxxx> 1623211550 -0400
>  committer Jeff King <peff@xxxxxxxx> 1623211550 -0400
>
>  tr a-z A-Z <
>
>I thought at first we could put the pathname into that key, but it's really a key for the _whole_ cache, not an individual entry. So that
>wouldn't work. You'd have to redesign the cache mechanism (which in turn might require assistance from the notes code). So I lean
>towards "if it hurts, don't do it". :)
>
>One final note. Randall said:
>
>> > The use case comes from a complex content renderer that needs to
>> > know what the original file name is, so as to be able to find
>> > additional content, by name, that describes the file (base
>> > name+different extension).
>
>That needs the filename, but there's also the implication there that ancillary data would be coming from _other_ files based on that
>name.
>That seems a lot more complicated. If I'm a filter and am asked to convert "foo.one", I might want to see "foo.two". But how do I know
>from which commit to get it? It is OK if "foo.two" is a globally unambiguous name across time, but I'd imagine in most cases you want
>"foo.two" that accompanied "foo.one" at the time. And now the cache is not even a property of a blob/filename pair, but rather of the
>whole tree.

Well, sadly, I see the points you both make. It might be better for my team to come up with a self-describing file format with embedded metadata instead. Thing is, the whole approach of idempotency in git where the commit would immutably contain both the file and its metadata in a separate blob but same commit is appealing - which is the only condition where I would see my original suggestion applying.

-Randall