On June 9, 2021 12:12 AM, Peff wrote: >On Wed, Jun 09, 2021 at 09:42:39AM +0900, Junio C Hamano wrote: > >> "Randall S. Becker" <rsbecker@xxxxxxxxxxxxx> writes: >> >> > The filter structure provides a mechanism for providing the working >> > directory's file name path to a filter using a %f argument. This >> > request is to teach the textconv mechanism to support the same >> > capability. >> > >> > The use case comes from a complex content renderer that needs to >> > know what the original file name is, so as to be able to find >> > additional content, by name, that describes the file (base >> > name+different extension). >> > >> > If this is considered a good idea, I would be happy to implement >> > this but need a pointer or two of where to look in the code to make >> > it happen. >> >> Both in diff, grep and cat-file, textconv eventually triggers >> diff.c::fill_textconv() and calls run_textconv() unless there is a >> cached copy of the resut of running textconv earlier on the same >> contents. This is because for each textconv driver, the output is >> expected to be purely a function of the input bytestream, and that is >> why it does not take any other input. >> >> So, if we have two identical blobs in a tree object under different >> pathnames, making the output from textconv different for them because >> they sit at different pathnames directly goes against the basic design >> of the system at the philosophical level. >> >> Having said that, I _suspect_ (but not verified) that as long as the >> driver is marked as non-cacheable, it may be acceptable to export a >> new environment variable, say, GIT_TEXTCONV_PATH, and allow the >> textconv program to produce different results for the same input. I >> am not offhand sure if it is OK to allow command line substitutions >> like the filter scripts, though. It would be nice from the point of >> view of consistency if we could do so, but those who use an existing >> textconv program at a pathname with per-cent in it may get negatively >> affected. > >As the person who implemented both textconv and its caching, all of that sounds quite sensible to me. :) > >The caching mechanism has never been turned on by default, and probably will never be (because one of the original uses for which I >wanted textconv is decrypting blobs on the fly, and storing the result would defeat the purpose). So it may be sufficient to say "if it hurts to >turn on caching with a filter that depends on the path, then don't do it". > >It may be possible to make it work with a cache, too. The caching mechanism also has a validity key, which looks something like this: > > [prime the cache with a silly filter] > $ git -c diff.perl.textconv='tr a-z A-Z <' \ > -c diff.perl.cachetextconv=true \ > log -p '*.perl' >/dev/null > > [the result is stored via git-notes; the body is the cache-key] > $ git cat-file commit refs/notes/textconv/perl > tree a27c3c18d222e47a7943c2844f9ed6c75710d8c4 > author Jeff King <peff@xxxxxxxx> 1623211550 -0400 > committer Jeff King <peff@xxxxxxxx> 1623211550 -0400 > > tr a-z A-Z < > >I thought at first we could put the pathname into that key, but it's really a key for the _whole_ cache, not an individual entry. So that >wouldn't work. You'd have to redesign the cache mechanism (which in turn might require assistance from the notes code). So I lean >towards "if it hurts, don't do it". :) > >One final note. Randall said: > >> > The use case comes from a complex content renderer that needs to >> > know what the original file name is, so as to be able to find >> > additional content, by name, that describes the file (base >> > name+different extension). > >That needs the filename, but there's also the implication there that ancillary data would be coming from _other_ files based on that >name. >That seems a lot more complicated. If I'm a filter and am asked to convert "foo.one", I might want to see "foo.two". But how do I know >from which commit to get it? It is OK if "foo.two" is a globally unambiguous name across time, but I'd imagine in most cases you want >"foo.two" that accompanied "foo.one" at the time. And now the cache is not even a property of a blob/filename pair, but rather of the >whole tree. Well, sadly, I see the points you both make. It might be better for my team to come up with a self-describing file format with embedded metadata instead. Thing is, the whole approach of idempotency in git where the commit would immutably contain both the file and its metadata in a separate blob but same commit is appealing - which is the only condition where I would see my original suggestion applying. -Randall