proposal for extending smudge/clean filters with raw file access

Joey Hess <id@xxxxxxxxxx> · Thu, 12 May 2016 14:24:32 -0400

I'm using smudge/clean filters in git-annex now, and it's not been an
entirely smooth fit between the interface and what git-annex wants
to do.

The clean filter has to consume the whole file content on stdin;
not reading it all will make git think the clean filter failed.
But, git-annex often doesn't need to read the whole content of a
work-tree file in order to clean it.

The smudge filter has to output the whole file content to stdout. But
git-annex often has the file's content on disk already, and could just
move it into place in the working tree. This would save CPU and IO and
often disk space too. But the smudge interface doesn't let git-annex use
the efficient approach.

So I propose extending the filter driver with two more optional
commands. Call them raw-clean and raw-smudge for now.

raw-clean would be like clean, but rather than being fed the whole
content of a large file on stdin, it would be passed the filename, and
can access the file itself. Like the clean filter, it outputs the
cleaned version on stdout.

raw-smudge would be like smudge, but rather than needing to output the
whole content of a large file on stdout, it would be passed a filename,
and can create that file itself.

To keep this backwards compatible, and to handle the cases where the
object being filtered is not a file on disk, the smudge and clean
filters would be required to be configured too, in order for raw-clean
and raw-smudge to be used.

It seems fairly easy to implement raw-clean. In sha1_file.c, index_path
would use raw-clean when available, while index_fd etc keep on using
the clean filter. I have not investigated what would be needed to implement
raw-smudge yet.

-- 
see shy jo
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html