From: Lars Schneider <larsxschneider@xxxxxxxxx> Git's clean/smudge mechanism invokes an external filter process for every single file that is affected by the filter. If you have *a lot* of files to filter then the startup time of the external filter process becomes a problem as discussed here: http://thread.gmane.org/gmane.comp.version-control.git/298293 This patch might be a solution to this problem and I am looking for feedback. Please note that this is a RFC/proof-of-concept patch. If the list agrees with this approach then I will post a proper patch series with more tests, error handling, and documentation. ## How does it work? Git's "convert.c" got a hashmap that maps a filter command to a process. If we need to filter a file, then we look into the hashmap. If the filter command is not there then we create and start a new filter process and setup in/out pipes. If the filter command is there then we just take the existing and running process. In case of a clean filter we send the filename of the file to clean via pipe to the filter process. The filter process reads the file, generates the cleaned version and sends it back to Git via pipe. In case of a smudge filter we send the destination filename of the file to smudge via pipe as well as its content size and content. The filter process reads all this info, generates the smudged version and stores the smudged version at the location of the destination filename. Finally, the filter process sends an "OK" to tell Git that everything worked as expected. This patch is based on Joey Hess' great "clean-smudge-annex" topic, idea, and implementation: https://github.com/gitster/git/tree/jh/clean-smudge-annex v1: http://thread.gmane.org/gmane.comp.version-control.git/297475 v4 (latest): http://thread.gmane.org/gmane.comp.version-control.git/297994 You can find a branch with with my patch on top of Joey's work here: https://github.com/larsxschneider/git/commits/clean-smudge-file-long-running/v1 ## Questions (1) Do you see a general problem with this kind of "pipe" protocol and long running Git filter processes? Other than the test cases I haven't implemented a real world filter, yet (on my todo list for tomorrow). (2) Joey's topic, which is the base for my patch, looks stalled for more than 2 weeks: http://thread.gmane.org/gmane.comp.version-control.git/297994/focus=298006 I would be happy to address Junio's comments and post a reroll. However, I don't want to interfere with Joey's plans. How does the list usually address this kind of situation? @Joey (in case you are reading this): My patch changes your initial idea quite a bit. However, I believe it is an improvement that could be beneficial for git-annex, too. Would you prefer to work with me on the combination of our ideas (file clean/smudge + long running clean/smudge processes) or would you prefer to keep your interface? (3) Clean/smudge filter use Git's async command API to create a separate thread and start the filter process. In my implementation I start the filter process directly as this eases the pipe communication. I assume I am missing something obvious here, but what is the advantage of the async command API? Git needs to wait for the result of the filter anyways and therefore async is not really necessary, or? (4) I am not really experienced with process pipes. Is it OK to use the "strbuf_read_once" function to read from a pipe? If I use "strbuf_read" then the process waits indefinitely. Is there a way to tell Git to read X bytes from a pipe (I just found the size "hint")? (5) As mentioned above "convert.c" got a hashmap that maps a filter command to a process. At the end of the Git process I would like to close the pipes and stop the filter processes. What would be the best place in the code to do that? Thanks, Lars --- convert.c | 143 +++++++++++++++++++++++++++++++++++++++++++------- t/t0021-conversion.sh | 80 +++++++++++++++++++++------- 2 files changed, 185 insertions(+), 38 deletions(-) diff --git a/convert.c b/convert.c index e4db007..590b85a 100644 --- a/convert.c +++ b/convert.c @@ -373,9 +373,120 @@ struct filter_params { int fd; const char *cmd; const char *path; /* Path within the git repository */ - const char *fspath; /* Path to file on disk */ }; +static int cmd_process_map_init = 0; +static struct hashmap cmd_process_map; + +struct cmd2process { + struct hashmap_entry ent; /* must be the first member! */ + const char *cmd; + struct child_process process; +}; + +static int cmd2process_cmp(const struct cmd2process *e1, + const struct cmd2process *e2, + const void *unused) +{ + return strcmp(e1->cmd, e2->cmd); +} + +static struct cmd2process *find_entry(const char *cmd) +{ + struct cmd2process k; + hashmap_entry_init(&k, memhash(&cmd, sizeof(char *))); + k.cmd = cmd; + return hashmap_get(&cmd_process_map, &k, NULL); +} + +static int apply_file_filter(const char *path, const char *src, size_t len, + struct strbuf *dst, const char *cmd) +{ + int ret = 1; + struct strbuf nbuf = STRBUF_INIT; + struct cmd2process *entry = NULL; + struct child_process *process = NULL; + const char *argv[] = { NULL, NULL }; + + if (!cmd || !*cmd) + return 0; + + if (!dst) + return 1; + + if (!cmd_process_map_init) { + cmd_process_map_init = 1; + hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0); + } else { + entry = find_entry(cmd); + } + + if (entry) { + process = &entry->process; + } else { + entry = malloc(sizeof(struct cmd2process)); + hashmap_entry_init(entry, memhash(&cmd, sizeof(char *))); + entry->cmd = cmd; + hashmap_add(&cmd_process_map, entry); + process = &entry->process; + + child_process_init(process); + argv[0] = cmd; + process->argv = argv; + process->use_shell = 1; + process->in = -1; + process->out = -1; + + if (start_command(process)) { + error("cannot fork to run external filter %s", cmd); + return 0; + } + } + + fflush(NULL); + + // TODO: Is the signchain_push OK here? + sigchain_push(SIGPIPE, SIG_IGN); + write_str_in_full(process->in, path); + write_str_in_full(process->in, "\n"); + if (src && len > 0) { + // smudge filter processing + struct strbuf lenstr = STRBUF_INIT; + strbuf_reset(&lenstr); + strbuf_addf(&lenstr, "%zu", len); + write_str_in_full(process->in, lenstr.buf); + write_str_in_full(process->in, "\n"); + write_in_full(process->in, src, len); + strbuf_reset(&nbuf); + if (strbuf_read_once(&nbuf, process->out, 2) < 0 || + strcmp(nbuf.buf, "OK") != 0) { + error("read from external file filter %s failed", cmd); + ret = 0; + } + } else { + // clean filter processing + strbuf_reset(&nbuf); + // TODO: Should we read the expected size here first? + if (strbuf_read_once(&nbuf, process->out, 0) < 0) { + error("read from external file filter %s failed", cmd); + ret = 0; + } else { + strbuf_swap(dst, &nbuf); + } + // TODO: Should we ask for an OK from the filter here, too? + } + sigchain_pop(SIGPIPE); + + // TODO: We probably should close the pipes and finish all processes + // in the hashmap. What would be a good place to do this? + // close(process->in) + // close(process->out) + // finish_command(process); + + strbuf_release(&nbuf); + return ret; +} + static int filter_buffer_or_fd(int in, int out, void *data) { /* @@ -402,15 +513,6 @@ static int filter_buffer_or_fd(int in, int out, void *data) strbuf_expand(&cmd, params->cmd, strbuf_expand_dict_cb, &dict); strbuf_release(&path); - /* append fspath to the command if it's set, separated with a space */ - if (params->fspath) { - struct strbuf fspath = STRBUF_INIT; - sq_quote_buf(&fspath, params->fspath); - strbuf_addstr(&cmd, " "); - strbuf_addbuf(&cmd, &fspath); - strbuf_release(&fspath); - } - argv[0] = cmd.buf; child_process.argv = argv; @@ -449,9 +551,8 @@ static int filter_buffer_or_fd(int in, int out, void *data) return (write_err || status); } -static int apply_filter(const char *path, const char *fspath, - const char *src, size_t len, int fd, - struct strbuf *dst, const char *cmd) +static int apply_filter(const char *path, const char *src, size_t len, + int fd, struct strbuf *dst, const char *cmd) { /* * Create a pipeline to have the command filter the buffer's @@ -479,7 +580,6 @@ static int apply_filter(const char *path, const char *fspath, params.fd = fd; params.cmd = cmd; params.path = path; - params.fspath = fspath; fflush(NULL); if (start_async(&async)) @@ -610,7 +710,7 @@ static int count_ident(const char *cp, unsigned long size) } static int ident_to_git(const char *path, const char *src, size_t len, - struct strbuf *buf, int ident) + struct strbuf *buf, int ident) { char *dst, *dollar; @@ -860,7 +960,7 @@ int would_convert_to_git_filter_fd(const char *path) if (!ca.drv->required) return 0; - return apply_filter(path, NULL, NULL, 0, -1, NULL, ca.drv->clean); + return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean); } static int can_filter_file(const char *filefilter, const char *filefiltername, @@ -950,7 +1050,7 @@ int convert_to_git(const char *path, const char *src, size_t len, required = ca.drv->required; } - ret |= apply_filter(path, NULL, src, len, -1, dst, filter); + ret |= apply_filter(path, src, len, -1, dst, filter); if (!ret && required) die("%s: clean filter '%s' failed", path, ca.drv->name); @@ -976,7 +1076,7 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst, assert(ca.drv); assert(ca.drv->clean); - if (!apply_filter(path, NULL, NULL, 0, fd, dst, ca.drv->clean)) + if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean)) die("%s: clean filter '%s' failed", path, ca.drv->name); crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, @@ -995,7 +1095,7 @@ void convert_to_git_filter_from_file(const char *path, struct strbuf *dst, assert(ca.drv->clean); assert(ca.drv->clean_from_file); - if (!apply_filter(path, path, "", 0, -1, dst, ca.drv->clean_from_file)) + if (!apply_file_filter(path, "", 0, dst, ca.drv->clean_from_file)) die("%s: cleanFromFile filter '%s' failed", path, ca.drv->name); crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, @@ -1040,7 +1140,10 @@ static int convert_to_working_tree_internal(const char *path, } } - ret_filter = apply_filter(path, destpath, src, len, -1, dst, filter); + if (destpath) + ret_filter = apply_file_filter(destpath, src, len, dst, filter); + else + ret_filter = apply_filter(path, src, len, -1, dst, filter); if (!ret_filter && required) die("%s: %s filter %s failed", path, destpath ? "smudgeToFile" : "smudge", ca.drv->name); diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh index 2722013..1eafe67 100755 --- a/t/t0021-conversion.sh +++ b/t/t0021-conversion.sh @@ -14,19 +14,39 @@ chmod +x rot13.sh cat <<EOF >rot13-from-file.sh #!$SHELL_PATH -srcfile="\$1" -touch rot13-from-file.ran -cat "\$srcfile" | ./rot13.sh +while read LINE; do + srcfile="\$LINE" + echo "CLEAN \$srcfile" >> rot13-from-file.ran + cat "\$srcfile" | ./rot13.sh +done + EOF chmod +x rot13-from-file.sh -cat <<EOF >rot13-to-file.sh -#!$SHELL_PATH -destfile="\$1" -touch rot13-to-file.ran -./rot13.sh > "\$destfile" +# TODO: Is there a way to read X bytes from a pipe via shell? +# I implemented the filter in Python as workaround. If Python +# is an undesired dependency then I could reimplement it in Perl, too. +cat <<EOF >rot13-to-file.py +#!/usr/bin/env python +import sys +from subprocess import Popen, PIPE, STDOUT + +for data in iter(sys.stdin.readline, ''): + filename = data[:-1] + content_size = sys.stdin.readline()[:-1] + content = sys.stdin.read(int(content_size)) + + p = Popen(['./rot13.sh'], stdout=PIPE, stdin=PIPE, stderr=PIPE) + with open(filename, 'w') as f: + f.write(p.communicate(input=content)[0]) + + with open('rot13-to-file.ran', 'a') as f: + f.write('SMUDGE {} {}\n'.format(filename, content_size)) + + sys.stdout.write('OK') + sys.stdout.flush() EOF -chmod +x rot13-to-file.sh +chmod +x rot13-to-file.py cat <<EOF >delete-file-and-fail.sh #!$SHELL_PATH @@ -293,28 +313,52 @@ test_expect_success 'disable filter with empty override' ' ' test_expect_success 'cleanFromFile filter is used when adding a file' ' + { + echo a + echo bb + echo ccc + } >test2 && + + test_config filter.rot13.cleanFromFile ./rot13-from-file.sh && + test_config filter.rot13.required true && echo "*.t filter=rot13" >.gitattributes && cat test >fstest.t && - git add fstest.t && - test -e rot13-from-file.ran && + cat test2 >fstest2.t && + git add fstest.t fstest2.t && + + test_line_count = 2 rot13-from-file.ran && + grep "CLEAN fstest.t" rot13-from-file.ran && + grep "CLEAN fstest2.t" rot13-from-file.ran && rm -f rot13-from-file.ran && rm -f fstest.t && - git checkout -- fstest.t && - cmp test fstest.t + rm -f fstest2.t && + + git checkout . && + cmp test fstest.t && + cmp test2 fstest2.t ' test_expect_success 'smudgeToFile filter is used when checking out a file' ' - test_config filter.rot13.smudgeToFile ./rot13-to-file.sh && + test_config filter.rot13.smudgeToFile ./rot13-to-file.py && + test_config filter.rot13.required true && rm -f fstest.t && - git checkout -- fstest.t && + rm -f fstest2.t && + git checkout . && + + # TODO: Why is this necessary? + sleep 0.1 && + cmp test fstest.t && + cmp test2 fstest2.t && - test -e rot13-to-file.ran && + test_line_count = 2 rot13-to-file.ran && + grep "SMUDGE fstest.t 57" rot13-to-file.ran && + grep "SMUDGE fstest2.t 9" rot13-to-file.ran && rm -f rot13-to-file.ran ' @@ -335,7 +379,7 @@ test_expect_success 'recovery from failure of smudgeToFile filter that deletes t ' test_expect_success 'smudgeToFile filter is used in merge' ' - test_config filter.rot13.smudgeToFile ./rot13-to-file.sh && + test_config filter.rot13.smudgeToFile ./rot13-to-file.py && git commit -m "added fstest.t" fstest.t && git checkout -b old && @@ -350,7 +394,7 @@ test_expect_success 'smudgeToFile filter is used in merge' ' ' test_expect_success 'smudgeToFile filter is used by git am' ' - test_config filter.rot13.smudgeToFile ./rot13-to-file.sh && + test_config filter.rot13.smudgeToFile ./rot13-to-file.py && git format-patch HEAD^ --stdout > fstest.patch && git reset --hard HEAD^ && -- 2.5.1 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html