> On 26 Jun 2017, at 21:02, Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Lars Schneider <larsxschneider@xxxxxxxxx> writes: > >> Some `clean` / `smudge` filters might require a significant amount of >> time to process a single blob (e.g. the Git LFS smudge filter might >> perform network requests). During this process the Git checkout >> operation is blocked and Git needs to wait until the filter is done to >> continue with the checkout. > > Good motivation. I'd say s/might/may/ to stress the fact that this > is addressing a real-world problem, i.e. some filters incur network > latencies. Agreed. >> Teach the filter process protocol (introduced in edcc858) to accept the > > When referring to an old commit, we recommend this format > > edcc8581 ("convert: add filter.<driver>.process option", 2016-10-16) > > (with or without dq around the title) which helps readers by telling > them how old the thing is and what it was about. Agreed. > ... > >> +Delay >> +^^^^^ >> + >> +If the filter supports the "delay" capability, then Git can send the >> +flag "can-delay" after the filter command and pathname. This flag >> +denotes that the filter can delay filtering the current blob (e.g. to >> +compensate network latencies) by responding with no content but with >> +the status "delayed" and a flush packet. >> +------------------------ >> +packet: git> command=smudge >> +packet: git> pathname=path/testfile.dat >> +packet: git> can-delay=1 >> +packet: git> 0000 >> +packet: git> CONTENT >> +packet: git> 0000 >> +packet: git< status=delayed >> +packet: git< 0000 >> +------------------------ >> + >> +If the filter supports the "delay" capability then it must support the >> +"list_available_blobs" command. If Git sends this command, then the >> +filter is expected to return a list of pathnames of blobs that are >> +available. The list must be terminated with a flush packet followed > > Is it correct to read the above "pathnames of blobs that are > availble" as "pathnames, among which Git earlier requested to be > filtered with "can-delay=1", for which the filtered result is > ready"? I do not mean to suggest this awkward wording to replace > yours, but I found yours a bit too fuzzy for first time readers. > > Perhaps my brain has rotten by hearing the "delayed/lazy transfer of > large blobs", but it went "Huh?" upon seeing "...are available". Maybe this? [...] If Git sends this command, then the filter is expected to return a list of pathnames representing blobs that have been delayed earlier and are now available. [...] >> +by a "success" status that is also terminated with a flush packet. If >> +no blobs for the delayed paths are available, yet, then the filter is >> +expected to block the response until at least one blob becomes >> +available. > > Ahh, this is better, at least you use "the delayed paths". > > What if the result never gets available (e.g. resulted in an error)? As soon as the filter responds with an empty list, Git stops asking. All blobs that Git has not received at this point are considered an error. Should I mention that explicitly? > Or is it considered "we _now_ know the result is an error" and such > a delayed path that failed to retrieve from LFS store "available" at > that point? No. "list_available_blobs" only returns blobs that are immediately available. Errors are not available. > It may be too late to raise to a series that went to v6, but this > smells more like "ready" not "available" to me. You mean you would call it "list_ready_blobs"? I am no native speaker but I feel "available" sounds better. I also contemplated "retrievable". I think I understand your reasoning, though. In the case of GitLFS all these blobs are "available". Only the ones that GitLFS has downloaded are ready, though. However, other filters might delay blobs for non-network related reasons and then "available" and "ready" would be the same, no? I would like to keep "available". > ... >> diff --git a/cache.h b/cache.h >> index ae4c45d379..2ec12c4477 100644 >> --- a/cache.h >> +++ b/cache.h >> @@ -1551,8 +1552,11 @@ struct checkout { >> }; >> #define CHECKOUT_INIT { NULL, "" } >> >> + >> #define TEMPORARY_FILENAME_LENGTH 25 > > You probably do not need that new blank line. Agreed! I have no idea why/how that got in. > ... >> diff --git a/convert.c b/convert.c >> index e55c034d86..6452ab546a 100644 >> --- a/convert.c >> +++ b/convert.c >> @@ -533,7 +534,8 @@ static int start_multi_file_filter_fn(struct subprocess_entry *subprocess) >> if (err) >> goto done; >> >> - err = packet_writel(process->in, "capability=clean", "capability=smudge", NULL); >> + err = packet_writel(process->in, >> + "capability=clean", "capability=smudge", "capability=delay", NULL); >> >> for (;;) { >> cap_buf = packet_read_line(process->out, NULL); >> @@ -549,6 +551,8 @@ static int start_multi_file_filter_fn(struct subprocess_entry *subprocess) >> entry->supported_capabilities |= CAP_CLEAN; >> } else if (!strcmp(cap_name, "smudge")) { >> entry->supported_capabilities |= CAP_SMUDGE; >> + } else if (!strcmp(cap_name, "delay")) { >> + entry->supported_capabilities |= CAP_DELAY; >> } else { >> warning( >> "external filter '%s' requested unsupported filter capability '%s'", > > The above makes me wonder if we want to introduce a table, e.g. > > static const struct { > const *name; > unsigned cap; > } known_caps[] = { > { "clean", CAP_CLEAN }, > { "smudge", CAP_SMUDGE }, > { "delay", CAP_DELAY }, > }; > > and drive everything (i.e. capability announcement in hunk +534,8 > and parsing of request in hunk +551,8) off of it. > > That would be overkill for two capabilities, but adding another to > make it three is when such a refactoring starts to become plausible. I agree with this way for many capabilities. I'll check if it makes sense for 3. > >> @@ -590,9 +594,11 @@ static void handle_filter_error(const struct strbuf *filter_status, >> >> static int apply_multi_file_filter(const char *path, const char *src, size_t len, >> int fd, struct strbuf *dst, const char *cmd, >> - const unsigned int wanted_capability) >> + const unsigned int wanted_capability, >> + struct delayed_checkout *dco) >> { >> int err; >> + int can_delay = 0; >> struct cmd2process *entry; >> struct child_process *process; >> struct strbuf nbuf = STRBUF_INIT; >> @@ -647,6 +653,14 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len >> if (err) >> goto done; >> >> + if ((entry->supported_capabilities & CAP_DELAY) && >> + dco && dco->state == CE_CAN_DELAY) { >> + can_delay = 1; >> + err = packet_write_fmt_gently(process->in, "can-delay=1\n"); >> + if (err) >> + goto done; >> + } >> + >> err = packet_flush_gently(process->in); >> if (err) >> goto done; >> @@ -662,14 +676,74 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len >> if (err) >> goto done; >> >> - err = strcmp(filter_status.buf, "success"); >> + if (can_delay && !strcmp(filter_status.buf, "delayed")) { >> + dco->state = CE_DELAYED; >> + string_list_insert(&dco->filters, cmd); >> + string_list_insert(&dco->paths, path); > > The dco->state CE_CAN_DELAY is global to all the paths involved in > this checkout, but filter_status.buf being "delayed" is per path > (i.e. the filter can be told "can-delay=1" but still respond with > success right away), and in such a case dco->state will not advance > from CE_CAN_DELAY to CE_DELAYED (which in turn means we will show > "can-delay=1" to the next path). On the other hand, once we are > responded by "delayed" after sending "can-delay=1", we will go into > CE_DELAYED state and will not say "can-delay=1" to any subsequent > path. I am not 100% happy with design of "dco->state". You are right, I use it as "global" flag to indicate if paths can be delayed. But I *also* use it as "flag" to indicate if a certain path has been delayed. Afterwards it is reset (see entry.c:write_entry). Peff stumbled over that as well [1]. I need to change that! Maybe with an additional flag "dco->delayed". This flag is reset before any "async_convert_to_working_tree" and evaluated after. [1] "Hmm. This "reset the state" bit at the end surprised me...." http://public-inbox.org/git/20170624141941.usy2pyhid3jrf3ku@xxxxxxxxxxxxxxxxxxxxx/ > This feels somewhat unsatisfactory. From the protocol exchange > description in the documentation part of this patch, I was sort of > expecting that Git can selectively say "You can delay response to > this path" and "I am willing to wait until you give the filtered > result" for each command/pathname pair. But this code structure > does not seem to allow that. > > I would understand it if CE_CAN_DELAY and CE_DELAYED are unified, > and the assignment to dco->state here is removed. You may be using > these two states in other parts of the code to optimize between the > case where *no* delayed path exists (even if Git is capable of > delaying) and *some* delayed path exist, but I think that can be > done by checking if dco->paths is empty, or something like that. > > Such a change will allow us later to add more logic (perhaps driven > by attributes) to the decision to send "can-delay=1" in hunk +653,14 > above, to express "this won't be delayed". > > Alternatively, the semantics of "the entire checkout exchange is > allowed to be delayed" can be kept but then the protocol looks too > confusing---it shouldn't have "can-delay=1" after a command/pathname > pair, as if it can be specified per path. Actually a delay "per path" is possible. I'll try to make this more clear with a little refactoring. > ... > >> +int async_query_available_blobs(const char *cmd, struct string_list *delayed_paths) >> +{ >> + int err; >> + char *line; >> + struct cmd2process *entry; >> + struct child_process *process; >> + struct strbuf filter_status = STRBUF_INIT; >> + >> + assert(subprocess_map_initialized); >> + entry = (struct cmd2process *)subprocess_find_entry(&subprocess_map, cmd); >> + if (!entry) { >> + error("external filter '%s' is not available anymore although " >> + "not all paths have been filtered", cmd); >> + return 0; >> + } >> + process = &entry->subprocess.process; >> + sigchain_push(SIGPIPE, SIG_IGN); >> + >> + err = packet_write_fmt_gently( >> + process->in, "command=list_available_blobs\n"); >> if (err) >> goto done; >> >> - err = read_packetized_to_strbuf(process->out, &nbuf) < 0; >> + err = packet_flush_gently(process->in); >> if (err) >> goto done; >> >> + while ((line = packet_read_line(process->out, NULL))) { >> + const char *path; >> + if (skip_prefix(line, "pathname=", &path)) >> + string_list_insert(delayed_paths, xstrdup(path)); > > I am assuming that the caller passes an empty string list to > delayed_paths variable (shouldn't it be called available_paths, or > ready_paths if we follow the earlier discussion above, by the way?), > and it will compare the resulting set with the set of paths that got > "delayed" response in the apply_multi_file_filter() function earlier. Correct. I also agree with the rename although it seems a bit odd to name the function "...available_blobs" and the variable "available_paths". The paths itself are not "available" anyways. They just represent the blobs that are available. Therefore it should be OK and it probably better than just "paths". > I am wondering whose responsibility it will be to deal with a path > "list-available" reports that are *not* asked by Git or Git got no > "delayed" response. The list subtraction done by the caller is > probably the logical place to do so. Correct. Git (the caller) will notice that not all "delayed" paths are listed by the filter and throw an error at the end. > >> diff --git a/entry.c b/entry.c >> index d6b263f78e..c6d5cc01dc 100644 >> --- a/entry.c >> +++ b/entry.c >> @@ -137,6 +137,85 @@ static int streaming_write_entry(const struct cache_entry *ce, char *path, >> ... >> +static int remove_available_paths(struct string_list_item *item, void *cb_data) >> +{ >> + struct string_list *available_paths = cb_data; >> + return !string_list_has_string(available_paths, item->string); >> +} >> + >> +int finish_delayed_checkout(struct checkout *state) >> +{ >> + int errs = 0; >> + struct string_list_item *filter, *path; >> + struct delayed_checkout *dco = state->delayed_checkout; >> + >> + if (!state->delayed_checkout) >> + return errs; >> + >> + dco->state = CE_RETRY; >> + while (dco->filters.nr > 0) { >> + for_each_string_list_item(filter, &dco->filters) { >> + struct string_list available_paths; >> + string_list_init(&available_paths, 0); > > STRING_LIST_INIT_NODUP? Of course! > >> + >> + if (!async_query_available_blobs(filter->string, &available_paths)) { >> + /* Filter reported an error */ >> + errs = 1; >> + filter->string = ""; >> + continue; >> + } >> + if (available_paths.nr <= 0) { >> + /* Filter responded with no entries. That means >> + the filter is done and we can remove the >> + filter from the list (see >> + "string_list_remove_empty_items" call below). >> + */ > > /* > * Our multi-line comments > * look like this. > */ Oops. Will fix. BTW: In [1] you mentioned that you run checkpatch. Would checkpatch catch this kind of error? If yes, is your version publicly available? [1] http://public-inbox.org/git/xmqq1ste2zwu.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx/ >> + filter->string = ""; >> + continue; >> + } >> + >> + /* In dco->paths we store a list of all delayed paths. >> + The filter just send us a list of available paths. >> + Remove them from the list. >> + */ >> + filter_string_list(&dco->paths, 0, >> + &remove_available_paths, &available_paths); > > We first remove from the outstanding request list (dco->paths) what > are now ready... > >> + for_each_string_list_item(path, &available_paths) { > > ...and go over those paths that are now ready. > >> + struct cache_entry* ce = index_file_exists( >> + state->istate, path->string, >> + strlen(path->string), 0); >> + assert(dco->state == CE_RETRY); >> + errs |= (ce ? checkout_entry(ce, state, NULL) : 1); >> + } > > But we never checked if the contents of this available_paths list is > a subset of the set of paths we originally asked the external > process to filter. Correct. > This will allow the process to overwrite any > random path that is not even involved in the checkout. No, not "any random path". Only paths that are part of the checkout. There are three cases: (1) available_path is a path that was delayed before (= happy case!) (2) available_path is a path that was not delayed before, but filtered (= no problem, as filtering is a idempotent operation) (3) available_path is a path that was neither delayed nor filtered before (= if the filter returns the blob as-is then this would be no problem. otherwise we would indeed have a screwed checkout) Case 3 might introduce a problem if the filter is buggy. Would you be OK with this check to catch case 3? dco_path_count = dco->paths.nr; filter_string_list(&dco->paths, 0, &remove_available_paths, &available_paths); if (dco_path_count - dco->paths.nr != available_paths.nr) { /* The filter responded with entries that have not * been delay earlier. Do not ask the filter * for available blobs, again, as the filter is * likely buggy. This will generate an error at * the end as some files are not filtered properly. */ filter->string = ""; error(_("The external filter '%s' responded with " "available blobs which have not been delayed " "earlier."), filter->string); continue; } >> + } >> + string_list_remove_empty_items(&dco->filters, 0); >> + } >> + string_list_clear(&dco->filters, 0); >> + >> + /* At this point we should not have any delayed paths anymore. */ >> + errs |= dco->paths.nr; >> + for_each_string_list_item(path, &dco->paths) { >> + warning(_("%s was not filtered properly."), path->string); >> + } >> + string_list_clear(&dco->paths, 0); > > And "list-available" that says "path X is ready" when we never asked > for X gets away free without detected as a bug, either. With the addition above it would! Thanks for the review :-) - Lars