> On 24 Jun 2017, at 16:19, Jeff King <peff@xxxxxxxx> wrote: > > On Thu, Jun 01, 2017 at 10:22:03AM +0200, Lars Schneider wrote: > >> Some `clean` / `smudge` filters might require a significant amount of >> time to process a single blob (e.g. the Git LFS smudge filter might >> perform network requests). During this process the Git checkout >> operation is blocked and Git needs to wait until the filter is done to >> continue with the checkout. >> >> Teach the filter process protocol (introduced in edcc858) to accept the >> status "delayed" as response to a filter request. Upon this response Git >> continues with the checkout operation. After the checkout operation Git >> calls "finish_delayed_checkout" which queries the filter for remaining >> blobs. If the filter is still working on the completion, then the filter >> is expected to block. If the filter has completed all remaining blobs >> then an empty response is expected. >> >> Git has a multiple code paths that checkout a blob. Support delayed >> checkouts only in `clone` (in unpack-trees.c) and `checkout` operations. > > It might be worth giving a reason in this last paragraph. I think the > reason is "because it's more complicated for the caller, as they have to > be OK with out-of-order processing and remembering to go back and handle > the delayed cases". Correct! However, my real reason was that these code paths process all files of the tree. Therefore the "out-of-order" processing can be effective. How about this: Git has a multiple code paths that checkout a blob. Support delayed checkouts only in `clone` (in unpack-trees.c) and `checkout` operations for now. The optimization is most effective in these code paths as all files of the tree are processed. > Speaking of which, _are_ we OK with out-of-order processing in things > like checkout? Certainly we care about deleting items before checking > out new ones (so getting rid of "foo/bar" to make room for "foo" and > vice versa). I guess it's OK as long as the delayed items are always > filters that check out new items. Junio noticed that too but thinks we should be OK: "[...] We checkout removals first to make room so that creation of a path X can succeed if an existing path X/Y that used to want to see X as a directory can succeed [...]" http://public-inbox.org/git/xmqqvavotych.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx/ >> ... > >> +After Git received the pathnames, it will request the corresponding >> +blobs again. These requests contain a pathname and an empty content >> +section. The filter is expected to respond with the smudged content >> +in the usual way as explained above. >> +------------------------ >> +packet: git> command=smudge >> +packet: git> pathname=path/testfile.dat >> +packet: git> 0000 >> +packet: git> 0000 # empty content! >> +packet: git< status=success >> +packet: git< 0000 >> +packet: git< SMUDGED_CONTENT >> +packet: git< 0000 >> +packet: git< 0000 # empty list, keep "status=success" unchanged! >> +------------------------ > > Out of curiosity, what should happen if we re-ask for a pathname that > wasn't mentioned in list_available_blobs? I guess it would depend on the > filter implementation, but probably most would just block (since we > wouldn't have asked for can-delay). The other option is returning an > error, I suppose, but blocking and filling the request sounds > friendlier. I agree that is up to the filter. I would expect the blocking strategy. The filter cannot delay it anymore as we have not send the "can-delay" flag. > ... >> diff --git a/cache.h b/cache.h >> index ae4c45d379..523ead1d95 100644 >> --- a/cache.h >> +++ b/cache.h >> @@ -1543,16 +1543,20 @@ extern int ident_cmp(const struct ident_split *, const struct ident_split *); >> struct checkout { >> struct index_state *istate; >> const char *base_dir; >> + struct delayed_checkout *delayed_checkout; >> int base_dir_len; > > Should base_dir_len and base_dir be kept together? > > I suspect you ordered it this way because... > >> unsigned force:1, >> quiet:1, >> not_new:1, >> refresh_cache:1; >> }; >> -#define CHECKOUT_INIT { NULL, "" } >> +#define CHECKOUT_INIT { NULL, "", NULL } >> + > > ...you wanted to add the NULL to the initializer here. But it's not > necessary. Once one element of a struct is initialized, all the > remaining members are initialized according to the usual static rules > (so NULL for pointers). We're already using that to zero out > base_dir_len and the flags (the "" is necessary because we want a real > empty string, not NULL). > > Since you want NULL for your new member, you can just leave the > initializer as-is and it will do the right thing. That was indeed my reasoning. I am no C expert and I wasn't sure about the initialization. The "Once one element of a struct is initialized" thing is news to me :-) Thanks! I'll keep the base_dir* together then! >> @@ -533,7 +534,8 @@ static int start_multi_file_filter_fn(struct subprocess_entry *subprocess) >> if (err) >> goto done; >> >> - err = packet_writel(process->in, "capability=clean", "capability=smudge", NULL); >> + err = packet_writel(process->in, >> + "capability=clean", "capability=smudge", "capability=delay", NULL); > > Why do we need to tell the filter we know about delay? Shouldn't it just > need to tell us that it knows about delay, and then we choose whether to > ask for can-delay for particular entries? Because in the protocol I defined that the filter needs to answer with a strict subset of this list [1]. I thought that this would make the protocol more future proof/backward compatible. Because the filter is not allowed to answer with something that Git does not understand. [1] https://github.com/git/git/blob/5402b1352f5181247405fbff1887008a0cb3b04a/Documentation/gitattributes.txt#L408-L411 > >> @@ -647,6 +653,14 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len >> if (err) >> goto done; >> >> + if (CAP_DELAY & entry->supported_capabilities && >> + dco && dco->state == CE_CAN_DELAY) { > > In a complicated conditional with bit operations, we usually put the bit > operation in its own parentheses so it's more obvious that it wasn't > supposed to be "&&". Like: > > if ((CAP_DELAY & entry->supported_capabilities) && > dco && dco->state == CE_CAN_DELAY)) Agreed! > The operator precedence is such that it works without them, so this is > just a style question (I'd also usually put the flags field before the > flag itself, but that's really getting into aesthetics). You mean (entry & CAP_DELAY) instead of (CAP_DELAY & entry)? > The dco check here is basically asking if can-delay is enabled for the > whole checkout. Would we ever need to mark a specific entry as > not-delayable? Probably not from the start, but I wondered if something > might get marked as a failure and need to be retried without delay or > similar. It's probably not worth going down that road until some caller > needs it. Agreed. > >> @@ -662,14 +676,78 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len >> if (err) >> goto done; >> >> - err = strcmp(filter_status.buf, "success"); >> + if (can_delay && !strcmp(filter_status.buf, "delayed")) { >> + dco->state = CE_DELAYED; >> + string_list_insert(&dco->filters, cmd); >> + string_list_insert(&dco->paths, path); > > These string_list_insert()s have bad worst-case performance, because > they keep the list sorted. So the worst case is reverse-sorted input, > where each entry goes at the front of the existing array and we have to > memmove() the whole array on each insert, and we do O(n^2) byte copies. > > The best case is when the input is already sorted. And I think in > general that's how we'll iterate over paths to checkout. The cmds are > more likely to be random, but we're also not likely to have more than a > handful of them. So this is probably an OK data structure to use, at > least for now. Agreed! Up until now I tested the delayed technique only with my small test case. It will become really interesting when GitLFS implements this feature. Then I can test it with our repos that have +20k filtered files. At this point I will be able to identify the spots that require more optimization. >> + for (;;) { >> + const char* pre = "pathname="; >> + const int pre_len = strlen(pre); >> + line = packet_read_line(process->out, NULL); >> + if (!line) >> + break; >> + err = strlen(line) <= pre_len || strncmp(line, pre, pre_len); >> + if (err) >> + goto done; >> + string_list_insert(delayed_paths, xstrdup(line+pre_len)); >> + } > > I think the prefix matching can be done more simply (and slightly more > efficiently) with: > > const char *path; > ... > err = !skip_prefix(line, "pathname=", &path); > if (err) > goto done; > string_list_insert(delayed_paths, xstrdup(path)); > > I notice we loop here, though. Would we want to be resilient to seeing > other keys besides pathname? Something like: > > while ((line = packet_read_line(process->out, NULL)) { > const char *path; > if (skip_prefix(line, "pathname=", &path)) > string_list_insert(delayed_paths, xstrdup(path); > else > ; /* ignore unknown keys */ > } > > I can imagine some future where the filter passes back additional > information about the delayed item. It would be nice if older versions > of Git would just quietly ignore that if they don't know how to handle > it. I agree with all the above. skip_prefix is neat! Although it's so dominant in the source code I never noticed it. >> +int finish_delayed_checkout(struct checkout *state) >> +{ >> + int errs = 0; >> + struct string_list_item *filter, *path; >> + struct delayed_checkout *dco = state->delayed_checkout; >> + >> + if (!state->delayed_checkout) { >> + return errs; >> + } > > Style: we'd usually omit the braces here. Sure! I'll remove it. (This is the only rule of the Git style guide that don't agree with :-) > >> + if (available_paths.nr <= 0) { >> + /* Filter responded with no entries. That means >> + the filter is done and we can remove the >> + filter from the list (see >> + "string_list_remove_empty_items" call below). >> + */ > > This is true because we know the filter would block until it had at > least one item to return. > > It does seem to put a lot of faith in the filter. What happens if we hit > this conditional and we know that we still have entries in dco->paths to > be handled? We check this later as you noticed below. >> + /* In dco->paths we store a list of all delayed paths. >> + The filter just send us a list of available paths. >> + Remove them from the list. >> + */ >> + filter_string_list(&dco->paths, 0, >> + &remove_available_paths, &available_paths); > > This is O(m log n), which is fine. Since both lists are sorted, we could > technically do it in an O(m+n) merge, but I don't think we have a > function to do that easily. > > Sorry to harp on run-times, but it's one place I think it would be easy > to subtly screw this up, so I'm trying to pay close attention. No worries at all! Thanks for thinking it through on that level. As I wrote above - I plan to really profile it under "real life conditions" as soon as we have full GitLFS support. >> ... > >> + /* At this point we should not have any delayed paths anymore. */ >> + errs |= dco->paths.nr; > > And I guess this answers my question above about a filter with no > entries. When we bail from the loop, we'll hit this point and complain. > > We probably should call string_list_clear(&dco->paths) here to avoid > leaking the paths on error (possibly we should also print them? Or maybe > they'd be warned about elsewhere). Agreed, I will clear the list. I'll also print the items because as far as I can see they are not printed elsewhere. Git just says: fatal: unable to checkout working tree warning: Clone succeeded, but checkout failed. You can inspect what was checked out with 'git status' and retry the checkout with 'git checkout -f HEAD' How about this? errs |= dco->paths.nr; for_each_string_list_item(path, &dco->paths) { warning("%s was not processed properly.", path->string); } string_list_clear(&dco->paths, 0); The output would be: warning: test-delay10.a was not processed properly. warning: test-delay10.b was not processed properly. warning: test-delay11.a was not processed properly. warning: test-delay20.a was not processed properly. fatal: unable to checkout working tree warning: Clone succeeded, but checkout failed. You can inspect what was checked out with 'git status' and retry the checkout with 'git checkout -f HEAD' I contemplated about the warning text. "$FILE was not filtered properly." is technical more correct but maybe it would confuse the user? Plus, at this point I don't really know what filter should have given me the files. That's why I can't print it in the warning message. > >> @@ -179,11 +254,36 @@ static int write_entry(struct cache_entry *ce, >> /* >> * Convert from git internal format to working tree format >> */ >> - if (ce_mode_s_ifmt == S_IFREG && >> - convert_to_working_tree(ce->name, new, size, &buf)) { >> - free(new); >> - new = strbuf_detach(&buf, &newsize); >> - size = newsize; >> + if (ce_mode_s_ifmt == S_IFREG) { >> + struct delayed_checkout *dco = state->delayed_checkout; >> + if (dco && dco->state != CE_NO_DELAY) { >> + /* Do not send the blob in case of a retry. */ >> + if (dco->state == CE_RETRY) { >> + new = NULL; >> + size = 0; >> + } >> + ret = async_convert_to_working_tree( >> + ce->name, new, size, &buf, dco); >> + if (ret && dco->state == CE_DELAYED) { >> + free(new); >> + /* Reset the state of the next blob */ >> + dco->state = CE_CAN_DELAY; >> + goto finish; >> + } > > Hmm. This "reset the state" bit at the end surprised me. I guess it's > not wrong, but it goes against the mental model I had formed above. ;) > > We really are using dco->state as a per-entry state flag. It just > happens to be in a persistent shared struct. I don't think it's wrong, > it was mostly just surprising. I don't know if it's worth trying to > simplify, but I think you could do it by: > > 1. Passing back the "was delayed" state from async_convert... in the > return value or via a separate out-parameter. In the beginning I had it implemented that way. But that meant that I had to pass two variables through the entire convert stack: async_convert_to_working_tree -> convert_to_working_tree_internal --> apply_filter ---> apply_multi_file_filter > > 2. Setting dco->state to CE_RETRY at the top of finish_delayed... so > that it's clear that it's about what phase of the conversation > we're in. I could do that. However, I thought it is safer to set the state *before* every checkout operation in case convert.c messes with this field (it should not in this phase). > But I'm OK with it as-is, too. I'll try 2. >> diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl >> index 5e43faeec1..f0dc0aff4a 100644 >> --- a/t/t0021/rot13-filter.pl >> +++ b/t/t0021/rot13-filter.pl > > I'll admit to skimming the perl changes here, but I didn't see anything > obviously wrong. Thanks a lot for the review, Lars