Re: [PATCH v2] convert: add "status=delayed" to filter process protocol

Jakub Narębski <jnareb@xxxxxxxxx> · Mon, 27 Feb 2017 23:11:13 +0100

W dniu 27.02.2017 o 11:32, Lars Schneider pisze:
> 
>> On 27 Feb 2017, at 10:58, Jeff King <peff@xxxxxxxx> wrote:
>>
>> On Sun, Feb 26, 2017 at 07:48:16PM +0100, Lars Schneider wrote:
>>
>>> +If the request cannot be fulfilled within a reasonable amount of time
>>> +then the filter can respond with a "delayed" status and a flush packet.
>>> +Git will perform the same request at a later point in time, again. The
>>> +filter can delay a response multiple times for a single request.
>>> +------------------------
>>> +packet:          git< status=delayed
>>> +packet:          git< 0000
>>> +------------------------

Is it something that happens instead of filter process sending the contents
of file, or is it something that happens after sending some part of the
contents (maybe empty) instead of empty list to keep "status=success"
unchanged or instead of "status=error" if there was a problem processing
file?

>>> +
>>
>> So Git just asks for the same content again? I see two issues with that:
>>
>>  1. Does git have to feed the blob content again? That can be expensive
>>     to access or to keep around in memory.
>>
>>  2. What happens when the item isn't ready on the second request? I can
>>     think of a few options:
>>
>>       a. The filter immediately says "nope, still delayed". But then
>>          Git ends up busy-looping with "is this one ready yet?"
>>
>>       b. The filter blocks until the item is ready. But then if other
>> 	  items _are_ ready, Git cannot work on processing them. We lose
>> 	  parallelism.
>>
>>       c. You could do a hybrid: block until _some_ item is ready, and
>>          then issue "delayed" responses for everything that isn't
>> 	  ready. Then if you assume that Git is looping over and over
>> 	  through the set of objects, it will either block or pick up
>> 	  _something_ on each loop.
>>
>> 	  But it makes a quadratic number of requests in the worst case.
>> 	  E.g., imagine you have N items and the last one is available
>> 	  first, then the second-to-last, and so on. You'll ask N times,
>> 	  then N-1, then N-2, and so on.

The current solution is a 'busy loop' one that I wrote about[1][2],
see below.

> 
> I completely agree - I need to change that. However, the goal of the v2
> iteration was to get the "convert" interface in an acceptable state.
> That's what I intended to say in the patch comment section:
> 
>     "Please ignore all changes behind async_convert_to_working_tree() and 
>      async_filter_finish() for now as I plan to change the implementation 
>      as soon as the interface is in an acceptable state."

I think that it is more important to start with a good abstraction,
and the proposal for protocol, rather than getting bogged down in
implementation details that may change as the idea for protocol
extension changes.

>>
>> I think it would be much more efficient to do something like:
>>
>>  [Git issues a request and gives it an opaque index id]
>>  git> command=smudge
>>  git> pathname=foo
>>  git> index=0
>>  git> 0000
>>  git> CONTENT
>>  git> 0000
>>
>>  [The data isn't ready yet, so the filter tells us so...]
>>  git< status=delayed
>>  git< 0000

So is it only as replacement for "status=success" + contents or
"status=abort", that is upfront before sending any part of the file?

Or, as one can assume from the point of the paragraph with the
"status=delayed", it is about replacing null list for success or
"status=error" after sending some part (maybe empty) of a file,
that is:

    [filter driver says that it can process contents]
    git< status=success
    git< 0000
    git< PARTIAL_SMUDGED_CONTENT (maybe empty)
    [there was some delay, for example one of shards is slow]
    git< 0000
    git< status=delayed
    git< 0000

>>
>>  [Git may make other requests, that are either served or delayed]
>>  git> command=smudge
>>  git> pathname=foo
>>  git> index=1
>>  git> 0000
>>  git< status=success
>>  git< 0000
>>  git< CONTENT
>>  git< 0000
>>
>>  [Now Git has processed all of the items, and each one either has its
>>   final status, or has been marked as delayed. So we ask for a delayed
>>   item]
>>  git> command=wait
>>  git> 0000

In my proposal[2] I have called this "command=continue"... but at this
point it is bikeshedding.  I think "command=wait" (or "await" ;-))
might be better.

>>
>>  [Some time may pass if nothing is ready. But eventually we get...]
>>  git< status=success

Or

    git< status=resumed

If it would not be undue burden on the filter driver process, we might
require for it to say where to continue at (in bytes), e.g.

    git< from=16426

That should, of course, go below index/pathname line.

>>  git< index=0

Or a filter driver could have used pathname as an index, that is

    git< pathname=path/testfile.dat

>>  git< 0000
>>  git< CONTENT
>>  git< 0000
>>
>> From Git's side, the loop is something like:
>>
>>  while (delayed_items > 0) {
>> 	/* issue a wait, and get back the status/index pair */
>> 	status = send_wait(&index);
>> 	delayed_items--;

This looks like my 'event loop' proposal[1][2], see below.

>>
>> 	/*
>> 	 * use "index" to find the right item in our list of files;
>> 	 * the format can be opaque to the filter, so we could index
>> 	 * it however we like. But probably numeric indices in an array
>> 	 * are the simplest.
>> 	 */
>> 	assert(index > 0 && index < nr_items);
>> 	item[index].status = status;
>> 	if (status == SUCCESS)
>> 		read_content(&item[index]);
>>  }
>>
>> and the filter side just attaches the "index" string to whatever its
>> internal queue structure is, and feeds it back verbatim when processing
>> that item finishes.

I have wrote about this for v1 version of this patch series:

[1]: https://public-inbox.org/git/ec8078ef-8ff2-d26f-ef73-5ef612737eee@xxxxxxxxx/
[2]: https://public-inbox.org/git/17fa31a5-8689-2766-952b-704f433a5b3a@xxxxxxxxx/

> 
> That could work! I had something like that in mind:
> 
> I teach Git a new command "list_completed" or similar. The filter
> blocks this call until at least one item is ready for Git. 
> Then the filter responds with a list of paths that identify the
> "ready items". Then Git asks for these ready items just with the
> path and not with any content. Could that work? Wouldn't the path
> be "unique" to identify a blob per filter run?

Why in the "drain" phase it is still Git that needs to ask filter for
contents, one file after another?  Wouldn't it be easier and simpler
for filter to finish sending contents, and send signal that it has
finished continue'ing?

To summarize my earlier emails, current proposal looks for me as if
it were a "busy loop" solution, that is[2]:

          while there are delayed paths:
                  for each delayed path:
                          request path from filter [a]
                          fetch the thing (supporting "delayed") [b]
                          if path done
                                  do the caller's thing to use buf
                                  remove it from delayed paths list

Footnotes:
----------
a) We don't send the Git-side contents of blob again, isn't it?
   So we need some protocol extension / new understanding anyway.
   for example that we don't send contents if we request path again.
b) If path is not ready at all, filter protocol would send status=delayed
   with empty contents.  This means that we would immediately go to the
   next path, if there is one.

There are some cases where busy loop is preferable, but I don't think
it is the case here.

The "event loop" solution, which is I think what Peff proposed, would
be something like this (from the protocol point of view, rather than
from the Git code point of view)[1]:

        while there are delayed paths:
                get path that is ready from filter
                fetch the thing to buf (supporting "delayed")
                if path done
                        do the caller's thing to use buf 
                        (e.g. finish checkout path, eof convert, etc.)

We can either trust filter process to tell us when it finished sending
delayed paths, or keep list of paths that are being delayed in Git.

Also, one thing that we need to be solved, assuming that the proposed
extension allows to send partial data from filter to be delayed and
continued later, is that Git needs to keep this partial response in buf;
this is because of precedence of gitattributes applying:

  Interaction between checkin/checkout attributes
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  In the check-in codepath, the worktree file is first converted with
  `filter` driver (if specified and corresponding driver defined), then
  the result is processed with `ident` (if specified), and then finally
  with `text` (again, if specified and applicable).

  In the check-out codepath, the blob content is first converted with
  `text`, and then `ident` and fed to `filter`.

Or maybe I misunderstood something; I am a bit confused by check-in
vs check-out there...

Note that with support of "command=wait" / "command=continue" Git
could interrupt sending contents, and drain unfinished files, if
it needs it... then continue sending rest of files.  Well, if the
protocol extension is interpreted to allow for this.

Best regards,
-- 
Jakub Narębski