Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

Jakub Narębski <jnareb@xxxxxxxxx> · Wed, 11 Jan 2017 15:53:27 +0100

W dniu 11.01.2017 o 11:20, Lars Schneider pisze: 
> On 10 Jan 2017, at 23:11, Jakub Narębski <jnareb@xxxxxxxxx> wrote:
>> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
>>> larsxschneider@xxxxxxxxx writes:
>>>> From: Lars Schneider <larsxschneider@xxxxxxxxx>
>>>>
>>>> Some `clean` / `smudge` filters might require a significant amount of
>>>> time to process a single blob. During this process the Git checkout
>>>> operation is blocked and Git needs to wait until the filter is done to
>>>> continue with the checkout.
>>
>> Lars, what is expected use case for this feature; that is when do you
>> think this problem may happen?  Is it something that happened IRL?
> 
> Yes, this problem happens every day with filters that perform network
> requests (e.g. GitLFS). 

Do I understand it correctly that the expected performance improvement
thanks to this feature is possible only if there is some amount of
parallelism and concurrency in the filter?  That is, filter can be sending
one blob to Git while processing other one, or filter can be fetching blobs
in parallel.

This means that filter process works as a kind of (de)multiplexer for
fetching and/or processing blob contents, I think.

> [...] In GitLFS we even implemented Git wrapper
> commands to address the problem: https://github.com/git-lfs/git-lfs/pull/988
> The ultimate goal of this patch is to be able to get rid of the wrapper 
> commands.

I'm sorry, I don't see it how the wrapper helps here.

> 
>>>> Teach the filter process protocol (introduced in edcc858) to accept the
>>>> status "delayed" as response to a filter request. Upon this response Git
>>>> continues with the checkout operation and asks the filter to process the
>>>> blob again after all other blobs have been processed.
>>>
>>> Hmm, I would have expected that the basic flow would become
>>>
>>> 	for each paths to be processed:
>>> 		convert-to-worktree to buf
>>> 		if not delayed:
>>> 			do the caller's thing to use buf
>>> 		else:
>>> 			remember path
>>>
>>> 	for each delayed paths:
>>> 		ensure filter process finished processing for path
>>> 		fetch the thing to buf from the process
>>> 		do the caller's thing to use buf
>>
>> I would expect here to have a kind of event loop, namely
>>
>>        while there are delayed paths:
>>                get path that is ready from filter
>>                fetch the thing to buf (supporting "delayed")
>>                if path done
>>                        do the caller's thing to use buf 
>>                        (e.g. finish checkout path, eof convert, etc.)
>>
>> We can either trust filter process to tell us when it finished sending
>> delayed paths, or keep list of paths that are being delayed in Git.
> 
> I could implement "get path that is ready from filter" but wouldn't
> that complicate the filter protocol? I think we can use the protocol pretty 
> much as if with the strategy outlined here:
> http://public-inbox.org/git/F533857D-9B51-44C1-8889-AA0542AD8250@xxxxxxxxx/

You are talking about the "busy-loop" solution, isn't it?  In the
same notation, it would look like this:

          while there are delayed paths:
                  for each delayed path:
                          request path from filter [1]
                          fetch the thing (supporting "delayed") [2]
                          if path done
                                  do the caller's thing to use buf
                                  remove it from delayed paths list

Footnotes:
----------
1) We don't send the Git-side contents of blob again, isn't it?
   So we need some protocol extension / new understanding anyway.
   for example that we don't send contents if we request path again.
2) If path is not ready at all, filter protocol would send status=delayed
   with empty contents.  This means that we would immediately go to the
   next path, if there is one.

There are some cases where busy loop is preferable, but I don't think
it is the case here.

The event loop solution would require additional protocol extension,
but I don't think those complicate protocol too much:

A. Git would need to signal filter process that it has sent all paths,
and that it should be sending delayed paths when they are ready.  This
could be done for example with "command=continue".

    packet:          git> command=continue

B. Filter driver, in the event-loop phase, when (de)multiplexing fetching
or processing of data, it would need now to initialize transfer, instead
of waiting for Git to ask.  It could look like this:

    packet:          git< status=resumed [3]
    packet:          git< pathname=file/to/be/resumed [4]
    packet:          git< 0000
    packet:          git< SMUDGED_CONTENT_CONTINUED
    packet:          git< 0000
    packet:          git< 0000  # empty list, means "status=success" [5]

Footnotes:
----------
3.) It could be "status=success", "status=delayed", "command=resumed", etc.
4.) In the future we can add byte at which we resume, size of file, etc.
5.) Of course sending reminder of contents may be further delayed.

-- 
Jakub Narębski