Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

Lars Schneider <larsxschneider@xxxxxxxxx> · Tue, 18 Apr 2017 10:53:45 +0200

> On 12. Apr 2017, at 06:37, Torsten Bögershausen <tboegi@xxxxxx> wrote:
> 
> On 2017-04-11 21:50, Lars Schneider wrote:
> 
> []
>>> packet:          git> command=smudge
>>> packet:          git> pathname=path/testfile.dat
>>> packet:          git> delay-id=1
>>> packet:          git> 0000
>>> packet:          git> CONTENT
>>> packet:          git> 0000
>>> packet:          git< status=delayed # this means: Git, please feed more
>>> packet:          git> 0000
>> Actually, this is how I implemented it first.
>> 
>> However, I didn't like that because we associate a
>> pathname with a delay-id. If the filter does not
>> delay the item then we associate a different
>> pathname with the same delay-id in the next request. 
>> Therefore I think it is better to present the delay-id 
>> *only* to the filter if the item is actually delayed.
>> 
>> I would be surprised if the extra round trip does impact
>> the performance in any meaningful way.
>> 
> 
> 2 spontanous remarks:
> 
> - Git can simply use a counter which is incremented by each blob
>  that is send to the filter.
>  Regardless what the filter answers (delayed or not), simply increment a
>  counter. (or is this too simple and I miss something?)

I am not sure I understand what you mean. Do you want to say that the filter just starts to delay items with an increasing counter by itself (first item = 0, second item = 1, ...). That could work but I would prefer a more explicitly defined solution. The self counting implicit way could easily get confused/mixed up I think.

> - I was thinking that the filter code is written as either "never delay" or
>  "always delay".
>  "Never delay" is the existing code.
>  What is your idea, when should a filter respond with delayed ?
>  My thinking was "always", silently assuming the more than one core can be
>  used, so that blobs can be filtered in parallel.

In case of Git LFS I expect the filter to answer with "delay" if a network call is required to fulfill the (smudge-) filter request.

>> We could do this but I think this would only complicate
>> the protocol. I expect the filter to spool results to the
>> disk or something.
>  Spooling things to disk was not part of my picture, to be honest.
>  This means additional execution time when a SSD is used, the chips
>  are more worn out...
>  There may be situations, where this is OK for some users (but not for others)
>  How can we prevent Git from (over-) flooding the filter?

I don't think this is Git's responsibility. If the filter can't handle the data then I expect the filter to *not* ask Git for a delay.

>  The protocol response from the filter would be just "delayed", and the filter
>  would block Git, right ?
>  But, in any case, it would still be nice if Git would collect converted blobs
>  from the filter, to free resource here.
>  This is more like the "streaming model", but on a higher level:
>  Send 4 blobs to the filter, collect the ready one, send the 5th blob to
>  the filter, collect the ready one, send the 6th blob to the filter, collect
>  ready one....

That sounds nice, but I don't think that it is necessary in this (first) iteration. I think it is acceptable for the filter to spool data to disk.

> (Back to the roots)
> Which criteria do you have in mind: When should a filter process the blob
> and return it immediately, and when would it respond "delayed" ?

See above: it's up to the filter. In case of Git LFS: delay if a network call is required.

Thanks,
Lars