Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

Taylor Blau <ttaylorr@xxxxxxxxxx> · Wed, 12 Apr 2017 13:46:10 -0400

I think this is a great approach and one that I'd be happy to implement in LFS.
The additional capability isn't too complex, so I think other similar filters to
LFS shouldn't have a hard time implementing it either.

I left a few comments, mostly expressing approval to the documentation changes.
I'll leave the C review to someone more expert than me.

+1 from me on the protocol changes.

> +Delay
> +^^^^^
> +
> +If the filter supports the "delay" capability, then Git can send the
> +flag "delay-able" after the filter command and pathname.

Nit: I think either way is fine, but `can_delay` will save us 1 byte per each
new checkout entry.

> +"delay-id", a number that identifies the blob, and a flush packet. The
> +filter acknowledges this number with a "success" status and a flush
> +packet.

I mentioned this in another thread, but I'd prefer, if possible, that we use the
pathname as a unique identifier for referring back to a particular checkout
entry. I think adding an additional identifier adds unnecessary complication to
the protocol and introduces a forced mapping on the filter side from id to
path.

Both Git and the filter are going to have to keep these paths in memory
somewhere, be that in-process, or on disk. That being said, I can see potential
troubles with a large number of long paths that exceed the memory available to
Git or the filter when stored in a hashmap/set.

On Git's side, I think trading that for some CPU time might make sense. If Git
were to SHA1 each path and store that in a hashmap, it would consume more CPU
time, but less memory to store each path. Git and the filter could then exchange
path names, and Git would simply SHA1 the pathname each time it needed to refer
back to memory associated with that entry in a hashmap.

> +by a "success" status that is also terminated with a flush packet. If
> +no blobs for the delayed paths are available, yet, then the filter is
> +expected to block the response until at least one blob becomes
> +available. The filter can tell Git that it has no more delayed blobs
> +by sending an empty list.

I think the blocking nature of the `list_available_blobs` command is a great
synchronization mechanism for telling the filter when it can and can't send new
blob information to Git. I was tenatively thinking of suggesting to remove this
command and instead allow the filter to send readied blobs in sequence after all
unique checkout entries had been sent from Git to the filter. But I think this
allows approach allows us more flexibility, and isn't that much extra
complication or bytes across the pipe.

--
Thanks,
Taylor Blau