Re: [PATCH v3 4/4] convert: add "status=delayed" to filter process protocol

Taylor Blau <ttaylorr@xxxxxxxxxx> · Wed, 12 Apr 2017 13:34:04 -0400

> > (And at this point, may I suggest to change "delay-id" into "request-id=1" ?
>
> If there is no objection by another reviewer then I am happy to change it.

I think "delay-id" may be more illustrative of what's occurring in this request.
That being said, my preference would be that we remove the
"delay-id"/"request-id" entirely from the protocol, and make Git responsible for
handling the path lookup by a hashmap.

Is the concern that a hashmap covering all entries in a large checkout would be
too large to keep in memory? If so, keeping an opaque ID as a part of the
protocol is something I would not object to.

> >> +packet:          git> 0000
> >> +packet:          git> 0000  # empty content!
> >> +packet:          git< status=success
> >> +packet:          git< 0000
> >> +packet:          git< SMUDGED_CONTENT
> >> +packet:          git< 0000
> >> +packet:          git< 0000
> >
> > OK, good.
> >
> > The quest is: what happens next ?
> >
> > 2 things, kind of in parallel, but we need to prioritize and serialize:
> > - Send the next blob
> > - Fetch ready blobs
> > - And of course: ask for more ready blobs.
> > (it looks as if Peff and Jakub had useful comments already,
> >  so I can stop here?)
>
> I would like to keep the mechanism as follows:
>
> 1. sends all blobs to the filter
> 2. fetch blobs until we are done
>
> @Taylor: Do you think that would be OK for LFS?

I think that this would be fine for LFS and filters of this kind in general. For
LFS in particular, my initial inclination would be to have the protocol open
support writing blob data back to Git at anytime during the checkout process,
not just after all blobs have been sent to the filter.

That being said, I don't think this holds up in practice. The blobs are too big
to fit in memory anyway, and will just end up getting written to LFS's object
cache in .git/lfs/objects.

Since they're already in there, all we would have to do is keep the list of
`readyIds map[int]*os.File` in memory (or even map int -> LFS OID, and open the
file later), and then `io.Copy()` from the open file back to Git.

This makes me think of adding another capability to the protocol, which would
just be exchanging paths on disk in `/tmp` or any other directory so that we
wouldn't have to stream content over the pipe. Instead of responding with

    packet:          git< status=success
    packet:          git< 0000
    packet:          git< SMUDGED_CONTENT
    packet:          git< 0000
    packet:          git< 0000

We could respond with:

    packet:          git< status=success
    packet:          git< 0000
    packet:          git< /path/to/contents.dat # <-
    packet:          git< 0000
    packet:          git< 0000

Git would then be responsible for opening that file on disk (the filter would
guarantee that to be possible), and then copying its contents into the working
tree.

I think that's a topic for later discussion, though :-).

> > In general, Git should not have a unlimited number of blobs outstanding,
> > as memory constraints may apply.
> > There may be a config variable for the number of outstanding blobs,
> > (similar to the window size in other protocols) and a variable
> > for the number of "send bytes in outstanding blobs"
> > (similar to window size (again!) in e.g TCP)
> >
> > The number of outstanding blobs is may be less important, and it is more
> > important to monitor the number of bytes we keep in memory in some way.
> >
> > Something like "we set a limit to 500K of out standng data", once we are
> > above the limit, don't send any new blobs.
>
> I don't expect the filter to keep everything in memory. If there is no memory
> anymore then I expect the filter to spool to disk. This keeps the protocol simple.
> If this turns out to be not sufficient then we could improve that later, too.

Agree.

--
Thanks,
Taylor Blau