Re: [PATCH v2 0/5] Git filter protocol

Jakub Narębski <jnareb@xxxxxxxxx> · Fri, 29 Jul 2016 09:40:47 +0200

W dniu 2016-07-28 o 15:29, Jeff King pisze:
> On Thu, Jul 28, 2016 at 09:16:18AM +0200, Lars Schneider wrote:
> 
>> But Peff ($gmane/299902), Duy, and Eric, seemed to prefer the pkt-line
>> solution (gmane is down - otherwise I would have given you the links).
> 
> FWIW, I think there are arguments for transmitting size + content
> (namely, that it is simpler); the downside is that it doesn't allow
> streaming.

And that it requires for the filter to know the size of its output
upfront (which, as I wrote, might be easy to do based on size of input
and data stored elsewhere, or might need generating whole output to
know).

I don't know how parallel Git is, but if it is parallel enough,
and other limits do not apply (limited amount of CPU cores, I/O limits),
without streaming new filter protocol might be slower, unless startup
time dominates (MS Windows?):

Current parallel:

   |   startup   | processing 1 |
    |  startup    | processing 2  |
   | startup |  processing 3 |
     |  startup  |  processing 4  |

Protocol v2:

   |  startup  | processing 1 | processing 2 | processing 3 | processing 4 |

> 
> So I think there are two viable alternatives:
> 
>   1. Total size of data in ASCII decimal, newline, then that many bytes
>      of content.
> 
>   2. No size header, then a series of pkt-lines followed by a flush
>      packet.

    3. Optional size header[2][3], then a series of pkt-lines followed
       by a flush packet[4].

[2] Git should always provide size, because it is easy to do, and
    I think quite cheap (stored with blob, stored in index, or stat()
    on file away).  Filter can provide size if it is easy to calculate,
    or approximation of size / size hint[5] - it helps to avoid
    reallocation.
[3] It is also a place where filter can pass error conditions that
    are known before starting processing a file.
[4] On one hand you need to catch cases where real size is larger than
    size sent upfront, or smaller than size sent upfront; on the
    other hand it might be a place where to send warnings and errors...
    unless we utilize stderr of a process (but then there is a problem
    of deadlocking, I think).
[5] I suggest

        <size as ascii decimal>
        "approx" SPC <size as ascii decimal>
        "unknown"
        "fail"

> And you should choose between the two based on whether it's more
> important to allow streaming, or more important to make the filter
> implementations simple[1].
> 
> Any solution that is in between those (like sending a size header and
> then using pktlines anyway) is sacrificing simplicity but not getting
> the streaming benefits.
> 
> -Peff
> 
> [1] I haven't thought hard enough about it to have a real opinion. My
>     gut says to go with the streaming, just because we've had to
>     retrofit streaming in other areas when dealing with blobs, so I
>     think we'll end up there eventually. So choosing a simpler protocol
>     like (1) would probably mean eventually implementing a next-version
>     protocol that does (2), and having to support both.
> 
> PS Jakub asked for links, but gmane is down. Here are the relevant threads:
> 
>    http://public-inbox.org/git/20160720134916.GB19359@xxxxxxxxxxxxxxxxxxxxx
> 
>    http://public-inbox.org/git/20160722154900.19477-1-larsxschneider%40gmail.com/t/#u
> 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html