Re: [PATCH 2/2] catfile.c: add --batch-command mode

John Cai <johncai86@xxxxxxxxx> · Thu, 03 Feb 2022 23:11:10 -0500

Hi Junio, thanks for the review!

On 3 Feb 2022, at 14:57, Junio C Hamano wrote:

> "John Cai via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
>
>> Subject: Re: [PATCH 2/2] catfile.c: add --batch-command mode
>
> "cat-file: add --batch-command mode" perhaps.  The patch touching
> the file "catfile.c" (which does not even exist) is an irrelevant
> implementation detail to spend 2 extra bytes in "git shortlog"
> output.
>
>> From: John Cai <johncai86@xxxxxxxxx>
>>
>> Add new flag --batch-command that will accept commands and arguments
>> from stdin, similar to git-update-ref --stdin.
>>
>> At GitLab, we use a pair of long running cat-file processes when
>> accessing object content. One for iterating over object metadata with
>> --batch-check, and the other to grab object contents with --batch.
>>
>> However, if we had --batch-command, we wouldnt need to keep both
>> processes around, and instead just have one process where we can flip
>> between getting object info, and getting object contents. This means we
>> can get rid of roughly half of long lived git cat-file processes. This
>> can lead to huge savings since on a given server there could be hundreds
>> of git cat-file processes running.
>
> Hmph, why hundreds, not two you listed?
>
> Do you mean "we have two per repository, but by combining, we can do
> with just one per repository, halving the number of processes"?

Yes, exactly. I'll reword this in the next version to be more clear.

>
>> git cat-file --batch-command
>>
>> would enter an interactive command mode whereby the user can enter in
>> commands and their arguments:
>>
>> <command> [arg1] [arg2] NL
>>
>> This patch adds the basic structure for add command which can be
>> extended in the future to add more commands.
>>
>> This patch also adds the following commands:
>>
>> contents <object> NL
>> info <object> NL
>>
>> The contents command takes an <object> argument and prints out the object
>> contents.
>>
>> The info command takes a <object> argument and prints out the object
>> metadata.
>>
>> In addition, we need a set of commands that enable a "read session".
>>
>> When a separate process (A) is connected to a git cat-file process (B)
>
> This description misleads readers into thinking as if we have just
> created a daemon process that is running, and an unrelated process
> can connect to it, which obviously poses a question about securing
> the connection.  It is my understanding that what this creates is
> just a consumer process (A) starts the cat-file process (B) locally
> on its behalf under process (A)'s privilege, and they talk over pipe
> without allowing any third-party to participate in the exchange, so
> we should avoid misleading users by saying "is connected to" here.

Yes this understanding is correct. Will fix wording in next version

>
>> and is interactively writing to and reading from it in --buffer mode,
>> (A) needs to be able to know when the buffer is flushed to stdout.
>
> If A and B are talking over a pair pipes, in order to avoid
> deadlocking, both ends need to be able to control whose turn it is
> to speak (and it is turn for the other side to listen).  A needs to
> be able to _control_ (not "know") when the buffer it uses to write
> to B gets flushed, in order to reliably say "I am done for now, it
> is your turn to speak" and be assured that it reaches B.  The story
> is the same for the other side.  When a request by A needs to be
> responded with multiple lines of output, B needs to be able to say
> "And that concludes my response, and I am ready to accept a new
> request from you" and make sure it reaches A.  "know when..." is
> probably a wrong phrase here.

Correct, "know" is not exactly right. _control_ would be the more accurate description.
>
>> Currently, from (A)'s perspective, the only way is to either 1. exit
>> (B)'s process or 2. send an invalid object to stdin. 1. is not ideal
>> from a performance perspective as it will require spawning a new cat-file
>> process each time, and 2. is hacky and not a good long term solution.
>
> Writing enumeration as bulletted or enumerated list would make it
> much easier to read, I would think.
>
>     From (A)'s perspective, the only way is to either
>
>     1. exit (B)'s process or
>     2. send an invalid object to stdin.
>
>     1. is not ideal from a performance perspective, as it will
>     require spawning a new cat-file process each time, and 2. is
>     hacky and not a good long term solution.
>
> I am not sure what you exactly mean by "exit" in the above.  Do you
> mean "kill" instead?

Yes

>
>> With the following commands, process (A) can begin a "session" and
>> send a list of object names over stdin. When "get contents" or "get info"
>> is issued, this list of object names will be fed into batch_one_object()
>> to retrieve either info or contents. Finally an fflush() will be called
>> to end the session.
>>
>> begin NL
>> get contents NL
>> get info NL
>>
>> These can be used in the following way:
>>
>> begin
>> <sha1>
>> <sha1>
>> <sha1>
>> <sha1>
>> get info
>>
>> begin
>> <sha1>
>> <sha1>
>> <sha1>
>> <sha1>
>> get contents
>>
>> With this mechanism, process (A) can be guaranteed to receive all of the
>> output even in --buffer mode.
>
> OK, so do these "get blah" serve both as command and an implicit
> "flush"?

yes, that's the idea.

>
> With an implicit "flush", do we really need "begin"?
>
> Also, from the point of view of extensibility, not saying what kind
> of operation is started when given "begin" is probably not a good
> idea.  "get info" and "get contents" may happen to be the only
> commands that are supported right now, and the parameters to them
> may happen to be just list of object names and nothing else, but
> what happens when a new "git frotz" command is added and its
> operation is extended with something other than object names and
> pathnames?  The way to parse these parameter lines for the "get"
> would be different for different commands, and if "cat-file" knows
> upfront what is to be done to these parameters, it can even start
> prefetching and precomputing to reduce latency observed by the
> client before the final "get info" command is given.
>
> So, from that point of view,
>
>     begin <cmd>
>     <parameter>
>     <parameter>
>     ...
>     end
>
> may be a better design, no?

Good point. Now I'm wondering if we can simplify where commands get queued up
and a "get" will execute them along with an implicit flush.

<cmd> <parameter>
<cmd> <parameter>
<cmd> <parameter>
get