Hi Junio, thanks for the review! On 3 Feb 2022, at 14:57, Junio C Hamano wrote: > "John Cai via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > >> Subject: Re: [PATCH 2/2] catfile.c: add --batch-command mode > > "cat-file: add --batch-command mode" perhaps. The patch touching > the file "catfile.c" (which does not even exist) is an irrelevant > implementation detail to spend 2 extra bytes in "git shortlog" > output. > >> From: John Cai <johncai86@xxxxxxxxx> >> >> Add new flag --batch-command that will accept commands and arguments >> from stdin, similar to git-update-ref --stdin. >> >> At GitLab, we use a pair of long running cat-file processes when >> accessing object content. One for iterating over object metadata with >> --batch-check, and the other to grab object contents with --batch. >> >> However, if we had --batch-command, we wouldnt need to keep both >> processes around, and instead just have one process where we can flip >> between getting object info, and getting object contents. This means we >> can get rid of roughly half of long lived git cat-file processes. This >> can lead to huge savings since on a given server there could be hundreds >> of git cat-file processes running. > > Hmph, why hundreds, not two you listed? > > Do you mean "we have two per repository, but by combining, we can do > with just one per repository, halving the number of processes"? Yes, exactly. I'll reword this in the next version to be more clear. > >> git cat-file --batch-command >> >> would enter an interactive command mode whereby the user can enter in >> commands and their arguments: >> >> <command> [arg1] [arg2] NL >> >> This patch adds the basic structure for add command which can be >> extended in the future to add more commands. >> >> This patch also adds the following commands: >> >> contents <object> NL >> info <object> NL >> >> The contents command takes an <object> argument and prints out the object >> contents. >> >> The info command takes a <object> argument and prints out the object >> metadata. >> >> In addition, we need a set of commands that enable a "read session". >> >> When a separate process (A) is connected to a git cat-file process (B) > > This description misleads readers into thinking as if we have just > created a daemon process that is running, and an unrelated process > can connect to it, which obviously poses a question about securing > the connection. It is my understanding that what this creates is > just a consumer process (A) starts the cat-file process (B) locally > on its behalf under process (A)'s privilege, and they talk over pipe > without allowing any third-party to participate in the exchange, so > we should avoid misleading users by saying "is connected to" here. Yes this understanding is correct. Will fix wording in next version > >> and is interactively writing to and reading from it in --buffer mode, >> (A) needs to be able to know when the buffer is flushed to stdout. > > If A and B are talking over a pair pipes, in order to avoid > deadlocking, both ends need to be able to control whose turn it is > to speak (and it is turn for the other side to listen). A needs to > be able to _control_ (not "know") when the buffer it uses to write > to B gets flushed, in order to reliably say "I am done for now, it > is your turn to speak" and be assured that it reaches B. The story > is the same for the other side. When a request by A needs to be > responded with multiple lines of output, B needs to be able to say > "And that concludes my response, and I am ready to accept a new > request from you" and make sure it reaches A. "know when..." is > probably a wrong phrase here. Correct, "know" is not exactly right. _control_ would be the more accurate description. > >> Currently, from (A)'s perspective, the only way is to either 1. exit >> (B)'s process or 2. send an invalid object to stdin. 1. is not ideal >> from a performance perspective as it will require spawning a new cat-file >> process each time, and 2. is hacky and not a good long term solution. > > Writing enumeration as bulletted or enumerated list would make it > much easier to read, I would think. > > From (A)'s perspective, the only way is to either > > 1. exit (B)'s process or > 2. send an invalid object to stdin. > > 1. is not ideal from a performance perspective, as it will > require spawning a new cat-file process each time, and 2. is > hacky and not a good long term solution. > > I am not sure what you exactly mean by "exit" in the above. Do you > mean "kill" instead? Yes > >> With the following commands, process (A) can begin a "session" and >> send a list of object names over stdin. When "get contents" or "get info" >> is issued, this list of object names will be fed into batch_one_object() >> to retrieve either info or contents. Finally an fflush() will be called >> to end the session. >> >> begin NL >> get contents NL >> get info NL >> >> These can be used in the following way: >> >> begin >> <sha1> >> <sha1> >> <sha1> >> <sha1> >> get info >> >> begin >> <sha1> >> <sha1> >> <sha1> >> <sha1> >> get contents >> >> With this mechanism, process (A) can be guaranteed to receive all of the >> output even in --buffer mode. > > OK, so do these "get blah" serve both as command and an implicit > "flush"? yes, that's the idea. > > With an implicit "flush", do we really need "begin"? > > Also, from the point of view of extensibility, not saying what kind > of operation is started when given "begin" is probably not a good > idea. "get info" and "get contents" may happen to be the only > commands that are supported right now, and the parameters to them > may happen to be just list of object names and nothing else, but > what happens when a new "git frotz" command is added and its > operation is extended with something other than object names and > pathnames? The way to parse these parameter lines for the "get" > would be different for different commands, and if "cat-file" knows > upfront what is to be done to these parameters, it can even start > prefetching and precomputing to reduce latency observed by the > client before the final "get info" command is given. > > So, from that point of view, > > begin <cmd> > <parameter> > <parameter> > ... > end > > may be a better design, no? Good point. Now I'm wondering if we can simplify where commands get queued up and a "get" will execute them along with an implicit flush. <cmd> <parameter> <cmd> <parameter> <cmd> <parameter> get