Re: [PATCH 1/4] notes: print note blob to stdout directly

Jeff King <peff@xxxxxxxx> · Mon, 19 Feb 2024 20:51:33 -0500

On Sat, Feb 17, 2024 at 01:45:39PM +0100, Maarten Bosmans wrote:

> > Of the two, I'd guess that the second one is a lot less work to
> > implement (on the Git side; on the reading side it's a little more
> > involved, but still should be a constant number of processes).
> 
> The second one is attractive for another reason than implementation
> simplicity. While the first one offers more flexibility, the second
> reuses the existing cat-file batch format, so the interface between
> git and scripts is familiar and consistent.

Agreed.

> > One variant of the second one is to use "git notes list". For example,
> > you can get all notes via cat-file like this right now:
> >
> >   git notes list |
> >   git cat-file --batch='%(objectname) %(objectsize) %(rest)'
> 
> So the cat-file batch output is suitable for blobs containing newline
> or NUL characters. But I struggle a bit with what would be an easy way
> of using this format in a shell script. Something with multiple read
> and read -N commands reading from the output, I guess.
> The git codebase has `extract_batch_output()` in t1006. This uses a
> separate perl invocation to parse the cat-file output, which confirms
> my suspicion there isn't a straight-forward way to do this in e.g.
> just a bash script.

Yes, you could perhaps do it with "read -N". But note that it is a
bash-ism, and other POSIX shells may not support it. That's one of the
reasons we do not use it in t1006.

Also, I suspect that even bash does not retain NUL bytes in shell
variables. E.g.:

  $ printf '\0foo\0bar\0' | cat -A
  ^@foo^@bar^@

  $ printf '\0foo\0bar\0' | (read -r -N 9 var; echo "var=$var" | cat -A)
  var=foobar$

So pure shell is probably a losing battle if you want to be binary
clean. You might be able to do something like using "read" to get the
header line, and then another tool like "head -c" to read those bytes.
But now you're back to one process per object. Plus depending on the
implementation of "head", it might or might not read more bytes from the
pipe as part of its stdio buffering.

So really, you are probably better off handling the output with a more
capable language (though again, there's not much Git can do here; the
complications are from handling binary data, and not Git's output format
choices).

> That was why my first steps were to accept that a launching a separate
> process per note in a bash loop is a pretty clear and well understood
> idiom in shell scripts and try to make the git part of that a bit more
> efficient.

Yes, I agree it's a simple idiom. And I certainly don't mind any
speedups we can get there, if they don't come with a cost (and your
patches looked pretty reasonable to me, though I didn't read them too
carefully). But I do think anytime you are invoking N+1 processes in a
shell script, it's kind of a losing battle for performance. :)

-Peff