Re: [PATCH] t5500.43: make the check a bit more robust

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Sat, 17 Oct 2020 05:31:18 +0200 (CEST)

Hi Peff,

On Tue, 13 Oct 2020, Jeff King wrote:

> On Tue, Oct 13, 2020 at 02:55:15PM -0400, Jeff King wrote:
>
> > But then in fbd76cd450 (sideband: reverse its dependency on pkt-line,
> > 2019-01-16), the function became demultiplex_sideband(). The loop went
> > away, and we pump only a single packet on each call. When we see
> > sideband 2, we do an early return. But on sideband 1, we continue to the
> > cleanup: label, which flushes the scratch buffer.
> >
> > I think we need to return from there without hitting that cleanup label,
> > like this:
>
> By the way, the reason this works is that the "scratch" buffer persists
> beyond individual calls to demultiplex_sideband(). So we can get away
> with letting it persist between the two.

The thing that threw me was that I somehow expected `recv_sideband()` to
return primary data as it is read, much like `read()` operates. And yes,
I also found the split version of the code (`recv_sideband()` contains the
`while` loop and lives in `pkt-line.c` while `demultiplex_sideband()`
contains the `scratch` handling and the `switch` between packet types and
it lives in `sideband.c`) was much harder to read than the original
version.

> However unlike the original recv_sideband(), which broke out of the loop
> on error and then flushed scratch, our post-fbd76cd450 does not do the
> same. It now looks like:
>
> int recv_sideband(const char *me, int in_stream, int out)
> {
>         char buf[LARGE_PACKET_MAX + 1];
>         int len;
>         struct strbuf scratch = STRBUF_INIT;
>         enum sideband_type sideband_type;
>
>         while (1) {
>                 len = packet_read(in_stream, NULL, NULL, buf, LARGE_PACKET_MAX,
>                                   0);
>                 if (!demultiplex_sideband(me, buf, len, 0, &scratch,
>                                           &sideband_type))
>                         continue;
>                 switch (sideband_type) {
>                 case SIDEBAND_PRIMARY:
>                         write_or_die(out, buf + 1, len - 1);
>                         break;
>                 default: /* errors: message already written */
>                         return sideband_type;
>                 }
>         }
> }
>
> I wondered if we could ever have a case where we broke out of the loop
> with data left over in "scratch" (or its buffer left allocated). I think
> the answer is no. We only break out of the loop if
> demultiplex_sideband() returned non-zero _and_ its not the primary
> sideband. So both before and after my suggested fix, we'd have hit the
> cleanup label in demultiplex_sideband(), which flushes and frees the
> buffer.

Right.

It took me quite a while to convince myself of that, too.

And since I am really worried that the complexity of the code makes it
easy to introduce a regression, I spent quite a bit of time to figure out
how to implement a good regression test for exactly this issue.

Even so, the regression test will not necessarily catch problems where the
`while` loop is abandoned without processing any unfinished sideband
message. I introduced a `BUG()` for that case, but it is quite a bit
unsatisfying that I could not come up with a better way to ensure that
this does not happen.

> I do have to say that the original loop before that commit was a lot
> easier to follow, though.
>
> Another weirdness I noticed is that it doesn't distinguish a flush
> packet (expected) from a zero-byte packet (an error). But neither did
> the original. I would have guessed the zero-byte packet was meant to
> trigger this second conditional:
>
>         if (len == 0) {
>                 *sideband_type = SIDEBAND_FLUSH;
>                 goto cleanup;
>         }
>         if (len < 1) {
>                 strbuf_addf(scratch,
>                             "%s%s: protocol error: no band designator",
>                             scratch->len ? "\n" : "", me);
>                 *sideband_type = SIDEBAND_PROTOCOL_ERROR;
>                 goto cleanup;
>         }
>
> but we'd hit the first conditional before then. We would still trigger
> the second if we saw EOF while reading the packet (because we set the
> length to -1 then), but then it's arguably the wrong error to be
> showing.
>
> So I think this could be improved a bit by using
> packet_read_with_status() in the recv_sideband() caller.

Possibly. But is it really a bug to send a zero-byte packet? It is
inefficient, sure. But I could see how it could potentially simplify code,
or serve e.g. as some sort of a "keep-alive" mechanism or whatever.

In other words, I am not sure that  we should treat this as an error, but
yes, we should probably not treat it as a flush (even if it is likely that
our current sideband users simply won't ever send empty primary packets).

Ciao,
Dscho