Re: Docs: possible incorrect diagram of Delta Copy instruction

Jeff King <peff@xxxxxxxx> · Fri, 11 Sep 2020 10:33:21 -0400

On Fri, Sep 11, 2020 at 02:30:20PM +0200, Bouke Versteegh wrote:

> While working on an implementation of git in dart, I've noticed a
> possible error in the documentation. I hope I'm using the correct
> channel to report this issue.

This is definitely the right place.

> On [pack format](https://git-scm.com/docs/pack-format#_instruction_to_copy_from_base_object)
> a diagram is shown, that explains the format of a Copy instruction
> inside a Deltified pack entry:
> 
> +----------+---------+---------+---------+---------+-------+-------+-------+
> | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
> +----------+---------+---------+---------+---------+-------+-------+-------+
> 
> The documentation specifies that diagrams follow the RFC950
> (https://www.ietf.org/rfc/rfc1950.txt) format.

I think you mean rfc1951 here, but yeah, it is clear in that document
that bytes are in MSB order. And also that each of those boxes is a
single byte.

> That means that the left bit is MSB, and the right bit is LSB, so the
> OpCode is MSB (1xxxxxxx), which is correct and matches other sources.

Right, agreed.

> It also would mean that offset 1-4 should be read from bit 7, 6, 5 and
> 4 (i.e. 0x40, 0x20, 0x10, 0x08)

I don't think this follows, though. "offset1" is a separate byte. We
have a sequence of bytes here, some of which may be present. Their
presence is determined by the bits "xxxxxxx", but the diagram does not
say anything about the order.

The paragraphs below do say:

  This is the instruction format to copy a byte range from the source
  object. It encodes the offset to copy from and the number of bytes to
  copy. Offset and size are in little-endian order.

So "offset1" is the lowest byte of the offset value. We still don't know
which bit in the instruction byte tells us whether it's there yet,
though. The next paragraph says:

  All offset and size bytes are optional. This is to reduce the
  instruction size when encoding small offsets or sizes. The first seven
  bits in the first octet determines which of the next seven octets is
  present. If bit zero is set, offset1 is present. If bit one is set
  offset2 is present and so on.

So bit zero of the "xxxxxxx" (the LSB) is offset1, and we have to check
that first. Which I think is what happens in:

>   PARSE_CP_PARAM(0x01, cp_off, 0);

That's reading the low bit of the command byte, and then reading the
_next_ byte from data, and shifting it not at all (because it's the low
byte of the offset). That macro looks like this:

  #define PARSE_CP_PARAM(bit, var, shift) do { \
                          if (cmd & (bit)) { \
                                  if (data >= top) \
                                          goto bad_length; \
                                  var |= ((unsigned) *data++ << (shift)); \
                          } } while (0)

> However, looking at the git source-code and other documentation (see
> [1] [2]), I see that offset [1-4] are read from the LOWEST 4 bits, and
> the SIZE  bits are stored right after MSB (opcode).

I think both the code and diagram are right. It's just that the order of
the follow-on bytes is not part of that diagram, but rather the
explanation below it.

> From this source code, I would conclude that the diagram should look
> like this instead:
> 
> +----------+---------+---------+---------+---------+-------+-------+-------+
> | 1xxxxxxx | size3 | size 2 | size1 | offset4 | offset3 | offset2 | offset1 |
> +----------+---------+---------+---------+---------+-------+-------+-------+

That would definitely be wrong. If there's an offset1 byte, it
immediately follows the command byte.

-Peff