Re: [PATCH 0/1] Sequence number out of range fix

"Gix, Brian" <brian.gix@xxxxxxxxx> · Wed, 15 Jan 2020 21:21:49 +0000

On Wed, 2020-01-15 at 22:04 +0100, michal.lowas-rzechonek@xxxxxxxxxxx wrote:
> Hi Brian,
> 
> On 01/15, Gix, Brian wrote:
> > > Good point. Note that his is also possible in the current
> > > implementation: since seq_num is applied to nonce with a 24bit mask,
> > > it's going to wrap around.
> > 
> > The full IV Index is in the nonce, and at 192 hours per IV index, will
> > be unique for something like 1.4 million years.
> 
> Yes it is, but that's not the point.
> 
> At the moment, net->seq_num is a 32 bit value that *can* exceed 24bit
> range, because mesh_net_next_seq_num() doesn't check ranges. So the
> raw value can reach 0x1000000 and above.
> 
> Now, this raw value is used in send_seg, passed to
> mesh_crypto_packet_build, which effectively applies a 24bit mask in line
> 640:

Yes, we should definitely be sanity checking this, and not sending if SeqNum out of range.

> 
> 	l_put_be32(seq, packet + 1);
> 	packet[1] = (ctl ? CTL : 0) | (ttl & TTL_MASK);
> 
> So this means that when:
> 
>  - the network is already in iv update (so that you can't increase the
>    iv_index, maybe you even started the procedure yourself because your
>    seq_num is above the threshold)
> 
>  - your sequence number is sufficiently large (because of the "repeated
>    crash" scenario described below)

I think if we are repeatedly crashing, and it is causing a runaway sequence number increase, that being
forbidden to send more packets is a natural consequence, and people should fix the code that was causing the
crash in the first place.

> 
> Then the actual value used in the nonce will be net->seq_num & 0xffffff,
> which is something you have *already* used before. All of that happens
> with the same IV index.
> 
> > The over commit is calculated based on the usage rate, and the daemon
> > would need to unexpectedly abort (not just ctrl-c or exit) for us to
> > use the over-commit value
> 
> Indeed, that's precisely what I'm talking about - repeated, unhandled
> process terminations. We're trying to come up with the patch simply
> because this situation has *already happened* on one of our instances.
>