Re: [PATCH] netlink: Return unsigned value for nla_len()

Kees Cook <keescook@xxxxxxxxxxxx> · Fri, 1 Dec 2023 20:39:44 -0800

On Fri, Dec 01, 2023 at 10:45:05AM -0800, Jakub Kicinski wrote:
> On Fri, 1 Dec 2023 10:17:02 -0800 Kees Cook wrote:
> > > > -static inline int nla_len(const struct nlattr *nla)
> > > > +static inline u16 nla_len(const struct nlattr *nla)
> > > >  {
> > > > -	return nla->nla_len - NLA_HDRLEN;
> > > > +	return nla->nla_len > NLA_HDRLEN ? nla->nla_len - NLA_HDRLEN : 0;
> > > >  }  
> > > 
> > > Note the the NLA_HDRLEN is the length of struct nlattr.
> > > I mean of the @nla object that gets passed in as argument here.
> > > So accepting that nla->nla_len may be < NLA_HDRLEN means
> > > that we are okay with dereferencing a truncated object...
> > > 
> > > We can consider making the return unsinged without the condition maybe?  
> > 
> > Yes, if we did it without the check, it'd do "less" damage on
> > wrap-around. (i.e. off by U16_MAX instead off by INT_MAX).
> > 
> > But I'd like to understand: what's the harm in adding the clamp? The
> > changes to the assembly are tiny:
> > https://godbolt.org/z/Ecvbzn1a1
> 
> Hm, I wonder if my explanation was unclear or you disagree..
> 
> This is the structure:
> 
> struct nlattr {
> 	__u16           nla_len; // attr len, incl. this header
> 	__u16           nla_type;
> };
> 
> and (removing no-op wrappers):
> 
> #define NLA_HDRLEN	sizeof(struct nlattr)
> 
> So going back to the code:
> 
> 	return nla->nla_len > NLA_HDRLEN ? nla->nla_len - NLA_HDRLEN...
> 
> We are reading nla->nla_len, which is the first 2 bytes of the structure.
> And then we check if the structure is... there?

I'm not debating whether it's there or not -- I'm saying the _contents_ of
"nlattr::nla_len", in the face of corruption or lack of initialization,
may be less than NLA_HDRLEN. (There's a lot of "but that's can't happen"
that _does_ happen in the kernel, so I'm extra paranoid.)

> If we don't trust that struct nlattr which gets passed here is at least
> NLA_HDRLEN (4B) then why do we think it's safe to read nla_len (the
> first 2B of it)?

Type confusion (usually due to Use-after-Free flaws) means that a memory
region is valid (i.e. good pointer), but that the contents might have
gotten changed through other means. (To see examples of this with
struct msg_msg, see: https://syst3mfailure.io/wall-of-perdition/)

(On a related note, why does nla_len start at 4 instead of 0? i.e. why
does it include the size of nlattr? That seems redundant based on the
same logic you're using here.)

> That's why I was pointing at nla_ok(). nla_ok() takes the size of the
> buffer / message as an arg, so that it can also check if looking at
> nla_len itself is not going to be an OOB access. 99% of netlink buffers
> we parse come from user space. So it's not like someone could have
> mis-initialized the nla_len in the kernel and being graceful is helpful.
> 
> The extra conditional is just a minor thing. The major thing is that
> unless I'm missing something the check makes me go 🤨️

My concern is that there are 562 callers of nla_len():

$ git grep '\bnla_len(\b' | wc -l
562

We have no way to be certain that all callers follow a successful
nla_ok() call.

Regardless, just moving from "int" to "u16" solves a bunch of value
range tracking pain that GCC appears to get upset about, so if you
really don't want the (tiny) sanity check, I can just send the u16
change.

-Kees

-- 
Kees Cook