Re: [PATCH v2 2/2] treewide: Add the __GFP_PACKED flag to several non-DMA kmalloc() allocations

Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> · Wed, 26 Oct 2022 14:59:04 +0200

On Wed, Oct 26, 2022 at 10:48:58AM +0100, Catalin Marinas wrote:
> On Wed, Oct 26, 2022 at 08:50:07AM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Oct 25, 2022 at 09:52:47PM +0100, Catalin Marinas wrote:
> > > diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c
> > > index 4d59d927ae3e..bff8901dc426 100644
> > > --- a/drivers/usb/core/message.c
> > > +++ b/drivers/usb/core/message.c
> > > @@ -525,7 +525,8 @@ int usb_sg_init(struct usb_sg_request *io, struct usb_device *dev,
> > >  	}
> > >  
> > >  	/* initialize all the urbs we'll use */
> > > -	io->urbs = kmalloc_array(io->entries, sizeof(*io->urbs), mem_flags);
> > > +	io->urbs = kmalloc_array(io->entries, sizeof(*io->urbs),
> > > +				 mem_flags | __GFP_PACKED);
> > 
> > Hey look, you just did what I was worried about :(
> > 
> > Oh wait, no, this is just the urb structure, not the actual data pointer
> > sent on the urb.
> 
> Yeah, I agree it gets tricky to review such patches. My hope is that
> driver writers won't start adding such flags everywhere, only where
> there's a significant number of allocations. Better flag naming would
> make this more obvious.

You need to give both driver writers, and reviewers, hints as to what
you are expecting to see happen here.  Where should, and should we not,
use this flag?  I can't really tell here as I don't understand the goal
by just looking at the flag itself.

If I see "packed", of course I want to use packed, why would a driver
writer NOT want it?  But that name does not give any hint on when it
would NOT be good to use, which is a big flaw in the naming.

> > Ick, this is going to get really hard to review over time.  I feel for
> > the need to want to start to pack things in smaller, but this is going
> > to be really really hard for maintainers to review submissions.
> > Especially if the only way problems show up is in random platforms where
> > the alignment causes a failure.
> > 
> > Anyway we can make all arches fail if we submit pointers that are not
> > aligned properly to the DMA engines?
> 
> As I mentioned in a prior reply, we could add a warning comparing the
> slab size with ARCH_DMA_MINALIGN (or cache_line_size(), though the
> latter could trigger on fully-coherent architectures).

It should trigger on all arches in order to get proper coverage.  That's
the only way we have fixed these types of bugs up over time.

> > > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > > index 63c7ebb0da89..e5ad1a3244fb 100644
> > > --- a/fs/binfmt_elf.c
> > > +++ b/fs/binfmt_elf.c
> > > @@ -883,7 +883,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
> > >  			goto out_free_ph;
> > >  
> > >  		retval = -ENOMEM;
> > > -		elf_interpreter = kmalloc(elf_ppnt->p_filesz, GFP_KERNEL);
> > > +		elf_interpreter = kmalloc(elf_ppnt->p_filesz,
> > > +					  GFP_KERNEL | __GFP_PACKED);
> > 
> > If this is going to now be sprinkled all over the tree, why not make it
> > a "real" flag, "GFP_PACKED"?  Or better yet, have it describe the
> > allocation better, "GFP_TINY" or
> > "GFP_I_KNOW_THIS_IS_TINY_WITH_NO_DMA_ISSUES"
> 
> Linus originally suggested GFP_NODMA. We could go with that (and a
> corresponding KMALLOC_NODMA_MINALIGN).

NODMA is good, it gives me a better idea of when it would be able to be
used (i.e. not in a driver.)

> > > --- a/lib/kasprintf.c
> > > +++ b/lib/kasprintf.c
> > > @@ -22,7 +22,7 @@ char *kvasprintf(gfp_t gfp, const char *fmt, va_list ap)
> > >  	first = vsnprintf(NULL, 0, fmt, aq);
> > >  	va_end(aq);
> > >  
> > > -	p = kmalloc_track_caller(first+1, gfp);
> > > +	p = kmalloc_track_caller(first+1, gfp | __GFP_PACKED);
> > 
> > How do we know this is going to be small?
> 
> We don't need to know it's small. If it's over 96 bytes on arm64, it
> goes in the kmalloc-128 cache or higher. It can even use the kmalloc-192
> cache that's not aligned to ARCH_DMA_MINALIGN (128). That's why I'd
> avoid GFP_TINY as this flag is not about size but rather alignment (e.g.
> 192 may not be DMA safe but it's larger than 128).
> 
> That said, I should try to identify sizes > 128 and <= 192 and pass such
> flag.

What if the flag is used for large sizes, what will happen?  In other
words, why would you ever NOT want to use this?  DMA is a big issue, but
then we should flip that around and explicitly mark the times we want
DMA, not not-want DMA as "not want" is by far the most common, right?

In other words, I'm leaning hard on the "mark allocations that need DMA
memory as needing DMA memory" option.  Yes it will be more work
initially, but I bet a lot if not most, of them can be caught with
coccinelle scripts.  And then enforced with sparse to ensure they don't
break in the future.

That would provide a huge hint to the allocator as to what is needed,
while this "packed" really doesn't express the intent here at all.  Then
if your allocator/arch has explicit requirements for DMA memory, it can
enforce them and not just hope and guess that people mark things as "not
dma".

> > > diff --git a/lib/kobject.c b/lib/kobject.c
> > > index a0b2dbfcfa23..2c4acb36925d 100644
> > > --- a/lib/kobject.c
> > > +++ b/lib/kobject.c
> > > @@ -144,7 +144,7 @@ char *kobject_get_path(struct kobject *kobj, gfp_t gfp_mask)
> > >  	len = get_kobj_path_length(kobj);
> > >  	if (len == 0)
> > >  		return NULL;
> > > -	path = kzalloc(len, gfp_mask);
> > > +	path = kzalloc(len, gfp_mask | __GFP_PACKED);
> > 
> > This might not be small, and it's going to be very very short-lived
> > (within a single function call), why does it need to be allocated this
> > way?
> 
> Regarding short-lived objects, you are right, they won't affect
> slabinfo. My ftrace-fu is not great, I only looked at the allocation
> hits and they keep adding up without counting how many are
> freed. So maybe we need tracing free() as well but not always easy to
> match against the allocation point and infer how many live objects there
> are.

This code path at boot time is yes, called a lot, but the path is
created/destroyed instantly.  There should not ever be any allocation
here that lasts more than one function call.  So you might want to look
at structures that remain in the slabs after booting, not just initial
boot allocation calls.  If you were to want to try to instrument this in
this way.  But I think that's a loosing game, let's do it right and mark
memory for DMA as such and keep track of it.

Bonus is that this will help tracking memory like this as potentially
"untrusted" if it comes from hardware, but that's a totally different
issue, and one that we don't need to worry about today, perhaps in a
year or so...

thanks,

greg k-h