Re: [PATCH v2] mm: disallow mapping that conflict for devm_memremap_pages()

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 18 Jul 2018 11:36:21 -0700

On Wed, Jul 18, 2018 at 02:27:31PM -0400, Jeff Moyer wrote:
> Hi, Dave,
> 
> Dave Jiang <dave.jiang@xxxxxxxxx> writes:
> 
> > When pmem namespaces created are smaller than section size, this can cause
> > issue during removal and gpf was observed:
> >
> > Add code to check whether we have mapping already in the same section and
> > prevent additional mapping from created if that is the case.
> >
> > Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx>
> > ---
> >
> > v2: Change dev_warn() to dev_WARN() to provide helpful backtrace. (Robert E)
> 
> OK, I can reproduce the issue.  What I don't like about your patch is
> that you can still get yourself into trouble.  Just create a namespace
> with a size that isn't aligned to 128MB, and then all further
> create-namespace operations will fail.  The only "fix" is to delete the
> odd-sized namespace and try again.  And that warning message doesn't
> really help the administrator to figure this out.
> 
> Why can't we simply round up to the next section automatically?  Either
> that, or have the kernel export a minimum namespace size of 128MB, and
> have ndctl enforce it?  I know we had some requests for 4MB namespaces,
> but it doesn't sound like those will be very useful if they're going to
> waste 124MB of space.
> 
> Or, we could try to fix this problem of having multiple namespace
> co-exist in the same memblock section.  That seems like the most obvious
> fix, but there must be a reason you didn't pursue it.
> 
> Dave, what do you think is the most viable option?

Just as a reminder, the desire for small pmem devices comes from cloud
usecases where you have teeny tiny layers, each of which might contain a
single package (eg a webserver or a database).  Because you're going to
run tens of thousands of instances, you don't want each machine to keep
a copy of the program text in pagecache; you want to have it in-memory
once and then DAX-map it in each guest.

While it's OK to waste a certain amount of each guest's physical memory,
when you have hundreds or thousands of these tiny layers, it adds up.