Re: steering allocations to particular parts of memory

Mel Gorman <mgorman@xxxxxxx> · Tue, 11 Sep 2012 10:34:08 +0100

On Fri, Sep 07, 2012 at 11:27:15AM -0700, Larry Bassel wrote:
> I am looking for a way to steer allocations (these may be
> by either userspace or the kernel) to or away from particular
> ranges of memory. The reason for this is that some parts of
> memory are different from others (i.e. some memory may be
> faster/slower). For instance there may be 500M of "fast"
> memory and 1500M of "slower" memory on a 2G platform.
> 

Hi Larry,

> At the memory mini-summit last week, it was mentioned
> that the Super-H architecture was using NUMA for this
> purpose, which was considered to be an very bad thing
> to do -- we have ported NUMA to ARM here (as an experiment)
> and agree that NUMA doesn't work well for solving this problem.
> 

Yes, I remember the discussion and regret it had to be cut short.

NUMA is almost always considered to be the first solution to this type
of problem but as you say it's considered to be a "very bad thing to do".
It's convenient in one sense because you get data structures that track all
the pages for you and create the management structures. It's bad because
page allocation uses these slow nodes when the fast nodes are full which
is a very poor placement policy. Similarly pages from the slow node are
reclaimed based on memory pressure. It comes down to luck whether the
optimal pages are in the slow node or not. You can try wedging your own
placement policy on the side but it won't be pretty.

> After the NUMA discussion, I spoke briefly to you and asked
> you what a good approach would be. You thought that something
> based on transcendent memory (which I am somewhat familiar
> with, having built something based upon it which can be used either
> as contiguous memory or as clean cache) might work, but
> you didn't supply any details.
> 

I was running out the door to catch a bus unfortunately. It was a somewhat
off-the-cuff remark that tmem might help you and what I was really
interested in what tmem used as a placement policy. All I was really sure
of was that a plain NUMA node is a bad idea. Unfortunately I have not
sat down to properly design a solution for this that would satisfy all
interested parties.  Hence take all this with a big grain of salt.

The reason why tmem (http://lwn.net/Articles/340080/) came to mind is that
it addresses a similar class of problem to yours. Very broadly speaking
it was described as memory of an "unknown and dynamically variable size,
is addressable only indirectly by the kernel, can be configured either
as persistent or as "ephemeral" (meaning it will be around for awhile,
but might disappear without warning), and is still fast enough to be
synchronously accessible"

This is not an exact fit obviously. The slow memory node (slowmem) is
fixed size and is directly accessible. The core idea might still be
useful to you though. I'm actually not familiar with tmem but it would
be worth investigating if you can use the same API to decide whether
pages should migrate to/from slowmem and when to simply discard pages
from slowmem.

A possibly variation would be to have cleancache and similar mechanisms
use slowmem as a backend.

A third variation is for people considering creating RAM-like devices
that are backed by some sort of fast storage. These would be interested
in an almost identical sort of API that you need.

Note that none of this actually stops you using a pgdat structure to
represent slowmem and to creating the struct pages for you. This could
be core helper code that allocates a pgdat structure and initialises all
the pages but does not create a kswapd thread, link it to zonelists etc.
The key Ideally there would be a placement policy API (maybe similar
to tmems) that can be shared with slowmem, cleancache, whatever you are
implementing and potentially tmem if it gets revived.

In my simple mind the final solution to cover most or all of these use
causes would look something like this ASCII scribble.

             movement trigger
        KSM? kswapd hook? faults?
                    |
               placement policy
               notification API
                    |
          |------------------|
          |                  |
        placement         placement
        policy            policy                       faulting, IO
          |                  |                              |
          |------------------|                              |
                   |                                        |
       API to move pages RAM<->backing,         get_user_pages like API
           discard pages                        page for userspace access
                   |                                        |
		   |----------------------------------------|
                   |
      Interface to make it look like RAM
      Create struct pages, partial pgdat,
       no kswapd, not linked to zonelist
                   |
   ------------------------------
   |               |            |
slowmem      block device     tmem

Hope this clarifies my position a little but people like Dan who have
focused on this problem in the past may have a much better idea.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>