Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 10 Jan 2019 13:29:27 -0800

On Thu, Jan 10, 2019 at 2:57 AM Mel Gorman <mgorman@xxxxxxx> wrote:
>
> On Mon, Jan 07, 2019 at 03:21:10PM -0800, Dan Williams wrote:
> > Randomization of the page allocator improves the average utilization of
> > a direct-mapped memory-side-cache. Memory side caching is a platform
> > capability that Linux has been previously exposed to in HPC
> > (high-performance computing) environments on specialty platforms. In
> > that instance it was a smaller pool of high-bandwidth-memory relative to
> > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > be found on general purpose server platforms where DRAM is a cache in
> > front of higher latency persistent memory [1].
> >
>
> So I glanced through the series and while I won't nak it, I'm not a
> major fan either so I won't ack it either.

Thanks for taking a look, some more comments / advocacy below...
because I'm not sure what Andrew will do with a "meh" response
compared to an ack.

> While there are merits to
> randomisation in terms of cache coloring, it may not be robust. IIRC, the
> main strength of randomisation vs being smart was "it's simple and usually
> doesn't fall apart completely". In particular I'd worry that compaction
> will undo all the randomisation work by moving related pages into the same
> direct-mapped lines. Furthermore, the runtime list management of "randomly
> place and head or tail of list" will have variable and non-deterministic
> outcomes and may also be undone by either high-order merging or compaction.

It's a fair point. To date we have not been able to measure the
average performance degrading over time (pages becoming more ordered)
but that said I think it would take more resources and time than I
have available for that trend to present. If it did present that would
only speak to a need to be more aggressive on the runtime
re-randomization. I think there's a case to be made to start simple
and only get more aggressive with evidence.

Note that higher order merging is not a current concern since the
implementation is already randomizing on MAX_ORDER sized pages. Since
memory side caches are so large there's no worry about a 4MB
randomization boundary.

However, for the (unproven) security use case where folks want to
experiment with randomizing on smaller granularity, they should be
wary of this (/me nudges Kees).

> As bad as it is, an ideal world would have a proper cache-coloring
> allocation algorithm but they previously failed as the runtime overhead
> exceeded the actual benefit, particularly as fully associative caches
> became more popular and there was no universal "one solution fits all". One
> hatchet job around it may be to have per-task free-lists that put free
> pages into buckets with the obvious caveat that those lists would need
> draining and secondary locking. A caveat of that is that there may need
> to be arch and/or driver hooks to detect how the colors are managed which
> could also turn into a mess.

We (Dave, I and others that took a look at this) started here, and the
"mess" looked daunting compared to randomization. Also a mess without
much more incremental benefit.

We also settled on a numa_emulation based approach for the cases where
an administrator knows they have a workload that can fit in the
cache... more on that below:

> The big plus of the series is that it's relatively simple and appears to
> be isolated enough that it only has an impact when the necessary hardware
> in place. It will deal with some cases but I'm not sure it'll survive
> long-term, particularly if HPC continues to report in the field that
> reboots are necessary to reshufffle the lists (taken from your linked
> documents). That workaround of running STREAM before a job starts and
> rebooting the machine if the performance SLAs are not met is horrid.

That workaround is horrid, and we have a separate solution for it
merged in commit cc9aec03e58f "x86/numa_emulation: Introduce uniform
split capability". When an administrator knows in advance that a
workload will fit in cache they can use this capability to run the
workload in a numa node that is guaranteed to not have cache conflicts
with itself.

Whereas randomization benefits the general cache-overcommit case. The
uniform numa split case addresses those niche users that can manually
time schedule jobs with different working set sizes... without needing
to reboot.