Re: [PATCH 0/3] memcg: Slow down swap allocation as the available space gets depleted

Shakeel Butt <shakeelb@xxxxxxxxxx> · Tue, 21 Apr 2020 15:39:01 -0700

On Tue, Apr 21, 2020 at 2:59 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
[snip]
>
> We do control very aggressive batch jobs to the extent where they have
> negligible latency impact on interactive services running on the same
> hosts. All the tools to do that are upstream and/or public, but it's
> still pretty new stuff (memory.low, io.cost, cpu headroom control,
> freezer) and they need to be put together just right.
>
> We're working on a demo application that showcases how it all fits
> together and hope to be ready to publish it soon.
>

That would be awesome.

>
[snip]
> >
> > What do you mean by not interchangeable? If I keep the hot memory (or
> > workingset) of a job in DRAM and cold memory in swap and control the
> > rate of refaults by controlling the definition of cold memory then I
> > am using the DRAM and swap interchangeably and transparently to the
> > job (that is what we actually do).
>
> Right, that's a more precise definition than my randomly chosen "80%"
> number above. There are parts of a workload's memory access curve
> (where x is distinct data accessed and y is the access frequency) that
> don't need to stay in RAM permanently and can be got on-demand from
> secondary storage without violating the workload's throughput/latency
> requirements. For that part, RAM, swap, disk can be interchangeable.
>
> I'm was specifically talking about the other half of that curve, and
> meant to imply that that's usually bigger than 20%. Usually ;-)
>
> I.e. we cannot say: workload x gets 10G of ram or swap, and it doesn't
> matter whether it gets it in ram or in swap. There is a line somewhere
> in between, and it'll vary with workload requirements, access patterns
> and IO speed. But no workload can actually run with 10G of swap and 0
> bytes worth of direct access memory, right?

Yes.

>
> Since you said before you're using combined memory+swap limits, I'm
> assuming that you configure the resource as interchangeable, but still
> have some form of determining where that cutoff line is between them -
> either by tuning proactive reclaim toward that line or having OOM kill
> policies when the line is crossed and latencies are violated?
>

Yes, more specifically tuning proactive reclaim towards that line. We
define that line in terms of acceptable refault rate for the job. The
acceptable refault rate is measured through re-use and idle page
histograms (these histograms are collected through our internal
implementation of Page Idle Tracking). I am planning to upstream and
open-source these.

> > I am also wondering if you guys explored the in-memory compression
> > based swap medium and if there are any reasons to not follow that
> > route.
>
> We played around with it, but I'm ambivalent about it.
>
> You need to identify that perfect "warm" middle section of the
> workingset curve that is 1) cold enough to not need permanent direct
> access memory, yet 2) warm enough to justify allocating RAM to it.
>
> A lot of our workloads have a distinguishable hot set and various
> amounts of fairly cold data during stable states, with not too much
> middle ground in between where compressed swap would really shine.
>
> Do you use compressed swap fairly universally, or more specifically
> for certain workloads?
>

Yes, we are using it fairly universally. There are few exceptions like
user space net and storage drivers.

> > Oh you mentioned DAX, that brings to mind a very interesting topic.
> > Are you guys exploring the idea of using PMEM as a cheap slow memory?
> > It is byte-addressable, so, regarding memcg accounting, will you treat
> > it as a memory or a separate resource like swap in v2? How does your
> > memory overcommit model work with such a type of memory?
>
> I think we (the kernel MM community, not we as in FB) are still some
> ways away from having dynamic/transparent data placement for pmem the
> same way we have for RAM. But I expect the kernel's high-level default
> strategy to be similar: order virtual memory (the data) by access
> frequency and distribute across physical memory/storage accordingly.
>
> (With pmem being divided into volatile space and filesystem space,
> where volatile space holds colder anon pages (and, if there is still a
> disk, disk cache), and the sizing decisions between them being similar
> as the ones we use for swap and filesystem today).
>
> I expect cgroup policy to be separate, because to users the
> performance difference matters. We won't want greedy batch
> applications displacing latency sensitive ones from RAM into pmem,
> just like we don't want this displacement into secondary storage
> today. Other than that, there isn't too much difference to users,
> because paging is already transparent - an mmapped() file looks the
> same whether it's backed by RAM, by disk or by pmem. The difference is
> access latencies and the aggregate throughput loss they add up to. So
> I could see pmem cgroup limits and protections (for the volatile space
> portion) the same way we have RAM limits and protections.
>
> But yeah, I think this is going a bit off topic ;-)

That's really interesting. Thanks for appeasing my curiosity.

thanks,
Shakeel