On Tue, Apr 21, 2020 at 2:59 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > [snip] > > We do control very aggressive batch jobs to the extent where they have > negligible latency impact on interactive services running on the same > hosts. All the tools to do that are upstream and/or public, but it's > still pretty new stuff (memory.low, io.cost, cpu headroom control, > freezer) and they need to be put together just right. > > We're working on a demo application that showcases how it all fits > together and hope to be ready to publish it soon. > That would be awesome. > [snip] > > > > What do you mean by not interchangeable? If I keep the hot memory (or > > workingset) of a job in DRAM and cold memory in swap and control the > > rate of refaults by controlling the definition of cold memory then I > > am using the DRAM and swap interchangeably and transparently to the > > job (that is what we actually do). > > Right, that's a more precise definition than my randomly chosen "80%" > number above. There are parts of a workload's memory access curve > (where x is distinct data accessed and y is the access frequency) that > don't need to stay in RAM permanently and can be got on-demand from > secondary storage without violating the workload's throughput/latency > requirements. For that part, RAM, swap, disk can be interchangeable. > > I'm was specifically talking about the other half of that curve, and > meant to imply that that's usually bigger than 20%. Usually ;-) > > I.e. we cannot say: workload x gets 10G of ram or swap, and it doesn't > matter whether it gets it in ram or in swap. There is a line somewhere > in between, and it'll vary with workload requirements, access patterns > and IO speed. But no workload can actually run with 10G of swap and 0 > bytes worth of direct access memory, right? Yes. > > Since you said before you're using combined memory+swap limits, I'm > assuming that you configure the resource as interchangeable, but still > have some form of determining where that cutoff line is between them - > either by tuning proactive reclaim toward that line or having OOM kill > policies when the line is crossed and latencies are violated? > Yes, more specifically tuning proactive reclaim towards that line. We define that line in terms of acceptable refault rate for the job. The acceptable refault rate is measured through re-use and idle page histograms (these histograms are collected through our internal implementation of Page Idle Tracking). I am planning to upstream and open-source these. > > I am also wondering if you guys explored the in-memory compression > > based swap medium and if there are any reasons to not follow that > > route. > > We played around with it, but I'm ambivalent about it. > > You need to identify that perfect "warm" middle section of the > workingset curve that is 1) cold enough to not need permanent direct > access memory, yet 2) warm enough to justify allocating RAM to it. > > A lot of our workloads have a distinguishable hot set and various > amounts of fairly cold data during stable states, with not too much > middle ground in between where compressed swap would really shine. > > Do you use compressed swap fairly universally, or more specifically > for certain workloads? > Yes, we are using it fairly universally. There are few exceptions like user space net and storage drivers. > > Oh you mentioned DAX, that brings to mind a very interesting topic. > > Are you guys exploring the idea of using PMEM as a cheap slow memory? > > It is byte-addressable, so, regarding memcg accounting, will you treat > > it as a memory or a separate resource like swap in v2? How does your > > memory overcommit model work with such a type of memory? > > I think we (the kernel MM community, not we as in FB) are still some > ways away from having dynamic/transparent data placement for pmem the > same way we have for RAM. But I expect the kernel's high-level default > strategy to be similar: order virtual memory (the data) by access > frequency and distribute across physical memory/storage accordingly. > > (With pmem being divided into volatile space and filesystem space, > where volatile space holds colder anon pages (and, if there is still a > disk, disk cache), and the sizing decisions between them being similar > as the ones we use for swap and filesystem today). > > I expect cgroup policy to be separate, because to users the > performance difference matters. We won't want greedy batch > applications displacing latency sensitive ones from RAM into pmem, > just like we don't want this displacement into secondary storage > today. Other than that, there isn't too much difference to users, > because paging is already transparent - an mmapped() file looks the > same whether it's backed by RAM, by disk or by pmem. The difference is > access latencies and the aggregate throughput loss they add up to. So > I could see pmem cgroup limits and protections (for the volatile space > portion) the same way we have RAM limits and protections. > > But yeah, I think this is going a bit off topic ;-) That's really interesting. Thanks for appeasing my curiosity. thanks, Shakeel