On Mon, 12 Apr 2021 12:20:22 -0700 Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote: > > > > > > On 4/8/21 4:52 AM, Michal Hocko wrote: > > > > >> The top tier memory used is reported in > > >> > > >> memory.toptier_usage_in_bytes > > >> > > >> The amount of top tier memory usable by each cgroup without > > >> triggering page reclaim is controlled by the > > >> > > >> memory.toptier_soft_limit_in_bytes > > > > > > > Michal, > > > > Thanks for your comments. I will like to take a step back and > > look at the eventual goal we envision: a mechanism to partition the > > tiered memory between the cgroups. > > > > A typical use case may be a system with two set of tasks. > > One set of task is very latency sensitive and we desire instantaneous > > response from them. Another set of tasks will be running batch jobs > > were latency and performance is not critical. In this case, > > we want to carve out enough top tier memory such that the working set > > of the latency sensitive tasks can fit entirely in the top tier memory. > > The rest of the top tier memory can be assigned to the background tasks. > > > > To achieve such cgroup based tiered memory management, we probably want > > something like the following. > > > > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1, > > where tier t_0 sits at the top and demotes to the lower tier. > > We envision for this top tier memory t0 the following knobs and counters > > in the cgroup memory controller > > > > memory_t0.current Current usage of tier 0 memory by the cgroup. > > > > memory_t0.min If tier 0 memory used by the cgroup falls below this low > > boundary, the memory will not be subjected to demotion > > to lower tiers to free up memory at tier 0. > > > > memory_t0.low Above this boundary, the tier 0 memory will be subjected > > to demotion. The demotion pressure will be proportional > > to the overage. > > > > memory_t0.high If tier 0 memory used by the cgroup exceeds this high > > boundary, allocation of tier 0 memory by the cgroup will > > be throttled. The tier 0 memory used by this cgroup > > will also be subjected to heavy demotion. > > > > memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup. > > > > If needed, memory_t[12...].current/min/low/high for additional tiers can be added. > > This follows closely with the design of the general memory controller interface. > > > > Will such an interface looks sane and acceptable with everyone? > > > > I have a couple of questions. Let's suppose we have a two socket > system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket > 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1). > Based on the tier definition of this patch series, tier_0: {node_0, > node_1} and tier_1: {node_2, node_3}. > > My questions are: > > 1) Can we assume that the cost of access within a tier will always be > less than the cost of access from the tier? (node_0 <-> node_1 vs > node_0 <-> node_2) No in large systems even it we can make this assumption in 2 socket ones. > 2) If yes to (1), is that assumption future proof? Will the future > systems with DRAM over CXL support have the same characteristics? > 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0 > <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3 > might be third tier and similarly for jobs running on node_1, node_2 > might be third tier. > > The reason I am asking these questions is that the statically > partitioning memory nodes into tiers will inherently add platform > specific assumptions in the user API. Absolutely agree. > > Assumptions like: > 1) Access within tier is always cheaper than across tier. > 2) Access from tier_i to tier_i+1 has uniform cost. > > The reason I am more inclined towards having numa centric control is > that we don't have to make these assumptions. Though the usability > will be more difficult. Greg (CCed) has some ideas on making it better > and we will share our proposal after polishing it a bit more. > Sounds good, will look out for that. Jonathan