On Wed, Mar 27, 2019 at 7:09 PM Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> wrote: > On 3/27/19 1:09 PM, Michal Hocko wrote: > > On Wed 27-03-19 11:59:28, Yang Shi wrote: > >> > >> On 3/27/19 10:34 AM, Dan Williams wrote: > >>> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > >>>> On Tue 26-03-19 19:58:56, Yang Shi wrote: > > [...] > >>>>> It is still NUMA, users still can see all the NUMA nodes. > >>>> No, Linux NUMA implementation makes all numa nodes available by default > >>>> and provides an API to opt-in for more fine tuning. What you are > >>>> suggesting goes against that semantic and I am asking why. How is pmem > >>>> NUMA node any different from any any other distant node in principle? > >>> Agree. It's just another NUMA node and shouldn't be special cased. > >>> Userspace policy can choose to avoid it, but typical node distance > >>> preference should otherwise let the kernel fall back to it as > >>> additional memory pressure relief for "near" memory. > >> In ideal case, yes, I agree. However, in real life world the performance is > >> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has > >> higher latency and lower bandwidth. We observed much higher latency on PMEM > >> than DRAM with multi threads. > > One rule of thumb is: Do not design user visible interfaces based on the > > contemporary technology and its up/down sides. This will almost always > > fire back. > > Thanks. It does make sense to me. > > > > > Btw. if you keep arguing about performance without any numbers. Can you > > present something specific? > > Yes, I did have some numbers. We did simple memory sequential rw latency > test with a designed-in-house test program on PMEM (bind to PMEM) and > DRAM (bind to DRAM). When running with 20 threads the result is as below: > > Threads w/lat r/lat > PMEM 20 537.15 68.06 > DRAM 20 14.19 6.47 > > And, sysbench test with command: sysbench --time=600 memory > --memory-block-size=8G --memory-total-size=1024T --memory-scope=global > --memory-oper=read --memory-access-mode=rnd --rand-type=gaussian > --rand-pareto-h=0.1 --threads=1 run > > The result is: > lat/ms > PMEM 103766.09 > DRAM 31946.30 > > > > >> In real production environment we don't know what kind of applications would > >> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have > >> unexpected performance degradation. I understand to have mempolicy to choose > >> to avoid it. But, there might be hundreds or thousands of applications > >> running on the machine, it sounds not that feasible to me to have each > >> single application set mempolicy to avoid it. > > we have cpuset cgroup controller to help here. > > > >> So, I think we still need a default allocation node mask. The default value > >> may include all nodes or just DRAM nodes. But, they should be able to be > >> override by user globally, not only per process basis. > >> > >> Due to the performance disparity, currently our usecases treat PMEM as > >> second tier memory for demoting cold page or binding to not memory access > >> sensitive applications (this is the reason for inventing a new mempolicy) > >> although it is a NUMA node. > > If the performance sucks that badly then do not use the pmem as NUMA, > > really. There are certainly other ways to export the pmem storage. Use > > it as a fast swap storage. Or try to work on a swap caching mechanism > > that still allows much faster access than a slow swap storage. But do > > not try to pretend to abuse the NUMA interface while you are breaking > > some of its long term established semantics. > > Yes, we are looking into using it as a fast swap storage too and perhaps > other usecases. > > Anyway, though nobody thought it makes sense to restrict default > allocation nodes, it sounds over-engineered. I'm going to drop it. > > One question, when doing demote and promote we need define a path, for > example, DRAM <-> PMEM (assume two tier memory). When determining what > nodes are "DRAM" nodes, does it make sense to assume the nodes with both > cpu and memory are DRAM nodes since PMEM nodes are typically cpuless nodes? For ACPI platforms the HMAT is effectively going to enforce "cpu-less" nodes for any memory range that has differentiated performance from the conventional memory pool, or differentiated performance for a specific initiator. So "memory-less == PMEM" is not a robust assumption. The plan is to use the HMAT to populate the default fallback order, but allow for an override if the HMAT information is missing or incorrect.