Zi Yan <ziy@xxxxxxxxxx> writes: > On 19 Jun 2021, at 4:18, Huang, Ying wrote: > >> Zi Yan <ziy@xxxxxxxxxx> writes: >> >>> On 18 Jun 2021, at 2:15, Huang Ying wrote: [snip] >>>> +/* >>>> + * When memory fills up on a node, memory contents can be >>>> + * automatically migrated to another node instead of >>>> + * discarded at reclaim. >>>> + * >>>> + * Establish a "migration path" which will start at nodes >>>> + * with CPUs and will follow the priorities used to build the >>>> + * page allocator zonelists. >>>> + * >>>> + * The difference here is that cycles must be avoided. If >>>> + * node0 migrates to node1, then neither node1, nor anything >>>> + * node1 migrates to can migrate to node0. >>>> + * >>>> + * This function can run simultaneously with readers of >>>> + * node_demotion[]. However, it can not run simultaneously >>>> + * with itself. Exclusion is provided by memory hotplug events >>>> + * being single-threaded. >>>> + */ >>>> +static void __set_migration_target_nodes(void) >>>> +{ >>>> + nodemask_t next_pass = NODE_MASK_NONE; >>>> + nodemask_t this_pass = NODE_MASK_NONE; >>>> + nodemask_t used_targets = NODE_MASK_NONE; >>>> + int node; >>>> + >>>> + /* >>>> + * Avoid any oddities like cycles that could occur >>>> + * from changes in the topology. This will leave >>>> + * a momentary gap when migration is disabled. >>>> + */ >>>> + disable_all_migrate_targets(); >>>> + >>>> + /* >>>> + * Ensure that the "disable" is visible across the system. >>>> + * Readers will see either a combination of before+disable >>>> + * state or disable+after. They will never see before and >>>> + * after state together. >>>> + * >>>> + * The before+after state together might have cycles and >>>> + * could cause readers to do things like loop until this >>>> + * function finishes. This ensures they can only see a >>>> + * single "bad" read and would, for instance, only loop >>>> + * once. >>>> + */ >>>> + smp_wmb(); >>>> + >>>> + /* >>>> + * Allocations go close to CPUs, first. Assume that >>>> + * the migration path starts at the nodes with CPUs. >>>> + */ >>>> + next_pass = node_states[N_CPU]; >>> >>> Is there a plan of allowing user to change where the migration >>> path starts? Or maybe one step further providing an interface >>> to allow user to specify the demotion path. Something like >>> /sys/devices/system/node/node*/node_demotion. >> >> I don't think that's necessary at least for now. Do you know any real >> world use case for this? > > In our P9+volta system, GPU memory is exposed as a NUMA node. > For the GPU workloads with data size greater than GPU memory size, > it will be very helpful to allow pages in GPU memory to be migrated/demoted > to CPU memory. With your current assumption, GPU memory -> CPU memory > demotion seems not possible, right? This should also apply to any > system with a device memory exposed as a NUMA node and workloads running > on the device and using CPU memory as a lower tier memory than the device > memory. Thanks a lot for your use case! It appears that the demotion path specified by users is one possible way to satisfy your requirement. And I think it's possible to enable that on top of this patchset. But we still have no specific plan to work on that at least for now. Best Regards, Huang, Ying