On 6/15/18 3:23 AM, Mel Gorman wrote: > On Thu, Jun 14, 2018 at 02:47:39PM -0600, Jens Axboe wrote: >>>>> Will numactl ... modprobe brd ... solve this problem? >>>> >>>> It won't, pages are allocated as needed. >>>> >>> >>> Then how about a numactl ... dd /dev/ram ... after the modprobe. >> >> Yes of course, or you could do that for every application that ends >> up in the path of the doing IO to it. The point of the option is to >> just make it explicit, and not have to either NUMA pin each task, >> or prefill all possible pages. >> > > It's certainly possible from userspace using dd and numactl setting the > desired memory policy. mmtests has the following snippet when setting > up a benchmark using brd to deal with both NUMA artifacts and variable > performance due to first faults early in the lifetime of a benchmark. > > modprobe brd rd_size=$((TESTDISK_RD_SIZE/1024)) > if [ "$TESTDISK_RD_PREALLOC" == "yes" ]; then > if [ "$TESTDISK_RD_PREALLOC_NODE" != "" ]; then > tmp_prealloc_cmd="numactl -N $TESTDISK_RD_PREALLOC_NODE" > else > tmp_prealloc_cmd="numactl -i all" > fi > $tmp_prealloc_cmd dd if=/dev/zero of=/dev/ram0 bs=1M &>/dev/null > fi > > (Haven't actually validated this in a long time but it worked at some point) You'd want to make this oflag=direct as well (this goes for Adam, too), or you could have pages being written that are NOT issued by dd. > First option allocates just from one node, the other interleaves between > everything. Any combination of nodes or policies can be used and this was > very simple, but it's what was needed at the time. The question is how > far do you want to go with supporting policies within the module? Not far, imho :-) > One option would be to keep this very simple like the patch suggests so users > get the hint that it's even worth considering and then point at a document > on how to do more complex policies from userspace at device creation time. > Another is simply to document the hazard that the locality of memory is > controlled by the memory policy of the first task that touches it. I like the simple option, especially since (as Christoph pointed out) that if we fail allocating from the given node, then we'll just go elsewhere. -- Jens Axboe