On Mon, 8 Apr 2019, Mel Gorman wrote: > On Sat, Apr 06, 2019 at 11:20:35AM -0400, Mikulas Patocka wrote: > > Hi > > > > The patch 1c30844d2dfe272d58c8fc000960b835d13aa2ac ("mm: reclaim small > > amounts of memory when an external fragmentation event occurs") breaks > > memory management on parisc. > > > > I have a parisc machine with 7GiB RAM, the chipset maps the physical > > memory to three zones: > > 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB > > 1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB > > 2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB > > (but it is not NUMA) > > > > With the patch 1c30844d2, the kernel will incorrectly reclaim the first > > zone when it fills up, ignoring the fact that there are two completely > > free zones. Basiscally, it limits cache size to 1GiB. > > > > For example, if I run: > > # dd if=/dev/sda of=/dev/null bs=1M count=2048 > > > > - with the proper kernel, there should be "Buffers - 2GiB" when this > > command finishes. With the patch 1c30844d2, buffers will consume just 1GiB > > or slightly more, because the kernel was incorrectly reclaiming them. > > > > I could argue that the feature is behaving as expected for separate > pgdats but that's neither here nor there. The bug is real but I have a > few questions. > > First, if pa-risc is !NUMA then why are separate local ranges > represented as separate nodes? Is it because of DISCONTIGMEM or something > else? DISCONTIGMEM is before my time so I'm not familiar with it and I'm not an expert in this area, I don't know. > I consider it "essentially dead" but the arch init code seems to setup > pgdats for each physical contiguous range so it's a possibility. The most > likely explanation is pa-risc does not have hardware with addressing > limitations smaller than the CPUs physical address limits and it's > possible to have more ranges than available zones but clarification would > be nice. By rights, SPARSEMEM would be supported on pa-risc but that > would be a time-consuming and somewhat futile exercise. Regardless of the > explanation, as pa-risc does not appear to support transparent hugepages, > an option is to special case watermark_boost_factor to be 0 on DISCONTIGMEM > as that commit was primarily about THP with secondary concerns around > SLUB. This is probably the most straight-forward solution but it'd need > a comment obviously. I do not know what the distro configurations for > pa-risc set as I'm not a user of gentoo or debian. I use Debian Sid, but I compile my own kernel. I uploaded the kernel .config here: http://people.redhat.com/~mpatocka/testcases/parisc-config.txt > Second, if you set the sysctl vm.watermark_boost_factor=0, does the > problem go away? If so, an option would be to set this sysctl to 0 by > default on distros that support pa-risc. Would that be suitable? I have tried it and the problem almost goes away. With vm.watermark_boost_factor=0, if I read 2GiB data from the disk, the buffer cache will contain about 1.8GiB. So, there's still some superfluous page reclaim, but it is smaller. BTW. I'm interested - on real NUMA machines - is reclaiming the file cache really a better option than allocating the file cache from non-local node? > Finally, I'm sure this has been asked before buy why is pa-risc alive? > It appears a new CPU has not been manufactured since 2005. Even Alpha > I can understand being semi-alive since it's an interesting case for > weakly-ordered memory models. pa-risc appears to be supported and active > for debian at least so someone cares. It's not the only feature like this > that is bizarrely alive but it is curious -- 32 bit NUMA support on x86, > I'm looking at you, your machines are all dead since the early 2000's > AFAIK and anyone else using NUMA on 32-bit x86 needs their head examined. I use it to test programs for portability to risc. If one could choose between buying an expensive power system or a cheap pa-risc system, pa-risc may be a better choice. The last pa-risc model has four cores at 1.1GHz, so it is not completely unuseable. Mikulas > -- > Mel Gorman > SUSE Labs >