On 4/22/2021 1:56 AM, David Hildenbrand wrote: > On 22.04.21 09:49, Michal Hocko wrote: >> Cc David and Oscar who are familiar with this code as well. >> >> On Wed 21-04-21 11:36:01, Florian Fainelli wrote: >>> Hi all, >>> >>> I have been trying for the past few days to identify the source of a >>> performance regression that we are seeing with the 5.4 kernel but not >>> with the 4.9 kernel on ARM64. Testing something newer like 5.10 is a bit >>> challenging at the moment but will happen eventually. >>> >>> What we are seeing is a ~3x increase in the time needed for >>> alloc_contig_range() to allocate 1GB in blocks of 2MB pages. The system >>> is idle at the time and there are no other contenders for memory other >>> than the user-space programs already started (DHCP client, shell, etc.). > > Hi, > > If you can easily reproduce it might be worth to just try bisecting; > that could be faster than manually poking around in the code. > > Also, it would be worth having a look at the state of upstream Linux. > Upstream Linux developers tend to not care about minor performance > regressions on oldish kernels. This is a big pain point here and I cannot agree more, but until we bridge that gap, this is not exactly easy to do for me unfortunately and neither is bisection :/ > > There has been work on improving exactly the situation you are > describing -- a "fail fast" / "no retry" mode for alloc_contig_range(). > Maybe it tackles exactly this issue. > > https://lkml.kernel.org/r/20210121175502.274391-3-minchan@xxxxxxxxxx > > Minchan is already on cc. This patch does not appear to be helping, in fact, I had locally applied this patch from way back when: https://lkml.org/lkml/2014/5/28/113 which would effectively do this unconditionally. Let me see if I can showcase this problem a x86 virtual machine operating in similar conditions to ours. > > (next time, please cc linux-mm on core-mm questions; maybe you tried, > but ended up with linux-mmc :) ) Yes that was the intent, thanks for correcting that. > >>> >>> I have tried playing with the compact_control structure settings but >>> have not found anything that would bring us back to the performance of >>> 4.9. More often than not, we see test_pages_isolated() returning an >>> non-zero error code which would explain the slow down, since we have >>> some logic that re-tries the allocation if alloc_contig_range() returns >>> -EBUSY. If I remove the retry logic however, we don't get -EBUSY and we >>> get the results below: >>> >>> 4.9 shows this: >>> >>> [ 457.537634] allocating: size: 1024MB avg: 59172 (us), max: 137306 >>> (us), min: 44859 (us), total: 591723 (us), pages: 512, per-page: 115 >>> (us) >>> [ 457.550222] freeing: size: 1024MB avg: 67397 (us), max: 151408 (us), >>> min: 52630 (us), total: 673974 (us), pages: 512, per-page: 131 (us) >>> >>> 5.4 show this: >>> >>> [ 222.388758] allocating: size: 1024MB avg: 156739 (us), max: 157254 >>> (us), min: 155915 (us), total: 1567394 (us), pages: 512, per-page: >>> 306 (us) >>> [ 222.401601] freeing: size: 1024MB avg: 209899 (us), max: 210085 (us), >>> min: 209749 (us), total: 2098999 (us), pages: 512, per-page: 409 (us) >>> >>> This regression is not seen when MIGRATE_CMA is specified instead of >>> MIGRATE_MOVABLE. >>> >>> A few characteristics that you should probably be aware of: >>> >>> - There is 4GB of memory populated with the memory being mapped into the >>> CPU's address starting at space at 0x4000_0000 (1GB), PAGE_SIZE is 4KB >>> >>> - there is a ZONE_DMA32 that starts at 0x4000_0000 and ends at >>> 0xE480_0000, from there on we have a ZONE_MOVABLE which is comprised of >>> 0xE480_0000 - 0xfdc00000 and another range spanning 0x1_0000_0000 - >>> 0x1_4000_0000 >>> >>> Attached is the kernel configuration. >>> >>> Thanks! >>> -- >>> Florian >> >> >> > > -- Florian