Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> writes: > On Tue, Jul 18, 2023 at 08:55:16AM +0800, Huang, Ying wrote: >> Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> writes: >> >> > On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote: >> >> Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> writes: >> >> >> >> > Batch should have a much lower maximum than high because it's a deferred cost >> >> > that gets assigned to an arbitrary task. The worst case is where a process >> >> > that is a light user of the allocator incurs the full cost of a refill/drain. >> >> > >> >> > Again, intuitively this may be PID Control problem for the "Mix" case >> >> > to estimate the size of high required to minimise drains/allocs as each >> >> > drain/alloc is potentially a lock contention. The catchall for corner >> >> > cases would be to decay high from vmstat context based on pcp->expires. The >> >> > decay would prevent the "high" being pinned at an artifically high value >> >> > without any zone lock contention for prolonged periods of time and also >> >> > mitigate worst-case due to state being per-cpu. The downside is that "high" >> >> > would also oscillate for a continuous steady allocation pattern as the PID >> >> > control might pick an ideal value suitable for a long period of time with >> >> > the "decay" disrupting that ideal value. >> >> >> >> Maybe we can track the minimal value of pcp->count. If it's small >> >> enough recently, we can avoid to decay pcp->high. Because the pages in >> >> PCP are used for allocations instead of idle. >> > >> > Implement as a separate patch. I suspect this type of heuristic will be >> > very benchmark specific and the complexity may not be worth it in the >> > general case. >> >> OK. >> >> >> Another question is as follows. >> >> >> >> For example, on CPU A, a large number of pages are freed, and we >> >> maximize batch and high. So, a large number of pages are put in PCP. >> >> Then, the possible situations may be, >> >> >> >> a) a large number of pages are allocated on CPU A after some time >> >> b) a large number of pages are allocated on another CPU B >> >> >> >> For a), we want the pages are kept in PCP of CPU A as long as possible. >> >> For b), we want the pages are kept in PCP of CPU A as short as possible. >> >> I think that we need to balance between them. What is the reasonable >> >> time to keep pages in PCP without many allocations? >> >> >> > >> > This would be a case where you're relying on vmstat to drain the PCP after >> > a period of time as it is a corner case. >> >> Yes. The remaining question is how long should "a period of time" be? > > Match the time used for draining "remote" pages from the PCP lists. The > choice is arbitrary and no matter what value is chosen, it'll be possible > to build an adverse workload. OK. >> If it's long, the pages in PCP can be used for allocation after some >> time. If it's short the pages can be put in buddy, so can be used by >> other workloads if needed. >> > > Assume that the main reason to expire pages and put them back on the buddy > list is to avoid premature allocation failures due to pages pinned on the > PCP. Once pages are going back onto the buddy list and the expiry is hit, > it might as well be assumed that the pages are cache-cold. Some bad corner > cases should be mitigated by disabling the adapative sizing when reclaim is > active. Yes. This can be mitigated, but the page allocation performance may be hurt. > The big remaaining corner case to watch out for is where the sum > of the boosted pcp->high exceeds the low watermark. If that should ever > happen then potentially a premature OOM happens because the watermarks > are fine so no reclaim is active but no pages are available. It may even > be the case that the sum of pcp->high should not exceed *min* as that > corner case means that processes may prematurely enter direct reclaim > (not as bad as OOM but still bad). Sorry, I don't understand this. When pages are moved from buddy to PCP, zone NR_FREE_PAGES will be decreased in rmqueue_bulk(). That is, pages in PCP will be counted as used instead of free. And, in zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is used to check watermark. So, if my understanding were correct, if the number of pages in PCP is larger than low/min watermark, we can still trigger reclaim. Whether is my understanding correct? >> Anyway, I will do some experiment for that. >> >> > You cannot reasonably detect the pattern on two separate per-cpu lists >> > without either inspecting remote CPU state or maintaining global >> > state. Either would incur cache miss penalties that probably cost more >> > than the heuristic saves. >> >> Yes. Totally agree. Best Regards, Huang, Ying