On Mon, Feb 10, 2020 at 6:18 PM Gavin Shan <gshan@xxxxxxxxxx> wrote: > > Hi Roman, > > On 2/11/20 12:31 PM, Roman Gushchin wrote: > > On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote: > >> On 2/11/20 3:17 AM, Roman Gushchin wrote: > >>> On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote: > >>>> commit 68600f623d69 ("mm: don't miss the last page because of round-off > >>>> error") makes the scan size round up to @denominator regardless of the > >>>> memory cgroup's state, online or offline. This affects the overall > >>>> reclaiming behavior: The corresponding LRU list is eligible for reclaiming > >>>> only when its size logically right shifted by @sc->priority is bigger than > >>>> zero in the former formula (non-roundup one). > >>> > >>> Not sure I fully understand, but wasn't it so before 68600f623d69 too? > >>> > >> > >> It's correct that "(non-roundup one)" is typo and should have been dropped. > >> Will be corrected in v2 if needed. > > > > Thanks! > > > >> > >>>> For example, the inactive > >>>> anonymous LRU list should have at least 0x4000 pages to be eligible for > >>>> reclaiming when we have 60/12 for swappiness/priority and without taking > >>>> scan/rotation ratio into account. After the roundup is applied, the > >>>> inactive anonymous LRU list becomes eligible for reclaiming when its > >>>> size is bigger than or equal to 0x1000 in the same condition. > >>>> > >>>> (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1 > >>>> ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1 > >>>> > >>>> aarch64 has 512MB huge page size when the base page size is 64KB. The > >>>> memory cgroup that has a huge page is always eligible for reclaiming in > >>>> that case. The reclaiming is likely to stop after the huge page is > >>>> reclaimed, meaing the subsequent @sc->priority and memory cgroups will be > >>>> skipped. It changes the overall reclaiming behavior. This fixes the issue > >>>> by applying the roundup to offlined memory cgroups only, to give more > >>>> preference to reclaim memory from offlined memory cgroup. It sounds > >>>> reasonable as those memory is likely to be useless. > >>> > >>> So is the problem that relatively small memory cgroups are getting reclaimed > >>> on default prio, however before they were skipped? > >>> > >> > >> Yes, you're correct. There are two dimensions for global reclaim: priority > >> (sc->priority) and memory cgroup. The scan/reclaim is carried out by iterating > >> from these two dimensions until the reclaimed pages are enough. If the roundup > >> is applied to current memory cgroup and occasionally helps to reclaim enough > >> memory, the subsequent priority and memory cgroup will be skipped. > >> > >>>> > >>>> The issue was found by starting up 8 VMs on a Ampere Mustang machine, > >>>> which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB > >>>> memory. 784MB swap space is consumed after these 8 VMs are completely up. > >>>> Note that KSM is disable while THP is enabled in the testing. With this > >>>> applied, the consumed swap space decreased to 60MB. > >>>> > >>>> total used free shared buff/cache available > >>>> Mem: 16196 10065 2049 16 4081 3749 > >>>> Swap: 8175 784 7391 > >>>> total used free shared buff/cache available > >>>> Mem: 16196 11324 3656 24 1215 2936 > >>>> Swap: 8175 60 8115 > >>> > >>> Does it lead to any performance regressions? Or it's only about increased > >>> swap usage? > >>> > >> > >> Apart from swap usage, it also had performance downgrade for my case. With > >> your patch (68600f623d69) included, it took 264 seconds to bring up 8 VMs. > >> However, 236 seconds are used to do same thing with my patch applied on top > >> of yours. There is 10% performance downgrade. It's the reason why I had a > >> stable tag. > > > > I see... > > > > I will put these data into the commit log of v2, which will be posted shortly. > > >> > >>>> > >>>> Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error") > >>>> Cc: <stable@xxxxxxxxxxxxxxx> # v4.20+ > >>>> Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx> > >>>> --- > >>>> mm/vmscan.c | 9 ++++++--- > >>>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c > >>>> index c05eb9efec07..876370565455 100644 > >>>> --- a/mm/vmscan.c > >>>> +++ b/mm/vmscan.c > >>>> @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > >>>> /* > >>>> * Scan types proportional to swappiness and > >>>> * their relative recent reclaim efficiency. > >>>> - * Make sure we don't miss the last page > >>>> - * because of a round-off error. > >>>> + * Make sure we don't miss the last page on > >>>> + * the offlined memory cgroups because of a > >>>> + * round-off error. > >>>> */ > >>>> - scan = DIV64_U64_ROUND_UP(scan * fraction[file], > >>>> + scan = mem_cgroup_online(memcg) ? > >>>> + div64_u64(scan * fraction[file], denominator) : > >>>> + DIV64_U64_ROUND_UP(scan * fraction[file], > >>>> denominator); > >>> > >>> It looks a bit strange to round up for offline and basically down for > >>> everything else. So maybe it's better to return to something like > >>> the very first version of the patch: > >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kernel_msg2883146.html&d=DwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=urGWFxpEgETD4pryLqIYaKdVUk1Munj_zLpJthvrreM&s=k2RDZGNcvb_Sia2tZwcMPZ79Mad5dw1oT8JdIy0rkGY&e= ? > >>> For memcg reclaim reasons we do care only about an edge case with few pages. > >>> > >>> But overall it's not obvious to me, why rounding up is worse than rounding down. > >>> Maybe we should average down but accumulate the reminder? > >>> Creating an implicit bias for small memory cgroups sounds groundless. > >>> > >> > >> I don't think v1 path works for me either. The logic in v1 isn't too much > >> different from commit 68600f623d69. v1 has selective roundup, but current > >> code is having a forced roundup. With 68600f623d69 reverted and your v1 > >> patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap is used. > >> It looks more worse than 68600f623d69. > >> > >> Yeah, it's not reasonable to have a bias on all memory cgroups regardless > >> their states. I do think it's still right to give bias to offlined memory > >> cgroups. > > > > I don't think so, it really depends on the workload. Imagine systemd restarting > > a service due to some update or with other arguments. Almost entire pagecache > > is relevant and can be reused by a new cgroup. > > > > Indeed, it depends on the workload. This patch is to revert 68600f623d69 for online > memory cgroups, but keep the logic for offlined memory cgroup to avoid breaking your > case. > > There is something which might be unrelated to discuss here: the pagecache could be backed > by a low-speed (HDD) or high-speed (SSD) media. So the cost to fetch them from disk to memory > isn't equal, meaning we need some kind of bias during reclaiming. It seems something missed > from current implementation. Yes, the refault cost was not taken into account. I recalled Johannes posted a patch series to do swap with refault cost weighted in a couple of years ago, please see: https://lwn.net/Articles/690079/. > > >> So the point is we need take care of the memory cgroup's state > >> and apply the bias to offlined ones only. The offlined memory cgroup is > >> going to die and has been dead. It's unlikely for its memory to be used > >> by someone, but still possible. So it's reasonable to hardly squeeze the > >> used memory of offlined memory cgroup if possible. > > > > Anyway, I think your version is good to mitigate the regression. > > So, please feel free to add > > Acked-by: Roman Gushchin <guro@xxxxxx> > > > > Thanks, Roman! It will be included in v2. > > > But I think we need something more clever long-term: e.g. accumulate > > the leftover from the division and add it to the next calculation. > > > > If you can test such an approach on your workload, that would be nice. > > > > Yeah, we need something smart in long run. Lets see if I can sort/test > it out and then come back to you. > > Thanks, > Gavin > >