. On Mon, Feb 21, 2022 at 1:14 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > On Tue, Feb 15, 2022 at 04:53:56PM -0500, Johannes Weiner wrote: > > Hi Yu, > > > > On Tue, Feb 15, 2022 at 02:43:05AM -0700, Yu Zhao wrote: > > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote: > > > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen) > > > > > +{ > > > > > + unsigned long max_seq = lruvec->lrugen.max_seq; > > > > > + > > > > > + VM_BUG_ON(gen >= MAX_NR_GENS); > > > > > + > > > > > + /* see the comment on MIN_NR_GENS */ > > > > > + return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1); > > > > > +} > > > > > > > > I'm still reading the series, so correct me if I'm wrong: the "active" > > > > set is split into two generations for the sole purpose of the > > > > second-chance policy for fresh faults, right? > > > > > > To be precise, the active/inactive notion on top of generations is > > > just for ABI compatibility, e.g., the counters in /proc/vmstat. > > > Otherwise, this function wouldn't be needed. > > > > Ah! would you mind adding this as a comment to the function? > > Will do. > > > But AFAICS there is the lru_gen_del_folio() callsite that maps it to > > the PG_active flag - which in turn gets used by add_folio() to place > > the thing back on the max_seq generation. So I suppose there is a > > secondary purpose of the function for remembering the page's rough age > > for non-reclaim isolation.> > > Yes, e.g., migration. > > > It would be good to capture that as well in a comment on the function. > > Will do. > > > > > > +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru, > > > > > + int zone, long delta) > > > > > +{ > > > > > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > > > + > > > > > + lockdep_assert_held(&lruvec->lru_lock); > > > > > + WARN_ON_ONCE(delta != (int)delta); > > > > > + > > > > > + __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta); > > > > > + __mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta); > > > > > +} > > > > > > > > This is a duplicate of update_lru_size(), please use that instead. > > > > > > > > Yeah technically you don't need the mem_cgroup_update_lru_size() but > > > > that's not worth sweating over, better to keep it simple. > > > > > > I agree we don't need the mem_cgroup_update_lru_size() -- let me spell > > > out why: > > > this function is not needed here because it updates the counters used > > > only by the active/inactive lru code, i.e., get_scan_count(). > > > > > > However, we can't reuse update_lru_size() because MGLRU can trip the > > > WARN_ONCE() in mem_cgroup_update_lru_size(). > > > > > > Unlike lru_zone_size[], lrugen->nr_pages[] is eventually consistent. > > > To move a page to a different generation, the gen counter in page->flags > > > is updated first, which doesn't require the LRU lock. The second step, > > > i.e., the update of lrugen->nr_pages[], requires the LRU lock, and it > > > usually isn't done immediately due to batching. Meanwhile, if this page > > > is, for example, isolated, nr_pages[] becomes temporarily unbalanced. > > > And this trips the WARN_ONCE(). > > > > Good insight. > > > > But in that case, I'd still think it's better to use update_lru_size() > > and gate the memcg update on lrugen-enabled, with a short comment > > saying that lrugen has its own per-cgroup counts already. It's just a > > bit too error prone to duplicate the stat updates. > > > > Even better would be: > > > > static __always_inline > > void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio) > > { > > enum lru_list lru = folio_lru_list(folio); > > > > update_lru_size(lruvec, lru, folio_zonenum(folio), > > folio_nr_pages(folio)); > > if (lrugen_enabled(lruvec)) > > lrugen_add_folio(lruvec, folio); > > else > > list_add(&folio->lru, &lruvec->lists[lru]); > > } > > > > But it does mean you'd have to handle unevictable pages. I'm reviewing > > from the position that mglru is going to supplant the existing reclaim > > algorithm in the long term, though, so being more comprehensive and > > eliminating special cases where possible is all-positive, IMO. > > > > Up to you. I'd only insist on reusing update_lru_size() at least. > > Will do. > > > > > > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) > > > > > +{ > > > > > + int gen; > > > > > + unsigned long old_flags, new_flags; > > > > > + int type = folio_is_file_lru(folio); > > > > > + int zone = folio_zonenum(folio); > > > > > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > > > > > + > > > > > + if (folio_test_unevictable(folio) || !lrugen->enabled) > > > > > + return false; > > > > > > > > These two checks should be in the callsite and the function should > > > > return void. Otherwise you can't understand the callsite without > > > > drilling down into lrugen code, even if lrugen is disabled. > > > > > > I agree it's a bit of nuisance this way. The alternative is we'd need > > > ifdef or another helper at the call sites because lrugen->enabled is > > > specific to lrugen. > > > > Coming from memcg, my experience has been that when you have a compile > > time-optional MM extension like this, you'll sooner or later need a > > config-independent helper to gate callbacks in generic code. So I > > think it's a good idea to add one now. > > > > One of these? > > > > lruvec_on_lrugen() > > SGTM. > > Personally I'd reuse lru_gen_enabled(), by passing NULL/lruvec. But > my guess is you wouldn't like it. > > > lruvec_using_lrugen() > > lruvec_lrugen_enabled() > > > > lruvec_has_generations() :-) > > > > > > On that note, I think #1 is reintroducing a problem we have fixed > > > > before, which is trashing the workingset with a flood of use-once > > > > mmapped pages. It's the classic scenario where LFU beats LRU. > > > > > > > > Mapped streaming IO isn't very common, but it does happen. See these > > > > commits: > > > > > > > > dfc8d636cdb95f7b792d5ba8c9f3b295809c125d > > > > 31c0569c3b0b6cc8a867ac6665ca081553f7984c > > > > 645747462435d84c6c6a64269ed49cc3015f753d > > > > > > > > From the changelog: > > > > > > > > The used-once mapped file page detection patchset. > > > > > > > > It is meant to help workloads with large amounts of shortly used file > > > > mappings, like rtorrent hashing a file or git when dealing with loose > > > > objects (git gc on a bigger site?). > > > > > > > > Right now, the VM activates referenced mapped file pages on first > > > > encounter on the inactive list and it takes a full memory cycle to > > > > reclaim them again. When those pages dominate memory, the system > > > > no longer has a meaningful notion of 'working set' and is required > > > > to give up the active list to make reclaim progress. Obviously, > > > > this results in rather bad scanning latencies and the wrong pages > > > > being reclaimed. > > > > > > > > This patch makes the VM be more careful about activating mapped file > > > > pages in the first place. The minimum granted lifetime without > > > > another memory access becomes an inactive list cycle instead of the > > > > full memory cycle, which is more natural given the mentioned loads. > > > > > > > > Translating this to multigen, it seems fresh faults should really > > > > start on the second oldest rather than on the youngest generation, to > > > > get a second chance but without jeopardizing the workingset if they > > > > don't take it. > > > > > > This is a good point, and I had worked on a similar idea but failed > > > to measure its benefits. In addition to placing mmapped file pages in > > > older generations, I also tried placing refaulted anon pages in older > > > generations. My conclusion was that the initial LRU positions of NFU > > > pages are not a bottleneck for workloads I've tested. The efficiency > > > of testing/clearing the accessed bit is. > > > > The concern isn't the scan overhead, but jankiness from the workingset > > being flooded out by streaming IO. > > Yes, MGLRU uses a different approach to solve this problem, and for > its approach, the scan overhead is the concern. > > MGLRU detects (defines) the working set by scanning the entire memory > for each generation, and it counters the flooding by accelerating the > creation of generations. IOW, all mapped pages have an equal chance to > get scanned, no matter which generation they are in. This is a design > difference compared with the active/inactive LRU, which tries to scans > the active/inactive lists less/more frequently. > > > The concrete usecase at the time was a torrent client hashing a > > downloaded file and thereby kicking out the desktop environment, which > > caused jankiness. The hashing didn't benefit from caching - the file > > wouldn't have fit into RAM anyway - so this was pointless to boot. > > > > Essentially, the tradeoff is this: > > > > 1) If you treat new pages as hot, you accelerate workingset > > transitions, but on the flipside you risk unnecessary refaults in > > running applications when those new pages are one-off. > > > > 2) If you take new pages with a grain of salt, you protect existing > > applications better from one-off floods, but risk refaults in NEW > > application while they're trying to start up. > > Agreed. > > > There are two arguments for why 2) is preferable: > > > > 1) Users are tolerant of cache misses when applications first launch, > > much less so after they've been running for hours. > > Our CUJs (Critical User Journeys) respectfully disagree :) > > They are built on the observation that once users have moved onto > another tab/app, they are more likely to stay with the new tab/app > rather than go back to the old ones. Speaking for myself, this is > generally the case. > > > 2) Workingset transitions (and associated jankiness) are bounded by > > the amount of RAM you need to repopulate. But streaming IO is > > bounded by storage, and datasets are routinely several times the > > amount of RAM. Uncacheable sets in excess of RAM can produce an > > infinite stream of "new" references; not protecting the workingset > > from that means longer or even sustained jankiness. > > I'd argue the opposite -- we shouldn't risk refaulting fresh hot pages > just to accommodate this concrete yet minor use case, especially > considering torrent has been given the means (MADV_SEQUENTIAL) to help > itself. > > I appreciate all your points here. The bottom line is we agree this is > a trade off. For what disagree about, we could be both right -- it > comes down to what workloads we care about *more*. > > To move forward, I propose we look at it from a non-technical POV: > would we want to offer users an alternative trade off so that they can > have greater flexibility? > > > > And some applications are smart enough to leverage MADV_SEQUENTIAL. > > > In this case, MGLRU does place mmapped file pages in the oldest > > > generation. > > > > Yes, it makes sense to optimize when MADV_SEQUENTIAL is requested. But > > that hint isn't reliably there, so it matters that we don't do poorly > > when it's missing. > > Agreed. > > > > I have an oversimplified script that uses memcached to mimic a > > > non-streaming workload and fio a (mmapped) streaming workload: > > > > Looking at the paramters and observed behavior, let me say up front > > that this looks like a useful benchmark, but doesn't capture the > > scenario I was talking about above. > > > > For one, the presence of swapping in both kernels suggests that the > > "streaming IO" component actually has repeat access that could benefit > > from caching. Second, I would expect memcache is accessing its memory > > frequently and consistently, and so could withstand workingset > > challenges from streaming IO better than, say, a desktop environment. > > The fio workload is a real streaming workload, but the memcached > workload might have been too large to be a typical desktop workload. > > More below. > > > More on that below. > > > > > 1. With MADV_SEQUENTIAL, the non-streaming workload is about 5 times > > > faster when using MGLRU. Somehow the baseline (rc3) swapped a lot. > > > (It shouldn't, and I haven't figured out why.) > > > > Baseline swaps when there are cache refaults. This is regardless of > > the hint: you may say you're accessing these pages sequentially, but > > the refaults say you're reusing them, with a frequency that suggests > > they might be cacheable. So it tries to cache them. > > > > I'd be curious if that results in fio being faster, or whether it's > > all just pointless thrashing. Can you share the fio results too? > > More below. > > > We could patch baseline to prioritize MADV_SEQUENTIAL more, but... > > > > > 2. Without MADV_SEQUENTIAL, the non-streaming workload is about 1 > > > time faster when using MGLRU. Both MGLRU and the baseline swapped > > > a lot. > > > > ...in practice I think this scenario will matter to a lot more users. > > I strongly feel we should prioritize what's advertised on a man page > over an unspecified (performance) behavior. > > > I would again be interested in the fio results. > > > > > MADV_SEQUENTIAL non-streaming ops/sec (memcached) > > > rc3 yes 292k > > > rc3 no 203k > > > rc3+v7 yes 1967k > > > rc3+v7 no 436k Appending a few notes on the baseline results: 1. Apparently FADV_DONTNEED rejects mmaped pages -- I found no reasons from the man page or the original commit why it should. I propose we remove the page_mapped() check in lru_deactivate_file_fn(). Adding Minchan to see how he thinks about this. 2. For MADV_SEQUENTIAL, I made workingset_refault() check (VM_SEQ_READ | VM_RAND_READ) and return if they are set, and the performance got a lot better (x3). 3. In page_referenced_one(), I think we should also exclude VM_RAND_READ in addition to VM_SEQ_READ. > > > cat mmap.sh > > > modprobe brd rd_nr=2 rd_size=56623104 > > > > > > mkswap /dev/ram0 > > > swapon /dev/ram0 > > > > > > mkfs.ext4 /dev/ram1 > > > mount -t ext4 /dev/ram1 /mnt > > > > > > memtier_benchmark -S /var/run/memcached/memcached.sock -P memcache_binary \ > > > -n allkeys --key-minimum=1 --key-maximum=50000000 --key-pattern=P:P -c 1 \ > > > -t 36 --ratio 1:0 --pipeline 8 -d 2000 > > > > > > # streaming workload: --fadvise_hint=0 disables MADV_SEQUENTIAL > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=4224m --buffered=1 \ > > > --ioengine=mmap --iodepth=128 --iodepth_batch_submit=32 \ > > > --iodepth_batch_complete=32 --rw=read --time_based --ramp_time=10m \ > > > --runtime=180m --group_reporting & > > > > As per above, I think this would be closer to a cacheable workingset > > than a streaming IO pattern. It depends on total RAM of course, but > > size=4G and time_based should loop around pretty quickly. > > The file size here shouldn't matter since fio is smart enough to > invalidate page cache before it rewinds (for sequential access): > > https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-invalidate > https://github.com/axboe/fio/blob/master/filesetup.c#L602 > > I think the problem might have been the memory size for memcached was > too large (100GB) to be all hot (limited by memory bandwidth). > > > Would you mind rerunning with files larger than RAM, to avoid repeat > > accesses (or at least only repeat with large distances)? > > Retested with the same free memory (120GB) for 40GB memcached and 200GB > fio. > > MADV_SEQUENTIAL FADV_DONTNEED memcached fio > rc4 no yes 4716k 232k > rc4+v7 no yes 4307k 265k > delta -9% +14% > > MGLRU lost with memcached but won with fio for the same reason: it > doesn't have any heuristics to detect the streaming characteristic > (and therefore lost with memcached) but relies on faster scanning > (and therefore won with fio) to keep the working set in memory. > > The baseline didn't swap this time (MGLRU did slightly), but it lost > with fio because it had to walk the rmap for each page in the entire > 200GB VMA, at least once, even for this streaming workload. > > This reflects the design difference I mentioned earlier. > > cat test.sh > modprobe brd rd_nr=1 rd_size=268435456 > > mkfs.ext4 /dev/ram0 > mount -t ext4 /dev/ram0 /mnt > > fallocate -l 40g /mnt/swapfile > mkswap /mnt/swapfile > swapon /mnt/swapfile > > fio -name=mglru --numjobs=1 --directory=/mnt --size=204800m \ > --buffered=1 --ioengine=mmap --fadvise_hint=0 --iodepth=128 \ > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ > --rw=read --time_based --ramp_time=10m --runtime=180m \ > --group_reporting & > pid=$! > > sleep 600 > > # load objects > memtier_benchmark -S /var/run/memcached/memcached.sock \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=20000000 --key-pattern=P:P -c 1 -t 36 \ > --ratio 1:0 --pipeline 8 -d 2000 > # read objects > memtier_benchmark -S /var/run/memcached/memcached.sock \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=20000000 --key-pattern=R:R -c 1 -t 36 \ > --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed > > kill -INT $pid > > > Depending on how hot memcache runs, it may or may not be able to hold > > onto its workingset. > > Agreed. > > > Testing interactivity is notoriously hard, but > > using a smaller, intermittent workload is probably more representative > > of overall responsiveness. Let fio ramp until memory is full, then do > > perf stat -r 10 /bin/sh -c 'git shortlog v5.0.. >/dev/null; sleep 1' > > I'll also check with the downstream maintainers to see if they have > heard any complaints about streaming workloads negatively impacting > user experience. > > > I'll try to reproduce this again too. Back then, that workload gave me > > a very janky desktop experience, and the patch very obvious relief. > > SGTM. > > > > > > @@ -113,6 +298,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio) > > > > > { > > > > > enum lru_list lru = folio_lru_list(folio); > > > > > > > > > > + if (lru_gen_add_folio(lruvec, folio, true)) > > > > > + return; > > > > > + > > > > > > > > bool parameters are notoriously hard to follow in the callsite. Can > > > > you please add lru_gen_add_folio_tail() instead and have them use a > > > > common helper? > > > > > > I'm not sure -- there are several places like this one. My question is > > > whether we want to do it throughout this patchset. We'd end up with > > > many helpers and duplicate code. E.g., in this file alone, we have two > > > functions taking bool parameters: > > > lru_gen_add_folio(..., bool reclaiming) > > > lru_gen_del_folio(..., bool reclaiming) > > > > > > I can't say they are very readable; at least they are very compact > > > right now. My concern is that we might loose the latter without having > > > enough of the former. > > > > > > Perhaps this is something that we could revisit after you've finished > > > reviewing the entire patchset? > > > > Sure, fair enough. > > > > > > > +void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec); > > > > > > > > "state" is what we usually init :) How about lrugen_init_lruvec()? > > > > > > Same story as "file", lol -- this used to be lru_gen_init_lruvec(): > > > https://lore.kernel.org/linux-mm/20210413065633.2782273-9-yuzhao@xxxxxxxxxx/ > > > > > > Naming is hard. Hopefully we can finalize it this time. > > > > Was that internal feedback? The revisions show this function went > > through several names, but I don't see reviews requesting those. If > > they weren't public I'm gonna pretend they didn't happen ;-) > > Indeed. I lost track. > > > > > You can drop the memcg parameter and use lruvec_memcg(). > > > > > > lruvec_memcg() isn't available yet when pgdat_init_internals() calls > > > this function because mem_cgroup_disabled() is initialized afterward. > > > > Good catch. That'll container_of() into garbage. However, we have to > > assume that somebody's going to try that simplification again, so we > > should set up the code now to prevent issues. > > > > cgroup_disable parsing is self-contained, so we can pull it ahead in > > the init sequence. How about this? > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > index 9d05c3ca2d5e..b544d768edc8 100644 > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -6464,9 +6464,9 @@ static int __init cgroup_disable(char *str) > > break; > > } > > } > > - return 1; > > + return 0; > > } > > -__setup("cgroup_disable=", cgroup_disable); > > +early_param("cgroup_disable", cgroup_disable); > > I think early_param() is still after pgdat_init_internals(), no? > > Thanks!