On 02/11/2024 12:59, Barry Song wrote: > On Sat, Nov 2, 2024 at 8:32 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> >> >> On 02/11/2024 10:12, Barry Song wrote: >>> From: Barry Song <v-songbaohua@xxxxxxxx> >>> >>> When the proportion of folios from the zero map is small, missing their >>> accounting may not significantly impact profiling. However, it’s easy >>> to construct a scenario where this becomes an issue—for example, >>> allocating 1 GB of memory, writing zeros from userspace, followed by >>> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out >>> and swap-in counts seem to vanish into a black hole, potentially >>> causing semantic ambiguity. >>> >>> We have two ways to address this: >>> >>> 1. Add a separate counter specifically for the zero map. >>> 2. Continue using the current accounting, treating the zero map like >>> a normal backend. (This aligns with the current behavior of zRAM >>> when supporting same-page fills at the device level.) >>> >>> This patch adopts option 1 as pswpin/pswpout counters are that they >>> only apply to IO done directly to the backend device (as noted by >>> Nhat Pham). >>> >>> We can find these counters from /proc/vmstat (counters for the whole >>> system) and memcg's memory.stat (counters for the interested memcg). >>> >>> For example: >>> >>> $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat >>> swpin_zero 1648 >>> swpout_zero 33536 >>> >>> $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat >>> swpin_zero 3905 >>> swpout_zero 3985 >>> >>> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") >> I don't think its a hotfix (or even a fix). It was discussed in the initial >> series to add these as a follow up and Joshua was going to do this soon. >> Its not fixing any bug in the initial series. > > I would prefer that all kernel versions with zeromap include this > counter; otherwise, > it could be confusing to determine where swap-in and swap-out have occurred, > as shown by the small program below: > > p =malloc(1g); > write p to zero > madvise_pageout > read p; > > Previously, there was 1GB of swap-in and swap-out activity reported, but > now nothing is shown. > > I don't mean to suggest that there's a bug in the zeromap code; rather, > having this counter would help clear up any confusion. > > I didn't realize Joshua was handling it. Is he still planning to? If > so, I can leave it > with Joshua if that was the plan :-) > Please do continue with this patch, I think he was going to look at the swapped_zero version that we discussed earlier anyways. Will let Joshua comment on it. >> >>> Cc: Usama Arif <usamaarif642@xxxxxxxxx> >>> Cc: Chengming Zhou <chengming.zhou@xxxxxxxxx> >>> Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> >>> Cc: Nhat Pham <nphamcs@xxxxxxxxx> >>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> >>> Cc: David Hildenbrand <david@xxxxxxxxxx> >>> Cc: Hugh Dickins <hughd@xxxxxxxxxx> >>> Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> >>> Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx> >>> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx> >>> Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> >>> Cc: Chris Li <chrisl@xxxxxxxxxx> >>> Cc: "Huang, Ying" <ying.huang@xxxxxxxxx> >>> Cc: Kairui Song <kasong@xxxxxxxxxxx> >>> Cc: Ryan Roberts <ryan.roberts@xxxxxxx> >>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> >>> --- >>> -v2: >>> * add separate counters rather than using pswpin/out; thanks >>> for the comments from Usama, David, Yosry and Nhat; >>> * Usama also suggested a new counter like swapped_zero, I >>> prefer that one be separated as an enhancement patch not >>> a hotfix. will probably handle it later on. >>> >> I dont think either of them would be a hotfix. > > As mentioned above, this isn't about fixing a bug; it's simply to ensure > that swap-related metrics don't disappear. > >> >>> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++ >>> include/linux/vm_event_item.h | 2 ++ >>> mm/memcontrol.c | 4 ++++ >>> mm/page_io.c | 16 ++++++++++++++++ >>> mm/vmstat.c | 2 ++ >>> 5 files changed, 34 insertions(+) >>> >>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst >>> index db3799f1483e..984eb3c9d05b 100644 >>> --- a/Documentation/admin-guide/cgroup-v2.rst >>> +++ b/Documentation/admin-guide/cgroup-v2.rst >>> @@ -1599,6 +1599,16 @@ The following nested keys are defined. >>> pglazyfreed (npn) >>> Amount of reclaimed lazyfree pages >>> >>> + swpin_zero >>> + Number of pages moved into memory with zero content, meaning no >>> + copy exists in the backend swapfile, allowing swap-in to avoid >>> + I/O read overhead. >>> + >>> + swpout_zero >>> + Number of pages moved out of memory with zero content, meaning no >>> + copy is needed in the backend swapfile, allowing swap-out to avoid >>> + I/O write overhead. >>> + >> >> Maybe zero-filled pages might be a better term in both. > > Do you mean dropping "with zero content" and replacing it by > Number of zero-filled pages moved out of memory ? I'm fine > with the change. Yes, mainly because if you do swapout of memory that was memset 0 its still content, just zero-filled. Thanks, Usama > >> >>> zswpin >>> Number of pages moved in to memory from zswap. >>> >>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >>> index aed952d04132..f70d0958095c 100644 >>> --- a/include/linux/vm_event_item.h >>> +++ b/include/linux/vm_event_item.h >>> @@ -134,6 +134,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >>> #ifdef CONFIG_SWAP >>> SWAP_RA, >>> SWAP_RA_HIT, >>> + SWPIN_ZERO, >>> + SWPOUT_ZERO, >>> #ifdef CONFIG_KSM >>> KSM_SWPIN_COPY, >>> #endif >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index 5e44d6e7591e..7b3503d12aaf 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -441,6 +441,10 @@ static const unsigned int memcg_vm_event_stat[] = { >>> PGDEACTIVATE, >>> PGLAZYFREE, >>> PGLAZYFREED, >>> +#ifdef CONFIG_SWAP >>> + SWPIN_ZERO, >>> + SWPOUT_ZERO, >>> +#endif >>> #ifdef CONFIG_ZSWAP >>> ZSWPIN, >>> ZSWPOUT, >>> diff --git a/mm/page_io.c b/mm/page_io.c >>> index 5d9b6e6cf96c..4b4ea8e49cf6 100644 >>> --- a/mm/page_io.c >>> +++ b/mm/page_io.c >>> @@ -204,7 +204,9 @@ static bool is_folio_zero_filled(struct folio *folio) >>> >>> static void swap_zeromap_folio_set(struct folio *folio) >>> { >>> + struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio); >>> struct swap_info_struct *sis = swp_swap_info(folio->swap); >>> + int nr_pages = folio_nr_pages(folio); >>> swp_entry_t entry; >>> unsigned int i; >>> >>> @@ -212,6 +214,12 @@ static void swap_zeromap_folio_set(struct folio *folio) >>> entry = page_swap_entry(folio_page(folio, i)); >>> set_bit(swp_offset(entry), sis->zeromap); >>> } >>> + >>> + count_vm_events(SWPOUT_ZERO, nr_pages); >>> + if (objcg) { >>> + count_objcg_events(objcg, SWPOUT_ZERO, nr_pages); >>> + obj_cgroup_put(objcg); >>> + } >>> } >>> >>> static void swap_zeromap_folio_clear(struct folio *folio) >>> @@ -507,6 +515,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret) >>> static bool swap_read_folio_zeromap(struct folio *folio) >>> { >>> int nr_pages = folio_nr_pages(folio); >>> + struct obj_cgroup *objcg; >>> bool is_zeromap; >>> >>> /* >>> @@ -521,6 +530,13 @@ static bool swap_read_folio_zeromap(struct folio *folio) >>> if (!is_zeromap) >>> return false; >>> >>> + objcg = get_obj_cgroup_from_folio(folio); >>> + count_vm_events(SWPIN_ZERO, nr_pages); >>> + if (objcg) { >>> + count_objcg_events(objcg, SWPIN_ZERO, nr_pages); >>> + obj_cgroup_put(objcg); >>> + } >>> + >>> folio_zero_range(folio, 0, folio_size(folio)); >>> folio_mark_uptodate(folio); >>> return true; >>> diff --git a/mm/vmstat.c b/mm/vmstat.c >>> index 22a294556b58..c8ef7352f9ed 100644 >>> --- a/mm/vmstat.c >>> +++ b/mm/vmstat.c >>> @@ -1418,6 +1418,8 @@ const char * const vmstat_text[] = { >>> #ifdef CONFIG_SWAP >>> "swap_ra", >>> "swap_ra_hit", >>> + "swpin_zero", >>> + "swpout_zero", >>> #ifdef CONFIG_KSM >>> "ksm_swpin_copy", >>> #endif >> > > Thanks > Barry