On Sun, Dec 15, 2024 at 12:07 AM Chris Li <chrisl@xxxxxxxxxx> wrote: > > On Tue, Dec 10, 2024 at 1:29 AM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > > > From: Kairui Song <kasong@xxxxxxxxxxx> > > > > commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance") > > replaced the cmpxchg/xchg with a global irq spinlock because some archs > > doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well. > > > > And as commented in swap_cgroup.c, this lock is not needed for map > > synchronization. > > > > Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement > > it to get rid of this lock. Introduced two helpers for doing so and they > > can be easily dropped if a generic 2 byte xchg is support. > > > > Testing using 64G brd and build with build kernel with make -j96 in 1.5G > > memory cgroup using 4k folios showed below improvement (10 test run): > > > > Before this series: > > Sys time: 10809.46 (stdev 80.831491) > > Real time: 171.41 (stdev 1.239894) > > > > After this commit: > > Sys time: 9621.26 (stdev 34.620000), -10.42% > > Real time: 160.00 (stdev 0.497814), -6.57% > > > > With 64k folios and 2G memcg: > > Before this series: > > Sys time: 8231.99 (stdev 30.030994) > > Real time: 143.57 (stdev 0.577394) > > > > After this commit: > > Sys time: 7403.47 (stdev 6.270000), -10.06% > > Real time: 135.18 (stdev 0.605000), -5.84% > > > > Sequential swapout of 8G 64k zero folios with madvise (24 test run): > > Before this series: > > 5461409.12 us (stdev 183957.827084) > > > > After this commit: > > 5420447.26 us (stdev 196419.240317) > > > > Sequential swapin of 8G 4k zero folios (24 test run): > > Before this series: > > 19736958.916667 us (stdev 189027.246676) > > > > After this commit: > > 19662182.629630 us (stdev 172717.640614) > > > > Performance is better or at least not worse for all tests above. > > > > Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx> > > --- > > mm/swap_cgroup.c | 73 +++++++++++++++++++++++++++++------------------- > > 1 file changed, 45 insertions(+), 28 deletions(-) > > > > diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c > > index 1770b076f6b7..a0a8547dc85d 100644 > > --- a/mm/swap_cgroup.c > > +++ b/mm/swap_cgroup.c > > @@ -7,19 +7,20 @@ > > > > static DEFINE_MUTEX(swap_cgroup_mutex); > > > > +/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */ > > Might not be two short if the atomic_t is more than 4 bytes. The > assumption here is that short is 2 bytes and atomic_t is 4 bytes. It > is hard to conclude that is the case for all architecture. > > > +#define ID_PER_SC (sizeof(atomic_t) / sizeof(unsigned short)) > > You should use "sizeof(struct swap_cgroup) / sizeof(unsigned short)", > or get rid of struct swap_cgroup and directly use atomic_t. > > > +#define ID_SHIFT (BITS_PER_TYPE(unsigned short)) > > +#define ID_MASK (BIT(ID_SHIFT) - 1) > > struct swap_cgroup { > > - unsigned short id; > > + atomic_t ids; > > You use struct swap_cgroup and atomic_t which assumes no padding added > to the struct. You might want to build an assert on sizeof(atomic_t) > == sizeof(struct swap_cgroup). Good idea, asserting struct swap_croup is an atomic_t ensures no unexpected behaviour. > > > }; > > > > struct swap_cgroup_ctrl { > > struct swap_cgroup *map; > > - spinlock_t lock; > > }; > > > > static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; > > > > -#define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup)) > > - > > /* > > * SwapCgroup implements "lookup" and "exchange" operations. > > * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge > > @@ -30,19 +31,32 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; > > * SwapCache(and its swp_entry) is under lock. > > * - When called via swap_free(), there is no user of this entry and no race. > > * Then, we don't need lock around "exchange". > > - * > > - * TODO: we can push these buffers out to HIGHMEM. > > */ > > -static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent, > > - struct swap_cgroup_ctrl **ctrlp) > > +static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map, > > + pgoff_t offset) > > { > > - pgoff_t offset = swp_offset(ent); > > - struct swap_cgroup_ctrl *ctrl; > > + unsigned int shift = (offset & 1) ? 0 : ID_SHIFT; > > Might not want to assume the ID_PER_SC is two. If some architecture > atomic_t is 64 bits then that code will break. Good idea, atomic_t is by defining an int, not sure if there is any strange archs will change the size though, more robust code is always better. Can change this to (offset % ID_PER_SC) instead, the generated machine code should be still the same for most archs. > > > + unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids); > > Here assume sizeof(unsigned int) == sizeof(atomic_t). Again,some > strange architecture might break it. Better use unsigned version of > aotmic_t; > > > > > > - ctrl = &swap_cgroup_ctrl[swp_type(ent)]; > > - if (ctrlp) > > - *ctrlp = ctrl; > > - return &ctrl->map[offset]; > > + return (old_ids & (ID_MASK << shift)) >> shift; > > Can be simplified as (old_ids >> shift) & ID_MASK. You might want to > double check that. > > > +} > > + > > +static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map, > > + pgoff_t offset, > > + unsigned short new_id) > > +{ > > + unsigned short old_id; > > + unsigned int shift = (offset & 1) ? 0 : ID_SHIFT; > > Same here, it assumes ID_PER_SC is 2. > > > + struct swap_cgroup *sc = &map[offset / ID_PER_SC]; > > + unsigned int new_ids, old_ids = atomic_read(&sc->ids); > > Again it assumes sizeof(unsigned int) == sizeof(atomic_t). I think this should be OK? The document says "atomic_t, atomic_long_t and atomic64_t use int, long and s64 respectively". Could change this with some wrapper but I think it's unnecessary. > > > + > > + do { > > + old_id = (old_ids & (ID_MASK << shift)) >> shift; > Can be simplify: > old_id = (old_ids >> shift) & ID_MASK; > > > + new_ids = (old_ids & ~(ID_MASK << shift)); > > + new_ids |= ((unsigned int)new_id) << shift; > > new_ids |= (atomic_t) new_id << shift; atomic_t doesn't work with bit operations, it must be an arithmetic type, so here I think it has to stay like this. > > > + } while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids)); > > + > > + return old_id; > > } > > > > /** > > @@ -58,21 +72,19 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, > > unsigned int nr_ents) > > { > > struct swap_cgroup_ctrl *ctrl; > > - struct swap_cgroup *sc; > > - unsigned short old; > > - unsigned long flags; > > pgoff_t offset = swp_offset(ent); > > pgoff_t end = offset + nr_ents; > > + unsigned short old, iter; > > + struct swap_cgroup *map; > > > > - sc = lookup_swap_cgroup(ent, &ctrl); > > + ctrl = &swap_cgroup_ctrl[swp_type(ent)]; > > + map = ctrl->map; > > > > - spin_lock_irqsave(&ctrl->lock, flags); > > - old = sc->id; > > - for (; offset < end; offset++, sc++) { > > - VM_BUG_ON(sc->id != old); > > - sc->id = id; > > - } > > - spin_unlock_irqrestore(&ctrl->lock, flags); > > The above will always assign nr_ents of swap entry atomically. > > > + old = __swap_cgroup_id_lookup(map, offset); > > + do { > > + iter = __swap_cgroup_id_xchg(map, offset, id); > > + VM_BUG_ON(iter != old); > > + } while (++offset != end); > > Here it is possible that some of the nr_ents can be changed while the > offset is still in the loop. Might want to examine if the caller can > trigger that or not. We want to make sure it is safe to do so, when > removing the spin lock, the nr_ents might not update to the same value > if two callers race it. Right, the problem is explained with the comments in the beginning of this file, I can update that with more details: Entries are always charged / uncharged belonging to one folio, or being uncharged with no folio related (no more users, other folios can't use these entries unless freeing is done). As a folio is always charged or uncharged as a whole, and charge / uncharge never happens concurrently, and the xchg here implies full barrier, so the set of entries will also always be used as a whole. So yes this function does implies the caller must always pass in swap entries belonging to one single folio, or entries that have no users. It's quite fragile indeed, I can make the caller pass in the folio as an argument to clarify this, with some more WARN or BUG_ON. > > > > return old; > > } > > @@ -85,9 +97,13 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, > > */ > > unsigned short lookup_swap_cgroup_id(swp_entry_t ent) > > { > > + struct swap_cgroup_ctrl *ctrl; > > + > > if (mem_cgroup_disabled()) > > return 0; > > - return lookup_swap_cgroup(ent, NULL)->id; > > + > > + ctrl = &swap_cgroup_ctrl[swp_type(ent)]; > > + return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent)); > > } > > > > int swap_cgroup_swapon(int type, unsigned long max_pages) > > @@ -98,14 +114,15 @@ int swap_cgroup_swapon(int type, unsigned long max_pages) > > if (mem_cgroup_disabled()) > > return 0; > > > > - map = vcalloc(max_pages, sizeof(struct swap_cgroup)); > > + BUILD_BUG_ON(!ID_PER_SC); > > It is simpler just to assert on: sizeof(atomic_t) >= sizeof(unsigned short). > I think that is what it does here. > > You might also want to assert: !(sizeof(atomic_t) % sizeof(unsigned short)) Good idea. > > Chris > > > + map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC), > > + sizeof(struct swap_cgroup)); > > if (!map) > > goto nomem; > > > > ctrl = &swap_cgroup_ctrl[type]; > > mutex_lock(&swap_cgroup_mutex); > > ctrl->map = map; > > - spin_lock_init(&ctrl->lock); > > mutex_unlock(&swap_cgroup_mutex); > > > > return 0; > > -- > > 2.47.1 > > >