On Sat, Dec 14, 2024 at 11:48 AM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > On Sun, Dec 15, 2024 at 12:07 AM Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > On Tue, Dec 10, 2024 at 1:29 AM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > > > > > From: Kairui Song <kasong@xxxxxxxxxxx> > > > > > > commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance") > > > replaced the cmpxchg/xchg with a global irq spinlock because some archs > > > doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well. > > > > > > And as commented in swap_cgroup.c, this lock is not needed for map > > > synchronization. > > > > > > Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement > > > it to get rid of this lock. Introduced two helpers for doing so and they > > > can be easily dropped if a generic 2 byte xchg is support. > > > > > > Testing using 64G brd and build with build kernel with make -j96 in 1.5G > > > memory cgroup using 4k folios showed below improvement (10 test run): > > > > > > Before this series: > > > Sys time: 10809.46 (stdev 80.831491) > > > Real time: 171.41 (stdev 1.239894) > > > > > > After this commit: > > > Sys time: 9621.26 (stdev 34.620000), -10.42% > > > Real time: 160.00 (stdev 0.497814), -6.57% > > > > > > With 64k folios and 2G memcg: > > > Before this series: > > > Sys time: 8231.99 (stdev 30.030994) > > > Real time: 143.57 (stdev 0.577394) > > > > > > After this commit: > > > Sys time: 7403.47 (stdev 6.270000), -10.06% > > > Real time: 135.18 (stdev 0.605000), -5.84% > > > > > > Sequential swapout of 8G 64k zero folios with madvise (24 test run): > > > Before this series: > > > 5461409.12 us (stdev 183957.827084) > > > > > > After this commit: > > > 5420447.26 us (stdev 196419.240317) > > > > > > Sequential swapin of 8G 4k zero folios (24 test run): > > > Before this series: > > > 19736958.916667 us (stdev 189027.246676) > > > > > > After this commit: > > > 19662182.629630 us (stdev 172717.640614) > > > > > > Performance is better or at least not worse for all tests above. > > > > > > Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx> > > > --- > > > mm/swap_cgroup.c | 73 +++++++++++++++++++++++++++++------------------- > > > 1 file changed, 45 insertions(+), 28 deletions(-) > > > > > > diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c > > > index 1770b076f6b7..a0a8547dc85d 100644 > > > --- a/mm/swap_cgroup.c > > > +++ b/mm/swap_cgroup.c > > > @@ -7,19 +7,20 @@ > > > > > > static DEFINE_MUTEX(swap_cgroup_mutex); > > > > > > +/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */ > > > > Might not be two short if the atomic_t is more than 4 bytes. The > > assumption here is that short is 2 bytes and atomic_t is 4 bytes. It > > is hard to conclude that is the case for all architecture. > > > > > +#define ID_PER_SC (sizeof(atomic_t) / sizeof(unsigned short)) > > > > You should use "sizeof(struct swap_cgroup) / sizeof(unsigned short)", > > or get rid of struct swap_cgroup and directly use atomic_t. > > > > > +#define ID_SHIFT (BITS_PER_TYPE(unsigned short)) > > > +#define ID_MASK (BIT(ID_SHIFT) - 1) > > > struct swap_cgroup { > > > - unsigned short id; > > > + atomic_t ids; > > > > You use struct swap_cgroup and atomic_t which assumes no padding added > > to the struct. You might want to build an assert on sizeof(atomic_t) > > == sizeof(struct swap_cgroup). > > Good idea, asserting struct swap_croup is an atomic_t ensures no > unexpected behaviour. > > > > > > }; > > > > > > struct swap_cgroup_ctrl { > > > struct swap_cgroup *map; > > > - spinlock_t lock; > > > }; > > > > > > static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; > > > > > > -#define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup)) > > > - > > > /* > > > * SwapCgroup implements "lookup" and "exchange" operations. > > > * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge > > > @@ -30,19 +31,32 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; > > > * SwapCache(and its swp_entry) is under lock. > > > * - When called via swap_free(), there is no user of this entry and no race. > > > * Then, we don't need lock around "exchange". > > > - * > > > - * TODO: we can push these buffers out to HIGHMEM. > > > */ > > > -static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent, > > > - struct swap_cgroup_ctrl **ctrlp) > > > +static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map, > > > + pgoff_t offset) > > > { > > > - pgoff_t offset = swp_offset(ent); > > > - struct swap_cgroup_ctrl *ctrl; > > > + unsigned int shift = (offset & 1) ? 0 : ID_SHIFT; > > > > Might not want to assume the ID_PER_SC is two. If some architecture > > atomic_t is 64 bits then that code will break. > > Good idea, atomic_t is by defining an int, not sure if there is any > strange archs will change the size though, more robust code is always > better. > > Can change this to (offset % ID_PER_SC) instead, the generated machine > code should be still the same for most archs. > > > > > > + unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids); > > > > Here assume sizeof(unsigned int) == sizeof(atomic_t). Again,some > > strange architecture might break it. Better use unsigned version of > > aotmic_t; > > > > > > > > > > - ctrl = &swap_cgroup_ctrl[swp_type(ent)]; > > > - if (ctrlp) > > > - *ctrlp = ctrl; > > > - return &ctrl->map[offset]; > > > + return (old_ids & (ID_MASK << shift)) >> shift; > > > > Can be simplified as (old_ids >> shift) & ID_MASK. You might want to > > double check that. > > > > > +} > > > + > > > +static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map, > > > + pgoff_t offset, > > > + unsigned short new_id) > > > +{ > > > + unsigned short old_id; > > > + unsigned int shift = (offset & 1) ? 0 : ID_SHIFT; > > > > Same here, it assumes ID_PER_SC is 2. > > > > > + struct swap_cgroup *sc = &map[offset / ID_PER_SC]; > > > + unsigned int new_ids, old_ids = atomic_read(&sc->ids); > > > > Again it assumes sizeof(unsigned int) == sizeof(atomic_t). > > I think this should be OK? The document says "atomic_t, atomic_long_t > and atomic64_t use int, long and s64 respectively". > > Could change this with some wrapper but I think it's unnecessary. Ack. > > > > > > + > > > + do { > > > + old_id = (old_ids & (ID_MASK << shift)) >> shift; > > Can be simplify: > > old_id = (old_ids >> shift) & ID_MASK; > > > > > + new_ids = (old_ids & ~(ID_MASK << shift)); > > > + new_ids |= ((unsigned int)new_id) << shift; > > > > new_ids |= (atomic_t) new_id << shift; > > atomic_t doesn't work with bit operations, it must be an arithmetic > type, so here I think it has to stay like this. Ack Chris