> -----Original Message----- > From: Song Bao Hua (Barry Song) > Sent: Tuesday, December 22, 2020 2:06 PM > To: 'Vitaly Wool' <vitaly.wool@xxxxxxxxxxxx> > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>; Minchan Kim <minchan@xxxxxxxxxx>; Mike > Galbraith <efault@xxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>; linux-mm > <linux-mm@xxxxxxxxx>; Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>; > NitinGupta <ngupta@xxxxxxxxxx>; Sergey Senozhatsky > <sergey.senozhatsky.work@xxxxxxxxx>; Andrew Morton > <akpm@xxxxxxxxxxxxxxxxxxxx> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > -----Original Message----- > > From: Vitaly Wool [mailto:vitaly.wool@xxxxxxxxxxxx] > > Sent: Tuesday, December 22, 2020 2:00 PM > > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx> > > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>; Minchan Kim <minchan@xxxxxxxxxx>; > Mike > > Galbraith <efault@xxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>; linux-mm > > <linux-mm@xxxxxxxxx>; Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>; > > NitinGupta <ngupta@xxxxxxxxxx>; Sergey Senozhatsky > > <sergey.senozhatsky.work@xxxxxxxxx>; Andrew Morton > > <akpm@xxxxxxxxxxxxxxxxxxxx> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song) > > <song.bao.hua@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Song Bao Hua (Barry Song) > > > > Sent: Tuesday, December 22, 2020 11:38 AM > > > > To: 'Vitaly Wool' <vitaly.wool@xxxxxxxxxxxx> > > > > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>; Minchan Kim <minchan@xxxxxxxxxx>; > > Mike > > > > Galbraith <efault@xxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>; linux-mm > > > > <linux-mm@xxxxxxxxx>; Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>; > > > > NitinGupta <ngupta@xxxxxxxxxx>; Sergey Senozhatsky > > > > <sergey.senozhatsky.work@xxxxxxxxx>; Andrew Morton > > > > <akpm@xxxxxxxxxxxxxxxxxxxx> > > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Vitaly Wool [mailto:vitaly.wool@xxxxxxxxxxxx] > > > > > Sent: Tuesday, December 22, 2020 11:12 AM > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx> > > > > > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>; Minchan Kim > <minchan@xxxxxxxxxx>; > > > > Mike > > > > > Galbraith <efault@xxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>; > linux-mm > > > > > <linux-mm@xxxxxxxxx>; Sebastian Andrzej Siewior > <bigeasy@xxxxxxxxxxxxx>; > > > > > NitinGupta <ngupta@xxxxxxxxxx>; Sergey Senozhatsky > > > > > <sergey.senozhatsky.work@xxxxxxxxx>; Andrew Morton > > > > > <akpm@xxxxxxxxxxxxxxxxxxxx> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song) > > > > > <song.bao.hua@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Shakeel Butt [mailto:shakeelb@xxxxxxxxxx] > > > > > > > Sent: Tuesday, December 22, 2020 10:03 AM > > > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx> > > > > > > > Cc: Vitaly Wool <vitaly.wool@xxxxxxxxxxxx>; Minchan Kim > > > > > <minchan@xxxxxxxxxx>; > > > > > > > Mike Galbraith <efault@xxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>; > > > > > linux-mm > > > > > > > <linux-mm@xxxxxxxxx>; Sebastian Andrzej Siewior > > <bigeasy@xxxxxxxxxxxxx>; > > > > > > > NitinGupta <ngupta@xxxxxxxxxx>; Sergey Senozhatsky > > > > > > > <sergey.senozhatsky.work@xxxxxxxxx>; Andrew Morton > > > > > > > <akpm@xxxxxxxxxxxxxxxxxxxx> > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > > > > > > <song.bao.hua@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Shakeel Butt [mailto:shakeelb@xxxxxxxxxx] > > > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > > > > > > To: Vitaly Wool <vitaly.wool@xxxxxxxxxxxx> > > > > > > > > > Cc: Minchan Kim <minchan@xxxxxxxxxx>; Mike Galbraith > > <efault@xxxxxx>; > > > > > LKML > > > > > > > > > <linux-kernel@xxxxxxxxxxxxxxx>; linux-mm <linux-mm@xxxxxxxxx>; > > Song > > > > > Bao > > > > > > > Hua > > > > > > > > > (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; Sebastian Andrzej > > Siewior > > > > > > > > > <bigeasy@xxxxxxxxxxxxx>; NitinGupta <ngupta@xxxxxxxxxx>; Sergey > > > > > > > Senozhatsky > > > > > > > > > <sergey.senozhatsky.work@xxxxxxxxx>; Andrew Morton > > > > > > > > > <akpm@xxxxxxxxxxxxxxxxxxxx> > > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool > > <vitaly.wool@xxxxxxxxxxxx> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@xxxxxxxxxx> > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and > releases > > > > > it > > > > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, > remove > > > > that > > > > > > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > > > > > > > > > > > I see from Mike's patch, recent zswap change introduced > the > > lockdep > > > > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap > > bug > > > > > and > > > > > > > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > > > > > > > > > > > > > This understanding is upside down. The code in zswap you are > > referring > > > > > > > > > > to is not buggy. You may claim that it is suboptimal but there > > is > > > > > > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? > > Sorry, > > > > > I > > > > > > > > > am not very familiar with the crypto API. If I select lzo or > lz4 > > as > > > > > a > > > > > > > > > zswap compressor will the [de]compression be async or sync? > > > > > > > > > > > > > > > > Right now, in crypto subsystem, new drivers are required to write > > based > > > > > on > > > > > > > > async APIs. The old sync API can't work in new accelerator drivers > > as > > > > > they > > > > > > > > are not supported at all. > > > > > > > > > > > > > > > > Old drivers are used to sync, but they've got async wrappers to > > support > > > > > async > > > > > > > > APIs. Eg. > > > > > > > > crypto: acomp - add support for lz4 via scomp > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > > > > > > > > > > > crypto: acomp - add support for lzo via scomp > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > > > > > > > > > > > so they are supporting async APIs but they are still working in > > sync > > > > mode > > > > > > > as > > > > > > > > those old drivers don't sleep. > > > > > > > > > > > > > > > > > > > > > > Good to know that those are sync because I want them to be sync. > > > > > > > Please note that zswap is a cache in front of a real swap and the > > load > > > > > > > operation is latency sensitive as it comes in the page fault path > > and > > > > > > > directly impacts the applications. I doubt decompressing synchronously > > > > > > > a 4k page on a cpu will be costlier than asynchronously decompressing > > > > > > > the same page from hardware accelerators. > > > > > > > > > > > > If you read the old paper: > > > > > > > > > > > > > > > > > > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit > > > > > y > > > > > > Because the hardware accelerator speeds up compression, looking at > the > > zswap > > > > > > metrics we observed that there were more store and load requests in > > a given > > > > > > amount of time, which filled up the zswap pool faster than a software > > > > > > compression run. Because of this behavior, we set the max_pool_percent > > > > > > parameter to 30 for the hardware compression runs - this means that > > zswap > > > > > > can use up to 30% of the 10GB of total memory. > > > > > > > > > > > > So using hardware accelerators, we get a chance to speed up compression > > > > > > while decreasing cpu utilization. > > > > > > > > > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we > might > > > > do > > > > > > in zswap is adding the below after applying Mike's original patch: > > > > > > > > > > > > if(in_atomic()) /* for zsmalloc */ > > > > > > while(!try_wait_for_completion(&req->done); > > > > > > else /* for zbud, z3fold */ > > > > > > crypto_wait_req(....); > > > > > > > > > > I don't think I'm going to ack this, sorry. > > > > > > > > > > > > > Fair enough. And I am also thinking if we can move zpool_unmap_handle() > > > > quite after zpool_map_handle() as below: > > > > > > > > dlen = PAGE_SIZE; > > > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > > ZPOOL_MM_RO); > > > > if (zpool_evictable(entry->pool->zpool)) > > > > src += sizeof(struct zswap_header); > > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > > mutex_lock(acomp_ctx->mutex); > > > > sg_init_one(&input, src, entry->length); > > > > sg_init_table(&output, 1); > > > > sg_set_page(&output, page, PAGE_SIZE, 0); > > > > acomp_request_set_params(acomp_ctx->req, &input, &output, > > entry->length, > > > > dlen); > > > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > > > &acomp_ctx->wait); > > > > mutex_unlock(acomp_ctx->mutex); > > > > > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > Since src is always low memory and we only need its virtual address > > > > to get the page of src in sg_init_one(). We don't actually read it > > > > by CPU anywhere. > > > > > > The below code might be better: > > > > > > dlen = PAGE_SIZE; > > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > > ZPOOL_MM_RO); > > > if (zpool_evictable(entry->pool->zpool)) > > > src += sizeof(struct zswap_header); > > > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > mutex_lock(acomp_ctx->mutex); > > > sg_init_one(&input, src, entry->length); > > > sg_init_table(&output, 1); > > > sg_set_page(&output, page, PAGE_SIZE, 0); > > > acomp_request_set_params(acomp_ctx->req, &input, &output, > > entry->length, dlen); > > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > &acomp_ctx->wait); > > > mutex_unlock(acomp_ctx->mutex); > > > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > I don't see how this is going to work since we can't guarantee src > > will be a valid pointer after the zpool_unmap_handle() call, can we? > > Could you please elaborate? > > A valid pointer is for cpu to read and write. Here, cpu doesn't read > and write it, we only need to get page struct from the address. > > void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int buflen) > { > sg_init_table(sg, 1); > sg_set_buf(sg, buf, buflen); > } > > static inline void sg_set_buf(struct scatterlist *sg, const void *buf, > unsigned int buflen) > { > #ifdef CONFIG_DEBUG_SG > BUG_ON(!virt_addr_valid(buf)); > #endif > sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf)); > } > > sg_init_one() is always using an address which has a linear mapping > with physical address. > So once we get the value of src, we can get the page struct. > > src has a linear mapping with physical address. It doesn't require > page table walk which vmalloc_to_page() wants. > > The req only requires page to initialize sg table, I think if > we are going to use a cpu-based (de)compression, the crypto > driver will kmap it again. Probably I made another bug here. for zsmalloc, it is possible to get highmem for zpool since its malloc_support_movable = true. if (zpool_malloc_support_movable(entry->pool->zpool)) gfp |= __GFP_HIGHMEM | __GFP_MOVABLE; ret = zpool_malloc(entry->pool->zpool, hlen + dlen, gfp, &handle); For 64bit system, there is never a highmem. For 32bit system, we may trigger this bug. So actually zswap should have used kmap_to_page() which can support both linear mapping and non-linear mapping. sg_init_one() only supports linear mapping. But it does't change the fact: Once req is initialized with page struct, we can unmap src. If we are going to use a HW accelerator, it would be a DMA; if we are going to use CPU decompression, crypto driver will kmap() again. > > > > > ~Vitaly > Thanks Barry