> -----Original Message----- > From: Sebastian Andrzej Siewior [mailto:bigeasy@xxxxxxxxxxxxx] > Sent: Tuesday, September 29, 2020 4:25 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx> > Cc: akpm@xxxxxxxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; > davem@xxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > linux-kernel@xxxxxxxxxxxxxxx; Luis Claudio R . Goncalves > <lgoncalv@xxxxxxxxxx>; Mahipal Challa <mahipalreddy2006@xxxxxxxxx>; > Seth Jennings <sjenning@xxxxxxxxxx>; Dan Streetman <ddstreet@xxxxxxxx>; > Vitaly Wool <vitaly.wool@xxxxxxxxxxxx>; Wangzhou (B) > <wangzhou1@xxxxxxxxxxxxx>; fanghao (A) <fanghao11@xxxxxxxxxx>; Colin > Ian King <colin.king@xxxxxxxxxxxxx> > Subject: Re: [PATCH v6] mm/zswap: move to use crypto_acomp API for > hardware acceleration > > On 2020-08-19 00:31:00 [+1200], Barry Song wrote: > > diff --git a/mm/zswap.c b/mm/zswap.c > > index fbb782924ccc..00b5f14a7332 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -127,9 +129,17 @@ > module_param_named(same_filled_pages_enabled, > zswap_same_filled_pages_enabled, > > * data structures > > **********************************/ > > > > +struct crypto_acomp_ctx { > > + struct crypto_acomp *acomp; > > + struct acomp_req *req; > > + struct crypto_wait wait; > > + u8 *dstmem; > > + struct mutex *mutex; > > +}; > > + > > struct zswap_pool { > > struct zpool *zpool; > > - struct crypto_comp * __percpu *tfm; > > + struct crypto_acomp_ctx __percpu *acomp_ctx; > > struct kref kref; > > struct list_head list; > > struct work_struct release_work; > > @@ -388,23 +398,43 @@ static struct zswap_entry > *zswap_entry_find_get(struct rb_root *root, > > * per-cpu code > > **********************************/ > > static DEFINE_PER_CPU(u8 *, zswap_dstmem); > > +/* > > + * If users dynamically change the zpool type and compressor at runtime, > i.e. > > + * zswap is running, zswap can have more than one zpool on one cpu, but > they > > + * are sharing dtsmem. So we need this mutex to be per-cpu. > > + */ > > +static DEFINE_PER_CPU(struct mutex *, zswap_mutex); > > There is no need to make it sturct mutex*. You could make it struct > mutex. The following make it more obvious how the relation here is (even > without a comment): > > |struct zswap_mem_pool { > | void *dst_mem; > | struct mutex lock; > |}; > | > |static DEFINE_PER_CPU(struct zswap_mem_pool , zswap_mem_pool); > > Please access only this, don't add use a pointer in a another struct to > dest_mem or lock which may confuse people. Well. It seems sensible. > > This resource is per-CPU. Do you really utilize them all? On 2, 8, 16, > 64, 128 core system? More later… It might be a good place to investigate if we are going to do deep performance tuning of zswap afterwards. But it is obviously out of the scope of this patch. In my understanding, some bottleneck could exist in the earlier code path of memory management before frontswap and zswap if there are too many cores as the memory and the LRU is globally managed and there might be some lock issue to block the throughput of frontswap. > > > @@ -916,14 +974,21 @@ static int zswap_writeback_entry(struct zpool > *pool, unsigned long handle) > > > > case ZSWAP_SWAPCACHE_NEW: /* page is locked */ > > /* decompress */ > > + acomp_ctx = this_cpu_ptr(entry->pool->acomp_ctx); > > My feeling is that this triggers a warning with CONFIG_DEBUG_PREEMPT. I > don't see how it could be avoid it. If you are not preemptible here then > you must not sleep later. I guess you are talking about some other APIs like get_cpu_ptr(). While APIs like get_cpu_var() and get_cpu_ptr() will prohibit preemption and disallow sleeping, APIs like this_cpu_ptr() and this_cpu_ptr() won't do that: #define get_cpu_var(var) \ (*({ \ preempt_disable(); \ this_cpu_ptr(&var); \ })) #define put_cpu_var(var) \ do { \ (void)&(var); \ preempt_enable(); \ } while (0) #define get_cpu_ptr(var) \ ({ \ preempt_disable(); \ this_cpu_ptr(var); \ }) #define put_cpu_ptr(var) \ do { \ (void)(var); \ preempt_enable(); \ } while (0) > > > + > > dlen = PAGE_SIZE; > > src = (u8 *)zhdr + sizeof(struct zswap_header); > > - dst = kmap_atomic(page); > > - tfm = *get_cpu_ptr(entry->pool->tfm); > > - ret = crypto_comp_decompress(tfm, src, entry->length, > > - dst, &dlen); > > - put_cpu_ptr(entry->pool->tfm); > > - kunmap_atomic(dst); > > + dst = kmap(page); > > + > > + mutex_lock(acomp_ctx->mutex); > > + sg_init_one(&input, src, entry->length); > > + sg_init_one(&output, dst, dlen); > > + acomp_request_set_params(acomp_ctx->req, &input, &output, > entry->length, dlen); > > + ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > &acomp_ctx->wait); > > + dlen = acomp_ctx->req->dlen; > > + mutex_unlock(acomp_ctx->mutex); > > Say a request comes in on CPU1. After the mutex_lock() you get migrated to > CPU2. You do crypto_wait_req() on CPU2. In that time another request > gets in on CPU1. It blocks on the very same mutex. No data corruption but > it could have used another buffer instead of blocking and waiting for > the previous one to finish its work. > So it might make sense to have a pool of pages which are shared in the > system globally system instead of having strict per-CPU allocations > while being fully migrate-able while the are in use. It might be a good place to investigate whether this can help improve the performance of zswap. It could be in a todo list for optimizing the utilization of the pool. On the other hand, it is obviously not proper to put that change in this patch. > > While at it: For dst you could use sg_set_page(). This would avoid the > kmap(). It seems it is true while it needs a call to sg_init_table(). Maybe it is worth to add one more API similar with sg_init_one() but the parameter is page rather than buffer if sg with one page(s) is a quite common case. For this case, I agree it is better to remove this redundant kmap/kunmap. > > > + kunmap(page); > > BUG_ON(ret); > > BUG_ON(dlen != PAGE_SIZE); > > > > Sebastian Thanks Barry