RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration

"Song Bao Hua (Barry Song)" <song.bao.hua@xxxxxxxxxxxxx> · Wed, 8 Jul 2020 21:45:47 +0000

> -----Original Message-----
> From: linux-crypto-owner@xxxxxxxxxxxxxxx
> [mailto:linux-crypto-owner@xxxxxxxxxxxxxxx] On Behalf Of Sebastian Andrzej
> Siewior
> Sent: Thursday, July 9, 2020 3:00 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Cc: akpm@xxxxxxxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; Linuxarm <linuxarm@xxxxxxxxxx>; Luis Claudio
> R . Goncalves <lgoncalv@xxxxxxxxxx>; Mahipal Challa
> <mahipalreddy2006@xxxxxxxxx>; Seth Jennings <sjenning@xxxxxxxxxx>;
> Dan Streetman <ddstreet@xxxxxxxx>; Vitaly Wool
> <vitaly.wool@xxxxxxxxxxxx>; Wangzhou (B) <wangzhou1@xxxxxxxxxxxxx>;
> Colin Ian King <colin.king@xxxxxxxxxxxxx>
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-08 00:52:10 [+1200], Barry Song wrote:
> …
> > @@ -127,9 +129,17 @@
> module_param_named(same_filled_pages_enabled,
> zswap_same_filled_pages_enabled,
> >  * data structures
> >  **********************************/
> >
> > +struct crypto_acomp_ctx {
> > +	struct crypto_acomp *acomp;
> > +	struct acomp_req *req;
> > +	struct crypto_wait wait;
> > +	u8 *dstmem;
> > +	struct mutex mutex;
> > +};
> …
> > @@ -561,8 +614,9 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
> >  	pr_debug("using %s zpool\n", zpool_get_type(pool->zpool));
> >
> >  	strlcpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> > -	pool->tfm = alloc_percpu(struct crypto_comp *);
> > -	if (!pool->tfm) {
> > +
> > +	pool->acomp_ctx = alloc_percpu(struct crypto_acomp_ctx *);
> 
> Can't you allocate the whole structure instead just a pointer to it? The
> structure looks just like bunch of pointers anyway. Less time for
> pointer chasing means more time for fun.
> 

Should be possible.

> > @@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned
> type, pgoff_t offset,
> >  	}
> >
> >  	/* compress */
> > -	dst = get_cpu_var(zswap_dstmem);
> > -	tfm = *get_cpu_ptr(entry->pool->tfm);
> > -	src = kmap_atomic(page);
> > -	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
> > -	kunmap_atomic(src);
> > -	put_cpu_ptr(entry->pool->tfm);
> > +	acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
> > +
> > +	mutex_lock(&acomp_ctx->mutex);
> > +
> > +	src = kmap(page);
> > +	dst = acomp_ctx->dstmem;
> 
> that mutex is per-CPU, per-context. The dstmem pointer is per-CPU. So if
> I read this right, you can get preempted after crypto_wait_req() and
> another context in this CPU writes its data to the same dstmem and then…
> 

This isn't true. Another thread in this cpu will be blocked by the mutex.
It is impossible for two threads to write the same dstmem.
If thread1 ran on cpu1, it held cpu1's mutex; if another thread wants to run on cpu1, it is blocked.
If thread1 ran on cpu1 first, it held cpu1's mutex, then it migrated to cpu2 (with very rare chance)
	a. if another thread wants to run on cpu1, it is blocked;
	b. if another thread wants to run on cpu2, it is not blocked but it will write cpu2's dstmem not cpu1's

> > +	sg_init_one(&input, src, PAGE_SIZE);
> > +	/* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> > +	sg_init_one(&output, dst, PAGE_SIZE * 2);
> > +	acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +	/*
> > +	 * it maybe looks a little bit silly that we send an asynchronous request,
> > +	 * then wait for its completion synchronously. This makes the process
> look
> > +	 * synchronous in fact.
> > +	 * Theoretically, acomp supports users send multiple acomp requests in
> one
> > +	 * acomp instance, then get those requests done simultaneously. but in
> this
> > +	 * case, frontswap actually does store and load page by page, there is no
> > +	 * existing method to send the second page before the first page is done
> > +	 * in one thread doing frontswap.
> > +	 * but in different threads running on different cpu, we have different
> > +	 * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > +	 */
> > +	ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> &acomp_ctx->wait);
> > +	dlen = acomp_ctx->req->dlen;
> > +	kunmap(page);
> > +
> >  	if (ret) {
> >  		ret = -EINVAL;
> >  		goto put_dstmem;
> 
> This looks using the same synchronous mechanism around an asynchronous
> interface. It works as a PoC.
> 
> As far as I remember the crypto async interface, the incoming skbs were
> fed to the async interface and returned to the caller so the NIC could
> continue allocate new RX skbs and move on. Only if the queue of requests
> was getting to long the code started to throttle. Eventually the async
> crypto code completed the decryption operation in a different context
> and fed the decrypted packet(s) into the stack.
> 
> From a quick view, you would have to return -EINPROGRESS here and have
> at the caller side something like that:
> 
> iff --git a/mm/page_io.c b/mm/page_io.c
> index e8726f3e3820b..9d1baa46ec3ed 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -252,12 +252,15 @@ int swap_writepage(struct page *page, struct
> writeback_control *wbc)
>                 unlock_page(page);
>                 goto out;
>         }
> -       if (frontswap_store(page) == 0) {
> +       ret = frontswap_store(page);
> +       if (ret == 0) {
>                 set_page_writeback(page);
>                 unlock_page(page);
>                 end_page_writeback(page);
>                 goto out;
>         }
> +       if (ret = -EINPROGRESS)
> +               goto out;
>         ret = __swap_writepage(page, wbc, end_swap_bio_write);
>  out:
>         return ret;
> 
> so that eventually callers like write_cache_pages() could feed all pages
> into *writepage and then wait for that bulk to finish.
> 
> Having it this way would also reshape the memory allocation you have.
> You have now per-context a per-CPU crypto request and everything. With
> a 64 or 128 core I'm not sure you will use all that resources.
> With a truly async interface you would be force to have a resource pool
> or so which you would use and then only allow a certain amount of
> parallel requests.

I agree we can optimize swap, frontswap, and zswap to make frontswap async to
improve performance. But this needs very careful thinking and benchmark. We
need benchmark to prove the performance improvement for making those changes.
I am very interested in figuring out a patchset for that. But for the first step,
we need to build a base so that everything else can move ahead. Right now, zswap
really can't work on new compression drivers. After we have a base, we can run
many things interesting, including making frontswap more efficient.

> 
> Sebastian

Thanks
Barry