Re: [PATCH v3] lightnvm: physical block device (pblk) target

Bart Van Assche <Bart.VanAssche@xxxxxxxxxxx> · Sat, 8 Apr 2017 20:56:10 +0000

On 04/07/17 11:50, Javier González wrote:
>  Documentation/lightnvm/pblk.txt  |   21 +
>  drivers/lightnvm/Kconfig         |   19 +
>  drivers/lightnvm/Makefile        |    5 +
>  drivers/lightnvm/pblk-cache.c    |  112 +++
>  drivers/lightnvm/pblk-core.c     | 1641 ++++++++++++++++++++++++++++++++++++++
>  drivers/lightnvm/pblk-gc.c       |  542 +++++++++++++
>  drivers/lightnvm/pblk-init.c     |  942 ++++++++++++++++++++++
>  drivers/lightnvm/pblk-map.c      |  135 ++++
>  drivers/lightnvm/pblk-rb.c       |  859 ++++++++++++++++++++
>  drivers/lightnvm/pblk-read.c     |  513 ++++++++++++
>  drivers/lightnvm/pblk-recovery.c | 1007 +++++++++++++++++++++++
>  drivers/lightnvm/pblk-rl.c       |  182 +++++
>  drivers/lightnvm/pblk-sysfs.c    |  500 ++++++++++++
>  drivers/lightnvm/pblk-write.c    |  408 ++++++++++
>  drivers/lightnvm/pblk.h          | 1127 ++++++++++++++++++++++++++
>  include/linux/lightnvm.h         |   57 +-
>  pblk-sysfs.c                     |  581 ++++++++++++++

This patch introduces two slightly different versions of pblk-sysfs.c - 
one at the top level and one in drivers/lightnvm. Please remove the file
at the top level.

> +config NVM_PBLK_L2P_OPT
> +	bool "PBLK with optimized L2P table for devices up to 8TB"
> +	depends on NVM_PBLK
> +	---help---
> +	Physical device addresses are typical 64-bit integers. Since we store
> +	the logical to physical (L2P) table on the host, this takes 1/500 of
> +	host memory (e.g., 2GB per 1TB of storage media). On drives under 8TB,
> +	it is possible to reduce this to 1/1000 (e.g., 1GB per 1TB). This
> +	option allows to do this optimization on the host L2P table.

Why is NVM_PBLK_L2P_OPT a compile-time option instead of a run-time 
option? Since this define does not affect the definition of the ppa_addr 
I don't see why this has to be a compile-time option. For e.g. Linux 
distributors the only choice would be to disable NVM_PBLK_L2P_OPT. I 
think it would be unfortunate if no Linux distribution ever would be 
able to benefit from this optimization.

> +#ifdef CONFIG_NVM_DEBUG
> +	atomic_add(nr_entries, &pblk->inflight_writes);
> +	atomic_add(nr_entries, &pblk->req_writes);
> +#endif

Has it been considered to use the "static key" feature such that 
consistency checks can be enabled at run-time instead of having to 
rebuild the kernel to enable CONFIG_NVM_DEBUG?

> +#ifdef CONFIG_NVM_DEBUG
> +	BUG_ON(nr_rec_entries != valid_entries);
> +	atomic_add(valid_entries, &pblk->inflight_writes);
> +	atomic_add(valid_entries, &pblk->recov_gc_writes);
> +#endif

Are you aware that Linus is strongly opposed against using BUG_ON()?

> +#ifdef CONFIG_NVM_DEBUG
> +	lockdep_assert_held(&l_mg->free_lock);
> +#endif

Why is lockdep_assert_held() surrounded with #ifdef CONFIG_NVM_DEBUG / 
#endif? Are you aware that lockdep_assert_held() do not generate any 
code with CONFIG_PROVE_LOCKING=n?

> +static const struct block_device_operations pblk_fops = {
> +	.owner		= THIS_MODULE,
> +};

Is this data structure useful? If so, where is pblk_fops used?

> +static void pblk_l2p_free(struct pblk *pblk)
> +{
> +	vfree(pblk->trans_map);
> +}
> +
> +static int pblk_l2p_init(struct pblk *pblk)
> +{
> +	sector_t i;
> +
> +	pblk->trans_map = vmalloc(sizeof(pblk_addr) * pblk->rl.nr_secs);
> +	if (!pblk->trans_map)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < pblk->rl.nr_secs; i++)
> +		pblk_ppa_set_empty(&pblk->trans_map[i]);
> +
> +	return 0;
> +}

Has it been considered to add support for keeping a subset of the L2P 
translation table in memory instead of keeping it in memory in its entirety?

> +	sprintf(cache_name, "pblk_line_m_%s", pblk->disk->disk_name);

Please use snprintf() or kasprintf() instead of printf(). That makes it 
easier for humans to verify that no buffer overflow is triggered.

> +/* physical block device target */
> +static struct nvm_tgt_type tt_pblk = {
> +	.name		= "pblk",
> +	.version	= {1, 0, 0},

Are version numbers useful inside the kernel tree?

> +void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry,
> +		 unsigned long *lun_bitmap, unsigned int valid_secs,
> +		 unsigned int off)
> +{
> +	struct pblk_sec_meta *meta_list = rqd->meta_list;
> +	unsigned int map_secs;
> +	int min = pblk->min_write_pgs;
> +	int i;
> +
> +	for (i = off; i < rqd->nr_ppas; i += min) {
> +		map_secs = (i + min > valid_secs) ? (valid_secs % min) : min;
> +		pblk_map_page_data(pblk, sentry + i, &rqd->ppa_list[i],
> +					lun_bitmap, &meta_list[i], map_secs);
> +	}
> +}

Has it been considered to implement the above code such that no modulo 
(%) computation is needed, which is a relatively expensive operation? I 
think for this loop that can be done easily.

> +static DECLARE_RWSEM(pblk_rb_lock);
> +
> +void pblk_rb_data_free(struct pblk_rb *rb)
> +{
> +	struct pblk_rb_pages *p, *t;
> +
> +	down_write(&pblk_rb_lock);
> +	list_for_each_entry_safe(p, t, &rb->pages, list) {
> +		free_pages((unsigned long)page_address(p->pages), p->order);
> +		list_del(&p->list);
> +		kfree(p);
> +	}
> +	up_write(&pblk_rb_lock);
> +}

Global locks like pblk_rb_lock are a performance bottleneck on 
multi-socket systems. Why is that lock global?

Bart.