Re: [PATCH 3/5] lightnvm: Flexible DMA pool entry size

Javier Gonzalez <javier@xxxxxxxxxxxx> · Tue, 9 Oct 2018 13:31:13 +0000

> On 9 Oct 2018, at 22.23, Matias Bjørling <mb@xxxxxxxxxxx> wrote:
> 
> On 10/09/2018 03:10 PM, Javier Gonzalez wrote:
>>> On 9 Oct 2018, at 21.56, Matias Bjørling <mb@xxxxxxxxxxx> wrote:
>>> 
>>> On 10/09/2018 02:10 PM, Hans Holmberg wrote:
>>>> On Tue, Oct 9, 2018 at 12:03 PM Igor Konopko <igor.j.konopko@xxxxxxxxx> wrote:
>>>>> On 09.10.2018 11:16, Hans Holmberg wrote:
>>>>>> On Fri, Oct 5, 2018 at 3:38 PM Igor Konopko <igor.j.konopko@xxxxxxxxx> wrote:
>>>>>>> Currently whole lightnvm and pblk uses single DMA pool,
>>>>>>> for which entry size is always equal to PAGE_SIZE.
>>>>>>> PPA list always needs 8b*64, so there is only 56b*64
>>>>>>> space for OOB meta. Since NVMe OOB meta can be bigger,
>>>>>>> such as 128b, this solution is not robustness.
>>>>>>> 
>>>>>>> This patch add the possiblity to support OOB meta above
>>>>>>> 56b by creating separate DMA pool for PBLK with entry
>>>>>>> size which is big enough to store both PPA list and such
>>>>>>> a OOB metadata.
>>>>>>> 
>>>>>>> Signed-off-by: Igor Konopko <igor.j.konopko@xxxxxxxxx>
>>>>>>> ---
>>>>>>>   drivers/lightnvm/core.c          | 33 +++++++++++++++++++++++---------
>>>>>>>   drivers/lightnvm/pblk-core.c     | 19 +++++++++---------
>>>>>>>   drivers/lightnvm/pblk-init.c     | 11 +++++++++++
>>>>>>>   drivers/lightnvm/pblk-read.c     |  3 ++-
>>>>>>>   drivers/lightnvm/pblk-recovery.c |  9 +++++----
>>>>>>>   drivers/lightnvm/pblk.h          | 11 ++++++++++-
>>>>>>>   drivers/nvme/host/lightnvm.c     |  6 ++++--
>>>>>>>   include/linux/lightnvm.h         |  8 +++++---
>>>>>>>   8 files changed, 71 insertions(+), 29 deletions(-)
>>>>>>> 
>>>>>>> diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
>>>>>>> index efb976a863d2..48db7a096257 100644
>>>>>>> --- a/drivers/lightnvm/core.c
>>>>>>> +++ b/drivers/lightnvm/core.c
>>>>>>> @@ -641,20 +641,33 @@ void nvm_unregister_tgt_type(struct nvm_tgt_type *tt)
>>>>>>>   }
>>>>>>>   EXPORT_SYMBOL(nvm_unregister_tgt_type);
>>>>>>> 
>>>>>>> -void *nvm_dev_dma_alloc(struct nvm_dev *dev, gfp_t mem_flags,
>>>>>>> -                                                       dma_addr_t *dma_handler)
>>>>>>> +void *nvm_dev_dma_alloc(struct nvm_dev *dev, void *pool,
>>>>>>> +                               gfp_t mem_flags, dma_addr_t *dma_handler)
>>>>>>>   {
>>>>>>> -       return dev->ops->dev_dma_alloc(dev, dev->dma_pool, mem_flags,
>>>>>>> -                                                               dma_handler);
>>>>>>> +       return dev->ops->dev_dma_alloc(dev, pool ?: dev->dma_pool,
>>>>>>> +                                               mem_flags, dma_handler);
>>>>>>>   }
>>>>>>>   EXPORT_SYMBOL(nvm_dev_dma_alloc);
>>>>>>> 
>>>>>>> -void nvm_dev_dma_free(struct nvm_dev *dev, void *addr, dma_addr_t dma_handler)
>>>>>>> +void nvm_dev_dma_free(struct nvm_dev *dev, void *pool,
>>>>>>> +                               void *addr, dma_addr_t dma_handler)
>>>>>>>   {
>>>>>>> -       dev->ops->dev_dma_free(dev->dma_pool, addr, dma_handler);
>>>>>>> +       dev->ops->dev_dma_free(pool ?: dev->dma_pool, addr, dma_handler);
>>>>>>>   }
>>>>>>>   EXPORT_SYMBOL(nvm_dev_dma_free);
>>>>>>> 
>>>>>>> +void *nvm_dev_dma_create(struct nvm_dev *dev, int size, char *name)
>>>>>>> +{
>>>>>>> +       return dev->ops->create_dma_pool(dev, name, size);
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL(nvm_dev_dma_create);
>>>>>>> +
>>>>>>> +void nvm_dev_dma_destroy(struct nvm_dev *dev, void *pool)
>>>>>>> +{
>>>>>>> +       dev->ops->destroy_dma_pool(pool);
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL(nvm_dev_dma_destroy);
>>>>>>> +
>>>>>>>   static struct nvm_dev *nvm_find_nvm_dev(const char *name)
>>>>>>>   {
>>>>>>>          struct nvm_dev *dev;
>>>>>>> @@ -682,7 +695,8 @@ static int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd,
>>>>>>>          }
>>>>>>> 
>>>>>>>          rqd->nr_ppas = nr_ppas;
>>>>>>> -       rqd->ppa_list = nvm_dev_dma_alloc(dev, GFP_KERNEL, &rqd->dma_ppa_list);
>>>>>>> +       rqd->ppa_list = nvm_dev_dma_alloc(dev, NULL, GFP_KERNEL,
>>>>>>> +                                               &rqd->dma_ppa_list);
>>>>>>>          if (!rqd->ppa_list) {
>>>>>>>                  pr_err("nvm: failed to allocate dma memory\n");
>>>>>>>                  return -ENOMEM;
>>>>>>> @@ -708,7 +722,8 @@ static void nvm_free_rqd_ppalist(struct nvm_tgt_dev *tgt_dev,
>>>>>>>          if (!rqd->ppa_list)
>>>>>>>                  return;
>>>>>>> 
>>>>>>> -       nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list);
>>>>>>> +       nvm_dev_dma_free(tgt_dev->parent, NULL, rqd->ppa_list,
>>>>>>> +                               rqd->dma_ppa_list);
>>>>>>>   }
>>>>>>> 
>>>>>>>   static int nvm_set_flags(struct nvm_geo *geo, struct nvm_rq *rqd)
>>>>>>> @@ -1145,7 +1160,7 @@ int nvm_register(struct nvm_dev *dev)
>>>>>>>          if (!dev->q || !dev->ops)
>>>>>>>                  return -EINVAL;
>>>>>>> 
>>>>>>> -       dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist");
>>>>>>> +       dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist", PAGE_SIZE);
>>>>>>>          if (!dev->dma_pool) {
>>>>>>>                  pr_err("nvm: could not create dma pool\n");
>>>>>>>                  return -ENOMEM;
>>>>>> 
>>>>>> Why hack the nvm_dev_ interfaces when you are not using the dev pool anyway?
>>>>>> Wouldn't it be more straightforward to use dma_pool_* instead?
>>>>> 
>>>>> In order to call dma_pool_create() I need NVMe device structure, which
>>>>> in my understanding is not public, so this is why I decided to reuse
>>>>> plumbing which was available in nvm_dev_* interfaces.
>>>> Hmm, yes, I see now.
>>>>> If there is some easy way to call dma_pool_create() from pblk module and
>>>>> I'm missing that - let me know. I can rewrite this part, if there is
>>>>> some better way to do so.
>>>> Create and destroy needs to go through dev->ops, but once you have
>>>> allocated the pool, there is no need for going through the dev->ops
>>>> for allocate/free as far as I can see.
>>>> Matias: can't we just replace .dev_dma_alloc / .dev_dma_free  by calls
>>>> to dma_pool_alloc/free ?
>>> 
>>> Better to make the initial dma alloc better. I don't want targets to play around with their own DMA pools.
>>> 
>>> Something like the below (only compile tested :)):
>> Good idea. One comment below, though.
>>> diff --git i/drivers/lightnvm/core.c w/drivers/lightnvm/core.c
>>> index efb976a863d2..f87696aa6b57 100644
>>> --- i/drivers/lightnvm/core.c
>>> +++ w/drivers/lightnvm/core.c
>>> @@ -1141,11 +1141,17 @@ EXPORT_SYMBOL(nvm_alloc_dev);
>>> int nvm_register(struct nvm_dev *dev)
>>> {
>>>        int ret;
>>> +       int meta_sz;
>>> 
>>>        if (!dev->q || !dev->ops)
>>>                return -EINVAL;
>>> 
>>> -       dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist");
>>> +       /* Allow room for both metadata and ppa list in the dma pool */
>>> +       meta_sz = (dev->geo.sos * NVM_MAX_VLBA) +
>>> +                                       (sizeof(uint64_t) * NVM_MAX_VLBA);
>>> +       meta_sz = PAGE_SIZE * (meta_sz >> PAGE_SHIFT);
>> Can we just send the required min meta_sz and then let the dma
>> allocation do the alignment? The diver knows if it can fit the
>> allocation without wasting unnecessary space.
> 
> The alignment must be a power of two, and from what I can see, the
> size will be rounded up internally in the dma_pool_create function. I
> don't have a strong opinion either way.
> 
>> Also, by sharing the ppa_list allocation and the metadata on the same
>> pool, we risk having unnecessary larger DMA buffers for the ppa_list.
>> Better have 2 pools, each with each own size. In fact, the only reason
>> we made the original allocation PAGE_SIZE was to fit the ppa_list and
>> metadata on the same allocation; now that we are relaxing this, we can
>> have tighter allocations for the ppa_list as there is a hard limit on 64
>> lbas (512B).
> 
> Let's keep it together for now. I want to avoid having if/else's in
> the target code that has to decide on which pool to use.

Why if/else? The lba list is allocated from the ppa_list pool and the
metadatada from the meta_pool. From the target perspective it is
different pools for a different purposes, as opposed to the small pool
for 256B PRPs in the nvme driver, which requires a size check.

Am I missing something?

> Also, most devices will fit within the PAGE_SIZE allocation.

We hear requirements for larger OOB areas form different places, so I'm
not sure this is true anymore.

Javier
Attachment:
signature.asc

Description: Message signed with OpenPGP