Re: [PATCH v2 2/2] libnvdimm: Add a poison list and export badblocks

Linda Knippers <linda.knippers@xxxxxxx> · Thu, 07 Jan 2016 18:15:54 -0500

On 1/7/2016 5:49 PM, Dan Williams wrote:
> On Thu, Jan 7, 2016 at 1:18 PM, Linda Knippers <linda.knippers@xxxxxxx> wrote:
>>
>> On 12/24/2015 9:21 PM, Vishal Verma wrote:
>>> During region creation, perform Address Range Scrubs (ARS) for the SPA
>>> (System Physical Address) ranges to retrieve known poison locations from
>>> firmware. Add a new data structure 'nd_poison' which is used as a list
>>> in nvdimm_bus to store these poison locations.
>>>
>>> When creating a pmem namespace, if there is any known poison associated
>>> with its physical address space, convert the poison ranges to bad sectors
>>> that are exposed using the badblocks interface.
>>>
>>> Signed-off-by: Vishal Verma <vishal.l.verma@xxxxxxxxx>
>>> ---
>>>  drivers/acpi/nfit.c       | 203 ++++++++++++++++++++++++++++++++++++++++++++++
>>>  drivers/nvdimm/core.c     | 187 ++++++++++++++++++++++++++++++++++++++++++
>>>  drivers/nvdimm/nd-core.h  |   3 +
>>>  drivers/nvdimm/nd.h       |   6 ++
>>>  drivers/nvdimm/pmem.c     |   6 ++
>>>  include/linux/libnvdimm.h |   1 +
>>>  6 files changed, 406 insertions(+)
>>>
>>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>>
>> Why are the ARS function in acpi/nfit.c rather than in the core?
>> I realize that the interfaces are/will be defined in the ACPI spec but
>> it's not really part of the NFIT.
>>
>> Why isn't this part of the nvdimm core, where the calls are made?
> 
> True.  acpi_nfit_find_poison() could easily become
> nvdimm_bus_find_poison() and move to the core.  However we only need
> to take that step when / if another nvdimm bus type show up that isn't
> nfit defined.  For now only nfit-defined nvdimms have a concept of
> ARS.

Why not go ahead and move it now?

I agree that ARS is related to NFIT, but practically everything else
in the core is as well.  It doesn't seem to belongs in nfit.c.  I
didn't even notice that's where it was since the subject line says
"libnvdimm".

> 
>>
>>> index aa45d48..ad6d8c6 100644
>>> --- a/drivers/acpi/nfit.c
>>> +++ b/drivers/acpi/nfit.c
>>> @@ -15,6 +15,7 @@
>>>  #include <linux/module.h>
>>>  #include <linux/mutex.h>
>>>  #include <linux/ndctl.h>
>>> +#include <linux/delay.h>
>>>  #include <linux/list.h>
>>>  #include <linux/acpi.h>
>>>  #include <linux/sort.h>
>>> @@ -1473,6 +1474,201 @@ static void acpi_nfit_blk_region_disable(struct nvdimm_bus *nvdimm_bus,
>>>       /* devm will free nfit_blk */
>>>  }
>>>
>>> +static int ars_get_cap(struct nvdimm_bus_descriptor *nd_desc,
>>> +             struct nd_cmd_ars_cap *cmd, u64 addr, u64 length)
>>> +{
>>> +     cmd->address = addr;
>>> +     cmd->length = length;
>>> +
>>> +     return nd_desc->ndctl(nd_desc, NULL, ND_CMD_ARS_CAP, cmd,
>>> +                     sizeof(*cmd));
>>> +}
>>> +
>>> +static int ars_do_start(struct nvdimm_bus_descriptor *nd_desc,
>>> +             struct nd_cmd_ars_start *cmd, u64 addr, u64 length)
>>> +{
>>> +     int rc;
>>> +
>>> +     cmd->address = addr;
>>> +     cmd->length = length;
>>> +     cmd->type = ND_ARS_PERSISTENT;
>>> +
>>> +     while (1) {
>>> +             rc = nd_desc->ndctl(nd_desc, NULL, ND_CMD_ARS_START, cmd,
>>> +                             sizeof(*cmd));
>>> +             if (rc)
>>> +                     return rc;
>>> +             switch (cmd->status) {
>>> +             case 0:
>>> +                     return 0;
>>> +             case 1:
>>> +                     /* ARS unsupported, but we should never get here */
>>> +                     return 0;
>>> +             case 2:
>>> +                     return -EINVAL;
>>> +             case 3:
>>
>> Are these values in flux in the ACPI spec?  Would #defines help?
> 
> Yes that would be more readable...
> 
>>> +                     /* ARS is in progress */
>>> +                     msleep(1000);
>>> +                     break;
>>
>> Should this be considered an error (perhaps EAGAIN?) if an ARS
>> is in progress since you're supposed to check the status before
>> attempting to start one?  Do we really want to keep retrying
>> it, potentially forever, here?
> 
> We indeed need a timeout here.  I'll take a look at this and the above
> as Vishal is out for the rest of the week.
> 
> There's also plans for 4.6 to make ARS polling asynchronous to the
> rest of the nvdimm system coming up, but deemed too risky /
> complicated for intercepting 4.5.

Ok, makes sense.

> 
>>
>>> +             default:
>>> +                     return -ENXIO;
>>> +             }
>>> +     }
>>> +}
>>> +
>>> +static int ars_get_status(struct nvdimm_bus_descriptor *nd_desc,
>>> +             struct nd_cmd_ars_status *cmd)
>>> +{
>>> +     int rc;
>>> +
>>> +     while (1) {
>>> +             rc = nd_desc->ndctl(nd_desc, NULL, ND_CMD_ARS_STATUS, cmd,
>>> +                     sizeof(*cmd));
>>> +             if (rc || cmd->status & 0xffff)
>>> +                     return -ENXIO;
>>> +
>>> +             /* Check extended status (Upper two bytes) */
>>> +             switch (cmd->status >> 16) {
>>> +             case 0:
>>> +                     return 0;
>>> +             case 1:
>>> +                     /* ARS is in progress */
>>> +                     msleep(1000);
>>> +                     break;
>>
>> Why isn't ARS in progress a legitimate status to return,
>> especially if you're supposed to check if one is in progress
>> before starting one?
>>
>> You could have a different call to wait for completion if
>> that's what you want to do, but it wouldn't be called
>> ars_get_status().
> 
> Right, this is something to address when we make ARS fully asynchronous.
> 
>>
>>> +             case 2:
>>> +                     /* No ARS performed for the current boot */
>>> +                     return 0;
>>> +             default:
>>> +                     return -ENXIO;
>>> +             }
>>> +     }
>>> +}
>>> +
>>> +static int ars_status_process_records(struct nvdimm_bus *nvdimm_bus,
>>> +             struct nd_cmd_ars_status *ars_status, u64 start)
>>> +{
>>> +     int rc;
>>> +     u32 i;
>>> +
>>> +     /*
>>> +      * The address field returned by ars_status should be either
>>> +      * less than or equal to the address we last started ARS for.
>>> +      * The (start, length) returned by ars_status should also have
>>> +      * non-zero overlap with the range we started ARS for.
>>> +      * If this is not the case, bail.
>>> +      */
>>> +     if (ars_status->address > start ||
>>> +                     (ars_status->address + ars_status->length < start))
>>> +             return -ENXIO;
>>> +
>>> +     for (i = 0; i < ars_status->num_records; i++) {
>>> +             rc = nvdimm_bus_add_poison(nvdimm_bus,
>>> +                             ars_status->records[i].err_address,
>>> +                             ars_status->records[i].length);
>>> +             if (rc)
>>> +                     return rc;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int acpi_nfit_find_poison(struct acpi_nfit_desc *acpi_desc,
>>> +             struct nd_region_desc *ndr_desc)
>>> +{
>>> +     struct nvdimm_bus_descriptor *nd_desc = &acpi_desc->nd_desc;
>>> +     struct nvdimm_bus *nvdimm_bus = acpi_desc->nvdimm_bus;
>>> +     struct nd_cmd_ars_status *ars_status = NULL;
>>> +     struct nd_cmd_ars_start *ars_start = NULL;
>>> +     struct nd_cmd_ars_cap *ars_cap = NULL;
>>> +     u64 start, len, cur, remaining;
>>> +     int rc;
>>> +
>>> +     ars_cap = kzalloc(sizeof(*ars_cap), GFP_KERNEL);
>>> +     if (!ars_cap)
>>> +             return -ENOMEM;
>>> +
>>> +     start = ndr_desc->res->start;
>>> +     len = ndr_desc->res->end - ndr_desc->res->start + 1;
>>> +
>>> +     rc = ars_get_cap(nd_desc, ars_cap, start, len);
>>> +     if (rc)
>>> +             goto out;
>>> +
>>> +     /*
>>> +      * If ARS is unsupported, or if the 'Persistent Memory Scrub' flag in
>>> +      * extended status is not set, skip this but continue initialization
>>> +      */
>>> +     if ((ars_cap->status & 0xffff) ||
>>> +             !(ars_cap->status >> 16 & ND_ARS_PERSISTENT)) {
>>> +             dev_warn(acpi_desc->dev,
>>> +                     "ARS unsupported (status: 0x%x), won't create an error list\n",
>>> +                     ars_cap->status);
>>> +             goto out;
>>> +     }
>>> +
>>> +     /*
>>> +      * Check if a full-range ARS has been run. If so, use those results
>>> +      * without having to start a new ARS.
>>> +      */
>>> +     ars_status = kzalloc(ars_cap->max_ars_out + sizeof(*ars_status),
>>> +                     GFP_KERNEL);
>>> +     if (!ars_status) {
>>> +             rc = -ENOMEM;
>>> +             goto out;
>>> +     }
>>> +
>>> +     rc = ars_get_status(nd_desc, ars_status);
>>> +     if (rc)
>>> +             goto out;
>>> +
>>> +     if (ars_status->address <= start &&
>>> +             (ars_status->address + ars_status->length >= start + len)) {
>>> +             rc = ars_status_process_records(nvdimm_bus, ars_status, start);
>>> +             goto out;
>>> +     }
>>> +
>>> +     /*
>>> +      * ARS_STATUS can overflow if the number of poison entries found is
>>> +      * greater than the maximum buffer size (ars_cap->max_ars_out)
>>> +      * To detect overflow, check if the length field of ars_status
>>> +      * is less than the length we supplied. If so, process the
>>> +      * error entries we got, adjust the start point, and start again
>>> +      */
>>> +     ars_start = kzalloc(sizeof(*ars_start), GFP_KERNEL);
>>> +     if (!ars_start)
>>> +             return -ENOMEM;
>>> +
>>> +     cur = start;
>>> +     remaining = len;
>>> +     do {
>>> +             u64 done, end;
>>> +
>>> +             rc = ars_do_start(nd_desc, ars_start, cur, remaining);
>>
>> I think you should get status, start, wait for completion, then process
>> records, at least that's how I read the example DSM spec and other
>> specs.  It says "You must first issue a Query ARS Status command...".
> 
> We do the lead-in ars_status check above...

Oh right, I missed that.

The way the locking works, the registration of regions are serialized so we
can't never have overlapping calls to this function, right?  A comment
about how this is synchronized might be helpful since the lock is many
functions away.

-- ljk

> 
>>
>>> +             if (rc)
>>> +                     goto out;
>>> +
>>> +             rc = ars_get_status(nd_desc, ars_status);
>>> +             if (rc)
>>> +                     goto out;
>>> +
>>> +             rc = ars_status_process_records(nvdimm_bus, ars_status, cur);
>>> +             if (rc)
>>> +                     goto out;
>>> +
>>> +             end = min(cur + remaining,
>>> +                     ars_status->address + ars_status->length);
>>> +             done = end - cur;
>>> +             cur += done;
>>> +             remaining -= done;
>>> +     } while (remaining);
>>> +
>>> + out:
>>> +     kfree(ars_cap);
>>> +     kfree(ars_start);
>>> +     kfree(ars_status);
>>> +     return rc;
>>> +}
>>> +
>>
>> I haven't looked at the rest of this, will try to later.
>>
> 
> Thanks Linda!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html