Re: [PATCH 05/13] lightnvm: pblk: Count all read errors in stats

Hans Holmberg <hans@xxxxxxxxxxxxx> · Mon, 4 Mar 2019 13:48:27 +0100

On Mon, Mar 4, 2019 at 1:42 PM Igor Konopko <igor.j.konopko@xxxxxxxxx> wrote:
>
>
>
> On 04.03.2019 12:45, Javier González wrote:
> >
> >> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@xxxxxxxxxxxxx> wrote:
> >>
> >> On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@xxxxxxxxxxx> wrote:
> >>>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@xxxxxxxxxxxxx> wrote:
> >>>>
> >>>> Igor: Have you seen this happening in real life?
> >>>>
> >>>> I think it would be better to count all expected errors and put them
> >>>> in the right bucket (without spamming dmesg). If we need a new bucket
> >>>> for i.e. vendor-specific-errors, let's do that instead.
>
> Generally I'm seeing different types of errors (which are typically as
> Javier mention controller errors) in cases such as hot drive removal, etc.
>
> We can skip that patch, since this are kind of corner cases. I can also
> create new type of pblk stats, sth. like "controller errors", which
> would collect all the other unexpected errors in one place instead of
> mixing them with real read/write errors as I did.

Yes, that makes sense.

Thanks,
Hans

>
> >>>>
> >>>> Someone wiser than me told me that every error print in the log is a
> >>>> potential customer call.
> >>>>
> >>>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this
> >>>> information. Why can't we let the drives expose this info and remove
> >>>> this from pblk? What's blocking that?
> >>>
> >>> Until now the spec. We added some new log information in Denali exactly
> >>> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
> >>> have it here, at least for debugging.
> >>
> >> Why add it to the spec? Why not use whatever everyone else is using?
> >>
> >> https://en.wikipedia.org/wiki/S.M.A.R.T. :
> >> "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often
> >> written as SMART) is a monitoring system included in computer hard
> >> disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its
> >> primary function is to detect and report various indicators of drive
> >> reliability with the intent of anticipating imminent hardware
> >> failures."
> >> Sounds like what we want here.
> >
> > I know what smart is… You need to define the fields. Maybe you want to
> > read Denali again - the extensions are couple with smart.
> >
> >> For debugging, a trace point or something(i.e. BPF) would be a better
> >> solution that would not impact hot-path performance.
> >
> > Cool. Look forward to the patches ;)
> >
> > Javier
> >