Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We too kept seeing this until a few months ago in a cluster with ~400 HDDs, while all the drive SMART statistics was always A-OK. Since we use erasure coding each PG involves up to 10 HDDs.

It took us a while to realize we shouldn't expect scrub errors on healthy drives, but eventually we decided to track it down, and found documentation suggesting to use

 rados list-inconsistent-obj <PG>  --format=json-pretty

... before you repair the PG. If you look into that (long) output, you are likely going to find a "read_error" for a specific OSD. Then we started to make a note of the HDD that saw the error.

This helped us identify two HDDs that had multiple read errors within a few weeks, even though their SMART data was still perfectly fine. Now that *might* just be bad luck, but we have enough drives that we don't care, so we just replaced them, and since then I've only had a single drive report an error.

One conclusion (in our case) is that it could be a drive that likely would have failed sooner or later, even though it hadn't yet reached a threshold for SMART to worry, or the alternative might be that it's a drive that just has more frequent read errors, but it's technically within the allowed variation. Assuming you have configured your cluster with reasonable redundancy you shouldn't run any risk of data losses, but for us we figured it's worth replacing a few outlier drives to sleep better.

Cheers,

Erik

--
Erik Lindahl <erik.lindahl@xxxxxxxxx>
On 9 Jan 2023 at 23:06 +0100, David Orman <ormandj@xxxxxxxxxxxx>, wrote:
> "dmesg" on all the linux hosts and look for signs of failing drives. Look at smart data, your HBAs/disk controllers, OOB management logs, and so forth. If you're seeing scrub errors, it's probably a bad disk backing an OSD or OSDs.
>
> Is there a common OSD in the PGs you've run the repairs on?
>
> On Mon, Jan 9, 2023, at 03:37, Kuhring, Mathias wrote:
> > Hey all,
> >
> > I'd like to pick up on this topic, since we also see regular scrub
> > errors recently.
> > Roughly one per week for around six weeks now.
> > It's always a different PG and the repair command always helps after a
> > while.
> > But the regular re-occurrence seems it bit unsettling.
> > How to best troubleshoot this.
> >
> > We are currently on ceph version 17.2.1
> > (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
> >
> > Best Wishes,
> > Mathias
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux