Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I’d look for deep-scrubs on that OSD, those are logged, maybe those timestamps match your observations.

Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:

Thanks for this!

The drive doesn't show increased utilization on average, but it does
sporadically get more I/O than other drives, usually in short bursts. I am
now trying to find a way to trace this to a specific PG, pool and object
(s) – not sure if that is possible.

/Z

On Fri, 7 Oct 2022, 12:17 Dan van der Ster, <dvanders@xxxxxxxxx> wrote:

Hi Zakhar,

I can back up what Konstantin has reported -- we occasionally have
HDDs performing very slowly even though all smart tests come back
clean. Besides ceph osd perf showing a high latency, you could see
high ioutil% with iostat.

We normally replace those HDDs -- usually by draining and zeroing
them, then putting them back in prod (e.g. in a different cluster or
some other service). I don't have statistics on how often those sick
drives come back to full performance or not -- that could indicate it
was a poor physical connection, vibrations, ... , for example. But I
do recall some drives came back repeatedly as "sick" but not dead w/
clean SMART tests.

If you have time you can dig deeper with increased bluestore debug
levels. In our environment this happens often enough that we simply
drain, replace, move on.

Cheers, dan




On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko <zakhar@xxxxxxxxx>
wrote:
>
> Unfortunately, that isn't the case: the drive is perfectly healthy and,
> according to all measurements I did on the host itself, it isn't any
> different from any other drive on that host size-, health- or
> performance-wise.
>
> The only difference I noticed is that this drive sporadically does more
I/O
> than other drives for a split second, probably due to specific PGs placed
> on its OSD, but the average I/O pattern is very similar to other drives
and
> OSDs, so it's somewhat unclear why the specific OSD is consistently
showing
> much higher latency. It would be good to figure out what exactly is
causing
> these I/O spikes, but I'm not yet sure how to do that.
>
> /Z
>
> On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin <k0ste@xxxxxxxx> wrote:
>
> > Hi,
> >
> > When you see one of 100 drives perf is unusually different, this may
mean
> > 'this drive is not like the others' and should be replaced
> >
> >
> > k
> >
> > Sent from my iPhone
> >
> > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
wrote:
> > >
> > > Anyone, please?
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux