Thanks for this! The drive doesn't show increased utilization on average, but it does sporadically get more I/O than other drives, usually in short bursts. I am now trying to find a way to trace this to a specific PG, pool and object (s) – not sure if that is possible. /Z On Fri, 7 Oct 2022, 12:17 Dan van der Ster, <dvanders@xxxxxxxxx> wrote: > Hi Zakhar, > > I can back up what Konstantin has reported -- we occasionally have > HDDs performing very slowly even though all smart tests come back > clean. Besides ceph osd perf showing a high latency, you could see > high ioutil% with iostat. > > We normally replace those HDDs -- usually by draining and zeroing > them, then putting them back in prod (e.g. in a different cluster or > some other service). I don't have statistics on how often those sick > drives come back to full performance or not -- that could indicate it > was a poor physical connection, vibrations, ... , for example. But I > do recall some drives came back repeatedly as "sick" but not dead w/ > clean SMART tests. > > If you have time you can dig deeper with increased bluestore debug > levels. In our environment this happens often enough that we simply > drain, replace, move on. > > Cheers, dan > > > > > On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko <zakhar@xxxxxxxxx> > wrote: > > > > Unfortunately, that isn't the case: the drive is perfectly healthy and, > > according to all measurements I did on the host itself, it isn't any > > different from any other drive on that host size-, health- or > > performance-wise. > > > > The only difference I noticed is that this drive sporadically does more > I/O > > than other drives for a split second, probably due to specific PGs placed > > on its OSD, but the average I/O pattern is very similar to other drives > and > > OSDs, so it's somewhat unclear why the specific OSD is consistently > showing > > much higher latency. It would be good to figure out what exactly is > causing > > these I/O spikes, but I'm not yet sure how to do that. > > > > /Z > > > > On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin <k0ste@xxxxxxxx> wrote: > > > > > Hi, > > > > > > When you see one of 100 drives perf is unusually different, this may > mean > > > 'this drive is not like the others' and should be replaced > > > > > > > > > k > > > > > > Sent from my iPhone > > > > > > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > wrote: > > > > > > > > Anyone, please? > > > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx