Re: Successful Upgrade from 14.2.18 to 15.2.16

Stefan Kooman <stefan@xxxxxx> · Tue, 12 Apr 2022 12:12:05 +0200

On 4/12/22 09:27, Dan van der Ster wrote:
Hi Stefan,

Thanks for the report. 9 hours fsck is the longest I've heard about
yet -- and on NVMe, that's quite surprising!

I believe Mark Schouten had to wait 3 days! before the fsck would 
finish. Although this might have been before optimizations in this area 
were made.

Which firmware are you running on those Samsung's? For a different
reason Mark and we have been comparing performance of that drive
between what's in his lab vs what we have in our data centre. We have
no obvious perf issues running EDA5702Q; Mark has some issue with the
Quincy RC running FW EDA53W0Q. I'm not sure if it's related, but worth
checking...

We have mainly EDA5402Q running. We have been running EDA5202Q before 
without issues. One OSD recently replaced came with EDA5702Q.

In any case, I'm also surprised you decided to drain the boxes before
fsck. Wouldn't 9 hours of down osds, with noout set, going to be less
invasive?

Yes, less invasive, but more risk. Note that even an "online" fsck does 
not mean that the OSDs are ONLINE: they aren't. So if a disk in some 
other failure domain decides to die, it has availability impact 
(min_size=2).
Besides that, we believe that the slow ops we sometimes see have their 
origin in the past (consolidating all cephfs metadata on 3 NVMe nodes 
and back to all nodes again). So by re-provisioning the OSDs we hope to 
get rid of them as well

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx