Re: OSD crashes during upgrade mimic->octopus

Stefan Kooman <stefan@xxxxxx> · Thu, 6 Oct 2022 16:31:20 +0200

On 10/6/22 16:12, Frank Schilder wrote:
Hi Igor and Stefan.

Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an > > EC pool. Seems to affect all sorts of pools.

I have to take this one back. It is indeed an EC pool that is also on these SSD OSDs that is affected. The meta-data pool was all active all the time until we lost the 3rd host. So, the bug reported is confirmed to affect EC pools.

If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?

Fortunately not. After loosing disks on the 3rd host, we had to start taking somewhat more desperate measures. We set the file system off-line to stop client IO and started rebooting hosts in reverse order of failing. This brought back the OSDs on the still un-converted hosts. We rebooted the converted host with the original fail of OSDs last. Unfortunately, here it seems we lost a drive for good. It looks like the OSDs crashed while the conversion was going on or something. They don't boot up and I need to look into that with more detail.

Maybe not do the online conversion, but opt for offline one? So you can 
inspect if it works or not. Time wise it hardly matters (online 
conversions used to be much slower, but that is not the case anymore). 
If an already upgraded OSD reboots (because it crashed for example), it 
will immediately do the conversion. It might be better to have a bit 
more control over it and do it manually.

We recently observed that OSDs that are restarted might take some time 
to do their standard RocksDB compactions. We therefore set the "noup" 
flag to give them some time to do the housekeeping and only after that 
finishes unset the noup flag. It helped prevent a lot of slow ops we 
would have had otherwise. It might help in this case as well.

We are currently trying to encourage fs clients to reconnect to the file system. Unfortunately, on many we get

# ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle

Is there a server-sided way to encourage the FS clients to reconnect to the cluster? What is a clean way to get them back onto the file system? I tried a remounts without success.

Not that I know of. You probably need to reboot those hosts.

Before executing the next conversion, I will compact the rocksdb on all SSD OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number of objects per PG, which is potentially the main reason for our observations.

Yup, pretty much certain that's the reason. Nowadays one of our default 
maintenance routines before doing upgrades / conversions, etc. is to do 
offline compaction of all OSDs beforehand.

I hope it helps.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx