Re: OSDs unable to mount BlueFS after reboot

Davíð Steinn Geirsson <david@xxxxxx> · Mon, 20 Sep 2021 10:00:13 +0000

On Mon, Sep 20, 2021 at 10:38:37AM +0200, Stefan Kooman wrote:
> On 9/16/21 13:42, Davíð Steinn Geirsson wrote:
> 
> > 
> > The 4 affected drives are of 3 different types from 2 different vendors:
> > ST16000NM001G-2KK103
> > ST12000VN0007-2GS116
> > WD60EFRX-68MYMN1
> > 
> > They are all connected through an LSI2308 SAS controller in IT mode. Other
> > drives that did not fail are also connected to the same controller.
> > 
> > There are no expanders in this particular machine, only a direct-attach
> > SAS backplane.
> 
> Does the SAS controller run the latest firmware?

As far as I can tell yes. Avago's website does not seem to list these
anymore, but they are running firmware version 20 which is the latest I
can find references to in a web search.

This machine has been chugging along like this for years (it was a single-
node ZFS NFS server before) and I've never had any such issues before.

> 
> I'm not sure what your failure domain is, but I would certainly want to try
> to reproduce this issue.

I'd be interested to hear any ideas you have about that. The failure domain
is host[1], but this is a 3-node cluster so there isn't much room for taking
a machine down for longer periods. Taking OSDs down is no problem.

The two other machines in the cluster have very similar hardware and software
so I am concerned about seeing the same there on reboot. Backfilling these
16TB spinners takes a long time and is still running, I'm not going to reboot
either of the other nodes until that is finished.

> 
> Gr. Stefan

Regards,
Davíð

[1] Mostly. Failure domain is host for every pool using the default CRUSH
rules. There is also an EC pool with m=5 k=7, with a custom CRUSH rule to
pick 3 hosts and 4 OSDs from each of the hosts.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx