On Mon, Sep 20, 2021 at 10:38:37AM +0200, Stefan Kooman wrote: > On 9/16/21 13:42, Davíð Steinn Geirsson wrote: > > > > > The 4 affected drives are of 3 different types from 2 different vendors: > > ST16000NM001G-2KK103 > > ST12000VN0007-2GS116 > > WD60EFRX-68MYMN1 > > > > They are all connected through an LSI2308 SAS controller in IT mode. Other > > drives that did not fail are also connected to the same controller. > > > > There are no expanders in this particular machine, only a direct-attach > > SAS backplane. > > Does the SAS controller run the latest firmware? As far as I can tell yes. Avago's website does not seem to list these anymore, but they are running firmware version 20 which is the latest I can find references to in a web search. This machine has been chugging along like this for years (it was a single- node ZFS NFS server before) and I've never had any such issues before. > > I'm not sure what your failure domain is, but I would certainly want to try > to reproduce this issue. I'd be interested to hear any ideas you have about that. The failure domain is host[1], but this is a 3-node cluster so there isn't much room for taking a machine down for longer periods. Taking OSDs down is no problem. The two other machines in the cluster have very similar hardware and software so I am concerned about seeing the same there on reboot. Backfilling these 16TB spinners takes a long time and is still running, I'm not going to reboot either of the other nodes until that is finished. > > Gr. Stefan Regards, Davíð [1] Mostly. Failure domain is host for every pool using the default CRUSH rules. There is also an EC pool with m=5 k=7, with a custom CRUSH rule to pick 3 hosts and 4 OSDs from each of the hosts.
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx