recently our ceph cluster very unstable, even replace a failed disk may trigger a chain reaction, cause large quantities of osd been wrongly marked down. I am not sure if it is because we have near 300 pgs in each sas osds and small bigger than 300 pgs for ssd osd. from logs, it all starts from osd_op_tp timed out, then osd no reply, then large wrongly mark down. 1. 45 machines, each machine has 16 sas and 8 ssd, all file journal in the osd data dir. 2. use rbd in this cluster 3. 300+ compute node to hold vm 4. osd node current has a hundred thousand threads and fifty thousand established network connection. 5. dell R730xd, and dell say no hardware error log so someone else faces the same unstable problem or using 300+ pgs?