On 01-12-2023 08:45, VÔ VI wrote:
Hi community, My cluster running with 10 nodes and 2 nodes goes down, sometimes the log shows the slow ops, what is the root cause? My osd is HDD and block.db and wal is 500GB SSD per osd. Health check update: 13 slow ops, oldest one blocked for 167 sec, osd.10 has slow ops (SLOW_OPS)
Most likely you have a crush rule that spreads objects over hosts as a failure domain. For size=3, min_size=2 (default for replicated pools) you might end up in a situation where two of the nodes that are offline have PGs where min_size=2 requirement is not fulfilled, and will hence by inactive and slow ops will occur.
When host is your failure domain, you should not reboot more than one at the same time. If the hosts are somehow organized (different racks, datacenters) you could make a higher level bucket and put your hosts there. And create a crush rule using that bucket type as failure domain, and have your pools use that.
Gr. Stefan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx