Re: About ceph osd slow ops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01-12-2023 08:45, VÔ VI wrote:
Hi community,

My cluster running with 10 nodes and 2 nodes goes down, sometimes the log
shows the slow ops, what is the root cause?
My osd is HDD and block.db and wal is 500GB SSD per osd.

Health check update: 13 slow ops, oldest one blocked for 167 sec, osd.10
has slow ops (SLOW_OPS)

Most likely you have a crush rule that spreads objects over hosts as a failure domain. For size=3, min_size=2 (default for replicated pools) you might end up in a situation where two of the nodes that are offline have PGs where min_size=2 requirement is not fulfilled, and will hence by inactive and slow ops will occur.

When host is your failure domain, you should not reboot more than one at the same time. If the hosts are somehow organized (different racks, datacenters) you could make a higher level bucket and put your hosts there. And create a crush rule using that bucket type as failure domain, and have your pools use that.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux