Re: About ceph osd slow ops

Stefan Kooman <stefan@xxxxxx> · Fri, 1 Dec 2023 10:40:36 +0100

On 01-12-2023 08:45, VÔ VI wrote:
Hi community,

My cluster running with 10 nodes and 2 nodes goes down, sometimes the log
shows the slow ops, what is the root cause?
My osd is HDD and block.db and wal is 500GB SSD per osd.

Health check update: 13 slow ops, oldest one blocked for 167 sec, osd.10
has slow ops (SLOW_OPS)

Most likely you have a crush rule that spreads objects over hosts as a 
failure domain. For size=3, min_size=2 (default for replicated pools) 
you might end up in a situation where two of the nodes that are offline 
have PGs where min_size=2 requirement is not fulfilled, and will hence 
by inactive and slow ops will occur.

When host is your failure domain, you should not reboot more than one at 
the same time. If the hosts are somehow organized (different racks, 
datacenters) you could make a higher level bucket and put your hosts 
there. And create a crush rule using that bucket type as failure domain, 
and have your pools use that.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx