Hi Stefan, I am running replicate x3 with a failure domain as host and setting min_size pool is 1. Because my cluster s3 traffic real time and can't stop or block IO, the data may be lost but IO alway available. I hope my cluster can run with two nodes unavailable. After that two nodes is down at the same time, and then nodes up, client IO and recover running in the same time, and some disk warning is slowops, what is the problem, may be my disk is overload, but the disk utilization only 60 -80% Thanks Stefan Vào Th 6, 1 thg 12, 2023 vào lúc 16:40 Stefan Kooman <stefan@xxxxxx> đã viết: > On 01-12-2023 08:45, VÔ VI wrote: > > Hi community, > > > > My cluster running with 10 nodes and 2 nodes goes down, sometimes the log > > shows the slow ops, what is the root cause? > > My osd is HDD and block.db and wal is 500GB SSD per osd. > > > > Health check update: 13 slow ops, oldest one blocked for 167 sec, osd.10 > > has slow ops (SLOW_OPS) > > Most likely you have a crush rule that spreads objects over hosts as a > failure domain. For size=3, min_size=2 (default for replicated pools) > you might end up in a situation where two of the nodes that are offline > have PGs where min_size=2 requirement is not fulfilled, and will hence > by inactive and slow ops will occur. > > When host is your failure domain, you should not reboot more than one at > the same time. If the hosts are somehow organized (different racks, > datacenters) you could make a higher level bucket and put your hosts > there. And create a crush rule using that bucket type as failure domain, > and have your pools use that. > > Gr. Stefan > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx