Den tors 9 maj 2019 kl 17:46 skrev Feng Zhang <prod.feng@xxxxxxxxx>:
Thanks, guys.
I forgot the IOPS. So since I have 100disks, the total
IOPS=100X100=10K. For the 4+2 erasure, one disk fail, then it needs to
read 5 and write 1 objects.Then the whole 100 disks can do 10K/6 ~ 2K
rebuilding actions per seconds.
While for the 100X6TB disks, suppose the object size is set to 4MB,
then 6TB/4MB=1.25 million objects. Not considering the disk throughput
IO or CPUs, fully rebuilding takes:
1.25M/2K=600 seconds?
I think you will _never_ see a full cluster all helping out at 100% to fix such an issue,
so while your math is probably correctly describing the absolute best-case, reality will
be somewhere below that.
Still, it will be quite possible to cause this situation and make
a measurement of your own with exactly your own circumstances, since everyones setup
is slightly different. Replacing broken drives is normal for any large storage system, and
ceph will prioritze client traffic most of the time over normal repairs, so that will add to the
total calendar time it takes for recovery, but keep your users happy while doing it.
May the most significant bit of your life be positive.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com