Re: Deep-scrub much slower than HDD speed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only
> numbers already divided by replication factor), you need 55 days
> to scrub it once.
> That's 8x larger than the default scrub factor [...] Also, even
> if I set the default scrub interval to 8x larger, it my disks
> will still be thrashing seeks 100% of the time, affecting the
> cluster's  throughput and latency performance.

Indeed! Every Ceph instance I have seen (not many) and almost every HPC storage system I have seen have this problem, and that's because they were never setup to have enough IOPS to support the maintenance load, never mind the maintenance load plus the user load (and as a rule not even the user load).

There is a simple reason why this happens: when a large Ceph (etc. storage instance is initially setup, it is nearly empty, so it appears to perform well even if it was setup with inexpensive but slow/large HDDs, then it becomes fuller and therefore heavily congested but whoever set it up has already changed jobs or been promoted because of their initial success (or they invent excuses).

A figure-of-merit that matters is IOPS-per-used-TB, and making it large enough to support concurrent maintenance (scrubbing, backfilling, rebalancing, backup) and user workloads. That is *expensive*, so in my experience very few storage instance buyers aim for that.

The CERN IT people discovered long ago that quotes for storage workers always used very slow/large HDDs that performed very poorly if the specs were given as mere capacity, so they switched to requiring a different metric, 18MB/s transfer rate of *interleaved* read and write per TB of capacity, that is at least two parallel access streams per TB.

https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write per TB is high enough to support simultaneous maintenance and user loads for most Ceph instances, especially in HPC.

I have seen HPC storage systems "designed" around 10TB and even 18TB HDDs, and the best that can be said about those HDDs is that they should be considered "tapes" with some random access ability.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux