On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only numbers
already divided by replication factor), you need 55 days to scrub it once.
That's 8x larger than the default scrub factor [...] Also, even if I set
the default scrub interval to 8x larger, it my disks will still be thrashing > seeks 100% of the time, affecting the cluster's throughput and latency
performance.
Indeed! Every Ceph instance I have seen (not many) and almost every HPC
storage system I have seen have this problem, and that's because they
were never setup to have enough IOPS to support the maintenance load,
never mind the maintenance load plus the user load (and as a rule not
even the user load).
There is a simple reason why this happens: when a large Ceph (etc.
storage instance is initially setup, it is nearly empty, so it appears
to perform well even if it was setup with inexpensive but slow/large
HDDs, then it becomes fuller and therefore heavily congested but whoever
set it up has already changed jobs or been promoted because of their
initial success (or they invent excuses).
A figure-of-merit that matters is IOPS-per-used-TB, and making it large
enough to support concurrent maintenance (scrubbing, backfilling,
rebalancing, backup) and user workloads. That is *expensive*, so in my
experience very few storage instance buyers aim for that.
The CERN IT people discovered long ago that quotes for storage workers
always used very slow/large HDDs that performed very poorly if the specs
were given as mere capacity, so they switched to requiring a different
metric, 18MB/s transfer rate of *interleaved* read and write per TB of
capacity, that is at least two parallel access streams per TB.
https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"
BTW I am not sure that a floor of 18MB/s of interleaved read and write
per TB is high enough to support simultaneous maintenance and user loads
for most Ceph instances, especially in HPC.
I have seen HPC storage systems "designed" around 10TB and even 18TB
HDDs, and the best that can be said about those HDDs is that they should
be considered "tapes" with some random access ability.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx