Re: Deep-scrub much slower than HDD speed

Peter Grandi <pg@xxxxxxxxxxxxxxxxxxxx> · Thu, 27 Apr 2023 08:17:35 +0100

On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only numbers 
already divided by replication factor), you need 55 days to scrub it once.
That's 8x larger than the default scrub factor [...] Also, even if I set
the default scrub interval to 8x larger, it my disks will still be thrashing > seeks 100% of the time, affecting the cluster's  throughput and latency
performance.

Indeed! Every Ceph instance I have seen (not many) and almost every HPC 
storage system I have seen have this problem, and that's because they 
were never setup to have enough IOPS to support the maintenance load, 
never mind the maintenance load plus the user load (and as a rule not 
even the user load).

There is a simple reason why this happens: when a large Ceph (etc. 
storage instance is initially setup, it is nearly empty, so it appears 
to perform well even if it was setup with inexpensive but slow/large 
HDDs, then it becomes fuller and therefore heavily congested but whoever 
set it up has already changed jobs or been promoted because of their 
initial success (or they invent excuses).

A figure-of-merit that matters is IOPS-per-used-TB, and making it large 
enough to support concurrent maintenance (scrubbing, backfilling, 
rebalancing, backup) and user workloads. That is *expensive*, so in my 
experience very few storage instance buyers aim for that.

The CERN IT people discovered long ago that quotes for storage workers 
always used very slow/large HDDs that performed very poorly if the specs 
were given as mere capacity, so they switched to requiring a different 
metric, 18MB/s transfer rate of *interleaved* read and write per TB of 
capacity, that is at least two parallel access streams per TB.

https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write 
per TB is high enough to support simultaneous maintenance and user loads 
for most Ceph instances, especially in HPC.

I have seen HPC storage systems "designed" around 10TB and even 18TB 
HDDs, and the best that can be said about those HDDs is that they should 
be considered "tapes" with some random access ability.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx