Re: Deep-scrub much slower than HDD speed

Frank Schilder <frans@xxxxxx> · Thu, 27 Apr 2023 18:18:09 +0000

Hi, I asked a similar question about increasing scrub throughput some time ago and couldn't get a fully satisfying answer: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/NHOHZLVQ3CKM7P7XJWGVXZUXY24ZE7RK

My observation is that much fewer (deep) scrubs are scheduled than could be executed. Some people wrote scripts to do scrub scheduling in a more efficient way (by last-scrub time stamp), but I don't want to go this route (yet). Unfortunately, the thread above does not contain the full conversation, I think it forked into a second one with the same or a similar title.

About performance calculations, along the lines of

> they were never setup to have enough IOPS to support the maintenance load,
> never mind the maintenance load plus the user load

> initially setup, it is nearly empty, so it appears
> to perform well even if it was setup with inexpensive but slow/large
> HDDs, then it becomes fuller and therefore heavily congested

There is a bit more to that. HDDs have the unfortunate property that sector reads/writes are not independent of which sector is read/written to. An empty drive will serve IO from the beginning of the disk when everything is fast. As drives fill up, they start using slower and slower regions. This performance degradation is in addition to the effects of longer seek paths and fragmentation.

Here I'm talking only about enterprise data centre drives with proper sustained performance profiles, not cheap stuff that falls apart once you go serious.

Unfortunately, ceph adds on top of that the lack of tail merging support, which makes small objects extra expensive.

Still, ceph was written for HDDs and actually performs well if IO calculations are done properly. For example, 8TB vs. 18TB drives. 8TB drives start with about 150MB/s bandwidth at the fast part and slow down to 80-100MB/s when you reach the end. 18TB drives are not just 8TB drives with denser packing, they actually have more platters. That means, they start out at 250MB/s and reach something like 100-130MB/s towards the end. Its more than double the capacity, but not more than double the throughput. IOP/s are roughly the same, so IOP/s per TB go down a lot with capacity.

When is this fine and when is it problematic. Its fine if you have large objects that are never modified. Then ceph will usually reach sequential read/write performance and scrubbing will be done within a week (with less than 10% utilisation, which is good). The other extreme is many small objects, in which case your observed performance/throughput can be terrible and scrubbing might never end.

For being able to make reasonable estimates, you need to know real-life object size distributions and if full object writes are effectively sequential (meaning you have large bluestore alloc sizes in general, look at the bluestore performance counters, it will indicate how many large and how many small writes you have).

We have a fairly mixed size distribution with, unfortunately, quite a percentage of small objects on our ceph fs. We do have 18T drives, which are about 30% utilised. Scrubbing still finishes within less than 2 weeks even with the outliers due to "not ideal" scrub scheduling (thread above). I'm willing to accept up to 4 weeks tail time, which will probably give me 50-60% utilisation before things go below acceptable.

In essence, the 18T average performance drives are something like 10T pretty good performance drives compared with the usual 8T drives. You just have to let go of 100% capacity utilisation. The limit is what comes first, capacity- or IOP/s saturation. Once admin workload cannot complete in time, that's it, the disks are full and one needs to expand.

We have about 900 HDDs in our cluster and I maintain this large number mostly for performance reasons. I don't think I will ever see more than 50% utilisation before we change deployment or add drives.

Looking at our data in more detail, most of it is ice cold. Therefore, in the long run we plan to go for tiered OSDs (bcache/dm-cache) with sufficient total SSD capacity to hold about 2 times all hot data. Then, maybe, we can fill big drives a bit more.

I was looking into large capacity SSDs and, I'm afraid, when going to the >=18TB SSD section they either have bad and often worse performance than spinners, or are massively expensive. With performance here I mean bandwidth. Large SSDs can have a sustained bandwith of 30MB/s. They will still do about 500-1000IOP/s per TB, but large file transfer or backfill will become a pain.

I looked at models with reasonable bandwidth and asked if I could get a price. The answer was that one such disk costs more than an entire of our standard storage servers. Clearly not our league. A better solution is to combine the best of both worlds and have a more intelligent software that can differentiate between hot and cold data and may be able to adapt to workloads.

> the best that can be said about those HDDs is that they should
> be considered "tapes" with some random access ability

Which is good if that is all you need. But true, a lot of people already forget that using an 8+3 EC profile on a pool will divide the aggregated IOP/s budget by 11. After this, divide by 2 and you have a number to tell your users/boss. They are either happy or give you more money.

Our users also think in terms of price/TB only. I simply incorporate performance into the calculation and come up with price per *usable* TB. Raw capacity includes admin overhead (which includes IOP/s), which can easily be 50% in total plus the replication overhead. Just let go of 100% capacity utilisation and you will have a well working cluster. I let go of 50% utilisation. That's when I start requesting material and it works really well. Still much cheaper than an all-flash install with higher utilisation.

To the all-flash enthusiasts. Yes, we have all-flash pools and I do enjoy their performance. Still, the price. There are people who say platters are outdated and SSDs are competitive. Well, my google-fu is maybe not good enough, so here we go. If you show me where I can get SSDs with the specs below, I will go all-flash. Until then, sorry, cost economy is still a thing.

Specs A:

- capacity: 18TB+
- sustained 1M block-size sequential read/write (iodepth=1): 15MB/s per TB
- sustained 4K random 50/50 read-write (iodepth=1): 100
- data written per day for 5 years: 1TB (yes, this *is* very low yet sufficient)
- interface: SATA/SAS, 2.5" or 3.5"
- price: <=350$ (for 18TB)

Specs B:

- capacity: 18TB+
- sustained 1M block-size sequential read/write (iodepth=1): 25MB/s per TB
- sustained 4K random 50/50 read-write (iodepth=1): 1000
- data written per day for 5 years: 1TB (yes, this *is* very low yet sufficient)
- interface: SATA/SAS, 2.5" or 3.5"
- price: <=700$ (for 18TB)

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Peter Grandi <pg@xxxxxxxxxxxxxxxxxxxx>
Sent: Thursday, April 27, 2023 11:55 AM
To: list fs Ceph
Subject:  Re: Deep-scrub much slower than HDD speed

 > On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only
 > numbers already divided by replication factor), you need 55 days
 > to scrub it once.
 > That's 8x larger than the default scrub factor [...] Also, even
 > if I set the default scrub interval to 8x larger, it my disks
 > will still be thrashing seeks 100% of the time, affecting the
 > cluster's  throughput and latency performance.

Indeed! Every Ceph instance I have seen (not many) and almost every HPC
storage system I have seen have this problem, and that's because they
were never setup to have enough IOPS to support the maintenance load,
never mind the maintenance load plus the user load (and as a rule not
even the user load).

There is a simple reason why this happens: when a large Ceph (etc.
storage instance is initially setup, it is nearly empty, so it appears
to perform well even if it was setup with inexpensive but slow/large
HDDs, then it becomes fuller and therefore heavily congested but whoever
set it up has already changed jobs or been promoted because of their
initial success (or they invent excuses).

A figure-of-merit that matters is IOPS-per-used-TB, and making it large
enough to support concurrent maintenance (scrubbing, backfilling,
rebalancing, backup) and user workloads. That is *expensive*, so in my
experience very few storage instance buyers aim for that.

The CERN IT people discovered long ago that quotes for storage workers
always used very slow/large HDDs that performed very poorly if the specs
were given as mere capacity, so they switched to requiring a different
metric, 18MB/s transfer rate of *interleaved* read and write per TB of
capacity, that is at least two parallel access streams per TB.

https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write
per TB is high enough to support simultaneous maintenance and user loads
for most Ceph instances, especially in HPC.

I have seen HPC storage systems "designed" around 10TB and even 18TB
HDDs, and the best that can be said about those HDDs is that they should
be considered "tapes" with some random access ability.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx