Re: Deep-scrub much slower than HDD speed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> Indeed! Every Ceph instance I have seen (not many) and almost every HPC storage system I have seen have this problem, and that's because they were never setup to have enough IOPS to support the maintenance load, never mind the maintenance load plus the user load (and as a rule not even the user load).

Yep, this is one of the false economies of spinners.  The SNIA TCO calculator includes a performance factor for just this reason.

> There is a simple reason why this happens: when a large Ceph (etc. storage instance is initially setup, it is nearly empty, so it appears to perform well even if it was setup with inexpensive but slow/large HDDs, then it becomes fuller and therefore heavily congested

Data fragments over time with organic growth, and the drive spends a larger fraction of time seeking.  I’ve predicted then seen this even on a cluster whose hardware had been blessed by a certain professional services company (*ahem*).

> but whoever set it up has already changed jobs or been promoted because of their initial success (or they invent excuses).

`xfs.mkfs -n size=65536` will haunt my nightmares until the end of my days.  As well as an inadequate LFF HDD architecture I was not permitted to fix, *including the mons*.  But I digress.

> A figure-of-merit that matters is IOPS-per-used-TB, and making it large enough to support concurrent maintenance (scrubbing, backfilling, rebalancing, backup) and user workloads. That is *expensive*, so in my experience very few storage instance buyers aim for that.

^^^ This.  Moreover, it’s all too common to try to band-aid this with expensive, fussy RoC HBAs with cache RAM and BBU/supercap.  The money spent on those, and spent on jumping through their hoops, can easily debulk the HDD-SSD CapEx gap.  Plus if your solution doesn’t do the job it needs to do, it is no bargain at any price.

This correlates with IOPS/$, a metric in which HDDs are abysmal.

> The CERN IT people discovered long ago that quotes for storage workers always used very slow/large HDDs that performed very poorly if the specs were given as mere capacity, so they switched to requiring a different metric, 18MB/s transfer rate of *interleaved* read and write per TB of capacity, that is at least two parallel access streams per TB.

At least one major SSD manufacturer attends specifically to reads under write pressure.

> https://www.sabi.co.uk/blog/13-two.html?131227#131227
> "The issue with disk drives with multi-TB capacities"
> 
> BTW I am not sure that a floor of 18MB/s of interleaved read and write per TB is high enough to support simultaneous maintenance and user loads for most Ceph instances, especially in HPC.
> 
> I have seen HPC storage systems "designed" around 10TB and even 18TB HDDs, and the best that can be said about those HDDs is that they should be considered "tapes" with some random access ability.

Yes!  This harks back to DECtape https://www.vt100.net/timeline/1964.html which was literally this, people even used it at a filesystem.  Some years ago I had Brian Kernighan sign one “Wow I haven’t seen one of these in YEARS!”

— aad

> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux