> [... number of PGs per OSD ...] > So it is mainly related to PG size. Indeed and secondarily number of objects: many objects per PG mean lower metadata overhead, but bigger PGs mean higher admin workload latency. >> Note: HDDs larger than 1TB are not really suitable for >> significant parallel user workloads and most admin workloads: >> http://www.sabi.co.uk/blog/17-one.html?170610#170610 "How challenging is a goal of 18MB/s per TB of storage, and latency" > I'm afraid nobody will build a 100PB cluster with 1T drives. Well if they want it to be able to do a non-negligible user workload plus the necessary admin workload too bad for them. My post above does not depend at all on filesystem, but simply on the physical parameter of IOPS-per-TB (a "figure of merit" that is widely underestimate or ignored) of HDDs, and having enough IOPS-per-TB to sustain both user and admin workload. A couple of legacy Ceph instances I saw in the past had 8TB and 18TB drives and as they got full the instances basically congealed (latencies in the several seconds or even dozens of second range) even under modest user workloads, and anyhow expensive admin workloads like scrubbing (never mind deep scrubbing) got behind by a year or two, and rebalancing was nearly impossible. Again not because of Ceph. > That's just absurd. So, the sharp increase of per-device > capacity has to be taken into account. Indeed by taking into account that HDDs 4TB and above behave as slow-random-access tapes. > Specifically as the same development is happening with SSDs. But that is completely different: SSDs have *much* higher IOPS, even SATA ones, so even large SSDs have enormously better IOPS-per-TB. > I would like to point out that there are scale-out storage > systems that have adopted their architecture for this scenario > and use large HDDs very well. That is *physically impossible* as they just do not have enough IOPS-per-TB for many "live" workloads. The illusion that they might work well happens in one of two cases: * Either because they have not filled up yet, or because they have filled up but only a minuscule subset of the data is in active use, the IOPS-per-*active*-TB of the user workload is still good enough. The problem with that is that for many admin workloads a lot of the data or even or *all* the data becomes active. * If the *active data* is mostly read-only and gets cached on a SSD tier of sufficient size, and admin workload does not matter. > For comparison, our University operates an all-HDD qumolo > cluster that handles the administrative and user/student > storage for the entire University and has about a factor 10 > higher aggregated sustained IOP/s performance compared with a > similarly sized ceph cluster (FS performance). I have some idea of how Qumulo does things and that is very unlikely, Ceph is not fundamentally inferior to their design. Perhaps the workload's anisotropy matches particularly well that of that particular Qumulo instance: https://www.sabi.co.uk/blog/16-one.html?160322#160322 "Examples of anisotropy in Isilon and Qumulo" > doing something "good enough" with large drives is very well > possible. It is possible to use large drives for "Glacier" style storage with an SSD front tier. But "Glacier" style storage is not trivial to design or manage. https://storagemojo.com/2014/04/25/amazons-glacier-secret-bdxl/ https://storagemojo.com/2014/04/30/glacier-redux/ https://blog.dshr.org/2014/09/more-on-facebooks-cold-storage.html https://www.theregister.com/2015/05/07/facebook_maid_gets_cold/ https://blog.dshr.org/2021/05/storage-update.html > - You also don't know why the recommendation today is 100-200 > per OSD fixed except that it was suitable for 1TB drives. > - You also can't answer what will happen if one goes for > 100-200 PGs per TB, meaning 1600-3200 PGs per 16TB drive. > So my main question, the last one, is still looking for an answer. The famous joke goes: a guy goes to the doctor and says "Doctor when I stab my hand hard with a fork it really hurts a lots, how can you fix that?" the doctor says "Do not do that". If you do not like that answer, keep looking :-). Put another way if the combined user+admin workload requires N*100 IOPS-per-TB, using N IOPS-per-TB storage media is not goint to give a happy experience. But lots of people know better. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx