Re: Load on drives of different sizes in ceph

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Mon, 6 Apr 2020 20:20:16 -0400

Sure - you can play with the weights or crush weights to make sure that 
all drives fill evenly to their respective capacities.  The consequence 
of doing this is that about twice as much data will be on the drives 
with twice the size obviously.  But - a perhaps less glaring consequence 
is that the double size drives will also, by default, get twice the I/O 
requests (iops).  With spinning drives, in general, the number of iops a 
drive can do is fairly independent of its size (not completely, but it 
is certainly not proportional). So when adding larger drives this way, 
the larger drives will determine the overall performance of the cluster 
- and the smaller drives will be less I/O loaded.

There are a few possible choices that I've gathered, none of them ideal:

1. Live with it.  Eventually when there are lots of larger drives, they
   will dominate the performance.
2. Use only part of the drive.  This keeps performance - but wastes space.
3. As suggested by Anthony below - read affinity can solve the read
   iops balance - but it doesn't do much for writes.
4. Create two OSDs on the larger drive.  This would allow part of the
   larger drive to host data that is accessed less frequently (from a
   different pool) to equalize the iops.  Unfortunately this is less
   than ideal since the two OSDs don't coordinate to schedule I/O for a
   single spinning drive.  Would it be possible to have a single OSD
   per drive, and still host let's say 2/3 one pool and 1/3 another? 
   This would require the OSD to be in two different crush hierarchies
   with different weights - I don't know if that is possible.
5. Put the larger drives into a different ceph pool.  This way I/O can
   be controlled by distributing data into pools.  Our problem is that
   there isn't a way to do this transparently with cephfs (there isn't
   a 100% correct way of moving files between pools).  We already do
   data moves across pools with some safeguards, but without help from
   the MDS this is not possible to do 100% correctly (it would require
   holding MDS caps while the data is moved - which isn't possible via
   the POSIX API).

Am I missing any choices/points?

Andras

On 4/6/20 5:20 PM, ceph@xxxxxxxxxx wrote:
Is not there a way to Deal with this kind of setup when playing with the "weight" of an OSD? I dont mean the "crush weight".

I am in a Situation where i had to Think about to add a Server with 24 x 2TB disks - my other osd nodes has 12 x 4TB. Which is in Sum 48TB per node in both situations.

Am 31. März 2020 18:10:24 MESZ schrieb Anthony D'Atri <aad@xxxxxxxxxxxxxx>:
You can adjust the Primary Affinity down on the larger drives so
they’ll get less read load.  In one test I’ve seen this result in a
10-15% increase in read throughout but it depends on your situation.

Optimal settings would require calculations that make my head hurt,
maybe someone has a tool but I haven’t seen it.

You might want to ensure that the drives are spread evenly across your
(unspecified) failure domains so that the extra capacity isn’t wasted,
again depending on your topology.

On Mar 31, 2020, at 8:49 AM, Eneko Lacunza <elacunza@xxxxxxxxx>
wrote:
Hi Andras,

El 31/3/20 a las 16:42, Andras Pataki escribió:
I'm looking for some advice on what to do about drives of different
sizes in the same cluster.
We have so far kept the drive sizes consistent on our main ceph
cluster (using 8TB drives).  We're getting some new hardware with
larger, 12TB drives next, and I'm pondering on how best to configure
them.  If I just simply add them, they will have 1.5x the data (which
is less of a problem), but will also get 1.5x the iops - so I presume
it will slow the whole cluster down as a result (these drives will be
busy, and the rest will not be as much).  I'm wondering how people
generally handle this.
I'm more concerned about these larger drives being busier than the
rest - so I'd like to be able to put for example 1/3 drive of less
accessed data on them in addition to the usual data - to use the extra
capacity but not increase the load on them.  Is there an easy way to
accomplish this?  One possibility is to run two OSDs on the drive (in
two crush hierarchies), which isn't ideal.  Can I just run one OSD
somehow and put it into two crush roots, or something similar?
You should adjust the weight of the new 12TB disk OSDs, to match the
weight of the current 8TB OSDs.
That will make the new disks the same as the old disks to Ceph :-)
But you'll lost the extra 4TB space, until you remove the 8TB disks
>from the cluster...
Cheers
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx