Re: handling different disk sizes

Félix Barbeira <fbarbeira@xxxxxxxxx> · Wed, 7 Jun 2017 07:39:11 +0200

Hi,

Thanks to your answers now I understand better this part of ceph. I did the change on the crushmap that Maxime suggested, after that the results are what I expect from the beginning:

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
 0 7.27100  1.00000  7445G 1830G  5614G 24.59 0.98 238
 3 7.27100  1.00000  7445G 1700G  5744G 22.84 0.91 229
 4 7.27100  1.00000  7445G 1731G  5713G 23.26 0.93 233
 1 1.81299  1.00000  1856G  661G  1195G 35.63 1.43  87
 5 1.81299  1.00000  1856G  544G  1311G 29.34 1.17  73
 6 1.81299  1.00000  1856G  519G  1337G 27.98 1.12  71
 2 2.72198  1.00000  2787G  766G  2021G 27.50 1.10 116
 7 2.72198  1.00000  2787G  651G  2136G 23.36 0.93 103
 8 2.72198  1.00000  2787G  661G  2126G 23.72 0.95  98
              TOTAL 36267G 9067G 27200G 25.00
MIN/MAX VAR: 0.91/1.43  STDDEV: 4.20
#

I understand that the ceph defaults "type host" are safer than "type osd", but like I said before this cluster is only for testing purposes only.

Thanks for all your answers :)

2017-06-06 9:20 GMT+02:00 Maxime Guyot <maxime@xxxxxxxxxxx>:
Hi Félix,
Changing the failure domain to OSD is probably the easiest option if this is a test cluster. I think the commands would go like:
- ceph osd getcrushmap -o map.bin
- crushtool -d map.bin -o map.txt
- sed -i 's/step chooseleaf firstn 0 type host/step chooseleaf firstn 0 type osd/' map.txt

- crushtool -c map.txt -o map.bin
- ceph osd setcrushmap -i map.bin

Moving HDDs into ~8TB/server would be a good option if this is a capacity focused use case. It will allow you to reboot 1 server at a time without radosgw down time. You would target for 26/3 = 8.66TB/ node so:
- node1: 1x8TB
- node2: 1x8TB +1x2TB
- node3: 2x6 TB + 1x2TB

If you are more concerned about performance then set the weights to 1 on all HDDs and forget about the wasted capacity.

Cheers,
Maxime

On Tue, 6 Jun 2017 at 00:44 Christian Wuerdig <christian.wuerdig@xxxxxxxxx> wrote:
Yet another option is to change the failure domain to OSD instead host (this avoids having to move disks around and will probably meet you initial expectations).
Means your cluster will become unavailable when you loose a host until you fix it though. OTOH you probably don't have too much leeway anyway with just 3 hosts so it might be an acceptable trade-off. It also means you can just add new OSDs to the servers wherever they fit.

On Tue, Jun 6, 2017 at 1:51 AM, David Turner <drakonstein@xxxxxxxxx> wrote:
If you want to resolve your issue without purchasing another node, you should move one disk of each size into each server.  This process will be quite painful as you'll need to actually move the disks in the crush map to be under a different host and then all of your data will move around, but then your weights will be able to utilize the weights and distribute the data between the 2TB, 3TB, and 8TB drives much more evenly.

On Mon, Jun 5, 2017 at 9:21 AM Loic Dachary <loic@xxxxxxxxxxx> wrote:

On 06/05/2017 02:48 PM, Christian Balzer wrote:

>

> Hello,

>

> On Mon, 5 Jun 2017 13:54:02 +0200 Félix Barbeira wrote:

>

>> Hi,

>>

>> We have a small cluster for radosgw use only. It has three nodes, witch 3

>             ^^^^^                                      ^^^^^

>> osds each. Each node has different disk sizes:

>>

>

> There's your answer, staring you right in the face.

>

> Your default replication size is 3, your default failure domain is host.

>

> Ceph can not distribute data according to the weight, since it needs to be

> on a different node (one replica per node) to comply with the replica size.

Another way to look at it is to imagine a situation where 10TB worth of data

is stored on node01 which has 8x3 24TB. Since you asked for 3 replicas, this

data must be replicated to node02 but ... there only is 2x3 6TB available.

So the maximum you can store is 6TB and remaining disk space on node01 and node03

will never be used.

python-crush analyze will display a message about that situation and show which buckets

are overweighted.

Cheers

>

> If your cluster had 4 or more nodes, you'd see what you expected.

> And most likely wouldn't be happy about the performance with your 8TB HDDs

> seeing 4 times more I/Os than then 2TB ones and thus becoming the

> bottleneck of your cluster.

>

> Christian

>

>> node01 : 3x8TB

>> node02 : 3x2TB

>> node03 : 3x3TB

>>

>> I thought that the weight handle the amount of data that every osd receive.

>> In this case for example the node with the 8TB disks should receive more

>> than the rest, right? All of them receive the same amount of data and the

>> smaller disk (2TB) reaches 100% before the bigger ones. Am I doing

>> something wrong?

>>

>> The cluster is jewel LTS 10.2.7.

>>

>> # ceph osd df

>> ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS

>>  0 7.27060  1.00000  7445G 1012G  6432G 13.60 0.57 133

>>  3 7.27060  1.00000  7445G 1081G  6363G 14.52 0.61 163

>>  4 7.27060  1.00000  7445G  787G  6657G 10.58 0.44 120

>>  1 1.81310  1.00000  1856G 1047G   809G 56.41 2.37 143

>>  5 1.81310  1.00000  1856G  956G   899G 51.53 2.16 143

>>  6 1.81310  1.00000  1856G  877G   979G 47.24 1.98 130

>>  2 2.72229  1.00000  2787G 1010G  1776G 36.25 1.52 140

>>  7 2.72229  1.00000  2787G  831G  1955G 29.83 1.25 130

>>  8 2.72229  1.00000  2787G 1038G  1748G 37.27 1.56 146

>>               TOTAL 36267G 8643G 27624G 23.83

>> MIN/MAX VAR: 0.44/2.37  STDDEV: 18.60

>> #

>>

>> # ceph osd tree

>> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY

>> -1 35.41795 root default

>> -2 21.81180     host node01

>>  0  7.27060         osd.0       up  1.00000          1.00000

>>  3  7.27060         osd.3       up  1.00000          1.00000

>>  4  7.27060         osd.4       up  1.00000          1.00000

>> -3  5.43929     host node02

>>  1  1.81310         osd.1       up  1.00000          1.00000

>>  5  1.81310         osd.5       up  1.00000          1.00000

>>  6  1.81310         osd.6       up  1.00000          1.00000

>> -4  8.16687     host node03

>>  2  2.72229         osd.2       up  1.00000          1.00000

>>  7  2.72229         osd.7       up  1.00000          1.00000

>>  8  2.72229         osd.8       up  1.00000          1.00000

>> #

>>

>> # ceph -s

>>     cluster 49ba9695-7199-4c21-9199-ac321e60065e

>>      health HEALTH_OK

>>      monmap e1: 3 mons at

>> {ceph-mon01=[x:x:x:x:x:x:x:x]:6789/0,ceph-mon02=[x:x:x:x:x:x:x:x]:6789/0,ceph-mon03=[x:x:x:x:x:x:x:x]:6789/0}

>>             election epoch 48, quorum 0,1,2 ceph-mon01,ceph-mon03,ceph-mon02

>>      osdmap e265: 9 osds: 9 up, 9 in

>>             flags sortbitwise,require_jewel_osds

>>       pgmap v95701: 416 pgs, 11 pools, 2879 GB data, 729 kobjects

>>             8643 GB used, 27624 GB / 36267 GB avail

>>                  416 active+clean

>> #

>>

>> # ceph osd pool ls

>> .rgw.root

>> default.rgw.control

>> default.rgw.data.root

>> default.rgw.gc

>> default.rgw.log

>> default.rgw.users.uid

>> default.rgw.users.keys

>> default.rgw.buckets.index

>> default.rgw.buckets.non-ec

>> default.rgw.buckets.data

>> default.rgw.users.email

>> #

>>

>> # ceph df

>> GLOBAL:

>>     SIZE       AVAIL      RAW USED     %RAW USED

>>     36267G     27624G        8643G         23.83

>> POOLS:

>>     NAME                           ID     USED      %USED     MAX AVAIL

>> OBJECTS

>>     .rgw.root                      1       1588         0         5269G

>>       4

>>     default.rgw.control            2          0         0         5269G

>>       8

>>     default.rgw.data.root          3       8761         0         5269G

>>      28

>>     default.rgw.gc                 4          0         0         5269G

>>      32

>>     default.rgw.log                5          0         0         5269G

>>     127

>>     default.rgw.users.uid          6       4887         0         5269G

>>      28

>>     default.rgw.users.keys         7        144         0         5269G

>>      16

>>     default.rgw.buckets.index      9          0         0         5269G

>>      14

>>     default.rgw.buckets.non-ec     10         0         0         5269G

>>       3

>>     default.rgw.buckets.data       11     2879G     35.34         5269G

>>  746848

>>     default.rgw.users.email        12        13         0         5269G

>>       1

>> #

>>

>

>

--

Loïc Dachary, Artisan Logiciel Libre

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Félix Barbeira.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com