Hi,
Thanks to your answers now I understand better this part of ceph. I did the change on the crushmap that Maxime suggested, after that the results are what I expect from the beginning:
# ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
0 7.27100 1.00000 7445G 1830G 5614G 24.59 0.98 238
3 7.27100 1.00000 7445G 1700G 5744G 22.84 0.91 229
4 7.27100 1.00000 7445G 1731G 5713G 23.26 0.93 233
1 1.81299 1.00000 1856G 661G 1195G 35.63 1.43 87
5 1.81299 1.00000 1856G 544G 1311G 29.34 1.17 73
6 1.81299 1.00000 1856G 519G 1337G 27.98 1.12 71
2 2.72198 1.00000 2787G 766G 2021G 27.50 1.10 116
7 2.72198 1.00000 2787G 651G 2136G 23.36 0.93 103
8 2.72198 1.00000 2787G 661G 2126G 23.72 0.95 98
TOTAL 36267G 9067G 27200G 25.00
MIN/MAX VAR: 0.91/1.43 STDDEV: 4.20
#
I understand that the ceph defaults "type host" are safer than "type osd", but like I said before this cluster is only for testing purposes only.
Thanks for all your answers :)
2017-06-06 9:20 GMT+02:00 Maxime Guyot <maxime@xxxxxxxxxxx>:
Hi Félix,Changing the failure domain to OSD is probably the easiest option if this is a test cluster. I think the commands would go like:- ceph osd getcrushmap -o map.bin- crushtool -d map.bin -o map.txt- sed -i 's/step chooseleaf firstn 0 type host/step chooseleaf firstn 0 type osd/' map.txt- crushtool -c map.txt -o map.bin- ceph osd setcrushmap -i map.binMoving HDDs into ~8TB/server would be a good option if this is a capacity focused use case. It will allow you to reboot 1 server at a time without radosgw down time. You would target for 26/3 = 8.66TB/ node so:- node1: 1x8TB- node2: 1x8TB +1x2TB
- node3: 2x6 TB + 1x2TBIf you are more concerned about performance then set the weights to 1 on all HDDs and forget about the wasted capacity.Cheers,MaximeOn Tue, 6 Jun 2017 at 00:44 Christian Wuerdig <christian.wuerdig@xxxxxxxxx> wrote:Yet another option is to change the failure domain to OSD instead host (this avoids having to move disks around and will probably meet you initial expectations).
Means your cluster will become unavailable when you loose a host until you fix it though. OTOH you probably don't have too much leeway anyway with just 3 hosts so it might be an acceptable trade-off. It also means you can just add new OSDs to the servers wherever they fit.______________________________On Tue, Jun 6, 2017 at 1:51 AM, David Turner <drakonstein@xxxxxxxxx> wrote:If you want to resolve your issue without purchasing another node, you should move one disk of each size into each server. This process will be quite painful as you'll need to actually move the disks in the crush map to be under a different host and then all of your data will move around, but then your weights will be able to utilize the weights and distribute the data between the 2TB, 3TB, and 8TB drives much more evenly.On Mon, Jun 5, 2017 at 9:21 AM Loic Dachary <loic@xxxxxxxxxxx> wrote:
On 06/05/2017 02:48 PM, Christian Balzer wrote:
>
> Hello,
>
> On Mon, 5 Jun 2017 13:54:02 +0200 Félix Barbeira wrote:
>
>> Hi,
>>
>> We have a small cluster for radosgw use only. It has three nodes, witch 3
> ^^^^^ ^^^^^
>> osds each. Each node has different disk sizes:
>>
>
> There's your answer, staring you right in the face.
>
> Your default replication size is 3, your default failure domain is host.
>
> Ceph can not distribute data according to the weight, since it needs to be
> on a different node (one replica per node) to comply with the replica size.
Another way to look at it is to imagine a situation where 10TB worth of data
is stored on node01 which has 8x3 24TB. Since you asked for 3 replicas, this
data must be replicated to node02 but ... there only is 2x3 6TB available.
So the maximum you can store is 6TB and remaining disk space on node01 and node03
will never be used.
python-crush analyze will display a message about that situation and show which buckets
are overweighted.
Cheers
>
> If your cluster had 4 or more nodes, you'd see what you expected.
> And most likely wouldn't be happy about the performance with your 8TB HDDs
> seeing 4 times more I/Os than then 2TB ones and thus becoming the
> bottleneck of your cluster.
>
> Christian
>
>> node01 : 3x8TB
>> node02 : 3x2TB
>> node03 : 3x3TB
>>
>> I thought that the weight handle the amount of data that every osd receive.
>> In this case for example the node with the 8TB disks should receive more
>> than the rest, right? All of them receive the same amount of data and the
>> smaller disk (2TB) reaches 100% before the bigger ones. Am I doing
>> something wrong?
>>
>> The cluster is jewel LTS 10.2.7.
>>
>> # ceph osd df
>> ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
>> 0 7.27060 1.00000 7445G 1012G 6432G 13.60 0.57 133
>> 3 7.27060 1.00000 7445G 1081G 6363G 14.52 0.61 163
>> 4 7.27060 1.00000 7445G 787G 6657G 10.58 0.44 120
>> 1 1.81310 1.00000 1856G 1047G 809G 56.41 2.37 143
>> 5 1.81310 1.00000 1856G 956G 899G 51.53 2.16 143
>> 6 1.81310 1.00000 1856G 877G 979G 47.24 1.98 130
>> 2 2.72229 1.00000 2787G 1010G 1776G 36.25 1.52 140
>> 7 2.72229 1.00000 2787G 831G 1955G 29.83 1.25 130
>> 8 2.72229 1.00000 2787G 1038G 1748G 37.27 1.56 146
>> TOTAL 36267G 8643G 27624G 23.83
>> MIN/MAX VAR: 0.44/2.37 STDDEV: 18.60
>> #
>>
>> # ceph osd tree
>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 35.41795 root default
>> -2 21.81180 host node01
>> 0 7.27060 osd.0 up 1.00000 1.00000
>> 3 7.27060 osd.3 up 1.00000 1.00000
>> 4 7.27060 osd.4 up 1.00000 1.00000
>> -3 5.43929 host node02
>> 1 1.81310 osd.1 up 1.00000 1.00000
>> 5 1.81310 osd.5 up 1.00000 1.00000
>> 6 1.81310 osd.6 up 1.00000 1.00000
>> -4 8.16687 host node03
>> 2 2.72229 osd.2 up 1.00000 1.00000
>> 7 2.72229 osd.7 up 1.00000 1.00000
>> 8 2.72229 osd.8 up 1.00000 1.00000
>> #
>>
>> # ceph -s
>> cluster 49ba9695-7199-4c21-9199-ac321e60065e
>> health HEALTH_OK
>> monmap e1: 3 mons at
>> {ceph-mon01=[x:x:x:x:x:x:x:x]:6789/0,ceph-mon02=[x:x:x:x:x: x:x:x]:6789/0,ceph-mon03=[x:x: x:x:x:x:x:x]:6789/0}
>> election epoch 48, quorum 0,1,2 ceph-mon01,ceph-mon03,ceph-mon02
>> osdmap e265: 9 osds: 9 up, 9 in
>> flags sortbitwise,require_jewel_osds
>> pgmap v95701: 416 pgs, 11 pools, 2879 GB data, 729 kobjects
>> 8643 GB used, 27624 GB / 36267 GB avail
>> 416 active+clean
>> #
>>
>> # ceph osd pool ls
>> .rgw.root
>> default.rgw.control
>> default.rgw.data.root
>> default.rgw.gc
>> default.rgw.log
>> default.rgw.users.uid
>> default.rgw.users.keys
>> default.rgw.buckets.index
>> default.rgw.buckets.non-ec
>> default.rgw.buckets.data
>> default.rgw.users.email
>> #
>>
>> # ceph df
>> GLOBAL:
>> SIZE AVAIL RAW USED %RAW USED
>> 36267G 27624G 8643G 23.83
>> POOLS:
>> NAME ID USED %USED MAX AVAIL
>> OBJECTS
>> .rgw.root 1 1588 0 5269G
>> 4
>> default.rgw.control 2 0 0 5269G
>> 8
>> default.rgw.data.root 3 8761 0 5269G
>> 28
>> default.rgw.gc 4 0 0 5269G
>> 32
>> default.rgw.log 5 0 0 5269G
>> 127
>> default.rgw.users.uid 6 4887 0 5269G
>> 28
>> default.rgw.users.keys 7 144 0 5269G
>> 16
>> default.rgw.buckets.index 9 0 0 5269G
>> 14
>> default.rgw.buckets.non-ec 10 0 0 5269G
>> 3
>> default.rgw.buckets.data 11 2879G 35.34 5269G
>> 746848
>> default.rgw.users.email 12 13 0 5269G
>> 1
>> #
>>
>
>
--
Loïc Dachary, Artisan Logiciel Libre
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
Félix Barbeira.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com