Re: read performance, separate client CRUSH maps or limit osd read access from each client

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Each of 3 clients from different buildings are picking same primary-affinity, and everything is slow at least on two.
Instead of just read from their local OSD they read mostly from primary-affinity.

What I need is something like primary-affinity for each client connection

ID  CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
 -1       0.08189 root default                         
 -3       0.02730     host vm1                         
  0   hdd 0.02730         osd.0     up  1.00000 1.00000
-10       0.02730     host vm2                         
  1   hdd 0.02730         osd.1     up  1.00000 0.50000
 -5       0.02730     host vm3                         
  2   hdd 0.02730         osd.2     up  1.00000 0.50000

v

On Tue, Nov 13, 2018 at 4:25 PM Jean-Charles Lopez <jelopez@xxxxxxxxxx> wrote:
Hi Vlad,

No need for a specific CRUSH map configuration. I’d suggest you use the primary-affinity setting on the OSD so that only the OSDs that are close to your read point are are selected as primary.

See https://ceph.com/geen-categorie/ceph-primary-affinity/ for information

Just set the primary affinity of all the OSDs in building 2 to 0.

Only the OSDs in building 1 should then be used as primary OSDs.

BR
JC

On Nov 13, 2018, at 12:19, Vlad Kopylov <vladkopy@xxxxxxxxx> wrote:

Or is it possible to mount one OSD directly for read file access?

v

On Sun, Nov 11, 2018 at 1:47 PM Vlad Kopylov <vladkopy@xxxxxxxxx> wrote:
Maybe it is possible if done via gateway-nfs export?
Settings for gateway allow read osd selection?

v

On Sun, Nov 11, 2018 at 1:01 AM Martin Verges <martin.verges@xxxxxxxx> wrote:
Hello Vlad,

If you want to read from the same data, then it ist not possible (as far I know).

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov <vladkopy@xxxxxxxxx> geschrieben:
Maybe i missed something but FS is explicitly selecting pools to put files and metadata, like I did below.
So if I create new pools - data in them will be different. If I apply the rule dc1_primary to cfs_data pool, and client from dc3 connects to fs t01 - it will start using dc1 hosts


ceph osd pool create cfs_data 100
ceph osd pool create cfs_meta 100
ceph fs new t01 cfs_data cfs_meta
sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o name=admin,secretfile=/home/mciadmin/admin.secret

rule dc1_primary {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take dc1
        step chooseleaf firstn 1 type host
        step emit
        step take dc2
        step chooseleaf firstn -2 type host
        step emit
        step take dc3
        step chooseleaf firstn -2 type host
        step emit
}

On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov <vladkopy@xxxxxxxxx> wrote:
Just to confirm - it will still populate  3 copies in each datacenter?
Thought this map was to select where to write to, guess it does write replication on the back end.

I thought pools are completely separate and clients would not see each others data?

Thank you Martin!




On Fri, Nov 9, 2018 at 2:10 PM Martin Verges <martin.verges@xxxxxxxx> wrote:
Hello Vlad,

you can generate something like this:

rule dc1_primary_dc2_secondary {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take dc1
        step chooseleaf firstn 1 type host
        step emit
        step take dc2
        step chooseleaf firstn 1 type host
        step emit
        step take dc3
        step chooseleaf firstn -2 type host
        step emit
}

rule dc2_primary_dc1_secondary {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take dc1
        step chooseleaf firstn 1 type host
        step emit
        step take dc2
        step chooseleaf firstn 1 type host
        step emit
        step take dc3
        step chooseleaf firstn -2 type host
        step emit
}

After you added such crush rules, you can configure the pools:

~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2

Now you place your workload from dc1 to the dc1 pool, and workload
from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
your workload issn't that write intensive) and save some money in dc3
as your client would always read from a SSD and write to Hybrid.

Btw. all this could be done with a few simple clicks through our web
frontend. Even if you want to export it via CephFS / NFS / .. it is
possible to set it on a per folder level. Feel free to take a look at
https://www.youtube.com/watch?v=V33f7ipw9d4 to see how easy it could
be.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


2018-11-09 17:35 GMT+01:00 Vlad Kopylov <vladkopy@xxxxxxxxx>:
> Please disregard pg status, one of test vms was down for some time it is
> healing.
> Question only how to make it read from proper datacenter
>
> If you have an example.
>
> Thanks
>
>
> On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov <vladkopy@xxxxxxxxx> wrote:
>>
>> Martin, thank you for the tip.
>> googling ceph crush rule examples doesn't give much on rules, just static
>> placement of buckets.
>> this all seems to be for placing data, not to giving client in specific
>> datacenter proper read osd
>>
>> maybe something wrong with placement groups?
>>
>> I added datacenter dc1 dc2 dc3
>> Current replicated_rule is
>>
>> rule replicated_rule {
>>         id 0
>> type replicated
>>         min_size 1
>>         max_size 10
>>         step take default
>>         step chooseleaf firstn 0 type host
>>         step emit
>> }
>>
>> # buckets
>> host ceph1 {
>> id -3 # do not change unnecessarily
>> id -2 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item osd.0 weight 1.000
>> }
>> datacenter dc1 {
>> id -9 # do not change unnecessarily
>> id -4 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item ceph1 weight 1.000
>> }
>> host ceph2 {
>> id -5 # do not change unnecessarily
>> id -6 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item osd.1 weight 1.000
>> }
>> datacenter dc2 {
>> id -10 # do not change unnecessarily
>> id -8 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item ceph2 weight 1.000
>> }
>> host ceph3 {
>> id -7 # do not change unnecessarily
>> id -12 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item osd.2 weight 1.000
>> }
>> datacenter dc3 {
>> id -11 # do not change unnecessarily
>> id -13 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item ceph3 weight 1.000
>> }
>> root default {
>> id -1 # do not change unnecessarily
>> id -14 class ssd # do not change unnecessarily
>> # weight 3.000
>> alg straw2
>> hash 0 # rjenkins1
>> item dc1 weight 1.000
>> item dc2 weight 1.000
>> item dc3 weight 1.000
>> }
>>
>>
>> #ceph pg dump
>> dumped all
>> version 29433
>> stamp 2018-11-09 11:23:44.510872
>> last_osdmap_epoch 0
>> last_pg_scan 0
>> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES    LOG
>> DISK_LOG STATE                      STATE_STAMP                VERSION
>> REPORTED UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP
>> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
>> 1.5f          0                  0        0         0       0        0
>> 0        0               active+clean 2018-11-09 04:35:32.320607      0'0
>> 544:1317 [0,2,1]          0 [0,2,1]              0        0'0 2018-11-09
>> 04:35:32.320561             0'0 2018-11-04 11:55:54.756115             0
>> 2.5c        143                  0      143         0       0 19490267
>> 461      461 active+undersized+degraded 2018-11-08 19:02:03.873218  508'461
>> 544:2100   [2,1]          2   [2,1]              2    290'380 2018-11-07
>> 18:58:43.043719          64'120 2018-11-05 14:21:49.256324             0
>> .....
>> sum 15239 0 2053 2659 0 2157615019 58286 58286
>> OSD_STAT USED    AVAIL  TOTAL  HB_PEERS PG_SUM PRIMARY_PG_SUM
>> 2        3.7 GiB 28 GiB 32 GiB    [0,1]    200             73
>> 1        3.7 GiB 28 GiB 32 GiB    [0,2]    200             58
>> 0        3.7 GiB 28 GiB 32 GiB    [1,2]    173             69
>> sum       11 GiB 85 GiB 96 GiB
>>
>> #ceph pg map 2.5c
>> osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
>>
>> #pg map 1.5f
>> osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
>>
>>
>> On Fri, Nov 9, 2018 at 2:21 AM Martin Verges <martin.verges@xxxxxxxx>
>> wrote:
>>>
>>> Hello Vlad,
>>>
>>> Ceph clients connect to the primary OSD of each PG. If you create a
>>> crush rule for building1 and one for building2 that takes a OSD from
>>> the same building as the first one, your reads to the pool will always
>>> be on the same building (if the cluster is healthy) and only write
>>> request get replicated to the other building.
>>>
>>> --
>>> Martin Verges
>>> Managing director
>>>
>>> Mobile: +49 174 9335695
>>> E-Mail: martin.verges@xxxxxxxx
>>> Chat: https://t.me/MartinVerges
>>>
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>>
>>> Web: https://croit.io
>>> YouTube: https://goo.gl/PGE1Bx
>>>
>>>
>>> 2018-11-09 4:54 GMT+01:00 Vlad Kopylov <vladkopy@xxxxxxxxx>:
>>> > I am trying to test replicated ceph with servers in different
>>> > buildings, and
>>> > I have a read problem.
>>> > Reads from one building go to osd in another building and vice versa,
>>> > making
>>> > reads slower then writes! Making read as slow as slowest node.
>>> >
>>> > Is there a way to
>>> > - disable parallel read (so it reads only from the same osd node where
>>> > mon
>>> > is);
>>> > - or give each client read restriction per osd?
>>> > - or maybe strictly specify read osd on mount;
>>> > - or have node read delay cap (for example if node time out is larger
>>> > then 2
>>> > ms then do not use such node for read as other replicas are available).
>>> > - or ability to place Clients on the Crush map - so it understands that
>>> > osd
>>> > in - for example osd in the same data-center as client has preference,
>>> > and
>>> > pull data from it/them.
>>> >
>>> > Mounting with kernel client latest mimic.
>>> >
>>> > Thank you!
>>> >
>>> > Vlad
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@xxxxxxxxxxxxxx
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux