Re: PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

Peter Linder <peter.linder@xxxxxxxxxxxxxx> · Sun, 8 Oct 2017 17:38:02 +0200



    Oh, you mean monitor quorum is
      enforced? I never really considered that. However, I think I found
      another solution:

      
      I created a second tree called "ldc" and under it I made 3
      "logical datacenters" (waiting for a better name) and grouped the
      servers under it so that one logical datacenter contains 3
      servers, one ssd and 2 hdd selected from different physical
      datacenters. I could now rewrite my hybrid rule to simply select
      one datacenter and then 3 hostgroups from it. I made a new bucket
      type called "hostgroup" that I put the physical servers under, so
      that it is easy to add more servers in the future (just add them
      to the correct host group)

      
       It should work, I will test fully this coming week. 

      
      Complete crushmap is below. Rules and stuff for the other two more
      normal rules are the same, interesting stuff starts about half way
      down.

      
      # begin crush map

        tunable choose_local_tries 0

        tunable choose_local_fallback_tries 0

        tunable choose_total_tries 50

        tunable chooseleaf_descend_once 1

        tunable chooseleaf_vary_r 1

        tunable chooseleaf_stable 1

        tunable straw_calc_version 1

        tunable allowed_bucket_algs 54

        
        # devices

        device 0 osd.0 class nvme

        device 1 osd.1 class nvme

        device 2 osd.2 class nvme

        device 3 osd.3 class nvme

        device 4 osd.4 class nvme

        device 5 osd.5 class nvme

        device 6 osd.6 class nvme

        device 7 osd.7 class nvme

        device 8 osd.8 class nvme

        device 9 osd.9 class nvme

        device 10 osd.10 class nvme

        device 11 osd.11 class nvme

        device 12 osd.12 class hdd

        device 13 osd.13 class hdd

        device 14 osd.14 class hdd

        device 15 osd.15 class hdd

        device 16 osd.16 class hdd

        device 17 osd.17 class hdd

        device 18 osd.18 class hdd

        device 19 osd.19 class hdd

        device 20 osd.20 class hdd

        device 21 osd.21 class hdd

        device 22 osd.22 class hdd

        device 23 osd.23 class hdd

        device 24 osd.24 class hdd

        device 25 osd.25 class hdd

        device 26 osd.26 class hdd

        device 27 osd.27 class hdd

        device 28 osd.28 class hdd

        device 29 osd.29 class hdd

        device 30 osd.30 class hdd

        device 31 osd.31 class hdd

        device 32 osd.32 class hdd

        device 33 osd.33 class hdd

        device 34 osd.34 class hdd

        device 35 osd.35 class hdd

        
        # types

        type 0 osd

        type 1 host

        type 2 hostgroup

        type 3 rack

        type 4 row

        type 5 pdu

        type 6 pod

        type 7 room

        type 8 datacenter

        type 9 region

        type 10 root

        
        # buckets

        host storage11 {

                id -5           # do not change unnecessarily

                id -6 class nvme                # do not change
        unnecessarily

                id -10 class hdd                # do not change
        unnecessarily

                # weight 2.913

                alg straw2

                hash 0  # rjenkins1

                item osd.0 weight 0.729

                item osd.3 weight 0.728

                item osd.6 weight 0.728

                item osd.9 weight 0.728

        }

        host storage21 {

                id -13          # do not change unnecessarily

                id -14 class nvme               # do not change
        unnecessarily

                id -15 class hdd                # do not change
        unnecessarily

                # weight 65.496

                alg straw2

                hash 0  # rjenkins1

                item osd.12 weight 5.458

                item osd.13 weight 5.458

                item osd.14 weight 5.458

                item osd.15 weight 5.458

                item osd.16 weight 5.458

                item osd.17 weight 5.458

                item osd.18 weight 5.458

                item osd.19 weight 5.458

                item osd.20 weight 5.458

                item osd.21 weight 5.458

                item osd.22 weight 5.458

                item osd.23 weight 5.458

        }

        datacenter HORN79 {

                id -19          # do not change unnecessarily

                id -26 class nvme               # do not change
        unnecessarily

                id -27 class hdd                # do not change
        unnecessarily

                # weight 68.406

                alg straw2

                hash 0  # rjenkins1

                item storage11 weight 2.911

                item storage21 weight 65.495

        }

        host storage13 {

                id -7           # do not change unnecessarily

                id -8 class nvme                # do not change
        unnecessarily

                id -11 class hdd                # do not change
        unnecessarily

                # weight 2.912

                alg straw2

                hash 0  # rjenkins1

                item osd.2 weight 0.728

                item osd.5 weight 0.728

                item osd.8 weight 0.728

                item osd.11 weight 0.728

        }

        host storage23 {

                id -16          # do not change unnecessarily

                id -17 class nvme               # do not change
        unnecessarily

                id -18 class hdd                # do not change
        unnecessarily

                # weight 65.496

                alg straw2

                hash 0  # rjenkins1

                item osd.24 weight 5.458

                item osd.25 weight 5.458

                item osd.26 weight 5.458

                item osd.27 weight 5.458

                item osd.28 weight 5.458

                item osd.29 weight 5.458

                item osd.30 weight 5.458

                item osd.31 weight 5.458

                item osd.32 weight 5.458

                item osd.33 weight 5.458

                item osd.34 weight 5.458

                item osd.35 weight 5.458

        }

        datacenter WAR {

                id -20          # do not change unnecessarily

                id -24 class nvme               # do not change
        unnecessarily

                id -25 class hdd                # do not change
        unnecessarily

                # weight 68.406

                alg straw2

                hash 0  # rjenkins1

                item storage13 weight 2.911

                item storage23 weight 65.495

        }

        host storage12 {

                id -3           # do not change unnecessarily

                id -4 class nvme                # do not change
        unnecessarily

                id -9 class hdd         # do not change unnecessarily

                # weight 2.912

                alg straw2

                hash 0  # rjenkins1

                item osd.1 weight 0.728

                item osd.4 weight 0.728

                item osd.7 weight 0.728

                item osd.10 weight 0.728

        }

        datacenter TEG4 {

                id -21          # do not change unnecessarily

                id -22 class nvme               # do not change
        unnecessarily

                id -23 class hdd                # do not change
        unnecessarily

                # weight 2.911

                alg straw2

                hash 0  # rjenkins1

                item storage12 weight 2.911

        }

        root default {

                id -1           # do not change unnecessarily

                id -2 class nvme                # do not change
        unnecessarily

                id -12 class hdd                # do not change
        unnecessarily

                # weight 139.721

                alg straw2

                hash 0  # rjenkins1

                item HORN79 weight 68.405

                item WAR weight 68.405

                item TEG4 weight 2.911

        }

        hostgroup hg1-1 {

                id -30          # do not change unnecessarily

                id -28 class nvme               # do not change
        unnecessarily

                id -54 class hdd                # do not change
        unnecessarily

                # weight 2.913

                alg straw2

                hash 0  # rjenkins1

                item storage11 weight 2.913

        }

        hostgroup hg1-2 {

                id -31          # do not change unnecessarily

                id -29 class nvme               # do not change
        unnecessarily

                id -55 class hdd                # do not change
        unnecessarily

                # weight 0.000

                alg straw2

                hash 0  # rjenkins1

        }

        hostgroup hg1-3 {

                id -32          # do not change unnecessarily

                id -43 class nvme               # do not change
        unnecessarily

                id -56 class hdd                # do not change
        unnecessarily

                # weight 65.496

                alg straw2

                hash 0  # rjenkins1

                item storage23 weight 65.496

        }

        hostgroup hg2-1 {

                id -33          # do not change unnecessarily

                id -45 class nvme               # do not change
        unnecessarily

                id -58 class hdd                # do not change
        unnecessarily

                # weight 2.912

                alg straw2

                hash 0  # rjenkins1

                item storage12 weight 2.912

        }

        hostgroup hg2-2 {

                id -34          # do not change unnecessarily

                id -46 class nvme               # do not change
        unnecessarily

                id -59 class hdd                # do not change
        unnecessarily

                # weight 65.496

                alg straw2

                hash 0  # rjenkins1

                item storage21 weight 65.496

        }

        hostgroup hg2-3 {

                id -35          # do not change unnecessarily

                id -47 class nvme               # do not change
        unnecessarily

                id -60 class hdd                # do not change
        unnecessarily

                # weight 65.496

                alg straw2

                hash 0  # rjenkins1

                item storage23 weight 65.496

        }

        hostgroup hg3-1 {

                id -36          # do not change unnecessarily

                id -49 class nvme               # do not change
        unnecessarily

                id -62 class hdd                # do not change
        unnecessarily

                # weight 2.912

                alg straw2

                hash 0  # rjenkins1

                item storage13 weight 2.912

        }

        hostgroup hg3-2 {

                id -37          # do not change unnecessarily

                id -50 class nvme               # do not change
        unnecessarily

                id -63 class hdd                # do not change
        unnecessarily

                # weight 65.496

                alg straw2

                hash 0  # rjenkins1

                item storage21 weight 65.496

        }

        hostgroup hg3-3 {

                id -38          # do not change unnecessarily

                id -51 class nvme               # do not change
        unnecessarily

                id -64 class hdd                # do not change
        unnecessarily

                # weight 0.000

                alg straw2

                hash 0  # rjenkins1

        }

        datacenter ldc1 {

                id -39          # do not change unnecessarily

                id -44 class nvme               # do not change
        unnecessarily

                id -57 class hdd                # do not change
        unnecessarily

                # weight 68.409

                alg straw2

                hash 0  # rjenkins1

                item hg1-1 weight 2.913

                item hg1-2 weight 0.000

                item hg1-3 weight 65.496

        }

        datacenter ldc2 {

                id -40          # do not change unnecessarily

                id -48 class nvme               # do not change
        unnecessarily

                id -61 class hdd                # do not change
        unnecessarily

                # weight 133.904

                alg straw2

                hash 0  # rjenkins1

                item hg2-1 weight 2.912

                item hg2-2 weight 65.496

                item hg2-3 weight 65.496

        }

        datacenter ldc3 {

                id -41          # do not change unnecessarily

                id -52 class nvme               # do not change
        unnecessarily

                id -65 class hdd                # do not change
        unnecessarily

                # weight 68.408

                alg straw2

                hash 0  # rjenkins1

                item hg3-1 weight 2.912

                item hg3-2 weight 65.496

                item hg3-3 weight 0.000

        }

        root ldc {

                id -42          # do not change unnecessarily

                id -53 class nvme               # do not change
        unnecessarily

                id -66 class hdd                # do not change
        unnecessarily

                # weight 270.721

                alg straw2

                hash 0  # rjenkins1

                item ldc1 weight 68.409

                item ldc2 weight 133.904

                item ldc3 weight 68.408

        }

        
        # rules

        rule hybrid {

                id 1

                type replicated

                min_size 1

                max_size 10

                step take ldc

                step choose firstn 1 type datacenter

                step chooseleaf firstn 0 type hostgroup

                step emit

        }

        rule hdd {

                id 2

                type replicated

                min_size 1

                max_size 3

                step take default class hdd

                step chooseleaf firstn 0 type datacenter

                step emit

        }

        rule nvme {

                id 3

                type replicated

                min_size 1

                max_size 3

                step take default class nvme

                step chooseleaf firstn 0 type datacenter

                step emit

        }

        
        # end crush map

      
      On 10/8/2017 3:22 PM, David Turner wrote:

    
      That's correct. It doesn't matter how many copies of
        the data you have in each datacenter. The mons control the maps
        and you should be good as long as you have 1 mon per DC. You
        should test this to see how the recovery goes, but there
        shouldn't be a problem.
      

        On Sat, Oct 7, 2017, 6:10 PM Дробышевский,
          Владимир <vlad@xxxxxxxxxx> wrote:

        
              2017-10-08 2:02 GMT+05:00 Peter
                Linder <peter.linder@xxxxxxxxxxxxxx>:

                
                          Then, I believe, the next best
                            configuration would be to set size for this
                            pool to 4.  It would choose an NVMe as the
                            primary OSD, and then choose an HDD from
                            each DC for the secondary copies.  This will
                            guarantee that a copy of the data goes into
                            each DC and you will have 2 copies in other
                            DCs away from the primary NVMe copy.  It
                            wastes a copy of all of the data in the
                            pool, but that's on the much cheaper HDD
                            storage and can probably be considered
                            acceptable losses for the sake of having the
                            primary OSD on NVMe drives.
                        
                      
                     I have considered this, and it should of
                    course work when it works so to say, but what if 1
                    datacenter is isolated while running? We would be
                    left with 2 running copies on each side for all PGs,
                    with no way of knowing what gets written where. In
                    the end, data would be destoyed due to the split
                    brain. Even being able to enforce quorum where the
                    SSD is would mean a single point of failure. 

                  
                In case you have one mon per DC all operations in
                  the isolated DC will be frozen, so I believe you would
                  not lose data.
              
            
                          On Sat, Oct 7, 2017 at 3:36 PM
                            Peter Linder <peter.linder@xxxxxxxxxxxxxx>
                            wrote:

                          
                              On
                                10/7/2017 8:08 PM, David Turner wrote:

                              
                                Just to make sure you
                                  understand that the reads will happen
                                  on the primary osd for the PG and not
                                  the nearest osd, meaning that reads
                                  will go between the datacenters. Also
                                  that each write will not ack until all
                                  3 writes happen adding the latency to
                                  the writes and reads both.
                                

                             Yes,
                              I understand this. It is actually fine,
                              the datacenters have been selected so that
                              they are about 10-20km apart. This yields
                              around a 0.1 - 0.2ms round trip time due
                              to speed of light being too low.
                              Nevertheless, latency due to network
                              shouldn't be a problem and it's all 40G
                              (dedicated) TRILL network for the moment.

                              
                              I just want to be able to select 1 SSD and
                              2 HDDs, all spread out. I can do that, but
                              one of the HDDs end up in the same
                              datacenter, probably because I'm using the
                              "take" command 2 times (resets selecting
                              buckets?).
                            

                                  On Sat, Oct 7, 2017,
                                    1:48 PM Peter Linder <peter.linder@xxxxxxxxxxxxxx>
                                    wrote:

                                  
                                      On
                                        10/7/2017 7:36 PM, Дробышевский,
                                        Владимир wrote:

                                      
                                        Hello!

                                          
                                            2017-10-07
                                              19:12 GMT+05:00 Peter
                                              Linder <peter.linder@xxxxxxxxxxxxxx>:
                                            

                                              The
                                                idea is to select an
                                                nvme osd, and

                                                then select the rest
                                                from hdd osds in
                                                different datacenters
                                                (see crush

                                                map below for
                                                hierarchy). 

                                                
                                              It's a little bit
                                                aside of the question,
                                                but why do you want to
                                                mix SSDs and HDDs in the
                                                same pool? Do you have
                                                read-intensive workload
                                                and going to use
                                                primary-affinity to get
                                                all reads from nvme?
                                               
                                            
                                     Yes, this is
                                      pretty much the idea, getting the
                                      performance from NVMe reads, while
                                      still maintaining triple
                                      redundancy and a reasonable cost.
                                      

                                          -- 

                                            
                                                  Regards,
                                                  Vladimir
                                                
                                              
_______________________________________________

                                    ceph-users mailing list

                                    ceph-users@xxxxxxxxxxxxxx

                                    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                                  
                  _______________________________________________

                  ceph-users mailing list

                  ceph-users@xxxxxxxxxxxxxx

                  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                  
              -- 

              
                    Regards,
                    Vladimir
                  
                
          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com