Re: PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

Peter Linder <peter.linder@xxxxxxxxxxxxxx> · Mon, 9 Oct 2017 19:04:17 +0200



    I was able to get this working with the
      crushmap in my last post! I now have the intended behavior
      together with the change of primary affinity on the slow hdds.
      Very happy, performance is excellent.

      
      One thing was a little weird though, I had to manually change the
      weight of each hostgroup so that they are in the same ballpark. If
      they were too far apart ceph couldn't properly allocate 3 buckets
      for each pg, some ended up being in state "remapped" or
      "degraded". 

      
      When I changed the weights (The crush rule selects 3 out of 3
      hostgroups anyway so weight isn't even a consideration there) to
      similar values that problem went away. 

      
      Perhaps that is a bug? 

      
      /Peter

      
      On 10/8/2017 3:22 PM, David Turner wrote:

    
      That's correct. It doesn't matter how many copies of
        the data you have in each datacenter. The mons control the maps
        and you should be good as long as you have 1 mon per DC. You
        should test this to see how the recovery goes, but there
        shouldn't be a problem.
      

        On Sat, Oct 7, 2017, 6:10 PM Дробышевский,
          Владимир <vlad@xxxxxxxxxx> wrote:

        
              2017-10-08 2:02 GMT+05:00 Peter
                Linder <peter.linder@xxxxxxxxxxxxxx>:

                
                          Then, I believe, the next best
                            configuration would be to set size for this
                            pool to 4.  It would choose an NVMe as the
                            primary OSD, and then choose an HDD from
                            each DC for the secondary copies.  This will
                            guarantee that a copy of the data goes into
                            each DC and you will have 2 copies in other
                            DCs away from the primary NVMe copy.  It
                            wastes a copy of all of the data in the
                            pool, but that's on the much cheaper HDD
                            storage and can probably be considered
                            acceptable losses for the sake of having the
                            primary OSD on NVMe drives.
                        
                      
                     I have considered this, and it should of
                    course work when it works so to say, but what if 1
                    datacenter is isolated while running? We would be
                    left with 2 running copies on each side for all PGs,
                    with no way of knowing what gets written where. In
                    the end, data would be destoyed due to the split
                    brain. Even being able to enforce quorum where the
                    SSD is would mean a single point of failure. 

                  
                In case you have one mon per DC all operations in
                  the isolated DC will be frozen, so I believe you would
                  not lose data.
              
            
                          On Sat, Oct 7, 2017 at 3:36 PM
                            Peter Linder <peter.linder@xxxxxxxxxxxxxx>
                            wrote:

                          
                              On
                                10/7/2017 8:08 PM, David Turner wrote:

                              
                                Just to make sure you
                                  understand that the reads will happen
                                  on the primary osd for the PG and not
                                  the nearest osd, meaning that reads
                                  will go between the datacenters. Also
                                  that each write will not ack until all
                                  3 writes happen adding the latency to
                                  the writes and reads both.
                                

                             Yes,
                              I understand this. It is actually fine,
                              the datacenters have been selected so that
                              they are about 10-20km apart. This yields
                              around a 0.1 - 0.2ms round trip time due
                              to speed of light being too low.
                              Nevertheless, latency due to network
                              shouldn't be a problem and it's all 40G
                              (dedicated) TRILL network for the moment.

                              
                              I just want to be able to select 1 SSD and
                              2 HDDs, all spread out. I can do that, but
                              one of the HDDs end up in the same
                              datacenter, probably because I'm using the
                              "take" command 2 times (resets selecting
                              buckets?).
                            

                                  On Sat, Oct 7, 2017,
                                    1:48 PM Peter Linder <peter.linder@xxxxxxxxxxxxxx>
                                    wrote:

                                  
                                      On
                                        10/7/2017 7:36 PM, Дробышевский,
                                        Владимир wrote:

                                      
                                        Hello!

                                          
                                            2017-10-07
                                              19:12 GMT+05:00 Peter
                                              Linder <peter.linder@xxxxxxxxxxxxxx>:
                                            

                                              The
                                                idea is to select an
                                                nvme osd, and

                                                then select the rest
                                                from hdd osds in
                                                different datacenters
                                                (see crush

                                                map below for
                                                hierarchy). 

                                                
                                              It's a little bit
                                                aside of the question,
                                                but why do you want to
                                                mix SSDs and HDDs in the
                                                same pool? Do you have
                                                read-intensive workload
                                                and going to use
                                                primary-affinity to get
                                                all reads from nvme?
                                               
                                            
                                     Yes, this is
                                      pretty much the idea, getting the
                                      performance from NVMe reads, while
                                      still maintaining triple
                                      redundancy and a reasonable cost.
                                      

                                          -- 

                                            
                                                  Regards,
                                                  Vladimir
                                                
                                              
_______________________________________________

                                    ceph-users mailing list

                                    ceph-users@xxxxxxxxxxxxxx

                                    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                                  
                  _______________________________________________

                  ceph-users mailing list

                  ceph-users@xxxxxxxxxxxxxx

                  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                  
              -- 

              
                    Regards,
                    Vladimir
                  
                
          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com