Re: PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

Peter Linder <peter.linder@xxxxxxxxxxxxxx> · Sat, 7 Oct 2017 23:02:22 +0200



    On 10/7/2017 10:41 PM, David Turner
      wrote:

    
      Disclaimer, I have never attempted this
        configuration especially with Luminous. I doubt many have, but
        it's a curious configuration that I'd love to help see if it is
        possible.
    
    Very generous of you :). (With that said, I suppose we are prepared
    to pay for help to have this figured out. It makes me a little
    headachy and there is budget space :)).

    
        There is 1 logical problem with your configuration (which
          you have most likely considered).  If you want all of your PGs
          to be primary on NVMe's across the 3 DC's, then you need to
          have 1/3 of your available storage (that you plan to use for
          this pool) be from NVMe's.  Otherwise they will fill up long
          before the HDDs and your cluster will be "full" while your
          HDDs are near empty.  I clarify "that you plan to use for this
          pool" because if you plan to put other stuff on just the HDDs,
          that is planning to utilize that extra space, then it's a part
          of the plan that your NVMe's don't total 1/3 of your storage.
      
    
    We were going to use the left over HDD space for nearline archives,
    intermediary backups etc.

    
        Second, I'm noticing that if a PG has a primary OSD in any
          datacenter other than TEG4, then it only has 1 other
          datacenter available to have its 2 HDD copies on.  If the
          rules were working properly, then I would expect the PG to be
          stuck undersized as opposed to choosing an OSD from a
          datacenter that it shouldn't be able to.  Potentially, you
          could test setting the size to 2 for this pool (while you're
          missing the third HDD node) to see if any PGs still end up on
          an HDD and NVMe in the same DC.  I think that likely you will
          find that PGs will still be able to use 2 copies in the same
          DC based on your current configuration.
      
    
    I know. This server does not exist yet. It should be finished this
    coming week (hardware is busy, task needs migrating). And yes, this
    does not make testing this out any easier. It was an oversight to
    not have it finished. 

    
        Then, I believe, the next best configuration would be to
          set size for this pool to 4.  It would choose an NVMe as the
          primary OSD, and then choose an HDD from each DC for the
          secondary copies.  This will guarantee that a copy of the data
          goes into each DC and you will have 2 copies in other DCs away
          from the primary NVMe copy.  It wastes a copy of all of the
          data in the pool, but that's on the much cheaper HDD storage
          and can probably be considered acceptable losses for the sake
          of having the primary OSD on NVMe drives.
      
    
    I have considered this, and it should of course work when it works
    so to say, but what if 1 datacenter is isolated while running? We
    would be left with 2 running copies on each side for all PGs, with
    no way of knowing what gets written where. In the end, data would be
    destoyed due to the split brain. Even being able to enforce quorum
    where the SSD is would mean a single point of failure. 

    
    I was thinking instead I can define a crushmap where I make logical
    datacenters that include the SSDs as they are spread out and the
    HDDS I explicitly want to mirror each ssd set to, and make a crush
    rule to enforce 3 copies of the data to 3 hosts within that
    "datacenter" selected for each PG. I dont really know how to make
    such a "depth first" rule though, but I will try tomorrow. 

    
    I was considering making 3 rules to map SSDs to HDDs and then 3
    pools, but that would leave me manually balancing load. And if one
    node went down, some RBDs would completely loose their SSD read
    capability instead of just 1/3 of it...  perhaps acceptable, but not
    optimal :)

    
    /Peter

    
        On Sat, Oct 7, 2017 at 3:36 PM Peter Linder <peter.linder@xxxxxxxxxxxxxx>
          wrote:

        
            On
              10/7/2017 8:08 PM, David Turner wrote:

            
              Just to make sure you understand that the
                reads will happen on the primary osd for the PG and not
                the nearest osd, meaning that reads will go between the
                datacenters. Also that each write will not ack until all
                3 writes happen adding the latency to the writes and
                reads both.
              

           Yes, I understand this.
            It is actually fine, the datacenters have been selected so
            that they are about 10-20km apart. This yields around a 0.1
            - 0.2ms round trip time due to speed of light being too low.
            Nevertheless, latency due to network shouldn't be a problem
            and it's all 40G (dedicated) TRILL network for the moment.

            
            I just want to be able to select 1 SSD and 2 HDDs, all
            spread out. I can do that, but one of the HDDs end up in the
            same datacenter, probably because I'm using the "take"
            command 2 times (resets selecting buckets?).
          

                On Sat, Oct 7, 2017, 1:48 PM Peter Linder
                  <peter.linder@xxxxxxxxxxxxxx>
                  wrote:

                
                    On
                      10/7/2017 7:36 PM, Дробышевский, Владимир wrote:

                    
                      Hello!

                        
                          2017-10-07 19:12
                            GMT+05:00 Peter Linder <peter.linder@xxxxxxxxxxxxxx>:
                          

                            The idea is
                              to select an nvme osd, and

                              then select the rest from hdd osds in
                              different datacenters (see crush

                              map below for hierarchy). 

                              
                            It's a little bit aside of the
                              question, but why do you want to mix SSDs
                              and HDDs in the same pool? Do you have
                              read-intensive workload and going to use
                              primary-affinity to get all reads from
                              nvme?
                             
                          
                   Yes, this is
                    pretty much the idea, getting the performance from
                    NVMe reads, while still maintaining triple
                    redundancy and a reasonable cost. 

                  
                        -- 

                          
                                Regards,
                                Vladimir
                              
                            
                  _______________________________________________

                  ceph-users mailing list

                  ceph-users@xxxxxxxxxxxxxx

                  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com