Re: Changing replica size of a running pool

Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> · Fri, 19 May 2017 16:31:26 +0200



    On 05/05/17 21:32, Alejandro Comisario
      wrote:

    
        Thanks
          David!
        Any
          one ? more thoughts ?
      
      
        On Wed, May 3, 2017 at 3:38 PM, David
          Turner <drakonstein@xxxxxxxxx>
          wrote:

          
            Those are both things that people have done
              and both work.  Neither is optimal, but both options work
              fine.  The best option is to definitely just get a third
              node now as you aren't going to be getting it for
              additional space from it later.  Your usable space between
              a 2 node size 2 cluster and a 3 node size 3 cluster is
              identical.
              

              If getting a third node is not possible, I would
                recommend a size 2 min_size 2 configuration.  You will
                block writes if either of your nodes or any copy of your
                data is down, but you will not get into an inconsistent
                state that can happen with min_size of 1 (and you can
                always set the min_size of a pool to 1 on the fly to
                perform maintenance).  If you go with the option to use
                the failure domain of OSDs instead of hosts and have
                size 3, then a single node going down will block writes
                into your cluster.  The only you gain from this is
                having 3 physical copies of the data until you get a
                third node, but a lot of backfilling when you change the
                crush rule.
                

                A more complex option that I think would be a
                  better solution than your 2 options would be to create
                  2 hosts in your crush map for each physical host and
                  split the OSDs in each host evenly between them.  That
                  way you can have 2 copies of data in a given node, but
                  never all 3 copies.  You have your 3 copies of data
                  and guaranteed that not all 3 are on the same host. 
                  Assuming min_size of 2, you will still block writes if
                  you restart either node.
                

    Smart idea. 

    Or if you have space, size 4 min_size 2 and then you can still lose
    a node. And you might think that's more space, but in a way it
    isn't... if you count free space reserved for recovery. If your size
    3 double nodes die, then the other has to recover to size 2 and then
    it'll use the same space as the size 4 pool. If the size 4 pool
    loses a node, it won't be able to recover... it'll stay size 2,
    which is what your size 3 pool would have been after recovery. So
    it's like it's pre-recovered. But you probably get a bit more write
    latency in this setup.

    
                If modifying the hosts in your crush map doesn't
                  sound daunting, then I would recommend going that
                  route... For most people that is more complex than
                  they'd like to go and I would say size 2 min_size 2
                  would be the way to go until you get a third node.
                   #my2cents
              
            
                  On Wed, May 3, 2017 at 12:41 PM
                    Maximiliano Venesio <massimo@xxxxxxxxxxx>
                    wrote:

                  
                  Guys
                    hi.

                    
                    I have a Jewel Cluster composed by two storage
                    servers which are configured on

                    the crush map as different buckets to store data.

                    
                    I've to configure two new pools on this cluster with
                    the certainty

                    that i'll have to add more servers in a short term.

                    
                    Taking into account that the recommended replication
                    size for every

                    pool is 3, i'm thinking in two possible scenarios.

                    
                    1) Set the replica size in 2 now, and in the future
                    change the replica

                    size to 3 on a running pool.

                    Is that possible? Can i have serious issues with the
                    rebalance of the

                    pgs, changing the pool size on the fly ?

                    
                    2) Set the replica size to 3, and change the ruleset
                    to replicate by

                    OSD instead of HOST now, and in the future change
                    this rule in the

                    ruleset to replicate again by host in a running
                    pool.

                    Is that possible? Can i have serious issues with the
                    rebalance of the

                    pgs, changing the ruleset in a running pool ?

                    
                    Which do you think is the best option ?

                    
                    Thanks in advanced.

                    
                    Maximiliano Venesio

                    Chief Cloud Architect | NUBELIU

                    E-mail: massimo@nubeliu.comCell: +54 9 11 3770 1853

                    _

                    www.nubeliu.com

                    _______________________________________________

                    ceph-users mailing list

                    ceph-users@xxxxxxxxxxxxxx

                    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                  
            _______________________________________________

            ceph-users mailing list

            ceph-users@xxxxxxxxxxxxxx

            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

            
        -- 

        
                          Alejandro
                                  Comisario

                              CTO
                                  | NUBELIU

                            E-mail: alejandro@xxxxxxxxxxxCell: +54 9 11
                                3770 1857

                              _

                              www.nubeliu.com

                              
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    -- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx
Internet: http://www.brockmann-consult.de
--------------------------------------------
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com