Re: will crush rule be used during object relocation in OSD failure ?

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sat, 24 Nov 2018 14:44:58 +0200



    On 23/11/18 18:00, ST Wong (ITSC)
      wrote:

    
        Hi all,
        

        We've 8 osd
          hosts, 4 in room 1 and 4 in room2.   
          

        A pool with size = 3
          using following crush map is created, to cater for room
          failure.
        

            rule multiroom {

                    id 0

                    type replicated

                    min_size 2

                    max_size 4

                    step take default

                    step choose firstn 2 type room

                    step chooseleaf firstn 2 type host

                    step emit

            }
        

          We're expecting:
        

        1.for each object, there are always 2 replicas in one room
          and 1 replica in other room making size=3.  But we can't
          control which room has 1 or 2 replicas.
        

        2.in case an osd host fails, ceph will assign remaining
          osds to the same PG to hold replicas on the failed osd host. 
          Selection is based on crush rule of the pool, thus maintaining
          the same failure domain - won't make all replicas in the same
          room.  
          

        3.in case of entire room with 1 replica fails, the pool
          will remain degraded but won't do any replica relocation.
        

        4. in case of entire room with 2 replicas fails, ceph will
          make use of osds in the surviving room and making 2 replicas. 
          Pool will not be writeable before all objects are made 2
          copies (unless we make pool size=4?).  Then when recovery is
          complete, pool will remain in degraded state until the failed
          room recover.

        
        Is our understanding correct?  Thanks a lot.
        Will do some simulation later to verify.
        

        Regards,
        /stwong

        
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    I think this is correct. To re-phrase 2) : all PGs on the failed
    node will be re-distributed on  several other hosts within the same
    room.

    
    Since some PGs will have 2 replicas in 1 room whereas some other PGs
    will have 2 replicas in the other room, i tend to dis-like such
    setup as it is not symmetric,some PGs will suffer more than others
    in case on room failure, you failure domain is not symmetric.
    Besides more importantly, as you stated in 4, you cluster will be
    down while these unfortunate PGs recover ( statistically that is
    half your data ). I would prefer in such case you would use a size=4
    min_size=2 setup.

    
    /Maged

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com