Re: Unexpected issues with simulated 'rack' outage

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Wed, 24 Jun 2015 15:44:33 +0200



    On 06/24/15 14:44, Romero Junior wrote:

    
        Hi, 
         
        We are setting up a test
            environment using Ceph as the main storage solution for my
            QEMU-KVM virtualization platform, and everything works fine
            except for the following:
            
         
        When I simulate a
            failure by powering off the switches on one of our three
            racks my virtual machines get into a weird state, the
            illustration might help you to fully understand what is
            going on:
            http://i.imgur.com/clBApzK.jpg
         
        The PGs are distributed
            based on racks, there are not default crush rules.
      
    
    What is ceph -s telling while you are in this state ?

    
    16000 pgs might be a problem: when your rack goes down, if your
    crushmap rules distribute pgs based on rack, with size = 2
    approximately 2/3 of your pgs should be in a degraded state. This
    means that ~10666 pgs will have to copy data to get back to a
    active+clean state. Your 2 other racks will then be really busy. You
    can probably tune the recovery processes to avoid too much
    interference with your normal VM I/Os.

    You didn't tell where the monitors are placed (and there are only 2
    on your illustration which means any one of them being unreachable
    will bring down your cluster).

    
    That said, I'm not sure that having a failure domain at the rack
    level when you only have 3 racks is a good idea. What you end up
    with when a switch fails is a reconfiguration of 2 third of your
    cluster, which is not desirable in any case. If possible, either
    distribute the hardware in more racks (4 racks : only 1/2 of your
    data will be affected, 5 racks only 2/5, ...) or make the switches
    redundant (each server with OSD connected to 2 switches, ...).

    
    Not that with 33 servers per rack, 3 OSD per server and 3 racks you
    have approximately 300 disks. With so many disks, size=2 is probably
    too low to get at a negligible probability of losing data (even if
    the failure case is 2 amongst 100 and not 300). With only ~20 disks
    we already got near a 2 simultaneous failure once (admitedly it was
    the combination of hardware and human error in the earlier days of
    our cluster). We currently have one failed disk and one giving signs
    (erratic performance) of hardware problems in a span of a few weeks.

    
    Best regards,

    
    Lionel

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com