Re: Unexpected issues with simulated 'rack' outage

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 24 Jun 2015 11:32:39 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I would make sure that your CRUSH rules are designed for such a failure. We currently have two racks and can suffer a one rack loss without blocking I/O. Here is what we do:

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 2 type rack
        step chooseleaf firstn 2 type host
        step emit
}
All pools are size=4 and min_size=2

This puts only two copies in each rack so that only half of the objects can be taken down by a rack loss. We also configure ceph with "mon_osd_downout_subtree_limit = host" so that it won't automatically mark a whole rack out (not that it would do a whole lot in our current 2 rack config).

Our network failure (dual Ethernet switches) is two racks, so our next failure domain is what we call a PUD or 2 racks. The 3-4 rack configuration is similar to the above with the choose changed to pud. Once we get to our 5th rack of storage, our config changes to:

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type pud
        step emit
}
All pools are size=3 and min_size=2

In this configuration, only one copy is kept per PUD and we can lose two racks in a PUD without blocking I/O in our cluster.

Under the default CRUSH rules, it is possible to get two objects in one rack. What does `ceph osd crush rule dump` show?

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVium0CRDmVDuy+mK58QAAkyoP/0ZJ9vnxYwbGtanAUNc3
gT/yT9j4P+l0IKAZHqM0Ypv1gmVG3jXi6aAtGe4nY5DZ8NmGQv/T0JkAXfTb
bzAQpnso4oQ3r7RaEGNmtZ4xJrHunAFabpSyQADAmR7IEmO2rYLRD4qeBRYP
TD7k3pGGqapXbWWoWZIytkihxjFODC3bP219K/awKn9pLMwzY4PyPyO0+Tbz
gL5vw62e+Gf2mUvWNIJQkQw0iFi572ZKyia7KMAjfOGw8DBCc3Df0xOkYp/9
m3UHk+JNMb9bbld+o6XoI4+Jv/+b+PkS8BcsoIHqJ6Q3n47C6YBTNbSWnhCo
EayuLbX2BmnGyXdfaaAoDwW0uuLSY8Lz3vCJe1HxGOak0x0W1yB5pg9iqogV
SNG3xgSoZXNBFEVGciuTfZh7d0dcn1FvUuiQR6Cn06uDpkIkb07zbEZ7vZrf
5AH0xTrXiA+q7PPMEXJGTIURUV4u1ZsVtoK2DgImhoh7mLC0dB3xeAh55aF3
gQimmOJBXjRqmcMh/IoRfR+Ee4CKEAdIgh5tRztR1Ql3envGP7lBRMG3WeVR
J2/7vvWfoA5woYE4JJQz58DOOMqx1mbkGY20+qj7Ibgz3xpBp9JAurhXWc/f
MnG+2OHx4//BwWMyA3oVvycJ7aawxSxnRZvMbr9wzL10qe2bT4pk/ZV7Bshj
unZB
=fE8n
-----END PGP SIGNATURE-----

----------------Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Jun 24, 2015 at 7:44 AM, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote:

    On 06/24/15 14:44, Romero Junior wrote:

        Hi, 

        We are setting up a test
            environment using Ceph as the main storage solution for my
            QEMU-KVM virtualization platform, and everything works fine
            except for the following:

        When I simulate a
            failure by powering off the switches on one of our three
            racks my virtual machines get into a weird state, the
            illustration might help you to fully understand what is
            going on:
            http://i.imgur.com/clBApzK.jpg

        The PGs are distributed
            based on racks, there are not default crush rules.

    What is ceph -s telling while you are in this state ?

    16000 pgs might be a problem: when your rack goes down, if your
    crushmap rules distribute pgs based on rack, with size = 2
    approximately 2/3 of your pgs should be in a degraded state. This
    means that ~10666 pgs will have to copy data to get back to a
    active+clean state. Your 2 other racks will then be really busy. You
    can probably tune the recovery processes to avoid too much
    interference with your normal VM I/Os.

    You didn't tell where the monitors are placed (and there are only 2
    on your illustration which means any one of them being unreachable
    will bring down your cluster).

    That said, I'm not sure that having a failure domain at the rack
    level when you only have 3 racks is a good idea. What you end up
    with when a switch fails is a reconfiguration of 2 third of your
    cluster, which is not desirable in any case. If possible, either
    distribute the hardware in more racks (4 racks : only 1/2 of your
    data will be affected, 5 racks only 2/5, ...) or make the switches
    redundant (each server with OSD connected to 2 switches, ...).

    Not that with 33 servers per rack, 3 OSD per server and 3 racks you
    have approximately 300 disks. With so many disks, size=2 is probably
    too low to get at a negligible probability of losing data (even if
    the failure case is 2 amongst 100 and not 300). With only ~20 disks
    we already got near a 2 simultaneous failure once (admitedly it was
    the combination of hardware and human error in the earlier days of
    our cluster). We currently have one failed disk and one giving signs
    (erratic performance) of hardware problems in a span of a few weeks.

    Best regards,

    Lionel

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com