Re: Unexpected issues with simulated 'rack' outage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I would make sure that your CRUSH rules are designed for such a failure. We currently have two racks and can suffer a one rack loss without blocking I/O. Here is what we do:

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 2 type rack
        step chooseleaf firstn 2 type host
        step emit
}
All pools are size=4 and min_size=2

This puts only two copies in each rack so that only half of the objects can be taken down by a rack loss. We also configure ceph with "mon_osd_downout_subtree_limit = host" so that it won't automatically mark a whole rack out (not that it would do a whole lot in our current 2 rack config).

Our network failure (dual Ethernet switches) is two racks, so our next failure domain is what we call a PUD or 2 racks. The 3-4 rack configuration is similar to the above with the choose changed to pud. Once we get to our 5th rack of storage, our config changes to:

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type pud
        step emit
}
All pools are size=3 and min_size=2

In this configuration, only one copy is kept per PUD and we can lose two racks in a PUD without blocking I/O in our cluster.

Under the default CRUSH rules, it is possible to get two objects in one rack. What does `ceph osd crush rule dump` show?

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVium0CRDmVDuy+mK58QAAkyoP/0ZJ9vnxYwbGtanAUNc3
gT/yT9j4P+l0IKAZHqM0Ypv1gmVG3jXi6aAtGe4nY5DZ8NmGQv/T0JkAXfTb
bzAQpnso4oQ3r7RaEGNmtZ4xJrHunAFabpSyQADAmR7IEmO2rYLRD4qeBRYP
TD7k3pGGqapXbWWoWZIytkihxjFODC3bP219K/awKn9pLMwzY4PyPyO0+Tbz
gL5vw62e+Gf2mUvWNIJQkQw0iFi572ZKyia7KMAjfOGw8DBCc3Df0xOkYp/9
m3UHk+JNMb9bbld+o6XoI4+Jv/+b+PkS8BcsoIHqJ6Q3n47C6YBTNbSWnhCo
EayuLbX2BmnGyXdfaaAoDwW0uuLSY8Lz3vCJe1HxGOak0x0W1yB5pg9iqogV
SNG3xgSoZXNBFEVGciuTfZh7d0dcn1FvUuiQR6Cn06uDpkIkb07zbEZ7vZrf
5AH0xTrXiA+q7PPMEXJGTIURUV4u1ZsVtoK2DgImhoh7mLC0dB3xeAh55aF3
gQimmOJBXjRqmcMh/IoRfR+Ee4CKEAdIgh5tRztR1Ql3envGP7lBRMG3WeVR
J2/7vvWfoA5woYE4JJQz58DOOMqx1mbkGY20+qj7Ibgz3xpBp9JAurhXWc/f
MnG+2OHx4//BwWMyA3oVvycJ7aawxSxnRZvMbr9wzL10qe2bT4pk/ZV7Bshj
unZB
=fE8n
-----END PGP SIGNATURE-----

----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Jun 24, 2015 at 7:44 AM, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote:
On 06/24/15 14:44, Romero Junior wrote:

Hi,

 

We are setting up a test environment using Ceph as the main storage solution for my QEMU-KVM virtualization platform, and everything works fine except for the following:

 

When I simulate a failure by powering off the switches on one of our three racks my virtual machines get into a weird state, the illustration might help you to fully understand what is going on: http://i.imgur.com/clBApzK.jpg

 

The PGs are distributed based on racks, there are not default crush rules.


What is ceph -s telling while you are in this state ?

16000 pgs might be a problem: when your rack goes down, if your crushmap rules distribute pgs based on rack, with size = 2 approximately 2/3 of your pgs should be in a degraded state. This means that ~10666 pgs will have to copy data to get back to a active+clean state. Your 2 other racks will then be really busy. You can probably tune the recovery processes to avoid too much interference with your normal VM I/Os.
You didn't tell where the monitors are placed (and there are only 2 on your illustration which means any one of them being unreachable will bring down your cluster).

That said, I'm not sure that having a failure domain at the rack level when you only have 3 racks is a good idea. What you end up with when a switch fails is a reconfiguration of 2 third of your cluster, which is not desirable in any case. If possible, either distribute the hardware in more racks (4 racks : only 1/2 of your data will be affected, 5 racks only 2/5, ...) or make the switches redundant (each server with OSD connected to 2 switches, ...).

Not that with 33 servers per rack, 3 OSD per server and 3 racks you have approximately 300 disks. With so many disks, size=2 is probably too low to get at a negligible probability of losing data (even if the failure case is 2 amongst 100 and not 300). With only ~20 disks we already got near a 2 simultaneous failure once (admitedly it was the combination of hardware and human error in the earlier days of our cluster). We currently have one failed disk and one giving signs (erratic performance) of hardware problems in a span of a few weeks.

Best regards,

Lionel

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux