Re: Unexpected issues with simulated 'rack' outage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/24/15 14:44, Romero Junior wrote:

Hi,

 

We are setting up a test environment using Ceph as the main storage solution for my QEMU-KVM virtualization platform, and everything works fine except for the following:

 

When I simulate a failure by powering off the switches on one of our three racks my virtual machines get into a weird state, the illustration might help you to fully understand what is going on: http://i.imgur.com/clBApzK.jpg

 

The PGs are distributed based on racks, there are not default crush rules.


What is ceph -s telling while you are in this state ?

16000 pgs might be a problem: when your rack goes down, if your crushmap rules distribute pgs based on rack, with size = 2 approximately 2/3 of your pgs should be in a degraded state. This means that ~10666 pgs will have to copy data to get back to a active+clean state. Your 2 other racks will then be really busy. You can probably tune the recovery processes to avoid too much interference with your normal VM I/Os.
You didn't tell where the monitors are placed (and there are only 2 on your illustration which means any one of them being unreachable will bring down your cluster).

That said, I'm not sure that having a failure domain at the rack level when you only have 3 racks is a good idea. What you end up with when a switch fails is a reconfiguration of 2 third of your cluster, which is not desirable in any case. If possible, either distribute the hardware in more racks (4 racks : only 1/2 of your data will be affected, 5 racks only 2/5, ...) or make the switches redundant (each server with OSD connected to 2 switches, ...).

Not that with 33 servers per rack, 3 OSD per server and 3 racks you have approximately 300 disks. With so many disks, size=2 is probably too low to get at a negligible probability of losing data (even if the failure case is 2 amongst 100 and not 300). With only ~20 disks we already got near a 2 simultaneous failure once (admitedly it was the combination of hardware and human error in the earlier days of our cluster). We currently have one failed disk and one giving signs (erratic performance) of hardware problems in a span of a few weeks.

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux