Re: Unexpected issues with simulated 'rack' outage

Saverio Proto <zioproto@xxxxxxxxx> · Wed, 24 Jun 2015 15:21:40 +0200

You dont have to wait, but the recovery process will be very heavy and it will have an impact on performance. The impact could be catastrophic as you are experiencing.

After removing 1 rack, the CRUSH algorithm will run again on the available resources and will map the PGs to the available OSDs. You lost 33% of OSDs so it will be a big change.

This means that you will not only have to create again copies for the OSDs that are out of your cluster, but also you have to move a round a lot of objects that are now misplaced.

It would also be nice to see your crushmap because you are not using the default. A conceptual bug in the crushmap could leave the cluster on a degraded state forever. For example if you did a crushmap to place copies only on different racks, and you want 3 copies with 2 racks available, this is a possible conceptual bug. 

Saverio

2015-06-24 15:11 GMT+02:00 Romero Junior <r.junior@xxxxxxxxxxxxxxxxxxx>:

If I have a replica of each object on the other racks why should I have to wait for any recovery time? The failure should not impact my virtual
 machines.

From: Saverio Proto [mailto:zioproto@xxxxxxxxx]

Sent: woensdag, 24 juni, 2015 14:54

To: Romero Junior

Cc: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  Unexpected issues with simulated 'rack' outage

Hello Romero,

I am still begineer with Ceph, but as far as I understood, ceph is not designed to lose the 33% of the cluster at once and recover rapidly. What I understand is that you are losing 33% of the cluster losing
 1 rack out of 3. It will take a very long time to recover, before you have HEALTH_OK status.

can you check with ceph -w how long it takes for ceph to converge to a healthy cluster after you switch off the switch in Rack-A ?

Saverio

2015-06-24 14:44 GMT+02:00 Romero Junior <r.junior@xxxxxxxxxxxxxxxxxxx>:

Hi,

We are setting up a test environment using Ceph as the main storage solution for my QEMU-KVM virtualization platform, and everything works fine except for the
 following: 

When I simulate a failure by powering off the switches on one of our three racks my virtual machines get into a weird state, the illustration might help you to
 fully understand what is going on: 
http://i.imgur.com/clBApzK.jpg

The PGs are distributed based on racks, there are not default crush rules.

The number of PGs is the following:

root@srv003:~# ceph osd pool ls detail
pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags hashpspool stripe_width
 0

The qemu talks directly to Ceph through librdb, the disk is configured as the following:

    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <auth username='libvirt'>
        <secret type='ceph' uuid='0d32bxxxyyyzzz47073a965'/>
      </auth>
      <source protocol='rbd' name='libvirt-pool/ceph-vm-automated'>
        <host name='10.XX.YY.1' port='6789'/>
        <host name='10.XX.YY.2' port='6789'/>
        <host name='10.XX.YY.2' port='6789'/>
      </source>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk25'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>

As mentioned, it's not a real read-only state, I can "touch" files and even login on the affected virtual machines (by the way, all are affected) however, a simple
 'dd' (count=10 bs=1MB conv=fdatasync) hangs forever. If a 3 GB file download starts (via wget/curl), it usually crashes after the first few hundred megabytes and it resumes as soon as I power on the “failed” rack. Everything goes back to normal as soon as
 the rack is powered on again.

For reference, each rack contains 33 nodes, each node contain 3 OSDs (1.5 TB each).

On the virtual machine, after recovering the rack, I can see the following messages on /var/log/kern.log:

[163800.444146] INFO: task jbd2/vda1-8:135 blocked for more than 120 seconds.
[163800.444260]       Not tainted 3.13.0-55-generic #94-Ubuntu
[163800.444295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[163800.444346] jbd2/vda1-8     D ffff88007fd13180     0   135      2 0x00000000
[163800.444354]  ffff880036d3bbd8 0000000000000046 ffff880036a4b000 ffff880036d3bfd8
[163800.444386]  0000000000013180 0000000000013180 ffff880036a4b000 ffff88007fd13a18
[163800.444390]  ffff88007ffc69d0 0000000000000002 ffffffff811efa80 ffff880036d3bc50
[163800.444396] Call Trace:
[163800.444420]  [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50
[163800.444426]  [<ffffffff817279bd>] io_schedule+0x9d/0x140
[163800.444432]  [<ffffffff811efa8e>] sleep_on_buffer+0xe/0x20
[163800.444437]  [<ffffffff81727e42>] __wait_on_bit+0x62/0x90
[163800.444442]  [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50
[163800.444447]  [<ffffffff81727ee7>] out_of_line_wait_on_bit+0x77/0x90
[163800.444455]  [<ffffffff810ab300>] ? autoremove_wake_function+0x40/0x40
[163800.444461]  [<ffffffff811f0dba>] __wait_on_buffer+0x2a/0x30
[163800.444470]  [<ffffffff8128be4d>] jbd2_journal_commit_transaction+0x185d/0x1ab0
[163800.444477]  [<ffffffff8107562f>] ? try_to_del_timer_sync+0x4f/0x70
[163800.444484]  [<ffffffff8129017d>] kjournald2+0xbd/0x250
[163800.444490]  [<ffffffff810ab2c0>] ? prepare_to_wait_event+0x100/0x100
[163800.444496]  [<ffffffff812900c0>] ? commit_timeout+0x10/0x10
[163800.444502]  [<ffffffff8108b702>] kthread+0xd2/0xf0
[163800.444507]  [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0
[163800.444513]  [<ffffffff81733ca8>] ret_from_fork+0x58/0x90
[163800.444517]  [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0

A few theories for this behavior were mention on #Ceph (OFTC):

[14:09] <Be-El> RomeroJnr: i think the problem is the fact that you write to parts of the rbd that have not been accessed before
[14:09] <Be-El> RomeroJnr: ceph does thin provisioning; each rbd is striped into chunks of 4 mb. each stripe is put into one pgs
[14:10] <Be-El> RomeroJnr: if you access formerly unaccessed parts of the rbd, a new stripe is created. and this probably fails if one of the racks is down
[14:10] <Be-El> RomeroJnr: but that's just a theory...maybe some developer can comment on this later
[14:21] <Be-El> smerz: creating an object in a pg might be different than writing to an object
[14:21] <Be-El> smerz: with one rack down ceph cannot satisfy the pg requirements in RomeroJnr's case
[14:22] <smerz> i can only agree with you. that i would expect other behaviour

The question is: is this behavior indeed expected?

Kind regards,

Romero Junior

Hosting Engineer

LeaseWeb Global Services B.V.

T: 
+31 20 316 0230

M: +31 6 2115 9310

E: 
r.junior@xxxxxxxxxxxxxxxxxxx

W: 
www.leaseweb.com

Luttenbergweg 8, 

1101 EC Amsterdam, 

Netherlands

LeaseWeb is the brand name under which the various independent LeaseWeb companies operate. Each company is a separate and distinct entity that provides services in a
 particular geographic area. LeaseWeb Global Services B.V. does not provide third-party services. Please see
www.leaseweb.com/en/legal for more information.

Kind regards,

Romero Junior

Hosting Engineer

LeaseWeb Global Services B.V.

T:
+31 20 316 0230

M:
+31 6 2115 9310

E:
r.junior@xxxxxxxxxxxxxxxxxxx

W:
www.leaseweb.com

Luttenbergweg 8, 
1101 EC Amsterdam, 
Netherlands

LeaseWeb is the brand name under which the various independent LeaseWeb companies operate. Each company is a separate and distinct entity that provides services in a particular geographic area. LeaseWeb Global Services
 B.V. does not provide third-party services. Please see www.leaseweb.com/en/legal for more information.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com