Unexpected issues with simulated 'rack' outage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

 

We are setting up a test environment using Ceph as the main storage solution for my QEMU-KVM virtualization platform, and everything works fine except for the following:

 

When I simulate a failure by powering off the switches on one of our three racks my virtual machines get into a weird state, the illustration might help you to fully understand what is going on: http://i.imgur.com/clBApzK.jpg

 

The PGs are distributed based on racks, there are not default crush rules.

 

The number of PGs is the following:

 

root@srv003:~# ceph osd pool ls detail

pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags hashpspool stripe_width 0

 

The qemu talks directly to Ceph through librdb, the disk is configured as the following:

 

    <disk type='network' device='disk'>

      <driver name='qemu' type='raw' cache='writeback'/>

      <auth username='libvirt'>

        <secret type='ceph' uuid='0d32bxxxyyyzzz47073a965'/>

      </auth>

      <source protocol='rbd' name='libvirt-pool/ceph-vm-automated'>

        <host name='10.XX.YY.1' port='6789'/>

        <host name='10.XX.YY.2' port='6789'/>

        <host name='10.XX.YY.2' port='6789'/>

      </source>

      <target dev='vda' bus='virtio'/>

      <alias name='virtio-disk25'/>

      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>

    </disk>

 

 

As mentioned, it's not a real read-only state, I can "touch" files and even login on the affected virtual machines (by the way, all are affected) however, a simple 'dd' (count=10 bs=1MB conv=fdatasync) hangs forever. If a 3 GB file download starts (via wget/curl), it usually crashes after the first few hundred megabytes and it resumes as soon as I power on the “failed” rack. Everything goes back to normal as soon as the rack is powered on again.

 

For reference, each rack contains 33 nodes, each node contain 3 OSDs (1.5 TB each).

 

On the virtual machine, after recovering the rack, I can see the following messages on /var/log/kern.log:

 

[163800.444146] INFO: task jbd2/vda1-8:135 blocked for more than 120 seconds.

[163800.444260]       Not tainted 3.13.0-55-generic #94-Ubuntu

[163800.444295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[163800.444346] jbd2/vda1-8     D ffff88007fd13180     0   135      2 0x00000000

[163800.444354]  ffff880036d3bbd8 0000000000000046 ffff880036a4b000 ffff880036d3bfd8

[163800.444386]  0000000000013180 0000000000013180 ffff880036a4b000 ffff88007fd13a18

[163800.444390]  ffff88007ffc69d0 0000000000000002 ffffffff811efa80 ffff880036d3bc50

[163800.444396] Call Trace:

[163800.444420]  [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50

[163800.444426]  [<ffffffff817279bd>] io_schedule+0x9d/0x140

[163800.444432]  [<ffffffff811efa8e>] sleep_on_buffer+0xe/0x20

[163800.444437]  [<ffffffff81727e42>] __wait_on_bit+0x62/0x90

[163800.444442]  [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50

[163800.444447]  [<ffffffff81727ee7>] out_of_line_wait_on_bit+0x77/0x90

[163800.444455]  [<ffffffff810ab300>] ? autoremove_wake_function+0x40/0x40

[163800.444461]  [<ffffffff811f0dba>] __wait_on_buffer+0x2a/0x30

[163800.444470]  [<ffffffff8128be4d>] jbd2_journal_commit_transaction+0x185d/0x1ab0

[163800.444477]  [<ffffffff8107562f>] ? try_to_del_timer_sync+0x4f/0x70

[163800.444484]  [<ffffffff8129017d>] kjournald2+0xbd/0x250

[163800.444490]  [<ffffffff810ab2c0>] ? prepare_to_wait_event+0x100/0x100

[163800.444496]  [<ffffffff812900c0>] ? commit_timeout+0x10/0x10

[163800.444502]  [<ffffffff8108b702>] kthread+0xd2/0xf0

[163800.444507]  [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0

[163800.444513]  [<ffffffff81733ca8>] ret_from_fork+0x58/0x90

[163800.444517]  [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0

 

A few theories for this behavior were mention on #Ceph (OFTC):

 

[14:09] <Be-El> RomeroJnr: i think the problem is the fact that you write to parts of the rbd that have not been accessed before

[14:09] <Be-El> RomeroJnr: ceph does thin provisioning; each rbd is striped into chunks of 4 mb. each stripe is put into one pgs

[14:10] <Be-El> RomeroJnr: if you access formerly unaccessed parts of the rbd, a new stripe is created. and this probably fails if one of the racks is down

[14:10] <Be-El> RomeroJnr: but that's just a theory...maybe some developer can comment on this later

[14:21] <Be-El> smerz: creating an object in a pg might be different than writing to an object

[14:21] <Be-El> smerz: with one rack down ceph cannot satisfy the pg requirements in RomeroJnr's case

[14:22] <smerz> i can only agree with you. that i would expect other behaviour

 

The question is: is this behavior indeed expected?

Kind regards,

Romero Junior
Hosting Engineer
LeaseWeb Global Services B.V.

T: +31 20 316 0230
M: +31 6 2115 9310
E: r.junior@xxxxxxxxxxxxxxxxxxx
W: www.leaseweb.com

Luttenbergweg 8 1101 EC Amsterdam Netherlands


LeaseWeb is the brand name under which the various independent LeaseWeb companies operate. Each company is a separate and distinct entity that provides services in a particular geographic area. LeaseWeb Global Services B.V. does not provide third-party services. Please see www.leaseweb.com/en/legal for more information.



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux