Re: Unexpected issues with simulated 'rack' outage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The recovery process is only triggered after 20 minutes, this tests are done before that.

I don’t see any traffic or increased load on any of the remaining nodes.

 

Here is my ceph.conf:

 

[global]

fsid = XXXX

mon_initial_members = mon001, mon002, mon003

mon_host = 10.XXX,10.YYYY,10.ZZZZ

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

public_network = 10.XXXX/24

cluster_network = 172.XXX/24

max_open_files = 131072

osd_pool_default_pg_num = 128

osd_pool_default_pgp_num = 128

osd_pool_default_size = 2

osd_pool_default_min_size = 1

osd_pool_default_crush_rule = 0

mon_osd_down_out_interval = 1200

mon_osd_min_down_reporters = 4

debug_lockdep = 0/0

debug_context = 0/0

debug_crush = 0/0

debug_buffer = 0/0

debug_timer = 0/0

debug_filer = 0/0

debug_objecter = 0/0

debug_rados = 0/0

debug_rbd = 0/0

debug_journaler = 0/0

debug_objectcatcher = 0/0

debug_client = 0/0

debug_osd = 0/0

debug_optracker = 0/0

debug_objclass = 0/0

debug_filestore = 0/0

debug_journal = 0/0

debug_ms = 0/0

debug_monc = 0/0

debug_tp = 0/0

debug_auth = 0/0

debug_finisher = 0/0

debug_heartbeatmap = 0/0

debug_perfcounter = 0/0

debug_asok = 0/0

debug_throttle = 0/0

debug_mon = 0/0

debug_paxos = 0/0

debug_rgw = 0/0

 

[osd]

osd_mkfs_type = xfs

osd_mkfs_options_xfs = -f -i size=2048

osd_mount_options_xfs = noatime,largeio,inode64,swalloc

osd_journal_size = 4096

osd_mon_heartbeat_interval = 30

filestore_merge_threshold = 40

filestore_split_multiple = 8

osd_op_threads = 8

filestore_op_threads = 8

filestore_max_sync_interval = 5

osd_max_scrubs = 1

osd_recovery_max_active = 5

osd_max_backfills = 2

osd_recovery_op_priority = 2

osd_recovery_max_chunk = 1048576

osd_recovery_threads = 1

osd_objectstore = filestore

osd_crush_update_on_start = true

 

root@srv003:~# ceph osd pool ls detail

pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags hashpspool stripe_width 0

 

 

 

From: Saverio Proto [mailto:zioproto@xxxxxxxxx]
Sent: woensdag, 24 juni, 2015 15:22
To: Romero Junior
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re: [ceph-users] Unexpected issues with simulated 'rack' outage

 

You dont have to wait, but the recovery process will be very heavy and it will have an impact on performance. The impact could be catastrophic as you are experiencing.

After removing 1 rack, the CRUSH algorithm will run again on the available resources and will map the PGs to the available OSDs. You lost 33% of OSDs so it will be a big change.

This means that you will not only have to create again copies for the OSDs that are out of your cluster, but also you have to move a round a lot of objects that are now misplaced.

It would also be nice to see your crushmap because you are not using the default. A conceptual bug in the crushmap could leave the cluster on a degraded state forever. For example if you did a crushmap to place copies only on different racks, and you want 3 copies with 2 racks available, this is a possible conceptual bug.

Saverio

 



 

2015-06-24 15:11 GMT+02:00 Romero Junior <r.junior@xxxxxxxxxxxxxxxxxxx>:

If I have a replica of each object on the other racks why should I have to wait for any recovery time? The failure should not impact my virtual machines.

 

From: Saverio Proto [mailto:zioproto@xxxxxxxxx]
Sent: woensdag, 24 juni, 2015 14:54
To: Romero Junior
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re: [ceph-users] Unexpected issues with simulated 'rack' outage

 

Hello Romero,

I am still begineer with Ceph, but as far as I understood, ceph is not designed to lose the 33% of the cluster at once and recover rapidly. What I understand is that you are losing 33% of the cluster losing 1 rack out of 3. It will take a very long time to recover, before you have HEALTH_OK status.

can you check with ceph -w how long it takes for ceph to converge to a healthy cluster after you switch off the switch in Rack-A ?

 

Saverio

 

2015-06-24 14:44 GMT+02:00 Romero Junior <r.junior@xxxxxxxxxxxxxxxxxxx>:

Hi,

 

We are setting up a test environment using Ceph as the main storage solution for my QEMU-KVM virtualization platform, and everything works fine except for the following:

 

When I simulate a failure by powering off the switches on one of our three racks my virtual machines get into a weird state, the illustration might help you to fully understand what is going on: http://i.imgur.com/clBApzK.jpg

 

The PGs are distributed based on racks, there are not default crush rules.

 

The number of PGs is the following:

 

root@srv003:~# ceph osd pool ls detail

pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags hashpspool stripe_width 0

 

The qemu talks directly to Ceph through librdb, the disk is configured as the following:

 

    <disk type='network' device='disk'>

      <driver name='qemu' type='raw' cache='writeback'/>

      <auth username='libvirt'>

        <secret type='ceph' uuid='0d32bxxxyyyzzz47073a965'/>

      </auth>

      <source protocol='rbd' name='libvirt-pool/ceph-vm-automated'>

        <host name='10.XX.YY.1' port='6789'/>

        <host name='10.XX.YY.2' port='6789'/>

        <host name='10.XX.YY.2' port='6789'/>

      </source>

      <target dev='vda' bus='virtio'/>

      <alias name='virtio-disk25'/>

      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>

    </disk>

 

 

As mentioned, it's not a real read-only state, I can "touch" files and even login on the affected virtual machines (by the way, all are affected) however, a simple 'dd' (count=10 bs=1MB conv=fdatasync) hangs forever. If a 3 GB file download starts (via wget/curl), it usually crashes after the first few hundred megabytes and it resumes as soon as I power on the “failed” rack. Everything goes back to normal as soon as the rack is powered on again.

 

For reference, each rack contains 33 nodes, each node contain 3 OSDs (1.5 TB each).

 

On the virtual machine, after recovering the rack, I can see the following messages on /var/log/kern.log:

 

[163800.444146] INFO: task jbd2/vda1-8:135 blocked for more than 120 seconds.

[163800.444260]       Not tainted 3.13.0-55-generic #94-Ubuntu

[163800.444295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[163800.444346] jbd2/vda1-8     D ffff88007fd13180     0   135      2 0x00000000

[163800.444354]  ffff880036d3bbd8 0000000000000046 ffff880036a4b000 ffff880036d3bfd8

[163800.444386]  0000000000013180 0000000000013180 ffff880036a4b000 ffff88007fd13a18

[163800.444390]  ffff88007ffc69d0 0000000000000002 ffffffff811efa80 ffff880036d3bc50

[163800.444396] Call Trace:

[163800.444420]  [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50

[163800.444426]  [<ffffffff817279bd>] io_schedule+0x9d/0x140

[163800.444432]  [<ffffffff811efa8e>] sleep_on_buffer+0xe/0x20

[163800.444437]  [<ffffffff81727e42>] __wait_on_bit+0x62/0x90

[163800.444442]  [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50

[163800.444447]  [<ffffffff81727ee7>] out_of_line_wait_on_bit+0x77/0x90

[163800.444455]  [<ffffffff810ab300>] ? autoremove_wake_function+0x40/0x40

[163800.444461]  [<ffffffff811f0dba>] __wait_on_buffer+0x2a/0x30

[163800.444470]  [<ffffffff8128be4d>] jbd2_journal_commit_transaction+0x185d/0x1ab0

[163800.444477]  [<ffffffff8107562f>] ? try_to_del_timer_sync+0x4f/0x70

[163800.444484]  [<ffffffff8129017d>] kjournald2+0xbd/0x250

[163800.444490]  [<ffffffff810ab2c0>] ? prepare_to_wait_event+0x100/0x100

[163800.444496]  [<ffffffff812900c0>] ? commit_timeout+0x10/0x10

[163800.444502]  [<ffffffff8108b702>] kthread+0xd2/0xf0

[163800.444507]  [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0

[163800.444513]  [<ffffffff81733ca8>] ret_from_fork+0x58/0x90

[163800.444517]  [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0

 

A few theories for this behavior were mention on #Ceph (OFTC):

 

[14:09] <Be-El> RomeroJnr: i think the problem is the fact that you write to parts of the rbd that have not been accessed before

[14:09] <Be-El> RomeroJnr: ceph does thin provisioning; each rbd is striped into chunks of 4 mb. each stripe is put into one pgs

[14:10] <Be-El> RomeroJnr: if you access formerly unaccessed parts of the rbd, a new stripe is created. and this probably fails if one of the racks is down

[14:10] <Be-El> RomeroJnr: but that's just a theory...maybe some developer can comment on this later

[14:21] <Be-El> smerz: creating an object in a pg might be different than writing to an object

[14:21] <Be-El> smerz: with one rack down ceph cannot satisfy the pg requirements in RomeroJnr's case

[14:22] <smerz> i can only agree with you. that i would expect other behaviour

 

The question is: is this behavior indeed expected?

Kind regards,

Romero Junior
Hosting Engineer
LeaseWeb Global Services B.V.

T: +31 20 316 0230
M: +31 6 2115 9310
E: r.junior@xxxxxxxxxxxxxxxxxxx
W: www.leaseweb.com

 

Luttenbergweg 8, 

1101 EC Amsterdam, 

Netherlands

 

LeaseWeb is the brand name under which the various independent LeaseWeb companies operate. Each company is a separate and distinct entity that provides services in a particular geographic area. LeaseWeb Global Services B.V. does not provide third-party services. Please see www.leaseweb.com/en/legal for more information.

 

 

Kind regards,

Romero Junior
Hosting Engineer
LeaseWeb Global Services B.V.

T: +31 20 316 0230
M: +31 6 2115 9310
E: r.junior@xxxxxxxxxxxxxxxxxxx
W: www.leaseweb.com

 

Luttenbergweg 8, 

1101 EC Amsterdam, 

Netherlands

 

LeaseWeb is the brand name under which the various independent LeaseWeb companies operate. Each company is a separate and distinct entity that provides services in a particular geographic area. LeaseWeb Global Services B.V. does not provide third-party services. Please see www.leaseweb.com/en/legal for more information.

 

Kind regards,

Romero Junior
Hosting Engineer
LeaseWeb Global Services B.V.

T: +31 20 316 0230
M: +31 6 2115 9310
E: r.junior@xxxxxxxxxxxxxxxxxxx
W: www.leaseweb.com

Luttenbergweg 8 1101 EC Amsterdam Netherlands


LeaseWeb is the brand name under which the various independent LeaseWeb companies operate. Each company is a separate and distinct entity that provides services in a particular geographic area. LeaseWeb Global Services B.V. does not provide third-party services. Please see www.leaseweb.com/en/legal for more information.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

 

 


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux