Hi Ilya, The RBD crash on OSD nodes going away is routinely hit in our setups. We have not been able to get a good stack trace for this one due to our console capture issues and these don't end up in the syslogs either after the crash. Will get you the traces soon. Most of the times this happens when all the OSD nodes go away at once. This could have probably been fixed by one of the following commits? Ilya Dryomov libceph: change from BUG to WARN for __remove_osd() asserts … idryomov authored on Nov 5 cc9f1f5 Ilya Dryomov libceph: clear r_req_lru_item in __unregister_linger_request() … idryomov authored on Nov 5 ba9d114 Ilya Dryomov libceph: unlink from o_linger_requests when clearing r_osd … idryomov authored on Nov 4 a390de0 Also, We have encountered a few other issues listed below (1) Soft Lockup issue Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft lockup - CPU#2 stuck for 22s! [java:29169] --- (vdbench process) . . Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace: Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph] Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.514121] Code: ff ff 48 89 df e8 e3 f1 ff ff 48 8b 7d a8 e8 7a 8c 0e e1 48 8b 7d b0 e8 41 d8 a7 e0 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49 8b 0e 4c 89 f2 48 c7 c6 d0 76 64 a0 48 c7 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.663443] RIP [<ffffffffa063340e>] osd_reset+0x22e/0x2c0 [libceph] Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.712105] RSP <ffff880a22b8bd80> (2) Soft lockup when OSDs are flapping Dec 18 18:25:10 rack3-client-2 kernel: [157126.089489] BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:0:45012] . . Dec 18 18:25:10 rack3-client-2 kernel: [157126.098648] Call Trace: Dec 18 18:25:10 rack3-client-2 kernel: [157126.098653] [<ffffffffa030d963>] kick_requests+0x1e3/0x440 [libceph] Dec 18 18:25:10 rack3-client-2 kernel: [157126.098657] [<ffffffffa030df98>] ceph_osdc_handle_map+0x2a8/0x620 [libceph] Dec 18 18:25:10 rack3-client-2 kernel: [157126.098662] [<ffffffffa030e55b>] dispatch+0x24b/0xb20 [libceph] Dec 18 18:25:10 rack3-client-2 kernel: [157126.098665] [<ffffffffa0301c08>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] Dec 18 18:25:10 rack3-client-2 kernel: [157126.098669] [<ffffffffa030552f>] con_work+0x164f/0x2b60 [libceph] Dec 18 18:25:10 rack3-client-2 kernel: [157126.098672] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098674] [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098676] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098679] [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098681] [<ffffffff8109df6d>] ? vtime_common_task_switch+0x3d/0x40 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098684] [<ffffffff810838a2>] process_one_work+0x182/0x450 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098686] [<ffffffff81084641>] worker_thread+0x121/0x410 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098688] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098690] [<ffffffff8108b312>] kthread+0xd2/0xf0 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098692] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098695] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0 Dec 18 18:25:10 rack3-client-2 kernel: [157126.098697] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0 (3) BUG_ON(!list_empty(&req->r_req_lru_item)); Dec 4 17:14:33 rack6-ramp-4 kernel: [320359.828209] kernel BUG at /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892! Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace: Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph] Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0 Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0 (4) img_request null Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Assertion failure in rbd_img_obj_callback() at line 2127: Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] rbd_assert(img_request != NULL); Dec 12 08:07:50 rack1-ram-6 kernel: [251597.257322] [<ffffffffa01a5897>] rbd_obj_request_complete+0x27/0x70 [rbd] Dec 12 08:07:50 rack1-ram-6 kernel: [251597.268450] [<ffffffffa01a8d4f>] rbd_osd_req_callback+0xdf/0x4e0 [rbd] Dec 12 08:07:50 rack1-ram-6 kernel: [251597.279182] [<ffffffffa039e262>] dispatch+0x4a2/0x900 [libceph] Dec 12 08:07:50 rack1-ram-6 kernel: [251597.289159] [<ffffffffa039494b>] try_read+0x4ab/0x10d0 [libceph] Dec 12 08:07:50 rack1-ram-6 kernel: [251597.299236] [<ffffffffa0396362>] ? try_write+0xa42/0xe30 [libceph] Dec 12 08:07:50 rack1-ram-6 kernel: [251597.309777] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.318627] [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.332347] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.341095] [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.351061] [<ffffffffa0396809>] con_work+0xb9/0x640 [libceph] Dec 12 08:07:50 rack1-ram-6 kernel: [251597.361003] [<ffffffff810838a2>] process_one_work+0x182/0x450 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.370752] [<ffffffff81084641>] worker_thread+0x121/0x410 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.379816] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.389173] [<ffffffff8108b312>] kthread+0xd2/0xf0 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.396898] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.407506] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.416181] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0 This is similar to: http://tracker.ceph.com/issues/8378 Saw that the rhel7a branch has many of the latest fixes and is somewhat compatible with 3.13 kernels, For validation, we have taken the rhel7a ceph-client branch and with minor modification gotten it to compile with 3.13.0 headers. With this we did not hit any issues (expect issue-2). We understand that is not the right approach for Ubuntu, It would be great if we could get the fixes into Ubuntu 14.04 kernels as well. Regards, Chaitanya -----Original Message----- From: Somnath Roy Sent: Tuesday, January 06, 2015 2:38 AM To: Ilya Dryomov Cc: Chaitanya Huilgol; ceph-devel@xxxxxxxxxxxxxxx Subject: RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels) It's happening both in idle and under load. I don't have the trace right now but will get you one soon. Thanks & Regards Somnath -----Original Message----- From: Ilya Dryomov [mailto:ilya.dryomov@xxxxxxxxxxx] Sent: Monday, January 05, 2015 12:34 PM To: Somnath Roy Cc: Chaitanya Huilgol; ceph-devel@xxxxxxxxxxxxxxx Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels) On Mon, Jan 5, 2015 at 11:01 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote: > Ilya, > Here is the steps.. > > 1. You have a cluster (3 nodes) and replication is 3 > > 2. map krbd image to a client. > > 3. Reboot or stop ceph services on one or more nodes > > 4. The client with krbd mapped module crashes Is it idle or under load? Do you have a trace of the crash? Thanks, Ilya ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f