RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ilya,

The RBD crash on OSD nodes going away is routinely hit in our setups.
We have not been able to get a good stack trace for this one due to our console capture issues and these don't end up in the syslogs either after the crash. Will get you the traces soon.
Most of the times this happens when all the OSD nodes go away at once.  This could have probably been fixed by one of the following commits?

Ilya Dryomov
libceph: change from BUG to WARN for __remove_osd() asserts …
idryomov authored on Nov 5
cc9f1f5
Ilya Dryomov
libceph: clear r_req_lru_item in __unregister_linger_request() …
idryomov authored on Nov 5
ba9d114
Ilya Dryomov
libceph: unlink from o_linger_requests when clearing r_osd …
idryomov authored on Nov 4
a390de0

Also, We have encountered a few other issues listed below

(1) Soft Lockup issue
Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft lockup - CPU#2 stuck for 22s! [java:29169] --- (vdbench process)
.
.
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph]
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.514121] Code: ff ff 48 89 df e8 e3 f1 ff ff 48 8b 7d a8 e8 7a 8c 0e e1 48 8b 7d b0 e8 41 d8 a7 e0 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49 8b 0e 4c 89 f2 48 c7 c6 d0 76 64 a0 48 c7
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.663443] RIP [<ffffffffa063340e>] osd_reset+0x22e/0x2c0 [libceph]
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.712105] RSP <ffff880a22b8bd80>

(2) Soft lockup when OSDs are flapping

Dec 18 18:25:10 rack3-client-2 kernel: [157126.089489] BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:0:45012]
.
.
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098648] Call Trace:
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098653] [<ffffffffa030d963>] kick_requests+0x1e3/0x440 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098657] [<ffffffffa030df98>] ceph_osdc_handle_map+0x2a8/0x620 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098662] [<ffffffffa030e55b>] dispatch+0x24b/0xb20 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098665] [<ffffffffa0301c08>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098669] [<ffffffffa030552f>] con_work+0x164f/0x2b60 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098672] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098674] [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098676] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098679] [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098681] [<ffffffff8109df6d>] ? vtime_common_task_switch+0x3d/0x40
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098684] [<ffffffff810838a2>] process_one_work+0x182/0x450
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098686] [<ffffffff81084641>] worker_thread+0x121/0x410
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098688] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098690] [<ffffffff8108b312>] kthread+0xd2/0xf0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098692] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098695] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098697] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0

(3)  BUG_ON(!list_empty(&req->r_req_lru_item));

Dec 4 17:14:33 rack6-ramp-4 kernel: [320359.828209] kernel BUG at /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892!
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph]
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0

(4) img_request null
Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Assertion failure in rbd_img_obj_callback() at line 2127:
Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]
Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]     rbd_assert(img_request != NULL);

Dec 12 08:07:50 rack1-ram-6 kernel: [251597.257322]  [<ffffffffa01a5897>] rbd_obj_request_complete+0x27/0x70 [rbd]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.268450]  [<ffffffffa01a8d4f>] rbd_osd_req_callback+0xdf/0x4e0 [rbd]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.279182]  [<ffffffffa039e262>] dispatch+0x4a2/0x900 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.289159]  [<ffffffffa039494b>] try_read+0x4ab/0x10d0 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.299236]  [<ffffffffa0396362>] ? try_write+0xa42/0xe30 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.309777]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.318627]  [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.332347]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.341095]  [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.351061]  [<ffffffffa0396809>] con_work+0xb9/0x640 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.361003]  [<ffffffff810838a2>] process_one_work+0x182/0x450
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.370752]  [<ffffffff81084641>] worker_thread+0x121/0x410
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.379816]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.389173]  [<ffffffff8108b312>] kthread+0xd2/0xf0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.396898]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.407506]  [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.416181]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
This is similar to: http://tracker.ceph.com/issues/8378

Saw that the rhel7a branch has many of the latest fixes and is somewhat compatible with 3.13 kernels,
For validation, we have taken the rhel7a ceph-client branch and with minor modification gotten it to compile with 3.13.0 headers. With this we did not hit any issues (expect issue-2).
We understand that is not the right approach for Ubuntu, It would be great if we could get the fixes into Ubuntu 14.04 kernels as well.

Regards,
Chaitanya

-----Original Message-----
From: Somnath Roy
Sent: Tuesday, January 06, 2015 2:38 AM
To: Ilya Dryomov
Cc: Chaitanya Huilgol; ceph-devel@xxxxxxxxxxxxxxx
Subject: RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

It's happening both in idle and under load.
I don't have the trace right now but will get you one soon.

Thanks & Regards
Somnath

-----Original Message-----
From: Ilya Dryomov [mailto:ilya.dryomov@xxxxxxxxxxx]
Sent: Monday, January 05, 2015 12:34 PM
To: Somnath Roy
Cc: Chaitanya Huilgol; ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

On Mon, Jan 5, 2015 at 11:01 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
> Ilya,
> Here is the steps..
>
> 1. You have a cluster (3 nodes) and replication is 3
>
> 2. map krbd image to a client.
>
> 3. Reboot or stop ceph services on one or more nodes
>
> 4. The client with krbd mapped module crashes

Is it idle or under load?

Do you have a trace of the crash?

Thanks,

                Ilya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux