Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 6, 2015 at 3:31 PM, Chaitanya Huilgol
<Chaitanya.Huilgol@xxxxxxxxxxx> wrote:
> Hi Ilya,
>
> The RBD crash on OSD nodes going away is routinely hit in our setups.
> We have not been able to get a good stack trace for this one due to our console capture issues and these don't end up in the syslogs either after the crash. Will get you the traces soon.
> Most of the times this happens when all the OSD nodes go away at once.  This could have probably been fixed by one of the following commits?
>
> Ilya Dryomov
> libceph: change from BUG to WARN for __remove_osd() asserts …
> idryomov authored on Nov 5
> cc9f1f5
> Ilya Dryomov
> libceph: clear r_req_lru_item in __unregister_linger_request() …
> idryomov authored on Nov 5
> ba9d114
> Ilya Dryomov
> libceph: unlink from o_linger_requests when clearing r_osd …
> idryomov authored on Nov 4
> a390de0

Yes, but probably others as well.

>
> Also, We have encountered a few other issues listed below
>
> (1) Soft Lockup issue
> Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft lockup - CPU#2 stuck for 22s! [java:29169] --- (vdbench process)
> .
> .
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph]
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.514121] Code: ff ff 48 89 df e8 e3 f1 ff ff 48 8b 7d a8 e8 7a 8c 0e e1 48 8b 7d b0 e8 41 d8 a7 e0 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49 8b 0e 4c 89 f2 48 c7 c6 d0 76 64 a0 48 c7
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.663443] RIP [<ffffffffa063340e>] osd_reset+0x22e/0x2c0 [libceph]
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.712105] RSP <ffff880a22b8bd80>
>
> (2) Soft lockup when OSDs are flapping
>
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.089489] BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:0:45012]
> .
> .
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098648] Call Trace:
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098653] [<ffffffffa030d963>] kick_requests+0x1e3/0x440 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098657] [<ffffffffa030df98>] ceph_osdc_handle_map+0x2a8/0x620 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098662] [<ffffffffa030e55b>] dispatch+0x24b/0xb20 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098665] [<ffffffffa0301c08>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098669] [<ffffffffa030552f>] con_work+0x164f/0x2b60 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098672] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098674] [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098676] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098679] [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098681] [<ffffffff8109df6d>] ? vtime_common_task_switch+0x3d/0x40
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098684] [<ffffffff810838a2>] process_one_work+0x182/0x450
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098686] [<ffffffff81084641>] worker_thread+0x121/0x410
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098688] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098690] [<ffffffff8108b312>] kthread+0xd2/0xf0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098692] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098695] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098697] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
>
> (3)  BUG_ON(!list_empty(&req->r_req_lru_item));
>
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320359.828209] kernel BUG at /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892!
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph]
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
>
> (4) img_request null
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Assertion failure in rbd_img_obj_callback() at line 2127:
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]     rbd_assert(img_request != NULL);
>
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.257322]  [<ffffffffa01a5897>] rbd_obj_request_complete+0x27/0x70 [rbd]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.268450]  [<ffffffffa01a8d4f>] rbd_osd_req_callback+0xdf/0x4e0 [rbd]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.279182]  [<ffffffffa039e262>] dispatch+0x4a2/0x900 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.289159]  [<ffffffffa039494b>] try_read+0x4ab/0x10d0 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.299236]  [<ffffffffa0396362>] ? try_write+0xa42/0xe30 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.309777]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.318627]  [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.332347]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.341095]  [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.351061]  [<ffffffffa0396809>] con_work+0xb9/0x640 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.361003]  [<ffffffff810838a2>] process_one_work+0x182/0x450
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.370752]  [<ffffffff81084641>] worker_thread+0x121/0x410
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.379816]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.389173]  [<ffffffff8108b312>] kthread+0xd2/0xf0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.396898]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.407506]  [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.416181]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> This is similar to: http://tracker.ceph.com/issues/8378
>
> Saw that the rhel7a branch has many of the latest fixes and is somewhat compatible with 3.13 kernels,
> For validation, we have taken the rhel7a ceph-client branch and with minor modification gotten it to compile with 3.13.0 headers. With this we did not hit any issues (expect issue-2).

What do you mean by "expect issue-2"?

(3) and (4) should be fixed in rhel7-a.  Can't say anything about (1)
and (2) - please report back if you see any soft lockup splats on
rhel7-a.

> We understand that is not the right approach for Ubuntu, It would be great if we could get the fixes into Ubuntu 14.04 kernels as well.

It may not be the right approach, but in many ways it's better than
a set of selected backports.  While working on another report I found
a couple easy-to-backport patches that are missing from Ubuntu 3.13
series and will forward them to stable guys, but, for those who can
build their own kernels at least, branches like rhel7-a are best.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux