Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations

Ilya Dryomov <idryomov@xxxxxxxxx> · Wed, 29 Mar 2017 11:21:44 +0200

On Tue, Mar 28, 2017 at 3:30 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> On Tue 28-03-17 15:23:58, Ilya Dryomov wrote:
>> On Tue, Mar 28, 2017 at 2:43 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>> > On Tue 28-03-17 14:30:45, Greg KH wrote:
>> >> 4.4-stable review patch.  If anyone has any objections, please let me know.
>> >
>> > I haven't seen the original patch but the changelog makes me worried.
>> > How exactly this is a problem? Where do we lockup? Does rbd/libceph take
>> > any xfs locks?
>>
>> No, it doesn't.  This is just another instance of "using GFP_KERNEL on
>> the writeback path may lead to a deadlock" with nothing extra to it.
>>
>> XFS is writing out data, libceph messenger worker tries to open
>> a socket and recurses back into XFS because the sockfs inode is
>> allocated with GFP_KERNEL.  The message with some of the data never
>> goes out and eventually we get a deadlock.
>>
>> I've only included the offending stack trace.  I guess I should have
>> stressed that ceph-msgr workqueue is used for reclaim.
>
> Could you be more specific about the lockup scenario. I still do not get
> how this would lead to a deadlock.

This is a set of stack traces from http://tracker.ceph.com/issues/19309
(linked in the changelog):

Workqueue: ceph-msgr con_work [libceph]
ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
[<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
[<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
[<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
[<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
[<ffffffff81086335>] flush_work+0x165/0x250
[<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
[<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
[<ffffffff816d6b42>] ? __slab_free+0xee/0x234
[<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
[<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
[<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
[<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
[<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
[<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
[<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[<ffffffff811c0c18>] super_cache_scan+0x178/0x180
[<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
[<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
[<ffffffff8115af70>] shrink_slab+0x100/0x140
[<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
[<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
[<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
[<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
[<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
[<ffffffff811a0ac5>] new_slab+0x2c5/0x390
[<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
[<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
[<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
[<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
[<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
[<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
[<ffffffff811d8566>] alloc_inode+0x26/0xa0
[<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
[<ffffffff815b933e>] sock_alloc+0x1e/0x80
[<ffffffff815ba855>] __sock_create+0x95/0x220
[<ffffffff815baa04>] sock_create_kern+0x24/0x30
[<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
[<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90

We are writing out data on ceph_connection X:

ceph_con_workfn
  mutex_lock(&con->mutex)  # ceph_connection::mutex
  try_write
    ceph_tcp_connect
      sock_create_kern
        GFP_KERNEL allocation
          allocator recurses into XFS, more I/O is issued

Workqueue: rbd rbd_request_workfn [rbd]
ffff880047a83b38 0000000000000046 ffff881025350c00 ffff8800383fa9e0
0000000000012b00 0000000000000000 ffff880047a83fd8 0000000000012b00
ffff88014b638860 ffff8800383fa9e0 ffff880047a83b38 ffff8810878dc1b8
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
[<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
[<ffffffffa048ad66>] ? ceph_str_hash+0x26/0x80 [libceph]
[<ffffffff816df7f6>] mutex_lock+0x36/0x4a
[<ffffffffa04784fd>] ceph_con_send+0x4d/0x130 [libceph]
[<ffffffffa047d3f0>] __send_queued+0x120/0x150 [libceph]
[<ffffffffa047fe7b>] __ceph_osdc_start_request+0x5b/0xd0 [libceph]
[<ffffffffa047ff41>] ceph_osdc_start_request+0x51/0x80 [libceph]
[<ffffffffa04a8050>] rbd_obj_request_submit.isra.27+0x10/0x20 [rbd]
[<ffffffffa04aa6de>] rbd_img_obj_request_submit+0x23e/0x500 [rbd]
[<ffffffffa04aa9ec>] rbd_img_request_submit+0x4c/0x60 [rbd]
[<ffffffffa04ab3d5>] rbd_request_workfn+0x305/0x410 [rbd]
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90

Here is that I/O.  We grab ceph_osd_client::request_mutex, but
ceph_connection::mutex is being held by the worker that recursed into
XFS:

rbd_queue_workfn
  ceph_osdc_start_request
    mutex_lock(&osdc->request_mutex);
    ceph_con_send
      mutex_lock(&con->mutex)  # deadlock

Workqueue: ceph-msgr con_work [libceph]
ffff88014a89fc08 0000000000000046 ffff88014a89fc18 ffff88013a2d90c0
0000000000012b00 0000000000000000 ffff88014a89ffd8 0000000000012b00
ffff880015a210c0 ffff88013a2d90c0 0000000000000000 ffff882028a84798
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
[<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
[<ffffffff816df7f6>] mutex_lock+0x36/0x4a
[<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
[<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
[<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
[<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90

Workqueue: ceph-msgr con_work [libceph]
ffff88014c10fc08 0000000000000046 ffff88013a2d9988 ffff88013a2d9920
0000000000012b00 0000000000000000 ffff88014c10ffd8 0000000000012b00
ffffffff81c1b4a0 ffff88013a2d9920 0000000000000000 ffff882028a84798
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
[<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
[<ffffffff816df7f6>] mutex_lock+0x36/0x4a
[<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
[<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
[<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
[<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90

These two are replies on ceph_connections Y and Z, which need
ceph_osd_client::request_mutex to figure out which requests can be
completed:

alloc_msg
  get_reply
    mutex_lock(&osdc->request_mutex);

Eventually everything else blocks on ceph_osd_client::request_mutex,
since it's used for both submitting requests and handling replies.

This really is a straightforward "using GFP_KERNEL on the writeback
path isn't allowed" case.  I'm not sure what made you worried here.

Thanks,

                Ilya