XFS on RBD deadlock

Lucas Stach <l.stach@xxxxxxxxxxxxxx> · Tue, 29 Nov 2016 12:17:08 +0100

Hi all,

we have encountered a deadlock with XFS on RBD on one of our development
compile servers.
The system has multiple XFS file system instances mounted on both local
storage and RBD. When under load the system encounters sustained high
memory pressure.
The deadlock was only seen with the XFS on RBD instance, other FS
instances on local storage remained unaffected. 

We were able to collect some backtraces, a trimmed version with the (in
our opinion) interesting bits below, that seem to illustrate the
deadlock pretty well. Backtraces from a 4.8 kernel, but current code
looks mostly unchanged.

1. One thread causes memory pressure, VM is trying to free memory
through the XFS superblock shrinker, which needs to force out the log to
be able to get rid of some of the cached inodes.

cc1             D ffff92243fad8180     0  6772   6770 0x00000080
ffff9224d107b200 ffff922438de2f40 ffff922e8304fed8 ffff9224d107b200
ffff922ea7554000 ffff923034fb0618 0000000000000000 ffff9224d107b200
ffff9230368e5400 ffff92303788b000 ffffffff951eb4e1 0000003e00095bc0
Nov 28 18:21:23 dude kernel: Call Trace:
[<ffffffff951eb4e1>] ? schedule+0x31/0x80
[<ffffffffc0ab0570>] ? _xfs_log_force_lsn+0x1b0/0x340 [xfs]
[<ffffffff94ca5790>] ? wake_up_q+0x60/0x60
[<ffffffffc0a9f7ff>] ? __xfs_iunpin_wait+0x9f/0x160 [xfs]
[<ffffffffc0ab0730>] ? xfs_log_force_lsn+0x30/0xb0 [xfs]
[<ffffffffc0a97041>] ? xfs_reclaim_inode+0x131/0x370 [xfs]
[<ffffffffc0a9f7ff>] ? __xfs_iunpin_wait+0x9f/0x160 [xfs]
[<ffffffff94cbcf80>] ? autoremove_wake_function+0x40/0x40
[<ffffffffc0a97041>] ? xfs_reclaim_inode+0x131/0x370 [xfs]
[<ffffffffc0a97442>] ? xfs_reclaim_inodes_ag+0x1c2/0x2d0 [xfs]
[<ffffffff94cb197c>] ? enqueue_task_fair+0x5c/0x920
[<ffffffff94c35895>] ? sched_clock+0x5/0x10
[<ffffffff94ca47e0>] ? check_preempt_curr+0x50/0x90
[<ffffffff94ca4834>] ? ttwu_do_wakeup+0x14/0xe0
[<ffffffff94ca53c3>] ? try_to_wake_up+0x53/0x3a0
[<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
[<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
[<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
[<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
[<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
[<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
[<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
[<ffffffff94dd918a>] ? alloc_pages_vma+0xba/0x280
[<ffffffff94db0f8b>] ? wp_page_copy+0x45b/0x6c0
[<ffffffff94db3e12>] ? alloc_set_pte+0x2e2/0x5f0
[<ffffffff94db2169>] ? do_wp_page+0x4a9/0x7e0
[<ffffffff94db4bd2>] ? handle_mm_fault+0x872/0x1250
[<ffffffff94c65a53>] ? __do_page_fault+0x1e3/0x500
[<ffffffff951f0cd8>] ? page_fault+0x28/0x30

2. RBD worker is trying to handle the XFS block request, but the
involved crypto code recurses into the XFS superblock shrinker, as it's
trying to allocate memory with GFP_KERNEL in __crypto_alloc_tfm.

kworker/9:3     D ffff92303f318180     0 20732      2 0x00000080
Workqueue: ceph-msgr ceph_con_workfn [libceph]
 ffff923035dd4480 ffff923038f8a0c0 0000000000000001 000000009eb27318
 ffff92269eb28000 ffff92269eb27338 ffff923036b145ac ffff923035dd4480
 00000000ffffffff ffff923036b145b0 ffffffff951eb4e1 ffff923036b145a8
Call Trace:
 [<ffffffff951eb4e1>] ? schedule+0x31/0x80
 [<ffffffff951eb77a>] ? schedule_preempt_disabled+0xa/0x10
 [<ffffffff951ed1f4>] ? __mutex_lock_slowpath+0xb4/0x130
 [<ffffffff951ed28b>] ? mutex_lock+0x1b/0x30
 [<ffffffffc0a974b3>] ? xfs_reclaim_inodes_ag+0x233/0x2d0 [xfs]
 [<ffffffff94d92ba5>] ? move_active_pages_to_lru+0x125/0x270
 [<ffffffff94f2b985>] ? radix_tree_gang_lookup_tag+0xc5/0x1c0
 [<ffffffff94dad0f3>] ? __list_lru_walk_one.isra.3+0x33/0x120
 [<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
 [<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
 [<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
 [<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
 [<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
 [<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
 [<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
 [<ffffffff94ddf42d>] ? cache_grow_begin+0x9d/0x560
 [<ffffffff94ddfb88>] ? fallback_alloc+0x148/0x1c0
 [<ffffffff94ed84e7>] ? __crypto_alloc_tfm+0x37/0x130
 [<ffffffff94de09db>] ? __kmalloc+0x1eb/0x580
 [<ffffffffc09fe2db>] ? crush_choose_firstn+0x3eb/0x470 [libceph]
 [<ffffffff94ed84e7>] ? __crypto_alloc_tfm+0x37/0x130
 [<ffffffff94ed9c19>] ? crypto_spawn_tfm+0x39/0x60
 [<ffffffffc08b30a3>] ? crypto_cbc_init_tfm+0x23/0x40 [cbc]
 [<ffffffff94ed857c>] ? __crypto_alloc_tfm+0xcc/0x130
 [<ffffffff94edcc23>] ? crypto_skcipher_init_tfm+0x113/0x180
 [<ffffffff94ed7cc3>] ? crypto_create_tfm+0x43/0xb0
 [<ffffffff94ed83b0>] ? crypto_larval_lookup+0x150/0x150
 [<ffffffff94ed7da2>] ? crypto_alloc_tfm+0x72/0x120
 [<ffffffffc0a01dd7>] ? ceph_aes_encrypt2+0x67/0x400 [libceph]
 [<ffffffffc09fd264>] ? ceph_pg_to_up_acting_osds+0x84/0x5b0 [libceph]
 [<ffffffff950d40a0>] ? release_sock+0x40/0x90
 [<ffffffff95139f94>] ? tcp_recvmsg+0x4b4/0xae0
 [<ffffffffc0a02714>] ? ceph_encrypt2+0x54/0xc0 [libceph]
 [<ffffffffc0a02b4d>] ? ceph_x_encrypt+0x5d/0x90 [libceph]
 [<ffffffffc0a02bdf>] ? calcu_signature+0x5f/0x90 [libceph]
 [<ffffffffc0a02ef5>] ? ceph_x_sign_message+0x35/0x50 [libceph]
 [<ffffffffc09e948c>] ? prepare_write_message_footer+0x5c/0xa0 [libceph]
 [<ffffffffc09ecd18>] ? ceph_con_workfn+0x2258/0x2dd0 [libceph]
 [<ffffffffc09e9903>] ? queue_con_delay+0x33/0xd0 [libceph]
 [<ffffffffc09f68ed>] ? __submit_request+0x20d/0x2f0 [libceph]
 [<ffffffffc09f6ef8>] ? ceph_osdc_start_request+0x28/0x30 [libceph]
 [<ffffffffc0b52603>] ? rbd_queue_workfn+0x2f3/0x350 [rbd]
 [<ffffffff94c94ec0>] ? process_one_work+0x160/0x410
 [<ffffffff94c951bd>] ? worker_thread+0x4d/0x480
 [<ffffffff94c95170>] ? process_one_work+0x410/0x410
 [<ffffffff94c9af8d>] ? kthread+0xcd/0xf0
 [<ffffffff951efb2f>] ? ret_from_fork+0x1f/0x40
 [<ffffffff94c9aec0>] ? kthread_create_on_node+0x190/0x190

3. This causes the XFS shrinker to not make any progress, also blocking
other tasks that are trying to free memory this way.

xz              D ffff92303f358180     0  5932   5928 0x00000084
 ffff921a56201180 ffff923038f8ae00 ffff92303788b2c8 0000000000000001
 ffff921e90234000 ffff921e90233820 ffff923036b14eac ffff921a56201180
 00000000ffffffff ffff923036b14eb0 ffffffff951eb4e1 ffff923036b14ea8
Call Trace:
 [<ffffffff951eb4e1>] ? schedule+0x31/0x80
 [<ffffffff951eb77a>] ? schedule_preempt_disabled+0xa/0x10
 [<ffffffff951ed1f4>] ? __mutex_lock_slowpath+0xb4/0x130
 [<ffffffff951ed28b>] ? mutex_lock+0x1b/0x30
 [<ffffffffc0a974b3>] ? xfs_reclaim_inodes_ag+0x233/0x2d0 [xfs]
 [<ffffffff94f2b985>] ? radix_tree_gang_lookup_tag+0xc5/0x1c0
 [<ffffffff94dad0f3>] ? __list_lru_walk_one.isra.3+0x33/0x120
 [<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
 [<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
 [<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
 [<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
 [<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
 [<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
 [<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
 [<ffffffff94dd73b1>] ? alloc_pages_current+0x91/0x140
 [<ffffffff94e0ab98>] ? pipe_write+0x208/0x3f0
 [<ffffffff94e01b08>] ? new_sync_write+0xd8/0x130
 [<ffffffff94e02293>] ? vfs_write+0xb3/0x1a0
 [<ffffffff94e03672>] ? SyS_write+0x52/0xc0
 [<ffffffff94c03b8a>] ? do_syscall_64+0x7a/0xd0
 [<ffffffff951ef9a5>] ? entry_SYSCALL64_slow_path+0x25/0x25

This might be the same issue as in
http://tracker.ceph.com/issues/15891 ,but as the backtraces there are
from an older kernel version it's not really clear.

Regards,
Lucas

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html