Deadlock during memory reclaim path involving sysfs and MD-Raid layers

Bruno Faccini <bfaccini62@xxxxxxxxx> · Thu, 10 May 2018 00:33:55 +0200

Hello,
We have recently triggered a dead-lock situation with the Kernel
versions used in RHEL 7.x distros, when using Lustre over MD-Raid
devices.
The whole story can be found at https://jira.hpdd.intel.com/browse/LU-10709.
As per my live and forced crash-dumps analysis, the scenario of this
dead-lock can be described as following.

A user-land thread wants to access some tunable in sysfs, and doing
so, with sysfs_mutex locked, it triggers memory allocation when
allocating a new inode via alloc_inode().
Since  the inode allocation is done under GFP_KERNEL, the registered
memory shrinkers will be allowed to run and thus to start new
filesystem operations, or eventually block when doing so because a
concurrent thread is already doing it and thus owns some protection
lock.
But anyway, whoever starts some filesystem operation will finally end
up with MD-Raid layer and its associated device-specific service
threads being involved which can then block, due to sysfs_mutex
already being locked, if an automatic recovery or a manual check
process has already been started for the concerned MD device, and
there is the need to report in-progress/completion status thru
sysfs_notify().
Hence the dead-lock and no further operation to be started anymore to
the concerned device.

I have been able to get rid of this problem by adding the following
patch to the 3.10.x Kernel series shipped with RHEL 7.x distros :
===================================================================================================

bfaccini-mac02:crash-master bfaccini$ cat
~/Documents/JIRAs/LU-10709/sysfs_alloc_inode_GFP_NOFS.patch

As part of LU-10709 problem/deadlock analysis, it has been

found that user-land processes intensivelly using sysfs

can cause a dead-lock if doing so memory reclaim is being

triggered and as part of it FS-specific shrinkers are run

and directly/indirectly involving layers (like MD/Raid)

also relying on sysfs.

To fix this, sysfs inode allocation must no longer use

the generic/GFP_KERNEL way but to be done as GFP_NOFS

to prevent any FS operations to interfer during possible

reclaim.


Signed-off-by: Bruno Faccini <bruno.faccini@xxxxxxxxx>

--- orig/fs/inode.c 2017-09-09 07:06:42.000000000 +0000

+++ bfi/fs/inode.c 2018-03-14 09:24:48.533380200 +0000

@@ -73,7 +73,7 @@ struct inodes_stat_t inodes_stat;

 static DEFINE_PER_CPU(unsigned int, nr_inodes);

 static DEFINE_PER_CPU(unsigned int, nr_unused);



-static struct kmem_cache *inode_cachep __read_mostly;

+struct kmem_cache *inode_cachep __read_mostly;



 static int get_nr_inodes(void)

 {

--- orig/fs/sysfs/sysfs.h 2017-09-09 07:06:42.000000000 +0000

+++ bfi/fs/sysfs/sysfs.h 2018-03-14 09:24:48.534380233 +0000

@@ -211,6 +211,8 @@ static inline void __sysfs_put(struct sy

  */

 struct inode *sysfs_get_inode(struct super_block *sb, struct sysfs_dirent *sd);

 void sysfs_evict_inode(struct inode *inode);

+extern struct kmem_cache *inode_cachep;

+struct inode *sysfs_alloc_inode(struct super_block *sb);

 int sysfs_sd_setattr(struct sysfs_dirent *sd, struct iattr *iattr);

 int sysfs_permission(struct inode *inode, int mask);

 int sysfs_setattr(struct dentry *dentry, struct iattr *iattr);

--- orig/fs/sysfs/mount.c 2017-09-09 07:06:42.000000000 +0000

+++ bfi/fs/sysfs/mount.c 2018-03-14 09:24:48.534380233 +0000

@@ -31,6 +31,7 @@ static const struct super_operations sys

  .statfs = simple_statfs,

  .drop_inode = generic_delete_inode,

  .evict_inode = sysfs_evict_inode,

+ .alloc_inode = sysfs_alloc_inode,

 };



 struct sysfs_dirent sysfs_root = {

--- orig/fs/sysfs/inode.c 2017-09-09 07:06:42.000000000 +0000

+++ bfi/fs/sysfs/inode.c 2018-03-14 09:24:48.534380233 +0000

@@ -314,6 +314,17 @@ void sysfs_evict_inode(struct inode *ino

  sysfs_put(sd);

 }



+/*

+ * As a new inode allocation occurs with sysfs_mutex held and memory reclaim

+ * can be triggered doing so, this needs to happen with FS operations disabled

+ * to avoid any deadlock between shrinkers and FS/device layers doing

+ * extensive use of sysfs (like MD/Raid) as part of their operations.

+ */

+struct inode *sysfs_alloc_inode(struct super_block *sb)

+{

+ return kmem_cache_alloc(inode_cachep, GFP_NOFS);

+}

+

 int sysfs_hash_and_remove(struct sysfs_dirent *dir_sd, const void
*ns, const char *name)

 {

  struct sysfs_addrm_cxt acxt;

===================================================================================================
which forces new sysfs inode allocation to be done under GFP_NOFS
instead of only with GFP_KERNEL previously.

After browsing recent 3.x/4.x Kernels source code, I believe problem
is still there but now in kernfs instead of sysfs, as the latter uses
the former's methods internally but where the same potential dead-lock
seems to exist around kernfs_mutex .

Does this problem look as already known to MD-Raid maintainers and
frequent users, and also does my analysis+patch look good ??

I have already submitted BZ #199589 to bugzilla.kernel.org for this
issue (see https://bugzilla.kernel.org/show_bug.cgi?id=199589).

Thanks in advance for any answer/help on this, Best regards,
Bruno.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html