Hello, We have recently triggered a dead-lock situation with the Kernel versions used in RHEL 7.x distros, when using Lustre over MD-Raid devices. The whole story can be found at https://jira.hpdd.intel.com/browse/LU-10709. As per my live and forced crash-dumps analysis, the scenario of this dead-lock can be described as following. A user-land thread wants to access some tunable in sysfs, and doing so, with sysfs_mutex locked, it triggers memory allocation when allocating a new inode via alloc_inode(). Since the inode allocation is done under GFP_KERNEL, the registered memory shrinkers will be allowed to run and thus to start new filesystem operations, or eventually block when doing so because a concurrent thread is already doing it and thus owns some protection lock. But anyway, whoever starts some filesystem operation will finally end up with MD-Raid layer and its associated device-specific service threads being involved which can then block, due to sysfs_mutex already being locked, if an automatic recovery or a manual check process has already been started for the concerned MD device, and there is the need to report in-progress/completion status thru sysfs_notify(). Hence the dead-lock and no further operation to be started anymore to the concerned device. I have been able to get rid of this problem by adding the following patch to the 3.10.x Kernel series shipped with RHEL 7.x distros : =================================================================================================== bfaccini-mac02:crash-master bfaccini$ cat ~/Documents/JIRAs/LU-10709/sysfs_alloc_inode_GFP_NOFS.patch As part of LU-10709 problem/deadlock analysis, it has been found that user-land processes intensivelly using sysfs can cause a dead-lock if doing so memory reclaim is being triggered and as part of it FS-specific shrinkers are run and directly/indirectly involving layers (like MD/Raid) also relying on sysfs. To fix this, sysfs inode allocation must no longer use the generic/GFP_KERNEL way but to be done as GFP_NOFS to prevent any FS operations to interfer during possible reclaim. Signed-off-by: Bruno Faccini <bruno.faccini@xxxxxxxxx> --- orig/fs/inode.c 2017-09-09 07:06:42.000000000 +0000 +++ bfi/fs/inode.c 2018-03-14 09:24:48.533380200 +0000 @@ -73,7 +73,7 @@ struct inodes_stat_t inodes_stat; static DEFINE_PER_CPU(unsigned int, nr_inodes); static DEFINE_PER_CPU(unsigned int, nr_unused); -static struct kmem_cache *inode_cachep __read_mostly; +struct kmem_cache *inode_cachep __read_mostly; static int get_nr_inodes(void) { --- orig/fs/sysfs/sysfs.h 2017-09-09 07:06:42.000000000 +0000 +++ bfi/fs/sysfs/sysfs.h 2018-03-14 09:24:48.534380233 +0000 @@ -211,6 +211,8 @@ static inline void __sysfs_put(struct sy */ struct inode *sysfs_get_inode(struct super_block *sb, struct sysfs_dirent *sd); void sysfs_evict_inode(struct inode *inode); +extern struct kmem_cache *inode_cachep; +struct inode *sysfs_alloc_inode(struct super_block *sb); int sysfs_sd_setattr(struct sysfs_dirent *sd, struct iattr *iattr); int sysfs_permission(struct inode *inode, int mask); int sysfs_setattr(struct dentry *dentry, struct iattr *iattr); --- orig/fs/sysfs/mount.c 2017-09-09 07:06:42.000000000 +0000 +++ bfi/fs/sysfs/mount.c 2018-03-14 09:24:48.534380233 +0000 @@ -31,6 +31,7 @@ static const struct super_operations sys .statfs = simple_statfs, .drop_inode = generic_delete_inode, .evict_inode = sysfs_evict_inode, + .alloc_inode = sysfs_alloc_inode, }; struct sysfs_dirent sysfs_root = { --- orig/fs/sysfs/inode.c 2017-09-09 07:06:42.000000000 +0000 +++ bfi/fs/sysfs/inode.c 2018-03-14 09:24:48.534380233 +0000 @@ -314,6 +314,17 @@ void sysfs_evict_inode(struct inode *ino sysfs_put(sd); } +/* + * As a new inode allocation occurs with sysfs_mutex held and memory reclaim + * can be triggered doing so, this needs to happen with FS operations disabled + * to avoid any deadlock between shrinkers and FS/device layers doing + * extensive use of sysfs (like MD/Raid) as part of their operations. + */ +struct inode *sysfs_alloc_inode(struct super_block *sb) +{ + return kmem_cache_alloc(inode_cachep, GFP_NOFS); +} + int sysfs_hash_and_remove(struct sysfs_dirent *dir_sd, const void *ns, const char *name) { struct sysfs_addrm_cxt acxt; =================================================================================================== which forces new sysfs inode allocation to be done under GFP_NOFS instead of only with GFP_KERNEL previously. After browsing recent 3.x/4.x Kernels source code, I believe problem is still there but now in kernfs instead of sysfs, as the latter uses the former's methods internally but where the same potential dead-lock seems to exist around kernfs_mutex . Does this problem look as already known to MD-Raid maintainers and frequent users, and also does my analysis+patch look good ?? I have already submitted BZ #199589 to bugzilla.kernel.org for this issue (see https://bugzilla.kernel.org/show_bug.cgi?id=199589). Thanks in advance for any answer/help on this, Best regards, Bruno. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html