On Thu 19-01-17 09:39:56, Michal Hocko wrote: > On Tue 17-01-17 18:29:25, Jan Kara wrote: > > On Tue 17-01-17 17:16:19, Michal Hocko wrote: > > > > > But before going to play with that I am really wondering whether we need > > > > > all this with no journal at all. AFAIU what Jack told me it is the > > > > > journal lock(s) which is the biggest problem from the reclaim recursion > > > > > point of view. What would cause a deadlock in no journal mode? > > > > > > > > We still have the original problem for why we need GFP_NOFS even in > > > > ext2. If we are in a writeback path, and we need to allocate memory, > > > > we don't want to recurse back into the file system's writeback path. > > > > > > But we do not enter the writeback path from the direct reclaim. Or do > > > you mean something other than pageout()'s mapping->a_ops->writepage? > > > There is only try_to_release_page where we get back to the filesystems > > > but I do not see any NOFS protection in ext4_releasepage. > > > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage() > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0), > > shrinkers. That's it. So the recursion possibilities are rather more limited > > than they used to be several years ago and we likely do not need as much > > GFP_NOFS protection as we used to. > > Thanks for making my remark more clear Jack! I would just want to add > that I was playing with the patch below (it is basically > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the > debugging patch which means they are called from within transaction) and > it didn't hit the lockdep when running xfstests both with or without the > enabled journal. > > So am I still missing something or the nojournal mode is safe and the > current series is OK wrt. ext*? I'm convinced the current series is OK, only real life will tell us whether we missed something or not ;) > The following patch in its current form is WIP and needs a proper review > before I post it. So jbd2 changes look confusing (although technically correct) to me - we *always* should run in NOFS context in those place so having GFP_KERNEL there looks like it is unnecessarily hiding what is going on. So in those places I'd prefer to keep GFP_NOFS or somehow else make it very clear these allocations are expected to be GFP_NOFS (and assert that). Otherwise the changes look good to me. Honza > --- > fs/ext4/inode.c | 4 ++-- > fs/ext4/mballoc.c | 14 +++++++------- > fs/ext4/xattr.c | 2 +- > fs/jbd2/journal.c | 4 ++-- > fs/jbd2/revoke.c | 2 +- > fs/jbd2/transaction.c | 2 +- > 6 files changed, 14 insertions(+), 14 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index b7d141c3b810..841cb8c4cb5e 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page, > return __ext4_journalled_writepage(page, len); > > ext4_io_submit_init(&io_submit, wbc); > - io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS); > + io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL); > if (!io_submit.io_end) { > redirty_page_for_writepage(wbc, page); > unlock_page(page); > @@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle, > int err = 0; > > page = find_or_create_page(mapping, from >> PAGE_SHIFT, > - mapping_gfp_constraint(mapping, ~__GFP_FS)); > + mapping_gfp_mask(mapping)); > if (!page) > return -ENOMEM; > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index d9fd184b049e..67b97cd6e3d6 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group, > static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group, > struct ext4_buddy *e4b) > { > - return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS); > + return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL); > } > > static void ext4_mb_unload_buddy(struct ext4_buddy *e4b) > @@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, > > /* We only do this if the grp has never been initialized */ > if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { > - int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS); > + int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL); > if (ret) > return ret; > } > @@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) > BUG_ON(ac->ac_status != AC_STATUS_FOUND); > BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); > > - pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); > + pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL); > if (pa == NULL) > return -ENOMEM; > > @@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac) > BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); > > BUG_ON(ext4_pspace_cachep == NULL); > - pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); > + pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL); > if (pa == NULL) > return -ENOMEM; > > @@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, > } > } > > - ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS); > + ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL); > if (!ac) { > ar->len = 0; > *errp = -ENOMEM; > @@ -4813,7 +4813,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode, > > /* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */ > err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b, > - GFP_NOFS|__GFP_NOFAIL); > + GFP_KERNEL|__GFP_NOFAIL); > if (err) > goto error_return; > > @@ -4832,7 +4832,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode, > * to fail. > */ > new_entry = kmem_cache_alloc(ext4_free_data_cachep, > - GFP_NOFS|__GFP_NOFAIL); > + GFP_KERNEL|__GFP_NOFAIL); > new_entry->efd_start_cluster = bit; > new_entry->efd_group = block_group; > new_entry->efd_count = count_clusters; > diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c > index 172317462238..f68e8c87f9f2 100644 > --- a/fs/ext4/xattr.c > +++ b/fs/ext4/xattr.c > @@ -1650,7 +1650,7 @@ ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh) > EXT4_XATTR_REFCOUNT_MAX; > int error; > > - error = mb_cache_entry_create(ext4_mb_cache, GFP_NOFS, hash, > + error = mb_cache_entry_create(ext4_mb_cache, GFP_KERNEL, hash, > bh->b_blocknr, reusable); > if (error) { > if (error == -EBUSY) > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c > index 3a449150f834..bd29daa975a5 100644 > --- a/fs/jbd2/journal.c > +++ b/fs/jbd2/journal.c > @@ -379,7 +379,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction, > */ > J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in)); > > - new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL); > + new_bh = alloc_buffer_head(GFP_KERNEL|__GFP_NOFAIL); > > /* keep subsequent assertions sane */ > atomic_set(&new_bh->b_count, 1); > @@ -2375,7 +2375,7 @@ static struct journal_head *journal_alloc_journal_head(void) > #ifdef CONFIG_JBD2_DEBUG > atomic_inc(&nr_journal_heads); > #endif > - ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS); > + ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_KERNEL); > if (!ret) { > jbd_debug(1, "out of memory for journal_head\n"); > pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__); > diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c > index cfc38b552118..c9c347468c5b 100644 > --- a/fs/jbd2/revoke.c > +++ b/fs/jbd2/revoke.c > @@ -141,7 +141,7 @@ static int insert_revoke_hash(journal_t *journal, unsigned long long blocknr, > { > struct list_head *hash_list; > struct jbd2_revoke_record_s *record; > - gfp_t gfp_mask = GFP_NOFS; > + gfp_t gfp_mask = GFP_KERNEL; > > if (journal_oom_retry) > gfp_mask |= __GFP_NOFAIL; > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c > index 35a5d3d76182..a7e50eb330a8 100644 > --- a/fs/jbd2/transaction.c > +++ b/fs/jbd2/transaction.c > @@ -974,7 +974,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, > JBUFFER_TRACE(jh, "allocate memory for buffer"); > jbd_unlock_bh_state(bh); > frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size, > - GFP_NOFS | __GFP_NOFAIL); > + GFP_KERNEL | __GFP_NOFAIL); > goto repeat; > } > jh->b_frozen_data = frozen_buffer; > -- > 2.11.0 > > -- > Michal Hocko > SUSE Labs -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html