Re: [syzbot] [ext4?] BUG: sleeping function called from invalid context in ext4_update_super

"Fabio M. De Francesco" <fmdefrancesco@xxxxxxxxx> · Sun, 11 Jun 2023 11:38:07 +0200

On domenica 11 giugno 2023 05:20:32 CEST Theodore Ts'o wrote:
> (Dropping linux-fsdevel and linux-kernel from the cc list.)
> 
> On Sat, Jun 10, 2023 at 10:41:18PM +0200, Fabio M. De Francesco wrote:
> > Well, I'm a new to filesystems. However, I'd like to test a change in
> > ext4_handle_error().
> > 
> > Currently I see that errors are handled according to the next snippet of
> > code
> > 
> > from the above-mentioned function (please note that we are in atomic 
context):
> > 	if (continue_fs && journal)
> > 	
> > 		schedule_work(&EXT4_SB(sb)->s_error_work);
> > 	
> > 	else
> > 	
> > 		ext4_commit_super(sb);
> > 
> > If evaluates false, we directly call ext4_commit_super(), forgetting that,
> > AFAICS we are in atomic context.
> > 
> > As I said I have only little experience with filesystems, so my question 
is:
> > despite the overhead, can we delete the check and do the following?
> > 
> > [ Unconditionally call schedule_work(&EXT4_SB(sb)->s_error_work) ]
> 
> That doesn't work, for the simple reason that it's possible that file
> system might be configured to immediately panic on an error.  (See
> later in the ext4_handle_error() function after the check for
> test_opt(sb, ERRORS_PANIC).  If that happens, the workqueue will never
> have a chance to run.  In that case, we have to call
> ext4_commit_super().
> 
> The real answer here is that ext4_error() must never be called from an
> atomic context, and a recent commit 5354b2af3406 ("ext4: allow
> ext4_get_group_info() to fail") added a call to ext4_error() which is
> problematic since some callers of the ext4_get_group_info() function
> may be holding a spinlock.  And so the best solution is to just simply
> to drop the call to ext4_error(), since it's not strictly necessary.

Ted,

I just responded with another message, so I'm waiting to know from you whether 
or not you are okay with me submitting a patch with your "Suggested by:" tag.

In the meantime, I have had time to think of a different solution that allows 
the workqueue the chance to run even if the file system is configured to 
immediately panic on error (sorry, no code - I'm in a hurry)...

This can let you leave that call to ext4_error() that commit 5354b2af3406 
("ext4: allow ext4_get_group_info()") had introduced (if it works - please 
keep on reading).

1) Init a global variable ("glob") and set it to 0.
2) Modify the code of the error handling workqueue to set "glob" to 1, soon 
before the task is done.
3) Change the fragment that panics the system to call mdelay() in a loop  (it 
doesn't sleep - correct?) for an adequate amount of time and periodically 
check READ_ONCE(global) == 1. If true, break and then panic, otherwise 
reiterate the loop.

Could it be useful?

> If there is an antagonist process which is actively corrupting the
> superblock, some other code path will report the fact that the file
> system is corrupted soon enough.
> 
> 						- Ted
> 
> P.S.  There is an exception to what I've described above, and that's
> special ext4_grp_locked_error() which is used in fs/ext4/mballoc.c.
> But that's a special case which requires very careful handling, In
> general, you simply must not be in atomic context when you want to
> report a problem.

Haven't yet had time to look at this. So last answer is the only one I have at 
the moment.

Thanks for your time,

Fabio