Re: Failure with generic/388 test

Jan Kara <jack@xxxxxxx> · Wed, 24 Jan 2018 12:11:32 +0100

On Fri 19-01-18 12:40:41, Theodore Ts'o wrote:
> On Thu, Jan 18, 2018 at 06:39:46PM +0100, Jan Kara wrote:
> > 
> > I've been seeing failures with generic/388 test (with 4.12-based distro
> > kernel). Now after debugging it for some time it seems it is a problem in
> > EXT4_GOING_FLAGS_NOLOGFLUSH implementation. What seems to be happening is
> > that e.g. directory is being created, we set EXT4_FLAGS_SHUTDOWN in the
> > middle and so the creation fails. We do iput() which should delete the
> > unused inode but since the filesystem is marked as EXT4_FLAGS_SHUTDOWN,
> > those changes don't hit the disk. *But* the transaction which has allocated
> > the inode still manages to commit before we abort the journal (as there is
> > a window provided by msleep() in EXT4_GOING_FLAGS_NOLOGFLUSH
> > implementation). So after journal recovery, we have unattached inode and
> > e2fsck complains.
> 
> Thanks for looking at the problem!  It's been on my todo list to try
> to find and fix.
> 
> Right, I see the problem, and issue is that we shouldn't be trying to
> abort the handle after we set EXT4_FLAGS_SHUTDOWN.  That flag should
> prevent new handles from being started, but we should allow already
> running handles to complete, so that we correctly handle the
> EXT4_GOING_FLAGS_LOGFLUSH case.
> 
> > So why is journal abort happening after EXT4_FLAGS_SHUTDOWN being set and
> > why is that window even prolonged by msleep? That would deserve a comment
> > if nothing else BTW...
> 
> I should drop the msleep().  That was there because when we first
> started, there was a bug which has since been fixed by 8e39c6c3d5cc:
> "ext4: fix a race in the ext4 shutdown path" (in the ext4 git tree;
> due for the next merge window).
> 
> The msleep() sleep bug significantly reduced the crash caused by the
> race.  This was non-ideal, but it was better than the alternative,
> which was when the iSCSI server went down, it would hang the system
> badly enough that node's cluster daemon (think Kubernetes daemon)
> would go non-responsive for long enough that a watchdog would hammer
> down the entire system.  We knew we had a race, but the msleep reduced
> the incidence to the point where it rarely happened in production
> workloads, and it was better than the alternative (which was
> guaranteed server death).

Ouch, someone is running EXT4_IOC_SHUTDOWN in production? I always thought
it is just a testing thing...

> Anyway, that bug has since been fixed and with this other problem
> which you've pointed out hopefully we will have fixed all/most of our
> shutdown issues.

Well, just removing msleep() does not completely fix the race, just makes
it unlikely. I believe in EXT4_GOING_FLAGS_NOLOGFLUSH case we should first
do jbd2_journal_abort() and only after that set EXT4_FLAGS_SHUTDOWN. That
will fix the race completely. Are you aware of anything that depends on the
"flag first, journal later" ordering in the shutdown path?

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR