Re: gfs2 iomap dealock, IOMAP_F_UNBALANCED

Andreas Gruenbacher <agruenba@xxxxxxxxxx> · Fri, 29 Mar 2019 23:13:00 +0100

On Thu, 28 Mar 2019 at 17:51, Christoph Hellwig <hch@xxxxxx> wrote:
> On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> > Hi Christoph,
> >
> > we need your help fixing a gfs2 deadlock involving iomap.  What's going
> > on is the following:
> >
> > * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
> >   lock and keeps it until gfs2_iomap_end.  It currently always does that
> >   even though there is no point other than for journaled data writes.
> >
> > * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
> >   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
> >   log flush lock again and deadlock.
>
> What is the exact call chain?

It's laid out here:
https://www.redhat.com/archives/cluster-devel/2019-March/msg00000.html

> balance_dirty_pages_ratelimited these days doesn't start I/O, but just
> wakes up the flusher threads.  Or do we have a issue where it is blocking
> on those threads?

Yes, the writer is holding sd_log_flush_lock at the point where it
ends up kicking the flusher thread and waiting for writeback to
happen. The flusher thread calls gfs2_write_inode, and that tries to
grab sd_log_flush_lock again.

> Also why do you need to flush the log for background writeback in
> ->write_inode?

If we stop doing that in the (wbc->sync_mode == WB_SYNC_NONE) case,
then inodes will remain dirty until the journal is flushed for some
other reason (or a write_inode with WB_SYNC_ALL). That doesn't seem
right. We could perhaps trigger a background journal flush in the
WB_SYNC_NONE case, but that would remove the back pressure on
balance_dirty_pages. Not sure this is a good idea, either.

> balance_dirty_pages_ratelimited is per definition not a data integrity
> writeback, so there shouldn't be a good reason to flush the log
> (which I assume the log flush log is for).
>
> If we look gfs2_write_inode, this seems to be the code:
>
>         bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));
>
>         if (flush_all)
>                 gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
>                                GFS2_LOG_HEAD_FLUSH_NORMAL |
>                                GFS2_LFC_WRITE_INODE);
>
> But what is the requirement to do this in writeback context?  Can't
> we move it out into another context instead?

Indeed, this isn't for data integrity in this case but because the
dirty limit is exceeded. What other context would you suggest to move
this to?

(The iomap flag I've proposed would save us from getting into this
situation in the first place.)

Thanks,
Andreas