Re: [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 24 Apr 2020 07:14:37 +1000

On Thu, Apr 23, 2020 at 10:29:58AM -0400, Brian Foster wrote:
> On Thu, Apr 23, 2020 at 02:46:04PM +1000, Dave Chinner wrote:
> > On Wed, Apr 22, 2020 at 01:54:21PM -0400, Brian Foster wrote:
> > > At unmount time, XFS emits a warning for every in-core buffer that
> > > might have undergone a write error. In practice this behavior is
> > > probably reasonable given that the filesystem is likely short lived
> > > once I/O errors begin to occur consistently. Under certain test or
> > > otherwise expected error conditions, this can spam the logs and slow
> > > down the unmount.
> > > 
> > > We already have a ratelimit state defined for buffers failing
> > > writeback. Fold this state into the buftarg and reuse it for the
> > > unmount time errors.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > 
> > Looks fine, but I suspect we both missed something here:
> > xfs_buf_ioerror_alert() was made a ratelimited printk in the last
> > cycle:
> > 
> > void
> > xfs_buf_ioerror_alert(
> >         struct xfs_buf          *bp,
> >         xfs_failaddr_t          func)
> > {
> >         xfs_alert_ratelimited(bp->b_mount,
> > "metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d",
> >                         func, (uint64_t)XFS_BUF_ADDR(bp), bp->b_length,
> >                         -bp->b_error);
> > }
> > 
> 
> Yeah, I hadn't noticed that.
> 
> > Hence I think all these buffer error alerts can be brought under the
> > same rate limiting variable. Something like this in xfs_message.c:
> > 
> 
> One thing to note is that xfs_alert_ratelimited() ultimately uses
> the DEFAULT_RATELIMIT_INTERVAL of 5s. The ratelimit we're generalizing
> here uses 30s (both use a burst of 10). That seems reasonable enough to
> me for I/O errors so I'm good with the changes below.
> 
> FWIW, that also means we could just call xfs_buf_alert_ratelimited()
> from xfs_buf_item_push() if we're also Ok with using an "alert" instead
> of a "warn." I'm not immediately aware of a reason to use one over the
> other (xfs_wait_buftarg() already uses alert) so I'll try that unless I
> hear an objection.

SOunds fine to me.

> The xfs_wait_buftarg() ratelimit presumably remains
> open coded because it's two separate calls and we probably don't want
> them to individually count against the limit.

That's why I suggested dropping the second "run xfs_repair" message
and triggering a shutdown after the wait loop. That way we don't
issue "run xfs_repair" for every single failed buffer (largely
noise!), and we get a non-rate-limited common "run xfs-repair"
message once we processed all the failed writes.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx