Re: xfsaild in D state seems to be blocking all other i/o sporadically

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 19 Apr 2017 08:37:40 -0400

On Wed, Apr 19, 2017 at 02:12:02PM +0200, Carlos Maiolino wrote:
> Hi,
> 
> On Wed, Apr 19, 2017 at 12:58:05PM +0200, Michael Weissenbacher wrote:
> > Hi List!
> > I have a storage server which primarily does around 15-20 parallel
> > rsync's, nothing special. Sometimes (3-4 times a day) i notice that all
> > I/O on the file system suddenly comes to a halt and the only process
> > that continues to do any I/O (according to iotop) is the process
> > xfsaild/md127. When this happens, xfsaild only does reads (according to
> > iotop) and consistently in D State (according to top).
> > Unfortunately this can sometimes stay like this for 5-15 minutes. During
> > this time even a simple "ls" our "touch" would block and be stuck in D
> > state. All other running processes accessing the fs are of course also
> > stuck in D state. It is a XFS V5 filesystem.
> > Then again, as sudden as it began, everything goes back to normal and
> > I/O continues. The problem is accompanied with several "process blocked
> > for xxx seconds" in dmesg and also some dropped connections due to
> > network timeouts.
> > 
> > I've tried several things to remedy the problem, including:
> >   - changing I/O schedulers (tried noop, deadline and cfq). Deadline
> > seems to be best (the block goes away in less time compared with the
> > others).
> >   - removing all mount options (defaults + usrquota, grpquota)
> >   - upgrading to the latest 4.11.0-rc kernel (before that i was on 4.9.x)
> > 
> > Nothing of the above seemed to have made a significant change to the
> > problem.
> > 
> > xfs_info output of the fs in question:
> > meta-data=/dev/md127             isize=512    agcount=33,
> > agsize=268435440 blks
> >          =                       sectsz=4096  attr=2, projid32bit=1
> >          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=8789917696, imaxpct=10
> >          =                       sunit=16     swidth=96 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal               bsize=4096   blocks=521728, version=2
> >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >
> 
> This is really not enough to give any idea of what might be happening, although
> this looks more like a slow storage while xfsaild is flushing the log, but we
> really need more information to try to give a better idea of what is going on,
> please look at:
> 
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> 
> Specially for: storage layout (RAID arrays, LVMs, thin provisioning, etc), and
> the dmesg output with the traces from the hang tasks.
> 

Information around memory usage might be particularly interesting here
as well. E.g., /proc/meminfo and /proc/slabinfo..

Brian

> Cheers.
> 
> > Storage Subsystem: Dell Perc H730P Controller 2GB NVCACHE, 12 6TB Disks,
> > RAID-10, latest Firmware Updates
> > 
> > I would be happy to dig out more information if needed. How can i find
> > out if the RAID Controller itself gets stuck? Nothing bad shows up in
> > the hardware and SCSI controller logs.
> > 
> 
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html