Re: Tasks blocking forever with XFS stack traces

Carlos Maiolino <cmaiolino@xxxxxxxxxx> · Tue, 5 Nov 2019 17:09:24 +0100

Hello.

> > > <snip>
> > > > >
> > > > > Other directories on the same filesystem seem fine as do other XFS
> > > > > filesystems on the same system.
> > > >
> > > > The fact you mention other directories seems to work, and the first stack trace
> > > > you posted, it sounds like you've been keeping a singe AG too busy to almost
> > > > make it unusable. But, you didn't provide enough information we can really make
> > > > any progress here, and to be honest I'm more inclined to point the finger to
> > > > your MD device.
> > >
> > > Let's see if we can pinpoint something :-)
> > >
> > > > Can you describe your MD device? RAID array? What kind? How many disks?
> > >
> > > RAID6 8 disks.
> >
> > >
> > > > What's your filesystem configuration? (xfs_info <mount point>)
> > >
> > > meta-data=/dev/md126             isize=512    agcount=32, agsize=43954432 blks
> > >          =                       sectsz=4096  attr=2, projid32bit=1
> > >          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
> > >          =                       reflink=0
> > > data     =                       bsize=4096   blocks=1406538240, imaxpct=5
> > >          =                       sunit=128    swidth=768 blks
> >
> > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > log      =internal               bsize=4096   blocks=521728, version=2
> > >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> >                                                 ^^^^^^  This should have been
> >                                                         configured to 8 blocks, not 1
> >
> > > Yes there's more. See a slightly elided dmesg from a longer run on
> > > https://sucs.org/~sits/test/kern-20191024.log.gz .
> >
> > At a first quick look, it looks like you are having lots of IO contention in the
> > log, and this is slowing down the rest of the filesystem. What caught my
> 
> Should it become so slow that a task freezes entirely and never
> finishes? Once the problem hits it's not like anything makes any more
> progress on those directories nor was there very much generating dirty
> data.
> 

I am not sure how long you waited until you assumed it 'never finishes' :),
but...
I've seen systems starve to death due large amounts of RMW being generated,
mainly when disks are slow or problematic. Since you are using SSDs and not
spindles, I wonder if the SSDs write cycle might play a role here, but it's just
a question I'm asking myself, I am not sure if it can play a role in making
things even worse.

In the file you sent above, you basically have:

xfs-sync worker thread, which forced a log flush, and now is sleeping waiting
for the flush to complete:

Oct 24 16:27:11 <host> kernel: [115151.674164] Call Trace:
Oct 24 16:27:11 <host> kernel: [115151.674170]  __schedule+0x24e/0x880
Oct 24 16:27:11 <host> kernel: [115151.674175]  schedule+0x2c/0x80
Oct 24 16:27:11 <host> kernel: [115151.674178]  schedule_timeout+0x1cf/0x350
Oct 24 16:27:11 <host> kernel: [115151.674184]  ? sched_clock+0x9/0x10
Oct 24 16:27:11 <host> kernel: [115151.674187]  ? sched_clock+0x9/0x10
Oct 24 16:27:11 <host> kernel: [115151.674191]  ? sched_clock_cpu+0x11/0xb0
Oct 24 16:27:11 <host> kernel: [115151.674196]  wait_for_completion+0xba/0x140
Oct 24 16:27:11 <host> kernel: [115151.674199]  ? wake_up_q+0x80/0x80
Oct 24 16:27:11 <host> kernel: [115151.674203]  __flush_work+0x15b/0x210
Oct 24 16:27:11 <host> kernel: [115151.674206]  ? worker_detach_from_pool+0xa0/0xa0
Oct 24 16:27:11 <host> kernel: [115151.674210]  flush_work+0x10/0x20
Oct 24 16:27:11 <host> kernel: [115151.674250]  xlog_cil_force_lsn+0x7b/0x210 [xfs]
Oct 24 16:27:11 <host> kernel: [115151.674253]  ? __switch_to_asm+0x41/0x70
Oct 24 16:27:11 <host> kernel: [115151.674256]  ? __switch_to_asm+0x35/0x70
Oct 24 16:27:11 <host> kernel: [115151.674259]  ? __switch_to_asm+0x41/0x70
Oct 24 16:27:11 <host> kernel: [115151.674262]  ? __switch_to_asm+0x35/0x70
Oct 24 16:27:11 <host> kernel: [115151.674264]  ? __switch_to_asm+0x41/0x70
Oct 24 16:27:11 <host> kernel: [115151.674267]  ? __switch_to_asm+0x35/0x70
Oct 24 16:27:11 <host> kernel: [115151.674306]  _xfs_log_force+0x8f/0x2a0 [xfs]

Which wake up and wait for xlog_cil_push on the following thread:

Oct 24 16:29:12 <host> kernel: [115272.503324] kworker/52:2    D    0 56479      2 0x80000080
Oct 24 16:29:12 <host> kernel: [115272.503381] Workqueue: xfs-cil/md126 xlog_cil_push_work [xfs]
Oct 24 16:29:12 <host> kernel: [115272.503383] Call Trace:
Oct 24 16:29:12 <host> kernel: [115272.503389]  __schedule+0x24e/0x880
Oct 24 16:29:12 <host> kernel: [115272.503394]  schedule+0x2c/0x80
Oct 24 16:29:12 <host> kernel: [115272.503444]  xlog_state_get_iclog_space+0x105/0x2d0 [xfs]
Oct 24 16:29:12 <host> kernel: [115272.503449]  ? wake_up_q+0x80/0x80
Oct 24 16:29:12 <host> kernel: [115272.503492]  xlog_write+0x163/0x6e0 [xfs]
Oct 24 16:29:12 <host> kernel: [115272.503536]  xlog_cil_push+0x2a7/0x410 [xfs]
Oct 24 16:29:12 <host> kernel: [115272.503577]  xlog_cil_push_work+0x15/0x20 [xfs]
Oct 24 16:29:12 <host> kernel: [115272.503581]  process_one_work+0x1de/0x420
Oct 24 16:29:12 <host> kernel: [115272.503584]  worker_thread+0x32/0x410
Oct 24 16:29:12 <host> kernel: [115272.503587]  kthread+0x121/0x140
Oct 24 16:29:12 <host> kernel: [115272.503590]  ? process_one_work+0x420/0x420
Oct 24 16:29:12 <host> kernel: [115272.503593]  ? kthread_create_worker_on_cpu+0x70/0x70
Oct 24 16:29:12 <host> kernel: [115272.503596]  ret_from_fork+0x35/0x40

xlog_state_get_iclog_space() is waiting at:

xlog_wait(&log->l_flush_wait, &log->l_icloglock);

Which should be awaken by xlog_state_do_callback() which is called by
xlog_state_done_syncing(), which in short, IIRC, comes from bio end io
callbacks.

I'm saying this by a quick look at the current code though. So, I think it ends
up by your system waiting for journal IO completion to make progress (or maybe
your storage stack missed an IO and XFS is stuck waiting for the completion
which will never come, maybe it's deadlocked somewhere I am not seeing).

I may very well be wrong though, it's been a while since I don't work on this
part of the code, so, I'll just wait somebody else to give a different POV,
other than risking pointing you in the wrong direction.

I'll try to keep looking at it when I get some extra time, but well, at the end
you will probably need to update your kernel to a recent version and try to
reproduce the problem again, chasing bugs on old kernels is not really the scope
of this list, but I'll give it some more thought when I get some extra time (if
nobody sees what's wrong before :)

> If this were to happen again though what extra information would be
> helpful (I'm guessing things like /proc/meminfo output)?
>

> > attention at first was the wrong configured log striping for the filesystem and
> > I wonder if this isn't the responsible for the amount of IO contention you are
> > having in the log. This might well be generating lots of RMW cycles while
> > writing to the log generating the IO contention and slowing down the rest of the
> > filesystem, I'll try to take a more careful look later on.
> 
> My understanding is that the md "chunk size" is 64k so basically
> you're saying the sectsz should have been manually set to be as big as
> possible at mkfs time? I never realised this never happened by default
> (I see the sunit seems to be correct given the block size of 4096 but
> I'm unsure about swidth)...

Your RAID chunk is 512k according to the information you provided previously:

md126 : active raid6 sdj[6] sdg[3] sdi[2] sdh[7] sdc[4] sdd[0] sda[5] sdb[1]
      5626152960 blocks level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]
      bitmap: 0/7 pages [0KB], 65536KB chunk

And I believe you used default mkfs options when you created the filesystem, so
xfs properly identified and automatically configured it for you:

sunit=128    swidth=768 blks

(128*4096) = 512Kib

And the swidth matches exactly the amount of data disks I'd expect from your
array, i.e. 6 Data Disks (+ 2 parity):
(768/128) = 6

What I meant is the Log sunit is weird in your configuration, see the 'sunit' in
the Log section. The XFS log stripe unit can range from 32Kib to 256Kib. And
mkfs configures it to match the data sunit, UNLESS the data sunit is bigger tha
256KiB, which then mkfs will set it to 32KiB by default, which, in your case,
using default mkfs configuration, I was expecting to see a log sunit=8
(8*4096) = 32KiB

Maybe your xfsprogs version had any bug, or maybe somebody set it manually.
I can't really say.

-- 
Carlos