Re: [PATCH] iomap: Address soft lockup in iomap_finish_ioend()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2022-01-06 at 09:48 +1100, Dave Chinner wrote:
> On Wed, Jan 05, 2022 at 08:45:05PM +0000, Trond Myklebust wrote:
> > On Tue, 2022-01-04 at 21:09 -0500, Trond Myklebust wrote:
> > > On Tue, 2022-01-04 at 12:22 +1100, Dave Chinner wrote:
> > > > On Tue, Jan 04, 2022 at 12:04:23AM +0000, Trond Myklebust
> > > > wrote:
> > > > > We have different reproducers. The common feature appears to
> > > > > be
> > > > > the
> > > > > need for a decently fast box with fairly large memory (128GB
> > > > > in
> > > > > one
> > > > > case, 400GB in the other). It has been reproduced with HDs,
> > > > > SSDs
> > > > > and
> > > > > NVME systems.
> > > > > 
> > > > > On the 128GB box, we had it set up with 10+ disks in a JBOD
> > > > > configuration and were running the AJA system tests.
> > > > > 
> > > > > On the 400GB box, we were just serially creating large (>
> > > > > 6GB)
> > > > > files
> > > > > using fio and that was occasionally triggering the issue.
> > > > > However
> > > > > doing
> > > > > an strace of that workload to disk reproduced the problem
> > > > > faster
> > > > > :-
> > > > > ).
> > > > 
> > > > Ok, that matches up with the "lots of logically sequential
> > > > dirty
> > > > data on a single inode in cache" vector that is required to
> > > > create
> > > > really long bio chains on individual ioends.
> > > > 
> > > > Can you try the patch below and see if addresses the issue?
> > > > 
> > > 
> > > That patch does seem to fix the soft lockups.
> > > 
> > 
> > Oops... Strike that, apparently our tests just hit the following
> > when
> > running on AWS with that patch.
> 
> OK, so there are also large contiguous physical extents being
> allocated in some cases here.
> 
> > So it was harder to hit, but we still did eventually.
> 
> Yup, that's what I wanted to know - it indicates that both the
> filesystem completion processing and the iomap page processing play
> a role in the CPU usage. More complex patch for you to try below...
> 
> Cheers,
> 
> Dave.

Hi Dave,

This patch got further than the previous one. However it too failed on
the same AWS setup after we started creating larger (in this case 52GB)
files. The previous patch failed at 15GB.

NR_06-18:00:17 pm-46088DSX1 /mnt/data-portal/data $ ls -lh
total 59G
-rw-r----- 1 root root  52G Jan  6 18:20 100g
-rw-r----- 1 root root 9.8G Jan  6 17:38 10g
-rw-r----- 1 root root   29 Jan  6 17:36 file
NR_06-18:20:10 pm-46088DSX1 /mnt/data-portal/data $
Message from syslogd@pm-46088DSX1 at Jan  6 18:22:44 ...
 kernel:[ 5548.082987] watchdog: BUG: soft lockup - CPU#10 stuck for
24s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:23:44 ...
 kernel:[ 5608.082895] watchdog: BUG: soft lockup - CPU#10 stuck for
23s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:27:08 ...
 kernel:[ 5812.082587] watchdog: BUG: soft lockup - CPU#10 stuck for
22s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:27:36 ...
 kernel:[ 5840.082533] watchdog: BUG: soft lockup - CPU#10 stuck for
21s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:28:08 ...
 kernel:[ 5872.082455] watchdog: BUG: soft lockup - CPU#10 stuck for
21s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:28:40 ...
 kernel:[ 5904.082400] watchdog: BUG: soft lockup - CPU#10 stuck for
21s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:29:16 ...
 kernel:[ 5940.082243] watchdog: BUG: soft lockup - CPU#10 stuck for
21s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:29:44 ...
 kernel:[ 5968.082249] watchdog: BUG: soft lockup - CPU#10 stuck for
22s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:30:24 ...
 kernel:[ 6008.082204] watchdog: BUG: soft lockup - CPU#10 stuck for
21s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:31:08 ...
 kernel:[ 6052.082194] watchdog: BUG: soft lockup - CPU#10 stuck for
24s! [kworker/10:0:18995]
Message from syslogd@pm-46088DSX1 at Jan  6 18:31:48 ...
 kernel:[ 6092.082010] watchdog: BUG: soft lockup - CPU#10 stuck for
21s! [kworker/10:0:18995]

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx






[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux