Re: [2.6.36-rc1] unmount livelock due to racing with bdi-flusher threads

Jan Kara <jack@xxxxxxx> · Thu, 30 Sep 2010 23:02:51 +0200

On Mon 13-09-10 12:41:28, Dave Chinner wrote:
> ping?
  Pong ;) I finally had a look at this. Thanks for reporting this.

> > I just had an umount take a very long time burning a CPU the entire
> > time. It wasn't the unmount thread, either, it was the the bdi
> > flusher thread for the the filesystem being unmounted. It was
> > spinning with this perf top trace:
> > 
> >            553144.00 76.9% writeback_inodes_wb  [kernel.kallsyms]
> >            106434.00 14.8% __ticket_spin_lock   [kernel.kallsyms]
> >             25646.00  3.6% __ticket_spin_unlock [kernel.kallsyms]
> >             10512.00  1.5% _raw_spin_lock       [kernel.kallsyms]
> >              9606.00  1.3% put_super            [kernel.kallsyms]
> >              7920.00  1.1% __put_super          [kernel.kallsyms]
> >              5592.00  0.8% down_read_trylock    [kernel.kallsyms]
> >                46.00  0.0% kfree                [kernel.kallsyms]
> >                22.00  0.0% __do_softirq         [kernel.kallsyms]
> >                19.00  0.0% wb_writeback         [kernel.kallsyms]
> >                16.00  0.0% wb_do_writeback      [kernel.kallsyms]
> >                 8.00  0.0% queue_io             [kernel.kallsyms]
> >                 6.00  0.0% run_timer_softirq    [kernel.kallsyms]
> >                 6.00  0.0% local_bh_enable_ip   [kernel.kallsyms]
> > 
> > This went on for ~7m25s (according to the pmchart trace I had on
> > screen) before something broke the livelock by writing the inodes to
> > disk (maybe the xfssyncd) and the unmount then completed a couple
> > of seconds later.
> > 
> > From the above profile, I'm assuming that writeback_inodes_wb() was
> > seeing pin_sb_for_writeback(sb) failing and moving dirty inodes from
> > the the b_io to the b_more_io list, then being called again,
> > splicing the inodes on b_more_io back to b_io, and then failed again
> > to pin_sb_for_writeback() for each inode, moving them back to the
> > b_more_io list....
> > 
> > This is on 2.6.36-rc1 + the radix tree fixes for writeback.
  Indeed, your analysis looks correct. The trouble is following:

  Flusher thread                           Umount
- start processing background writeback
					   - get s_mount for writing
					   - queue syncing work for flusher
					   - waits until flusher thread
					     gets to it
- loops infinitely, trying to get s_umount
  for reading

In principle a classical ABBA deadlock. Actually, there are more
complicated (and harder to hit) cases like:

  Flusher thread	  Sync				Remount
- processes background
  writeback
			  - gets s_umount for reading
			  - queues syncing work
			  - waits for syncing work
							- tries to get
							  s_umount for writing
							  and blocks
- now loops infinitely
  since it cannot get
  s_umount for reading anymore

The question is how to properly resolve it. The cases like the second one
above show that it's not enough to just somehow work-around writeback
during umount. Also it's not only background writeback that can get
deadlocked like this but generally anything submitted via
__bdi_start_writeback (as these kinds of writeback do not have superblock
specified).

I think the best resolution of this problem would be to change the work
that is submitted via bdi_start_writeback() (i.e., the work without
superblock = work which needs to do locking) to "target based scheme" like
Christoph wanted already some time ago. I actually have a patch to do this
for background writeback so I will just modify it to apply to a wider range
of writeback as well. Or Christoph, do you already have some patches in
this direction?

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html