Re: [2.6.36-rc3] Workqueues, XFS, dependencies and deadlocks

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 8 Sep 2010 20:05:03 +1000

On Wed, Sep 08, 2010 at 10:51:28AM +0200, Tejun Heo wrote:
> Hello,
> 
> On 09/08/2010 10:22 AM, Dave Chinner wrote:
> > Ok, it looks as if the WQ_HIGHPRI is all that was required to avoid
> > the log IO completion starvation livelocks. I haven't yet pulled
> > the tree below, but I've now created about a billion inodes without
> > seeing any evidence of the livelock occurring.
> > 
> > Hence it looks like I've been seeing two livelocks - one caused by
> > the VM that Mel's patches fix, and one caused by the workqueue
> > changeover that is fixed by the WQ_HIGHPRI change.
> > 
> > Thanks for you insights, Tejun - I'll push the workqueue change
> > through the XFS tree to Linus.
> 
> Great, BTW, I have several questions regarding wq usage in xfs.
> 
> * Do you think @max_active > 1 could be useful for xfs?  If most works
>   queued on the wq are gonna contend for the same (blocking) set of
>   resources, it would just make more threads sleeping on those
>   resources but otherwise it would help reducing execution latency a
>   lot.

It may indeed help, but I can't really say much more than that right
now. I need a deeper understanding of the impact of increasing
max_active (I have a basic understanding now) before I could say for
certain.

> * xfs_mru_cache is a singlethread workqueue.  Do you specifically need
>   singlethreadedness (strict ordering of works) or is it just to avoid
>   creating dedicated per-cpu workers?  If the latter, there's no need
>   to use singlethread one anymore.

Didn't need per-cpu workers, so could probably drop it now.

> * Are all four workqueues in xfs used during memory allocation?  With
>   the new implementation, the reasons to have dedicated wqs are,

The xfsdatad, xfslogd and xfsconvertd are all in the memory reclaim
path. That is, they need to be able to run and make progress when
memory is low because if the IO does not complete, pages under IO
will never complete the transition from dirty to clean. Hence they
are not in the direct memory allocation path, but they are
definitely an important part of the memory reclaim path that
operates in low memory conditions.

>   - Forward progress guarantee in the memory allocation path.  Each
>     workqueue w/ WQ_RESCUER has _one_ rescuer thread reserved for
>     execution of works on the specific wq, which will be used under
>     memory pressure to make forward progress.

That, to me, says they all need a rescuer thread because they all
need to be able to make forward progress in OOM conditions.

>   - A wq is a flush domain.  You can flush works on it as a group.

We do that as well for the above workqueues as well to ensure
correct sync(1), freeze and unmount behaviour (see
xfs_flush_buftarg()).

>   - A wq is also a attribute domain.  If certain work items need to be
>     handled differently (highpri, cpu intensive, execution ordering,
>     etc...), they can be queued to a wq w/ those attributes specified.

And we already know that that xfslogd_workqueue needs the WQ_HIGHPRI
flag....

>   Maybe some of those workqueues can drop WQ_RESCUER or merged or just
>   use the system workqueue?

Maybe the mru wq can use the system wq, but I'm really opposed to
merging XFS wqs with system work queues simply from a debugging POV.
I've lost count of the number of times I've walked the IO completion
queueѕ with a debugger or crash dump analyser to try to work out if
missing IO that wedged the filesystem got stuck on the completion
queue. If I want to be able to say "the IO was lost by a lower
layer", then I have to be able to confirm it is not stuck in a
completion queue. That much harder if I don't know what the work
container objects on the queue are....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html