Re: [PATCH] [RFC] xfs: introduce an allocation workqueue

Alex Elder <aelder@xxxxxxx> · Thu, 16 Jun 2011 12:14:24 -0500

On Thu, 2011-04-14 at 20:49 -0500, Dave Chinner wrote:
> On Thu, Apr 14, 2011 at 02:29:34PM -0500, Alex Elder wrote:
> > On Tue, 2011-04-12 at 23:52 +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > 
> > > We currently have significant issues with the amount of stack that
> > > allocation in XFS uses, especially in the writeback path. We can
> > > easily consume 4k of stack between mapping the page, manipulating
> > > the bmap btree and allocating blocks from the free list. Not to
> > > mention btree block readahead and other functionality that issues IO
> > > in the allocation path.
> > > 
> > > As a result, we can no longer fit allocation in the writeback path
> > > in the stack space provided on x86_64. To alleviate this problem,
> > > introduce an allocation workqueue and move all allocations to a
> > > seperate context. This can be easily added as an interposing layer
> > > into xfs_alloc_vextent(), which takes a single argument structure
> > > and does not return until the allocation is complete or has failed.
> > > 
> > > To do this, add a work structure and a completion to the allocation
> > > args structure. This allows xfs_alloc_vextent to queue the args onto
> > > the workqueue and wait for it to be completed by the worker. This
> > > can be done completely transparently to the caller.
> > > 
> > > The worker function needs to ensure that it sets and clears the
> > > PF_MEMALLOC flag appropriately as it is being run in an active
> > > tranÑaction context. Work can also be queued in a memory reclaim
> > > context, so a rescuer is needed for the workqueue.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > Interesting.
> > 
> > I guess I don't see anything inherently wrong with this.
> > At first glance it seems like workqueue abuse.  But it's
> > better than rolling our own daemon to do the same thing.
> > (There's nothing to stop you from doing this generically
> > either...)
> > 
> > Will it shift some accounting of CPU time from user to
> > system?
> 
> Some, yes. For delayed allocation in the background flush or sync
> path, there is no difference because that isn't accounted to the
> user anyway. The only cases it will make a difference are for
> foreground writeback (e.g. fsync or write throttling), and for inode
> and directory allocation.

Linus made a comment relevant to this yesterday,
but in reference to RCU making use of kernel threads:

        Re: REGRESSION: Performance regressions from switching
        anon_vma->lock to mutex

        So using anonymous kernel threads is actually a real downside.
        It makes it much less obvious what is going on. We saw that
        exact same thing with the generic worker thread conversions:
        things that used to have clear performance issues ("oh, the
        iwl-phy0 thread is using 3% of CPU time because it is polling
        for IO, and I can see that in 'top'") turned into
        much-harder-to-see issues ("oh, kwork0 us using 3% CPU time
        according to 'top' - I have no idea why").

Maybe if there were a way to reuse the generic
worker infrastructure but have it associated with
a kernel module or something.  Then again that is
sort of contrary to the purpose of the generic
centrally-managed pool.  Maybe if generic work
could somehow get tagged with its real owner
or origin it would be better, but I really have
no idea how one would go about that.

In any case, I think we need to be cautious about
moving in this direction.

					-Alex

> > The code looks OK to me. The idea of it makes me pause
> > a little, though I don't object.
> 
> It's definitely a different way of thinking, but I'm seriously
> starting to consider pushing more operations into asychronous
> background tasks that the caller can wait on if they so desire.
> e.g. EOF truncation during xfs_inactive and xfs_release(), flushing
> a range of a file, inode cluster freeing, CIL flushes, etc.
> 
> We've got multiple cores in most machines these days (even the low
> end is going multi-core), context switch overhead is pretty low and
> we've got concurrency managed work queues so we don't need thread
> pools - we should take advantage of what they provide....
> 
> Cheers,
> 
> Dave.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs