Re: [RFC PATCH 0/3] block: Fix fsync slowness with CFQ cgroups

Vivek Goyal <vgoyal@xxxxxxxxxx> · Tue, 28 Jun 2011 09:35:58 -0400

On Tue, Jun 28, 2011 at 12:47:38PM +1000, Dave Chinner wrote:
> 
> Vivek, I'm not sure this is a general solution. If we hand journal
> IO off to a workqueue, then we've got no idea what the "dependent
> task" is.
> 
> I bring this up as I have a current patchset that moves all the XFS
> journal IO out of process context into a workqueue to solve
> process-visible operation latency (e.g. 1000 mkdir syscalls run at
> 1ms each, the 1001st triggers a journal checkpoint and takes 500ms)
> and background checkpoint submission races.  This effectively means
> that XFS will trigger the same bad CFQ behaviour on fsync, but have
> no means of avoiding it because we don't have a specific task to
> yield to.
> 
> And FWIW, we're going to be using workqueues more and more in XFS
> for asynchronous processing of operations. I'm looking to use WQs
> for speculative readahead of inodes, all our delayed metadata
> writeback, log IO submission, free space allocation requests,
> background inode allocation, background inode freeing, background
> EOF truncation, etc to process as much work asynchronously outside
> syscall context as possible (let's use all those CPU cores we
> have!).
> 
> All of these things will push potentially dependent IO operations
> outside of the bounds of the process actually doing the operation,
> so some general solution to the "dependent IO in an undefined thread
> context" problem really needs to be solved sooner rather than
> later...
> 
> As it is, I don't have any good ideas of how to solve this, but I
> thought it is worth bringing to your attention while you are trying
> to solve a similar issue.

Dave,

Coule of thoughts.

- We can introduce anohter block layer call were dependencies are setup
  from worker thread context. So when the process schedules the work, it can
  save the task information somewhere and when the worker thread actually
  calls the specified funciton, that function can setup the dependency
  between worker thread and submitting task.

  Probably original process can tear down the dependency connection
  when IO is done. I am assuming that IO submitting process is waiting
  for all IO to finish.

  In current framework one can specify multiple processes being dependent
  on one thread but not vice-a-versa. I think we should be able to 
  handle that by maintaining a linked list of dependent queues instead
  of single pointer. So if a process submits a bunch of jobs with help
  of bunch of worker threads from multiple cpus, I think that case is
  manageable with some extension to current patches.

- Or we can also try to do something more exotic and that when we schedule
  a work, one should be able to tell which cgroup the worker should run in.
  When the worker actually runs, it can migrate itself to destination
  destination cgroup and submit IO. This does not take care of cases like
  journalling thread where multiple processes are dependent on single
  kernel thread. In that case above dependent queue solution should work
  well.

So I think above API can be extended to handle the case of work queues
also or we could look into migrating worker in user specified cgroup if
that turns out to be a better solution. 

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html