Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> On Thu, Mar 31, 2011 at 7:16 AM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> >> > and also try to select inodes intelligently (cgroup aware manner).
> >>
> >> Such selection algorithms would need to be able to handle hundreds
> >> of thousands of newly dirtied inodes per second so sorting and
> >> selecting them efficiently will be a major issue...
> >
> > There was proposal of memory cgroup maintaining a per memory cgroup per
> > bdi structure which will keep a list of inodes which need writeback
> > from that cgroup.
> 
> FYI, I have patches which implement this per memcg per bdi dirty inode
> list.  I want to debug a few issues before posting an RFC series.  But
> it is getting close.

That's all well and good, but we're still trying to work out how to
scale this list in a sane fashion. We just broke it out into it's
own global lock, so it's going to change soon so that the list+lock
is not a contention point on large machines. Just breaking it into a
list per cgroup doesn't solve this problem - it just adds another
container to the list.

Also, you have the problem that some filesystems don't use the bdi
dirty inode list for all the dirty inodes in the filesytem - XFS has
recent changed to only track VFS dirtied inodes in that list, intead
using it's own active item list to track all logged modifications.
IIUC, btrfs and ext3/4 do something similar as well. My current plans
are to modify the dirty inode code to allow filesystems to say tot
the VFS "don't track this dirty inode - I'm doing it myself" so that
we can reduce the VFS dirty inode list to only those inodes with
dirty pages....

> > So any cgroup looking for a writeback will queue up this structure on
> > bdi and flusher threads can walk though this list and figure out
> > which memory cgroups and which inodes within memory cgroup need to
> > be written back.
> 
> The way these memcg-writeback patches are currently implemented is
> that when a memcg is over background dirty limits, it will queue the
> memcg a on a global over_bg_limit list and wakeup bdi flusher.

No global lists and locks, please. That's one of the big problems
with the current foreground IO based throttling - it _hammers_ the
global inode writeback list locks such that one an 8p machine we can
be wasted 2-3 entire CPUs just contending on it when all 8 CPUs are
trying to throttle and write back at the same time.....

> There
> is no context (memcg or otherwise) given to the bdi flusher.  After
> the bdi flusher checks system-wide background limits, it uses the
> over_bg_limit list to find (and rotate) an over limit memcg.  Using
> the memcg, then the per memcg per bdi dirty inode list is walked to
> find inode pages to writeback.  Once the memcg dirty memory usage
> drops below the memcg-thresh, the memcg is removed from the global
> over_bg_limit list.

If you want controlled hand-off of writeback, you need to pass the
memcg that triggered the throttling directly to the bdi. You already
know what both the bdi and memcg that need writeback are. Yes, this
needs concurrency at the BDI flush level to handle, but see my
previous email in this thread for that....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux