Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook

Jan Kara <jack@xxxxxxx> · Thu, 2 Jan 2014 07:46:59 +0100

On Tue 31-12-13 15:34:40, Chris Mason wrote:
> On Tue, 2013-12-31 at 22:22 +0800, Tao Ma wrote:
> > Hi Chris,
> > On 12/31/2013 09:19 PM, Chris Mason wrote:
> >  
> > > So I'd like to throttle the rate at which dirty pages are created,
> > > preferably based on the rates currently calculated in the BDI of how
> > > quickly the device is doing IO.  This way we can limit dirty creation to
> > > a percentage of the disk capacity during the current workload
> > > (regardless of random vs buffered).
> > Fengguang had already done some work on this, but it seems that the
> > community does't have a consensus on where this control file should go.
> >  You can look at this link: https://lkml.org/lkml/2011/4/4/205
> 
> I had forgotten Wu's patches here, it's very close to the starting point
> I was hoping for.
  I specifically don't like those patches because throttling pagecache
dirty rate is IMHO rather poor interface. What people want to do is to
limit IO from a container. That means reads & writes, buffered & direct IO.
So dirty rate is just a one of several things which contributes to total IO
rate. When you have both direct IO & buffered IO happening in the container
they influence each other so dirty rate 50 MB/s may be fine when nothing
else is going on in the container but may be far to much for the system if
there are heavy direct IO reads happening as well.

So you really need to tune the limit on the dirty rate depending on how
fast the writeback can happen (which is what current IO-less throttling
does), not based on some hard throughput number like
50 MB/s (which is what Fengguang's patches did if I remember right).

What could work a tad bit better (and that seems to be something you are
proposing) is to have a weight for each memcg and each memcg would be
allowed to dirty at a rate proportional to its weight * writeback
throughput. But this still has a couple of problems:
1) This doesn't take into account local situation in a memcg - for memcg
   full of dirty pages you want to throttle dirtying much more than for a
   memcg which has no dirty pages.
2) Flusher thread (or workqueue these days) doesn't know anything about
   memcgs. So it can happily flush a memcg which is relatively OK for a
   rather long time while some other memcg is full of dirty pages and
   struggling to do any progress.
3) This will be somewhat unfair since the total IO allowed to happen from a
   container will depend on whether you are doing only reads (or DIO), only
   writes or both reads & writes.

In an ideal world you could compute writeback throughput for each memcg
(and writeback from a memcg would be accounted in a proper blkcg - we would
need unified memcg & blkcg hieararchy for that), take into account number of
dirty pages in each memcg, and compute dirty rate according to these two
numbers. But whether this can work in practice heavily depends on the memcg
size and how smooth / fair can the writeback from different memcgs be so
that we don't have excessive stalls and throughput estimation errors...

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html