Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 22, 2011 at 11:25:31AM -0400, Vivek Goyal wrote:
> It is and we have modified CFQ a lot to tackle that but still... 
> 
> Just do a "dd if=/dev/zero of=/zerofile bs=1M count=4K" on your root
> disk and then try to launch firefox and browse few websites and see if
> you are happy with the response of the firefox. It took me more than
> a minute to launch firefox and be able to input and load first website.
> 
> But I agree that READ latencies in presence of WRITES can be a problem
> independent of IO controller.

Reading this I've some dejavu, this is literally a decade old problem,
so old that when I first worked on it the elevator had no notion of
latency and it would potentially infinitely starve any I/O (regardless
of read/write) at the end of the disk if any I/O before the end would
keep coming in ;).

We're orders of magnitude better these days, but one thing I didn't
see mentioned is that according to memories, a lot of it had to do
with the way the DMA command size can grow to the maximum allowed by
the sg table for writes, but reads (especially metadata and small
files where readahead is less effective) won't grow to the maximum or
even if it grows to the maximum the readahead may not be useful
(userland will seek again not reading into the readahead) and even if
synchronous metadata reads aren't involved it'll submit another
physical readahead after having satisfied only a little userland read.

So even if you have a totally unfair io scheduler that places the next
read request always at the top of the queue (ignoring any fairness
requirement), you're still going to have the synchronous small read
dma waiting at the top of the queue for the large dma write to
complete.

The time I got the dd if=/dev/zero working best is when I broke the
throughput by massively reducing the dma size (by error or intentional
frankly I don't remember). SATA requires ~64k large dma to run at peak
speed, and I expect if you reduce it to 4k it'll behave a lot better
than current 256k. Some very old scsi device I had performed best at
512k dma (much faster than 64k). The max sector size is still 512k
today, probably 256k (or only 128k) for SATA but likely above 64k (as
it saves CPU even if throughput can be maxed out at ~64k dma as far as
the platter is concerned).

> Also it is only CFQ which provides READS so much preferrence over WRITES.
> deadline and noop do not which we typically use on faster storage. There
> we might take a bigger hit on READ latencies depending on what storage
> is and how effected it is with a burst of WRITES.
> 
> I guess it boils down to better system control and better predictability.

I tend to think to get even better read latency and predictability,
the IO scheduler could dynamically and temporarily reduce the max
sector size of the write dma (and also ensure any read readahead is
also reduced to the dynamic reduced sector size or it'd be detrimental
on the number of read DMA issued for each userland read).

Maybe with tagged queuing things are better and the dma size doesn't
make a difference anymore, I don't know. Surely Jens knows this best
and can tell me if I'm wrong.

Anyway it should be real easy to test, just a two liner reducing the
max sector size to scsi_lib and the max readahead, should allow you to
see how fast firefox starts with cfq when dd if=/dev/zero is running
and if there's any difference at all.

I've seen huge work on cfq but still the max merging remains at top
and it doesn't decrease dynamically and I doubt you can get
real unnoticeable writeback to reads, without such a chance, no matter
how the IO scheduler is otherwise implemented.

I'm unsure if this will ever be really viable in single user
environment (often absolute throughput is more important and that is
clearly higher - at least for the writeback - by keeping the max
sector fixed to the max), but if cgroup wants to make a dd
if=/dev/zero of=zero bs=10M oflag=direct from one group unnoticeable
to the other cgroups that are reading, it's worth researching if this
still an actual issue with todays hardware. I guess SSD won't change
it much, as it's a DMA duration issue, not seeks, in fact it may be
way more noticeable on SSD as seeks will be less costly leaving the
duration effect more visible.

> So throttling is happening at two layers. One throttling is in
> balance_dirty_pages() which is actually not dependent on user inputted
> parameters. It is more dependent on what's the page cache share of 
> this cgroup and what's the effecitve IO rate this cgroup is getting.
> The real IO throttling is happning at device level which is dependent
> on parameters inputted by user and which in-turn indirectly should decide
> how tasks are throttled in balance_dirty_pages().

This sounds a fine design to me.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux