On Fri, Apr 22, 2011 at 11:25:31AM -0400, Vivek Goyal wrote: > It is and we have modified CFQ a lot to tackle that but still... > > Just do a "dd if=/dev/zero of=/zerofile bs=1M count=4K" on your root > disk and then try to launch firefox and browse few websites and see if > you are happy with the response of the firefox. It took me more than > a minute to launch firefox and be able to input and load first website. > > But I agree that READ latencies in presence of WRITES can be a problem > independent of IO controller. Reading this I've some dejavu, this is literally a decade old problem, so old that when I first worked on it the elevator had no notion of latency and it would potentially infinitely starve any I/O (regardless of read/write) at the end of the disk if any I/O before the end would keep coming in ;). We're orders of magnitude better these days, but one thing I didn't see mentioned is that according to memories, a lot of it had to do with the way the DMA command size can grow to the maximum allowed by the sg table for writes, but reads (especially metadata and small files where readahead is less effective) won't grow to the maximum or even if it grows to the maximum the readahead may not be useful (userland will seek again not reading into the readahead) and even if synchronous metadata reads aren't involved it'll submit another physical readahead after having satisfied only a little userland read. So even if you have a totally unfair io scheduler that places the next read request always at the top of the queue (ignoring any fairness requirement), you're still going to have the synchronous small read dma waiting at the top of the queue for the large dma write to complete. The time I got the dd if=/dev/zero working best is when I broke the throughput by massively reducing the dma size (by error or intentional frankly I don't remember). SATA requires ~64k large dma to run at peak speed, and I expect if you reduce it to 4k it'll behave a lot better than current 256k. Some very old scsi device I had performed best at 512k dma (much faster than 64k). The max sector size is still 512k today, probably 256k (or only 128k) for SATA but likely above 64k (as it saves CPU even if throughput can be maxed out at ~64k dma as far as the platter is concerned). > Also it is only CFQ which provides READS so much preferrence over WRITES. > deadline and noop do not which we typically use on faster storage. There > we might take a bigger hit on READ latencies depending on what storage > is and how effected it is with a burst of WRITES. > > I guess it boils down to better system control and better predictability. I tend to think to get even better read latency and predictability, the IO scheduler could dynamically and temporarily reduce the max sector size of the write dma (and also ensure any read readahead is also reduced to the dynamic reduced sector size or it'd be detrimental on the number of read DMA issued for each userland read). Maybe with tagged queuing things are better and the dma size doesn't make a difference anymore, I don't know. Surely Jens knows this best and can tell me if I'm wrong. Anyway it should be real easy to test, just a two liner reducing the max sector size to scsi_lib and the max readahead, should allow you to see how fast firefox starts with cfq when dd if=/dev/zero is running and if there's any difference at all. I've seen huge work on cfq but still the max merging remains at top and it doesn't decrease dynamically and I doubt you can get real unnoticeable writeback to reads, without such a chance, no matter how the IO scheduler is otherwise implemented. I'm unsure if this will ever be really viable in single user environment (often absolute throughput is more important and that is clearly higher - at least for the writeback - by keeping the max sector fixed to the max), but if cgroup wants to make a dd if=/dev/zero of=zero bs=10M oflag=direct from one group unnoticeable to the other cgroups that are reading, it's worth researching if this still an actual issue with todays hardware. I guess SSD won't change it much, as it's a DMA duration issue, not seeks, in fact it may be way more noticeable on SSD as seeks will be less costly leaving the duration effect more visible. > So throttling is happening at two layers. One throttling is in > balance_dirty_pages() which is actually not dependent on user inputted > parameters. It is more dependent on what's the page cache share of > this cgroup and what's the effecitve IO rate this cgroup is getting. > The real IO throttling is happning at device level which is dependent > on parameters inputted by user and which in-turn indirectly should decide > how tasks are throttled in balance_dirty_pages(). This sounds a fine design to me. Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html