2012/8/20, Fengguang Wu <fengguang.wu@xxxxxxxxx>: > On Mon, Aug 20, 2012 at 09:48:42AM +0900, Namjae Jeon wrote: >> 2012/8/19, Fengguang Wu <fengguang.wu@xxxxxxxxx>: >> > On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote: >> >> From: Namjae Jeon <namjae.jeon@xxxxxxxxxxx> >> >> >> >> This patch is based on suggestion by Wu Fengguang: >> >> https://lkml.org/lkml/2011/8/19/19 >> >> >> >> kernel has mechanism to do writeback as per dirty_ratio and >> >> dirty_background >> >> ratio. It also maintains per task dirty rate limit to keep balance of >> >> dirty pages at any given instance by doing bdi bandwidth estimation. >> >> >> >> Kernel also has max_ratio/min_ratio tunables to specify percentage of >> >> writecache >> >> to control per bdi dirty limits and task throtelling. >> >> >> >> However, there might be a usecase where user wants a writeback tuning >> >> parameter to flush dirty data at desired/tuned time interval. >> >> >> >> dirty_background_time provides an interface where user can tune >> >> background >> >> writeback start time using /sys/block/sda/bdi/dirty_background_time >> >> >> >> dirty_background_time is used alongwith average bdi write bandwidth >> >> estimation >> >> to start background writeback. >> > >> > Here lies my major concern about dirty_background_time: the write >> > bandwidth estimation is an _estimation_ and will sure become wildly >> > wrong in some cases. So the dirty_background_time implementation based >> > on it will not always work to the user expectations. >> > >> > One important case is, some users (eg. Dave Chinner) explicitly take >> > advantage of the existing behavior to quickly create & delete a big >> > 1GB temp file without worrying about triggering unnecessary IOs. >> > >> Hi. Wu. >> Okay, I have a question. >> >> If making dirty_writeback_interval per bdi to tune short interval >> instead of background_time, We can get similar performance >> improvement. >> /sys/block/<device>/bdi/dirty_writeback_interval >> /sys/block/<device>/bdi/dirty_expire_interval >> >> NFS write performance improvement is just one usecase. >> >> If we can set interval/time per bdi, other usecases will be created >> by applying. > > Per-bdi interval/time tunables, if there comes such a need, will in > essential be for data caching and safety. If turning them into some > requirement for better performance, the users will potential be > stretched on choosing the "right" value for balanced data cache, > safety and performance. Hmm, not a comfortable prospection. Hi Wu. First, Thanks for shared information. I change writeback interval on NFS server only. I think that this does not affect data cache/page behaviour(caching) change on NFS client. NFS client will start sending write requests as per default NFS/writeback logic. So, no change in NFS client data caching behaviour. Also, on NFS server it does not make change in system-wide caching behaviour. It only modifies caching/writeback behaviour of a particular “bdi” on NFS server so that NFS client could see better WRITE speed. I will share several performancetest results as Dave's opinion. > >> >The numbers are impressive! FYI, I tried another NFS specific approach >> >to avoid big NFS COMMITs, which achieved similar performance gains: >> >> >nfs: writeback pages wait queue >> >https://lkml.org/lkml/2011/10/20/235 This patch looks client side optimization to me.(need to check more) Do we need the optimization of server side as Bruce's opinion ? Thanks. >> >> Thanks. > > The NFS write queue, on the other hand, is directly aimed for > improving NFS performance, latency and responsiveness. > > In comparison to the per-bdi interval/time, it's more a guarantee of > smoother NFS writes. As the tests show in the original email, with > the cost of a little more commits, it gains much better write > throughput and latency. > > The NFS write queue is even a requirement, if we want to get > reasonable good responsiveness. Without it, the 20% dirty limit may > well be filled by NFS writeback/unstable pages. This is very bad for > responsiveness. Let me quote contents of two old emails (with small > fixes): > > : PG_writeback pages have been the biggest source of > : latency issues in the various parts of the system. > : > : It's not uncommon for me to see filesystems sleep on PG_writeback > : pages during heavy writeback, within some lock or transaction, which in > : turn stall many tasks that try to do IO or merely dirty some page in > : memory. Random writes are especially susceptible to such stalls. The > : stable page feature also vastly increase the chances of stalls by > : locking the writeback pages. > > : When there are N seconds worth of writeback pages, it may > : take N/2 seconds on average for wait_on_page_writeback() to finish. > : So the total time cost of running into a random writeback page and > : waiting on it is also O(n^2): > > : E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it) > > : That means we can hardly keep more than 1-second worth of writeback > : pages w/o worrying about long waits on PG_writeback in various parts > : of the kernel. > > : Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > : the case of direct reclaim, it means blocking random tasks that are > : allocating memory in the system. > : > : PG_writeback pages are much worse than PG_dirty pages in that they are > : not movable. This makes a big difference for high-order page allocations. > : To make room for a 2MB huge page, vmscan has the option to migrate > : PG_dirty pages, but for PG_writeback it has no better choices than to > : wait for IO completion. > : > : The difficulty of THP allocation goes up *exponentially* with the > : number of PG_writeback pages. Assume PG_writeback pages are randomly > : distributed in the physical memory space. Then we have formula > : > : P(reclaimable for THP) = P(non-PG_writeback)^512 > : > : That's the possibly for a contiguous range of 512 pages to be free of > : PG_writeback, so that it's immediately reclaimable for use by > : transparent huge page. This ruby script shows us the concrete numbers. > : > : irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 > } > : > : P(hit PG_writeback) P(reclaimable for THP) > : 0.001 0.599 > : 0.002 0.359 > : 0.003 0.215 > : 0.004 0.128 > : 0.005 0.077 > : 0.006 0.046 > : 0.007 0.027 > : 0.008 0.016 > : 0.009 0.010 > : 0.010 0.006 > : > : The numbers show that when the PG_writeback pages go up from 0.1% to > : 1% of system memory, the THP reclaim success ratio drops quickly from > : 60% to 0.6%. It indicates that in order to use THP without constantly > : running into stalls, the reasonable PG_writeback ratio is <= 0.1%. > : Going beyond that threshold, it quickly becomes intolerable. > : > : That makes a limit of 256MB writeback pages for a mem=256GB system. > : Looking at the real vmstat:nr_writeback numbers in dd write tests: > : > : JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009 > : JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335 > : JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026 > : JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099 > : JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058 > : JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335 > : > : Oops btrfs has 4GB writeback pages -- which asks for some bug fixing. > : Even ext4's 800MB still looks way too high, but that's ~1s worth of > : data per queue (or 130ms worth of data for the high performance Intel > : SSD, which is perhaps in danger of queue underruns?). So this system > : would require 512GB memory to comfortably run KVM instances with THP > : support. > > The main concern on the NFS write wait queue, however, was that it > might hurt performance for long fat network pipes with large > bandwidth-delay products. If the pipe size can be properly estimated, > we'll be able to set adequate queue size and remove the last obstacle > of that patch. > > Thanks, > Fengguang > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html