On Tue, Jun 11, 2013 at 06:53:15AM +1000, Dave Chinner wrote: > On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote: > > On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote: > > > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote: > > So to summarize it appears that most of the time was spent with > > kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken > > up by a softirq, is the one that calls xfs_buf_unlock(). Assuming I'm > > not missing some important intermediate steps does this provide any > > more information about what resource I'm actually waiting for? Does > > this point to any changes that happened after 3.4? Are there any tips > > that could help minimize these contentions? > > The only difference between this and 3.4 is the allocation workqueue > thread. That, however, won't be introducing second long delays. What > you are seeing here is simply a the latency of waiting for > background metadata IO to complete during an allocation which has > the ilock held.... Again thank you for your analysis Dave. I've taken a step back to look at the big picture and that allowed me to identify what _has_ changed between 3.4 and 3.10. What changed is the behavior of vm.dirty_expire_centisecs. Honestly, the previous behavior never made any sense to me and I'm not entirely sure the current behavior does either. In the workload I've been debugging we append data to many small files using mmap. The writes are small and the total data rate is very low thus for most files it may take several minutes to fill a page. Having low-latency writes are important, but as you know stalls are always possible. One way to reduce the probability of a stall is to reduce the frequency of writeback, and adjusting vm.dirty_expire_centisecs and/or vm.dirty_writeback_centisecs should allow us to do that. On kernels 3.4 and older we chose to increase vm.dirty_expire_centisecs to 30000 since we can comfortably loose 5 minutes of data in the event of a system failure and we believed this would cause a fairly consistent low data rate as every vm.dirty_writeback_centisecs (5s) it would writeback all dirty pages that were vm.dirty_expire_centisecs (5min) old. On old kernels that isn't exactly what happened. Instead every 5 minutes there would be a burst of writeback and a slow trickle at all other times. This also reduced the total amount of data written back since the same dirty page wasn't written back every 30 seconds. This also virtually eliminated the stalls we saw so it was left alone. On 3.10 vm.dirty_expire_centisecs=30000 no longer does the same thing. Honestly I'm not sure what it does, but the result is a fairly consistent high data rate being written back to disk. The fact that is consistent might lead me to believe that it writes back all pages that are vm.dirty_expire_centisecs old every vm.dirty_writeback_centisecs, but the data rate is far too high for that to be true. It appears that I can effectively get the same old behavior by setting vm.dirty_writeback_centisecs=30000. -- Shawn -- --------------------------------------------------------------- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs