Re: High latencies writing to a memory mapped file

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 27 Jul 2010 09:22:24 +1000

On Mon, Jul 26, 2010 at 05:09:09PM -0500, Shawn Bohrer wrote:
> On Fri, Jul 23, 2010 at 09:28:46AM +1000, Dave Chinner wrote:
> > On Thu, Jul 22, 2010 at 09:47:06AM -0500, Shawn Bohrer wrote:
> > > Hello,
> > > 
> > > We have an application that receives data over the network, and then
> > > writes data out to a shared memory mapped file on xfs.  Previously
> > > everything worked great when we were running a CentOS 5.3 kernel
> > > (2.6.18-128.7.1.el5), but when we upgraded to the newer CentOS 5.4 or
> > > 5.5 kernels we started to see occasional spikes (up to 1 second) in
> > > the time it takes to write the data to the memory mapped file.  I
> > > noticed that between the CentOS 5.3 and 5.4 kernels the upstream xfs
> > > changes were backported.  Testing on 2.6.32.16 shows the same
> > > occasional latency spikes.
> > > 
> > > After doing some debugging on 2.6.32.16 I have found the cause of the
> > > latency spikes.  The spikes occur every time pdflush runs to flush the
> > > dirty pages.  Note our app only writes data at about 1.5 MB/s and we
> > > are well below both the dirty_background_ratio and the dirty_ratio so
> > > these flushes are only of pages that are dirty_expire_centisecs old.
> > > Capturing one of the spikes with ftrace shows:
> > > 
> > > ...
> > > flush-8:0-601   [011] 1291592.127590: funcgraph_entry:                   |  xfs_ilock() {
> > > flush-8:0-601   [011] 1291592.127590: funcgraph_entry:                   |    down_write() {
> > > flush-8:0-601   [011] 1291592.127590: funcgraph_entry:        0.213 us   |      _cond_resched();
> > > flush-8:0-601   [011] 1291592.127591: funcgraph_exit:         0.657 us   |    }
> > > flush-8:0-601   [011] 1291592.127591: funcgraph_exit:         1.174 us   |  }
> > > flush-8:0-601   [011] 1291592.127764: sched_switch:         task flush-8:0:601 [120] (D) ==> swapper:0 [120]
> > > ...
> > 
> > what is the lead-in context to this hunk? i.e. how does xfs_ilock()
> > get called? That will tell me if it is blocking on the iolock or the
> > ilock...
> 
> I don't have the lead-in context here.  I'm using the function_graph
> tracer and with stalls up to a second in length it easily overflowed
> the ftrace ring buffer.  Once I found I my process was blocking on the
> xfs_ilock(), I limited my traces to include xfs_ilock, xfs_iunlock,
> handle_mm_fault, and the scheduler events.  If you are still
> interested in seeing some more detail here I can try to collect some
> more captures.

I'm not sure isthe tracepoint buffer is the same as the ftrace
buffer, but if it is you can increase the size by doing this sort of
thing:

# echo 32768 > /sys/kernel/debug/tracing/buffer_size_kb

I often do that to capture long XFS traces during testing....

[snip]

> > Hmmm - perhaps the issue is that the transaction commit has not been
> > able to get a iclog buffer to write the transaction into and so is
> > sleeping waiting for an iclog IO to complete. That would be woken by
> > the xfslogd, and then the transaction would write into the
> > now-available iclog and unlock the inode.
> > 
> > So, I'd say that in this case your application is waiting for log IO
> > completion to occur. Nothing you can do to avoid that, but you can
> > minimise the impact - increase the log buffer size via the mount
> > option logbsize (i.e. -o logbsize=262144) to reduce the frequency of
> > iclog buffer IOs and make it less likely the system will run stall
> > having run out of iclog space.
> > 
> > The other possibility is that the log IO is being slowed by barriers
> > being issued with the log writes. If you've got non-volatile caches
> > on your storage, you can turn off barriers and that should help
> > reduce log IO latency significantly.
> 
> I've tried mounting with "noatime,nobarrier,logbufs=8,logbsize=262144"
> but I don't notice much difference and still see the stalls.

Ok. One other possibility came to mind - the transaction commit is
blocking while locking the superblock buffer to apply free block
count changes. Do you have the "lazy-count" option set (xfs_info
<mtpt> will tell you). If it isn't set, then transaction commit
latency can be quite high if the sb buffer is being written to disk
when the transaction tries to lock it. If it is not set, you can use
xfs_admin to set that bit (mkfs has defaulted to lazy-count=1 for
some time now).

I this doesn't help, then we need to do more work to determine what
is happening...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs