Re: High latencies writing to a memory mapped file

Shawn Bohrer <shawn.bohrer@xxxxxxxxx> · Mon, 26 Jul 2010 17:09:09 -0500

Hey Dave,

Thanks for the suggestions.

On Fri, Jul 23, 2010 at 09:28:46AM +1000, Dave Chinner wrote:
> On Thu, Jul 22, 2010 at 09:47:06AM -0500, Shawn Bohrer wrote:
> > Hello,
> > 
> > We have an application that receives data over the network, and then
> > writes data out to a shared memory mapped file on xfs.  Previously
> > everything worked great when we were running a CentOS 5.3 kernel
> > (2.6.18-128.7.1.el5), but when we upgraded to the newer CentOS 5.4 or
> > 5.5 kernels we started to see occasional spikes (up to 1 second) in
> > the time it takes to write the data to the memory mapped file.  I
> > noticed that between the CentOS 5.3 and 5.4 kernels the upstream xfs
> > changes were backported.  Testing on 2.6.32.16 shows the same
> > occasional latency spikes.
> > 
> > After doing some debugging on 2.6.32.16 I have found the cause of the
> > latency spikes.  The spikes occur every time pdflush runs to flush the
> > dirty pages.  Note our app only writes data at about 1.5 MB/s and we
> > are well below both the dirty_background_ratio and the dirty_ratio so
> > these flushes are only of pages that are dirty_expire_centisecs old.
> > Capturing one of the spikes with ftrace shows:
> > 
> > ...
> > flush-8:0-601   [011] 1291592.127590: funcgraph_entry:                   |  xfs_ilock() {
> > flush-8:0-601   [011] 1291592.127590: funcgraph_entry:                   |    down_write() {
> > flush-8:0-601   [011] 1291592.127590: funcgraph_entry:        0.213 us   |      _cond_resched();
> > flush-8:0-601   [011] 1291592.127591: funcgraph_exit:         0.657 us   |    }
> > flush-8:0-601   [011] 1291592.127591: funcgraph_exit:         1.174 us   |  }
> > flush-8:0-601   [011] 1291592.127764: sched_switch:         task flush-8:0:601 [120] (D) ==> swapper:0 [120]
> > ...
> 
> what is the lead-in context to this hunk? i.e. how does xfs_ilock()
> get called? That will tell me if it is blocking on the iolock or the
> ilock...

I don't have the lead-in context here.  I'm using the function_graph
tracer and with stalls up to a second in length it easily overflowed
the ftrace ring buffer.  Once I found I my process was blocking on the
xfs_ilock(), I limited my traces to include xfs_ilock, xfs_iunlock,
handle_mm_fault, and the scheduler events.  If you are still
interested in seeing some more detail here I can try to collect some
more captures.

> > Here you can see flush-8:0 acquires the xfs_ilock() then I believe it
> > blocks waiting for the I/O to complete.  Next you can see my
> > file_writer process generates a page fault while writing new data to
> > the memory mapped file.  The page fault tries to acquire the
> > xfs_ilock() and then blocks because it is held by the flush-8:0
> > process.
> > 
> > ...
> > file_writer-10549 [001] 1291593.100416: print:                00000000 write_delay start
> > file_writer-10549 [001] 1291593.100419: funcgraph_entry:                   |  handle_mm_fault() {
> > file_writer-10549 [001] 1291593.100420: funcgraph_entry:                   |    __do_fault() {
> > file_writer-10549 [001] 1291593.100420: funcgraph_entry:                   |      filemap_fault() {
> > file_writer-10549 [001] 1291593.100421: funcgraph_entry:        0.775 us   |        find_get_page();
> > file_writer-10549 [001] 1291593.100422: funcgraph_entry:        0.191 us   |        _cond_resched();
> > file_writer-10549 [001] 1291593.100422: funcgraph_exit:         1.980 us   |      }
> > file_writer-10549 [001] 1291593.100422: funcgraph_entry:                   |      unlock_page() {
> > file_writer-10549 [001] 1291593.100423: funcgraph_entry:        0.154 us   |        page_waitqueue();
> > file_writer-10549 [001] 1291593.100423: funcgraph_entry:        0.313 us   |        __wake_up_bit();
> > file_writer-10549 [001] 1291593.100423: funcgraph_exit:         1.080 us   |      }
> > file_writer-10549 [001] 1291593.100424: funcgraph_entry:                   |      xfs_vm_page_mkwrite() {
> > file_writer-10549 [001] 1291593.100424: funcgraph_entry:                   |        block_page_mkwrite() {
> > file_writer-10549 [001] 1291593.100424: funcgraph_entry:        0.240 us   |          _cond_resched();
> > file_writer-10549 [001] 1291593.100425: funcgraph_entry:                   |          block_prepare_write() {
> > file_writer-10549 [001] 1291593.100425: funcgraph_entry:                   |            __block_prepare_write() {
> > file_writer-10549 [001] 1291593.100426: funcgraph_entry:                   |              create_empty_buffers() {
> > file_writer-10549 [001] 1291593.100426: funcgraph_entry:                   |                alloc_page_buffers() {
> > file_writer-10549 [001] 1291593.100426: funcgraph_entry:                   |                  alloc_buffer_head() {
> > file_writer-10549 [001] 1291593.100426: funcgraph_entry:                   |                    kmem_cache_alloc() {
> > file_writer-10549 [001] 1291593.100427: funcgraph_entry:        0.491 us   |                      _cond_resched();
> > file_writer-10549 [001] 1291593.100427: funcgraph_exit:         1.043 us   |                    }
> > file_writer-10549 [001] 1291593.100428: funcgraph_entry:        0.185 us   |                    recalc_bh_state();
> > file_writer-10549 [001] 1291593.100428: funcgraph_exit:         2.003 us   |                  }
> > file_writer-10549 [001] 1291593.100428: funcgraph_entry:        0.137 us   |                  set_bh_page();
> > file_writer-10549 [001] 1291593.100428: funcgraph_exit:         2.701 us   |                }
> > file_writer-10549 [001] 1291593.100429: funcgraph_entry:        0.271 us   |                _spin_lock();
> > file_writer-10549 [001] 1291593.100429: funcgraph_exit:         3.751 us   |              }
> > file_writer-10549 [001] 1291593.100429: funcgraph_entry:                   |              xfs_get_blocks() {
> > file_writer-10549 [001] 1291593.100430: funcgraph_entry:                   |                __xfs_get_blocks() {
> > file_writer-10549 [001] 1291593.100430: funcgraph_entry:                   |                  xfs_iomap() {
> > file_writer-10549 [001] 1291593.100430: funcgraph_entry:                   |                    xfs_ilock() {
> 
> So this is trying to lock the ilock exclusively to map the range or
> do delayed allocation reservations.
> 
> > Finally xfslogd/1 wakes up the flush-8:0 process which releases the
> > xfs_ilock() waking up my file_writer process and allowing it to
> > continue.
> > ...
> > xfslogd/1-213   [001] 1291593.158606: sched_wakeup:         task flush-8:0:601 [120] success=1 [011]
> >    <idle>-0     [011] 1291593.158609: sched_switch:         task swapper:0 [120] (R) ==> flush-8:0:601 [120]
> 
> xfslogd? Inodes do not get locked across log writes; they are
> unlocked in the transaction commit.
> 
> Hmmm - perhaps the issue is that the transaction commit has not been
> able to get a iclog buffer to write the transaction into and so is
> sleeping waiting for an iclog IO to complete. That would be woken by
> the xfslogd, and then the transaction would write into the
> now-available iclog and unlock the inode.
> 
> So, I'd say that in this case your application is waiting for log IO
> completion to occur. Nothing you can do to avoid that, but you can
> minimise the impact - increase the log buffer size via the mount
> option logbsize (i.e. -o logbsize=262144) to reduce the frequency of
> iclog buffer IOs and make it less likely the system will run stall
> having run out of iclog space.
> 
> The other possibility is that the log IO is being slowed by barriers
> being issued with the log writes. If you've got non-volatile caches
> on your storage, you can turn off barriers and that should help
> reduce log IO latency significantly.

I've tried mounting with "noatime,nobarrier,logbufs=8,logbsize=262144"
but I don't notice much difference and still see the stalls.

> Another possibility is to switch to the deadline or no-op IO
> scheduler to avoid log IO being delayed by (the ever changing)
> CFQ IO scheduling idiosyncracies.

I also tried using the noop IO scheduler, but also didn't see any
improvement.

Thanks,
Shawn

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs