Re: [PATCH 0/4 RESEND] writeback: Dirty list handling changes

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 16 May 2014 10:47:54 +1000

On Fri, May 16, 2014 at 09:55:14AM +1000, Dave Chinner wrote:
> On Thu, May 15, 2014 at 05:41:53PM +0200, Jan Kara wrote:
> >   Hello,
> > 
> >   so I was recently thinking about how writeback code shuffles inodes between
> > lists and also how redirty_tail() clobbers dirtied_when timestamp (which broke
> > my sync(2) optimization). This patch series came out of that. Patch 1 is a
> > clear win and just needs an independent review that I didn't forget about
> > something. Patch 3 changes writeback list handling - IMHO it makes the logic
> > somewhat more straightforward as we don't have to bother shuffling inodes
> > between lists and we also don't need to clobber dirtied_when timestamp.
> > But opinions may differ...
> > 
> > Patches passed xfstests run and I did some basic writeback tests using tiobench
> > and some artifical sync livelock tests to verify nothing regressed. So I'd
> > be happy if people could have a look.
> 
> Performance regresses significantly.
> 
> Test is on a 16p/16GB VM with a sparse 100TB XFS filesystem backed
> by a pair of SSDs in RAID0:
> 
> ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
> /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
> /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
> /mnt/scratch/6  -d  /mnt/scratch/7
> 
> That creates 10 million 4k files with 16 threads and 10000 files per
> directory. No sync/fsync is done, so it's a pure background
> writeback workload.
> 
> For 0-400,000 files, it runs in memory, at 400-800k files background
> writeback is occurring, at > 800k files foreground throttling is
> occurring.
> 
> The file create rates and write IOPS/bw are:
> 
>                   vanilla		    patched
> load point     files  iops  bw		files iops  bw
> < bg thres	120k   0     0		 110k   0    0
> < fg thres	120k  37k  210MB/s	  60k  20k 130MB/s
> sustained	 36k  37k  210MB/s	  25k  28k 150MB/s
> 
> 
> The average create rate is 40k (vanilla) vs 28k (patched). Wall
> times:
> 
>          vanilla	  patched
> real    4m27.475s	 6m29.364s
> user    1m7.072s	 1m3.590s
> sys     10m0.836s	 22m34.362s
> 
> The new code burns more than twice the system CPU whilst going
> significantly slower.

OK, so it looks like we've got a case of lock contention:

  48.37%  [kernel]  [k] _raw_spin_lock
   - _raw_spin_lock
      - 96.32% __mark_inode_dirty
         - 99.99% __set_page_dirty
              mark_buffer_dirty
              __block_commit_write.isra.25
              block_write_end
              generic_write_end
              xfs_vm_write_end
              generic_perform_write
              xfs_file_buffered_aio_write
              xfs_file_aio_write
              do_sync_write
              vfs_write
              sys_write
              tracesys
              __GI___libc_write
      + 1.80% xfs_log_commit_cil

Which is almost certainly the bdi->wb.list_lock.

-   6.70%  [kernel]  [k] tag_for_io
     tag_for_io
     wb_writeback
     bdi_writeback_workfn
     process_one_work
     worker_thread
     kthread
     ret_from_fork

But that's still a lot of CPU in tag_for_io().

Ah, that's the cause. tag_for_io() holds the bdi->wb.list_lock
for the entire list walk tagging inodes. That explains why
performance doesn't start to drop until writeback begins. Ouch:

writeback_tag_for_io: bdi 253:32: older=4295532768 age=56 tagged=14092 reason=background
writeback_tag_for_io: bdi 253:32: older=4295532808 age=56 tagged=13556 reason=background
writeback_tag_for_io: bdi 253:32: older=4295532848 age=56 tagged=12387 reason=background
writeback_tag_for_io: bdi 253:32: older=4295532888 age=60 tagged=11414 reason=background
writeback_tag_for_io: bdi 253:32: older=4295532929 age=60 tagged=11128 reason=background
writeback_tag_for_io: bdi 253:32: older=4295532970 age=60 tagged=10457 reason=background

We're holding the log while we walk 10,000+ inodes at a time and
the other 15 CPUs are spinning on that.

List walking and tagging like this is never going to scale. I think
we do need multiple lists, but not like we currently have. Right now
the problem is that inodes of different dirtied_when sit on the same
ordered list, and we need to find the first inode with a specific
dirtied_when value within the list efficiently to be able to
determine what inodes need to be written back.

Seems to me that this problem has been solved elswhere in the
kernel, like for tracking of timers to expire on a given jiffie
(tvec, tvec_base, timer_lists). Perhaps we should be looking to move
to a set of time based lists for efficiently tracking what inodes
should be written back at a given background writeback interval
rather than trying to keep everything on the one list.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html