Re: High CPU Utilization When Copying to Ext4

Sean McCauliff <Sean.D.McCauliff@xxxxxxxx> · Fri, 8 Jul 2011 10:08:31 -0700

I tried running perf on the copy program on subset of the sparse files. 
 It seems like ext4 is the source of high cpu utilization.  At this 
point this high cpu utilization is very annoying, but I can live with 
this problem.  If you know something simple I could do to alleviate this 
problem I would be most appreciative.  At the end of this email is a 
consolidation of information about this problem.

Events: 6M cycles
-76.80%     java  [kernel.kallsyms]      [k] ext4_mb_good_group
  - ext4_mb_good_group
      - 99.24% ext4_mb_regular_allocator
           ext4_mb_new_blocks
           ext4_ext_map_blocks
           ext4_map_blocks
         - mpage_da_map_and_submit
            - 96.25% write_cache_pages_da
                 ext4_da_writepages
                 do_writepages
                 writeback_single_inode
                 writeback_sb_inodes
                 writeback_inodes_wb
                 balance_dirty_pages_ratelimited_nr
                 generic_file_buffered_write
                 __generic_file_aio_write
                 generic_file_aio_write
                 ext4_file_write
                 do_sync_write
                 vfs_write
                 sys_write
                 system_call_fastpath
               - 0x338480df7d
                    100.00% writeBytes
            + 3.75% ext4_da_writepages
      + 0.76% ext4_mb_new_blocks
+4.07%     java  [kernel.kallsyms]      [k] do_raw_spin_lock
+2.19%     java  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
+1.53%     java  [kernel.kallsyms]      [k] ext4_get_group_info
+1.07%     java  [kernel.kallsyms]      [k] ext4_mb_regular_allocator
+1.07%     java  [kernel.kallsyms]      [k] compaction_alloc
+0.85%     java  [kernel.kallsyms]      [k] read_hpet
+0.40%     java  [kernel.kallsyms]      [k] copy_user_generic_string
+0.32%     java  [kernel.kallsyms]      [k] __bitmap_empty
+0.31%     java  [kernel.kallsyms]      [k] ktime_get

Specifics:

The copy program is written in Java with some C code that calls the 
fiemap ioctl.  It uses this to maintain the sparseness of the 
destination files and seems to be much faster then doing contiguous zero 
detection like tar or cp in order to identify the holes in the files. 
The copy program is using 64 threads.

During the copy system cpu is over 90%, iowait is generally only 1 or 2%.

Source file system is 8T ext3, destination file system is 16T ext4. 
Files are sparse, non-sparse size is 17M.  They have about a few hundred 
extents on average as reported by filefrag.  The destination file 
generated by the copy program has fewer extents, but are otherwise 
identical.  I assume this is due to smarter allocation by ext4.

The source file system is built on top of LVM which is built on top of 
four multipath devices which load balance for a pair of qlogic FC HBAs. 
 The destination file system is built on top of a single multipath 
device which load balances the same pair of HBAs (no LVM).

The san is a 3par with 240 SATA drives.  Each lun exported to the server 
is in a RAID1+0 configuration striped over all the drives.  The server 
is directly connection without a FC switch.

Fedora 15.
Linux xxxx.arc.nasa.gov 2.6.38.8-32.fc15.x86_64 #1 SMP Mon Jun 13 
19:49:05 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

The server has 8 cores and 64G of memory.

Nothing else is running or consuming substantial resources on this 
server.  top shows that java, flush and kworker processes are consuming cpu.

Thanks!
Sean

On 06/29/2011 07:33 PM, Ted Ts'o wrote:
On Wed, Jun 29, 2011 at 05:01:45PM -0700, Sean McCauliff wrote:
Sorry, I didn't mean to bother you.  I did try and email ext3-users
so as to not take up any developer time with my question.

Yeah, but it's not likely anyone on that list would be able to help
you.  Both ext3 and ext4 isn't expected to take a huge amount of CPU
under normal conditions when doing this type of copying where you will
be likely disk bound.

Well, you're not using fallocate() (at least you haven't disclosed it
to date), and writing into fallocated space is the only thing that
would be using a workqueue at all (which is what the kworker threads
are using).

So I very much doubt it has anything to do with ext4.  The fiber
channel drivers do use workqueues a fair amount, so yes, it would be
useful to know that you are using a fiber channel SAN.  At this point
I'd suggest that you use oprofile or perf to see where the CPU is
being consumed.  Perf is probably better since it will allow you to
see the call chains.

						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html