Hi Zheng, this is output of tune2fs: tune2fs 1.41.12 (17-May-2010) Filesystem volume name: <none> Last mounted on: / Filesystem UUID: 358707f1-8997-4e96-8110-d9f929073e68 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: journal_data_writeback user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 23281664 Block count: 93110272 Reserved block count: 4655513 Free blocks: 44878839 Free inodes: 22074662 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1001 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Fri Oct 7 05:01:07 2011 Last mount time: Mon Sep 9 18:49:39 2013 Last write time: Wed Aug 21 10:48:03 2013 Mount count: 23 Maximum mount count: -1 Last checked: Fri Oct 7 05:01:07 2011 Check interval: 0 (<none>) Lifetime writes: 1795 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 4207998 Default directory hash: half_md4 Directory Hash Seed: 4de319c2-fc2d-4771-b4de-b63acd610ba0 Journal backup: inode blocks And this is mount options: UUID=358707f1-8997-4e96-8110-d9f929073e68 / ext4 defaults 1 1 Yes, I forgot to mention that in the deferred write test, turning off deferred allocation or using preallocation, the problem is largely gone. I mentioned largely gone because it still happens but much less frequently. Thanks. --Cuong On Thu, Sep 12, 2013 at 12:46 AM, Zheng Liu <gnehzuil.liu@xxxxxxxxx> wrote: > Hello Cuong, > > Could you please tell us the kernel version? Meanwhile it would be > better if you could paste the result of the following commands: > * sudo tune2fs -l ${DEV} > * cat /proc/mounts | grep ${DEV} > > I want to know which feature is enabled in your file system. From your > description, I guess delayed allocation is enabled. So I suggest that > you can try to disable it. You can disable it using the following > command: > * sudo mount -t ext4 -o remount,nodelalloc ${DEV} ${MNT} > > Regards, > - Zheng > > On Wed, Sep 11, 2013 at 09:17:26PM -0700, Cuong Tran wrote: >> We have seen GC stalls that are NOT due to memory usage of applications. >> >> GC log reports the CPU user and system time of GC threads, which are >> almost 0, and stop-the-world time, which can be multiple seconds. This >> indicates GC threads are waiting for IO but GC threads should be >> CPU-bound in user mode. >> >> We could reproduce the problems using a simple Java program that just >> appends to a log file via log4j. If the test just runs by itself, it >> does not incur any GC stalls. However, if we run a script that enters >> a loop to create multiple large file via falloc() and then deletes >> them, then GC stall of 1+ seconds can happen fairly predictably. >> >> We can also reproduce the problem by periodically switch the log and >> gzip the older log. IO device, a single disk drive, is overloaded by >> FS flush when this happens. >> >> Our guess is GC has to acquiesce its threads and if one of the threads >> is stuck in the kernel (say in non-interruptible mode). Then GC has to >> wait until this thread unblocks. In the mean time, it already stops >> the world. >> >> Another test that shows similar problem is doing deferred writes to >> append a file. Latency of deferred writes is very fast but once a >> while, it can last more than 1 second. >> >> We would really appreciate if you could shed some light on possible >> causes? (Threads blocked because of journal check point, delayed >> allocation can't proceed?). We could alleviate the problem by >> configuring expire_centisecs and writeback_centisecs to flush more >> frequently, and thus even-out the workload to the disk drive. But we >> would like to know if there is a methodology to model the rate of >> flush vs. rate of changes and IO throughput of the drive (SAS, 15K >> RPM). >> >> Many thanks. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html