This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "XFS development tree". The branch, for-linus has been updated 14c26c6 xfs: add trace points for log forces 3ba3160 xfs: fix memory reclaim deadlock on agi buffer ea562ed xfs: fix delalloc quota accounting on failure 1307bbd xfs: protect xfs_sync_worker with s_umount semaphore 3fe3e6b xfs: introduce SEEK_DATA/SEEK_HOLE support e700a06 xfs: make xfs_extent_busy_trim not static 611c994 xfs: make XBF_MAPPED the default behaviour d4f3512 xfs: flush outstanding buffers on log mount failure 12bcb3f xfs: Properly exclude IO type flags from buffer flags ad1e95c xfs: clean up xfs_bit.h includes 2af51f3 xfs: move xfs_do_force_shutdown() and kill xfs_rw.c 2a0ec1d xfs: move xfs_get_extsz_hint() and kill xfs_rw.h fd50092 xfs: move xfs_fsb_to_db to xfs_bmap.h 4ecbfe6 xfs: clean up busy extent naming efc27b5 xfs: move busy extent handling to it's own file 60a3460 xfs: move xfsagino_t to xfs_types.h bc4010e xfs: use iolock on XFS_IOC_ALLOCSP calls aa5c158 xfs: kill XBF_DONTBLOCK 7ca790a xfs: kill xfs_read_buf() a8acad7 xfs: kill XBF_LOCK 795cac7 xfs: kill xfs_buf_btoc aa0e883 xfs: use blocks for storing the desired IO size 4e94b71 xfs: use blocks for counting length of buffers de1cbee xfs: kill b_file_offset e70b73f xfs: clean up buffer get/read call API bf813cd xfs: use kmem_zone_zalloc for buffers ead360c xfs: fix incorrect b_offset initialisation 0e95f19 xfs: check for buffer errors before waiting fe2429b xfs: fix buffer lookup race on allocation failure aff3a9e xfs: Use preallocation for inodes with extsz hints 3ed9116 xfs: limit specualtive delalloc to maxioffset 58e2077 xfs: don't assert on delalloc regions beyond EOF 81158e0 xfs: prevent needless mount warning causing test failures d3bc815 xfs: punch new delalloc blocks out of failed writes inside EOF. 6ffc4db xfs: page type check in writeback only checks last buffer 4c2d542 xfs: Do background CIL flushes via a workqueue 04913fd xfs: pass shutdown method into xfs_trans_ail_delete_bulk a856917 xfs: remove some obsolete comments in xfs_trans_ail.c 43ff212 xfs: on-stack delayed write buffer lists 960c60a xfs: do not add buffers to the delwri queue until pushed fe7257f xfs: do not write the buffer from xfs_qm_dqflush 4c46819 xfs: do not write the buffer from xfs_iflush 8a48088 xfs: don't flush inodes from background inode reclaim 211e4d4 xfs: implement freezing by emptying the AIL 1c30462 xfs: allow assigning the tail lsn with the AIL lock held 32ce90a xfs: remove log item from AIL in xfs_iflush after a shutdown dea9609 xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown 7582df5 xfs: using GFP_NOFS for blkdev_issue_flush 01c84d2 xfs: punch all delalloc blocks beyond EOF on write failure. 507630b xfs: use shared ilock mode for direct IO writes by default 193aec1 xfs: push the ilock into xfs_zero_eof f38996f xfs: reduce ilock hold times in xfs_setattr_size 467f789 xfs: reduce ilock hold times in xfs_file_aio_write_checks b4d05e3 xfs: avoid taking the ilock unnessecarily in xfs_qm_dqattach 8a00ebe xfs: Ensure inode reclaim can run during quotacheck da5bf95 xfs: don't fill statvfs with project quota for a directory if it was not enabled. 0195c00 Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system f21ce8f Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs 9ffc93f Remove all #inclusions of asm/system.h 49d99a2 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs 48fde70 switch open-coded instances of d_make_root() to new helper 8de5277 vfs: check i_nlink limits in vfs_{mkdir,rename_dir,link} c922bbc xfs: make inode quota check more general 20f12d8 xfs: change available ranges of softlimit and hardlimit in quota check 0529348 XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended 04da0c8 xfs: use a normal shrinker for the dquot freelist from 5a5881cdeec2c019b5c9a307800218ee029f7f61 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- commit 14c26c6a05de138a4fd9a0c05ff8e7435a618324 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Tue Apr 24 16:33:31 2012 +1000 xfs: add trace points for log forces To enable easy tracing of the location of log forces and the frequency of them via perf, add a pair of trace points to the log force functions. This will help debug where excessive log forces are being issued from by simple perf commands like: # ~/perf/perf top -e xfs:xfs_log_force -G -U Which gives this sort of output: Events: 141 xfs:xfs_log_force - 100.00% [kernel] [k] xfs_log_force - xfs_log_force 87.04% xfsaild kthread kernel_thread_helper - 12.87% xfs_buf_lock _xfs_buf_find xfs_buf_get xfs_trans_get_buf xfs_da_do_buf xfs_da_get_buf xfs_dir2_data_init xfs_dir2_leaf_addname xfs_dir_createname xfs_create xfs_vn_mknod xfs_vn_create vfs_create do_last.isra.41 path_openat do_filp_open do_sys_open sys_open system_call_fastpath Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 3ba316037470bbf98c8a16c2179c02794fb8862e Author: Peter Watkins <treestem@xxxxxxxxx> Date: Mon May 7 16:11:37 2012 -0400 xfs: fix memory reclaim deadlock on agi buffer Note xfs_iget can be called while holding a locked agi buffer. If it goes into memory reclaim then inode teardown may try to lock the same buffer. Prevent the deadlock by calling radix_tree_preload with GFP_NOFS. Signed-off-by: Peter Watkins <treestem@xxxxxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit ea562ed6e7df5acd9392d993882c39e855099165 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Tue May 8 20:48:53 2012 +1000 xfs: fix delalloc quota accounting on failure xfstest 270 was causing quota reservations way beyond what was sane (ten to hundreds of TB) for a 4GB filesystem. There's a sign problem in the error handling path of xfs_bmapi_reserve_delalloc() because xfs_trans_unreserve_quota_nblks() simple negates the value passed - which doesn't work for an unsigned variable. This causes reservations of close to 2^32 block instead of removing a reservation of a handful of blocks. Fix the same problem in the other xfs_trans_unreserve_quota_nblks() callers where unsigned integer variables are used, too. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Eric Sandeen <sandeen@xxxxxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 1307bbd2af67283131728637e9489002adb26f10 Author: Ben Myers <bpm@xxxxxxx> Date: Tue May 15 14:26:55 2012 -0500 xfs: protect xfs_sync_worker with s_umount semaphore xfs_sync_worker checks the MS_ACTIVE flag in s_flags to avoid doing work during mount and unmount. This flag can be cleared by unmount after the xfs_sync_worker checks it but before the work is completed. The has caused crashes in the completion handler for the dummy transaction commited by xfs_sync_worker: PID: 27544 TASK: ffff88013544e040 CPU: 3 COMMAND: "kworker/3:0" #0 [ffff88016fdff930] machine_kexec at ffffffff810244e9 #1 [ffff88016fdff9a0] crash_kexec at ffffffff8108d053 #2 [ffff88016fdffa70] oops_end at ffffffff813ad1b8 #3 [ffff88016fdffaa0] no_context at ffffffff8102bd48 #4 [ffff88016fdffaf0] __bad_area_nosemaphore at ffffffff8102c04d #5 [ffff88016fdffb40] bad_area_nosemaphore at ffffffff8102c12e #6 [ffff88016fdffb50] do_page_fault at ffffffff813afaee #7 [ffff88016fdffc60] page_fault at ffffffff813ac635 [exception RIP: xlog_get_lowest_lsn+0x30] RIP: ffffffffa04a9910 RSP: ffff88016fdffd10 RFLAGS: 00010246 RAX: ffffc90014e48000 RBX: ffff88014d879980 RCX: ffff88014d879980 RDX: ffff8802214ee4c0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88016fdffd10 R8: ffff88014d879a80 R9: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802214ee400 R13: ffff88014d879980 R14: 0000000000000000 R15: ffff88022fd96605 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff88016fdffd18] xlog_state_do_callback at ffffffffa04aa186 [xfs] #9 [ffff88016fdffd98] xlog_state_done_syncing at ffffffffa04aa568 [xfs] Protect xfs_sync_worker by using the s_umount semaphore at the read level to provide exclusion with unmount while work is progressing. Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 3fe3e6b18216c1247497dfd51c35484338856e1b Author: Jeff Liu <jeff.liu@xxxxxxxxxx> Date: Thu May 10 21:29:17 2012 +0800 xfs: introduce SEEK_DATA/SEEK_HOLE support This patch adds lseek(2) SEEK_DATA/SEEK_HOLE functionality to xfs. Signed-off-by: Jie Liu <jeff.liu@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit e700a06c71dbbc0879a5d15881cca7b772282484 Author: Ben Myers <bpm@xxxxxxx> Date: Thu May 10 13:55:33 2012 -0500 xfs: make xfs_extent_busy_trim not static Commit e459df5, 'xfs: move busy extent handling to it's own file' moved some code from xfs_alloc.c into xfs_extent_busy.c for convenience in userspace code merges. One of the functions moved is xfs_extent_busy_trim (formerly xfs_alloc_busy_trim) which is defined STATIC. Unfortunately this function is still used in xfs_alloc.c, and this results in an undefined symbol in xfs.ko. Make xfs_extent_busy_trim not static and add its prototype to xfs_extent_busy.h. Signed-off-by: Ben Myers <bpm@xxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> commit 611c99468c7aa1a5c2bb6d46e7b5d8e53eecfefd Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:59:07 2012 +1000 xfs: make XBF_MAPPED the default behaviour Rather than specifying XBF_MAPPED for almost all buffers, introduce XBF_UNMAPPED for the couple of users that use unmapped buffers. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit d4f3512b0891658b6b4d5fc99567242b3fc2d6b7 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:59:06 2012 +1000 xfs: flush outstanding buffers on log mount failure When we fail to mount the log in xfs_mountfs(), we tear down all the infrastructure we have already allocated. However, the process of mounting the log may have progressed to the point of reading, caching and modifying buffers in memory. Hence before we can free all the infrastructure, we have to flush and remove all the buffers from memory. Problem first reported by Eric Sandeen, later a different incarnation was reported by Ben Myers. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 12bcb3f7d4371f74bd25372e98e0d2da978e82b2 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:59:05 2012 +1000 xfs: Properly exclude IO type flags from buffer flags Recent event tracing during a debugging session showed that flags that define the IO type for a buffer are leaking into the flags on the buffer incorrectly. Fix the flag exclusion mask in xfs_buf_alloc() to avoid problems that may be caused by such leakage. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit ad1e95c54eb3980ab2b4683fba29ad0ef954ec51 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:59:04 2012 +1000 xfs: clean up xfs_bit.h includes With the removal of xfs_rw.h and other changes over time, xfs_bit.h is being included in many files that don't actually need it. Clean up the includes as necessary. Also move the only-used-once xfs_ialloc_find_free() static inline function out of a header file that is widely included to reduce the number of needless dependencies on xfs_bit.h. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 2af51f3a4ef93945d20ff27ab28c5c68b5a21efc Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:59:03 2012 +1000 xfs: move xfs_do_force_shutdown() and kill xfs_rw.c xfs_do_force_shutdown now is the only thing in xfs_rw.c. There is no need to keep it in it's own file anymore, so move it to xfs_fsops.c next to xfs_fs_goingdown() and kill xfs_rw.c. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 2a0ec1d9ed7f3aa7974fccfbb612fadda2e10bad Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:59:02 2012 +1000 xfs: move xfs_get_extsz_hint() and kill xfs_rw.h The only thing left in xfs_rw.h is a function prototype for an inode function. Move that to xfs_inode.h, and kill xfs_rw.h. Also move the function implementing the prototype from xfs_rw.c to xfs_inode.c so we only have one function left in xfs_rw.c Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit fd50092c08068b5bc5d170bc17894db584aaf7b2 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:59:01 2012 +1000 xfs: move xfs_fsb_to_db to xfs_bmap.h This is the only remaining useful function in xfs_rw.h, so move it to a header file responsible for block mapping functions that the callers already include. Soon we can get rid of xfs_rw.h. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 4ecbfe637cbcc0f093d1f295ef483f4e31e3987b Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Sun Apr 29 10:41:10 2012 +0000 xfs: clean up busy extent naming Now that the busy extent tracking has been moved out of the allocation files, clean up the namespace it uses to "xfs_extent_busy" rather than a mix of "xfs_busy" and "xfs_alloc_busy". Signed-off-by: Dave Chinner<dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit efc27b52594e322d4c94e379489fa3690bf74739 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Sun Apr 29 10:39:43 2012 +0000 xfs: move busy extent handling to it's own file To make it easier to handle userspace code merges, move all the busy extent handling out of the allocation code and into it's own file. The userspace code does not need the busy extent code, so this simplifies the merging of the kernel code into the userspace xfsprogs library. Because the busy extent code has been almost completely rewritten over the past couple of years, also update the copyright on this new file to include the authors that made all those changes. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 60a34607b26b60d6b5c5c928ede7fc84b0f06b85 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:58 2012 +1000 xfs: move xfsagino_t to xfs_types.h Untangle the header file includes a bit by moving the definition of xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include xfs_ag.h. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit bc4010ecb8f4d4316e1a63a879a2715e49d113ad Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:57 2012 +1000 xfs: use iolock on XFS_IOC_ALLOCSP calls fsstress has a particular effective way of stopping debug XFS kernels. We keep seeing assert failures due finding delayed allocation extents where there should be none. This shows up when extracting extent maps and we are holding all the locks we should be to prevent races, so this really makes no sense to see these errors. After checking that fsstress does not use mmap, it occurred to me that fsstress uses something that no sane application uses - the XFS_IOC_ALLOCSP ioctl interfaces for preallocation. These interfaces do allocation of blocks beyond EOF without using preallocation, and then call setattr to extend and zero the allocated blocks. THe problem here is this is a buffered write, and hence the allocation is a delayed allocation. Unlike the buffered IO path, the allocation and zeroing are not serialised using the IOLOCK. Hence the ALLOCSP operation can race with operations holding the iolock to prevent buffered IO operations from occurring. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit aa5c158ec97bd4014f47a2bc0150fb6b20e6c48b Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:56 2012 +1000 xfs: kill XBF_DONTBLOCK Just about all callers of xfs_buf_read() and xfs_buf_get() use XBF_DONTBLOCK. This is used to make memory allocation use GFP_NOFS rather than GFP_KERNEL to avoid recursion through memory reclaim back into the filesystem. All the blocking get calls in growfs occur inside a transaction, even though they are no part of the transaction, so all allocation will be GFP_NOFS due to the task flag PF_TRANS being set. The blocking read calls occur during log recovery, so they will probably be unaffected by converting to GFP_NOFS allocations. Hence make XBF_DONTBLOCK behaviour always occur for buffers and kill the flag. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 7ca790a507a9288ebedab90a8e40b9afa8e4e949 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:55 2012 +1000 xfs: kill xfs_read_buf() xfs_read_buf() is effectively the same as xfs_trans_read_buf() when called outside a transaction context. The error handling is slightly different in that xfs_read_buf stales the errored buffer it gets back, but there is probably good reason for xfs_trans_read_buf() for doing this. Hence update xfs_trans_read_buf() to the same error handling as xfs_read_buf(), and convert all the callers of xfs_read_buf() to use the former function. We can then remove xfs_read_buf(). Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit a8acad70731e7d0585f25f33f8a009176f001f70 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:54 2012 +1000 xfs: kill XBF_LOCK Buffers are always returned locked from the lookup routines. Hence we don't need to tell the lookup routines to return locked buffers, on to try and lock them. Remove XBF_LOCK from all the callers and from internal buffer cache usage. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 795cac72e902496adac399389f9affe5d1ab821a Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:53 2012 +1000 xfs: kill xfs_buf_btoc xfs_buf_btoc and friends are simple macros that do basic block to page index conversion and vice versa. These aren't widely used, and we use open coded masking and shifting everywhere else. Hence remove the macros and open code the work they do. Also, use of PAGE_CACHE_{SIZE|SHIFT|MASK} for these macros is now incorrect - we are using pages directly and not the page cache, so use PAGE_{SIZE|MASK|SHIFT} instead. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit aa0e8833b05cbd9d34d6a1ddaf23a74a58d76a03 Author: Dave Chinner <david@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:52 2012 +1000 xfs: use blocks for storing the desired IO size Now that we pass block counts everywhere, and index buffers by block number and length in units of blocks, convert the desired IO size into block counts rather than bytes. Convert the code to use block counts, and those that need byte counts get converted at the time of use. Rename the b_desired_count variable to something closer to it's purpose - b_io_length - as it is only used to specify the length of an IO for a subset of the buffer. The only time this is used is for log IO - both writing iclogs and during log recovery. In all other cases, the b_io_length matches b_length, and hence a lot of code confuses the two. e.g. the buf item code uses the io count exclusively when it should be using the buffer length. Fix these apprpriately as they are found. Also, remove the XFS_BUF_{SET_}COUNT() macros that are just wrappers around the desired IO length. They only serve to make the code shouty loud, don't actually add any real value, and are often used incorrectly. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 4e94b71b7068b4bd9c615301197e09dbf0c3b770 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:51 2012 +1000 xfs: use blocks for counting length of buffers Now that we pass block counts everywhere, and index buffers by block number, track the length of the buffer in units of blocks rather than bytes. Convert the code to use block counts, and those that need byte counts get converted at the time of use. Also, remove the XFS_BUF_{SET_}SIZE() macros that are just wrappers around the buffer length. They only serve to make the code shouty loud and don't actually add any real value. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit de1cbee46269a3b707eb99b37f33afdd4cfaaea4 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:50 2012 +1000 xfs: kill b_file_offset Seeing as we pass block numbers around everywhere in the buffer cache now, it makes no sense to index everything by byte offset. Replace all the byte offset indexing with block number based indexing, and replace all uses of the byte offset with direct conversion from the block index. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit e70b73f84f474cc594a39bd8ff083974e6d69aea Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:49 2012 +1000 xfs: clean up buffer get/read call API The xfs_buf_get/read API is not consistent in the units it uses, and does not use appropriate or consistent units/types for the variables. Convert the API to use disk addresses and block counts for all buffer get and read calls. Use consistent naming for all the functions and their declarations, and convert the internal functions to use disk addresses and block counts to avoid need to convert them from one type to another and back again. Fix all the callers to use disk addresses and block counts. In many cases, this removes an additional conversion from the function call as the callers already have a block count. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit bf813cdddfb3a5bc88e1612e8f62a12367871213 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:48 2012 +1000 xfs: use kmem_zone_zalloc for buffers To replace the alloc/memset pair. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit ead360c50d33772f45943792893a58865adf3638 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:47 2012 +1000 xfs: fix incorrect b_offset initialisation Because we no longer use the page cache for buffering, there is no direct block number to page offset relationship anymore. xfs_buf_get_pages is still setting up b_offset as if there was some relationship, and that is leading to incorrectly setting up *uncached* buffers that don't overwrite b_offset once they've had pages allocated. For cached buffers, the first block of the buffer is always at offset zero into the allocated memory. This is true for sub-page sized buffers, as well as for multiple-page buffers. For uncached buffers, b_offset is only non-zero when we are associating specific memory to the buffers, and that is set correctly by the code setting up the buffer. Hence remove the setting of b_offset in xfs_buf_get_pages, because it is now always the wrong thing to do. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 0e95f19ad983e72a9cb93a67b3290b58f0467b36 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:46 2012 +1000 xfs: check for buffer errors before waiting If we call xfs_buf_iowait() on a buffer that failed dispatch due to an IO error, it will wait forever for an Io that does not exist. This is hndled in xfs_buf_read, but there is other code that calls xfs_buf_iowait directly that doesn't. Rather than make the call sites have to handle checking for dispatch errors and then checking for completion errors, make xfs_buf_iowait() check for dispatch errors on the buffer before waiting. This means we handle both dispatch and completion errors with one set of error handling at the caller sites. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit fe2429b0966a7ec42b5fe3bf96f0f10de0a3b536 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:45 2012 +1000 xfs: fix buffer lookup race on allocation failure When memory allocation fails to add the page array or tht epages to a buffer during xfs_buf_get(), the buffer is left in the cache in a partially initialised state. There is enough state left for the next lookup on that buffer to find the buffer, and for the buffer to then be used without finishing the initialisation. As a result, when an attempt to do IO on the buffer occurs, it fails with EIO because there are no pages attached to the buffer. We cannot remove the buffer from the cache immediately and free it, because there may already be a racing lookup that is blocked on the buffer lock. Hence the moment we unlock the buffer to then free it, the other user is woken and we have a use-after-free situation. To avoid this race condition altogether, allocate the pages for the buffer before we insert it into the cache. This then means that we don't have an allocation failure case to deal after the buffer is already present in the cache, and hence avoid the problem altogether. In most cases we won't have racing inserts for the same buffer, and so won't increase the memory pressure allocation before insertion may entail. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit aff3a9edb7080f69f07fe76a8bd089b3dfa4cb5d Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:44 2012 +1000 xfs: Use preallocation for inodes with extsz hints xfstest 229 exposes a problem with buffered IO, delayed allocation and extent size hints. That is when we do delayed allocation during buffered IO, we reserve space for the extent size hint alignment and allocate the physical space to align the extent, but we do not zero the regions of the extent that aren't written by the write(2) syscall. The result is that we expose stale data in unwritten regions of the extent size hints. There are two ways to fix this. The first is to detect that we are doing unaligned writes, check if there is already a mapping or data over the extent size hint range, and if not zero the page cache first before then doing the real write. This can be very expensive for large extent size hints, especially if the subsequent writes fill then entire extent size before the data is written to disk. The second, and simpler way, is simply to turn off delayed allocation when the extent size hint is set and use preallocation instead. This results in unwritten extents being laid down on disk and so only the written portions will be converted. This matches the behaviour for direct IO, and will also work for the real time device. The disadvantage of this approach is that for small extent size hints we can get file fragmentation, but in general extent size hints are fairly large (e.g. stripe width sized) so this isn't a big deal. Implement the second approach as it is simple and effective. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 3ed9116e8a3e9c0870b2076340b3da9b8f900f3b Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Sun Apr 29 22:43:19 2012 +1000 xfs: limit specualtive delalloc to maxioffset Speculative delayed allocation beyond EOF near the maximum supported file offset can result in creating delalloc extents beyond mp->m_maxioffset (8EB). These can never be trimmed during xfs_free_eof_blocks() because they are beyond mp->m_maxioffset, and that results in assert failures in xfs_fs_destroy_inode() due to delalloc blocks still being present. xfstests 071 exposes this problem. Limit speculative delalloc to mp->m_maxioffset to avoid this problem. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 58e20770646932fe9b758c94e8c278ea9ec93878 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Sun Apr 29 21:16:17 2012 +1000 xfs: don't assert on delalloc regions beyond EOF When we are doing speculative delayed allocation beyond EOF, conversion of the region allocated beyond EOF is dependent on the largest free space extent available. If the largest free extent is smaller than the delalloc range, then after allocation we leave a delalloc extent that starts beyond EOF. This extent cannot *ever* be converted by flushing data, and so will remain there until either the EOF moves into the extent or it is truncated away. Hence if xfs_getbmap() runs on such an inode and is asked to return extents beyond EOF, it will assert fail on this extent even though there is nothing xfs_getbmap() can do to convert it to a real extent. Hence we should simply report these delalloc extents rather than assert that there should be none. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 81158e0cecdf53b1f6d88a514c6c20e0ee18ec7b Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Apr 27 19:45:22 2012 +1000 xfs: prevent needless mount warning causing test failures Often mounting small filesystem with small logs will emit a warning such as: XFS (vdb): Invalid block length (0x2000) for buffer during log recovery. This causes tests to randomly fail because this output causes the clean filesystem checks on test completion to think the filesystem is inconsistent. The cause of the error is simply that log recovery is asking for a buffer size that is larger than the log when zeroing the tail. This is because the buffer size is rounded up, and if the right head and tail conditions exist then the buffer size can be larger than the log. Limit the variable size xlog_get_bp() callers to requesting buffers smaller than the log. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit d3bc815afb549eecb3679a4b2f0df216e34df998 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Apr 27 19:45:21 2012 +1000 xfs: punch new delalloc blocks out of failed writes inside EOF. When a partial write inside EOF fails, it can leave delayed allocation blocks lying around because they don't get punched back out. This leads to assert failures like: XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847 when evicting inodes from the cache. This can be trivially triggered by xfstests 083, which takes between 5 and 15 executions on a 512 byte block size filesystem to trip over this. Debugging shows a failed write due to ENOSPC calling xfs_vm_write_failed such as: [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae and no action is taken on it. This leaves behind a delayed allocation extent that has no page covering it and no data in it: [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0 [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1 [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1 ^^^^^^^^^^^^^^^ [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa So the delayed allocation extent is one block long at offset 0x16c00. Tracing shows that a bigger write: xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags allocates the block, and then fails with ENOSPC trying to allocate the last block on the page, leading to a failed write with stale delalloc blocks on it. Because we've had an ENOSPC when trying to allocate 0x16e00, it means that we are never goinge to call ->write_end on the page and so the allocated new buffer will not get marked dirty or have the buffer_new state cleared. In other works, what the above write is supposed to end up with is this mapping for the page: +------+------+------+------+------+------+------+------+ UMA UMA UMA UMA UMA UMA UND FAIL where: U = uptodate M = mapped N = new A = allocated D = delalloc FAIL = block we ENOSPC'd on. and the key point being the buffer_new() state for the newly allocated delayed allocation block. Except it doesn't - we're not marking buffers new correctly. That buffer_new() problem goes back to the xfs_iomap removal days, where xfs_iomap() used to return a "new" status for any map with newly allocated blocks, so that __xfs_get_blocks() could call set_buffer_new() on it. We still have the "new" variable and the check for it in the set_buffer_new() logic - except we never set it now! Hence that newly allocated delalloc block doesn't have the new flag set on it, so when the write fails we cannot tell which blocks we are supposed to punch out. WHy do we need the buffer_new flag? Well, that's because we can have this case: +------+------+------+------+------+------+------+------+ UMD UMD UMD UMD UMD UMD UND FAIL where all the UMD buffers contain valid data from a previously successful write() system call. We only want to punch the UND buffer because that's the only one that we added in this write and it was only this write that failed. That implies that even the old buffer_new() logic was wrong - because it would result in all those UMD buffers on the page having set_buffer_new() called on them even though they aren't new. Hence we shoul donly be calling set_buffer_new() for delalloc buffers that were allocated (i.e. were a hole before xfs_iomap_write_delay() was called). So, fix this set_buffer_new logic according to how we need it to work for handling failed writes correctly. Also, restore the new buffer logic handling for blocks allocated via xfs_iomap_write_direct(), because it should still set the buffer_new flag appropriately for newly allocated blocks, too. SO, now we have the buffer_new() being set appropriately in __xfs_get_blocks(), we can detect the exact delalloc ranges that we allocated in a failed write, and hence can now do a walk of the buffers on a page to find them. Except, it's not that easy. When block_write_begin() fails, it unlocks and releases the page that we just had an error on, so we can't use that page to handle errors anymore. We have to get access to the page while it is still locked to walk the buffers. Hence we have to open code block_write_begin() in xfs_vm_write_begin() to be able to insert xfs_vm_write_failed() is the right place. With that, we can pass the page and write range to xfs_vm_write_failed() and walk the buffers on the page, looking for delalloc buffers that are either new or beyond EOF and punch them out. Handling buffers beyond EOF ensures we still handle the existing case that xfs_vm_write_failed() handles. Of special note is the truncate_pagecache() handling - that only should be done for pages outside EOF - pages within EOF can still contain valid, dirty data so we must not punch them out of the cache. That just leaves the xfs_vm_write_end() failure handling. The only failure case here is that we didn't copy the entire range, and generic_write_end() handles that by zeroing the region of the page that wasn't copied, we don't have to punch out blocks within the file because they are guaranteed to contain zeros. Hence we only have to handle the existing "beyond EOF" case and don't need access to the buffers on the page. Hence it remains largely unchanged. Note that xfs_getbmap() can still trip over delalloc blocks beyond EOF that are left there by speculative delayed allocation. Hence this bug fix does not solve all known issues with bmap vs delalloc, but it does fix all the the known accidental occurances of the problem. Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 6ffc4db5de61d36e969a26bc94509c59246c81f8 Author: Dave Chinner <david@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:43 2012 +1000 xfs: page type check in writeback only checks last buffer xfs_is_delayed_page() checks to see if a page has buffers matching the given IO type passed in. It does so by walking the buffer heads on the page and checking if the state flags match the IO type. However, the "acceptable" variable that is calculated is overwritten every time a new buffer is checked. Hence if the first buffer on the page is of the right type, this state is lost if the second buffer is not of the correct type. This means that xfs_aops_discard_page() may not discard delalloc regions when it is supposed to, and xfs_convert_page() may not cluster IO as efficiently as possible. This problem only occurs on filesystems with a block size smaller than page size. Also, rename xfs_is_delayed_page() to xfs_check_page_type() to better describe what it is doing - it is not delalloc specific anymore. The problem was first noticed by Peter Watkins. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 4c2d542f2e786537db33b613d5199dc6d69a96da Author: Dave Chinner <david@xxxxxxxxxxxxx> Date: Mon Apr 23 17:54:32 2012 +1000 xfs: Do background CIL flushes via a workqueue Doing background CIL flushes adds significant latency to whatever async transaction that triggers it. To avoid blocking async transactions on things like waiting for log buffer IO to complete, move the CIL push off into a workqueue. By moving the push work into a workqueue, we remove all the latency that the commit adds from the foreground transaction commit path. This also means that single threaded workloads won't do the CIL push procssing, leaving them more CPU to do more async transactions. To do this, we need to keep track of the sequence number we have pushed work for. This avoids having many transaction commits attempting to schedule work for the same sequence, and ensures that we only ever have one push (background or forced) in progress at a time. It also means that we don't need to take the CIL lock in write mode to check for potential background push races, which reduces lock contention. To avoid potential issues with "smart" IO schedulers, don't use the workqueue for log force triggered flushes. Instead, do them directly so that the log IO is done directly by the process issuing the log force and so doesn't get stuck on IO elevator queue idling incorrectly delaying the log IO from the workqueue. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 04913fdd91f342e537005ef1233f98068b925a7f Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Apr 23 15:58:41 2012 +1000 xfs: pass shutdown method into xfs_trans_ail_delete_bulk xfs_trans_ail_delete_bulk() can be called from different contexts so if the item is not in the AIL we need different shutdown for each context. Pass in the shutdown method needed so the correct action can be taken. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit a8569171ba26344a4c0308fc0da8f41795408ebc Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:40 2012 +1000 xfs: remove some obsolete comments in xfs_trans_ail.c Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 43ff2122e6492bcc88b065c433453dce88223b30 Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:39 2012 +1000 xfs: on-stack delayed write buffer lists Queue delwri buffers on a local on-stack list instead of a per-buftarg one, and write back the buffers per-process instead of by waking up xfsbufd. This is now easily doable given that we have very few places left that write delwri buffers: - log recovery: Only done at mount time, and already forcing out the buffers synchronously using xfs_flush_buftarg - quotacheck: Same story. - dquot reclaim: Writes out dirty dquots on the LRU under memory pressure. We might want to look into doing more of this via xfsaild, but it's already more optimal than the synchronous inode reclaim that writes each buffer synchronously. - xfsaild: This is the main beneficiary of the change. By keeping a local list of buffers to write we reduce latency of writing out buffers, and more importably we can remove all the delwri list promotions which were hitting the buffer cache hard under sustained metadata loads. The implementation is very straight forward - xfs_buf_delwri_queue now gets a new list_head pointer that it adds the delwri buffers to, and all callers need to eventually submit the list using xfs_buf_delwi_submit or xfs_buf_delwi_submit_nowait. Buffers that already are on a delwri list are skipped in xfs_buf_delwri_queue, assuming they already are on another delwri list. The biggest change to pass down the buffer list was done to the AIL pushing. Now that we operate on buffers the trylock, push and pushbuf log item methods are merged into a single push routine, which tries to lock the item, and if possible add the buffer that needs writeback to the buffer list. This leads to much simpler code than the previous split but requires the individual IOP_PUSH instances to unlock and reacquire the AIL around calls to blocking routines. Given that xfsailds now also handle writing out buffers, the conditions for log forcing and the sleep times needed some small changes. The most important one is that we consider an AIL busy as long we still have buffers to push, and the other one is that we do increment the pushed LSN for buffers that are under flushing at this moment, but still count them towards the stuck items for restart purposes. Without this we could hammer on stuck items without ever forcing the log and not make progress under heavy random delete workloads on fast flash storage devices. [ Dave Chinner: - rebase on previous patches. - improved comments for XBF_DELWRI_Q handling - fix XBF_ASYNC handling in queue submission (test 106 failure) - rename delwri submit function buffer list parameters for clarity - xfs_efd_item_push() should return XFS_ITEM_PINNED ] Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 960c60af8b9481595e68875e79b2602e73169c29 Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:38 2012 +1000 xfs: do not add buffers to the delwri queue until pushed Instead of adding buffers to the delwri list as soon as they are logged, even if they can't be written until commited because they are pinned defer adding them to the delwri list until xfsaild pushes them. This makes the code more similar to other log items and prepares for writing buffers directly from xfsaild. The complication here is that we need to fail buffers that were added but not logged yet in xfs_buf_item_unpin, borrowing code from xfs_bioerror. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit fe7257fd4b8ae9a3e354d9edb61890973e373ef0 Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:37 2012 +1000 xfs: do not write the buffer from xfs_qm_dqflush Instead of writing the buffer directly from inside xfs_qm_dqflush return it to the caller and let the caller decide what to do with the buffer. Also remove the pincount check in xfs_qm_dqflush that all non-blocking callers already implement and the now unused flags parameter and the XFS_DQ_IS_DIRTY check that all callers already perform. [ Dave Chinner: fixed build error cause by missing '{'. ] Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 4c46819a8097a75d3b378c5e56d2bcf47bb7408d Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:36 2012 +1000 xfs: do not write the buffer from xfs_iflush Instead of writing the buffer directly from inside xfs_iflush return it to the caller and let the caller decide what to do with the buffer. Also remove the pincount check in xfs_iflush that all non-blocking callers already implement and the now unused flags parameter. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 8a48088f6439249019b5e17f6391e710656879d9 Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:35 2012 +1000 xfs: don't flush inodes from background inode reclaim We already flush dirty inodes throug the AIL regularly, there is no reason to have second thread compete with it and disturb the I/O pattern. We still do write inodes when doing a synchronous reclaim from the shrinker or during unmount for now. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 211e4d434bd737be38aabad0247ce3da9964370e Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:34 2012 +1000 xfs: implement freezing by emptying the AIL Now that we write back all metadata either synchronously or through the AIL we can simply implement metadata freezing in terms of emptying the AIL. The implementation for this is fairly simply and straight-forward: A new routine is added that asks the xfsaild to push the AIL to the end and waits for it to complete and send a wakeup. The routine will then loop if the AIL is not actually empty, and continue to do so until the AIL is compeltely empty. We keep an inode reclaim pass in the freeze process to avoid having memory pressure have to reclaim inodes that require dirtying the filesystem to be reclaimed after the freeze has completed. This means we can also treat unmount in the exact same way as freeze. As an upside we can now remove the radix tree based inode writeback and xfs_unmountfs_writesb. [ Dave Chinner: - Cleaned up commit message. - Added inode reclaim passes back into freeze. - Cleaned up wakeup mechanism to avoid the use of a new sleep counter variable. ] Signed-off-by: Christoph Hellwig <hch@xxxxxx> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 1c30462542bac8abffb4823638b6b1659c1cfcf5 Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:33 2012 +1000 xfs: allow assigning the tail lsn with the AIL lock held Provide a variant of xlog_assign_tail_lsn that has the AIL lock already held. By doing so we do an additional atomic_read + atomic_set under the lock, which comes down to two instructions. Switch xfs_trans_ail_update_bulk and xfs_trans_ail_delete_bulk to the new version to reduce the number of lock roundtrips, and prepare for a new addition that would require a third lock roundtrip in xfs_trans_ail_delete_bulk. This addition is also the reason for slightly rearranging the conditionals and relying on xfs_log_space_wake for checking that the filesystem has been shut down internally. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 32ce90a4b79155a155de2b284d8b69023e5e8fea Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:32 2012 +1000 xfs: remove log item from AIL in xfs_iflush after a shutdown If a filesystem has been forced shutdown we are never going to write inodes to disk, which means the inode items will stay in the AIL until we free the inode. Currently that is not a problem, but a pending change requires us to empty the AIL before shutting down the filesystem. In that case leaving the inode in the AIL is lethal. Make sure to remove the log item from the AIL to allow emptying the AIL on shutdown filesystems. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit dea9609527a55b65638a6323894269334dfe6ec5 Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Mon Apr 23 15:58:31 2012 +1000 xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown If a filesystem has been forced shutdown we are never going to write dquots to disk, which means the dquot items will stay in the AIL forever. Currently that is not a problem, but a pending chance requires us to empty the AIL before shutting down the filesystem, in which case this behaviour is lethal. Make sure to remove the log item from the AIL to allow emptying the AIL on shutdown filesystems. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 7582df516c93046b8d2111a780c69de77f9882fb Author: Shaohua Li <shli@xxxxxxxxxx> Date: Tue Apr 24 21:23:46 2012 +0800 xfs: using GFP_NOFS for blkdev_issue_flush Issuing a block device flush request in transaction context using GFP_KERNEL directly can cause deadlocks due to memory reclaim recursion. Use GFP_NOFS to avoid recursion from reclaim context. Signed-off-by: Shaohua Li <shli@xxxxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 01c84d2dc1311fb76ea217dadfd5b3a5f3cab563 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Thu Apr 26 09:23:09 2012 +1000 xfs: punch all delalloc blocks beyond EOF on write failure. I've been seeing regular ASSERT failures in xfstests when running fsstress based tests over the past month. xfs_getbmap() has been failing this test: XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) || (map[i].br_startblock != DELAYSTARTBLOCK), file: fs/xfs/xfs_bmap.c, line: 5650 where it is encountering a delayed allocation extent after writing all the dirty data to disk and then walking the extent map atomically by holding the XFS_IOLOCK_SHARED to prevent new delayed allocation extents from being created. Test 083 on a 512 byte block size filesystem was used to reproduce the problem, because it only had a 5s run timeand would usually fail every 3-4 runs. This test is exercising ENOSPC behaviour by running fsstress on a nearly full filesystem. The following trace extract shows the final few events on the inode that tripped the assert: xfs_ilock: flags ILOCK_EXCL caller xfs_setfilesize xfs_setfilesize: isize 0x180000 disize 0x12d400 offset 0x17e200 count 7680 file size updated to 0x180000 by IO completion xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_iext_insert: state idx 3 offset 3072 block 4503599627239432 count 1 flag 0 caller xfs_bmap_add_extent_hole_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180000 count 512 type startoff 0xc00 startblock -1 blockcount 0x1 xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks delalloc write, adding a single block at offset 0x180000 xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180200 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180200 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180200 count 512 type startoff 0xc00 startblock -1 blockcount 0x2 And succeeding on retry after flushing dirty inodes. xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180400 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 And failing the retry, giving a real ENOSPC error. xfs_ilock: flags ILOCK_EXCL caller xfs_vm_write_failed ^^^^^^^^^^^^^^^^^^^ The smoking gun - the write being failed and cleaning up delalloc blocks beyond EOF allocated by the failed write. xfs_getattr: xfs_ilock: flags IOLOCK_SHARED caller xfs_getbmap xfs_ilock: flags ILOCK_SHARED caller xfs_ilock_map_shared And that's where we died almost immediately afterwards. xfs_bmapi_read() found delalloc extent beyond current file in memory file size. Some debug I added to xfs_getbmap() showed the state just before the assert failure: ino 0x80e48: off 0xc00, fsb 0xffffffffffffffff, len 0x1, size 0x180000 start_fsb 0x106, end_fsb 0x638 ino flags 0x2 nex 0xd bmvcnt 0x555, len 0x3c58a6f23c0bf1, start 0xc00 ext 0: off 0x1fc, fsb 0x24782, len 0x254 ext 1: off 0x450, fsb 0x40851, len 0x30 ext 2: off 0x480, fsb 0xd99, len 0x1b8 ext 3: off 0x92f, fsb 0x4099a, len 0x3b ext 4: off 0x96d, fsb 0x41844, len 0x98 ext 5: off 0xbf1, fsb 0x408ab, len 0xf which shows that we found a single delalloc block beyond EOF (first line of output) when we were returning the map for a length somewhere around 10^16 bytes long (second line), and the on-disk extents showed they didn't go past EOF (last lines). Further debug added to xfs_vm_write_failed() showed this happened when punching out delalloc blocks beyond the end of the file after the failed write: [ 132.606693] ino 0x80e48: vwf to 0x181000, sze 0x180000 [ 132.609573] start_fsb 0xc01, end_fsb 0xc08 It punched the range 0xc01 -> 0xc08, but the range we really need to punch is 0xc00 -> 0xc07 (8 blocks from 0xc00) as this testing was run on a 512 byte block size filesystem (8 blocks per page). the punch from is 0xc00. So end_fsb is correct, but start_fsb is wrong as we punch from start_fsb for (end_fsb - start_fsb) blocks. Hence we are not punching the delalloc block beyond EOF in the case. The fix is simple - it's a silly off-by-one mistake in calculating the range. It's especially silly because the macro used to calculate the start_fsb already takes into account the case where the inode size is an exact multiple of the filesystem block size... Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Eric Sandeen <sandeen@xxxxxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 507630b29f13a3d8689895618b12015308402e22 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Tue Mar 27 10:34:50 2012 -0400 xfs: use shared ilock mode for direct IO writes by default For the direct IO write path, we only really need the ilock to be taken in exclusive mode during IO submission if we need to do extent allocation instead of all the time. Change the block mapping code to take the ilock in shared mode for the initial block mapping, and only retake it exclusively when we actually have to perform extent allocations. We were already dropping the ilock for the transaction allocation, so this doesn't introduce new race windows. Based on an earlier patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 193aec10504e4c24521449c46317282141fb36e8 Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Tue Mar 27 10:34:49 2012 -0400 xfs: push the ilock into xfs_zero_eof Instead of calling xfs_zero_eof with the ilock held only take it internally for the minimall required critical section around xfs_bmapi_read. This also requires changing the calling convention for xfs_zero_last_block slightly. The actual zeroing operation is still serialized by the iolock, which must be taken exclusively over the call to xfs_zero_eof. We could in fact use a shared lock for the xfs_bmapi_read calls as long as the extent list has been read in, but given that we already hold the iolock exclusively there is little reason to micro optimize this further. Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit f38996f5768713fb60e1d2de66c097367d54bb6a Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Tue Mar 27 10:34:48 2012 -0400 xfs: reduce ilock hold times in xfs_setattr_size We do not need the ilock for most checks done in the beginning of xfs_setattr_size. Replace the long critical section before starting the transaction with a smaller one around xfs_zero_eof and an optional one inside xfs_qm_dqattach that isn't entered unless using quotas. While this isn't a big optimization for xfs_setattr_size itself it will allow pushing the ilock into xfs_zero_eof itself later. Signed-off-by: Christoph Hellwig <hch@xxxxxx> commit 467f78992a0743e0e71729e4faa20b67b0f25289 Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Tue Mar 27 10:34:47 2012 -0400 xfs: reduce ilock hold times in xfs_file_aio_write_checks We do not need the ilock for generic_write_checks and the i_size_read, which are protected by i_mutex and/or iolock, so reduce the ilock critical section to just the call to xfs_zero_eof. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit b4d05e3019692fc5a8c573fbce60de2d48c5b7a1 Author: Christoph Hellwig <hch@xxxxxxxxxxxxx> Date: Tue Mar 27 10:34:46 2012 -0400 xfs: avoid taking the ilock unnessecarily in xfs_qm_dqattach Check if we actually need to attach a dquot before taking the ilock in xfs_qm_dqattach. This avoid superflous lock roundtrips for the common cases of quota support compiled in but not activated on a filesystem and an inode that already has the dquots attached. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> ----------------------------------------------------------------------- Summary of changes: fs/xfs/Makefile | 2 +- fs/xfs/xfs_ag.h | 18 -- fs/xfs/xfs_alloc.c | 585 +----------------------------------------- fs/xfs/xfs_alloc.h | 28 -- fs/xfs/xfs_alloc_btree.c | 9 +- fs/xfs/xfs_aops.c | 218 +++++++++++----- fs/xfs/xfs_attr.c | 25 +- fs/xfs/xfs_attr_leaf.c | 3 +- fs/xfs/xfs_bmap.c | 32 ++- fs/xfs/xfs_bmap.h | 3 + fs/xfs/xfs_bmap_btree.c | 1 - fs/xfs/xfs_btree.c | 1 - fs/xfs/xfs_buf.c | 593 ++++++++++++++++++------------------------- fs/xfs/xfs_buf.h | 97 +++---- fs/xfs/xfs_buf_item.c | 123 +++------ fs/xfs/xfs_da_btree.c | 17 +- fs/xfs/xfs_dfrag.c | 2 - fs/xfs/xfs_dir2.c | 1 - fs/xfs/xfs_dir2_block.c | 1 - fs/xfs/xfs_dir2_data.c | 1 - fs/xfs/xfs_dir2_leaf.c | 1 - fs/xfs/xfs_dir2_node.c | 1 - fs/xfs/xfs_dir2_sf.c | 1 - fs/xfs/xfs_discard.c | 6 +- fs/xfs/xfs_dquot.c | 91 ++----- fs/xfs/xfs_dquot.h | 3 +- fs/xfs/xfs_dquot_item.c | 162 ++++-------- fs/xfs/xfs_error.c | 1 - fs/xfs/xfs_export.c | 1 - fs/xfs/xfs_extent_busy.c | 603 ++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_extent_busy.h | 69 +++++ fs/xfs/xfs_extfree_item.c | 59 ++--- fs/xfs/xfs_file.c | 327 +++++++++++++++--------- fs/xfs/xfs_fsops.c | 82 +++++- fs/xfs/xfs_ialloc.c | 10 +- fs/xfs/xfs_ialloc.h | 9 - fs/xfs/xfs_ialloc_btree.c | 1 - fs/xfs/xfs_iget.c | 24 +- fs/xfs/xfs_inode.c | 132 ++++------ fs/xfs/xfs_inode.h | 5 +- fs/xfs/xfs_inode_item.c | 176 ++++--------- fs/xfs/xfs_inode_item.h | 2 +- fs/xfs/xfs_inum.h | 5 - fs/xfs/xfs_ioctl.c | 2 - fs/xfs/xfs_ioctl32.c | 2 - fs/xfs/xfs_iomap.c | 59 +++-- fs/xfs/xfs_iops.c | 15 +- fs/xfs/xfs_itable.c | 1 - fs/xfs/xfs_log.c | 49 ++-- fs/xfs/xfs_log.h | 1 + fs/xfs/xfs_log_cil.c | 253 +++++++++++-------- fs/xfs/xfs_log_priv.h | 2 + fs/xfs/xfs_log_recover.c | 103 ++++---- fs/xfs/xfs_message.c | 1 - fs/xfs/xfs_mount.c | 77 ++---- fs/xfs/xfs_mount.h | 2 +- fs/xfs/xfs_qm.c | 196 +++++++------- fs/xfs/xfs_qm_bhv.c | 2 - fs/xfs/xfs_qm_syscalls.c | 1 - fs/xfs/xfs_quotaops.c | 1 - fs/xfs/xfs_rename.c | 12 - fs/xfs/xfs_rtalloc.c | 10 +- fs/xfs/xfs_rw.c | 156 ------------ fs/xfs/xfs_rw.h | 47 ---- fs/xfs/xfs_super.c | 54 ++-- fs/xfs/xfs_sync.c | 281 +++++++-------------- fs/xfs/xfs_trace.c | 2 - fs/xfs/xfs_trace.h | 53 ++-- fs/xfs/xfs_trans.c | 7 +- fs/xfs/xfs_trans.h | 18 +- fs/xfs/xfs_trans_ail.c | 207 +++++++-------- fs/xfs/xfs_trans_buf.c | 126 +++------ fs/xfs/xfs_trans_dquot.c | 2 - fs/xfs/xfs_trans_extfree.c | 1 - fs/xfs/xfs_trans_inode.c | 2 - fs/xfs/xfs_trans_priv.h | 12 +- fs/xfs/xfs_types.h | 5 + fs/xfs/xfs_utils.c | 4 - fs/xfs/xfs_vnodeops.c | 47 ++-- 79 files changed, 2460 insertions(+), 2884 deletions(-) create mode 100644 fs/xfs/xfs_extent_busy.c create mode 100644 fs/xfs/xfs_extent_busy.h delete mode 100644 fs/xfs/xfs_rw.c delete mode 100644 fs/xfs/xfs_rw.h hooks/post-receive -- XFS development tree _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs