Hi, This is my sixth rework of this patch. The problem with the previous version is that the underlying file system's block_map function may set the buffer_head's b_state to 0, which may be misinterpreted by fiemap as meaning it has returned the hole size in b_size. This version implements a suggestion from Steve Whitehouse: it improves on the design by (unfortunately) using two flags: a flag to tell block_map to return hole size if possible, and another flag to tell fiemap that the hole size has been returned. This way there is no possibility of a misunderstanding. The problem: If you do a fiemap operation on a very large sparse file, it can take an extremely long amount of time (we're talking days here) because function __generic_block_fiemap does a block-for-block search when it encounters a hole. The solution: Allow the underlying file system to return the hole size so that function __generic_block_fiemap can quickly skip the hole. I have a companion patch to GFS2 that takes advantage of the new flags to speed up fiemap on sparse GFS2 files. Other file systems can do the same as they see fit. For GFS2, the time it takes to skip a 1PB hole in a sparse file goes from days to milliseconds. Patch description: This patch changes function __generic_block_fiemap so that it sets a new buffer_want_holesize bit. The new bit signals to the underlying file system to return a hole size from its block_map function (if possible) in the event that a hole is encountered at the requested block. If the block_map function encounters a hole, and sets buffer_got_holesize, fiemap takes the returned b_size to be the size of the hole, in bytes. It then skips the hole and moves to the next block. This may be repeated several times in a row, especially for large holes, due to possible limitations of the fs-specific block_map function. This is still much faster than trying each block individually when large holes are encountered. If the block_map function does not set buffer_got_holesize, the request for holesize has been ignored, and it falls back to today's method of doing a block-by-block search for the next valid block. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson <rpeterso@xxxxxxxxxx> --- fs/ioctl.c | 7 ++++++- include/linux/buffer_head.h | 4 ++++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 8ac3fad..9c07a40 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -291,13 +291,18 @@ int __generic_block_fiemap(struct inode *inode, memset(&map_bh, 0, sizeof(struct buffer_head)); map_bh.b_size = len; + set_buffer_want_holesize(&map_bh); ret = get_block(inode, start_blk, &map_bh, 0); if (ret) break; /* HOLE */ if (!buffer_mapped(&map_bh)) { - start_blk++; + if (buffer_got_holesize(&map_bh)) + start_blk += logical_to_blk(inode, + map_bh.b_size); + else + start_blk++; /* * We want to handle the case where there is an diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 324329c..fc8e47c 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -37,6 +37,8 @@ enum bh_state_bits { BH_Meta, /* Buffer contains metadata */ BH_Prio, /* Buffer should be submitted with REQ_PRIO */ BH_Defer_Completion, /* Defer AIO completion to workqueue */ + BH_Want_Holesize, /* Return hole size if possible */ + BH_Got_Holesize, /* Hole size has been returned */ BH_PrivateStart,/* not a state bit, but the first bit available * for private allocation by other entities @@ -128,6 +130,8 @@ BUFFER_FNS(Boundary, boundary) BUFFER_FNS(Write_EIO, write_io_error) BUFFER_FNS(Unwritten, unwritten) BUFFER_FNS(Meta, meta) +BUFFER_FNS(Want_Holesize, want_holesize) +BUFFER_FNS(Got_Holesize, got_holesize) BUFFER_FNS(Prio, prio) BUFFER_FNS(Defer_Completion, defer_completion) -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html