This is kinda messy, just posting for review / comment. I'll tidy this up and submit a good patch once we decide what should be in it. My use-case is primarily sparsifying unwritten pre-allocated regions in files created by bittorrent clients. I also want it to not make a mess if run on every file in a big tree. * special case at EOF: punch past EOF to end of block Otherwise you end up with the last block still allocated, since allocation can extend beyond EOF. Linux (3.13) doesn't detect the special-case of EOF in the middle of the last pre-allocated block, so it leaves a single allocated block at the end of a file. FIEMAP would be needed to detect unwritten extents beyond EOF, I believe. Would maybe be nice if fallocate could detect and interact with that situation. * Set a minimum size for holes of 1MB, or half the file size if smaller. The special case for small files allows turning all-zero files into a hole. A cmdline option to override the min size would be good for other use-cases. Maybe --dig-holes=4k, having -d take an optional argument? Otherwise a new option is needed, since --length selects the range to dig in. * More useful logging of what's happening, only printing stuff about unchanged files if verbose > 2. (-vvv on the cmdline.) Logging is very much a WIP. Current state of what's printed is from sorting out what happens with the last block of a file. IDK if we want to concern users with the special-casing at EOF. nice ionice -c3 find ... -xdev -type f -size +2M -exec fallocate-local -v -d {} \; generated a lot of useless lines that hid the lines showing anything actually getting done. This fixes that, by logging only when zero blocks are detected. * Punch out unwritten pre-allocated space. SEEK_DATA doesn't distinguish between holes and pre-allocated extents, so for now just brute-force scan. Having fallocate unable to reverse its own action is kinda silly. :P TODO: detect it more efficiently. Maybe keep using SEEK_DATA, but then always do FALLOC_FL_PUNCH_HOLE, since it shouldn't be harmful to call on a hole. The is_nul() loop is pretty fast, but the mem copy from pread() really hurts. mmap would be faster, since I think Linux has some tricks for mmaping every page of holes / unwritten regions / /dev/zero to a single physical page (copy-on-write all-zero). More --dig-holes functionality could end up taking as much code as all the rest of fallocate. And having some difference in command-line parsing. (e.g. supporting multiple file arguments, and a knobs or two to control the digging.) Would also want to document the caveats of FIEMAP / FIBMAP, like e2fsprog's filefrag(8). I'd also like to have a --show-layout option with output like xfs_bmap -vpl, or filefrag -e, but without the clutter of locations on the FS's underlying blockdev. It would make sense to be able to query the results of preallocating, hole punching, etc. using the same tool used to do them. Also, filefrag doesn't show where there are holes, unless you do the math to see if there are gaps between the extents. xfs_bmap does, but only works on xfs. (Which isn't the default FS for most distros.) A dedicated tool for hole-punching could also incorporate options for recursive operation over directories, (although that's not really needed because find -exec {} + works great). Summarizing results across all the files it operated on would be good. I was about to start hacking something up when I googled to see if someone else had, and saw that the latest fallocate got that feature. Thoughts on whether fallocate should stay simple, or if it's fine to have different behaviour and cmdline handling for different modes. No reason fallocate couldn't loop over args for the other modes, but -c, -p and -z especially are dangerous if the user accidentally leaves an unintended file on the command line while editting a previous ls command into a fallocate, or makes a copy/paste error. --- sys-utils/fallocate.c | 37 ++++++++++++++++++++++++++++++++----- 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/sys-utils/fallocate.c b/sys-utils/fallocate.c index 9af3bb8ce1492defda57cc17764197790bb34c8e..9d9e9617d8eb12e33af41f842ba9d5a24aab7cef 100644 --- a/sys-utils/fallocate.c +++ b/sys-utils/fallocate.c @@ -195,6 +195,7 @@ static void dig_holes(int fd, off_t off, off_t len) err(EXIT_FAILURE, _("stat failed %s"), filename); bufsz = st.st_blksize; + const off_t min_holesz = min((off_t)1024*1024, st.st_size / 2); // TODO: check --length? if (lseek(fd, off, SEEK_SET) < 0) err(EXIT_FAILURE, _("seek on %s failed"), filename); @@ -218,18 +219,25 @@ static void dig_holes(int fd, off_t off, off_t len) if (is_nul(buf, rsz)) { if (!hole_sz) { /* new hole detected */ +#if 0 /* FIXME: preallocated areas look like holes to SEEK_DATA */ int rc = skip_hole(fd, &off); if (rc == 0) continue; /* hole skipped */ else if (rc == 1) break; /* end of file */ +#endif hole_start = off; } hole_sz += rsz; - } else if (hole_sz) { - xfallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, - hole_start, hole_sz); - ct += hole_sz; + } else if (hole_sz) { + if (hole_sz < min_holesz) { + if (verbose) + fprintf(stdout, "not holepunching only %jd kiB in %s\n", (intmax_t)(hole_sz / 1024), filename); + } else { + xfallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, + hole_start, hole_sz); + ct += hole_sz; + } hole_sz = hole_start = 0; } @@ -247,6 +255,25 @@ static void dig_holes(int fd, off_t off, off_t len) } if (hole_sz) { + if (verbose) { + if (hole_sz < min_holesz) // even a small hole at end of file should be fine + fprintf(stdout, "allowing small hole (%jd kiB) at end of file\n", (intmax_t)(hole_sz / 1024)); + else + fprintf(stdout, "hole %jd B (%jd kiB) at end of file\n", (intmax_t)hole_sz, (intmax_t)(hole_sz / 1024)); + } + /* XFS and EXT4 (or maybe Linux in general) require us to + punch all the way to the end of the block containing the end of the file. + A punch that goes only to EOF will be treated as a partial-block punch, + resulting in a block of allocated and zeroed space. Tested on 3.13.0 (Ubuntu) + */ + const off_t remainder = hole_sz % bufsz; // bufsz = st.st_blksize + if (end == 0 && remainder) + hole_sz += bufsz - remainder; + //hole_sz += bufsz - 1; // Don't do anything stupid if used on a weird + //hole_sz &= ~(bufsz - 1); // system where st_blksze isn't a power of 2 + + if (verbose > 1) + fprintf(stdout, "hole %jd B (%jd kiB) after rounding at end of file\n", (intmax_t)hole_sz, (intmax_t)(hole_sz / 1024)); xfallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, hole_start, hole_sz); ct += hole_sz; @@ -254,7 +281,7 @@ static void dig_holes(int fd, off_t off, off_t len) free(buf); - if (verbose) { + if ((ct > 0 && verbose) || verbose > 2) { char *str = size_to_human_string(SIZE_SUFFIX_3LETTER | SIZE_SUFFIX_SPACE, ct); fprintf(stdout, _("%s: %s (%ju bytes) converted to sparse holes.\n"), filename, str, ct); -- 2.2.1 Demonstration of how current version leaves a 4k block allocated at the end of the file, if it isn't a multiple of st_blksize. Also of how you have to be careful with FIEMAP on files that are being written. Although if we don't care where on disk the region ended up, then we might NOT need to fdatasync the file. Note how filefrag prints delalloc right after I dd some zeros. xfs_bmap has the sync thing as a side effect, since it uses an XFS ioctl instead of FIEMAP. peter@tesla:/var/tmp/peter/tmp$ dd if=/dev/zero of=ztest4k bs=512 count=32 32+0 records in 32+0 records out 16384 bytes (16 kB) copied, 0.00027062 s, 60.5 MB/s peter@tesla:/var/tmp/peter/tmp$ filefrag -e ztest4k Filesystem type is: 58465342 File size of ztest4k is 16384 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 0.. 3: 4: unknown,delalloc,eof ztest4k: 1 extent found peter@tesla:/var/tmp/peter/tmp$ xfs_bmap -vpl ztest4k ztest4k: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..31]: 7685000..7685031 0 (7685000..7685031) 32 00000 peter@tesla:/var/tmp/peter/tmp$ filefrag -e ztest4k Filesystem type is: 58465342 File size of ztest4k is 16384 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 960625.. 960628: 4: eof ztest4k: 1 extent found peter@tesla:/var/tmp/peter/tmp$ fallocate-2.25.2 -vvd ztest4k ztest4k: 16 KiB (16384 bytes) converted to sparse holes. peter@tesla:/var/tmp/peter/tmp$ xfs_bmap -vpl ztest4k ztest4k: no extents peter@tesla:/var/tmp/peter/tmp$ dd if=/dev/zero of=ztest.odd bs=512 count=33 33+0 records in 33+0 records out 16896 bytes (17 kB) copied, 0.000512697 s, 33.0 MB/s peter@tesla:/var/tmp/peter/tmp$ xfs_bmap -vpl ztest.odd ztest.odd: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..39]: 7685000..7685039 0 (7685000..7685039) 40 00000 peter@tesla:/var/tmp/peter/tmp$ fallocate-2.25.2 -vvd ztest.odd ztest.odd: 16.5 KiB (16896 bytes) converted to sparse holes. peter@tesla:/var/tmp/peter/tmp$ xfs_bmap -vpl ztest.odd ztest.odd: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..31]: hole 32 1: [32..39]: 7685032..7685039 0 (7685032..7685039) 8 00000 peter@tesla:/var/tmp/peter/tmp$ filefrag -e ztest.odd Filesystem type is: 58465342 File size of ztest.odd is 16896 (5 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 4.. 4: 960629.. 960629: 1: eof ztest.odd: 1 extent found peter@tesla:/var/tmp/peter/tmp$ fallocate-local -vvd ztest.odd hole 16896 B (16 kiB) at end of file hole 20480 B (20 kiB) after rounding at end of file ztest.odd: 20 KiB (20480 bytes) converted to sparse holes. peter@tesla:/var/tmp/peter/tmp$ xfs_bmap -vpl ztest.odd ztest.odd: no extents strace -f xfs_bmap -vpl ztest.odd [pid 8159] open("ztest.odd", O_RDONLY) = 3 [pid 8159] fstatfs(3, {f_type=0x58465342, f_bsize=4096, f_blocks=13100800, f_bfree=4007762, f_bavail=4007762, f_files=52428800, f_ffree=52041931, f_fsid={2067, 0}, f_namelen=255, f_frsize=4096}) = 0 [pid 8159] ioctl(3, 0xffffffff8070587c, 0x7fffceea1d90) = 0 [pid 8159] fstatfs(3, {f_type=0x58465342, f_bsize=4096, f_blocks=13100800, f_bfree=4007762, f_bavail=4007762, f_files=52428800, f_ffree=52041931, f_fsid={2067, 0}, f_namelen=255, f_frsize=4096}) = 0 for those of you who are curious, but don't have an XFS mount available. Also: check fragmentation on XFS (usable online, while FS is mounted, but requires root because other xfs_db commands can do stuff.): $ sudo xfs_db -c frag -r /dev/sdc1 actual 49988, ideal 47367, fragmentation factor 5.24% xfs_fsr can defrag xfs. I sometimes use this: alias prealloc-mv='rsync --remove-source-files --preallocate -a' It's not a good idea to always preallocate for everything, esp. small files, because on XFS, preallocate also implies allocate start of file on a RAID stripe boundary, so you fragment your free space if you use it for small files. This is historical behaviour from when writing uncompressed video to one-frame-per-file in realtime required a raid array. This overloading of the semantics of posix_fallocate on XFS is one of the major reasons GNU cp and mv don't preallocate, even though they know exactly how big the destination file will eventually be. Maybe if XFS kept their aligning behaviour for preallocation done with xfs-specific ioctls, but dropped it for fallocate, that would allow everyone to use it when they start writing a file that will hit a known size and isn't expected to grow. I gather that would be a lot of work, though, since XFS is really designed to allocate during writeback, to put stuff written at the same time near each other. Anyway, just thought I'd throw in this background info on prealloc, to save everyone the trouble of thinking of having cp(1) prealloc, and then finding out why it doesn't. Prealloc is still extremely useful for files that get written slowly, or especially not linear order. (e.g. torrent downloads.) -- To unsubscribe from this list: send the line "unsubscribe util-linux" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html