On Tue, 25 Feb 2014, Dave Chinner wrote: > On Tue, Feb 25, 2014 at 02:16:01PM +1100, Stephen Rothwell wrote: > > On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > > Namjae Jeon (10): > > > > fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate > > > > xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate > > > > > > I've pushed these to the following branch: > > > > > > git://oss.sgi.com/xfs/xfs.git xfs-collapse-range > > > > > > And so they'll be in tomorrow's linux-next tree. > > > > > > > ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate > > > > > > I've left this one alone for the ext4 guys to sort out. > > > > So presumably that xfs tree branch is now completely stable and so Ted > > could just merge that branch into the ext4 tree as well and put the ext4 > > part on top of that in his tree. > > Well, for some definition of stable. Right now it's just a topic > branch that is merged into the for-next branch, so in theory it is > still just a set of pending changes in a branch in a repo that has > been pushed to linux-next for testing. > > That said, I don't see that branch changing unless we find bugs in > the code or a problem with the API needs fixing, at which point I > would add more commits to it and rebase the for-next branch that you > are pulling into the linux-next tree. > > Realistically, I'm waiting for Lukas to repost his other pending > fallocate changes (the zero range changes) so I can pull the VFS and > XFS bits of that into the XFS tree and I can test them together > before I'll call the xfs-collapse-range stable and ready to be > merged into some other tree... Thank you, Namjae and Dave, for driving this; and thank you, Ted and Matthew, for raising appropriate mmap concerns (2013-7-31 and 2014-2-2). I was aware of this work in progress, but only now found time to look. I've not studied the implementation, knowing too little of ext4 and xfs; but it sounds like the approach you've taken, writing out dirties and truncating all pagecache from the critical offset onwards, is the sensible approach for now - lame, and leaves me wondering whether an offline tool wouldn't be more appropriate; but a safe place to start if we suppose it will be extended to handle pagecache better in future. Of course I'm interested in the possibility of extending it to tmpfs; which may not be a worthwhile exercise in itself, except that it would force us to face and solve any pagecache/radixtree issues, if possible, thereby enhancing the support for disk-based filesystems. I doubt we should look into that before Jan Kara's range locking mods arrive, or are rejected. As I understand it, you're going ahead with this, knowing that there can be awkward races with concurrent faults - more likely to cause trinity fuzzer reports, than interfere with daily usage (trinity seems to be good at faulting into holes being punched). That's probably the right pragmatic decision; but I'm a little worried that it's justfied by saying we already have such races in hole-punch. Collapse is significantly more challenging than either hole-punch or truncation: the shifting of pages from one offset to another is new, and might present nastier bugs. Emphasis on "might": I expect it's impossible, given your current approach, but something to be on guard against is unmap_mapping_range() failing to find and unmap a pte, because the page is mapped at the "wrong" place in the vma, resulting in BUG_ON(page_mapped(page)) in __delete_from_page_cache(). One thing that is slightly wrong in what you have right now, is your use of truncate_pagecache_range(): you'll need to add an "even_cows" arg to that (or make it a wrapper to a __truncate_pagecache_range() taking additional "even_cows" arg). That arg governs what is done with anonymous COW'ed pages in a MAP_PRIVATE mmap of the file. Truncation is required by spec to unmap them; hole punching was up to us to spec, did not unmap them originally (because there was no preliminary call to unmap_mapping_range(), so it happened to rely on the inefficient fallback within truncate_inode_page()), and that seemed fine because, why discard a user's data unnecessarily? But your case is different: collapse is much closer to truncation, and if you do not unmap the private COW'ed pages, then pages left behind beyond the EOF will break the spec that requires SIGBUS when touching there, and pages within EOF will be confusingly derived from file data now belonging to another offset or none (move these pages within the user address space? no, I don't think anon_vmas would allow that, and there may be no right place to move them). It's clear that the right and easy thing to do is just to unmap them (all of them, from critical offset to EOF), in the rare case of there being any such pages. Whether this detail needs to be mentioned in the man page (I don't like throwing away a user's data without warning) I'm not sure, Michael can judge. FALLOC_FL_COLLAPSE_RANGE: I'm a little sad at the name COLLAPSE, but probably seven months too late to object. It surprises me that you're doing all this work to deflate a part of the file, without the obvious complementary work to inflate it - presumably all those advertisers whose ads you're cutting out, will come back to us soon to ask for inflation, so that they have somewhere to reinsert them ;) But you have the good precedent of "truncate" being used to extend files, so I suppose "collapse" can one day be enhanced to inflate a file when given negative len. Or perhaps, like truncate, that would lead to too much "if (bigger) do_one_thing() else something_else()", and a separate FALLOC_FL_ would prove better. Certainly there's no requirement that you should implement this, I was just a little surprised that you had not. I should mention that when "we" implemented this thirty years ago, we had a strong conviction that the system call should be idempotent: that is, the len argument should indicate the final i_size, not the amount being removed from it. Now, I don't remember the grounds for that conviction: maybe it was just an idealistic preference for how to design a good system call. I can certainly see that defining it that way round would surprise many app programmers. Just mentioning this in case anyone on these lists sees a practical advantage to doing it that way instead. I see you've included xfstests and xfs_io updates, nice. Did you realize that util-linux has a /usr/bin/fallocate? I hope someone will update that too. Hugh _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs