On Wed, Sep 24, 2008 at 11:31:46AM +1000, Neil Brown wrote: > On Tuesday September 23, npiggin@xxxxxxx wrote: > > Thanks for the review so far, and sorry for not getting back to you earlier. > > Just starting to get back into the swing of things. > > :-) The conference season does that do you. > > > > > Definitely that is the plan, which Christoph also suggested a year > > ago. It will make all that much nicer. Coherency should not be a problem > > because for block devices we have the big-hammer coherency of flushing on > > close, and I suspect online fs tools would use ioctls or syscalls into > > the filesystem rather than raw bdev access. > > You might suspect that... but you might be surprised. > > tune2fs -c 30 /dev/sda1 > > changes the max-mount-count (number of mounts before we force a fsck) > to 30, and does it simply by writing to the block device. No ioctls. > This is expected to work when the filesystem is mounted. I'm fairly > sure that works exactly because the ext3 holds on the to the > bufferhead for the superblock in the block device, so a write by > tune2fs updates the memory that ext3fs store the superblock in. So > when ext3 journals and then write that superblock, it still has the > updates max-mount-count. Oh. Hmm, that sucks -- the block may not have been dirty before the write, so it may require allocations to write it out (not that filesystems people care about that, but it is wrong and one day in the distant future we won't do it any more in Linux). Anyway, I don't think there is a lot that can be done really... > > > As I think more on this, I'm wondering if I've missed something > > > (entirely possible). If each block does indeed hold a refcount on > > > the page, then freeing idle pages which happen to have blocks > > > attached is not ever going to happen. This would be ok if you > > > dropped the fsblocks as soon as they weren't in used, but your > > > commentary seems to suggest that is an option, not a certainty. > > > Confused. > > > > Yeah, it's just that we already expect PagePrivate to elevate the refcount > > everywhere... actually if it didn't, I think the VM might not reclaim those > > pages ;) Buffer heads do the same thing. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > I cannot see that. I don't think that any code expects PagePrivate to > elevate the refcount. I cannot see that buffer heads get a reference > on the page for each buffer_head? There is only one get_page (a > find_get_page) in buffer.c, and it is followed shortly by a > page_cache_release. > > ???? See attach_page_buffers and __clear_page_buffers. Actually neither buffer heads or fsblocks elevate the reference for each buffer/fsblock attached to the page, but just elevate it once for N items. Actually it is not so trivial to get rid of this I think, because we must handle the case of "orphaned" pagecache pages: low level page refcounting definitely does not even look at PagePrivate, and it's really only a few things like reclaim which treats that flag as a pin on the page. > > > 7/ Those kmallocs in fsblock_init are a worry. Surely is isn't hard > > > to come up with a scheme for static allocation. > > > And I think you should remove "if (i < 30)" and just leave it as > > > a simple 'else'. > > > > Oh? I don't see such a problem except I should panic if they fail I > > suppose (rather than oops strangely). > > I've been programmed to raise a warning whenever I see a malloc where > the result isn't checked for NULL. :-( Done, fixed it (with a panic! :)) > > > 12/ In 'invalidate_block' you have commented out lock_block, and > > > put in a comment "why lock?". > > > > > > The reason to lock is to wait for pending writeout to complete. > > > If the writeout was triggered by the VFS/VM, then the page will > > > have PageWriteback set and we will already have waited for that to > > > clear. However if the writeout was triggered internally by the > > > filesystem (such as for data ordering guarantees) then > > > PageWriteback will not have been set, but the block (not the page) > > > will be locked. > > > > > > At least that is how I understand buffer_heads to work. However > > > for fsblock you set PageWriteback whenever you write out any block > > > on the page. As noted in point 9 above, I'm not convinced that is > > > the correct thing to do. > > > > Yeah, fsblock keeps dirty and writeout coherent, and uses only the page > > lock (rather than page and buffer lock that buffer heads use) for fsblock > > internal locking. I have not found any cases where the locking is needed, > > but I was doing it in case minix/xfs/ext2/etc required it for some reason. > > I'm not sure that is a good design decision. > It means that the only way to concurrently write out two blocks that > are in the same page, is to write the whole page. I should be able to add more blocks to a page under writeout, actually, thanks to the fsblock locking. > I notice that you don't even provide an interface for writing a block > asynchronously. If the filesystem has a random collection of blocks > that it wants to write out it has to write each on synchronously, or > maybe write out the whole page for each block, which might not be what > is wanted. Right. That could be added, but I just have kind of been exporting things as it turns out they're required. But that's a pretty obvious one. > > But if I can have a clean break, I'll just require filesystems do their > > own locking rather than use the lock here ;) In fsblock I'm never going to > > add code with comments that look like /* I think reiserfs does this */, > > /* Does this still happen */ etc ;) > > Yeah, those comments are pretty ugly. > > What I think you want is a well defined public interface that provides > the sort of functionality that filesystems want, and which takes care > of all the tricky details. > Locking is certainly a tricky detail so I'm not sure that leaving it > entirely to the filesystem is a good idea. Well the locking is taken care of if they stay within the page lock (in that respect it is simpler than buffer heads). But if they then go out and do things outside that page lock and it turns out they need protection from invalidation as well, then they'd have to do their own locking... I just don't want to add a lot of expensive operations even to something as rare as invalidate, "just in case". > And async block writes are something that I expect some filesystems > would want (I have one that does). OK, that's a good point. I'll look at adding that. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html