On Thu 07-08-14 07:13:05, Dave Chinner wrote: > On Wed, Aug 06, 2014 at 10:46:39AM +0200, Jan Kara wrote: > > On Wed 06-08-14 09:44:16, Dave Chinner wrote: > > > On Fri, Aug 01, 2014 at 12:00:53AM +0200, Jan Kara wrote: > > > > Switch inode dirty tracking lists to be per superblock instead of per > > > > bdi. This is a major step towards filesystems being able to do their > > > > own dirty tracking and selection of inodes for writeback if they desire > > > > so (e.g. because they journal or COW data and need to writeback inodes > > > > & pages in a specific order unknown to generic writeback code). > > > > > > > > Per superblock dirty lists also make selecting inodes for writeback > > > > somewhat simpler because we don't have to search for inodes from a > > > > particular superblock for some kinds of writeback (OTOH we pay for this > > > > by having to iterate through superblocks for all-bdi type of writeback) > > > > and this simplification will allow for an easier switch to a better > > > > scaling data structure for dirty inodes. > .... > > > I'm not sure I really like this code very much - it seems to be > > > muchmore complex than it needs to be because writeback is still > > > managed on a per-bdi basis and the sb iteration is pretty clunky. > > > If we are moving to per-sb inode tracking, we should also move all > > > the writeback management to per-sb as well. > > > > > > IMO, there's no good reason for keeping flusher threads per-bdi and > > > then having to iterate per-sb just to do background/periodic > > > writeback, and then have special cases for sb specific writeback > > > that avoids the bdi per-sb looping. i.e. per-sb flush work executed > > > by a bdi flusher thread makes a lot more sense than per-bdi > > > flush work that iterates superblocks. > > > > > > So for the moment, I think this patch makes things worse rather than > > > better. I'd much prefer to see a single series that moves from per-bdi > > > tracking/writeback to per-sb tracking/writeback than to split the > > > tracking/writeback changes and then have to support an weird, > > > temporary, intermediate code base like this... > > So when writing this series I was thinking about both possibilities - > > i.e., keeping per-bdi threads and changing to per-sb threads. In the end > > I've decided to start with keeping per-bdi threads and seeing how things > > work out. > > > Regarding per-sb threads I have two unresolved questions: > > > > 1) How to handle block device inodes - when filesystem is mounted on the > > block device, it would be natural to writeback such inode together with > > other filesystem's inodes. When filesystem isn't mounted on the block > > device we have no superblock to attach such inode to so we would have to > > have some flusher for the virtual blkdev superblock that will deal with > > specifics of block device superblock. > > You mean for the blockdev_superblock? That's probably a good idea; > it would isolate the special case blockdev inode writeback in it's > own little function. i.e. this is crying out for a superblock > specific writeback method (just like tux3 is asking for) that simply > does: > > list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) { > if (work->sync == WB_SYNC_ALL) > __sync_blockdev(I_BDEV(inode), 1); > else > __sync_blockdev(I_BDEV(inode), 0); > } Sadly, it's not that easy - you have to pass in struct writeback_control so that writing obeys nr_to_write - otherwise you can livelock doing writeback and preemption of background works won't work either. So you have to do something along the lines of what writeback_inodes() does... Also I don't really see how to get rid of specialcasing block devices. Normal superblocks are attached to a bdi - bdi flusher iterates all superblocks on that bdi to do background writeback of the bdi. However block device superblock is special - inodes of that superblock belong to different bdis so you cannot just attach that superblock to any particular bdi. That's why I created a special "writeback queue" for each bdi embedded in struct backing_dev_info and use that for tracking writeback state of block device inodes belonging to the bdi. > For data integrity operations, the higher layers call > sync_blockdev() appropriately so we don't actually have to care > about blockdev inodes during per-sb writeback at all. Agreed. > With that, we get rid of another longstanding, ugly wart in the > writeback code. > > > 2) How to deal with multiple superblocks per device - and here I'm > > convinced we should not regress writeback performance of the case where > > disk is partitioned into several partitions. And I think this will require > > some kind of synchronization between per-sb threads on the same device. > > Right, that's why I suggested this: > > > > i.e. per-sb flush work executed > > > by a bdi flusher thread makes a lot more sense than per-bdi > > > flush work that iterates superblocks. Aha, sorry, I misunderstood this sentense. > What I mean is that every work that is executed on a BDI has > work->sb set. Functions like wb_check_background_flush and > wb_check_background_flush no longer take just the bdi, they also > take the sb they are going to operate on so it canbe stuffed into > the work structure that is passed to bdi_writeback(). At this point, > bdi_writeback() only needs to operate on a single sb. > > We still have all the same "periodic/background writeback is aborted > if new work is queued" logic so that data integrity operations > preempt background work, so there is no difference in behaviour > there. i.e. the bdi still manages the interactions between all the > superblocks, just from a slightly different perspective: we > explicitly schedule writeback by superblock rather than by bdi. OK, this is an interesting idea. However suppose you have two active filesystems on a bdi. Now you have to somehow switch between these two filesystems - I don't think you can switch to the second filesystem after you clean the first one because that may starve the second filesystem of background writeback for a long time and in the extreme case it may never happen. But I suppose if we set nr_to_write sensibly we could avoid such problems... I'll try this and see how it looks in the code. > IOWs, I'm not suggesting per-sb flusher threads, just scheduling of > the work to the existing bdi flusher thread slightly differently. > > I suspect that the above two things will also make the sb-specific > writeback callout much more obvious and, in general, make the > writeback logic easier to read as there's no extraneous juggling of > special cases needed. The sb callout will handles the blockdev inode > writeback case, and the change of work scheduling gets rid of the > need to iterate the sbs on the bdi.... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html