On Wed, Aug 06, 2014 at 10:46:39AM +0200, Jan Kara wrote: > On Wed 06-08-14 09:44:16, Dave Chinner wrote: > > On Fri, Aug 01, 2014 at 12:00:53AM +0200, Jan Kara wrote: > > > Switch inode dirty tracking lists to be per superblock instead of per > > > bdi. This is a major step towards filesystems being able to do their > > > own dirty tracking and selection of inodes for writeback if they desire > > > so (e.g. because they journal or COW data and need to writeback inodes > > > & pages in a specific order unknown to generic writeback code). > > > > > > Per superblock dirty lists also make selecting inodes for writeback > > > somewhat simpler because we don't have to search for inodes from a > > > particular superblock for some kinds of writeback (OTOH we pay for this > > > by having to iterate through superblocks for all-bdi type of writeback) > > > and this simplification will allow for an easier switch to a better > > > scaling data structure for dirty inodes. .... > > I'm not sure I really like this code very much - it seems to be > > muchmore complex than it needs to be because writeback is still > > managed on a per-bdi basis and the sb iteration is pretty clunky. > > If we are moving to per-sb inode tracking, we should also move all > > the writeback management to per-sb as well. > > > > IMO, there's no good reason for keeping flusher threads per-bdi and > > then having to iterate per-sb just to do background/periodic > > writeback, and then have special cases for sb specific writeback > > that avoids the bdi per-sb looping. i.e. per-sb flush work executed > > by a bdi flusher thread makes a lot more sense than per-bdi > > flush work that iterates superblocks. > > > > So for the moment, I think this patch makes things worse rather than > > better. I'd much prefer to see a single series that moves from per-bdi > > tracking/writeback to per-sb tracking/writeback than to split the > > tracking/writeback changes and then have to support an weird, > > temporary, intermediate code base like this... > So when writing this series I was thinking about both possibilities - > i.e., keeping per-bdi threads and changing to per-sb threads. In the end > I've decided to start with keeping per-bdi threads and seeing how things > work out. > Regarding per-sb threads I have two unresolved questions: > > 1) How to handle block device inodes - when filesystem is mounted on the > block device, it would be natural to writeback such inode together with > other filesystem's inodes. When filesystem isn't mounted on the block > device we have no superblock to attach such inode to so we would have to > have some flusher for the virtual blkdev superblock that will deal with > specifics of block device superblock. You mean for the blockdev_superblock? That's probably a good idea; it would isolate the special case blockdev inode writeback in it's own little function. i.e. this is crying out for a superblock specific writeback method (just like tux3 is asking for) that simply does: list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) { if (work->sync == WB_SYNC_ALL) __sync_blockdev(I_BDEV(inode), 1); else __sync_blockdev(I_BDEV(inode), 0); } For data integrity operations, the higher layers call sync_blockdev() appropriately so we don't actually have to care about blockdev inodes during per-sb writeback at all. With that, we get rid of another longstanding, ugly wart in the writeback code. > 2) How to deal with multiple superblocks per device - and here I'm > convinced we should not regress writeback performance of the case where > disk is partitioned into several partitions. And I think this will require > some kind of synchronization between per-sb threads on the same device. Right, that's why I suggested this: > > i.e. per-sb flush work executed > > by a bdi flusher thread makes a lot more sense than per-bdi > > flush work that iterates superblocks. What I mean is that every work that is executed on a BDI has work->sb set. Functions like wb_check_background_flush and wb_check_background_flush no longer take just the bdi, they also take the sb they are going to operate on so it canbe stuffed into the work structure that is passed to bdi_writeback(). At this point, bdi_writeback() only needs to operate on a single sb. We still have all the same "periodic/background writeback is aborted if new work is queued" logic so that data integrity operations preempt background work, so there is no difference in behaviour there. i.e. the bdi still manages the interactions between all the superblocks, just from a slightly different perspective: we explicitly schedule writeback by superblock rather than by bdi. IOWs, I'm not suggesting per-sb flusher threads, just scheduling of the work to the existing bdi flusher thread slightly differently. I suspect that the above two things will also make the sb-specific writeback callout much more obvious and, in general, make the writeback logic easier to read as there's no extraneous juggling of special cases needed. The sb callout will handles the blockdev inode writeback case, and the change of work scheduling gets rid of the need to iterate the sbs on the bdi.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html