Hi, This is the first of four core changes we would like for Tux3. We start with a hard one and suggest a simple solution. The first patch in this series adds a new super operation to write back multiple inodes in a single call. The second patch applies to our linux-tux3 repository at git.kernel.org to demonstrate how this interface is used, and removes about 450 lines of workaround code. Traditionally, core kernel tells each mounted filesystems which dirty pages of which inodes should be flushed to disk, but unfortunately, is blissfully ignorant of filesystem-specific ordering constraints. This scheme was really invented for Ext2 and has not evolved much recently. Tux3, with its strong ordering and optimized delta commit, cannot tolerate random flushing and therefore takes responsibility for flush ordering itself. On the other hand, Tux3 has no good way to know when is the right time to flush, but core is good at that. This proposed API harmonizes those two competencies so that Tux3 and core each take care of what they are good at, and not more. The API extension consists of a new writeback hook and two helpers to be uses within the hook. The hook sits about halfway down the fs-writeback stack, just after core has determined that some dirty inodes should be flushed to disk and just before it starts thinking about which inodes those should be. At that point, core calls Tux3 instead of continuing on down the usual do_writepages path. Tux3 responds by staging a new delta commit, using the new helpers to tell core which inodes were flushed versus being left dirty in cache. This is pretty much the same behavior as the traditional writeout path, but less convoluted, probably more efficient, and certainly easier to analyze. The new writeback hook looks like: progress = sb->s_op->writeback(sb, &writeback_control, &nr_pages); This should be self-explanatory: nr_pages and progress have the semantics of existing usage in fs-writeback; writeback_control is ignored by Tux3, but that is only because Tux3 always flushes everything and does not require hints for now. We can safely assume that &wbc or equivalent is wanted here. An obvious wart is the overlap between "progress" and "nr_pages", but fs-writeback thinks that way, so it would not make much sense to improve one without improving the other. Tux3 necessarily keeps its own dirty inode list, which is an area of overlap with fs-writeback. In a perfect world, there would be just one dirty inode list per superblock, on which both fs-writeback and Tux3 would operate. That would be a deeper core change than seems appropriate right now. Potential races are a big issue with this API, which is no surprise. The fs-writeback scheme requires keeping several kinds of object in sync: tux3 dirty inode lists, fs-writeback dirty inode lists and inode dirty state. The new helpers inode_writeback_done(inode) and inode_writeback_touch(inode) take care of that while hiding internal details of the fs-writeback implementation. Tux3 calls inode_writeback_done when it has flushed an inode and marked it clean, or calls inode_writeback_touch if it intends to retain a dirty inode in cache. These have simple implementations. The former just removes a clean inode from any fs-writeback list. The latter updates the inode's dirty timestamp so that fs-writeback does not keep trying flush it. Both these things could be done more efficiently by re-engineering fs-writeback, but we prefer to work with the existing scheme for now. Hirofumi's masterful hack nicely avoided racy removal of inodes from the writeback list by taking an internal fs-writeback lock inside filesystem code. The new helper requires dropping i_lock inside filesystem code and retaking it in the helper, so inode redirty can race with writeback list removal. This does not seem to present a problem because filesystem code is able to enforce strict alternation of cleaning and calling the helper. As an offsetting advantage, writeback lock contention is reduced. Compared to Hirofumi's hack, the cost of this interface is one additional spinlock per inode_writeback_done, which is insignificant compared to the convoluted code path that is avoided. Regards, Daniel --- From: Daniel Phillips <daniel@xxxxxxxx> Subject: [PATCH] Add a super operation for writeback Add a "writeback" super operation to be called in the form: progress = sb->s_op->writeback(sb, &wbc, &pages); The filesystem is expected to flush some inodes to disk and return progress of at least 1, or if no inodes are flushed, return progress of zero. The filesystem should try to flush at least the number of pages specified in *pages, or if that is not possible, return approximately the number of pages not flushed into *pages. Within the ->writeback callback, the filesystem should call inode_writeback_done(inode) for each inode flushed (and therefore set clean) or inode_writeback_touch(inode) for any inode that will be retained dirty in cache. Signed-off-by: Daniel Phillips <daniel@xxxxxxxx> Signed-off-by: OGAWA Hirofumi <hirofumi@xxxxxxxxxxxxxxxxxx> --- fs/fs-writeback.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++--- include/linux/fs.h | 4 +++ 2 files changed, 60 insertions(+), 3 deletions(-) diff -puN fs/fs-writeback.c~core-writeback fs/fs-writeback.c --- linux-tux3/fs/fs-writeback.c~core-writeback 2014-05-31 06:43:19.416031712 +0900 +++ linux-tux3-hirofumi/fs/fs-writeback.c 2014-05-31 06:44:46.087904373 +0900 @@ -192,6 +192,35 @@ void inode_wb_list_del(struct inode *ino } /* + * Remove inode from writeback list if clean. + */ +void inode_writeback_done(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + + spin_lock(&bdi->wb.list_lock); + spin_lock(&inode->i_lock); + if (!(inode->i_state & I_DIRTY)) + list_del_init(&inode->i_wb_list); + spin_unlock(&inode->i_lock); + spin_unlock(&bdi->wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_done); + +/* + * Add inode to writeback dirty list with current time. + */ +void inode_writeback_touch(struct inode *inode) +{ + struct backing_dev_info *bdi = inode->i_sb->s_bdi; + spin_lock(&bdi->wb.list_lock); + inode->dirtied_when = jiffies; + list_move(&inode->i_wb_list, &bdi->wb.b_dirty); + spin_unlock(&bdi->wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_touch); + +/* * Redirty an inode: set its when-it-was dirtied timestamp and move it to the * furthest end of its superblock's dirty-inode list. * @@ -593,9 +622,9 @@ static long writeback_chunk_size(struct * * Return the number of pages and/or inodes written. */ -static long writeback_sb_inodes(struct super_block *sb, - struct bdi_writeback *wb, - struct wb_writeback_work *work) +static long __writeback_sb_inodes(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) { struct writeback_control wbc = { .sync_mode = work->sync_mode, @@ -710,6 +739,30 @@ static long writeback_sb_inodes(struct s return wrote; } +static long writeback_sb_inodes(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + if (sb->s_op->writeback) { + struct writeback_control wbc = { + .sync_mode = work->sync_mode, + .tagged_writepages = work->tagged_writepages, + .for_kupdate = work->for_kupdate, + .for_background = work->for_background, + .for_sync = work->for_sync, + .range_cyclic = work->range_cyclic, + }; + long ret; + + spin_unlock(&wb->list_lock); + ret = sb->s_op->writeback(sb, &wbc, &work->nr_pages); + spin_lock(&wb->list_lock); + return ret; + } + + return __writeback_sb_inodes(sb, wb, work); +} + static long __writeback_inodes_wb(struct bdi_writeback *wb, struct wb_writeback_work *work) { diff -puN include/linux/fs.h~core-writeback include/linux/fs.h --- linux-tux3/include/linux/fs.h~core-writeback 2014-05-31 06:43:19.436031682 +0900 +++ linux-tux3-hirofumi/include/linux/fs.h 2014-05-31 06:48:41.619558001 +0900 @@ -1536,6 +1536,8 @@ struct super_operations { int (*drop_inode) (struct inode *); void (*evict_inode) (struct inode *); void (*put_super) (struct super_block *); + long (*writeback)(struct super_block *super, + struct writeback_control *wbc, long *nr_pages); int (*sync_fs)(struct super_block *sb, int wait); int (*freeze_fs) (struct super_block *); int (*unfreeze_fs) (struct super_block *); @@ -1737,6 +1739,8 @@ static inline void file_accessed(struct touch_atime(&file->f_path); } +void inode_writeback_done(struct inode *inode); +void inode_writeback_touch(struct inode *inode); int sync_inode(struct inode *inode, struct writeback_control *wbc); int sync_inode_metadata(struct inode *inode, int wait); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html