Dave Chinner <david@xxxxxxxxxxxxx> writes: >> Otherwise, vfs can't know the data is whether after sync point or before >> sync point, and have to wait or not. FS is using the behavior like >> data=journal has tracking of those already, and can reuse it. > > The VFS writeback code already differentiates between data written > during a sync operation and that dirtied after a sync operation. > Perhaps you should look at the tagging for WB_SYNC_ALL writeback > that write_cache_pages does.... > > But, anyway, we don't have to do that on the waiting side of things. > All we need to do is add the inode to a "under IO" list on the bdi > when the mapping is initially tagged with pages under writeback, > and remove it from that list during IO completion when the mapping > is no longer tagged as having pages under writeback. > > wait_sb_inodes() just needs to walk that list and wait on each inode > to complete IO. It doesn't require *any awareness* of the underlying > filesystem implementation or how the IO is actually issued - if > there's IO in progress at the time wait_sb_inodes() is called, then > it waits for it. > >> > Fix the root cause of the problem - the sub-optimal VFS code. >> > Hacking around it specifically for out-of-tree code is not the way >> > things get done around here... >> >> I'm thinking the root cause is vfs can't have knowledge of FS internal, >> e.g. FS is handling data transactional way, or not. > > If the filesystem has transactional data/metadata that the VFS is > not tracking, then that is what the ->sync_fs call is for. i.e. so > the filesystem can then do what ever extra writeback/waiting it > needs to do that the VFS is unaware of. > > We already cater for what Tux3 needs in the VFS - all you've done is > found an inefficient algorithm that needs fixing. write_cache_pages() is library function to be called from per-FS. So, it is not under vfs control can be assume already. And it doesn't do right things via write_cache_pages() for data=journal, because it handles for each inodes, not at once. So, new dirty data can be inserted while marking. Thanks. -- OGAWA Hirofumi <hirofumi@xxxxxxxxxxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html