On Mon, Jan 4, 2016 at 7:51 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote: >> Historically we have waited for filesystem specific heuristics to >> attempt to guess when a block device is gone. Sometimes this works, but >> in other cases the system can hang waiting for the fs to trigger its >> shutdown protocol. >> >> The initial motivation for this investigation was to prevent DAX >> mappings (direct mmap access to persistent memory) from leaking past the >> lifetime of the hosting block device. However, Dave points out that >> these shutdown operations are needed in other scenarios. Quoting Dave: >> >> For example, if we detect a free space corruption during allocation, >> it is not safe to trust *any active mapping* because we can't trust >> that we having handed out the same block to multiple owners. Hence >> on such a filesystem shutdown, we have to prevent any new DAX >> mapping from occurring and invalidate all existing mappings as we >> cannot allow userspace to modify any data or metadata until we've >> resolved the corruption situation. >> >> The current block device shutdown sequence of del_gendisk + >> blk_cleanup_queue is problematic. We want to tell the fs after >> blk_cleanup_queue that there is no possibility of recovery, but by that >> time we have deleted partitions and lost the ability to find all the >> super-blocks on a block device. >> >> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone() > > I don't see anything that introduces a ->quiesce() method. > Right, we only have generic_quiesce_super() as no fs had a need to do anything but the generic actions. >> notifications to all the filesystems hosted on the disk. Where >> ->quiesce() are 'shutdown' operations while the bdev may still be alive, > > So you are saying "quiesce == invalidation", which is in conflict > with what we typically think quiesce means. i.e. "Quiesce" is what > we do when doing orderly writeback of all outstanding dirty objects > in a filesystem - what we do during freeze, remount-ro, and unmount. > e.g. See the functions xfs_log_quiesce() and xfs_attr_quiesce() Fair enough, I should have called this one invalidate_super. > >> and ->bdi_gone() is a set of actions to take after the backing device >> is known to be permanently dead. > > In which case, bdi_gone == shutdown. > True, and following this logic I think the existing generic_shutdown_super() should be renamed generic_kill_super() to match the fs actions, but see below... > Operation methods should be named after what they do, not what their > calling context is. i.e. these are "invalidate" and "shutdown" > superblock operations, not "quiesce" and "bdi_gone". I was running out of colors to paint the bike shed given generic_shutdown_super() was already in use. Also, since generic_shutdown_super() has had its current meaning since forever I didn't think it was worth the risk to change the meaning of such a long standing symbol. Other ideas, generic_stop_super? >> generic_quiesce_super and generic_bdi_gone, are the default operations >> when a filesystem does not implement ->quiesce(), ->bdi_gone(). They >> invalidate inodes and unmap DAX-inodes respectively. For now only >> ->bdi_gone() has an associated super operation as xfs will implement >> ->bdi_gone() in a later patch. > > I don't quite understand what the point of factoring > __invalidate_device() like this is - it's not used by anyone, so it > seems completely unnecessary to me. > You're right, without a need for an fs to intercept the 'invalidation' event there's no need to do this refactor. > And really, that points out that there are multiple changes in this > patch set that should be done separately. The rework of > del_gendisk() into del_gendisk_start/del_gendisk_end should be the > first patch. The del_gendisk/blk_cleanup_queue to > blk_cleanup_queue() combining should be the second part, as it > builds on the factoring of del_gendisk(). Then, if we really need to > keep it, the factoring to introduce generic_quiesce_super. And > finally, the patch that allows the shutdown callout can be > introduced. Ok, will split. > > [snip] >> >> +static void generic_bdi_gone(struct super_block *sb) >> +{ >> + struct inode *inode, *_inode = NULL; >> + >> + spin_lock(&sb->s_inode_list_lock); >> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { >> + spin_lock(&inode->i_lock); >> + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) >> + || !IS_DAX(inode)) { >> + spin_unlock(&inode->i_lock); >> + continue; >> + } >> + __iget(inode); >> + spin_unlock(&inode->i_lock); >> + spin_unlock(&sb->s_inode_list_lock); >> + >> + unmap_mapping_range(inode->i_mapping, 0, 0, 1); >> + iput(_inode); >> + _inode = inode; >> + cond_resched(); >> + >> + spin_lock(&sb->s_inode_list_lock); >> + } >> + spin_unlock(&sb->s_inode_list_lock); >> + iput(_inode); >> +} > > This belongs in fs/inode.c, right next to invalidate_inodes(), and > with a name that describes what it does, not the context in which > it is called. e.g. unmap_dax_inodes(). > Sounds good. >> +void shutdown_partition(struct gendisk *disk, int partno) >> +{ >> + struct block_device *bdev = bdget_disk(disk, partno); >> + struct super_block *sb = get_super(bdev); >> + >> + if (!bdev) >> + return; > > Null pointer deref. Certainly a landmine waiting for someone to > tread on. > Nope, get_super() checks for a NULL argument. >> + if (!sb) { >> + bdput(bdev); >> + return; >> + } >> + >> + if (sb->s_op->bdi_gone) >> + sb->s_op->bdi_gone(sb); >> + else >> + generic_bdi_gone(sb); > > if (sb->s_op->shutdown) > sb->s_op->shutdown(sb); > else > unmap_dax_inodes(sb); > > name things correctly, and the code documents itself. How about 'stop' or 'halt' instead of 'shutdown' to preserve the historical meaning of generic_shutdown_super? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html