On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@xxxxxxx> wrote: >> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown >> From: Dan Williams <dan.j.williams@xxxxxxxxx> >> >> Currently dax mappings leak past / survive block_device shutdown. While >> page cache pages are permitted to be read/written after the block_device >> is torn down this is not acceptable in the dax case as all media access >> must end when the device is disabled. The pfn backing a dax mapping is >> permitted to be invalidated after bdev shutdown and this is indeed the >> case with brd. >> >> When a dax capable block_device driver calls del_gendisk() in its >> shutdown path del_gendisk() needs to ensure that all DAX pfns are >> unmapped. This is different than the pagecache backed case where the >> disk is protected by the queue being torn down which ends I/O to the >> device. Since dax bypasses the page cache we need to unconditionally >> unmap the inode. > ... >> +static void unmap_list(struct list_head *head) >> +{ >> + struct inode *inode, *_i; >> + >> + list_for_each_entry_safe(inode, _i, head, i_lru) { >> + list_del_init(&inode->i_lru); >> + unmap_mapping_range(&inode->i_data, 0, 0, 1); >> + iput(inode); >> + cond_resched(); >> + } >> +} >> + >> /** >> * evict_inodes - evict all evictable inodes for a superblock >> * @sb: superblock to operate on >> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) >> int busy = 0; >> struct inode *inode, *next; >> LIST_HEAD(dispose); >> + LIST_HEAD(unmap); >> >> spin_lock(&sb->s_inode_list_lock); >> list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { >> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) >> busy = 1; >> continue; >> } >> + if (IS_DAX(inode) && atomic_read(&inode->i_count)) { >> + /* >> + * dax mappings can't live past this invalidation event >> + * as there is no page cache present to allow the data >> + * to remain accessible. >> + */ >> + __iget(inode); >> + inode_lru_list_del(inode); >> + spin_unlock(&inode->i_lock); >> + list_add(&inode->i_lru, &unmap); >> + busy = 1; >> + continue; >> + } > > I'm somewhat concerned about the abuse of i_lru list here. The inodes have > i_count > 0 and so the code generally assumes such inodes should be removed > from LRU list if they are there. Now I didn't find a place where this could > really hit you but it is fragile. And when that happens, you have a > corruption of your unmap list (since you access it without any locks) and > also you will not unmap your inode. "i_lru" list abuse was my main concern with this patch, I'll look into a different way. > Also are you sure that your unmapping cannot race with other process > mapping the pfns again? You're right, there is indeed a gap between the unmap and when get_blocks() will start returning errors in the fault path. I think I need to defer the unmapping until after blk_cleanup_queue() where we know that no new I/o to the device is possible. > BTW what happens when you access a DAX pfn that got removed? SIGBUS. I don't see a way to be any kinder to the application. After the ->remove() method for the block_device is complete we can't be sure that the pfn is valid or even present in the system (think brd, or VM hot provisioning). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html