On Mon, Apr 15, 2019 at 10:11:39AM -0700, Mike Kravetz wrote: > On 4/15/19 2:15 AM, Michal Hocko wrote: > > On Mon 15-04-19 06:16:15, Naoya Horiguchi wrote: > >> On Fri, Apr 12, 2019 at 04:40:01PM -0700, Mike Kravetz wrote: > >>> On 4/11/19 9:02 PM, Yufen Yu wrote: > >>>> Commit 58b6e5e8f1ad ("hugetlbfs: fix memory leak for resv_map") > >>> ... > >>>> However, for inode mode that is 'S_ISBLK', hugetlbfs_evict_inode() may > >>>> free or modify i_mapping->private_data that is owned by bdev inode, > >>>> which is not expected! > >>> ... > >>>> We fix the problem by moving resv_map to hugetlbfs_inode_info. It may > >>>> be more reasonable. > >>> > >>> Your patches force me to consider these potential issues. Thank you! > >>> > >>> The root of all these problems (including the original leak) is that the > >>> open of a block special inode will result in bd_acquire() overwriting the > >>> value of inode->i_mapping. Since hugetlbfs inodes normally contain a > >>> resv_map at inode->i_mapping->private_data, a memory leak occurs if we do > >>> not free the initially allocated resv_map. In addition, when the > >>> inode is evicted/destroyed inode->i_mapping may point to an address space > >>> not associated with the hugetlbfs inode. If code assumes inode->i_mapping > >>> points to hugetlbfs inode address space at evict time, there may be bad > >>> data references or worse. > >> > >> Let me ask a kind of elementary question: is there any good reason/purpose > >> to create and use block special files on hugetlbfs? I never heard about > >> such usecases. > > I am not aware of this as a common use case. Yufen Yu may be able to provide > more details about how the issue was discovered. My guess is that it was > discovered via code inspection. > > >> I guess that the conflict of the usage of ->i_mapping is > >> discovered recently and that's because block special files on hugetlbfs are > >> just not considered until recently or well defined. So I think that we might > >> be better to begin with defining it first. > > Unless I am mistaken, this is just like creating a device special file > in any other filesystem. Correct? hugetlbfs is just some place for the > inode/file to reside. What happens when you open/ioctl/close/etc the file > is really dependent on the vfs layer and underlying driver. > OK. Generally speaking, "special files just work even on hugetlbfs" sounds fine for me if it properly works. > > A absolutely agree. Hugetlbfs is overly complicated even without that. > > So if this is merely "we have tried it and it has blown up" kinda thing > > then just refuse the create blockdev files or document it as undefined. > > You need a root to do so anyway. > > Can we just refuse to create device special files in hugetlbfs? Do we need > to worry about breaking any potential users? I honestly do not know if anyone > does this today. However, if they did I believe things would "just work". > The only known issue is leaking a resv_map structure when the inode is > destroyed. I doubt anyone would notice that leak today. Thanks for explanation, so that's unclear now. > > Let me do a little more research. I think this can all be cleaned up by > making hugetlbfs always operate on the address space embedded in the inode. > If nothing else, a change or explanation should be added as to why most code > operates on inode->mapping and one place operates on &inode->i_data. Sounds nice, thank you. (Just for sharing point, not intending to block the fix ...) My remaining concern is that this problem might not be hugetlbfs specific, because what triggers the issue seems to be the usage of inode->i_mapping. bd_acquire() are callable from any filesystem, so I'm wondering whether we have something to generally prevent this kind of issue? Thanks, Naoya Horiguchi