On 4/7/21 8:26 PM, Miaohe Lin wrote: > On 2021/4/8 11:24, Miaohe Lin wrote: >> On 2021/4/8 4:53, Mike Kravetz wrote: >>> On 4/7/21 12:24 AM, Miaohe Lin wrote: >>>> Hi: >>>> On 2021/4/7 10:49, Mike Kravetz wrote: >>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote: >>>>>> The resv_map could be NULL since this routine can be called in the evict >>>>>> inode path for all hugetlbfs inodes. So we could have chg = 0 and this >>>>>> would result in a negative value when chg - freed. This is unexpected for >>>>>> hugepage_subpool_put_pages() and hugetlb_acct_memory(). >>>>> >>>>> I am not sure if this is possible. >>>>> >>>>> It is true that resv_map could be NULL. However, I believe resv map >>>>> can only be NULL for inodes that are not regular or link inodes. This >>>>> is the inode creation code in hugetlbfs_get_inode(). >>>>> >>>>> /* >>>>> * Reserve maps are only needed for inodes that can have associated >>>>> * page allocations. >>>>> */ >>>>> if (S_ISREG(mode) || S_ISLNK(mode)) { >>>>> resv_map = resv_map_alloc(); >>>>> if (!resv_map) >>>>> return NULL; >>>>> } >>>>> >>>> >>>> Agree. >>>> >>>>> If resv_map is NULL, then no hugetlb pages can be allocated/associated >>>>> with the file. As a result, remove_inode_hugepages will never find any >>>>> huge pages associated with the inode and the passed value 'freed' will >>>>> always be zero. >>>>> >>>> >>>> But I am confused now. AFAICS, remove_inode_hugepages() searches the address_space of >>>> the inode to remove the hugepages while does not care if inode has associated resv_map. >>>> How does it prevent hugetlb pages from being allocated/associated with the file if >>>> resv_map is NULL? Could you please explain this more? >>>> >>> >>> Recall that there are only two ways to get huge pages associated with >>> a hugetlbfs file: fallocate and mmap/write fault. Directly writing to >>> hugetlbfs files is not supported. >>> >>> If you take a closer look at hugetlbfs_get_inode, it has that code to >>> allocate the resv map mentioned above as well as the following: >>> >>> switch (mode & S_IFMT) { >>> default: >>> init_special_inode(inode, mode, dev); >>> break; >>> case S_IFREG: >>> inode->i_op = &hugetlbfs_inode_operations; >>> inode->i_fop = &hugetlbfs_file_operations; >>> break; >>> case S_IFDIR: >>> inode->i_op = &hugetlbfs_dir_inode_operations; >>> inode->i_fop = &simple_dir_operations; >>> >>> /* directory inodes start off with i_nlink == 2 (for "." entry) */ >>> inc_nlink(inode); >>> break; >>> case S_IFLNK: >>> inode->i_op = &page_symlink_inode_operations; >>> inode_nohighmem(inode); >>> break; >>> } >>> >>> Notice that only S_IFREG inodes will have i_fop == &hugetlbfs_file_operations. >>> hugetlbfs_file_operations contain the hugetlbfs specific mmap and fallocate >>> routines. Hence, only files with S_IFREG inodes can potentially have >>> associated huge pages. S_IFLNK inodes can as well via file linking. >>> >>> If an inode is not S_ISREG(mode) || S_ISLNK(mode), then it will not have >>> a resv_map. In addition, it will not have hugetlbfs_file_operations and >>> can not have associated huge pages. >>> >> >> Many many thanks for detailed and patient explanation! :) I think I have got the idea! >> >>> I looked at this closely when adding commits >>> 58b6e5e8f1ad hugetlbfs: fix memory leak for resv_map >>> f27a5136f70a hugetlbfs: always use address space in inode for resv_map pointer >>> >>> I may not be remembering all of the details correctly. Commit f27a5136f70a >>> added the comment that resv_map could be NULL to hugetlb_unreserve_pages. >>> >> >> Since we must have freed == 0 while chg == 0. Should we make this assumption explict >> by something like below? >> >> WARN_ON(chg < freed); >> > > Or just a comment to avoid confusion ? > Yes, add a comment to hugetlb_unreserve_pages saying that !resv_map implies freed == 0. It would also be helpful to check for (chg - freed) == 0 and skip the calls to hugepage_subpool_put_pages() and hugetlb_acct_memory(). Both of those routines may perform an unnecessary lock/unlock cycle in this case. A simple if (chg == free) return 0; before the call to hugepage_subpool_put_pages would work. -- Mike Kravetz