On (23/06/08 18:35), Andy Whitcroft didst pronounce: > When a hugetlb mapping with a reservation is split, a new VMA is cloned > from the original. This new VMA is a direct copy of the original > including the reservation count. When this pair of VMAs are unmapped > we will incorrect double account the unused reservation and the overall > reservation count will be incorrect, in extreme cases it will wrap. > > The problem occurs when we split an existing VMA say to unmap a page in > the middle. split_vma() will create a new VMA copying all fields from > the original. As we are storing our reservation count in vm_private_data > this is also copies, endowing the new VMA with a duplicate of the original > VMA's reservation. Neither of the new VMAs can exhaust these reservations > as they are too small, but when we unmap and close these VMAs we will > incorrect credit the remainder twice and resv_huge_pages will become > out of sync. This can lead to allocation failures on mappings with > reservations and even to resv_huge_pages wrapping which prevents all > subsequent hugepage allocations. > > The simple fix would be to correctly apportion the remaining reservation > count when the split is made. However the only hook we have vm_ops->open > only has the new VMA we do not know the identity of the preceeding VMA. > Also even if we did have that VMA to hand we do not know how much of the > reservation was consumed each side of the split. > > This patch therefore takes a different tack. We know that the whole of any > private mapping (which has a reservation) has a reservation over its whole > size. Any present pages represent consumed reservation. Therefore if > we track the instantiated pages we can calculate the remaining reservation. > > This patch reuses the existing regions code to track the regions for which > we have consumed reservation (ie. the instantiated pages), as each page > is faulted in we record the consumption of reservation for the new page. > When we need to return unused reservations at unmap time we simply count > the consumed reservation region subtracting that from the whole of the map. > During a VMA split the newly opened VMA will point to the same region map, > as this map is offset oriented it remains valid for both of the split VMAs. > This map is referenced counted so that it is removed when all VMAs which > are part of the mmap are gone. > > Thanks to Adam Litke and Mel Gorman for their review feedback. > > Signed-off-by: Andy Whitcroft <apw@xxxxxxxxxxxx> Nice explanation. Testing on i386 with qemu, this patch allows some small tests to pass without corruption of the rsvd counters. libhugetlbfs tests also passed. I do not see anything new to complain about in the code. Thanks. Acked-by: Mel Gorman <mel@xxxxxxxxx> > --- > mm/hugetlb.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++++--------- > 1 files changed, 144 insertions(+), 27 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index d701e39..7ba6d4d 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -49,6 +49,16 @@ static DEFINE_SPINLOCK(hugetlb_lock); > /* > * Region tracking -- allows tracking of reservations and instantiated pages > * across the pages in a mapping. > + * > + * The region data structures are protected by a combination of the mmap_sem > + * and the hugetlb_instantion_mutex. To access or modify a region the caller > + * must either hold the mmap_sem for write, or the mmap_sem for read and > + * the hugetlb_instantiation mutex: > + * > + * down_write(&mm->mmap_sem); > + * or > + * down_read(&mm->mmap_sem); > + * mutex_lock(&hugetlb_instantiation_mutex); > */ > struct file_region { > struct list_head link; > @@ -171,6 +181,30 @@ static long region_truncate(struct list_head *head, long end) > return chg; > } > > +static long region_count(struct list_head *head, long f, long t) > +{ > + struct file_region *rg; > + long chg = 0; > + > + /* Locate each segment we overlap with, and count that overlap. */ > + list_for_each_entry(rg, head, link) { > + int seg_from; > + int seg_to; > + > + if (rg->to <= f) > + continue; > + if (rg->from >= t) > + break; > + > + seg_from = max(rg->from, f); > + seg_to = min(rg->to, t); > + > + chg += seg_to - seg_from; > + } > + > + return chg; > +} > + > /* > * Convert the address within this vma to the page offset within > * the mapping, in base page units. > @@ -193,9 +227,15 @@ static pgoff_t vma_pagecache_offset(struct hstate *h, > (vma->vm_pgoff >> huge_page_order(h)); > } > > -#define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1)) > -#define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2)) > +/* > + * Flags for MAP_PRIVATE reservations. These are stored in the bottom > + * bits of the reservation map pointer, which are always clear due to > + * alignment. > + */ > +#define HPAGE_RESV_OWNER (1UL << 0) > +#define HPAGE_RESV_UNMAPPED (1UL << 1) > #define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED) > + > /* > * These helpers are used to track how many pages are reserved for > * faults in a MAP_PRIVATE mapping. Only the process that called mmap() > @@ -205,6 +245,15 @@ static pgoff_t vma_pagecache_offset(struct hstate *h, > * the reserve counters are updated with the hugetlb_lock held. It is safe > * to reset the VMA at fork() time as it is not in use yet and there is no > * chance of the global counters getting corrupted as a result of the values. > + * > + * The private mapping reservation is represented in a subtly different > + * manner to a shared mapping. A shared mapping has a region map associated > + * with the underlying file, this region map represents the backing file > + * pages which have ever had a reservation assigned which this persists even > + * after the page is instantiated. A private mapping has a region map > + * associated with the original mmap which is attached to all VMAs which > + * reference it, this region map represents those offsets which have consumed > + * reservation ie. where pages have been instantiated. > */ > static unsigned long get_vma_private_data(struct vm_area_struct *vma) > { > @@ -217,22 +266,48 @@ static void set_vma_private_data(struct vm_area_struct *vma, > vma->vm_private_data = (void *)value; > } > > -static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma) > +struct resv_map { > + struct kref refs; > + struct list_head regions; > +}; > + > +struct resv_map *resv_map_alloc(void) > +{ > + struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL); > + if (!resv_map) > + return NULL; > + > + kref_init(&resv_map->refs); > + INIT_LIST_HEAD(&resv_map->regions); > + > + return resv_map; > +} > + > +void resv_map_release(struct kref *ref) > +{ > + struct resv_map *resv_map = container_of(ref, struct resv_map, refs); > + > + /* Clear out any active regions before we release the map. */ > + region_truncate(&resv_map->regions, 0); > + kfree(resv_map); > +} > + > +static struct resv_map *vma_resv_map(struct vm_area_struct *vma) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > if (!(vma->vm_flags & VM_SHARED)) > - return get_vma_private_data(vma) & ~HPAGE_RESV_MASK; > + return (struct resv_map *)(get_vma_private_data(vma) & > + ~HPAGE_RESV_MASK); > return 0; > } > > -static void set_vma_resv_huge_pages(struct vm_area_struct *vma, > - unsigned long reserve) > +static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > VM_BUG_ON(vma->vm_flags & VM_SHARED); > > - set_vma_private_data(vma, > - (get_vma_private_data(vma) & HPAGE_RESV_MASK) | reserve); > + set_vma_private_data(vma, (get_vma_private_data(vma) & > + HPAGE_RESV_MASK) | (unsigned long)map); > } > > static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) > @@ -260,19 +335,12 @@ static void decrement_hugepage_resv_vma(struct hstate *h, > if (vma->vm_flags & VM_SHARED) { > /* Shared mappings always use reserves */ > h->resv_huge_pages--; > - } else { > + } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { > /* > * Only the process that called mmap() has reserves for > * private mappings. > */ > - if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { > - unsigned long flags, reserve; > - h->resv_huge_pages--; > - flags = (unsigned long)vma->vm_private_data & > - HPAGE_RESV_MASK; > - reserve = (unsigned long)vma->vm_private_data - 1; > - vma->vm_private_data = (void *)(reserve | flags); > - } > + h->resv_huge_pages--; > } > } > > @@ -289,7 +357,7 @@ static int vma_has_private_reserves(struct vm_area_struct *vma) > { > if (vma->vm_flags & VM_SHARED) > return 0; > - if (!vma_resv_huge_pages(vma)) > + if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) > return 0; > return 1; > } > @@ -794,12 +862,19 @@ static int vma_needs_reservation(struct hstate *h, > return region_chg(&inode->i_mapping->private_list, > idx, idx + 1); > > - } else { > - if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) > - return 1; > - } > + } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { > + return 1; > > - return 0; > + } else { > + int err; > + pgoff_t idx = vma_pagecache_offset(h, vma, addr); > + struct resv_map *reservations = vma_resv_map(vma); > + > + err = region_chg(&reservations->regions, idx, idx + 1); > + if (err < 0) > + return err; > + return 0; > + } > } > static void vma_commit_reservation(struct hstate *h, > struct vm_area_struct *vma, unsigned long addr) > @@ -810,6 +885,13 @@ static void vma_commit_reservation(struct hstate *h, > if (vma->vm_flags & VM_SHARED) { > pgoff_t idx = vma_pagecache_offset(h, vma, addr); > region_add(&inode->i_mapping->private_list, idx, idx + 1); > + > + } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { > + pgoff_t idx = vma_pagecache_offset(h, vma, addr); > + struct resv_map *reservations = vma_resv_map(vma); > + > + /* Mark this page used in the map. */ > + region_add(&reservations->regions, idx, idx + 1); > } > } > > @@ -1456,13 +1538,42 @@ out: > return ret; > } > > +static void hugetlb_vm_op_open(struct vm_area_struct *vma) > +{ > + struct resv_map *reservations = vma_resv_map(vma); > + > + /* > + * This new VMA should share its siblings reservation map if present. > + * The VMA will only ever have a valid reservation map pointer where > + * it is being copied for another still existing VMA. As that VMA > + * has a reference to the reservation map it cannot dissappear until > + * after this open call completes. It is therefore safe to take a > + * new reference here without additional locking. > + */ > + if (reservations) > + kref_get(&reservations->refs); > +} > + > static void hugetlb_vm_op_close(struct vm_area_struct *vma) > { > struct hstate *h = hstate_vma(vma); > - unsigned long reserve = vma_resv_huge_pages(vma); > + struct resv_map *reservations = vma_resv_map(vma); > + unsigned long reserve; > + unsigned long start; > + unsigned long end; > > - if (reserve) > - hugetlb_acct_memory(h, -reserve); > + if (reservations) { > + start = vma_pagecache_offset(h, vma, vma->vm_start); > + end = vma_pagecache_offset(h, vma, vma->vm_end); > + > + reserve = (end - start) - > + region_count(&reservations->regions, start, end); > + > + kref_put(&reservations->refs, resv_map_release); > + > + if (reserve) > + hugetlb_acct_memory(h, -reserve); > + } > } > > /* > @@ -1479,6 +1590,7 @@ static int hugetlb_vm_op_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > > struct vm_operations_struct hugetlb_vm_ops = { > .fault = hugetlb_vm_op_fault, > + .open = hugetlb_vm_op_open, > .close = hugetlb_vm_op_close, > }; > > @@ -2037,8 +2149,13 @@ int hugetlb_reserve_pages(struct inode *inode, > if (!vma || vma->vm_flags & VM_SHARED) > chg = region_chg(&inode->i_mapping->private_list, from, to); > else { > + struct resv_map *resv_map = resv_map_alloc(); > + if (!resv_map) > + return -ENOMEM; > + > chg = to - from; > - set_vma_resv_huge_pages(vma, chg); > + > + set_vma_resv_map(vma, resv_map); > set_vma_resv_flags(vma, HPAGE_RESV_OWNER); > } > > -- > 1.5.6.205.g7ca3a > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe kernel-testers" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html