On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry <almasrymina@xxxxxxxxxx> wrote: > > On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > > > On 9/19/19 3:24 PM, Mina Almasry wrote: > > > Patch series implements hugetlb_cgroup reservation usage and limits, which > > > track hugetlb reservations rather than hugetlb memory faulted in. Details of > > > the approach is 1/7. > > > > Thanks for your continued efforts Mina. > > > > One thing that has bothered me with this approach from the beginning is that > > hugetlb reservations are related to, but somewhat distinct from hugetlb > > allocations. The original (existing) huegtlb cgroup implementation does not > > take reservations into account. This is an issue you are trying to address > > by adding a cgroup support for hugetlb reservations. However, this new > > reservation cgroup ignores hugetlb allocations at fault time. > > > > I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation > > of hugetlb pages. Both the existing cgroup code and the reservation approach > > have what I think are some serious flaws. Consider a system with 100 hugetlb > > pages available. A sysadmin, has two groups A and B and wants to limit hugetlb > > usage to 50 pages each. > > > > With the existing implementation, a task in group A could create a mmap of > > 100 pages in size and reserve all 100 pages. Since the pages are 'reserved', > > nobody in group B can allocate ANY huge pages. This is true even though > > no pages have been allocated in A (or B). > > > > With the reservation implementation, a task in group A could use MAP_NORESERVE > > and allocate all 100 pages without taking any reservations. > > > > As mentioned in your documentation, it would be possible to use both the > > existing (allocation) and new reservation cgroups together. Perhaps if both > > are setup for the 50/50 split things would work a little better. > > > > However, instead of creating a new reservation crgoup how about adding > > reservation support to the existing allocation cgroup support. One could > > even argue that a reservation is an allocation as it sets aside huge pages > > that can only be used for a specific purpose. Here is something that > > may work. > > > > Starting with the existing allocation cgroup. > > - When hugetlb pages are reserved, the cgroup of the task making the > > reservations is charged. Tracking for the charged cgroup is done in the > > reservation map in the same way proposed by this patch set. > > - At page fault time, > > - If a reservation already exists for that specific area do not charge the > > faulting task. No tracking in page, just the reservation map. > > - If no reservation exists, charge the group of the faulting task. Tracking > > of this information is in the page itself as implemented today. > > - When the hugetlb object is removed, compare the reservation map with any > > allocated pages. If cgroup tracking information exists in page, uncharge > > that group. Otherwise, unharge the group (if any) in the reservation map. > > > > Sorry for the late response here. I've been prototyping the > suggestions from this conversation: > > 1. Supporting cgroup-v2 on the current controller seems trivial. > Basically just specifying the dfl files seems to do it, and my tests > on top of cgroup-v2 don't see any problems so far at least. In light > of this I'm not sure it's best to create a new controller per say. > Seems like it would duplicate a lot of code with the current > controller, so I've tentatively just stuck to the plan in my current > patchset, a new counter on the existing controller. > > 2. I've been working on transitioning the new counter to the behavior > Mike specified in the email I'm responding to. So far I have a flow > that works for shared mappings but not private mappings: > > - On reservation, charge the new counter and store the info in the > resv_map. The counter gets uncharged when the resv_map entry gets > removed (works fine). > - On alloc_huge_page(), check if there is a reservation for the page > being allocated. If not, charge the new counter and store the > information in resv_map. The counter still gets uncharged when the > resv_map entry gets removed. > > The above works for all shared mappings and reserved private mappings, > but I' having trouble supporting private NORESERVE mappings. Charging > can work the same as for shared mappings: charge the new counter on > reservation and on allocations that do not have a reservation. But the > question still comes up: where to store the counter to uncharge this > page? I thought of a couple of things that don't seem to work: > > 1. I thought of putting the counter in resv_map->reservation_counter, > so that it gets uncharged on vm_op_close. But, private NORESERVE > mappings don't even have a resv_map allocated for them. > > 2. I thought of detecting on free_huge_page that the page being freed > belonged to a private NORESERVE mapping, and uncharging the > hugetlb_cgroup in the page itself, but free_huge_page only gets a > struct page* and I can't seem to find a way to detect that that the > page comes from a private NORESERVE mapping from only the struct > page*. > > Mike, note your suggestion above to check if the page hugetlb_cgroup > is null doesn't work if we want to keep the current counter working > the same: the page will always have a hugetlb_cgroup that points that > contains the old counter. Any ideas how to apply this new counter > behavior to a private NORESERVE mappings? Is there maybe a flag I can > set on the pages at allocation time that I can read on free time to > know whether to uncharge the hugetlb_cgroup or not? Reading the code and asking around a bit, it seems the pointer to the hugetlb_cgroup is in page[2].private. Is it reasonable to use page[3].private to store the hugetlb_cgroup to uncharge for the new counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that would solve my problem. When allocating a private NORESERVE page, set page[3].private to the hugetlb_cgroup to uncharge, then on free_huge_page, check page[3].private, if it is non-NULL, uncharge the new counter on it.