CC'ing linux-mm On 8/27/21 8:11 AM, Guillaume Morin wrote: > Hi, > > After upgrading to 5.10 from 5.4 (though I believe that these issues are > still present in 5.14), we noticed some refcount count warning > pertaining a corrupted reference count of the hugetlb css, e.g > > [ 704.259734] percpu ref (css_release) <= 0 (-1) after switching to atomic > [ 704.259755] WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130 > [ 704.259911] CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G O 5.10.60 #1 > [ 704.259916] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130 > [ 704.259920] Code: eb b1 80 3d 37 4f 0a 01 00 0f 85 5d ff ff ff 49 8b 55 e0 48 c7 c7 38 57 0d 94 c6 05 1f 4f 0a 01 01 49 8b 75 e8 e8 a9 e5 c1 ff <0f> 0b e9 3b ff ff ff 66 90 55 48 89 e5 41 56 49 89 f6 41 55 49 89 > [ 704.259922] RSP: 0000:ffffb19b4684bdd0 EFLAGS: 00010282 > [ 704.259924] RAX: 0000000000000000 RBX: 7ffffffffffffffe RCX: 0000000000000027 > [ 704.259926] RDX: 0000000000000000 RSI: ffff9a81ffb58b40 RDI: ffff9a81ffb58b48 > [ 704.259927] RBP: ffffb19b4684bde8 R08: 0000000000000003 R09: 0000000000000001 > [ 704.259929] R10: 0000000000000003 R11: ffffb19b4684bb70 R12: 0000370946a03b50 > [ 704.259931] R13: ffff9a72c9ceb860 R14: 0000000000000000 R15: ffff9a72c42f4000 > [ 704.259933] FS: 0000000000000000(0000) GS:ffff9a81ffb40000(0000) knlGS:0000000000000000 > [ 704.259935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 704.259936] CR2: 0000000001416318 CR3: 000000011e1ac003 CR4: 00000000003706e0 > [ 704.259938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 704.259939] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 704.259941] Call Trace: > [ 704.259950] rcu_core+0x30f/0x530 > [ 704.259955] rcu_core_si+0xe/0x10 > [ 704.259959] __do_softirq+0x103/0x2a2 > [ 704.259964] ? sort_range+0x30/0x30 > [ 704.259968] run_ksoftirqd+0x2b/0x40 > [ 704.259971] smpboot_thread_fn+0x11a/0x170 > [ 704.259975] kthread+0x10a/0x140 > [ 704.259978] ? kthread_create_worker_on_cpu+0x70/0x70 > [ 704.259983] ret_from_fork+0x22/0x30 > > The box would soon crash due to some GPF or NULL pointer deference > either in cgroups_destroy or in the kill_css path. We confirmed the > issue was specific to the hugetlb css by manually disabling its use and > verifying that the box then stayed up and happy. Thanks for reporting this Guillaume! There have been other hugetlb cgroup fixes since 5.10. I do not believe they are related to the underflow issue you have seen. Just FYI. I do hope Mina is taking a look as well. His recent/ongoing mremap work is also touching these areas of the code. > I believe there might be 2 distinct bugs leading to this. I am not > familiar with the cgroup code so I might be off base here. I did my best > to track this and understand the logic. Any feedback will be welcome. > > I have not provided patches because if I am correct, they're fairly > trivial and since I am unfamiliar with this code, I am afraid they could > not be that helpful. But I could provide them if anybody is interested. > > 1. Since e9fe92ae0cd2, hugetlb_vm_op_close() decreases the refcount of > the css (if present) through the hugetlb_cgroup_uncharge_counter() call > if a resv map is set on the vma and the owner flag is present (i.e > private mapping). In the most basic case, the corresponding refcount > increase is done in hugetlb_reserve_pages(). > However when sibling vmas are opened, hugetlb_vm_op_open() is called, > the resv map reference count is increased (if vma_resv_map(vma) is not > NULL for private mappings), but not for a potential resv->css (i.e if > resv->css != NULL). > When these siblings vmas are closed, the refcount will still be > decreased once per such vma, leading to an underflow and premature > release (potentially use after free) of the hugetlb css. The fix would > be to call css_get() if resv->css != NULL in hugetlb_vm_op_open() That does appear to be a real issue. In the 'common' fork case, a vma is duplicated but reset_vma_resv_huge_pages() is called to clear the pointer to the reserve map for private mappings before calling hugetlb_vm_op_open. However, when a vma is split both resulting vmas would be 'owners' of private mapping reserves without incrementing the refcount which would lead to the underflow you describe. > 2. After 08cf9faf75580, __free_huge_page() decrements the css refcount > for _each_ page unconditionally by calling > hugetlb_cgroup_uncharge_page_rsvd(). > But a per-page reference count is only taken *per page* outside the > reserve case in alloc_huge_page() (i.e > hugetlb_cgroup_charge_cgroup_rsvd() is called only if deferred_reserve > is true). In the reserve case, there is only one css reference linked > to the resv map (taken in hugetlb_reserve_pages()). > This also leads to an underflow of the counter. A similar scheme to > HPageRestoreReserve can be used to track which pages were allocated in > the deferred_reserve case and call hugetlb_cgroup_uncharge_page_rsvd() > only for these during freeing. I am not sure about the above analysis. It is true that hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in free_huge_page. However, IIUC hugetlb_cgroup_uncharge_page_rsvd will only decrement the css refcount if there is a non-NULL hugetlb_cgroup pointer in the page. And, the pointer in the page would only be set in the 'deferred_reserve' path of alloc_huge_page. Unless I am missing something, they seem to balance. In any case I will look closer at the code with an eye on underflows. Thanks for the report and help! -- Mike Kravetz