On 15.10.20 09:56, David Hildenbrand wrote: > On 14.10.20 20:31, Mike Kravetz wrote: >> On 10/14/20 11:18 AM, David Hildenbrand wrote: >>> On 14.10.20 19:56, Mina Almasry wrote: >>>> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand <david@xxxxxxxxxx> wrote: >>>>> >>>>> On 14.10.20 17:22, David Hildenbrand wrote: >>>>>> Hi everybody, >>>>>> >>>>>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon >>>>>> with hugetlbfs and reported that this results in [1] >>>>>> >>>>>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5 >>>>>> >>>>>> 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong) >>>>>> >>>>>> >>>>>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) >>>>>> to discard pages that are reported as free by a VM. The reporting >>>>>> granularity is in pageblock granularity. So when the guest reports >>>>>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. >>>>>> >>>>>> I was also able to reproduce (also with virtio-mem, which similarly >>>>>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9 >>>>>> (and on v5.7.X from F32). >>>>>> >>>>>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting >>>>>> is broken with cgroups. I did *not* try without cgroups yet. >>>>>> >>>>>> Any ideas? >>>> >>>> Hi David, >>>> >>>> I may be able to dig in and take a look. How do I reproduce this >>>> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a >>>> hugetlb region? >>>> >>> >>> Hi Mina, >>> >>> thanks for having a look. I started poking around myself but, >>> being new to cgroup code, I even failed to understand why that code gets >>> triggered though the hugetlb controller isn't even enabled. >>> >>> I assume you at least have to make sure that there is >>> a page populated (MMAP_POPULATE, or read/write it). But I am not >>> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is >>> sufficient, or if it will require a sequence of >>> populate+discard(punch) (or multi-threading). >> >> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching >> with (and without) hugetlb controller enabled and did not see this issue. >> >> May need to reproduce via QEMU as below. > > Not sure if relevant, but QEMU should be using > memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file. > > Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of > the md (e.g., > 90%). > I just tried to reproduce by doing random accesses + random fallocate(FALLOC_FL_PUNCH_HOLE) within a file - without success. So could be 1. KVM is involved messing this up 2. Multi-threading is involved However, I am also able to reproduce with only a single VCPU (there is still the QEMU main thread, but it limits the chance for races). Even KVM spits fire after a while, which could be a side effect of allocations failing: error: kvm run failed Bad address RAX=0000000000000000 RBX=ffff8c12c9c217c0 RCX=ffff8c12fb1b8fc0 RDX=0000000000000007 RSI=ffff8c12c9c217c0 RDI=ffff8c12c9c217c8 RBP=000000000000000d RSP=ffffb3964040fa68 R8 =0000000000000008 R9 =ffff8c12c9c20000 R10=ffff8c12fffd5000 R11=00000000000303c0 R12=ffff8c12c9c217c0 R13=0000000000000008 R14=0000000000000001 R15=fffff31d44270800 RIP=ffffffffaf33ba0f RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 0000000000000000 00000000 00000000 CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0000 0000000000000000 00000000 00000000 FS =0000 00007f8fabc87040 00000000 00000000 GS =0000 ffff8c12fbc00000 00000000 00000000 LDT=0000 fffffe0000000000 00000000 00000000 TR =0040 fffffe0000003000 00004087 00008b00 DPL=0 TSS64-busy GDT= fffffe0000001000 0000007f IDT= fffffe0000000000 00000fff CR0=80050033 CR2=0000560e10895398 CR3=00000001073b2000 CR4=00350ef0 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000d01 Code=0f 0b eb e2 90 0f 1f 44 00 00 53 48 89 fb 31 c0 48 8d 7f 08 <48> c7 47 f8 00 00 00 00 48 89 d9 48 c7 c2 44 d3 52 -- Thanks, David / dhildenb