Hello Mike, I don't usually work on the kernel so please excuse any inaccuracies. I'm contacting you off-list because, if what I've facing is confirmed, it might be considered a security issue (DoS). I'll leave that to your judgement. I'm seeing an issue related to hugetlb_cgroup: I'm running: kubernetes 1.19 + containerd/docker kernel 5.9.0-36.fc34.x86_64 kernel params: systemd.unified_cgroup_hierarchy=0 default_hugepagesz=1G hugepagesz=1G hugepages=10 I'm still trying to isolate aspects of my setup, currently my reproducer is: 1 - Start a simple pod that uses the recently added HugePages medium feature [1] (pod yaml attached) 2 - Start a DPDK app. It doesn't need to run successfully (as in transfer packets) nor interact with real hardware. It seems just initializing the EAL layer (which handles hugepage reservation and locking) is enough to trigger the issue 3 - Delete the Pod (or let it "Complete"). Results in what seems to be a thread endlessly looping over a spin_lock. top: 1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy 'perf top -g' reports: - 63.28% 0.01% [kernel] [k] worker_thread - 49.97% worker_thread - 52.64% process_one_work - 62.08% css_killed_work_fn - hugetlb_cgroup_css_offline 41.52% _raw_spin_lock - 2.82% _cond_resched rcu_all_qs 2.66% PageHuge - 0.57% schedule - 0.57% __schedule Under certain circumstances (which I'm still trying to understand) this makes the kernel quite unresponsive, requiring a hard reboot. I've isolated the issue in a VM and I was about to start bisecting the issue (which does not happen on kernel-5.6.6-300.fc32). Do you have any clue or pointer as to how to further troubleshoot this issue? Thanks, -- Adrián Moreno [1] https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/
Attachment:
test.yaml
Description: application/yaml