Re: [patch 48/91] kernel/crash_core: add crashkernel=auto for vmcore creation

David Hildenbrand <david@xxxxxxxxxx> · Sat, 8 May 2021 11:22:18 +0200

Let me take a look .... oh, there it is from 2009

https://marc.info/?t=125006512600002&r=1&w=2

and then we had it in 2018

https://lkml.org/lkml/2018/5/20/262

Thanks for digging these two out, otherwise I may need do for people to
know the history better.

Sure, I stumbled over this myself recently when wondering about what 
fadump is.

The issue I have with this: it's just plain wrong when you take memory
hotplug into serious account as we see it quite heavily in VMs. You don't
know what you'll need when building a kernel. Just pass it via the cmdline

Hmm, kdump may have no issue with memory hotplug in crashkernel
reservation aspect. The system RAM size is not correlated to
crashkernel size directly, that's why the default value in this patch is

"Not correlated directly" ...

"1G-64G:128M,64G-1T:256M,1T-:512M"

Am I still asleep and dreaming? :)

not linear related to system RAM size. The proportion of crashkernel
size to the total RAM size is thing we take into account. Usually
crashkernel 160M is enough on most of systems. If system RAM size is
larger, extra memory can be added just in case, and not bring much
impact to system.

So, all the rules we have are essentially broken because they rely 
completely on the system RAM during boot.

With our investigation, PCIe devices impact the crashkernel size, and
cpu number. There are always pci devices which driver require tens of KB
meomry, even MB. E.g in below patch, my colleague Coiby found out the
i40e network card even cost 1.5G memory to initialize its ringbuffer on
ppc, and 85M on x86_64.

[PATCH v1 0/3] Reducing memory usage of i40e for kdump
http://lists.infradead.org/pipermail/kexec/2021-March/022117.html

Even though not all pci devices need surprisingly large memory like
i40e, system with hundreds of pci devices can also cost more memory than
expected. This kind of system usually is high end server, specified
crashkernel value need be set manually.

So system RAM size is the least important part to influence crashkernel

Aehm, not with fadump, no?

costing. Say my x1 laptop, even though I extended the RAM to 100TB, 160M
crashkernel is still enough. Just we would like to get a tiny extra part
to add to crashkernel if the total RAM is very large, that's the rule
for crashkernel=auto. As for VMs, given their very few devices, virtio
disk, NAT nic, etc, no matter how much memory is deployed and hot
added/removed, crashkernel size won't be influenced very much. My
personal understanding about it.

That's an interesting observation. But you're telling me that we end up 
wasting memory for the crashkernel because "crashkernel=auto" which is 
supposed to do something magical good automatically does something very 
suboptimal? Oh my ... this is broken.

Long story short: crashkernel=auto is pure ugliness.

Why can't we construct a crashkernel in user space when 
installing/activating kdump and requiring a reboot for kdump to be 
active as long as that crashkernel setting is not properly respected?

Just have a look at the system properties (is_qemu(), #PCI, ...) and 
propose a value for "crashkernel=". Check that that value is at least 
active when activating kdump. Otherwise don't enable kdump and fail.

Yes, it can be difficult with some newer/older kernels having some 
different demands, but things should change drastically, and a distro 
can always update its advises along with the kernel, no?

You could even have a kernel interface that gives you the current 
crashkernel size (maybe already there) vs. the recommended crashkernel 
size. Make kdump or *whoever* activate that in the cmdline and let kdump 
check if both values are satisfied when booting up.

Also: this approach here doesn't make any sense when you want to do 
something dependent on other cmdline parameters. Take "fadump=on" vs 
"fadump=off" as an example. You just cannot handle it properly as 
proposed in this patch. To me the approach in this patch makes least 
sense TBH.

--
Thanks,

David / dhildenb