Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

Michal Hocko <mhocko@xxxxxxxx> · Fri, 8 Dec 2023 11:04:27 +0100

On Fri 08-12-23 09:55:39, Baoquan He wrote:
> On 12/07/23 at 12:52pm, Michal Hocko wrote:
> > On Thu 07-12-23 12:13:14, Philipp Rudo wrote:
[...]
> > > Thing is that users don't only want to reduce the memory usage but also
> > > the downtime of kdump. In the end I'm afraid that "simply waiting" will
> > > make things unnecessarily more complex without really solving any issue.
> > 
> > I am not sure I see the added complexity. Something as simple as
> > __crash_kexec:
> > 	if (crashk_cma_cnt) 
> > 		mdelay(TIMEOUT)
> > 
> > should do the trick.
> 
> I would say please don't do this. kdump jumping is a very quick
> behavirou after corruption, usually in several seconds. I can't see any
> meaningful stuff with the delay of one minute or several minutes.

Well, I've been told that DMA should complete within seconds after
controller is programmed (if that was much more then short term pinning
is not really appropriate because that would block memory movability for
way too long and therefore result in failures). This is something we can
tune for.

But if that sounds like a completely wrong approach then I think an
alternative would be to live with potential inflight DMAs just avoid
using that memory by the kdump kernel before the DMA controllers (PCI
bus) is reinitialized by the kdump kernel. That should happen early in
the boot process IIRC and the CMA backed memory could be added after
that moment. We already do have means so defer memory initialization
so an extension shouldn't be hard to do. It will be a slightly more involved
patch touching core MM which we have tried to avoid so far. Does that
sound like something acceptable?

[...]

> > The thing we should keep in mind is that the memory sitting aside is not
> > used in majority of time. Crashes (luckily/hopefully) do not happen very
> > often. And I can really see why people are reluctant to waste it. Every
> > MB of memory has an operational price tag on it. And let's just be
> > really honest, a simple reboot without a crash dump is very likely
> > a cheaper option than wasting a productive memory as long as the issue
> > happens very seldom.
> 
> All the time, I have never heard people don't want to "waste" the
> memory. E.g, for more than 90% of system on x86, 256M is enough. The
> rare exceptions will be noted once recognized and documented in product
> release.
> 
> And ,cma is not silver bullet, see this oom issue caused by i40e and its
> fix , your crashkernel=1G,cma won't help either.
> 
> [v1,0/3] Reducing memory usage of i40e for kdump
> https://patchwork.ozlabs.org/project/intel-wired-lan/cover/20210304025543.334912-1-coxu@xxxxxxxxxx/
> 
> ======Abstrcted from above cover letter==========================
> After reducing the allocation of tx/rx/arg/asq ring buffers to the
> minimum, the memory consumption is significantly reduced,
>     - x86_64: 85.1MB to 1.2MB 
>     - POWER9: 15368.5MB to 20.8MB
> ==================================================================

Nice to see memory consumption reduction fixes. But, honestly this
should happen regardless of kdump. CMA backed kdump is not to
workaround excessive kernel memory consumers. It seems I am failing to
get the message through :( but I do not know how else to express that the
pressure on reducing the wasted memory is real. It is not important
whether 256MB is enough for everybody. Even that would grow to non
trivial cost in data centers with many machines.

> And say more about it. This is not the first time of attempt to make use
> of ,cma area for crashkernel=. In redhat, at least 5 people have tried
> to add this, finally we gave up after long discussion and investigation.
> This year, one kernel developer in our team raised this again with a
> very long mail after his own analysis, we told him the discussion and
> trying we have done in the past.

This is really hard to comment on without any references to those
discussions. From this particular email thread I have a perception that
you guys focus much more on correctness provability than feasibility. If
we applied the same approach universally then many other features
couldn't have been merged. E.g. kexec for reasons you have mentioned in
the email thread.

Anyway, thanks for pointing to regular DMA via gup case which we were
obviously not aware of. I personally have considered this to be a
marginal problem comparing to RDMA which is unpredictable wrt timing.
But we believe that this could be worked around. Now it would be really
valuable if we knew somebody has _tried_ that and it turned out not
working because of XYZ reasons.  That would be a solid base to
re-evaluate and think of different approaches.

Look, this will be your call as maintainers in the end. If you are
decided then fair enough. We might end up trying this feature downstream
and maybe come back in the future with an experience which we currently
do not have. But it seems we are not alone seeing the existing state is
insufficient (http://lkml.kernel.org/r/20230719224821.GC3528218@xxxxxxxxxx).

Thanks!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec