On Wednesday, April 04/18/18, 2018 at 19:58:01 +0530, Eric W. Biederman wrote: > Rahul Lakkireddy <rahul.lakkireddy@xxxxxxxxxxx> writes: > > > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote: > >> Hi Rahul, > >> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote: > >> > On production servers running variety of workloads over time, kernel > >> > panic can happen sporadically after days or even months. It is > >> > important to collect as much debug logs as possible to root cause > >> > and fix the problem, that may not be easy to reproduce. Snapshot of > >> > underlying hardware/firmware state (like register dump, firmware > >> > logs, adapter memory, etc.), at the time of kernel panic will be very > >> > helpful while debugging the culprit device driver. > >> > > >> > This series of patches add new generic framework that enable device > >> > drivers to collect device specific snapshot of the hardware/firmware > >> > state of the underlying device in the crash recovery kernel. In crash > >> > recovery kernel, the collected logs are added as elf notes to > >> > /proc/vmcore, which is copied by user space scripts for post-analysis. > >> > > >> > The sequence of actions done by device drivers to append their device > >> > specific hardware/firmware logs to /proc/vmcore are as follows: > >> > > >> > 1. During probe (before hardware is initialized), device drivers > >> > register to the vmcore module (via vmcore_add_device_dump()), with > >> > callback function, along with buffer size and log name needed for > >> > firmware/hardware log collection. > >> > >> I assumed the elf notes info should be prepared while kexec_[file_]load > >> phase. But I did not read the old comment, not sure if it has been discussed > >> or not. > >> > > > > We must not collect dumps in crashing kernel. Adding more things in > > crash dump path risks not collecting vmcore at all. Eric had > > discussed this in more detail at: > > > > https://lkml.org/lkml/2018/3/24/319 > > > > We are safe to collect dumps in the second kernel. Each device dump > > will be exported as an elf note in /proc/vmcore. > > It just occurred to me there is one variation that is worth > considering. > > Is the area you are looking at dumping part of a huge mmio area? > I think someone said 2GB? > > If that is the case it could be worth it to simply add the needed > addresses to the range of memory we need to dump, and simply having a > elf note saying that is what happened. > We are _not_ dumping mmio area. However, one part of the dump collection involves reading 2 GB on-chip memory via PIO access, which is compressed and stored. Thanks, Rahul