Rahul Lakkireddy <rahul.lakkireddy@xxxxxxxxxxx> writes: > On Thursday, April 04/19/18, 2018 at 20:23:37 +0530, Eric W. Biederman wrote: >> Rahul Lakkireddy <rahul.lakkireddy@xxxxxxxxxxx> writes: >> >> > On Thursday, April 04/19/18, 2018 at 07:10:30 +0530, Dave Young wrote: >> >> On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote: >> >> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote: >> >> > > Hi Rahul, >> >> > > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote: >> >> > > > On production servers running variety of workloads over time, kernel >> >> > > > panic can happen sporadically after days or even months. It is >> >> > > > important to collect as much debug logs as possible to root cause >> >> > > > and fix the problem, that may not be easy to reproduce. Snapshot of >> >> > > > underlying hardware/firmware state (like register dump, firmware >> >> > > > logs, adapter memory, etc.), at the time of kernel panic will be very >> >> > > > helpful while debugging the culprit device driver. >> >> > > > >> >> > > > This series of patches add new generic framework that enable device >> >> > > > drivers to collect device specific snapshot of the hardware/firmware >> >> > > > state of the underlying device in the crash recovery kernel. In crash >> >> > > > recovery kernel, the collected logs are added as elf notes to >> >> > > > /proc/vmcore, which is copied by user space scripts for post-analysis. >> >> > > > >> >> > > > The sequence of actions done by device drivers to append their device >> >> > > > specific hardware/firmware logs to /proc/vmcore are as follows: >> >> > > > >> >> > > > 1. During probe (before hardware is initialized), device drivers >> >> > > > register to the vmcore module (via vmcore_add_device_dump()), with >> >> > > > callback function, along with buffer size and log name needed for >> >> > > > firmware/hardware log collection. >> >> > > >> >> > > I assumed the elf notes info should be prepared while kexec_[file_]load >> >> > > phase. But I did not read the old comment, not sure if it has been discussed >> >> > > or not. >> >> > > >> >> > >> >> > We must not collect dumps in crashing kernel. Adding more things in >> >> > crash dump path risks not collecting vmcore at all. Eric had >> >> > discussed this in more detail at: >> >> > >> >> > https://lkml.org/lkml/2018/3/24/319 >> >> > >> >> > We are safe to collect dumps in the second kernel. Each device dump >> >> > will be exported as an elf note in /proc/vmcore. >> >> >> >> I understand that we should avoid adding anything in crash path. And I also >> >> agree to collect device dump in second kernel. I just assumed device >> >> dump use some memory area to store the debug info and the memory >> >> is persistent so that this can be done in 2 steps, first register the >> >> address in elf header in kexec_load, then collect the dump in 2nd >> >> kernel. But it seems the driver is doing some other logic to collect >> >> the info instead of just that simple like I thought. >> >> >> > >> > It seems simpler, but I'm concerned with waste of memory area, if >> > there are no device dumps being collected in second kernel. In >> > approach proposed in these series, we dynamically allocate memory >> > for the device dumps from second kernel's available memory. >> >> Don't count that kernel having more than about 128MiB. >> > > If large dump is expected, Administrator can increase the memory > allocated to the second kernel (using crashkernel boot param), to > ensure device dumps get collected. Except 128MiB is already a already a huge amount to reserve. I typically have run crash dumps with 16MiB of memory and thought it was overkill. Looking below 32MiB seems a bit high but it is small enough that it is still doable. I am baffled at how 2GiB can be guaranteed to fit in 32MiB (sparse register space?) but if it works reliably. >> For that reason if for no other it would be nice if it was possible to >> have the driver to not initialize the device and just stand there >> handing out the data a piece at a time as it is read from /proc/vmcore. >> > > Since cxgb4 is a network driver, it can be used to transfer the dumps > over the network. So we must ensure the dumps get collected and > stored, before device gets initialized to transfer dumps over > the network. Good point. For some reason I was thinking it was an infiniband and not an 10GiB ethernet device. >> The 2GiB number I read earlier concerns me for working in a limited >> environment. >> > > All dumps, including the 2GB on-chip memory dump, is compressed by > the cxgb4 driver as they are collected. The overall compressed dump > comes out at max 32 MB. > >> It might even make sense to separate this into a completely separate >> module (depended upon the main driver if it makes sense to share >> the functionality) so that people performing crash dumps would not >> hesitate to include the code in their initramfs images. >> >> I can see splitting a device up into a portion only to be used in case >> of a crash dump and a normal portion like we do for main memory but I >> doubt that makes sense in practice. >> > > This is not required, especially in case of network drivers, which > must collect underlying device dump and initialize the device to > transfer dumps over the network. I have a practical concern. What happens if the previous kernel left the device in such a bad stat the driver can not successfully initialize it. Does failure to initialize cxgb4 after a crash now mean that you can not capture the crash dump to see the crazy state the device was in? Typically the initramfs for a crash dump does not include unnecessary drivers so that hardware in states the drivers can't handle won't prevent taking a crash dump. I understand the issue if you are taking a dump over your 10GiB ethernet it is a moot point. But if you are writing your dump to disk, or writing it over a management gigabit ethernet then it is still an issue. Is there a decoupling so that a totally b0rked device can't prevent taking it's own dump? Eric