On Mon, Apr 22, 2019 at 4:56 PM Kirtika Ruchandani <kirtika@xxxxxxxxxx> wrote: > > On Mon, Apr 22, 2019 at 11:07 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > > > On Wed 17-04-19 10:35:16, Luca Coelho wrote: > > > From: Shahar S Matityahu <shahar.s.matityahu@xxxxxxxxx> > > > > > > The driver uses msix causes-register to handle both msix and non msix > > > interrupts when performing sync nmi. On devices that do not support > > > msix this register is unmapped and accessing it causes a kernel panic. > > > > > > Solve this by differentiating the two cases and accessing the proper > > > causes-register in each case. > > Are you sure reading CSR_INT from trans.c without explicitly getting irq_lock.c > like rx.c does, is thread-safe? I don't claim to understand this fully, but this > smells wrong from past experience with this driver. I'll see if I can cook up > a test case with a race condition here. Sorry for the typos. I meant "writing (not reading) to CSR_INT in iwl_trans_pcie_sync_nmi without explicitly getting trans_pcie->irq_lock like iwl_pcie_irq_handler does". I spent some time playing around this, and while I don't have a black-and-white test-case to show this patch has side-effects (at the very least), I have some notes: 1. Repeatedly sending fw_nmi is a good test-case for the problem this patch is trying to solve. i.e. I had $ cd /sys/kernel/debug/iwlwifi/${PCI_ID}/iwlmvm/ $ while true; do echo 1 > fw_nmi ; done # this is likely too harsh and needs a sleep in b/w With the current ToT driver in wireless-drivers-next, this ramoops-es super quickly after hitting the problematic MSI-X read. 2. With this patch applied, I hit an ADVANCED_SYSASSERT 0x0 and it took ~800ms and dozens of kernel warnings before the driver "recovered". I've sent the full dmesg to Luca off-list. It didn't get any better with adding spin_lock(&trans_pcie->irq_lock) / spin_unlock(&trans_pcie->irq_lock) around the problematic CSR_INT write in iwl_trans_sync_nmi. So I still don't have any concrete reasoning or proof, except a dmesg after a test that doesn't look right. Apologies if this discussion is moot - FWIW, the thing that caused a code smell in the first place was that historically, nothing in trans.c had mucked with writing to CSR_INT, except when enabling interrupts again with a 0xFFFFFFFF at _initialization time_ (and we've seen race conditions there in the past). You can check this with `git grep -p -w CSR_INT trans.c`. > > > > > > Reported-by: Michal Hocko <mhocko@xxxxxxxxxx> > > > Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@xxxxxxxxx> > > > Signed-off-by: Luca Coelho <luciano.coelho@xxxxxxxxx> > > > > $ dmesg | grep "Error sending SCAN_CFG_CMD:" > > [49786.288548] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms. > > [53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms. > > > > without the oops and with the iwlwifi internal dump IIUC which is the > > previous behavior. > > [53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms. > > [53457.166882] iwlwifi 0000:01:00.0: Current CMD queue read_ptr 224 write_ptr 225 > > [53457.414973] iwlwifi 0000:01:00.0: HW error, resetting before reading > > [53457.421339] iwlwifi 0000:01:00.0: Start IWL Error Log Dump: > > [53457.421345] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1269232956 > > [53457.421347] iwlwifi 0000:01:00.0: Loaded firmware version: 36.9f0a2d68.0 > > [53457.421350] iwlwifi 0000:01:00.0: 0x45E91306 | ADVANCED_SYSASSERT > > [53457.421352] iwlwifi 0000:01:00.0: 0x2F58D384 | trm_hw_status0 > > [53457.421353] iwlwifi 0000:01:00.0: 0x7F1A8CFD | trm_hw_status1 > > [53457.421355] iwlwifi 0000:01:00.0: 0x07E787FD | branchlink2 > > [53457.421357] iwlwifi 0000:01:00.0: 0xE9E54368 | interruptlink1 > > [53457.421359] iwlwifi 0000:01:00.0: 0x470D9BBF | interruptlink2 > > [53457.421361] iwlwifi 0000:01:00.0: 0xAF040E7E | data1 > > [53457.421362] iwlwifi 0000:01:00.0: 0xE7FBCA48 | data2 > > [53457.421364] iwlwifi 0000:01:00.0: 0x4E4A8288 | data3 > > [53457.421366] iwlwifi 0000:01:00.0: 0x861DEA98 | beacon time > > [53457.421368] iwlwifi 0000:01:00.0: 0xE8F23466 | tsf low > > [53457.421369] iwlwifi 0000:01:00.0: 0xD7B19307 | tsf hi > > [53457.421371] iwlwifi 0000:01:00.0: 0xE58934E3 | time gp1 > > [53457.421373] iwlwifi 0000:01:00.0: 0xB013FEBE | time gp2 > > [53457.421375] iwlwifi 0000:01:00.0: 0x962DCC75 | uCode revision type > > [53457.421376] iwlwifi 0000:01:00.0: 0xFF8FB30F | uCode version major > > [53457.421378] iwlwifi 0000:01:00.0: 0x0DD08E17 | uCode version minor > > [53457.421380] iwlwifi 0000:01:00.0: 0x87FD70DE | hw version > > [53457.421382] iwlwifi 0000:01:00.0: 0x853F6851 | board version > > [53457.421384] iwlwifi 0000:01:00.0: 0x08D7F330 | hcmd > > [53457.421385] iwlwifi 0000:01:00.0: 0x6B7E5FEE | isr0 > > [53457.421387] iwlwifi 0000:01:00.0: 0x2B1E7CD4 | isr1 > > [53457.421389] iwlwifi 0000:01:00.0: 0x3F133B16 | isr2 > > [53457.421391] iwlwifi 0000:01:00.0: 0x5D480C5A | isr3 > > [53457.421392] iwlwifi 0000:01:00.0: 0x34E93EBA | isr4 > > [53457.421394] iwlwifi 0000:01:00.0: 0x42AD8E83 | last cmd Id > > [53457.421396] iwlwifi 0000:01:00.0: 0x1F5BBCFF | wait_event > > [53457.421398] iwlwifi 0000:01:00.0: 0x6808B2C1 | l2p_control > > [53457.421399] iwlwifi 0000:01:00.0: 0x0D5B1F33 | l2p_duration > > [53457.421401] iwlwifi 0000:01:00.0: 0xF4C94535 | l2p_mhvalid > > [53457.421403] iwlwifi 0000:01:00.0: 0x3DCE6EBB | l2p_addr_match > > [53457.421405] iwlwifi 0000:01:00.0: 0xFDDC41FE | lmpm_pmg_sel > > [53457.421406] iwlwifi 0000:01:00.0: 0xB53A17F5 | timestamp > > [53457.421408] iwlwifi 0000:01:00.0: 0x5A6A4113 | flow_handler > > [53457.421474] iwlwifi 0000:01:00.0: Start IWL Error Log Dump: > > [53457.421477] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1182976748 > > [53457.421478] iwlwifi 0000:01:00.0: 0x62D2BDB3 | ADVANCED_SYSASSERT > > [53457.421480] iwlwifi 0000:01:00.0: 0x4D9E5019 | umac branchlink1 > > [53457.421482] iwlwifi 0000:01:00.0: 0x8CB69F6E | umac branchlink2 > > [53457.421484] iwlwifi 0000:01:00.0: 0x9868662D | umac interruptlink1 > > [53457.421486] iwlwifi 0000:01:00.0: 0x9800F8F7 | umac interruptlink2 > > [53457.421488] iwlwifi 0000:01:00.0: 0xC71449B8 | umac data1 > > [53457.421489] iwlwifi 0000:01:00.0: 0xAB0AB17F | umac data2 > > [53457.421491] iwlwifi 0000:01:00.0: 0x6C6F9753 | umac data3 > > [53457.421493] iwlwifi 0000:01:00.0: 0xFC49D724 | umac major > > [53457.421495] iwlwifi 0000:01:00.0: 0xA61CC627 | umac minor > > [53457.421496] iwlwifi 0000:01:00.0: 0x45BAA0B8 | frame pointer > > [53457.421498] iwlwifi 0000:01:00.0: 0x319D112B | stack pointer > > [53457.421500] iwlwifi 0000:01:00.0: 0xEFD9E2E9 | last host cmd > > [53457.421502] iwlwifi 0000:01:00.0: 0x82640FF7 | isr status reg > > [53457.421506] ieee80211 phy0: Hardware restart was requested > > [53457.941685] iwlwifi 0000:01:00.0: Queue 0 is inactive on fifo 0 and stuck for 2500 ms. SW [224, 225] HW [0, 0] FH TRB=0x02b759ca1 > > > > Feel free to add > > Tested-by: Michal Hocko <mhocko@xxxxxxxx> > > > > Thanks for your quick patch and sorry it took so long from my side. > > > > -- > > Michal Hocko > > SUSE Labs