On 2021/12/10 19:35, John Garry wrote: > On 09/12/2021 23:55, Damien Le Moal wrote: >>> Earlier today it was the mount command which was hanging. From debugging >>> that, I found that the very first SSP command when mounting is sent the >>> HW successfully but no completion interrupt ever occurs there - I really >>> don't know why. Other SSP commands complete successfully before this and >>> after (TMFs in the error handling), including ones which have sgls. >>> >>> sda is a SAS drive, but I think SATA has the same issue - I was just >>> looking at sda. >>> >>> One thing I noticed in the driver is that it uses mb() in between >>> writing to the DMA memory and initiating the HW - I don't think mb is >>> strong enough. However I don't think that is my issue - it wouldn't fail >>> reliably if it was. >> Weird. > > Yeah, quite strange. > > I will also note that these earlier logs are also red flags, which I > have not investigated: > > [87.288239] sas: target proto 0x0 at 500e004aaaaaaa1f:0x10 not handled > [87.294793] sas: ex 500e004aaaaaaa1f phy16 failed to discover > >> I do not have an arm host to test. Could it be that the card FW is >> crashing ? > > But the later TMFs seem to succeed, so I doubt it's crashing. I did > wonder if it's going into some low-power/idle mode and just not > responding, but not sure on that. > >> Can you recover from the above ? > > It never really recovers and is always caught up in some error handling. > >> Or do you have to power cycle for >> the HDD to be accessible again ? > > Power cycle is necessary to recover as we can't remove the driver when > it is in error handling > >> >> Other possibility may be an IRQ controller issue with the platform ? >> > > Highly unlikely. I did wonder if the interrupts are properly allocated > and requested, and they look ok from /proc/interrupts > > I also tried limiting the CPUs we bring up to a single CPU and so that > we only use a single MSIx and a single HW queue, and now get this crash: > > [7.775168] loop: module loaded > [7.783226] pm80xx 0000:04:00.0: Adding to iommu group 0 > [7.795787] pm80xx 0000:04:00.0: pm80xx: driver version 0.1.40 > [7.806789] pm80xx 0000:04:00.0: enabling device (0140 -> 0142) > [7.818910] :: pm8001_pci_alloc 530:Setting link rate to default value > [8.866618] scsi host0: pm80xx > [8.879056] pm80xx0:: process_oq 4169:Firmware Fatal error! Regval:0xc0f I have an old-ish Adaptec card that throws something similar at me if I connect a host-managed SMR drive to the HBA. After that message, nothing works at all so in my case I suspect that the FW gets into a really bad state/crashes. > [8.885842] pm80xx0:: print_scratchpad_registers > 4130:MSGU_SCRATCH_PAD_0: 0x40002000 > [8.893661] pm80xx0:: print_scratchpad_registers > 4132:MSGU_SCRATCH_PAD_1:0xc0f > [8.900958] pm80xx0:: print_scratchpad_registers > 4134:MSGU_SCRATCH_PAD_2: 0x0 > [8.908169] pm80xx0:: print_scratchpad_registers > 4136:MSGU_SCRATCH_PAD_3: 0x30000000 > [8.915986] pm80xx0:: print_scratchpad_registers > 4138:MSGU_HOST_SCRATCH_PAD_0: 0x0 > [8.923630] pm80xx0:: print_scratchpad_registers > 4140:MSGU_HOST_SCRATCH_PAD_1: 0x0 > [8.931274] pm80xx0:: print_scratchpad_registers > 4142:MSGU_HOST_SCRATCH_PAD_2: 0x0 > [8.938917] pm80xx0:: print_scratchpad_registers > 4144:MSGU_HOST_SCRATCH_PAD_3: 0x0 > [8.946561] pm80xx0:: print_scratchpad_registers > 4146:MSGU_HOST_SCRATCH_PAD_4: 0x0 > [8.954205] pm80xx0:: print_scratchpad_registers > 4148:MSGU_HOST_SCRATCH_PAD_5: 0x0 > [8.961849] pm80xx0:: print_scratchpad_registers > 4150:MSGU_RSVD_SCRATCH_PAD_0: 0x0 > [8.969493] pm80xx0:: print_scratchpad_registers > 4152:MSGU_RSVD_SCRATCH_PAD_1: 0x0 > [8.977143] Unable to handle kernel NULL pointer dereference at virtual > address 0000000000000018 Is this with or without your phys_to_virt() dma/iommu fix patch ? I do remember seeing lots of crashes/hangs with that old-ish Adaptec HBA on x86 host. I can try again to see if the errors are similar. I definitely hit the iommu problem with that card as I had to boot with iommu=off to, well, be able to boot :) Next time I go to the lab, I will plug this card again to test with your iommu patch. > [8.994782] Mem abort info: > [8.997565] ESR = 0x96000004 > [9.006782] EC = 0x25: DABT (current EL), IL = 32 bits > [9.018781] SET = 0, FnV = 0 > [9.021824] EA = 0, S1PTW = 0 > [9.030797] FSC = 0x04: level 0 translation fault > [9.038794] Data abort info: > [9.041662] ISV = 0, ISS = 0x00000004 > [9.050781] CM = 0, WnR = 0 > [9.053737] [0000000000000018] user address but active_mm is swapper > [9.070782] Internal error: Oops: 96000004 [#1] PREEMPT SMP > [9.076343] Modules linked in: > [9.079387] CPU: 0 PID: 20 Comm: kworker/0:2 Not tainted > 5.16.0-rc4-00002-ge23d68774178-dirty #328 > [9.088333] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - > V1.16.01 03/15/2019 > [9.096844] Workqueue: pm80xx pm8001_work_fn > [9.101108] pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [9.108057] pc : pm8001_work_fn+0x298/0x690 > [9.112229] lr : process_one_work+0x1d0/0x354 > [9.116574] sp : ffff800012d23d50 > [9.119876] x29: ffff800012d23d50 x28: 0000000000000000 x27: 0000000000000000 > [9.127000] x26: ffff8000117e8bc0 x25: ffff8000113aaeb0 x24: ffff00209d4b0000 > [9.134124] x23: ffff0020ad23b280 x22: ffff00209d4b8000 x21: 0000000000000001 > [9.141249] x20: 0000000000000000 x19: 0000000000000038 x18: 0000000000000000 > [9.148373] x17: 4441505f48435441 x16: 5243535f44565352 x15: 000000052ff6b548 > [9.155496] x14: 0000000000000018 x13: 0000000000000018 x12: 0000000000000000 > [9.162620] x11: 0000000000000014 x10: 00000000000009a0 x9 : ffff002086ef6074 > [9.169743] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff002086ef6074 > [9.176867] x5 : 0000787830386d70 x4 : ffff00209d5e0000 x3 : 0000000000000000 > [9.183990] x2 : ffff00209d5e0038 x1 : ffff800010a20120 x0 : 0000000000000051 > [9.191114] Call trace: > [9.193547] pm8001_work_fn+0x298/0x690 > [9.197372] process_one_work+0x1d0/0x354 > [9.201369] worker_thread+0x13c/0x470 > [9.205105] kthread+0x17c/0x190 > [9.208321] ret_from_fork+0x10/0x20 > [9.211886] Code: 17fffff1 310006bf 54fffde0 f9400c54 (f9400e80) > [9.217968] ---[ end trace de649a9be2843866 ]--- > [9.339812] pm80xx0:: process_oq 4169:Firmware Fatal error! Regval:0xc0f > [9.346602] pm80xx0:: print_scratchpad_registers > 4130:MSGU_SCRATCH_PAD_0: 0x40002000 > [9.354420] pm80xx0:: print_scratchpad_registers > 4132:MSGU_SCRATCH_PAD_1:0xc0f > [9.361717] pm80xx0:: print_scratchpad_registers > 4134:MSGU_SCRATCH_PAD_2: 0x0 > [9.368927] pm80xx0:: print_scratchpad_registers > 4136:MSGU_SCRATCH_PAD_3: 0x30000000 > [9.376744] pm80xx0:: print_scratchpad_registers > 4138:MSGU_HOST_SCRATCH_PAD_0: 0x0 > [9.384388] pm80xx0:: print_scratchpad_registers > 4140:MSGU_HOST_SCRATCH_PAD_1: 0x0 > [9.392032] pm80xx0:: print_scratchpad_registers > 4142:MSGU_HOST_SCRATCH_PAD_2: 0x0 > [9.399676] pm80xx0:: print_scratchpad_registers > 4144:MSGU_HOST_SCRATCH_PAD_3: 0x0 > [9.407319] pm80xx0:: print_scratchpad_registers > 4146:MSGU_HOST_SCRATCH_PAD_4: 0x0 > [9.414963] pm80xx0:: print_scratchpad_registers > 4148:MSGU_HOST_SCRATCH_PAD_5: 0x0 > [9.422607] pm80xx0:: print_scratchpad_registers > 4150:MSGU_RSVD_SCRATCH_PAD_0: 0x0 > [9.430251] pm80xx0:: print_scratchpad_registers > 4152:MSGU_RSVD_SCRATCH_PAD_1: 0x0 > [ 10.028906] Freeing initrd memory: 413456K > > ... > > Thanks, > John -- Damien Le Moal Western Digital Research