RE: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

Don Brace <don.brace@xxxxxxxxxxxxx> · Tue, 16 Jan 2018 15:22:18 +0000

> -----Original Message-----
> From: Laurence Oberman [mailto:loberman@xxxxxxxxxx]
> Sent: Tuesday, January 16, 2018 7:29 AM
> To: Thomas Gleixner <tglx@xxxxxxxxxxxxx>; Ming Lei <ming.lei@xxxxxxxxxx>
> Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>; Jens Axboe <axboe@xxxxxx>;
> linux-block@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Mike Snitzer
> <snitzer@xxxxxxxxxx>; Don Brace <don.brace@xxxxxxxxxxxxx>
> Subject: Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined
> to irq vector
> 
> > > It is because of irq_create_affinity_masks().
> >
> > That still does not answer the question. If the interrupt for a queue
> > is
> > assigned to an offline CPU, then the queue should not be used and
> > never
> > raise an interrupt. That's how managed interrupts have been designed.
> >
> > Thanks,
> >
> >       tglx
> >
> >
> >
> >
> 
> I captured a full boot log for this issue for Microsemi, I will send it
> to Don Brace.
> I enabled all the HPSA debug and here is snippet
> 
> 
> ..
> ..
> ..
>   246.751135] INFO: task systemd-udevd:413 blocked for more than 120
> seconds.
> [  246.788008]       Tainted: G          I      4.15.0-rc4.noming+ #1
> [  246.822380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  246.865594] systemd-udevd   D    0   413    411 0x80000004
> [  246.895519] Call Trace:
> [  246.909713]  ? __schedule+0x340/0xc20
> [  246.930236]  schedule+0x32/0x80
> [  246.947905]  schedule_timeout+0x23d/0x450
> [  246.970047]  ? find_held_lock+0x2d/0x90
> [  246.991774]  ? wait_for_completion_io+0x108/0x170
> [  247.018172]  io_schedule_timeout+0x19/0x40
> [  247.041208]  wait_for_completion_io+0x110/0x170
> [  247.067326]  ? wake_up_q+0x70/0x70
> [  247.086801]  hpsa_scsi_do_simple_cmd+0xc6/0x100 [hpsa]
> [  247.114315]  hpsa_scsi_do_simple_cmd_with_retry+0xb7/0x1c0 [hpsa]
> [  247.146629]  hpsa_scsi_do_inquiry+0x73/0xd0 [hpsa]
> [  247.174118]  hpsa_init_one+0x12cb/0x1a59 [hpsa]

This trace comes from internally generated discovery commands. No SCSI devices have
been presented to the SML yet.

At this point we should be running on only one CPU. These commands are meant to use
reply queue 0 which are tied to CPU 0. It's interesting that the patch helps.

However, I was wondering if you could inspect the iLo IML logs and send the
AHS logs for inspection.

Thanks,
Don Brace
ESC - Smart Storage
Microsemi Corporation

> [  247.199851]  ? __pm_runtime_resume+0x55/0x70
> [  247.224527]  local_pci_probe+0x3f/0xa0
> [  247.246034]  pci_device_probe+0x146/0x1b0
> [  247.268413]  driver_probe_device+0x2b3/0x4a0
> [  247.291868]  __driver_attach+0xda/0xe0
> [  247.313370]  ? driver_probe_device+0x4a0/0x4a0
> [  247.338399]  bus_for_each_dev+0x6a/0xb0
> [  247.359912]  bus_add_driver+0x41/0x260
> [  247.380244]  driver_register+0x5b/0xd0
> [  247.400811]  ? 0xffffffffc016b000
> [  247.418819]  hpsa_init+0x38/0x1000 [hpsa]
> [  247.440763]  ? 0xffffffffc016b000
> [  247.459451]  do_one_initcall+0x4d/0x19c
> [  247.480539]  ? do_init_module+0x22/0x220
> [  247.502575]  ? rcu_read_lock_sched_held+0x64/0x70
> [  247.529549]  ? kmem_cache_alloc_trace+0x1f7/0x260
> [  247.556204]  ? do_init_module+0x22/0x220
> [  247.578633]  do_init_module+0x5a/0x220
> [  247.600322]  load_module+0x21e8/0x2a50
> [  247.621648]  ? __symbol_put+0x60/0x60
> [  247.642796]  SYSC_finit_module+0x94/0xe0
> [  247.665336]  entry_SYSCALL_64_fastpath+0x1f/0x96
> [  247.691751] RIP: 0033:0x7fc63d6527f9
> [  247.712308] RSP: 002b:00007ffdf1659ba8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000139
> [  247.755272] RAX: ffffffffffffffda RBX: 0000556b524c5f70 RCX:
> 00007fc63d6527f9
> [  247.795779] RDX: 0000000000000000 RSI: 00007fc63df6f099 RDI:
> 0000000000000008
> [  247.836413] RBP: 00007fc63df6f099 R08: 0000000000000000 R09:
> 0000556b524be760
> [  247.876395] R10: 0000000000000008 R11: 0000000000000246 R12:
> 0000000000000000
> [  247.917597] R13: 0000556b524c5f10 R14: 0000000000020000 R15:
> 0000000000000000
> [  247.957272]
> [  247.957272] Showing all locks held in the system:
> [  247.992019] 1 lock held by khungtaskd/118:
> [  248.015019]  #0:  (tasklist_lock){.+.+}, at: [<000000004ef3538d>]
> debug_show_all_locks+0x39/0x1b0
> [  248.064600] 2 locks held by systemd-udevd/413:
> [  248.090031]  #0:  (&dev->mutex){....}, at: [<000000002a395ec8>]
> __driver_attach+0x4a/0xe0
> [  248.136620]  #1:  (&dev->mutex){....}, at: [<00000000d9def23c>]
> __driver_attach+0x58/0xe0
> [  248.183245]
> [  248.191675] =============================================
> [  248.191675]
> [  314.825134] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  315.368421] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  315.894373] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  316.418385] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  316.944461] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  317.466708] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  317.994380] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starti