On 2017-05-17 17:01:53 [+0200], To Chad Dupuis wrote: > On 2017-05-12 11:55:52 [-0400], Chad Dupuis wrote: > > Ok, I believe I've found the issue here. The machine that the test has > > performed on had many more possible CPUs than active CPUs. We calculate > > which CPU to the work time on in bnx2fc_process_new_cqes() like this: > > > > unsigned int cpu = wqe % num_possible_cpus(); > > > > Since not all CPUs are active, we were trying to schedule work on > > non-active CPUs which meant that the upper layers were never notified of > > the completion. With this change: > > > > diff --git a/drivers/scsi/bnx2fc/bnx2fc_hwi.c > > b/drivers/scsi/bnx2fc/bnx2fc_hwi.c > > index c2288d6..6f08e43 100644 > > --- a/drivers/scsi/bnx2fc/bnx2fc_hwi.c > > +++ b/drivers/scsi/bnx2fc/bnx2fc_hwi.c > > @@ -1042,7 +1042,12 @@ static int bnx2fc_process_new_cqes(struct > > bnx2fc_rport *tgt) > > /* Pending work request completion */ > > struct bnx2fc_work *work = NULL; > > struct bnx2fc_percpu_s *fps = NULL; > > - unsigned int cpu = wqe % num_possible_cpus(); > > + unsigned int cpu = wqe % num_active_cpus(); > > + > > + /* Sanity check cpu to make sure it's online */ > > + if (!cpu_active(cpu)) > > + /* Default to CPU 0 */ > > + cpu = 0; > > > > work = bnx2fc_alloc_work(tgt, wqe); > > if (work) { > > > > The issue is fixed. > > > > Sebastian, can you add this change to your patch set? > > Are sure that you can reliably reproduce the issue and fix it with the > patch above? Because this patch: oh. Okay. Now it clicked. It can fix the issue but it is still possible, that CPU0 goes down between your check for it and schedule_work_on() returning. Let my think of something… Sebastian