On 02/26/2018 11:37 PM, Bill Kuzeja wrote: > Because of the shifting around of code in qla2x00_probe_one recently, > failures during adapter initialization can lead to problems, i.e. NULL > pointer crashes and doubly freed data structures which cause eventual > panics. > > To address these problems, I tested all early probe exit points (and > debugged the resulting panics). The end result is a simple looking patch > set of changes, with minimum reshuffling of code. In my testing with this > patch, I have not encountered any NULL pointer crashes or double frees. > > Fixes: d64d6c5671db ("scsi: qla2xxx: Fix NULL pointer crash due to probe failure") > Signed-off-by: Bill Kuzeja <william.kuzeja@xxxxxxxxxxx> > > --- > > Some of these issues are due to: > > commit d64d6c5671db scsi: qla2xxx: Fix NULL pointer crash due to probe failure > > where some frees were moved around, as well as the error exit from > a qla2x00_request_irqs failure. > > This was a fix for: > > commit d74595278f4a scsi: qla2xxx: Add multiple queue pair functionality. > > which caused problems of its own. > > To reproduce these issues, I run a test where I break the card early in init, > (and also through kprobe fault injection). This way, I've been able to hit > four different types of crashes. > > Here's the leadup to the first issue: > > [ 1044.133225] mpt3sas_cm4: diag reset: FAILED > [ 1072.694857] qla2xxx [0000:b7:00.0]-d04c:12: MBX Command timeout for cmd 2, > _status 0xffffffff hccr 0xffffffff > [ 1072.734020] qla2xxx [0000:b7:00.0]-00cf:12: Setup chip ****FAILED****. > [ 1072.748506] qla2xxx [0000:b7:00.0]-00d6:12: Failed to initialize adapter - Adapter flags 42. > > and we crash.... > > [ 1072.787691] kernel BUG at mm/slub.c:3601! > > Here's the stack trace from the dump: > > PID: 20037 TASK: ffffa1447ab24f10 CPU: 0 COMMAND: "kworker/0:14" > #0 [ffffa144c6ba7850] die at ffffffff8aa2e92b > #1 [ffffa144c6ba7880] do_trap at ffffffff8b10e1f0 > #2 [ffffa144c6ba78d0] do_invalid_op at ffffffff8aa2b284 > #3 [ffffa144c6ba7980] invalid_op at ffffffff8b11b4ee > [exception RIP: kfree+313] > RIP: ffffffff8abf3f79 RSP: ffffa144c6ba7a30 RFLAGS: 00010246 > RAX: 001fffff00000000 RBX: ffffa13cdbe24000 RCX: 000000010040003c > RDX: 001fffff00000000 RSI: fffffd615cf8a6c0 RDI: ffffa13cdbe24000 > RBP: ffffa144c6ba7a48 R8: ffffa143be29b1c0 R9: 000000010040003c > R10: 00000000be29bc01 R11: fffffd61416f8900 R12: ffffa14c763bf000 > R13: ffffffffc08df36d R14: ffffa13cdbe20740 R15: ffffa13cdbe20000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 > #4 [ffffa144c6ba7a50] qla2x00_probe_one at ffffffffc08df36d [qla2xxx] > #5 [ffffa144c6ba7dd0] local_pci_probe at ffffffff8ad92e45 > #6 [ffffa144c6ba7e08] work_for_cpu_fn at ffffffff8aaae4c4 > #7 [ffffa144c6ba7e20] process_one_work at ffffffff8aab170a > #8 [ffffa144c6ba7e68] worker_thread at ffffffff8aab2528 > #9 [ffffa144c6ba7ec8] kthread at ffffffff8aab96af > > Here's we're trying to re-free ha->nvram in qla2x00_mem_free. When the > failure in ha->isp_ops->initialize_adapter occurs, we jump to the > probe_failed label. Then we end up calling qla2x00_mem_free twice because > qla2x00_free_device calls qla2x00_mem_free as well. This is problem #1. > > Problem #2: If we fail even earlier - in qla2x00_mem_alloc - we jump to > probe_hw_failed which now tries to free memory that wasn't allocated > (note before commit d64d6c5671db, we only called qla2x00_clear_drv_active - > this was correct). My server crashed on qla2x00_free_req_que. > > Next, notice that the only exit point that calls probe_init_failed is > when qla2x00_alloc_queues fails. But looking inside qla2x00_alloc_queues, > we only do the following if this routine is successful: > > ha->rsp_q_map[0] = rsp; > ha->req_q_map[0] = req; > set_bit(0, ha->rsp_qid_map); > set_bit(0, ha->req_qid_map); > > Otherwise, we free up ha->base_qpair, ha->rsp_q_map, ha->req_q_map > and exit with an error. Then we jump to probe_init_failed, which does > two things that are bad. > > First, there's (problem #3): > > ha->req_q_map[0] = NULL; > ha->rsp_q_map[0] = NULL; > > In the failure case of qla2x00_alloc_queues, ha->req_q_map and > ha->rsp_q_map are set to NULL. Can't dereference a NULL pointer and > that is what we crash on here. > > If the system didn't panic on this, it would have on problem #4. > There are extra calls to qla2x00_free_req_que and qla2x00_free_rsp_que > in probe_init_failed path because these calls are now duplicated in > probe_hw_failed because of commit d64d6c5671db. > > Now for the changes... > > Problem #1 is easily solved. When we enter probe_failed, have a new > variable that indicates this is the case so probe_hw_failed can check > it and not call the same free routines again (qla2x00_free_device calls > qla2x00_mem_free, qla2x00_free_req_que, and qla2x00_free_rsp_que). > > For problem 2, an extra label (probe_hw_failed_no_mem) will be created > at the end of probe_hw_failed, so we don't try to free unallocated memory. > > Problems 3 and 4 are also easy to remedy: if qla2x00_alloc_queues fails, > we should call probe_failed instead. And since nobody jumps to > probe_init_failed anymore, just eliminate that label. The calls to > qla2x00_free_req_que are done below in probe_hw_failed anyways, to which > we fall through. > > I've tested out these changes thoroughly with a test that fails > initialization at various times. I've also used kprobes to inject errors > to force us into various error paths. The exit points are very solid now, > no NULL pointer crashes, no double frees. > --- > drivers/scsi/qla2xxx/qla_os.c | 26 ++++++++++++-------------- > 1 file changed, 12 insertions(+), 14 deletions(-) > > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c > index afcb5567..3faa7a3 100644 > --- a/drivers/scsi/qla2xxx/qla_os.c > +++ b/drivers/scsi/qla2xxx/qla_os.c > @@ -2739,6 +2739,7 @@ static void qla2x00_iocb_work_fn(struct work_struct *work) > uint16_t req_length = 0, rsp_length = 0; > struct req_que *req = NULL; > struct rsp_que *rsp = NULL; > + int probe_failed = 0; > int i; > > bars = pci_select_bars(pdev, IORESOURCE_MEM | IORESOURCE_IO); > @@ -3024,7 +3025,7 @@ static void qla2x00_iocb_work_fn(struct work_struct *work) > ql_log_pci(ql_log_fatal, pdev, 0x0031, > "Failed to allocate memory for adapter, aborting.\n"); > > - goto probe_hw_failed; > + goto probe_hw_failed_no_mem; > } > > req->max_q_depth = MAX_Q_DEPTH; > @@ -3102,7 +3103,7 @@ static void qla2x00_iocb_work_fn(struct work_struct *work) > ql_log(ql_log_fatal, base_vha, 0x003d, > "Failed to allocate memory for queue pointers..." > "aborting.\n"); > - goto probe_init_failed; > + goto probe_failed; > } > > if (ha->mqenable && shost_use_blk_mq(host)) { > @@ -3387,16 +3388,8 @@ static void qla2x00_iocb_work_fn(struct work_struct *work) > > return 0; > > -probe_init_failed: > - qla2x00_free_req_que(ha, req); > - ha->req_q_map[0] = NULL; > - clear_bit(0, ha->req_qid_map); > - qla2x00_free_rsp_que(ha, rsp); > - ha->rsp_q_map[0] = NULL; > - clear_bit(0, ha->rsp_qid_map); > - ha->max_req_queues = ha->max_rsp_queues = 0; > - > probe_failed: > + probe_failed = 1; > if (base_vha->timer_active) > qla2x00_stop_timer(base_vha); > base_vha->flags.online = 0; > @@ -3412,9 +3405,14 @@ static void qla2x00_iocb_work_fn(struct work_struct *work) > scsi_host_put(base_vha->host); > > probe_hw_failed: > - qla2x00_mem_free(ha); > - qla2x00_free_req_que(ha, req); > - qla2x00_free_rsp_que(ha, rsp); > + /* These already done in qla2x00_free_device, don't double free */ > + if (!probe_failed) { > + qla2x00_mem_free(ha); > + qla2x00_free_req_que(ha, req); > + qla2x00_free_rsp_que(ha, rsp); > + } > + > +probe_hw_failed_no_mem: > qla2x00_clear_drv_active(ha); > > iospace_config_failed: > Hmm. Isn't is rather the case that the labels and gotos are mixed up? We have this order of labels: probe_init_failed probe_failed probe_hw_failed iospace_config_failed disable_device but this potential calling order: disable_device iospace_config_failed probe_hw_failed probe_hw_failed probe_init_failed probe_failed probe_failed probe_failed So it looks to me as if the 'probe_init_failed' and 'probe_failed' labels should be exchanged... But yeah, we need to fix this up. I've run into this issue myself recently :-( Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)