On Wed, Jan 11 2012 at 3:36am -0500, Jun'ichi Nomura <j-nomura@xxxxxxxxxxxxx> wrote: > Hi Hannes, > > On 01/10/12 20:18, Hannes Reinecke wrote: > > I'm trying to hunt down a mysterious crash: > > > > Unable to handle kernel pointer dereference at virtual kernel > > address 000003c001762000 > > Oops: 0011 [#1] SMP > > Modules linked in: dm_round_robin sg sd_mod crc_t10dif ipv6 loop > > dm_multipath scsi_dh dm_mod qeth_l3 ipv6_lib zfcp scsi_transport_fc > > scsi_tgt scsi_mod dasd_eckd_mod dasd_mod qeth qdio ccwgroup ext3 > > mbcache jbd > > Supported: Yes > > CPU: 0 Not tainted 3.0.13-0.5-default #1 > > Process kworker/0:1 (pid: 51750, task: 0000000008450138, ksp: > > 0000000007323620) > > Krnl PSW : 0704200180000000 000003c001953750 (dm_done+0x44/0x188 > > [dm_mod]) > > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3 > > Krnl GPRS: 000000002adb5e80 000003c001762100 0000000037de9e80 > > 0000000000000000 > > 0000000000000400 000003c00195e178 0000000000000380 > > 0000000000000000 > > 0000000000000100 0000000000000001 0000000037de9e80 > > 0000000037de9e68 > > 000003c00194f000 000003c00195e168 000000007ef97d30 > > 000000007ef97cd8 > > Krnl Code: 000003c001953740: e3b021580004 lg %r11,344(%r2) > > 000003c001953746: e310b0080004 lg %r1,8(%r11) > > 000003c00195374c: bf4fb180 icm %r4,15,384(%r11) > > >000003c001953750: e32010080004 lg %r2,8(%r1) > > 000003c001953756: e38020500004 lg %r8,80(%r2) > > 000003c00195375c: a7740062 brc 7,3c001953820 > > 000003c001953760: b9020099 ltgr %r9,%r9 > > 000003c001953764: a7840049 brc 8,3c0019537f6 > > > > r11 is struct dm_rq_target_io *tio = clone->end_io_data; > > r1 is tio->ti (ie struct dm_target), which is invalid. > > r2 is tio->ti->type, likewise. > > > > Apparently the table got replaced between map_io() and dm_done(), > > causing this invalid pointer. > > While we do hold a reference on the mapped_device in map_request(), > > we only take a _single_ reference to the table in dm_request_fn(), > > which is dropped again at the end. > > > And as the table holds the pointer to the targets, they'll be > > invalidated upon table swapping, causing dm_done() accessing an > > invalid pointer. > > The last paragraph is not correct. > If any requests are in-flight, dm does not swap table. > > I.e., in dm_suspend(), for request-based dm, we do: > 1) stop request_queue processing > <from this point, no new request becomes in-flight> > 2) wait for completion of in-flight I/Os > <from this point, no requests are in-flight> > and only after that, we can swap tables. > > Existence of in-flight I/O is checked by "pending" counter of md. > > The counter is increased in dm_request_fn() > and decreased in rq_completed(), which is called either > when the original request is requeued or completed. > I.e. after dm_done() processing. > > > > I can't really believe that is the case, so please do correct me if > > the above analysis is wrong. > > Just guessing... > Such a mysterious crash could occur if there are bugs like: > - somebody start the queue while dm stopped it in dm_suspend() > - somebody submit/complete/requeue request with > wrong function and corrupt pending counter > - lower-level driver completes a request twice > > If you can recreate the crash, try attached debug patch. > It should raise warnings when cases like above happen > and might help hunting down the problem. There was also that earlier thread about a crash in dm_done(), where dereferencing clone->end_io_data caused the crash: https://lkml.org/lkml/2011/10/31/97 Heiko said that Hannes' patch: http://www.redhat.com/archives/dm-devel/2011-November/msg00176.html ... actually helped: https://lkml.org/lkml/2011/11/29/168 But we never did get to the bottom of _why_, and Jun'ichi pointed out a NULL pointer check is missing: http://www.redhat.com/archives/dm-devel/2011-December/msg00022.html Anyway, Hannes, if you can reproduce it'd also be interesting to see if your patch (updated with NULL pointer check?) makes any difference. Mike -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel