Re: [Bug 12945] New: SCSI Generic (sg): BUG: sleeping function called from invalid context

Jens Axboe <jens.axboe@xxxxxxxxxx> · Thu, 26 Mar 2009 19:43:02 +0100

On Thu, Mar 26 2009, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Thu, 26 Mar 2009 12:27:53 GMT bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=12945
> > 
> >            Summary: SCSI Generic (sg): BUG: sleeping function called from
> >                     invalid context
> >            Product: SCSI Drivers
> >            Version: 2.5
> >     Kernel Version: 2.6.28.9
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >         AssignedTo: scsi_drivers-other@xxxxxxxxxxxxxxxxxxxx
> >         ReportedBy: txtoxtox285@xxxxxxxxxxxxxx
> >         Regression: No
> > 
> > 
> > Created an attachment (id=20685)
> >  --> (http://bugzilla.kernel.org/attachment.cgi?id=20685)
> > Stack trace on program kill (2.6.28.9)
> > 
> > I am experimenting with CD audio extraction. I use the SCSI Generic driver for
> > this.
> > 
> > My test program uses read() and write() (instead of ioctl) to send requests to
> > the driver and receive responses. I use SG_FLAG_DIRECT_IO.
> > 
> > When I kill my program (because I don't want to wait until it has ripped the
> > entire CD), I am often rewarded with messages like "BUG: sleeping function
> > called from invalid context at linux-2.6.28.9/include/linux/pagemap.h:347". I
> > have attached typical stack trace.
> > 
> > Another case when I hit this BUG is when I set a time out and the CD drive
> > doesn't respond fast enough. A stack trace is attached.
> 
> > [34215.786870] BUG: sleeping function called from invalid context at /mnt/var-pub/src/linux-2.6.28.9/include/linux/pagemap.h:347
> > [34215.786880] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper
> > [34215.786886] Pid: 0, comm: swapper Not tainted 2.6.28.9 #1
> > [34215.786890] Call Trace:
> > [34215.786894]  <IRQ>  [<ffffffff8026c4cc>] set_page_dirty_lock+0x1a/0x45
> > [34215.786911]  [<ffffffff802ae17d>] bio_unmap_user+0x1e/0x4a
> > [34215.786920]  [<ffffffff802e876b>] __blk_rq_unmap_user+0x14/0x20
> > [34215.786928]  [<ffffffff80210852>] pit_next_event+0x2e/0x49
> > [34215.786934]  [<ffffffff802e8795>] blk_rq_unmap_user+0x1e/0x4b
> > [34215.786965]  [<ffffffffa0163475>] sg_finish_rem_req+0x6d/0x88 [sg]
> > [34215.786979]  [<ffffffffa0164ef3>] sg_rq_end_io+0x131/0x205 [sg]
> > [34215.786986]  [<ffffffff802e5c1f>] end_that_request_last+0x58/0x194
> > [34215.786992]  [<ffffffff802e5e00>] blk_end_io+0x48/0x7d
> > [34215.787019]  [<ffffffffa0026bef>] scsi_next_command+0x219/0x283 [scsi_mod]
> > [34215.787039]  [<ffffffffa00279b1>] scsi_io_completion+0x181/0x53b [scsi_mod]
> > [34215.787047]  [<ffffffff802e9737>] blk_done_softirq+0x5f/0x6d
> > [34215.787054]  [<ffffffff80230787>] __do_softirq+0x5e/0xf8
> > [34215.787061]  [<ffffffff8020ca8c>] call_softirq+0x1c/0x28
> > [34215.787067]  [<ffffffff8020d6bc>] do_softirq+0x2c/0x68
> > [34215.787073]  [<ffffffff80230696>] irq_exit+0x36/0x82
> > [34215.787079]  [<ffffffff8020d79e>] do_IRQ+0xa6/0xb8
> > [34215.787085]  [<ffffffff8020c256>] ret_from_intr+0x0/0xa
> > [34215.787088]  <EOI>  [<ffffffff8034f648>] menu_reflect+0x0/0x6d
> > [34215.787112]  [<ffffffffa0147d51>] acpi_idle_enter_simple+0x170/0x1d6 [processor]
> > [34215.787127]  [<ffffffffa0147d47>] acpi_idle_enter_simple+0x166/0x1d6 [processor]
> > [34215.787134]  [<ffffffff8034eb32>] cpuidle_idle_call+0x73/0xb1
> > [34215.787140]  [<ffffffff8020ac2a>] cpu_idle+0x3c/0x73
> 
> Argh.  sg_finish_rem_req() is called from interrupt context.  But
> blk_rq_unmap_user() can run
> __bio_unmap_user()->set_page_dirty_lock()->lock_page(), which can call
> schedule().  If it does call schedule(), the machine will crash.
> 
> afacit, blk_rq_unmap_user() has always been a can-sleep function, and
> this is a regression caused by
> 
> commit 6e5a30cba5e7c03b2cd564e968f1dd667a0f7c42

Yep, it is. The problem is the usage of:

        blk_execute_rq_nowait(sdp->device->request_queue, sdp->disk,
                              srp->rq, 1, sg_rq_end_io);

and then doing the sg_finish_rem_req() -> blk_rq_unmap_user() from the
end_io path, where other users do a sync request and then unmap from the
same context. Hmm. Perhaps we can add some request flag to specify doing
the completion from user context, then other users could be converted do
the _nowait() approach as well and get some unification/cleanup there as
well.

I'll cook up a patch.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html