sg driver and blk_get_request()

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Sat, 20 Apr 2019 17:32:45 -0400

I'm trying hard to crash the sg v4 driver that I'm working on and
I have seen failures but my test hardware is not really server
quality.

I haven't got a copy of the crashes. As Murphy would have, after I
connected another machine to monitor the test machine's serial port
there have been no crashes ... just partial lockups, about once
every 24 hours. And those lockups are caused by this line:

        /*
         * NOTE
         *
         * With scsi-mq enabled, there are a fixed number of preallocated
         * requests equal in number to shost->can_queue.  If all of the
         * preallocated requests are already in use, then blk_get_request()
         * will sleep until an active command completes, freeing up a request.
         * Although waiting in an asynchronous interface is less than ideal, we
         * do not want to use BLK_MQ_REQ_NOWAIT here because userspace might
         * not expect an EWOULDBLOCK from this condition.
         */
        rq = blk_get_request(q, (r0w ? REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN), 0);

I include the NOTE because I didn't write it (and I think its wrong in the
sense that that call should return EWOULDBLOCK promptly rather than _never_
return when the sg driver is being used async/non-blocking mode). So it
goes in to that call (from about 16 threads) and fails to return from any.
The machine is in that state at the moment.

The test is using the sg v4 driver (but that piece of code is the same
in the production v3 driver) against mainly the scsi_debug driver. To make
it a bit more realistic a Seagate SAS SSD is added to the mix (only READing)
so that brings mpt3sas into the picture.

The kernel is vanilla 5.0.8

Any thoughts about what this may be or how to debug it?

[The sg driver holds no locks when it calls blk_get_request() so
'cat /proc/scsi/sg/debug' still works showing me where the state
machines are "parked". The rootfs is on a SATA disk. I can log onto
that machine but in previous cases, shutdown locked up.]

Doug Gilbert