Re: Bypass block layer and Fill SCSI lower layer driver queue

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Wed, 18 Sep 2013 14:00:44 -0700

On Wed, 2013-09-18 at 01:41 -0500, Alireza Haghdoost wrote:
> Hi
> 
> I am working on a high throughput and low latency application which
> does not tolerate block layer overhead to send IO request directly to
> fiber channel lower layer SCSI driver. I used to work with libaio but
> currently I am looking for a way to by pass the block layer and send
> SCSI commands from the application layer directly to the SCSI driver
> using /dev/sgX device and ioctl() system call.
>
> I have noticed that sending IO request through sg device even with
> nonblocking and direct IO flags is quite slow and does not fill up
> lower layer SCSI driver TCQ queue. i.e IO depth or
> /sys/block/sdX/in_flight is always ZERO. Therefore the application
> throughput is even lower that sending IO request through block layer
> with libaio and io_submit() system call. In both cases I used only one
> IO context (or fd) and single threaded.
> 
> I have noticed that some well known benchmarking tools like fio does
> not support IO depth for sg devices as well. Therefore, I was
> wondering if it is feasible to bypass block layer and achieve higher
> throughput and lower latency (for sending IO request only).
> 
> 
> Any comment on my issue is highly appreciated.
> 
> 

FYI, you've got things backward as to where the real overhead is being
introduced.

The block layer / aio overhead is minimal compared to the overhead
introduced by the existing scsi_request_fn() logic, and extreme locking
contention between request_queue->queue_lock and scsi_host->host_lock
that are accessed/released multiple times per struct scsi_cmnd dispatch.

This locking contention and other memory allocations currently limit per
struct scsi_device performance with small block random IOPs to ~250K vs.
~1M with raw block drivers providing their own make_request() function.

FYI, there is an early alpha scsi-mq prototype that bypasses the
scsi_request_fn() junk all together, that is able to reach small block
IOPs + latency that is comparable to raw block drivers.

Only a handful of LLDs have been converted to run with full scsi-mq
pre-allocation thus far, and the code is considered early, early
alpha.  

It's the only real option for SCSI to get anywhere near raw block driver
performance + latency, but is still quite a ways off mainline.

--nab

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html