Re: libata / scsi separation

Tejun Heo <htejun@xxxxxxxxx> · Wed, 10 Dec 2008 11:47:05 +0900

Hello,

Grant Grundler wrote:
> On Tue, Dec 9, 2008 at 5:54 PM, Tejun Heo <htejun@xxxxxxxxx> wrote:
>> (cc'ing Jens)
> ...
>> Is the command issue rate really the bottleneck?
> 
> Not directly. It's the lack of CPU leftover at high transaction rates
> ( > 10000 IOPS per disk). So yes, the system does bottle neck on CPU
> utilization.
> 
>> It seem a bit
>> unlikely unless you're issuing lots of really small IOs but then again
>> those new SSDs are pretty fast.
> 
> That's the whole point of SSDs (lots of small, random IO).

But on many workloads, filesystems manage to colocate what belongs
together and with little help from read ahead and block layer we
manage to dish out decently sized requests.  It will be great to serve
4k requests as fast as we can but whether that should be (or rather
how much) the focal point of optimization is a slightly different
problem.

> The second desirable attribute SSDs have is consistent response for
> reads. HDs vary from microseconds to 100's of milliseconds. Very long
> tail in the read latency response.
> 
>>> (OK, I haven't measured the overhead of the *SCSI* layer, I've measured
>>> the overhead of the *libata* layer.  I think the point here is that you
>>> can't measure the difference at a macro level unless you're sending a
>>> lot of commands.)
>> How did you measure it?
> 
> Willy presented how he measured SCSI stack at LSF2008. ISTR he was
> advised to use oprofile in his test application so there is probably
> an updated version of these slides:
>     http://iou.parisc-linux.org/lsf2008/IO-latency-Kristen-Carlson-Accardi.pdf

Ah... okay, with ram low level driver.

>> The issue path isn't thick at all although
>> command allocation logic there is a bit brain damaged and should use
>> block layer tag management.  All it does is - allocate qc, interpret
>> SCSI command to ATA command and write it to qc, map dma and build dma
>> table and pass it over to the low level issue function.  The only
>> extra step there is the translation part and I don't think that can
>> take a full microsecond on modern processors.
> 
> Maybe you are counting instructions and not cycles? Every cache miss
> is 200-300 cycles (say 100ns). When running multiple threads, we will
> miss on nearly every spinlock acquisition and probably on several data
> accesses. 1 microsecond isn't alot when counting this way.

Yeah, ata uses its own locking and the qc allocation does atomic
bitops for each bit for no good reason which can hurt for very hi-ops
with NCQ tags filled up.  If serving 4k requests as fast as possible
is the goal, I'm not really sure the current SCSI or ATA commands are
the best suited ones.  Both SCSI and ATA are focused on rotating media
with seek latency and thus have SG on the host bus side in mode cases
but never on the device side.  If getting the maximum random scattered
access throughput is a must, the best way would be adding a SG r/w
commands to ATA and adapt our storage stack accordingly.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html