Hello, Grant Grundler wrote: >>> Maybe you are counting instructions and not cycles? Every cache miss >>> is 200-300 cycles (say 100ns). When running multiple threads, we will >>> miss on nearly every spinlock acquisition and probably on several data >>> accesses. 1 microsecond isn't alot when counting this way. >> Yeah, ata uses its own locking and the qc allocation does atomic >> bitops for each bit for no good reason which can hurt for very hi-ops >> with NCQ tags filled up. If serving 4k requests as fast as possible >> is the goal, I'm not really sure the current SCSI or ATA commands are >> the best suited ones. Both SCSI and ATA are focused on rotating media >> with seek latency > > I think existing File Systems and block IO schedulers (except NOOP) are > tuned for rotating media and access patterns that benefit this media the most. Acutally, the whole stack is optimized toward IO devices with seek latency, from the hardware to our drivers and the whole block layer itself. >> and thus have SG on the host bus side in mode cases >> but never on the device side. > > SG == scatter-gather? I'm not sure why that is specific to rotating media. > Or is this referring to "SCSI-generic" pass through? I was talking about scatter-gather. All the IO commands are about one continuous extent of data on the device and the whole stack from the bio is built that way and the overhead of libata is minute compared to the whole thing including emitting single command and receiving completion for each 4k transfer. > In any case, only traversing one fewer layers (SCSI or libata) in > block code path would help serve 4k requests more efficiently. Yes, no doubt. >> If getting the maximum random scattered >> access throughput is a must, the best way would be adding a SG r/w >> commands to ATA and adapt our storage stack accordingly. > > I don't think everyone wants to throw out the entire stack. > But adding a passthrough for ATA and connecting that to FUSE might > be a performant alternative. Don't know how FUSE would come into play but if the device can receive list of IOs to perform in a single command and reply accordingly, the block layer (possibly bio interface too?) can be modified to merge random IOs into a single request and things will be really fast and whether we grab one more spinlock or not at the bottom of the stack wouldn't really matter. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html