On Wed, 2009-04-15 at 23:31 -0700, Grant Grundler wrote: > On Tue, Apr 14, 2009 at 6:39 PM, Robert Hancock <hancockrwd@xxxxxxxxx> wrote: > > James Bottomley wrote: [...] > > Compared to > > any kind of hardware/controller interactions I wouldn't say it's likely to > > be a significant bottleneck at all. In oprofile runs I've done with heavy > > ATA activity, the top time consumers are the interrupt handlers, > > interrupts just introduce completion reporting latency. Interrupt mitigation > techinques/smarter controller can reduce this. For sure smarter controllers; I'm less convinced on interrupt mitigation techniques, primarily because smart controllers tend to do batch completion anyway. The reason storage shouldn't really need interrupt mitigation (like NAPI for networks) is that we shouldn't ever really be surprised by an interrupt, unlike networks. We should always know what payloads will be coming at us from the device (and have buffers ready and waiting). Most SCSI controllers (including the SAS ones) do all of this today: You can send out I/O and on some of them get under two interrupts per gigabyte of data. I fully agree that some of the less smart SATA controllers have a lot of catching up to do in this space, but that isn't necessarily a driver issue; you can't polish a turd as the saying goes ... > > command issue paths, > > Hrm? Is this in the device driver? > > > code that actually is poking IO registers. > > Stupid controller design. NICs have been able to run without > MMIO *Reads* in the performance for more than 5 years now. > New "Enterprise" SAS/SATA controllers are better but I'm not > at liberty to discuss those. (sorry) That's both a protocol and a controller issue for SATA: some of the ATA transfer modes (like pio) have a lot higher overhead (and, unfortunately, less offload in the controller) in the protocol stack and can be unavoidable on certain transactions. > > The libata-scsi code > > hasn't even shown up on the radar in my experience. > > It won't for normal disk IOPS rates (<1000 IOPS per disk). > Run it at 20K or 50K IOPS and see again. > NICs are pushing alot more than that. So again, we get to this terminology problem: NICs tend to have a fixed packet size (the network MTU) so IOPS/s makes a lot of sense for them. In most storage transactions we don't really have MTU limitations, so we try to right size the outgoing transactions to maximize for bandwidth, so IOPS don't tell the full story. (after all, if I see all my packets are 128k, I can artificially reduce the merge limit to 64k and double my IOPS). IOPS are starting to come up because SSDs are saying they prefer many smaller transactions to an accumulated larger one. I'm still not entirely convinced that trying to rightsize is wrong here: most of the FS data is getting more contiguous, so even for SSDs we can merge without a lot of work. A simple back of the envelope calculation can give the rightizing: If you want a SSD to max out at its 31 allowed tags saturating a 3G sata link, then you're talking 10M per tag per second. If we assume a 4k sector size, that's 2500 IOPS per tag (there's no real point doing less than 4k, because that has us splitting the page cache). Or, to put it another way, over 75k IOPS for a single SSD doesn't make sense ... the interesting question is whether it would make more sense to align on, say 16k io and so expect to max out at 20k IOPS. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html