On 2011-03-01 14:11, Arnd Bergmann wrote: > On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote: >> >> On 2011-02-25 07:21, Arnd Bergmann wrote: >>> On Friday 25 February 2011, Andrei Warkentin wrote: >>>> Yup. I understand :-). That's the strategy I'm going to follow. For >>>> page_size-alignment/splitting I'm looking at the block layer now. Is >>>> that the right approach or should I still submit a (cleaned up) patch >>>> to mmc/card/block.c for that performance improvement. >>> >>> I guess it should live in block/cfq-iosched in the long run, but I don't >>> know how easy it is to implement it there for test purposes. >> >> I don't think I saw the original patch(es) for this? > > Nobody has posted one yet, only discussions. Andrei made a patch for the > MMC block driver to split requests in some cases, but I think the > concept has changed enough that it's probably not useful to look at > that patch. > > I think what needs to be done here is to split requests in these cases: > > * Small requests should be split on flash page boundaries, where a page > is typically 8 to 32 KB. Sending one hardware request that spans two > partial pages can be slower than sending two requests with the same > data, but on page boundaries. > > * If a hardware transfer is limited to a few sectors, these should be > aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector > maximum transfers, a request that spans from sector 7 to 62 should be > split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62. > This reduces the number of page read-modify-write cycles that the drive > does. > > * No request should ever span multiple erase blocks. Most flash drives today > have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should > treat the erase block boundary like a seek on a hard drive. The I/O > scheduler should try to send all sector writes of an erase block in sequence, > but after that it can chose any other erase block to write to next. > > I think if we get this logic, we can deal well with all cheap flash drives. > The two parameters we need are the page size and the erase block size, > which the kernel can sometimes guess, but should also be tunable in > sysfs for devices that don't tell us or lie to the kernel about them. > > I'm not sure if we want to do this for all nonrotational media, or > add another flag to enable these optimizations. On proper SSDs that have > an intelligent controller and enough RAM, they probably would not help > all that much, or even make it slightly slower due to a higher number > of separate write requests. Thanks for the recap. One way to handle this would be to have a dm target that ensures that requests are never built up to violate any of the above items. Doing splitting is a little silly, when you can prevent it from happening in the first place. Alternatively, a queue ->merge_bvec_fn() with a settings table could provide the same. As this is of limited scope, I would prefer having this done via a plugin of some sort (like a dm target). -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html