On Sat, 4 Apr 2009, Theodore Tso wrote: > > It may be that it's easier from an less-lines-of-the-kernels-to-change > point of view to add a WRITE_SYNC_PLUGGED which doesn't do the unplug, > and keep WRITE_SNYC as having the implicit unplug. Or maybe it would > be better to completely separate the "send a write which is marked as > SYNC" from the "implicit unplug" in the API. Well, the block layer internally already separates the two, it's just WRITE_SYNC that ties them together. So a trivial patch like the appended would make WRITE_SYNC just mean "this is synchronous IO" without the unplugging, and then WRITE_UNPLUG would mark it both as synchronous _and_ unplug it. (Or you could do the WRITE_SYNC_PLUGGED model too, but I think WRITE_SYNC and WRITE_UNPLUG actually would be better names, since they talk about the effects more directly). The question is more of a "who really wants the unplugging". Presumably the direct-io code-path really does (on the assumption that if you are doing direct-io, the caller had better done all the coalescing it wants). Non-rotational media would tend to want the unplugging, but the block layer already does that (ie in __make_request() we already do: if (unplug || blk_queue_nonrot(q)) __generic_unplug_device(q); so non-rotational devices get unplugged whether the request was a unplugging request or not). Of course, different IO schedulers react differently to that whole "sync vs unplug" thing. I think CFQ is the only one that actually cares about the "sync" bit (using different queues for sync vs async). The other schedulers only care about the plugging. So the patch below really doesn't make much sense as-is, because as things are right now, the scheduler behaviors are so different for the unplug-vs-sync thing that no sane user can ever know whether they should use WRITE_SYNC (== higher priority queueing for CFQ, no-op for others) or WRITE_UNPLUG (unplug on all, and additionally higher priority for CFQ). So I'm not serious about this patch, because as things are right now, I don't think it's sensible, but I do think that this just generally shows the confusion of what we should do. When is plugging right, when isn't it? Also, one of the issues seems to literally be that the higher-level request handling doesn't care AT ALL about the priority. Allocating the request itself does keep reads and writes separated, but if the request is a SYNCIO request, and non-sync writes have filled up th write requests, we'll have to wait for the background writes to free up requests. That is quite possibly the longest wait we have in the system. See get_request(): const int rw = rw_flags & 0x01; (That should _really_ be RW_MASK instead of 0x01, I suspect) and how the request allocation itself is done. I think this is broken. Linus --- include/linux/fs.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index a09e17c..a144377 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -95,7 +95,8 @@ struct inodes_stat_t { #define SWRITE 3 /* for ll_rw_block() - wait for buffer lock */ #define READ_SYNC (READ | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG)) #define READ_META (READ | (1 << BIO_RW_META)) -#define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG)) +#define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNCIO)) +#define WRITE_UNPLUG (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG)) #define SWRITE_SYNC (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG)) #define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER)) #define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD) -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html