Jens Axboe <axboe@xxxxxxxxx> writes: > On 2012-12-12 20:41, Jeff Moyer wrote: >> Jeff Moyer <jmoyer@xxxxxxxxxx> writes: >> >>>> I agree. This isn't about scheduling, we haven't even reached that part >>>> yet. Back when we split the queues into read vs write, this problem >>>> obviously wasn't there. Now we have sync writes and reads, both eating >>>> from the same pool. The io scheduler can impact this a bit by forcing >>>> reads to must allocate (Jan, which io scheduler are you using?). CFQ >>>> does this when it's expecting a request from this process queue. >>>> >>>> Back in the day, we used to have one list. To avoid a similar problem, >>>> we reserved the top of the list for reads. With the batching, it's a bit >>>> more complicated. If we make the request allocation (just that, not the >>>> scheduling) be read vs write instead of sync vs async, then we have the >>>> same issue for sync vs buffered writes. >>>> >>>> How about something like the below? Due to the nature of sync reads, we >>>> should allow a much longer timeout. The batch is really tailored towards >>>> writes at the moment. Also shrink the batch count, 32 is pretty large... >>> >>> Does batching even make sense for dependent reads? I don't think it >>> does. >> >> Having just read the batching code in detail, I'd like to ammend this >> misguided comment. Batching logic kicks in when you happen to be lucky >> enough to use up the last request. As such, I'd be surprised if the >> patch you posted helped. Jens, don't you think the writer is way more >> likely to become the batcher? I do agree with shrinking the batch count >> to 16, whether or not the rest of the patch goes in. >> >>> Assuming you disagree, then you'll have to justify that fixed >>> time value of 2 seconds. The amount of time between dependent reads >>> will vary depending on other I/O sent to the device, the properties of >>> the device, the I/O scheduler, and so on. If you do stick 2 seconds in >>> there, please comment it. Maybe it's time we started keeping track of >>> worst case Q->C time? That could be used to tell worst case latency, >>> and adjust magic timeouts like this one. >>> >>> I'm still thinking about how we might solve this in a cleaner way. >> >> The way things stand today, you can do a complete end run around the I/O >> scheduler by queueing up enough I/O. To address that, I think we need >> to move to a request list per io_context as Jan had suggested. That >> way, we can keep the logic about who gets to submit I/O when in one >> place. >> >> Jens, what do you think? > > I think that is pretty extreme. We have way too much accounting around > this already, and I'd rather just limit the batching than make > per-ioc request lists too. I'm not sure I understand your comment about accounting. I don't think it would add overhead to move to per-ioc request lists. Note that, if we did move to per-ioc request lists, we could yank out the blk cgroup implementation of same. > I agree the batch addition isn't super useful for the reads. It really > is mostly a writer thing, and the timing reflects that. > > The problem is really that the WRITE_SYNC is (for Jan's case) behaving > like buffered writes, so it eats up a queue of requests very easily. On > the allocation side, the assumption is that WRITE_SYNC behaves like > dependent reads. Similar to a dd with oflag=direct, not like a flood of > requests. For dependent sync writes, our current behaviour is fine, we > treat them like reads. For commits of WRITE_SYNC, they should be treated > like async WRITE instead. What are you suggesting? It sounds as though you might be suggesting that WRITE_SYNCs are allocated from the async request list, but treated as sync requests in the I/O scheduler. Oh, but only for this case of streaming write syncs. How did you want to detect that? In the caller? Tracking information in the ioc? Clear as mud. ;-) -Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html