On Mon, Mar 02, 2015 at 05:58:23PM +0100, Michal Hocko wrote: > On Mon 02-03-15 11:39:13, Theodore Ts'o wrote: > > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote: > > > The idea is sound. But I am pretty sure we will find many corner > > > cases. E.g. what if the mere reservation attempt causes the system > > > to go OOM and trigger the OOM killer? > > > > Doctor, doctor, it hurts when I do that.... > > > > So don't trigger the OOM killer. We can let the caller decide whether > > the reservation request should block or return ENOMEM, but the whole > > point of the reservation request idea is that this happens *before* > > we've taken any mutexes, so blocking won't prevent forward progress. > > Maybe I wasn't clear. I wasn't concerned about the context which > is doing to reservation. I was more concerned about all the other > allocation requests which might fail now (becasuse they do not have > access to the reserves). So you think that we should simply disable OOM > killer while there is any reservation active? Wouldn't that be even more > fragile when something goes terribly wrong? That's a silly strawman. Why wouldn't you simply block them until the reserves are released when the transaction completes and the unused memory goes back to the free pool? Let me try another tack. My qualifications are as a distributed control system engineer, not a computer scientist. I see everything as a system of interconnected feedback loops: an operating system is nothing but a set of very complex, tightly interconnected control systems. Precedence? IO-less dirty throttling - that came about after I'd been advocating a control theory based algorithm for several years to solve the breakdown problems of dirty page throttling. We look at the code Fenguang Wu wrote as one of the major success stories of Linux - the writeback code just works and nobody ever has to tune it anymore. I see the problem of direct memory reclaim as being very similar to the problems the old IO based write throttling had: it has unbound concurrency, severe unfairness and breaks down badly when heavily loaded. As a control system, it has the same terrible design as the IO-based write throttling had. There are other many similarities, too. Allocation can only take place at the rate at which reclaim occurs, and we only have a limited budget of allocatable pages. This is the same as the dirty page throttling - dirtying pages is limited to the rate we can clean pages, and there are a limited budget of dirty pages in the system. Reclaiming pages is also done most efficiently by a single thread per zone where lots of internal context can be kept (kswapd). This is similar to optimal writeback of dirty pages requires a single thread with internal context per block device.. Waiting for free pages to arrive can be done by an ordered queuing system, and we can account for the number of pages each allocation requires in the queueing system and hence only need wake the number of waiters that will consume the memory just freed. Just like we do with the the dirty page throttling queue. As such, the same solutions could be applied. As the allocation demand exceeds the supply of free pages, we throttle allocation by sleeping on an ordered queue and only waking waiters at the rate at which kswapd reclaim can free pages. It's trivial to account accurately, and the feedback loop is relatively simple, too. We can also easily maintain a reserve of free pages this way, usable only by allocation marked with special flags. The reserve threshold can be dynamic, and tasks that request it to change can be blocked until the reserve has been built up to meet caler requirements. Allocations that are allowed to dip into the reserve may do so rather than being added to the queue that waits for reclaim. Reclaim would always fill the reserve back up to it's limits first, and tasks that have reservations can release them gradually as they mark them as consumed by the reservation context (e.g. when a filesystem joins an object to a transaction and modifies it), thereby reducing the reserve that task has available as it progresses. So, there's yet another possible solution to the allocation reservation problem, and one that solves several other problems that are being described as making reservation pools difficult or even impossible to implement. Seriously, I'm not expecting this problem to be solved tomorrow; what I want is reliable, deterministic memory allocation behaviour from the mm subsystem. I want people to be thinking about how to acheive that rather than limiting their solutions by what we have now and can hack into the current code, because otherwise we'll never end up with a reliable memory allocation reservation system.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>