On Sun, Mar 01, 2015 at 11:15:06AM -0500, Johannes Weiner wrote: > > We had these lockups in cgroups with just a handful of threads, which > all got stuck in the allocator and there was nobody left to volunteer > unreclaimable memory. When this was being addressed, we knew that the > same can theoretically happen on the system-level but weren't aware of > any reports. Well now, here we are. I think the "few threads in a small" cgroup problem is a little difference, because in those cases very often the global system has enough memory, and there is always the possibility that we might relax the memory cgroup guarantees a little in order to allow forward progress. In fact, arguably this *is* the right thing to do, because we have situations where (a) the VFS takes the directory mutex, (b) the directory blocks have been pushed out of memory, and so (c) a system call running in container with a small amount of memory and/or a small amount of disk bandwidth allowed via its prop I/O settings ends up taking a very long time for the directory blocks to be read into memory. If a high priority process, like say a cluster management daemon, also tries to to read the same directory, it can end up stalled for long enough for the software watchdog to take out the entire machine from the cluster. The hard problem here is that the lock is taken by the VFS, *before* it calls into the file system specific layer, and so the VFS has no idea (a) how much memory or disk bandwidth it needs, and (b) whether it needs any memory or disk bandwidth in the first place in order to service a directory lookup operation (most of the time, it doesn't). So there may be situations where in the restricted cgroup, it would useful for the file system to be able to say, "you know, we're holding onto a lock and the fact that the disk controller is going to force this low priority cgroup to wait over a minute for the I/O to even be queued out to the disk, maybe we should make an exception and bust the disk controller cgroup cap". (There is a related problem where a cgroup with a low disk bandwidth quota is slowing down writeback, and we are desperately short on global memory, and where relaxing the disk bandwidth limit via some kind of priority inheritance scheme would prevent "innocent" high, proprity cgroups from having some of their processes get OOM-killed. I suppose one could claim that the high priority cgroups tend to belong to the sysadmin, who set the stupid disk bandwidth caps in the first place, so there is a certain justice in having the high priority processes getting OOM killed, but still, it would be nice if we could do the right thing automatically.) But in any case, some of these workarounds, where we relax a particuarly tightly constrained cgroup limit, are obviously not going to help when the entire system is low on memory. > It really depends on what the goal here is. You don't have to be > perfectly accurate, but if you can give us a worst-case estimate we > can actually guarantee forward progress and eliminate these lockups > entirely, like in the block layer. Sure, there will be bugs and the > estimates won't be right from the start, but we can converge towards > the right answer. If the allocations which are allowed to dip into > the reserves - the current nofail sites? - can be annotated with a gfp > flag, we can easily verify the estimates by serving those sites > exclusively from the private reserve pool and emit warnings when that > runs dry. We wouldn't even have to stress the system for that. > > But there are legitimate concerns that this might never work. For > example, the requirements could be so unpredictable, or assessing them > with reasonable accuracy could be so expensive, that the margin of > error would make the worst case estimate too big to be useful. Big > enough that the reserves would harm well-behaved systems. And if > useful worst-case estimates are unattainable, I don't think we need to > bother with reserves. We can just stick with looping and OOM killing, > that works most of the time, too. I'm not sure that you want to reserve for the worst-case. What might work is if subsystems (probably primarily file systems) give you estimates for the usual case and the worst case, and you reserve for the something in between these two bounds. In practice there will be a huge number of file systems operations taking place in your typical super-busy system, and if you reserve for the worst case, it probably will be too much. We need to make sure there is enough memory available for some forward progress, and if we need to stall a few operations with some sleeping loops, so be it. So I don't think the "heads up" mounts don't have to be strict reservations in the sense that the memory will be available instantly without any sleeping or looping. I would also suggest that "reservations" be tied to a task struct and not to some magic __GFP_* flag, since it's not just allocations done by the file system, but also by the block device drivers, and if certain write operations fail, the results will be catastrophic -- and the block device can't tell whether a particular I/O operatoion must succeed or we declare the file system as needing manual recovery and potentially reboot the entire system, and an I/O operation where a fail could be handled by reflecting ENOMEM back up to userspace. The difference is a property of the call stack, so the simplest way of handing this is to store the reservation in the task struct, and let the reservation get automatically returned to the system when a particular process makes a transition from kernel space to user space. The bottom line is that I agree that looping and OOM-killing works most of the time, and so I'm happy with something that makes life a little bit better and a little bit more predictable for the VM, if that makes the system behave a bit more smoothly under high memory pressures. But at the same time, we don't want to make things too complicated; whether that means that we don't try to achieve perfection, or simply not worry about the global memory pressure situation, and instead try to think about other solutions to handle the "small number of threads in a container, and try to OOM kill a bit less frequently, and instead force it to loop/sleep for a bit, and then let a random foreground kernel thread in the container allow to "borrow" a small amount of memory to hopefully let it make forward progress, especially if it is holding locks, or is in the process of exiting, etc. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>