[ cc fsdevel list - watch out for side effect of 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) which was merged between 3.19-rc6 and 3.19-rc7 , started from http://marc.info/?l=linux-mm&m=142348457310066&w=2 ] Replying in this post picked up from several posts in this thread. Michal Hocko wrote: > Besides that __GFP_WAIT callers should be prepared for the allocation > failure and should better cope with it. So no, I really hate something > like the above. Those who do not want to retry with invoking the OOM killer are using __GFP_WAIT + __GFP_NORETRY allocations. Those who want to retry with invoking the OOM killer are using __GFP_WAIT allocations. Those who must retry forever with invoking the OOM killer, no matter how many processes the OOM killer kills, are using __GFP_WAIT + __GFP_NOFAIL allocations. However, since use of __GFP_NOFAIL is prohibited, I think many of __GFP_WAIT users are expecting that the allocation fails only when "the OOM killer set TIF_MEMDIE flag to the caller but the caller failed to allocate from memory reserves". Also, the implementation before 9879de7373fc (mm: page_alloc: embed OOM killing naturally into allocation slowpath) effectively supported __GFP_WAIT users with such expectation. Michal Hocko wrote: > Because they cannot perform any IO/FS transactions and that would lead > to a premature OOM conditions way too easily. OOM killer is a _last > resort_ reclaim opportunity not something that would happen just because > you happen to be not able to flush dirty pages. But you should not have applied such change without making necessary changes to GFP_NOFS / GFP_NOIO users with such expectation and testing at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch. Michal Hocko wrote: > Well, you are beating your machine to death so you can hardly get any > time guarantee. It would be nice to have a better feedback mechanism to > know when to back off and fail the allocation attempt which might be > blocking OOM victim to pass away. This is extremely tricky because we > shouldn't be too eager to fail just because of a sudden memory pressure. Michal Hocko wrote: > > I wish only somebody like kswapd repeats the loop on behalf of all > > threads waiting at memory allocation slowpath... > > This is the case when the kswapd is _able_ to cope with the memory > pressure. It looks wasteful for me that so many threads (greater than number of available CPUs) are sleeping at cond_resched() in shrink_slab() when checking SysRq-t. Imagine 1000 threads sleeping at cond_resched() in shrink_slab() on a machine with only 1 CPU. Each thread gets a chance to try calling reclaim function only when all other threads gave that thread a chance at cond_resched(). Such situation is almost mutually preventing from making progress. I wish the following mechanism. Prepare a kernel thread (for avoiding being OOM-killed) and let __GFP_WAIT and __GFP_WAIT + __GFP_NOFAIL users to wake up the kernel thread when they failed to allocate from free list. The kernel thread calls shrink_slab() etc. (and also out_of_memory() as needed) and wakes them sleeping at wait_for_event() up. Failing to allocate from free list is a rare case. Therefore, context switches for asking somebody else for reclaiming memory would be an acceptable overhead. If such mechanism are implemented, 1000 threads except the somebody can save CPU time by sleeping. Avoiding "almost mutually preventing from making progress" situation will drastically shorten the time guarantee even if I beat my machine to death. Such mechanism might be similar to Dave Chinner's Make the OOM killer only be invoked by kswapd or some other independent kernel thread so that it is independent of the allocation context that needs to invoke it, and have the invoker wait to be told what to do. suggestion. Dave Chinner wrote: > Filesystems do demand paging of metadata within transactions, which > means we are guaranteed to be holding locks when doing memory > allocation. Indeed, this is what the GFP_NOFS allocation context is > supposed to convey - we currently *hold locks* and so reclaim needs > to be careful about recursion. I'll also argue that it means the OOM > killer cannot kill the process attempting memory allocation for the > same reason. I agree with Dave Chinner about this. I tested on ext4 filesystem, one is stock Linux 3.19 and the other is Linux 3.19 with - /* The OOM killer does not compensate for light reclaim */ - if (!(gfp_mask & __GFP_FS)) - goto out; applied. Running a Java-like stressing program (which is multi threaded and likely be chosen by the OOM killer due to huge memory usage) shown below with ext4 filesystem set to remount read-only upon filesystem error. # mount -o remount,errors=remount-ro / ---------- Testing program start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> static int file_writer(void *unused) { char buffer[128] = { }; int fd; snprintf(buffer, sizeof(buffer) - 1, "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT, 0600); unlink(buffer); while (write(fd, buffer, 1) == 1 && fsync(fd) == 0); return 0; } static void memory_consumer(void) { const int fd = open("/dev/zero", O_RDONLY); unsigned long size; char *buf = NULL; for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } read(fd, buf, size); /* Will cause OOM due to overcommit */ } int main(int argc, char *argv[]) { int i; for (i = 0; i < 100; i++) { char *cp = malloc(4 * 1024); if (!cp || clone(file_writer, cp + 4 * 1024, CLONE_SIGHAND | CLONE_VM, NULL) == -1) break; } memory_consumer(); while (1) pause(); return 0; } ---------- Testing program end ---------- The former showed that the ext4 filesystem is remounted read-only due to filesystem errors with 50%+ reproducibility. ---------- [ 72.440013] do_get_write_access: OOM for frozen_buffer [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory (...snipped....) [ 72.495559] do_get_write_access: OOM for frozen_buffer [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.496839] do_get_write_access: OOM for frozen_buffer [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.505766] Aborting journal on device sda1-8. [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only [ 72.505853] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 72.507995] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 72.508773] EXT4-fs (sda1): Remounting filesystem read-only [ 72.508775] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 72.547799] do_get_write_access: OOM for frozen_buffer [ 72.706692] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.035416] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.291732] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.422171] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.511862] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.589174] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) [ 73.665302] EXT4-fs warning (device sda1): ext4_evict_inode:260: couldn't mark inode dirty (err -12) ---------- On the other hand, the latter showed that the ext4 filesystem was never remounted read-only because filesystem errors did not occur, though several TIF_MEMDIE stalls which the timeout patch would handle were observed as with the former. As this is ext4 filesystem, this would use GFP_NOFS. But does using GFP_NOFS + __GFP_NOFAIL at ext4 filesystem solve the problem? I don't think so. The underlying block layer which ext4 filesystem calls would use GFP_NOIO. And memory allocation failures at block layer will result in I/O error which is observed by users as filesystem error. Does passing __GFP_NOFAIL down to block layer solve the problem? I don't think so. There is no means to teach block layer that filesystem layer is doing critical operations where failure results in serious problems. Then, does using GFP_NOIO + __GFP_NOFAIL at block layer solves the problem? I don't think so. It is nothing but bypassing /* The OOM killer does not compensate for light reclaim */ if (!(gfp_mask & __GFP_FS)) goto out; check by passing __GFP_NOFAIL flag. Michal Hocko wrote: > Failing __GFP_WAIT allocation is perfectly fine IMO. Why do you think > this is a problem? Killing a user space process or taking filesystem error actions (e.g. remount-ro or kernel panic), which choice is less painful for users? I believe that !(gfp_mask & __GFP_FS) check is a bug and should be removed. Rather, shouldn't allocations without __GFP_FS get more chance to succeed than allocations with __GFP_FS? If I were the author, I might have added below check instead. /* This is not a critical allocation. Don't invoke the OOM killer. */ if (gfp_mask & __GFP_FS) goto out; Falling into retry loop with same watermark might prevent rescuer threads from doing memory allocation which is needed for making free memory. Maybe we should use lower watermark for GFP_NOIO and below, middle watermark for GFP_NOFS, high watermark for GFP_KERNEL and above. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html