Re: OOM killer changes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 15.08.2016 20:12, Joonsoo Kim wrote:
On Mon, Aug 15, 2016 at 11:16:36AM +0200, Vlastimil Babka wrote:
On 08/15/2016 06:48 AM, Ralf-Peter Rohbeck wrote:
On 02.08.2016 12:25, Ralf-Peter Rohbeck wrote:
Took me a little longer than expected due to work. The failure wouldn't
happen for a while and so I started a couple of scripts and let them
run. When I checked today the server didn't respond on the network and
sure enough it had killed everything. This is with 4.7.0 with the config
based on Debian 4.7-rc7.

trace_pipe got a little big (5GB) so I uploaded the logs to
https://urldefense.proofpoint.com/v2/url?u=https-3A__filebin.net_box0wycfouvhl6sr_OOM-5F4.7.0.tar.bz2&d=DQIBAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ&m=5VwXI8Iw4BejxSrNmLdOj-bp6ZZXeBJ_-ENR4F0NToo&s=KuzRUwyq4itin6x-UJT-XYbJ9q0tOSt3zQuEYZyHKqE&e= . before_btrfs is
before the btrfs filesystems were mounted.
I did run a btrfs balance because it creates IO load and I needed to
balance anyway. Maybe that's what caused it?
pgmigrate_success        46738962
pgmigrate_fail          135649772
compact_migrate_scanned 309726659
compact_free_scanned   9715615169
compact_isolated        229689596
compact_stall 4777
compact_fail 3068
compact_success 1709
compact_daemon_wake 207834

The migration failures are quite enormous. Very quick analysis of the
trace seems to confirm that these are mostly "real", as opposed to result
of failure to isolate free pages for migration targets, although the free
scanner spent a lot of time:
I don't think that main reason of OOM is 'real' migration failure.
If it is the case, compaction would find next migratable pages and
eventually some of pages would be migrated successfully.

pagetypeinfo shows that there are too many unmovable pageblock.
Freepage scanner don't scan those pageblocks so there is a large
possibility that it cannot find freepages even if the system has many
freepages. I think that this is the root cause of the problem.

It's better to check that following work-around help the problem.

Thanks.

------------>8-----------
diff --git a/mm/compaction.c b/mm/compaction.c
index 9affb29..965eddd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1082,10 +1082,6 @@ static void isolate_freepages(struct compact_control *cc)
                 if (!page)
                         continue;
- /* Check the block is suitable for migration */
-               if (!suitable_migration_target(page))
-                       continue;
-
                 /* If isolation recently failed, do not retry */
                 if (!isolation_suitable(cc, page))
                         continue;

That seemed to help a little (subjectively) but still OOM killed a kernel build. The logs are attached.

Thanks,
Ralf-Peter


----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.

Attachment: OOM_4.7.0_p3.tar.bz2
Description: OOM_4.7.0_p3.tar.bz2


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]