OOM Killer cannot be invoked in a diskless environment to relieve severe memory pressure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have been running into an interesting problem with the Out Of Memory Killer in a diskless environment running a 2.6.34.7-56 Fedora x86 based kernel.

It seems in a diskless environment the OOM Killer is not being invoked when the system is under severe memory pressure.  As a result the system hard hangs.

To investigate this issue further I made a System Tap Debug Fedora 13 live dvd  so I could gather in-line function probing of the OOM code path when testing in a disk-less environment to confirm if the OOM killer was invoked.

I also logged some memory statistics. If anyone is interested I have attached the debug files and logs to kernel.org bug:

-----------------------------
Testing: _with_ a disk
-----------------------------

A ran the c program from 
with the addition to print out mallinfo and /proc/meminfo statistics.

The objective of the mem-pressure.c is to allocate huge blocks and fill them with 1s until the OOM killer is invoked and kills of offending process to rescue the system from memory pressure.

The test program successfully created memory pressure on the Fedora 13 system and caused the OOM killer to kill off offending process according to their calculated "badness" score.  In my test runs if mem-pressure wasn't killed directly then gnome, metacity, or the shell were killed which in turn killed mem-pressure.

The SystemTap script, oom.stp, shows the system going through the code sequence leading up to the OOM killer:
|> __alloc_pages_may_oom()
        |> out_of_memory()
                |> oom_kill_process()
                        |>oom_kill_task()
                                |>__oom_kill_task()

[Note: I removed probing __alloc_pages_slowpath() from oom.stp disk-environment testing to make the oom-stp.out log a little simpler to read.]

===Analysis===
No surprises here.  With an attached disk the OOM killer was able to be invoked and relive the memory pressure returning the system to a ~functioning state.  With the oom.stp script we can see the system following the code path to invoke the OOM killer. 

[Note in some of my test runs I noticed some other services were killed before mem-pressure leading to a somewhat crippled system, but the main point is the system did _not_ hang.]

-Test data-
See the oom-test-data.tar.gz: OOM_testing/disk for test results using a disk environment.  OOM_testing/test/code has the scripts, and test program. I compiled the c program as mem-pressure, and logged the output to mem-pressure.out. In this sequence of testing I did not adjust any oom_adj values.  I also collected various memory statistics every second using the meminfo.sh script, and logged the output to meminfo.out. meminfo.out shows the
system's memory resources being used, and mem-pressure.out shows the consumption of memory resources by the mem-pressure program. The SystemTap script, oom.stp, output was logged to oom-stp.out

-----------------------------
Testing: _NO_ a disk (disk-less environment)
-----------------------------

I booted the Fedora 13 live dvd with SystemTap and kernel debuginfo rpms added to ensure a disk-less environment.

-Additions to the test program and scripts-
I ran the same mem-pressure from testing with a disk, but I added code to set the oom_adj value of mem-pressure to 15 to ensure if the OOM killer was invoked it would first kill the mem-pressure program.

I also added  __alloc_pages_slowpath() function to the oom.stp SystemTap script to show more of the memory allocation code path leading up to OOM killer.

The test program mem-pressure still successfully creates memory pressure, and as in previous testing the system hard hangs when system memory resources have been exhausted.

===Analysis===
The system hard hangs after memory resources have been exhausted  by mem-pressure.  Probing the code path:
__alloc_pages_slowpath()
        |> __alloc_pages_may_oom()
                |> out_of_memory()
                        |> oom_kill_process() oom_kill_process
                                |>oom_kill_task
                                        |>__oom_kill_task

I see the system accessing alloc_pages_slowpath(), however I never see the system enter into out_of_memory(), and thus __oom_kill_task is not invoked to kill the mem-pressure processes. Consequently the system is unable to relive the memory pressure, and the system hard hangs.

-Test Data-
See the oom-test-data.tar.gz:  OOM_testing/NO_disk for test results using a disk-less environment.  OOM_testing/test/code has the scripts, test program, and kickstart files.

The same test data was collected as in the disk-enviroment.  I also added the kickstart files I used to make the Fedora 13 SystemTap Debug live dvd.  If your interested the instructions to create a Fedora Live CD are at:
With the kickstarts provided one would run:
        livecd-creator \
                --config=/path/to/kickstartsfedora-livecd-desktop.ks \
                --fslabel=Fedora-LiveCD-Debug --cache=/var/cache/live

(Note: I logged output to a mounted USB 2.0 flash drive.)

=============

In summary, the test results show the OOM Killer not being invoked in the disk-less environment to relive memory pressure causing the system to hard hang. This may very well be a by-product of running a diskless environment and overcommitting memory resources.  

Any insights into this issue are welcomed.
-Thanks

--
Antonio Rosales
IBM Linux Technology Center


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]