On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > On Mon 26-08-19 12:36:28, Edward Chron wrote: > [...] > > Extensibility using OOM debug options > > ------------------------------------- > > What is needed is an extensible system to optionally configure > > debug options as needed and to then dynamically enable and disable > > them. Also for options that produce multiple lines of entry based > > output, to configure which entries to print based on how much > > memory they use (or optionally all the entries). > > With a patch this large and adding a lot of new stuff we need a more > detailed usecases described I believe. I guess it would make sense to explain motivation for each OOM Debug option I've sent separately. I see there comments on the patches I will try and add more information there. An overview would be that we've been collecting information on OOM's over the last 12 years or so. These are from switches, other embedded devices, servers both large and small. We ask for feedback on what information was helpful or could be helpful. We try and add it to make root causing issues easier. These OOM debug options are some of the options we've created. I didn't port all of them to 5.3 but these are representative. Our latest is kernel is a bit behind 5.3. > > > [...] > > > Use of debugfs to allow dynamic controls > > ---------------------------------------- > > By providing a debugfs interface that allows options to be configured, > > enabled and where appropriate to set a minimum size for selecting > > entries to print, the output produced when an OOM event occurs can be > > dynamically adjusted to produce as little or as much detail as needed > > for a given system. > > Who is going to consume this information and why would that consumer be > unreasonable to demand further maintenance of that information in future > releases? In other words debugfs is not considered a stableAPI which is > OK here but the side effect of any change to these files results in user > visible behavior and we consider that more or less a stable as long as > there are consumers. > > > OOM debug options can be added to the base code as needed. > > > > Currently we have the following OOM debug options defined: > > > > * System State Summary > > -------------------- > > One line of output that includes: > > - Uptime (days, hour, minutes, seconds) > > We do have timestamps in the log so why is this needed? Here is how an OOM report looks when we get it to look at: Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=1000 Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not tainted 5.3.0-rc6+ #33 Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd. IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018 This shows the date and time, not time of the last boot. The /var/log/messages output is what we often have to look at not raw dmesgs. > > > > - Number CPUs > > - Machine Type > > - Node name > > - Domain name > > why are these needed? That is a static information that doesn't really > influence the OOM situation. Sorry if a few of the items overlap what OOM prints. We've been printing a lot of this information since 2.6.38 and OOM reporting has been updated. We're updating our 4.19 system to have the latest OOM Report format. This was the 5.0 patch Reorg the OOM report in the dump header. Also back porting Shakeel's 5.3 patch to refactor dump tasks for memcg OOMs. We're testing those back ports right now in fact. We can probably get rid of some of the information we have but I haven't had a chance yet. Hopefully can do it as part of sending some code upstream. > > > > - Kernel Release > > - Kernel Version > > part of the oom report > > > > > Example output when configured and enabled: > > > > Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019 > > > > * Tasks Summary > > ------------- > > One line of output that includes: > > - Number of Threads > > - Number of processes > > - Forks since boot > > - Processes that are runnable > > - Processes that are in iowait > > We do have sysrq+t for this kind of information. Why do we need to > duplicate it? Unfortunately, we can't login into every customer system or even system of our own and do a sysrq+t after each OOM. You could scan for OOMs and have a script do it, but doing a sysrq+t after an OOM event, you'll get different results. I'd rather have the runnable and iowait counts during the OOM event not after. Computers are so darn fast, free up some memory and things can look a lot different. We've seen crond fork and hang and gradually create thousands of processes and sorts of other unintended fork bombs. On some systems we can't print all of the process information as we've discussed. So we print a summary of how many there are total and if you use the select process print option you can print all the processes that use more than 1% for example. That may be a dozen or two versus hundreds or thousands. That may make printing some user processes, the largest memory users feasible. > > > > Example output when configured and enabled: > > > > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0 > > > > * ARP Table and/or Neighbour Discovery Table Summary > > -------------------------------------------------- > > One line of output each for ARP and ND that includes: > > - Table name > > - Table size (max # entries) > > - Key Length > > - Entry Size > > - Number of Entries > > - Last Flush (in seconds) > > - hash grows > > - entry allocations > > - entry destroys > > - Number lookups > > - Number of lookup hits > > - Resolution failures > > - Garbage Collection Forced Runs > > - Table Full > > - Proxy Queue Length > > > > Example output when configured and enabled (for both): > > > > ... kernel: neighbour: Table: arp_tbl size: 256 keyLen: 4 entrySize: 360 entries: 9 lastFlush: 1721s hGrows: 1 allocs: 9 destroys: 0 lookups: 204 hits: 199 resFailed: 38 gcRuns/Forced: 111 / 0 tblFull: 0 proxyQlen: 0 > > > > ... kernel: neighbour: Table: nd_tbl size: 128 keyLen: 16 entrySize: 368 entries: 6 lastFlush: 1720s hGrows: 0 allocs: 7 destroys: 1 lookups: 0 hits: 0 resFailed: 0 gcRuns/Forced: 110 / 0 tblFull: 0 proxyQlen: 0 > > Again, why is this needed particularly for the OOM event? I do > understand this might be useful system health diagnostic information but > how does this contribute to the OOM? > It is example of some system table information we print. Other adjustable table information may be useful as well. These table sizes are often adjustable and collecting stats on usage helps determine if settings are appropriate. The value during OOM events is very useful as usage varies. We also collect the same stats like this from user code periodically and can compare these. > > > * Add Select Slabs Print > > ---------------------- > > Allow select slab entries (based on a minimum size) to be printed. > > Minimum size is specified as a percentage of the total RAM memory > > in tenths of a percent, consistent with existing OOM process scoring. > > Valid values are specified from 0 to 1000 where 0 prints all slab > > entries (all slabs that have at least one slab object in use) up > > to 1000 which would require a slab to use 100% of memory which can't > > happen so in that case only summary information is printed. > > > > The first line of output is the standard Linux output header for > > OOM printed Slab entries. This header looks like this: > > > > Aug 6 09:37:21 egc103 yourserver: Unreclaimable slab info: > > > > The output is existing slab entry memory usage limited such that only > > entries equal to or larger than the minimum size are printed. > > Empty slabs (no slab entries in slabs in use) are never printed. > > > > Additional output consists of summary information that is printed > > at the end of the output. This summary information includes: > > - # entries examined > > - # entries selected and printed > > - minimum entry size for selection > > - Slabs total size (kB) > > - Slabs reclaimable size (kB) > > - Slabs unreclaimable size (kB) > > > > Example Summary output when configured and enabled: > > > > Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123 printed: 83 minsize: 0kB > > > > Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB Unreclaim: 100580kB > > I am all for practical improvements for slab reporting. It is not really > trivial to find a good balance though. Printing all the caches simply > doesn't scale. So I would start by improving the current state rather > than adding more configurability. Yes, there is a challenge here and with the information you choose to report when an OOM event occurs. Paraphrasing, one size may not fit all. To address this we tried to make it easy to add options and to allow them to enabled / disabled. We'd rather rate limit based on memory usage than have the kernel print rate limit arbitrarily. We had to make some choices on how to do this. That said we view the OOM report as debugging information. So if you change the format as long as we get the information we feel is relevant, we're happy. Since we print release and version information we can adjust our scripts to handle format changes. It's work but not really that big a deal. If you remove information that was useful that is a bit more painful, but not the end of the world. > > > > > > * Add Select Vmalloc allocations Print > > ------------------------------------ > > Allow select vmalloc entries (based on a minimum size) to be printed. > > Minimum size is specified as a percentage of the total RAM memory > > in tenths of a percent, consistent with existing OOM process scoring. > > Valid values are specified from 0 to 1000 where 0 prints all vmalloc > > entries (all vmalloc allocations that have at least one page in use) up > > to 1000 which would require a vmalloc to use 100% of memory which can't > > happen so in that case only summary information is printed. > > > > The first line of output is a new Vmalloc output header for > > OOM printed Vmalloc entries. This header looks like this: > > > > Aug 19 19:27:01 yourserver kernel: Vmalloc Info: > > > > The output is vmalloc entry information output limited such that only > > entries equal to or larger than the minimum size are printed. > > Unused vmallocs (no pages assigned to the vmalloc) are never printed. > > The vmalloc entry information includes: > > - Size (in bytes) > > - pages (Number pages in use) > > - Caller Information to identify the request > > > > A sample vmalloc entry output looks like this: > > > > Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640 caller=__do_sys_swapon+0x78e/0x113 > > > > Additional output consists of summary information that is printed > > at the end of the output. This summary information includes: > > - Number of Vmalloc entries examined > > - Number of Vmalloc entries printed > > - minimum entry size for selection > > > > A sample Vmalloc Summary output looks like this: > > > > Aug 19 19:27:01 coronado kernel: Summary: Vmalloc entries examined: 1070 printed: 989 minsize: 0kB > > This is a lot of information. I wouldn't be surprised if this alone > could easily overflow the ringbuffer. Besides that, it is rarely useful > for the OOM situation debugging. The overall size of the vmalloc area > is certainly interesting but I am not sure we have a handy counter to > cope with constrained OOM contexts. > We've had cases where just displaying very large allocations explained why an OOM event occurred. We size this so we rarely get much output here, an entry or two at most. Again it is optional so if you don't care don't enable it. > > > * Add Select Process Entries Print > > -------------------------------- > > Allow select process entries (based on a minimum size) to be printed. > > Minimum size is specified as a percentage totalpages (RAM + swap) > > in tenths of a percent, consistent with existing OOM process scoring. > > Note: user process memory can be swapped out when swap space present > > so that is why swap space and ram memory comprise the totalpages > > used to calculate the percentage of memory a process is using. > > Valid values are specified from 0 to 1000 where 0 prints all user > > processes (that have valid mm sections and aren't exiting) up to > > 1000 which would require a user process to use 100% of memory which > > can't happen so in that case only summary information is printed. > > > > The first line of output is the standard Linux output headers for > > OOM printed User Processes. This header looks like this: > > > > Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in pages): > > Aug 19 19:27:01 yourserver kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > > > > The output is existing per user process data limited such that only > > entries equal to or larger than the minimum size are printed. > > > > Jul 21 20:07:48 yourserver kernel: [ 579] 0 579 7942 1010 90112 0 -1000 systemd-udevd > > > > Additional output consists of summary information that is printed > > at the end of the output. This summary information includes: > > > > Aug 19 19:27:01 yourserver kernel: Summary: OOM Tasks considered:277 printed:143 minimum size:0kB totalpages:32791608kB > > This sounds like a good idea to limit the eligible process list size but > I am concerned that it might get misleading easily when there are many > small processes contributing to the OOM in the end. > > > * Add Enhanced Process Print Information > > -------------------------------------- > > Add OOM Debug code that prints additional detailed information about > > users processes that were considered for OOM killing for any print > > selected processes. The information is displayed for each user process > > that OOM prints in the output. > > > > This supplemental per user process information is very helpful for > > determing how process memory is used to allow OOM event root cause > > identifcation that might not otherwise be possible. > > > > Output information for enhanced user process entrys printed includes: > > - pid > > - parent pid > > - ruid > > - euid > > - tgid > > - Process State (S) > > - utime in seconds > > - stime in seconds > > - oom_score_adjust > > - task comm value (name of process) > > - Vmem KiB > > - MaxRss KiB > > - CurRss KiB > > - Pte KiB > > - Swap KiB > > - Sock KiB > > - Lib KiB > > - Text KiB > > - Heap KiB > > - Stack KiB > > - File KiB > > - Shmem KiB > > - Read Pages > > - Fault Pages > > - Lock KiB > > - Pinned KiB > > I can see some of these being interesting but I would rather pick up > those and add to the regular oom output rather than go over configuring > them. > Would be glad to add these to standard OOM output. One issue is there are extra bytes of output with more detail. So when constrained, to justify this we said we'd rather have lots of detail on the top 50 or so many consuming processes versus course information on all user processes. The task information will provide us counts of processes and measures of process creation that are very useful. > > > Configuring Patches: > > ------------------- > > OOM Debug and any options you want to use must first be configured so > > the code is included in your kernel. This requires selecting kernel > > config file options. You will find config options to select under: > > > > Kernel hacking ---> Memory Debugging ---> > > > > [*] Debug OOM > > [*] Debug OOM System State > > [*] Debug OOM System Tasks Summary > > [*] Debug OOM ARP Table > > [*] Debug OOM ND Table > > [*] Debug OOM Select Slabs Print > > [*] Debug OOM Slabs Select Always Print Enable > > [*] Debug OOM Enhanced Slab Print > > [*] Debug OOM Select Vmallocs Print > > [*] Debug OOM Select Process Print > > [*] Debug OOM Enhanced Process Print > > I really dislike these though. We already have zillions of debugging > options and the config space is enormous. Different combinations of them > make any compile testing a challenge and a lot of cpu cycles eaten. > Besides that, who is going to configure those in without using them > directly? Distributions are not going to enable without having all > options being disabled by default for example. > Oh I agree, I dislike configuration options, there are so many and when you upgrade you're like what now. That said, I understand their value when range from small embedded devices up to super computers and zillions of devices. I would be pleased to just have one configuration option or better yet just have the code be part of the standard system. So getting rid of any or all of that would be a pleasure. Quite honestly, we may argue for certain items but in general we're quite flexible. > > 12 files changed, 1339 insertions(+), 11 deletions(-) > > This must have a been a lot of work and I really appreciate that. > > On the other hand it is a lot of code to maintain (note that you are > usually introspecting deep internals of subsystems so changes would > have to be carefully considered here as well) without a very strong > demand. > > Sure it is a nice to have thing in some cases. I can imagine that some > of that information would have helped me when debugging some weird OOM > reports but I strongly suspect I would likely not have all necessary > pieces enabled because those were not reproducible. Having everything > on is just not usable due to amount of data. printk is not free and > we have seen cases where a lot of output just turned the machine into > unsuable state. If you have a reproducible OOMs then you can trigger > a panic and have the full state of the system to examine. So I am not > really convinced all this is going to be used to justify the maintenance > overhead. I can speak to many OOM events we have had to triage and root cause over the past 7+ years that I've been involved with. It is quite true that there is no single OOM report format that will allow every problem to be completely root caused. The OOM report cannot provide all the information a full dump provides. That said, the OOM report can give you an excellent start on where to look when you otherwise aren't sure where to look. With luck everything you need is in the OOM report and you root cause right there. I can give you all sort of examples of this. They're all anecdotal but I would expect that admin and support people in data centers see much of the same sorts of issues. Would welcome input from others too. Different environments certainly can vary. On the issue of reproducible OOMs verus non-reproducible, that is important to consider: First many OOMs we look at happen in the data center and they are not easily reproducible. The analogy I use we spend a lot of time having to drive with our tail lights. That is we do a postmortem with limited information after the fact. Why? We don't have the time or luxury to turn on panic on OOM and let the system reboot. In fact we very often have neither the time it takes to dump the system or the storage space to hold a full system dump, a shame as it is the best scenario for sure. If a switch locks up for a few seconds the routing protocols can time out and that can start a reconfiguration chain reaction in your data center that will not be well received. If we could take a full system dump every time we need to capture the state of the system you wouldn't need an OOM report. In fact where else in the Kernel does the Kernel produce a report? OOM events are an odd beast that for some systems are just an annoyance and on other systems can be quite painful. If you're lucky you can ignore the fact that OOM killed one of your tabs in chrome browser. Your not so lucky if a key process gets OOM killed causing a cascade of issues. The more pain you feel the more motivated you become to try and avoid future events. We're not touching situations where OOM events occur in clusters or periodically due to a persisting issue or lots of other OOM dramas that occur from time to time. For people who are unlucky and have to care about OOM events, you often can't reproduce these and you want to capture as much information as is reasonable so you can work what the cause was with the hope that you can prevent future events. How much information is reasonable and what information you want to record may vary. > All that being said, I do not think this is something we want to merge > without a really _strong_ usecase to back it. > I will supply any information that I can. Let me know specifics on what you need. I guess I can try an explain a justification for each option I sent and we can have a dialog as needed. That is at least a starting point. I was hoping that posting this code and starting a discussion might draw in both experts and others with an interest in the information that is produced for an OOM event. Our experience is that some additional information and the ability to adjust what is produced is valuable. We don't add new options all the time but making it easy to do so is helpful. It would be nice if everything was standard output but even optional configurable information is better than none. We can continue to mod our kernel but if others would benefit, we're happy to contribute to the best of our abilities. We're flexible enough to make any recommended improvements as well. Also, our implementation though we've been using it for some years, and it continues to evolve, is a reference implementation. Since the output is debugging information and we identify what system release and version produces the output with each event, we can adjust our scripts to deal without output changes as the system evolves. This is expected as systems and Linux continue to evolve and improve. We'd be happy to work with you and your colleagues to contribute any improvements that you can accept to help to improve the OOM Report output. Thank-you again for your time and consideration! Edward Chron Arista Networks > > Thanks! > -- > Michal Hocko > SUSE Labs