Have you looked at memory cgroups and using that with limits with VMs?
The problem was *NOT* that my VMs exhausted all memory. I know that is
what "normally" triggers oom-killer, but you have to understand this
mine was a very different scenario, hence I wanted to bring it to
people's attention. I had about 10Gb of *FREE* HIGH and 34GB of *FREE*
SWAP when oom-killer was activated - yep, didn't make sense to me
either. If you want to study the logs :-
https://bugzilla.kernel.org/show_bug.cgi?id=15058
Looks like the problem was LOWMEM exhaust that triggered oom-killer.
Which is dumb, because it was cache that was exhausting LOWMEM, and
killing userland processes isn't a great way to deal with that issue.
[My] VMs generally alloate all resource at start-up and that's it.
Committed_AS: 14345016 kB
I tried "vm.overcommit_memory=2" and that didn't help. On a 48Gb system
oom-killer should NEVER be invoked with that kind of memory profile -
Its a quirk of running a 32bit system with *so* much memory, and the way
pre-2.6.33 handled LOWMEM.
We've now moved all VM guests onto one server in preparation for a
re-install of the other with 64bit host O/S.
Tests with 2.6.33.3 (+ latest qemu) appear to show this issue is fixed
in the latest kernel (I can see it has much improved LOWMEM management),
but we've only been running it days, and it can take 3 to 4 weeks to
trigger.
FYI: We run about 100 VM guests on 7 VM hosts in five data centres -
mostly production, some development. We've been using KVM in a
production environment for a while now - starting [in production] at
about KVM-82 on 2.6.28 - our oldest live systems now are two on KVM-84
on 2.6.28.4 and they are rock solid (one gets more punishment than it
deserves) - but they only have 16Gb, so aren't seeing LOWMEM exhaust cos
their memory map is *so* much smaller.
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html