Re: [PATCH] qemu: Relax hard RSS limit

Michal Privoznik <mprivozn@xxxxxxxxxx> · Tue, 08 Jan 2013 16:49:26 +0100

On 08.01.2013 16:46, Dave Allan wrote:
> On Tue, Jan 08, 2013 at 04:42:00PM +0100, Michal Privoznik wrote:
>> On 08.01.2013 16:24, Daniel P. Berrange wrote:
>>> On Tue, Jan 08, 2013 at 10:37:19AM +0100, Michal Privoznik wrote:
>>>> Currently, if there's no hard memory limit defined for a domain,
>>>> libvirt tries to calculate one, based on domain definition and magic
>>>> equation and set it upon the domain startup. The rationale behind was,
>>>> if there's a memory leak or exploit in qemu, we should prevent the
>>>> host system trashing. However, the equation was too tightening, as it
>>>> didn't reflect what the kernel counts into the memory used by a
>>>> process. Since many hosts do have a swap, nobody hasn't noticed
>>>> anything, because if hard memory limit is reached, process can
>>>> continue allocating memory on a swap. However, if there is no swap on
>>>> the host, the process gets killed by OOM killer. In our case, the qemu
>>>> process it is.
>>>>
>>>> To prevent this, we need to relax the hard RSS limit. Moreover, we
>>>> should reflect more precisely the kernel way of accounting the memory
>>>> for process. That is, even the kernel caches are counted within the
>>>> memory used by a process (within cgroups at least). Hence the magic
>>>> equation has to be changed:
>>>>
>>>>   limit = 1.5 * (domain memory + total video memory) + (32MB for cache
>>>>           per each disk) + 200MB
>>>> ---
>>>>
>>>> There is a bit more that should be taken into account, e.g. shared
>>>> pages, where accounting is even more complicated:
>>>>
>>>> "Shared pages are accounted on the basis of the first touch approach.
>>>> The cgroup that first touches a page is accounted for the page." [1]
>>>>
>>>> I don't we even want to try to reflect this in our code. That's why
>>>> the coefficient of domain memory has been lifted from 1.02 to 1.5, in
>>>> hope it will just be enough.
>>>>
>>>> 1: http://www.kernel.org/doc/Documentation/cgroups/memory.txt
>>>>
>>>>  src/qemu/qemu_cgroup.c | 15 +++++++++------
>>>>  1 file changed, 9 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c
>>>> index 7faf025..16a9d7c 100644
>>>> --- a/src/qemu/qemu_cgroup.c
>>>> +++ b/src/qemu/qemu_cgroup.c
>>>> @@ -343,15 +343,18 @@ int qemuSetupCgroup(virQEMUDriverPtr driver,
>>>>          unsigned long long hard_limit = vm->def->mem.hard_limit;
>>>>  
>>>>          if (!hard_limit) {
>>>> -            /* If there is no hard_limit set, set a reasonable
>>>> -             * one to avoid system trashing caused by exploited qemu.
>>>> -             * As 'reasonable limit' has been chosen:
>>>> -             *     (1 + k) * (domain memory + total video memory) + F
>>>> -             * where k = 0.02 and F = 200MB. */
>>>> +            /* If there is no hard_limit set, set a reasonable one to avoid
>>>> +             * system trashing caused by exploited qemu.  As 'reasonable limit'
>>>> +             * has been chosen:
>>>> +             *     (1 + k) * (domain memory + total video memory) + (32MB for
>>>> +             *     cache per each disk) + F
>>>> +             * where k = 0.5 and F = 200MB.  The cache for disks is important as
>>>> +             * kernel cache on the host side counts into the RSS limit. */
>>>>              hard_limit = vm->def->mem.max_balloon;
>>>>              for (i = 0; i < vm->def->nvideos; i++)
>>>>                  hard_limit += vm->def->videos[i]->vram;
>>>> -            hard_limit = hard_limit * 1.02 + 204800;
>>>> +            hard_limit = hard_limit * 1.5 + 204800;
>>>> +            hard_limit += vm->def->ndisks * 32768;
>>>>          }
>>>>  
>>>>          rc = virCgroupSetMemoryHardLimit(cgroup, hard_limit);
>>>
>>> ACK,
>>>
>>> can't say I'm a fan of our heuristics but I don't see a better way
>>> yet. Lets see how this new limit copes.
>>>
>>> Daniel
>>>
>>
>> Yeah, it's sort of magic. Pushed now. Thanks.
> 
> How does one turn off the limits?
> 
> Dave

Either disable mem cgroup (e.g. by unmounting it), or set own limit in
the domain XML (libvirt won't even try to calculate new one then).

Michal

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list