Re: Memory fragmentation and kvm_alloc_stage2_pgd

Jungseok Lee <jungseoklee85@xxxxxxxxx> · Wed, 13 Aug 2014 23:12:59 +0900

On Aug 12, 2014, at 12:36 AM, Jungseok Lee wrote:
> On Aug 12, 2014, at 12:07 AM, Christoffer Dall wrote:
>> On Mon, Aug 11, 2014 at 11:23:04PM +0900, Jungseok Lee wrote:
>>> On Aug 11, 2014, at 8:35 PM, Christoffer Dall wrote:
>>>> On Mon, Aug 11, 2014 at 12:24:35PM +0100, Richard W.M. Jones wrote:
>>>>> On Mon, Aug 11, 2014 at 01:20:46PM +0200, Christoffer Dall wrote:
>>>>>> On Sun, Aug 10, 2014 at 02:24:04PM +0100, Richard W.M. Jones wrote:
>>>>>>> kvm_alloc_stage2_pgd has to do an order 9 allocation, ie. 512
>>>>>>> contiguous pages I think.
>>>>>>> 
>>>>>>> This often leads to problems running qemu when memory is relatively
>>>>>>> low -- eg. if you have one VM running, a healthy number of host
>>>>>>> applications, and perhaps "just" 4GB free; then you decide to run the
>>>>>>> libguestfs test suite.
>>>>>>> 
>>>>>>> Any suggestions how to deal with this?
>>>>>>> 
>>>>>> I'm not familiar with the libguestfs test suite, but are you saying you
>>>>>> have 4GB of free physical memory and when you start your first VM then
>>>>>> you get this error?  That sounds unlikely to me.
>>>>> 
>>>>> No, it runs hundreds of appliances (not all at the same time).  Some
>>>>> fail.
>>>>> 
>>>>> It seems to be a memory fragmentation issue, rather than the absolute
>>>>> free memory.
>>>>> 
>>>> Ok, that's what I thought.  You can probably hack around it by reducing
>>>> S2_PGD_ORDER to whatever is accessible by the VMs you wish to run (as
>>>> Jungseok also points out), but I'm afraid an upstream solution is
>>>> probably not ready before the next merge window opens, at least.
>>> 
>>> In case of ARM64 KVM, it is possible to reduce S2_PGD_ORDER in the following way.
>>> 
>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>> @@ -62,7 +62,28 @@
>>> * Align KVM with the kernel's view of physical memory. Should be
>>> * 40bit IPA, with PGD being 8kB aligned in the 4KB page configuration.
>>> */
>>> -#define KVM_PHYS_SHIFT PHYS_MASK_SHIFT
>>> +static inline int kvm_get_pa_range(void)
>>> +{
>>> +       int pa_range = read_cpuid(ID_AA64MMFR0_EL1) & 0xf;
>>> +
>>> +       switch (pa_range) {
>>> +       case 0:
>>> +               return 32;
>>> +       case 1:
>>> +               return 36;
>>> +       case 2:
>>> +               return 40;
>>> +       case 3:
>>> +               return 42;
>>> +       case 4:
>>> +               return 44;
>>> +       case 5:
>>> +               return 48;
>>> +       default:
>>> +               return -EINVAL;
>>> +       }
>>> +}
>>> +#define KVM_PHYS_SHIFT kvm_get_pa_range()
>>> #define KVM_PHYS_SIZE  (1UL << KVM_PHYS_SHIFT)
>>> #define KVM_PHYS_MASK  (KVM_PHYS_SIZE - 1UL)
>>> 
>>> The code puts limitation on guest's address space which is at most
>>> host's physical address space. For example, if host runs on Cortex-A57,
>>> IPA is set to 44, not 48.
>>> 
>>> If this approach looks reasonable, I will post it as 3.17-rc1 comes up.
>>> If not, please ignore it or use it as hack.
>>> 
>> 
>> Did you check what happens when handling a stage-2 translation fault due
>> to the input address being larger than the address space specified by
>> the T0SZ field?
> 
> I will check it carefully.
> 
>> My feeling is that this should only be included in a proper rework of
>> the supported guest physical address sizes.
> 
> I agree. I just would like to figure out a right approach.
> Thanks for the comment!

As Christoffer points out, T0SZ field should be considered together.

- Jungseok Lee
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm