On Thu, Oct 01, 2020 at 08:43:58AM -0400, boris.ostrovsky@xxxxxxxxxx wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > >>>>>>> Also, wrt KASLR stuff, that issue is still seen sometimes but I haven't had > >>>>>>> bandwidth to dive deep into the issue and fix it. > >>>> So what's the plan there? You first mentioned this issue early this year and judged by your response it is not clear whether you will ever spend time looking at it. > >>>> > >>> I do want to fix it and did do some debugging earlier this year just haven't > >>> gotten back to it. Also, wanted to understand if the issue is a blocker to this > >>> series? > >> > >> Integrating code with known bugs is less than ideal. > >> > > So for this series to be accepted, KASLR needs to be fixed along with other > > comments of course? > > > Yes, please. > > > > >>> I had some theories when debugging around this like if the random base address picked by kaslr for the > >>> resuming kernel mismatches the suspended kernel and just jogging my memory, I didn't find that as the case. > >>> Another hunch was if physical address of registered vcpu info at boot is different from what suspended kernel > >>> has and that can cause CPU's to get stuck when coming online. > >> > >> I'd think if this were the case you'd have 100% failure rate. And we are also re-registering vcpu info on xen restore and I am not aware of any failures due to KASLR. > >> > > What I meant there wrt VCPU info was that VCPU info is not unregistered during hibernation, > > so Xen still remembers the old physical addresses for the VCPU information, created by the > > booting kernel. But since the hibernation kernel may have different physical > > addresses for VCPU info and if mismatch happens, it may cause issues with resume. > > During hibernation, the VCPU info register hypercall is not invoked again. > > > I still don't think that's the cause but it's certainly worth having a look. > Hi Boris, Apologies for picking this up after last year. I did some dive deep on the above statement and that is indeed the case that's happening. I did some debugging around KASLR and hibernation using reboot mode. I observed in my debug prints that whenever vcpu_info* address for secondary vcpu assigned in xen_vcpu_setup at boot is different than what is in the image, resume gets stuck for that vcpu in bringup_cpu(). That means we have different addresses for &per_cpu(xen_vcpu_info, cpu) at boot and after control jumps into the image. I failed to get any prints after it got stuck in bringup_cpu() and I do not have an option to send a sysrq signal to the guest or rather get a kdump. This change is not observed in every hibernate-resume cycle. I am not sure if this is a bug or an expected behavior. Also, I am contemplating the idea that it may be a bug in xen code getting triggered only when KASLR is enabled but I do not have substantial data to prove that. Is this a coincidence that this always happens for 1st vcpu? Moreover, since hypervisor is not aware that guest is hibernated and it looks like a regular shutdown to dom0 during reboot mode, will re-registering vcpu_info for secondary vcpu's even plausible? I could definitely use some advice to debug this further. Some printk's from my debugging: At Boot: xen_vcpu_setup: xen_have_vcpu_info_placement=1 cpu=1, vcpup=0xffff9e548fa560e0, info.mfn=3996246 info.offset=224, Image Loads: It ends up in the condition: xen_vcpu_setup() { ... if (xen_hvm_domain()) { if (per_cpu(xen_vcpu, cpu) == &per_cpu(xen_vcpu_info, cpu)) return 0; } ... } xen_vcpu_setup: checking mfn on resume cpu=1, info.mfn=3934806 info.offset=224, &per_cpu(xen_vcpu_info, cpu)=0xffff9d7240a560e0 This is tested on c4.2xlarge [8vcpu 15GB mem] instance with 5.10 kernel running in the guest. Thanks, Anchal. > > -boris > >