Christoffer Dall <cdall@xxxxxxxxxx> writes: > On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote: >> Hi guys, >> >> On 27/03/17 12:20, Punit Agrawal wrote: >> > Christoffer Dall <cdall@xxxxxxxxxx> writes: >> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote: >> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for >> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver >> >>> SIGBUS to any user space process using the page, and notify all the >> >>> in-kernel users. >> >>> >> >>> If the page corresponded with guest memory, KVM will unmap this page >> >>> from its stage2 page tables. The user space process that allocated >> >>> this memory may have never touched this page in which case it may not >> >>> be mapped meaning SIGBUS won't be delivered. >> >>> >> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it >> >>> comes to process the stage2 fault. >> >>> >> >>> Do as x86 does, and deliver the SIGBUS when we discover >> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb >> >>> as this matches the user space mapping size. >> >> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c >> >>> index 962616fd4ddd..9d1aa294e88f 100644 >> >>> --- a/arch/arm/kvm/mmu.c >> >>> +++ b/arch/arm/kvm/mmu.c >> >>> @@ -20,8 +20,10 @@ >> >>> #include <linux/kvm_host.h> >> >>> #include <linux/io.h> >> >>> #include <linux/hugetlb.h> >> >>> +#include <linux/sched/signal.h> >> >>> #include <trace/events/kvm.h> >> >>> #include <asm/pgalloc.h> >> >>> +#include <asm/siginfo.h> >> >>> #include <asm/cacheflush.h> >> >>> #include <asm/kvm_arm.h> >> >>> #include <asm/kvm_mmu.h> >> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, >> >>> __coherent_cache_guest_page(vcpu, pfn, size); >> >>> } >> >>> >> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb) >> >>> +{ >> >>> + siginfo_t info; >> >>> + >> >>> + info.si_signo = SIGBUS; >> >>> + info.si_errno = 0; >> >>> + info.si_code = BUS_MCEERR_AR; >> >>> + info.si_addr = (void __user *)address; >> >>> + >> >>> + if (hugetlb) >> >>> + info.si_addr_lsb = PMD_SHIFT; >> >>> + else >> >>> + info.si_addr_lsb = PAGE_SHIFT; >> >>> + >> >>> + send_sig_info(SIGBUS, &info, current); >> >>> +} >> >>> + >> >>> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >> >>> struct kvm_memory_slot *memslot, unsigned long hva, >> >>> unsigned long fault_status) >> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >> >>> smp_rmb(); >> >>> >> >>> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable); >> >>> + if (pfn == KVM_PFN_ERR_HWPOISON) { >> >>> + kvm_send_hwpoison_signal(hva, hugetlb); >> >> >> >> The way this is called means that we'll only notify userspace of a huge >> >> mapping if userspace is mapping hugetlbfs, and not because the stage2 >> >> mapping may or may not have used transparent huge pages when the error >> >> was discovered. Is this the desired semantics? >> >> No, >> >> >> > I think so. >> > >> > AFAIUI, transparent hugepages are split before being poisoned while all >> > the underlying pages of a hugepage are poisoned together, i.e., no >> > splitting. >> >> In which case I need to look into this some more! >> >> My thinking was we should report the size that was knocked out of the stage2 to >> avoid the guest repeatedly faulting until it has touched every guest-page-size >> in the stage2 hole. > > By signaling something at the fault path, I think it's going to be very > hard to backtrack how the stage 2 page tables looked like when faults > started happening, because I think these are completely decoupled events > (the mmu notifier and the later fault). > >> >> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always >> be the same as those used by userspace. > > I think the mapping sizes should be the same between userspace and KVM, > but the mapping size of a particular page (and associated pages) may > vary over time. Stage 1 and Stage 2 support different hugepage sizes. A larger size stage 1 page maps to multiple stage 2 page table entries. For stage 1, we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while only PMD_SIZE is supported for Stage 2. > >> >> If the page was split before KVM could have taken this fault I assumed it would >> fault on the page-size mapping and hugetlb would be false. > > I think you could have a huge page, which gets unmapped as a result on > it getting split (perhaps because there was a failure on one page) and > later as you fault, you can discover a range which can be a hugetlbfs or > transparent huge pages. > > The question that I don't know is how Linux behaves if a page is marked > with hwpoison, in that case, if Linux never supports THP and always > marks an entire huge page in a hugetlbfs with the poison, then I think > we're mostly good here. If not, we should make sure we align with > whatever the rest of the kernel does. AFAICT, a hugetlbfs page is poisoned as a whole while thp is split before poisoning. Quoting comment near the top of memory_failure() in mm/memory_failure.c. /* * Currently errors on hugetlbfs pages are measured in hugepage units, * so nr_pages should be 1 << compound_order. OTOH when errors are on * transparent hugepages, they are supposed to be split and error * measurement is done in normal page units. So nr_pages should be one * in this case. */ > >> (which is already >> wrong for another reason, looks like I grabbed the variable before >> transparent_hugepage_adjust() has had a go a it.). >> > > yes, which is why I asked if you only care about hugetlbfs. > Based on the comment above, we should never get a poisoned page that is part of a transparent hugepage. >> >> >> Also notice that the hva is not necessarily aligned to the beginning of >> >> the huge page, so can we be giving userspace wrong information by >> >> pointing in the middle of a huge page and telling it there was an >> >> address error in the size of the PMD ? >> >> >> > >> > I could be reading it wrong but I think we are fine here - the address >> > (hva) is the location that faulted. And the lsb indicates the least >> > significant bit of the faulting address (See man sigaction(2)). The >> > receiver of the signal is expected to use the address and lsb to workout >> > the extent of corruption. >> >> kill_proc() in mm/memory-failure.c does this too, but the address is set by >> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit >> off list.) >> >> >> > Though I missed a subtlety while reviewing the patch before. The >> > reported lsb should be for the userspace hugepage mapping (i.e., hva) >> > and not for the stage 2. >> >> I thought these were always supposed to be the same, and using hugetlb was a bug >> because I didn't look closely enough at what is_vm_hugetlb_page() does. See above. >> >> >> > In light of this, I'd like to retract my Reviewed-by tag for this >> > version of the patch as I believe we'll need to change the lsb >> > reporting. >> >> Sure, lets work out what this should be doing. I'm beginning to suspect x86's >> 'always page size' was correct to begin with! >> > > I had a sense of that too, but it would be good to understand how you > mark and individual page within a hugetlbfs huge page with hwpoison... I don't think it is possible to mark an individual page in a hugetlbfs page - it's all or nothing. AFAICT, the SIGBUS report is for user mappings and doesn't have to care whether it's Stage 2 hugetlb page or thp. And the lsb determination should take the Stage 1 hugepage size into account - something along the lines of the snippet from previous email. Hope that makes sense. Punit > > Thanks, > -Christoffer _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm