Christoffer Dall <cdall@xxxxxxxxxx> writes: > On Mon, Mar 27, 2017 at 02:31:44PM +0100, Punit Agrawal wrote: >> Christoffer Dall <cdall@xxxxxxxxxx> writes: >> >> > On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote: >> >> Hi guys, >> >> >> >> On 27/03/17 12:20, Punit Agrawal wrote: >> >> > Christoffer Dall <cdall@xxxxxxxxxx> writes: >> >> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote: >> >> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for >> >> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver >> >> >>> SIGBUS to any user space process using the page, and notify all the >> >> >>> in-kernel users. >> >> >>> >> >> >>> If the page corresponded with guest memory, KVM will unmap this page >> >> >>> from its stage2 page tables. The user space process that allocated >> >> >>> this memory may have never touched this page in which case it may not >> >> >>> be mapped meaning SIGBUS won't be delivered. >> >> >>> >> >> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it >> >> >>> comes to process the stage2 fault. >> >> >>> >> >> >>> Do as x86 does, and deliver the SIGBUS when we discover >> >> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb >> >> >>> as this matches the user space mapping size. >> >> >> >> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c >> >> >>> index 962616fd4ddd..9d1aa294e88f 100644 >> >> >>> --- a/arch/arm/kvm/mmu.c >> >> >>> +++ b/arch/arm/kvm/mmu.c >> >> >>> @@ -20,8 +20,10 @@ >> >> >>> #include <linux/kvm_host.h> >> >> >>> #include <linux/io.h> >> >> >>> #include <linux/hugetlb.h> >> >> >>> +#include <linux/sched/signal.h> >> >> >>> #include <trace/events/kvm.h> >> >> >>> #include <asm/pgalloc.h> >> >> >>> +#include <asm/siginfo.h> >> >> >>> #include <asm/cacheflush.h> >> >> >>> #include <asm/kvm_arm.h> >> >> >>> #include <asm/kvm_mmu.h> >> >> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, >> >> >>> __coherent_cache_guest_page(vcpu, pfn, size); >> >> >>> } >> >> >>> >> >> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb) >> >> >>> +{ >> >> >>> + siginfo_t info; >> >> >>> + >> >> >>> + info.si_signo = SIGBUS; >> >> >>> + info.si_errno = 0; >> >> >>> + info.si_code = BUS_MCEERR_AR; >> >> >>> + info.si_addr = (void __user *)address; >> >> >>> + >> >> >>> + if (hugetlb) >> >> >>> + info.si_addr_lsb = PMD_SHIFT; >> >> >>> + else >> >> >>> + info.si_addr_lsb = PAGE_SHIFT; >> >> >>> + >> >> >>> + send_sig_info(SIGBUS, &info, current); >> >> >>> +} >> >> >>> + >> >> >>> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >> >> >>> struct kvm_memory_slot *memslot, unsigned long hva, >> >> >>> unsigned long fault_status) >> >> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >> >> >>> smp_rmb(); >> >> >>> >> >> >>> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable); >> >> >>> + if (pfn == KVM_PFN_ERR_HWPOISON) { >> >> >>> + kvm_send_hwpoison_signal(hva, hugetlb); >> >> >> >> >> >> The way this is called means that we'll only notify userspace of a huge >> >> >> mapping if userspace is mapping hugetlbfs, and not because the stage2 >> >> >> mapping may or may not have used transparent huge pages when the error >> >> >> was discovered. Is this the desired semantics? >> >> >> >> No, >> >> >> >> >> >> > I think so. >> >> > >> >> > AFAIUI, transparent hugepages are split before being poisoned while all >> >> > the underlying pages of a hugepage are poisoned together, i.e., no >> >> > splitting. >> >> >> >> In which case I need to look into this some more! >> >> >> >> My thinking was we should report the size that was knocked out of the stage2 to >> >> avoid the guest repeatedly faulting until it has touched every guest-page-size >> >> in the stage2 hole. >> > >> > By signaling something at the fault path, I think it's going to be very >> > hard to backtrack how the stage 2 page tables looked like when faults >> > started happening, because I think these are completely decoupled events >> > (the mmu notifier and the later fault). >> > >> >> >> >> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always >> >> be the same as those used by userspace. >> > >> > I think the mapping sizes should be the same between userspace and KVM, >> > but the mapping size of a particular page (and associated pages) may >> > vary over time. >> >> Stage 1 and Stage 2 support different hugepage sizes. A larger size >> stage 1 page maps to multiple stage 2 page table entries. For stage 1, >> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while >> only PMD_SIZE is supported for Stage 2. >> >> > >> >> >> >> If the page was split before KVM could have taken this fault I assumed it would >> >> fault on the page-size mapping and hugetlb would be false. >> > >> > I think you could have a huge page, which gets unmapped as a result on >> > it getting split (perhaps because there was a failure on one page) and >> > later as you fault, you can discover a range which can be a hugetlbfs or >> > transparent huge pages. >> > >> > The question that I don't know is how Linux behaves if a page is marked >> > with hwpoison, in that case, if Linux never supports THP and always >> > marks an entire huge page in a hugetlbfs with the poison, then I think >> > we're mostly good here. If not, we should make sure we align with >> > whatever the rest of the kernel does. >> >> AFAICT, a hugetlbfs page is poisoned as a whole while thp is split >> before poisoning. Quoting comment near the top of memory_failure() in >> mm/memory_failure.c. >> >> /* >> * Currently errors on hugetlbfs pages are measured in hugepage units, >> * so nr_pages should be 1 << compound_order. OTOH when errors are on >> * transparent hugepages, they are supposed to be split and error >> * measurement is done in normal page units. So nr_pages should be one >> * in this case. >> */ >> >> > >> >> (which is already >> >> wrong for another reason, looks like I grabbed the variable before >> >> transparent_hugepage_adjust() has had a go a it.). >> >> >> > >> > yes, which is why I asked if you only care about hugetlbfs. >> > >> >> Based on the comment above, we should never get a poisoned page that is >> part of a transparent hugepage. >> >> >> >> >> >> Also notice that the hva is not necessarily aligned to the beginning of >> >> >> the huge page, so can we be giving userspace wrong information by >> >> >> pointing in the middle of a huge page and telling it there was an >> >> >> address error in the size of the PMD ? >> >> >> >> >> > >> >> > I could be reading it wrong but I think we are fine here - the address >> >> > (hva) is the location that faulted. And the lsb indicates the least >> >> > significant bit of the faulting address (See man sigaction(2)). The >> >> > receiver of the signal is expected to use the address and lsb to workout >> >> > the extent of corruption. >> >> >> >> kill_proc() in mm/memory-failure.c does this too, but the address is set by >> >> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit >> >> off list.) >> >> >> >> >> >> > Though I missed a subtlety while reviewing the patch before. The >> >> > reported lsb should be for the userspace hugepage mapping (i.e., hva) >> >> > and not for the stage 2. >> >> >> >> I thought these were always supposed to be the same, and using hugetlb was a bug >> >> because I didn't look closely enough at what is_vm_hugetlb_page() does. >> >> See above. >> >> >> >> >> >> >> > In light of this, I'd like to retract my Reviewed-by tag for this >> >> > version of the patch as I believe we'll need to change the lsb >> >> > reporting. >> >> >> >> Sure, lets work out what this should be doing. I'm beginning to suspect x86's >> >> 'always page size' was correct to begin with! >> >> >> > >> > I had a sense of that too, but it would be good to understand how you >> > mark and individual page within a hugetlbfs huge page with hwpoison... >> >> I don't think it is possible to mark an individual page in a hugetlbfs >> page - it's all or nothing. >> >> AFAICT, the SIGBUS report is for user mappings and doesn't have to care >> whether it's Stage 2 hugetlb page or thp. And the lsb determination should >> take the Stage 1 hugepage size into account - something along the lines >> of the snippet from previous email. >> > > I think the lsb should indicate the size of the memory region known to > be broken by the kernel - however that whole mechanism works. Agreed. To re-iterate and confirm that we are on the same page (this time using correct terminology) - * the kernel poisons pages in PAGE_SIZEed or hugepage sized units depending on where the poisoned address maps to. If the affected location maps to a transparent hugepage, the thp is split and the PAGE_SIZed unit corresponding to the location is poisoned. * When sending a SIGBUS on encountering a poisoned pfn, the lsb should be - - PAGE_SHIFT, if the hva does not map to a hugepage - huge_page_shift(hstate_vma(vma)), if the hva belongs to a hugepage Hopefully that makes sense. Punit > > -Christoffer > _______________________________________________ > kvmarm mailing list > kvmarm@xxxxxxxxxxxxxxxxxxxxx > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm