On Wed, Mar 13, 2024, Christian König wrote: > Am 13.03.24 um 15:48 schrieb Sean Christopherson: > > On Wed, Mar 13, 2024, Christian König wrote: > > > Am 13.03.24 um 14:34 schrieb Sean Christopherson: > > > > What Christoph is objecting to is that, in this series, KVM is explicitly adding > > > > support for mapping non-compound (huge)pages into KVM guests. David is arguing > > > > that Christoph's objection to _KVM_ adding support is unfair, because the real > > > > problem is that the kernel already maps such pages into host userspace. I.e. if > > > > the userspace mapping ceases to exist, then there are no mappings for KVM to follow > > > > and propagate to KVM's stage-2 page tables. > > > And I have to agree with Christoph that this doesn't make much sense. KVM > > > should *never* map (huge) pages from VMAs marked with VM_PFNMAP into KVM > > > guests in the first place. > > > > > > What it should do instead is to mirror the PFN from the host page tables > > > into the guest page tables. > > That's exactly what this series does. Christoph is objecting to KVM playing nice > > with non-compound hugepages, as he feels that such mappings should not exist > > *anywhere*. > > Well Christoph is right those mappings shouldn't exists and they also don't > exists. > > What happens here is that a driver has allocated some contiguous memory to > do DMA with. And then some page table is pointing to a PFN inside that > memory because userspace needs to provide parameters for the DMA transfer. > > This is *not* a mapping of a non-compound hugepage, it's simply a PTE > pointing to some PFN. Yes, I know. And David knows. By "such mappings" I did not mean "huge PMD mappings that point at non-compound pages", I meant "any mapping in the host userspace VMAs and page tables that points at memory that is backed by a larger-than-order-0, non-compound allocation". And even then, the whole larger-than-order-0 mapping is not something we on the KVM side care about, at all. The _only_ new thing KVM is trying to do in this series is to allow mapping non-refcounted struct page memory into KVM guest. Those details were brought up purely because they provide context on how/why such non-refcounted pages exist. > It can trivially be that userspace only maps 4KiB of some 2MiB piece of > memory the driver has allocate. > > > I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice, > > we should instead fix the TTM allocations. And David pointed out that that was > > tried and got NAK'd. > > Well as far as I can see Christoph rejects the complexity coming with the > approach of sometimes grabbing the reference and sometimes not. Unless I've wildly misread multiple threads, that is not Christoph's objection. >From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@xxxxxxxxxxxxx): On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote: > > KVM needs to be aware of non-refcounted struct page memory no matter what; see > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but > > non-reference-counted pages"). I don't think it makes any sense whatsoever to > > remove that code and assume every driver in existence will do the right thing. > > Agreed. > > > > > With the cleanups done, playing nice with non-refcounted paged instead of outright > > rejecting them is a wash in terms of lines of code, complexity, and ongoing > > maintenance cost. > > I tend to strongly disagree with that, though. We can't just let these > non-refcounted pages spread everywhere and instead need to fix their > usage. > And I have to agree that this is extremely odd. Yes, it's odd and not ideal. But with nested virtualization, KVM _must_ "map" pfns directly into the guest via fields in the control structures that are consumed by hardware. I.e. pfns are exposed to the guest in an "out-of-band" structure that is NOT part of the stage-2 page tables. And wiring those up to the MMU notifiers is extremely difficult for a variety of reasons[*]. Because KVM doesn't control which pfns are mapped this way, KVM's compromise is to grab a reference to the struct page while the out-of-band mapping exists, i.e. to pin the page to prevent use-after-free. And KVM's historical ABI is to support any refcounted page for these out-of-band mappings, regardless of whether the page was obtained by gup() or follow_pte(). Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace, KVM resorts to conditionally grabbing references and disllowing non-refcounted pages from being inserted into the out-of-band mappings. But again, I don't think these details are relevant to Christoph's objection. [*] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@xxxxxxxxxx