On Thu, Oct 05, 2023, Michael Roth wrote: > On Thu, Sep 21, 2023 at 02:34:46PM -0700, Sean Christopherson wrote: > > On Thu, Sep 21, 2023, Sean Christopherson wrote: > > > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c > > > > index a819367434e9..01fb4ca861d0 100644 > > > > --- a/virt/kvm/guest_mem.c > > > > +++ b/virt/kvm/guest_mem.c > > > > @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start, > > > > static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len) > > > > { > > > > struct list_head *gmem_list = &inode->i_mapping->private_list; > > > > + struct address_space *mapping = inode->i_mapping; > > > > pgoff_t start = offset >> PAGE_SHIFT; > > > > pgoff_t end = (offset + len) >> PAGE_SHIFT; > > > > struct kvm_gmem *gmem; > > > > > > > > + /* > > > > + * punch hole may result in zeroing partial area. As pages can be > > > > + * encrypted, prohibit zeroing partial area. > > > > + */ > > > > + if (offset & ~PAGE_MASK || len & ~PAGE_MASK) > > > > + return -EINVAL; > > > > > > This should be unnecessary, kvm_gmem_fallocate() does > > > > > > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > > > return -EINVAL; > > > > > > before invoking kvm_gmem_punch_hole(). If that's not working, i.e. your test > > > fails, then that code needs to be fixed. I'll run your test to double-check, > > > but AFAICT this is unnecesary. > > > > I confirmed that the testcase passes without the extra checks. Just to close the > > loop, what prompted adding more checks to kvm_gmem_punch_hole()? > > I don't know if it's the same issue that Isaku ran into, but for SNP we > hit a similar issue with the truncate_inode_pages_range(lstart, lend) call. > > The issue in that case was a bit more subtle: > > - userspace does a hole-punch on a 4K range of its gmem FD, which happens > to be backed by a 2MB folio. > - truncate_inode_pages_range() gets called for that 4K range > - truncate_inode_pages_range() does special handling on the folios at the > start/end of the range in case they are partial and passes these to > truncate_inode_partial_folio(folio, lstart, lend). In this case, there's > just the 1 backing folio. But it *still* gets the special treatment, and > so gets passed to truncate_inode_partial_folio(). > - truncate_inode_partial_folio() will then zero that 4K range, even though > it is page-aligned, based on the following rationale in the comments: > > /* > * We may be zeroing pages we're about to discard, but it avoids > * doing a complex calculation here, and then doing the zeroing > * anyway if the page split fails. > */ > folio_zero_range(folio, offset, length); > > - after that, .invalidate_folio callback is issued, then the folio is split, > and the caller (truncate_inode_pages_range()) does another pass through > the whole range and can free the now-split folio then .free_folio callbacks > are issued. > > Because of that, we can't rely on .invalidate_folio/.free_folio to handle > putting the page back into a normal host-accessible state, because the > zero'ing will happen beforehand. Argh, and that causes an RMP violation #PF. FWIW, I don't *think* zeroing would be problematic for TDX. The page would get poisoned, but KVM would re-zero the memory with MOVDIR64B and flush the cache. > That's why we ended up needing to do this for SNP patches to make sure > arch-specific invalidation callbacks are issued before the truncation occurs: > > https://github.com/mdroth/linux/commit/4ebcc04b84dd691fc6daccb9b7438402520b0704#diff-77306411fdaeb7f322a1ca756dead9feb75363aa6117b703ac118576153ddb37R233 > > I'd planned to post those as a separate RFC to discuss, but when I came across > this it seemed like it might be relevant to what the TDX folks might ran into > here. > > If not for the zero'ing logic mentioned above, for SNP at least, the > .free_folio() ends up working pretty nicely for both truncation and fput(), > and even plays nicely with live update use-case where the destination gmem > instance shares the inode->i_mapping, since iput() won't trigger the > truncate_inode_pages_final() until the last reference goes away so we don't > have to do anything special in kvm_gmem_release() to determine when we > should/shouldn't issue the arch-invalidations to clean up things like the > RMP table. > > It seems like the above zero'ing logic could be reworked to only zero non-page > aligned ranges (as the comments above truncate_inode_pages_range() claim > should be the case), which would avoid the issue for the gmem use-case. But I > wonder if some explicit "dont-zero-these-pages" flag might be more robust. > > Or maybe there's some other way we should be going about this? Skipping the write seems like the obvious solution. An address_space flag, e.g. AS_INACCESSIBLE, would be the easiest thing to implement. Or maybe even make it AS_DONT_ZERO_ON_TRUNCATE_DAMMIT (mostly joking). Or a hook in address_space_operations to zero the folio, which conceptually is better in many ways, but feels like overkill.