Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.

Junaid Shahid <junaids@xxxxxxxxxx> · Wed, 23 Nov 2016 19:50:43 -0800

Hi Paolo,

Thank you for the detailed feedback. I will send an updated version of the patch soon. A few comments below:

On Monday, November 21, 2016 03:42:23 PM Paolo Bonzini wrote:
> > @@ -576,20 +617,34 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> >  	 */
> >  	if (spte_is_bit_changed(old_spte, new_spte,
> >                                  shadow_accessed_mask | shadow_dirty_mask))
> > -		ret = true;
> > +		flush = true;
> >  
> > -	if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
> > +	writable_cleared = is_writable_pte(old_spte) &&
> > +			   !is_writable_pte(new_spte);
> > +	acc_track_enabled = !is_access_track_spte(old_spte) &&
> > +			    is_access_track_spte(new_spte);
> > +
> > +	if (writable_cleared || acc_track_enabled)
> > +		flush = true;
> > +
> > +	if (shadow_accessed_mask ?
> > +	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
> > +	    acc_track_enabled)
> 
> Please introduce a new function spte_is_access_tracking_enabled(u64 
> old_spte, u64 new_spte) and use it here:
> 
> 	if (shadow_accessed_mask ?
> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
> 	    spte_is_access_tracking_enabled(old_spte, new_spte)) {
> 		flush |= !shadow_accessed_mask;
> 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> 	}
> 

I think we can just set flush = true in the then block instead of 
flush |= !shadow_accessed_mask.

And while we are at it, is there any reason to flush the TLB when setting the
A or D bit in the PTE? If not, we can remove this earlier block since the
clearing case is now handled in the separate if blocks for accessed and dirty:

	if (spte_is_bit_changed(old_spte, new_spte,
                           shadow_accessed_mask | shadow_dirty_mask))
		flush = true;

Also, instead of spte_is_access_tracking_enabled(), I’ve added is_accessed_spte
as you suggested later and used that here as well.

> > -	if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
> > +
> > +	if (shadow_dirty_mask ?
> > +	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
> > +	    writable_cleared)
> 
> writable_cleared can be inline here and written as
> 
> 	spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)
> 
> so
> 
> 	if (shadow_dirty_mask ?
> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
> 	    spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)) {
> 		flush |= !shadow_dirty_mask;
> 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> 	}
> >  		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> >  
> > -	return ret;
> > +	return flush;
> >  }
> 
> But please anticipate the changes to this function, except for 
> introducing is_access_track_spte of course, to a separate function.

Sorry, I didn’t exactly understand what you meant by the last line. But I have
made it like this:

	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
		flush = true;
		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
	}

	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
		flush = true;
		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
	}

> >  static int mmu_spte_clear_track_bits(u64 *sptep)
> >  {
> > @@ -604,6 +659,13 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
> >  	if (!is_shadow_present_pte(old_spte))
> >  		return 0;
> >  
> > +	/*
> > +	 * For access tracking SPTEs, the pfn was already marked accessed/dirty
> > +	 * when the SPTE was marked for access tracking, so nothing to do here.
> > +	 */
> > +	if (is_access_track_spte(old_spte))
> > +		return 1;
> 
> This should go after the "WARN_ON", since that's a valuable check.  In 
> addition, I think it's a good idea to keep similar idioms between 
> mmu_spte_update and mmu_spte_clear_track_bits, like this:
> 
> 	if (shadow_accessed_mask
> 	    ? old_spte & shadow_accessed_mask
> 	    : !is_access_track_spte(old_spte))
> 		kvm_set_pfn_accessed(pfn);
>         if (shadow_dirty_mask
> 	    ? old_spte & shadow_dirty_mask
> 	    : old_spte & PT_WRITABLE_MASK)
>                 kvm_set_pfn_dirty(pfn);
> 
> 	return 1;
> 
> or (you pick)
> 
> 	if (shadow_accessed_mask
> 	    ? !(old_spte & shadow_accessed_mask)
> 	    : is_access_track_spte(old_spte))
> 		return 1;
> 
> 	kvm_set_pfn_accessed(pfn);
>         if (shadow_dirty_mask
> 	    ? old_spte & shadow_dirty_mask
> 	    : old_spte & PT_WRITABLE_MASK)
>                 kvm_set_pfn_dirty(pfn);
> 
> 	return 1;

I’ve replaced the checks with if(is_accessed_spte...) / if (is_dirty_spte...)

> > +	/*
> > +	 * Any PTE marked for access tracking should also be marked for dirty
> > +	 * tracking (by being non-writable)
> > +	 */
> > +	spte &= ~PT_WRITABLE_MASK;
> 
> This should be implicit in the definition of 
> shadow_acc_track_mask/value, so it's not necessary (or it can be a 
> WARN_ONCE).

It can't be handled by shadow_acc_track_mask/value, but I have changed
shadow_acc_track_saved_bits_mask to save only the R/X bits, which achieves the
same result.

> 
> > +	spte &= ~(shadow_acc_track_saved_bits_mask <<
> > +		  shadow_acc_track_saved_bits_shift);
> 
> Should there be a WARN_ON if these bits are not zero?

No, these bits can be non-zero from a previous instance of 
mark_spte_for_access_track. They are not cleared when the PTE is restored to
normal state. (Though we could do that and then have a WARN_ONCE here.)

> > @@ -2919,10 +3063,17 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >  	 *
> >  	 * Compare with set_spte where instead shadow_dirty_mask is set.
> >  	 */
> > -	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
> > +	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
> >  		return false;
> >  
> > -	kvm_vcpu_mark_page_dirty(vcpu, gfn);
> > +	if (remove_write_prot) {
> 
> Stupid question ahead, why not call kvm_vcpu_mark_page_accessed in this 
> function?

We could, but I kept that call in the same paths as before for consistency with
the other cases. Plus, even though it is a cheap call, why add it to the fast 
PF path if it is not necessary.

> > @@ -2953,36 +3104,43 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> >  
> >  		/*
> > -		 * Check if it is a spurious fault caused by TLB lazily flushed.
> > +		 * Check whether the memory access that caused the fault would
> > +		 * still cause it if it were to be performed right now. If not,
> > +		 * then this is a spurious fault caused by TLB lazily flushed,
> > +		 * or some other CPU has already fixed the PTE after the
> > +		 * current CPU took the fault.
> > +		 *
> > +		 * If Write-Only mappings ever become supported, then the
> > +		 * condition below would need to be changed appropriately.
> >  		 *
> >  		 * Need not check the access of upper level table entries since
> >  		 * they are always ACC_ALL.
> >  		 */
> > -		if (is_writable_pte(spte)) {
> > -			ret = true;
> > +		if (((spte & PT_PRESENT_MASK) && !remove_write_prot) ||
> > +		    valid_exec_access) {
> > +			fault_handled = true;
> >  			break;
> >  		}
> 
> Let's separate the three conditions (R/W/X):
> 
> 		if ((error_code & PFERR_FETCH_MASK) {
> 			if ((spte & (shadow_x_mask|shadow_nx_mask))
> 			    == shadow_x_mask) {
> 				fault_handled = true;
> 				break;
> 			}
> 		}
> 		if (error_code & PFERR_WRITE_MASK) {
> 			if (is_writable_pte(spte)) {
> 				fault_handled = true;
> 				break;
> 			}
> 			remove_write_prot =
> 				spte_can_locklessly_be_made_writable(spte);
> 		}
> 		if (!(error_code & PFERR_PRESENT_MASK)) {
> 			if (!is_access_track_spte(spte)) {
> 				fault_handled = true;
> 				break;
> 			}
> 			remove_acc_track = true;
> 		}

I think the third block is incorrect e.g. it will set fault_handled = true even
for a completely zero PTE. I have replaced it with the following:

		if ((error_code & PFERR_USER_MASK) &&
		    (spte & PT_PRESENT_MASK)) {
			fault_handled = true;
			break;
		}

		remove_acc_track = is_access_track_spte(spte);

> >  
> > +/* This is only supposed to be used for non-EPT mappings */
> 
> It's only used for non-EPT mappings, why is it only *supposed* to be 
> used for non-EPT mappings?  It seems to me that it would work.
> 
> >  static bool need_remote_flush(u64 old, u64 new)
> >  {

It would work but it will return true in at least one (probably only) case 
where it doesn’t need to e.g. going from an acc-track PTE to a zeroed PTE.
Though maybe that is not a big deal. Of course, we could also just update it 
to handle acc-track. 

Thanks,
Junaid

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html