Re: [PATCH v2 2/6] dt-bindings: riscv: Add Zawrs ISA extension description

Charlie Jenkins <charlie@xxxxxxxxxxxx> · Tue, 23 Apr 2024 14:00:53 -0400

On Tue, Apr 23, 2024 at 10:46:01AM +0200, Andrew Jones wrote:
> On Mon, Apr 22, 2024 at 06:36:45PM -0400, Charlie Jenkins wrote:
> > On Sun, Apr 21, 2024 at 12:20:03PM +0200, Andrew Jones wrote:
> > > On Fri, Apr 19, 2024 at 12:40:01PM -0400, Charlie Jenkins wrote:
> ...
> > > > What would be the purpose of a vendor implementing WRS.NTO (and putting
> > > > it in the DT) that never terminates? The spec says "Then a subsequent
> > > > WRS.NTO instruction would cause the hart to temporarily stall execution
> > > > in a low- power state until a store occurs to the reservation set or an
> > > > interrupt is observed." Why is this wording for WRS.NTO not sufficient
> > > > to assume that an implementation of this instruction would eventually
> > > > terminate?
> > > >
> > > 
> > > We can invoke smp_cond_load_relaxed(addr, VAL || anything_we_want()). This
> > > means we may not expect VAL ever to be written, which rules out "until a
> > > store occurs". As for "an interrupt is observed", we don't know which one
> > > to expect to arrive within a "reasonable" amount of time. We need to know
> > > which one(s), since, while wrs.nto will terminate even when interrupts are
> > > globally disabled, we still need to have the interrupt(s) we expect to be
> > > locally enabled. And, the interrupts should arrive in a "reasonable"
> > > amount of time since we want to poll anything_we_want() at a "reasonable"
> > > frequency.
> > > 
> > > So, we need firmware to promise to enable exceptions if there aren't any
> > > such interrupts. Or, we could require hardware descriptions to identify
> > > which interrupt(s) would be good to have enabled before calling wrs.nto.
> > > Maybe there's already some way to describe something like that?
> > > 
> > > Thanks,
> > > drew
> > 
> > Ahh okay I am caught up now. So the wording we are looking at in the
> > spec is:
> > 
> > "When executing in VS or VU mode, if the VTW bit is set in hstatus, the
> > TW bit in mstatus is clear, and the WRS.NTO does not complete within an
> > implementation-specific bounded time limit, the WRS.NTO instruction will
> > cause a virtual instruction exception."
> 
> That's what the hypervisor should promise to do when there's no other
> guarantee of wrs.nto terminating (but the hypervisor likely wants to
> anyway since it wants guests to trap on wrs.nto in order to potentially
> schedule the lock holding VCPU). The firmware of the host should likewise
> promise to set mstatus.TW when there's no guarantee of wrs.nto
> terminating, but that's likely _not_ something it normally would want to
> do, so hopefully there will always be implementation-specific "other
> reasons" which guarantee termination.
> 
> > 
> > With the concern being that it is possible for "implementation-specific
> > bounded time limit" to be infinite/never times out,
> 
> The implementation-defined short timeout cannot be infinite, but it only
> applies to wrs.sto. While using wrs.sto would relieve the concern, it
> cannot be configured to raise exceptions, which means it's not useful in
> guests. If we want to use wrs.sto in hosts and wrs.nto in guests then we
> need a paravirt channel which allows an "enlightened" guest to determine
> that it is a guest and that the hypervisor has configured wrs.nto to
> trap, which then indicates it's a good idea to patch wrs.sto to wrs.nto.
> But, adding paravirt stuff should be avoided whenever possible since it
> adds complexity we'd rather not maintain.
> 

That still wouldn't solve this issue, because the wrs.nto guest may still
never wakeup in the implementation-specific way?

> > and the kernel
> > enters a WRS where the reservation set is not required to be invalidated
> > for the condition we are waiting on to become true.
> > 
> > An option here would be to enforce in the spec that this time limit is
> > finite. If the original intention of the spec was to have it be finite,
> > then this would not be an issue. If the intention was to allow no time
> > limit, then this would probably have to be a new extension.
> 
> wrs.nto has been specified to never timeout.
> wrs.sto has been specified to never raise exceptions.
> 
> If we had an instruction which would timeout when mstatus.TW/hstatus.VTW
> is clear and raise exceptions when set, then that's the one we'd choose.
> 

Yes, this does seem like the ideal situtation.

> > 
> > We are also able to change the kernel to not allow these conditions that
> > would break this interpretation of WRS. I found three instances in the
> > kernel that contain a condition that is not dependent on the wrs
> > reservation.
> > 
> > 1.
> > # queued_spin_lock_slowpath() in kernel/locking/qspinlock.c
> > val = atomic_cond_read_relaxed(&lock->val,
> > 			       (VAL != _Q_PENDING_VAL) || !cnt--);
> > 
> > The first condition will only become true if lock->val changes which
> > should invalidate the reservation. !cnt-- on the otherhand is a counter
> > of the number of loops that happen under-the-hood in
> > atomic_cond_read_relaxed. This seems like an abuse of the function and
> > could be factored out into a new bounded-iteration cond_read macro.
> > 
> > 2.
> > # osq_lock() in kernel/locking/osq_lock.c
> > if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> > 			  vcpu_is_preempted(node_cpu(node->prev))))
> > 
> > VAL is the first condition and won't be a problem here since changes to
> > it will cause the reservation to become invalid. arm64 has hard-coded
> > vcpu_is_preempted to be false for the same exact reason that riscv would
> > want to (the wait wouldn't be woken up). There is a comment that
> > points this out in arch/arm64/include/asm/spinlock.h. riscv currently
> > uses the default implementation which returns false.
> 
> The operative word is 'currently'. I plan to eventually get riscv's
> vcpu_is_preempted() working since we already laid the groundwork by
> adding a preempted flag to the SBI STA struct.
> 
> > 
> > need_resched() should not be a problem since this condition only changes
> > when the hart recieves an IPI, so as long as the hart is able to receive
> > an IPI while in WRS it will be fine.
> > 
> > 3.
> > # __arm_smmu_cmdq_poll_until_msi() in drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > smp_cond_load_relaxed(cmd, !VAL || (ret = queue_poll(&qp)));
> > 
> > arm driver, not relevant.
> > 
> > 
> > 
> > The only case that would cause a problem in the current implementation
> > of the kernel would be queued_spin_lock_slowpath() with the cnt check.
> > We are able to either change this definition, change the spec, or leave
> > it up to the vendor who would be hit by this issue to change it.
> 
> We could attempt to document restrictions on the condition given to
> smp_cond_load_relaxed() and then change the callers to honor those
> restrictions, but that doesn't sound too easy. How will we remove
> vcpu_is_preempted() on x86?

The solution here seems like it would be to not use wrs for riscv in
this case.  We really would need an alternate extension that allows wrs
in a guest to be guaranteed to trap into the host at some point. 

> 
> We should probably start the process of a new extension which has the
> hybrid wrs.sto/nto instruction, but I'm reluctant to, since, in practice,
> the current wrs.nto is probably fine. I don't really expect there to be
> implementations that never terminate, even though I'd rather we document
> that it's _required_ wrs.nto terminates, or that exceptions be raised.
> Maybe I'm attempting to document it in the wrong place though. Maybe this
> is more of a Documentation/arch/riscv/boot.rst type of thing.
> 

wrs.nto is most likely sufficient. A new extension will take a long
time. We could go ahead with wrs.nto, propose the extension, and when
the extension is ready switch over to it. In the meantime have this
behavior documented.

> Thanks,
> drew

- Charlie