On Fri, Feb 02, 2024, Sean Christopherson wrote: > On Fri, Feb 02, 2024, Shaoqin Huang wrote: > > --- > > v2->v3: > > - Rebase to v6.8-rc2. > > - Use TEST_ASSERT(). > > Patch says otherwise. > > > @@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg) > > return; > > } > > > > + sem_getvalue(&sem_vcpu_stop, &sem_val); > > + assert(sem_val == 0); > > + sem_getvalue(&sem_vcpu_cont, &sem_val); > > + assert(sem_val == 0); > > + > > /* > > * We reserve page table for 2 times of extra dirty mem which > > * will definitely cover the original (1G+) test range. Here > > @@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg) > > sync_global_to_guest(vm, iteration); > > } > > > > + /* > > + * > > + * Before we set the host_quit, let the vcpu has time to run, to make > > + * sure we consume the sem_vcpu_stop and the vcpu consume the > > + * sem_vcpu_cont, to keep the semaphore balance. > > + */ > > + usleep(p->interval * 1000); > > Sorry, I wasn't as explicit as I should have been. When I said I don't have a > better solution, I did not mean to imply that I am ok with busy waiting as a > hack-a-around. > > Against my better judgment, I spent half an hour slogging through this test to > figure out what's going wrong. IIUC, the problem is essentially that the test > instructs the vCPU worker to continue _after_ the last iteration, and _then_ sets > host_quit, which results in the vCPU running one extra (unvalidated) iteration. > > For the other modes, which stop if and only if vcpu_sync_stop_requested is set, > the extra iteration is a non-issue. But because the dirty ring variant stops > after every exit (to purge the ring), it hangs without an extra "continue". > > So rather than blindly fire off an extra sem_vcpu_cont that may or may not be > consumed, I believe the test can simply set host_quit _before_ the final "continue" > so that the vCPU worker doesn't run an extra iteration. > > I ran the below with 1000 loops of "for (i = 0; i < LOG_MODE_NUM; i++)" and so > no issues. I didn't see the assert you hit, but without the fix, I did see this > fire within a few loops (less than 5 I think)l > > assert(host_log_mode == LOG_MODE_DIRTY_RING || > atomic_read(&vcpu_sync_stop_requested) == false); > > I'll post this as two patches: one to fix the bug, and a second to have the > LOG_MODE_DIRTY_RING variant clear vcpu_sync_stop_requested so that the above > assert() can be modified as below. Scratch patch 2, I'm pretty sure the vCPU worker can race with the main task and clear vcpu_sync_stop_requested just before the main task sets it, which would result in a false positive. I didn't see any failures, but I'm 99% certain the race exists. I suspect there are some other warts in the test that would complicate attempts to clean things up, so for now I'l just post the fix for the imbalance bug.