On Mon, Dec 13 2021, Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > We do specify that a migration driver has discretion in using the error > state for failed transitions, so there are options to simplify error > handling. > > If we look at bit flips, we have: > > Initial state > | Resuming > | | Saving > | | | Running > | | | | Next states with multiple bit flips > > a) 0, 0, 0 (d) > b) 0, 0, 1 (c) (e) > c) 0, 1, 0 (b) (e) > d) 0, 1, 1 (a) (e) > e) 1, 0, 0 (b) (c) (d) > f) 1, 0, 1 UNSUPPORTED > g) 1, 1, 0 ERROR > h) 1, 1, 1 INVALID > > We specify that we cannot pass through any invalid states during > transition, so if I consider each bit to be a discrete operation and > map all these multi-bit changes to a sequence of single bit flips, the > only requirements are: > > 1) RESUMING must be cleared before setting SAVING or RUNNING > 2) SAVING or RUNNING must be cleared before setting RESUMING > > I think the basis of your priority scheme comes from that. Ordering > of the remaining items is more subtle though, for instance > 0 -> SAVING | RUNNING can be broken down as: > > 0 -> SAVING > SAVING -> SAVING | RUNNING > > vs > > 0 -> RUNNING > RUNNING -> SAVING | RUNNING > > I'd give preference to enabling logging before running and I believe > that holds for transition (e) -> (d) as well. Agreed. > > In the reverse case, SAVING | RUNNING -> 0 > > SAVING | RUNNING -> RUNNING > RUNNING -> 0 > > vs > > SAVING | RUNNING -> SAVING > SAVING -> 0 > > This one is more arbitrary. I tend to favor clearing RUNNING to stop > the device first, largely because it creates nice symmetry in the > resulting algorithm and follows the general principle that you > discovered as well, converge towards zero by addressing bit clearing > before setting. That also makes sense to me. > So a valid algorithm would include: > > int set_device_state(u32 old, u32 new, bool is_unwind) > { > if (old.RUNNING && !new.RUNNING) { > curr.RUNNING = 0; > if (ERROR) goto unwind; > } > if (old.SAVING && !new.SAVING) { > curr.SAVING = 0; > if (ERROR) goto unwind; > } > if (old.RESUMING && !new.RESUMING) { > curr.RESUMING = 0; > if (ERROR) goto unwind; > } > if (!old.RESUMING && new.RESUMING) { > curr.RESUMING = 1; > if (ERROR) goto unwind; > } > if (!old.SAVING && new.SAVING) { > curr.saving = 1; > if (ERROR) goto unwind; > } > if (!old.RUNNING && new.RUNNING) { > curr.RUNNING = 1; > if (ERROR) goto unwind; > } > > return 0; > > unwind: > if (!is_unwind) { > ret = set_device_state(curr, old, true); > if (ret) { > curr.raw = ERROR; > return ret; > } > } > > return -EIO; > } > I've stared at this and scribbled a bit on paper and I think this would work. > And I think that also addresses the claim that we're doomed to untested > and complicated error code handling, we unwind by simply swapping the > args to our set state function and enter the ERROR state should that > recursive call fail. Nod. As we clear RUNNING as the first thing and would only set RUNNING again as the last step, any transition to ERROR should be save. > > I think it would be reasonable for documentation to present similar > pseudo code as a recommendation, but ultimately the migration driver > needs to come up with something that fits all the requirements. Yes, something like this should go into Documentation/.