On 6/29/20 10:56 AM, Cornelia Huck wrote: > On Wed, 17 Jun 2020 07:24:17 -0400 > Eric Farman <farman@xxxxxxxxxxxxx> wrote: > >> On 6/16/20 3:50 PM, Eric Farman wrote: >>> Let's continue our discussion of the handling of vfio-ccw interrupts. >>> >>> The initial fix [1] relied upon the interrupt path's examination of the >>> FSM state, and freeing all resources if it were CP_PENDING. But the >>> interface used by HALT/CLEAR SUBCHANNEL doesn't affect the FSM state. >>> Consider this sequence: >>> >>> CPU 1 CPU 2 >>> CLEAR (state=IDLE/no change) >>> START [2] >>> INTERRUPT (set state=IDLE) >>> INTERRUPT (set state=IDLE) >>> >>> This translates to a couple of possible scenarios: >>> >>> A) The START gets a cc2 because of the outstanding CLEAR, -EBUSY is >>> returned, resources are freed, and state remains IDLE >>> B) The START gets a cc0 because the CLEAR has already presented an >>> interrupt, and state is set to CP_PENDING >>> >>> If the START gets a cc0 before the CLEAR INTERRUPT (stacked onto a >>> workqueue by the IRQ context) gets a chance to run, then the INTERRUPT >>> will release the channel program memory prematurely. If the two >>> operations run concurrently, then the FSM state set to CP_PROCESSING >>> will prevent the cp_free() from being invoked. But the io_mutex >>> boundary on that path will pause itself until the START completes, >>> and then allow the FSM to be reset to IDLE without considering the >>> outstanding START. Neither scenario would be considered good. >>> >>> Having said all of that, in v2 Conny suggested [3] the following: >>> >>>> - Detach the cp from the subchannel (or better, remove the 1:1 >>>> relationship). By that I mean building the cp as a separately >>>> allocated structure (maybe embedding a kref, but that might not be >>>> needed), and appending it to a list after SSCH with cc=0. Discard it >>>> if cc!=0. >>>> - Remove the CP_PENDING state. The state is either IDLE after any >>>> successful SSCH/HSCH/CSCH, or a new state in that case. But no >>>> special state for SSCH. >>>> - A successful CSCH removes the first queued request, if any. >>>> - A final interrupt removes the first queued request, if any. >>> >>> What I have implemented here is basically this, with a few changes: >>> >>> - I don't queue cp's. Since there should only be one START in process >>> at a time, and HALT/CLEAR doesn't build a cp, I didn't see a pressing >>> need to introduce that complexity. >>> - Furthermore, while I initially made a separately allocated cp, adding >>> an alloc for a cp on each I/O AND moving the guest_cp alloc from the >>> probe path to the I/O path seems excessive. So I implemented a >>> "started" flag to the cp, set after a cc0 from the START, and examine >>> that on the interrupt path to determine whether cp_free() is needed. >> >> FYI... After a day or two of running, I sprung a kernel debug oops for >> list corruption in ccwchain_free(). I'm going to blame this piece, since >> it was the last thing I changed and I hadn't come across any such damage >> since v2. So either "started" is a bad idea, or a broken one. Or both. :) > > Have you come to any conclusion wrt 'started'? Not wanting to generate > stress, just asking :) > I've talked myself out of it, and gone back to your original proposal of a separately allocated cp. (Still no queuing.) Too early to pass judgement though. Yesterday, when running with a cp_free() call after a CSCH, I was getting all sorts of errors very early on, so at the moment I've pulled that back out again. If it looks good in this form, I'll put that as a separate patch and write up some doc for a discussion on that point.
Attachment:
signature.asc
Description: OpenPGP digital signature