On Sat, Jan 23, 2016 at 12:12:02AM +0000, Al Viro wrote: > On Fri, Jan 22, 2016 at 03:30:02PM -0500, Mike Marshall wrote: > > The userspace daemon (client-core) that reads/writes to the device > > restarts automatically if it stops for some reason... I believe active > > ops are marked "purged" when this happens, and when client-core > > restarts "purged" ops are retried (once)... see the comment > > in waitqueue.c "if the operation was purged in the meantime..." > > > > I've tried to rattle Walt and Becky's chains to see if they > > can describe it better... > > What I mean is the following sequence: > > Syscall: puts op into request list, sleeps in wait_for_matching_downcall() > Daemon: exits, markes purged, wakes Syscall up > Daemon gets restarted > Daemon calls read(), finds op still on the list > Syscall: finally gets the timeslice, removes op from the list, decides to > resubmit > > This is very hard to hit - normally by the time we get around to read() > from restarted daemon the waiter had already been woken up and > already removed the purged op from the list. So in practice you probably > had never hit that case. However, it is theoretically possible. > > What I propose to do is to have purged requests that are still in the lists > to be skipped by orangefs_devreq_read() and orangefs_devreq_remove_op(). > IOW, pretend that the race had been won by whatever had been waiting on > that request and got woken up when it had been purged. > > Note that by the time it gets resubmitted, it already has the 'purged' flag > removed - set_op_state_waiting(op) is done when we are inserting into > request list and it leaves no trace of OP_VFS_STATE_PURGED. So I'm not > talking about the resubmitted stuff; just the one that had been in queue > since before the daemon restart and hadn't been removed from there yet. OK, aforementioned locking/refcounting scheme implemented (and completely untested). See #orangefs-untested. Rules: * refcounting is real refcounting - objects are created with refcount 1, get_op() increments refcount, op_release() decrements and frees when zero. * daemon interaction (read/write_iter) is refcount-neutral - it grabs a reference when picking request from list/hash and always drops it in the end. Submitters are always releasing the reference acquired when allocating request. * when submitter decides to give up upon request (for any reason - timeout, signal, daemon disconnect) it marks it with new bit - OP_VFS_STATE_GIVEN_UP. Once that is done, nobody else is allowed to touch its ->list. * request is inserted into hash only when we'd succeeded copying to daemon's memory (and only if request hadn't been given up while we'd been copying it). If copying fails, we only have to refile to the request list (and only if it hadn't been given up). * when copying a response from daemon, request is only marked serviced if it hadn't been given up while we were parsing the response. Malformed responses, failed copying of response, etc. are treated as well-formed error response would be. Error value is the one we'll be returning to daemon - might or might not be right from the error recovery standpoint. * hash lock nests *outside* of op->lock now and all work with the hash is protected by it. * daemon interaction skips the requests that had been around since before the control device had been opened, acting as if the (already woken up) submitters had already gotten around to removing those from the list/hash. That's the normal outcome of such race anyway, and it simplifies the analysis. Again, it's completely untested; might oops, break stuff, etc. I _think_ it makes sense from the lifetime rules and locking POV, but... I still would like to understand what's going on in wait_for_cancellation_request() - it *probably* shouldn't matter for the correctness of that massage, but there might be dragons. I don't understand in which situation wrt pending signals it is supposed to be called and the comments in there contradict the actual code. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html