Gregory Haskins wrote: > Hi Esben, > Thank you for the review. Comments inline. > > Esben Nielsen wrote: > >> Disclaimer: I am no longer actively involved and I must admit I might >> have lost out on much of >> what have been going on since I contributed to the PI system 2 years >> ago. But I allow myself to comment >> anyway. >> >> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <ghaskins@xxxxxxxxxx> wrote: >> >> >>> The kernel currently addresses priority-inversion through priority- >>> inheritence. However, all of the priority-inheritence logic is >>> integrated into the Real-Time Mutex infrastructure. This causes a few >>> problems: >>> >>> 1) This tightly coupled relationship makes it difficult to extend to >>> other areas of the kernel (for instance, pi-aware wait-queues may >>> be desirable). >>> 2) Enhancing the rtmutex infrastructure becomes challenging because >>> there is no seperation between the locking code, and the pi-code. >>> >>> This patch aims to rectify these shortcomings by designing a stand-alone >>> pi framework which can then be used to replace the rtmutex-specific >>> version. The goal of this framework is to provide similar functionality >>> to the existing subsystem, but with sole focus on PI and the >>> relationships between objects that can boost priority, and the objects >>> that get boosted. >>> >>> >> This is really a good idea. When I had time (2 years ago) to actively >> work on these problem >> I also came to the conclusion that PI should be more general than just >> the rtmutex. Preemptive RCU >> was the example which drove it. >> >> But I do disagree that general objects should get boosted: The end >> targets are always tasks. The objects might >> be boosted as intermediate steps, but priority end the only applies to tasks. >> >> > Actually I fully agree with you here. Its probably just poor wording on > my part, but this is exactly what happens. We may "boost" arbitrary > objects on the way to boosting a task...but the intermediate objects are > just there to help find our way to the proper tasks. Ultimately > everything ends up at the scheduler eventually ;) > > >> I also have a few comments to the actual design: >> >> >> >>> .... >>> + >>> +Multiple sinks per Node: >>> + >>> +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on"). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks". >>> + >>> +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners. >>> >>> >> This is bad from a RT point of view: You have a hard time determininig >> the number of sinks per node. An rw-lock could have an arbitrary >> number of readers (is supposed to really). Therefore >> you have no chance of knowing how long the boost/deboost operation >> will take. And you also know for how long the boosted tasks stay >> boosted. If there can be an arbitrary number of >> such tasks you can no longer be deterministic. >> >> > > While you may have a valid concern about what rwlocks can do to > determinism, not that we already have PI enabled rwlocks before my > patch, so I am not increasing nor decreasing determinism in this > regard. That being said, Steven Rostedt (author of the pi-rwlocks, > CC'd) has facilities to manage this (such as limiting the number of > readers to num_online_cpus) which this design would retain. Long story > short, I do not believe I have made anything worse here, so this is a > different discussion if you are still concerned. > > > >> >> >>> ... >>> + >>> +#define MAX_PI_DEPENDENCIES 5 >>> >>> >> WHAT??? There is a finite lock depth defined. I know we did that >> originally but it wasn't hardcoded (as far as I remember) and >> it was certainly not as low as 5. >> >> > > Note that this is simply in reference to how many direct sinks you can > link to a node, not how long the resulting chain can grow. The chain > depth is actually completely unconstrained by the design. I chose "5" > here because typically we need 1 sink for the next link in the chain, > and 1 sink for local notifications. The other 3 are there for head-room > (we often hit 3-4 as we transition between nodes (add one node -> delete > another, etc). > To clarify what I meant here: If you think of a normal linked-list node having a single "next" pointer, this implementation is like each node having up to 5 "next" pointers. However typically only 1-2 are used, and all but one will usually point to a "leaf" node, meaning it does not form a chain but terminates processing locally. Typically there will be only one link to something that forms a chain with other nodes. I did this because I realized the pattern (boost/deboost/update) was similar whether the node was a leaf or a chain-link, so I unified both behind the single pi_sink interface. That being understood, note that as with any linked-list, the nodes can still have an arbitrary chaining depth (and I will fix this to be iterative instead of recursive, as previously mentioned). > You are not the first to comment about this, however, so it makes me > realize it is not very clear ;) I will comment the code better. > > > >> Remember: PI is used by the user space futeces as well! >> >> > > Yes, and on a slight tangent from your point, this incidentally is > actually a problem in the design such that I need to respin at least a > v5. My current design uses recursion against the sink->update() > methods, which Peter Zijlstra pointed out would blow up with large > userpspace chains. My next version will forgo the recursion in favor of > an iterative method more reminiscent of the original design. > > >> >> >>> .... >>> +/* >>> + * _pi_node_update - update the chain >>> + * >>> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries >>> + * that need to propagate up the chain. This is a step-wise process where we >>> + * have to be careful about locking and preemption. By trying MAX_PI_DEPs >>> + * times, we guarantee that this update routine is an effective barrier... >>> + * all modifications made prior to the call to this barrier will have completed. >>> + * >>> + * Deadlock avoidance: This node may participate in a chain of nodes which >>> + * form a graph of arbitrary structure. While the graph should technically >>> + * never close on itself barring any bugs, we still want to protect against >>> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep >>> + * from detecting this potential). To do this, we employ a dual-locking >>> + * scheme where we can carefully control the order. That is: node->lock >>> + * protects most of the node's internal state, but it will never be held >>> + * across a chain update. sinkref->lock, on the other hand, can be held >>> + * across a boost/deboost, and also guarantees proper execution order. Also >>> + * note that no locks are held across an sink->update. >>> + */ >>> +static int >>> +_pi_node_update(struct pi_sink *sink, unsigned int flags) >>> +{ >>> + struct pi_node *node = node_of(sink); >>> + struct pi_sinkref *sinkref; >>> + unsigned long iflags; >>> + int count = 0; >>> + int i; >>> + int pprio; >>> + struct updater updaters[MAX_PI_DEPENDENCIES]; >>> + >>> + spin_lock_irqsave(&node->lock, iflags); >>> + >>> + pprio = node->prio; >>> + >>> + if (!plist_head_empty(&node->srcs)) >>> + node->prio = plist_first(&node->srcs)->prio; >>> + else >>> + node->prio = MAX_PRIO; >>> + >>> + list_for_each_entry(sinkref, &node->sinks, list) { >>> + /* >>> + * If the priority is changing, or if this is a >>> + * BOOST/DEBOOST, we consider this sink "stale" >>> + */ >>> + if (pprio != node->prio >>> + || sinkref->state != pi_state_boosted) { >>> + struct updater *iter = &updaters[count++]; >>> >>> >> What prevents count from overrun? >> >> > > The node->sinks list will never have more than MAX_PI_DEPs in it, by design. > > >> >> >>> + >>> + BUG_ON(!atomic_read(&sinkref->sink->refs)); >>> + _pi_sink_get(sinkref); >>> + >>> + iter->update = 1; >>> + iter->sinkref = sinkref; >>> + iter->sink = sinkref->sink; >>> + } >>> + } >>> + >>> + spin_unlock(&node->lock); >>> + >>> + for (i = 0; i < count; ++i) { >>> + struct updater *iter = &updaters[i]; >>> + unsigned int lflags = PI_FLAG_DEFER_UPDATE; >>> + struct pi_sink *sink; >>> + >>> + sinkref = iter->sinkref; >>> + sink = iter->sink; >>> + >>> + spin_lock(&sinkref->lock); >>> + >>> + switch (sinkref->state) { >>> + case pi_state_boost: >>> + sinkref->state = pi_state_boosted; >>> + /* Fall through */ >>> + case pi_state_boosted: >>> + sink->ops->boost(sink, &sinkref->src, lflags); >>> + break; >>> + case pi_state_deboost: >>> + sink->ops->deboost(sink, &sinkref->src, lflags); >>> + sinkref->state = pi_state_free; >>> + >>> + /* >>> + * drop the ref that we took when the sinkref >>> + * was allocated. We still hold a ref from >>> + * above. >>> + */ >>> + _pi_sink_put_all(node, sinkref); >>> + break; >>> + case pi_state_free: >>> + iter->update = 0; >>> + break; >>> + default: >>> + panic("illegal sinkref type: %d", sinkref->state); >>> + } >>> + >>> + spin_unlock(&sinkref->lock); >>> + >>> + /* >>> + * We will drop the sinkref reference while still holding the >>> + * preempt/irqs off so that the memory is returned synchronously >>> + * to the system. >>> + */ >>> + _pi_sink_put_local(node, sinkref); >>> + } >>> + >>> + local_irq_restore(iflags); >>> >>> >> Yack! You keep interrupts off while doing the chain. >> > > Actually, not quite. The first pass (with interrupts off) simply sets > the new priority value at each local element (limited to 5, typically > 1-2). Short and sweet. Its the "update" that happens next (with > interrupts/preemption enabled) that updates the chain. > > > >> I think my main >> contribution to the PI system 2 years ago was to do this preemptively. >> I.e. there was points in the loop where interrupts and preemption >> where turned on. >> >> > > I agree this is important, but I think you will see with further review > that this is in fact what I do too. > > >> Remember: It goes into user space again. An evil user could craft an >> application with a very long lock depth and keep higher priority real >> time tasks from running for an arbitrary long time (if >> no limit on the lock depth is set, which is bad because it will be too >> low in some cases.) >> >> But as I said I have had no time to watch what has actually been going >> on in the kernel for the last 2 years roughly. The said defects might >> have creeped in by other contributers already :-( >> >> Esben >> >> > > Esben, > Your review and insight are very much appreciated. I will be sure to > address the concerns mentioned above and CC you on the next release. > > Thanks again, > -Greg > > >
Attachment:
signature.asc
Description: OpenPGP digital signature