From: Dave <dave.willmer@xxxxxxxxx> Signed-off-by: Dave <dave.willmer@xxxxxxxxx> --- advsync/advsync.tex | 10 ++--- advsync/memorybarriers.tex | 26 ++++++------ appendix/primitives/primitives.tex | 4 +- appendix/questions/after.tex | 2 +- appendix/questions/concurrentparallel.tex | 2 +- appendix/whymb/whymemorybarriers.tex | 6 +-- datastruct/datastruct.tex | 70 +++++++++++++++---------------- debugging/debugging.tex | 32 +++++++------- defer/rcuapi.tex | 2 +- defer/rcuintro.tex | 2 +- defer/toyrcu.tex | 20 ++++----- defer/whichtochoose.tex | 2 +- formal/dyntickrcu.tex | 24 +++++------ formal/spinhint.tex | 10 ++--- future/future.tex | 2 +- future/htm.tex | 30 ++++++------- future/tm.tex | 24 +++++------ locking/locking.tex | 32 +++++++------- rt/rt.tex | 46 ++++++++++---------- together/applyrcu.tex | 26 ++++++------ together/count.tex | 4 +- together/hash.tex | 8 ++-- 22 files changed, 192 insertions(+), 192 deletions(-) diff --git a/advsync/advsync.tex b/advsync/advsync.tex index dd567b3..1839985 100644 --- a/advsync/advsync.tex +++ b/advsync/advsync.tex @@ -56,7 +56,7 @@ basis of real-time programming: \item Real-time forward-progress guarantees usually have some definite time associated with them, for example, ``scheduling latency must be less than 100 microseconds.'' - In contrast, NBS only + In contrast, NBS only guarantees that progress will be made in finite time, with no definite bound. \item Real-time forward-progress guarantees are sometimes @@ -71,7 +71,7 @@ basis of real-time programming: a certain fraction of its time idle, or when I/O rates are below some specified maximum. In contrast, NBS's forward-progress - guarantees are usually unconditional.\footnote{ + guarantees are usually unconditional.\footnote{ As we will see below, some recent NBS work relaxes this guarantee.} \item Real-time forward-progress guarantees usually apply only @@ -111,7 +111,7 @@ as follows: \end{enumerate} NBS classes~1 and~2 were first formulated in the early 1990s, -class~3 was first fomrulated in the early 2000s, +class~3 was first formulated in the early 2000s, and class~4 was first formulated in 2013. The final two classes have seen informal use for a great many decades, but were reformulated in 2013. @@ -167,12 +167,12 @@ the Linux kernel. 5 struct cds_wfcq_node *new_tail) 6 { 7 struct cds_wfcq_node *old_tail; - 8 + 8 9 old_tail = uatomic_xchg(&tail->p, new_tail); 10 CMM_STORE_SHARED(old_tail->next, new_head); 11 return old_tail != &head->node; 12 } -13 +13 14 static inline bool 15 _cds_wfcq_enqueue(struct cds_wfcq_head *head, 16 struct cds_wfcq_tail *tail, diff --git a/advsync/memorybarriers.tex b/advsync/memorybarriers.tex index 2d16542..6639fec 100644 --- a/advsync/memorybarriers.tex +++ b/advsync/memorybarriers.tex @@ -82,7 +82,7 @@ Many people do indeed expect their computers to keep track of things, but many also insist that they keep track of things quickly. One difficulty that modern computer-system vendors face is that the main memory cannot keep up with the CPU -- modern CPUs can execute -hundreds of instructions in time required to fetch a single variable +hundreds of instructions in the time required to fetch a single variable from memory. CPUs therefore sport increasingly large caches, as shown in Figure~\ref{fig:advsync:Modern Computer System Cache Structure}. @@ -248,12 +248,12 @@ Figure~\ref{fig:advsync:Software Logic Analyzer}. This code fragment is executed in parallel by several CPUs. Line~1 sets a shared variable to the current CPU's ID, line~2 initializes several variables from a \co{gettb()} function that -delivers the value of fine-grained hardware ``timebase'' counter that is +delivers the value of a fine-grained hardware ``timebase'' counter that is synchronized among all CPUs (not available from all CPU architectures, unfortunately!), and the loop from lines~3-8 records the length of time that the variable retains the value that this CPU assigned to it. Of course, one of the CPUs will ``win'', and would thus never exit -the loop if not for the check on lines~7-8. +the loop if not for the check on lines~6-8. \QuickQuiz{} What assumption is the code fragment @@ -441,7 +441,7 @@ This does not work for the non-atomic stores described earlier because the non-atomic stores do not return any indication of the earlier value, hence the possibility of ambiguity. -Please note well that this section applies \emph{only} when all +Please note that this section applies \emph{only} when all CPUs' accesses are to one single variable. In this single-variable case, cache coherence guarantees the global ordering, at least assuming that some of the more aggressive @@ -790,7 +790,7 @@ these combinations in order to fully understand how this works. But suppose that in combination~1 from Table~\ref{tab:advsync:Memory-Barrier Combinations}, CPU~1's load from A returns the value that CPU~2 stored - to A. Then we know that CPU~1's load from B returned + to A. Then we know that CPU~1's load from B returned either the same value as CPU~2's load from A or some later value. \QuickQuiz{} @@ -830,7 +830,7 @@ The following properties must then hold true: On any given run, however, all CPUs and threads must have a consistent view of the order of critical sections for a given exclusive lock.} -\item Suppose a given variable has not yet been stored to in a +\item Suppose a given variable has not yet been stored in a critical section that is currently executing. Then any load from a given variable performed in that critical section must see the last store to that variable from the last previous @@ -868,7 +868,7 @@ assert(b == 2); If the CPU is not required to see all of its loads and stores in order, then the {\tt b=1+a} might well see an old version of the variable ``a''. - + This is why it is so very important that each CPU or thread see all of its own loads and stores in program order. } \QuickQuizEnd @@ -903,7 +903,7 @@ spin_unlock(&mylock); that they were first, they would all see {\tt p==NULL}, and they would all allocate memory. All but one of those allocations would be leaked. - + This is why it is so very important that all the critical sections for a given exclusive lock appear to execute in some well-defined order. @@ -911,7 +911,7 @@ spin_unlock(&mylock); Suppose that the third property did not hold. Then the counter shown in the following code might well count backwards. -This third property is crucial, as it cannot be strictly with +This third property is crucial, as it cannot be strictly true with pairwise memory barriers. \vspace{5pt} @@ -938,7 +938,7 @@ spin_unlock(&mylock); to see the most recent store to this variable, it might well see the original value of zero, and therefore set the counter to one, which would be going backwards. - + This is why it is so very important that loads from a given variable in a given critical section see the last store from the last prior critical section to @@ -976,7 +976,7 @@ laid out in Section~\ref{sec:advsync:What Can You Trust?}. 4 while (atomic_read(&lck->a) != 0) 5 continue; 6 } - 7 + 7 8 void spin_unlock(spinlock_t lck) 9 { 10 smp_mb(); @@ -1833,7 +1833,7 @@ versa: \subsubsection{Examples of Memory Barrier Pairings} \label{sec:advsync:Examples of Memory Barrier Pairings} -Firstly, write barriers act as a partial orderings on store operations. +Firstly, write barriers act as partial orderings on store operations. Consider the following sequence of events: \vspace{5pt} @@ -1866,7 +1866,7 @@ Figure~\ref{fig:advsync:Write Barrier Ordering Semantics}. \ContributedBy{Figure}{fig:advsync:Write Barrier Ordering Semantics}{David Howells} \end{figure*} -Secondly, data dependency barriers act as a partial orderings on data-dependent +Secondly, data dependency barriers act as partial orderings on data-dependent loads. Consider the following sequence of events with initial values {\tt \{B = 7, X = 9, Y = 8, C = \&Y\}}: diff --git a/appendix/primitives/primitives.tex b/appendix/primitives/primitives.tex index a3cd271..d7fb471 100644 --- a/appendix/primitives/primitives.tex +++ b/appendix/primitives/primitives.tex @@ -172,7 +172,7 @@ The \co{wait_all_threads()} primitive waits for completion of all currently running threads. It is the caller's responsibility to synchronize with thread creation and deletion if required. -However, this primitive is normally used to clean up and the end of +However, this primitive is normally used to clean up at the end of a run, so such synchronization is normally not needed. \subsection{Example Usage} @@ -347,7 +347,7 @@ Figure~\ref{fig:intro:Per-Thread-Variable API} shows the per-thread-variable API. This API provides the per-thread equivalent of global variables. Although this API is, strictly speaking, not necessary, it can -greatly simply coding. +greatly simplify coding. \begin{figure}[htbp] { \scriptsize diff --git a/appendix/questions/after.tex b/appendix/questions/after.tex index 89a8c17..7ff2662 100644 --- a/appendix/questions/after.tex +++ b/appendix/questions/after.tex @@ -155,7 +155,7 @@ Why is time going backwards? The number in parentheses is the difference in microseconds, with a large number exceeding 10 microseconds, and one exceeding even 100 microseconds! -Please note that this CPU can potentially execute about more than 100,000 +Please note that this CPU can potentially execute more than 100,000 instructions in that time. One possible reason is given by the following sequence of events: diff --git a/appendix/questions/concurrentparallel.tex b/appendix/questions/concurrentparallel.tex index 5d0c8ea..ce22280 100644 --- a/appendix/questions/concurrentparallel.tex +++ b/appendix/questions/concurrentparallel.tex @@ -52,7 +52,7 @@ there are important situations where efficiency, performance, and scalability concerns sharply limit the level of competence that the scheduler can reasonably offer. One important example is when the scheduler is implemented in -hardware, as it often is SIMD units or GPGPUs. +hardware, as it often is in SIMD units or GPGPUs. Another example is a workload where the units of work are quite short, so that even a software-based scheduler must make hard choices between subtlety on the one hand and efficiency on the other. diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex index 00263f0..719de12 100644 --- a/appendix/whymb/whymemorybarriers.tex +++ b/appendix/whymb/whymemorybarriers.tex @@ -1139,7 +1139,7 @@ Therefore, we can add a memory barrier to function \co{bar} as follows: 4 smp_mb(); 5 b = 1; 6 } - 7 + 7 8 void bar(void) 9 { 10 if (b == 0) @@ -1565,7 +1565,7 @@ The benefit of this extremely weak memory model is that Alpha can use simpler cache hardware, which in turn permitted higher clock frequency in Alpha's heyday. -The last column indicates whether a given CPU has a incoherent +The last column indicates whether a given CPU has an incoherent instruction cache and pipeline. Such CPUs require special instructions be executed for self-modifying code. @@ -2371,7 +2371,7 @@ future such problems: can result in corrupting the data input! \item External busses that fail to transmit cache-coherence data. - + This is an even more painful variant of the above problem, but causes groups of devices---and even memory itself---to fail to respect cache coherence. diff --git a/datastruct/datastruct.tex b/datastruct/datastruct.tex index c20c242..25cd5c4 100644 --- a/datastruct/datastruct.tex +++ b/datastruct/datastruct.tex @@ -69,7 +69,7 @@ Those interested in Schr\"odinger's animals can query them, however, Schr\"odinger has noted extremely high rates of queries for his cat, so much so that he suspects that his mice might be using the database to check up on their nemesis. -This means that Sch\"odinger's application must be able to support a +This means that Schr\"odinger's application must be able to support a high rate of queries to a single data element. Please keep this application in mind as various data structures are presented. @@ -140,12 +140,12 @@ offers excellent scalability. 2 struct cds_list_head hte_next; 3 unsigned long hte_hash; 4 }; - 5 + 5 6 struct ht_bucket { 7 struct cds_list_head htb_head; 8 spinlock_t htb_lock; 9 }; -10 +10 11 struct hashtab { 12 unsigned long ht_nbuckets; 13 struct ht_bucket ht_bkt[0]; @@ -174,7 +174,7 @@ The \co{hashtab} structure (lines~11-14 in Figure~\ref{fig:datastruct:Hash-Table Data Structures}) contains four \co{ht_bucket} structures (lines~6-9 in Figure~\ref{fig:datastruct:Hash-Table Data Structures}), -with the \co{->bt_nbuckets} field controlling the number of buckets. +with the \co{->ht_nbuckets} field controlling the number of buckets. Each such bucket contains a list header \co{->htb_head} and a lock \co{->htb_lock}. The list headers chain \co{ht_elem} structures @@ -196,13 +196,13 @@ has bucket~0 with two elements and bucket~2 with one. \begin{verbatim} 1 #define HASH2BKT(htp, h) \ 2 (&(htp)->ht_bkt[h % (htp)->ht_nbuckets]) - 3 + 3 4 static void hashtab_lock(struct hashtab *htp, 5 unsigned long hash) 6 { 7 spin_lock(&HASH2BKT(htp, hash)->htb_lock); 8 } - 9 + 9 10 static void hashtab_unlock(struct hashtab *htp, 11 unsigned long hash) 12 { @@ -235,7 +235,7 @@ corresponding to the specified hash value. 7 { 8 struct ht_bucket *htb; 9 struct ht_elem *htep; -10 +10 11 htb = HASH2BKT(htp, hash); 12 cds_list_for_each_entry(htep, 13 &htb->htb_head, @@ -297,7 +297,7 @@ If no element matches, line~20 returns \co{NULL}. 7 cds_list_add(&htep->hte_next, 8 &HASH2BKT(htp, hash)->htb_head); 9 } -10 +10 11 void hashtab_del(struct ht_elem *htep) 12 { 13 cds_list_del_init(&htep->hte_next); @@ -331,7 +331,7 @@ or modifying this same bucket, for example, by invoking 3 { 4 struct hashtab *htp; 5 int i; - 6 + 6 7 htp = malloc(sizeof(*htp) + 8 nbuckets * 9 sizeof(struct ht_bucket)); @@ -344,7 +344,7 @@ or modifying this same bucket, for example, by invoking 16 } 17 return htp; 18 } -19 +19 20 void hashtab_free(struct hashtab *htp) 21 { 22 free(htp); @@ -517,7 +517,7 @@ read-mostly cases where updates are rare, but could happen at any time. \label{sec:datastruct:Read-Mostly Data Structures} Although partitioned data structures can offer excellent scalability, -NUMA effects can result in severe degradations of both performance and +NUMA effects can result in severe degradations of both performance and scalability. In addition, the need for readers to exclude writers can degrade performance in @@ -541,7 +541,7 @@ section~\cite{McKenney:2013:SDS:2483852.2483867}. 3 { 4 rcu_read_lock(); 5 } - 6 + 6 7 static void hashtab_unlock_lookup(struct hashtab *htp, 8 unsigned long hash) 9 { @@ -579,7 +579,7 @@ Figure~\ref{fig:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Contro 7 { 8 struct ht_bucket *htb; 9 struct ht_elem *htep; -10 +10 11 htb = HASH2BKT(htp, hash); 12 cds_list_for_each_entry_rcu(htep, 13 &htb->htb_head, @@ -646,7 +646,7 @@ RCU read-side critical section, for example, the caller must invoke 7 cds_list_add_rcu(&htep->hte_next, 8 &HASH2BKT(htp, hash)->htb_head); 9 } -10 +10 11 void hashtab_del(struct ht_elem *htep) 12 { 13 cds_list_del_rcu(&htep->hte_next); @@ -722,7 +722,7 @@ thread than is hazard pointers. This situation changes above 32 CPUs. Because RCU is using more than half of each core's resources from a -single hardware thread, RCU gains relatively litte benefit from the +single hardware thread, RCU gains relatively little benefit from the second hardware thread in each core. The slope of hazard pointers's trace also decreases at 32 CPUs, but less dramatically, @@ -809,7 +809,7 @@ Of course, all three of these implementations fare much better than does global locking. Of course, it is quite possible that the differences in lookup performance -is affected by the differences in update rates. +are affected by the differences in update rates. One way to check this is to artificially throttle the update rates of per-bucket locking and hazard pointers to match that of RCU. Doing so does not significantly improve the lookup performace of @@ -1014,12 +1014,12 @@ which is the subject of the next section. 3 struct cds_list_head hte_next[2]; 4 unsigned long hte_hash; 5 }; - 6 + 6 7 struct ht_bucket { 8 struct cds_list_head htb_head; 9 spinlock_t htb_lock; 10 }; -11 +11 12 struct ht { 13 long ht_nbuckets; 14 long ht_resize_cur; @@ -1034,7 +1034,7 @@ which is the subject of the next section. 23 void *(*ht_getkey)(struct ht_elem *htep); 24 struct ht_bucket ht_bkt[0]; 25 }; -26 +26 27 struct hashtab { 28 struct ht *ht_cur; 29 spinlock_t ht_lock; @@ -1122,13 +1122,13 @@ the old table. 6 key) % htp->ht_nbuckets; 7 return &htp->ht_bkt[*b]; 8 } - 9 + 9 10 static struct ht_bucket * 11 ht_get_bucket(struct ht **htp, void *key, 12 long *b, int *i) 13 { 14 struct ht_bucket *htbp; -15 +15 16 htbp = ht_get_bucket_single(*htp, key, b); 17 if (*b <= (*htp)->ht_resize_cur) { 18 *htp = (*htp)->ht_new; @@ -1207,7 +1207,7 @@ with a resize operation. 5 struct ht *htp; 6 struct ht_bucket *htbp; 7 struct ht_bucket *htbp_new; - 8 + 8 9 rcu_read_lock(); 10 htp = rcu_dereference(htp_master->ht_cur); 11 htbp = ht_get_bucket_single(htp, key, &b); @@ -1219,14 +1219,14 @@ with a resize operation. 17 spin_lock(&htbp_new->htb_lock); 18 spin_unlock(&htbp->htb_lock); 19 } -20 +20 21 void hashtab_unlock_mod(struct hashtab *htp_master, 22 void *key) 23 { 24 long b; 25 struct ht *htp; 26 struct ht_bucket *htbp; -27 +27 28 htp = rcu_dereference(htp_master->ht_cur); 29 htbp = ht_get_bucket(&htp, key, &b, NULL); 30 spin_unlock(&htbp->htb_lock); @@ -1273,7 +1273,7 @@ section. The code in Figures~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection} and~\ref{fig:datastruct:Resizable Hash-Table Update-Side Concurrency Control} - compute the hash and execute the bucket-selection logic twice for + computes the hash and executes the bucket-selection logic twice for updates! Why this blatant inefficiency? \QuickQuizAnswer{ @@ -1288,7 +1288,7 @@ The \co{hashtab_unlock_mod()} function releases the lock acquired by \co{hashtab_lock_mod()}. Line~28 picks up the current hash table, and then line~29 invokes \co{ht_get_bucket()} in order to gain a reference to the bucket that -corresponds to the key---and of course this bucket might well in a +corresponds to the key---and of course this bucket might well be in a new hash table. Line~30 releases the bucket's lock and finally line~31 exits the RCU read-side critical section. @@ -1296,7 +1296,7 @@ RCU read-side critical section. \QuickQuiz{} Suppose that one thread is inserting an element into the new hash table during a resize operation. - What prevents this insertion to be lost due to a subsequent + What prevents this insertion being lost due to a subsequent resize operation completing before the insertion does? \QuickQuizAnswer{ The second resize operation will not be able to move beyond @@ -1308,7 +1308,7 @@ RCU read-side critical section. RCU read-side critical section. As we will see when we examine the \co{hashtab_resize()} function, this means that the first resize operation will - use + use \co{synchronize_rcu()} to wait for the insertion's read-side critical section to complete. } \QuickQuizEnd @@ -1325,7 +1325,7 @@ RCU read-side critical section. 7 struct ht *htp; 8 struct ht_elem *htep; 9 struct ht_bucket *htbp; -10 +10 11 htp = rcu_dereference(htp_master->ht_cur); 12 htbp = ht_get_bucket(&htp, key, &b, &i); 13 cds_list_for_each_entry_rcu(htep, @@ -1337,7 +1337,7 @@ RCU read-side critical section. 19 } 20 return NULL; 21 } -22 +22 23 void 24 hashtab_add(struct hashtab *htp_master, 25 struct ht_elem *htep) @@ -1346,14 +1346,14 @@ RCU read-side critical section. 28 int i; 29 struct ht *htp; 30 struct ht_bucket *htbp; -31 +31 32 htp = rcu_dereference(htp_master->ht_cur); 33 htbp = ht_get_bucket(&htp, htp->ht_getkey(htep), 34 &b, &i); 35 cds_list_add_rcu(&htep->hte_next[i], 36 &htbp->htb_head); 37 } -38 +38 39 void 40 hashtab_del(struct hashtab *htp_master, 41 struct ht_elem *htep) @@ -1362,7 +1362,7 @@ RCU read-side critical section. 44 int i; 45 struct ht *htp; 46 struct ht_bucket *htbp; -47 +47 48 htp = rcu_dereference(htp_master->ht_cur); 49 htbp = ht_get_bucket(&htp, htp->ht_getkey(htep), 50 &b, &i); @@ -1474,7 +1474,7 @@ a concurrent resize operation. 13 struct ht_bucket *htbp_new; 14 unsigned long hash; 15 long b; -16 +16 17 if (!spin_trylock(&htp_master->ht_lock)) 18 return -EBUSY; 19 htp = htp_master->ht_cur; @@ -1890,7 +1890,7 @@ cases it does give up some performance. The following sections touch on specialization, memory conservation, and hardware considerations. -Please do not mistakes these short sections for a definitive treatise +Please do not mistake these short sections for a definitive treatise on this subject. Whole books have been written on optimizing to a specific CPU, let alone to the set of CPU families in common use today. diff --git a/debugging/debugging.tex b/debugging/debugging.tex index 6fde3d1..0210d4f 100644 --- a/debugging/debugging.tex +++ b/debugging/debugging.tex @@ -44,7 +44,7 @@ Section~\ref{sec:debugging:Probability and Heisenbugs} gives an overview of the use of probability for validating parallel software. Because performance and scalability are first-class requirements for parallel programming, -Section~\ref{sec:debugging:Performance Estimation} which covers these +Section~\ref{sec:debugging:Performance Estimation} covers these topics. Finally, Section~\ref{sec:debugging:Summary} @@ -188,7 +188,7 @@ The next section examines this conundrum. \label{sec:debugging:Required Mindset} When carrying out any validation effort, you should keep the following -defintions in mind: +definitions in mind: \begin{enumerate} \item The only bug-free programs are trivial programs. @@ -498,7 +498,7 @@ be prepared to develop and run your own test suite. Test development is an underappreciated and very valuable skill, so be sure to take full advantage of any existing test suites available to you. -Important as test development is, we will leave further discussion of +Important as test development is, we will leave further discussion of it to books dedicated to that topic. The following sections therefore discuss locating bugs in your code given that you already have a good test suite. @@ -643,11 +643,11 @@ into C compilers. There are nevertheless lint-like tools under development and in use to this day. -The sparse static analyzer~\cite{JonathanCorbet2004sparse} +The sparse static analyzer~\cite{JonathanCorbet2004sparse} looks for higher-level issues in the Linux kernel, including: \begin{enumerate} -\item Misuse of pointers to use-space structures. +\item Misuse of pointers to user-space structures. \item Assignments from too-long constants. \item Empty \co{switch} statements. \item Mismatched lock acquisition and release primitives. @@ -929,7 +929,7 @@ Now the question is just how much testing is required in order to be certain that you actually fixed the bug, as opposed to just reducing the probability of it occurring on the one hand, having fixed only one of several -related bugs on the other and, or made some ineffectual unrelated +related bugs on the other hand, or made some ineffectual unrelated change on yet a third hand. In short, what is the answer to the eternal question posed by Figure~\ref{fig:cpu:Passed-the-stress-test}? @@ -1074,7 +1074,7 @@ to run the test to cause the probability of failure to rise above 99\%?'' After all, if we were to run the test enough times that the probability of seeing at least one failure becomes 99\%, if there are no failures, there is only 1\% probability of this being due to dumb luck. -And if we plug $f=0.1$ into +And if we plug $f=0.1$ into Equation~\ref{eq:debugging:Binomial Failure Rate} and vary $n$, we find that 43 runs gives us a 98.92\% chance of at least one test failing given the original 10\% per-test failure rate, @@ -1565,7 +1565,7 @@ the first place. This is a bit of a dark art, but there are a number of things you can do to find them. -On approach is to recognize that race conditions often end up corrupting +One approach is to recognize that race conditions often end up corrupting some of the data involved in the race. It is therefore good practice to double-check the synchronization of any corrupted data. @@ -1833,7 +1833,7 @@ much a bug as is incorrectness. \QuickQuiz{} That is ridiculous!!! After all, isn't getting the correct answer later than one would like - \emph{has} better than getting an incorrect answer??? + better than getting an incorrect answer??? \QuickQuizAnswer{ This question fails to consider the option of choosing not to compute the answer at all, and in doing so, also fails to consider @@ -2014,7 +2014,7 @@ to creep in, including: measurement period. \item Some types of interference, for example, random memory errors, are so rare that they can be dealt with by running a number - of sets of interations of the test. + of sets of iterations of the test. If the level of interference was statistically significant, any performance outliers could be rejected statistically. \item Any iteration of the test might be interfered with by other @@ -2166,7 +2166,7 @@ describes statistics-based rejection. Many systems, including Linux, provide means for determining after the fact whether some forms of interference have occurred. -For example, if your test encountered process-based interference, a +For example, if your test encountered process-based interference, a context switch must have occurred during the test. On Linux-based systems, this context switch will be visible in \co{/proc/<PID>/sched} in the \co{nr_switches} field. @@ -2178,13 +2178,13 @@ Similarly, interrupt-based interference can be detected via the \begin{verbatim} 1 #include <sys/time.h> 2 #include <sys/resource.h> - 3 + 3 4 /* Return 0 if test results should be rejected. */ 5 int runtest(void) 6 { 7 struct rusage ru1; 8 struct rusage ru2; - 9 + 9 10 if (getrusage(RUSAGE_SELF, &ru1) != 0) { 11 perror("getrusage"); 12 abort(); @@ -2248,7 +2248,7 @@ beginning of the sorted list, and use these to estimate a typical inter-element delta, which in turn may be multiplied by the number of elements in the list to obtain an upper bound on permissible values. The algorithm then repeatedly considers the next element of the list. -If it is falls below the upper bound, and if the distance between +If it falls below the upper bound, and if the distance between the next element and the previous element is not too much greater than the average inter-element distance for the portion of the list accepted thus far, then the next element is accepted and the process repeats. @@ -2278,7 +2278,7 @@ Otherwise, the remainder of the list is rejected. 19 esac 20 shift 21 done - 22 + 22 23 awk -v divisor=$divisor -v relerr=$relerr \ 24 -v trendbreak=$trendbreak '{ 25 for (i = 2; i <= NF; i++) @@ -2343,7 +2343,7 @@ This script takes three optional arguments as follows: This defaults to 0.01, which is equivalent to 1\%. \item \co{--trendbreak}: Ratio of inter-element spacing constituting a break in the trend of the data. - Fr example, if the average spacing in the data accepted so far + For example, if the average spacing in the data accepted so far is 1.5, then if the trend-break ratio is 2.0, then if the next data value differs from the last one by more than 3.0, this constitutes a break in the trend. diff --git a/defer/rcuapi.tex b/defer/rcuapi.tex index eed973c..aa354ab 100644 --- a/defer/rcuapi.tex +++ b/defer/rcuapi.tex @@ -481,7 +481,7 @@ The Linux kernel currently has a surprising number of RCU APIs and implementations. There is some hope of reducing this number, evidenced by the fact that a given build of the Linux kernel currently has at most -three implementations behind four APIs (given that RCU Classic +four implementations behind three APIs (given that RCU Classic and Realtime RCU share the same API). However, careful inspection and analysis will be required, just as would be required in order to eliminate one of the many locking APIs. diff --git a/defer/rcuintro.tex b/defer/rcuintro.tex index 0b4a51c..03d6e12 100644 --- a/defer/rcuintro.tex +++ b/defer/rcuintro.tex @@ -160,7 +160,7 @@ Figure~\ref{fig:defer:Waiting for Pre-Existing Readers}, with time advancing from the top of the figure to the bottom. Although production-quality implementations of this approach can be -quite complex, a toy implementatoin is exceedingly simple: +quite complex, a toy implementation is exceedingly simple: \vspace{5pt} \begin{minipage}[t]{\columnwidth} diff --git a/defer/toyrcu.tex b/defer/toyrcu.tex index c9ba9e9..03d8bc4 100644 --- a/defer/toyrcu.tex +++ b/defer/toyrcu.tex @@ -1142,7 +1142,7 @@ As with the implementation described in Section~\ref{defer:Simple Counter-Based RCU}, the read-side primitives scale extremely well, incurring roughly 115~nanoseconds of overhead regardless of the number of CPUs. -The \co{synchronize_rcu()} primitives is still expensive, +The \co{synchronize_rcu()} primitive is still expensive, ranging from about one microsecond up to about 16~microseconds. This is nevertheless much cheaper than the roughly 200~microseconds incurred by the implementation in @@ -1200,18 +1200,18 @@ thread-local accesses to one, as is done in the next section. 4 ACCESS_ONCE(rcu_gp_ctr) + 1; 5 smp_mb(); 6 } - 7 + 7 8 static void rcu_read_unlock(void) 9 { 10 smp_mb(); 11 __get_thread_var(rcu_reader_gp) = 12 ACCESS_ONCE(rcu_gp_ctr); 13 } -14 +14 15 void synchronize_rcu(void) 16 { 17 int t; -18 +18 19 smp_mb(); 20 spin_lock(&rcu_gp_lock); 21 ACCESS_ONCE(rcu_gp_ctr) += 2; @@ -1386,7 +1386,7 @@ variables. 2 { 3 long tmp; 4 long *rrgp; - 5 + 5 6 rrgp = &__get_thread_var(rcu_reader_gp); 7 tmp = *rrgp; 8 if ((tmp & RCU_GP_CTR_NEST_MASK) == 0) @@ -1395,19 +1395,19 @@ variables. 11 *rrgp = tmp; 12 smp_mb(); 13 } -14 +14 15 static void rcu_read_unlock(void) 16 { 17 long tmp; -18 +18 19 smp_mb(); 20 __get_thread_var(rcu_reader_gp)--; 21 } -22 +22 23 void synchronize_rcu(void) 24 { 25 int t; -26 +26 27 smp_mb(); 28 spin_lock(&rcu_gp_lock); 29 ACCESS_ONCE(rcu_gp_ctr) += @@ -1905,7 +1905,7 @@ not only of RCU itself, but also of the requirements of enclosing software environments and applications. Those wishing an even deeper understanding are invited to read descriptions of production-quality RCU -implmentations~\cite{MathieuDesnoyers2012URCU,PaulEMcKenney2007PreemptibleRCU,PaulEMcKenney2008HierarchicalRCU,PaulEMcKenney2009BloatwatchRCU}. +implementations~\cite{MathieuDesnoyers2012URCU,PaulEMcKenney2007PreemptibleRCU,PaulEMcKenney2008HierarchicalRCU,PaulEMcKenney2009BloatwatchRCU}. The preceding sections listed some desirable properties of the various RCU primitives. diff --git a/defer/whichtochoose.tex b/defer/whichtochoose.tex index c7aaf5d..11cb4e2 100644 --- a/defer/whichtochoose.tex +++ b/defer/whichtochoose.tex @@ -93,7 +93,7 @@ traversed. Hazard pointers incur the overhead of a memory barrier for each data element traversed, and sequence locks incur the overhead of a pair of memory barriers for each attempt to execute the critical section. -The overhead of RCU implemntations vary from nothing to that of a pair of +The overhead of RCU implementations vary from nothing to that of a pair of memory barriers for each read-side critical section, thus providing RCU with the best performance, particularly for read-side critical sections that traverse many data elements. diff --git a/formal/dyntickrcu.tex b/formal/dyntickrcu.tex index 56b93ef..aa22985 100644 --- a/formal/dyntickrcu.tex +++ b/formal/dyntickrcu.tex @@ -365,7 +365,7 @@ The \co{rcu_try_flip_waitack_state()} state invokes Lines~7 and 8 pick up current and snapshot versions of \co{dynticks_progress_counter}, respectively. -The memory barrier on line~ensures that the counter checks +The memory barrier on line~9 ensures that the counter checks in the later \co{rcu_try_flip_waitzero_state} follow the fetches of these counters. Lines~10 and 11 return zero (meaning no communication with the @@ -1222,7 +1222,7 @@ incrementing the control variable. Line~10 tells \co{dyntick_nohz()} that an interrupt handler is running, and line~45 tells \co{dyntick_nohz()} that this handler has completed. -Line~49 is used for liveness verification, much as is the corresponding +Line~49 is used for liveness verification, as is the corresponding line of \co{dyntick_nohz()}. \QuickQuiz{} @@ -1319,7 +1319,7 @@ results in a correct verification with roughly half a million states, passing without errors. However, this version of the model does not handle nested interrupts. -This topic is taken up in the nest section. +This topic is taken up in the next section. \subsubsection{Validating Nested Interrupt Handlers} \label{sec:formal:Validating Nested Interrupt Handlers} @@ -1434,7 +1434,7 @@ This model (\url{dyntickRCU-irq-ssl.spin}) results in a correct verification with a bit more than half a million states, passing without errors. However, this version of the model does not handle NMIs, -which are taken up in the nest section. +which are taken up in the next section. \subsubsection{Validating NMI Handlers} \label{sec:formal:Validating NMI Handlers} @@ -1999,7 +1999,7 @@ the opposite \co{dynticks} polarity. 1 void rcu_nmi_enter(void) 2 { 3 struct rcu_dynticks *rdtp; - 4 + 4 5 rdtp = &__get_cpu_var(rcu_dynticks); 6 if (rdtp->dynticks & 0x1) 7 return; @@ -2007,11 +2007,11 @@ the opposite \co{dynticks} polarity. 9 WARN_ON(!(rdtp->dynticks_nmi & 0x1)); 10 smp_mb(); 11 } -12 +12 13 void rcu_nmi_exit(void) 14 { 15 struct rcu_dynticks *rdtp; -16 +16 17 rdtp = &__get_cpu_var(rcu_dynticks); 18 if (rdtp->dynticks & 0x1) 19 return; @@ -2026,7 +2026,7 @@ the opposite \co{dynticks} polarity. \end{figure} Figure~\ref{fig:formal:NMIs From Dynticks-Idle Mode} -show the \co{rcu_nmi_enter()} and \co{rcu_nmi_exit()} functions, +shows the \co{rcu_nmi_enter()} and \co{rcu_nmi_exit()} functions, which inform RCU of NMI entry and exit, respectively, from dynticks-idle mode. However, if the NMI arrives during an irq handler, then RCU will already @@ -2101,7 +2101,7 @@ the fact that no RCU read-side critical sections may appear in dynticks-idle mode. Lines~23-25 check to see if the prior irq handlers enqueued any RCU callbacks, forcing this CPU out of dynticks-idle mode via -an reschedule IPI if so. +a reschedule API if so. \subsubsection{Checking For Dynticks Quiescent States} \label{sec:formal:Checking For Dynticks Quiescent States} @@ -2115,7 +2115,7 @@ an reschedule IPI if so. 4 int ret; 5 int snap; 6 int snap_nmi; - 7 + 7 8 snap = rdp->dynticks->dynticks; 9 snap_nmi = rdp->dynticks->dynticks_nmi; 10 smp_mb(); @@ -2145,7 +2145,7 @@ Figures~\ref{fig:formal:Entering and Exiting Dynticks-Idle Mode}, \ref{fig:formal:Interrupts From Dynticks-Idle Mode}. Lines~11 and 12 record the snapshots for later calls to \co{rcu_implicit_dynticks_qs}, -and lines~13 and~14 checks to see if the CPU is in dynticks-idle mode with +and lines~13 and~14 check to see if the CPU is in dynticks-idle mode with neither irqs nor NMIs in progress (in other words, both snapshots have even values), hence in an extended quiescent state. If so, lines~15 and 16 count this event, and line~17 returns @@ -2161,7 +2161,7 @@ true if the CPU was in a quiescent state. 5 long curr_nmi; 6 long snap; 7 long snap_nmi; - 8 + 8 9 curr = rdp->dynticks->dynticks; 10 snap = rdp->dynticks_snap; 11 curr_nmi = rdp->dynticks->dynticks_nmi; diff --git a/formal/spinhint.tex b/formal/spinhint.tex index f08fda3..f5cbc2a 100644 --- a/formal/spinhint.tex +++ b/formal/spinhint.tex @@ -130,7 +130,7 @@ Lines 19-28 actually do the initialization, while lines 29-39 perform the assertion. Both are atomic blocks in order to avoid unnecessarily increasing the state space: because they are not part of the algorithm proper, -we loose no verification coverage by making them atomic. +we lose no verification coverage by making them atomic. The do-od construct on lines 21-27 implements a Promela loop, which can be thought of as a C {\tt for (;;)} loop containing a @@ -389,7 +389,7 @@ Given a source file \url{qrcu.spin}, one can use the following commands: liveness, fairness, or forward-progress checks, you may need to compile without \co{-DSAFETY}. If you leave off \co{-DSAFETY} when you could have used it, the program will let you know. - + The optimizations produced by \co{-DSAFETY} greatly speed things up, so you should use it when you can. An example situation where you cannot use \co{-DSAFETY} is @@ -461,7 +461,7 @@ C++, or Java. progress and terminate the loop. In Promela, loop counters must be avoided like the plague because they cause the state space to explode. On the other hand, there is no penalty for - infinite loops in Promela as long as the none of the variables + infinite loops in Promela as long as none of the variables monotonically increase or decrease -- Promela will figure out how many passes through the loop really matter, and automatically prune execution beyond that point. @@ -855,7 +855,7 @@ qrcu_read_unlock(&my_qrcu_struct, idx); but, like SRCU's \co{synchronize_srcu()}, QRCU's \co{synchronize_qrcu()} need wait only for those read-side critical sections that are using the same \co{qrcu_struct}. - + For example, \co{synchronize_qrcu(&your_qrcu_struct)} would \emph{not} need to wait on the earlier QRCU read-side critical section. @@ -895,7 +895,7 @@ two \co{#define} statements, giving us not one but two ways to create combinatorial explosion. The \co{idx} variable controls which of the two elements of the \co{ctr} array will be used by readers, and the \co{readerprogress} variable -allows to assertion to determine when all the readers are finished +allows an assertion to determine when all the readers are finished (since a QRCU update cannot be permitted to complete until all pre-existing readers have completed their QRCU read-side critical sections). diff --git a/future/future.tex b/future/future.tex index e431fa7..65eb681 100644 --- a/future/future.tex +++ b/future/future.tex @@ -54,7 +54,7 @@ I list but three: Note that Haskell's \emph{monads} were invented to deal with single-threaded global state, and that multi-threaded access to global state requires additional violence to the functional model. -\item Multithreaded procedural languages often use synchonization +\item Multithreaded procedural languages often use synchronization primitives such as locks, atomic operations, and transactions, which inflict added violence upon the functional model. \item Procedural languages can \emph{alias} function arguments, diff --git a/future/htm.tex b/future/htm.tex index 09b5dee..3de32c7 100644 --- a/future/htm.tex +++ b/future/htm.tex @@ -95,9 +95,9 @@ HTM's advantage is greatest in cases where a lock data structure is placed in a separate cache line, in which case, converting a given critical section to an HTM transaction can reduce that critical section's overhead by a full cache miss. -This savings can be quite significant for the common case of short +These savings can be quite significant for the common case of short critical sections, at least for those situations where the elided lock -does not share a cache line with a oft-written variable protected by +does not share a cache line with an oft-written variable protected by that lock. \QuickQuiz{} @@ -265,7 +265,7 @@ For example, suppose that transactions~A and~B are defined as follows: \vspace{5pt} \begin{minipage}[t]{\columnwidth} \begin{verbatim} -Trasaction A Transaction B +Transaction A Transaction B x = 1; y = 2; y = 3; x = 4; @@ -452,7 +452,7 @@ short-duration transactions could be guaranteed to eventually succeed. This would permit a transaction to be unconditionally retried, in the same way that compare-and-swap (CAS) and load-linked/store-conditional (LL/SC) operations are unconditionally retried in code that uses these -instructions to implement atomic operation. +instructions to implement atomic operations. Unfortunately, most currently available HTM implementation refuse to make any @@ -654,16 +654,16 @@ semantics of locking, but loses locking's time-based messaging semantics. \scriptsize \begin{verbatim} 1 int my_status = -1; /* Thread local. */ - 2 + 2 3 while (continue_working()) { 4 enqueue_any_new_work(); 5 wp = dequeue_work(); 6 do_work(wp); 7 my_timestamp = clock_gettime(...); 8 } - 9 + 9 10 acquire_lock(&departing_thread_lock); - 11 + 11 12 /* 13 * Disentangle from application, might 14 * acquire other locks, can take much longer @@ -672,7 +672,7 @@ semantics of locking, but loses locking's time-based messaging semantics. 17 */ 18 my_status = get_return_status(); 19 release_lock(&departing_thread_lock); - 20 + 20 21 /* thread awaits repurposing. */ \end{verbatim} \end{minipage} @@ -738,7 +738,7 @@ inversion.'' 1 void boostee(void) 2 { 3 int i = 0; - 4 + 4 5 acquire_lock(&boost_lock[i]); 6 for (;;) { 7 acquire_lock(&boost_lock[!i]); @@ -747,11 +747,11 @@ inversion.'' 10 do_something(); 11 } 12 } - 13 + 13 14 void booster(void) 15 { 16 int i = 0; - 17 + 17 18 for (;;) { 19 usleep(1000); /* sleep 1 ms. */ 20 acquire_lock(&boost_lock[i]); @@ -886,7 +886,7 @@ Table~\ref{tab:future:Comparison of Locking and HTM} results in the updated comparison between augmented locking and HTM shown in Table~\ref{tab:future:Comparison of Locking (Augmented by RCU or Hazard Pointers) and HTM}. -A summary of the differnces between the two tables is as follows: +A summary of the differences between the two tables is as follows: \begin{enumerate} \item Use of non-blocking read-side mechanisms alleviates deadlock issues. @@ -914,7 +914,7 @@ page~\pageref{fig:defer:RCU Areas of Applicability}, that is no reason not to start moving in that direction. HTM seems best suited to update-heavy workloads involving relatively -small changes to disparate portions of a relatively large in-memory +small changes to disparate portions of relatively large in-memory data structures running on large multiprocessors, as this meets the size restrictions of current HTM implementations while minimizing the probability of conflicts and attendant aborts and @@ -943,7 +943,7 @@ Nevertheless, it is quite possible that a steady stream of RCU or hazard-pointer readers might starve updaters due to a corresponding steady stream of conflicts. This vulnerability could be eliminated (perhaps at significant -hardware cost and complexity) by giving extra-tranactional +hardware cost and complexity) by giving extra-transactional reads the pre-transaction copy of the memory location being loaded. The fact that HTM transactions must have fallbacks might in some cases @@ -988,7 +988,7 @@ interrupt frequency, and scheduler implementation. Cache size and associativity was discussed in Section~\ref{sec:future:Transaction-Size Limitations}, along with some research intended to work around current limitations. -However, we HTM forward-progress guarantees would +However, HTM forward-progress guarantees would come with size limits, large though these limits might one day be. So why don't current HTM implementations provide forward-progress guarantees for small transactions, for example, limited to the diff --git a/future/tm.tex b/future/tm.tex index 3e42e74..f3142f3 100644 --- a/future/tm.tex +++ b/future/tm.tex @@ -196,7 +196,7 @@ Here are some options available to TM: multiple nested transactions). Alternatively, enlist the compiler to enforce RPC-free transactions. - This approach does works, but will require TM to + This approach does work, but will require TM to interact with other synchronization primitives. \item Permit only one special irrevocable transaction~\cite{SpearMichaelScott2008InevitableSTM} @@ -333,7 +333,7 @@ How could a similar persistent functionality be provided for TM? Unfortunately, this does not handle network communication, nor does it handle I/O to devices that do not provide snapshot capabilities, for example, memory sticks. -\item Build a time machine. +\item Build a time machine. \end{enumerate} Of course, the fact that it is called transactional \emph{memory} @@ -416,7 +416,7 @@ What might TM do about thread spawning within a transaction? participating in the transaction? The answers to these questions are reasonably straightforward in the case of locking. - The answers for TM are left as an exercise for the reader. + The answers for TM are left as an exercise for the reader. \end{enumerate} Given that parallel execution of transactions is commonplace in the @@ -484,10 +484,10 @@ from within a transaction? This approach has some advantages over aborting the transaction at runtime, but again requires non-TM synchronization primitives for use in conjunction with \co{exec()}. -\item Treat the transaction in a manner similar to non-persistent +\item Treat the transaction in a manner similar to non-persistent Locking primitives, so that the transaction survives if exec() fails, and silently commits if the \co{exec()} succeeds. - The case were some of the variables affected by the transaction + The case where some of the variables affected by the transaction reside in \co{mmap()}ed memory (and thus could survive a successful \co{exec()} system call) is left as an exercise for the reader. \item Abort the transaction (and the \co{exec()} system call) if the @@ -542,7 +542,7 @@ Options for part (a), the actual loading of the code, include the following: already present, and the transaction can thus be expected to proceed normally. \item Disallow dynamic linking and loading of functions from within - transactions. + transactions. \end{enumerate} Options for part (b), the inability to detect TM-unfriendly operations @@ -572,7 +572,7 @@ in a not-yet-loaded function, possibilities include the following: That said, the standardization effort is already in progress~\cite{Ali-Reza-Adl-Tabatabai2009CppTM}. \item As above, disallow dynamic linking and loading of functions from - within transactions. + within transactions. \end{enumerate} I/O operations are of course a known weakness of TM, and dynamic linking @@ -702,7 +702,7 @@ quite well, at least as long as the usual well-known software-engineering techniques are employed to avoid deadlock. It is not unusual to acquire locks from within RCU read-side critical sections, which eases deadlock concerns because RCU read-side primitives -cannot participated in lock-based deadlock cycles. +cannot participate in lock-based deadlock cycles. But happens when you attempt to acquire a lock from within a transaction? In theory, the answer is trivial: simply manipulate the data structure @@ -738,10 +738,10 @@ the occasional transaction. done by the TxLinux~\cite{ChistopherJRossbach2007a} group. This approach seems sound, but leaves the locking design constraints (such as the need to avoid deadlock) firmly in place. -\item Strive to reduce the overhead imposed on locking primitives. +\item Strive to reduce the overhead imposed on locking primitives. \end{enumerate} -The fact that there could possibly a problem interfacing TM and locking +The fact that there could possibly be a problem interfacing TM and locking came as a surprise to many, which underscores the need to try out new mechanisms and primitives in real-world production software. Fortunately, the advent of open source means that a huge quantity of @@ -755,7 +755,7 @@ other locks, which just works, at least as long as the usual well-known software-engineering techniques are employed to avoid deadlock. Read-acquiring reader-writer locks from within RCU read-side critical sections also works, and doing so eases deadlock concerns because RCU -read-side primitives cannot participated in lock-based deadlock cycles. +read-side primitives cannot participate in lock-based deadlock cycles. But what happens when you attempt to read-acquire a reader-writer lock from within a transaction? @@ -899,7 +899,7 @@ Some possibilities are as follows: to be atomic, the ordering of the accesses within the transaction is not supposed to matter. \item Prohibit use of TM in RCU updates. - This is guaranteed to work, but seems a bit restrictive. + This is guaranteed to work, but seems a bit restrictive. \end{enumerate} It seems likely that additional approaches will be uncovered, especially diff --git a/locking/locking.tex b/locking/locking.tex index 9feeaf0..8f22308 100644 --- a/locking/locking.tex +++ b/locking/locking.tex @@ -365,18 +365,18 @@ the third containing lock~D. 2 spinlock_t s; 3 struct list_head h; 4 }; - 5 + 5 6 struct list_head *list_start(struct locked_list *lp) 7 { 8 spin_lock(&lp->s); 9 return list_next(lp, &lp->h); 10 } - 11 + 11 12 struct list_head *list_next(struct locked_list *lp, 13 struct list_head *np) 14 { 15 struct list_head *ret; - 16 + 16 17 ret = np->next; 18 if (ret == &lp->h) { 19 spin_unlock(&lp->s); @@ -413,12 +413,12 @@ been reached. 2 struct list_head n; 3 int a; 4 }; - 5 + 5 6 void list_print(struct locked_list *lp) 7 { 8 struct list_head *np; 9 struct list_ints *ip; - 10 + 10 11 np = list_start(lp); 12 while (np != NULL) { 13 ip = list_entry(np, struct list_ints, n); @@ -653,7 +653,7 @@ Once all needed locks have been acquired, the transaction enters the second phase, where locks are released, but not acquired. This locking approach allows databases to provide serializability guarantees for their transactions, in other words, to guarantee -that all of values see and produced by the transactions are consistent +that all values seen and produced by the transactions are consistent with some global ordering of all the transactions. Many such systems rely on the ability to abort transactions, although this can be simplified by avoiding making any changes to shared data @@ -844,7 +844,7 @@ quite useful in many settings. 11 spin_unlock(&lock2); 12 spin_unlock(&lock1); 13 } - 14 + 14 15 void thread2(void) 16 { 17 retry: @@ -939,7 +939,7 @@ a group of threads starve, rather than just one of them.\footnote{ 14 spin_unlock(&lock2); 15 spin_unlock(&lock1); 16 } - 17 + 17 18 void thread2(void) 19 { 20 unsigned int wait = 1; @@ -982,7 +982,7 @@ Figure~\ref{fig:locking:Conditional Locking and Exponential Backoff}. required to execute the critical section, which will normally be in the microsecond or millisecond range. \item The code does not check for overflow. - On the other hand, this bug is nullified + On the other hand, this bug is nullified by the previous bug: 32 bits worth of seconds is more than 50 years. \end{enumerate} @@ -1450,7 +1450,7 @@ However, RAII locking also has a dark side. RAII makes it quite difficult to encapsulate lock acquisition and release, for example, in iterators. In many iterator implementations, you would like to acquire the lock in -the interator's ``start'' function and release it in the iterator's ``stop'' +the iterator's ``start'' function and release it in the iterator's ``stop'' function. RAII locking instead requires that the lock acquisition and release take place in the same level of scoping, making such encapsulation difficult or @@ -1506,7 +1506,7 @@ the root \co{rcu_node} structure's \co{->fqslock} as been acquired. 3 int ret; 4 struct rcu_node *rnp = rnp_leaf; 5 struct rcu_node *rnp_old = NULL; - 6 + 6 7 for (; rnp != NULL; rnp = rnp->parent) { 8 ret = (ACCESS_ONCE(gp_flags)) || 9 !raw_spin_trylock(&rnp->fqslock); @@ -1615,7 +1615,7 @@ environments. \begin{verbatim} 1 typedef int xchglock_t; 2 #define DEFINE_XCHG_LOCK(n) xchglock_t n = 0 - 3 + 3 4 void xchg_lock(xchglock_t *xp) 5 { 6 while (xchg(xp, 1) == 1) { @@ -1623,7 +1623,7 @@ environments. 8 continue; 9 } 10 } - 11 + 11 12 void xchg_unlock(xchglock_t *xp) 13 { 14 (void)xchg(xp, 0); @@ -1949,7 +1949,7 @@ synchronization design on locking, such software also almost always makes use of other synchronization mechanisms, including special counting algorithms (Chapter~\ref{chp:Counting}), data ownership (Chapter~\ref{chp:Data Ownership}), -reference counting (Section~\ref{sec:defer:Reference Counting}), +reference counting (Section~\ref{sec:defer:Reference Counting}), sequence locking (Section~\ref{sec:defer:Sequence Locks}), and read-copy update (Section~\ref{sec:defer:Read-Copy Update (RCU)}). In addition, practitioners use tools for deadlock @@ -2106,7 +2106,7 @@ by deciding where the locks should be acquired and released. In particular, this strategy allows the lock acquisition and release functions to block signals as needed without the library code needing to -be concerned with of which signals need to be blocked by which locks. +be concerned with which signals need to be blocked by which locks. The separation of concerns used by this strategy can be quite effective, but in some cases the strategies laid out in the following sections can work better. @@ -2212,7 +2212,7 @@ but not as easy as constructing a parallel application. With the advent of readily available low-cost multicore systems, a common task is parallelizing an existing library that was designed with only single-threaded use in mind. -This all-to-common disregard for parallelism can result in a library +This all-too-common disregard for parallelism can result in a library API that is severely flawed from a parallel-programming viewpoint. Candidate flaws include: diff --git a/rt/rt.tex b/rt/rt.tex index 538a34a..f2f7cf8 100644 --- a/rt/rt.tex +++ b/rt/rt.tex @@ -37,7 +37,7 @@ said to be a soft real-time application: ``My application computes million-point fourier transforms in half a picosecond.'' ``No way!!! -The clock cycle on this system is more the \emph{three hundred} picoseconds!'' +The clock cycle on this system is more than \emph{three hundred} picoseconds!'' ``Ah, but it is a \emph{soft} real-time application!'' If the term ``soft real time'' is to be of any use whatesoever, some limits are clearly required. @@ -218,8 +218,8 @@ that portion of the outside world that is to be monitored or controlled. stop operating. } \QuickQuizEnd -A number of systems intended to operate in environments with impressive -levels shock and vibration, for example, engine control systems. +A number of systems are intended to operate in environments with impressive +levels of shock and vibration, for example, engine control systems. More strenuous requirements may be found when we move away from continuous vibrations to intermittent shocks. For example, during my undergraduate studies, I encountered an old Athena @@ -240,7 +240,7 @@ the effects of low-energy electromagnetic radiation, error-correction coding can reduce the effects of high-energy radiation, various potting and sealing techniques can reduce the effect of air quality, and any number of heating and cooling systems can counter the effects of temperature. -In extreme cases, triple modulo redundancy can reduce the probability that +In extreme cases, triple modular redundancy can reduce the probability that a fault in one part of the system will result in incorrect behavior from the overall system. However, all of these methods have one thing in common: Although they @@ -292,7 +292,7 @@ real-time systems. } \QuickQuizEnd Of course, maintaining sufficiently low utilization requires great -discipline throughout the design and implmeentation. +discipline throughout the design and implementation. There is nothing quite like a little feature creep to destroy deadlines. \subsubsection{Application Constraints} @@ -368,7 +368,7 @@ cylinder contained in the log to the blade, (4)~Continuously vary the knife's position so as to peel the log into veneer, (5)~Remove the remaining core of the log that is too small to peel, and (6)~Wait for the next log. -Each of these five phases of operation might well have its own set of +Each of these six phases of operation might well have its own set of deadlines and environmental constraints, for example, one would expect phase~4's deadlines to be much more severe than those of phase 6, milliseconds instead of seconds. @@ -437,7 +437,7 @@ large advances are required. systems~\cite{JadeAlglave2011ppcmem,Alglave:2013:SVW:2450268.2450306}. } \QuickQuizEnd -In addition to latency requirement for the real-time portions of the +In addition to latency requirements for the real-time portions of the application, there will likely be performance and scalability requirements for the non-real-time portions of the application. These additional requirements reflect the fact that ultimate real-time @@ -456,7 +456,7 @@ the sound-bite-based approach to real-time computing. \label{sec:rt:Who Needs Real-Time Computing?} It is possible to argue that all computing is in fact real-time computing. -For one moderately extreme example, when purchase a birthday gift online, +For one moderately extreme example, when you purchase a birthday gift online, you would like the gift to arrive before the recipient's birthday. And in fact even turn-of-the-millenium web services observed sub-second response constraints~\cite{KristofferBohmann2001a}, and requirements have @@ -694,13 +694,13 @@ utility thread running on Linux, there are invariably rough edges. In addition, the RTOS must interface to both the hardware and to the Linux kernel, thus requiring significant maintenance with changes in both hardware and kernel. -Furthermore, each such RTOSes often has its own system-call interface +Furthermore, each such RTOS often has its own system-call interface and set of system libraries, which can balkanize both ecosystems and developers. In fact, these problems seem to be what drove the combination of RTOSes with Linux, as this approach allowed access to the full real-time capabilities of the RTOS, while allowing the application's non-real-time -code full access Linux's rich and vibrant open-source ecosystem. +code full access to Linux's rich and vibrant open-source ecosystem. \begin{figure*}[p] \begin{center} @@ -746,7 +746,7 @@ This of course greatly improves real-time response latency, but preemption is still disabled within RCU read-side critical sections, spinlock critical sections, -interrupt handlers, +interrupt handlers, interrupt-disabled code regions, and preempt-disabled code regions, as indicated by the red boxes in the left-most diagram in the middle row of the figure. @@ -791,7 +791,7 @@ If configured properly, a non-trivial undertaking, \co{CONFIG_NO_HZ_FULL} offers real-time threads levels of performance nearly rivaling that of bare-metal systems. -Some has of course been much debate over which of these approaches +There has of course been much debate over which of these approaches is best for real-time systems, and this debate has been going on for quite some time~\cite{JonCorbet2004RealTimeLinuxPart1,JonCorbet2004RealTimeLinuxPart2}. @@ -917,7 +917,7 @@ that timeouts cannot be set for finer than one-millisecond granularities. On the other hand, Figure~\ref{fig:rt:Timer Wheel at 100kHz} shows timer processing taking place every ten microseconds, which -provides acceptably find timer granularity for most (but not all!) +provides acceptably fine timer granularity for most (but not all!) workloads, but which processes timers so frequently that the system might well not have time to do anything else. @@ -952,7 +952,7 @@ is good and sufficient. Another key observation is that error-handling timeouts are normally cancelled very early, often before they can be cascaded. A final observation is that systems commonly have many more error-handling -timeouts than they do timer events, so that an $O(\log n)$ +timeouts than they do timer events, so that an $O(\log n)$ data structure should provide acceptable performance for timer events. In short, the Linux kernel's -rt patchset uses timer wheels for @@ -973,7 +973,7 @@ namely long-running interrupt handlers, as shown in Figure~\ref{fig:rt:Non-Threaded Interrupt Handler}. These latencies can be especially problematic for devices that can -deliver large number of events with a single interrupt, which means +deliver a large number of events with a single interrupt, which means that the interrupt handler will run for an extended period of time processing all of these events. Worse yet are devices that can deliver new events to a still-running @@ -1089,9 +1089,9 @@ priority-inversion conundrum: \item Only allow one read-acquisition of a given reader-writer lock at a time. (This is the approach traditionally taken by the Linux kernel's -rt patchset.) -\item Only allow $N$ read-acquisitions of a given reader-writer lcok +\item Only allow $N$ read-acquisitions of a given reader-writer lock at a time, where $N$ is the number of CPUs. -\item Only allow $N$ read-acquisitions of a given reader-writer lcok +\item Only allow $N$ read-acquisitions of a given reader-writer lock at a time, where $N$ is a number specified somehow by the developer. There is a good chance that the Linux kernel's -rt patchset @@ -1154,11 +1154,11 @@ excessive real-time latencies. 3 current->rcu_read_lock_nesting++; 4 barrier(); 5 } - 6 + 6 7 void __rcu_read_unlock(void) 8 { 9 struct task_struct *t = current; -10 +10 11 if (t->rcu_read_lock_nesting != 1) { 12 --t->rcu_read_lock_nesting; 13 } else { @@ -1255,8 +1255,8 @@ boosting of large numbers of readers. \paragraph{Preemptible spinlocks} are an important part of the -rt patchset due to the long-duration spinlock-based critical sections in the Linux kernel. -This functionality has not yet reached mainline: Although they is a conceptually -simple substitution of sleeplocks for spinlocks, they has proven relatively +This functionality has not yet reached mainline: Although they are a conceptually +simple substitution of sleeplocks for spinlocks, they have proven relatively controversial.\footnote{ In addition, development of the -rt patchset has slowed in recent years, perhaps because the real-time functionality that is already @@ -1342,7 +1342,7 @@ Specific per-kthread advice may be found in the Linux kernel source A third source of OS jitter in the Linux kernel for CPU-bound threads running at real-time priority is the scheduler itself. This is an intentional debugging feature, designed to ensure that -important non-realtime work is allotted at least 50 milliseoncds +important non-realtime work is allotted at least 50 milliseconds out of each second, even if there is an infinite-loop bug in your real-time application. However, when you are running a polling-loop-style real-time application, @@ -1430,7 +1430,7 @@ housekeeping CPUs to handle the housekeeping load imposed by the rest of the system, which requires careful benchmarking and tuning. Of course, there is no free lunch, and \co{NO_HZ_FULL} is no exception. -As noted earlier, +As noted earlier, \co{NO_HZ_FULL} makes kernel/user transitions more expensive due to the need for delta process accounting and the need to inform kernel subsystems (such as RCU) of the transitions. diff --git a/together/applyrcu.tex b/together/applyrcu.tex index 55062e3..81d2cab 100644 --- a/together/applyrcu.tex +++ b/together/applyrcu.tex @@ -115,22 +115,22 @@ held constant, ensuring that \co{read_count()} sees consistent data. 2 unsigned long total; 3 unsigned long *counterp[NR_THREADS]; 4 }; - 5 + 5 6 long __thread counter = 0; 7 struct countarray *countarrayp = NULL; 8 DEFINE_SPINLOCK(final_mutex); - 9 + 9 10 void inc_count(void) 11 { 12 counter++; 13 } - 14 + 14 15 long read_count(void) 16 { 17 struct countarray *cap; 18 unsigned long sum; 19 int t; - 20 + 20 21 rcu_read_lock(); 22 cap = rcu_dereference(countarrayp); 23 sum = cap->total; @@ -140,7 +140,7 @@ held constant, ensuring that \co{read_count()} sees consistent data. 27 rcu_read_unlock(); 28 return sum; 29 } - 30 + 30 31 void count_init(void) 32 { 33 countarrayp = malloc(sizeof(*countarrayp)); @@ -150,22 +150,22 @@ held constant, ensuring that \co{read_count()} sees consistent data. 37 } 38 memset(countarrayp, '\0', sizeof(*countarrayp)); 39 } - 40 + 40 41 void count_register_thread(void) 42 { 43 int idx = smp_thread_id(); - 44 + 44 45 spin_lock(&final_mutex); 46 countarrayp->counterp[idx] = &counter; 47 spin_unlock(&final_mutex); 48 } - 49 + 49 50 void count_unregister_thread(int nthreadsexpected) 51 { 52 struct countarray *cap; 53 struct countarray *capold; 54 int idx = smp_thread_id(); - 55 + 55 56 cap = malloc(sizeof(*countarrayp)); 57 if (cap == NULL) { 58 fprintf(stderr, "Out of memory\n"); @@ -330,7 +330,7 @@ lock. This section shows how RCU may be used to avoid this overhead. The code for performing an I/O is quite similar to the original, with -an RCU read-side critical section be substituted for the reader-writer +a RCU read-side critical section being substituted for the reader-writer lock read-side critical section in the original: \vspace{5pt} @@ -432,7 +432,7 @@ use of explicit memory barriers. 2 int length; 3 char a[0]; 4 }; - 5 + 5 6 struct foo { 7 struct foo_a *fa; 8 }; @@ -521,7 +521,7 @@ This copying might incur unacceptably large overhead. 3 double meas_2; 4 double meas_3; 5 }; - 6 + 6 7 struct animal { 8 char name[40]; 9 double age; @@ -559,7 +559,7 @@ can be freed. 3 double meas_2; 4 double meas_3; 5 }; - 6 + 6 7 struct animal { 8 char name[40]; 9 double age; diff --git a/together/count.tex b/together/count.tex index 009239c..892133d 100644 --- a/together/count.tex +++ b/together/count.tex @@ -8,7 +8,7 @@ This section outlines possible solutions to some counter conundrums. \subsection{Counting Updates} \label{sec:together:Counting Updates} -Suppose that Sch\"odinger (see +Suppose that Schr\"odinger (see Section~\ref{sec:datastruct:Motivating Application}) wants to count the number of updates for each animal, and that these updates are synchronized using a per-data-element lock. @@ -23,7 +23,7 @@ protection of that element's lock! \subsection{Counting Lookups} \label{sec:together:Counting Lookups} -Suppose that Sch\"odinger also wants to count the number of lookups for +Suppose that Schr\"odinger also wants to count the number of lookups for each animal, where lookups are protected by RCU. How can this counting best be done? diff --git a/together/hash.tex b/together/hash.tex index 9c7f1b7..60acdb1 100644 --- a/together/hash.tex +++ b/together/hash.tex @@ -14,7 +14,7 @@ This situation is analogous to that in Section~\ref{sec:together:Correlated Fields}: We have a hash table where we need correlated views of two or more of the elements. -These elements are updated together, and we do not want so see an old +These elements are updated together, and we do not want to see an old version of the first element along with new versions of the other elements. For example, Schr\"odinger decided to add his extended family to his @@ -24,7 +24,7 @@ happen instantaneously, he is also a traditionalist. As such, he absolutely does not want his database ever to show that the bride is now married, but the groom is not, and vice versa. In other words, Schr\"odinger wants to be able to carry out a -wedlock-consistent tranversal of his database. +wedlock-consistent traversal of his database. One approach is to use sequence locks (see Section~\ref{sec:defer:Sequence Locks}), @@ -40,7 +40,7 @@ This approach works quite well when the number of correlated elements is small, the time to read these elements is short, and the update rate is low. Otherwise, updates might happen so quickly that readers might never complete. -Although Schr\"odinger does not expect that even is least-sane relatives +Although Schr\"odinger does not expect that even his least-sane relatives will marry and divorce quickly enough for this to be a problem, he does realize that this problem could well arise in other situations. One way to avoid this reader-starvation problem is to have the readers @@ -74,7 +74,7 @@ interested reader. Suppose that a statistical scan of all elements in a hash table is required. For example, Schr\"odinger might wish to compute the average -length-to-weight ratio over all of his animals.\footnote{ +length-to-weight ratio over all of his animals.\footnote{ Why would such a quantity be useful? Beats me! But group statistics in general are often useful.} -- 1.9.3 (Apple Git-50) -- To unsubscribe from this list: send the line "unsubscribe perfbook" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html