On Mon, Aug 30, 2021 at 11:24:24PM +0900, Akira Yokosawa wrote: > Add index markers to terms listed below: > > o Amdarl's Law > o cbmc (acronym) > o efficiency > - energy > o false sharing > o fragmentation > o grace period > o hazard pointer > o hot spot > o invalidation > o latency > - cache-invalidation > - grace-priod > - message > - scheduling > o linearizable > o memory consistency > - process (process consistency) > - sequential (sequential consistency) > - weak (weak consistency) > o Nidhugg > o NUCA (acronym) > o NUMA (acronym) > o overhead > o performance > o race condition > o scalability > > Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx> > --- > Hi Paul, > > "Take one" was back in April: > > de25a117a9c6 ("index: Add index and acronym tags, take one") > > "Take three" should follow in a month or so. Nice! Queued and pushed, thank you! The comments on the various forms of the IX command in perfbook-lt.tex are very helpful, by the way, so thank you for that as well. Thanx, Paul > Thanks, Akira > -- > SMPdesign/SMPdesign.tex | 8 ++++---- > SMPdesign/beyond.tex | 4 ++-- > SMPdesign/criteria.tex | 4 ++-- > SMPdesign/partexercises.tex | 4 ++-- > advsync/advsync.tex | 4 ++-- > advsync/rt.tex | 6 +++--- > appendix/questions/concurrentparallel.tex | 4 ++-- > appendix/toyrcu/toyrcu.tex | 6 +++--- > appendix/whymb/whymemorybarriers.tex | 8 ++++---- > count/count.tex | 2 +- > cpu/overheads.tex | 10 +++++----- > cpu/overview.tex | 4 ++-- > datastruct/datastruct.tex | 16 ++++++++-------- > debugging/debugging.tex | 10 +++++----- > defer/hazptr.tex | 2 +- > defer/rcuapi.tex | 6 +++--- > defer/rcufundamental.tex | 2 +- > defer/rcuintro.tex | 4 ++-- > defer/rcurelated.tex | 8 ++++---- > defer/whichtochoose.tex | 4 ++-- > formal/axiomatic.tex | 2 +- > formal/dyntickrcu.tex | 2 +- > formal/formal.tex | 2 +- > formal/sat.tex | 4 ++-- > formal/spinhint.tex | 12 ++++++------ > formal/stateless.tex | 2 +- > future/cpu.tex | 6 +++--- > future/formalregress.tex | 5 +++-- > future/htm.tex | 4 ++-- > future/tm.tex | 4 ++-- > glossary.tex | 4 ++-- > glsdict.tex | 1 + > intro/intro.tex | 6 +++--- > locking/locking-existence.tex | 2 +- > locking/locking.tex | 6 +++--- > memorder/memorder.tex | 11 ++++++----- > together/applyrcu.tex | 2 +- > together/hazptr.tex | 2 +- > together/refcnt.tex | 2 +- > together/seqlock.tex | 3 ++- > 40 files changed, 101 insertions(+), 97 deletions(-) > > diff --git a/SMPdesign/SMPdesign.tex b/SMPdesign/SMPdesign.tex > index 6d53309f..3d58582d 100644 > --- a/SMPdesign/SMPdesign.tex > +++ b/SMPdesign/SMPdesign.tex > @@ -451,7 +451,7 @@ bucket from being freed during the time that its lock was being acquired. > reference counter, or a global array of reference counters. > These have strengths and limitations similar to those > called out above for locks. > - \item Use \emph{hazard pointers}~\cite{MagedMichael04a}, which > + \item Use \emph{\IXpl{hazard pointer}}~\cite{MagedMichael04a}, which > can be thought of as an inside-out reference count. > Hazard-pointer-based algorithms maintain a per-thread list of > pointers, so that the appearance of a given pointer on > @@ -525,7 +525,7 @@ Data ownership might seem arcane, but it is used very frequently: > If there is significant sharing, communication between the threads > or CPUs can result in significant complexity and overhead. > Furthermore, if the most-heavily used data happens to be that owned > -by a single CPU, that CPU will be a ``hot spot'', sometimes with > +by a single CPU, that CPU will be a ``\IX{hot spot}'', sometimes with > results resembling that shown in \cref{fig:SMPdesign:Data and Skew}. > However, in situations where no sharing is required, data ownership > achieves ideal performance, and with code that can be as simple > @@ -566,7 +566,7 @@ a mathematical synchronization\-/efficiency viewpoint. > Readers who are uninspired by mathematics might choose to skip > this section. > > -The approach is to use a crude queueing model for the efficiency of > +The approach is to use a crude queueing model for the \IX{efficiency} of > synchronization mechanism that operate on a single shared global > variable, based on an M/M/1 queue. > M/M/1 queuing models are based on an exponentially distributed > @@ -1336,7 +1336,7 @@ First, real-world allocators are required to handle a wide range > of allocation sizes, as opposed to the single size shown in this > toy example. > One popular way to do this is to offer a fixed set of sizes, spaced > -so as to balance external and internal fragmentation, such as in > +so as to balance external and internal \IX{fragmentation}, such as in > the late-1980s BSD memory allocator~\cite{McKusick88}. > Doing this would mean that the ``globalmem'' variable would need > to be replicated on a per-size basis, and that the associated > diff --git a/SMPdesign/beyond.tex b/SMPdesign/beyond.tex > index f533ba0d..f628ff55 100644 > --- a/SMPdesign/beyond.tex > +++ b/SMPdesign/beyond.tex > @@ -526,13 +526,13 @@ Compiling all three algorithms with -O3 gives results similar to > (albeit faster than) those shown in > \cref{fig:SMPdesign:CDF of SEQ/PWQ and SEQ/PART Solution-Time Ratios}, > except that PWQ provides almost no speedup compared > -to SEQ, in keeping with Amdahl's Law~\cite{GeneAmdahl1967AmdahlsLaw}. > +to SEQ, in keeping with \IXr{Amdahl's Law}~\cite{GeneAmdahl1967AmdahlsLaw}. > However, if the goal is to double performance compared to unoptimized > SEQ, as opposed to achieving optimality, compiler > optimizations are quite attractive. > > Cache alignment and padding often improves performance by reducing > -false sharing. > +\IX{false sharing}. > However, for these maze-solution algorithms, aligning and padding the > maze-cell array \emph{degrades} performance by up to 42\,\% for 1000x1000 mazes. > Cache locality is more important than avoiding > diff --git a/SMPdesign/criteria.tex b/SMPdesign/criteria.tex > index a3f9bc66..f73fc8aa 100644 > --- a/SMPdesign/criteria.tex > +++ b/SMPdesign/criteria.tex > @@ -58,7 +58,7 @@ contention, overhead, read-to-write ratio, and complexity: > program would not need any synchronization primitives. > Therefore, any time consumed by these primitives > (including communication cache misses as well as > - message latency, locking primitives, atomic instructions, > + \IXh{message}{latency}, locking primitives, atomic instructions, > and memory barriers) > is overhead that does not contribute directly to the useful > work that the program is intended to accomplish. > @@ -129,7 +129,7 @@ parallel program. > \begin{enumerate} > \item The less time a program spends in exclusive-lock critical sections, > the greater the potential speedup. > - This is a consequence of Amdahl's Law~\cite{GeneAmdahl1967AmdahlsLaw} > + This is a consequence of \IXr{Amdahl's Law}~\cite{GeneAmdahl1967AmdahlsLaw} > because only one CPU may execute within a given > exclusive-lock critical section at a given time. > > diff --git a/SMPdesign/partexercises.tex b/SMPdesign/partexercises.tex > index cd609505..6f9622c1 100644 > --- a/SMPdesign/partexercises.tex > +++ b/SMPdesign/partexercises.tex > @@ -454,7 +454,7 @@ the left-hand index on \clnref{lidx}, the right-hand lock on \clnref{rlock} > the right-hand index on \clnref{ridx}, and, finally, the hashed array > of simple lock-based double-ended queues on \clnref{bkt}. > A high-performance implementation would of course use padding or special > -alignment directives to avoid false sharing. > +alignment directives to avoid \IX{false sharing}. > \end{fcvref} > > \begin{listing} > @@ -780,7 +780,7 @@ In contrast, the double-ended > queue implementations are examples of ``vertical parallelism'' or > ``pipelining'', given that data moves from one thread to another. > The tighter coordination required for pipelining in turn requires > -larger units of work to obtain a given level of efficiency. > +larger units of work to obtain a given level of \IX{efficiency}. > > \QuickQuizSeries{% > \QuickQuizB{ > diff --git a/advsync/advsync.tex b/advsync/advsync.tex > index f8750250..abbf353a 100644 > --- a/advsync/advsync.tex > +++ b/advsync/advsync.tex > @@ -81,7 +81,7 @@ them, please re-read > {\emph{Robert A. Heinlein}} > > The term \IXacrfst{nbs}~\cite{MauriceHerlihy90a} > -describes seven classes of linearizable algorithms with differing > +describes seven classes of \IX{linearizable} algorithms with differing > \emph{\IXpl{forward-progress guarantee}}~\cite{DanAlitarh2013PracticalProgress}, > which are as follows: > > @@ -364,7 +364,7 @@ largely orthogonal to those that form the basis of real-time programming: > \begin{enumerate} > \item Real-time forward-progress guarantees usually have some > definite time associated with them, for example, > - ``scheduling latency must be less than 100 microseconds.'' > + ``\IXh{scheduling}{latency} must be less than 100 microseconds.'' > In contrast, the most popular forms of NBS only guarantees > that progress will be made in finite time, with no definite > bound. > diff --git a/advsync/rt.tex b/advsync/rt.tex > index 24ba2a66..b82153d9 100644 > --- a/advsync/rt.tex > +++ b/advsync/rt.tex > @@ -404,7 +404,7 @@ In any case, careful choices of hardware, drivers, and software > configuration would be required to support phase~4's more severe > requirements. > > -A key advantage of this phase-by-phase approach is that the latency > +A key advantage of this phase-by-phase approach is that the \IX{latency} > budgets can be broken down, so that the application's various components > can be developed independently, each with its own latency budget. > Of course, as with any other kind of budget, there will likely be the > @@ -1275,7 +1275,7 @@ A preemptible RCU implementation was therefore added to the Linux kernel. > This implementation avoids the need to individually track the state of > each and every task in the kernel by keeping lists of tasks that have > been preempted within their current RCU read-side critical sections. > -A grace period is permitted to end: > +A \IX{grace period} is permitted to end: > \begin{enumerate*}[(1)] > \item Once all CPUs have completed any RCU read-side critical sections > that were in effect before the start of the current grace period and > @@ -1895,7 +1895,7 @@ temperature, humidity, and barometric pressure. > The real-time response constraints on this program are so severe that > it is not permissible to spin or block, thus ruling out locking, > nor is it permissible to use a retry loop, thus ruling out sequence locks > -and hazard pointers. > +and \IXpl{hazard pointer}. > Fortunately, the temperature and pressure are normally controlled, > so that a default hard-coded set of data is usually sufficient. > > diff --git a/appendix/questions/concurrentparallel.tex b/appendix/questions/concurrentparallel.tex > index b8bf8116..0ffc1119 100644 > --- a/appendix/questions/concurrentparallel.tex > +++ b/appendix/questions/concurrentparallel.tex > @@ -50,8 +50,8 @@ One could argue that we should simply demand a reasonable level of > competence from the scheduler, so that we could simply ignore any > distinctions between parallelism and concurrency. > Although this is often a good strategy, > -there are important situations where efficiency, > -performance, and scalability concerns sharply limit the level > +there are important situations where \IX{efficiency}, > +\IX{performance}, and \IX{scalability} concerns sharply limit the level > of competence that the scheduler can reasonably offer. > One important example is when the scheduler is implemented in > hardware, as it often is in SIMD units or GPGPUs. > diff --git a/appendix/toyrcu/toyrcu.tex b/appendix/toyrcu/toyrcu.tex > index b84755e3..0824fdd3 100644 > --- a/appendix/toyrcu/toyrcu.tex > +++ b/appendix/toyrcu/toyrcu.tex > @@ -69,7 +69,7 @@ to participate in \IXpl{deadlock cycle}. > Furthermore, in absence of recursive locks, > RCU read-side critical sections cannot be nested, and, finally, > although concurrent RCU updates could in principle be satisfied by > -a common grace period, this implementation serializes grace periods, > +a common \IX{grace period}, this implementation serializes grace periods, > preventing grace-period sharing. > > \QuickQuizSeries{% > @@ -435,7 +435,7 @@ New readers that arrive after the complement operation will increment > RCU read-side critical sections. > This means that the value of \co{rcu_refcnt[0]} will no longer be incremented, > and thus will be monotonically decreasing.\footnote{ > - There is a race condition that this ``monotonically decreasing'' > + There is a \IX{race condition} that this ``monotonically decreasing'' > statement ignores. > This race condition will be dealt with by the code for > \co{synchronize_rcu()}. > @@ -1036,7 +1036,7 @@ all are either even (\clnref{even}) or are greater than the global > \co{rcu_gp_ctr} (\clnrefrange{gt1}{gt2}). > \Clnref{poll} blocks for a short period of time to wait for a > pre-existing RCU read-side critical section, but this can be replaced with > -a spin-loop if grace-period latency is of the essence. > +a spin-loop if \IXh{grace-period}{latency} is of the essence. > Finally, the memory barrier at \clnref{mb3} ensures that any subsequent > destruction will not be reordered into the preceding loop. > \end{fcvref} > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex > index 25d2c8f5..b49d5fe8 100644 > --- a/appendix/whymb/whymemorybarriers.tex > +++ b/appendix/whymb/whymemorybarriers.tex > @@ -205,7 +205,7 @@ What happens when it does a write? > Because it is important that all CPUs agree on the value of a given > data item, before a given CPU writes to that data item, it must first > cause it to be removed, or ``invalidated'', from other CPUs' caches. > -Once this invalidation has completed, the CPU may safely modify the > +Once this \IX{invalidation} has completed, the CPU may safely modify the > data item. > If the data item was present in this CPU's cache, but was read-only, > this process is termed a ``\IXalth{write miss}{write}{cache miss}''. > @@ -955,7 +955,7 @@ involve lots of complex steps in silicon. > Unfortunately, each store buffer must be relatively small, which means > that a CPU executing a modest sequence of stores can fill its store > buffer (for example, if all of them result in cache misses). > -At that point, the CPU must once again wait for invalidations to complete > +At that point, the CPU must once again wait for \IXpl{invalidation} to complete > in order to drain its store buffer before it can continue executing. > This same situation can arise immediately after a memory barrier, when > \emph{all} subsequent store instructions must wait for invalidations to > @@ -1020,7 +1020,7 @@ as discussed in the next section. > > Let us suppose that CPUs queue invalidation requests, but respond to > them immediately. > -This approach minimizes the cache-invalidation latency seen by CPUs > +This approach minimizes the \IXh{cache-invalidation}{latency} seen by CPUs > doing stores, but can defeat memory barriers, as seen in the following > example. > > @@ -1355,7 +1355,7 @@ constraints~\cite{PaulMcKenney2005i,PaulMcKenney2005j}: > please keep this scenario in mind! > }\QuickQuizEnd > > -Imagine a large non-uniform cache architecture (NUCA) system that, > +Imagine a large \IXacrf{nuca} system that, > in order to provide fair allocation > of interconnect bandwidth to CPUs in a given node, provided per-CPU > queues in each node's interconnect interface, as shown in > diff --git a/count/count.tex b/count/count.tex > index 14ae9a36..a9a36b27 100644 > --- a/count/count.tex > +++ b/count/count.tex > @@ -795,7 +795,7 @@ value of the counter and exiting threads. > Finally, variable-length data structures such as linked > lists can be used, as is done in the userspace RCU > library~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyers2012URCU}. > - This last approach can also reduce false sharing in some cases. > + This last approach can also reduce \IX{false sharing} in some cases. > }\QuickQuizEnd > > \begin{fcvref}[ln:count:count_end:whole:inc] > diff --git a/cpu/overheads.tex b/cpu/overheads.tex > index caedd9e9..84360256 100644 > --- a/cpu/overheads.tex > +++ b/cpu/overheads.tex > @@ -9,7 +9,7 @@ > low-level software in ignorance of the underlying hardware.} > {\emph{Unknown}} > > -This section presents actual overheads of the obstacles to performance > +This section presents actual \IXpl{overhead} of the obstacles to performance > listed out in the previous section. > However, it is first necessary to get a rough view of hardware system > architecture, which is the subject of the next section. > @@ -472,7 +472,7 @@ InfiniBand or any number of proprietary interconnects, has a latency > of roughly five microseconds for an end-to-end round trip, during which > time more than \emph{ten thousand} instructions might have been executed. > Standards-based communications networks often require some sort of > -protocol processing, which further increases the latency. > +protocol processing, which further increases the \IX{latency}. > Of course, geographic distance also increases latency, with the > speed-of-light through optical fiber latency around the world coming to > roughly 195 \emph{milliseconds}, or more than 400 million clock > @@ -528,7 +528,7 @@ accessing memory sequentially. > For example, given a 64-byte cacheline and software accessing 64-bit > variables, the first access will still be slow due to speed-of-light > delays (if nothing else), but the remaining seven can be quite fast. > -However, this optimization has a dark side, namely false sharing, > +However, this optimization has a dark side, namely \IX{false sharing}, > which happens when different variables in the same cacheline are > being updated by different CPUs, resulting in a high cache-miss rate. > Software can use the alignment directives available in many compilers > @@ -571,8 +571,8 @@ computing needs more than a bit of rework! > A fifth hardware optimization is large caches, allowing individual > CPUs to operate on larger datasets without incurring expensive cache > misses. > -Although large caches can degrade energy efficiency and cache-miss > -latency, the ever-growing cache sizes on production microprocessors > +Although large caches can degrade \IXh{energy}{efficiency} and \IXh{cache-miss} > +{latency}, the ever-growing cache sizes on production microprocessors > attests to the power of this optimization. > > A final hardware optimization is read-mostly replication, in which > diff --git a/cpu/overview.tex b/cpu/overview.tex > index c68b0756..4ee639b8 100644 > --- a/cpu/overview.tex > +++ b/cpu/overview.tex > @@ -141,7 +141,7 @@ from memory than it did to execute an instruction. > More recently, microprocessors might execute hundreds or even thousands > of instructions in the time required to access memory. > This disparity is due to the fact that \IXr{Moore's Law} has increased CPU > -performance at a much greater rate than it has decreased memory latency, > +performance at a much greater rate than it has decreased memory \IX{latency}, > in part due to the rate at which memory sizes have grown. > For example, a typical 1970s minicomputer might have 4\,KB (yes, kilobytes, > not megabytes, let alone gigabytes or terabytes) of main memory, with > @@ -277,7 +277,7 @@ memory barriers almost always reduce performance, as depicted in > \cref{fig:cpu:CPU Meets a Memory Barrier}. > > As with atomic operations, CPU designers have been working hard to > -reduce memory-barrier overhead, and have made substantial progress. > +reduce \IXh{memory-barrier}{overhead}, and have made substantial progress. > > \subsection{Cache Misses} > \label{sec:cpu:Cache Misses} > diff --git a/datastruct/datastruct.tex b/datastruct/datastruct.tex > index d703a741..317bec42 100644 > --- a/datastruct/datastruct.tex > +++ b/datastruct/datastruct.tex > @@ -452,7 +452,7 @@ read-mostly cases where updates are rare, but could happen at any time. > \epigraph{Adapt the remedy to the disease.}{\emph{Chinese proverb}} > > Although partitioned data structures can offer excellent scalability, > -NUMA effects can result in severe degradations of both performance and > +\IXacr{numa} effects can result in severe degradations of both performance and > scalability. > In addition, > the need for read-side synchronization can degrade performance in > @@ -460,7 +460,7 @@ read-mostly situations. > However, we can achieve both performance and scalability by using > RCU, which was introduced in > \cref{sec:defer:Read-Copy Update (RCU)}. > -Similar results can be achieved using hazard pointers > +Similar results can be achieved using \IXpl{hazard pointer} > (\path{hazptr.c})~\cite{MagedMichael04a}, which will be included in > the performance results shown in this > section~\cite{McKenney:2013:SDS:2483852.2483867}. > @@ -557,7 +557,7 @@ forward pointer intact, so that \co{hashtab_lookup()} can traverse > to the newly deleted element's successor. > > Of course, after invoking \co{hashtab_del()}, the caller must wait for > -an RCU grace period (e.g., by invoking \co{synchronize_rcu()}) before > +an RCU \IX{grace period} (e.g., by invoking \co{synchronize_rcu()}) before > freeing or otherwise reusing the memory for the newly deleted element. > > \subsection{RCU-Protected Hash Table Validation} > @@ -679,7 +679,7 @@ second hardware thread in each core. > The slope of the hazard-pointers trace also decreases at 224 CPUs, but > less dramatically, > because the second hardware thread is able to fill in the time > -that the first hardware thread is stalled due to memory-barrier latency. > +that the first hardware thread is stalled due to \IXh{memory-barrier}{latency}. > As we will see in later sections, this second-hardware-thread > advantage depends on the workload. > > @@ -757,7 +757,7 @@ to about half again faster than that of either QSBR or RCU\@. > on \cpageref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets} > shows that varying hash-table size has almost > no effect? > - Might the problem instead be something like false sharing? > + Might the problem instead be something like \IX{false sharing}? > }\QuickQuizAnswer{ > Excellent question! > > @@ -889,7 +889,7 @@ throughput. > Of course, a great many applications have good load-spreading > properties, and for these applications sharding works > quite well. > -However, sharding does not handle ``hot spots'' very well, with > +However, sharding does not handle ``\IXpl{hot spot}'' very well, with > the hot spot exemplified by Schr\"odinger's cat being but one case > in point. > > @@ -2185,7 +2185,7 @@ Furthermore, even if you were content to run only on Linux, such a > self-modifying installation poses validation challenges. > For example, systems with 32-byte cachelines might work well, but > performance might suffer on systems with 64-byte cachelines due > -to false sharing. > +to \IX{false sharing}. > > Fortunately, there are some rules of thumb that work reasonably well in > practice, which were gathered into a 1995 > @@ -2241,7 +2241,7 @@ An additional set of rules of thumb deal with locks: > that they protect. > This approach means that the cache miss that brings the > lock to the current CPU also brings its data. > -\item Protect read-mostly data with hazard pointers, RCU, or, for > +\item Protect read-mostly data with \IXpl{hazard pointer}, RCU, or, for > long-duration critical sections, reader-writer locks. > \end{enumerate} > > diff --git a/debugging/debugging.tex b/debugging/debugging.tex > index 10b3f801..0e205617 100644 > --- a/debugging/debugging.tex > +++ b/debugging/debugging.tex > @@ -699,7 +699,7 @@ scripts to analyze the voluminous output. > But beware---scripts will only notice what you tell them to. > My rcutorture scripts are a case in point: > Early versions of those scripts were quite satisfied with a test run > -in which RCU grace periods stalled indefinitely. > +in which RCU \IXpl{grace period} stalled indefinitely. > This of course resulted in the scripts being modified to detect RCU > grace-period stalls, but this does not change the fact that the scripts > will only detect problems that I make them detect. > @@ -1660,7 +1660,7 @@ of a bug appearing, which > is why extremely lightweight tracing and assertion mechanism are > so critically important. > > -The term ``heisenbug'' was inspired by the \pplsur{Weiner}{Heisenberg} > +The term ``\IX{heisenbug}'' was inspired by the \pplsur{Weiner}{Heisenberg} > \IX{Uncertainty Principle} from quantum physics, which states that > it is impossible to > exactly quantify a given particle's position and velocity at any given > @@ -1707,7 +1707,7 @@ Adding \co{printf()} statements will likely greatly reduce or even > eliminate the lost counts. > However, converting the load-add-store sequence to a load-add-delay-store > sequence will greatly increase the incidence of lost counts (try it!). > -Once you spot a bug involving a race condition, it is frequently possible > +Once you spot a bug involving a \IX{race condition}, it is frequently possible > to create an anti-heisenbug by adding delay in this manner. > > Of course, this begs the question of how to find the race condition in > @@ -2147,7 +2147,7 @@ Unfortunately, it is often impractical for the following reasons: > Creating a benchmark that approximates > the application can help overcome these obstacles. > A carefully constructed benchmark can help promote performance, > -scalability, energy efficiency, and much else besides. > +scalability, \IXh{energy}{efficiency}, and much else besides. > However, be careful to avoid investing too much into the benchmarking > effort. > It is after all important to invest at least a little into the > @@ -2291,7 +2291,7 @@ The remainder of this section looks at ways of resolving this conflict. > Benchmarks sensitive to memory bandwidth (such as those involving > matrix arithmetic) should spread the running threads across the > available cores and sockets to maximize memory parallelism. > - They should also spread the data across NUMA nodes, memory > + They should also spread the data across \IXplr{NUMA node}, memory > controllers, and DRAM chips to the extent possible. > In contrast, benchmarks sensitive to memory latency (including > most poorly scaling applications) should instead maximize > diff --git a/defer/hazptr.tex b/defer/hazptr.tex > index 08e133dc..824a36ec 100644 > --- a/defer/hazptr.tex > +++ b/defer/hazptr.tex > @@ -13,7 +13,7 @@ inside out, that is, rather than incrementing an integer stored in the > data element, instead store a pointer to that data element in > per-CPU (or per-thread) lists. > Each element of these lists is called a > -\emph{hazard pointer}~\cite{MagedMichael04a}.\footnote{ > +\emph{\IX{hazard pointer}}~\cite{MagedMichael04a}.\footnote{ > Also independently invented by others~\cite{HerlihyLM02}.} > The value of a given data element's ``virtual reference counter'' can > then be obtained by counting the number of hazard pointers referencing > diff --git a/defer/rcuapi.tex b/defer/rcuapi.tex > index 2d645280..95dd0e02 100644 > --- a/defer/rcuapi.tex > +++ b/defer/rcuapi.tex > @@ -191,9 +191,9 @@ The corresponding synchronous update-side primitives, > \co{synchronize_rcu()} and \co{synchronize_rcu_expedited()}, along with > their synonym \co{synchronize_net()}, wait for any type of currently > executing RCU read-side critical sections to complete. > -The length of this wait is known as a ``grace period'', and > -\co{synchronize_rcu_expedited()} is designed to reduce grace-period > -latency at the expense of increased CPU overhead and IPIs. > +The length of this wait is known as a ``\IX{grace period}'', and > +\co{synchronize_rcu_expedited()} is designed to reduce \IXh{grace-period} > +{latency} at the expense of increased CPU overhead and IPIs. > The asynchronous update-side primitive, \co{call_rcu()}, > invokes a specified function with a specified argument after a > subsequent grace period. > diff --git a/defer/rcufundamental.tex b/defer/rcufundamental.tex > index 4ee37d87..32e78c3c 100644 > --- a/defer/rcufundamental.tex > +++ b/defer/rcufundamental.tex > @@ -538,7 +538,7 @@ microphones, headsets, cameras, mice, printers, and much else besides. > Furthermore, the large number of Linux-kernel RCU API uses shown in > \cref{fig:defer:RCU Usage in the Linux Kernel}, > combined with the Linux kernel's heavy use of reference counting > -and with increasing use of hazard pointers in other projects, demonstrates > +and with increasing use of \IXpl{hazard pointer} in other projects, demonstrates > that tolerance for such inconsistencies is more common than one might > imagine. > > diff --git a/defer/rcuintro.tex b/defer/rcuintro.tex > index 519f1627..d87c18c1 100644 > --- a/defer/rcuintro.tex > +++ b/defer/rcuintro.tex > @@ -207,7 +207,7 @@ code, then the corresponding CPU or thread is within a quiescent state, > in turn signaling the completion of any reader that might have access > to the newly removed data element. > Once all CPU's or thread's PCs have been observed to be outside of any > -reader, the grace period has completed. > +reader, the \IX{grace period} has completed. > Please note that this approach poses some serious challenges, including > memory ordering, functions that are \emph{sometimes} invoked from readers, > and ever-exciting code-motion optimizations. > @@ -592,7 +592,7 @@ of an object, respectively. > These mechanisms distribute the work among read and > update paths in such a way as to make read paths extremely fast, using > replication and weakening optimizations in a manner similar to > -hazard pointers, but without the need for read-side retries. > +\IXpl{hazard pointer}, but without the need for read-side retries. > In some cases, including \co{CONFIG_PREEMPT=n} Linux kernels, > RCU's read-side primitives have zero overhead. > > diff --git a/defer/rcurelated.tex b/defer/rcurelated.tex > index 710bc603..28f013d6 100644 > --- a/defer/rcurelated.tex > +++ b/defer/rcurelated.tex > @@ -106,7 +106,7 @@ the overhead of grace periods while holding locks. > approach~\cite{MayaArbel2014RCUtree} to an RCU-protected search tree that, > like Masstree, allows concurrent updates. > This paper includes a proof of correctness, including proof that all > -operations on this tree are linearizable. > +operations on this tree are \IX{linearizable}. > Unfortunately, this implementation achieves linearizability by incurring > the full latency of grace-period waits while holding locks, which degrades > scalability of update-only workloads. > @@ -191,7 +191,7 @@ grace periods. > period, or using something like \co{call_rcu()} > instead of waiting for a grace period? > }\QuickQuizAnswer{ > - The authors wished to support linearizable tree > + The authors wished to support \IX{linearizable} tree > operations, so that concurrent additions to, deletions > from, and searches of the tree would appear to execute > in some globally agreed-upon order. > @@ -242,7 +242,7 @@ HTM and RCU to search trees~\cite{Siakavaras2017CombiningHA,DimitriosSiakavaras2 > graphs~\cite{ChristinaGiannoula2018HTM-RCU-graphcoloring}, > and > \ppl{SeongJae}{Park} et al.~have used HTM and RCU to optimize high-contention > -locking on NUMA systems. > +locking on \IXacr{numa} systems. > > \ppl{Alex}{Kogan} et al.~applied RCU to the construction of range locking > for scalable address spaces~\cite{AlexKogan2020RCUrangelocks}. > @@ -273,7 +273,7 @@ RCU~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyers2012URCU}. > \ppl{Lihao}{Liang} (University of Oxford), \pplmdl{Paul E.}{McKenney} (IBM), > \ppl{Daniel}{Kroening}, and \ppl{Tom}{Melham} > (both also Oxford)~\cite{LihaoLiang2016VerifyTreeRCU} > -used the \IX{C bounded model checker (CBMC)}~\cite{EdmundClarke2004CBMC} > +used the \IXacrf{cbmc}~\cite{EdmundClarke2004CBMC} > to produce a mechanical proof of correctness of a significant portion > of Linux-kernel Tree RCU\@. > \ppl{Lance}{Roy}~\cite{LanceRoy2017CBMC-SRCU} used CBMC to produce a similar > diff --git a/defer/whichtochoose.tex b/defer/whichtochoose.tex > index e1b5233a..e3ce4540 100644 > --- a/defer/whichtochoose.tex > +++ b/defer/whichtochoose.tex > @@ -21,7 +21,7 @@ and where elements might be added to and removed from the structure > at any location and at any time. > \Cref{sec:defer:Which to Choose? (Production Use)} > then points out a few publicly visible production uses of > -hazard pointers, sequence locking, and RCU\@. > +\IXpl{hazard pointer}, sequence locking, and RCU\@. > This discussion should help you to make an informed choice between > these techniques. > > @@ -378,7 +378,7 @@ of RCU and reference counters. > RCU is used for short-lived references, which means that RCU read-side > critical sections can be short. > These short RCU read-side critical sections in turn mean that the corresponding > -RCU grace periods can also be short, which limits the memory footprint. > +RCU \IXpl{grace period} can also be short, which limits the memory footprint. > For the few data elements that need longer-lived references, reference > counting is used. > This means that the complexity of reference-acquisition failure only > diff --git a/formal/axiomatic.tex b/formal/axiomatic.tex > index 175af54e..bf5b8263 100644 > --- a/formal/axiomatic.tex > +++ b/formal/axiomatic.tex > @@ -285,7 +285,7 @@ referencing \co{y}, which in turn is initialized to the value > \co{P0()} on \clnrefrange{P0start}{P0end} > removes element \co{y} from the list by replacing it with element \co{z} > (\clnref{assignnewtail}), > -waits for a grace period (\clnref{sync}), > +waits for a \IX{grace period} (\clnref{sync}), > and finally zeroes \co{y} to emulate \co{free()} (\clnref{free}). > \co{P1()} on \clnrefrange{P1start}{P1end} > executes within an RCU read-side critical section > diff --git a/formal/dyntickrcu.tex b/formal/dyntickrcu.tex > index 5113fd09..bee5b4da 100644 > --- a/formal/dyntickrcu.tex > +++ b/formal/dyntickrcu.tex > @@ -23,7 +23,7 @@ RCU implementations disable preemption across RCU read-side > critical sections, resulting in excessive real-time latencies. > > However, one disadvantage of the older \rt\ implementation > -was that each grace period > +was that each \IX{grace period} > requires work to be done on each CPU, even if that CPU is in a low-power > ``dynticks-idle'' state, > and thus incapable of executing RCU read-side critical sections. > diff --git a/formal/formal.tex b/formal/formal.tex > index ac897b9e..86828d4a 100644 > --- a/formal/formal.tex > +++ b/formal/formal.tex > @@ -10,7 +10,7 @@ > \OriginallyPublished{Chapter}{chp:Formal Verification}{Formal Verification}{Linux Weekly News}{PaulEMcKenney2007QRCUspin,PaulEMcKenney2008dynticksRCU,PaulEMcKenney2011ppcmem} > > Parallel algorithms can be hard to write, and even harder to debug. > -Testing, though essential, is insufficient, as fatal race conditions > +Testing, though essential, is insufficient, as fatal \IXpl{race condition} > can have extremely low probabilities of occurrence. > Proofs of correctness can be valuable, but in the end are just as > prone to human error as is the original algorithm. > diff --git a/formal/sat.tex b/formal/sat.tex > index cc85e4fc..c17d5c71 100644 > --- a/formal/sat.tex > +++ b/formal/sat.tex > @@ -25,14 +25,14 @@ available~\cite{Kroening:2008:DPA:1391237}. > \begin{figure} > \centering > \resizebox{2in}{!}{\includegraphics{formal/cbmc}} > -\caption{CBMC Processing Flow} > +\caption{\IXacr{cbmc} Processing Flow} > \label{fig:formal:CBMC Processing Flow} > \end{figure} > > In addition, front-end programs for SAT solvers can automatically translate > C code into logic expressions, taking assertions into account and generating > assertions for error conditions such as array-bounds errors. > -One example is the C bounded model checker, or \co{cbmc}, which is > +One example is the \IXacrl{cbmc}, or \co{cbmc}, which is > available as part of many Linux distributions. > This tool is quite easy to use, with \co{cbmc test.c} sufficing to > validate \path{test.c}, resulting in the processing flow shown in > diff --git a/formal/spinhint.tex b/formal/spinhint.tex > index c1fe2d40..52b19d33 100644 > --- a/formal/spinhint.tex > +++ b/formal/spinhint.tex > @@ -29,10 +29,10 @@ applies Promela to early versions of RCU's dyntick-idle implementation. > \subsection{Promela and Spin} > \label{sec:formal:Promela and Spin} > > -Promela is a language designed to help verify protocols, but which > +\IX{Promela} is a language designed to help verify protocols, but which > can also be used to verify small parallel algorithms. > You recode your algorithm and correctness constraints in the C-like > -language Promela, and then use Spin to translate it into a C program > +language Promela, and then use \IX{Spin} to translate it into a C program > that you can compile and run. > The resulting program carries out a full state-space search of your > algorithm, either verifying or finding counter-examples for > @@ -66,7 +66,7 @@ more complex uses. > > \begin{fcvref}[ln:formal:promela:increment:whole] > \Cref{lst:formal:Promela Code for Non-Atomic Increment} > -demonstrates the textbook race condition > +demonstrates the textbook \IX{race condition} > resulting from non-atomic increment. > \Clnref{nprocs} defines the number of processes to run (we will vary this > to see the effect on state space), \clnref{count} defines the counter, > @@ -1204,10 +1204,10 @@ clearly untrustworthy. > specialized in some way. > For example, Promela does not handle realistic memory models > (though they can be programmed into > - Promela~\cite{Desnoyers:2013:MSM:2506164.2506174}), > - CBMC~\cite{EdmundClarke2004CBMC} does not detect probabilistic > + \IX{Promela}~\cite{Desnoyers:2013:MSM:2506164.2506174}), > + \IXacr{cbmc}~\cite{EdmundClarke2004CBMC} does not detect probabilistic > hangs and deadlocks, and > - Nidhugg~\cite{CarlLeonardsson2014Nidhugg} does not detect > + \IX{Nidhugg}~\cite{CarlLeonardsson2014Nidhugg} does not detect > bugs involving data nondeterminism. > But this means that these tools cannot be trusted to find > bugs that they are not designed to locate. > diff --git a/formal/stateless.tex b/formal/stateless.tex > index ef393a60..06c09c18 100644 > --- a/formal/stateless.tex > +++ b/formal/stateless.tex > @@ -23,7 +23,7 @@ of whether less-exact approaches might find bugs in larger programs. > \end{figure} > > Although the jury is still out on this question, stateless model > -checkers such as Nidhugg~\cite{CarlLeonardsson2014Nidhugg} have in > +checkers such as \IX{Nidhugg}~\cite{CarlLeonardsson2014Nidhugg} have in > some cases handled larger programs~\cite{SMC-TreeRCU}, and with > similar ease of use, as illustrated by > \cref{fig:formal:Nidhugg Processing Flow}. > diff --git a/future/cpu.tex b/future/cpu.tex > index 27b0d796..dd4b959f 100644 > --- a/future/cpu.tex > +++ b/future/cpu.tex > @@ -113,7 +113,7 @@ Also from 2004~\cite{PaulEdwardMcKenneyPhD}: > multithreaded CPUs are now standard for many desktop and laptop > computer systems. > The most aggressively multithreaded CPUs share all levels of > - cache hierarchy, thereby eliminating CPU-to-CPU memory latency, > + cache hierarchy, thereby eliminating CPU-to-CPU \IXh{memory}{latency}, > in turn greatly reducing the performance penalty for traditional > synchronization mechanisms. > However, a multithreaded CPU would still incur overhead due to > @@ -220,8 +220,8 @@ And one more quote from 2004~\cite{PaulEdwardMcKenneyPhD}: > increasing memory-latency ratios > put RCU at an increasing disadvantage compared to other synchronization > mechanisms. > - Since Linux has been observed with over 1,600 callbacks per grace > - period under heavy load~\cite{Sarma04c}, > + Since Linux has been observed with over 1,600 callbacks per \IX{grace > + period} under heavy load~\cite{Sarma04c}, > it seems safe to say that Linux falls into the former category. > \end{quote} > > diff --git a/future/formalregress.tex b/future/formalregress.tex > index 789e1636..d5babe82 100644 > --- a/future/formalregress.tex > +++ b/future/formalregress.tex > @@ -70,7 +70,8 @@ mechanism---by hand. > As with Promela and \co{spin}, both PPCMEM and \co{herd} are > extremely useful, but they are not well-suited for regression suites. > > -In contrast, \co{cbmc} and Nidhugg can input C programs of reasonable > +In contrast, \IXaltacr{\co{cbmc}}{cbmc} and \IX{Nidhugg} can input > +C programs of reasonable > (though still quite limited) size, and if their capabilities continue > to grow, could well become excellent additions to regression suites. > The Coverity static-analysis tool also inputs C programs, and of very > @@ -138,7 +139,7 @@ It is critically important that formal-verification tools correctly > model their environment. > One all-too-common omission is the memory model, where a great > many formal-verification tools, including Promela/spin, are > -restricted to sequential consistency. > +restricted to \IXalth{sequential consistency}{sequential}{memory consistency}. > The QRCU experience related in > \cref{sec:formal:Is QRCU Really Correct?} > is an important cautionary tale. > diff --git a/future/htm.tex b/future/htm.tex > index c42ba49e..bf78d5ea 100644 > --- a/future/htm.tex > +++ b/future/htm.tex > @@ -117,7 +117,7 @@ that lock. > If the lock is in the same cacheline as some of the variables > that it is protecting, then writes to those variables by one CPU > will invalidate that cache line for all the other CPUs. > - These invalidations will > + These \IXpl{invalidation} will > generate large numbers of conflicts and retries, perhaps even > degrading performance and scalability compared to locking. > }\QuickQuizEnd > @@ -864,7 +864,7 @@ In addition, this is not the whole story. > Locking is not normally used by itself, but is instead typically > augmented by other synchronization mechanisms, > including reference counting, atomic operations, non-blocking data structures, > -hazard pointers~\cite{MagedMichael04a,HerlihyLM02}, > +\IXpl{hazard pointer}~\cite{MagedMichael04a,HerlihyLM02}, > and RCU~\cite{McKenney98,McKenney01a,ThomasEHart2007a,PaulEMcKenney2012ELCbattery}. > The next section looks at how such augmentation changes the equation. > > diff --git a/future/tm.tex b/future/tm.tex > index a71fc140..ce071bc8 100644 > --- a/future/tm.tex > +++ b/future/tm.tex > @@ -86,7 +86,7 @@ storage. > \label{sec:future:I/O Operations} > > One can execute I/O operations within a lock-based critical section, > -while holding a hazard pointer, within a sequence-locking read-side > +while holding a \IX{hazard pointer}, within a sequence-locking read-side > critical section, and from within a userspace-RCU read-side critical > section, and even all at the same time, if need be. > What happens when you attempt to execute an I/O operation from within > @@ -896,7 +896,7 @@ HTM and RCU to search trees~\cite{Siakavaras2017CombiningHA,DimitriosSiakavaras2 > graphs~\cite{ChristinaGiannoula2018HTM-RCU-graphcoloring}, > and > \ppl{SeongJae}{Park} et al.~have used HTM and RCU to optimize high-contention > -locking on NUMA systems~\cite{SeongJaePark2020HTMRCUlock}. > +locking on \IXacr{numa} systems~\cite{SeongJaePark2020HTMRCUlock}. > > It is important to note that RCU permits readers and updaters to run > concurrently, further permitting RCU readers to access data that is in > diff --git a/glossary.tex b/glossary.tex > index ee1471bd..c6266b15 100644 > --- a/glossary.tex > +++ b/glossary.tex > @@ -402,7 +402,7 @@ > A source-code memory access that simply mentions the name of > the object being accessed. > (See ``Marked access''.) > -\item[\IX{Process Consistency}:] > +\item[\IXalth{Process Consistency}{process}{memory consistency}:] > A memory-consistency model in which each CPU's stores appear to > occur in program order, but in which different CPUs might see > accesses from more than one CPU as occurring in different orders. > @@ -477,7 +477,7 @@ > \item[\IXh{Sequence}{Lock}:] > A reader-writer synchronization mechanism in which readers > retry their operations if a writer was present. > -\item[\IX{Sequential Consistency}:] > +\item[\IXalth{Sequential Consistency}{sequential}{memory consistency}:] > A memory-consistency model where all memory references appear to occur > in an order consistent with > a single global order, and where each CPU's memory references > diff --git a/glsdict.tex b/glsdict.tex > index 434ab9f5..c861e09b 100644 > --- a/glsdict.tex > +++ b/glsdict.tex > @@ -209,6 +209,7 @@ > \newcommand{\IXacrmfstpl}[1]{\glsuseri{#1}\acrmfstpl{#1}} % put index via acronym dictionary (first form, plural) > \newcommand{\IXAcrmfst}[1]{\glsuseri{#1}\Acrmfst{#1}} % put index via acronym dictionary (first form, upper case) > \newcommand{\IXAcrmfstpl}[1]{\glsuseri{#1}\Acrmfstpl{#1}} % put index via acronym dictionary (first form, upper case, plural) > +\newcommand{\IXaltacr}[2]{\glsuseri{#2}\hlindex{#1}} % put index of acronym #2 to alternative string (#1) > > \newacronym{cas}{CAS}{compare and swap} > \newabbreviation{cas:m}{CAS}{compare-and-swap} > diff --git a/intro/intro.tex b/intro/intro.tex > index 93d78dd9..3fb567cf 100644 > --- a/intro/intro.tex > +++ b/intro/intro.tex > @@ -74,7 +74,7 @@ categories, including: > \item The paucity of publicly accessible parallel code. > \item The lack of a widely understood engineering discipline of > parallel programming. > -\item The high overhead of communication relative to that of processing, > +\item The high \IX{overhead} of communication relative to that of processing, > even in tightly coupled shared-memory computers. > \end{enumerate} > > @@ -337,7 +337,7 @@ and you will probably get done much more quickly. > }\QuickQuizEnd > > Note that ``performance'' is interpreted broadly here, > -including for example scalability (performance per CPU) and efficiency > +including for example \IX{scalability} (performance per CPU) and \IX{efficiency} > (performance per watt). > > \begin{figure} > @@ -1080,7 +1080,7 @@ parallelism, so much so that it is > usually wise to begin parallelization by partitioning your write-intensive > resources and replicating frequently accessed read-mostly resources. > The resource in question is most frequently data, which might be > -partitioned over computer systems, mass-storage devices, NUMA nodes, > +partitioned over computer systems, mass-storage devices, \IXplr{NUMA node}, > CPU cores (or dies or hardware threads), pages, cache lines, instances > of synchronization primitives, or critical sections of code. > For example, partitioning over locking primitives is termed > diff --git a/locking/locking-existence.tex b/locking/locking-existence.tex > index feb18464..967c09d2 100644 > --- a/locking/locking-existence.tex > +++ b/locking/locking-existence.tex > @@ -95,7 +95,7 @@ One way of providing scalability that improves as the size of the > data structure increases is to place a lock in each element of the > structure. > Unfortunately, putting the lock that is to protect a data element > -in the data element itself is subject to subtle race conditions, > +in the data element itself is subject to subtle \IXpl{race condition}, > as shown in > \cref{lst:locking:Per-Element Locking Without Existence Guarantees}. > > diff --git a/locking/locking.tex b/locking/locking.tex > index 7ae9a115..18d52e36 100644 > --- a/locking/locking.tex > +++ b/locking/locking.tex > @@ -1063,7 +1063,7 @@ these problems by maintaining low \IX{lock contention}. > \IX{Unfairness} can be thought of as a less-severe form of starvation, > where a subset of threads contending for a given lock are granted > the lion's share of the acquisitions. > -This can happen on machines with shared caches or NUMA characteristics, > +This can happen on machines with shared caches or \IXacr{numa} characteristics, > for example, as shown in > \cref{fig:locking:System Architecture and Lock Unfairness}. > If CPU~0 releases a lock that all the other CPUs are attempting > @@ -1129,7 +1129,7 @@ be accessed by the lock holder without interference from other threads. > as the lock itself, then attempts by other CPUs to acquire > the lock will result in expensive cache misses on the part > of the CPU holding the lock. > - This is a special case of false sharing, which can also occur > + This is a special case of \IX{false sharing}, which can also occur > if a pair of variables protected by different locks happen > to share a cache line. > In contrast, if the lock is in a different cache line than > @@ -1933,7 +1933,7 @@ starvation~\cite{McKenney02e,radovic03hierarchical,radovic02efficient,BenJackson > Many of these can be thought of as analogous to the elevator algorithms > traditionally used in scheduling disk I/O. > > -Unfortunately, the same scheduling logic that improves the efficiency > +Unfortunately, the same scheduling logic that improves the \IX{efficiency} > of queued locks at high contention also increases their overhead at > low contention. > Beng-Hong Lim and Anant Agarwal therefore combined a simple test-and-set > diff --git a/memorder/memorder.tex b/memorder/memorder.tex > index 1d810b51..debfbcc3 100644 > --- a/memorder/memorder.tex > +++ b/memorder/memorder.tex > @@ -788,7 +788,7 @@ down significantly on your users' systems. > Or on your future system. > > \paragraph{Ordering operations are not magic.} > -When your program is failing due to some race condition, it is often > +When your program is failing due to some \IX{race condition}, it is often > tempting to toss in a few memory-ordering operations in an attempt > to barrier your bugs out of existence. > A far better reaction is to use higher-level primitives in a carefully > @@ -967,7 +967,7 @@ CPU~4 believes that the value is~``4'' for almost 500\,ns. > CPUs~2 and~3 are a pair of hardware threads on the same > core, sharing the same cache hierarchy, and therefore have > very low communications latencies. > - This is a NUMA, or, more accurately, a NUCA effect. > + This is a \IXacr{numa}, or, more accurately, a \IXacr{nuca} effect. > > This leads to the question of why CPUs~2 and~3 ever disagree > at all. > @@ -3299,7 +3299,7 @@ the fundamental property of RCU grace periods is this straightforward > two-part guarantee: > \begin{enumerate*}[(1)] > \item If any part of a given RCU read-side critical section precedes > -the beginning of a given grace period, then the entirety of that > +the beginning of a given \IX{grace period}, then the entirety of that > critical section precedes the end of that grace period. > \item If any part of a given RCU read-side critical section follows > the end of a given grace period, then the entirety of that > @@ -4022,7 +4022,7 @@ architecture-specific code or synchronization primitives. > Besides, they say that a little knowledge is a very dangerous thing. > Just imagine the damage you could do with a lot of knowledge! > For those who wish to understand more about individual CPUs' > -memory consistency models, the next sections describe those of a few > +\IX{memory consistency} models, the next sections describe those of a few > popular and prominent CPUs. > Although there is no substitute for actually reading a given CPU's > documentation, these sections do give a good overview. > @@ -4381,7 +4381,8 @@ defined its memory ordering with an executable formal model~\cite{ARMv8A:2017}. > \subsection{Itanium} > \label{sec:memorder:Itanium} > > -Itanium offers a weak consistency model, so that in absence of explicit > +Itanium offers a \IXalth{weak consistency}{weak}{memory consistency} > +model, so that in absence of explicit > memory-barrier instructions or dependencies, Itanium is within its rights > to arbitrarily reorder memory references~\cite{IntelItanium02v2}. > Itanium has a memory-fence instruction named {\tt mf}, but also has > diff --git a/together/applyrcu.tex b/together/applyrcu.tex > index 975a10a6..a5a1c0b3 100644 > --- a/together/applyrcu.tex > +++ b/together/applyrcu.tex > @@ -309,7 +309,7 @@ Suppose we have an RCU-protected variable-length array, as shown in > \cref{lst:together:RCU-Protected Variable-Length Array}. > The length of the array \co{->a[]} can change dynamically, and at any > given time, its length is given by the field \co{->length}. > -Of course, this introduces the following race condition: > +Of course, this introduces the following \IX{race condition}: > > \begin{listing} > \begin{VerbatimL}[tabsize=8] > diff --git a/together/hazptr.tex b/together/hazptr.tex > index 2749fdb1..e8a6255a 100644 > --- a/together/hazptr.tex > +++ b/together/hazptr.tex > @@ -19,7 +19,7 @@ Suppose a reference count is becoming a performance or scalability > bottleneck. > What can you do? > > -One approach is to instead use hazard pointers. > +One approach is to instead use \IXpl{hazard pointer}. > > There are some differences, perhaps most notably that with > hazard pointers it is extremely expensive to determine when > diff --git a/together/refcnt.tex b/together/refcnt.tex > index b30acf5b..958fa383 100644 > --- a/together/refcnt.tex > +++ b/together/refcnt.tex > @@ -576,7 +576,7 @@ result was zero, \clnref{call} invokes the \co{call_rcu()} primitives > in order to free up the file structure (via the \co{file_free_rcu()} > function specified in \co{call_rcu()}'s second argument), but only after > all currently-executing RCU read-side critical sections complete, that > -is, after an RCU grace period has elapsed. > +is, after an RCU \IX{grace period} has elapsed. > > Once the grace period completes, the \co{file_free_rcu()} function > obtains a pointer to the file structure on \clnref{obtain}, and frees it > diff --git a/together/seqlock.tex b/together/seqlock.tex > index 4e95d65b..85b029ea 100644 > --- a/together/seqlock.tex > +++ b/together/seqlock.tex > @@ -132,7 +132,8 @@ notably in the dcache subsystem~\cite{NeilBrown2015RCUwalk}. > Note that it is likely that similar schemes also work with hazard > pointers. > > -This approach provides sequential consistency to successful readers, > +This approach provides \IXalth{sequential consistency}{sequential} > +{memory consistency} to successful readers, > each of which will either see the effects of a given update or not, > with any partial updates resulting in a read-side retry. > Sequential consistency is an extremely strong guarantee, incurring equally > -- > 2.17.1 >