[PATCH -perfbook 1/4] count: Employ \cref{} and its variants

Akira Yokosawa <akiyks@xxxxxxxxx> · Sat, 8 May 2021 16:07:09 +0900

Also fix indents by white spaces in Quick Quizzes.

Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx>
---
 count/count.tex | 586 ++++++++++++++++++++++++------------------------
 1 file changed, 293 insertions(+), 293 deletions(-)

diff --git a/count/count.tex b/count/count.tex
index b69515a1..bdb3fdbf 100644
--- a/count/count.tex
+++ b/count/count.tex
@@ -29,7 +29,7 @@ counting.
 	Because the straightforward counting algorithms, for example,
 	atomic operations on a shared counter, either are slow and scale
 	badly, or are inaccurate, as will be seen in
-	Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}.
+	\cref{sec:count:Why Isn't Concurrent Counting Trivial?}.
 }\EQuickQuizEnd
 
 \EQuickQuiz{
@@ -56,7 +56,7 @@ counting.
 	For example, a 1\,\% error might be just fine when the count
 	is on the order of a million or so, but might be absolutely
 	unacceptable once the count reaches a trillion.
-	See Section~\ref{sec:count:Statistical Counters}.
+	See \cref{sec:count:Statistical Counters}.
 }\EQuickQuizEnd
 
 \QuickQuizLabel{\QcountQstatcnt}
@@ -78,7 +78,7 @@ counting.
 	\emph{except} that it must distinguish approximately
 	between values below the limit and values greater than or
 	equal to the limit.
-	See Section~\ref{sec:count:Approximate Limit Counters}.
+	See \cref{sec:count:Approximate Limit Counters}.
 }\EQuickQuizEnd
 
 \QuickQuizLabel{\QcountQapproxcnt}
@@ -105,7 +105,7 @@ counting.
 	between values between the limit and zero on the one hand,
 	and values that either are less than or equal to zero or
 	are greater than or equal to the limit on the other hand.
-	See Section~\ref{sec:count:Exact Limit Counters}.
+	See \cref{sec:count:Exact Limit Counters}.
 }\EQuickQuizEnd
 
 \QuickQuizLabel{\QcountQexactcnt}
@@ -134,7 +134,7 @@ counting.
 	to keep the value at zero until it has taken some action
 	to prevent subsequent threads from gaining access to the
 	device being removed.
-	See Section~\ref{sec:count:Applying Exact Limit Counters}.
+	See \cref{sec:count:Applying Exact Limit Counters}.
 }\EQuickQuizEnd
 
 \QuickQuizLabel{\QcountQIOcnt}
@@ -161,10 +161,10 @@ are more advanced.
 
 Let's start with something simple, for example, the straightforward
 use of arithmetic shown in
-Listing~\ref{lst:count:Just Count!} (\path{count_nonatomic.c}).
+\cref{lst:count:Just Count!} (\path{count_nonatomic.c}).
 \begin{fcvref}[ln:count:count_nonatomic:inc-read]
-Here, we have a counter on line~\lnref{counter}, we increment it on
-line~\lnref{inc}, and we read out its value on line~\lnref{read}.
+Here, we have a counter on \clnref{counter}, we increment it on
+\clnref{inc}, and we read out its value on \clnref{read}.
 What could be simpler?
 \end{fcvref}
 
@@ -174,7 +174,7 @@ What could be simpler?
 	Why all that extra typing???
 }\QuickQuizAnswer{
 	See \cref{sec:toolsoftrade:Shared-Variable Shenanigans}
-	on page~\pageref{sec:toolsoftrade:Shared-Variable Shenanigans}
+	on \cpageref{sec:toolsoftrade:Shared-Variable Shenanigans}
 	for more information on how the compiler can cause trouble,
 	as well as how \co{READ_ONCE()} and \co{WRITE_ONCE()} can avoid
 	this trouble.
@@ -201,9 +201,9 @@ Although approximation does have a large place in computing, loss of
 \QuickQuizSeries{%
 \QuickQuizB{
 	But can't a smart compiler prove that
-	line~\ref{ln:count:count_nonatomic:inc-read:inc}
+	\clnrefr{ln:count:count_nonatomic:inc-read:inc}
 	of
-	Listing~\ref{lst:count:Just Count!}
+	\cref{lst:count:Just Count!}
 	is equivalent to the \co{++} operator and produce an x86
 	add-to-memory instruction?
 	And won't the CPU cache cause this to be atomic?
@@ -263,11 +263,11 @@ Although approximation does have a large place in computing, loss of
 
 The straightforward way to count accurately is to use \IX{atomic} operations,
 as shown in
-Listing~\ref{lst:count:Just Count Atomically!} (\path{count_atomic.c}).
+\cref{lst:count:Just Count Atomically!} (\path{count_atomic.c}).
 \begin{fcvref}[ln:count:count_atomic:inc-read]
-Line~\lnref{counter} defines an atomic variable,
-line~\lnref{inc} atomically increments it, and
-line~\lnref{read} reads it out.
+\Clnref{counter} defines an atomic variable,
+\clnref{inc} atomically increments it, and
+\clnref{read} reads it out.
 \end{fcvref}
 Because this is atomic, it keeps perfect count.
 However, it is slower: on my six-core x86 laptop, it is more than
@@ -285,10 +285,10 @@ when only a single thread is incrementing.\footnote{
 	scalability~\cite{Andrews91textbook,Arcangeli03,10.5555/3241639.3241645,DavidUngar2011unsync}.}
 
 This poor performance should not be a surprise, given the discussion in
-Chapter~\ref{chp:Hardware and its Habits},
+\cref{chp:Hardware and its Habits},
 nor should it be a surprise that the performance of atomic increment
 gets slower as the number of CPUs and threads increase, as shown in
-Figure~\ref{fig:count:Atomic Increment Scalability on x86}.
+\cref{fig:count:Atomic Increment Scalability on x86}.
 In this figure, the horizontal dashed line resting on the x~axis
 is the ideal performance that would be achieved
 by a perfectly scalable algorithm: with such an algorithm, a given
@@ -354,15 +354,15 @@ additional CPUs.
 \end{figure}
 
 For another perspective on global atomic increment, consider
-Figure~\ref{fig:count:Data Flow For Global Atomic Increment}.
+\cref{fig:count:Data Flow For Global Atomic Increment}.
 In order for each CPU to get a chance to increment a given
 global variable, the \IX{cache line} containing that variable must
 circulate among all the CPUs, as shown by the red arrows.
 Such circulation will take significant time, resulting in
 the poor performance seen in
-Figure~\ref{fig:count:Atomic Increment Scalability on x86},
+\cref{fig:count:Atomic Increment Scalability on x86},
 which might be thought of as shown in
-Figure~\ref{fig:count:Waiting to Count}.
+\cref{fig:count:Waiting to Count}.
 The following sections discuss high-performance counting, which
 avoids the delays inherent in such circulation.
 
@@ -389,7 +389,7 @@ avoids the delays inherent in such circulation.
 	\end{enumerate}
 	But what if neither of the first two conditions holds?
 	Then you should think carefully about the algorithms discussed
-	in Section~\ref{sec:count:Statistical Counters}, which achieve
+	in \cref{sec:count:Statistical Counters}, which achieve
 	near-ideal performance on commodity hardware.
 
 \begin{figure}
@@ -410,13 +410,13 @@ avoids the delays inherent in such circulation.
 	particular atomic increment.
 	This results in instruction latency that varies as $\O{\log N}$,
 	where $N$ is the number of CPUs, as shown in
-	Figure~\ref{fig:count:Data Flow For Global Combining-Tree Atomic Increment}.
+	\cref{fig:count:Data Flow For Global Combining-Tree Atomic Increment}.
 	And CPUs with this sort of hardware optimization started to
 	appear in 2011.
 
 	This is a great improvement over the $\O{N}$ performance
 	of current hardware shown in
-	Figure~\ref{fig:count:Data Flow For Global Atomic Increment},
+	\cref{fig:count:Data Flow For Global Atomic Increment},
 	and it is possible that hardware latencies might decrease
 	further if innovations such as three-dimensional fabrication prove
 	practical.
@@ -441,14 +441,14 @@ posed in \QuickQuizRef{\QcountQstatcnt}.
 Statistical counting is typically handled by providing a counter per
 thread (or CPU, when running in the kernel), so that each thread
 updates its own counter, as was foreshadowed in
-Section~\ref{sec:toolsoftrade:Per-CPU Variables}
-on page~\pageref{sec:toolsoftrade:Per-CPU Variables}.
+\cref{sec:toolsoftrade:Per-CPU Variables}
+on \cpageref{sec:toolsoftrade:Per-CPU Variables}.
 The aggregate value of the counters is read out by simply summing up
 all of the threads' counters,
 relying on the commutative and associative properties of addition.
 This is an example of the Data Ownership pattern that will be introduced in
-Section~\ref{sec:SMPdesign:Data Ownership}
-on page~\pageref{sec:SMPdesign:Data Ownership}.
+\cref{sec:SMPdesign:Data Ownership}
+on \cpageref{sec:SMPdesign:Data Ownership}.
 
 \QuickQuiz{
 	But doesn't the fact that C's ``integers'' are limited in size
@@ -488,7 +488,7 @@ thread (presumably cache aligned and padded to avoid false sharing).
 	implementation that permits an arbitrary number of threads,
 	for example, using \GCC's \co{__thread} facility,
 	as shown in
-	Section~\ref{sec:count:Per-Thread-Variable-Based Implementation}.
+	\cref{sec:count:Per-Thread-Variable-Based Implementation}.
 }\QuickQuizEnd
 
 \begin{listing}
@@ -498,10 +498,10 @@ thread (presumably cache aligned and padded to avoid false sharing).
 \end{listing}
 
 Such an array can be wrapped into per-thread primitives, as shown in
-Listing~\ref{lst:count:Array-Based Per-Thread Statistical Counters}
+\cref{lst:count:Array-Based Per-Thread Statistical Counters}
 (\path{count_stat.c}).
 \begin{fcvref}[ln:count:count_stat:inc-read]
-Line~\lnref{define} defines an array containing a set of per-thread counters of
+\Clnref{define} defines an array containing a set of per-thread counters of
 type \co{unsigned long} named, creatively enough, \co{counter}.
 
 \Clnrefrange{inc:b}{inc:e}
@@ -552,7 +552,7 @@ The use of \co{READ_ONCE()} prevents this optimization and others besides.
 \QuickQuizSeries{%
 \QuickQuizB{
 	How does the per-thread \co{counter} variable in
-	Listing~\ref{lst:count:Array-Based Per-Thread Statistical Counters}
+	\cref{lst:count:Array-Based Per-Thread Statistical Counters}
 	get initialized?
 }\QuickQuizAnswerB{
 	The C standard specifies that the initial value of
@@ -566,7 +566,7 @@ The use of \co{READ_ONCE()} prevents this optimization and others besides.
 %
 \QuickQuizE{
 	How is the code in
-	Listing~\ref{lst:count:Array-Based Per-Thread Statistical Counters}
+	\cref{lst:count:Array-Based Per-Thread Statistical Counters}
 	supposed to permit more than one counter?
 }\QuickQuizAnswerE{
 	Indeed, this toy example does not support more than one counter.
@@ -585,7 +585,7 @@ The use of \co{READ_ONCE()} prevents this optimization and others besides.
 This approach scales linearly with increasing number of updater threads
 invoking \co{inc_count()}.
 As is shown by the green arrows on each CPU in
-Figure~\ref{fig:count:Data Flow For Per-Thread Increment},
+\cref{fig:count:Data Flow For Per-Thread Increment},
 the reason for this is that each CPU can make rapid progress incrementing
 its thread's variable, without any expensive cross-system communication.
 As such, this section solves the network-packet counting problem presented
@@ -596,7 +596,7 @@ at the beginning of this chapter.
 	and during that time, the counter could well be changing.
 	This means that the value returned by
 	\co{read_count()} in
-	Listing~\ref{lst:count:Array-Based Per-Thread Statistical Counters}
+	\cref{lst:count:Array-Based Per-Thread Statistical Counters}
 	will not necessarily be exact.
 	Assume that the counter is being incremented at rate
 	$r$ counts per unit time, and that \co{read_count()}'s
@@ -659,7 +659,7 @@ at the beginning of this chapter.
 
 	Of course, it is sometimes unacceptable for the counter to
 	continue incrementing during the read operation.
-	Section~\ref{sec:count:Applying Exact Limit Counters}
+	\Cref{sec:count:Applying Exact Limit Counters}
 	discusses a way to handle this situation.
 
 	Thus far, we have been considering a counter that is only
@@ -693,7 +693,7 @@ at the beginning of this chapter.
 	Therefore, that the long-term
 	movement of the counter is given by $\left( 1-2f \right) r$.
 	Plugging this into
-	Equation~\ref{eq:count:CounterErrorAverage} yields:
+	\cref{eq:count:CounterErrorAverage} yields:
 
 	\begin{equation}
 		\frac{\left( 1 - 2 f \right) r \Delta}{2}
@@ -720,7 +720,7 @@ This is the topic of the next section.
 \GCC\ provides an \co{__thread} storage class that provides
 per-thread storage.
 This can be used as shown in
-Listing~\ref{lst:count:Per-Thread Statistical Counters} (\path{count_end.c})
+\cref{lst:count:Per-Thread Statistical Counters} (\path{count_end.c})
 to implement
 a statistical counter that not only scales well and avoids arbitrary
 thread-number limits, but that also incurs little or no performance
@@ -756,7 +756,7 @@ value of the counter and exiting threads.
 	When a user-level thread exits, its per-thread variables all
 	disappear, which complicates the problem of per-thread-variable
 	access, particularly before the advent of user-level RCU
-	(see Section~\ref{sec:defer:Read-Copy Update (RCU)}).
+	(see \cref{sec:defer:Read-Copy Update (RCU)}).
 	In contrast, in the Linux kernel, when a CPU goes offline,
 	that CPU's per-CPU variables remain mapped and accessible.
 
@@ -801,20 +801,20 @@ be seen on \clnrefrange{b}{e}.
 
 \begin{fcvref}[ln:count:count_end:whole:read]
 The \co{read_count()} function used by readers is a bit more complex.
-Line~\lnref{acquire} acquires a lock to exclude exiting threads, and
-line~\lnref{release} releases it.
-Line~\lnref{sum:init} initializes the sum to the count accumulated by those threads that
+\Clnref{acquire} acquires a lock to exclude exiting threads, and
+\clnref{release} releases it.
+\Clnref{sum:init} initializes the sum to the count accumulated by those threads that
 have already exited, and
 \clnrefrange{loop:b}{loop:e} sum the counts being accumulated
 by threads currently running.
-Finally, line~\lnref{return} returns the sum.
+Finally, \clnref{return} returns the sum.
 \end{fcvref}
 
 \QuickQuizSeries{%
 \QuickQuizB{
 	\begin{fcvref}[ln:count:count_end:whole:read]
-	Doesn't the check for \co{NULL} on line~\lnref{check} of
-	Listing~\ref{lst:count:Per-Thread Statistical Counters}
+	Doesn't the check for \co{NULL} on \clnref{check} of
+	\cref{lst:count:Per-Thread Statistical Counters}
 	add extra branch mispredictions?
 	Why not have a variable set permanently to zero, and point
 	unused counter-pointers to that variable rather than setting
@@ -831,7 +831,7 @@ Finally, line~\lnref{return} returns the sum.
 \QuickQuizE{
 	Why on earth do we need something as heavyweight as a \emph{lock}
 	guarding the summation in the function \co{read_count()} in
-	Listing~\ref{lst:count:Per-Thread Statistical Counters}?
+	\cref{lst:count:Per-Thread Statistical Counters}?
 }\QuickQuizAnswerE{
 	Remember, when a thread exits, its per-thread variables disappear.
 	Therefore, if we attempt to access a given thread's per-thread
@@ -841,7 +841,7 @@ Finally, line~\lnref{return} returns the sum.
 	scenario.
 
 	Of course, we could instead read-acquire a reader-writer lock,
-	but Chapter~\ref{chp:Deferred Processing} will introduce even
+	but \cref{chp:Deferred Processing} will introduce even
 	lighter-weight mechanisms for implementing the required coordination.
 
 	Another approach would be to use an array instead of a per-thread
@@ -866,7 +866,7 @@ array to point to its per-thread \co{counter} variable.
 \QuickQuiz{
 	Why on earth do we need to acquire the lock in
 	\co{count_register_thread()} in
-	Listing~\ref{lst:count:Per-Thread Statistical Counters}?
+	\cref{lst:count:Per-Thread Statistical Counters}?
 	It is a single properly aligned machine-word store to a location
 	that no other thread is modifying, so it should be atomic anyway,
 	right?
@@ -884,13 +884,13 @@ array to point to its per-thread \co{counter} variable.
 function, which
 must be called prior to exit by each thread that previously called
 \co{count_register_thread()}.
-Line~\lnref{acquire} acquires the lock, and
-line~\lnref{release} releases it, thus excluding any
+\Clnref{acquire} acquires the lock, and
+\clnref{release} releases it, thus excluding any
 calls to \co{read_count()} as well as other calls to
 \co{count_unregister_thread()}.
-Line~\lnref{add} adds this thread's \co{counter} to the global
+\Clnref{add} adds this thread's \co{counter} to the global
 \co{finalcount},
-and then line~\lnref{NULL} \co{NULL}s out its \co{counterp[]} array entry.
+and then \clnref{NULL} \co{NULL}s out its \co{counterp[]} array entry.
 A subsequent call to \co{read_count()} will see the exiting thread's
 count in the global \co{finalcount}, and will skip the exiting thread
 when sequencing through the \co{counterp[]} array, thus obtaining
@@ -924,13 +924,13 @@ variables vanish when that thread exits.
 
 	One workaround is to ensure that each thread continues to exist
 	until all threads are finished, as shown in
-	Listing~\ref{lst:count:Per-Thread Statistical Counters With Lockless Summation}
+	\cref{lst:count:Per-Thread Statistical Counters With Lockless Summation}
 	(\path{count_tstat.c}).
 	Analysis of this code is left as an exercise to the reader,
 	however, please note that it requires tweaks in the
 	\path{counttorture.h} counter-evaluation scheme.
 	(Hint: See \co{#ifndef KEEP_GCC_THREAD_LOCAL}.)
-	Chapter~\ref{chp:Deferred Processing} will introduce
+	\Cref{chp:Deferred Processing} will introduce
 	synchronization mechanisms that handle this situation in a much
 	more graceful manner.
 }\QuickQuizEnd
@@ -974,17 +974,17 @@ eventually consistent.
 
 \begin{fcvref}[ln:count:count_stat_eventual:whole]
 The implementation is shown in
-Listing~\ref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
+\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
 (\path{count_stat_eventual.c}).
 \Clnrefrange{per_thr_cnt}{glb_cnt}
 show the per-thread variable and the global variable that
-track the counter's value, and line~\lnref{stopflag} shows \co{stopflag}
+track the counter's value, and \clnref{stopflag} shows \co{stopflag}
 which is used to coordinate termination (for the case where we want
 to terminate the program with an accurate counter value).
 The \co{inc_count()} function shown on
 \clnrefrange{inc:b}{inc:e} is similar to its
 counterpart in
-Listing~\ref{lst:count:Array-Based Per-Thread Statistical Counters}.
+\cref{lst:count:Array-Based Per-Thread Statistical Counters}.
 The \co{read_count()} function shown on
 \clnrefrange{read:b}{read:e} simply returns the
 value of the \co{global_count} variable.
@@ -1014,7 +1014,7 @@ comes at the cost of the additional thread running \co{eventual()}.
 \QuickQuizSeries{%
 \QuickQuizB{
 	Why doesn't \co{inc_count()} in
-	Listing~\ref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
+	\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
 	need to use atomic instructions?
 	After all, we now have multiple threads accessing the per-thread
 	counters!
@@ -1027,7 +1027,7 @@ comes at the cost of the additional thread running \co{eventual()}.
 	counter updates from becoming visible to
 	\co{eventual()}.\footnote{
 		A simple definition of \co{READ_ONCE()} is shown in
-		Listing~\ref{lst:toolsoftrade:Compiler Barrier Primitive (for GCC)}.}
+		\cref{lst:toolsoftrade:Compiler Barrier Primitive (for GCC)}.}
 
 	An older version of this algorithm did in fact use atomic
 	instructions, kudos to Ersoy Bayramoglu for pointing out that
@@ -1052,7 +1052,7 @@ comes at the cost of the additional thread running \co{eventual()}.
 %
 \QuickQuizM{
 	Won't the single global thread in the function \co{eventual()} of
-	Listing~\ref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
+	\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
 	be just as severe a bottleneck as a global lock would be?
 }\QuickQuizAnswerM{
 	In this case, no.
@@ -1063,7 +1063,7 @@ comes at the cost of the additional thread running \co{eventual()}.
 %
 \QuickQuizM{
 	Won't the estimate returned by \co{read_count()} in
-	Listing~\ref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
+	\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
 	become increasingly
 	inaccurate as the number of threads rises?
 }\QuickQuizAnswerM{
@@ -1077,11 +1077,11 @@ comes at the cost of the additional thread running \co{eventual()}.
 %
 \QuickQuizM{
 	Given that in the eventually\-/consistent algorithm shown in
-	Listing~\ref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
+	\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
 	both reads and updates have extremely low overhead
 	and are extremely scalable, why would anyone bother with the
 	implementation described in
-	Section~\ref{sec:count:Array-Based Implementation},
+	\cref{sec:count:Array-Based Implementation},
 	given its costly read-side code?
 }\QuickQuizAnswerM{
 	The thread executing \co{eventual()} consumes CPU time.
@@ -1113,7 +1113,7 @@ comes at the cost of the additional thread running \co{eventual()}.
 %
 \QuickQuizE{
 	What is the accuracy of the estimate returned by \co{read_count()} in
-	Listing~\ref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}?
+	\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}?
 }\QuickQuizAnswerE{
 	A straightforward way to evaluate this estimate is to use the
 	analysis derived in \QuickQuizARef{\StatisticalCounterAccuracy},
@@ -1215,7 +1215,7 @@ of structures in use exceeds a limit, in this case, 10,000.
 Suppose further that these structures are short-lived, that this
 limit is rarely exceeded, and that this limit is approximate in
 that it is OK to exceed it sometimes by some bounded amount
-(see Section~\ref{sec:count:Exact Limit Counters}
+(see \cref{sec:count:Exact Limit Counters}
 if you instead need the limit to be exact).
 
 \subsection{Design}
@@ -1242,10 +1242,10 @@ or other means of communicating between threads.\footnote{
 In short, for many important workloads, we cannot fully partition the counter.
 Given that partitioning the counters was what brought the excellent
 update-side performance for the three schemes discussed in
-Section~\ref{sec:count:Statistical Counters}, this might be grounds
+\cref{sec:count:Statistical Counters}, this might be grounds
 for some pessimism.
 However, the eventually consistent algorithm presented in
-Section~\ref{sec:count:Eventually Consistent Implementation}
+\cref{sec:count:Eventually Consistent Implementation}
 provides an interesting hint.
 Recall that this algorithm kept two sets of books, a
 per-thread \co{counter} variable for updaters and a \co{global_count}
@@ -1311,30 +1311,30 @@ instructions and no interactions between threads, but where occasional
 use is also made of a more conservatively designed
 (and higher overhead) global algorithm.
 This design pattern is covered in more detail in
-Section~\ref{sec:SMPdesign:Parallel Fastpath}.
+\cref{sec:SMPdesign:Parallel Fastpath}.
 
 \subsection{Simple Limit Counter Implementation}
 \label{sec:count:Simple Limit Counter Implementation}
 
 \begin{fcvref}[ln:count:count_lim:variable]
-Listing~\ref{lst:count:Simple Limit Counter Variables}
+\Cref{lst:count:Simple Limit Counter Variables}
 shows both the per-thread and global variables used by this
 implementation.
 The per-thread \co{counter} and \co{countermax} variables are the
 corresponding thread's local counter and the upper bound on that
 counter, respectively.
 The \co{globalcountmax} variable on
-line~\lnref{globalcountmax} contains the upper
+\clnref{globalcountmax} contains the upper
 bound for the aggregate counter, and the \co{globalcount} variable
-on line~\lnref{globalcount} is the global counter.
+on \clnref{globalcount} is the global counter.
 The sum of \co{globalcount} and each thread's \co{counter} gives
 the aggregate value of the overall counter.
 The \co{globalreserve} variable on
-line~\lnref{globalreserve} is at least the sum of all of the
+\clnref{globalreserve} is at least the sum of all of the
 per-thread \co{countermax} variables.
 \end{fcvref}
 The relationship among these variables is shown by
-Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}:
+\cref{fig:count:Simple Limit Counter Variable Relationships}:
 \begin{enumerate}
 \item	The sum of \co{globalcount} and \co{globalreserve} must
 	be less than or equal to \co{globalcountmax}.
@@ -1378,7 +1378,7 @@ functions (\path{count_lim.c}).
 	\cref{lst:count:Simple Limit Counter Add; Subtract; and Read}
 	provide \co{add_count()} and \co{sub_count()} instead of the
 	\co{inc_count()} and \co{dec_count()} interfaces show in
-	Section~\ref{sec:count:Statistical Counters}?
+	\cref{sec:count:Statistical Counters}?
 }\QuickQuizAnswer{
 	Because structures come in different sizes.
 	Of course, a limit counter corresponding to a specific size
@@ -1390,10 +1390,10 @@ functions (\path{count_lim.c}).
 \Clnrefrange{b}{e} show \co{add_count()},
 which adds the specified value \co{delta}
 to the counter.
-Line~\lnref{checklocal} checks to see if there is room for
+\Clnref{checklocal} checks to see if there is room for
 \co{delta} on this thread's
 \co{counter}, and, if so,
-line~\lnref{add} adds it and line~\lnref{return:ls} returns success.
+\clnref{add} adds it and \clnref{return:ls} returns success.
 This is the \co{add_counter()} fastpath, and it does no atomic operations,
 references only per-thread variables, and should not incur any cache misses.
 \end{fcvref}
@@ -1411,7 +1411,7 @@ references only per-thread variables, and should not incur any cache misses.
 
 \QuickQuiz{
 	What is with the strange form of the condition on
-	line~\ref{ln:count:count_lim:add_sub_read:add:checklocal} of
+	\clnrefr{ln:count:count_lim:add_sub_read:add:checklocal} of
 	\cref{lst:count:Simple Limit Counter Add; Subtract; and Read}?
 	Why not the more intuitive form of the fastpath shown in
 	\cref{lst:count:Intuitive Fastpath}?
@@ -1435,35 +1435,35 @@ references only per-thread variables, and should not incur any cache misses.
 
 \begin{fcvref}[ln:count:count_lim:add_sub_read:add]
 If the test on
-line~\lnref{checklocal} fails, we must access global variables, and thus
+\clnref{checklocal} fails, we must access global variables, and thus
 must acquire \co{gblcnt_mutex} on
-line~\lnref{acquire}, which we release on line~\lnref{release:f}
-in the failure case or on line~\lnref{release:s} in the success case.
-Line~\lnref{globalize} invokes \co{globalize_count()}, shown in
-Listing~\ref{lst:count:Simple Limit Counter Utility Functions},
+\clnref{acquire}, which we release on \clnref{release:f}
+in the failure case or on \clnref{release:s} in the success case.
+\Clnref{globalize} invokes \co{globalize_count()}, shown in
+\cref{lst:count:Simple Limit Counter Utility Functions},
 which clears the thread-local variables, adjusting the global variables
 as needed, thus simplifying global processing.
 (But don't take \emph{my} word for it, try coding it yourself!)
-Lines~\lnref{checkglb:b} and~\lnref{checkglb:e} check to see
+\Clnref{checkglb:b,checkglb:e} check to see
 if addition of \co{delta} can be accommodated,
 with the meaning of the expression preceding the less-than sign shown in
-Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}
+\cref{fig:count:Simple Limit Counter Variable Relationships}
 as the difference in height of the two red (leftmost) bars.
 If the addition of \co{delta} cannot be accommodated, then
-line~\lnref{release:f} (as noted earlier) releases \co{gblcnt_mutex} and
-line~\lnref{return:gf}
+\clnref{release:f} (as noted earlier) releases \co{gblcnt_mutex} and
+\clnref{return:gf}
 returns indicating failure.
 
 Otherwise, we take the slowpath.
-Line~\lnref{addglb} adds \co{delta} to \co{globalcount}, and then
-line~\lnref{balance} invokes \co{balance_count()} (shown in
-Listing~\ref{lst:count:Simple Limit Counter Utility Functions})
+\Clnref{addglb} adds \co{delta} to \co{globalcount}, and then
+\clnref{balance} invokes \co{balance_count()} (shown in
+\cref{lst:count:Simple Limit Counter Utility Functions})
 in order to update both the global and the per-thread variables.
 This call to \co{balance_count()}
 will usually set this thread's \co{countermax} to re-enable the fastpath.
-Line~\lnref{release:s} then releases
+\Clnref{release:s} then releases
 \co{gblcnt_mutex} (again, as noted earlier), and, finally,
-line~\lnref{return:gs} returns indicating success.
+\clnref{return:gs} returns indicating success.
 \end{fcvref}
 
 \QuickQuiz{
@@ -1483,25 +1483,25 @@ line~\lnref{return:gs} returns indicating success.
 \Clnrefrange{b}{e} show \co{sub_count()},
 which subtracts the specified
 \co{delta} from the counter.
-Line~\lnref{checklocal} checks to see if the per-thread counter can accommodate
-this subtraction, and, if so, line~\lnref{sub} does the subtraction and
-line~\lnref{return:ls} returns success.
+\Clnref{checklocal} checks to see if the per-thread counter can accommodate
+this subtraction, and, if so, \clnref{sub} does the subtraction and
+\clnref{return:ls} returns success.
 These lines form \co{sub_count()}'s fastpath, and, as with
 \co{add_count()}, this fastpath executes no costly operations.
 
 If the fastpath cannot accommodate subtraction of \co{delta},
 execution proceeds to the slowpath on
 \clnrefrange{acquire}{return:gs}.
-Because the slowpath must access global state, line~\lnref{acquire}
-acquires \co{gblcnt_mutex}, which is released either by line~\lnref{release:f}
-(in case of failure) or by line~\lnref{release:s} (in case of success).
-Line~\lnref{globalize} invokes \co{globalize_count()}, shown in
-Listing~\ref{lst:count:Simple Limit Counter Utility Functions},
+Because the slowpath must access global state, \clnref{acquire}
+acquires \co{gblcnt_mutex}, which is released either by \clnref{release:f}
+(in case of failure) or by \clnref{release:s} (in case of success).
+\Clnref{globalize} invokes \co{globalize_count()}, shown in
+\cref{lst:count:Simple Limit Counter Utility Functions},
 which again clears the thread-local variables, adjusting the global variables
 as needed.
-Line~\lnref{checkglb} checks to see if the counter can accommodate subtracting
-\co{delta}, and, if not, line~\lnref{release:f} releases \co{gblcnt_mutex}
-(as noted earlier) and line~\lnref{return:gf} returns failure.
+\Clnref{checkglb} checks to see if the counter can accommodate subtracting
+\co{delta}, and, if not, \clnref{release:f} releases \co{gblcnt_mutex}
+(as noted earlier) and \clnref{return:gf} returns failure.
 \end{fcvref}
 
 \QuickQuizSeries{%
@@ -1530,23 +1530,23 @@ Line~\lnref{checkglb} checks to see if the counter can accommodate subtracting
 }\QuickQuizAnswerE{
 	Indeed it will!
 	In many cases, this will be a problem, as discussed in
-	Section~\ref{sec:count:Simple Limit Counter Discussion}, and
+	\cref{sec:count:Simple Limit Counter Discussion}, and
 	in those cases the algorithms from
-	Section~\ref{sec:count:Exact Limit Counters}
+	\cref{sec:count:Exact Limit Counters}
 	will likely be preferable.
 }\QuickQuizEndE
 }
 
 \begin{fcvref}[ln:count:count_lim:add_sub_read:sub]
-If, on the other hand, line~\lnref{checkglb} finds that the counter \emph{can}
+If, on the other hand, \clnref{checkglb} finds that the counter \emph{can}
 accommodate subtracting \co{delta}, we complete the slowpath.
-Line~\lnref{subglb} does the subtraction and then
-line~\lnref{balance} invokes \co{balance_count()} (shown in
-Listing~\ref{lst:count:Simple Limit Counter Utility Functions})
+\Clnref{subglb} does the subtraction and then
+\clnref{balance} invokes \co{balance_count()} (shown in
+\cref{lst:count:Simple Limit Counter Utility Functions})
 in order to update both global and per-thread variables
 (hopefully re-enabling the fastpath).
-Then line~\lnref{release:s} releases \co{gblcnt_mutex}, and
-line~\lnref{return:gs} returns success.
+Then \clnref{release:s} releases \co{gblcnt_mutex}, and
+\clnref{return:gs} returns success.
 \end{fcvref}
 
 \QuickQuiz{
@@ -1572,15 +1572,15 @@ line~\lnref{return:gs} returns success.
 \Clnrefrange{b}{e} show \co{read_count()},
 which returns the aggregate value
 of the counter.
-It acquires \co{gblcnt_mutex} on line~\lnref{acquire}
-and releases it on line~\lnref{release},
+It acquires \co{gblcnt_mutex} on \clnref{acquire}
+and releases it on \clnref{release},
 excluding global operations from \co{add_count()} and \co{sub_count()},
 and, as we will see, also excluding thread creation and exit.
-Line~\lnref{initsum} initializes local variable \co{sum} to the value of
+\Clnref{initsum} initializes local variable \co{sum} to the value of
 \co{globalcount}, and then the loop spanning
 \clnrefrange{loop:b}{loop:e} sums the
 per-thread \co{counter} variables.
-Line~\lnref{return} then returns the sum.
+\Clnref{return} then returns the sum.
 \end{fcvref}
 
 \begin{listing}
@@ -1589,7 +1589,7 @@ Line~\lnref{return} then returns the sum.
 \label{lst:count:Simple Limit Counter Utility Functions}
 \end{listing}
 
-Listing~\ref{lst:count:Simple Limit Counter Utility Functions}
+\Cref{lst:count:Simple Limit Counter Utility Functions}
 shows a number of utility functions used by the \co{add_count()},
 \co{sub_count()}, and \co{read_count()} primitives shown in
 \cref{lst:count:Simple Limit Counter Add; Subtract; and Read}.
@@ -1601,12 +1601,12 @@ per-thread counters, adjusting the global variables appropriately.
 It is important to note that this function does not change the aggregate
 value of the counter, but instead changes how the counter's current value
 is represented.
-Line~\lnref{add} adds the thread's \co{counter} variable to \co{globalcount},
-and line~\lnref{zero} zeroes \co{counter}.
-Similarly, line~\lnref{sub} subtracts the per-thread \co{countermax} from
-\co{globalreserve}, and line~\lnref{zeromax} zeroes \co{countermax}.
+\Clnref{add} adds the thread's \co{counter} variable to \co{globalcount},
+and \clnref{zero} zeroes \co{counter}.
+Similarly, \clnref{sub} subtracts the per-thread \co{countermax} from
+\co{globalreserve}, and \clnref{zeromax} zeroes \co{countermax}.
 It is helpful to refer to
-Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}
+\cref{fig:count:Simple Limit Counter Variable Relationships}
 when reading both this function and \co{balance_count()}, which is next.
 \end{fcvref}
 
@@ -1620,7 +1620,7 @@ of the counter exceeding the \co{globalcountmax} limit.
 Changing the current thread's \co{countermax} variable of course
 requires corresponding adjustments to \co{counter}, \co{globalcount}
 and \co{globalreserve}, as can be seen by referring back to
-Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}.
+\cref{fig:count:Simple Limit Counter Variable Relationships}.
 By doing this, \co{balance_count()} maximizes use of
 \co{add_count()}'s and \co{sub_count()}'s low-overhead fastpaths.
 As with \co{globalize_count()}, \co{balance_count()} is not permitted
@@ -1631,22 +1631,22 @@ that portion of
 \co{globalcountmax} that is not already covered by either
 \co{globalcount} or \co{globalreserve}, and assign the
 computed quantity to this thread's \co{countermax}.
-Line~\lnref{adjreserve} makes the corresponding adjustment to \co{globalreserve}.
-Line~\lnref{middle} sets this thread's \co{counter} to the middle of the range
+\Clnref{adjreserve} makes the corresponding adjustment to \co{globalreserve}.
+\Clnref{middle} sets this thread's \co{counter} to the middle of the range
 from zero to \co{countermax}.
-Line~\lnref{check} checks to see whether \co{globalcount} can in fact accommodate
+\Clnref{check} checks to see whether \co{globalcount} can in fact accommodate
 this value of \co{counter}, and, if not,
-line~\lnref{adjcounter} decreases \co{counter}
+\clnref{adjcounter} decreases \co{counter}
 accordingly.
 Finally, in either case,
-line~\lnref{adjglobal} makes the corresponding adjustment to
+\clnref{adjglobal} makes the corresponding adjustment to
 \co{globalcount}.
 \end{fcvref}
 
 \QuickQuiz{
 	\begin{fcvref}[ln:count:count_lim:utility:balance]
 	Why set \co{counter} to \co{countermax / 2} in \clnref{middle} of
-	Listing~\ref{lst:count:Simple Limit Counter Utility Functions}?
+	\cref{lst:count:Simple Limit Counter Utility Functions}?
 	Wouldn't it be simpler to just take \co{countermax} counts?
 	\end{fcvref}
 }\QuickQuizAnswer{
@@ -1678,10 +1678,10 @@ line~\lnref{adjglobal} makes the corresponding adjustment to
 It is helpful to look at a schematic depicting how the relationship
 of the counters changes with the execution of first
 \co{globalize_count()} and then \co{balance_count}, as shown in
-Figure~\ref{fig:count:Schematic of Globalization and Balancing}.
+\cref{fig:count:Schematic of Globalization and Balancing}.
 Time advances from left to right, with the leftmost configuration
 roughly that of
-Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}.
+\cref{fig:count:Simple Limit Counter Variable Relationships}.
 The center configuration shows the relationship of these same counters
 after \co{globalize_count()} is executed by thread~0.
 As can be seen from the figure, thread~0's \co{counter} (``c~0'' in
@@ -1715,7 +1715,7 @@ Because thread~0's \co{counter} is less than its \co{countermax},
 thread~0 can once again increment the counter locally.
 
 \QuickQuiz{
-	In Figure~\ref{fig:count:Schematic of Globalization and Balancing},
+	In \cref{fig:count:Schematic of Globalization and Balancing},
 	even though a quarter of the remaining count up to the limit is
 	assigned to thread~0, only an eighth of the remaining count is
 	consumed, as indicated by the uppermost dotted line connecting
@@ -1751,11 +1751,11 @@ of \co{gblcnt_mutex}.
 Finally, \clnrefrange{b}{e} show \co{count_unregister_thread()},
 which tears down
 state for a soon-to-be-exiting thread.
-Line~\lnref{acquire} acquires \co{gblcnt_mutex} and
-line~\lnref{release} releases it.
-Line~\lnref{globalize} invokes \co{globalize_count()}
+\Clnref{acquire} acquires \co{gblcnt_mutex} and
+\clnref{release} releases it.
+\Clnref{globalize} invokes \co{globalize_count()}
 to clear out this thread's
-counter state, and line~\lnref{clear} clears this thread's entry in the
+counter state, and \clnref{clear} clears this thread's entry in the
 \co{counterp[]} array.
 \end{fcvref}
 
@@ -1807,11 +1807,11 @@ permissible value of the per-thread \co{countermax} variable.
 
 \begin{fcvref}[ln:count:count_lim_app:balance]
 Similarly,
-Listing~\ref{lst:count:Approximate Limit Counter Balancing}
+\cref{lst:count:Approximate Limit Counter Balancing}
 is identical to the \co{balance_count()} function in
-Listing~\ref{lst:count:Simple Limit Counter Utility Functions},
+\cref{lst:count:Simple Limit Counter Utility Functions},
 with the addition of
-lines~\lnref{enforce:b} and~\lnref{enforce:e}, which enforce the
+\clnref{enforce:b,enforce:e}, which enforce the
 \co{MAX_COUNTERMAX} limit on the per-thread \co{countermax} variable.
 \end{fcvref}
 
@@ -1901,11 +1901,11 @@ represent \co{counter} and the low-order 16 bits to represent
 \begin{fcvref}[ln:count:count_lim_atomic:var_access:var]
 The variables and access functions for a simple atomic limit counter
 are shown in
-Listing~\ref{lst:count:Atomic Limit Counter Variables and Access Functions}
+\cref{lst:count:Atomic Limit Counter Variables and Access Functions}
 (\path{count_lim_atomic.c}).
 The \co{counter} and \co{countermax} variables in earlier algorithms
 are combined into the single variable \co{counterandmax} shown on
-line~\lnref{candmax}, with \co{counter} in the upper half and \co{countermax} in
+\clnref{candmax}, with \co{counter} in the upper half and \co{countermax} in
 the lower half.
 This variable is of type \co{atomic_t}, which has an underlying
 representation of \co{int}.
@@ -1913,17 +1913,17 @@ representation of \co{int}.
 \Clnrefrange{def:b}{def:e} show the definitions for \co{globalcountmax}, \co{globalcount},
 \co{globalreserve}, \co{counterp}, and \co{gblcnt_mutex}, all of which
 take on roles similar to their counterparts in
-Listing~\ref{lst:count:Approximate Limit Counter Variables}.
-Line~\lnref{CM_BITS} defines \co{CM_BITS}, which gives the number of bits in each half
-of \co{counterandmax}, and line~\lnref{MAX_CMAX} defines \co{MAX_COUNTERMAX}, which
+\cref{lst:count:Approximate Limit Counter Variables}.
+\Clnref{CM_BITS} defines \co{CM_BITS}, which gives the number of bits in each half
+of \co{counterandmax}, and \clnref{MAX_CMAX} defines \co{MAX_COUNTERMAX}, which
 gives the maximum value that may be held in either half of
 \co{counterandmax}.
 \end{fcvref}
 
 \QuickQuiz{
 	In what way does
-        line~\ref{ln:count:count_lim_atomic:var_access:var:CM_BITS} of
-	Listing~\ref{lst:count:Atomic Limit Counter Variables and Access Functions}
+	\clnrefr{ln:count:count_lim_atomic:var_access:var:CM_BITS} of
+	\cref{lst:count:Atomic Limit Counter Variables and Access Functions}
 	violate the C standard?
 }\QuickQuizAnswer{
 	It assumes eight bits per byte.
@@ -1942,25 +1942,25 @@ when given the underlying \co{int} from the
 \co{atomic_t counterandmax} variable, splits it into its
 \co{counter} (\co{c})
 and \co{countermax} (\co{cm}) components.
-Line~\lnref{msh} isolates the most-significant half of this \co{int},
+\Clnref{msh} isolates the most-significant half of this \co{int},
 placing the result as specified by argument \co{c},
-and line~\lnref{lsh} isolates the least-significant half of this \co{int},
+and \clnref{lsh} isolates the least-significant half of this \co{int},
 placing the result as specified by argument \co{cm}.
 \end{fcvref}
 
 \begin{fcvref}[ln:count:count_lim_atomic:var_access:split]
 \Clnrefrange{b}{e} show the \co{split_counterandmax()} function, which
 picks up the underlying \co{int} from the specified variable
-on line~\lnref{int}, stores it as specified by the \co{old} argument on
-line~\lnref{old}, and then invokes \co{split_counterandmax_int()} to split
-it on line~\lnref{split_int}.
+on \clnref{int}, stores it as specified by the \co{old} argument on
+\clnref{old}, and then invokes \co{split_counterandmax_int()} to split
+it on \clnref{split_int}.
 \end{fcvref}
 
 \QuickQuiz{
 	Given that there is only one \co{counterandmax} variable,
 	why bother passing in a pointer to it on
-        line~\ref{ln:count:count_lim_atomic:var_access:split:func} of
-	Listing~\ref{lst:count:Atomic Limit Counter Variables and Access Functions}?
+	\clnrefr{ln:count:count_lim_atomic:var_access:split:func} of
+	\cref{lst:count:Atomic Limit Counter Variables and Access Functions}?
 }\QuickQuizAnswer{
 	There is only one \co{counterandmax} variable \emph{per thread}.
 	Later, we will see code that needs to pass other threads'
@@ -1970,14 +1970,14 @@ it on line~\lnref{split_int}.
 \begin{fcvref}[ln:count:count_lim_atomic:var_access:merge]
 \Clnrefrange{b}{e} show the \co{merge_counterandmax()} function, which
 can be thought of as the inverse of \co{split_counterandmax()}.
-Line~\lnref{merge} merges the \co{counter} and \co{countermax}
+\Clnref{merge} merges the \co{counter} and \co{countermax}
 values passed in \co{c} and \co{cm}, respectively, and returns
 the result.
 \end{fcvref}
 
 \QuickQuiz{
 	Why does \co{merge_counterandmax()} in
-	Listing~\ref{lst:count:Atomic Limit Counter Variables and Access Functions}
+	\cref{lst:count:Atomic Limit Counter Variables and Access Functions}
 	return an \co{int} rather than storing directly into an
 	\co{atomic_t}?
 }\QuickQuizAnswer{
@@ -1991,7 +1991,7 @@ the result.
 \label{lst:count:Atomic Limit Counter Add and Subtract}
 \end{listing}
 
-Listing~\ref{lst:count:Atomic Limit Counter Add and Subtract}
+\Cref{lst:count:Atomic Limit Counter Add and Subtract}
 shows the \co{add_count()} and \co{sub_count()} functions.
 
 \begin{fcvref}[ln:count:count_lim_atomic:add_sub:add]
@@ -2003,34 +2003,34 @@ with the remainder of the function being the slowpath.
 the \co{atomic_cmpxchg()} primitive on
 \clnrefrange{atmcmpex}{loop:e} performing the
 actual CAS\@.
-Line~\lnref{split} splits the current thread's \co{counterandmax} variable into its
+\Clnref{split} splits the current thread's \co{counterandmax} variable into its
 \co{counter} (in \co{c}) and \co{countermax} (in \co{cm}) components,
 while placing the underlying \co{int} into \co{old}.
-Line~\lnref{check} checks whether the amount \co{delta} can be accommodated
+\Clnref{check} checks whether the amount \co{delta} can be accommodated
 locally (taking care to avoid integer overflow), and if not,
-line~\lnref{goto} transfers to the slowpath.
-Otherwise, line~\lnref{merge} combines an updated \co{counter} value with the
+\clnref{goto} transfers to the slowpath.
+Otherwise, \clnref{merge} combines an updated \co{counter} value with the
 original \co{countermax} value into \co{new}.
 The \co{atomic_cmpxchg()} primitive on
 \clnrefrange{atmcmpex}{loop:e} then atomically
 compares this thread's \co{counterandmax} variable to \co{old},
 updating its value to \co{new} if the comparison succeeds.
-If the comparison succeeds, line~\lnref{return:fs} returns success, otherwise,
-execution continues in the loop at line~\lnref{fast:b}.
+If the comparison succeeds, \clnref{return:fs} returns success, otherwise,
+execution continues in the loop at \clnref{fast:b}.
 \end{fcvref}
 
 \QuickQuizSeries{%
 \QuickQuizB{
 	Yecch!
 	Why the ugly \co{goto} on
-        line~\ref{ln:count:count_lim_atomic:add_sub:add:goto} of
-	Listing~\ref{lst:count:Atomic Limit Counter Add and Subtract}?
+	\clnrefr{ln:count:count_lim_atomic:add_sub:add:goto} of
+	\cref{lst:count:Atomic Limit Counter Add and Subtract}?
 	Haven't you heard of the \co{break} statement???
 }\QuickQuizAnswerB{
 	Replacing the \co{goto} with a \co{break} would require keeping
 	a flag to determine whether or not
-        line~\ref{ln:count:count_lim_atomic:add_sub:add:return:fs}
-        should return, which
+	\clnrefr{ln:count:count_lim_atomic:add_sub:add:return:fs}
+	should return, which
 	is not the sort of thing you want on a fastpath.
 	If you really hate the \co{goto} that much, your best bet would
 	be to pull the fastpath into a separate function that returned
@@ -2040,57 +2040,57 @@ execution continues in the loop at line~\lnref{fast:b}.
 }\QuickQuizEndB
 %
 \QuickQuizE{
-        \begin{fcvref}[ln:count:count_lim_atomic:add_sub:add]
+	\begin{fcvref}[ln:count:count_lim_atomic:add_sub:add]
 	Why would the \co{atomic_cmpxchg()} primitive at
-        \clnrefrange{atmcmpex}{loop:e} of
-	Listing~\ref{lst:count:Atomic Limit Counter Add and Subtract}
+	\clnrefrange{atmcmpex}{loop:e} of
+	\cref{lst:count:Atomic Limit Counter Add and Subtract}
 	ever fail?
-	After all, we picked up its old value on line~\lnref{split} and have not
+	After all, we picked up its old value on \clnref{split} and have not
 	changed it!
 	\end{fcvref}
 }\QuickQuizAnswerE{
 	\begin{fcvref}[ln:count:count_lim_atomic:add_sub:add]
 	Later, we will see how the \co{flush_local_count()} function in
-	Listing~\ref{lst:count:Atomic Limit Counter Utility Functions 1}
+	\cref{lst:count:Atomic Limit Counter Utility Functions 1}
 	might update this thread's \co{counterandmax} variable concurrently
 	with the execution of the fastpath on
-        \clnrefrange{fast:b}{loop:e} of
-	Listing~\ref{lst:count:Atomic Limit Counter Add and Subtract}.
+	\clnrefrange{fast:b}{loop:e} of
+	\cref{lst:count:Atomic Limit Counter Add and Subtract}.
 	\end{fcvref}
 }\QuickQuizEndE
 }
 
 \begin{fcvref}[ln:count:count_lim_atomic:add_sub:add]
 \Clnrefrange{slow:b}{return:ss} of
-Listing~\ref{lst:count:Atomic Limit Counter Add and Subtract}
+\cref{lst:count:Atomic Limit Counter Add and Subtract}
 show \co{add_count()}'s slowpath, which is protected by \co{gblcnt_mutex},
-which is acquired on line~\lnref{acquire} and released on
-lines~\lnref{release:f} and~\lnref{release:s}.
-Line~\lnref{globalize} invokes \co{globalize_count()},
+which is acquired on \clnref{acquire} and released on
+\clnref{release:f,release:s}.
+\Clnref{globalize} invokes \co{globalize_count()},
 which moves this thread's
 state to the global counters.
 \Clnrefrange{checkglb:b}{checkglb:e} check whether
 the \co{delta} value can be accommodated by
-the current global state, and, if not, line~\lnref{flush} invokes
+the current global state, and, if not, \clnref{flush} invokes
 \co{flush_local_count()} to flush all threads' local state to the
 global counters, and then
 \clnrefrange{checkglb:nb}{checkglb:ne} recheck whether \co{delta} can
 be accommodated.
 If, after all that, the addition of \co{delta} still cannot be accommodated,
-then line~\lnref{release:f} releases \co{gblcnt_mutex} (as noted earlier), and
-then line~\lnref{return:sf} returns failure.
+then \clnref{release:f} releases \co{gblcnt_mutex} (as noted earlier), and
+then \clnref{return:sf} returns failure.
 
-Otherwise, line~\lnref{addglb} adds \co{delta} to the global counter,
-line~\lnref{balance}
-spreads counts to the local state if appropriate, line~\lnref{release:s} releases
+Otherwise, \clnref{addglb} adds \co{delta} to the global counter,
+\clnref{balance}
+spreads counts to the local state if appropriate, \clnref{release:s} releases
 \co{gblcnt_mutex} (again, as noted earlier), and finally,
-line~\lnref{return:ss}
+\clnref{return:ss}
 returns success.
 \end{fcvref}
 
 \begin{fcvref}[ln:count:count_lim_atomic:add_sub:sub]
 \Clnrefrange{b}{e} of
-Listing~\ref{lst:count:Atomic Limit Counter Add and Subtract}
+\cref{lst:count:Atomic Limit Counter Add and Subtract}
 show \co{sub_count()}, which is structured similarly to
 \co{add_count()}, having a fastpath on
 \clnrefrange{fast:b}{fast:e} and a slowpath on
@@ -2106,15 +2106,15 @@ the reader.
 \end{listing}
 
 \begin{fcvref}[ln:count:count_lim_atomic:read]
-Listing~\ref{lst:count:Atomic Limit Counter Read} shows \co{read_count()}.
-Line~\lnref{acquire} acquires \co{gblcnt_mutex} and
-line~\lnref{release} releases it.
-Line~\lnref{initsum} initializes local variable \co{sum} to the value of
+\Cref{lst:count:Atomic Limit Counter Read} shows \co{read_count()}.
+\Clnref{acquire} acquires \co{gblcnt_mutex} and
+\clnref{release} releases it.
+\Clnref{initsum} initializes local variable \co{sum} to the value of
 \co{globalcount}, and the loop spanning
 \clnrefrange{loop:b}{loop:e} adds the
 per-thread counters to this sum, isolating each per-thread counter
-using \co{split_counterandmax} on line~\lnref{split}.
-Finally, line~\lnref{return} returns the sum.
+using \co{split_counterandmax} on \clnref{split}.
+Finally, \clnref{return} returns the sum.
 \end{fcvref}
 
 \begin{listing}
@@ -2142,7 +2142,7 @@ The code for \co{globalize_count()} is shown on
 \clnrefrange{b}{e}
 of \cref{lst:count:Atomic Limit Counter Utility Functions 1}, and
 is similar to that of previous algorithms, with the addition of
-line~\lnref{split}, which is now required to split out \co{counter} and
+\clnref{split}, which is now required to split out \co{counter} and
 \co{countermax} from \co{counterandmax}.
 \end{fcvref}
 
@@ -2150,23 +2150,23 @@ line~\lnref{split}, which is now required to split out \co{counter} and
 The code for \co{flush_local_count()}, which moves all threads' local
 counter state to the global counter, is shown on
 \clnrefrange{b}{e}.
-Line~\lnref{checkrsv} checks to see if the value of
+\Clnref{checkrsv} checks to see if the value of
 \co{globalreserve} permits
-any per-thread counts, and, if not, line~\lnref{return:n} returns.
-Otherwise, line~\lnref{initzero} initializes local variable \co{zero} to a combined
+any per-thread counts, and, if not, \clnref{return:n} returns.
+Otherwise, \clnref{initzero} initializes local variable \co{zero} to a combined
 zeroed \co{counter} and \co{countermax}.
 The loop spanning \clnrefrange{loop:b}{loop:e} sequences
 through each thread.
-Line~\lnref{checkp} checks to see if the current thread has counter state,
+\Clnref{checkp} checks to see if the current thread has counter state,
 and, if so, \clnrefrange{atmxchg}{glbrsv} move that state
 to the global counters.
-Line~\lnref{atmxchg} atomically fetches the current thread's state
+\Clnref{atmxchg} atomically fetches the current thread's state
 while replacing it with zero.
-Line~\lnref{split} splits this state into its \co{counter}
+\Clnref{split} splits this state into its \co{counter}
 (in local variable \co{c})
 and \co{countermax} (in local variable \co{cm}) components.
-Line~\lnref{glbcnt} adds this thread's \co{counter} to \co{globalcount}, while
-line~\lnref{glbrsv} subtracts this thread's \co{countermax} from \co{globalreserve}.
+\Clnref{glbcnt} adds this thread's \co{counter} to \co{globalcount}, while
+\clnref{glbrsv} subtracts this thread's \co{countermax} from \co{globalreserve}.
 \end{fcvref}
 
 \QuickQuizSeries{%
@@ -2174,8 +2174,8 @@ line~\lnref{glbrsv} subtracts this thread's \co{countermax} from \co{globalreser
 	What stops a thread from simply refilling its
 	\co{counterandmax} variable immediately after
 	\co{flush_local_count()} on
-        line~\ref{ln:count:count_lim_atomic:utility1:flush:b} of
-	Listing~\ref{lst:count:Atomic Limit Counter Utility Functions 1}
+	\clnrefr{ln:count:count_lim_atomic:utility1:flush:b} of
+	\cref{lst:count:Atomic Limit Counter Utility Functions 1}
 	empties it?
 }\QuickQuizAnswerB{
 	This other thread cannot refill its \co{counterandmax}
@@ -2192,8 +2192,8 @@ line~\lnref{glbrsv} subtracts this thread's \co{countermax} from \co{globalreser
 	\co{add_count()} or \co{sub_count()} from interfering with
 	the \co{counterandmax} variable while
 	\co{flush_local_count()} is accessing it on
-        line~\ref{ln:count:count_lim_atomic:utility1:flush:atmxchg} of
-	Listing~\ref{lst:count:Atomic Limit Counter Utility Functions 1}?
+	\clnrefr{ln:count:count_lim_atomic:utility1:flush:atmxchg} of
+	\cref{lst:count:Atomic Limit Counter Utility Functions 1}?
 }\QuickQuizAnswerE{
 	Nothing.
 	Consider the following three cases:
@@ -2220,23 +2220,23 @@ line~\lnref{glbrsv} subtracts this thread's \co{countermax} from \co{globalreser
 
 \begin{fcvref}[ln:count:count_lim_atomic:utility2]
 \Clnrefrange{balance:b}{balance:e} on
-Listing~\ref{lst:count:Atomic Limit Counter Utility Functions 2}
+\cref{lst:count:Atomic Limit Counter Utility Functions 2}
 show the code for \co{balance_count()}, which refills
 the calling thread's local \co{counterandmax} variable.
 This function is quite similar to that of the preceding algorithms,
 with changes required to handle the merged \co{counterandmax} variable.
 Detailed analysis of the code is left as an exercise for the reader,
 as it is with the \co{count_register_thread()} function starting on
-line~\lnref{register:b} and the \co{count_unregister_thread()} function starting on
-line~\lnref{unregister:b}.
+\clnref{register:b} and the \co{count_unregister_thread()} function starting on
+\clnref{unregister:b}.
 \end{fcvref}
 
 \QuickQuiz{
 	Given that the \co{atomic_set()} primitive does a simple
 	store to the specified \co{atomic_t}, how can
-        line~\ref{ln:count:count_lim_atomic:utility2:balance:atmcset} of
+	\clnrefr{ln:count:count_lim_atomic:utility2:balance:atmcset} of
 	\co{balance_count()} in
-	Listing~\ref{lst:count:Atomic Limit Counter Utility Functions 2}
+	\cref{lst:count:Atomic Limit Counter Utility Functions 2}
 	work correctly in face of concurrent \co{flush_local_count()}
 	updates to this variable?
 }\QuickQuizAnswer{
@@ -2281,7 +2281,7 @@ Even though per-thread state will now be manipulated only by the
 corresponding thread, there will still need to be synchronization
 with the signal handlers.
 This synchronization is provided by the state machine shown in
-Figure~\ref{fig:count:Signal-Theft State Machine}.
+\cref{fig:count:Signal-Theft State Machine}.
 
 \begin{figure}
 \centering
@@ -2319,7 +2319,7 @@ The slowpath then sets that thread's \co{theft} state to IDLE\@.
 
 \QuickQuizSeries{%
 \QuickQuizB{
-	In Figure~\ref{fig:count:Signal-Theft State Machine}, why is
+	In \cref{fig:count:Signal-Theft State Machine}, why is
 	the REQ \co{theft} state colored red?
 }\QuickQuizAnswerB{
 	To indicate that only the fastpath is permitted to change the
@@ -2329,7 +2329,7 @@ The slowpath then sets that thread's \co{theft} state to IDLE\@.
 }\QuickQuizEndB
 %
 \QuickQuizE{
-	In Figure~\ref{fig:count:Signal-Theft State Machine}, what is
+	In \cref{fig:count:Signal-Theft State Machine}, what is
 	the point of having separate REQ and ACK \co{theft} states?
 	Why not simplify the state machine by collapsing
 	them into a single REQACK state?
@@ -2375,7 +2375,7 @@ The slowpath then sets that thread's \co{theft} state to IDLE\@.
 \label{sec:count:Signal-Theft Limit Counter Implementation}
 
 \begin{fcvref}[ln:count:count_lim_sig:data]
-Listing~\ref{lst:count:Signal-Theft Limit Counter Data}
+\Cref{lst:count:Signal-Theft Limit Counter Data}
 (\path{count_lim_sig.c})
 shows the data structures used by the signal-theft based counter
 implementation.
@@ -2384,7 +2384,7 @@ for the per-thread theft state machine
 described in the preceding section.
 \Clnrefrange{var:b}{var:e} are similar to earlier implementations,
 with the addition of
-lines~\lnref{maxp} and~\lnref{theftp} to allow remote access to a
+\clnref{maxp,theftp} to allow remote access to a
 thread's \co{countermax}
 and \co{theft} variables, respectively.
 \end{fcvref}
@@ -2396,7 +2396,7 @@ and \co{theft} variables, respectively.
 \end{listing}
 
 \begin{fcvref}[ln:count:count_lim_sig:migration:globalize]
-Listing~\ref{lst:count:Signal-Theft Limit Counter Value-Migration Functions}
+\Cref{lst:count:Signal-Theft Limit Counter Value-Migration Functions}
 shows the functions responsible for migrating counts between per-thread
 variables and the global variables.
 \Clnrefrange{b}{e} show \co{globalize_count()},
@@ -2407,14 +2407,14 @@ implementations.
 \Clnrefrange{b}{e} show \co{flush_local_count_sig()},
 which is the signal
 handler used in the theft process.
-Lines~\lnref{check:REQ} and~\lnref{return:n} check to see if
+\Clnref{check:REQ,return:n} check to see if
 the \co{theft} state is REQ, and, if not
 returns without change.
-Line~\lnref{mb:1} executes a memory barrier to ensure that the sampling of the
+\Clnref{mb:1} executes a memory barrier to ensure that the sampling of the
 theft variable happens before any change to that variable.
-Line~\lnref{set:ACK} sets the \co{theft} state to ACK, and, if
-line~\lnref{check:fast} sees that
-this thread's fastpaths are not running, line~\lnref{set:READY} sets the \co{theft}
+\Clnref{set:ACK} sets the \co{theft} state to ACK, and, if
+\clnref{check:fast} sees that
+this thread's fastpaths are not running, \clnref{set:READY} sets the \co{theft}
 state to READY\@.
 \end{fcvref}
 
@@ -2432,8 +2432,8 @@ state to READY\@.
 	\co{theft} per-thread variable?
 }\QuickQuizAnswer{
 	\begin{fcvref}[ln:count:count_lim_sig:migration:flush_sig]
-	The first one (on line~\lnref{check:REQ}) can be argued to be unnecessary.
-	The last two (lines~\lnref{set:ACK} and~\lnref{set:READY}) are important.
+	The first one (on \clnref{check:REQ}) can be argued to be unnecessary.
+	The last two (\clnref{set:ACK,set:READY}) are important.
 	If these are removed, the compiler would be within its rights
 	to rewrite \clnrefrange{set:ACK}{set:READY} as follows:
 	\end{fcvref}
@@ -2456,20 +2456,20 @@ slowpath to flush all threads' local counts.
 The loop spanning
 \clnrefrange{loop:b}{loop:e} advances the \co{theft} state for each
 thread that has local count, and also sends that thread a signal.
-Line~\lnref{skip} skips any non-existent threads.
-Otherwise, line~\lnref{checkmax} checks to see if the current thread holds any local
-count, and, if not, line~\lnref{READY} sets the thread's \co{theft} state to READY
-and line~\lnref{next} skips to the next thread.
-Otherwise, line~\lnref{REQ} sets the thread's \co{theft} state to REQ and
-line~\lnref{signal} sends the thread a signal.
+\Clnref{skip} skips any non-existent threads.
+Otherwise, \clnref{checkmax} checks to see if the current thread holds any local
+count, and, if not, \clnref{READY} sets the thread's \co{theft} state to READY
+and \clnref{next} skips to the next thread.
+Otherwise, \clnref{REQ} sets the thread's \co{theft} state to REQ and
+\clnref{signal} sends the thread a signal.
 \end{fcvref}
 
 \QuickQuizSeries{%
 \QuickQuizB{
-	In Listing~\ref{lst:count:Signal-Theft Limit Counter Value-Migration Functions},
+	In \cref{lst:count:Signal-Theft Limit Counter Value-Migration Functions},
 	why is it safe for
-        line~\ref{ln:count:count_lim_sig:migration:flush:checkmax}
-        to directly access the other thread's
+	\clnrefr{ln:count:count_lim_sig:migration:flush:checkmax}
+	to directly access the other thread's
 	\co{countermax} variable?
 }\QuickQuizAnswerB{
 	Because the other thread is not permitted to change the value
@@ -2482,16 +2482,16 @@ line~\lnref{signal} sends the thread a signal.
 }\QuickQuizEndB
 %
 \QuickQuizM{
-	In Listing~\ref{lst:count:Signal-Theft Limit Counter Value-Migration Functions},
+	In \cref{lst:count:Signal-Theft Limit Counter Value-Migration Functions},
 	why doesn't
-        line~\ref{ln:count:count_lim_sig:migration:flush:signal}
-        check for the current thread sending itself
+	\clnrefr{ln:count:count_lim_sig:migration:flush:signal}
+	check for the current thread sending itself
 	a signal?
 }\QuickQuizAnswerM{
 	There is no need for an additional check.
 	The caller of \co{flush_local_count()} has already invoked
 	\co{globalize_count()}, so the check on
-	line~\ref{ln:count:count_lim_sig:migration:flush:checkmax}
+	\clnrefr{ln:count:count_lim_sig:migration:flush:checkmax}
 	will have succeeded, skipping the later \co{pthread_kill()}.
 }\QuickQuizEndM
 %
@@ -2516,18 +2516,18 @@ then steals that thread's count.
 and the loop spanning
 \clnrefrange{loop3:b}{loop3:e} waits until the current
 thread's \co{theft} state becomes READY\@.
-Line~\lnref{block} blocks for a millisecond to avoid priority-inversion problems,
-and if line~\lnref{check:REQ} determines that the thread's signal has not yet arrived,
-line~\lnref{signal2} resends the signal.
-Execution reaches line~\lnref{thiev:b} when the thread's \co{theft} state becomes
+\Clnref{block} blocks for a millisecond to avoid priority-inversion problems,
+and if \clnref{check:REQ} determines that the thread's signal has not yet arrived,
+\clnref{signal2} resends the signal.
+Execution reaches \clnref{thiev:b} when the thread's \co{theft} state becomes
 READY, so \clnrefrange{thiev:b}{thiev:e} do the thieving.
-Line~\lnref{IDLE} then sets the thread's \co{theft} state back to IDLE\@.
+\Clnref{IDLE} then sets the thread's \co{theft} state back to IDLE\@.
 \end{fcvref}
 
 \QuickQuiz{
-	In Listing~\ref{lst:count:Signal-Theft Limit Counter Value-Migration Functions},
-        why does line~\ref{ln:count:count_lim_sig:migration:flush:signal2}
-        resend the signal?
+	In \cref{lst:count:Signal-Theft Limit Counter Value-Migration Functions},
+	why does \clnrefr{ln:count:count_lim_sig:migration:flush:signal2}
+	resend the signal?
 }\QuickQuizAnswer{
 	Because many operating systems over several decades have
 	had the property of losing the occasional signal.
@@ -2557,34 +2557,34 @@ earlier examples.
 \end{listing}
 
 \begin{fcvref}[ln:count:count_lim_sig:add]
-Listing~\ref{lst:count:Signal-Theft Limit Counter Add Function}
+\Cref{lst:count:Signal-Theft Limit Counter Add Function}
 shows the \co{add_count()} function.
 The fastpath spans \clnrefrange{fast:b}{return:fs}, and the slowpath
 \clnrefrange{acquire}{return:ss}.
-Line~\lnref{fast:b} sets the per-thread \co{counting} variable to 1 so that
+\Clnref{fast:b} sets the per-thread \co{counting} variable to 1 so that
 any subsequent signal handlers interrupting this thread will
 set the \co{theft} state to ACK rather than READY, allowing this
 fastpath to complete properly.
-Line~\lnref{barrier:1} prevents the compiler from reordering any of the fastpath body
+\Clnref{barrier:1} prevents the compiler from reordering any of the fastpath body
 to precede the setting of \co{counting}.
-Lines~\lnref{check:b} and~\lnref{check:e} check to see
+\Clnref{check:b,check:e} check to see
 if the per-thread data can accommodate
 the \co{add_count()} and if there is no ongoing theft in progress,
-and if so line~\lnref{add:f} does the fastpath addition and
-line~\lnref{fasttaken} notes that
+and if so \clnref{add:f} does the fastpath addition and
+\clnref{fasttaken} notes that
 the fastpath was taken.
 
-In either case, line~\lnref{barrier:2} prevents the compiler from reordering the
-fastpath body to follow line~\lnref{clearcnt}, which permits any subsequent signal
+In either case, \clnref{barrier:2} prevents the compiler from reordering the
+fastpath body to follow \clnref{clearcnt}, which permits any subsequent signal
 handlers to undertake theft.
-Line~\lnref{barrier:3} again disables compiler reordering, and then
-line~\lnref{check:ACK}
+\Clnref{barrier:3} again disables compiler reordering, and then
+\clnref{check:ACK}
 checks to see if the signal handler deferred the \co{theft}
-state-change to READY, and, if so, line~\lnref{mb} executes a memory
-barrier to ensure that any CPU that sees line~\lnref{READY} setting state to
-READY also sees the effects of line~\lnref{add:f}.
-If the fastpath addition at line~\lnref{add:f} was executed, then
-line~\lnref{return:fs} returns
+state-change to READY, and, if so, \clnref{mb} executes a memory
+barrier to ensure that any CPU that sees \clnref{READY} setting state to
+READY also sees the effects of \clnref{add:f}.
+If the fastpath addition at \clnref{add:f} was executed, then
+\clnref{return:fs} returns
 success.
 \end{fcvref}
 
@@ -2595,17 +2595,17 @@ success.
 \end{listing}
 
 \begin{fcvref}[ln:count:count_lim_sig:add]
-Otherwise, we fall through to the slowpath starting at line~\lnref{acquire}.
+Otherwise, we fall through to the slowpath starting at \clnref{acquire}.
 The structure of the slowpath is similar to those of earlier examples,
 so its analysis is left as an exercise to the reader.
 \end{fcvref}
 Similarly, the structure of \co{sub_count()} on
-Listing~\ref{lst:count:Signal-Theft Limit Counter Subtract Function}
+\cref{lst:count:Signal-Theft Limit Counter Subtract Function}
 is the same
 as that of \co{add_count()}, so the analysis of \co{sub_count()} is also
 left as an exercise for the reader, as is the analysis of
 \co{read_count()} in
-Listing~\ref{lst:count:Signal-Theft Limit Counter Read Function}.
+\cref{lst:count:Signal-Theft Limit Counter Read Function}.
 
 \begin{listing}
 \input{CodeSamples/count/count_lim_sig@xxxxxxxxxxxxxxxxxx}
@@ -2615,7 +2615,7 @@ Listing~\ref{lst:count:Signal-Theft Limit Counter Read Function}.
 
 \begin{fcvref}[ln:count:count_lim_sig:initialization:init]
 \Clnrefrange{b}{e} of
-Listing~\ref{lst:count:Signal-Theft Limit Counter Initialization Functions}
+\cref{lst:count:Signal-Theft Limit Counter Initialization Functions}
 show \co{count_init()}, which set up \co{flush_local_count_sig()}
 as the signal handler for \co{SIGUSR1},
 enabling the \co{pthread_kill()} calls in \co{flush_local_count()}
@@ -2647,7 +2647,7 @@ them both on the system that your application is to be deployed on.
 	the read side to be fast?
 }\QuickQuizAnswer{
 	One approach is to use the techniques shown in
-	Section~\ref{sec:count:Eventually Consistent Implementation},
+	\cref{sec:count:Eventually Consistent Implementation},
 	summarizing an approximation to the overall counter value in
 	a single variable.
 	Another approach would be to use multiple threads to carry
@@ -2706,7 +2706,7 @@ counted at full speed.
 Although a biased counter can be quite helpful and useful, it is only a
 partial solution to the removable I/O device access-count problem
 called out on
-page~\pageref{chp:Counting}.
+\cpageref{chp:Counting}.
 When attempting to remove a device, we must not only know the precise
 number of current I/O accesses, we also need to prevent any future
 accesses from starting.
@@ -2731,16 +2731,16 @@ if (removing) {			\lnlbl[check]
 \end{fcvlabel}
 
 \begin{fcvref}[ln:count:inline:I/O]
-Line~\lnref{acq} read-acquires the lock, and either
-line~\lnref{rel1} or~\lnref{rel2} releases it.
-Line~\lnref{check} checks to see if the device is being removed, and, if so,
-line~\lnref{rel1} releases the lock and
-line~\lnref{cancel} cancels the I/O, or takes whatever
+\Clnref{acq} read-acquires the lock, and either
+\clnref{rel1} or~\lnref{rel2} releases it.
+\Clnref{check} checks to see if the device is being removed, and, if so,
+\clnref{rel1} releases the lock and
+\clnref{cancel} cancels the I/O, or takes whatever
 action is appropriate given that the device is to be removed.
-Otherwise, line~\lnref{inc} increments the access count,
-line~\lnref{rel2} releases the
-lock, line~\lnref{do} performs the I/O, and
-line~\lnref{dec} decrements the access count.
+Otherwise, \clnref{inc} increments the access count,
+\clnref{rel2} releases the
+lock, \clnref{do} performs the I/O, and
+\clnref{dec} decrements the access count.
 \end{fcvref}
 
 \QuickQuiz{
@@ -2770,11 +2770,11 @@ remove_device();		\lnlbl[remove]
 \end{fcvlabel}
 
 \begin{fcvref}[ln:count:inline:remove]
-Line~\lnref{acq} write-acquires the lock and
-line~\lnref{rel} releases it.
-Line~\lnref{note} notes that the device is being removed, and the loop spanning
+\Clnref{acq} write-acquires the lock and
+\clnref{rel} releases it.
+\Clnref{note} notes that the device is being removed, and the loop spanning
 \clnrefrange{loop:b}{loop:e} waits for any I/O operations to complete.
-Finally, line~\lnref{remove} does any additional processing needed to prepare for
+Finally, \clnref{remove} does any additional processing needed to prepare for
 device removal.
 \end{fcvref}
 
@@ -2825,12 +2825,12 @@ perform and scale extremely well in certain special cases.
 
 It is well worth reviewing the lessons from these counting algorithms.
 To that end,
-Section~\ref{sec:count:Parallel Counting Performance}
+\cref{sec:count:Parallel Counting Performance}
 summarizes performance and scalability,
-Section~\ref{sec:count:Parallel Counting Specializations}
+\cref{sec:count:Parallel Counting Specializations}
 discusses the need for specialization,
 and finally,
-Section~\ref{sec:count:Parallel Counting Lessons}
+\cref{sec:count:Parallel Counting Lessons}
 enumerates lessons learned and calls attention to later chapters that
 will expand on these lessons.
 
@@ -2895,7 +2895,7 @@ updates than the array-based implementation
 and suffers severe \IX{lock contention} when there are many parallel readers.
 This contention can be addressed using the deferred-processing
 techniques introduced in
-Chapter~\ref{chp:Deferred Processing},
+\cref{chp:Deferred Processing},
 as shown on the \path{count_end_rcu.c} row of
 \cref{tab:count:Statistical/Limit Counter Performance on x86}.
 Deferred processing also shines on the \path{count_stat_eventual.c} row,
@@ -2929,7 +2929,7 @@ courtesy of eventual consistency.
 	``Use the right tool for the job.''
 
 	As can be seen from
-	Figure~\ref{fig:count:Atomic Increment Scalability on x86},
+	\cref{fig:count:Atomic Increment Scalability on x86},
 	single-variable atomic increment need not apply for any job
 	involving heavy use of parallel updates.
 	In contrast, the algorithms shown in the top half of
@@ -2940,7 +2940,7 @@ courtesy of eventual consistency.
 	featuring a single atomically incremented
 	variable that can be read out using a single load,
 	similar to the approach used in
-	Section~\ref{sec:count:Eventually Consistent Implementation}.
+	\cref{sec:count:Eventually Consistent Implementation}.
 }\QuickQuizEndE
 }
 
@@ -3061,7 +3061,7 @@ This sort of adaptation will become increasingly important as the
 number of CPUs on mainstream systems continues to increase.
 
 In short, as discussed in
-Chapter~\ref{chp:Hardware and its Habits},
+\cref{chp:Hardware and its Habits},
 the laws of physics constrain parallel software just as surely as they
 constrain mechanical artifacts such as bridges.
 These constraints force specialization, though in the case of software
@@ -3125,13 +3125,13 @@ and partial parallelization in particular in
 
 The partially partitioned counting algorithms used locking to
 guard the global data, and locking is the subject of
-Chapter~\ref{chp:Locking}.
+\cref{chp:Locking}.
 In contrast, the partitioned data tended to be fully under the control of
 the corresponding thread, so that no synchronization whatsoever was required.
 This \emph{data ownership} will be introduced in
-Section~\ref{sec:SMPdesign:Data Ownership}
+\cref{sec:SMPdesign:Data Ownership}
 and discussed in more detail in
-Chapter~\ref{chp:Data Ownership}.
+\cref{chp:Data Ownership}.
 
 Because integer addition and subtraction are extremely cheap
 compared to typical synchronization operations, achieving reasonable
@@ -3144,12 +3144,12 @@ the counting algorithms listed in
 \cref{tab:count:Statistical/Limit Counter Performance on x86}.
 
 Finally, the eventually consistent statistical counter discussed in
-Section~\ref{sec:count:Eventually Consistent Implementation}
+\cref{sec:count:Eventually Consistent Implementation}
 showed how deferring activity (in that case, updating the global
 counter) can provide substantial performance and scalability benefits.
 This approach allows common case code to use much cheaper synchronization
 operations than would otherwise be possible.
-Chapter~\ref{chp:Deferred Processing} will examine a number of additional
+\Cref{chp:Deferred Processing} will examine a number of additional
 ways that deferral can improve performance, scalability, and even
 real-time response.
 
@@ -3160,11 +3160,11 @@ Summarizing the summary:
 \item	Partial partitioning, that is, partitioning applied only to
 	common code paths, works almost as well.
 \item	Partial partitioning can be applied to code (as in
-	Section~\ref{sec:count:Statistical Counters}'s statistical
+	\cref{sec:count:Statistical Counters}'s statistical
 	counters' partitioned updates and non-partitioned reads), but also
 	across time (as in
-	Section~\ref{sec:count:Approximate Limit Counters}'s and
-	Section~\ref{sec:count:Exact Limit Counters}'s
+	\cref{sec:count:Approximate Limit Counters}'s and
+	\cref{sec:count:Exact Limit Counters}'s
 	limit counters running fast when far from
 	the limit, but slowly when close to the limit).
 \item	Partitioning across time often batches updates locally
@@ -3179,7 +3179,7 @@ Summarizing the summary:
 	and scalability, as seen in the \path{count_end.c} row of
 	\cref{tab:count:Statistical/Limit Counter Performance on x86}.
 \item	Judicious use of delay promotes performance and scalability, as
-	seen in Section~\ref{sec:count:Eventually Consistent Implementation}.
+	seen in \cref{sec:count:Eventually Consistent Implementation}.
 \item	Parallel performance and scalability is usually a balancing act:
 	Beyond a certain point, optimizing some code paths will degrade
 	others.
@@ -3210,15 +3210,15 @@ synchronization operations, and
 (3)~\emph{weakening} synchronization operations where feasible.
 As a rough rule of thumb, you should apply these methods in this order,
 as was noted earlier in the discussion of
-Figure~\ref{fig:intro:Ordering of Parallel-Programming Tasks}
+\cref{fig:intro:Ordering of Parallel-Programming Tasks}
 on
-page~\pageref{fig:intro:Ordering of Parallel-Programming Tasks}.
+\cpageref{fig:intro:Ordering of Parallel-Programming Tasks}.
 The partitioning optimization applies to the
 ``Resource Partitioning and Replication'' bubble,
 the batching optimization to the ``Work Partitioning'' bubble,
 and the weakening optimization to the ``Parallel Access Control'' bubble,
 as shown in
-Figure~\ref{fig:count:Optimization and the Four Parallel-Programming Tasks}.
+\cref{fig:count:Optimization and the Four Parallel-Programming Tasks}.
 Of course, if you are using special-purpose hardware such as
 digital signal processors (DSPs), field-programmable gate arrays (FPGAs),
 or general-purpose graphical processing units (GPGPUs), you may need
-- 
2.17.1