On Mon, Oct 28, 2019 at 12:24:36AM +0900, Akira Yokosawa wrote: > >From 0e0d8d79f0416bcbe284d8d191bd76a74f352f14 Mon Sep 17 00:00:00 2001 > From: Akira Yokosawa <akiyks@xxxxxxxxx> > Date: Sun, 27 Oct 2019 22:07:59 +0900 > Subject: [RFC PATCH] count: Merge tables of statistical and limited counter performance > > To help comparison of performance of various counter algorithms, > merge the tables and align data of both statistical and limited > counters. > Also update the wording of references to the merged table. > > Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx> > --- > Hi Paul, > > > Would it make sense to rotate the "Exact?" heading 90 degrees and > > to put the "(ns)" underneath the "Updates"? Those two changes > > might make Figure 5.2 fit horizontally. > > Another idea is to merge those 2 tables and align performance results > vertically, which this RFC patch attempts to do. > > As the "Exact?" column is irrelevant to statistical counters, I made > those cells look blank. > > How does this look to you? Good improvement, queued and pushed, thank you! I couldn't resist trying rotating the "Exact?" heading, though there has to be a better way to do this. Thoughts? (Feel free to provide an alternative, and I will be happy to replace my attempt with it.) Another thing would be to make the "Algorithms" heading be something like "Algorithms: \co{count_*.c}", then trim the leading "count_" and trailing ".c" from each row's first column. Would that make sense? Thanx, Paul > Thanks, Akira > -- > count/count.tex | 96 +++++++++++++++++++------------------------------ > 1 file changed, 37 insertions(+), 59 deletions(-) > > diff --git a/count/count.tex b/count/count.tex > index 6430abd5..cff56d72 100644 > --- a/count/count.tex > +++ b/count/count.tex > @@ -2817,33 +2817,43 @@ will expand on these lessons. > \rowcolors{4}{}{lightgray} > \renewcommand*{\arraystretch}{1.1} > \small > -\centering\OneColumnHSpace{-.25in} > -\begin{tabular}{lrS[table-format=1.1]S[table-format=3.0]S[table-format=4.0] > +\centering\OneColumnHSpace{-.35in} > +\newcommand{\NA}{\cellcolor{white}} > +\begin{tabular}{lrcS[table-format=2.1]S[table-format=3.0]S[table-format=4.0] > S[table-format=6.0]S[table-format=6.0]} > \toprule > - & & & \multicolumn{4}{c}{Reads (ns)} \\ > - \cmidrule{4-7} > - Algorithm & Section & \multicolumn{1}{r}{Updates (ns)} & > + & & & \multicolumn{1}{c}{Updates} & \multicolumn{4}{c}{Reads (ns)} \\ > + \cmidrule{5-8} > + Algorithm & Section & Exact? & \multicolumn{1}{r}{(ns)} & > \multicolumn{1}{r}{1 CPU} & > \multicolumn{1}{r}{8 CPUs} & > \multicolumn{1}{r}{64 CPUs} & > \multicolumn{1}{r}{420 CPUs} \\ > \midrule > - \path{count_stat.c} & \ref{sec:count:Array-Based Implementation} & > + \path{count_stat.c} & \ref{sec:count:Array-Based Implementation} & \NA & > 6.3 & 294 & 303 & 315 & 612 \\ > - \path{count_stat_eventual.c} & \ref{sec:count:Eventually Consistent Implementation} & > + \path{count_stat_eventual.c} & \ref{sec:count:Eventually Consistent Implementation} & \NA & > 6.4 & 1 & 1 & 1 & 1 \\ > - \path{count_end.c} & \ref{sec:count:Per-Thread-Variable-Based Implementation} & > + \path{count_end.c} & \ref{sec:count:Per-Thread-Variable-Based Implementation} & \NA & > 2.9 & 301 & 6 309 & 147 594 & 239 683 \\ > - \path{count_end_rcu.c} & \ref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters} & > + \path{count_end_rcu.c} & \ref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters} & \NA & > 2.9 & 454 & 481 & 508 & 2 317 \\ > + \midrule > + \path{count_lim.c} & \ref{sec:count:Simple Limit Counter Implementation} & > + N & 3.2 & 435 & 6 678 & 156 175 & 239 422 \\ > + \path{count_lim_app.c} & \ref{sec:count:Approximate Limit Counter Implementation} & > + N & 2.4 & 485 & 7 041 & 173 108 & 239 682 \\ > + \path{count_lim_atomic.c} & \ref{sec:count:Atomic Limit Counter Implementation} & > + Y & 19.7 & 513 & 7 085 & 199 957 & 239 450 \\ > + \path{count_lim_sig.c} & \ref{sec:count:Signal-Theft Limit Counter Implementation} & > + Y & 4.7 & 519 & 6 805 & 120 000 & 238 811 \\ > \bottomrule > \end{tabular} > -\caption{Statistical Counter Performance on x86} > -\label{tab:count:Statistical Counter Performance on x86} > +\caption{Statistical/Limit Counter Performance on x86} > +\label{tab:count:Statistical/Limit Counter Performance on x86} > \end{table*} > > -Table~\ref{tab:count:Statistical Counter Performance on x86} > +The top half of \cref{tab:count:Statistical/Limit Counter Performance on x86} > shows the performance of the four parallel statistical counting > algorithms. > All four algorithms provide near-perfect linear scalability for updates. > @@ -2856,13 +2866,13 @@ This contention can be addressed using the deferred-processing > techniques introduced in > Chapter~\ref{chp:Deferred Processing}, > as shown on the \path{count_end_rcu.c} row of > -Table~\ref{tab:count:Statistical Counter Performance on x86}. > +\cref{tab:count:Statistical/Limit Counter Performance on x86}. > Deferred processing also shines on the \path{count_stat_eventual.c} row, > courtesy of eventual consistency. > > \QuickQuiz{} > On the \path{count_stat.c} row of > - Table~\ref{tab:count:Statistical Counter Performance on x86}, > + \cref{tab:count:Statistical/Limit Counter Performance on x86}, > we see that the read-side scales linearly with the number of > threads. > How is that possible given that the more threads there are, > @@ -2878,8 +2888,8 @@ courtesy of eventual consistency. > } \QuickQuizEnd > > \QuickQuiz{} > - Even on the last row of > - Table~\ref{tab:count:Statistical Counter Performance on x86}, > + Even on the fourth row of > + \cref{tab:count:Statistical/Limit Counter Performance on x86}, > the read-side performance of these statistical counter > implementations is pretty horrible. > So why bother with them? > @@ -2890,8 +2900,8 @@ courtesy of eventual consistency. > Figure~\ref{fig:count:Atomic Increment Scalability on x86}, > single-variable atomic increment need not apply for any job > involving heavy use of parallel updates. > - In contrast, the algorithms shown in > - Table~\ref{tab:count:Statistical Counter Performance on x86} > + In contrast, the algorithms shown in the top half of > + \cref{tab:count:Statistical/Limit Counter Performance on x86} > do an excellent job of handling update-heavy situations. > Of course, if you have a read-mostly situation, you should > use something else, for example, an eventually consistent design > @@ -2901,37 +2911,7 @@ courtesy of eventual consistency. > Section~\ref{sec:count:Eventually Consistent Implementation}. > } \QuickQuizEnd > > -\begin{table*} > -\rowcolors{4}{}{lightgray} > -\renewcommand*{\arraystretch}{1.1} > -\small > -\centering\OneColumnHSpace{-.4in} > -\begin{tabular}{lrcS[table-format=2.1]S[table-format=3.0]S[table-format=4.0] > - S[table-format=6.0]S[table-format=6.0]} > - \toprule > - & & & & \multicolumn{4}{c}{Reads (ns)} \\ > - \cmidrule{5-8} > - Algorithm & Section & Exact? & \multicolumn{1}{r}{Updates (ns)} & > - \multicolumn{1}{r}{1 CPU} & > - \multicolumn{1}{r}{8 CPUs} & > - \multicolumn{1}{r}{64 CPUs} & > - \multicolumn{1}{r}{420 CPUs} \\ > - \midrule > - \path{count_lim.c} & \ref{sec:count:Simple Limit Counter Implementation} & > - N & 3.2 & 435 & 6 678 & 156 175 & 239 422 \\ > - \path{count_lim_app.c} & \ref{sec:count:Approximate Limit Counter Implementation} & > - N & 2.4 & 485 & 7 041 & 173 108 & 239 682 \\ > - \path{count_lim_atomic.c} & \ref{sec:count:Atomic Limit Counter Implementation} & > - Y & 19.7 & 513 & 7 085 & 199 957 & 239 450 \\ > - \path{count_lim_sig.c} & \ref{sec:count:Signal-Theft Limit Counter Implementation} & > - Y & 4.7 & 519 & 6 805 & 120 000 & 238 811 \\ > - \bottomrule > -\end{tabular} > -\caption{Limit Counter Performance on x86} > -\label{tab:count:Limit Counter Performance on x86} > -\end{table*} > - > -Table~\ref{tab:count:Limit Counter Performance on x86} > +The bottom half of \cref{tab:count:Statistical/Limit Counter Performance on x86} > shows the performance of the parallel limit-counting algorithms. > Exact enforcement of the limits incurs a substantial performance > penalty, although on this x86 system that penalty can be reduced > @@ -2940,8 +2920,8 @@ All of these implementations suffer from read-side lock contention > in the face of concurrent readers. > > \QuickQuiz{} > - Given the performance data shown in > - Table~\ref{tab:count:Limit Counter Performance on x86}, > + Given the performance data shown in the bottom half of > + \cref{tab:count:Statistical/Limit Counter Performance on x86}, > we should always prefer signals over atomic operations, right? > \QuickQuizAnswer{ > That depends on the workload. > @@ -2962,8 +2942,8 @@ in the face of concurrent readers. > > \QuickQuiz{} > Can advanced techniques be applied to address the lock > - contention for readers seen in > - Table~\ref{tab:count:Limit Counter Performance on x86}? > + contention for readers seen in the bottom half of > + \cref{tab:count:Statistical/Limit Counter Performance on x86}? > \QuickQuizAnswer{ > One approach is to give up some update-side performance, as is > done with scalable non-zero indicators > @@ -3126,8 +3106,7 @@ operations, so that a great many of these cheap operations are handled > by a single synchronization operation. > Batching optimizations of one sort or another are used by each of > the counting algorithms listed in > -Tables~\ref{tab:count:Statistical Counter Performance on x86} > -and~\ref{tab:count:Limit Counter Performance on x86}. > +\cref{tab:count:Statistical/Limit Counter Performance on x86}. > > Finally, the eventually consistent statistical counter discussed in > Section~\ref{sec:count:Eventually Consistent Implementation} > @@ -3158,20 +3137,19 @@ Summarizing the summary: > thereby decreasing synchronization overhead, in turn > improving performance and scalability. > All the algorithms shown in > - Tables~\ref{tab:count:Statistical Counter Performance on x86} > - and~\ref{tab:count:Limit Counter Performance on x86} > + \cref{tab:count:Statistical/Limit Counter Performance on x86} > make heavy use of batching. > \item Read-only code paths should remain read-only: Spurious > synchronization writes to shared memory kill performance > and scalability, as seen in the \path{count_end.c} row of > - Table~\ref{tab:count:Statistical Counter Performance on x86}. > + \cref{tab:count:Statistical/Limit Counter Performance on x86}. > \item Judicious use of delay promotes performance and scalability, as > seen in Section~\ref{sec:count:Eventually Consistent Implementation}. > \item Parallel performance and scalability is usually a balancing act: > Beyond a certain point, optimizing some code paths will degrade > others. > The \path{count_stat.c} and \path{count_end_rcu.c} rows of > - Table~\ref{tab:count:Statistical Counter Performance on x86} > + \cref{tab:count:Statistical/Limit Counter Performance on x86} > illustrate this point. > \item Different levels of performance and scalability will affect > algorithm and data-structure design, as do a large number of > -- > 2.17.1 > >