[Question] Quick Quiz 5.17 and cache coherency

Yubin Ruan <ablacktshirt@xxxxxxxxx> · Wed, 22 Feb 2017 21:22:48 +0800

Hi,
I am reading chapter 5 and got confused by the answer to Quick Quiz
5.17. So this is an Array-Based Per-thread Eventually Consistent Counter
scheme:

1  DEFINE_PER_THREAD(unsigned long, counter);
2  unsigned long global_count;
3  int stopflag;
4
5  void inc_count(void)
6  {
7      ACCESS_ONCE(__get_thread_var(counter))++;
8  }
9
10 unsigned long read_count(void)
11 {
12     return ACCESS_ONCE(global_count);
13 }
14
15 void *eventual(void *arg)
16 {
17     int t;
18     int sum;
19     while (stopflag < 3) {
20         sum = 0;
21         for_each_thread(t)
22             sum += ACCESS_ONCE(per_thread(counter, t));
23         ACCESS_ONCE(global_count) = sum;
24         poll(NULL, 0, 1);
25         if (stopflag) {
26             smp_mb();
27             stopflag++;
28         }
29     }
30     return NULL;
31 }
32
33 void count_init(void)
34 {
35     thread_id_t tid;
36     if (pthread_create(&tid, NULL, eventual, NULL)) {
37         perror("count_init:pthread_create");
38         exit(-1);
39     }
40 }
41
42 void count_cleanup(void)
43 {
44     stopflag = 1;
45     while (stopflag < 3)
46         poll(NULL, 0, 1);
47     smp_mb();
48 }

I understand the code. In Quick Quiz 5.17, the question is:

    Why _doesn't_ the `inc_count()' in the code above need to use atomic
    instructions? After all, we now have multiple threads accessing the
    per-thread counters!

I think I know the answer to this question: now that you use per-thread
variable, you don't need atomic instructions. The scenarios where you
need atomic instructions are some places like this:

1 long counter = 0;
2 void inc_count(void)
3 {
4    counter++;  //need atomic instruction
5 }
6
7 long read_count(void)
8 {
9    return counter;
10}

But, the answer provided in the book is:

<------------------- Answer Begin ---------------------->
    Because one of the two threads only reads, and because
the variable is aligned and machine-sized, non-atomic instructions
suffice. That said, the ACCESS_ONCE() macro is used to prevent
compiler optimizations that might otherwise prevent the counter
updates from becoming visible to eventual() [Cor12].

    An older version of this algorithm did in fact use atomic
instructions, kudos to Ersoy Bayramoglu for pointing
out that they are in fact unnecessary. That said, atomic
                                      ~~~~~~~~~~~~~~~~~
instructions would be needed in cases where the per-thread
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
counter variables were smaller than the global global_count.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
However, note that on a 32-bit system, the per-thread counter
variables might need to be limited to 32 bits in order to sum
them accurately, but with a 64-bit global_count variable to avoid
overflow. In this case, it is necessary to zero the per-thread
counter variables periodically in order to avoid overflow. It is
extremely important to note that this zeroing cannot be delayed too long
or overflow of the smaller per-thread variables will result. This
approach therefore imposes real-time requirements on the underlying
system, and in turn must be used with extreme care.

    In contrast, if all variables are the same size, overflow
of any variable is harmless because the eventual sum will
be modulo the word size.
<------------------------ End -------------------------->

Although more complicated than I think, I totally fine with this answer,
except the sentence with ~~~ under it. Why is that? Why do we need
atomic instructions when counter variables were smaller than the global
`global_count' ? Also, the second sentence of the question seems to hint
about cache coherency, but I cannot see the point :(

regards,
Yubin Ruan
--
To unsubscribe from this list: send the line "unsubscribe perfbook" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html