On 02/24/2017 06:16 AM, Akira Yokosawa wrote: > On 2017/02/24 01:02:21 +0800, Yubin Ruan wrote: >> On 02/23/2017 10:33 PM, Akira Yokosawa wrote: >>> On 2017/02/23 22:44:41 +0900, Akira Yokosawa wrote: >>>> On 2017/02/22 21:22:48 +0800, Yubin Ruan wrote: >>>>> Hi, >>>>> I am reading chapter 5 and got confused by the answer to Quick Quiz >>>>> 5.17. So this is an Array-Based Per-thread Eventually Consistent Counter >>>>> scheme: >>>>> >>>>> 1 DEFINE_PER_THREAD(unsigned long, counter); >>>>> 2 unsigned long global_count; >>>>> 3 int stopflag; >>>>> 4 >>>>> 5 void inc_count(void) >>>>> 6 { >>>>> 7 ACCESS_ONCE(__get_thread_var(counter))++; >>>>> 8 } >>>>> 9 >>>>> 10 unsigned long read_count(void) >>>>> 11 { >>>>> 12 return ACCESS_ONCE(global_count); >>>>> 13 } >>>>> 14 >>>>> 15 void *eventual(void *arg) >>>>> 16 { >>>>> 17 int t; >>>>> 18 int sum; >>>>> 19 while (stopflag < 3) { >>>>> 20 sum = 0; >>>>> 21 for_each_thread(t) >>>>> 22 sum += ACCESS_ONCE(per_thread(counter, t)); >>>>> 23 ACCESS_ONCE(global_count) = sum; >>>>> 24 poll(NULL, 0, 1); >>>>> 25 if (stopflag) { >>>>> 26 smp_mb(); >>>>> 27 stopflag++; >>>>> 28 } >>>>> 29 } >>>>> 30 return NULL; >>>>> 31 } >>>>> 32 >>>>> 33 void count_init(void) >>>>> 34 { >>>>> 35 thread_id_t tid; >>>>> 36 if (pthread_create(&tid, NULL, eventual, NULL)) { >>>>> 37 perror("count_init:pthread_create"); >>>>> 38 exit(-1); >>>>> 39 } >>>>> 40 } >>>>> 41 >>>>> 42 void count_cleanup(void) >>>>> 43 { >>>>> 44 stopflag = 1; >>>>> 45 while (stopflag < 3) >>>>> 46 poll(NULL, 0, 1); >>>>> 47 smp_mb(); >>>>> 48 } >>>>> >>>>> >>>>> I understand the code. In Quick Quiz 5.17, the question is: >>>>> >>>>> Why _doesn't_ the `inc_count()' in the code above need to use atomic >>>>> instructions? After all, we now have multiple threads accessing the >>>>> per-thread counters! >>>>> >>>>> I think I know the answer to this question: now that you use per-thread >>>>> variable, you don't need atomic instructions. The scenarios where you >>>>> need atomic instructions are some places like this: >>>>> >>>>> 1 long counter = 0; >>>>> 2 void inc_count(void) >>>>> 3 { >>>>> 4 counter++; //need atomic instruction >>>>> 5 } >>>>> 6 >>>>> 7 long read_count(void) >>>>> 8 { >>>>> 9 return counter; >>>>> 10} >>>>> >>>>> >>>>> But, the answer provided in the book is: >>>>> >>>>> <------------------- Answer Begin ----------------------> >>>>> Because one of the two threads only reads, and because >>>>> the variable is aligned and machine-sized, non-atomic instructions >>>>> suffice. >>>> >>>> Well, I think this sentence explains everything necessary. >>>> Since the per-thread counters are also read-accessed from eventual() thread, >>>> the the increment of the count should look "atomic" from eventual(). >>>> If the variable is *not* machine-sized or unaligned, the increment >>>> access might need multiple read / write accesses and it would be not >>>> "atomic" from other thread's point of view. >>> >>> This was somewhat confusing. Let me try again. >>> >>> Since the per-thread counters are also read-accessed from eventual() thread, >>> the write access of the increment of the count should look "atomic" from eventual(). >>> If the variable is *not* machine-sized nor aligned, the write of the modified value >>> might need multiple read / write accesses and they would not be 'atomic" from >>> other threads' points of view. >>> >>> Incrementing a counter is a Read-Modify-Write access. In per-thread counter, >>> it is not necessary to make the whole RMW access atomic. Only the write of the >>> incremented value needs to look "atomic". >>> >>> By the way, an "unsigned long" variable is machine-sized and aligned on >>> most compiler-platform combination, but there might be exceptions I suppose. >>> >> >> Thanks for your information! >> >>>> >>>> As as side note, you can see the changes made in this Quick Quiz and Answer >>>> in the git history. Actually, the original version (dated 2011-01-01, >>>> git tag:v2011.01.02a) of the Quiz said: >>>> >>>> Why does inc_count() in >>>> Figure 5.? (whatever figure number at that time) >>>> need to use atomic instructions? >>>> >>>> This was the opposite of the current version. >>>> >>>>> That said, the ACCESS_ONCE() macro is used to prevent >>>>> compiler optimizations that might otherwise prevent the counter >>>>> updates from becoming visible to eventual() [Cor12]. >>>>> >>>>> An older version of this algorithm did in fact use atomic >>>>> instructions, kudos to Ersoy Bayramoglu for pointing >>>>> out that they are in fact unnecessary. That said, atomic >>>>> ~~~~~~~~~~~~~~~~~ >>>>> instructions would be needed in cases where the per-thread >>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>> counter variables were smaller than the global global_count. >>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>> >>>> I suppose Paul added this sentence to explain the case where >>>> per-thread counter is shorter than machine-size. >> >> So, you mean, what he intended to say is something like: >> >> would be needed in cases where the per-thread counter >> variables where smaller than machine size >> ? > > Let's see what Paul want to say. (By the way, this message seems private and > I kept this reply private.) sorry I just forgot to `Cc'. > >> But, after all, there is only one thread doing the writing, which mean >> that we don't need any atomic instruction even the variable `counter' is >> not properly aligned. right? > > The point is that "counter" is per-thread and is in an array. > "counter" for another thread is placed next to it in an unaligned way. > > Suppose an architecture which can only do machine-sized, aligned memory accesses. > A write to an unaligned variable will need the following 4 memory accesses: > > 1. Read memory including lower part of the variable > 2. Read memory including higher part of the variable > (Replace the updated part on registers without affecting outside of > the variable) > 3. Write memory including lower part of the variable > 4. Write memory including higher part of the variable > > Can you imagine what happens if another thread also does a write to a neighboring > counter in a racy timing? > > Thanks, Akira > well I didn't expect that. thank you. regards, Yubin Ruan >> >> regards, >> Yubin Ruan >> >>>> >>>> I'm sure Paul will give some comment once he recalls the circumstances ;-) >>> >>> Now added explicit CC: Paul. >>> >>>> >>>> Thanks, Akira >>>> >>>>> However, note that on a 32-bit system, the per-thread counter >>>>> variables might need to be limited to 32 bits in order to sum >>>>> them accurately, but with a 64-bit global_count variable to avoid >>>>> overflow. In this case, it is necessary to zero the per-thread >>>>> counter variables periodically in order to avoid overflow. It is >>>>> extremely important to note that this zeroing cannot be delayed too long >>>>> or overflow of the smaller per-thread variables will result. This >>>>> approach therefore imposes real-time requirements on the underlying >>>>> system, and in turn must be used with extreme care. >>>>> >>>>> In contrast, if all variables are the same size, overflow >>>>> of any variable is harmless because the eventual sum will >>>>> be modulo the word size. >>>>> <------------------------ End --------------------------> >>>>> >>>>> Although more complicated than I think, I totally fine with this answer, >>>>> except the sentence with ~~~ under it. Why is that? Why do we need >>>>> atomic instructions when counter variables were smaller than the global >>>>> `global_count' ? Also, the second sentence of the question seems to hint >>>>> about cache coherency, but I cannot see the point :( >>>>> >>>>> regards, >>>>> Yubin Ruan >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe perfbook" in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >> >> . >> -- To unsubscribe from this list: send the line "unsubscribe perfbook" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html