Re: [PATCH] Fix cThread related race conditions

Patrick Lerda <patrick9876@xxxxxxx> · Sun, 19 Feb 2023 18:17:11 +0100

On 18/02/2023 11:10, Marko Mäkelä wrote:
Wed, Feb 15, 2023 at 05:17:55PM +0100, Klaus Schmidinger wrote:
On 02.02.23 21:56, Patrick Lerda wrote:
...

diff --git a/thread.c b/thread.c
index 93eb8c0..21be7a4 100644
--- a/thread.c
+++ b/thread.c
@@ -312,13 +312,16 @@ bool cThread::Start(void)
               cCondWait::SleepMs(THREAD_STOP_SLEEP);
         }
      if (!active) {
-        active = running = true;
-        if (pthread_create(&childTid, NULL, (void *(*) (void 
*))&StartThread, (void *)this) == 0) {
-           pthread_detach(childTid); // auto-reap
+        pthread_t localTid;
+        running = true;
+        if (pthread_create(&localTid, NULL, (void *(*) (void 
*))&StartThread, (void *)this) == 0) {
+           pthread_detach(localTid); // auto-reap
+           childTid = localTid;
+           active = true;
[snip]
+     if ((err = pthread_kill(localTid, 0)) != 0) {
         if (err != ESRCH)
            LOG_ERROR;
-        childTid = 0;
-        active = running = false;
+	if (active && childTid == localTid)

localTid was initialized to childTid 4 lines earlier, so what's with 
the "childTid == localTid" check here? Isn't this always true?

cThread::childTid may be modified by another thread that is executing
cThread::Action() inside cThread::StartThread().

Thinking aloud: Do we need "bool active", or could "childTid!=0" take 
its role?

Using childTid=0 is unreliable, assuming that childTid is valid when 
active=true is better.


I see this happens with "address sanitizer". Is there an actual, 
reproducible, real world problem that this patch fixes?

AddressSanitizer only changes some timing characteristics. It should
not have any direct impact on the possibility of race conditions.

I can agree with your questioning of ThreadSanitizer findings, but I
think that AddressSanitizer needs to be taken seriously.

For a while, in a multi-threaded piece of software that I maintain,
AddressSanitizer seemed to issue bogus errors. The reason was a race
condition around some additional instrumentation to declare memory
inside a custom allocator as "not allocated", by
ASAN_POISON_MEMORY_REGION() and ASAN_UNPOISON_MEMORY_REGION(). Ever
since the code was changed so that we will not shortly poison
everything and then unpoison the safe bits, but just poison the unsafe
bits, -fsanitizer=address has only reported real problems.

Are the "atomics" really necessary?

Before C++11, I would think that multiple threads were out of the
scope of the standard, in the "implementation defined" territory,
which is kind-of "not even wrong". Now that C++ since the 2011 version
covers multi-threaded execution, data races are unambiguously "wrong",
that is, "undefined behaviour".

The way the patch uses std::atomic may be an overkill. While
std::memory_order_seq_cst (the default) may make little difference to
ISAs that implement a strong memory model (SPARC, IA-32, AMD64), it
can be an unnecessary burden on ARM, POWER, RISC-V RVWMO.

If we are only interested in a data field itself and not other memory
that it may be "protecting" in some other cache line,
std::memory_order_relaxed should suffice.

If you are running on a single-core or single-socket IA-32 or AMD64
CPU, all this should not make much difference. There could already be
sufficient "compiler barriers" around the code.

Proper use of memory barriers or atomics might fix some rare hangs or
crashes on startup or shutdown on multi-core ARM system, such as a
Raspberry Pi.

Race conditions do not always lead to crashes; they could lead to
slowness or busy loops as well. Just an example: After I fixed a few
things in the rpihddevice plugin, I can reliably play back or
fast-forward recordings to the very end, with no premature
interruption or excessive delays.

I recommend the following to anyone who is working on multi-threaded
code, especially lock-free data structures and algorithms:

https://en.cppreference.com/w/cpp/atomic/memory_order
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

I find it easier to use a mutex or rw-lock to protect shared data
structures, and to document how the data is being protected. Atomic
memory access operations usually come into play when there are
performance bottlenecks that need to be fixed.

	Marko


I agree with you, another implementation could work as well.

Patrick.

_______________________________________________
vdr mailing list
vdr@xxxxxxxxxxx
https://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr

_______________________________________________
vdr mailing list
vdr@xxxxxxxxxxx
https://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr