On Tue, 2021-03-02 at 20:44 +0800, lixiaokeng wrote: > > > Note that unlike all other threads, TUR threads are _detached_ > > threads. > > multipathd tries to cancel them, but it has no way to verify that > > they > > actually stopped. It may be just a normal observation that you > > can't > > see the messages when a TUR thread terminates, in particular if the > > program is exiting and might have already closed the stderr file > > descriptor. > > > > > > If you look at the crashed processes with gdb, the thread IDs > > should > > give you some clue which stack belongs to which thread. The TUR > > threads > > will have higher thread IDs than the others because they are > > started > > later. > > > > > ?? > > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/usr/lib64/libthread_db.so.1". > Core was generated by `/sbin/multipathd -d -s'. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x00007f3f669e071d in ?? () > [Current thread is 1 (Thread 0x7f3f65873700 (LWP 1645593))] > (gdb) i thread > Id Target Id Frame > * 1 Thread 0x7f3f65873700 (LWP 1645593) 0x00007f3f669e071d in ?? > () > 2 Thread 0x7f3f6611a000 (LWP 1645066) 0x00007f3f669fede7 in > munmap () at ../sysdeps/unix/syscall-template.S:78 > 3 Thread 0x7f3f6609d700 (LWP 1645095) syscall () at > ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 > (gdb) bt > #0 0x00007f3f669e071d in ?? () > #1 0x0000000000000000 in ?? () > (gdb) thread 2 > [Switching to thread 2 (Thread 0x7f3f6611a000 (LWP 1645066))] > #0 0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall- > template.S:78 > 78 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) > (gdb) bt > #0 0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall- > template.S:78 > #1 0x00007f3f669fb77d in _dl_unmap_segments > (l=l@entry=0x557cb432ba10) at ./dl-unmap-segments.h:32 > ... > #10 0x00007f3f669b44ed in cleanup_prio () at prio.c:66 > //cleanup_checkers() is finished. > #11 0x0000557cb26db794 in child (param=<optimized out>) at > main.c:2932 > #12 0x0000557cb26d44d3 in main (argc=<optimized out>, > argv=0x7ffc98d47948) at main.c:3150 > > > UNWIND > > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/usr/lib64/libthread_db.so.1". > Core was generated by `/sbin/multipathd -d -s'. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 x86_64_fallback_frame_state (fs=0x7fa9b2f576d0, > context=0x7fa9b2f57980) at ./md-unwind-support.h:58 > 58 if (*(unsigned char *)(pc+0) == 0x48 > [Current thread is 1 (Thread 0x7fa9b2f58700 (LWP 1285074))] > (gdb) i thread > Id Target Id Frame > * 1 Thread 0x7fa9b2f58700 (LWP 1285074) (Exiting) > x86_64_fallback_frame_state (fs=0x7fa9b2f576d0, > context=0x7fa9b2f57980) at ./md-unwind-support.h:58 > 2 Thread 0x7fa9b383e000 (LWP 1284366) > 0x00007fa9b403e127 in __close (fd=5) at > ../sysdeps/unix/sysv/linux/close.c:27 > 3 Thread 0x7fa9b37c1700 (LWP 1284374) syscall () at > ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 > 4 Thread 0x7fa9b2f73700 (LWP 1285077) > 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall- > template.S:78 > 5 Thread 0x7fa9b2f61700 (LWP 1285076) > 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall- > template.S:78 > 6 Thread 0x7fa9b2f4f700 (LWP 1285079) > 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall- > template.S:78 > 7 Thread 0x7fa9b2fa9700 (LWP 1285080) > 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall- > template.S:78 > (gdb) thread 2 > [Switching to thread 2 (Thread 0x7fa9b383e000 (LWP 1284366))] > #0 0x00007fa9b403e127 in __close (fd=5) at > ../sysdeps/unix/sysv/linux/close.c:27 > 27 return SYSCALL_CANCEL (close, fd); > (gdb) bt > #0 0x00007fa9b403e127 in __close (fd=5) at > ../sysdeps/unix/sysv/linux/close.c:27 > #1 0x00005606f030f95b in cleanup_dmevent_waiter () at dmevents.c:111 > #2 0x00005606f03087a2 in child (param=<optimized out>) at > main.c:2934 > #3 0x00005606f03014d3 in main (argc=<optimized out>, > argv=0x7ffdb782ab38) at main.c:3150 > > > The LWP of ?? and UNWIND is much larger than thread 2(main). > > I add print_func like: > > @@ -228,6 +228,10 @@ static void copy_msg_to_tcc(void *ct_p, const > char *msg) > pthread_mutex_unlock(&ct->lock); > } > > +static void lxk10 (void) > +{ > + condlog(2, "lxk exit tur_thread"); > +} > static void *tur_thread(void *ctx) > { > struct tur_checker_context *ct = ctx; > @@ -235,6 +239,8 @@ static void *tur_thread(void *ctx) > char devt[32]; > > /* This thread can be canceled, so setup clean up */ > + condlog(2, "lxk start tur_thread"); > + pthread_cleanup_push(lxk10, NULL); > tur_thread_cleanup_push(ct); > > When there are four devices, core log: > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: exit > (signal) > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sda: > unusable path > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdf: > unusable path > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sde: > unusable path > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdd: > unusable path > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdc: > unusable path > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdb: > unusable path > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start > tur_thread > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit > tur_thread > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start > tur_thread > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start > tur_thread > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit > tur_thread > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start > tur_thread > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit > tur_thread > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: > 360014057a1353ec1bdd4dfcad19db6db: remove multipath map > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdg: orphan > path, map flushed > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: > orphaning path sdg that holds hwe of > 360014057a1353ec1bdd4dfcad19db6db > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker > refcount 4 > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: > 36001405faf8a6c2920840ed8ba73b9ee: remove multipath map > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdj: orphan > path, map flushed > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: > orphaning path sdj that holds hwe of > 36001405faf8a6c2920840ed8ba73b9ee > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker > refcount 3 > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: > 36001405044c0f50ba3c4e5b9b57e4de4: remove multipath map > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdi: orphan > path, map flushed > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: > orphaning path sdi that holds hwe of > 36001405044c0f50ba3c4e5b9b57e4de4 > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker > refcount 2 > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: > 36001405e0cbb950907b4a51af1a002ed: remove multipath map > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdh: orphan > path, map flushed > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: > orphaning path sdh that holds hwe of > 36001405e0cbb950907b4a51af1a002ed > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker > refcount 1 > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit > ueventloop > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit > uxlsnrloop > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit > uevqloop > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit > wait_dmevents > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit > checkerloop > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio > checker refcount 6 > Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur > checker //checker_put So we do not see "unloading tur checker". Like you said, that suggests that the crash occurs between libcheck_free() and the thread exiting. I suggest you put a message in tur.c:libcheck_free (), AFTER the call to cleanup_context(), printing the values of "running" and "holders". Anyway: holders = uatomic_sub_return(&ct->holders, 1); if (!holders) cleanup_context(ct); Whatever mistakes we have made, only one actor can have seen holders == 0, and have called cleanup_context(). The stacks you have shown indicate that the instruction pointers were broken. That would suggest something similar as dicussed in the ML thread leading to 38ffd89 ("libmultipath: prevent DSO unloading with astray checker threads"). Your logs show "tur checker refcount 1", so the next call to checker_put would have unloaded the DSO. Please try commenting out the dlclose() call in free_checker_class(), and see if it makes a difference. Martin -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel