> Note that unlike all other threads, TUR threads are _detached_ threads. > multipathd tries to cancel them, but it has no way to verify that they > actually stopped. It may be just a normal observation that you can't > see the messages when a TUR thread terminates, in particular if the > program is exiting and might have already closed the stderr file > descriptor. > > > If you look at the crashed processes with gdb, the thread IDs should > give you some clue which stack belongs to which thread. The TUR threads > will have higher thread IDs than the others because they are started > later. > ?? [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". Core was generated by `/sbin/multipathd -d -s'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f3f669e071d in ?? () [Current thread is 1 (Thread 0x7f3f65873700 (LWP 1645593))] (gdb) i thread Id Target Id Frame * 1 Thread 0x7f3f65873700 (LWP 1645593) 0x00007f3f669e071d in ?? () 2 Thread 0x7f3f6611a000 (LWP 1645066) 0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-template.S:78 3 Thread 0x7f3f6609d700 (LWP 1645095) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 (gdb) bt #0 0x00007f3f669e071d in ?? () #1 0x0000000000000000 in ?? () (gdb) thread 2 [Switching to thread 2 (Thread 0x7f3f6611a000 (LWP 1645066))] #0 0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-template.S:78 78 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) (gdb) bt #0 0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-template.S:78 #1 0x00007f3f669fb77d in _dl_unmap_segments (l=l@entry=0x557cb432ba10) at ./dl-unmap-segments.h:32 ... #10 0x00007f3f669b44ed in cleanup_prio () at prio.c:66 //cleanup_checkers() is finished. #11 0x0000557cb26db794 in child (param=<optimized out>) at main.c:2932 #12 0x0000557cb26d44d3 in main (argc=<optimized out>, argv=0x7ffc98d47948) at main.c:3150 UNWIND [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". Core was generated by `/sbin/multipathd -d -s'. Program terminated with signal SIGSEGV, Segmentation fault. #0 x86_64_fallback_frame_state (fs=0x7fa9b2f576d0, context=0x7fa9b2f57980) at ./md-unwind-support.h:58 58 if (*(unsigned char *)(pc+0) == 0x48 [Current thread is 1 (Thread 0x7fa9b2f58700 (LWP 1285074))] (gdb) i thread Id Target Id Frame * 1 Thread 0x7fa9b2f58700 (LWP 1285074) (Exiting) x86_64_fallback_frame_state (fs=0x7fa9b2f576d0, context=0x7fa9b2f57980) at ./md-unwind-support.h:58 2 Thread 0x7fa9b383e000 (LWP 1284366) 0x00007fa9b403e127 in __close (fd=5) at ../sysdeps/unix/sysv/linux/close.c:27 3 Thread 0x7fa9b37c1700 (LWP 1284374) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 4 Thread 0x7fa9b2f73700 (LWP 1285077) 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78 5 Thread 0x7fa9b2f61700 (LWP 1285076) 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78 6 Thread 0x7fa9b2f4f700 (LWP 1285079) 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78 7 Thread 0x7fa9b2fa9700 (LWP 1285080) 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-template.S:78 (gdb) thread 2 [Switching to thread 2 (Thread 0x7fa9b383e000 (LWP 1284366))] #0 0x00007fa9b403e127 in __close (fd=5) at ../sysdeps/unix/sysv/linux/close.c:27 27 return SYSCALL_CANCEL (close, fd); (gdb) bt #0 0x00007fa9b403e127 in __close (fd=5) at ../sysdeps/unix/sysv/linux/close.c:27 #1 0x00005606f030f95b in cleanup_dmevent_waiter () at dmevents.c:111 #2 0x00005606f03087a2 in child (param=<optimized out>) at main.c:2934 #3 0x00005606f03014d3 in main (argc=<optimized out>, argv=0x7ffdb782ab38) at main.c:3150 The LWP of ?? and UNWIND is much larger than thread 2(main). I add print_func like: @@ -228,6 +228,10 @@ static void copy_msg_to_tcc(void *ct_p, const char *msg) pthread_mutex_unlock(&ct->lock); } +static void lxk10 (void) +{ + condlog(2, "lxk exit tur_thread"); +} static void *tur_thread(void *ctx) { struct tur_checker_context *ct = ctx; @@ -235,6 +239,8 @@ static void *tur_thread(void *ctx) char devt[32]; /* This thread can be canceled, so setup clean up */ + condlog(2, "lxk start tur_thread"); + pthread_cleanup_push(lxk10, NULL); tur_thread_cleanup_push(ct); When there are four devices, core log: Mar 02 11:40:35 localhost.localdomain multipathd[85474]: exit (signal) Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sda: unusable path Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdf: unusable path Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sde: unusable path Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdd: unusable path Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdc: unusable path Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdb: unusable path Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit tur_thread Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit tur_thread Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start tur_thread Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit tur_thread Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 360014057a1353ec1bdd4dfcad19db6db: remove multipath map Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdg: orphan path, map flushed Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdg that holds hwe of 360014057a1353ec1bdd4dfcad19db6db Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 4 Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 36001405faf8a6c2920840ed8ba73b9ee: remove multipath map Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdj: orphan path, map flushed Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdj that holds hwe of 36001405faf8a6c2920840ed8ba73b9ee Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 3 Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 36001405044c0f50ba3c4e5b9b57e4de4: remove multipath map Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdi: orphan path, map flushed Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdi that holds hwe of 36001405044c0f50ba3c4e5b9b57e4de4 Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 2 Mar 02 11:40:35 localhost.localdomain multipathd[85474]: 36001405e0cbb950907b4a51af1a002ed: remove multipath map Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdh: orphan path, map flushed Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG: orphaning path sdh that holds hwe of 36001405e0cbb950907b4a51af1a002ed Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker refcount 1 Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit ueventloop Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit uxlsnrloop Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit uevqloop Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit wait_dmevents Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit checkerloop Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio checker refcount 6 Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur checker //checker_put Mar 02 11:40:36 localhost.localdomain systemd-coredump[85547]: Process 85474 (multipathd) of user 0 dumped core There are four "lxk start tur_thread" but three "lxk exit tur_thread". >> I will use >> int oldstate; >> pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate); >> ... >> pthread_setcancelstate(oldstate, NULL); >> pthread_testcancel(); >> to test it. > > Where exactly do you want to put that code? > I add this in BEGAIN and END of tur_thread. But it is not helpful. > IIUC you don't compile multipathd with -fexceptions, do you? You > haven't answered my previous question why you do that for systemd. I don't know why use -fexceptions before, but we have removed it and there is no udev_monitor_receive_device core. Regards, Lixiaokeng -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel