Hi Michał , > On Jan 17, 2020, at 5:34 AM, Michał Lowas-Rzechonek <michal.lowas-rzechonek@xxxxxxxxxxx> wrote: > > Hi Brian, > >> On 01/16, Brian Gix wrote: >> Any packet that may be handled internally by the daemon must be sent in >> it's own idle_oneshot context, to prevent multiple nodes from handling >> and responding in the same context, eventually corrupting memory. >> >> This addresses the following crash: >> Program terminated with signal SIGSEGV, Segmentation fault. >> 0 tcache_get (tc_idx=0) at malloc.c:2951 >> 2951 tcache->entries[tc_idx] = e->next; >> (gdb) bt >> 0 tcache_get (tc_idx=0) at malloc.c:2951 >> 1 __GI___libc_malloc (bytes=bytes@entry=16) at malloc.c:3058 >> 2 0x0000564cff9bc1de in l_malloc (size=size@entry=16) at ell/util.c:62 >> 3 0x0000564cff9bd46b in l_queue_push_tail (queue=0x564d000c9710, data=data@entry=0x564d000d0d60) at ell/queue.c:136 >> 4 0x0000564cff9beabd in idle_add (callback=callback@entry=0x564cff9be4e0 <oneshot_callback>, user_data=user_data@entry=0x564d000d4700, >> flags=flags@entry=268435456, destroy=destroy@entry=0x564cff9be4c0 <idle_destroy>) at ell/main.c:292 >> 5 0x0000564cff9be5f7 in l_idle_oneshot (callback=callback@entry=0x564cff998bc0 <tx_worker>, user_data=user_data@entry=0x564d000d83f0, >> destroy=destroy@entry=0x0) at ell/idle.c:144 >> 6 0x0000564cff998326 in send_tx (io=<optimized out>, info=0x7ffd035503f4, data=<optimized out>, len=<optimized out>) >> at mesh/mesh-io-generic.c:637 >> 7 0x0000564cff99675a in send_network_beacon (key=0x564d000cfee0) at mesh/net-keys.c:355 >> 8 snb_timeout (timeout=0x564d000dd730, user_data=0x564d000cfee0) at mesh/net-keys.c:364 >> 9 0x0000564cff9bdca2 in timeout_callback (fd=<optimized out>, events=<optimized out>, user_data=0x564d000dd730) at ell/timeout.c:81 >> 10 timeout_callback (fd=<optimized out>, events=<optimized out>, user_data=0x564d000dd730) at ell/timeout.c:70 >> 11 0x0000564cff9bedcd in l_main_iterate (timeout=<optimized out>) at ell/main.c:473 >> 12 0x0000564cff9bee7c in l_main_run () at ell/main.c:520 >> 13 l_main_run () at ell/main.c:502 >> 14 0x0000564cff9bf08c in l_main_run_with_signal (callback=<optimized out>, user_data=0x0) at ell/main.c:642 >> 15 0x0000564cff994b64 in main (argc=<optimized out>, argv=0x7ffd03550668) at mesh/main.c:268 > > Hm. I can't seem to wrap my head around this backtrace. Do you maybe > have a reproduction path? The backtrace doesn’t really show what has gone wrong very well, because what has happened is a heap corruption. The seg fault occurs during a memory alloc sometime later. The physics of the problem, is best shown by local config client requesting segmented composition data from a local config server. The one request, all response segments, the return seg ACKs all happen on the same C calling stack which gets *very* deep, and steps off the end, since nothing goes OTA. It does *not* happen during OTA operations because each discrete packet starts from a fresh C calling stack from main(). Offloading the Send Packet Requests to l_idle_oneshot ensures that each discrete loopbacked packet also starts from a known low point on the C calling stack. Does that make sense?