On Fri, 3 Feb 2023 at 11:17, Chris Chilvers <chilversc@xxxxxxxxx> wrote: > > I have been having an issue where cachefilesd will randomly crash causing the > cache to be withdrawn. The crash is intermittent and can sometimes happen > within minutes, other times it can take hours, or never. > > Fortunately it has produced a crash dump so I've been able to analyse what > happened. > > From the stack trace (and debug logging) the last operation it was running is > the decant_cull_table. The code fails in the check block at the end of the > function when it calls abort(). > > (gdb) bt > #0 __pthread_kill_implementation (no_tid=0, signo=6, > threadid=140614334650176) at ./nptl/pthread_kill.c:44 > #1 __pthread_kill_internal (signo=6, threadid=140614334650176) at > ./nptl/pthread_kill.c:78 > #2 __GI___pthread_kill (threadid=140614334650176, > signo=signo@entry=6) at ./nptl/pthread_kill.c:89 > #3 0x00007fe353442476 in __GI_raise (sig=sig@entry=6) at > ../sysdeps/posix/raise.c:26 > #4 0x00007fe3534287f3 in __GI_abort () at ./stdlib/abort.c:79 > #5 0x0000556d6c9f0965 in decant_cull_table () at cachefilesd.c:1571 > #6 cachefilesd () at cachefilesd.c:780 > #7 0x0000556d6c9f140b in main (argc=<optimized out>, > argv=<optimized out>) at cachefilesd.c:581 > > For reference the code at frame 5 from the decant_cull_table function is: > > check: > for (loop = 0; loop < nr_in_ready_table; loop++) > if (((long)cullready[loop] & 0xf0000000) == 0x60000000) > abort(); > > Checking the cull table, the first object in the cull table appears to be > valid. > > (gdb) p nr_in_ready_table > $1 = 242 > > (gdb) p cullready[0] > $2 = (struct object *) 0x556d6d7382a0 > > (gdb) p -pretty -- *cullready[0] > $3 = { > parent = 0x556d6d7352b0, > children = 0x0, > next = 0x0, > prev = 0x0, > dir = 0x0, > ino = 13631753, > usage = 1, > empty = false, > new = false, > cullable = true, > type = OBJTYPE_DATA, > atime = 1675349423, > name = "E" > } > > The inode number from the struct matches a file in the fscache. > > $ sudo find /var/cache/fscache -inum 13631753 > /var/cache/fscache/cache/Infs,3.0,2,,300000a,e5e9b1269df2b0d,,,d0,100000,100000,249f0,249f0,249f0,249f0,1/@00/E210w114Hg92Az0HAMYCClFMVmkMY050002w1qO200 > > However, the memory address of the struct matches (fails) the check. > > (gdb) p (((long)cullready[0] & 0xf0000000) == 0x60000000) > $4 = 1 > > 0000 556d 6d73 82a0 > & 0000 0000 f000 0000 > = 0000 0000 6000 0000 > > $ file /sbin/cachefilesd > /sbin/cachefilesd: ELF 64-bit LSB pie executable, x86-64 > > Looking at the code, I suspect that this magic 0x60000000 number is supposed > to be some kind of sentinel value that's used as a bug check for errors such > as use after free? This would make sense when the application was 32 bit, as > address pattern 0110 in the highest nibble either cannot occur, or lies within > the kernel address space. However, when compiled as 64 bit this assumption is > no longer true and the bit pattern can appear in perfectly valid addresses. > > This would also explain the random nature of the crashes, as the cachefilesd > is at the whims of the OS and calloc function. > > -- > Chris Any thoughts on this issue? I think the main question to be answered is if the debug checks such as "(0x6b000000 | __LINE__)" still have any value. If not this can be simplified by simply setting the pointer to null, and updating the check to look for nulls. If __LINE__ still has value then there are two questions to answer: 1. How to make this safe for 64 bit architectures? 2. Should __LINE__ only be included in debug builds, and null used normally? -- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/linux-cachefs