I have been having an issue where cachefilesd will randomly crash causing the cache to be withdrawn. The crash is intermittent and can sometimes happen within minutes, other times it can take hours, or never. Fortunately it has produced a crash dump so I've been able to analyse what happened. >From the stack trace (and debug logging) the last operation it was running is the decant_cull_table. The code fails in the check block at the end of the function when it calls abort(). (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140614334650176) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140614334650176) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140614334650176, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007fe353442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007fe3534287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x0000556d6c9f0965 in decant_cull_table () at cachefilesd.c:1571 #6 cachefilesd () at cachefilesd.c:780 #7 0x0000556d6c9f140b in main (argc=<optimized out>, argv=<optimized out>) at cachefilesd.c:581 For reference the code at frame 5 from the decant_cull_table function is: check: for (loop = 0; loop < nr_in_ready_table; loop++) if (((long)cullready[loop] & 0xf0000000) == 0x60000000) abort(); Checking the cull table, the first object in the cull table appears to be valid. (gdb) p nr_in_ready_table $1 = 242 (gdb) p cullready[0] $2 = (struct object *) 0x556d6d7382a0 (gdb) p -pretty -- *cullready[0] $3 = { parent = 0x556d6d7352b0, children = 0x0, next = 0x0, prev = 0x0, dir = 0x0, ino = 13631753, usage = 1, empty = false, new = false, cullable = true, type = OBJTYPE_DATA, atime = 1675349423, name = "E" } The inode number from the struct matches a file in the fscache. $ sudo find /var/cache/fscache -inum 13631753 /var/cache/fscache/cache/Infs,3.0,2,,300000a,e5e9b1269df2b0d,,,d0,100000,100000,249f0,249f0,249f0,249f0,1/@00/E210w114Hg92Az0HAMYCClFMVmkMY050002w1qO200 However, the memory address of the struct matches (fails) the check. (gdb) p (((long)cullready[0] & 0xf0000000) == 0x60000000) $4 = 1 0000 556d 6d73 82a0 & 0000 0000 f000 0000 = 0000 0000 6000 0000 $ file /sbin/cachefilesd /sbin/cachefilesd: ELF 64-bit LSB pie executable, x86-64 Looking at the code, I suspect that this magic 0x60000000 number is supposed to be some kind of sentinel value that's used as a bug check for errors such as use after free? This would make sense when the application was 32 bit, as address pattern 0110 in the highest nibble either cannot occur, or lies within the kernel address space. However, when compiled as 64 bit this assumption is no longer true and the bit pattern can appear in perfectly valid addresses. This would also explain the random nature of the crashes, as the cachefilesd is at the whims of the OS and calloc function. -- Chris -- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/linux-cachefs