Re: decant_cull_table intermittently aborting cachefilesd

Chris Chilvers <chilversc@xxxxxxxxx> · Tue, 28 Mar 2023 18:22:49 +0100

On Fri, 3 Feb 2023 at 11:17, Chris Chilvers <chilversc@xxxxxxxxx> wrote:
>
> I have been having an issue where cachefilesd will randomly crash causing the
> cache to be withdrawn. The crash is intermittent and can sometimes happen
> within minutes, other times it can take hours, or never.
>
> Fortunately it has produced a crash dump so I've been able to analyse what
> happened.
>
> From the stack trace (and debug logging) the last operation it was running is
> the decant_cull_table. The code fails in the check block at the end of the
> function when it calls abort().
>
>     (gdb) bt
>     #0  __pthread_kill_implementation (no_tid=0, signo=6,
> threadid=140614334650176) at ./nptl/pthread_kill.c:44
>     #1  __pthread_kill_internal (signo=6, threadid=140614334650176) at
> ./nptl/pthread_kill.c:78
>     #2  __GI___pthread_kill (threadid=140614334650176,
> signo=signo@entry=6) at ./nptl/pthread_kill.c:89
>     #3  0x00007fe353442476 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/posix/raise.c:26
>     #4  0x00007fe3534287f3 in __GI_abort () at ./stdlib/abort.c:79
>     #5  0x0000556d6c9f0965 in decant_cull_table () at cachefilesd.c:1571
>     #6  cachefilesd () at cachefilesd.c:780
>     #7  0x0000556d6c9f140b in main (argc=<optimized out>,
> argv=<optimized out>) at cachefilesd.c:581
>
> For reference the code at frame 5 from the decant_cull_table function is:
>
>     check:
>         for (loop = 0; loop < nr_in_ready_table; loop++)
>             if (((long)cullready[loop] & 0xf0000000) == 0x60000000)
>                 abort();
>
> Checking the cull table, the first object in the cull table appears to be
> valid.
>
>     (gdb) p nr_in_ready_table
>     $1 = 242
>
>     (gdb) p cullready[0]
>     $2 = (struct object *) 0x556d6d7382a0
>
>     (gdb) p -pretty -- *cullready[0]
>     $3 = {
>         parent = 0x556d6d7352b0,
>         children = 0x0,
>         next = 0x0,
>         prev = 0x0,
>         dir = 0x0,
>         ino = 13631753,
>         usage = 1,
>         empty = false,
>         new = false,
>         cullable = true,
>         type = OBJTYPE_DATA,
>         atime = 1675349423,
>         name = "E"
>     }
>
> The inode number from the struct matches a file in the fscache.
>
>     $ sudo find /var/cache/fscache -inum 13631753
>     /var/cache/fscache/cache/Infs,3.0,2,,300000a,e5e9b1269df2b0d,,,d0,100000,100000,249f0,249f0,249f0,249f0,1/@00/E210w114Hg92Az0HAMYCClFMVmkMY050002w1qO200
>
> However, the memory address of the struct matches (fails) the check.
>
>     (gdb) p (((long)cullready[0] & 0xf0000000) == 0x60000000)
>     $4 = 1
>
>       0000 556d 6d73 82a0
>     & 0000 0000 f000 0000
>     = 0000 0000 6000 0000
>
>     $ file /sbin/cachefilesd
>     /sbin/cachefilesd: ELF 64-bit LSB pie executable, x86-64
>
> Looking at the code, I suspect that this magic 0x60000000 number is supposed
> to be some kind of sentinel value that's used as a bug check for errors such
> as use after free? This would make sense when the application was 32 bit, as
> address pattern 0110 in the highest nibble either cannot occur, or lies within
> the kernel address space. However, when compiled as 64 bit this assumption is
> no longer true and the bit pattern can appear in perfectly valid addresses.
>
> This would also explain the random nature of the crashes, as the cachefilesd
> is at the whims of the OS and calloc function.
>
> --
> Chris

Any thoughts on this issue? I think the main question to be answered is if the
debug checks such as "(0x6b000000 | __LINE__)" still have any value. If not
this can be simplified by simply setting the pointer to null, and updating
the check to look for nulls.

If __LINE__ still has value then there are two questions to answer:

1. How to make this safe for 64 bit architectures?
2. Should __LINE__ only be included in debug builds, and null used normally?