e2fsck dir_info corruption

Andreas Dilger <adilger@xxxxxxxxx> · Fri, 31 Jan 2020 21:07:42 -0700

I've been trying to track down a failure with e2fsck on a large filesystem.  Running e2fsck-1.45.2 repeatedly and consistently reports an error in pass2:

Internal error: couldn't find dir_info for <ino>

The inode number is consistent across multiple runs, but examining it with
debugfs shows that it is a normal directory with no problems.  In order to
complete the e2fsck, we tried using "debugfs -w -R 'clri <ino>'" to erase
the inode, but this just shifted the error onto a slightly later directory
(which would itself now be reported consistently).

The filesystem is 6TB with 2.3B inodes, about 1.8B of which are in use.  The
inode numbers reporting problems are in the range of 1.2B.  e2fsck is using
about 45GB of RAM (with some swap on an NVMe SSD, so much faster than the
HDD the filesystem image is on).  It takes about 3-4h to hit this problem.

I was able to run e2fsck under gdb, though this slows it down significantly
(10-100x slower), if there are conditional breakpoints since they need to
be evaluated for each inode, which makes it impossible to use this way.

What I've discovered is quite interesting.  The ctx->dir_info->array is
filled in properly for the bad inode, and the array is OK at the start of
pass2 (2 is the root inode, and $66 holds the bad inode number):

    (gdb) p e2fsck_get_dir_info(ctx, 2)
    $171 = (struct dir_info *) 0x7ffabf2bd010
    (gdb) p *$171
    $172 = {ino = 2, dotdot = 0, parent = 2}
    (gdb) p $66
    $173 = 1222630490
    (gdb) p e2fsck_get_dir_info(ctx, $66)
    $174 = (struct dir_info *) 0x7ffb3d327f54
    (gdb) p *$174
    $175 = {ino = 1222630490, dotdot = 0, parent = 0}

The dir_info array itself is over 4GB in size (not sure if this is relevant
or not), with 380M directories, but the bad inode is only about half way
through the array ($140 is the index of the problematic entry):

    (gdb) p $140
    $176 = 176197275
    (gdb) p p ctx->dir_info->array[$140]
    $177 = {ino = 1222630490, dotdot = 0, parent = 0}
    (gdb) p *ctx->dir_info
    $178 = {count = 381539950, size = 381539965,
      array = 0x7ffabf2bd010, last_lookup = 0x0,
      tdb_fn = 0x0, tdb = 0x0}
    (gdb) p $178.count * sizeof(ctx->dir_info->array[0])
    $179 = 4578479400

I put a hardware watchpoint on the ".parent" and ".dotdot" fields of this
entry so that I could watch as e2fsck_dir_info_set_dotdot() updated it,
but I didn't think to put one on the ".ino" field, since I thought it would
be static.

    (gdb) info break
    Num     Type           Disp Enb Address    What
    15      hw watchpoint  keep y              -location $177->parent
    16      hw watchpoint  keep y              -location $177->dotdot

The watchpoint triggered, and saw that the entry was changed by qsort_r(),
which at first I thought "OK, the dir_info array needs to be sorted, because
a binary search is used on it", but in fact the array is *created* in order
during the inode table scan and does not need to be sorted.  As can be seen
from the stack, it is *another* array that is being sorted that overwrites
the dir_info entry:

    Hardware watchpoint 16: -location $177->dotdot

    Old value = 0
    New value = 1581603653
    0x0000003b2ea381f7 in _quicksort ()
       from /lib64/libc.so.6
    (gdb) p $141
    $179 = {ino = 1222630490, dotdot = 0, parent = 0}
    (gdb) bt
    #0  0x0000003b2ea381e1 in _quicksort ()
      from /lib64/libc.so.6
    #1  0x0000003b2ea38f82 in qsort_r ()
      from /lib64/libc.so.6
    #2  0x000000000045447b in ext2fs_dblist_sort2 (
        dblist=0x6c9a90,
        sortfunc=0x428c3d <special_dir_block_cmp>)
        at dblist.c:189
    #3  0x00000000004285bc in e2fsck_pass2 (
        ctx=0x6c4000) at pass2.c:180
    #4  0x000000000041646b in e2fsck_run (ctx=0x6c4000)
        at e2fsck.c:253
    #5  0x0000000000415210 in main (argc=4,
        argv=0x7fffffffe088) at unix.c:2047
    (gdb) p ctx->dir_info->array[$140]
    $207 = {ino = 0, dotdot = 1581603653, parent = 0}
    (gdb) f 3
    #3  0x00000000004285bc in e2fsck_pass2 (
        ctx=0x6c4000) at pass2.c:180
    180       ext2fs_dblist_sort2(fs->dblist, special_dir_block_cmp);

AFAIK, the ext2fs_dblist_sort2() is for the directory *blocks*, and should
not be changing the dir_info at all.  Is this a bug in qsort or glibc?

What I just noticed writing this email is that the fs->dblist.list address
is right in the middle of the dir_info array address range:

    (gdb) p *fs->dblist
    $210 = {magic = 2133571340, fs = 0x6c4460,
      size = 763079922, count = 388821313, sorted = 0,
      list = 0x7ffad011e010}
    (gdb) p &ctx->dir_info->array[0]
    $211 = (struct dir_info *) 0x7ffabf2bd010
    (gdb) p &ctx->dir_info->array[$140]
    $212 = (struct dir_info *) 0x7ffb3d327f54

which might explain why sorting dblist is messing with dir_info?  I don't
_think_ it is a problem with my build or swap, which is different from
the system that this was originally reproduced on.

Any thoughts on how to continue debugging this?

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP