Highly reflinked and fragmented considered harmful?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Is it to be expected that removing 29TB of highly reflinked and fragmented data could take days, the entire time blocking other tasks like "rm" and "df" on the same filesystem?

- is any way to see progress and make an eta estimate?
- would 4700+ processes blocked on the same fs slow things down or they're not relevant? - with a reboot/remount, does the log replay continue from where it left off, or start again?
- is there anything that might improve the situation in newer kernels?

Some details:

# uname -r
5.15.34-otn-00007-g6bff5dd37abb

# xfs_info /chroot
meta-data=/dev/mapper/vg00-chroot isize=512    agcount=282, agsize=244184192 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=68719475712, imaxpct=1
         =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The fs is sitting on lvm and the underlying block device is a ceph 8+3 erasure-coded rbd.

This fs was originally 30T and has been expanded to 256T:

  Limits to growth
  https://www.spinics.net/lists/linux-xfs/msg60451.html

And it's been the topic of a few other issues:

  Extreme fragmentation ho!
  https://www.spinics.net/lists/linux-xfs/msg47707.html

  Mysterious ENOSPC
  https://www.spinics.net/lists/linux-xfs/msg55446.html

The story...

I did an "rm -rf" of a directory containing a "du"-indicated 29TB spread over maybe 50 files. The data would have been highly reflinked and fragmented. A large part of the reflinking would be to files outside the dir in question, and I imagine maybe only 2-3TB of data would actually be freed by the "rm".

The "rm" completed in less than a second, but an immediately following "df" on that fs didn't return, even 60+ hours later(!!).

In the meantime, other background processes where also attempting to do various "rm"s (small amounts of non-reflinked data) and "df"s on the same fs which were also hanging - building up to an load average of 4700+(!!) over time. A flame graph showed the kernel with 75% idle and 3.75% xfsaild as the largest users, and the rest a wide variety of other uninteresting stuff. A "sysrq w" showed the tasks blocking in xfs_inodegc_flush, e.g.:

May 06 09:49:29 d5 kernel: task:df              state:D stack:    0 pid:32738 ppid:     1 flags:0x00000004
May 06 09:49:29 d5 kernel: Call Trace:
May 06 09:49:29 d5 kernel:  <TASK>
May 06 09:49:29 d5 kernel:  __schedule+0x241/0x740
May 06 09:49:29 d5 kernel:  schedule+0x3a/0xa0
May 06 09:49:29 d5 kernel:  schedule_timeout+0x271/0x310
May 06 09:49:29 d5 kernel:  ? find_held_lock+0x2d/0x90
May 06 09:49:29 d5 kernel:  ? sched_clock_cpu+0x9/0xa0
May 06 09:49:29 d5 kernel:  ? lock_release+0x214/0x350
May 06 09:49:29 d5 kernel:  wait_for_completion+0x7b/0xc0
May 06 09:49:29 d5 kernel:  __flush_work+0x217/0x350
May 06 09:49:29 d5 kernel:  ? flush_workqueue_prep_pwqs+0x120/0x120
May 06 09:49:29 d5 kernel:  ? wait_for_completion+0x1c/0xc0
May 06 09:49:29 d5 kernel:  xfs_inodegc_flush.part.24+0x62/0xc0 [xfs]
May 06 09:49:29 d5 kernel:  xfs_fs_statfs+0x37/0x1a0 [xfs]
May 06 09:49:29 d5 kernel:  statfs_by_dentry+0x3c/0x60
May 06 09:49:29 d5 kernel:  vfs_statfs+0x16/0xd0
May 06 09:49:29 d5 kernel:  user_statfs+0x44/0x80
May 06 09:49:29 d5 kernel:  __do_sys_statfs+0x10/0x30
May 06 09:49:29 d5 kernel:  do_syscall_64+0x34/0x80
May 06 09:49:29 d5 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 06 09:49:29 d5 kernel: RIP: 0033:0x7f2adb25ec07
May 06 09:49:29 d5 kernel: RSP: 002b:00007ffe074c81d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
May 06 09:49:29 d5 kernel: RAX: ffffffffffffffda RBX: 00005604e08f93c0 RCX: 00007f2adb25ec07
May 06 09:49:29 d5 kernel: RDX: 00007ffe074c8460 RSI: 00007ffe074c81e0 RDI: 00005604e08f93c0
May 06 09:49:29 d5 kernel: RBP: 00007ffe074c81e0 R08: 00005604e08f9400 R09: 0000000000000000
May 06 09:49:29 d5 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe074c8280
May 06 09:49:29 d5 kernel: R13: 0000000000000000 R14: 00005604e08f9400 R15: 00005604e08f93c0
May 06 09:49:29 d5 kernel:  </TASK>

Iostat showed consistent high %util and high latencies. E.g. a combined load average and iostat output at 66 hours after the initial "df" started:

           load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s  arq-sz  aqu-sz    await   rwait   wwait %util
07:55:32 4772.3   0.0   0.0   0.2   0.8   0.0  98.8 rbd0    0.0    0.0    0.5   27.9     1.80   530.20   37.46   25.87   911.07   34.67   925.2  97.0
07:55:52 4772.1   0.0   0.0   0.2   0.0   0.0  99.7 rbd0    0.0    0.0    0.9   17.2     3.60   548.20   60.97    7.64   422.22   11.83   443.7  99.2
07:56:12 4772.1   0.0   0.0   0.3   1.9   0.0  97.6 rbd0    0.0    0.5    0.2   46.9     1.00   513.00   21.80   46.06   976.91  304.40   980.5  96.3
07:56:32 4773.1   0.0   0.0   0.3   1.7   0.0  97.8 rbd0    0.0    0.0    0.5   12.9     1.80   306.20   45.97    6.81   508.09    4.67   525.6  57.8
07:56:52 4773.8   0.0   0.0   0.3   1.4   0.0  98.0 rbd0    0.0    0.1    0.7   40.4     2.60   591.80   28.96   32.70   796.60  163.23   806.8  92.2
07:57:12 4774.4   0.1   0.0   0.3   1.8   0.0  97.7 rbd0    0.0    0.3    0.2   43.1     0.80   541.80   25.06   42.13   973.05   15.25   977.5  84.0
07:57:32 4774.8   0.2   0.0   0.7   2.3   0.0  96.5 rbd0    0.0    0.1    0.5   35.2     2.00   496.40   27.92   30.25   847.35    9.00   859.3  85.7
07:57:52 4775.4   0.0   0.0   0.4   1.6   0.0  97.7 rbd0    0.0    0.2    0.7   45.1     2.80   510.80   22.43   42.88   936.33   76.21   949.7  73.6

Apart from the iostat there was no other indication any progress was being made.

Eventually I thought that perhaps the excessive numbers of blocked processes might be adding to the woes by causing thundering herd problems and the like, and decided to reboot the box as the only way to remove those tasks, with the understanding the journal to be replayed before the fs mount would complete.

There were few more reboots and mount(-attempts) over the next day as I was messing around with things - turning off the background processes causing the ever-increasing loads, taking lvm snapshots, upgrading to the latest stable version of the kernel etc.

I'm now wondering if each mount continues from whence the previous mount(-attempt) left off, or does it start processing the replay log again from the very beginning?

It's now been 40 hours since the most recent reboot/mount and the mount process is still running. The load average is a more reasonable <10, and the iostat is similar to above.

Is there any way to see where the log replay is up to, and from there make a judgement on how much longer it might take, e.g. something in xfs_db, or some eBPF wizardry?

I tried using xfs_logprint but I'm not sure what it's telling me, and multiple versions taken 10 minutes apart don't show any differences that are obviously "progress" - the only differences are in q:0x[[:xdigit:]]+ in the TRANS entries, and in a:0x[[:xdigit:]]+ in the INO entries, e.g.:

t1:
TRANS: tid:0xcf6e0775  #items:50029  trans:0xcf6e0775  q:0x563417886560
...
INO: cnt:2 total:2 a:0x563419435920 len:56 a:0x5634194359c0 len:176
        INODE: #regs:2   ino:0x48000698b  flags:0x1   dsize:0
        CORE inode:
                magic:IN  mode:0x81a4  ver:3  format:3
                uid:1952  gid:1952  nlink:0 projid:0x00000000
                atime:1616462862  mtime:1651077084  ctime:1651634628
                flushiter:0
                size:0x0  nblks:0xdbcfc47  exsize:0  nextents:1676364  anextents:0
                forkoff:0  dmevmask:0x0  dmstate:0  flags:0x0  gen:4176048690
                flags2 0x2 cowextsize 0x0

t1+10s:
TRANS: tid:0xcf6e0775  #items:50029  trans:0xcf6e0775  q:0x55b5d8815560
...
INO: cnt:2 total:2 a:0x55b5da3c4920 len:56 a:0x55b5da3c49c0 len:176
        INODE: #regs:2   ino:0x48000698b  flags:0x1   dsize:0
        CORE inode:
                magic:IN  mode:0x81a4  ver:3  format:3
                uid:1952  gid:1952  nlink:0 projid:0x00000000
                atime:1616462862  mtime:1651077084  ctime:1651634628
                flushiter:0
                size:0x0  nblks:0xdbcfc47  exsize:0  nextents:1676364  anextents:0
                forkoff:0  dmevmask:0x0  dmstate:0  flags:0x0  gen:4176048690
                flags2 0x2 cowextsize 0x0


Cheers,

Chris



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux