Re: btrfs induced data loss (on xfs) - 5.19.0-38-generic

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Tue, 18 Apr 2023 16:00:53 +0800

On 2023/4/18 15:31, Kyle Sanderson wrote:
On Tue, Apr 18, 2023 at 12:19 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
On 2023/4/18 15:07, Kyle Sanderson wrote:
On Mon, Apr 17, 2023 at 1:42 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
Thus better to make sure qgroup is not enabled, with the following command:
If it shows error, then qgroup is really disabled and we can look
somewhere else.

All my disks:
ERROR: can't list qgroups: quotas not enabled

As long as there's nothing else running on the disk I seem to be able
to remove the files now. If I have a small workload (5k~ open
descriptors across multiple pids) running with them open it just seems
to fully lockup the entire box (100% utilization on the single disk,
with the same iowait symptom and system load increasing). I killed
everything and was able to clear around 4T. The background tasks are
running now to actually free the space, but the "frontend" unlink
action is done. Running multiple unlinks in parallel seems to trigger
this - does it take a full lock per pid or something?

Unlink in btrfs is expensive, unlike other filesystems.

As btrfs put every inode into the same btree (subvolume tree), and doing 
opeartions like unlink which may involves tons of file extents, all the 
unlink operations will race to hold lock on the same btree nodes/leaves.

Normally if there are parallel operations, it's recommended to put all 
those files into different subvolumes, which can greatly improve the 
performance for btrfs.

Unlike other filesystems, which mostly put one inode per-tree, btrfs is 
not that good at handling high load inside one subvolume.

Thanks,
Qu

Kyle.

On Tue, Apr 18, 2023 at 12:19 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:

On 2023/4/18 15:07, Kyle Sanderson wrote:
On Mon, Apr 17, 2023 at 1:42 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
And talking about the "100% utilization", did you mean a single CPU core
is fully utilized by some btrfs thread?

No, iostat reporting the disk is at "100%" utilization.

If so, it looks like qgroup is involved.
Did you just recently deleted one or more snapshots? If so, it's almost
certain qgroup is the cause of the 100% CPU utilization.

No, I used to run snapraid-btrfs (uses snapper) but both projects are
effectively orphaned. There's no snapshots left, but I used to run
zygo/bees also on these disks but that also has basic problems
(specifically mlock' on GBs of data when it's sitting completely idle
for hours at a time). bees was not running when this happened (or for
the duration of the uptime of the machine).

Those are not affecting qgroups directly.

In that case, you can disable quota for that btrfs (permanently), or use
newer kernel which has a sysfs interface:

     /sys/fs/btrfs/<UUID>/qgroups/drop_subtree_threshold

Thank you, I'll give it a go after the reboot here. I don't think I
have quotas enabled but I did look at it at one point (I remember
toggling on then off).

Just in case, if that path exists, it means qgroup is enabled, and then
qgroup would be the most possible cause.

IIRC a lot of tools like snapper would enable qgroup automatically.

Thus better to make sure qgroup is not enabled, with the following command:

   # btrfs qgroup show -pcre <mnt>

If it shows error, then qgroup is really disabled and we can look
somewhere else.

Thanks,
Qu

If you write some value like 3 into that file, btrfs would automatically
avoid such long CPU usage caused by large snapshot dropping.

But talking about the XFS corruption, I'm not sure how btrfs can lead to
problems in XFS...

All I/O seemed to be dead on the box.

I tried to remove the same data again and the task hung... this time I
was able to interrupt the removal and the box recovered.

[92798.210656] INFO: task sync:1282043 blocked for more than 120 seconds.
[92798.210683]       Not tainted 5.19.0-38-generic #39~22.04.1-Ubuntu
[92798.210707] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[92798.210735] task:sync            state:D stack:    0 pid:1282043
ppid:1281934 flags:0x00004002
[92798.210739] Call Trace:
[92798.210741]  <TASK>
[92798.210743]  __schedule+0x257/0x5d0
[92798.210747]  schedule+0x68/0x110
[92798.210750]  wait_current_trans+0xde/0x140 [btrfs]
[92798.210820]  ? destroy_sched_domains_rcu+0x40/0x40
[92798.210824]  start_transaction+0x34d/0x5f0 [btrfs]
[92798.210895]  btrfs_attach_transaction_barrier+0x23/0x70 [btrfs]
[92798.210966]  btrfs_sync_fs+0x4a/0x1c0 [btrfs]
[92798.211029]  ? vfs_fsync_range+0xa0/0xa0
[92798.211033]  sync_fs_one_sb+0x26/0x40
[92798.211037]  iterate_supers+0x9e/0x110
[92798.211041]  ksys_sync+0x62/0xb0
[92798.211045]  __do_sys_sync+0xe/0x20
[92798.211049]  do_syscall_64+0x59/0x90
[92798.211052]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[92798.211057] RIP: 0033:0x7fd08471babb
[92798.211059] RSP: 002b:00007ffd344545f8 EFLAGS: 00000246 ORIG_RAX:
00000000000000a2
[92798.211063] RAX: ffffffffffffffda RBX: 00007ffd344547d8 RCX: 00007fd08471babb
[92798.211065] RDX: 00007fd084821101 RSI: 00007ffd344547d8 RDI: 00007fd0847d9e29
[92798.211067] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[92798.211069] R10: 0000556ba8b21e40 R11: 0000000000000246 R12: 0000556ba8acebc0
[92798.211071] R13: 0000556ba8acc19f R14: 00007fd08481942c R15: 0000556ba8acc034
[92798.211075]  </TASK>

Kyle.

On Mon, Apr 17, 2023 at 1:42 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:

On 2023/4/17 13:20, Kyle Sanderson wrote:
The single btrfs disk was at 100% utilization and a wa of 50~, reading
back at around 2MB/s. df and similar would simply freeze. Leading up
to this I removed around 2T of data from a single btrfs disk. I
managed to get most of the services shutdown and disks unmounted, but
when the system came back up I had to use xfs_repair (for the first
time in a very long time) to boot into my system. I likely should have
just pulled the power...

I didn't see any obvious btrfs related work involved in the call trace.
Thus I believe it's not really hanging in a waiting status.

And talking about the "100% utilization", did you mean a single CPU core
is fully utilized by some btrfs thread?

If so, it looks like qgroup is involved.
Did you just recently deleted one or more snapshots? If so, it's almost
certain qgroup is the cause of the 100% CPU utilization.

In that case, you can disable quota for that btrfs (permanently), or use
newer kernel which has a sysfs interface:

     /sys/fs/btrfs/<UUID>/qgroups/drop_subtree_threshold

If you write some value like 3 into that file, btrfs would automatically
avoid such long CPU usage caused by large snapshot dropping.

But talking about the XFS corruption, I'm not sure how btrfs can lead to
problems in XFS...

Thanks,
Qu

[1147997.255020] INFO: task happywriter:3425205 blocked for more than
120 seconds.
[1147997.255088]       Not tainted 5.19.0-38-generic #39~22.04.1-Ubuntu
[1147997.255114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1147997.255144] task:happywriter state:D stack:    0 pid:3425205
ppid:557021 flags:0x00004000
[1147997.255151] Call Trace:
[1147997.255155]  <TASK>
[1147997.255160]  __schedule+0x257/0x5d0
[1147997.255169]  ? __schedule+0x25f/0x5d0
[1147997.255173]  schedule+0x68/0x110
[1147997.255176]  rwsem_down_write_slowpath+0x2ee/0x5a0
[1147997.255180]  ? check_heap_object+0x100/0x1e0
[1147997.255185]  down_write+0x4f/0x70
[1147997.255189]  do_unlinkat+0x12b/0x2d0
[1147997.255194]  __x64_sys_unlink+0x42/0x70
[1147997.255197]  ? syscall_exit_to_user_mode+0x2a/0x50
[1147997.255201]  do_syscall_64+0x59/0x90
[1147997.255204]  ? do_syscall_64+0x69/0x90
[1147997.255207]  ? sysvec_call_function_single+0x4e/0xb0
[1147997.255211]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[1147997.255216] RIP: 0033:0x1202a57
[1147997.255220] RSP: 002b:00007fe467ffd4c8 EFLAGS: 00000246 ORIG_RAX:
0000000000000057
[1147997.255224] RAX: ffffffffffffffda RBX: 00007fe3a4e94d28 RCX:
0000000001202a57
[1147997.255226] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
00007fe450054b60
[1147997.255228] RBP: 00007fe450054b60 R08: 0000000000000000 R09:
00000000000000e8
[1147997.255231] R10: 0000000000000001 R11: 0000000000000246 R12:
00007fe467ffd5b0
[1147997.255233] R13: 00007fe467ffd5f0 R14: 00007fe467ffd5e0 R15:
00007fe3a4e94d28
[1147997.255239]  </TASK>
[1148118.087966] INFO: task happywriter:3425205 blocked for more than
241 seconds.
[1148118.088022]       Not tainted 5.19.0-38-generic #39~22.04.1-Ubuntu
[1148118.088048] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1148118.088077] task:happywriter state:D stack:    0 pid:3425205
ppid:557021 flags:0x00004000
[1148118.088083] Call Trace:
[1148118.088087]  <TASK>
[1148118.088093]  __schedule+0x257/0x5d0
[1148118.088101]  ? __schedule+0x25f/0x5d0
[1148118.088105]  schedule+0x68/0x110
[1148118.088108]  rwsem_down_write_slowpath+0x2ee/0x5a0
[1148118.088113]  ? check_heap_object+0x100/0x1e0
[1148118.088118]  down_write+0x4f/0x70
[1148118.088121]  do_unlinkat+0x12b/0x2d0
[1148118.088126]  __x64_sys_unlink+0x42/0x70
[1148118.088129]  ? syscall_exit_to_user_mode+0x2a/0x50
[1148118.088133]  do_syscall_64+0x59/0x90
[1148118.088136]  ? do_syscall_64+0x69/0x90
[1148118.088139]  ? sysvec_call_function_single+0x4e/0xb0
[1148118.088142]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[1148118.088148] RIP: 0033:0x1202a57
[1148118.088151] RSP: 002b:00007fe467ffd4c8 EFLAGS: 00000246 ORIG_RAX:
0000000000000057
[1148118.088155] RAX: ffffffffffffffda RBX: 00007fe3a4e94d28 RCX:
0000000001202a57
[1148118.088158] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
00007fe450054b60
[1148118.088160] RBP: 00007fe450054b60 R08: 0000000000000000 R09:
00000000000000e8
[1148118.088162] R10: 0000000000000001 R11: 0000000000000246 R12:
00007fe467ffd5b0
[1148118.088164] R13: 00007fe467ffd5f0 R14: 00007fe467ffd5e0 R15:
00007fe3a4e94d28
[1148118.088170]  </TASK>
[1148238.912688] INFO: task kcompactd0:70 blocked for more than 120 seconds.
[1148238.912741]       Not tainted 5.19.0-38-generic #39~22.04.1-Ubuntu
[1148238.912767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1148238.912796] task:kcompactd0      state:D stack:    0 pid:   70
ppid:     2 flags:0x00004000
[1148238.912803] Call Trace:
[1148238.912806]  <TASK>
[1148238.912812]  __schedule+0x257/0x5d0
[1148238.912821]  schedule+0x68/0x110
[1148238.912824]  io_schedule+0x46/0x80
[1148238.912827]  folio_wait_bit_common+0x14c/0x3a0
[1148238.912833]  ? filemap_invalidate_unlock_two+0x50/0x50
[1148238.912838]  __folio_lock+0x17/0x30
[1148238.912842]  __unmap_and_move.constprop.0+0x39c/0x640
[1148238.912847]  ? free_unref_page+0xe3/0x190
[1148238.912852]  ? move_freelist_tail+0xe0/0xe0
[1148238.912855]  ? move_freelist_tail+0xe0/0xe0
[1148238.912858]  unmap_and_move+0x7d/0x4e0
[1148238.912862]  migrate_pages+0x3b8/0x770
[1148238.912867]  ? move_freelist_tail+0xe0/0xe0
[1148238.912870]  ? isolate_freepages+0x2f0/0x2f0
[1148238.912874]  compact_zone+0x2ca/0x620
[1148238.912878]  proactive_compact_node+0x8a/0xe0
[1148238.912883]  kcompactd+0x21c/0x4e0
[1148238.912886]  ? destroy_sched_domains_rcu+0x40/0x40
[1148238.912892]  ? kcompactd_do_work+0x240/0x240
[1148238.912896]  kthread+0xeb/0x120
[1148238.912900]  ? kthread_complete_and_exit+0x20/0x20
[1148238.912904]  ret_from_fork+0x1f/0x30
[1148238.912911]  </TASK>
[1148238.913163] INFO: task happywriter:3425205 blocked for more than
362 seconds.
[1148238.913195]       Not tainted 5.19.0-38-generic #39~22.04.1-Ubuntu
[1148238.913220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1148238.913250] task:happywriter state:D stack:    0 pid:3425205
ppid:557021 flags:0x00004000
[1148238.913254] Call Trace:
[1148238.913256]  <TASK>
[1148238.913259]  __schedule+0x257/0x5d0
[1148238.913264]  ? __schedule+0x25f/0x5d0
[1148238.913267]  schedule+0x68/0x110
[1148238.913270]  rwsem_down_write_slowpath+0x2ee/0x5a0
[1148238.913276]  ? check_heap_object+0x100/0x1e0
[1148238.913280]  down_write+0x4f/0x70
[1148238.913284]  do_unlinkat+0x12b/0x2d0
[1148238.913288]  __x64_sys_unlink+0x42/0x70
[1148238.913292]  ? syscall_exit_to_user_mode+0x2a/0x50
[1148238.913296]  do_syscall_64+0x59/0x90
[1148238.913299]  ? do_syscall_64+0x69/0x90
[1148238.913301]  ? sysvec_call_function_single+0x4e/0xb0
[1148238.913305]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[1148238.913310] RIP: 0033:0x1202a57
[1148238.913315] RSP: 002b:00007fe467ffd4c8 EFLAGS: 00000246 ORIG_RAX:
0000000000000057
[1148238.913319] RAX: ffffffffffffffda RBX: 00007fe3a4e94d28 RCX:
0000000001202a57
[1148238.913321] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
00007fe450054b60
[1148238.913323] RBP: 00007fe450054b60 R08: 0000000000000000 R09:
00000000000000e8
[1148238.913325] R10: 0000000000000001 R11: 0000000000000246 R12:
00007fe467ffd5b0
[1148238.913328] R13: 00007fe467ffd5f0 R14: 00007fe467ffd5e0 R15:
00007fe3a4e94d28
[1148238.913333]  </TASK>
[1148238.913429] INFO: task kworker/u16:20:3402199 blocked for more
than 120 seconds.
[1148238.913459]       Not tainted 5.19.0-38-generic #39~22.04.1-Ubuntu
[1148238.913496] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1148238.913527] task:kworker/u16:20  state:D stack:    0 pid:3402199
ppid:     2 flags:0x00004000
[1148238.913533] Workqueue: events_unbound io_ring_exit_work
[1148238.913539] Call Trace:
[1148238.913541]  <TASK>
[1148238.913544]  __schedule+0x257/0x5d0
[1148238.913548]  schedule+0x68/0x110
[1148238.913551]  schedule_timeout+0x122/0x160
[1148238.913556]  __wait_for_common+0x8f/0x190
[1148238.913559]  ? usleep_range_state+0xa0/0xa0
[1148238.913564]  wait_for_completion+0x24/0x40
[1148238.913567]  io_ring_exit_work+0x186/0x1e9
[1148238.913571]  ? io_uring_del_tctx_node+0xbf/0xbf
[1148238.913576]  process_one_work+0x21c/0x400
[1148238.913579]  worker_thread+0x50/0x3f0
[1148238.913583]  ? rescuer_thread+0x3a0/0x3a0
[1148238.913586]  kthread+0xeb/0x120
[1148238.913590]  ? kthread_complete_and_exit+0x20/0x20
[1148238.913594]  ret_from_fork+0x1f/0x30
[1148238.913600]  </TASK>
[1148238.913607] INFO: task stat:3434604 blocked for more than 120 seconds.
[1148238.913633]       Not tainted 5.19.0-38-generic #39~22.04.1-Ubuntu
[1148238.913658] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1148238.913687] task:stat            state:D stack:    0 pid:3434604
ppid:3396625 flags:0x00004004
[1148238.913691] Call Trace:
[1148238.913693]  <TASK>
[1148238.913695]  __schedule+0x257/0x5d0
[1148238.913699]  schedule+0x68/0x110
[1148238.913703]  request_wait_answer+0x13f/0x220
[1148238.913708]  ? destroy_sched_domains_rcu+0x40/0x40
[1148238.913713]  fuse_simple_request+0x1bb/0x370
[1148238.913717]  fuse_do_getattr+0xda/0x320
[1148238.913720]  ? try_to_unlazy+0x5b/0xd0
[1148238.913726]  fuse_update_get_attr+0xb3/0xf0
[1148238.913730]  fuse_getattr+0x87/0xd0
[1148238.913733]  vfs_getattr_nosec+0xba/0x100
[1148238.913737]  vfs_statx+0xa9/0x140
[1148238.913741]  vfs_fstatat+0x59/0x80
[1148238.913744]  __do_sys_newlstat+0x38/0x80
[1148238.913750]  __x64_sys_newlstat+0x16/0x20
[1148238.913753]  do_syscall_64+0x59/0x90
[1148238.913757]  ? handle_mm_fault+0xba/0x2a0
[1148238.913761]  ? exit_to_user_mode_prepare+0xaf/0xd0
[1148238.913765]  ? irqentry_exit_to_user_mode+0x9/0x20
[1148238.913769]  ? irqentry_exit+0x43/0x50
[1148238.913772]  ? exc_page_fault+0x92/0x1b0
[1148238.913776]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[1148238.913780] RIP: 0033:0x7fa76960d6ea
[1148238.913783] RSP: 002b:00007ffe6d011878 EFLAGS: 00000246 ORIG_RAX:
0000000000000006
[1148238.913787] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
00007fa76960d6ea
[1148238.913789] RDX: 00007ffe6d011890 RSI: 00007ffe6d011890 RDI:
00007ffe6d01290b
[1148238.913791] RBP: 00007ffe6d011a70 R08: 0000000000000001 R09:
0000000000000000
[1148238.913793] R10: fffffffffffff3ce R11: 0000000000000246 R12:
0000000000000000
[1148238.913795] R13: 000055feff0d2960 R14: 00007ffe6d011a68 R15:
00007ffe6d01290b
[1148238.913800]  </TASK>