[PATCH] btrfs: don't trigger autodefrag if cleaner is woken up early

Qu Wenruo <wqu@xxxxxxxx> · Thu, 10 Mar 2022 19:34:35 +0800

[PROBLEM]
Patch "btrfs: don't let new writes to trigger autodefrag on the same
inode" now makes autodefrag really only to scan one inode once
per autodefrag run.

That patch works mostly fine, as the trace events show their intervals
are almost 30s:
(only showing the root 257 ino 329891 start 0)

 486.810041: defrag_file_start: root=257 ino=329891 start=0 len=754728960
 506.407089: defrag_file_start: root=257 ino=329891 start=0 len=754728960
 536.463320: defrag_file_start: root=257 ino=329891 start=0 len=754728960
 539.721309: defrag_file_start: root=257 ino=329891 start=0 len=754728960
 569.741417: defrag_file_start: root=257 ino=329891 start=0 len=754728960
 594.832649: defrag_file_start: root=257 ino=329891 start=0 len=754728960
 624.258214: defrag_file_start: root=257 ino=329891 start=0 len=754728960
 654.856264: defrag_file_start: root=257 ino=329891 start=0 len=754728960
 684.943029: defrag_file_start: root=257 ino=329891 start=0 len=754728960
 715.288662: defrag_file_start: root=257 ino=329891 start=0 len=754728960

But there are some outlawers, like 536s->539s, it's only 3s, not the 30s
default commit interval.

[CAUSE]
There are several call sites which can wake up transaction kthread
early, while transaction kthread itself can skip transaction if its
timer doesn't reach commit interval, but cleaner is always woken up.

This means each time transaction ktreahd gets woken up, we also trigger
autodefrag, even transaction kthread chooses to skip its work.

This is not a good behavior for files under heavy write load, as we
waste extra IO/CPU while the defragged extents can easily get fragmented
again.

[FIX]
In btrfs_run_defrag_inodes(), we check if our current time is larger
than last run + commit_interval.

If not, skip this run and wait for next opportunity.

This patch along with patch "btrfs: don't let new writes to trigger
autodefrag on the same inode" are mostly for backport to v5.16.

This is just to reduce the unnecessary IO/CPU caused by autodefrag, the
final solution would be allowing users to change autodefrag scan
interval and target extent size.

Cc: stable@xxxxxxxxxxxxxxx # 5.16+
Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/file.c  | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a8a3de10cead..44116a47307e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -899,6 +899,7 @@ struct btrfs_fs_info {
 
 	/* auto defrag inodes go here */
 	spinlock_t defrag_inodes_lock;
+	u64 defrag_last_run_ksec;
 	struct rb_root defrag_inodes;
 	atomic_t defrag_running;
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index abba1871e86e..a852754f5601 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -312,9 +312,20 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 {
 	struct inode_defrag *defrag;
 	struct rb_root defrag_inodes;
+	u64 ksec = ktime_get_seconds();
 	u64 first_ino = 0;
 	u64 root_objectid = 0;
 
+	/*
+	 * If cleaner get woken up early, skip this run to avoid frequent
+	 * re-dirty, which is not really useful for heavy writes.
+	 *
+	 * TODO: Make autodefrag to happen in its own thread.
+	 */
+	if (ksec - fs_info->defrag_last_run_ksec < fs_info->commit_interval)
+		return 0;
+	fs_info->defrag_last_run_ksec = ksec;
+
 	atomic_inc(&fs_info->defrag_running);
 	spin_lock(&fs_info->defrag_inodes_lock);
 	/*
-- 
2.35.1