[PROBLEM] Patch "btrfs: don't let new writes to trigger autodefrag on the same inode" now makes autodefrag really only to scan one inode once per autodefrag run. That patch works mostly fine, as the trace events show their intervals are almost 30s: (only showing the root 257 ino 329891 start 0) 486.810041: defrag_file_start: root=257 ino=329891 start=0 len=754728960 506.407089: defrag_file_start: root=257 ino=329891 start=0 len=754728960 536.463320: defrag_file_start: root=257 ino=329891 start=0 len=754728960 539.721309: defrag_file_start: root=257 ino=329891 start=0 len=754728960 569.741417: defrag_file_start: root=257 ino=329891 start=0 len=754728960 594.832649: defrag_file_start: root=257 ino=329891 start=0 len=754728960 624.258214: defrag_file_start: root=257 ino=329891 start=0 len=754728960 654.856264: defrag_file_start: root=257 ino=329891 start=0 len=754728960 684.943029: defrag_file_start: root=257 ino=329891 start=0 len=754728960 715.288662: defrag_file_start: root=257 ino=329891 start=0 len=754728960 But there are some outlawers, like 536s->539s, it's only 3s, not the 30s default commit interval. [CAUSE] There are several call sites which can wake up transaction kthread early, while transaction kthread itself can skip transaction if its timer doesn't reach commit interval, but cleaner is always woken up. This means each time transaction ktreahd gets woken up, we also trigger autodefrag, even transaction kthread chooses to skip its work. This is not a good behavior for files under heavy write load, as we waste extra IO/CPU while the defragged extents can easily get fragmented again. [FIX] In btrfs_run_defrag_inodes(), we check if our current time is larger than last run + commit_interval. If not, skip this run and wait for next opportunity. This patch along with patch "btrfs: don't let new writes to trigger autodefrag on the same inode" are mostly for backport to v5.16. This is just to reduce the unnecessary IO/CPU caused by autodefrag, the final solution would be allowing users to change autodefrag scan interval and target extent size. Cc: stable@xxxxxxxxxxxxxxx # 5.16+ Signed-off-by: Qu Wenruo <wqu@xxxxxxxx> --- fs/btrfs/ctree.h | 1 + fs/btrfs/file.c | 11 +++++++++++ 2 files changed, 12 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a8a3de10cead..44116a47307e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -899,6 +899,7 @@ struct btrfs_fs_info { /* auto defrag inodes go here */ spinlock_t defrag_inodes_lock; + u64 defrag_last_run_ksec; struct rb_root defrag_inodes; atomic_t defrag_running; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index abba1871e86e..a852754f5601 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -312,9 +312,20 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info) { struct inode_defrag *defrag; struct rb_root defrag_inodes; + u64 ksec = ktime_get_seconds(); u64 first_ino = 0; u64 root_objectid = 0; + /* + * If cleaner get woken up early, skip this run to avoid frequent + * re-dirty, which is not really useful for heavy writes. + * + * TODO: Make autodefrag to happen in its own thread. + */ + if (ksec - fs_info->defrag_last_run_ksec < fs_info->commit_interval) + return 0; + fs_info->defrag_last_run_ksec = ksec; + atomic_inc(&fs_info->defrag_running); spin_lock(&fs_info->defrag_inodes_lock); /* -- 2.35.1