On 2021/5/18 10:57 PM, Theodore Y. Ts'o wrote: > On Tue, May 18, 2021 at 09:19:13AM +0800, Wang Jianchao wrote: >>> That way we don't need to move all of this to a kworker context. >> >> The submit_bio also needs to be out of jbd2 commit kthread as it may be >> blocked due to blk-wbt or no enough request tag. ;) > > Actually, there's a bigger deal that I hadn't realized, about why we > is why are currently using submit_bio_wait(). We *must* wait until > discard has completed before we call ext4_free_data_in_buddy(), which > is what allows those blocks to be reused by the block allocator. > > If the discard happens after we reallocate the block, there is a good > chance that we will end up corrupting a data or metadata block, > leading to user data loss. Yes > > There's another corollary to this; if you use blk-wbt, and you are > doing lots of deletes, and we move this all to a writeback thread, > this *significantly* increases the chance that the user will see > ENOSPC errors in the case where they are with a very full (close to > 100% used) file system. We would flush the kwork that's doing discard in this patch. That's done in ext4_should_retry_alloc() > > I'd argue that this is a *really* good reason why using mount -o > discard is Just A Bad Idea if you are running with blk-wbt. If > discards are slow, using fstrim is a much better choice. It's also > the case that for most SSD's and workloads, doing frequent discards > doesn't actually help that much. The write endurance of the device is > not compromised that much if you only run fs-trim and discard unused > blocks once a day, or even once a week --- I only recommend use of > mount -o discard in cases where the discard operation is effectively > free. (e.g., in cases where the FTL is implemented on the Host OS, or > you are running with super-fast flash which is PCIe or NVMe attached.) We're running ext4 with discard on a nbd device whose backend is storage cluster. The discard can help to free the unused space to storage pool. And sometimes application delete a lot of data and discard is flooding. Then we see the jbd2 commit kthread is blocked for a long time. Even move the discard out of jbd2, we still see the write IO of jbd2 log could be blocked. blk-wbt could help to relieve this. Finally the delay is shift to allocation path. But this is better than blocking the page fault path which holds the read mm->mmap_sem. Best regards Jianchao > > Cheers, > > - Ted >