On Wed, May 19, 2021 at 09:27:56AM +0800, Wang Jianchao wrote: > > We're running ext4 with discard on a nbd device whose backend is storage > cluster. The discard can help to free the unused space to storage pool. > > And sometimes application delete a lot of data and discard is flooding. > Then we see the jbd2 commit kthread is blocked for a long time. Even > move the discard out of jbd2, we still see the write IO of jbd2 log > could be blocked. blk-wbt could help to relieve this. Finally the delay > is shift to allocation path. But this is better than blocking the page > fault path which holds the read mm->mmap_sem. I'm assuming that the problem is when the application deletes a lot of data, the discard flood is causing performance problems on your nbd server. Is that the high level problem that you are trying to solve? So if that's the case, I'd suggest a different approach. First, move kmem_cache_free(ext4_free_data_cachep, entry) out of ext4_free_data_in_buddy() to its caller, ext4_process_data. Then if discard is enabled, after calling ext4_free_data_in_buddy(), the ext4_free_data struct will be detached from rbtree rooted in ext4_group_info.bb_free_root, and then we can attach it to a new rbtree rooted in ext4_group_info.bb_discard_root. This allows the block to be reused as soon the commit is finished (allowing for potentially more efficient block allocations), but we can now keep track of which blocks would be useful for discarding and decouple that from when we release the blocks to be reused. We can now use the pre-existing fstrim kernel thread infrastructure to lock a block group, and we can now iterate over the rbtree, and take into account which blocks have since become allocated --- since if a block has been allocated, there's no need to send a discard for it. I think this will be more efficient, and will allow us to share more of the code for fstrim and the discard-at-runtime model used by "mount -o discard". We can also fine-tune how quickly we issue discards; it might be that if user has executed "rm -rf" it might actually better to wait until the deletes have completed, even if it takes several commit intervals, since it might allow us to combine discards if the blocks 100-199 and 400-500 are released in one commit, and blocks 200-399 are released two or three commits later. Something else I'd urge you to consider is whether it's possible to enhance the nbd protocol to add some kind of back-channel notification when the shared storage is getting low on space. In that case, when the nbd client code a request from the nbd server indicating, "please issue discards if possible", it could either trigger an upcall to userspace, which could then issue the fstrim ioctl, which in the case where "mount -o discard" is enabled, would accelerate when discards took place. We could then make the fstrim thread normally work on a much slower pace, but when there is a signal from the shared storage that space is needed, clients could accelerate when they issue discards to free up shared space. Cheers, - Ted P.S. One other potential thought; if we have established a new bb_discard_root rbtree, it *might* actually be beneficial to consider using that information in the block allocator. One of the best way to tell an SSD that block is no longer needed is to simply overwrite that block. If we do that, we don't need to send a discard to that block any more. Of course, we still want to keep blocks contiguous since even though seeks are free for SSD's, we want to keep large reads contiguous as much as possible, and we want to keep the extent tree as compact as possible. But if we have just released a 12k file, and we are writing a new 12k file, and don't really care *where* in the block group we are writing that file, reusing blocks that had just been freed might actually be a good strategy. That's not something you need to implement in this patch series, but it might be an interesting optimization.