CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On Fri, Aug 18, 2023 at 01:31:35AM +0000, Lu, Davina wrote: >> >> Looks like this is a similar issue I saw before with fio test (buffered IO with 100 threads), it is also shows "ext4-rsv-conversion" work queue takes lots CPU and make journal update every stuck. >Given the stack traces, it is very much a different problem. I see, I thought it maybe the same since it is all related to convert unwritten extents to extents. I didn't look into details of the stuck though. >> There is a patch and see if this is the same issue? this is not the >> finial patch since there may have some issue from Ted. I will forward >> that email to you in a different loop. I didn't continue on this patch >> that time since we thought is might not be the real case in RDS. >The patch which you've included is dangerous and can cause file system corruption. See my reply at [1], and your corrected patch which addressed my concern at [2]. If folks want to try a patch, please use the one at [2], and not the one you quoted in this thread, since it's missing critically needed locking. >[1] https://lore.kernel.org/r/YzTMZ26AfioIbl27@xxxxxxx > [2] https://lore.kernel.org/r/53153bdf0cce4675b09bc2ee6483409f@xxxxxxxxxx > The reason why we never pursued it is because (a) at one of our weekly > ext4 video chats, I was informed by Oleg Kiselev that the performance issue was addressed in a different way, and (b) I'd want to reproduce the issue on a machine under my control so I could understand what was was going on and so we could examine the dynamics of what was happening with and without the patch. So I'd would have needed to know how many CPU's what kind of storage device (HDD?, SSD? md-raid? > etc.) was in use, in addition to the fio recipe. Thanks for pointed out, I almost forget I did this version 2. How to replicate this issue : CPU is X86_64, 64 cores, 2.50GHZ, MEM is 256GB (it is VM though). Attached with one NVME device (no lvm, drbd etc) with IOPS 64000 and 16GiB. I can also replicate with 10000 IOPS 1000GiB NVME volume. Run fio test: 1. Create new files, fio or dd, fio is: /usr/bin/fio --name=16kb_rand_write_only_2048_jobs --directory=/rdsdbdata --rw=randwrite --ioengine=sync --buffered=1 --bs=16k --max-jobs=2048 --numjobs=$1 --runtime=30 --thread --filesize=28800000 --fsync=1 --group_reporting --create_only=1 > /dev/null 2. sudo echo 1 > /proc/sys/vm/drop_caches 3. fio --name=16kb_rand_write_only_2048_jobs --directory=/rdsdbdata --rw=randwrite --ioengine=sync --buffered=1 --bs=16k --max-jobs=2048 --numjobs=2048 --runtime=60 --time_based --thread --filesize=28800000 --fsync=1 --group_reporting Can see the IOPS drop from 17K to Jobs: 2048 (f=2048): [w(2048)] [13.3% done] [0KB/1296KB/0KB /s] [0/81/0 iops] [eta 00m:52s] <----- IOPS drops to less than < 100 The way to create and mount fs is: mke2fs -m 1 -t ext4 -b 4096 -L /rdsdbdata /dev/nvme5n1 -J size=128 mount -o rw,noatime,nodiratime,data=ordered /dev/nvme5n1 /rdsdbdata Yes, Oleg is correct, there is another way to solve this: large the journal size from 128MB to 2GB. But looks like this is not an typical issue for RDS background so we didn't continue much on this. What I can find is: the journal doesn't have enough space (cannot buffer much) so it has to wait all the current transaction completes in code add_transaction_credits() below: if (needed > journal->j_max_transaction_buffers / 2) { jbd2_might_wait_for_commit(journal); wait_event(journal->j_wait_reserved, atomic_read(&journal->j_reserved_credits) + rsv_blocks <= journal->j_max_transaction_buffers / 2); And the journal locking journal→j_state_lock show stuck at a long time. But not sure why the "ext4-rsv-conversion" also plays a role here, this should be triggered by ext4_writepages(). But what I can see is when the journal lock stuck, each core's utility is almost 100% and the ext4-rsv-conversion shows at that time. > Finally, I'm a bit nervous about setting the internal __WQ_ORDERED flag with max_active > 1. What was that all about, anyway? Yes, you are correct. I didn't use "__WQ_ORDERED" carefully, it better not use with max_active > 1 . My purpose was try to guarantee the work queue can be sequentially implemented on each core. Thanks Davina Lu