Need help to understand and resolve a blktrace replay with trims issue. Sorry, this is a rather long note, but puzzling issue: The problem: If I replay a blktrace containing trims and writes, the trims occur at the expected timing and size, but all of the otherwise random writes are split into groups of 4KB sequential writes, and are intermittently delayed (for seconds) related to the trim operations. Fio debug io,blktrace, and write_iops_log indicates fio is replaying the operations correctly, but a blktrace of the blktrace replay indicates operations writes do not happen as recorded by fio. How to reproduce: Building fio from head (with proposed patch https://github.com/axboe/fio/pull/777 to fix assert error when blktrace replay contains trims) Note: I've tried this several times with various OS levels and fio revision and mapping to one cpu - failures seem consistent FIO command to produce the replay blktrace ---------------------------------------------------------------- Fio two jobs to perform 256MB Random Trims along with 256KB Random Writes at the same 64MB BW rate, while collecting blktrace. > blktrace -d /dev/nvme0n1 -o p2-trimwrites-gen > fio --output="p2-trimwrites-gen.out" --output-format=normal --direct=1 --runtime=30s --filename="/dev/nvme0n1" --ioengine=libaio --numa_cpu_nodes=0 --numa_mem_policy="bind:0" --overwrite=1 --unified_rw_reporting=0 --name="Writes" --bs=256KB --rate=64MB --rw=randwrite --write_iops_log="p2-trimwrites-gen-writes" -name="Trims" --bs=256MB --rate=64MB --rw=randtrim --write_iops_log="p2-trimwrites-gen-trim" > blkparse -i p2-trimwrites-gen -o /dev/null -d p2-trimwrites-gen.bin The actual write operations are restricted according to max_hw_sectors_kb to 127, so the writes are split into 3 sequential operations: 248+248+16 blocks as indicated in the block trace. > blkparse -i p2-trimwrites-gen -o p2-trimwrites-gen.blkparsed > grep ' Q ' p2-trimwrites-gen.blkparsed > p2-trimwrites-gen.blkparsedq # helps to see timing and sequence > egrep 'Q|C' p2-trimwrites-gen.blkparsed > p2-trimwrites-gen.blkparsedqc # include completions - nothing surprising Blktrace Replay -------------------------------------------- When the original blktrace is replayed while collecting another blktrace to check: > blktrace -d /dev/nvme0n1 -o p2-fio-twreplay & > fio --output="p2-fio-twreplay.out" --output-format=normal --direct=1 --filename="/dev/nvme0n1" --ioengine=libaio --numa_cpu_nodes=0 --numa_mem_policy="bind:0" --overwrite=1 --unified_rw_reporting=0 --log_offset=1 --name="TrimWrite" --write_iops_log="p2-fio-twreplay" --replay_redirect=/dev/nvme0n1 --read_iolog="p2-trimwrites-gen.bin" > blkparse -i p2-fio-twreplay -o p2-fio-twreplay.blkparsed > grep ' Q ' p2-fio-twreplay.blkparsed > p2-fio-twreplay.blkparsedq > egrep 'Q|C' p2-fio-twreplay.blkparsed > p2-fio-twreplay.blkparsedqc The fio io log indicates that fio is issuing operations with timing and sizes as indicated in original blktrace (1st two deallocates occur quickly while fio adjust to rate, then they occur 4 seconds apart, with intervening write operations). 2nd deallocate at 353 ms: 279, 1, 1, 126976, 447737622528 279, 1, 1, 8192, 447737749504 283, 1, 1, 126976, 352846086144 283, 1, 1, 126976, 352846213120 283, 1, 1, 8192, 352846340096 353, 1, 2, 268435456, 145492017152 <- 2nd deallocate 355, 1, 1, 126976, 644440915968 <- 355, 1, 1, 126976, 644441042944 <- Expected write operation split 355, 1, 1, 8192, 644441169920 <- 359, 1, 1, 126976, 289736491008 359, 1, 1, 126976, 289736617984 The fio result statistics also seem to indicate that all is well, with trim and write rates somewhat close to expectations: TrimWrite: (groupid=0, jobs=1): err= 0: pid=13167: Fri May 24 10:48:29 2019 write: IOPS=674, BW=56.2MiB/s (58.9MB/s)(1920MiB/34156msec) ... trim: IOPS=0, BW=67.5MiB/s (70.7MB/s)(2304MiB/34156msec) HOWEVER, the blktrace gathered during the replay indicates reality is different. The deallocates occur with the expected timing and sizes, but the writes are delayed and occur as 4KB sequential writes, rather than the fio recorded 124K+124k+8k writes 259,0 6 136 0.133395386 13167 Q D 398458880 + 524288 [fio] <- first two deallocates (they are not waiting on completions, but I don't now here) 259,0 6 140 0.273302506 13167 Q D 284164096 + 524288 [fio] 259,0 6 144 5.211904696 10263 Q W 1602560 + 8 [kworker/u97:0] <- writes held off by something (prior deallocate has long ago completed), and writes become 4KB sequentials 259,0 6 147 5.211914136 10263 Q W 1602568 + 8 [kworker/u97:0] 259,0 6 150 5.211916318 10263 Q W 1602576 + 8 [kworker/u97:0] 259,0 6 153 5.211918422 10263 Q W 1602584 + 8 [kworker/u97:0] 259,0 6 156 5.211920629 10263 Q W 1602592 + 8 [kworker/u97:0] ... 259,0 4 154632 5.910154830 10263 Q W 1563763184 + 8 [kworker/u97:0] 259,0 4 154636 5.910159723 10263 Q W 1563763192 + 8 [kworker/u97:0] 259,0 6 87359 8.623207082 13167 Q D 939524096 + 524288 [fio] <- latter deallocates also bundled (writes held up seemingly till all deallocates complete) 259,0 6 87363 12.865571718 13167 Q D 61341696 + 524288 [fio] 259,0 6 87367 17.108236841 13167 Q D 708313088 + 524288 [fio] 259,0 12 1 21.350936497 13167 Q D 965214208 + 524288 [fio] 259,0 12 5 25.593485851 13167 Q D 1087897600 + 524288 [fio] 259,0 12 9 29.836349325 13167 Q D 445120512 + 524288 [fio] 259,0 4 154640 30.955897677 10263 Q W 1563776000 + 8 [kworker/u97:0] <- writes finish up without delays, again only as 4KB sequential groups. 259,0 4 154643 30.955909105 10263 Q W 1563776008 + 8 [kworker/u97:0] 259,0 4 154646 30.955911370 10263 Q W 1563776016 + 8 [kworker/u97:0] An fio debug=blktrace,io didn't seem to show anything different than the io log. It seems likely there is some interaction between fio and the kernel or driver, though it appears to only occur replay. The write io problem DOES NOT occur if the trims are filtered out of the blktrace (--replay_skip=trim). Then the write operations rate and sizes occur as the expected 124K+124k+8k write groups Any ideas? I've also tried with centos 7.2, 7.3, 7.4 - which are kernels 3.10... Thanks Kris Davis