Write operations stalled and split into 4KB sequential during blktrace replay containing trims

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Need help to understand and resolve a blktrace replay with trims issue.  Sorry, this is a rather long note, but puzzling issue: 

The problem: 
If I replay a blktrace containing trims and writes, the trims occur at the expected timing and size, but all of the otherwise random writes are split into groups of 4KB sequential writes, and are intermittently delayed (for seconds) related to the trim operations.  Fio debug io,blktrace, and write_iops_log indicates fio is replaying the operations correctly, but a blktrace of the blktrace replay indicates operations writes do not happen as recorded by fio.  


How to reproduce: 
Building fio from head (with proposed patch https://github.com/axboe/fio/pull/777 to fix assert error when blktrace replay contains trims)
Note: I've tried this several times with various OS levels and fio revision and mapping to one cpu - failures seem consistent

FIO command to produce the replay blktrace
----------------------------------------------------------------
Fio two jobs to perform 256MB Random Trims along with 256KB Random Writes at the same 64MB BW rate, while collecting blktrace.  
> blktrace -d /dev/nvme0n1 -o p2-trimwrites-gen 
> fio --output="p2-trimwrites-gen.out" --output-format=normal --direct=1 --runtime=30s --filename="/dev/nvme0n1" --ioengine=libaio  --numa_cpu_nodes=0 --numa_mem_policy="bind:0" --overwrite=1 --unified_rw_reporting=0 --name="Writes" --bs=256KB  --rate=64MB --rw=randwrite --write_iops_log="p2-trimwrites-gen-writes" -name="Trims" --bs=256MB --rate=64MB --rw=randtrim  --write_iops_log="p2-trimwrites-gen-trim"
> blkparse -i p2-trimwrites-gen -o /dev/null -d p2-trimwrites-gen.bin

The actual write operations are restricted according to max_hw_sectors_kb to 127, so the writes are split into 3 sequential operations: 248+248+16 blocks as indicated in the block trace.
>   blkparse -i p2-trimwrites-gen -o p2-trimwrites-gen.blkparsed
>   grep ' Q ' p2-trimwrites-gen.blkparsed > p2-trimwrites-gen.blkparsedq        # helps to see timing and sequence
>   egrep 'Q|C' p2-trimwrites-gen.blkparsed > p2-trimwrites-gen.blkparsedqc  # include completions - nothing surprising 


Blktrace Replay
--------------------------------------------
When the original blktrace is replayed while collecting another blktrace to check:
> blktrace -d /dev/nvme0n1 -o p2-fio-twreplay &
> fio --output="p2-fio-twreplay.out" --output-format=normal --direct=1 --filename="/dev/nvme0n1" --ioengine=libaio --numa_cpu_nodes=0 --numa_mem_policy="bind:0" --overwrite=1 --unified_rw_reporting=0 --log_offset=1 --name="TrimWrite" --write_iops_log="p2-fio-twreplay"  --replay_redirect=/dev/nvme0n1 --read_iolog="p2-trimwrites-gen.bin"
> blkparse -i p2-fio-twreplay -o p2-fio-twreplay.blkparsed
> grep ' Q ' p2-fio-twreplay.blkparsed > p2-fio-twreplay.blkparsedq    
> egrep 'Q|C' p2-fio-twreplay.blkparsed > p2-fio-twreplay.blkparsedqc

The fio io log indicates that fio is issuing operations with timing and sizes as indicated in original blktrace  (1st two deallocates occur quickly while fio adjust to rate, then they occur 4 seconds apart, with intervening write operations). 
2nd deallocate at 353 ms: 
    279, 1, 1, 126976, 447737622528
    279, 1, 1, 8192, 447737749504
    283, 1, 1, 126976, 352846086144
    283, 1, 1, 126976, 352846213120
    283, 1, 1, 8192, 352846340096
    353, 1, 2, 268435456, 145492017152    <- 2nd deallocate
    355, 1, 1, 126976, 644440915968           <- 
    355, 1, 1, 126976, 644441042944           <- Expected write operation split
    355, 1, 1, 8192, 644441169920               <- 
    359, 1, 1, 126976, 289736491008
    359, 1, 1, 126976, 289736617984

The fio result statistics also seem to indicate that all is well, with trim and write rates somewhat close to expectations: 
TrimWrite: (groupid=0, jobs=1): err= 0: pid=13167: Fri May 24 10:48:29 2019
  write: IOPS=674, BW=56.2MiB/s (58.9MB/s)(1920MiB/34156msec)
...
  trim: IOPS=0, BW=67.5MiB/s (70.7MB/s)(2304MiB/34156msec)


HOWEVER, the blktrace gathered during the replay indicates reality is different.  
The deallocates occur with the expected timing and sizes, but the writes are delayed and occur as 4KB sequential writes, rather than the fio recorded 124K+124k+8k writes
    259,0    6      136     0.133395386 13167  Q   D 398458880 + 524288 [fio]                 <- first two deallocates (they are not waiting on completions, but I don't now here)
    259,0    6      140     0.273302506 13167  Q   D 284164096 + 524288 [fio]
    259,0    6      144     5.211904696 10263  Q   W 1602560 + 8 [kworker/u97:0]         <- writes held off by something (prior deallocate has long ago completed), and writes become 4KB sequentials
    259,0    6      147     5.211914136 10263  Q   W 1602568 + 8 [kworker/u97:0]
    259,0    6      150     5.211916318 10263  Q   W 1602576 + 8 [kworker/u97:0]
    259,0    6      153     5.211918422 10263  Q   W 1602584 + 8 [kworker/u97:0]
    259,0    6      156     5.211920629 10263  Q   W 1602592 + 8 [kworker/u97:0]
...
    259,0    4   154632     5.910154830 10263  Q   W 1563763184 + 8 [kworker/u97:0]
    259,0    4   154636     5.910159723 10263  Q   W 1563763192 + 8 [kworker/u97:0]
    259,0    6    87359     8.623207082 13167  Q   D 939524096 + 524288 [fio]            <- latter deallocates also bundled (writes held up seemingly till all deallocates complete) 
    259,0    6    87363    12.865571718 13167  Q   D 61341696 + 524288 [fio]
    259,0    6    87367    17.108236841 13167  Q   D 708313088 + 524288 [fio]
    259,0   12        1    21.350936497 13167  Q   D 965214208 + 524288 [fio]
    259,0   12        5    25.593485851 13167  Q   D 1087897600 + 524288 [fio]
    259,0   12        9    29.836349325 13167  Q   D 445120512 + 524288 [fio]
    259,0    4   154640    30.955897677 10263  Q   W 1563776000 + 8 [kworker/u97:0]   <- writes finish up without delays, again only as 4KB sequential groups.
    259,0    4   154643    30.955909105 10263  Q   W 1563776008 + 8 [kworker/u97:0]
    259,0    4   154646    30.955911370 10263  Q   W 1563776016 + 8 [kworker/u97:0]


An fio debug=blktrace,io didn't seem to show anything different than the io log.

It seems likely there is some interaction between fio and the kernel or driver, though it appears to only occur replay.

The write io problem DOES NOT occur if the trims are filtered out of the blktrace  (--replay_skip=trim).   Then the write operations rate and sizes occur as the expected 124K+124k+8k write groups

Any ideas?  I've also tried with centos 7.2, 7.3, 7.4 - which are kernels 3.10... 
Thanks

Kris Davis






[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux