Hi Jeff, On 5 December 2017 at 00:01, Jeff Furlong <jeff.furlong@xxxxxxx> wrote: > > In commit "Add support for doing total latency percentiles" (b599759ba565e7f2f573af364e6da4fe6d556a90), the total latency percentiles option was added. However, some workloads appear broken: > > # fio --name=test --ioengine=libaio --direct=1 --rw=randread --iodepth=1 --size=100% --bs=4k --filename=/dev/nvme0n1 --runtime=5s --write_lat_log=test --disable_lat=0 --disable_clat=1 --disable_slat=1 --lat_percentiles=1 > test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.2-66-g67bfe > Starting 1 process > Jobs: 1 (f=1): [r(1)][100.0%][r=351MiB/s,w=0KiB/s][r=89.9k,w=0 IOPS][eta 00m:00s] > test: (groupid=0, jobs=1): err= 0: pid=6879: Mon Dec 4 15:46:04 2017 > read: IOPS=89.7k, BW=351MiB/s (368MB/s)(1753MiB/5001msec) > lat (usec): min=9, max=2259, avg=10.67, stdev= 7.66 > lat percentiles (nsec): > | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], > | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], > | 70.00th=[ 0], 80.00th=[ 0], 90.00th=[ 0], 95.00th=[ 0], > | 99.00th=[ 0], 99.50th=[ 0], 99.90th=[ 0], 99.95th=[ 0], > | 99.99th=[ 0] > > In the above, we try to enable only lat log, so disable clat and slat. The summary output shows lat, but the values are zero. I guess the code that collects those percentiles is always looking at whether clat is enabled... > # fio --name=test --ioengine=libaio --direct=1 --rw=randread --iodepth=1 --size=100% --bs=4k --filename=/dev/nvme0n1 --runtime=5s --write_lat_log=test --disable_lat=0 --disable_clat=0 --disable_slat=1 --lat_percentiles=1 > test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.2-66-g67bfe > Starting 1 process > Jobs: 1 (f=1): [r(1)][100.0%][r=356MiB/s,w=0KiB/s][r=91.1k,w=0 IOPS][eta 00m:00s] > test: (groupid=0, jobs=1): err= 0: pid=6956: Mon Dec 4 15:47:43 2017 > read: IOPS=87.8k, BW=343MiB/s (360MB/s)(1715MiB/5001msec) > clat (nsec): min=561, max=2258.3k, avg=9941.30, stdev=6660.37 > lat (usec): min=9, max=2259, avg=10.88, stdev= 7.52 > lat percentiles (nsec): > | 1.00th=[10304], 5.00th=[10304], 10.00th=[10304], 20.00th=[10304], > | 30.00th=[10432], 40.00th=[10432], 50.00th=[10432], 60.00th=[10432], > | 70.00th=[10432], 80.00th=[10944], 90.00th=[11072], 95.00th=[14784], > | 99.00th=[15296], 99.50th=[15424], 99.90th=[16064], 99.95th=[17024], > | 99.99th=[24704] > > In the above, we force clat on, and now the lat values are non-zero in the summary output. > > In the commit for stat.c, I see: > > if (ts->lat_percentiles) > add_clat_percentile_sample(ts, nsec, ddir); > > Does that seem correct? Yes it looks correct - it's just the function is a bit misnamed. fio stores all its latency percentile information in the same place since you can only have either completion percentiles or total percentiles but not both at the same time. After digging about for a while it looks like the problem stems from the wrong sample count being used in stat.c: 419 static void show_ddir_status(struct group_run_stats *rs, struct thread_stat *ts, 420 int ddir, struct buf_output *out) 421 { [...] 461 462 if (ts->clat_percentiles || ts->lat_percentiles) { 463 show_clat_percentiles(ts->io_u_plat[ddir], 464 ts->clat_stat[ddir].samples, 465 ts->percentile_list, 466 ts->percentile_precision, 467 ts->clat_percentiles, out); fio is always using the clat_stat samples count rather than choosing the count based on the type of percentile being requested. https://github.com/axboe/fio/pull/505 should hopefully address this (and disables percentile gathering if that type of latency has been disabled). -- Sitsofe | http://sucs.org/~sits/ -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html