On Wed, Nov 17, 2021 at 03:52:13PM +0900, Damien Le Moal wrote: > When iops, latency, or bw logging options are used, fio will by default > log information for any I/O that completes. The initial number of I/O > log entries is 1024, as defined by (DEF_LOG_ENTRIES). When all log > entries are used, new log entries are dynamically allocated by > get_new_log(). This dynamic log entry allocation can negatively impact > time-related statistics such as the I/O tail latencies (e.g. 99.9 > percentile completion latency) as growing the logs causes a temporary > I/O stall (IO quiesce), which disturbs the workload steady state. The > effect of this is especially noticeable with workloads using IO > priorities: the tail latencies of high priority I/Os increase if the > IO log needs to be grown. > > For example, running the following fio command on a SATA disk > supporting NCQ priority: > > fio --name=prio-randread --filename=/dev/sdg \ > --random_generator=tausworthe64 --ioscheduler=none \ > --write_lat_log=randread.log --log_prio=1 --rw=randread --bs=128k \ > --ioengine=libaio --iodepth=32 --direct=1 --cmdprio_class=1 ¥ Nit: s/¥/\ Republic credits? Republic credits are no good out here. I need something more real. (For the Star Wars fans out there..) > --cmdprio_percentage=30 --runtime=900 > > (128KB random read workload at QD=32 and 30% of commands issued with a > high priority), with an inital number of log entries equal to the > default of 1024, depending on the machine memory state, the completion > latency statistics may show imprecise percentiles such as shown below. > > high prio (30.75%) clat percentiles (msec): > | 1.00th=[ 14], 5.00th=[ 17], 10.00th=[ 20], 20.00th=[ 23], > | 30.00th=[ 27], 40.00th=[ 32], 50.00th=[ 40], 60.00th=[ 53], > | 70.00th=[ 71], 80.00th=[ 104], 90.00th=[ 169], 95.00th=[ 243], > | 99.00th=[ 514], 99.50th=[ 676], 99.90th=[ 1485], 99.95th=[ 1502], > | 99.99th=[ 1552] > low prio (69.25%) clat percentiles (msec): > | 1.00th=[ 16], 5.00th=[ 24], 10.00th=[ 37], 20.00th=[ 68], > | 30.00th=[ 105], 40.00th=[ 146], 50.00th=[ 199], 60.00th=[ 255], > | 70.00th=[ 330], 80.00th=[ 439], 90.00th=[ 592], 95.00th=[ 718], > | 99.00th=[ 885], 99.50th=[ 986], 99.90th=[ 1469], 99.95th=[ 1536], > | 99.99th=[ 1586] > > All completion latency percentiles above the 99.90th percentile are > similar for the high and low priority commands, which is not consistent > with the drive expected execution of prioritized read commands. > > To solve this issue and get more precise latency statistics, this patch > introduces the new "log_entries" option to allow specifying a larger > initial number of IO log entries to avoid run-time allocation. > This option value defaults to DEF_LOG_ENTRIES and its maximum value is > MAX_LOG_ENTRIES to be consistent with get_new_log() allocation. Also > simplify get_new_log() by using calloc() instead of malloc, thus > removing the need for the local variable new_size. > > Adding the "--log_entries=65536" option to the previous command line > example, the completion latency results obtained are more stable: > > high prio (30.72%) clat percentiles (msec): > | 1.00th=[ 15], 5.00th=[ 17], 10.00th=[ 19], 20.00th=[ 22], > | 30.00th=[ 24], 40.00th=[ 27], 50.00th=[ 32], 60.00th=[ 36], > | 70.00th=[ 46], 80.00th=[ 57], 90.00th=[ 81], 95.00th=[ 105], > | 99.00th=[ 161], 99.50th=[ 188], 99.90th=[ 271], 99.95th=[ 275], > | 99.99th=[ 363] > low prio (69.28%) clat percentiles (msec): > | 1.00th=[ 16], 5.00th=[ 27], 10.00th=[ 43], 20.00th=[ 80], > | 30.00th=[ 123], 40.00th=[ 176], 50.00th=[ 236], 60.00th=[ 313], > | 70.00th=[ 401], 80.00th=[ 506], 90.00th=[ 634], 95.00th=[ 718], > | 99.00th=[ 844], 99.50th=[ 885], 99.90th=[ 953], 99.95th=[ 995], > | 99.99th=[ 1053] > > All completion percentiles clearly now show shorter latencies for high > priority commands, as expected. The 99.99th percentile for low prirotiy Nit: s/prirotiy/priority > commands is also improved compared to the previous case as the > measurements are not impacted by the log dynamic allocation. > > Suggested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> > Signed-off-by: Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> Side note: Would it perhaps be wise, in addition to this patch, to also increase the DEF_LOG_ENTRIES value slightly? The default value is 1024, and sizeof(struct io_sample_offset) == 40, regrow_logs() seems (at maximum) regrow 6 different logs (slat, clat, clat_hist_log, lat_log, bw_log, iops_log). The default space used for a single log == 40 * 1024 = 40 KB, if someone has enabled all 6 different logs/log types = 40 KB * 6 = 240 KB per thread_data. While it would be hard to figure out a new default, e.g. bumping it to 32768 would mean 40 * 32768 = 1.26 MB, if all 6 logs are enabled = 7.5 MB per struct thread_data. Sure, someone can have thousands of threads, but I doubt that you would have logging enabled for all those threads. If you have logging enabled, the most common case is probably to have slat, clat and lat logs, which means 3 different logs: 1.26 MB * 3 = 3.76 MB per struct thread_data, however, this dynamic memory allocation will only be done if you have explicitly enabled any of the log files in the first place. TL;DR: I think that we can merge this patch AND increase the default number of log entries to e.g. 32768. (Or if we want to be quite conservative, perhaps only bump it to something like 16384, which would only be less than 2 MB per struct thread_data if 3 different log files are enabled.) For this patch, except for two small nits in the commit message: Reviewed-by: Niklas Cassel <niklas.cassel@xxxxxxx>