Hi, Sitsofe Wheeler wrote on 02/29/2016 10:01 PM: > Hi, > > On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote: >> Hello, >> >> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio >> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0 >> and direct=1, so user space CPU consumption should be virtually zero. >> >> That 30% user CPU consumption makes me suspect that this is overhead for internal fio >> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I >> hope, I'm using correct English terms). >> >> Can anybody comment it and suggest how to decrease this user space CPU consumption? >> >> Here is my full fio job: >> >> [global] >> ioengine=libaio >> buffered=0 >> sync=0 >> direct=1 >> randrepeat=1 >> softrandommap=1 >> rw=randread >> bs=4k >> filename=./nvram (it's a link to a block device) >> exitall=1 >> thread=1 >> disable_lat=1 >> disable_slat=1 >> disable_clat=1 >> loops=10 >> iodepth=16 > > You appear to be missing gtod_reduce > (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or > gettimeofday cpu pinning. You also aren't using batching > (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ). Thanks, I tried them, but they did not make any significant difference. The biggest difference I had was when I changed CPU governor to "performance". Now I have 20-25% user space, measured by fio itself, it's coherent with top. Note, I'm considering per-thread CPU consumption, to see it in top you need to press '1' (one line per each CPU). I also tried to short circuit the sync engine by calling fio_io_end() directly from top of fio_syncio_queue(), so no actual IO is done. The results were interesting enough to publish here in details (%% are per job): Jobs IOPS(M) %user %sys 1 4.3 78 22 2 7.6 67 33 3 10.5 65 35 4 7.7 61 38 5 4.8 78 22 6 4.7 83 17 7 4.8 84 15 Results were very consistent between runs. CPU - 8 cores Intel Xeon E5-2667 v3 @ 3.20GHz with 20M L3 cache and HT. Fio is the latest git. Obviously, if fio had zero overhead, i.e. instrumental mistake, IOPS level in this test should sky rocket to hundreds of millions IOPS to have few %% overhead on multi-million IOPS measurements. But we only have 4.3M per thread and 10.5M overall, which are, apparently, max fio is capable to measure in the current implementation doesn't matter how fast the storage stack is (it simply doesn't have more CPU cycles to run). Also, apparently, there is some lock contention for something inside fio, which is severely limiting multi-jobs performance. Interesting that gtod_cpu for the single job case decreased IOPS to 3.8M with the same user/sys %%: 22/78. Explicit clocksource=cpu didn't make any difference. Another observation is why the sys CPU consumption is so high, if TSC clock is used? Apparently, it was not used ever despite of explicit clocksource=cpu. I checked perf for the single job case, which is the most interesting case where to start optimizing from, and it reported 69% of time was spent in clock_thread_fn() and 3.2% in memset(). The latter also rises question, why is memset for the READs test? Apparently, this memset is on high IO path. The full job file was: [global] ioengine=sync buffered=0 sync=0 direct=1 randrepeat=0 norandommap softrandommap=1 random_generator=lfsr /* does not really matter */ rw=randread bs=4K /* it does not matter, since it's short circuit */ filename=./nvram /* does not matter */ exitall=1 thread=1 gtod_reduce=1 loops=10 iodepth=8 /* does not matter */ [file1] [file2] ... The consumed user space CPU roughly could be considered the instrumental mistake. Generally speaking, we have 3 components: load generator, measurement infrastructure ("a gauge") and load processor (storage). The storage is the object whose performance we are measuring by applying load from the load generator and using the measurement infrastructure to get the results. Since the storage stack is entirely in the kernel, what we can see as the user space CPU consumption is the aggregated load generator and measurement infrastructure CPU consumption, i.e. fio overhead, i.e. instrumental mistake. (Obviously, this is true only when CPU is the bottleneck as you can see in top with one line per each CPU output, which is pretty much always true for high IOPS tests.) Thus, I'm afraid, looks like currently fio, being a really great tool, has severe limitations for high IOPS measurements, because it has too big internal load generation and measurement overheads. It's like having a thermometer, which has mistake 0-infinity depending from temperature you are measuring. If it's low enough, you will have 100% accuracy, but if it's too high, it might start measuring something internal instead of what it is supposed to measure. To be fair, all thermometers behave like this ;). However, this analyze shows that fio accuracy significantly declining starting from few hundreds K IOPS, where for me it has with libaio and my NVRAM card 22% overhead on 612K IOPS (QD 8, single job). Adding more jobs increases IOPS up to the card's limit, but the per thread overhead remains about the same. Just a friendly analyze in a hope to improve the great tool. Multi-million IOPS storage is coming, so this is important. Or did I miss anything? > You may want to look at what fio settings your flash vendor recommends > for benchmarking purposes... Those where I started from. However, being a person with an experimental physics background, I started from very basics: calibrating my tools to figure out instrumental mistakes I have with them. My checks with fio led to this thread. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html