Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)

Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> · Wed, 1 Jun 2022 19:27:37 +0000 (UTC)

Tankyou,

I don't know if my NVME's devices are 4K LBA. I do not think so. They are all the same model and manufacturer. I know that they work with blocks of 512 Bytes, but that their latency is very high when processing blocks of this size.

However, in all the tests I do with them with 4K blocks, the result is much better. So I always use 4K blocks. Because in real life I don't think I'll use blocks smaller than 4K.

> You can remove the kernel interpretation using passthrough commands. Here's an
> example comparing with and without FUA assuming a 512b logical block format:
> 
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency
> 
> if you have a 4k LBA format, use "--block-count=0".
> 
> And you may want to run each of the above several times to get an average since
> other factors can affect the reported latency.

I created a bash script capable of executing the two commands you suggested to me in a period of 10 seconds in a row, to get some more acceptable average. The result is the following:

root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
write back
root@pve-21:~# ./nvme_write.sh
Total: 10 seconds, 3027 tests. Latency (us) : min: 29  /  avr: 37   /  max: 98
root@pve-21:~# ./nvme_write.sh --force-unit-access
Total: 10 seconds, 2985 tests. Latency (us) : min: 29  /  avr: 37   /  max: 111
root@pve-21:~#
root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
Total: 10 seconds, 2556 tests. Latency (us) : min: 404  /  avr: 428   /  max: 492
root@pve-21:~# ./nvme_write.sh --block-count=0
Total: 10 seconds, 2521 tests. Latency (us) : min: 403  /  avr: 428   /  max: 496
root@pve-21:~#
root@pve-21:~#
root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
write through
root@pve-21:~# ./nvme_write.sh
Total: 10 seconds, 2988 tests. Latency (us) : min: 29  /  avr: 37   /  max: 114
root@pve-21:~# ./nvme_write.sh --force-unit-access
Total: 10 seconds, 2926 tests. Latency (us) : min: 29  /  avr: 36   /  max: 71
root@pve-21:~#
root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
Total: 10 seconds, 2456 tests. Latency (us) : min: 31  /  avr: 428   /  max: 496
root@pve-21:~# ./nvme_write.sh --block-count=0
Total: 10 seconds, 2627 tests. Latency (us) : min: 402  /  avr: 428   /  max: 509

Well, as we can see above, in almost 3k tests run in a period of ten seconds, with each of the commands, I got even better results than I already got with ioping. I did tests with isolated commands as well, but I decided to write a bash script to be able to execute many commands in a short period of time and make an average. And we can see an average of about 37us in any situation. Very low!

However, when using that suggested command --block-count=0 the latency is very high in any situation, around 428us.

But as we see, using the nvme command, the latency is always the same in any scenario, whether with or without --force-unit-access, having a difference only regarding the use of the command directed to devices that don't have LBA or that aren't.

What do you think?

Tanks,

Em segunda-feira, 30 de maio de 2022 10:45:37 BRT, Keith Busch <kbusch@xxxxxxxxxx> escreveu: 

On Sun, May 29, 2022 at 11:50:57AM +0000, Adriano Silva wrote:

> So why the slowness? Is it just the time spent in kernel code to set FUA and Flush Cache bits on writes that would cause all this latency increment (84us to 1.89ms) ?

I don't think the kernel's handling accounts for that great of a difference. I
think the difference is probably on the controller side.

The NVMe spec says that a Write command with FUA set:

"the controller shall write that data and metadata, if any, to non-volatile
media before indicating command completion."

So if the memory is non-volatile, it can complete the command without writing
to the backing media. It can also commit the data to the backing media if it
wants to before completing the command, but that's implementation specific
details.

You can remove the kernel interpretation using passthrough commands. Here's an
example comparing with and without FUA assuming a 512b logical block format:

  # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
  # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency

If you have a 4k LBA format, use "--block-count=0".

And you may want to run each of the above several times to get an average since
other factors can affect the reported latency.