Re: Bcache in writes direct with fsync. Are IOPS limited?

Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> · Thu, 26 May 2022 20:20:09 +0000 (UTC)

Hi People,

Thanks for answering.

This is a enterprise NVMe device with Power Loss Protection system. It has a non-volatile cache.

Before purchasing these enterprise devices, I did tests with consumer NVMe. Consumer device performance is acceptable only on hardware cached writes. But on the contrary on consumer devices in tests with fio passing parameters for direct and synchronous writing (--direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth= 1) the performance is very low. So today I'm using enterprise NVME with tantalum capacitors which makes the cache non-volatile and performs much better when written directly to the hardware. But the performance issue is only occurring when the write is directed to the bcache device.

Here is information from my Hardware you asked for (Eric), plus some additional information to try to help.

root@pve-20:/# blockdev --getss /dev/nvme0n1
512
root@pve-20:/# blockdev --report /dev/nvme0n1
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0    960197124096   /dev/nvme0n1
root@pve-20:/# blockdev --getioopt /dev/nvme0n1
512
root@pve-20:/# blockdev --getiomin /dev/nvme0n1
512
root@pve-20:/# blockdev --getpbsz /dev/nvme0n1
512
root@pve-20:/# blockdev --getmaxsect /dev/nvme0n1
256
root@pve-20:/# blockdev --getbsz /dev/nvme0n1
4096
root@pve-20:/# blockdev --getsz /dev/nvme0n1
1875385008
root@pve-20:/# blockdev --getra /dev/nvme0n1
256
root@pve-20:/# blockdev --getfra /dev/nvme0n1
256
root@pve-20:/# blockdev --getdiscardzeroes /dev/nvme0n1
0
root@pve-20:/# blockdev --getalignoff /dev/nvme0n1
0

root@pve-20:~# nvme id-ctrl -H /dev/nvme0n1 |grep -A1 vwc
vwc       : 0
  [0:0] : 0    Volatile Write Cache Not Present
root@pve-20:~#

root@pve-20:~# nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid       : 0x1c5c
ssvid     : 0x1c5c
sn        : EI6............................D2Q   
mn        : HFS960GD0MEE-5410A                      
fr        : 40033A00
rab       : 1
ieee      : ace42e
cmic      : 0
mdts      : 5
cntlid    : 0
ver       : 10200
rtd3r     : 90f560
rtd3e     : ea60
oaes      : 0
ctratt    : 0
rrls      : 0
oacs      : 0x6
acl       : 3
aerl      : 3
frmw      : 0xf
lpa       : 0x2
elpe      : 254
npss      : 2
avscc     : 0x1
apsta     : 0
wctemp    : 353
cctemp    : 361
mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
edstt     : 2
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0
mntmt     : 0
mxtmt     : 0
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x14
fuses     : 0
fna       : 0x4
vwc       : 0
awun      : 255
awupf     : 0
nvscc     : 1
nwpc      : 0
acwu      : 0
sgls      : 0
mnan      : 0
subnqn    :
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
ctrattr   : 0
msdbd     : 0
ps    0 : mp:7.39W operational enlat:1 exlat:1 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:2.02W active_power:4.02W
ps    1 : mp:6.82W operational enlat:1 exlat:1 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:2.02W active_power:2.02W
ps    2 : mp:4.95W operational enlat:1 exlat:1 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:2.02W active_power:2.02W
root@pve-20:~#

root@pve-20:~# nvme id-ns /dev/nvme0n1
NVME Identify Namespace 1:
nsze    : 0x6fc81ab0
ncap    : 0x6fc81ab0
nuse    : 0x6fc81ab0
nsfeat  : 0
nlbaf   : 0
flbas   : 0x10
mc      : 0
dpc     : 0
dps     : 0
nmic    : 0
rescap  : 0
fpi     : 0
dlfeat  : 0
nawun   : 0
nawupf  : 0
nacwu   : 0
nabsn   : 0
nabo    : 0
nabspf  : 0
noiob   : 0
nvmcap  : 0
nsattr    : 0
nvmsetid: 0
anagrpid: 0
endgid  : 0
nguid   : 00000000000000000000000000000000
eui64   : ace42e610000189f
lbaf  0 : ms:0   lbads:9  rp:0 (in use)
root@pve-20:~#

If anyone needs any more information about the hardware, please ask.

An interesting thing to note is that when I test using fio with --bs=512, the direct hardware performance is horrible (~1MB/s).

root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=512 --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
journal-test: (g=0): rw=randwrite, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-512B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1047KiB/s][w=2095 IOPS][eta 00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=1715926: Mon May 23 14:05:28 2022
  write: IOPS=2087, BW=1044KiB/s (1069kB/s)(5220KiB/5001msec); 0 zone resets
    slat (nsec): min=3338, max=90998, avg=12760.92, stdev=3377.45
    clat (usec): min=32, max=945, avg=453.85, stdev=27.03
     lat (usec): min=46, max=953, avg=467.16, stdev=27.79
    clat percentiles (usec):
     |  1.00th=[  404],  5.00th=[  420], 10.00th=[  429], 20.00th=[  433],
     | 30.00th=[  437], 40.00th=[  453], 50.00th=[  465], 60.00th=[  465],
     | 70.00th=[  469], 80.00th=[  469], 90.00th=[  474], 95.00th=[  474],
     | 99.00th=[  494], 99.50th=[  502], 99.90th=[  848], 99.95th=[  889],
     | 99.99th=[  914]
   bw (  KiB/s): min= 1033, max= 1056, per=100.00%, avg=1044.22, stdev= 9.56, samples=9
   iops        : min= 2066, max= 2112, avg=2088.67, stdev=19.14, samples=9
  lat (usec)   : 50=0.03%, 100=0.01%, 500=99.38%, 750=0.44%, 1000=0.14%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=74, max=578, avg=279.19, stdev=45.25
    sync percentiles (nsec):
     |  1.00th=[  151],  5.00th=[  179], 10.00th=[  235], 20.00th=[  249],
     | 30.00th=[  255], 40.00th=[  278], 50.00th=[  294], 60.00th=[  298],
     | 70.00th=[  314], 80.00th=[  314], 90.00th=[  330], 95.00th=[  334],
     | 99.00th=[  346], 99.50th=[  350], 99.90th=[  374], 99.95th=[  386],
     | 99.99th=[  498]
  cpu          : usr=3.40%, sys=5.38%, ctx=10439, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10439,0,10438 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1044KiB/s (1069kB/s), 1044KiB/s-1044KiB/s (1069kB/s-1069kB/s), io=5220KiB (5345kB), run=5001-5001msec

Disk stats (read/write):
  nvme0n1: ios=58/10171, merge=0/0, ticks=10/4559, in_queue=0, util=97.64%

But the same test directly on the hardware with fio passing the parameter --bs=4K, the performance completely changes, for the better (~130MB/s).

root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
journal-test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=125MiB/s][w=31.9k IOPS][eta 00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=1725642: Mon May 23 14:13:50 2022
  write: IOPS=31.9k, BW=124MiB/s (131MB/s)(623MiB/5001msec); 0 zone resets
    slat (nsec): min=2942, max=87863, avg=3222.02, stdev=1233.34
    clat (nsec): min=865, max=1238.6k, avg=25283.31, stdev=24400.58
     lat (usec): min=24, max=1243, avg=28.63, stdev=24.45
    clat percentiles (usec):
     |  1.00th=[   23],  5.00th=[   23], 10.00th=[   23], 20.00th=[   23],
     | 30.00th=[   24], 40.00th=[   24], 50.00th=[   24], 60.00th=[   25],
     | 70.00th=[   26], 80.00th=[   26], 90.00th=[   26], 95.00th=[   29],
     | 99.00th=[   35], 99.50th=[   41], 99.90th=[  652], 99.95th=[  725],
     | 99.99th=[  766]
   bw (  KiB/s): min=125696, max=129008, per=99.98%, avg=127456.33, stdev=1087.63, samples=9
   iops        : min=31424, max=32252, avg=31864.00, stdev=271.99, samples=9
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.01%, 20=0.01%, 50=99.59%, 100=0.24%, 250=0.01%
  lat (usec)   : 500=0.02%, 750=0.10%, 1000=0.02%
  lat (msec)   : 2=0.01%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=43, max=435, avg=68.51, stdev=10.83
    sync percentiles (nsec):
     |  1.00th=[   59],  5.00th=[   60], 10.00th=[   61], 20.00th=[   63],
     | 30.00th=[   64], 40.00th=[   65], 50.00th=[   66], 60.00th=[   67],
     | 70.00th=[   70], 80.00th=[   73], 90.00th=[   77], 95.00th=[   80],
     | 99.00th=[  122], 99.50th=[  147], 99.90th=[  177], 99.95th=[  189],
     | 99.99th=[  251]
  cpu          : usr=10.72%, sys=19.54%, ctx=159367, majf=0, minf=11
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,159384,0,159383 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=124MiB/s (131MB/s), 124MiB/s-124MiB/s (131MB/s-131MB/s), io=623MiB (653MB), run=5001-5001msec

Disk stats (read/write):
  nvme0n1: ios=58/155935, merge=0/0, ticks=10/3823, in_queue=0, util=98.26%

Does anything justify this difference?

Maybe that's why when I create bcache with the -w=4K option the performance improves. Not as much as I'd like, but it gets better.

I also noticed that when I use the --bs=4K parameter (or indicating even larger blocks) and use the --ioengine=libaio parameter together in the direct test on the hardware, the performance improves a lot, even doubling the speed in the case of blocks of 4K Without --ioengine=libaio, direct hardware is somewhere around 15K IOPS at 60.2 MB/s, but using this library, it goes to 32K IOPS and 130MB/s;

That's why I have standardized using this parameter (--ioengine=libaio) in tests.

The buckets, I read that it would be better to put the hardware device erase block size. However, I have already tried to find this information by reading the device, also with the manufacturer, but without success. So I have no idea which bucket size would be best, but from my tests, the default of 512KB seems to be adequate. 

Responding to Coly, I did tests using fio to directly write to the block device NVME (/dev/nvme0n1), without going through any partitions. Performance is always slightly better on hardware when writing directly to the block without a partition. But the difference is minimal. This difference also seems to be reflected in bcache, but it is also very small (insignificant).

I've already noticed that, increasing the number of jobs, the performance of the bcache0 device improves a lot, reaching almost equal to the performance of tests done directly on the Hardware. 

Eric, perhaps it is not such a simple task to recompile the Kernel with the suggested change. I'm working with Proxmox 6.4. I'm not sure, but I think the Kernel may have some adaptation. It is based on Kernel 5.4, which it is approved for.

Also listening to Coly's suggestion, I'll try to perform tests with the Kernel version 5.15 to see if it can solve. Would this version be good enough? It's just that, as I said above, as I'm using Proxmox, I'm afraid to change the Kernel version they provide.

Eric, to be clear, the hardware I'm using has only 1 processor socket.

I'm trying to test with another identical computer (the same motherboard, the same processor, the same NVMe, with the difference that it only has 12GB of RAM, the first having 48GB). It is an HP Z400 Workstation with an Intel Xeon X5680 sixcore processor (12 threads), DDR3 1333MHz 10600E (old computer). On the second computer, I put a newer version of the distribution that uses Kernel based on version 5.15. I am now comparing the performance of the two computers in the lab.

On this second computer I had worse performance than the first one (practically half the performance with bcache), despite the performance of the tests done directly in NVME being identical.

I tried going back to the same OS version on the first computer to try and keep the exact same scenario on both computers so I could first compare the two. I try to keep the exact same software configuration. However, there were no changes. Is it the low RAM that makes the performance worse in the second?

I noticed a difference in behavior on the second computer compared to the first in dstat. While the first computer doesn't seem to touch the backup device at all, the second computer signals something a little different, as although it doesn't write data to the backup disk, it does signal IO movement. Strange no?

Let's look at the dstat of the first computer:

--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6953B 7515B|0.13 0.26 0.26|  0   0  99   0   0| 399   634 |25-05 09:41:42|   0
   0  8192B:4096B 2328k:   0  1168k|   0  2.00 :1.00   586 :   0   587 |9150B 2724B|0.13 0.26 0.26|  2   2  96   0   0|1093  3267 |25-05 09:41:43|   1B
   0     0 :   0    58M:   0    29M|   0     0 :   0  14.8k:   0  14.7k|  14k 9282B|0.13 0.26 0.26|  1   3  94   2   0|  16k   67k|25-05 09:41:44|   1B
   0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|  10k 8992B|0.13 0.26 0.26|  1   3  93   2   0|  16k   69k|25-05 09:41:45|   1B
   0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|7281B 4651B|0.13 0.26 0.26|  1   3  92   4   0|  16k   67k|25-05 09:41:46|   1B
   0     0 :   0    59M:   0    30M|   0     0 :   0  15.2k:   0  15.1k|7849B 4729B|0.20 0.28 0.27|  1   4  94   2   0|  16k   69k|25-05 09:41:47|   1B
   0     0 :   0    57M:   0    28M|   0     0 :   0  14.4k:   0  14.4k|  11k 8584B|0.20 0.28 0.27|  1   3  94   2   0|  15k   65k|25-05 09:41:48|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4086B 7720B|0.20 0.28 0.27|  0   0 100   0   0| 274   332 |25-05 09:41:49|   0

Note that on this first computer, the writings and IOs of the backing device (sdb) remain motionless. While NVMe device IOs track bcache0 device IOs at ~14.8K

Let's see the dstat now on the second computer:

--dsk/sdd---dsk/nvme0n1-dsk/bcache0 ---io/sdd----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |9254B 3301B|0.15 0.19 0.11|  1   2  97   0   0| 360   318 |26-05 06:27:15|   0
   0  8192B:4096B   19M:   0  9600k|   0  2402 :1.00  4816 :   0  4801 |8826B 3619B|0.15 0.19 0.11|  0   1  98   0   0|8115    27k|26-05 06:27:16|   1B
   0     0 :   0    21M:   0    11M|   0  2737 :   0  5492 :   0  5474 |4051B 2552B|0.15 0.19 0.11|  0   2  97   1   0|9212    31k|26-05 06:27:17|   1B
   0     0 :   0    23M:   0    11M|   0  2890 :   0  5801 :   0  5781 |4816B 2492B|0.15 0.19 0.11|  1   2  96   2   0|9976    34k|26-05 06:27:18|   1B
   0     0 :   0    23M:   0    11M|   0  2935 :   0  5888 :   0  5870 |4450B 2552B|0.22 0.21 0.12|  0   2  96   2   0|9937    33k|26-05 06:27:19|   1B
   0     0 :   0    22M:   0    11M|   0  2777 :   0  5575 :   0  5553 |8644B 1614B|0.22 0.21 0.12|  0   2  98   0   0|9416    31k|26-05 06:27:20|   1B
   0     0 :   0  2096k:   0  1040k|   0   260 :   0   523 :   0   519 |  10k 8760B|0.22 0.21 0.12|  0   1  99   0   0|1246  3157 |26-05 06:27:21|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4083B 2990B|0.22 0.21 0.12|  0   0 100   0   0| 390   369 |26-05 06:27:22|   0

In this case, with exactly the same command, we got a very different result. While writes to the backing device (sdd) do not happen (this is correct), we noticed that IOs occur on both the NVMe device and the backing device (i think this is wrong), but at a much lower rate now, around 5.6K on NVMe and 2.8K on the backing device. It leaves the impression that although it is not writing anything to sdd device, it is sending some signal to the backing device in each two IO operations that is performed with the cache device. And that would be delaying the answer. Could it be something like this?

It is important to point out that the writeback mode is on, obviously, and that the sequential cutoff is at zero, but I tried to put default values or high values and there were no changes. I also tried changing congested_write_threshold_us and congested_read_threshold_us, also with no result changes.

The only thing I noticed different between the configurations of the two computers was btree_cache_size, which on the first is much larger (7.7M) m while on the second it is only 768K. But I don't know if this parameter is configurable and if it could justify the difference.

Disabling Intel's Turbo Boost technology through the BIOS appears to have no effect.

And we will continue our tests comparing the two computers, including to test the two versions of the Kernel. If anyone else has ideas, thanks!

Em terça-feira, 17 de maio de 2022 22:23:09 BRT, Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> escreveu: 

On Tue, 10 May 2022, Adriano Silva wrote:
> I'm trying to set up a flash disk NVMe as a disk cache for two or three 
> isolated (I will use 2TB disks, but in these tests I used a 1TB one) 
> spinning disks that I have on a Linux 5.4.174 (Proxmox node).

Coly has been adding quite a few optimizations over the years.  You might 
try a new kernel and see if that helps.  More below.

> I'm using a NVMe (960GB datacenter devices with tantalum capacitors) as 
> a cache.
> [...]
>
> But when I do the same test on bcache writeback, the performance drops a 
> lot. Of course, it's better than the performance of spinning disks, but 
> much worse than when accessed directly from the NVMe device hardware.
>
> [...]
> As we can see, the same test done on the bcache0 device only got 1548 
> IOPS and that yielded only 6.3 KB/s.

Well done on the benchmarking!  I always thought our new NVMes performed 
slower than expected but hadn't gotten around to investigating. 

> I've noticed in several tests, varying the amount of jobs or increasing 
> the size of the blocks, that the larger the size of the blocks, the more 
> I approximate the performance of the physical device to the bcache 
> device.

You said "blocks" but did you mean bucket size (make-bcache -b) or block 
size (make-bcache -w) ?

If larger buckets makes it slower than that actually surprises me: bigger 
buckets means less metadata and better sequential writeback to the 
spinning disks (though you hadn't yet hit writeback to spinning disks in 
your stats).  Maybe you already tried, but varying the bucket size might 
help.  Try graphing bucket size (powers of 2) against IOPS, maybe there is 
a "sweet spot"?

Be aware that 4k blocks (so-called "4Kn") is unsafe for the cache device, 
unless Coly has patched that.  Make sure your `blockdev --getss` reports 
512 for your NVMe!

Hi Coly,

Some time ago you ordered an an SSD to test the 4k cache issue, has that 
been fixed?  I've kept an eye out for the patch but not sure if it was released.

You have a really great test rig setup with NVMes for stress
testing bcache. Can you replicate Adriano's `ioping` numbers below?

> With ioping it is also possible to notice a limitation, as the latency 
> of the bcache0 device is around 1.5ms, while in the case of the raw 
> device (a partition of NVMe), the same test is only 82.1us.
> 
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
>
> root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us

Wow, almost 20x higher latency, sounds convincing that something is wrong.

A few things to try:

1. Try ioping without -Y.  How does it compare?

2. Maybe this is an inter-socket latency issue.  Is your server 
  multi-socket?  If so, then as a first pass you could set the kernel 
  cmdline `isolcpus` for testing to limit all processes to a single 
  socket where the NVMe is connected (see `lscpu`).  Check `hwloc-ls`
  or your motherboard manual to see how the NVMe port is wired to your
  CPUs.

  If that helps then fine tune with `numactl -cN ioping` and 
  /proc/irq/<n>/smp_affinity_list (and `grep nvme /proc/interrupts`) to 
  make sure your NVMe's are locked to IRQs on the same socket.

3a. sysfs:

> # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff

good.

> # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
> # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us

Also try these (I think bcache/cache is a symlink to /sys/fs/bcache/<cache set>)

echo 10000000 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us 
echo 10000000 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us

Try tuning journal_delay_ms: 
  /sys/fs/bcache/<cset-uuid>/journal_delay_ms
    Journal writes will delay for up to this many milliseconds, unless a 
    cache flush happens sooner. Defaults to 100.

3b: Hacking bcache code:

I just noticed that journal_delay_ms says "unless a cache flush happens 
sooner" but cache flushes can be re-ordered so flushing the journal when 
REQ_OP_FLUSH comes through may not be useful, especially if there is a 
high volume of flushes coming down the pipe because the flushes could kill 
the NVMe's cache---and maybe the 1.5ms ping is actual flash latency.  It
would flush data and journal.

Maybe there should be a cachedev_noflush sysfs option for those with some 
kind of power-loss protection of there SSD's.  It looks like this is 
handled in request.c when these functions call bch_journal_meta():

    1053: static void cached_dev_nodata(struct closure *cl)
    1263: static void flash_dev_nodata(struct closure *cl)

Coly can you comment about journal flush semantics with respect to 
performance vs correctness and crash safety?

Adriano, as a test, you could change this line in search_alloc() in 
request.c:

    - s->iop.flush_journal    = op_is_flush(bio->bi_opf);
    + s->iop.flush_journal    = 0;

and see how performance changes.

Someone correct me if I'm wrong, but I don't think flush_journal=0 will 
affect correctness unless there is a crash.  If that /is/ the performance 
problem then it would narrow the scope of this discussion.

4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can 
  you set your CPU governor to run at full clock speed and then slowest 
  clock speed to see if it is a CPU limit somewhere as we expect?

  You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure 
  the governor did its job.  

  If it scales with CPU then something in bcache is working too hard.  
  Maybe garbage collection?  Other devs would need to chime in here to 
  steer the troubleshooting if that is the case.

5. I'm not sure if garbage collection is the issue, but you might try 
  Mingzhe's dynamic incremental gc patch:
    https://www.spinics.net/lists/linux-bcache/msg11185.html

6. Try dm-cache and see if its IO latency is similar to bcache: If it is 
  about the same then that would indicate an issue in the block layer 
  somewhere outside of bcache.  If dm-cache is better, then that confirms 
  a bcache issue.

> The cache was configured directly on one of the NVMe partitions (in this 
> case, the first partition). I did several tests using fio and ioping, 
> testing on a partition on the NVMe device, without partition and 
> directly on the raw block, on a first partition, on the second, with or 
> without configuring bcache. I did all this to remove any doubt as to the 
> method. The results of tests performed directly on the hardware device, 
> without going through bcache are always fast and similar.
> 
> But tests in bcache are always slower. If you use writethrough, of 
> course, it gets much worse, because the performance is equal to the raw 
> spinning disk.
> 
> Using writeback improves a lot, but still doesn't use the full speed of 
> NVMe (honestly, much less than full speed).

Indeed, I hope this can be fixed!  A 20x improvement in bcache would 
be awesome.

> But I've also noticed that there is a limit on writing sequential data, 
> which is a little more than half of the maximum write rate shown in 
> direct tests by the NVMe device.

For sync, async, or both?

> Processing doesn't seem to be going up like the tests.

What do you mean "processing" ?

-Eric