I think I have run into a bottleneck somewhere in/around RAID and the filesystem, and I am trying to determine if it's a known issue, or something I am doing wrong. Briefly, I have 4x NVME SSDs. When formatted as 4 separate ext4 filesystems, a sequential write benchmark does 8 GB/s in aggregate, which is about what I expect. But when striped together using mdadm raid0, seq write speed is less than 5 GB/s. In both cases I am using fio with 8 writer threads to 8 separate files, libaio, iodepth 32, large buffers, no atime, and other sensible options (see pastebin at bottom for details). When using the drives separately, top looks like this: top - 10:49:52 up 38 min, 4 users, load average: 5.49, 2.42, 2.45 Tasks: 363 total, 15 running, 221 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.0 us, 54.6 sy, 0.0 ni, 42.5 id, 0.8 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 23252352 total, 428864 free, 1056968 used, 21766520 buff/cache KiB Swap: 2097148 total, 2097148 free, 0 used. 21538676 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3204 root 20 0 745432 37492 1064 R 99.9 0.2 0:35.64 fio 3208 root 20 0 745456 37508 1076 R 99.8 0.2 0:35.62 fio 3209 root 20 0 745464 37492 1056 R 99.7 0.2 0:35.65 fio 3206 root 20 0 745444 37496 1060 R 99.6 0.2 0:35.54 fio 3210 root 20 0 745468 37504 1076 R 99.6 0.2 0:35.64 fio 3207 root 20 0 745452 37500 1068 R 99.4 0.2 0:35.54 fio 3203 root 20 0 745428 37500 1068 R 99.2 0.2 0:35.59 fio 3205 root 20 0 745440 37500 1068 R 99.0 0.2 0:35.43 fio 158 root 20 0 0 0 0 R 89.4 0.0 5:09.55 kswapd0 3027 root 20 0 0 0 0 R 60.9 0.0 0:44.53 kworker/u488:3 3110 root 20 0 0 0 0 R 56.0 0.0 0:33.54 kworker/u488:0 3026 root 20 0 0 0 0 R 54.4 0.0 1:22.48 kworker/u488:2 375 root 20 0 0 0 0 R 50.6 0.0 2:50.14 kworker/u488:18 Note the 4 kworker threads doing filesystem or i/o stuff. With raid0, top looks like this: top - 11:17:37 up 1:06, 4 users, load average: 9.13, 3.95, 2.64 Tasks: 355 total, 6 running, 225 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.1 us, 25.4 sy, 0.0 ni, 52.5 id, 20.9 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 23252352 total, 399740 free, 1042156 used, 21810456 buff/cache KiB Swap: 2097148 total, 2097148 free, 0 used. 21553788 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3027 root 20 0 0 0 0 R 99.2 0.0 6:19.05 kworker/u488:3 158 root 20 0 0 0 0 S 85.0 0.0 11:33.52 kswapd0 3487 root 20 0 745472 37580 1108 R 54.7 0.2 1:10.13 fio 3481 root 20 0 745448 37568 1108 R 45.0 0.2 1:07.82 fio 3485 root 20 0 745464 37572 1112 D 44.5 0.2 1:08.99 fio 3483 root 20 0 745456 37572 1112 D 44.0 0.2 1:10.01 fio 3486 root 20 0 745468 37572 1112 D 43.9 0.2 1:09.24 fio 3484 root 20 0 745460 37564 1104 D 43.7 0.2 1:10.95 fio 3488 root 20 0 745476 37584 1112 R 43.7 0.2 1:07.64 fio 3482 root 20 0 745452 37572 1112 D 42.3 0.2 1:06.93 fio Instead of 4 kworkers, there is now only 1. In the raid case, it very much looks like all i/o is bottlenecking in a single kernel worker thread. Is this a know issue/design choice? Mostly I'm interested in knowing where the bottleneck is, rather "fixing" it. System is running Ubuntu 18.04 LTS / Linux Linux 4.15.0-20-generic x86_64 on a 10-core Intel Xeon Silver 4114 2.2 GHz, 6-ch DDR4, 4x Samsung 960 PRO 1TB NVME SSDs, each on a dedicated x4 PCIe 3 link from the CPU (no PCIe switches / PCH lanes). Full transcript of all mdadm, mkfs.ext4, fio, etc commands and output here: https://pastebin.com/AXuXhurD thanks, -Ryan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html