On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@xxxxxxxxxxxxx> wrote: > Performance: > A simple fio direct 4k random write test with incrementing number > of threads. > > [fuse] > threads wr_iops wr_bw wr_lat > 1 33606 134424 26.53226 > 2 57056 228224 30.38476 > 4 88667 354668 40.12783 > 7 116561 466245 53.98572 > 8 129134 516539 55.6134 > > [fuse-splice] > threads wr_iops wr_bw wr_lat > 1 39670 158682 21.8399 > 2 51100 204400 34.63294 > 4 75220 300882 47.42344 > 7 97706 390825 63.04435 > 8 98034 392137 73.24263 > > [xfs-dax] > threads wr_iops wr_bw wr_lat Data missing. > [Maxdata-1.5-zufs] > threads wr_iops wr_bw wr_lat > 1 1041802 260,450 3.623 > 2 1983997 495,999 3.808 > 4 3829456 957,364 3.959 > 7 4501154 1,125,288 5.895330 > 8 4400698 1,100,174 6.922174 Just a heads up, that I have achieved similar results with a prototype using the unmodified fuse protocol. This prototype was built with ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per op). I found a big scheduler scalability bottleneck that is caused by update of mm->cpu_bitmap at context switch. This can be worked around by using shared memory instead of shared page tables, which is a bit of a pain, but it does prove the point. Thought about fixing the cpu_bitmap cacheline pingpong, but didn't really get anywhere. Are you interested in comparing zufs with the scalable fuse prototype? If so, I'll push the code into a public repo with some instructions, Thanks, Miklos