On 26/09/2019 10:11, Miklos Szeredi wrote: > On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@xxxxxxxxxxxxx> wrote: > > Just a heads up, that I have achieved similar results with a prototype > using the unmodified fuse protocol. This prototype was built with > ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per > op). > I found a big scheduler scalability bottleneck that is caused by > update of mm->cpu_bitmap at context switch. This can be worked > around by using shared memory instead of shared page tables, which is > a bit of a pain, but it does prove the point. Thought about fixing > the cpu_bitmap cacheline pingpong, but didn't really get anywhere. > I'm not sure what is the scalability bottleneck you are seeing above. With zufs I have a very good scalability, almost flat up to the number of CPUs, and/or the limit of the memory bandwith if I'm accessing pmem. I do have a bad scalability bottleneck if I use mmap of pages caused by the call to zap_vma_ptes. Which is why I invented the NIO way. (Inspired by you) Once you send me the git URL I will have a look in the code and see if I can find any differences. That said I do believe that a new Scheduler object that completely bypasses the scheduler and just relinquishes its time slice to the switched to thread, will cut off another 0.5u from the single thread latency. (5th patch talks about that) > Are you interested in comparing zufs with the scalable fuse prototype? > If so, I'll push the code into a public repo with some instructions, > > Thanks, > Miklos > Miklos would you please have some bandwith to review my code? it would make me very happy and calm. Your input is very valuable to me. Thanks Boaz