On Tue, Apr 30, 2013 at 8:17 AM, Srinivas Eeda <srinivas.eeda@xxxxxxxxxx> wrote: > Hi Miklos and all, > > I would like to submit following fixes for review which enhance FUSE > scalability on NUMA systems. The changes add new mount option 'numa' and > involve kernel and user library changes. I am currently forwarding kernel fixes > only and will forward library changes once the kernel fixes seem ok to you :) > > In our internal tests, we noticed that FUSE was not scaling well when multiple > users access same mount point. The contention is on fc->lock spinlock, but the > real problem is not the spinlock itself but because of the latency involved > in accessing single spinlock from multiple NUMA nodes. > > This fix groups various fields in fuse_conn and creates a set for each NUMA > node to reduce contention. A spinlock is created for each NUMA node which will > synchronize access to node local set. All processes will now access nodes local > spinlock thus reducing latency. To get this behavior users(fuse library) or > the file system implementers should pass 'numa' mount option. If 'numa' option > is not specified during mount, FUSE will create single set of the grouped > fields and behavior is similar to current. File systems that support NUMA > option should listen on /dev/fuse from all NUMA nodes to serve > incoming/outgoing requests. If File systems are using fuse library then the > library will do that for them. Why just NUMA? For example see this discussion a while back: http://thread.gmane.org/gmane.comp.file-systems.fuse.devel/11832/ We should be improving scalability in small steps, each of which makes sense and improves the situation. Marking half the fuse_conn structure per-cpu or per-node is too large and is probably not even be the best step. For example we have various counters protected by fc->lock that could be done with per-cpu counters. Similarly, we could have per-cpu lists for requests, balancing requests only when necessary. After that we could add some heuristics to discourage balancing between numa nodes. To sum up: improving scalability for fuse would be nice, but don't just do it for NUMA and don't do it in one big step. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html