Re: [fuse-devel] FUSE: fixes to improve scalability on NUMA systems

Miklos Szeredi <miklos@xxxxxxxxxx> · Tue, 30 Apr 2013 18:29:03 +0200

On Tue, Apr 30, 2013 at 8:17 AM, Srinivas Eeda <srinivas.eeda@xxxxxxxxxx> wrote:
> Hi Miklos and all,
>
> I would like to submit following fixes for review which enhance FUSE
> scalability on NUMA systems. The changes add new mount option 'numa' and
> involve kernel and user library changes. I am currently forwarding kernel fixes
> only and will forward library changes once the kernel fixes seem ok to you :)
>
> In our internal tests, we noticed that FUSE was not scaling well when multiple
> users access same mount point. The contention is on fc->lock spinlock, but the
> real problem is not the spinlock itself but because of the latency involved
> in accessing single spinlock from multiple NUMA nodes.
>
> This fix groups various fields in fuse_conn and creates a set for each NUMA
> node to reduce contention. A spinlock is created for each NUMA node which will
> synchronize access to node local set. All processes will now access nodes local
> spinlock thus reducing latency. To get this behavior users(fuse library) or
> the file system implementers should pass 'numa' mount option. If 'numa' option
> is not specified during mount, FUSE will create single set of the grouped
> fields and behavior is similar to current. File systems that support NUMA
> option should listen on /dev/fuse from all NUMA nodes to serve
> incoming/outgoing requests. If File systems are using fuse library then the
> library will do that for them.

Why just NUMA?  For example see this discussion a while back:

http://thread.gmane.org/gmane.comp.file-systems.fuse.devel/11832/

We should be improving scalability in small steps, each of which makes
sense and improves the situation.   Marking half the fuse_conn
structure per-cpu or per-node is too large and is probably not even be
the best step.

For example we have various counters protected by fc->lock that could
be done with per-cpu counters.  Similarly, we could have per-cpu lists
for requests, balancing requests only when necessary.  After that we
could add some heuristics to discourage balancing between numa nodes.

To sum up: improving scalability for fuse would be nice, but don't
just do it for NUMA and don't do it in one big step.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html