Re: [fuse-devel] FUSE: fixes to improve scalability on NUMA systems

Anand Avati <avati@xxxxxxxxxx> · Wed, 08 May 2013 02:11:05 -0700

On Wed May  1 02:53:14 2013, Miklos Szeredi wrote:
[CC-s added]

On Tue, Apr 30, 2013 at 8:28 PM, Srinivas Eeda <srinivas.eeda@xxxxxxxxxx> wrote:

The reason I targeted NUMA is because NUMA machine is where I am seeing
significant performance issues. Even on a NUMA system if I bind all user
threads to a particular NUMA node, there is no notable performance issue.
The test I ran was to start multiple(from 4 to 128) "dd if=/dev/zero
of=/dbfsfilesxx bs=1M count=4000"  on a system which has 8 NUMA nodes where
each node has 20 cores. So total cpu's were 160.

http://thread.gmane.org/gmane.comp.file-systems.fuse.devel/11832/

That was a good discussion. The problem discussed here is much more fine
grained than mine. Fix I emailed, proposes to bind requests to within a NUMA
node vs the above discussion that proposes to bind requests to within cpu.
Based on your agreement with Anand Avati I think you prefer to bind requests
to cpu.

Yes, that's one direction worth exploring.

http://article.gmane.org/gmane.comp.file-systems.fuse.devel/11909

Patch I proposed can easily be modified to do that. With my current system
in mind, currently my patch will split each queue to 8 (on 8 node numa).
With the change each queue will be split to 160. Currently my libfuse fix
will start 8 threads and bind one to each NUMA node, now it will have to
start 160 and bind them to cpus. If you prefer to see some numbers I can
modify the patch and run some tests.

Okay.  Though, as I said, I'd first like to see just some small part
changed e.g. just per-CPU queues, with the background accounting left
alone.  Yeah, that will probably not improve async read performance as
well as you like since per-CPU queues are fundamentally about
synchronous requests.

Chances of processes migrating to different NUMA node is minimum. So I
didn't modify fuse header to carry a queue id. In the worst case where the
worker thread gets migrated to different NUMA node my fix will scan all
split queues till it find the request. But if we split the queues to per
cpu, there is a high chance that processes migrate to different cpu's. So I
think it will benefit that I add cpuid to the fuse in/out headers.

Yes, but lets start simple.  Just do per-CPU queues and see what it
does in different workloads.  Obviously it will regress in some cases,
that's fine.  We can then see if the direction is good and the
regressions can be fixed or if it's a completely wrong approach.

There is certainly scope for improving in general CPU affinity (as 
shown in the referred thread). It would be sad to let it pass by and 
have only NUMA affinity. As already mentioned, it shouldn't be too hard 
to change your patch for per-CPU behavior.

What is the userspace strategy? To have per CPU (or NUMA node) thread 
pinned with affinity? Do you plan to address starvation (maybe not just 
yet)?

Thanks!
Avati
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html