Re: [patch 2/2 v3]raid5: create multiple threads to handle stripes

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Fri, 29 Mar 2013 04:36:14 -0500

I'm CC'ing Joe Landman as he's already building systems of the caliber
that would benefit from this write threading and may need configurable
CPU scheduling.  Joe I've not seen a post from you on linux-raid in a
while so I don't know if you've been following this topic.  Shaohua has
created patch sets to eliminate, or dramatically mitigate, the horrible
single threaded write performance of md/RAID 1, 10, 5, 6 on SSD.
Throughput no longer hits a wall due to peaking one core, as with the
currently shipping kernel code.  Your thoughts?

On 3/28/2013 9:34 PM, Shaohua Li wrote:
...
> Frankly I don't like the cpuset way. It might just work, but it's just another
> API to control process affinity and has no essential difference against my
> approach (which directly sets process affinity). Generally we use cpuset
> instead of process affinity is because of something like inherit affinity.
> While the raid5 process doesn't involve those.

First I should again state I'm not a developer, but a sysadmin, and this
is the viewpoint from which I speak.

The essential difference I see is the user interface the sysadmin will
employ to tweak thread placement/behavior.  Hypothetically, say I have a
64 socket Altix UV machine w/8 core CPUs, 512 cores.  Each node board
has two sockets, two distinct NUMA nodes, 64 total, but these share a
NUMALink hub interface chip connection to the rest of the machine, and
share a PCIe mezzanine interface.

We obviously want to keep md/RAID housekeeping bandwidth (stripe cache,
RMW reads, etc) isolated to the node where it is attached so it doesn't
needlessly traverse NUMALink eating precious, limited, 'high' latency
NUMAlink system interconnect bandwidth.  We need to keep that free for
our parallel application which is eating 100% of the other 504 cores and
saturating NUMAlink with MPI and file IO traffic.

So lets say I have one NUMA node out of 64 dedicated to block device IO.
 It has a PCIe x8 v2 IB 4x QDR HBA (4GB/s) connection to a SAN box with
18 SSDs (and 128 SAS rust).  The SAN RAID ASIC can't keep up with SSD
RAID5 IO rates while also doing RAID for the rust.  So we export the
SSDs individually and we make 2x 9 drive md/RAID5 arrays.  I've already
created a cpuset with this NUMA node for strictly storage related
processes including but not limited to XFS utils, backup processes,
snapshots, etc, so that the only block IO traversing NUMAlink is user
application data.  Now I add another 18 SSDs to the SAN chassis, and
another IB HBA to this node board.

Ideally, my md/RAID write threads should already be bound to this
cpuset.  So all I should need to do is add this 2nd node to the cpuset
and I'm done.  No need to monkey with additional md/RAID specific
interfaces.

Now, that's the simple scenario.  On this particular machine's
architecture, you have two NUMA nodes per physical node, so expanding
storage hardware on the same node board should be straightforward above.
 However, most Altix UV machines will have storage HBAs plugged into
many node boards.  If we create one cpuset and put all the md/RAID write
theads in it, then we get housekeeping RAID IO traversing the NUMAlink
interconnect.  So in this case we'd want to pin the threads to the
physical node board where the PCIe cards, and thus disks, are attached.

The 'easy' way to do this is simply create multiple cpusets, one for
each storage node.  But then you have the downside of administration
headaches, because you may need to pin your FS utils, backup, etc to a
different storage cpuset depending on which HBAs the filesystem resides,
and do this each and every time, which is a nightmare with scheduled
jobs.  Thus in this case its probably best to retain the single storage
cpuset and simply make sure the node boards share the same upstream
switch hop, keeping the traffic as local as possible.  The kernel
scheduler might already have some NUMA scheduling intelligence here that
works automagically even within a cpuset, to minimize this.  I simply
lack knowledge in this area.

>> I still like the idea of an 'ioctl' which a process can call and will cause
>> it to start handling requests.
>> The process could bind itself to whatever cpu or cpuset it wanted to, then
>> could call the ioctl on the relevant md array, and pass in a bitmap of cpus
>> which indicate which requests it wants to be responsible for.  The current
>> kernel thread will then only handle requests that no-one else has put their
>> hand up for.  This leave all the details of configuration in user-space
>> (where I think it belongs).
> 
> The 'ioctl' way is interesting. But there are something we need answer:
> 
> 1. How kernel knows if there will be process to handle one cpu's requests
> before an 'ioctl' is called? I suppose you want 2 ioctls. One ioctl telles
> kernel the process handles request from cpus of a cpumask. The other ioctl does
> request handling. The process must sleep in the ioctl to wait requests.
> 
> 2. If a process is killed in the middle, how kernel knows? Do you want to hook
> something in task management code? For normal process exit, we need another
> ioctl to tell kernel the process is exiting.
> 
> The only difference between this way and my way is if the request handling task
> is userspace or kernel space. In both ways, you need set affinity and uses
> ioctl/sysfs to control requests source for the process.

Being a non dev I lack requisite knowledge to comment on ioctls.  I'll
simply reiterate that whatever you go with should make use of an
existing familiar user interface where this same scheduling is already
handled, which is cpusets.  The only difference being kernel vs user
space.  Which may turn out to be a problem, I dunno.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html