Re: [patch 2/2 v3]raid5: create multiple threads to handle stripes

Shaohua Li <shli@xxxxxxxxxx> · Mon, 1 Apr 2013 09:57:41 +0800

On Fri, Mar 29, 2013 at 04:36:14AM -0500, Stan Hoeppner wrote:
> I'm CC'ing Joe Landman as he's already building systems of the caliber
> that would benefit from this write threading and may need configurable
> CPU scheduling.  Joe I've not seen a post from you on linux-raid in a
> while so I don't know if you've been following this topic.  Shaohua has
> created patch sets to eliminate, or dramatically mitigate, the horrible
> single threaded write performance of md/RAID 1, 10, 5, 6 on SSD.
> Throughput no longer hits a wall due to peaking one core, as with the
> currently shipping kernel code.  Your thoughts?
> 
> On 3/28/2013 9:34 PM, Shaohua Li wrote:
> ...
> > Frankly I don't like the cpuset way. It might just work, but it's just another
> > API to control process affinity and has no essential difference against my
> > approach (which directly sets process affinity). Generally we use cpuset
> > instead of process affinity is because of something like inherit affinity.
> > While the raid5 process doesn't involve those.
> 
> First I should again state I'm not a developer, but a sysadmin, and this
> is the viewpoint from which I speak.
> 
> The essential difference I see is the user interface the sysadmin will
> employ to tweak thread placement/behavior.  Hypothetically, say I have a
> 64 socket Altix UV machine w/8 core CPUs, 512 cores.  Each node board
> has two sockets, two distinct NUMA nodes, 64 total, but these share a
> NUMALink hub interface chip connection to the rest of the machine, and
> share a PCIe mezzanine interface.
> 
> We obviously want to keep md/RAID housekeeping bandwidth (stripe cache,
> RMW reads, etc) isolated to the node where it is attached so it doesn't
> needlessly traverse NUMALink eating precious, limited, 'high' latency
> NUMAlink system interconnect bandwidth.  We need to keep that free for
> our parallel application which is eating 100% of the other 504 cores and
> saturating NUMAlink with MPI and file IO traffic.
> 
> So lets say I have one NUMA node out of 64 dedicated to block device IO.
>  It has a PCIe x8 v2 IB 4x QDR HBA (4GB/s) connection to a SAN box with
> 18 SSDs (and 128 SAS rust).  The SAN RAID ASIC can't keep up with SSD
> RAID5 IO rates while also doing RAID for the rust.  So we export the
> SSDs individually and we make 2x 9 drive md/RAID5 arrays.  I've already
> created a cpuset with this NUMA node for strictly storage related
> processes including but not limited to XFS utils, backup processes,
> snapshots, etc, so that the only block IO traversing NUMAlink is user
> application data.  Now I add another 18 SSDs to the SAN chassis, and
> another IB HBA to this node board.
> 
> Ideally, my md/RAID write threads should already be bound to this
> cpuset.  So all I should need to do is add this 2nd node to the cpuset
> and I'm done.  No need to monkey with additional md/RAID specific
> interfaces.
> 
> Now, that's the simple scenario.  On this particular machine's
> architecture, you have two NUMA nodes per physical node, so expanding
> storage hardware on the same node board should be straightforward above.
>  However, most Altix UV machines will have storage HBAs plugged into
> many node boards.  If we create one cpuset and put all the md/RAID write
> theads in it, then we get housekeeping RAID IO traversing the NUMAlink
> interconnect.  So in this case we'd want to pin the threads to the
> physical node board where the PCIe cards, and thus disks, are attached.
> 
> The 'easy' way to do this is simply create multiple cpusets, one for
> each storage node.  But then you have the downside of administration
> headaches, because you may need to pin your FS utils, backup, etc to a
> different storage cpuset depending on which HBAs the filesystem resides,
> and do this each and every time, which is a nightmare with scheduled
> jobs.  Thus in this case its probably best to retain the single storage
> cpuset and simply make sure the node boards share the same upstream
> switch hop, keeping the traffic as local as possible.  The kernel
> scheduler might already have some NUMA scheduling intelligence here that
> works automagically even within a cpuset, to minimize this.  I simply
> lack knowledge in this area.
> 
> >> I still like the idea of an 'ioctl' which a process can call and will cause
> >> it to start handling requests.
> >> The process could bind itself to whatever cpu or cpuset it wanted to, then
> >> could call the ioctl on the relevant md array, and pass in a bitmap of cpus
> >> which indicate which requests it wants to be responsible for.  The current
> >> kernel thread will then only handle requests that no-one else has put their
> >> hand up for.  This leave all the details of configuration in user-space
> >> (where I think it belongs).
> > 
> > The 'ioctl' way is interesting. But there are something we need answer:
> > 
> > 1. How kernel knows if there will be process to handle one cpu's requests
> > before an 'ioctl' is called? I suppose you want 2 ioctls. One ioctl telles
> > kernel the process handles request from cpus of a cpumask. The other ioctl does
> > request handling. The process must sleep in the ioctl to wait requests.
> > 
> > 2. If a process is killed in the middle, how kernel knows? Do you want to hook
> > something in task management code? For normal process exit, we need another
> > ioctl to tell kernel the process is exiting.
> > 
> > The only difference between this way and my way is if the request handling task
> > is userspace or kernel space. In both ways, you need set affinity and uses
> > ioctl/sysfs to control requests source for the process.
> 
> Being a non dev I lack requisite knowledge to comment on ioctls.  I'll
> simply reiterate that whatever you go with should make use of an
> existing familiar user interface where this same scheduling is already
> handled, which is cpusets.  The only difference being kernel vs user
> space.  Which may turn out to be a problem, I dunno.

Hmm, there might be misunderstanding here. In my way:

#echo 3 > /sys/block/md0/md/auxthread_number. Create several kernel threads to
handle requests. You can use any approach to set smp affinity for the threads.
You can use cpuset to bind the threads too.

#echo 1-3 > /sys/block/md0/md/auxth0/cpulist. This doesn't set above threads'
affinity. It sets which CPU's requests the thread should handle. Regardless
using my way, cpuset or ioctl, we need the similar way to notify worker thread
which CPU's requests it should handle (unless we have a hook in scheduler, when
a thread's affinity is changed, we get a notification)

In the sumary, my approach doesn't prevent you to use CPUSET. Did I miss something?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html