On Sat, Dec 7, 2013 at 3:12 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
On 12/6/2013 4:14 PM, Mike Dacre wrote:...
> On Fri, Dec 6, 2013 at 12:58 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>wrote:
> UUID=a58bf1db-0d64-4a2d-8e03-aad78dbebcbe /science xfsYour RAID card has persistent write cache (BBWC) and we know it's
> defaults,inode64 1 0
enabled from tool your output. By default XFS assumes BBWC is not
present, and uses write barriers to ensure order/consistency. Using
barriers on top of BBWC will be detrimental to write performance, for a
couple of reasons:
1. Prevents the controller from optimizing writeback patterns
2. A portion, or all of, the write cache is frequently flushed
Add 'nobarrier' to your mount options to avoid this problem. It should
speed up many, if not all, write operations considerably, which will in
turn decrease seek contention amongst jobs. Currently your write cache
isn't working nearly as well as it should, and in fact could be
operating horribly.
One minute attribute cache lifetime seems maybe a little high for a
> On the slave nodes, I managed to reduce the demand on the disks by adding
> the actimeo=60 mount option. Prior to doing this I would sometimes see the
> disk being negatively affected by enormous numbers of getattr requests.
> Here is the fstab mount on the nodes:
>
> 192.168.2.1:/science /science nfs
> defaults,vers=3,nofail,actimeo=60,bg,hard,intr,rw 0 0
compute cluster. But if you've had no ill effects and it squelched the
getattr flood this is good.
...
> Correct, I am not consciously aligning the XFS to the RAID geometry, IXFS alignment is not something to worry about in this case.
> actually didn't know that was possible.
...
>> So it's a small compute cluster using NFS over Infiniband for sharedThis isn't going to work well because a tiny IO stream can seek the
>> file access to a low performance RAID6 array. The IO resource sharing
>> is automatic. But AFAIK there's no easy way to enforce IO quotas on
>> users or processes, if at all. You may simply not have sufficient IO to
>> go around. Let's ponder that.
>
> I have tried a few things to improve IO allocation. BetterLinux have a
> cgroup control suite that allow on-the-fly user-level IO adjustments,
> however I found them to be quite cumbersome.
disks to death, such as a complex find command, ls -R, etc. A single
command such as these can generate thousands of seeks. Shaping/limiting
user IO won't affect this.
...
>> Looking at the math, you currently have approximately 14*150=2100USB disk is generally a recipe for disaster. Plenty of horror stories
>> seeks/sec capability with 14x 7.2k RPM data spindles. That's less than
>> 100 seeks/sec per compute node, i.e. each node is getting about 2/3rd of
>> the performance of a single SATA disk from this array. This simply
>> isn't sufficient for servicing a 23 node cluster, unless all workloads
>> are compute bound, and none IO/seek bound. Given the overload/crash
>> that brought you to our attention, I'd say some of your workloads are
>> obviously IO/seek bound. I'd say you probably need more/faster disks.
>> Or you need to identify which jobs are IO/seek heavy and schedule them
>> so they're not running concurrently.
>
> Yes, this is a problem. We sadly lack the resources to do much better than
> this, we have recently been adding extra storage by just chaining together
> USB3 drives with RAID and LVM... which is cumbersome and slow, but cheaper.
on both this list and linux-raid regarding USB connected drives,
enclosures, etc. I pray you don't run into those problems.
Your LSI 9260 controller supports using SSDs for read/write flash cache.
> My current solution is to be on the alert for high IO jobs, and to move
> them to a specific torque queue that limits the number of concurrent jobs.
> This works, but I have not found a way to do it automatically.
> Thankfully, with a 12 member lab, it is actually not terribly complex to
> handle, but I would definitely prefer a more comprehensive solution. I
> don't doubt that the huge IO and seek demands we put on these disks will
> cause more problems in the future.
LSI charges $279 for it. It's called CacheCade Pro:
http://www.lsi.com/products/raid-controllers/pages/megaraid-cachecade-pro-software.aspx.
Connect two good quality fast SSDs to the controller, such as:
http://www.newegg.com/Product/Product.aspx?Item=N82E16820147192
Two SSDs, mirrored, to prevent cached writes from being lost if a single
SSD fails. You now have a ~90K IOPS, 128GB, 500MB/s low latency
read/write cache in front of your RAID6 array. This should go a long
way toward eliminating your bottlenecks. You can accomplish this for
~$550 assuming you have two backplane drive slots free for the SSDs. If
not, you add one of these for $279:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816117207
This is an Intel 24 port SAS expander, the same device as in your drive
backplane. SAS expanders can be daisy chained many deep. You can drop
it into a PCIe x4 or greater slot from which it only draws power--no
data pins are connected. Or if no slots are available you can mount it
to the side wall of your rack server chassis and power it via the 4 pin
Molex plug. This requires a drill, brass or plastic standoffs, and DIY
skills. I use this option as it provides a solid mount for un/plugging
the SAS cables, and being side mounted neither it nor the cables
interfere with airflow.
You'll plug the 9260-4i into one port of the Intel expander. You'll
need another SFF-8087 cable for this:
http://www.newegg.com/Product/Product.aspx?Item=N82E16812652015
You will plug your drive backplane cable into another of the 6 SFF-8087
ports on the Intel. Into a 3rd port you will plug an SFF-8087 breakout
cable to give you 4 individual drive connections. You will plug two of
these into your two SSDs.
http://www.newegg.com/Product/Product.aspx?Item=N82E16816116097
If you have no internal 2.5/3.5" drive brackets free for the SSDs and
you'd prefer not to drill (more) holes in the chassis to directly mount
them or a new cage for them, simply use some heavy duty Velcro squares,
2" is fine.
Worst case scenario you're looking at less than $1000 to cure your IO
bottlenecks, or at the very least mitigate them to a minor annoyance
instead of a show stopper. And if you free up some money for some
external JBODs and drives in the future, you can route 2 of the unused
SFF-8087 connectors of the Intel Expander out the back panel to attach
expander JBOD enclosures, using one of these and 2 more of the 8087
cables up above:
http://www.ebay.com/itm/8-Port-SAS-SATA-6G-Dual-SFF-8088-mini-SAS-to-SFF-8087-PCIe-Adapter-w-LP-Bracket-/390508767029
I'm sure someone makes a 3 port model but 10 minutes of searching didn't
turn one up. These panel adapters are application specific. Most are
made to be mounted in a disk enclosure where the HBA/RAID card is on the
outside of the chassis, on the other end of the 8088 cable. This two
port model is designed to be inside a server chassis, where the HBA
connects to the internal 8087 ports. Think Ethernet x-over cable.
The 9260-4i supports up to 128 drives. This Intel expander and a panel
connector allow you to get there with external JBODs. The only caveat
being that you're limited to "only" 4.8 GB/s to/from all the disks.
--
Stan
Hi Stan,
Thanks for the great advice, I think you are on to something there. I will look into doing this in the next week or so when I have more time. I added 'nobarrier' to my mount options.
Thanks again, I will let you know how it goes after I have upgraded.
Best,
Mike
_______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs