Joe Landman put forth on 2/26/2011 6:56 PM: > Local drives (as you suggested > later on) will deliver 75-100 MB/s of bandwidth, and he'd need 2 for > RAID1, as well as a RAID0 (e.g. RAID10) for local bandwidth (150+ MB/s). > 4 drives per unit, 50 units. 200 drives. Yes, this is pretty much exactly what I mentioned. ~5GB/s aggregate. But we've still not received an accurate detailed description from Matt regarding his actual performance needs. He's not posted iostat numbers from his current filer, or any similar metrics. > Any admin want to admin 200+ drives in 50 chassis? Admin 50 different > file systems? GPFS has single point administration for all storage in all nodes. > Oh, and what is the impact if some of those nodes went away? Would they > take down the file system? In the cloud of microdisk model Stan > suggested, yes they would. No, they would not. GPFS has multiple redundancy mechanisms and can sustain multiple node failures. I think you should read the GPFS introductory documentation: http://www.ibm.com/common/ssi/fcgi-bin/ssialias?infotype=SA&subtype=WH&appname=STGE_XB_XB_USEN&htmlfid=XBW03010USEN&attachment=XBW03010USEN.PDF > Which is why you might not want to give that > advice serious consideration. Unless you built in replication. Now we > are at 400 disks in 50 chassis. Your numbers are wrong, by a factor of 2. He should research GPFS and give it serious consideration. It may be exactly what he needs. > Again, this design keeps getting worse. Actually it's getting better, which you'll see after reading the docs. > Now this is sad, very sad. > > Stan started out selling the Nexsan version of things (and why was he For the record, I'm not selling anything. I don't have a $$ horse in this race. I'm simply trying to show Matt some good options. I don't work for any company selling anything. I'm just an SA, giving free advice to another SA with regard to his request for information. I just happen to know a lot more about high performance storage than the average SA. I recommend Nexsan products because I've used them, they work very well, and are very competitive WRT price/performance/capacity. > doing it on the MD RAID list I wonder?), The OP asked for possible solutions to solve for his need. This need may not necessarily be best met by mdraid, regardless of the fact he asked on the Linux RAID list. LED identification of a failed drive is enough reason for me to not recommend mdraid in this solution, given the fact he'll only have 4 disks per chassis w/an inbuilt hardware RAID chip. I'm guessing fault LED is one of the reasons why you use a combination of PCIe RAID cards and mdraid in your JackRabbit and Delta-V systems instead of strictly mdraid. I'm not knocking it. That's the only way to do it properly on such systems. Likewise, please don't knock me for recommending the obvious better solution in this case. mdraid would have not materially positive impact, but would introduce maintenance problems. > which would have run into the > same costs Stan noted later. Now Stan is selling (actually mis-selling) > GPFS (again, on an MD RAID list, seemingly having picked it off of a > website), without having a clue as to the pricing, implementation, > issues, etc. I first learned of GPFS in 2001 when it was deployed on the 256 node IBM Netfinity dual P3 933 Myrinet cluster at Maui High Performance Computing Center. GPFS was deployed in this cluster using what is currently called the Network Shared Disk protocol, spanning the 512 local disks. GPFS has grown and matured significantly in the 10 years since. Today it is most commonly deployed with a dedicated file server node farm architecture, but it still works just as well using NSD. In the configuration I suggested, each node will be an NSD client and NSD server. GPFS is renowned for its reliability and performance in the world of HPC cluster computing due to its excellent 10+ year track record in the field. It is years ahead of any other cluster filesystem in capability, performance, manageability, and reliability. > I did suggest using GlusterFS as it will help with a number of aspects, > has an open source version. I did also suggest (since he seems to wish > to build it himself) that he pursue a reasonable design to start with, I don't believe his desire is to actually DIY the compute and/or storage nodes. If it is, for a production system of this size/caliber, *I* wouldn't DIY in this case, and I'm the king of DIY hardware. Actually, I'm TheHardwareFreak. ;) I guess you've missed the RHS of my email addy. :) I was given that nickname, flattering or not, about 15 years ago. Obviously it stuck. It's been my vanity domain for quite a few years. > and avoid the filer based designs Stan suggested (two Nexsan's and some > sort of filer head to handle them), or a SAN switch of some sort. There's nothing wrong with a single filer, just because it's a single filer. I'm sure you've sold some singles. They can be very performant. I could build a single DIY 10 GbE filer today from white box parts using JBOD enclosures that could push highly parallel NFS client reads at ~4GB/s all day long, about double the performance of your JackRabbit 5U. It would take me some time to tune PCIe interrupt routing, TCP, NFS server threading, etc, but it can be done. Basic parts list would be something like: 1 x SuperMicro H8DG6 w/dual 8 core 2GHz Optys, 8x4GB DDR3 ECC RDIMMs 3 x LSI MegaRAID SAS 9280-4i4e PCIe x8 512MB cache 1 x NIAGARA 32714L Quad Port Fiber 10 Gigabit Ethernet NIC 1 x SUPERMICRO CSE-825TQ-R700LPB Black 2U Rackmount 700W redundant PSU 3 x NORCO DS-24E External 4U 24 Bay 6G SAS w/LSI 4x6 SAS expander 74 x Seagate ST3300657SS 15K 300GB 6Gb/s SAS, 2 boot, 72 in JBOD chassis Configure 24 drive HW RAID6 on each LSI HBA, mdraid linear over them Format the mdraid device with mkfs.xfs with "-d agcount=66" With this setup the disks will saturate the 12 SAS host channels at 7.2GB/s aggregate with concurrent parallel streaming reads, as each 22 drive RAID6 will be able to push over 3GB/s with 15k drives. This excess of disk bandwidth, and high random IOPS of the 15k drives, ensures that highly random read loads from many concurrent NFS clients will still hit in the 4GB/s range, again, after the system has been properly tuned. > Neither design works well in his scenario, or for that matter, in the > vast majority of HPC situations. Why don't you ask Matt, as I have, for an actual, accurate description of his workload. What we've been given isn't an accurate description. If it was, his current production systems would be so overwhelmed he'd already be writing checks for new gear. I've seen no iostat or other metrics, which are standard fair when asking for this kind of advice. > I did make a full disclosure of my interests up front, and people are > free to take my words with a grain of salt. Insinuating based upon my > disclosure? Sad. It just seems to me you're too willing to oversell him. He apparently doesn't have that kind of budget anyway. If we, you, me, anyone, really wants to give Matt good advice, regardless of how much you might profit, or mere satisfaction I may gain because one of my suggestions was implemented, why don't we both agree to get as much information as possible from Matt before making any more recommendations? I think we've both forgotten once or twice in this thread that it's not about us, but about Matt's requirement. > See GlusterFS. Open source at zero cost. However, and this is a large > however, this design, using local storage for a pooled "cloud" of disks, > has some often problematic issues (resiliency, performance, hotspots). A > truly hobby design would use this. Local disk is fine for scratch > space, for a few other things. Managing the disk spread out among 50 > nodes? Yeah, its harder. Gluster isn't designed as a high performance parallel filesystem. It was never meant to be such. There are guys on the dovecot list who have tried it as a maildir store and it just falls over. It simply cannot handle random IO workloads, period. And yes, it is difficult to design a high performance parallel network based filesystem. Much so. IBM has a massive lead on the other cluster filesystems as IBM started work back in the mid/late 90s for their Power clusters. > I'm gonna go out on a limb here and suggest Matt speak with HPC cluster > and storage people. He can implement things ranging from effectively > zero cost through things which can be quite expensive. If you are > talking to Netapp about HPC storage, well, probably move onto a real HPC > storage shop. His problem is squarely in the HPC arena. I'm still not convinced of that. Simply stating "I have 50 compute nodes each w/one GbE port, so I need 6GB/s of bandwidth" isn't actual application workload data. From what Matt did describe of how the application behaves, simply time shifting the data access will likely solve all of his problems, cheaply. He might even be able to get by with his current filer. We simply need more information. I do anyway. I'd hope you would as well. > However, I would strongly advise against designs such as a single > centralized unit, or a cloud of micro disks. The first design is > decidedly non-scalable, which is in part why the HPC community abandoned > it years ago. The second design is very hard to manage and guarantee > any sort of resiliency. You get all the benefits of a RAID0 in what > Stan proposed. A single system filer is scalable up to the point you run out of PCIe slots. The system I mentioned using the Nexsan array can scale 3x before running out of slots. I think some folks at IBM would tend to vehemently disagree with your assertions here about GPFS. :) It's the only filesystem used on IBM's pSeries clusters and supercomputers. I'd wager that IBM has shipped more GPFS nodes into the HPC marketplace than Joe's company has shipped nodes, total, ever, into any market, or ever will, by a factor of at least 100. This isn't really a fair comparison, as IBM has shipped single GPFS supercomputers with more nodes than Joe's company will sell in its entire lifespan. Case in point: ASCI Purple has 1640 GPFS client nodes, and 134 GPFS server nodes. This machine ships GPFS traffic over the IBM HPS network at 4GB/s per node link, each node having two links for 8GB/s per client node--a tad faster than GbE. ;). For this environment, and most HPC "centers", using a few fat GPFS storage servers with hundreds of terabytes of direct attached fiber channel storage makes more sense than deploying every compute node as a GPFS client *and* server using local disk. In Matt's case it makes more sense to do the latter, called NSD. For the curious, here are the details of the $140 million ASCI Purple system including the GPFS setup: https://computing.llnl.gov/tutorials/purple/ > Start out talking with and working with experts, and its pretty likely > you'll come out with a good solution. The inverse is also true. If by experts you mean those working in the HPC field, not vendors, that's a great idea. Matt, fire off a short polite email to Jack Dongarra and one to Bill Camp. Dr. Dongarra is the primary author of the Linpack benchmark, which is used to rate the 500 fastest supercomputers in the world twice yearly, among other things. His name is probably the most well known in the field of supercomputing. Bill Camp designed the Red Storm supercomputer, which is now the architectural basis for Cray's large MPP supercomputers. He works for Sandia National Laboratory, which is one of the 4 US nuclear weapons laboratories. If neither of these two men has an answer for you, nor can point you to folks who do, the answer simply doesn't exist. Out of consideration I'm not going to post their email addresses. You can find them at the following locations. While you're at it, read the Red Storm document. It's very interesting. http://www.netlib.org/utk/people/JackDongarra/ http://www.google.com/url?sa=t&source=web&cd=3&ved=0CCEQFjAC&url=http%3A%2F%2Fwww.lanl.gov%2Forgs%2Fhpc%2Fsalishan%2Fsalishan2003%2Fcamp.pdf&rct=j&q=bill%20camp%20asci%20red&ei=VxRqTdTuEYOClAf4xKH_AQ&usg=AFQjCNFl420n6HAwBkDs5AFBU2TKpsiHvA&cad=rja I've not corresponded with Professor Dongarra for many years, but back then he always answered my emails rather promptly, within a day or two. The key is to keep it short and sweet, as the man is pretty busy I'd guess. I've never corresponded with Dr. Camp, but I'm sure he'd respond to you, one way or another. My experience is that technical people enjoy talking tech shop, at least to a degree. > MD RAID, which Stan dismissed as a "hobby RAID" at first can work well That's a mis-characterization of the statement I made. > for Matt. GlusterFS can help with the parallel file system atop this. > Starting with a realistic design, an MD RAID based system (self built or > otherwise) could easily provide everything Matt needs, at the data rates > he needs it, using entirely open source technologies. And good designs. I don't recall Matt saying he needed a solution based entirely on FOSS. If he did I missed it. If he can accomplish his goals with all FOSS that's always a plus in my book. However, I'm not averse to closed source when it's a better fit for a requirement. > You really won't get good performance out of a bad design. The folks That's brilliant insight. ;) > doing HPC work who've responded have largely helped frame good design > patterns. The folks who aren't sure what HPC really is, haven't. The folks who use the term HPC as a catch all, speaking as if there is one workload pattern, or only one file access pattern which comprises HPC, as Joe continues to do, and who attempt to tell others they don't know what they're talking about, when they most certainly do, should be viewed with some skepticism. Just as in the business sector, there are many widely varied workloads in the HPC space. At opposite ends of the disk access spectrum, analysis applications tend to read a lot and write very little. Simulation applications, on the other hand, tend to read very little, and generate a tremendous amount of output. For each of these, some benefit greatly from highly parallel communication and disk throughput, some don't. Some benefit from extreme parallelism, and benefit from using message passing and Lustre file access over infiniband, some with lots of serialization don't. Some may benefit from openmp parallelism but only mild amounts of disk parallelism. In summary, there are many shades of HPC. For maximum performance and ROI, just as in the business or any other computing world, one needs to optimize his compute and storage system to meet his particular workload. There isn't one size that fits all. Thus, contrary to what Joe may have anyone here believe, NFS filers are a perfect fit for some HPC workloads. For Joe to say that any workload that works fine with an NFS filer isn't an HPC workload is simply rubbish. One need look no further than a little ways back in this thread to see this. In one hand, Joe says Matt's workload is absolutely an HPC workload. Matt currently uses an NFS filer for this workload. Thus, Joe would say this isn't an HPC workload because it's working fine with an NFS filer. Just a bit of self contradiction there. Instead of arguing what is and is not HPC, and arguing that Matt's workload is "an HPC workload", I think, again, that nailing down his exact data access profile and making a recommendation on that, is what he needs. I'm betting he could care less if his workload is "an HPC workload" or not. I'm starting to tire of this thread. Matt has plenty of conflicting information to sort out. I'll be glad to answer any questions he may have of me. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html