On Mon, May 02, 2011 at 11:47:48AM -0400, Paul Anderson wrote: > Our genetic sequencing research group is growing our file storage from > 1PB to 2PB. ..... > We are deploying five Dell 810s, 192GiB RAM, 12 core, each with three > LSI 9200-8E SAS controllers, and three SuperMicro 847 45 drive bay > cabinets with enterprise grade 2TB drives. So roughly 250TB raw capacity per box. > We're running Ubuntu 10.04 LTS, and have tried either the stock kernel > (2.6.32-30) or 2.6.35 from linux.org. (OT: why do people install a desktop OS on their servers?) > We organize the storage as one > software (MD) RAID 0 composed of 7 software RAID (MD) 6s, each with 18 > drives, giving 204 TiB usable (9 drives of the 135 are unused). That's adventurous. I would serious consider rethinking this - hardware RAID-6 with controllers that have ia significant amount of BBWC is much more appropriate for this scale of storage. You get an unclean shutdown (e.g. power loss) and MD is going to take _weeks_ to resync those RAID6 arrays. Background scrubbing is likely to never cease, either.... Also, knowing how you spread out the disks in each RAID-6 group between controllers, trays, etc as that has important performance and failure implications. e.g. I'm guessing that you are taking 6 drives from each enclosure for each 18-drive raid-6 group, which would split the RAID-6 group across all three SAS controllers and enclosures. That means if you lose a SAS controller or enclosure you lose all RAID-6 groups at once which is effectively catastrophic from a recovery point of view. It also means that one slow controller slows down everything so load balancing is difficult. Large stripes might look like a good idea, buti when you get to this scale concatenation of high throughput LUNs provides better throughput because of less contention through the storage controllers and enclosures. > XFS > is set up properly (as far as I know) with respect to stripe and chunk > sizes. Any details? You might be wrong ;) > Allocation groups are 1TiB in size, which seems sane for the > size of files we expect to work with. Any filesystem over 16TB will use 1TB AGs. > In isolated testing, I see around 5GiBytes/second raw (135 parallel dd > reads), and with a benchmark test of 10 simultaneous 64GiByte dd > commands, I can see just shy of 2 GiBytes/second reading, and around > 1.4GiBytes/second writing through XFS. The benchmark is crude, but > fairly representative of our expected use. If you want insightful comments, then you'll need to provide intimate details of the tests your ran and the results (e.g. command lines, raw results, etc). > md apparently does not support barriers, so we are badly exposed in > that manner, I know. As a test, I disabled write cache on all drives, > performance dropped by 30% or so, but since md is apparently the > problem, barriers still didn't work. Doesn't matter if you have BBWC on your hardware RAID controllers. Seriously, if you want to sustain high throughput, you want a large amount of BBWC in front your disks.... > Nonetheless, what we need, but don't have, is stability. > > With 2.6.32-30, we get reliable kernel panics after 2 days of > sustained rsync to the machine (around 150-250MiBytes/second for the > entire time - the source machines are slow), Stack traces from the crash? > and with 2.6.35, we get a > bad resource contention problem fairly quickly - much less than 24 > hours (in this instance, we start getting XFS kernel thread timeouts > similar to what I've seen posted here recently, but it isn't clear > whether it is only XFS or also ext3 boot drives that are starved for > I/O - suspending or killing all I/O load doesn't solve the problem - > only a reboot does). Details of the timeout messages? > Ideally, I'd firstly be able to find informed opinions about how I can > improve this arrangement - we are mildly flexible on RAID controllers, > very flexible on versions of Linux, etc, and can try other OS's as a > last resort (but the leading contender here would be "something" > running ZFS, and though I love ZFS, it really didn't seem to work well > for our needs). > > Secondly, I welcome suggestions about which version of the linux > kernel you'd prefer to hear bug reports about, as well as what kinds > of output is most useful (we're getting all chassis set up with serial > console so we can do kgdb and also full kernel panic output results). If you want to stay on mainline kernels with best-effort community support, I'd suggest 2.6.38 or more recent kernels are the only ones we're going to debug. If you want fixes, then running the curent -rc kernels is probably a good idea. It's unlikely you'll get anyone backporting fixes for you to older kernels. Alternatively, you can switch to something like RHEL (or SLES) where XFS is fully supported (and in the RHEL case, pays my bills :). The advantage of this is that once the bug is fixed in mainline, it will get backported to the supported kernel you are running. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs