Re: XFS/Linux Sanity check

Paul Anderson <pha@xxxxxxxxx> · Tue, 3 May 2011 12:05:08 -0400

Dave, thanks for your feedback - comments below - possibly of interest
to others.

Several underlying assumptions strongly influence my choices that I've
made here.

Sequential I/O is of paramount importance - all else is nearly
insignificant (not entirely true, but a reasonable plan for the coming
year or two).

Highly I/O intensive work can/should be done locally to avoid
networking (NFS and 10 GigE just add more delays - later, research
could be done to saturate a 10GigE link in a variety of other ways,
but is of secondary concern to me today).

Compute intensive workloads will start looking more random because
we'll send those out to the grid and large numbers of incoming
requests makes the I/O stream less predictable.  Mind you, I envision
eliminating NFS or any other network filesystem in favor of straight
TCP/IP or even something like RoCE from Redhat.  With proper
buffering, even serving data like this can look sequential by and
large.

The team here favors large filesystems because from the user
perspective it is simply easier than having to juggle space among
distinct partitions.  The easy admninistrative solution of splitting
204TiB into say 7 mounted volumes really imposes a big barrier to how
work is organized, and further wastes storage.

I believe that typical working file sizes will exceed 100GiB within a
year or two - for example, one project is generating 250 sequencing
sample files each of which is 250 GiB in size, which we need to pull,
reprocess, and analyze.  This is fallout from the fact that there is a
very rapid drop in the cost of genome sequencing that is still
underway.

On Mon, May 2, 2011 at 11:18 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, May 02, 2011 at 11:47:48AM -0400, Paul Anderson wrote:
>> Our genetic sequencing research group is growing our file storage from
>> 1PB to 2PB.
> .....
>> We are deploying five Dell 810s, 192GiB RAM, 12 core, each with three
>> LSI 9200-8E SAS controllers, and three SuperMicro 847 45 drive bay
>> cabinets with enterprise grade 2TB drives.
>
> So roughly 250TB raw capacity per box.
>
>> We're running Ubuntu 10.04 LTS, and have tried either the stock kernel
>> (2.6.32-30) or 2.6.35 from linux.org.
>
> (OT: why do people install a desktop OS on their servers?)

Our end users want many GUI based apps running on the compute head
nodes, ergo we wind up installing most of the desktop anyway, so it is
just easier to install it and add whatever server related packages we
may need.  I'm not fond of that situation myself.

>
>> We organize the storage as one
>> software (MD) RAID 0 composed of 7 software RAID (MD) 6s, each with 18
>> drives, giving 204 TiB usable (9 drives of the 135 are unused).
>
> That's adventurous. I would serious consider rethinking this -
> hardware RAID-6 with controllers that have ia significant amount of
> BBWC is much more appropriate for this scale of storage. You get an
> unclean shutdown (e.g. power loss) and MD is going to take _weeks_
> to resync those RAID6 arrays. Background scrubbing is likely to
> never cease, either....

18 hours from start - remember the sync is proceeding at over
4GiBytes/sec (14.5 hours if exactly 4 GiBytes/second).

The big problem with my setup is lack of BBWC.  They are running in
JBOD mode, and I can disable per drive write cache and still maintain
decent performance across the array.

That said, there are few if any cases where we care about loss of
in-flight data - we care a great deal about static data that is
corrupted or lost due to metadata corruption, so this is still
probably an open issue (ideas welcome).

> Also, knowing how you spread out the disks in each RAID-6 group
> between controllers, trays, etc as that has important performance
> and failure implications.

You bet!

> e.g. I'm guessing that you are taking 6 drives from each enclosure
> for each 18-drive raid-6 group, which would split the RAID-6 group
> across all three SAS controllers and enclosures. That means if you
> lose a SAS controller or enclosure you lose all RAID-6 groups at
> once which is effectively catastrophic from a recovery point of view.
> It also means that one slow controller slows down everything so load
> balancing is difficult.

Each of the three enclosures has a pair of SAS expanders, and each LSI
9200-8e controller has two SAS cables, so I actually ordered the
RAID-6 drive sets as subsets of three, each from successive distinct
controller cards in a round robin fashion until you have a full set of
18 drives.

A wrinkle is that the SAS expanders have differing numbers of drives -
24 front, 21 rear (the other 3 on the rear are taken by the power
supplies).  So to finding a good match of RAID size versus available
channels and splitting I/O across those channels is a bit challenging.

> Large stripes might look like a good idea, buti when you get to this
> scale concatenation of high throughput LUNs provides better
> throughput because of less contention through the storage
> controllers and enclosures.

I don't disagree, but what I need to do is run a scripted test varying
stripe size, stripe units, chunk size (md parameter), etc - this gets
cumbersome with the 135 drives, as trying to get good balances across
the available resources is tedious and not automatic.

Basically, I found a combo (described immediately below) that works
pretty well, and started working on other problems than performance.
I have sufficient hardware to test other combinations, but time to run
them is an issue for me.  (ie set them up precisely right, babysit
them, wait for parity to build, then test - yes, I tested on various
subsets of the full 126 drive array, but getting those configs right
and then knowing you can extrapolate to the full size set is confusing
and hurts my poor little head)

>
>> XFS
>> is set up properly (as far as I know) with respect to stripe and chunk
>> sizes.
>
> Any details? You might be wrong ;)

Oh yes indeedy, I could be wrong!

Each of the 126 in use drives show something like this:

/dev/sdbc1:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : f3c44896:ecdcadca:153ee6d1:1770781f
           Name : louie:5  (local to host louie)
  Creation Time : Fri Apr  8 15:01:16 2011
     Raid Level : raid6
   Raid Devices : 18

 Avail Dev Size : 3907026856 (1863.02 GiB 2000.40 GB)
     Array Size : 62512429056 (29808.25 GiB 32006.36 GB)
  Used Dev Size : 3907026816 (1863.02 GiB 2000.40 GB)
    Data Offset : 264 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : adbd8716:94ebf4a2:ea753ee0:418b7bd8

    Update Time : Tue May  3 11:18:45 2011
       Checksum : 44d36ef7 - correct
         Events : 187

     Chunk Size : 64K

There are 7 RAID-6 arrays, each of which look like this:

/dev/md0:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : cbb4b32e:afc7126a:922e501d:9404011e
           Name : louie:8  (local to host louie)
  Creation Time : Fri Apr  8 15:02:20 2011
     Raid Level : raid0
   Raid Devices : 7

 Avail Dev Size : 62512429048 (29808.25 GiB 32006.36 GB)
  Used Dev Size : 0
    Data Offset : 8 sectors
   Super Offset : 0 sectors
          State : active
    Device UUID : 94bfd084:138f8ca5:2938df2e:1ef0b76d

    Update Time : Fri Apr  8 15:02:20 2011
       Checksum : d733d87a - correct
         Events : 0

     Chunk Size : 1024K

    Array Slot : 0 (0, 1, 2, 3, 4, 5, 6)
   Array State : Uuuuuuu

The seven RAID 6 devices are concatenated into a RAID 0:

/dev/md8:
        Version : 01.01
  Creation Time : Fri Apr  8 15:02:20 2011
     Raid Level : raid0
     Array Size : 218793494528 (208657.74 GiB 224044.54 GB)
   Raid Devices : 7
  Total Devices : 7
Preferred Minor : 8
    Persistence : Superblock is persistent

    Update Time : Fri Apr  8 15:02:20 2011
          State : clean
 Active Devices : 7
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 1024K

           Name : louie:8  (local to host louie)
           UUID : cbb4b32e:afc7126a:922e501d:9404011e
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       9        0        0      active sync   /dev/block/9:0
       1       9        1        1      active sync   /dev/block/9:1
       2       9        2        2      active sync   /dev/block/9:2
       3       9        3        3      active sync   /dev/block/9:3
       4       9        4        4      active sync   /dev/block/9:4
       5       9        5        5      active sync   /dev/block/9:5
       6       9        6        6      active sync   /dev/block/9:6

The xfs_info for the mounted volume is:

meta-data=/dev/md8               isize=256    agcount=204, agsize=268435440 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=54698373632, imaxpct=1
         =                       sunit=16     swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The sunit/swidth parameters are chosen to exactly match the RAID-6
device, not the RAID-0.

Mount options are negligible, although I will be trying this:

UUID=0a675b55-d68a-41f2-8bb7-063e33123531 /exports        xfs
inode64,largeio,logbufs=8,noatime        0       2

All disk drives (almost a thousand here now) are Hitachi HUA72202 2TB
enterprise drives.  We did a failed experiment awhile back with
desktop drives... never again.

>
>> Allocation groups are 1TiB in size, which seems sane for the
>> size of files we expect to work with.
>
> Any filesystem over 16TB will use 1TB AGs.
>
>> In isolated testing, I see around 5GiBytes/second raw (135 parallel dd
>> reads), and with a benchmark test of 10 simultaneous 64GiByte dd
>> commands, I can see just shy of 2 GiBytes/second reading, and around
>> 1.4GiBytes/second writing through XFS.   The benchmark is crude, but
>> fairly representative of our expected use.
>
> If you want insightful comments, then you'll need to provide
> intimate details of the tests your ran and the results (e.g. command
> lines, raw results, etc).

To test RAW read rates, I do this:

for i in  /dev/sd[b-z] /dev/sd[a-z][a-z] ; do dd if=$i of=/dev/null
bs=1024k & done

killall dd gets rid of them.  I use "dstat 1" to check what the kernel
thinks is happening.

For filesystems test (configured and mounted as I described above with
the mdadm commands and xfs_info), I do this:

for load in 0 1 2 3 4 5 6 7 8 9 ; do
        dd if=/dev/zero of=/exports/load_$load$step bs=1024k count=32768 &
done

Later to test read, I do:

for load in 0 1 2 3 4 5 6 7 8 9 ; do
        dd of=/dev/null if=/exports/load_$load bs=1024 &
done

In both cases, I watch I/O rates after the buffers overflow - with
192GB of RAM, this takes a few seconds.

For giggles, I've allowed the read commands to cache 20-100GB in RAM,
then rerun the read test to see what a cached read rate looks like -
interestingly, the aggregate dd reported I/O rate in that case is
around 5GiBytes/second, indicating that is approaching something of an
upper limit for this particular chassis.

I am fully aware that this is a simplified test.  I'm also quite
familiar with the workload, and know this is a reasonable facsimile of
what we do.

Better real world benchmarking for us now comprises of end user jobs -
day long jobs on a single sequencing run using a bunch of home grown
software.

>
>> md apparently does not support barriers, so we are badly exposed in
>> that manner, I know.  As a test, I disabled write cache on all drives,
>> performance dropped by 30% or so, but since md is apparently the
>> problem, barriers still didn't work.
>
> Doesn't matter if you have BBWC on your hardware RAID
> controllers. Seriously, if you want to sustain high throughput, you
> want a large amount of BBWC in front your disks....

Here we talk performance expectations and goals - from my testing so
far, I can reasonably say I'm happy with the performance of the
software RAID with XFS running on top of that.  What I need now are
stability and robustness in the face of crashes.

I'm still perfectly willing to buy good HW RAID cards, don't get me
wrong, but their main benefit to me will be the battery backed cache,
not the performance.

Keep in mind that it is hard to balance a HW RAID card across multiple
SAS expanders - you can certainly get a -16e card of some sort, but
then it does ALL of the I/O to those 4 expanders ALL of the time.  I'm
not sure that is a win, either.  Cheaper cards, one per expander might
work, though (but with six 8x slots available, probably a HW RAID card
with 8e would be the best - run two expanders per card as I do now).

>
>> Nonetheless, what we need, but don't have, is stability.
>>
>> With 2.6.32-30, we get reliable kernel panics after 2 days of
>> sustained rsync to the machine (around 150-250MiBytes/second for the
>> entire time - the source machines are slow),
>
> Stack traces from the crash?

Mostly a non-responsive console and kgdb was not set up at the time -
I am trying to get this set up now.   Here's the one stack trace I
wrote down from the console (again from a 2.6.32-30 kernel):

RSP 0018:ffff880dcce39e48
E FLAGS 287

_spin_lock+0xe/0x20
futex_wake+0x7d/0x130
handle_nm_fault+0x1a8/0x3c0
do_futex+0x68/0x1b0
sys_futex+0x7b/0x170
do_page_fault+0x158/0x3b0
system_call_fastpath+0x16/0x1b

All other info lost - other crashes result in a locked console that
we've not been able to revive.

The load on the system at the time of the crash was simply 3-4 rsync's
copying data via 'ssh -c arcfour' over to the XFS filesystem
(basically loading up the test server with user data for further
testing).  Sustained I/O rates were moderate - 200-400MiBytes/second.
No swap, CPU load of significance or user jobs.

Obviously, this is an old kernel and of less interest, but nonetheless
answers your question.

>
>> and with 2.6.35, we get a
>> bad resource contention problem fairly quickly - much less than 24
>> hours (in this instance, we start getting XFS kernel thread timeouts
>> similar to what I've seen posted here recently, but it isn't clear
>> whether it is only XFS or also ext3 boot drives that are starved for
>> I/O - suspending or killing all I/O load doesn't solve the problem -
>> only a reboot does).
>
> Details of the timeout messages?

Here are some typical ones from yesterday when I was trying to run the
sync command on a relatively lightly loaded 2.6.35 machine (sustained
100MiByte/second copies onto the server in question):

178602.197456] INFO: task sync:2787 blocked for more than 120 seconds.
[178602.203933] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[178602.211863] sync          D 0000000000000000     0  2787   2691 0x00000000
[178602.211867]  ffff880d2dc51cd8 0000000000000086 ffff880d2dc51cc8
0000000000015880
[178602.211870]  ffff880d2dc51fd8 0000000000015880 ffff880d2dc51fd8
ffff8817fb725d40
[178602.211872]  0000000000015880 0000000000015880 ffff880d2dc51fd8
0000000000015880
[178602.211875] Call Trace:
[178602.211887]  [<ffffffff81050641>] ? select_task_rq_fair+0x561/0x8e0
[178602.211893]  [<ffffffff8156436d>] schedule_timeout+0x22d/0x310
[178602.211896]  [<ffffffff8104af33>] ? enqueue_task_fair+0x43/0x90
[178602.211898]  [<ffffffff8104e609>] ? enqueue_task+0x79/0x90
[178602.211900]  [<ffffffff81563606>] wait_for_common+0xd6/0x180
[178602.211904]  [<ffffffff81053310>] ? default_wake_function+0x0/0x20
[178602.211910]  [<ffffffff81167570>] ? sync_one_sb+0x0/0x30
[178602.211912]  [<ffffffff8156378d>] wait_for_completion+0x1d/0x20
[178602.211915]  [<ffffffff81162b19>] sync_inodes_sb+0x89/0x180
[178602.211955]  [<ffffffffa032c0f1>] ? xfs_quiesce_data+0x71/0xc0 [xfs]
[178602.211958]  [<ffffffff81167570>] ? sync_one_sb+0x0/0x30
[178602.211960]  [<ffffffff81167558>] __sync_filesystem+0x88/0xa0
[178602.211962]  [<ffffffff81167590>] sync_one_sb+0x20/0x30
[178602.211966]  [<ffffffff81142afb>] iterate_supers+0x8b/0xd0
[178602.211968]  [<ffffffff811675e5>] sys_sync+0x45/0x70
[178602.211973]  [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b

>
>> Ideally, I'd firstly be able to find informed opinions about how I can
>> improve this arrangement - we are mildly flexible on RAID controllers,
>> very flexible on versions of Linux, etc, and can try other OS's as a
>> last resort (but the leading contender here would be "something"
>> running ZFS, and though I love ZFS, it really didn't seem to work well
>> for our needs).
>>
>> Secondly, I welcome suggestions about which version of the linux
>> kernel you'd prefer to hear bug reports about, as well as what kinds
>> of output is most useful (we're getting all chassis set up with serial
>> console so we can do kgdb and also full kernel panic output results).
>
> If you want to stay on mainline kernels with best-effort community
> support, I'd suggest 2.6.38 or more recent kernels are the only ones
> we're going to debug. If you want fixes, then running the curent -rc
> kernels is probably a good idea. It's unlikely you'll get anyone
> backporting fixes for you to older kernels.

I will be doing that today.  We can backport if it were crucial to do
so, but I'm not aware of any local reasons why this would be so.

>
> Alternatively, you can switch to something like RHEL (or SLES) where
> XFS is fully supported (and in the RHEL case, pays my bills :). The
> advantage of this is that once the bug is fixed in mainline, it will
> get backported to the supported kernel you are running.

We're buying a RHEL support license today - hooray!

My rationale for doing that is that I'm not convinced I will be seeing
just XFS issues in the kernel - the stack trace I reported is more
generic than XFS...

Paul

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs