Fwd: Fwd: Sudden File System Corruption

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





---------- Forwarded message ----------
From: Mike Dacre <mike.dacre@xxxxxxxxx>
Date: Fri, Dec 6, 2013 at 2:14 PM
Subject: Re: Fwd: Sudden File System Corruption
To: stan@xxxxxxxxxxxxxxxxx





On Fri, Dec 6, 2013 at 12:58 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
On 12/5/2013 9:58 AM, Mike Dacre wrote:

> On Thu, Dec 5, 2013 at 12:10 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>wrote:
>> On 12/4/2013 8:55 PM, Mike Dacre wrote:
>> ...
>
> Definitely RAID6
>
> 2.  Strip size?  (eg 512KB)
>>
> 64KB

Ok, so 64*14 = 896KB stripe.  This seems pretty sane for a 14 spindle
parity array and mixed workloads.

> 4.  BBU module?
>>
> Yes. iBBU, state optimal, 97% charged.
>
> 5.  Is write cache enabled?
>>
>> Yes: Cahced IO and Write Back with BBU are enabled.

I should have pointed you this this earlier:
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

but we've got most of it already.  We don't have your fstab mount
options.  Please provide that.
 
UUID=a58bf1db-0d64-4a2d-8e03-aad78dbebcbe /science                xfs     defaults,inode64          1 0 

On the slave nodes, I managed to reduce the demand on the disks by adding the actimeo=60 mount option.  Prior to doing this I would sometimes see the disk being negatively affected by enormous numbers of getattr requests.  Here is the fstab mount on the nodes:

192.168.2.1:/science                      /science                nfs     defaults,vers=3,nofail,actimeo=60,bg,hard,intr,rw  0 0

...
> This is also attached as xfs_info.txt

You're not aligning XFS to the RAID geometry (unless you're overriding
in fstab).  No alignment is good though for small (<896KB) file
allocations but less than optimal for large streaming allocation writes.
 But it isn't a factor in the problems you reported.


Correct, I am not consciously aligning the XFS to the RAID geometry, I actually didn't know that was possible.
 
...
>> Good point.  These happened while trying to ls.  I am not sure why I can't
> find them in the log, they printed out to the console as 'Input/Output'
> errors, simply stating that the ls command failed.

We look for SCSI IO errors preceding an XFS error as a causal indicator.
 I didn't see that here.  You could have run into the bug Ben described
earlier.  I can't really speak to the console errors.

>> With delaylog enabled, which I believe it is in RHEL/CentOS 6, a single
>> big rm shouldn't kill the disks.  But with the combination of other
>> workloads it seems you may have been seeking the disks to death.
>>
> That is possible, workloads can get really high sometimes.  I am not sure
> how to control that without significantly impacting performance - I want a
> single user to be able to use 98% IO capacity sometimes... but other times
> I want the load to be split amongst many users.

You can't control the seeking at the disks.  You can only schedule
workloads together that don't compete for seeks.  And if you have one
metadata or random read/write heavy workload, with this SATA RAID6
array, it will need exclusive access for the duration of execution, or
the portion that does all the random IO.  Otherwise other workloads
running concurrently will crawl while competing for seek bandwidth.

> Also, each user can
> execute jobs simultaneously on 23 different computers, each acessing the
> same drive via NFS.  This is a great system most of the time, but sometimes
> the workloads on the drive get really high.

So it's a small compute cluster using NFS over Infiniband for shared
file access to a low performance RAID6 array.  The IO resource sharing
is automatic.  But AFAIK there's no easy way to enforce IO quotas on
users or processes, if at all.  You may simply not have sufficient IO to
go around.  Let's ponder that.

I have tried a few things to improve IO allocation.  BetterLinux have a cgroup control suite that allow on-the-fly user-level IO adjustments, however I found them to be quite cumbersome.

I considered an ugly hack in which I would run two NFS servers, one on the network to the login node, and one on the network to the other nodes, so that I could use cgroups to limit IO by process, effectively guaranteeing a 5% IO capacity window to the login node, even if the compute nodes were all going crazy.  I quickly came to the conclusion that I don't know enough about filesystems, nfs, or the linux kernel to do this effectively: I would almost certainly just make an ugly mess that accomplished little more than breaking a lot of things, while not solving the problem.  I still think it is a good idea in principle though, I just recognize that it would need to be implemented by someone with a lot more experience than me, and that it would probably be a major undertaking.
 
Looking at the math, you currently have approximately 14*150=2100
seeks/sec capability with 14x 7.2k RPM data spindles.  That's less than
100 seeks/sec per compute node, i.e. each node is getting about 2/3rd of
the performance of a single SATA disk from this array.  This simply
isn't sufficient for servicing a 23 node cluster, unless all workloads
are compute bound, and none IO/seek bound.  Given the overload/crash
that brought you to our attention, I'd say some of your workloads are
obviously IO/seek bound.  I'd say you probably need more/faster disks.
Or you need to identify which jobs are IO/seek heavy and schedule them
so they're not running concurrently.

Yes, this is a problem.  We sadly lack the resources to do much better than this, we have recently been adding extra storage by just chaining together USB3 drives with RAID and LVM... which is cumbersome and slow, but cheaper.

My current solution is to be on the alert for high IO jobs, and to move them to a specific torque queue that limits the number of concurrent jobs.  This works, but I have not found a way to do it automatically.  Thankfully, with a 12 member lab, it is actually not terribly complex to handle, but I would definitely prefer a more comprehensive solution.  I don't doubt that the huge IO and seek demands we put on these disks will cause more problems in the future. 
 
...
>> http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E
>>
>> "As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much
>> of the parallelization in XFS."
...
>> echo deadline > /sys/block/sda/queue/scheduler
>>
> Wow, this is huge, I can't believe I missed that.  I have switched it to
> noop now as we use write caching.  I have been trying to figure out for a
> while why I would keep getting timeouts when the NFS load was high.  If you
> have any other suggestions for how I can improve performance, I would
> greatly appreciate it.

This may not fix NFS timeouts entirely but it should help.  If the NFS
operations are seeking the disks to death you may still see timeouts.

>> This one simple command line may help pretty dramatically, immediately,
>> assuming your hardware array parameters aren't horribly wrong for your
>> workloads, and your XFS alignment correctly matches the hardware geometry.
>>
> Great, thanks.  Our workloads vary considerably as we are a biology
> research lab, sometimes we do lots of seeks, other times we are almost
> maxing out read or write speed with massively parallel processes all
> accessing the disk at the same time.

Do you use munin or something similar?  Sample output:
http://demo.munin-monitoring.org/munin-monitoring.org/demo.munin-monitoring.org/index.html#disk

Project page:
http://munin-monitoring.org/

I have been using Ganglia, but it doesn't have good NFS monitoring as far as I can tell.  I will check out Munin, thanks for the advice. 
 
It also has an NFS module and many others.  The storage oriented metrics
may be very helpful to you.  You would install munin-node on the NFS
server and all compute nodes, and munin on a collector/web server.  This
will allow you to cross reference client and server NFS loads.  You can
then cross reference the time in your PBS logs to see which users were
running which jobs when IO spikes occur on the NFS server.  You'll know
exactly which workloads, or combination thereof, are causing IO spikes.

--
Stan

-Mike 


_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs

[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux