Fwd: Sudden File System Corruption

Mike Dacre <mike.dacre@xxxxxxxxx> · Thu, 5 Dec 2013 07:58:06 -0800

Hi Stan,

On Thu, Dec 5, 2013 at 12:10 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:

On 12/4/2013 8:55 PM, Mike Dacre wrote:

...

> I have a 16 2TB drive RAID6 array powered by an LSI 9240-4i.  It has an XFS.

It's a 9260-4i, not a 9240, a huge difference.  I went digging through

your dmesg output because I knew the 9240 doesn't support RAID6.  A few

questions.  What is the LSI RAID configuration?

You are right, sorry.  9260-4i

1.  Level -- confirm RAID6
Definitely RAID6 

2.  Strip size?  (eg 512KB)
64KB 

3.  Stripe size? (eg 7168KB, 14*256)
Not sure how to get this 

4.  BBU module?
Yes. iBBU, state optimal, 97% charged. 

5.  Is write cache enabled?

Yes: Cahced IO and Write Back with BBU are enabled.

I have also attached an adapter summary (megaraid_adp_info.txt) and a virtual and physical drive summary (megaraid_drive_info.txt). 

What is the XFS geometry?

5.  xfs_info /dev/sda

`xfs_info /dev/sda1`
meta-data =""          isize=256    agcount=26, agsize=268435455 blks
               =                         sectsz=512   attr=2

data         =                         bsize=4096   blocks=6835404288, imaxpct=5
               =                         sunit=0      swidth=0 blks
naming    =version 2            bsize=4096   ascii-ci=0

log          =internal               bsize=4096   blocks=521728, version=2
              =                          sectsz=512   sunit=0 blks, lazy-count=1
realtime   =none                   extsz=4096   blocks=0, rtextents=0

This is also attached as xfs_info.txt 

A combination of these these being wrong could very well be part of your

problems.

...

> IO errors when any requests were made.  This happened while it was being

I didn't see any IO errors in your dmesg output.  None.

Good point.  These happened while trying to ls.  I am not sure why I can't find them in the log, they printed out to the console as 'Input/Output' errors, simply stating that the ls command failed.

> accessed by  5 different users, one was doing a very large rm operation (rm

> *sh on thousands on files in a directory).  Also, about 30 minutes before

> we had connected the globus connect endpoint to allow easy file transfers

> to SDSC.

With delaylog enabled, which I believe it is in RHEL/CentOS 6, a single

big rm shouldn't kill the disks.  But with the combination of other

workloads it seems you may have been seeking the disks to death.
That is possible, workloads can get really high sometimes.  I am not sure how to control that without significantly impacting performance - I want a single user to be able to use 98% IO capacity sometimes... but other times I want the load to be split amongst many users.  Also, each user can execute jobs simultaneously on 23 different computers, each acessing the same drive via NFS.  This is a great system most of the time, but sometimes the workloads on the drive get really high. 

...

> In the end, I successfully repaired the filesystem with `xfs_repair -L

> /dev/sda1`.  However, I am nervous that some files may have been corrupted.

I'm sure your users will let you know.  I'd definitely have a look in

the directory that was targeted by the big rm operation which apparently

didn't finish when XFS shutdown.

> Do any of you have any idea what could have caused this problem?

Yes.  A few things.  The first is this, and it's a big one:

Dec  4 18:15:28 fruster kernel: io scheduler noop registered

Dec  4 18:15:28 fruster kernel: io scheduler anticipatory registered

Dec  4 18:15:28 fruster kernel: io scheduler deadline registered

Dec  4 18:15:28 fruster kernel: io scheduler cfq registered (default)

http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

"As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much

of the parallelization in XFS."

*Never* use the CFQ elevator with XFS, and never with a high performance

storage system.  In fact, IMHO, never use CFQ period.  It was horrible

even before 3.2.12.  It is certain that CFQ is playing a big part in

your 120s timeouts, though it may not be solely responsible for your IO

bottleneck.  Switch to deadline or noop immediately, deadline if LSI

write cache is disabled, noop if it is enabled.  Execute this manually

now, and add it to a startup script and verify it is being set at

startup, as it's not permanent:

echo deadline > /sys/block/sda/queue/scheduler

Wow, this is huge, I can't believe I missed that.  I have switched it to noop now as we use write caching.  I have been trying to figure out for a while why I would keep getting timeouts when the NFS load was high.  If you have any other suggestions for how I can improve performance, I would greatly appreciate it.

This one simple command line may help pretty dramatically, immediately,

assuming your hardware array parameters aren't horribly wrong for your

workloads, and your XFS alignment correctly matches the hardware geometry.

Great, thanks.  Our workloads vary considerably as we are a biology research lab, sometimes we do lots of seeks, other times we are almost maxing out read or write speed with massively parallel processes all accessing the disk at the same time.

--

Stan

-Mike 

Adapter #0

==============================================================================
                    Versions
                ================
Product Name    : LSI MegaRAID SAS 9260-4i
Serial No       : SV14821972
FW Package Build: 12.14.0-0167

                    Mfg. Data
                ================
Mfg. Date       : 11/24/11
Rework Date     : 00/00/00
Revision No     : 61A
Battery FRU     : N/A

                Image Versions in Flash:
                ================
FW Version         : 2.130.393-2551
BIOS Version       : 3.28.00_4.14.05.00_0x05270000
Preboot CLI Version: 04.04-020:#%00009
WebBIOS Version    : 6.0-52-e_48-Rel
NVDATA Version     : 2.09.03-0045
Boot Block Version : 2.02.00.00-0000
BOOT Version       : 09.250.01.219

                Pending Images in Flash
                ================
None

                PCI Info
                ================
Controller Id	: 0000
Vendor Id       : 1000
Device Id       : 0079
SubVendorId     : 1000
SubDeviceId     : 9260

Host Interface  : PCIE

ChipRevision    : B4

Link Speed 	     : 0 
Number of Frontend Port: 0 
Device Interface  : PCIE

Number of Backend Port: 4 
Port  :  Address
0        500304800129497f 
1        0000000000000000 
2        0000000000000000 
3        0000000000000000 

                HW Configuration
                ================
SAS Address      : 500605b004137820
BBU              : Present
Alarm            : Present
NVRAM            : Present
Serial Debugger  : Present
Memory           : Present
Flash            : Present
Memory Size      : 512MB
TPM              : Absent
On board Expander: Absent
Upgrade Key      : Absent
Temperature sensor for ROC    : Absent
Temperature sensor for controller    : Absent

                Settings
                ================
Current Time                     : 7:21:54 12/5, 2013
Predictive Fail Poll Interval    : 300sec
Interrupt Throttle Active Count  : 16
Interrupt Throttle Completion    : 50us
Rebuild Rate                     : 30%
PR Rate                          : 30%
BGI Rate                         : 30%
Check Consistency Rate           : 30%
Reconstruction Rate              : 30%
Cache Flush Interval             : 4s
Max Drives to Spinup at One Time : 4
Delay Among Spinup Groups        : 2s
Physical Drive Coercion Mode     : Disabled
Cluster Mode                     : Disabled
Alarm                            : Enabled
Auto Rebuild                     : Enabled
Battery Warning                  : Enabled
Ecc Bucket Size                  : 15
Ecc Bucket Leak Rate             : 1440 Minutes
Restore HotSpare on Insertion    : Disabled
Expose Enclosure Devices         : Enabled
Maintain PD Fail History         : Enabled
Host Request Reordering          : Enabled
Auto Detect BackPlane Enabled    : SGPIO/i2c SEP
Load Balance Mode                : Auto
Use FDE Only                     : No
Security Key Assigned            : No
Security Key Failed              : No
Security Key Not Backedup        : No
Default LD PowerSave Policy      : Controller Defined
Maximum number of direct attached drives to spin up in 1 min : 120 
Auto Enhanced Import             : Yes
Any Offline VD Cache Preserved   : No
Allow Boot with Preserved Cache  : No
Disable Online Controller Reset  : No
PFK in NVRAM                     : No
Use disk activity for locate     : No
POST delay			 : 90 seconds
BIOS Error Handling          	 : Stop On Errors
Current Boot Mode 		  :Normal
                Capabilities
                ================
RAID Level Supported             : RAID0, RAID1, RAID5, RAID6, RAID00, RAID10, RAID50, RAID60, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span
Supported Drives                 : SAS, SATA

Allowed Mixing:

Mix in Enclosure Allowed
Mix of SAS/SATA of HDD type in VD Allowed
Mix of SAS/SATA of SSD type in VD Allowed
Mix of SSD/HDD in VD Allowed

                Status
                ================
ECC Bucket Count                 : 0

                Limitations
                ================
Max Arms Per VD          : 32 
Max Spans Per VD         : 8 
Max Arrays               : 128 
Max Number of VDs        : 64 
Max Parallel Commands    : 1008 
Max SGE Count            : 80 
Max Data Transfer Size   : 8192 sectors 
Max Strips PerIO         : 42 
Max LD per array         : 16 
Min Strip Size           : 8 KB
Max Strip Size           : 1.0 MB
Max Configurable CacheCade Size: 0 GB
Current Size of CacheCade      : 0 GB
Current Size of FW Cache       : 346 MB

                Device Present
                ================
Virtual Drives    : 1 
  Degraded        : 0 
  Offline         : 0 
Physical Devices  : 18 
  Disks           : 16 
  Critical Disks  : 0 
  Failed Disks    : 0 

                Supported Adapter Operations
                ================
Rebuild Rate                    : Yes
CC Rate                         : Yes
BGI Rate                        : Yes
Reconstruct Rate                : Yes
Patrol Read Rate                : Yes
Alarm Control                   : Yes
Cluster Support                 : No
BBU                             : Yes
Spanning                        : Yes
Dedicated Hot Spare             : Yes
Revertible Hot Spares           : Yes
Foreign Config Import           : Yes
Self Diagnostic                 : Yes
Allow Mixed Redundancy on Array : No
Global Hot Spares               : Yes
Deny SCSI Passthrough           : No
Deny SMP Passthrough            : No
Deny STP Passthrough            : No
Support Security                : No
Snapshot Enabled                : No
Support the OCE without adding drives : Yes
Support PFK                     : Yes
Support PI                      : No
Support Boot Time PFK Change    : No
Disable Online PFK Change       : No
PFK TrailTime Remaining         : 0 days 0 hours
Support Shield State            : No
Block SSD Write Disk Cache Change: No
Support Online FW Update	: Yes

                Supported VD Operations
                ================
Read Policy          : Yes
Write Policy         : Yes
IO Policy            : Yes
Access Policy        : Yes
Disk Cache Policy    : Yes
Reconstruction       : Yes
Deny Locate          : No
Deny CC              : No
Allow Ctrl Encryption: No
Enable LDBBM         : Yes
Support Breakmirror  : No
Power Savings        : No

                Supported PD Operations
                ================
Force Online                            : Yes
Force Offline                           : Yes
Force Rebuild                           : Yes
Deny Force Failed                       : No
Deny Force Good/Bad                     : No
Deny Missing Replace                    : No
Deny Clear                              : No
Deny Locate                             : No
Support Temperature                     : Yes
Disable Copyback                        : No
Enable JBOD                             : No
Enable Copyback on SMART                : No
Enable Copyback to SSD on SMART Error   : Yes
Enable SSD Patrol Read                  : No
PR Correct Unconfigured Areas           : Yes
Enable Spin Down of UnConfigured Drives : Yes
Disable Spin Down of hot spares         : No
Spin Down time                          : 30 
T10 Power State                         : No
                Error Counters
                ================
Memory Correctable Errors   : 0 
Memory Uncorrectable Errors : 0 

                Cluster Information
                ================
Cluster Permitted     : No
Cluster Active        : No

                Default Settings
                ================
Phy Polarity                     : 0 
Phy PolaritySplit                : 0 
Background Rate                  : 30 
Strip Size                       : 256kB
Flush Time                       : 4 seconds
Write Policy                     : WB
Read Policy                      : RA
Cache When BBU Bad               : Disabled
Cached IO                        : No
SMART Mode                       : Mode 6
Alarm Disable                    : Yes
Coercion Mode                    : None
ZCR Config                       : Unknown
Dirty LED Shows Drive Activity   : No
BIOS Continue on Error           : 3 
Spin Down Mode                   : None
Allowed Device Type              : SAS/SATA Mix
Allow Mix in Enclosure           : Yes
Allow HDD SAS/SATA Mix in VD     : Yes
Allow SSD SAS/SATA Mix in VD     : Yes
Allow HDD/SSD Mix in VD          : Yes
Allow SATA in Cluster            : No
Max Chained Enclosures           : 16 
Disable Ctrl-R                   : Yes
Enable Web BIOS                  : Yes
Direct PD Mapping                : No
BIOS Enumerate VDs               : Yes
Restore Hot Spare on Insertion   : No
Expose Enclosure Devices         : Yes
Maintain PD Fail History         : Yes
Disable Puncturing               : No
Zero Based Enclosure Enumeration : No
PreBoot CLI Enabled              : Yes
LED Show Drive Activity          : Yes
Cluster Disable                  : Yes
SAS Disable                      : No
Auto Detect BackPlane Enable     : SGPIO/i2c SEP
Use FDE Only                     : No
Enable Led Header                : Yes
Delay during POST                : 0 
EnableCrashDump                  : No
Disable Online Controller Reset  : No
EnableLDBBM                      : Yes
Un-Certified Hard Disk Drives    : Allow
Treat Single span R1E as R10     : No
Max LD per array                 : 16
Power Saving option              : Don't Auto spin down Configured Drives
Max power savings option is  not allowed for LDs. Only T10 power conditions are to be used.
Default spin down time in minutes: 30 
Enable JBOD                      : No
TTY Log In Flash                 : No
Auto Enhanced Import             : Yes
BreakMirror RAID Support         : Yes
Disable Join Mirror              : No
Enable Shield State              : No
Time taken to detect CME         : 60s

Exit Code: 0x00
System
	Operating System:  Linux version 2.6.32-358.23.2.el6.x86_64 
	Driver Version: 06.504.01.00-rh1
	CLI Version: 8.07.07

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 25.463 TB
Sector Size         : 512
Is VD emulated      : No
Parity Size         : 3.637 TB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 16
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Cached, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Cached, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No

Exit Code: 0x00

Hardware
        Controller
                 ProductName       : LSI MegaRAID SAS 9260-4i(Bus 0, Dev 0)
                 SAS Address       : 500605b004137820
                 FW Package Version: 12.14.0-0167
                 Status            : Optimal
        BBU
                 BBU Type          : iBBU
                 Status            : Healthy
        Enclosure
                 Product Id        : SAS2X28         
                 Type              : SES
                 Status            : OK

                 Product Id        : SGPIO           
                 Type              : SGPIO
                 Status            : OK

        PD 
                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 0 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 1 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 2 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 3 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 5 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 6 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 7 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 4 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 11 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 10 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 9 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 8 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 15 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 14 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 13 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 1 >: Slot 12 
                Vendor Id          : ATA     
                Product Id         : WDC WD2002FAEX-0
                State              : Online
                Disk Type          : SATA,Hard Disk Device
                Capacity           : 1.818 TB
                Power State        : Active

Storage

       Virtual Drives
                Virtual drive      : Target Id 0 ,VD name 
                Size               : 25.463 TB
                State              : Optimal
                RAID Level         : 6 

Exit Code: 0x00
meta-data=/dev/sda1              isize=256    agcount=26, agsize=268435455 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=6835404288, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs