Re: Re[6]: RAID 6 crashes system when being accessed

Roger Heflin <rogerheflin@xxxxxxxxx> · Sun, 6 Jul 2014 20:56:13 -0500

You are watching for the machine to crash and/or produce messages in
/var/log/messages.

On Sun, Jul 6, 2014 at 7:54 PM, Justin Stephenson
<justin@xxxxxxxxxxxxxxxxx> wrote:
> Thanks again, Roger. Your input was super helpful and also helped me
> understand a little more about the relationship between md and my file
> system.
>
> in the full tests you mentioned "find /<dir> -type f -ls" and "...exec cksum
> {} \;"
>
> what would I be looking for? I executed the first one and I got a colossal
> list of files. The server stores a lot of media resources for my design
> practice and there are probably hundreds of thousands of files on there.
>
> Please let me know,
>
> J
> --------
> Justin Stephenson
> Creative Director/Motion Designer
> 416-900-6069
> http://justinstephenson.com
>
>
>
>
> ------ Original Message ------
> From: "Roger Heflin" <rogerheflin@xxxxxxxxx>
> To: "Justin Stephenson" <justin@xxxxxxxxxxxxxxxxx>
> Cc: "stan" <stan@xxxxxxxxxxxxxxxxx>; "Linux RAID"
> <linux-raid@xxxxxxxxxxxxxxx>
> Sent: 05/07/2014 4:42:04 PM
> Subject: Re: Re[4]: RAID 6 crashes system when being accessed
>
>> The MD volume itself would not be unstable. The filesystem,
>> directory and file structures could have been corrupted, likely it did
>> fix something that was not important enough to report. When you hit
>> the specific directory entry and/or file data that would be when it
>> would crash. I have no idea how many times I have fixed this sort of
>> issue, it is pretty common on an unexpected crash, maybe 1 in 10-50
>> crashes will produce this sort of error, the risk rises if files were
>> being created when it happens.
>>
>> If you want to do a full test this will list out all dirs "find
>> /<dirname> -type f -ls" and this will actual read all files fairly
>> quickly. If you want to check to see if all files and extents make
>> sense you can run the next commnad but it will take a long time
>> depending on how much data you have "find /<dirname> -type f -ls -exec
>> cksum {} \;"
>>
>> On Sat, Jul 5, 2014 at 2:22 PM, Justin Stephenson
>> <justin@xxxxxxxxxxxxxxxxx> wrote:
>>>
>>>  Hello Roger,
>>>
>>>  Thank-you for your email and for laying out some trouble shooting steps
>>> for
>>>  me. I will take these to heart and keep them on file for the future.
>>>
>>>  I can report that there was a screen of rapid scrolling text during the
>>>  crashes and some kind of memory contents dump that had a progress
>>> indicator.
>>>  From what I could see, there was some kind of kernel panic and a message
>>>  about ATA-9. Nothing in the /var/log/messages file as far as I could
>>> see.
>>>
>>>  I had tried unmounting and running fsck before but not with your
>>> specified
>>>  -f -y flags.
>>>
>>>  Here are the steps I took based on your input.
>>>
>>>  - ran system overnight with md raid unmounted.
>>>  - fully completed resync
>>>  - performed fsck -f -y. It took approx 6 minutes (on a 12TB volume). No
>>>  errors reported in the printout.
>>>  - reboot
>>>  - locally initiated and completed a 22 gb copy from and to the md raid
>>> and a
>>>  local esata external drive.
>>>
>>>  ---
>>>
>>>  - from a workstation, opened SMB share to the MD raid
>>>  - workstation initiated copy to and from the CentOS box (and MD drive)
>>> of
>>>  the same 22gb folder over SMB.
>>>  - opened vnc client to the centOS box from a workstation.
>>>
>>>  Up until the fsck -f -y any of these three operations would cause a
>>> crash.
>>>
>>>
>>>  In summary, it would seem that the issue has been resolved by the fsck
>>> -f
>>>  -y. Up until running fsck - f -y, the system was completely
>>> unpredictable
>>>  when the MD drive was mounted - either during a sync or after it was
>>>  completed. I find this surprising, but perhaps I should not?
>>>
>>>  Based on Stan's email, I checked my UPS power settings, and I am certain
>>> I
>>>  was ending up with a hard powerdown when the battery ran out. I have
>>>  remedied this.
>>>
>>>  Could this have caused the MD volume to become unstable?
>>>
>>>  In any event, everything is up and running. I will report back with a
>>> log
>>>  entry if anything else appears.
>>>
>>>  Thanks again,
>>>
>>>  - Justin
>>>
>>>
>>>
>>>
>>>  ------ Original Message ------
>>>  From: "Roger Heflin" <rogerheflin@xxxxxxxxx>
>>>  To: "Justin Stephenson" <justin@xxxxxxxxxxxxxxxxx>
>>>  Cc: stan@xxxxxxxxxxxxxxxxx; "Linux RAID" <linux-raid@xxxxxxxxxxxxxxx>
>>>  Sent: 05/07/2014 12:17:45 AM
>>>  Subject: Re: Re[2]: RAID 6 crashes system when being accessed
>>>
>>>>  Some questions.
>>>>
>>>>  Do you get any messages on the screen when it crashes and/or is there
>>>>  anything in /var/log/messages from the crashes?
>>>>
>>>>  Is a sync running when it crashes? If so what kind of SATA
>>>>  controllers/setup are you using? I have had 2 previous setups that
>>>>  would run fairly stably so long as a sync was not running, but if a
>>>>  sync was running then the machine became unstable.
>>>>
>>>>  Did you umount it and run a "fsck -f -y" that took a while (at least
>>>>  30 seconds) or just umount it and ran fsck and it finished quickly and
>>>>  indicated clean? Generally if you nicely umount it the fs thinks it
>>>>  is clean even when it is not because of some previous event.
>>>>
>>>>  On Fri, Jul 4, 2014 at 8:08 PM, Justin Stephenson
>>>>  <justin@xxxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>>   Hi,
>>>>>
>>>>>   Thanks for your reply.
>>>>>
>>>>>   I should clarify that the crashes continue to be an issue in the
>>>>> absence
>>>>>  of
>>>>>   any power outage so this issue is now independent of power. I
>>>>> mentioned
>>>>>  the
>>>>>   UPS only with the thought that my problems may have been caused by a
>>>>>  sudden
>>>>>   power-down.
>>>>>
>>>>>   Please let me know if there are any logs or status print outs I could
>>>>>  pull
>>>>>   to help troubleshoot this.
>>>>>
>>>>>   Thanks Again,
>>>>>
>>>>>   - J
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   ------ Original Message ------
>>>>>   From: "Stan Hoeppner" <stan@xxxxxxxxxxxxxxxxx>
>>>>>   To: "Justin Stephenson" <justin@xxxxxxxxxxxxxxxxx>;
>>>>>   linux-raid@xxxxxxxxxxxxxxx
>>>>>   Sent: 04/07/2014 3:34:17 PM
>>>>>   Subject: Re: RAID 6 crashes system when being accessed
>>>>>
>>>>>>   On 7/4/2014 9:11 AM, Justin Stephenson wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    Hello,
>>>>>>>
>>>>>>>    I am experiencing some issues with my md raid. It is crashing my
>>>>>>>  system
>>>>>>>    when accessed with any "verve". The reboot initiates a resync of
>>>>>>> the
>>>>>>>    raid. I have gone through the crash/reboot/resynced a number of
>>>>>>> times
>>>>>>>    now and the crash happens within minutes of mounting the raid.
>>>>>>>
>>>>>>>    Here are some details:
>>>>>>>
>>>>>>>    - It is a raid 6 with 7 3TB devices.
>>>>>>>    - Formatted as EXT4
>>>>>>>    - mdadm v3.2.6 - 25th October 2012
>>>>>>>    - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64
>>>>>>>    - It has been running flawlessly for the previous 6 months.
>>>>>>>    - I have a cron script running that resyncs monthly.
>>>>>>>    - When the raid is unmounted, the system runs fine. (I have an
>>>>>>>    additional "dumb" hardware raid 1 for dailies attached to an ESATA
>>>>>>>  port.
>>>>>>>    This runs perfectly).
>>>>>>>    - I am in the process of re-syncing the raid 6 again right now.
>>>>>>>    - I have run an fsck on the raid volume after it was fully synced
>>>>>>> and
>>>>>>>    everything came up clean.
>>>>>>>
>>>>>>>    - there have been lots of power outages the last while with the
>>>>>>> hot
>>>>>>>    summer in Toronto. My UPS shuts the system down for me, though I
>>>>>>>  think I
>>>>>>>    can correlate the issues with the power outages.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   This sounds like the UPS is cutting power to the system before the
>>>>>>   shutdown sequence completes, before the array is stopped. This
>>>>>> assumes
>>>>>>   you are already using apcupsd or similar. If you are check the
>>>>>>   configuration to make sure the system has plenty of time to shutdown
>>>>>>   after the UPS sends notification to the system. If you are not, then
>>>>>>   this will always happen as the UPS is simply cutting power when the
>>>>>>   battery gets low.
>>>>>>
>>>>>>   Note that if the UPS is undersized for this system and only yields a
>>>>>>  few
>>>>>>   minutes of on-battery time, it may simply not have enough juice to
>>>>>> keep
>>>>>>   the machine up throughout the shutdown process.
>>>>>>
>>>>>>   In summary, either your shutdown software isn't configured properly,
>>>>>>  you
>>>>>>   are not using it, or the UPS is too small. This isn't an md problem.
>>>>>>
>>>>>>
>>>>>>   Cheers,
>>>>>>
>>>>>>   Stan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>>   To unsubscribe from this list: send the line "unsubscribe linux-raid"
>>>>> in
>>>>>   the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>   More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html