Re: Array 'freezes' for some time after large writes?

Mark Knecht <markknecht@xxxxxxxxx> · Tue, 30 Mar 2010 15:21:20 -0700

I just finished a long compile on my dad's i5-661/DH55HC machine which
uses this same WD drive and I didn't spot any sign of this happening
there. That's a very recent Intel chipset also and probably more or
less the same SATA controller.

I'm going to turn on the kernel message into dmesg thing for a while
and see if anything pops up.

I can set up some additional partitions on my local drive to test
other file systems but since you're ext3 and I'm ext3 then it's not
that unless the problem moved forward with code over time.

I like the idea of using dd but I want to be careful about that sort
of thing. I've not used dd before, but if I could tell it to write a
gigabyte without messing up existing stuff then that could be helpful.

Back later,
Mark

On Tue, Mar 30, 2010 at 1:59 PM, Jim Duchek <jim.duchek@xxxxxxxxx> wrote:
> I'm using ext4 on everything, but it's hard to judge which ext3 bugs
> might affect ext4 as well.  I really don't have the ability to
> destructively test the array, I need all the data that's on it and I
> don't have enough spare space elsewhere to back it all up.  You might
> see if you can trigger it with dd, writing to the drive directly w/no
> filesystem?
>
> Jim
>
>
>
> On 30 March 2010 14:45, Mark Knecht <markknecht@xxxxxxxxx> wrote:
>> Hi,
>>   I am running the nvidia binary drivers. I'm not doing anything with
>> X at this point so I an just unload them I think. I could even remove
>> the card I suppose.
>>
>>   I built a machine for my dad a couple of months ago that uses the
>> same 1TB WD drive that I am using now. I don't remember seeing
>> anything like this on his machine but I'm going to go check that.
>>
>>   One other similarity I suspect we have is ext3? There were problems
>> with ext3 priority inversion in earlier kernel. It's my understanding
>> that they thought they had that worked out but possibly we're
>> triggering this somehow? since I've got a lot of disk space I can set
>> up some other partitions, etc4, reiser4, etc., and try copying files
>> to trigger it. However it's difficult for me if it requires read/write
>> as I'm not set up to really use the machine yet. Is that something you
>> have room to try?
>>
>>   Also, we haven't discussed what drivers are loaded or kernel
>> config. Here's my current driver set:
>>
>> keeper ~ # lsmod
>> Module                  Size  Used by
>> ipv6                  207757  30
>> usbhid                 21529  0
>> nvidia              10611606  22
>> snd_hda_codec_realtek   239530  1
>> snd_hda_intel          17688  0
>> ehci_hcd               30854  0
>> snd_hda_codec          45755  2 snd_hda_codec_realtek,snd_hda_intel
>> snd_pcm                58104  2 snd_hda_intel,snd_hda_codec
>> snd_timer              15030  1 snd_pcm
>> snd                    37476  5
>> snd_hda_codec_realtek,snd_hda_intel,snd_hda_codec,snd_pcm,snd_timer
>> soundcore                800  1 snd
>> snd_page_alloc          5809  2 snd_hda_intel,snd_pcm
>> rtc_cmos                7678  0
>> rtc_core               11093  1 rtc_cmos
>> sg                     23029  0
>> uhci_hcd               18047  0
>> usbcore               115023  4 usbhid,ehci_hcd,uhci_hcd
>> agpgart                24341  1 nvidia
>> processor              23121  0
>> e1000e                111701  0
>> firewire_ohci          20022  0
>> rtc_lib                 1617  1 rtc_core
>> firewire_core          36109  1 firewire_ohci
>> thermal                11650  0
>> keeper ~ #
>>
>> - Mark
>>
>> On Tue, Mar 30, 2010 at 1:32 PM, Jim Duchek <jim.duchek@xxxxxxxxx> wrote:
>>> Hrm, I've never seen that kernel message.  I don't think any of my
>>> freezes have lasted for up to 120 seconds though (my drives are half
>>> as big -- might matter?)  It looks like we've both got WD drives --
>>> and we both have nvidia 9500gt's as well.  Are you running the nvidia
>>> binary drivers, or noveau? (It seems like it wouldn't matter
>>> especially as, at least on my system, they don't share an interrupt or
>>> anything, but I hate to ignore any hardware that we both have the same
>>> of). I did move to 2.6.33 for some time, but that didn't change the
>>> behaviour.
>>>
>>> Jim
>>>
>>>
>>> On 30 March 2010 13:05, Mark Knecht <markknecht@xxxxxxxxx> wrote:
>>>> On Tue, Mar 30, 2010 at 10:47 AM, Jim Duchek <jim.duchek@xxxxxxxxx> wrote:
>>>> <SNIP>
>>>>>  You're having this happen even if the disk in question is not in an
>>>>> array?  If so perhaps it's an SATA issue and not a RAID one, and we
>>>>> should move this discussion accordingly.
>>>>
>>>> Yes, in my case the delays are so long - sometimes 2 or 3 minutes -
>>>> that when I tried to build the system using RAID1 I got this kernel
>>>> bug in dmesg. It's jsut info - not a real failure - but because it's
>>>> talking about long delays I gave up on RAID and tried a standard
>>>> single drive build. Turns out that it has (I think...) nothing to do
>>>> with RAID at all. you'll not that there are instructions for turning
>>>> the message off but I've not tried them. I intend to do a parallel
>>>> RAID1 build on this machine and be able to test both RAID vs non-RAID.
>>>>
>>>> - Mark
>>>>
>>>> INFO: task kjournald:17466 blocked for more than 120 seconds.
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> kjournald     D ffff8800280bbe00     0 17466      2 0x00000000
>>>>  ffff8801adf9d890 0000000000000046 0000000000000000 0000000000000000
>>>>  ffff8801adcbde44 0000000000004000 000000000000fe00 000000000000c878
>>>>  0000000800000050 ffff88017a99aa40 ffff8801af90a150 ffff8801adf9db08
>>>> Call Trace:
>>>>  [<ffffffff812dd063>] ? md_make_request+0xb6/0xf1
>>>>  [<ffffffff8109c248>] ? sync_buffer+0x0/0x40
>>>>  [<ffffffff8137a4fc>] ? io_schedule+0x2d/0x3a
>>>>  [<ffffffff8109c283>] ? sync_buffer+0x3b/0x40
>>>>  [<ffffffff8137a879>] ? __wait_on_bit+0x41/0x70
>>>>  [<ffffffff8109c248>] ? sync_buffer+0x0/0x40
>>>>  [<ffffffff8137a913>] ? out_of_line_wait_on_bit+0x6b/0x77
>>>>  [<ffffffff810438b2>] ? wake_bit_function+0x0/0x23
>>>>  [<ffffffff8109c637>] ? sync_dirty_buffer+0x72/0xaa
>>>>  [<ffffffff81131b8e>] ? journal_commit_transaction+0xa74/0xde2
>>>>  [<ffffffff8103abcc>] ? lock_timer_base+0x26/0x4b
>>>>  [<ffffffff81043884>] ? autoremove_wake_function+0x0/0x2e
>>>>  [<ffffffff81134804>] ? kjournald+0xe3/0x206
>>>>  [<ffffffff81043884>] ? autoremove_wake_function+0x0/0x2e
>>>>  [<ffffffff81134721>] ? kjournald+0x0/0x206
>>>>  [<ffffffff81043591>] ? kthread+0x8b/0x93
>>>>  [<ffffffff8100bd3a>] ? child_rip+0xa/0x20
>>>>  [<ffffffff81043506>] ? kthread+0x0/0x93
>>>>  [<ffffffff8100bd30>] ? child_rip+0x0/0x20
>>>> livecd ~ #
>>>>
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html