Re: Array 'freezes' for some time after large writes?

Jim Duchek <jim.duchek@xxxxxxxxx> · Tue, 30 Mar 2010 12:47:55 -0500

Well it appears that I can absolutely reproduce it 100% of the time by
copying a large (>1 gig) video file and then immediately playing it.
It seems to hit the freeze on just about the same frame every time,
and playing it seems to be necessary (it doesn't freeze if I just do
the copy and go about my business).  Possibly an issue w/disk buffers?
 You're having this happen even if the disk in question is not in an
array?  If so perhaps it's an SATA issue and not a RAID one, and we
should move this discussion accordingly.  I reproduced your steps and
I'm seeing pretty much the same thing, although not quite hitting 100%
wa (although I'm guessing it would if I shut everything else down --
Got a full desktop running).

I'm using a Biostar T41-A7 mobo, Intel Core 2 Quad Q8400 Yorkfield
2.66GHz 4MB L2, and 3 Western Digital Caviar Blue WD5000AAJS drives
and 1 WD5002ABYS.  I note that the 3 older drives claim using ATA-7
and the newer one says ATA-8. Any similarities?

Jim

On 30 March 2010 12:18, Mark Knecht <markknecht@xxxxxxxxx> wrote:
> On Tue, Mar 30, 2010 at 10:07 AM, Jim Duchek <jim.duchek@xxxxxxxxx> wrote:
>> Hi all.  Regularly after a large write to the disk (untarring a very
>> large file, etc), my RAID5 will 'freeze' for a period of time --
>> perhaps around a minute.  My system is completely responsive otherwise
>> during this time, with the exception of anything that is attempting to
>> read or write from the array -- it's as if any file descriptors simply
>> block.  Nothing disk/raid-related is written to the logs during this
>> time.  The array is mounted as /home -- so an awful lot of things
>> completely freeze during this time (web browser, any video that is
>> running, etc).  The disks don't seem to be actually accessed during
>> this time (I can't hear them, and the disk access light stays off),
>> and it's not as if it's just reading slowly -- it's not reading at
>> all.   Array performance is completely normal before and after the
>> freeze and simply non-existent during it.  The root disk (which is on
>> a seperate disk entirely from the RAID) runs fine during this time, as
>> does everything else (network, video card, etc -- as long it doesn't
>> touch the array) -- for example, a Terminal window open is still
>> responsive during the freeze, and 'ls /' would work fine, while 'ls
>> /home' would block until the 'freeze' is over.
>>
>> Some more detailed information on my setup attached.  It's pretty
>> vanilla.  Unfortunately this started around the time four things
>> happened -- a kernel upgrade to 2.6.32, upgrading my filesystems to
>> ext4, replacing a disk gone bad in the RAID, and a video card change.
>> I would assume one of these is the culprit, but you know what they say
>> about 'assume'.  I cannot reproduce the problem reliably, but it
>> happens a couple times a day.  My questions are these:
>>
>> 1. Is there any way to turn on more detailed logging for the RAID
>> system in the kernel?  The wiki or a google search makes no mention I
>> can find, and mdadm doesn't put anything out during this time.
>> 2. Possibly a problem with the SATA system?  My root drive is PATA --
>> my RAID disks are all SATA.
>> 2. Uh, any other ideas? :)
>>
>>
>> Thanks, all.
>>
>> Jim Duchek
>>
>
> I'm seeing a lot of this on a new Intel-based system. I've never run
> into it before.
>
> In my case I can see the delays while looking at top. They correspond
> to 100%wa, as shown here:
>
> top - 02:27:17 up 28 min,  2 users,  load average: 2.76, 1.95, 1.30
> Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu5  :  0.0%us,  0.3%sy,  0.0%ni,  0.0%id, 99.7%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:   6107024k total,  1448676k used,  4658348k free,   187492k buffers
> Swap:  4200988k total,        0k used,  4200988k free,   915900k cached
>
> Like you nothing seems to get written anywhere when this is happening,
> and in my case it happens whether I'm using RAID1 or not.
>
> From the command line if I do the following and wait for one of these
> 100%wa events to occur
>
> echo "1" > /proc/sys/vm/block_dump
> ... wait a short while ...
> echo "0" > /proc/sys/vm/block_dump
>
> then grepping dmesg with this command
>
> dmesg | egrep "READ|WRITE|dirtied"
>
> shows the following:
>
>
> flush-8:0(3365): WRITE block 33555792 on sda3
> flush-8:0(3365): WRITE block 33555800 on sda3
> flush-8:0(3365): WRITE block 33701984 on sda3
> flush-8:0(3365): WRITE block 33720128 on sda3
> flush-8:0(3365): WRITE block 33721496 on sda3
> flush-8:0(3365): WRITE block 33816576 on sda3
>
> so something ugly is going on. I have no idea what causes these blocks
> but they are really messing me up.
>
> Sometimes these events last for minutes. I've not yet discovered if
> it's specific to my drives, my motherboard, the kernel or what.
>
> - Mark
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html