Re: Fwd: high volume of disk-writes causes disk to 'disappear'

Leif Sawyer <ak.hepcat+scsi@xxxxxxxxx> · Tue, 1 Jun 2010 08:44:20 -0800

Well, good thing I left it running longer.   It triggered multiple
times over the past week.

So I guess it's not fixed with the simple cleanup patches that had
been posted previously.

I really need some help with scsi debugging to get valid log data out
of this, in order
to isolate the issue.

-L

On Tue, May 25, 2010 at 2:58 PM, Leif Sawyer <ak.hepcat+scsi@xxxxxxxxx> wrote:
> On Wed, May 19, 2010 at 5:23 AM, Leif Sawyer <ak.hepcat+scsi@xxxxxxxxx> wrote:
>> looks like the 75% mark might have been too high of an estimate.
>> Whipped up a quick logger to show me when i was failing:
>>
>> <user.info<14>>May 18 17:08:01 websniff-6036a5 logger: disk: /data at
>> 59% utilization
>> <user.info<14>>May 18 17:09:01 websniff-6036a5 logger: disk: /data at
>> 59% utilization
>> <user.info<14>>May 18 17:10:01 websniff-6036a5 logger: disk: /data at
>> 60% utilization
>> <user.info<14>>May 18 17:11:01 websniff-6036a5 logger: disk: /data at
>> 60% utilization
>> [22563.204037] INFO: task flush-8:16:2430 blocked for more than 120 seconds.
>> [22563.224392] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [22563.248117] INFO: task dumpcap:4004 blocked for more than 120 seconds.
>> [22563.267662] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [22563.291359] INFO: task df:14714 blocked for more than 120 seconds.
>> [22563.309874] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [22563.333593] INFO: task websniff.cgi:14717 blocked for more than 120 seconds.
>> [22563.354690] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [22682.229526] end_request: I/O error, dev sdb, sector 169922743
>> [22682.247345] Buffer I/O error on device sdb1, logical block 21240335
>> [22682.266781] end_request: I/O error, dev sdb, sector 170131519
>> [22682.284445] Buffer I/O error on device sdb1, logical block 21266432
>> [....... repeats until......]
>> [22682.782577] sd 3:0:1:0: rejecting I/O to offline device
>> [22682.798907] sd 3:0:1:0: rejecting I/O to offline device
>>
>> And from here on out, the device is no longer recognized by the system
>> until a reboot.
>>
>> I need some help with scsi debugging in order to provide more useful
>> information.
>>
>> I do have a 512mb logfile (text)  with lots of scsi dump card state
>> logs and such, though.
>>
>
>
> Okay, so on a whim, I applied some patches that were recently posted
> here that I thought
> might have an impact on my particular system (anything generic scsi,
> or adaptec-related)
>
> My system has been up since yesterday with those patches applied, and my disk
> has been churning at the 100% utilized (with between 600Mb and 75Mb
> free at any given time)
> with tshark continuously rolling over new capture files  for over 6h.
> (which it never did before)
>
> the following patches were applied which were not cosmetic or debug related:
>
>     lct_data->tid assignment
>     io_dev->iop assignment
>     usg use after kfree
>     gdth  goto out_free_ccb_phys  instead of  out_free_coal_stat
>
>
> If there's interest, i'll back out the patches one at a time and see
> which one(s)
> cause/bring-back the most instability.
>
>
>
> --
> "It's pronounced Layf...you know, like Leif Garrett? Don't you watch
>  'I Love the 70's'? What kind of retro lover are you, anyway?"
>

-- 
"It's pronounced Layf...you know, like Leif Garrett? Don't you watch
 'I Love the 70's'? What kind of retro lover are you, anyway?"
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html