HDD problem, software bug, bios bug, or hardware ?

Adko Branil <adkobranil@xxxxxxxxx> · Fri, 24 Aug 2012 17:54:08 -0700 (PDT)

My system hangs from time to time, after few hours work(which differs from minutes to 8-9 hours), with kernel panic. Before that begins it worked fine for about 6 years - no software or hardware changes during this period.
I have some photos of the screen after panic, the first two are with the old linux kernel 2.6.16.27:

http://picpaste.com/pics/img00005-73m0unO0.1345852235.jpg
http://picpaste.com/pics/P170812_12.01-MeZrs3zv.1345817375.jpg

-they can enlarge on click.

Then i installed slackware-current with their default kernel "huge.s" and the crashes continued:

http://picpaste.com/pics/P210812_15.34-3NSTEV8f.1345816730.jpg

then i swithced off the swap:

http://picpaste.com/pics/P230812_15.06-hB12169n.1345812390.jpg

after that i managed to save one message with netconsole (swap is off):

	1. [13330.042569] BUG: unable to handle kernel paging request at 000060ff80001f1c
	2. [13330.043554] IP: [<ffffffff810b17e0>] no_action+0x10/0x10
	3. [13330.043554] PGD 0 
	4. [13330.043554] Oops: 0002 [#1] SMP 
	5. [13330.043554] CPU 1 
	6. [13330.043554] Modules linked in: ipv6 
lp netconsole snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq 
snd_pcm_oss snd_mixer_oss fuse nouveau mxm_wmi wmi video ttm 
drm_kms_helper drm amd64_agp processor thermal_sys k8temp agpgart hwmon 
snd_via82xx snd_ac97_codec snd_mpu401_uart snd_rawmidi snd_seq_device 
snd_pcm snd_page_alloc snd_timer snd soundcore ac97_bus ppdev parport_pc i2c_algo_bit gameport evdev shpchp button i2c_viapro i2c_core loop skge parport [last unloaded: lp]
	7. [13330.043554] 
	8. [13330.043554] Pid: 0, comm: swapper/1 Not tainted 3.2.27 #2 To Be Filled By O.E.M. To Be Filled By O.E.M./A8V Deluxe
	9. [13330.043554] RIP: 0010:[<ffffffff810b17e0>]  [<ffffffff810b17e0>] no_action+0x10/0x10
	10. [13330.043554] RSP: 0018:ffff88007fd03f10  EFLAGS: 00010086
	11. [13330.043554] RAX: 000060ff80001f1c RBX: ffff88007aef2c00 RCX: 00000000fffffffa
	12. [13330.043554] RDX: 00000000000000d0 RSI: ffff88007ae93f80 RDI: ffff88007aef2c00
	13. [13330.043554] RBP: ffff88007fd03f38 R08: ffff88007aef2c00 R09: ffff88007cc00000
	14. [13330.043554] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007aef2c8c
	15. [13330.043554] R13: 0000000000000011 R14: 0000000000000000 R15: 0000000000000000
	16. [13330.043554] FS:  00007f674b3e6740(0000) GS:ffff88007fd00000(0000) knlGS:00000000f7369700
	17. [13330.043554] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
	18. [13330.043554] CR2: 000060ff80001f1c CR3: 000000006f115000 CR4: 00000000000006e0
	19. [13330.043554] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
	20. [13330.043554] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
	21. [13330.043554] Process swapper/1 (pid: 0, threadinfo ffff88007bd18000, task ffff88007d0ec4c0)
	22. [13330.043554] Stack:
	23. [13330.043554]  ffffffff810b1a10 ffff88007fd03f58 ffff88007aef2c00 0000000000000051
	24. [13330.043554]  0000000000000011 ffff88007fd03f58 ffffffff810b4879 ffff88007fd03f58
	25. [13330.043554]  0000000000000011 ffff88007fd03f78 ffffffff81003d12 ffff88007fd03f78
	26. [13330.043554] Call Trace:
	27. [13330.043554]  <IRQ> 
	28. [13330.043554]  [<ffffffff810b1a10>] ? handle_irq_event+0x40/0x70
	29. [13330.043554]  [<ffffffff810b4879>] handle_fasteoi_irq+0x59/0x100
	30. [13330.043554]  [<ffffffff81003d12>] handle_irq+0x22/0x40
	31. [13330.043554]  [<ffffffff81b3158a>] do_IRQ+0x5a/0xe0
	32. [13330.043554]  [<ffffffff81b2e82b>] common_interrupt+0x6b/0x6b
	33. [13330.043554]  <EOI>

 here is link to dmesg, before that last crash: http://pastebin.com/Af7bb34x

And at the end i noticed scary messages in the syslog:

[31770.094556] REISERFS warning (device sda1): clm-6006 reiserfs_dirty_inode: writing inode 347717 on readonly FS
[31770.472848] REISERFS warning (device sda1): clm-6006 reiserfs_dirty_inode: writing inode 347740 on readonly FS
[31790.796117] REISERFS warning (device sda1): clm-6006 reiserfs_dirty_inode: writing inode 426162 on readonly FS

after which i have done reiserfsck immediately - no corruption were found.
Never seen such messages before, i have syslogs for 17 days before that  - no messages like this.

I have done some tests with smartmontools before - when it was the old linux (2.6.16.27) - the result of "smartctl -s on -a /dev/sda" is:

smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.5.2] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus
Device Model:     ST3200822AS
Serial Number:    4LJ221BB
Firmware Version: 3.01
User Capacity:    200,049,647,616 bytes [200 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Sat Aug 25 03:09:01 2012 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  430) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 111) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   050   046   006    Pre-fail  Always       -       179699255
  3 Spin_Up_Time            0x0003   097   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       123
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       6
  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       81170784
  9 Power_On_Hours          0x0032   039   039   000    Old_age   Always       -       53553
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       142
194 Temperature_Celsius     0x0022   037   054   000    Old_age   Always       -       37
195 Hardware_ECC_Recovered  0x001a   050   046   000    Old_age   Always       -       179699255
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   198   000    Old_age   Always       -       2
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 2
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 13784 hours (574 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 7a 7d 1d e0  Error: ICRC, ABRT at LBA = 0x001d7d7a = 1932666

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 7b 7c 1d e0 00      22:14:23.595  READ DMA EXT
  25 00 00 7b 7b 1d e0 00      22:14:23.593  READ DMA EXT
  25 00 00 7b 7a 1d e0 00      22:14:23.576  READ DMA EXT
  25 00 00 7b 79 1d e0 00      22:14:23.567  READ DMA EXT
  25 00 00 7b 78 1d e0 00      22:14:23.566  READ DMA EXT

Error 1 occurred at disk power-on lifetime: 13784 hours (574 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 fa 0e 01 e0  Error: ICRC, ABRT at LBA = 0x00010efa = 69370

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 fb 0d 01 e0 00      22:13:03.489  READ DMA EXT
  25 00 00 fb 0c 01 e0 00      22:13:03.487  READ DMA EXT
  25 00 00 fb 0b 01 e0 00      22:13:03.701  READ DMA EXT
  25 00 00 fb 09 01 e0 00      22:13:03.682  READ DMA EXT
  25 00 00 fb 07 01 e0 00      22:13:03.681  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     53153         -
# 2  Short offline       Completed without error       00%     53152         -
# 3  Short offline       Completed without error       00%     53152         -
# 4  Short offline       Completed without error       00%     53152         -
# 5  Short offline       Completed without error       00%     53152         -
# 6  Short offline       Completed without error       00%     53148         -
# 7  Short offline       Completed without error       00%     53148         -
# 8  Short offline       Completed without error       00%     53148         -
# 9  Extended offline    Aborted by host               80%     53148         -
#10  Short offline       Completed without error       00%     53147         -
#11  Short offline       Completed without error       00%     53147         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

And soon after that (you can see the time of the messages)" i just succeed to to capture one whole panic message(i am hoping it is):

[32874.215014] BUG: unable to handle kernel NULL pointer dereference at 0000000000000086
[32874.215192] IP: [<ffffffff819f9440>] start_show+0x30/0x30
[32874.215192] PGD 7afe0067 PUD 7497e067 PMD 0 
[32874.215192] Oops: 0002 [#1] SMP 
[32874.215192] CPU 1 
[32874.215192] Modules linked in: netconsole ipt_REJECT xt_tcpudp iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables ipv6 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss fuse nouveau mxm_wmi wmi video ttm drm_kms_helper snd_via82xx snd_ac97_codec snd_mpu401_uart snd_rawmidi snd_seq_device snd_pcm snd_page_alloc drm snd_timer amd64_agp processor i2c_algo_bit snd shpchp k8temp agpgart thermal_sys i2c_viapro hwmon i2c_core skge soundcore ac97_bus gameport evdev ppdev button parport_pc parport loop [last unloaded: lp]
[32874.215192] 
[32874.215192] Pid: 0, comm: swapper/1 Not tainted 3.2.27 #2 To Be Filled By O.E.M. To Be Filled By O.E.M./A8V Deluxe
[32874.215192] RIP: 0010:[<ffffffff819f9440>]  [<ffffffff819f9440>] start_show+0x30/0x30
[32874.215192] RSP: 0018:ffff88007fd03eb0  EFLAGS: 00010006
[32874.215192] RAX: 0000000000000086 RBX: ffffffff820c2fc0 RCX: 0000000000000001
[32874.215192] RDX: 00001de61fe84bdb RSI: 0000000000000000 RDI: ffffffff820c2fc0
[32874.215192] RBP: ffff88007fd03ed8 R08: 0000000000000000 R09: 0000000000000001
[32874.215192] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000008069
[32874.215192] R13: 00000000484f99af R14: 0000000000ab2476 R15: 0000000000000000
[32874.215192] FS:  00007f61bddf4740(0000) GS:ffff88007fd00000(0000) knlGS:00000000f75fc6c0
[32874.215192] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[32874.215192] CR2: 0000000000000086 CR3: 00000000746e8000 CR4: 00000000000006e0
[32874.215192] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[32874.215192] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[32874.215192] Process swapper/1 (pid: 0, threadinfo ffff88007bd18000, task ffff88007d0ec4c0)
[32874.215192] Stack:
[32874.215192]  ffffffff8107df04 ffff88007fd12680 0000000000000001 000000000000d300
[32874.215192]  0000000000000000 ffff88007fd03ef8 ffffffff8107ab80 ffff88007fd0d300
[32874.215192]  0000000000000001 ffff88007fd03f08 ffffffff8107abe9 ffff88007fd03f28
[32874.215192] Call Trace:
[32874.215192]  <IRQ> 
[32874.215192]  [<ffffffff8107df04>] ? ktime_get+0x64/0xe0
[32874.215192]  [<ffffffff8107ab80>] sched_clock_tick+0x40/0x90
[32874.215192]  [<ffffffff8107abe9>] sched_clock_idle_wakeup_event+0x19/0x20
[32874.215192]  [<ffffffff8108538e>] tick_nohz_stop_idle+0x3e/0x50
[32874.215192]  [<ffffffff81085b77>] tick_check_idle+0xb7/0xd0
[32874.215192]  [<ffffffff8105a749>] irq_enter+0x69/0x70
[32874.215192]  [<ffffffff81b31653>] smp_apic_timer_interrupt+0x43/0x99
[32874.215192]  [<ffffffff81b2f9cb>] apic_timer_interrupt+0x6b/0x70
[32874.215192]  <EOI> 
[32874.215192]  [<ffffffff8107aa58>] ? sched_clock_cpu+0xa8/0x120
[32874.215192]  [<ffffffff8100a89a>] ? default_idle+0x5a/0x180
[32874.215192]  [<ffffffff810009b6>] cpu_idle+0xf6/0x110
[32874.215192]  [<ffffffff81b146ea>] start_secondary+0x1cf/0x1d6
[32874.215192] Code: 66 66 66 90 48 8b 0f 48 c7 c2 0d 46 dc 81 48 89 f0 be 00 10 00 00 48 89 c7 31 c0 e8 5b 71 b9 ff 5d 48 98 c3 0f 1f 80 00 00 00 00 <55> 48 89 e5 66 66 66 66 90 8b 15 39 31 6f 00 ed 25 ff ff ff 00 
[32874.215192] RIP  [<ffffffff819f9440>] start_show+0x30/0x30
[32874.215192]  RSP <ffff88007fd03eb0>
[32874.215192] CR2: 0000000000000086
[32874.215192] [drm] nouveau 0000:01:00.0: Setting dpms mode 0 on vga encoder (output 0)
[32874.215192] ---[ end trace 90aad159d8ed7c1e ]---
[32874.215192] Kernel panic - not syncing: Fatal exception in interrupt
[32874.215192] Pid: 0, comm: swapper/1 Tainted: G      D      3.2.27 #2
[32874.215192] Call Trace:
[32874.215192]  <IRQ>  [<ffffffff81b1aeea>] panic+0x91/0x189
[32874.215192]  [<ffffffff81005491>] oops_end+0x91/0xa0
[32874.215192]  [<ffffffff81b1a85f>] no_context+0x1fa/0x225
[32874.215192]  [<ffffffff81b1aa3b>] __bad_area_nosemaphore+0x1b1/0x1d0
[32874.215192]  [<ffffffff81b1aa6d>] bad_area_nosemaphore+0x13/0x15
[32874.215192]  [<ffffffff81028794>] do_page_fault+0x2b4/0x480
[32874.215192]  [<ffffffff8104aa6c>] ? load_balance+0xac/0x780
[32874.215192]  [<ffffffff81a1b1e0>] ? skb_release_head_state+0x60/0x100
[32874.215192]  [<ffffffff81a1affe>] ? __kfree_skb+0x1e/0xa0
[32874.215192]  [<ffffffff81a1b0b1>] ? consume_skb+0x31/0x70
[32874.215192]  [<ffffffff81b2ea2f>] page_fault+0x1f/0x30
[32874.215192]  [<ffffffff819f9440>] ? start_show+0x30/0x30
[32874.215192]  [<ffffffff8107df04>] ? ktime_get+0x64/0xe0
[32874.215192]  [<ffffffff8107ab80>] sched_clock_tick+0x40/0x90
[32874.215192]  [<ffffffff8107abe9>] sched_clock_idle_wakeup_event+0x19/0x20
[32874.215192]  [<ffffffff8108538e>] tick_nohz_stop_idle+0x3e/0x50
[32874.215192]  [<ffffffff81085b77>] tick_check_idle+0xb7/0xd0
[32874.215192]  [<ffffffff8105a749>] irq_enter+0x69/0x70
[32874.215192]  [<ffffffff81b31653>] smp_apic_timer_interrupt+0x43/0x99
[32874.215192]  [<ffffffff81b2f9cb>] apic_timer_interrupt+0x6b/0x70
[32874.215192]  <EOI>  [<ffffffff8107aa58>] ? sched_clock_cpu+0xa8/0x120
[32874.215192]  [<ffffffff8100a89a>] ? default_idle+0x5a/0x180
[32874.215192]  [<ffffffff810009b6>] cpu_idle+0xf6/0x110
[32874.215192]  [<ffffffff81b146ea>] start_secondary+0x1cf/0x1d6
[32874.215192] panic occurred, switching back to text console.

swap is off again.

After that i ran the machine with the newest kernel - 3.5.2, and if it happens again i will try "nosmp" option.Any ideas of what should be the reason, or how to catch it, will be welcome.

Is that the right place to ask, or should i send it to kernel@xxxxxxxxxxxxxxx, or somewhere else ?
Thanks in advance !

Adko.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html