My system hangs from time to time, after few hours work(which differs from minutes to 8-9 hours), with kernel panic. Before that begins it worked fine for about 6 years - no software or hardware changes during this period. I have some photos of the screen after panic, the first two are with the old linux kernel 2.6.16.27: http://picpaste.com/pics/img00005-73m0unO0.1345852235.jpg http://picpaste.com/pics/P170812_12.01-MeZrs3zv.1345817375.jpg -they can enlarge on click. Then i installed slackware-current with their default kernel "huge.s" and the crashes continued: http://picpaste.com/pics/P210812_15.34-3NSTEV8f.1345816730.jpg then i swithced off the swap: http://picpaste.com/pics/P230812_15.06-hB12169n.1345812390.jpg after that i managed to save one message with netconsole (swap is off): 1. [13330.042569] BUG: unable to handle kernel paging request at 000060ff80001f1c 2. [13330.043554] IP: [<ffffffff810b17e0>] no_action+0x10/0x10 3. [13330.043554] PGD 0 4. [13330.043554] Oops: 0002 [#1] SMP 5. [13330.043554] CPU 1 6. [13330.043554] Modules linked in: ipv6 lp netconsole snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss fuse nouveau mxm_wmi wmi video ttm drm_kms_helper drm amd64_agp processor thermal_sys k8temp agpgart hwmon snd_via82xx snd_ac97_codec snd_mpu401_uart snd_rawmidi snd_seq_device snd_pcm snd_page_alloc snd_timer snd soundcore ac97_bus ppdev parport_pc i2c_algo_bit gameport evdev shpchp button i2c_viapro i2c_core loop skge parport [last unloaded: lp] 7. [13330.043554] 8. [13330.043554] Pid: 0, comm: swapper/1 Not tainted 3.2.27 #2 To Be Filled By O.E.M. To Be Filled By O.E.M./A8V Deluxe 9. [13330.043554] RIP: 0010:[<ffffffff810b17e0>] [<ffffffff810b17e0>] no_action+0x10/0x10 10. [13330.043554] RSP: 0018:ffff88007fd03f10 EFLAGS: 00010086 11. [13330.043554] RAX: 000060ff80001f1c RBX: ffff88007aef2c00 RCX: 00000000fffffffa 12. [13330.043554] RDX: 00000000000000d0 RSI: ffff88007ae93f80 RDI: ffff88007aef2c00 13. [13330.043554] RBP: ffff88007fd03f38 R08: ffff88007aef2c00 R09: ffff88007cc00000 14. [13330.043554] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007aef2c8c 15. [13330.043554] R13: 0000000000000011 R14: 0000000000000000 R15: 0000000000000000 16. [13330.043554] FS: 00007f674b3e6740(0000) GS:ffff88007fd00000(0000) knlGS:00000000f7369700 17. [13330.043554] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 18. [13330.043554] CR2: 000060ff80001f1c CR3: 000000006f115000 CR4: 00000000000006e0 19. [13330.043554] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 20. [13330.043554] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 21. [13330.043554] Process swapper/1 (pid: 0, threadinfo ffff88007bd18000, task ffff88007d0ec4c0) 22. [13330.043554] Stack: 23. [13330.043554] ffffffff810b1a10 ffff88007fd03f58 ffff88007aef2c00 0000000000000051 24. [13330.043554] 0000000000000011 ffff88007fd03f58 ffffffff810b4879 ffff88007fd03f58 25. [13330.043554] 0000000000000011 ffff88007fd03f78 ffffffff81003d12 ffff88007fd03f78 26. [13330.043554] Call Trace: 27. [13330.043554] <IRQ> 28. [13330.043554] [<ffffffff810b1a10>] ? handle_irq_event+0x40/0x70 29. [13330.043554] [<ffffffff810b4879>] handle_fasteoi_irq+0x59/0x100 30. [13330.043554] [<ffffffff81003d12>] handle_irq+0x22/0x40 31. [13330.043554] [<ffffffff81b3158a>] do_IRQ+0x5a/0xe0 32. [13330.043554] [<ffffffff81b2e82b>] common_interrupt+0x6b/0x6b 33. [13330.043554] <EOI> here is link to dmesg, before that last crash: http://pastebin.com/Af7bb34x And at the end i noticed scary messages in the syslog: [31770.094556] REISERFS warning (device sda1): clm-6006 reiserfs_dirty_inode: writing inode 347717 on readonly FS [31770.472848] REISERFS warning (device sda1): clm-6006 reiserfs_dirty_inode: writing inode 347740 on readonly FS [31790.796117] REISERFS warning (device sda1): clm-6006 reiserfs_dirty_inode: writing inode 426162 on readonly FS after which i have done reiserfsck immediately - no corruption were found. Never seen such messages before, i have syslogs for 17 days before that - no messages like this. I have done some tests with smartmontools before - when it was the old linux (2.6.16.27) - the result of "smartctl -s on -a /dev/sda" is: smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.5.2] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus Device Model: ST3200822AS Serial Number: 4LJ221BB Firmware Version: 3.01 User Capacity: 200,049,647,616 bytes [200 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Sat Aug 25 03:09:01 2012 MSK SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Enabled. === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 111) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 050 046 006 Pre-fail Always - 179699255 3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 123 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 6 7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 81170784 9 Power_On_Hours 0x0032 039 039 000 Old_age Always - 53553 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 142 194 Temperature_Celsius 0x0022 037 054 000 Old_age Always - 37 195 Hardware_ECC_Recovered 0x001a 050 046 000 Old_age Always - 179699255 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 198 000 Old_age Always - 2 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 2 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2 occurred at disk power-on lifetime: 13784 hours (574 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 7a 7d 1d e0 Error: ICRC, ABRT at LBA = 0x001d7d7a = 1932666 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 7b 7c 1d e0 00 22:14:23.595 READ DMA EXT 25 00 00 7b 7b 1d e0 00 22:14:23.593 READ DMA EXT 25 00 00 7b 7a 1d e0 00 22:14:23.576 READ DMA EXT 25 00 00 7b 79 1d e0 00 22:14:23.567 READ DMA EXT 25 00 00 7b 78 1d e0 00 22:14:23.566 READ DMA EXT Error 1 occurred at disk power-on lifetime: 13784 hours (574 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 fa 0e 01 e0 Error: ICRC, ABRT at LBA = 0x00010efa = 69370 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 fb 0d 01 e0 00 22:13:03.489 READ DMA EXT 25 00 00 fb 0c 01 e0 00 22:13:03.487 READ DMA EXT 25 00 00 fb 0b 01 e0 00 22:13:03.701 READ DMA EXT 25 00 00 fb 09 01 e0 00 22:13:03.682 READ DMA EXT 25 00 00 fb 07 01 e0 00 22:13:03.681 READ DMA EXT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 53153 - # 2 Short offline Completed without error 00% 53152 - # 3 Short offline Completed without error 00% 53152 - # 4 Short offline Completed without error 00% 53152 - # 5 Short offline Completed without error 00% 53152 - # 6 Short offline Completed without error 00% 53148 - # 7 Short offline Completed without error 00% 53148 - # 8 Short offline Completed without error 00% 53148 - # 9 Extended offline Aborted by host 80% 53148 - #10 Short offline Completed without error 00% 53147 - #11 Short offline Completed without error 00% 53147 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. And soon after that (you can see the time of the messages)" i just succeed to to capture one whole panic message(i am hoping it is): [32874.215014] BUG: unable to handle kernel NULL pointer dereference at 0000000000000086 [32874.215192] IP: [<ffffffff819f9440>] start_show+0x30/0x30 [32874.215192] PGD 7afe0067 PUD 7497e067 PMD 0 [32874.215192] Oops: 0002 [#1] SMP [32874.215192] CPU 1 [32874.215192] Modules linked in: netconsole ipt_REJECT xt_tcpudp iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables ipv6 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss fuse nouveau mxm_wmi wmi video ttm drm_kms_helper snd_via82xx snd_ac97_codec snd_mpu401_uart snd_rawmidi snd_seq_device snd_pcm snd_page_alloc drm snd_timer amd64_agp processor i2c_algo_bit snd shpchp k8temp agpgart thermal_sys i2c_viapro hwmon i2c_core skge soundcore ac97_bus gameport evdev ppdev button parport_pc parport loop [last unloaded: lp] [32874.215192] [32874.215192] Pid: 0, comm: swapper/1 Not tainted 3.2.27 #2 To Be Filled By O.E.M. To Be Filled By O.E.M./A8V Deluxe [32874.215192] RIP: 0010:[<ffffffff819f9440>] [<ffffffff819f9440>] start_show+0x30/0x30 [32874.215192] RSP: 0018:ffff88007fd03eb0 EFLAGS: 00010006 [32874.215192] RAX: 0000000000000086 RBX: ffffffff820c2fc0 RCX: 0000000000000001 [32874.215192] RDX: 00001de61fe84bdb RSI: 0000000000000000 RDI: ffffffff820c2fc0 [32874.215192] RBP: ffff88007fd03ed8 R08: 0000000000000000 R09: 0000000000000001 [32874.215192] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000008069 [32874.215192] R13: 00000000484f99af R14: 0000000000ab2476 R15: 0000000000000000 [32874.215192] FS: 00007f61bddf4740(0000) GS:ffff88007fd00000(0000) knlGS:00000000f75fc6c0 [32874.215192] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [32874.215192] CR2: 0000000000000086 CR3: 00000000746e8000 CR4: 00000000000006e0 [32874.215192] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [32874.215192] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [32874.215192] Process swapper/1 (pid: 0, threadinfo ffff88007bd18000, task ffff88007d0ec4c0) [32874.215192] Stack: [32874.215192] ffffffff8107df04 ffff88007fd12680 0000000000000001 000000000000d300 [32874.215192] 0000000000000000 ffff88007fd03ef8 ffffffff8107ab80 ffff88007fd0d300 [32874.215192] 0000000000000001 ffff88007fd03f08 ffffffff8107abe9 ffff88007fd03f28 [32874.215192] Call Trace: [32874.215192] <IRQ> [32874.215192] [<ffffffff8107df04>] ? ktime_get+0x64/0xe0 [32874.215192] [<ffffffff8107ab80>] sched_clock_tick+0x40/0x90 [32874.215192] [<ffffffff8107abe9>] sched_clock_idle_wakeup_event+0x19/0x20 [32874.215192] [<ffffffff8108538e>] tick_nohz_stop_idle+0x3e/0x50 [32874.215192] [<ffffffff81085b77>] tick_check_idle+0xb7/0xd0 [32874.215192] [<ffffffff8105a749>] irq_enter+0x69/0x70 [32874.215192] [<ffffffff81b31653>] smp_apic_timer_interrupt+0x43/0x99 [32874.215192] [<ffffffff81b2f9cb>] apic_timer_interrupt+0x6b/0x70 [32874.215192] <EOI> [32874.215192] [<ffffffff8107aa58>] ? sched_clock_cpu+0xa8/0x120 [32874.215192] [<ffffffff8100a89a>] ? default_idle+0x5a/0x180 [32874.215192] [<ffffffff810009b6>] cpu_idle+0xf6/0x110 [32874.215192] [<ffffffff81b146ea>] start_secondary+0x1cf/0x1d6 [32874.215192] Code: 66 66 66 90 48 8b 0f 48 c7 c2 0d 46 dc 81 48 89 f0 be 00 10 00 00 48 89 c7 31 c0 e8 5b 71 b9 ff 5d 48 98 c3 0f 1f 80 00 00 00 00 <55> 48 89 e5 66 66 66 66 90 8b 15 39 31 6f 00 ed 25 ff ff ff 00 [32874.215192] RIP [<ffffffff819f9440>] start_show+0x30/0x30 [32874.215192] RSP <ffff88007fd03eb0> [32874.215192] CR2: 0000000000000086 [32874.215192] [drm] nouveau 0000:01:00.0: Setting dpms mode 0 on vga encoder (output 0) [32874.215192] ---[ end trace 90aad159d8ed7c1e ]--- [32874.215192] Kernel panic - not syncing: Fatal exception in interrupt [32874.215192] Pid: 0, comm: swapper/1 Tainted: G D 3.2.27 #2 [32874.215192] Call Trace: [32874.215192] <IRQ> [<ffffffff81b1aeea>] panic+0x91/0x189 [32874.215192] [<ffffffff81005491>] oops_end+0x91/0xa0 [32874.215192] [<ffffffff81b1a85f>] no_context+0x1fa/0x225 [32874.215192] [<ffffffff81b1aa3b>] __bad_area_nosemaphore+0x1b1/0x1d0 [32874.215192] [<ffffffff81b1aa6d>] bad_area_nosemaphore+0x13/0x15 [32874.215192] [<ffffffff81028794>] do_page_fault+0x2b4/0x480 [32874.215192] [<ffffffff8104aa6c>] ? load_balance+0xac/0x780 [32874.215192] [<ffffffff81a1b1e0>] ? skb_release_head_state+0x60/0x100 [32874.215192] [<ffffffff81a1affe>] ? __kfree_skb+0x1e/0xa0 [32874.215192] [<ffffffff81a1b0b1>] ? consume_skb+0x31/0x70 [32874.215192] [<ffffffff81b2ea2f>] page_fault+0x1f/0x30 [32874.215192] [<ffffffff819f9440>] ? start_show+0x30/0x30 [32874.215192] [<ffffffff8107df04>] ? ktime_get+0x64/0xe0 [32874.215192] [<ffffffff8107ab80>] sched_clock_tick+0x40/0x90 [32874.215192] [<ffffffff8107abe9>] sched_clock_idle_wakeup_event+0x19/0x20 [32874.215192] [<ffffffff8108538e>] tick_nohz_stop_idle+0x3e/0x50 [32874.215192] [<ffffffff81085b77>] tick_check_idle+0xb7/0xd0 [32874.215192] [<ffffffff8105a749>] irq_enter+0x69/0x70 [32874.215192] [<ffffffff81b31653>] smp_apic_timer_interrupt+0x43/0x99 [32874.215192] [<ffffffff81b2f9cb>] apic_timer_interrupt+0x6b/0x70 [32874.215192] <EOI> [<ffffffff8107aa58>] ? sched_clock_cpu+0xa8/0x120 [32874.215192] [<ffffffff8100a89a>] ? default_idle+0x5a/0x180 [32874.215192] [<ffffffff810009b6>] cpu_idle+0xf6/0x110 [32874.215192] [<ffffffff81b146ea>] start_secondary+0x1cf/0x1d6 [32874.215192] panic occurred, switching back to text console. swap is off again. After that i ran the machine with the newest kernel - 3.5.2, and if it happens again i will try "nosmp" option.Any ideas of what should be the reason, or how to catch it, will be welcome. Is that the right place to ask, or should i send it to kernel@xxxxxxxxxxxxxxx, or somewhere else ? Thanks in advance ! Adko. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html