On Monday 04 of May 2009 00:43:01 Robert Hancock wrote: > Andrey Borzenkov wrote: > > Yeah, me knows - it does sound weird. > > > > I moved my system from to different notebook (it was pure rsync + > > mkinitrd + couple of adjustments in loaded modules). Older one was > > using pata_ali, new one is using AHCI (Intel ICH8). > > > > After booting it (using *the* *same* kernel, only different > > modules) for the first time I started to get random oopses, file > > system corruptions as well as scary HSM violations in logs. > > > > Ubuntu live CD had no issues (even with relatively high IO load > > during kernel compilation); also my son used this notebook for > > several months under ArchLinux without any issues either. > > > > I compiled minimal stripped down kernel which worked. I tried to > > compile more feature reach one - and immediately got HSM > > violations. This time I noticed that "bad" one had dm-multipath and > > "good" one not ... well, removing /etc/mutipath.conf (to block > > loading of it) - and lo and behold! I am running without any issues > > ... > > > > So what I got in logs was: > > > > [ 18.593217] device-mapper: uevent: version 1.0.3 > > [ 18.593873] device-mapper: ioctl: 4.14.0-ioctl (2008-04-23) > > initialised: dm-devel@xxxxxxxxxx > > [ 18.664812] device-mapper: multipath: version 1.0.5 loaded > > [ 19.070383] ata1.00: exception Emask 0x2 SAct 0x73 SErr > > 0x3000400 action 0x6 > > [ 19.070389] ata1.00: irq_stat 0x45000008 > > [ 19.070395] ata1: SError: { Proto TrStaTrns UnrecFIS } > > [ 19.070406] ata1.00: cmd 60/08:00:e2:e5:c3/00:00:01:00:00/40 tag > > 0 ncq 4096 in > > [ 19.070409] res 40/00:38:00:00:00/00:00:00:00:00/40 > > Emask 0x2 (HSM violation) > > [ 19.070414] ata1.00: status: { DRDY } > > [ 19.070424] ata1.00: cmd 60/40:08:12:60:40/00:00:01:00:00/40 tag > > 1 ncq 32768 in > > [ 19.070426] res 40/00:38:00:00:00/00:00:00:00:00/40 > > Emask 0x2 (HSM violation) > > [ 19.070431] ata1.00: status: { DRDY } > > [ 19.070442] ata1.00: cmd 60/18:20:42:c1:c4/00:00:01:00:00/40 tag > > 4 ncq 12288 in > > [ 19.070444] res 41/84:18:59:c1:c4/03:00:01:00:00/61 > > Emask 0x412 (ATA bus error) <F> > > [ 19.070453] ata1.00: status: { DRDY ERR } > > [ 19.070455] ata1.00: error: { ICRC ABRT } > > [ 19.070460] ata1.00: cmd 60/28:28:22:c3:c4/00:00:01:00:00/40 tag > > 5 ncq 20480 in > > [ 19.070461] res 40/00:38:00:00:00/00:00:00:00:00/40 > > Emask 0x2 (HSM violation) > > [ 19.070464] ata1.00: status: { DRDY } > > [ 19.070469] ata1.00: cmd 60/80:30:c2:c3:c4/02:00:01:00:00/40 tag > > 6 ncq 327680 in > > [ 19.070470] res 40/00:38:00:00:00/00:00:00:00:00/40 > > Emask 0x2 (HSM violation) > > [ 19.070473] ata1.00: status: { DRDY } > > [ 19.070479] ata1: hard resetting link > > [ 19.420110] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl > > 300) [ 19.485363] ata1.00: configured for UDMA/100 > > [ 19.485406] ata1: EH complete > > It seems really unlikely that dm-multipath could have this kind of > effect. I could hardly beieve it myself. > This really looks like a hardware problem of some type, as > the controller is seeing errors on the SATA link and the drive is > complaining about bad CRCs. Maybe whatever you were doing was > stressing something more than usual? > I was booting :) Nothing more. > > [ 19.586193] device-mapper: multipath round-robin: version 1.0.0 > > loaded > > [ 19.586634] device-mapper: table: 254:0: multipath: error > > getting device > > [ 19.586638] device-mapper: ioctl: error adding target to table > > [ 31.932088] REISERFS (device sda7): found reiserfs format "3.6" > > with standard journal > > [ 31.932111] REISERFS (device sda7): using ordered data mode > > [ 31.932289] REISERFS (device sda7): journal params: device sda7, > > size 8192, journal first block 18, max trans len 1024, max batch > > 900, max commit age 30, max trans age 30 > > [ 31.932559] REISERFS (device sda7): checking transaction log > > (sda7) [ 31.999527] REISERFS (device sda7): Using r5 hash to sort > > names [ 32.101614] loop: module loaded > > [ 32.581870] Adding 4192924k swap on /dev/sda5. Priority:-1 > > extents:1 across:4192924k > > [ 34.671257] iwl3945 0000:0c:00.0: firmware: requesting > > iwlwifi-3945-2.ucode > > [ 34.924834] iwl3945 0000:0c:00.0: loaded firmware version > > 15.28.2.8 [ 34.941614] BUG: unable to handle kernel paging > > request at 08048154 [ 34.949359] IP: [<c0284711>] > > kmem_cache_alloc+0xa8/0xc1 [ 34.949359] *pde = 7f4b8067 > > [ 34.949359] Oops: 0003 [#1] SMP > > [ 34.949359] last sysfs file: > > /sys/devices/virtual/dmi/id/board_serial [ 34.949359] Modules > > linked in: loop ext2 mbcache dm_mirror dm_region_hash dm_log > > dm_round_robin dm_multipath dm_mod > > snd_hda_codec_idt iwl3945 snd_hda_intel snd_hda_codec iwlcore > > snd_hwdep btusb uvcvideo bluetooth snd_pcm videodev sdhci_pci > > ohci1394 snd_page_alloc sdhci v4l1_compat iTCO_wdt ieee1394 > > lib80211 mmc_core dcdbas ricoh_mmc hid ata_piix ahci libata > > reiserfs > > [ 34.949359] > > [ 34.949359] Pid: 2151, comm: hald Tainted: G W (2.6.30- > > rc4-3avb #5) XPS M1330 > > [ 34.949359] EIP: 0060:[<c0284711>] EFLAGS: 00010246 CPU: 1 > > [ 34.949359] EIP is at kmem_cache_alloc+0xa8/0xc1 > > [ 34.949359] EAX: 00000000 EBX: 08048154 ECX: 00000025 EDX: > > 00000094 [ 34.949359] ESI: 00008050 EDI: 08048154 EBP: f5a26f2c > > ESP: f5a26f14 [ 34.949359] DS: 007b ES: 007b FS: 00d8 GS: 0033 > > SS: 0068 [ 34.949359] Process hald (pid: 2151, ti=f5a26000 > > task=f5aabe30 task.ti=f5a26000) > > [ 34.949359] Stack: > > [ 34.949359] f7001000 c02eb6d5 00000094 f6601260 f5a21d80 > > 00008050 f5a26f44 c02eb6d5 > > [ 34.949359] f5a21d90 f5a21d80 f71e4dec f71e4e08 f5a26f58 > > c02a9b43 f6f71eec f71e4dec > > [ 34.949359] f5a21d80 f5a26f7c c02a9c6c f6f72100 f5a21db4 > > ffffffea c02aa7d6 fffffff4 > > [ 34.949359] Call Trace: > > [ 34.949359] [<c02eb6d5>] ? idr_pre_get+0x27/0x60 > > [ 34.949359] [<c02eb6d5>] ? idr_pre_get+0x27/0x60 > > [ 34.949359] [<c02a9b43>] ? inotify_handle_get_wd+0x19/0x54 > > [ 34.949359] [<c02a9c6c>] ? inotify_add_watch+0x4d/0xe1 > > [ 34.949359] [<c02aa7d6>] ? sys_inotify_add_watch+0xbb/0x16c > > [ 34.949359] [<c02aa815>] ? sys_inotify_add_watch+0xfa/0x16c > > [ 34.949359] [<c02029d4>] ? sysenter_do_call+0x12/0x32 > > [ 34.949359] Code: 02 00 00 75 09 57 9d e8 60 ca fb ff eb 07 e8 > > d1 de fb ff 57 9d 66 85 f6 79 20 85 db 74 1c 8b 4d f0 89 df c1 e9 > > 02 31 c0 8b 55 f0 <f3> ab f6 c2 02 74 02 66 ab f6 c2 01 74 01 aa 8d > > 65 f4 89 d8 5b [ 34.949359] EIP: [<c0284711>] > > kmem_cache_alloc+0xa8/0xc1 SS:ESP 0068:f5a26f14 > > [ 34.949359] CR2: 0000000008048154 > > [ 35.262159] ---[ end trace a499c68e391b7f63 ]--- > > [ 35.272994] Registered led device: iwl-phy0::radio > > [ 35.282572] Registered led device: iwl-phy0::assoc > > [ 35.292096] Registered led device: iwl-phy0::RX > > [ 35.302859] Registered led device: iwl-phy0::TX > > [ 35.317633] BUG: unable to handle kernel paging request at > > 62696c2f [ 35.327346] IP: [<c02846c7>] kmem_cache_alloc+0x5e/0xc1 > > [ 35.327561] *pde = 00000000 > > [ 35.327561] Oops: 0000 [#2] SMP > > [ 35.327561] last sysfs file: > > /sys/devices/virtual/dmi/id/board_serial [ 35.327561] Modules > > linked in: loop ext2 mbcache dm_mirror dm_region_hash dm_log > > dm_round_robin dm_multipath dm_mod > > snd_hda_codec_idt iwl3945 snd_hda_intel snd_hda_codec iwlcore > > snd_hwdep btusb uvcvideo bluetooth snd_pcm videodev sdhci_pci > > ohci1394 snd_page_alloc sdhci v4l1_compat iTCO_wdt ieee1394 > > lib80211 mmc_core dcdbas ricoh_mmc hid ata_piix ahci libata > > reiserfs > > [ 35.327561] > > [ 35.327561] Pid: 2267, comm: console-kit-dae Tainted: G D W > > (2.6.30-rc4-3avb #5) XPS M1330 > > [ 35.327561] EIP: 0060:[<c02846c7>] EFLAGS: 00010002 CPU: 1 > > [ 35.327561] EIP is at kmem_cache_alloc+0x5e/0xc1 > > [ 35.327561] EAX: 00000000 EBX: 62696c2f ECX: f7001000 EDX: > > c22b34bc [ 35.327561] ESI: 00008050 EDI: 00000246 EBP: f5a5cf2c > > ESP: f5a5cf14 [ 35.327561] DS: 007b ES: 007b FS: 00d8 GS: 0033 > > SS: 0068 [ 35.327561] Process console-kit-dae (pid: 2267, > > ti=f5a5c000 task=f59fa550 task.ti=f5a5c000) > > [ 35.327561] Stack: > > [ 35.327561] f7001000 c02eb6d5 00000094 f67fe0c0 f67fe0c0 > > 00008050 f5a5cf44 c02eb6d5 > > [ 35.327561] f67fe0d0 f67fe0c0 f71e4cd4 f71e4cf0 f5a5cf58 > > c02a9b43 f6f7259c f71e4cd4 > > [ 35.327561] f67fe0c0 f5a5cf7c c02a9c6c f6f727b0 f67fe0f4 > > ffffffea c02aa7d6 fffffff4 > > [ 35.327561] Call Trace: > > [ 35.327561] [<c02eb6d5>] ? idr_pre_get+0x27/0x60 > > [ 35.327561] [<c02eb6d5>] ? idr_pre_get+0x27/0x60 > > [ 35.327561] [<c02a9b43>] ? inotify_handle_get_wd+0x19/0x54 > > [ 35.327561] [<c02a9c6c>] ? inotify_add_watch+0x4d/0xe1 > > [ 35.327561] [<c02aa7d6>] ? sys_inotify_add_watch+0xbb/0x16c > > [ 35.327561] [<c02aa815>] ? sys_inotify_add_watch+0xfa/0x16c > > [ 35.327561] [<c02029d4>] ? sysenter_do_call+0x12/0x32 > > [ 35.327561] Code: 62 f5 21 00 9c 5f fa e8 aa ca fb ff 8b 4d e8 > > 64 a1 88 d4 66 c0 8b 94 81 a4 00 00 00 8b 42 10 89 45 f0 8b 1a 85 > > db 74 0a 8b 42 0c <8b> 04 83 89 02 eb 15 52 83 c9 ff 89 f2 ff 75 ec > > 8b 45 e8 e8 4d [ 35.327561] EIP: [<c02846c7>] > > kmem_cache_alloc+0x5e/0xc1 SS:ESP 0068:f5a5cf14 > > [ 35.327561] CR2: 0000000062696c2f > > [ 35.327561] ---[ end trace a499c68e391b7f64 ]--- > > [ 35.693405] ADDRCONF(NETDEV_UP): wlan0: link is not ready > > > > The contents of /etc/mutipath.conf was: > > > > defaults { > > getuid_callout "/bin/echo -n 12345678" > > path_checker readsector0 > > } > > multipaths { > > multipath { > > wwid 12345678 > > features "1 queue_if_no_path" > > no_path_retry 5 > > failback immediate > > } > > } > > > > I was experimenting with multipath-over-loop and completely forgot > > about it later. > > > > I understand that it sounds more like "doctor, it hurts when I stab > > myself in the eye". Still, even with this absolutely weird config > > result is rather unexpected (what is worse, it is near to > > impossible to find out the reason; I did it by pure accident). Also > > I have been running with *exactly* the same config for almost a > > year without any ill effects at all. > > > > I am ready to offer debugging aid if required.
Attachment:
signature.asc
Description: This is a digitally signed message part.