Re: Status update on sparc32 genirq support

Marcel van Nies <morcles@xxxxxxxxx> · Tue, 8 Mar 2011 22:13:05 +0100

Hi,

As expected, esp_debug gives a lot of ouput.
Is there anything in particular to look out for ?

Btw:
At this point:
> Give root password for maintenance
> (or type Control-D to continue):

I can logon, and reads from disk seem to go fine.

# mount -n -o remount,rw /
Then also writes to disk seem to go fine.

So, is this an ESP or EXT2 bug at all ?

Marcel

On Tue, Mar 8, 2011 at 9:22 PM, Marcel van Nies <morcles@xxxxxxxxx> wrote:
> Hi,
>
> The good news:
> sparc-next-2.6 with commit 4d14a459857bd151ecbd14bcd37b4628da00792b reverted
> does NOT segfault. I did not apply the genirq patch yet.
>
> The bad news:
> Segfault gone, say hello to EXT2 read failure   :o(
>
> I'll rebuild this kernel with the esp_debug.patch Sam sent a couple of days ago.
>
>
> [    0.233333] esp: esp0, regs[fd00a000:fd009000] irq[36]
> [    0.236666] esp: esp0 is a FAS100A, 40 MHz (ccf=0), SCSI ID 7
> [    3.243333] scsi0 : esp
> [    3.483332] scsi 0:0:1:0: Direct-Access     FUJITSU  MAP3735N
> SUN72G  0401 PQ: 0 ANSI: 4
> [    3.486666] scsi target0:0:1: Beginning Domain Validation
> [    3.493332] scsi target0:0:1: FAST-10 SCSI 10.0 MB/s ST (100 ns, offset 15)
> [    3.499999] scsi target0:0:1: Domain Validation skipping write tests
> [    3.503332] scsi target0:0:1: Ending Domain Validation
> [    3.743332] scsi 0:0:3:0: Direct-Access     FUJITSU  MAP3735N
> SUN72G  0401 PQ: 0 ANSI: 4
> [    3.746666] scsi target0:0:3: Beginning Domain Validation
> [    3.753332] scsi target0:0:3: FAST-10 SCSI 10.0 MB/s ST (100 ns, offset 15)
> [    3.756666] scsi target0:0:3: Domain Validation skipping write tests
> [    3.759999] scsi target0:0:3: Ending Domain Validation
> [    4.469999] esp: esp1, regs[fd00c000:fd00b000] irq[53]
> [    4.473332] esp: esp1 is a FASHME, 40 MHz (ccf=0), SCSI ID 7
> [    7.479999] scsi1 : esp
> ...
> [   11.029998] sd 0:0:1:0: [sda] 143374738 512-byte logical blocks:
> (73.4 GB/68.3 GiB)
> [   11.033332] sd 0:0:3:0: [sdb] 143374738 512-byte logical blocks:
> (73.4 GB/68.3 GiB)
> [   11.036665] sd 0:0:1:0: [sda] Write Protect is off
> [   11.043332] sd 0:0:1:0: [sda] Write cache: disabled, read cache:
> enabled, doesn't support DPO or FUA
> [   11.046665] sd 0:0:3:0: [sdb] Write Protect is off
> [   11.053332] sd 0:0:3:0: [sdb] Write cache: disabled, read cache:
> enabled, doesn't support DPO or FUA
> [   11.066665]  sda: sda1 sda2 sda3
> [   11.073332]  sdb: sdb1 sdb2 sdb3 sdb4 sdb5 sdb6 sdb7
> [   11.089998] sd 0:0:1:0: [sda] Attached SCSI disk
> [   11.093332] sd 0:0:3:0: [sdb] Attached SCSI disk
> [   11.106665] EXT3-fs: barriers not enabled
> [   11.113332] kjournald starting.  Commit interval 5 seconds
> [   11.116665] EXT3-fs (sdb4): mounted filesystem with ordered data mode
> [   11.119998] VFS: Mounted root (ext3 filesystem) readonly on device 8:20.
> [   11.123332] Freeing unused kernel memory: 108k freed
> INIT: version 2.86 booting
> [   12.673332] NET: Registered protocol family 1
>
> Gentoo Linux; http://www.gentoo.org/
>  Copyright 1999-2007 Gentoo Foundation; Distributed under the GPLv2
>
>  * Mounting proc at /proc ...                                             [ ok ]
>  * Mounting sysfs at /sys ...                                             [ ok ]
>  * Mounting /dev for udev ...                                             [ ok ]
> ...
> blahblah
> ...
>  * Checking root filesystem ...fsck.ext3: No such file or directory
> while trying to open /dev/sdb4
> /dev/sdb4:
> The superblock could not be read or does not describe a correct ext2
> filesystem.  If the device is valid and it really contains an ext2
> filesystem (and not swap or ufs or something else), then the superblock
> is corrupt, and you might try running e2fsck with an alternate superblock:
>    e2fsck -b 8193 <device>
>
>  * Filesystem couldn't be fixed :(
>         [ !! ]
> Give root password for maintenance
> (or type Control-D to continue):
>
>
> Marcel
>
>
> On Tue, Mar 8, 2011 at 12:17 PM, Marcel van Nies <morcles@xxxxxxxxx> wrote:
>> Hi,
>>
>> 2.6.33.7 with commit 4d14a459857bd151ecbd14bcd37b4628da00792b reverted
>> does not segfault.
>> I also tried sparc-next-2.6, but I messed up my tree somehow. I will
>> try again later.
>>
>> M
>>
>> On Tue, Mar 8, 2011 at 8:45 AM, Marcel van Nies <morcles@xxxxxxxxx> wrote:
>>> Hi,
>>>
>>>> But first step is to get confirmation that reverting this commit
>>>> indeed fixes the bug
>>>
>>> I'll try that.
>>> M
>>>
>>> On Tue, Mar 8, 2011 at 8:37 AM, Marcel van Nies <morcles@xxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> It appears that two consecutive commits are causing problems on
>>>> hyperSPARC, I noticed that too late.
>>>>
>>>> Commit 4d14a459857bd151ecbd14bcd37b4628da00792b (the one I reported
>>>> earlier) only causes the system to hang, not panic:
>>>> [   11.266665] sd 0:0:1:0: [sda] Attached SCSI disk
>>>> [   11.279998] sd 0:0:3:0: [sdb] Attached SCSI disk
>>>> [   11.299998] kjournald starting.  Commit interval 5 seconds
>>>> [   11.303332] EXT3-fs: mounted filesystem with writeback data mode.
>>>> [   11.306665] VFS: Mounted root (ext3 filesystem) readonly on device 8:20.
>>>> [   11.309998] Freeing unused kernel memory: 100k freed
>>>> <system hangs here - stop-A does go back to prom>
>>>>
>>>> and
>>>> commit c658ad1b4e1520511da8323aa5e60d444cc303ed
>>>> Author: David S. Miller <davem@xxxxxxxxxxxxx>
>>>> Date:   Fri Dec 11 00:44:47 2009 -0800
>>>>
>>>>    sparc64: Add syscall tracepoint support.
>>>>
>>>>    Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
>>>>
>>>> actually makes the kernel panic:
>>>> [   11.336665] Freeing unused kernel memory: 100k freed
>>>> [   11.419998] Kernel panic - not syncing: Attempted to kill init!
>>>> [   11.423332] [f002f5b8 : do_group_exit+0x84/0xb4 ]
>>>>  [f0039490 : get_signal_to_deliver+0x338/0x35c ]
>>>>  [f00124cc : do_signal+0x30/0x8f0 ]
>>>>  [f0012da0 : do_notify_resume+0x14/0x38 ]
>>>>  [f000fca4 : signal_p+0x14/0x24 ]
>>>>  [f000edfc : srmmu_fault+0x58/0x68 ]
>>>> [   11.466665] Press Stop-A (L1-A) to return to the boot prom
>>>>
>>>>
>>>> Marcel
>>>>
>>>>
>>>> On Tue, Mar 8, 2011 at 8:08 AM, Sam Ravnborg <sam@xxxxxxxxxxxx> wrote:
>>>>> On Mon, Mar 07, 2011 at 11:01:20PM -0800, David Miller wrote:
>>>>>> From: Sam Ravnborg <sam@xxxxxxxxxxxx>
>>>>>> Date: Tue, 8 Mar 2011 07:00:39 +0100
>>>>>>
>>>>>> > Added davem...
>>>>>> > We see strange SEGV faults in userspace and fail to read from ext2..
>>>>>> > All on some (but not all) sparc32 boxes.
>>>>>>
>>>>>> I saw the original report.
>>>>>>
>>>>>> But reverting this commit is the wrong thing to do from what I can
>>>>>> tell.
>>>>>>
>>>>>> Either we have:
>>>>>>
>>>>>> 1) A compiler code gen bug.
>>>>>>
>>>>>> 2) Some piece of code which is sparc32 specific is invoking memset
>>>>>>    or memcpy in a way which makes assumptions which are in fact not
>>>>>>    valid
>>>>>>
>>>>>> 3) The code change is merely making cache offsets change, masking the
>>>>>>    true problem
>>>>>>
>>>>>> Especially in cases #2 and #3 we're just hiding a heisen-bug and
>>>>>> not fixing the real problem.
>>>>> Agree on this.
>>>>> But first step is to get confirmation that reverting this commit
>>>>> indeed fixes the bug. Then we can go hunting for 2), 3) or 1).
>>>>> I hope we will find that 2) is the culprint.
>>>>>
>>>>>        Sam
>>>>>
>>>>
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html