Re: [BUG] scsi: hpsa: how to destroy your files

scameron@xxxxxxxxxxxxxxxxxx · Thu, 1 Sep 2011 14:03:29 -0500

On Thu, Sep 01, 2011 at 01:01:38PM -0500, scameron@xxxxxxxxxxxxxxxxxx wrote:
> On Thu, Sep 01, 2011 at 07:40:15PM +0200, Eric Dumazet wrote:
> > Le jeudi 01 septembre 2011 à 11:07 -0500, scameron@xxxxxxxxxxxxxxxxxx a
> > écrit :
> > > On Thu, Sep 01, 2011 at 05:24:02PM +0200, Eric Dumazet wrote:
> > > > Stephen,
> > > > 
> > > > Current linux-3.1-rc4+ is a total disaster on my BL460c G6
> > > 
> > > What kernel were you running successfully previously?
> > > 
> > > I saw similar on BL460cG7 on Friday with 3.1-rc4,
> > > but I'm not sure the problem is in the driver.  
> > > I installed rhel6.1, then put 3.1-rc4 on.  Turning off
> > > "Virtualization" in the kernel config seemed to help
> > > (allowed it to boot) and so I thought that must have
> > > been the source of the issue.  So, you might try that.
> > > 
> > > However, I rebooted that machine just now, and
> > > now I am getting the similar "hpsa 0000:0c:00.0: resetting device 0:0:0:0"
> > > message, so that's pretty weird.
> > > 
> > > The cmd_alloc failure, I didn't see, but I may have missed it
> > > (didn't have console directed to serial output.)
> > > 
> > > cmd_alloc failing is not generally expected, as we reserve enough
> > > commands that the upper layers should never exhaust them all (should
> > > honor hpsa's max request limit), so that's pretty weird that
> > > you're seeing that.
> > > 
> > > I am able to run 3.1-rc3 on rhel6 just fine on other systems (DL380g7,
> > > for example) and I don't think there are any hpsa changes between rc3
> > > and rc4.  (haven't tried rc4 on the dl380g7 yet).
> > > 
> > > So, I'm not sure what's going on with the BL460c yet, but I am
> > > aware of the problem and have already seen it.  I can't think of
> > > any driver changes lately which should be causing such
> > > changes in behavior.
> > > 
> > > -- steve
> > > 
> > > 
> > 
> > OK I found the bad commit,I got lucky... I lost some files but my
> > machine was able to complete the bisection. CC involved people
> > 
> 
> Thanks.  I will run this information by the hardware guys here
> and see if they have any bright ideas.
> 
> Would be interesting to see if the "pcie_bus_safe" option 
> makes a difference.

FWIW, this option does not help (though it does change the
behavior).  I get hpsa complaining about bad tags returned
from the hardware, which is to say, this code from hpsa.c
fires:

	static inline int bad_tag(struct ctlr_info *h, u32 tag_index,
		u32 raw_tag)
	{
		if (unlikely(tag_index >= h->nr_cmds)) {
			dev_warn(&h->pdev->dev, "bad tag 0x%08x ignored.\n", raw_tag);
			return 1;
		}
		return 0;
	}

I had added "pcie_bus_safe" and "pci.pcie_bus_safe" to the command
line parameters.  (Was hard to tell how it was supposed to be used
as there is nothing in Documentation directory that mentions 
pcie_bus_safe.)

-- steve

> 
> -- steve
> 
> > git bisect start
> > # bad: [9e79e3e9dd9672b37ac9412e9a926714306551fe] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
> > git bisect bad 9e79e3e9dd9672b37ac9412e9a926714306551fe
> > # good: [322a8b034003c0d46d39af85bf24fee27b902f48] Linux 3.1-rc1
> > git bisect good 322a8b034003c0d46d39af85bf24fee27b902f48
> > # bad: [0c3bef612881ee6216a36952ffaabfc35b83545c] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
> > git bisect bad 0c3bef612881ee6216a36952ffaabfc35b83545c
> > # good: [8c70aac04e01a08b7eca204312946206d1c1baac] Merge branch 'staging-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
> > git bisect good 8c70aac04e01a08b7eca204312946206d1c1baac
> > # good: [291b63c86aea8a571ddf913d41ab5156b8314dad] Merge branch 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6
> > git bisect good 291b63c86aea8a571ddf913d41ab5156b8314dad
> > # good: [aa462abe8aaf2198d6aef97da20c874ac694a39f] mm: fix __page_to_pfn for a const struct page argument
> > git bisect good aa462abe8aaf2198d6aef97da20c874ac694a39f
> > # good: [5c80c71b9a0ec518b4b58d2a61de01a04f4a4453] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
> > git bisect good 5c80c71b9a0ec518b4b58d2a61de01a04f4a4453
> > # good: [2c4ac99f983f1341b5962a16b5e8de6049bf10b5] Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
> > git bisect good 2c4ac99f983f1341b5962a16b5e8de6049bf10b5
> > # bad: [0a2daa1cf35004f5adbf4138555cc5669abf3a3e] PCI: make cardbus-bridge resources optional
> > git bisect bad 0a2daa1cf35004f5adbf4138555cc5669abf3a3e
> > # bad: [be768912a49b10b68e96fbd8fa3cab0adfbd3091] PCI: honor child buses add_size in hot plug configuration
> > git bisect bad be768912a49b10b68e96fbd8fa3cab0adfbd3091
> > # bad: [b03e7495a862b028294f59fc87286d6d78ee7fa1] PCI: Set PCI-E Max Payload Size on fabric
> > git bisect bad b03e7495a862b028294f59fc87286d6d78ee7fa1
> > commit b03e7495a862b028294f59fc87286d6d78ee7fa1
> > Author: Jon Mason <mason@xxxxxxxx>
> > Date:   Wed Jul 20 15:20:54 2011 -0500
> > 
> >     PCI: Set PCI-E Max Payload Size on fabric
> >     
> >     On a given PCI-E fabric, each device, bridge, and root port can have a
> >     different PCI-E maximum payload size.  There is a sizable performance
> >     boost for having the largest possible maximum payload size on each PCI-E
> >     device.  However, if improperly configured, fatal bus errors can occur.
> >     Thus, it is important to ensure that PCI-E payloads sends by a device
> >     are never larger than the MPS setting of all devices on the way to the
> >     destination.
> >     
> >     This can be achieved two ways:
> >     
> >     - A conservative approach is to use the smallest common denominator of
> >       the entire tree below a root complex for every device on that fabric.
> >     
> >     This means for example that having a 128 bytes MPS USB controller on one
> >     leg of a switch will dramatically reduce performances of a video card or
> >     10GE adapter on another leg of that same switch.
> >     
> >     It also means that any hierarchy supporting hotplug slots (including
> >     expresscard or thunderbolt I suppose, dbl check that) will have to be
> >     entirely clamped to 128 bytes since we cannot predict what will be
> >     plugged into those slots, and we cannot change the MPS on a "live"
> >     system.
> >     
> >     - A more optimal way is possible, if it falls within a couple of
> >       constraints:
> >     * The top-level host bridge will never generate packets larger than the
> >       smallest TLP (or if it can be controlled independently from its MPS at
> >       least)
> >     * The device will never generate packets larger than MPS (which can be
> >       configured via MRRS)
> >     * No support of direct PCI-E <-> PCI-E transfers between devices without
> >       some additional code to specifically deal with that case
> >     
> >     Then we can use an approach that basically ignores downstream requests
> >     and focuses exclusively on upstream requests. In that case, all we need
> >     to care about is that a device MPS is no larger than its parent MPS,
> >     which allows us to keep all switches/bridges to the max MPS supported by
> >     their parent and eventually the PHB.
> >     
> >     In this case, your USB controller would no longer "starve" your 10GE
> >     Ethernet and your hotplug slots won't affect your global MPS.
> >     Additionally, the hotplugged devices themselves can be configured to a
> >     larger MPS up to the value configured in the hotplug bridge.
> >     
> >     To choose between the two available options, two PCI kernel boot args
> >     have been added to the PCI calls.  "pcie_bus_safe" will provide the
> >     former behavior, while "pcie_bus_perf" will perform the latter behavior.
> >     By default, the latter behavior is used.
> >     
> >     NOTE: due to the location of the enablement, each arch will need to add
> >     calls to this function.  This patch only enables x86.
> >     
> >     This patch includes a number of changes recommended by Benjamin
> >     Herrenschmidt.
> >     
> >     Tested-by: Jordan_Hargrave@xxxxxxxx
> >     Signed-off-by: Jon Mason <mason@xxxxxxxx>
> >     Signed-off-by: Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx>
> > 
> > 
> > 
> > > > 
> > > > 
> > > > Few seconds after boot, I get "cmd_alloc returned NULL" messages
> > > > or "hpsa 0000:0c:00.0: resetting device 0:0:0:0"
> > > > 
> > > > Usually lot of files are corrupted, fsck needed, and full distro
> > > > reinstall as well.
> > > > 
> > > > I tested on two different machines, same result.
> > > > 
> > > > Relevant hardware information :
> > > > 
> > > > 	Manufacturer: HP
> > > > 	Product Name: ProLiant BL460c G6
> > > > 	Version: I24
> > > > 	Release Date: 05/05/2011
> > > > 	Intel(R) Xeon(R) CPU E5540 @ 2.53GHz  (two sockets)
> > > > 
> > > > 0c:00.0 RAID bus controller: Hewlett-Packard Company Smart Array G6
> > > > controllers (rev 01)
> > > > 	Subsystem: Hewlett-Packard Company Smart Array P410i
> > > > 	Flags: bus master, fast devsel, latency 0, IRQ 16
> > > > 	Memory at fbc00000 (64-bit, non-prefetchable) [size=4M]
> > > > 	Memory at fbbf0000 (64-bit, non-prefetchable) [size=4K]
> > > > 	I/O ports at 4000 [size=256]
> > > > 	[virtual] Expansion ROM at e7200000 [disabled] [size=512K]
> > > > 	Capabilities: [40] Power Management version 3
> > > > 	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > > 	Capabilities: [70] Express Endpoint, MSI 00
> > > > 	Capabilities: [ac] MSI-X: Enable+ Count=16 Masked-
> > > > 	Capabilities: [100] Advanced Error Reporting
> > > > 	Kernel driver in use: hpsa
> > > > 
> > > > # hpacucli ctrl all show config detail
> > > > 
> > > > Smart Array P410i in Slot 0 (Embedded)
> > > >    Bus Interface: PCI
> > > >    Slot: 0
> > > >    Serial Number: 5001438006F44240
> > > >    RAID 6 (ADG) Status: Disabled
> > > >    Controller Status: OK
> > > >    Chassis Slot: 
> > > >    Hardware Revision: Rev C
> > > >    Firmware Version: 2.50
> > > >    Rebuild Priority: Medium
> > > >    Expand Priority: Medium
> > > >    Surface Scan Delay: 15 secs
> > > >    Surface Scan Mode: Idle
> > > >    Wait for Cache Room: Disabled
> > > >    Surface Analysis Inconsistency Notification: Disabled
> > > >    Post Prompt Timeout: 0 secs
> > > >    Cache Board Present: False
> > > >    Drive Write Cache: Disabled
> > > >    SATA NCQ Supported: True
> > > > 
> > > >    Array: A
> > > >       Interface Type: SATA
> > > >       Unused Space: 0 MB
> > > >       Status: OK
> > > > 
> > > > 
> > > > 
> > > >       Logical Drive: 1
> > > >          Size: 232.9 GB
> > > >          Fault Tolerance: RAID 1
> > > >          Heads: 255
> > > >          Sectors Per Track: 32
> > > >          Cylinders: 59844
> > > >          Strip Size: 128 KB
> > > >          Status: OK
> > > >          Unique Identifier: 600508B1001030364634343234300F00
> > > >          Disk Name: /dev/cciss/c0d0
> > > >          Mount Points: / 9.3 GB, /home 216.0 GB
> > > >          OS Status: LOCKED
> > > >          Logical Drive Label: A0124E845001438006F442403033
> > > >          Mirror Group 0:
> > > >             physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 250 GB, OK)
> > > >          Mirror Group 1:
> > > >             physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 250 GB, OK)
> > > > 
> > > >       physicaldrive 1I:1:1
> > > >          Port: 1I
> > > >          Box: 1
> > > >          Bay: 1
> > > >          Status: OK
> > > >          Drive Type: Data Drive
> > > >          Interface Type: SATA
> > > >          Size: 250 GB
> > > >          Firmware Revision: HPG2    
> > > >          Serial Number: K648T9C27M8E        
> > > >          Model: ATA     GJ0250EAGSQ     
> > > >          SATA NCQ Capable: True
> > > >          SATA NCQ Enabled: True
> > > >          PHY Count: 1
> > > >          PHY Transfer Rate: 3.0GBPS
> > > > 
> > > >       physicaldrive 1I:1:2
> > > >          Port: 1I
> > > >          Box: 1
> > > >          Bay: 2
> > > >          Status: OK
> > > >          Drive Type: Data Drive
> > > >          Interface Type: SATA
> > > >          Size: 250 GB
> > > >          Firmware Revision: HPG2    
> > > >          Serial Number: K648T9C27M49        
> > > >          Model: ATA     GJ0250EAGSQ     
> > > >          SATA NCQ Capable: True
> > > >          SATA NCQ Enabled: True
> > > >          PHY Count: 1
> > > >          PHY Transfer Rate: 3.0GBPS
> > > > 
> > > > 
> > > > 
> > > > 64 bit kernel, 4GB of memory
> > > > 
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html