Re: xfs corruption

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 7 Mar 2016 10:15:33 +0100

This functionality is common on RAID controllers in combination with HCL-certified drives.

This usually means that you can't rely on it working unless you stick to the exact combination that's certified, which is impossible in practice.
For example LSI controllers do this if you get the right SSDs, but the right SSD also needs to have the "right" firmware, which is usually very old and you won't get that version anywhere... or maybe only if you shell out 100% premium on the price by buying it directly with server hardware, and even then the replacement drives you'll get might be a different revision after some time and you'll need to explain to the vendor that you very specifically bought that combination because of HCL.
Good luck getting support from a server vendor to help you when you keep old buggy firmware on your drives in order for the HBA to work correctly :-) 

The good news is that modern drives don't really need TRIM. It's better to concentrate on the higher layers where it's much more useful for thin provisioning and oversuscribing of the disk space, but the drives themselves don't gain much.

There's one scenario when it is useful and that's when you deliberately get a lower-grade drive (TBW/DWPD) and need it to survive for longer than the rated amount of data. Underprivisioning is quite useful then and you need to either TRIM or secure-erase the drive when you prepare it. But unless you're on a tight budget I'd say it's not worth it and you should just get the proper drive...

Jan

> On 07 Mar 2016, at 09:21, Ric Wheeler <rwheeler@xxxxxxxxxx> wrote:
> 
> 
> Unfortunately, you will have to follow up with the hardware RAID card vendors to see what commands their firmware handles.
> 
> Good luck!
> 
> Ric
> 
> 
> On 03/07/2016 01:37 PM, Ferhat Ozkasgarli wrote:
>> I am always forgetting this reply all things.
>> /
>> /
>> /RAID5 and RAID10 (or other raid levels) are a property of the block devices. XFS, ext4, etc can pass down those commands to the firmware on the card and it is up to the firmware to propagate the command on to the backend drives./
>> 
>> You mean I can get a hardware raid card that can pass discard and trim commend to disks with raid 10 array?
>> 
>> Can you please suggest me such a raid card?
>> 
>> Because we are in a verge of deciding on hardware raid or software raid to use. Because our OpenStack cluster uses full SSD storage (local raid 10) and my manager want to utilize hardware raid with SSD disks.
>> 
>> 
>> 
>> On Mon, Mar 7, 2016 at 10:04 AM, Ric Wheeler <rwheeler@xxxxxxxxxx <mailto:rwheeler@xxxxxxxxxx>> wrote:
>> 
>>    You are right that some cards might not send those commands on to the
>>    backend storage, but spinning disks don't usually implement either trim or
>>    discard (SSD's do though).
>> 
>>    XFS, ext4, etc can pass down those commands to the firmware on the card
>>    and it is up to the firmware to propagate the command on to the backend
>>    drives.
>> 
>>    The file system layer itself does track allocation internally in its
>>    layer, so you will benefit from being able to reuse those blocks after a
>>    trim command (even without a raid card of any kind).
>> 
>>    Regards,
>> 
>>    Ric
>> 
>> 
>>    On 03/07/2016 12:58 PM, Ferhat Ozkasgarli wrote:
>> 
>>        Rick; you mean Raid 0 environment right?
>> 
>>        If you use raid 5 or raid 10 or some other more complex raid
>>        configuration most of the physical disks' abilities vanishes. (trim,
>>        discard etc..)
>> 
>>        Only handful of hardware raid cards able to pass trim and discard
>>        commands to physical disks if the raid configuration is raid 0 or raid 1.
>> 
>>        On Mon, Mar 7, 2016 at 9:21 AM, Ric Wheeler <rwheeler@xxxxxxxxxx
>>        <mailto:rwheeler@xxxxxxxxxx> <mailto:rwheeler@xxxxxxxxxx
>>        <mailto:rwheeler@xxxxxxxxxx>>> wrote:
>> 
>> 
>> 
>>            It is perfectly reasonable and common to use hardware RAID cards in
>>            writeback mode under XFS (and under Ceph) if you configure them
>>        properly.
>> 
>>            The key thing is that for writeback cache enabled, you need to
>>        make sure
>>            that the S-ATA drives' write cache itself is disabled. Also make
>>        sure that
>>            your file system is mounted with "barrier" enabled.
>> 
>>            To check the backend write cache state on drives, you often need
>>        to use
>>            RAID card specific tools to query and set them.
>> 
>>            Regards,
>> 
>>            Ric
>> 
>> 
>> 
>> 
>>            On 02/27/2016 07:20 AM, fangchen sun wrote:
>> 
>> 
>>                Thank you for your response!
>> 
>>                All my hosts have raid cards. Some raid cards are in
>>        pass-throughput
>>                mode, and the others are in write-back mode. I will set all
>>        raid cards
>>                pass-throughput mode and observe for a period of time.
>> 
>> 
>>                Best Regards
>>                sunspot
>> 
>> 
>>                2016-02-25 20:07 GMT+08:00 Ferhat Ozkasgarli
>>        <ozkasgarli@xxxxxxxxx <mailto:ozkasgarli@xxxxxxxxx>
>>                <mailto:ozkasgarli@xxxxxxxxx <mailto:ozkasgarli@xxxxxxxxx>>
>>        <mailto:ozkasgarli@xxxxxxxxx <mailto:ozkasgarli@xxxxxxxxx>
>>                <mailto:ozkasgarli@xxxxxxxxx <mailto:ozkasgarli@xxxxxxxxx>>>>:
>> 
>>                    This has happened me before but in virtual machine
>>        environment.
>> 
>>                    The VM was KVM and storage was RBD. My problem was a bad
>>        cable in
>>                network.
>> 
>>                    You should check following details:
>> 
>>                    1-) Do you use any kind of hardware raid configuration?
>>        (Raid 0, 5
>>                or 10)
>> 
>>                    Ceph does not work well on hardware raid systems. You
>>        should use raid
>>                    cards in HBA (non-raid) mode and let raid card
>>        pass-throughput the
>>                disk.
>> 
>>                    2-) Check your network connections
>> 
>>                    It mas seem a obvious solution but  believe me network is
>>        one of
>>                the top
>>                    rated culprit in Ceph environments.
>> 
>>                    3-) If you are using SSD disk, make sure you use non-raid
>>                configuration.
>> 
>> 
>> 
>>                    On Tue, Feb 23, 2016 at 10:55 PM, fangchen sun
>>                <sunspot0105@xxxxxxxxx <mailto:sunspot0105@xxxxxxxxx>
>>        <mailto:sunspot0105@xxxxxxxxx <mailto:sunspot0105@xxxxxxxxx>>
>>                    <mailto:sunspot0105@xxxxxxxxx
>>        <mailto:sunspot0105@xxxxxxxxx> <mailto:sunspot0105@xxxxxxxxx
>>        <mailto:sunspot0105@xxxxxxxxx>>>> wrote:
>> 
>>                        Dear all:
>> 
>>                        I have a ceph object storage cluster with 143 osd and 7
>>                radosgw, and
>>                        choose XFS as the underlying file system.
>>                        I recently ran into a problem that sometimes a osd is
>>        marked
>>                down when
>>                        the returned value of the function "chain_setxattr()"
>>        is -117.
>>                I only
>>                        umount the disk and repair it with "xfs_repair".
>> 
>>                        os: centos 6.5
>>                        kernel version: 2.6.32
>> 
>>                        the log for dmesg command:
>>                        [41796028.532225] Pid: 1438740, comm: ceph-osd Not tainted
>>                        2.6.32-925.431.23.3.letv.el6.x86_64 #1
>>                        [41796028.532227] Call Trace:
>>                        [41796028.532255] [<ffffffffa01e1e5f>] ?
>>                xfs_error_report+0x3f/0x50 [xfs]
>>                        [41796028.532276] [<ffffffffa01d506a>] ?
>>                xfs_da_read_buf+0x2a/0x30 [xfs]
>>                        [41796028.532296] [<ffffffffa01e1ece>] ?
>>                        xfs_corruption_error+0x5e/0x90 [xfs]
>>                        [41796028.532316] [<ffffffffa01d4f4c>] ?
>>                xfs_da_do_buf+0x6cc/0x770 [xfs]
>>                        [41796028.532335] [<ffffffffa01d506a>] ?
>>                xfs_da_read_buf+0x2a/0x30 [xfs]
>>                        [41796028.532359] [<ffffffffa0206fc7>] ?
>>                kmem_zone_alloc+0x77/0xf0 [xfs]
>>                        [41796028.532380] [<ffffffffa01d506a>] ?
>>                xfs_da_read_buf+0x2a/0x30 [xfs]
>>                        [41796028.532399] [<ffffffffa01bc481>] ?
>>                        xfs_attr_leaf_addname+0x61/0x3d0 [xfs]
>>                        [41796028.532426] [<ffffffffa01bc481>] ?
>>                        xfs_attr_leaf_addname+0x61/0x3d0 [xfs]
>>                        [41796028.532455] [<ffffffffa01ff187>] ?
>>                xfs_trans_add_item+0x57/0x70
>>                        [xfs]
>>                        [41796028.532476] [<ffffffffa01cc208>] ?
>>                xfs_bmbt_get_all+0x18/0x20 [xfs]
>>                        [41796028.532495] [<ffffffffa01bcbb4>] ?
>>                xfs_attr_set_int+0x3c4/0x510
>>                        [xfs]
>>                        [41796028.532517] [<ffffffffa01d4f5b>] ?
>>                xfs_da_do_buf+0x6db/0x770 [xfs]
>>                        [41796028.532536] [<ffffffffa01bcd81>] ?
>>                xfs_attr_set+0x81/0x90 [xfs]
>>                        [41796028.532560] [<ffffffffa0216cc3>] ?
>>                __xfs_xattr_set+0x43/0x60 [xfs]
>>                        [41796028.532584] [<ffffffffa0216d31>] ?
>>                xfs_xattr_user_set+0x11/0x20
>>                        [xfs]
>>                        [41796028.532592] [<ffffffff811aee92>] ?
>>                generic_setxattr+0xa2/0xb0
>>                        [41796028.532596] [<ffffffff811b134e>] ?
>>                __vfs_setxattr_noperm+0x4e/0x160
>>                        [41796028.532600] [<ffffffff81196b77>] ?
>>                inode_permission+0xa7/0x100
>>                        [41796028.532604] [<ffffffff811b151c>] ?
>>        vfs_setxattr+0xbc/0xc0
>>                        [41796028.532607] [<ffffffff811b15f0>] ?
>>        setxattr+0xd0/0x150
>>                        [41796028.532612] [<ffffffff8105af80>] ?
>>                __dequeue_entity+0x30/0x50
>>                        [41796028.532617] [<ffffffff8100988e>] ?
>>        __switch_to+0x26e/0x320
>>                        [41796028.532621] [<ffffffff8118aec0>] ?
>>                __sb_start_write+0x80/0x120
>>                        [41796028.532626] [<ffffffff8152912e>] ?
>>        thread_return+0x4e/0x760
>>                        [41796028.532630] [<ffffffff811b171d>] ?
>>        sys_fsetxattr+0xad/0xd0
>>                        [41796028.532633] [<ffffffff8100b072>] ?
>>                system_call_fastpath+0x16/0x1b
>>                        [41796028.532636] XFS (sdi1): Corruption detected.
>>        Unmount and run
>>                        xfs_repair
>> 
>>                        Any comments will be much appreciated!
>> 
>>                        Best Regards!
>>                        sunspot
>> 
>> 
>> 
>>            _______________________________________________
>>            ceph-users mailing list
>>        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>        <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>>        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com