Re: OSD crashes regularely

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Disk is not ok, look to the output below:
SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE

you should replace the disk.

On Wed, May 20, 2020 at 5:11 PM Thomas <74cmonty@xxxxxxxxx> wrote:
>
> Hello,
>
> I have a pool of +300 OSDs that are identical model (Seagate model:
> ST1800MM0129 size: 1.64 TiB).
> Only 1 OSD crashes regularely, however I cannot identify a root cause.
>
> Based on the output of smartctl the disk is ok.
>
> # smartctl -a -d megaraid,1
> /dev/sda
> [47/1833]
> smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build)
> Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Vendor:               LENOVO-X
> Product:              ST1800MM0129
> Revision:             L2B6
> Compliance:           SPC-4
> User Capacity:        1,800,360,124,416 bytes [1.80 TB]
> Logical block size:   512 bytes
> Physical block size:  4096 bytes
> LU is fully provisioned
> Rotation Rate:        10500 rpm
> Form Factor:          2.5 inches
> Logical Unit id:      0x5000c500bb7822cf
> Serial number:        WBN0QHX80000E852944J
> Device type:          disk
> Transport protocol:   SAS (SPL-3)
> Local Time is:        Mon May 18 09:19:41 2020 CEST
> SMART support is:     Available - device has SMART capability.
> SMART support is:     Enabled
> Temperature Warning:  Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE
> FAILURE [asc=5d, ascq=10]                              [22/1833]
>
> Grown defects during certification <not available>
> Total blocks reassigned during format <not available>
> Total new blocks reassigned = 68
> Power on minutes since format <not available>
> Current Drive Temperature:     33 C
> Drive Trip Temperature:        65 C
>
> Manufactured in week 31 of year 2018
> Specified cycle count over device lifetime:  10000
> Accumulated start-stop cycles:  21
> Specified load-unload count over device lifetime:  300000
> Accumulated load-unload cycles:  709
> Elements in grown defect list: 18
>
> Error counter log:
>            Errors Corrected by           Total   Correction
> Gigabytes    Total
>                ECC          rereads/    errors   algorithm
> processed    uncorrected
>            fast | delayed   rewrites  corrected  invocations   [10^9
> bytes]  errors
> read:   3278853896        1         0  3278853897         32
> 83933.567          19
> write:         0        0         0         0          0
> 24093.894           0
> verify: 3080361880        0         0  3080361880          0
> 12630.494           0
>
> Non-medium error count:      244
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime
> LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> # 1  Background short  Completed                   -
> 3761                 - [-   -    -]
> # 2  Background short  Completed                   -
> 3737                 - [-   -    -]
> # 3  Background short  Completed                   -
> 3713                 - [-   -    -]
> # 4  Background short  Completed                   -
> 3689                 - [-   -    -]
> # 5  Background short  Completed                   -
> 3665                 - [-   -    -]
> # 6  Background short  Completed                   -
> 3641                 - [-   -    -]
> # 7  Background short  Completed                   -
> 3617                 - [-   -    -]
> # 8  Background short  Completed                   -
> 3593                 - [-   -    -]
> # 9  Background long   Completed                   -
> 3569                 - [-   -    -]
> #10  Background short  Completed                   -
> 3545                 - [-   -    -]
> #11  Background short  Completed                   -
> 3521                 - [-   -    -]
> #12  Background short  Completed                   -
> 3497                 - [-   -    -]
> #13  Background short  Completed                   -
> 3473                 - [-   -    -]
> #14  Background short  Completed                   -
> 3449                 - [-   -    -]
> #15  Background short  Completed                   -
> 3425                 - [-   -    -]
> #16  Background short  Completed                   -
> 3401                 - [-   -    -]
> #17  Background short  Completed                   -
> 3377                 - [-   -    -]
> #18  Background short  Completed                   -
> 3353                 - [-   -    -]
> #19  Background short  Completed                   -
> 3329                 - [-   -    -]
> #20  Background short  Completed                   -
> 3305                 - [-   -    -]
>
> Long (extended) Self-test duration: 9459 seconds [157.7 minutes]
>
> I have attached the log of the affected OSD.
>
> THX
> Thomas
>
> Ich habe 1 zu dieser E-Mail gehörende Datei hochgeladen:
> ceph-osd.92.log.1.gz <https://we.tl/t-7DzNCDP3iZ>(578
> KB)WeTransferhttps://we.tl/t-7DzNCDP3iZ
> Mozilla Thunderbird <https://www.thunderbird.net> macht es einfach,
> große Dateien über E-Mails zu teilen.
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux