OSD crashes regularely

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I have a pool of +300 OSDs that are identical model (Seagate model:
ST1800MM0129 size: 1.64 TiB).
Only 1 OSD crashes regularely, however I cannot identify a root cause.

Based on the output of smartctl the disk is ok.

# smartctl -a -d megaraid,1
/dev/sda                                                                                     
[47/1833]
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               LENOVO-X
Product:              ST1800MM0129
Revision:             L2B6
Compliance:           SPC-4
User Capacity:        1,800,360,124,416 bytes [1.80 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        10500 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500bb7822cf
Serial number:        WBN0QHX80000E852944J
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon May 18 09:19:41 2020 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE
FAILURE [asc=5d, ascq=10]                              [22/1833]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 68
Power on minutes since format <not available>
Current Drive Temperature:     33 C
Drive Trip Temperature:        65 C

Manufactured in week 31 of year 2018
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  21
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  709
Elements in grown defect list: 18

Error counter log:
           Errors Corrected by           Total   Correction    
Gigabytes    Total
               ECC          rereads/    errors   algorithm     
processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9
bytes]  errors
read:   3278853896        1         0  3278853897         32     
83933.567          19
write:         0        0         0         0          0     
24093.894           0
verify: 3080361880        0         0  3080361880          0     
12630.494           0

Non-medium error count:      244

SMART Self-test log
Num  Test              Status                 segment  LifeTime 
LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   
3761                 - [-   -    -]
# 2  Background short  Completed                   -   
3737                 - [-   -    -]
# 3  Background short  Completed                   -   
3713                 - [-   -    -]
# 4  Background short  Completed                   -   
3689                 - [-   -    -]
# 5  Background short  Completed                   -   
3665                 - [-   -    -]
# 6  Background short  Completed                   -   
3641                 - [-   -    -]
# 7  Background short  Completed                   -   
3617                 - [-   -    -]
# 8  Background short  Completed                   -   
3593                 - [-   -    -]
# 9  Background long   Completed                   -   
3569                 - [-   -    -]
#10  Background short  Completed                   -   
3545                 - [-   -    -]
#11  Background short  Completed                   -   
3521                 - [-   -    -]
#12  Background short  Completed                   -   
3497                 - [-   -    -]
#13  Background short  Completed                   -   
3473                 - [-   -    -]
#14  Background short  Completed                   -   
3449                 - [-   -    -]
#15  Background short  Completed                   -   
3425                 - [-   -    -]
#16  Background short  Completed                   -   
3401                 - [-   -    -]
#17  Background short  Completed                   -   
3377                 - [-   -    -]
#18  Background short  Completed                   -   
3353                 - [-   -    -]
#19  Background short  Completed                   -   
3329                 - [-   -    -]
#20  Background short  Completed                   -   
3305                 - [-   -    -]

Long (extended) Self-test duration: 9459 seconds [157.7 minutes]

I have attached the log of the affected OSD.

THX
Thomas

Ich habe 1 zu dieser E-Mail gehörende Datei hochgeladen:
ceph-osd.92.log.1.gz <https://we.tl/t-7DzNCDP3iZ>(578
KB)WeTransferhttps://we.tl/t-7DzNCDP3iZ
Mozilla Thunderbird <https://www.thunderbird.net> macht es einfach,
große Dateien über E-Mails zu teilen.


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux