Fwd: Can not activate some OSDs after upgrade (bad crc on label)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Cephers,

Any idea about this?

Best regards,
Huseyin Cotuk
hcotuk@xxxxxxxxx

Begin forwarded message:

From: Huseyin Cotuk <hcotuk@xxxxxxxxx>
Subject: Can not activate some OSDs after upgrade (bad crc on label)
Date: 19 December 2023 at 14:09:20 GMT+3
To: ceph-users@xxxxxxx

Hello Cephers,

I have two identical Ceph clusters with 32 OSDs each, running radosgw with EC. They were running Octopus on Ubuntu 20.04. 

On one of these clusters, I have upgraded OS to Ubuntu 22.04 and Ceph version is upgraded to Quincy 17.2.6. This cluster completed the process without any issue and it works as expected. 

On the second cluster, I followed the same procedure and upgraded the cluster. After upgrade 9 of 32 OSDs can not be activated. AFAIU, the label of these OSDs can not be read. ceph-volume lvm activate {osd.id} {osd_fsid} command fails as below:

 stderr: failed to read label for /dev/ceph-block-13/block-13: (5) Input/output error
 stderr: 2023-12-19T11:46:25.310+0300 7f088cd7ea80 -1 bluestore(/dev/ceph-block-13/block-13) _read_bdev_label bad crc on label, expected 2340927273 != actual 2067505886

All ceph-bluestore-tool and ceph-object-storetool commands fail with the same message, so I can not try repair, fsck or migrate. 

# ceph-bluestore-tool  repair --deep yes --path /var/lib/ceph/osd/ceph-13/
failed to load os-type: (2) No such file or directory
2023-12-19T13:57:06.551+0300 7f39b1635a80 -1 bluestore(/var/lib/ceph/osd/ceph-13/block) _read_bdev_label bad crc on label, expected 2340927273 != actual 2067505886

I also tried show label with bluestore-tool without success. 

# ceph-bluestore-tool show-label --dev /dev/ceph-block-13/block-13
unable to read label for /dev/ceph-block-13/block-13: (5) Input/output error
2023-12-19T14:01:19.668+0300 7fdcdd111a80 -1 bluestore(/dev/ceph-block-13/block-13) _read_bdev_label bad crc on label, expected 2340927273 != actual 2067505886

I can get the information including osd_fsif, block_uuid of all failed OSDs via ceph-volume lvm list like below. 

====== osd.13 ======

  [block]       /dev/ceph-block-13/block-13

      block device              /dev/ceph-block-13/block-13
      block uuid                jFaTba-ln5r-muQd-7Ef9-3tWe-JwvO-qW9nqi
      cephx lockbox secret      
      cluster fsid              4e7e7d1c-22db-49c7-9f24-5a75cd3a3b9f
      cluster name              ceph
      crush device class        None
      encrypted                 0
      osd fsid                  c9ee3ef6-73d7-4029-9cd6-086cc95d2f27
      osd id                    13
      osdspec affinity          
      type                      block
      vdo                       0
      devices                   /dev/mapper/mpathb

All vgs and lvs look healthy. 

# lvdisplay ceph-block-13/block-13
  --- Logical volume ---
  LV Path                /dev/ceph-block-13/block-13
  LV Name                block-13
  VG Name                ceph-block-13
  LV UUID                jFaTba-ln5r-muQd-7Ef9-3tWe-JwvO-qW9nqi
  LV Write Access        read/write
  LV Creation host, time ank-backup01, 2023-11-29 10:41:53 +0300
  LV Status              available
  # open                 0
  LV Size                <7.28 TiB
  Current LE             1907721
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:33

This is a single node cluster running only radosgw. The environment is as follows:

# ceph -v 
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

# lsb_release -a 
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

# ceph osd crush rule dump  
[
    {
        "rule_id": 0,
        "rule_name": "osd_replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -2,
                "item_name": "default~hdd"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "osd"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 2,
        "rule_name": "default.rgw.buckets.data",
        "type": 3,
        "steps": [
            {
                "op": "set_chooseleaf_tries",
                "num": 5
            },
            {
                "op": "set_choose_tries",
                "num": 100
            },
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_indep",
                "num": 0,
                "type": "osd"
            },
            {
                "op": "emit"
            }
        ]
    }
]

ID  CLASS  WEIGHT     TYPE NAME              STATUS     REWEIGHT  PRI-AFF
-1         226.29962  root default                                       
-3         226.29962      host ank-backup01                              
 0    hdd    7.29999          osd.0                 up   1.00000  1.00000
 1    hdd    7.29999          osd.1                 up   1.00000  1.00000
 2    hdd    7.29999          osd.2                 up   1.00000  1.00000
 3    hdd    7.29999          osd.3                 up   1.00000  1.00000
 4    hdd    7.29999          osd.4                 up   1.00000  1.00000
 5    hdd    7.29999          osd.5                 up   1.00000  1.00000
 6    hdd    7.29999          osd.6                 up   1.00000  1.00000
 7    hdd    7.29999          osd.7                 up   1.00000  1.00000
 8    hdd    7.29999          osd.8                 up   1.00000  1.00000
 9    hdd    7.29999          osd.9                 up   1.00000  1.00000
10    hdd    7.29999          osd.10                up   1.00000  1.00000
11    hdd    7.29999          osd.11                up   1.00000  1.00000
12    hdd    7.29999          osd.12         down         0  1.00000
13    hdd    7.29999          osd.13         down         0  1.00000
14    hdd    7.29999          osd.14         down         0  1.00000
15    hdd    7.29999          osd.15         down         0  1.00000
16    hdd    7.29999          osd.16         down         0  1.00000
17    hdd    7.29999          osd.17         down         0  1.00000
18    hdd    7.29999          osd.18         down         0  1.00000
19    hdd    7.29999          osd.19         down         0  1.00000
20    hdd    7.29999          osd.20         down         0  1.00000
21    hdd    7.29999          osd.21                up   1.00000  1.00000
22    hdd    7.29999          osd.22                up   1.00000  1.00000
23    hdd    7.29999          osd.23                up   1.00000  1.00000
24    hdd    7.29999          osd.24                up   1.00000  1.00000
25    hdd    7.29999          osd.25                up   1.00000  1.00000
26    hdd    7.29999          osd.26                up   1.00000  1.00000
27    hdd    7.29999          osd.27                up   1.00000  1.00000
28    hdd    7.29999          osd.28                up   1.00000  1.00000
29    hdd    7.29999          osd.29                up   1.00000  1.00000
30    hdd    7.29999          osd.30                up   1.00000  1.00000
31    hdd    7.29999          osd.31                up   1.00000  1.00000

Does anybody have any idea why the labels of these OSDs can not be read? Any help would be appreciated. 

Best Regards,
Huseyin Cotuk
hcotuk@xxxxxxxxx





_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux