OSD unable to boot. Take the whole memory and doesn't join to the mon

Gonzalo Aguilar Delgado <gaguilar.delgado@xxxxxxxxx> · Thu, 30 Dec 2021 20:22:06 +0100

Hi, 
I was commenting to the IRC channels, after several tryies we found no solution. I will try to post all the information. 

I have a cluster ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
It was manually installed from very firsts versions and manually upgraded on each release. It was working nice, until powercut. 
On reboot half of the OSD where unable to boot. They eat all memory on the host and never connect to the mon. 

The OSD start by replaying journal as usual and when the message logging_to_monitors appear it start eating memory until oom.

I added a very big swap to see if it pass this point. And it does... Then it dies.

It doesn't but at least don't die until it crashes with heartbeat error. I attach some logs... 

What we tried is to shutdown everything mon, mgr, mds, etc. Start only mon and mgr, and see if they connect and start correctly. Seems to work.

Then I start one problematic OSD. The OSD.2 with full log enabled. This http://pastie.org/p/1L0nidHL0JkA3BH1RDzLMy Works but doesn't connect to the mon. 
We checked, network, jumbo frames and everything that can affect. I ran objecttool to fix pgs, and everything seems right. This OSD.2 is running on XFS disk with filestore... Yes, I know I should migrate, but I don't.

The curious thing is that I have another OSD in that same host and it works. OSD.4 this one is bluestore. But I don't know what has to be, since it's not connecting. But the underlaying filesystem works.

More strange things are this..
>ceph osd tree
root@red-compute:/home/gaguilar# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME                STATUS  REWEIGHT  PRI-AFF
-1         8.74597  root default                                      
-2         3.81898      host blue-compute                             
 0    hdd  1.00000          osd.0              down         0  1.00000
 2    hdd  1.00000          osd.2              down         0  1.00000
 4    hdd  1.81898          osd.4                up         0  1.00000
-5         2.36800      host cadet-compute                            
 1    hdd  0.03000          osd.1              down   0.90002  1.00000
 5    hdd  0.03999          osd.5              down   0.90002  1.00000
 7    hdd  0.03000          osd.7              down   0.90002  1.00000
11    hdd  0.45000          osd.11               up   1.00000  1.00000
13    hdd  1.81799          osd.13             down   1.00000  1.00000
-6         0.45000      host cobalt-compute                           
12    hdd  0.45000          osd.12               up   1.00000  1.00000
-3         2.10899      host red-compute                              
 6    hdd  0.90900          osd.6              down         0  1.00000
 9    hdd  0.90900          osd.9              down   1.00000  1.00000
10    hdd  0.29099          osd.10             down   1.00000  1.00000

Reports 3 OSD up. 4, 11, 12. But only 4 is up. Is the one I do tests on blue-compute since it works!
The other two are down. And they don't get updated. Why?

The OSD.4 tries to ping them because they seems to be up, but it fails. My ceph.conf doesn't have anything weird

[global]
fsid = 9028f4da-0d77-462b-be9b-dbdf7fa57771
#mon_initial_members = blue-compute, red-compute, cadet-compute
mon_initial_members = red-compute, cadet-compute
mon_host = [v2:172.16.0.100:3300,v1:172.16.0.100:6789], 172.16.99.10, 172.16.0.119
#mon_host = 172.16.0.119, 172.16.0.100, 172.16.99.10
#mgr_host = 172.16.0.119
#mon_host = 172.16.0.100, 172.16.99.10
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_pg_num = 128
osd_pool_default_pgp_num = 128
osd_pool_default_size = 2  # Write an object 3 times.
osd_pool_default_min_size = 1 # Allow writing one copy in a degraded state.
public_network = 172.16.0.0/16

#osd_recovery_max_active = 9
#osd_max_backfills = 3
#osd_recovery_op_priority = 3

mon_data_avail_warn = 10
[mon]
caps_mon = "allow *"

[osd]
#bluestore cache autotune=1
osd max write size = 512
osd client message size cap = 1024
osd op threads = 1
#osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
#filestore xattr use omap = true                         # Default false# by XATTRS Use object map,EXT4 File system ,XFS perhaps btrfs You can also use
#filestore min sync interval = 10                          # Default 0.1# Minimum synchronization interval from log to data disk (seconds)
#filestore max sync interval = 15                          # Default 5# Maximum synchronization interval from log to data disk (seconds)
#filestore queue max ops = 2500                        # Default 500# The maximum number of operations accepted by the data disk
#filestore queue max bytes = 1048576000            # Default 100   # The maximum number of bytes in one operation of data disk (bytes
#filestore queue committing max ops = 50000       # Default 500     # The data disk can commit The number of operations
#filestore queue committing max bytes = 10485760000 # Default 100 # The data disk can commit The maximum number of bytes (bytes)
#filestore split multiple = 8                                               # The default value is 2         # The maximum number of files in the previous subdire>
#filestore merge threshold = 40                                        # The default value is 10       # The minimum number of files in the previous subclass dire>
#filestore fd cache size = 1024                                         # The default value is 128              # Object file handle cache size
filestore op threads = 1                                                  # The default value is 2                    # Concurrent file system operations

[osd]
osd max pg log entries = 50
osd min pg log entries = 50
osd_pg_log_dups_tracked = 50

I also have other hosts with some disk up and other that doesn't. The whole status of the cluster is garbage.
I'm truly desperate. This is what I tried:
Starting manually
Run objectstore tool, fix and repair
Apply journal
Remove selinux
Telnet to mon from OSD
Starting different osd... some works.
Adding new mgr
Downgrade kernel
upgrade kernel
Try different pacific versions (patch version)
Try run two different OSD in the same machine to see if they communicate osd.4 and osd.2 (2 doesn't work)
running osd as root
Nothing works.

I see a lot of:
2021-12-30T19:17:12.022+0000 7f114647b640 20 osd.2 1492194 tick last_purged_snaps_scrub 2021-12-30T11:58:25.837843+0000 next 2021-12-31T11:58:25.837843+0000
2021-12-30T19:17:12.030+0000 7f1121b89640 10 --2- [v2:172.16.0.119:6808/11420,v1:172.16.0.119:6809/11420] >> [v2:172.16.0.100:3300/0,v1:172.16.0.100:6789/0] conn(0x5594e3af6400 0x5594e3d57400 secure :-1 s=READY pgs=12286 cs=0 l=1 rev1=1 rx=0x5595a2aa88d0 tx=0x5594ecdcb380).send_keepalive
2021-12-30T19:17:12.030+0000 7f114a6e3640 10 -- [v2:172.16.0.119:6808/11420,v1:172.16.0.119:6809/11420] >> [v2:172.16.0.100:3300/0,v1:172.16.0.100:6789/0] conn(0x5594e3af6400 msgr2=0x5594e3d57400 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).handle_write
2021-12-30T19:17:12.030+0000 7f114a6e3640 10 --2- [v2:172.16.0.119:6808/11420,v1:172.16.0.119:6809/11420] >> [v2:172.16.0.100:3300/0,v1:172.16.0.100:6789/0] conn(0x5594e3af6400 0x5594e3d57400 secure :-1 s=READY pgs=12286 cs=0 l=1 rev1=1 rx=0x5595a2aa88d0 tx=0x5594ecdcb380).write_event
2021-12-30T19:17:12.030+0000 7f114a6e3640 10 --2- [v2:172.16.0.119:6808/11420,v1:172.16.0.119:6809/11420] >> [v2:172.16.0.100:3300/0,v1:172.16.0.100:6789/0] conn(0x5594e3af6400 0x5594e3d57400 secure :-1 s=READY pgs=12286 cs=0 l=1 rev1=1 rx=0x5595a2aa88d0 tx=0x5594ecdcb380).write_event appending keepalive
2021-12-30T19:17:12.030+0000 7f114a6e3640 10 -- [v2:172.16.0.119:6808/11420,v1:172.16.0.119:6809/11420] >> [v2:172.16.0.100:3300/0,v1:172.16.0.100:6789/0] conn(0x5594e3af6400 msgr2=0x5594e3d57400 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 96 remaining bytes 0
2021-12-30T19:17:12.034+0000 7f114a6e3640 10 --2- [v2:172.16.0.119:6808/11420,v1:172.16.0.119:6809/11420] >> [v2:172.16.0.100:3300/0,v1:172.16.0.100:6789/0] conn(0x5594e3af6400 0x5594e3d57400 secure :-1 s=READY pgs=12286 cs=0 l=1 rev1=1 rx=0x5595a2aa88d0 tx=0x5594ecdcb380).handle_read_frame_dispatch tag=19

So it seems it's connected, why it doesn't get up?

At the end of the log I can see a HeartBeat crash. 

So what can I try to recover the cluster I ran out of ideas...

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx