Re: Ceph octopus version cluster not starting

Amudhan P <amudhan83@xxxxxxxxx> · Mon, 16 Sep 2024 21:49:23 +0530

Thanks Frank.

Figured out the issue was NTP, nodes were not able to reach NTP server
which caused NTP service to fail.

It looks like Ceph systemd service has dependency for NTP service status.

On Mon, Sep 16, 2024 at 4:12 PM Frank Schilder <frans@xxxxxx> wrote:

> I think this output is normal and I guess the MON is up? If so, I would
> start another mon in the same way on another host. If the monmap is correct
> with network etc. they should start talking to each other. If you have 3
> mons in the cluster, you should get quorum.
>
> On the host where the mon is running, you can also ask for the cluster
> status via the mon-admin socket. You should get a response that includes
> "out of quorum" or the like. Once you have the second mon up, you can start
> checking that they form quorum.
>
> If this works, then I would conclude that your cluster is probably OK on
> disk and the issue is somewhere with systemd.
>
> You shouldn't run too much manual. I usually use this to confirm that the
> daemon can start and its data store on disk is healthy. After that, I start
> looking for what prevents startup. In your case it doesn't seem to be
> ceph-daemons crashing and that's what this check mainly is for. You could
> maybe try one mgr and then one OSD. If these come up and join the cluster,
> its something outside ceph.
>
> For your systemd debugging, add at least the option "-f" to the daemon's
> command lines to force traditional log files to be written.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Amudhan P <amudhan83@xxxxxxxxx>
> Sent: Monday, September 16, 2024 12:18 PM
> To: Frank Schilder
> Cc: Eugen Block; ceph-users@xxxxxxx
> Subject: Re:  Re: Ceph octopus version cluster not starting
>
> Frank,
>
> with Manual command I was able to start mon and able to see logs in log
> file and I don't find any issue in logs except below lines.
> Should I stop manual command and try to start mon service from systemd or
> follow the same approach in all mon nodes?
>
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb:
> [db/version_set.cc:3757] Recovered from manifest
> file:/var/lib/ceph/mon/node/store.db/MANIFEST-4328236 s
> ucceeded,manifest_file_number is 4328236, next_file_number is 4328238,
> last_sequence is 1782572963, log_number is 4328223,prev_log_number is
> 0,max_column_family is 0,mi
> n_log_number_to_keep is 0
>
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb:
> [db/version_set.cc:3766] Column family [default] (ID 0), log number is
> 4328223
>
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1726481214623513, "job": 1, "event": "recovery_started",
> "log_files": [4328237]}
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb:
> [db/db_impl_open.cc:583] Recovering log #4328237 mode 2
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb:
> [db/version_set.cc:3036] Creating manifest 4328239
>
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1726481214625473, "job": 1, "event": "recovery_finished"}
> 2024-09-16T15:36:54.628+0530 7f5783d1e5c0  4 rocksdb: DB pointer
> 0x561bb7e90000
>
>
>
> On Mon, Sep 16, 2024 at 2:22 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx>> wrote:
> Hi. When I have issues like this, what sometimes helps is to start a
> daemon manually (not systemctl or anything like that). Make sure no
> ceph-mon is running on the host:
>
> ps -eo cmd | grep ceph-mon
>
> and start a ceph-mon manually with a command like this (make sure the
> binary is the correct version):
>
> /usr/bin/ceph-mon --cluster ceph --setuser ceph --setgroup ceph
> --foreground -i MON-NAME --mon-data /var/lib/ceph/mon/STORE --public-addr
> MON-IP
>
> Depending on your debug settings, this command does output a bit on
> startup. If your settings in ceph.conf are 0/0, I think you can override
> this on the command line. It might be useful to set the option "-d" (debug
> mode with "log to stderr") on the command line as well. With defaults it
> will talk at least about opening the store and then just wait or complain
> that there are no peers.
>
> This is a good sign.
>
> If you got one MON running, start another one on another host and so on
> until you have enough up for quorum. Then you can start querying the MONs
> what their problem is.
>
> If none of this works, the output of the manual command maybe with higher
> debug settings on the command line should be helpful.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Amudhan P <amudhan83@xxxxxxxxx<mailto:amudhan83@xxxxxxxxx>>
> Sent: Monday, September 16, 2024 10:36 AM
> To: Eugen Block
> Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> Subject:  Re: Ceph octopus version cluster not starting
>
> No, I don't use cephadm and I have enough space for a log storage.
>
> When I try to start mon service in any of the node it just keeps waiting to
> complete without any error msg in stdout or in log file.
>
> On Mon, Sep 16, 2024 at 1:21 PM Eugen Block <eblock@xxxxxx<mailto:
> eblock@xxxxxx>> wrote:
>
> > Hi,
> >
> > I would focus on the MONs first. If they don't start, your cluster is
> > not usable. It doesn't look like you use cephadm, but please confirm.
> > Check if the nodes are running out of disk space, maybe that's why
> > they don't log anything and fail to start.
> >
> >
> > Zitat von Amudhan P <amudhan83@xxxxxxxxx<mailto:amudhan83@xxxxxxxxx>>:
> >
> > > Hi,
> > >
> > > Recently added one disk in Ceph cluster using "ceph-volume lvm create
> > > --data /dev/sdX" but the new OSD didn't start. After some rest of the
> > other
> > > nodes OSD service also stopped. So, I restarted all nodes in the
> cluster
> > > now after restart.
> > > MON, MDS, MGR  and OSD services are not starting. Could find any new
> logs
> > > also after restart it is totally silent in all nodes.
> > > Could find some logs in Ceph-volume service.
> > >
> > >
> > > Error in Ceph-volume logs :-
> > > [2024-09-15 23:38:15,080][ceph_volume.process][INFO  ] stderr Running
> > > command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-5
> > > --> Executable selinuxenabled not in PATH:
> > > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> > > Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-5
> > > Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph
> > prime-osd-dir
> > > --dev
> > >
> >
> /dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e
> > > --path /var/lib/ceph/osd/ceph-5 --no-mon-config
> > >  stderr: failed to read label for
> > >
> >
> /dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
> > > (2) No such file or directory
> > > 2024-09-15T23:38:15.059+0530 7fe7767c8100 -1
> > >
> >
> bluestore(/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e)
> > > _read_bdev_label failed to open
> > >
> >
> /dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
> > > (2) No such file or directory
> > > -->  RuntimeError: command returned non-zero exit status: 1
> > > [2024-09-15 23:38:15,084][ceph_volume.process][INFO  ] stderr Running
> > > command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
> > > --> Executable selinuxenabled not in PATH:
> > > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> > > Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
> > > Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph
> > prime-osd-dir
> > > --dev
> > >
> >
> /dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988
> > > --path /var/lib/ceph/osd/ceph-2 --no-mon-config
> > >  stderr: failed to read label for
> > >
> >
> /dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988:
> > > (2) No such file or directory
> > >
> > > But I could find "
> > >
> >
> /dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988"
> > > the path valid and listing folder.
> > >
> > > Not sure how to proceed or where to start any idea or suggestion ?
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx>
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx