Re: Ceph octopus version cluster not starting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



No there wasn't any error msg in systemd it was just silent even for an
hour.

On Mon, Sep 16, 2024 at 10:02 PM Frank Schilder <frans@xxxxxx> wrote:

> Hi Amudhan,
>
> great that you figured that out. Does systemd not output an error in that
> case?? I would expect an error message. On our systems systemd is quite
> chatty when a unit fails.
>
> You probably still need to figure out why your new OSD took everything
> down over time. Maybe create a new case if this happens again.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Amudhan P <amudhan83@xxxxxxxxx>
> Sent: Monday, September 16, 2024 6:19 PM
> To: Frank Schilder
> Cc: Eugen Block; ceph-users@xxxxxxx
> Subject: Re:  Re: Ceph octopus version cluster not starting
>
> Thanks Frank.
>
> Figured out the issue was NTP, nodes were not able to reach NTP server
> which caused NTP service to fail.
>
> It looks like Ceph systemd service has dependency for NTP service status.
>
> On Mon, Sep 16, 2024 at 4:12 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx>> wrote:
> I think this output is normal and I guess the MON is up? If so, I would
> start another mon in the same way on another host. If the monmap is correct
> with network etc. they should start talking to each other. If you have 3
> mons in the cluster, you should get quorum.
>
> On the host where the mon is running, you can also ask for the cluster
> status via the mon-admin socket. You should get a response that includes
> "out of quorum" or the like. Once you have the second mon up, you can start
> checking that they form quorum.
>
> If this works, then I would conclude that your cluster is probably OK on
> disk and the issue is somewhere with systemd.
>
> You shouldn't run too much manual. I usually use this to confirm that the
> daemon can start and its data store on disk is healthy. After that, I start
> looking for what prevents startup. In your case it doesn't seem to be
> ceph-daemons crashing and that's what this check mainly is for. You could
> maybe try one mgr and then one OSD. If these come up and join the cluster,
> its something outside ceph.
>
> For your systemd debugging, add at least the option "-f" to the daemon's
> command lines to force traditional log files to be written.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Amudhan P <amudhan83@xxxxxxxxx<mailto:amudhan83@xxxxxxxxx>>
> Sent: Monday, September 16, 2024 12:18 PM
> To: Frank Schilder
> Cc: Eugen Block; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> Subject: Re:  Re: Ceph octopus version cluster not starting
>
> Frank,
>
> with Manual command I was able to start mon and able to see logs in log
> file and I don't find any issue in logs except below lines.
> Should I stop manual command and try to start mon service from systemd or
> follow the same approach in all mon nodes?
>
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb:
> [db/version_set.cc:3757] Recovered from manifest
> file:/var/lib/ceph/mon/node/store.db/MANIFEST-4328236 s
> ucceeded,manifest_file_number is 4328236, next_file_number is 4328238,
> last_sequence is 1782572963, log_number is 4328223,prev_log_number is
> 0,max_column_family is 0,mi
> n_log_number_to_keep is 0
>
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb:
> [db/version_set.cc:3766] Column family [default] (ID 0), log number is
> 4328223
>
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1726481214623513, "job": 1, "event": "recovery_started",
> "log_files": [4328237]}
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb:
> [db/db_impl_open.cc:583] Recovering log #4328237 mode 2
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb:
> [db/version_set.cc:3036] Creating manifest 4328239
>
> 2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1726481214625473, "job": 1, "event": "recovery_finished"}
> 2024-09-16T15:36:54.628+0530 7f5783d1e5c0  4 rocksdb: DB pointer
> 0x561bb7e90000
>
>
>
> On Mon, Sep 16, 2024 at 2:22 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
> Hi. When I have issues like this, what sometimes helps is to start a
> daemon manually (not systemctl or anything like that). Make sure no
> ceph-mon is running on the host:
>
> ps -eo cmd | grep ceph-mon
>
> and start a ceph-mon manually with a command like this (make sure the
> binary is the correct version):
>
> /usr/bin/ceph-mon --cluster ceph --setuser ceph --setgroup ceph
> --foreground -i MON-NAME --mon-data /var/lib/ceph/mon/STORE --public-addr
> MON-IP
>
> Depending on your debug settings, this command does output a bit on
> startup. If your settings in ceph.conf are 0/0, I think you can override
> this on the command line. It might be useful to set the option "-d" (debug
> mode with "log to stderr") on the command line as well. With defaults it
> will talk at least about opening the store and then just wait or complain
> that there are no peers.
>
> This is a good sign.
>
> If you got one MON running, start another one on another host and so on
> until you have enough up for quorum. Then you can start querying the MONs
> what their problem is.
>
> If none of this works, the output of the manual command maybe with higher
> debug settings on the command line should be helpful.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Amudhan P <amudhan83@xxxxxxxxx<mailto:amudhan83@xxxxxxxxx><mailto:
> amudhan83@xxxxxxxxx<mailto:amudhan83@xxxxxxxxx>>>
> Sent: Monday, September 16, 2024 10:36 AM
> To: Eugen Block
> Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> Subject:  Re: Ceph octopus version cluster not starting
>
> No, I don't use cephadm and I have enough space for a log storage.
>
> When I try to start mon service in any of the node it just keeps waiting to
> complete without any error msg in stdout or in log file.
>
> On Mon, Sep 16, 2024 at 1:21 PM Eugen Block <eblock@xxxxxx<mailto:
> eblock@xxxxxx><mailto:eblock@xxxxxx<mailto:eblock@xxxxxx>>> wrote:
>
> > Hi,
> >
> > I would focus on the MONs first. If they don't start, your cluster is
> > not usable. It doesn't look like you use cephadm, but please confirm.
> > Check if the nodes are running out of disk space, maybe that's why
> > they don't log anything and fail to start.
> >
> >
> > Zitat von Amudhan P <amudhan83@xxxxxxxxx<mailto:amudhan83@xxxxxxxxx
> ><mailto:amudhan83@xxxxxxxxx<mailto:amudhan83@xxxxxxxxx>>>:
> >
> > > Hi,
> > >
> > > Recently added one disk in Ceph cluster using "ceph-volume lvm create
> > > --data /dev/sdX" but the new OSD didn't start. After some rest of the
> > other
> > > nodes OSD service also stopped. So, I restarted all nodes in the
> cluster
> > > now after restart.
> > > MON, MDS, MGR  and OSD services are not starting. Could find any new
> logs
> > > also after restart it is totally silent in all nodes.
> > > Could find some logs in Ceph-volume service.
> > >
> > >
> > > Error in Ceph-volume logs :-
> > > [2024-09-15 23:38:15,080][ceph_volume.process][INFO  ] stderr Running
> > > command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-5
> > > --> Executable selinuxenabled not in PATH:
> > > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> > > Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-5
> > > Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph
> > prime-osd-dir
> > > --dev
> > >
> >
> /dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e
> > > --path /var/lib/ceph/osd/ceph-5 --no-mon-config
> > >  stderr: failed to read label for
> > >
> >
> /dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
> > > (2) No such file or directory
> > > 2024-09-15T23:38:15.059+0530 7fe7767c8100 -1
> > >
> >
> bluestore(/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e)
> > > _read_bdev_label failed to open
> > >
> >
> /dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
> > > (2) No such file or directory
> > > -->  RuntimeError: command returned non-zero exit status: 1
> > > [2024-09-15 23:38:15,084][ceph_volume.process][INFO  ] stderr Running
> > > command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
> > > --> Executable selinuxenabled not in PATH:
> > > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> > > Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
> > > Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph
> > prime-osd-dir
> > > --dev
> > >
> >
> /dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988
> > > --path /var/lib/ceph/osd/ceph-2 --no-mon-config
> > >  stderr: failed to read label for
> > >
> >
> /dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988:
> > > (2) No such file or directory
> > >
> > > But I could find "
> > >
> >
> /dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988"
> > > the path valid and listing folder.
> > >
> > > Not sure how to proceed or where to start any idea or suggestion ?
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>>
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>>
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux