Re: Problem with Ceph daemons

Adam King <adking@xxxxxxxxxx> · Mon, 21 Feb 2022 08:16:00 -0500

I'd say you probably don't need both services. It looks like they're
configured to listen to the same port(80 from the output) and are being
placed on the same hosts (c01-c06). It could be that port conflict that is
causing the rgw daemons to go into error state. Cephadm will try to put 2
down on each of these hosts to satisfy both rgw services specified but if
they both try to use the same port whichever one gets placed second could
go into error state for that reason.

 - Adam King

On Fri, Feb 18, 2022 at 1:38 PM Ron Gage <ron@xxxxxxxxxxx> wrote:

> All:
>
> I think I found the problem - hence...
>
> [root@c01 ceph]# ceph orch ls
> NAME                       PORTS        RUNNING  REFRESHED  AGE  PLACEMENT
> alertmanager               ?:9093,9094      1/1  2m ago     9d   count:1
> crash                                       6/6  2m ago     9d   *
> grafana                    ?:3000           1/1  2m ago     9d   count:1
> mgr                                         2/2  2m ago     9d   count:2
> mon                                         5/5  2m ago     9d   count:5
> node-exporter              ?:9100           6/6  2m ago     9d   *
> osd                                           2  2m ago     -
> <unmanaged>
> osd.all-available-devices                    16  2m ago     2d   *
> prometheus                 ?:9095           1/1  2m ago     9d   count:1
> rgw.obj0                   ?:80             1/6  2m ago     9d
>  c01;c02;c03;c04;c05;c06;count:6
> rgw.obj01                  ?:80             5/6  2m ago     5d
>  c01;c02;c03;c04;c05;c06
>
>
> To my untrained eye, it looks like rgw.obj0 is extra and unneeded.  Does
> anyone know a way to prove this out and if needed remove it?
>
> Thanks!
>
> Ron Gage
> Westland, MI
>
> -----Original Message-----
> From: Eugen Block <eblock@xxxxxx>
> Sent: Thursday, February 17, 2022 2:32 AM
> To: ceph-users@xxxxxxx
> Subject:  Re: Problem with Ceph daemons
>
> Can you retry after resetting the systemd unit? The message "Start request
> repeated too quickly." should be cleared first, then start it
> again:
>
> systemctl reset-failed
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> systemctl start
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
>
> Then check the logs again. If there's still nothing in the rgw log then
> you'll need to check the (active) mgr daemon logs for anything suspicious
> and also the syslog on that rgw host. Is the rest of the cluster healthy?
> Are rgw daemons colocated with other services?
>
>
> Zitat von Ron Gage <ron@xxxxxxxxxxx>:
>
> > Adam:
> >
> >
> >
> > Not really….
> >
> >
> >
> > -- Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has begun starting up.
> >
> > Feb 16 15:01:03 c01 podman[426007]:
> >
> > Feb 16 15:01:04 c01 bash[426007]:
> > 915d1e19fa0f213902c666371c8e825480e103f85172f3b15d1d5bf2427a87c9
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+0000 7f4f72ff6440  0 deferred set uid:gid to
> > 167:167 (ceph:ceph)
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+0000 7f4f72ff6440  0 ceph version 16.2.7
> > (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (st>
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+0000 7f4f72ff6440  0 framework: beast
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+0000 7f4f72ff6440  0 framework conf key:
> > port, val: 80
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+0000 7f4f72ff6440  1 radosgw_Main not setting
> > numa affinity
> >
> > Feb 16 15:01:04 c01 systemd[1]: Started Ceph rgw.obj0.c01.gpqshk for
> > 35194656-893e-11ec-85c8-005056870dae.
> >
> > -- Subject: Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has finished start-up
> >
> > -- Defined-By: systemd
> >
> > -- Support: https://access.redhat.com/support
> >
> > --
> >
> > -- Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has finished starting up.
> >
> > --
> >
> > -- The start-up result is done.
> >
> > Feb 16 15:01:04 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Main process exited, code=exited, status=98/n/a
> >
> > Feb 16 15:01:05 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Failed with result 'exit-code'.
> >
> > -- Subject: Unit failed
> >
> > -- Defined-By: systemd
> >
> > -- Support: https://access.redhat.com/support
> >
> > --
> >
> > -- The unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has entered the 'failed' state with result 'exit-code'.
> >
> > Feb 16 15:01:15 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Service RestartSec=10s expired, scheduling restart.
> >
> > Feb 16 15:01:15 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Scheduled restart job, restart counter is at 5.
> >
> > -- Subject: Automatic restarting of a unit has been scheduled
> >
> > -- Defined-By: systemd
> >
> > -- Support: https://access.redhat.com/support
> >
> > --
> >
> > -- Automatic restarting of the unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has been scheduled, as the result for
> >
> > -- the configured Restart= setting for the unit.
> >
> > Feb 16 15:01:15 c01 systemd[1]: Stopped Ceph rgw.obj0.c01.gpqshk for
> > 35194656-893e-11ec-85c8-005056870dae.
> >
> > -- Subject: Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has finished shutting down
> >
> > -- Defined-By: systemd
> >
> > -- Support: https://access.redhat.com/support
> >
> > --
> >
> > -- Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has finished shutting down.
> >
> > Feb 16 15:01:15 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Start request repeated too quickly.
> >
> > Feb 16 15:01:15 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Failed with result 'exit-code'.
> >
> > -- Subject: Unit failed
> >
> > -- Defined-By: systemd
> >
> > -- Support: https://access.redhat.com/support
> >
> > --
> >
> > -- The unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has entered the 'failed' state with result 'exit-code'.
> >
> > Feb 16 15:01:15 c01 systemd[1]: Failed to start Ceph
> > rgw.obj0.c01.gpqshk for 35194656-893e-11ec-85c8-005056870dae.
> >
> > -- Subject: Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has failed
> >
> > -- Defined-By: systemd
> >
> > -- Support: https://access.redhat.com/support
> >
> > --
> >
> > -- Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has failed.
> >
> > --
> >
> > -- The result is failed.
> >
> >
> >
> > Ron Gage
> >
> > Westland, MI
> >
> >
> >
> > From: Adam King <adking@xxxxxxxxxx>
> > Sent: Wednesday, February 16, 2022 4:18 PM
> > To: Ron Gage <ron@xxxxxxxxxxx>
> > Cc: ceph-users <ceph-users@xxxxxxx>
> > Subject: Re:  Problem with Ceph daemons
> >
> >
> >
> > Is there anything useful in the rgw daemon's logs? (e.g. journalctl
> > -xeu ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk
> > <mailto:ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk>
> > )
> >
> >
> >
> >  - Adam King
> >
> >
> >
> > On Wed, Feb 16, 2022 at 3:58 PM Ron Gage <ron@xxxxxxxxxxx
> > <mailto:ron@xxxxxxxxxxx> > wrote:
> >
> > Hi everyone!
> >
> >
> >
> > Looks like I am having some problems with some of my ceph RGW daemons
> > - they won't stay running.
> >
> >
> >
> > From 'cephadm ls'.
> >
> >
> >
> > {
> >
> >         "style": "cephadm:v1",
> >
> >         "name": "rgw.obj0.c01.gpqshk",
> >
> >         "fsid": "35194656-893e-11ec-85c8-005056870dae",
> >
> >         "systemd_unit":
> > "ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk
> > <mailto:ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk
> > <mailto:ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk>
> > > ",
> >
> >         "enabled": true,
> >
> >         "state": "error",
> >
> >         "service_name": "rgw.obj0",
> >
> >         "ports": [
> >
> >             80
> >
> >         ],
> >
> >         "ip": null,
> >
> >         "deployed_by": [
> >
> >
> > "quay.io/ceph/ceph@sha256:c3a89afac4f9c83c716af57e08863f7010318538c7e2
> > cd9114
> > <http://quay.io/ceph/ceph@sha256:c3a89afac4f9c83c716af57e08863f7010318
> > 538c7e2cd911458800097f7d97d>
> > 58800097f7d97d
> > <mailto:quay.io <mailto:quay.io>
> > /ceph/ceph@sha256:c3a89afac4f9c83c716af57e08863f7010318538c7e
> > 2cd911458800097f7d97d> ",
> >
> >
> > "quay.io/ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f31eaa76f
> > ff41a7
> > <http://quay.io/ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f3
> > 1eaa76fff41a77fa32d0b903061>
> > 7fa32d0b903061
> > <mailto:quay.io <mailto:quay.io>
> > /ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f31eaa76
> > fff41a77fa32d0b903061> "
> >
> >         ],
> >
> >         "rank": null,
> >
> >         "rank_generation": null,
> >
> >         "memory_request": null,
> >
> >         "memory_limit": null,
> >
> >         "container_id": null,
> >
> >         "container_image_name":
> > "quay.io/ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f31eaa76f
> > ff41a7
> > <http://quay.io/ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f3
> > 1eaa76fff41a77fa32d0b903061>
> > 7fa32d0b903061
> > <mailto:quay.io <mailto:quay.io>
> > /ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f31eaa76
> > fff41a77fa32d0b903061> ",
> >
> >         "container_image_id": null,
> >
> >         "container_image_digests": null,
> >
> >         "version": null,
> >
> >         "started": null,
> >
> >         "created": "2022-02-09T01:00:53.411541Z",
> >
> >         "deployed": "2022-02-09T01:00:52.338515Z",
> >
> >         "configured": "2022-02-09T01:00:53.411541Z"
> >
> >     },
> >
> >
> >
> > That whole "state: error" bit is concerning to me - and it
> > contributing to the cluster status of warning (showing 6 cephadm daemons
> down).
> >
> >
> >
> > Can I get a hint or two on how to fix this?
> >
> >
> > Thanks!
> >
> >
> >
> > Ron Gage
> >
> > Westland, MI
> >
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > <mailto:ceph-users@xxxxxxx> To unsubscribe send an email to
> > ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx