Re: Problem with Ceph daemons

"Ron Gage" <ron@xxxxxxxxxxx> · Fri, 18 Feb 2022 13:37:51 -0500

All:

I think I found the problem - hence...

[root@c01 ceph]# ceph orch ls
NAME                       PORTS        RUNNING  REFRESHED  AGE  PLACEMENT
alertmanager               ?:9093,9094      1/1  2m ago     9d   count:1
crash                                       6/6  2m ago     9d   *
grafana                    ?:3000           1/1  2m ago     9d   count:1
mgr                                         2/2  2m ago     9d   count:2
mon                                         5/5  2m ago     9d   count:5
node-exporter              ?:9100           6/6  2m ago     9d   *
osd                                           2  2m ago     -    <unmanaged>
osd.all-available-devices                    16  2m ago     2d   *
prometheus                 ?:9095           1/1  2m ago     9d   count:1
rgw.obj0                   ?:80             1/6  2m ago     9d   c01;c02;c03;c04;c05;c06;count:6
rgw.obj01                  ?:80             5/6  2m ago     5d   c01;c02;c03;c04;c05;c06

To my untrained eye, it looks like rgw.obj0 is extra and unneeded.  Does anyone know a way to prove this out and if needed remove it?

Thanks!

Ron Gage
Westland, MI

-----Original Message-----
From: Eugen Block <eblock@xxxxxx> 
Sent: Thursday, February 17, 2022 2:32 AM
To: ceph-users@xxxxxxx
Subject:  Re: Problem with Ceph daemons

Can you retry after resetting the systemd unit? The message "Start request repeated too quickly." should be cleared first, then start it
again:

systemctl reset-failed
ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
systemctl start
ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service

Then check the logs again. If there's still nothing in the rgw log then you'll need to check the (active) mgr daemon logs for anything suspicious and also the syslog on that rgw host. Is the rest of the cluster healthy? Are rgw daemons colocated with other services?

Zitat von Ron Gage <ron@xxxxxxxxxxx>:

> Adam:
>
>
>
> Not really….
>
>
>
> -- Unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has begun starting up.
>
> Feb 16 15:01:03 c01 podman[426007]:
>
> Feb 16 15:01:04 c01 bash[426007]:  
> 915d1e19fa0f213902c666371c8e825480e103f85172f3b15d1d5bf2427a87c9
>
> Feb 16 15:01:04 c01 conmon[426038]: debug
> 2022-02-16T20:01:04.303+0000 7f4f72ff6440  0 deferred set uid:gid to
> 167:167 (ceph:ceph)
>
> Feb 16 15:01:04 c01 conmon[426038]: debug
> 2022-02-16T20:01:04.303+0000 7f4f72ff6440  0 ceph version 16.2.7
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (st>
>
> Feb 16 15:01:04 c01 conmon[426038]: debug
> 2022-02-16T20:01:04.303+0000 7f4f72ff6440  0 framework: beast
>
> Feb 16 15:01:04 c01 conmon[426038]: debug
> 2022-02-16T20:01:04.303+0000 7f4f72ff6440  0 framework conf key:  
> port, val: 80
>
> Feb 16 15:01:04 c01 conmon[426038]: debug
> 2022-02-16T20:01:04.303+0000 7f4f72ff6440  1 radosgw_Main not setting 
> numa affinity
>
> Feb 16 15:01:04 c01 systemd[1]: Started Ceph rgw.obj0.c01.gpqshk for 
> 35194656-893e-11ec-85c8-005056870dae.
>
> -- Subject: Unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has finished start-up
>
> -- Defined-By: systemd
>
> -- Support: https://access.redhat.com/support
>
> --
>
> -- Unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has finished starting up.
>
> --
>
> -- The start-up result is done.
>
> Feb 16 15:01:04 c01 systemd[1]:  
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service: 
> Main process exited, code=exited, status=98/n/a
>
> Feb 16 15:01:05 c01 systemd[1]:  
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:  
> Failed with result 'exit-code'.
>
> -- Subject: Unit failed
>
> -- Defined-By: systemd
>
> -- Support: https://access.redhat.com/support
>
> --
>
> -- The unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has entered the 'failed' state with result 'exit-code'.
>
> Feb 16 15:01:15 c01 systemd[1]:  
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:  
> Service RestartSec=10s expired, scheduling restart.
>
> Feb 16 15:01:15 c01 systemd[1]:  
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:  
> Scheduled restart job, restart counter is at 5.
>
> -- Subject: Automatic restarting of a unit has been scheduled
>
> -- Defined-By: systemd
>
> -- Support: https://access.redhat.com/support
>
> --
>
> -- Automatic restarting of the unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has been scheduled, as the result for
>
> -- the configured Restart= setting for the unit.
>
> Feb 16 15:01:15 c01 systemd[1]: Stopped Ceph rgw.obj0.c01.gpqshk for 
> 35194656-893e-11ec-85c8-005056870dae.
>
> -- Subject: Unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has finished shutting down
>
> -- Defined-By: systemd
>
> -- Support: https://access.redhat.com/support
>
> --
>
> -- Unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has finished shutting down.
>
> Feb 16 15:01:15 c01 systemd[1]:  
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service: 
> Start request repeated too quickly.
>
> Feb 16 15:01:15 c01 systemd[1]:  
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:  
> Failed with result 'exit-code'.
>
> -- Subject: Unit failed
>
> -- Defined-By: systemd
>
> -- Support: https://access.redhat.com/support
>
> --
>
> -- The unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has entered the 'failed' state with result 'exit-code'.
>
> Feb 16 15:01:15 c01 systemd[1]: Failed to start Ceph 
> rgw.obj0.c01.gpqshk for 35194656-893e-11ec-85c8-005056870dae.
>
> -- Subject: Unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has failed
>
> -- Defined-By: systemd
>
> -- Support: https://access.redhat.com/support
>
> --
>
> -- Unit
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service 
> has failed.
>
> --
>
> -- The result is failed.
>
>
>
> Ron Gage
>
> Westland, MI
>
>
>
> From: Adam King <adking@xxxxxxxxxx>
> Sent: Wednesday, February 16, 2022 4:18 PM
> To: Ron Gage <ron@xxxxxxxxxxx>
> Cc: ceph-users <ceph-users@xxxxxxx>
> Subject: Re:  Problem with Ceph daemons
>
>
>
> Is there anything useful in the rgw daemon's logs? (e.g. journalctl 
> -xeu ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk
> <mailto:ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk>
> )
>
>
>
>  - Adam King
>
>
>
> On Wed, Feb 16, 2022 at 3:58 PM Ron Gage <ron@xxxxxxxxxxx 
> <mailto:ron@xxxxxxxxxxx> > wrote:
>
> Hi everyone!
>
>
>
> Looks like I am having some problems with some of my ceph RGW daemons 
> - they won't stay running.
>
>
>
> From 'cephadm ls'.
>
>
>
> {
>
>         "style": "cephadm:v1",
>
>         "name": "rgw.obj0.c01.gpqshk",
>
>         "fsid": "35194656-893e-11ec-85c8-005056870dae",
>
>         "systemd_unit":
> "ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk
> <mailto:ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk
> <mailto:ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk> 
> > ",
>
>         "enabled": true,
>
>         "state": "error",
>
>         "service_name": "rgw.obj0",
>
>         "ports": [
>
>             80
>
>         ],
>
>         "ip": null,
>
>         "deployed_by": [
>
>
> "quay.io/ceph/ceph@sha256:c3a89afac4f9c83c716af57e08863f7010318538c7e2
> cd9114 
> <http://quay.io/ceph/ceph@sha256:c3a89afac4f9c83c716af57e08863f7010318
> 538c7e2cd911458800097f7d97d>
> 58800097f7d97d
> <mailto:quay.io <mailto:quay.io>
> /ceph/ceph@sha256:c3a89afac4f9c83c716af57e08863f7010318538c7e
> 2cd911458800097f7d97d> ",
>
>
> "quay.io/ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f31eaa76f
> ff41a7 
> <http://quay.io/ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f3
> 1eaa76fff41a77fa32d0b903061>
> 7fa32d0b903061
> <mailto:quay.io <mailto:quay.io>
> /ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f31eaa76
> fff41a77fa32d0b903061> "
>
>         ],
>
>         "rank": null,
>
>         "rank_generation": null,
>
>         "memory_request": null,
>
>         "memory_limit": null,
>
>         "container_id": null,
>
>         "container_image_name":
> "quay.io/ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f31eaa76f
> ff41a7 
> <http://quay.io/ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f3
> 1eaa76fff41a77fa32d0b903061>
> 7fa32d0b903061
> <mailto:quay.io <mailto:quay.io>
> /ceph/ceph@sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f31eaa76
> fff41a77fa32d0b903061> ",
>
>         "container_image_id": null,
>
>         "container_image_digests": null,
>
>         "version": null,
>
>         "started": null,
>
>         "created": "2022-02-09T01:00:53.411541Z",
>
>         "deployed": "2022-02-09T01:00:52.338515Z",
>
>         "configured": "2022-02-09T01:00:53.411541Z"
>
>     },
>
>
>
> That whole "state: error" bit is concerning to me - and it 
> contributing to the cluster status of warning (showing 6 cephadm daemons down).
>
>
>
> Can I get a hint or two on how to fix this?
>
>
> Thanks!
>
>
>
> Ron Gage
>
> Westland, MI
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx 
> <mailto:ceph-users@xxxxxxx> To unsubscribe send an email to 
> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
> email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx