Re: Urgent help! RGW Disappeared on Quincy

Deep Dish <deeepdish@xxxxxxxxx> · Tue, 27 Dec 2022 20:53:10 -0500

Got logging enabled as per
https://ceph.io/en/news/blog/2022/centralized_logging/.   My embedded
grafana doesn't come up in the dashboard, but at least I have log (files)
on my nodes.   Interesting.

Two issues plaguing my cluster:

1 - RGWs not manageable
2 - MDS_SLOW_METADATA_IO warning (impact to cephfs)

Issue 1:

I have 4x RGWs deployed.   All started / processes running.  They all
report similar log entries:

7fcc32b6a5c0  0 deferred set uid:gid to 167:167 (ceph:ceph)

7fcc32b6a5c0  0 ceph version 17.2.5
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable), process
radosgw, pid 2

7fcc32b6a5c0  0 framework: beast

7fcc32b6a5c0  0 framework conf key: port, val: 80

7fcc32b6a5c0  1 radosgw_Main not setting numa affinity

7fcc32b6a5c0  1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0

7fcc32b6a5c0  1 D3N datacache enabled: 0

7fcc0869a700  0 INFO: RGWReshardLock::lock found lock on reshard.0000000011
to be held by another RGW process; skipping for now

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on lc.1,
sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on lc.3,
sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again
(repeating)

Seems like a stale lock, not previously cleaned up when the cluster was
busy recovering and rebalancing.

Issue 2:

ceph health detail:

[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs

    mds.fs01.ceph02mon03.rjcxat(mds.0): 8 slow metadata IOs are blocked >
30 secs, oldest blocked for 39485 secs

Log entries from ceph02mon03 MDS host:

 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131271 from mon.4
 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131272 from mon.4
 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131273 from mon.4
 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131274 from mon.4
 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131275 from mon.4
 7fe36c6b8700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1
included below; oldest blocked for > 33.126589 secs
 7fe36c6b8700  0 log_channel(cluster) log [WRN] : slow request 33.126588
seconds old, received at 2022-12-27T19:45:45.952225+0000:
client_request(client.55009:99980 create
#0x10000000bc2/vzdump-qemu-30003-2022_12_27-14_43_43.log
2022-12-27T19:45:45.948045+0000 caller_uid=0, caller_gid=0{}) currently
submit entry: journal_and_reply
 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131276 from mon.4
 7fe36c6b8700  0 log_channel(cluster) log [WRN] : 1 slow requests, 0
included below; oldest blocked for > 38.126737 secs
 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131277 from mon.4
 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131278 from mon.4
 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131279 from mon.4
 7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131280 from mon.4

I suspect that the file in the log above int's the culprit.   How can I get
to the root cause of MDS slowdowns?

On Tue, Dec 27, 2022 at 3:32 PM Pavin Joseph <me@xxxxxxxxxxxxxxx> wrote:

> Interesting, the logs show the crash module [0] itself has crashed.
> Something sent it a SIGINT or SIGTERM and the module didn't handle it
> correctly due to what seems like a bug in the code.
>
> I haven't experienced the crash module itself crashing yet (in Quincy)
> because nothing sent a SIG[INT|TERM] to it yet.
>
> So I'd continue investigating into why these signals were sent to the
> crash module.
>
> To fix the crash module from crashing, go to "/usr/bin/ceph-crash" and
> edit the handler function on line 82 like so:
>
> def handler(signum, frame):
>    print('*** Interrupted with signal %d ***' % signum)
>    signame = signal.Signals(signum).name
>    print(f'Signal handler called with signal {signame} ({signum})')
>    print(frame)
>    sys.exit(0)
>
> ---
>
> Once the crash module is working, perhaps you could run a "ceph crash ls"
>
> Regarding podman logs, perhaps try this [1].
>
> [0]: https://docs.ceph.com/en/quincy/mgr/crash/
> [1]: https://docs.podman.io/en/latest/markdown/podman-logs.1.html
>
> On 27-Dec-22 11:59 PM, Deep Dish wrote:
> > HI Pavin,
> >
> > Thanks for the reply.   I'm a bit at a loss honestly as this worked
> > perfectly without any issue up until the rebalance of the cluster.
> > Orchestrator is great.   Aside from this (which I suspect is not
> > orchestrator related), I haven't had any issues.
> >
> > In terms of logs, I'm not sure where to start looking in this new
> > containerized environment as they pertain to individual ceph processes
> -- I
> > assumed everything would be centrally collected within orch.
> >
> > Connecting into the podman container of a RGW, there are no logs in
> > /var/log/ceph aside from ceph-volume.   My ceph.conf is minimal with only
> > monitors defined.  The only log I'm able to pull is as follows:
> >
> > # podman logs 35d4ac5445ca
> >
> > INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
> >
> > Traceback (most recent call last):
> >
> >    File "/usr/bin/ceph-crash", line 113, in <module>
> >
> >      main()
> >
> >    File "/usr/bin/ceph-crash", line 109, in main
> >
> >      time.sleep(args.delay * 60)
> >
> > TypeError: handler() takes 1 positional argument but 2 were given
> >
> > INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
> >
> >
> >
> > Looks like the RGW daemon is crashing.   How do I get logs to persist?
>  I
> > suspect I won't be able to use orchestrator to push down the config, and
> > would have to manipulate within the container image itself.
> >
> > I also attempted to redeply the RGW containers without success.
> >
> > On Tue, Dec 27, 2022 at 10:39 AM Pavin Joseph <me@xxxxxxxxxxxxxxx>
> wrote:
> >
> >> Here's the first things I'd check in your situation:
> >>
> >> 1. Logs
> >> 2. Is the RGW HTTP server running on its port?
> >> 3. Re-check config including authentication.
> >>
> >> ceph orch is too new and didn't pass muster in our own internal testing.
> >> You're braver than most for using it in production.
> >>
> >> Pavin.
> >>
> >> On 27-Dec-22 8:48 PM, Deep Dish wrote:
> >>> Quick update:
> >>>
> >>> - I followed documentation, and ran the following:
> >>>
> >>> # ceph dashboard set-rgw-credentials
> >>>
> >>> Error EINVAL: No RGW credentials found, please consult the
> documentation
> >> on
> >>> how to enable RGW for the dashboard.
> >>>
> >>>
> >>>
> >>> - I see dashboard credentials configured (all this was working fine
> >> before):
> >>>
> >>>
> >>> # ceph dashboard get-rgw-api-access-key
> >>>
> >>> P?????????????????G  (? commented out)
> >>>
> >>>
> >>>
> >>> Seems to me like my RGW config is non-existent / corrupted for some
> >>> reason.  When trying to curl a RGW directly I get a "connection
> refused".
> >>>
> >>>
> >>>
> >>> On Tue, Dec 27, 2022 at 9:41 AM Deep Dish <deeepdish@xxxxxxxxx> wrote:
> >>>
> >>>> I built a net-new Quincy cluster (17.2.5) using ceph orch as follows:
> >>>>
> >>>> 2x mgrs
> >>>> 4x rgw
> >>>> 5x mon
> >>>> 4x rgw
> >>>> 5x mds
> >>>> 6x osd hosts w/ 10 drives each --> will be growing to 7 osd hosts in
> the
> >>>> coming days.
> >>>>
> >>>> I migrated all data from my legacy nautilus cluster (via rbd-mirror,
> >>>> rclone for s3 buckets, etc.).  All moved over successfully without
> >> issue.
> >>>>
> >>>> The cluster went through a series of rebalancing events (adding
> >> capacity,
> >>>> osd nodes, changing fault domain for EC volumes).
> >>>>
> >>>> It's settled now, however throughout the process all of my RGW nodes
> are
> >>>> no longer part of the cluster -- meaning ceph doesn't recognize /
> detect
> >>>> them, despite containers, networking, etc. all being setup correctly.
> >>>> This also means I'm unable to manage any RGW functions (via the
> >> dashboard
> >>>> or cli).   As an example via cli (within Cephadm shell):
> >>>>
> >>>> # radosgw-admin pools list
> >>>>
> >>>> could not list placement set: (2) No such file or directory
> >>>>
> >>>> I have data in buckets, how can I get my RGWs to return online?
> >>>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx