Re: Urgent help! RGW Disappeared on Quincy

Pavin Joseph <me@xxxxxxxxxxxxxxx> · Wed, 28 Dec 2022 17:16:06 +0530

1. This is a guess, but check /var/[lib|run]/ceph for any lock files.
2. This is more straightforward to fix, add faster WAL/Block device/LV 
for each OSD or create a fast storage pool just for metadata. Also, 
experiment with MDS cache size/trim [0] settings.

[0]: https://docs.ceph.com/en/latest/cephfs/cache-configuration/

On 28-Dec-22 7:23 AM, Deep Dish wrote:
Got logging enabled as per
https://ceph.io/en/news/blog/2022/centralized_logging/.   My embedded
grafana doesn't come up in the dashboard, but at least I have log (files)
on my nodes.   Interesting.

Two issues plaguing my cluster:

1 - RGWs not manageable
2 - MDS_SLOW_METADATA_IO warning (impact to cephfs)

Issue 1:

I have 4x RGWs deployed.   All started / processes running.  They all
report similar log entries:

7fcc32b6a5c0  0 deferred set uid:gid to 167:167 (ceph:ceph)

7fcc32b6a5c0  0 ceph version 17.2.5
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable), process
radosgw, pid 2

7fcc32b6a5c0  0 framework: beast

7fcc32b6a5c0  0 framework conf key: port, val: 80

7fcc32b6a5c0  1 radosgw_Main not setting numa affinity

7fcc32b6a5c0  1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0

7fcc32b6a5c0  1 D3N datacache enabled: 0

7fcc0869a700  0 INFO: RGWReshardLock::lock found lock on reshard.0000000011
to be held by another RGW process; skipping for now

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on lc.1,
sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on lc.3,
sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0dea5700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again

7fcc0bea1700  0 lifecycle: RGWLC::process() failed to acquire lock on
lc.16, sleep 5, try again
(repeating)

Seems like a stale lock, not previously cleaned up when the cluster was
busy recovering and rebalancing.

Issue 2:

ceph health detail:

[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs

     mds.fs01.ceph02mon03.rjcxat(mds.0): 8 slow metadata IOs are blocked >
30 secs, oldest blocked for 39485 secs

Log entries from ceph02mon03 MDS host:

  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131271 from mon.4
  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131272 from mon.4
  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131273 from mon.4
  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131274 from mon.4
  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131275 from mon.4
  7fe36c6b8700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1
included below; oldest blocked for > 33.126589 secs
  7fe36c6b8700  0 log_channel(cluster) log [WRN] : slow request 33.126588
seconds old, received at 2022-12-27T19:45:45.952225+0000:
client_request(client.55009:99980 create
#0x10000000bc2/vzdump-qemu-30003-2022_12_27-14_43_43.log
2022-12-27T19:45:45.948045+0000 caller_uid=0, caller_gid=0{}) currently
submit entry: journal_and_reply
  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131276 from mon.4
  7fe36c6b8700  0 log_channel(cluster) log [WRN] : 1 slow requests, 0
included below; oldest blocked for > 38.126737 secs
  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131277 from mon.4
  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131278 from mon.4
  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131279 from mon.4
  7fe36debb700  1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version
131280 from mon.4

I suspect that the file in the log above int's the culprit.   How can I get
to the root cause of MDS slowdowns?

On Tue, Dec 27, 2022 at 3:32 PM Pavin Joseph <me@xxxxxxxxxxxxxxx> wrote:

Interesting, the logs show the crash module [0] itself has crashed.
Something sent it a SIGINT or SIGTERM and the module didn't handle it
correctly due to what seems like a bug in the code.

I haven't experienced the crash module itself crashing yet (in Quincy)
because nothing sent a SIG[INT|TERM] to it yet.

So I'd continue investigating into why these signals were sent to the
crash module.

To fix the crash module from crashing, go to "/usr/bin/ceph-crash" and
edit the handler function on line 82 like so:

def handler(signum, frame):
    print('*** Interrupted with signal %d ***' % signum)
    signame = signal.Signals(signum).name
    print(f'Signal handler called with signal {signame} ({signum})')
    print(frame)
    sys.exit(0)

---

Once the crash module is working, perhaps you could run a "ceph crash ls"

Regarding podman logs, perhaps try this [1].

[0]: https://docs.ceph.com/en/quincy/mgr/crash/
[1]: https://docs.podman.io/en/latest/markdown/podman-logs.1.html

On 27-Dec-22 11:59 PM, Deep Dish wrote:
HI Pavin,

Thanks for the reply.   I'm a bit at a loss honestly as this worked
perfectly without any issue up until the rebalance of the cluster.
Orchestrator is great.   Aside from this (which I suspect is not
orchestrator related), I haven't had any issues.

In terms of logs, I'm not sure where to start looking in this new
containerized environment as they pertain to individual ceph processes
-- I
assumed everything would be centrally collected within orch.

Connecting into the podman container of a RGW, there are no logs in
/var/log/ceph aside from ceph-volume.   My ceph.conf is minimal with only
monitors defined.  The only log I'm able to pull is as follows:

# podman logs 35d4ac5445ca

INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s

Traceback (most recent call last):

    File "/usr/bin/ceph-crash", line 113, in <module>

      main()

    File "/usr/bin/ceph-crash", line 109, in main

      time.sleep(args.delay * 60)

TypeError: handler() takes 1 positional argument but 2 were given

INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s

Looks like the RGW daemon is crashing.   How do I get logs to persist?
  I
suspect I won't be able to use orchestrator to push down the config, and
would have to manipulate within the container image itself.

I also attempted to redeply the RGW containers without success.

On Tue, Dec 27, 2022 at 10:39 AM Pavin Joseph <me@xxxxxxxxxxxxxxx>
wrote:

Here's the first things I'd check in your situation:

1. Logs
2. Is the RGW HTTP server running on its port?
3. Re-check config including authentication.

ceph orch is too new and didn't pass muster in our own internal testing.
You're braver than most for using it in production.

Pavin.

On 27-Dec-22 8:48 PM, Deep Dish wrote:
Quick update:

- I followed documentation, and ran the following:

# ceph dashboard set-rgw-credentials

Error EINVAL: No RGW credentials found, please consult the
documentation
on
how to enable RGW for the dashboard.

- I see dashboard credentials configured (all this was working fine
before):

# ceph dashboard get-rgw-api-access-key

P?????????????????G  (? commented out)

Seems to me like my RGW config is non-existent / corrupted for some
reason.  When trying to curl a RGW directly I get a "connection
refused".

On Tue, Dec 27, 2022 at 9:41 AM Deep Dish <deeepdish@xxxxxxxxx> wrote:

I built a net-new Quincy cluster (17.2.5) using ceph orch as follows:

2x mgrs
4x rgw
5x mon
4x rgw
5x mds
6x osd hosts w/ 10 drives each --> will be growing to 7 osd hosts in
the
coming days.

I migrated all data from my legacy nautilus cluster (via rbd-mirror,
rclone for s3 buckets, etc.).  All moved over successfully without
issue.

The cluster went through a series of rebalancing events (adding
capacity,
osd nodes, changing fault domain for EC volumes).

It's settled now, however throughout the process all of my RGW nodes
are
no longer part of the cluster -- meaning ceph doesn't recognize /
detect
them, despite containers, networking, etc. all being setup correctly.
This also means I'm unable to manage any RGW functions (via the
dashboard
or cli).   As an example via cli (within Cephadm shell):

# radosgw-admin pools list

could not list placement set: (2) No such file or directory

I have data in buckets, how can I get my RGWs to return online?

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx