Got logging enabled as per https://ceph.io/en/news/blog/2022/centralized_logging/. My embedded grafana doesn't come up in the dashboard, but at least I have log (files) on my nodes. Interesting. Two issues plaguing my cluster: 1 - RGWs not manageable 2 - MDS_SLOW_METADATA_IO warning (impact to cephfs) Issue 1: I have 4x RGWs deployed. All started / processes running. They all report similar log entries: 7fcc32b6a5c0 0 deferred set uid:gid to 167:167 (ceph:ceph) 7fcc32b6a5c0 0 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable), process radosgw, pid 2 7fcc32b6a5c0 0 framework: beast 7fcc32b6a5c0 0 framework conf key: port, val: 80 7fcc32b6a5c0 1 radosgw_Main not setting numa affinity 7fcc32b6a5c0 1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0 7fcc32b6a5c0 1 D3N datacache enabled: 0 7fcc0869a700 0 INFO: RGWReshardLock::lock found lock on reshard.0000000011 to be held by another RGW process; skipping for now 7fcc0bea1700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.1, sleep 5, try again 7fcc0dea5700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.3, sleep 5, try again 7fcc0dea5700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.16, sleep 5, try again 7fcc0dea5700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.16, sleep 5, try again 7fcc0bea1700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.16, sleep 5, try again 7fcc0dea5700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.16, sleep 5, try again 7fcc0bea1700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.16, sleep 5, try again 7fcc0dea5700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.16, sleep 5, try again 7fcc0bea1700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.16, sleep 5, try again 7fcc0dea5700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.16, sleep 5, try again 7fcc0bea1700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.16, sleep 5, try again (repeating) Seems like a stale lock, not previously cleaned up when the cluster was busy recovering and rebalancing. Issue 2: ceph health detail: [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs mds.fs01.ceph02mon03.rjcxat(mds.0): 8 slow metadata IOs are blocked > 30 secs, oldest blocked for 39485 secs Log entries from ceph02mon03 MDS host: 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131271 from mon.4 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131272 from mon.4 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131273 from mon.4 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131274 from mon.4 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131275 from mon.4 7fe36c6b8700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 33.126589 secs 7fe36c6b8700 0 log_channel(cluster) log [WRN] : slow request 33.126588 seconds old, received at 2022-12-27T19:45:45.952225+0000: client_request(client.55009:99980 create #0x10000000bc2/vzdump-qemu-30003-2022_12_27-14_43_43.log 2022-12-27T19:45:45.948045+0000 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131276 from mon.4 7fe36c6b8700 0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 38.126737 secs 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131277 from mon.4 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131278 from mon.4 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131279 from mon.4 7fe36debb700 1 mds.fs01.ceph02mon03.rjcxat Updating MDS map to version 131280 from mon.4 I suspect that the file in the log above int's the culprit. How can I get to the root cause of MDS slowdowns? On Tue, Dec 27, 2022 at 3:32 PM Pavin Joseph <me@xxxxxxxxxxxxxxx> wrote: > Interesting, the logs show the crash module [0] itself has crashed. > Something sent it a SIGINT or SIGTERM and the module didn't handle it > correctly due to what seems like a bug in the code. > > I haven't experienced the crash module itself crashing yet (in Quincy) > because nothing sent a SIG[INT|TERM] to it yet. > > So I'd continue investigating into why these signals were sent to the > crash module. > > To fix the crash module from crashing, go to "/usr/bin/ceph-crash" and > edit the handler function on line 82 like so: > > def handler(signum, frame): > print('*** Interrupted with signal %d ***' % signum) > signame = signal.Signals(signum).name > print(f'Signal handler called with signal {signame} ({signum})') > print(frame) > sys.exit(0) > > --- > > Once the crash module is working, perhaps you could run a "ceph crash ls" > > Regarding podman logs, perhaps try this [1]. > > [0]: https://docs.ceph.com/en/quincy/mgr/crash/ > [1]: https://docs.podman.io/en/latest/markdown/podman-logs.1.html > > On 27-Dec-22 11:59 PM, Deep Dish wrote: > > HI Pavin, > > > > Thanks for the reply. I'm a bit at a loss honestly as this worked > > perfectly without any issue up until the rebalance of the cluster. > > Orchestrator is great. Aside from this (which I suspect is not > > orchestrator related), I haven't had any issues. > > > > In terms of logs, I'm not sure where to start looking in this new > > containerized environment as they pertain to individual ceph processes > -- I > > assumed everything would be centrally collected within orch. > > > > Connecting into the podman container of a RGW, there are no logs in > > /var/log/ceph aside from ceph-volume. My ceph.conf is minimal with only > > monitors defined. The only log I'm able to pull is as follows: > > > > # podman logs 35d4ac5445ca > > > > INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s > > > > Traceback (most recent call last): > > > > File "/usr/bin/ceph-crash", line 113, in <module> > > > > main() > > > > File "/usr/bin/ceph-crash", line 109, in main > > > > time.sleep(args.delay * 60) > > > > TypeError: handler() takes 1 positional argument but 2 were given > > > > INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s > > > > > > > > Looks like the RGW daemon is crashing. How do I get logs to persist? > I > > suspect I won't be able to use orchestrator to push down the config, and > > would have to manipulate within the container image itself. > > > > I also attempted to redeply the RGW containers without success. > > > > On Tue, Dec 27, 2022 at 10:39 AM Pavin Joseph <me@xxxxxxxxxxxxxxx> > wrote: > > > >> Here's the first things I'd check in your situation: > >> > >> 1. Logs > >> 2. Is the RGW HTTP server running on its port? > >> 3. Re-check config including authentication. > >> > >> ceph orch is too new and didn't pass muster in our own internal testing. > >> You're braver than most for using it in production. > >> > >> Pavin. > >> > >> On 27-Dec-22 8:48 PM, Deep Dish wrote: > >>> Quick update: > >>> > >>> - I followed documentation, and ran the following: > >>> > >>> # ceph dashboard set-rgw-credentials > >>> > >>> Error EINVAL: No RGW credentials found, please consult the > documentation > >> on > >>> how to enable RGW for the dashboard. > >>> > >>> > >>> > >>> - I see dashboard credentials configured (all this was working fine > >> before): > >>> > >>> > >>> # ceph dashboard get-rgw-api-access-key > >>> > >>> P?????????????????G (? commented out) > >>> > >>> > >>> > >>> Seems to me like my RGW config is non-existent / corrupted for some > >>> reason. When trying to curl a RGW directly I get a "connection > refused". > >>> > >>> > >>> > >>> On Tue, Dec 27, 2022 at 9:41 AM Deep Dish <deeepdish@xxxxxxxxx> wrote: > >>> > >>>> I built a net-new Quincy cluster (17.2.5) using ceph orch as follows: > >>>> > >>>> 2x mgrs > >>>> 4x rgw > >>>> 5x mon > >>>> 4x rgw > >>>> 5x mds > >>>> 6x osd hosts w/ 10 drives each --> will be growing to 7 osd hosts in > the > >>>> coming days. > >>>> > >>>> I migrated all data from my legacy nautilus cluster (via rbd-mirror, > >>>> rclone for s3 buckets, etc.). All moved over successfully without > >> issue. > >>>> > >>>> The cluster went through a series of rebalancing events (adding > >> capacity, > >>>> osd nodes, changing fault domain for EC volumes). > >>>> > >>>> It's settled now, however throughout the process all of my RGW nodes > are > >>>> no longer part of the cluster -- meaning ceph doesn't recognize / > detect > >>>> them, despite containers, networking, etc. all being setup correctly. > >>>> This also means I'm unable to manage any RGW functions (via the > >> dashboard > >>>> or cli). As an example via cli (within Cephadm shell): > >>>> > >>>> # radosgw-admin pools list > >>>> > >>>> could not list placement set: (2) No such file or directory > >>>> > >>>> I have data in buckets, how can I get my RGWs to return online? > >>>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx