Okay, after writing this mail, I might have found what's wrong.
The message
monclient(hunting): handle_auth_bad_method server allowed_methods [2]
but i only support [2]
makes no sense, but it pointed me to something else when I had a pod
that refused to start even after deleting it multiple times. I noticed
that it was always scheduled on the very same host, so something about
the host itself must have caused it. All radosgw daemons are
bootstrapped using a bootstrap key and the resulting auth key is
persisted to /var/lib/ceph/radosgw/ on the host. After deleting that
directory on all hosts and restarting the deployment, all pods came back
up again. So I guess something was wrong with the keys stored on some of
the host machines.
Janek
On 31/05/2022 11:08, Janek Bevendorff wrote:
Hi,
This is an issue I've been having since at least Ceph 15.x and I
haven't found a way around it yet. I have a bunch of radosgw nodes in
a Kubernetes cluster (using the ceph/ceph-daemon Docker image) and
once every few container restarts, the daemon decides to crash at
startup for unknown reasons resulting in a crash loop. When I delete
the entire pod and try again, it boots up fine most of the time (not
always).
There is no obvious error message. When I set DEBUG=stayalive, all I
get is:
2022-05-31 08:51:39 /opt/ceph-container/bin/entrypoint.sh: STAYALIVE:
container will not die if a command fails.
2022-05-31 08:51:39 /opt/ceph-container/bin/entrypoint.sh: static:
does not generate config
2022-05-31 08:51:39 /opt/ceph-container/bin/entrypoint.sh: SUCCESS
exec: PID 51: spawning /usr/bin/radosgw --cluster ceph --setuser ceph
--setgroup ceph --default-log-to-stderr=true --err-to-stderr=true
--default-log-to-file=false --foreground -n client.rgw.XXX -k
/var/lib/ceph/radosgw/ceph-rgw.XXX/keyring
exec: Waiting 51 to quit
2022-05-31T08:51:39.355+0000 7f23c0fe9700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2]
failed to fetch mon config (--no-mon-config to skip)
teardown: managing teardown after SIGCHLD
teardown: Waiting PID 51 to terminate
teardown: Process 51 is terminated
/opt/ceph-container/bin/docker_exec.sh: line 14: warning:
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0
/opt/ceph-container/bin/docker_exec.sh: line 6: warning:
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0
An issue occured and you asked me to stay alive.
You can connect to me with: sudo docker exec -i -t /bin/bash
The current environment variables will be reloaded by this bash to be
in a similar context.
When debugging is over stop me with: pkill sleep
I'll sleep endlessly waiting for you darling, bye bye
/opt/ceph-container/bin/docker_exec.sh: line 6: warning:
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0
The actual error seems to be "warning: run_pending_traps: bad value in
trap_list", but I have no idea how to fix that or why that even happens.
This is super annoying, because it means that over time, the number of
live radosgw containers is reduced, because at some point, most pods
are stuck in a CrashLoopBackOff state. I then have to manually delete
all those pods so that they get rescheduled, which tends to work in
about 3 out of 4 attempts or so.
The radosgw containers are running version 16.2.5 (the latest version
available for the container image), the rest of the cluster is on 16.2.9.
Any help would be greatly appreciated.
Janek
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx