Re: Containerized radosgw crashes randomly at startup

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Tue, 31 May 2022 13:00:58 +0200

Okay, after writing this mail, I might have found what's wrong.

The message

monclient(hunting): handle_auth_bad_method server allowed_methods [2] 
but i only support [2]

makes no sense, but it pointed me to something else when I had a pod 
that refused to start even after deleting it multiple times. I noticed 
that it was always scheduled on the very same host, so something about 
the host itself must have caused it. All radosgw daemons are 
bootstrapped using a bootstrap key and the resulting auth key is 
persisted to /var/lib/ceph/radosgw/ on the host. After deleting that 
directory on all hosts and restarting the deployment, all pods came back 
up again. So I guess something was wrong with the keys stored on some of 
the host machines.

Janek

On 31/05/2022 11:08, Janek Bevendorff wrote:
Hi,

This is an issue I've been having since at least Ceph 15.x and I 
haven't found a way around it yet. I have a bunch of radosgw nodes in 
a Kubernetes cluster (using the ceph/ceph-daemon Docker image) and 
once every few container restarts, the daemon decides to crash at 
startup for unknown reasons resulting in a crash loop. When I delete 
the entire pod and try again, it boots up fine most of the time (not 
always).

There is no obvious error message. When I set DEBUG=stayalive, all I 
get is:

2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: STAYALIVE: 
container will not die if a command fails.
2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: static: 
does not generate config
2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: SUCCESS
exec: PID 51: spawning /usr/bin/radosgw --cluster ceph --setuser ceph 
--setgroup ceph --default-log-to-stderr=true --err-to-stderr=true 
--default-log-to-file=false --foreground -n client.rgw.XXX -k 
/var/lib/ceph/radosgw/ceph-rgw.XXX/keyring
exec: Waiting 51 to quit
2022-05-31T08:51:39.355+0000 7f23c0fe9700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [2]
failed to fetch mon config (--no-mon-config to skip)
teardown: managing teardown after SIGCHLD
teardown: Waiting PID 51 to terminate
teardown: Process 51 is terminated
/opt/ceph-container/bin/docker_exec.sh: line 14: warning: 
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0
/opt/ceph-container/bin/docker_exec.sh: line 6: warning: 
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0
An issue occured and you asked me to stay alive.
You can connect to me with: sudo docker exec -i -t  /bin/bash
The current environment variables will be reloaded by this bash to be 
in a similar context.
When debugging is over stop me with: pkill sleep
I'll sleep endlessly waiting for you darling, bye bye
/opt/ceph-container/bin/docker_exec.sh: line 6: warning: 
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0

The actual error seems to be "warning: run_pending_traps: bad value in 
trap_list", but I have no idea how to fix that or why that even happens.

This is super annoying, because it means that over time, the number of 
live radosgw containers is reduced, because at some point, most pods 
are stuck in a CrashLoopBackOff state. I then have to manually delete 
all those pods so that they get rescheduled, which tends to work in 
about 3 out of 4 attempts or so.

The radosgw containers are running version 16.2.5 (the latest version 
available for the container image), the rest of the cluster is on 16.2.9.

Any help would be greatly appreciated.

Janek

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx