Containerized radosgw crashes randomly at startup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

This is an issue I've been having since at least Ceph 15.x and I haven't found a way around it yet. I have a bunch of radosgw nodes in a Kubernetes cluster (using the ceph/ceph-daemon Docker image) and once every few container restarts, the daemon decides to crash at startup for unknown reasons resulting in a crash loop. When I delete the entire pod and try again, it boots up fine most of the time (not always).

There is no obvious error message. When I set DEBUG=stayalive, all I get is:

2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: STAYALIVE: container will not die if a command fails. 2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: SUCCESS
exec: PID 51: spawning /usr/bin/radosgw --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false --foreground -n client.rgw.XXX -k /var/lib/ceph/radosgw/ceph-rgw.XXX/keyring
exec: Waiting 51 to quit
2022-05-31T08:51:39.355+0000 7f23c0fe9700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
failed to fetch mon config (--no-mon-config to skip)
teardown: managing teardown after SIGCHLD
teardown: Waiting PID 51 to terminate
teardown: Process 51 is terminated
/opt/ceph-container/bin/docker_exec.sh: line 14: warning: run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0 /opt/ceph-container/bin/docker_exec.sh: line 6: warning: run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0
An issue occured and you asked me to stay alive.
You can connect to me with: sudo docker exec -i -t  /bin/bash
The current environment variables will be reloaded by this bash to be in a similar context.
When debugging is over stop me with: pkill sleep
I'll sleep endlessly waiting for you darling, bye bye
/opt/ceph-container/bin/docker_exec.sh: line 6: warning: run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0


The actual error seems to be "warning: run_pending_traps: bad value in trap_list", but I have no idea how to fix that or why that even happens.

This is super annoying, because it means that over time, the number of live radosgw containers is reduced, because at some point, most pods are stuck in a CrashLoopBackOff state. I then have to manually delete all those pods so that they get rescheduled, which tends to work in about 3 out of 4 attempts or so.

The radosgw containers are running version 16.2.5 (the latest version available for the container image), the rest of the cluster is on 16.2.9.

Any help would be greatly appreciated.

Janek

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux