Re: Strange container restarts?

Tim Holloway <timh@xxxxxxxxxxxxx> · Wed, 13 Nov 2024 08:52:19 -0500

Things are a little more complex. Managed (container) resources are 
handled by systemd, which typically auto-restart failed services. Ceph 
diverts the container logs (what you'd get from "docker logs 
container-id") into the systemd journal log. So doing a journalctl check 
is advised. Although crashing containers have a lamentable tendency not 
to log what took them down.

Some subsystems within a container have their own loggers which can be 
configured. I think that includes, for example, Prometheus. In which 
case, it's important ensure that the location that they're set to log to 
is OUTSIDE the container, as otherwise they'll log to a file inside the 
container image, and the image is destroyed when the container 
terminates and thus the evidence will be logs.

This is, of course, exempting problems with the containers themselves. 
It's always prudent to ensure that there's plenty of spare RAM for the 
container to run in and that the root filesystem ("/") has enough free 
space to hold the generated images. Which can be potentially quite large.

    Tim

On 11/12/24 05:59, Eugen Block wrote:
I don't see osd related exec_died messages in Pacific, but on Quincy 
they are also logged. But I can simply trigger it with a 'cephadm ls', 
so it's just the regular check, no need to worry about that. It's not 
triggered though if you only run 'cephadm ls --no-detail', but one 
would have to look through the code to understand what exactly the 
full ls command queries. But as I wrote, this isn't an issue, just a 
regular check.

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

I haven't looked too deep into it yet, but I think it's the regular 
cephadm check. The timestamps should match those in the 
/var/log/ceph/cephadm.log, where you can see something like that:

cephadm ['--image', '{YOUR_REGISTRY}', 'ls']

It goes through your inventory and runs several 'gather-facts' 
commands and a couple more. I don't think you need to worry about this.

Regards,
Eugen

Zitat von Jan Marek <jmarek@xxxxxx>:

Hello,

we have ceph cluster which consists of 12 host, on every host we
have 12 NVMe "disks".

On most of these host (9 of 12) we have in logs errors, see
attached file.

We tried to check this problem, and we have these points:

1) On every host there is only one OSD. Thus it's not problem in
version 18.2.2 generally, because there will be on another OSD,
not only one of host?

2) Sometimes one of this OSD crashed :-( It seems, that crashed
OSD are from set of OSDs, which have this problem.

3) ceph cluster goes OK and it "doesn't know" about any problem
with these OSD. It seem's, that this new instance of ceph-osd
daemon tried to start either podman or conmon itself. We've tried
to control PID files for conman, but they're seems to be OK?

4) We tried to check 'ceph orch' command, but it does not try to
start these containers, because it know, that they exists and run
('ceph orch ps' list these containers as running).

5) I've tried to pause ochestrator, but I've still found in syslog
these entries... :-(

Please, is there any possibility to find out, where is problem
and stop this?

We have all of the ceph host prepared by ansible, thus there is
the same environment.

On every machine we have podman version 4.3.1+ds1-8+deb12u1 and
conmon version 2.1.6+ds1-1. OS is Debian bookworm.

Attached logs was prepared by:

grep exec_died /var/log/syslog

Sincerely
Jan Marek
--
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx