Norf, I missed half of the answers... * the 8TB disks hold around 80-90 PGs (16TB around 160-180) * per PG we've around 40k objects 170m objects in 1.2PiB of storage Am Di., 22. März 2022 um 09:29 Uhr schrieb Boris Behrens <bb@xxxxxxxxx>: > Good morning K, > > the "freshly done" host, where it happened last got: > * 21x 8TB TOSHIBA MG06ACA800E (Spinning) > * No block.db devices (just removed the 2 cache SSDs by syncing the disks > out, wiping them and adding them back without block.db) > * 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz > * 256GB ECC RAM > * 2x 10GBit Network (802.3ad encap3+4 lacp-fast bonding) > > # free -g > total used free shared buff/cache > available > Mem: 251 87 2 0 161 > 162 > Swap: 15 0 15 > > We had this problem with one of the 21 OSDs, but I expect it to happen > random some time in the future. Cluster got 212 OSDs and 2-3 of them get at > least marked down once per day. Sometime they get marked down >3 times, so > systemd hast to restart the OSD process. > > Cheers > Boris > > > Am Di., 22. März 2022 um 07:48 Uhr schrieb Konstantin Shalygin < > k0ste@xxxxxxxx>: > >> Hi, >> >> What is actual hardware (CPU, spinners, NVMe, network)? >> This is HDD with block.db on NVMe? >> How much PG per osd? >> How much obj per PG? >> >> >> k >> Sent from my iPhone >> >> > On 20 Mar 2022, at 19:59, Boris Behrens <bb@xxxxxxxxx> wrote: >> > >> > So, >> > I have tried to remove the OSDs, wipe the disks and sync them back in >> > without block.db SSD. (Still in progress, 212 spinning disks take time >> to >> > out and in again) >> > And I just experienced them same behavior on one OSD on a host where all >> > disks got synced in new. This disk was marked as in yesterday and is >> still >> > backfilling. >> > >> > Right before the OSD get marked as down by other OSDs I observe a ton of >> > these log entries: >> > 2022-03-20T11:54:40.759+0000 7ff9eef5d700 1 heartbeat_map is_healthy >> > 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15 >> > ... >> > 2022-03-20T11:55:02.370+0000 7ff9ee75c700 1 heartbeat_map is_healthy >> > 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15 >> > 2022-03-20T11:55:03.290+0000 7ff9d3c8a700 1 heartbeat_map reset_timeout >> > 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15 >> > .. >> > 2022-03-20T11:55:03.390+0000 7ff9df4a1700 0 log_channel(cluster) log >> [WRN] >> > : Monitor daemon marked osd.48 down, but it is still running >> > 2022-03-20T11:55:03.390+0000 7ff9df4a1700 0 log_channel(cluster) log >> [DBG] >> > : map e514383 wrongly marked me down at e514383 >> > 2022-03-20T11:55:03.390+0000 7ff9df4a1700 -1 osd.48 514383 >> > _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last >> > 600.000000 seconds, shutting down >> > >> > There are 21 OSDs without cache SSD in the host. >> > All disks are attached to a single Broadcom / LSI SAS3008 SAS >> controller. >> > 256GB ECC-RAM / 40 CPU cores. >> > >> > What else can I do to find the problem? >> > >> >> Am Di., 8. März 2022 um 12:25 Uhr schrieb Francois Legrand < >> >> fleg@xxxxxxxxxxxxxx>: >> >> >> >> Hi, >> >> >> >> The last 2 osd I recreated were on december 30 and february 8. >> >> >> >> I totally agree that ssd cache are a terrible spof. I think that's an >> >> option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very >> >> high. Using 1 ssd for 10 osd increase the risk for almost no gain >> because >> >> the ssd is 10 times faster but has 10 times more access ! >> >> Indeed, we did some benches with nvme for the wal db (1 nvme for ~10 >> >> osds), and the gain was not tremendous, so we decided not use them ! >> >> F. >> >> >> >> >> >> Le 08/03/2022 à 11:57, Boris Behrens a écrit : >> >> >> >> Hi Francois, >> >> >> >> thanks for the reminder. We offline compacted all of the OSDs when we >> >> reinstalled the hosts with the new OS. >> >> But actually reinstalling them was never on my list. >> >> >> >> I could try that and in the same go I can remove all the cache SSDs >> (when >> >> one SSD share the cache for 10 OSDs this is a horrible SPOF) and reuse >> the >> >> SSDs as OSDs for the smaller pools in a RGW (like log and meta). >> >> >> >> How long ago did you recreate the earliest OSD? >> >> >> >> Cheers >> >> Boris >> >> >> >> Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand < >> >> fleg@xxxxxxxxxxxxxx>: >> >> >> >>> Hi, >> >>> We also had this kind of problems after upgrading to octopus. Maybe >> you >> >>> can play with the hearthbeat grace time ( >> >>> >> https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/ >> >>> ) to tell osds to wait a little more before declaring another osd >> down ! >> >>> We also try to fix the problem by manually compact the down osd >> >>> (something like : systemctl stop ceph-osd@74; sleep 10; >> >>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact; >> >>> systemctl start ceph-osd@74). >> >>> This worked a few times, but some osd went down again, thus we simply >> >>> wait for the datas to be reconstructed elswhere and then reinstall the >> >>> dead osd : >> >>> ceph osd destroy 74 --yes-i-really-mean-it >> >>> ceph-volume lvm zap /dev/sde --destroy >> >>> ceph-volume lvm create --osd-id 74 --data /dev/sde >> >>> >> >>> This seems to fix the issue for us (up to now). >> >>> >> >>> F. >> >>> >> >>> Le 08/03/2022 à 09:35, Boris Behrens a écrit : >> >>>> Yes, this is something we know and we disabled it, because we ran >> into >> >>> the >> >>>> problem that PGs went unavailable when two or more OSDs went offline. >> >>>> >> >>>> I am searching for the reason WHY this happens. >> >>>> Currently we have set the service file to restart=always and removed >> the >> >>>> StartLimitBurst from the service file. >> >>>> >> >>>> We just don't understand why the OSDs don't answer the heathbeat. The >> >>> OSDs >> >>>> that are flapping are random in terms of Host, Disksize, having SSD >> >>>> block.db or not. >> >>>> Network connectivity issues is something that I would rule out, >> because >> >>> the >> >>>> cluster went from "nothing ever happens except IOPS" to "random OSDs >> are >> >>>> marked DOWN until they kill themself" with the update from nautilus >> to >> >>>> octopus. >> >>>> >> >>>> I am out of ideas and hoped this was a bug in 15.2.15, but after the >> >>> update >> >>>> things got worse (happen more often). >> >>>> We tried to: >> >>>> * disable swap >> >>>> * more swap >> >>>> * disable bluefs_buffered_io >> >>>> * disable write cache for all disks >> >>>> * disable scrubbing >> >>>> * reinstall with new OS (from centos7 to ubuntu 20.04) >> >>>> * disable cluster_network (so there is only one way to communicate) >> >>>> * increase txqueuelen on the network interfaces >> >>>> * everything together >> >>>> >> >>>> >> >>>> What we try next: add more SATA controllers, so there are not 24 >> disks >> >>>> attached to a single controller, but I doubt this will help. >> >>>> >> >>>> Cheers >> >>>> Boris >> >>>> >> >>>> >> >>>> >> >>>> Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster < >> >>>> dvanders@xxxxxxxxx>: >> >>>> >> >>>>> Here's the reason they exit: >> >>>>> >> >>>>> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 > >> >>>>> osd_max_markdown_count 5 in last 600.000000 seconds, shutting down >> >>>>> >> >>>>> If an osd flaps (marked down, then up) 6 times in 10 minutes, it >> >>>>> exits. (This is a safety measure). >> >>>>> >> >>>>> It's normally caused by a network issue -- other OSDs are telling >> the >> >>>>> mon that he is down, but then the OSD himself tells the mon that >> he's >> >>>>> up! >> >>>>> >> >>>>> Cheers, Dan >> >>>>> >> >>>>> On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens <bb@xxxxxxxxx> wrote: >> >>>>>> Hi, >> >>>>>> >> >>>>>> we've had the problem with OSDs marked as offline since we updated >> to >> >>>>>> octopus and hope the problem would be fixed with the latest patch. >> We >> >>>>> have >> >>>>>> this kind of problem only with octopus and there only with the big >> s3 >> >>>>>> cluster. >> >>>>>> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k >> >>>>>> * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond) >> >>>>>> * We only use the frontend network. >> >>>>>> * All disks are spinning, some have block.db devices. >> >>>>>> * All disks are bluestore >> >>>>>> * configs are mostly defaults >> >>>>>> * we've set the OSDs to restart=always without a limit, because we >> had >> >>>>> the >> >>>>>> problem with unavailable PGs when two OSDs are marked as offline >> and >> >>> the >> >>>>>> share PGs. >> >>>>>> >> >>>>>> But since we installed the latest patch we are experiencing more >> OSD >> >>>>> downs >> >>>>>> and even crashes. >> >>>>>> I tried to remove as much duplicated lines as possible. >> >>>>>> >> >>>>>> Is the numa error a problem? >> >>>>>> Why do OSD daemons not respond to hearthbeats? I mean even when the >> >>> disk >> >>>>> is >> >>>>>> totally loaded with IO, the system itself should answer >> heathbeats, or >> >>>>> am I >> >>>>>> missing something? >> >>>>>> >> >>>>>> I really hope some of you could send me on the correct way to solve >> >>> this >> >>>>>> nasty problem. >> >>>>>> >> >>>>>> This is how the latest crash looks like >> >>>>>> Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+0000 >> >>>>>> 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify >> >>>>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> ... >> >>>>>> Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+0000 >> >>>>>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify >> >>>>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) >> ** >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: in thread 7f5ef1501700 >> >>>>>> thread_name:tp_osd_tp >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: ceph version 15.2.16 >> >>>>>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable) >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 1: (()+0x143c0) >> >>> [0x7f5f0d4623c0] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2: (pthread_kill()+0x38) >> >>>>>> [0x7f5f0d45ef08] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 3: >> >>>>>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char >> >>> const*, >> >>>>>> unsigned long)+0x471) [0x55a699a01201] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 4: >> >>>>>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, >> unsigned >> >>>>>> long, unsigned long)+0x8e) [0x55a699a0199e] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 5: >> >>>>>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0) >> >>>>>> [0x55a699a224b0] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 6: >> >>>>>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) >> [0x55a699a252c4] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 7: (()+0x8609) >> >>> [0x7f5f0d456609] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 8: (clone()+0x43) >> >>>>> [0x7f5f0cfc0163] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+0000 >> >>>>>> 7f5ef1501700 -1 *** Caught signal (Aborted) ** >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: in thread 7f5ef1501700 >> >>>>>> thread_name:tp_osd_tp >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: ceph version 15.2.16 >> >>>>>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable) >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 1: (()+0x143c0) >> >>> [0x7f5f0d4623c0] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2: (pthread_kill()+0x38) >> >>>>>> [0x7f5f0d45ef08] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 3: >> >>>>>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char >> >>> const*, >> >>>>>> unsigned long)+0x471) [0x55a699a01201] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 4: >> >>>>>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, >> unsigned >> >>>>>> long, unsigned long)+0x8e) [0x55a699a0199e] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 5: >> >>>>>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0) >> >>>>>> [0x55a699a224b0] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 6: >> >>>>>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) >> [0x55a699a252c4] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 7: (()+0x8609) >> >>> [0x7f5f0d456609] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 8: (clone()+0x43) >> >>>>> [0x7f5f0cfc0163] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: NOTE: a copy of the >> >>> executable, >> >>>>> or >> >>>>>> `objdump -rdS <executable>` is needed to interpret this. >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: -5246> >> >>>>> 2022-03-07T17:49:07.678+0000 >> >>>>>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify >> >>>>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 0> >> >>>>> 2022-03-07T17:53:07.387+0000 >> >>>>>> 7f5ef1501700 -1 *** Caught signal (Aborted) ** >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: in thread 7f5ef1501700 >> >>>>>> thread_name:tp_osd_tp >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: ceph version 15.2.16 >> >>>>>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable) >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 1: (()+0x143c0) >> >>> [0x7f5f0d4623c0] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2: (pthread_kill()+0x38) >> >>>>>> [0x7f5f0d45ef08] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 3: >> >>>>>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char >> >>> const*, >> >>>>>> unsigned long)+0x471) [0x55a699a01201] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 4: >> >>>>>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, >> unsigned >> >>>>>> long, unsigned long)+0x8e) [0x55a699a0199e] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 5: >> >>>>>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0) >> >>>>>> [0x55a699a224b0] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 6: >> >>>>>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) >> [0x55a699a252c4] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 7: (()+0x8609) >> >>> [0x7f5f0d456609] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 8: (clone()+0x43) >> >>>>> [0x7f5f0cfc0163] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: NOTE: a copy of the >> >>> executable, >> >>>>> or >> >>>>>> `objdump -rdS <executable>` is needed to interpret this. >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: -5246> >> >>>>> 2022-03-07T17:49:07.678+0000 >> >>>>>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify >> >>>>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 0> >> >>>>> 2022-03-07T17:53:07.387+0000 >> >>>>>> 7f5ef1501700 -1 *** Caught signal (Aborted) ** >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: in thread 7f5ef1501700 >> >>>>>> thread_name:tp_osd_tp >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: ceph version 15.2.16 >> >>>>>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable) >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 1: (()+0x143c0) >> >>> [0x7f5f0d4623c0] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2: (pthread_kill()+0x38) >> >>>>>> [0x7f5f0d45ef08] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 3: >> >>>>>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char >> >>> const*, >> >>>>>> unsigned long)+0x471) [0x55a699a01201] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 4: >> >>>>>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, >> unsigned >> >>>>>> long, unsigned long)+0x8e) [0x55a699a0199e] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 5: >> >>>>>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0) >> >>>>>> [0x55a699a224b0] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 6: >> >>>>>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) >> [0x55a699a252c4] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 7: (()+0x8609) >> >>> [0x7f5f0d456609] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 8: (clone()+0x43) >> >>>>> [0x7f5f0cfc0163] >> >>>>>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: NOTE: a copy of the >> >>> executable, >> >>>>> or >> >>>>>> `objdump -rdS <executable>` is needed to interpret this. >> >>>>>> Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service: Main >> process >> >>>>>> exited, code=killed, status=6/ABRT >> >>>>>> Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service: Failed >> with >> >>>>> result >> >>>>>> 'signal'. >> >>>>>> Mar 07 17:53:19 s3db18 systemd[1]: ceph-osd@161.service: Scheduled >> >>>>> restart >> >>>>>> job, restart counter is at 1. >> >>>>>> Mar 07 17:53:19 s3db18 systemd[1]: Stopped Ceph object storage >> daemon >> >>>>>> osd.161. >> >>>>>> Mar 07 17:53:19 s3db18 systemd[1]: Starting Ceph object storage >> daemon >> >>>>>> osd.161... >> >>>>>> Mar 07 17:53:19 s3db18 systemd[1]: Started Ceph object storage >> daemon >> >>>>>> osd.161. >> >>>>>> Mar 07 17:53:20 s3db18 ceph-osd[4009440]: >> 2022-03-07T17:53:20.498+0000 >> >>>>>> 7f9617781d80 -1 Falling back to public interface >> >>>>>> Mar 07 17:53:33 s3db18 ceph-osd[4009440]: >> 2022-03-07T17:53:33.906+0000 >> >>>>>> 7f9617781d80 -1 osd.161 489778 log_to_monitors {default=true} >> >>>>>> Mar 07 17:53:34 s3db18 ceph-osd[4009440]: >> 2022-03-07T17:53:34.206+0000 >> >>>>>> 7f96106f2700 -1 osd.161 489778 set_numa_affinity unable to identify >> >>>>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> ... >> >>>>>> Mar 07 18:58:12 s3db18 ceph-osd[4009440]: >> 2022-03-07T18:58:12.717+0000 >> >>>>>> 7f96106f2700 -1 osd.161 489880 set_numa_affinity unable to identify >> >>>>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> >> >>>>>> And this is how an it looks like when OSDs get marked as out: >> >>>>>> Mar 03 19:29:04 s3db13 ceph-osd[5792]: 2022-03-03T19:29:04.857+0000 >> >>>>>> 7f16115e0700 -1 osd.97 485814 heartbeat_check: no reply from >> >>>>>> [XX:22::65]:6886 osd.124 since back 2022-03-03T19:28:41.250692+0000 >> >>> front >> >>>>>> 2022-03-03T19:28:41.250649+0000 (oldest deadline >> >>>>>> 2022-03-03T19:29:04.150352+0000) >> >>>>>> ...130 time... >> >>>>>> Mar 03 21:55:37 s3db13 ceph-osd[5792]: 2022-03-03T21:55:37.844+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486383 heartbeat_check: no reply from >> >>>>>> [XX:22::65]:6941 osd.124 since back 2022-03-03T21:55:12.514627+0000 >> >>> front >> >>>>>> 2022-03-03T21:55:12.514649+0000 (oldest deadline >> >>>>>> 2022-03-03T21:55:36.613469+0000) >> >>>>>> Mar 04 00:00:05 s3db13 ceph-osd[5792]: 2022-03-04T00:00:05.035+0000 >> >>>>>> 7f1613080700 -1 received signal: Hangup from killall -q -1 >> ceph-mon >> >>>>>> ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror (PID: >> >>> 1385079) >> >>>>>> UID: 0 >> >>>>>> Mar 04 00:00:05 s3db13 ceph-osd[5792]: 2022-03-04T00:00:05.047+0000 >> >>>>>> 7f1613080700 -1 received signal: Hangup from (PID: 1385080) UID: >> 0 >> >>>>>> Mar 04 00:06:00 s3db13 sudo[1389262]: ceph : TTY=unknown ; >> PWD=/ ; >> >>>>>> USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde >> >>>>>> Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session): >> session >> >>>>>> opened for user root by (uid=0) >> >>>>>> Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session): >> session >> >>>>>> closed for user root >> >>>>>> Mar 04 00:06:01 s3db13 sudo[1389287]: ceph : TTY=unknown ; >> PWD=/ ; >> >>>>>> USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json >> /dev/sde >> >>>>>> Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session): >> session >> >>>>>> opened for user root by (uid=0) >> >>>>>> Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session): >> session >> >>>>>> closed for user root >> >>>>>> Mar 05 00:00:10 s3db13 ceph-osd[5792]: 2022-03-05T00:00:10.213+0000 >> >>>>>> 7f1613080700 -1 received signal: Hangup from killall -q -1 >> ceph-mon >> >>>>>> ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror (PID: >> >>> 2406262) >> >>>>>> UID: 0 >> >>>>>> Mar 05 00:00:10 s3db13 ceph-osd[5792]: 2022-03-05T00:00:10.237+0000 >> >>>>>> 7f1613080700 -1 received signal: Hangup from (PID: 2406263) UID: >> 0 >> >>>>>> Mar 05 00:08:03 s3db13 sudo[2411721]: ceph : TTY=unknown ; >> PWD=/ ; >> >>>>>> USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde >> >>>>>> Mar 05 00:08:03 s3db13 sudo[2411721]: pam_unix(sudo:session): >> session >> >>>>>> opened for user root by (uid=0) >> >>>>>> Mar 05 00:08:04 s3db13 sudo[2411721]: pam_unix(sudo:session): >> session >> >>>>>> closed for user root >> >>>>>> Mar 05 00:08:04 s3db13 sudo[2411725]: ceph : TTY=unknown ; >> PWD=/ ; >> >>>>>> USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json >> /dev/sde >> >>>>>> Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session): >> session >> >>>>>> opened for user root by (uid=0) >> >>>>>> Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session): >> session >> >>>>>> closed for user root >> >>>>>> Mar 05 19:19:49 s3db13 ceph-osd[5792]: 2022-03-05T19:19:49.189+0000 >> >>>>>> 7f160fddd700 -1 osd.97 486852 set_numa_affinity unable to identify >> >>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> Mar 05 19:21:18 s3db13 ceph-osd[5792]: 2022-03-05T19:21:18.377+0000 >> >>>>>> 7f160fddd700 -1 osd.97 486858 set_numa_affinity unable to identify >> >>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> Mar 05 19:21:45 s3db13 ceph-osd[5792]: 2022-03-05T19:21:45.304+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from >> >>>>>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:21:21.762744+0000 >> >>> front >> >>>>>> 2022-03-05T19:21:21.762723+0000 (oldest deadline >> >>>>>> 2022-03-05T19:21:45.261347+0000) >> >>>>>> Mar 05 19:21:46 s3db13 ceph-osd[5792]: 2022-03-05T19:21:46.260+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from >> >>>>>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:21:21.762744+0000 >> >>> front >> >>>>>> 2022-03-05T19:21:21.762723+0000 (oldest deadline >> >>>>>> 2022-03-05T19:21:45.261347+0000) >> >>>>>> Mar 05 19:21:47 s3db13 ceph-osd[5792]: 2022-03-05T19:21:47.252+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from >> >>>>>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:21:21.762744+0000 >> >>> front >> >>>>>> 2022-03-05T19:21:21.762723+0000 (oldest deadline >> >>>>>> 2022-03-05T19:21:45.261347+0000) >> >>>>>> Mar 05 19:22:59 s3db13 ceph-osd[5792]: 2022-03-05T19:22:59.636+0000 >> >>>>>> 7f160fddd700 -1 osd.97 486869 set_numa_affinity unable to identify >> >>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> Mar 05 19:23:33 s3db13 ceph-osd[5792]: 2022-03-05T19:23:33.439+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+read+known_if_redirected e486872) >> >>>>>> Mar 05 19:23:34 s3db13 ceph-osd[5792]: 2022-03-05T19:23:34.458+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+read+known_if_redirected e486872) >> >>>>>> Mar 05 19:23:35 s3db13 ceph-osd[5792]: 2022-03-05T19:23:35.434+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from >> >>>>>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:23:09.928097+0000 >> >>> front >> >>>>>> 2022-03-05T19:23:09.928150+0000 (oldest deadline >> >>>>>> 2022-03-05T19:23:35.227545+0000) >> >>>>>> ... >> >>>>>> Mar 05 19:23:48 s3db13 ceph-osd[5792]: 2022-03-05T19:23:48.386+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+read+known_if_redirected e486872) >> >>>>>> Mar 05 19:23:49 s3db13 ceph-osd[5792]: 2022-03-05T19:23:49.362+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from >> >>>>>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:23:09.928097+0000 >> >>> front >> >>>>>> 2022-03-05T19:23:09.928150+0000 (oldest deadline >> >>>>>> 2022-03-05T19:23:35.227545+0000) >> >>>>>> Mar 05 19:23:49 s3db13 ceph-osd[5792]: 2022-03-05T19:23:49.362+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+read+known_if_redirected e486872) >> >>>>>> Mar 05 19:23:50 s3db13 ceph-osd[5792]: 2022-03-05T19:23:50.358+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486873 get_health_metrics reporting 2 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+read+known_if_redirected e486872) >> >>>>>> Mar 05 19:23:51 s3db13 ceph-osd[5792]: 2022-03-05T19:23:51.330+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d >> >>> 4:b0b12ee9:::gc.22:head >> >>>>>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9 >> >>>>>> ondisk+retry+read+known_if_redirected e486872) >> >>>>>> Mar 05 19:23:52 s3db13 ceph-osd[5792]: 2022-03-05T19:23:52.326+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d >> >>> 4:b0b12ee9:::gc.22:head >> >>>>>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9 >> >>>>>> ondisk+retry+read+known_if_redirected e486872) >> >>>>>> Mar 05 19:23:53 s3db13 ceph-osd[5792]: 2022-03-05T19:23:53.338+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d >> >>> 4:b0b12ee9:::gc.22:head >> >>>>>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9 >> >>>>>> ondisk+retry+read+known_if_redirected e486872) >> >>>>>> Mar 05 19:25:02 s3db13 ceph-osd[5792]: 2022-03-05T19:25:02.342+0000 >> >>>>>> 7f160fddd700 -1 osd.97 486878 set_numa_affinity unable to identify >> >>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> Mar 05 19:25:33 s3db13 ceph-osd[5792]: 2022-03-05T19:25:33.569+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 2 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486879) >> >>>>>> ... >> >>>>>> Mar 05 19:25:44 s3db13 ceph-osd[5792]: 2022-03-05T19:25:44.476+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486879) >> >>>>>> Mar 05 19:25:45 s3db13 ceph-osd[5792]: 2022-03-05T19:25:45.456+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from >> >>>>>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping >> sent >> >>>>>> 2022-03-05T19:25:25.281582+0000 (oldest deadline >> >>>>>> 2022-03-05T19:25:45.281582+0000) >> >>>>>> Mar 05 19:25:45 s3db13 ceph-osd[5792]: 2022-03-05T19:25:45.456+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486879) >> >>>>>> ... >> >>>>>> Mar 05 19:26:08 s3db13 ceph-osd[5792]: 2022-03-05T19:26:08.363+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486879) >> >>>>>> Mar 05 19:26:09 s3db13 ceph-osd[5792]: 2022-03-05T19:26:09.371+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from >> >>>>>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping >> sent >> >>>>>> 2022-03-05T19:25:25.281582+0000 (oldest deadline >> >>>>>> 2022-03-05T19:25:45.281582+0000) >> >>>>>> Mar 05 19:26:09 s3db13 ceph-osd[5792]: 2022-03-05T19:26:09.375+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486879) >> >>>>>> Mar 05 19:26:10 s3db13 ceph-osd[5792]: 2022-03-05T19:26:10.383+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486881 get_health_metrics reporting 3 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d >> >>> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486879) >> >>>>>> Mar 05 19:26:11 s3db13 ceph-osd[5792]: 2022-03-05T19:26:11.407+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d >> >>> 4:b0b12ee9:::gc.22:head >> >>>>>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11 >> >>>>>> ondisk+retry+read+known_if_redirected e486879) >> >>>>>> Mar 05 19:26:12 s3db13 ceph-osd[5792]: 2022-03-05T19:26:12.399+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1 slow >> ops, >> >>>>>> oldest is osd_op(client.2304224848.0:3139913 4.d >> >>> 4:b0b12ee9:::gc.22:head >> >>>>>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11 >> >>>>>> ondisk+retry+read+known_if_redirected e486879) >> >>>>>> Mar 05 19:27:24 s3db13 ceph-osd[5792]: 2022-03-05T19:27:24.975+0000 >> >>>>>> 7f160fddd700 -1 osd.97 486887 set_numa_affinity unable to identify >> >>> public >> >>>>>> interface '' numa node: (2) No such file or directory >> >>>>>> Mar 05 19:27:58 s3db13 ceph-osd[5792]: 2022-03-05T19:27:58.114+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow >> ops, >> >>>>>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d >> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486889) >> >>>>>> ... >> >>>>>> Mar 05 19:28:08 s3db13 ceph-osd[5792]: 2022-03-05T19:28:08.137+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow >> ops, >> >>>>>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d >> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486889) >> >>>>>> Mar 05 19:28:09 s3db13 ceph-osd[5792]: 2022-03-05T19:28:09.125+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from >> >>>>>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping >> sent >> >>>>>> 2022-03-05T19:27:48.548094+0000 (oldest deadline >> >>>>>> 2022-03-05T19:28:08.548094+0000) >> >>>>>> Mar 05 19:28:09 s3db13 ceph-osd[5792]: 2022-03-05T19:28:09.125+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow >> ops, >> >>>>>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d >> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486889) >> >>>>>> ... >> >>>>>> Mar 05 19:28:29 s3db13 ceph-osd[5792]: 2022-03-05T19:28:29.060+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow >> ops, >> >>>>>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d >> (undecoded) >> >>>>>> ondisk+retry+write+known_if_redirected e486889) >> >>>>>> Mar 05 19:28:30 s3db13 ceph-osd[5792]: 2022-03-05T19:28:30.040+0000 >> >>>>>> 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from >> >>>>>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping >> sent >> >>>>>> 2022-03-05T19:27:48.548094+0000 (oldest deadline >> >>>>>> 2022-03-05T19:28:08.548094+0000) >> >>>>>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.696+0000 >> >>>>>> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 > >> >>>>>> osd_max_markdown_count 5 in last 600.000000 seconds, shutting down >> >>>>>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000 >> >>>>>> 7f1613080700 -1 received signal: Interrupt from Kernel ( Could be >> >>>>>> generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 >> >>>>>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000 >> >>>>>> 7f1613080700 -1 osd.97 486896 *** Got signal Interrupt *** >> >>>>>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000 >> >>>>>> 7f1613080700 -1 osd.97 486896 *** Immediate shutdown >> >>>>>> (osd_fast_shutdown=true) *** >> >>>>>> Mar 05 19:29:44 s3db13 systemd[1]: ceph-osd@97.service: Succeeded. >> >>>>>> Mar 05 19:29:54 s3db13 systemd[1]: ceph-osd@97.service: Scheduled >> >>>>> restart >> >>>>>> job, restart counter is at 1. >> >>>>>> Mar 05 19:29:54 s3db13 systemd[1]: Stopped Ceph object storage >> daemon >> >>>>>> osd.97. >> >>>>>> Mar 05 19:29:54 s3db13 systemd[1]: Starting Ceph object storage >> daemon >> >>>>>> osd.97... >> >>>>>> Mar 05 19:29:54 s3db13 systemd[1]: Started Ceph object storage >> daemon >> >>>>>> osd.97. >> >>>>>> Mar 05 19:29:55 s3db13 ceph-osd[3236773]: >> 2022-03-05T19:29:55.116+0000 >> >>>>>> 7f5852f74d80 -1 Falling back to public interface >> >>>>>> Mar 05 19:30:34 s3db13 ceph-osd[3236773]: >> 2022-03-05T19:30:34.746+0000 >> >>>>>> 7f5852f74d80 -1 osd.97 486896 log_to_monitors {default=true} >> >>>>>> -- >> >>>>>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal >> abweichend >> >>> im >> >>>>>> groüen Saal. >> >>>>>> _______________________________________________ >> >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >> >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >>>> >> >>> >> >>> _______________________________________________ >> >>> ceph-users mailing list -- ceph-users@xxxxxxx >> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >>> >> >> >> >> >> >> -- >> >> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend >> im >> >> groüen Saal. >> >> >> >> >> >> >> > >> > -- >> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im >> > groüen Saal. >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > groüen Saal. > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx