Hello Marco,
first of all, hyperconverged setups with public accessable VMs could be affected by DDoS attacks or other harmful issues that causes cascading errors in your infrastructure.
Are you sure your network worked correctly at the time?
--
Martin Verges
Managing director
Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Martin Verges
Managing director
Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Am Mo., 13. Mai 2019 um 11:43 Uhr schrieb Marco Gaiarin <gaio@xxxxxxxxx>:
[I't is not really a 'mortem', but...]
Saturday afternoon, my 3-nodes proxmox ceph cluster have a big
'slowdown', that started at 12:35:24 with some OOM condition in one of
the 3 storage nodes, followed with also OOM on another node, at
12:43:31.
After that, all bad things happens: stuck requests, SCSI timeout on
VMs, MONs flip-flop on RBD clients.
I make a 'ceph -s' every hour, so at 14:17:01 i had at two nodes:
cluster 8794c124-c2ec-4e81-8631-742992159bd6
health HEALTH_WARN
26 requests are blocked > 32 sec
monmap e9: 5 mons at {2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0}
election epoch 3930, quorum 0,1,2,3,4 blackpanther,capitanmarvel,4,2,3
osdmap e15713: 12 osds: 12 up, 12 in
pgmap v67358590: 768 pgs, 3 pools, 2222 GB data, 560 kobjects
6639 GB used, 11050 GB / 17689 GB avail
768 active+clean
client io 266 kB/s wr, 25 op/s
and on the third:
cluster 8794c124-c2ec-4e81-8631-742992159bd6
health HEALTH_WARN
5 mons down, quorum
monmap e9: 5 mons at {2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0}
election epoch 3931, quorum
osdmap e15713: 12 osds: 12 up, 12 in
pgmap v67358598: 768 pgs, 3 pools, 2222 GB data, 560 kobjects
6639 GB used, 11050 GB / 17689 GB avail
767 active+clean
1 active+clean+scrubbing
client io 617 kB/s wr, 70 op/s
At that hour, the site served by the cluster was just closed (eg, no
users). The only task running, looking at logs, seems a backup
(bacula), but was just saving catalog, eg database workload on a
container, and ended at 14.27.
All that continue, more or less, till sunday morning, then all goes
back as normal.
Seems there was no hardware failures on nodes.
Backup tasks (all VM/LXC backups) on saturday night competed with no
errors.
Someone can provide some hint on how to 'correlate' various logs, and
so (try to) understand what happens?
Thanks.
--
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/
Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN)
marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797
Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com