[I't is not really a 'mortem', but...] Saturday afternoon, my 3-nodes proxmox ceph cluster have a big 'slowdown', that started at 12:35:24 with some OOM condition in one of the 3 storage nodes, followed with also OOM on another node, at 12:43:31. After that, all bad things happens: stuck requests, SCSI timeout on VMs, MONs flip-flop on RBD clients. I make a 'ceph -s' every hour, so at 14:17:01 i had at two nodes: cluster 8794c124-c2ec-4e81-8631-742992159bd6 health HEALTH_WARN 26 requests are blocked > 32 sec monmap e9: 5 mons at {2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0} election epoch 3930, quorum 0,1,2,3,4 blackpanther,capitanmarvel,4,2,3 osdmap e15713: 12 osds: 12 up, 12 in pgmap v67358590: 768 pgs, 3 pools, 2222 GB data, 560 kobjects 6639 GB used, 11050 GB / 17689 GB avail 768 active+clean client io 266 kB/s wr, 25 op/s and on the third: cluster 8794c124-c2ec-4e81-8631-742992159bd6 health HEALTH_WARN 5 mons down, quorum monmap e9: 5 mons at {2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0} election epoch 3931, quorum osdmap e15713: 12 osds: 12 up, 12 in pgmap v67358598: 768 pgs, 3 pools, 2222 GB data, 560 kobjects 6639 GB used, 11050 GB / 17689 GB avail 767 active+clean 1 active+clean+scrubbing client io 617 kB/s wr, 70 op/s At that hour, the site served by the cluster was just closed (eg, no users). The only task running, looking at logs, seems a backup (bacula), but was just saving catalog, eg database workload on a container, and ended at 14.27. All that continue, more or less, till sunday morning, then all goes back as normal. Seems there was no hardware failures on nodes. Backup tasks (all VM/LXC backups) on saturday night competed with no errors. Someone can provide some hint on how to 'correlate' various logs, and so (try to) understand what happens? Thanks. -- dott. Marco Gaiarin GNUPG Key ID: 240A3D66 Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/ Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN) marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797 Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA! http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000 (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA) _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com