Post-mortem analisys?

Marco Gaiarin <gaio@xxxxxxxxx> · Mon, 13 May 2019 11:43:23 +0200

[I't is not really a 'mortem', but...]

Saturday afternoon, my 3-nodes proxmox ceph cluster have a big
'slowdown', that started at 12:35:24 with some OOM condition in one of
the 3 storage nodes, followed with also OOM on another node, at
12:43:31.

After that, all bad things happens: stuck requests, SCSI timeout on
VMs, MONs flip-flop on RBD clients.

I make a 'ceph -s' every hour, so at 14:17:01 i had at two nodes:

    cluster 8794c124-c2ec-4e81-8631-742992159bd6
     health HEALTH_WARN
            26 requests are blocked > 32 sec
     monmap e9: 5 mons at {2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0}
            election epoch 3930, quorum 0,1,2,3,4 blackpanther,capitanmarvel,4,2,3
     osdmap e15713: 12 osds: 12 up, 12 in
      pgmap v67358590: 768 pgs, 3 pools, 2222 GB data, 560 kobjects
            6639 GB used, 11050 GB / 17689 GB avail
                 768 active+clean
  client io 266 kB/s wr, 25 op/s

and on the third:
    cluster 8794c124-c2ec-4e81-8631-742992159bd6
     health HEALTH_WARN
            5 mons down, quorum
     monmap e9: 5 mons at {2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0}
            election epoch 3931, quorum
     osdmap e15713: 12 osds: 12 up, 12 in
      pgmap v67358598: 768 pgs, 3 pools, 2222 GB data, 560 kobjects
            6639 GB used, 11050 GB / 17689 GB avail
                 767 active+clean
                   1 active+clean+scrubbing
  client io 617 kB/s wr, 70 op/s

At that hour, the site served by the cluster was just closed (eg, no
users). The only task running, looking at logs, seems a backup
(bacula), but was just saving catalog, eg database workload on a
container, and ended at 14.27.

All that continue, more or less, till sunday morning, then all goes
back as normal.
Seems there was no hardware failures on nodes.

Backup tasks (all VM/LXC backups) on saturday night competed with no
errors.

Someone can provide some hint on how to 'correlate' various logs, and
so (try to) understand what happens?

Thanks.

-- 
dott. Marco Gaiarin				        GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''          http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

		Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
      http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
	(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com