Re: Post-mortem analisys?

Martin Verges <martin.verges@xxxxxxxx> · Mon, 13 May 2019 14:28:50 +0200

Hello Marco,
first of all, hyperconverged setups with public accessable VMs could be affected by DDoS attacks or other harmful issues that causes cascading errors in your infrastructure.

Are you sure your network worked correctly at the time?

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Am Mo., 13. Mai 2019 um 11:43 Uhr schrieb Marco Gaiarin <gaio@xxxxxxxxx>:

[I't is not really a 'mortem', but...]

Saturday afternoon, my 3-nodes proxmox ceph cluster have a big

'slowdown', that started at 12:35:24 with some OOM condition in one of

the 3 storage nodes, followed with also OOM on another node, at

12:43:31.

After that, all bad things happens: stuck requests, SCSI timeout on

VMs, MONs flip-flop on RBD clients.

I make a 'ceph -s' every hour, so at 14:17:01 i had at two nodes:

    cluster 8794c124-c2ec-4e81-8631-742992159bd6

     health HEALTH_WARN

            26 requests are blocked > 32 sec

     monmap e9: 5 mons at {2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0}

            election epoch 3930, quorum 0,1,2,3,4 blackpanther,capitanmarvel,4,2,3

     osdmap e15713: 12 osds: 12 up, 12 in

      pgmap v67358590: 768 pgs, 3 pools, 2222 GB data, 560 kobjects

            6639 GB used, 11050 GB / 17689 GB avail

                 768 active+clean

  client io 266 kB/s wr, 25 op/s

and on the third:

    cluster 8794c124-c2ec-4e81-8631-742992159bd6

     health HEALTH_WARN

            5 mons down, quorum

     monmap e9: 5 mons at {2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0}

            election epoch 3931, quorum

     osdmap e15713: 12 osds: 12 up, 12 in

      pgmap v67358598: 768 pgs, 3 pools, 2222 GB data, 560 kobjects

            6639 GB used, 11050 GB / 17689 GB avail

                 767 active+clean

                   1 active+clean+scrubbing

  client io 617 kB/s wr, 70 op/s

At that hour, the site served by the cluster was just closed (eg, no

users). The only task running, looking at logs, seems a backup

(bacula), but was just saving catalog, eg database workload on a

container, and ended at 14.27.

All that continue, more or less, till sunday morning, then all goes

back as normal.

Seems there was no hardware failures on nodes.

Backup tasks (all VM/LXC backups) on saturday night competed with no

errors.

Someone can provide some hint on how to 'correlate' various logs, and

so (try to) understand what happens?

Thanks.

-- 

dott. Marco Gaiarin                                     GNUPG Key ID: 240A3D66

  Associazione ``La Nostra Famiglia''          http://www.lanostrafamiglia.it/

  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)

  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

                Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!

      http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000

        (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com