Re: Another cluster completely hang

Mario Giammarco <mgiammarco@xxxxxxxxx> · Wed, 29 Jun 2016 10:00:51 +0000

Now the problem is that ceph has put out two disks because scrub  has failed (I think it is not a disk fault but due to mark-complete)How can I:
- disable scrub
- put in again the two disks

I will wait anyway the end of recovery to be sure it really works again

Il giorno mer 29 giu 2016 alle ore 11:16 Mario Giammarco <mgiammarco@xxxxxxxxx> ha scritto:
Infact I am worried because:
1) ceph is under proxmox, and proxmox may decide to reboot a server if it is not responding
2) probably a server was rebooted while ceph was reconstructing
3) even using max=3 do not help

Anyway this is the "unofficial" procedure that I am using, much simpler than blog post:

1) find host where is pg
2) stop ceph in that host
3) ceph-objectstore-tool --pgid 1.98 --op mark-complete --data-path /var/lib/ceph/osd/ceph-9 --journal-path /var/lib/ceph/osd/ceph-9/journal 
4) start ceph
5) look finally it reconstructing

Il giorno mer 29 giu 2016 alle ore 11:11 Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> ha scritto:
Hi,

removing ONE disk while your replication is 2, is no problem.

You dont need to wait a single second to replace of remove it. Its

anyway not used and out/down. So from ceph's point of view its not existent.

----------------

But as christian told you already, what we see now fits to a szenario

where you lost the osd and eighter you did something, or something else

happens, but the data were not recovered again.

Eighter because another OSD was broken, or because you did something.

Maybe, because of the "too many PGs per OSD (307 > max 300)" ceph never

recovered.

What i can see from http://pastebin.com/VZD7j2vN is that

OSD 5,13,9,0,6,2,3 and maybe others, are the OSD's holding the

incomplete data.

This are 7 OSD's from 10. So something happend to that OSD's or the data

in them. And that had nothing to do with a single disk failing.

Something else must have been happend.

And as christian already wrote: you will have to go through your logs

back until the point were things going down.

Because a fail of a single OSD, no matter what your replication size is,

can ( normally ) not harm the consistency of 7 other OSD's, means 70% of

your total cluster.

--

Mit freundlichen Gruessen / Best regards

Oliver Dzombic

IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )

Zum Sonnenberg 1-3

63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau

Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1

UST ID: DE274086107

Am 29.06.2016 um 10:56 schrieb Mario Giammarco:

> Yes I have removed it from crush because it was broken. I have waited 24

> hours to see if cephs would like to heals itself. Then I removed the

> disk completely (it was broken...) and I waited 24 hours again. Then I

> start getting worried.

> Are you saying to me that I should not remove a broken disk from

> cluster? 24 hours were not enough?

>

> Il giorno mer 29 giu 2016 alle ore 10:53 Zoltan Arnold Nagy

> <zoltan@xxxxxxxxxxxxxxxxxx <mailto:zoltan@xxxxxxxxxxxxxxxxxx>> ha scritto:

>

>     Just loosing one disk doesn’t automagically delete it from CRUSH,

>     but in the output you had 10 disks listed, so there must be

>     something else going - did you delete the disk from the crush map as

>     well?

>

>     Ceph waits by default 300 secs AFAIK to mark an OSD out after it

>     will start to recover.

>

>

>>     On 29 Jun 2016, at 10:42, Mario Giammarco <mgiammarco@xxxxxxxxx

>>     <mailto:mgiammarco@xxxxxxxxx>> wrote:

>>

>>     I thank you for your reply so I can add my experience:

>>

>>     1) the other time this thing happened to me I had a cluster with

>>     min_size=2 and size=3 and the problem was the same. That time I

>>     put min_size=1 to recover the pool but it did not help. So I do

>>     not understand where is the advantage to put three copies when

>>     ceph can decide to discard all three.

>>     2) I started with 11 hdds. The hard disk failed. Ceph waited

>>     forever for hard disk coming back. But hard disk is really

>>     completelly broken so I have followed the procedure to really

>>     delete from cluster. Anyway ceph did not recover.

>>     3) I have 307 pgs more than 300 but it is due to the fact that I

>>     had 11 hdds now only 10. I will add more hdds after I repair the pool

>>     4) I have reduced the monitors to 3

>>

>>

>>

>>     Il giorno mer 29 giu 2016 alle ore 10:25 Christian Balzer

>>     <chibi@xxxxxxx <mailto:chibi@xxxxxxx>> ha scritto:

>>

>>

>>         Hello,

>>

>>         On Wed, 29 Jun 2016 06:02:59 +0000 Mario Giammarco wrote:

>>

>>         > pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0

>>         object_hash

>>                                        ^

>>         And that's the root cause of all your woes.

>>         The default replication size is 3 for a reason and while I do

>>         run pools

>>         with replication of 2 they are either HDD RAIDs or extremely

>>         trustworthy

>>         and well monitored SSD.

>>

>>         That said, something more than a single HDD failure must have

>>         happened

>>         here, you should check the logs and backtrace all the step you

>>         did after

>>         that OSD failed.

>>

>>         You said there were 11 HDDs and your first ceph -s output showed:

>>         ---

>>              osdmap e10182: 10 osds: 10 up, 10 in

>>         ----

>>         And your crush map states the same.

>>

>>         So how and WHEN did you remove that OSD?

>>         My suspicion would be it was removed before recovery was complete.

>>

>>         Also, as I think was mentioned before, 7 mons are overkill 3-5

>>         would be a

>>         saner number.

>>

>>         Christian

>>

>>         > rjenkins pg_num 512 pgp_num 512 last_change 9313 flags

>>         hashpspool

>>         > stripe_width 0

>>         >        removed_snaps [1~3]

>>         > pool 1 'rbd2' replicated size 2 min_size 1 crush_ruleset 0

>>         object_hash

>>         > rjenkins pg_num 512 pgp_num 512 last_change 9314 flags

>>         hashpspool

>>         > stripe_width 0

>>         >        removed_snaps [1~3]

>>         > pool 2 'rbd3' replicated size 2 min_size 1 crush_ruleset 0

>>         object_hash

>>         > rjenkins pg_num 512 pgp_num 512 last_change 10537 flags

>>         hashpspool

>>         > stripe_width 0

>>         >        removed_snaps [1~3]

>>         >

>>         >

>>         > ID WEIGHT  REWEIGHT SIZE   USE   AVAIL %USE  VAR

>>         > 5 1.81000  1.00000  1857G  984G  872G 53.00 0.86

>>         > 6 1.81000  1.00000  1857G 1202G  655G 64.73 1.05

>>         > 2 1.81000  1.00000  1857G 1158G  698G 62.38 1.01

>>         > 3 1.35999  1.00000  1391G  906G  485G 65.12 1.06

>>         > 4 0.89999  1.00000   926G  702G  223G 75.88 1.23

>>         > 7 1.81000  1.00000  1857G 1063G  793G 57.27 0.93

>>         > 8 1.81000  1.00000  1857G 1011G  846G 54.44 0.88

>>         > 9 0.89999  1.00000   926G  573G  352G 61.91 1.01

>>         > 0 1.81000  1.00000  1857G 1227G  629G 66.10 1.07

>>         > 13 0.45000  1.00000   460G  307G  153G 66.74 1.08

>>         >              TOTAL 14846G 9136G 5710G 61.54

>>         > MIN/MAX VAR: 0.86/1.23  STDDEV: 6.47

>>         >

>>         >

>>         >

>>         > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)

>>         >

>>         > http://pastebin.com/SvGfcSHb

>>         > http://pastebin.com/gYFatsNS

>>         > http://pastebin.com/VZD7j2vN

>>         >

>>         > I do not understand why I/O on ENTIRE cluster is blocked

>>         when only few

>>         > pgs are incomplete.

>>         >

>>         > Many thanks,

>>         > Mario

>>         >

>>         >

>>         > Il giorno mar 28 giu 2016 alle ore 19:34 Stefan Priebe -

>>         Profihost AG <

>>         > s.priebe@xxxxxxxxxxxx <mailto:s.priebe@xxxxxxxxxxxx>> ha

>>         scritto:

>>         >

>>         > > And ceph health detail

>>         > >

>>         > > Stefan

>>         > >

>>         > > Excuse my typo sent from my mobile phone.

>>         > >

>>         > > Am 28.06.2016 um 19:28 schrieb Oliver Dzombic

>>         <info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>:

>>         > >

>>         > > Hi Mario,

>>         > >

>>         > > please give some more details:

>>         > >

>>         > > Please the output of:

>>         > >

>>         > > ceph osd pool ls detail

>>         > > ceph osd df

>>         > > ceph --version

>>         > >

>>         > > ceph -w for 10 seconds ( use http://pastebin.com/ please )

>>         > >

>>         > > ceph osd crush dump ( also pastebin pls )

>>         > >

>>         > > --

>>         > > Mit freundlichen Gruessen / Best regards

>>         > >

>>         > > Oliver Dzombic

>>         > > IP-Interactive

>>         > >

>>         > > mailto:info@xxxxxxxxxxxxxxxxx

>>         <mailto:info@xxxxxxxxxxxxxxxxx> <info@xxxxxxxxxxxxxxxxx

>>         <mailto:info@xxxxxxxxxxxxxxxxx>>

>>         > >

>>         > > Anschrift:

>>         > >

>>         > > IP Interactive UG ( haftungsbeschraenkt )

>>         > > Zum Sonnenberg 1-3

>>         > > 63571 Gelnhausen

>>         > >

>>         > > HRB 93402 beim Amtsgericht Hanau

>>         > > Geschäftsführung: Oliver Dzombic

>>         > >

>>         > > Steuer Nr.: 35 236 3622 1

>>         > > UST ID: DE274086107

>>         > >

>>         > >

>>         > > Am 28.06.2016 um 18:59 schrieb Mario Giammarco:

>>         > >

>>         > > Hello,

>>         > >

>>         > > this is the second time that happens to me, I hope that

>>         someone can

>>         > >

>>         > > explain what I can do.

>>         > >

>>         > > Proxmox ceph cluster with 8 servers, 11 hdd. Min_size=1,

>>         size=2.

>>         > >

>>         > >

>>         > > One hdd goes down due to bad sectors.

>>         > >

>>         > > Ceph recovers but it ends with:

>>         > >

>>         > >

>>         > > cluster f2a8dd7d-949a-4a29-acab-11d4900249f4

>>         > >

>>         > >     health HEALTH_WARN

>>         > >

>>         > >            3 pgs down

>>         > >

>>         > >            19 pgs incomplete

>>         > >

>>         > >            19 pgs stuck inactive

>>         > >

>>         > >            19 pgs stuck unclean

>>         > >

>>         > >            7 requests are blocked > 32 sec

>>         > >

>>         > >     monmap e11: 7 mons at

>>         > >

>>         > > {0=192.168.0.204:6789/0,1=192.168.0.201:6789/0

>>         <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>,

>>         > >

>>         > >

>>         2=192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202

>>         <http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>:

>>         > >

>>         > > 6789/0,5=192.168.0.206:6789/0,6=192.168.0.207:6789/0

>>         <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>}

>>         > >

>>         > >            election epoch 722, quorum

>>         > >

>>         > > 0,1,2,3,4,5,6 1,4,2,0,3,5,6

>>         > >

>>         > >     osdmap e10182: 10 osds: 10 up, 10 in

>>         > >

>>         > >      pgmap v3295880: 1024 pgs, 2 pools, 4563 GB data, 1143

>>         kobjects

>>         > >

>>         > >            9136 GB used, 5710 GB / 14846 GB avail

>>         > >

>>         > >                1005 active+clean

>>         > >

>>         > >                  16 incomplete

>>         > >

>>         > >                   3 down+incomplete

>>         > >

>>         > >

>>         > > Unfortunately "7 requests blocked" means no virtual

>>         machine can boot

>>         > >

>>         > > because ceph has stopped i/o.

>>         > >

>>         > >

>>         > > I can accept to lose some data, but not ALL data!

>>         > >

>>         > > Can you help me please?

>>         > >

>>         > > Thanks,

>>         > >

>>         > > Mario

>>         > >

>>         > >

>>         > > _______________________________________________

>>         > >

>>         > > ceph-users mailing list

>>         > >

>>         > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>>         > >

>>         > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>         > >

>>         > >

>>         > > _______________________________________________

>>         > > ceph-users mailing list

>>         > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>>         > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>         > >

>>         > > _______________________________________________

>>         > > ceph-users mailing list

>>         > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>>         > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>         > >

>>

>>

>>         --

>>         Christian Balzer        Network/Systems Engineer

>>         chibi@xxxxxxx <mailto:chibi@xxxxxxx>           Global OnLine

>>         Japan/Rakuten Communications

>>         http://www.gol.com/

>>

>>     _______________________________________________

>>     ceph-users mailing list

>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com