Re: Another cluster completely hang

Mario Giammarco <mgiammarco@xxxxxxxxx> · Wed, 29 Jun 2016 10:33:57 +0000

Thanks,I can put in osds but the do not stay in, and I am pretty sure that are not broken.

Il giorno mer 29 giu 2016 alle ore 12:07 Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> ha scritto:
hi,

ceph osd set noscrub

ceph osd set nodeep-scrub

ceph osd in <id>

--

Mit freundlichen Gruessen / Best regards

Oliver Dzombic

IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )

Zum Sonnenberg 1-3

63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau

Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1

UST ID: DE274086107

Am 29.06.2016 um 12:00 schrieb Mario Giammarco:

> Now the problem is that ceph has put out two disks because scrub  has

> failed (I think it is not a disk fault but due to mark-complete)

> How can I:

> - disable scrub

> - put in again the two disks

>

> I will wait anyway the end of recovery to be sure it really works again

>

> Il giorno mer 29 giu 2016 alle ore 11:16 Mario Giammarco

> <mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>> ha scritto:

>

>     Infact I am worried because:

>

>     1) ceph is under proxmox, and proxmox may decide to reboot a server

>     if it is not responding

>     2) probably a server was rebooted while ceph was reconstructing

>     3) even using max=3 do not help

>

>     Anyway this is the "unofficial" procedure that I am using, much

>     simpler than blog post:

>

>     1) find host where is pg

>     2) stop ceph in that host

>     3) ceph-objectstore-tool --pgid 1.98 --op mark-complete --data-path

>     /var/lib/ceph/osd/ceph-9 --journal-path

>     /var/lib/ceph/osd/ceph-9/journal

>     4) start ceph

>     5) look finally it reconstructing

>

>     Il giorno mer 29 giu 2016 alle ore 11:11 Oliver Dzombic

>     <info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>> ha scritto:

>

>         Hi,

>

>         removing ONE disk while your replication is 2, is no problem.

>

>         You dont need to wait a single second to replace of remove it. Its

>         anyway not used and out/down. So from ceph's point of view its

>         not existent.

>

>         ----------------

>

>         But as christian told you already, what we see now fits to a

>         szenario

>         where you lost the osd and eighter you did something, or

>         something else

>         happens, but the data were not recovered again.

>

>         Eighter because another OSD was broken, or because you did

>         something.

>

>         Maybe, because of the "too many PGs per OSD (307 > max 300)"

>         ceph never

>         recovered.

>

>         What i can see from http://pastebin.com/VZD7j2vN is that

>

>         OSD 5,13,9,0,6,2,3 and maybe others, are the OSD's holding the

>         incomplete data.

>

>         This are 7 OSD's from 10. So something happend to that OSD's or

>         the data

>         in them. And that had nothing to do with a single disk failing.

>

>         Something else must have been happend.

>

>         And as christian already wrote: you will have to go through your

>         logs

>         back until the point were things going down.

>

>         Because a fail of a single OSD, no matter what your replication

>         size is,

>         can ( normally ) not harm the consistency of 7 other OSD's,

>         means 70% of

>         your total cluster.

>

>         --

>         Mit freundlichen Gruessen / Best regards

>

>         Oliver Dzombic

>         IP-Interactive

>

>         mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>

>

>         Anschrift:

>

>         IP Interactive UG ( haftungsbeschraenkt )

>         Zum Sonnenberg 1-3

>         63571 Gelnhausen

>

>         HRB 93402 beim Amtsgericht Hanau

>         Geschäftsführung: Oliver Dzombic

>

>         Steuer Nr.: 35 236 3622 1

>         UST ID: DE274086107

>

>

>         Am 29.06.2016 um 10:56 schrieb Mario Giammarco:

>         > Yes I have removed it from crush because it was broken. I have

>         waited 24

>         > hours to see if cephs would like to heals itself. Then I

>         removed the

>         > disk completely (it was broken...) and I waited 24 hours

>         again. Then I

>         > start getting worried.

>         > Are you saying to me that I should not remove a broken disk from

>         > cluster? 24 hours were not enough?

>         >

>         > Il giorno mer 29 giu 2016 alle ore 10:53 Zoltan Arnold Nagy

>         > <zoltan@xxxxxxxxxxxxxxxxxx <mailto:zoltan@xxxxxxxxxxxxxxxxxx>

>         <mailto:zoltan@xxxxxxxxxxxxxxxxxx

>         <mailto:zoltan@xxxxxxxxxxxxxxxxxx>>> ha scritto:

>         >

>         >     Just loosing one disk doesn’t automagically delete it from

>         CRUSH,

>         >     but in the output you had 10 disks listed, so there must be

>         >     something else going - did you delete the disk from the

>         crush map as

>         >     well?

>         >

>         >     Ceph waits by default 300 secs AFAIK to mark an OSD out

>         after it

>         >     will start to recover.

>         >

>         >

>         >>     On 29 Jun 2016, at 10:42, Mario Giammarco

>         <mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>

>         >>     <mailto:mgiammarco@xxxxxxxxx

>         <mailto:mgiammarco@xxxxxxxxx>>> wrote:

>         >>

>         >>     I thank you for your reply so I can add my experience:

>         >>

>         >>     1) the other time this thing happened to me I had a

>         cluster with

>         >>     min_size=2 and size=3 and the problem was the same. That

>         time I

>         >>     put min_size=1 to recover the pool but it did not help.

>         So I do

>         >>     not understand where is the advantage to put three copies

>         when

>         >>     ceph can decide to discard all three.

>         >>     2) I started with 11 hdds. The hard disk failed. Ceph waited

>         >>     forever for hard disk coming back. But hard disk is really

>         >>     completelly broken so I have followed the procedure to really

>         >>     delete from cluster. Anyway ceph did not recover.

>         >>     3) I have 307 pgs more than 300 but it is due to the fact

>         that I

>         >>     had 11 hdds now only 10. I will add more hdds after I

>         repair the pool

>         >>     4) I have reduced the monitors to 3

>         >>

>         >>

>         >>

>         >>     Il giorno mer 29 giu 2016 alle ore 10:25 Christian Balzer

>         >>     <chibi@xxxxxxx <mailto:chibi@xxxxxxx>

>         <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>> ha scritto:

>         >>

>         >>

>         >>         Hello,

>         >>

>         >>         On Wed, 29 Jun 2016 06:02:59 +0000 Mario Giammarco wrote:

>         >>

>         >>         > pool 0 'rbd' replicated size 2 min_size 1

>         crush_ruleset 0

>         >>         object_hash

>         >>                                        ^

>         >>         And that's the root cause of all your woes.

>         >>         The default replication size is 3 for a reason and

>         while I do

>         >>         run pools

>         >>         with replication of 2 they are either HDD RAIDs or

>         extremely

>         >>         trustworthy

>         >>         and well monitored SSD.

>         >>

>         >>         That said, something more than a single HDD failure

>         must have

>         >>         happened

>         >>         here, you should check the logs and backtrace all the

>         step you

>         >>         did after

>         >>         that OSD failed.

>         >>

>         >>         You said there were 11 HDDs and your first ceph -s

>         output showed:

>         >>         ---

>         >>              osdmap e10182: 10 osds: 10 up, 10 in

>         >>         ----

>         >>         And your crush map states the same.

>         >>

>         >>         So how and WHEN did you remove that OSD?

>         >>         My suspicion would be it was removed before recovery

>         was complete.

>         >>

>         >>         Also, as I think was mentioned before, 7 mons are

>         overkill 3-5

>         >>         would be a

>         >>         saner number.

>         >>

>         >>         Christian

>         >>

>         >>         > rjenkins pg_num 512 pgp_num 512 last_change 9313 flags

>         >>         hashpspool

>         >>         > stripe_width 0

>         >>         >        removed_snaps [1~3]

>         >>         > pool 1 'rbd2' replicated size 2 min_size 1

>         crush_ruleset 0

>         >>         object_hash

>         >>         > rjenkins pg_num 512 pgp_num 512 last_change 9314 flags

>         >>         hashpspool

>         >>         > stripe_width 0

>         >>         >        removed_snaps [1~3]

>         >>         > pool 2 'rbd3' replicated size 2 min_size 1

>         crush_ruleset 0

>         >>         object_hash

>         >>         > rjenkins pg_num 512 pgp_num 512 last_change 10537 flags

>         >>         hashpspool

>         >>         > stripe_width 0

>         >>         >        removed_snaps [1~3]

>         >>         >

>         >>         >

>         >>         > ID WEIGHT  REWEIGHT SIZE   USE   AVAIL %USE  VAR

>         >>         > 5 1.81000  1.00000  1857G  984G  872G 53.00 0.86

>         >>         > 6 1.81000  1.00000  1857G 1202G  655G 64.73 1.05

>         >>         > 2 1.81000  1.00000  1857G 1158G  698G 62.38 1.01

>         >>         > 3 1.35999  1.00000  1391G  906G  485G 65.12 1.06

>         >>         > 4 0.89999  1.00000   926G  702G  223G 75.88 1.23

>         >>         > 7 1.81000  1.00000  1857G 1063G  793G 57.27 0.93

>         >>         > 8 1.81000  1.00000  1857G 1011G  846G 54.44 0.88

>         >>         > 9 0.89999  1.00000   926G  573G  352G 61.91 1.01

>         >>         > 0 1.81000  1.00000  1857G 1227G  629G 66.10 1.07

>         >>         > 13 0.45000  1.00000   460G  307G  153G 66.74 1.08

>         >>         >              TOTAL 14846G 9136G 5710G 61.54

>         >>         > MIN/MAX VAR: 0.86/1.23  STDDEV: 6.47

>         >>         >

>         >>         >

>         >>         >

>         >>         > ceph version 0.94.7

>         (d56bdf93ced6b80b07397d57e3fa68fe68304432)

>         >>         >

>         >>         > http://pastebin.com/SvGfcSHb

>         >>         > http://pastebin.com/gYFatsNS

>         >>         > http://pastebin.com/VZD7j2vN

>         >>         >

>         >>         > I do not understand why I/O on ENTIRE cluster is

>         blocked

>         >>         when only few

>         >>         > pgs are incomplete.

>         >>         >

>         >>         > Many thanks,

>         >>         > Mario

>         >>         >

>         >>         >

>         >>         > Il giorno mar 28 giu 2016 alle ore 19:34 Stefan

>         Priebe -

>         >>         Profihost AG <

>         >>         > s.priebe@xxxxxxxxxxxx

>         <mailto:s.priebe@xxxxxxxxxxxx> <mailto:s.priebe@xxxxxxxxxxxx

>         <mailto:s.priebe@xxxxxxxxxxxx>>> ha

>         >>         scritto:

>         >>         >

>         >>         > > And ceph health detail

>         >>         > >

>         >>         > > Stefan

>         >>         > >

>         >>         > > Excuse my typo sent from my mobile phone.

>         >>         > >

>         >>         > > Am 28.06.2016 um 19:28 schrieb Oliver Dzombic

>         >>         <info@xxxxxxxxxxxxxxxxx

>         <mailto:info@xxxxxxxxxxxxxxxxx> <mailto:info@xxxxxxxxxxxxxxxxx

>         <mailto:info@xxxxxxxxxxxxxxxxx>>>:

>         >>         > >

>         >>         > > Hi Mario,

>         >>         > >

>         >>         > > please give some more details:

>         >>         > >

>         >>         > > Please the output of:

>         >>         > >

>         >>         > > ceph osd pool ls detail

>         >>         > > ceph osd df

>         >>         > > ceph --version

>         >>         > >

>         >>         > > ceph -w for 10 seconds ( use http://pastebin.com/

>         please )

>         >>         > >

>         >>         > > ceph osd crush dump ( also pastebin pls )

>         >>         > >

>         >>         > > --

>         >>         > > Mit freundlichen Gruessen / Best regards

>         >>         > >

>         >>         > > Oliver Dzombic

>         >>         > > IP-Interactive

>         >>         > >

>         >>         > > mailto:info@xxxxxxxxxxxxxxxxx

>         <mailto:info@xxxxxxxxxxxxxxxxx>

>         >>         <mailto:info@xxxxxxxxxxxxxxxxx

>         <mailto:info@xxxxxxxxxxxxxxxxx>> <info@xxxxxxxxxxxxxxxxx

>         <mailto:info@xxxxxxxxxxxxxxxxx>

>         >>         <mailto:info@xxxxxxxxxxxxxxxxx

>         <mailto:info@xxxxxxxxxxxxxxxxx>>>

>         >>         > >

>         >>         > > Anschrift:

>         >>         > >

>         >>         > > IP Interactive UG ( haftungsbeschraenkt )

>         >>         > > Zum Sonnenberg 1-3

>         >>         > > 63571 Gelnhausen

>         >>         > >

>         >>         > > HRB 93402 beim Amtsgericht Hanau

>         >>         > > Geschäftsführung: Oliver Dzombic

>         >>         > >

>         >>         > > Steuer Nr.: 35 236 3622 1

>         >>         > > UST ID: DE274086107

>         >>         > >

>         >>         > >

>         >>         > > Am 28.06.2016 um 18:59 schrieb Mario Giammarco:

>         >>         > >

>         >>         > > Hello,

>         >>         > >

>         >>         > > this is the second time that happens to me, I

>         hope that

>         >>         someone can

>         >>         > >

>         >>         > > explain what I can do.

>         >>         > >

>         >>         > > Proxmox ceph cluster with 8 servers, 11 hdd.

>         Min_size=1,

>         >>         size=2.

>         >>         > >

>         >>         > >

>         >>         > > One hdd goes down due to bad sectors.

>         >>         > >

>         >>         > > Ceph recovers but it ends with:

>         >>         > >

>         >>         > >

>         >>         > > cluster f2a8dd7d-949a-4a29-acab-11d4900249f4

>         >>         > >

>         >>         > >     health HEALTH_WARN

>         >>         > >

>         >>         > >            3 pgs down

>         >>         > >

>         >>         > >            19 pgs incomplete

>         >>         > >

>         >>         > >            19 pgs stuck inactive

>         >>         > >

>         >>         > >            19 pgs stuck unclean

>         >>         > >

>         >>         > >            7 requests are blocked > 32 sec

>         >>         > >

>         >>         > >     monmap e11: 7 mons at

>         >>         > >

>         >>         > > {0=192.168.0.204:6789/0,1=192.168.0.201:6789/0

>         <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>

>         >>         <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>,

>         >>         > >

>         >>         > >

>         >>

>          2=192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202

>         <http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>

>         >>

>          <http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>:

>         >>         > >

>         >>         > >

>         6789/0,5=192.168.0.206:6789/0,6=192.168.0.207:6789/0

>         <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>

>         >>         <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>}

>         >>         > >

>         >>         > >            election epoch 722, quorum

>         >>         > >

>         >>         > > 0,1,2,3,4,5,6 1,4,2,0,3,5,6

>         >>         > >

>         >>         > >     osdmap e10182: 10 osds: 10 up, 10 in

>         >>         > >

>         >>         > >      pgmap v3295880: 1024 pgs, 2 pools, 4563 GB

>         data, 1143

>         >>         kobjects

>         >>         > >

>         >>         > >            9136 GB used, 5710 GB / 14846 GB avail

>         >>         > >

>         >>         > >                1005 active+clean

>         >>         > >

>         >>         > >                  16 incomplete

>         >>         > >

>         >>         > >                   3 down+incomplete

>         >>         > >

>         >>         > >

>         >>         > > Unfortunately "7 requests blocked" means no virtual

>         >>         machine can boot

>         >>         > >

>         >>         > > because ceph has stopped i/o.

>         >>         > >

>         >>         > >

>         >>         > > I can accept to lose some data, but not ALL data!

>         >>         > >

>         >>         > > Can you help me please?

>         >>         > >

>         >>         > > Thanks,

>         >>         > >

>         >>         > > Mario

>         >>         > >

>         >>         > >

>         >>         > > _______________________________________________

>         >>         > >

>         >>         > > ceph-users mailing list

>         >>         > >

>         >>         > > ceph-users@xxxxxxxxxxxxxx

>         <mailto:ceph-users@xxxxxxxxxxxxxx>

>         <mailto:ceph-users@xxxxxxxxxxxxxx

>         <mailto:ceph-users@xxxxxxxxxxxxxx>>

>         >>         > >

>         >>         > >

>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>         >>         > >

>         >>         > >

>         >>         > > _______________________________________________

>         >>         > > ceph-users mailing list

>         >>         > > ceph-users@xxxxxxxxxxxxxx

>         <mailto:ceph-users@xxxxxxxxxxxxxx>

>         <mailto:ceph-users@xxxxxxxxxxxxxx

>         <mailto:ceph-users@xxxxxxxxxxxxxx>>

>         >>         > >

>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>         >>         > >

>         >>         > > _______________________________________________

>         >>         > > ceph-users mailing list

>         >>         > > ceph-users@xxxxxxxxxxxxxx

>         <mailto:ceph-users@xxxxxxxxxxxxxx>

>         <mailto:ceph-users@xxxxxxxxxxxxxx

>         <mailto:ceph-users@xxxxxxxxxxxxxx>>

>         >>         > >

>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>         >>         > >

>         >>

>         >>

>         >>         --

>         >>         Christian Balzer        Network/Systems Engineer

>         >>         chibi@xxxxxxx <mailto:chibi@xxxxxxx>

>         <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>           Global

>         OnLine

>         >>         Japan/Rakuten Communications

>         >>         http://www.gol.com/

>         >>

>         >>     _______________________________________________

>         >>     ceph-users mailing list

>         >>     ceph-users@xxxxxxxxxxxxxx

>         <mailto:ceph-users@xxxxxxxxxxxxxx>

>         <mailto:ceph-users@xxxxxxxxxxxxxx

>         <mailto:ceph-users@xxxxxxxxxxxxxx>>

>         >>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>         >

>         >

>         >

>         > _______________________________________________

>         > ceph-users mailing list

>         > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>         > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>         >

>         _______________________________________________

>         ceph-users mailing list

>         ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com