Re: Another cluster completely hang

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Wed, 29 Jun 2016 12:07:05 +0200



hi,

ceph osd set noscrub
ceph osd set nodeep-scrub

ceph osd in <id>


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 29.06.2016 um 12:00 schrieb Mario Giammarco:
> Now the problem is that ceph has put out two disks because scrub  has
> failed (I think it is not a disk fault but due to mark-complete)
> How can I:
> - disable scrub
> - put in again the two disks
> 
> I will wait anyway the end of recovery to be sure it really works again
> 
> Il giorno mer 29 giu 2016 alle ore 11:16 Mario Giammarco
> <mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>> ha scritto:
> 
>     Infact I am worried because:
> 
>     1) ceph is under proxmox, and proxmox may decide to reboot a server
>     if it is not responding
>     2) probably a server was rebooted while ceph was reconstructing
>     3) even using max=3 do not help
> 
>     Anyway this is the "unofficial" procedure that I am using, much
>     simpler than blog post:
> 
>     1) find host where is pg
>     2) stop ceph in that host
>     3) ceph-objectstore-tool --pgid 1.98 --op mark-complete --data-path
>     /var/lib/ceph/osd/ceph-9 --journal-path
>     /var/lib/ceph/osd/ceph-9/journal 
>     4) start ceph
>     5) look finally it reconstructing
> 
>     Il giorno mer 29 giu 2016 alle ore 11:11 Oliver Dzombic
>     <info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>> ha scritto:
> 
>         Hi,
> 
>         removing ONE disk while your replication is 2, is no problem.
> 
>         You dont need to wait a single second to replace of remove it. Its
>         anyway not used and out/down. So from ceph's point of view its
>         not existent.
> 
>         ----------------
> 
>         But as christian told you already, what we see now fits to a
>         szenario
>         where you lost the osd and eighter you did something, or
>         something else
>         happens, but the data were not recovered again.
> 
>         Eighter because another OSD was broken, or because you did
>         something.
> 
>         Maybe, because of the "too many PGs per OSD (307 > max 300)"
>         ceph never
>         recovered.
> 
>         What i can see from http://pastebin.com/VZD7j2vN is that
> 
>         OSD 5,13,9,0,6,2,3 and maybe others, are the OSD's holding the
>         incomplete data.
> 
>         This are 7 OSD's from 10. So something happend to that OSD's or
>         the data
>         in them. And that had nothing to do with a single disk failing.
> 
>         Something else must have been happend.
> 
>         And as christian already wrote: you will have to go through your
>         logs
>         back until the point were things going down.
> 
>         Because a fail of a single OSD, no matter what your replication
>         size is,
>         can ( normally ) not harm the consistency of 7 other OSD's,
>         means 70% of
>         your total cluster.
> 
>         --
>         Mit freundlichen Gruessen / Best regards
> 
>         Oliver Dzombic
>         IP-Interactive
> 
>         mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
> 
>         Anschrift:
> 
>         IP Interactive UG ( haftungsbeschraenkt )
>         Zum Sonnenberg 1-3
>         63571 Gelnhausen
> 
>         HRB 93402 beim Amtsgericht Hanau
>         Geschäftsführung: Oliver Dzombic
> 
>         Steuer Nr.: 35 236 3622 1
>         UST ID: DE274086107
> 
> 
>         Am 29.06.2016 um 10:56 schrieb Mario Giammarco:
>         > Yes I have removed it from crush because it was broken. I have
>         waited 24
>         > hours to see if cephs would like to heals itself. Then I
>         removed the
>         > disk completely (it was broken...) and I waited 24 hours
>         again. Then I
>         > start getting worried.
>         > Are you saying to me that I should not remove a broken disk from
>         > cluster? 24 hours were not enough?
>         >
>         > Il giorno mer 29 giu 2016 alle ore 10:53 Zoltan Arnold Nagy
>         > <zoltan@xxxxxxxxxxxxxxxxxx <mailto:zoltan@xxxxxxxxxxxxxxxxxx>
>         <mailto:zoltan@xxxxxxxxxxxxxxxxxx
>         <mailto:zoltan@xxxxxxxxxxxxxxxxxx>>> ha scritto:
>         >
>         >     Just loosing one disk doesn’t automagically delete it from
>         CRUSH,
>         >     but in the output you had 10 disks listed, so there must be
>         >     something else going - did you delete the disk from the
>         crush map as
>         >     well?
>         >
>         >     Ceph waits by default 300 secs AFAIK to mark an OSD out
>         after it
>         >     will start to recover.
>         >
>         >
>         >>     On 29 Jun 2016, at 10:42, Mario Giammarco
>         <mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>
>         >>     <mailto:mgiammarco@xxxxxxxxx
>         <mailto:mgiammarco@xxxxxxxxx>>> wrote:
>         >>
>         >>     I thank you for your reply so I can add my experience:
>         >>
>         >>     1) the other time this thing happened to me I had a
>         cluster with
>         >>     min_size=2 and size=3 and the problem was the same. That
>         time I
>         >>     put min_size=1 to recover the pool but it did not help.
>         So I do
>         >>     not understand where is the advantage to put three copies
>         when
>         >>     ceph can decide to discard all three.
>         >>     2) I started with 11 hdds. The hard disk failed. Ceph waited
>         >>     forever for hard disk coming back. But hard disk is really
>         >>     completelly broken so I have followed the procedure to really
>         >>     delete from cluster. Anyway ceph did not recover.
>         >>     3) I have 307 pgs more than 300 but it is due to the fact
>         that I
>         >>     had 11 hdds now only 10. I will add more hdds after I
>         repair the pool
>         >>     4) I have reduced the monitors to 3
>         >>
>         >>
>         >>
>         >>     Il giorno mer 29 giu 2016 alle ore 10:25 Christian Balzer
>         >>     <chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>         <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>> ha scritto:
>         >>
>         >>
>         >>         Hello,
>         >>
>         >>         On Wed, 29 Jun 2016 06:02:59 +0000 Mario Giammarco wrote:
>         >>
>         >>         > pool 0 'rbd' replicated size 2 min_size 1
>         crush_ruleset 0
>         >>         object_hash
>         >>                                        ^
>         >>         And that's the root cause of all your woes.
>         >>         The default replication size is 3 for a reason and
>         while I do
>         >>         run pools
>         >>         with replication of 2 they are either HDD RAIDs or
>         extremely
>         >>         trustworthy
>         >>         and well monitored SSD.
>         >>
>         >>         That said, something more than a single HDD failure
>         must have
>         >>         happened
>         >>         here, you should check the logs and backtrace all the
>         step you
>         >>         did after
>         >>         that OSD failed.
>         >>
>         >>         You said there were 11 HDDs and your first ceph -s
>         output showed:
>         >>         ---
>         >>              osdmap e10182: 10 osds: 10 up, 10 in
>         >>         ----
>         >>         And your crush map states the same.
>         >>
>         >>         So how and WHEN did you remove that OSD?
>         >>         My suspicion would be it was removed before recovery
>         was complete.
>         >>
>         >>         Also, as I think was mentioned before, 7 mons are
>         overkill 3-5
>         >>         would be a
>         >>         saner number.
>         >>
>         >>         Christian
>         >>
>         >>         > rjenkins pg_num 512 pgp_num 512 last_change 9313 flags
>         >>         hashpspool
>         >>         > stripe_width 0
>         >>         >        removed_snaps [1~3]
>         >>         > pool 1 'rbd2' replicated size 2 min_size 1
>         crush_ruleset 0
>         >>         object_hash
>         >>         > rjenkins pg_num 512 pgp_num 512 last_change 9314 flags
>         >>         hashpspool
>         >>         > stripe_width 0
>         >>         >        removed_snaps [1~3]
>         >>         > pool 2 'rbd3' replicated size 2 min_size 1
>         crush_ruleset 0
>         >>         object_hash
>         >>         > rjenkins pg_num 512 pgp_num 512 last_change 10537 flags
>         >>         hashpspool
>         >>         > stripe_width 0
>         >>         >        removed_snaps [1~3]
>         >>         >
>         >>         >
>         >>         > ID WEIGHT  REWEIGHT SIZE   USE   AVAIL %USE  VAR
>         >>         > 5 1.81000  1.00000  1857G  984G  872G 53.00 0.86
>         >>         > 6 1.81000  1.00000  1857G 1202G  655G 64.73 1.05
>         >>         > 2 1.81000  1.00000  1857G 1158G  698G 62.38 1.01
>         >>         > 3 1.35999  1.00000  1391G  906G  485G 65.12 1.06
>         >>         > 4 0.89999  1.00000   926G  702G  223G 75.88 1.23
>         >>         > 7 1.81000  1.00000  1857G 1063G  793G 57.27 0.93
>         >>         > 8 1.81000  1.00000  1857G 1011G  846G 54.44 0.88
>         >>         > 9 0.89999  1.00000   926G  573G  352G 61.91 1.01
>         >>         > 0 1.81000  1.00000  1857G 1227G  629G 66.10 1.07
>         >>         > 13 0.45000  1.00000   460G  307G  153G 66.74 1.08
>         >>         >              TOTAL 14846G 9136G 5710G 61.54
>         >>         > MIN/MAX VAR: 0.86/1.23  STDDEV: 6.47
>         >>         >
>         >>         >
>         >>         >
>         >>         > ceph version 0.94.7
>         (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>         >>         >
>         >>         > http://pastebin.com/SvGfcSHb
>         >>         > http://pastebin.com/gYFatsNS
>         >>         > http://pastebin.com/VZD7j2vN
>         >>         >
>         >>         > I do not understand why I/O on ENTIRE cluster is
>         blocked
>         >>         when only few
>         >>         > pgs are incomplete.
>         >>         >
>         >>         > Many thanks,
>         >>         > Mario
>         >>         >
>         >>         >
>         >>         > Il giorno mar 28 giu 2016 alle ore 19:34 Stefan
>         Priebe -
>         >>         Profihost AG <
>         >>         > s.priebe@xxxxxxxxxxxx
>         <mailto:s.priebe@xxxxxxxxxxxx> <mailto:s.priebe@xxxxxxxxxxxx
>         <mailto:s.priebe@xxxxxxxxxxxx>>> ha
>         >>         scritto:
>         >>         >
>         >>         > > And ceph health detail
>         >>         > >
>         >>         > > Stefan
>         >>         > >
>         >>         > > Excuse my typo sent from my mobile phone.
>         >>         > >
>         >>         > > Am 28.06.2016 um 19:28 schrieb Oliver Dzombic
>         >>         <info@xxxxxxxxxxxxxxxxx
>         <mailto:info@xxxxxxxxxxxxxxxxx> <mailto:info@xxxxxxxxxxxxxxxxx
>         <mailto:info@xxxxxxxxxxxxxxxxx>>>:
>         >>         > >
>         >>         > > Hi Mario,
>         >>         > >
>         >>         > > please give some more details:
>         >>         > >
>         >>         > > Please the output of:
>         >>         > >
>         >>         > > ceph osd pool ls detail
>         >>         > > ceph osd df
>         >>         > > ceph --version
>         >>         > >
>         >>         > > ceph -w for 10 seconds ( use http://pastebin.com/
>         please )
>         >>         > >
>         >>         > > ceph osd crush dump ( also pastebin pls )
>         >>         > >
>         >>         > > --
>         >>         > > Mit freundlichen Gruessen / Best regards
>         >>         > >
>         >>         > > Oliver Dzombic
>         >>         > > IP-Interactive
>         >>         > >
>         >>         > > mailto:info@xxxxxxxxxxxxxxxxx
>         <mailto:info@xxxxxxxxxxxxxxxxx>
>         >>         <mailto:info@xxxxxxxxxxxxxxxxx
>         <mailto:info@xxxxxxxxxxxxxxxxx>> <info@xxxxxxxxxxxxxxxxx
>         <mailto:info@xxxxxxxxxxxxxxxxx>
>         >>         <mailto:info@xxxxxxxxxxxxxxxxx
>         <mailto:info@xxxxxxxxxxxxxxxxx>>>
>         >>         > >
>         >>         > > Anschrift:
>         >>         > >
>         >>         > > IP Interactive UG ( haftungsbeschraenkt )
>         >>         > > Zum Sonnenberg 1-3
>         >>         > > 63571 Gelnhausen
>         >>         > >
>         >>         > > HRB 93402 beim Amtsgericht Hanau
>         >>         > > Geschäftsführung: Oliver Dzombic
>         >>         > >
>         >>         > > Steuer Nr.: 35 236 3622 1
>         >>         > > UST ID: DE274086107
>         >>         > >
>         >>         > >
>         >>         > > Am 28.06.2016 um 18:59 schrieb Mario Giammarco:
>         >>         > >
>         >>         > > Hello,
>         >>         > >
>         >>         > > this is the second time that happens to me, I
>         hope that
>         >>         someone can
>         >>         > >
>         >>         > > explain what I can do.
>         >>         > >
>         >>         > > Proxmox ceph cluster with 8 servers, 11 hdd.
>         Min_size=1,
>         >>         size=2.
>         >>         > >
>         >>         > >
>         >>         > > One hdd goes down due to bad sectors.
>         >>         > >
>         >>         > > Ceph recovers but it ends with:
>         >>         > >
>         >>         > >
>         >>         > > cluster f2a8dd7d-949a-4a29-acab-11d4900249f4
>         >>         > >
>         >>         > >     health HEALTH_WARN
>         >>         > >
>         >>         > >            3 pgs down
>         >>         > >
>         >>         > >            19 pgs incomplete
>         >>         > >
>         >>         > >            19 pgs stuck inactive
>         >>         > >
>         >>         > >            19 pgs stuck unclean
>         >>         > >
>         >>         > >            7 requests are blocked > 32 sec
>         >>         > >
>         >>         > >     monmap e11: 7 mons at
>         >>         > >
>         >>         > > {0=192.168.0.204:6789/0,1=192.168.0.201:6789/0
>         <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>
>         >>         <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>,
>         >>         > >
>         >>         > >
>         >>       
>          2=192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202
>         <http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>
>         >>       
>          <http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>:
>         >>         > >
>         >>         > >
>         6789/0,5=192.168.0.206:6789/0,6=192.168.0.207:6789/0
>         <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>
>         >>         <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>}
>         >>         > >
>         >>         > >            election epoch 722, quorum
>         >>         > >
>         >>         > > 0,1,2,3,4,5,6 1,4,2,0,3,5,6
>         >>         > >
>         >>         > >     osdmap e10182: 10 osds: 10 up, 10 in
>         >>         > >
>         >>         > >      pgmap v3295880: 1024 pgs, 2 pools, 4563 GB
>         data, 1143
>         >>         kobjects
>         >>         > >
>         >>         > >            9136 GB used, 5710 GB / 14846 GB avail
>         >>         > >
>         >>         > >                1005 active+clean
>         >>         > >
>         >>         > >                  16 incomplete
>         >>         > >
>         >>         > >                   3 down+incomplete
>         >>         > >
>         >>         > >
>         >>         > > Unfortunately "7 requests blocked" means no virtual
>         >>         machine can boot
>         >>         > >
>         >>         > > because ceph has stopped i/o.
>         >>         > >
>         >>         > >
>         >>         > > I can accept to lose some data, but not ALL data!
>         >>         > >
>         >>         > > Can you help me please?
>         >>         > >
>         >>         > > Thanks,
>         >>         > >
>         >>         > > Mario
>         >>         > >
>         >>         > >
>         >>         > > _______________________________________________
>         >>         > >
>         >>         > > ceph-users mailing list
>         >>         > >
>         >>         > > ceph-users@xxxxxxxxxxxxxx
>         <mailto:ceph-users@xxxxxxxxxxxxxx>
>         <mailto:ceph-users@xxxxxxxxxxxxxx
>         <mailto:ceph-users@xxxxxxxxxxxxxx>>
>         >>         > >
>         >>         > >
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>         >>         > >
>         >>         > >
>         >>         > > _______________________________________________
>         >>         > > ceph-users mailing list
>         >>         > > ceph-users@xxxxxxxxxxxxxx
>         <mailto:ceph-users@xxxxxxxxxxxxxx>
>         <mailto:ceph-users@xxxxxxxxxxxxxx
>         <mailto:ceph-users@xxxxxxxxxxxxxx>>
>         >>         > >
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>         >>         > >
>         >>         > > _______________________________________________
>         >>         > > ceph-users mailing list
>         >>         > > ceph-users@xxxxxxxxxxxxxx
>         <mailto:ceph-users@xxxxxxxxxxxxxx>
>         <mailto:ceph-users@xxxxxxxxxxxxxx
>         <mailto:ceph-users@xxxxxxxxxxxxxx>>
>         >>         > >
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>         >>         > >
>         >>
>         >>
>         >>         --
>         >>         Christian Balzer        Network/Systems Engineer
>         >>         chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>         <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>           Global
>         OnLine
>         >>         Japan/Rakuten Communications
>         >>         http://www.gol.com/
>         >>
>         >>     _______________________________________________
>         >>     ceph-users mailing list
>         >>     ceph-users@xxxxxxxxxxxxxx
>         <mailto:ceph-users@xxxxxxxxxxxxxx>
>         <mailto:ceph-users@xxxxxxxxxxxxxx
>         <mailto:ceph-users@xxxxxxxxxxxxxx>>
>         >>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>         >
>         >
>         >
>         > _______________________________________________
>         > ceph-users mailing list
>         > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>         > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>         >
>         _______________________________________________
>         ceph-users mailing list
>         ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com