Re: Another cluster completely hang

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Wed, 29 Jun 2016 13:11:46 +0200

Hi,

it does not.

But in your case, you have 10 OSD, and 7 of them have incomplete PG's.

So since your proxmox vps's are not on single PG's but spread across
many PG's you have a good chance that at least some data of any vps is
on one of the defect PG's.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 29.06.2016 um 13:09 schrieb Mario Giammarco:
> Just one question: why when ceph has some incomplete pgs it refuses to
> do I/o on good pgs?
> 
> Il giorno mer 29 giu 2016 alle ore 12:55 Oliver Dzombic
> <info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>> ha scritto:
> 
>     Hi,
> 
>     again:
> 
>     You >must< check all your logs ( as fucky as it is for sure ).
> 
>     Means on the ceph nodes in /var/log/ceph/*
> 
>     And go back to the time where things went down the hill.
> 
>     There must be something else going on, beyond normal osd crash.
> 
>     And your manual pg repair/pg remove/pg set complete is, most probably,
>     just getting your situation worst.
> 
>     So really, if you want to have a chance to find out whats going on, you
>     must check all the logs. Especially the OSD logs, especially the OSD log
>     of the OSD you removed, and then the OSD logs of those pg, which are
>     incomplete/stuck/what_ever_not_good.
> 
>     --
>     Mit freundlichen Gruessen / Best regards
> 
>     Oliver Dzombic
>     IP-Interactive
> 
>     mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
> 
>     Anschrift:
> 
>     IP Interactive UG ( haftungsbeschraenkt )
>     Zum Sonnenberg 1-3
>     63571 Gelnhausen
> 
>     HRB 93402 beim Amtsgericht Hanau
>     Geschäftsführung: Oliver Dzombic
> 
>     Steuer Nr.: 35 236 3622 1
>     UST ID: DE274086107
> 
> 
>     Am 29.06.2016 um 12:33 schrieb Mario Giammarco:
>     > Thanks,
>     > I can put in osds but the do not stay in, and I am pretty sure
>     that are
>     > not broken.
>     >
>     > Il giorno mer 29 giu 2016 alle ore 12:07 Oliver Dzombic
>     > <info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>> ha
>     scritto:
>     >
>     >     hi,
>     >
>     >     ceph osd set noscrub
>     >     ceph osd set nodeep-scrub
>     >
>     >     ceph osd in <id>
>     >
>     >
>     >     --
>     >     Mit freundlichen Gruessen / Best regards
>     >
>     >     Oliver Dzombic
>     >     IP-Interactive
>     >
>     >     mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >
>     >     Anschrift:
>     >
>     >     IP Interactive UG ( haftungsbeschraenkt )
>     >     Zum Sonnenberg 1-3
>     >     63571 Gelnhausen
>     >
>     >     HRB 93402 beim Amtsgericht Hanau
>     >     Geschäftsführung: Oliver Dzombic
>     >
>     >     Steuer Nr.: 35 236 3622 1
>     >     UST ID: DE274086107
>     >
>     >
>     >     Am 29.06.2016 um 12:00 schrieb Mario Giammarco:
>     >     > Now the problem is that ceph has put out two disks because
>     scrub  has
>     >     > failed (I think it is not a disk fault but due to mark-complete)
>     >     > How can I:
>     >     > - disable scrub
>     >     > - put in again the two disks
>     >     >
>     >     > I will wait anyway the end of recovery to be sure it really
>     works
>     >     again
>     >     >
>     >     > Il giorno mer 29 giu 2016 alle ore 11:16 Mario Giammarco
>     >     > <mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>
>     <mailto:mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>>
>     >     <mailto:mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>
>     <mailto:mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>>>> ha
>     scritto:
>     >     >
>     >     >     Infact I am worried because:
>     >     >
>     >     >     1) ceph is under proxmox, and proxmox may decide to reboot a
>     >     server
>     >     >     if it is not responding
>     >     >     2) probably a server was rebooted while ceph was
>     reconstructing
>     >     >     3) even using max=3 do not help
>     >     >
>     >     >     Anyway this is the "unofficial" procedure that I am
>     using, much
>     >     >     simpler than blog post:
>     >     >
>     >     >     1) find host where is pg
>     >     >     2) stop ceph in that host
>     >     >     3) ceph-objectstore-tool --pgid 1.98 --op mark-complete
>     >     --data-path
>     >     >     /var/lib/ceph/osd/ceph-9 --journal-path
>     >     >     /var/lib/ceph/osd/ceph-9/journal
>     >     >     4) start ceph
>     >     >     5) look finally it reconstructing
>     >     >
>     >     >     Il giorno mer 29 giu 2016 alle ore 11:11 Oliver Dzombic
>     >     >     <info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>>> ha
>     >     scritto:
>     >     >
>     >     >         Hi,
>     >     >
>     >     >         removing ONE disk while your replication is 2, is no
>     problem.
>     >     >
>     >     >         You dont need to wait a single second to replace of
>     remove
>     >     it. Its
>     >     >         anyway not used and out/down. So from ceph's point
>     of view its
>     >     >         not existent.
>     >     >
>     >     >         ----------------
>     >     >
>     >     >         But as christian told you already, what we see now
>     fits to a
>     >     >         szenario
>     >     >         where you lost the osd and eighter you did something, or
>     >     >         something else
>     >     >         happens, but the data were not recovered again.
>     >     >
>     >     >         Eighter because another OSD was broken, or because
>     you did
>     >     >         something.
>     >     >
>     >     >         Maybe, because of the "too many PGs per OSD (307 >
>     max 300)"
>     >     >         ceph never
>     >     >         recovered.
>     >     >
>     >     >         What i can see from http://pastebin.com/VZD7j2vN is that
>     >     >
>     >     >         OSD 5,13,9,0,6,2,3 and maybe others, are the OSD's
>     holding the
>     >     >         incomplete data.
>     >     >
>     >     >         This are 7 OSD's from 10. So something happend to that
>     >     OSD's or
>     >     >         the data
>     >     >         in them. And that had nothing to do with a single disk
>     >     failing.
>     >     >
>     >     >         Something else must have been happend.
>     >     >
>     >     >         And as christian already wrote: you will have to go
>     >     through your
>     >     >         logs
>     >     >         back until the point were things going down.
>     >     >
>     >     >         Because a fail of a single OSD, no matter what your
>     >     replication
>     >     >         size is,
>     >     >         can ( normally ) not harm the consistency of 7 other
>     OSD's,
>     >     >         means 70% of
>     >     >         your total cluster.
>     >     >
>     >     >         --
>     >     >         Mit freundlichen Gruessen / Best regards
>     >     >
>     >     >         Oliver Dzombic
>     >     >         IP-Interactive
>     >     >
>     >     >         mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>> <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>>
>     >     >
>     >     >         Anschrift:
>     >     >
>     >     >         IP Interactive UG ( haftungsbeschraenkt )
>     >     >         Zum Sonnenberg 1-3
>     >     >         63571 Gelnhausen
>     >     >
>     >     >         HRB 93402 beim Amtsgericht Hanau
>     >     >         Geschäftsführung: Oliver Dzombic
>     >     >
>     >     >         Steuer Nr.: 35 236 3622 1
>     >     >         UST ID: DE274086107
>     >     >
>     >     >
>     >     >         Am 29.06.2016 um 10:56 schrieb Mario Giammarco:
>     >     >         > Yes I have removed it from crush because it was
>     broken.
>     >     I have
>     >     >         waited 24
>     >     >         > hours to see if cephs would like to heals itself.
>     Then I
>     >     >         removed the
>     >     >         > disk completely (it was broken...) and I waited 24
>     hours
>     >     >         again. Then I
>     >     >         > start getting worried.
>     >     >         > Are you saying to me that I should not remove a broken
>     >     disk from
>     >     >         > cluster? 24 hours were not enough?
>     >     >         >
>     >     >         > Il giorno mer 29 giu 2016 alle ore 10:53 Zoltan
>     Arnold Nagy
>     >     >         > <zoltan@xxxxxxxxxxxxxxxxxx
>     <mailto:zoltan@xxxxxxxxxxxxxxxxxx>
>     >     <mailto:zoltan@xxxxxxxxxxxxxxxxxx
>     <mailto:zoltan@xxxxxxxxxxxxxxxxxx>>
>     <mailto:zoltan@xxxxxxxxxxxxxxxxxx <mailto:zoltan@xxxxxxxxxxxxxxxxxx>
>     >     <mailto:zoltan@xxxxxxxxxxxxxxxxxx
>     <mailto:zoltan@xxxxxxxxxxxxxxxxxx>>>
>     >     >         <mailto:zoltan@xxxxxxxxxxxxxxxxxx
>     <mailto:zoltan@xxxxxxxxxxxxxxxxxx>
>     >     <mailto:zoltan@xxxxxxxxxxxxxxxxxx
>     <mailto:zoltan@xxxxxxxxxxxxxxxxxx>>
>     >     >         <mailto:zoltan@xxxxxxxxxxxxxxxxxx
>     <mailto:zoltan@xxxxxxxxxxxxxxxxxx>
>     >     <mailto:zoltan@xxxxxxxxxxxxxxxxxx
>     <mailto:zoltan@xxxxxxxxxxxxxxxxxx>>>>> ha scritto:
>     >     >         >
>     >     >         >     Just loosing one disk doesn’t automagically delete
>     >     it from
>     >     >         CRUSH,
>     >     >         >     but in the output you had 10 disks listed, so
>     there
>     >     must be
>     >     >         >     something else going - did you delete the disk
>     from the
>     >     >         crush map as
>     >     >         >     well?
>     >     >         >
>     >     >         >     Ceph waits by default 300 secs AFAIK to mark
>     an OSD out
>     >     >         after it
>     >     >         >     will start to recover.
>     >     >         >
>     >     >         >
>     >     >         >>     On 29 Jun 2016, at 10:42, Mario Giammarco
>     >     >         <mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>
>     <mailto:mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>>
>     >     <mailto:mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>
>     <mailto:mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>>>
>     >     >         >>     <mailto:mgiammarco@xxxxxxxxx
>     <mailto:mgiammarco@xxxxxxxxx>
>     >     <mailto:mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>>
>     >     >         <mailto:mgiammarco@xxxxxxxxx
>     <mailto:mgiammarco@xxxxxxxxx>
>     >     <mailto:mgiammarco@xxxxxxxxx <mailto:mgiammarco@xxxxxxxxx>>>>>
>     wrote:
>     >     >         >>
>     >     >         >>     I thank you for your reply so I can add my
>     experience:
>     >     >         >>
>     >     >         >>     1) the other time this thing happened to me I
>     had a
>     >     >         cluster with
>     >     >         >>     min_size=2 and size=3 and the problem was the
>     same.
>     >     That
>     >     >         time I
>     >     >         >>     put min_size=1 to recover the pool but it did
>     not help.
>     >     >         So I do
>     >     >         >>     not understand where is the advantage to put
>     three
>     >     copies
>     >     >         when
>     >     >         >>     ceph can decide to discard all three.
>     >     >         >>     2) I started with 11 hdds. The hard disk failed.
>     >     Ceph waited
>     >     >         >>     forever for hard disk coming back. But hard
>     disk is
>     >     really
>     >     >         >>     completelly broken so I have followed the
>     procedure
>     >     to really
>     >     >         >>     delete from cluster. Anyway ceph did not recover.
>     >     >         >>     3) I have 307 pgs more than 300 but it is due to
>     >     the fact
>     >     >         that I
>     >     >         >>     had 11 hdds now only 10. I will add more hdds
>     after I
>     >     >         repair the pool
>     >     >         >>     4) I have reduced the monitors to 3
>     >     >         >>
>     >     >         >>
>     >     >         >>
>     >     >         >>     Il giorno mer 29 giu 2016 alle ore 10:25
>     Christian
>     >     Balzer
>     >     >         >>     <chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>
>     >     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>>
>     >     >         <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>
>     >     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>>>> ha scritto:
>     >     >         >>
>     >     >         >>
>     >     >         >>         Hello,
>     >     >         >>
>     >     >         >>         On Wed, 29 Jun 2016 06:02:59 +0000 Mario
>     >     Giammarco wrote:
>     >     >         >>
>     >     >         >>         > pool 0 'rbd' replicated size 2 min_size 1
>     >     >         crush_ruleset 0
>     >     >         >>         object_hash
>     >     >         >>                                        ^
>     >     >         >>         And that's the root cause of all your woes.
>     >     >         >>         The default replication size is 3 for a
>     reason and
>     >     >         while I do
>     >     >         >>         run pools
>     >     >         >>         with replication of 2 they are either HDD
>     RAIDs or
>     >     >         extremely
>     >     >         >>         trustworthy
>     >     >         >>         and well monitored SSD.
>     >     >         >>
>     >     >         >>         That said, something more than a single
>     HDD failure
>     >     >         must have
>     >     >         >>         happened
>     >     >         >>         here, you should check the logs and backtrace
>     >     all the
>     >     >         step you
>     >     >         >>         did after
>     >     >         >>         that OSD failed.
>     >     >         >>
>     >     >         >>         You said there were 11 HDDs and your
>     first ceph -s
>     >     >         output showed:
>     >     >         >>         ---
>     >     >         >>              osdmap e10182: 10 osds: 10 up, 10 in
>     >     >         >>         ----
>     >     >         >>         And your crush map states the same.
>     >     >         >>
>     >     >         >>         So how and WHEN did you remove that OSD?
>     >     >         >>         My suspicion would be it was removed before
>     >     recovery
>     >     >         was complete.
>     >     >         >>
>     >     >         >>         Also, as I think was mentioned before, 7
>     mons are
>     >     >         overkill 3-5
>     >     >         >>         would be a
>     >     >         >>         saner number.
>     >     >         >>
>     >     >         >>         Christian
>     >     >         >>
>     >     >         >>         > rjenkins pg_num 512 pgp_num 512 last_change
>     >     9313 flags
>     >     >         >>         hashpspool
>     >     >         >>         > stripe_width 0
>     >     >         >>         >        removed_snaps [1~3]
>     >     >         >>         > pool 1 'rbd2' replicated size 2 min_size 1
>     >     >         crush_ruleset 0
>     >     >         >>         object_hash
>     >     >         >>         > rjenkins pg_num 512 pgp_num 512 last_change
>     >     9314 flags
>     >     >         >>         hashpspool
>     >     >         >>         > stripe_width 0
>     >     >         >>         >        removed_snaps [1~3]
>     >     >         >>         > pool 2 'rbd3' replicated size 2 min_size 1
>     >     >         crush_ruleset 0
>     >     >         >>         object_hash
>     >     >         >>         > rjenkins pg_num 512 pgp_num 512 last_change
>     >     10537 flags
>     >     >         >>         hashpspool
>     >     >         >>         > stripe_width 0
>     >     >         >>         >        removed_snaps [1~3]
>     >     >         >>         >
>     >     >         >>         >
>     >     >         >>         > ID WEIGHT  REWEIGHT SIZE   USE   AVAIL
>     %USE  VAR
>     >     >         >>         > 5 1.81000  1.00000  1857G  984G  872G
>     53.00 0.86
>     >     >         >>         > 6 1.81000  1.00000  1857G 1202G  655G
>     64.73 1.05
>     >     >         >>         > 2 1.81000  1.00000  1857G 1158G  698G
>     62.38 1.01
>     >     >         >>         > 3 1.35999  1.00000  1391G  906G  485G
>     65.12 1.06
>     >     >         >>         > 4 0.89999  1.00000   926G  702G  223G
>     75.88 1.23
>     >     >         >>         > 7 1.81000  1.00000  1857G 1063G  793G
>     57.27 0.93
>     >     >         >>         > 8 1.81000  1.00000  1857G 1011G  846G
>     54.44 0.88
>     >     >         >>         > 9 0.89999  1.00000   926G  573G  352G
>     61.91 1.01
>     >     >         >>         > 0 1.81000  1.00000  1857G 1227G  629G
>     66.10 1.07
>     >     >         >>         > 13 0.45000  1.00000   460G  307G  153G
>     66.74 1.08
>     >     >         >>         >              TOTAL 14846G 9136G 5710G 61.54
>     >     >         >>         > MIN/MAX VAR: 0.86/1.23  STDDEV: 6.47
>     >     >         >>         >
>     >     >         >>         >
>     >     >         >>         >
>     >     >         >>         > ceph version 0.94.7
>     >     >         (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>     >     >         >>         >
>     >     >         >>         > http://pastebin.com/SvGfcSHb
>     >     >         >>         > http://pastebin.com/gYFatsNS
>     >     >         >>         > http://pastebin.com/VZD7j2vN
>     >     >         >>         >
>     >     >         >>         > I do not understand why I/O on ENTIRE
>     cluster is
>     >     >         blocked
>     >     >         >>         when only few
>     >     >         >>         > pgs are incomplete.
>     >     >         >>         >
>     >     >         >>         > Many thanks,
>     >     >         >>         > Mario
>     >     >         >>         >
>     >     >         >>         >
>     >     >         >>         > Il giorno mar 28 giu 2016 alle ore
>     19:34 Stefan
>     >     >         Priebe -
>     >     >         >>         Profihost AG <
>     >     >         >>         > s.priebe@xxxxxxxxxxxx
>     <mailto:s.priebe@xxxxxxxxxxxx>
>     >     <mailto:s.priebe@xxxxxxxxxxxx <mailto:s.priebe@xxxxxxxxxxxx>>
>     >     >         <mailto:s.priebe@xxxxxxxxxxxx
>     <mailto:s.priebe@xxxxxxxxxxxx>
>     >     <mailto:s.priebe@xxxxxxxxxxxx <mailto:s.priebe@xxxxxxxxxxxx>>>
>     <mailto:s.priebe@xxxxxxxxxxxx <mailto:s.priebe@xxxxxxxxxxxx>
>     >     <mailto:s.priebe@xxxxxxxxxxxx <mailto:s.priebe@xxxxxxxxxxxx>>
>     >     >         <mailto:s.priebe@xxxxxxxxxxxx
>     <mailto:s.priebe@xxxxxxxxxxxx>
>     >     <mailto:s.priebe@xxxxxxxxxxxx
>     <mailto:s.priebe@xxxxxxxxxxxx>>>>> ha
>     >     >         >>         scritto:
>     >     >         >>         >
>     >     >         >>         > > And ceph health detail
>     >     >         >>         > >
>     >     >         >>         > > Stefan
>     >     >         >>         > >
>     >     >         >>         > > Excuse my typo sent from my mobile phone.
>     >     >         >>         > >
>     >     >         >>         > > Am 28.06.2016 um 19:28 schrieb Oliver
>     Dzombic
>     >     >         >>         <info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >     >         <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>>> <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >     >         <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>>>>>:
>     >     >         >>         > >
>     >     >         >>         > > Hi Mario,
>     >     >         >>         > >
>     >     >         >>         > > please give some more details:
>     >     >         >>         > >
>     >     >         >>         > > Please the output of:
>     >     >         >>         > >
>     >     >         >>         > > ceph osd pool ls detail
>     >     >         >>         > > ceph osd df
>     >     >         >>         > > ceph --version
>     >     >         >>         > >
>     >     >         >>         > > ceph -w for 10 seconds ( use
>     >     http://pastebin.com/
>     >     >         please )
>     >     >         >>         > >
>     >     >         >>         > > ceph osd crush dump ( also pastebin pls )
>     >     >         >>         > >
>     >     >         >>         > > --
>     >     >         >>         > > Mit freundlichen Gruessen / Best regards
>     >     >         >>         > >
>     >     >         >>         > > Oliver Dzombic
>     >     >         >>         > > IP-Interactive
>     >     >         >>         > >
>     >     >         >>         > > mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >     >         <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>>
>     >     >         >>         <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >     >         <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>>>> <info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >     >         <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>>
>     >     >         >>         <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>
>     >     >         <mailto:info@xxxxxxxxxxxxxxxxx
>     <mailto:info@xxxxxxxxxxxxxxxxx>
>     >     <mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>>>>>
>     >     >         >>         > >
>     >     >         >>         > > Anschrift:
>     >     >         >>         > >
>     >     >         >>         > > IP Interactive UG ( haftungsbeschraenkt )
>     >     >         >>         > > Zum Sonnenberg 1-3
>     >     >         >>         > > 63571 Gelnhausen
>     >     >         >>         > >
>     >     >         >>         > > HRB 93402 beim Amtsgericht Hanau
>     >     >         >>         > > Geschäftsführung: Oliver Dzombic
>     >     >         >>         > >
>     >     >         >>         > > Steuer Nr.: 35 236 3622 1
>     >     >         >>         > > UST ID: DE274086107
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         >>         > > Am 28.06.2016 um 18:59 schrieb Mario
>     Giammarco:
>     >     >         >>         > >
>     >     >         >>         > > Hello,
>     >     >         >>         > >
>     >     >         >>         > > this is the second time that happens
>     to me, I
>     >     >         hope that
>     >     >         >>         someone can
>     >     >         >>         > >
>     >     >         >>         > > explain what I can do.
>     >     >         >>         > >
>     >     >         >>         > > Proxmox ceph cluster with 8 servers,
>     11 hdd.
>     >     >         Min_size=1,
>     >     >         >>         size=2.
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         >>         > > One hdd goes down due to bad sectors.
>     >     >         >>         > >
>     >     >         >>         > > Ceph recovers but it ends with:
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         >>         > > cluster
>     f2a8dd7d-949a-4a29-acab-11d4900249f4
>     >     >         >>         > >
>     >     >         >>         > >     health HEALTH_WARN
>     >     >         >>         > >
>     >     >         >>         > >            3 pgs down
>     >     >         >>         > >
>     >     >         >>         > >            19 pgs incomplete
>     >     >         >>         > >
>     >     >         >>         > >            19 pgs stuck inactive
>     >     >         >>         > >
>     >     >         >>         > >            19 pgs stuck unclean
>     >     >         >>         > >
>     >     >         >>         > >            7 requests are blocked >
>     32 sec
>     >     >         >>         > >
>     >     >         >>         > >     monmap e11: 7 mons at
>     >     >         >>         > >
>     >     >         >>         > >
>     >     {0=192.168.0.204:6789/0,1=192.168.0.201:6789/0
>     <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>
>     >     <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>
>     >     >         <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>
>     >     >         >>
>     >      <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>,
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         >>
>     >     >
>     >     2=192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202
>     <http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>
>     >   
>      <http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>
>     >     >
>     >     
>     <http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>
>     >     >         >>
>     >     >
>     >   
>      <http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>:
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         6789/0,5=192.168.0.206:6789/0,6=192.168.0.207:6789/0
>     <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>
>     >     <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>
>     >     >         <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>
>     >     >         >>
>     >      <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>}
>     >     >         >>         > >
>     >     >         >>         > >            election epoch 722, quorum
>     >     >         >>         > >
>     >     >         >>         > > 0,1,2,3,4,5,6 1,4,2,0,3,5,6
>     >     >         >>         > >
>     >     >         >>         > >     osdmap e10182: 10 osds: 10 up, 10 in
>     >     >         >>         > >
>     >     >         >>         > >      pgmap v3295880: 1024 pgs, 2
>     pools, 4563 GB
>     >     >         data, 1143
>     >     >         >>         kobjects
>     >     >         >>         > >
>     >     >         >>         > >            9136 GB used, 5710 GB /
>     14846 GB
>     >     avail
>     >     >         >>         > >
>     >     >         >>         > >                1005 active+clean
>     >     >         >>         > >
>     >     >         >>         > >                  16 incomplete
>     >     >         >>         > >
>     >     >         >>         > >                   3 down+incomplete
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         >>         > > Unfortunately "7 requests blocked"
>     means no
>     >     virtual
>     >     >         >>         machine can boot
>     >     >         >>         > >
>     >     >         >>         > > because ceph has stopped i/o.
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         >>         > > I can accept to lose some data, but
>     not ALL
>     >     data!
>     >     >         >>         > >
>     >     >         >>         > > Can you help me please?
>     >     >         >>         > >
>     >     >         >>         > > Thanks,
>     >     >         >>         > >
>     >     >         >>         > > Mario
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         >>         > >
>     _______________________________________________
>     >     >         >>         > >
>     >     >         >>         > > ceph-users mailing list
>     >     >         >>         > >
>     >     >         >>         > > ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>>
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >         >>         > >
>     >     >         >>         > >
>     >     >         >>         > >
>     _______________________________________________
>     >     >         >>         > > ceph-users mailing list
>     >     >         >>         > > ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>>
>     >     >         >>         > >
>     >     >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >         >>         > >
>     >     >         >>         > >
>     _______________________________________________
>     >     >         >>         > > ceph-users mailing list
>     >     >         >>         > > ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>>
>     >     >         >>         > >
>     >     >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >         >>         > >
>     >     >         >>
>     >     >         >>
>     >     >         >>         --
>     >     >         >>         Christian Balzer        Network/Systems
>     Engineer
>     >     >         >>         chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>
>     >     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>>
>     >     >         <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>
>     >     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>
>     <mailto:chibi@xxxxxxx <mailto:chibi@xxxxxxx>>>>           Global
>     >     >         OnLine
>     >     >         >>         Japan/Rakuten Communications
>     >     >         >>         http://www.gol.com/
>     >     >         >>
>     >     >         >>     _______________________________________________
>     >     >         >>     ceph-users mailing list
>     >     >         >>     ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     >         <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>>
>     >     >         >>   
>      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >         >
>     >     >         >
>     >     >         >
>     >     >         > _______________________________________________
>     >     >         > ceph-users mailing list
>     >     >         > ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>
>     >     >         > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >         >
>     >     >         _______________________________________________
>     >     >         ceph-users mailing list
>     >     >         ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>>
>     >     >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com