Ceph Recovery Assistance, pgs stuck peering

Ben Hines <bhines@xxxxxxxxx> · Mon, 7 Mar 2016 21:41:02 -0800

Howdy,
I was hoping someone could help me recover a couple pgs which are causing problems in my cluster. If we aren't able to resolve this soon, we may have to just destroy them and lose some data. Recovery has so far been unsuccessful. Data loss would probably cause some here to reconsider Ceph as something we'll stick with long term, so i'd love to recover it. 

Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering after a disk failure
pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d
pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a
pg 10.4f query:  https://gist.github.com/benh57/44bdd2a19ea667d920ab
ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7

- The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when it went down, the pg was 'down + peering'. It was marked lost.
- After marking 73 lost, the new primary still wants to peer and flips between peering and incomplete. 
- Noticed '73' still shows in the pg query output for the bad pgs. (maybe i need to bring back an osd with the same name?)- Noticed that the new primary got set to an osd (osd-77) which was on the same node as (osd-76) which had all the data.  Figuring 77 couldn't peer with 36 because it was on the same node, i set 77 out, 36 became primary and 76 became one of the replicas. No change.

startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20, debug filestore = 30, debug ms =
1'  (large files)

osd 36 (12.7a1) startup log:  https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log
osd 6 (10.4f) startup log: https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log

Some other Notes: 

- Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has 12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a copy of the data.  Even after running a pg repair does not pick up the data from 76, remains stuck peering

- One of the pgs was part of a pool which was no longer needed. (the unused radosgw .rgw.control pool, with one 0kb object in it) Per previous steps discussed here for a similar failure, i attempted these recovery steps on it, to see if they would work for the others:

-- The failed osd disk only mounts 'read only' which causes ceph-objectstore-tool to fail to export, so i exported it from a seemingly good copy on another osd.
-- stopped all osds
-- exported the pg with objectstore-tool from an apparently good OSD
-- removed the pg from all osds which had it using objectstore-tool
-- imported the pg into an out osd, osd-100
  Importing pgid 4.95
Write 4/88aa5c95/notify.2/head
Import successful
-- Force recreated the pg on the cluster:
           ceph pg force_create_pg 4.95
-- brought up all osds
-- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects

However, the object doesn't sync to the pg from osd-100, and instead 64 tells to to remove itself from osd-100:
2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch 0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2
2016-03-05 15:44:22.858174 7fc004168700  7 osd.100 68025 handle_pg_remove from osd.64 on 1 pgs
2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025 require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660
2016-03-05 15:44:22.858188 7fc004168700  5 osd.100 68025 queue_pg_for_deletion: 4.95
2016-03-05 15:44:22.858228 7fc004168700 15 osd.100 68025 project_pg_history 4.95 from 68025 to 68025, start ec=76 les/c/f 62655/62611/0 66982/67983/66982

Not wanting this to happen to my needed data from the other PGs, i didn't try this procedure with those PGs. After this procedure  osd-100 does get listed in 'pg query' as 'might_have_unfound', but ceph apparently decides not to use it and the active osd sends a remove.

output of 'ceph pg 4.95 query' after these recovery steps: https://gist.github.com/benh57/fc9a847cd83f4d5e4dcf

Quite Possibly Related:

I am occasionally noticing some incorrectness in 'ceph osd tree'. It seems my crush map thinks some osds are on the wrong hosts. I wonder if this is why peering is failing?
(example)
 -5   9.04999     host cld-mtl-006
 12   1.81000         osd.12               up  1.00000          1.00000
 13   1.81000         osd.13               up  1.00000          1.00000
 14   1.81000         osd.14               up  1.00000          1.00000
 94   1.81000         osd.94               up  1.00000          1.00000
 26   1.81000         osd.26               up  0.86775          1.00000

^^ this host only has 4 osds on it! osd.26 is actually running over on cld-mtl-004 !    Restarting 26 fixed the map. 
osd.42 (out) was also in the wrong place in 'osd tree'. tree syas it's on cld-mtl-013, it's actually on cld-mtl-024. 
- fixing these issues caused a large re-balance, so 'ceph health detail' is a bit dirty right now, but you can see the stuck pgs:
ceph health detail: 

-  I wonder if these incorrect crushmaps caused ceph to put some data on the wrong osds, resulting in a peering failure later when the map repaired itself?
-  How does ceph determine what node an OSD is on? That process may be periodically failing due to some issue. (dns?)
-  Perhaps if i enable 'allow peer to same host' setting, the cluster could repair? Then i could turn it off again.

Any assistance is appreciated!

-Ben

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com