After making that setting, the pg appeared to start peering but then it actually changed the primary OSD to osd.100 - then went incomplete again. Perhaps it did that because another OSD had more data? I presume i need to set that value on each osd where the pg hops to.
-Ben
On Tue, Mar 8, 2016 at 10:39 AM, David Zafman <dzafman@xxxxxxxxxx> wrote:
Ben,
I haven't look at everything in your message, but pg 12.7a1 has lost data because of writes that went only to osd.73. The way to recover this is to force recovery to ignore this fact and go with whatever data you have on the remaining OSDs.
I assume that having min_size 1, having multiple nodes failing and clients continuing to write then permanently losing osd.73 caused this.
You should TEMPORARILY set osd_find_best_info_ignore_history_les config variable to 1 on osd.36 and then mark it down (ceph osd down), so it will rejoin, re-peer and mark the pg active+clean. Don't forget to set osd_find_best_info_ignore_history_les
back to 0.
Later you should fix your crush map. See http://docs.ceph.com/docs/master/rados/operations/crush-map/
The wrong placements makes you vulnerable to a single host failure taking out multiple copies of an object.
David
On 3/7/16 9:41 PM, Ben Hines wrote:
Howdy, I was hoping someone could help me recover a couple pgs which are causing problems in my cluster. If we aren't able to resolve this soon, we may have to just destroy them and lose some data. Recovery has so far been unsuccessful. Data loss would probably cause some here to reconsider Ceph as something we'll stick with long term, so i'd love to recover it. Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering after a disk failure pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a pg 10.4f query: https://gist.github.com/benh57/44bdd2a19ea667d920ab ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7 - The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when it went down, the pg was 'down + peering'. It was marked lost. - After marking 73 lost, the new primary still wants to peer and flips between peering and incomplete. - Noticed '73' still shows in the pg query output for the bad pgs. (maybe i need to bring back an osd with the same name?) - Noticed that the new primary got set to an osd (osd-77) which was on the same node as (osd-76) which had all the data. Figuring 77 couldn't peer with 36 because it was on the same node, i set 77 out, 36 became primary and 76 became one of the replicas. No change. startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20, debug filestore = 30, debug ms = 1' (large files) osd 36 (12.7a1) startup log: https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log osd 6 (10.4f) startup log: https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log Some other Notes: - Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has 12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a copy of the data. Even after running a pg repair does not pick up the data from 76, remains stuck peering - One of the pgs was part of a pool which was no longer needed. (the unused radosgw .rgw.control pool, with one 0kb object in it) Per previous steps discussed here for a similar failure, i attempted these recovery steps on it, to see if they would work for the others: -- The failed osd disk only mounts 'read only' which causes ceph-objectstore-tool to fail to export, so i exported it from a seemingly good copy on another osd. -- stopped all osds -- exported the pg with objectstore-tool from an apparently good OSD -- removed the pg from all osds which had it using objectstore-tool -- imported the pg into an out osd, osd-100 Importing pgid 4.95 Write 4/88aa5c95/notify.2/head Import successful -- Force recreated the pg on the cluster: ceph pg force_create_pg 4.95 -- brought up all osds -- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects However, the object doesn't sync to the pg from osd-100, and instead 64 tells to to remove itself from osd-100: 2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch 0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2 2016-03-05 15:44:22.858174 7fc004168700 7 osd.100 68025 handle_pg_remove from osd.64 on 1 pgs 2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025 require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660 2016-03-05 15:44:22.858188 7fc004168700 5 osd.100 68025 queue_pg_for_deletion: 4.95 2016-03-05 15:44:22.858228 7fc004168700 15 osd.100 68025 project_pg_history 4.95 from 68025 to 68025, start ec=76 les/c/f 62655/62611/0 66982/67983/66982 Not wanting this to happen to my needed data from the other PGs, i didn't try this procedure with those PGs. After this procedure osd-100 does get listed in 'pg query' as 'might_have_unfound', but ceph apparently decides not to use it and the active osd sends a remove. output of 'ceph pg 4.95 query' after these recovery steps: https://gist.github.com/benh57/fc9a847cd83f4d5e4dcf Quite Possibly Related: I am occasionally noticing some incorrectness in 'ceph osd tree'. It seems my crush map thinks some osds are on the wrong hosts. I wonder if this is why peering is failing? (example) -5 9.04999 host cld-mtl-006 12 1.81000 osd.12 up 1.00000 1.00000 13 1.81000 osd.13 up 1.00000 1.00000 14 1.81000 osd.14 up 1.00000 1.00000 94 1.81000 osd.94 up 1.00000 1.00000 26 1.81000 osd.26 up 0.86775 1.00000 ^^ this host only has 4 osds on it! osd.26 is actually running over on cld-mtl-004 ! Restarting 26 fixed the map. osd.42 (out) was also in the wrong place in 'osd tree'. tree syas it's on cld-mtl-013, it's actually on cld-mtl-024. - fixing these issues caused a large re-balance, so 'ceph health detail' is a bit dirty right now, but you can see the stuck pgs: ceph health detail: - I wonder if these incorrect crushmaps caused ceph to put some data on the wrong osds, resulting in a peering failure later when the map repaired itself? - How does ceph determine what node an OSD is on? That process may be periodically failing due to some issue. (dns?) - Perhaps if i enable 'allow peer to same host' setting, the cluster could repair? Then i could turn it off again. Any assistance is appreciated! -Ben
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com