Decide which copy you want to keep and export that with ceph-objectstore-tool Delete all copies on all OSDs with ceph-objectstore-tool (not by deleting the directory on the disk). Use force_create_pg to recreate the pg empty. Use ceph-objectstore-tool to do a rados import on the exported pg copy. On Wed, Mar 15, 2017 at 12:00 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote: > Hello, > > I have tried to recover the pg using the following steps: > Preparation: > 1. set noout > 2. stop osd.2 > 3. use ceph-objectstore-tool to export from osd2 > 4. start osd.2 > 5. repeat step 2-4 on osd 35,28, 63 (I've done these hoping to be able to > use one of those exports to recover the PG) > > > First attempt: > > 1. stop osd.2 > 2. remove the 3.367_head directory > 3. start osd.2 > Here I was hoping that the cluster will recover the pg from the 2 other > identical osds. It did NOT. So I have tried the following commands on the > PG: > ceph pg repair > ceph pg scrub > ceph pg deep-scrub > ceph pg force_create_pg > nothing changed. My PG was still incomplete. So I tried to remove all the > OSDs that were referenced in the pg query: > > > 1. stop osd.2 > 2. delete the 3.367_head directory > 3. start osd2 > 4 repeat steps 6-8 for all the OSDs that were listed in the pg query > 5. did an import from one of the exports. -> I was able again to query the > pg (that was impossible when all the 3.367_head dirs were deleted) and the > stats were saying that the number of objects is 6 the size is 21M (all > correct values according to the files I was able to see before starting the > procedure) But the PG is still incomplete. > > What else can I try? > > Thank you, > Laszlo > > > > > > On 12.03.2017 13:06, Brad Hubbard wrote: >> >> On Sun, Mar 12, 2017 at 7:51 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> >> wrote: >>> >>> Hello, >>> >>> I have already done the export with ceph_objectstore_tool. I just have to >>> decide which OSDs to keep. >>> Can you tell me why the directory structure in the OSDs is different for >>> the >>> same PG when checking on different OSDs? >>> For instance, in OSD 2 and 63 there are NO subdirectories in the >>> 3.367__head, while OSD 28, 35 contains >>> ./DIR_7/DIR_6/DIR_B/ >>> ./DIR_7/DIR_6/DIR_3/ >>> >>> When are these subdirectories created? >>> >>> The files are identical on all the OSDs, only the way how these are >>> stored >>> is different. It would be enough if you could point me to some >>> documentation >>> that explain these, I'll read it. So far, searching for the architecture >>> of >>> an OSD, I could not find the gory details about these directories. >> >> >> https://github.com/ceph/ceph/blob/master/src/os/filestore/HashIndex.h >> >>> >>> Kind regards, >>> Laszlo >>> >>> >>> On 12.03.2017 02:12, Brad Hubbard wrote: >>>> >>>> >>>> On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> >>>> wrote: >>>>> >>>>> >>>>> Hello, >>>>> >>>>> Thank you for your answer. >>>>> >>>>> indeed the min_size is 1: >>>>> >>>>> # ceph osd pool get volumes size >>>>> size: 3 >>>>> # ceph osd pool get volumes min_size >>>>> min_size: 1 >>>>> # >>>>> I'm gonna try to find the mentioned discussions on the mailing lists, >>>>> and >>>>> read them. If you have a link at hand, that would be nice if you would >>>>> send >>>>> it to me. >>>> >>>> >>>> >>>> This thread is one example, there are lots more. >>>> >>>> >>>> >>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html >>>> >>>>> >>>>> In the attached file you can see the contents of the directory >>>>> containing >>>>> PG >>>>> data on the different OSDs (all that have appeared in the pg query). >>>>> According to the md5sums the files are identical. What bothers me is >>>>> the >>>>> directory structure (you can see the ls -R in each dir that contains >>>>> files). >>>> >>>> >>>> >>>> So I mixed up 63 and 68, my list should have read 2, 28, 35 and 63 >>>> since 68 is listed as empty in the pg query. >>>> >>>>> >>>>> Where can I read about how/why those DIR# subdirectories have appeared? >>>>> >>>>> Given that the files themselves are identical on the "current" OSDs >>>>> belonging to the PG, and as the osd.63 (currently not belonging to the >>>>> PG) >>>>> has the same files, is it safe to stop the OSD.2, remove the 3.367_head >>>>> dir, >>>>> and then restart the OSD? (all these with the noout flag set of course) >>>> >>>> >>>> >>>> *You* need to decide which is the "good" copy and then follow the >>>> instructions in the links I provided to try and recover the pg. Back >>>> those known copies on 2, 28, 35 and 63 up with the >>>> ceph_objectstore_tool before proceeding. They may well be identical >>>> but the peering process still needs to "see" the relevant logs and >>>> currently something is stopping it doing so. >>>> >>>>> >>>>> Kind regards, >>>>> Laszlo >>>>> >>>>> >>>>> On 11.03.2017 00:32, Brad Hubbard wrote: >>>>>> >>>>>> >>>>>> >>>>>> So this is why it happened I guess. >>>>>> >>>>>> pool 3 'volumes' replicated size 3 min_size 1 >>>>>> >>>>>> min_size = 1 is a recipe for disasters like this and there are plenty >>>>>> of ML threads about not setting it below 2. >>>>>> >>>>>> The past intervals in the pg query show several intervals where a >>>>>> single OSD may have gone rw. >>>>>> >>>>>> How important is this data? >>>>>> >>>>>> I would suggest checking which of these OSDs actually have the data >>>>>> for this pg. From the pg query it looks like 2, 35 and 68 and possibly >>>>>> 28 since it's the primary. Check all OSDs in the pg query output. I >>>>>> would then back up all copies and work out which copy, if any, you >>>>>> want to keep and then attempt something like the following. >>>>>> >>>>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg17820.html >>>>>> >>>>>> If you want to abandon the pg see >>>>>> >>>>>> >>>>>> >>>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html >>>>>> for a possible solution. >>>>>> >>>>>> http://ceph.com/community/incomplete-pgs-oh-my/ may also give some >>>>>> ideas. >>>>>> >>>>>> >>>>>> On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai >>>>>> <laszlo@xxxxxxxxxxxxxxxx> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> The OSDs are all there. >>>>>>> >>>>>>> $ sudo ceph osd stat >>>>>>> osdmap e60609: 72 osds: 72 up, 72 in >>>>>>> >>>>>>> an I have attached the result of ceph osd tree, and ceph osd dump >>>>>>> commands. >>>>>>> I got some extra info about the network problem. A faulty network >>>>>>> device >>>>>>> has >>>>>>> flooded the network eating up all the bandwidth so the OSDs were not >>>>>>> able >>>>>>> to >>>>>>> properly communicate with each other. This has lasted for almost 1 >>>>>>> day. >>>>>>> >>>>>>> Thank you, >>>>>>> Laszlo >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10.03.2017 12:19, Brad Hubbard wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> To me it looks like someone may have done an "rm" on these OSDs but >>>>>>>> not removed them from the crushmap. This does not happen >>>>>>>> automatically. >>>>>>>> >>>>>>>> Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If >>>>>>>> so, >>>>>>>> paste the output. >>>>>>>> >>>>>>>> Without knowing what exactly happened here it may be difficult to >>>>>>>> work >>>>>>>> out how to proceed. >>>>>>>> >>>>>>>> In order to go clean the primary needs to communicate with multiple >>>>>>>> OSDs, some of which are marked DNE and seem to be uncontactable. >>>>>>>> >>>>>>>> This seems to be more than a network issue (unless the outage is >>>>>>>> still >>>>>>>> happening). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai >>>>>>>> <laszlo@xxxxxxxxxxxxxxxx> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I was informed that due to a networking issue the ceph cluster >>>>>>>>> network >>>>>>>>> was >>>>>>>>> affected. There was a huge packet loss, and network interfaces were >>>>>>>>> flipping. That's all I got. >>>>>>>>> This outage has lasted a longer period of time. So I assume that >>>>>>>>> some >>>>>>>>> OSD >>>>>>>>> may have been considered dead and the data from them has been moved >>>>>>>>> away >>>>>>>>> to >>>>>>>>> other PGs (this is what ceph is supposed to do if I'm correct). >>>>>>>>> Probably >>>>>>>>> that was the point when the listed PGs have appeared into the >>>>>>>>> picture. >>>>>>>>> From the query we can see this for one of those OSDs: >>>>>>>>> { >>>>>>>>> "peer": "14", >>>>>>>>> "pgid": "3.367", >>>>>>>>> "last_update": "0'0", >>>>>>>>> "last_complete": "0'0", >>>>>>>>> "log_tail": "0'0", >>>>>>>>> "last_user_version": 0, >>>>>>>>> "last_backfill": "MAX", >>>>>>>>> "purged_snaps": "[]", >>>>>>>>> "history": { >>>>>>>>> "epoch_created": 4, >>>>>>>>> "last_epoch_started": 54899, >>>>>>>>> "last_epoch_clean": 55143, >>>>>>>>> "last_epoch_split": 0, >>>>>>>>> "same_up_since": 60603, >>>>>>>>> "same_interval_since": 60603, >>>>>>>>> "same_primary_since": 60593, >>>>>>>>> "last_scrub": "2852'33528", >>>>>>>>> "last_scrub_stamp": "2017-02-26 02:36:55.210150", >>>>>>>>> "last_deep_scrub": "2852'16480", >>>>>>>>> "last_deep_scrub_stamp": "2017-02-21 >>>>>>>>> 00:14:08.866448", >>>>>>>>> "last_clean_scrub_stamp": "2017-02-26 >>>>>>>>> 02:36:55.210150" >>>>>>>>> }, >>>>>>>>> "stats": { >>>>>>>>> "version": "0'0", >>>>>>>>> "reported_seq": "14", >>>>>>>>> "reported_epoch": "59779", >>>>>>>>> "state": "down+peering", >>>>>>>>> "last_fresh": "2017-02-27 16:30:16.230519", >>>>>>>>> "last_change": "2017-02-27 16:30:15.267995", >>>>>>>>> "last_active": "0.000000", >>>>>>>>> "last_peered": "0.000000", >>>>>>>>> "last_clean": "0.000000", >>>>>>>>> "last_became_active": "0.000000", >>>>>>>>> "last_became_peered": "0.000000", >>>>>>>>> "last_unstale": "2017-02-27 16:30:16.230519", >>>>>>>>> "last_undegraded": "2017-02-27 16:30:16.230519", >>>>>>>>> "last_fullsized": "2017-02-27 16:30:16.230519", >>>>>>>>> "mapping_epoch": 60601, >>>>>>>>> "log_start": "0'0", >>>>>>>>> "ondisk_log_start": "0'0", >>>>>>>>> "created": 4, >>>>>>>>> "last_epoch_clean": 55143, >>>>>>>>> "parent": "0.0", >>>>>>>>> "parent_split_bits": 0, >>>>>>>>> "last_scrub": "2852'33528", >>>>>>>>> "last_scrub_stamp": "2017-02-26 02:36:55.210150", >>>>>>>>> "last_deep_scrub": "2852'16480", >>>>>>>>> "last_deep_scrub_stamp": "2017-02-21 >>>>>>>>> 00:14:08.866448", >>>>>>>>> "last_clean_scrub_stamp": "2017-02-26 >>>>>>>>> 02:36:55.210150", >>>>>>>>> "log_size": 0, >>>>>>>>> "ondisk_log_size": 0, >>>>>>>>> "stats_invalid": "0", >>>>>>>>> "stat_sum": { >>>>>>>>> "num_bytes": 0, >>>>>>>>> "num_objects": 0, >>>>>>>>> "num_object_clones": 0, >>>>>>>>> "num_object_copies": 0, >>>>>>>>> "num_objects_missing_on_primary": 0, >>>>>>>>> "num_objects_degraded": 0, >>>>>>>>> "num_objects_misplaced": 0, >>>>>>>>> "num_objects_unfound": 0, >>>>>>>>> "num_objects_dirty": 0, >>>>>>>>> "num_whiteouts": 0, >>>>>>>>> "num_read": 0, >>>>>>>>> "num_read_kb": 0, >>>>>>>>> "num_write": 0, >>>>>>>>> "num_write_kb": 0, >>>>>>>>> "num_scrub_errors": 0, >>>>>>>>> "num_shallow_scrub_errors": 0, >>>>>>>>> "num_deep_scrub_errors": 0, >>>>>>>>> "num_objects_recovered": 0, >>>>>>>>> "num_bytes_recovered": 0, >>>>>>>>> "num_keys_recovered": 0, >>>>>>>>> "num_objects_omap": 0, >>>>>>>>> "num_objects_hit_set_archive": 0, >>>>>>>>> "num_bytes_hit_set_archive": 0 >>>>>>>>> }, >>>>>>>>> "up": [ >>>>>>>>> 28, >>>>>>>>> 35, >>>>>>>>> 2 >>>>>>>>> ], >>>>>>>>> "acting": [ >>>>>>>>> 28, >>>>>>>>> 35, >>>>>>>>> 2 >>>>>>>>> ], >>>>>>>>> "blocked_by": [], >>>>>>>>> "up_primary": 28, >>>>>>>>> "acting_primary": 28 >>>>>>>>> }, >>>>>>>>> "empty": 1, >>>>>>>>> "dne": 0, >>>>>>>>> "incomplete": 0, >>>>>>>>> "last_epoch_started": 0, >>>>>>>>> "hit_set_history": { >>>>>>>>> "current_last_update": "0'0", >>>>>>>>> "current_last_stamp": "0.000000", >>>>>>>>> "current_info": { >>>>>>>>> "begin": "0.000000", >>>>>>>>> "end": "0.000000", >>>>>>>>> "version": "0'0", >>>>>>>>> "using_gmt": "1" >>>>>>>>> }, >>>>>>>>> "history": [] >>>>>>>>> } >>>>>>>>> }, >>>>>>>>> >>>>>>>>> Where can I read more about the meaning of each parameter, some of >>>>>>>>> them >>>>>>>>> have >>>>>>>>> quite self explanatory names, but not all (or probably we need a >>>>>>>>> deeper >>>>>>>>> knowledge to understand them). >>>>>>>>> Isn't there any parameter that would say when was that OSD assigned >>>>>>>>> to >>>>>>>>> the >>>>>>>>> given PG? Also the stat_sum shows 0 for all its parameters. Why is >>>>>>>>> it >>>>>>>>> blocking then? >>>>>>>>> >>>>>>>>> Is there a way to tell the PG to forget about that OSD? >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> Laszlo >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10.03.2017 03:05, Brad Hubbard wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Can you explain more about what happened? >>>>>>>>>> >>>>>>>>>> The query shows progress is blocked by the following OSDs. >>>>>>>>>> >>>>>>>>>> "blocked_by": [ >>>>>>>>>> 14, >>>>>>>>>> 17, >>>>>>>>>> 51, >>>>>>>>>> 58, >>>>>>>>>> 63, >>>>>>>>>> 64, >>>>>>>>>> 68, >>>>>>>>>> 70 >>>>>>>>>> ], >>>>>>>>>> >>>>>>>>>> Some of these OSDs are marked as "dne" (Does Not Exist). >>>>>>>>>> >>>>>>>>>> peer": "17", >>>>>>>>>> "dne": 1, >>>>>>>>>> "peer": "51", >>>>>>>>>> "dne": 1, >>>>>>>>>> "peer": "58", >>>>>>>>>> "dne": 1, >>>>>>>>>> "peer": "64", >>>>>>>>>> "dne": 1, >>>>>>>>>> "peer": "70", >>>>>>>>>> "dne": 1, >>>>>>>>>> >>>>>>>>>> Can we get a complete background here please? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Mar 9, 2017 at 10:53 PM, Laszlo Budai >>>>>>>>>> <laszlo@xxxxxxxxxxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> After a major network outage our ceph cluster ended up with an >>>>>>>>>>> inactive >>>>>>>>>>> PG: >>>>>>>>>>> >>>>>>>>>>> # ceph health detail >>>>>>>>>>> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck >>>>>>>>>>> unclean; >>>>>>>>>>> 1 >>>>>>>>>>> requests are blocked > 32 sec; 1 osds have slow requests >>>>>>>>>>> pg 3.367 is stuck inactive for 912263.766607, current state >>>>>>>>>>> incomplete, >>>>>>>>>>> last >>>>>>>>>>> acting [28,35,2] >>>>>>>>>>> pg 3.367 is stuck unclean for 912263.766688, current state >>>>>>>>>>> incomplete, >>>>>>>>>>> last >>>>>>>>>>> acting [28,35,2] >>>>>>>>>>> pg 3.367 is incomplete, acting [28,35,2] >>>>>>>>>>> 1 ops are blocked > 268435 sec >>>>>>>>>>> 1 ops are blocked > 268435 sec on osd.28 >>>>>>>>>>> 1 osds have slow requests >>>>>>>>>>> >>>>>>>>>>> # ceph -s >>>>>>>>>>> cluster 6713d1b8-83da-11e6-aa79-525400d98c5a >>>>>>>>>>> health HEALTH_WARN >>>>>>>>>>> 1 pgs incomplete >>>>>>>>>>> 1 pgs stuck inactive >>>>>>>>>>> 1 pgs stuck unclean >>>>>>>>>>> 1 requests are blocked > 32 sec >>>>>>>>>>> monmap e3: 3 mons at >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> {tv-dl360-1=10.12.193.73:6789/0,tv-dl360-2=10.12.193.74:6789/0,tv-dl360-3=10.12.193.75:6789/0} >>>>>>>>>>> election epoch 72, quorum 0,1,2 >>>>>>>>>>> tv-dl360-1,tv-dl360-2,tv-dl360-3 >>>>>>>>>>> osdmap e60609: 72 osds: 72 up, 72 in >>>>>>>>>>> pgmap v3670252: 4864 pgs, 11 pools, 134 GB data, 23778 >>>>>>>>>>> objects >>>>>>>>>>> 490 GB used, 130 TB / 130 TB avail >>>>>>>>>>> 4863 active+clean >>>>>>>>>>> 1 incomplete >>>>>>>>>>> client io 0 B/s rd, 38465 B/s wr, 2 op/s >>>>>>>>>>> >>>>>>>>>>> ceph pg repair doesn't change anything. What should I try to >>>>>>>>>>> recover >>>>>>>>>>> it? >>>>>>>>>>> Attached is the result of ceph pg query on the problem PG. >>>>>>>>>>> >>>>>>>>>>> Thank you, >>>>>>>>>>> Laszlo >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> ceph-users mailing list >>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com