I'd like to not have to null them if possible, there's nothing outlandishly valuable, its more the time to reprovision (users have stuff on there, mainly testing but I have a nasty feeling some users won't have backed up their test instances). When you say complicated and fragile, could you expand? Thanks again! Joel On Wed, Mar 11, 2015 at 1:21 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: > Ok, you lost all copies from an interval where the pgs went active. The > recovery from this is going to be complicated and fragile. Are the pools > valuable? > -Sam > > > On 03/11/2015 03:35 AM, joel.merrick@xxxxxxxxx wrote: >> >> For clarity too, I've tried to drop the min_size before as suggested, >> doesn't make a difference unfortunately >> >> On Wed, Mar 11, 2015 at 9:50 AM, joel.merrick@xxxxxxxxx >> <joel.merrick@xxxxxxxxx> wrote: >>> >>> Sure thing, n.b. I increased pg count to see if it would help. Alas not. >>> :) >>> >>> Thanks again! >>> >>> health_detail >>> https://gist.github.com/199bab6d3a9fe30fbcae >>> >>> osd_dump >>> https://gist.github.com/499178c542fa08cc33bb >>> >>> osd_tree >>> https://gist.github.com/02b62b2501cbd684f9b2 >>> >>> Random selected queries: >>> queries/0.19.query >>> https://gist.github.com/f45fea7c85d6e665edf8 >>> queries/1.a1.query >>> https://gist.github.com/dd68fbd5e862f94eb3be >>> queries/7.100.query >>> https://gist.github.com/d4fd1fb030c6f2b5e678 >>> queries/7.467.query >>> https://gist.github.com/05dbcdc9ee089bd52d0c >>> >>> On Tue, Mar 10, 2015 at 2:49 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>> >>>> Yeah, get a ceph pg query on one of the stuck ones. >>>> -Sam >>>> >>>> On Tue, 2015-03-10 at 14:41 +0000, joel.merrick@xxxxxxxxx wrote: >>>>> >>>>> Stuck unclean and stuck inactive. I can fire up a full query and >>>>> health dump somewhere useful if you want (full pg query info on ones >>>>> listed in health detail, tree, osd dump etc). There were blocked_by >>>>> operations that no longer exist after doing the OSD addition. >>>>> >>>>> Side note, spent some time yesterday writing some bash to do this >>>>> programatically (might be useful to others, will throw on github) >>>>> >>>>> On Tue, Mar 10, 2015 at 1:41 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>>> >>>>>> What do you mean by "unblocked" but still "stuck"? >>>>>> -Sam >>>>>> >>>>>> On Mon, 2015-03-09 at 22:54 +0000, joel.merrick@xxxxxxxxx wrote: >>>>>>> >>>>>>> On Mon, Mar 9, 2015 at 2:28 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> You'll probably have to recreate osds with the same ids (empty >>>>>>>> ones), >>>>>>>> let them boot, stop them, and mark them lost. There is a feature in >>>>>>>> the >>>>>>>> tracker to improve this behavior: >>>>>>>> http://tracker.ceph.com/issues/10976 >>>>>>>> -Sam >>>>>>> >>>>>>> Thanks Sam, I've readded the OSDs, they became unblocked but there >>>>>>> are >>>>>>> still the same number of pgs stuck. I looked at them in some more >>>>>>> detail and it seems they all have num_bytes='0'. Tried a repair too, >>>>>>> for good measure. Still nothing I'm afraid. >>>>>>> >>>>>>> Does this mean some underlying catastrophe has happened and they are >>>>>>> never going to recover? Following on, would that cause data loss. >>>>>>> There are no missing objects and I'm hoping there's appropriate >>>>>>> checksumming / replicas to balance that out, but now I'm not so sure. >>>>>>> >>>>>>> Thanks again, >>>>>>> Joel >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> $ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge' >> >> >> > -- $ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge' _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com