Re: 7915 is not resolved

Boris Lukashev <blukashev@xxxxxxxxxxxxxxxx> · Tue, 12 Jan 2016 20:06:03 -0500

Thank you sage! The patch below allowed one of the two stuck OSDs to start up.
Whats the procedure from here? Is there a way to make Ceph consistent?
OSD 4 is still refusing to start, new error message however.

Should i drop OSD 4 and rebuild? If so whats the proper way to do
this? I've not started any VMs from the volumes yet, but i can rbd ls
-p compute which is new and ceph -s shows "health HEALTH_WARN 245 pgs
degraded; 245 pgs stuck unclean; recovery 23261/258270 objects
degraded (9.006%); 1/10 in osds are down"
How can i ensure everything is consistent before upgrading the cluster
to with hammer debs from the Mirantis repo?

Thank you very much for the assistance,
-Boris

Patch diff (slightly modified to compile):

--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
@@ -135,8 +135,22 @@ void PGPool::update(OSDMapRef map)
   name = map->get_pool_name(id);
   if (pi->get_snap_epoch() == map->get_epoch()) {
     pi->build_removed_snaps(newly_removed_snaps);
-    newly_removed_snaps.subtract(cached_removed_snaps);
-    cached_removed_snaps.union_of(newly_removed_snaps);
+    lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << "
removed_snaps " << newly_removed_snaps << " cached_removed " <<
cached_removed_snaps << dendl;
+    interval_set<snapid_t> intersection;
+    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
+    lgeneric_subdout(g_ceph_context, osd, 0)
+      << __func__ << " removed_snaps " << newly_removed_snaps
+      << " cached_removed " << cached_removed_snaps
+      << " interaction " << intersection
+      << dendl;
+    if (intersection == cached_removed_snaps) {
+      newly_removed_snaps.subtract(cached_removed_snaps);
+      cached_removed_snaps.union_of(newly_removed_snaps);
+    } else {
+      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << "
cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
+      cached_removed_snaps = newly_removed_snaps;
+      newly_removed_snaps.clear();
+    }
     snapc = pi->get_snap_context();
   } else {
     newly_removed_snaps.clear();
@@ -1473,7 +1487,9 @@ void PG::activate(ObjectStore::Transaction& t,
     dout(20) << "activate - purged_snaps " << info.purged_snaps
             << " cached_removed_snaps " << pool.cached_removed_snaps << dendl;
     snap_trimq = pool.cached_removed_snaps;
-    snap_trimq.subtract(info.purged_snaps);
+    interval_set<snapid_t> intersection;
+    intersection.intersection_of(info.purged_snaps, snap_trimq);
+    snap_trimq.subtract(intersection);
     dout(10) << "activate - snap_trimq " << snap_trimq << dendl;
     if (!snap_trimq.empty() && is_clean())
       queue_snap_trim();

On Tue, Jan 12, 2016 at 3:21 PM, Boris Lukashev
<blukashev@xxxxxxxxxxxxxxxx> wrote:
> Having added the following diff to the patch stack and rebuilt, i'm
> still seeing the two OSD not come up.
> However, with sage's help in IRC, i now have the following patch and
> changes to the librados hello world example to clean out bad snaps
> found via osd log 20:
>
> diff --git a/src/osd/PG.cc b/src/osd/PG.cc
> index d7174af..d78ee31 100644
> --- a/src/osd/PG.cc
> +++ b/src/osd/PG.cc
> @@ -137,7 +137,7 @@ void PGPool::update(OSDMapRef map)
>      pi->build_removed_snaps(newly_removed_snaps);
>      interval_set<snapid_t> intersection;
>      intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
> -    if (!(intersection == cached_removed_snaps)) {
> +    if (intersection == cached_removed_snaps) {
>        newly_removed_snaps.subtract(cached_removed_snaps);
>        cached_removed_snaps.union_of(newly_removed_snaps);
>      } else {
>
>
> in hello world, remove everything after io_ctx is initialized until
> the out: section and just before it, add
>   /*
>    * remove snapshots
>    */
>   {
>     io_ctx.selfmanaged_snap_remove(0x19+0xb);
>     io_ctx.selfmanaged_snap_remove(0x19+0xc);
>     io_ctx.selfmanaged_snap_remove(0x19+0xd);
>     io_ctx.selfmanaged_snap_remove(0x19+0xe);
>   }
>
> with appropriate pointers to the snaps.
>
> So, with the OSDs still not starting, i'm curious as to what the next
> step is - should i keep trying to get the OSDs up, or should i try to
> remove the snaps with the librados bin and then try to bring them up.
> Since i'm manually deleting things from Ceph, i figure i only get one
> shot so suggestions are very welcome :).
>
> Thanks as always,
> -Boris
>
> On Tue, Jan 12, 2016 at 1:03 PM, Boris Lukashev
> <blukashev@xxxxxxxxxxxxxxxx> wrote:
>> I've put some of the output from debug osd 20 at
>> http://pastebin.com/he5snqwF, it seems one of the last operations is
>> in fact " activate - purged_snaps [1~5,8~2,b~d,19~f]
>> cached_removed_snaps [1~5,8~2,b~d,19~b]" which seems to make sense in
>> the context of this mismatch...
>> There is an ungodly amount of output from level 20, anything specific
>> you'd like me to grep for?
>>
>> On Tue, Jan 12, 2016 at 12:47 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>> On Tue, 12 Jan 2016, Boris Lukashev wrote:
>>>> Should i try altering the patch with the ! removed and reloading the
>>>> OSDs? (would read as ' if (intersection == cached_removed_snaps) { ')
>>>> Just for my own education on the matter - if there is disagreement
>>>> between the contents of an OSD and the map, like in this case where a
>>>> pending request seems to be outstanding, is there no mediation process
>>>> between the on-disk data (OSD) and metadata (map) services? With XFS
>>>> being used underneath most of the time, that strikes me as somewhat
>>>> scary - its not the most consistent of filesystems on the best of
>>>> days.
>>>>
>>>> With the mismatch between the OSD and map, but xfs_check coming back
>>>> clean, should i be worried about a corrupt cluster in the event that i
>>>> can somehow get it running?
>>>> I figure that with FireFly dying and Hammer available from Mirantis, i
>>>> should upgrade the cluster, but i would like to know what the safest
>>>> way forward is - i'd really prefer to keep using Ceph, its been
>>>> educational and quite handy, but if i have to rebuild the cluster
>>>> it'll need to keep playing nice with the Fuel deployed OpenStack. If i
>>>> can get access to the images stored by Glance and Swift metadata, i'll
>>>> gladly export and rebuild clean presuming i can figure out how. The
>>>> RBD images are already saved (manual export by tracking the rbd
>>>> segment hashes from the metadata files bearing volume-UUID
>>>> designations matching what i saw in Cinder, and dd-ing chunks into
>>>> flat files for raw images). Worst case, if the cluster wont come back
>>>> up and give me access to the data, what's the process for getting it
>>>> to a "clean" state such that i can upgrade to hammer and reseed my
>>>> glance, swift, and volume data from backups/exports? Do i need to
>>>> remove and re-add OSDs, or is there some darker magic at play to
>>>> ensure there are no remnants of bad data/messages?
>>>
>>> I think teh safe path is to
>>>
>>> (1) reproduce with debug osd = 20, so we can see what the removed_snaps is
>>> on the pg vs hte one in the osdmap.
>>>
>>> (2) fix the ! in the patch and restart the osds
>>>
>>> (3) re-add the deleted snaps to the osdmap so that things are back in
>>> sync.  This is possible through the librados API so it should be pretty
>>> simple to fix.  But, let's look at what (1) shows first.
>>>
>>> sage
>>>
>>>
>>>>
>>>> Thank you all
>>>> -Boris
>>>>
>>>> On Tue, Jan 12, 2016 at 8:24 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>>> > On Tue, 12 Jan 2016, Mykola Golub wrote:
>>>> >> On Mon, Jan 11, 2016 at 09:00:18PM -0500, Boris Lukashev wrote:
>>>> >> > In case anyone is following the mailing list later on, we spoke in IRC
>>>> >> > and Sage provided a patch - http://fpaste.org/309609/52550203/
>>>> >>
>>>> >> > diff --git a/src/osd/PG.cc b/src/osd/PG.cc
>>>> >> > index dc18aec..f9ee23c 100644
>>>> >> > --- a/src/osd/PG.cc
>>>> >> > +++ b/src/osd/PG.cc
>>>> >> > @@ -135,8 +135,16 @@ void PGPool::update(OSDMapRef map)
>>>> >> >    name = map->get_pool_name(id);
>>>> >> >    if (pi->get_snap_epoch() == map->get_epoch()) {
>>>> >> >      pi->build_removed_snaps(newly_removed_snaps);
>>>> >> > -    newly_removed_snaps.subtract(cached_removed_snaps);
>>>> >> > -    cached_removed_snaps.union_of(newly_removed_snaps);
>>>> >> > +    interval_set<snapid_t> intersection;
>>>> >> > +    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
>>>> >> > +    if (!(intersection == cached_removed_snaps)) {
>>>> >> > +      newly_removed_snaps.subtract(cached_removed_snaps);
>>>> >>
>>>> >> Sage, won't it still violate the assert?
>>>> >> "intersection != cached_removed_snaps" means that cached_removed_snaps
>>>> >> contains snapshots missed in newly_removed_snaps, and we can't subtract?
>>>> >
>>>> > Oops, yeah, just remote the !.
>>>> >
>>>> > As you can see the problem is that the OSDMap's removed snaps shrank
>>>> > somehow.  If you crank up logging you can see what the competing sets
>>>> > are.
>>>> >
>>>> > An alternative fix/hack would be to modify the monitor to allow the
>>>> > snapids that were previously in the set to be added back into the OSDMap.
>>>> > That's arguably a better fix, although it's a bit more work.  But, even
>>>> > then, something like the above will be needed since there are still
>>>> > OSDMaps in the history where the set is smaller.
>>>> >
>>>> > sage
>>>> >
>>>> >>
>>>> >> > +      cached_removed_snaps.union_of(newly_removed_snaps);
>>>> >> > +    } else {
>>>> >> > +      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << " cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
>>>> >> > +      cached_removed_snaps = newly_removed_snaps;
>>>> >> > +      newly_removed_snaps.clear();
>>>> >> > +    }
>>>> >> >      snapc = pi->get_snap_context();
>>>> >> >    } else {
>>>> >> >      newly_removed_snaps.clear();
>>>> >>
>>>> >> --
>>>> >> Mykola Golub
>>>> >>
>>>> >>
>>>> >>
>>>>
>>>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html