Re: unfound objects - why and how to recover ? (bonus : jewel logs)

Samuel Just <sjust@xxxxxxxxxx> · Fri, 27 May 2016 09:57:03 -0700

Well, it's not supposed to do that if the backing storage is working
properly.  If the filesystem/disk controller/disk combination is not
respecting barriers (or otherwise can lose committed data in a power
failure) in your configuration, a power failure could cause a node to
go backwards in time -- that would explain it.  Without logs, I can't
say any more.  If you can reproduce, we'll want

debug osd = 20
debug filestore = 20
debug ms = 1

on all of the osds involved in an affected PG.
-Sam

On Fri, May 27, 2016 at 7:04 AM, SCHAER Frederic <frederic.schaer@xxxxxx> wrote:
> Hi,
>
>
>
> --
>
> First, let me start with the bonus…
>
> I migrated from hammer => jewel and followed the migration instructions… but
> migrations instructions are missing this :
>
> #chown  -R ceph:ceph /var/log/ceph
>
> I just discoved this was the reason I found no log nowhere about my current
> issue :/
>
> --
>
>
>
> This is maybe the 3rd time this happens to me … This time I’d like to try to
> understand what happens.
>
>
>
> So. ceph-10.2.0-0.el7.x86_64+Cent0S 7.2 here.
>
> Ceph health was happy, but any rbd operation was hanging – hence : ceph was
> hung, and so were the test VMs running on it.
>
>
>
> I placed my VM in an EC pool on top of which I overlayed an RBD pool with
> SSDs.
>
> The EC pool is defined as being a 3+1 pool, with 5 hosts hosting the OSDs
> (and the failure domain is set to hosts)
>
>
>
> “Ceph –w” wasn’t displaying new status lines as usual, but ceph health
> (detail) wasn’t saying anything would be wrong.
>
> After looking at one node, I found that ceph logs were empty on one node, so
> I decided to restart the OSDs on that one using : systemctl restart
> ceph-osd@*
>
>
>
> After I did that, ceph –w got to life again , but telling me there was a
> dead MON – which I restarted too.
>
> I watched some kind of recovery happening, and after a few seconds/minutes,
> I now see :
>
>
>
> [root@ceph0 ~]# ceph health detail
>
> HEALTH_WARN 4 pgs degraded; 3 pgs recovering; 1 pgs recovery_wait; 4 pgs
> stuck unclean; recovery 57/373846 objects degraded (0.015%); recovery
> 57/110920 unfound (0.051%)
>
> pg 691.65 is stuck unclean for 310704.556119, current state
> active+recovery_wait+degraded, last acting [44,99,69,9]
>
> pg 691.1e5 is stuck unclean for 493631.370697, current state
> active+recovering+degraded, last acting [77,43,20,99]
>
> pg 691.12a is stuck unclean for 14521.475478, current state
> active+recovering+degraded, last acting [42,56,7,106]
>
> pg 691.165 is stuck unclean for 14521.474525, current state
> active+recovering+degraded, last acting [21,71,24,117]
>
> pg 691.165 is active+recovering+degraded, acting [21,71,24,117], 15 unfound
>
> pg 691.12a is active+recovering+degraded, acting [42,56,7,106], 1 unfound
>
> pg 691.1e5 is active+recovering+degraded, acting [77,43,20,99], 2 unfound
>
> pg 691.65 is active+recovery_wait+degraded, acting [44,99,69,9], 39 unfound
>
> recovery 57/373846 objects degraded (0.015%)
>
> recovery 57/110920 unfound (0.051%)
>
>
>
> Damn.
>
> Last time this happened, I was forced to declare lost the PGs in order to
> recover a “healthy” ceph, because ceph does not want to revert PGs in EC
> pools. But one of the VMs started hanging randomly on disk IOs…
>
> This same VM is now down, and I can’t remove its disk from rbd, it’s hanging
> at 99% - I could work that around by renaming the file and re-installing the
> VM on a new disk, but anyway, I’d like to understand+fix+make sure this does
> not happen again.
>
> We sometimes suffer power cuts here : if restarting daemons kills ceph data,
> I cannot think of what would happen in case of power cut…
>
>
>
> Back to the unfound objects. I have no OSD down that would be in the cluster
> (only 1 down, and I put it myself down – OSD.46 - , but set its weight to 0
> last week)
>
> I can query the PGs, but I don’t understand what I see in there.
>
> For instance :
>
>
>
> #ceph pg 691.65 query
>
> (…)
>
>                 "num_objects_missing": 0,
>
>                 "num_objects_degraded": 39,
>
>                 "num_objects_misplaced": 0,
>
>                 "num_objects_unfound": 39,
>
>                 "num_objects_dirty": 138,
>
>
>
> And then for 2 peers I see :
>
>                 "state": "active+undersized+degraded", ## undersized ???
>
> (…)
>
>                     "num_objects_missing": 0,
>
>                     "num_objects_degraded": 138,
>
>                     "num_objects_misplaced": 138,
>
>                     "num_objects_unfound": 0,
>
>                     "num_objects_dirty": 138,
>
>                 "blocked_by": [],
>
>                 "up_primary": 44,
>
>                 "acting_primary": 44
>
>
>
>
>
> If I look at the “missing” objects, I can see something on some OSDs :
>
> # ceph pg 691.165 list_missing
>
> (…)
>
> {
>
>             "oid": {
>
>                 "oid": "rbd_data.8de32431bd7b7.0000000000000ea7",
>
>                 "key": "",
>
>                 "snapid": -2,
>
>                 "hash": 971513189,
>
>                 "max": 0,
>
>                 "pool": 691,
>
>                 "namespace": ""
>
>             },
>
>             "need": "26521'22595",
>
>             "have": "25922'22575",
>
>             "locations": []
>
>         }
>
>
>
> All of the missing objects have this “need/have” discrepancy.
>
>
>
> I can see such objects in a “691.165” directory on secondary OSDs, but I do
> not see any 691.165 directory on the primary OSD (44)… ?
>
> For instance :
>
> [root@ceph0 ~]# ll
> /var/lib/ceph/osd/ceph-21/current/691.165s0_head/*8de32431bd7b7.0000000000000ea7*
>
> -rw-r--r-- 1 ceph ceph 1399392 May 15 13:18
> /var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0000000000000ea7__head_39E81D65__2b3_5843_0
>
> -rw-r--r-- 1 ceph ceph 1399392 May 27 11:07
> /var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0000000000000ea7__head_39E81D65__2b3_ffffffffffffffff_0
>
>
>
> Even so : assuming I would have lost data on that OSD 44 (how ??), I would
> assume ceph would be able to reconstruct the missing data/PG thanks to the
> erasure codes/replica for RBD , it looks like it’s not willing to ??
>
> I already know that telling ceph to forget about the lost PGs is not a good
> idea, as it will cause the VMs using them to hang afterwards… and I’d prefer
> seeing ceph as a rock-solid solution allowing one to recover from such
> “usual” operations… ?
>
>
>
> If anyone got ideas, I’d be happy … should I kill osd.44 for good and
> recreate it ?
>
>
>
> Thanks
>
>
>
> P.S : I already tried to :
>
>
>
> “ceph tell osd.44 injectargs --debug-osd 0/5 --debug-filestore 0/5”
>
> Or
>
> “ceph tell osd.44 injectargs --debug-osd 20/20 --debug-filestore 20/20”
>
>
>
> PS : I tried this before I found the bonus at the start of this email…
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com