Well, it's not supposed to do that if the backing storage is working properly. If the filesystem/disk controller/disk combination is not respecting barriers (or otherwise can lose committed data in a power failure) in your configuration, a power failure could cause a node to go backwards in time -- that would explain it. Without logs, I can't say any more. If you can reproduce, we'll want debug osd = 20 debug filestore = 20 debug ms = 1 on all of the osds involved in an affected PG. -Sam On Fri, May 27, 2016 at 7:04 AM, SCHAER Frederic <frederic.schaer@xxxxxx> wrote: > Hi, > > > > -- > > First, let me start with the bonus… > > I migrated from hammer => jewel and followed the migration instructions… but > migrations instructions are missing this : > > #chown -R ceph:ceph /var/log/ceph > > I just discoved this was the reason I found no log nowhere about my current > issue :/ > > -- > > > > This is maybe the 3rd time this happens to me … This time I’d like to try to > understand what happens. > > > > So. ceph-10.2.0-0.el7.x86_64+Cent0S 7.2 here. > > Ceph health was happy, but any rbd operation was hanging – hence : ceph was > hung, and so were the test VMs running on it. > > > > I placed my VM in an EC pool on top of which I overlayed an RBD pool with > SSDs. > > The EC pool is defined as being a 3+1 pool, with 5 hosts hosting the OSDs > (and the failure domain is set to hosts) > > > > “Ceph –w” wasn’t displaying new status lines as usual, but ceph health > (detail) wasn’t saying anything would be wrong. > > After looking at one node, I found that ceph logs were empty on one node, so > I decided to restart the OSDs on that one using : systemctl restart > ceph-osd@* > > > > After I did that, ceph –w got to life again , but telling me there was a > dead MON – which I restarted too. > > I watched some kind of recovery happening, and after a few seconds/minutes, > I now see : > > > > [root@ceph0 ~]# ceph health detail > > HEALTH_WARN 4 pgs degraded; 3 pgs recovering; 1 pgs recovery_wait; 4 pgs > stuck unclean; recovery 57/373846 objects degraded (0.015%); recovery > 57/110920 unfound (0.051%) > > pg 691.65 is stuck unclean for 310704.556119, current state > active+recovery_wait+degraded, last acting [44,99,69,9] > > pg 691.1e5 is stuck unclean for 493631.370697, current state > active+recovering+degraded, last acting [77,43,20,99] > > pg 691.12a is stuck unclean for 14521.475478, current state > active+recovering+degraded, last acting [42,56,7,106] > > pg 691.165 is stuck unclean for 14521.474525, current state > active+recovering+degraded, last acting [21,71,24,117] > > pg 691.165 is active+recovering+degraded, acting [21,71,24,117], 15 unfound > > pg 691.12a is active+recovering+degraded, acting [42,56,7,106], 1 unfound > > pg 691.1e5 is active+recovering+degraded, acting [77,43,20,99], 2 unfound > > pg 691.65 is active+recovery_wait+degraded, acting [44,99,69,9], 39 unfound > > recovery 57/373846 objects degraded (0.015%) > > recovery 57/110920 unfound (0.051%) > > > > Damn. > > Last time this happened, I was forced to declare lost the PGs in order to > recover a “healthy” ceph, because ceph does not want to revert PGs in EC > pools. But one of the VMs started hanging randomly on disk IOs… > > This same VM is now down, and I can’t remove its disk from rbd, it’s hanging > at 99% - I could work that around by renaming the file and re-installing the > VM on a new disk, but anyway, I’d like to understand+fix+make sure this does > not happen again. > > We sometimes suffer power cuts here : if restarting daemons kills ceph data, > I cannot think of what would happen in case of power cut… > > > > Back to the unfound objects. I have no OSD down that would be in the cluster > (only 1 down, and I put it myself down – OSD.46 - , but set its weight to 0 > last week) > > I can query the PGs, but I don’t understand what I see in there. > > For instance : > > > > #ceph pg 691.65 query > > (…) > > "num_objects_missing": 0, > > "num_objects_degraded": 39, > > "num_objects_misplaced": 0, > > "num_objects_unfound": 39, > > "num_objects_dirty": 138, > > > > And then for 2 peers I see : > > "state": "active+undersized+degraded", ## undersized ??? > > (…) > > "num_objects_missing": 0, > > "num_objects_degraded": 138, > > "num_objects_misplaced": 138, > > "num_objects_unfound": 0, > > "num_objects_dirty": 138, > > "blocked_by": [], > > "up_primary": 44, > > "acting_primary": 44 > > > > > > If I look at the “missing” objects, I can see something on some OSDs : > > # ceph pg 691.165 list_missing > > (…) > > { > > "oid": { > > "oid": "rbd_data.8de32431bd7b7.0000000000000ea7", > > "key": "", > > "snapid": -2, > > "hash": 971513189, > > "max": 0, > > "pool": 691, > > "namespace": "" > > }, > > "need": "26521'22595", > > "have": "25922'22575", > > "locations": [] > > } > > > > All of the missing objects have this “need/have” discrepancy. > > > > I can see such objects in a “691.165” directory on secondary OSDs, but I do > not see any 691.165 directory on the primary OSD (44)… ? > > For instance : > > [root@ceph0 ~]# ll > /var/lib/ceph/osd/ceph-21/current/691.165s0_head/*8de32431bd7b7.0000000000000ea7* > > -rw-r--r-- 1 ceph ceph 1399392 May 15 13:18 > /var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0000000000000ea7__head_39E81D65__2b3_5843_0 > > -rw-r--r-- 1 ceph ceph 1399392 May 27 11:07 > /var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0000000000000ea7__head_39E81D65__2b3_ffffffffffffffff_0 > > > > Even so : assuming I would have lost data on that OSD 44 (how ??), I would > assume ceph would be able to reconstruct the missing data/PG thanks to the > erasure codes/replica for RBD , it looks like it’s not willing to ?? > > I already know that telling ceph to forget about the lost PGs is not a good > idea, as it will cause the VMs using them to hang afterwards… and I’d prefer > seeing ceph as a rock-solid solution allowing one to recover from such > “usual” operations… ? > > > > If anyone got ideas, I’d be happy … should I kill osd.44 for good and > recreate it ? > > > > Thanks > > > > P.S : I already tried to : > > > > “ceph tell osd.44 injectargs --debug-osd 0/5 --debug-filestore 0/5” > > Or > > “ceph tell osd.44 injectargs --debug-osd 20/20 --debug-filestore 20/20” > > > > PS : I tried this before I found the bonus at the start of this email… > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com