Re: Problem with inconsistent PG

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 17 Feb 2012 09:54:45 -0800 (PST)

On Fri, 17 Feb 2012, Oliver Francke wrote:
> Well then,
> 
> found it via the "ceph osd dump" via the pool-id, thanks. The according customer
> opened a ticket this morning for not being able to boot his VM after shutdown.
> So I had to do some testdisk/fsck and tar the content into a new image.
> 
> I hope, there are no other "bad blocks" not being visible as "inconsistencies".
> 
> As these faulty images were easy detected as the boot-block was affected, how
> big is the chance, that there are more rb..-fragments being corrupted within a image
> in reference to what you mentioned below:
> 
> "...transactions to leak across checkpoint/snapshot boundaries."
> 
> Do we have a chance to detect it? I fear not, cause it will perhaps only be visible while
> doing a "fsck" inside the VM?!

It is hard to say.  There is a small chance that it will trigger any time 
ceph-osd is restarted.  The bug is fixed in the next release (which should 
be out today), but of course upgrading involves shutting down :(.  
Alternatively, you can cherry-pick the fixes, 
1009d1a016f049e19ad729a0c00a354a3956caf7 and 
93d7ef96316f30d3d7caefe07a5a747ce883ca2d.  v0.42 includes some encoding 
changes that means you can upgrade but you can't downgrade again.  (These 
encoding changes are being made so that in the future, you _can_ 
downgrade.)

Here's what I suggest:

 - don't restart any ceph-osds if you can help it
 - wait for v0.42 to come out, and wait until Monday at least
 - pause read/write traffic to the cluster with

 ceph osd pause

 - wait at least 30 seconds for osds to do a commit without any load.  
   this makes it extremely unlikely you'd trigger the bug.
 - upgrade to v0.42, or restart with a patched ceph-osd.
 - unpause io with

 ceph osd unpause

sage

> 
> Anyway, thanks for your help and best regards,
> 
> Oliver.
> 
> Am 16.02.2012 um 19:02 schrieb Sage Weil:
> 
> > On Thu, 16 Feb 2012, Oliver Francke wrote:
> >> Hi Sage,
> >> 
> >> thnx for the quick response,
> >> 
> >> Am 16.02.2012 um 18:17 schrieb Sage Weil:
> >> 
> >>> On Thu, 16 Feb 2012, Oliver Francke wrote:
> >>>> Hi Sage, *,
> >>>> 
> >>>> your tip with truncating from below did not solve the problem. Just to recap:
> >>>> 
> >>>> we had two inconsistencies, which we could break down to something like:
> >>>> 
> >>>> rb.0.0.000000000000__head_DA680EE2
> >>>> 
> >>>> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
> >>>> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too -
> >>>> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
> >>>> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
> >>>> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
> >>>> found.
> >>>> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
> >>>> from such a file with name and pg, how can we identify the real file being associated with, cause there is another
> >>>> customer with a potential problem with next reboot ( second inconsistency).
> >>>> 
> >>>> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted
> >>>> partition tables, so all in the first "head-file"?
> >>>> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader
> >>>> anymore ;) ).
> >>> 
> >>> 'head' in this case means the object hasn't been COWed (snapshotted and 
> >>> then overwritten), and 000000000000 means its the first 4MB block of the 
> >>> rbd image/disk.
> >>> 
> >> 
> >> yes, true,
> >> 
> >>> We you able to use the 'rbd info' in the previous email to identify which 
> >>> image it is?  Is that what you mean by 'identify the real file'?
> >>> 
> >> 
> >> that's the point, from the object I would like to identify the complete image location ala:
> >> 
> >> <pool>/<image>
> >> 
> >> from there I'd know, which customer's rbd disk-image is affected.
> > 
> > For pool, look at the pgid, in this case '109.6'.  109 is the pool id.  
> > Look at the pool list from 'ceph osd dump' output to see which pool name 
> > that is.
> > 
> > For the image, rb.0.0 is the image prefix.  Look at each rbd image in that 
> > pool, and check for the image whose prefix matches.  e.g.,
> > 
> > for img in `rbd -p poolname list` ; do rbd info $img -p poolname | grep 
> > -q rb.0.0 && echo found $img ; done
> > 
> > BTW, are you creating a pool per customer here?  You need to be a little 
> > bit careful about creating large numbers of pools; the system isn't really 
> > designed to be used that way.  You should use a pool if you have a 
> > distinct data placement requirement (e.g., put these objects on this set 
> > of ceph-osds).  But because of the way things work internally creating 
> > hundreds/thousands of them won't be very efficient.
> > 
> > sage
> > 
> > 
> >> 
> >> Thnx for your patience,
> >> 
> >> Oliver.
> >> 
> >>> I'm not sure I understand exactly what your question is.  I would have 
> >>> expected modifying the file with fdisk -l to work (if fdisk sees a valid 
> >>> partition table, it should be able to write it too).
> >>> 
> >>> sage
> >>> 
> >>> 
> >>>> 
> >>>> Thanks in@vance and kind regards,
> >>>> 
> >>>> Oliver.
> >>>> 
> >>>> Am 13.02.2012 um 18:13 schrieb Sage Weil:
> >>>> 
> >>>>> On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
> >>>>> 
> >>>>>>>> Hi Liste,
> >>>>>>>> 
> >>>>>>>> today i've got another problem.
> >>>>>>>> 
> >>>>>>>> ceph -w shows up with an inconsistent PG over night:
> >>>>>>>> 
> >>>>>>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
> >>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>>>>>> GB avail
> >>>>>>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
> >>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>>>>>> GB avail
> >>>>>>>> 
> >>>>>>>> I've identified it with "ceph pg dump - | grep inconsistent
> >>>>>>>> ...
> >>>>>>>> 
> >>>>>>>> Now I've tried to repair it with: ceph pg repair 109.6
> >>>>>>>> 
> >>>>>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
> >>>>>>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
> >>>>>>>> repair' (0)
> >>>>>>>> 
> >>>>>>>> but i only get the following result:
> >>>>>>>> 
> >>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
> >>>>>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
> >>>>>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
> >>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
> >>>>>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
> >>>>>>>> objects
> >>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
> >>>>>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
> >>>>>>>> 
> >>>>>>>> Can someone please explain me what to do in this case and how to recover
> >>>>>>>> the pg ?
> >>>>>>> 
> >>>>>>> So the "fix" is just to truncate the file to the expected size, 3145728,
> >>>>>>> by finding it in the current/ directory.  The name/path will be slightly
> >>>>>>> weird; look for 'rb.0.0.0000000000bd'.
> >>>>>>> 
> >>>>>>> The data is still suspect, though.  Did the ceph-osd restart or crash
> >>>>>>> recently?  I would do that, repair (it should succeed), and then fsck the
> >>>>>>> file system in that rbd image.
> >>>>>>> 
> >>>>>>> We just fixed a bug that was causing transactions to leak across
> >>>>>>> checkpoint/snapshot boundaries.  That could be responsible for causing all
> >>>>>>> sorts of subtle corruptions, including this one.  It'll be included in
> >>>>>>> v0.42 (out next week).
> >>>>>>> 
> >>>>>>> sage
> >>>>>> 
> >>>>>> Hi Sarge,
> >>>>>> 
> >>>>>> no ... the osd didn't crash. I had to do some hardware maintainance and push
> >>>>>> it
> >>>>>> out of distribution with "ceph osd out 3". After a short while i used
> >>>>>> "/etc/init.d/ceph stop" on that osd.
> >>>>>> Then, after my work i've started ceph and push it in the distribution with
> >>>>>> "ceph osd in 3".
> >>>>> 
> >>>>> For the bug I'm worried about, stopping the daemon and crashing are 
> >>>>> equivalent.  In both cases, a transaction may have been only partially 
> >>>>> included in the checkpoint.
> >>>>> 
> >>>>>> Could you please tell me if this is the right way to get an osd out for
> >>>>>> maintainance ? Is there
> >>>>>> any other thing i should do to keep data consistent ?
> >>>>> 
> >>>>> You followed the right procedure.  There is (hopefully, was!) just a bug.
> >>>>> 
> >>>>> sage
> >>>>> 
> >>>>> 
> >>>>>> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
> >>>>>> with a each a total capacity
> >>>>>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
> >>>>>> data store for a kvm virtualisation
> >>>>>> farm. The farm is accessing the data directly per rbd.
> >>>>>> 
> >>>>>> Thank you
> >>>>>> 
> >>>>>> Jens
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>> 
> >>>>>> 
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>> 
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>> 
> >>>> 
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html