Re: Problem with inconsistent PG

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 16 Feb 2012 09:17:55 -0800 (PST)

On Thu, 16 Feb 2012, Oliver Francke wrote:
> Hi Sage, *,
> 
> your tip with truncating from below did not solve the problem. Just to recap:
> 
> we had two inconsistencies, which we could break down to something like:
> 
> rb.0.0.000000000000__head_DA680EE2
> 
> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too -
> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
> found.
> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
> from such a file with name and pg, how can we identify the real file being associated with, cause there is another
> customer with a potential problem with next reboot ( second inconsistency).
>
> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted
> partition tables, so all in the first "head-file"?
> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader
> anymore ;) ).

'head' in this case means the object hasn't been COWed (snapshotted and 
then overwritten), and 000000000000 means its the first 4MB block of the 
rbd image/disk.

We you able to use the 'rbd info' in the previous email to identify which 
image it is?  Is that what you mean by 'identify the real file'?

I'm not sure I understand exactly what your question is.  I would have 
expected modifying the file with fdisk -l to work (if fdisk sees a valid 
partition table, it should be able to write it too).

sage

> 
> Thanks in@vance and kind regards,
> 
> Oliver.
> 
> Am 13.02.2012 um 18:13 schrieb Sage Weil:
> 
> > On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
> > 
> >>>> Hi Liste,
> >>>> 
> >>>> today i've got another problem.
> >>>> 
> >>>> ceph -w shows up with an inconsistent PG over night:
> >>>> 
> >>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
> >>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>> GB avail
> >>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
> >>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
> >>>> GB avail
> >>>> 
> >>>> I've identified it with "ceph pg dump - | grep inconsistent
> >>>> 
> >>>> 109.6    141    0    0    0    463820288    111780    111780
> >>>> active+clean+inconsistent    485'7115    480'7301    [3
> >>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
> >>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]    [3
> >>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4
> >>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>]
> >>>> 485'7061    2012-02-10 08:02:12.043986
> >>>> 
> >>>> Now I've tried to repair it with: ceph pg repair 109.6
> >>>> 
> >>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
> >>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
> >>>> repair' (0)
> >>>> 
> >>>> but i only get the following result:
> >>>> 
> >>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
> >>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
> >>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
> >>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
> >>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
> >>>> objects
> >>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
> >>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
> >>>> 
> >>>> Can someone please explain me what to do in this case and how to recover
> >>>> the pg ?
> >>> 
> >>> So the "fix" is just to truncate the file to the expected size, 3145728,
> >>> by finding it in the current/ directory.  The name/path will be slightly
> >>> weird; look for 'rb.0.0.0000000000bd'.
> >>> 
> >>> The data is still suspect, though.  Did the ceph-osd restart or crash
> >>> recently?  I would do that, repair (it should succeed), and then fsck the
> >>> file system in that rbd image.
> >>> 
> >>> We just fixed a bug that was causing transactions to leak across
> >>> checkpoint/snapshot boundaries.  That could be responsible for causing all
> >>> sorts of subtle corruptions, including this one.  It'll be included in
> >>> v0.42 (out next week).
> >>> 
> >>> sage
> >> 
> >> Hi Sarge,
> >> 
> >> no ... the osd didn't crash. I had to do some hardware maintainance and push
> >> it
> >> out of distribution with "ceph osd out 3". After a short while i used
> >> "/etc/init.d/ceph stop" on that osd.
> >> Then, after my work i've started ceph and push it in the distribution with
> >> "ceph osd in 3".
> > 
> > For the bug I'm worried about, stopping the daemon and crashing are 
> > equivalent.  In both cases, a transaction may have been only partially 
> > included in the checkpoint.
> > 
> >> Could you please tell me if this is the right way to get an osd out for
> >> maintainance ? Is there
> >> any other thing i should do to keep data consistent ?
> > 
> > You followed the right procedure.  There is (hopefully, was!) just a bug.
> > 
> > sage
> > 
> > 
> >> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
> >> with a each a total capacity
> >> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
> >> data store for a kvm virtualisation
> >> farm. The farm is accessing the data directly per rbd.
> >> 
> >> Thank you
> >> 
> >> Jens
> >> 
> >> 
> >> 
> >> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html