Re: Problem with inconsistent PG

Oliver Francke <Oliver.Francke@xxxxxxxx> · Fri, 17 Feb 2012 15:00:05 +0100

Well then,

found it via the "ceph osd dump" via the pool-id, thanks. The according customer
opened a ticket this morning for not being able to boot his VM after shutdown.
So I had to do some testdisk/fsck and tar the content into a new image.

I hope, there are no other "bad blocks" not being visible as "inconsistencies".

As these faulty images were easy detected as the boot-block was affected, how
big is the chance, that there are more rb..-fragments being corrupted within a image
in reference to what you mentioned below:

"...transactions to leak across checkpoint/snapshot boundaries."

Do we have a chance to detect it? I fear not, cause it will perhaps only be visible while
doing a "fsck" inside the VM?!

Anyway, thanks for your help and best regards,

Oliver.

Am 16.02.2012 um 19:02 schrieb Sage Weil:

> On Thu, 16 Feb 2012, Oliver Francke wrote:
>> Hi Sage,
>> 
>> thnx for the quick response,
>> 
>> Am 16.02.2012 um 18:17 schrieb Sage Weil:
>> 
>>> On Thu, 16 Feb 2012, Oliver Francke wrote:
>>>> Hi Sage, *,
>>>> 
>>>> your tip with truncating from below did not solve the problem. Just to recap:
>>>> 
>>>> we had two inconsistencies, which we could break down to something like:
>>>> 
>>>> rb.0.0.000000000000__head_DA680EE2
>>>> 
>>>> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3
>>>> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too -
>>>> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really
>>>> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table.
>>>> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been
>>>> found.
>>>> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is,
>>>> from such a file with name and pg, how can we identify the real file being associated with, cause there is another
>>>> customer with a potential problem with next reboot ( second inconsistency).
>>>> 
>>>> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted
>>>> partition tables, so all in the first "head-file"?
>>>> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader
>>>> anymore ;) ).
>>> 
>>> 'head' in this case means the object hasn't been COWed (snapshotted and 
>>> then overwritten), and 000000000000 means its the first 4MB block of the 
>>> rbd image/disk.
>>> 
>> 
>> yes, true,
>> 
>>> We you able to use the 'rbd info' in the previous email to identify which 
>>> image it is?  Is that what you mean by 'identify the real file'?
>>> 
>> 
>> that's the point, from the object I would like to identify the complete image location ala:
>> 
>> <pool>/<image>
>> 
>> from there I'd know, which customer's rbd disk-image is affected.
> 
> For pool, look at the pgid, in this case '109.6'.  109 is the pool id.  
> Look at the pool list from 'ceph osd dump' output to see which pool name 
> that is.
> 
> For the image, rb.0.0 is the image prefix.  Look at each rbd image in that 
> pool, and check for the image whose prefix matches.  e.g.,
> 
> for img in `rbd -p poolname list` ; do rbd info $img -p poolname | grep 
> -q rb.0.0 && echo found $img ; done
> 
> BTW, are you creating a pool per customer here?  You need to be a little 
> bit careful about creating large numbers of pools; the system isn't really 
> designed to be used that way.  You should use a pool if you have a 
> distinct data placement requirement (e.g., put these objects on this set 
> of ceph-osds).  But because of the way things work internally creating 
> hundreds/thousands of them won't be very efficient.
> 
> sage
> 
> 
>> 
>> Thnx for your patience,
>> 
>> Oliver.
>> 
>>> I'm not sure I understand exactly what your question is.  I would have 
>>> expected modifying the file with fdisk -l to work (if fdisk sees a valid 
>>> partition table, it should be able to write it too).
>>> 
>>> sage
>>> 
>>> 
>>>> 
>>>> Thanks in@vance and kind regards,
>>>> 
>>>> Oliver.
>>>> 
>>>> Am 13.02.2012 um 18:13 schrieb Sage Weil:
>>>> 
>>>>> On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
>>>>> 
>>>>>>>> Hi Liste,
>>>>>>>> 
>>>>>>>> today i've got another problem.
>>>>>>>> 
>>>>>>>> ceph -w shows up with an inconsistent PG over night:
>>>>>>>> 
>>>>>>>> 2012-02-10 08:38:48.701775    pg v441251: 1982 pgs: 1981 active+clean, 1
>>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>>>>>> GB avail
>>>>>>>> 2012-02-10 08:38:49.702789    pg v441252: 1982 pgs: 1981 active+clean, 1
>>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
>>>>>>>> GB avail
>>>>>>>> 
>>>>>>>> I've identified it with "ceph pg dump - | grep inconsistent
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> Now I've tried to repair it with: ceph pg repair 109.6
>>>>>>>> 
>>>>>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6]
>>>>>>>> 2012-02-10 08:35:52.276776 mon.1 ->  'instructing pg 109.6 on osd.3 to
>>>>>>>> repair' (0)
>>>>>>>> 
>>>>>>>> but i only get the following result:
>>>>>>>> 
>>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
>>>>>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
>>>>>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728
>>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
>>>>>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
>>>>>>>> objects
>>>>>>>> 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
>>>>>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
>>>>>>>> 
>>>>>>>> Can someone please explain me what to do in this case and how to recover
>>>>>>>> the pg ?
>>>>>>> 
>>>>>>> So the "fix" is just to truncate the file to the expected size, 3145728,
>>>>>>> by finding it in the current/ directory.  The name/path will be slightly
>>>>>>> weird; look for 'rb.0.0.0000000000bd'.
>>>>>>> 
>>>>>>> The data is still suspect, though.  Did the ceph-osd restart or crash
>>>>>>> recently?  I would do that, repair (it should succeed), and then fsck the
>>>>>>> file system in that rbd image.
>>>>>>> 
>>>>>>> We just fixed a bug that was causing transactions to leak across
>>>>>>> checkpoint/snapshot boundaries.  That could be responsible for causing all
>>>>>>> sorts of subtle corruptions, including this one.  It'll be included in
>>>>>>> v0.42 (out next week).
>>>>>>> 
>>>>>>> sage
>>>>>> 
>>>>>> Hi Sarge,
>>>>>> 
>>>>>> no ... the osd didn't crash. I had to do some hardware maintainance and push
>>>>>> it
>>>>>> out of distribution with "ceph osd out 3". After a short while i used
>>>>>> "/etc/init.d/ceph stop" on that osd.
>>>>>> Then, after my work i've started ceph and push it in the distribution with
>>>>>> "ceph osd in 3".
>>>>> 
>>>>> For the bug I'm worried about, stopping the daemon and crashing are 
>>>>> equivalent.  In both cases, a transaction may have been only partially 
>>>>> included in the checkpoint.
>>>>> 
>>>>>> Could you please tell me if this is the right way to get an osd out for
>>>>>> maintainance ? Is there
>>>>>> any other thing i should do to keep data consistent ?
>>>>> 
>>>>> You followed the right procedure.  There is (hopefully, was!) just a bug.
>>>>> 
>>>>> sage
>>>>> 
>>>>> 
>>>>>> My structure is ->  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes
>>>>>> with a each a total capacity
>>>>>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
>>>>>> data store for a kvm virtualisation
>>>>>> farm. The farm is accessing the data directly per rbd.
>>>>>> 
>>>>>> Thank you
>>>>>> 
>>>>>> Jens
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html