Well then, found it via the "ceph osd dump" via the pool-id, thanks. The according customer opened a ticket this morning for not being able to boot his VM after shutdown. So I had to do some testdisk/fsck and tar the content into a new image. I hope, there are no other "bad blocks" not being visible as "inconsistencies". As these faulty images were easy detected as the boot-block was affected, how big is the chance, that there are more rb..-fragments being corrupted within a image in reference to what you mentioned below: "...transactions to leak across checkpoint/snapshot boundaries." Do we have a chance to detect it? I fear not, cause it will perhaps only be visible while doing a "fsck" inside the VM?! Anyway, thanks for your help and best regards, Oliver. Am 16.02.2012 um 19:02 schrieb Sage Weil: > On Thu, 16 Feb 2012, Oliver Francke wrote: >> Hi Sage, >> >> thnx for the quick response, >> >> Am 16.02.2012 um 18:17 schrieb Sage Weil: >> >>> On Thu, 16 Feb 2012, Oliver Francke wrote: >>>> Hi Sage, *, >>>> >>>> your tip with truncating from below did not solve the problem. Just to recap: >>>> >>>> we had two inconsistencies, which we could break down to something like: >>>> >>>> rb.0.0.000000000000__head_DA680EE2 >>>> >>>> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3 >>>> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too - >>>> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really >>>> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table. >>>> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been >>>> found. >>>> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is, >>>> from such a file with name and pg, how can we identify the real file being associated with, cause there is another >>>> customer with a potential problem with next reboot ( second inconsistency). >>>> >>>> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted >>>> partition tables, so all in the first "head-file"? >>>> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader >>>> anymore ;) ). >>> >>> 'head' in this case means the object hasn't been COWed (snapshotted and >>> then overwritten), and 000000000000 means its the first 4MB block of the >>> rbd image/disk. >>> >> >> yes, true, >> >>> We you able to use the 'rbd info' in the previous email to identify which >>> image it is? Is that what you mean by 'identify the real file'? >>> >> >> that's the point, from the object I would like to identify the complete image location ala: >> >> <pool>/<image> >> >> from there I'd know, which customer's rbd disk-image is affected. > > For pool, look at the pgid, in this case '109.6'. 109 is the pool id. > Look at the pool list from 'ceph osd dump' output to see which pool name > that is. > > For the image, rb.0.0 is the image prefix. Look at each rbd image in that > pool, and check for the image whose prefix matches. e.g., > > for img in `rbd -p poolname list` ; do rbd info $img -p poolname | grep > -q rb.0.0 && echo found $img ; done > > BTW, are you creating a pool per customer here? You need to be a little > bit careful about creating large numbers of pools; the system isn't really > designed to be used that way. You should use a pool if you have a > distinct data placement requirement (e.g., put these objects on this set > of ceph-osds). But because of the way things work internally creating > hundreds/thousands of them won't be very efficient. > > sage > > >> >> Thnx for your patience, >> >> Oliver. >> >>> I'm not sure I understand exactly what your question is. I would have >>> expected modifying the file with fdisk -l to work (if fdisk sees a valid >>> partition table, it should be able to write it too). >>> >>> sage >>> >>> >>>> >>>> Thanks in@vance and kind regards, >>>> >>>> Oliver. >>>> >>>> Am 13.02.2012 um 18:13 schrieb Sage Weil: >>>> >>>>> On Sun, 12 Feb 2012, Jens Rehpoehler wrote: >>>>> >>>>>>>> Hi Liste, >>>>>>>> >>>>>>>> today i've got another problem. >>>>>>>> >>>>>>>> ceph -w shows up with an inconsistent PG over night: >>>>>>>> >>>>>>>> 2012-02-10 08:38:48.701775 pg v441251: 1982 pgs: 1981 active+clean, 1 >>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345 >>>>>>>> GB avail >>>>>>>> 2012-02-10 08:38:49.702789 pg v441252: 1982 pgs: 1981 active+clean, 1 >>>>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345 >>>>>>>> GB avail >>>>>>>> >>>>>>>> I've identified it with "ceph pg dump - | grep inconsistent >>>>>>>> ... >>>>>>>> >>>>>>>> Now I've tried to repair it with: ceph pg repair 109.6 >>>>>>>> >>>>>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6] >>>>>>>> 2012-02-10 08:35:52.276776 mon.1 -> 'instructing pg 109.6 on osd.3 to >>>>>>>> repair' (0) >>>>>>>> >>>>>>>> but i only get the following result: >>>>>>>> >>>>>>>> 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455420 osd.3 >>>>>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid >>>>>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728 >>>>>>>> 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455426 osd.3 >>>>>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent >>>>>>>> objects >>>>>>>> 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455799 osd.3 >>>>>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors >>>>>>>> >>>>>>>> Can someone please explain me what to do in this case and how to recover >>>>>>>> the pg ? >>>>>>> >>>>>>> So the "fix" is just to truncate the file to the expected size, 3145728, >>>>>>> by finding it in the current/ directory. The name/path will be slightly >>>>>>> weird; look for 'rb.0.0.0000000000bd'. >>>>>>> >>>>>>> The data is still suspect, though. Did the ceph-osd restart or crash >>>>>>> recently? I would do that, repair (it should succeed), and then fsck the >>>>>>> file system in that rbd image. >>>>>>> >>>>>>> We just fixed a bug that was causing transactions to leak across >>>>>>> checkpoint/snapshot boundaries. That could be responsible for causing all >>>>>>> sorts of subtle corruptions, including this one. It'll be included in >>>>>>> v0.42 (out next week). >>>>>>> >>>>>>> sage >>>>>> >>>>>> Hi Sarge, >>>>>> >>>>>> no ... the osd didn't crash. I had to do some hardware maintainance and push >>>>>> it >>>>>> out of distribution with "ceph osd out 3". After a short while i used >>>>>> "/etc/init.d/ceph stop" on that osd. >>>>>> Then, after my work i've started ceph and push it in the distribution with >>>>>> "ceph osd in 3". >>>>> >>>>> For the bug I'm worried about, stopping the daemon and crashing are >>>>> equivalent. In both cases, a transaction may have been only partially >>>>> included in the checkpoint. >>>>> >>>>>> Could you please tell me if this is the right way to get an osd out for >>>>>> maintainance ? Is there >>>>>> any other thing i should do to keep data consistent ? >>>>> >>>>> You followed the right procedure. There is (hopefully, was!) just a bug. >>>>> >>>>> sage >>>>> >>>>> >>>>>> My structure is -> 3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes >>>>>> with a each a total capacity >>>>>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a >>>>>> data store for a kvm virtualisation >>>>>> farm. The farm is accessing the data directly per rbd. >>>>>> >>>>>> Thank you >>>>>> >>>>>> Jens >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html