Hi Sage, thnx for the quick response, Am 16.02.2012 um 18:17 schrieb Sage Weil: > On Thu, 16 Feb 2012, Oliver Francke wrote: >> Hi Sage, *, >> >> your tip with truncating from below did not solve the problem. Just to recap: >> >> we had two inconsistencies, which we could break down to something like: >> >> rb.0.0.000000000000__head_DA680EE2 >> >> according to the ceph dump from below. Walking to the node with the OSD mounted on /data/osd3 >> for example, and a stupid "find ?" brings up a couple of them, so the pg number is relevant too - >> makes sense - we went into lets say "/data/osd3/current/84.2_head/" and did a hex dump from the file, looked really >> like the "head", in means of signs from an installed grub-loader. But a corrupted partition-table. >> From other of these files one could do a "fdisk -l <file>" and at least a partition-table could have been >> found. >> Two days later we got a customers big complaint about not being able to boot his VM anymore. The point now is, >> from such a file with name and pg, how can we identify the real file being associated with, cause there is another >> customer with a potential problem with next reboot ( second inconsistency). >> >> We also had some VM's in a big test-phase with similar problems? grub going into rescue-prompt, invalid/corrupted >> partition tables, so all in the first "head-file"? >> Would be cool to get some more infos? and sched some light into the structures ( myself not really being a good code-reader >> anymore ;) ). > > 'head' in this case means the object hasn't been COWed (snapshotted and > then overwritten), and 000000000000 means its the first 4MB block of the > rbd image/disk. > yes, true, > We you able to use the 'rbd info' in the previous email to identify which > image it is? Is that what you mean by 'identify the real file'? > that's the point, from the object I would like to identify the complete image location ala: <pool>/<image> from there I'd know, which customer's rbd disk-image is affected. Thnx for your patience, Oliver. > I'm not sure I understand exactly what your question is. I would have > expected modifying the file with fdisk -l to work (if fdisk sees a valid > partition table, it should be able to write it too). > > sage > > >> >> Thanks in@vance and kind regards, >> >> Oliver. >> >> Am 13.02.2012 um 18:13 schrieb Sage Weil: >> >>> On Sun, 12 Feb 2012, Jens Rehpoehler wrote: >>> >>>>>> Hi Liste, >>>>>> >>>>>> today i've got another problem. >>>>>> >>>>>> ceph -w shows up with an inconsistent PG over night: >>>>>> >>>>>> 2012-02-10 08:38:48.701775 pg v441251: 1982 pgs: 1981 active+clean, 1 >>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345 >>>>>> GB avail >>>>>> 2012-02-10 08:38:49.702789 pg v441252: 1982 pgs: 1981 active+clean, 1 >>>>>> active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345 >>>>>> GB avail >>>>>> >>>>>> I've identified it with "ceph pg dump - | grep inconsistent >>>>>> >>>>>> 109.6 141 0 0 0 463820288 111780 111780 >>>>>> active+clean+inconsistent 485'7115 480'7301 [3 >>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4 >>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>] [3 >>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#3>,4 >>>>>> <http://marc.info/?l=ceph-devel&m=132891306919981&w=2#4>] >>>>>> 485'7061 2012-02-10 08:02:12.043986 >>>>>> >>>>>> Now I've tried to repair it with: ceph pg repair 109.6 >>>>>> >>>>>> 2012-02-10 08:35:52.276325 mon<- [pg,repair,109.6] >>>>>> 2012-02-10 08:35:52.276776 mon.1 -> 'instructing pg 109.6 on osd.3 to >>>>>> repair' (0) >>>>>> >>>>>> but i only get the following result: >>>>>> >>>>>> 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455420 osd.3 >>>>>> 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid >>>>>> 1ef398ce/rb.0.0.0000000000bd/headsize 2736128 != known size 3145728 >>>>>> 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455426 osd.3 >>>>>> 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent >>>>>> objects >>>>>> 2012-02-10 08:36:18.447553 log 2012-02-10 08:36:08.455799 osd.3 >>>>>> 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors >>>>>> >>>>>> Can someone please explain me what to do in this case and how to recover >>>>>> the pg ? >>>>> >>>>> So the "fix" is just to truncate the file to the expected size, 3145728, >>>>> by finding it in the current/ directory. The name/path will be slightly >>>>> weird; look for 'rb.0.0.0000000000bd'. >>>>> >>>>> The data is still suspect, though. Did the ceph-osd restart or crash >>>>> recently? I would do that, repair (it should succeed), and then fsck the >>>>> file system in that rbd image. >>>>> >>>>> We just fixed a bug that was causing transactions to leak across >>>>> checkpoint/snapshot boundaries. That could be responsible for causing all >>>>> sorts of subtle corruptions, including this one. It'll be included in >>>>> v0.42 (out next week). >>>>> >>>>> sage >>>> >>>> Hi Sarge, >>>> >>>> no ... the osd didn't crash. I had to do some hardware maintainance and push >>>> it >>>> out of distribution with "ceph osd out 3". After a short while i used >>>> "/etc/init.d/ceph stop" on that osd. >>>> Then, after my work i've started ceph and push it in the distribution with >>>> "ceph osd in 3". >>> >>> For the bug I'm worried about, stopping the daemon and crashing are >>> equivalent. In both cases, a transaction may have been only partially >>> included in the checkpoint. >>> >>>> Could you please tell me if this is the right way to get an osd out for >>>> maintainance ? Is there >>>> any other thing i should do to keep data consistent ? >>> >>> You followed the right procedure. There is (hopefully, was!) just a bug. >>> >>> sage >>> >>> >>>> My structure is -> 3 MDS/MON Server on seperate Hardware Nodes an 3 OSD Nodes >>>> with a each a total capacity >>>> of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a >>>> data store for a kvm virtualisation >>>> farm. The farm is accessing the data directly per rbd. >>>> >>>> Thank you >>>> >>>> Jens >>>> >>>> >>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html