Re: pg inconsistent : found clone without head

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

log [INF] : 3.136 repair ok, 0 fixed

Thank you Greg, I did like that, it worked well.


Laurent


Le 25/11/2013 19:10, Gregory Farnum a écrit :
On Mon, Nov 25, 2013 at 8:10 AM, Laurent Barbe <laurent@xxxxxxxxxxx> wrote:
Hello,

Since yesterday, scrub has detected an inconsistent pg :( :

# ceph health detail    (ceph version 0.61.9)
HEALTH_ERR 1 pgs inconsistent; 9 scrub errors
pg 3.136 is active+clean+inconsistent, acting [9,1]
9 scrub errors

# ceph pg map 3.136
osdmap e4363 pg 3.136 (3.136) -> up [9,1] acting [9,1]

But when I try to repair, osd.9 daemon failed :

# ceph pg repair 3.136
instructing pg 3.136 on osd.9 to repair

2013-11-25 10:04:09.758845 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
96ad1336/rb.0.32a6.238e1f29.000000034d6a/5ab//3
2013-11-25 10:04:09.759862 7fc2f0706700  0 log [ERR] : repair 3.136
96ad1336/rb.0.32a6.238e1f29.000000034d6a/5ab//3 found clone without head
2013-11-25 10:04:12.872908 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
e5822336/rb.0.32a6.238e1f29.000000036552/5b3//3
2013-11-25 10:04:12.873064 7fc2f0706700  0 log [ERR] : repair 3.136
e5822336/rb.0.32a6.238e1f29.000000036552/5b3//3 found clone without head
2013-11-25 10:04:14.497750 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
38372336/rb.0.32a6.238e1f29.000000011379/5bb//3
2013-11-25 10:04:14.497796 7fc2f0706700  0 log [ERR] : repair 3.136
38372336/rb.0.32a6.238e1f29.000000011379/5bb//3 found clone without head
2013-11-25 10:04:57.557894 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
109b8336/rb.0.32a6.238e1f29.00000003ad6b/5ab//3
2013-11-25 10:04:57.558052 7fc2f0706700  0 log [ERR] : repair 3.136
109b8336/rb.0.32a6.238e1f29.00000003ad6b/5ab//3 found clone without head
2013-11-25 10:17:45.835145 7fc2f0706700  0 log [ERR] : 3.136 repair stat
mismatch, got 8289/8292 objects, 1981/1984 clones, 26293444608/26294251520
bytes.
2013-11-25 10:17:45.835248 7fc2f0706700  0 log [ERR] : 3.136 repair 4
missing, 0 inconsistent objects
2013-11-25 10:17:45.835320 7fc2f0706700  0 log [ERR] : 3.136 repair 9
errors, 5 fixed
2013-11-25 10:17:45.839963 7fc2f0f07700 -1 osd/ReplicatedPG.cc: In function
'int ReplicatedPG::recover_primary(int)' thread 7fc2f0f07700 time 2013-11-25
10:17:45.836790
osd/ReplicatedPG.cc: 6643: FAILED assert(latest->is_update())


The object (found clone without head) concern the rbd images below (which is
in use) :

# rbd info datashare/share3
rbd image 'share3':
         size 1024 GB in 262144 objects
         order 22 (4096 KB objects)
         block_name_prefix: rb.0.32a6.238e1f29
         format: 1


Directory contents :
In OSD.9 (Primary) :
/var/lib/ceph/osd/ceph-9/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls -l
rb.0.32a6.238e1f29.000000034d6a*
-rw-r--r-- 1 root root 4194304 nov.   6 02:25
rb.0.32a6.238e1f29.000000034d6a__7ed_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   8 02:40
rb.0.32a6.238e1f29.000000034d6a__7f5_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   9 02:44
rb.0.32a6.238e1f29.000000034d6a__7fd_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  12 02:52
rb.0.32a6.238e1f29.000000034d6a__815_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  14 02:39
rb.0.32a6.238e1f29.000000034d6a__825_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  16 02:45
rb.0.32a6.238e1f29.000000034d6a__835_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  19 01:59
rb.0.32a6.238e1f29.000000034d6a__84d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  20 02:25
rb.0.32a6.238e1f29.000000034d6a__855_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  22 02:18
rb.0.32a6.238e1f29.000000034d6a__865_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24
rb.0.32a6.238e1f29.000000034d6a__86d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24
rb.0.32a6.238e1f29.000000034d6a__head_96AD1336__3

In OSD.1 (Replica) :
/var/lib/ceph/osd/ceph-1/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls -l
rb.0.32a6.238e1f29.000000034d6a*
-rw-r--r-- 1 root root 4194304 oct.  11 17:13
rb.0.32a6.238e1f29.000000034d6a__5ab_96AD1336__3   <--- ????
-rw-r--r-- 1 root root 4194304 nov.   6 02:25
rb.0.32a6.238e1f29.000000034d6a__7ed_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   8 02:40
rb.0.32a6.238e1f29.000000034d6a__7f5_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   9 02:44
rb.0.32a6.238e1f29.000000034d6a__7fd_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  12 02:52
rb.0.32a6.238e1f29.000000034d6a__815_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  14 02:39
rb.0.32a6.238e1f29.000000034d6a__825_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  16 02:45
rb.0.32a6.238e1f29.000000034d6a__835_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  19 01:59
rb.0.32a6.238e1f29.000000034d6a__84d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  20 02:25
rb.0.32a6.238e1f29.000000034d6a__855_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  22 02:18
rb.0.32a6.238e1f29.000000034d6a__865_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24
rb.0.32a6.238e1f29.000000034d6a__86d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24
rb.0.32a6.238e1f29.000000034d6a__head_96AD1336__3


The file rb.0.32a6.238e1f29.000000034d6a__5ab_96AD1336__3 is only present on
replica on osd.1. It seems that this snapshot (5ab) no longer exists.

# ceph osd dump | grep snap
         removed_snaps [1~c,e~23]
         removed_snaps
[1~7,9~1,d~2,14~789,7a0~1,7a2~3,7a8~1,7aa~43,7f1~1,7f3~2,7f9~1,7fb~2,801~1,803~2,809~1,80b~2,811~1,813~2,819~1,81b~2,821~1,823~2,829~1,82b~2,831~1,833~2,839~1,83b~2,841~1,843~2,849~1,84b~2,851~1,853~2,859~1,85b~2,861~1,863~2,869~1,86b~2,871~1,873~2,879~1,87b~39,8ba~49]

# for i in `rbd snap ls datashare/share3 | cut -f3 -d ' '`; do printf '%x, '
$i; done
7ed, 7f5, 7fd, 805, 80d, 815, 81d, 825, 82d, 835, 83d, 845, 84d, 855, 85d,
865, 86d, 875, 8b4, 905


How can I be sure that this file is no more useful?

If these files are no longer used, do you think I can remove them manually
in osd.1 ?

Like this ? :

$ ceph osd set noout
$ service ceph stop osd.1

$ cd /var/lib/ceph/osd/ceph-1/current/3.136_head
$ mv
./DIR_6/DIR_3/DIR_3/DIR_1/rb.0.32a6.238e1f29.000000034d6a__5ab_96AD1336__3
./DIR_6/DIR_3/DIR_3/DIR_2/rb.0.32a6.238e1f29.000000036552__5b3_E5822336__3
./DIR_6/DIR_3/DIR_3/DIR_2/rb.0.32a6.238e1f29.000000011379__5bb_38372336__3
./DIR_6/DIR_3/DIR_3/DIR_8/rb.0.32a6.238e1f29.00000003ad6b__5ab_109B8336__3
/root/temp_obj_backup

$ service ceph start osd.1
$ ceph osd unset noout

$ ceph pg repair 3.136

I ran this by Sam to get more ideas of the cause, and he suggests
checking dmesg on the offending node (guessing that your local
filesystem is having some issues). Manually removing the offending
object from the replica and then running repair (to fix any lingering
stat mismatches) should deal with it, yes.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux