Hello,
Since yesterday, scrub has detected an inconsistent pg :( :
# ceph health detail (ceph version 0.61.9)
HEALTH_ERR 1 pgs inconsistent; 9 scrub errors
pg 3.136 is active+clean+inconsistent, acting [9,1]
9 scrub errors
# ceph pg map 3.136
osdmap e4363 pg 3.136 (3.136) -> up [9,1] acting [9,1]
But when I try to repair, osd.9 daemon failed :
# ceph pg repair 3.136
instructing pg 3.136 on osd.9 to repair
2013-11-25 10:04:09.758845 7fc2f0706700 0 log [ERR] : 3.136 osd.9 missing
96ad1336/rb.0.32a6.238e1f29.000000034d6a/5ab//3
2013-11-25 10:04:09.759862 7fc2f0706700 0 log [ERR] : repair 3.136
96ad1336/rb.0.32a6.238e1f29.000000034d6a/5ab//3 found clone without head
2013-11-25 10:04:12.872908 7fc2f0706700 0 log [ERR] : 3.136 osd.9 missing
e5822336/rb.0.32a6.238e1f29.000000036552/5b3//3
2013-11-25 10:04:12.873064 7fc2f0706700 0 log [ERR] : repair 3.136
e5822336/rb.0.32a6.238e1f29.000000036552/5b3//3 found clone without head
2013-11-25 10:04:14.497750 7fc2f0706700 0 log [ERR] : 3.136 osd.9 missing
38372336/rb.0.32a6.238e1f29.000000011379/5bb//3
2013-11-25 10:04:14.497796 7fc2f0706700 0 log [ERR] : repair 3.136
38372336/rb.0.32a6.238e1f29.000000011379/5bb//3 found clone without head
2013-11-25 10:04:57.557894 7fc2f0706700 0 log [ERR] : 3.136 osd.9 missing
109b8336/rb.0.32a6.238e1f29.00000003ad6b/5ab//3
2013-11-25 10:04:57.558052 7fc2f0706700 0 log [ERR] : repair 3.136
109b8336/rb.0.32a6.238e1f29.00000003ad6b/5ab//3 found clone without head
2013-11-25 10:17:45.835145 7fc2f0706700 0 log [ERR] : 3.136 repair stat
mismatch, got 8289/8292 objects, 1981/1984 clones, 26293444608/26294251520
bytes.
2013-11-25 10:17:45.835248 7fc2f0706700 0 log [ERR] : 3.136 repair 4
missing, 0 inconsistent objects
2013-11-25 10:17:45.835320 7fc2f0706700 0 log [ERR] : 3.136 repair 9
errors, 5 fixed
2013-11-25 10:17:45.839963 7fc2f0f07700 -1 osd/ReplicatedPG.cc: In function
'int ReplicatedPG::recover_primary(int)' thread 7fc2f0f07700 time 2013-11-25
10:17:45.836790
osd/ReplicatedPG.cc: 6643: FAILED assert(latest->is_update())
The object (found clone without head) concern the rbd images below (which is
in use) :
# rbd info datashare/share3
rbd image 'share3':
size 1024 GB in 262144 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.32a6.238e1f29
format: 1
Directory contents :
In OSD.9 (Primary) :
/var/lib/ceph/osd/ceph-9/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls -l
rb.0.32a6.238e1f29.000000034d6a*
-rw-r--r-- 1 root root 4194304 nov. 6 02:25
rb.0.32a6.238e1f29.000000034d6a__7ed_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 8 02:40
rb.0.32a6.238e1f29.000000034d6a__7f5_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 9 02:44
rb.0.32a6.238e1f29.000000034d6a__7fd_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 12 02:52
rb.0.32a6.238e1f29.000000034d6a__815_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 14 02:39
rb.0.32a6.238e1f29.000000034d6a__825_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 16 02:45
rb.0.32a6.238e1f29.000000034d6a__835_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 19 01:59
rb.0.32a6.238e1f29.000000034d6a__84d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 20 02:25
rb.0.32a6.238e1f29.000000034d6a__855_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 22 02:18
rb.0.32a6.238e1f29.000000034d6a__865_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 23 02:24
rb.0.32a6.238e1f29.000000034d6a__86d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 23 02:24
rb.0.32a6.238e1f29.000000034d6a__head_96AD1336__3
In OSD.1 (Replica) :
/var/lib/ceph/osd/ceph-1/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls -l
rb.0.32a6.238e1f29.000000034d6a*
-rw-r--r-- 1 root root 4194304 oct. 11 17:13
rb.0.32a6.238e1f29.000000034d6a__5ab_96AD1336__3 <--- ????
-rw-r--r-- 1 root root 4194304 nov. 6 02:25
rb.0.32a6.238e1f29.000000034d6a__7ed_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 8 02:40
rb.0.32a6.238e1f29.000000034d6a__7f5_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 9 02:44
rb.0.32a6.238e1f29.000000034d6a__7fd_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 12 02:52
rb.0.32a6.238e1f29.000000034d6a__815_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 14 02:39
rb.0.32a6.238e1f29.000000034d6a__825_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 16 02:45
rb.0.32a6.238e1f29.000000034d6a__835_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 19 01:59
rb.0.32a6.238e1f29.000000034d6a__84d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 20 02:25
rb.0.32a6.238e1f29.000000034d6a__855_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 22 02:18
rb.0.32a6.238e1f29.000000034d6a__865_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 23 02:24
rb.0.32a6.238e1f29.000000034d6a__86d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov. 23 02:24
rb.0.32a6.238e1f29.000000034d6a__head_96AD1336__3
The file rb.0.32a6.238e1f29.000000034d6a__5ab_96AD1336__3 is only present on
replica on osd.1. It seems that this snapshot (5ab) no longer exists.
# ceph osd dump | grep snap
removed_snaps [1~c,e~23]
removed_snaps
[1~7,9~1,d~2,14~789,7a0~1,7a2~3,7a8~1,7aa~43,7f1~1,7f3~2,7f9~1,7fb~2,801~1,803~2,809~1,80b~2,811~1,813~2,819~1,81b~2,821~1,823~2,829~1,82b~2,831~1,833~2,839~1,83b~2,841~1,843~2,849~1,84b~2,851~1,853~2,859~1,85b~2,861~1,863~2,869~1,86b~2,871~1,873~2,879~1,87b~39,8ba~49]
# for i in `rbd snap ls datashare/share3 | cut -f3 -d ' '`; do printf '%x, '
$i; done
7ed, 7f5, 7fd, 805, 80d, 815, 81d, 825, 82d, 835, 83d, 845, 84d, 855, 85d,
865, 86d, 875, 8b4, 905
How can I be sure that this file is no more useful?
If these files are no longer used, do you think I can remove them manually
in osd.1 ?
Like this ? :
$ ceph osd set noout
$ service ceph stop osd.1
$ cd /var/lib/ceph/osd/ceph-1/current/3.136_head
$ mv
./DIR_6/DIR_3/DIR_3/DIR_1/rb.0.32a6.238e1f29.000000034d6a__5ab_96AD1336__3
./DIR_6/DIR_3/DIR_3/DIR_2/rb.0.32a6.238e1f29.000000036552__5b3_E5822336__3
./DIR_6/DIR_3/DIR_3/DIR_2/rb.0.32a6.238e1f29.000000011379__5bb_38372336__3
./DIR_6/DIR_3/DIR_3/DIR_8/rb.0.32a6.238e1f29.00000003ad6b__5ab_109B8336__3
/root/temp_obj_backup
$ service ceph start osd.1
$ ceph osd unset noout
$ ceph pg repair 3.136