Scrub failing all the time, new inconsistencies keep appearing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I'm using ceph since long time ago. A day ago added jewel requirement for OSD. And upgraded crush map.


From this time I had all kind of errors, maybe because disks failing because rebalances or because there's a problem I don't know.

I have some pg active+clean+inconsistent, from different volumens. When I try to repair or do scrub I get:

2017-09-14 15:24:32.139215  [ERR] 9.8b shard 2: soid 9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head data_digest 0x903e1482 != data_digest 0x4d4e39be from auth oi 9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head(3982'375882 osd.1.0:2494526 dirty|data_digest|omap_digest s 4194304 uv 375794 dd 4d4e39be od ffffffff)
2017-09-14 15:24:32.139220  [ERR] 9.8b shard 6: soid 9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head data_digest 0x903e1482 != data_digest 0x4d4e39be from auth oi 9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head(3982'375882 osd.1.0:2494526 dirty|data_digest|omap_digest s 4194304 uv 375794 dd 4d4e39be od ffffffff)
2017-09-14 15:24:32.139222  [ERR] 9.8b soid 9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head: failed to pick suitable auth object

I removed one of the OSD and added a bigger one to the cluster. But still had the old authority disk in the machine. (But I removed from crush map and all as documentation says). Mine is a small cluster and I know it tends to be more critical since not enough replicas if something goes wrong:


ID WEIGHT  TYPE NAME                 UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.27299 root default                                               
-4 4.27299     rack rack-1                                            
-2 1.00000         host blue-compute                                  
 0 1.00000             osd.0              up  1.00000          1.00000
 2 1.00000             osd.2              up  1.00000          1.00000
-3 3.27299         host red-compute                                   
 4 1.00000             osd.4              up  1.00000          1.00000
 3 1.36380             osd.3              up  1.00000          1.00000
 6 0.90919             osd.6              up  1.00000          1.00000


the old osd.1 still in machine red-compute but outside the cluster. I repeat. My question is.


With this kind of error. Is anything I can do to recover from the error?

Second. If I cannot find an authority pg on the cluster, in osd.2 and osd.6 how can I fix it? Can I get it from the old osd.1. How?

> ceph pg map 9.8b
  osdmap e7049 pg 9.8b (9.8b) -> up [6,2] acting [6,2]

> rados list-inconsistent-pg high_value
["9.8b"]

Any help on this?


Thank you in advance.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux