ceph hang on pg list_unfound

Don Waterloo <don.waterloo@xxxxxxxxx> · Wed, 18 May 2016 21:20:11 -0400

I am running 10.2.0-0ubuntu0.16.04.1.I've run into a problem w/ cephfs metadata pool. Specifically I have a pg w/ an 'unfound' object.

But i can't figure out which since when i run:
ceph pg 12.94 list_unfound

it hangs (as does ceph pg 12.94 query). I know its in the cephfs metadata  pool since I run:
ceph pg ls-by-pool cephfs_metadata |egrep "pg_stat|12\\.94"

and it shows it there:
pg_stat objects mip     degr    misp    unf     bytes   log     disklog state   state_stamp     v       reported        up      up_primary      acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub deep_scrub_stamp
12.94   231     1       1       0       1       90      3092    3092    active+recovering+degraded      2016-05-18 23:49:15.718772      8957'386130     9472:367098     [1,4]   1       [1,4]   1       8935'385144     2016-05-18 10:46:46.123526     8337'379527     2016-05-14 22:37:05.974367

OK, so what is hanging, and how can i get it to unhang so i can run a 'mark_unfound_lost' on it?

pg 12.94 is on osd.0

ID WEIGHT  TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 5.48996 root default                                       
-2 0.89999     host nubo-1                                    
 0 0.89999         osd.0         up  1.00000          1.00000 
-3 0.89999     host nubo-2                                    
 1 0.89999         osd.1         up  1.00000          1.00000 
-4 0.89999     host nubo-3                                    
 2 0.89999         osd.2         up  1.00000          1.00000 
-5 0.92999     host nubo-19                                   
 3 0.92999         osd.3         up  1.00000          1.00000 
-6 0.92999     host nubo-20                                   
 4 0.92999         osd.4         up  1.00000          1.00000 
-7 0.92999     host nubo-21                                   
 5 0.92999         osd.5         up  1.00000          1.00000 

I cranked the logging on osd.0. I see a lot of messages, but nothing interesting.

I've double checked all nodes can ping each other. I've run 'xfs_repair' on the underlying xfs storage to check for issues (there were none).

Can anyone suggest how to uncrack this hang so i can try and repair this system?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com