Gregory Farnum <gfarnum@...> writes: > Or maybe it's 0.9a, or maybe I just don't remember at all. I'm sure > somebody recalls... > I'm still struggling with this. When copying some files from the ceph file system, it hangs forever. Here's some more data: * Attempt to copy file. ceph --watch-warn shows: 2016-01-01 11:16:12.637932 osd.405 [WRN] slow request 480.160153 seconds old, received at 2016-01-01 11:08:12.477509: osd_op(client.46686461.1:11 10000006479.00000004 [read 2097152~2097152 [1@-1]] 0.ca710b7 read e367378) currently waiting for replay end * Look for client's entry in "ceph daemon mds.0 session ls". Here it is: { "id": 46686461, "num_leases": 0, "num_caps": 10332, "state": "open", "replay_requests": 0, "reconnecting": false, "inst": "client.46686461 192.168.1.180:0\/2512587758", "client_metadata": { "entity_id": "", "hostname": "node80.galileo", "kernel_version": "4.3.3-1.el6.elrepo.i686" } }, * Look for messages in /var/log/ceph/ceph.log referring to this client: 2016-01-01 11:16:12.637917 osd.405 192.168.1.23:6823/30938 142 : cluster [WRN] slow request 480.184693 seconds old, received at 2016-01-01 11:08:12.452970: osd_op(client.46686461.1:10 10000006479.00000004 [read 0~2097152 [1@-1]] 0.ca710b7 read e367378) currently waiting for replay end 2016-01-01 11:16:12.637932 osd.405 192.168.1.23:6823/30938 143 : cluster [WRN] slow request 480.160153 seconds old, received at 2016-01-01 11:08:12.477509: osd_op(client.46686461.1:11 10000006479.00000004 [read 2097152~2097152 [1@-1]] 0.ca710b7 read e367378) currently waiting for replay end 2016-01-01 11:23:11.298786 mds.0 192.168.1.31:6800/19945 64 : cluster [WRN] slow request 7683.077077 seconds old, received at 2016-01-01 09:15:08.221671: client_request(client.46686461:758 readdir #1000001913d 2016-01-01 09:15:08.222194) currently acquired locks 2016-01-01 11:24:12.728794 osd.405 192.168.1.23:6823/30938 145 : cluster [WRN] slow request 960.275521 seconds old, received at 2016-01-01 11:08:12.452970: osd_op(client.46686461.1:10 10000006479.00000004 [read 0~2097152 [1@-1]] 0.ca710b7 read e367378) currently waiting for replay end 2016-01-01 11:24:12.728814 osd.405 192.168.1.23:6823/30938 146 : cluster [WRN] slow request 960.250982 seconds old, received at 2016-01-01 11:08:12.477509: osd_op(client.46686461.1:11 10000006479.00000004 [read 2097152~2097152 [1@-1]] 0.ca710b7 read e367378) currently waiting for replay end * Seems to refer to "0.ca710b7", which I'm guessing is either pg 0.ca, 0.ca7, 0.7b, 0.7b0, 0.b7 or 0.0b7. Look for these in "ceph health detail": ceph health detail | egrep '0\.ca|0\.7b|0\.b7|0\.0b' pg 0.7b2 is stuck inactive since forever, current state incomplete, last acting [307,206] pg 0.7b2 is stuck unclean since forever, current state incomplete, last acting [307,206] pg 0.7b2 is incomplete, acting [307,206] OK, so no "7b" or "7b0", but is "7b2" close enough? * Take a look at osd 307 and 206. These are both online and show no errors in their logs. Why then the "stuck"? * Look at filesystem on other OSDs for "7b". Find this: osd 102 (defunct, offline OSD disk, appears as "DNE" in "ceph osd tree"): drwxr-xr-x 3 root root 4096 Dec 13 12:58 0.7b_head drwxr-xr-x 2 root root 6 Dec 13 12:43 0.7b_TEMP osd 103: drwxr-xr-x 3 root root 4096 Dec 18 12:04 0.7b0_head osd 110: drwxr-xr-x 3 root root 4096 Dec 20 09:06 0.7b_head osd 402: drwxr-xr-x 3 root root 4096 Jul 1 2014 0.7b_head All of these OSDs except 102 are up and heathy. Where do I go from here? Bryan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com