Re: Stuck Request

Ian Pye <ianpye@xxxxxxxxx> · Mon, 29 Oct 2012 18:09:46 +0000

The client's IO held up fine, and I don't see any signs of them
blocking. The writes are done inside of an aio_operate() rados call.
In the client logs too, I don't see any record of a failed write.

ceph -s
   health HEALTH_OK
   monmap e1: 1 mons at {a=10.25.36.11:6789/0}, election epoch 2, quorum 0 a
   osdmap e21: 15 osds: 15 up, 15 in
    pgmap v3648: 3072 pgs: 3072 active+clean; 8730 bytes data, 157 GB
used, 2075 GB / 2233 GB avail
   mdsmap e4: 1/1/1 up {0=a=up:active}

 ceph-osd --version
ceph version 0511c84c03618a3ca078258913e3f153c2ede2a
(commit:80511c84c03618a3ca078258913e3f153c2ede2a)

On Mon, Oct 29, 2012 at 6:03 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
> Interesting, I don't think the request is stalled.  I think we
> completed the request, but leaked a reference to the request
> structure.  Do you see IO from the clients stall?  What is the output
> of ceph -s?  What version are you running (ceph-osd --version)?
> -Sam
>
> On Mon, Oct 29, 2012 at 10:53 AM, Ian Pye <ianpye@xxxxxxxxx> wrote:
>> Guys,
>>
>> I'm running a three node cluster (version 0.53), and after a while of
>> running under constant write load generated by two daemons, I am
>> seeing that 1 request is totally blocked:
>>
>> [WRN] 1 slow requests, 1 included below; oldest blocked for > 7550.891933 secs
>> 2012-10-29 10:33:54.689563 osd.0 [WRN] slow request 7550.891933
>> seconds old, received at 2012-10-29 08:28:03.797576:
>> osd_sub_op(client.4116.0:490 0.3e
>> e3aa943e//logger/pg/data/2012-10-29/BWBCK/1351524240/head//0 [] v
>> 13'37 snapset=0=[]:[] snapc=0=[]) v7 currently started
>>
>> ceph --admin-daemon /path/to/osd.1.asok dump_ops_in_flight gives:
>> "ops": [
>>         { "description": "osd_sub_op(client.4116.0:490 0.3e
>> e3aa943e\/\/logger\/pg\/data\/2012-10-29\/BWBCK\/1351524240\/head\/\/0
>> [] v 13'37 snapset=0=[]:[] snapc=0=[])",
>>           "received_at": "2012-10-29 08:28:03.797576",
>>           "age": "8348.393528",
>>           "duration": "0.045426",
>>           "flag_point": "started",
>>           "events": [
>>                 { "time": "2012-10-29 08:28:03.805648",
>>                   "event": "waiting_for_osdmap"},
>>                 { "time": "2012-10-29 08:28:03.806203",
>>                   "event": "reached_pg"},
>>                 { "time": "2012-10-29 08:28:03.806222",
>>                   "event": "started"},
>>                 { "time": "2012-10-29 08:28:03.806299",
>>                   "event": "commit_queued_for_journal_write"},
>>                 { "time": "2012-10-29 08:28:03.807905",
>>                   "event": "write_thread_in_journal_buffer"},
>>                 { "time": "2012-10-29 08:28:03.808154",
>>                   "event": "journaled_completion_queued"},
>>                 { "time": "2012-10-29 08:28:03.809422",
>>                   "event": "sub_op_commit"},
>>                 { "time": "2012-10-29 08:28:03.843002",
>>                   "event": "sub_op_applied"}]}]}
>>
>> Restarting the OSD kills this request. Is this a bug, and, is there a
>> way to stop a request without the OSD restart?
>>
>> Thanks,
>>
>> Ian
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html