Delays while waiting_for_osdmap according to dump_historic_ops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I recently found out about the "ceph --admin-daemon 
/var/run/ceph/ceph-osd.<id>.asok dump_historic_ops" command, and noticed 
something unexpected in the output on my cluster, after checking 
numerous output samples...

It looks to me like "normal" write ops on my cluster spend roughly:

<1ms between "received_at" and "waiting_for_osdmap"
<1ms between "waiting_for_osdmap" and "reached_pg"
<15ms between "reached_pg" and "commit_sent"
<15ms between "commit_sent" and "done"

For reference, this is a small (3-host) all-SSD cluster, with monitors 
co-located with OSDs. Each host has: 1 SSD for the OS, 1 SSD for the 
journal, and 1 SSD for the OSD + monitor data (I initially had the 
monitor data on the same drive as the OS, but encountered performance 
problems - which have since been allieviated by moving the monitor data 
to the same drives as the OSDs. Networking is infiniband (8 Gbps 
dedicated point-to-point link between each pair of hosts). I'm running 
v0.80.5. And the OSDs use XFS.

Anyway, as this command intentionally shows the worst few recent IOs, I 
only rarely see examples that match the above "norm". Rather, the 
typical outliers that it highlights are usually write IOs with 
~100-300ms latency, where the extra latency exists purely between the 
"received_at" and "reached_pg" timestamps, and mostly in the 
"waiting_for_osdmap" step. Also it looks like these slow IOs come in 
batches. Every write IO arriving within the same ~1 second period will 
suffer from these strangely slow initial two steps, with the additional 
latency being almost identical for each one within the same batch. After 
which things return to normal again in that those steps take <1ms. So 
compared to the above "norm", these look more like:

~50ms between "received_at" and "waiting_for_osdmap"
~150ms between "waiting_for_osdmap" and "reached_pg"
<15ms between "reached_pg" and "commit_sent"
<15ms between "commit_sent" and "done"

This seems unexpected to me. I don't see why those initial steps in the 
IO should ever take such a long time to complete. Where should I be 
looking next to track down the cause? I'm guessing that 
"waiting_for_osdmap" involves OSD<->Mon communication, and so perhaps 
indicates poor performance of the Mons. But for there to be any 
non-negligible delay between "received_at" and "waiting_for_osdmap" 
makes no sense to me at all.

Here are a couple of examples...

This is a rare example (from the perspective of the dump_historic_ops 
output) of what I believe to be a "more normal" write IO:

         { "description": "osd_op(client.10697.0:560068 
rbd_data.1a1a2ae8944a.0000000000000722 [] 2.1ecc1a2a ack+ondisk+write 
e595)",
           "received_at": "2014-09-07 23:55:58.771015",
           "age": "9.101564",
           "duration": "0.010586",
           "type_data": [
                 "commit sent; apply or cleanup",
                 { "client": "client.10697",
                   "tid": 560068},
                 [
                     { "time": "2014-09-07 23:55:58.771391",
                       "event": "waiting_for_osdmap"},
                     { "time": "2014-09-07 23:55:58.771895",
                       "event": "reached_pg"},
                     { "time": "2014-09-07 23:55:58.771978",
                       "event": "started"},
                     { "time": "2014-09-07 23:55:58.771994",
                       "event": "started"},
                     { "time": "2014-09-07 23:55:58.772041",
                       "event": "waiting for subops from 1,2"},
                     { "time": "2014-09-07 23:55:58.772210",
                       "event": "commit_queued_for_journal_write"},
                     { "time": "2014-09-07 23:55:58.775767",
                       "event": "write_thread_in_journal_buffer"},
                     { "time": "2014-09-07 23:55:58.777814",
                       "event": "journaled_completion_queued"},
                     { "time": "2014-09-07 23:55:58.777837",
                       "event": "op_commit"},
                     { "time": "2014-09-07 23:55:58.778504",
                       "event": "op_applied"},
                     { "time": "2014-09-07 23:55:58.779218",
                       "event": "sub_op_commit_rec"},
                     { "time": "2014-09-07 23:55:58.781480",
                       "event": "sub_op_commit_rec"},
                     { "time": "2014-09-07 23:55:58.781573",
                       "event": "commit_sent"},
                     { "time": "2014-09-07 23:55:58.781601",
                       "event": "done"}]]},

Whereas this is more typical of what I normally see in the output from 
"dump_historic_ops":

         { "description": "osd_op(client.10737.0:877375 
rbd_data.25212ae8944a.00000000000001c0 [] 2.5493ce40 ack+ondisk+write 
e595)",
           "received_at": "2014-09-07 23:57:16.444904",
           "age": "84.769560",
           "duration": "0.269918",
           "type_data": [
                 "commit sent; apply or cleanup",
                 { "client": "client.10737",
                   "tid": 877375},
                 [
                     { "time": "2014-09-07 23:57:16.520486",
                       "event": "waiting_for_osdmap"},
                     { "time": "2014-09-07 23:57:16.681982",
                       "event": "reached_pg"},
                     { "time": "2014-09-07 23:57:16.682030",
                       "event": "started"},
                     { "time": "2014-09-07 23:57:16.682049",
                       "event": "started"},
                     { "time": "2014-09-07 23:57:16.682119",
                       "event": "waiting for subops from 1,2"},
                     { "time": "2014-09-07 23:57:16.682292",
                       "event": "commit_queued_for_journal_write"},
                     { "time": "2014-09-07 23:57:16.682370",
                       "event": "write_thread_in_journal_buffer"},
                     { "time": "2014-09-07 23:57:16.690089",
                       "event": "journaled_completion_queued"},
                     { "time": "2014-09-07 23:57:16.691201",
                       "event": "op_commit"},
                     { "time": "2014-09-07 23:57:16.692730",
                       "event": "sub_op_commit_rec"},
                     { "time": "2014-09-07 23:57:16.700063",
                       "event": "sub_op_commit_rec"},
                     { "time": "2014-09-07 23:57:16.700105",
                       "event": "commit_sent"},
                     { "time": "2014-09-07 23:57:16.714792",
                       "event": "op_applied"},
                     { "time": "2014-09-07 23:57:16.714822",
                       "event": "done"}]]}]}

Alex


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux