On Tue, 24 Sep 2013, Dan van der Ster wrote: > Hi, > Yes, the journal is a file on same XFS partition as the data. > > I have some results from a rados bench to a single-copy pool on the JBOD disks. > > [root@p05151113777233 ~]# rados bench -p test 30 write -t 1 -b 5 > ... > Total time run: 30.027296 > Total writes made: 612 > Write size: 5 > Bandwidth (MB/sec): 0.000 > > Stddev Bandwidth: 1.92864e-05 > Max bandwidth (MB/sec): 0.000119209 > Min bandwidth (MB/sec): 0 > Average Latency: 0.0490627 > Stddev Latency: 0.0197453 > Max latency: 0.136101 > Min latency: 0.003991 > > The osd log and dump of historic ops are here: > http://box.cern.ch/owncloud/public.php?service=files&t=a53c229bab2179c9e1a03899a859129f > > The logs are quite verbose for me, but the journal do_write latencies > certainly stand out: > > 2013-09-24 10:28:52.853011 7f03b78dd700 20 journal do_write latency 0.044686 > 2013-09-24 10:28:56.610670 7f03b78dd700 20 journal do_write latency 0.025085 > 2013-09-24 10:29:09.273017 7f03b78dd700 20 journal do_write latency 0.031082 > 2013-09-24 10:29:22.703697 7f03b78dd700 20 journal do_write latency 0.045090 > > If you confirm that that really is the journal write+sync latency (and > not including the final commit to disk in that time), then I'll find > some new boxes where I can try out the direct IO to a partitioned > journal. Yes.. this is write(2) and then fdatasync(2) to the journal file only. With a partition (well, raw block device) you will get direct-io and aio (several in-flight requests at once), so these latencies should go down quite a bit. On a standalone disk and lots of writes you should be able to see the rotational timing (either sub-ms latencies or ~6-9ms or whatever). For a partition, you're competing with the fs so it will be noisier than that. sage > > Cheers, Dan > > > On Mon, Sep 23, 2013 at 11:40 PM, Andreas Joachim Peters > <Andreas.Joachim.Peters@xxxxxxx> wrote: > > Here is the output for one write_full on the in-memory OSD (the others look more or less the same e.g. most time is waiting for op_applied/op_commit): > > > > { "description": "osd_op(client.1764910.0:1 eos-root--0 [writefull 0~5] 15.fef59585 e3818)", > > "rmw_flags": 4, > > "received_at": "2013-09-23 23:33:16.536547", > > "age": "3.698191", > > "duration": "0.002225", > > "flag_point": "commit sent; apply or cleanup", > > "client_info": { "client": "client.1764910", > > "tid": 1}, > > "events": [ > > { "time": "2013-09-23 23:33:16.536706", > > "event": "waiting_for_osdmap"}, > > { "time": "2013-09-23 23:33:16.536807", > > "event": "reached_pg"}, > > { "time": "2013-09-23 23:33:16.536915", > > "event": "started"}, > > { "time": "2013-09-23 23:33:16.536936", > > "event": "started"}, > > { "time": "2013-09-23 23:33:16.537029", > > "event": "waiting for subops from [1056,1057]"}, > > { "time": "2013-09-23 23:33:16.537110", > > "event": "commit_queued_for_journal_write"}, > > { "time": "2013-09-23 23:33:16.537158", > > "event": "write_thread_in_journal_buffer"}, > > { "time": "2013-09-23 23:33:16.537242", > > "event": "journaled_completion_queued"}, > > { "time": "2013-09-23 23:33:16.537269", > > "event": "op_commit"}, > > { "time": "2013-09-23 23:33:16.538547", > > "event": "sub_op_commit_rec"}, > > { "time": "2013-09-23 23:33:16.538573", > > "event": "op_applied"}, > > { "time": "2013-09-23 23:33:16.538715", > > "event": "sub_op_commit_rec"}, > > { "time": "2013-09-23 23:33:16.538754", > > "event": "commit_sent"}, > > { "time": "2013-09-23 23:33:16.538772", > > "event": "done"}]}, > > > > We should probably look at the same output for the JBOD configuration ... > > > > Cheers Andreas. > > > > ________________________________________ > > From: Mark Nelson [mark.nelson@xxxxxxxxxxx] > > Sent: 23 September 2013 18:03 > > To: Andreas Joachim Peters > > Cc: Dan van der Ster; Sage Weil; ceph-devel@xxxxxxxxxxxxxxx > > Subject: Re: Object Write Latency > > > > On 09/23/2013 10:38 AM, Andreas Joachim Peters wrote: > >> We deployed 3 OSDs with an EXT4 using RapidDisk in-memory. > >> > >> The FS does 140k/s append+sync and the latency is now: > >> > >> ~1 ms for few byte objects with single replica > >> ~2 ms for few byte objects three replica (instead of 65-80ms) > > > > Interesting! If you look at the slowest operations in the ceph admin > > socket now with dump_historic_ops, where are those operations spending > > their time? > > > >> > >> This gives probably the base-line of the best you can do with the current implementation. > >> > >> ==> the 80ms are probably just a 'feature' of the hardware (JBOD disks/controller) and we might try to find some tuning parameters to improve the latency slightly. > > > > Hardware definitely plays a huge part in terms of Ceph performance. You > > can run Ceph on just about anything, but it's surprising how different > > two roughly similar systems can perform. > > > >> > >> Could you just explain how the async api functions (is_complete, is_safe) map to the three states > >> > >> 1) object is transferred from client to all OSDs and is present in memory there > >> 2) object is written to the OSD journal > >> 3) object is committed from OSD journal to the OSD filesystem > >> > >> Is it correct that the object is visible by clients only when 3) has happened? > > > > Yes, afaik. > > > >> > >> Thanks for your help, > >> Andreas. > >> > >> > >> > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html