Re: Guest sync write iops so poor.

Jan Schermer <jan@xxxxxxxxxxx> · Fri, 26 Feb 2016 11:07:26 +0100

O_DIRECT is _not_ a flag for synchronous blocking IO.O_DIRECT only hints the kernel that it needs not cache/buffer the data.
The kernel is actually free to buffer and cache it and it does buffer it.
It also does _not_ flush O_DIRECT writes to disk but it makes best effort to send it to the drives ASAP (where it can sit in cache).
Finishing an O_DIRECT request doesn't guarantee it is on disk at all.

In effect, you can issue parallel O_DIRECT request and they will scale with queue depth, but the ordering is not guaranteed and neither is it crash safe.

btw "innodb_flush_log_at_trx_commit = 5" does not do what you think it does. It's only values are
0 - flush only periodically, not crash consistent (most data should be there somewhere but it does require a lengthy manual recovery)
1 - flush after every transaction (not every write as you illustrated), ACID compliant
2 - flush periodically, database *should* be crash consistent but you can lose some transactions

no other value does anything:

mysql> show global variables like "innodb_flush_log_at_trx_commit";
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| innodb_flush_log_at_trx_commit | 2     |
+--------------------------------+-------+
1 row in set (0.00 sec)

mysql> set global innodb_flush_log_at_trx_commit = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> show global variables like "innodb_flush_log_at_trx_commit";
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| innodb_flush_log_at_trx_commit | 1     |
+--------------------------------+-------+
1 row in set (0.00 sec)

mysql> set global innodb_flush_log_at_trx_commit = 5;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show global variables like "innodb_flush_log_at_trx_commit";
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| innodb_flush_log_at_trx_commit | 2     |
+--------------------------------+-------+
1 row in set (0.00 sec)

On Ceph, you either need to live with a max of ~ 200 (serializable) transactions/sec, settle for innodb_flush_log_at_trx_commit = 2 and lose the tail of transactions or you can put the innodb log files on a separate device (drbd accross several nodes, physical SSD...) which will survive a crash.

Jan

On 26 Feb 2016, at 10:49, Huan Zhang <huan.zhang.jn@xxxxxxxxx> wrote:

fio /dev/rbd0 sync=1 has no problem.
Doesn't find 'sync cache code' in linux rbd block driver and radosgw api. 
Seems sync cache is just the concept of librbd (for rbd cache). 
Just my concerns.

2016-02-26 17:30 GMT+08:00 Huan Zhang <huan.zhang.jn@xxxxxxxxx>:
Hi Nick,
DB's IO pattern depends on config, mysql for example.
innodb_flush_log_at_trx_commit =1, mysql will sync after one transcation. like:
write
sync
wirte
sync
...

innodb_flush_log_at_trx_commit = 5,
write
write
write
write
write
sync

innodb_flush_log_at_trx_commit = 0,
write
write
...
one second later.
sync.

may not very accurate, but more or less.

We test mysql tps, with nnodb_flush_log_at_trx_commit =1, get very poor performance even if we can reach very high O_DIRECT randwrite iops with fio.

2016-02-26 16:59 GMT+08:00 Nick Fisk <nick@xxxxxxxxxx>:
> -----Original Message-----

> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of

> Huan Zhang

> Sent: 26 February 2016 06:50

> To: Jason Dillaman <dillaman@xxxxxxxxxx>

> Cc: josh durgin <josh.durgin@xxxxxxxxxxx>; Nick Fisk <nick@xxxxxxxxxx>;

> ceph-users <ceph-users@xxxxxxxx>

> Subject: Re:  Guest sync write iops so poor.

>

> rbd engine with fsync=1 seems stuck.

> Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta

> 1244d:10h:39m:18s]

>

> But fio using /dev/rbd0 sync=1 direct=1 ioengine=libaio iodepth=64, get very

> high iops ~35K, similar to direct wirte.

>

> I'm confused with that result, IMHO, ceph could just ignore the sync cache

> command since it always use sync write to journal, right?

Even if the data is not sync'd to the data storage part of the OSD, the data still has to be written to the journal and this is where the performance limit lies.

The very nature of SDS means that you are never going to achieve the same latency as you do to a local disk as even if the software side introduced no extra latency, just the network latency will severely limit your sync performance.

Do you know the IO pattern the DB's generate? I know you can switch most DB's to flush with O_DIRECT instead of sync, it might be this helps in your case.

Also check out the tech talk from last month about high performance databases on Ceph. The presenter gave the impression that, at least in their case, not every write was a sync IO. So your results could possibly matter less than you think.

Also please search the lists and past presentations about reducing write latency. There are a few things you can do like disabling logging and some kernel parameters to stop the CPU's entering sleep states/reducing frequency. One thing I witnessed that if the Ceph cluster is only running at low queue depths, so it's only generating low cpu load, all the cores on the CPU's throttle themselves down to their lowest speeds, which really hurts latency.

>

> Why we get so bad sync iops, how ceph handle it?

> Very appreciated to get your reply!

>

> 2016-02-25 22:44 GMT+08:00 Jason Dillaman <dillaman@xxxxxxxxxx>:

> > 35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't

> actually

> > work. Or it's not touching the same object (but I wonder whether write

> > ordering is preserved at that rate?).

>

> The fio rbd engine does not support "sync=1"; however, it should support

> "fsync=1" to accomplish roughly the same effect.

>

> Jason

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com