O_DIRECT is _not_ a flag for synchronous blocking IO. O_DIRECT only hints the kernel that it needs not cache/buffer the data. The kernel is actually free to buffer and cache it and it does buffer it. It also does _not_ flush O_DIRECT writes to disk but it makes best effort to send it to the drives ASAP (where it can sit in cache). Finishing an O_DIRECT request doesn't guarantee it is on disk at all. In effect, you can issue parallel O_DIRECT request and they will scale with queue depth, but the ordering is not guaranteed and neither is it crash safe. btw "innodb_flush_log_at_trx_commit = 5" does not do what you think it does. It's only values are 0 - flush only periodically, not crash consistent (most data should be there somewhere but it does require a lengthy manual recovery) 1 - flush after every transaction (not every write as you illustrated), ACID compliant 2 - flush periodically, database *should* be crash consistent but you can lose some transactions no other value does anything: mysql> show global variables like "innodb_flush_log_at_trx_commit"; +--------------------------------+-------+ | Variable_name | Value | +--------------------------------+-------+ | innodb_flush_log_at_trx_commit | 2 | +--------------------------------+-------+ 1 row in set (0.00 sec) mysql> set global innodb_flush_log_at_trx_commit = 1; Query OK, 0 rows affected (0.00 sec) mysql> show global variables like "innodb_flush_log_at_trx_commit"; +--------------------------------+-------+ | Variable_name | Value | +--------------------------------+-------+ | innodb_flush_log_at_trx_commit | 1 | +--------------------------------+-------+ 1 row in set (0.00 sec) mysql> set global innodb_flush_log_at_trx_commit = 5; Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> show global variables like "innodb_flush_log_at_trx_commit"; +--------------------------------+-------+ | Variable_name | Value | +--------------------------------+-------+ | innodb_flush_log_at_trx_commit | 2 | +--------------------------------+-------+ 1 row in set (0.00 sec) On Ceph, you either need to live with a max of ~ 200 (serializable) transactions/sec, settle for innodb_flush_log_at_trx_commit = 2 and lose the tail of transactions or you can put the innodb log files on a separate device (drbd accross several nodes, physical SSD...) which will survive a crash. Jan
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com