Re: newstore direction

Howard Chu <hyc@xxxxxxxxx> · Fri, 23 Oct 2015 06:21:43 +0000 (UTC)

Ric Wheeler <rwheeler <at> redhat.com> writes:

> 
> On 10/21/2015 09:32 AM, Sage Weil wrote:
> > On Tue, 20 Oct 2015, Ric Wheeler wrote:
> >>> Now:
> >>>       1 io  to write a new file
> >>>     1-2 ios to sync the fs journal (commit the inode, alloc change)
> >>>             (I see 2 journal IOs on XFS and only 1 on ext4...)
> >>>       1 io  to commit the rocksdb journal (currently 3, but will drop to
> >>>             1 with xfs fix and my rocksdb change)
> >> I think that might be too pessimistic - the number of discrete IO's
sent down
> >> to a spinning disk make much less impact on performance than the number of
> >> fsync()'s since they IO's all land in the write cache.  Some newer spinning
> >> drives have a non-volatile write cache, so even an fsync() might not end up
> >> doing the expensive data transfer to the platter.
> > True, but in XFS's case at least the file data and journal are not
> > colocated, so its 2 seeks for the new file write+fdatasync and another for
> > the rocksdb journal commit.  Of course, with a deep queue, we're doing
> > lots of these so there's be fewer journal commits on both counts, but the
> > lower bound on latency of a single write is still 3 seeks, and that bound
> > is pretty critical when you also have network round trips and replication
> > (worst out of 2) on top.
> 
> What are the performance goals we are looking for?
> 
> Small, synchronous writes/second?
> 
> File creates/second?
> 
> I suspect that looking at things like seeks/write is probably looking at the 
> wrong level of performance challenges.  Again, when you write to a modern
drive, 
> you write to its write cache and it decides internally when/how to destage to 
> the platter.
> 
> If you look at the performance of XFS with streaming workloads, it will
tend to 
> max out the bandwidth of the underlaying storage.
> 
> If we need IOP's/file writes, etc, we should be clear on what we are
aiming at.
> 
> >
> >> It would be interesting to get the timings on the IO's you see to
measure the
> >> actual impact.
> > I observed this with the journaling workload for rocksdb, but I assume the
> > journaling behavior is the same regardless of what is being journaled.
> > For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and
> > blktrace showed an IO to the file, and 2 IOs to the journal.  I believe
> > the first one is the record for the inode update, and the second is the
> > journal 'commit' record (though I forget how I decided that).  My guess is
> > that XFS is being extremely careful about journal integrity here and not
> > writing the commit record until it knows that the preceding records landed
> > on stable storage.  For ext4, the latency was about ~20ms, and blktrace
> > showed the IO to the file and then a single journal IO.  When I made the
> > rocksdb change to overwrite an existing, prewritten file, the latency
> > dropped to ~10ms on ext4, and blktrace showed a single IO as expected.
> > (XFS still showed the 2 journal commit IOs, but Dave just posted the fix
> > for that on the XFS list today.)

> Normally, best practice is to use batching to avoid paying worst case latency 
> when you do a synchronous IO. Write a batch of files or appends without
fsync, 
> then go back and fsync and you will pay that latency once (not per file/op).

If filesystems would support ordered writes you wouldn't need to fsync at
all. Just spit out a stream of writes and declare that batch N must be
written before batch N+1. (Note that this is not identical to "write
barriers", which imposed the same latencies as fsync by blocking all I/Os at
a barrier boundary. Ordered writes may be freely interleaved with un-ordered
writes, so normal I/O traffic can proceed unhindered. Their ordering is only
enforced wrt other ordered writes.)

A bit of a shame that Linux's SCSI drivers support Ordering attributes but
nothing above that layer makes use of it.
-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/ 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html