Re: O_DIRECT logic in CephFS, ceph-fuse / Performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kasper,

In order to do what you want here, we need to make O_DIRECT-initiated 
requests on the client get a flag that tells the OSD to also bypass its 
cache.  That doesn't happen right now.

Assuming we do add that flag, we can either make the IO actually do 
O_DIRECT, or we can make it do some fadvise after the call, or any manner 
of things depending on what makes the most sense for that particular 
backend implementation.  For FileStore, it seems pretty likely that 
O_DIRECT is the right thing.  It is somewhat complicated by the presence 
of the FDCache which avoids opening a new file descriptor for each IO, so 
it is non-trivial, but doable.

There's nothing preventing us from identifying what these hints on writes 
might be now.  Other possibilities that have come up:

- the following write should be done O_DIRECT.  or perhaps more precisely, 
  the write should not be cached (e.g., because the client is caching it, 
  or doesn't expect to ever read it)
- the following write is on data that is expected to be immutable
- the following write is on data that is expected to have a short/long 
  lifetime.

etc.

sage



On Wed, 12 Mar 2014, Milosz Tanski wrote:

> Kasper,
> 
> I only know about the kernel cephfs... but there are special code
> paths for O_DIRECT read/writes. Both read and write bypass the page
> cache and send commands directly to OSDs for the objects, on the write
> case the object has a write lock with MDS. So unlike NFS this seams
> like it does the right thing.
> 
> I'm guessing when you say XFS on rbd with O_DIRECT you mean the files
> are opened O_DIRECT on the filesystem. That doesn't take into account
> readahead that the kernel does in the block device layer which is
> independent of file read-ahead and (it's at much lower layer). You can
> find out what that is set to using the "blockdev --getra /dev/XXX"
> command.
> 
> Cheers,
> - Milosz
> 
> On Wed, Mar 12, 2014 at 4:27 PM, Kasper Dieter
> <dieter.kasper@xxxxxxxxxxxxxx> wrote:
> > The 'man 2 open' states
> > ---snip---
> > The behaviour of O_DIRECT with NFS will differ from local file systems.  (...)
> > The  NFS  protocol does not support passing the flag to the server,
> > so O_DIRECT I/O will bypass the page cache only on the client;
> > the server may still cache the I/O.
> > ---snip---
> >
> > Q1: How does CephFS and ceph-fuse handle the O_DIRECT flag ?
> >         (similar to NFS Ceph is Network FS, too and has client/server)
> >
> >
> > Some Test cases with O_DIRECT & io_submit() on 4K (65536, 262144, 1048576, 4194304 is the different obj_size):
> >
> > out.rand.fuse.ssd2-r2-1-1-1048576:  Max. throughput read         : 7.22768MB/s
> > out.rand.fuse.ssd2-r2-1-1-262144:  Max. throughput read         : 7.18318MB/s
> > out.rand.fuse.ssd2-r2-1-1-65536:  Max. throughput read         : 7.25543MB/s
> > out.sequ.fuse.ssd2-r2-1-1-1048576:  Max. throughput read         : 118.092MB/s
> > out.sequ.fuse.ssd2-r2-1-1-262144:  Max. throughput read         : 111.073MB/s
> > out.sequ.fuse.ssd2-r2-1-1-65536:  Max. throughput read         : 95.4332MB/s
> >
> > out.rand.cephfs.ssd2-r2-1-1-1048576:  Max. throughput read         : 11.2144MB/s
> > out.rand.cephfs.ssd2-r2-1-1-262144:  Max. throughput read         : 11.0371MB/s
> > out.rand.cephfs.ssd2-r2-1-1-65536:  Max. throughput read         : 11.017MB/s
> > out.sequ.cephfs.ssd2-r2-1-1-1048576:  Max. throughput read         : 11.2299MB/s
> > out.sequ.cephfs.ssd2-r2-1-1-262144:  Max. throughput read         : 10.9488MB/s
> > out.sequ.cephfs.ssd2-r2-1-1-65536:  Max. throughput read         : 10.5669MB/s
> >
> > out.rand.t3-ssd2-v2-1-1048576-20:  Max. throughput read         : 81.9598MB/s
> > out.rand.t3-ssd2-v2-1-262144-18:  Max. throughput read         : 140.45MB/s
> > out.rand.t3-ssd2-v2-1-4194304-22:  Max. throughput read         : 55.8478MB/s
> > out.rand.t3-ssd2-v2-1-65536-16:  Max. throughput read         : 158.441MB/s
> > out.sequ.t3-ssd2-v2-1-1048576-20:  Max. throughput read         : 74.3693MB/s
> > out.sequ.t3-ssd2-v2-1-262144-18:  Max. throughput read         : 140.444MB/s
> > out.sequ.t3-ssd2-v2-1-4194304-22:  Max. throughput read         : 42.7327MB/s
> > out.sequ.t3-ssd2-v2-1-65536-16:  Max. throughput read         : 165.434MB/s
> >
> > t3 = XFS on rbd.ko
> >
> > CephFS and ceph-fuse    seems to use no caching at all on random-reads.
> > Ceph-fuse               seems to use some caching on sequential-reads.
> > rbd.ko                  seems to use caching on all reads (because only XFS knows about O_DIRECT ;-))
> >
> >
> > Q2: How can the read-caching logic be enabled for ceph-fuse / CephFS ?
> >
> > BTW I'm aware of the "O_DIRECT (...) designed  by  a  deranged monkey" text in the open-2-manpage ;-)
> >
> >
> > -Dieter
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Milosz Tanski
> CTO
> 10 East 53rd Street, 37th floor
> New York, NY 10022
> 
> p: 646-253-9055
> e: milosz@xxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux