Re: Cache Tier Flush = immediate base tier journal sync?

Christian Balzer <chibi@xxxxxxx> · Tue, 17 Mar 2015 08:46:45 +0900

On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote:

> Nothing here particularly surprises me. I don't remember all the
> details of the filestore's rate limiting off the top of my head, but
> it goes to great lengths to try and avoid letting the journal get too
> far ahead of the backing store. Disabling the filestore flusher and
> increasing the sync intervals without also increasing the
> filestore_wbthrottle_* limits is not going to work well for you.
> -Greg
> 
While very true and what I recalled (backing store being kicked off early)
from earlier mails, I think having every last configuration parameter
documented in a way that doesn't reduce people to guesswork would be very
helpful.

For example "filestore_wbthrottle_xfs_inodes_start_flusher" which defaults
to 500. 
Assuming that this means to start flushing once 500 inodes have
accumulated, how would Ceph even know how many inodes are needed for the
data present?

Lastly with these parameters, there is xfs and btrfs incarnations, no
ext4. 
Do the xfs parameters also apply to ext4?

Christian

> On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Gregory Farnum
> >> Sent: 16 March 2015 17:33
> >> To: Nick Fisk
> >> Cc: ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re:  Cache Tier Flush = immediate base tier
> >> journal sync?
> >>
> >> On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >> >
> >> > I’m not sure if it’s something I’m doing wrong or just experiencing
> >> > an
> >> oddity, but when my cache tier flushes dirty blocks out to the base
> >> tier, the writes seem to hit the OSD’s straight away instead of
> >> coalescing in the journals, is this correct?
> >> >
> >> > For example if I create a RBD on a standard 3 way replica pool and
> >> > run fio
> >> via librbd 128k writes, I see the journals take all the io’s until I
> >> hit my filestore_min_sync_interval and then I see it start writing to
> >> the underlying disks.
> >> >
> >> > Doing the same on a full cache tier (to force flushing)  I
> >> > immediately see the
> >> base disks at a very high utilisation. The journals also have some
> >> write IO at the same time. The only other odd thing I can see via
> >> iostat is that most of the time whilst I’m running Fio, is that I can
> >> see the underlying disks doing very small write IO’s of around 16kb
> >> with an occasional big burst of activity.
> >> >
> >> > I know erasure coding+cache tier is slower than just plain
> >> > replicated pools,
> >> but even with various high queue depths I’m struggling to get much
> >> above 100-150 iops compared to a 3 way replica pool which can easily
> >> achieve 1000- 1500. The base tier is comprised of 40 disks. It seems
> >> quite a marked difference and I’m wondering if this strange journal
> >> behaviour is the cause.
> >> >
> >> > Does anyone have any ideas?
> >>
> >> If you're running a full cache pool, then on every operation touching
> >> an object which isn't in the cache pool it will try and evict an
> >> object. That's probably what you're seeing.
> >>
> >> Cache pool in general are only a wise idea if you have a very skewed
> >> distribution of data "hotness" and the entire hot zone can fit in
> >> cache at once.
> >> -Greg
> >
> > Hi Greg,
> >
> > It's not the caching behaviour that I confused about, it’s the journal
> > behaviour on the base disks during flushing. I've been doing some more
> > tests and can do something reproducible which seems strange to me.
> >
> > First off 10MB of 4kb writes:
> > time ceph tell osd.1 bench 10000000 4096
> > { "bytes_written": 10000000,
> >   "blocksize": 4096,
> >   "bytes_per_sec": "16009426.000000"}
> >
> > real    0m0.760s
> > user    0m0.063s
> > sys     0m0.022s
> >
> > Now split this into 2x5mb writes:
> > time ceph tell osd.1 bench 5000000 4096 &&  time ceph tell osd.1 bench
> > 5000000 4096 { "bytes_written": 5000000,
> >   "blocksize": 4096,
> >   "bytes_per_sec": "10580846.000000"}
> >
> > real    0m0.595s
> > user    0m0.065s
> > sys     0m0.018s
> > { "bytes_written": 5000000,
> >   "blocksize": 4096,
> >   "bytes_per_sec": "9944252.000000"}
> >
> > real    0m4.412s
> > user    0m0.053s
> > sys     0m0.071s
> >
> > 2nd bench takes a lot longer even though both should easily fit in the
> > 5GB journal. Looking at iostat, I think I can see that no writes
> > happen to the journal whilst the writes from the 1st bench are being
> > flushed. Is this the expected behaviour? I would have thought as long
> > as there is space available in the journal it shouldn't block on new
> > writes. Also I see in iostat writes to the underlying disk happening
> > at a QD of 1 and 16kb IO's for a number of seconds, with a large blip
> > or activity just before the flush finishes. Is this the correct
> > behaviour? I would have thought if this "tell osd bench" is doing
> > sequential IO then the journal should be able to flush 5-10mb of data
> > in a fraction a second.
> >
> > Ceph.conf
> > [osd]
> > filestore max sync interval = 30
> > filestore min sync interval = 20
> > filestore flusher = false
> > osd_journal_size = 5120
> > osd_crush_location_hook = /usr/local/bin/crush-location
> > osd_op_threads = 5
> > filestore_op_threads = 4
> >
> >
> > iostat during period where writes seem to be blocked (journal=sda
> > disk=sdd)
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00 sdb
> > 0.00     0.00    0.00    2.00     0.00     4.00     4.00     0.00
> > 0.00    0.00    0.00   0.00   0.00 sdc               0.00     0.00
> > 0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00
> > 0.00   0.00   0.00 sdd               0.00     0.00    0.00   76.00
> > 0.00   760.00    20.00     0.99   13.11    0.00   13.11  13.05  99.20
> >
> > iostat during what I believe to be the actual flush
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00 sdb
> > 0.00     0.00    0.00    2.00     0.00     4.00     4.00     0.00
> > 0.00    0.00    0.00   0.00   0.00 sdc               0.00     0.00
> > 0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00
> > 0.00   0.00   0.00 sdd               0.00  1411.00    0.00  206.00
> > 0.00  6560.00    63.69    70.14  324.14    0.00  324.14   4.85 100.00
> >
> >
> > Nick
> >
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com