Re: Cache Tier Flush = immediate base tier journal sync?

Nick Fisk <nick@xxxxxxxxxx> · Thu, 19 Mar 2015 21:24:51 -0000

I think this could be part of what I am seeing. I found this post from back in 2003

http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083

Which seems to describe a work around for the behaviour to what I am seeing. The constant small block IO I was seeing looks like it was either the pg log and info updates or FS metatdata. I have been going through the blktraces I did today and 90% of the time I am just seeing 8kb writes and journal writes. 

I think the journal and filestore settings I have been adjusting, have just been moving the "data" sync around the benchmark timeline and altering when the journal starts throttling. It seems that with small IO's the metadata overhead takes several times longer than the actual data writing. This probably also explains why a full SSD OSD is faster than a HDD+SSD even for brief bursts of IO.

In the thread I posted above, it seems that adding something like flashcache can massively help overcome this problem, so this is something I might look into. It’s a shame I didn't get BBWC with my OSD nodes as this would have also likely alleviated this problem with a lot less hassle.

> Ah, no, you're right. With the bench command it all goes in to one object, it's
> just a separate transaction for each 64k write. But again depending on flusher
> and throttler settings in the OSD, and the backing FS' configuration, it can be
> a lot of individual updates — in particular, every time there's a sync it has to
> update the inode.
> Certainly that'll be the case in the described configuration, with relatively low
> writeahead limits on the journal but high sync intervals — once you hit the
> limits, every write will get an immediate flush request.
> 
> But none of that should have much impact on your write amplification tests
> unless you're actually using "osd bench" to test it. You're more likely to be
> seeing the overhead of the pg log entry, pg info change, etc that's associated
> with each write.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com