Adding flashcache for data disk to cache Ceph metadata writes

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Wed, 16 Jan 2013 04:00:46 +0000

Hi List,
	I have introduced flashcache (https://github.com/facebook/flashcache) aim at reduce Ceph metadata IOs to OSD's disk. Basically, for every data writes, ceph need to write 3 things:
Pg log
Pg info
Actual data
	First 2 requests are small, but for non-btrfs filesystem, the first 2 writes will results OSD disk to do 2 seeks, it's critical to spindle-disk's throughput as mentioned in earlier mail.

	I list the detail of my experiment , any inputs is highly appreciate.

	[Setup]
	2 Host, 1 SSD with 1 SATA disk, SSD is partitioned into 4 partitions. P1 as OSD Journal, P2 as FlashCache for sata. P3 used as XFS metadata journal.
	1 Client,1 RBD Volume created and mounted
	[FlashCache setup]
		[Create cached device]
			flashcache_create -v -p back fsdc /dev/sda2 /dev/sdc
		[Create filesystem]
			mkfs.xfs -f -i size=2048 -d agcount=1 -l logdev=/dev/sda3,size=128m /dev/mapper/fsdc
		[Mount]
			mount -o logdev=/dev/sda3 -o logbsize=256k -o delaylog -o inode64 /dev/mapper/fsdc /data/osd.21/
		[Tuning]
			sysctl dev.flashcache.sda9+sdc.skip_seq_thresh_kb=32

			Since I am aiming to cache only ceph metadata and the metadata writes are very small, so I configured flashcache to skip all sequential write larger than 32K. Basically you can set this to 1K because the 			meta writes are all less than 1K. I set it to 32K just for a quick test.
	[Experiment]
		Doing dd from the client on top of the RBD Volume
	[Result]
		Throughput boost from 37MB/s to ~ 90MB/s, since the flashcache working in DM level, it's transparent to Ceph.

	My test is just a quick test, further test (include sequential R/W,random R/W) are in schedule. Will come back with you if there are some progress.

																																					Xiaoxi

-----Original Message-----
From: Sage Weil [mailto:sage@xxxxxxxxxxx] 
Sent: 2013年1月16日 5:43
To: Chen, Xiaoxi
Cc: Mark Nelson; Yan, Zheng 
Subject: RE: Seperate metadata disk for OSD

On Tue, 15 Jan 2013, Chen, Xiaoxi wrote:
> Hi Sage,
> 	FlashCache works well for this scenarios, I created a hybrid-disk with 1 ssd partition(shared the same ssd but different patition with Ceph journal and XFS journal) and 1 sata disk.Configured the FlashCache to ignore all sequential request larger than 32K(Well, it can be set to a smaller number).
> 	The results shows a comparable performance with CephMeta-to-ssd solution.
> 	Since flashcache working in the DM layer , I suppose it's transparent 
> to Ceph, right?

Right.  That's great to hear that it works well.  If you don't mind, it would be great if you could report the same thing to ceph-devel with a bit of detail about how you configured FlashCache so that others can do the same.

Thanks!
sage

> 																									Xiaoxi
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage@xxxxxxxxxxx]
> Sent: 2013?1?15? 2:19
> To: Chen, Xiaoxi
> Cc: Mark Nelson; Yan, Zheng ; ceph-devel@xxxxxxxxxxxxxxx
> Subject: RE: Seperate metadata disk for OSD
> 
> On Mon, 14 Jan 2013, Chen, Xiaoxi wrote:
> > Hi Sage,
> >     Thanks for your mail~
> > 	Would you have a timetable about when such improvement can be ready?It's critical for non-btrfs filesystem.
> > 	I am thinking about introducing flashcache into my configuration to cache such meta write, since flashcache working under the filesystem, I suppose it will not break the assumption inside Ceph. I will try it on tomorrow and come back with you ~
> > 	Thanks again for the helps!
> 
> I think avoiding the pginfo change may be pretty simple.  The log one I am a bit less concerned about (the appends from many rbd IOs will get aggregated into a small number of XFS IOs), and changing that around would be a bigger deal.
> 
> sage
> 
> 
> > 																								Xiaoxi
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@xxxxxxxxxxx]
> > Sent: 2013?1?13? 0:57
> > To: Chen, Xiaoxi
> > Cc: Mark Nelson; Yan, Zheng ; ceph-devel@xxxxxxxxxxxxxxx
> > Subject: RE: Seperate metadata disk for OSD
> > 
> > On Sat, 12 Jan 2013, Chen, Xiaoxi wrote:
> > > Hi Zheng?
> > > 	I have put XFS log to a separate disk, indeed it provide some performance gain but not that significant.
> > > 	Ceph's metadata is somehow separate(it's some files reside in OSD's disk), therefore,it cannot be helped by neither XFS journal log nor OSD's journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta folder ) to a separate SSD disk.
> > > To Nelson,
> > > 	I did the experiment with just 1 client, if using more clients, the gain will not be that much.
> > > 	It looks to me that a single write from client side become 3 writes to disk is somehow a big overhead for in-place-update filesystem such like XFS since it introduce more seeks.Out-of-place-update filesystem will not suffer a lot for such pattern,I didn?t find this problem when I using BTRFS as backend filesystem. But forBTRFS, fragmentation is another performance killer, for a single RBD volume, if you did a lot of random write on it, the sequential read performance will drop to 30% of a new RBD volume. This make BTRFS unusable in production.
> > > 	Separate Ceph meta seems quite easy to me ( I just mount a partition to /data/osd.X/meta), is it right  ? is there any potential problem in it? 
> > 
> > Unfortunately, yes.  The ceph journal and fs sync are carefully timed.  
> > The ceph-osd assumes that syncfs(2) on the $osd_data/current/commit_op_seq file will sync everything, but if meta/ is another fs that isnt true.  At the every least, the code needs to be modified to sync that as well.
> > 
> > That said, there is a lot of improvement that can be had here.  The three things we write are:
> > 
> >  the pg log
> >  the pg info, spread across the pg dir xattr and that pginfo file  
> > the actual io
> > 
> > The pg log could go in leveldb, which would translate those writes into a single sequential stream across the entire OSD.  And the PG info separate between the xattr and the file is far from optimal: most of that data doesn't actually change on each write.  What little does is very small, and could be moved into the xattr, avoiding touching the file (which means an inode + data block write) at all.
> > 
> > We need to look a bit more closely to see how difficult that will really be to implement, but I think it is promising!
> > 
> > sage
> > 
> > 
> > > 
> > > 																														Xiaoxi -----Original Message-----
> > > From: Mark Nelson [mailto:mark.nelson@xxxxxxxxxxx]
> > > Sent: 2013?1?12? 21:36
> > > To: Yan, Zheng
> > > Cc: Chen, Xiaoxi; ceph-devel@xxxxxxxxxxxxxxx
> > > Subject: Re: Seperate metadata disk for OSD
> > > 
> > > Hi Xiaoxi and Zheng,
> > > 
> > > We've played with both of these some internally, but not for a production deployment. Mostly just for diagnosing performance problems. 
> > >   It's been a while since I last played with this, but I hadn't seen a whole lot of performance improvements at the time.  That may have been due to the hardware in use, or perhaps other parts of Ceph have improved to the point where this matters now!
> 
> > > 
> > > On a side note, Btrfs also had a google summer of code project to let you put metadata on an external device.  Originally I think that was supposed to make it into 3.7, but am not sure if that happened.
> > > 
> > > Mark
> > > 
> > > On 01/12/2013 06:21 AM, Yan, Zheng wrote:
> > > > On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote:
> > > >>
> > > >> Hi list,
> > > >>          For a rbd write request, Ceph need to do 3 writes:
> > > >> 2013-01-10 13:10:15.539967 7f52f516c700 10
> > > >>filestore(/data/osd.21) _do_transaction on 0x327d790
> > > >> 2013-01-10 13:10:15.539979 7f52f516c700 15
> > > >>filestore(/data/osd.21) write meta/516b801c/pglog_2.1a/0//-1
> > > >>36015~147
> > > >> 2013-01-10 13:10:15.540016 7f52f516c700 15
> > > >>filestore(/data/osd.21)
> > > >>path: 
> > > >>/data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
> > > >> 2013-01-10 13:10:15.540164 7f52f516c700 15
> > > >>filestore(/data/osd.21) write meta/28d2f4a8/pginfo_2.1a/0//-1
> > > >>0~496
> > > >> 2013-01-10 13:10:15.540189 7f52f516c700 15
> > > >>filestore(/data/osd.21)
> > > >>path: 
> > > >>/data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
> > > >> 2013-01-10 13:10:15.540217 7f52f516c700 10
> > > >>filestore(/data/osd.21) _do_transaction on 0x327d708
> > > >> 2013-01-10 13:10:15.540222 7f52f516c700 15
> > > >>filestore(/data/osd.21) write
> > > >>2.1a_head/8abf341a/rb.0.106e.6b8b4567.0000000002d3/head//2
> > > >>3227648~524288
> > > >> 2013-01-10 13:10:15.540245 7f52f516c700 15
> > > >>filestore(/data/osd.21)
> > > >>path: 
> > > >>/data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.0000000002d3__
> > > >>he
> > > >>ad
> > > >>_8
> > > >>ABF341A__2
> > > >>l
> > > >>          If using XFS as backend file system and running xfs on top of traditional sata disk, it will introduce a lot of seeks and therefore reduce bandwidth, a blktrace is available here :( http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate this issue.( single client running dd on top of a new RBD volumes).
> > > >>          Then I tried to move /osd.X/current/meta to a separate disk, the bandwidth boosted.(look blktrace at http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
> > > >>          I haven't test other access pattern or something else, but it looks to me that moving such meta to a separate disk (ssd or sata with btrfs) will benefit ceph write performance, is it true? Will ceph introduce this feature in the future?  Is there any potential problem for such hack?
> > > >>
> > > >
> > > > Did you try putting XFS metadata log a separate and fast device 
> > > > (mkfs.xfs -l logdev=/dev/sdbx,size=10000b). I think it will 
> > > > boost performance too.
> > > >
> > > > Regards
> > > > Yan, Zheng
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > 
> > > N????y????b?????v?????{.n??????z??ay????????j ???f????????????????:+v???????? ??zZ+??????"?!?
> > N????y????b?????v?????{.n??????z??ay????????j ???f????????????????:+v???????? ??zZ+??????"?!?
> 
?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f