Re: poor OSD performance using kernel 3.4

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Mon, 28 May 2012 07:37:23 +0200 (CEST)

I think filestore journal parallel works only with btrfs.
Other filesystem are writeahead.

if you write at 120MB/S, so your journal of 1GB is at 50% in 4sec.

So you got around 480MB each 4sec, does your disks can flush sequentially these 480MB in less than 4sec ?
(do a small benchmark of your disk in local filesystem, without ceph)

If not, you can have spikes in your write stats if the journal.

simple schema if disks are not fast enough:

0-4sec
------
random write (first wave 480MB) --->journal

4-8sec
------
random write (second wave)---->journal---->write flush of first wave(480MB) --->disks

8-12sec
-------
random write (thirst wave) blocked ---->journal---->write of second wave-blocked---->write flush of first wave not yet finished(480MB) --->disks

good schema
-----------
0-4sec
------
random write (first wave 480MB) --->journal

4-8sec
------
random write (second wave)---->journal---->write flush of first wave(480MB) --->disks

8-12sec
-------
random write (thirst wave)---->journal---->write of second wave(480MB) --->disks

So, with a bigger journal, you have more datas to write to disks, so you can write more datas sequentially in 1 flush.
4sec seem very low, you need to have 20-30sec between flush.

How many disks (7,2K) do you have by osd ?

----- Mail original ----- 

De: "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> 
À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> 
Cc: ceph-devel@xxxxxxxxxxxxxxx, "Mark Nelson" <mark.nelson@xxxxxxxxxxx> 
Envoyé: Dimanche 27 Mai 2012 20:57:23 
Objet: Re: poor OSD performance using kernel 3.4 

Am 27.05.2012 13:33, schrieb Alexandre DERUMIER: 
>> how much time to flush from journal to disks ? 
>>> I don't know how to measure this. 
> Do an iostat, you must see timelapse of write inactivity on disk (datas are written to journal) , then after a timelapse 
> of write activity on disk.(data flushed from journal to disk) 
No it always starts in parallel. Journal is set to 1GB. I've now moved 
the journal to disk - so i can use iostat. 

>>> As ceph starts to write to journal and 
>>> disk in parallel 
> 
> this is strange, from doc: 
> http://ceph.com/wiki/OSD_journal 
> 
> the journal mode should be write-ahead with xfs. 
> So write to journal first then flush to disk each 30sec. 
I'm not quite sure as: 
http://ceph.com/wiki/Ceph.conf#filestore_journal_writeahead 

says there are two options: 
filestore journal writeahead 
and 
filestore journal parallel 
but even 
filestore journal writeahead = 1 
filestore journal parallel = 0 

results in a parallel start. 

> maybe your tmpfs is too small, and flushs occurs at 50% of free space on journal. 
> If by exemple, your flush occurs each 1 or 2seconds, this can cause very slow write. 
1GB? My 1Gbit/s LAN test connection can't handle more than about 
120MB/s. So there's at least room for 8-10s. 

;-( 

Stefan 

-- 

-- 

	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html