I think filestore journal parallel works only with btrfs. Other filesystem are writeahead. if you write at 120MB/S, so your journal of 1GB is at 50% in 4sec. So you got around 480MB each 4sec, does your disks can flush sequentially these 480MB in less than 4sec ? (do a small benchmark of your disk in local filesystem, without ceph) If not, you can have spikes in your write stats if the journal. simple schema if disks are not fast enough: 0-4sec ------ random write (first wave 480MB) --->journal 4-8sec ------ random write (second wave)---->journal---->write flush of first wave(480MB) --->disks 8-12sec ------- random write (thirst wave) blocked ---->journal---->write of second wave-blocked---->write flush of first wave not yet finished(480MB) --->disks good schema ----------- 0-4sec ------ random write (first wave 480MB) --->journal 4-8sec ------ random write (second wave)---->journal---->write flush of first wave(480MB) --->disks 8-12sec ------- random write (thirst wave)---->journal---->write of second wave(480MB) --->disks So, with a bigger journal, you have more datas to write to disks, so you can write more datas sequentially in 1 flush. 4sec seem very low, you need to have 20-30sec between flush. How many disks (7,2K) do you have by osd ? ----- Mail original ----- De: "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> Cc: ceph-devel@xxxxxxxxxxxxxxx, "Mark Nelson" <mark.nelson@xxxxxxxxxxx> Envoyé: Dimanche 27 Mai 2012 20:57:23 Objet: Re: poor OSD performance using kernel 3.4 Am 27.05.2012 13:33, schrieb Alexandre DERUMIER: >> how much time to flush from journal to disks ? >>> I don't know how to measure this. > Do an iostat, you must see timelapse of write inactivity on disk (datas are written to journal) , then after a timelapse > of write activity on disk.(data flushed from journal to disk) No it always starts in parallel. Journal is set to 1GB. I've now moved the journal to disk - so i can use iostat. >>> As ceph starts to write to journal and >>> disk in parallel > > this is strange, from doc: > http://ceph.com/wiki/OSD_journal > > the journal mode should be write-ahead with xfs. > So write to journal first then flush to disk each 30sec. I'm not quite sure as: http://ceph.com/wiki/Ceph.conf#filestore_journal_writeahead says there are two options: filestore journal writeahead and filestore journal parallel but even filestore journal writeahead = 1 filestore journal parallel = 0 results in a parallel start. > maybe your tmpfs is too small, and flushs occurs at 50% of free space on journal. > If by exemple, your flush occurs each 1 or 2seconds, this can cause very slow write. 1GB? My 1Gbit/s LAN test connection can't handle more than about 120MB/s. So there's at least room for 8-10s. ;-( Stefan -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html