Fwd: Bad performance of CephFS (first use)

chibi@xxxxxxx (Christian Balzer) · Sat, 10 May 2014 14:46:29 +0900

On Fri, 09 May 2014 23:03:50 +0200 Michal Pazdera wrote:

> Dne 9.5.2014 9:08, Christian Balzer napsal(a):
> > Is that really just one disk?
> 
> Yes, its just one disk in all PCs. I know that the setup is bad, but I 
> want just to get
> familiar with Ceph (and other parallel fs like Gluster ot Lustre) and 
> see what they can
> do and cannot.
>
Note that I have zero experience with CephFS, I just use the RBD part of
Ceph. 
However it is my understanding that CephFS isn't really ready for prime
time at this point, mostly due to the fact that the MDS isn't HA.

> > You have the reason for the write performance half right.
> > Every write goes to the primary OSD of the PG for that object.
> > That is, the journal of that OSD, which in your configuration I
> > suspect is a file on the same XFS as the actual OSD data. Either way,
> > it would be on the same disk as you only have one.
> > So that write goes to the primary OSD journal, then gets replicated to
> > the journal of the secondary OSD, then it get's ACK'ed to the client.
> > Meanwhile the journals will have to get written to the actual storage
> > eventually.
> 
> So the client PC writes on one OSD to the journal and then the OSD 
> replicates the data
> from that journal to the second OSD also into its journal. Only after 
> that the data on each OSD are
> copied from the journals into actual OSD storage ? Interesting in that 
> case klient should write
> around 100MB/s to one OSD then it should stop and wait for that OSD to 
> replicate dato onto
> second OSD (also around 100MB/s) and then all is done. After the disk 
> with journlas and storage
> space should copy all the data on itselfs.
> 
It will start sending data at 100MB/s (provided the network is otherwise
idle). But once the journal starts writing to the filestore your HDD now
is effectively at half speed, 50MB/s from your comment below. 

See below for the parallelism, as in writing to all OSDs at more or less
the same time. This of course reduces your bandwidth for client writes and
replication speed as they eat into each other.

> So the journal is some kind of cache for OSDs?
> 
It's like a lot of journals, to provide a faster way (it being even in the
on-disk version a sequential file) to write things and maybe
coalesce/merge these writes before actual writing to the actual OSD
storage filesystem.
And if the journal is on a dedicated device, preferably of course a SSD,
this works beautifully. 

>  From the graphs i got it seems that the klient is sending data to both 
> OSD in parallel into the journals.
You haven't told us what you're actually using to perform those writes,
but in case it is a normal (non-direct) write or parallel writes your
client node will most certainly start writing to both OSDs. 
As the PGs (placement groups, read about the principles of how Ceph works
on its homepage) for the objects (4MB sized ones with RBD, no idea about
CephFS) will be more or less evenly distributed between those OSDs.

> Then each of the OSDs copy the data once more on itselfs (not sure). But 
> i dont know why the network
> traffic has these spikes. Is it because the client writes some chunk of 
> data and then waits for something
> before next chunk can be sent ?
> 
Indeed, the next write can only happen after the previous one has been
acknowledged, which is after the data has been written to both journals.
Congestion on the disks and network will slow that down.

> > So each write happens basically twice, your single disk now only has an
> > effective speed of around 60-70MB/s (couldn't find any benchmarks for
> > your model, but most drives of this type have write speeds up to
> > 140MB/s).
> 
> They can work around 100MB/s for sure.
>
Which becomes 50MB/s if the journal is on it. ^^

Christian

> > Now add to this the fact that the replication from the other OSD will
> > of course also impact things.
> > That network bandwidth for the replication has to come from
> > somewhere...
> >
> > Look at what the recommended configurations by Inktank are and at
> > previous threads in here to get an idea what helps.
> >
> > Since I doubt you have the budget or parts for your test setup to add
> > more disks, SSDs for journals, HW cache controllers, additional
> > network cards and so forth I guess you will have to live with this
> > performance for now.
> >
> > Christian
> >
> > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global 
> > OnLine Japan/Fusion Communications http://www.gol.com/
> 
> I will do that. Thank you very much for your reply!
> 
> 
> ---
> Tato zpr?va neobsahuje viry ani jin? ?kodliv? k?d - avast! Antivirus je
> aktivn?. http://www.avast.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/