Re: Ceph Overall Usage

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 12 Feb 2013 10:34:04 -0800

On Tue, Feb 12, 2013 at 6:03 AM, Jason Stover <jason.stover@xxxxxxxxx> wrote:
> Hi all,
>
>   [ Kind of randomness, but don't really know where else to throw this at
> since it really isn't a "dev" question... :/  Sorry in advanced. ]
>
>   So, I'm testing setting up a small ceph cluster. I have 3 nodes in the
> following layout:
>
>     1. 1x MDS, 1x MON
>     2. 1x MON, 2x OSD (2x 250GB SATA Drive)
>     3. 1x MON, 2x OSD (2x 250GB SATA Drive)
>
>     4. 7x Clients (3.5.7 Kernel [OFED Requirement])
>
>     Ceph 0.56.1, using default crushmap during the basic testing since it
> seems pretty appropriate for what I'm wanting. OSD Journals (1GB) writing to
> a different disk than what's used for data storage.
>
>   Basic, simple, usage is fine... But I'm wanting to use the Ceph cluster as
> shared storage for both cloud VM images, and *shared* HPC compute node
> images for scratch space. Once I use Ceph in a basic HPC use case Ceph
> becomes impractical.  For example (with monitors on the IB network):
>
>   From each of the HPC compute nodes, I write out a 4GB file (dd
> if=/dev/zero of=/scratch/[blah] [...]). From the head node, of the HPC
> cluster, I do a 'time { ls -lh /scratch; }' ...  All of the 'dd' commands
> return in around 4 seconds (~1GB/s write - ~6GB/s aggregate counting all
> nodes) ...

(Interjection:) The fast return is just local caching, fyi. Although
I'd expect the flushing out to take less than 4.5 minutes...

 On the head node it takes 4.5 minutes, to get a response on the
> 'ls' command and get a directory listing.
>
>   This is a higher use case, as writing out the multiple smaller files
> (tested writing a count of 1 to 200 or 1000 files per node, size increasing
> per count) does give slightly better read performance ... but will still
> hang while doing a directory listing of the shared file system; Might wait 2
> or 3 seconds depending on the sizes being written.
>
>   The shared file system is mounted on each of the 7 HPC nodes like (3.5.7
> clients):
>
>       # mount -t ceph cephm-ib0,ceph00-ib0,ceph01-ib0:/ /scratch -o
> name=admin,secretfile=~/.ceph.key
>
>   I'm assuming that this is hitting a disk cache on the writes and then
> blocking until everything is finally written to the disks ... but why are
> the MON/MDS blocking the reads of the file system?

Yeah. Ceph allows for a lot of caching shared state between a single
client and the MDS. However, one of the data safety invariants it
enforces is that anything which has been observed by a second client
must be safe on disk. So the list command is coming in to the MDS, and
then it's telling all the other clients to drop their buffer
capabilities so it can get a stable snapshot of the state to return,
and the clients are flushing out all that dirty data to disk before
the MDS returns data to the listing client.

You would see a lot less of this if, for instance, each client were
writing to its own directory in the shared hierarchy.
That said, there's probably some room for optimization in these
behaviors; you'll notice that we don't recommend CephFS for most
production uses at this time and this kind of tuning comes after basic
stability stuff. ;)

Going back to your apparently slow writes (6GB in ~4.5 minutes), you
should back up and do some more basic benchmarking using the rados
bench commands and other stuff. And try some dd et al runs which
include the flush commands.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com