[ Kind of randomness, but don't really know where else to throw this at since it really isn't a "dev" question... :/ Sorry in advanced. ]
So, I'm testing setting up a small ceph cluster. I have 3 nodes in the following layout:
1. 1x MDS, 1x MON
2. 1x MON, 2x OSD (2x 250GB SATA Drive)
3. 1x MON, 2x OSD (2x 250GB SATA Drive)
4. 7x Clients (3.5.7 Kernel [OFED Requirement])
Ceph 0.56.1, using default crushmap during the basic testing since it seems pretty appropriate for what I'm wanting. OSD Journals (1GB) writing to a different disk than what's used for data storage.
Basic, simple, usage is fine... But I'm wanting to use the Ceph cluster as shared storage for both cloud VM images, and *shared* HPC compute node images for scratch space. Once I use Ceph in a basic HPC use case Ceph becomes impractical. For example (with monitors on the IB network):
From each of the HPC compute nodes, I write out a 4GB file (dd if=/dev/zero of=/scratch/[blah] [...]). From the head node, of the HPC cluster, I do a 'time { ls -lh /scratch; }' ... All of the 'dd' commands return in around 4 seconds (~1GB/s write - ~6GB/s aggregate counting all nodes) ... On the head node it takes 4.5 minutes, to get a response on the 'ls' command and get a directory listing.
This is a higher use case, as writing out the multiple smaller files (tested writing a count of 1 to 200 or 1000 files per node, size increasing per count) does give slightly better read performance ... but will still hang while doing a directory listing of the shared file system; Might wait 2 or 3 seconds depending on the sizes being written.
The shared file system is mounted on each of the 7 HPC nodes like (3.5.7 clients):
# mount -t ceph cephm-ib0,ceph00-ib0,ceph01-ib0:/ /scratch -o name=admin,secretfile=~/.ceph.key
I'm assuming that this is hitting a disk cache on the writes and then blocking until everything is finally written to the disks ... but why are the MON/MDS blocking the reads of the file system? Shouldn't they allow reads for what they know of (this assumes stuff has actually been written to disks by this point)?
I would prefer creating a shared RBD image, and using that ... but the HPC images would become much more complex because the clustered FS requirement for the shared image ... and adding that overhead, on top of ceph already replicating the data seems overkill to me, but the HPC nodes must be able to see what every other node has written. :/
Creating an RBD image, mounting it on the head node, and then NFS exporting it also doesn't work, as once I bring NFS into the equation performance drops to around 25MB/s write.
So ... anyone have any pointers I can look at for this use case? Am I just doing something horribly brain dead for what I'm attempting?
Thanks much,
-Jason
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com