Re: read-only mounts of RBD images on multiple nodes for parallel reads

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Thu, 17 Jan 2019 20:02:28 +0100

Hi,

first of: I'm probably not the expert you are waiting for, but we are using CephFS for HPC / HTC (storing datafiles), and make use of containers for all jobs (up to ~2000 running in parallel). 
We also use RBD, but for our virtualization infrastructure. 

While I'm always one of the first to recommend CephFS / RBD, I personally think that another (open source) file system - CVMFS - may suit your container-usecase significantly better. 
We use that to store our container images (and software in several versions). The containers are rebuilt daily. 
CVMFS is read-only for the clients by design. An administrator commits changes on the "Stratum 0" server,
and the clients see the new changes shortly after the commit has happened. Things are revisioned, and you can roll back in case something goes wrong. 
Why did we choose CVMFS here? 
- No need to have an explicit write-lock when changing things. 
- Deduplication built-in. We build several new containers daily, and keep them for 30 days (for long-running jobs). 
  Deduplication spares us from the need to have many factors more of storage. 
  I still hope Ceph learns deduplication some day ;-). 
- Extreme caching. The file system works via HTTP, i.e. you can use standard caching proxies (squids), and all clients have their own local disk cache. The deduplication
  also applies to that, so only unique chunks need to be fetched. 
High availability is rather easy to get (not as easy as with Ceoh, but you can have it by running one "Stratum 0" machine which does the writing,
at least two "Stratum 1" machines syncing everything, and if you want more performance also at least two squid servers in front). 
It's a FUSE filesystem, but unexpectedly well performing especially for small files as you have them for software and containers. 
The caching and deduplication heavily reduce traffic when you run many containers, especially when they all start concurrently. 

That's just my 2 cents, and your mileage may vary (for example, this does not work well if the machines running the containers do not have any local storage to cache things). 
And maybe you do not run thousands of containers in parallel, and you do not gain as much as we do from the deduplication. 

If it does not fit your case, I think RBD is a good way to go, but sadly I can not share experience how well / stable it works with many clients mounting the volume read-only in parallel. 
In our virtualization, there's always only one exclusive lock on a volume. 

Cheers,
	Oliver

Am 17.01.19 um 19:27 schrieb Void Star Nill:
> Hi,
> 
> We am trying to use Ceph in our products to address some of the use cases. We think Ceph block device for us. One of the use cases is that we have a number of jobs running in containers that need to have Read-Only access to shared data. The data is written once and is consumed multiple times. I have read through some of the similar discussions and the recommendations on using CephFS for these situations, but in our case Block device makes more sense as it fits well with other use cases and restrictions we have around this use case.
> 
> The following scenario seems to work as expected when we tried on a test cluster, but we wanted to get an expert opinion to see if there would be any issues in production. The usage scenario is as follows:
> 
> - A block device is created with "--image-shared" options:
> 
>     rbd create mypool/foo --size 4G --image-shared
> 
> 
> - The image is mapped to a host, formatted in ext4 format (or other file formats), mounted to a directory in read/write mode and data is written to it. Please note that the image will be mapped in exclusive write mode -- no other read/write mounts are allowed a this time.
> 
> - The volume is unmapped from the host and then mapped on to N number of other hosts where it will be mounted in read-only mode and the data is read simultaneously from N readers
> 
> As mentioned above, this seems to work as expected, but we wanted to confirm that we won't run into any unexpected issues.
> 
> Appreciate any inputs on this.
> 
> Thanks,
> Shridhar
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com