On Tue, 16 Apr 2013, Gandalf Corvotempesta wrote: > 2013/4/16 Mark Kampe <mark.kampe@xxxxxxxxxxx>: > > The entire web is richly festooned with cache servers whose > > sole raison d'etre is to solve precisely this problem. They > > are so good at it that back-bone providers often find it more > > cash-efficient to buy more cache servers than to lay more > > fiber. Cache servers don't merely save disk I/O, they catch > > these requests before they reach the server (or even the > > backbone). > > Mine was just an example, there are many other cases where a frotnend > cache is not possible. > I think that ceph should spread reads across the whole clusters by > default (like a big RAID-1), to archieve bandwidth improvement. > > Glusters does this, and also MooseFS. > > What happens in case of a big file (for example, 100MB) with multiple > chunks? Is ceph smart enough to read multiple chunks from multiple > servers simultaneously or the whole file will be served by just an OSD > ? Yes. The readahead window grows to include a few objects to take advantage of parallelism for reads. The problem with reading from random/multiple replicas by default is cache efficiency. If every reader picks a random replica, then there are effectively N locations that may hae an object cached in RAM (instead of on disk), and the caches for each OSD will be about 1/Nth as effective. The only time in makes sense to read from replicas is when you are CPU or network limited; the rest of the time it is better to read from the primary's cache than a replica's disk. Unfortunately at the librados level, the client doesn't generally know that. The infrastructure is in place for the MDS (or librados user) to indicate when reads from replicas are safe, but a bit more work is needed to make the client code utilize that information. It's not a difficult improvement, and loadiness could also be communicated back to clients on a per-osd session basis, but it's not implemented yet. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html