Re: ceph and efficient access of distributed resources

Sage Weil <sage@xxxxxxxxxxx> · Tue, 16 Apr 2013 06:59:04 -0700 (PDT)

On Tue, 16 Apr 2013, Gandalf Corvotempesta wrote:
> 2013/4/16 Mark Kampe <mark.kampe@xxxxxxxxxxx>:
> > The entire web is richly festooned with cache servers whose
> > sole raison d'etre is to solve precisely this problem.  They
> > are so good at it that back-bone providers often find it more
> > cash-efficient to buy more cache servers than to lay more
> > fiber.  Cache servers don't merely save disk I/O, they catch
> > these requests before they reach the server (or even the
> > backbone).
> 
> Mine was just an example, there are many other cases where a frotnend
> cache is not possible.
> I think that ceph should spread reads across the whole clusters by
> default (like a big RAID-1), to archieve bandwidth improvement.
> 
> Glusters does this, and also MooseFS.
> 
> What happens in case of a big file (for example, 100MB) with multiple
> chunks? Is ceph smart enough to read multiple chunks from multiple
> servers simultaneously or the whole file will be served by just an OSD
> ?

Yes.  The readahead window grows to include a few objects to take 
advantage of parallelism for reads.

The problem with reading from random/multiple replicas by default is cache 
efficiency.  If every reader picks a random replica, then there are 
effectively N locations that may hae an object cached in RAM (instead of 
on disk), and the caches for each OSD will be about 1/Nth as effective.  
The only time in makes sense to read from replicas is when you are CPU or 
network limited; the rest of the time it is better to read from the 
primary's cache than a replica's disk.

Unfortunately at the librados level, the client doesn't generally know 
that.  The infrastructure is in place for the MDS (or librados user) to 
indicate when reads from replicas are safe, but a bit more work is needed 
to make the client code utilize that information.  It's not a difficult 
improvement, and loadiness could also be communicated back to clients on a 
per-osd session basis, but it's not implemented yet.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html