Re: RBD/OSD questions

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 6 May 2010 10:14:00 -0700 (PDT)

On Thu, 6 May 2010, Martin Fick wrote:
> I have a few more questions.
> 
> -Can files stored in the OSD heal "incrementally"?
> 
> Suppose there are 3 replicas for a large file and that
> a small byte range change occurs while replica 3 is 
> down.  Will replica 3 heal efficiently when it 
> returns?  Will only the small changed byte range 
> be transferred?

Currently, no.  This is a big item on the TODO list, both for efficiency 
here, and also to facilitate better memory and network IO when 
objects are large (recovery currently loads, sends, saves objects in their 
entirety).

> -Also, can reads be spreadout over replicas?
> 
> This might be a nice optimization to reduce seek
> times under certain conditions, when there are no
> writers or the writer is the only reader (and thus
> is aware of all the writes even before they 
> complete).  Under these conditions it seems like it
> would be possible to not enforce the "tail reading"
> order of replicas and thus additionally benefit
> from "read stripping" across the replicas the way
> many raid implementations do with RAID1.
> 
> I thought that this might be particularly useful
> for RBD when it is used exclusively (say by mounting
> a local FS)  since even with replicas, it seems like
> it could then relax the replica tail reading 
> constraint.

The idea certainly has it's appeal, and I played with it for a while a few 
years back.  At that time I had a _really_ hard time trying to manufacture 
a workload scenario where it actually mades things faster and not slower.  
In general, spreading out reads will pollute caches (e.g., spreading 
across two replicas means caches are half as effective).  

What I tried to do was use fast heartbeats between OSDs to shared average 
request queue lengths, so that the primary could 'shed' a read request to 
a replica if it's queue length/request latency was significantly shorter.  
I wasn't really able to make it work.

In the case of very hot objects, the primary will already have it in 
cache, and the fastest thing is to just serve it up immediately.  Unless 
the network port is fully saturated. For cold objects, shedding could 
help, but only if there is a sufficient load disparity between replicas to 
compensate for the overhead of shedding.  At the time I had trouble 
simluating either situation.  Also, the client/osd interface has changed 
such that only clients initiate connections, so the previous shed 
path (client -> osd1 -> osd2 -> client) won't work.

We're certainly open to any ideas in this area...

sage