On Thu, 6 May 2010, Martin Fick wrote: > I have a few more questions. > > -Can files stored in the OSD heal "incrementally"? > > Suppose there are 3 replicas for a large file and that > a small byte range change occurs while replica 3 is > down. Will replica 3 heal efficiently when it > returns? Will only the small changed byte range > be transferred? Currently, no. This is a big item on the TODO list, both for efficiency here, and also to facilitate better memory and network IO when objects are large (recovery currently loads, sends, saves objects in their entirety). > -Also, can reads be spreadout over replicas? > > This might be a nice optimization to reduce seek > times under certain conditions, when there are no > writers or the writer is the only reader (and thus > is aware of all the writes even before they > complete). Under these conditions it seems like it > would be possible to not enforce the "tail reading" > order of replicas and thus additionally benefit > from "read stripping" across the replicas the way > many raid implementations do with RAID1. > > I thought that this might be particularly useful > for RBD when it is used exclusively (say by mounting > a local FS) since even with replicas, it seems like > it could then relax the replica tail reading > constraint. The idea certainly has it's appeal, and I played with it for a while a few years back. At that time I had a _really_ hard time trying to manufacture a workload scenario where it actually mades things faster and not slower. In general, spreading out reads will pollute caches (e.g., spreading across two replicas means caches are half as effective). What I tried to do was use fast heartbeats between OSDs to shared average request queue lengths, so that the primary could 'shed' a read request to a replica if it's queue length/request latency was significantly shorter. I wasn't really able to make it work. In the case of very hot objects, the primary will already have it in cache, and the fastest thing is to just serve it up immediately. Unless the network port is fully saturated. For cold objects, shedding could help, but only if there is a sufficient load disparity between replicas to compensate for the overhead of shedding. At the time I had trouble simluating either situation. Also, the client/osd interface has changed such that only clients initiate connections, so the previous shed path (client -> osd1 -> osd2 -> client) won't work. We're certainly open to any ideas in this area... sage