On Mon, Feb 26, 2018 at 11:06 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
> On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
>
> Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
> > I don’t actually know this option, but based on your results it’s clear that “fast read” is telling the OSD it should issue reads to all k+m OSDs storing data and then reconstruct the data from the first k to respond. Without the fast read it simply asks the regular k data nodes to read it back straight and sends the reply back. This is a straight trade off of more bandwidth for lower long-tail latencies.
> > -Greg
>
> Many thanks, this certainly explains it!
> Apparently I misunderstood how "normal" read works - I thought that in any case, all shards would be requested, and the primary OSD would check EC is still fine.
>
>
> Nope, EC PGs can self-validate (they checksum everything) and so extra shards are requested only if one of the OSDs has an error.
>
>
> However, with the explanation that indeed only the actual "k" shards are read in the "normal" case, it's fully clear to me that "fast_read" will be slower for us,
> since we are limited by network bandwidth.
>
> On a side-note, activating fast_read also appears to increase CPU load a bit, which is then probably due to the EC calculations that need to be performed if the "wrong"
> shards arrived at the primary OSD first.
>
> I believe this also explains why an EC pool actually does remapping in a k=4 m=2 pool with failure domain host if one of 6 hosts goes down:
> Namely, to have the "k" shards available on the "up" OSDs. This answers an earlier question of mine.
>
>
> I don't quite understand what you're asking/saying here, but if an OSD gets marked out all the PGs that used to rely on it will get another OSD unless you've instructed the cluster not to do so. The specifics of any given erasure code have nothing to do with it. :)
> -Greg
Ah, sorry, let me clarify.
The EC pool I am considering is k=4 m=2 with failure domain host, on 6 hosts.
So necessarily, there is one shard for each host. If one host goes down for a prolonged time,
there's no "logical" advantage of redistributing things - since whatever you do, with 5 hosts, all PGs will stay in degraded state anyways.
However, I noticed Ceph is remapping all PGs, and actively moving data. I presume now this is done for two reasons:
- The remapping is needed since the primary OSD might be the one which went down. But for remapping (I guess) there's no need to actually move data,
or is there?
- The data movement is done to have the "k" shards available.
If it's really the case that "all shards are equal", then data movement should not occur - or is this a bug / bad feature?
If you lose one OSD out of a host, Ceph is going to try and re-replicate the data onto the other OSDs in that host. Your PG size and the CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs need to be placed on different hosts.
You're right that gets very funny if your PG size is equal to the number of hosts. We generally discourage people from running configurations like that.
Or if you mean that you are losing a host, and the data is shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a result of EC pools' "indep" rather than "firstn" crush rules?)
-Greg
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com