Re: fast_read in EC pools

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 26, 2018 at 11:33 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
Am 26.02.2018 um 20:23 schrieb Gregory Farnum:
>
>
> On Mon, Feb 26, 2018 at 11:06 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
>
>     Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
>     > On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>> wrote:
>     >
>     >     Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
>     >     > I don’t actually know this option, but based on your results it’s clear that “fast read” is telling the OSD it should issue reads to all k+m OSDs storing data and then reconstruct the data from the first k to respond. Without the fast read it simply asks the regular k data nodes to read it back straight and sends the reply back. This is a straight trade off of more bandwidth for lower long-tail latencies.
>     >     > -Greg
>     >
>     >     Many thanks, this certainly explains it!
>     >     Apparently I misunderstood how "normal" read works - I thought that in any case, all shards would be requested, and the primary OSD would check EC is still fine.
>     >
>     >
>     > Nope, EC PGs can self-validate (they checksum everything) and so extra shards are requested only if one of the OSDs has an error.
>     >  
>     >
>     >     However, with the explanation that indeed only the actual "k" shards are read in the "normal" case, it's fully clear to me that "fast_read" will be slower for us,
>     >     since we are limited by network bandwidth.
>     >
>     >     On a side-note, activating fast_read also appears to increase CPU load a bit, which is then probably due to the EC calculations that need to be performed if the "wrong"
>     >     shards arrived at the primary OSD first.
>     >
>     >     I believe this also explains why an EC pool actually does remapping in a k=4 m=2 pool with failure domain host if one of 6 hosts goes down:
>     >     Namely, to have the "k" shards available on the "up" OSDs. This answers an earlier question of mine.
>     >
>     >
>     > I don't quite understand what you're asking/saying here, but if an OSD gets marked out all the PGs that used to rely on it will get another OSD unless you've instructed the cluster not to do so. The specifics of any given erasure code have nothing to do with it. :)
>     > -Greg
>
>     Ah, sorry, let me clarify.
>     The EC pool I am considering is k=4 m=2 with failure domain host, on 6 hosts.
>     So necessarily, there is one shard for each host. If one host goes down for a prolonged time,
>     there's no "logical" advantage of redistributing things - since whatever you do, with 5 hosts, all PGs will stay in degraded state anyways.
>
>     However, I noticed Ceph is remapping all PGs, and actively moving data. I presume now this is done for two reasons:
>     - The remapping is needed since the primary OSD might be the one which went down. But for remapping (I guess) there's no need to actually move data,
>       or is there?
>     - The data movement is done to have the "k" shards available.
>     If it's really the case that "all shards are equal", then data movement should not occur - or is this a bug / bad feature?
>
>
> If you lose one OSD out of a host, Ceph is going to try and re-replicate the data onto the other OSDs in that host. Your PG size and the CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs need to be placed on different hosts.
>
> You're right that gets very funny if your PG size is equal to the number of hosts. We generally discourage people from running configurations like that.

Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) would be our starting point - since we may add more hosts later (not too soon-ish, but it's not excluded more may come in a year or so),
and migrating large EC pools to different settings still seems a bit messy.
We can't really afford to reduce available storage significantly more in the current setup, and would like to have the possibility to lose one host (for example for an OS upgrade),
and then still lose a few disks in case they fail with bad timing.

>
> Or if you mean that you are losing a host, and the data is shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a result of EC pools' "indep" rather than "firstn" crush rules?)

They are indep, which I think is the default (no manual editing done). I thought the main goal of indep was exactly to reduce data movement.
Indeed, it's very funny that data is moved, it certainly does not help to increase redundancy ;-).

Given that you're stuck in that state, you probably want to set the mon_osd_down_out_subtree_limit so that it doesn't mark out a whole host.

Can you also share the output of "ceph osd crush dump"?
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux