Re: fast_read in EC pools

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 26 Feb 2018 20:06:10 +0100

Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
> On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
> 
>     Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
>     > I don’t actually know this option, but based on your results it’s clear that “fast read” is telling the OSD it should issue reads to all k+m OSDs storing data and then reconstruct the data from the first k to respond. Without the fast read it simply asks the regular k data nodes to read it back straight and sends the reply back. This is a straight trade off of more bandwidth for lower long-tail latencies.
>     > -Greg
> 
>     Many thanks, this certainly explains it!
>     Apparently I misunderstood how "normal" read works - I thought that in any case, all shards would be requested, and the primary OSD would check EC is still fine.
> 
> 
> Nope, EC PGs can self-validate (they checksum everything) and so extra shards are requested only if one of the OSDs has an error.
>  
> 
>     However, with the explanation that indeed only the actual "k" shards are read in the "normal" case, it's fully clear to me that "fast_read" will be slower for us,
>     since we are limited by network bandwidth.
> 
>     On a side-note, activating fast_read also appears to increase CPU load a bit, which is then probably due to the EC calculations that need to be performed if the "wrong"
>     shards arrived at the primary OSD first.
> 
>     I believe this also explains why an EC pool actually does remapping in a k=4 m=2 pool with failure domain host if one of 6 hosts goes down:
>     Namely, to have the "k" shards available on the "up" OSDs. This answers an earlier question of mine.
> 
> 
> I don't quite understand what you're asking/saying here, but if an OSD gets marked out all the PGs that used to rely on it will get another OSD unless you've instructed the cluster not to do so. The specifics of any given erasure code have nothing to do with it. :)
> -Greg

Ah, sorry, let me clarify. 
The EC pool I am considering is k=4 m=2 with failure domain host, on 6 hosts. 
So necessarily, there is one shard for each host. If one host goes down for a prolonged time,
there's no "logical" advantage of redistributing things - since whatever you do, with 5 hosts, all PGs will stay in degraded state anyways. 

However, I noticed Ceph is remapping all PGs, and actively moving data. I presume now this is done for two reasons:
- The remapping is needed since the primary OSD might be the one which went down. But for remapping (I guess) there's no need to actually move data,
  or is there? 
- The data movement is done to have the "k" shards available. 
If it's really the case that "all shards are equal", then data movement should not occur - or is this a bug / bad feature? 

Cheers,
	Oliver

>  
> 
> 
>     Many thanks for clearing this up!
> 
>     Cheers,
>             Oliver
> 
>     > On Mon, Feb 26, 2018 at 3:57 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>> wrote:
>     >
>     >     Some additional information gathered from our monitoring:
>     >     It seems fast_read does indeed become active immediately, but I do not understand the effect.
>     >
>     >     With fast_read = 0, we see:
>     >     ~ 5.2 GB/s total outgoing traffic from all 6 OSD hosts
>     >     ~ 2.3 GB/s total incoming traffic to all 6 OSD hosts
>     >
>     >     With fast_read = 1, we see:
>     >     ~ 5.1 GB/s total outgoing traffic from all 6 OSD hosts
>     >     ~ 3   GB/s total incoming traffic to all 6 OSD hosts
>     >
>     >     I would have expected exactly the contrary to happen...
>     >
>     >     Cheers,
>     >             Oliver
>     >
>     >     Am 26.02.2018 um 12:51 schrieb Oliver Freyermuth:
>     >     > Dear Cephalopodians,
>     >     >
>     >     > in the few remaining days when we can still play at our will with parameters,
>     >     > we just now tried to set:
>     >     > ceph osd pool set cephfs_data fast_read 1
>     >     > but did not notice any effect on sequential, large file read throughput on our k=4 m=2 EC pool.
>     >     >
>     >     > Should this become active immediately? Or do OSDs need a restart first?
>     >     > Is the option already deemed safe?
>     >     >
>     >     > Or is it just that we should not expect any change on throughput, since our system (for large sequential reads)
>     >     > is purely limited by the IPoIB throughput, and the shards are nevertheless requested by the primary OSD?
>     >     > So the gain would not be in throughput, but the reply to the client would be slightly faster (before all shards have arrived)?
>     >     > Then this option would be mainly of interest if the disk IO was congested (which does not happen for us as of yet)
>     >     > and not help so much if the system is limited by network bandwidth.
>     >     >
>     >     > Cheers,
>     >     >       Oliver
>     >     >
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > ceph-users mailing list
>     >     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >
>     >
>     >
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
> 
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com