Re: why does 3 copies take so much more time than 2?

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 4 Jan 2023 14:11:21 -0600

Hi Charles,

Going from 40s to 4.5m seems excessive to me at least.  Can you tell if 
the drives or OSDs are hitting their limits?  Tools like iostart, sar, 
or collectl might help.

Longer answer: There are a couple of potential issues.  One is that you 
are bound by the latency of writing the slowest copy of the data. IE 
let's say that you have a 25% chance of having a slow write when writing 
a copy of data.  Depending on the replication factor, that might result 
in a higher chance of any given replica write slowing down the whole write:

1x: 25%

2x: 100% - (100%-25%)^2 = 43.75%

3x: 100% - (100%-25%)^3 = 57.8%

That only tells part of the story though.  In the 2x and 3x cases, you 
are not just dealing with a potentially higher probability of hitting a 
high latency event, but you are working the system harder at the same 
time.  There's more work for the drives, more metadata for RocksDB, more 
network traffic, and more work for the async msgr threads.  If you are 
using multiple active/active MDSes, the behavior of the dynamic subtree 
partitioning can be somewhat volatile as well.  The trick is likely 
going to be to figure out what it is that's holding you back and whether 
or not it's a local phenomena (slow drive/node) or global.

Mark

On 1/4/23 13:32, Charles Hedrick wrote:
I'm testing cephfs. I have 3 nodes, with 2 hard disks and one ssd on each. cephfs is set to put metadata on ssd and data on hdd.

With the two pools set size = 3, untar'ing a 19 G file with 90K files in it takes 4.5 minutes.
With size = 2, it takes 40 sec. (The tar file is stored in a file system that's in memory.)

Is that expected?

This is the current version of ceph, deployed with cephadm. The only non-default setup is allocating metadata to ssd and data to hdd.

   data_devices:
     rotational: 1
   db_devices:
     rotational: 0

ceph osd crush rule create-replicated replicated_hdd default host hdd
ceph osd crush rule create-replicated replicated_ssd default host ssd
ceph osd pool set cephfs.main.data crush_rule replicated_hdd
ceph osd pool set cephfs.main.meta crush_rule replicated_ssd

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx