Hi Sage, I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this: 00 01 02 03 04 05 06 07 08 09 10 00: 0 14 17 14 19 23 13 22 21 20 1800 01: 12 0 11 13 19 19 15 10 16 17 1841 02: 17 27 0 17 15 15 13 19 18 11 1813 03: 14 17 15 0 23 11 20 15 23 17 1792 04: 14 18 16 25 0 27 13 8 15 16 1771 05: 19 16 22 25 13 0 9 19 21 21 1813 06: 18 15 21 17 10 18 0 10 18 11 1873 07: 13 17 22 13 16 17 14 0 25 12 1719 08: 23 20 16 17 19 18 11 12 0 18 1830 09: 14 20 15 17 12 16 17 11 13 0 1828 10: 0 0 0 0 0 0 0 0 0 0 0 before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0 after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080 Each line shows how many objects moved from a given disk to the others after disk 10 was added. Most objects go to the new disk and around 1% go to each other disks. The before and after lines show how many objects are mapped to each disk. They all have the same weight and it's using replica 2 and straw2. Does that look right ? Cheers On 02/13/2017 03:21 PM, Sage Weil wrote: > On Mon, 13 Feb 2017, Loic Dachary wrote: >> Hi, >> >> Dan van der Ster reached out to colleagues and friends and Pedro >> López-Adeva Fernández-Layos came up with a well written analysis of the >> problem and a tentative solution which he described at : >> https://github.com/plafl/notebooks/blob/master/replication.ipynb >> >> Unless I'm reading the document incorrectly (very possible ;) it also >> means that the probability of each disk needs to take in account the >> weight of all disks. Which means that whenever a disk is added / removed >> or its weight is changed, this has an impact on the probability of all >> disks in the cluster and objects are likely to move everywhere. Am I >> mistaken ? > > Maybe (I haven't looked closely at the above yet). But for comparison, in > the normal straw2 case, adding or removing a disk also changes the > probabilities for everything else (e.g., removing one out of 10 identical > disks changes the probability from 1/10 to 1/9). The key property that > straw2 *is* able to handle is that as long as the relative probabilities > between two unmodified disks does not change, then straw2 will avoid > moving any objects between them (i.e., all data movement is to or from > the disk that is reweighted). > > sage > > >> >> Cheers >> >> On 01/26/2017 04:05 AM, Sage Weil wrote: >>> This is a longstanding bug, >>> >>> http://tracker.ceph.com/issues/15653 >>> >>> that causes low-weighted devices to get more data than they should. Loic's >>> recent activity resurrected discussion on the original PR >>> >>> https://github.com/ceph/ceph/pull/10218 >>> >>> but since it's closed and almost nobody will see it I'm moving the >>> discussion here. >>> >>> The main news is that I have a simple adjustment for the weights that >>> works (almost perfectly) for the 2nd round of placements. The solution is >>> pretty simple, although as with most probabilities it tends to make my >>> brain hurt. >>> >>> The idea is that, on the second round, the original weight for the small >>> OSD (call it P(pick small)) isn't what we should use. Instead, we want >>> P(pick small | first pick not small). Since P(a|b) (the probability of a >>> given b) is P(a && b) / P(b), >>> >>> P(pick small | first pick not small) >>> = P(pick small && first pick not small) / P(first pick not small) >>> >>> The last term is easy to calculate, >>> >>> P(first pick not small) = (total_weight - small_weight) / total_weight >>> >>> and the && term is the distribution we're trying to produce. For exmaple, >>> if small has 1/10 the weight, then we should see 1/10th of the PGs have >>> their second replica be the small OSD. So >>> >>> P(pick small && first pick not small) = small_weight / total_weight >>> >>> Putting those together, >>> >>> P(pick small | first pick not small) >>> = P(pick small && first pick not small) / P(first pick not small) >>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight) >>> = small_weight / (total_weight - small_weight) >>> >>> This is, on the second round, we should adjust the weights by the above so >>> that we get the right distribution of second choices. It turns out it >>> works to adjust *all* weights like this to get hte conditional probability >>> that they weren't already chosen. >>> >>> I have a branch that hacks this into straw2 and it appears to work >>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the >>> current code, you get >>> >>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2 >>> rule 0 (data), x = 0..40000000, numrep = 2..2 >>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001 >>> device 0: 19765965 [9899364,9866601] >>> device 1: 19768033 [9899444,9868589] >>> device 2: 19769938 [9901770,9868168] >>> device 3: 19766918 [9898851,9868067] >>> device 6: 929148 [400572,528576] >>> >>> which is very close for the first replica (primary), but way off for the >>> second. With my hacky change, >>> >>> rule 0 (data), x = 0..40000000, numrep = 2..2 >>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001 >>> device 0: 19797315 [9899364,9897951] >>> device 1: 19799199 [9899444,9899755] >>> device 2: 19801016 [9901770,9899246] >>> device 3: 19797906 [9898851,9899055] >>> device 6: 804566 [400572,403994] >>> >>> which is quite close, but still skewing slightly high (by a big less than >>> 1%). >>> >>> Next steps: >>> >>> 1- generalize this for >2 replicas >>> 2- figure out why it skews high >>> 3- make this work for multi-level hierarchical descent >>> >>> sage >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- Loïc Dachary, Artisan Logiciel Libre
#include "mapper.h" #include "builder.h" #include "crush.h" #include "hash.h" #include "stdio.h" #define NUMBER_OF_OBJECTS 100000 void map_with_crush(int replication_count, int hosts_count, int object_map[][NUMBER_OF_OBJECTS]) { struct crush_map *m = crush_create(); m->choose_local_tries = 0; m->choose_local_fallback_tries = 0; m->choose_total_tries = 50; m->chooseleaf_descend_once = 1; m->chooseleaf_vary_r = 1; m->chooseleaf_stable = 1; m->allowed_bucket_algs = (1 << CRUSH_BUCKET_UNIFORM) | (1 << CRUSH_BUCKET_LIST) | (1 << CRUSH_BUCKET_STRAW2); int root_type = 1; int host_type = 2; int bucketno = 0; int hosts[hosts_count]; int weights[hosts_count]; int disk = 0; for(int host = 0; host < hosts_count; host++) { struct crush_bucket *b; b = crush_make_bucket(m, CRUSH_BUCKET_STRAW2, CRUSH_HASH_DEFAULT, host_type, 0, NULL, NULL); assert(b != NULL); assert(crush_bucket_add_item(m, b, disk, 0x10000) == 0); assert(crush_add_bucket(m, 0, b, &bucketno) == 0); hosts[host] = bucketno; weights[host] = 0x10000; disk++; } struct crush_bucket *root; int bucket_root; root = crush_make_bucket(m, CRUSH_BUCKET_STRAW2, CRUSH_HASH_DEFAULT, root_type, hosts_count, hosts, weights); assert(root != NULL); assert(crush_add_bucket(m, 0, root, &bucket_root) == 0); assert(crush_reweight_bucket(m, root) == 0); struct crush_rule *r; int minsize = 1; int maxsize = 5; int number_of_steps = 3; r = crush_make_rule(number_of_steps, 0, 0, minsize, maxsize); assert(r != NULL); crush_rule_set_step(r, 0, CRUSH_RULE_TAKE, bucket_root, 0); crush_rule_set_step(r, 1, CRUSH_RULE_CHOOSELEAF_FIRSTN, replication_count, host_type); crush_rule_set_step(r, 2, CRUSH_RULE_EMIT, 0, 0); int ruleno = crush_add_rule(m, r, -1); assert(ruleno >= 0); crush_finalize(m); { int result[replication_count]; __u32 weights[hosts_count]; for(int i = 0; i < hosts_count; i++) weights[i] = 0x10000; int cwin_size = crush_work_size(m, replication_count); char cwin[cwin_size]; crush_init_workspace(m, cwin); for(int x = 0; x < NUMBER_OF_OBJECTS; x++) { memset(result, '\0', sizeof(int) * replication_count); assert(crush_do_rule(m, ruleno, x, result, replication_count, weights, hosts_count, cwin) == 2); for(int i = 0; i < replication_count; i++) { object_map[i][x] = result[i]; } } } crush_destroy(m); } int same_set(int object, int replication_count, int before[][NUMBER_OF_OBJECTS], int after[][NUMBER_OF_OBJECTS]) { for(int r = 0; r < replication_count; r++) { int found = 0; for(int s = 0; s < replication_count; s++) if(before[r][object] == after[s][object]) { found = 1; break; } if(!found) return 0; } return 1; } void with_crush(int replication_count, int hosts_count) { int before[replication_count][NUMBER_OF_OBJECTS]; map_with_crush(replication_count, hosts_count, &before[0]); int after[replication_count][NUMBER_OF_OBJECTS]; map_with_crush(replication_count, hosts_count+1, &after[0]); int movement[hosts_count + 1][hosts_count + 1]; memset(movement, '\0', sizeof(movement)); int count_before[hosts_count + 1]; memset(count_before, '\0', sizeof(count_before)); int count_after[hosts_count + 1]; memset(count_after, '\0', sizeof(count_after)); for(int object = 0; object < NUMBER_OF_OBJECTS; object++) { // if(same_set(object, replication_count, &before[0], &after[0])) // continue; for(int replica = 0; replica < replication_count; replica++) { count_before[before[replica][object]]++; count_after[after[replica][object]]++; if (before[replica][object] == after[replica][object]) continue; movement[before[replica][object]][after[replica][object]]++; } } printf(" "); for(int host = 0; host < hosts_count + 1; host++) printf(" %02d ", host); printf("\n"); for(int from = 0; from < hosts_count + 1; from++) { printf("%02d: ", from); for(int to = 0; to < hosts_count + 1; to++) printf("%6d ", movement[from][to]); printf("\n"); } printf("before: "); for(int host = 0; host < hosts_count + 1; host++) printf("%6d ", count_before[host]); printf("\n"); printf("after: "); for(int host = 0; host < hosts_count + 1; host++) printf("%6d ", count_after[host]); printf("\n"); } int main(int argc, char* argv[]) { int replication_count = atoi(argv[1]); int hosts_count = atoi(argv[2]); with_crush(replication_count, hosts_count); } /* * Local Variables: * compile-command: "gcc -g -o compare compare.c $(pkg-config --cflags --libs libcrush) && ./compare 2 10" * End: */