Oh! That's why data imbalance occurs in Ceph. I totally misunderstood Ceph's placement algorithm until just now. Thank you a lot for your detailed explanation :) Sincerely, 2024년 1월 25일 (목) 오후 9:32, Janne Johansson <icepic.dz@xxxxxxxxx>님이 작성: > > Den tors 25 jan. 2024 kl 11:57 skrev Henry lol <pub.virtualization@xxxxxxxxx>: > > > > It's reasonable enough. > > actually, I expected the client to have just? thousands of > > "PG-to-OSDs" mappings. > > Yes, but filename to PG is done with a pseudorandom algo. > > > Nevertheless, it’s so heavy that the client calculates location on > > demand, right? > > Yes, and I guess the client has some kind of algorithm that makes it > possible to know that PG 1.a4 should be on OSD 4, 93, 44 but also if 4 > is missing, the next candidate would be 51, if 93 isn't up either then > 66 would be the next logical OSD to contact for that copy and so on. > Since all parts (client, mons, OSDs) have the same code, when osd 4 > dies, 51 knows it needs to get a copy from either 93 or 44 and as soon > as that copy is made, the PG will stop being active+degraded but might > possibly be active+remapped, since it knows it wants to go back to OSD > 4 if it comes back with the same size again. > > > if the client with the outdated map sends a request to the wrong OSD, > > then does the OSD handle it somehow through redirection or something? > > I think it would get told it has the wrong osdmap. > > > Lastly, not only CRUSH map but also other factors like storage usage > > are considered when doing CRUSH? > > because it seems that the target OSD set isn’t deterministic given only it. > > It doesn't take OSD usage into consideration except at creation time > or OSD in/out/reweighing (or manual displacements with upmap and so > forth), so this is why "ceph df" will tell you a pool has X free > space, where X is "smallest free space on the OSDs on which this pool > lies, times the number of OSDs". Given the pseudorandom placement of > objects to PGs, there is nothing to prevent you from having the worst > luck ever and all the objects you create end up on the OSD with least > free space. > > -- > May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx