Re: Questions about the CRUSH details

Oh! That's why data imbalance occurs in Ceph.
I totally misunderstood Ceph's placement algorithm until just now.

Thank you a lot for your detailed explanation :)


2024년 1월 25일 (목) 오후 9:32, Janne Johansson <>님이 작성:
> Den tors 25 jan. 2024 kl 11:57 skrev Henry lol <pub.virtualization@xxxxxxxxx>:
> >
> > It's reasonable enough.
> > actually, I expected the client to have just? thousands of
> > "PG-to-OSDs" mappings.
> Yes, but filename to PG is done with a pseudorandom algo.
> > Nevertheless, it’s so heavy that the client calculates location on
> > demand, right?
> Yes, and I guess the client has some kind of algorithm that makes it
> possible to know that PG 1.a4 should be on OSD 4, 93, 44 but also if 4
> is missing, the next candidate would be 51, if 93 isn't up either then
> 66 would be the next logical OSD to contact for that copy and so on.
> Since all parts (client, mons, OSDs) have the same code, when osd 4
> dies, 51 knows it needs to get a copy from either 93 or 44 and as soon
> as that copy is made, the PG will stop being active+degraded but might
> possibly be active+remapped, since it knows it wants to go back to OSD
> 4 if it comes back with the same size again.
> > if the client with the outdated map sends a request to the wrong OSD,
> > then does the OSD handle it somehow through redirection or something?
> I think it would get told it has the wrong osdmap.
> > Lastly, not only CRUSH map but also other factors like storage usage
> > are considered when doing CRUSH?
> > because it seems that the target OSD set isn’t deterministic given only it.
> It doesn't take OSD usage into consideration except at creation time
> or OSD in/out/reweighing (or manual displacements with upmap and so
> forth), so this is why "ceph df" will tell you a pool has X free
> space, where X is "smallest free space on the OSDs on which this pool
> lies, times the number of OSDs". Given the pseudorandom placement of
> objects to PGs, there is nothing to prevent you from having the worst
> luck ever and all the objects you create end up on the OSD with least
> free space.
