Re: Questions about the CRUSH details

Janne Johansson <icepic.dz@xxxxxxxxx> · Thu, 25 Jan 2024 13:32:06 +0100

Den tors 25 jan. 2024 kl 11:57 skrev Henry lol <pub.virtualization@xxxxxxxxx>:
>
> It's reasonable enough.
> actually, I expected the client to have just? thousands of
> "PG-to-OSDs" mappings.

Yes, but filename to PG is done with a pseudorandom algo.

> Nevertheless, it’s so heavy that the client calculates location on
> demand, right?

Yes, and I guess the client has some kind of algorithm that makes it
possible to know that PG 1.a4 should be on OSD 4, 93, 44 but also if 4
is missing, the next candidate would be 51, if 93 isn't up either then
66 would be the next logical OSD to contact for that copy and so on.
Since all parts (client, mons, OSDs) have the same code, when osd 4
dies, 51 knows it needs to get a copy from either 93 or 44 and as soon
as that copy is made, the PG will stop being active+degraded but might
possibly be active+remapped, since it knows it wants to go back to OSD
4 if it comes back with the same size again.

> if the client with the outdated map sends a request to the wrong OSD,
> then does the OSD handle it somehow through redirection or something?

I think it would get told it has the wrong osdmap.

> Lastly, not only CRUSH map but also other factors like storage usage
> are considered when doing CRUSH?
> because it seems that the target OSD set isn’t deterministic given only it.

It doesn't take OSD usage into consideration except at creation time
or OSD in/out/reweighing (or manual displacements with upmap and so
forth), so this is why "ceph df" will tell you a pool has X free
space, where X is "smallest free space on the OSDs on which this pool
lies, times the number of OSDs". Given the pseudorandom placement of
objects to PGs, there is nothing to prevent you from having the worst
luck ever and all the objects you create end up on the OSD with least
free space.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx