Hey Alwin,
Thanks for your reply, answers inline.
I'd assume (w/o pool config) that the EC 2+1 is putting PG as inactive. Because for EC you need n-2 for redundancy and n-1 for availability.
Yeah, I guess this is likely related to the issue.
The output got a bit mangled. Could you please provide them in some pastebin maybe?
Yes of course, I hoped that my mail would be displayed with monospace
font by default :)
Here are the outputs:
`ceph -s` when we first noticed something was wrong:
https://pastebin.com/raw/FcUFB25D
`ceph -s` now:
https://pastebin.com/raw/rsLynw2V
`ceph osd df tree`:
https://pastebin.com/raw/kD9VEcLR
Can you please post the crush rule and pool settings? To better understand the data distribution. And what does the logs show on one of the affected OSDs?
`ceph osd crush dump`:
https://pastebin.com/raw/DWEHcNaA
While trying to dig up a bit more information, I noticed that the mgr
web UI was down, which is why we failed the active mgr to have one of
the standbys to take over, without thinking much...
Lo and behold, this completely resolved the issue from one moment to the
other. Now `ceph -s` return 338 active+clean pgs, as expected and desired...
While we are naturally pretty happy that the problem resolved itself, it
would still be good to understand
1. what caused this weird state in which `ceph -s` output did not match
what was happening in reality,
2. how a mgr failover could cause changes in `ceph -s` output, thereby
fixing the above issue,
3. why `ceph osd df tree` reported a weird split state with only few
hosts contributing storage, and why it did not return any results
when run on some other hosts.
Happy to hear your thoughts!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx