Re: A problem with the CEPH PG state getting stuck

Eugen Block <eblock@xxxxxx> · Tue, 30 Jul 2024 12:52:20 +0000

Hi,

your cli output is barely readable, although it probably is not that  
relevant here. Apparently, it's an EC pool you're referring to? A pg  
repair tries to repair inconsistent objects, see [1] for more details.
I don't really know how to explain "repeer", I'm also not a dev, so  
maybe someone from the ceph team can explain it better. But from how I  
understand it, a temporary new mapping for the primary OSD is created  
and that would trigger something like a refresh, I guess:

      // map to just primary; it will map back to what it wants
      pending_inc.new_pg_temp[pgid] = { primary }

So this doesn't really affect your PG content wise, I think, it's just  
a refresh. Why your PGs get stuck (is it always the same? Is a  
specific OSD involved in all of the cases?) is difficut to answer  
without knowing so little about your cluster. Is the overall cluster  
health status okay? Are your OSDs (slow or fast drives?) highly  
utilized?

[1]  
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#more-information-on-pg-repair

Zitat von 苏察哈尔灿 <2644294460@xxxxxx>:

My ceph cluster sometimes gets stuck in the active+clean+snaptrim  
state when doing regular snapshot deletion, and the corresponding pg  
does not change for a long time. As follows:

27.7c&nbsp; &nbsp; &nbsp;14350&nbsp; &nbsp; &nbsp; &nbsp;  
&nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp;  
&nbsp; 0&nbsp; 38073876582&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  
0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp;2864&nbsp;  
&nbsp; &nbsp; 3000&nbsp; &nbsp; &nbsp; &nbsp;  
active+clean+snaptrim&nbsp; &nbsp; &nbsp;9h&nbsp; 43777'677286&nbsp;  
43777:1018248&nbsp; &nbsp; [52,55,20,36,29,91,9,63,14,2]p52&nbsp;  
&nbsp; [52,55,20,36,29,91,9,63,14,2]p52&nbsp;  
2024-07-21T13:57:12.602903+0000&nbsp;  
2024-07-20T11:02:06.953985+0000&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;52&nbsp; queued for scrub&nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
27.c1&nbsp; &nbsp; &nbsp;14055&nbsp; &nbsp; &nbsp; &nbsp;  
&nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp;  
&nbsp; 0&nbsp; 37408096205&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  
0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp;2754&nbsp;  
&nbsp; &nbsp; 3000&nbsp; &nbsp; &nbsp; &nbsp;  
active+clean+snaptrim&nbsp; &nbsp; &nbsp;9h&nbsp; 43777'676887&nbsp;  
&nbsp;43777:952875&nbsp; &nbsp;  
&nbsp;[0,21,39,62,26,58,41,86,66,2]p0&nbsp; &nbsp;  
&nbsp;[0,21,39,62,26,58,41,86,66,2]p0&nbsp;  
2024-07-22T02:26:43.470883+0000&nbsp;  
2024-07-20T05:09:23.763918+0000&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;44&nbsp; periodic scrub scheduled  
@ 2024-07-23T07:46:50.084316+0000&nbsp; &nbsp; &nbsp;
27.19b&nbsp; &nbsp; 14711&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp;  
38926389125&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp;2765&nbsp; &nbsp; &nbsp;  
3000&nbsp; &nbsp; &nbsp; &nbsp; active+clean+snaptrim&nbsp; &nbsp;  
&nbsp;9h&nbsp; 43777'698478&nbsp; &nbsp;43777:849574&nbsp; &nbsp;  
&nbsp; [5,7,29,14,61,10,22,0,19,37]p5&nbsp; &nbsp; &nbsp;  
[5,7,29,14,61,10,22,0,19,37]p5&nbsp;  
2024-07-22T03:28:39.036189+0000&nbsp;  
2024-07-17T14:00:43.082766+0000&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;40&nbsp; periodic scrub scheduled  
@ 2024-07-23T12:30:57.488823+0000&nbsp; &nbsp; &nbsp;
27.1a3&nbsp; &nbsp; 14323&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp;  
37918899397&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp;2930&nbsp; &nbsp; &nbsp;  
3000&nbsp; &nbsp; &nbsp; &nbsp; active+clean+snaptrim&nbsp; &nbsp;  
&nbsp;9h&nbsp; 43777'675713&nbsp; &nbsp;43777:943324&nbsp;  
&nbsp;[52,4,10,46,20,69,49,39,44,30]p52&nbsp;  
&nbsp;[52,4,10,46,20,69,49,39,44,30]p52&nbsp;  
2024-07-21T11:15:16.442506+0000&nbsp;  
2024-07-20T04:10:45.391550+0000&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;50&nbsp; queued for scrub&nbsp;  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

No change for 9 hours. Then, after I typed the command "ceph pg  
repeer 27.7c", the corresponding pg state was restored to normal. I  
don't know what this "repeer" command is for, will it have any  
effect on pg? What's the difference between "repeer" and "ceph pg  
repair"? Then, why does the pg get stuck so often? Thanks for your  
help!

My ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)  
quincy (stable)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx