Re: A problem with the CEPH PG state getting stuck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

your cli output is barely readable, although it probably is not that relevant here. Apparently, it's an EC pool you're referring to? A pg repair tries to repair inconsistent objects, see [1] for more details. I don't really know how to explain "repeer", I'm also not a dev, so maybe someone from the ceph team can explain it better. But from how I understand it, a temporary new mapping for the primary OSD is created and that would trigger something like a refresh, I guess:

      // map to just primary; it will map back to what it wants
      pending_inc.new_pg_temp[pgid] = { primary }

So this doesn't really affect your PG content wise, I think, it's just a refresh. Why your PGs get stuck (is it always the same? Is a specific OSD involved in all of the cases?) is difficut to answer without knowing so little about your cluster. Is the overall cluster health status okay? Are your OSDs (slow or fast drives?) highly utilized?

[1] https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#more-information-on-pg-repair

Zitat von 苏察哈尔灿 <2644294460@xxxxxx>:

My ceph cluster sometimes gets stuck in the active+clean+snaptrim state when doing regular snapshot deletion, and the corresponding pg does not change for a long time. As follows:


27.7c&nbsp; &nbsp; &nbsp;14350&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; 38073876582&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp;2864&nbsp; &nbsp; &nbsp; 3000&nbsp; &nbsp; &nbsp; &nbsp; active+clean+snaptrim&nbsp; &nbsp; &nbsp;9h&nbsp; 43777'677286&nbsp; 43777:1018248&nbsp; &nbsp; [52,55,20,36,29,91,9,63,14,2]p52&nbsp; &nbsp; [52,55,20,36,29,91,9,63,14,2]p52&nbsp; 2024-07-21T13:57:12.602903+0000&nbsp; 2024-07-20T11:02:06.953985+0000&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;52&nbsp; queued for scrub&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 27.c1&nbsp; &nbsp; &nbsp;14055&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; 37408096205&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp;2754&nbsp; &nbsp; &nbsp; 3000&nbsp; &nbsp; &nbsp; &nbsp; active+clean+snaptrim&nbsp; &nbsp; &nbsp;9h&nbsp; 43777'676887&nbsp; &nbsp;43777:952875&nbsp; &nbsp; &nbsp;[0,21,39,62,26,58,41,86,66,2]p0&nbsp; &nbsp; &nbsp;[0,21,39,62,26,58,41,86,66,2]p0&nbsp; 2024-07-22T02:26:43.470883+0000&nbsp; 2024-07-20T05:09:23.763918+0000&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;44&nbsp; periodic scrub scheduled @ 2024-07-23T07:46:50.084316+0000&nbsp; &nbsp; &nbsp; 27.19b&nbsp; &nbsp; 14711&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; 38926389125&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp;2765&nbsp; &nbsp; &nbsp; 3000&nbsp; &nbsp; &nbsp; &nbsp; active+clean+snaptrim&nbsp; &nbsp; &nbsp;9h&nbsp; 43777'698478&nbsp; &nbsp;43777:849574&nbsp; &nbsp; &nbsp; [5,7,29,14,61,10,22,0,19,37]p5&nbsp; &nbsp; &nbsp; [5,7,29,14,61,10,22,0,19,37]p5&nbsp; 2024-07-22T03:28:39.036189+0000&nbsp; 2024-07-17T14:00:43.082766+0000&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;40&nbsp; periodic scrub scheduled @ 2024-07-23T12:30:57.488823+0000&nbsp; &nbsp; &nbsp; 27.1a3&nbsp; &nbsp; 14323&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; 37918899397&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp;2930&nbsp; &nbsp; &nbsp; 3000&nbsp; &nbsp; &nbsp; &nbsp; active+clean+snaptrim&nbsp; &nbsp; &nbsp;9h&nbsp; 43777'675713&nbsp; &nbsp;43777:943324&nbsp; &nbsp;[52,4,10,46,20,69,49,39,44,30]p52&nbsp; &nbsp;[52,4,10,46,20,69,49,39,44,30]p52&nbsp; 2024-07-21T11:15:16.442506+0000&nbsp; 2024-07-20T04:10:45.391550+0000&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;50&nbsp; queued for scrub&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;


No change for 9 hours. Then, after I typed the command "ceph pg repeer 27.7c", the corresponding pg state was restored to normal. I don't know what this "repeer" command is for, will it have any effect on pg? What's the difference between "repeer" and "ceph pg repair"? Then, why does the pg get stuck so often? Thanks for your help!


My ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux