Thank you Frank , All disks are HDDs . Would like to know if I can increase the number of PGs live in production without a negative impact to the cluster. if yes which commands to use . Thank you very much for your prompt reply. Michel On Mon, Jan 29, 2024 at 10:59 AM Frank Schilder <frans@xxxxxx> wrote: > Hi Michel, > > are your OSDs HDD or SSD? If they are HDD, its possible that they can't > handle the deep-scrub load with default settings. In that case, have a look > at this post > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/ > for some basic tuning info and a script to check your scrub stamp > distribution. > > You should also get rid of slow/failing ones. Look at smartctl output and > throw out disks with remapped sectors, uncorrectable r/w errors or > unusually many corrected read-write-errors (assuming you have disks with > ECC). > > A basic calculation for deep-scrub is as follows: max number of PGs that > can be scrubbed at the same time: A=#OSDs/replication factor (rounded > down). Take the B=deep-scrub times from the OSD logs (grep for deep-scrub) > in minutes. Your pool can deep-scrub at a max A*24*(60/B) PGs per day. For > reasonable operations you should not do more than 50% of that. With that > you can calculate how many days it needs to deep-scrub your PGs. > > Usual reasons for slow deep-scrub progress is too few PGs. With > replication factor 3 and 48 OSDs you have a PG budget of ca. 3200 (ca > 200/OSD) but use only 385. You should consider increasing the PG count for > pools with lots of data. This should already relax the situation somewhat. > Then do the calc above and tune deep-scrub times per pool such that they > match with disk performance. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Michel Niyoyita <micou12@xxxxxxxxx> > Sent: Monday, January 29, 2024 7:42 AM > To: E Taka > Cc: ceph-users > Subject: Re: 6 pgs not deep-scrubbed in time > > Now they are increasing , Friday I tried to deep-scrubbing manually and > they have been successfully done , but Monday morning I found that they are > increasing to 37 , is it the best to deep-scrubbing manually while we are > using the cluster? if not what is the best to do in order to address that . > > Best Regards. > > Michel > > ceph -s > cluster: > id: cb0caedc-eb5b-42d1-a34f-96facfda8c27 > health: HEALTH_WARN > 37 pgs not deep-scrubbed in time > > services: > mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M) > mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1 > osd: 48 osds: 48 up (since 11M), 48 in (since 11M) > rgw: 6 daemons active (6 hosts, 1 zones) > > data: > pools: 10 pools, 385 pgs > objects: 6.00M objects, 23 TiB > usage: 151 TiB used, 282 TiB / 433 TiB avail > pgs: 381 active+clean > 4 active+clean+scrubbing+deep > > io: > client: 265 MiB/s rd, 786 MiB/s wr, 3.87k op/s rd, 699 op/s wr > > On Sun, Jan 28, 2024 at 6:14 PM E Taka <0etaka0@xxxxxxxxx> wrote: > > > 22 is more often there than the others. Other operations may be blocked > > because of a deep-scrub is not finished yet. I would remove OSD 22, just > to > > be sure about this: ceph orch osd rm osd.22 > > > > If this does not help, just add it again. > > > > Am Fr., 26. Jan. 2024 um 08:05 Uhr schrieb Michel Niyoyita < > > micou12@xxxxxxxxx>: > > > >> It seems that are different OSDs as shown here . how have you managed to > >> sort this out? > >> > >> ceph pg dump | grep -F 6.78 > >> dumped all > >> 6.78 44268 0 0 0 0 > >> 178679640118 0 0 10099 10099 > >> active+clean 2024-01-26T03:51:26.781438+0200 107547'115445304 > >> 107547:225274427 [12,36,37] 12 [12,36,37] 12 > >> 106977'114532385 2024-01-24T08:37:53.597331+0200 101161'109078277 > >> 2024-01-11T16:07:54.875746+0200 0 > >> root@ceph-osd3:~# ceph pg dump | grep -F 6.60 > >> dumped all > >> 6.60 44449 0 0 0 0 > >> 179484338742 716 36 10097 10097 > >> active+clean 2024-01-26T03:50:44.579831+0200 107547'153238805 > >> 107547:287193139 [32,5,29] 32 [32,5,29] 32 > >> 107231'152689835 2024-01-25T02:34:01.849966+0200 102171'147920798 > >> 2024-01-13T19:44:26.922000+0200 0 > >> 6.3a 44807 0 0 0 0 > >> 180969005694 0 0 10093 10093 > >> active+clean 2024-01-26T03:53:28.837685+0200 107547'114765984 > >> 107547:238170093 [22,13,11] 22 [22,13,11] 22 > >> 106945'113739877 2024-01-24T04:10:17.224982+0200 102863'109559444 > >> 2024-01-15T05:31:36.606478+0200 0 > >> root@ceph-osd3:~# ceph pg dump | grep -F 6.5c > >> 6.5c 44277 0 0 0 0 > >> 178764978230 0 0 10051 10051 > >> active+clean 2024-01-26T03:55:23.339584+0200 107547'126480090 > >> 107547:264432655 [22,37,30] 22 [22,37,30] 22 > >> 107205'125858697 2024-01-24T22:32:10.365869+0200 101941'120957992 > >> 2024-01-13T09:07:24.780936+0200 0 > >> dumped all > >> root@ceph-osd3:~# ceph pg dump | grep -F 4.12 > >> dumped all > >> 4.12 0 0 0 0 0 > >> 0 0 0 0 0 > >> active+clean 2024-01-24T08:36:48.284388+0200 0'0 > >> 107546:152711 [22,19,7] 22 [22,19,7] 22 > >> 0'0 2024-01-24T08:36:48.284307+0200 0'0 > >> 2024-01-13T09:09:22.176240+0200 0 > >> root@ceph-osd3:~# ceph pg dump | grep -F 10.d > >> dumped all > >> 10.d 0 0 0 0 0 > >> 0 0 0 0 0 > >> active+clean 2024-01-24T04:04:33.641541+0200 0'0 > >> 107546:142651 [14,28,1] 14 [14,28,1] 14 > >> 0'0 2024-01-24T04:04:33.641451+0200 0'0 > >> 2024-01-12T08:04:02.078062+0200 0 > >> root@ceph-osd3:~# ceph pg dump | grep -F 5.f > >> dumped all > >> 5.f 0 0 0 0 0 > >> 0 0 0 0 0 > >> active+clean 2024-01-25T08:19:04.148941+0200 0'0 > >> 107546:161331 [11,24,35] 11 [11,24,35] 11 > >> 0'0 2024-01-25T08:19:04.148837+0200 0'0 > >> 2024-01-12T06:06:00.970665+0200 0 > >> > >> > >> On Fri, Jan 26, 2024 at 8:58 AM E Taka <0etaka0@xxxxxxxxx> wrote: > >> > >>> We had the same problem. It turned out that one disk was slowly dying. > >>> It was easy to identify by the commands (in your case): > >>> > >>> ceph pg dump | grep -F 6.78 > >>> ceph pg dump | grep -F 6.60 > >>> … > >>> > >>> This command shows the OSDs of a PG in square brackets. If is there > >>> always the same number, then you've found the OSD which causes the slow > >>> scrubs. > >>> > >>> Am Fr., 26. Jan. 2024 um 07:45 Uhr schrieb Michel Niyoyita < > >>> micou12@xxxxxxxxx>: > >>> > >>>> Hello team, > >>>> > >>>> I have a cluster in production composed by 3 osds servers with 20 > disks > >>>> each deployed using ceph-ansibleand ubuntu OS , and the version is > >>>> pacific > >>>> . These days is in WARN state caused by pgs which are not > deep-scrubbed > >>>> in > >>>> time . I tried to deep-scrubbed some pg manually but seems that the > >>>> cluster > >>>> can be slow, would like your assistance in order that my cluster can > be > >>>> in > >>>> HEALTH_OK state as before without any interuption of service . The > >>>> cluster > >>>> is used as openstack backend storage. > >>>> > >>>> Best Regards > >>>> > >>>> Michel > >>>> > >>>> > >>>> ceph -s > >>>> cluster: > >>>> id: cb0caedc-eb5b-42d1-a34f-96facfda8c27 > >>>> health: HEALTH_WARN > >>>> 6 pgs not deep-scrubbed in time > >>>> > >>>> services: > >>>> mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M) > >>>> mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1 > >>>> osd: 48 osds: 48 up (since 11M), 48 in (since 11M) > >>>> rgw: 6 daemons active (6 hosts, 1 zones) > >>>> > >>>> data: > >>>> pools: 10 pools, 385 pgs > >>>> objects: 5.97M objects, 23 TiB > >>>> usage: 151 TiB used, 282 TiB / 433 TiB avail > >>>> pgs: 381 active+clean > >>>> 4 active+clean+scrubbing+deep > >>>> > >>>> io: > >>>> client: 59 MiB/s rd, 860 MiB/s wr, 155 op/s rd, 665 op/s wr > >>>> > >>>> root@ceph-osd3:~# ceph health detail > >>>> HEALTH_WARN 6 pgs not deep-scrubbed in time > >>>> [WRN] PG_NOT_DEEP_SCRUBBED: 6 pgs not deep-scrubbed in time > >>>> pg 6.78 not deep-scrubbed since 2024-01-11T16:07:54.875746+0200 > >>>> pg 6.60 not deep-scrubbed since 2024-01-13T19:44:26.922000+0200 > >>>> pg 6.5c not deep-scrubbed since 2024-01-13T09:07:24.780936+0200 > >>>> pg 4.12 not deep-scrubbed since 2024-01-13T09:09:22.176240+0200 > >>>> pg 10.d not deep-scrubbed since 2024-01-12T08:04:02.078062+0200 > >>>> pg 5.f not deep-scrubbed since 2024-01-12T06:06:00.970665+0200 > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>> > >>> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx