Re: Scrub stuck and 'pg has invalid (post-split) stat'

Eugen Block <eblock@xxxxxx> · Thu, 22 Feb 2024 15:11:45 +0000

Hi,

Have you already tried to set the primary PG out and wait for the
backfill to finish?

Of course I meant the primary OSD for that PG, I hope that was clear. ;-)

We are thinking about the use of "ceph pg_mark_unfound_lost revert"

I'm not a developer, but how I read the code [2] is that such an  
action actually caused the invalid_state in the first place, the  
function void PrimaryLogPG::mark_all_unfound_lost handles two cases:

    case pg_log_entry_t::LOST_REVERT:
...
    case pg_log_entry_t::LOST_DELETE:

And after those cases updates the stats and marks them invalid:

  recovery_state.update_stats(
    [](auto &history, auto &stats) {
      stats.stats_invalid = true;
      return false;
    });

But according to the scrubbing code [3] it would update the invalid  
stats when finishing the scrub:

  if (info.stats.stats_invalid) {
    m_pl_pg->recovery_state.update_stats([=](auto& history, auto& stats) {
      stats.stats = m_scrub_cstat;
      stats.stats_invalid = false;
      return false;
    });

So from my understanding, the PG should be "scrubbable", I don't  
really understand why it doesn't. Did you already send the overall  
cluster status (ceph -s)? And maybe attach the entire query output to  
a file and attach it?

[2] https://github.com/ceph/ceph/blob/v16.2.13/src/osd/PrimaryLogPG.cc#L12407
[3] https://github.com/ceph/ceph/blob/v16.2.13/src/osd/PrimaryLogScrub.cc#L54

Zitat von Cedric <yipikai7@xxxxxxxxx>:

On Thu, Feb 22, 2024 at 12:37 PM Eugen Block <eblock@xxxxxx> wrote:
You haven't told yet if you changed the hit_set_count to 0.

Not yet, we will give it a try ASAP

Have you already tried to set the primary PG out and wait for the
backfill to finish?

No, we will try also

And another question, are all services running pacific already and on
the same version (ceph versions)?

Yes, all daemon runs 16.2.13

Zitat von Cedric <yipikai7@xxxxxxxxx>:

> Yes the osd_scrub_invalid_stats is set to true.
>
> We are thinking about the use of "ceph pg_mark_unfound_lost revert"
> action, but we wonder if there is a risk of data loss.
>
> On Thu, Feb 22, 2024 at 11:50 AM Eugen Block <eblock@xxxxxx> wrote:
>>
>> I found a config to force scrub invalid PGs, what is your current
>> setting on that?
>>
>> ceph config get osd osd_scrub_invalid_stats
>> true
>>
>> The config reference states:
>>
>> > Forces extra scrub to fix stats marked as invalid.
>>
>> But the default seems to be true, so I'd expect it's true in your case
>> as well?
>>
>> Zitat von Cedric <yipikai7@xxxxxxxxx>:
>>
>> > Thanks Eugen for the suggestion, yes we have tried, also repeering
>> > concerned PGs, still the same issue.
>> >
>> > Looking at the code it seems the split-mode message is triggered when
>> > the PG as ""stats_invalid": true,", here is the result of a query:
>> >
>> > "stats_invalid": true,
>> >                 "dirty_stats_invalid": false,
>> >                 "omap_stats_invalid": false,
>> >                 "hitset_stats_invalid": false,
>> >                 "hitset_bytes_stats_invalid": false,
>> >                 "pin_stats_invalid": false,
>> >                 "manifest_stats_invalid": false,
>> >
>> > I also provide again cluster informations that was lost in previous
>> > missed reply all. Don't hesitate to ask more if needed I would be
>> > glade to provide them.
>> >
>> > Cédric
>> >
>> >
>> > On Thu, Feb 22, 2024 at 11:04 AM Eugen Block <eblock@xxxxxx> wrote:
>> >>
>> >> Hm, I wonder if setting (and unsetting after a while) noscrub and
>> >> nodeep-scrub has any effect. Have you tried that?
>> >>
>> >> Zitat von Cedric <yipikai7@xxxxxxxxx>:
>> >>
>> >> > Update: we have run fsck and re-shard on all bluestore volume, seems
>> >> > sharding were not applied.
>> >> >
>> >> > Unfortunately scrubs and deep-scrubs are still stuck on PGs of the
>> >> > pool that is suffering the issue, but other PGs scrubs well.
>> >> >
>> >> > The next step will be to remove the cache tier as suggested, but its
>> >> > not available yet as PGs needs to be scrubbed in order for the cache
>> >> > tier can be activated.
>> >> >
>> >> > As we are struggling to make this cluster works again, any help
>> >> > would be greatly appreciated.
>> >> >
>> >> > Cédric
>> >> >
>> >> >> On 20 Feb 2024, at 20:22, Cedric <yipikai7@xxxxxxxxx> wrote:
>> >> >>
>> >> >> Thanks Eugen, sorry about the missed reply to all.
>> >> >>
>> >> >> The reason we still have the cache tier is because we were not able
>> >> >> to flush all dirty entry to remove it (as per the procedure), so
>> >> >> the cluster as been migrated from HDD/SSD to NVME a while ago but
>> >> >> tiering remains, unfortunately.
>> >> >>
>> >> >> So actually we are trying to understand the root cause
>> >> >>
>> >> >> On Tue, Feb 20, 2024 at 1:43 PM Eugen Block <eblock@xxxxxx> wrote:
>> >> >>>
>> >> >>> Please don't drop the list from your response.
>> >> >>>
>> >> >>> The first question coming to mind is, why do you have a  
cache-tier if
>> >> >>> all your pools are on nvme decices anyway? I don't see any
>> benefit here.
>> >> >>> Did you try the suggested workaround and disable the cache-tier?
>> >> >>>
>> >> >>> Zitat von Cedric <yipikai7@xxxxxxxxx>:
>> >> >>>
>> >> >>>> Thanks Eugen, see attached infos.
>> >> >>>>
>> >> >>>> Some more details:
>> >> >>>>
>> >> >>>> - commands that actually hangs: ceph balancer status ; rbd
>> -p vms ls ;
>> >> >>>> rados -p vms_cache cache-flush-evict-all
>> >> >>>> - all scrub running on vms_caches pgs are stall / start in a loop
>> >> >>>> without actually doing anything
>> >> >>>> - all io are 0 both from ceph status or iostat on nodes
>> >> >>>>
>> >> >>>> On Tue, Feb 20, 2024 at 10:00 AM Eugen Block  
<eblock@xxxxxx> wrote:
>> >> >>>>>
>> >> >>>>> Hi,
>> >> >>>>>
>> >> >>>>> some more details would be helpful, for example what's the
>> pool size
>> >> >>>>> of the cache pool? Did you issue a PG split before or during the
>> >> >>>>> upgrade? This thread [1] deals with the same problem,  
the described
>> >> >>>>> workaround was to set hit_set_count to 0 and disable the
>> cache layer
>> >> >>>>> until that is resolved. Afterwards you could enable the  
cache layer
>> >> >>>>> again. But keep in mind that the code for cache tier is entirely
>> >> >>>>> removed in Reef (IIRC).
>> >> >>>>>
>> >> >>>>> Regards,
>> >> >>>>> Eugen
>> >> >>>>>
>> >> >>>>> [1]
>> >> >>>>>
>> >>
>>  
https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-adding-a-cache-osd
>> >> >>>>>
>> >> >>>>> Zitat von Cedric <yipikai7@xxxxxxxxx>:
>> >> >>>>>
>> >> >>>>>> Hello,
>> >> >>>>>>
>> >> >>>>>> Following an upgrade from Nautilus (14.2.22) to Pacific
>> (16.2.13), we
>> >> >>>>>> encounter an issue with a cache pool becoming completely stuck,
>> >> >>>>>> relevant messages below:
>> >> >>>>>>
>> >> >>>>>> pg xx.x has invalid (post-split) stats; must scrub before
>> tier agent
>> >> >>>>>> can activate
>> >> >>>>>>
>> >> >>>>>> In OSD logs, scrubs are starting in a loop without
>> succeeding for all
>> >> >>>>>> pg of this pool.
>> >> >>>>>>
>> >> >>>>>> What we already tried without luck so far:
>> >> >>>>>>
>> >> >>>>>> - shutdown / restart OSD
>> >> >>>>>> - rebalance pg between OSD
>> >> >>>>>> - raise the memory on OSD
>> >> >>>>>> - repeer PG
>> >> >>>>>>
>> >> >>>>>> Any idea what is causing this? any help will be greatly
>> appreciated
>> >> >>>>>>
>> >> >>>>>> Thanks
>> >> >>>>>>
>> >> >>>>>> Cédric
>> >> >>>>>> _______________________________________________
>> >> >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> _______________________________________________
>> >> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >> >>>
>> >> >>>
>> >> >>>
>> >>
>> >>
>> >>
>>
>>
>>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx