Re: [External Email] Re: ceph-objectstore-tool core dump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I also had a delay on the start of the repair scrub when I was dealing with
this issue.  I ultimately increased the number of simultaneous scrubs, but
I think you could also temporarily disable scrubs and then re-issue the 'pg
repair'.  (But I'm not one of the experts on this.)

My perception is that between EC pools, large HDDs, and the overall OSD
count, there might need to be some tuning to assure that scrubs can get
scheduled:  A large HDD contains pieces of more PGs.  Each PG in an EC pool
is spread across more disks than a replication pool.  Thus, especially if
the number of OSDs is not large, there is an increased chance that more
than one scrub will want to read the same OSD.   Scheduling nightmare if
the number of simultaneous scrubs is low and client traffic is given
priority.

-Dave

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx
607-760-2328 (Cell)
607-777-4641 (Office)


On Sun, Oct 3, 2021 at 11:51 PM 胡 玮文 <huww98@xxxxxxxxxxx> wrote:

> > 在 2021年10月4日,04:18,Michael Thomas <wart@xxxxxxxxxxx> 写道:
> >
> > On 10/3/21 12:08, 胡 玮文 wrote:
> >>>> 在 2021年10月4日,00:53,Michael Thomas <wart@xxxxxxxxxxx> 写道:
> >>>
> >>> I recently started getting inconsistent PGs in my Octopus (15.2.14)
> ceph cluster.  I was able to determine that they are all coming from the
> same OSD: osd.143.  This host recently suffered from an unplanned power
> loss, so I'm not surprised that there may be some corruption.  This PG is
> part of a EC 8+2 pool.
> >>>
> >>> The OSD logs from the PG's primary OSD show this and similar errors
> from the PG's most recent deep scrub:
> >>>
> >>> 2021-10-03T03:25:25.969-0500 7f6e6801f700 -1 log_channel(cluster) log
> [ERR] : 23.1fa shard 143(1) soid 23:5f8c3d4e:::10000179969.00000168:head :
> candidate had a read error
> >>>
> >>> In attempting to fix it, I first ran 'ceph pg repair 23.1fa' on the
> PG. This accomplished nothing.  Next I ran a shallow fsck on the OSD:
> >> I expect this ‘ceph pg repair’ command could handle this kind of
> errors. After issuing this command, the pg should enter a state like
> “active+clean+scrubbing+deep+inconsistent+repair”, then you wait for the
> repair to finish (this can take hours), and you should be able to recover
> from the inconsistent state. What do you mean by “This accomplished
> nothing”?
> >
> > The PG never entered the 'repair' state, nor did anything appear in the
> primary OSD logs about a request for repair.  After more than 24 hours, the
> PG remained listed as 'inconsistent'.
> >
> > --Mike
>
> I have encountered a similar situation. My case is the pg being repaired
> cannot get all the scrub reservations to enter the scrubbing state. Could
> you try “ceph tell osd.<primary OSD ID> dump_scrubs”, and see whether
> 23.1fa is listed and has forced == true? If so, this may also be your case.
> I think you may wait even longer, raise “osd_max_scrubs” config, or try set
> then unset noscrub to interrupt the running scrubs.
>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux