Re: 16.2.8 QE testing issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, May 9, 2022 at 1:26 AM Satoru Takeuchi <satoru.takeuchi@xxxxxxxxx> wrote:
Hi,

2022年4月29日(金) 6:54 Ilya Dryomov <idryomov@xxxxxxxxx>:
>
> On Thu, Apr 28, 2022 at 9:27 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> >
> > On Thu, Apr 28, 2022 at 1:06 AM Yuri Weinstein <yweinste@xxxxxxxxxx> wrote:
> > >
> > > We are seeing issues during tests, visible in upgrade tests
> > > (http://pulpito.front.sepia.ceph.com/yuriw-2022-04-27_14:24:25-upgrade:octopus-x-pacific-distro-default-smithi/)
> > >
> > > Looks like trackers:
> > > https://tracker.ceph.com/issues/55444
> > > https://tracker.ceph.com/issues/55475
> > >
> > > Casey, Ilya pls advise if they have to be addressed before 16.2.8 release
> >
> > The RBD failures are rather puzzling but seem to be reproducible (at
> > least in teuthology, not locally so far).  I doubt it is something new
> > so probably not a blocker for -rc -- will investigate in parallel.
>
> Neha, this looks like a critical omap handling regression on the OSD
> side to me.  Definitely a blocker for 16.2.8.  I have reassigned the
> above RBD ticket to you.

I encountered data corruption in my cluster. When I upgraded from
v16.2.7 + some patches (PR#43581, 44413, 45502, 45654) to this version
plus PR#45963 patch, an unfound object appeared.
After trying to fix this problem, the unfound object disappeared but there
is at least one inconsistent PG.

Hi Satoru, 

It is possible that you ran into the bug that was fixed in https://github.com/ceph/ceph/pull/46096. It affects clusters with OSDs that have resharding off. If you have a way to upgrade to 16.2.8 (to be released very soon), that'd be great.

Thanks,
Neha
 

Here is what I did after the above-mentioned upgrading.

1. Stopped an OSD that is related to an unfound object. Then unfound
object disappeared,
   but an inconsistent PG appeared.
2. To resolve the inconsistency, downgraded the Ceph version to the
previous version (which does not contain PR#45963).
   But some OSDs started to crash.
3. Going back to the new version (which contains PR#45963) and then
additional inconsistent PGs appeared.
4. Some of them were fixed with the following document, but at least one PG
    is still inconsistent and there might be other problematic pgs.

    https://docs.ceph.com/en/latest/rados/operations/pg-repair/

Does anyone know is it possible to resolve this corruption and how to do it?

I'll upgrade the Ceph version to v16.2.7 + PR#46096. But it's unsure
whether this
upgrade resolves my issue.

Additional information.

- Some OSDs were created in v16.2.z and the others were in v15.2.z or older.
- `rados list-inconsistent-obj` reports there is no inconsistent
object as follows despite this pg is inconsistent.

rados list-inconsistent-obj 2.57 | jq .
{
  "epoch": 90596,
  "inconsistents": []
}


Thanks,
Satoru



>
> Yuri, the natural suspect is https://github.com/ceph/ceph/pull/45963.
> I would suggest building a pacific branch without it and rerunning my
> test run:
>
>     https://pulpito.ceph.com/dis-2022-04-28_17:15:07-upgrade:octopus-x-pacific-distro-default-smithi/
>
> Thanks,
>
>                 Ilya
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux