Re: libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping

Markus Kienast <mark@xxxxxxxxxxxxx> · Sun, 16 May 2021 12:54:21 +0200

Hi Ilya,

unfortunately I can not find any "missing primary copy of ..." error in the
logs of my 3 OSDs.
The NVME disks are also brand new and there is not much traffic on them.

The only error keyword I find are those two messages in osd.0 and osd.1
logs shown below.

BTW the error posted before actually concerns osd1. The one I posted was
copied from somebody elses bug report, which had similar errors. Here are
my original error messages on LTSP boot:
[    10.331119] libceph: mon1 (1)10.101.0.27:6789 session established
[    10.331799] libceph: client175444 fsid
b0f4a188-bd81-11ea-8849-97abe2843f29
[    10.336866] libceph: mon0 (1)10.101.0.25:6789 session established
[    10.337598] libceph: client175444 fsid
b0f4a188-bd81-11ea-8849-97abe2843f29
*[    10.349380] libceph: get_reply osd1 tid 11 data 4164 > preallocated*
*4096, skipping*

elias@maas:~$ juju ssh ceph-osd/2 sudo zgrep -i error
/var/log/ceph/ceph-osd.0.log
2021-05-16T08:52:56.872+0000 7f0b262c2d80  4 rocksdb:
  Options.error_if_exists: 0
2021-05-16T08:52:59.872+0000 7f0b262c2d80  4 rocksdb:
  Options.error_if_exists: 0
2021-05-16T08:53:00.884+0000 7f0b262c2d80  1 osd.0 8599 warning: got an
error loading one or more classes: (1) Operation not permitted

elias@maas:~$ juju ssh ceph-osd/0 sudo zgrep -i error
/var/log/ceph/ceph-osd.1.log
2021-05-16T08:49:52.971+0000 7fb6aa68ed80  4 rocksdb:
  Options.error_if_exists: 0
2021-05-16T08:49:55.979+0000 7fb6aa68ed80  4 rocksdb:
  Options.error_if_exists: 0
2021-05-16T08:49:56.828+0000 7fb6aa68ed80  1 osd.1 8589 warning: got an
error loading one or more classes: (1) Operation not permitted

How can I find our more about this bug? It keeps coming back every two
weeks and I need to restart all OSDs to make it go away for another two
weeks. Can I check "tid 11 data 4164" somehow. I find no documentation,
what a tid actually is and how I could perform a read test on it.

Another interesting detail is, that the problem does only seem to affect
booting up from this RBD but not operation per se. The thin clients already
booted from this RBD continue working.

All systems run:
Ubuntu 20.04.2 LTS
Kernel 5.8.0-53-generic
ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus
(stable)

The cluster has been setup with Ubuntu MAAS/juju, consists of
* 1 MAAS server
* with 1 virtual LXD juju controller
* 3 OSD servers with one 2 TB Nvme SSD each for ceph and a 256 SATA SSD for
the operating system.
* each OSD contains a virtualized LXD MON and an LXD FS server (setup
through juju, see juju yaml file attached).

How can I investigate this problem further or might an upgrade to pacific
be necessary?

My best regards,
Markus

Am Mi., 28. Apr. 2021 um 19:25 Uhr schrieb Ilya Dryomov <idryomov@xxxxxxxxx
>:

> On Sun, Apr 25, 2021 at 11:42 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> >
> > On Sun, Apr 25, 2021 at 12:37 AM Markus Kienast <mark@xxxxxxxxxxxxx>
> wrote:
> > >
> > > I am seeing these messages when booting from RBD and booting hangs
> there.
> > >
> > > libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated
> > > 131072, skipping
> > >
> > > However, Ceph Health is OK, so I have no idea what is going on. I
> > > reboot my 3 node cluster and it works again for about two weeks.
> > >
> > > How can I find out more about this issue, how can I dig deeper? Also
> > > there has been at least one report about this issue before on this
> > > mailing list - " Strange Data Issue - Unexpected client
> > > hang on OSD I/O Error" - but no solution has been presented.
> > >
> > > This report was from 2018, so no idea if this is still an issue for
> > > Dyweni the original reporter. If you read this, I would be happy to
> > > hear how you solved the problem.
> >
> > Hi Markus,
> >
> > What versions of ceph and the kernel are in use?
> >
> > Are you also seeing I/O errors and "missing primary copy of ..., will
> > try copies on ..." messages in the OSD logs (in this case osd2)?
>
> For the sake of archives, the " Strange Data Issue
> - Unexpected client hang on OSD I/O Error" instance has been fixed
> in 12.2.12, 13.2.5 and 14.2.0:
>
> https://tracker.ceph.com/issues/37680
>
> I also tried to reply to that thread but it didn't go through because
> the old ceph-users@xxxxxxxxxxxxxx mailing list is decommissioned.
>
> Thanks,
>
>                 Ilya
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx