Re: 14.2.9 MDS Failing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 1, 2020 at 9:27 PM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote:

> The OpenFileTable objects are safe to delete while the MDS is offline
> anyways, the RADOS object names are mds*_openfiles*
>

I should clarify this a little bit: you shouldn't touch the CephFS internal
state or data structures unless you know *exactly* what you are doing.

However, it is pretty safe to delete these files in general, running a
scrub afterwards is a good idea anyways.

But only do this after reading up on lots of details or consulting an
expert.
My assessment here is purely an educated guess based on the error and can
be wrong or counter-productive. All of my mailing list advice is just
things that I know off the top of my head with no further research.
Take with a grain of salt. Don't touch stuff that you don't understand if
you have important data in there.


Paul


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


>
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Fri, May 1, 2020 at 9:04 PM Marco Pizzolo <marcopizzolo@xxxxxxxxx>
> wrote:
>
>> Also seeing errors such as this:
>>
>>
>> [2020-05-01 13:15:20,970][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:20,970][systemd][WARNING] failed activating OSD, retries
>> left: 11
>> [2020-05-01 13:15:20,974][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.13 with osd_fsid
>> dd49cd80-418e-4a8c-8ebf-a33d339663ff
>> [2020-05-01 13:15:20,989][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:20,989][systemd][WARNING] failed activating OSD, retries
>> left: 11
>> [2020-05-01 13:15:20,998][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.5 with osd_fsid
>> 4eaf2baa-60f2-4045-8964-6152608c742a
>> [2020-05-01 13:15:21,014][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:21,014][systemd][WARNING] failed activating OSD, retries
>> left: 11
>> [2020-05-01 13:15:21,019][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.9 with osd_fsid
>> 32f4a716-f26e-4579-a074-5d6452c22e34
>> [2020-05-01 13:15:21,035][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:21,035][systemd][WARNING] failed activating OSD, retries
>> left: 11
>> [2020-05-01 13:15:25,972][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 1-0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>> [2020-05-01 13:15:25,994][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 13-dd49cd80-418e-4a8c-8ebf-a33d339663ff
>> [2020-05-01 13:15:26,020][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 5-4eaf2baa-60f2-4045-8964-6152608c742a
>> [2020-05-01 13:15:26,040][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 9-32f4a716-f26e-4579-a074-5d6452c22e34
>> [2020-05-01 13:15:26,388][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.1 with osd_fsid
>> 0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>> [2020-05-01 13:15:26,389][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.13 with osd_fsid
>> dd49cd80-418e-4a8c-8ebf-a33d339663ff
>> [2020-05-01 13:15:26,391][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.5 with osd_fsid
>> 4eaf2baa-60f2-4045-8964-6152608c742a
>> [2020-05-01 13:15:26,402][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:26,403][systemd][WARNING] failed activating OSD, retries
>> left: 10
>> [2020-05-01 13:15:26,403][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:26,404][systemd][WARNING] failed activating OSD, retries
>> left: 10
>> [2020-05-01 13:15:26,404][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:26,405][systemd][WARNING] failed activating OSD, retries
>> left: 10
>> [2020-05-01 13:15:26,411][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.9 with osd_fsid
>> 32f4a716-f26e-4579-a074-5d6452c22e34
>> [2020-05-01 13:15:26,424][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:26,424][systemd][WARNING] failed activating OSD, retries
>> left: 10
>> [2020-05-01 13:15:31,408][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 1-0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>> [2020-05-01 13:15:31,408][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 5-4eaf2baa-60f2-4045-8964-6152608c742a
>> [2020-05-01 13:15:31,409][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 13-dd49cd80-418e-4a8c-8ebf-a33d339663ff
>> [2020-05-01 13:15:31,429][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 9-32f4a716-f26e-4579-a074-5d6452c22e34
>> [2020-05-01 13:15:31,743][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.5 with osd_fsid
>> 4eaf2baa-60f2-4045-8964-6152608c742a
>> [2020-05-01 13:15:31,750][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.13 with osd_fsid
>> dd49cd80-418e-4a8c-8ebf-a33d339663ff
>> [2020-05-01 13:15:31,752][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:31,752][systemd][WARNING] failed activating OSD, retries
>> left: 9
>> [2020-05-01 13:15:31,754][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.1 with osd_fsid
>> 0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>> [2020-05-01 13:15:31,761][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:31,762][systemd][WARNING] failed activating OSD, retries
>> left: 9
>> [2020-05-01 13:15:31,764][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:31,765][systemd][WARNING] failed activating OSD, retries
>> left: 9
>>
>> On Fri, May 1, 2020 at 2:23 PM Marco Pizzolo <marcopizzolo@xxxxxxxxx>
>> wrote:
>>
>> > Hi Ashley,
>> >
>> > Thanks for your response.  Nothing that I can think of would have
>> > happened.  We are using max_mds =1.  We do have 4 so used to have 3
>> > standby.  Within minutes they all crash.
>> >
>> > On Fri, May 1, 2020 at 2:21 PM Ashley Merrick <singapore@xxxxxxxxxxxxxx
>> >
>> > wrote:
>> >
>> >> Quickly checking the code that calls that assert
>> >>
>> >>
>> >>
>> >>
>> >> if (version > omap_version) {
>> >>
>> >> omap_version = version;
>> >>
>> >> omap_num_objs = num_objs;
>> >>
>> >> omap_num_items.resize(omap_num_objs);
>> >>
>> >> journal_state = jstate;
>> >>
>> >> } else if (version == omap_version) {
>> >>
>> >> ceph_assert(omap_num_objs == num_objs);
>> >>
>> >> if (jstate > journal_state)
>> >>
>> >> journal_state = jstate;
>> >>
>> >> }
>> >>
>> >> }
>> >>
>> >>
>> >> Im not a dev, but not sure if this will help, seems could mean that MDS
>> >> thinks its behind on omaps/too far ahead.
>> >>
>> >>
>> >> Anything happened recently? Just running a single MDS?
>> >>
>> >>
>> >> Hopefully someone else may see this and shine some light on what could
>> be
>> >> causing it.
>> >>
>> >>
>> >>
>> >> ---- On Sat, 02 May 2020 02:10:58 +0800 marcopizzolo@xxxxxxxxx wrote
>> ----
>> >>
>> >>
>> >> Hello,
>> >>
>> >> Hoping you can help me.
>> >>
>> >> Ceph had been largely problem free for us for the better part of a
>> year.
>> >> We have a high file count in a single CephFS filesystem, and are seeing
>> >> this error in the logs:
>> >>
>> >>
>> >>
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/mds/OpenFileTable.cc:
>> >> 777: FAILED ceph_assert(omap_num_objs == num_objs)
>> >>
>> >> The issued seemed to occur this morning, and restarting the MDS as
>> well as
>> >> rebooting the servers doesn't correct the problem.
>> >>
>> >> Not really sure where to look next as the MDS daemons crash.
>> >>
>> >> Appreciate any help you can provide
>> >>
>> >> Marco
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux