Re: 14.2.9 MDS Failing

Marco Pizzolo <marcopizzolo@xxxxxxxxx> · Fri, 1 May 2020 17:47:03 -0400

Thanks Everyone,

I was able to address the issue at least temporarily.  The filesystem and
MDSes are for the time staying online and the pgs are being remapped.

What i'm not sure about is the best tuning for MDS given our use case, nor
am i sure of exactly what caused the OSD to flap as they did, so I don't
yet know how to avoid a recurrence.

I do very much like Ceph though....

Best wishes,

Marco

On Fri, May 1, 2020 at 3:49 PM Marco Pizzolo <marcopizzolo@xxxxxxxxx> wrote:

> Understood Paul, thanks.
>
> In case this helps to shed any further light...Digging through logs I'm
> also seeing this:
>
> 2020-05-01 10:06:55.984 7eff10cc3700  1 mds.prdceph01 Updating MDS map to
> version 1487236 from mon.2
> 2020-05-01 10:06:56.398 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> 17 slow requests, 1 included below; oldest blocked for > 254.203584 secs
> 2020-05-01 10:06:56.398 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> slow request 60.552277 seconds old, received at 2020-05-01 10:05:55.846466:
> client_request(client.2525280:277916371 mkdir #0x10014e76974/1f 2020-05-01
> 10:05:55.844490 caller_uid=1010, caller_gid=1015{}) currently submit entry:
> journal_and_reply
> 2020-05-01 10:06:57.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> 17 slow requests, 2 included below; oldest blocked for > 255.205489 secs
> 2020-05-01 10:06:57.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> slow request 60.564545 seconds old, received at 2020-05-01 10:05:56.836104:
> client_request(client.2525280:277921203 create
> #0x10014f12b86/9254b3f0-1d5a-4e88-8d41-d36f244bcb12.zip 2020-05-01
> 10:05:56.834494 caller_uid=1010, caller_gid=1015{}) currently submit entry:
> journal_and_reply
> 2020-05-01 10:06:57.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> slow request 60.550874 seconds old, received at 2020-05-01 10:05:56.849775:
> client_request(client.2525280:277921267 mkdir #0x10014e78bec/e0 2020-05-01
> 10:05:56.848494 caller_uid=1010, caller_gid=1015{}) currently submit entry:
> journal_and_reply
> 2020-05-01 10:06:58.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> 17 slow requests, 0 included below; oldest blocked for > 256.205519 secs
> 2020-05-01 10:07:15.250 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> 2020-05-01 10:07:15.250 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> beacon heartbeat to monitors (last acked 3.99999s ago); MDS internal
> heartbeat is not healthy!
> 2020-05-01 10:07:15.750 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> 2020-05-01 10:07:15.750 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> beacon heartbeat to monitors (last acked 4.49999s ago); MDS internal
> heartbeat is not healthy!
> 2020-05-01 10:07:16.250 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> 2020-05-01 10:07:16.250 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> beacon heartbeat to monitors (last acked 4.99998s ago); MDS internal
> heartbeat is not healthy!
> 2020-05-01 10:07:16.750 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> 2020-05-01 10:07:16.750 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> beacon heartbeat to monitors (last acked 5.49998s ago); MDS internal
> heartbeat is not healthy!
> 2020-05-01 10:07:17.250 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> 2020-05-01 10:07:17.250 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> beacon heartbeat to monitors (last acked 5.99998s ago); MDS internal
> heartbeat is not healthy!
>
>
> THEN about 5 minutes later.......
>
>
>
> 2020-05-01 10:07:35.559 7eff10cc3700  1 mds.prdceph01  9: 'ceph'
> 2020-05-01 10:07:35.559 7eff10cc3700  1 mds.prdceph01 respawning with exe
> /usr/bin/ceph-mds
> 2020-05-01 10:07:35.559 7eff10cc3700  1 mds.prdceph01  exe_path
> /proc/self/exe
> 2020-05-01 10:07:50.785 7fbff66291c0  0 ceph version 14.2.9
> (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process
> ceph-mds, pid 9710
> 2020-05-01 10:07:50.787 7fbff66291c0  0 pidfile_write: ignore empty
> --pid-file
> 2020-05-01 10:07:50.817 7fbfe4408700  1 mds.prdceph01 Updating MDS map to
> version 1487238 from mon.2
> 2020-05-01 10:07:55.820 7fbfe4408700  1 mds.prdceph01 Updating MDS map to
> version 1487239 from mon.2
> 2020-05-01 10:07:55.820 7fbfe4408700  1 mds.prdceph01 Map has assigned me
> to become a standby
> 2020-05-01 10:11:07.369 7fbfe4408700  1 mds.prdceph01 Updating MDS map to
> version 1487282 from mon.2
> 2020-05-01 10:11:07.373 7fbfe4408700  1 mds.0.1487282 handle_mds_map i am
> now mds.0.1487282
> 2020-05-01 10:11:07.373 7fbfe4408700  1 mds.0.1487282 handle_mds_map state
> change up:boot --> up:replay
> 2020-05-01 10:11:07.373 7fbfe4408700  1 mds.0.1487282 replay_start
> 2020-05-01 10:11:07.374 7fbfe4408700  1 mds.0.1487282  recovery set is
> 2020-05-01 10:11:07.374 7fbfe4408700  1 mds.0.1487282  waiting for osdmap
> 2198096 (which blacklists prior instance)
> 2020-05-01 10:11:09.517 7fbfdd3fa700  0 mds.0.cache creating system inode
> with ino:0x100
> 2020-05-01 10:11:09.518 7fbfdd3fa700  0 mds.0.cache creating system inode
> with ino:0x1
> 2020-05-01 10:11:09.753 7fbfddbfb700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/mds/OpenFileTable.cc:
> In function 'void OpenFileTable::_load_finish(int, int, int, unsigned int,
> bool, bool, ceph::bufferlist&, std::map<std::basic_string<char>,
> ceph::buffer::v14_2_0::list>&)' thread 7fbfddbfb700 time 2020-05-01
> 10:11:09.752945
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/mds/OpenFileTable.cc:
> 777: FAILED ceph_assert(omap_num_objs == num_objs)
>
>  ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x14a) [0x7fbfed691875]
>  2: (()+0x253a3d) [0x7fbfed691a3d]
>  3: (OpenFileTable::_load_finish(int, int, int, unsigned int, bool, bool,
> ceph::buffer::v14_2_0::list&, std::map<std::string,
> ceph::buffer::v14_2_0::list, std::less<std::string>,
> std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> >
> >&)+0x1bb0) [0x563c26b1e8b0]
>  4: (C_IO_OFT_Load::finish(int)+0x3b) [0x563c26b2467b]
>  5: (MDSContext::complete(int)+0x74) [0x563c26afbce4]
>  6: (MDSIOContextBase::complete(int)+0x177) [0x563c26afbf47]
>  7: (Finisher::finisher_thread_entry()+0x16f) [0x7fbfed71a44f]
>  8: (()+0x7e65) [0x7fbfeb551e65]
>  9: (clone()+0x6d) [0x7fbfea1ff88d]
>
> 2020-05-01 10:11:09.755 7fbfddbfb700 -1 *** Caught signal (Aborted) **
>  in thread 7fbfddbfb700 thread_name:fn_anonymous
>
>  ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus
> (stable)
>  1: (()+0xf5f0) [0x7fbfeb5595f0]
>  2: (gsignal()+0x37) [0x7fbfea137337]
>  3: (abort()+0x148) [0x7fbfea138a28]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x199) [0x7fbfed6918c4]
>  5: (()+0x253a3d) [0x7fbfed691a3d]
>  6: (OpenFileTable::_load_finish(int, int, int, unsigned int, bool, bool,
> ceph::buffer::v14_2_0::list&, std::map<std::string,
> ceph::buffer::v14_2_0::list, std::less<std::string>,
> std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> >
> >&)+0x1bb0) [0x563c26b1e8b0]
>  7: (C_IO_OFT_Load::finish(int)+0x3b) [0x563c26b2467b]
>  8: (MDSContext::complete(int)+0x74) [0x563c26afbce4]
>  9: (MDSIOContextBase::complete(int)+0x177) [0x563c26afbf47]
>  10: (Finisher::finisher_thread_entry()+0x16f) [0x7fbfed71a44f]
>  11: (()+0x7e65) [0x7fbfeb551e65]
>  12: (clone()+0x6d) [0x7fbfea1ff88d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>
> I've taken the filesystem down to see if allowing autonomous healing of
> the undersized and degraded PGs (15 at this time) will help, but would
> greatly appreciate any assistance/guidance anyone can offer.
>
> Thanks in advance,
>
> Marco
>
> On Fri, May 1, 2020 at 3:36 PM Paul Emmerich <paul.emmerich@xxxxxxxx>
> wrote:
>
>>
>> On Fri, May 1, 2020 at 9:27 PM Paul Emmerich <paul.emmerich@xxxxxxxx>
>> wrote:
>>
>>> The OpenFileTable objects are safe to delete while the MDS is offline
>>> anyways, the RADOS object names are mds*_openfiles*
>>>
>>
>> I should clarify this a little bit: you shouldn't touch the CephFS
>> internal state or data structures unless you know *exactly* what you are
>> doing.
>>
>> However, it is pretty safe to delete these files in general, running a
>> scrub afterwards is a good idea anyways.
>>
>> But only do this after reading up on lots of details or consulting an
>> expert.
>> My assessment here is purely an educated guess based on the error and can
>> be wrong or counter-productive. All of my mailing list advice is just
>> things that I know off the top of my head with no further research.
>> Take with a grain of salt. Don't touch stuff that you don't understand if
>> you have important data in there.
>>
>>
>> Paul
>>
>>
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>>
>>
>>>
>>>
>>>
>>> Paul
>>>
>>> --
>>> Paul Emmerich
>>>
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>
>>> croit GmbH
>>> Freseniusstr. 31h
>>> 81247 München
>>> www.croit.io
>>> Tel: +49 89 1896585 90
>>>
>>>
>>> On Fri, May 1, 2020 at 9:04 PM Marco Pizzolo <marcopizzolo@xxxxxxxxx>
>>> wrote:
>>>
>>>> Also seeing errors such as this:
>>>>
>>>>
>>>> [2020-05-01 13:15:20,970][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:20,970][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 11
>>>> [2020-05-01 13:15:20,974][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.13 with osd_fsid
>>>> dd49cd80-418e-4a8c-8ebf-a33d339663ff
>>>> [2020-05-01 13:15:20,989][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:20,989][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 11
>>>> [2020-05-01 13:15:20,998][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.5 with osd_fsid
>>>> 4eaf2baa-60f2-4045-8964-6152608c742a
>>>> [2020-05-01 13:15:21,014][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:21,014][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 11
>>>> [2020-05-01 13:15:21,019][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.9 with osd_fsid
>>>> 32f4a716-f26e-4579-a074-5d6452c22e34
>>>> [2020-05-01 13:15:21,035][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:21,035][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 11
>>>> [2020-05-01 13:15:25,972][ceph_volume.process][INFO  ] Running command:
>>>> /usr/sbin/ceph-volume lvm trigger 1-0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>>>> [2020-05-01 13:15:25,994][ceph_volume.process][INFO  ] Running command:
>>>> /usr/sbin/ceph-volume lvm trigger
>>>> 13-dd49cd80-418e-4a8c-8ebf-a33d339663ff
>>>> [2020-05-01 13:15:26,020][ceph_volume.process][INFO  ] Running command:
>>>> /usr/sbin/ceph-volume lvm trigger 5-4eaf2baa-60f2-4045-8964-6152608c742a
>>>> [2020-05-01 13:15:26,040][ceph_volume.process][INFO  ] Running command:
>>>> /usr/sbin/ceph-volume lvm trigger 9-32f4a716-f26e-4579-a074-5d6452c22e34
>>>> [2020-05-01 13:15:26,388][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.1 with osd_fsid
>>>> 0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>>>> [2020-05-01 13:15:26,389][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.13 with osd_fsid
>>>> dd49cd80-418e-4a8c-8ebf-a33d339663ff
>>>> [2020-05-01 13:15:26,391][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.5 with osd_fsid
>>>> 4eaf2baa-60f2-4045-8964-6152608c742a
>>>> [2020-05-01 13:15:26,402][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:26,403][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 10
>>>> [2020-05-01 13:15:26,403][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:26,404][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 10
>>>> [2020-05-01 13:15:26,404][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:26,405][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 10
>>>> [2020-05-01 13:15:26,411][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.9 with osd_fsid
>>>> 32f4a716-f26e-4579-a074-5d6452c22e34
>>>> [2020-05-01 13:15:26,424][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:26,424][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 10
>>>> [2020-05-01 13:15:31,408][ceph_volume.process][INFO  ] Running command:
>>>> /usr/sbin/ceph-volume lvm trigger 1-0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>>>> [2020-05-01 13:15:31,408][ceph_volume.process][INFO  ] Running command:
>>>> /usr/sbin/ceph-volume lvm trigger 5-4eaf2baa-60f2-4045-8964-6152608c742a
>>>> [2020-05-01 13:15:31,409][ceph_volume.process][INFO  ] Running command:
>>>> /usr/sbin/ceph-volume lvm trigger
>>>> 13-dd49cd80-418e-4a8c-8ebf-a33d339663ff
>>>> [2020-05-01 13:15:31,429][ceph_volume.process][INFO  ] Running command:
>>>> /usr/sbin/ceph-volume lvm trigger 9-32f4a716-f26e-4579-a074-5d6452c22e34
>>>> [2020-05-01 13:15:31,743][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.5 with osd_fsid
>>>> 4eaf2baa-60f2-4045-8964-6152608c742a
>>>> [2020-05-01 13:15:31,750][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.13 with osd_fsid
>>>> dd49cd80-418e-4a8c-8ebf-a33d339663ff
>>>> [2020-05-01 13:15:31,752][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:31,752][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 9
>>>> [2020-05-01 13:15:31,754][ceph_volume.process][INFO  ] stderr -->
>>>>  RuntimeError: could not find osd.1 with osd_fsid
>>>> 0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>>>> [2020-05-01 13:15:31,761][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:31,762][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 9
>>>> [2020-05-01 13:15:31,764][systemd][WARNING] command returned non-zero
>>>> exit
>>>> status: 1
>>>> [2020-05-01 13:15:31,765][systemd][WARNING] failed activating OSD,
>>>> retries
>>>> left: 9
>>>>
>>>> On Fri, May 1, 2020 at 2:23 PM Marco Pizzolo <marcopizzolo@xxxxxxxxx>
>>>> wrote:
>>>>
>>>> > Hi Ashley,
>>>> >
>>>> > Thanks for your response.  Nothing that I can think of would have
>>>> > happened.  We are using max_mds =1.  We do have 4 so used to have 3
>>>> > standby.  Within minutes they all crash.
>>>> >
>>>> > On Fri, May 1, 2020 at 2:21 PM Ashley Merrick <
>>>> singapore@xxxxxxxxxxxxxx>
>>>> > wrote:
>>>> >
>>>> >> Quickly checking the code that calls that assert
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> if (version > omap_version) {
>>>> >>
>>>> >> omap_version = version;
>>>> >>
>>>> >> omap_num_objs = num_objs;
>>>> >>
>>>> >> omap_num_items.resize(omap_num_objs);
>>>> >>
>>>> >> journal_state = jstate;
>>>> >>
>>>> >> } else if (version == omap_version) {
>>>> >>
>>>> >> ceph_assert(omap_num_objs == num_objs);
>>>> >>
>>>> >> if (jstate > journal_state)
>>>> >>
>>>> >> journal_state = jstate;
>>>> >>
>>>> >> }
>>>> >>
>>>> >> }
>>>> >>
>>>> >>
>>>> >> Im not a dev, but not sure if this will help, seems could mean that
>>>> MDS
>>>> >> thinks its behind on omaps/too far ahead.
>>>> >>
>>>> >>
>>>> >> Anything happened recently? Just running a single MDS?
>>>> >>
>>>> >>
>>>> >> Hopefully someone else may see this and shine some light on what
>>>> could be
>>>> >> causing it.
>>>> >>
>>>> >>
>>>> >>
>>>> >> ---- On Sat, 02 May 2020 02:10:58 +0800 marcopizzolo@xxxxxxxxx
>>>> wrote ----
>>>> >>
>>>> >>
>>>> >> Hello,
>>>> >>
>>>> >> Hoping you can help me.
>>>> >>
>>>> >> Ceph had been largely problem free for us for the better part of a
>>>> year.
>>>> >> We have a high file count in a single CephFS filesystem, and are
>>>> seeing
>>>> >> this error in the logs:
>>>> >>
>>>> >>
>>>> >>
>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/mds/OpenFileTable.cc:
>>>> >> 777: FAILED ceph_assert(omap_num_objs == num_objs)
>>>> >>
>>>> >> The issued seemed to occur this morning, and restarting the MDS as
>>>> well as
>>>> >> rebooting the servers doesn't correct the problem.
>>>> >>
>>>> >> Not really sure where to look next as the MDS daemons crash.
>>>> >>
>>>> >> Appreciate any help you can provide
>>>> >>
>>>> >> Marco
>>>> >> _______________________________________________
>>>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> >> _______________________________________________
>>>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> >>
>>>> >
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx