Re: CephFS 16.2.10 problem

Dhairya Parmar <dparmar@xxxxxxxxxx> · Mon, 25 Nov 2024 18:37:51 +0530

On Mon, Nov 25, 2024 at 3:33 PM <Alexey.Tsivinsky@xxxxxxxxxxxxxxxxxxxx>
wrote:

> Thanks for your answer!
>
>
> Current status of our cluster
>
> cluster:
>     id:     c3d33e01-dfcd-4b39-8614-993370672504
>     health: HEALTH_WARN
>             1 failed cephadm daemon(s)
>             1 filesystem is degraded
>
>   services:
>     mon: 3 daemons, quorum cmon1,cmon2,cmon3 (age 15h)
>     mgr: cmon3.ixtbep(active, since 19h), standbys: cmon1.efktsr
>     mds: 2/2 daemons up
>     osd: 168 osds: 168 up (since 2d), 168 in (since 3w)
>
>   data:
>     volumes: 0/1 healthy, 1 recovering
>     pools:   4 pools, 4641 pgs
>     objects: 181.91M objects, 235 TiB
>     usage:   708 TiB used, 290 TiB / 997 TiB avail
>     pgs:     4630 active+clean
>              11   active+clean+scrubbing+deep
>

This doesn't reveal much. Can you share MDS logs?

>
> We are trying to do cephfs-journal-tool --rank cephfs: 0 journal inspect
> and the utility does nothing.
>

If the ranks are unavailable, it won't do anything. Do you see any log
statements like "Couldn't determine MDS rank."?

We thought that mds were blocking their journals, and turned them off. But
> the utility does not work, and ceph -s says that one mds is running,
> although we checked that we stopped all processes.
> It turns out somewhere else there is a blocking of magazines.
> What else can be done? Do you want to restart the monitors?
>
>
> Best Regards!
>
>
> Alexey Tsivinsky
>
>
> e-mail:a.tsivinsky@xxxxxxxxxxxxxxxxxxxxx
>
>
>
> *От:* Dhairya Parmar <dparmar@xxxxxxxxxx>
> *Отправлено:* 25 ноября 2024 г. 12:19
> *Кому:* Цивинский Алексей Александрович
> *Копия:* Marc@xxxxxxxxxxxxxxxxx; ceph-users@xxxxxxx
> *Тема:* Re:  Re: CephFS 16.2.10 problem
>
> Hi,
>
> The log you shared indicates that MDS is waiting for the latest OSDMap
> epoch. The epoch number in log line 123138 is the epoch of last failure.
> Any MDS entering replay state needs at least this osdmap epoch to ensure
> the blocklist propopates. If the epoch is less than this then it just goes
> back to waiting.
>
> I have limited knowledge about the OSDs but you had mentioned in your
> initial mail about executing some OSD commands, I'm not sure if the issue
> lies there. You can check and share OSD logs or maybe `ceph -s` could
> reveal some potential warnings.
>
>
> *Dhairya Parmar*
>
> Associate Software Engineer, CephFS
>
> IBM, Inc.
>
>
>
> On Mon, Nov 25, 2024 at 1:29 PM <Alexey.Tsivinsky@xxxxxxxxxxxxxxxxxxxx>
> wrote:
>
>> Good afternoon
>>
>> We tried to leave only one mds, stopped others, even deleted one, and
>> turned off the requirement for stand-by mds. Nothing helped, mds remained
>> in the status of replays.
>> Current situation: we now have two active mds in the status of replays,
>> and one in stand-by.
>> At the same time, in the logs we see a message
>> mds.0.660178  waiting for osdmap 123138 (which blocklists prior instance)
>> At the same time, there is no activity on both mds.
>> The launch of the cephfs-journal-tool journal inspect utility does not
>> produce any results - the utility worked for 12 hours and did not produce
>> anything, we stopped it.
>>
>> Maybe the problem is this blocking? How to remove it?
>>
>> Best regards!
>>
>> Alexey Tsivinsky
>> e-mail: a.tsivinsky@xxxxxxxxxxxxxxxxxxxxx
>> ________________________________________
>> От: Marc <Marc@xxxxxxxxxxxxxxxxx>
>> Отправлено: 25 ноября 2024 г. 1:47
>> Кому: Цивинский Алексей Александрович; ceph-users@xxxxxxx
>> Тема: RE: CephFS 16.2.10 problem
>>
>> >
>> > The following problem occurred.
>> > There is a cluster ceph 16.2.10
>> > The cluster was operating normally on Friday. Shut down cluster:
>> > -Excluded all clients
>> > Executed commands:
>> > ceph osd set noout
>> > ceph osd set nobackfill
>> > ceph osd set norecover
>> > ceph osd set norebalance
>> > ceph osd set nodown
>> > ceph osd set pause
>> > Turned off the cluster, checked server maintenance.
>> > Enabled cluster. He gathered himself, found all the nodes, and here the
>> > problem began. After all OSD went up and all pg became available, cephfs
>> > refused to start.
>> > Now mds are in the replay status, and do not go to the ready status.
>> > Previously, one of them was in the replay (laggy) status, but we
>> > executed command:  ceph config set mds mds_wipe_sessions true
>> > After that, mds switched to the status of replays, the third in standby
>> > status started, and mds crashes with an error stopped.
>> > But cephfs is still unavailable.
>> > What else can we do?
>> > The cluster is very large, almost 200 million files.
>> >
>>
>> I assume you tried to start just one mds and wait until it would come up
>> as active (before starting the others)?
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx