Re: OSD crashes during upgrade mimic->octopus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

I added a sample of OSDs on identical disks. The usage is quite well balanced, so the numbers I included are representative. I don't believe that we had one such extreme outlier. Maybe it ran full during conversion. Most of the data is OMAP after all.

I can't dump the free-dumps into paste bin, they are too large. Not sure if you can access ceph-post-files. I will send you a tgz in a separate e-mail directly to you.

> And once again - do other non-starting OSDs show the same ENOSPC error?
> Evidently I'm unable to make any generalization about the root cause due
> to lack of the info...

As I said before, I need more time to check this and give you the answer you actually want. The stupid answer is they don't, because the other 3 are taken down the moment 16 crashes and don't reach the same point. I need to take them out of the grouped management and start them by hand, which I can do tomorrow. I'm too tired now to play on our production system.

The free-dumps are on their separate way. I included one for OSD 17 as well (on the same disk).

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 07 October 2022 01:19:44
To: Frank Schilder; ceph-users@xxxxxxx
Cc: Stefan Kooman
Subject: Re:  OSD crashes during upgrade mimic->octopus

The log I inspected was for osd.16  so please share that OSD
utilization... And honestly I trust allocator's stats more so it's
rather CLI stats are incorrect if any. Anyway free dump should provide
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...


W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
there are some chances it will work.


Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:
> Hi Igor,
>
> I suspect there is something wrong with the data reported. These OSDs are only 50-60% used. For example:
>
> ID    CLASS     WEIGHT       REWEIGHT  SIZE     RAW USE   DATA      OMAP     META      AVAIL    %USE   VAR   PGS  STATUS     TYPE NAME
>    29       ssd      0.09099   1.00000   93 GiB    49 GiB    17 GiB   16 GiB    15 GiB   44 GiB  52.42  1.91  104         up                      osd.29
>    44       ssd      0.09099   1.00000   93 GiB    50 GiB    23 GiB   10 GiB    16 GiB   43 GiB  53.88  1.96  121         up                      osd.44
>    58       ssd      0.09099   1.00000   93 GiB    49 GiB    16 GiB   15 GiB    18 GiB   44 GiB  52.81  1.92  123         up                      osd.58
>   984       ssd      0.09099   1.00000   93 GiB    57 GiB    26 GiB   13 GiB    17 GiB   37 GiB  60.81  2.21  133         up                      osd.984
>
> Yes, these drives are small, but it should be possible to find 1M more. It sounds like some stats data/counters are incorrect/corrupted. Is it possible to run an fsck on a bluestore device to have it checked for that? Any idea how an incorrect utilisation might come about?
>
> I will look into starting these OSDs individually. This will be a bit of work as our deployment method is to start/stop all OSDs sharing the same disk simultaneously (OSDs are grouped by disk). If one fails all others also go down. Its for simplifying disk management and this debugging is a new use case we never needed before.
>
> Thanks for your help at this late hour!
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
> Sent: 07 October 2022 00:37:34
> To: Frank Schilder; ceph-users@xxxxxxx
> Cc: Stefan Kooman
> Subject: Re:  OSD crashes during upgrade mimic->octopus
>
> Hi Frank,
>
> the abort message "bluefs enospc" indicates lack of free space for
> additional bluefs space allocations which prevents osd from startup.
>
>   From the following log line one can see that bluefs needs ~1M more
> space while the total available one is approx 622M. the problem is that
> bluefs needs continuous(!) 64K chunks though. Which apparently aren't
> available due to high disk fragmentation.
>
>       -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
> bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
> allocate on 0x110000 min_size 0x110000 > allocated total 0x30000
> bluefs_shared_alloc_size 0x10000 allocated 0x30000 available 0x 25134000
>
>
> To double check the above root cause analysis it would be helpful to get
> ceph-bluestore-tool's free_dump command output - small chances there is
> a bug in allocator which "misses" some long enough chunks. But given
> disk space utilization (>90%) and pretty small disk size this is
> unlikely IMO.
>
> So to work around the issue and bring OSD up you should either expand
> the main device for OSD or add standalone DB volume.
>
>
> Curious whether other non-starting OSDs report the same error...
>
>
> Thanks,
>
> Igor
>
>
>
> On 10/7/2022 1:02 AM, Frank Schilder wrote:
>> Hi Igor,
>>
>> the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the show. I collected its startup log here: https://pastebin.com/25D3piS6 . The line sticking out is line 603:
>>
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc: 2931: ceph_abort_msg("bluefs enospc")
>>
>> This smells a lot like rocksdb corruption. Can I do something about that? I still need to convert most of our OSDs and I cannot afford to loose more. The rebuild simply takes too long in the current situation.
>>
>> Thanks for your help and best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>> Sent: 06 October 2022 17:03:53
>> To: Frank Schilder; ceph-users@xxxxxxx
>> Cc: Stefan Kooman
>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>
>> Sorry - no clue about CephFS related questions...
>>
>> But could you please share full OSD startup log for any one which is
>> unable to restart after host reboot?
>>
>>
>> On 10/6/2022 5:12 PM, Frank Schilder wrote:
>>> Hi Igor and Stefan.
>>>
>>>>> Not sure why you're talking about replicated(!) 4(2) pool.
>>>> Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an > > EC pool. Seems to affect all sorts of pools.
>>> I have to take this one back. It is indeed an EC pool that is also on these SSD OSDs that is affected. The meta-data pool was all active all the time until we lost the 3rd host. So, the bug reported is confirmed to affect EC pools.
>>>
>>>> If not - does any died OSD unconditionally mean its underlying disk is
>>>> unavailable any more?
>>> Fortunately not. After loosing disks on the 3rd host, we had to start taking somewhat more desperate measures. We set the file system off-line to stop client IO and started rebooting hosts in reverse order of failing. This brought back the OSDs on the still un-converted hosts. We rebooted the converted host with the original fail of OSDs last. Unfortunately, here it seems we lost a drive for good. It looks like the OSDs crashed while the conversion was going on or something. They don't boot up and I need to look into that with more detail.
>>>
>>> We are currently trying to encourage fs clients to reconnect to the file system. Unfortunately, on many we get
>>>
>>> # ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
>>> ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle
>>>
>>> Is there a server-sided way to encourage the FS clients to reconnect to the cluster? What is a clean way to get them back onto the file system? I tried a remounts without success.
>>>
>>> Before executing the next conversion, I will compact the rocksdb on all SSD OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number of objects per PG, which is potentially the main reason for our observations.
>>>
>>> Thanks for your help,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>>> Sent: 06 October 2022 14:39
>>> To: Frank Schilder; ceph-users@xxxxxxx
>>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>>
>>> Are crashing OSDs still bound to two hosts?
>>>
>>> If not - does any died OSD unconditionally mean its underlying disk is
>>> unavailable any more?
>>>
>>>
>>> On 10/6/2022 3:35 PM, Frank Schilder wrote:
>>>> Hi Igor.
>>>>
>>>>> Not sure why you're talking about replicated(!) 4(2) pool.
>>>> Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an EC pool. Seems to affect all sorts of pools.
>>>>
>>>> I just lost another disk, we have PGs down now. I really hope the stuck bstore_kv_sync thread does not lead to rocksdb corruption.
>>>>
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>>>> Sent: 06 October 2022 14:26
>>>> To: Frank Schilder; ceph-users@xxxxxxx
>>>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>>>
>>>> On 10/6/2022 2:55 PM, Frank Schilder wrote:
>>>>> Hi Igor,
>>>>>
>>>>> it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to make a bad situation worse for now and wait for recovery to finish. The inactive PGs are activating very slowly.
>>>> Got it.
>>>>
>>>>> By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even inactive here? This "feature" is new in octopus, I reported it about 2 months ago as a bug. Testing with mimic I cannot reproduce this problem: https://tracker.ceph.com/issues/56995
>>>> Not sure why you're talking about replicated(!) 4(2) pool. In the above
>>>> ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
>>>> ec-4-2...). Which means 6 shards per object and may be this setup has
>>>> some issues with mapping to unique osds within a host (just 3 hosts are
>>>> available!) ...  One can see that pg 4.* are marked as inactive only.
>>>> Not a big expert in this stuff so mostly just speculating....
>>>>
>>>>
>>>> Do you have the same setup in the production cluster in question? If so
>>>> - then you lack 2 of 6 shards and IMO the cluster properly marks the
>>>> relevant PGs as inactive. The same would apply to 3x replicated PGs as
>>>> well though since two replicas are down..
>>>>
>>>>
>>>>> I found this in the syslog, maybe it helps:
>>>>>
>>>>> kernel: task:bstore_kv_sync  state:D stack:    0 pid:3646032 ppid:3645340 flags:0x00000000
>>>>> kernel: Call Trace:
>>>>> kernel: __schedule+0x2a2/0x7e0
>>>>> kernel: schedule+0x4e/0xb0
>>>>> kernel: io_schedule+0x16/0x40
>>>>> kernel: wait_on_page_bit_common+0x15c/0x3e0
>>>>> kernel: ? __page_cache_alloc+0xb0/0xb0
>>>>> kernel: wait_on_page_bit+0x3f/0x50
>>>>> kernel: wait_on_page_writeback+0x26/0x70
>>>>> kernel: __filemap_fdatawait_range+0x98/0x100
>>>>> kernel: ? __filemap_fdatawrite_range+0xd8/0x110
>>>>> kernel: file_fdatawait_range+0x1a/0x30
>>>>> kernel: sync_file_range+0xc2/0xf0
>>>>> kernel: ksys_sync_file_range+0x41/0x80
>>>>> kernel: __x64_sys_sync_file_range+0x1e/0x30
>>>>> kernel: do_syscall_64+0x3b/0x90
>>>>> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> kernel: RIP: 0033:0x7ffbb6f77ae7
>>>>> kernel: RSP: 002b:00007ffba478c3c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000115
>>>>> kernel: RAX: ffffffffffffffda RBX: 000000000000002d RCX: 00007ffbb6f77ae7
>>>>> kernel: RDX: 0000000000002000 RSI: 000000015f849000 RDI: 000000000000002d
>>>>> kernel: RBP: 000000015f849000 R08: 0000000000000000 R09: 0000000000002000
>>>>> kernel: R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000002000
>>>>> kernel: R13: 0000000000000007 R14: 0000000000000001 R15: 0000560a1ae20380
>>>>> kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds.
>>>>> kernel:      Tainted: G            E     5.14.13-1.el7.elrepo.x86_64 #1
>>>>> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>
>>>>> It is quite possible that this was the moment when these OSDs got stuck and were marked down. The time stamp is about right.
>>>> Right. this is a primary thread which submits transactions to DB. And it
>>>> stuck for >123 seconds. Given that the disk is completely unresponsive I
>>>> presume something has happened at lower level (controller or disk FW)
>>>> though.. May be this was somehow caused by "fragmented" DB access and
>>>> compaction would heal this. On the other hand the compaction had to be
>>>> applied after omap upgrade so I'm not sure another one would change the
>>>> state...
>>>>
>>>>
>>>>
>>>>> Best regards,
>>>>> =================
>>>>> Frank Schilder
>>>>> AIT Risø Campus
>>>>> Bygning 109, rum S14
>>>>>
>>>>> ________________________________________
>>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>>>>> Sent: 06 October 2022 13:45:17
>>>>> To: Frank Schilder; ceph-users@xxxxxxx
>>>>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>>>>
>>>>>       From your response to Stefan I'm getting that one of two damaged hosts
>>>>> has all OSDs down and unable to start. I that correct? If so you can
>>>>> reboot it with no problem and proceed with manual compaction [and other
>>>>> experiments] quite "safely" for the rest of the cluster.
>>>>>
>>>>>
>>>>> On 10/6/2022 2:35 PM, Frank Schilder wrote:
>>>>>> Hi Igor,
>>>>>>
>>>>>> I can't access these drives. They have an OSD- or LVM process hanging in D-state. Any attempt to do something with these gets stuck as well.
>>>>>>
>>>>>> I somehow need to wait for recovery to finish and protect the still running OSDs from crashing similarly badly.
>>>>>>
>>>>>> After we have full redundancy again and service is back, I can add the setting osd_compact_on_start=true and start rebooting servers. Right now I need to prevent the ship from sinking.
>>>>>>
>>>>>> Best regards,
>>>>>> =================
>>>>>> Frank Schilder
>>>>>> AIT Risø Campus
>>>>>> Bygning 109, rum S14
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>>>>>> Sent: 06 October 2022 13:28:11
>>>>>> To: Frank Schilder; ceph-users@xxxxxxx
>>>>>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>>>>>
>>>>>> IIUC the OSDs that expose "had timed out after 15" are failing to start
>>>>>> up. Is that correct or I missed something?  I meant trying compaction
>>>>>> for them...
>>>>>>
>>>>>>
>>>>>> On 10/6/2022 2:27 PM, Frank Schilder wrote:
>>>>>>> Hi Igor,
>>>>>>>
>>>>>>> thanks for your response.
>>>>>>>
>>>>>>>> And what's the target Octopus release?
>>>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
>>>>>>>
>>>>>>> I'm afraid I don't have the luxury right now to take OSDs down or add extra load with an on-line compaction. I would really appreciate a way to make the OSDs more crash tolerant until I have full redundancy again. Is there a setting that increases the OPS timeout or is there a way to restrict the load to tolerable levels?
>>>>>>>
>>>>>>> Best regards,
>>>>>>> =================
>>>>>>> Frank Schilder
>>>>>>> AIT Risø Campus
>>>>>>> Bygning 109, rum S14
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>>>>>>> Sent: 06 October 2022 13:15
>>>>>>> To: Frank Schilder; ceph-users@xxxxxxx
>>>>>>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>>>>>>
>>>>>>> Hi Frank,
>>>>>>>
>>>>>>> you might want to compact RocksDB by ceph-kvstore-tool for those OSDs
>>>>>>> which are showing
>>>>>>>
>>>>>>> "heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 15"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I could see such an error after bulk data removal and following severe
>>>>>>> DB performance drop pretty often.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Igor
>>>>>> --
>>>>>> Igor Fedotov
>>>>>> Ceph Lead Developer
>>>>>>
>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>
>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>
>>>>> --
>>>>> Igor Fedotov
>>>>> Ceph Lead Developer
>>>>>
>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>
>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>
>>>> --
>>>> Igor Fedotov
>>>> Ceph Lead Developer
>>>>
>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>
>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>> Com. register: Amtsgericht Munich HRB 231263
>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>
>>> --
>>> Igor Fedotov
>>> Ceph Lead Developer
>>>
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux