Hi Igor, sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to de-fragment the OSD. It doesn't look like the fsck command does that. Is there any such tool? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: 07 October 2022 01:53:20 To: Igor Fedotov; ceph-users@xxxxxxx Subject: Re: OSD crashes during upgrade mimic->octopus Hi Igor, I added a sample of OSDs on identical disks. The usage is quite well balanced, so the numbers I included are representative. I don't believe that we had one such extreme outlier. Maybe it ran full during conversion. Most of the data is OMAP after all. I can't dump the free-dumps into paste bin, they are too large. Not sure if you can access ceph-post-files. I will send you a tgz in a separate e-mail directly to you. > And once again - do other non-starting OSDs show the same ENOSPC error? > Evidently I'm unable to make any generalization about the root cause due > to lack of the info... As I said before, I need more time to check this and give you the answer you actually want. The stupid answer is they don't, because the other 3 are taken down the moment 16 crashes and don't reach the same point. I need to take them out of the grouped management and start them by hand, which I can do tomorrow. I'm too tired now to play on our production system. The free-dumps are on their separate way. I included one for OSD 17 as well (on the same disk). Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Igor Fedotov <igor.fedotov@xxxxxxxx> Sent: 07 October 2022 01:19:44 To: Frank Schilder; ceph-users@xxxxxxx Cc: Stefan Kooman Subject: Re: OSD crashes during upgrade mimic->octopus The log I inspected was for osd.16 so please share that OSD utilization... And honestly I trust allocator's stats more so it's rather CLI stats are incorrect if any. Anyway free dump should provide additional proofs.. And once again - do other non-starting OSDs show the same ENOSPC error? Evidently I'm unable to make any generalization about the root cause due to lack of the info... W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly there are some chances it will work. Thanks, Igor On 10/7/2022 1:59 AM, Frank Schilder wrote: > Hi Igor, > > I suspect there is something wrong with the data reported. These OSDs are only 50-60% used. For example: > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME > 29 ssd 0.09099 1.00000 93 GiB 49 GiB 17 GiB 16 GiB 15 GiB 44 GiB 52.42 1.91 104 up osd.29 > 44 ssd 0.09099 1.00000 93 GiB 50 GiB 23 GiB 10 GiB 16 GiB 43 GiB 53.88 1.96 121 up osd.44 > 58 ssd 0.09099 1.00000 93 GiB 49 GiB 16 GiB 15 GiB 18 GiB 44 GiB 52.81 1.92 123 up osd.58 > 984 ssd 0.09099 1.00000 93 GiB 57 GiB 26 GiB 13 GiB 17 GiB 37 GiB 60.81 2.21 133 up osd.984 > > Yes, these drives are small, but it should be possible to find 1M more. It sounds like some stats data/counters are incorrect/corrupted. Is it possible to run an fsck on a bluestore device to have it checked for that? Any idea how an incorrect utilisation might come about? > > I will look into starting these OSDs individually. This will be a bit of work as our deployment method is to start/stop all OSDs sharing the same disk simultaneously (OSDs are grouped by disk). If one fails all others also go down. Its for simplifying disk management and this debugging is a new use case we never needed before. > > Thanks for your help at this late hour! > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Igor Fedotov <igor.fedotov@xxxxxxxx> > Sent: 07 October 2022 00:37:34 > To: Frank Schilder; ceph-users@xxxxxxx > Cc: Stefan Kooman > Subject: Re: OSD crashes during upgrade mimic->octopus > > Hi Frank, > > the abort message "bluefs enospc" indicates lack of free space for > additional bluefs space allocations which prevents osd from startup. > > From the following log line one can see that bluefs needs ~1M more > space while the total available one is approx 622M. the problem is that > bluefs needs continuous(!) 64K chunks though. Which apparently aren't > available due to high disk fragmentation. > > -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1 > bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to > allocate on 0x110000 min_size 0x110000 > allocated total 0x30000 > bluefs_shared_alloc_size 0x10000 allocated 0x30000 available 0x 25134000 > > > To double check the above root cause analysis it would be helpful to get > ceph-bluestore-tool's free_dump command output - small chances there is > a bug in allocator which "misses" some long enough chunks. But given > disk space utilization (>90%) and pretty small disk size this is > unlikely IMO. > > So to work around the issue and bring OSD up you should either expand > the main device for OSD or add standalone DB volume. > > > Curious whether other non-starting OSDs report the same error... > > > Thanks, > > Igor > > > > On 10/7/2022 1:02 AM, Frank Schilder wrote: >> Hi Igor, >> >> the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the show. I collected its startup log here: https://pastebin.com/25D3piS6 . The line sticking out is line 603: >> >> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc: 2931: ceph_abort_msg("bluefs enospc") >> >> This smells a lot like rocksdb corruption. Can I do something about that? I still need to convert most of our OSDs and I cannot afford to loose more. The rebuild simply takes too long in the current situation. >> >> Thanks for your help and best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >> Sent: 06 October 2022 17:03:53 >> To: Frank Schilder; ceph-users@xxxxxxx >> Cc: Stefan Kooman >> Subject: Re: OSD crashes during upgrade mimic->octopus >> >> Sorry - no clue about CephFS related questions... >> >> But could you please share full OSD startup log for any one which is >> unable to restart after host reboot? >> >> >> On 10/6/2022 5:12 PM, Frank Schilder wrote: >>> Hi Igor and Stefan. >>> >>>>> Not sure why you're talking about replicated(!) 4(2) pool. >>>> Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an > > EC pool. Seems to affect all sorts of pools. >>> I have to take this one back. It is indeed an EC pool that is also on these SSD OSDs that is affected. The meta-data pool was all active all the time until we lost the 3rd host. So, the bug reported is confirmed to affect EC pools. >>> >>>> If not - does any died OSD unconditionally mean its underlying disk is >>>> unavailable any more? >>> Fortunately not. After loosing disks on the 3rd host, we had to start taking somewhat more desperate measures. We set the file system off-line to stop client IO and started rebooting hosts in reverse order of failing. This brought back the OSDs on the still un-converted hosts. We rebooted the converted host with the original fail of OSDs last. Unfortunately, here it seems we lost a drive for good. It looks like the OSDs crashed while the conversion was going on or something. They don't boot up and I need to look into that with more detail. >>> >>> We are currently trying to encourage fs clients to reconnect to the file system. Unfortunately, on many we get >>> >>> # ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point >>> ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle >>> >>> Is there a server-sided way to encourage the FS clients to reconnect to the cluster? What is a clean way to get them back onto the file system? I tried a remounts without success. >>> >>> Before executing the next conversion, I will compact the rocksdb on all SSD OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number of objects per PG, which is potentially the main reason for our observations. >>> >>> Thanks for your help, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>> Sent: 06 October 2022 14:39 >>> To: Frank Schilder; ceph-users@xxxxxxx >>> Subject: Re: OSD crashes during upgrade mimic->octopus >>> >>> Are crashing OSDs still bound to two hosts? >>> >>> If not - does any died OSD unconditionally mean its underlying disk is >>> unavailable any more? >>> >>> >>> On 10/6/2022 3:35 PM, Frank Schilder wrote: >>>> Hi Igor. >>>> >>>>> Not sure why you're talking about replicated(!) 4(2) pool. >>>> Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an EC pool. Seems to affect all sorts of pools. >>>> >>>> I just lost another disk, we have PGs down now. I really hope the stuck bstore_kv_sync thread does not lead to rocksdb corruption. >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>>> Sent: 06 October 2022 14:26 >>>> To: Frank Schilder; ceph-users@xxxxxxx >>>> Subject: Re: OSD crashes during upgrade mimic->octopus >>>> >>>> On 10/6/2022 2:55 PM, Frank Schilder wrote: >>>>> Hi Igor, >>>>> >>>>> it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to make a bad situation worse for now and wait for recovery to finish. The inactive PGs are activating very slowly. >>>> Got it. >>>> >>>>> By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even inactive here? This "feature" is new in octopus, I reported it about 2 months ago as a bug. Testing with mimic I cannot reproduce this problem: https://tracker.ceph.com/issues/56995 >>>> Not sure why you're talking about replicated(!) 4(2) pool. In the above >>>> ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile >>>> ec-4-2...). Which means 6 shards per object and may be this setup has >>>> some issues with mapping to unique osds within a host (just 3 hosts are >>>> available!) ... One can see that pg 4.* are marked as inactive only. >>>> Not a big expert in this stuff so mostly just speculating.... >>>> >>>> >>>> Do you have the same setup in the production cluster in question? If so >>>> - then you lack 2 of 6 shards and IMO the cluster properly marks the >>>> relevant PGs as inactive. The same would apply to 3x replicated PGs as >>>> well though since two replicas are down.. >>>> >>>> >>>>> I found this in the syslog, maybe it helps: >>>>> >>>>> kernel: task:bstore_kv_sync state:D stack: 0 pid:3646032 ppid:3645340 flags:0x00000000 >>>>> kernel: Call Trace: >>>>> kernel: __schedule+0x2a2/0x7e0 >>>>> kernel: schedule+0x4e/0xb0 >>>>> kernel: io_schedule+0x16/0x40 >>>>> kernel: wait_on_page_bit_common+0x15c/0x3e0 >>>>> kernel: ? __page_cache_alloc+0xb0/0xb0 >>>>> kernel: wait_on_page_bit+0x3f/0x50 >>>>> kernel: wait_on_page_writeback+0x26/0x70 >>>>> kernel: __filemap_fdatawait_range+0x98/0x100 >>>>> kernel: ? __filemap_fdatawrite_range+0xd8/0x110 >>>>> kernel: file_fdatawait_range+0x1a/0x30 >>>>> kernel: sync_file_range+0xc2/0xf0 >>>>> kernel: ksys_sync_file_range+0x41/0x80 >>>>> kernel: __x64_sys_sync_file_range+0x1e/0x30 >>>>> kernel: do_syscall_64+0x3b/0x90 >>>>> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>> kernel: RIP: 0033:0x7ffbb6f77ae7 >>>>> kernel: RSP: 002b:00007ffba478c3c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000115 >>>>> kernel: RAX: ffffffffffffffda RBX: 000000000000002d RCX: 00007ffbb6f77ae7 >>>>> kernel: RDX: 0000000000002000 RSI: 000000015f849000 RDI: 000000000000002d >>>>> kernel: RBP: 000000015f849000 R08: 0000000000000000 R09: 0000000000002000 >>>>> kernel: R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000002000 >>>>> kernel: R13: 0000000000000007 R14: 0000000000000001 R15: 0000560a1ae20380 >>>>> kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds. >>>>> kernel: Tainted: G E 5.14.13-1.el7.elrepo.x86_64 #1 >>>>> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> >>>>> It is quite possible that this was the moment when these OSDs got stuck and were marked down. The time stamp is about right. >>>> Right. this is a primary thread which submits transactions to DB. And it >>>> stuck for >123 seconds. Given that the disk is completely unresponsive I >>>> presume something has happened at lower level (controller or disk FW) >>>> though.. May be this was somehow caused by "fragmented" DB access and >>>> compaction would heal this. On the other hand the compaction had to be >>>> applied after omap upgrade so I'm not sure another one would change the >>>> state... >>>> >>>> >>>> >>>>> Best regards, >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>>>> Sent: 06 October 2022 13:45:17 >>>>> To: Frank Schilder; ceph-users@xxxxxxx >>>>> Subject: Re: OSD crashes during upgrade mimic->octopus >>>>> >>>>> From your response to Stefan I'm getting that one of two damaged hosts >>>>> has all OSDs down and unable to start. I that correct? If so you can >>>>> reboot it with no problem and proceed with manual compaction [and other >>>>> experiments] quite "safely" for the rest of the cluster. >>>>> >>>>> >>>>> On 10/6/2022 2:35 PM, Frank Schilder wrote: >>>>>> Hi Igor, >>>>>> >>>>>> I can't access these drives. They have an OSD- or LVM process hanging in D-state. Any attempt to do something with these gets stuck as well. >>>>>> >>>>>> I somehow need to wait for recovery to finish and protect the still running OSDs from crashing similarly badly. >>>>>> >>>>>> After we have full redundancy again and service is back, I can add the setting osd_compact_on_start=true and start rebooting servers. Right now I need to prevent the ship from sinking. >>>>>> >>>>>> Best regards, >>>>>> ================= >>>>>> Frank Schilder >>>>>> AIT Risø Campus >>>>>> Bygning 109, rum S14 >>>>>> >>>>>> ________________________________________ >>>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>>>>> Sent: 06 October 2022 13:28:11 >>>>>> To: Frank Schilder; ceph-users@xxxxxxx >>>>>> Subject: Re: OSD crashes during upgrade mimic->octopus >>>>>> >>>>>> IIUC the OSDs that expose "had timed out after 15" are failing to start >>>>>> up. Is that correct or I missed something? I meant trying compaction >>>>>> for them... >>>>>> >>>>>> >>>>>> On 10/6/2022 2:27 PM, Frank Schilder wrote: >>>>>>> Hi Igor, >>>>>>> >>>>>>> thanks for your response. >>>>>>> >>>>>>>> And what's the target Octopus release? >>>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) >>>>>>> >>>>>>> I'm afraid I don't have the luxury right now to take OSDs down or add extra load with an on-line compaction. I would really appreciate a way to make the OSDs more crash tolerant until I have full redundancy again. Is there a setting that increases the OPS timeout or is there a way to restrict the load to tolerable levels? >>>>>>> >>>>>>> Best regards, >>>>>>> ================= >>>>>>> Frank Schilder >>>>>>> AIT Risø Campus >>>>>>> Bygning 109, rum S14 >>>>>>> >>>>>>> ________________________________________ >>>>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>>>>>> Sent: 06 October 2022 13:15 >>>>>>> To: Frank Schilder; ceph-users@xxxxxxx >>>>>>> Subject: Re: OSD crashes during upgrade mimic->octopus >>>>>>> >>>>>>> Hi Frank, >>>>>>> >>>>>>> you might want to compact RocksDB by ceph-kvstore-tool for those OSDs >>>>>>> which are showing >>>>>>> >>>>>>> "heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 15" >>>>>>> >>>>>>> >>>>>>> >>>>>>> I could see such an error after bulk data removal and following severe >>>>>>> DB performance drop pretty often. >>>>>>> >>>>>>> Thanks, >>>>>>> Igor >>>>>> -- >>>>>> Igor Fedotov >>>>>> Ceph Lead Developer >>>>>> >>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>>>>> >>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>>>>> >>>>> -- >>>>> Igor Fedotov >>>>> Ceph Lead Developer >>>>> >>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>>>> >>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>>>> >>>> -- >>>> Igor Fedotov >>>> Ceph Lead Developer >>>> >>>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>>> >>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>> Com. register: Amtsgericht Munich HRB 231263 >>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>>> >>> -- >>> Igor Fedotov >>> Ceph Lead Developer >>> >>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>> CEO: Martin Verges - VAT-ID: DE310638492 >>> Com. register: Amtsgericht Munich HRB 231263 >>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>> >> -- >> Igor Fedotov >> Ceph Lead Developer >> >> Looking for help with your Ceph cluster? Contact us at https://croit.io >> >> croit GmbH, Freseniusstr. 31h, 81247 Munich >> CEO: Martin Verges - VAT-ID: DE310638492 >> Com. register: Amtsgericht Munich HRB 231263 >> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >> > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > -- Igor Fedotov Ceph Lead Developer Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx