Re: OSD crashes during upgrade mimic->octopus

Igor Fedotov <igor.fedotov@xxxxxxxx> · Fri, 7 Oct 2022 02:44:09 +0300

well, I've just realized that you're apparently unable to collect these 
high-level stats for broken OSDs, aren't you?

But if that's the case you shouldn't make any assumption about faulty 
OSDs utilization from healthy ones - it's definitely a very doubtful 
approach ;)

On 10/7/2022 2:19 AM, Igor Fedotov wrote:
The log I inspected was for osd.16  so please share that OSD 
utilization... And honestly I trust allocator's stats more so it's 
rather CLI stats are incorrect if any. Anyway free dump should provide 
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC 
error?  Evidently I'm unable to make any generalization about the root 
cause due to lack of the info...

W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly 
there are some chances it will work.

Thanks,

Igor

On 10/7/2022 1:59 AM, Frank Schilder wrote:
Hi Igor,

I suspect there is something wrong with the data reported. These OSDs 
are only 50-60% used. For example:

ID    CLASS     WEIGHT       REWEIGHT  SIZE     RAW USE DATA      
OMAP     META      AVAIL    %USE   VAR   PGS STATUS     TYPE NAME
   29       ssd      0.09099   1.00000   93 GiB    49 GiB    17 GiB   
16 GiB    15 GiB   44 GiB  52.42  1.91  104 up                      
osd.29
   44       ssd      0.09099   1.00000   93 GiB    50 GiB    23 GiB   
10 GiB    16 GiB   43 GiB  53.88  1.96  121 up                      
osd.44
   58       ssd      0.09099   1.00000   93 GiB    49 GiB    16 GiB   
15 GiB    18 GiB   44 GiB  52.81  1.92  123 up                      
osd.58
  984       ssd      0.09099   1.00000   93 GiB    57 GiB    26 GiB   
13 GiB    17 GiB   37 GiB  60.81  2.21  133 up                      
osd.984

Yes, these drives are small, but it should be possible to find 1M 
more. It sounds like some stats data/counters are 
incorrect/corrupted. Is it possible to run an fsck on a bluestore 
device to have it checked for that? Any idea how an incorrect 
utilisation might come about?

I will look into starting these OSDs individually. This will be a bit 
of work as our deployment method is to start/stop all OSDs sharing 
the same disk simultaneously (OSDs are grouped by disk). If one fails 
all others also go down. Its for simplifying disk management and this 
debugging is a new use case we never needed before.

Thanks for your help at this late hour!

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@xxxxxxx
Cc: Stefan Kooman
Subject: Re:  OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

  From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

      -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
allocate on 0x110000 min_size 0x110000 > allocated total 0x30000
bluefs_shared_alloc_size 0x10000 allocated 0x30000 available 0x 25134000

To double check the above root cause analysis it would be helpful to get
ceph-bluestore-tool's free_dump command output - small chances there is
a bug in allocator which "misses" some long enough chunks. But given
disk space utilization (>90%) and pretty small disk size this is
unlikely IMO.

So to work around the issue and bring OSD up you should either expand
the main device for OSD or add standalone DB volume.

Curious whether other non-starting OSDs report the same error...

Thanks,

Igor

On 10/7/2022 1:02 AM, Frank Schilder wrote:
Hi Igor,

the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one 
crashing the show. I collected its startup log here: 
https://pastebin.com/25D3piS6 . The line sticking out is line 603:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc: 
2931: ceph_abort_msg("bluefs enospc")

This smells a lot like rocksdb corruption. Can I do something about 
that? I still need to convert most of our OSDs and I cannot afford 
to loose more. The rebuild simply takes too long in the current 
situation.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 06 October 2022 17:03:53
To: Frank Schilder; ceph-users@xxxxxxx
Cc: Stefan Kooman
Subject: Re:  OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is
unable to restart after host reboot?

On 10/6/2022 5:12 PM, Frank Schilder wrote:
Hi Igor and Stefan.

Not sure why you're talking about replicated(!) 4(2) pool.
Its because in the production cluster its the 4(2) pool that has 
that problem. On the test cluster it was an > > EC pool. Seems to 
affect all sorts of pools.
I have to take this one back. It is indeed an EC pool that is also 
on these SSD OSDs that is affected. The meta-data pool was all 
active all the time until we lost the 3rd host. So, the bug 
reported is confirmed to affect EC pools.

If not - does any died OSD unconditionally mean its underlying 
disk is
unavailable any more?
Fortunately not. After loosing disks on the 3rd host, we had to 
start taking somewhat more desperate measures. We set the file 
system off-line to stop client IO and started rebooting hosts in 
reverse order of failing. This brought back the OSDs on the still 
un-converted hosts. We rebooted the converted host with the 
original fail of OSDs last. Unfortunately, here it seems we lost a 
drive for good. It looks like the OSDs crashed while the conversion 
was going on or something. They don't boot up and I need to look 
into that with more detail.

We are currently trying to encourage fs clients to reconnect to the 
file system. Unfortunately, on many we get

# ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle

Is there a server-sided way to encourage the FS clients to 
reconnect to the cluster? What is a clean way to get them back onto 
the file system? I tried a remounts without success.

Before executing the next conversion, I will compact the rocksdb on 
all SSD OSDs. The HDDs seem to be entirely unaffected. The SSDs 
have a very high number of objects per PG, which is potentially the 
main reason for our observations.

Thanks for your help,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 06 October 2022 14:39
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  OSD crashes during upgrade mimic->octopus

Are crashing OSDs still bound to two hosts?

If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?

On 10/6/2022 3:35 PM, Frank Schilder wrote:
Hi Igor.

Not sure why you're talking about replicated(!) 4(2) pool.
Its because in the production cluster its the 4(2) pool that has 
that problem. On the test cluster it was an EC pool. Seems to 
affect all sorts of pools.

I just lost another disk, we have PGs down now. I really hope the 
stuck bstore_kv_sync thread does not lead to rocksdb corruption.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 06 October 2022 14:26
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  OSD crashes during upgrade mimic->octopus

On 10/6/2022 2:55 PM, Frank Schilder wrote:
Hi Igor,

it has the SSD OSDs down, the HDD OSDs are running just fine. I 
don't want to make a bad situation worse for now and wait for 
recovery to finish. The inactive PGs are activating very slowly.
Got it.

By the way, there are 2 out of 4 OSDs up in the replicated 4(2) 
pool. Why are PGs even inactive here? This "feature" is new in 
octopus, I reported it about 2 months ago as a bug. Testing with 
mimic I cannot reproduce this problem: 
https://tracker.ceph.com/issues/56995
Not sure why you're talking about replicated(!) 4(2) pool. In the 
above
ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
ec-4-2...). Which means 6 shards per object and may be this setup has
some issues with mapping to unique osds within a host (just 3 
hosts are
available!) ...  One can see that pg 4.* are marked as inactive only.
Not a big expert in this stuff so mostly just speculating....

Do you have the same setup in the production cluster in question? 
If so
- then you lack 2 of 6 shards and IMO the cluster properly marks the
relevant PGs as inactive. The same would apply to 3x replicated 
PGs as
well though since two replicas are down..

I found this in the syslog, maybe it helps:

kernel: task:bstore_kv_sync  state:D stack:    0 pid:3646032 
ppid:3645340 flags:0x00000000
kernel: Call Trace:
kernel: __schedule+0x2a2/0x7e0
kernel: schedule+0x4e/0xb0
kernel: io_schedule+0x16/0x40
kernel: wait_on_page_bit_common+0x15c/0x3e0
kernel: ? __page_cache_alloc+0xb0/0xb0
kernel: wait_on_page_bit+0x3f/0x50
kernel: wait_on_page_writeback+0x26/0x70
kernel: __filemap_fdatawait_range+0x98/0x100
kernel: ? __filemap_fdatawrite_range+0xd8/0x110
kernel: file_fdatawait_range+0x1a/0x30
kernel: sync_file_range+0xc2/0xf0
kernel: ksys_sync_file_range+0x41/0x80
kernel: __x64_sys_sync_file_range+0x1e/0x30
kernel: do_syscall_64+0x3b/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7ffbb6f77ae7
kernel: RSP: 002b:00007ffba478c3c0 EFLAGS: 00000293 ORIG_RAX: 
0000000000000115
kernel: RAX: ffffffffffffffda RBX: 000000000000002d RCX: 
00007ffbb6f77ae7
kernel: RDX: 0000000000002000 RSI: 000000015f849000 RDI: 
000000000000002d
kernel: RBP: 000000015f849000 R08: 0000000000000000 R09: 
0000000000002000
kernel: R10: 0000000000000007 R11: 0000000000000293 R12: 
0000000000002000
kernel: R13: 0000000000000007 R14: 0000000000000001 R15: 
0000560a1ae20380
kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 
123 seconds.
kernel:      Tainted: G            E 5.14.13-1.el7.elrepo.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

It is quite possible that this was the moment when these OSDs got 
stuck and were marked down. The time stamp is about right.
Right. this is a primary thread which submits transactions to DB. 
And it
stuck for >123 seconds. Given that the disk is completely 
unresponsive I
presume something has happened at lower level (controller or disk FW)
though.. May be this was somehow caused by "fragmented" DB access and
compaction would heal this. On the other hand the compaction had 
to be
applied after omap upgrade so I'm not sure another one would 
change the
state...

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 06 October 2022 13:45:17
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  OSD crashes during upgrade mimic->octopus

      From your response to Stefan I'm getting that one of two 
damaged hosts
has all OSDs down and unable to start. I that correct? If so you can
reboot it with no problem and proceed with manual compaction [and 
other
experiments] quite "safely" for the rest of the cluster.

On 10/6/2022 2:35 PM, Frank Schilder wrote:
Hi Igor,

I can't access these drives. They have an OSD- or LVM process 
hanging in D-state. Any attempt to do something with these gets 
stuck as well.

I somehow need to wait for recovery to finish and protect the 
still running OSDs from crashing similarly badly.

After we have full redundancy again and service is back, I can 
add the setting osd_compact_on_start=true and start rebooting 
servers. Right now I need to prevent the ship from sinking.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 06 October 2022 13:28:11
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  OSD crashes during upgrade mimic->octopus

IIUC the OSDs that expose "had timed out after 15" are failing 
to start
up. Is that correct or I missed something?  I meant trying 
compaction
for them...

On 10/6/2022 2:27 PM, Frank Schilder wrote:
Hi Igor,

thanks for your response.

And what's the target Octopus release?
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)

I'm afraid I don't have the luxury right now to take OSDs down 
or add extra load with an on-line compaction. I would really 
appreciate a way to make the OSDs more crash tolerant until I 
have full redundancy again. Is there a setting that increases 
the OPS timeout or is there a way to restrict the load to 
tolerable levels?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 06 October 2022 13:15
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  OSD crashes during upgrade 
mimic->octopus

Hi Frank,

you might want to compact RocksDB by ceph-kvstore-tool for 
those OSDs
which are showing

"heartbeat_map is_healthy 'OSD::osd_op_tp thread 
0x7f1886536700' had timed out after 15"

I could see such an error after bulk data removal and following 
severe
DB performance drop pretty often.

Thanks,
Igor
--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at 
https://croit.io