Re: Stuck in upgrade process to reef

Igor Fedotov <igor.fedotov@xxxxxxxx> · Wed, 17 Jan 2024 15:22:46 +0300

Hi Jan,

w.r.t. osd.0 - if this is the only occurrence then I'd propose simply 
redeploy the OSD. This looks like some BlueStore metadata inconsistency 
which could occur long before the upgrade. Likely the upgrade just 
revealed the issue.  And honestly I can hardly imagine how to 
investigate it at this point.

Let's see how further upgrades go and come back to this question if more 
similar issues pop up.

Meanwhile I'd recommend to run fsck for every OSD prior to upgrade to 
get clear understanding if metadata is consistent or not.

This way - if occurred once again - we can prove/disprove my statement 
about the issue being unrelated to upgrades above.

Thanks,

Igor

On 17/01/2024 15:07, Jan Marek wrote:
Hi Igor,

many thanks for advice!

I've tried to start osd.1 and it started already, now it's
resynchronizing data.

I will start daemons one-by-one.

What do you mean about osd.0, which have a problem with
bluestore fsck? Is there a way to repair it?

Sincerely
Jan

Dne Út, led 16, 2024 at 08:15:03 CET napsal(a) Igor Fedotov:
Hi Jan,

I've just fired an upstream ticket for your case, see
https://tracker.ceph.com/issues/64053 for more details.

You might want to tune (or preferably just remove) your custom
bluestore_cache_.*_ratio settings to fix the issue.

This is reproducible and fixable in my lab this way.

Hope this helps.

Thanks,

Igor

On 15/01/2024 12:54, Jan Marek wrote:
Hi Igor,

I've tried to start ceph-sod daemon as you advice me and I'm
sending log osd.1.start.log

About memory: According to 'top' podman ceph daemon don't reach
2% of whole server memory (64GB)...

I have switch on autotune of memory...

My ceph config dump - see attached dump.txt

Sincerely
Jan Marek

Dne Čt, led 11, 2024 at 04:02:02 CET napsal(a) Igor Fedotov:
Hi Jan,

unfortunately this wasn't very helpful. Moreover the log looks a bit messy -
looks like a mixture of outputs from multiple running instances or
something. I'm not an expert in using containerized setups though.

Could you please simplify things by running ceph-osd process manually like
you did for ceph-objectstore-tool. And enforce log output to a file. Command
line should look somewhat the following:

ceph-osd -i 0 --log-to-file --log-file <some-file> --debug-bluestore 5/20
--debug-prioritycache 10

Please don't forget to run repair prior to that.

Also you haven't answered my questions about custom [memory] settings and
RAM usage during OSD startup. It would be nice to hear some feedback.

Thanks,

Igor

On 11/01/2024 16:47, Jan Marek wrote:
Hi Igor,

I've tried to start osd.1 with debug_prioritycache and
debug_bluestore 5/20, see attached file...

Sincerely
Jan

Dne St, led 10, 2024 at 01:03:07 CET napsal(a) Igor Fedotov:
Hi Jan,

indeed this looks like some memory allocation problem - may be OSD's RAM
usage threshold reached or something?

Curious if you have any custom OSD settings or may be any memory caps for
Ceph containers?

Could you please set debug_bluestore to 5/20 and debug_prioritycache to 10
and try to start OSD once again. Please monitor process RAM usage along the
process and share the resulting log.

Thanks,

Igor

On 10/01/2024 11:20, Jan Marek wrote:
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx