Re: "cephadm version" in reef returns "AttributeError: 'CephadmContext' object has no attribute 'fsid'"

Eugen Block <eblock@xxxxxx> · Thu, 02 Nov 2023 07:52:23 +0000

There are a couple of examples in the docs [2], so in your case it  
probably would be something rather simple like:

service_type: osd
service_id: osd_spec_default
placement:
  host_pattern: '*'
spec:
  data_devices:
    rotational: 1
  db_devices:
    rotational: 0

You can apply that config to specific hosts or all of them, it really  
depends on your actual setup. You can also dry-run the config before  
applying it with the --dry-run flag:

ceph orch apply -i my-osd-specs.yaml --dry-run

I'd recommend to create a test cluster if possible to have some  
options to practice and get familiar with all that stuff.

Ideally I would use the commands to simply move the DB of my  
existing orchestrator deployed ODSs to the SSD, but when I tried  
that command it broke my OSD and I had to delete it and leave he  
cluster in a degraded state until it had recovered.

Do you still have the commands and the output somewhere what exactly  
went wrong? I haven't migrated DBs in quite some time, especially not  
in a newer version like Reef. I assume you tried it with  
bluefs-bdev-migrate?

I remember having seen these repeating scrub starts messages in this  
list, but I can't seem to find the right thread. I can't recall if  
there was a solution to that...
When was the last time you failed the mgr service? That still does  
help sometimes...

[2] https://docs.ceph.com/en/reef/cephadm/services/osd/#examples

Zitat von Martin Conway <martin.conway@xxxxxxxxxx>:

first of all, I'd still recommend to use the orchestrator to deploy  
OSDs. Building
OSDs manually and then adopt them is redundant. Or do you have issues with
the drivegroups?

I am having to do it this way because I couldn't find any doco on  
how to specify a separate DB/WAL device when deploying OSDs using  
the orchestrator. If there is such a command I agree it would be a  
better choice.

Ideally I would use the commands to simply move the DB of my  
existing orchestrator deployed ODSs to the SSD, but when I tried  
that command it broke my OSD and I had to delete it and leave he  
cluster in a degraded state until it had recovered. I find it very  
stressful when I get out of my depth with problems like that, so I  
gave up on that idea and am doing the remove, redploy, adopt method,  
which is working, but VERY slow.

I don't have *the* solution but you could try to disable the mclock  
scheduler
[1] which is the default since Quincy. Maybe that will speed up  
things? There
have been reports in the list about some unwanted or at least unexpected
behavior.

I did try this to try and speed up my rebalances, but it didn't seem  
to make much difference. I haven't tried it to see what difference  
it makes to scrubbing.

As for the "not (deep-)scrubbed in time" messages, there seems to be
progress (in your ceph status), but depending on the drive utilization you
could increase the number of scrubs per OSD (osd_max_scrubs).

There are lot of scrubs running, this morning after my rebalance  
finally completed it has 22 scrubbing, 6 deep scrubbing (across 28  
OSDs). This has fallen from the number it was running yesterday when  
the rebalance was still happening (38/9).

I believe if I kick the cluster by taking a host into maintenance  
and back the numbers will jump up again. The trouble is I don't know  
how to tell if a scrub is actually achieving something, stuck or  
restarting over and over.

My current ceph pg dump is:
https://pastebin.com/AQhNKSBN

and if I run it again a few minutes later:
https://pastebin.com/yfREzJ4s

I see evidence of scrubs not working because some of my OSD logs  
look like this:
2023-11-01T20:51:08.668+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 6.2d scrub starts
2023-11-01T20:51:11.658+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1ac scrub starts
2023-11-01T20:51:19.565+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.17 scrub starts
2023-11-01T20:51:20.516+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1d9 scrub starts
2023-11-01T20:51:22.463+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 5.9b scrub starts
2023-11-01T20:51:24.488+0000 7f3be0b27700  0 log_channel(cluster)  
log [DBG] : 5.65 scrub starts
2023-11-01T20:51:29.474+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 6.2d scrub starts
2023-11-01T20:51:31.484+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1ac scrub starts
2023-11-01T20:51:34.455+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.17 scrub starts
2023-11-01T20:51:39.444+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1d9 deep-scrub starts
2023-11-01T20:51:42.473+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 5.9b scrub starts
2023-11-01T20:51:44.510+0000 7f3be0b27700  0 log_channel(cluster)  
log [DBG] : 5.65 scrub starts
2023-11-01T20:51:46.491+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 6.2d scrub starts
2023-11-01T20:51:47.465+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1ac scrub starts
2023-11-01T20:51:49.443+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.17 scrub starts
2023-11-01T20:51:51.439+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1d9 scrub starts
2023-11-01T20:51:53.388+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 5.9b scrub starts
2023-11-01T20:52:00.345+0000 7f3be0b27700  0 log_channel(cluster)  
log [DBG] : 5.65 scrub starts
2023-11-01T20:52:02.438+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 6.2d scrub starts
2023-11-01T20:52:03.452+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1ac scrub starts
2023-11-01T20:52:10.421+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.17 scrub starts
2023-11-01T20:52:11.436+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1d9 deep-scrub starts
2023-11-01T20:52:12.465+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 5.9b deep-scrub starts
2023-11-01T20:52:13.470+0000 7f3be0b27700  0 log_channel(cluster)  
log [DBG] : 5.65 scrub starts
2023-11-01T20:52:14.468+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 6.2d deep-scrub starts
2023-11-01T20:52:17.512+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1ac scrub starts
2023-11-01T20:52:20.507+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.17 scrub starts
2023-11-01T20:52:22.428+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1d9 scrub starts
2023-11-01T20:52:23.438+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 5.9b scrub starts
2023-11-01T20:52:24.444+0000 7f3be0b27700  0 log_channel(cluster)  
log [DBG] : 5.65 scrub starts
2023-11-01T20:52:28.461+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 6.2d scrub starts
2023-11-01T20:52:45.551+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1ac scrub starts
2023-11-01T20:52:46.593+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.17 scrub starts
2023-11-01T20:52:48.595+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1d9 scrub starts
2023-11-01T20:52:52.488+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 5.9b deep-scrub starts
2023-11-01T20:52:55.504+0000 7f3be0b27700  0 log_channel(cluster)  
log [DBG] : 5.65 scrub starts
2023-11-01T20:52:58.519+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 6.2d scrub starts
2023-11-01T20:52:59.477+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1ac scrub starts
2023-11-01T20:53:01.505+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.17 scrub starts
2023-11-01T20:53:03.467+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1d9 scrub starts
2023-11-01T20:53:06.406+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 5.9b scrub starts
2023-11-01T20:53:10.446+0000 7f3be0b27700  0 log_channel(cluster)  
log [DBG] : 5.65 scrub starts
2023-11-01T20:53:13.470+0000 7f3be1328700  0 log_channel(cluster)  
log [DBG] : 6.2d scrub starts
2023-11-01T20:53:15.521+0000 7f3bdfb25700  0 log_channel(cluster)  
log [DBG] : 5.1ac scrub starts

There is no scrub OK type messages.

I notice these logs were from yesterday, and currently there is  
nothing being logged for these OSDs. If I was to "kick the cluster"  
these scrub logs would probably start showing up again.

When this is happening the PGs mentioned seem to stay in a "scrub  
queued" state.

Any light you could shine my way would be appreciated.

Thanks,
Martin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx