Re: ceph-users Digest, Vol 110, Issue 18

Chris Zacco <czacco@xxxxxxxxx> · Tue, 8 Mar 2022 06:58:24 -0500

What is the next national communication device? 

Sent from my iPhone

> On Mar 8, 2022, at 3:15 AM, ceph-users-request@xxxxxxx wrote:
> 
> Send ceph-users mailing list submissions to
>    ceph-users@xxxxxxx
> 
> To subscribe or unsubscribe via email, send a message with subject or
> body 'help' to
>    ceph-users-request@xxxxxxx
> 
> You can reach the person managing the list at
>    ceph-users-owner@xxxxxxx
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
> 
> Today's Topics:
> 
>   1. Cephadm is stable or not in product? (norman.kern)
>   2. Re: Cephadm is stable or not in product? (Martin Verges)
>   3. Re: "Incomplete" pg's (Eugen Block)
>   4. Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)
>      (Dan van der Ster)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 8 Mar 2022 12:16:40 +0800
> From: "norman.kern" <norman.kern@xxxxxxx>
> Subject:  Cephadm is stable or not in product?
> To: ceph-users <ceph-users@xxxxxxx>
> Message-ID: <0424c4c2-2a5b-13b6-3795-5e78ca283b36@xxxxxxx>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> 
> Dear Ceph folks,
> 
> Anyone is using cephadm in product(Version: Pacific)？ I found several bugs on it and
> I really doubt it.
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Tue, 8 Mar 2022 07:26:00 +0100
> From: Martin Verges <martin.verges@xxxxxxxx>
> Subject:  Re: Cephadm is stable or not in product?
> To: "norman.kern" <norman.kern@xxxxxxx>
> Cc: ceph-users <ceph-users@xxxxxxx>
> Message-ID:
>    <CAOf0rEu0_i1-WPn-uR9AigeDvZn7qkvePvLaq2WZPiiyJy756g@xxxxxxxxxxxxxx>
> Content-Type: text/plain; charset="UTF-8"
> 
> Some say it is, some say it's not.
> Every time I try it, it's buggy as hell and I can destroy my test clusters
> with ease. That's why I still avoid it. But as you can see in my signature,
> I am biased ;).
> 
> --
> Martin Verges
> Managing director
> 
> Mobile: +49 174 9335695  | Chat: https://t.me/MartinVerges
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> 
> 
>> On Tue, 8 Mar 2022 at 05:18, norman.kern <norman.kern@xxxxxxx> wrote:
>> 
>> Dear Ceph folks,
>> 
>> Anyone is using cephadm in product(Version: Pacific)？ I found several bugs
>> on it and
>> I really doubt it.
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 08 Mar 2022 06:47:45 +0000
> From: Eugen Block <eblock@xxxxxx>
> Subject:  Re: "Incomplete" pg's
> To: ceph-users@xxxxxxx
> Message-ID:
>    <20220308064745.Horde.4B2t2of688SQWtjvNrd4Nmz@xxxxxxxxxxxxxx>
> Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes
> 
> Hi,
> 
> IIUC the OSDs 3,4,5 have been removed while some PGs still refer to  
> them, correct? Have the OSDs been replaced with the same IDs? If not  
> (so there are currently no OSDs with IDs 3,4,5 in your osd tree) maybe  
> marking them as lost [1] would resolve the stuck PG creation, although  
> I doubt that this will do anything if there aren't any OSDs with these  
> IDs anymore. I haven't had to mark an OSD lost yet myself, so I'm not  
> sure of the consequences.
> There's a similar thread [2] where the situation got resolved, not by  
> marking the OSDs as lost but by using  
> 'osd_find_best_info_ignore_history_les' which I haven't used myself  
> either. But maybe worth a shot?
> 
> 
> [1] https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/
> [2]  
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/G6MJF7PGCCW5JTC6R6UV2EXT54YGU3LG/
> 
> 
> Zitat von "Kyriazis, George" <george.kyriazis@xxxxxxxxx>:
> 
>> Ok, I saw that there is now a “ceph old force-create-pg” command.   
>> Not sure if it is a replacement of “ceph pg force_create_pg” or if  
>> it does something different.
>> 
>> I tried it, and it looked like it worked:
>> 
>> # ceph osd force-create-pg 1.353 --yes-i-really-mean-it
>> pg 1.353 now creating, ok
>> #
>> 
>> But the pg is still stuck in “incomplete” state.
>> 
>> Re-issuing the same command, I get:
>> 
>> # ceph osd force-create-pg 1.353 --yes-i-really-mean-it
>> pg 1.353 already creating
>> #
>> 
>> Which means that the request is queued up somewhere, however, the pg  
>> in question is still stuck in incomplete state:
>> 
>> # ceph pg ls | grep ^1\.353
>> 1.353        0         0          0        0             0            
>> 0           0     0                        incomplete    71m         
>>     0'0        54514:92        [4,6,22]p4        [4,6,22]p4   
>> 2022-02-28T15:47:37.794357-0600  2022-02-02T07:53:15.339511-0600
>> #
>> 
>> How do I find out if it is stuck, or just plain queued behind some  
>> other request?
>> 
>> Thank you!
>> 
>> George
>> 
>> On Mar 7, 2022, at 12:09 PM, Kyriazis, George  
>> <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx>> wrote:
>> 
>> After some thought, I decided to try “ceph pg force_create_pg” on  
>> the incomplete pgs, as suggested by name online sources.
>> 
>> However, I got:
>> 
>> # ceph pg force_create_pg 1.353
>> no valid command found; 10 closest matches:
>> pg stat
>> pg getmap
>> pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]
>> pg dump_json [all|summary|sum|pools|osds|pgs...]
>> pg dump_pools_json
>> pg ls-by-pool <poolstr> [<states>...]
>> pg ls-by-primary  
>> <id|osd.id<http://osd.id/><http://osd.id<http://osd.id/>>>  
>> [<pool:int>] [<states>...]
>> pg ls-by-osd  
>> <id|osd.id<http://osd.id/><http://osd.id<http://osd.id/>>>  
>> [<pool:int>] [<states>...]
>> pg ls [<pool:int>] [<states>...]
>> pg dump_stuck [inactive|unclean|stale|undersized|degraded...]  
>> [<threshold:int>]
>> Error EINVAL: invalid command
>> #
>> 
>> ?
>> 
>> I am running pacific 16.2.7.
>> 
>> Thanks!
>> 
>> George
>> 
>> 
>> On Mar 4, 2022, at 7:51 AM, Kyriazis, George  
>> <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx>>  
>> wrote:
>> 
>> Thanks Janne,
>> 
>> (Inline)
>> 
>> On Mar 4, 2022, at 1:04 AM, Janne Johansson  
>> <icepic.dz@xxxxxxxxx<mailto:icepic.dz@xxxxxxxxx><mailto:icepic.dz@xxxxxxxxx>>  
>> wrote:
>> 
>> Due to a mistake on my part, I accidentally destroyed more OSDs that  
>> I needed to, and I ended up with 2 pgs in “incomplete” state.
>> 
>> Doing “ceph pg query on one of the pgs that is incomplete, I get the  
>> following (somewhere in the output):
>> 
>>         "up": [
>>             12,
>>             6,
>>             20
>>         ],
>>         "acting": [
>>             12,
>>             6,
>>             20
>>         ],
>>         "avail_no_missing": [],
>>         "object_location_counts": [],
>>         "blocked_by": [
>>             3,
>>             4,
>>             5
>>         ],
>>         "up_primary": 12,
>>         "acting_primary": 12,
>>         "purged_snaps": []
>> 
>> 
>> I am assuming this means that OSDs 3,4,5 were the original ones  
>> (that are now destroyed), but I don’t understand why the output  
>> shows 12, 6, 20 as active.
>> 
>> I can't help with the cephfs part since we don't use that, but I think
>> the above output means "since 3,4,5 are gone, 12,6 and 20 are now
>> designated as the replacement OSDs to hold the PG", but since 3,4,5
>> are gone, none of them can backfill into 12,6,20, so 12,6,20 are
>> waiting for this PG to appear "somewhere" so they can recover.
>> 
>> I thought that if that was the case 3,4,5 should be listed as  
>> “active”, with 12,6,20 as “up”..
>> 
>> My corcern about cephfs is that, since it is a layer above the ceph  
>> base layer, there maybe the corrective action needs to start at  
>> cephfs, otherwise cephfs won’t be aware of any changes happening  
>> underneath.
>> 
>> Perhaps you can force pg creation, so that 12,6,20 gets an empty PG to
>> start the pool again, and then hope that the next rsync will fill in
>> any missing slots, but this part I am not so sure about since I don't
>> know what other data apart from file contents may exist in a cephfs
>> pool.
>> 
>> Is the worst-case (dropping the pool, recreating it and running a full
>> rsync again) a possible way out? If so, you can perhaps test and see
>> if you can bridge the gap of the missing PGs, but if resyncing is out,
>> then wait for suggestions from someone more qualified at cephfs stuff
>> than me. ;)
>> 
>> I’ll wait a bit more for some other people to suggest something.  At  
>> this point I don’t have anything with high confidence that it will  
>> work.
>> 
>> Thanks!
>> 
>> George
>> 
>> 
>> --
>> May the most significant bit of your life be positive.
>> 
>> _______________________________________________
>> ceph-users mailing list --  
>> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to  
>> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx>
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to  
>> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 8 Mar 2022 09:09:35 +0100
> From: Dan van der Ster <dvanders@xxxxxxxxx>
> Subject:  Re: octopus (15.2.16) OSDs crash or don't answer
>    heathbeats (and get marked as down)
> To: Boris Behrens <bb@xxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxx>
> Message-ID:
>    <CABZ+qqmaR++yej6tj-EAdVzhw0fkUg-90JQyTHiofcX2LtZHNg@xxxxxxxxxxxxxx>
> Content-Type: text/plain; charset="UTF-8"
> 
> Here's the reason they exit:
> 
> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
> osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
> 
> If an osd flaps (marked down, then up) 6 times in 10 minutes, it
> exits. (This is a safety measure).
> 
> It's normally caused by a network issue -- other OSDs are telling the
> mon that he is down, but then the OSD himself tells the mon that he's
> up!
> 
> Cheers, Dan
> 
>> On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens <bb@xxxxxxxxx> wrote:
>> 
>> Hi,
>> 
>> we've had the problem with OSDs marked as offline since we updated to
>> octopus and hope the problem would be fixed with the latest patch. We have
>> this kind of problem only with octopus and there only with the big s3
>> cluster.
>> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
>> * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
>> * We only use the frontend network.
>> * All disks are spinning, some have block.db devices.
>> * All disks are bluestore
>> * configs are mostly defaults
>> * we've set the OSDs to restart=always without a limit, because we had the
>> problem with unavailable PGs when two OSDs are marked as offline and the
>> share PGs.
>> 
>> But since we installed the latest patch we are experiencing more OSD downs
>> and even crashes.
>> I tried to remove as much duplicated lines as possible.
>> 
>> Is the numa error a problem?
>> Why do OSD daemons not respond to hearthbeats? I mean even when the disk is
>> totally loaded with IO, the system itself should answer heathbeats, or am I
>> missing something?
>> 
>> I really hope some of you could send me on the correct way to solve this
>> nasty problem.
>> 
>> This is how the latest crash looks like
>> Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+0000
>> 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> ...
>> Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+0000
>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
>> thread_name:tp_osd_tp
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
>> [0x7f5f0d45ef08]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
>> unsigned long)+0x471) [0x55a699a01201]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
>> long, unsigned long)+0x8e) [0x55a699a0199e]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
>> [0x55a699a224b0]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+0000
>> 7f5ef1501700 -1 *** Caught signal (Aborted) **
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
>> thread_name:tp_osd_tp
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
>> [0x7f5f0d45ef08]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
>> unsigned long)+0x471) [0x55a699a01201]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
>> long, unsigned long)+0x8e) [0x55a699a0199e]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
>> [0x55a699a224b0]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the executable, or
>> `objdump -rdS <executable>` is needed to interpret this.
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246> 2022-03-07T17:49:07.678+0000
>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:      0> 2022-03-07T17:53:07.387+0000
>> 7f5ef1501700 -1 *** Caught signal (Aborted) **
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
>> thread_name:tp_osd_tp
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
>> [0x7f5f0d45ef08]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
>> unsigned long)+0x471) [0x55a699a01201]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
>> long, unsigned long)+0x8e) [0x55a699a0199e]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
>> [0x55a699a224b0]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the executable, or
>> `objdump -rdS <executable>` is needed to interpret this.
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246> 2022-03-07T17:49:07.678+0000
>> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:      0> 2022-03-07T17:53:07.387+0000
>> 7f5ef1501700 -1 *** Caught signal (Aborted) **
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
>> thread_name:tp_osd_tp
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
>> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
>> [0x7f5f0d45ef08]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
>> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
>> unsigned long)+0x471) [0x55a699a01201]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
>> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
>> long, unsigned long)+0x8e) [0x55a699a0199e]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
>> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
>> [0x55a699a224b0]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
>> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
>> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the executable, or
>> `objdump -rdS <executable>` is needed to interpret this.
>> Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service: Main process
>> exited, code=killed, status=6/ABRT
>> Mar 07 17:53:09 s3db18 systemd[1]: ceph-osd@161.service: Failed with result
>> 'signal'.
>> Mar 07 17:53:19 s3db18 systemd[1]: ceph-osd@161.service: Scheduled restart
>> job, restart counter is at 1.
>> Mar 07 17:53:19 s3db18 systemd[1]: Stopped Ceph object storage daemon
>> osd.161.
>> Mar 07 17:53:19 s3db18 systemd[1]: Starting Ceph object storage daemon
>> osd.161...
>> Mar 07 17:53:19 s3db18 systemd[1]: Started Ceph object storage daemon
>> osd.161.
>> Mar 07 17:53:20 s3db18 ceph-osd[4009440]: 2022-03-07T17:53:20.498+0000
>> 7f9617781d80 -1 Falling back to public interface
>> Mar 07 17:53:33 s3db18 ceph-osd[4009440]: 2022-03-07T17:53:33.906+0000
>> 7f9617781d80 -1 osd.161 489778 log_to_monitors {default=true}
>> Mar 07 17:53:34 s3db18 ceph-osd[4009440]: 2022-03-07T17:53:34.206+0000
>> 7f96106f2700 -1 osd.161 489778 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> ...
>> Mar 07 18:58:12 s3db18 ceph-osd[4009440]: 2022-03-07T18:58:12.717+0000
>> 7f96106f2700 -1 osd.161 489880 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> 
>> And this is how an it looks like when OSDs get marked as out:
>> Mar 03 19:29:04 s3db13 ceph-osd[5792]: 2022-03-03T19:29:04.857+0000
>> 7f16115e0700 -1 osd.97 485814 heartbeat_check: no reply from
>> [XX:22::65]:6886 osd.124 since back 2022-03-03T19:28:41.250692+0000 front
>> 2022-03-03T19:28:41.250649+0000 (oldest deadline
>> 2022-03-03T19:29:04.150352+0000)
>> ...130 time...
>> Mar 03 21:55:37 s3db13 ceph-osd[5792]: 2022-03-03T21:55:37.844+0000
>> 7f16115e0700 -1 osd.97 486383 heartbeat_check: no reply from
>> [XX:22::65]:6941 osd.124 since back 2022-03-03T21:55:12.514627+0000 front
>> 2022-03-03T21:55:12.514649+0000 (oldest deadline
>> 2022-03-03T21:55:36.613469+0000)
>> Mar 04 00:00:05 s3db13 ceph-osd[5792]: 2022-03-04T00:00:05.035+0000
>> 7f1613080700 -1 received  signal: Hangup from killall -q -1 ceph-mon
>> ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror  (PID: 1385079)
>> UID: 0
>> Mar 04 00:00:05 s3db13 ceph-osd[5792]: 2022-03-04T00:00:05.047+0000
>> 7f1613080700 -1 received  signal: Hangup from  (PID: 1385080) UID: 0
>> Mar 04 00:06:00 s3db13 sudo[1389262]:     ceph : TTY=unknown ; PWD=/ ;
>> USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde
>> Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session): session
>> opened for user root by (uid=0)
>> Mar 04 00:06:00 s3db13 sudo[1389262]: pam_unix(sudo:session): session
>> closed for user root
>> Mar 04 00:06:01 s3db13 sudo[1389287]:     ceph : TTY=unknown ; PWD=/ ;
>> USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json /dev/sde
>> Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session): session
>> opened for user root by (uid=0)
>> Mar 04 00:06:01 s3db13 sudo[1389287]: pam_unix(sudo:session): session
>> closed for user root
>> Mar 05 00:00:10 s3db13 ceph-osd[5792]: 2022-03-05T00:00:10.213+0000
>> 7f1613080700 -1 received  signal: Hangup from killall -q -1 ceph-mon
>> ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror  (PID: 2406262)
>> UID: 0
>> Mar 05 00:00:10 s3db13 ceph-osd[5792]: 2022-03-05T00:00:10.237+0000
>> 7f1613080700 -1 received  signal: Hangup from  (PID: 2406263) UID: 0
>> Mar 05 00:08:03 s3db13 sudo[2411721]:     ceph : TTY=unknown ; PWD=/ ;
>> USER=root ; COMMAND=/usr/sbin/smartctl -a --json=o /dev/sde
>> Mar 05 00:08:03 s3db13 sudo[2411721]: pam_unix(sudo:session): session
>> opened for user root by (uid=0)
>> Mar 05 00:08:04 s3db13 sudo[2411721]: pam_unix(sudo:session): session
>> closed for user root
>> Mar 05 00:08:04 s3db13 sudo[2411725]:     ceph : TTY=unknown ; PWD=/ ;
>> USER=root ; COMMAND=/usr/sbin/nvme ata smart-log-add --json /dev/sde
>> Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session): session
>> opened for user root by (uid=0)
>> Mar 05 00:08:04 s3db13 sudo[2411725]: pam_unix(sudo:session): session
>> closed for user root
>> Mar 05 19:19:49 s3db13 ceph-osd[5792]: 2022-03-05T19:19:49.189+0000
>> 7f160fddd700 -1 osd.97 486852 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> Mar 05 19:21:18 s3db13 ceph-osd[5792]: 2022-03-05T19:21:18.377+0000
>> 7f160fddd700 -1 osd.97 486858 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> Mar 05 19:21:45 s3db13 ceph-osd[5792]: 2022-03-05T19:21:45.304+0000
>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:21:21.762744+0000 front
>> 2022-03-05T19:21:21.762723+0000 (oldest deadline
>> 2022-03-05T19:21:45.261347+0000)
>> Mar 05 19:21:46 s3db13 ceph-osd[5792]: 2022-03-05T19:21:46.260+0000
>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:21:21.762744+0000 front
>> 2022-03-05T19:21:21.762723+0000 (oldest deadline
>> 2022-03-05T19:21:45.261347+0000)
>> Mar 05 19:21:47 s3db13 ceph-osd[5792]: 2022-03-05T19:21:47.252+0000
>> 7f16115e0700 -1 osd.97 486863 heartbeat_check: no reply from
>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:21:21.762744+0000 front
>> 2022-03-05T19:21:21.762723+0000 (oldest deadline
>> 2022-03-05T19:21:45.261347+0000)
>> Mar 05 19:22:59 s3db13 ceph-osd[5792]: 2022-03-05T19:22:59.636+0000
>> 7f160fddd700 -1 osd.97 486869 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> Mar 05 19:23:33 s3db13 ceph-osd[5792]: 2022-03-05T19:23:33.439+0000
>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
>> ondisk+retry+read+known_if_redirected e486872)
>> Mar 05 19:23:34 s3db13 ceph-osd[5792]: 2022-03-05T19:23:34.458+0000
>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
>> ondisk+retry+read+known_if_redirected e486872)
>> Mar 05 19:23:35 s3db13 ceph-osd[5792]: 2022-03-05T19:23:35.434+0000
>> 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from
>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:23:09.928097+0000 front
>> 2022-03-05T19:23:09.928150+0000 (oldest deadline
>> 2022-03-05T19:23:35.227545+0000)
>> ...
>> Mar 05 19:23:48 s3db13 ceph-osd[5792]: 2022-03-05T19:23:48.386+0000
>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
>> ondisk+retry+read+known_if_redirected e486872)
>> Mar 05 19:23:49 s3db13 ceph-osd[5792]: 2022-03-05T19:23:49.362+0000
>> 7f16115e0700 -1 osd.97 486872 heartbeat_check: no reply from
>> [XX:22::60]:6834 osd.171 since back 2022-03-05T19:23:09.928097+0000 front
>> 2022-03-05T19:23:09.928150+0000 (oldest deadline
>> 2022-03-05T19:23:35.227545+0000)
>> Mar 05 19:23:49 s3db13 ceph-osd[5792]: 2022-03-05T19:23:49.362+0000
>> 7f16115e0700 -1 osd.97 486872 get_health_metrics reporting 2 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
>> ondisk+retry+read+known_if_redirected e486872)
>> Mar 05 19:23:50 s3db13 ceph-osd[5792]: 2022-03-05T19:23:50.358+0000
>> 7f16115e0700 -1 osd.97 486873 get_health_metrics reporting 2 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4.97748d0d (undecoded)
>> ondisk+retry+read+known_if_redirected e486872)
>> Mar 05 19:23:51 s3db13 ceph-osd[5792]: 2022-03-05T19:23:51.330+0000
>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4:b0b12ee9:::gc.22:head
>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
>> ondisk+retry+read+known_if_redirected e486872)
>> Mar 05 19:23:52 s3db13 ceph-osd[5792]: 2022-03-05T19:23:52.326+0000
>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4:b0b12ee9:::gc.22:head
>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
>> ondisk+retry+read+known_if_redirected e486872)
>> Mar 05 19:23:53 s3db13 ceph-osd[5792]: 2022-03-05T19:23:53.338+0000
>> 7f16115e0700 -1 osd.97 486874 get_health_metrics reporting 2 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4:b0b12ee9:::gc.22:head
>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=9
>> ondisk+retry+read+known_if_redirected e486872)
>> Mar 05 19:25:02 s3db13 ceph-osd[5792]: 2022-03-05T19:25:02.342+0000
>> 7f160fddd700 -1 osd.97 486878 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> Mar 05 19:25:33 s3db13 ceph-osd[5792]: 2022-03-05T19:25:33.569+0000
>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 2 slow ops,
>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486879)
>> ...
>> Mar 05 19:25:44 s3db13 ceph-osd[5792]: 2022-03-05T19:25:44.476+0000
>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow ops,
>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486879)
>> Mar 05 19:25:45 s3db13 ceph-osd[5792]: 2022-03-05T19:25:45.456+0000
>> 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from
>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping sent
>> 2022-03-05T19:25:25.281582+0000 (oldest deadline
>> 2022-03-05T19:25:45.281582+0000)
>> Mar 05 19:25:45 s3db13 ceph-osd[5792]: 2022-03-05T19:25:45.456+0000
>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow ops,
>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486879)
>> ...
>> Mar 05 19:26:08 s3db13 ceph-osd[5792]: 2022-03-05T19:26:08.363+0000
>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow ops,
>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486879)
>> Mar 05 19:26:09 s3db13 ceph-osd[5792]: 2022-03-05T19:26:09.371+0000
>> 7f16115e0700 -1 osd.97 486880 heartbeat_check: no reply from
>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping sent
>> 2022-03-05T19:25:25.281582+0000 (oldest deadline
>> 2022-03-05T19:25:45.281582+0000)
>> Mar 05 19:26:09 s3db13 ceph-osd[5792]: 2022-03-05T19:26:09.375+0000
>> 7f16115e0700 -1 osd.97 486880 get_health_metrics reporting 3 slow ops,
>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486879)
>> Mar 05 19:26:10 s3db13 ceph-osd[5792]: 2022-03-05T19:26:10.383+0000
>> 7f16115e0700 -1 osd.97 486881 get_health_metrics reporting 3 slow ops,
>> oldest is osd_op(client.2304224857.0:4271104 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486879)
>> Mar 05 19:26:11 s3db13 ceph-osd[5792]: 2022-03-05T19:26:11.407+0000
>> 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4:b0b12ee9:::gc.22:head
>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11
>> ondisk+retry+read+known_if_redirected e486879)
>> Mar 05 19:26:12 s3db13 ceph-osd[5792]: 2022-03-05T19:26:12.399+0000
>> 7f16115e0700 -1 osd.97 486882 get_health_metrics reporting 1 slow ops,
>> oldest is osd_op(client.2304224848.0:3139913 4.d 4:b0b12ee9:::gc.22:head
>> [call rgw_gc.rgw_gc_queue_list_entries in=46b] snapc 0=[] RETRY=11
>> ondisk+retry+read+known_if_redirected e486879)
>> Mar 05 19:27:24 s3db13 ceph-osd[5792]: 2022-03-05T19:27:24.975+0000
>> 7f160fddd700 -1 osd.97 486887 set_numa_affinity unable to identify public
>> interface '' numa node: (2) No such file or directory
>> Mar 05 19:27:58 s3db13 ceph-osd[5792]: 2022-03-05T19:27:58.114+0000
>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow ops,
>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486889)
>> ...
>> Mar 05 19:28:08 s3db13 ceph-osd[5792]: 2022-03-05T19:28:08.137+0000
>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow ops,
>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486889)
>> Mar 05 19:28:09 s3db13 ceph-osd[5792]: 2022-03-05T19:28:09.125+0000
>> 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from
>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping sent
>> 2022-03-05T19:27:48.548094+0000 (oldest deadline
>> 2022-03-05T19:28:08.548094+0000)
>> Mar 05 19:28:09 s3db13 ceph-osd[5792]: 2022-03-05T19:28:09.125+0000
>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow ops,
>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486889)
>> ...
>> Mar 05 19:28:29 s3db13 ceph-osd[5792]: 2022-03-05T19:28:29.060+0000
>> 7f16115e0700 -1 osd.97 486890 get_health_metrics reporting 4 slow ops,
>> oldest is osd_op(client.2304235452.0:811825 4.d 4.97748d0d (undecoded)
>> ondisk+retry+write+known_if_redirected e486889)
>> Mar 05 19:28:30 s3db13 ceph-osd[5792]: 2022-03-05T19:28:30.040+0000
>> 7f16115e0700 -1 osd.97 486890 heartbeat_check: no reply from
>> [XX:22::60]:6834 osd.171 ever on either front or back, first ping sent
>> 2022-03-05T19:27:48.548094+0000 (oldest deadline
>> 2022-03-05T19:28:08.548094+0000)
>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.696+0000
>> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
>> osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000
>> 7f1613080700 -1 received  signal: Interrupt from Kernel ( Could be
>> generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000
>> 7f1613080700 -1 osd.97 486896 *** Got signal Interrupt ***
>> Mar 05 19:29:43 s3db13 ceph-osd[5792]: 2022-03-05T19:29:43.700+0000
>> 7f1613080700 -1 osd.97 486896 *** Immediate shutdown
>> (osd_fast_shutdown=true) ***
>> Mar 05 19:29:44 s3db13 systemd[1]: ceph-osd@97.service: Succeeded.
>> Mar 05 19:29:54 s3db13 systemd[1]: ceph-osd@97.service: Scheduled restart
>> job, restart counter is at 1.
>> Mar 05 19:29:54 s3db13 systemd[1]: Stopped Ceph object storage daemon
>> osd.97.
>> Mar 05 19:29:54 s3db13 systemd[1]: Starting Ceph object storage daemon
>> osd.97...
>> Mar 05 19:29:54 s3db13 systemd[1]: Started Ceph object storage daemon
>> osd.97.
>> Mar 05 19:29:55 s3db13 ceph-osd[3236773]: 2022-03-05T19:29:55.116+0000
>> 7f5852f74d80 -1 Falling back to public interface
>> Mar 05 19:30:34 s3db13 ceph-osd[3236773]: 2022-03-05T19:30:34.746+0000
>> 7f5852f74d80 -1 osd.97 486896 log_to_monitors {default=true}
>> --
>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>> groÃƒ¼en Saal.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
> 
> 
> ------------------------------
> 
> End of ceph-users Digest, Vol 110, Issue 18
> *******************************************

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx