Re: RGW crashes when rgw_enable_ops_log is enabled

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Matt

Thanks for your answer.

Should I open a bug report then?

How would I be able to read more from it? Have multiple threads access it and read from it simultaneously?

Marc

On 1/25/24 20:25, Matt Benjamin wrote:
Hi Marc,

No, the only thing you need to do with the Unix socket is to keep reading from it.  So it probably is getting backlogged.  And while you could arrange things to make that less likely, you likely can't make it impossible, so there's a bug here.

Matt

On Thu, Jan 25, 2024 at 10:52 AM Marc Singer <marc@singer.services> wrote:

    Hi

    I am using a unix socket client to connect with it and read the data
    from it.
    Do I need to do anything like signal the socket that this data has
    been
    read? Or am I not reading fast enough and data is backing up?

    What I am also noticing that at some point (probably after something
    with the ops socket happens), the log level seems to increase for
    some
    reason? I did not find anything in the logs yet why this would be
    the case.

    *Normal:*

    2024-01-25T15:47:58.444+0000 7fe98a5c0b00  1 ====== starting new
    request
    req=0x7fe98712c720 =====
    2024-01-25T15:47:58.548+0000 7fe98b700b00  1 ====== req done
    req=0x7fe98712c720 op status=0 http_status=200
    latency=0.104001537s ======
    2024-01-25T15:47:58.548+0000 7fe98b700b00  1 beast: 0x7fe98712c720:
    redacted - redacted [25/Jan/2024:15:47:58.444 +0000] "PUT
    /redacted/redacted/chunks/27/27242/27242514_10_4194304 HTTP/1.1" 200
    4194304 - "redacted" - latency=0.104001537s

    *Close before crashing:
    *

       -509> 2024-01-25T14:54:31.588+0000 7f5186648b00  1 ====== starting
    new request req=0x7f517ffca720 =====
       -508> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s initializing for trans_id =
    tx0000023a42eb7515dcdc0-0065b27627-823feaa-central
       -507> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s getting op 1
       -506> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj verifying requester
       -505> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj normalizing buckets
    and tenants
       -504> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj init permissions
       -503> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj recalculating target
       -502> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj reading permissions
       -501> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj init op
       -500> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj verifying op mask
       -499> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj verifying op permissions
       -498> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
    2568229052387020224 0.000000000s s3:put_obj Searching permissions for
    identity=rgw::auth::SysReqApplier ->
    rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted,
    subuser=, perm_mask=15, is_admin=0) mask=50
       -497> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
    2568229052387020224 0.000000000s s3:put_obj Searching permissions for
    uid=redacted
       -496> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
    2568229052387020224 0.000000000s s3:put_obj Found permission: 15
       -495> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
    2568229052387020224 0.000000000s s3:put_obj Searching permissions for
    group=1 mask=50
       -494> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
    2568229052387020224 0.000000000s s3:put_obj Permissions for group
    not found
       -493> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
    2568229052387020224 0.000000000s s3:put_obj Searching permissions for
    group=2 mask=50
       -492> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
    2568229052387020224 0.000000000s s3:put_obj Permissions for group
    not found
       -491> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
    2568229052387020224 0.000000000s s3:put_obj -- Getting permissions
    done
    for identity=rgw::auth::SysReqApplier ->
    rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted,
    subuser=, perm_mask=15, is_admin=0), owner=redacted, perm=2
       -490> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj verifying op params
       -489> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj pre-executing
       -488> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj check rate limiting
       -487> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
    2568229052387020224 0.000000000s s3:put_obj executing
       -486> 2024-01-25T14:54:31.624+0000 7f5183898b00  5 req
    2568229052387020224 0.036000550s s3:put_obj NOTICE: call to
    do_aws4_auth_completion
       -485> 2024-01-25T14:54:31.624+0000 7f5183898b00  5 req
    2568229052387020224 0.036000550s s3:put_obj NOTICE: call to
    do_aws4_auth_completion
       -484> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req
    2568229052387020224 0.092001401s s3:put_obj completing
       -483> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req
    2568229052387020224 0.092001401s s3:put_obj op status=0
       -482> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req
    2568229052387020224 0.092001401s s3:put_obj http status=200

       -481> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  1 ====== req done
    req=0x7f517ffca720 op status=0 http_status=200
    latency=0.092001401s ======

    Thanks for your help.

    Marc Singer

    On 1/25/24 16:22, Matt Benjamin wrote:
    > Hi Marc,
    >
    > The ops log code is designed to discard data if the socket is
    > flow-controlled, iirc.  Maybe we just need to handle the signal.
    >
    > Of course, you should have something consuming data on the
    socket, but it's
    > still a problem if radosgw exits unexpectedly.
    >
    > Matt
    >
    > On Thu, Jan 25, 2024 at 10:08 AM Marc
    Singer<marc@singer.services>  wrote:
    >
    >> Hi Ceph Users
    >>
    >> I am encountering a problem with the RGW Admin Ops Socket.
    >>
    >> I am setting up the socket as follows:
    >>
    >> rgw_enable_ops_log = true
    >> rgw_ops_log_socket_path = /tmp/ops/rgw-ops.socket
    >> rgw_ops_log_data_backlog = 16Mi
    >>
    >> Seems like the socket fills up over time and it doesn't seem to get
    >> flushed, at some point the process runs out of file space.
    >>
    >> Do I need to configure something or send something for the
    socket to flush?
    >>
    >> See the log here:
    >>
    >> 0> 2024-01-25T13:10:13.908+0000 7f247b00eb00 -1 *** Caught
    signal (File
    >> size limit exceeded) **
    >>    in thread 7f247b00eb00 thread_name:ops_log_file
    >>
    >>    ceph version 18.2.0
    (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef
    >> (stable)
    >>    NOTE: a copy of the executable, or `objdump -rdS
    <executable>` is
    >> needed to interpret this.
    >>
    >> --- logging levels ---
    >>      0/ 5 none
    >>      0/ 1 lockdep
    >>      0/ 1 context
    >>      1/ 1 crush
    >>      1/ 5 mds
    >>      1/ 5 mds_balancer
    >>      1/ 5 mds_locker
    >>      1/ 5 mds_log
    >>      1/ 5 mds_log_expire
    >>      1/ 5 mds_migrator
    >>      0/ 1 buffer
    >>      0/ 1 timer
    >>      0/ 1 filer
    >>      0/ 1 striper
    >>      0/ 1 objecter
    >>      0/ 5 rados
    >>      0/ 5 rbd
    >>      0/ 5 rbd_mirror
    >>      0/ 5 rbd_replay
    >>      0/ 5 rbd_pwl
    >>      0/ 5 journaler
    >>      0/ 5 objectcacher
    >>      0/ 5 immutable_obj_cache
    >>      0/ 5 client
    >>      1/ 5 osd
    >>      0/ 5 optracker
    >>      0/ 5 objclass
    >>      1/ 3 filestore
    >>      1/ 3 journal
    >>      0/ 0 ms
    >>      1/ 5 mon
    >>      0/10 monc
    >>      1/ 5 paxos
    >>      0/ 5 tp
    >>      1/ 5 auth
    >>      1/ 5 crypto
    >>      1/ 1 finisher
    >>      1/ 1 reserver
    >>      1/ 5 heartbeatmap
    >>      1/ 5 perfcounter
    >>      1/ 5 rgw
    >>      1/ 5 rgw_sync
    >>      1/ 5 rgw_datacache
    >>      1/ 5 rgw_access
    >>      1/ 5 rgw_dbstore
    >>      1/ 5 rgw_flight
    >>      1/ 5 javaclient
    >>      1/ 5 asok
    >>      1/ 1 throttle
    >>      0/ 0 refs
    >>      1/ 5 compressor
    >>      1/ 5 bluestore
    >>      1/ 5 bluefs
    >>      1/ 3 bdev
    >>      1/ 5 kstore
    >>      4/ 5 rocksdb
    >>      4/ 5 leveldb
    >>      1/ 5 fuse
    >>      2/ 5 mgr
    >>      1/ 5 mgrc
    >>      1/ 5 dpdk
    >>      1/ 5 eventtrace
    >>      1/ 5 prioritycache
    >>      0/ 5 test
    >>      0/ 5 cephfs_mirror
    >>      0/ 5 cephsqlite
    >>      0/ 5 seastore
    >>      0/ 5 seastore_onode
    >>      0/ 5 seastore_odata
    >>      0/ 5 seastore_omap
    >>      0/ 5 seastore_tm
    >>      0/ 5 seastore_t
    >>      0/ 5 seastore_cleaner
    >>      0/ 5 seastore_epm
    >>      0/ 5 seastore_lba
    >>      0/ 5 seastore_fixedkv_tree
    >>      0/ 5 seastore_cache
    >>      0/ 5 seastore_journal
    >>      0/ 5 seastore_device
    >>      0/ 5 seastore_backref
    >>      0/ 5 alienstore
    >>      1/ 5 mclock
    >>      0/ 5 cyanstore
    >>      1/ 5 ceph_exporter
    >>      1/ 5 memstore
    >>     -2/-2 (syslog threshold)
    >>     99/99 (stderr threshold)
    >> --- pthread ID / name mapping for recent threads ---
    >>     7f2472a89b00 / safe_timer
    >>     7f2472cadb00 / radosgw
    >>     ...
    >>     log_file
    >>
    >>
    /var/lib/ceph/crash/2024-01-25T13:10:13.909546Z_01ee6e6a-e946-4006-9d32-e17ef2f9df74/log
    >> --- end dump of recent events ---
    >> reraise_fatal: default handler for signal 25 didn't terminate
    the process?
    >>
    >> Thank you for your help.
    >>
    >> Marc
    >> _______________________________________________
    >> ceph-users mailing list --ceph-users@xxxxxxx
    >> To unsubscribe send an email toceph-users-leave@xxxxxxx
    >>
    >
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx



--

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux