Re: RGW crashes when rgw_enable_ops_log is enabled

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi

The issue is open: https://tracker.ceph.com/issues/64244

If you could take a look or let me know what are the next steps I would be super grateful.

In the meantime I will try to increase the read throughput.

Thanks

Marc

On 1/26/24 15:23, Matt Benjamin wrote:
Hi Marc,

1. if you can, yes, create a tracker issue on tracker.ceph.com?
2. you might be able to get more throughput with (some number) of
additional threads;  the first thing I would try is prioritization (nice)

regards,

Matt


On Fri, Jan 26, 2024 at 6:08 AM Marc Singer<marc@singer.services>  wrote:

Hi Matt

Thanks for your answer.

Should I open a bug report then?

How would I be able to read more from it? Have multiple threads access
it and read from it simultaneously?

Marc

On 1/25/24 20:25, Matt Benjamin wrote:
Hi Marc,

No, the only thing you need to do with the Unix socket is to keep
reading from it.  So it probably is getting backlogged.  And while you
could arrange things to make that less likely, you likely can't make
it impossible, so there's a bug here.

Matt

On Thu, Jan 25, 2024 at 10:52 AM Marc Singer<marc@singer.services>
wrote:
     Hi

     I am using a unix socket client to connect with it and read the data
     from it.
     Do I need to do anything like signal the socket that this data has
     been
     read? Or am I not reading fast enough and data is backing up?

     What I am also noticing that at some point (probably after something
     with the ops socket happens), the log level seems to increase for
     some
     reason? I did not find anything in the logs yet why this would be
     the case.

     *Normal:*

     2024-01-25T15:47:58.444+0000 7fe98a5c0b00  1 ====== starting new
     request
     req=0x7fe98712c720 =====
     2024-01-25T15:47:58.548+0000 7fe98b700b00  1 ====== req done
     req=0x7fe98712c720 op status=0 http_status=200
     latency=0.104001537s ======
     2024-01-25T15:47:58.548+0000 7fe98b700b00  1 beast: 0x7fe98712c720:
     redacted - redacted [25/Jan/2024:15:47:58.444 +0000] "PUT
     /redacted/redacted/chunks/27/27242/27242514_10_4194304 HTTP/1.1" 200
     4194304 - "redacted" - latency=0.104001537s

     *Close before crashing:
     *

        -509> 2024-01-25T14:54:31.588+0000 7f5186648b00  1 ====== starting
     new request req=0x7f517ffca720 =====
        -508> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000s initializing for trans_id =
     tx0000023a42eb7515dcdc0-0065b27627-823feaa-central
        -507> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000s getting op 1
        -506> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  verifying requester
        -505> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  normalizing buckets
     and tenants
        -504> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  init permissions
        -503> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  recalculating target
        -502> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  reading permissions
        -501> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  init op
        -500> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  verifying op mask
        -499> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  verifying op permissions
        -498> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
     2568229052387020224 0.000000000ss3:put_obj  Searching permissions for
     identity=rgw::auth::SysReqApplier  ->
     rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted,
     subuser=, perm_mask=15, is_admin=0) mask=50
        -497> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
     2568229052387020224 0.000000000ss3:put_obj  Searching permissions for
     uid=redacted
        -496> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
     2568229052387020224 0.000000000ss3:put_obj  Found permission: 15
        -495> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
     2568229052387020224 0.000000000ss3:put_obj  Searching permissions for
     group=1 mask=50
        -494> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
     2568229052387020224 0.000000000ss3:put_obj  Permissions for group
     not found
        -493> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
     2568229052387020224 0.000000000ss3:put_obj  Searching permissions for
     group=2 mask=50
        -492> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
     2568229052387020224 0.000000000ss3:put_obj  Permissions for group
     not found
        -491> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
     2568229052387020224 0.000000000ss3:put_obj  -- Getting permissions
     done
     for identity=rgw::auth::SysReqApplier  ->
     rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted,
     subuser=, perm_mask=15, is_admin=0), owner=redacted, perm=2
        -490> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  verifying op params
        -489> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  pre-executing
        -488> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  check rate limiting
        -487> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
     2568229052387020224 0.000000000ss3:put_obj  executing
        -486> 2024-01-25T14:54:31.624+0000 7f5183898b00  5 req
     2568229052387020224 0.036000550ss3:put_obj  NOTICE: call to
     do_aws4_auth_completion
        -485> 2024-01-25T14:54:31.624+0000 7f5183898b00  5 req
     2568229052387020224 0.036000550ss3:put_obj  NOTICE: call to
     do_aws4_auth_completion
        -484> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req
     2568229052387020224 0.092001401ss3:put_obj  completing
        -483> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req
     2568229052387020224 0.092001401ss3:put_obj  op status=0
        -482> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req
     2568229052387020224 0.092001401ss3:put_obj  http status=200

        -481> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  1 ====== req done
     req=0x7f517ffca720 op status=0 http_status=200
     latency=0.092001401s ======

     Thanks for your help.

     Marc Singer

     On 1/25/24 16:22, Matt Benjamin wrote:
     > Hi Marc,
     >
     > The ops log code is designed to discard data if the socket is
     > flow-controlled, iirc.  Maybe we just need to handle the signal.
     >
     > Of course, you should have something consuming data on the
     socket, but it's
     > still a problem if radosgw exits unexpectedly.
     >
     > Matt
     >
     > On Thu, Jan 25, 2024 at 10:08 AM Marc
     Singer<marc@singer.services>   wrote:
     >
     >> Hi Ceph Users
     >>
     >> I am encountering a problem with the RGW Admin Ops Socket.
     >>
     >> I am setting up the socket as follows:
     >>
     >> rgw_enable_ops_log = true
     >> rgw_ops_log_socket_path = /tmp/ops/rgw-ops.socket
     >> rgw_ops_log_data_backlog = 16Mi
     >>
     >> Seems like the socket fills up over time and it doesn't seem to
get
     >> flushed, at some point the process runs out of file space.
     >>
     >> Do I need to configure something or send something for the
     socket to flush?
     >>
     >> See the log here:
     >>
     >> 0> 2024-01-25T13:10:13.908+0000 7f247b00eb00 -1 *** Caught
     signal (File
     >> size limit exceeded) **
     >>    in thread 7f247b00eb00 thread_name:ops_log_file
     >>
     >>    ceph version 18.2.0
     (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef
     >> (stable)
     >>    NOTE: a copy of the executable, or `objdump -rdS
     <executable>` is
     >> needed to interpret this.
     >>
     >> --- logging levels ---
     >>      0/ 5 none
     >>      0/ 1 lockdep
     >>      0/ 1 context
     >>      1/ 1 crush
     >>      1/ 5 mds
     >>      1/ 5 mds_balancer
     >>      1/ 5 mds_locker
     >>      1/ 5 mds_log
     >>      1/ 5 mds_log_expire
     >>      1/ 5 mds_migrator
     >>      0/ 1 buffer
     >>      0/ 1 timer
     >>      0/ 1 filer
     >>      0/ 1 striper
     >>      0/ 1 objecter
     >>      0/ 5 rados
     >>      0/ 5 rbd
     >>      0/ 5 rbd_mirror
     >>      0/ 5 rbd_replay
     >>      0/ 5 rbd_pwl
     >>      0/ 5 journaler
     >>      0/ 5 objectcacher
     >>      0/ 5 immutable_obj_cache
     >>      0/ 5 client
     >>      1/ 5 osd
     >>      0/ 5 optracker
     >>      0/ 5 objclass
     >>      1/ 3 filestore
     >>      1/ 3 journal
     >>      0/ 0 ms
     >>      1/ 5 mon
     >>      0/10 monc
     >>      1/ 5 paxos
     >>      0/ 5 tp
     >>      1/ 5 auth
     >>      1/ 5 crypto
     >>      1/ 1 finisher
     >>      1/ 1 reserver
     >>      1/ 5 heartbeatmap
     >>      1/ 5 perfcounter
     >>      1/ 5 rgw
     >>      1/ 5 rgw_sync
     >>      1/ 5 rgw_datacache
     >>      1/ 5 rgw_access
     >>      1/ 5 rgw_dbstore
     >>      1/ 5 rgw_flight
     >>      1/ 5 javaclient
     >>      1/ 5 asok
     >>      1/ 1 throttle
     >>      0/ 0 refs
     >>      1/ 5 compressor
     >>      1/ 5 bluestore
     >>      1/ 5 bluefs
     >>      1/ 3 bdev
     >>      1/ 5 kstore
     >>      4/ 5 rocksdb
     >>      4/ 5 leveldb
     >>      1/ 5 fuse
     >>      2/ 5 mgr
     >>      1/ 5 mgrc
     >>      1/ 5 dpdk
     >>      1/ 5 eventtrace
     >>      1/ 5 prioritycache
     >>      0/ 5 test
     >>      0/ 5 cephfs_mirror
     >>      0/ 5 cephsqlite
     >>      0/ 5 seastore
     >>      0/ 5 seastore_onode
     >>      0/ 5 seastore_odata
     >>      0/ 5 seastore_omap
     >>      0/ 5 seastore_tm
     >>      0/ 5 seastore_t
     >>      0/ 5 seastore_cleaner
     >>      0/ 5 seastore_epm
     >>      0/ 5 seastore_lba
     >>      0/ 5 seastore_fixedkv_tree
     >>      0/ 5 seastore_cache
     >>      0/ 5 seastore_journal
     >>      0/ 5 seastore_device
     >>      0/ 5 seastore_backref
     >>      0/ 5 alienstore
     >>      1/ 5 mclock
     >>      0/ 5 cyanstore
     >>      1/ 5 ceph_exporter
     >>      1/ 5 memstore
     >>     -2/-2 (syslog threshold)
     >>     99/99 (stderr threshold)
     >> --- pthread ID / name mapping for recent threads ---
     >>     7f2472a89b00 / safe_timer
     >>     7f2472cadb00 / radosgw
     >>     ...
     >>     log_file
     >>
     >>

  /var/lib/ceph/crash/2024-01-25T13:10:13.909546Z_01ee6e6a-e946-4006-9d32-e17ef2f9df74/log
     >> --- end dump of recent events ---
     >> reraise_fatal: default handler for signal 25 didn't terminate
     the process?
     >>
     >> Thank you for your help.
     >>
     >> Marc
     >> _______________________________________________
     >> ceph-users mailing list --ceph-users@xxxxxxx
     >> To unsubscribe send an emailtoceph-users-leave@xxxxxxx
     >>
     >
     _______________________________________________
     ceph-users mailing list --ceph-users@xxxxxxx
     To unsubscribe send an email toceph-users-leave@xxxxxxx



--

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux