Re: RGW crashes when rgw_enable_ops_log is enabled

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Marc,

1. if you can, yes, create a tracker issue on tracker.ceph.com?
2. you might be able to get more throughput with (some number) of
additional threads;  the first thing I would try is prioritization (nice)

regards,

Matt


On Fri, Jan 26, 2024 at 6:08 AM Marc Singer <marc@singer.services> wrote:

> Hi Matt
>
> Thanks for your answer.
>
> Should I open a bug report then?
>
> How would I be able to read more from it? Have multiple threads access
> it and read from it simultaneously?
>
> Marc
>
> On 1/25/24 20:25, Matt Benjamin wrote:
> > Hi Marc,
> >
> > No, the only thing you need to do with the Unix socket is to keep
> > reading from it.  So it probably is getting backlogged.  And while you
> > could arrange things to make that less likely, you likely can't make
> > it impossible, so there's a bug here.
> >
> > Matt
> >
> > On Thu, Jan 25, 2024 at 10:52 AM Marc Singer <marc@singer.services>
> wrote:
> >
> >     Hi
> >
> >     I am using a unix socket client to connect with it and read the data
> >     from it.
> >     Do I need to do anything like signal the socket that this data has
> >     been
> >     read? Or am I not reading fast enough and data is backing up?
> >
> >     What I am also noticing that at some point (probably after something
> >     with the ops socket happens), the log level seems to increase for
> >     some
> >     reason? I did not find anything in the logs yet why this would be
> >     the case.
> >
> >     *Normal:*
> >
> >     2024-01-25T15:47:58.444+0000 7fe98a5c0b00  1 ====== starting new
> >     request
> >     req=0x7fe98712c720 =====
> >     2024-01-25T15:47:58.548+0000 7fe98b700b00  1 ====== req done
> >     req=0x7fe98712c720 op status=0 http_status=200
> >     latency=0.104001537s ======
> >     2024-01-25T15:47:58.548+0000 7fe98b700b00  1 beast: 0x7fe98712c720:
> >     redacted - redacted [25/Jan/2024:15:47:58.444 +0000] "PUT
> >     /redacted/redacted/chunks/27/27242/27242514_10_4194304 HTTP/1.1" 200
> >     4194304 - "redacted" - latency=0.104001537s
> >
> >     *Close before crashing:
> >     *
> >
> >        -509> 2024-01-25T14:54:31.588+0000 7f5186648b00  1 ====== starting
> >     new request req=0x7f517ffca720 =====
> >        -508> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s initializing for trans_id =
> >     tx0000023a42eb7515dcdc0-0065b27627-823feaa-central
> >        -507> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s getting op 1
> >        -506> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj verifying requester
> >        -505> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj normalizing buckets
> >     and tenants
> >        -504> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj init permissions
> >        -503> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj recalculating target
> >        -502> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj reading permissions
> >        -501> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj init op
> >        -500> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj verifying op mask
> >        -499> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj verifying op permissions
> >        -498> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
> >     2568229052387020224 0.000000000s s3:put_obj Searching permissions for
> >     identity=rgw::auth::SysReqApplier ->
> >     rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted,
> >     subuser=, perm_mask=15, is_admin=0) mask=50
> >        -497> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
> >     2568229052387020224 0.000000000s s3:put_obj Searching permissions for
> >     uid=redacted
> >        -496> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
> >     2568229052387020224 0.000000000s s3:put_obj Found permission: 15
> >        -495> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
> >     2568229052387020224 0.000000000s s3:put_obj Searching permissions for
> >     group=1 mask=50
> >        -494> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
> >     2568229052387020224 0.000000000s s3:put_obj Permissions for group
> >     not found
> >        -493> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
> >     2568229052387020224 0.000000000s s3:put_obj Searching permissions for
> >     group=2 mask=50
> >        -492> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
> >     2568229052387020224 0.000000000s s3:put_obj Permissions for group
> >     not found
> >        -491> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req
> >     2568229052387020224 0.000000000s s3:put_obj -- Getting permissions
> >     done
> >     for identity=rgw::auth::SysReqApplier ->
> >     rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted,
> >     subuser=, perm_mask=15, is_admin=0), owner=redacted, perm=2
> >        -490> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj verifying op params
> >        -489> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj pre-executing
> >        -488> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj check rate limiting
> >        -487> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req
> >     2568229052387020224 0.000000000s s3:put_obj executing
> >        -486> 2024-01-25T14:54:31.624+0000 7f5183898b00  5 req
> >     2568229052387020224 0.036000550s s3:put_obj NOTICE: call to
> >     do_aws4_auth_completion
> >        -485> 2024-01-25T14:54:31.624+0000 7f5183898b00  5 req
> >     2568229052387020224 0.036000550s s3:put_obj NOTICE: call to
> >     do_aws4_auth_completion
> >        -484> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req
> >     2568229052387020224 0.092001401s s3:put_obj completing
> >        -483> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req
> >     2568229052387020224 0.092001401s s3:put_obj op status=0
> >        -482> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req
> >     2568229052387020224 0.092001401s s3:put_obj http status=200
> >
> >        -481> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  1 ====== req done
> >     req=0x7f517ffca720 op status=0 http_status=200
> >     latency=0.092001401s ======
> >
> >     Thanks for your help.
> >
> >     Marc Singer
> >
> >     On 1/25/24 16:22, Matt Benjamin wrote:
> >     > Hi Marc,
> >     >
> >     > The ops log code is designed to discard data if the socket is
> >     > flow-controlled, iirc.  Maybe we just need to handle the signal.
> >     >
> >     > Of course, you should have something consuming data on the
> >     socket, but it's
> >     > still a problem if radosgw exits unexpectedly.
> >     >
> >     > Matt
> >     >
> >     > On Thu, Jan 25, 2024 at 10:08 AM Marc
> >     Singer<marc@singer.services>  wrote:
> >     >
> >     >> Hi Ceph Users
> >     >>
> >     >> I am encountering a problem with the RGW Admin Ops Socket.
> >     >>
> >     >> I am setting up the socket as follows:
> >     >>
> >     >> rgw_enable_ops_log = true
> >     >> rgw_ops_log_socket_path = /tmp/ops/rgw-ops.socket
> >     >> rgw_ops_log_data_backlog = 16Mi
> >     >>
> >     >> Seems like the socket fills up over time and it doesn't seem to
> get
> >     >> flushed, at some point the process runs out of file space.
> >     >>
> >     >> Do I need to configure something or send something for the
> >     socket to flush?
> >     >>
> >     >> See the log here:
> >     >>
> >     >> 0> 2024-01-25T13:10:13.908+0000 7f247b00eb00 -1 *** Caught
> >     signal (File
> >     >> size limit exceeded) **
> >     >>    in thread 7f247b00eb00 thread_name:ops_log_file
> >     >>
> >     >>    ceph version 18.2.0
> >     (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef
> >     >> (stable)
> >     >>    NOTE: a copy of the executable, or `objdump -rdS
> >     <executable>` is
> >     >> needed to interpret this.
> >     >>
> >     >> --- logging levels ---
> >     >>      0/ 5 none
> >     >>      0/ 1 lockdep
> >     >>      0/ 1 context
> >     >>      1/ 1 crush
> >     >>      1/ 5 mds
> >     >>      1/ 5 mds_balancer
> >     >>      1/ 5 mds_locker
> >     >>      1/ 5 mds_log
> >     >>      1/ 5 mds_log_expire
> >     >>      1/ 5 mds_migrator
> >     >>      0/ 1 buffer
> >     >>      0/ 1 timer
> >     >>      0/ 1 filer
> >     >>      0/ 1 striper
> >     >>      0/ 1 objecter
> >     >>      0/ 5 rados
> >     >>      0/ 5 rbd
> >     >>      0/ 5 rbd_mirror
> >     >>      0/ 5 rbd_replay
> >     >>      0/ 5 rbd_pwl
> >     >>      0/ 5 journaler
> >     >>      0/ 5 objectcacher
> >     >>      0/ 5 immutable_obj_cache
> >     >>      0/ 5 client
> >     >>      1/ 5 osd
> >     >>      0/ 5 optracker
> >     >>      0/ 5 objclass
> >     >>      1/ 3 filestore
> >     >>      1/ 3 journal
> >     >>      0/ 0 ms
> >     >>      1/ 5 mon
> >     >>      0/10 monc
> >     >>      1/ 5 paxos
> >     >>      0/ 5 tp
> >     >>      1/ 5 auth
> >     >>      1/ 5 crypto
> >     >>      1/ 1 finisher
> >     >>      1/ 1 reserver
> >     >>      1/ 5 heartbeatmap
> >     >>      1/ 5 perfcounter
> >     >>      1/ 5 rgw
> >     >>      1/ 5 rgw_sync
> >     >>      1/ 5 rgw_datacache
> >     >>      1/ 5 rgw_access
> >     >>      1/ 5 rgw_dbstore
> >     >>      1/ 5 rgw_flight
> >     >>      1/ 5 javaclient
> >     >>      1/ 5 asok
> >     >>      1/ 1 throttle
> >     >>      0/ 0 refs
> >     >>      1/ 5 compressor
> >     >>      1/ 5 bluestore
> >     >>      1/ 5 bluefs
> >     >>      1/ 3 bdev
> >     >>      1/ 5 kstore
> >     >>      4/ 5 rocksdb
> >     >>      4/ 5 leveldb
> >     >>      1/ 5 fuse
> >     >>      2/ 5 mgr
> >     >>      1/ 5 mgrc
> >     >>      1/ 5 dpdk
> >     >>      1/ 5 eventtrace
> >     >>      1/ 5 prioritycache
> >     >>      0/ 5 test
> >     >>      0/ 5 cephfs_mirror
> >     >>      0/ 5 cephsqlite
> >     >>      0/ 5 seastore
> >     >>      0/ 5 seastore_onode
> >     >>      0/ 5 seastore_odata
> >     >>      0/ 5 seastore_omap
> >     >>      0/ 5 seastore_tm
> >     >>      0/ 5 seastore_t
> >     >>      0/ 5 seastore_cleaner
> >     >>      0/ 5 seastore_epm
> >     >>      0/ 5 seastore_lba
> >     >>      0/ 5 seastore_fixedkv_tree
> >     >>      0/ 5 seastore_cache
> >     >>      0/ 5 seastore_journal
> >     >>      0/ 5 seastore_device
> >     >>      0/ 5 seastore_backref
> >     >>      0/ 5 alienstore
> >     >>      1/ 5 mclock
> >     >>      0/ 5 cyanstore
> >     >>      1/ 5 ceph_exporter
> >     >>      1/ 5 memstore
> >     >>     -2/-2 (syslog threshold)
> >     >>     99/99 (stderr threshold)
> >     >> --- pthread ID / name mapping for recent threads ---
> >     >>     7f2472a89b00 / safe_timer
> >     >>     7f2472cadb00 / radosgw
> >     >>     ...
> >     >>     log_file
> >     >>
> >     >>
> >
>  /var/lib/ceph/crash/2024-01-25T13:10:13.909546Z_01ee6e6a-e946-4006-9d32-e17ef2f9df74/log
> >     >> --- end dump of recent events ---
> >     >> reraise_fatal: default handler for signal 25 didn't terminate
> >     the process?
> >     >>
> >     >> Thank you for your help.
> >     >>
> >     >> Marc
> >     >> _______________________________________________
> >     >> ceph-users mailing list --ceph-users@xxxxxxx
> >     >> To unsubscribe send an email toceph-users-leave@xxxxxxx
> >     >>
> >     >
> >     _______________________________________________
> >     ceph-users mailing list -- ceph-users@xxxxxxx
> >     To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> >
> > --
> >
> > Matt Benjamin
> > Red Hat, Inc.
> > 315 West Huron Street, Suite 140A
> > Ann Arbor, Michigan 48103
> >
> > http://www.redhat.com/en/technologies/storage
> >
> > tel.  734-821-5101
> > fax.  734-769-8938
> > cel.  734-216-5309
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>


-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux