Re: slow requests are blocked

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2. When logging the 1/5 is what's written to the log file/what's temporarily stored in memory.  If you want to increase logging, you need to increase both numbers to 20/20 or 10/10.  You can also just set it to 20 or 10 and ceph will set them to the same number.  I personally do both numbers to remind myself that the defaults aren't the same to set it back to.

3. You are not accessing something stored within the ceph cluster, so it isn't using your admin cephx key that you have.  It is accessing the daemon socket in the OS so you need to have proper permissions to be able to access it.  The daemon sockets are located in /var/run/ceph/ which is a folder without any public permissions.  You could use `sudo -u ceph ceph daemon osd.15 dump_historic_opsor` or just `sudo ceph daemon osd.15 dump_historic_ops` as root can access the daemon socket as well.

On Thu, May 10, 2018 at 7:14 AM Grigory Murashov <murashov@xxxxxxxxxxxxxx> wrote:
Hi JC!

Thanks for your answer first.

1. I have added output of  ceph health detail to Zabbix in case of
warning. So every time I will see with which OSD the problem is.

2. I have default level of all logs. As I see here
http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/
default log level for OSD is 1/5. Should I try

debug osd = 1/20 of 1/10 would be enough here?

3. Any thoughts why do I have Permission denied? All of my sockets are
also defaults.

[cephuser@osd3 ~]$ ceph daemon osd.15 dump_historic_ops
admin_socket: exception getting command descriptions: [Errno 13]
Permission denied

Thanks in advance.

Grigory Murashov
Voximplant

08.05.2018 17:31, Jean-Charles Lopez пишет:
> Hi Grigory,
>
> are these lines the only lines in your log file for OSD 15?
>
> Just for sanity, what are the log levels you have set, if any, in your config file away from the default? If you set all log levels to 0 like some people do you may want to simply go back to the default by commenting out the debug_ lines in your config file. If you want to see something more detailed you can indeed increase the log level to 5 or 10.
>
> What you can also do is to use the admin socket on the machine to see what operations are actually blocked: ceph daemon osd.15 dump_ops_in_flight and ceph daemon osd.15 dump_historic_ops.
>
> These two commands and their output will show you what exact operations are blocked and will also point you to the other OSDs this OSD is working with to serve the IO. May be the culprit is actually one of the OSDs handling the subops or it could be a network problem.
>
> Regards
> JC
>
>> On May 8, 2018, at 03:11, Grigory Murashov <murashov@xxxxxxxxxxxxxx> wrote:
>>
>> Hello Jean-Charles!
>>
>> I have finally catch the problem, It was at 13-02.
>>
>> [cephuser@storage-ru1-osd3 ~]$ ceph health detail
>> HEALTH_WARN 18 slow requests are blocked > 32 sec
>> REQUEST_SLOW 18 slow requests are blocked > 32 sec
>>      3 ops are blocked > 65.536 sec
>>      15 ops are blocked > 32.768 sec
>>      osd.15 has blocked requests > 65.536 sec
>> [cephuser@storage-ru1-osd3 ~]$
>>
>>
>> But surprise - there is no information in ceph-osd.15.log that time
>>
>>
>> 2018-05-08 12:54:26.105919 7f003f5f9700  4 rocksdb: (Original Log Time 2018/05/08-12:54:26.105843) EVENT_LOG_v1 {"time_micros": 1525773266105834, "job": 2793, "event": "trivial_move", "dest
>> ination_level": 3, "files": 1, "total_files_size": 68316970}
>> 2018-05-08 12:54:26.105926 7f003f5f9700  4 rocksdb: (Original Log Time 2018/05/08-12:54:26.105854) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
>> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1537] [default] Moved #1 files to level-3 68316970 bytes OK
>> : base level 1 max bytes base 268435456 files[0 4 45 403 722 0 0] max score 0.98
>>
>> 2018-05-08 13:07:29.711425 7f004f619700  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
>> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:684] reusing log 8051 from recycle list
>>
>> 2018-05-08 13:07:29.711497 7f004f619700  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
>> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:725] [default] New memtable created with log file: #8089. Immutable memtables: 0.
>>
>> 2018-05-08 13:07:29.726107 7f003fdfa700  4 rocksdb: (Original Log Time 2018/05/08-13:07:29.711524) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
>> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1158] Calling FlushMemTableToOutputFile with column family
>> [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1
>> 2018-05-08 13:07:29.726124 7f003fdfa700  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
>> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/flush_job.cc:264] [default] [JOB 2794] Flushing memtable with next log file: 8089
>>
>> Should I have some deeply logging?
>>
>>
>> Grigory Murashov
>> Voximplant
>>
>> 07.05.2018 18:59, Jean-Charles Lopez пишет:
>>> Hi,
>>>
>>> ceph health detail
>>>
>>> This will tell you which OSDs are experiencing the problem so you can then go and inspect the logs and use the admin socket to find out which requests are at the source.
>>>
>>> Regards
>>> JC
>>>
>>>> On May 7, 2018, at 03:52, Grigory Murashov <murashov@xxxxxxxxxxxxxx> wrote:
>>>>
>>>> Hello!
>>>>
>>>> I'm not much experiensed in ceph troubleshouting that why I ask for help.
>>>>
>>>> I have multiple warnings coming from zabbix as a result of ceph -s
>>>>
>>>> REQUEST_SLOW: HEALTH_WARN : 21 slow requests are blocked > 32 sec
>>>>
>>>> I don't see any hardware problems that time.
>>>>
>>>> I'm able to find the same strings in ceph.log and ceph-mon.log like
>>>>
>>>> 2018-05-07 12:37:57.375546 7f3037dae700  0 log_channel(cluster) log [WRN] : Health check failed: 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
>>>>
>>>> Now It's important to find out the root of the problem.
>>>>
>>>> How to find out:
>>>>
>>>> 1. which OSDs are affected
>>>>
>>>> 2. which particular requests were slowed and blocked?
>>>>
>>>> I assume I need more detailed logging - how to do that?
>>>>
>>>> Appreciate your help.
>>>>
>>>> --
>>>> Grigory Murashov
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux