Ceph issue too many open files.

Daznis <daznis@xxxxxxxxx> · Mon, 16 Jul 2018 13:45:06 +0300

Hi,

Recently about ~2 weeks ago something strange started happening with
one of the ceph cluster I'm managing. It's running ceph jewel 10.2.10
with cache layer. Some OSD's started crashing with "too many open
files error". From looking at the issue I have found that it keeps a
lot of links in /proc/self/fd and once 1 mil limit is reached it
crashes. I have tried increasing the limit to 2 mil, but same thing
happened. The problem with this is that it's not clearing
/proc/self/fd as there is about 900k inodes used inside the OSD drive.
Once the OSD is restarted and scrub starts I'm getting missing shard
errors:

2018-07-15 18:32:26.554348 7f604ebd1700 -1 log_channel(cluster) log
[ERR] : 6.58 shard 51 missing
6:1a3a2565:::rbd_data.314da9e52da0f2.000000000000d570:head

OSD crash log:
    -4> 2018-07-15 17:40:25.566804 7f97143fe700  0
filestore(/var/lib/ceph/osd/ceph-44)  error (24) Too many open files
not handled on operation 0x7f970e0274c0 (5142329351.0.0, or op 0,
counting from 0)
    -3> 2018-07-15 17:40:25.566825 7f97143fe700  0
filestore(/var/lib/ceph/osd/ceph-44) unexpected error code
    -2> 2018-07-15 17:40:25.566829 7f97143fe700  0
filestore(/var/lib/ceph/osd/ceph-44)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "touch",
            "collection": "6.f0_head",
            "oid": "#-8:0f000000:::temp_6.f0_0_55255967_2688:head#"
        },
        {
            "op_num": 1,
            "op_name": "write",
            "collection": "6.f0_head",
            "oid": "#-8:0f000000:::temp_6.f0_0_55255967_2688:head#",
            "length": 65536,
            "offset": 0,
            "bufferlist length": 65536
        },
        {
            "op_num": 2,
            "op_name": "omap_setkeys",
            "collection": "6.f0_head",
            "oid": "#6:0f000000::::head#",
            "attr_lens": {
                "_info": 925
            }
        }
    ]
}

    -1> 2018-07-15 17:40:25.566886 7f97143fe700 -1 dump_open_fds
unable to open /proc/self/fd
     0> 2018-07-15 17:40:25.569564 7f97143fe700 -1
os/filestore/FileStore.cc: In function 'void
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
ThreadPool::TPHandle*)' thread 7f97143fe700 time 2018-07-15
17:40:25.566888
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

Any insight on how to fix this issue is appreciated.

Regards,
Darius
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com