Hi, we're trying to give ceph a try on our compute cluster Initial stress tests passed without problems, but over the weekend a couple of cosd processes died and now access to the ceph mount point blocks and mounting the ceph dir fails with mount: 192.168.1.141:6789,192.168.1.145:6789,192.168.1.150:6789:/: can't read superblock Attempts to restart the cosd on the affected storage nodes fails with # /usr/local/bin/cosd -f -i 6 -c /etc/ceph/ceph.conf ** WARNING: Ceph is still under heavy development, and is only suitable for ** ** testing and review. Do not trust it with important data. ** starting osd6 at 0.0.0.0:6800/2685 osd_data /var/ceph/osd6 /var/ceph/osd6/journal terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted Stracing the cosd process shows that it calls mmap() with silly values for the "fd" and the "length" parameter: mmap(NULL, 18446744073709436928, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) I briefly looked at the source code and noticed that raw_mmap_pages() in include/buffer.h of seems to call mmap() with an unsigned int rather than with a size_t as the second (length) parameter. Since 18446744073709436928 = 0xfffffffffffe4000 this looks like an integer overflow. But maybe it is just uninitialized garbage. I've tried the v0.20.2 and the testing branch of the ceph git repo. Both versions of cosd show the same behaviour. Our ceph file system 5.5T large, we have 7 cosds, 3 cmons and 3 cmds, see the ceph.conf below for details. Any idea how to get back the data? If you need further debugging info, don't hesitate to ask. Thanks Andre --- [global] ; enable secure authentication auth supported = cephx osd journal size = 100 ; measured in MB ; You need at least one monitor. You need at least three if you want to ; tolerate any node failures. Always create an odd number. [mon] mon data = /var/ceph/mon$id ; some minimal logging (just message traffic) to aid debugging debug ms = 1 [mon0] host = node141 mon addr = 192.168.1.141:6789 [mon1] host = node145 mon addr = 192.168.1.145:6789 [mon2] host = node150 mon addr = 192.168.1.150:6789 ; You need at least one mds. Define two to get a standby. [mds] ; where the mds keeps it's secret encryption keys keyring = /var/ceph/keyring.$name [mds0] host = node141 [mds1] host = node145 [mds2] host = node150 ; osd ; You need at least one. Two if you want data to be replicated. ; Define as many as you like. [osd] ; This is where the btrfs volume will be mounted. osd data = /var/ceph/osd$id ; Ideally, make this a separate disk or partition. A few GB ; is usually enough; more if you have fast disks. You can use ; a file under the osd data dir if need be ; (e.g. /data/osd$id/journal), but it will be slower than a ; separate disk or partition. osd journal = /var/ceph/osd$id/journal [osd0] host = node141 [osd1] host = node145 [osd2] host = node150 [osd3] host = node146 [osd4] host = node147 [osd5] host = node149 [osd6] host = node142 -- The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc
Description: Digital signature