Fwd: OSD fail on client writes

Jeffrey McDonald <jmcdonal@xxxxxxx> · Sat, 21 Feb 2015 12:12:16 -0600

Hi, 
We have a ceph Giant installation with a radosgw interface.   There are 198 OSDs on seven OSD servers and we're seeing OSD failures on the system when users try to write files via the s3 interface.    We're more likely to see the failures if the files are larger than 1 GB and if the files go to a newly created bucket.   We have seen failures for older buckets but that seem to happen less frequently.   I can regularly crash the OSD with a 3.6 GB file writing to a newly created bucket.  

Three weeks ago, we upgraded to Giant from firefly to achieve better performance.   Under firefly it was impossible to break the system.    We have had these issues since we've moved to giant.   We've gone  through tests with iptables, sysctl parameters and testing different versions of s3cmd (along with different python versions), there is no indication that any of these matter for the failures.   

Here is the client interaction: 

 $ ls -lh 420N.bam 
-rw-------. 1 jmcdonal tech 3.6G Feb 19 07:52 420N.bam

$ s3cmd put 420N.bam s3://jmtestbigfiles2/ 
420N.bam -> s3://jmtestbigfiles2/420N.bam  [part 1 of 4, 1024MB]
 1073741824 of 1073741824   100% in   22s    45.95 MB/s  done
420N.bam -> s3://jmtestbigfiles2/420N.bam  [part 2 of 4, 1024MB]
 1073741824 of 1073741824   100% in   23s    44.35 MB/s  done
420N.bam -> s3://jmtestbigfiles2/420N.bam  [part 3 of 4, 1024MB]
 1073741824 of 1073741824   100% in   21s    48.33 MB/s  done
420N.bam -> s3://jmtestbigfiles2/420N.bam  [part 4 of 4, 562MB]
 589993365 of 589993365   100% in   42s    13.28 MB/s  done
ERROR: syntax error: line 1, column 49
ERROR: 
Upload of '420N.bam' part 4 failed. Use
  /usr/bin/s3cmd abortmp s3://jmtestbigfiles2/420N.bam 2/A5m20_uvjRllfTNB4wplXZH0eYDjyen
to abort the upload, or
  /usr/bin/s3cmd --upload-id 2/A5m20_uvjRllfTNB4wplXZH0eYDjyen put ...
to continue the upload.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
  Please try reproducing the error using
  the latest s3cmd code from the git master
  branch found at:
    https://github.com/s3tools/s3cmd
  If the error persists, please report the
  following lines (removing any private
  info as necessary) to:
   s3tools-bugs@xxxxxxxxxxxxxxxxxxxxx

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Invoked as: /usr/bin/s3cmd put 420N.bam s3://jmtestbigfiles2/
Problem: AttributeError: 'module' object has no attribute 'ParseError'
S3cmd:   1.5.0-rc1
python:   2.6.6 (r266:84292, Jan 22 2014, 09:42:36) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)]
environment LANG=en_US.UTF-8

Traceback (most recent call last):
  File "/usr/bin/s3cmd", line 2523, in <module>
    rc = main()
  File "/usr/bin/s3cmd", line 2441, in main
    rc = cmd_func(args)
  File "/usr/bin/s3cmd", line 380, in cmd_object_put
    response = s3.object_put(full_name, uri_final, extra_headers, extra_label = seq_label)
  File "/usr/lib/python2.6/site-packages/S3/S3.py", line 516, in object_put
    return self.send_file_multipart(file, headers, uri, size)
  File "/usr/lib/python2.6/site-packages/S3/S3.py", line 1037, in send_file_multipart
    upload.upload_all_parts()
  File "/usr/lib/python2.6/site-packages/S3/MultiPart.py", line 111, in upload_all_parts
    self.upload_part(seq, offset, current_chunk_size, labels, remote_status = remote_statuses.get(seq))
  File "/usr/lib/python2.6/site-packages/S3/MultiPart.py", line 165, in upload_part
    response = self.s3.send_file(request, self.file, labels, buffer, offset = offset, chunk_size = chunk_size)
  File "/usr/lib/python2.6/site-packages/S3/S3.py", line 1010, in send_file
    warning("Upload failed: %s (%s)" % (resource['uri'], S3Error(response)))
  File "/usr/lib/python2.6/site-packages/S3/Exceptions.py", line 51, in __init__
    except ET.ParseError:
AttributeError: 'module' object has no attribute 'ParseError'

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
  Please try reproducing the error using
  the latest s3cmd code from the git master
  branch found at:
    https://github.com/s3tools/s3cmd
  If the error persists, please report the
  above lines (removing any private
  info as necessary) to:
   s3tools-bugs@xxxxxxxxxxxxxxxxxxxxx
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-----------------------------------------------------------------------------------------
on the gateway, I see this object request: 

"ops": [
        { "tid": 14235,
          "pg": "70.59a712df",
          "osd": 21,
          "object_id": "default.315696.1__shadow_420N.bam.2\/A5m20_uvjRllfTNB4wplXZH0eYDjyen.4_140",
          "object_locator": "@70",
          "target_object_id": "default.315696.1__shadow_420N.bam.2\/A5m20_uvjRllfTNB4wplXZH0eYDjyen.4_140",
          "target_object_locator": "@70",
          "paused": 0,
          "used_replica": 0,
          "precalc_pgid": 0,
          "last_sent": "2015-02-21 11:20:11.317593",
          "attempts": 7,
          "snapid": "head",
          "snap_context": "0=[]",
          "mtime": "2015-02-21 11:18:58.114452",
          "osd_ops": [
                "write 2621440~169365"]}],

-----------------------------------------------------------------------------------------

The transfer crashes three OSD: 

# ceph osd tree | grep down 
# id	weight	type name	up/down	reweight
42	0.55				osd.42	down	1	
110	0.55				osd.110	down	1	
191	0.55				osd.191	down	1	

----------------------------------------------------------------------------------

These OSD will fail again on the restart and I've turned on full debugging (20 for everything that can be debugged) for each of these OSDs.    I've added the log files for the three failed OSDs to my google drive.   There are three bzipped file there, one for each OSD.     

https://drive.google.com/folderview?id=0Bzz8TrxFvfemfm9qSVktUWZzdDJqZm9qN0dpODdWVUt4ZFRpVUNzUzgyNWFwbHZDVXBseWM&usp=drive_web

ceph-osd.42.log.bz2
ceph-osd.110.log.bz2
ceph-osd.191.log.bz2

These are restart file attempts, I wasn't able to capture the original failure as I've no predictability on which ones will fail and the logs are so verbose they quickly fill my log partition.   

--------------------------------------------------------------------------------------

The logs are rather large (~80 MB) and I'm not sure what I should be looking for to remedy the situation.   Any assistance would be greatly appreciated......we really need to solve this issue....

I can collect additional debug logs if that would assist in the diagnosis.  

Thanks in advance, 
Jeff

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com