OL> Value of some fields below needs to be verified/checked: Yep, okay, I think I've got some sane sanity checks in place now. >> + CKPT_COPY(op, h->sock.err, sock->sk_err); >> + CKPT_COPY(op, h->sock.err_soft, sock->sk_err_soft); OL> I'm unsure if we need ensure that an error value is in range... OL> can it do harm to pass an out-of-range value back to the caller ? I don't think any of the kernel code would care, but it's probably best to make sure that it's in a reasonable range. OL> This is 'int' in struct sock, but assignment may make it negative OL> ? Fixed this and others. >> + CKPT_COPY(op, h->sock.rcvtimeo, sock->sk_rcvtimeo); >> + CKPT_COPY(op, h->sock.sndtimeo, sock->sk_sndtimeo); OL> These two too. Perhaps refactor and use sock_set_timeout() ? >From the look of sock_set_timeout() I'm not sure it's expected to be called from kernel space. Regardless, I think just setting it (and not corrupting it) should be fine. OL> These should be checked for sign-ness, limited in magnitude (e.g. OL> <= sysctl_wmem_max), checked against a minimum (see setting it in OL> net/core/sock.c), and finally perhaps correlated with a flag OL> against sk_userlocks ? Yep, fixed. >> + if ((sock->sk_state != TCP_LISTEN) && >> + (socket->ops->getname(socket, &h->raddr, &h->raddr_len, 1))) { >> + if (sock->sk_type != SOCK_DGRAM) { >> + ret = -EINVAL; >> + goto out; >> + } OL> I suspect this fails for !SOCK_DGRAM that isn't connected, OL> e.g. newly created. Okay, yeah. I changed the logic to only bail if it's a DGRAM socket and the state is TCP_ESTABLISHED, which I think is probably what we want. Since we bail on the transitional TCP states, I think this should be okay. >> + h->raddr_len = 0; OL> Hmmm.. do you need to ensure that this value is consistent with OL> the value of ->peer ? (at restart) I dunno, why would I? If I hook up the two peers to the same address structure on restart everything should be correct, no? As somewhat of a side question, is there a rule that says that getname() must return the exact same thing each time it is called? OL> I'd suggest a ckpt_write_err() here (and in other key places) to OL> explain to the user what caused the checkpoint to fail. Yep, I hadn't followed the introduction of that, but it's a very welcome addition. I've added calls in the appropriate places. OL> Any reason to write the buffers of a listening socket ? they only OL> contain not-yet-accepted sockest, which aren't supported anyway. Nope, fixed. OL> And while there is not support for not-yet-accepted sockets, maybe OL> add an explicit test (if LISTEN and has skb's) and fail with an OL> error, as well as call ckpt_write_err(). Isn't that just this (from the patch): + if ((sock->sk_state == TCP_LISTEN) && + !skb_queue_empty(&sock->sk_receive_queue)) { + ckpt_debug("listening socket has unaccepted peers"); + return -EBUSY; + } ? I've put a ckpt_write_err() there now... >> + /* FIXME: write out-of-order queue for TCP */ OL> I'd prefer failing if such data exists... Yep, indeed. I've removed the comment here and will augment the INET patch to check for that and fail appropriately. >> + memcpy(skb_put(*skb, len), (char *)(h + 1), len); OL> Can we avoid this memory copy ? Perhaps add _ckpt_read_hdr() and OL> _ckpt_read_payload() helpers ? Okay, yep. >> +static int sock_read_buffers(struct ckpt_ctx *ctx, >> + struct sock *sock, >> + struct sk_buff_head *queue) >> +{ >> + struct ckpt_hdr_socket_buffer *h; >> + int ret = 0; >> + int i; >> + >> + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFERS); >> + if (IS_ERR(h)) { >> + ret = PTR_ERR(h); >> + goto out; >> + } >> + >> + for (i = 0; i < h->skb_count; i++) { >> + struct sk_buff *skb = NULL; >> + >> + ret = sock_read_buffer(ctx, sock, &skb); >> + if (ret) >> + break; >> + >> + skb_queue_tail(queue, skb); >> + } OL> Make sure that total length is within limits, or this is an OL> opening for DoS. You mean the number of buffers? The sock_read_buffer() function will limit each skb to SKB_MAX_ALLOC. What is a sane maximum number of buffers? Maybe we need a sysctl for that? OL> I checked below, but you don't examine @len before calling this OL> function. So a DoS is possible by exhausting kernel memory with OL> these kmalloc()s. Yep, fixed. OL> Hmmm... a socket that is connected may also be bind(), so I'm not OL> sure the test below is always correct ? You mean because it could have been bind()'ed to a different address than it's peer before it called connect()? Wouldn't the connect() have changed the local address? OL> A thought: do we want to restore sharing of addresses at restart ? OL> For instance, all sockets accepted from a single listening socket OL> share their local addresses. For INET it seems important, but I don't think I've seen anything in the UNIX code that cares. However, I can move some of the parent-tracking stuff to this patch in order to hook up the shared addresses if you think it's important. OL> 1) Non-listening sockets may also need bind(). Especially OL> SOCK_DGRAM. Perhaps if the socket is not established and had a local address, we call bind? OL> 2) What happens if s1 is checkpointed before s2 in this scenario: OL> s1 = socket(); bind(s1, addr); unlink(addr); OL> s2 = socket(); bind(s1, addr); OL> I'd suggest that a socket that is did bind() and that OL> corresponding inode was later unlink()ed, will not be re-bind at OL> all. After all, it is unreachable in terms of connect() syscall. I'm not sure what you're saying here... OL> 3) If the sockets path is relative, you can't blindly rely on the OL> socket address reported by ->getname(), because the original OL> bind() occured _relative_ to the task's current-working-directory OL> at the time of the socket() syscall. OL> The cwd of the restarting task may differ. Moreover, the resulting OL> path: cwd + sockaddr may exceed 108 bytes. Lastly, getsockname() OL> reports the original, relative pathname anyways. OL> So if the address doesn't begin with a '/', you need to also save OL> the cwd and on restart, go there before caling bind(). But the app could have changed its cwd after binding the socket so the cwd of the checkpointed task isn't what we want, right? Maybe we could do a dentry lookup to find the actual path it's bound to and store that separately from the local and remote addresses? -- Dan Smith IBM Linux Technology Center email: danms@xxxxxxxxxx _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers