Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> · Sun, 18 Mar 2018 00:13:00 +0300

On 17.03.2018 17:15, Sowmini Varadhan wrote:
> 
> I spent a long time staring at both v1 and v2 of your patch.

Thanks for your time!

> I understand the overall goal, but I am afraid to say that these
> patches are complete hacks.

I'm not agree with you, see below the explanations.

> I was trying to understand why patchv1 blows with a null rtn in 
> rds_tcp_init_net, but v2 does not, and the analysis is ugly.  
> 
> I'm going to put down the analysis here, and others can
> decide if this sort of hack is a healthy solution for a scaling
> issue (IMHO  it is not- we should get the real fix for the
> scaling instead of using duck-tape-and-chewing-gum solutions)
> 
> What is happening in v1 is this:
> 
> 1. Wnen I do "modprobe rds_tcp" in init_net, we end up doing the
>    following in rds_tcp_init
>        register_pernet_device(&rds_tcp_dev_ops);
>        register_pernet_device(&rds_tcp_net_ops);
>    Where rds_tcp_dev_ops has 
>         .id = &rds_tcp_netid,
>         .size = sizeof(struct rds_tcp_net),
>    and rds_tcp_net_ops has 0 values for both of these.
> 
> 2. So now pernet_list has &rds_tcp_net_ops as the first member of the
>    pernet_list.
> 
> 3. Now I do "ip netns create blue". As part of setup_net(), we walk
>    the pernet_list and call the ->init of each member (through ops_init()). 
>    So we'd hit rds_tcp_net_ops first. Since the id/size are 0, we'd
>    skip the struct rds_tcp_net allocation, so rds_tcp_init_net would 
>    find a null return from net_generic() and bomb.
> 
> The way I view it (and ymmv) the hack here is to call both
> register_pernet_device and register_pernet_subsys: the kernel only
> guarantees that calling *one* of register_pernet_* will ensure
> that you can safely call net_generic() afterwards.
> 
> The v2 patch "works around" this by reordering the registration. 
> So this time, init_net will set up the rds_tcp_net_ops as the second 
> member, and the first memeber will be the pernet_operations struct 
> that has non-zero id and size.
> 
> But then the unregistration (necessarily) works in the opposite order
> you have to unregister_pernet_device first (so that interfaces are
> quiesced) and then unregister_pernet_subsys() so that sockets can
> be safely quiesced.
> 
> I dont think this type of hack makes the code cleaner, it just
> make things much harder to understand, and completely brittle
> for subsequent changes.

It's not a hack, it's just a way to fix the problem, like other
pernet_operations do. It's OK for pernet_operations to share
net_generic() id. The only thing you need is to request the id in
the pernet_operations, which go the first in pernet_list. There are
a lot of examples in kernel:

1)sunrpc_net_ops
  rpcsec_gss_net_ops

  these pernet_operations share sunrpc_net_id. It's requested in sunrpc_net_ops:

  static struct pernet_operations sunrpc_net_ops = {
        .init = sunrpc_init_net,
        .exit = sunrpc_exit_net,
        .id = &sunrpc_net_id,
        .size = sizeof(struct sunrpc_net),

  and it's also used by rpcsec_gss_net_ops:

  static struct pernet_operations rpcsec_gss_net_ops = {
        .init = rpcsec_gss_init_net,
        .exit = rpcsec_gss_exit_net,
  };

  rpcsec_gss_init_net()->gss_svc_init_net()->rsc_cache_create_net(), where:

  static int rsc_cache_create_net(struct net *net)
  {
        struct sunrpc_net *sn = net_generic(net, sunrpc_net_id);
        ...                                      ^^^here^^^

  The only thing is sunrpc_net_ops must be registered before rpcsec_gss_net_ops,
  and rpc code guarantees that.

2)ipvs_core_ops
  ipvs_core_dev_ops
  ip_vs_ftp_ops
  ip_vs_lblc_ops
  ip_vs_lblcr_ops

  these pernet_operations (5!) share ip_vs_net_id, which is requested in ipvs_core_ops:

  static struct pernet_operations ipvs_core_ops = {
        .init = __ip_vs_init,
        .exit = __ip_vs_cleanup,
        .id   = &ip_vs_net_id,
        .size = sizeof(struct netns_ipvs),
  };

  static int __net_init __ip_vs_init(struct net *net)
  {
        struct netns_ipvs *ipvs;
        ...
        ipvs = net_generic(net, ip_vs_net_id);
        net->ipvs = ipvs;
        ...
  }

  static struct pernet_operations ipvs_core_dev_ops = {
        .exit = __ip_vs_dev_cleanup,
  };

  static void __net_exit __ip_vs_dev_cleanup(struct net *net)
  {
        struct netns_ipvs *ipvs = net_ipvs(net);
                                  ^^^requested in ipvs_core_ops
        ...
  }

Look at the above example. They solve the same problem, rds has.
They need to do some actions at pernet_device exit time. And there
is ipvs_core_dev_ops added for this, since ipvs_core_ops are called
in pernet_subsys time. See ip_vs_init():

static int __init ip_vs_init(void)
{
        ...
        ret = register_pernet_subsys(&ipvs_core_ops);   /* Alloc ip_vs struct */
        if (ret < 0)
                goto cleanup_conn;
        ...
        ret = register_pernet_device(&ipvs_core_dev_ops);

That's all. They use pernet_device exit, which may be called in parallel
with anything, and which doesn't use rtnl_lock().

There is no reasons, rds_tcp_net_ops uses its own way different to all other
pernet_operations, and uses exclusive rtnl_lock().

> To solve the scaling problem why not just have a well-defined 
> callback to modules when devices are quiesced, instead of 
> overloading the pernet_device registration in this obscure way?

We already have them. These callbacks are called pernet_operations exit methods.
These methods can execute in parallel and scale nice.

Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html