On 17.03.2018 17:15, Sowmini Varadhan wrote: > > I spent a long time staring at both v1 and v2 of your patch. Thanks for your time! > I understand the overall goal, but I am afraid to say that these > patches are complete hacks. I'm not agree with you, see below the explanations. > I was trying to understand why patchv1 blows with a null rtn in > rds_tcp_init_net, but v2 does not, and the analysis is ugly. > > I'm going to put down the analysis here, and others can > decide if this sort of hack is a healthy solution for a scaling > issue (IMHO it is not- we should get the real fix for the > scaling instead of using duck-tape-and-chewing-gum solutions) > > What is happening in v1 is this: > > 1. Wnen I do "modprobe rds_tcp" in init_net, we end up doing the > following in rds_tcp_init > register_pernet_device(&rds_tcp_dev_ops); > register_pernet_device(&rds_tcp_net_ops); > Where rds_tcp_dev_ops has > .id = &rds_tcp_netid, > .size = sizeof(struct rds_tcp_net), > and rds_tcp_net_ops has 0 values for both of these. > > 2. So now pernet_list has &rds_tcp_net_ops as the first member of the > pernet_list. > > 3. Now I do "ip netns create blue". As part of setup_net(), we walk > the pernet_list and call the ->init of each member (through ops_init()). > So we'd hit rds_tcp_net_ops first. Since the id/size are 0, we'd > skip the struct rds_tcp_net allocation, so rds_tcp_init_net would > find a null return from net_generic() and bomb. > > The way I view it (and ymmv) the hack here is to call both > register_pernet_device and register_pernet_subsys: the kernel only > guarantees that calling *one* of register_pernet_* will ensure > that you can safely call net_generic() afterwards. > > The v2 patch "works around" this by reordering the registration. > So this time, init_net will set up the rds_tcp_net_ops as the second > member, and the first memeber will be the pernet_operations struct > that has non-zero id and size. > > But then the unregistration (necessarily) works in the opposite order > you have to unregister_pernet_device first (so that interfaces are > quiesced) and then unregister_pernet_subsys() so that sockets can > be safely quiesced. > > I dont think this type of hack makes the code cleaner, it just > make things much harder to understand, and completely brittle > for subsequent changes. It's not a hack, it's just a way to fix the problem, like other pernet_operations do. It's OK for pernet_operations to share net_generic() id. The only thing you need is to request the id in the pernet_operations, which go the first in pernet_list. There are a lot of examples in kernel: 1)sunrpc_net_ops rpcsec_gss_net_ops these pernet_operations share sunrpc_net_id. It's requested in sunrpc_net_ops: static struct pernet_operations sunrpc_net_ops = { .init = sunrpc_init_net, .exit = sunrpc_exit_net, .id = &sunrpc_net_id, .size = sizeof(struct sunrpc_net), and it's also used by rpcsec_gss_net_ops: static struct pernet_operations rpcsec_gss_net_ops = { .init = rpcsec_gss_init_net, .exit = rpcsec_gss_exit_net, }; rpcsec_gss_init_net()->gss_svc_init_net()->rsc_cache_create_net(), where: static int rsc_cache_create_net(struct net *net) { struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); ... ^^^here^^^ The only thing is sunrpc_net_ops must be registered before rpcsec_gss_net_ops, and rpc code guarantees that. 2)ipvs_core_ops ipvs_core_dev_ops ip_vs_ftp_ops ip_vs_lblc_ops ip_vs_lblcr_ops these pernet_operations (5!) share ip_vs_net_id, which is requested in ipvs_core_ops: static struct pernet_operations ipvs_core_ops = { .init = __ip_vs_init, .exit = __ip_vs_cleanup, .id = &ip_vs_net_id, .size = sizeof(struct netns_ipvs), }; static int __net_init __ip_vs_init(struct net *net) { struct netns_ipvs *ipvs; ... ipvs = net_generic(net, ip_vs_net_id); net->ipvs = ipvs; ... } static struct pernet_operations ipvs_core_dev_ops = { .exit = __ip_vs_dev_cleanup, }; static void __net_exit __ip_vs_dev_cleanup(struct net *net) { struct netns_ipvs *ipvs = net_ipvs(net); ^^^requested in ipvs_core_ops ... } Look at the above example. They solve the same problem, rds has. They need to do some actions at pernet_device exit time. And there is ipvs_core_dev_ops added for this, since ipvs_core_ops are called in pernet_subsys time. See ip_vs_init(): static int __init ip_vs_init(void) { ... ret = register_pernet_subsys(&ipvs_core_ops); /* Alloc ip_vs struct */ if (ret < 0) goto cleanup_conn; ... ret = register_pernet_device(&ipvs_core_dev_ops); That's all. They use pernet_device exit, which may be called in parallel with anything, and which doesn't use rtnl_lock(). There is no reasons, rds_tcp_net_ops uses its own way different to all other pernet_operations, and uses exclusive rtnl_lock(). > To solve the scaling problem why not just have a well-defined > callback to modules when devices are quiesced, instead of > overloading the pernet_device registration in this obscure way? We already have them. These callbacks are called pernet_operations exit methods. These methods can execute in parallel and scale nice. Kirill -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html