1title: Linux Network Virtualization
Network Namespace
In order to provide the isolation, Linux has 6 namespaces to split the different resources, shown as follows:
Namespace | Description | |
---|---|---|
Mount Namespace | File system mount point | CLONE_NEWNET |
UTS Namespace | Hostname | CLONE_NETUTS |
IPC Namespace | POSIX process messaging queue | CLONE_NEWIPC |
PID Namespace | Process PID number namespace | CLONE_NEWPID |
Network Namespace | IP address/Port/Router/IPtables | CLONE_NEWNS |
User Namespace | User isolation | CLONE_NEWUSER |
For the process, if they want to use the resources of the Namespace, they should enter the namespace first. And the resources don’t share between different namespaces.
Network Namespace Overview
You can use ip netns
to manage the Network Namespace, or you can write C code to operate the Network Namespace through the system call. We can use clone()
(a extend for fork()
) to create a usual Namespace, and specify the parameter CLONE_NEWNET
to create a Network Namespace.
1ip netns add netns1
After you created the namespace, you can use ip netns show
or ip netns list
to check the result. At the same time, a file named netns1
is created at /var/run/netns/netns1
, this is the Mount Point. On the one hand, this file is for managing the namespace, on the other hand, even if no process running in the namespace, the namespace still exists.
Once the Network Namespace is created, you can use ip netns exec <namespace> <command>
to enter it to do some configuration.
For example,
1# run bash in network namespace <netns1>
2ip netns exec netns1 bash
3
4# get the network interface of the network namespace <netns1>
5ip net exec netns1 ip link list
After you create a Network Namespace, it contains a Loopback Interface at least.
If you want to delete the Network Namespace, try
1ip netns delete netns1
Noted, actually, the above command does not totally delete the Network Namespace, this only removed the Mount Point. If there is any process still running, the Network Namespace will still exist.
Configuration
When we talk about the network communication of the processes in the Network Namespace, the Virtual/Actual Network Device is required.
The new Network Namespace only contains a Loopback Interface, and the status of this interface is DOWN
. After setting the lo
up, you can ping 127.0.0.1
and get the response.
But, if we want to contact the outside, we need to create a couple of virtual ether network cards, it’s the veth pair
. Veth pair always presents in couples. It was like a two-way pipe, the datagram went into one side, and came out from another side.
Let’s create a veth pair
, and put one side into the netns1
.
1# create a veth pair veth0-veth1
2ip link add veth0 type veth peer name veth1
3# mv veth1 to netns1 network namespace
4ip link set veth1 netns netns1
After that, we created 2 veth interface. But these are in DOWN
status. Next, we set them up and bind an IP to the interface.
1ip netns exec netns1 ifconfig veth1 10.1.1.1/24 up
2ifconfig veth0 10.1.1.2/up
Now, you can ping 10.1.1.1
on the host.
The route table and IPtables are different between different Network Namespace.
When we enter the netns1
Network Namespace, the route table and IPtables are empty. So, when you are in the netns1
you can’t connect to the Internet. There are several ways to solve that:
-
Create a Linux bridge on the host, and bind one side of the veth pair to the bridge
-
Add NAT rule on the host and enable Linux IP forward.
1# enable or disable IP forwarding status 2sysctl -w net.ipv4.ip_forward=0 3# OR 4sysctl -w net.ipv4.ip_forward=1 5 6# check the IP forwarding status 7cat /proc/sys/net/ipv4/ip_forward 80
Note
Users can put the physical/virtual network device to any network namespace, and one device only can be put into one Network Namespace.
The process can enter the Network Namespace through the Linux System Call clone()/unshare()/setns()
. No-root process in a specific Network Namespace only can assess and config the local Network Namespace.
The root process can create a network device in the Network Namespace. And, root process can put the local Network Namespace’s device in another namespace.
1# move interface form netns1 to PID=1
2ip net exec netns1 ip link set veth1 netns 1
The above command moves the veth1 interface from netns1 to host default Network namespace.
For the root user of the network namespace, they can move the network device to any Network Namespace, even the host network namespace. So, there will be a potential risk. If users want to avoid the risk, they need to combine the PID Namespace and Mount Namespace to make the Network Namespace totally isolated.
How do combine the PID Namespace and Mount Namespace to make the Network Namespace totally isolated?
Network Namespace API
The API is related with Linux System Call: clone() unshare() setns()
and file in /proc
. This chapter will introduce the usage of the network namespace API through several examples.
clone() unshare() setns()
use the const variable CLONE_NEW*
to represent different namespace:
Namespace | Description | |
---|---|---|
Mount Namespace | File system mount point | CLONE_NEWNS |
UTS Namespace | Hostname | CLONE_NEWUTS |
IPC Namespace | POSIX process messaging queue | CLONE_NEWIPC |
PID Namespace | Process PID number namespace | CLONE_NEWPID |
Network Namespace | IP address/Port/Router/IPtables | CLONE_NEWNET |
User Namespace | User isolation | CLONE_NEWUSER |
Create namespace through clone()
1int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);
clone()
is an extension of fork()
, we can control the function through the flags
parameter. clone()
has more than 20 CLONE_*
flag to control the processes’ actions.
If set a flag CLONE_NEW
, the system will create a corresponding type namespace and a new process and put the process into the new Namespace. Through |
we can specify multiple CLONE_NEW
flags.
The parameters’ meaning, from left to right
- Function point
child_func
, specify a new function for the new process. When the function returned, the child process ended. This function returns an integer value to represent the exit code. - Point
child_stack
is put into the child process’s stack, another word, put user-mode stack point to child process’s esp register. The process that callsclone()
needs to allocate a new stack to the child process. - Int
flags
represent theCLONE_*
flag, which can be multiply connected with|
. args
is the user-defined parameter.
Finally, you should pay attention to the authorization and safety, most of the Namespace creates need the system capability, which not should be total root permission, but need CAP_SYS_ADMIN
to execute the essential System Call.
Linux 的特权是将root的权限分为各个小部分,使得一个进程只需要被授予刚刚好的权限来执行特定的任务。如果这些特权足够小,且选择的恰到好处,那么即使一个进程受损(比如缓冲区溢出),它所造成的危害也会受限于它所拥有的的特权。 例如,CAP_KILL 允许向任意的进程发送信号, 而CAP_SYS_TIME允许进程设置系统的时钟。
Keep namespace existing
Every process has its own /proc/PID/ns
, and every file in this path represents a type of namespace.
Before Linux Kernel v3.8, the file in this path is a hard link, and only has IPC, nets, and uts. After v3.8, every file is a special symbolic link file and these files provide a way to operate the namespace related to the process.
1ls -l /proc/$$/ns # $$ is the PID of bash
One of the symbolic link file’s usage is to show if two processes use the same namespace. If two processes are in the same namespace, the inode
number on the symbolic link file should be the same. (You can through stat()
to get the inode number in st_ino
)
1#define _GNU_SOURCE
2#define NULL 0x0
3#include <sys/types.h>
4#include <sys/wait.h>
5#include <sys/mount.h>
6#include <stdio.h>
7#include <sched.h>
8#include <signal.h>
9#include <unistd.h>
10#include <stdlib.h>
11
12#define STACK_SIZE (1024 * 1024)
13
14// sync primitive
15int checkpoint[2];
16
17static char child_stack[STACK_SIZE];
18char *const child_args[] = {
19 "/bin/bash",
20 NULL};
21
22int child_main(void *arg)
23{
24 char c;
25
26 // init sync primitive
27 printf(">> wait for signal");
28 close(checkpoint[1]);
29 printf(">> receive signal");
30
31 // setup hostname
32 printf("- [%5d] World !\n", getpid());
33 sethostname("In Namespace", 12);
34
35 // remount "/proc" to get accurate "top" && "ps" output
36 mount("proc", "/proc", "proc", 0, NULL);
37
38 // wait for network setup in parent
39 read(checkpoint[0], &c, 1);
40
41 // setup network
42 system("ip link set lo up");
43 system("ip link set veth1 up");
44 system("ip addr add 192.168.0.2/24 dev veth1");
45 execv(child_args[0], child_args);
46 printf("Oops \n");
47 return 1;
48}
49
50int main()
51{
52 // init sync primitive
53 pipe(checkpoint);
54
55 printf("- [%5d] Hello ?\n", getpid());
56
57 int child_pid = clone(child_main, child_stack + STACK_SIZE,
58 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | SIGCHLD, NULL);
59
60 // further init; create a veth pair
61 char *cmd;
62 asprintf(&cmd, "ip link set veth1 netns %d", child_pid);
63 system("ip link add veth0 type veth peer name veth1");
64 system(cmd);
65 system("ip link set veth0 up");
66 system("ip addr add 192.168.0.1/24 dev veth0");
67 free(cmd);
68
69 // single "done"
70 close(checkpoint[1]);
71 printf(">>> signal done");
72
73 waitpid(child_pid, NULL, 0);
74 return 0;
75}
NC command
use for network test
C primitive
Veth Pair
The principle of Veth Pair is that put data into one side of the veth pair, and get out from another side.
The kernel code if veth pair
Relationship between container and veth pair
The typical model of the container network is veth pair + bridge. The interface in the container eth0
is a peer of the host veth interface. So, how to find which is the peer interface?
Method 1:
First, in the target container find that,
1cat /sys/class/net/eth0/iflink
And then check all the files on the host /sys/class/net/
,check the ifindex
value which is the same as the iflink
. The interface with the same value is another side of the veth pair.
Method 2
Linux Bridge
What is Linux Bridge
Create and Manage the Linux Bridge
1ip link add name br0 type bridge
2ip link set br0 up
Besides the ip
command, we can use the brctl
tool in the bridge-utils
package to create a bridge.
1brctl addr br0