BIO和NIO区别以及原理

  之前在学习NIO的时候只是简单的学习了其使用,对齐组件Selector、Channel、Buffer 也是只是有三个重要的类,至于为什么叫NIO以及NIO的优点没有了解,这里详细记录下。

1 . 简单组成

  内核模式:跑内核程序。在内核模式下,代码具有对硬件的所有控制权限。可以执行所有CPU指令,可以访问任意地址的内存。内核模式是为操作系统最底层,最可信的函数服务的。在内核模式下的任何异常都是灾难性的,将会导致整台机器停机。

  用户模式:跑用户程序。在用户模式下,代码没有对硬件的直接控制权限,只能访问自己的用户空间地址。程序是通过调用系统接口(System APIs)来达到访问硬件和内存。在这种保护模式下,即时程序发生崩溃也是可以恢复的。在你的电脑上大部分程序都是在用户模式下运行的。

  当程序涉及到调用内核程序,会涉及到模式的状态,CPU会先保存用户线程的上下文,然后切换到内核模式去执行内核程序,最后在根据上下文切户到应用程序。

文件描述符fd:文件描述符(File descriptor)是计算机科学中的一个术语,是一个用于表述指向文件的引用的抽象化概念。文件描述符在形式上是一个非负整数。实际上,它是一个索引值,指向内核为每一个进程所维护的该进程打开文件的记录表。当程序打开一个现有文件或者创建一个新文件时,内核向进程返回一个文件描述符。在程序设计中,一些涉及底层的程序编写往往会围绕着文件描述符展开。但是文件描述符这一概念往往只适用于UNIX、Linux这样的操作系统。

2. BIO测试

BIO测试:

import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class SocketServer {

    private static final ExecutorService executorService = Executors.newFixedThreadPool(5);

    public static void main(String[] args) throws Exception {
        ServerSocket serverSocket = new ServerSocket(8088);
        System.out.println("serverSocket 8088 start");
        while (true) {
            Socket socket = serverSocket.accept();
            System.out.println("socket.getInetAddress(): " + socket.getInetAddress());
            executorService.execute(new MyThread(socket));
        }
    }

    static class MyThread extends Thread {

        private Socket socket;

        public MyThread(Socket socket) {
            this.socket = socket;
        }

        @Override
        public void run() {
            try {
                InputStream inputStream = socket.getInputStream();
                BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
                while (true) {
                    String s = bufferedReader.readLine();
                    System.out.println(Thread.currentThread().getId() + " 收到的消息	" + s);
                }
            } catch (Exception exception) {
                // ignore
            } finally {
            }
        }
    }
}

使用JDK6 进行编译然后监测方法执行

[root@localhost jdk6]# ./jdk1.6.0_06/bin/javac SocketServer.java 
[root@localhost jdk6]# strace -ff -o out ./jdk1.6.0_06/bin/java SocketServer
serverSocket 8088 start

strace是一个可用于诊断、调试和教学的Linux用户空间跟踪器。我们用它来监控用户空间进程和内核的交互,比如系统调用、信号传递、进程状态变更等。

会生成几个out文件,如下:(每个线程对应一个文件,JVM启动默认会创建一些守护线程,用于GC或者接收jmap 等命令的线程)

[root@localhost jdk6]# ll
total 64092
drwxr-xr-x. 9   10  143      204 Jul 23  2008 jdk1.6.0_06
-rw-r--r--. 1 root root 64885867 Jul 20 03:33 jdk-6u6-p-linux-x64.tar.gz
-rw-r--r--. 1 root root    21049 Jul 20 07:04 out.10685
-rw-r--r--. 1 root root   139145 Jul 20 07:04 out.10686
-rw-r--r--. 1 root root    21470 Jul 20 07:06 out.10687
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10688
-rw-r--r--. 1 root root      906 Jul 20 07:04 out.10689
-rw-r--r--. 1 root root      985 Jul 20 07:04 out.10690
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10691
-rw-r--r--. 1 root root      906 Jul 20 07:04 out.10692
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10693
-rw-r--r--. 1 root root   388112 Jul 20 07:06 out.10694
-rw-r--r--. 1 root root     1433 Jul 20 07:04 SocketServer.class
-rw-r--r--. 1 root root     1626 Jul 20 07:03 SocketServer.java
-rw-r--r--. 1 root root     1297 Jul 20 07:04 SocketServer$MyThread.class
[root@localhost jdk6]# ll | grep out | wc -l
10

1. 查找socket 关键字所在的文件

[root@localhost jdk6]# grep socket out.*
out.10686:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
out.10686:connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
out.10686:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
out.10686:connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
out.10686:socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4
out.10686:getsockname(0, 0x7f64c9083350, [28])    = -1 ENOTSOCK (Socket operation on non-socket)
out.10686:socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4
out.10686:socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 5
out.10686:socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4

可以看到10686 文件有建立socket 的操作,然后分析10686文件。

(1) 定位到文件末尾:

 可以看到是accept命令阻塞,并且无返回结果

(2) 继续追踪,查看socket 启动以及bind、listen 的过程

 可以看到主要过程是:一个SocketServer 启动的过程如下

socket => 4 (文件描述符)
bind(4, 8088)
listen(4)

accept(4, 阻塞 

查看accept 命令的格式如下: (可以看到是接受一个socket 连接,如果有连接会返回一个正整数)

[root@localhost jdk6]# man 2 accept
ACCEPT(2)                                                    Linux Programmer's Manual                                                    ACCEPT(2)

NAME
       accept, accept4 - accept a connection on a socket

SYNOPSIS
       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <sys/socket.h>

       int accept4(int sockfd, struct sockaddr *addr,
                   socklen_t *addrlen, int flags);

DESCRIPTION
       The accept() system call is used with connection-based socket types (SOCK_STREAM, SOCK_SEQPACKET).  It extracts the first connection request
       on the queue of pending connections for the listening socket, sockfd, creates a new connected socket, and  returns  a  new  file  descriptor
       referring to that socket.  The newly created socket is not in the listening state.  The original socket sockfd is unaffected by this call.

       The  argument  sockfd  is a socket that has been created with socket(2), bound to a local address with bind(2), and is listening for connec‐
       tions after a listen(2)
RETURN VALUE
       On success, these system calls return a nonnegative integer that is a descriptor for the accepted socket.  On error,  -1  is  returned,  and
       errno is set appropriately.

2. nc 命令模拟建立一个客户端连接

[root@localhost jdk6]# nc localhost 8088

(1) 查看服务器端会多生成一个out文件

[root@localhost jdk6]# ll
total 72712
drwxr-xr-x. 9   10  143      204 Jul 23  2008 jdk1.6.0_06
-rw-r--r--. 1 root root 64885867 Jul 20 03:33 jdk-6u6-p-linux-x64.tar.gz
-rw-r--r--. 1 root root    21049 Jul 20 07:04 out.10685
-rw-r--r--. 1 root root   141155 Jul 20 07:32 out.10686
-rw-r--r--. 1 root root   369445 Jul 20 07:33 out.10687
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10688
-rw-r--r--. 1 root root      906 Jul 20 07:04 out.10689
-rw-r--r--. 1 root root      985 Jul 20 07:04 out.10690
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10691
-rw-r--r--. 1 root root      906 Jul 20 07:04 out.10692
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10693
-rw-r--r--. 1 root root  7103157 Jul 20 07:33 out.10694
-rw-r--r--. 1 root root     1266 Jul 20 07:32 out.10866
-rw-r--r--. 1 root root     1433 Jul 20 07:04 SocketServer.class
-rw-r--r--. 1 root root     1626 Jul 20 07:03 SocketServer.java
-rw-r--r--. 1 root root     1297 Jul 20 07:04 SocketServer$MyThread.class
[root@localhost jdk6]# ll | grep out | wc -l
11

(2)查看10686 文件accept 返回的命令 (可以看到接收成功之后返回一个文件描述符是6, 接下来该 fd 会用于recvfrom 读取数据)

 可以看到clone 是创建的处理任务的线程, 也就是对应内核是clone 命令进行创建线程。linux下没有真正意义的线程,因为linux下没有给线程设计专有的结构体,它的线程是用进程模拟的,而它是由多个进程共一块地址空间而模拟得到的。

查看clone 函数如下: (类似于fork 函数创建子进程)

man 2 clone

CLONE(2)                                                     Linux Programmer's Manual                                                     CLONE(2)

NAME
       clone, __clone2 - create a child process

SYNOPSIS
       /* Prototype for the glibc wrapper function */

       #include <sched.h>

       int clone(int (*fn)(void *), void *child_stack,
                 int flags, void *arg, ...
                 /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

       /* Prototype for the raw system call */

       long clone(unsigned long flags, void *child_stack,
                 void *ptid, void *ctid,
                 struct pt_regs *regs);

   Feature Test Macro Requirements for glibc wrapper function (see feature_test_macros(7)):

       clone():
           Since glibc 2.14:
               _GNU_SOURCE
           Before glibc 2.14:
               _BSD_SOURCE || _SVID_SOURCE
                   /* _GNU_SOURCE also suffices */

DESCRIPTION
       clone() creates a new process, in a manner similar to fork(2).

(3) 查看10866 文件: 

 可以看到当前线程阻塞在recvfrom命令, 查看recfrom 命令如下:

man 2 recvfrom

RECV(2)                                                      Linux Programmer's Manual                                                      RECV(2)

NAME
       recv, recvfrom, recvmsg - receive a message from a socket

SYNOPSIS
       #include <sys/types.h>
       #include <sys/socket.h>

       ssize_t recv(int sockfd, void *buf, size_t len, int flags);

       ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags,
                        struct sockaddr *src_addr, socklen_t *addrlen);

       ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags);

DESCRIPTION
       The recvfrom() and recvmsg() calls are used to receive messages from a socket, and may be used to receive data on a socket whether or not it
       is connection-oriented.

       If src_addr is not NULL, and the underlying protocol provides the source address, this source address is filled in.  When src_addr is  NULL,
       nothing  is  filled  in; in this case, addrlen is not used, and should also be NULL.  The argument addrlen is a value-result argument, which
       the caller should initialize before the call to the size of the buffer associated with src_addr, and modified  on  return  to  indicate  the
       actual size of the source address.  The returned address is truncated if the buffer provided is too small; in this case, addrlen will return
       a value greater than was supplied to the call.

       The recv() call is normally used only on a connected socket (see connect(2)) and is identical to recvfrom() with a NULL src_addr argument.
...

RETURN VALUE
       These  calls return the number of bytes received, or -1 if an error occurred.  In the event of an error, errno is set to indicate the error.
       The return value will be 0 when the peer has performed an orderly shutdown.

  可以看到是从socket 连接中读取数据,会导致阻塞。

(4) nc 建立连接的客户端发送消息

[root@localhost jdk6]# nc localhost 8088
test

1》主窗口打印的信息如下:

[root@localhost jdk6]# strace -ff -o out ./jdk1.6.0_06/bin/java SocketServer
serverSocket 8088 start
socket.getInetAddress(): /0:0:0:0:0:0:0:1
9 收到的消息    test

2》查看out.10866 打出的命令如下

   可以看到接收完消息之后然后再次进入recvfrom 命令。

总结: BIO问题总结

1. 每连接每线程,造成的问题就是线程内存消费、cpu 调度消耗

2. 根源是blocking阻塞:accept和recvfrom 内核操作。 解决方案就是内核提供NONBLOCKING非阻塞方案。

3. 查看socket 方法提供了一个参数SOCK_NONBLOCK 用于设置非阻塞:  (获取不到返回-1, 到java 里面就是null)

SOCK_NONBLOCK   Set the O_NONBLOCK file status flag on the new open file description.  Using this flag saves  extra  calls  to  fcntl(2)  to
                       achieve the same result.

3. NIO测试

NIO在Java 中被叫做new io, 在操作系统层面被称为nonblocking io。 下面的测试基于JDK8。

代码如下:

import java.net.InetSocketAddress;
import java.nio.ByteBuffer;
import java.nio.channels.ServerSocketChannel;
import java.nio.channels.SocketChannel;
import java.util.LinkedList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class NIOSocket {


    private static final ExecutorService executorService = Executors.newFixedThreadPool(5);

    public static void main(String[] args) throws Exception {
        LinkedList<SocketChannel> clients = new LinkedList<>();
        ServerSocketChannel serverSocketChannel = ServerSocketChannel.open();
        serverSocketChannel.bind(new InetSocketAddress(8088));
        serverSocketChannel.configureBlocking(false); // 对应于操作系统的NONBLOCKING

        while (true) {
            Thread.sleep(500);
            /**
             *  accept 调用内核的命令
             *  BIO的时候一直阻塞,有客户端链接的时候返回这个客户端的fd 文件描述符
             *  NONBLOCKING 会有返回值,只是返回值是-1
             **/
            SocketChannel client = serverSocketChannel.accept(); // 不会阻塞 -1 null (OS层面返回-1, java 层面返回null)
            if (client != null) {
                client.configureBlocking(false);
                int port = client.socket().getPort();
                System.out.println("client.socket().getPort(): " + port);
                clients.add(client);
            }

            ByteBuffer buffer = ByteBuffer.allocate(4096); // 申请内存,可以在堆内,也可以在堆外DM
            // 遍历client读取消息
            for (SocketChannel c : clients) {
                int read = c.read(buffer); // >0 -1 不会阻塞
                if (read > 0) {
                    buffer.flip();
                    byte[] bytes = new byte[buffer.limit()];
                    buffer.get(bytes);
                    String string = new String(bytes);
                    System.out.println("client.socket().getPort(): " + c.socket().getPort() + " 收到的消息: " + string);
                    buffer.clear();
                }
            }
        }
    }
}

0. NIO启动的过程如下:

socket => 4 (文件描述符)
bind(4, 8088)
listen(4)
4.nonblocking (设置内核系统调用参数为非阻塞,4是文件描述符)

accept(4, xxx) => -1 6

  可以看到是设置非阻塞,调用内核方法的时候不会阻塞,有则返回文件描述符,没有则返回-1,到应用程序内部对应null 或者 -1.

1. 用JDK8编译

2.  用strace 追踪

[root@localhost jdk8]# strace -ff -o out ./jdk1.8.0_291/bin/java NIOSocket

3. 查看生成的out 文件

[root@localhost jdk8]# ll 
total 143724
drwxr-xr-x. 8 10143 10143       273 Apr  7 15:14 jdk1.8.0_291
-rw-r--r--. 1 root  root  144616467 Jul 20 03:42 jdk-8u291-linux-i586.tar.gz
-rw-r--r--. 1 root  root       2358 Jul 20 08:18 NIOSocket.class
-rw-r--r--. 1 root  root       2286 Jul 20 08:18 NIOSocket.java
-rw-r--r--. 1 root  root      12822 Jul 20 08:20 out.11117
-rw-r--r--. 1 root  root    1489453 Jul 20 08:20 out.11118
-rw-r--r--. 1 root  root      10315 Jul 20 08:20 out.11119
-rw-r--r--. 1 root  root       1445 Jul 20 08:20 out.11120
-rw-r--r--. 1 root  root       1424 Jul 20 08:20 out.11121
-rw-r--r--. 1 root  root        884 Jul 20 08:20 out.11122
-rw-r--r--. 1 root  root      11113 Jul 20 08:20 out.11123
-rw-r--r--. 1 root  root        884 Jul 20 08:20 out.11124
-rw-r--r--. 1 root  root     269113 Jul 20 08:20 out.11125
[root@localhost jdk8]# ll | grep out | wc -l
9

一个socket 必须经过上面的socket、bind、listen、accept过程,查看其过程

(1) socket

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4

(2) bind 和listen

bind(4, {sa_family=AF_INET6, sin6_port=htons(8088), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
listen(4, 50)

(3)  查看accept:(可以看到是非阻塞的方式进行,默认会返回一个 -1 值)

 4. 用nc 建立一个连接

nc localhost 8088

5. 查看out.11118 文件

会有一条accept 命令返回值不是-1.如下:

accept(4, {sa_family=AF_INET6, sin6_port=htons(59238), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28]) = 5

 6. 建立的链接发送一条消息HELLO

7.  主线程查看消息

[root@localhost jdk8]# strace -ff -o out ./jdk1.8.0_291/bin/java NIOSocket
client.socket().getPort(): 59238
client.socket().getPort(): 59238 收到的消息: HELLO

8.  查看out.11118 文件读取到的信息: 可以看到调用的是read 方法。并且会读取到返回的消息

NIO优缺点:

优点:规避多线程的问题

缺点:假设一万个连接,只有一个发来数据,每循环一次,必须向内核发送一万次read 调用,那么有9999次是无意义的,消耗时间和资源(用户空间向内核空间的循环遍历,复杂度在系统调用上)。

解决办法: 内核继续向前发展,引入多路复用器。 selector、poll、epoll

 补充:上面代码设置的是非阻塞,默认是阻塞,如果去掉设置非阻塞的参数,查看结果如下

1. 代码:

public class NIOSocket {public static void main(String[] args) throws Exception {
        LinkedList<SocketChannel> clients = new LinkedList<>();
        ServerSocketChannel serverSocketChannel = ServerSocketChannel.open();
        serverSocketChannel.bind(new InetSocketAddress(8088));
       //  serverSocketChannel.configureBlocking(false); // 对应于操作系统的NONBLOCKING

        while (true) {
            Thread.sleep(500);
            /**
             *  accept 调用内核的命令
             *  BIO的时候一直阻塞,有客户端链接的时候返回这个客户端的fd 文件描述符
             *  NONBLOCKING 会有返回值,只是返回值是-1
             **/
            SocketChannel client = serverSocketChannel.accept(); // 不会阻塞 -1 null (OS层面返回-1, java 层面返回null)
            if (client != null) {
          //      client.configureBlocking(false);
                int port = client.socket().getPort();
                System.out.println("client.socket().getPort(): " + port);
                clients.add(client);
            }

            ByteBuffer buffer = ByteBuffer.allocate(4096); // 申请内存,可以在堆内,也可以在堆外DM
            // 遍历client读取消息
            for (SocketChannel c : clients) {
                int read = c.read(buffer); // >0 -1 不会阻塞
                if (read > 0) {
                    buffer.flip();
                    byte[] bytes = new byte[buffer.limit()];
                    buffer.get(bytes);
                    String string = new String(bytes);
                    System.out.println("client.socket().getPort(): " + c.socket().getPort() + " 收到的消息: " + string);
                    buffer.clear();
                }
            }
        }
    }
}

2. strace 查看阻塞情况:

(1). 查看accept 阻塞情况

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4
setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

bind(4, {sa_family=AF_INET6, sin6_port=htons(8088), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0

listen(4, 50) 

。。。

accept(4, 

(2) nc 连接之后查看read 阻塞情况

 

4. 引入多路复用器

上面NIO模型中,有个缺点是:假设一万个连接,只有一个发来数据,每循环一次,必须向内核发送一万次read 调用,那么有9999次是无意义的,消耗时间和资源(用户空间向内核空间的循环遍历,复杂度在系统调用上)。

解决办法: 内核继续向前发展,引入多路复用器。 selector、poll、epoll。

socket => 4 (文件描述符)
bind(4, 8088)
listen(4)
4.nonblocking (设置内核系统调用参数为非阻塞,4是文件描述符)

while(true) {
    select(fd) // O(1), 上限是1024
    read(fd) // 读取数据
}

如果是应用程序自己读取IO,那么这个IO模型,无论是BIO、NIO、多路复用,都是同步IO模型,多路复用器只能给fd文件描述符的状态,不能给到数据。也就是需要用户程序调用内核程序从内核空间读取用户程序内存。Windows的IOCP 内核有线程,拷贝数据到用户空间。

select、poll 多路复用器

优势: 通过一次系统调用,把fds传递给内核,内核进行遍历,这种遍历减少了系统调用的次数。

缺点:

1.重复传递 fd 文件描述符,解决办法:内核开辟空间保留fd   

2.每次select、poll, 内核都要遍历全量的 fd,解决办法:计组深度只是,中断,callback,增强。

3. select支持的文件描述符数量太小了,默认是1024个。

  因此产生了epoll。

5. epoll 理解

优势:

1. 对fd数量没有限制(当然这个在poll也被解决了)

2. 抛弃了bitmap数组实现了新的结构来存储多种事件类型

3. 无需重复拷贝fd 随用随加 随弃随删

4. 采用事件驱动避免轮询查看可读写事件

linux 查看epoll命令如下:

man epoll

NAME
       epoll - I/O event notification facility

SYNOPSIS
       #include <sys/epoll.h>

NAME
       epoll - I/O event notification facility

SYNOPSIS
       #include <sys/epoll.h>

DESCRIPTION
       The  epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them.  The epoll
       API can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers  of  watched  file  descriptors.
       The following system calls are provided to create and manage an epoll instance:

       *  epoll_create(2)  creates  an  epoll instance and returns a file descriptor referring to that instance.  (The more recent epoll_create1(2)
          extends the functionality of epoll_create(2).)

       *  Interest in particular file descriptors is then registered via epoll_ctl(2).  The set of file  descriptors  currently  registered  on  an
          epoll instance is sometimes called an epoll set.

       *  epoll_wait(2) waits for I/O events, blocking the calling thread if no events are currently available.

 可以看到epoll 本身包括三个子命令,epoll_create、epoll_ctl、cpoll_wait,

epoll提供了三个函数,epoll_create,epoll_ctl 和 epoll_wait,epoll_create是创建一个epoll句柄,创建一个epoll 实例,并且初始化其相关数据结构;epoll_ctl是注册要监听的事件类型;epoll_wait则是等待事件的产生。

对于第一个缺点,epoll的解决方案在epoll_ctl函数中。每次注册新的事件到epoll句柄中时(在epoll_ctl中指定EPOLL_CTL_ADD),会把所有的fd拷贝进内核,而不是在epoll_wait的时候重复拷贝。epoll保证了每个fd在整个过程中只会拷贝一次。

对于第二个缺点,epoll的解决方案不像select或poll一样每次都把current轮流加入fd对应的设备等待队列中,而只在epoll_ctl时把current挂一遍(这一遍必不可少)并为每个fd指定一个回调函数,当设备就绪,唤醒等待队列上的等待者时,就会调用这个回调函数,而这个回调函数会把就绪的fd加入一个就绪链表)。epoll_wait的工作实际上就是在这个就绪链表中查看有没有就绪的fd(利用schedule_timeout()实现睡一会,判断一会的效果,和select实现中的第7步是类似的)。

对于第三个缺点,epoll没有这个限制,它所支持的FD上限是最大可以打开文件的数目,这个数字一般远大于2048,举个例子,在1GB内存的机器上大约是10万左右,具体数目可以cat /proc/sys/fs/file-max察看,一般来说这个数目和系统内存关系很大。

man 2 查看二类系统调用命令如下:

(1) epoll_create: 创建一个epoll 实例,并且初始化其相关的数据结构

EPOLL_CREATE(2)                                              Linux Programmer's Manual                                              EPOLL_CREATE(2)

NAME
       epoll_create, epoll_create1 - open an epoll file descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_create(int size);
       int epoll_create1(int flags);

DESCRIPTION
       epoll_create()  creates  an  epoll(7)  instance.   Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see NOTES
       below.

       epoll_create() returns a file descriptor referring to the new epoll instance.  This file descriptor is used for all the subsequent calls  to
       the  epoll interface.  When no longer required, the file descriptor returned by epoll_create() should be closed by using close(2).  When all
       file descriptors referring to an epoll instance have been closed, the kernel destroys the instance and releases the associated resources for
       reuse.

   epoll_create1()
       If  flags  is  0,  then, other than the fact that the obsolete size argument is dropped, epoll_create1() is the same as epoll_create().  The
       following value can be included in flags to obtain different behavior:

       EPOLL_CLOEXEC
              Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor.  See the description of the O_CLOEXEC flag in open(2) for reasons
              why this may be useful.

RETURN VALUE
       On success, these system calls return a nonnegative file descriptor.  On error, -1 is returned, and errno is set to indicate the error.

 (2) epoll_ctl:fd添加/删除于epoll_create返回的epfd中

EPOLL_CTL(2)                                                 Linux Programmer's Manual                                                 EPOLL_CTL(2)

NAME
       epoll_ctl - control interface for an epoll descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

DESCRIPTION
       This  system call performs control operations on the epoll(7) instance referred to by the file descriptor epfd.  It requests that the opera‐
       tion op be performed for the target file descriptor, fd.

       Valid values for the op argument are :

       EPOLL_CTL_ADD
              Register the target file descriptor fd on the epoll instance referred to by the file descriptor epfd and associate  the  event  event
              with the internal file linked to fd.

       EPOLL_CTL_MOD
              Change the event event associated with the target file descriptor fd.

       EPOLL_CTL_DEL
              Remove  (deregister) the target file descriptor fd from the epoll instance referred to by epfd.  The event is ignored and can be NULL
              (but see BUGS below).

       The event argument describes the object linked to the file descriptor fd.  The struct epoll_event is defined as :

           typedef union epoll_data {
               void        *ptr;
               int          fd;
               uint32_t     u32;
               uint64_t     u64;
           } epoll_data_t;

           struct epoll_event {
               uint32_t     events;      /* Epoll events */
               epoll_data_t data;        /* User data variable */
           };

       The events member is a bit set composed using the following available event types:

       EPOLLIN
              The associated file is available for read(2) operations.

       EPOLLOUT
              The associated file is available for write(2) operations.

       EPOLLRDHUP (since Linux 2.6.17)
              Stream socket peer closed connection, or shut down writing half of connection.  (This flag is especially useful  for  writing  simple
              code to detect peer shutdown when using Edge Triggered monitoring.)

       EPOLLPRI
              There is urgent data available for read(2) operations.

       EPOLLERR
              Error  condition  happened  on the associated file descriptor.  epoll_wait(2) will always wait for this event; it is not necessary to
              set it in events.

       EPOLLHUP
              Hang up happened on the associated file descriptor.  epoll_wait(2) will always wait for this event; it is not necessary to set it  in
              events.

       EPOLLET
              Sets  the  Edge  Triggered  behavior  for  the  associated  file descriptor.  The default behavior for epoll is Level Triggered.  See
              epoll(7) for more detailed information about Edge and Level Triggered event distribution architectures.

       EPOLLONESHOT (since Linux 2.6.2)
              Sets the one-shot behavior for the associated file descriptor.  This means that after an event is pulled out with  epoll_wait(2)  the
              associated  file  descriptor  is internally disabled and no other events will be reported by the epoll interface.  The user must call
              epoll_ctl() with EPOLL_CTL_MOD to rearm the file descriptor with a new event mask.

RETURN VALUE
       When successful, epoll_ctl() returns zero.  When an error occurs, epoll_ctl() returns -1 and errno is set appropriately.

(3) epoll_wait:该接口是阻塞等待内核返回的可读写事件,epfd还是epoll_create的返回值,events是个结构体数组指针存储epoll_event,也就是将内核返回的待处理epoll_event结构都存储下来

EPOLL_WAIT(2)                                                Linux Programmer's Manual                                                EPOLL_WAIT(2)

NAME
       epoll_wait, epoll_pwait - wait for an I/O event on an epoll file descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_wait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout);
       int epoll_pwait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout,
                      const sigset_t *sigmask);

DESCRIPTION
       The  epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor epfd.  The memory area pointed to
       by events will contain the events that will be available for the caller.  Up to maxevents are returned by epoll_wait().  The maxevents argu‐
       ment must be greater than zero.

       The  timeout  argument  specifies the minimum number of milliseconds that epoll_wait() will block.  (This interval will be rounded up to the
       system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.)  Specifying a timeout
       of  -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero cause epoll_wait() to return immediately, even if
       no events are available.

       The struct epoll_event is defined as :

           typedef union epoll_data {
               void    *ptr;
               int      fd;
               uint32_t u32;
               uint64_t u64;
           } epoll_data_t;

           struct epoll_event {
               uint32_t     events;    /* Epoll events */
               epoll_data_t data;      /* User data variable */
           };

       The data of each returned structure will contain the same data the user set with an  epoll_ctl(2)  (EPOLL_CTL_ADD,EPOLL_CTL_MOD)  while  the
       events member will contain the returned event bit field.

   epoll_pwait()
       The  relationship between epoll_wait() and epoll_pwait() is analogous to the relationship between select(2) and pselect(2): like pselect(2),
       epoll_pwait() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.

       The following epoll_pwait() call:

           ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask);

       is equivalent to atomically executing the following calls:

           sigset_t origmask;

           sigprocmask(SIG_SETMASK, &sigmask, &origmask);
           ready = epoll_wait(epfd, &events, maxevents, timeout);
           sigprocmask(SIG_SETMASK, &origmask, NULL);

       The sigmask argument may be specified as NULL, in which case epoll_pwait() is equivalent to epoll_wait().

RETURN VALUE
       When successful, epoll_wait() returns the number of file descriptors ready for the requested I/O, or zero if no file descriptor became ready
       during the requested timeout milliseconds.  When an error occurs, epoll_wait() returns -1 and errno is set appropriately.

  可以看到epoll定义的事件结构。

 1.  epoll官方demo

       #define MAX_EVENTS 10
           struct epoll_event ev, events[MAX_EVENTS];
           int listen_sock, conn_sock, nfds, epollfd;

           /* Set up listening socket, 'listen_sock' (socket(),
              bind(), listen()) */

           epollfd = epoll_create(10);
           if (epollfd == -1) {
               perror("epoll_create");
               exit(EXIT_FAILURE);
           }

           ev.events = EPOLLIN;
           ev.data.fd = listen_sock;
           if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
               perror("epoll_ctl: listen_sock");
               exit(EXIT_FAILURE);
           }

           for (;;) {
               nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
               if (nfds == -1) {
                   perror("epoll_pwait");
                   exit(EXIT_FAILURE);
               }

               for (n = 0; n < nfds; ++n) {
                   if (events[n].data.fd == listen_sock) {
                       conn_sock = accept(listen_sock,
                                       (struct sockaddr *) &local, &addrlen);
                       if (conn_sock == -1) {
                           perror("accept");
                           exit(EXIT_FAILURE);
                       }
                       setnonblocking(conn_sock);
                       ev.events = EPOLLIN | EPOLLET;
                       ev.data.fd = conn_sock;
                       if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
                                   &ev) == -1) {
                           perror("epoll_ctl: conn_sock");
                           exit(EXIT_FAILURE);
                       }
                   } else {
                       do_use_fd(events[n].data.fd);
                   }
               }
           }

2. 事件触发模式

The  epoll  event  distribution interface is able to behave both as edge-triggered (ET) and as level-triggered (LT). epoll的事件发布有两种模式,分别对应边缘触发和水平触发

1. 水平触发(level-trggered):默认是该模式

只要文件描述符关联的读内核缓冲区非空,有数据可以读取,就一直发出可读信号进行通知

当文件描述符关联的内核写缓冲区不满,有空间可以写入,就一直发出可写信号进行通知

2. 边缘触发(edge-triggered):

当文件描述符关联的读内核缓冲区由空转化为非空的时候,则发出可读信号进行通知

当文件描述符关联的内核写缓冲区由满转化为不满的时候,则发出可写信号进行通知

两者的区别:

水平触发是只要读缓冲区有数据,就会一直触发可读信号,而边缘触发仅仅在空变为非空的时候通知一次,举个例子:

1. 读缓冲区刚开始是空的

2. 读缓冲区写入2KB数据

3. 水平触发和边缘触发模式此时都会发出可读信号

4. 收到信号通知后,读取了1kb的数据,读缓冲区还剩余1KB数据

5. 水平触发会再次进行通知,而边缘触发不会再进行通知

所以边缘触发需要一次性的把缓冲区的数据读完为止,也就是一直读,直到读到EGAIN(EGAIN说明缓冲区已经空了)为止,因为这一点,边缘触发需要设置文件句柄为非阻塞

一道面试题:使用Linux epoll模型的LT水平触发模式,当socket可写时,会不停的触发socket可写的事件,如何处理?

普通做法:

  当需要向socket写数据时,将该socket加入到epoll等待可写事件。接收到socket可写事件后,调用write或send发送数据,当数据全部写完后, 将socket描述符移出epoll列表,这种做法需要反复添加和删除。

改进做法:

  向socket写数据时直接调用send发送,当send返回错误码EAGAIN,才将socket加入到epoll,等待可写事件后再发送数据,全部数据发送完毕,再移出epoll模型,改进的做法相当于认为socket在大部分时候是可写的,不能写了再让epoll帮忙监控。

3. epoll 模型图

  可以简单的理解为如下图

(1) 调用epoll_create 创建一个epoll 实例(初始化相关的数据结构),并且返回一个fd文件描述符

(2) 调用epoll_ctl 注册事件到上面返回的文件描述符,实际就是添加一个fd以及监听的事件到内核空间(红黑树维护一个结构)

当有事件发生内核会把事件结构移动到另一个就绪队列

(3) 调用epoll_wait 从就绪队列获取事件(获取的事件包括fd、事件类型等)

 4. 测试

代码如下:

import java.net.InetSocketAddress;
import java.nio.ByteBuffer;
import java.nio.channels.SelectionKey;
import java.nio.channels.Selector;
import java.nio.channels.ServerSocketChannel;
import java.nio.channels.SocketChannel;
import java.util.Iterator;
import java.util.Set;

public class NIOSocket {

    public static void main(String[] args) throws Exception {
        // 创建ServerSocketChannel -> ServerSocket
        // Java NIO中的 ServerSocketChannel 是一个可以监听新进来的TCP连接的通道, 就像标准IO中的ServerSocket一样。ServerSocketChannel类在 java.nio.channels包中。
        // 通过调用 ServerSocketChannel.open() 方法来打开ServerSocketChannel.如:
        ServerSocketChannel serverSocketChannel = ServerSocketChannel.open();
        serverSocketChannel.socket().bind(new InetSocketAddress(8088));
        serverSocketChannel.configureBlocking(false);

        // 得到一个Selecor对象 (sun.nio.ch.WindowsSelectorImpl)
        Selector selector = Selector.open();

        //把 serverSocketChannel 注册到  selector 关心 事件为 OP_ACCEPT
        //SelectionKey中定义的4种事件
        //SelectionKey.OP_ACCEPT —— 接收连接进行事件,表示服务器监听到了客户连接,那么服务器可以接收这个连接了
        // SelectionKey.OP_CONNECT —— 连接就绪事件,表示客户与服务器的连接已经建立成功
        //SelectionKey.OP_READ  —— 读就绪事件,表示通道中已经有了可读的数据,可以执行读操作了(通道目前有数据,可以进行读操作了)
        //SelectionKey.OP_WRITE —— 写就绪事件,表示已经可以向通道写数据了(通道目前可以用于写操作)
        serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);

        System.out.println("注册后的selectionkey 数量=" + selector.keys().size()); // 1

        //循环等待客户端连接
        while (true) {
            //这里我们等待1秒,如果没有事件发生, 返回
            if (selector.select(1000) == 0) { //没有事件发生
//                System.out.println("服务器等待了1秒,无连接");
                continue;
            }

            //如果返回的>0, 就获取到相关的 selectionKey集合
            //1.如果返回的>0, 表示已经获取到关注的事件
            //2. selector.selectedKeys() 返回关注事件的集合
            //   通过 selectionKeys 反向获取通道
            Set<SelectionKey> selectionKeys = selector.selectedKeys();
            System.out.println("selectionKeys 数量 = " + selectionKeys.size());

            //hasNext() :该方法会判断集合对象是否还有下一个元素,如果已经是最后一个元素则返回false。
            //next():把迭代器的指向移到下一个位置,同时,该方法返回下一个元素的引用。
            //remove() 从迭代器指向的集合中移除迭代器返回的最后一个元素。
            //遍历 Set<SelectionKey>, 使用迭代器遍历
            Iterator<SelectionKey> keyIterator = selectionKeys.iterator();

            while (keyIterator.hasNext()) {
                //获取到SelectionKey
                SelectionKey key = keyIterator.next();
                //根据key 对应的通道发生的事件做相应处理
                if (key.isAcceptable()) { //如果是 OP_ACCEPT, 有新的客户端连接
                    //该该客户端生成一个 SocketChannel
                    SocketChannel socketChannel = serverSocketChannel.accept();
                    System.out.println("客户端连接成功 生成了一个 socketChannel " + socketChannel.hashCode());
                    //将  SocketChannel 设置为非阻塞
                    socketChannel.configureBlocking(false);
                    //将socketChannel 注册到selector, 关注事件为 OP_READ, 同时给socketChannel 关联一个Buffer
                    socketChannel.register(selector, SelectionKey.OP_READ, ByteBuffer.allocate(1024));

                    System.out.println("客户端连接后 ,注册的selectionkey 数量=" + selector.keys().size()); //2,3,4..
                }

                if (key.isReadable()) {  //发生 OP_READ
                    //通过key 反向获取到对应channel
                    SocketChannel channel = (SocketChannel) key.channel();
                    //获取到该channel关联的buffer
                    ByteBuffer buffer = (ByteBuffer) key.attachment();
                    channel.read(buffer);
                    System.out.println("from 客户端: " + new String(buffer.array()));
                }

                //手动从集合中移动当前的selectionKey, 防止重复操作
                keyIterator.remove();
            }
        }
    }
}

(1)  启动程序进行监测

[root@localhost jdk8]# strace -ff -o out ./jdk1.8.0_291/bin/java NIOSocket

(2) 查看out 文件总数

[root@localhost jdk8]# ll
total 143780
-rw-r--r--. 1 root  root       1033 Jul 20 23:11 Client.class
-rw-r--r--. 1 root  root        206 Jul 20 23:10 Client.java
drwxr-xr-x. 8 10143 10143       273 Apr  7 15:14 jdk1.8.0_291
-rw-r--r--. 1 root  root  144616467 Jul 20 03:42 jdk-8u291-linux-i586.tar.gz
-rw-r--r--. 1 root  root       2705 Jul 21 05:54 NIOSocket.class
-rw-r--r--. 1 root  root       5004 Jul 21 05:44 NIOSocket.java
-rw-r--r--. 1 root  root      13093 Jul 21 05:54 out.29779
-rw-r--r--. 1 root  root    2305003 Jul 21 05:54 out.29780
-rw-r--r--. 1 root  root      12951 Jul 21 05:54 out.29781
-rw-r--r--. 1 root  root       2101 Jul 21 05:54 out.29782
-rw-r--r--. 1 root  root       1784 Jul 21 05:54 out.29783
-rw-r--r--. 1 root  root       5016 Jul 21 05:54 out.29784
-rw-r--r--. 1 root  root      99615 Jul 21 05:54 out.29785
-rw-r--r--. 1 root  root        914 Jul 21 05:54 out.29786
-rw-r--r--. 1 root  root     119854 Jul 21 05:54 out.29787
-rw-r--r--. 1 root  root       7308 Jul 21 05:54 out.29789

(3) nc 连接到8088并且发送消息 "hello"

[root@localhost jdk8]# nc localhost 8088
hello

(4) 从out.29780查看重要的信息

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4
setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
。。。
bind(4, {sa_family=AF_INET6, sin6_port=htons(8088), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
listen(4, 50) 
。。。
epoll_create(256)                       = 7
。。。
epoll_ctl(7, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=5, u64=17757820874070687749}}) = 0
。。。
epoll_ctl(7, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=4, u64=17757820874070687748}}) = 0
gettimeofday({tv_sec=1626861254, tv_usec=513203}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861255, tv_usec=513652}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861256, tv_usec=515602}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861257, tv_usec=518045}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861258, tv_usec=520289}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861259, tv_usec=521552}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0

。。。

accept(4, {sa_family=AF_INET6, sin6_port=htons(59252), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28]) = 9
。。。
epoll_ctl(7, EPOLL_CTL_ADD, 9, {EPOLLIN, {u32=9, u64=17757980303256715273}}) = 0
gettimeofday({tv_sec=1626861260, tv_usec=952780}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
。。。
epoll_wait(7, [{EPOLLIN, {u32=9, u64=17757980303256715273}}], 4096, 1000) = 1
write(1, "selectionKeys 346225260351207217 = 1", 24) = 24
write(1, "
", 1)                       = 1
。。。
read(9, "hello
", 1024)                = 6

可以看到大致过程:

1》建立socket

2》bind端口

3》listen 监听端口

4》epoll_create(256) = 7 创建epoll 实例

5》注册事件 (第一个是内置的,第二个是serverSocketChannel 的 fd 注册到 epfd)

epoll_ctl(7, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=5, u64=17757820874070687749}}) = 0 /

epoll_ctl(7, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=4, u64=17757820874070687748}}) = 0

6》epoll_wait 获取事件

7》获取到连接事件

8》accept = 9 返回一个 fd 是9的 客户端socket

9》注册fd为 9 、事件为读事件到epfd

10》epoll_wait 获取到一个事件,可以看到事件为可读,事件的fd为9.

11》read(9  进行读取数据

  验证了上面的过程: epoll-create -> epoll_ctl -> epoll_wait

5. 测试2

简单的例子测试其过程

import java.net.InetSocketAddress;
import java.nio.channels.SelectionKey;
import java.nio.channels.Selector;
import java.nio.channels.ServerSocketChannel;

public class NIOSocket {

    public static void main(String[] args) throws Exception {
        ServerSocketChannel serverSocketChannel = ServerSocketChannel.open();
        serverSocketChannel.socket().bind(new InetSocketAddress(8088));
        serverSocketChannel.configureBlocking(false);
        System.out.println("serverSocketChannel init 8088");

        Selector selector = Selector.open();
        serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);
        System.out.println("Selector.open() = 8088");

        int select = selector.select(1000);
        System.out.println("select: " + select);
    }
}

strace 查看其如下: socketindlisen 就跳过

。。。
epoll_create(256)                       = 8
。。。
epoll_ctl(8, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=6, u64=17757820874070687750}}) = 0
。。。
epoll_ctl(8, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=4, u64=17757820874070687748}}) = 0
gettimeofday({tv_sec=1626858133, tv_usec=975699}, NULL) = 0
epoll_wait(8, [], 4096, 1000)           = 0

可以看到

(1) epoll_create  创建一个epoll 实例,返回一个fd

(2) epoll_ctr 注册事件、fd 到刚才返回的epfd

(3) epoll_wait  获取epfd 的事件列表

6. 测试3 

Selector selector = Selector.open();

对于如上代码, 测试其调用内核命令:

epoll_create(256)                       = 6
。。。
epoll_ctl(6, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=4, u64=17762324473698058244}}) = 0

补充: select、poll、epoll的区别

(1)select==>时间复杂度O(n)

  它仅仅知道了,有I/O事件发生了,却并不知道是哪那几个流(可能有一个,多个,甚至全部),我们只能无差别轮询所有流,找出能读出数据,或者写入数据的流,对他们进行操作。所以select具有O(n)的无差别轮询复杂度,同时处理的流越多,无差别轮询时间就越长。最大的fd文件描述符长度是1024.

(2)poll==>时间复杂度O(n)

  poll本质上和select没有区别,它将用户传入的数组拷贝到内核空间,然后查询每个fd对应的设备状态, 但是它没有最大连接数的限制,原因是它是基于链表来存储的.

(3)epoll==>时间复杂度O(1)

  epoll可以理解为event poll,不同于忙轮询和无差别轮询,epoll会把哪个流发生了怎样的I/O事件通知我们。所以我们说epoll实际上是事件驱动(每个事件关联上fd)的,此时我们对这些流的操作都是有意义的。(复杂度降低到了O(1))

  select,poll,epoll都是IO多路复用的机制。I/O多路复用就通过一种机制,可以监视多个描述符,一旦某个描述符就绪(一般是读就绪或者写就绪),能够通知程序进行相应的读写操作。但select,poll,epoll本质上都是同步I/O,因为他们都需要在读写事件就绪后应用程序自己负责进行读写,也就是说这个读写过程是阻塞的,而异步I/O则无需自己负责进行读写,异步I/O的实现会负责把数据从内核拷贝到用户空间。

  epoll跟select都能提供多路I/O复用的解决方案。在现在的Linux内核里有都能够支持,其中epoll是Linux所特有,而select则应该是POSIX所规定,一般操作系统均有实现。

  我们Java 程序使用selector 的时候,在不同的操作系统上可能会使用不同的多路复用器,我在centos7上使用的是epoll。

补充:man查看支持的手册,如果不支持的话 yum install -y man-pages    安装全量的man手册

man 2 cmd 是查看系统调用

[root@localhost jdk8]# man man

       1   Executable programs or shell commands
       2   System calls (functions provided by the kernel)
       3   Library calls (functions within program libraries)
       4   Special files (usually found in /dev)
       5   File formats and conventions eg /etc/passwd
       6   Games
       7   Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7)
       8   System administration commands (usually only for root)
       9   Kernel routines [Non standard]

补充: C10K问题

  最初的服务器是基于进程/线程模型。新到来一个TCP连接,就需要分配一个进程。假如有C10K,就需要创建1W个进程,可想而知单机是无法承受的。那么如何突破单机性能是高性能网络编程必须要面对的问题,进而这些局限和问题就统称为C10K问题。

  因为Linux是互联网企业中使用率最高的操作系统,Epoll就成为C10K killer、高并发、高性能、异步非阻塞这些技术的代名词了。FreeBSD推出了kqueue,Linux推出了epoll,Windows推出了IOCP,Solaris推出了/dev/poll。这些操作系统提供的功能就是为了解决C10K问题。epoll技术的编程模型就是异步非阻塞回调,也可以叫做Reactor,事件驱动,事件轮循(EventLoop)。Nginx,libevent,node.js这些就是Epoll时代的产物。

补充:redis采用多路复用原理查看

1. 下载并安全redis

2. strace 检测redis 启动

[root@localhost test]# strace -ff -o redisout ../redis-5.0.4/src/redis-server 
34127:C 21 Jul 2021 21:57:26.281 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
34127:C 21 Jul 2021 21:57:26.281 # Redis version=5.0.4, bits=64, commit=00000000, modified=0, pid=34127, just started
34127:C 21 Jul 2021 21:57:26.282 # Warning: no config file specified, using the default config. In order to specify a config file use ../redis-5.0.4/src/redis-server /path/to/redis.conf
34127:M 21 Jul 2021 21:57:26.284 * Increased maximum number of open files to 10032 (it was originally set to 1024).
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 5.0.4 (00000000/0) 64 bit
  .-`` .-```.  ```/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 34127
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

34127:M 21 Jul 2021 21:57:26.294 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
34127:M 21 Jul 2021 21:57:26.294 # Server initialized
34127:M 21 Jul 2021 21:57:26.294 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
34127:M 21 Jul 2021 21:57:26.296 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
34127:M 21 Jul 2021 21:57:26.296 * Ready to accept connections

2. 查看out文件

[root@localhost test]# ll
total 48
-rw-r--r--. 1 root root 34219 Jul 21 21:57 redisout.34127
-rw-r--r--. 1 root root   134 Jul 21 21:57 redisout.34128
-rw-r--r--. 1 root root   134 Jul 21 21:57 redisout.34129
-rw-r--r--. 1 root root   134 Jul 21 21:57 redisout.34130

3. 我们知道启动一个程序需要socket、bind、listen, 搜索bind

[root@localhost test]# grep bind ./*
./redisout.34127:bind(6, {sa_family=AF_INET6, sin6_port=htons(6379), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
./redisout.34127:bind(7, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

4. 查看 redisout.34127

...
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(7, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(7, 511)                          = 0
...

epoll_create(1024)                      = 5
...
epoll_ctl(5, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=6, u64=6}}) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=3, u64=3}}) = 0
...
epoll_wait(5, [], 10128, 0)             = 0
open("/proc/34127/stat", O_RDONLY)      = 8
read(8, "34127 (redis-server) R 34125 341"..., 4096) = 341
close(8)                                = 0
read(3, 0x7ffd0d4c055f, 1)              = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5, [], 10128, 100)           = 0
open("/proc/34127/stat", O_RDONLY)      = 8
read(8, "34127 (redis-server) R 34125 341"..., 4096) = 341
close(8)                                = 0
read(3, 0x7ffd0d4c055f, 1)              = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5, [], 10128, 100)           = 0
...

5. 建立一个客户端并且存一个值

[root@localhost test]# ../redis-5.0.4/src/redis-cli 
127.0.0.1:6379> set testkey testvalue
OK

6. 继续查看34127 文件

。。。
accept(7, {sa_family=AF_INET, sin_port=htons(48084), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 8
。。。
epoll_ctl(5, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=8}}) = 0
。。。
epoll_wait(5, [{EPOLLIN, {u32=8, u64=8}}], 10128, 6) = 1
read(8, "*1
$7
COMMAND
", 16384) = 17
。。。
read(8, "*3
$3
set
$7
testkey
$9
te"..., 16384) = 41
read(3, 0x7ffd0d4c055f, 1)              = -1 EAGAIN (Resource temporarily unavailable)
write(8, "+OK
", 5)                  = 5
。。。

可以看看到也read接受客户端发送的数据和write写回到客户端的数据满足redis 协议发送请求数据和解析响应数据

7. 客户端发送get 请求

127.0.0.1:6379> get testkey
"testvalue"

8. 查看out 文件

。。。
epoll_wait(5, [{EPOLLIN, {u32=8, u64=8}}], 10128, 100) = 1
read(8, "*2
$3
get
$7
testkey
", 16384) = 26
read(3, 0x7ffd0d4c055f, 1)              = -1 EAGAIN (Resource temporarily unavailable)
write(8, "$9
testvalue
", 15)     = 15
。。。

9. 上面也可以看到redis 启动的时候启动了4个线程(根据生成的out文件可以看出来), 也可以用top 查看

(1)  查看PID

[root@localhost test]# netstat -nltp | grep 6379
tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      34127/../redis-5.0. 
tcp6       0      0 :::6379                 :::*                    LISTEN      34127/../redis-5.0. 

(2) 查看线程信息

[root@localhost test]# top -Hp 34127

  redis 单线程是说接受请求、处理获取数据以及写数据等核心操作是单线程,一个线程内完成的,其他线程用来处理AOF、删除过期key 等操作。

关于redis接受数据协议和发送数据协议参考 https://www.cnblogs.com/qlqwjy/p/8560052.html

 补充:nginx单线程多路复用查看,epoll 的过程

1. 用strace 启动监测

[root@localhost sbin]# strace -ff -o out ./nginx

2. 查看生成的out文件

[root@localhost sbin]# ll
total 3796
-rwxr-xr-x. 1 root root 3851552 Jul 22 01:02 nginx
-rw-r--r--. 1 root root   20027 Jul 22 03:56 out.47227
-rw-r--r--. 1 root root    1100 Jul 22 03:56 out.47228
-rw-r--r--. 1 root root    5512 Jul 22 03:56 out.47229

  可以看到生成3个文件

3. ps 查看相关进程

[root@localhost sbin]# ps -ef | grep nginx | grep -v 'grep'
root      47225  38323  0 03:56 pts/1    00:00:00 strace -ff -o out ./nginx
root      47228      1  0 03:56 ?        00:00:00 nginx: master process ./nginx
nobody    47229  47228  0 03:56 ?        00:00:00 nginx: worker process

  可以看到由一个master进程一个worker进程。master进程负责重启、检测语法等,worker进程用于接收请求。

4. 查看out文件

(1) 查看47228 master 文件

  1 set_robust_list(0x7fec5129da20, 24)     = 0
  2 setsid()                                = 47228
  3 umask(000)                              = 022
  4 open("/dev/null", O_RDWR)               = 7
  5 dup2(7, 0)                              = 0
  6 dup2(7, 1)                              = 1
  7 close(7)                                = 0
  8 open("/usr/local/nginx/logs/nginx.pid", O_RDWR|O_CREAT|O_TRUNC, 0644) = 7
  9 pwrite64(7, "47228
", 6, 0)            = 6
 10 close(7)                                = 0
 11 dup2(5, 2)                              = 2
 12 close(3)                                = 0
 13 rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 ALRM TERM CHLD WINCH IO], NULL, 8) = 0
 14 socketpair(AF_UNIX, SOCK_STREAM, 0, [3, 7]) = 0
 15 ioctl(3, FIONBIO, [1])                  = 0
 16 ioctl(7, FIONBIO, [1])                  = 0
 17 ioctl(3, FIOASYNC, [1])                 = 0
 18 fcntl(3, F_SETOWN, 47228)               = 0
 19 fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
 20 fcntl(7, F_SETFD, FD_CLOEXEC)           = 0
 21 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fec5129da10) = 47229
 22 rt_sigsuspend([], 8

  可以看到master 没有epoll相关命令,master 进程主要用来负责接受信号、热更新、热部署、监听worker服务状态。也可以看到最后通过clone 命令创建一个47220 worker子进程。

(2) 查看47229 文件

。。。
epoll_create(512)                       = 8
eventfd2(0, 0)                          = 9
epoll_ctl(8, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLET, {u32=7088384, u64=7088384}}) = 0
socketpair(AF_UNIX, SOCK_STREAM, 0, [10, 11]) = 0
epoll_ctl(8, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLRDHUP|EPOLLET, {u32=7088384, u64=7088384}}) = 0
close(11)                               = 0
epoll_wait(8, [{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=7088384, u64=7088384}}], 1, 5000) = 1
close(10)                               = 0
mmap(NULL, 225280, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fec51266000
brk(NULL)                               = 0x20ba000
brk(0x20f1000)                          = 0x20f1000
epoll_ctl(8, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLRDHUP, {u32=1361469456, u64=140652950478864}}) = 0
close(3)                                = 0
epoll_ctl(8, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLRDHUP, {u32=1361469672, u64=140652950479080}}) = 0
epoll_wait(8, 
。。。

(3) curl 进行访问测试

[root@localhost test3]# curl http://localhost:80
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
         35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

(4) 继续查看47229  文件

epoll_ctl(8, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLRDHUP, {u32=1361469672, u64=140652950479080}}) = 0
epoll_wait(8, [{EPOLLIN, {u32=1361469456, u64=140652950478864}}], 512, -1) = 1
accept4(6, {sa_family=AF_INET, sin_port=htons(40704), sin_addr=inet_addr("127.0.0.1")}, [112->16], SOCK_NONBLOCK) = 3
epoll_ctl(8, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLRDHUP|EPOLLET, {u32=1361469888, u64=140652950479296}}) = 0
epoll_wait(8, [{EPOLLIN, {u32=1361469888, u64=140652950479296}}], 512, 60000) = 1
recvfrom(3, "GET / HTTP/1.1
User-Agent: curl"..., 1024, 0, NULL, NULL) = 73
stat("/usr/local/nginx/html/index.html", {st_mode=S_IFREG|0644, st_size=612, ...}) = 0
open("/usr/local/nginx/html/index.html", O_RDONLY|O_NONBLOCK) = 10
fstat(10, {st_mode=S_IFREG|0644, st_size=612, ...}) = 0
writev(3, [{iov_base="HTTP/1.1 200 OK
Server: nginx/1"..., iov_len=238}], 1) = 238
sendfile(3, 10, [0] => [612], 612)      = 612
write(4, "127.0.0.1 - - [22/Jul/2021:04:29"..., 86) = 86
close(10)                               = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
epoll_wait(8, [{EPOLLIN|EPOLLRDHUP, {u32=1361469888, u64=140652950479296}}], 512, 65000) = 1
recvfrom(3, "", 1024, 0, NULL, NULL)    = 0
close(3)                                = 0
epoll_wait(8, 

(5) 再次过滤查看socket相关以及epoll 相关

[root@localhost sbin]# grep socket ./*
Binary file ./nginx matches
./out.47227:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
./out.47227:connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
./out.47227:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
./out.47227:connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
./out.47227:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
./out.47227:connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
./out.47227:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
./out.47227:connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
./out.47227:socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
./out.47228:socketpair(AF_UNIX, SOCK_STREAM, 0, [3, 7]) = 0
./out.47229:socketpair(AF_UNIX, SOCK_STREAM, 0, [10, 11]) = 0
[root@localhost sbin]# grep bind ./*
Binary file ./nginx matches
./out.47227:bind(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
[root@localhost sbin]# grep listen ./*
Binary file ./nginx matches
./out.47227:listen(6, 511)                          = 0
./out.47227:listen(6, 511)                          = 0
[root@localhost sbin]# grep epoll_create ./*
Binary file ./nginx matches
./out.47227:epoll_create(100)                       = 5
./out.47229:epoll_create(512)                       = 8
[root@localhost sbin]# grep epoll_ctl ./*
Binary file ./nginx matches
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLET, {u32=7088384, u64=7088384}}) = 0
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLRDHUP|EPOLLET, {u32=7088384, u64=7088384}}) = 0
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLRDHUP, {u32=1361469456, u64=140652950478864}}) = 0
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLRDHUP, {u32=1361469672, u64=140652950479080}}) = 0
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLRDHUP|EPOLLET, {u32=1361469888, u64=140652950479296}}) = 0
[root@localhost sbin]# grep epoll_wait ./*
Binary file ./nginx matches
./out.47229:epoll_wait(8, [{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=7088384, u64=7088384}}], 1, 5000) = 1
./out.47229:epoll_wait(8, [{EPOLLIN, {u32=1361469456, u64=140652950478864}}], 512, -1) = 1
./out.47229:epoll_wait(8, [{EPOLLIN, {u32=1361469888, u64=140652950479296}}], 512, 60000) = 1
./out.47229:epoll_wait(8, [{EPOLLIN|EPOLLRDHUP, {u32=1361469888, u64=140652950479296}}], 512, 65000) = 1
./out.47229:epoll_wait(8, 
【当你用心写完每一篇博客之后,你会发现它比你用代码实现功能更有成就感!】
原文地址:https://www.cnblogs.com/qlqwjy/p/15023277.html