(转载)Linux的file descriptor笔记

前言

说来惭愧，一直以来都在跟Linux 打交道，也了解everything in unix is a file 的概念，却没有真的好好理解file descriptor 的基本结构是怎样，但是在知乎上面看到这篇Linux file descriptor 演进史，让我对于他为什么长这样有更进一步的认识。(其实原本想找找英文资料，不过这篇讲历史的还蛮清楚的)。

基本上这篇文章会笔记目前新版的file descriptor 结构，也会延伸一些其他看到的资料，基于我对于Linux kernel 并不是专家，如果有错的地方希望大家能够指正。

file descriptor

file descriptor (fd) 基本上是一层介面，可以让我们去操作file 和其他input/output interface (例如pipe & socket)。

每个进程(运行中的程序)都有与之关联的文件描述符. 文件描述符通常是一些小的整数,我们可以通过一个进程的文件描述符访问该进程打开的文件或这设备. 一个进程能够有多少文件描述符取决于系统的配置. 当一个程序开始运行时, 它一般有三个文件描述符:

到底什么是文件描述符？

Unix中所有I / O的基本构件都是字节序列。大多数程序使用的是更简单的抽象---字节流或I / O流。

进程借助描述符（也称为文件描述符）来引用I / O流。Pipes, files, FIFOs, POSIX IPC’s (message queues, semaphores, shared memory), event queues 都是通过描述符来引用I/O streams 的典型示例。

文件描述符的Creation 和 Release

描述符要么通过系统调用（例如open，pipe，socket等）显式创建，要么从父进程继承。

在以下情况下释放描述符：

---the process exits

---by calling the close system call

---implicitly after an exec when the descriptor is marked as close on exec.

Close-on-exec

When a process forks, all the descriptors are “duplicated” in the child process. If any of the descriptors are marked close on exec, then after the parent forks but before the child execs, the descriptors in the child marked as “close-on-exec” are closed and will no longer be available to the child process.

当通过 read 或 write 系统调用时 on a descriptor，data transfer happens.

Chapter 7. I/O System Overview, from the book Design and Implementation of the FreeBSD Operating System. Page 315

File Entry

Every descriptor points to a data structure called the file entry in the kernel. The file entry maintains a per descriptor file offset in bytes from the beginning of the file entry object. An open system call creates a new file entry.

Fork/Dup and File Entries

A fork system call results in descriptors being shared by the parent and child with share by reference semantics. Both the parent and the child are using the same descriptor and reference the same offset in the file entry. The same semantics apply to a dup/dup2 system call used to duplicate a file descriptor.

#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
int main(char *argv[]) {
    int fd = open("abc.txt", O_WRONLY | O_CREAT | O_TRUNC, 0666);
    fork();
    write(fd, "xyz", 3);
    printf("%ld
", lseek(fd, 0, SEEK_CUR));
    close(fd);
    return 0;
}

More interesting is what the close-on-exec flag does, if the descriptors are only being shared. My guess is setting the flag removes the descriptor from the child’s descriptor table, so that the parent can still continue using the descriptor but the child wouldn’t be able to use it once it has exec-ed.（子进程一旦执行就关闭）

Offset-per-descriptor

As multiple descriptors can reference the same file entry, the file entrydata structure maintains a file offset for every descriptor. Read and write operations begin at this offset and the offset itself updated after every data transfer. The offset determines the position in the file entry where the next read or write will happen. When a process terminates, the kernel reclaims all descriptors in use by the process. If the process in question was the last to reference the file entry, the kernel then deallocates that file entry.

Anatomy of a File Entry

Each file entry contains:

— the type
— an array of function pointers. This array of function pointers translates generic operations on file descriptors to file-type specific implementations.

Disambiguating this a bit further, all file descriptors expose a common generic API that indicates operations (such as read, write, changing the descriptor mode, truncating the descriptor, ioctl operations, polling and so forth) that may be performed on the descriptor.

The actual implementation of these operations vary by file type and different file types have their own custom implementation. Reads on sockets aren’t quite the same as reads on pipes, even if the higher level API exposed is the same. The open call is not a part of this list, since the implementation greatly varies for different file types. However once the file entry is created with an open call, the rest of the operations may be called using a generic API.

Most networking is done using sockets. A socket is referenced by a descriptor and acts as an endpoint for communication. Two processes can create two sockets each and establish a reliable byte stream by connecting those two end points. Once the connection has been established, the descriptors can be read from or written to using the file offsets described above. The kernel can redirect the output of one process to the input of another on another machine. The same read and write system calls are used for byte-stream type connections, but different system calls handle addressed messages like network datagrams.

From the book the Linux Programming Interface — page 95

kernel 内的基本结构(对上图的解析）

每个process 里面包含file descriptor 的table。

file descriptor 其实只是个指针，指向系统层面(system-wide) 的openfile table 的entry ，而这个openfile table 在Posix 里面称为open file description。

fd_table 内的inode_ptr 在去指向i-node table (system-wide) 内的entry。

file descriptor 和file 之间的关系并不是一对一的，如上图所示。

图从这个投影片来的lusp_fileio_slides.pdf，另外要大推作者的书The Linux Programming Interface，非常值得收藏。

对应的data structure source code

1、process task_struct 里面有file_struct 成员，基本上需要从这个file_struct 里面找到对应的file descriptor。file_struct 的成员原本是直接在task_struct 内的，现在将它独立起来，并用指针去存取，主要是因为linux 在支持thread 之后，需要以task_struct 为thread 单位，可以透过指针共用file_struct 这种资源。

struct task_struct {
    ...
    struct files_struct *files;
    ...
}

2、files_struct 里面可以找到per process fdtable (file descriptor table)，其中使用了很厉害的RCU 技术，主要是针对读多写少的情况下，提升存取写入fdtable 效能。(struct fdtable in include/linux/fdtable.h)

struct files_struct {
  /*
   * read mostly part
   */
	atomic_t count;
	bool resize_in_progress;
	wait_queue_head_t resize_wait;

	struct fdtable __rcu *fdt;
	struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
	spinlock_t file_lock ____cacheline_aligned_in_smp;
	unsigned int next_fd;
	unsigned long close_on_exec_init[1];
	unsigned long open_fds_init[1];
	unsigned long full_fds_bits_init[1];
	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

struct fdtable {
	unsigned int max_fds;
	struct file __rcu **fd;      /* current fd array */
	unsigned long *close_on_exec;
	unsigned long *open_fds;
	unsigned long *full_fds_bits;
	struct rcu_head rcu;
};

3、open file table 也称为open file descriptions，是系统层级的table(https://github.com/torvalds/linux/blob/master/include/linux/fs.h#L921)，这个struct 定义了一些蛮重要的资料像是file_offset, file_status, 还有最重要的inode_ptr 去指向对应的inode。

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;

4、open file table 在指向 system-wide 的 inode-table，(https://github.com/torvalds/linux/blob/master/include/linux/fs.h#L615)，其中的i_mode 就记录了对应的是哪一种档案类型。

struct inode {
	umode_t			i_mode;
	unsigned short  i_opflags;
	kuid_t			i_uid;
	kgid_t			i_gid;
	unsigned int	i_flags;

一些常见的fd操作

1、同一个process 内通常透过dup() or dup2() 可以复制file descriptor，而两个fd 就可以指向同一笔openfile entry (也就是同一个file)。

2、不同process 透过fork() 也会拿到各自的file descriptor，去指向同一笔openfile entry。

3、不同process 去开启同一份档案，会用各自的file descriptor 指向不同的openfile entry，但最后会指向同一份inode

其他的经验分享

在没有了解fd 的时候其实在写程序上面犯了不少错，像是在曾经在写一个socket programming 时，在main process 内fork child process ，但是却没有使用close-on-exec flag ，所以把main process 打开的fd 也带过去给child，所以就算在main process 去close socket，对于那个被child 抓住的socket 还是没被释放，所以就看到前面的LB 说后端的连线数量没有下降，接着因为rate limiting 的缘故，就把外面的连线给挡住了，而其实这时候后端还闲着很，这就是不熟悉fd 行为而种下的雷，在理解了fd 后，接着会再做一些笔记来谈谈epoll & scm_right 之类的东西怎么运作的，了解fd 对于我们写程式真的蛮重要的啊!

Reference:

https://man7.org/training/download/lusp_fileio_slides.pdf

The Linux Programming Interface https://man7.org/tlpi/index.html

参考链接：

Linux 文件描述符简介(file descriptor)_请叫我AXin-CSDN博客 https://blog.csdn.net/Artprog/article/details/60601253

Linux 的 file descriptor 筆記 - Kakashi's Blog
https://kkc.github.io/2020/08/22/file-descriptor/