[中英对照]Device Drivers in User Space: A Case for Network Device Driver

[中英对照]Device Drivers in User Space: A Case for Network Device Driver | 用户态设备驱动: 以网卡驱动为例

前文初步介绍了Linux用户态设备驱动，本文将介绍一个典型的案例。Again, 如对Linux用户态设备驱动程序开发感兴趣，请阅读本文，否则请飘过。

Device Drivers in User Space: A Case for Network Device Driver | 用户态设备驱动：以网卡驱动为例

Hemant Agrawal and Ravi Malhotra, Member, IACSIT

Abstract -- Traditionally device drivers specially  the  network
one's are implemented and used in Linux   Kernel   for   various
reasons. However in recent trend, many network   stack   vendors
are  moving  towards  the  user  space  based  drivers.     Open
Source – 'GPL' is one of strong reason for such a move.  In  the
absence of generic guidelines,  there  are  various  options  to
implement  device  drivers  in  user  space.    Each  has  their
advantage and disadvantage. In  this  paper,   we   will   cover
multiple issues with user space device driver and will give more
insight  about  the  Network  Device  Driver  implementation  in
User Space.

Index Terms -- Network drivers, user space, zero copy.

摘要：基于各种原因，传统的设备驱动(特别是网卡驱动) 是在Linux内核中实现和使用的。然而，最近的趋势表明，很多网络栈供应商正在将设备驱动转移到用户态实现。为什么导致这样的转移？GPL是诸多原因中强有力的一个。逃离了通用指南的藩篱，在用户态实现设备驱动的可选方案就多了，当然，每一种解决方案都有其优缺点。在本文中，我们将首先讨论在用户态实现设备驱动面临的诸多问题，然后就如何在用户态实现一个网卡驱动做深入的讨论。

关键字：网卡驱动，用户态，零拷贝。

I. INTRODUCTION | 简介

However, in recent times, there  has  been  a  shift  towards
running data path applications in  the  user  space  context.
Linux   user   space   provides   several   advantages    for
applications with  respect  to  a  more  robust  and flexible
process  management,   standardized  system  call  interface,
simpler resource management, availability of a  large  number
of libraries for XML, regular expression parsing etc. It also
makes  applications  easier  to  debug  by  providing  memory
isolation and independent restart. At the  same  time,  while
kernel space applications need to confirm to  GPL  guidelines,
user space applications are not bound  by  such  restrictions.

传统的设备驱动是在内核空间实现的，然而，从内核切换到用户空间上下文运行数据通路应用程序的转变正在发生。Linux用户空间为应用程序提供了多个优点，包括更稳定更灵活的进程管理，标准化的系统调用接口，更简单的资源管理，大量的XML库，和正则表达式解析等等。Linux用户空间还使应用调试起来更简单，通过提供内存隔离和独立重启机制。与此同时，内核态的应用程序需要遵循GPL,而用户态的应用程序却不需要受到这样的限制。

User  space  data  path  processing  comes   with   its   own
overheads. Since the network drivers run  in  kernel  context
and use kernel space memory for packet storage, there  is  an
overhead of  copying  the  packet  data  from  user-space  to
kernel  space  memory  and  vice-versa.                 Also,
user/kernel-mode transitions usually  impose  a  considerable
performance overhead, thereby violates the  low  latency  and
high  throughput  requirements  of  data  path  applications.

用户态数据通路处理的开销是伴随着数据处理而发生的。由于网卡驱动运行在内核态而且使用内核内存来存储数据包，那么将数据包从用户态拷贝到内核态必然产生一个开销，反之亦然。此外，在用户态和内核态之间来回切换通常会产生相当大的性能开销，这显然与数据通路应用程序所需要的低延迟和高吞吐量背道而驰。

In the rest of this paper,  we shall explore  an  alternative
approach to reduce these overheads for user space  data  path
applications.

在本文的剩余部分，我们将探讨一种方法，该方法用于降低用户态数据通路应用程序的开销。

II. MAPPING MEMORY TO USER-SPACE | 将(设备)内存映射到用户空间

As an alternative to the traditional I/O  model,  the  Linux
kernel provides  a  user-space  application  with  means  to
directly map the memory available to kernel to a user  space
address range. In the context of device  drivers,  this  can
provide user space applications direct access to the  device
memory  which  includes  register  configuration   and   I/O
descriptors. All accesses by the application to the assigned
address range ends up directly accessing the device memory.

作为对传统的I/O模型的一种替代方式，Linux内核为用户空间应用程序提供了将对内核来说是可用的内存空间直接映射到用户态地址空间的方法。在设备驱动的上下文中，这可以给用户空间应用程序提供对设备内存的直接访问，其中包括设备寄存器配置和I/O描述符。应用程序对已分配的内存空间的所有访问都以直接访问设备内存而结束。

There are several Linux system calls which  allow  this  kind
of memory mapping, the simplest being the  mmap()  call.  The
mmap() call allows the user application  to  map  a  physical
device address range one page at a time or a contiguous range
of physical memory in multiples of page size.

在Linux系统中，存在着不止一个能够提供这种内存映射的系统调用，其中最简单的就是mmap()系统调用。mmap()系统调用允许应用程序每一次只将一个物理内存页映射到用户空间，也允许将多个连续的物理内存页映射到用户空间。

Other  Linux  system  calls   for   mapping   memory   include
splice()/vmsplice() which allows an arbitrary kernel buffer to
be read or written to from user space, while  tee()  allows  a
copy between  2  kernel  space  buffers  without  access  from
user space[1].

其他提供内存映射的系统调用包括splice()/vmsplice()，它们允许（从用户态发起的）对任意的内核缓冲区做读写。而tee()系统调用允许在两个内核缓冲区之间直接做数据拷贝，而不需要访问用户态。

The task of mapping  between  the  physical  memories  to  the
user  space  memory  is  typically  done   using   Translation
Look-aside Buffers or TLB. The number  of  TLB  entries  in  a
given processor is typically limited and as such they are used
as a cache by Linux kernel. The  size  of  the  memory  region
mapped by each entry is typically restricted  to  the  minimum
page size supported by the processor, which is 4k bytes.

将物理内存映射到用户空间内存通常使用TLB（旁路转换缓冲，即快表或页表缓冲）。在一个给定的处理器上，TLB条目数量通常是有限的，因此他们是作为Linux内核的cache来使用。由TLB条目映射的内存区域大小通常来说是受限的，即为处理器支持的最小内存页大小(4KB)。

Linux maps the  kernel  memory  using  a  small  set  of  TLB
entries which are fixed during initialization time. For  user
space applications however, the number  of  TLB  entries  are
limited and each TLB miss can result in a performance hit. To
avoid such penalties, Linux provides concept of  a  Huge-TLB,
which allows user space  applications  to  map  pages  larger
than  the  default  minimum  page  size  of  4k  bytes.  This
mapping can be used not only for application  data  but  text
segment as well.

Linux在初始化阶段使用固定的一小部分TLB条目来对内核内存进行映射。但是对用户态应用来说，TLB条目的数量是有限的，每一次TLB未命中都会导致性能下降。为了避免这样的不利因素对性能的影响，Linux引入了Huge-TLB的概念，即允许用户态应用映射大于4KB（最小内存页尺寸）的内存页。这种映射不但可以用在应用数据上，而且可以用在（正文）代码段上。

Several  efficient   mechanisms   have   been   developed   in
Linux to support zero copy  mechanisms  between   user   space
and  kernel  space  based  on   memory   mapping   and   other
techniques [2]-[4].         These can be used by the data path
applications while continuing the leverage the existing kernel
space  network  driver  implementation.   However  they  still
consume the precious CPU  cycles  and  per  packet  processing
cost still remain moderately higher.    Having a direct access
to the hardware from the user space can  eliminates  the  need
for  any  mechanisms  to  transfer  packets  back  and   forth
between user space and kernel space,    and thus it can reduce
the per packet processing cost to a minimum.

基于内存映射和其他技术，Linux已经开发了几种有效的机制来支持在内核空间与用户空间之间保持零拷贝(zero copy)。这些机制可以为数据通路应用程序所使用，在继续使用现有的内核态网卡驱动实现的情况下。然后，弥足珍贵的CPU周期仍然被消耗掉，而且单个数据包处理的成本仍旧比较高（虽然在可接受的范围内）。从用户态直接访问硬件，就有效地规避了在用户态与内核态之间来来回回地传输数据包，因此能够最小化单个数据包的处理成本。

III. UIO DRIVERS

Linux   provides  a  standard  UIO  framework[4]   for
developing user space based device drivers.    The UIO
framework defines a small kernel space component which
performs 2 key tasks:

o Indicate device memory regions to user space.
o Register for device interrupts and provide interrupt
  indication to user space.

Linux为基于用户态的设备驱动开发提供一个标准的Userspace I/O（UIO）框架。UIO框架定义了一个小的内核态组件，该组件负责执行两个核心任务：

给用户态指明设备内存区域的起始位置。
注册设备中断并向用户态提供中断服务。

The kernel space  UIO  component  then  exposes  the  device
via a set of sysfs entries like /dev/uioXX.   The user space
component searches for these entries,       reads the device
address ranges and maps  them  to  user  space  memory.  The
user space  component  can  perform  all  device  management
tasks including I/O from the device. For interrupts however,
it needs to perform a blocking read() on  the  device entry,
which results in the kernel component putting the user space
application to sleep and wakes it up once  an  interrupt  is
received.

内核态UIO组件通过一组形如/dev/uioXXX的条目将设备暴露给用户态。用户态组件搜索这些条目，读取设备的起始地址，将其映射到用户空间内存。用户态组件可以执行所有的设备管理任务，包括来自设备的I/O。然而，中断需要在设备条目上执行一个阻塞的读操作，该阻塞操作将导致用户态应用程序被内核组件设为睡眠状态，等中断被接收后才被唤醒。

IV. USER SPACE NETWORK DRIVERS | 用户态网卡驱动

The memory required by a  network  device  driver  can  be  of
three types:
o Configuration space: this refers to the common configuration 
  registers of the device.
o I/O descriptor space: this refers to the descriptors used by
  the device to access data from the device.
o I/O data space: this refers to the actual I/O data  accessed
  from the device.

网络设备驱动所需的内存分为三种类型：

配置空间：设备的公共的配置寄存器。
I/O描述符空间：被设备用来访问设备中的数据的描述符。
I/O数据空间：被设备访问的实际的I/O数据。

Taking the case of a typical Ethernet device, the above can
refer to the common  device  configuration  (including  MAC
configuration), buffer-descriptor rings,  and  packet  data
buffers.

以典型的以太网设备为例，上述内存类型分别对应为公共设备配置（包括MAC地址配置），环缓冲区描述符和包数据缓冲区。

In case of  kernel  space  network  drivers,  all  3  regions  are
mapped  to  kernel  space,  and  any  access  to  these  from  the
user-space is typically abstracted out via either ioctl() calls or
read()/write() calls, from where a copy of the  data  is  provided
to the user space application.

在内核态网络设备驱动程序中，所有的三个区域都是映射到内核空间。从用户空间访问这三个区域的话被典型地抽象为：通过ioctl()调用或者read()/write()调用。数据被拷贝到用户空间给用户态应用程序使用。

User space network  drivers  on  the  other  hand,  map  all   3
regions directly to user space memory.  This  allows  the   user
space application to directly drive the buffer descriptor  rings
from user space. Data  buffers  can  be  managed  and   accessed
directly by the application without overhead of a copy.

另一方面，在用户态网络设备驱动程序中，所有的三个区域都是直接映射到用户空间。于是允许用户态应用程序在用户态直接驱动环缓冲区描述符。无需任何的内存拷贝开销，应用程序就可以直接管理和访问数据缓冲区。

Taking the specific example  of  an  implementation  of  a  user
space   network  driver  for  eTSEC  Ethernet  controller  on  a
Freescale QorIQ P1020 platform,     the configuration space is a
single region of 4k size,        which is page boundary aligned.
This  contains  all  the  device  specific  registers  including
controller settings, MAC settings, interrupts etc. Besides this,
the   MDIO   region   also   needs   to  be   mapped   to  allow
configuration of   the   Ethernet   Phy   devices.   The   eTSEC
provides for up to  8  different  individual  buffer  descriptor
rings,  each  of  which  are  mapped  onto  a  separate   memory
region, to   allow   for   simultaneous   access   by   multiple
applications. The data  buffers  referenced  by  the  descriptor
rings are allocated  from  a  single  contagious  memory  block,
which  is  allocated   and   mapped   to   user   space   during
initialization time.

举个具体的例子，在飞思卡尔的QorIQ P1020平台上实现的eTSEC以太网控制器的用户态网卡驱动，配置空间是一个页边对齐的大小为4K的独立的区域，包括所有的设备相关的寄存器（控制器设置，MAC地址设置和中断等）。除此之外， MDIO区域也需要被映射，以允许对以太网物理设备进行配置。eTESC提供多达8个独立的缓冲区描述符环，每一个环都被映射到一个单独的内存区域中，从而允许多个应用程序对设备同时进行访问。被描述符环引用的数据缓冲区是从一个单一的（具有传染性？）的内存块分配的，该内存块在初始化阶段被分配/映射到用户空间。

V. CONSTRAINTS OF USER SPACE DRIVERS | 用户态设备驱动受到的限制

Direct  access  to  network  devices  brings  its  own  set  of
complications for user space applications,  which  were  hidden
by several layers of kernel stack and system calls.

o Sharing a single network device across multiple applications.
o Blocking access to network data.
o Lack of network stack services like TCP/IP.
o Memory management for packet buffers.
o Resource management across application restarts.
o Lack of a standardized driver interface for applications.

对网络设备的直接访问给用户空间应用带来了一系列复杂的问题，这些问题之所以通常看不到，是因为被多个内核栈和系统调用给屏蔽掉了。

跨多个应用程序共享单个网络设备。
阻塞对网络数据的访问。
缺少网络栈服务，例如TCP/IP。
对数据包缓冲区的内存管理。
横跨多个应用程序重启的资源管理。
对应用程序来说，缺乏标准的驱动程序接口。

Figure 1: Kernel space network driver 内核态网卡驱动

Figure 2: User space network driver 用户态网卡驱动

A. Sharing Devices

Unlike  the  Linux  socket   layer   which   allows   multiple
applications to open sockets – TCP, UDP or raw  IP,  the  user
space network drivers  allow  only  a  single  application  to
access the data  from  an  interface.  However,  most  network
interfaces nowadays provide multiple buffer  descriptor  rings
in  both  receive  and  transmit  direction.   Further,  these
interfaces also provide some kind of  hardware  classification
mechanism to divert incoming traffic to these multiple  rings.
Such  a  mechanism  can  be  used  to  map  individual  buffer
descriptor rings to different applications. This again  limits
the number of applications on a single interface to the number
of rings supported by the hardware device.     An alternate to
this is to develop a dispatcher framework over the user  space
driver, which will deal with multiple applications.

A. 设备共享。

Linux套接字层允许多个应用程序打开socket(TCP, UDP或raw IP), 用户态网卡驱动则不然，它只允许单个应用程序从接口中访问数据。然而，现如今，大多数网络接口在接收(rx)和发送(tx)方向上提供多个缓冲描述符环。此外，这些接口还提供某种硬件分类机制，该机制将传入的流量转移到多个缓冲描述符环。这种机制可用于映射单个缓冲区描述符环到多个不同的应用程序。这又限制了在单个接口上为硬件设备所支持的应用程序的数量。一个替代方案就是在用户态设备驱动上开发分发器框架，用以处理多个应用程序。

B. Blocking Access to Data

Unlike  traditional  socket  based  access   which   allows   user
space applications to  block  until  data  was  available  on  the
socket, or to do a select()/poll() to wait on multiple inputs, the
user  space application  has  to  constantly   poll   the   buffer
descriptor ring for an indication  for  incoming  data.  This  can
be addressed by the use of a  blocking  read()  call  on  the  UIO
device entry, which would allow  the  user  space  application  to
block  on  receive  interrupts  from  the  Ethernet  device.  This
also  provides  the  application  with  the  freedom  of  when  it
wants  to  be  notified  of  interrupts – i.e.  instead  of  being
interrupted  for  each  packet,  it  can  choose  to  implement  a
polling  mechanism  to  consume  a  certain   number   of   buffer
descriptor entries before  returning  to  other processing  tasks.
When   all   buffer   descriptor   entries   are   consumed,   the
application can again perform a  read()  to  block  until  further
data arrives.

B. 阻塞数据访问。

传统的基于socket的访问允许用户空间应用程序阻塞，直到数据在socket上变得可用；或用户空间应用程序使用select()/poll()等待多个输入，但是必须不断地轮询缓冲区，该缓冲区用于指示有数据到达。这可以在UIO设备条目上用一个阻塞read()调用来实现，它允许用户空间应用程序阻塞住以接收来自以太设备的中断。这还给应用程序提供了自由，当它希望被通知有中断发生，而不是对于每个数据包都去响应中断。应用程序可以有选择地区实现一种轮询机制，该机制在返回到其他处理任务之前消耗掉一定数量的缓冲区描述符条目。当所有的缓冲区描述符条目都被消耗掉的时候，应用程序再进行一次read()阻塞操作，等待新的数据到达。

C. Lack of Network Stack Services

The Linux network stack and  socket  interface  also  abstract
basic  networking  services  from  applications   like   route
lookup,  ARP  etc.  In  the  absence  of  such  services,  the
application has to either runs its own equivalent of a network
stack or maintain a local copy of  the  routing  and  neighbor
databases in the kernel.

C. 缺乏网络栈服务。

Linux网络栈和socket接口对应用程序提供基本的网络服务抽象，例如路由查找，ARP(以太网网络地址解析)等。在没有诸如此类的服务的情况下，应用程序必须自己运行一个与网络栈等效的东东，或者在本地维护一份来自内核的路由/邻居数据库拷贝。

D. Memory Management for Buffers

The user  space  application  also  needs  to  deal  with  the
buffers  provided  to  the  network  device   for   storing  &
retrieving data.  Besides  allocation  and  freeing  of  these
buffers, it also needs to perform the translation of the  user
space virtual address to the physical address before providing
them to the device. Doing this translation for each buffer  at
runtime can be very costly. Also, since  the  number  of  TLBs
in the processor may be limited, performance may be  hit.  The
alternative is to use  Huge-TLB to  allocate  a  single  large
chunk of memory, and carve out the data buffers  out  of  this
memory chunk.

D. 缓冲区的内存管理。

用户空间应用程序还需要处理缓冲区，该缓冲区被用来给网络设备提供数据存储和数据检索。除了分配和释放这些缓冲区，它还需要执行用户空间虚拟地址到物理地址的转换，在提供设备之前。在运行时对每个缓冲区执行地址转换的代价无疑是非常昂贵的。另一种方法是使用巨型TLB分配单个大内存块，并将数据缓冲区从内存块中分割出来。

E. Application Restart

The application is  responsible  for  allocating  and  managing
device resources and current state of the device. In  case  the
application crashes or is restarted without being given control
to perform cleanup,   the device may be left in an inconsistent
state. One way to resolve this could be to use the kernel space
UIO component to keep track of application  process  state  and
on restart, to reset the device and reset any  memory  mappings
created by the application.

E. 应用程序重启。

应用程序负责分配和管理设备资源和设备的当前状态。当应用程序崩溃或重新启动时，如没有执行cleanup, 那么设备的状态可能会不一致。解决这个问题的一种方法就是使用内核态的UIO组件。UIO组件跟踪应用程序的状态，在应用重启时重置设备，并重置被应用程序创建的所有内存映射。

F. Standardized User Interface

The  current  generation  of   user   space   network   drivers
provide a set of low level API which are often very specific to
the device implementation,  rather  than  confirm  to  standard
system  call  API    like   open()/close(),  read()/write()  or
send()/receive(). This implies that the application needs to be
ported to use each specific network device.

F. 标准的用户接口。

用户空间网络驱动程序提供一组通常非常低级别API，这些API与设备实现密切相关，而不是标准的系统调用API例如open/close, read/write或send/receive。那么，当应用程序需要使用另一个特定的网络设备的话，就意味着需要进行移植。

VI. CONCLUSION | 总结

While    the    UIO    framework    provides    user    space
applications with the freedom  of  having  direct  access  to
network devices, it brings its own share  of  limitations  in
terms of sharing across  applications,  resource  and  memory
management. The current  generation  of  user  space  network
drivers works well in a constrained use case environment of a
single application  tightly  coupled  to  a  network  device.
However,   further  work  on  such  drivers  must  take  into
account addressing some of these limitations.

UIO框架为用户态应用程序提供了具有直接访问网络设备的自由，也引入了一些局限性，在跨应用程序共享资源和内存方面。对与网络设备紧密耦合的单个应用程序来说，当前的用户态网络设备驱动在其受限的用例环境中工作得非常好。然而，这一类驱动的下一步工作则必须考虑如何解决其存在的局限性。

REFERENCES | 参考文献

[1] M. Welsh et al., “Memory Management For User-Level Network
    Interfaces,” IEEE Micro, pp. 77-82, Mar.-Apr. 1998.
[2] D. Stancevic, “Zero Copy I: User-Mode Perspective,” Linux Journal,
    pp. 105, Jan. 2003.
[3] N. M. Thadani et al., “ An Efficient Zero-Copy I/O Framework for
    UNIX,” Sun Microsystem Inc, May 1995.
[4] H. Koch, The Userspace I/O HOWTO, [Online]. Available:
    http://www.kernel.org/doc/htmldocs/uio-howto.html