DPDK初始化

一. DPDK源码版本: DPDK19.02    http://core.dpdk.org/download/

二. DPDK 初始化部分

  1.初始化EAL环境,rte_eal_init();

  2.解析参数,因为DPDK的参数在EAL初始化时就进行了解析,所以,这里主要解析的是我们自己的参数,可以使用getopt_long函数。

  3.初始化内存池等,这里要注意放在接口的初始化之前,为接收数据包做准备。

  4.初始化接口

  5.启动所有核上的线程。rte_eal_mp_remote_launch()

三. 下面详细讲解初始化作用过程:

2.1 EAL初始化

1)EAL功能作用:

• Intel® DPDK loading and launching
• Support for multi-process and multi-thread execution types
• Core affinity/assignment procedures
• System memory allocation/de-allocation
• Atomic/lock operations
• Time reference
• PCI bus access
• Trace and debug functions
• CPU feature identification
• Interrupt handling
• Alarm operations

ref: http://doc.dpdk.org/guides/prog_guide/   (详细可参考文章)

2) 初始化程序: 源文件eal.c

  1 /* Launch threads, called at application init(). */
  2 int
  3 rte_eal_init(int argc, char **argv)
  4 {
  5     int i, fctret, ret;
  6     pthread_t thread_id;
  7     static rte_atomic32_t run_once = RTE_ATOMIC32_INIT(0);
  8     const char *p;
  9     static char logid[PATH_MAX];
 10     char cpuset[RTE_CPU_AFFINITY_STR_LEN];
 11     char thread_name[RTE_MAX_THREAD_NAME_LEN];
 12 
 13     /* checks if the machine is adequate */
 14     //检测cpu的标识是否支持
 15     //dpdk在进行cpu运行时,会考虑采用cpu高级指令来优化运算速度。
 16     if (!rte_cpu_is_supported()) {   
 17         rte_eal_init_alert("unsupported cpu type.");
 18         rte_errno = ENOTSUP;
 19         return -1;
 20     }
 21 
 22     //操作静态局部变量run_once确保函数只执行一次
 23     if (!rte_atomic32_test_and_set(&run_once)) {
 24         rte_eal_init_alert("already called initialization.");
 25         rte_errno = EALREADY;
 26         return -1;
 27     }
 28 
 29     p = strrchr(argv[0], '/');
 30     strlcpy(logid, p ? p + 1 : argv[0], sizeof(logid));
 31     thread_id = pthread_self();
 32 
 33     //初始化结构体struct internal_config
 34     eal_reset_internal_config(&internal_config);
 35 
 36     /* set log level as early as possible */
 37     //解析命令行参数,只处理“--log-level”,保存在internal_config.log_level
 38     eal_log_level_parse(argc, argv);
 39 
 40     //获取系统中的CPU数量
 41     if (rte_eal_cpu_init() < 0) {
 42         rte_eal_init_alert("Cannot detect lcores.");
 43         rte_errno = ENOTSUP;
 44         return -1;
 45     }
 46 
 47     /*
 48     EAL初始化参数:
 49         -c COREMASK:要使用CPU core16进制掩码。注意core编号在不同的平台不一样,需要事先确定好。
 50         -n NUM:每个处理器socket的内存通道数
 51         -b domain:bus:devid.func:网口黑名单,EAL不能使用的PCI设备(可以同时存在多个-b选项)
 52         –socket-mem:在指定socket上分配大页内存
 53         -m MB:指定分配大大页内存数,不限处理器的socket。加以使用—socket-mem代替这个参数
 54         -r NUM:内存的rank数
 55         -v:显示程序版本号
 56         –huge-dir:大页内存的挂载点
 57         –file-prefix:大页内存文件的前缀
 58         –proc-type:进程类型(primary,secondary,auto)
 59         –xen-dom0:支持程序在Xen Domain0中非大页内存下运行
 60         –vmware-tsc-map:使用VMware TSC代替本地的RDTSC
 61         –base-virtaddr :指定虚拟地址的基址
 62         –vfio-intr:指定VFIO使用的中断类型(如果不是用VFIO则无效) 
 63         -c是必须的,其它都是可选的。
 64     */
 65     fctret = eal_parse_args(argc, argv);
 66     if (fctret < 0) {
 67         rte_eal_init_alert("Invalid 'command line' arguments.");
 68         rte_errno = EINVAL;
 69         rte_atomic32_clear(&run_once);
 70         return -1;
 71     }
 72     
 73     //根据命令行参数初始化internal_config
 74     if (eal_plugins_init() < 0) {
 75         rte_eal_init_alert("Cannot init plugins");
 76         rte_errno = EINVAL;
 77         rte_atomic32_clear(&run_once);
 78         return -1;
 79     }
 80 
 81     if (eal_option_device_parse()) {
 82         rte_errno = ENODEV;
 83         rte_atomic32_clear(&run_once);
 84         return -1;
 85     }
 86 
 87     /*
 88     主应用的情况(RTE_PROC_PRIMARY)
 89         rte_eal_config_create
 90         eal_runtime_config_path:获取runtime配置文件路径,如“/var/run/.rte_config”
 91         打开文件,上锁,mmap映射文件到内存
 92         将early configuration structure(全局变量early_mem_config)拷贝到此内存中,rte_config.mem_config指向这块内存
 93         映射地址保存在rte_config.mem_config->mem_cfg_addr中,用于从应用将来映射到相同的地址
 94     从应用的情况(RTE_PROC_SECONDARY)
 95         rte_eal_config_attach
 96             eal_runtime_config_path
 97             打开文件,mmap映射文件到内存
 98             rte_config.mem_config指向映射的内存
 99         rte_eal_mcfg_wait_complete
100             如果struct rte_mem_config结构的magic成员没有被写成RTE_MAGIC,就继续等待
101             (主应用ready后会将struct rte_mem_config结构的magic成员写成RTE_MAGIC)
102         rte_eal_config_reattach
103             从前面mmap映射文件中获取主应用mmap的映射地址(即rte_config.mem_config->mem_cfg_addr)
104             munmap解除先前的映射
105             指定主应用映射地址重新执行mmap映射,如果最终映射地址和指定映射地址不一致,则出错退出
106             将rte_config.mem_config指向重新映射的内存
107     */
108     rte_config_init();
109 
110     
111     /*
112     初始化global interrupt source head
113     创建pipe
114     创建线程来等待处理中断,线程执行函数为eal_intr_thread_main
115         线程运行循环
116             epoll_create:创建epoll文件描述符
117             epoll_ctl:把前面创建的the read end of the pipe,添加到epoll wait list中
118             遍历以global interrupt source head为头部的struct rte_intr_source结构链表
119                 如果当前struct rte_intr_source结构没有挂载的callback函数,跳过
120                 把所有的uio device file descriptor,添加到epoll wait list中
121             eal_intr_handle_interrupts
122                 epoll_wait:wait for an I/O event on an epoll file descriptor
123                 eal_intr_process_interrupts
124                     遍历所有发生的I/O eventc
125                     如果the read end of the pipe可用,执行read操作,函数返回
126                     遍历struct rte_intr_source结构链表,查找当前I/O event对应的structrte_intr_source结构
127                     根据interrupt handle type(uio/alarm/…),确定需要读取的字节长度
128                     执行文件read操作
129                     如果read数据成功,执行当前struct rte_intr_source结构挂载的所有callback函数
130                 调用eal_intr_process_interrupts返回负数,本次中断处理结束返回
131             关闭epoll文件描述符
132         如果创建线程成功,调用rte_thread_setname给线程设置名称“eal-intr-thread”
133             pthread_setname_np
134     循环(browse all running lcores except the master lcore)
135         创建主线程与子线程通信使用的pipe
136         设置子线程状态为WAIT
137         创建子线程,线程执行函数为eal_thread_loop
138             根据线程ID,获取当前线程的lcore_id
139             获取主线程向子线程通信所用管道,子线程读取数据的file descriptor(m2s)
140             获取子线程向主线程通信所用管道,子线程发送数据的file descriptor(s2m)
141             eal_thread_set_affinity:设置子线程cpu affinity
142             eal_thread_dump_affinity
143             线程主循环
144                 等待读取主线程发送的命令
145                 设置线程状态为RUNNING
146                 向主线程发送ack
147                 读取当前lcore对应的structlcore_config结构中的lcore_function_t类型函数指针,及调用参数
148                 执行所指函数,并存储返回值
149                 设置线程状态为FINISHED
150     如果创建线程成功,调用rte_thread_setname给线程设置名称“lcore-slave-xx”
151     
152     */
153     if (rte_eal_intr_init() < 0) {
154         rte_eal_init_alert("Cannot init interrupt-handling thread");
155         return -1;
156     }
157 
158     /* Put mp channel init before bus scan so that we can init the vdev
159      * bus through mp channel in the secondary process before the bus scan.
160      */
161      /*
162         多进程的情况稍微复杂一些,除了线程间的通信外,还要完成primary进程和其他secondary进程的通信
163         模块初始化中的下面函数完成的(mp表示multiple process)
164         其内部会单独创建一个线程用来接收来自其他进程的消息
165      */
166     if (rte_mp_channel_init() < 0) {
167         rte_eal_init_alert("failed to init mp channel");
168         if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
169             rte_errno = EFAULT;
170             return -1;
171         }
172     }
173 
174     /* register multi-process action callbacks for hotplug */
175     //注册一个action
176     if (rte_mp_dev_hotplug_init() < 0) {
177         rte_eal_init_alert("failed to register mp callback for hotplug");
178         return -1;
179     }
180 
181     /*
182         bus scan提供的主接口,内部会调用所有bus->scan。接口的目的是扫描所有bus下注册的设备
183         bus下默认的设备路径在/sys/bus/pci/devices
184         同内核扫描流程不同,DPDK只是将kernel扫描pci后建立的sysfs信息读取出来,获得内核已经扫描好的pci信息
185         在linux设备模型中总线类型下挂有属于该bus的device和driver的文件夹,每个文件夹里存在具体的device指向实际的设备文件
186             /sys/bus/pci/devices/
187         
188     */
189     if (rte_bus_scan()) {
190         rte_eal_init_alert("Cannot scan the buses for devices");
191         rte_errno = ENODEV;
192         rte_atomic32_clear(&run_once);
193         return -1;
194     }
195 
196     /* if no EAL option "--iova-mode=<pa|va>", use bus IOVA scheme */
197     if (internal_config.iova_mode == RTE_IOVA_DC) {
198         /* autodetect the IOVA mapping mode (default is RTE_IOVA_PA) */
199         //获取全局配置结构struct rte_config,初始指向全局变量early_mem_config
200         rte_eal_get_configuration()->iova_mode =
201             rte_bus_get_iommu_class();
202 
203         /* Workaround for KNI which requires physical address to work */
204         if (rte_eal_get_configuration()->iova_mode == RTE_IOVA_VA &&
205                 rte_eal_check_module("rte_kni") == 1) {
206             rte_eal_get_configuration()->iova_mode = RTE_IOVA_PA;
207             RTE_LOG(WARNING, EAL,
208                 "Some devices want IOVA as VA but PA will be used because.. "
209                 "KNI module inserted
");
210         }
211     } else {
212         rte_eal_get_configuration()->iova_mode =
213             internal_config.iova_mode;
214     }
215 
216     if (internal_config.no_hugetlbfs == 0) {
217         /* rte_config isn't initialized yet */
218         ret = internal_config.process_type == RTE_PROC_PRIMARY ?
219                 eal_hugepage_info_init() :
220                 eal_hugepage_info_read();
221         if (ret < 0) {
222             rte_eal_init_alert("Cannot get hugepage information.");
223             rte_errno = EACCES;
224             rte_atomic32_clear(&run_once);
225             return -1;
226         }
227     }
228 
229     if (internal_config.memory == 0 && internal_config.force_sockets == 0) {
230         if (internal_config.no_hugetlbfs)
231             internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
232     }
233 
234     if (internal_config.vmware_tsc_map == 1) {
235 #ifdef RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT
236         rte_cycles_vmware_tsc_map = 1;
237         RTE_LOG (DEBUG, EAL, "Using VMWARE TSC MAP, "
238                 "you must have monitor_control.pseudo_perfctr = TRUE
");
239 #else
240         RTE_LOG (WARNING, EAL, "Ignoring --vmware-tsc-map because "
241                 "RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT is not set
");
242 #endif
243     }
244 
245     rte_srand(rte_rdtsc());
246 
247     /*
248         调用fopencookie,定义一个定制的写日志接口
249         调用openlog打开日志
250         rte_eal_common_log_init:
251             STAILQ_INIT:初始化Singly-linked Tail queue,队头为log_history
252             rte_mempool_create
253             如果创建mempool失败,调用rte_mempool_lookup
254                 获取链接所有mempool结构链表的头结构structrte_mempool_list
255                 遍历链接所有mempool结构链表的所有结点
256                 比较struct rte_tailq_entry结构的data域指向的struct rte_mempool结构的名称,
257                 是否与指定名称相同
258             返回找到的指向struct rte_mempool结构的指针,或NULL
259     */
260     if (rte_eal_log_init(logid, internal_config.syslog_facility) < 0) {
261         rte_eal_init_alert("Cannot init logging.");
262         rte_errno = ENOMEM;
263         rte_atomic32_clear(&run_once);
264         return -1;
265     }
266 
267 #ifdef VFIO_PRESENT
268     if (rte_eal_vfio_setup() < 0) {
269         rte_eal_init_alert("Cannot init VFIO");
270         rte_errno = EAGAIN;
271         rte_atomic32_clear(&run_once);
272         return -1;
273     }
274 #endif
275     /* in secondary processes, memory init may allocate additional fbarrays
276      * not present in primary processes, so to avoid any potential issues,
277      * initialize memzones first.
278      */
279      /*
280      rte_memzone在DPDK的内存资源管理中起到的是其他资源管家的作用,默认情况下,
281      在DPDK初始化时会创建RTE_MAX_MEMZONE个rte_memzone,
282      每一个都可以记录一个rte_ring或者rte_mempool的内存位置
283      每一个rte_ring或者rte_mempool都有一个指针回指到它关联的rte_memzone
284      Memzone是内存分配的基本单元,mempool,malloc_heap在需要内存时,都会执行rte_memzone_reserve操作
285      rte_memzone_reserve 从memseg中分配一块内存出来
286      */
287     if (rte_eal_memzone_init() < 0) {
288         rte_eal_init_alert("Cannot init memzone");
289         rte_errno = ENODEV;
290         return -1;
291     }
292 
293     /*
294     1.获取所有预留hugepage的物理地址并按物理地址进行排序
295     2.根据物理物理地址,虚拟地址,soket_id等将hugpages组合成memseg
296     3.将所有memseg信息在所有dpdk程序间共享
297     */
298     if (rte_eal_memory_init() < 0) {
299         rte_eal_init_alert("Cannot init memory");
300         rte_errno = ENOMEM;
301         return -1;
302     }
303 
304     /* the directories are locked during eal_hugepage_info_init */
305     //解锁hugepage目录(由前面的eal_hugepage_info_init函数加锁)
306     eal_hugedirs_unlock();
307 
308     /*
309     1.函数将连续的memseg使用heap的方式管理起来,heap数据抽象
310     2.注册register_mp_requests
311     3.rte_memseg_contig_walk遍历memseg list中连续的mem seg,然后使用malloc_add_seg将这些内存加入heap的管理
312     4.heap的管理在malloc_heap_add_memory中实现
313     */
314     if (rte_eal_malloc_heap_init() < 0) {
315         rte_eal_init_alert("Cannot init malloc heap");
316         rte_errno = ENODEV;
317         return -1;
318     }
319 
320     if (rte_eal_tailqs_init() < 0) {
321         rte_eal_init_alert("Cannot init tail queues for objects");
322         rte_errno = EFAULT;
323         return -1;
324     }
325 
326     //赋值全局的struct rte_intr_handle结构,调用timerfd_create函数创建定时器timer对象
327     if (rte_eal_alarm_init() < 0) {
328         rte_eal_init_alert("Cannot init interrupt-handling thread");
329         /* rte_eal_alarm_init sets rte_errno on failure. */
330         return -1;
331     }
332 
333     /*
334         设定全局变量eal_timer_source为EAL_TIMER_TSC(TSC/HPET)
335         set_tsc_freq:设置TSC frequency(每秒钟时钟中断的次数)
336         解析文件“/proc/cpuinfo”,检查“flags”属性中“constant_tsc”和“nonstop_tsc”是否存在
337     */
338     if (rte_eal_timer_init() < 0) {
339         rte_eal_init_alert("Cannot init HPET or TSC timers");
340         rte_errno = ENOTSUP;
341         return -1;
342     }
343 
344     /*
345         获取masterlcore对应的numa socket
346         rte_eal_get_physmem_layout:获取struct rte_memseg结构数组地址
347         遍历struct rte_memseg结构数组,检查特定struct rte_memseg结构是否存在(对应此numa socket,并且长度大于0)
348     */
349     eal_check_mem_on_local_socket();
350 
351     /*
352         设置主线程的lcore_id
353         eal_thread_set_affinity
354             rte_sys_gettid:获取线程的tid
355             设置线程的CPU亲和性,记录numasocket等信息
356     */
357     eal_thread_init_master(rte_config.master_lcore);
358 
359     //dump当前线程的CPU affinity
360     ret = eal_thread_dump_affinity(cpuset, sizeof(cpuset));
361 
362     RTE_LOG(DEBUG, EAL, "Master lcore %u is ready (tid=%zx;cpuset=[%s%s])
",
363         rte_config.master_lcore, (uintptr_t)thread_id, cpuset,
364         ret == 0 ? "" : "...");
365 
366     RTE_LCORE_FOREACH_SLAVE(i) {
367 
368         /*
369          * create communication pipes between master thread
370          * and children
371          */
372         if (pipe(lcore_config[i].pipe_master2slave) < 0)
373             rte_panic("Cannot create pipe
");
374         if (pipe(lcore_config[i].pipe_slave2master) < 0)
375             rte_panic("Cannot create pipe
");
376 
377         lcore_config[i].state = WAIT;
378 
379         /* create a thread for each lcore */
380         ret = pthread_create(&lcore_config[i].thread_id, NULL,
381                      eal_thread_loop, NULL);
382         if (ret != 0)
383             rte_panic("Cannot create thread
");
384 
385         /* Set thread_name for aid in debugging. */
386         snprintf(thread_name, sizeof(thread_name),
387             "lcore-slave-%d", i);
388         ret = rte_thread_setname(lcore_config[i].thread_id,
389                         thread_name);
390         if (ret != 0)
391             RTE_LOG(DEBUG, EAL,
392                 "Cannot set name for lcore thread
");
393     }
394 
395     /*
396      * Launch a dummy function on all slave lcores, so that master lcore
397      * knows they are all ready when this function returns.
398      */
399      /*指示所有子线程启动一个dummyfunction*/
400      /*
401         检查各个子线程/lcore的状态是否处于WAIT
402             rte_eal_remote_launch:向各个子线程/lcore发送执行命令
403                 获取主线程向子线程通信所用管道,主线程发送数据的file descriptor(m2s)
404                 获取子线程向主线程通信所用管道,主线程读取数据的file descriptor(s2m)
405                 将lcore_function_t类型函数指针,及调用参数填入当前lcore对应的structlcore_config结构
406                 向子线程发送命令
407                 等待读取子线程发送的ack
408             如果最后一个参数值为CALL_MASTER(lcore handler executed by master core),主线程也执行所指函数
409     */
410     rte_eal_mp_remote_launch(sync_func, NULL, SKIP_MASTER);
411     rte_eal_mp_wait_lcore();
412 
413     /* initialize services so vdevs register service during bus_probe. */
414     ret = rte_service_init();
415     if (ret) {
416         rte_eal_init_alert("rte_service_init() failed");
417         rte_errno = ENOEXEC;
418         return -1;
419     }
420 
421     /* Probe all the buses and devices/drivers on them */
422     if (rte_bus_probe()) {
423         rte_eal_init_alert("Cannot probe devices");
424         rte_errno = ENOTSUP;
425         return -1;
426     }
427 
428 #ifdef VFIO_PRESENT
429     /* Register mp action after probe() so that we got enough info */
430     if (rte_vfio_is_enabled("vfio") && vfio_mp_sync_setup() < 0)
431         return -1;
432 #endif
433 
434     /* initialize default service/lcore mappings and start running. Ignore
435      * -ENOTSUP, as it indicates no service coremask passed to EAL.
436      */
437     ret = rte_service_start_with_defaults();
438     if (ret < 0 && ret != -ENOTSUP) {
439         rte_errno = ENOEXEC;
440         return -1;
441     }
442 
443     /*
444      * Clean up unused files in runtime directory. We do this at the end of
445      * init and not at the beginning because we want to clean stuff up
446      * whether we are primary or secondary process, but we cannot remove
447      * primary process' files because secondary should be able to run even
448      * if primary process is dead.
449      *
450      * In no_shconf mode, no runtime directory is created in the first
451      * place, so no cleanup needed.
452      */
453     if (!internal_config.no_shconf && eal_clean_runtime_dir() < 0) {
454         rte_eal_init_alert("Cannot clear runtime directory
");
455         return -1;
456     }
457     
458     /*
459         如果是主应用,将全局内存配置struct rte_mem_config结构的magic成员写成RTE_MAGIC,
460         表明主应用EAL初始化完成
461     */
462     rte_eal_mcfg_complete();
463 
464     /* Call each registered callback, if enabled */
465     rte_option_init();
466 
467     return fctret;
468 }
原文地址:https://www.cnblogs.com/mysky007/p/11044542.html