CPU负载均衡之WALT学习【转】

转自：https://blog.csdn.net/xiaoqiaoq0/article/details/107135747/

前言

本文继续整理CPU调度WALT相关内容，主要整理如下内容：

WALT是什么？
WALT 计算？
WALT 计算数据如何使用？

1. WALT是什么？

WALT：Windows-Assist Load Tracing的缩写：
- 从字面意思来看，是以window作为辅助项来跟踪CPU LOAD；
- 实质上是一种计算方法，用数据来表现CPU当前的loading情况，用于后续任务调度、迁移、负载均衡等功能；

1.1 为什么需要WALT ？

对于一项技术的发展，尤其是一种计算方式的引入，一定是伴随着过去的技术不在适用于当前事务发展的需要，或者这项技术可以让人更懒；

1.1.1 PELT的计算方式的不足？

PELT的引进的时候，linux的主流还在于服务器使用，更多关注设备性能的体现，彼时功耗还不是考虑的重点，而随着移动设备的发展，功耗和响应速度成为被人们直接感知到的因素，成为当前技术发展主要考虑的因素：

对于当前的移动设备，在界面处理的应用场景，需要尽快响应，否则user会明显感觉到卡顿；
对于当前移动设备，功耗更是一个必须面对的因素，手机需要频繁充电，那销量一定好不了；
根据用户场景决定task是否heavy的要求，比如显示的内容不同，其task重要程度也不同，即同一个类别的TASK也需要根据具体情况动态改变；

而基于当前PELT的调度情况（衰减的计算思路），更能体现连续的趋势情况，而对于快速的突变性质的情况，不是很友好：

对于快速上升和快速下降的情况响应速度较慢，由于衰减的计算过程，所以实际的Loading上升和下降需要一定周期后才能在数据上反馈出来，导致响应速度慢；
PELT基于其衰减机制，所以对于一个task sleep 一段时间后，则其负载计算减小，但是如果此时该Task为网络传输这种，周期性的需要cpu和freq的能力，则不能快速响应（因为该计算方式更能体现趋向性、平均效果）

1.2 WALT如何处理

根据上述的原因，我们了解到，当前需要在PELT的基础上（保持其好处），实现一种更能适用于当前需求的计算方式：

数据上报更加及时；
数据直接体现现状；
对算力的消耗不会增加（算力）；

1.2.1 WALT 处理

我这里总结了WALT所能（需要）做到的效果：

继续保持对于所有Task-entity的跟踪；
在此前usage（load）的基础上，添加对于demand的记录，用于之后预测；
每个CPU上runqueue 的整体负载仍为所有Task统计的sum；
核心在于计算差异，由之前的衰减的方式变更为划分window的方式：数据采集更能快速体现实际变化（对比与PELT的趋势），如下为Linux官方的一些资料：
1. A task’s demand is the maximum of its contribution to the most recently completed window and its average demand over the past N windows.
2. WALT “forgets” blocked time entirely：即只统计runable和running time，可以对于Task的实际耗时有更准确的统计，可以通过demand预测；
3. CPU busy time - The sum of execution times of all tasks in the most recently completed window；
4. WALT “forgets” cpu utilization as soon as tasks are taken off of the runqueue；

1.2.2 应用补充

task分配前各个CPU和task负载的统计；
task migration 迁移
大小核的分配；
EAS 分配；

1.3 版本导入

linux 4.8.2 之后导入（但是在bootlin查看code，最新5.8仍没有对应文件）
android 4.4之后导入（android kernel 4.9 中是有这部分的）

2. Kernel如何启用WALT

android kernel code中已经集成了这部分内容，不过根据厂商的差异，可能存在没有启用的情况：

打开宏测试：
1. menuconfig ==》Genernal setup ==》CPU/Task time and stats accounting ==》support window based load tracking
2. 图示：
直接修改
1. kernel/arch/arm64/config/defconfig中添加CONFIG_SCHED_WALT=y
build image 验证修改是否生效：
demo:/sys/kernel/tracing # zcat /proc/config.gz | grep WALT

CONFIG_SCHED_WALT=y
CONFIG_HID_WALTOP=y
测试
当前只是在ftrace中可以看到确实有统计walt的数据，但是没有实际的应用来确认具体是否有改善或者其他数据（当然Linux的资料中有一些数据，但是并非本地测试）；

3. WALT计算

本小节从原理和code 来说明，WALT采用的计算方式：

windows 是如何划分的？
对于Task如何分类，分别做怎样的处理？
WALT部分数据如何更新？
WALT更新的数据如何被调度、EAS使用？

3.1 Windows划分

首先来看辅助计算项window是如何划分的？
简单理解，就是将系统自启动开始以一定时间作为一个周期，分别统计不同周期内Task的Loading情况，并将其更新到Runqueue中；

则还有哪些内容需要考虑？

一个周期即window设置为多久比较合适？这个根据实际项目不同调试不同的值，目前Kernel中是设置的标准是20ms；
具体统计多少个window内的Loading情况？根据实际项目需要调整，目前Kernel中设置为5个window；

所以对于一个Task和window，可能存在如下几种情况：
在这里插入图片描述
ps：ms = mark_start（Task开始），ws = window_start（当前window开始）， wc = wallclock（当前系统时间）

Task在这个window内启动，且做统计时仍在这个window内，即Task在一个window内；
Task在前一个window内启动，做统计时在当前window内，即Task跨过两个window；
Task在前边某一个window内启动，做统计时在当前window内，即Task跨过多个完整window；

即Task在Window的划分只有上述三种情况，所有的计算都是基于上述划分的；

3.2 Task 分类

可以想到的是，对于不同类别的Task或者不同状态的Task计算公式都是不同的，WALT将Task划分为如下几个类别：
Tadk分类
上图中有将各个Task event的调用函数列出来；

3.2.1 更新demand判断

在更新demand时，会首先根据Task event判断此时是否需要更新：
demand对类别的差异
对应function：

static int account_busy_for_task_demand(struct task_struct *p, int event)
{
	/* No need to bother updating task demand for exiting tasks
	 * or the idle task. */
	 //task 已退出或者为IDLE，则不需要计算
	if (exiting_task(p) || is_idle_task(p))
		return 0;

	/* When a task is waking up it is completing a segment of non-busy
	 * time. Likewise, if wait time is not treated as busy time, then
	 * when a task begins to run or is migrated, it is not running and
	 * is completing a segment of non-busy time. */
	// 默认 walt_account_wait_time是1，则只有TASK_WAKE 
	if (event == TASK_WAKE || (!walt_account_wait_time &&
			 (event == PICK_NEXT_TASK || event == TASK_MIGRATE)))
		return 0;

	return 1;
}

3.2.2 更新CPU busy time判断

在更新CPU busy time时，会首先根据Task event判断此时是否需要更新：
busy time对event的差异
对应function：

static int account_busy_for_cpu_time(struct rq *rq, struct task_struct *p,
				     u64 irqtime, int event)
{
//是否为idle task or other task？	
	if (is_idle_task(p)) {
		/* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */
		// 是schedule 触发的下一个task为idle task
		if (event == PICK_NEXT_TASK)
			return 0;
	
		/* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */
		// 如果是中断或者等待IO的IDLE TASK，是要计算busy time的；
		return irqtime || cpu_is_waiting_on_io(rq);
	}

	//wake 唤醒操作不需要计算；
	if (event == TASK_WAKE)
		return 0;

	//不是IDLE TASK则以下几个类型需要计算
	if (event == PUT_PREV_TASK || event == IRQ_UPDATE ||
					 event == TASK_UPDATE)
		return 1;

	/* Only TASK_MIGRATE && PICK_NEXT_TASK left */
	//默认是0
	return walt_freq_account_wait_time;
}

3.3 数据如何更新？（调用逻辑）

前边两个小结已经介绍了Task在window上统计逻辑和不同Task统计不同数据判断，这里具体来看核心调用逻辑，首先上一张图：
WALT
这个图是在xmind导出来的结构图，不清楚是否可以放大查看，这里具体介绍流程：

入口函数walt_update_task_ravg
demand更新函数
cpu busy time 更新函数

3.3.1 入口函数介绍

walt_update_task_ravg
对应function：

/* Reflect task activity on its demand and cpu's busy time statistics */
void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
		 int event, u64 wallclock, u64 irqtime)
{
	//判断返回
	if (walt_disabled || !rq->window_start)
		return;
	lockdep_assert_held(&rq->lock);
	//更新window_start和cum_window_demand
	update_window_start(rq, wallclock);

	if (!p->ravg.mark_start)
		goto done;
	//更新数据：demand和busy_time
	update_task_demand(p, rq, event, wallclock);
	update_cpu_busy_time(p, rq, event, wallclock, irqtime);

done:
	// trace
	trace_walt_update_task_ravg(p, rq, event, wallclock, irqtime);
	// 更新mark_start	
	p->ravg.mark_start = wallclock;
}

函数主要做三件事情：

更新当前 window start时间为之后数据更新做准备；
更新对应task的demand数值，需要注意这里也会对应更新RQ中的数据；
更新对应task的cpu busy time占用；

这个函数是WALT计算的主要入口，可以看到调用它的位置有很多，即上图最左侧内容，简单来说就是在中断、唤醒、迁移、调度这些case下都会更新Loading情况，这里不一一详细说明了；

task awakend

task start execute

task stop execute

task exit

window rollover

interrupt

scheduler_tick

task migration

freq change

3.3.2 更新window start

这里主要是在计算之前更新window_start确保rq 窗口起始值准确：
在这里插入图片描述
对应function：

static void
update_window_start(struct rq *rq, u64 wallclock)
{
	s64 delta;
	int nr_windows;
	//计算时间
	delta = wallclock - rq->window_start;
	/* If the MPM global timer is cleared, set delta as 0 to avoid kernel BUG happening */
	if (delta < 0) {
		delta = 0;
		/*
		 * WARN_ONCE(1,
		 * "WALT wallclock appears to have gone backwards or reset
");
		 */
	}

	if (delta < walt_ravg_window) // 不足一个window周期，则直接返回；
		return;

	nr_windows = div64_u64(delta, walt_ravg_window);//计算window数量
	rq->window_start += (u64)nr_windows * (u64)walt_ravg_window;//统计window_start时间

	rq->cum_window_demand = rq->cumulative_runnable_avg;//实质还得使用cumulative_runnable_avg
}

3.3.3 更新demand

3.3.3.1 demand主要逻辑：

在这里插入图片描述
对应function：

/*
 * Account cpu demand of task and/or update task's cpu demand history
 *
 * ms = p->ravg.mark_start;
 * wc = wallclock
 * ws = rq->window_start
 *
 * Three possibilities:
 *
 *	a) Task event is contained within one window.
 *		window_start < mark_start < wallclock
 *
 *		ws   ms  wc
 *		|    |   |
 *		V    V   V
 *		|---------------|
 *
 *	In this case, p->ravg.sum is updated *iff* event is appropriate
 *	(ex: event == PUT_PREV_TASK)
 *
 *	b) Task event spans two windows.
 *		mark_start < window_start < wallclock
 *
 *		ms   ws   wc
 *		|    |    |
 *		V    V    V
 *		-----|-------------------
 *
 *	In this case, p->ravg.sum is updated with (ws - ms) *iff* event
 *	is appropriate, then a new window sample is recorded followed
 *	by p->ravg.sum being set to (wc - ws) *iff* event is appropriate.
 *
 *	c) Task event spans more than two windows.
 *
 *		ms ws_tmp			   ws  wc
 *		|  |				   |   |
 *		V  V				   V   V
 *		---|-------|-------|-------|-------|------
 *		   |				   |
 *		   |<------ nr_full_windows ------>|
 *
 *	In this case, p->ravg.sum is updated with (ws_tmp - ms) first *iff*
 *	event is appropriate, window sample of p->ravg.sum is recorded,
 *	'nr_full_window' samples of window_size is also recorded *iff*
 *	event is appropriate and finally p->ravg.sum is set to (wc - ws)
 *	*iff* event is appropriate.
 *
 * IMPORTANT : Leave p->ravg.mark_start unchanged, as update_cpu_busy_time()
 * depends on it!
 */
static void update_task_demand(struct task_struct *p, struct rq *rq,
		 int event, u64 wallclock)
{
	u64 mark_start = p->ravg.mark_start;//mark start 可以看到是task 的值；
	u64 delta, window_start = rq->window_start;//window start是 rq的值；
	int new_window, nr_full_windows;
	u32 window_size = walt_ravg_window;

	//第一个判断条件，ms和ws，即当前task的start实际是否在这个window内；	
	new_window = mark_start < window_start;
	if (!account_busy_for_task_demand(p, event)) {
		if (new_window)
			/* If the time accounted isn't being accounted as
			 * busy time, and a new window started, only the
			 * previous window need be closed out with the
			 * pre-existing demand. Multiple windows may have
			 * elapsed, but since empty windows are dropped,
			 * it is not necessary to account those. */
			update_history(rq, p, p->ravg.sum, 1, event);
		return;
	}

	// 如果ms > ws，则是case a：将wc-ms，在此周期内的实际执行时间；
	if (!new_window) {
		/* The simple case - busy time contained within the existing
		 * window. */
		add_to_task_demand(rq, p, wallclock - mark_start);
		return;
	}

	//超过 1个window的情况
	/* Busy time spans at least two windows. Temporarily rewind
	 * window_start to first window boundary after mark_start. */
	//从ms 到 ws的时间，包含多个完整window
	delta = window_start - mark_start;
	nr_full_windows = div64_u64(delta, window_size);
	window_start -= (u64)nr_full_windows * (u64)window_size;
	//ws 计算到ws_tmp这里：
	
	/* Process (window_start - mark_start) first */
	//先添加最开始半个周期的demand
	add_to_task_demand(rq, p, window_start - mark_start);

	/* Push new sample(s) into task's demand history */
	
	//更新history
	update_history(rq, p, p->ravg.sum, 1, event);
	if (nr_full_windows)
		update_history(rq, p, scale_exec_time(window_size, rq),
				   nr_full_windows, event);

	/* Roll window_start back to current to process any remainder
	 * in current window. */
	// 还原 window_start 
	window_start += (u64)nr_full_windows * (u64)window_size;

	/* Process (wallclock - window_start) next */
	//更新最后的周期，可以看到整体类似于pelt的计算，增加了history的操作；
	mark_start = window_start;
	add_to_task_demand(rq, p, wallclock - mark_start);
}		

//demand计算更新：
static void add_to_task_demand(struct rq *rq, struct task_struct *p,
		u64 delta)
{
	//demand需要做一次转换，将实际运行时间，转换为CPU 能力比例，一般就是获取CPU 的capcurr 然后除1024；
	delta = scale_exec_time(delta, rq);
	p->ravg.sum += delta;
	//这里有个判断当sum超过window size的时候修改；
	if (unlikely(p->ravg.sum > walt_ravg_window))
		p->ravg.sum = walt_ravg_window;
}

3.3.3.2 update history 逻辑：

update_history 整理：

本函数在Task进入一个新的Window的时候调用；
更新Task中的demand，根据过往几个Window的情况；
同步更新Rq中的Usage，根据当前demand计算值；

对应function：

/*
 * Called when new window is starting for a task, to record cpu usage over
 * recently concluded window(s). Normally 'samples' should be 1. It can be > 1
 * when, say, a real-time task runs without preemption for several windows at a
 * stretch.
 */
 
static void update_history(struct rq *rq, struct task_struct *p,
			 u32 runtime, int samples, int event)
{
	u32 *hist = &p->ravg.sum_history[0];//对应window 指针链接
	int ridx, widx;
	u32 max = 0, avg, demand;
	u64 sum = 0;

	/* Ignore windows where task had no activity */
	if (!runtime || is_idle_task(p) || exiting_task(p) || !samples)
			goto done;

	/* Push new 'runtime' value onto stack */
	widx = walt_ravg_hist_size - 1;// history数量最大位置
	ridx = widx - samples;//计算链表中需要去除的window数量

//如下两个for循环就是将新增加的window添加到history链表中，并更新sum值和max值；	
	for (; ridx >= 0; --widx, --ridx) {
		hist[widx] = hist[ridx];
		sum += hist[widx];
		if (hist[widx] > max)
			max = hist[widx];
	}

	for (widx = 0; widx < samples && widx < walt_ravg_hist_size; widx++) {
		hist[widx] = runtime;
		sum += hist[widx];
		if (hist[widx] > max)
			max = hist[widx];
	}
// Task中sum赋值；
	p->ravg.sum = 0;

//demand根据策略不同，从history window中计算，我们默认是policy2 就是 WINDOW_STATS_MAX_RECENT_AVG，在过去平均值和当前值中选择大的那个；
	if (walt_window_stats_policy == WINDOW_STATS_RECENT) {
		demand = runtime;
	} else if (walt_window_stats_policy == WINDOW_STATS_MAX) {
		demand = max;
	} else {
		avg = div64_u64(sum, walt_ravg_hist_size);
		if (walt_window_stats_policy == WINDOW_STATS_AVG)
			demand = avg;
		else
			demand = max(avg, runtime);
	}

	/*
	 * A throttled deadline sched class task gets dequeued without
	 * changing p->on_rq. Since the dequeue decrements hmp stats
	 * avoid decrementing it here again.
	 *
	 * When window is rolled over, the cumulative window demand
	 * is reset to the cumulative runnable average (contribution from
	 * the tasks on the runqueue). If the current task is dequeued
	 * already, it's demand is not included in the cumulative runnable
	 * average. So add the task demand separately to cumulative window
	 * demand.
	 */
//进行runnable_avg参数矫正，前提为并非deadline类型task	 
	if (!task_has_dl_policy(p) || !p->dl.dl_throttled) {
		if (task_on_rq_queued(p))//在runqueue中排队，但是没有实际执行
			fixup_cumulative_runnable_avg(rq, p, demand);//在rq中添加当前demand和task中记录demand的差值，更新到cumulative_runnable_avg
		else if (rq->curr == p)//当前执行的就是这个Task
			fixup_cum_window_demand(rq, demand);//在rq中添加demand
	}
//最后将计算出来的demand更新到Task中；
	p->ravg.demand = demand;

done:
	trace_walt_update_history(rq, p, runtime, samples, event);
	return;
}

//更新cumulative_runnable_avg的值；
static void
fixup_cumulative_runnable_avg(struct rq *rq,
			      struct task_struct *p, u64 new_task_load)
{
//计算demand和p中记录的demand差值（可能小于0）
	s64 task_load_delta = (s64)new_task_load - task_load(p);
//添加到rq中
	rq->cumulative_runnable_avg += task_load_delta;
	if ((s64)rq->cumulative_runnable_avg < 0)
		panic("cra less than zero: tld: %lld, task_load(p) = %u
",
			task_load_delta, task_load(p));
//
	fixup_cum_window_demand(rq, task_load_delta);
}

//更新cum_window_demand，直接累加传入值
static inline void fixup_cum_window_demand(struct rq *rq, s64 delta)
{
	rq->cum_window_demand += delta;
	if (unlikely((s64)rq->cum_window_demand < 0))
		rq->cum_window_demand = 0;
}

//可以看到这里实际更新了：cum_window_demand、cumulative_runnable_avg
//这两个还在如下函数中有更新：就一个+，一个-，
void
walt_inc_cumulative_runnable_avg(struct rq *rq,
				 struct task_struct *p)
{
	rq->cumulative_runnable_avg += p->ravg.demand;

	/*
	 * Add a task's contribution to the cumulative window demand when
	 *
	 * (1) task is enqueued with on_rq = 1 i.e migration,
	 *     prio/cgroup/class change.
	 * (2) task is waking for the first time in this window.
	 */
	if (p->on_rq || (p->last_sleep_ts < rq->window_start))
		fixup_cum_window_demand(rq, p->ravg.demand);
}

void
walt_dec_cumulative_runnable_avg(struct rq *rq,
				 struct task_struct *p)
{
	rq->cumulative_runnable_avg -= p->ravg.demand;
	BUG_ON((s64)rq->cumulative_runnable_avg < 0);

	/*
	 * on_rq will be 1 for sleeping tasks. So check if the task
	 * is migrating or dequeuing in RUNNING state to change the
	 * prio/cgroup/class.
	 */
	if (task_on_rq_migrating(p) || p->state == TASK_RUNNING)
		fixup_cum_window_demand(rq, -(s64)p->ravg.demand);
}

//在code中搜索了这两个函数的调用：
//分别在fairdl
tstop_task中调用enqueue时inc，dequeue时dec；
//这部分计算会优先于rq中nr_running进行；

函数的一些注解都在code中添加了，有任何疑问欢迎提出；

3.3.3.3 demand更新函数总结：

则demand更新主要做了如下内容：

计算包括task中间包括多个1个window以及多个window的情况，实质就是根据我们上文提到的窗口划分来做的；
需要注意的是本函数中window_start和mark_start都是局部变量，实际task内值并未更新，因为之后计算busy time还需要使用；
demand 实质更新的就是task中ravg.sum以及rq中cumulative_runnable_avg 和cum_window_demand ；

3.3.4 更新cpu busy time

这个函数逻辑画出来更加庞大，主要是针对于不同的case做计算，计算划分都是前文提过的窗口划分，但是具体数值统计会有些许差异：
在这里插入图片描述
对应function：

/*
 * Account cpu activity in its busy time counters (rq->curr/prev_runnable_sum)
 */
static void update_cpu_busy_time(struct task_struct *p, struct rq *rq,
		 int event, u64 wallclock, u64 irqtime)
{
	int new_window, nr_full_windows = 0;
	int p_is_curr_task = (p == rq->curr);
	u64 mark_start = p->ravg.mark_start;    //ms
	u64 window_start = rq->window_start;    //ws
	u32 window_size = walt_ravg_window;    //window size 
	u64 delta;

	//初始变量值获取
	new_window = mark_start < window_start;// is task period in a new window?
	if (new_window) {
		// update nr_full_windows
		nr_full_windows = div64_u64((window_start - mark_start),
						window_size);
		if (p->ravg.active_windows < USHRT_MAX)
			p->ravg.active_windows++;
	}

	/* Handle per-task window rollover. We don't care about the idle
	 * task or exiting tasks. */
	if (new_window && !is_idle_task(p) && !exiting_task(p)) {
		u32 curr_window = 0;

		if (!nr_full_windows)
			curr_window = p->ravg.curr_window;
		//update prev
		p->ravg.prev_window = curr_window;
		p->ravg.curr_window = 0;
	}

	// 根据event irq判断当前的输入，如果没有对busy造成贡献，则直接返回；
	if (!account_busy_for_cpu_time(rq, p, irqtime, event)) {
		/* account_busy_for_cpu_time() = 0, so no update to the
		 * task's current window needs to be made. This could be
		 * for example
		 *
		 *   - a wakeup event on a task within the current
		 *     window (!new_window below, no action required),
		 *   - switching to a new task from idle (PICK_NEXT_TASK)
		 *     in a new window where irqtime is 0 and we aren't
		 *     waiting on IO */

		if (!new_window)
			return;

		/* A new window has started. The RQ demand must be rolled
		 * over if p is the current task. */
		if (p_is_curr_task) {
			u64 prev_sum = 0;

			/* p is either idle task or an exiting task */
			if (!nr_full_windows) {
				prev_sum = rq->curr_runnable_sum;
			}

			rq->prev_runnable_sum = prev_sum;
			rq->curr_runnable_sum = 0;
		}

		return;
	}

	//对应task在当前window内启动，对类型做判断（这个是核心），然后计算时间更新
	if (!new_window) {
		/* account_busy_for_cpu_time() = 1 so busy time needs
		 * to be accounted to the current window. No rollover
		 * since we didn't start a new window. An example of this is
		 * when a task starts execution and then sleeps within the
		 * same window. */
		//判断：不是中断 或者 不是idle 或者 等待IO
		if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq))
			delta = wallclock - mark_start;
		else
			delta = irqtime;
		//换算时间增加curr上
		delta = scale_exec_time(delta, rq);
		rq->curr_runnable_sum += delta;
		if (!is_idle_task(p) && !exiting_task(p))
			p->ravg.curr_window += delta;

		return;
	}

	// cur window 内task有做事情，但是传入参数并非该task，一般来说就是中断；
	if (!p_is_curr_task) {
		/* account_busy_for_cpu_time() = 1 so busy time needs
		 * to be accounted to the current window. A new window
		 * has also started, but p is not the current task, so the
		 * window is not rolled over - just split up and account
		 * as necessary into curr and prev. The window is only
		 * rolled over when a new window is processed for the current
		 * task.
		 *
		 * Irqtime can't be accounted by a task that isn't the
		 * currently running task. */
		//整体分割为两步计算，prev & curr
		if (!nr_full_windows) {
			/* A full window hasn't elapsed, account partial
			 * contribution to previous completed window. */
			delta = scale_exec_time(window_start - mark_start, rq);
			if (!exiting_task(p))
				p->ravg.prev_window += delta;
		} else {
			/* Since at least one full window has elapsed,
			 * the contribution to the previous window is the
			 * full window (window_size). */
			delta = scale_exec_time(window_size, rq);
			if (!exiting_task(p))
				p->ravg.prev_window = delta;
		}
		rq->prev_runnable_sum += delta;

		/* Account piece of busy time in the current window. */
		delta = scale_exec_time(wallclock - window_start, rq);
		rq->curr_runnable_sum += delta;
		if (!exiting_task(p))
			p->ravg.curr_window = delta;

		return;
	}

	//运行的函数
	if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) {
		/* account_busy_for_cpu_time() = 1 so busy time needs
		 * to be accounted to the current window. A new window
		 * has started and p is the current task so rollover is
		 * needed. If any of these three above conditions are true
		 * then this busy time can't be accounted as irqtime.
		 *
		 * Busy time for the idle task or exiting tasks need not
		 * be accounted.
		 *
		 * An example of this would be a task that starts execution
		 * and then sleeps once a new window has begun. */

		if (!nr_full_windows) {
			/* A full window hasn't elapsed, account partial
			 * contribution to previous completed window. */
			delta = scale_exec_time(window_start - mark_start, rq);
			if (!is_idle_task(p) && !exiting_task(p))
				p->ravg.prev_window += delta;

			delta += rq->curr_runnable_sum;
		} else {
			/* Since at least one full window has elapsed,
			 * the contribution to the previous window is the
			 * full window (window_size). */
			delta = scale_exec_time(window_size, rq);
			if (!is_idle_task(p) && !exiting_task(p))
				p->ravg.prev_window = delta;

		}
		/*
		 * Rollover for normal runnable sum is done here by overwriting
		 * the values in prev_runnable_sum and curr_runnable_sum.
		 * Rollover for new task runnable sum has completed by previous
		 * if-else statement.
		 */
		rq->prev_runnable_sum = delta;

		/* Account piece of busy time in the current window. */
		delta = scale_exec_time(wallclock - window_start, rq);
		rq->curr_runnable_sum = delta;
		if (!is_idle_task(p) && !exiting_task(p))
			p->ravg.curr_window = delta;

		return;
	}

	//中断
	if (irqtime) {
		/* account_busy_for_cpu_time() = 1 so busy time needs
		 * to be accounted to the current window. A new window
		 * has started and p is the current task so rollover is
		 * needed. The current task must be the idle task because
		 * irqtime is not accounted for any other task.
		 *
		 * Irqtime will be accounted each time we process IRQ activity
		 * after a period of idleness, so we know the IRQ busy time
		 * started at wallclock - irqtime. */

		BUG_ON(!is_idle_task(p));
		mark_start = wallclock - irqtime;

		/* Roll window over. If IRQ busy time was just in the current
		 * window then that is all that need be accounted. */
		rq->prev_runnable_sum = rq->curr_runnable_sum;
		if (mark_start > window_start) {
			rq->curr_runnable_sum = scale_exec_time(irqtime, rq);
			return;
		}

		/* The IRQ busy time spanned multiple windows. Process the
		 * busy time preceding the current window start first. */
		delta = window_start - mark_start;
		if (delta > window_size)
			delta = window_size;
		delta = scale_exec_time(delta, rq);
		rq->prev_runnable_sum += delta;

		/* Process the remaining IRQ busy time in the current window. */
		delta = wallclock - window_start;
		rq->curr_runnable_sum = scale_exec_time(delta, rq);

		return;
	}

	BUG();
}

细节内容在函数中注释了，这里来简单总结下：

根据不同Task类型做不同busytime时间的计算；
核心计算方式均相同，只是具体数值差异；
更新数据为
Task中prev_window、curr_window
rq中prev_runable_sum、curr_runnable_sum

3.3.5 irq load 相关调用统计

3.3.5.1 与irq相关的三个变量：

cur_irqload：当前Task的irqload，即执行时间
avg_irqload：当前rq的平均irqload，这个值与中断频率相关，逐步衰减，是个累加值；
u64 irqload_ts：上次计算walt irqload的时间，通过这个值来确认中断频次；

3.3.5.2 调用逻辑

sched_init时三个值被设置为0，前边已经研究过了，这东西是在中断时被调用，具体来看：

void walt_account_irqtime(int cpu, struct task_struct *curr,
				 u64 delta, u64 wallclock)
{
	struct rq *rq = cpu_rq(cpu);
	unsigned long flags, nr_windows;
	u64 cur_jiffies_ts;

	raw_spin_lock_irqsave(&rq->lock, flags);

	/*
	 * cputime (wallclock) uses sched_clock so use the same here for
	 * consistency.
	 */
	 //计算从获取wallclock到执行到这里的差值更新，即做矫正；
	 //这里需要跟踪delta传入时值，sched_clock_cpu - irq_start_time
	 //即delta是irq的执行时间；
	delta += sched_clock() - wallclock;
	cur_jiffies_ts = get_jiffies_64();

	//如果是IDLE task则做walt相关计算更新，这里是获取的当前值作为wallclock，delta即irq执行time
	if (is_idle_task(curr))
		walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(),
				 delta);

	//计算两次中断统计之间的时间，这里nr_windows是tick数
	nr_windows = cur_jiffies_ts - rq->irqload_ts;

	//这里是指这个CPU上触发中断的频率，以10个tick作为判断依据，假设HZ设置为250，则一个tick为4ms
	if (nr_windows) {
		if (nr_windows < 10) {//如果经过的时间差值在10以内，则avg_irqload衰减为原来的3/4
			/* Decay CPU's irqload by 3/4 for each window. */
			rq->avg_irqload *= (3 * nr_windows);
			rq->avg_irqload = div64_u64(rq->avg_irqload,
						    4 * nr_windows);
		} else {//如果经过的时间差值超过10，则avg_irqload忽略不计，直接记为0；
			rq->avg_irqload = 0;
		}
		//累加当前的irqload
		rq->avg_irqload += rq->cur_irqload;
		rq->cur_irqload = 0;
	}

	rq->cur_irqload += delta;
//irqload_ts为当前值，目前搜索irqload_ts只有这两个位置有更新使用，则说明ts是指上次irq中断统计的时间	
	rq->irqload_ts = cur_jiffies_ts;
	raw_spin_unlock_irqrestore(&rq->lock, flags);
}

account_irq_enter_time/account_irq_exit_time ==> irq_account_irq ==> walt_account_irqtime
这个过程还比较简单：

中断进入和退出的时候都会统计数据；
统计数据即中断执行时间；
rq的时间根据中断进入的频率累加不同；

3.3.5.3 irqload使用的第一个场景

判断cpu的irq load情况，直接上code：

#define WALT_HIGH_IRQ_TIMEOUT 3

u64 walt_irqload(int cpu) {
	struct rq *rq = cpu_rq(cpu);
	s64 delta;
	delta = get_jiffies_64() - rq->irqload_ts;

        /*
	 * Current context can be preempted by irq and rq->irqload_ts can be
	 * updated by irq context so that delta can be negative.
	 * But this is okay and we can safely return as this means there
	 * was recent irq occurrence.
	 */
//这个计算是避免被竞争抢占后delta值发生变化，至于这里为什么是3，目前还有疑惑？
        if (delta < WALT_HIGH_IRQ_TIMEOUT)
		return rq->avg_irqload;
        else
		return 0;
}

//这个函数是在find_best_target，即在migirate时找到下一个CPU时判断负载；
int walt_cpu_high_irqload(int cpu) {
	return walt_irqload(cpu) >= sysctl_sched_walt_cpu_high_irqload;//这个值默认是10ms
}

3.4 关键结构体

rq //在runqueue中添加部分数据统计
task_struct //在task_struct中添加对应变量
ravg //与这个计算相关的结构

3.4.1 rq

在这里插入图片描述
对应的结构定义：

struct rq {
	
	...
	
#ifdef CONFIG_SCHED_WALT
	u64 cumulative_runnable_avg;
	u64 window_start;
	u64 curr_runnable_sum;
	u64 prev_runnable_sum;
	u64 nt_curr_runnable_sum;
	u64 nt_prev_runnable_sum;
	u64 cur_irqload;
	u64 avg_irqload;
	u64 irqload_ts;
	u64 cum_window_demand;
#endif /* CONFIG_SCHED_WALT */

	...

};

3.4.2 task_struct

在这里插入图片描述

struct task_struct {

	...
	
#ifdef CONFIG_SCHED_WALT
	struct ravg ravg;
	/*
	 * 'init_load_pct' represents the initial task load assigned to children
	 * of this task
	 */
	u32 init_load_pct;
	u64 last_sleep_ts;
#endif

	...
}
/* ravg represents frequency scaled cpu-demand of tasks */
struct ravg {
	/*
	 * 'mark_start' marks the beginning of an event (task waking up, task
	 * starting to execute, task being preempted) within a window
	 *
	 * 'sum' represents how runnable a task has been within current
	 * window. It incorporates both running time and wait time and is
	 * frequency scaled.
	 *
	 * 'sum_history' keeps track of history of 'sum' seen over previous
	 * RAVG_HIST_SIZE windows. Windows where task was entirely sleeping are
	 * ignored.
	 *
	 * 'demand' represents maximum sum seen over previous
	 * sysctl_sched_ravg_hist_size windows. 'demand' could drive frequency
	 * demand for tasks.
	 *
	 * 'curr_window' represents task's contribution to cpu busy time
	 * statistics (rq->curr_runnable_sum) in current window
	 *
	 * 'prev_window' represents task's contribution to cpu busy time
	 * statistics (rq->prev_runnable_sum) in previous window
	 */
	u64 mark_start; //  marks the beginning of an event (task waking up, task starting to execute, task being preempted) within a window
	u32 sum, demand; // sum ： how runable a task has benn within current window； demand： 
	u32 sum_history[RAVG_HIST_SIZE_MAX]; // 
	u32 curr_window, prev_window;
	u16 active_windows;
};
#endif

4. 附录

4.1 linux的调度变更过程

runqueue 按照优先级划分，active expored，更快速的调度；
CFS 提出virtual time的概念，根据优先级换算不同的物理时间；
CFS + PELT，更加合理的分配Task以及迁移Task；
CFS + WALT，响应更加迅速，更适合用于手机这类设备，可以在性能和功耗之间做比较好的平衡；

4.2 待补充内容

update history code [done]
irq 调用过程 [done]
对于更新数据的使用==>计划跟踪top过程，希望明天可以初步完成

【作者】张昺华

【出处】http://www.cnblogs.com/sky-heaven/

【博客园】 http://www.cnblogs.com/sky-heaven/

【知乎】 http://www.zhihu.com/people/zhang-bing-hua

【我的作品---旋转倒立摆】 http://v.youku.com/v_show/id_XODM5NDAzNjQw.html?spm=a2hzp.8253869.0.0&from=y1.7-2

【我的作品---自平衡自动循迹车】 http://v.youku.com/v_show/id_XODM5MzYyNTIw.html?spm=a2hzp.8253869.0.0&from=y1.7-2

【大饼教你学系列】https://edu.csdn.net/course/detail/10393

【新浪微博】张昺华--sky

【twitter】 @sky2030_

【微信公众号】张昺华

本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利.