定时器与时间管理

jiffies

全局变量jiffies用来记录自系统启动以来产生的节拍的总数（在arm体系结构中默认100H，即10ms一拍），启动时系统内核将其初始化为0，此后每次时钟中断处理程序会增加其变量的值。一秒内增加的值也就是HZ（在<asm/param.h>中定义），系统运行时间以秒为单位计算，就等于jiffies/HZ。

jiffies在<linux/jiffies.h>中定义：

extern u64 __jiffy_data jiffies_64;
extern unsigned long volatile __jiffy_data jiffies;

jiffies用unsigned long去存储，用其他任何类型存放它都不正确。在64位机器上使用64位空间存储，而在32位机器上使用32位空间存储，实际使用时，jiffies只不过是jiffies_64的低32位，与此同时系统也提供了对jiffies_64的读取函数get_jiffies_64()，但是这种需求很少。在64位机器上二者作用相同。

实际使用时经常用相对时间，即用HZ去计算相对时间，比如：

unsigned long time_stamp = jiffies ;	//现在的时间
unsigned long next_tick = jiffies + 1 ; //从现在开始后一个节拍
unsigned long later	= jiffies + 5 * HZ ;//从现在开始后5s
unsigned long fraction = jiffies + HZ / 10 ;//从现在开始后100ms

jiffies的回绕

回绕情况主要发生在溢出时，例如：

unsigned long timeout = jiffies + HZ/2;		//半秒后超时
/*
*	执行一些任务……
*/
if(timeout > jiffies)
{
	/*没有超时，正常执行*/
}
else
{
	/*超时了，与预期不同，发生错误*/
}

这段代码貌似可以实现对执行任务是否超时的判断，但考虑一种例外：假如jifies到溢出的距离小于HZ/2个节拍，那么timeout显然会产生溢出而变为很小的值，恰巧执行任务时间非常短，此时jiffies没有发生回绕，那么第一个判断条件的判断结果显然与预期不符，因此会产生不可预知的问题。

为了解决此类问题，内核提供了四个宏来帮助比较节拍计数，它们能正确处理节拍计数回绕的情况，定义在<linux/jiffies.h>中：

/*
 *	These inlines deal with timer wrapping correctly. You are 
 *	strongly encouraged to use them
 *	1. Because people otherwise forget
 *	2. Because if the timer wrap changes in future you won't have to
 *	   alter your driver code.
 *
 * time_after(a,b) returns true if the time a is after time b.
 *
 * Do this with "<0" and ">=0" to only test the sign of the result. A
 * good compiler would generate better code (and a really good compiler
 * wouldn't care). Gcc is currently neither.
 */
#define time_after(a,b)		
	(typecheck(unsigned long, a) && 
	 typecheck(unsigned long, b) && 
	 ((long)(b) - (long)(a) < 0))
#define time_before(a,b)	time_after(b,a)

#define time_after_eq(a,b)	
	(typecheck(unsigned long, a) && 
	 typecheck(unsigned long, b) && 
	 ((long)(a) - (long)(b) >= 0))
#define time_before_eq(a,b)	time_after_eq(b,a)

time_after（a,b）在a超过b时返回真，否则返回假，time_before是对time_after的反向使用；

time_after_eq(a,b)与time_after的区别在于time_after判断的时候是否带等号，其实原理是一样的。

可以看到这个实现方法很简单，就是强制转换为long去比较。

HZ向用户空间的转换

内核空间HZ值的变化很大程度上会改变用户空间对HZ的判断，从而使用户空间产生幻觉：“我还用的100HZ，拍一百次手还是一秒”，殊不知，内核HZ为1000，此时你拍一百次手只有0.1秒！

所以内核引入了相应的办法将内核的节拍计数转换为用户的节拍计数方法:

clock_t jiffies_to_clock_t(long x)
{
#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
# if HZ < USER_HZ
	return x * (USER_HZ / HZ);
# else
	return x / (HZ / USER_HZ);
# endif
#else
	return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
#endif
}

其实就是一个比例转换而已！

硬时钟和定时器

体系结构提供了两种设备进行计时：实时时钟和系统定时器

实时时钟（RTC）

实时时钟使用来持久存放系统时间的设备，即使系统关闭后，它也可以靠主板上的微型电池提供的电力保持系统的计时。在PC上RTC和CMOS集成在一起，RTC和BIOS的设置保存都是通过一个微型电源供电的。

当系统启动时，内核通过读取RTC的值来初始化墙上时间，该时间存放在xtime变量中。

系统定时器

系统定时器是内核定时机制中最为重要的角色，尽管不同体系结构中的定时器实现不尽相同，但是系统定时器的根本思想并没有区别——提供一种周期性的触发中断机制。

时钟中断处理程序

可以分成两部分：体系结构相关部分和体系结构无关部分

与体系结构相关的过程作为系统定时器的中断处理程序而注册到内核中，以便在产生时钟中断时，它能够相应地运行。虽然处理程序的基本工作依赖于特定的体系结构，但是绝大部分处理程序至少也要执行以下工作：

获得xtime_lock锁，以便对访问jiffies和墙上时间进行保护
需要时应答或设置系统时钟
周期性地使用墙上时间更新实时时钟
调用体系结构无关的时钟例程：tick_periodic();

终端服务程序主要通过调用与体系结构无关的tick_periodic()执行更多的工作：

给jiffies_64变量增加1（这个操作始终是安全的，因为前面申请了锁）
更新资源消耗的统计值，比如当前进程所消耗的系统时间和用户时间
执行已经到期的动态定时器
执行sheduler_tick（）函数
更新墙上时间，改时间放在xtime中
计算平均负载值

以上工作都通过单独的函数实现，所以tick_periodic()的代码看起来很简洁：

static void tick_periodic(int cpu)
{
	if (tick_do_timer_cpu == cpu) {
		write_seqlock(&xtime_lock);

		/* Keep track of the next tick event */
		tick_next_period = ktime_add(tick_next_period, tick_period);

		do_timer(1);
		write_sequnlock(&xtime_lock);
	}

	update_process_times(user_mode(get_irq_regs()));
	profile_tick(CPU_PROFILING);
}

其中很多重要的操作都在do_timer(）和update_process_times（）中完成：

void do_timer(unsigned long ticks)
{
	jiffies_64 += ticks;
	update_wall_time();
	calc_global_load();
}

do_timer（）进行jiffies_64的增加和更新墙上时间的操作，同时计算了系统的平均负载统计值。

void update_process_times(int user_tick)
{
	struct task_struct *p = current;
	int cpu = smp_processor_id();

	/* Note: this timer irq context must be accounted for as well. */
	account_process_tick(p, user_tick);
	run_local_timers();
	rcu_check_callbacks(cpu, user_tick);
	printk_tick();
	perf_event_do_pending();
	scheduler_tick();
	run_posix_cpu_timers(p);
}

update_process_times（）用来更新所耗费的各种节拍数，通过user_tick区别是花在用户空间还是内核空间。

其中account_process_tick()用来对进程的时间进行更新，这种方法直接把上个节拍的时间记在当前进程上，由于粒度限制，可能在这个节拍内处理器并没有完全都被该进程占用，但这个误差目前还是无法避免的，所以提高HZ会对这个问题会有所改善。

/*
 * Account a single tick of cpu time.
 * @p: the process that the cpu time gets accounted to
 * @user_tick: indicates if the tick is a user or a system tick
 */
void account_process_tick(struct task_struct *p, int user_tick)
{
	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
	struct rq *rq = this_rq();

	if (user_tick)
		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
		account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
				    one_jiffy_scaled);
	else
		account_idle_time(cputime_one_jiffy);
}

接下来run_local_timers()标记了一个软中断去处理所有到期的定时器。

最后schedule_tick（）函数负责减少当前运行进程的时间片计数值，并且在需要时设置need_resched标志。

在这些过程全部执行完毕之后timer_periodic()返回与体系结构相关的中断处理程序，继续执行后面的工作，释放xtime_lock锁，最终退出。

以上动作每1/HZ就要发生一次，即一秒钟就要执行时钟中断程序HZ次。

实际时间

即墙上时间，定义在<kernel/time/timekeeping.c>中：


/*
 * The current time
 * wall_to_monotonic is what we need to add to xtime (or xtime corrected
 * for sub jiffie times) to get to monotonic time.  Monotonic is pegged
 * at zero at system boot time, so wall_to_monotonic will be negative,
 * however, we will ALWAYS keep the tv_nsec part positive so we can use
 * the usual normalization.
 *
 * wall_to_monotonic is moved after resume from suspend for the monotonic
 * time not to jump. We need to add total_sleep_time to wall_to_monotonic
 * to get the real boot based time offset.
 *
 * - wall_to_monotonic is no longer the boot time, getboottime must be
 * used instead.
 */
struct timespec xtime __attribute__ ((aligned (16)));
struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
static struct timespec total_sleep_time;

struct timespec {
        time_t  tv_sec;         /* seconds */
        long    tv_nsec;        /* nanoseconds */
};

xtime.tv_sec以秒为单位，存放着自1970年1月1日（UTC）以来经过的时间，1970年1月1日被称为纪元，多数Unix系统的墙上时间都是基于该纪元而言的。

xtime.tv_nsec记录自上一秒开始经过的ns数。

对xtime进行读写需要使用xtime_lock锁，该锁不是普通自旋锁，而是一个seqlock虚列锁。

从用户空间取得墙上时间的接口是gettimeofday()，在内核中相应的系统调用是sys_gettimeofday()，定义在<kernel/time.c>中：

SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, tv,
		struct timezone __user *, tz)
{
	if (likely(tv != NULL)) {
		struct timeval ktv;
		do_gettimeofday(&ktv);
		if (copy_to_user(tv, &ktv, sizeof(ktv)))
			return -EFAULT;
	}
	if (unlikely(tz != NULL)) {
		if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
			return -EFAULT;
	}
	return 0;
}

通过调用do_gettimeofday调用与体系结构相关的函数，循环读取xtime并返回给用户。

一般情况下，内核不会像用户空间程序那样频繁使用xtime，但有时特殊：在文件系统的实现代码中存放访问时间戳（创建、存取、修改等）时需要使用xtime。

定时器

有时也称为内核定时器/动态定时器，是管理内核流逝的时间的基础。

定时器大部分用于想推后一个特定时间在做什么事情时，初始化后设置一个超时时间和超时后需要执行的函数，然后激活即可。每次定时器完成工作就会自行撤销，而不是周期执行的，这也是被称为动态定时器的一个原因。

定时器的运行次数也不受限制，在内核中的使用非常普遍。

使用定时器

定时器由结构timer_list表示，定义在<linux/timer.h>中：


struct timer_list {
	struct list_head entry;			//定时器的链表入口
	unsigned long expires;			//以jiffies为单位的定时值

	void (*function)(unsigned long);//定时器处理函数
	unsigned long data;				//传给处理函数的长整型参数

	struct tvec_base *base;			//内部值，用户不要使用
#ifdef CONFIG_TIMER_STATS
	void *start_site;
	char start_comm[16];
	int start_pid;
#endif
#ifdef CONFIG_LOCKDEP
	struct lockdep_map lockdep_map;
#endif
};

内核提供了一组与定时器相关的接口用来简化管理定时器的操作：

使用时要先创建并初始化：

struct timer_list my_timer;
init_timer(&my_timer);
/*填充数据结构中需要的参数*/
my_timer.expires = jiffies + delay;		//这里是一个绝对计数值
my_timer.data = 0;						//参数
my_timer.function = my_function;		//处理函数

data参数允许利用同一处理函数注册多个定时器。

最后必须使用add_timer激活定时器。

void add_timer(struct timer_list *timer)
{
	BUG_ON(timer_pending(timer));
	mod_timer(timer, timer->expires);
}

可以看出，其实add_timer调用了mod_timer修改timer的超时时间，因此想要修改已经激活的定时器参数也可以调用mod_timer（其实mod_timer就包含了激活功能，因此你也可以修改未激活的定时器参数，但用它修改没什么必要，毕竟可以直接操作结构体~）

如果要停止定时器需要调用：

int del_timer(struct timer_list *timer)
{
	struct tvec_base *base;
	unsigned long flags;
	int ret = 0;

	timer_stats_timer_clear_start_info(timer);
	if (timer_pending(timer)) {
		base = lock_timer_base(timer, &flags);
		if (timer_pending(timer)) {
			detach_timer(timer, 1);
			if (timer->expires == base->next_timer &&
			    !tbase_get_deferrable(timer->base))
				base->next_timer = base->timer_jiffies;
			ret = 1;
		}
		spin_unlock_irqrestore(&base->lock, flags);
	}

	return ret;
}

注意：不需要对已经超时的timer调用该函数，因为它们会自动删除。

当应用于多处理器上时，尽量调用del_timer_sync去删除它，因为多处理器中可能在删除时它的处理函数正在其他处理器上运行

实现定时器

内核在时钟中断发生后执行定时器，定时器作为软中断在下半部上下文中执行。最终会调用：

/*
 * Called by the local, per-CPU timer interrupt on SMP.
 */
void run_local_timers(void)
{
	hrtimer_run_queues();
	raise_softirq(TIMER_SOFTIRQ);
	softlockup_tick();
}

run_timer_softirq()函数处理软中断TIMER_SOFTIRQ，从而在当前处理器上运行所有超时定时器。

内核还对超时的定时器进行了分组管理，以便更有效率的找到超时的定时器并执行。

延迟执行

忙等待

最简单的延迟方法（也最不理想）即：

while(time_before(jiffies,timeout));

这种方法和自旋锁一样，会占用处理器资源，所以基本上不会用到

实际上用的是：

while(time_before(jiffies,timeout))
{
	cond_resched();
}

通过这种方法让内核在等待时有可能进行调度，而且只有在非常重要的进程需要执行时才会调度（这也是cond的来源），这种延迟也只能在进程上下文中进行。

短延迟

有时延迟时间需要比一个节拍还小，这事就不能用节拍去延迟了，内核提供了三个可以处理ms、ns和us的函数：


#ifndef mdelay
#define mdelay(n) (
	(__builtin_constant_p(n) && (n)<=MAX_UDELAY_MS) ? udelay((n)*1000) : 
	({unsigned long __ms=(n); while (__ms--) udelay(1000);}))
#endif


#ifndef ndelay
static inline void ndelay(unsigned long x)
{
	udelay(DIV_ROUND_UP(x, 1000));
}
#define ndelay(x) ndelay(x)
#endif


#define udelay(n)							
	(__builtin_constant_p(n) ?					
	  ((n) > (MAX_UDELAY_MS * 1000) ? __bad_udelay() :		
			__const_udelay((n) * ((2199023U*HZ)>>11))) :	
	  __udelay(n))

#endif /* defined(_ARM_DELAY_H) */

这些都是通过忙循环的方式实现的，比较精确

schedule_timeout()

这种方法更理想，因为它会让需要延迟执行的任务睡眠到指定的延迟时间耗尽后再重新运行，但是有利有弊，它不能保证延迟的精准性。例如信号量中的睡眠机制就是用它来实现的。

/**
 * schedule_timeout - sleep until timeout
 * @timeout: timeout value in jiffies
 *
 * Make the current task sleep until @timeout jiffies have
 * elapsed. The routine will return immediately unless
 * the current task state has been set (see set_current_state()).
 *
 * You can set the task state as follows -
 *
 * %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
 * pass before the routine returns. The routine will return 0
 *
 * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
 * delivered to the current task. In this case the remaining time
 * in jiffies will be returned, or 0 if the timer expired in time
 *
 * The current task state is guaranteed to be TASK_RUNNING when this
 * routine returns.
 *
 * Specifying a @timeout value of %MAX_SCHEDULE_TIMEOUT will schedule
 * the CPU away without a bound on the timeout. In this case the return
 * value will be %MAX_SCHEDULE_TIMEOUT.
 *
 * In all cases the return value is guaranteed to be non-negative.
 */
signed long __sched schedule_timeout(signed long timeout)
{
	struct timer_list timer;
	unsigned long expire;

	switch (timeout)
	{
	case MAX_SCHEDULE_TIMEOUT:
		/*
		 * These two special cases are useful to be comfortable
		 * in the caller. Nothing more. We could take
		 * MAX_SCHEDULE_TIMEOUT from one of the negative value
		 * but I' d like to return a valid offset (>=0) to allow
		 * the caller to do everything it want with the retval.
		 */
		schedule();
		goto out;
	default:
		/*
		 * Another bit of PARANOID. Note that the retval will be
		 * 0 since no piece of kernel is supposed to do a check
		 * for a negative retval of schedule_timeout() (since it
		 * should never happens anyway). You just have the printk()
		 * that will tell you if something is gone wrong and where.
		 */
		if (timeout < 0) {
			printk(KERN_ERR "schedule_timeout: wrong timeout "
				"value %lx
", timeout);
			dump_stack();
			current->state = TASK_RUNNING;
			goto out;
		}
	}

	expire = timeout + jiffies;

	setup_timer_on_stack(&timer, process_timeout, (unsigned long)current);
	__mod_timer(&timer, expire, false, TIMER_NOT_PINNED);
	schedule();
	del_singleshot_timer_sync(&timer);

	/* Remove the timer from the object tracker */
	destroy_timer_on_stack(&timer);

	timeout = expire - jiffies;

 out:
	return timeout < 0 ? 0 : timeout;
}

调用它的代码段必须保证能够被调度，即在进程上下文中且不持有锁。