Beennan的内嵌汇编指导（译）Brennan's Guide to Inline Assembly

注：写在前面，这是一篇翻译文章，本人的英文水平很有限，但内嵌汇编是学习操作系统不可少的知识，本人也常去查看这方面的内容，本文是在做mit的jos实验中的一篇关于内嵌汇编的介绍。关于常用的内嵌汇编（AT&T格式）的语法都有介绍，同时在篇末还列出了常用的一些内嵌汇编代码的写法。看了很有益处。大牛就不必看了。当然非常欢迎对文章中的翻译错误或不当之处进行指正。

ps:这是这篇文章的原地址：http://www.delorie.com/djgpp/doc/brennan/brennan_att_inline_djgpp.html

ps:所有注都是本人另外添加的。

Brennan's Guide to Inline Assembly
Beennan的内嵌汇编指导

by Brennan "Bas" Underwood
作者：Brennan

Document version 1.1.2.2
文档版本 1.1.2.2

Ok. This is meant to be an introduction to inline assembly under DJGPP. DJGPP is based on GCC, so it uses the AT&T/UNIX syntax and has a somewhat unique method of inline assembly. I spent many hours figuring some of this stuff out and told Info that I hate it, many times.
这是一篇关于在DJGPP编译器下的内嵌汇编的介绍。DJGPP基于GCC，所以它使用AT&T语法格式，并且一些独特的方法。我花了好几个小时指出它的特性以及我多次提到的令我讨厌的地方。
Hopefully if you already know Intel syntax, the examples will be helpful to you. I've put variable names, register names and other literals in bold type.
如果你已经了解Intel的汇编语法，这些例子会对你很有帮助。我用粗体字来标识变量、寄存器以及其他名称。

The Syntax
语法

So, DJGPP uses the AT&T assembly syntax. What does that mean to you?
DJGPP使用AT&T汇编语法。这对你意味着什么？

Register naming:

寄存器名称

AT&T:  %eax
Intel: eax

Source/Destination Ordering:

操作数方向：

In AT&T syntax (which is the UNIX standard, BTW) the source is always on the left, and the destination is always on the right.So let's load ebx with the value in eax:
在AT&T语法中（顺便说一句，这个在unix中是标准。），来源总在左侧，目的总在右侧。那么让我将eax中的值保存在ebx中，那语句将会象如下所示：

AT&T:  movl %eax, %ebx
Intel: mov ebx, eax

Constant value/immediate value format:

常量和立即数格式：

You must prefix all constant/immediate values with "$".
你必须在常量和立即数前加$符号。
Let's load eax with the address of the "C" variable booga, which is static.
将一个c语言的一个静态变量booga保存在eax中。

AT&T:  movl $_booga, %eax
Intel: mov eax, _booga

Now let's load ebx with 0xd00d:
将一个十六进制数保存在ebx中。

AT&T:  movl $0xd00d, %ebx
Intel: mov ebx, d00dh

Operator size specification:

操作数大小指令：

You must suffix the instruction with one of b, w, or l to specify the width of the destination register as a byte, word or longword. If you omit this, GAS (GNU assembler) will attempt to guess. You don't want GAS to guess, and guess wrong! Don't forget it.
你必须使用b,w或者l做为指令后缀来表示保存在目的寄存器中的是一个位，字或长字。如果省略它，GAS(GNU的编译器）会临时推断。你一定不想GAS去猜它，也许会猜错！这一点不要忘记。如下面的指令：

AT&T:  movw %ax, %bx
Intel: mov bx, ax

The equivalent forms for Intel is byte ptr, word ptr, and dword ptr, but that is for when you are...
在intel汇编语法中相匹配的格式是位使用byte ptr，字使用word ptr,长字使用dword ptr，但这是...

Referencing memory:

内存引用：

DJGPP uses 386-protected mode, so you can forget all that real-mode addressing junk, including the restrictions on which register has what default segment, which registers can be base or index pointers. Now, we just get 6 general purpose registers. (7 if you use ebp, but be sure to restore it yourself or compile with -fomit-frame-pointer.)
DJGPP使用386的保护模式，所以你可以忘记所有关于实模式地址的问题，包括寄存器默认使用哪个段寄存器，哪个寄存器可以用做基址或索引指针。现在，我们必须使用6个通用寄存器。（当然，如果你使用ebp，那就是7个，但必须记得自已手动恢复它，或者在编译时使用-fomit-frame-pointer选项。）

Here is the canonical format for 32-bit addressing:
下面是32位地址的常规格式：

AT&T:  immed32(basepointer,indexpointer,indexscale) 32位立即数（基址指针，索引指针，索引倍数）
Intel: [basepointer + indexpointer*indexscale + immed32]

You could think of the formula to calculate the address as:
你需要使用以下公式来计算地址：

immed32 + basepointer + indexpointer * indexscale

You don't have to use all those fields, but you do have to have at least 1 of immed32, basepointer and you MUST add the size suffix to the operator!
你可能不会用到所有的参数部分，但你至少会有一个立即数参数，使用基址指针时你必须添加指定大小的后缀。

Let's see some simple forms of memory addressing:
让我们来看一些简单的关于内存地址例子：

Addressing a particular C variable:

（直接寻址）用一个规则的c变量的进行内存寻址：

AT&T:  _booga
Intel: [_booga]

Note: the underscore ("_") is how you get at static (global) C variables from assembler. This only works with global variables. Otherwise, you can use extended asm to have variables preloaded into registers for you. I address that farther down.
注释：下划线是编译器翻译后的静态（全局）c语言变量。这种方式仅在引用全局变量时使用。否则你必须使用扩展asm来控制可变的预保存寄存器。我会在后面指出这种用法。

Addressing what a register points to:

（间接寻址）使用一个寄存器中地址值进行内存寻址：

AT&T:  (%eax)
Intel: [eax]

Addressing a variable offset by a value in a register:

（寄存器变址寻址）使用寄存器加偏移量进行内存寻址：

AT&T: _variable(%eax)
Intel: [eax + _variable]

Addressing a value in an array of integers (scaling up by 4):

使用一个整数数组进行内存寻址（以4为步长）：

AT&T:  _array(,%eax,4)
Intel: [eax*4 + array]

You can also do offsets with the immediate value:

你也可以使用立即数作为偏移量：

C code: *(p+1) where p is a char *
对应的c代码：*(p+1) 这里p是一个char * 变量

AT&T:  1(%eax) where eax has the value of p（这里eax是变量p的值）
Intel: [eax + 1]

You can do some simple math on the immediate value:

你也可以对立即数进行简单的算术运算：

AT&T: _struct_pointer+8

I assume you can do that with Intel format as well.
我假设你能用Intel格式做相同的事情。

Addressing a particular char in an array of 8-character records:

在一个8个大小的字符数组组成的记录中进行寻址：

eax holds the number of the record desired. ebx has the wanted char's offset within the record.
寄存器eax中保存的是记录号。寄存器ebx中是这个记录中想查找的字符的偏移量。

AT&T:  _array(%ebx,%eax,8)
Intel: [ebx + eax*8 + _array]

Whew. Hopefully that covers all the addressing you'll need to do. As a note, you can put esp into the address, but only as the base register.

希望这些能覆盖你能遇到的所有寻址方式。另外，你可以把esp的值放在一个内存地址中，但仅限于做为基址寄存器。

Basic inline assembly

基本内嵌汇编

The format for basic inline assembly is very simple, and much like Borland's method.
内联汇编的语法格式是相当简单的，而且更象Borland的方法。

asm ("statements");

Pretty simple, no? So
非常简单，是不？

asm ("nop");
//will do nothing of course, and
//什么也不做的空语句。
asm ("cli");
//will stop interrupts, with
//关闭中断，
asm ("sti");
//of course enabling them. You can use __asm__ instead of asm if the keyword asm conflicts with something in your program.
//When it comes to simple stuff like this, basic inline assembly is fine. You can even push your registers onto the stack, 
//use them, and put them back.
//当然是允许中断了。如果asm关键字在你的程序中冲突了，你可以使用__asm__代替asm。

如果仅象上面这些一样简单，那内联汇编真是好东西。你甚至可以将寄存器入栈，然后使用它们，用完后再出栈。就象下面这样：

asm ("pushl %eax
	"
     "movl $0, %eax
	"
     "popl %eax");

(The 's and 's are there so the .s file that GCC generates and hands to GAS comes out right when you've got multiple statements per asm.)
It's really meant for issuing instructions for which there is no equivalent in C and don't touch the registers.
（这里使用的是为了让GAS在一段内联汇编中使用了多条语句时准确地认出它们。）这里真正用意是为了让它们和c语句不等同。并且不破坏寄存器。

But if you do touch the registers, and don't fix things at the end of your asm statement, like so:
但如果你破坏了寄存器，并且在结束时也没有修正，就象下面：

asm ("movl %eax, %ebx");
asm ("xorl %ebx, %edx");
asm ("movl $0, _booga");

then your program will probably blow things to hell. This is because GCC hasn't been told that your asm statement clobbered ebx and edx and booga, which it might have been keeping in a register, and might plan on using later. For that, you need:

那么你的程序可能会最到恐怖的事情。这是因为GCC没有告诉你的汇编语句前面的ebx,edx和booga(可能是保存在寄存器中)，你在后面计划用到它。如想如此，你需要：

Extended inline assembly

扩展的内嵌汇编

The basic format of the inline assembly stays much the same, but now gets Watcom-like extensions to allow input arguments and output arguments.
内嵌汇编的基本语法格式和上面提到的很象，但需要Watcom扩展风格的输入及输出参数。

Here is the basic format:
下面是基本的语法格式：

asm ( "statements" : output_registers : input_registers : clobbered_registers);
asm（语句：输出寄存器，输入寄存器，会被破坏的寄存器）

Let's just jump straight to a nifty example, which I'll then explain:
先让我们直接看一段例子，稍后会做解释：

asm ("cld
	"
     "rep
	"
     "stosl"
     : /* no output registers *//*没有指定输出寄存器*/
     : "c" (count), "a" (fill_value), "D" (dest)
     : "%ecx", "%edi" );

The above stores the value in fill_value count times to the pointer dest.

上面的程序段将fill_value分count次保存在目的地址处。

Let's look at this bit by bit.

让我们一句一句来看看。

asm ("cld "

We are clearing the direction bit of the flags register. You never know what this is going to be left at, and it costs you all of 1 or 2 cycles.

清除寄存器方向标志。你永远不会知道如果忘记了这句会怎么样，也许会花费你一两个循环的时间。

"rep "

"stosl"

Notice that GAS requires the rep prefix to occupy a line of it's own. Notice also that stos has the l suffix to make it move longwords.

注意GAS需要rep前缀单独占一行。也要注意stos指令有个后缀l来指明它每次移动一个长字。

: /* no output registers */

Well, there aren't any in this function.

在这段函数中这里什么也没有。

: "c" (count), "a" (fill_value), "D" (dest)

Here we load ecx with count, eax with fill_value, and edi with dest. Why make GCC do it instead of doing it ourselves? Because GCC, in its register allocating, might be able to arrange for, say, fill_value to already be in eax. If this is in a loop, it might be able to preserve eax thru the loop, and save a movl once per loop.

这里count值被保存在ecx中，fill_value被保存在eax中，edi中的是目的地址。为什么要自已指定寄存器，而不是让GCC来决定？因为GCC在分配寄存器时，可能会做如此安排，比如，fill_value已经在eax中了。假如这是一个循环，它应该整个循环被保留在eax中，每次循环均要保存一次。

: "%ecx", "%edi" );

And here's where we specify to GCC, "you can no longer count on the values you loaded into ecx or edi to be valid." This doesn't mean they will be reloaded for certain. This is the clobberlist.

这里的意思是提醒GCC，“你不能指望你保存在ecx或edi中的数据依旧有效。”这不意味着它们一定被重新载入。这是一个寄存器影响列表。

Seem funky? Well, it really helps when optimizing, when GCC can know exactly what you're doing with the registers before and after. It folds your assembly code into the code it's generates (whose rules for generation look remarkably like the above) and then optimizes. It's even smart enough to know that if you tell it to put (x+1) in a register, then if you don't clobber it, and later C code refers to (x+1), and it was able to keep that register free, it will reuse the computation. Whew.

看起来让人担心？好吧。当GCC能准确地知道你使用寄存器前后的事情时，在优化代码时会有帮助。它将你的代码放在它生成的代码中然后再优化。编译器足够智能，以致于知道如果你告诉它放置一个变量值（经+1）到一个寄存器中，然后如果你不去破坏它，在后面的C代码对这个变量（x+1）的引用中，它会保持这个寄存器，这样就能重用计算。

Here's the list of register loading codes that you'll be likely to use:

下面是你最可能用到的寄存器对应的代码列表：

a        eax
b        ebx
c        ecx
d        edx
S        esi
D        edi
I        constant value (0 to 31)数值
q,r      dynamically allocated register (see below)动态分配寄存器
g        eax, ebx, ecx, edx or variable in memory
A        eax and edx combined into a 64-bit integer (use long longs)长字时用eax和dex合起来表示一个64位字

Note that you can't directly refer to the byte registers (ah, al, etc.) or the word registers (ax, bx, etc.) when you're loading this way. Once you've got it in there, though, you can specify ax or whatever all you like.

注意在这种使用方法中，你不能直接引用位寄存器（ah,al,等等）或者字寄存器（ax,bx,等等）。一旦你拿到一个寄存器，你就能指定ax或者你愿意的用法。

The codes have to be in quotes, and the expressions to load in have to be in parentheses.

代码必须位于引号之内，表达式必须放在圆括号内。

When you do the clobber list, you specify the registers as above with the %. If you write to a variable, you must include "memory" as one of The Clobbered. This is in case you wrote to a variable that GCC thought it had in a register. This is the same as clobbering all registers. While I've never run into a problem with it, you might also want to add "cc" as a clobber if you change the condition codes (the bits in the flags register the jnz, je, etc. operators look at.)

在寄存器影响列表中，使用%前缀。如果你使用了一个变量，你必须在列表中包括memory。这是防止你写了一个变量，GCC却把它放在寄存器中。

Now, that's all fine and good for loading specific registers. But what if you specify, say, ebx, and ecx, and GCC can't arrange for the values to be in those registers without having to stash the previous values. It's possible to let GCC pick the register(s). You do this:

现在，使用指定的寄存器似乎很好用。但，一定你指定ebx和ecx,而GCC在不隐藏以前保存的值就无法安排这些数值。一种办法是让GCC来选择寄存器。可以象下面这样做：

asm ("leal (%1,%1,4), %0"
     : "=r" (x)
     : "0" (x) );

The above example multiplies x by 5 really quickly (1 cycle on the Pentium). Now, we could have specified, say eax. But unless we really need a specific register (like when using rep movsl or rep stosl, which are hardcoded to use ecx, edi, and esi), why not let GCC pick an available one? So when GCC generates the output code for GAS, %0 will be replaced by the register it picked.

上面例子快速将变量x乘5倍（在Pentium上只用一个周期）。我们可以指定寄存器，比如eax。但只有我们真的必须指定寄存器时才应该这样做（就象当我们使用rep movsl或者rep stosl这样的语句时，因为它们规定必须使用ecx,dei和dsi），如果不必要，那为什么不让gcc来选择一个可用的寄存器呢？这样，当GCC生成输出代码时，%0就会被它选择的寄存器代替。注：lea是传送指令，将左侧值传送到右侧寄存器中。这样就产生类似这样的代码：%0=%1+%1*4，这样，就实现了x变量的乘5。

And where did "q" and "r" come from? Well, "q" causes GCC to allocate from eax, ebx, ecx, and edx. "r" lets GCC also consider esi and edi. So make sure, if you use "r" that it would be possible to use esi or edi in that instruction. If not, use "q".

那么什么时候使用q和r?q会导致GCC在eax,ebx,ecx和edx这几个寄存器中进行分配。r让GCC决定esi和edi。如果你使用了r，那就一定会使用esi或edi这两个寄存器。如果不必要，请使用q。

Now, you might wonder, how to determine how the %n tokens get allocated to the arguments. It's a straightforward first-come-first-served, left-to-right thing, mapping to the "q"'s and "r"'s. But if you want to reuse a register allocated with a "q" or "r", you use "0", "1", "2"... etc.

现在你很可能想知道%n这样的参数是如何分配的？这里遵循先看到先服务，从左至右的规则，将q或r指定的寄存器进行映射。如果你想重复使用通过q或r分配的寄存器，可以使用0,1,2等。

You don't need to put a GCC-allocated register on the clobberlist as GCC knows that you're messing with it.

你不必要在影响列表中包含GCC分配的寄存器，因为GCC知道它们的使用情况。

Now for output registers.

下面是输出寄存器。

asm ("leal (%1,%1,4), %0"
     : "=r" (x_times_5)
     : "r" (x) );

Note the use of = to specify an output register. You just have to do it that way. If you want 1 variable to stay in 1 register for both in and out, you have to respecify the register allocated to it on the way in with the "0" type codes as mentioned above.

注意，使用=号来指定输出寄存器。你需要做的仅仅就是象上面这样。如果你想让第1个变量在输入及输出时均保留在第一个寄存器，你必须使用0类型代码来重新分配寄存器。

asm ("leal (%0,%0,4), %0"
     : "=r" (x)
     : "0" (x) );

注：这段代码就通过0来指定使用的寄存器和%0是一个。

This also works, by the way:

下面代码也完成同样工作：

asm ("leal (%%ebx,%%ebx,4), %%ebx"
     : "=b" (x)
     : "b" (x) );

2 things here:

两点要注意的事：

Note that we don't have to put ebx on the clobberlist, GCC knows it goes into x. Therefore, since it can know the value of ebx, it isn't considered clobbered. Notice that in extended asm, you must prefix registers with %% instead of just %. Why, you ask? Because as GCC parses along for %0's and %1's and so on, it would interpret %edx as a %e parameter, see that that's non-existent, and ignore it. Then it would bitch about finding a symbol named dx, which isn't valid because it's not prefixed with % and it's not the one you meant anyway.

注意，我们不必将ebx放在影响列表中，因为GCC知道它将保存变量x。因此它知道ebx中保存有值，它就不会考虑去破坏它。注意在扩展内联汇编中，你必须使用%%前缀来代替%前缀。为什么非要如此呢？因为GCC分析%0这类参数变量，它会在分析%edx时在%e处就停止分析，这样会将%edx做为%e这样的参数变量，但它是不存在的，GCC就会忽略它。同样GCC也会破坏找到的dx这样的符号名称，因为那些没有%前缀的符号名称是不合语法的。

Important note: If your assembly statement must execute where you put it, (i.e. must not be moved out of a loop as an optimization), put the keyword volatile after asm and before the ()'s. To be ultra-careful, use __asm__ __volatile__ (...whatever...);

重要的注意：如果你的汇编代码必须要象你书写的那样来执行，（比如，不能在优化中将它从循环中移除），那么就需要在asm关键字与()前放置volatile关键字。一定要小心，使用 __asm__ __volatile__ (..其他代码...);

However, I would like to point out that if your assembly's only purpose is to calculate the output registers, with no other side effects, you should leave off the volatile keyword so your statement will be processed into GCC's common subexpression elimination optimization.

然而，我要指出的是，如果你的汇编代码目的仅仅是计算输出寄存器，并且不m有其他影响，你不应当放置volatile关键字，这样可以允许GCC进行代码。

Some useful examples

一些有用例子

#define disable() __asm__ __volatile__ ("cli");

#define enable() __asm__ __volatile__ ("sti");

Of course, libc has these defined too.
当然，libc库中也有这些定义。

#define times3(arg1, arg2) 
__asm__ ( 
  "leal (%0,%0,2),%0" 
  : "=r" (arg2) 
  : "0" (arg1) );

#define times5(arg1, arg2) 
__asm__ ( 
  "leal (%0,%0,4),%0" 
  : "=r" (arg2) 
  : "0" (arg1) );

#define times9(arg1, arg2) 
__asm__ ( 
  "leal (%0,%0,8),%0" 
  : "=r" (arg2) 
  : "0" (arg1) );

These multiply arg1 by 3, 5, or 9 and put them in arg2. You should be ok to do: times5(x,x);

上面这些代码是将乘数arg1进行3倍，5倍或9倍乘法，然后结果放在arg2中。你应当象这样做：times5(x,x);

as well.

#define rep_movsl(src, dest, numwords) 
__asm__ __volatile__ ( 
  "cld
	" 
  "rep
	" 
  "movsl" 
  : : "S" (src), "D" (dest), "c" (numwords) 
  : "%ecx", "%esi", "%edi" )

Helpful Hint: If you say memcpy() with a constant length parameter, GCC will inline it to a rep movsl like above. But if you need a variable length version that inlines and you're always moving dwords, there ya go.

有益的提示：如果你使用固定长度参数来调用memcpy()函数，GCC会将它内联成象上面这样的转移指令。但如果你需要一个内联的可变长度参数内存拷贝，你总是需要移动dwords，就象上面。

#define rep_stosl(value, dest, numwords) 
__asm__ __volatile__ ( 
  "cld
	" 
  "rep
	" 
  "stosl" 
  : : "a" (value), "D" (dest), "c" (numwords) 
  : "%ecx", "%edi" )

Same as above but for memset(), which doesn't get inlined no matter what (for now.)

上面的代码和memset()函数执行同样功能，但memset不会生成内联代码（到目前为止是这样）。

#define RDTSC(llptr) ({ 
__asm__ __volatile__ ( 
        ".byte 0x0f; .byte 0x31" 
        : "=A" (llptr) 
        : : "eax", "edx"); })

Reads the TimeStampCounter on the Pentium and puts the 64 bit result into llptr.

读取Pentium机器上的时间戳，然后将它放在一个64位的结果变量llptr中。

注：在多核心机器上，可能使用rdtscp指令更可靠些，虽然执行周期多一些。就象下面这样：

__inline__ uint64_t perf_counter(void)
{
  uint32_t lo, hi;
  // take time stamp counter, rdtscp does serialize by itself, and is much cheaper than using CPUID
  __asm__ __volatile__ (
      "rdtscp" : "=a"(lo), "=d"(hi)
      );
  return ((uint64_t)lo) | (((uint64_t)hi) << 32);
}

The End

写在最后

"The End"?! Yah, I guess so.

结束了？我猜是这样。

If you're wondering, I personally am a big fan of AT&T/UNIX syntax now. (It might have helped that I cut my teeth on SPARC assembly. Of course, that machine actually had a decent number of general registers.) It might seem weird to you at first, but it's really more logical than Intel format, and has no ambiguities.

如果你想知道，到目前为止我个人是一个AT&T/UNIX语法的粉丝。（这种语法在我使用SPARC汇编时有帮助。当然那个机器实际上有相当多的通用寄存器。）这些语法对你来说可能有些怪，但真的比Intel格式要有逻辑得多，而且没有岐义。

If I still haven't answered a question of yours, look in the Info pages for more information, particularly on the input/output registers. You can do some funky stuff like use "A" to allocate two registers at once for 64-bit math or "m" for static memory locations, and a bunch more that aren't really used as much as "q" and "r".

如果对你的问题我上面这些内容依旧没有能够说清楚，可以相关的Info Pages去看更多信息，尤其是关于寄存器的输入和输出部分。你能做一些恐怖的事情，例如，使用"A"同时分配两个寄存器来完成64位计算，或者使用"m"来定位静态内存，或者"q"功"r"来绑定更多内容。

Alternately, mail me, and I'll see what I can do. (If you find any errors in the above, please, e-mail me and tell me about it! It's frustrating enough to learn without buggy docs!) Or heck, mail me to say "boogabooga."

或者，给我写信，我将看看我能帮你做什么。（如果你在上面的内容中发现错误，请一定要e-mail我，让我知道！得知一个文档没有错误是令人不快的！）真见鬼，给我写信并写上"boogabooga." 注：最后这句话，我真不知作者在说什么。

It's the least you can do.