Character types in C and C++

How many built-in character types are there in C++? The answer may surprise you.

The language described in the original 1978 C Programming Language (aka "the White Book") by Kernighan and Ritchie didn't have the keyword signed, meaning that there were only two character types: char and unsigned char.

This was analogous to the situation for int (which is always signed) and unsigned int, except that C compilers were allowed to make char an unsigned type, which many did typically due to platform conventions or better optimization opportunities for unsigned integer arithmetic. Granting compilers such latitude has been an important factor in making C a highly portable, yet efficient, language.

Leaving the signedness of char up to the implementation had its drawbacks, though. On a platform where the plain character type was unsigned, you'd have one less built-in type for small integers; the smallest signed type was short, which could very well be larger than a char. As can be expected, lots of programs were written that relied on char being signed or unsigned.

In the ANSI C Draft Standard, the keyword signed was added, introducing a signed char type for all platforms. The new keyword solved the problem of not being able to use signed char portably, but at this point the standard committee could not mandate plain char to be signed. It would break a lot of code and upset vendors as well as users.

The comprimise was to make signed char a type distinct from the two existing character types, while requiring char to have the same representation and values as either signed char or unsigned char. In other words, a char must look exactly like a signed char or unsigned char to the hardware; which one is implementation-defined. C++ later adopted this compromise for compatibility with C, so both languages now have three distinct char types.

You have probably seen the wide character type wchar_t even if you haven't used it (there are certain caveats to wchar_t, but that's a topic for another time). The _t suffix is a common convention indicating a typedef name, and that's the way the C standard defines wchar_t (in the header stddef.h). Since typedef doesn't create types, only new names for other types, wchar_t is not a distinct type in C.

In contrast, the C++ standard defines wchar_t as a built-in type with "the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type" (C++98 §3.9.1). This makes wchar_t a distinct type with the same representation as another type, in a way quite similar to char and subtly different from the wide character type in C.

注：wchar_t 是C++的内置类型，并且他具有和其他某个整形类型具有相同的size，signedness和对其要求，这个对应的整形类型就是wchar_t 的底层类型。

Like the char type in C and C++, it is implementation-defined whether wchar_t is a signed or unsigned type. Does this mean there are three distinct types of wide characters as well? No, signedness can't be forced by using unsigned wchar_t or signed wchar_t; there are no such types and compilers should flag the code as erroneous.

There were no legacy reasons for introducing signed and unsigned variants of the wide character type, and it doesn't make sense to use wchar_t for storing integers anyway; it has the same representation as a built-in integral type, after all. There are thus four distinct character types in Standard C++:

char
signed char
unsigned char
wchar_t

Is this useful information or merely pedantic trivia? Knowing the distinct character types is important when you overload functions and specialize templates in C++, but even in C it can be relevant due to the way conversions work:

int main(void)
{
    char *a = "Hello, World!";
    unsigned char *b = a; /* distinct types! */
    signed   char *c = a; /* distinct types! */

return 0;
}
C compilers are supposed to warn about the above code, but in practice many do not. gcc will inform you that pointer targets in assignment differ in signedness if you use the -pedantic flag, but the default is to silently accept such conversions. g++ correctly rejects the same program:

error: invalid conversion from `char*' to `unsigned char*'
error: invalid conversion from `char*' to `signed char*'Casts should be used in both languages when converting between pointers to the different char types. In C++ you can't get away with being sloppy; omitting the cast is illegal.

注：Character values of type unsigned char have a range from 0 to 0xFF hexadecimal. A signed char has range 0x80 to 0x7F. These ranges translate to 0 to 255 decimal, and –128 to +127 decimal, respectively. The /J compiler option changes the default from signed to unsigned. 两者都作为字符使用是没有区别的，当作为整数使用时是有区别的，多数情况下，char ,signed char 、unsigned char 类型的数据具有相同的特性，然而当你把一个单字节的数赋给一个大整型数域时,便会看到它们在符号扩展上的差异。另一个区别表现在当把一个介于128和255之间的数赋给signed char 变量时编译器必须先进行数值转化，同样还会出现警告。例如下面的Code：

Code

注：wchar_t is a data type in ANSI/ISO C, ANSI/ISO C++, and some other programming languages that is intended to represent wide characters.

The Unicode standard 4.0 says that

"ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension."

and that

"The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."

Under Win32, wchar_t is 16 bits wide and represents a UTF-16 code unit. On Unix-like systems wchar_t is commonly 32 bits wide and represents a UTF-32 code unit.