【IRA/GSM/UCS2】the difference of IRA/GSM/UCS2 character set

3GPP 27.007

5.5 Select TE character set +CSCS

Table 6: +CSCS parameter command syntax

Command	Possible response(s)
+CSCS=[<chset>]
+CSCS?	+CSCS: <chset>
+CSCS=?	+CSCS: (list of supported <chset>s)

Description

Set command informs TA which character set <chset> is used by the TE. TA is then able to convert character strings correctly between TE and MT character sets.

When TA‑TE interface is set to 8‑bit operation and used TE alphabet is 7‑bit, the highest bit shall be set to zero.

NOTE: It is manufacturer specific how the internal alphabet of MT is converted to/from the TE alphabet.

Read command shows current setting and test command displays conversion schemes implemented in the TA.

Defined values

<chset>: character set as a string type (conversion schemes not listed here can be defined by manufacturers)

"GSM" GSM 7 bit default alphabet (3GPP TS 23.038 [25]); this setting causes easily software flow control (XON/XOFF) problems.

"HEX" Character strings consist only of hexadecimal numbers from 00 to FF; e.g. "032FE6" equals three 8-bit characters with decimal values 3, 47 and 230; no conversions to the original MT character set shall be done.

If MT is using GSM 7 bit default alphabet, its characters shall be padded with 8th bit (zero) before converting them to hexadecimal numbers (i.e. no SMS‑style packing of 7‑bit alphabet).

"IRA" International reference alphabet (see ITU‑T Recommendation T.50 [13]).

"PCCPxxx" PC character set Code Page xxx

"PCDN" PC Danish/Norwegian character set

"UCS2" 16-bit universal multiple-octet coded character set (see ISO/IEC10646 [32]); UCS2 character strings are converted to hexadecimal numbers from 0000 to FFFF; e.g. "004100620063" equals three 16-bit characters with decimal values 65, 98 and 99.

"UTF-8" Octet (8-bit) lossless encoding of UCS characters (see RFC 3629 [69]); UTF-8 encodes each UCS character as a variable number of octets, where the number of octets depends on the integer value assigned to the UCS character. The input format shall be a stream of octets. It shall not be converted to hexadecimal numbers as in "HEX" or "UCS2". This character set requires an 8-bit TA – TE interface.

"8859-n" ISO 8859 Latin n (1‑6) character set

"8859-C" ISO 8859 Latin/Cyrillic character set

"8859-A" ISO 8859 Latin/Arabic character set

"8859-G" ISO 8859 Latin/Greek character set

"8859-H" ISO 8859 Latin/Hebrew character set

Implementation

Mandatory when a command using the setting of this command is implemented.

======================================================================================

IRA

http://mercury.webster.edu/aleshunas/COSC%205130/Q-IRA.pdf

A familiar example of data is text or character strings. While textual data are most convenient
for human beings, they cannot, in character form, be easily stored or transmitted by data
processing and communications systems. Such systems are designed for binary data. Thus a
number of codes have been devised by which characters are represented by a sequence of bits.
Perhaps the earliest common example of this is the Morse code. Today, the most commonly used
text code is the International Reference Alphabet (IRA).1 Each character in this code is
represented by a unique 7-bit binary code; thus, 128 different characters can be represented.
Table Q.1 lists all of the code values. In the table, the bits of each character are labeled from b7,
which is the most significant bit, to b1, the least significant bit. Characters are of two types:
printable and control (Table Q.2). Printable characters are the alphabetic, numeric, and special
characters that can be printed on paper or displayed on a screen. For example, the bit
representation of the character "K" is b7b6b5b4b3b2b1 = 1001011. Some of the control characters
have to do with controlling the printing or displaying of characters; an example is carriage return.
Other control characters are concerned with communications procedures.
IRA-encoded characters are almost always stored and transmitted using 8 bits per
character. The eighth bit is a parity bit used for error detection. The parity bit is the most
significant bit and is therefore labeled b8. This bit is set such that the total number of binary 1s in
each octet is always odd (odd parity) or always even (even parity). Thus a transmission error that
changes a single bit, or any odd number of bits, can be detected

GSM

https://en.wikipedia.org/wiki/GSM_03.38

GSM 7-bit default alphabet and extension table of 3GPP TS 23.038 / GSM 03.38[edit]

The standard encoding for GSM messages is the 7-bit default alphabet as defined in the 23.038 recommendation.

Seven-bit characters must be encoded into octets following one of three packing modes:

CBS: using this encoding, it is possible to send up to 93 characters (packed in up to 82 octets) in one SMS message in a Cell Broadcast Service.
SMS: using this encoding, it is possible to send up to 160 characters (packed in up to 140 octets) in one SMS message in the GSM network.
USSD: using this encoding, it is possible to send up to 182 characters (packed in up to 160 octets) in one SMS message of Unstructured Supplementary Service Data.

GSM 8-bit data encoding[edit]

8-bit data encoding mode treats the information as raw data. According to the standard, the alphabet for this encoding is user-specific.

UCS-2 Encoding[edit]

This encoding allows use of a greater range of characters and languages. UCS-2 can represent the most commonly used Latin and eastern characters at the cost of a greater space expense. Actually, some cell phones (e.g. iPhones) use UTF-16 instead of UCS-2 to display emoticons in short messages.^[4]

A single SMS GSM message using this encoding can have at most 70 characters (140 octets).

Note that on many GSM cell phones, there's no specific preselection of the UCS-2 encoding. The default is to use the 7-bit encoding described above, until one enters a character that is not present in the GSM 7-bit table (for example the lowercase 'a' with acute: 'á'). In that case, the whole message gets reencoded using the UCS-2 encoding, and the maximum length of the message sent in only 1 SMS is immediately reduced to 70 characters, instead of 160. On smartphones the message encoding depends on the SMS application used and its setting as well as on the length of the message. Some smartphones even send longer messages as a multimedia message (MMS).

To avoid unexpected costs for senders that have a subscription for a limited pack of sent SMS, smartphones should display the number of character used and the maximum number of characters in the composed SMS. When a message does exceeds this maximum, the message will be sent as multiple successive SMS containing parts of the message (each one containing a sequence number, which also uses a few leading characters in each part); these parts will be reassembled later by the recipient.

Some GSM smartphones will alert the user about the number of SMS messages needed to send the message, when it requires more than one.