Character Sets

One Byte Character Sets

GSM character set most widespread in messaging technology is defined by document referred to as [GSM 03.38] also referred to as IA5. It is always supported by both the infrastructure (SMSCs) and mobile handsets.

Basically the character set consists of 127 Latin, accented, Greek characters and symbols in default set plus few additional characters in extended set. Each character is defined by 7-bit value and characters are lined up one-after-another allowing to pack 160 characters in 140 bytes of data.

GSM Character Set

Character & Name GSM (hex) UCS-2 (hex) CIMD2 (ASCII)
@ Commercial At 00 0040 _Oa
£ Pound Sign 01 00a3 _L-
$ Dollar Sign 02 0024
¥ Yen Sign 03 00a5 _Y-
è Latin Small Letter E With Grave 04 00e8 _e`
é Latin Small Letter E With Acute 05 00e9 _e'
ù Latin Small Letter U With Grave 06 00f9 _u`
ì Latin Small Letter I With Grave 07 00ec _i`
ò Latin Small Letter O With Grave 08 00f2 _o`
Ç Latin Capital Letter C With Cedilla 09 00c7 _C,
(cr) Carriage Return 0a 000a
Ø Latin Capital Letter O With Stroke 0b 00d8 _O/
ø Latin Small Letter O With Stroke 0c 00f8 _o/
(lf) Line Feed 0d 000d
Å Latin Capital Letter A With Ring Above 0e 00c5 _A*
å Latin Small Letter A With Ring Above 0f 00e5 _a*
Δ Greek Capital Letter Delta 10 0394 _gd
_ Low Line (Underscore) 11 005f _--
Φ Greek Capital Letter Phi 12 03a6 _gf
Γ Greek Capital Letter Gamma 13 0393 _gg
Λ Greek Capital Letter Lambda 14 039b _gl
Ω Greek Capital Letter Omega 15 03a9 _go
Π Greek Capital Letter Pi 16 03a0 _gp
Ψ Greek Capital Letter Psi 17 03a8 _gi
Σ Greek Capital Letter Sigma 18 03a3 _gs
Θ Greek Capital Letter Theta 19 0398 _gt
Ξ Greek Capital Letter Xi 1a 039e _gx
(esc) Escape 1b 001b _XX
Æ Latin Capital Letter Ae 1c 00c6 _AE
æ Latin Small Letter Ae 1d 00e6 _ae
ß Latin Small Letter Sharp S 1e 00df _ss
É Latin Capital Letter E With Acute 1f 00c9 _E'
(sp) Space 20 0020
! Exclamation Mark 21 0021
" Quotation Mark 22 0022 _qq
# Number Sign 23 0023
¤ Currency Sign 24 00a4 _ox
% Percent Sign 25 0025
& Ampersand 26 0026
' Apostrophe 27 0027
( Left Parenthesis 28 0028
) Right Parenthesis 29 0029
* Asterisk 2a 002a
+ Plus Sign 2b 002b
, Comma 2c 002c
- Hyphen-Minus 2d 002d
. Full Stop 2e 002e
/ Solidus 2f 002f
0 Digit 0 30 0030
1 Digit 1 31 0031
2 Digit 2 32 0032
3 Digit 3 33 0033
4 Digit 4 34 0034
5 Digit 5 35 0035
6 Digit 6 36 0036
7 Digit 7 37 0037
8 Digit 8 38 0038
9 Digit 9 39 0039
: Colon 3a 003a
; Semicolon 3b 003b
< Less-Than Sign 3c 003c
= Equals Sign 3d 003d
> Greater-Than Sign 3e 003e
? Question Mark 3f 003f
¡ Inverted Exclamation Mark 40 00a1 _!!
A Latin Capital Letter A 41 0041
B Latin Capital Letter B 42 0042
C Latin Capital Letter C 43 0043
D Latin Capital Letter D 44 0044
E Latin Capital Letter E 45 0045
F Latin Capital Letter F 46 0046
G Latin Capital Letter G 47 0047
H Latin Capital Letter H 48 0048
I Latin Capital Letter I 49 0049
J Latin Capital Letter J 4a 004a
K Latin Capital Letter K 4b 004b
L Latin Capital Letter L 4c 004c
M Latin Capital Letter M 4d 004d
N Latin Capital Letter N 4e 004e
O Latin Capital Letter O 4f 004f
P Latin Capital Letter P 50 0050
Q Latin Capital Letter Q 51 0051
R Latin Capital Letter R 52 0052
S Latin Capital Letter S 53 0053
T Latin Capital Letter T 54 0054
U Latin Capital Letter U 55 0055
V Latin Capital Letter V 56 0056
W Latin Capital Letter W 57 0057
X Latin Capital Letter X 58 0058
Y Latin Capital Letter Y 59 0059
Z Latin Capital Letter Z 5a 005a
Ä Latin Capital Letter A With Diaeresis 5b 00c4 _A"
Ö Latin Capital Letter O With Diaeresis 5c 00d6 _O"
Ñ Latin Capital Letter N With Tilde 5d 00d1 _N~
Ü Latin Capital Letter U With Diaeresis 5e 00dc _U"
§ Paragraph 5f 00a7 _so
¿ Inverted Question Mark 60 00bf _??
a Latin Small Letter A 61 0061
b Latin Small Letter B 62 0062
c Latin Small Letter C 63 0063
d Latin Small Letter D 64 0064
e Latin Small Letter E 65 0065
f Latin Small Letter F 66 0066
g Latin Small Letter G 67 0067
h Latin Small Letter H 68 0068
i Latin Small Letter I 69 0069
j Latin Small Letter J 6a 006a
k Latin Small Letter K 6b 006b
l Latin Small Letter L 6c 006c
m Latin Small Letter M 6d 006d
n Latin Small Letter N 6e 006e
o Latin Small Letter O 6f 006f
p Latin Small Letter P 70 0070
q Latin Small Letter Q 71 0071
r Latin Small Letter R 72 0072
s Latin Small Letter S 73 0073
t Latin Small Letter T 74 0074
u Latin Small Letter U 75 0075
v Latin Small Letter V 76 0076
w Latin Small Letter W 77 0077
x Latin Small Letter X 78 0078
y Latin Small Letter Y 79 0079
z Latin Small Letter Z 7a 007a
ä Latin Small Letter A With Diaeresis 7b 00e4 _a"
ö Latin Small Letter O With Diaeresis 7c 00f6 _o"
ñ Latin Small Letter N With Tilde 7d 00f1 _n~
ü Latin Small Letter U With Diaeresis 7e 00fc _u"
à Latin Small Letter A With Grave 7f 00e0 _a`

Extended GSM Character Set

Character & Name GSM (hex) UCS-2 (hex) CIMD2 (ASCII)
^ Modified Letter Circumflex Accent 1b14 02c6 _XX_gl
{ Left Curly Bracket 1b28 007b _XX(
| Vertical Bar 1b40 007c _XX_!!
} Right Curly Bracket 1b29 007d _XX)
\ Reverse Solidus 1b2f 5c _XX/
[ Left Square Bracket 1b3c 5b _XX<
~ Tilde 1b3d 7e _XX=
] Right Square Bracket 1b3e 5d _XX>
Euro sign 1b65 20ac _XXe

Encoding of Extended Characters

Extended characters (also referred as "Escape characters") are encoded by being preceded by 1b (hex) escape character, i.e. Circumflex Accent ("^") occupies two characters of message and is encoded as 1b14 (hex), Vertical Bar character ("|") is encoded as 1b40 (hex) and so on.

It is worth remembering that each extended character in the string consumes two characters of the message space.

Unicode Encoding

All SMSCs and all modern handsets should support two byte Unicode (big endian UCS-2) encoding. Sending Unicode message means that one message can handle a fixed maximum number of 70 characters (140 bytes, two byte each character) so longer messages have to be concatenated. Although Unicode passes through mobile network unchanged support for particular characters may be limited on different markets (e.g. handsets sold in Europe may not support Far East characters etc.)

Library Encoding

Our library supports Wide Char (Unicode) string on input and they apply all the necessary conversions to produce proper encoding. Similarly character sets received from the mobile network are converted into Wide Char (Unicode) string on output.

See Also

