Unicode


Martin McBride, 2016-11-17
Tags utf8 ascii unicode font html python
Categories text character

ASCII is quite limited, it only allows for the English alphabet, numerals and a few punctuation symbols. Only 5% of the world's population speak English as a first language. Schemes such as Extended ASCII provided limited support for other languages and alphabets, but with the increasing importance of computers in society, and in particular the growth of the web, a better solution is needed.

Unicode

Unicode is an ambitious solution. It attempts to define a unique value for every single character used by every single language there is! Unicode works by defining a unique number, called a code point for each character. A code point is a 16 bit (2 byte) quantity, allowing a value between 0 and 65535.

{{% orange-note %}} Unicode supports over 100 different scripts (alphabets) including some historic ones such as Hieroglyphics and Ancient Greek.

The total number of languages supported is much higher than this, because some scripts are used by many different languages. For example, the Latin alphabet is used for most European languages (including English) as well as many other languages around the world.

Unicode includes over 100,000 characters. In addition to letters, this includes many symbols such as mathematical, scientific of music symbols. {{% /orange-note %}}

Adding Unicode to a document

How would you add a Unicode character to a document? There are several ways:

  • Copy and paste it from somewhere else. There are websites with tables of characters which you can find by searching "unicode pi" or whatever you are looking for, and then copy the character.
  • Use Insert Symbol if your word processor has that function.
  • Use a Character Map program.
  • Use a keyboard shortcut, if one exists for your computer.

Fonts

Simply adding a Unicode character to your document will not automatically mean that it will display correctly. You must also choose a font which has the character available.

For example if you are using an English version of Windows and you want to enter some Chinese characters into a document, you will probably find that most of the available fonts don't include Chinese characters. There are thousands of Chinese characters, and most western fonts don't include them because:

  • They take a lot of effort to design
  • They take up a lot of disk space
  • Most users don't really need them

A modern computer will most likely have some fonts which support a wider range of Unicode characters. A word processor or browser might automatically select the correct font if it finds characters in a document or website which the default font cannot display.

Adding Unicode to a HTML page

You can easily add Unicode characters to an HTML page. You will need to know the numerical value of the character, in decimal or hex.

  • The string &#ddd; where ddd is the decimal value of the Unicode character, or
  • The string &#xhhh; where hhh is the hexadecimal value of the Unicode character.

For example, the character π (Greek letter Pi, code 960) can be entered as π or π.

Adding Unicode to your code

You might be writing a program which needs to print out Unicode text. In Python 3, there are several ways to do this (other languages will have equivalent methods):

print('Perimeter = 2πr')
print('Perimeter = 2\u03c0r')
print('Perimeter = 2\N{GREEK SMALL LETTER PI}r')

In the first case, we have pasted the π character into the string. Most program editors, eg IDLE, will allow this. In the second case we use \uxxxx where xxxx is a 4 digit hex code. In the final case we use \N{name} where name is the Unicode name of the character - find it on any website which lists Unicode characters.

Unicode structure

Unicode characters are each given a value between 0 and 65536 (0x0000 to 0xFFFF in hexadecimal). This range is divided into blocks, where each block contains characters for a particular alphabet.

  • Value 0 to 127 form the Basic Latin block. These characters exactly match the ASCII character set.
  • Value 128 to 255 form the Latin 1 block. These characters exactly match the default Extended ASCII character set.
  • These are followed by further blocks for other alphabets: Greek, Cyrillic, Hebrew, Arabic and many more.

As Unicode was being developed, it became apparent that 65536 code points was not enough to cover every character known to man. Unicode was extended to allow extra "planes" with even more characters. However, the extra planes contain ancient or obscure characters which are rarely used.

Copyright (c) Axlesoft Ltd 2021