Specifying a Character Set

Computers operate using numbers. They therefore there need a way for a computer to represent letters (and other "characters") as numbers, so that they can be stored in computer memory, manipulated by a computer, or sent via communication links. Early systems communications systems used a range of methods. Some used a compact representation with a variable number of bits per character (e.g., morse encoding). Otehr used a fixed number of bits to represent each character, (e.g., Baudot codes, ITA-2, and ITA-5).

With the introduction of 8-bit computers, a standard format was specified, known as "ASCII" (American Standard Code for Information Interchange). ASCII was initially developed for tasks such as sending documents to printers, and many of the control commands (with the lowest value) make sense in this context.

Because ASCII is a 7-bit code, it can represent 128 different characters. When an ASCII character is stored in a byte the most significant bit is always zero (forming an 8-bit ASCII value).There are assigned unique codes for each letter (a-z and A-Z), the numeric digits (0-9), punctuation marks (,./?), etc. Some characters are not printing characters (see the first two columns of the following table). These characters represent control functions such as ejecting a sheet of paper from a printer (FF, Form Feed), or ring a bell! (BEL); or the start of a message header (SOH) and DEL which is used to erase a paper tape.

Sometimes the most significant bit is used to indicate that the byte is not an ASCII character, but is a graphics symbol, however this is not defined by ASCII.

Note: There is also a difference between the 8-bit binary representation of the number zero (00000000) and the corresponding ASCII digit '0' (00110000).

An ASCII Table

Each ASCII character can be identified according to its position within the ASCII Table.


An ASCII table which may easily be copied.


 MS 3 bits	   0	    1 	  2	   3	   4	   5	   6	   7
LS 4 bits\
	0         NUL	  DLE	 SP	  0	   @	   P	   '	   p
	1         SOH	  DC1	 !	   1	   A	   Q	   a	   q
	2         STX	  DC2	 "	   2	   B	   R	   b	   r
	3         ETX	  DC3	 #	   3	   C	   S	   c	   s 
	4         EOT	  DC4	 $	   4	   D	   T	   d	   t
	5          NQ	  NAK	 %	   5	   E	   U	   e	   u
	6         ACK	  SYN	 &	   6	   F	   V	   f	   v
	7         BEL	  ETB	 '	   7	   G	   W	   g	   w
	8          BS	  CAN	 (	   8	   H	   X	   h	   v
	9          HT	   EM	  )	   9	   I	   Y	   i	   y
	A          LF	  SUB	 *	   :	   J	   Z	   j	   z
	B          VT	  ESC	 +	   ;	   K	   [	   k	   {
	C          FF	   FS	  ,	   <	   L	   \	   l	   |
	D          CR	   GS	  -	   =	   M	   ]	   m	   }
	E          SO	   RS	  .	   >	   N	   ^	   n	   ~
	F          SI	   US	  /	   ?	   O	   _	   o	   DEL


Notes:

  1. To convert a hexadecimal number using the table, take the most significant 4 bits (row) followed by the least significant 4 bits (column); e.g. 0x33 means 00110011, which is the code for the character 3.
  2. Some simple rules: the decimal digits 0 - 9 are represented by the codes 0x30 - 0x39. The upper case letters run from 0x41 to 0x5A; the corresponding lower case letters run from 0x61 to 0x7A; the two codes are identical except for one bit (e.g. C is 0x43 and c is 0x63; in binary C is 1000011 and c is 1100011; the only difference is bit 5.
  3. Many of the codes are not printing characters at all; these are the codes 0x00 to 0x1F, and 0xFF, which are represented by groups of letters (NUL, DEL). Some are frequently used in text; for example LF (line feed) which is 0x0A (which causes a printer or display to move down one line), and CR (carriage return) which is 0x0D (which often causes a printer or display to move down one line and to the left hand side). There is also SP (space) which is 0x20; since this corresponds to an actual blank in the text it might be regarded as printing. NUL (null) has a value of zero and causes a printer or display to ignore the character. Others characters were once used to give information about messages, for example STX (start of text, 0x02,) and ETX (end of text, 0x03).
  4. Computers often have a need to store groups of characters (forming words or sentences). A group of characters is usually called a "string". In high level languages such as 'C', the end of a "string" is indicated by using a NUL character (0x00). Since this character is never actually displayed, it is safe to assume that the character will never be one of the characters in a string.

Representing Numbers

Numbers can be represented by a binary value. However, it is necessary to also know the format used. This could be as simple as encoding a value 0 to 355 as a single byte. Or using a group of bytes to represented a signed of unsigned number. It could also be more complex where the encoding uses a scientifix format, or some other representation.

As an example of the use of ASCII, consider the problem of printing the result of a numerical calculation on a terminal, or sending it over a communications line.

Suppose the number to be printed is (in binary) 01101100;

The first step is to convert this into decimal;

The answer is 108;

Each digit may be represented by the Hex codes of 0x01, 0x00, and 0x08

Each of these digits must be converted to an ASCII character code;

The corresponding codes representing the numeric digitals are (in Hex) 0x31, 0x30 and 0x38:

As a set of bytes, in binary these are: 00110001, 00110000 and 00111000.

These three values are sent as a sequence of bytes.

The receiver needs to recognise the character set and can then print these codes as a sequence of characters: "108".

Unicode

The ASCII Character Set is often used, but is based on the English alphabet - which is not so good if the need is to communicate in a different alphabet, or to communicate graphics, emojii, etc.

Unicode is defined by the Unicode Consortium, founded in 1988. It is now commonly used to represent characters. Unlike ASCII, Unicode provides a way to support speakers of many languages. Each character is assigned a value (code point), used to represent the characters in computer memory and storage systems and specifications. There are 1,114,112 (17 ⨉ 2**16) code points; as of Unicode 16.0 (2024), about 155,000 have been assigned to characters.

For example, an upper case “A”, is represented by decimal 65 (0x41), and in UNICODE is expressed as U+0041, and a Black Heart, is represented by decimal 128420, expressed as U+1F5A4.

Text Strings

If text needs to be represeneted, it is usually stored as a string. This is a series of ASCII or Unicode characters, each one of which is stored as value. Using ASCII each character is stored in one byte. The formatting characters such as space, carriage return and line feed may be included in the string.

Some method is needed for indicating the length of the string, or where the end of the string is. There are two main methods:

The programmer must, of course, know the convention that is being used. There is nothing to distinguish bits that represent a number, from bits that reprersent characters; the receiver has to know how the bits are supposed to be interpretted before you can do anything with them.

The second way is most commonly used in C programs (see also the "pig" and "dog" examples). Note that a more sophisticated method of storing text (say with a word processing program) where you want to store details about the font, or the size of characters for example, you need other information as well; but the actual information about the text will still could be stored as ASCII (or Unicode) characters.

The input to a computer program is usually a set of strings; a high level language like C not only has lots of functions that can handle strings like this (e.g. strcat(), strcpy(), len()); but when it is actually running its compiler, it is using those same functions to read in the program, which is presented as a series of characters. Some microprocessors and computers have special instructions to handle strings of characters efficiently.

When Unicode is used, more than one byte may be needed to represent a single character.


See also:

"pig" example

ASCII Chart in PDF

Serial Communications

Parity


Prof. Gorry Fairhurst, School of Engineering, University of Aberdeen, Scotland. (2025)