# Characters and Strings, & ASCII

#### ASCII

Computers operate using numbers. They therefore there need a way for a computer to convert letters (and other "characters") to and from numbers so that they can be stored inside the computer and manipulated by the computer. A set of codes, known as "ASCII" (American Standard Code for Information Interchange) are used. These were initially developed for tasks such as sending documents to printers, and many of the commands make sense in this context.

Each letter is assigned a value according to its position within the ASCII table. Every letter, number, punctuation mark, etc. (known as a character) is given a unique code. Note there is a difference between the 8-bit binary representation of the number zero (00000000) and the corresponding ASCII '0' (00110000) printing character.

#### ASCII Table

An ASCII table which may easily be copied.

``` MS 3 bits	0	  1	  2	  3	   4	  5	  6	   7
LS 4 bits\
0		       NUL	  DLE	 SP	  0	   @	   P	   '	   p
1		       SOH	  DC1	 !	   1	   A	   Q	   a	   q
2		       STX	  DC2	 "	   2	   B	   R	   b	   r
3		       ETX	  DC3	 #	   3	   C	   S    c	   s
4		       EOT	  DC4	 \$	   4	   D	   T	   d	   t
5		       ENQ	  NAK	 %	   5	   E	   U   	e   	u
6		       ACK	  SYN	 &	   6	   F	   V	   f   	v
7		       BEL	  ETB	 '	   7	   G	   W	   g	   w
8		       BS	   CAN	 (	   8	   H	   X	   h   	x
9		       HT	   EM	  )	   9	   I	   Y	   i	   y
A		       LF	   SUB	 *	   :	   J	   Z	   j	   z
B		      VT	   ESC	 +	   ;	   K	   [	   k	   {
C		       FF	   FS	  ,	   <	   L	   \	   l	   |
D		      CR	   GS	  -	   =	   M	   ]	   m	   }
E		       SO	   RS	  .	   >	   N	   ^	   n	   ~
F		       SI	   US	  /	   ?	   O	   _	   o	   DEL```

Notes:

1. ASCII is a 7-bit code, representing 128 different characters. When an ASCII character is stored in a byte the most significant bit is always zero. Sometimes the extra bit is used to indicate that the byte is not an ASCII character, but is a graphics symbol, however this is not defined by ASCII.
2. To convert a hexadecimal number using the table, take the most significant 4 bits (row) followed by the least significant 4 bits (column); e.g. 0x33 means 00110011, which is the code for the character 3.
3. Some simple rules: the decimal digits 0 - 9 are represented by the codes 0x30 - 0x39. The upper case letters run from 0x41 to 0x5A; the corresponding lower case letters run from 0x61 to 0x7A; the two codes are identical except for one bit (e.g. C is 0x43 and c is 0x63; in binary C is 1000011 and c is 1100011; the only difference is bit 5.
4. Many of the codes are not printing characters at all; these are the codes 0x00 to 0x1F, and 0xFF, which are represented by groups of letters (NUL, DEL). Some are frequently used in text; for example LF (line feed) which is 0x0A (which causes a printer or display to move down one line), and CR (carriage return) which is 0x0D (which often causes a printer or display to move down one line and to the left hand side). There is also SP (space) which is 0x20; since this corresponds to an actual blank in the text it might be regarded as printing. NUL (null) has a value of zero and causes a printer or display to ignore the character. Others characters were once used to give information about messages, for example STX (start of text, 0x02,) and ETX (end of text, 0x03).
5. Computers often have a need to store groups of characters (forming words or sentences). A group of characters is usually called a "string". In high level languages such as 'C', the end of a "string" is indicated by using a NUL character (0x00). Since this character is never actually displayed, it is safe to assume that the character will never be one of the characters in a string.

Conversion of numbers

As an example of the use of ASCII, consider the problem of printing on a screen or printer the result of a numerical calculation.

Suppose the number to be printed is (in binary) 01101100;

The first step is to convert this into decimal;

Each digit may be represented by the BCD codes of 0001, 0000, and 1000

(The Hex values of course are 0x1, 0x0, and 0x8)

Each of these digits need to be converted to their own ASCII character codes;

They are (in Hex) 0x31, 0x30 and 0x38:

In binary, 0110001, 0110000 and 0111000.

These are the codes which are sent to the printer

The printer will have been preprogrammed to recognise and print these codes as "108".

### Text Strings

If text is being stored in a computer, it is usually stored as a string (a series of ASCII characters, each one of which is stored as one byte). The formatting characters such as space, carriage return and line feed may be included in the string.

Some method is needed for indicating the length of the string, or where the end of the string is. There are two main methods:

• The first byte of the string is not a character, but the binary number equal to the number of characters in the string. Suppose, for example, we wished to store the string Hello world! Including the space between the words, this has 12 characters. It would then be stored (writing the binary in hex) as
0C 48 65 6C 6C 6F 20 57 6F 72 6C 64 21
where 0C is the hexadecimal number representing the value 12 in decimal.
• The last byte of the string is followed by a standard byte to mean "end of string"; the null value 00 is usual. In this second method the string would be stored as
48 65 6C 6C 6F 20 57 6F 72 6C 64 21 00
(this example therefore uses 13 bytes to store 12 characters).

The programmer must of course know the convention being used. There is nothing to distinguish bits which mean numbers, from bits which means letters and characters; you have to know what the bits are supposed to mean before you can do anything with them.

The second way is most commonly used in C programs. (see also the "pig" example). Note that if you are using a more sophisticated method of storing text (say with a word processing program) where you want to store details about the font, or the size of characters for example, you need other information as well; but the actual information about the text will still usually be stored as ASCII characters.

The actual input to a computer program is usually a set of strings; a high level language like C not only has lots of functions which can handle strings like this (e.g. strcat(), strcpy(), len()); but when it is actually running its compiler, it is using those same functions to read in the program, which is presented as a series of ASCII characters. Some microprocessors and computer chips have special instructions to handle strings of characters efficiently.