Characters and Strings, ASCII

ASCII

Computers operate using numbers and therefore there needs to be a way for a computer to convert letters (and other "characters") to and from numbers. A set of codes, known as "ASCII" (American Standard Code for Information Interchange) are used. These were initially developed for tasks such as sending documents to printers, and many of the commands make sense in this context.

Note there is a difference between the 8-bit binary representation of the number zero (00000000) and the corresponding ASCII '0' (00110000) printing character.

ASCII Table

An ASCII table which may easily be copied.

MS 3 bits		0	1	2	3	4	5	6	7
LS 4 bits
	0		NUL	DLE	SP	0	@	P	'	p
	1		SOH	DC1	!	1	A	Q	a	q
	2		STX	DC2	"	2	B	R	b	r
	3		ETX	DC3	#	3	C	S	c	s
	4		EOT	DC4	$	4	D	T	d	t
	5		ENQ	NAK	%	5	E	U	e	u
	6		ACK	SYN	&	6	F	V	f	v
	7		BEL	ETB	'	7	G	W	g	w
	8		BS	CAN	(	8	H	X	h	x
	9		HT	EM	)	9	I	Y	i	y
	A		LF	SUB	*	:	J	Z	j	z
	B		VT	ESC	+	;	K	[	k	{
	C		FF	FS	,	<	L	\	l	|
	D		CR	GS	-	=	M	]	m	}
	E		SO	RS	.	>	N	^	n	~
	F		SI	US	/	?	O	_	o	DEL

Notes:

ASCII is a 7-bit code, representing 128 different characters. When an ascii character is stored in a byte the most significant bit is always zero. Sometimes the extra bit is used to indicate that the byte is not an ASCII character, but is a graphics symbol, however this is not defined by ASCII.
To convert a hexadecimal number using the table, take the most significant 4 bits (row) followed by the least significant 4 bits (column); e.g. 0x33 means 00110011, which is the code for the character 3.
Some simple rules: the decimal digits 0 - 9 are represented by the codes 30 - 39. The upper case letters run from 41 to 5A; the corresponding lower case letters run from 61 to 7A; the two codes are identical except for one bit (e.g. C is 43 and c is 63; in binary C is 1000011 and c is 1100011; the only difference is bit 5.
Many of the codes are not printing characters at all; these are the codes 00 to 1F, and FF, which are represented by groups of letters (NUL, DEL). Some are frequently used in text; for example LF (line feed) which is 0x0A (which causes a printer/display to move down one line), and CR (carriage return) which is 0x0D (which often causes a printer/display to move down one line and to the left hand side). There is also SP (space) which is 0x20; since this corresponds to an actual blank in the text it might be regarded as printing. NUL (null) has a value of zero and causes a printer/display to ignore the character. Others characters were once used to give information about messages, for example STX (start of text, 0x02,) and ETX (end of text, 0x03).
Computers often have a need to store groups of chacaters (forming words or sentences). A group of chacaters is usually called a "string". In high level languages such as 'C', the end of a "string" is indicated by using a NUL character (0x00). Since this character is never actually displayed, it is safe to assume that the character will never be one of the characters in a string.

Conversion of numbers

As an example of the use of ASCII, consider for example the problem of printing on a screen or printer the result of a numerical calculation.

Suppose the number to be printed is (in binary) 01101100; the first step is to convert this into decimal; the answer is 108; this would be represented in the computer by the BCD codes of 0001, 0000, and 1000 (their Hex values of course are 1, 0 and 8).

These need to be converted to their ASCII codes; they are (in Hex) 31, 30 and 38: or in binary, 0110001, 0110000 and 0111000.

These are the codes which are sent to the printer, which will have been preprogrammed to recognise and print these codes as 108.

Text Strings

If text is being stored in a computer, it is usually stored as a string (a series of ASCII characters, each one of which is stored as one byte). The formatting characters such as space, carriage return and line feed may be included in the string.

Some method is needed for indicating the length of the string, or where the end of the string is. There are two main methods:

Always use the first byte of the string to mean not a character, but the binary number equal to the number of characters in the string. Suppose, for example, we wished to store the string Hello world! Including the space between the words, this has 12 characters. It would then be stored (writing the binary in hex) as
0C 48 65 6C 6C 6F 20 57 6F 72 6C 64 21
where 0C is the binary number representing 12.
Always make the last byte of the string a standard byte to mean "end of string"; the null value 00 is usual. In this second method the string would be stored as
48 65 6C 6C 6F 20 57 6F 72 6C 64 21 00
(this example therefore uses 13 bytes to store 12 characters).

The programmer must of course know what convention is being used. There is nothing to distinguish bits which mean numbers, from bits which means letters and characters; you have to know what the bits are supposed to mean before you can do anything with them.

The second way is most commonly used in C programs. (see also the "pig" example). Note that if you are using a more sophisticated method of storing text (say with a word processing programme) where you want to store details about the font, or the size of characters for example, you need other information as well; but the actual information about the text will still usually be stored as ASCII characters.

The actual input to a computer program is usually a set of strings; a high level language like C not only has lots of functions which can handle strings like this; but when it is actually running its compiler, it is using those same functions to read in the program, which is presented as a series of ASCII characters. Some microprocessors and computer chips have special instructions to handle strings of characters efficiently.