Skip to main content

Section 5.9 Character Data and Operators

Another primitive data type in Java is the character type, char. A character in Java is represented by a 16-bit unsigned integer. This means that a total of \(2^{16}\) or 65536 different Unicode characters can be represented, corresponding to the integer values 0 to 65535.

The Unicode character set is an international standard that has been developed to enable computer languages to represent characters in a wide variety of languages, not just English. (See http://www.unicode.org/ for detailed information.)

It is customary in programming languages to use unsigned integers to represent characters. This means that all the digits (\(0, \dots,9\)), alphabetic letters (\(a,\dots,z, A,\dots, Z\)), punctuation symbols (such as . ; , `` `' ! _ -), and nonprinting control characters (LINE_FEED, ESCAPE, CARRIAGE_RETURN, \(\dots\)) that make up the computer's character set are represented in the computer's memory by integers.

A more traditional set of characters is the ASCII (American Standard Code for Information Interchange) character set. ASCII is based on a 7-bit code and, therefore, defines \(2^7\) or 128 different characters, corresponding to the integer values 0 to 127. In order to make Unicode backward compatible with ASCII systems, the first 128 Unicode characters are identical to the ASCII characters. Thus, in both the ASCII and Unicode encoding, the printable characters have the integer values shown in FigureΒ 5.9.1.

Code   32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Char   SP !  "  #  $  %  &  '  (  )   *  +  ,  -  .  /
Code   48 49 50 51 52 53 54 55 56 57
Char   0  1  2  3  4  5  6  7  8  9
Code   58 59 60 61 62 63 64
Char   :  ;  <  =  >  ?  @
Code   65 66 67 68 69 70 71 72 73 74 75 76 77
Char   A  B  C  D  E  F  G  H  I  J  K  L  M
Code   78 79 80 81 82 83 84 85 86 87 88 89 90
Char   N  O  P  Q  R  S  T  U  V  W  X  Y  Z
Code   91 92 93 94 95 96
Char   [  \  ]  ^  _  `
Code   97 98 99 100 101 102 103 104 105 106 107 108 109
Char   a  b  c  d   e   f   g   h   i   j   k   l   m
Code   110 111 112 113 114 115 116 117 118 119 120 121 122
Char   n   o   p   q   r   s   t   u   v   w   x   y   z
Code   123 124 125 126
Char   {   |   }   ~
Figure 5.9.1. ASCII codes for selected characters.

Subsection 5.9.1 Character to Integer Conversions

Is β€˜A’ a character or an integer? The fact that character data are stored as integers in the computer's memory can cause some confusion about whether a given piece of data is a character or an integer. In other words, when is a character, for example β€˜A’, treated as the integer (65) instead of as the character β€˜A’? The rule in Java is that a character literal β€” β€˜a’ or β€˜A’ or β€˜0’ or β€˜?’ β€” is always treated as a character, unless we explicitly tell Java to treat it as an integer. So if we display a literal's value

System.out.println('a');

the letter β€˜a’ will be displayed. Similarly, if we assign β€˜a’ to a char variable and then display the variable's value,

char ch = 'a';
System.out.println(ch);         // Displays 'a'

the letter β€˜a’ will be shown. If, on the other hand, we wish to output a character's integer value, we must use an explicit cast operator as follows:

System.out.println((int)'a') ;   // Displays 97

A cast operation, such as (int), converts one type of data ('a') into another (97). This is known as a type conversion. Similarly, if we wish to store a character's integer value in a variable, we can cast the char into an int as follows:

int k = (int)'a';       // Converts 'a' to 97
System.out.println(k);  // Displays 97

As these examples show, a cast is a type conversion operator. Java allows a wide variety of both explicit and implicit type conversions. Certain conversions (for example, promotions, in which, say, a float is promoted to a double) take place automatically when methods are invoked, when assignment statements are executed, when expressions are evaluated, and so on.

Type conversion in Java is governed by several rules and exceptions. In some cases Java allows the programmer to make implicit cast conversions. For example, in the following assignment a char is converted to an int even though no explicit cast operator is used:

char ch;
int k;
k = ch; // convert a char into an int

Java permits this conversion because no information will be lost. A character char is represented in 16 bits whereas an int is represented in 32 bits. This is like trying to put a small object into a large box. Space will be left over, but the object will fit inside without being damaged. Similarly, storing a 16-bit char in a 32-bit int will leave the extra 16 bits unused. This widening primitive conversion changes one primitive type (char) into a wider one (int), where a type's width is the number of bits used in its representation.

On the other hand, trying to assign an int value to a char variable leads to a syntax error:

char ch;
int k;
ch = k;    // Syntax error: can't assign int to char

Trying to assign a 32-bit int to 16-bit char is like trying to cram a big object into an undersized box. The object won't fit unless we shrink it in some way. Java will allow us to assign an int value to a char variable, but only if we perform an explicit cast on it:

ch = (char)k; // Explicit cast of int k into char ch

The (char) cast operation performs a careful β€œshrinking” of the int by lopping off the last 16 bits of the int. This can be done without loss of information provided that k's value is in the range 0 to 65535β€”that is, in the range of values that fit into a char variable. This narrowing primitive conversion changes a wider type (32-bit int) to a narrower type (16- bit char). Because of the potential here for information loss, it is up to the programmer to determine that the cast can be performed safely.

The cast operator can be used with any primitive type. It applies to the variable or expression that immediately follows it. Thus, parentheses must be used to cast the expression m + n into a char:

char ch = (char)(m + n);

The following statement would cause a syntax error because the cast operator would only be applied to m:

char ch = (char)m + n; // Error: right side is an int

In the expression on the right-hand side, the character produced by (char)m will be promoted to an int because it is part of an integer operation whose result will still be an int. Therefore, it cannot be assigned to a char without an explicit cast.

Exercises Self-Study Exercises

1.

Suppose that m and n are integer variables of type int and that ch1 and ch2 are character variables of type char. Determine in each of the cases that follow whether the assignment statements are valid. If not, modify the statement to make it valid.

m = n;        
m = ch1;       
ch2 = n;       
ch1 = ch2;     
ch1 = m - n;

Subsection 5.9.2 Lexical Ordering

The order in which the characters are arranged, their lexical order, is an important feature of the character set. It especially comes into play for such tasks as arranging strings in alphabetical order.

Although the actual integer values assigned to the individual characters by ASCII and UNICODE encoding seem somewhat arbitrary, the characters are, in fact, arranged in a particular order. For example, note that various sequences of digits, '0' ... '9', and letters, 'a' ... 'z' and 'A' ... 'Z', are represented by sequences of integers (FigureΒ 5.9.1).

This makes it possible to represent the lexical order of the characters in terms of the less than relationship among integers. The fact that β€˜a’ comes before β€˜f’ in alphabetical order is represented by the fact that 97 (the integer code for β€˜a’) is less than 102 (the integer code for β€˜f’). Similarly, the digit β€˜5’ comes before the digit β€˜9’ in an alphabetical sequence because 53 (the integer code for β€˜5’) is less than 57 (the integer code for β€˜9’).

This ordering relationship extends throughout the character set. Thus, it is also the case that β€˜A’ comes before β€˜a’ in the lexical ordering because 65 (the integer code for β€˜A’) is less than 97 (the integer code for β€˜a’). Similarly, the character β€˜[’ comes before β€˜}’ because its integer code (91) is less than 125, the integer code for β€˜}’.

Subsection 5.9.3 Relational Operators

Given the lexical ordering of the char type, the following relational operators can be defined: \(\lt\text{,}\) \(>\text{,}\) \(\lt\)=, \(>\)=, ==, !=. Given any two characters, ch1 and ch2, the expression ch1 \(\lt\) ch2 is true if and only if the integer value of ch1 is less than the integer value of ch2. In this case we say that ch1precedesch2 in lexical order. Similarly, the expression ch1 \(>\) ch2 is true if and only if the integer value of ch1 is greater than the integer value of ch2. In this case we say that ch1followsch2. And so on for the other relational operators. This means that we can perform comparison operations on any two character operands (TableΒ 5.9.3).

Table 5.9.3. Relational operations on characters.
Operation Operator Java True Expression
Precedes \(\lt\) \(ch1\ \lt \ ch2\) \('a'\ \lt \ 'b'\)
Follows \(>\) \(ch1\ >\ ch2\) \('c'\ >\ 'a'\)
Precedes or equals \(\lt =\) \(ch1\ \lt =\ ch2\) \('a'\ \lt =\ 'a'\)
Follows or equals \(>=\) \(ch2\ >=\ ch1\) \('a'\ >=\ 'a'\)
Equal to \(= =\) \(ch1\ ==\ ch2\) \('a'\ ==\ 'a'\)
Not equal to \(!\!=\) \(ch1\ !\!=\ ch2\) \('a'\ !\!=\ 'b'\)
You have attempted of activities on this page.