Lab 13 - Working with Character Data and C-style Strings

Lab 13 -- Working with Character Data
and C-style Strings
CREATE ANSWER SHEET for Closed Lab 13
Objectives:

Learn how characters are stored and manipulated in C++ programs
Introduce several standard character-oriented functions
Introduce C-style character strings (aka C strings)
Introduce and discuss some commonly used C string functions

Sections:

The char Data Type for Storing Characters
Character Expressions and Comparisons
Character Functions
Character Input
Other Character Sets
C-style Character Strings (aka C strings)
Input of C Strings
Output of C Strings
Comparison of C Strings
Copying (assigning) a C String
C String Length

The char Data Type for Storing Characters
The char data type can hold a single character value. Thus, we are able to declare character constants and variables, as well as input and output character data, as follows:

    const char ASTERISK = '*';
    char ch;
    char letter;
    letter = 'a';
    cin >> ch;
    cout << letter;

The char data type sets aside one byte to store a character value. Recall that one byte equals 8 bits, where each bit contains either a zero or a one. We will number the bits of a byte as shown in this diagram:

b₇

b₆

b₅

b₄

b₃

b₂

b₁

b₀

The leftmost bit is numbered b₇. Going to the right, the next bit is numbered b₆. Continuing in descending order, we number the bits b₅, b₄, b₃, b₂, b₁, until we reach the rightmost bit that is numbered b₀.

What values do the 8 bits have when they represent various characters? For example, what pattern of zeroes and ones is used to represent, say, the letter A? A common encoding scheme for character data is ASCII, which stands for American Standard Code for Information Interchange. This standard describes what bit patterns represent various symbols, digits, letters, and so forth. By and large, all the computer systems you use encode characters using ASCII. (We will discuss why this is a bit of a simplification later in this lab.)

Before we examine an ASCII code chart, we must realize that any pattern of zeroes and ones in a byte is also a binary number and thus has a numeric value! For example, the bit pattern of all zeroes

has a (decimal) value of zero, whereas the following bit patterns represent the values one, two, three and four, respectively:

and the following bit pattern represents the (decimal) value 127:

Don't worry if you aren't familiar with binary numbers at this time! The lesson to be learned from the examples above is that each bit pattern has a decimal value. Thus instead of dealing with the bit pattern for an encoded character, we can instead use the pattern's decimal value to identify the pattern.

In the ASCII code, bit b₇ is always a zero. Since this leaves us with 7 bits (b₆ through b₀) for our code, this means there are 128 possible patterns. (2⁷ = 128.) An ASCII code chart will thus have 128 entries; because we start numbering from zero, the entries will be numbered 0 through 127.

Please click on the ASCII Code Chart link to open a new window that will display the chart.

Examine the entries in the ASCII Code Chart. Notice that the 128 entries are divided into four sections of 32 characters. The first section, in green, consists of the control characters. Many of these characters (for example, the horizontal tab, HT) cannot be displayed directly on a screen. The next section, in light brown, consists of various symbols, punctuation marks, and numerals. The third section, in blue, consists mainly of uppercase letters of the alphabet. The fourth and final section, in light blue, consists mainly of the lowercase letters of the alphabet. Each entry has a (very small red) number identifying it. For example, the uppercase letter A (in the blue third section) has the number 65. This number represents the value of the bit pattern that represents the letter A. Here is the exciting part: in C++ the number 65 and the character 'A' are considered to be the same thing! Thus, in the following IF statement: if ('A'==65) the boolean expression (aka predicate) would evaluate to true.

Note in particular the very first ASCII character: the NULL character (abbreviated NUL). The bit pattern for the NULL character is all zeroes and thus the NULL character has a value of zero. Note also that the blank character is called space (abbreviated SP) in the ASCII code and has a value of 32.

Exercise 1:
Place the answer to the following questions on the answer sheet:

What is the ASCII value of the character 'G'?
What character corresponds to the ASCII value 109?

Character Expressions and Comparisons
We have previously used the relational operators ==, <, <=, >, >=, and != to compare, for example, integer quantities. However the relational operators also work with character data. This is because, in C++, character data is really just a form of integer data. If we examine the ASCII chart, we will see that the digit characters '0' through '9' have code values of 48 through 57. Thus '0'<'1' evaluates to true because 48 is less than 49. Similarly the relations '2'<'3' through '8'<'9' will also evaluate to true.

In ASCII, the uppercase characters 'A' to 'Z' have the codes 65 through 90. Thus the boolean expression 'A'<'B' will evaluate to true. In ASCII, the lowercase letters have the codes 97 through 122. Thus the boolean expression 'a'<'b' will evaluate to true. Because of the ASCII encoding scheme, an uppercase letter is less than a lowercase letter. Thus the boolean expression 'A'<'a' will evaluate to true. Note also that digits (numerals) are less than letters. Thus the boolean expression '9'<'A' will evaluate to true.

Because char values are just a form of integer data, it is possible to do arithmetic with them. For example, 'C'-'A' evaluates to 2, because 67-65 equals two. Can you predict what the following would print?

    char ch;
    ch = 'a' + 10;
    cout << ch;

It would print the character k, because 'k' is the tenth character after 'a'. Or, put another way, because 97 (the ASCII value of 'a') plus 10 equals 107, which is the ASCII value of 'k'.

Usually the C++ compiler can determine from context if we are using a char item as an ASCII character or as a numeric quantity. Sometimes, however, we need to be explicit in a program that we want to use an ASCII character as a numeric quantity. In those instances, we can use the C++ type cast operator int. It is used much like a function. Thus int('0') is the value 48, int('A') is the value 65, and int('a') is the value 97. Similarly, the C++ type cast operator char will produce a character whose code is the given integer. Hence char(97) is 'a', char(48) is '0', etc. Thus the following code would first print the letter A on one line, the number 65 on the second line, the number 37 on the third line, and the symbol % (percent sign) on the fourth line.

    int  ac = 37;
    char ch = 'A';
    cout << ch       << endl;   // prints A
    cout << int(ch)  << endl;   // prints 65
    cout << ac       << endl;   // prints 37
    cout << char(ac) << endl;   // prints %

Exercise 2:
Evaluate the following C++ expressions using the ASCII character set.

int('a') - int('A')
char(int('L'))
int('8') - int('3')
'8' > 'X'
'G' != 'g'
'K' < '\t'
'B' <= 75
char( 'j'+7 )
int( 'j'+7 )
char( 'P'+ ('D'-'A') )

Character Functions
Several character functions are provided in C++ to aid in the manipulation of character data. To use these library functions the header file, cctype, must be included in a C++ program. A few of these are listed below:

Function	Purpose
`tolower(ch)`	If `ch` is uppercase, the function returns the corresponding lowercase letter. Otherwise, `ch` is returned.
`toupper(ch)`	If `ch` is lowercase, the function returns the corresponding uppercase letter. Otherwise, `ch` is returned.
`isalpha(ch)`	Returns true if `ch` is an upper or lower case letter. Otherwise false is returned.
`isdigit(ch)`	Returns true if `ch` is a digit. Otherwise false is returned.
`islower(ch)`	Returns true if `ch` is a lowercase letter. Otherwise false is returned.
`isupper(ch)`	Returns true if `ch` is an uppercase letter. Otherwise false is returned.
`isspace(ch)`	Returns true if `ch` is a space (SP), newline (LF), formfeed (FF), carriage return (CR), tab (HT), or vertical tab (VT). Otherwise false is returned.

Exercise 3:
Although the cctype functions above are very useful and commonly used, they are not hard to code. On the answer sheet, write code to implement your version of the isdigit() function described above. The function prototype for your function is bool isdigit(char ch);

Character Input
We saw in Closed Lab 5 that the extraction (input) operator >> can be used to read character values. When reading character values using the extraction operator, any leading whitespace (blanks, tabs, newlines, etc.) is skipped until the first non-whitespace character is found. For example, if the input is

bbbbjk (b is a blank) the code cin >> ch; causes, by default, the blanks to be skipped and the character j is read into ch.

We can ask C++ not to skip whitespace on input by setting a format state flag. The following C++ statement would cause the flag skipws to be turned off (it defaults to on, which means that whitespace is skipped over):

cin.unsetf(ios::skipws);

After the flag is turned off, if the input is

bbbbjk (b is a blank) then the code cin >> ch; will cause ch to contain a blank since whitespace is no longer skipped.

Turning on and off a format state flag is a hassle. C++ provides an easier mechanism to allow one to input one character at a time with whitespace treated the same way as any other character. The get() function reads the very next character in the input stream without skipping any whitespace. Using get() instead of cin in the above example, if the input is again

bbbbjk (b is a blank) then the code cin.get(ch); will read a blank into the variable ch.

Exercise 4:
Copy the source file $CLA/cla13a.cpp to your closed lab directory. The program currently reads data, character by character, from standard input (cin) and counts the number of newline characters. As it reads, it up-cases (i.e., capitalizes) lowercase letters and prints the revised stream of characters to standard output (cout). Make the following two changes to the program:

Change the program so that it changes the case of all letters and prints this to standard output. That is, instead of only up-casing lowercase letters, it also lower-cases uppercase letters. Do this by replacing the ChangeToUpperCase function with a ChangeCase function that takes the value of its single argument and returns the corresponding upper/lower-case letter, as applicable. Replace the call to ChangeToUpperCase in main() with a call to ChangeCase.
Have the program also count the number of tab characters in the input stream and print this number out on a line after the count of newline characters. (Two lines of code, currently commented out, anticipate this modification; go ahead and use this code.)

Also change student name on the top comment (first line of code) to your name and change the CSCI 2170 section number to your section number. Use as your executable file name "recase". You are to submit the source program listing of the modified code, compilation results, and a sample run of the program using as input the original source file ($CLA/cla13a.cpp). Something like the following UNIX commands will let you create what is required:

            $ script lab13ex4.log
            $ pr -n -t -e4 cla13a.cpp
            $ c++ cla13a.cpp -o recase
            $ recase < $CLA/cla13a.cpp
            $ exit

(Be sure to properly exit the script session!)

Other Character Sets

Note: this section is optional reading.
If you are pressed for time, you may skip to the next section (C-style Character Strings ) .

If you are French and your name is Thévenod, or you are German and your name is Günter, or you are Italian and your name is Pezzè (note the special characters in each name), clearly ASCII is an inadequate encoding scheme---it simply does not have the characters or symbols necessary to write many foreign words and names. Extra characters and symbols can be grafted atop the ASCII code by relaxing the assumption that bit b₇ always be zero. A discussion of other character sets is beyond the scope of this closed lab. However, if you are interested in the subject, you are encouraged to follow the following link: Character sets & Encodings.

But what if you need to write in Chinese or Japanese? Languages such as these have an enormous alphabet of characters. UNICODE is a code that can accommodate even such enormous alphabets. UNICODE is a 16-bit encoding scheme with 65536 characters. (2¹⁶ = 65536.) The use of UNICODE is becoming more common. (For example, strings in the languages Python 3 and Java are stored as UNICODE characters.) If you are interested in the subject, you are encouraged to visit http://www.unicode.org for more information.

C-style Character Strings (aka C strings)
Recall that the language C++ was based on the earlier language C. Unlike C++, the C language does not have the string data type. Instead, in C, strings of characters are stored in one-dimensional arrays of type char, one character per array element. Because C++ maintains backward compatibility with C, when we have placed a string of characters in double quotes, say in an output instruction, we have used C strings. For example:

    cout << "Enter the test score:";

    cout << "x=";

Thus a list of characters contained in double quotes is a C string, also sometimes called a C-style character string. Because C strings are stored as an array, the C string "x=" is stored as

A C string is terminated by the special character '\0' called the NULL character. The NULL character acts as a sentinel and marks the end of a C-style character string.

Note the difference between a C string and a character. For example, the C string "x" is stored as an array

whereas the character 'x' is stored as a single character

Notice:
The string is enclosed in double quotes
while the character is enclosed in single quotes (or apostrophes).

Character arrays are used to store C strings. For example:

    char lastName[10] = "Jackson";

will cause the array, lastName to contain:

If instead, we had initialized lastName with a string bigger than can be stored in the array, for example:

    char lastName[10] = "Washington";

then no error message will be generated in C++, but the following unfortunate situation will occur:

Notice that the NULL character is placed at the end of the C string--however, not enough space was allocated for the array and unpredictable results are likely to occur. Perhaps other valuable data may be overwritten. (This error is sometimes called a buffer overflow error.)

Exercise 5:
On the answer sheet, place the answers to the following questions.

Show a declaration of a C string array variable large enough to contain your first (personal) name. Be sure to include room for the NULL terminating character. The array variable should be initialized to contain your first name.
Show an internal representation of your first name.

For example, to show the internal representation of "x=" on the answer sheet, one would type:

x = \0

0 1 2

Input of C Strings
When reading a C string variable, we could use the get() routine described previously. For example:

    char line[80];
    char ch;
    int charCount = 0;

    // Read a line of text into the character
    // array "line".  If the line is too long,
    // only read the first 79 characters

    cin.get(ch);
    while ( ch!='\n' && charCount<79 )
    {
        line[charCount] = ch;
        charCount++;
        cin.get(ch);
    }
    // Place the NULL terminating character in the last position
    line[charCount] = '\0';

Alternately, the extraction (input) operator >> can be used to read in a C string. As when reading numeric and character data, beginning whitespace characters (blanks, newline, tab, etc.) are ignored; next, all characters up to the first whitespace character are read and placed in the variable. When the first whitespace character is encountered, reading terminates and the NULL character is placed in the string. For example, if we have:

    char name[20];
    cout << "Please enter the name: "
    cin >> name;

and our input is: Mouse, Mickey

then internally, name will contain:

The same input will occur no matter how many blanks, tabs, etc preceded Mouse. As you can see, the first whitespace terminates the string. We must remember to use the whitespace as a separator of data items. For example, if we read

    cin >> name >> test;

where test is an int variable and our input is James95

then the name will contain:

and the variable test will not have any value read into it.

Output of C Strings
A C string can be printed one character at a time, for example using the put() function. Alternately, the C string can be printed all at once using cout. Assume name is a character string initialized as follows:

    char name[15] = "Washington";

The output

    cout << endl << name;

will cause the string "Washington" to be printed in the leftmost 10 positions of a new line. Assuming the header file, iomanip, has been added to a program, the instruction:

    cout << setw(15) << name;

will cause "Washington" to be printed right justified in a field of width 15 with 5 preceding blanks. To left justify the name, we can use

    cout << left << setw(15) << name;

This will cause "Washington" to be printed left-justified with 5 blanks after the name.

You can not turn on right justification, instead we must resume right justification by turning off left justification! The following statement will do this:

    cout.unsetf(ios::left);

Exercise 6:
Copy the source file $CLA/cla13b.cpp to your closed lab directory. This program has a C string variable name and an integer variable ID. The program inputs values into the two variables and prints the values. After you have examined the program, test the program with your last name and the last four digits of your social security number Next, modify the program so that it prints the name and ID as name : ID. Also add I/O manipulators so that the name is left justified in a field of 20 characters, the ID is right justified in fields of 15 characters, and ':' between them. Also change student name on the top comment (first line of code) to your name and change the CSCI 2170 section number to your section number. You are to submit the source program listing of the modified code, compilation results, and a sample run of the program using as input your name. Something like the following UNIX commands will let you create what is required:

            $ script lab13ex6.log
            $ pr -n -t -e4 cla13b.cpp
            $ c++ cla13b.cpp
            $ a.out
            ...the data you enter...
            $ exit

(Be sure to properly exit the script session!)

C String manipulating functions
C++ provides an extensive collection of functions that allow the programmer to manipulate C strings.
To use these functions, the cstring header file must be included in your program.

Comparison of C Strings
The function strcmp() has been provided to compare two C strings, say str1 and str2. If str1 < str2 (based on a character by character comparison), a value less than 0 is returned. If str1 and str2 are the same, a value of 0 is returned. If str1 > str2, a value greater than 0 is returned. The function makes character comparisons of the elements in str1 and str2 starting with the 0th character. It stops when it finds characters that are not equal or when it reaches the end of one of the strings.

Examples:

    strcmp("A","B")         returns < 0 (negative)
    strcmp("James","Jami")  returns < 0 (negative) because 'e' < 'i'
    strcmp("135", "24")     returns < 0 (negative) because '1' < '2'
    strcmp("ABCD","ABC")    returns > 0 (positive)
    strcmp("ABC","ABCD")    returns < 0 (negative)
    strcmp("89", "89")      returns 0   (zero)

Exercise 7:
For each of the following C string comparisons, tell whether strcmp() returns 0, a value less than 0 (a negative value), or a value greater than 0 (a positive value)?

strcmp("158", "435")
strcmp("abc", "ABC")
strcmp("Jim", "Jimmy")
strcmp("Brenda", "Brenda")

Copying (assigning) a C String
C string assignment (or string copy) is achieved by the function strcpy(). This function has two arguments, a destination string and a source string. The function copies the source string into the destination string. No check is made to determine whether or not the destination string has enough space! All characters from the source up to and including the '\0' are copied.

Example:

    char name[15];
    strcpy(name, "Mr. Mouse");

will cause the variable name to contain:

If the source string is shorter than the destination string, the remaining characters in the destination string remain unchanged. If the source string is longer than the destination string, storage locations following those allocated to the destination string will be overwritten. This may cause surprising (and hard to debug) results.

C String Length
C++ provides a function,called strlen(), that returns the number of characters in a C string (up to but not including the NULL character).

For example, strlen("Mr. Mouse") returns the value 9.

Exercise 8:
Show the internal representation of str1 and str2 (in the same fashion as you did in Exercise 5) and the value of the variable length after each of the following statements, given the declarations:

    char str1[5];
    char str2[5]="mom";
    int length;

strcpy(str1, "Joe");
length = strlen(str1);

strcpy(str1, "Joseph");
length = strlen(str1);

(The second item is a bit tricky!)

Exercise 9:
Suppose the C++ function strlen() didn't already exist. Write code to implement your version of the strlen() function. That is, write a function to determine the length of a C string, str, supplied as an argument. Call your function MyStrLen(). The function prototype for your function is int MyStrLen(char str[]);

Write a main program to test your function. The main program should contain a sentinel-controlled loop that

reads a C string,
finds the length of the C string using your MyStrLen() function,
and prints the length of the C string.

The loop should terminate when the sentinel string END is read. (That is, the three input characters "END".) Remember to use the function strcmp() to test for the terminating condition. Use as your source file name "cla13c.cpp". As the first line in the source file, have a comment (much like the programs $CLA/cla13a.cc and $CLA/cla13d.cpp ) containing your name and CSCI 2170 section number.

Use as your executable file name "length". You are to submit the source program listing of the source code, compilation results, and a sample run of the program using input of your choosing. Something like the following UNIX commands will let you create what is required:

            $ script lab13ex9.log
            $ pr -n -t -e4 cla13c.cpp
            $ c++ cla13c.cpp -o length
            $ length
            ...the data you enter...
            $ exit

(Be sure to properly exit the script session!)

Exercise 10:
There are some 20+ C string functions. To see what they are, look at the UNIX manual page for the string functions by typing
$ man string

On the answer sheet, give the names of five additional C string functions not described above.

Exercise 11:

Submit the log files you have created in Lab 13 typing

$ handin lab13log lab13ex4.log lab13ex6.log lab13ex9.log

Exercise 12:

From the PC you are working on, you must also submit the answer sheet (AnswerSheet13.pdf) using the following directions:

Save your Lab13 Final Answer Sheet (click icon on very bottom of the answer sheet window)
Create the file AnswerSheet13.pdf using your Browser's 'Save as PDF' feature
(1. ctrl+p [Windows] or command+p [Mac] 2. Select 'Save as PDF')
Use your C-number/password to log into the Gus Homework Repository System:
https://www.cs.mtsu.edu/cgi-bin/gus/gus.py
Submit your AnswerSheet13.pdf file to the Gus Homework Repository System

Select the assignment from the dropdown menu: lab13ans
Select the Submit radio button
Click the button: Perform Action
(On the subsequent screen) Click the button: Choose File to locate your AnswerSheet13.pdf
Click the Upload button to upload the chosen file.

Congratulations! You have finished Lab 13.

Once you are done, you will need to log off ranger. Enter $ exit to exit the (sakura) terminal window. Depending on how you logged in to ranger you will need to enter $ exit one or more times to get completely logged off the system.