Elementary Data Structures ~ Strings

We use the term string to refer to a variable-length array of characters, defined by a starting point and by a string-termination character marking the end. Strings are valuable as low-level data structures, for two basic reasons.First, many computing applications involve processing textual data, which can be represented directly with strings. Second, many computer systems provide direct and efficient access to bytes of memory, which correspond directly to characters in strings. That is, in a great many situations, the string abstraction matches needs of the application to the capabilities of the machine.

The abstract notion of a sequence of characters ending with a string-termination character could be implemented in many ways. For example, we could use a linked list, although that choice would exact a cost of one pointer per character.

The difference between a string and an array of characters revolves around length. Both represent contiguous areas of memory, but the length of an array is set at the time that the array is created, whereas the length of a string may change during the execution of a program. This difference has interesting implications, which we shall explore shortly.

We need to reserve memory for a string, either at compile time, by declaring a fixed-length array of characters, or at execution time, by calling malloc. Once the array is allocated, we can fill it with characters, starting at the beginning, and ending with the string-termination character. Without a string-termination character, a string is no more or no less than an array of characters; with the string-termination character, we can work at a higher level of abstraction, and consider only the portion of the array from the beginning to the string-termination character to contain meaningful information. In C, the termination character is the one with value 0, also known as ‘\0’.

For example, to find the length of a string, we count the number of characters between the beginning and the string-termination character. Table 1 gives simple operations that we commonly perform on strings. They all involve processing the strings by scanning through them from beginning to end. Many of these functions are available as library functions declared in <string.h>, although many programmers use slightly modified versions in inline code for simple applications. Robust functions implementing the same operations would have extra code to check for error conditions. We include the code here not just to highlight its simplicity, but also to expose its performance characteristics plainly.


Table 1    Elementary string-processing operations

This table gives implementations of basic string-processing operations, using two different C language primitives. The pointer approach leads to more compact code, but the indexed-array approach is a more natural way to express the algorithms and leads to code that is easier to understand. The pointer version of the concatenate operation is the same as the indexed array version, and the pointer version of prefixed compare is obtained from the normal compare in the same way as for the indexed array version and is omitted. The implementations all take time proportional to string lengths.


One of the most important operations that we perform on strings is the compare operation, which tells us which of two strings would appear first in the dictionary. For purposes of discussion, we assume an idealized dictionary (since the actual rules for strings that contain punctuation, uppercase and lowercase letters, numbers, and so forth are rather complex), and compare strings character-by-character, from beginning to end. This ordering is called lexicographic order. We also use the compare function to tell whether strings are equal — by convention, the compare function returns a negative number if the first argument string appears before the second in the dictionary, return 0 if they are equal, and return 1 if the first appears after the second in lexicographic order. It is critical to take note that doing equality testing is not the same as determining whether two string pointers are equal — if two string pointers are equal, then so are the referenced strings (they are the same string), but we also could have different string pointers that point to equal strings (identical sequences of characters). Numerous applications involve storing information as strings, then processing or accessing that information by comparing the strings, so the compare operation is a particularly critical one.

Program 1 is an implementation of a simple string-processing task, which prints out the places where a short pattern string appears within a long text string. Several sophisticated algorithms have been developed for this task, but this simple one illustrates several of the conventions that we use when processing string in C.


Program 1    String search

This program discovers all occurrences of a word from the command line in a (presumably much larger) text string. We declare the text string as a fixed-size character array (we could also use malloc, as previous program) and read it from standard input, using getchar(). Memory for the word from the command line-argument is allocated by the system before this program is invoked, and we find the string pointer in argv[1]. For each starting position i in a, we try matching the substring starting at that position with p, testing for equality character by character. Whenever we reach the end of p successfully, we print out the starting position i of the occurrence of the word in the text.


String processing provides a convincing example of the need to be knowledgeable about the performance of library functions. The problem is that a library function might take more time than we expect, intuitively. For example, determining the length of a string takes time proportional to the length of the string. For example, after a quick look at the library, we might implement the pattern match in Program 1 as follows:


Program 2    String search in library

This program implements the pattern match in Program 1 in a standard string library way.


Unfortunately, this code fragment takes time proportional to at least the square of the length of a, no matter what code is in the body of the loop, because it goes all the way through a to determine its length each time through the loop. This cost is considerable, even prohibitive: Running this program to check whether this book (which has more than 1 million characters) contains a certain word would require trillions of instructions. Problem such as this one are difficult to detect because the problem might work fine when we are debugging if for all small strings, but then slow down or even never finish when it goes into production. Moreover, we can avoid such problems only if we know about them!!!

This kind of error is called a performance bug, because the code can be verified to be correct, but it does not perform as efficiently as we (implicitly) expect. Before we can even begin the study of efficient algorithms, we must be certain to have eliminated performance bugs of this type. Although standard libraries have many virtues, we must be wary of the dangers of using them for simple functions of this kind.

One of the essential concepts that we return to time and again in this book is that different implementations of the same abstract notion can lead to widely different performance characteristics. For example, if we keep track of the length of the string, we can support a function that can return the length of a string in constant time, but for which other operations run more slowly. One implementation might be appropriate for one application; another implementation might be appropriate for another application.

Library functions, all too often, cannot guarantee to provide the best performance for all applications. Even if (as in the case of strlen) the performance of a library function is well documented, we have no assurance that some future implementation might not involve performance changes that will have adverse effects on our programs. This issue is critical in the design of algorithms and data structures, and thus is one that we must always bear in mind.

Strings are actually pointers to chars. In some cases, this realization can lead to compact code for string-processing functions. For example, to copy one string to another, we could writeinstead of or the third option given in Table 1. These two ways of referring to string are equivalent, but may lead to code with different performance properties on different machines. We generally use the array version for clarity and the pointer version for economy, reserving detailed study of which is best for particular pieces of frequently executed code in particular applications.

Memory allocation for strings is more difficult than for linked lists because strings vary in size. Indeed, a fully general mechanism to reserve space for strings is neither more nor less than the system-provided malloc and free functions. As mentioned earlier, various algorithms have been developed for this problem, whose performance characteristics are system and machine dependent. Often, memory allocation is a less severe problem when we are working with strings than it might first appear, because we work with pointers to the string, rather that with the characters themselves. Indeed, we do not normally assume in C code that all strings sit in individually allocated chunks of memory. We tend to assume that each string sits in memory of indeterminate allocation, just big enough for the string and its termination character. We must be very careful to ensure adequate allocation when we are performing operations that build or lengthen strings.

发表评论

电子邮件地址不会被公开。 必填项已用*标注