Armed with the tools outlined in Principles of Algorithm Analysis I, we now consider the analysis of sequential search and binary search, two basic algorithms for determining whether or not any of a sequence of objects appears among a set of previously stored objects.
1、Examples of Algorithm Analysis
Our purpose is to illustrate the manner in which we will compare algorithms, rather than to describe the particular algorithms in detail. The simple version of the algorithms that we consider here not only expose many aspects of the algorithm design and analysis problem, but also have many direct applications.
For example, we might imagine a credit-card company that has N credit risks or stolen credit cards, and that wants to check whether any of M given transactions involves any of the N bad numbers. To be concrete, we might think of N being large (say on the order of 1000 to 1000000) and M being huge (say on the order of 1000000 to 1000000000) for this application. The goal of the analysis is to be able to estimate the running times of the algorithms when the values of the parameters fall within these ranges.
Program 1 implements a straightforward solution to the search problem. It is packaged as a C function that operates on an array for better compatibility with other code that we will examine for the same problem in later Part, but it is not necessary to understand the details of the packaging to understand the algorithm: We store all the objects in an array; then, for each transaction, we look through the array sequentially, from beginning to end, checking each to see whether it is the one that we seek. To analyze the algorithm, we note immediately that the running time depends on whether or not the object sought in the array. We can determine that the search is unsuccessful only by examining each of the N objects, but a search could end successfully at the first, second, or any one of the objects.
Program 1 Sequential search
This function checks whether the number v is among a previously stored set of numbers in a[ l ], a[l + 1], . . . , a[ r ], by comparing against each number sequentially, starting at the beginning. If we reach the end without finding the number sought, then we return the value -1. Otherwise, we return the index of the array position containing the number.
Therefore, the running time depends on the data. If all the searches are for the number that happens to be in the first position in the array, then the algorithm will be fast; if they are for the number that happens to be in the last position in the array, it will be slow. We discuss in later section the distinction between being able to guarantee performance and being able to predict performance. In this case, the best guarantee that we can provide is that no more that N numbers will be examined.
To make a prediction, however, we need to make an assumption about the data. In this case, we might choose to assume that all the numbers are randomly chosen. This assumption implies, for example, that each number in the table is equally likely to be the object of a search. On reflection, we realize that it is that property of the search that is critical, because with randomly chosen numbers we would be unlikely to have a successful search at all. For some applications, the number of transactions that involve a successful search might be high; for other applications, it might be low. To avoid confusing the model with properties of the application, we separate the two cases (successful and unsuccessful) and analyze them independently. This example illustrates that a critical part of an effective analysis is the development of a reasonable model for the application at hand. Our analytic results will depend on the proportion of searches that are successful; indeed, it will give us information that we might need if we are to choose different algorithms for different applications based on this parameter.
Property 1 Sequential search examines N numbers for each unsuccessful search and about N / 2 numbers for each successful search on the average.
If each number in the table is equally likely to be the object of a search, thenis the average cost of a search. Property 1 implies that the running time of Program 1 is proportional to N, subject to the implicit assumption that the average cost of comparing two numbers is constant. Thus, for example, we can expect that, if we double the number of object, we double the amount of time required for a search.
We can speed up sequential search for unsuccessful search by putting the numbers in the table in order. A number of the algorithms that we will consider get that task done in time proportional to NlogN, which is insignificant by comparison to the search costs when M is huge. In an ordered table, we can terminate the search immediately on reaching a number that is larger than the one that we seek. This change reduces the cost of sequential search to about N/2 numbers examined for unsuccessful search, the same as for successful search.
Property 2 Sequential search in an ordered table examines N numbers for each search in the worst case and about N/2 numbers for each search on the average.
We still need to specify a model for unsuccessful search. This result follows from assuming that the search is equally likely to terminate at any one of the N+1 intervals defined by the N numbers in the table, which leads immediately to the expressionThe cost of an unsuccessful search ending before or after the Nth entry in the table is the same: N.
Another way to state the result of Property 2 is to say that the running time of sequential search is proportional to MN for M transactions, on the average and in the worst case. If we double either the number of transactions or the number of objects in the table, we can expect the running time to double; if we double both, we can expect the running time to go up by a factor of 4. The result also tells us that the method is not suitable for huge tables. If it takes c microseconds to examine a single number, then , for M = 1000000000 and N = 1000000, the running time for all the transactions would be at least (c/2)1000000000 seconds, or, by Figure 1, about 16c years, which is prohibitive.
Program 2 is a classical solution to the search problem that is much more efficient than sequential search. It is based on the idea that, if the numbers in the table are in order, we can eliminate half of them from consideration by comparing the one that we seek with the one at the middle position in the table. If it is equal, we have a successful search. If it is less, we apply the same method to the left half of the table. If it is greater, we apply the same method to the right half of the table. Figure 1 is an example of the operation of this method on a sample set of numbers.
Program 2 Binary search
This program has the same functionality as Program 1, but it is much more efficient.
Figure 1 To see whether or not 5025 is in the table of numbers in the left column, we first compare it with 6504; that leads us to consider the first half of the array. Then we compare against 4548 (the middle of the first half); that leads us to the second of the first half. We continue, always working on a subarray that would contain the number being sought, if it is in the table. Eventually, we get a subarray with just 1 element, which is not equal to 5025, so 5025 is not in the table.
Property 3 Binary search never examines more than ⌊lg N⌋ + 1 numbers.
The proof of this property illustrates the use of recurrence relations in the analysis of algorithms. If we let TN represent the number of comparisons required for binary search in the worst case, then the way in which the algorithm reduces search in a table of size N to search in a table half the size immediately implies that
To search in a table of size N, we examine the middle number, then search in a table of size no larger than ⌊N/2⌋. The actual cost could be less than this value because the comparison might cause us to terminate a successful search, or because the table to be searched might be of size ⌊N/2⌋ – 1 (if N is even). As we did in the solution of Formula 2 in Principles of Algorithm Analysis I, we can prove immediately that
and then verify the general result by induction. Property 3 allows us to solve a huge search problem with up to 1 million numbers with at most 20 comparisons per transaction, and that is likely to be less than the time it takes to read or write the number on many computers. The search problem is so important that several methods have been developed that are even faster than this one, as we shall see later.
Note that we express Property 1 and Property 2 in terms of the operations that we perform most often on the data. As we noted in the commentary following Property 1, we expect that each operation should take a constant amount of time, and we can conclude that the running time of binary search is proportional to lgN as compared to N for sequential search. As we double N, the running time of binary search hardly changes, but the running time of sequential search doubles. As N grows, the gap between the two methods becomes a chasm.
We can verify the analytic evidence of Property 1 and 2 by implementing and testing the algorithms. For example, Table 1 shows running times for binary search and sequential search for M searches in a table of size N (including, for binary search, the cost of sorting the table) for various values of M and N. We will not consider the implementation of the program to run these experiments in detail here because it is similar to those that we consider in full detail in later series, and because we consider the use of library and external functions and other details of putting together programs from constituent pieces, including the sort function, in later series. For the moment, we simply stress that doing empirical testing is an integral part of evaluating the efficiency of an algorithm.
Table 1 Empirical study of sequential and binary search
These relative timings validate our analytic results that sequential search takes time proportional to MN and binary search takes time proportional to MlgN for M searches in a table of N objects. When we increase N by a factor of 2, the time for sequential search increases by a factor of 2 as well, but the time for binary search hardly changes. Sequential search is infeasible for huge M as N increases, but binary search is fast even for huge tables.Key: S sequential search (Program 1); B binary search (Program 2)
Table 1 validates our observation that the functional growth of the running time allows us to predict performance for huge cases on the basis of empirical studies for small cases. The combination of mathematical analysis and empirical studies provides persuasive evidence that binary search is the preferred algorithm, by far.
This example is a prototype of our general approach to comparing algorithms. We use mathematical analysis of the frequency with which algorithms perform critical abstract operations, then use those results to deduce the functional form of the running time, which allows us to verify and extend empirical studies. As we develop algorithmic solutions to computational problems that are more and more refined, and as we develop mathematical analyses to learn their performance characteristics that are more and more refined, we call on mathematical studies from the literature, so as to keep our attention on the algorithms themselves in this series. We cannot do thorough mathematical and empirical studies of every algorithm that we encounter, but we strive to identify essential performance characteristics, knowing that, in principle, we can develop a scientific basis for making informed choices among algorithms in critical applications.
2、Guarantees, Predictions, and Limitations
The running time of most algorithms depends on their input data. Typically, our goal in the analysis of algorithms is somehow to eliminate that dependence: We want to be able to say something about the performance of our programs that depends on the input data, to as little an extent as possible, because we generally do not know what the input data will be each time the program is invoked.
Studying the worst-case performance of algorithms is attractive because it allows us to make guarantees about the running time of programs.
There are several difficulties with worst-case analysis, however. For a given algorithm, there might be a significant gap between the time required for it to solve a worst-case instance of the input and the time required for it to solve the data that it might encounter in practice.
For example, quick union requires time proportional to N in the worst case, but only log N for typical data. More important, we cannot always prove that there is an input for which the running time of an algorithm achieves a certain bound; we can prove only that it is guaranteed to be lower than the bound. Moreover, for some problems, algorithms with good worst-case performance are significantly more complicated than are other algorithms.
Studying the average-case performance of algorithms is attractive because it allows us to make predictions about the running time of programs. In the simplest situation, we can characterize precisely the inputs to the algorithm.
There are also several difficulties with average-case analysis, however. First, the input model may not accurately characterize the inputs encountered in practice, or there may be no natural input model at all. Second, the analysis might require deep mathematical reasoning. Third, knowing the average value of the running time might not be sufficient: we may need to know the standard deviation or other facts about the distribution of the running time, which may be even more difficult to derive.
The field of computational complexity is the branch of analysis of algorithms that helps us to understand the fundamental limitations that we can expect to encounter when designing algorithms. The overall goal is to determine the worst-case running time of the best algorithm to solve a given problem, to within a constant factor. This function is called the complexity of the problem.
When we can prove that the worst-case running time of an algorithm to solve a certain problem is O(f(N)), we say that f(N) is an upper bound on the complexity of the problem. In other words, the running time of the best algorithm to solve a problem is no higher than the running time of any particular algorithm to solve the problem.
We constantly strive to improve our algorithms, but we eventually reach a point where no change seems to improve the running time. For every given problem, we are interested in knowing when to stop trying to find improved algorithms, so we see lower bounds on the complexity.
Not all algorithms are worthy of such intense scrutiny; indeed, during the design process, it is preferable to work with approximate performance indicators to guide the design process without extraneous detail. As the design becomes more refined, so must be analysis, and more sophisticated mathematical tools need to be applied. Often, the design process leads to detailed complexity studies that lead to theoretical algorithms that are rather far from any particular application.
It is a common mistake to assume that rough analyses from complexity studies will translate immediately into efficient practical algorithms; such assumptions can lead to unpleasant surprises. On the other hand, computational complexity is a powerful tool that tells us when we have reached performance limits in our design work and that can suggest departures in design in pursuit of closing the gap between upper and lower bounds.
We take the view that algorithm design, careful implementation, mathematical analysis, theoretical studies, and empirical analysis all contribute in important ways to the development of elegant and efficient programs.
We want to gain information about the properties of our programs using any tools at our disposal, then to modify or develop new programs on the basis of that information. We will not be able to do exhaustive testing and analysis of every algorithm that we run in every programming environment on every machine, but we can use careful implementations of algorithms that we know to be efficient, then refine and compare them when peak performance is necessary. Throughout the book, when appropriate, we shall consider the most important methods in sufficient detail to appreciate why they perform well.