![]() |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Published: Wednesday, July 2, 2008 By Scott Mitchell
Introduction Jeff's blog entry caught my attention because of a past article I wrote here on 4Guys, Randomly Reordering an Array. That article, written back in 2000, showed various techniques for reordering an array in classic ASP using VBScript. And, wouldn't you know it, the first technique uses the exact same naïve algorithm Jeff warns about in his blog post! Oops! (I've since updated the old article.) After learning of the problems with the naïve algorithm I decided to spend some time exploring approaches that produce valid distributions. This article shows why the naïve algorithm produces less than ideal results and includes code for two alternative solutions. Read on to learn more!
Dissecting the Naïve Algorithm
If you run this algorithm a handful of times it appears to correctly do a random in-place reordering of the elements in Take a three item array with element values 0, 1, and 2. These three elements can be combined into six permutations:
With the naïve algorithm, each item in the array is randomly switched with another item in the array (including, perhaps, itself), meaning that for an array of n elements there are nn possible outcomes. For a three element array, then, there are 33 = 3*3*3 = 27 possible outcomes. But there are only six ways a three element array can be arranged, and six doesn't divide 27. Consequently, some outcomes in the naïve algorithm must be overrepresented and others underrepresented. This can best be illustrated by decomposing the algorithm to show the actual results.
The following (very tall) figure illustrates this breakdown for a three item array that stores the values 0, 1, and 2, and that starts with the values ordered
The various outcomes are not evenly represented. The arrangements
This uneven weighting can be illustrated by reordering the array thousands of times and tallying up the resulting re-orderings. The download available at
the end of this article includes a demo that does this tallying for a user-specified array size and number of re-orderings. The demo also includes an option
to view the results in Microsoft Excel, which you can use to quickly chart the distribution. The following graph shows the distribution of a three item
array randomly ordered 60,000 times. Because there are six possible outcomes, each outcome should occur roughly 10,000 times. But as the following graph
shows, the permutations
Fixing the Naïve Algorithm In 1938 statisticians Ronald Fisher and Frank Yates devised the Fisher-Yates Shuffle algorithm, which could be used to randomly order a set of items with a correct distribution of results. It being 1938, their algorithm was actually expressed in terms of steps to be performed by a human. In his seminal book The Art of Computer Programming, Donald Knuth provided the algorithm in pseudo-code. Today the algorithm is commonly called the Knuth shuffle, or the Knuth-Fisher-Yates shuffle. The way the Knuth-Fisher-Yates works is by enumerating n - 1 positions, and only considering a swap with positions not yet enumerated. This is typically implemented in code by looping from the right-most array position to the second to left position. Each enumerated element is randomly swapped with an element position to its left (including, possibly, its own location). The code is deceptively similar to the naïve algorithm:
There are two subtle but important changes in the code. First, the algorithm loops from The following diagram illustrates how this algorithm unfolds for a three item array. As you can see, the algorithm has six possible outcomes, which match up to the six possible permutations. Consequently, no permutation is over- or under-weighted.
The following chart compares the tally of arrangements using the Knuth-Fisher-Yates versus the naïve algorithm when performing 60,000 re-orderings. The red bars represent the tallies for Knuth-Fisher-Yates. As you can see, they're tight around 10,000 outcomes, which is expected. The naïve algorithm, over-represents three permutations.
Ordering the Contents of an Array by a Random Value This same concept can be applied to randomly ordering arrays. In short, if you can associate some random value with each element in an array then you can sort by that value to get a random ordering. In his blog entry titled Shuffling, Jeff Atwood proposes a simple LINQ-based approach:
As with our previous example, Upon first glance there are two main differences between the OrderBy GUID algorithm and Knuth-Fisher-Yates. The most apparent one is that Knuth-Fisher-Yates does in place shuffling whereas the OrderBy GUID returns a new object that represents the random order. In other words, when using Knuth-Fisher-Yates you are affecting the actual order of elements in the array. OrderBy GUID, on the other hand, does not disturb the elements in the array.
The second difference between the two has to do with the asymptotic performance. Knuth-Fisher-Yates examines each element in the array once, giving it
an asymptotic running time of O(n). The OrderBy GUID approach must sort the results; the best sorting algorithms have a running time of
O(n log n). This difference in running time will is not profound until you get to substantially large values of n, so if you are
sorting a deck of 52 cards, for instance, the asymptotic running time difference is insignificant. (The demo available for download at the end of this article
lists the time taken to perform the tallies for each algorithm; you'll see that in most cases the OrderBy GUID takes 50-75% longer than the Knuth-Fisher-Yates
implementation. For small values of n, the bulk of this time difference is likely accounted for by the fact that the OrderBy GUID approach must
create a new The more subtle yet important question is, Does the OrderBy GUID algorithm produce a valid, even distribution among the permutation? I (incorrectly) assumed that the naïve algorithm worked as desired when writing Randomly Reordering an Array, and didn't learn of my mistake until years later. To convince myself that the OrderBy GUID approach works correctly, I tallied the permutations for five item arrays shuffled 120,000 times and compared the OrderBy GUID distribution against the Knuth-Fisher-Yates algorithm. As you can see, both are tight around 1,000, which is how many times each of the 120 permutations should occur.
Compare this to the distributions for the naïve algorithm when dealing with a five element array shuffled 120,000 times. The following chart shows the distribution discrepancy between the OrderBy GUID and the naïve algorithm. The X-axis represents the various permutations and the Y-axis the number of times each permutation occurred in the 120,000 shuffles. The black triangles show the distribution for the OrderBy GUID algorithm; they are tight around 1,000. The red squares are the distributions for the naïve algorithm. As you can see, they're all over the place. Certain permutations are wildly overrepresented while others are wildly underrepresented.
Granted, this emperical analysis does not "prove" the correctness of the OrderBy GUID approach, but it does give me more confidence in its correctness. To more thoroughly convince yourself of the randomness of the OrderBy GUID algorithm you can use the Chi-square test.
Conclusion Happy Programming!
Further Reading
Attachments
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|