A Million Amazonian Monkeys

Blog Summary: (AI Summaries by Summarizes)
  • The author created a project using Hadoop's MapReduce with Amazon's Elastic MapReduce to simulate the Infinite Monkey Theorem.
  • The project uses fake Amazonian Map Monkeys to create random data in ASCII between a and z, which is then passed through a Bloom Field membership test and a string comparison to check if it matches Shakespearean works.
  • The source material is all of Shakespeare's works from Project Gutenberg, stripped down to a-z with whitespace and newlines removed.
  • The exponential growth and computational difficulty of the problem is shown in a table, with 10 characters having 141,167,095,653,376 combinations.
  • Benefits of using Hadoop and Amazon include the ability to search other works and scale the project linearly by adding more computers to the cluster.

One of my favorite Simpsons episodes has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens ‘It was the best of times, it was blurst of times.’ The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare. And that is what I did (tried to do). I created millions of monkeys on Amazon and put them at virtual typewriters (aka Infinite Monkey Theorem). An old New York Times article gives a general account of computers as authors.

I have been learning Hadoop lately and wanted a project to try some things out. Also, I have been wanting to try Amazon’s Web Services out. This project brought the 2 pieces together by using Hadoop’s MapReduce with Amazon’s Elastic MapReduce. The best part of all is how cheap it is to try things like this. A small instance costs $0.10 and a medium CPU instance costs $0.20 per instance hour.

Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys. The Map Monkeys create random data in ASCII between a and z. It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys. Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test. If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison. If that passes, a genius monkey has written 10 characters of Shakespeare.

The source material is all of Shakespeare’s works as taken from Project Gutenberg. My monkeys leave it to their editors to format things well. Everything is stripped down to a-z, all whitespace and newlines are taken out of the source material. Otherwise, things are left intact. I had to reduce the monkeys’ output to reduce number of combinations due to the exponential nature of the problem. The table below shows the exponential growth and computational difficulty of the problem.

# CharactersCombinations
10
141,167,095,653,376
9
5,429,503,678,976
8
208,827,064,576
7
8,031,810,176
6
308,915,776
5
11,881,376
4
456,976
3
17,576
2
676
1
26

There are several benefits to using Hadoop (MapReduce) and Amazon. One is that other works or every book in Project Gutenberg could be searched. The beauty of a Bloom Filter is how it allows a membership tests without having to load the entire text into memory and it allows a very quick membership test instead of having to compare every character. Another benefit is that one can scale this very well via Amazon where adding another computer to the cluster increases the output in a linear fashion. However, my pocketbook doesn’t scale well, maybe Amazon will give me some free monkeys. With only a few minor changes, the Monkey MapReduce program could be changed to 24 random characters, check against every written work by man, and scale to 100+ computers.

Now for some numbers. There are 10^26 (141,167,095,653,376) combinations of 10 characters with the 26 letters of the English alphabet. In my trimmed down Shakespeare, there are 3,696,339 non-distinct possible 10 character groups. As of today, I have ran 20,164,000,000 maps or 10 character checks using Elastic MapReduce. This comes out to 0.014% of the possible combinations. For the curious, I am running 2 instances of “HighCPU – Medium”. A day’s worth of monkeys costs $19.20 and does 6,980,000,000 checks.

Even though this project ranks up there as the eighth wonder of the world and is as important as curing cancer, I don’t have a bunch of money to put towards it. You can get in touch with me via the contact form to help out and carry on this monumental work. We need to act now before the endangered Amazonian Hadoop Monkeys go extinct. I have an even better idea for creating Shakespeare with Amazonian Monkeys and a very cool way to visualize it

.

Related Posts

Data Teams Survey 2020-2024 Analysis

Blog Summary: (AI Summaries by Summarizes)**Total Value Creation**:**Gradual Decrease in Value Creation**:**Team Makeup and Descriptions**:**Methodologies**:**Advice**:Frequently Asked Questions (AI FAQ by Summarizes)

Data Teams Survey 2024 Results

Blog Summary: (AI Summaries by Summarizes)Companies are not fully utilizing LLMs in data engineering, with 24.7% of teams not using them at all.Only 12% of