Thursday, 2th December, 2010
The Von Neumann Bottleneck is very well known issue in computer science. It is that the processor and the memory are separate, and that the “bandwidth” between the two is very small, compared to the speed of either. This results in the processor waiting while things are brought in from memory, or worse yet, a shared piece of memory being locked while another processor uses it for an indeterminate amount of time.
In order to overcome this issue (in theory) one solution would be to give all of the processors full access to all of the memory at all times, that is, the processor is the memory and the memory is the processor. In practice, this is difficult because processors are only designed to operate on a single word at a time in general (yes. yes, multicore, but that only gets us so far), in effect enforcing the Von Neumann bottle neck.
A solution to this is to keep the memory, by and large, unchanged, but make the memory the processor.
First, we must take a brief diversion into non-standard computation. Cellular automata like Conway’s Game of Life have been around a long time, and Conway’s Game of Life is known to be Turing complete.
In Conway’s Game of Life, there is a board of squares or cells. Each cell is either alive or dead based on rules about it’s nearest 8 neighbours. Too many and it dies, too few and it dies. Enough cells next to an empty or dead cell and it comes to life. These very simple rules can lead to very complex behaviour, so complex that, as mentioned before, one can simulate a Turing machine inside a Game of Life.
But there exist other games of life. There are cellular automata known as “Elementary Cellular Automata” or ECA for short. These were studied by Stephen Wolfram for a time, and later shown to also have the potential to be Turing complete.
A simple ECA could simply be a string of 1s and 0s: 10101101010010111. Each cell lives or dies based on a Boolean function on it’s two nearest neighbours. It so happens that Rule 110 gives rise to a Turing complete ECA.
Rule 110 can be expressed succinctly as (y.¬(x.z)) || ¬y.z.
As it happened, I built a simple evaluator for these kind of things in Haskell:
-- An Elementary Cellular Automata implemented in Haskell -- Auxilliaries for converting lists of Int to Bools & vice-versa intBool :: [Int] -> [Bool] intBool = map (==1) boolInt :: [Bool] -> [Int] boolInt = map boolInt' where boolInt' True = 1 boolInt' False = 0 --The rule definition for this machine. --This is ECA Rule 110, it's Turing complete! rule :: [Bool] -> Bool rule [x, y, z] = (y && ( not ( x && z )) ) || ( ( not y ) && z) rule [x, y] = rule [x,y,False] -- It should not have to use these rules. rule [x] = rule [x,False,False] -- Ditto. runMachine :: [Int] -> [[Int]] runMachine input = input : runMachine output where -- Perform a single step in the machine. runStep :: [[Bool]] -> [Int] runStep xs = boolInt (map (rule) xs) -- Breaks the state up into a number of overlapping chunks. breakState :: Int -> [a] -> [[a]] breakState _  =  breakState n lis = (take n lis) : breakState n (drop 1 lis) -- Convert to boolean values boolState = intBool input -- We need the list to be infinite to make the end nice to -- deal with currentState = boolState ++ currentState -- Break the state up and only get the number of chunks -- necessary to keep the machine's length constant. brokenState :: [[Bool]] brokenState = take (length input) (breakState 3 currentState) -- Calculate the next output state output = runStep brokenState -- Run a fixed machine that prints each state main = mapM_ print (runMachine [1,0,0,1,1,1,0,1,0,1,0,0,1,0,1,1,0,0,1,1,1,1,1,0])
Obviously, the machine is not very fast, and I have not yet made it compute anything.
A brief run down of how this works:
First, the input 1s and 0s are converted to a list of Bools (True and False). It is then made to repeat infinitely so that the end is wrapped around and easy to deal with. These Bools are then broken up into a list of lists such that they’re all the same size and overlapping. [0,1,1,0] -> [[0,1,1], [1,1,0], [1,0,0], [0,0,1], [0,1,1] ..]. We then take only as many of these that we need (In the example above, we only need 4). We then apply the rules to them to get a list of Bools the same length as the initial list. We then use that as our input for the next generation.
Now, to implement this in hardware, we take our nice logic on each set of three adjacent and convert to logic gates. It turns out that this can be shrunk down to 5 NAND gates. So we have 5 extra gates per bit of memory. 64kilobits requires a 320,000 gates. Loop the output of each set of gates back onto the memory. That is, use the output to set each bit it is related to and you have a ECA in hardware.
If we give it the right starting state we can perform computation on it. We can also perform a number of concurrent computations on it.
For example, if we have n computations and a very large ECA, we can do something like this:
c1c1c1c1c1 00000000 … 0000000 c2c2c2c2c2c2c2c2 00000000 … 000000 cncncncncn
If we design out ci (i.e. each individual computation) well, we can prevent it from growing too large and interfering with the other computations going on. Since in Rule 110, 000 -> 0 always, no computations will randomly appear.
Clearly, this will be a very slow method for traditional computation, but if we optimize our computations to a series of disparate computations, we can hopefully have them run slightly faster than a “standard” one, and given the simplicity of the hardware, it should be good enough to clock it extremely fast.
Hopefully, this will be of use, and I might stop by the hardware labs to try and make a couple of bits of an ECA computer.
We also need a compiler that can transform high level languages (like Haskell, Python, etc.) into raw “ECA Code”. This will likely prove to be quite difficult.