Search Microcontrollers

Tuesday, October 1, 2013

The geekiest place in the world

...was open for visits.

I am not sure a "geek-ness" ranking exists for places, it probably does, but I have no factual data about that.
Still I am pretty sure that most would agree that if this is not the place with the highest score in the world, it actually gets a pretty damn good placement in the ranking.

I am talking about the CERN facilities between Geneva (Switzerland) and France.

From time to time they need to shutdown the machines and perform some maintenance, if you are lucky enough to be around, you get a guided tour of the Physics Disneyland.
I live nearby, so it was an easy one for me.

Turns out that this time they had a two days open to the public, with chances to visit the underground tunnel of the LHC.

Cern personnel (scientists, technicians etc) was acting as guide and showing us around - thanks guys, you are awesome -

While I enjoy physics, I am not good enough to provide in depth explanations, still I thought it was cool to post a few pictures here.

The Large Hadron Collider  (LHC) is "the big one"at the moment, accelerating particles in a 27Km ring placed between 150 and 180 mt below the surface, across Switzerland and France.

A friend of mine and myself started our visit there, at point 4

After getting an helmet and being assigned to a guide together with a small group of visitors, we descended in the underground with an elevator, first thing we met was one of the many rooms with control electronics.

The LHC is a Synchrotron accelerator, meaning it bends the beam so that it runs in a long circle, comes back and gets another "kick" of acceleration.  
Electric fields (in cavities) are used in few points to accelerate the particles and magnetic fields are used to bend the beam.
Additional magnets (quadripoles) are used to focus the beam which would naturally tend to disperse (being composed by charged particles of the same type).

All this requires a lot of energy, in fact the machine consumes about the same energy as the private consumption in the whole Geneva city.

The "compact" dipole magnets are about 15 meters long and produce a 8.4 Tesla magnetic field using a 11.000A current.
The only way to get all that current in a relatively small solenoid, is to use superconductors which are cooled down at 1.9K with Helium.

Machines and equipment are transported to/from the surface using vertical tunnels like this one :

Before getting to the actual LHC ring, you pass by rooms full of machines, mainly for the cryogenic (like the one in the picture) and electric control

Finally the "Ring": 
Once the beams are formed (two beams rotating in the opposite direction, running in separate vacuum pipes, crossed for collisions in 4 points in the ring), they are accelerated in cavities

Electrical energy here is converted from normal conducting to superconducting to feed the magnets that will need to focus and guide the beam (sorry the picture is a bit out of focus)

The huge black cables on the top part bring down the normal electrical energy from the surface, a tiny (superconducting) cable of the section visible in the bottom part (it is there just for demo) transports the same energy to the magnets.

This is the part that caused the incident back in 2008, which resulted in a 4 month unplanned stop of the machine.
Apparently the connectors between those two pipes did not perform according to the specs and started to heat up, slowly transferring heat to the magnets and forcing their conductors to exit the normal superconducting temperature range.
Don't ask me why, but this eventually converted part of the beam energy into kinetic energy, which displaced the magnets (we are talking about "little toys" with the weight of few tons).
In this stage the magnets are used to focus the beam, which is particularly difficult when the beam is "slow", in fact when it reaches relativistic speed, (some "funny" effect of relativity) particles reduce the "hate" they have between them, so it is easier to contain them.

Finally the blue dipoles gradually bend the beam (you can see it is bent on the left) in the tunnel, till the next station which is 3.7Km down this tunnel.
I searched for the Gelmini Tunnel, but it was nowhere to be found.

All this is really cool, I admit I can grasp some of the basic concepts, but that's  pretty much it,  still every single cubic meter there is packed with technology like you would expect in Geek's Heaven.

However, why stop at today's  technology when you can have  a sneak-peek of the technology of tomorrow?
Let's  go see 2030

The Compact Linear Collider (CLIC) does not actually exist now, there is a small prototype used to prove the concept (it worked!).

CLIC is supposed to be a 48Km Linear  collider (one end should arrive pretty close to my house), being linear it does not need dipoles to bend the beam and it is not supposed to use superconducting magnets, even to focus the beam.

But the most interesting feature is how the energy for the acceleration is provided.
In a synchrotron the beam comes back to a given point several times (11K times a second in the LHC) so it can be accelerated in few points.
In a linear machine, you need to keep accelerating the beam all along the path in order to achieve the needed speed in one single run.
Transporting the needed energy all along the 48Km path would be really challenging.
CLIC uses a "drive" beam to provide energy to accelerate the test beam.

The concept is that a "wide" beam (lots of charges) is started together with the test "small" beam then all along the path energy is extracted from the drive beam (by slowing it down) and transferred to the test beam via a 12GHz pulse.

This is the CLIC working prototype,  in the picture above the beams are created.
Magnets based on standard technology solenoids are the used to focus the beams

The researcher who was guiding us explained that there are some studies about these solenoids, obviously not being supercondicting they are not really efficient and pose a few issues, interesting enough they are also exploring the possibility to use huge neodymium permanent magnets and powerful mechanical actuators to modulate the magnetic field with them.

Before entering the acceleration module, part of the test beam is extracted and it's energy (speed) is measured, finally dumping it in a concrete block.
This sets a baseline to verify the acceleration achieved after the acceleration module.

In the image above the acceleration module.
In the front part there is the test beam passing through the cavities, in the back side the drive beam also going through cavities (to slow it down in this case) and the energy is routed in radio frequency via those copper pipes visible in the right part.
A precise piston there allows the regulation of the "accelerating" wave to ensure it is in phase with the test beam.

Finally the beam is measured and dumped in another concrete block (this is just a prototype containing one acceleration module, the real project is supposed to have about 20.000 modules, guess the concrete block at the end will have to be a bit bigger and sturdy :) ).

In the picture above an accelerating element, you will notice the hole for the beam is really small and this is one of the characteristics of CLIC.
Having an extremely compact beam has several advantages, but also poses some big challenges.
Just imagine a pipe 48 Km long, under the ground, where those holes are perfectly aligned all along the path.

For this reason the whole structure is mounted on a gazzilion of extremely precise actuators that manage to dynamically ensure the alignment.

One of the modules (picture above) undergoing mechanical tests in another lab.

And this basically concludes my report of the visit at CERN, I may have made a few mistakes (hopefully not too many) in my description, I cannot claim to be an expert on the subject, I just get excited seeing technology stuff that actually works and since you are reading this blog, you are probably like myself.

Feel free to send comments, corrections and to call me ignorant if you like (I know I am, so I will not take that as an offense :) ).

Thursday, June 6, 2013

Puzzle - The MCU way (Stellaris)

I have been playing with a puzzle that I described here, with he purpose of attempting a parallel implementation.

As a proof of concept for the basic solution, I proposed a java algorithm.

However, since the beginning my aim was to implement the solution on a cpu with no OS, I am targeting the ARM Cortex Sitara, but I thought the LM4F was already a good starting point.

The basic idea is to use low cost / low power cpus in some kind of  network enabling them to cooperate together.
Would it make sense to consider an MCU, such as the Stellaris?
Indeed its 50MHz frequency does not keep up with i5 /i7 CPUs at 3+GHz, but in some cases it might be enough.

Admittedly my algorithm is not really optimized, but for this experiment it is probably ok, in the end my idea is to verify the possibility of having a network of smart sensors, able to integrate AND PROCESS data , without the help of an external dedicated host.

So far, it's  just for fun, curiosity.

Today I simply copy&pasted my java algorithm into CCS, converted it into C and redirected the System.out.println() to UARTprintf(), but most of the code recompiled without intervention.

You can download the full source code here (requires StellarisWare installed)

Obviously, the result is the same, as you can see from the console of the java program running on my PC and the RealTerm window that captures the serial output of the Launchpad.
Did not measure performances yet, however it is absolutely clear that the PC is WAY faster in getting the solution, even if java is not nealry as efficient as ARM C.
Comparing a multi core 2.6GHZ 64 Bit CPU with a single core 50MHz 32 bit one is not exactly what we would call a fair match.

Still, the Launchpad managed to get the solution.
It is indeed a non conventional usage of an MCU, I did not use any peripheral (excluding the UART and the onboard LED, for debug purposes), I simply used it as a CPU.

Since now I have the basic algorithm working on the Stellaris, I can start thinking of allowing multiple parallel CPUs working at the same time.
Stay tuned!

Wednesday, June 5, 2013

Stellaris Launchpad - Coding

In the previous post I illustrated a basic example code using the Stellaris Launchpad and CCS.
While it was indeed an interesting view, it might have surprised some of you since the code was somewhat similar to the MSP430 style more than the more advanced high level programming of the C2000.

Fact is that you can go both ways.

The Stellaris has plenty of RAM, computational power etc, so you might actually embark in bigger, more complex projects.
For those a high level approach is often beneficial.

I would like now to illustrate how the previous problem (blinking the colored led) could be tackled using high level programming style.

#include "inc/hw_memmap.h"
#include "inc/hw_types.h"
#include "driverlib/gpio.h"
#include "driverlib/rom.h"
#include "driverlib/sysctl.h"

int main(void)
    volatile unsigned long ulLoop;

     //set system clock

  // Enable the GPIO port that is used for the on-board LED.

    // Enable the GPIO pins for the LEDS

    unsigned int color = 2;
        // Turn on the color LED.
        GPIOPinWrite(GPIO_PORTF_BASE, color, color);

        // Delay for a bit.
        SysCtlDelay(SysCtlClockGet() / 10 / 3);

        // Turn off the color LED.
        GPIOPinWrite(GPIO_PORTF_BASE, color, 0);
        color = color<<1;
        if (color>8) color =2;
// Delay for a bit.
        SysCtlDelay(SysCtlClockGet() / 10 / 3);

First thing to notice is that to use those high level libraries, you need to include the appropriate header files, which may vary depending on which peripherals you are planning to use. 
Once that is done, you can forget (well, almost) about all those specific registers and interact a bit more high level with your hardware.
The beauty of this is that the code becomes more readable, still you can chose the style that better suits yours, what I would not recommend is to mix the two approaches.
If you like the high level coding, then you'd  better stick to the functions as black-boxes, but as usual, your mileage may vary.

Happy (high level or low level) coding!

Note : there is one instruction which is not directly related to the LEDs, although it does affect the blinking frequency :

                      SYSCTL_XTAL_16MHZ | SYSCTL_OSC_MAIN);

This actually sets the System Clock, like you probably guessed.
As many other MCUs the Stellaris can source many different internal / external oscillators (SYSCTL_XTAL_16MHZ | SYSCTL_OSC_MAIN), however these feed the onboard PLL system, which you are most likely going to use (SYSCTL_USE_PLL).
The PLL runs at 400MHz (you can find that in the LM4F120HQ datasheet), which is hardware divided by 2, giving a maximum frequency of 200MHz.
Dividing this 200MHz by 4 (SYSCTL_SYSDIV_4) we obtain 50MHz, which is the frequency our clock will run by issuing that instruction.

Note 2 : Curious about the funny formula expressed in the  SysCtlDelay(SysCtlClockGet() / 10 / 3); instruction to specify the delay?
Let's  just start to say that with that value the delay will be 1/10th of second, but where those numbers come from?
The SysCtlClockGet() function returns the number of cpu cycles per seconds, meaning if we are running at 50MHz it will be 50.000.000.
Now, the SysCtlDelay function delays 3 cycles per each loop it does internally and performs the number of loops you specify as a parameter.
This means that if you specify SysCtlDelay(SysCtlClockGet() / 3) the delay will be exactly 1 second, regardless of the CPU frequency you selected.

Tuesday, June 4, 2013

Stellaris Cortex M4F Launchpad

Some time ago I ordered from TI a couple of Stellaris launchpads, but unfortunately I did not have much time to play with this new toy, lately I dug it out and gave it a go...

First of all, the Stellaris launchpad is powered by a Cortex M4F 120H5QR MCU, which is quite "a beast", running @ 80MHz, floating point capabilities and internal 256KB Flash Memory.

It is packed with plenty of peripherals, ranging from 8 UARTS, 4SPI, 4 I2C, CAN, ADC12 (1MSPS!!), a flexible clocking system, timers, DMA (32Ch!) etc.

There is quite a lot to explore, as you can see.

The launchpad comes in the usual format (and fancy red color) and also has the typical headers, both male and females on the opposite side of the pcb.

Other connections are two MICRO USB (yup, they went for the micro option with these ones, just to make you mess around with cables I guess :) ), one of which you will use to access the board via the FTDI / JTAG interface.

I am using Code Composer Studio and the free software library (containing some examples) StellarisWare, from TI.

Loading a project and debugging it is as easy as it can get, just remember to switch the power select towards the micro usb connector, that will power the board from it, connect the usb to your pc and fire up CCS.
TI Resource Explores shows StellarisWare installed (did not work on my previous installation, had issues with CCSV5.2, 5.4 seems to do its job neatly), browse to find the blinky example (I always like to blink leds as the first project) and... it works.

Apparently the on board multi color LED is connected to the GPIO port F, normally these leds use 3 colors, so it is easy to guess that on that port there will be 3 bits controlling the 3 color components.
Instead of checking the manual I decided to experiment with the bits until I found the color components.

But , let's see the example code (from TI) :

    volatile unsigned long ulLoop;
    // Enable the GPIO port that is used for the on-board LED.
    // Do a dummy read to insert a few cycles after enabling the peripheral.
    ulLoop = SYSCTL_RCGC2_R;
    unsigned int color = 0x08; // 2 is red, 4 blue, 8 green 
    // Enable the GPIO pin for the LED (PF3).  Set the direction as output, and
    // enable the GPIO pin for digital function.
    GPIO_PORTF_DIR_R = color;
    GPIO_PORTF_DEN_R = color;
        // Turn on the LED.
        GPIO_PORTF_DATA_R |= color;

        // Delay for a bit.
        for(ulLoop = 0; ulLoop < 200000; ulLoop++)
        // Turn off the LED.
        GPIO_PORTF_DATA_R &= ~(color);
// Delay for a bit.
        for(ulLoop = 0; ulLoop < 200000; ulLoop++)

The original program had a constant 0x08 instead of the variable color, but I thought this way it was a bit easier to experiment with the colors.
Turns out that the bit 1 (2^1) corresponds to the red component, bit 2 (2^2) to the blue and 2^3 to the green one.
You can combine them i.e. using 0x0E (2+4+8) to get a white(ish) color.

Besides that it appears a typical gpio management :
This enables the GPIO port F
    GPIO_PORTF_DIR_R = color;
    GPIO_PORTF_DEN_R = color;
Direction of bits on port F is set to output for the bit that correspond to the desired color
    GPIO_PORTF_DATA_R |= color;
    GPIO_PORTF_DATA_R &= ~(color);
Or turns the bits on and AND+NOT turns them off, as usual.

Ok, I know, not much, right?
Still, it's a fancy colored led blinking and most of all you can use it to check your environment is properly set up, it worked for me :)

Promise I will find some more interesting experiments soon.

Friday, May 31, 2013

Reasons behind the proposed puzzle and its solution

Hello, I am referring to the previous posts about the puzzle I proposed, along with some possible algorithmic solution.

I also drafted a proposal on how to split the problem across multiple concurrent processes.

At this point maybe I should write a few words on why we would want to do this and when it could become useful.
There are different paradigms that allow to split a problem into multiple sub problems with the purpose of leveraging on multiple available processors.

The Seti@home project  efficiently does exactly this.

In my view we can classify these kind of problems in two main categories :
1) Problems that can be "hashed" into subsets (or "buckets") of data
2) Problems that cannot be immediately hashed since the state f(n+1) depends on f(n), therefore I could not ask a process to compute f(n+1) unless it already computed f(n)

Let's  take some simple examples :

The equation that allows us to represent a line can be expressed as  y = a +bx
This means that we can calculate directly y(x) for any value of x (being x a real number), so such problem falls easily into my category 1)
However, when we split the workload on different processors, we normally consider finite increments (this is common in real life problems), so that the variable x is a Natural Number (an integer)

Some other problems, which are common when we run simulations, require an iterative approach.
By that I mean that in order to calculate a function (or better a series) y(x) we need to apply a set of rules that take into account previous values of the serie itself.
A typical, well known example, could be the Fibonacci number series.
These fall in my category 2) and may be important in computing simulated scenarios.

Still there is a distinction to be made within category 2 :
Series like the Fibonacci one have one possible values at each i step, such as f(i) is defined and has a single solution.
When exploring simulated scenarios, normally f(i) presents several possible solutions, as visible in the description of the proposed puzzle.
While this is indeed making the problem harder to tackle, it also opens (eventually) up to parallel processing.
In fact, since f(i) has multiple values, then f(i+1) is also going to have multiple values (in general more than f(i) unless f is converging) meaning that each value f(i) can be considered the start of a new problem and then individually assigned to a separate processor.

What could be a practical example of a problem benefiting from this approach?

I am thinking about models used to simulate a pandemic spread, or the evolution of a meteorological event. Any model that aims to test all the possible evolution of something which is not directly predictable using a mathematical function.

I don't  think I invented this approach, I am sure there is a lot of great work done already, I am just curious to try it myself

Saturday, May 25, 2013

Towel Day 2013

Any Vogon spaceship passing by lately?

... seriously, we are waiting here!

Friday, May 24, 2013

Puzzle, moving the solution to parallel

The puzzle problem illustrated here and for which I proposed a simple algorithm here, can probably be assigned to multiple CPUs at the same time in a way that the solution is found in parallel.

In my original post I already explained that the solution search process can be "initialized" using the "step" array which encodes the  list of "moves" executed in the first "n" steps.

This means that the problem can be visualized as follows :

Each level of this tree represents one step.
So, for step 1 (after root) there are 8 possible positions to be tested, each one of which has 8 possible destinations in step 2 etc.

"Slicing"the problem means, at any level, to restrict the search from move n>=1 to move m <=8.
Then slices of the problem may be distributed to different CPUs to be computed at the same time.

If we try to implement this concept in the example algorithm, we need to account for three things :

1) The initial part of the solution, consisting of a step array from 1 to i that locates the beginning of the search for the given process
2) The initial move to be tested at step i+1
3) The final move to be tested at step i+1

Technically this could be implemented by making step a 2 dimensional array initially set as :

{{0,8},{0.8},{0,8}....}  being 0 the last move attempted (none) and 8 the last to be attempted for that particular step.

When initializing the solution for a giving CPU, by transmitting the step array, we could use the last element in the array as the boundary of the solution to be searched.

Let's imagine a process P1 started to attempt a solution for the (5x5) puzzle and after 3 steps, it decides to call for reinforcements, asking a second cpu to take over part of the problem.

In our example the current situation is represented by the image below

this also means that the step array contains the following data :

Where the Last Move is the move that allowed us to move from A1 to C2 (Step 1), then from C2 to E1 (Step 2) etc according to the numbering visible below.

We know that the Max Move is 8 because the current process (P1) is asked to investigate the FULL solution (so, from 1 to 8 for every step).
At this point P1 wants to ask a new process P2 to address part of the problem starting from step 4.
Let's imagine that P1 decides to split the work from step 4 onward into two sets of moves S1= [1..4] and S2=[5..8].
It will run itself S1 and ask P2 to take care of S2.

P1 needs to prepare the initial step array for P2, basically telling it which part of the problem has already been taken care of.
The first 3 steps will need to be "locked", so the Max Move will be set equal to Last Move, plus the additional new step 4 will have Last Move = 4 (this si the last move defined in S1) and Max Move = 8

-> Step(P2) = {{3,3},{2,2},{5,5},{4,8}}

At the same time P1 needs to set a constraining Max Move for step 4

-> Step(P1) = {{3,8},{2,8},{5,8},{0,4}}

The example used only two different processes, but the concept stands for any number of processes that could be engaged at any step.

Thursday, May 9, 2013

Puzzle algorithm example

For those that have seen my previous post and are eventually interested to see a simple example of the described algorithm, here you can find a bare-bone implementation in Java.
You can configure the size of the square by setting the values of sizeX and sizeY.
Be warned that solutions for squares up to 7x7 come out pretty quickly on an i5 PC, from 8x8 on the magnitude of the problem requires LONG times of execution before the first solution is found.

This is an example of the solution found for a 5x5 problem :

As it is designed now, the algorithm does not implement any kind of optimization and it stops when it finds the first solution (but it can be easily modified to find ALL the solutions) or when it runs out of options.

The reason why I am proposing this java implementation is because it is easily readable and implements a basic execution logic which allows me to easily experiment trying to add some parallelism.

The final solution is not intended to be in Java

public class LatinPuzzle {

    private int stepCounter;  
    private int step[];
    private int cell[][];
    static final int moves[][] = 
    static final int sizeX = 6;
    static final int sizeY = 6;
    private int pSize= sizeX * sizeY;
// debugging/control info
    private long needUpdate =0;
    private int low = 100;
    private int high=0;

    void init()
    step = new int[pSize];
    cell = new int[pSize][2];
    stepCounter = 0;
    private boolean isValid(int x,int y)
    if ((x<1)||(y<1)||(x>sizeX)||(y>sizeY)) 
      return false;
    int scan =0;
    while (scan<stepCounter)
      if ((cell[scan][0]==x)&&(cell[scan][1]==y))
       return false;
    return true;
    private void move(int n)
    if (n>=8) return;
    int x = cell[stepCounter][0]+moves[n][0];
    int y = cell[stepCounter][1]+moves[n][1];
    if (isValid(x,y))
    cell[stepCounter][0] = x;
    cell[stepCounter][1] = y;  
    private boolean backOne()
    if (stepCounter<1) return false; else return true;
    public void tryPuzzle()
    boolean running = true;
    while ((stepCounter<(pSize-1))&&(running))
    if (step[stepCounter]>=8) running = backOne();
     // debugging/control start
    if (stepCounter>high) high = stepCounter; 
    if (stepCounter<low) low = stepCounter; 
    if (needUpdate++ % 10000000 == 0) 
    System.out.println(""+needUpdate+","+stepCounter+" "+low+"-"+high);
    low = pSize;
    high = 0;
     // debugging/control end

     // final output
    if (stepCounter>=(pSize-1))
 System.out.println(""+needUpdate+", Success !! "+stepCounter+".");
          for (int i=0;i<stepCounter;i++)
          for (int i=0;i<=stepCounter;i++)
             System.out.print(""+cell[i][0]+","+cell[i][1]+" ");
    } else      
      System.out.println(""+needUpdate+",failed "+stepCounter+".");
public static void main(String[] args)
LatinPuzzle lp = new LatinPuzzle();

Wednesday, May 8, 2013

BeagleBone Black - Unboxing

I just received my Beaglebone Black.
It looks cool, it's black...

Besides the color it looks like the previous (white) version, but you can immediately spot the presence of an HDMI connector.
This is probably a nice addition, being able, out of the box, to connect to an HDMI monitor, not sure I will use it since up to now I used my bone as a remote terminal only.
But still I think that keyboard + mouse + monitor might eventually get you out of the problem in the case you mess up with your network configuration.

The box it comes in is (I got mine from Adafruit, because I think LadyAda is really cool) a neat pretty soft box which also contains a MINI usb cable.

Checking the side of the board, you can see the micro HDMI connector (close to the SD slot, below the pcb) with the standard usb and mini SD.
No SD is included because the black has an internal flash that can store the SO, the SD connector is still available so you can use another SD for storage.
I think this is a neat feature because, besides lowering the total cost (no need for the SD card) it also allows you to store your stuff separately from the OS which in the end reduces the chances you mess up the OS filesystem (that happened to me, was logging data on the SD of the bone, power went off... and the file system was broken, had to recreate the SD.

The other side remains pretty much the same as the previous version : 5V connector, a mini usb and ethernet.

The main differences with the white bone are :
1) Price is about half (45$ vs 89$)
2) Added HDMI
3) Added internal flash 2Gb (with performance boost x8 vs x4 bits)
4) Memory was 256MB DDR2, now  512MB DDR3 (!!)
5) CPU (AM3359) clock was 720MHz, now 1GHz ( thanks to the  Sitara XAM3359AZCZ100)

Also, the white bone CPU was scaling down to 500MHz when powered via USB instead of the 5V connector, apparently (from the specs) this is not happening for the black, which should stay at 1000MHz.

I will start soon playing with this new toy, which apparently comes configured to start X immediately (not the way I like it, but it is normally easy to set it up).

One thing that really triggers my curiosity is that it is supposed to support StarterWare, a set of free TI libraries (kudos to TI, you manufactured my first computer -TI99/4A- 20+ years ago and still surprise me every month. Count me in your fan club :) ) supposed to give you API level access to the hardware without the need of an OS.

Will keep you posted

Tuesday, May 7, 2013

Latin, Puzzles and Parallel Computing

First of all I have to apologize for being absent for so long, I was actually busy with some new interests, which obviously are now generating the urge to post about them.

No, this is not about Latin, sorry for those that arrived here with some wicked Google search about some dead language.

In a galaxy far, far away and long time ago, I attended a "scientific high school" (Liceo Scientifico), that happened in Milano, Italy.
When you attend that kind of school, you also have to study Latin... yeah, I know...

As you can easily imagine, me and most of my friends did not like the subject as we believed that it had nothing to do with science, which actually was our main interest.
Latin for us was good to come out of situations where you did not have a real answer, in the end it did not make any sense, but it sounded pretty damn cool.
Our teacher said it helped us to exercise our brain and would have provided us a good forma mentis, I honestly thought that if she had to use Latin to bullshit us, she was actually proving our point.
Still, we had no other choice than suck it up for five long years.
We also had to study religion and the teacher tried to convince us that exorcism is a true thing, but I was too busy playing fetch across the rainbow with my unicorn to pay attention to that.

The point is that in order to survive the immense boredom generated by those studies, we started playing puzzles during the lectures.
Our preferred one was this one :

The rules are quite simple : in a 10x10 square, starting top left, you move from cell to cell using the chess horse move and you record the progressive number of your move.
You cannot go outside the square and cannot enter twice in the same cell.
Winning condition is to fill in all the 100 numbers.

It is NOT easy, but eventually one of us solved the puzzle after about 2 or 3 years of Latin.
If you think this is a boring activity, you probably never studied Latin irregular verbs.
Unfortunately, the guy who solved the puzzle, failed Latin at the same time... but hey, it was TOTALLY worth it.

Still, after a while -maybe 1 month- it became "not so interesting" to me and I thought it would have been more interesting to write an algorithm to solve it.
The first prototype was written in basic on a Sinclair QL, it never worked tho because I did not know the QL at all,  and did not  own one, I just did a few tests on the computer of a schoolmate brother.
I was 15 at that time and my (self taught) programming skills were still unfortunately limited to TI basic and some Z80 assembly.

Nonetheless that exercise helped me to formulate an algorithmic solution based on the following ideas (trying to remember it) :

1) "Moves" can be encoded in a way that, starting from a position X, the next position can be  one of the 8 shown below

This means that we can create a function move(n) [with n between 1 and 8] that returns a 2 element array with dX and dY being the increments on the X and Y axis,

2) The new coordinates are calculated and an isValid(x,y) function returns true if the the move lands in a valid location.
Non valid locations are those outside the square (X<1 , X>10 , Y<1, Y>10) or those already occupied by previous steps

3) An array step[100] of integers will record at each step the move that brought me there and a second array cell[100][2] will store the X,Y coordinates of the positions  

4) a stepCounter keeps track of current step

5) I verify if I am in a winning position (stepCounter == 100)

Logic : 

When I start I immediately try the move(1), if isValid returns false, I try the move 2, until I reach the move 8.
In order to check if a cell is taken, I simply scan the cell[] array (there are some possible optimizations here, but the concept stands).
When I reach move 8 and still cannot place the move, I need to decrement stepCounter and try the next available move step[--stepCounter]

This is clearly a brute force algorithm, not particularly smart, but the beauty is that it can be easily generalized to solve different puzzles, I successfully used it to solve "find the exit of the maze" puzzles too.

Unfortunately the first working version of it came some time later (still before my friend did solve the puzzle manually tho :) ) running on a Ast Bravo 286 , which was my first MS-DOS PC running a 16 bit processor at an incredible speed of 10MHz.
A better version was coded in Turbo Pascal 3, later on improved with TP5.5 and finally refined with BP7 including some parts in assembly.

 I also implemented part of it in Scheme Lisp, using a functional approach with recursive calls instead of the imperative style of the first versions, however I came out with the impression that these kind of problems should be solved with low level languages since code execution speed matters a lot.

By the time I had the BP7 version of the code, I had a second computer, a 486 DX 50MHz which proved to be way faster than my old 286, but the idea was : can I have both two computers work at the same time, on the SAME solution?

I did not have a network connecting the two PCs, just a serial cable, and fantasized on how to have them working in parallel exchanging information to balance dynamically the load.
I implemented instead a simpler approach : the BP7 version was able to save it's state to disk and eventually resume the execution.
If you think, saving the state is pretty easy : you just need the step[]  array, anything else can be reconstructed.
This also meant that feeding a pre-filled step array to the program would make it take care of a PORTION of the executions.
I know it was not much, but it was a first step (my first step) into parallelism.
After all Latin was useful for me as I would have never faced this challenge without the boredom that "Rosa, Rosae, Rosae..." can generate, plus, as a bonus, now I can say basic profanities in the same language that the Pope speaks!

Recently I have been fiddling around parallel computing (for data mining and similar stuff), played a bit with map-reduce and some basic implementations in Python.

Map reduce is a great solution for some problems, provided you can pre- partition the data in a way that a master node sends packages of data to mapper (and finally reducer) client processes.
The issue I see is that you cannot (I believe) use it for a generic calculation where the step n depends on the steps [0..n-1], situation that can be easily represented by the puzzle I illustrated.

Now, I have two questions in my mind :
1) Do we have real problems to be solved that can be eventually attacked with an approach  similar to a generalized version of my puzzle algorithm?
2) Which kind of balancing mechanism can be implemented to dynamically manage executions in parallel?

I have hard time figuring out point 1 (damn, I should have payed more attention to those Latin lectures, I am sure they would have been useful here!!), but I have some ideas for point 2.

The idea is :
what about experimenting with real hardware devices without an OS on them, extremely flexible in a way that we could eventually implement a low level communication protocol between them... and extremely cheap?
Are you thinking what I am thinking?
Cheap microcontrollers scaling up to ARM Sitara microprocessors (MSP430G2 up to BeagleBone Black with a Sitara @1GHz: 45$ )   
For the actual balancing mechanism, I think it should be possible to work with a topology with no master and having peer clients exchanging directly information (maybe using SPI), I have some basic ideas, but I need to run some simulations first.

OK, this post is too long now, so I will follow up maybe next week in another post, meanwhile feel free to send your ideas, suggestions...

... and I will refrain from closing with a Latin quote.

[update : Example algorithm (in Java) here]