I’ve been lurking over at Reddit’s /r/dailyprogrammer for some time now, watching for challenges that would be suited to an FPGA. This past week I found one that was perfect:

In graph theory, a graph’s radius is the minimum eccentricity of any vertex for a given graph. More simply: it is the minimum distance between all possible pairs of vertices in a graph.

As an example, the Petersen graph has a radius of 2 because any vertex is connected to any other vertex within 2 edges.

On the other hand, the Butterfly graph has a radius of 1 since its middle vertex can connect to any other vertex within 1 edge, which is the smallest eccentricity of all vertices in this set. Any other vertex has an eccentricity of 2.

Given an Adjacency matrix find the radius of the graph. The graph is not directed, so the matrix will always be reflected about the main diagonal. Print the radius of the graph as an integer.

My solution is written in Verilog and targets an Altera Cyclone IV FPGA development board. The final answer is shown on the board’s LCD display. The algorithm I describe here takes advantage of an FPGA’s parallelism and results in a O(n) solution.

The design instantiates a network of nodes that represent vertices in the graph. The nodes themselves determine the radius of the graph through a constant process of telling their neighbors to whom they’re connected and reporting their distance to every other node in the graph. Due to the distributed nature of the design the solution is found very quickly. For a graph with N nodes the solution is found in N clock cycles (or less).

The original post gave this example adjacency matrix to solve:

`0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0`

This matrix describes a graph of 10 vertices. Each row corresponds to a node (or vertex) in the graph. Each column of a given row tells us who that node’s neighbors are. For example, the first row corresponds to node 0 and tells us that its neighboring nodes are 1, 4 and 5.

If you followed the Wikipedia links in the problem statement above and started doodling you’ll discover that this adjacency matrix is describing a Petersen graph.

Very pretty, Petersen.

Before we tackle the Petersen graph lets work through a smaller example first to get a feel for the algorithm. Below is a graph with 5 vertices shown both as an adjacency matrix and, well, as a graph.

The basic idea is that each node tells its neighbors how far away it thinks it is from other nodes in the network. I call this the node’s distance vector. Each node then takes their neighbors’ distance vectors and constructs a distance matrix. The distance matrix describes how many hops it would take to get to any other node in the network through its neighbors. As the distance matrix is updated, each node find the shortest path to every other node in the graph and updates its distance vector. Through this process the nodes learn their distances to every other node.

At start up, the adjacency matrix is loaded into the network. Each node stores its own adjacency vector to keep track of who its immediate neighbors are. Using its adjacency vector each node starts to construct its distance matrix :

The distance from a node to itself is 0. The distance to an adjacent node is 1. If a node isn’t aware of a path from itself to another node then the distance is infinite. This is shown as a grey box in the pictures. I use -1 in the actual design (all 1’s binary).

Based on the distance matrix each node then broadcasts a distance vector which is the shortest distance from itself to each other node in the matrix.

Each node uses the distance vectors from its immediate neighbors to update its own distance matrix. This is done by adding 1 to each distance reported in the vector. The idea is that if my neighbor says some other node is x hops away and I’m 1 hop away from her then I must be x+1 hops away from that other node.

Here is what the distance matrix for each node looks like in the 2nd clock cycle:

Take a look at the second row of Node 1’s Distance Matrix (colored in bright green). That row is Node 1’s row. All of Node 1’s neighbors fill in their Node 1 row with this information but add 1 to the distances (colored in dark green). Node’s that aren’t Node 1’s neighbors keep that row blank (e.g. Node 4).

Now that we have some more information in the distance matrices, the distance vectors become more informative:

Each column in a node’s distance vector describes the shortest distance to every other node. This was found by comparing the values in each column of the distance matrix and choosing the smallest one.

In the 3rd and final clock cycle the distance matrices look like this:

Here are the final distance vectors:

We know we are done because every entry in every distance vector has a value (there are no blanks). So, to find the radius of this graph we just find the longest reported distance: 3.

Part of the challenge of implementing this algorithm in an FPGA was figuring out how to store and access the distance vectors and matrices. We’d like each node to be able to take a 2D array as input but Verilog won’t let us do that. We’ve put people on the moon, my phone is a hand held super computer by 1984’s standards (when Verilog was created) and Verilog still can’t do 2D ports.

Here’s what I finally came up with:

The distance vectors are packed together as a 1D input and then unpacked into a 2D array inside of the module. The remote matrix (dist_matrix_rmt) is what’s advertised by neighboring nodes. The local matrix (dist_matrix_lcl) is the data structure local to this node.

Below is the portion of Node that actually updates the local distance matrix and is the heart of the algorithm. Keep in mind that there is no actual “looping” going on here - it’s all happening in parallel!

The information in the local distance matrix is then used to update the node’s output distance vector. This process below just “loops” through the rows of every column finding the smallest value. Again, these nested for-loops just describe a digital circuit that determine every value of the vector in parallel. It all happens in a single clock cycle.

Finally, all of these nodes are instantiated with a generate loop. Notice that each node’s distance vector is just packed into the signal that becomes the distance matrix inputs to all of the other nodes.

Here’s what the board looks like showing the final answer to the Petersen graph:

A few weeks ago I came across this problem:

*Find a number consisting of 9 digits in which each of the digits from 1 to 9 appears only once. This number must also satisfy these divisibility requirements:*

*The number should be divisible by 9.**If the rightmost digit is removed, the remaining number should be divisible by 8.**If the rightmost digit of the new number is removed, the remaining number should be divisible by 7.**And so on, until there’s only one digit (which will necessarily be divisible by 1).*

This problem has been around for a few years and many people have solved it using a variety of different programming languages. I decided to tackle it using Verilog targeting an Altera Cyclone IV FPGA development board. My design uses the board’s 50MHz clock and finds the solution in approximately 2.2ms.

There are three main components to the design: a module to calculate all of the unique combinations of the digits, a module to perform the division on the resulting combinations, and a LCD interface module to display the answer once found.

I started with pen and paper writing out all of the unique combinations for small sets of symbols to get a feel for the underlying algorithm.

It became clear that the process is recursive: to find all the unique combinations of the elements in a set you pull out each element in turn and find the unique combinations of the remaining elements. I try to show this process in the graph below.

Traversing every path from trunk to leaf yields all of the combinations. Notice that each node with n elements has n branches. This means that a set with n unique elements has n! unique combinations. Therefore, for this problem with nine digits there are 9! = 362,880 combinations to check (worst case).

At first I was afraid this would be one of those problems that just doesn’t solve cleanly without using recursion. Real recursion can’t be implemented with synthesizable Verilog (and it just doesn’t have that warm fuzzy feeling of a bunch of logic gates working together in parallel). Fortunately, I was able to unwrap the recursive process into something reasonable using for loops. Here’s some C-like pseudo-code for finding combinations of a set with 4 characters:

set = "abcd" for (i=0; i<4; i++) {

stage1 = set

stage1[0] = set[i]

stage1[i] = set[0] for (j=1; j<4; j++) {

stage2 = stage1

stage2[1] = stage1[j]

stage2[j] = stage1[1] for (k=2; k<4; k++) { stage3 = stage2

stage3[2] = stage2[k]

stage3[k] = stage2[2]

}

}

}

I implemented this by cascading a series of counters with a module to handle digit swapping. For a number with n unique digits we need n-1 stages. Note that it’s n-1 and not n stages because the nth stage is the trivial case of finding all the combinations of a set with just one element. Here’s a schematic showing how it looks for a set with 3 digits.

The counter to implement the for loop functionality was straight forward. Each counter increments by 1 when the downstream counter has maxed out. Once a counter hits the max value it rolls over back to its initial value.

always@(posedge rst_i, posedge clk_i) begin if (rst_i) begin cntr_o <= CNTR_INIT; end else if (incr_i) begin if (cntr_o < CNTR_LIMIT) begin cntr_o <= cntr_o + 1; end else begin cntr_o <= CNTR_INIT; end end end assign done_o = ena_i & (cntr_o == CNTR_LIMIT);

Note that for the counter’s initial value and limit (maximum) are constants so I made them parameters instead of inputs to the module.

Below is the part of DigitSwap that does the swapping:

always@(*) begin digits_o = digits_i; digits_o[DIGIT(position1_i)] = digits_i[DIGIT(position2_i)]; digits_o[DIGIT(position2_i)] = digits_i[DIGIT(position1_i)]; end

Notice that this is a combinatorial process with blocking assignments (rather than synchronous non-blocking) so that every stage swaps its digits in the same clock cycle. This results in a unique combination every clock. Also, I defined a macro called `DIGIT (which makes indexing the digits in a bit array more friendly) but I’ve removed the leading ` in these code snippets because of a quirk in the syntax highlighting.

Finally, I used a generate loop to cascade the modules together:

genvar c; generate for (c=1; c<NUM_OF_DIGITS; c=c+1) begin : GEN_COMBINATIONS assign incr[c] = done[c+1]; Counter #( .CNTR_INIT ( c ), .CNTR_LIMIT ( NUM_OF_DIGITS ) ) Counter( .rst_i ( rst_i ), .clk_i ( clk_i ), .ena_i ( incr[c] ), .cntr_o ( cntr[c] ), .done_o ( done[c] ) ); DigitSwap #( .NUM_OF_DIGITS ( NUM_OF_DIGITS ), .DIGIT_WIDTH ( DIGIT_WIDTH ) ) DigitSwap( .clk_i ( clk_i ), .digits_i ( digits[c-1] ), .position1_i ( cntr[c] ), .position2_i ( c ), .digits_o ( digits[c] ) ); end endgenerate assign done[NUM_OF_DIGITS] = comb_ena_i; assign comb_vld_o = 1; assign comb_digits_o = digits[NUM_OF_DIGITS-1];

Now that all of the combinations are generated the next step is to check them. However, up until this point the digits were just packed into an array of bits (like BCD). These digits must first be converted into an actual binary number (for example,’h01020304 => ‘d1234). I’m pretty happy with the conversion method I came up with. I was amazed that it met timing and that I didn’t have to pipeline it:

always@(posedge clk_i) begin binary_o = 0; for (d=1; d<=NUM_OF_DIGITS; d=d+1) begin binary_o = binary_o + digits_i[DIGIT(d)]*(10 ** (NUM_OF_DIGITS-d)); end done_o <= ena_i; end

Now it’s just divide and check.

All eight dividers operate in parallel. When all of the dividers report a zero remainder then the solution has been found. I decided against being clever and didn’t use any tricks to do the division. Each division is done with a divider module generated by the Altera tools. They were instantiated using a generate loop (similar to the one shown earlier).

Once I had my answer I just had to display it on the LCD. This part was fairly anti-climatic and kind of a pain. At least now I have a LCD interface module that I can use in the future.

Here’s what it looks like:

Someone on Reddit was asking about implementing this function in synthesizable Verilog:

\[\Large{\frac{x}{3}e^{(1-\frac{x}{3})}}\]

I came up with this (meta) solution that uses behavioral Verilog to generate a synthesizable Verilog case statement which implements the desired function as a look-up table:

module SomebodysHomework; parameter BITS_OF_PRECISION = 8; parameter SCALING_FACTOR = 1.0; real x = 0.0; integer result = -1; function integer JustSomeMath; input real x; parameter real e = 2.7183; real y; begin y = x/3.0 * e ** (1.0 - x/3.0); JustSomeMath = integer(y * (1<<BITS_OF_PRECISION)); end endfunction initial begin $display ("case (x)"); while (result != 0) begin #1; x = x + 1; result = JustSomeMath(x/SCALING_FACTOR); $display(" %0d : answer = %0d;", x, result); end $display(" default: answer = 0;"); $display("endcase"); end endmodule

Here’s part of the case statement that gets generated by the code above.

```
case (x)
1 : answer = 166;
2 : answer = 238;
3 : answer = 256;
...
27 : answer = 1;
28 : answer = 1;
29 : answer = 0;
default: answer = 0;
endcase
```

Here are some pretty pictures of the results with different values for BITS_OF_PRECISION and SCALING_FACTOR.