# More Efficient Topological Sort

**Abstract.** Topo logical sort of an acyclic graph has
many applications such as job scheduling and network

analysis. Due to its importance, it has been tackledon many models. Dekel et al.
[3], proposedan

algorithm for solving the problem in O(log^{2}N) time on the hyper cube or
shuffle -exchange networks

with O(N^{3}) processors. Chaudhuri [2], gave an O(logN) algorithm using
O(N^{3})
processors on a CRCW

PRAM model. On the LARPBS ( Linear Arrays with a Reconfigurable Pipelined Bus
System) model,

Li et al. [5] showed that the problem for a weighted directed graph with N
vertices can be solved

in O(logN) time by using N^{3} processors. In this paper, a more efficient
topological sort algorithm is

proposed on the same LARPBS model. We show that the problem can be solved in
O(logN) time by

using N^{3}/ logN processors. We show that the algorithm has better time
and processor complexities than

the best algorithm on the hypercube, and has the same time complexity but better
processor complexity

than the best algorithm on the CRCW PRAM model.

**Keywords:** analysis of algorithms, graph problem,
massive parallelism, optical bus, time complexity

## 1. Introduction

Many applications can be re presented as an acyclic graph.
The topological sort of

an acyclic graph is to find a linear ordering for the vertices of the graph, such
that

if <i j> is an edge of the graph, then i appears before j in the ordering [4].
If

the graph is not acyclic then no linear ordering is possible. A topological sort
of a

directed graph can be viewed as an ordering of its vertices along a horizontal
line, so

that all directed edges go from left to right. Topological sort has many
applications

in various job scheduling, ne twork analysis , and other areas [1, 4].

The topological sort problem has been tackled on many
models. For example,

Dekel et al. proposed an algorithm for solving the problem in O(log^{2}N) time on
the

hypercube or shuffle-exchange networks with O(N^{3}) processors [3]. Chaudhuri also

gave an O(logN) algorithm using O(N^{3}) processors on a CRCW PRAM model [2].

In this paper, we study the problem on a new computational model called Linear

Array with a Reconfigurable Pipelined Bus System (LARPBS) [10], which will be

briefly discussed below. Since an LARPBS can simulate a primitive operation on

a hypercube with the same number of processors in O(1) time, the algorithm on

the hypercube reported in [3] can be implemented on an LARPBS with the same

time and process or complexities. The question is: can we do a better job through

direct implementation on the LARPBS model? Recently, an algorithm using optical

buses was proposed in [7]. The topological sort algorithm is based on the matrix
of

all-pairs shortest paths. The problem with this method , is that the time
complexity

of the algorithm is related to the magnitude of weights associated with a graph.

For weights of unbounded magnitude, the time complexities with the algorithms

are normally larger. Li et al. [7] showed that the problem for a weighted directed

graph with N vertices can be solved in O(logN) time by using N^{3} processors, if

the weights are of bounded magnitude and precision. For weights of unbounded

magnitude and precision, the problem can be solved in O(logN) time by using

(where is any small positive constant ) processors; in O(logN log logN)

time by using N^{3}/ log logN processors; or in O(logN) time with high probability

by using N^{3} processors. In this paper, we implement a topological sort algorithm

on the LARPBS based on the approach for computing the transitive closure of a

graph. Because transitive closure of a graph is calculatedbasedon boolean
values,

more efficient algorithms can be achieved compared with calculations based on

integer values. Our algorithm solves the same problem in O(logN) time by using

only N^{3}/ logN processors, and is better than the one in [7] on the same model.

Our result also shows, that not only better time bounds are obtained, but also

fewer processors are used for the same problem on the LARPBS model, than the

algorithm on the hypercube model described in [3]. Our algorithm has the same

time bound, but uses fewer processors than the algorithm on the CRCW PRAM

model [2].

It should be noted that Ma et al. [8], claimed that the
topological sort problem

can be solved in O(1) time based on the transitive closure algorithm on a 2D

reconfigurable mesh with N^{3} processors described in [15]. However, their
transitive

closure algorithm works only for undirected graphs, and topological sort requires
a

solution for directed graphs. Hence, there is a flaw in the algorithm described
in

[8]. Since the transitive closure problem on undirected graphs is much easier
than

directed graphs, it is not clear whether the topological sort problem can be
solved

in O(1) time on a 2D reconfigurable mesh. For 1D arrays, such as the LARPBS

model, the problem seems to be more difficult to solve, since we cannot embed a

graph into the architectures for a fast solution.

## 2. Reconfigurable pipelined optical buses

Recent advances in optical technologies have made optical
interconnection networks

feasible. A pipelined optical bus system uses optical waveguides instead of

electrical signals to transfer messages among electronic processors. In addition
to

the high propagation speed of light, there are two important properties of
optical

pulse transmission on an optical bus, namely, unidirectional propagation and

predictable propagation delay. These advantages of using
waveguides enable synchronized

concurrent
accesses of an optical bus in a pipelined fashion [9].

Figure 1 shows a linear array in which electronic
processors are connected with

an optical bus. Each processor is connected to the bus with two directional
couplers,

one for transmitting on the upper segment, and the other for receiving from

the lower segment of the bus [9–11]. The optical bus contains three identical
waveguides,

one for carrying messages (the message waveguides), and the other two for

carrying address information (the reference waveguide and the select waveguide).
For

the purpose of simplicity, the message waveguide, which resembles the reference

waveguide, has been omitted. As shown in Figure 1, only the reference waveguide

and the the select waveguide are displayed there. Messages are organized as
fixed length

message frames. Note that optical signals propagate unidirectionally from left

to right on the upper segment, and from right to left on the lower segment. This

bus system is also referred to as the folded-bus connection in [10].

The three waveguides are used for addressing and simple
computations, such as

binary value summation [10]. It has been shown that such pipelinedoptical bus

systems can support a massive volume of communications simultaneously , andare

particularly appropriate for applications that involve intensive regular or
irregular

communication andd ata movement operations, such as permutation, one-to-one

communication, broadcasting, multicasting, multiple multicasting, extraction,
and

compression [6, 10]. It has been shown that by using the coincident pulse
addressing

technique, all these primitive operations take O(1) bus cycles, where the bus
cycle

length is the end-to-end message transmission time over a bus [6, 10].

**Remark** To avoid controversy, let us emphasize that
in this paper, by “O(f(p))

time” we mean O(f(p)) bus cycles for communication plus O(f(p)) time for local

computation.) This assumption has been used by many other authors in the
literature

[6, 9, 11–14] for different models with optical buses.

In addition to supporting fast communications, an optical
bus itself can be used

as a computing device for global aggregation. It was proven in [6, 10], that by
using

N processors, the summation of N integers or reals with bounded magnitude and

precision, the prefix sums of N binary values, the logical-or and logical-and of
N

Boolean values can be calculated in constant number of bus cycles.

An LARPBS consists of N processors
connected by a
pipelined

optical bus. In the LARPBS, we insert an optical switch on each section of the
transmitting

bus and receiving bus. Thus, each processor has 6 more local switches, three

on its three receiving segments, and three on its three transmitting segments,
besides

its switch for conditional delay. The switches on the receiving and transmitting
segments

between processors i and i + 1 are called RSR(i) and RST(i), respectively,

and are local to processor i as shown in Figure 2. Here, RSR(i) 0 ≤ i < N are

2 × 1 optical switches, and RST(i) 0 ≤ i < N are 1 × 2 optical switches. In the
following

discussion, these switches, will be called reconfigurable switches, due to their

function. When all switches are set to straight, the bus system operates as a
regular

pipelined bus system. When RSR(i) and RST(i) are set to cross, the whole bus system

is split into two separate systems, one consisting of processors 0,1,...., and i

and the other consisting of i +1, i +2, ... , N −1. Thus, in general an LARPBS can

also be partitioned into k ≥ 2 independent subarrays LARPBS_{1}, LARPBS_{2},
... ,

LARPBS_{k}, such that LARPBS_{j} contains processors
, where

. The subarrays can operate as regular linear arrays

with pipelined optical bus systems, and all subarrays can be used independently
for

different computations without interference (see [10] for an elaborated
exposition).

Figure 3 shows the LARPBS model with 6 processors. The array is split into two

subarrays, with the first one having four processors,
and the second one having two

processors.

As pointed out in [10], many intensive regular or irregular
communication

and data movement operations, such as permutation, one-to-one communication,

broadcasting, multicasting, multiple multicasting, extraction, and compression
can

be carried out in O(1) bus cycles on an LARPBS. The above primitive communication,

data movement, and aggregation operations provide an algorithmic view

on parallel computing using optical buses, and also allow us to develop, specify,

and analyze parallel algorithms by ignoring optical and engineering details. These

powerful primitives that support massive parallel communications, plus the
recon-figurability

of optical buses make the LARPBS computing model very attractive in

solving problems that are both computation and communication intensive. See [6,
9,

11–14] for more algorithms on the LARPBS model and some related models using

optical buses.

## 3. Topological sort on the LARPBS model

In this section, a new topological sort algorithm is
described on the LARPBS model.

The previous primitive operations are used as building blocks in our
implementation.

Before we describe the algorithm, we need to present some
terms and notation.

The connectivity matrix of a graph G with N vertices is an N × N matrix C, whose

elements are defined as follows: c_{jk}
= 1 if there is path from vertex j to k (j ≠ k), or

vertex j is the same as vertex k (i.e., j = k). Otherwise, c_{jk}=0 for 0 ≤ j k ≤ N
− 1.

The matrix C is also known as the transitive closure of the graph G. Li [5], has
shown

that the product of two N × N boolean matrices can be calculated in constant time

on an LARPBS with O(N^{3})/ logN processors. The algorithm is a parallelization of

the Four Russians’ algorithm. Further, the time complexity of the algorithm is
measured by

bit level operations. By using the boolean matrix multiplication algorithm

O(logN) times, Li has also established the following result [5]:

Lemma 1 [5] The transitive closure of a graph with N
vertices can be computed in

O(logN) bus cycles on an LARPBS with N^{3}/ logN processors.

Let G = (V,E) be a connected directed graph with N vertices.
For any two

vertices j and i, we say that i is j-reachable, and i is a predecessor of j if
there

is a path from i to j in G. G = (V,E) is called a dense graph if E = O(N^{2}
). For

vertix i, define two sets
,
and
. Then,

In-degree(i) =|IN(i)| andOut-degree(i)= |OUT(i)| .

Let AC(v) = {x |x is v—reachable in G and Num(v)= |AC(v)| , the
following

lemma is due to [8]:

**Lemma 2** Assume that G is a directed acyclic graph,
and v, and w are two vertices

of the graph. If v is a predecessor of w, then Num(v) > Num (w).

The correctness of this lemma can be explained intuitively.
Because v is a predecessor

of w, there is a directed path of length at least one from v to w. Clearly, for

any vertex u, if u is w-reachable, u is also v-reachable. Hence, we have Num(v) >

Num (w) .

In [10], a quicksort algorithm using an average of O(logN)
bus cycles on an

LARPBS with N processors is described. The quicksort algorithm is suitable for

general keys. If we restrict the values to integers, a radix sorting algorithm
can be

used. We show here that an LARPBS with N processors can sort N integers with

k bits in O k steps. The basic idea is to repeat the split operation (a
generalized

compression operation) k times, and each time uses the l-th bit as the mark for

the active set, where 0 ≤ l < k [10]. After k iterations, all N integers are in
sorted

order. Since vertices in a graph use integer representation, radix sort is
suitable for

most graph problems as shown below. We repeat the lemma here.

**Lemma 3** A radixsort of N integers with logN bits
can be performed in O(logN)

bus cycles.

Assume that a matrix is stored in a row-major fashion in an
LARPBS. That is,

element i j is stored in processor k, where k = N * i + j. The transpose of a

matrix involves storing the matrix in column-major fashion. This can be easily
done

through exchanging the two indices and calculate the new location for an
element.

Then, each processor just sends the element to the destination processor. This
can

be performed in O(1) bus cycles.

**Lemma 4** A transposition of a matrix can be carried
out in O(1) bus cycles assuming

that each processor stores one element of the matrix.

Our algorithm is based on the above results. The input of
our algorithm is the

adjacency matrix A of a directed graph G = (V,E) , where.

The output of our algorithm has two possibilities. If there is a directed cycle
in G,

the algorithm indicates a failure and exits immediately. Otherwise, the result
for

topological sort is stored in the array such that if v is a predecessor

of w and Topo(i) = v and Topo (j) = w, then i < j. That is, Topo(i) points to node

v which has a rank i in the topological order.

Our algorithm uses an LARPBS with N^{3}/ logN processors.
Assume that the

adjacency matrix A is stored in the first N^{2} processors in row-major order. The

algorithm consists of the following 7 steps:

**Proposed Algorithm**

• Step 1: A = A + I, where I is the N × N unit Boolean
matrix. This can be

done in O(1) time, since each processor holding a diagonal element sets its local

element in A to one.

• Step 2: Calculate the transitive closure A* of A via logN time Boolean matrix

multiplication. This step can be carried out in O(logN) bus cycles on an LARPBS

with N^{3}/ logN processors.

• Step 3: If there exist a pair of vertices such that A* (i,j) ∧ A*
(j,i) = 1, then

output “None” and exit; otherwise, continue to next step. This step can be
carried

out easily by a matrix transposition operation and can be done in O(1) bus cycles

based on lemma 4.

• Step 4: Calculate the vector

Basically, Num(v) contains the row sum of A*. We can
reconfigure the subarray

with the first N^{2} processors into N subsystems, where each subsystem contains

an LARPBS with N processors.

Then perform a summation in each subsystem in parallel
using the bus split

method, and store the sum in the first processor of each subsystem. Move the N

sums to the first N processors in the LARPBS. Hence, this step requires at most,

logN bus cycles.font>

• Step 5: Construct a 2-dimensional array B (0...N −1,1...2)
such that B(v,1) =

Num(v) , and B(v,2) = v for 0 ≤ v ≤ N − 1. The array B is stored in the first N

processors of the LARPBS. This step takes a constant time.

• Step 6: Sort the array B, based on the following rule: for any i j, 0 ≤ i j ≤ N
−1,

. We can use the

radix sorting algorithm on the LARPBS model [10], to sort the array B. Since the

value of the array B elements has at most O(logN) bits, this step can be done

in O(logN) bus cycles.

• Step 7: Copy the results into array Topo; i.e; Topo(t) = B (t,2). Here, only
local

operations are involved. This is done in parallel and hence takes O(1) time.

The final results are in the Topo variables of the first N
processors of the

LARPBS. The correctness of this algorithm can be intuitively explained as
follows.

First, if there is a directed cycle via vertices i and j, by the definition of
A*, it

must be true that A*(i,j) = 1 and A*(j,i) = 1. Hence, the algorithm should exit in

Step 3. Otherwise, the vertices of G are arranged in array
according

to the decreasing order of Num(v) . In other words, for any two vertices i and

j, assume Topo(i) = v and Topo(j) = w, if Num(v) ≥ Num (w) , then i < j. Based

on Lemma 2, the vertices of G are listed in topological order.

According to the above analysis, the total time used in
the algorithm is O(logN)

bus cycles. Hence, we have the following theorem:

**Theorem 1** The topological sort problem can be
solved in O(logN) bus cycles on an

LARPBS with N^{3}/ logN processors.

## 4. Conclusions

In this paper, a new and improved topological sort algorithm
has been presented

on the LARPBS model. The algorithm has better performance compared with their

counterparts on the hypercube [1], and on the theoretical CRCW PRAM model [2].

It also improves an algorithm on the same model reported in [7]. The standard

sequential topological sort algorithm has a time complexity of O N + E in a

directed graph G N E [2]. In the worst case, it is O N^{2} time. Hence, our
algorithm

is still not optimal in terms of time-processor product. Further reduction in

processor complexity in the algorithm would be an interesting research topic.

**Acknowledgments**

Research was supported in part by the Ministry of
Education, Science, Sports and

Culture of Japan under Grant-in-Aid for Scientific Research (C), No.1168041, and

by the Japan Society for Promotion of Science in the form of a fellowship, the

National Science Foundation (NSF) under Grant CCR-9503882.

Prev | Next |