Election Algorithms

Chapter 7 of Introduction to Distributed Algorithms by Gerard Tel. Presented by Denys Duchier.

Definition

Assumptions

The algorithms presented here always elect the initiator with smallest id.


Election with tree algorithm

Assumption of the tree algorithm: all leaves ar initiators ==> wake up phase by flooding.

Flooding

Tree Algorithm: When process p has received <tok,vi> from all but one neighbor, it computes q=min(p,(vi)) and send <tok,q> to the remaining neighbor, then waits for its reply <tok,r>, computes s=min(q,r) and sends back <tok,s to the other neighbors. If p=s it has been elected.

Complexity


Election with phase algorithm

Assumptions

Each process sends exactly D msgs to each of its out-neighbors.

Only when i msgs have been received on every in-channel is the i+1 msg sent to all out-channels.

Complexity: O(D.|E|) msgs, O(D) time.


Unidirectional Ring Networks


Lelann

Each initiator computes a list of the ids of all initiators then decides.

non-initiator
forwards all msgs
initiator

Complexity


Chang-Roberts

Same as Lelann, but removes loosing token: initiator p removes token q if q>p.

Complexity: average case is better (O(N Log N)), but worst case still has O(N2) lower bound.

Proof of worst case: let ids be arranged in increasing order and each p be an initiator. Tokens disappear at process 0. It take N-i hops for token i to get to process 0 ==> # of msgs = SumiN-1 N-i = N(N+1)/2.

Proof of average case

General Idea: assume all processes are initiators, and compute the average number of token passings over all possible arrangements of ids on the N-ring.

The are (N-1)! arrangements.

Let s be the smallest id and pi the id occuring i steps before s. Compute the # of times <tok,pi> is passed in all arrangements, then sum over i.

<tok,s> is passed N times.
<tok,pi> is passed at most i times.

Definition: Ai,k = # of arrangements where <tok,pi> is passed exactly k times.

<tok,pi> is passed i times if pi is smallest of pi ... p1 which happens exactly 1/i of the time ==> Ai,i=(N-1)!/i.

<tok,pi> is passed at least k (< i) times if pi is followed by k-1 processes with id > pi, i.e. 1/k of the time. It is passed exactly k times if atleast k but not k+1 or more ==> Ai,k=(1/k - 1/(k+1))(N-1)! = (N-1)!/k(k+1)

<tok,pi> is passed Sumk=1i kAi,k=(N-1)! Sumj=1i 1/j = (N-1)! Hi

Hi is called a harmonic number.

Sum over i: Sumi=1N-1 Hi(N-1)!=(NHN)(N-1)! Average = NHN =~ 0.69 N Log N.

The correspondance between harmonic numbers and the Log function can be readily derived from the lower and upper Riemann approximations of the integral of 1/x (idea of the proof: courtesy of Joachim Niehren).


Peterson / Dolev-Klawe-Rodeh

Achieves O(N Log N) in the worst case.

General idea: at each round, a surviving id compares itself with its neighbours. If not the smallest, it becomes inactive.

Complexity: at most log N rounds. at each round, information must be propagated at most N hops ==> N Log N.

On a unidirectional ring: we cannot compare to both neighbors... we need a trick!

Trick: an active process p, is currently the home of active id pright. It receives pmid which is the 1st active id to its left and then pleft, the 2nd active id to its left. Now p becomes the home of pmid which can be compared to its two active neighbors pleft and pright.

The algorithm proceeds as follows:

active process

inactive process

Complexity: floor(Log N) + 1 rounds, each 2 sends per process ==> 2n(floor(Log N)+1) msgs.


Lower-bound result on the complexity of election on unidirectional rings

Assumptions

Technique: compute lower-bound of avgA(I) = average of # of msgs used by algorithm A in all rings labeled with ids from I.

Notations and Definitions

s=s1...sn a sequence of distinct ids

D={(s1...sk) | i=/=j => si=/=sj} the set of all such sequences.

CS(s) = set of cyclic shifts of s.

s-ring = ring labeled with ids of sequence s. If t in CS(s), then s-ring is also t-ring.

For purpose of analysis, all msgs are augmented with a trace which is a sequence of ids. If m is a msg sent by p before it has received any msg, trace(m)=p. If p has received a msg with trace s1...sk, then trace(m)=s1...skp.

Note: p only receives msgs of monotonically increasin traces. The trace of the last msg received by p represent the set of processes on which the current state of p depends.

A subset E of D is exhaustive if:

M(s,E) = # of fragments of s-ring in E
Mk(s,E) = # of fragments of length k of s-ring in E

EA = set of s such that A sends s-msg on s-ring.

Claim (1): If both t and u contain s as a substring and A sends a s-msg on the t-ring then it also sends a s-msg on the u-ring.

Proof: state of receiving process depends only on trace, and all these processes start out in identical conditions on both rings.

Claim: EA is exhaustive.

Proof:

Claim: on s-ring A sends at least M(s,ES) msgs.

Proof: consider t in EA which is also a fragment of s; since t-msg is sent on t-ring, also on s-ring ==> # of msgs when A runs on s-ring is at least M(s,EA).

Let I be a finite set of ids (|I|=N).

# of msgs for all combinations >= Sums in Per(I) M(s,EA)

where Per(I) is the set of permutations of I.

avgA(I)>= (1/N!) Sums in Per(I) M(s,EA)
= (1/N!) Sums in Per(I) Sumk=1...N Mk(s,EA)
= (1/N!) Sumk=1...N Sums in Per(I) Mk(s,EA)

There are N fragments of length k in s-ring ==> N*N! in all configurations ==> N!(N/k) if we count only 1 cyclic shift. Since EA cyclicly covers D, there is at least one in EA.

avgA(I) >= (1/N!) Sumk=1...N N!(N/k) = NHN =~ 0.69 N Log N


Extinction Construction

Purpose: to obtain a decentralized leader election algorithm given an arbitrary centralized wave algorithm.

General Description:

Construction Ex(A): every process maintains caw (currently active wave).

Claim: a unique leader is elected by Ex(A).

Proof:


Gallager-Humblet-Spira

Leader election is closely related to computing a spanning tree with a decentralized algorithm.

let CE be the msg complexity of leader election, and CT the msg complexity of computing a spanning tree. Given a spanning tree, we can elect a leader with the tree algorithm in O(N) msgs: CE =< CT + O(N). Given a leader, we can compute a spanning tree with the (centralized) echo algorithm in 2|E| msgs: CT =< CE + 2|E|.

Assumptions:

Lemmas:

Global Description:

Notations and Mechanisms

Lemmas:

Summary of Combining Strategy: fragment F=(FN,L), eF lowest out-going edge.

Rule A:
if eF leads to F'=(FN',L') L<L' then F combines intro F' and the new values FN',L' are sent to all processes in F.

Rule B:
if eF leads to F'=(FN',L') and L=L' and eF=eF' then both combine intro new fragment (w(eF),L+1) and these values are sent to all process in both F and F'.

Rule C:
otherwise F must wait until Rule A or B applies.

Sometimes the processing of a msg must be deferred until a local condition is satisfied ==> the msg is stored and later retrieved and treated as if it had been received at that moment. In Oz, it turns out this can be done more simply using suspensions and FD vars.

Algorithm: each process has a state in {sleep,find,found}, and maintains a status in {basic,branch,reject} for each of its channels.

A status, initially basic, is changed to branch (resp. reject) when it is determined that the edge is in (resp. not in) the MST.

Nodes in a fragment cooperate to find the lowest weight out-going edge, then send a <connect,L> through it.

Initiator (in fact all nodes by executing their initial actions) determines lowest weight channel and sends <connect,0> through it.

In order to follow the detailed reactive description of the algorithm below, you probably need to keep the book next to you

p receives <connect,L> from q
It is very important at this point to know that the algorithm ensures the invariant L=<levelp.

if L<levelp then cause the q-fragment to join the p-fragment by flooding it with <initiate,levelp,namep,statep>

else wait until pq becomes smallest, at which point:

then (by symmetry) both are flooded with <initiate,L+1,w(pq),find>

On each side of the core edge the flooding constructs a tree rooted at the core nodes.

Additionally, if state=find all nodes join the search with procedure Test.

procedure Test
if p has unused edges (i.e. basic), pick the smallest one and send a <test,levelp,namep> through it to determine if it is outgoing, else send <report,infinity> back to father.

p receives <test,L,F> from q
if L>levelp then wait: this is where the invariant mentioned earler is maintained. Why wait? because there could be a flooding <initiate,L,F,S> in progress ==> p and q could be in the same fragment, but not know it yet.

else if F=namep then it is an internal edge: send back <reject>, else send back <accept> (the presentation in the book is complicated by an optimization)

p receives <accept> from q
update current best with w(pq). when all children have reported, send back report of current best to parent.

p receives <reject> from q
update status of pq to reject then Test next lowest.

p receives <report,w> from q
if q=/=father then update best and maybe report to parent

else p is core node and just received report from the other core node:

procedure Changeroot
forward <changeroot> msg along the path to the best edge, then send <connect,levelp> through it.

Correctness: only lowest weight out-going edges are co-opted.

Termination: finitely many edges + each phase uses tree algorithm.

Complete: terminates with MST because terminates and executes fully because no dead-locks.

The only possible problem with the above is if the algorithm terminates before having fully computed the MST. This can only happen as a result of deadlock when the algorithm must wait at certain points.

Claim: No deadlock can occur because if F1 waits on F2 a well-founded precedence relation F1<F2 holds between them.

Proof: F1=(FN1,L1) F2=(FN2,L2). When F2 receives <connect,L1> from F1, processing is delayed, i.e. F1 waits on F2, when either:

The above induces a well-founded precedence relation. QED.

Complexity: each edge is rejected at most once which takes 2 msgs ==> 2|E|.

At any level a node receives at most 1 initiate and 1 accept msg, and sends at most 1 report, 1 changeroot or connect, and 1 test not leading to rejection ==> 5N Log N.

total # msgs bounded by 2|E|+ 5N Log N


Korach-Kutten-Moran

Purpose: construction of election algorithm given a traversal algorithm (a traversal algorithm is a centralized algorithm with only one token in motion).

Difficulty: we want a decentralized algorithm. Peter Van Roy asked: Why not let all waves run to completion, they will all make the same decision, wont they?. My answer/guess was that this would require O(N2) memory (O(N) at each node for N possible waves), whereas the algorithm presented here requires only constant space. Is this convincing?

General Idea: when 2 traversals intersect, one should replace the other.

Practical Issues: how to inform one another? how to make this choice consistently? how to avoid mass suicide?

options 2 and 3 are both about 2 fronts of the same level meeting because that is the only time when we can effectively kill both and replace them with a single new one (of higher level).

General Description of Algorithm

Algorithm: A token <level,id> can be annexing or chasing.

annex <q,l> arrives at p <catp,levp>

chase <q,l> arrives at p <catp,levp>

duchier@dfki.uni-sb.de