Election Algorithms

Chapter 7 of Introduction to Distributed Algorithms by Gerard Tel. Presented by Denys Duchier.

Definition

Each process executes the same algorithm
The algorithm is decentralized (may be > 1 initiator)
Always terminates in a configuration with a single leader

Assumptions

System is fully asynchronous
Each process has a unique id (from totally ordered set) of size w bits
Each msg has O(w) bits for fair comparisons

The algorithms presented here always elect the initiator with smallest id.

Election with tree algorithm

Assumption of the tree algorithm: all leaves ar initiators ==> wake up phase by flooding.

Flooding

initiator sends <wakeup> to all neighbors
non initiator sends <wakeup>l too all neighbors upon receipt of 1st <wakeup>l
all nodes wait until they have received <wakeup>l from all neighbours then start the tree algorithm

Tree Algorithm: When process p has received <tok,v_i> from all but one neighbor, it computes q=min(p,(v_i)) and send <tok,q> to the remaining neighbor, then waits for its reply <tok,r>, computes s=min(q,r) and sends back <tok,s to the other neighbors. If p=s it has been elected.

Complexity

4 msgs per channel ==> 4(N-1) msgs.
let D be the diameter of the tree (length of longest shortest path between 2 nodes); flooding completes in D steps; at D+1 the wave has started and it requires 2 traversals: 2D. Total: 3D+1.
actually the 2 phases progress concurrently: 2D.

Election with phase algorithm

Assumptions

network's diameter D is known (or upper bound approximation)
channels are directed (model undirected channel by 2 directed ones)

Each process sends exactly D msgs to each of its out-neighbors.

Only when i msgs have been received on every in-channel is the i+1 msg sent to all out-channels.

Complexity: O(D.|E|) msgs, O(D) time.

Unidirectional Ring Networks

Lelann

Each initiator computes a list of the ids of all initiators then decides.

non-initiator

forwards all msgs

initiator

sends its id around the ring
records and forwards every id it receives
when it received its own id back, it has seen everything and decides

Complexity

each initiator decides when its own token has gone around the ring after N steps. There are at most N tokens ==> o(N²) msgs.
because of startup delays, it may take more than N time units, but every initiator must have started before the 2st token has gone around the ring, i.e. N-1 steps ==> 2N-1 time units, O(N).

Chang-Roberts

Same as Lelann, but removes loosing token: initiator p removes token q if q>p.

Complexity: average case is better (O(N Log N)), but worst case still has O(N²) lower bound.

Proof of worst case: let ids be arranged in increasing order and each p be an initiator. Tokens disappear at process 0. It take N-i hops for token i to get to process 0 ==> # of msgs = Sum_i^N-1 N-i = N(N+1)/2.

Proof of average case

General Idea: assume all processes are initiators, and compute the average number of token passings over all possible arrangements of ids on the N-ring.

The are (N-1)! arrangements.

Let s be the smallest id and p_i the id occuring i steps before s. Compute the # of times <tok,p_i> is passed in all arrangements, then sum over i.

<tok,s> is passed N times.
<tok,p_i> is passed at most i times.

Definition: A_i,k = # of arrangements where <tok,p_i> is passed exactly k times.

<tok,p_i> is passed i times if p_i is smallest of p_i ... p₁ which happens exactly 1/i of the time ==> A_i,i=(N-1)!/i.

<tok,p_i> is passed at least k (< i) times if p_i is followed by k-1 processes with id > p_i, i.e. 1/k of the time. It is passed exactly k times if atleast k but not k+1 or more ==> A_i,k=(1/k - 1/(k+1))(N-1)! = (N-1)!/k(k+1)

<tok,p_i> is passed Sum_k=1ⁱ kA_i,k=(N-1)! Sum_j=1ⁱ 1/j = (N-1)! H_i

H_i is called a harmonic number.

Sum over i: Sum_i=1^N-1 H_i(N-1)!=(NH_N)(N-1)! Average = NH_N =~ 0.69 N Log N.

The correspondance between harmonic numbers and the Log function can be readily derived from the lower and upper Riemann approximations of the integral of 1/x (idea of the proof: courtesy of Joachim Niehren).

Peterson / Dolev-Klawe-Rodeh

Achieves O(N Log N) in the worst case.

General idea: at each round, a surviving id compares itself with its neighbours. If not the smallest, it becomes inactive.

Complexity: at most log N rounds. at each round, information must be propagated at most N hops ==> N Log N.

On a unidirectional ring: we cannot compare to both neighbors... we need a trick!

Trick: an active process p, is currently the home of active id p_right. It receives p_mid which is the 1st active id to its left and then p_left, the 2nd active id to its left. Now p becomes the home of p_mid which can be compared to its two active neighbors p_left and p_right.

The algorithm proceeds as follows:

active process

send <one,p_right>
receive <one,p_mid>
if p_mid=p_right then the msg just travelled all around the ring because it carries the only remaining active id.
- send <smal,p_right>
- receive >smal,p_right>
- stop
send <two,p_mid>
receive <two,p_left>
if p_mid < p_left and p_mid < p_right then it survives and makes its home at p (i.e. p_right := p_mid), else p becomes inactive.

inactive process

receive <one,p>
send <one,p>
receive <m,q> if m=smal take note of winner.
send <m,q>

Complexity: floor(Log N) + 1 rounds, each 2 sends per process ==> 2n(floor(Log N)+1) msgs.

Lower-bound result on the complexity of election on unidirectional rings

Assumptions

unidirectional rings of unknown size
all processes are initiators
algorithm is msg-driven

Technique: compute lower-bound of avg_A(I) = average of # of msgs used by algorithm A in all rings labeled with ids from I.

Notations and Definitions

s=s₁...s_n a sequence of distinct ids

D={(s₁...s_k) | i=/=j => s_i=/=s_j} the set of all such sequences.

CS(s) = set of cyclic shifts of s.

s-ring = ring labeled with ids of sequence s. If t in CS(s), then s-ring is also t-ring.

For purpose of analysis, all msgs are augmented with a trace which is a sequence of ids. If m is a msg sent by p before it has received any msg, trace(m)=p. If p has received a msg with trace s₁...s_k, then trace(m)=s₁...s_kp.

Note: p only receives msgs of monotonically increasin traces. The trace of the last msg received by p represent the set of processes on which the current state of p depends.

A subset E of D is exhaustive if:

E is prefixed closed: tu in E => t in E.
E cyclicly covers D: for all s in D, exists r in CS(s), such that r in E.

M(s,E) = # of fragments of s-ring in E
M_k(s,E) = # of fragments of length k of s-ring in E

E_A = set of s such that A sends s-msg on s-ring.

Claim (1): If both t and u contain s as a substring and A sends a s-msg on the t-ring then it also sends a s-msg on the u-ring.

Proof: state of receiving process depends only on trace, and all these processes start out in identical conditions on both rings.

Claim: E_A is exhaustive.

Proof:

prefix closed: if A sends s-msg on s-ring, then for all prefix t of s, A sends t-msg on s-ring (by def of trace) ==> by claim (1) A sends t-msg on t-ring.
cyclicly covers D: when p decides on s-ring, it must have received msg with trace of length |s|, i.e. a s-msg.

Claim: on s-ring A sends at least M(s,E_S) msgs.

Proof: consider t in E_A which is also a fragment of s; since t-msg is sent on t-ring, also on s-ring ==> # of msgs when A runs on s-ring is at least M(s,E_A).

Let I be a finite set of ids (|I|=N).

# of msgs for all combinations >= Sum_{s in
Per(I)} M(s,E_A)

where Per(I) is the set of permutations of I.

avg_A(I)	>= (1/N!) Sum_{s in Per(I)} M(s,E_A)
	= (1/N!) Sum_{s in Per(I)} Sum_k=1...N M_k(s,E_A)
	= (1/N!) Sum_k=1...N Sum_{s in Per(I)} M_k(s,E_A)

There are N fragments of length k in s-ring ==> N*N! in all configurations ==> N!(N/k) if we count only 1 cyclic shift. Since E_A cyclicly covers D, there is at least one in E_A.

avg_A(I) >= (1/N!) Sum_k=1...N N!(N/k) = NH_N =~ 0.69 N Log N

Extinction Construction

Purpose: to obtain a decentralized leader election algorithm given an arbitrary centralized wave algorithm.

General Description:

each initiator starts its own wave tagged with its id.
only the wave of the smallest initiator runs to decision.
all other waves are extinguished before that.

Construction Ex(A): every process maintains caw (currently active wave).

initiators start their own waves
when a msg from wave q arrives at p
- q>caw: ignore
- q=caw: behave like A
- q<caw or caw undef: p joins q-wave

Claim: a unique leader is elected by Ex(A).

Proof:

let p₀ be the smallest initiator: its wave runs to completion and elect p₀.
if p is non-initator, there is no p-wave: p is not elected.
p>p₀ initiator: in a wave algorithm, a decision at p^' is causally preceded by a send in all other processes... but p₀ never executes anything in wave p ==> wave p never decides ==> never elects.

Gallager-Humblet-Spira

Leader election is closely related to computing a spanning tree with a decentralized algorithm.

let C_E be the msg complexity of leader election, and C_T the msg complexity of computing a spanning tree. Given a spanning tree, we can elect a leader with the tree algorithm in O(N) msgs: C_E =< C_T + O(N). Given a leader, we can compute a spanning tree with the (centralized) echo algorithm in 2|E| msgs: C_T =< C_E + 2|E|.

Assumptions:

each edge has a unique weight (if not given, can be constructed from the ids of their two vertices in 2|E| msgs)
all nodes execute initial actions before processing any msgs (in a sense, all nodes eventually behave like initiators).

Lemmas:

edges all have distinct weights ==> unique MST (minimum spanning tree) (proof by contradiction).
let a fragment be a subtree of MST. if F is a fragment and e is the least weight out-going edge of F, then F+{e} is a fragment (same proof technique).

Global Description:

a collection of fragments is maintained that covers all nodes
initially each node is in its own fragment
nodes in a fragment cooperate to find the lowest weight out-going edge of the fragment.
when found, the two connected fragments are combined.
stops when only 1 fragment remains

Notations and Mechanisms

each fragment has a name
when combining: smaller fragment adopts name of larger one
if size was used to determine which is larger, both would have to be updated with new size:
- instead, each fragment has a level, initially 0
- when F₁ combines with F₂ of higher level: F₁ adopts name and level of F₂
- when same level: both are updated with new name (= weight of core edge) and level+1.

Lemmas:

a fragment of level L has at least 2^L nodes (proof by induction) ==> max level = Log N ==> each process changes level at most log N times.
a node changes name only when level increases ==> there will be at most N Log N changes overall in the network.

Summary of Combining Strategy: fragment F=(FN,L), e_F lowest out-going edge.

Rule A:: if e_F leads to F'=(FN',L') L<L' then F combines intro F' and the new values FN',L' are sent to all processes in F.
Rule B:: if e_F leads to F'=(FN',L') and L=L' and e_F=e_F' then both combine intro new fragment (w(e_F),L+1) and these values are sent to all process in both F and F'.
Rule C:: otherwise F must wait until Rule A or B applies.

Sometimes the processing of a msg must be deferred until a local condition is satisfied ==> the msg is stored and later retrieved and treated as if it had been received at that moment. In Oz, it turns out this can be done more simply using suspensions and FD vars.

Algorithm: each process has a state in {sleep,find,found}, and maintains a status in {basic,branch,reject} for each of its channels.

A status, initially basic, is changed to branch (resp. reject) when it is determined that the edge is in (resp. not in) the MST.

Nodes in a fragment cooperate to find the lowest weight out-going edge, then send a <connect,L> through it.

Initiator (in fact all nodes by executing their initial actions) determines lowest weight channel and sends <connect,0> through it.

In order to follow the detailed reactive description of the algorithm below, you probably need to keep the book next to you

p receives <connect,L> from q

It is very important at this point to know that the algorithm ensures the invariant L=<level_p.

if L<level_p then cause the q-fragment to join the p-fragment by flooding it with <initiate,level_p,name_p,state_p>

else wait until pq becomes smallest, at which point:

the levels are equal
p has also sent <connect,L> to q

then (by symmetry) both are flooded with <initiate,L+1,w(pq),find>

On each side of the core edge the flooding constructs a tree rooted at the core nodes.

Additionally, if state=find all nodes join the search with procedure Test.

procedure Test

if p has unused edges (i.e. basic), pick the smallest one and send a <test,level_p,name_p> through it to determine if it is outgoing, else send <report,infinity> back to father.

p receives <test,L,F> from q

if L>level_p then wait: this is where the invariant mentioned earler is maintained. Why wait? because there could be a flooding <initiate,L,F,S> in progress ==> p and q could be in the same fragment, but not know it yet.

else if F=name_p then it is an internal edge: send back <reject>, else send back <accept> (the presentation in the book is complicated by an optimization)

p receives <accept> from q

update current best with w(pq). when all children have reported, send back report of current best to parent.

p receives <reject> from q

update status of pq to reject then Test next lowest.

p receives <report,w> from q

if q=/=father then update best and maybe report to parent

else p is core node and just received report from the other core node:

wait until search in p-fragment has completed
if best out-going edge is in p-fragment, invoke Changeroot
if w=infinity, no outgoing edge have been found: stop

procedure Changeroot

forward <changeroot> msg along the path to the best edge, then send <connect,level_p> through it.

Correctness: only lowest weight out-going edges are co-opted.

Termination: finitely many edges + each phase uses tree algorithm.

Complete: terminates with MST because terminates and executes fully because no dead-locks.

The only possible problem with the above is if the algorithm terminates before having fully computed the MST. This can only happen as a result of deadlock when the algorithm must wait at certain points.

Claim: No deadlock can occur because if F₁ waits on F₂ a well-founded precedence relation F₁<F₂ holds between them.

Proof: F₁=(FN₁,L₁) F₂=(FN₂,L₂). When F₂ receives <connect,L₁> from F₁, processing is delayed, i.e. F₁ waits on F₂, when either:

L₁>L₂
L₁=L₂ and w(e_F₁)>w(e_F₂), i.e. F₂ is busy attending to smaller weight out-going edges

The above induces a well-founded precedence relation. QED.

Complexity: each edge is rejected at most once which takes 2 msgs ==> 2|E|.

At any level a node receives at most 1 initiate and 1 accept msg, and sends at most 1 report, 1 changeroot or connect, and 1 test not leading to rejection ==> 5N Log N.

total # msgs bounded by 2|E|+ 5N Log N

Korach-Kutten-Moran

Purpose: construction of election algorithm given a traversal algorithm (a traversal algorithm is a centralized algorithm with only one token in motion).

Difficulty: we want a decentralized algorithm. Peter Van Roy asked: Why not let all waves run to completion, they will all make the same decision, wont they?. My answer/guess was that this would require O(N²) memory (O(N) at each node for N possible waves), whereas the algorithm presented here requires only constant space. Is this convincing?

General Idea: when 2 traversals intersect, one should replace the other.

Practical Issues: how to inform one another? how to make this choice consistently? how to avoid mass suicide?

when two fronts meet: i.e. the 2 tokens of 2 traversals meet: simply abandon both and start a new traversal which takes precedence over the 2 previous partial traversals ==> use higher level.
when one front meets another traversal in the middle, there 3 options:
1. die
2. wait for another front
3. chase after the front of this traversal

options 2 and 3 are both about 2 fronts of the same level meeting because that is the only time when we can effectively kill both and replace them with a single new one (of higher level).

General Description of Algorithm

a front always dies when it meets a traversal of higher level (but not when same level, to avoid mass suicide).
if not yet chased then chase it:
- either create a new traversal of higher order
- or die because of traversal of higher order
if chased then wait:
- either for traversal of higher order
- or for other front of same level ==> create higher traversal
- or for chasing token (if the waiting token has encoutered its own wave again)

Algorithm: A token <level,id> can be annexing or chasing.

annex <q,l> arrives at p <cat_p,lev_p>

traversal algorithm terminates (q elected)
l>lev_p ==> overwrite and keep traversing
l<lev_p ==> killed
token waiting ==> start new traversal <l+1,p>
q<cat_p ==> token waits
front chased ==> token waits
q>cat_p ==> chase front
else (q=cat_p) ==> traverse as usual

chase <q,l> arrives at p <cat_p,lev_p>

lev_p>l ==> killed
token waiting ==> start new traversal <l+1,p>
front chased ==> token waits
else keep chasing

duchier@dfki.uni-sb.de