Authors: Songze Li,Seyed Mohammadreza Mousavi Kalan,Qian Yu,Mahdi Soltanolkotabi,A. Salman Avestimehr
ArXiv: 1805.09934
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1805.09934v1
We consider the problem of training a least-squares regression model on a
large dataset using gradient descent. The computation is carried out on a
distributed system consisting of a master node and multiple worker nodes. Such
distributed systems are significantly slowed down due to the presence of
slow-running machines (stragglers) as well as various communication
bottlenecks. We propose "polynomially coded regression" (PCR) that
substantially reduces the effect of stragglers and lessens the communication
burden in such systems. The key idea of PCR is to encode the partial data
stored at each worker, such that the computations at the workers can be viewed
as evaluating a polynomial at distinct points. This allows the master to
compute the final gradient by interpolating this polynomial. PCR significantly
reduces the recovery threshold, defined as the number of workers the master has
to wait for prior to computing the gradient. In particular, PCR requires a
recovery threshold that scales inversely proportionally with the amount of
computation/storage available at each worker. In comparison, state-of-the-art
straggler-mitigation schemes require a much higher recovery threshold that only
decreases linearly in the per worker computation/storage load. We prove that
PCR's recovery threshold is near minimal and within a factor two of the best
possible scheme. Our experiments over Amazon EC2 demonstrate that compared with
state-of-the-art schemes, PCR improves the run-time by 1.50x ~ 2.36x with
naturally occurring stragglers, and by as much as 2.58x ~ 4.29x with artificial
stragglers.