Perturbed Iterate Analysis
for Asynchronous Stochastic Optimization
Abstract
We introduce and analyze stochastic optimization methods where the input to each update is perturbed by bounded noise.
We show that this framework forms the basis of a unified approach to analyze asynchronous implementations of stochastic optimization algorithms,
by viewing them as serial methods operating on noisy inputs.
Using our perturbed iterate framework, we provide new analyses of the Hogwild! algorithm and asynchronous stochastic coordinate descent, that are simpler than earlier analyses, remove many assumptions of previous models, and in some cases yield improved upper bounds on the convergence rates.
We proceed to apply our framework to develop and analyze KroMagnon: a novel, parallel, sparse stochastic variancereduced gradient (SVRG) algorithm.
We demonstrate experimentally on a 16core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.
Keywords: stochastic optimization, asynchronous algorithms, parallel machine learning.
1 Introduction
Asynchronous parallel stochastic optimization algorithms have recently gained significant traction in algorithmic machine learning. A large body of recent work has demonstrated that nearlinear speedups are achievable, in theory and practice, on many common machine learning tasks [1, 2, 3, 4, 5, 6, 7, 8]. Moreover, when these lockfree algorithms are applied to nonconvex optimization, significant speedups are still achieved with no loss of statistical accuracy. This behavior has been demonstrated in practice in stateoftheart deep learning systems such as Google’s Downpour SGD [9] and Microsoft’s Project Adam [10].
Although asynchronous stochastic algorithms are simple to implement and enjoy excellent performance in practice, they are challenging to analyze theoretically. The current analyses require lengthy derivations and several assumptions that may not reflect realistic system behavior. Moreover, due to the difficult nature of the proofs, the algorithms analyzed are often simplified versions of those actually run in practice.
In this paper, we propose a general framework for deriving convergence rates for parallel, lockfree, asynchronous firstorder stochastic algorithms. We interpret the algorithmic effects of asynchrony as perturbing the stochastic iterates with bounded noise. This interpretation allows us to show how a variety of asynchronous firstorder algorithms can be analyzed as their serial counterparts operating on noisy inputs. The advantage of our framework is that it yields elementary convergence proofs, can remove or relax simplifying assumptions adopted in prior art, and can yield improved bounds when compared to earlier work.
We demonstrate the general applicability of our framework by providing new convergence analyses for Hogwild!, i.e., the asynchronous stochastic gradient method (SGM), for asynchronous stochastic coordinate descent (ASCD), and KroMagnon: a novel asynchronous sparse version of the stochastic variancereduced gradient (SVRG) method [11]. In particular, we provide a modified version of SVRG that allows for sparse updates, we show that this method can be parallelized in the asynchronous model, and we provide convergence guarantees using our framework. Experimentally, the asynchronous, parallel sparse SVRG achieves nearlylinear speedups on a machine with 16 cores and is sometimes four orders of magnitude faster than the standard (dense) SVRG method.
1.1 Related work
The algorithmic tapestry of parallel stochastic optimization is rich and diverse extending back at least to the late 60s [12]. Much of the contemporary work in this space is built upon the foundational work of Bertsekas, Tsitsiklis et al. [13, 14]; the shared memory access model that we are using in this work, is very similar to the partially asynchronous model introduced in the aforementioned manuscripts. Recent advances in parallel and distributed computing technologies have generated renewed interest in the theoretical understanding and practical implementation of parallel stochastic algorithms [15, 16, 17, 18, 19, 20].
The power of lockfree, asynchronous stochastic optimization on sharedmemory multicore systems was first demonstrated in the work of [1]. The authors introduce Hogwild!, a completely lockfree and asynchronous parallel stochastic gradient method (SGM) that exhibits nearly linear speedups for a variety of machine learning tasks. Inspired by Hogwild!, several authors developed lockfree and asynchronous algorithms that move beyond SGM, such as the work of Liu et al. on parallel stochastic coordinate descent [5, 21]. Additional work in first order optimization and beyond [6, 7, 22, 8, 23], extending to parallel iterative linear solvers [24, 25], has further shown that linear speedups are possible in the asynchronous shared memory model.
2 Perturbed Stochastic Gradients
Preliminaries and Notation
We study parallel asynchronous iterative algorithms that minimize convex functions with . The computational model is the same as that of Niu et al. [1]: a number of cores have access to the same shared memory, and each of them can read and update components of in the shared memory. The algorithms that we consider are asynchronous and lockfree: cores do not coordinate their reads or writes, and while a core is reading/writing other cores can update the shared variables in .
We focus our analysis on functions that are smooth and strongly convex. A function is smooth if it is differentiable and has Lipschitz gradients
where denotes the Euclidean norm. Strong convexity with parameter imposes a curvature condition on :
Strong convexity implies that has a unique minimum and satisfies
In the following, we use , , and to denote iteration counters, while reserving and to denote coordinate indices. We use to denote absolute constants.
Perturbed Iterates
A popular way to minimize convex functions is via firstorder stochastic algorithms. These algorithms can be described using the following general iterative expression:
(2.1) 
where is a random variable independent of and is an unbiased estimator of the true gradient of at : . The success of firstorder stochastic techniques partly lies in their computational efficiency: the small computational cost of using noisy gradient estimates trumps the gains of using true gradients.
A major advantage of the iterative formula in (2.1) is that—in combination with strong convexity, and smoothness inequalities—one can easily track algorithmic progress and establish convergence rates to the optimal solution. Unfortunately, the progress of asynchronous parallel algorithms cannot be precisely described or analyzed using the above iterative framework. Processors do not read from memory actual iterates , as there is no global clock that synchronizes reads or writes while different cores write/read “stale” variables.
In the subsequent sections, we show that the following simple perturbed variant of Eq. (2.1) can capture the algorithmic progress of asynchronous stochastic algorithms. Consider the following iteration
(2.2) 
where is a stochastic error term. For simplicity let . Then,
(2.3)  
where in the last equation we added and subtracted the term .
We assume that and are independent. However, in contrast to recursion (2.1), we no longer require to be independent of . The importance of the above independence assumption will become clear in the next section.
We now take the expectation of both sides in (2.3). Since and are independent of , we use iterated expectations to obtain Moreover, since is strongly convex, we know that
(2.4) 
where the second inequality is a simple consequence of the triangle inequality. Now, let and substitute (2.4) back into Eq. (2.3) to get
(2.5) 
The recursive equation (2.5) is key to our analysis. We show that for given , , and , we can obtain convergence rates through elementary algebraic manipulations. Observe that there are three “error” terms in (2.5): captures the stochastic gradient decay with each iteration, captures the mismatch between the true iterate and its noisy estimate, and measures the size of the projection of that mismatch on the gradient at each step. The key contribution of our work is to show that 1) this iteration can capture the algorithmic progress of asynchronous algorithms, and 2) the error terms can be bounded to obtain a rate for Hogwild!, and linear rates of convergence for asynchronous SCD and asynchronous sparse SVRG.
3 Analyzing Hogwild!
In this section, we provide a simple analysis of Hogwild!, the asynchronous implementation of SGM. We focus on functions that are decomposable into terms:
(3.1) 
where , and each depends only on the coordinates indexed by the subset of . For simplicity we assume that the terms of are differentiable; our results can be readily extended to nondifferentiable s.
We refer to the sets as hyperedges and denote the set of hyperedges by . We sometimes refer to s as the terms of . As shown in Fig. 1, the hyperedges induce a bipartite graph between the terms and the variables in , and a conflict graph between the terms. Let be the average degree in the conflict graph; that is, the average number of terms that are in conflict with a single term. We assume that , otherwise we could decompose the problem into smaller independent subproblems. As we will see, under our perturbed iterate analysis framework the convergence rate of asynchronous algorithms depends on .
Hogwild! (Alg. 1) is a method to parallelize SGM in the asynchronous setting [1]. It is deployed on multiple cores that have access to shared memory, where the optimization variable and the data points that define the terms are stored. During its execution each core samples uniformly at random a hyperedge from . It reads the coordinates of the shared vector , evaluates at the point read, and finally adds to the shared variable.
During the execution of Hogwild! cores do not synchronize or follow an order between reads or writes. Moreover, they access (i.e., read or write) a set of coordinates in without the use of any locking mechanisms that would ensure a conflictfree execution. This implies that the reads/writes of distinct cores can intertwine in arbitrary ways, e.g., while a core updates a subset of variables, before completing its task, other cores can read/write the same subset of variables.
In [1], the authors analyzed a variant of Hogwild! in which several simplifying assumptions were made. Specifically, in [1] 1) only a single coordinate per sampled hyperedge is updated (i.e., the for loop in Hogwild! is replaced with a single coordinate update); 2) the authors assumed consistent reads, i.e., it was assumed that while a core is reading the shared variable, no writes from other cores occur; 3) the authors make an implicit assumption on the uniformity of the processing times of cores (explained in the following), that does not generically hold in practice. These simplifications alleviate some of the challenges in analyzing Hogwild! and allowed the authors to provide a convergence result. As we show in the current paper, however, these simplifications are not necessary to obtain a convergence analysis. Our perturbed iterates framework can be used in an elementary way to analyze the original version of Hogwild!, yielding improved bounds compared to earlier analyses.
3.1 Ordering the samples
A subtle but important point in the analysis of Hogwild! is the need to define an order for the sampled hyperedges. A key point of difference of our work is that we order the samples based on the order in which they were sampled, not the order in which cores complete the processing of the samples.
Definition 1.
We denote by the th sampled hyperedge in a run of Alg. 1.
That is, denotes the sample obtained when line in Alg. 1 is executed for the th time. This is different from the original work of [1], in which the samples were ordered according to the completion time of each thread. The issue with such an ordering is that the distribution of the samples, conditioned on the ordering, is not always uniform; for example, hyperedges of small cardinality are more likely to be “early” samples. A uniform distribution is needed for the theoretical analysis of stochastic gradient methods, a point that is disregarded in [1]. Our ordering according to sampling time resolves this issue by guaranteeing uniformity among samples in a trivial way.
3.2 Defining read iterates and clarifying independence assumptions
Since the shared memory variable can change inconsistently during reads and writes, we also have to be careful about the notion of iterates in Hogwild!.
Definition 2.
We denote by the contents of the shared memory before the th execution of line . Moreover, we denote by the vector, that in coordinates contains exactly what the core that sampled read. We then define for all . Note that we do not assume consistent reads, i.e., the contents of the shared memory can potentially change while a core is reading.
At this point we would like to briefly discuss an independence assumption held by all prior work. In the following paragraph, we explain why this assumption is not always true in practice. In Appendix A, we show how to lift such independence assumption, but for ease of exposition we do adopt it in our main text.
Assumption 1.
The vector is independent of the sampled hyperedge .
The above independence assumption is important when establishing the convergence rate of the algorithm, and has been held explicitly or implicitly in prior work [1, 6, 5, 21]. Specifically, when proving convergence rates for these algorithms we need to show via iterated expectations that , which follows from the independence of and . However, observe that although is independent of by construction, this is not the case for the vector read by the core that sampled . For example, consider the scenario of two consecutively sampled hyperedges in Alg. 1 that overlap on a subset of coordinates. Then, say one core is reading the coordinates of the shared variables indexed by its hyperedge, while the second core is updating a subset of these coordinates. In this case, the values read by the first core depend on the support of the sampled hyperedge.
One way to rigorously enforce the independence of and is to require the processors to read the entire shared variable before sampling a new hyperedge. However, this might not be reasonable in practice, as the dimension of tends to be considerably larger than the sparsity of the hyperedges. As we mentioned earlier, in Appendix A, we show how to overcome the issue of dependence and thereby remove Assumption 1; however, this results in a slightly more cumbersome analysis. To ease readability, in our main text we do adopt Assumption 1.
3.3 The perturbed iterates view of asynchrony
In this work, we assume that all writes are atomic, in the sense that they will be successfully recorded in the shared memory at some point. Atomicity is a reasonable assumption in practice, as it can be strictly enforced through compareandswap operations [1].
Assumption 2.
Every write in line 6 of Alg. 1 will complete successfully.
This assumption implies that all writes will appear in the shared memory by the end of the execution, in the form of coordinatewise updates. Due to commutativity the order in which these updates are recorded in the shared memory is irrelevant. Hence, after processing a total of hyperedges the shared memory contains:
(3.2) 
where is the initial guess and is defined as the vector that contains all gradient updates up to sample .
Remark 1.
Throughout this section we denote , which we assume to be bounded: . Such a uniform bound on the norm of the stochastic gradient is true when operating on a bounded ball; this can in turn be enforced by a simple, coordinatewise thresholding operator. We can refine our analysis by avoiding the uniform bound on , through a simple application of the cocoercivity lemma as it was used in [26]; in this case, our derivations would only require a uniform bound on . Our subsequent derivations can be adapted to the above, however to keep our derivations elementary we will use the uniform bound on .
Remark 2.
Observe that although a core is only reading the subset of variables that are indexed by its sampled hyperedge, in (3.2) we use the entire vector as the input to the sampled gradient. We can do this since is independent of the coordinates of outside the support of hyperedge .
Using the above definitions, we define the perturbed iterates of Hogwild! as
(3.3) 
for , where is the th uniformly sampled hyperedge. Observe that all but the first and last of these iterates are “fake”: there might not be an actual time when they exist in the shared memory during the execution. However, is what is stored in memory before the execution starts, and is exactly what is stored in shared memory at the end of the execution.
We observe that the iterates in (3.3) place Hogwild! in the perturbed gradient framework introduced in §2:
We are only left to bound the three error terms , , and . Before we proceed, we note that for the technical soundness of our theorems, we have to also define a random variable that captures the system randomness. In particular, let denote a random variable that encodes the randomness of the system (i.e., random delays between reads and writes, gradient computation time, etc). Although we do not explicitly use , its distribution is required implicitly to compute the expectations for the convergence analysis. This is because the random samples do not fully determine the output of Alg. 1. However, along with completely determine the time of all reads and writes. We continue with our final assumption needed by our analysis, that is also needed by prior art.
Assumption 3 (Bounded overlaps).
Two hyperedges and overlap in time if they are processed concurrently at some point during the execution of Hogwild!. The time during which a hyperedge is being processed begins when the sampling function is called and ends after the last coordinate of is written to the shared memory. We assume that there exists a number , such that the maximum number of sampled hyperedges that can overlap in time with a particular sampled hyperedge cannot be more than .
The usefulness of the above assumption is that it essentially abstracts away all system details relative to delays, processing overlaps, and number of cores into a single parameter. Intuitively, can be perceived as a proxy for the number of cores, i.e., we would expect that no more than roughly sampled hyperedges overlap in time with a single hyperedge, assuming that the processing times across the samples are approximately similar. Observe that if is small, then we expect the distance between and the noisy iterate to be small. In our perturbed iterate framework, if we set , then we obtain the classical iterative formula of serial SGM.
To quantify the distance between (i.e., the iterate read by the core that sampled ) and (i.e., the “fake” iterate used to establish convergence rates), we observe that any difference between them is caused solely by hyperedges that overlap with in time. To see this, let be an “earlier” sample, i.e., , that does not overlap with in time. This implies that the processing of finishes before starts being processed. Hence, the full contribution of will be recorded in both and (for the latter this holds by definition). Similarly, if and does not overlap with in time, then neither nor (for the latter, again by definition) contain any of the coordinate updates involved in the gradient update . Assumption 3 ensures that if or , the sample does not overlap in time with .
By the above discussion, and due to Assumption 3, there exist diagonal matrices with diagonal entries in such that
(3.4) 
These diagonal matrices account for any possible pattern of (potentially) partial updates that can occur while hyperedge is being processed. We would like to note that the above notation bears resemblance to the coordinateupdate mismatch formulation of asynchronous coordinatebased algorithms, as in [21, 27, 28].
We now turn to the convergence proof, emphasizing its elementary nature within the perturbed iterate analysis framework. We begin by bounding the error terms and ( is already assumed to be at most ).
Lemma 3.
Hogwild! satisfies the recursion in (2.5) with
where is the average degree of the conflict graph between the hyperedges.
Proof.
The norm of the mismatch can be bounded in the following way:
since are diagonal sign matrices and since the steps are supported on the samples . We use the upper bound to obtain
The last step follows because two sampled hyperedges (sampled with replacement) intersect with probability at most . We can bound in a similar way:
∎
Plugging the bounds of Lemma 3 in our recursive formula, we see that Hogwild! satisfies the recursion
(3.5) 
On the other hand, serial SGM satisfies the recursion If the step size is set to , it attains target accuracy in iterations. Hence, when the term of (3.5) is orderwise constant, Hogwild! satisfies the same recursion (up to constants) as serial SGM. This directly implies the main result of this section.
Theorem 4.
If the number of samples that overlap in time with a single sample during the execution of Hogwild! is bounded as
Hogwild!, with step size , reaches an accuracy of after
iterations.
Since the iteration bound in the theorem is (up to a constant) the same as that of serial SGM, our result implies a linear speedup. We would like to note that an improved rate of can be obtained by appropriately diminishing stepsizes per epoch (see, e.g., [1, 29]). Furthermore, observe that although the bound on might seem restrictive, it is—up to a logarithmic factor—proportional to the total number of iterations required by Hogwild! (or even serial SGM) to reach accuracy. Assuming that the average degree of the conflict graph is constant, and that we perform a constant number of passes over the data, i.e., , then can be as large as , i.e., nearly linear in the number of function terms.^{1}^{1}1 hides logarithmic terms.
3.4 Comparison with the original Hogwild! analysis of [1]
Let us summarize the key points of improvement compared to the original Hogwild! analysis:

Our analysis is elementary and compact, and follows simply by bounding the , and terms, after introducing the perturbed gradient framework of § 2.

We do not assume consistent reads: while a core is reading from the shared memory other cores are allowed to read, or write.

In [1] the authors analyze a simplified version of Hogwild! where for each sampled hyperedge only a randomly selected coordinate is updated. Here we analyze the “fullupdate” version of Hogwild!.

We order the samples by the order in which they were sampled, not by completion time. This allows to rigorously prove our convergence bounds, without assuming anything on the distribution of the processing time of each hyperedge. This is unlike [1], where there is an implicit assumption of uniformity with respect to processing times.

The previous work of [1] establishes a nearlylinear speedup for Hogwild! if is bounded as , where is the maximum right degree of the termvariables bipartite graph, shown in Fig 1, and is the maximum left degree of the same graph. Observe that , where is the maximum degree of the conflict graph. Here, we obtain a linear speedup for up to , where is only the average degree of the conflict graph in Fig 1. Our bound on the delays can be orders of magnitude better than that of [1].
4 Asynchronous Stochastic Coordinate Descent
In this section, we use the perturbed gradient framework to analyze the convergence of asynchronous parallel stochastic coordinate descent (ASCD). This algorithm has been previously analyzed in [5, 21]. We show that the algorithm admits an elementary treatment in our perturbed iterate framework, under the same assumptions made for Hogwild!.
ASCD, shown in Alg. 2, is a linearly convergent algorithm for minimizing strongly convex functions . At each iteration a core samples one of the coordinates, computes a full gradient update for that coordinate, and proceeds with updating a single element of the shared memory variable . The challenge in analyzing ASCD, compared to Hogwild!, is that, in order to show linear convergence, we need to show that the error due to the asynchrony between cores decays fast when the iterates arrive close to the optimal solution. The perturbed iterate framework can handle this type of noise analysis in a straightforward manner, using simple recursive bounds.
We define as in the previous section, but now the samples are coordinates sampled uniformly at random from . After samples have been processed completely, the following vector is contained in shared memory:
where is the initial guess, is the standard basis vector with a one at position , denotes the th coordinate of the gradient of computed at . Similar to Hogwild! in the previous section, ASCD satisfies the following iterative formula
Notice that , and thus, similarly to Hogwild!, ASCD’s iterates satisfy the recursion of Eq. (2.5):
Before stating the main result of this section, let us introduce some further notation. Let us define the largest distance between the optimal vector, and the vector read by the cores during the execution of the algorithm: a value which should be thought of as proportional to . Furthermore, by a simple application of the Lipschitz assumption on , we have a uniform bound on the norm of each computed gradient Here we assume that the optimization takes place in an ball, so that . This simply means that the iterates will never have infinitely large coordinate values. This assumption is made in previous work explicitly or implicitly, and in practice it can be implemented easily since the projection on an ball can be done componentwise. Finally, let us define the condition number of as where is the Lipschitz constant, and the strong convexity parameter.
Theorem 5.
Let the maximum number of coordinates that can be concurrently processed while a core is processing a single coordinate be at most
Then, ASCD with stepsize achieves an error after
iterations.
Using the recursive inequality (2.5), serial SCD with the same stepsize as in the above theorem, can be shown to achieve the same accuracy as ASCD in the same number of steps. Hence, as long as the proxy for the number of cores is bounded as , our theorem implies a linear speedup with respect to this simple convergence bound. We would like to note, however, that the coordinate descent literature sometimes uses more refined properties of the function to be optimized that can lead to potentially better convergence bounds, especially in terms of function value accuracy, i.e., (see e.g., [5, 21, 30]).
We would further like to remark that between the two bounds on , the second one, i.e., , is the more restrictive, as the first one is proportional—up to log factors—to the square root of the number of iterations, which is usually . We explain in our subsequent derivation how this loose bound can be improved, but leave the tightness of the bound as an open question for future work.
4.1 Proof of Theorem 5
The analysis here is slightly more involved compared to Hogwild!.The main technical bottleneck is to relate the decay of with that of , and then to exploit the sparsity of the updates for bounding .
We start with a simple upper bound on the norm of the gradient updates. From the Lipschitz assumption on , we have
where the last inequality is due to Jensen’s inequality. This yields the following result.
Lemma 6.
For any and we have
Let be the total number of ASCD iterations, and let us define the set
which has cardinality at most and contains all indices around within steps, as sketched in Fig. 2.
Due to Assumption 3, and similar to [21], there exist variables such that, for any index in the set , we have
(4.1) 
The above equation implies that the difference between a “fake” iterate at time and the value that was read at time can be expressed as a linear combination of any coordinate updates that occurred during the time interval defined by .
From Eq. (4.1) we see that , for any , can be upper bounded in terms of the magnitude of the coordinate updates that occur in . Since these updates are coordinates of the true gradient, we can use their norm to bound the size of . This will be useful towards bounding . Moreover, Lemma 6 shows that the magnitude of the gradient steps can be upper bounded in terms of the size of the mismatches. This will in turn be useful in bounding . The above observations are fundamental to our approach. The following lemma makes the above ideas explicit.
Lemma 7.
For any , we have
(4.2)  
(4.3) 
Proof.
The first inequality is a consequence of Lemma 6. For the second, as mentioned previously, we have when . Hence,
where the first inequality follows due to Jensen’s inequality, and the last inequality uses the bound . ∎
Remark 3.
Let us now define for simplicity Observe that that all gradient norms can be bounded as
a property that we will use in our bounds. Observe that and . To obtain bounds for our first two error terms, and , we will expand the recursive relations that are implied by Lemma 7. As shown in § B.1 of the Appendix, we obtain the following bounds.
Lemma 8.
Let and set , for any and . Then,
The CauchySchwartz inequality implies the bound . Unfortunately this approach yields a result that can only guarantee convergence up to a factor of slower than serial SCD. This happens because upper bounding the inner product by disregards the extreme sparsity of . The next lemma uses a slightly more involved argument to bound exploiting the sparsity of the gradient update. The proof can be found in Appendix B.1.
Lemma 9.
Let and . Then,
4.1.1 Putting it all together
We can now plug in the upper bounds on , , and in our perturbed iterate recursive formula
to find that ASCD satisfies
Observe that in the serial case of SCD the errors and are zero, and . By applying the Lipschitz assumption on , we get , and obtain the simple recursive formula
(4.4) 
To guarantee that ASCD follows the same recursion, i.e., it has the same convergence rate as the one implied by Eq. (4.4), we require that where is a constant. Solving for we get
where is some absolute constant. For , the term in the recursive bound becomes
where we used the inequality . Hence, ASCD satisfies