1 Introduction
Consider the problem of estimating an covariance matrix (or its inverse ) of a
variate probability distribution from
independently and identically distributed samplesdrawn from the same probability distribution. In applications spanning from computer vision, natural language processing, to economics
[24, 25, 10], the matrix is often sparse, meaning that its matrix elements are mostly zero. For Gaussian distributions, the statistical interpretation of sparsity in
is that most of the variables are pairwise conditionally independent [27, 45, 14, 5].Imposing sparsity upon can regularize the associated estimation problem and greatly reduce the number of samples required. This is particularly important in highdimensional settings where is large, often significantly larger than the number of samples . One popular approach regularizes the associated maximum likelihood estimation (MLE) problem by a sparsitypromoting term, as in
(1) 
Here, is the sample covariance matrix with sample mean , and is the resulting estimator for . This approach, commonly known as the graphical lasso [14], is known to enjoy a number of statistical guarantees [37, 34], some of which are direct extensions of earlier work on the classical lasso [31, 29, 42, 22]. A variation on this theme is to only impose the penalty on the offdiagonal elements of , or to place different weights on the elements of the matrix , as in the classical weighted lasso.
While the regularized problem (1) is technically convex, it is commonly considered intractable for largescale datasets. The decision variable is an matrix, so simply fitting all variables into memory is already a significant issue. Generalpurpose algorithms have either prohibitively high complexity or slow convergence. In practice, (1) is solved using problemspecific algorithms. The stateoftheart include GLASSO [14], QUIC [20], and its “bigdata” extension BIGQUIC [21]. These algorithms use between and time and between and memory per iteration, but the number of iterations needed to converge to an accurate solution can be very large.
1.1 Graphical lasso, softthresholding, and MDMC
The high practical cost of graphical lasso has inspired a number of heuristics, which enjoy less guarantees but are significantly cheaper to use. Indeed, heuristics are often the only viable option once
exceeds the order of a few tens of thousands.One simple idea is to threshold the sample covariance matrix : to examine all of its elements and keep only the ones whose magnitudes exceed some threshold. Thresholding can be fast—even for very largescale datasets—because it is embarassingly parallel; its quadratic total work can be spread over thousands or millions of parallel processors, in a GPU or distributed on cloud servers. When the number of samples is small, i.e. , thresholding can also be performed using memory, by working directly with the centered matrixofsamples satisfying .
In a recent line of work [26, 39, 12, 13], the simple heuristic of thresholding was shown to enjoy some surprising guarantees. In particular, [39, 12] proved that when the lasso weight is imposed over only the offdiagonal elements of that—under some assumptions—the sparsity pattern of the associated graphical lasso estimator can be recovered by performing a softthresholding operation on , as in
(2) 
and recovering
(3) 
The associated graph (also denoted as when there is no ambiguity) is obtained by viewing each nonzero element in as an edge between the th and th vertex in an undirected graph on nodes. Moreover, they showed that the estimator can be recovered by solving a version of (1) in which the sparsity pattern is explicitly imposed, as in
(4)  
subject to 
Recovering the exact value of (and not just its sparsity pattern) is important because it provides a shrinkage MLE when the true MLE is illdefined; for Gaussian fields, its nonzero values encode the partial correlations between variables. Problem (4) is named the maximum determinant matrix completion (MDMC) in the literature, for reasons explained below. The problem has a recursive closedform solution whenever the graph of is acyclic (i.e. a tree or forest) [12], or more generally, if it is chordal [13]. It is worth emphasizing that the closedform solution is extremely fast to evaluate: a chordal example in [13] with 13,659 variables took just seconds to solve on a laptop computer.
The assumptions needed for graphical lasso to be equivalent to thresolding are hard to check but relatively mild. Indeed, [12] proves that they are automatically satisfied whenever is sufficiently large relative to the sample covariance matrix. Their numerical study found “sufficiently large” to be a fairly loose criterion in practice, particularly in view of the fact that large values of are needed to induce a sufficiently sparse estimate of , e.g. with nonzero elements.
However, the requirement for to be chordal is very strong. Aside from trivial chordal graphs like trees and cliques, thresholding will produce a chordal graph with probability zero. When is nonchordal, no closedform solution exists, and one must resort to an iterative algorithm. The stateoftheart for nonchordal MDMC is to embed the nonchordal graph within a chordal graph, and to solve the resulting problem as a semidefinite program using an interiorpoint method [8, 2].
1.2 Main results
The purpose of this paper is twofold. First, we derive an extension of the guarantees derived by [26, 39, 12, 13] for a slightly more general version of the problem that we call restricted graphical lasso (RGL):
(5)  
subject to 
In other words, RGL is (1) penalized by a weighted lasso penalty on the offdiagonals, and with an a priori sparsity pattern imposed as an additional constraint. We use the sparsity pattern to incorporate prior information on the structure of the graphical model. For example, if the sample covariance is collected over a graph, such as a communication system or a social network, then faraway variables can be assumed as pairwise conditionally independent [33, 19, 7]. Including these neighborhood relationships into can regularize the statistical problem, as well as reduce the numerical cost for a solution.
In Section 2, we describe a procedure to transform RGL (5) into MDMC (4), in the same style as prior results by [12, 13] for graphical lasso. More specifically, we softthreshold the sample covariance and then project this matrix onto the sparsity pattern . We give conditions for the resulting sparsity pattern to be equivalent to the one obtained by solving (5). Furthermore, we prove that the resulting estimator can be recovered by solving the same MDMC problem (4) with appropriately modified.
The second purpose is to describe an efficient algorithm to solve MCDC when the graph is nonchordal, based on the chordal embedding approach of [8, 2, 4]. We embed within a chordal , to result in a convex optimization problem over , the space of real symmetric matrices with sparsity pattern . This way, the constraint is implicitly imposed, meaning that we simply ignore the nonzero elements not in . Next, we solve an optimization problem on using a custom NewtonCG method^{1}^{1}1The MATLAB source code for our solver can be found at http://alum.mit.edu/www/ryz. The main idea is to use an inner conjugate gradients (CG) loop to solve the Newton subproblem of an outer Newton’s method. The actual algorithm has a number of features designed to exploit problem structure, including the sparse chordal property of , duality, and the ability for CG and Newton to converge superlinearly; these are outlined in Section 3.
Assuming that the chordal embedding is sparse with nonzero elements, we prove in Section 3.4, that our algorithm converges to an accurate solution of MDMC (4) in
(6) 
Most importantly, the algorithm is highly efficient in practice. In Section 4, we present computation results on a suite of test cases. Both synthetic and reallife graphs are considered. Using our approach, we solve sparse inverse covariance estimation problems containing as many as 200,000 variables, in less than an hour on a laptop computer.
1.3 Related Work
Graphical lasso with prior information. A number of approaches are available in the literature to introduce prior information to graphical lasso. The weighted version of graphical lasso mentioned before is an example, though RGL will generally be more efficient to solve due to a reduction in the number of variables. [11] introduced a class of graphical lasso in which the true graphical model is assumed to have Laplacian structure. This structure commonly appears in signal and image processing [28]. For the a priori graphbased correlation structure described above, [17] introduced a pathway graphical lasso method similar to RGL.
Algorithms for graphical lasso. Algorithms for graphical lasso are usually based on some mixture of Newton [32], proximal Newton [21, 20], iterative thresholding [35], and (block) coordinate descent [14, 40]. All of these suffer fundamentally from the need to keep track and act on all elements in the matrix decision variable. Even if the final solution matrix were sparse with nonzeros, it is still possible for the algorithm to traverse through a “dense region” in which the iterate must be fully dense. Thresholding heuristics have been proposed to address issue, but these may adversely affect the outer algorithm and prevent convergence. It is generally impossible to guarantee a figure lower than time periteration, even if the solution contains only nonzeros. Most of the algorithms mentioned above actually have worstcase periteration costs of .
Graphical lasso via thresholding. The elementary estimator for graphical models (EEGM) [43] is another thresholdingbased lowcomplexity method that is able to recover the actual graphical lasso estimator. Both EEGM and our algorithm have a similar level of performance in practice, because both algorithm are bottlenecked by the initial thresholding step, which is a quadratic time operation.
Algorithms for MDMC. Our algorithm is inspired by a line of results [8, 2, 4, 23] for minimizing the logdet penalty on chordal sparsity patterns, culminating in the CVXOPT package [3]. These are Newton algorithms that solve the Newton subproblem by explicitly forming and factoring the fullydense Newton matrix. When , these algorithms cost time and memory per iteration, where is the number of edges added to to yield the chordal . In practice, is usually a factor of 0.1 to 20 times , so these algorithms are cubic time and memory. Our algorithm solves the Newton subproblem iteratively using CG. We prove that CG requires just time to compute the Newton direction to machine precision (see Section 3.4
). In practice, CG converges much faster than its worstcase bound, because it is able to exploit eigenvalue clustering to achieve superlinear convergence.
Notations
Let be the set of
real vectors, and
be the set of real symmetric matrices. (We denote using lowercase, using uppercase, and index the th element of as .) We endow with the usual matrix inner product and Euclidean (i.e. Frobenius) norm . Let and be the associated set of positive semidefinite and positive definite matrices. We will frequently write to mean and write to mean . Given a sparsity pattern , we define as the set of real symmetric matrices with this sparsity pattern.2 Restricted graphical lasso, softthresholding, and MDMC
Let denote the projection operator from onto , i.e. by setting all if . Let be the sample covariance matrix individually softthresholded by , as in
(7) 
In this section, we state the conditions for —the projection of the softthresholded matrix in (7) onto —to have the same sparsity pattern as the RGL estimator in (5). Furthermore, the estimator can be explicitly recovered by solving the MDMC problem (4) while replacing and . For brevity, all proofs and remarks are omitted; these can be found in the appendix.
Before we state the exact conditions, we begin by adopting the some definitions and notations from the literature.
Definition 1.
[12] Given a matrix , define as its sparsity pattern. Then is called inverseconsistent if there exists a matrix such that
(8a)  
(8b)  
(8c) 
The matrix is called an inverseconsistent complement of and is denoted by . Furthermore, is called signconsistent if for every , the th elements of and have opposite signs.
Moreover, we take the usual matrix maxnorm to exclude the diagonal, as in and adopt the function defined with respect to the sparsity pattern and scalar
s.t.  
We are now ready to state the conditions for softthresholding to be equivalent to RGL.
Theorem 2.
Proof.
See Appendix A. ∎
Theorem 2 leads to the following corollary, which asserts that the optimal solution of RGL can be obtained by maximum determinant matrix completion: computing the matrix with the largest determinant that “fillsin” the zero elements of .
Corollary 3.
Proof.
3 Proposed Algorithm
This section describes an efficient algorithm to solve MDMC (4) in which the sparsity pattern is nonchordal. If we assume that the input matrix is sparse, and that sparse Cholesky factorization is able to solve in time, then our algorithm is guaranteed to compute an accurate solution in time and memory.
The algorithm is fundamentally a NewtonCG method, i.e. Newton’s method in which the Newton search directions are computed using conjugate gradients (CG). It is developed from four key insights:
1. Chordal embedding is easy via sparse matrix heuristics. Stateoftheart algorithms for (4) begin by computing a chordal embedding for . The optimal chordal embedding with the fewest number of nonzeros is NPhard to compute, but a goodenough embedding with nonzeros is sufficient for our purposes. Computing a good with is exactly the same problem as finding a sparse Cholesky factorization with fillin. Using heuristics developed for numerical linear algebra, we are able to find sparse chordal embeddings for graphs containing millions of edges and hundreds of thousands of nodes in seconds.
2. Optimize directly on the sparse matrix cone. Using logdet barriers for sparse matrix cones [8, 2, 4, 41], we can optimize directly in the space , while ignoring all matrix elements outside of . If , then only decision variables must be explicitly optimized. Moreover, each function evaluation, gradient evaluation, and matrixvector product with the Hessian can be performed in time, using the numerical recipes in [4].
3. The dual is easier to solve than the primal. The primal problem starts with a feasible point and seeks to achieve firstorder optimality. The dual problem starts with an infeasible optimal point satisfying firstorder optimality, and seeks to make it feasible. Feasibility is easier to achieve than optimality, so the dual problem is easier to solve than the primal.
4. Conjugate gradients (CG) converges in iterations. Under the same conditions that allow Theorem 2 to work, our main result (Theorem 6) bounds the condition number of the Newton subproblem to be , independent of the problem dimension and the current accuracy . It is therefore cheaper to solve this subproblem using CG to machine precision in time than it is to solve for it directly in time using Cholesky factorization [8, 2, 4]. Moreover, CG is an optimal Krylov subspace method, and as such, it is often able to exploit clustering in the eigenvalues to converge superlinearly. Finally, computing the Newton direction to high accuracy further allows the outer Newton method to also converge quadratically.
The remainder of this section describes each consideration in further detail. We state the algorithm explicitly in Section 3.5.
3.1 Efficient chordal embedding
Following [8], we begin by reformulating (4) into a sparse chordal matrix program
(12)  
subject to  
in which is a chordal embedding for : a sparsity pattern whose graph contains no induced cycles greater than three. This can be implemented using standard algorithms for largeandsparse linear equations, due to the following result.
Proposition 4.
Let be a positive definite matrix with sparsity pattern . Compute its unique lowertriangular Cholesky factor satisfying . Ignoring perfect numerical cancellation, the sparsity pattern of is a chordal embedding .
Note that can be determined directly from using a symbolic Cholesky algorithm, which simulates the steps of Gaussian elimination using Boolean logic. Moreover, we can substantially reduce the number of edges added to by reordering the columns and rows of using a fillreducing ordering.
Corollary 5.
Let be a permutation matrix. For the same in Proposition 4, compute the unique Cholesky factor satisfying . Ignoring perfect numerical cancellation, the sparsity pattern of is a chordal embedding .
The problem of finding the best choice of is known as the fillminimizing problem, and is NPcomplete [44]. However, good orderings are easily found using heuristics developed for numerical linear algebra, like minimum degree ordering [15] and nested dissection [16, 1]. In fact, [16] proved that nested dissection is suboptimal for boundeddegree graphs, and notes that “we do not know a class of graphs for which [nested dissection is suboptimal] by more than a constant factor.”
If admits sparse chordal embeddings, then a goodenough will usually be found using minimum degree or nested dissection. In MATLAB, the minimum degree ordering and symbolic factorization steps can be performed in two lines of code; see the snippet in Figure 1.
3.2 Logarithmic barriers for sparse matrix cones
Define the cone of sparse positive semidefinite matrices , and the cone of sparse matrices with positive semidefinite completions , as the following
Then (12) can be posed as the primaldual pair:
(13)  
(14) 
with in which and are the “logdet” barrier functions on and as introduced by [8, 2, 4]
The linear map converts a list of variables into the corresponding matrix in . The gradients of are simply the projections of their usual values onto , as in
Given any let be the unique matrix satisfying . Then we have
Assuming that is sparse and chordal, all six operations can be efficiently evaluated in time and memory, using the numerical recipes described in [4].
3.3 Solving the dual problem
Our algorithm actually solves the dual problem (14), which can be rewritten as an unconstrained optimization problem
(15) 
After the solution is found, we can recover the optimal estimator for the primal problem via . The dual problem (14) is easier to solve than the primal (13) because the origin often lies very close to the solution . To see this, note that produces a candidate estimator that solves the chordal matrix completion problem
which is a relaxation of the nonchordal problem posed over . As observed by several previous authors [8], this relaxation is a high quality guess, and is often “almost feasible” for the original nonchordal problem posed over , as in . Some simple algebra shows that the gradient evaluated at the origin has Euclidean norm , so if holds true, then the origin is close to optimal. Starting from this point, we can expect Newton’s method to rapidly converge at a quadratic rate.
3.4 CG converges in iterations
The most computationally expensive part of Newton’s method is the solution of the Newton direction via the system of equations
(16) 
The Hessian matrix is fully dense, but matrixvector products are linear time using the algorithms in Section LABEL:subsec:Logarithmicbarriersfor. This insight motivates solving (16) using an iterative Krylov subspace method like conjugate gradients (CG), which is a matrixfree method that requires a single matrixvector product with at each iteration [6]. Starting from the origin , the method converges to an accurate search direction satisfying
in at most
(17) 
where is the condition number of the Hessian matrix [18, 38]. In many important convex optimization problems, the condition number grows like or as the outer Newton iterates approach an neighborhood of the true solution. As a consequence, NewtonCG methods typically require or CG iterations.
It is therefore surprising that we are able to bound globally for the MDMC problem. Below, we state our main result, which says that the condition number depends polynomially on the problem data and the quality of the initial point, but is independent of the problem dimension and the accuracy of the current iterate .
Theorem 6.
At any satisfying and , the condition number of the Hessian matrix is bound
(18) 
where:

is the suboptimality of the initial point,

is the vectorized version of the problem data,

and are the initial primaldual pair,

and are the solution primaldual pair.
Proof.
See Appendix B. ∎
Remark 7.
Newton’s method is a descent method, so its th iterate trivially satisfies . Technically, the condition can be guaranteed by enclosing Newton’s method within an outer auxillary pathfollowing loop; see Section 4.3.5 of [30]. In practice, naive Newton’s method will usually satisfy the condition on its own; see our numerical experiments in Section 4.
Applying Theorem 6 to (17) shows that CG solves each Newton subproblem to accuracy in iterations. Multiplying this figure by the Newton steps to converge yields a global iteration bound of
Multiplying this figure by the cost of each CG iteration proves the claimed time complexity in (6). In practice, CG typically converges much faster than this worstcase bound, due to its ability to exploit the clustering of eigenvalues in ; see [18, 38]. Moreover, accurate Newton directions are only needed to guarantee quadratic convergence close to the solution. During the initial Newton steps, we may loosen the error tolerance for CG for a significant speedup. Inexact Newton steps can be used to obtain a speedup of a factor of 23.
3.5 The full algorithm
To summarize, we begin by computing a chordal embedding for the sparsity pattern of , using the code snippet in Figure 1. We use the embedding to reformulate (4) as (12), and solve the unconstrained problem defined in (15), using Newton’s method
starting at the origin . The function value , gradient and Hessian matrixvector products are all evaluated using the numerical recipes described by [4].
At each th Newton step, we compute the Newton search direction using conjugate gradients. A loose tolerance is used when the Newton decrement is large, and a tight tolerance is used when the decrement is small, implying that the iterate is close to the true solution.
Once a Newton direction is computed with a sufficiently large Newton decrement , we use a backtracking line search to determine the stepsize . In other words, we select the first instance of the sequence that satisfies the Armijo–Goldstein condition
in which and are line search parameters. Our implementation used and . We complete the step and repeat the process, until convergence.
We terminate the outer Newton’s method if the Newton decrement falls below a threshold. This implies either that the solution has been reached, or that CG is not converging to a good enough to make significant progress. The associated estimator for is recovered by evaluating .
4 Numerical Results
Finally, we benchmark our algorithm^{2}^{2}2The MATLAB source code for our solver can be found at http://alum.mit.edu/www/ryz against QUIC [20], commonly considered the fastest solver for graphical lasso or RGL^{3}^{3}3QUIC was taken from http://bigdata.ices.utexas.edu/software/1035/. (Another widelyused algorithm is GLASSO [14], but we found it to be significantly slower than QUIC.) We consider two case studies. The first case study numerically verifies the claimed complexity of our MDMC algorithm on problems with a nearlybanded structure. The second case study performs the full thresholdMDMC procedure for graphical lasso and RGL, on graphs collected from reallife applications.
All experiments are performed on a laptop computer with an Intel Core i7 quadcore 2.50 GHz CPU and 16GB RAM. The reported results are based on a serial implementation in MATLABR2017b. Both our Newton decrement threshold and QUIC’s convergence threshold are .
We implemented the softthresholding set (7) as a serial routine that uses memory by taking the matrixofsamples satisfying as input. The routine implicitly partitions into submatrices of size , and iterates over the submatrices one at a time. For each submatrix, it explicitly forms the submatrix, thresholds it using dense linear algebra, and then stores the result as a sparse matrix.
4.1 Case Study 1: Banded Patterns
The first case study aims to verify the claimed complexity of our algorithm for MDMC. Here, we avoid the proposed thresholding step, and focus solely on the MDMC (4) problem. Each sparsity pattern is a corrupted banded matrices with bandwidth . The offdiagonal nonzero elements of
are selected from the uniform distribution in
and then corrupted to zero with probability 0.3. The diagonal elements are fixed to . Our numerical experiments fix the bandwidth and vary the number of variables from 1,000 to 200,000. A time limit of 2 hours is set for both algorithms.Figure 2a compares the running time of both algorithms. A loglog regression results in an empirical time complexity of for our algorithm, and for QUIC. The extra 0.1 in the exponent is most likely an artifact our MATLAB implementation. In either case, QUIC’s quadratic complexity limits it to . By contrast, our algorithm solves an instance with in less than 33 minutes. The resulting solutions are extremely accurate, with optimality and feasibility gaps of less than and , respectively.
4.2 Case Study 2: RealLife Graphs
NewtonCG  QUIC  

#  file name  type  sec  gap  feas  sec  diff. gap  speedup  
1  freeFlyingRobot7  GL  3918  20196  5.15  28.9  5.7e17  2.3e7  31.0  3.9e4  1.07 
1  freeFlyingRobot7  RGL  3918  20196  5.15  12.1  6.5e17  2.9e8  38.7  3.8e5  3.20 
2  freeFlyingRobot14  GL  5985  27185  4.56  23.5  5.4e17  1.1e7  78.3  3.8e4  3.33 
2  freeFlyingRobot14  RGL  5985  27185  4.56  19.0  6.0e17  1.7e8  97.0  3.8e5  5.11 
3  cryg10000  GL  10000  170113  17.0  17.3  5.9e17  5.2e9  360.3  1.5e3  20.83 
3  cryg10000  RGL  10000  170113  17.0  18.5  6.3e17  1.0e7  364.1  1.9e5  19.68 
4  epb1  GL  14734  264832  18.0  81.6  5.6e17  4.3e8  723.5  5.1e4  8.86 
4  epb1  RGL  14734  264832  18.0  44.2  6.2e17  3.3e8  1076.4  4.2e4  24.35 
5  bloweya  GL  30004  10001  0.33  295.8  5.6e17  9.4e9  
5  bloweya  RGL  30004  10001  0.33  75.0  5.5e17  3.6e9  
6  juba40k  GL  40337  18123  0.44  373.3  5.6e17  2.6e9  
6  juba40k  RGL  40337  18123  0.44  341.1  5.9e17  2.7e7  
7  bayer01  GL  57735  671293  11.6  2181.3  5.7e17  5.2e9  
7  bayer01  RGL  57735  671293  11.6  589.1  6.4e17  1.0e7  
8  hcircuit  GL  105676  58906  0.55  2732.6  5.8e17  9.0e9  
8  hcircuit  RGL  105676  58906  0.55  1454.9  6.3e17  7.3e8  
9  co2010  RGL  201062  1022633  5.08  4012.5  6.3e17  4.6e8 
The second case study aims to benchmark the full thresholdingMDMC procedure for sparse inverse covariance estimation on reallife graphs. The actual graphs (i.e. the sparsity patterns) for are chosen from SuiteSparse Matrix Collection [9]—a publicly available dataset for largeandsparse matrices collected from realworld applications. Our chosen graphs vary in size from to , and are taken from applications in chemical processes, material science, graph problems, optimal control and model reduction, thermal processes and circuit simulations.
For each sparsity pattern , we design a corresponding as follows. For each , we select from the uniform distribution in , and then corrupt it to zero with probability . Then, we set each diagonal to . Using this , we generate samples i.i.d. as . This results in a sample covariance matrix .
We solve graphical lasso and RGL with the described above using our proposed softthresholdingMDMC algorithm and QUIC, in order to estimate . In the case of RGL, we assume that the graph is known a priori, while noting that 30% of the elements of have been corrupted to zero. Our goal here is to discover the location of these corrupted elements. In all of our simulations, the threshold is set so that the number of nonzero elements in the the estimator is roughly the same as the ground truth. We limit both algorithms to 3 hours of CPU time.
Figure 2b compares the CPU time of both two algorithms for this case study; the specific details are provided in Table 1. A loglog regression results in an empirical time complexity of and for graphical lasso and RGL using our algorithm, and and for the same using QUIC. The exponents of our algorithm are due to the initial softthresholding step, which is quadratictime on a serial computer, but because the overall procedure is dominated by the solution of the MDMC. Both algorithms solve graphs with within the allotted time limit, though our algorithm is 11 times faster on average. Only our algorithm is able to solve the estimation problem with in a little more than an hour.
To check whether thresholdingMDMC really does solve graphical lasso and RGL, we substitute the two sets of estimators back into their original problems (1) and (5). The corresponding objective values have a relative difference , suggesting that both sets of estimators are about equally optimal. This observation verifies our claims in Theorem 2 and Corollary 3 that (1) and (5): thresholdingMDMC does indeed solve graphical lasso and RGL.
5 Conclusions
Graphical lasso is a widelyused approach for estimating a covariance matrix with a sparse inverse from limited samples. In this paper, we consider a slightly more general formulation called restricted graphical lasso (RGL), which additionally enforces a prior sparsity pattern to the estimation. We describe an efficient approach that substantially reduces the cost of solving RGL: 1) softthresholding the sample covariance matrix and projecting onto the prior pattern, to recover the estimator’s sparsity pattern; and 2) solving a maximum determinant matrix completion (MDMC) problem, to recover the estimator’s numerical values. The first step is quadratic time and memory but embarrassingly parallelizable. If the resulting sparsity pattern is sparse and chordal, then under mild technical assumptions, the second step can be performed using the NewtonCG algorithm described in this paper in linear time and memory. The algorithm is tested on both synthetic and reallife data, solving instances with as many as 200,000 variables to 79 digits of accuracy within an hour on a standard laptop computer.
References
 [1] Ajit Agrawal, Philip Klein, and R Ravi. Cutting down on fill using nested dissection: Provably good elimination orderings. In Graph Theory and Sparse Matrix Computation, pages 31–55. Springer, 1993.
 [2] Martin S Andersen, Joachim Dahl, and Lieven Vandenberghe. Implementation of nonsymmetric interiorpoint methods for linear optimization over sparse matrix cones. Mathematical Programming Computation, 2(3):167–201, 2010.
 [3] Martin S Andersen, Joachim Dahl, and Lieven Vandenberghe. CVXOPT: A Python package for convex optimization. Available at cvxopt. org, 54, 2013.
 [4] Martin S Andersen, Joachim Dahl, and Lieven Vandenberghe. Logarithmic barriers for sparse matrix cones. Optimization Methods and Software, 28(3):396–423, 2013.
 [5] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine learning research, 9:485–516, 2008.
 [6] Richard Barrett, Michael W Berry, Tony F Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst. Templates for the solution of linear systems: building blocks for iterative methods, volume 43. Siam, 1994.
 [7] D. Croft, G. O’Kelly, G. Wu, R. Haw, M. Gillespie, L. Matthews, M. Caudy, P. Garapati, G. Gopinath, B. Jassal, and S. Jupe. Reactome: a database of reactions, pathways and biological processes. Nucleic acids research, 39:691–697, 2010.
 [8] Joachim Dahl, Lieven Vandenberghe, and Vwani Roychowdhury. Covariance selection for nonchordal graphs via chordal embedding. Optimization Methods & Software, 23(4):501–520, 2008.
 [9] Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011.
 [10] Steven N. Durlauf. Nonergodic economic growth. The Review of Economic Studies, 60(2):349–366, 1993.
 [11] Hilmi E. Egilmez, Eduardo Pavez, and Antonio Ortega. Graph learning from data under laplacian and structural constraints. IEEE Journal of Selected Topics in Signal Processing, 11(6):825–841, 2017.
 [12] Salar Fattahi and Somayeh Sojoudi. Graphical lasso and thresholding: Equivalence and closedform solutions. https://arxiv.org/abs/1708.09479, 2017.
 [13] Salar Fattahi, Richard Y Zhang, and Somayeh Sojoudi. Sparse inverse covariance estimation for chordal structures. https://arxiv.org/abs/1711.09131, 2018.
 [14] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
 [15] Alan George and Joseph WH Liu. The evolution of the minimum degree ordering algorithm. Siam review, 31(1):1–19, 1989.
 [16] John Russell Gilbert. Some nested dissection order is nearly optimal. Information Processing Letters, 26(6):325–328, 1988.
 [17] Maxim Grechkin, Maryam Fazel, Daniela M. Witten, and SuIn Lee. Pathway graphical lasso. AAAI, pages 2617–2623, 2015.
 [18] Anne Greenbaum. Iterative methods for solving linear systems, volume 17. Siam, 1997.
 [19] Jean Honorio, Dimitris Samaras, Nikos Paragios, Rita Goldstein, and Luis E. Ortiz. Sparse and locally constant gaussian graphical models. Advances in Neural Information Processing Systems, pages 745–753, 2009.
 [20] C. J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Quic: quadratic approximation for sparse inverse covariance estimation. Journal of Machine Learning Research, 15(1):2911–2947, 2014.
 [21] ChoJui Hsieh, Mátyás A Sustik, Inderjit S Dhillon, Pradeep K Ravikumar, and Russell Poldrack. Big & quic: Sparse inverse covariance estimation for a million variables. In Advances in neural information processing systems, pages 3165–3173, 2013.
 [22] Junzhou Huang and Tong Zhang. The benefit of group sparsity. The Annals of Statistics, 38(4):1978–2004, 2010.
 [23] Jinchao Li, Martin S Andersen, and Lieven Vandenberghe. Inexact proximal newton methods for selfconcordant functions. Mathematical Methods of Operations Research, 85(1):19–41, 2017.
 [24] Stan Z. Li. Markov random field models in computer vision. European conference on computer vision, pages 351–370, 1994.
 [25] Christopher D. Manning and Hinrich Schütze. Foundations of statistical natural language processing. MIT press, 1999.
 [26] Rahul Mazumder and Trevor Hastie. Exact covariance thresholding into connected components for largescale graphical lasso. Journal of Machine Learning Research, 13:781–794, 2012.
 [27] Nicolai Meinshausen and Peter Bühlmann. Highdimensional graphs and variable selection with the lasso. The annals of statistics, 1436–1462, 2006.
 [28] Peyman Milanfar. A tour of modern image filtering: New insights and methods, both practical and theoretical. IEEE Signal Processing Magazine, 30(1):106–128, 2013.
 [29] Sahand Negahban and Martin J. Wainwright. Joint support recovery under highdimensional scaling: Benefits and perils of regularization. Proceedings of the 21st International Conference on Neural Information Processing Systems, pages 1161–1168, 2008.
 [30] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 [31] Guillaume Obozinski, Martin J. Wainwright, and Michael I. Jordan. Union support recovery in highdimensional multivariate regression. Communication, Control, and Computing, 2008 46th Annual Allerton Conference on, pages 21–26, 2008.
 [32] Figen Oztoprak, Jorge Nocedal, Steven Rennie, and Peder A Olsen. Newtonlike methods for sparse inverse covariance estimation. In Advances in neural information processing systems, pages 755–763, 2012.

[33]
Dongjoo Park and Laurence R. Rilett.
Forecasting freeway link travel times with a multilayer feedforward neural network.
Computer Aided Civil and Infrastructure Engineering, 14(5):357–367, 1999.  [34] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. Highdimensional covariance estimation by minimizing penalized logdeterminant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
 [35] Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, and Arian Maleki. Iterative thresholding algorithm for sparse inverse covariance estimation. In Advances in Neural Information Processing Systems, pages 1574–1582, 2012.
 [36] Donald J Rose. Triangulated graphs and the elimination process. Journal of Mathematical Analysis and Applications, 32(3):597–609, 1970.
 [37] Adam J. Rothman, Peter J. Bickel, Elizaveta Levina, and Ji Zhu. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515, 2008.
 [38] Yousef Saad. Iterative methods for sparse linear systems, volume 82. Siam, 2003.
 [39] Somayeh Sojoudi. Equivalence of graphical lasso and thresholding for sparse graphs. Journal of Machine Learning Research, 17(115):1–21, 2016.
 [40] Eran Treister and Javier S Turek. A blockcoordinate descent approach for largescale sparse inverse covariance estimation. In Advances in neural information processing systems, pages 927–935, 2014.
 [41] Lieven Vandenberghe, Martin S Andersen, et al. Chordal graphs and semidefinite optimization. Foundations and Trends® in Optimization, 1(4):241–433, 2015.
 [42] Martin J Wainwright. Sharp thresholds for highdimensional and noisy sparsity recovery using constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
 [43] Eunho Yang, Aurélie C. Lozano, and Pradeep K. Ravikumar. Elementary estimators for graphical models. Advances in neural information processing systems, pages 2159–2167, 2014.
 [44] Mihalis Yannakakis. Computing the minimum fillin is NPcomplete. SIAM Journal on Algebraic Discrete Methods, 2(1):77–79, 1981.
 [45] Ming Yuan and Yi Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, pages 19–35, 2007.
Appendix A Restricted graphical Lasso and MDMC
Our aim is to elucidate the connection between the RGL and MDMC problem under the assumption that the regularization coefficients are chosen to be large, i.e., when a sparse solution for the RGL is sought. Recall that RGL is formulated as follows
minimize  (19a)  
s.t.  (19b)  
(19c) 
Now, consider the following modified softthresholded sample covariance matrix
(20) 
Definition 8 ([12]).
For a given symmetric matrix , let denote the minimal set such that . is called inverseconsistent if there exists a matrix with zero diagonal such that
(21a)  
(21b)  
(21c) 
The matrix is called an inverseconsistent complement of and is denoted by . Furthermore, is called signconsistent if for every , the entries of and have opposite signs.
Definition 9 ([12]).
Given a sparsity pattern and a scalar , define as the maximum of over all inverseconsistent positivedefinite matrices with the diagonal entries all equal to 1 such that and .
Without loss of generality, we make the following mild assumption.
Assumption 1.
and for every .
We are now ready to restate Theorem 2.
See 2
First, note that the diagonal elements of are and its offdiagonal elements are between and . A sparse solution for RGL requires large regularization coefficients. This leads to numerous zero elements in and forces the magnitude of the nonzero elements to be small. This means that, in most instances, is positive definite or even diagonally dominant. Certifying Condition (ii) is hard in general. However, [12] shows that this condition is automatically implied by Condition (i) when induces an acyclic structure. More generally, [39] shows that is signconsistent if is close to its first order Taylor expansion. This assumption holds in practice due to the fact that the magnitude of the offdiagonal elements of is small. Furthermore, [13] proves that this condition is necessary for the equivalence between the sparsity patterns of thresholding and GL when the regularization matrix is large enough. Finally, [12] shows that the left hand side of (9) is upper bounded by for some which only depends on . This implies that, when is small, or equivalently the regularization matrix is large, Condition (iii) is automatically satisfied.
a.1 Proofs
In this section, we present the technical proofs of our theorems. To prove Theorem 2, we need a number of lemmas. First, consider the RGL problem, with . The first lemma offers optimality (KKT) conditions for the unique solution of this problem.
Lemma 10.
is the optimal solution of RGL problem with if and only if it satisfies the following conditions for every :
(22a)  
(22b)  
(22c) 
where denotes the entry of .
Proof.
The proof is straightforward and omitted for brevity. ∎
Now, consider the following optimization:
(23) 
where
(24) 
Let denotes the optimal solution of (23). Furthermore, define as a diagonal matrix with for every . The following lemma relates to .
Lemma 11.
We have .
Proof.
To prove this lemma, we define an intermediate optimization problem. Consider
Comments
There are no comments yet.