Stochastic $p$th root approximation of a stochastic matrix: A Riemannian optimization approach¶
Fabio Durastante, Beatrice Meini
addpath("../Matlabcodes/approximate-embedded-markov/ConstrainedOptimization/")
addpath("../Matlabcodes/approximate-embedded-markov/Utilities/")
addpath("../Matlabcodes/approximate-embedded-markov/Manifold/")
addpath("../Matlabcodes/approximate-embedded-markov/Matrices/")
here = pwd;
cd ../Matlabcodes/approximate-embedded-markov/manopt/
importmanopt;
cd(here);
Manopt was added to Matlab's path. Path not saved: please re-call importmanopt next time.
The problem we want to solve¶
Given a stochastic matrix $P$, i.e., $$P \mathbf{1} = \mathbf{1}, \qquad P \geq 0,$$
we want to find a stochastic matrix $X$ such that $$ X^p = P, \qquad p \in \mathbb{N},\;\; p \geq 2. $$
In general, such a matrix does not exist, consider, e.g., the matrix $$ P = \begin{bmatrix} 0.5000 & 0.3750 & 0.1250 \\ 0.7500 & 0.1250 & 0.1250 \\ 0.0833 & 0.0417 & 0.8750 \end{bmatrix} $$ The eigenvalues are $\lambda(P) = \{-0.2500,0.7500,1\}$.
Counterexamples¶
P = [0.5000 0.3750 0.1250
0.7500 0.1250 0.1250
0.0833 0.0417 0.8750];
ev = eig(P)
ev = 3x1 double -0.2500 1.0000 0.7500
sqrtm(P)
ans = 0.6220 + 0.1667i 0.3110 - 0.1667i 0.0670 + 0.0000i 0.6220 - 0.3333i 0.3110 + 0.3333i 0.0670 - 0.0000i 0.0446 + 0.0000i 0.0224 - 0.0000i 0.9330 + 0.0000i
There are no stochastic roots, the pathologies are also not over$\ldots$
Non-primary (stochastic) roots¶
$$ P(a) = \frac{1}{3} \begin{bmatrix} 1-2a & 1+a & 1+a \\ 1+a & 1-2a & 1+a \\ 1+a & 1+a & 1-2a \\ \end{bmatrix}, \quad 0 < a \leq \frac{1}{3}, \; \lambda{(P(a))} = \{1,-i \sqrt{a},i \sqrt{a}\}.$$ Has only one stochastic square root, which is not a primary function
P = @(a) [1-2*a,1+a,1+a;1+a,1-2*a,1+a;1+a,1+a,1-2*a]/3;
sqrtm(P(1/6))
ans = 0.3333 + 0.2722i 0.3333 - 0.1361i 0.3333 - 0.1361i 0.3333 - 0.1361i 0.3333 + 0.2722i 0.3333 - 0.1361i 0.3333 - 0.1361i 0.3333 - 0.1361i 0.3333 + 0.2722i
$$ X(a) = \frac{1}{3} \begin{bmatrix} 1 & 1+\sqrt{3a} & 1-\sqrt{3a} \\ 1-\sqrt{3a} & 1 & 1 + \sqrt{3a} \\ 1+\sqrt{3a} & 1-\sqrt{3a} & 1 \\ \end{bmatrix} $$
X = @(a) [1,1+sqrt(3*a),1-sqrt(3*a);1-sqrt(3*a),1,1+sqrt(3*a);1+sqrt(3*a),1-sqrt(3*a),1]/3;
norm(P(1/6)-X(1/6)*X(1/6))
ans = 1.0834e-16
Pathologies, pathologies, and more pathologies$\ldots$¶
- A stochastic matrix may have no $p$th root for any $p$,
- A stochastic matrix may have $p$th roots but no stochastic $p$th root,
- A stochastic matrix may have a stochastic principal $p$th root as well as a stochastic nonprimary $p$th root,
- A stochastic matrix may have a stochastic principal $p$th root but no other stochastic $p$th root,
- The principal $p$th root of a stochastic matrix with distinct, real, positive eigenvalues is not necessarily stochastic,
- A (row) diagonally dominant stochastic matrix may not have a stochastic principal $p$th root,
- A stochastic matrix whose principal $p$th root is not stochastic may still have a primary stochastic $p$th root,
- A stochastic matrix with distinct eigenvalues may have a stochastic principal $p$th root and a different stochastic primary $p$th root.
See Higham, N. J., & Lin, L. (2011). On $p$th roots of stochastic matrices. Linear Algebra and its Applications, 435(3), 448-463.
So let's approximate it!¶
Let us call $$ \mathbb{S}_n^0 = \{ S \in \mathbb{R}^{n \times n} \,:\; S \mathbf{1} = \mathbf{1}, \; S \geq 0 \}, $$ and decide to compute one between
- $ X = \displaystyle \arg\min_{ X \in \mathbb{S}_n^0 } \frac{1}{2}\| X^p - A\|_F^2$,
- $ \displaystyle X = \arg\min_{ X \in \mathbb{S}_n^0 } \frac{1}{2}\| X - A^{\frac{1}{p}} \|_F^2$,
- $ X = \arg\min_{ X \in \mathbb{S}_n^0 } \frac{1}{2}\| X - A^{\frac{1}{p}} \|_F^2$, where $\Omega = \{ \mathbf{h} \in \mathbb{R}^n \,:\,\mathbf{1}^T \mathbf{h} = 1,\; B \mathbf{h} \geq 0, \; B = [\operatorname{vec}(I)|\operatorname{vec}(A)|\cdots|\operatorname{vec}(A^{n-1})] \}$, and approximate $$ X = X(\mathbf{h}) = \sum_{i=0}^{n-1} h_i A^i. $$
A constrained optimization problem¶
$$ \text{Given } p \in \mathbb{N} \text{ find:} \qquad \begin{array}{ll} \displaystyle \min_{ X \in \mathbb{R}^{n \times n} } & \displaystyle F(X) := \frac{1}{2}\| X^p - P \|_F^2, \\ \text{s.t. } & X \mathbf{1} = \mathbf{1},\\ & X \geq 0.\end{array} $$
This can be done with the fmincon
routine in MATLAB, we can achieve this either by exploiting the explicit computation of the gradient and the Hessian or by using automatic approximations with finite differences:
$$ \nabla F(X) = \sum_{i=1}^{p} (X^T)^{j-1} (X^p - P) (X^T)^{p-j} $$
The Hessian of $F$ is an $n^2 \times n^2$ matrix, that is the
representation of the Fréchet derivative $L_{\nabla F}$ of $\nabla F$, i.e., for any $E \in \mathbb{R}^{n \times n}$:
$$ \operatorname{vec}( L_{\nabla F}(X,E)) = H \operatorname{vec}(E) .$$
A constrained optimization problem¶
The calculation is a bit cumbersome, but it can be written in closed form as:
$$\begin{split} L_{\nabla F}(X,E) = & \; \sum_{j=1}^{p} \left( (X^T)^{j-1} (X^p - P) \sum_{l=1}^{p-j} (X^T)^{p-j-l} E^T (X^T)^{l-1} \right. \\ & + (X^T)^{j-1} \sum_{k=1}^p X^{p-k} E X^{k-1} (X^T)^{p-j} \\ & \left. + \sum_{i=1}^{j-1} (X^T)^{j-1-i} E^T (X^T)^{j-1} (X^p - P) (X^T)^{p-j} \right) \\ \end{split}$$
Then it's a matter of putting together a couple of MATLAB functions that evaluate the quantities so you can pass them to fmincon
.
MATLAB Routines (objective)¶
function [Y,g] = objective(X,A,p)
%%OBJECTIVE is the objective function to minimize
n = sqrt(length(X));
Xm = reshape(X,n,n);
Xp = mpower(Xm,p);
Y = 0.5*norm(Xp(:)-A(:),"fro")^2;
if nargout > 1 % Gradient is required
F = Xp - A;
g = zeros(n,n);
for j=1:p
g = g + mpower(Xm',j-1)*F*mpower(Xm',p-j);
end
end
end
MATLAB Routines (Hessian)¶
function y = hessian(x,lambda,v,A,p)
%HESSIAN implementation, this uses the matrix formulation and a number of conversion between matrix and vector arguments
n = size(A,1); % Size of the problem
X = reshape(x,n,n); % Current point
E = reshape(v,n,n); % Evaluation matrix
F = mpower(X,p)-A; Y = zeros(n,n); % Hessian (free objective)
for j=1:p
S = zeros(n,n);
for l=1:p-j
S = S+mpower(X',p-j-l)*E'*mpower(X',l-1);
end
Y = Y + mpower(X',j-1)*F*S;
S = zeros(n,n); % Second term
for k=1:p
S = S + mpower(X,p-k)*E*mpower(X,k-1)*mpower(X',p-j);
end
Y = Y + mpower(X',j-1)*S;
for i=1:j-1
Y = Y + mpower(X',j-1-i)*E'*mpower(X',i-1)*F*mpower(X',p-j);
end
end % Constraints that is always empty in our case.
l1 = lambda.ineqnonlin; l2 = lambda.eqnonlin;
y = Y(:);% Final Conversion
end
Putting all together¶
We can collect all this code in a MATLAB function:
function [X,output,history] = approximatepower(A,p,A0,tol,maxIterations,varargin)
%%APPROXIMATEPOWER solves the constrained optimization problem
% X = arg min || X^p - A ||_F
% requiring X to be row-stochastic, i.e., X >= 0 and X 1 = 1.
% The optional argument can be either 'FD' to use finite difference
% approximations of Hessian and Gradients, or 'ANALYTICAL' to use the one computed analytically.
end
Note: This and all the other functions we are going to discuss can be found on the GitHub repository pth-root-stochastic:
git clone --recurse-submodules git@github.com:Cirdans-Home/approximate-embedded-markov.git
An example¶
Let's try everything on an example.
X0 = rand(3,3);
X0 = X0./sum(X0,2);
[Xapprox,output,history] = approximatepower(P(1/6),2,X0,1e-6,500,'analytical');
disp(Xapprox)
disp(P(1/6))
disp(norm(Xapprox*Xapprox-P(1/6),"fro"))
0.3337 0.5657 0.1005 0.0943 0.3363 0.5695 0.5720 0.0980 0.3300 0.2222 0.3889 0.3889 0.3889 0.2222 0.3889 0.3889 0.3889 0.2222 2.8422e-07
[v,l] = eigs(Xapprox',1,'largestabs');
disp(v.'./sum(v))
0.3333 0.3333 0.3333
Another example¶
We consider the matrix Pajek/GD96_c from the SuiteSparse matrix collection as the adjacency matrix $A$ of an undirected graph normalized by the inverse of the sum of the row entries.
load("GD96_c.mat")
A = Problem.A;
G = graph(A,'omitselfloops');
n = size(A,1);
D = spdiags(G.degree,0,n,n);
P = D\A;
X0 = rand(n,n); % Optimization step
X0 = X0./sum(X0,2);
[Xapprox,output,history] = approximatepower(P,2,X0,1e-6,1000,'analytical');
disp(norm(Xapprox*Xapprox - P,"fro"))
2.0209
Visualize the steady state¶
For a Markov chain, the measure that interests us most and that we would like to preserve is the steady state vector $\boldsymbol{\pi}$. The destination by taking steps of length one or steps of length one half should remain unchanged!
[pitrue,l] = eigs(P',1,'largestabs'); pitrue = pitrue.'./sum(pitrue);
[pi,l] = eigs(Xapprox',1,'largestabs'); pi = pi.'./sum(pi);
figure("Position",[947 761 871 323]); semilogy(1:n,pitrue,'o',1:n,pi,'x')
legend('Steady state','Approximate root steady state')
Preserving the steady state¶
Let $\boldsymbol{\pi} \in \mathbb{R}^{n}$ be a positive vector such that $\boldsymbol{\pi}^T \mathbf{1} = 1$, and define the set $$ \mathbb{S}_n^{\boldsymbol{\pi}} = \{ S \in \mathbb{R}^{n \times n} \,:\; S \mathbf{1} = \mathbf{1}, \; \boldsymbol{\pi}^T S = \boldsymbol{\pi}^T, \; S > 0 \}, $$ i.e., $\mathbb{S}_n^{\boldsymbol{\pi}}$ is the set of $n\times n$ positive stochastic matrices, having the same stationary distribution $\boldsymbol{\pi}$. After proving that $\mathbb{S}_n^{\boldsymbol{\pi}}$ is a manifold our optimization problem is rewritten as: $$ \text{Given }A \in \mathbb{S}_n^0 \text{ and } \boldsymbol{\pi} \text{ s.t. } \boldsymbol{\pi}^T A = \boldsymbol{\pi}^T \text{ find } X = \displaystyle \arg\min_{ X \in \mathbb{S}_n^{\boldsymbol{\pi}}} \frac{1}{2} \| X^p - A\|_F^2. $$
- The approach is inspired by: Douik, A.; Hassibi, B. Manifold optimization over the set of doubly stochastic matrices: a second-order geometry. IEEE Trans. Signal Process.67(2019), no.22, 5761–5774.
The Riemannnian structure¶
Definition: Let $\mathcal{E}$ be a linear space of dimension $d$. A non-empty subset $\mathcal{M}$ of $\mathcal{E}$ is a smooth embedded submanifold of $\mathcal{E}$ of dimension $n$ if either
$n = d$ and $\mathcal{M}$ is open in $\mathcal{E}$;
$n = d-k$ for some $k \geq 1$ and, for each $x \in \mathcal{M}$, there exists a neighborhood $U$ of $x$ in $\mathcal{E}$ and a smooth function $h : U \rightarrow \mathbb{R}^k$ such that
- If $y \in U$, then $h(y)= 0$ if and only if $y \in \mathcal{M}$; and
- $\operatorname{rank} \mathrm{D}h(x) = k$, for $\mathrm{D}h(x)$ the differential of $h$ at $x$;
Such function $h$ is called a local defining function for $\mathcal{M}$ at $x$.
Direct inspection: $\mathbb{S}_n^{\boldsymbol{\pi}}$ is an embedded manifold of $\mathbb{R}^{n \times n}$ of dimension $(n-1)^2$, since it is indeed generated by $2n-1$ linearly independent equations.
The tangent space¶
Definition: A tangent vector $\xi_x$ to a manifold $\mathcal{M}$ at a point $x$ is a mapping from the set $\mathfrak{F}_x(\mathcal{M})$ of smooth real-valued functions defined on a neighborhood of $x$ to $\mathbb{R}$ such that there exists a curve $\gamma$ on $\mathcal{M}$ realizing the tangent vector $\xi_x$, i.e., such that $\gamma(0)=x$, and $$ \xi_x f = \dot{\gamma}(0)f \triangleq \left.\frac{\mathrm{d}(f(\gamma(t)))}{\mathrm{d}t}\right\rvert_{t=0}, \; \forall \, f \in \mathfrak{F}_x(\mathcal{M}); $$ The tangent space $\mathcal{T}_x \mathcal{M}$ at $x\in\mathcal M$ is then the set of all tangent vectors to $\mathcal{M}$ at a point $x$. The tangent bundle is the manifold $\mathcal{T} \mathcal{M}$ that assembles all the tangent vectors, i.e., the disjoint union $\mathcal{T} \mathcal{M} = {\bigsqcup_{x \in \mathcal{M}}} \mathcal{T}_x \mathcal{M}$.
Lemma The tangent space to $\mathbb{S}_n^{\boldsymbol{\pi}}$ at $S\in\mathbb{S}_n^{\boldsymbol{\pi}}$ is given by $$ \mathcal{T}_S \mathbb{S}_n^{\boldsymbol{\pi}} = \{ \xi_S \in \mathbb{R}^{n \times n}\,:\; \xi_S \mathbf{1} = \mathbf{0}, \; \boldsymbol{\pi}^T \xi_S = \mathbf{0} \}. $$
The Fisher metric¶
$\mathbb{S}_n^{\boldsymbol{\pi}}$ can be extended to be a Riemannian manifold by adding a positive-definite inner product on its tangent space at every point. This metric is given by the Fisher information metric: $$ \begin{split} g_A(\xi_S,\eta_S) = &\; \langle \xi_S, \eta_S \rangle_S \\ = &\; \sum_{i,j=1}^{n} \frac{ (\xi_S)_{i,j} (\eta_S)_{i,j} }{S_{i,j}} \\ = &\; \operatorname{Trace}( (\xi_S \oslash S) \eta_S^T ), \;\forall\, \xi_S,\eta_S \in \mathcal{T}_S \mathbb{S}_n^{\boldsymbol{\pi}}. \end{split} $$
- Note: This is why we require $ > 0$ on the components of stochastic matrices, we divide by $S_{i,j}$,
- Also: we require the chain to be irreducible, i.e., $\boldsymbol{\pi} > 0$.
Implementing Manifolds in MANOPT¶
To use the optimization algorithm implemented in MANOPT on $\mathbb{S}_n^{\boldsymbol{\pi}}$ we need to create a new manifold object. This is done in MANOPT by defining a function
that outputs a manifold, i.e., a structure array
whose fields are the various functions needed by the optimization algorithms.
function M = multinomialfixedstochasticfactory(pi,optionsolve)
n = length(pi);
e = ones(n, 1);
M.name = @() sprintf(['%dx%d row-stochastic matrices with positive ' ...
'entries and fixed left-stochastic eigenvector'], n, n);
M.dim = @() (n-1)^2;
% Fisher metric
M.inner = @iproduct;
function ip = iproduct(X, eta, zeta)
ip = sum((eta(:).*zeta(:))./X(:));
end
M.norm = @(X, eta) sqrt(M.inner(X, eta, eta));
end
Orthogonal projection¶
To use any optimization strategy we need an expression for the projection operator on the tangent space $$ \Pi_S \,:\,\mathbb{R}^{n \times n} \rightarrow \mathcal{T}_S \mathbb{S}_n^{\boldsymbol{\pi}}. $$ For a $Z \in \mathbb{R}^{n \times n}$ and an $S \in \mathbb{S}_n^{\boldsymbol{\pi}}$, we can express the orthogonal projection by using the decomposition of any ambient vector into $$ Z = \Pi_S(Z) + \Pi_S^\perp(Z). $$
Lemma: The orthogonal complement of the tangent space $\mathcal{T}_S \mathbb{S}_n^{\boldsymbol{\pi}}$ has the expression $$ \mathcal{T}_S^\perp \mathbb{S}_n^{\boldsymbol{\pi}} = \{ \xi_S^\perp \in \mathbb{R}^{n\times n} \,:\, \xi_S^\perp = (\boldsymbol{\alpha} \mathbf{1}^T + \boldsymbol{\pi} \boldsymbol{\beta}^T) \odot S \}, $$ for some vectors $\boldsymbol{\alpha},\boldsymbol{\beta} \in \mathbb{R}^{n}$.
Orthogonal projection (computation)¶
Proposition. The orthogonal projection $\Pi_S\,:\,\mathbb{R}^{n\times n} \rightarrow \mathcal{T}_S \mathbb{S}_n^{\boldsymbol{\pi}}$ of a matrix $Z$-with respect to the scalar product induced by Fisher's metric-has the following expression: $$ \Pi_S(Z) = Z - (\boldsymbol{\alpha} \mathbf{1}^T + \boldsymbol{\pi}\boldsymbol{\beta}^T)\odot S, $$ where the vectors $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$ are a solution to the following consistent linear system $$ \begin{bmatrix} Z\mathbf{1} \\ Z^T \boldsymbol{\pi} \end{bmatrix}= \begin{bmatrix} I & D_{\boldsymbol{\pi}} S \\ S^T D_{\boldsymbol{\pi}} & \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi}) \end{bmatrix} \begin{bmatrix} \boldsymbol{\alpha} \\ \boldsymbol{\beta} \end{bmatrix}, \quad D_{\boldsymbol{\pi}} = \operatorname{diag}(\boldsymbol{\pi}). $$
- we must therefore add the computation of the orthogonal projection to our
MATLAB
function
M.proj = @projection;
function etaproj = projection(X, eta) % Projection of the vector eta in the ambient space onto the tangent space
b = [sum(eta, 2) ; eta'*pi];
[alpha, beta] = mylinearsolve(X, b);
etaproj = eta - (alpha*e' + pi*beta').*X;
end
- we will then return to the solution of the linear system.
Riemannian gradient¶
Let now $f : \mathbb{S}_n^{\boldsymbol{\pi}} \rightarrow \mathbb{R}$ be a smooth real function defined on the manifold, and denote by $\operatorname{Grad} f(S)$ its euclidean gradient with respect to the euclidean metric.
Definition. The Riemannian gradient of $f$ at $x$, denoted by $\operatorname{grad}f(x)$, of a manifold $\mathcal{M}$ is defined as the unique vector in $\mathcal{T}_x\mathcal{M}$ that satisfies: $$ \langle \operatorname{grad}f(x), \xi_x \rangle_x = \mathrm{D} f(x) [\xi_x],\ \forall \ \xi_x \in \mathcal{T}_x\mathcal{M}. $$ Where we have denoted by $\mathrm{D} f(x)[\xi]$ the directional derivative of $f$ given by: $$ \mathrm{D} f(x)[\xi]=\lim_{t\to 0}\frac{f(x+t\xi)-f(x)}{t}. $$
Proposition. The Riemannian gradient $\operatorname{grad}f(S)$ is expressed in terms of the Euclidean gradient $\operatorname{Grad}f(S)$ as: $$ \operatorname{grad}f(S) = \Pi_S(\operatorname{Grad}f(S) \odot S).$$
Riemannian gradient (computation)¶
We need to a routine for this to the MANOPT implementation.
% Conversion of Euclidean to Riemannian gradient
M.egrad2rgrad = @egrad2rgrad;
function rgrad = egrad2rgrad(X, egrad) % projection of the euclidean gradient
mu = (X.*egrad);
b = [sum(mu, 2) ; mu'*pi];
[alpha, beta] = mylinearsolve(X, b);
rgrad = mu - (alpha*e' + pi*beta').*X;
end
- Remark: we still have to solve a linear system with the same structure, and we use the same routine as for the orthogonal projection.
Riemanian Hessian - Levi-Civita connection¶
Definition An affine connection $\nabla \,:\,\mathcal{T}\mathcal{M} \times \mathcal{T}\mathcal{M} \to \mathcal{T}\mathcal{M}$ is a map that associates to each $(\eta,\xi)$ in the tangent bundle the tangent vector $\nabla_\eta \xi$ satisfying for all $a,b \in \mathbb{R}$, and smooth $f,g: \mathcal{M} \longrightarrow \mathbb{R}$:
$\nabla_{f(\eta)+g(\chi)}\xi = f(\nabla_\eta \xi)+ g(\nabla_\chi \xi)$,
$\nabla_{\eta}(a\xi+b\varphi) = a\nabla_{\eta}\xi+b\nabla_{\eta}\varphi$,
$\nabla_{\eta}(f(\xi)) = \xi(f) \eta + f(\nabla_\eta \xi)$,
wherein the vector field $\xi$ acts on the function $f$ by derivation, that is $\xi(f)=\mathrm{D}(f)[\xi].$ We call Levi-Civita connection the affine connection that preserves the Riemannian metric, i.e., the affine connection such that
$\nabla_\eta \xi - \nabla_\xi \eta = [\eta,\xi] $ $\forall\,\eta,\xi \in \mathcal{T}\mathcal{M}$,
$\chi \langle \eta,\xi \rangle = \langle \nabla_\chi \eta,\xi \rangle + \langle\eta , \nabla_\chi \xi \rangle$, $\forall\,\eta ,\xi ,\chi \in \mathcal{T} \mathcal{M}$,
where we are denoting with $[\cdot,\cdot]$ the Lie bracket $$ [\xi,\eta]g = \xi(\eta (g)) - \eta(\xi (g)). $$
Riemanian Hessian - the Hessian¶
Definition The Riemannian Hessian of $f$ at $x$, denoted by {$\operatorname{{hess}}f(x)$}, of a manifold $\mathcal{M}$ is a mapping from $\mathcal{T}_x\mathcal{M}$ into itself defined by: $$ \operatorname{{hess}}f(x)[\xi_x] = \nabla_{\xi_x} \operatorname{grad}f(x), \ \forall \ \xi_x \in \mathcal{T}_x\mathcal{M}, $$ where $\operatorname{grad}f(x)$ is the Riemannian gradient and $\nabla$ is the Levi-Civita connection on $\mathcal{M}$.
Proposition (Koszul formula). The Levi-Civita connection on the Euclidean space $\mathbb{R}^{n \times n}$ endowed with the Fisher information metric is given by $$ \nabla_{\eta_S} \xi_S = \mathrm{D}(\xi_S)[\eta_S] - \cfrac{1}{2} (\eta_S \odot \xi_S) \oslash S.$$
- Koszul formula is Theorem 5.3.1 in Absil, P.-A.; Mahony, R.; Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008. xvi+224 pp. ISBN:978-0-691-13298-3
Riemanian Hessian - the Hessian¶
Theorem. The Riemannian Hessian $\mathrm{hess} f(S)[\xi_S]$ can be obtained from the Euclidean gradient $\operatorname{Grad} f(S)$ and the Euclidean Hessian $\operatorname{Hess} f(S)$ by using the identity $$ \mathrm{hess} f(S)[\xi_S] = \Pi_S(\mathrm{D}(\mathrm{grad} f(S))[\xi_S])-\frac12 \Pi_S(( \Pi_S(\mathrm{Grad} f(S))\odot S) \odot \xi_S) \oslash S), $$ where $$ D(\mathrm{grad} f(S))[\xi_S] = \dot{\gamma}[\xi_S]- (\dot{\boldsymbol{\alpha}}[\xi_S] \mathbf{1}^T+\boldsymbol{\pi}\dot{\boldsymbol{\beta}}^T[\xi_S])\odot S - (\boldsymbol{\alpha} \mathbf{1}^T+\boldsymbol{\pi}\boldsymbol{\beta}^T)\odot \xi_S, $$ and $$\begin{split} \gamma = & \; \mathrm{Grad} f(S)\odot S,\\ \dot{\gamma}[\xi_S] = & \; \mathrm{Hess}\; f(S)[\xi_S]\odot S + \mathrm{Grad}\; f(S)\odot \xi_S,\\ \mathcal{A} = &\; \begin{bmatrix} I & D_{\boldsymbol{\pi}} S \\ S^T D_{\boldsymbol{\pi}} & \mathrm{diag}( S^TD_{\boldsymbol{\pi}}\boldsymbol{\pi}) \end{bmatrix}, \\ \boldsymbol{\alpha}, \boldsymbol{\beta} \,\text{s.t.}\,&\; \mathcal{A} \begin{bmatrix} \boldsymbol{\alpha} \\ \boldsymbol{\beta} \end{bmatrix} = \begin{bmatrix} \gamma \mathbf{1} \\ \gamma^T \boldsymbol{\pi} \end{bmatrix},\\ \dot{\boldsymbol{\alpha}}[\xi_S], \dot{\boldsymbol{\beta}}[\xi_S] \,\text{s.t.}\,&\; \mathcal{A} \begin{bmatrix} \dot{\boldsymbol{\alpha}}[\xi_S] \\ \dot{\boldsymbol{\beta}}[\xi_S] \end{bmatrix} = \begin{bmatrix} \dot{\gamma}[\xi_S] \mathbf{1} \\ \dot{\gamma}^T[\xi_S] \boldsymbol{\pi} \end{bmatrix} - \begin{bmatrix} 0 & D_{\boldsymbol{\pi}} \xi_S \\ \xi_S^T D_{\boldsymbol{\pi}} & \mathrm{diag}(\xi_ S^TD_{\boldsymbol{\pi}}\boldsymbol{\pi}) \end{bmatrix} \begin{bmatrix} \boldsymbol{\alpha} \\ \boldsymbol{\beta} \end{bmatrix}. \end{split} $$
Riemannian Hessian (computation)¶
The corresponding routine for the MANOPT implementation is given by
M.ehess2rhess = @ehess2rhess;
function rhess = ehess2rhess(X, egrad, ehess, eta)
% Computing the directional derivative of the Riemannian
% gradient
gamma = egrad.*X;
gammadot = ehess.*X + egrad.*eta;
bdot = [ gammadot*e ; gammadot.'*pi];
b = [gamma*e ; gamma.'*pi];
[alpha, beta] = mylinearsolve(X, b);
S1 = [ zeros(size(eta)) , diag(pi)*eta; eta'*diag(pi) , diag(eta'*diag(pi)*pi) ];
[alphadot, betadot] = mylinearsolve(X, bdot - S1*[alpha; beta]); % %- [eta*beta; eta'*alpha]
S = (alpha*e' + pi*beta');
deltadot = gammadot - (alphadot*e' + pi*betadot').*X- S.*eta; % rgraddot
% Computing Riemannian gradient
delta = gamma - S.*X; % rgrad
% Riemannian Hessian in the ambient space
nabla = deltadot - 0.5*(delta.*eta)./X;
% Riemannian Hessian on the tangent space
rhess = projection(X, nabla);
end
Retractions - Sinkhorn-Knopp algorithm¶
To complete the construction of the Riemannian optimization algorithm, we also need to define the retraction from the tangent bundle to the manifold. To obtain such a map, we apply a suitable modification of the generalized Sinkhorn-Knopp algorithm, which is based on the following theorem:
Theorem (Sinkhorn generalization). Let $A \in \mathbb{R}^{n \times n}$ be a nonnegative matrix. Then for any vectors $\mathbf{r},\mathbf{c} \in \mathbb{R}^{n}$ with nonnegative entries there exist diagonal matrices $D_1$ and $D_2$ such that $$ D_1 A D_2 \mathbf{1} = \mathbf{r}, \qquad D_2 A^T D_1 \mathbf{1} = \mathbf{c}, $$ if and only if there exists a matrix $B$ such $B\mathbf{1} = \mathbf{r}$ and $B^T\mathbf{1} = \mathbf{c}$, and having the same nonzero pattern as$A$. Furthermore, if the matrix $A$ is positive, then $D_1$ and $D_2$ are unique up to a constant factor.
- Rothblum, U.G., Schneider, H. Scalings of matrices which have prespecified row sums and column sums via optimization. Linear Algebra Appl.114/115(1989), 737–764.
Retractions - Definition¶
Proposition. Let $A \in \mathbb{R}^{n \times n}$ be a matrix with positive entries. Then there exist diagonal matrices $D_1$ and $D_2$ such that $$ D_1 A D_2 \mathbf{1} = \mathbf{1}, \qquad \boldsymbol{\pi}^T D_1 A D_2 = \boldsymbol{\pi}^T. $$ Moreover, $D_1$ and $D_2$ are diagonal matrices such that $D_1\widehat A D_2 \mathbf{1}=\boldsymbol{\pi}$ and $\mathbf{1}^T D_1\widehat A D_2 =\boldsymbol{\pi}^T$, where $\widehat A=\mathrm{diag}(\boldsymbol{\pi})A$.
Definition. A retraction on a manifold $\mathcal{M}$ is a smooth map $R$ from the tangent bundle {$\mathcal{T} \mathcal{M} = \bigsqcup_{x \in \mathcal{M}} \mathcal{T}_x\mathcal{M}$, i.e., the disjoint union of the tangent spaces,} onto $\mathcal{M}$. For all $x \in \mathcal{M}$ the restriction of $R$ to $\mathcal{T}_x \mathcal{M}$, denoted by $R_x$, satisfies the following properties:
- $R_x(0) = x$ (centering);
- The curve $\gamma_{\xi_x}(\tau) = R_x(\tau \xi_x)$ satisfies $$ \left.\frac{d \gamma_{\xi_x}(\tau)}{d\tau}\right|_{\tau = 0} = \xi_x, \quad\,\forall\, \xi_x \in \mathcal{T}_x \mathcal{M}. \qquad \text{(local rigidity)} $$.
Retractions - Auxiliary Result¶
Proposition. Let $\mathcal{M}$ be an embedded manifold of the Euclidean space $\mathcal{E}$ and let $\mathcal{N}$ be an abstract manifold such that $\dim(\mathcal{M}) + \dim(\mathcal{N}) = \dim(\mathcal{E})$. Assume that there is a diffeomorphism \begin{align*} \phi: \mathcal{M} \times \mathcal{N} &\longrightarrow \mathcal{E}^* \\ (A,B) &\longmapsto \phi(A,B) \end{align*} where $\mathcal{E}^*$ is an open subset of $\mathcal{E}$, with a neutral element $I \in \mathcal{N}$ satisfying $$ \phi(A,I) = A, \ \forall \ A \in \mathcal{M}. $$ Under the above assumption, the mapping \begin{align*} R_x: \mathcal{T}_x\mathcal{M} &\longrightarrow \mathcal{M} \\ \xi_x &\longmapsto R_x(\xi_x) = \pi_1(\phi^{-1}(x+\xi_x)), \end{align*} where $\pi_1: \mathcal{M} \times \mathcal{N} \longrightarrow \mathcal{M}: (A,B) \longmapsto A$ is the projection onto the first component, defines a retraction on the manifold $\mathcal{M}$ for all $x \in \mathcal{M}$ and $\xi_x$ in the neighbourhood of $0_x$.
Retractions - Construction¶
Theorem. The map $R: \mathcal{T}\mathbb{S}_n^{\boldsymbol{\pi}} \longrightarrow \mathbb{S}_n^{\boldsymbol{\pi}} $ whose restriction $R_{S}$ to $ \mathcal{T}_S\mathbb{S}_n^{\boldsymbol{\pi}} $ is given by: $$ R_{S}(\xi_S) = S + \xi_S, $$ is a well-defined retraction on $\mathbb{S}_n^{\boldsymbol{\pi}}$ whenever $\xi_S$ is in a neighborhood of $\mathbf{0}_{S}$, more explicitly, whenever $S > - \xi_S$ entry-wise.
To avoid the deterioration of the quality of the analogous retraction on $\mathbb{S}_n^{\mathbf{1}}$ in the presence of small modulus elements in the iterations of the Riemannian optimization algorithms, a modification based on the combination of the entry-wise exponential of a matrix and the Sinkhorn-Knopp’s algorithm can be used.
- Douik, A.; Hassibi, B. Manifold optimization over the set of doubly stochastic matrices: a second-order geometry. IEEE Trans. Signal Process.67(2019), no.22, 5761–5774.
We adapt here such proposal to the manifold $\mathbb{S}_n^{\boldsymbol{\pi}}$.
Retractions - The implemented one¶
Theorem. The map $\hat{R}: \mathcal{T}\mathbb{S}_n^{\boldsymbol{\pi}} \longrightarrow \mathbb{S}_n^{\boldsymbol{\pi}} $ whose restriction $\hat{R}_{S}$ to $ \mathcal{T}_S\mathbb{S}_n^{\boldsymbol{\pi}} $ is given by: \begin{align*} \hat{R}_{S}(\xi_S) = \mathcal{S}\left( S \odot \exp(\xi_S \oslash S )\right), \end{align*} is a first-order retraction on $\mathbb{S}_n^{\boldsymbol{\pi}}$, where $\mathcal{S}\left( \cdot \right)$ represents an application of the modified Sinkhorn-Knopp’s algorithm, and $\exp(\cdot)$ the entry-wise exponential.
M.retr = @retraction;
function Y = retraction(X, eta, t)
if nargin < 3
t = 1.0;
end
Y = X.*exp(t*(eta./X));
Y = modifiedsinkhorn(Y,pi);
Y = max(Y, eps);
end
Modified Sinkhorn-Knopp implementation¶
function [B,u,v] = modifiedsinkhorn(A,pi,maxit,checkperiod)
N = size(A,1);
tol = eps(N);
Ahat = diag(pi)*A;
iter = 0; % Number of iteration
u = ones(N,1); v = ones(N,1); e = v; % Initialize u and v
while iter < maxit
iter = iter + 1; % previous for safekeeping
u_prev = u;
v_prev = v;
row = Ahat*v;
u = pi./(row); % update u and v
v = pi./(Ahat'*u);
if mod(iter, checkperiod) == 0 % Check if converged
gap = abs(u'*row - 1);
if isnan(gap)
break;
end
if gap <= tol
if norm(diag(u)*A*diag(v)*e-e,"inf") < tol && ...
norm(pi'*diag(u)*A*diag(v)-pi',"inf") < tol
break;
end
end
end
Modified Sinkhorn-Knopp implementation (continued)¶
if any(isinf(u)) || any(isnan(u)) || any(isinf(v)) || any(isnan(v))
warning('DoublyStochasticProjection:NanInfEncountered', ...
'Nan or Inf occured at iter %d. \n', iter);
u = u_prev;
v = v_prev;
break;
end
end
% The matrix we want is built from A as in the Theorem
B = diag(u)*A*diag(v);
end
- In order not to clutter the implementation of the manifold with too many sub-functions, this is sent to a separate file
modifiedsinkhorn.m
.
Computational issues a.k.a how-to solve the linear systems¶
We have to solve several compatible singular linear systems of the form $$ \mathcal{A} \begin{bmatrix} \mathbf{x}\\ \mathbf{y} \end{bmatrix} = \begin{bmatrix} \mathbf{c}\\ \mathbf{d} \end{bmatrix}, ~~~ \mathcal{A} = \begin{bmatrix} I & D_{\boldsymbol{\pi}} S \\ S^T D_{\boldsymbol{\pi}} & \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi}) \end{bmatrix}. $$ To this purpose, we want to use an iterative method of the Krylov type to avoid assembling the $2 \times 2$ block matrix.
Proposition Given $S \in \mathbb{S}_n^{\boldsymbol{\pi}}$, the $2 \times 2$ block matrix $\mathcal{A}$ is such that
- $\mathcal{A}$ is similar to a singular M-matrix,
- $\lambda(\mathcal{A}) \in \{ 0 \} \cup \left[\frac{\left(\delta^*+1 -\sqrt{\delta^* (\delta^* +4 r^*-2)+1}\right)}{2}, \max\{ 1+\| \boldsymbol{\pi} \|_\infty, 2\| \boldsymbol{\pi} \|_\infty\}\right]$, for $$ r^* = \min_{j=1,\ldots,n} \max_{i = 1,\ldots, n} s_{i,j}, \text{ and } \delta^* = \min_{ \substack{i=1,\ldots,n \\ i \neq k}} \left( S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi} \right)_i, $$ and $k$ is such that $\displaystyle r^* = \min_{i = 1,\ldots, n} s_{i,k}$ moreover, if $r^*+\delta^*< 1$, then $$\lambda(\mathcal{A}) \in \{ 0 \} \cup \left[ \delta^* \left( 1 - \frac{r^*}{1-\delta^*}\right), \max\{ 1+\| \boldsymbol{\pi} \|_\infty, 2\| \boldsymbol{\pi} \|_\infty\}\right].$$
Possible approaches¶
- Conjugate Gradient (CG) directly on the system,
- LSQR directly on the system
- Solve via CG the system for the Schur complement with respect to the $(1,1)$-block, i.e. we solve instead $$ [ \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi}) - S^T D_{\boldsymbol{\pi}^2} S ]\mathbf{y} = \mathbf{c} - S^T D_{\boldsymbol{\pi}} \mathbf{b}, \quad \mathbf{x} = \mathbf{b} - D_{\boldsymbol{\pi}} S \mathbf{y}. $$ The matrix in the above system is a symmetric irreducible singular M-matrix, therefore it is semidefinite, with a simple eigenvalue equal to 0.
In both formulations, the basis of the null space is known, and we can consider the de-singularized version obtained through a rank 1 update of the matrix; This consists of solving the updated linear system $$ \left( \begin{bmatrix} I & D_{\boldsymbol{\pi}} S \\ S^T D_{\boldsymbol{\pi}} & \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi}) \end{bmatrix} + \frac{1}{\boldsymbol{\pi}^T\boldsymbol{\pi} + n} \begin{bmatrix} \boldsymbol{\pi}\\-\mathbf{1} \end{bmatrix} [\boldsymbol{\pi}^T,-\mathbf{1}^T]\right) \begin{bmatrix} \hat{\mathbf{x}}\\ \hat{\mathbf{y}} \end{bmatrix} = \begin{bmatrix} \mathbf{b}\\ \mathbf{c} \end{bmatrix}, $$ or, $$ \left[ \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi}) - S^T D_{\boldsymbol{\pi}^2} S + \sigma\frac{1}{n} \mathbf{1}\mathbf{1}^T \right]\hat{\mathbf{y}} = \mathbf{c} - S^T D_{\boldsymbol{\pi}} \mathbf{b}, \quad \hat{\mathbf{x}} = \hat{\mathbf{b}} - D_{\boldsymbol{\pi}} S \hat{\mathbf{y}}, \quad \sigma > 0. $$
Possible approaches - Preconditioners¶
To precondition the CG algorithm we start from an empirical observation, since the target matrix $X$ has to be stochastic we expect, as the dimension $n$ of the problem grows, to encounter many small entries. This suggests using a diagonally compensated modified incomplete Cholesky directly on the system matrix. To avoid the presence of nonpositive pivots, the diagonal compensation term can be used to be $\min_{i=1,\ldots,n}\pi_i$. Another strategy consists of first scaling the system \begin{equation*} \begin{split} \left[ I - \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi})^{-\frac{1}{2}} S^T D_{\boldsymbol{\pi}^2} S \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi})^{-\frac{1}{2}} \right] \tilde{\mathbf{y}} = \\ \qquad \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi})^{-\frac{1}{2}} \left( \mathbf{c} - S^T D_{\boldsymbol{\pi}} \mathbf{b}\right), \end{split} \end{equation*} and then recover $$ \mathbf{y} = \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi})^{-\frac{1}{2}} \tilde{\mathbf{y}}, \quad \mathbf{x} = \mathbf{b} - D_{\boldsymbol{\pi}} S \mathbf{y}. $$ By calling $\tilde{S} = \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi})^{-\frac{1}{2}} S^T D_{\boldsymbol{\pi}^2} S \mathrm{diag}(S^T D_{\boldsymbol{\pi}} \boldsymbol{\pi})^{-\frac{1}{2}}$, and observing that $\rho(\tilde{S})$ $=1$, we can use a truncated Neumann series as preconditioner, i.e., $$ P_{k,\tau} = I + \sum_{j = 1}^k \hat{S}^j, \qquad (\hat{S})_{p,q} = \begin{cases} (\tilde{S})_{p,q}, & |\tilde{S}_{i,j}| \geq \tau,\\ 0, & \text{otherwise}. \end{cases} $$
Implementation of mylinearsolve
¶
All this ideas can implemented in the mylinearsolve
routine for our MANOPT manifold
function [alpha, beta] = mylinearsolve(X, b)
switch upper(optionsolve.formulation)
case "BLOCK"
% 2 x 2 block formulation
switch upper(optionsolve.method)
case "CG"
[zeta, flag, res, iter,resvec] = ...
pcg(@(x) mycompute(x,true), b, 1e-6, 2*n,...
[],[],b);
case "PCG"
[zeta, flag, res, iter,resvec] = ...
pcg(@(x) mycompute(x,true), b, 1e-6, 2*n,...
@(x) myprec(x,true),[],b);
case "LSQR"
[zeta, flag, res, iter,resvec] = ...
lsqr(@mycompute, b, 1e-6, 2*n);
case "DIRECT"
S1 = [eye(size(X)) diag(pi)*X;
X.'*diag(pi) diag(X.'*(diag(pi)*pi))];
S1 = S1 + (1/(pi'*pi + n))*[pi;-e]*[pi',e'];
zeta = S1\b;
flag = 0;
iter = 0;
res = norm(S1*zeta-b,2)/norm(b,2);
otherwise
error("Manopt:unknown_linear_solver")
end
Implementation of mylinearsolve
(continued)¶
case "SCHUR"
bschur = b(n+1:end,1) - X.'*diag(pi)*b(1:n,1);
% We solve here the system in its reduced formulation
switch upper(optionsolve.method)
case "CG"
[zetaschur, flag, res, iter,resvec] = ...
pcg(@(x) mycompute_schur(x,true), bschur, ...
1e-6, 2*n, [], [], bschur);
case "PCG"
L = ichol(sparse(diag(X.'*(diag(pi)*pi)) - X.'*(pi.^2.*(X))), ...
struct('type','ict', ...
'droptol',optionsolve.threshold, ...
'michol','on', ...
'diagcomp',min(pi)));
[zetaschur, flag, res, iter,resvec] = ...
pcg(@(x) mycompute_schur(x,true), bschur, ...
1e-6, 2*n, L, L', bschur);
case "PCG2"
Donehalf = spdiags(1./sqrt((X.'*(diag(pi)*pi))),0,n,n);
P = (Donehalf*X')*spdiags(pi.^2,0,n,n)*(X*Donehalf);
P(abs(P) < optionsolve.threshold) = 0;
P = sparse(P);
PrecNeu = @(x) neumannseries(P,x,optionsolve.kappa);
[zetaschur, flag, res, iter,resvec] = ...
pcg(@(x) mycompute_schur2(x,Donehalf,true), Donehalf*bschur, ...
1e-6, 2*n, PrecNeu, [], Donehalf*bschur);
zetaschur = diag(1./sqrt((X.'*(diag(pi)*pi))))*zetaschur;
Implementation of mylinearsolve
(continued)¶
case "LSQR"
[zetaschur, flag, res, iter,resvec] = ...
lsqr(@mycompute_schur, bschur, ...
1e-6, 2*n);
case "DIRECT"
S1schur = diag(X.'*(diag(pi)*pi)) - X.'*diag(pi.^2)*X + 0.005*ones(n,n)/n;
zetaschur = S1schur\bschur;
flag = 0;
iter = 0;
res = norm(S1schur*zetaschur-bschur,2)/norm(bschur,2);
otherwise
error("Manopt:unknown_linear_solver")
end
% reconstruct the whole vector
zeta = [b(1:n,1) - pi.*(X*zetaschur);...
zetaschur];
otherwise
error("Manopt:unknown_linear_formulation")
end
Implementation of mylinearsolve
(continued)¶
Together with the different matrix-vector product routines.
function Ax = mycompute(x,flag)
xtop = x(1:n,1);
xbottom = x(n+1:end,1);
Axtop = xtop + diag(pi)*X*xbottom;
Axbottom = X'*diag(pi)*xtop + diag(pi'*diag(pi)*X)*xbottom;
Ax = [Axtop; Axbottom];
end
function Ax = mycompute_schur(x,flag)
Ax = (X.'*(diag(pi)*pi)).*x - X.'*(pi.^2.*(X*x));
end
function Ax = mycompute_schur2(x,D,flag)
y = D*x;
Ax = x - D*(X.'*(pi.^2.*(X*y)));
end
Implementation of mylinearsolve
(continued)¶
And the ones to apply the preconditioners.
function Ax = myprec(x,flag)
xtop = x(1:n,1);
xbottom = x(n+1:end,1);
Axtop = xtop;
Axbottom = diag(X'*diag(pi)*pi)\xbottom;
Ax = [Axtop; Axbottom];
end
function y = neumannseries(P,x,k)
y = x;
if k <= 0
return
else
for j =1:k
y = y + P*x;
end
end
end
alpha = zeta(1:n, 1);
beta = zeta(n+1:end, 1);
end
Back to our sparse matrix example¶
We go back to the example that started this whole business and see if we can preserve the stationary vector.
clear; clc; close all;
load('Sandi_authors.mat');
A = Problem.A;
D = diag(sum(A,2));
A = full(D\A);
n = size(A,1);
e = ones(n,1);
p = 2; % What root do we want?
tol = 1e-6; % Optimization parameters
maxIterations = 1000;
manifold = multinomialfactory(n,n); % These are column stochastic!
X0 = manifold.rand();
clear manifold
We have loaded into memory and generated the sparse matrix example that we have already seen previously.
Let us solve it by constrained optimization¶
%% Constrained Optimization
[X1,output1,history1] = approximatepower(A,p,X0,tol,maxIterations,"ANALYTICAL");
figure(1)
semilogy(1:length(history1.targetFunctionValues),history1.targetFunctionValues, ...
'b-','LineWidth',2)
legend('Constrained Optimization')
And now we solve it on the manifold¶
manifold = multinomialfactory(n,n); % These are column stochastic!
problem.M = manifold;
problem.cost = @(x) 0.5*cnormsqfro(mpower(x,p).'-A);
problem = manoptAD(problem);
options.verbose = 0;
options.tolgradnorm = 1e-5;
options.maxiter = 100;
options.verbosity = 0;
[X2, xcost2, info2] = trustregions(problem,X0,options);
- we built the manifold,
- defined the objective function,
- used auto-diff for computing Euclidean gradient and Hessian function,
Convergence history¶
figure(1)
semilogy(1:length(history1.targetFunctionValues),history1.targetFunctionValues, ...
'b-',...
[info2.iter],[info2.cost],'b--','LineWidth',2);
legend('Constrained Optimization','Riemannian Trust-Region')
clear manifold problem options
And then we solve it on the new manifold¶
[pi,~] = eigs(A.',1,'largestabs');
pi = pi./sum(pi);
optionsolve = initoptions();
optionsolve.method = "direct";
optionsolve.verbose = false;
M = multinomialfixedstochasticfactory(pi,optionsolve);
X0 = M.rand(); % Initial guess
problem.M = M;
problem.cost = @(x) 0.5*cnormsqfro(mpower(x,p)-A);
problem = manoptAD(problem);
options.tolgradnorm = 1e-5;
options.verbosity = 1;
options.maxiter = 100;
[X3, xcost3, info3, options] = trustregions(problem,X0,options);
Max iteration count reached; options.maxiter = 100. Total time is 3.443367 [s] (excludes statsfun)
Convergence history¶
figure(1)
semilogy(1:length(history1.targetFunctionValues),history1.targetFunctionValues, ...
'b-',...
[info2.iter],[info2.cost],'b--', ...
[info3.iter],[info3.cost],'r-','LineWidth',2);
legend('Constrained Optimization','Riemannian Trust-Region','RTS preserving steady state')
Looking at the steady state¶
[pi3,~] = eigs(X3.',1,'largestabs');
pi3 = pi3/sum(pi3);
figure(2)
semilogy(1:length(pi),pi,'o',1:length(pi3),pi3,'x')
The reducible case¶
We consider here what is an edge case for our approach, i.e., when the chain is reducible due to the existence of two communication classes; in other words, to the case in which the stationary distribution $\boldsymbol{\pi} = [0,\ldots,0,1]^T$.
- Application: R. A. Jarrow, D. Lando, and S. M. Turnbull, A Markov Model for the Term Structure of Credit Risk Spreads, The Review of Financial Studies, 10 (2015), pp. 481–523.
The Markov chain modelling represents the dynamics of the different credit rating states, i.e., evaluations of the relative ability of an entity or obligation to meet financial commitments over time and examining the probability of transitioning between these states.
An example¶
data(i,j) |
AAA |
AA |
A |
BBB |
BB |
B |
CCC |
D |
---|---|---|---|---|---|---|---|---|
AAA |
0,891 | 0,0963 | 0,0078 | 0,0019 | 0,003 | 0 | 0 | 0 |
AA |
0,0086 | 0,901 | 0,0747 | 0,0099 | 0,0029 | 0,0029 | 0 | 0 |
A |
0,0009 | 0,0291 | 0,8894 | 0,0649 | 0,0101 | 0,0045 | 0 | 0,0009 |
BBB |
0,0006 | 0,0043 | 0,0656 | 0,8427 | 0,0644 | 0,016 | 0,0018 | 0,0045 |
BB |
0,0004 | 0,0022 | 0,0079 | 0,0719 | 0,7764 | 0,1043 | 0,0127 | 0,0241 |
B |
0 | 0,0019 | 0,0031 | 0,0066 | 0,0517 | 0,8246 | 0,0435 | 0,0685 |
CCC |
0 | 0 | 0,0116 | 0,0116 | 0,0203 | 0,0754 | 0,6493 | 0,2319 |
D |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
These were obtained from a statistical model that looked at annual data, with the root being that we would like to have information on the transition in shorter times.
Idea: let's "PageRank"¶
In a similar guise to the Page Rank problem, we apply a perturbation to the data matrix to make it irreducible, as $$ \tilde{A} = (1-\gamma) A + \gamma(\mathbf{1}\mathbf{1}^T)/n, \qquad 0 < \gamma \ll 1, $$ which admits stationary distribution $\tilde{\boldsymbol{\pi}}>0$ with respect to which we can construct the manifold $\mathbb{S}_n^{\tilde{\boldsymbol{\pi}}}$. The approximate root can then be computed by solving for \begin{equation*} \min_{X \in \mathbb{S}_n^{\tilde{\boldsymbol{\pi}}}} \frac{1}{2}\|X^2 - A \|_F^2. \end{equation*} For $\gamma = 10^{-4}$, we obtain $\tilde{\boldsymbol{\pi}} \approxeq [0.0002, 0.0007, 0.0012, 0.0009, 0.0005, 0.0006, 0.0001, 0.9957]^T$.
Code¶
clear all;
data = [0.8910 0.0963 0.0078 0.0019 0.0030 0.0000 0.0000 0.0000
0.0086 0.9010 0.0747 0.0099 0.0029 0.0029 0.0000 0.0000
0.0009 0.0291 0.8894 0.0649 0.0101 0.0045 0.0000 0.0009
0.0006 0.0043 0.0656 0.8427 0.0644 0.0160 0.0018 0.0045
0.0004 0.0022 0.0079 0.0719 0.7764 0.1043 0.0127 0.0241
0.0000 0.0019 0.0031 0.0066 0.0517 0.8246 0.0435 0.0685
0.0000 0.0000 0.0116 0.0116 0.0203 0.0754 0.6493 0.2319
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000];
A = diag(sum(data,2))\data;
gam = 1e-4;
E = ones(size(A))./size(A,1);
B = gam*E + (1-gam)*A;
pi = pvgth(B); % Computation of the stationary vector:
Solve it on the manifold¶
% Build the manifold
optionsolve = initoptions();
optionsolve.correction = true;
option.verbose = 0;
M = multinomialfixedstochasticfactory(pi,optionsolve);
X0 = diag(sum(triu(ones(size(A))),2))\triu(ones(size(A))); % Initial guess
X0 = gam*E + (1-gam)*X0;
X0 = modifiedsinkhorn(X0,pi,100);
problem.M = M;
problem.cost = @(x) 0.5*cnormsqfro(mpower(x,2)-A);
problem = manoptAD(problem);
options.tolcost = 1e-3;
options.verbosity = 0;
options.Delta_bar = 22;
[X1, xcost, info, options] = rlbfgs(problem,X0,options);
Constrained optimization and visualization¶
rA = approximatepower(A,2,X0,1e-6,1000);
figure("Position",[3103 842 1113 365])
subplot(1,2,1)
heatmap(X1)
title("Riemannian L-BFGS")
subplot(1,2,2)
heatmap(rA);
title("Constrained Optimization (Trustregion)")