pydigree.stats.mixedmodel package¶
Submodules¶
pydigree.stats.mixedmodel.likelihood module¶
Functions for computing likelihoods of linear mixed models
-
class
pydigree.stats.mixedmodel.likelihood.ML(mm, starts=None, info='fisher')¶ Bases:
pydigree.stats.mixedmodel.likelihood.MixedModelLikelihood-
ai_element(V_i, V_j)¶ An element of the average information matrix (observed+expected)/2
Parameters: - V_i – variance-covariance matrix of the first component
- V_i – matrix
- V_j – variance-covariance matrix of the second component
- V_j – matrix
Returns: derivative
Return type: float
-
expectation_maximization()¶ Performs a round of Expectation-Maximization ML
-
fisher_element(V_i, V_j)¶ An element of the expected information matrix (Fisher matrix)
Parameters: - V_i – variance-covariance matrix of the first component
- V_i – matrix
- V_j – variance-covariance matrix of the second component
- V_j – matrix
Returns: derivative
Return type: float
-
gradient_element(V_i)¶ Derivative of the full loglikelihood with regard to a variance component
Parameters: - V_i – variance-covariance matrix of the component
- V_i – matrix
Returns: derivative
Return type: float
-
hessian_element(V_i, V_j)¶ Second derivative of the full loglikelihood with regard to two variance components
Parameters: - V_i – variance-covariance matrix of the first component
- V_i – matrix
- V_j – variance-covariance matrix of the second component
- V_j – matrix
Returns: derivative
Return type: float
-
loglikelihood()¶ Full loglikelihood of the model
-
observed_element(V_i, V_j)¶ An element of the observed information matrix (-1 * hessian)
Parameters: - V_i – variance-covariance matrix of the first component
- V_i – matrix
- V_j – variance-covariance matrix of the second component
- V_j – matrix
Returns: derivative
Return type: float
-
-
class
pydigree.stats.mixedmodel.likelihood.MixedModelLikelihood(mm, starts=None, info='fisher')¶ Bases:
objectA class describing the state of a mixed model likelihood function being maximized
-
gradient()¶ The gradient of the likelihood function with regard to each variance component
Returns: gradient Return type: np.array
-
info_matrix(kind=None)¶ The matrix of second deriviatives for each variance component
Parameters: kind (string) – the kind of information matrix required Return type: 2d matrix
-
set_info(info)¶ Change which information matrix is used in fitting
-
set_parameters(params)¶ set the parameters
-
-
class
pydigree.stats.mixedmodel.likelihood.REML(mm, starts=None, info='fisher')¶ Bases:
pydigree.stats.mixedmodel.likelihood.MixedModelLikelihoodA likelihood function for mixed effects models that does not consider variance due to fixed effects, which may bias the optimal variance component estimates
-
ai_element(PV_i, PV_j)¶ Element of the average information matrix
-
average_information_matrix()¶
-
expectation_maximization()¶ Performs a round of Expectation-Maximization REML
-
fisher_element(PV_i, PV_j)¶ Element of the expected information matrix
-
fisher_information_matrix()¶
-
gradient()¶ Return the gradient for restricted loglikelihood
-
hessian_element(PV_i, PV_j)¶ Second derivative of REML function w.r.t two variance components
-
info_matrix(kind=None)¶ The matrix of second deriviatives for each variance component
Parameters: kind (string) – the kind of information matrix required Return type: 2d matrix
-
loglikelihood()¶ Returns the restricted loglikelihood for mixed model variance component estimation.
References:
Harville. ‘Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems’ Journal of the American Statistical Association. (1977) (72):258
-
observed_element(PV_i, PV_j)¶ Element of the observed information matrix (i.e. -hessian)
-
-
pydigree.stats.mixedmodel.likelihood.full_loglikelihood(y, V, X, beta, Vinv=None)¶ Returns the full loglikelihood of a mixed model
Ref: SAS documentation for PROC MIXED
-
pydigree.stats.mixedmodel.likelihood.inv(mat)¶ Inverts a dense matrix or makes a sparse matrix dense and inverts that
Parameters: mat – matrix Returns: matrix inverse Return type: dense matrix
-
pydigree.stats.mixedmodel.likelihood.logdet(mat)¶ Returns the (positive) log determinant of a matrix.
Parameters: mat (matrix) – matrix to calculate Returns: log-determinant of mat Return type: float
-
pydigree.stats.mixedmodel.likelihood.makeP(X, Vinv)¶ Makes the P matrix commonly found in mixed model estimation
-
pydigree.stats.mixedmodel.likelihood.makeVinv(V)¶
pydigree.stats.mixedmodel.maximization module¶
-
class
pydigree.stats.mixedmodel.maximization.MLEResult(parameters, restricted_loglikelihood, method, jacobian=None, hessian=None, full_loglikelihood=None)¶ Bases:
objectAn object representing the result of maximization of a mixed model
-
pydigree.stats.mixedmodel.maximization.expectation_maximization(mm, likelihood, starts=None, maxiter=10000, tol=0.0001, return_after=1e+30, verbose=False)¶ Maximizes a linear mixed model by Expectation-Maximization
Formulas for EM-REML are given in Lynch & Walsh, Ch 27, Example 5 (pg. 799)
Unlike the Newton-type algorithms, EM only makes use of the first derivative of the loglikelihood function. The presence of second derivatives means that Newton-type maximization will converge very quickly, since it works on a better approximation of the likelihood surface.
EM tends to converge VERY slowly, because the changes at every step are so small. For example, a model that took 3 iterations/0m3.927s to converge with AI-REML took 52 iterations/0m32.803s with EM-REML. Individual EM iterations are relatively fast because you don’t have to compute the Hessian (or an approximation of it). But since you have to invert the variance-covariance matrix of observations each iteration regardless, time adds up quickly.
However, EM has the nice property that it monotonically converges to a maximum, and avoids the parameter esimtate out-of-range problems occasionally found with Newton-type methods and have to be remedied as a special case.
Parameters: - mm (MixedModel) – a model to be maximized
- likelihood – A likelihood object
- starts (numpy array) – starting values
- maxiter (integer) – The maximum number of iterations of scoring before raising an error
- tol (float) – The minimum amount of change in the proportion of variance by any of the variance components to continue iterating.
- verbose (bool) – Print likelihood, variance component, and relative variance estimates at each iteration. Useful for debugging or watching the direction the optimizer is taking.
- return_after – return estimates after n iterations, regardless of whether or not the model converged.
Returns: variance components at the MLE
Return type:
-
pydigree.stats.mixedmodel.maximization.grid_search(mm, likelihood, nevals=10, oob=False)¶ Grid searches a likelihood function over a range of variance component values
Parameters: - mm (MixedModel) – A model to be maximized
- likelihood – a MixedModelLikelihood object (i.e. ML, or REML)
- nevals (integer) – number of evaluations over each component
- oob (bool) – Evaluate out of bounds arguments (e.g. sum(vcs) > var(y))
Returns: variance components at the best node in the grid
Return type:
-
pydigree.stats.mixedmodel.maximization.newtonlike_maximization(mm, likelihood, maxiter=250, tol=0.0001, scoring=10, verbose=False)¶ Updates variance components for a linear mixed model by an iterative scheme to find the restricted maximum likelihood estimates of the variance components in the model.
- Models are fit by iterating the following equation:
- theta_(i+1) = theta_i - inverse(J(theta_i)) * S(theta_i)
- Where:
- theta_i = A vector of the estimated variance components at
- iteration i
S(theta): The score function (gradient) of the REML loglikelihood J(theta): The information matrix (the matrix of second derivatives
of the REML loglikelihood with regard to each of the variance compents in theta)
In all optimization schemes in this function S(theta) is the gradient of the REML loglikelihood function evaluated at the current estimates of theta.
The information matrix J(theta) is a q x q square matrix for the q predictors in the model. There are a few information matrices available for use, specified by the argument method. ‘Newton-Raphson’ uses the observed information matrix (the negative Hessian). This is the most complicated to calculate, is very sensitive to starting values, and can occasionally have numerical issues. The method value ‘Fisher scoring’ uses the Fisher Information Matrix (expected value of negative hessian), which is simpler to calculate. This is a common way of fitting mixed model and is the default method for this optimizer. The last (‘Average Information’) uses the average of the Fisher Information matrix and Observed Information matrix. This is a common approach in the animal breeding literature. Averaging the two results in the elimination of a time consuming trace term, making this the fastest method in terms of time per iteration, though it may require a few more iterations than Fisher scoring or Newton-Raphson.
When the change in the proportion of variance explained by each variance component after iteration falls below tol for every variance component, iteration stops and the estimated variance components are returned. Setting the tolerance based on the proportion has the effect of standardizing tolerances over any amount of variance.
Occasionally, Newton-type algorithms will push the estimated values of the variance components to invalid levels (i.e. below zero, above the total variance of the outcome variable). Outside the valid range, the loglikelihood surface becomes ill-conditioned and the optimizer may not return back to valid parameter estimates. This is especially true for Fisher scoring and AIREML when the true value of a variance component is close to the border of valid estimates. The information matrices used by Fisher Scoring and AIREML, being approximations to the Hessian, can put the iteration estimates of the variance components outside the valid space. The parameter constrained enforces validity of variance component estimates in two ways: likelihoods must be monotonically increasing, and variance component estimates must be in the valid range. If the change in estimates violates either of these, a line search between that change and changes in the same direction but exponentially decreasing in magnitude is performed until a valid set of estimates is met.
The scoring argument allows you to run a few iterations of Fisher scoring or AI-REML to get close to the maximum, then switch over to Newton-Raphson to end quicker.
Parameters: - mm (MixedModel) – model to be maximized
- likelihood (Likelihood) – a Likelihood object
- maxiter (integer) – The maximum number of iterations of scoring before raising an error
- tol (float) – The minimum amount of change in the proportion of variance by any of the variance components to continue iterating.
- scoring (integer) – Number of iterations of Fisher Scoring or AI-REML before switching to Newton-Raphson. If already using Newton-Raphson, this argument has no effect.
- verbose (bool) – Print likelihood, variance component, and relative variance estimates at each iteration. Useful for debugging or watching the direction the optimizer is taking.
Returns: An MLE object of the variance components at the MLE
-
pydigree.stats.mixedmodel.maximization.scoring_iteration(info_mat, gradient)¶ Performs an iteration for a Newton-type maximization algorithm
Parameters: - info_mat (2d numpy array) – A matrix of second derivatives (or approximations of it) at the current parameter estimates
- gradient (1d numpy array) – A vector containing the gradient at the current parameter estimates
Returns: change in parameters for the current iteration.
Return type: numpy array
Raises: LinAlgError if the information matrix is singular
pydigree.stats.mixedmodel.mixedmodel module¶
A linear mixed effects model
-
class
pydigree.stats.mixedmodel.mixedmodel.MixedModel(pedigrees, outcome=None, fixed_effects=None, random_effects=None, covariance_matrices=None, only=None)¶ Bases:
object- Fits linear models in the form of y = X * b + sum(Z_i * u_i) + e, where:
- y is the vector of outcomes X is a design matrix of fixed effects b is the vector of coefficients correstponding to those fixed effects Z_i is an incidence matrix corresponding to random effect i u_i is a vector of values corresponding to random effect i e is a vector of errors
-
P¶ Projection matrix
-
R¶ Covariance matrix of the residual variance
-
add_fixed_effects(effect)¶ Adds a fixed effect to the model
-
add_genetic_effect(kind='additive')¶ Adds a genetic effect to the model as a random effect
Parameters: kind ('additive' or 'dominance') – type of effect to add
-
add_random_effect(effect)¶ Adds a random effect to the model
-
bic¶ Calculates the Bayesian Information Criterion (BIC) for the model
Return type: float
-
blup(idx)¶ Get the BLUPs for a random effect
Parameters: idx (int) – index of effect Return type: np.array
-
clear_model()¶ Clears all parameters from the model
-
copy()¶ Returns a deep copy of the model
Returns: A copy of the model object Return type: MixedModel
-
covariance_matrices¶ The covariance matrices associated with each random effect
Return type: list of matrices
-
df¶ The number of observations minus the number of fixed effects, minus the number of non-residual random effects
Return type: integer
-
fit_model()¶ Builds X, Z, Y, and R for the model
Returns: void
-
loglikelihood(restricted=False, vcs=None, vmat=None)¶ Returns the loglikelihood of the model with the current model parameters
Returns: loglikelihood Return type: float
-
maximize(method='Average Information', restricted=False, starts=None, verbose=False)¶ Finds the optimal values for variance components in the model using provided optimization methods.
Parameters: - restricted (bool) – Uses REML estimation
- starts (iterable of numerics) – starting values for the variance components
- method (string) – maximization method
- verbose (bool:) – output maximization progress
-
maximized¶ Has the model been maximized?
Return type: bool
-
nobs()¶ Returns: the number of fully observed individuals in the model Return type: integer
-
observations()¶ Returns a list of the fully observed individuals in the model Fully observed individuals have observations for each fixed effect and and observation for the outcome variable.
Returns: the fully observed individuals Return type: list of Individuals
-
residual_variance()¶ Returns the variance in y not accounted for by random effects
Return type: float
-
set_outcome(outcome)¶ Sets the outcome for the mixed model
-
set_variance_components(variance_components)¶ Manually set variance components for each random effect in the model. Useful if you know a priori, say a heritability, and just want to predict breeding values for the trait.
Parameters: variance_components (iterable of numerics) – variances associated with each random effect
-
summary()¶ Prints a summary of the current model
Return type: void
-
variance_components¶ The current variances associated with each random effect
Return type: list of floats
-
class
pydigree.stats.mixedmodel.mixedmodel.RandomEffect(individuals, label, variance=None, incidence_matrix=None, covariance_matrix=None, levels=None)¶ Bases:
objectA random effect in a mixed model
-
G¶ Convenience property for returning the covariance_matrix
-
V_i¶
-
Z¶ Convenience property for returning the incidence matrix
-
covariance_matrix¶
-
incidence_matrix¶
-
label¶
-
levels¶
-
nlevels¶ The number of levels of the random effect
Return type: int
-
sigma¶ Convenience property for returning the variance of the component
Return type: float
-
variance_component¶
-
-
pydigree.stats.mixedmodel.mixedmodel.inv(M)¶ Invert a matrix. If sparse, convert to dense first
-
pydigree.stats.mixedmodel.mixedmodel.is_genetic_effect(effect)¶ Is this effect a genetic effect?
Return type: bool
-
pydigree.stats.mixedmodel.mixedmodel.make_incidence_matrix(individuals, effect_name)¶
pydigree.stats.mixedmodel.mixin module¶
A mixin class for building mixed models
-
class
pydigree.stats.mixedmodel.mixin.MixedModelMixin¶ Bases:
objectMixin class for populations to implement mixed model functions
-
incidence_matrix(variable=None, inds=None, onlylevels=None)¶ Generates an incidence matrix for random effect in a mixed model based on variable. If no variable is given, the individual is used (i.e. an identity matrix).
Parameters: - variable – phenotype to form the matrix for
- inds – only use these individuals
- onlylevels – only use these levels of the random effect
-