Pseudomarker - Tutorial
Genevieve Monsees and Saunak Sen
Introduction
This is a step-by-step tutorial for performing QTL data analysis using Pseudomarker tools written in Matlab. As an example we will use the data on salt-induced hypertension in mice collected by Dr. Bev Paigen at the Jackson Laboratory. The backcross experiment between the C57BL/6J strain and A/J strain had 250 mice whose blood pressure was measured.
First steps
Once your data is clean, in the proper format, and stored under a common directory (see Pseudomarker data format for details), you are ready to begin your analysis. You will also have to make sure you know which directory the Pseudomarker programs were saved in.
Invoke Matlab by double-clicking the icon (in Windows on on the Mac) or by typing into a command shell (if you are in UNIX).
The first step in setting up your analysis script is to inform your
Matlab command window which directories you will be accessing by adding
a path to those directories. In our case, we need to add a path to the
directory in which the Pseudomarker programs are stored (for example
c:\pseudomarker). We can do this using the addpath
command:
addpath c:\pseudomarker;
You need to set the path only once each time you invoke Matlab. (Note: Placing a %
at the beginning of a line will comment the line out.)
In order to access our data files we will change directory to where the data is located by the following command
cd c:\qtl\data\hypertensionassuming that the data is in the directory
c:\qtl\data\hypertension
in the pseudomarker format.
To find out what version of Pseudomarker code you are using, type into the command window:
pseudomarkerReading data
Reading the data into the command window
Now that Matlab knows where to look for the programs, we want to read the data into the workspace. To do this we use the readdata
command. The function readdata
will output two variables, which we are calling y and mdata. While y is a matrix containing the information in the file pheno.dat
,
mdata is a 1x20 array of structures containing the information on the
marker data organized by chromosome. See the section on the data
structures for more information. The final code using the above
examples would read:
Placing a semicolon at the end of the command line prevents Matlab from displaying the output in the command window. Once we have created y and qtldata, we can identify our phenotype and covariate (if we have one).
Identify the Phenotype and Covariate
Now that all the measurements made on each animal are in the matrix y, it is time to separate out the phenotype we want to study and covariates. In the hypertension data, there are no covariates. To identify our phenotype, we state what we would like it to be called (bp) and then specify that it is all rows (noted as : ) within a column of y (for example column 1).
bp = y(:,1);If a covariate is used, it is identified in the same manner. First we state what we would like it to be called, say covar and then specify that it is all rows ( : ) within a column of y (for example column 2), then the syntax would look like:
% do not use the following step for hypertension data% covar = y(:,2);
If a covariate is not being used and this is the case with the hypertension data, we inform Matlab by creating an empty covariate matrix covar.
covar=[];Matlab is now fully aware of all of our original data.
Basic plots
As they say: Garbage in, garbage out. For the data analysis to be meaningful we have to make sure that data integrity has not been compromised, whether there are any artifacts in the data, and if any transformations are needed. This is best done with the help of some pictures. The most basic of them are scatterplots and histograms. In Matlab this can be accomplished by giving the command
plotmatrix( y )We can also use a butterfly plot of the blood pressure phenotype by giving the command
figure; butterfly( {bp}, 1, 'Blood pressure' );Another diagnostic plot is to plot the raw genotypes. We prefer to plot the genotype after sorting the mice according to the phenotype. This way we can see patterns. figure; plotgeno( mdata, 'genolabel', {'B6','Het','Missing'} )
Now we can take our first steps towards analysis.
Imputation step
The goal of the imputation step is to create multiple sets of fake
marker data that are compatible with the observed marker genotype data.
These will be stored in an array of structures similar to the observed
marker data structure. For exploratory scanning, we use about 16 sets
of fake marker genotypes that are separated from each other by about
10cM. The function called impute
does this for us.
Assuming we want to store the imputed genotypes in the array of
structures, fake here is what you have to enter:
Type
help impute
for more options and help about the impute
function.
If you wish to impute just the first four chromosomes and the seventh, you would type instead
fake = impute( mdata, 0.1, 16, [ 1:4 7 ] );This means, "Impute 16 sets of fake marker data compatible with observed genotypes in mdata on chromosomes 1 through 4 and 7. The spacing between the pseudomarkers should be 0.1 Morgans."
Exploratory scans
On to the analysis!
Now that we have created a fake, we can begin to calculate some of the basic quantities. All three functions which we are about to introduce will require bp, covar, mdata, fake, and obsdata as input, and will output structure arrays. For details about the structures see the section on data structures.
Mainscan
The function mainscan
will run a one dimensional genome scan and store the result in a structure array. We will name this array of structures lod1.
Plotting a Mainscan
To visualize the output of mainscan
we can use the command plotmainscan
. Optionally we can indicate the chromosomes we wish to plot, and the type of plot we desire.
The chromosomes we wish to plot must be enclosed in square brackets ([]) and separated by colons (for example [1:3:4]). If we wish to plot all chromosomes, this item can be omitted from the command line. The default is to plot all chromosomes.
There are three options for the type of plot: 'lod'
, 'lik'
, 'F'
and 'prop'
. The default is 'lod'
which plots the LOD score. The option 'prop'
will plot the proportion of variance explained, 'loglik'
uses the natural base for the logarithm, and 'F'
uses F statistics. If we wish to plot the LOD score, this item can be
omitted from the command line. If we choose to plot the LOD score for
chromosomes 1, 3, and 4 our code would look like:
However, if we chose to plot the proportion of variance for all chromosomes, we just have to enter:
plotmainscan(lod1,'prop');Running a Pairscan
The function pairscan
will run a pairwise
(two-dimensional) genome scan for both additive and interaction
effects. The syntax is almost identical to that of mainscan
.
If we call the resulting structure array lod2 we will enter:
Plotting a Pairscan
The function plotpairscan
produces a plot of the pairwise effects. It has similar options as its one-dimensional counterpart plotmainscan
. By default, the command
will plot the results of a two-dimensional genome scan using a LOD score ('lod'
)
scale. This plot is more complicated than the result of the
one-dimensional scan. The plot has two halves. The lower right-hand
triangle shows the LOD score for a two-locus model including interactions,
while the upper left-hand triangle shows the LOD score for the
interaction term alone. For visual parity, the upper tiriangle is
inflated by a factor of three in a backcross experiment. For intercross
experiments the inflation factor is two.
We would look for peaks in the lower right half plot of the pairscan
results to identify major pairs of loci. Blips in the upper left half of the plot indicate evidence for interactions.
If we chose to plot the proportion of variance explained for chromosomes 1, 3, and 4, we can use the subscripting mechanism of Matlab to write:
plotpairscan(lod2([1 3 4],[1 3 4]),'prop');Note that the chromosomes we want have to be listed twice because we have to subscript the appropriate rows and columns of the pairwise scan data structure.
The procedure for adding a title and axis labels is the same as in mainscan
, although they have different meaning in this case.
Model selection
Now that we have a rough idea of the results from the plots of mainscan
and pairscan
, we have to find out which loci are worth a closer look. Statistically, this is a hard problem. Here is how we do it.
Permutation tests
First we run permutation tests on the results of mainscan
and pairscan
to find out which loci (or pairs of loci) stand out. The following
commands will run permutation tests for the one-dimensional and
two-dimensional scans:
maxlod2 = permutest2(bp,[],[],fake,100);
In the above commands 1000 permutation tests were run for the mainscan
and 100 were used for the pairscan
.
Because the permutation tests may take a while to run, we recommend
using a small of permutations to find out how long the program will
take for a larger number of permutations. We generally use 1000
permutations for mainscan
and 100 permutations for pairscan
. The permutest2
takes a long time to run, so be careful what number of permutations you give it to do.
Setting the cutoff
After the permutations have finished, we have to extract the thresholds to use for subsequent use.
% gives 90% 95% and 99% cutoffs for mainscanthresh1 = thresh(maxlod1)
% gives 90% 95% and 99% cutoffs for pairscan
thresh2 = thresh(maxlod2)
For the one-dimensional scans, the thresholds are the upper percentiles of the maxima of the LOD score obtained by permuting the phenotype column. For two-dimensional scans, the permutation tests give two percentiles. The first column gives the percentiles of the maximum LOD score comparing a two-QTL model with interaction term to an null model. The second column gives the percentiles of the maximum of the LOD score testing for the interaction term for every two-QTL model fitted.
These cutoff values are in the LOD scale. To convert them to the propvar
scale we have to use a formula or use the function lod2propvar
. For example to convert the one-dimensional cutoff to the proportion of variance scale, use
n = length( ~isnan( bp ) );
lod2propvar( cut1, n );
Using a flagging routine
After seeting the cutoffs, we can use the flagging routines that will identify the important loci. Suppose we choose the 95% cutoffs.
% get 95% mainscan cutoffcut1 = thresh1(2);
% get 95% pairscan cutoff for full model vs null model
cut2 = thresh2(2,1);
The function,
reportscan
flags loci that are important based on one-dimensional and
two-dimensional scans. To flag single loci based on one-dimensional
scans use
reportscan(lod1,cut1);To flag pairs of loci based on the results of one- and two-dimensional scans use reportscan(lod1, lod2, [ cut2 0.001 0.005 ] )
This function looks at all pairs of loci. A locus pair is deemend interesting if the LOD score at the locus pair comparing the full model to the null model exceeds the two-dimensional thresholds based on permutation tests as set by
cut2
.
For all interesting locus pairs, we make secondary tests for the
interaction term and additive effects at levels 0.001 and 0.005
respectively.
The reportscan
function will give a set of pairs of
loci that are deemed worthwhile. We can then see what the allelic
effects of those loci are and then fit a multiple-QTL model with all
the loci in them.
Plotting allelic effects
First, we select the pseudomarkers closest to the loci flagged. selectfake = subsetgeno( fake, 'chrid', [ 1 4 6 15 ], ...'mpos', [ 70 30 70 20 ] );
Next we plot the allelic effects of selected loci using the function
alleleplot
.
figure;subplot(2,2,1)
alleleplot( bp, selectfake(1).igeno, 'Chr1QTL', {'B6','Het'} )
subplot(2,2,2)
alleleplot( bp, selectfake(2).igeno, 'Chr4QTL', {'B6','Het'} )
subplot(2,2,3)
alleleplot( bp, selectfake(4).igeno, 'Chr15QTL', {'B6','Het'} )
subplot(2,2,4)
alleleplot( bp, selectfake(3).igeno, selectfake(4).igeno, ...
{'Chr6QTL','Chr15QTL'}, {'B6','Het'} )
Note that all the QTL on chromosomes 1, 4 and 15 have strong main effects. The interaction between loci on chromosomes 6 and 15 can be seen since the locus on chromosome 15 has an effect only when the locus on chromosome 6 is heterozygous.
Type III analysis
We can also do a "Type III" analysis using the general "scan" function. First we will fit the largest model with main effects on chromosomes 1, 4, 6 and 15, and a 6x15 interactions. Then we will drop each of these and see how much the LOD score drops as a result. These are accomplished using the generalscan
function.
% Fit full model:[lod0,bf0] = scan( bp, [], [], selectfake, [ 1 4 6 15; 1 1 1 1 ], ...
struct( 'twoint', [ 3 4 ] ) );
% Drop QTL on chr1:
[lod1,bf1] = scan( bp, [], [], selectfake, [ 4 6 15; 1 1 1 ], ...
struct( 'twoint', [ 2 3 ] ) );
% Drop QTL on chr4
[lod4,bf4] = scan( bp, [], [], selectfake, [ 1 6 15; 1 1 1 ], ...
struct( 'twoint', [ 2 3 ] ) );
% Drop 6x15 interaction
[lod6x15,bf6x15] = scan( bp, [], [], selectfake, [ 1 4 6 15; 1 1 1 1 ] );
% Drop drop QTLs on chr6 and chr15
[lod6n15,bf6n15] = scan( bp, [], [], selectfake, [ 1 4; 1 1 ] );
Here are the LOD scores for the comparisons: % Compare models to the largest model:
comparison_lods = [ lod0-lod1 lod0-lod4 lod0-lod6x15 lod0-lod6n15 ]
Now we calculate the nominal p-values corresponding to these comparisons: % Calculate nominal p-values associated with those terms
comparison_pvalues = [ 1-chi2cdf(2*log(10)*(lod0-lod1),1)...
1-chi2cdf(2*log(10)*(lod0-lod4),1)...
1-chi2cdf(2*log(10)*(lod0-lod6x15),1)...
1-chi2cdf(2*log(10)*(lod0-lod6n15),1) ]
We can see that all terms in the model do contribute and we keep them. So our selected model has QTL on chromosomes 1, 4, 6, and 15. The last two loci interact.
Fine mapping
Once we have narrowed down the list of chromosomes that harbor QTLs, how many of them on each chromosome, and whether they interact with each other, it is time to estimate the genetic model and the location of the QTLs. The first step in this process would be to run scanning programs at higher resolutions for the chromosomes of interest. For example, we could place the pseudomarkers spaced 2cM apart and use 128 fake datasets instead of 16 which is used for a 10cM pseudomarker map.
bigfake = impute( mdata, 0.02 );biglod1 = mainscan( bp, [], [], bigfake );
biglod2 = pairscan( bp, [], [], bigfake );
The imputation step and the scanning steps will take much longer than those for the exploratory scans. Either increase the resolution and the number of imputations gradually, or be patient!
Single QTL on a chromosome
When the evidence indicates that there is only one QTL on a
chromosome, we can make a plot of the posterior distribution of the QTL
location. This is done by using the plotmainlocalize
function.
Two QTL on a chromosome
If there is reason to believe that there are two QTL on a single
chromosome, then we can make a plot of the joint posterior distribution
of two QTL on the chromosome using the function plotpairlocalize
. We can use
This will draw the contours of confidence regions for the two QTL on chromosome 1 at levels 50%, 95% and 99% levels. We find these settings useful; feel free to experiment with other choices of levels for the confidence regions.
Estimating effects
Single QTL on a chromosome
To estimate the effects of the QTL on each chromosome and their standard errors, one can use the estimate
family of functions. These give the approximate the posterior means and
variances of the estimated effects. For example if there is one QTL on
chromosome 4, then to estimate its effect we would use:
obsdata = find( ~isnan(bp) );
pheno = bp(obsdata);
igeno = fake(4).igeno(obsdata,:,:);
[mm,vv] = oneestimate( pheno, [], igeno, 'bc' );
In this case mm and vv are the posterior means and variances of the estimated effects of the QTL on chromosome 4.
Two QTL on a chromosome
If there are more than one QTL on a chromosome, then we will have to use different estimate
functions. For esample, if there are two additive QTL on chromosome 4 we will use
igeno = fake(4).igeno(obsdata,:,:);
[mm,vv] = twoestimate( pheno, [], igeno, 'bc', 'a+b' );
If they were interacting, we would use [mm,vv] = twoestimate( pheno, [], igeno, 'bc', 'a*b' );
The last argument is the model argument which is either
'a+b'
or 'a*b'
depending on whether we are estimating additive effects or the full model including interactions.
Two QTL on different chromosomes
If there are more than two QTL on different chromosomes, then we use twoestimate2
. For example, if there are two additive QTL on chromosome 1 and 4 we will use
igeno1 = fake(1).igeno(obsdata,:,:);
igeno4 = fake(4).igeno(obsdata,:,:);
[mm,vv] = twoestimate2( pheno, [], igeno1, igeno4, 'bc', 'a+b' );
If they were interacting, we would use [mm,vv] = twoestimate2( pheno, [], igeno1, igeno4, 'bc', 'a*b' );
More than two QTL
For this we have to use the functions threeestimate
, threeestimate2
, and threeestimate3
functions. The syntax for these functions is similar to the twoestimate
functions. Please see the help
for these functions for details. We are trying to develop general functions to deal with all these cases together.
Data structures
In this section we will describe the data structures underlying the Pseudomarker programs.
readdata
This function outputs two arguments. The first one is the phenotype matrix and the second one is the marker data structure. This is an array, each element of the array corresponding to a chromosome. The fields are the following:
chrid:
chromosome numbergeno:
matrix of genotypes; rows correspond to inidviduals and columns correspond to markersmnames:
list of marker namesmpos:
marker positions in Morganschromlen:
length of the chromosome under consideration; at this point we round up from the righmost marker to the nearest 20cM
impute
The output of this function is also a structure array with each element of the structure corresponding to a chromosome. The fields are:chrid:
chromosome numnerigeno:
matrix of imputed genotypes; rows correspond to individuals, columns correspond to pseudomarkers and pages correspond to the imputation numnermpos:
pseudomarker positions in Morganscross:
cross type;'bc'
(backcross) or'f2'
(F2 intercross)chromlen:
chromosome length
mainscan
Each element of the output structure array corresponds to a chromosome. Here are the fields:n:
number of individualscross:
cross typechrid:
chromosome numberbf:
un-normalized bayes factorlod:
vector of LOD scores; this is the log of the posterior density at the corresponding pseudomarker locationsmpos:
pseudomarker positionschromlen:
chromosome length in Morgans
pairscan
This is probably the most complicated structure of them all. This may change a bit in future versions of the code. Be warned...The program outputs a two-dimentional array of structures. Each corresponds to a pair of chromosomes which may or may not be the same.
n:
number of individualscross:
cross typechrid1:
row chromosome numberchrid2:
col chromosome numberbfadd:
Bayes factor for additive model (if row=col)bfint:
Bayes factor for full model (if row=col)bf:
Bayes factor for full model (if row>col) else for additive modelbaselod:
not implemented now; intended for covariateslod:
log of the posterior density for QTL locations at pseudomarker pair locationschromlen1:
row chromosome lengthchromlen2:
col chromosome lengthmpos1:
row pseudomarker positionsmpos2:
col pseudomarker positions
You made it! You have now done some basic QTL Analysis and know some
of the ins and outs of Matlab coding. Do not forget to use the help
command if you ever have any questions about a function. If you find errors in this document please drop us a note!