Detection of positive Darwinian selection

Part 0) PAML (Phylogenetic Analysis by Maximum Likelihood)

In this practical, you will use PAML to compute and validate different hypothesis about the evolutionary history of two different genes.
PAML is a very powerful, but also very complex/complicated tool.
It can be found at link:
But it is installed in your computer with command-line "codeml.exe".
You will see the three different categories of codon models: "Branch specific model",  "Site specific model" and "Branch-site model"

The examples are from the following papers:
Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution.
Yang Z.
Mol Biol Evol. 1998 May;15(5):568-73.

Codon-substitution models for heterogeneous selection pressure at amino acid sites.
Ziheng Yang, Rasmus Nielsen, Nick Goldman, and Anne-Mette Krabbe Pedersen
Genetics. 2000 May;155(1):431-49.

CodeML needs three inputs: The manual is available here.

dS = Synonymous mutation substitution rate
dN = Non-synonymous mutation substitution rate
omega = dN/dS
kappa = ts/tv (transition/transversion rate)

dN/dS < 1  => purifying selection
dN/dS = 1 =>  neutral evolution
dN/dS > 1 => positive Darwinian selection

Part 1) Estimation of different dN/dS in Lysosyme of primates



Alignment: lysozyme.nuc
Tree: lysozyme.tree
Controle file lysozyme_M0.ctl

Download the data in one folder, and have a look at these files.

Part 1.1) Estimation of global dN/dS

You will use the model M0 (one ratio). It's the most basic model of evolution.
It computes all dN and dS values among branch assuming an identical dN/dS ratio among all branches.
It's explained in pages 34-35 in the manual.

There are two options in the file lysozyme_M0.ctl that specify which model to use:
Just launch the command as follow: "codeml.exe lysozyme_M0.ctl"

Question 1: look at the outfile "lysozyme_M0.mlc" and found different variables computed:

Part 1.2) Estimation of dN/dS specific to the Hominoids lineage

First, copy and paste the controle "lysozyme_M0.ctl" and rename it to "lysozyme_Branch.ctl"
And also copy and paste the tree file, and rename into "lysozyme_tagged.tree"

We want to compute the specific dN/dS for the Hominoids, so we need to specifie which branch in the tree we need.
To do that, we will add a "#1" at the end of Hominoid group in the tree.
Open your tree file "lysozyme_tagged.tree" in NJPplot.
You can see that the Hominoid group is composed by four taxa: Human, Chimp_bonobo_gorilla, OrangUtan and Gibbon.
write "#1" just after the first ")" after Gibbon and save your file.
Re-open in NJplot and display the bootstrap. Your Hominoid branch should have a #1.

Modify the controle file "lysozyme_Branch.ctl" to use the Two-ratios model. (option "model = 2" and "NSSites = 0")
Don't forget to specify the tagged tree and another name for the "outfile"

And launch it.

It will compute a global omega for all branches, except for the Hominoid branch.

Question: Is the dN/dS in Hominoid different from the rest of the tree? Is it under purifying, neutral or positive selection?

To be sure of the answer, you need to perform a LRT of this model against the null model 0, previously computed.

Part 1.3) Is it positive selection or just relaxation of selection pressure?

First, copy and paste the controle "lysozyme_Branch.ctl" and rename it to "lysozyme_Branch_neutral.ctl"

We saw in the previous part that the dN/dS of Hominoid evolved under a different regime.
But how can we know if it's really positive selection, or just relaxation of selection pressure?
To do that, we will contrast the previous result against a model of neutrality.

In the file "lysozyme_Branch_neutral.ctl", set the options "fix_omega = 1" and  "omega = 1"  and re-launch the model (don't forget to modify the name of the outfile).

Get the loglikelihood values and number of parameters for previous results and compute the LRT as follow:

p-value = chi-square(2*delta lnL, d.f.)
2*delta lnL = twice the difference (logLikelihood from positive model vs logLikelihood from neutral model)
d.f. = degree of freedom = difference in the number of parameters.

You can use Excel to compute the LRT. (function LOI.KHIDEUX).

Part 2) Estimation of Sites evolving under positive selection in HIV-1 env.

In this part, we want to detect sites evolving under positive selection.
They are often found in genes involved in arms race (immunity defence or viral attack) or in sex genes (ie sperm lysin in abalon).

We will analyse a small fragment (V3 part) of the env gene in HIV-1. This gene produces the glycoprotein gp120 protein, which is exposed at the surface of the virus.


Alignment (phylip format): HIVenvSweden.nuc
Alignment (fasta format): HIVenvSweden.fasta
Tree: HIVenvSweden.tree
Controle file: HIVenvSweden.ctl

Part 2.1) Estimation de selection positive en model Site specific: nef gene HIV-2

The different sites models are explained page 35-38 in the manual.

We will construct a LRT using two different models:
The LRT compares the M2a against the M1a (d.f. = 2).

We can launch all models in one way by setting the option "model=0" and NSSites = "0 1 2".
Change this in the controle file and launch it.

Have a look at the ".mlc" file. There are all information about the three models computed.
The Bayes Empirical Bayes (BEB) gives a list of sites with a good confidence.
Have a look at your alignment file (fasta format) in Jalview.
(You need to translate (Calculate->Translate cDNA) to get the position of codon site.)

Question 2.1: Based on the LRT, could you said that there are sites under positive selection?

Question 2.2: Which percentage of site evolving under positive selection and what is the average dN/dS ?

Question 2.3:  Are the site predicted by BEB (with >90% of confidence) very conserved or not ?

Part 2.2) Detection and visualisation of selected sites with Selecton webserver => Server Selecton

Go to the webserver:

This server can also compute detect positive selection with Site-specific modela.
We will be able to see the positive selected sites on the structure.

You can provide the different parameter to the server:

- DNA file:  HIVenvSweden.fasta.
- Query sequence name: U68496
- Your email address (very useful as the program is rather slow)

- Protein structure: you can perform a BLAST in the PDB to find the best match. (the best match is PDB ID: 2B4C with chain identifier G)
- Tree phylogeny: HIVenvSweden.tree
- Evolutionary model: use a model of positive selection (M8 is nice).

It will take long time, about one hour.

Part 3) Estimation of sites who have evolved under positive in a specific lineage (Hominoid in lysozyme).

DATA:  You will use the same data as in the lysozyme example.

You will use the Branch-site model A. It's mix between the two previous model (Branch-specific and site-specific).
It's explained in pages 38-39 in the manual.

Branch-site model A classifies the different codon sites into four different categories:
Site class 0:   Codon sites evolving under purifying selection in all branches. (0<dN/dS<1)
Site class 1:   Codon sites evolving under neutral evolution in all branches. (dN/dS=1)
Site class 2a:   Codon sites evolving under positive selection in the selected branch (dN/dS>1), and under purifying selection in the rest of the tree (0<dN/dS <1)
Site class 2b:   Codon sites evolving under positive selection in the selected branch (dN/dS>1), and under neutral evolution in the rest of the tree (dN/dS =1)

Part 3.1) Estimation of positive selection using the Branch-site specific model A.

The Branch-site model A can be used with the option "model = 2" and "NSSites = 2".
Do the change in the controle file (don't forget to specify another outfile)

Part 3.2) Estimation of selection positive using the Branch-site specific, under a test of neutrality.

The null model for Branch-site model A is the same model as before, but with omega fixed to 1.

Set the options "fix_omega = 1" and  "omega = 1"  and re-launch the model (don't forget to modify the name of the outfile).

You can use the MSA in FASTA format (lysozyme.fasta) to view the positively selected sites (from BEB) on Jalview.  
Tips: load the tree file in Jalview to group sequences.

Question 3.1: Based on the LRT, are sites under positive selection?

Question 3.2: Which percentage of site evolving under positive selection and what is the average dN/dS ?

Question 3.3: What can you say about the site predicted by BEB (with >90% of confidence)?