Vous regardez une version antérieure (v. /display/BiologieStructurale/Running+AlphaFold2) de cette page.

afficher les différences afficher l'historique de la page

« Afficher la version précédente Vous regardez la version actuelle de cette page. (v. 20) afficher la version suivante »

AlphaFold2 is an artificial intelligence based software developed by DeepMind Technologies, a subsidiary of Alphabet (parent company of Google) which performs protein structure prediction using a deep learning system. Initial demonstration of the program at CASP13 showed great success at accurately predicting structures for targets rated as being the most difficult, where no previous template existed. In a second public demonstration during the 14th CASP in 2020, the AlphaFold2 team achieved a level of accuracy much higher than any other participating group.

Such results have proven to be a giant step forward in the field of ab initio protein structure prediction. Later in July 2021, DeepMind published a paper in Nature describing AlphaFold. It also made public the source code of the software on GitHub and offered a searchable database containing the structure prediction result of the complete proteome of about 20 species (over 365 000 protein structures).

Link to the AlphaFold2 database:

https://alphafold.ebi.ac.uk/

Four options on how to run AlphaFold2 are described below:

  1. Locally
  2. SBGrid
  3. Google Colab
  4. NMRbox


Using AlphaFold2 locally

AlphaFold2 has been installed locally on the bioinformatics infrastructure. As of May 16, 2022, version 2.1.0 is accessible, and include access to the multimer module.

In order to run AlphaFold2 locally, you need to connect to the jacquere host (either via SSH or physically). Once connected, open a terminal window and browse to the folder where you want the data to be saved. From there, create a text file containing your amino acid squence of interest in FASTA format and paste the amino acid sequence into the file.

touch NSs.fasta
echo ">RVFV NSs
MDYFPVISVDLQSGRRVVSVEYFRGDGPPRIPYSMVGPCCVFLMHHRPSHEVRLRFSDFY
NVGEFPYRVGLGDFASNVAPPPAKPFQRLIDLIGHMTLSDFTRFPNLKEAISWPLGEPSL
AFFDLSSTRVHRNDDIRRDQIATLAMRSCKITNDLEDSFVGLHRMIATEAILRGIDLCLL
PGFDLMYEVAHVQCVRLLQAAKEDISNAVVPNSALIVLMEESLMLRSSLPSMMGRNNWIP
VIPPIPDVEMESEEESDDDGFVEVD" > NSs.fasta

Then use the following wrapper script in order to run AlphaFold2. You may want to adjust the parameters to your specific case. Make the file executable and run it.

chmod +x run.sh
./run.sh NSs.fasta

As a reference, it took about 2h50 for the calculation to predict 5 models for RVFV NSs (the sequence given as example above) to take place on jacquere. Once it is done, you can go to the results folder and analyze your results. A description of the various files produced by the program is provided on the GitHub page of AlphaFold.

Below is the content of the script.

#!/bin/bash
#
# Alphafold 2, version 2.1.0
# wrapper script to run Alphafold2 for a monomer
#
# usage:
# ./run.sh sequence.fasta
#
# adjust parameters as required

DATA_DIR="/home/sbio/afdb/"
OUTPUT_DIR=$(pwd)
MAX_TEMPLATE_DATE=$(date -I)
DB_PRESET="full_dbs"
BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta"
MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa"
TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files"
PDB70_DB="${DATA_DIR}/pdb70/pdb70"
OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat"

python3 /usr/local/alphafold/run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--max_template_date=$MAX_TEMPLATE_DATE \
--db_preset=$DB_PRESET \
--bfd_database_path=$BFD_DB \
--uniclust30_database_path=$UNICLUST30_DB \
--uniref90_database_path=$UNIREF90_DB \
--mgnify_database_path=$MGNIFY_DB \
--template_mmcif_dir=$TEMPLATE_MMCIF_DIR \
--pdb70_database_path=$PDB70_DB \
--obsolete_pdbs_path=$OBSOLETE_PATH \
--use_gpu_relax="true" \
--fasta_paths=${1}

Using the multimer option

It is now possible to run AlphaFold2 in multimer mode (AlphaFold-Multimer) where the prediction of a complex formation is done (more information can be found on the original publication descibing the process). To do this locally, a modified wrapper script should be used where the multimer flag is set. An example can be downloaded here. The content is the following:

Multimer
#!/bin/bash
# Alphafold 2, version 2.1.0
# wrapper script to run Alphafold for a multimer
#
# usage:
# ./run_alphafold_tempate_multimer.sh multiple_sequences.fasta
#
# adjust parameters as required

DATA_DIR="/home/sbio/afdb"
OUTPUT_DIR=$(pwd)
MAX_TEMPLATE_DATE=$(date -I)
DB_PRESET="full_dbs"
BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
UNIPROT_DB="${DATA_DIR}/uniprot/uniprot.fasta"
UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta"
MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa"
TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files"
PDB_SEQRES_DB="${DATA_DIR}/pdb_seqres/pdb_seqres.txt"
PDB70_DB="${DATA_DIR}/pdb70/pdb70"
OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat"

python3 /usr/local/alphafold/run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--max_template_date=$MAX_TEMPLATE_DATE \
--db_preset=$DB_PRESET \
--bfd_database_path=$BFD_DB \
--uniclust30_database_path=$UNICLUST30_DB \
--uniprot_database_path=$UNIPROT_DB \
--uniref90_database_path=$UNIREF90_DB \
--mgnify_database_path=$MGNIFY_DB \
--template_mmcif_dir=$TEMPLATE_MMCIF_DIR \
--pdb_seqres_database_path=$PDB_SEQRES_DB \
--obsolete_pdbs_path=$OBSOLETE_PATH \
--use_gpu_relax="true" \
--model_preset=multimer \
--fasta_paths=${1}

Beware that this will likely take a lot of time (expect run times over 48 hours) depending on the complexity of your submission. To give an idea, a recent calculation for a homohexamer of 250 amino acids per monomer took about 65 hours.

It is also important to format the FASTA file properly. For a homomer, use a single FASTA file with this format:

>sequence_1
<SEQUENCE>
>sequence_2
<SEQUENCE>
>sequence_3
<SEQUENCE>

And for a heteromer:

>sequence_1
<SEQUENCE A>
>sequence_2
<SEQUENCE A>
>sequence_3
<SEQUENCE B>
>sequence_4
<SEQUENCE B>
>sequence_5
<SEQUENCE B>

Visualizing the results with PyMOL

Once the calculation is done, a subdirectory will be created and will contain your results. The description of each file can be found here. You can then open your results using the ranked_*.pdb files (the relaxed predicted structure) in your preferred molecular visualization software.

The per-residue confidence score (pLDDT), is stored in the B-factor field of the PDB file and the values can be used to colour the structure accordingly using the command spectrum b in PyMOL. Higher score means better confidence. A description can be found here. The PyMOLWiki website describes well the spectrum command and the various options that can be applied, including the various colour palettes available.

Using AlphaFold2 via SBGrid

SBGrid is a software infrastructure that includes a library of over 400 structural biology applications, including AlphaFold. The SBGrid software collection is restricted to SBGrid member laboratories.

 The current version installed on our server is Alphafold2 2.1.1. More details are given on the SBGrid Wiki page for AlphaFold.

A username and password are required to access the infrastructure. They can be requested by email to Normand Cyr or Ryan Richter.

You need to log into the host named mazuelo. This computer has a GPU card powerful enough for the calculations required by AlphaFold. It will not work on the other computers of the network.

Once logged in, open a terminal window and on the command prompt, type

sbgrid

This will activate the SBGrid environment and allow you to access the various applications. You should see the following welcome message

********************************************************************************
                  Software Support by SBGrid (www.sbgrid.org)
********************************************************************************
 Your use of the applications contained in the /programs  directory constitutes
 acceptance of  the terms of the SBGrid License Agreement included  in the file
 /programs/share/LICENSE.  The applications  distributed by SBGrid are licensed
 exclusively to member laboratories of the SBGrid Consortium.
              Run sbgrid-accept-license to remove the above message.  
********************************************************************************
 SBGrid was developed with support from its members, Harvard Medical School,    
 HHMI, and NSF. If use of SBGrid compiled software was an important element     
 in your publication, please include the following reference in your work:      
                                                                                      
 Software used in the project was installed and configured by SBGrid.                   
 cite: eLife 2013;2:e01456, Collaboration gets the most out of software.                
********************************************************************************
 SBGrid installation last updated: 2022-01-16
 Please submit bug reports and help requests to:       <bugs@sbgrid.org>  or
                                                       <http://sbgrid.org/bugs>
            For additional information visit https://sbgrid.org/wiki
********************************************************************************

If you do not activate the SBGrid environment, you will get the following error message

SBGrid shell environment is not initialized! Please source
/programs/sbgrid.shrc or /programs/sbgrid.cshrc to use the
software.

In order to run AlphaFold2, you will need at least two text files:

  1. A modified version of the AlphaFold2 wrapper provided by SBGrid that can be downloaded here.
  2. Your protein sequence in FASTA format (a complete description of the FASTA format can be found here). The description line is mandatory. An example file can be downloaded here.

Example FASTA file:

> Rift Valley Fever Virus NSs protein
MDYFPVISVDLQSGRRVVSVEYFRGDGPPRIPYSMVGPCCVFLMHHRPSHEVRLRFSDFY
NVGEFPYRVGLGDFASNVAPPPAKPFQRLIDLIGHMTLSDFTRFPNLKEAISWPLGEPSL
AFFDLSSTRVHRNDDIRRDQIATLAMRSCKITNDLEDSFVGLHRMIATEAILRGIDLCLL
PGFDLMYEVAHVQCVRLLQAAKEDISNAVVPNSALIVLMEESLMLRSSLPSMMGRNNWIP
VIPPIPDVEMESEEESDDDGFVEVD

Content of the run_alphafold_template.sh wrapper:

# Alphafold2, version 2.1.1
# wrapper script to run AlphaFold2 for a monomer
#
# usage:
# ./run_alphafold_tempate.sh sequence.fasta
#
# adjust parameters as required

DATA_DIR="/home/sbio/afdb/"
OUTPUT_DIR=$(pwd)
MAX_TEMPLATE_DATE=$(date -I)
DB_PRESET="full_dbs"
BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta"
MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa"
TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files"
PDB70_DB="${DATA_DIR}/pdb70/pdb70"
OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat"

run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--max_template_date=$MAX_TEMPLATE_DATE \
--db_preset=$DB_PRESET \
--bfd_database_path=$BFD_DB \
--uniclust30_database_path=$UNICLUST30_DB \
--uniref90_database_path=$UNIREF90_DB \
--mgnify_database_path=$MGNIFY_DB \
--template_mmcif_dir=$TEMPLATE_MMCIF_DIR \
--pdb70_database_path=$PDB70_DB \
--obsolete_pdbs_path=$OBSOLETE_PATH \
--fasta_paths=${1} \

Replace the --fasta_paths= value to point to your FASTA file in the run_alphafold_template.sh wrapper, and make the wrapper script executable.

chmod +x run_alphafold_template.sh

Then, to run AlphaFold, enter the following command in the terminal prompt:

./run_alphafold_template.sh [FASTA file location]

As a reference, it took about 60 minutes for the calculation to predict 5 models for RVFV NSs (the sequence given as example above) to take place on mazuelo. Once it is done, you can go to the results folder and analyze your results. A description of the various files produced by the program is provided on the GitHub page of AlphaFold.

Using pTM models

In order to use pTM models, you need to set the ALPHAFOLD_PTM environment variable before running AlphaFold:

export ALPHAFOLD_PTM=true

Using the multimer option

It is now possible to run predictions with multimers. A wrapper script can be downloaded here. Content of the run_alphafold_template_multimer.sh wrapper:

# AlphaFold2 2, version 2.1.1
# wrapper script to run AlphaFold2 for a multimer
#
# usage:
# ./run_alphafold_tempate_multimer.sh multiple_sequences.fasta
#
# adjust parameters as required

DATA_DIR="/home/sbio/afdb/"
OUTPUT_DIR=$(pwd)
MAX_TEMPLATE_DATE=$(date -I)
DB_PRESET="full_dbs"
BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta"
MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa"
TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files"
PDB70_DB="${DATA_DIR}/pdb70/pdb70"
OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat"
PDB_SEQRES_DATABASE_PATH="${DATA_DIR}/pdb_seqres/pdb_seqres.txt"
UNIPROT_DATABASE_PATH="${DATA_DIR}/uniprot/uniprot.fasta"

run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--max_template_date=$MAX_TEMPLATE_DATE \
--db_preset=$DB_PRESET \
--bfd_database_path=$BFD_DB \
--uniclust30_database_path=$UNICLUST30_DB \
--uniref90_database_path=$UNIREF90_DB \
--mgnify_database_path=$MGNIFY_DB \
--template_mmcif_dir=$TEMPLATE_MMCIF_DIR \
--pdb_seqres_database_path=$PDB_SEQRES_DATABASE_PATH \
--uniprot_database_path=$UNIPROT_DATABASE_PATH \
--obsolete_pdbs_path=$OBSOLETE_PATH \
--model_preset=multimer \
--fasta_paths=${1}

All protein sequences have to be in the same FASTA file.

Using AlphaFold2 via Google Colab

DeepMind, in collaboration with Google, has provided users with a Colab notebook to run AlphaFold2 (currently version 2.1.0). The notebook can be opened at the address below, and Instructions are provided in the notebook as to how to proceed:

https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb

As stated in the Colab notebook:

"In comparison to AlphaFold2 v2.0, this Colab notebook uses no templates (homologous structures) and a selected portion of the BFD database. We have validated these changes on several thousand recent PDB structures. While accuracy will be near-identical to the full AlphaFold2 system on many targets, a small fraction have a large drop in accuracy due to the smaller MSA and lack of templates. For best reliability, we recommend instead using the full open source AlphaFold, or the AlphaFold2 Protein Structure Database."

There is now a faster approach to multiple sequence alignment within AlphaFold2 developped by the group of Martin Steinegger (reference to the preprint paper). It has been implemented in a Google Colab notebook similarly to the above. The prediction should take about 10 minutes to run (vs. 60+ minutes with the DeepMind implementation above).

https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb

Using AlphaFold2 in NMRbox

"NMRbox is a resource for biomolecular NMR (Nuclear Magnetic Resonance) software. It provides tools for finding the software you need, documentation and tutorials for getting the most out of the software, and cloud-based virtual machines for executing the software."

After registering, it is possible to log into a virtual machine and run AlphaFold2 predictions (version 2.1.1). At the moment, only T4 equipped VMs are available for this task: argon.nmrbox.org, oxygen.nmrbox.org, rubidium.nmrbox.org, and zirconium.nmrbox.org. Once logged in:

  • Open a terminal and go to the desired output directory
  • Create a FASTA file containing the protein sequence of interest
  • Run AlphaFold2 using the AlphaFold2 FASTA_file.fasta command. Other options are availables, see AlphaFold2 -h.
  • Aucune étiquette