Comparaison des versions

Légende

  • Ces lignes ont été ajoutées. Ce mot a été ajouté.
  • Ces lignes ont été supprimées. Ce mot a été supprimé.
  • La mise en forme a été modifiée.

Introduction

AlphaFold AlphaFold2 is an artificial intelligence based software developed by DeepMind Technologies, a subsidiary of Alphabet (parent company of Google) which performs protein structure prediction using a deep learning system. Initial demonstration of the program at CASP13 showed great success at accurately predicting structures for targets rated as being the most difficult, where no previous template existed. In a second public demonstration during the 14th CASP in 2020, the AlphaFold AlphaFold2 team achieved a level of accuracy much higher than any other participating group.

Such results have proven to be a giant step forward in the field of ab initio protein structure prediction. Later in July 2021, DeepMind published a paper in Nature describing AlphaFold. It also made public the source code of the software on GitHub and offered a searchable database containing the structure prediction result of the complete proteome of about 20 species (over 365 000 protein structures).

Link to the AlphaFold AlphaFold2 database:

https://alphafold.ebi.ac.uk/

Three options on how to run AlphaFold are described here:

  1. SBGrid
  2. Google Colab
  3. NMRbox

...

Four options on how to run AlphaFold2 are described here:

  1. Locally
  2. SBGrid
  3. Google Colab
  4. NMRbox

Using AlphaFold2 locally
Ancre
locally
locally

AlphaFold2 has been installed locally on the bioinformatics infrastructure. As of March 4, 2022, version 2.1.0 is accessible, and include access to the multimer module.

In order to run AlphaFold2 locally, you need to connect to the jacquere host. Once connected, open a terminal window and browse to the folder where you want the data to be saved. From there, create a text file containing your amino acid squence of interest in FASTA format and paste the amino acid sequence into the file.

Bloc de code
languagebash
touch NSs.fasta
echo ">RVFV NSs
MDYFPVISVDLQSGRRVVSVEYFRGDGPPRIPYSMVGPCCVFLMHHRPSHEVRLRFSDFY
NVGEFPYRVGLGDFASNVAPPPAKPFQRLIDLIGHMTLSDFTRFPNLKEAISWPLGEPSL
AFFDLSSTRVHRNDDIRRDQIATLAMRSCKITNDLEDSFVGLHRMIATEAILRGIDLCLL
PGFDLMYEVAHVQCVRLLQAAKEDISNAVVPNSALIVLMEESLMLRSSLPSMMGRNNWIP
VIPPIPDVEMESEEESDDDGFVEVD" > NSs.fasta

Then use the following wrapper script in order to run AlphaFold2. You may want to adjust the parameters to your specific case. Make the file executable and run it.

Bloc de code
languagebash
chmod +x run.sh
./run.sh NSs.fasta

As a reference, it took about 60 minutes for the calculation to predict 5 models for RVFV NSs (the sequence given as example above) to take place on jacquere. Once it is done, you can go to the results folder and analyze your results. A description of the various files produced by the program is provided on the GitHub page of AlphaFold. A wiki page is also dedicated to the vizualisation of the result using PyMOL.

Below is the content of the script.

Bloc de code
languagebash
#!/bin/bash
#
# Alphafold 2, version 2.1.0
# wrapper script to run Alphafold2 for a monomer
#
# usage:
# ./run_alphafold_tempate.sh sequence.fasta
#
# adjust parameters as required

DATA_DIR="/home/sbio/afdb/"
OUTPUT_DIR=$(pwd)
MAX_TEMPLATE_DATE=$(date -I)
DB_PRESET="full_dbs"
BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta"
MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa"
TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files"
PDB70_DB="${DATA_DIR}/pdb70/pdb70"
OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat"

python3 /usr/local/alphafold/run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--max_template_date=$MAX_TEMPLATE_DATE \
--db_preset=$DB_PRESET \
--bfd_database_path=$BFD_DB \
--uniclust30_database_path=$UNICLUST30_DB \
--uniref90_database_path=$UNIREF90_DB \
--mgnify_database_path=$MGNIFY_DB \
--template_mmcif_dir=$TEMPLATE_MMCIF_DIR \
--pdb70_database_path=$PDB70_DB \
--obsolete_pdbs_path=$OBSOLETE_PATH \
--use_gpu_relax="true" \
--fasta_paths=${1}

Using AlphaFold2 via SBGrid
Ancre
sbgrid
sbgrid

Avertissement

SBGrid is a software infrastructure that includes a library of over 400 structural biology applications, including AlphaFold. The SBGrid software collection is restricted to SBGrid member laboratories.

...

Bloc de code
languagebash
SBGrid shell environment is not initialized! Please source
/programs/sbgrid.shrc or /programs/sbgrid.cshrc to use the
software.

In order to run AlphaFoldAlphaFold2, you will need at least two text files:

  1. A modified version of the AlphaFold AlphaFold2 wrapper provided by SBGrid that can be downloaded here.
  2. Your protein sequence in FASTA format (a complete description of the FASTA format can be found here). The description line is mandatory. An example file can be downloaded here.

...

Bloc de code
languagebash
# Alphafold2, version 2.1.1
# wrapper script to run AlphafoldAlphaFold2 for a monomer
#
# usage:
# ./run_alphafold_tempate.sh sequence.fasta
#
# adjust parameters as required

DATA_DIR="/home/sbio/afdb/"
OUTPUT_DIR=$(pwd)
MAX_TEMPLATE_DATE=$(date -I)
DB_PRESET="full_dbs"
BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta"
MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa"
TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files"
PDB70_DB="${DATA_DIR}/pdb70/pdb70"
OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat"

run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--max_template_date=$MAX_TEMPLATE_DATE \
--db_preset=$DB_PRESET \
--bfd_database_path=$BFD_DB \
--uniclust30_database_path=$UNICLUST30_DB \
--uniref90_database_path=$UNIREF90_DB \
--mgnify_database_path=$MGNIFY_DB \
--template_mmcif_dir=$TEMPLATE_MMCIF_DIR \
--pdb70_database_path=$PDB70_DB \
--obsolete_pdbs_path=$OBSOLETE_PATH \
--fasta_paths=${1} \

...

As a reference, it took about 60 minutes for the calculation to predict 5 models for RVFV NSs (the sequence given as example above) to take place on mazuelo. Once it is done, you can go to the results folder and analyze your results. A description of the various files produced by the program is provided on the GitHub page of AlphaFold.

Using pTM models

In order to use pTM models, you need to set the ALPHAFOLD_PTM environment variable before running AlphaFold:

Bloc de code
languagebash
export ALPHAFOLD_PTM=true

Using the multimer option

It is now possible to run predictions with multimers. A wrapper script can be downloaded here. Content of the run_alphafold_template_multimer.sh wrapper:

Bloc de code
languagebash
# AlphafoldAlphaFold2 2, version 2.1.1
# wrapper script to run AlphafoldAlphaFold2 for a multimer
#
# usage:
# ./run_alphafold_tempate_multimer.sh multiple_sequences.fasta
#
# adjust parameters as required

DATA_DIR="/home/sbio/afdb/"
OUTPUT_DIR=$(pwd)
MAX_TEMPLATE_DATE=$(date -I)
DB_PRESET="full_dbs"
BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta"
MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa"
TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files"
PDB70_DB="${DATA_DIR}/pdb70/pdb70"
OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat"
PDB_SEQRES_DATABASE_PATH="${DATA_DIR}/pdb_seqres/pdb_seqres.txt"
UNIPROT_DATABASE_PATH="${DATA_DIR}/uniprot/uniprot.fasta"

run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--max_template_date=$MAX_TEMPLATE_DATE \
--db_preset=$DB_PRESET \
--bfd_database_path=$BFD_DB \
--uniclust30_database_path=$UNICLUST30_DB \
--uniref90_database_path=$UNIREF90_DB \
--mgnify_database_path=$MGNIFY_DB \
--template_mmcif_dir=$TEMPLATE_MMCIF_DIR \
--pdb_seqres_database_path=$PDB_SEQRES_DATABASE_PATH \
--uniprot_database_path=$UNIPROT_DATABASE_PATH \
--obsolete_pdbs_path=$OBSOLETE_PATH \
--model_preset=multimer \
--fasta_paths=${1}

All protein sequences have to be in the same FASTA file.

Using

...

AlphaFold2 via Google Colab
Ancre
google_colab
google_colab

DeepMind, in collaboration with Google, has provided users with a Colab notebook to run AlphaFold2 (currently version 2.1.0). The notebook can be opened at the address below, and Instructions are provided in the notebook as to how to proceed:

...

Avertissement

As stated in the Colab notebook:

"In comparison to AlphaFold AlphaFold2 v2.0, this Colab notebook uses no templates (homologous structures) and a selected portion of the BFD database. We have validated these changes on several thousand recent PDB structures. While accuracy will be near-identical to the full AlphaFold AlphaFold2 system on many targets, a small fraction have a large drop in accuracy due to the smaller MSA and lack of templates. For best reliability, we recommend instead using the full open source AlphaFold, or the AlphaFold AlphaFold2 Protein Structure Database."

There is now a faster approach to multiple sequence alignment within AlphaFold AlphaFold2 developped by the group of Martin Steinegger (reference to the preprint paper). It has been implemented in a Google Colab notebook similarly to the above. The prediction should take about 10 minutes to run (vs. 60+ minutes with the DeepMind implementation above).

https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb

Using

...

AlphaFold2 in NMRbox
Ancre
nmrbox
nmrbox

"NMRbox is a resource for biomolecular NMR (Nuclear Magnetic Resonance) software. It provides tools for finding the software you need, documentation and tutorials for getting the most out of the software, and cloud-based virtual machines for executing the software."

After registering, it is possible to log into a virtual machine and run AlphaFold AlphaFold2 predictions (version 2.1.1). At the moment, only T4 equipped VMs are available for this task: argon.nmrbox.org, oxygen.nmrbox.org, rubidium.nmrbox.org, and zirconium.nmrbox.org. Once logged in:

  • Open a terminal and go to the desired output directory
  • Create a FASTA file containing the protein sequence of interest
  • Run AlphaFold AlphaFold2 using the alphafold AlphaFold2 FASTA_file.fasta command. Other options are availables, see alphafold AlphaFold2 -h.