Introduction
AlphaFold2 is an artificial intelligence based software developed by DeepMind Technologies, a subsidiary of Alphabet (parent company of Google) which performs protein structure prediction using a deep learning system. Initial demonstration of the program at CASP13 showed great success at accurately predicting structures for targets rated as being the most difficult, where no previous template existed. In a second public demonstration during the 14th CASP in 2020, the AlphaFold2 team achieved a level of accuracy much higher than any other participating group.
Such results have proven to be a giant step forward in the field of ab initio protein structure prediction. Later in July 2021, DeepMind published a paper in Nature describing AlphaFold. It also made public the source code of the software on GitHub and offered a searchable database containing the structure prediction result of the complete proteome of about 20 species (over 365 000 protein structures).
Link to the AlphaFold2 database:
Four options on how to run AlphaFold2 are described below:
- Locally
- SBGrid
- Google Colab
NMRbox
Using AlphaFold2 locally
AlphaFold2 has been installed locally on the bioinformatics infrastructure. As of March 4, 2022, version 2.1.0 is accessible, and include access to the multimer module.
In order to run AlphaFold2 locally, you need to connect to the jacquere
host. Once connected, open a terminal window and browse to the folder where you want the data to be saved. From there, create a text file containing your amino acid squence of interest in FASTA format and paste the amino acid sequence into the file.
touch NSs.fasta echo ">RVFV NSs MDYFPVISVDLQSGRRVVSVEYFRGDGPPRIPYSMVGPCCVFLMHHRPSHEVRLRFSDFY NVGEFPYRVGLGDFASNVAPPPAKPFQRLIDLIGHMTLSDFTRFPNLKEAISWPLGEPSL AFFDLSSTRVHRNDDIRRDQIATLAMRSCKITNDLEDSFVGLHRMIATEAILRGIDLCLL PGFDLMYEVAHVQCVRLLQAAKEDISNAVVPNSALIVLMEESLMLRSSLPSMMGRNNWIP VIPPIPDVEMESEEESDDDGFVEVD" > NSs.fasta
Then use the following wrapper script in order to run AlphaFold2. You may want to adjust the parameters to your specific case. Make the file executable and run it.
chmod +x run.sh ./run.sh NSs.fasta
As a reference, it took about 2h50 for the calculation to predict 5 models for RVFV NSs (the sequence given as example above) to take place on jacquere
. Once it is done, you can go to the results folder and analyze your results. A description of the various files produced by the program is provided on the GitHub page of AlphaFold. A wiki page is also dedicated to the vizualisation of the result using PyMOL.
Below is the content of the script.
#!/bin/bash # # Alphafold 2, version 2.1.0 # wrapper script to run Alphafold2 for a monomer # # usage: # ./run.sh sequence.fasta # # adjust parameters as required DATA_DIR="/home/sbio/afdb/" OUTPUT_DIR=$(pwd) MAX_TEMPLATE_DATE=$(date -I) DB_PRESET="full_dbs" BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt" UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08" UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta" MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa" TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files" PDB70_DB="${DATA_DIR}/pdb70/pdb70" OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat" python3 /usr/local/alphafold/run_alphafold.py \ --data_dir=$DATA_DIR \ --output_dir=$OUTPUT_DIR \ --max_template_date=$MAX_TEMPLATE_DATE \ --db_preset=$DB_PRESET \ --bfd_database_path=$BFD_DB \ --uniclust30_database_path=$UNICLUST30_DB \ --uniref90_database_path=$UNIREF90_DB \ --mgnify_database_path=$MGNIFY_DB \ --template_mmcif_dir=$TEMPLATE_MMCIF_DIR \ --pdb70_database_path=$PDB70_DB \ --obsolete_pdbs_path=$OBSOLETE_PATH \ --use_gpu_relax="true" \ --fasta_paths=${1}
Using AlphaFold2 via SBGrid
SBGrid is a software infrastructure that includes a library of over 400 structural biology applications, including AlphaFold. The SBGrid software collection is restricted to SBGrid member laboratories.
The current version installed on our server is Alphafold2 2.1.1. More details are given on the SBGrid Wiki page for AlphaFold.
A username and password are required to access the infrastructure. They can be requested by email to Normand Cyr or Ryan Richter.
You need to log into the host named mazuelo
. This computer has a GPU card powerful enough for the calculations required by AlphaFold. It will not work on the other computers of the network.
Once logged in, open a terminal window and on the command prompt, type
sbgrid
This will activate the SBGrid environment and allow you to access the various applications. You should see the following welcome message
******************************************************************************** Software Support by SBGrid (www.sbgrid.org) ******************************************************************************** Your use of the applications contained in the /programs directory constitutes acceptance of the terms of the SBGrid License Agreement included in the file /programs/share/LICENSE. The applications distributed by SBGrid are licensed exclusively to member laboratories of the SBGrid Consortium. Run sbgrid-accept-license to remove the above message. ******************************************************************************** SBGrid was developed with support from its members, Harvard Medical School, HHMI, and NSF. If use of SBGrid compiled software was an important element in your publication, please include the following reference in your work: Software used in the project was installed and configured by SBGrid. cite: eLife 2013;2:e01456, Collaboration gets the most out of software. ******************************************************************************** SBGrid installation last updated: 2022-01-16 Please submit bug reports and help requests to: <bugs@sbgrid.org> or <http://sbgrid.org/bugs> For additional information visit https://sbgrid.org/wiki ********************************************************************************
If you do not activate the SBGrid environment, you will get the following error message
SBGrid shell environment is not initialized! Please source /programs/sbgrid.shrc or /programs/sbgrid.cshrc to use the software.
In order to run AlphaFold2, you will need at least two text files:
- A modified version of the AlphaFold2 wrapper provided by SBGrid that can be downloaded here.
- Your protein sequence in FASTA format (a complete description of the FASTA format can be found here). The description line is mandatory. An example file can be downloaded here.
Example FASTA file:
> Rift Valley Fever Virus NSs protein MDYFPVISVDLQSGRRVVSVEYFRGDGPPRIPYSMVGPCCVFLMHHRPSHEVRLRFSDFY NVGEFPYRVGLGDFASNVAPPPAKPFQRLIDLIGHMTLSDFTRFPNLKEAISWPLGEPSL AFFDLSSTRVHRNDDIRRDQIATLAMRSCKITNDLEDSFVGLHRMIATEAILRGIDLCLL PGFDLMYEVAHVQCVRLLQAAKEDISNAVVPNSALIVLMEESLMLRSSLPSMMGRNNWIP VIPPIPDVEMESEEESDDDGFVEVD
Content of the run_alphafold_template.sh
wrapper:
# Alphafold2, version 2.1.1 # wrapper script to run AlphaFold2 for a monomer # # usage: # ./run_alphafold_tempate.sh sequence.fasta # # adjust parameters as required DATA_DIR="/home/sbio/afdb/" OUTPUT_DIR=$(pwd) MAX_TEMPLATE_DATE=$(date -I) DB_PRESET="full_dbs" BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt" UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08" UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta" MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa" TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files" PDB70_DB="${DATA_DIR}/pdb70/pdb70" OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat" run_alphafold.py \ --data_dir=$DATA_DIR \ --output_dir=$OUTPUT_DIR \ --max_template_date=$MAX_TEMPLATE_DATE \ --db_preset=$DB_PRESET \ --bfd_database_path=$BFD_DB \ --uniclust30_database_path=$UNICLUST30_DB \ --uniref90_database_path=$UNIREF90_DB \ --mgnify_database_path=$MGNIFY_DB \ --template_mmcif_dir=$TEMPLATE_MMCIF_DIR \ --pdb70_database_path=$PDB70_DB \ --obsolete_pdbs_path=$OBSOLETE_PATH \ --fasta_paths=${1} \
Replace the --fasta_paths=
value to point to your FASTA file in the run_alphafold_template.sh
wrapper, and make the wrapper script executable.
chmod +x run_alphafold_template.sh
Then, to run AlphaFold, enter the following command in the terminal prompt:
./run_alphafold_template.sh [FASTA file location]
As a reference, it took about 60 minutes for the calculation to predict 5 models for RVFV NSs (the sequence given as example above) to take place on mazuelo
. Once it is done, you can go to the results folder and analyze your results. A description of the various files produced by the program is provided on the GitHub page of AlphaFold.
Using pTM models
In order to use pTM models, you need to set the ALPHAFOLD_PTM
environment variable before running AlphaFold:
export ALPHAFOLD_PTM=true
Using the multimer option
It is now possible to run predictions with multimers. A wrapper script can be downloaded here. Content of the run_alphafold_template_multimer.sh
wrapper:
# AlphaFold2 2, version 2.1.1 # wrapper script to run AlphaFold2 for a multimer # # usage: # ./run_alphafold_tempate_multimer.sh multiple_sequences.fasta # # adjust parameters as required DATA_DIR="/home/sbio/afdb/" OUTPUT_DIR=$(pwd) MAX_TEMPLATE_DATE=$(date -I) DB_PRESET="full_dbs" BFD_DB="${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt" UNICLUST30_DB="${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08" UNIREF90_DB="${DATA_DIR}/uniref90/uniref90.fasta" MGNIFY_DB="${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa" TEMPLATE_MMCIF_DIR="${DATA_DIR}/pdb_mmcif/mmcif_files" PDB70_DB="${DATA_DIR}/pdb70/pdb70" OBSOLETE_PATH="${DATA_DIR}/pdb_mmcif/obsolete.dat" PDB_SEQRES_DATABASE_PATH="${DATA_DIR}/pdb_seqres/pdb_seqres.txt" UNIPROT_DATABASE_PATH="${DATA_DIR}/uniprot/uniprot.fasta" run_alphafold.py \ --data_dir=$DATA_DIR \ --output_dir=$OUTPUT_DIR \ --max_template_date=$MAX_TEMPLATE_DATE \ --db_preset=$DB_PRESET \ --bfd_database_path=$BFD_DB \ --uniclust30_database_path=$UNICLUST30_DB \ --uniref90_database_path=$UNIREF90_DB \ --mgnify_database_path=$MGNIFY_DB \ --template_mmcif_dir=$TEMPLATE_MMCIF_DIR \ --pdb_seqres_database_path=$PDB_SEQRES_DATABASE_PATH \ --uniprot_database_path=$UNIPROT_DATABASE_PATH \ --obsolete_pdbs_path=$OBSOLETE_PATH \ --model_preset=multimer \ --fasta_paths=${1}
All protein sequences have to be in the same FASTA file.
Using AlphaFold2 via Google Colab
DeepMind, in collaboration with Google, has provided users with a Colab notebook to run AlphaFold2 (currently version 2.1.0). The notebook can be opened at the address below, and Instructions are provided in the notebook as to how to proceed:
https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb
As stated in the Colab notebook:
"In comparison to AlphaFold2 v2.0, this Colab notebook uses no templates (homologous structures) and a selected portion of the BFD database. We have validated these changes on several thousand recent PDB structures. While accuracy will be near-identical to the full AlphaFold2 system on many targets, a small fraction have a large drop in accuracy due to the smaller MSA and lack of templates. For best reliability, we recommend instead using the full open source AlphaFold, or the AlphaFold2 Protein Structure Database."
There is now a faster approach to multiple sequence alignment within AlphaFold2 developped by the group of Martin Steinegger (reference to the preprint paper). It has been implemented in a Google Colab notebook similarly to the above. The prediction should take about 10 minutes to run (vs. 60+ minutes with the DeepMind implementation above).
https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb
Using AlphaFold2 in NMRbox
"NMRbox is a resource for biomolecular NMR (Nuclear Magnetic Resonance) software. It provides tools for finding the software you need, documentation and tutorials for getting the most out of the software, and cloud-based virtual machines for executing the software."
After registering, it is possible to log into a virtual machine and run AlphaFold2 predictions (version 2.1.1). At the moment, only T4 equipped VMs are available for this task: argon.nmrbox.org, oxygen.nmrbox.org, rubidium.nmrbox.org, and zirconium.nmrbox.org. Once logged in:
- Open a terminal and go to the desired output directory
- Create a FASTA file containing the protein sequence of interest
- Run AlphaFold2 using the
AlphaFold2 FASTA_file.fasta
command. Other options are availables, seeAlphaFold2 -h
.