Showing preview only (2,392K chars total). Download the full file or copy to clipboard to get everything.
Repository: google-deepmind/alphafold
Branch: main
Commit: 42719e135a62
Files: 118
Total size: 2.3 MB
Directory structure:
gitextract_p0ek25vk/
├── .dockerignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── afdb/
│ └── README.md
├── alphafold/
│ ├── __init__.py
│ ├── common/
│ │ ├── __init__.py
│ │ ├── confidence.py
│ │ ├── confidence_test.py
│ │ ├── mmcif_metadata.py
│ │ ├── protein.py
│ │ ├── protein_test.py
│ │ ├── residue_constants.py
│ │ ├── residue_constants_test.py
│ │ └── testdata/
│ │ ├── 2rbg.pdb
│ │ ├── 5nmu.pdb
│ │ └── glucagon.pdb
│ ├── data/
│ │ ├── __init__.py
│ │ ├── feature_processing.py
│ │ ├── mmcif_parsing.py
│ │ ├── msa_identifiers.py
│ │ ├── msa_pairing.py
│ │ ├── parsers.py
│ │ ├── pipeline.py
│ │ ├── pipeline_multimer.py
│ │ ├── templates.py
│ │ └── tools/
│ │ ├── __init__.py
│ │ ├── hhblits.py
│ │ ├── hhsearch.py
│ │ ├── hmmbuild.py
│ │ ├── hmmsearch.py
│ │ ├── jackhmmer.py
│ │ ├── kalign.py
│ │ └── utils.py
│ ├── model/
│ │ ├── __init__.py
│ │ ├── all_atom.py
│ │ ├── all_atom_multimer.py
│ │ ├── all_atom_test.py
│ │ ├── base_config.py
│ │ ├── base_config_test.py
│ │ ├── common_modules.py
│ │ ├── config.py
│ │ ├── config_test.py
│ │ ├── data.py
│ │ ├── features.py
│ │ ├── folding.py
│ │ ├── folding_multimer.py
│ │ ├── geometry/
│ │ │ ├── __init__.py
│ │ │ ├── rigid_matrix_vector.py
│ │ │ ├── rotation_matrix.py
│ │ │ ├── struct_of_array.py
│ │ │ ├── test_utils.py
│ │ │ ├── utils.py
│ │ │ └── vector.py
│ │ ├── layer_stack.py
│ │ ├── layer_stack_test.py
│ │ ├── lddt.py
│ │ ├── lddt_test.py
│ │ ├── mapping.py
│ │ ├── model.py
│ │ ├── modules.py
│ │ ├── modules_multimer.py
│ │ ├── prng.py
│ │ ├── prng_test.py
│ │ ├── quat_affine.py
│ │ ├── quat_affine_test.py
│ │ ├── r3.py
│ │ ├── tf/
│ │ │ ├── __init__.py
│ │ │ ├── data_transforms.py
│ │ │ ├── input_pipeline.py
│ │ │ ├── protein_features.py
│ │ │ ├── protein_features_test.py
│ │ │ ├── proteins_dataset.py
│ │ │ ├── shape_helpers.py
│ │ │ ├── shape_helpers_test.py
│ │ │ ├── shape_placeholders.py
│ │ │ └── utils.py
│ │ └── utils.py
│ ├── notebooks/
│ │ ├── __init__.py
│ │ ├── notebook_utils.py
│ │ └── notebook_utils_test.py
│ ├── relax/
│ │ ├── __init__.py
│ │ ├── amber_minimize.py
│ │ ├── amber_minimize_test.py
│ │ ├── cleanup.py
│ │ ├── cleanup_test.py
│ │ ├── relax.py
│ │ ├── relax_test.py
│ │ ├── testdata/
│ │ │ ├── model_output.pdb
│ │ │ ├── multiple_disulfides_target.pdb
│ │ │ ├── with_violations.pdb
│ │ │ └── with_violations_casp14.pdb
│ │ ├── utils.py
│ │ └── utils_test.py
│ └── version.py
├── conftest.py
├── docker/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── run_docker.py
├── docs/
│ └── technical_note_v2.3.0.md
├── notebooks/
│ └── AlphaFold.ipynb
├── pyproject.toml
├── requirements.txt
├── run_alphafold.py
├── run_alphafold_test.py
├── scripts/
│ ├── download_all_data.sh
│ ├── download_alphafold_params.sh
│ ├── download_bfd.sh
│ ├── download_mgnify.sh
│ ├── download_pdb70.sh
│ ├── download_pdb_mmcif.sh
│ ├── download_pdb_seqres.sh
│ ├── download_small_bfd.sh
│ ├── download_uniprot.sh
│ ├── download_uniref30.sh
│ └── download_uniref90.sh
└── server/
├── README.md
└── example.json
================================================
FILE CONTENTS
================================================
================================================
FILE: .dockerignore
================================================
.dockerignore
docker/Dockerfile
================================================
FILE: CONTRIBUTING.md
================================================
# How to Contribute
We welcome small patches related to bug fixes and documentation, but we do not
plan to make any major changes to this repository.
## Contributor License Agreement
Contributions to this project must be accompanied by a Contributor License
Agreement. You (or your employer) retain the copyright to your contribution,
this simply gives us permission to use and redistribute your contributions as
part of the project. Head over to <https://cla.developers.google.com/> to see
your current agreements on file or to sign a new one.
You generally only need to submit a CLA once, so if you've already submitted one
(even if it was for a different project), you probably don't need to do it
again.
## Code reviews
All submissions, including submissions by project members, require review. We
use GitHub pull requests for this purpose. Consult
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
information on using pull requests.
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: README.md
================================================

# AlphaFold
This package provides an implementation of the inference pipeline of AlphaFold
v2. For simplicity, we refer to this model as AlphaFold throughout the rest of
this document.
We also provide:
1. An implementation of AlphaFold-Multimer. This represents a work in progress
and AlphaFold-Multimer isn't expected to be as stable as our monomer
AlphaFold system. [Read the guide](#updating-existing-installation) for how
to upgrade and update code.
2. The [technical note](docs/technical_note_v2.3.0.md) containing the models
and inference procedure for an updated AlphaFold v2.3.0.
3. A [CASP15 baseline](docs/casp15_predictions.zip) set of predictions along
with documentation of any manual interventions performed.
Any publication that discloses findings arising from using this source code or
the model parameters should [cite](#citing-this-work) the
[AlphaFold paper](https://doi.org/10.1038/s41586-021-03819-2) and, if
applicable, the
[AlphaFold-Multimer paper](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1).
Please also refer to the
[Supplementary Information](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-021-03819-2/MediaObjects/41586_2021_3819_MOESM1_ESM.pdf)
for a detailed description of the method.
**You can use a slightly simplified version of AlphaFold with
community-supported versions (see below).
If you have any questions, please contact the AlphaFold team at
[alphafold@deepmind.com](mailto:alphafold@deepmind.com).

## Installation and running your first prediction
You will need a machine running Linux, AlphaFold does not support other
operating systems. Full installation requires up to 3 TB of disk space to keep
genetic databases (SSD storage is recommended) and a modern NVIDIA GPU (GPUs
with more memory can predict larger protein structures).
Please follow these steps:
1. Install [Docker](https://www.docker.com/).
* Install
[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
for GPU support.
* Setup running
[Docker as a non-root user](https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user).
1. Clone this repository and `cd` into it.
```bash
git clone https://github.com/deepmind/alphafold.git
cd ./alphafold
```
1. Download genetic databases and model parameters:
* Install `aria2c`. On most Linux distributions it is available via the
package manager as the `aria2` package (on Debian-based distributions
this can be installed by running `sudo apt install aria2`).
Same for `rsync`.
* Please use the script `scripts/download_all_data.sh` to download and set
up full databases. This may take substantial time (download size is 556
GB), so we recommend running this script in the background:
```bash
scripts/download_all_data.sh <DOWNLOAD_DIR> > download.log 2> download_all.log &
```
* **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
subdirectory in the AlphaFold repository directory.** If it is, the
Docker build will be slow as the large databases will be copied into the
docker build context.
* It is possible to run AlphaFold with reduced databases; please refer to
the [complete documentation](#genetic-databases).
1. Check that AlphaFold will be able to use a GPU by running:
```bash
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
```
The output of this command should show a list of your GPUs. If it doesn't,
check if you followed all steps correctly when setting up the
[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
or take a look at the following
[NVIDIA Docker issue](https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-801479573).
If you wish to run AlphaFold using Singularity (a common containerization
platform on HPC systems) we recommend using some of the third party
Singularity setups as linked in
https://github.com/deepmind/alphafold/issues/10 or
https://github.com/deepmind/alphafold/issues/24.
1. Build the Docker image:
```bash
docker build -f docker/Dockerfile -t alphafold .
```
If you encounter the following error:
```
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease' is not signed.
```
use the workaround described in
https://github.com/deepmind/alphafold/issues/463#issuecomment-1124881779.
1. Install the `run_docker.py` dependencies. Note: You may optionally wish to
create a
[Python Virtual Environment](https://docs.python.org/3/tutorial/venv.html)
to prevent conflicts with your system's Python environment.
```bash
pip3 install -r docker/requirements.txt
```
1. Make sure that the output directory exists (the default is `/tmp/alphafold`)
and that you have sufficient permissions to write into it.
1. Run `run_docker.py` pointing to a FASTA file containing the protein
sequence(s) for which you wish to predict the structure (`--fasta_paths`
parameter). AlphaFold will search for the available templates before the
date specified by the `--max_template_date` parameter; this could be used to
avoid certain templates during modeling. `--data_dir` is the directory with
downloaded genetic databases and `--output_dir` is the absolute path to the
output directory.
```bash
python3 docker/run_docker.py \
--fasta_paths=your_protein.fasta \
--max_template_date=2022-01-01 \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
1. Once the run is over, the output directory shall contain predicted
structures of the target protein. Please check the documentation below for
additional options and troubleshooting tips.
### Genetic databases
This step requires `aria2c` to be installed on your machine.
AlphaFold needs multiple genetic (sequence) databases to run:
* [BFD](https://bfd.mmseqs.com/),
* [MGnify](https://www.ebi.ac.uk/metagenomics/),
* [PDB70](http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/),
* [PDB](https://www.rcsb.org/) (structures in the mmCIF format),
* [PDB seqres](https://www.rcsb.org/) – only for AlphaFold-Multimer,
* [UniRef30 (FKA UniClust30)](https://uniclust.mmseqs.com/),
* [UniProt](https://www.uniprot.org/uniprot/) – only for AlphaFold-Multimer,
* [UniRef90](https://www.uniprot.org/help/uniref).
We provide a script `scripts/download_all_data.sh` that can be used to download
and set up all of these databases:
* Recommended default:
```bash
scripts/download_all_data.sh <DOWNLOAD_DIR>
```
will download the full databases.
* With `reduced_dbs` parameter:
```bash
scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs
```
will download a reduced version of the databases to be used with the
`reduced_dbs` database preset. This shall be used with the corresponding
AlphaFold parameter `--db_preset=reduced_dbs` later during the AlphaFold run
(please see [AlphaFold parameters](#running-alphafold) section).
:ledger: **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
subdirectory in the AlphaFold repository directory.** If it is, the Docker build
will be slow as the large databases will be copied during the image creation.
We don't provide exactly the database versions used in CASP14 – see the
[note on reproducibility](#note-on-casp14-reproducibility). Some of the
databases are mirrored for speed, see [mirrored databases](#mirrored-databases).
:ledger: **Note: The total download size for the full databases is around 556 GB
and the total size when unzipped is 2.62 TB. Please make sure you have a large
enough hard drive space, bandwidth and time to download. We recommend using an
SSD for better genetic search performance.**
:ledger: **Note: If the download directory and datasets don't have full read and
write permissions, it can cause errors with the MSA tools, with opaque
(external) error messages. Please ensure the required permissions are applied,
e.g. with the `sudo chmod 755 --recursive "$DOWNLOAD_DIR"` command.**
The `download_all_data.sh` script will also download the model parameter files.
Once the script has finished, you should have the following directory structure:
```
$DOWNLOAD_DIR/ # Total: ~ 2.62 TB (download: 556 GB)
bfd/ # ~ 1.8 TB (download: 271.6 GB)
# 6 files.
mgnify/ # ~ 120 GB (download: 67 GB)
mgy_clusters_2022_05.fa
params/ # ~ 5.3 GB (download: 5.3 GB)
# 5 CASP14 models,
# 5 pTM models,
# 5 AlphaFold-Multimer models,
# LICENSE,
# = 16 files.
pdb70/ # ~ 56 GB (download: 19.5 GB)
# 9 files.
pdb_mmcif/ # ~ 238 GB (download: 43 GB)
mmcif_files/
# About 199,000 .cif files.
obsolete.dat
pdb_seqres/ # ~ 0.2 GB (download: 0.2 GB)
pdb_seqres.txt
small_bfd/ # ~ 17 GB (download: 9.6 GB)
bfd-first_non_consensus_sequences.fasta
uniref30/ # ~ 206 GB (download: 52.5 GB)
# 7 files.
uniprot/ # ~ 105 GB (download: 53 GB)
uniprot.fasta
uniref90/ # ~ 67 GB (download: 34 GB)
uniref90.fasta
```
`bfd/` is only downloaded if you download the full databases, and `small_bfd/`
is only downloaded if you download the reduced databases.
### Model parameters
While the AlphaFold code is licensed under the Apache 2.0 License, the AlphaFold
parameters and CASP15 prediction data are made available under the terms of the
CC BY 4.0 license. Please see the [Disclaimer](#license-and-disclaimer) below
for more detail.
The AlphaFold parameters are available from
https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar, and
are downloaded as part of the `scripts/download_all_data.sh` script. This script
will download parameters for:
* 5 models which were used during CASP14, and were extensively validated for
structure prediction quality (see Jumper et al. 2021, Suppl. Methods 1.12
for details).
* 5 pTM models, which were fine-tuned to produce pTM (predicted TM-score) and
(PAE) predicted aligned error values alongside their structure predictions
(see Jumper et al. 2021, Suppl. Methods 1.9.7 for details).
* 5 AlphaFold-Multimer models that produce pTM and PAE values alongside their
structure predictions.
### Updating existing installation
If you have a previous version you can either reinstall fully from scratch
(remove everything and run the setup from scratch) or you can do an incremental
update that will be significantly faster but will require a bit more work. Make
sure you follow these steps in the exact order they are listed below:
1. **Update the code.**
* Go to the directory with the cloned AlphaFold repository and run `git
fetch origin main` to get all code updates.
1. **Update the UniProt, UniRef, MGnify and PDB seqres databases.**
* Remove `<DOWNLOAD_DIR>/uniprot`.
* Run `scripts/download_uniprot.sh <DOWNLOAD_DIR>`.
* Remove `<DOWNLOAD_DIR>/uniclust30`.
* Run `scripts/download_uniref30.sh <DOWNLOAD_DIR>`.
* Remove `<DOWNLOAD_DIR>/uniref90`.
* Run `scripts/download_uniref90.sh <DOWNLOAD_DIR>`.
* Remove `<DOWNLOAD_DIR>/mgnify`.
* Run `scripts/download_mgnify.sh <DOWNLOAD_DIR>`.
* Remove `<DOWNLOAD_DIR>/pdb_mmcif`. It is needed to have PDB SeqRes and
PDB from exactly the same date. Failure to do this step will result in
potential errors when searching for templates when running
AlphaFold-Multimer.
* Run `scripts/download_pdb_mmcif.sh <DOWNLOAD_DIR>`.
* Run `scripts/download_pdb_seqres.sh <DOWNLOAD_DIR>`.
1. **Update the model parameters.**
* Remove the old model parameters in `<DOWNLOAD_DIR>/params`.
* Download new model parameters using
`scripts/download_alphafold_params.sh <DOWNLOAD_DIR>`.
1. **Follow [Running AlphaFold](#running-alphafold).**
#### Using deprecated model weights
To use the deprecated v2.2.0 AlphaFold-Multimer model weights:
1. Change `SOURCE_URL` in `scripts/download_alphafold_params.sh` to
`https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar`,
and download the old parameters.
2. Change the `_v3` to `_v2` in the multimer `MODEL_PRESETS` in `config.py`.
To use the deprecated v2.1.0 AlphaFold-Multimer model weights:
1. Change `SOURCE_URL` in `scripts/download_alphafold_params.sh` to
`https://storage.googleapis.com/alphafold/alphafold_params_2022-01-19.tar`,
and download the old parameters.
2. Remove the `_v3` in the multimer `MODEL_PRESETS` in `config.py`.
## Running AlphaFold
**The simplest way to run AlphaFold is using the provided Docker script.** This
was tested on Google Cloud with a machine using the `nvidia-gpu-cloud-image`
with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional
3 TB disk, and an A100 GPU. For your first run, please follow the instructions
from
[Installation and running your first prediction](#installation-and-running-your-first-prediction)
section.
1. By default, Alphafold will attempt to use all visible GPU devices. To use a
subset, specify a comma-separated list of GPU UUID(s) or index(es) using the
`--gpu_devices` flag. See
[GPU enumeration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
for more details.
1. You can control which AlphaFold model to run by adding the `--model_preset=`
flag. We provide the following models:
* **monomer**: This is the original model used at CASP14 with no
ensembling.
* **monomer\_casp14**: This is the original model used at CASP14 with
`num_ensemble=8`, matching our CASP14 configuration. This is largely
provided for reproducibility as it is 8x more computationally expensive
for limited accuracy gain (+0.1 average GDT gain on CASP14 domains).
* **monomer\_ptm**: This is the original CASP14 model fine tuned with the
pTM head, providing a pairwise confidence measure. It is slightly less
accurate than the normal monomer model.
* **multimer**: This is the [AlphaFold-Multimer](#citing-this-work) model.
To use this model, provide a multi-sequence FASTA file. In addition, the
UniProt database should have been downloaded.
1. You can control MSA speed/quality tradeoff by adding
`--db_preset=reduced_dbs` or `--db_preset=full_dbs` to the run command. We
provide the following presets:
* **reduced\_dbs**: This preset is optimized for speed and lower hardware
requirements. It runs with a reduced version of the BFD database. It
requires 8 CPU cores (vCPUs), 8 GB of RAM, and 600 GB of disk space.
* **full\_dbs**: This runs with all genetic databases used at CASP14.
Running the command above with the `monomer` model preset and the
`reduced_dbs` data preset would look like this:
```bash
python3 docker/run_docker.py \
--fasta_paths=T1050.fasta \
--max_template_date=2020-05-14 \
--model_preset=monomer \
--db_preset=reduced_dbs \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
1. After generating the predicted model, AlphaFold runs a relaxation step to
improve local geometry. By default, only the best model (by pLDDT) is
relaxed (`--models_to_relax=best`), but also all of the models
(`--models_to_relax=all`) or none of the models (`--models_to_relax=none`)
can be relaxed.
1. The relaxation step can be run on GPU (faster, but could be less stable) or
CPU (slow, but stable). This can be controlled with
`--enable_gpu_relax=true` (default) or `--enable_gpu_relax=false`.
1. AlphaFold can reuse MSAs (multiple sequence alignments) for the same
sequence via `--use_precomputed_msas=true` option; this can be useful for
trying different AlphaFold parameters. This option assumes that the
directory structure generated by the first AlphaFold run in the output
directory exists and that the protein sequence is the same.
### Running AlphaFold-Multimer
All steps are the same as when running the monomer system, but you will have to
* provide an input fasta with multiple sequences,
* set `--model_preset=multimer`,
An example that folds a protein complex `multimer.fasta`:
```bash
python3 docker/run_docker.py \
--fasta_paths=multimer.fasta \
--max_template_date=2020-05-14 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
By default the multimer system will run 5 seeds per model (25 total predictions)
for a small drop in accuracy you may wish to run a single seed per model. This
can be done via the `--num_multimer_predictions_per_model` flag, e.g. set it to
`--num_multimer_predictions_per_model=1` to run a single seed per model.
### AlphaFold prediction speed
The table below reports prediction runtimes for proteins of various lengths. We
only measure unrelaxed structure prediction with three recycles while excluding
runtimes from MSA and template search. When running `docker/run_docker.py` with
`--benchmark=true`, this runtime is stored in `timings.json`. All runtimes are
from a single A100 NVIDIA GPU. Prediction speed on A100 for smaller structures
can be improved by increasing `global_config.subbatch_size` in
`alphafold/model/config.py`.
No. residues | Prediction time (s)
-----------: | ------------------:
100 | 4.9
200 | 7.7
300 | 13
400 | 18
500 | 29
600 | 36
700 | 53
800 | 60
900 | 91
1,000 | 96
1,100 | 140
1,500 | 280
2,000 | 450
2,500 | 969
3,000 | 1,240
3,500 | 2,465
4,000 | 5,660
4,500 | 12,475
5,000 | 18,824
### Examples
Below are examples on how to use AlphaFold in different scenarios.
#### Folding a monomer
Say we have a monomer with the sequence `<SEQUENCE>`. The input fasta should be:
```fasta
>sequence_name
<SEQUENCE>
```
Then run the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=monomer.fasta \
--max_template_date=2021-11-01 \
--model_preset=monomer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
#### Folding a homomer
Say we have a homomer with 3 copies of the same sequence `<SEQUENCE>`. The input
fasta should be:
```fasta
>sequence_1
<SEQUENCE>
>sequence_2
<SEQUENCE>
>sequence_3
<SEQUENCE>
```
Then run the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=homomer.fasta \
--max_template_date=2021-11-01 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
#### Folding a heteromer
Say we have an A2B3 heteromer, i.e. with 2 copies of `<SEQUENCE A>` and 3 copies
of `<SEQUENCE B>`. The input fasta should be:
```fasta
>sequence_1
<SEQUENCE A>
>sequence_2
<SEQUENCE A>
>sequence_3
<SEQUENCE B>
>sequence_4
<SEQUENCE B>
>sequence_5
<SEQUENCE B>
```
Then run the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=heteromer.fasta \
--max_template_date=2021-11-01 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
#### Folding multiple monomers one after another
Say we have a two monomers, `monomer1.fasta` and `monomer2.fasta`.
We can fold both sequentially by using the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=monomer1.fasta,monomer2.fasta \
--max_template_date=2021-11-01 \
--model_preset=monomer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
#### Folding multiple multimers one after another
Say we have a two multimers, `multimer1.fasta` and `multimer2.fasta`.
We can fold both sequentially by using the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=multimer1.fasta,multimer2.fasta \
--max_template_date=2021-11-01 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
### AlphaFold output
The outputs will be saved in a subdirectory of the directory provided via the
`--output_dir` flag of `run_docker.py` (defaults to `/tmp/alphafold/`). The
outputs include the computed MSAs, unrelaxed structures, relaxed structures,
ranked structures, raw model outputs, prediction metadata, and section timings.
The `--output_dir` directory will have the following structure:
```
<target_name>/
features.pkl
ranked_{0,1,2,3,4}.pdb
ranking_debug.json
relax_metrics.json
relaxed_model_{1,2,3,4,5}.pdb
result_model_{1,2,3,4,5}.pkl
timings.json
unrelaxed_model_{1,2,3,4,5}.pdb
msas/
bfd_uniref_hits.a3m
mgnify_hits.sto
uniref90_hits.sto
```
The contents of each output file are as follows:
* `features.pkl` – A `pickle` file containing the input feature NumPy arrays
used by the models to produce the structures.
* `unrelaxed_model_*.pdb` – A PDB format text file containing the predicted
structure, exactly as outputted by the model.
* `relaxed_model_*.pdb` – A PDB format text file containing the predicted
structure, after performing an Amber relaxation procedure on the unrelaxed
structure prediction (see Jumper et al. 2021, Suppl. Methods 1.8.6 for
details).
* `ranked_*.pdb` – A PDB format text file containing the predicted structures,
after reordering by model confidence. Here `ranked_i.pdb` should contain the
prediction with the (`i + 1`)-th highest confidence (so that `ranked_0.pdb`
has the highest confidence). To rank model confidence, we use predicted LDDT
(pLDDT) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6 for details).
If `--models_to_relax=all` then all ranked structures are relaxed. If
`--models_to_relax=best` then only `ranked_0.pdb` is relaxed (the rest are
unrelaxed). If `--models_to_relax=none`, then the ranked structures are all
unrelaxed.
* `ranking_debug.json` – A JSON format text file containing the pLDDT values
used to perform the model ranking, and a mapping back to the original model
names.
* `relax_metrics.json` – A JSON format text file containing relax metrics, for
instance remaining violations.
* `timings.json` – A JSON format text file containing the times taken to run
each section of the AlphaFold pipeline.
* `msas/` - A directory containing the files describing the various genetic
tool hits that were used to construct the input MSA.
* `result_model_*.pkl` – A `pickle` file containing a nested dictionary of the
various NumPy arrays directly produced by the model. In addition to the
output of the structure module, this includes auxiliary outputs such as:
* Distograms (`distogram/logits` contains a NumPy array of shape [N_res,
N_res, N_bins] and `distogram/bin_edges` contains the definition of the
bins).
* Per-residue pLDDT scores (`plddt` contains a NumPy array of shape
[N_res] with the range of possible values from `0` to `100`, where `100`
means most confident). This can serve to identify sequence regions
predicted with high confidence or as an overall per-target confidence
score when averaged across residues.
* Present only if using pTM models: predicted TM-score (`ptm` field
contains a scalar). As a predictor of a global superposition metric,
this score is designed to also assess whether the model is confident in
the overall domain packing.
* Present only if using pTM models: predicted pairwise aligned errors
(`predicted_aligned_error` contains a NumPy array of shape [N_res,
N_res] with the range of possible values from `0` to
`max_predicted_aligned_error`, where `0` means most confident). This can
serve for a visualisation of domain packing confidence within the
structure.
The pLDDT confidence measure is stored in the B-factor field of the output PDB
files (although unlike a B-factor, higher pLDDT is better, so care must be taken
when using for tasks such as molecular replacement).
This code has been tested to match mean top-1 accuracy on a CASP14 test set with
pLDDT ranking over 5 model predictions (some CASP targets were run with earlier
versions of AlphaFold and some had manual interventions; see our forthcoming
publication for details). Some targets such as T1064 may also have high
individual run variance over random seeds.
## Inferencing many proteins
The provided inference script is optimized for predicting the structure of a
single protein, and it will compile the neural network to be specialized to
exactly the size of the sequence, MSA, and templates. For large proteins, the
compile time is a negligible fraction of the runtime, but it may become more
significant for small proteins or if the multi-sequence alignments are already
precomputed. In the bulk inference case, it may make sense to use our
`make_fixed_size` function to pad the inputs to a uniform size, thereby reducing
the number of compilations required.
We do not provide a bulk inference script, but it should be straightforward to
develop on top of the `RunModel.predict` method with a parallel system for
precomputing multi-sequence alignments. Alternatively, this script can be run
repeatedly with only moderate overhead.
## Note on CASP14 reproducibility
AlphaFold's output for a small number of proteins has high inter-run variance,
and may be affected by changes in the input data. The CASP14 target T1064 is a
notable example; the large number of SARS-CoV-2-related sequences recently
deposited changes its MSA significantly. This variability is somewhat mitigated
by the model selection process; running 5 models and taking the most confident.
To reproduce the results of our CASP14 system as closely as possible you must
use the same database versions we used in CASP. These may not match the default
versions downloaded by our scripts.
For genetics:
* UniRef90:
[v2020_01](https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2020_01/uniref/)
* MGnify:
[v2018_12](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2018_12/)
* Uniclust30: [v2018_08](http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/)
* BFD: [only version available](https://bfd.mmseqs.com/)
For templates:
* PDB: (downloaded 2020-05-14)
* PDB70:
[2020-05-13](http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/pdb70_from_mmcif_200513.tar.gz)
An alternative for templates is to use the latest PDB and PDB70, but pass the
flag `--max_template_date=2020-05-14`, which restricts templates only to
structures that were available at the start of CASP14.
## Citing this work
If you use the code or data in this package, please cite:
```bibtex
@Article{AlphaFold2021,
author = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\v{Z}}{\'\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},
journal = {Nature},
title = {Highly accurate protein structure prediction with {AlphaFold}},
year = {2021},
volume = {596},
number = {7873},
pages = {583--589},
doi = {10.1038/s41586-021-03819-2}
}
```
In addition, if you use the AlphaFold-Multimer mode, please cite:
```bibtex
@article {AlphaFold-Multimer2021,
author = {Evans, Richard and O{\textquoteright}Neill, Michael and Pritzel, Alexander and Antropova, Natasha and Senior, Andrew and Green, Tim and {\v{Z}}{\'\i}dek, Augustin and Bates, Russ and Blackwell, Sam and Yim, Jason and Ronneberger, Olaf and Bodenstein, Sebastian and Zielinski, Michal and Bridgland, Alex and Potapenko, Anna and Cowie, Andrew and Tunyasuvunakool, Kathryn and Jain, Rishub and Clancy, Ellen and Kohli, Pushmeet and Jumper, John and Hassabis, Demis},
journal = {bioRxiv},
title = {Protein complex prediction with AlphaFold-Multimer},
year = {2021},
elocation-id = {2021.10.04.463034},
doi = {10.1101/2021.10.04.463034},
URL = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034},
eprint = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034.full.pdf},
}
```
## Community contributions
Colab notebooks provided by the community (please note that these notebooks may
vary from our full AlphaFold system and we did not validate their accuracy):
* The
[ColabFold AlphaFold2 notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb)
by Martin Steinegger, Sergey Ovchinnikov and Milot Mirdita, which uses an
API hosted at the Södinglab based on the MMseqs2 server
[(Mirdita et al. 2019, Bioinformatics)](https://academic.oup.com/bioinformatics/article/35/16/2856/5280135)
for the multiple sequence alignment creation.
## Acknowledgements
AlphaFold communicates with and/or references the following separate libraries
and packages:
* [Abseil](https://github.com/abseil/abseil-py)
* [Biopython](https://biopython.org)
* [Colab](https://research.google.com/colaboratory/)
* [Docker](https://www.docker.com)
* [HH Suite](https://github.com/soedinglab/hh-suite)
* [HMMER Suite](http://eddylab.org/software/hmmer)
* [Haiku](https://github.com/deepmind/dm-haiku)
* [JAX](https://github.com/google/jax/)
* [Kalign](https://msa.sbc.su.se/cgi-bin/msa.cgi)
* [matplotlib](https://matplotlib.org/)
* [ML Collections](https://github.com/google/ml_collections)
* [NumPy](https://numpy.org)
* [OpenMM](https://github.com/openmm/openmm)
* [OpenStructure](https://openstructure.org)
* [pymol3d](https://github.com/avirshup/py3dmol)
* [Sonnet](https://github.com/deepmind/sonnet)
* [TensorFlow](https://github.com/tensorflow/tensorflow)
* [Tree](https://github.com/deepmind/tree)
* [tqdm](https://github.com/tqdm/tqdm)
We thank all their contributors and maintainers!
## Get in Touch
If you have any questions not covered in this overview, please contact the
AlphaFold team at [alphafold@deepmind.com](mailto:alphafold@deepmind.com).
We would love to hear your feedback and understand how AlphaFold has been useful
in your research. Share your stories with us at
[alphafold@deepmind.com](mailto:alphafold@deepmind.com).
## License and Disclaimer
This is not an officially supported Google product.
Copyright 2022 DeepMind Technologies Limited.
AlphaFold 2 and its output are for theoretical modeling only. They are not
intended, validated, or approved for clinical use. You should not use the
AlphaFold 2 or its output for clinical purposes or rely on them for medical or
other professional advice. Any content regarding those topics is provided for
informational purposes only and is not a substitute for advice from a qualified
professional.
Output of AlphaFold 2 are predictions with varying levels of confidence and
should be interpreted carefully. Use discretion before relying on, publishing,
downloading or otherwise using AlphaFold 2 and its output.
### AlphaFold Code License
Licensed under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of the
License at https://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed
under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
### Model Parameters License
The AlphaFold parameters are made available under the terms of the Creative
Commons Attribution 4.0 International (CC BY 4.0) license. You can find details
at: https://creativecommons.org/licenses/by/4.0/legalcode
### Third-party software
Use of the third-party software, libraries or code referred to in the
[Acknowledgements](#acknowledgements) section above may be governed by separate
terms and conditions or license provisions. Your use of the third-party
software, libraries or code is subject to any such terms and you should check
that you can comply with any applicable restrictions or terms and conditions
before use.
### Mirrored Databases
The following databases have been mirrored by DeepMind, and are available with
reference to the following:
* [BFD](https://bfd.mmseqs.com/) (unmodified), by Steinegger M. and Söding J.,
available under a
[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
* [BFD](https://bfd.mmseqs.com/) (modified), by Steinegger M. and Söding J.,
modified by DeepMind, available under a
[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
See the Methods section of the
[AlphaFold proteome paper](https://www.nature.com/articles/s41586-021-03828-1)
for details.
* [Uniref30: v2021_03](http://wwwuser.gwdg.de/~compbiol/uniclust/2021_03/)
(unmodified), by Mirdita M. et al., available under a
[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
* [MGnify: v2022_05](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/README.txt)
(unmodified), by Mitchell AL et al., available free of all copyright
restrictions and made fully and freely available for both non-commercial and
commercial use under
[CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
================================================
FILE: afdb/README.md
================================================
# AlphaFold Protein Structure Database
## Introduction
The AlphaFold UniProt release (214M predictions) is hosted on
[Google Cloud Public Datasets](https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold),
and is available to download at no cost under a
[CC-BY-4.0 licence](http://creativecommons.org/licenses/by/4.0/legalcode). The
dataset is in a Cloud Storage bucket, and metadata is available on BigQuery. A
Google Cloud account is required for the download, but the data can be freely
used under the terms of the
[CC-BY 4.0 Licence](http://creativecommons.org/licenses/by/4.0/legalcode).
This document provides an overview of how to access and download the dataset for
different use cases. Please refer to the [AlphaFold database FAQ](https://www.alphafold.com/faq)
for further information on what proteins are in the database and a changelog of
releases.
:ledger: **Note: The full dataset is difficult to manipulate without significant
computational resources (the size of the dataset is 23 TiB, 3 * 214M files).**
There are also alternatives to downloading the full dataset:
1. Download a premade subset (covering important species / Swiss-Prot) via our
[download page](https://alphafold.ebi.ac.uk/download).
2. Download a custom subset of the data. See below.
If you need to download the full dataset then please see the "Bulk download"
section. See "Creating a Google Cloud Account" below for more information on how
to avoid any surprise costs when using Google Cloud Public Datasets.
## Licence
Data is available for academic and commercial use, under a
[CC-BY-4.0 licence](http://creativecommons.org/licenses/by/4.0/legalcode).
EMBL-EBI expects attribution (e.g. in publications, services or products) for
any of its online services, databases or software in accordance with good
scientific practice.
If you make use of an AlphaFold prediction, please cite the following papers:
* [Jumper, J et al. Highly accurate protein structure prediction with
AlphaFold. Nature
(2021).](https://www.nature.com/articles/s41586-021-03819-2)
* [Varadi, M et al. AlphaFold Protein Structure Database: massively expanding
the structural coverage of protein-sequence space with high-accuracy models.
Nucleic Acids Research
(2021).](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab1061/6430488)
AlphaFold Data Copyright (2022) DeepMind Technologies Limited.
## Disclaimer
The AlphaFold Data and other information provided on this site is for
theoretical modelling only, caution should be exercised in its use. It is
provided 'as-is' without any warranty of any kind, whether expressed or implied.
For clarity, no warranty is given that use of the information shall not infringe
the rights of any third party. The information is not intended to be a
substitute for professional medical advice, diagnosis, or treatment, and does
not constitute medical or other professional advice.
## Format
Dataset file names start with a protein identifier of the form `AF-[a UniProt
accession]-F[a fragment number]`.
Three files are provided for each entry:
* **model_v4.cif** – contains the atomic coordinates for the predicted protein
structure, along with some metadata. Useful references for this file format
are the [ModelCIF](https://github.com/ihmwg/ModelCIF) and
[PDBx/mmCIF](https://mmcif.wwpdb.org) project sites.
* **confidence_v4.json** – contains a confidence metric output by AlphaFold
called pLDDT. This provides a number for each residue, indicating how
confident AlphaFold is in the *local* surrounding structure. pLDDT ranges
from 0 to 100, where 100 is most confident. This is also contained in the
CIF file.
* **predicted_aligned_error_v4.json** – contains a confidence metric output by
AlphaFold called PAE. This provides a number for every pair of residues,
which is lower when AlphaFold is more confident in the relative position of
the two residues. PAE is more suitable than pLDDT for judging confidence in
relative domain placements.
[See here](https://alphafold.ebi.ac.uk/faq#faq-7) for a description of the
format.
Predictions grouped by NCBI taxonomy ID are available as
`proteomes/proteome-tax_id-[TAX ID]-[SHARD ID]_v4.tar` within the same
bucket.
There are also two extra files stored in the bucket:
* `accession_ids.csv` – This file contains a list of all the UniProt
accessions that have predictions in AlphaFold DB. The file is in CSV format
and includes the following columns, separated by a comma:
* UniProt accession, e.g. A8H2R3
* First residue index (UniProt numbering), e.g. 1
* Last residue index (UniProt numbering), e.g. 199
* AlphaFold DB identifier, e.g. AF-A8H2R3-F1
* Latest version, e.g. 4
* `sequences.fasta` – This file contains sequences for all proteins in the
current database version in FASTA format. The identifier rows start with
">AFDB", followed by the AlphaFold DB identifier and the name of the
protein. The sequence rows contain the corresponding amino acid sequences.
Each sequence is on a single line, i.e. there is no wrapping.
## Creating a Google Cloud Account
Downloading from the Google Cloud Public Datasets (rather than from AFDB or 3D
Beacons) requires a Google Cloud account. See the
[Google Cloud get started](https://cloud.google.com/docs/get-started) page, and
explore the [free tier account usage limits](https://cloud.google.com/free).
**IMPORTANT: After the trial period has finished (90 days), to continue access,
you are required to upgrade to a billing account. While your free tier access
(including access to the Public Datasets storage bucket) continues, usage beyond
the free tier will incur costs – please familiarise yourself with the pricing
for the services that you use to avoid any surprises.**
1. Go to
[https://cloud.google.com/datasets](https://cloud.google.com/datasets).
2. Create an account:
1. Click "get started for free" in the top right corner.
2. Agree to all terms of service.
3. Follow the setup instructions. Note that a payment method is required,
but this will not be used unless you enable billing.
4. Access to the Google Cloud Public Datasets storage bucket is always at
no cost and you will have access to the
[free tier.](https://cloud.google.com/free/docs/gcp-free-tier#free-tier-usage-limits)
3. Set up a project:
1. In the top left corner, click the navigation menu (three horizontal bars
icon).
2. Select: "Cloud overview" -> "Dashboard".
3. In the top left corner there is a project menu bar (likely says "My
First Project"). Select this and a "Select a Project" box will appear.
4. To keep using this project, click "Cancel" at the bottom of the box.
5. To create a new project, click "New Project" at the top of the box:
1. Select a project name.
2. For location, if your organization has a Cloud account then select
this, otherwise leave as is.
4. Install `gsutil`:
1. Follow these
[instructions](https://cloud.google.com/storage/docs/gsutil_install).
## Accessing the dataset
The data is available from:
* GCS data bucket:
[gs://public-datasets-deepmind-alphafold-v4](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4)
## Bulk download
We don't recommend downloading the full dataset unless required for processing
with local computational resources, for example in an academic high performance
computing centre.
We estimate that a 1 Gbps internet connection will allow download of the full
database in roughly 2.5 days.
While we don’t know the exact nature of your computational infrastructure, below
are some suggested approaches for downloading the dataset. Please reach out to
[alphafold@deepmind.com](mailto:alphafold@deepmind.com) if you have any
questions.
The recommended way of downloading the whole database is by downloading
1,015,797 sharded proteome tar files using the command below. This is
significantly faster than downloading all of the individual files because of
large constant per-file latency.
```bash
gsutil -m cp -r gs://public-datasets-deepmind-alphafold-v4/proteomes/ .
```
You will then have to un-tar all of the proteomes and un-gzip all of the
individual files. Note that after un-taring, there will be about 644M files, so
make sure your filesystem can handle this.
### Storage Transfer Service
Some users might find the
[Storage Transfer Service](https://cloud.google.com/storage-transfer-service) a
convenient way to set up the transfer between this bucket and another bucket, or
another cloud service. *Using this service may incur costs*. Please check the
[pricing page](https://cloud.google.com/storage-transfer/pricing) for more
detail, particularly for transfers to other cloud services.
## Downloading subsets of the data
### AlphaFold Database search
For simple queries, for example by protein name, gene name or UniProt accession
you can use the main search bar on
[alphafold.ebi.ac.uk](https://alphafold.ebi.ac.uk).
### 3D Beacons
[3D-Beacons](https://3d-beacons.org) is an international collaboration of
protein structure data providers to create a federated network with unified data
access mechanisms. The 3D-Beacons platform allows users to retrieve coordinate
files and metadata of experimentally determined and theoretical protein models
from data providers such as AlphaFold DB.
More information about how to access AlphaFold predictions using 3D-Beacons is
available at
[3D-Beacons documentation](https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/docs).
### Other premade species subsets
Downloads for some model organism proteomes, global health proteomes and
Swiss-Prot are available on the
[AFDB website](https://alphafold.ebi.ac.uk/download). These are generated from
[reference proteomes](https://www.uniprot.org/help/reference_proteome). If you
want other species, or *all* proteins for a particular species, please continue
reading.
We provide 1,015,797 sharded tar files for all species in
[gs://public-datasets-deepmind-alphafold-v4/proteomes/](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/proteomes/).
We shard each proteome so that each shard contains at most 10,000 proteins
(which corresponds to 30,000 files per shard, since there are 3 files per
protein). To download a proteome of your choice, you have to do the following
steps:
1. Find the [NCBI taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy)
(`[TAX_ID]`) of the species in question.
2. Run `gsutil -m cp
gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-[TAX
ID]-*_v4.tar .` to download all shards for this proteome.
3. Un-tar all of the downloaded files and un-gzip all of the individual files.
### File manifests
Pre-made lists of files (manifests) are available at
[gs://public-datasets-deepmind-alphafold-v4/manifests](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/manifests/).
Note that these filenames do not include the bucket prefix, but this can be
added once the files have been downloaded to your filesystem.
One can also define their own list of files, for example created by BigQuery
(see below). `gsutil` can be used to download these files with
```bash
cat [manifest file] | gsutil -m cp -I .
```
This will be much slower than downloading the tar files (grouped by species)
because each file has an associated overhead.
### BigQuery
**IMPORTANT: The
[free tier](https://cloud.google.com/bigquery/pricing#free-tier) of Google Cloud
comes with [BigQuery Sandbox](https://cloud.google.com/bigquery/docs/sandbox)
with 1 TB of free processed query data each month. Repeated queries within a
month could exceed this limit and if you have
[upgraded to a paid Cloud Billing account](https://cloud.google.com/free/docs/gcp-free-tier#how-to-upgrade)
you may be charged.**
**This should be sufficient for running a number of queries on the metadata
table, though the usage depends on the size of the columns queried and selected.
Please look at the
[BigQuery pricing page](https://cloud.google.com/bigquery/pricing) for more
information.**
**This is the user's responsibility so please ensure you keep track of your
billing settings and resource usage in the console.**
BigQuery provides a serverless and highly scalable analytics tool enabling SQL
queries over large datasets. The metadata for the UniProt dataset takes up 113
GiB and so can be challenging to process and analyse locally. The table name is:
* BigQuery metadata table:
[bigquery-public-data.deepmind_alphafold.metadata](https://console.cloud.google.com/bigquery?project=bigquery-public-data&ws=!1m5!1m4!4m3!1sbigquery-public-data!2sdeepmind_alphafold!3smetadata)
With BigQuery SQL you can do complex queries, e.g. find all high accuracy
predictions for a particular species, or even join on to other datasets, e.g. to
an experimental dataset by the `uniprotSequence`, or to the NCBI taxonomy by
`taxId`.
If you would find additional information in the metadata useful please file a
GitHub issue.
#### Setup
Follow the
[BigQuery Sandbox set up guide](https://cloud.google.com/bigquery/docs/sandbox).
#### Exploring the metadata
The column names and associated data types available can be found using the
following query.
```sql
SELECT column_name, data_type FROM bigquery-public-data.deepmind_alphafold.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'metadata'
```
**Column name** | **Data type** | **Description**
---------------------- | --------------- | ---------------
allVersions | `ARRAY<INT64>` | An array of AFDB versions this prediction has had
entryId | `STRING` | The AFDB entry ID, e.g. "AF-Q1HGU3-F1"
fractionPlddtConfident | `FLOAT64` | Fraction of the residues in the prediction with pLDDT between 70 and 90
fractionPlddtLow | `FLOAT64` | Fraction of the residues in the prediction with pLDDT between 50 and 70
fractionPlddtVeryHigh | `FLOAT64` | Fraction of the residues in the prediction with pLDDT greater than 90
fractionPlddtVeryLow | `FLOAT64` | Fraction of the residues in the prediction with pLDDT less than 50
gene | `STRING` | The name of the gene if known, e.g. "COII"
geneSynonyms | `ARRAY<STRING>` | Additional synonyms for the gene
globalMetricValue | `FLOAT64` | The mean pLDDT of this prediction
isReferenceProteome | `BOOL` | Is this protein part of the reference proteome?
isReviewed | `BOOL` | Has this protein been reviewed, i.e. is it part of SwissProt?
latestVersion | `INT64` | The latest AFDB version for this prediction
modelCreatedDate | `DATE` | The date of creation for this entry, e.g. "2022-06-01"
organismCommonNames | `ARRAY<STRING>` | List of common organism names
organismScientificName | `STRING` | The scientific name of the organism
organismSynonyms | `ARRAY<STRING>` | List of synonyms for the organism
proteinFullNames | `ARRAY<STRING>` | Full names of the protein
proteinShortNames | `ARRAY<STRING>` | Short names of the protein
sequenceChecksum | `STRING` | [CRC64 hash](https://www.uniprot.org/help/checksum) of the sequence. Can be used for cheaper lookups.
sequenceVersionDate | `DATE` | Date when the sequence data was last modified in UniProt
taxId | `INT64` | NCBI taxonomy id of the originating species
uniprotAccession | `STRING` | Uniprot accession ID
uniprotDescription | `STRING` | The name recommended by the UniProt consortium
uniprotEnd | `INT64` | Number of the last residue in the entry relative to the UniProt entry. This is equal to the length of the protein unless we are dealing with protein fragments.
uniprotId | `STRING` | The Uniprot EntryName field
uniprotSequence | `STRING` | Amino acid sequence for this prediction
uniprotStart | `INT64` | Number of the first residue in the entry relative to the UniProt entry. This is 1 unless we are dealing with protein fragments.
#### Producing summary statistics
The following query gives the mean of the prediction confidence fractions per
species.
```sql
SELECT
organismScientificName AS name,
SUM(fractionPlddtVeryLow) / COUNT(fractionPlddtVeryLow) AS mean_plddt_very_low,
SUM(fractionPlddtLow) / COUNT(fractionPlddtLow) AS mean_plddt_low,
SUM(fractionPlddtConfident) / COUNT(fractionPlddtConfident) AS mean_plddt_confident,
SUM(fractionPlddtVeryHigh) / COUNT(fractionPlddtVeryHigh) AS mean_plddt_very_high,
COUNT(organismScientificName) AS num_predictions
FROM bigquery-public-data.deepmind_alphafold.metadata
GROUP by name
ORDER BY num_predictions DESC;
```
#### Producing lists of files
We expect that the most important use for the metadata will be to create subsets
of proteins according to various criteria, so that users can choose to only copy
a subset of the 214M proteins that exist in the dataset. An example query is
given below:
```sql
with file_rows AS (
with file_cols AS (
SELECT
CONCAT(entryID, '-model_v4.cif') as m,
CONCAT(entryID, '-predicted_aligned_error_v4.json') as p
FROM bigquery-public-data.deepmind_alphafold.metadata
WHERE organismScientificName = "Homo sapiens"
AND (fractionPlddtVeryHigh + fractionPlddtConfident) > 0.5
)
SELECT * FROM file_cols UNPIVOT (files for filetype in (m, p))
)
SELECT CONCAT('gs://public-datasets-deepmind-alphafold-v4/', files) as files
from file_rows
```
In this case, the list has been filtered to only include proteins from *Homo
sapiens* for which over half the residues are confident or better (>70 pLDDT).
This creates a table with one column "files", where each row is the cloud
location of one of the two file types that has been provided for each protein.
There is an additional `confidence_v4.json` file which contains the
per-residue pLDDT. This information is already in the CIF file but may be
preferred if only this information is required.
This allows users to bulk download the exact proteins they need, without having
to download the entire dataset. Other columns may also be used to select subsets
of proteins, and we point the user to the
[BigQuery documentation](https://cloud.google.com/bigquery/docs) to understand
other ways to filter for their desired protein lists. Likewise, the
documentation should be followed to download these file subsets locally, as the
most appropriate approach will depend on the filesize. Note that it may be
easier to download large files using [Colab](https://colab.research.google.com/)
(e.g. pandas to_csv).
#### Previous versions
Previous versions of AFDB will remain available at
[gs://public-datasets-deepmind-alphafold](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold)
to enable reproducible research. We recommend using the latest version (v4).
================================================
FILE: alphafold/__init__.py
================================================
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""An implementation of the inference pipeline of AlphaFold v2.0."""
================================================
FILE: alphafold/common/__init__.py
================================================
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Common data types and constants used within Alphafold."""
================================================
FILE: alphafold/common/confidence.py
================================================
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Functions for processing confidence metrics."""
import json
from typing import Dict, Optional, Tuple
import numpy as np
def _softmax(x: np.ndarray, axis: Optional[int] = None):
x = np.asarray(x)
x_max = np.max(x, axis=axis, keepdims=True)
exp_x_shifted = np.exp(x - x_max)
return exp_x_shifted / np.sum(exp_x_shifted, axis=axis, keepdims=True)
def compute_plddt(logits: np.ndarray) -> np.ndarray:
"""Computes per-residue pLDDT from logits.
Args:
logits: [num_res, num_bins] output from the PredictedLDDTHead.
Returns:
plddt: [num_res] per-residue pLDDT.
"""
num_bins = logits.shape[-1]
bin_width = 1.0 / num_bins
bin_centers = np.arange(start=0.5 * bin_width, stop=1.0, step=bin_width)
probs = np.array(_softmax(logits, axis=-1))
predicted_lddt_ca = np.sum(probs * bin_centers[None, :], axis=-1)
return predicted_lddt_ca * 100
def _confidence_category(score: float) -> str:
"""Categorizes pLDDT into: disordered (D), low (L), medium (M), high (H)."""
if 0 <= score < 50:
return 'D'
if 50 <= score < 70:
return 'L'
elif 70 <= score < 90:
return 'M'
elif 90 <= score <= 100:
return 'H'
else:
raise ValueError(f'Invalid pLDDT score {score}')
def confidence_json(plddt: np.ndarray) -> str:
"""Returns JSON with confidence score and category for every residue.
Args:
plddt: Per-residue confidence metric data.
Returns:
String with a formatted JSON.
Raises:
ValueError: If `plddt` has a rank different than 1.
"""
if plddt.ndim != 1:
raise ValueError(f'The plddt array must be rank 1, got: {plddt.shape}.')
confidence = {
'residueNumber': list(range(1, len(plddt) + 1)),
'confidenceScore': [round(float(s), 2) for s in plddt],
'confidenceCategory': [_confidence_category(s) for s in plddt],
}
return json.dumps(confidence, indent=None, separators=(',', ':'))
def _calculate_bin_centers(breaks: np.ndarray):
"""Gets the bin centers from the bin edges.
Args:
breaks: [num_bins - 1] the error bin edges.
Returns:
bin_centers: [num_bins] the error bin centers.
"""
step = breaks[1] - breaks[0]
# Add half-step to get the center
bin_centers = breaks + step / 2
# Add a catch-all bin at the end.
bin_centers = np.concatenate([bin_centers, [bin_centers[-1] + step]], axis=0)
return bin_centers
def _calculate_expected_aligned_error(
alignment_confidence_breaks: np.ndarray,
aligned_distance_error_probs: np.ndarray,
) -> Tuple[np.ndarray, np.ndarray]:
"""Calculates expected aligned distance errors for every pair of residues.
Args:
alignment_confidence_breaks: [num_bins - 1] the error bin edges.
aligned_distance_error_probs: [num_res, num_res, num_bins] the predicted
probs for each error bin, for each pair of residues.
Returns:
predicted_aligned_error: [num_res, num_res] the expected aligned distance
error for each pair of residues.
max_predicted_aligned_error: The maximum predicted error possible.
"""
bin_centers = _calculate_bin_centers(alignment_confidence_breaks)
# Tuple of expected aligned distance error and max possible error.
return (
np.sum(aligned_distance_error_probs * bin_centers, axis=-1),
np.asarray(bin_centers[-1]),
)
def compute_predicted_aligned_error(
logits: np.ndarray, breaks: np.ndarray
) -> Dict[str, np.ndarray]:
"""Computes aligned confidence metrics from logits.
Args:
logits: [num_res, num_res, num_bins] the logits output from
PredictedAlignedErrorHead.
breaks: [num_bins - 1] the error bin edges.
Returns:
aligned_confidence_probs: [num_res, num_res, num_bins] the predicted
aligned error probabilities over bins for each residue pair.
predicted_aligned_error: [num_res, num_res] the expected aligned distance
error for each pair of residues.
max_predicted_aligned_error: The maximum predicted error possible.
"""
aligned_confidence_probs = np.array(_softmax(logits, axis=-1))
predicted_aligned_error, max_predicted_aligned_error = (
_calculate_expected_aligned_error(
alignment_confidence_breaks=breaks,
aligned_distance_error_probs=aligned_confidence_probs,
)
)
return {
'aligned_confidence_probs': aligned_confidence_probs,
'predicted_aligned_error': predicted_aligned_error,
'max_predicted_aligned_error': max_predicted_aligned_error,
}
def pae_json(pae: np.ndarray, max_pae: float) -> str:
"""Returns the PAE in the same format as is used in the AFDB.
Note that the values are presented as floats to 1 decimal place, whereas AFDB
returns integer values.
Args:
pae: The n_res x n_res PAE array.
max_pae: The maximum possible PAE value.
Returns:
PAE output format as a JSON string.
"""
# Check the PAE array is the correct shape.
if pae.ndim != 2 or pae.shape[0] != pae.shape[1]:
raise ValueError(f'PAE must be a square matrix, got {pae.shape}')
# Round the predicted aligned errors to 1 decimal place.
rounded_errors = np.round(pae.astype(np.float64), decimals=1)
formatted_output = [{
'predicted_aligned_error': rounded_errors.tolist(),
'max_predicted_aligned_error': max_pae,
}]
return json.dumps(formatted_output, indent=None, separators=(',', ':'))
def predicted_tm_score(
logits: np.ndarray,
breaks: np.ndarray,
residue_weights: Optional[np.ndarray] = None,
asym_id: Optional[np.ndarray] = None,
interface: bool = False,
) -> np.ndarray:
"""Computes predicted TM alignment or predicted interface TM alignment score.
Args:
logits: [num_res, num_res, num_bins] the logits output from
PredictedAlignedErrorHead.
breaks: [num_bins] the error bins.
residue_weights: [num_res] the per residue weights to use for the
expectation.
asym_id: [num_res] the asymmetric unit ID - the chain ID. Only needed for
ipTM calculation, i.e. when interface=True.
interface: If True, interface predicted TM score is computed.
Returns:
ptm_score: The predicted TM alignment or the predicted iTM score.
"""
# residue_weights has to be in [0, 1], but can be floating-point, i.e. the
# exp. resolved head's probability.
if residue_weights is None:
residue_weights = np.ones(logits.shape[0])
bin_centers = _calculate_bin_centers(breaks)
num_res = int(np.sum(residue_weights))
# Clip num_res to avoid negative/undefined d0.
clipped_num_res = max(num_res, 19)
# Compute d_0(num_res) as defined by TM-score, eqn. (5) in Yang & Skolnick
# "Scoring function for automated assessment of protein structure template
# quality", 2004: http://zhanglab.ccmb.med.umich.edu/papers/2004_3.pdf
d0 = 1.24 * (clipped_num_res - 15) ** (1.0 / 3) - 1.8
# Convert logits to probs.
probs = np.array(_softmax(logits, axis=-1))
# TM-Score term for every bin.
tm_per_bin = 1.0 / (1 + np.square(bin_centers) / np.square(d0))
# E_distances tm(distance).
predicted_tm_term = np.sum(probs * tm_per_bin, axis=-1)
pair_mask = np.ones(shape=(num_res, num_res), dtype=bool)
if interface:
pair_mask *= asym_id[:, None] != asym_id[None, :]
predicted_tm_term *= pair_mask
pair_residue_weights = pair_mask * (
residue_weights[None, :] * residue_weights[:, None]
)
normed_residue_mask = pair_residue_weights / (
1e-8 + np.sum(pair_residue_weights, axis=-1, keepdims=True)
)
per_alignment = np.sum(predicted_tm_term * normed_residue_mask, axis=-1)
return np.asarray(per_alignment[(per_alignment * residue_weights).argmax()])
================================================
FILE: alphafold/common/confidence_test.py
================================================
# Copyright 2023 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Test confidence metrics."""
from absl.testing import absltest
from alphafold.common import confidence
import numpy as np
class ConfidenceTest(absltest.TestCase):
def test_pae_json(self):
pae = np.array([[0.01, 13.12345], [20.0987, 0.0]])
pae_json = confidence.pae_json(pae=pae, max_pae=31.75)
self.assertEqual(
pae_json,
'[{"predicted_aligned_error":[[0.0,13.1],[20.1,0.0]],'
'"max_predicted_aligned_error":31.75}]',
)
def test_confidence_json(self):
plddt = np.array([42, 42.42])
confidence_json = confidence.confidence_json(plddt=plddt)
print(confidence_json)
self.assertEqual(
confidence_json,
(
'{"residueNumber":[1,2],'
'"confidenceScore":[42.0,42.42],'
'"confidenceCategory":["D","D"]}'
),
)
if __name__ == '__main__':
absltest.main()
================================================
FILE: alphafold/common/mmcif_metadata.py
================================================
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""mmCIF metadata."""
from typing import Mapping, Sequence
from alphafold import version
import numpy as np
_DISCLAIMER = """ALPHAFOLD DATA, COPYRIGHT (2021) DEEPMIND TECHNOLOGIES LIMITED.
THE INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD BE
EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY OF ANY KIND,
WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT USE OF THE INFORMATION
SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD PARTY. DISCLAIMER: THE INFORMATION IS
NOT INTENDED TO BE A SUBSTITUTE FOR PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR
TREATMENT, AND DOES NOT CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE. IT IS
AVAILABLE FOR ACADEMIC AND COMMERCIAL PURPOSES, UNDER CC-BY 4.0 LICENCE."""
# Authors of the Nature methods paper we reference in the mmCIF.
_MMCIF_PAPER_AUTHORS = (
'Jumper, John',
'Evans, Richard',
'Pritzel, Alexander',
'Green, Tim',
'Figurnov, Michael',
'Ronneberger, Olaf',
'Tunyasuvunakool, Kathryn',
'Bates, Russ',
'Zidek, Augustin',
'Potapenko, Anna',
'Bridgland, Alex',
'Meyer, Clemens',
'Kohl, Simon A. A.',
'Ballard, Andrew J.',
'Cowie, Andrew',
'Romera-Paredes, Bernardino',
'Nikolov, Stanislav',
'Jain, Rishub',
'Adler, Jonas',
'Back, Trevor',
'Petersen, Stig',
'Reiman, David',
'Clancy, Ellen',
'Zielinski, Michal',
'Steinegger, Martin',
'Pacholska, Michalina',
'Berghammer, Tamas',
'Silver, David',
'Vinyals, Oriol',
'Senior, Andrew W.',
'Kavukcuoglu, Koray',
'Kohli, Pushmeet',
'Hassabis, Demis',
)
# Authors of the mmCIF - we set them to be equal to the authors of the paper.
_MMCIF_AUTHORS = _MMCIF_PAPER_AUTHORS
def add_metadata_to_mmcif(
old_cif: Mapping[str, Sequence[str]], model_type: str
) -> Mapping[str, Sequence[str]]:
"""Adds AlphaFold metadata in the given mmCIF."""
cif = {}
# ModelCIF conformation dictionary.
cif['_audit_conform.dict_name'] = ['mmcif_ma.dic']
cif['_audit_conform.dict_version'] = ['1.3.9']
cif['_audit_conform.dict_location'] = [
'https://raw.githubusercontent.com/ihmwg/ModelCIF/master/dist/'
'mmcif_ma.dic'
]
# License and disclaimer.
cif['_pdbx_data_usage.id'] = ['1', '2']
cif['_pdbx_data_usage.type'] = ['license', 'disclaimer']
cif['_pdbx_data_usage.details'] = [
'Data in this file is available under a CC-BY-4.0 license.',
_DISCLAIMER,
]
cif['_pdbx_data_usage.url'] = [
'https://creativecommons.org/licenses/by/4.0/',
'?',
]
cif['_pdbx_data_usage.name'] = ['CC-BY-4.0', '?']
# Structure author details.
cif['_audit_author.name'] = []
cif['_audit_author.pdbx_ordinal'] = []
for author_index, author_name in enumerate(_MMCIF_AUTHORS, start=1):
cif['_audit_author.name'].append(author_name)
cif['_audit_author.pdbx_ordinal'].append(str(author_index))
# Paper author details.
cif['_citation_author.citation_id'] = []
cif['_citation_author.name'] = []
cif['_citation_author.ordinal'] = []
for author_index, author_name in enumerate(_MMCIF_PAPER_AUTHORS, start=1):
cif['_citation_author.citation_id'].append('primary')
cif['_citation_author.name'].append(author_name)
cif['_citation_author.ordinal'].append(str(author_index))
# Paper citation details.
cif['_citation.id'] = ['primary']
cif['_citation.title'] = [
'Highly accurate protein structure prediction with AlphaFold'
]
cif['_citation.journal_full'] = ['Nature']
cif['_citation.journal_volume'] = ['596']
cif['_citation.page_first'] = ['583']
cif['_citation.page_last'] = ['589']
cif['_citation.year'] = ['2021']
cif['_citation.journal_id_ASTM'] = ['NATUAS']
cif['_citation.country'] = ['UK']
cif['_citation.journal_id_ISSN'] = ['0028-0836']
cif['_citation.journal_id_CSD'] = ['0006']
cif['_citation.book_publisher'] = ['?']
cif['_citation.pdbx_database_id_PubMed'] = ['34265844']
cif['_citation.pdbx_database_id_DOI'] = ['10.1038/s41586-021-03819-2']
# Type of data in the dataset including data used in the model generation.
cif['_ma_data.id'] = ['1']
cif['_ma_data.name'] = ['Model']
cif['_ma_data.content_type'] = ['model coordinates']
# Description of number of instances for each entity.
cif['_ma_target_entity_instance.asym_id'] = old_cif['_struct_asym.id']
cif['_ma_target_entity_instance.entity_id'] = old_cif[
'_struct_asym.entity_id'
]
cif['_ma_target_entity_instance.details'] = ['.'] * len(
cif['_ma_target_entity_instance.entity_id']
)
# Details about the target entities.
cif['_ma_target_entity.entity_id'] = cif[
'_ma_target_entity_instance.entity_id'
]
cif['_ma_target_entity.data_id'] = ['1'] * len(
cif['_ma_target_entity.entity_id']
)
cif['_ma_target_entity.origin'] = ['.'] * len(
cif['_ma_target_entity.entity_id']
)
# Details of the models being deposited.
cif['_ma_model_list.ordinal_id'] = ['1']
cif['_ma_model_list.model_id'] = ['1']
cif['_ma_model_list.model_group_id'] = ['1']
cif['_ma_model_list.model_name'] = ['Top ranked model']
cif['_ma_model_list.model_group_name'] = [
f'AlphaFold {model_type} v{version.__version__} model'
]
cif['_ma_model_list.data_id'] = ['1']
cif['_ma_model_list.model_type'] = ['Ab initio model']
# Software used.
cif['_software.pdbx_ordinal'] = ['1']
cif['_software.name'] = ['AlphaFold']
cif['_software.version'] = [f'v{version.__version__}']
cif['_software.type'] = ['package']
cif['_software.description'] = ['Structure prediction']
cif['_software.classification'] = ['other']
cif['_software.date'] = ['?']
# Collection of software into groups.
cif['_ma_software_group.ordinal_id'] = ['1']
cif['_ma_software_group.group_id'] = ['1']
cif['_ma_software_group.software_id'] = ['1']
# Method description to conform with ModelCIF.
cif['_ma_protocol_step.ordinal_id'] = ['1', '2', '3']
cif['_ma_protocol_step.protocol_id'] = ['1', '1', '1']
cif['_ma_protocol_step.step_id'] = ['1', '2', '3']
cif['_ma_protocol_step.method_type'] = [
'coevolution MSA',
'template search',
'modeling',
]
# Details of the metrics use to assess model confidence.
cif['_ma_qa_metric.id'] = ['1', '2']
cif['_ma_qa_metric.name'] = ['pLDDT', 'pLDDT']
# Accepted values are distance, energy, normalised score, other, zscore.
cif['_ma_qa_metric.type'] = ['pLDDT', 'pLDDT']
cif['_ma_qa_metric.mode'] = ['global', 'local']
cif['_ma_qa_metric.software_group_id'] = ['1', '1']
# Global model confidence metric value.
cif['_ma_qa_metric_global.ordinal_id'] = ['1']
cif['_ma_qa_metric_global.model_id'] = ['1']
cif['_ma_qa_metric_global.metric_id'] = ['1']
global_plddt = np.mean(
[float(v) for v in old_cif['_atom_site.B_iso_or_equiv']]
)
cif['_ma_qa_metric_global.metric_value'] = [f'{global_plddt:.2f}']
cif['_atom_type.symbol'] = sorted(set(old_cif['_atom_site.type_symbol']))
return cif
================================================
FILE: alphafold/common/protein.py
================================================
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Protein data type."""
import collections
import dataclasses
import functools
import io
from typing import Any, Dict, List, Mapping, Optional, Tuple
from alphafold.common import mmcif_metadata
from alphafold.common import residue_constants
from Bio.PDB import MMCIFParser
from Bio.PDB import PDBParser
from Bio.PDB.mmcifio import MMCIFIO
from Bio.PDB.Structure import Structure
import numpy as np
FeatureDict = Mapping[str, np.ndarray]
ModelOutput = Mapping[str, Any] # Is a nested dict.
# Complete sequence of chain IDs supported by the PDB format.
PDB_CHAIN_IDS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
PDB_MAX_CHAINS = len(PDB_CHAIN_IDS) # := 62.
# Data to fill the _chem_comp table when writing mmCIFs.
_CHEM_COMP: Mapping[str, Tuple[Tuple[str, str], ...]] = {
'L-peptide linking': (
('ALA', 'ALANINE'),
('ARG', 'ARGININE'),
('ASN', 'ASPARAGINE'),
('ASP', 'ASPARTIC ACID'),
('CYS', 'CYSTEINE'),
('GLN', 'GLUTAMINE'),
('GLU', 'GLUTAMIC ACID'),
('HIS', 'HISTIDINE'),
('ILE', 'ISOLEUCINE'),
('LEU', 'LEUCINE'),
('LYS', 'LYSINE'),
('MET', 'METHIONINE'),
('PHE', 'PHENYLALANINE'),
('PRO', 'PROLINE'),
('SER', 'SERINE'),
('THR', 'THREONINE'),
('TRP', 'TRYPTOPHAN'),
('TYR', 'TYROSINE'),
('VAL', 'VALINE'),
),
'peptide linking': (('GLY', 'GLYCINE'),),
}
@dataclasses.dataclass(frozen=True)
class Protein:
"""Protein structure representation."""
# Cartesian coordinates of atoms in angstroms. The atom types correspond to
# residue_constants.atom_types, i.e. the first three are N, CA, CB.
atom_positions: np.ndarray # [num_res, num_atom_type, 3]
# Amino-acid type for each residue represented as an integer between 0 and
# 20, where 20 is 'X'.
aatype: np.ndarray # [num_res]
# Binary float mask to indicate presence of a particular atom. 1.0 if an atom
# is present and 0.0 if not. This should be used for loss masking.
atom_mask: np.ndarray # [num_res, num_atom_type]
# Residue index as used in PDB. It is not necessarily continuous or 0-indexed.
residue_index: np.ndarray # [num_res]
# 0-indexed number corresponding to the chain in the protein that this residue
# belongs to.
chain_index: np.ndarray # [num_res]
# B-factors, or temperature factors, of each residue (in sq. angstroms units),
# representing the displacement of the residue from its ground truth mean
# value.
b_factors: np.ndarray # [num_res, num_atom_type]
def __post_init__(self):
if len(np.unique(self.chain_index)) > PDB_MAX_CHAINS:
raise ValueError(
f'Cannot build an instance with more than {PDB_MAX_CHAINS} chains '
'because these cannot be written to PDB format.'
)
def _from_bio_structure(
structure: Structure, chain_id: Optional[str] = None
) -> Protein:
"""Takes a Biopython structure and creates a `Protein` instance.
WARNING: All non-standard residue types will be converted into UNK. All
non-standard atoms will be ignored.
Args:
structure: Structure from the Biopython library.
chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
Otherwise all chains are parsed.
Returns:
A new `Protein` created from the structure contents.
Raises:
ValueError: If the number of models included in the structure is not 1.
ValueError: If insertion code is detected at a residue.
"""
models = list(structure.get_models())
if len(models) != 1:
raise ValueError(
'Only single model PDBs/mmCIFs are supported. Found'
f' {len(models)} models.'
)
model = models[0]
atom_positions = []
aatype = []
atom_mask = []
residue_index = []
chain_ids = []
b_factors = []
for chain in model:
if chain_id is not None and chain.id != chain_id:
continue
for res in chain:
if res.id[2] != ' ':
raise ValueError(
f'PDB/mmCIF contains an insertion code at chain {chain.id} and'
f' residue index {res.id[1]}. These are not supported.'
)
res_shortname = residue_constants.restype_3to1.get(res.resname, 'X')
restype_idx = residue_constants.restype_order.get(
res_shortname, residue_constants.restype_num
)
pos = np.zeros((residue_constants.atom_type_num, 3))
mask = np.zeros((residue_constants.atom_type_num,))
res_b_factors = np.zeros((residue_constants.atom_type_num,))
for atom in res:
if atom.name not in residue_constants.atom_types:
continue
pos[residue_constants.atom_order[atom.name]] = atom.coord
mask[residue_constants.atom_order[atom.name]] = 1.0
res_b_factors[residue_constants.atom_order[atom.name]] = atom.bfactor
if np.sum(mask) < 0.5:
# If no known atom positions are reported for the residue then skip it.
continue
aatype.append(restype_idx)
atom_positions.append(pos)
atom_mask.append(mask)
residue_index.append(res.id[1])
chain_ids.append(chain.id)
b_factors.append(res_b_factors)
# Chain IDs are usually characters so map these to ints.
unique_chain_ids = np.unique(chain_ids)
chain_id_mapping = {cid: n for n, cid in enumerate(unique_chain_ids)}
chain_index = np.array([chain_id_mapping[cid] for cid in chain_ids])
return Protein(
atom_positions=np.array(atom_positions),
atom_mask=np.array(atom_mask),
aatype=np.array(aatype),
residue_index=np.array(residue_index),
chain_index=chain_index,
b_factors=np.array(b_factors),
)
def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
"""Takes a PDB string and constructs a `Protein` object.
WARNING: All non-standard residue types will be converted into UNK. All
non-standard atoms will be ignored.
Args:
pdb_str: The contents of the pdb file
chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
Otherwise all chains are parsed.
Returns:
A new `Protein` parsed from the pdb contents.
"""
with io.StringIO(pdb_str) as pdb_fh:
parser = PDBParser(QUIET=True)
structure = parser.get_structure(id='none', file=pdb_fh)
return _from_bio_structure(structure, chain_id)
def from_mmcif_string(
mmcif_str: str, chain_id: Optional[str] = None
) -> Protein:
"""Takes a mmCIF string and constructs a `Protein` object.
WARNING: All non-standard residue types will be converted into UNK. All
non-standard atoms will be ignored.
Args:
mmcif_str: The contents of the mmCIF file
chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
Otherwise all chains are parsed.
Returns:
A new `Protein` parsed from the mmCIF contents.
"""
with io.StringIO(mmcif_str) as mmcif_fh:
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure(structure_id='none', filename=mmcif_fh)
return _from_bio_structure(structure, chain_id)
def _chain_end(atom_index, end_resname, chain_name, residue_index) -> str:
chain_end = 'TER'
return (
f'{chain_end:<6}{atom_index:>5} {end_resname:>3} '
f'{chain_name:>1}{residue_index:>4}'
)
def to_pdb(prot: Protein) -> str:
"""Converts a `Protein` instance to a PDB string.
Args:
prot: The protein to convert to PDB.
Returns:
PDB string.
"""
restypes = residue_constants.restypes + ['X']
res_1to3 = lambda r: residue_constants.restype_1to3.get(restypes[r], 'UNK')
atom_types = residue_constants.atom_types
pdb_lines = []
atom_mask = prot.atom_mask
aatype = prot.aatype
atom_positions = prot.atom_positions
residue_index = prot.residue_index.astype(np.int32)
chain_index = prot.chain_index.astype(np.int32)
b_factors = prot.b_factors
if np.any(aatype > residue_constants.restype_num):
raise ValueError('Invalid aatypes.')
# Construct a mapping from chain integer indices to chain ID strings.
chain_ids = {}
for i in np.unique(chain_index): # np.unique gives sorted output.
if i >= PDB_MAX_CHAINS:
raise ValueError(
f'The PDB format supports at most {PDB_MAX_CHAINS} chains.'
)
chain_ids[i] = PDB_CHAIN_IDS[i]
pdb_lines.append('MODEL 1')
atom_index = 1
last_chain_index = chain_index[0]
# Add all atom sites.
for i in range(aatype.shape[0]):
# Close the previous chain if in a multichain PDB.
if last_chain_index != chain_index[i]:
pdb_lines.append(
_chain_end(
atom_index,
res_1to3(aatype[i - 1]),
chain_ids[chain_index[i - 1]],
residue_index[i - 1],
)
)
last_chain_index = chain_index[i]
atom_index += 1 # Atom index increases at the TER symbol.
res_name_3 = res_1to3(aatype[i])
for atom_name, pos, mask, b_factor in zip(
atom_types, atom_positions[i], atom_mask[i], b_factors[i]
):
if mask < 0.5:
continue
record_type = 'ATOM'
name = atom_name if len(atom_name) == 4 else f' {atom_name}'
alt_loc = ''
insertion_code = ''
occupancy = 1.00
element = atom_name[0] # Protein supports only C, N, O, S, this works.
charge = ''
# PDB is a columnar format, every space matters here!
atom_line = (
f'{record_type:<6}{atom_index:>5} {name:<4}{alt_loc:>1}'
f'{res_name_3:>3} {chain_ids[chain_index[i]]:>1}'
f'{residue_index[i]:>4}{insertion_code:>1} '
f'{pos[0]:>8.3f}{pos[1]:>8.3f}{pos[2]:>8.3f}'
f'{occupancy:>6.2f}{b_factor:>6.2f} '
f'{element:>2}{charge:>2}'
)
pdb_lines.append(atom_line)
atom_index += 1
# Close the final chain.
pdb_lines.append(
_chain_end(
atom_index,
res_1to3(aatype[-1]),
chain_ids[chain_index[-1]],
residue_index[-1],
)
)
pdb_lines.append('ENDMDL')
pdb_lines.append('END')
# Pad all lines to 80 characters.
pdb_lines = [line.ljust(80) for line in pdb_lines]
return '\n'.join(pdb_lines) + '\n' # Add terminating newline.
def ideal_atom_mask(prot: Protein) -> np.ndarray:
"""Computes an ideal atom mask.
`Protein.atom_mask` typically is defined according to the atoms that are
reported in the PDB. This function computes a mask according to heavy atoms
that should be present in the given sequence of amino acids.
Args:
prot: `Protein` whose fields are `numpy.ndarray` objects.
Returns:
An ideal atom mask.
"""
return residue_constants.STANDARD_ATOM_MASK[prot.aatype]
def from_prediction(
features: FeatureDict,
result: ModelOutput,
b_factors: Optional[np.ndarray] = None,
remove_leading_feature_dimension: bool = True,
) -> Protein:
"""Assembles a protein from a prediction.
Args:
features: Dictionary holding model inputs.
result: Dictionary holding model outputs.
b_factors: (Optional) B-factors to use for the protein.
remove_leading_feature_dimension: Whether to remove the leading dimension of
the `features` values.
Returns:
A protein instance.
"""
fold_output = result['structure_module']
def _maybe_remove_leading_dim(arr: np.ndarray) -> np.ndarray:
return arr[0] if remove_leading_feature_dimension else arr
if 'asym_id' in features:
chain_index = _maybe_remove_leading_dim(features['asym_id'])
else:
chain_index = np.zeros_like(_maybe_remove_leading_dim(features['aatype']))
if b_factors is None:
b_factors = np.zeros_like(fold_output['final_atom_mask'])
return Protein(
aatype=_maybe_remove_leading_dim(features['aatype']),
atom_positions=fold_output['final_atom_positions'],
atom_mask=fold_output['final_atom_mask'],
residue_index=_maybe_remove_leading_dim(features['residue_index']) + 1,
chain_index=chain_index,
b_factors=b_factors,
)
def to_mmcif(
prot: Protein,
file_id: str,
model_type: str,
) -> str:
"""Converts a `Protein` instance to an mmCIF string.
WARNING 1: The _entity_poly_seq is filled with unknown (UNK) residues for any
missing residue indices in the range from min(1, min(residue_index)) to
max(residue_index). E.g. for a protein object with positions for residues
2 (MET), 3 (LYS), 6 (GLY), this method would set the _entity_poly_seq to:
1 UNK
2 MET
3 LYS
4 UNK
5 UNK
6 GLY
This is done to preserve the residue numbering.
WARNING 2: Converting ground truth mmCIF file to Protein and then back to
mmCIF using this method will convert all non-standard residue types to UNK.
If you need this behaviour, you need to store more mmCIF metadata in the
Protein object (e.g. all fields except for the _atom_site loop).
WARNING 3: Converting ground truth mmCIF file to Protein and then back to
mmCIF using this method will not retain the original chain indices.
WARNING 4: In case of multiple identical chains, they are assigned different
`_atom_site.label_entity_id` values.
Args:
prot: A protein to convert to mmCIF string.
file_id: The file ID (usually the PDB ID) to be used in the mmCIF.
model_type: 'Multimer' or 'Monomer'.
Returns:
A valid mmCIF string.
Raises:
ValueError: If aminoacid types array contains entries with too many protein
types.
"""
atom_mask = prot.atom_mask
aatype = prot.aatype
atom_positions = prot.atom_positions
residue_index = prot.residue_index.astype(np.int32)
chain_index = prot.chain_index.astype(np.int32)
b_factors = prot.b_factors
# Construct a mapping from chain integer indices to chain ID strings.
chain_ids = {}
# We count unknown residues as protein residues.
for entity_id in np.unique(chain_index): # np.unique gives sorted output.
chain_ids[entity_id] = _int_id_to_str_id(entity_id + 1)
mmcif_dict = collections.defaultdict(list)
mmcif_dict['data_'] = file_id.upper()
mmcif_dict['_entry.id'] = file_id.upper()
label_asym_id_to_entity_id = {}
# Entity and chain information.
for entity_id, chain_id in chain_ids.items():
# Add all chain information to the _struct_asym table.
label_asym_id_to_entity_id[str(chain_id)] = str(entity_id)
mmcif_dict['_struct_asym.id'].append(chain_id)
mmcif_dict['_struct_asym.entity_id'].append(str(entity_id))
# Add information about the entity to the _entity_poly table.
mmcif_dict['_entity_poly.entity_id'].append(str(entity_id))
mmcif_dict['_entity_poly.type'].append(residue_constants.PROTEIN_CHAIN)
mmcif_dict['_entity_poly.pdbx_strand_id'].append(chain_id)
# Generate the _entity table.
mmcif_dict['_entity.id'].append(str(entity_id))
mmcif_dict['_entity.type'].append(residue_constants.POLYMER_CHAIN)
# Add the residues to the _entity_poly_seq table.
for entity_id, (res_ids, aas) in _get_entity_poly_seq(
aatype, residue_index, chain_index
).items():
for res_id, aa in zip(res_ids, aas):
mmcif_dict['_entity_poly_seq.entity_id'].append(str(entity_id))
mmcif_dict['_entity_poly_seq.num'].append(str(res_id))
mmcif_dict['_entity_poly_seq.mon_id'].append(
residue_constants.resnames[aa]
)
# Populate the chem comp table.
for chem_type, chem_comp in _CHEM_COMP.items():
for chem_id, chem_name in chem_comp:
mmcif_dict['_chem_comp.id'].append(chem_id)
mmcif_dict['_chem_comp.type'].append(chem_type)
mmcif_dict['_chem_comp.name'].append(chem_name)
# Add all atom sites.
atom_index = 1
for i in range(aatype.shape[0]):
res_name_3 = residue_constants.resnames[aatype[i]]
if aatype[i] <= len(residue_constants.restypes):
atom_names = residue_constants.atom_types
else:
raise ValueError(
'Amino acid types array contains entries with too many protein types.'
)
for atom_name, pos, mask, b_factor in zip(
atom_names, atom_positions[i], atom_mask[i], b_factors[i]
):
if mask < 0.5:
continue
type_symbol = residue_constants.atom_id_to_type(atom_name)
mmcif_dict['_atom_site.group_PDB'].append('ATOM')
mmcif_dict['_atom_site.id'].append(str(atom_index))
mmcif_dict['_atom_site.type_symbol'].append(type_symbol)
mmcif_dict['_atom_site.label_atom_id'].append(atom_name)
mmcif_dict['_atom_site.label_alt_id'].append('.')
mmcif_dict['_atom_site.label_comp_id'].append(res_name_3)
mmcif_dict['_atom_site.label_asym_id'].append(chain_ids[chain_index[i]])
mmcif_dict['_atom_site.label_entity_id'].append(
label_asym_id_to_entity_id[chain_ids[chain_index[i]]]
)
mmcif_dict['_atom_site.label_seq_id'].append(str(residue_index[i]))
mmcif_dict['_atom_site.pdbx_PDB_ins_code'].append('.')
mmcif_dict['_atom_site.Cartn_x'].append(f'{pos[0]:.3f}')
mmcif_dict['_atom_site.Cartn_y'].append(f'{pos[1]:.3f}')
mmcif_dict['_atom_site.Cartn_z'].append(f'{pos[2]:.3f}')
mmcif_dict['_atom_site.occupancy'].append('1.00')
mmcif_dict['_atom_site.B_iso_or_equiv'].append(f'{b_factor:.2f}')
mmcif_dict['_atom_site.auth_seq_id'].append(str(residue_index[i]))
mmcif_dict['_atom_site.auth_asym_id'].append(chain_ids[chain_index[i]])
mmcif_dict['_atom_site.pdbx_PDB_model_num'].append('1')
atom_index += 1
metadata_dict = mmcif_metadata.add_metadata_to_mmcif(mmcif_dict, model_type)
mmcif_dict.update(metadata_dict)
return _create_mmcif_string(mmcif_dict)
@functools.lru_cache(maxsize=256)
def _int_id_to_str_id(num: int) -> str:
"""Encodes a number as a string, using reverse spreadsheet style naming.
Args:
num: A positive integer.
Returns:
A string that encodes the positive integer using reverse spreadsheet style,
naming e.g. 1 = A, 2 = B, ..., 27 = AA, 28 = BA, 29 = CA, ... This is the
usual way to encode chain IDs in mmCIF files.
"""
if num <= 0:
raise ValueError(f'Only positive integers allowed, got {num}.')
num = num - 1 # 1-based indexing.
output = []
while num >= 0:
output.append(chr(num % 26 + ord('A')))
num = num // 26 - 1
return ''.join(output)
def _get_entity_poly_seq(
aatypes: np.ndarray, residue_indices: np.ndarray, chain_indices: np.ndarray
) -> Dict[int, Tuple[List[int], List[int]]]:
"""Constructs gapless residue index and aatype lists for each chain.
Args:
aatypes: A numpy array with aatypes.
residue_indices: A numpy array with residue indices.
chain_indices: A numpy array with chain indices.
Returns:
A dictionary mapping chain indices to a tuple with list of residue indices
and a list of aatypes. Missing residues are filled with UNK residue type.
"""
if (
aatypes.shape[0] != residue_indices.shape[0]
or aatypes.shape[0] != chain_indices.shape[0]
):
raise ValueError(
'aatypes, residue_indices, chain_indices must have the same length.'
)
# Group the present residues by chain index.
present = collections.defaultdict(list)
for chain_id, res_id, aa in zip(chain_indices, residue_indices, aatypes):
present[chain_id].append((res_id, aa))
# Add any missing residues (from 1 to the first residue and for any gaps).
entity_poly_seq = {}
for chain_id, present_residues in present.items():
present_residue_indices = set([x[0] for x in present_residues])
min_res_id = min(present_residue_indices) # Could be negative.
max_res_id = max(present_residue_indices)
new_residue_indices = []
new_aatypes = []
present_index = 0
for i in range(min(1, min_res_id), max_res_id + 1):
new_residue_indices.append(i)
if i in present_residue_indices:
new_aatypes.append(present_residues[present_index][1])
present_index += 1
else:
new_aatypes.append(20) # Unknown amino acid type.
entity_poly_seq[chain_id] = (new_residue_indices, new_aatypes)
return entity_poly_seq
def _create_mmcif_string(mmcif_dict: Dict[str, Any]) -> str:
"""Converts mmCIF dictionary into mmCIF string."""
mmcifio = MMCIFIO()
mmcifio.set_dict(mmcif_dict)
with io.StringIO() as file_handle:
mmcifio.save(file_handle)
return file_handle.getvalue()
================================================
FILE: alphafold/common/protein_test.py
================================================
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from absl.testing import absltest
from absl.testing import parameterized
from alphafold.common import protein
from alphafold.common import residue_constants
import numpy as np
# Internal import (7716).
TEST_DATA_DIR = 'alphafold/common/testdata/'
class ProteinTest(parameterized.TestCase):
def _check_shapes(self, prot, num_res):
"""Check that the processed shapes are correct."""
num_atoms = residue_constants.atom_type_num
self.assertEqual((num_res, num_atoms, 3), prot.atom_positions.shape)
self.assertEqual((num_res,), prot.aatype.shape)
self.assertEqual((num_res, num_atoms), prot.atom_mask.shape)
self.assertEqual((num_res,), prot.residue_index.shape)
self.assertEqual((num_res,), prot.chain_index.shape)
self.assertEqual((num_res, num_atoms), prot.b_factors.shape)
@parameterized.named_parameters(
dict(
testcase_name='chain_A',
pdb_file='2rbg.pdb',
chain_id='A',
num_res=282,
num_chains=1,
),
dict(
testcase_name='chain_B',
pdb_file='2rbg.pdb',
chain_id='B',
num_res=282,
num_chains=1,
),
dict(
testcase_name='multichain',
pdb_file='2rbg.pdb',
chain_id=None,
num_res=564,
num_chains=2,
),
)
def test_from_pdb_str(self, pdb_file, chain_id, num_res, num_chains):
pdb_file = os.path.join(
absltest.get_default_test_srcdir(), TEST_DATA_DIR, pdb_file
)
with open(pdb_file) as f:
pdb_string = f.read()
prot = protein.from_pdb_string(pdb_string, chain_id)
self._check_shapes(prot, num_res)
self.assertGreaterEqual(prot.aatype.min(), 0)
# Allow equal since unknown restypes have index equal to restype_num.
self.assertLessEqual(prot.aatype.max(), residue_constants.restype_num)
self.assertLen(np.unique(prot.chain_index), num_chains)
def test_to_pdb(self):
with open(
os.path.join(
absltest.get_default_test_srcdir(), TEST_DATA_DIR, '2rbg.pdb'
)
) as f:
pdb_string = f.read()
prot = protein.from_pdb_string(pdb_string)
pdb_string_reconstr = protein.to_pdb(prot)
for line in pdb_string_reconstr.splitlines():
self.assertLen(line, 80)
prot_reconstr = protein.from_pdb_string(pdb_string_reconstr)
np.testing.assert_array_equal(prot_reconstr.aatype, prot.aatype)
np.testing.assert_array_almost_equal(
prot_reconstr.atom_positions, prot.atom_positions
)
np.testing.assert_array_almost_equal(
prot_reconstr.atom_mask, prot.atom_mask
)
np.testing.assert_array_equal(
prot_reconstr.residue_index, prot.residue_index
)
np.testing.assert_array_equal(prot_reconstr.chain_index, prot.chain_index)
np.testing.assert_array_almost_equal(
prot_reconstr.b_factors, prot.b_factors
)
@parameterized.named_parameters(
dict(
testcase_name='glucagon',
pdb_file='glucagon.pdb',
model_type='Monomer',
),
dict(testcase_name='7bui', pdb_file='5nmu.pdb', model_type='Multimer'),
)
def test_to_mmcif(self, pdb_file, model_type):
with open(
os.path.join(
absltest.get_default_test_srcdir(), TEST_DATA_DIR, pdb_file
)
) as f:
pdb_string = f.read()
prot = protein.from_pdb_string(pdb_string)
file_id = 'test'
mmcif_string = protein.to_mmcif(prot, file_id, model_type)
prot_reconstr = protein.from_mmcif_string(mmcif_string)
np.testing.assert_array_equal(prot_reconstr.aatype, prot.aatype)
np.testing.assert_array_almost_equal(
prot_reconstr.atom_positions, prot.atom_positions
)
np.testing.assert_array_almost_equal(
prot_reconstr.atom_mask, prot.atom_mask
)
np.testing.assert_array_equal(
prot_reconstr.residue_index, prot.residue_index
)
np.testing.assert_array_equal(prot_reconstr.chain_index, prot.chain_index)
np.testing.assert_array_almost_equal(
prot_reconstr.b_factors, prot.b_factors
)
def test_ideal_atom_mask(self):
with open(
os.path.join(
absltest.get_default_test_srcdir(), TEST_DATA_DIR, '2rbg.pdb'
)
) as f:
pdb_string = f.read()
prot = protein.from_pdb_string(pdb_string)
ideal_mask = protein.ideal_atom_mask(prot)
non_ideal_residues = set([102] + list(range(127, 286)))
for i, (res, atom_mask) in enumerate(
zip(prot.residue_index, prot.atom_mask)
):
if res in non_ideal_residues:
self.assertFalse(np.all(atom_mask == ideal_mask[i]), msg=f'{res}')
else:
self.assertTrue(np.all(atom_mask == ideal_mask[i]), msg=f'{res}')
def test_too_many_chains(self):
num_res = protein.PDB_MAX_CHAINS + 1
num_atom_type = residue_constants.atom_type_num
with self.assertRaises(ValueError):
_ = protein.Protein(
atom_positions=np.random.random([num_res, num_atom_type, 3]),
aatype=np.random.randint(0, 21, [num_res]),
atom_mask=np.random.randint(0, 2, [num_res]).astype(np.float32),
residue_index=np.arange(1, num_res + 1),
chain_index=np.arange(num_res),
b_factors=np.random.uniform(1, 100, [num_res]),
)
if __name__ == '__main__':
absltest.main()
================================================
FILE: alphafold/common/residue_constants.py
================================================
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Constants used in AlphaFold."""
import collections
import functools
import os
from typing import Final, List, Mapping, Tuple
from jax import tree
import numpy as np
# Internal import (35fd).
# Distance from one CA to next CA [trans configuration: omega = 180].
ca_ca = 3.80209737096
# Format: The list for each AA type contains chi1, chi2, chi3, chi4 in
# this order (or a relevant subset from chi1 onwards). ALA and GLY don't have
# chi angles so their chi angle lists are empty.
chi_angles_atoms = {
'ALA': [],
# Chi5 in arginine is always 0 +- 5 degrees, so ignore it.
'ARG': [
['N', 'CA', 'CB', 'CG'],
['CA', 'CB', 'CG', 'CD'],
['CB', 'CG', 'CD', 'NE'],
['CG', 'CD', 'NE', 'CZ'],
],
'ASN': [['N', 'CA', 'CB', 'CG'], ['CA', 'CB', 'CG', 'OD1']],
'ASP': [['N', 'CA', 'CB', 'CG'], ['CA', 'CB', 'CG', 'OD1']],
'CYS': [['N', 'CA', 'CB', 'SG']],
'GLN': [
['N', 'CA', 'CB', 'CG'],
['CA', 'CB', 'CG', 'CD'],
['CB', 'CG', 'CD', 'OE1'],
],
'GLU': [
['N', 'CA', 'CB', 'CG'],
['CA', 'CB', 'CG', 'CD'],
['CB', 'CG', 'CD', 'OE1'],
],
'GLY': [],
'HIS': [['N', 'CA', 'CB', 'CG'], ['CA', 'CB', 'CG', 'ND1']],
'ILE': [['N', 'CA', 'CB', 'CG1'], ['CA', 'CB', 'CG1', 'CD1']],
'LEU': [['N', 'CA', 'CB', 'CG'], ['CA', 'CB', 'CG', 'CD1']],
'LYS': [
['N', 'CA', 'CB', 'CG'],
['CA', 'CB', 'CG', 'CD'],
['CB', 'CG', 'CD', 'CE'],
['CG', 'CD', 'CE', 'NZ'],
],
'MET': [
['N', 'CA', 'CB', 'CG'],
['CA', 'CB', 'CG', 'SD'],
['CB', 'CG', 'SD', 'CE'],
],
'PHE': [['N', 'CA', 'CB', 'CG'], ['CA', 'CB', 'CG', 'CD1']],
'PRO': [['N', 'CA', 'CB', 'CG'], ['CA', 'CB', 'CG', 'CD']],
'SER': [['N', 'CA', 'CB', 'OG']],
'THR': [['N', 'CA', 'CB', 'OG1']],
'TRP': [['N', 'CA', 'CB', 'CG'], ['CA', 'CB', 'CG', 'CD1']],
'TYR': [['N', 'CA', 'CB', 'CG'], ['CA', 'CB', 'CG', 'CD1']],
'VAL': [['N', 'CA', 'CB', 'CG1']],
}
# If chi angles given in fixed-length array, this matrix determines how to mask
# them for each AA type. The order is as per restype_order (see below).
chi_angles_mask = [
[0.0, 0.0, 0.0, 0.0], # ALA
[1.0, 1.0, 1.0, 1.0], # ARG
[1.0, 1.0, 0.0, 0.0], # ASN
[1.0, 1.0, 0.0, 0.0], # ASP
[1.0, 0.0, 0.0, 0.0], # CYS
[1.0, 1.0, 1.0, 0.0], # GLN
[1.0, 1.0, 1.0, 0.0], # GLU
[0.0, 0.0, 0.0, 0.0], # GLY
[1.0, 1.0, 0.0, 0.0], # HIS
[1.0, 1.0, 0.0, 0.0], # ILE
[1.0, 1.0, 0.0, 0.0], # LEU
[1.0, 1.0, 1.0, 1.0], # LYS
[1.0, 1.0, 1.0, 0.0], # MET
[1.0, 1.0, 0.0, 0.0], # PHE
[1.0, 1.0, 0.0, 0.0], # PRO
[1.0, 0.0, 0.0, 0.0], # SER
[1.0, 0.0, 0.0, 0.0], # THR
[1.0, 1.0, 0.0, 0.0], # TRP
[1.0, 1.0, 0.0, 0.0], # TYR
[1.0, 0.0, 0.0, 0.0], # VAL
]
# The following chi angles are pi periodic: they can be rotated by a multiple
# of pi without affecting the structure.
chi_pi_periodic = [
[0.0, 0.0, 0.0, 0.0], # ALA
[0.0, 0.0, 0.0, 0.0], # ARG
[0.0, 0.0, 0.0, 0.0], # ASN
[0.0, 1.0, 0.0, 0.0], # ASP
[0.0, 0.0, 0.0, 0.0], # CYS
[0.0, 0.0, 0.0, 0.0], # GLN
[0.0, 0.0, 1.0, 0.0], # GLU
[0.0, 0.0, 0.0, 0.0], # GLY
[0.0, 0.0, 0.0, 0.0], # HIS
[0.0, 0.0, 0.0, 0.0], # ILE
[0.0, 0.0, 0.0, 0.0], # LEU
[0.0, 0.0, 0.0, 0.0], # LYS
[0.0, 0.0, 0.0, 0.0], # MET
[0.0, 1.0, 0.0, 0.0], # PHE
[0.0, 0.0, 0.0, 0.0], # PRO
[0.0, 0.0, 0.0, 0.0], # SER
[0.0, 0.0, 0.0, 0.0], # THR
[0.0, 0.0, 0.0, 0.0], # TRP
[0.0, 1.0, 0.0, 0.0], # TYR
[0.0, 0.0, 0.0, 0.0], # VAL
[0.0, 0.0, 0.0, 0.0], # UNK
]
# Atoms positions relative to the 8 rigid groups, defined by the pre-omega, phi,
# psi and chi angles:
# 0: 'backbone group',
# 1: 'pre-omega-group', (empty)
# 2: 'phi-group', (currently empty, because it defines only hydrogens)
# 3: 'psi-group',
# 4,5,6,7: 'chi1,2,3,4-group'
# The atom positions are relative to the axis-end-atom of the corresponding
# rotation axis. The x-axis is in direction of the rotation axis, and the y-axis
# is defined such that the dihedral-angle-defining atom (the last entry in
# chi_angles_atoms above) is in the xy-plane (with a positive y-coordinate).
# format: [atomname, group_idx, rel_position]
rigid_group_atom_positions = {
'ALA': [
['N', 0, (-0.525, 1.363, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.526, -0.000, -0.000)],
['CB', 0, (-0.529, -0.774, -1.205)],
['O', 3, (0.627, 1.062, 0.000)],
],
'ARG': [
['N', 0, (-0.524, 1.362, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.525, -0.000, -0.000)],
['CB', 0, (-0.524, -0.778, -1.209)],
['O', 3, (0.626, 1.062, 0.000)],
['CG', 4, (0.616, 1.390, -0.000)],
['CD', 5, (0.564, 1.414, 0.000)],
['NE', 6, (0.539, 1.357, -0.000)],
['NH1', 7, (0.206, 2.301, 0.000)],
['NH2', 7, (2.078, 0.978, -0.000)],
['CZ', 7, (0.758, 1.093, -0.000)],
],
'ASN': [
['N', 0, (-0.536, 1.357, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.526, -0.000, -0.000)],
['CB', 0, (-0.531, -0.787, -1.200)],
['O', 3, (0.625, 1.062, 0.000)],
['CG', 4, (0.584, 1.399, 0.000)],
['ND2', 5, (0.593, -1.188, 0.001)],
['OD1', 5, (0.633, 1.059, 0.000)],
],
'ASP': [
['N', 0, (-0.525, 1.362, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.527, 0.000, -0.000)],
['CB', 0, (-0.526, -0.778, -1.208)],
['O', 3, (0.626, 1.062, -0.000)],
['CG', 4, (0.593, 1.398, -0.000)],
['OD1', 5, (0.610, 1.091, 0.000)],
['OD2', 5, (0.592, -1.101, -0.003)],
],
'CYS': [
['N', 0, (-0.522, 1.362, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.524, 0.000, 0.000)],
['CB', 0, (-0.519, -0.773, -1.212)],
['O', 3, (0.625, 1.062, -0.000)],
['SG', 4, (0.728, 1.653, 0.000)],
],
'GLN': [
['N', 0, (-0.526, 1.361, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.526, 0.000, 0.000)],
['CB', 0, (-0.525, -0.779, -1.207)],
['O', 3, (0.626, 1.062, -0.000)],
['CG', 4, (0.615, 1.393, 0.000)],
['CD', 5, (0.587, 1.399, -0.000)],
['NE2', 6, (0.593, -1.189, -0.001)],
['OE1', 6, (0.634, 1.060, 0.000)],
],
'GLU': [
['N', 0, (-0.528, 1.361, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.526, -0.000, -0.000)],
['CB', 0, (-0.526, -0.781, -1.207)],
['O', 3, (0.626, 1.062, 0.000)],
['CG', 4, (0.615, 1.392, 0.000)],
['CD', 5, (0.600, 1.397, 0.000)],
['OE1', 6, (0.607, 1.095, -0.000)],
['OE2', 6, (0.589, -1.104, -0.001)],
],
'GLY': [
['N', 0, (-0.572, 1.337, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.517, -0.000, -0.000)],
['O', 3, (0.626, 1.062, -0.000)],
],
'HIS': [
['N', 0, (-0.527, 1.360, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.525, 0.000, 0.000)],
['CB', 0, (-0.525, -0.778, -1.208)],
['O', 3, (0.625, 1.063, 0.000)],
['CG', 4, (0.600, 1.370, -0.000)],
['CD2', 5, (0.889, -1.021, 0.003)],
['ND1', 5, (0.744, 1.160, -0.000)],
['CE1', 5, (2.030, 0.851, 0.002)],
['NE2', 5, (2.145, -0.466, 0.004)],
],
'ILE': [
['N', 0, (-0.493, 1.373, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.527, -0.000, -0.000)],
['CB', 0, (-0.536, -0.793, -1.213)],
['O', 3, (0.627, 1.062, -0.000)],
['CG1', 4, (0.534, 1.437, -0.000)],
['CG2', 4, (0.540, -0.785, -1.199)],
['CD1', 5, (0.619, 1.391, 0.000)],
],
'LEU': [
['N', 0, (-0.520, 1.363, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.525, -0.000, -0.000)],
['CB', 0, (-0.522, -0.773, -1.214)],
['O', 3, (0.625, 1.063, -0.000)],
['CG', 4, (0.678, 1.371, 0.000)],
['CD1', 5, (0.530, 1.430, -0.000)],
['CD2', 5, (0.535, -0.774, 1.200)],
],
'LYS': [
['N', 0, (-0.526, 1.362, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.526, 0.000, 0.000)],
['CB', 0, (-0.524, -0.778, -1.208)],
['O', 3, (0.626, 1.062, -0.000)],
['CG', 4, (0.619, 1.390, 0.000)],
['CD', 5, (0.559, 1.417, 0.000)],
['CE', 6, (0.560, 1.416, 0.000)],
['NZ', 7, (0.554, 1.387, 0.000)],
],
'MET': [
['N', 0, (-0.521, 1.364, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.525, 0.000, 0.000)],
['CB', 0, (-0.523, -0.776, -1.210)],
['O', 3, (0.625, 1.062, -0.000)],
['CG', 4, (0.613, 1.391, -0.000)],
['SD', 5, (0.703, 1.695, 0.000)],
['CE', 6, (0.320, 1.786, -0.000)],
],
'PHE': [
['N', 0, (-0.518, 1.363, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.524, 0.000, -0.000)],
['CB', 0, (-0.525, -0.776, -1.212)],
['O', 3, (0.626, 1.062, -0.000)],
['CG', 4, (0.607, 1.377, 0.000)],
['CD1', 5, (0.709, 1.195, -0.000)],
['CD2', 5, (0.706, -1.196, 0.000)],
['CE1', 5, (2.102, 1.198, -0.000)],
['CE2', 5, (2.098, -1.201, -0.000)],
['CZ', 5, (2.794, -0.003, -0.001)],
],
'PRO': [
['N', 0, (-0.566, 1.351, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.527, -0.000, 0.000)],
['CB', 0, (-0.546, -0.611, -1.293)],
['O', 3, (0.621, 1.066, 0.000)],
['CG', 4, (0.382, 1.445, 0.0)],
# ['CD', 5, (0.427, 1.440, 0.0)],
['CD', 5, (0.477, 1.424, 0.0)], # manually made angle 2 degrees larger
],
'SER': [
['N', 0, (-0.529, 1.360, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.525, -0.000, -0.000)],
['CB', 0, (-0.518, -0.777, -1.211)],
['O', 3, (0.626, 1.062, -0.000)],
['OG', 4, (0.503, 1.325, 0.000)],
],
'THR': [
['N', 0, (-0.517, 1.364, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.526, 0.000, -0.000)],
['CB', 0, (-0.516, -0.793, -1.215)],
['O', 3, (0.626, 1.062, 0.000)],
['CG2', 4, (0.550, -0.718, -1.228)],
['OG1', 4, (0.472, 1.353, 0.000)],
],
'TRP': [
['N', 0, (-0.521, 1.363, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.525, -0.000, 0.000)],
['CB', 0, (-0.523, -0.776, -1.212)],
['O', 3, (0.627, 1.062, 0.000)],
['CG', 4, (0.609, 1.370, -0.000)],
['CD1', 5, (0.824, 1.091, 0.000)],
['CD2', 5, (0.854, -1.148, -0.005)],
['CE2', 5, (2.186, -0.678, -0.007)],
['CE3', 5, (0.622, -2.530, -0.007)],
['NE1', 5, (2.140, 0.690, -0.004)],
['CH2', 5, (3.028, -2.890, -0.013)],
['CZ2', 5, (3.283, -1.543, -0.011)],
['CZ3', 5, (1.715, -3.389, -0.011)],
],
'TYR': [
['N', 0, (-0.522, 1.362, 0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.524, -0.000, -0.000)],
['CB', 0, (-0.522, -0.776, -1.213)],
['O', 3, (0.627, 1.062, -0.000)],
['CG', 4, (0.607, 1.382, -0.000)],
['CD1', 5, (0.716, 1.195, -0.000)],
['CD2', 5, (0.713, -1.194, -0.001)],
['CE1', 5, (2.107, 1.200, -0.002)],
['CE2', 5, (2.104, -1.201, -0.003)],
['OH', 5, (4.168, -0.002, -0.005)],
['CZ', 5, (2.791, -0.001, -0.003)],
],
'VAL': [
['N', 0, (-0.494, 1.373, -0.000)],
['CA', 0, (0.000, 0.000, 0.000)],
['C', 0, (1.527, -0.000, -0.000)],
['CB', 0, (-0.533, -0.795, -1.213)],
['O', 3, (0.627, 1.062, -0.000)],
['CG1', 4, (0.540, 1.429, -0.000)],
['CG2', 4, (0.533, -0.776, 1.203)],
],
}
# A list of atoms (excluding hydrogen) for each AA type. PDB naming convention.
# pylint: disable=line-too-long
# pyformat: disable
residue_atoms = {
'ALA': ['C', 'CA', 'CB', 'N', 'O'],
'ARG': ['C', 'CA', 'CB', 'CG', 'CD', 'CZ', 'N', 'NE', 'O', 'NH1', 'NH2'],
'ASP': ['C', 'CA', 'CB', 'CG', 'N', 'O', 'OD1', 'OD2'],
'ASN': ['C', 'CA', 'CB', 'CG', 'N', 'ND2', 'O', 'OD1'],
'CYS': ['C', 'CA', 'CB', 'N', 'O', 'SG'],
'GLU': ['C', 'CA', 'CB', 'CG', 'CD', 'N', 'O', 'OE1', 'OE2'],
'GLN': ['C', 'CA', 'CB', 'CG', 'CD', 'N', 'NE2', 'O', 'OE1'],
'GLY': ['C', 'CA', 'N', 'O'],
'HIS': ['C', 'CA', 'CB', 'CG', 'CD2', 'CE1', 'N', 'ND1', 'NE2', 'O'],
'ILE': ['C', 'CA', 'CB', 'CG1', 'CG2', 'CD1', 'N', 'O'],
'LEU': ['C', 'CA', 'CB', 'CG', 'CD1', 'CD2', 'N', 'O'],
'LYS': ['C', 'CA', 'CB', 'CG', 'CD', 'CE', 'N', 'NZ', 'O'],
'MET': ['C', 'CA', 'CB', 'CG', 'CE', 'N', 'O', 'SD'],
'PHE': ['C', 'CA', 'CB', 'CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ', 'N', 'O'],
'PRO': ['C', 'CA', 'CB', 'CG', 'CD', 'N', 'O'],
'SER': ['C', 'CA', 'CB', 'N', 'O', 'OG'],
'THR': ['C', 'CA', 'CB', 'CG2', 'N', 'O', 'OG1'],
'TRP': ['C', 'CA', 'CB', 'CG', 'CD1', 'CD2', 'CE2', 'CE3', 'CZ2', 'CZ3', 'CH2', 'N', 'NE1', 'O'],
'TYR': ['C', 'CA', 'CB', 'CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ', 'N', 'O', 'OH'],
'VAL': ['C', 'CA', 'CB', 'CG1', 'CG2', 'N', 'O']
}
# pyformat: enable
# pylint: enable=line-too-long
# Naming swaps for ambiguous atom names.
# Due to symmetries in the amino acids the naming of atoms is ambiguous in
# 4 of the 20 amino acids.
# (The LDDT paper lists 7 amino acids as ambiguous, but the naming ambiguities
# in LEU, VAL and ARG can be resolved by using the 3d constellations of
# the 'ambiguous' atoms and their neighbours)
residue_atom_renaming_swaps = {
'ASP': {'OD1': 'OD2'},
'GLU': {'OE1': 'OE2'},
'PHE': {'CD1': 'CD2', 'CE1': 'CE2'},
'TYR': {'CD1': 'CD2', 'CE1': 'CE2'},
}
# Van der Waals radii [Angstroem] of the atoms (from Wikipedia)
van_der_waals_radius = {
'C': 1.7,
'N': 1.55,
'O': 1.52,
'S': 1.8,
}
Bond = collections.namedtuple(
'Bond', ['atom1_name', 'atom2_name', 'length', 'stddev']
)
BondAngle = collections.namedtuple(
'BondAngle',
['atom1_name', 'atom2_name', 'atom3name', 'angle_rad', 'stddev'],
)
@functools.lru_cache(maxsize=None)
def load_stereo_chemical_props() -> Tuple[
Mapping[str, List[Bond]],
Mapping[str, List[Bond]],
Mapping[str, List[BondAngle]],
]:
"""Load stereo_chemical_props.txt into a nice structure.
Load literature values for bond lengths and bond angles and translate
bond angles into the length of the opposite edge of the triangle
("residue_virtual_bonds").
Returns:
residue_bonds: Dict that maps resname -> list of Bond tuples.
residue_virtual_bonds: Dict that maps resname -> list of Bond tuples.
residue_bond_angles: Dict that maps resname -> list of BondAngle tuples.
"""
stereo_chemical_props_path = os.path.join(
os.path.dirname(os.path.abspath(__file__)), 'stereo_chemical_props.txt'
)
with open(stereo_chemical_props_path, 'rt') as f:
stereo_chemical_props = f.read()
lines_iter = iter(stereo_chemical_props.splitlines())
# Load bond lengths.
residue_bonds = {}
next(lines_iter) # Skip header line.
for line in lines_iter:
if line.strip() == '-':
break
bond, resname, length, stddev = line.split()
atom1, atom2 = bond.split('-')
if resname not in residue_bonds:
residue_bonds[resname] = []
residue_bonds[resname].append(
Bond(atom1, atom2, float(length), float(stddev))
)
residue_bonds['UNK'] = []
# Load bond angles.
residue_bond_angles = {}
next(lines_iter) # Skip empty line.
next(lines_iter) # Skip header line.
for line in lines_iter:
if line.strip() == '-':
break
bond, resname, angle_degree, stddev_degree = line.split()
atom1, atom2, atom3 = bond.split('-')
if resname not in residue_bond_angles:
residue_bond_angles[resname] = []
residue_bond_angles[resname].append(
BondAngle(
atom1,
atom2,
atom3,
float(angle_degree) / 180.0 * np.pi,
float(stddev_degree) / 180.0 * np.pi,
)
)
residue_bond_angles['UNK'] = []
def make_bond_key(atom1_name, atom2_name):
"""Unique key to lookup bonds."""
return '-'.join(sorted([atom1_name, atom2_name]))
# Translate bond angles into distances ("virtual bonds").
residue_virtual_bonds = {}
for resname, bond_angles in residue_bond_angles.items():
# Create a fast lookup dict for bond lengths.
bond_cache = {}
for b in residue_bonds[resname]:
bond_cache[make_bond_key(b.atom1_name, b.atom2_name)] = b
residue_virtual_bonds[resname] = []
for ba in bond_angles:
bond1 = bond_cache[make_bond_key(ba.atom1_name, ba.atom2_name)]
bond2 = bond_cache[make_bond_key(ba.atom2_name, ba.atom3name)]
# Compute distance between atom1 and atom3 using the law of cosines
# c^2 = a^2 + b^2 - 2ab*cos(gamma).
gamma = ba.angle_rad
length = np.sqrt(
bond1.length**2
+ bond2.length**2
- 2 * bond1.length * bond2.length * np.cos(gamma)
)
# Propagation of uncertainty assuming uncorrelated errors.
dl_outer = 0.5 / length
dl_dgamma = (2 * bond1.length * bond2.length * np.sin(gamma)) * dl_outer
dl_db1 = (2 * bond1.length - 2 * bond2.length * np.cos(gamma)) * dl_outer
dl_db2 = (2 * bond2.length - 2 * bond1.length * np.cos(gamma)) * dl_outer
stddev = np.sqrt(
(dl_dgamma * ba.stddev) ** 2
+ (dl_db1 * bond1.stddev) ** 2
+ (dl_db2 * bond2.stddev) ** 2
)
residue_virtual_bonds[resname].append(
Bond(ba.atom1_name, ba.atom3name, length, stddev)
)
return (residue_bonds, residue_virtual_bonds, residue_bond_angles)
# Between-residue bond lengths for general bonds (first element) and for Proline
# (second element).
between_res_bond_length_c_n = [1.329, 1.341]
between_res_bond_length_stddev_c_n = [0.014, 0.016]
# Between-residue cos_angles.
between_res_cos_angles_c_n_ca = [-0.5203, 0.0353] # degrees: 121.352 +- 2.315
between_res_cos_angles_ca_c_n = [-0.4473, 0.0311] # degrees: 116.568 +- 1.995
# This mapping is used when we need to store atom data in a format that requires
# fixed atom data size for every residue (e.g. a numpy array).
# pyformat: disable
atom_types = [
'N', 'CA', 'C', 'CB', 'O', 'CG', 'CG1', 'CG2', 'OG', 'OG1', 'SG', 'CD',
'CD1', 'CD2', 'ND1', 'ND2', 'OD1', 'OD2', 'SD', 'CE', 'CE1', 'CE2', 'CE3',
'NE', 'NE1', 'NE2', 'OE1', 'OE2', 'CH2', 'NH1', 'NH2', 'OH', 'CZ', 'CZ2',
'CZ3', 'NZ', 'OXT'
]
# pyformat: enable
atom_order = {atom_type: i for i, atom_type in enumerate(atom_types)}
atom_type_num = len(atom_types) # := 37.
# A compact atom encoding with 14 columns
# pylint: disable=line-too-long
# pylint: disable=bad-whitespace
# pyformat: disable
restype_name_to_atom14_names = {
'ALA': ['N', 'CA', 'C', 'O', 'CB', '', '', '', '', '', '', '', '', ''],
'ARG': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD', 'NE', 'CZ', 'NH1', 'NH2', '', '', ''],
'ASN': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'OD1', 'ND2', '', '', '', '', '', ''],
'ASP': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'OD1', 'OD2', '', '', '', '', '', ''],
'CYS': ['N', 'CA', 'C', 'O', 'CB', 'SG', '', '', '', '', '', '', '', ''],
'GLN': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD', 'OE1', 'NE2', '', '', '', '', ''],
'GLU': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD', 'OE1', 'OE2', '', '', '', '', ''],
'GLY': ['N', 'CA', 'C', 'O', '', '', '', '', '', '', '', '', '', ''],
'HIS': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'ND1', 'CD2', 'CE1', 'NE2', '', '', '', ''],
'ILE': ['N', 'CA', 'C', 'O', 'CB', 'CG1', 'CG2', 'CD1', '', '', '', '', '', ''],
'LEU': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD1', 'CD2', '', '', '', '', '', ''],
'LYS': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD', 'CE', 'NZ', '', '', '', '', ''],
'MET': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'SD', 'CE', '', '', '', '', '', ''],
'PHE': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ', '', '', ''],
'PRO': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD', '', '', '', '', '', '', ''],
'SER': ['N', 'CA', 'C', 'O', 'CB', 'OG', '', '', '', '', '', '', '', ''],
'THR': ['N', 'CA', 'C', 'O', 'CB', 'OG1', 'CG2', '', '', '', '', '', '', ''],
'TRP': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD1', 'CD2', 'NE1', 'CE2', 'CE3', 'CZ2', 'CZ3', 'CH2'],
'TYR': ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ', 'OH', '', ''],
'VAL': ['N', 'CA', 'C', 'O', 'CB', 'CG1', 'CG2', '', '', '', '', '', '', ''],
'UNK': ['', '', '', '', '', '', '', '', '', '', '', '', '', ''],
}
# pyformat: enable
# pylint: enable=line-too-long
# pylint: enable=bad-whitespace
# This is the standard residue order when coding AA type as a number.
# Reproduce it by taking 3-letter AA codes and sorting them alphabetically.
# pyformat: disable
restypes = [
'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P',
'S', 'T', 'W', 'Y', 'V'
]
# pyformat: enable
restype_order = {restype: i for i, restype in enumerate(restypes)}
restype_num = len(restypes) # := 20.
unk_restype_index = restype_num # Catch-all index for unknown restypes.
restypes_with_x = restypes + ['X']
restype_order_with_x = {restype: i for i, restype in enumerate(restypes_with_x)}
def sequence_to_onehot(
sequence: str, mapping: Mapping[str, int], map_unknown_to_x: bool = False
) -> np.ndarray:
"""Maps the given sequence into a one-hot encoded matrix.
Args:
sequence: An amino acid sequence.
mapping: A dictionary mapping amino acids to integers.
map_unknown_to_x: If True, any amino acid that is not in the mapping will be
mapped to the unknown amino acid 'X'. If the mapping doesn't contain amino
acid 'X', an error will be thrown. If False, any amino acid not in the
mapping will throw an error.
Returns:
A numpy array of shape (seq_len, num_unique_aas) with one-hot encoding of
the sequence.
Raises:
ValueError: If the mapping doesn't contain values from 0 to
num_unique_aas - 1 without any gaps.
"""
num_entries = max(mapping.values()) + 1
if sorted(set(mapping.values())) != list(range(num_entries)):
raise ValueError(
'The mapping must have values from 0 to num_unique_aas-1 '
'without any gaps. Got: %s'
% sorted(mapping.values())
)
one_hot_arr = np.zeros((len(sequence), num_entries), dtype=np.int32)
for aa_index, aa_type in enumerate(sequence):
if map_unknown_to_x:
if aa_type.isalpha() and aa_type.isupper():
aa_id = mapping.get(aa_type, mapping['X'])
else:
raise ValueError(f'Invalid character in the sequence: {aa_type}')
else:
aa_id = mapping[aa_type]
one_hot_arr[aa_index, aa_id] = 1
return one_hot_arr
restype_1to3 = {
'A': 'ALA',
'R': 'ARG',
'N': 'ASN',
'D': 'ASP',
'C': 'CYS',
'Q': 'GLN',
'E': 'GLU',
'G': 'GLY',
'H': 'HIS',
'I': 'ILE',
'L': 'LEU',
'K': 'LYS',
'M': 'MET',
'F': 'PHE',
'P': 'PRO',
'S': 'SER',
'T': 'THR',
'W': 'TRP',
'Y': 'TYR',
'V': 'VAL',
}
PROTEIN_CHAIN: Final[str] = 'polypeptide(L)'
POLYMER_CHAIN: Final[str] = 'polymer'
def atom_id_to_type(atom_id: str) -> str:
"""Convert atom ID to atom type, works only for standard protein residues.
Args:
atom_id: Atom ID to be converted.
Returns:
String corresponding to atom type.
Raises:
ValueError: If atom ID not recognized.
"""
if atom_id.startswith('C'):
return 'C'
elif atom_id.startswith('N'):
return 'N'
elif atom_id.startswith('O'):
return 'O'
elif atom_id.startswith('H'):
return 'H'
elif atom_id.startswith('S'):
return 'S'
raise ValueError('Atom ID not recognized.')
# NB: restype_3to1 differs from Bio.PDB.protein_letters_3to1 by being a simple
# 1-to-1 mapping of 3 letter names to one letter names. The latter contains
# many more, and less common, three letter names as keys and maps many of these
# to the same one letter name (including 'X' and 'U' which we don't use here).
restype_3to1 = {v: k for k, v in restype_1to3.items()}
# Define a restype name for all unknown residues.
unk_restype = 'UNK'
resnames = [restype_1to3[r] for r in restypes] + [unk_restype]
resname_to_idx = {resname: i for i, resname in enumerate(resnames)}
# The mapping here uses hhblits convention, so that B is mapped to D, J and O
# are mapped to X, U is mapped to C, and Z is mapped to E. Other than that the
# remaining 20 amino acids are kept in alphabetical order.
# There are 2 non-amino acid codes, X (representing any amino acid) and
# "-" representing a missing amino acid in an alignment. The id for these
# codes is put at the end (20 and 21) so that they can easily be ignored if
# desired.
HHBLITS_AA_TO_ID = {
'A': 0,
'B': 2,
'C': 1,
'D': 2,
'E': 3,
'F': 4,
'G': 5,
'H': 6,
'I': 7,
'J': 20,
'K': 8,
'L': 9,
'M': 10,
'N': 11,
'O': 20,
'P': 12,
'Q': 13,
'R': 14,
'S': 15,
'T': 16,
'U': 1,
'V': 17,
'W': 18,
'X': 20,
'Y': 19,
'Z': 3,
'-': 21,
}
# Partial inversion of HHBLITS_AA_TO_ID.
ID_TO_HHBLITS_AA = {
0: 'A',
1: 'C', # Also U.
2: 'D', # Also B.
3: 'E', # Also Z.
4: 'F',
5: 'G',
6: 'H',
7: 'I',
8: 'K',
9: 'L',
10: 'M',
11: 'N',
12: 'P',
13: 'Q',
14: 'R',
15: 'S',
16: 'T',
17: 'V',
18: 'W',
19: 'Y',
20: 'X', # Includes J and O.
21: '-',
}
restypes_with_x_and_gap = restypes + ['X', '-']
MAP_HHBLITS_AATYPE_TO_OUR_AATYPE = tuple(
restypes_with_x_and_gap.index(ID_TO_HHBLITS_AA[i])
for i in range(len(restypes_with_x_and_gap))
)
def _make_standard_atom_mask() -> np.ndarray:
"""Returns [num_res_types, num_atom_types] mask array."""
# +1 to account for unknown (all 0s).
mask = np.zeros([restype_num + 1, atom_type_num], dtype=np.int32)
for restype, restype_letter in enumerate(restypes):
restype_name = restype_1to3[restype_letter]
atom_names = residue_atoms[restype_name]
for atom_name in atom_names:
atom_type = atom_order[atom_name]
mask[restype, atom_type] = 1
return mask
STANDARD_ATOM_MASK = _make_standard_atom_mask()
# A one hot representation for the first and second atoms defining the axis
# of rotation for each chi-angle in each residue.
def chi_angle_atom(atom_index: int) -> np.ndarray:
"""Define chi-angle rigid groups via one-hot representations."""
chi_angles_index = {}
one_hots = []
for k, v in chi_angles_atoms.items():
indices = [atom_types.index(s[atom_index]) for s in v]
indices.extend([-1] * (4 - len(indices)))
chi_angles_index[k] = indices
for r in restypes:
res3 = restype_1to3[r]
one_hot = np.eye(atom_type_num)[chi_angles_index[res3]]
one_hots.append(one_hot)
one_hots.append(np.zeros([4, atom_type_num])) # Add zeros for residue `X`.
one_hot = np.stack(one_hots, axis=0)
one_hot = np.transpose(one_hot, [0, 2, 1])
return one_hot
chi_atom_1_one_hot = chi_angle_atom(1)
chi_atom_2_one_hot = chi_angle_atom(2)
# An array like chi_angles_atoms but using indices rather than names.
chi_angles_atom_indices = [chi_angles_atoms[restype_1to3[r]] for r in restypes]
chi_angles_atom_indices = tree.map(
lambda atom_name: atom_order[atom_name], chi_angles_atom_indices
)
chi_angles_atom_indices = np.array([
chi_atoms + ([[0, 0, 0, 0]] * (4 - len(chi_atoms)))
for chi_atoms in chi_angles_atom_indices
])
# Mapping from (res_name, atom_name) pairs to the atom's chi group index
# and atom index within that group.
chi_groups_for_atom = collections.defaultdict(list)
for res_name, chi_angle_atoms_for_res in chi_angles_atoms.items():
for chi_group_i, chi_group in enumerate(chi_angle_atoms_for_res):
for atom_i, atom in enumerate(chi_group):
chi_groups_for_atom[(res_name, atom)].append((chi_group_i, atom_i))
chi_groups_for_atom = dict(chi_groups_for_atom)
def _make_rigid_transformation_4x4(ex, ey, translation):
"""Create a rigid 4x4 transformation matrix from two axes and transl."""
# Normalize ex.
ex_normalized = ex / np.linalg.norm(ex)
# make ey perpendicular to ex
ey_normalized = ey - np.dot(ey, ex_normalized) * ex_normalized
ey_normalized /= np.linalg.norm(ey_normalized)
# compute ez as cross product
eznorm = np.cross(ex_normalized, ey_normalized)
m = np.stack([ex_normalized, ey_normalized, eznorm, translation]).transpose()
m = np.concatenate([m, [[0.0, 0.0, 0.0, 1.0]]], axis=0)
return m
# create an array with (restype, atomtype) --> rigid_group_idx
# and an array with (restype, atomtype, coord) for the atom positions
# and compute affine transformation matrices (4,4) from one rigid group to the
# previous group
restype_atom37_to_rigid_group = np.zeros([21, 37], dtype=int)
restype_atom37_mask = np.zeros([21, 37], dtype=np.float32)
restype_atom37_rigid_group_positions = np.zeros([21, 37, 3], dtype=np.float32)
restype_atom14_to_rigid_group = np.zeros([21, 14], dtype=int)
restype_atom14_mask = np.zeros([21, 14], dtype=np.float32)
restype_atom14_rigid_group_positions = np.zeros([21, 14, 3], dtype=np.float32)
restype_rigid_group_default_frame = np.zeros([21, 8, 4, 4], dtype=np.float32)
def _make_rigid_group_constants():
"""Fill the arrays above."""
for restype, restype_letter in enumerate(restypes):
resname = restype_1to3[restype_letter]
for atomname, group_idx, atom_position in rigid_group_atom_positions[
resname
]:
atomtype = atom_order[atomname]
restype_atom37_to_rigid_group[restype, atomtype] = group_idx
restype_atom37_mask[restype, atomtype] = 1
restype_atom37_rigid_group_positions[restype, atomtype, :] = atom_position
atom14idx = restype_name_to_atom14_names[resname].index(atomname)
restype_atom14_to_rigid_group[restype, atom14idx] = group_idx
restype_atom14_mask[restype, atom14idx] = 1
restype_atom14_rigid_group_positions[restype, atom14idx, :] = (
atom_position
)
for restype, restype_letter in enumerate(restypes):
resname = restype_1to3[restype_letter]
atom_positions = {
name: np.array(pos)
for name, _, pos in rigid_group_atom_positions[resname]
}
# backbone to backbone is the identity transform
restype_rigid_group_default_frame[restype, 0, :, :] = np.eye(4)
# pre-omega-frame to backbone (currently dummy identity matrix)
restype_rigid_group_default_frame[restype, 1, :, :] = np.eye(4)
# phi-frame to backbone
mat = _make_rigid_transformation_4x4(
ex=atom_positions['N'] - atom_positions['CA'],
ey=np.array([1.0, 0.0, 0.0]),
translation=atom_positions['N'],
)
restype_rigid_group_default_frame[restype, 2, :, :] = mat
# psi-frame to backbone
mat = _make_rigid_transformation_4x4(
ex=atom_positions['C'] - atom_positions['CA'],
ey=atom_positions['CA'] - atom_positions['N'],
translation=atom_positions['C'],
)
restype_rigid_group_default_frame[restype, 3, :, :] = mat
# chi1-frame to backbone
if chi_angles_mask[restype][0]:
base_atom_names = chi_angles_atoms[resname][0]
base_atom_positions = [atom_positions[name] for name in base_atom_names]
mat = _make_rigid_transformation_4x4(
ex=base_atom_positions[2] - base_atom_positions[1],
ey=base_atom_positions[0] - base_atom_positions[1],
translation=base_atom_positions[2],
)
restype_rigid_group_default_frame[restype, 4, :, :] = mat
# chi2-frame to chi1-frame
# chi3-frame to chi2-frame
# chi4-frame to chi3-frame
# luckily all rotation axes for the next frame start at (0,0,0) of the
# previous frame
for chi_idx in range(1, 4):
if chi_angles_mask[restype][chi_idx]:
axis_end_atom_name = chi_angles_atoms[resname][chi_idx][2]
axis_end_atom_position = atom_positions[axis_end_atom_name]
mat = _make_rigid_transformation_4x4(
ex=axis_end_atom_position,
ey=np.array([-1.0, 0.0, 0.0]),
translation=axis_end_atom_position,
)
restype_rigid_group_default_frame[restype, 4 + chi_idx, :, :] = mat
_make_rigid_group_constants()
def make_atom14_dists_bounds(
overlap_tolerance=1.5, bond_length_tolerance_factor=15
):
"""compute upper and lower bounds for bonds to assess violations."""
restype_atom14_bond_lower_bound = np.zeros([21, 14, 14], np.float32)
restype_atom14_bond_upper_bound = np.zeros([21, 14, 14], np.float32)
restype_atom14_bond_stddev = np.zeros([21, 14, 14], np.float32)
residue_bonds, residue_virtual_bonds, _ = load_stereo_chemical_props()
for restype, restype_letter in enumerate(restypes):
resname = restype_1to3[restype_letter]
atom_list = restype_name_to_atom14_names[resname]
# create lower and upper bounds for clashes
for atom1_idx, atom1_name in enumerate(atom_list):
if not atom1_name:
continue
atom1_radius = van_der_waals_radius[atom1_name[0]]
for atom2_idx, atom2_name in enumerate(atom_list):
if (not atom2_name) or atom1_idx == atom2_idx:
continue
atom2_radius = van_der_waals_radius[atom2_name[0]]
lower = atom1_radius + atom2_radius - overlap_tolerance
upper = 1e10
restype_atom14_bond_lower_bound[restype, atom1_idx, atom2_idx] = lower
restype_atom14_bond_lower_bound[restype, atom2_idx, atom1_idx] = lower
restype_atom14_bond_upper_bound[restype, atom1_idx, atom2_idx] = upper
restype_atom14_bond_upper_bound[restype, atom2_idx, atom1_idx] = upper
# overwrite lower and upper bounds for bonds and angles
for b in residue_bonds[resname] + residue_virtual_bonds[resname]:
atom1_idx = atom_list.index(b.atom1_name)
atom2_idx = atom_list.index(b.atom2_name)
lower = b.length - bond_length_tolerance_factor * b.stddev
upper = b.length + bond_length_tolerance_factor * b.stddev
restype_atom14_bond_lower_bound[restype, atom1_idx, atom2_idx] = lower
restype_atom14_bond_lower_bound[restype, atom2_idx, atom1_idx] = lower
restype_atom14_bond_upper_bound[restype, atom1_idx, atom2_idx] = upper
restype_atom14_bond_upper_bound[restype, atom2_idx, atom1_idx] = upper
restype_atom14_bond_stddev[restype, atom1_idx, atom2_idx] = b.stddev
restype_atom14_bond_stddev[restype, atom2_idx, atom1_idx] = b.stddev
return {
'lower_bound': restype_atom14_bond_lower_bound, # shape (21,14,14)
'upper_bound': restype_atom14_bond_upper_bound, # shape (21,14,14)
'stddev': restype_atom14_bond_stddev, # shape (21,14,14)
}
# pyformat: disable
CCD_NAME_TO_ONE_LETTER: Mapping[str, str] = {
'00C': 'C', '01W': 'X', '02K': 'A', '03Y': 'C', '07O': 'C', '08P': 'C',
'0A0': 'D', '0A1': 'Y', '0A2': 'K', '0A8': 'C', '0AA': 'V', '0AB': 'V',
'0AC': 'G', '0AD': 'G', '0AF': 'W', '0AG': 'L', '0AH': 'S', '0AK': 'D',
'0AM': 'A', '0AP': 'C', '0AU': 'U', '0AV': 'A', '0AZ': 'P', '0BN': 'F',
'0C': 'C', '0CS': 'A', '0DC': 'C', '0DG': 'G', '0DT': 'T', '0FL': 'A',
'0G': 'G', '0NC': 'A', '0SP': 'A', '0U': 'U', '10C': 'C', '125': 'U',
'126': 'U', '127': 'U', '128': 'N', '12A': 'A', '143': 'C', '193': 'X',
'1AP': 'A', '1MA': 'A', '1MG': 'G', '1PA': 'F', '1PI': 'A', '1PR': 'N',
'1SC': 'C', '1TQ': 'W', '1TY': 'Y', '1X6': 'S', '200': 'F', '23F': 'F',
'23S': 'X', '26B': 'T', '2AD': 'X', '2AG': 'A', '2AO': 'X', '2AR': 'A',
'2AS': 'X', '2AT': 'T', '2AU': 'U', '2BD': 'I', '2BT': 'T', '2BU': 'A',
'2CO': 'C', '2DA': 'A', '2DF': 'N', '2DM': 'N', '2DO': 'X', '2DT': 'T',
'2EG': 'G', '2FE': 'N', '2FI': 'N', '2FM': 'M', '2GT': 'T', '2HF': 'H',
'2LU': 'L', '2MA': 'A', '2MG': 'G', '2ML': 'L', '2MR': 'R', '2MT': 'P',
'2MU': 'U', '2NT': 'T', '2OM': 'U', '2OT': 'T', '2PI': 'X', '2PR': 'G',
'2SA': 'N', '2SI': 'X', '2ST': 'T', '2TL': 'T', '2TY': 'Y', '2VA': 'V',
'2XA': 'C', '32S': 'X', '32T': 'X', '3AH': 'H', '3AR': 'X', '3CF': 'F',
'3DA': 'A', '3DR': 'N', '3GA': 'A', '3MD': 'D', '3ME': 'U', '3NF': 'Y',
'3QN': 'K', '3TY': 'X', '3XH': 'G', '4AC': 'N', '4BF': 'Y', '4CF': 'F',
'4CY': 'M', '4DP': 'W', '4FB': 'P', '4FW': 'W', '4HT': 'W', '4IN': 'W',
'4MF': 'N', '4MM': 'X', '4OC': 'C', '4PC': 'C', '4PD': 'C', '4PE': 'C',
'4PH': 'F', '4SC': 'C', '4SU': 'U', '4TA': 'N', '4U7': 'A', '56A': 'H',
'5AA': 'A', '5AB': 'A', '5AT': 'T', '5BU': 'U', '5CG': 'G', '5CM': 'C',
'5CS': 'C', '5FA': 'A', '5FC': 'C', '5FU': 'U', '5HP': 'E', '5HT': 'T',
'5HU': 'U', '5IC': 'C', '5IT': 'T', '5IU': 'U', '5MC': 'C', '5MD': 'N',
'5MU': 'U', '5NC': 'C', '5PC': 'C', '5PY': 'T', '5SE': 'U', '64T': 'T',
'6CL': 'K', '6CT': 'T', '6CW': 'W', '6HA': 'A', '6HC': 'C', '6HG': 'G',
'6HN': 'K', '6HT': 'T', '6IA': 'A', '6MA': 'A', '6MC': 'A', '6MI': 'N',
'6MT': 'A', '6MZ': 'N', '6OG': 'G', '70U': 'U', '7DA': 'A', '7GU': 'G',
'7JA': 'I', '7MG': 'G', '8AN': 'A', '8FG': 'G', '8MG': 'G', '8OG': 'G',
'9NE': 'E', '9NF': 'F', '9NR': 'R', '9NV': 'V', 'A': 'A', 'A1P': 'N',
'A23': 'A', 'A2L': 'A', 'A2M': 'A', 'A34': 'A', 'A35': 'A', 'A38': 'A',
'A39': 'A', 'A3A': 'A', 'A3P': 'A', 'A40': 'A', 'A43': 'A', 'A44': 'A',
'A47': 'A', 'A5L': 'A', 'A5M': 'C', 'A5N': 'N', 'A5O': 'A', 'A66': 'X',
'AA3': 'A', 'AA4': 'A', 'AAR': 'R', 'AB7': 'X', 'ABA': 'A', 'ABR': 'A',
'ABS': 'A', 'ABT': 'N', 'ACB': 'D', 'ACL': 'R', 'AD2': 'A', 'ADD': 'X',
'ADX': 'N', 'AEA': 'X', 'AEI': 'D', 'AET': 'A', 'AFA': 'N', 'AFF': 'N',
'AFG': 'G', 'AGM': 'R', 'AGT': 'C', 'AHB': 'N', 'AHH': 'X', 'AHO': 'A',
'AHP': 'A', 'AHS': 'X', 'AHT': 'X', 'AIB': 'A', 'AKL': 'D', 'AKZ': 'D',
'ALA': 'A', 'ALC': 'A', 'ALM': 'A', 'ALN': 'A', 'ALO': 'T', 'ALQ': 'X',
'ALS': 'A', 'ALT': 'A', 'ALV': 'A', 'ALY': 'K', 'AN8': 'A', 'AP7': 'A',
'APE': 'X', 'APH': 'A', 'API': 'K', 'APK': 'K', 'APM': 'X', 'APP': 'X',
'AR2': 'R', 'AR4': 'E', 'AR7': 'R', 'ARG': 'R', 'ARM': 'R', 'ARO': 'R',
'ARV': 'X', 'AS': 'A', 'AS2': 'D', 'AS9': 'X', 'ASA': 'D', 'ASB': 'D',
'ASI': 'D', 'ASK': 'D', 'ASL': 'D', 'ASM': 'X', 'ASN': 'N', 'ASP': 'D',
'ASQ': 'D', 'ASU': 'N', 'ASX': 'B', 'ATD': 'T', 'ATL': 'T', 'ATM': 'T',
'AVC': 'A', 'AVN': 'X', 'AYA': 'A', 'AZK': 'K', 'AZS': 'S', 'AZY': 'Y',
'B1F': 'F', 'B1P': 'N', 'B2A': 'A', 'B2F': 'F', 'B2I': 'I', 'B2V': 'V',
'B3A': 'A', 'B3D': 'D', 'B3E': 'E', 'B3K': 'K', 'B3L': 'X', 'B3M': 'X',
'B3Q': 'X', 'B3S': 'S', 'B3T': 'X', 'B3U': 'H', 'B3X': 'N', 'B3Y': 'Y',
'BB6': 'C', 'BB7': 'C', 'BB8': 'F', 'BB9': 'C', 'BBC': 'C', 'BCS': 'C',
'BE2': 'X', 'BFD': 'D', 'BG1': 'S', 'BGM': 'G', 'BH2': 'D', 'BHD': 'D',
'BIF': 'F', 'BIL': 'X', 'BIU': 'I', 'BJH': 'X', 'BLE': 'L', 'BLY': 'K',
'BMP': 'N', 'BMT': 'T', 'BNN': 'F', 'BNO': 'X', 'BOE': 'T', 'BOR': 'R',
'BPE': 'C', 'BRU': 'U', 'BSE': 'S', 'BT5': 'N', 'BTA': 'L', 'BTC': 'C',
'BTR': 'W', 'BUC': 'C', 'BUG': 'V', 'BVP': 'U', 'BZG': 'N', 'C': 'C',
'C1X': 'K', 'C25': 'C', 'C2L': 'C', 'C2S': 'C', 'C31': 'C', 'C32': 'C',
'C34': 'C', 'C36': 'C', 'C37': 'C', 'C38': 'C', 'C3Y': 'C', 'C42': 'C',
'C43': 'C', 'C45': 'C', 'C46': 'C', 'C49': 'C', 'C4R': 'C', 'C4S': 'C',
'C5C': 'C', 'C66': 'X', 'C6C': 'C', 'CAF': 'C', 'CAL': 'X', 'CAR': 'C',
'CAS': 'C', 'CAV': 'X', 'CAY': 'C', 'CB2': 'C', 'CBR': 'C', 'CBV': 'C',
'CCC': 'C', 'CCL': 'K', 'CCS': 'C', 'CDE': 'X', 'CDV': 'X', 'CDW': 'C',
'CEA': 'C', 'CFL': 'C', 'CG1': 'G', 'CGA': 'E', 'CGU': 'E', 'CH': 'C',
'CHF': 'X', 'CHG': 'X', 'CHP': 'G', 'CHS': 'X', 'CIR': 'R', 'CLE': 'L',
'CLG': 'K', 'CLH': 'K', 'CM0': 'N', 'CME': 'C', 'CMH': 'C', 'CML': 'C',
'CMR': 'C', 'CMT': 'C', 'CNU': 'U', 'CP1': 'C', 'CPC': 'X', 'CPI': 'X',
'CR5': 'G', 'CS0': 'C', 'CS1': 'C', 'CS3': 'C', 'CS4': 'C', 'CS8': 'N',
'CSA': 'C', 'CSB': 'C', 'CSD': 'C', 'CSE': 'C', 'CSF': 'C', 'CSI': 'G',
'CSJ': 'C', 'CSL': 'C', 'CSO': 'C', 'CSP': 'C', 'CSR': 'C', 'CSS': 'C',
'CSU': 'C', 'CSW': 'C', 'CSX': 'C', 'CSZ': 'C', 'CTE': 'W', 'CTG': 'T',
'CTH': 'T', 'CUC': 'X', 'CWR': 'S', 'CXM': 'M', 'CY0': 'C', 'CY1': 'C',
'CY3': 'C', 'CY4': 'C', 'CYA': 'C', 'CYD': 'C', 'CYF': 'C', 'CYG': 'C',
'CYJ': 'X', 'CYM': 'C', 'CYQ': 'C', 'CYR': 'C', 'CYS': 'C', 'CZ2': 'C',
'CZZ': 'C', 'D11': 'T', 'D1P': 'N', 'D3': 'N', 'D33': 'N', 'D3P': 'G',
'D3T': 'T', 'D4M': 'T', 'D4P': 'X', 'DA': 'A', 'DA2': 'X', 'DAB': 'A',
'DAH': 'F', 'DAL': 'A', 'DAR': 'R', 'DAS': 'D', 'DBB': 'T', 'DBM': 'N',
'DBS': 'S', 'DBU': 'T', 'DBY': 'Y', 'DBZ': 'A', 'DC': 'C', 'DC2': 'C',
'DCG': 'G', 'DCI': 'X', 'DCL': 'X', 'DCT': 'C', 'DCY': 'C', 'DDE': 'H',
'DDG': 'G', 'DDN': 'U', 'DDX': 'N', 'DFC': 'C', 'DFG': 'G', 'DFI': 'X',
'DFO': 'X', 'DFT': 'N', 'DG': 'G', 'DGH': 'G', 'DGI': 'G', 'DGL': 'E',
'DGN': 'Q', 'DHA': 'S', 'DHI': 'H', 'DHL': 'X', 'DHN': 'V', 'DHP': 'X',
'DHU': 'U', 'DHV': 'V', 'DI': 'I', 'DIL': 'I', 'DIR': 'R', 'DIV': 'V',
'DLE': 'L', 'DLS': 'K', 'DLY': 'K', 'DM0': 'K', 'DMH': 'N', 'DMK': 'D',
'DMT': 'X', 'DN': 'N', 'DNE': 'L', 'DNG': 'L', 'DNL': 'K', 'DNM': 'L',
'DNP': 'A', 'DNR': 'C', 'DNS': 'K', 'DOA': 'X', 'DOC': 'C', 'DOH': 'D',
'DON': 'L', 'DPB': 'T', 'DPH': 'F', 'DPL': 'P', 'DPP': 'A', 'DPQ': 'Y',
'DPR': 'P', 'DPY': 'N', 'DRM': 'U', 'DRP': 'N', 'DRT': 'T', 'DRZ': 'N',
'DSE': 'S', 'DSG': 'N', 'DSN': 'S', 'DSP': 'D', 'DT': 'T', 'DTH': 'T',
'DTR': 'W', 'DTY': 'Y', 'DU': 'U', 'DVA': 'V', 'DXD': 'N', 'DXN': 'N',
'DYS': 'C', 'DZM': 'A', 'E': 'A', 'E1X': 'A', 'ECC': 'Q', 'EDA': 'A',
'EFC': 'C', 'EHP': 'F', 'EIT': 'T', 'ENP': 'N', 'ESB': 'Y', 'ESC': 'M',
'EXB': 'X', 'EXY': 'L', 'EY5': 'N', 'EYS': 'X', 'F2F': 'F', 'FA2': 'A',
'FA5': 'N', 'FAG': 'N', 'FAI': 'N', 'FB5': 'A', 'FB6': 'A', 'FCL': 'F',
'FFD': 'N', 'FGA': 'E', 'FGL': 'G', 'FGP': 'S', 'FHL': 'X', 'FHO': 'K',
'FHU': 'U', 'FLA': 'A', 'FLE': 'L', 'FLT': 'Y', 'FME': 'M', 'FMG': 'G',
'FMU': 'N', 'FOE': 'C', 'FOX': 'G', 'FP9': 'P', 'FPA': 'F', 'FRD': 'X',
'FT6': 'W', 'FTR': 'W', 'FTY': 'Y', 'FVA': 'V', 'FZN': 'K', 'G': 'G',
'G25': 'G', 'G2L': 'G', 'G2S': 'G', 'G31': 'G', 'G32': 'G', 'G33': 'G',
'G36': 'G', 'G38': 'G', 'G42': 'G', 'G46': 'G', 'G47': 'G', 'G48': 'G',
'G49': 'G', 'G4P': 'N', 'G7M': 'G', 'GAO': 'G', 'GAU': 'E', 'GCK': 'C',
'GCM': 'X', 'GDP': 'G', 'GDR': 'G', 'GFL': 'G', 'GGL': 'E', 'GH3': 'G',
'GHG': 'Q', 'GHP': 'G', 'GL3': 'G', 'GLH': 'Q', 'GLJ': 'E', 'GLK': 'E',
'GLM': 'X', 'GLN': 'Q', 'GLQ': 'E', 'GLU': 'E', 'GLX': 'Z', 'GLY': 'G',
'GLZ': 'G', 'GMA': 'E', 'GMS': 'G', 'GMU': 'U', 'GN7': 'G', 'GND': 'X',
'GNE': 'N', 'GOM': 'G', 'GPL': 'K', 'GS': 'G', 'GSC': 'G', 'GSR': 'G',
'GSS': 'G', 'GSU': 'E', 'GT9': 'C', 'GTP': 'G', 'GVL': 'X', 'H2U': 'U',
'H5M': 'P', 'HAC': 'A', 'HAR': 'R', 'HBN': 'H', 'HCS': 'X', 'HDP': 'U',
'HEU': 'U', 'HFA': 'X', 'HGL': 'X', 'HHI': 'H', 'HIA': 'H', 'HIC': 'H',
'HIP': 'H', 'HIQ': 'H', 'HIS': 'H', 'HL2': 'L', 'HLU': 'L', 'HMR': 'R',
'HOL': 'N', 'HPC': 'F', 'HPE': 'F', 'HPH': 'F', 'HPQ': 'F', 'HQA': 'A',
'HRG': 'R', 'HRP': 'W', 'HS8': 'H', 'HS9': 'H', 'HSE': 'S', 'HSL': 'S',
'HSO': 'H', 'HTI': 'C', 'HTN': 'N', 'HTR': 'W', 'HV5': 'A', 'HVA': 'V',
'HY3': 'P', 'HYP': 'P', 'HZP': 'P', 'I': 'I', 'I2M': 'I', 'I58': 'K',
'I5C': 'C', 'IAM': 'A', 'IAR': 'R', 'IAS': 'D', 'IC': 'C', 'IEL': 'K',
'IG': 'G', 'IGL': 'G', 'IGU': 'G', 'IIL': 'I', 'ILE': 'I', 'ILG': 'E',
'ILX': 'I', 'IMC': 'C', 'IML': 'I', 'IOY': 'F', 'IPG': 'G', 'IPN': 'N',
'IRN': 'N', 'IT1': 'K', 'IU': 'U', 'IYR': 'Y', 'IYT': 'T', 'IZO': 'M',
'JJJ': 'C', 'JJK': 'C', 'JJL': 'C', 'JW5': 'N', 'K1R': 'C', 'KAG': 'G',
'KCX': 'K', 'KGC': 'K', 'KNB': 'A', 'KOR': 'M', 'KPI': 'K', 'KST': 'K',
'KYQ': 'K', 'L2A': 'X', 'LA2': 'K', 'LAA': 'D', 'LAL': 'A', 'LBY': 'K',
'LC': 'C', 'LCA': 'A', 'LCC': 'N', 'LCG': 'G', 'LCH': 'N', 'LCK': 'K',
'LCX': 'K', 'LDH': 'K', 'LED': 'L', 'LEF': 'L', 'LEH': 'L', 'LEI': 'V',
'LEM': 'L', 'LEN': 'L', 'LET': 'X', 'LEU': 'L', 'LEX': 'L', 'LG': 'G',
'LGP': 'G', 'LHC': 'X', 'LHU': 'U', 'LKC': 'N', 'LLP': 'K', 'LLY': 'K',
'LME': 'E', 'LMF': 'K', 'LMQ': 'Q', 'LMS': 'N', 'LP6': 'K', 'LPD': 'P',
'LPG': 'G', 'LPL': 'X', 'LPS': 'S', 'LSO': 'X', 'LTA': 'X', 'LTR': 'W',
'LVG': 'G', 'LVN': 'V', 'LYF': 'K', 'LYK': 'K', 'LYM': 'K', 'LYN': 'K',
'LYR': 'K', 'LYS': 'K', 'LYX': 'K', 'LYZ': 'K', 'M0H': 'C', 'M1G': 'G',
'M2G': 'G', 'M2L': 'K', 'M2S': 'M', 'M30': 'G', 'M3L': 'K', 'M5M': 'C',
'MA': 'A', 'MA6': 'A', 'MA7': 'A', 'MAA': 'A', 'MAD': 'A', 'MAI': 'R',
'MBQ': 'Y', 'MBZ': 'N', 'MC1': 'S', 'MCG': 'X', 'MCL': 'K', 'MCS': 'C',
'MCY': 'C', 'MD3': 'C', 'MD6': 'G', 'MDH': 'X', 'MDR': 'N', 'MEA': 'F',
'MED': 'M', 'MEG': 'E', 'MEN': 'N', 'MEP': 'U', 'MEQ': 'Q', 'MET': 'M',
'MEU': 'G', 'MF3': 'X', 'MG1': 'G', 'MGG': 'R', 'MGN': 'Q', 'MGQ': 'A',
'MGV': 'G', 'MGY': 'G', 'MHL': 'L', 'MHO': 'M', 'MHS': 'H', 'MIA': 'A',
'MIS': 'S', 'MK8': 'L', 'ML3': 'K', 'MLE': 'L', 'MLL': 'L', 'MLY': 'K',
'MLZ': 'K', 'MME': 'M', 'MMO': 'R', 'MMT': 'T', 'MND': 'N', 'MNL': 'L',
'MNU': 'U', 'MNV': 'V', 'MOD': 'X', 'MP8': 'P', 'MPH': 'X', 'MPJ': 'X',
'MPQ': 'G', 'MRG': 'G', 'MSA': 'G', 'MSE': 'M', 'MSL': 'M', 'MSO': 'M',
'MSP': 'X', 'MT2': 'M', 'MTR': 'T', 'MTU': 'A', 'MTY': 'Y', 'MVA': 'V',
'N': 'N', 'N10': 'S', 'N2C': 'X', 'N5I': 'N', 'N5M': 'C', 'N6G': 'G',
'N7P': 'P', 'NA8': 'A', 'NAL': 'A', 'NAM': 'A', 'NB8': 'N', 'NBQ': 'Y',
'NC1': 'S', 'NCB': 'A', 'NCX': 'N', 'NCY': 'X', 'NDF': 'F', 'NDN': 'U',
'NEM': 'H', 'NEP': 'H', 'NF2': 'N', 'NFA': 'F', 'NHL': 'E', 'NIT': 'X',
'NIY': 'Y', 'NLE': 'L', 'NLN': 'L', 'NLO': 'L', 'NLP': 'L', 'NLQ': 'Q',
'NMC': 'G', 'NMM': 'R', 'NMS': 'T', 'NMT': 'T', 'NNH': 'R', 'NP3': 'N',
'NPH': 'C', 'NPI': 'A', 'NSK': 'X', 'NTY': 'Y', 'NVA': 'V', 'NYM': 'N',
'NYS': 'C', 'NZH': 'H', 'O12': 'X', 'O2C': 'N', 'O2G': 'G', 'OAD': 'N',
'OAS': 'S', 'OBF': 'X', 'OBS': 'X', 'OCS': 'C', 'OCY': 'C', 'ODP': 'N',
'OHI': 'H', 'OHS': 'D', 'OIC': 'X', 'OIP': 'I', 'OLE': 'X', 'OLT': 'T',
'OLZ': 'S', 'OMC': 'C', 'OMG': 'G', 'OMT': 'M', 'OMU': 'U', 'ONE': 'U',
'ONH': 'A', 'ONL': 'X', 'OPR': 'R', 'ORN': 'A', 'ORQ': 'R', 'OSE': 'S',
'OTB': 'X', 'OTH': 'T', 'OTY': 'Y', 'OXX': 'D', 'P': 'G', 'P1L': 'C',
'P1P': 'N', 'P2T': 'T', 'P2U': 'U', 'P2Y': 'P', 'P5P': 'A', 'PAQ': 'Y',
'PAS': 'D', 'PAT': 'W', 'PAU': 'A', 'PBB': 'C', 'PBF': 'F', 'PBT': 'N',
'PCA': 'E', 'PCC': 'P', 'PCE': 'X', 'PCS': 'F', 'PDL': 'X', 'PDU': 'U',
'PEC': 'C', 'PF5': 'F', 'PFF': 'F', 'PFX': 'X', 'PG1': 'S', 'PG7': 'G',
'PG9': 'G', 'PGL': 'X', 'PGN': 'G', 'PGP': 'G', 'PGY': 'G', 'PHA': 'F',
'PHD': 'D', 'PHE': 'F', 'PHI': 'F', 'PHL': 'F', 'PHM': 'F', 'PIV': 'X',
'PLE': 'L', 'PM3': 'F', 'PMT': 'C', 'POM': 'P', 'PPN': 'F', 'PPU': 'A',
'PPW': 'G', 'PQ1': 'N', 'PR3': 'C', 'PR5': 'A', 'PR9': 'P', 'PRN': 'A',
'PRO': 'P', 'PRS': 'P', 'PSA': 'F', 'PSH': 'H', 'PST': 'T', 'PSU': 'U',
'PSW': 'C', 'PTA': 'X', 'PTH': 'Y', 'PTM': 'Y', 'PTR': 'Y', 'PU': 'A',
'PUY': 'N', 'PVH': 'H', 'PVL': 'X', 'PYA': 'A', 'PYO': 'U', 'PYX': 'C',
'PYY': 'N', 'QMM': 'Q', 'QPA': 'C', 'QPH': 'F', 'QUO': 'G', 'R': 'A',
'R1A': 'C', 'R4K': 'W', 'RE0': 'W', 'RE3': 'W', 'RIA': 'A', 'RMP': 'A',
'RON': 'X', 'RT': 'T', 'RTP': 'N', 'S1H': 'S', 'S2C': 'C', 'S2D': 'A',
'S2M': 'T', 'S2P': 'A', 'S4A': 'A', 'S4C': 'C', 'S4G': 'G', 'S4U': 'U',
'S6G': 'G', 'SAC': 'S', 'SAH': 'C', 'SAR': 'G', 'SBL': 'S', 'SC': 'C',
'SCH': 'C', 'SCS': 'C', 'SCY': 'C', 'SD2': 'X', 'SDG': 'G', 'SDP': 'S',
'SEB': 'S', 'SEC': 'A', 'SEG': 'A', 'SEL': 'S', 'SEM': 'S', 'SEN': 'S',
'SEP': 'S', 'SER': 'S', 'SET': 'S', 'SGB': 'S', 'SHC': 'C', 'SHP': 'G',
'SHR': 'K', 'SIB': 'C', 'SLA': 'P', 'SLR': 'P', 'SLZ': 'K', 'SMC': 'C',
'SME': 'M', 'SMF': 'F', 'SMP': 'A', 'SMT': 'T', 'SNC': 'C', 'SNN': 'N',
'SOC': 'C', 'SOS': 'N', 'SOY': 'S', 'SPT': 'T', 'SRA': 'A', 'SSU': 'U',
'STY': 'Y', 'SUB': 'X', 'SUN': 'S', 'SUR': 'U', 'SVA': 'S', 'SVV': 'S',
'SVW': 'S', 'SVX': 'S', 'SVY': 'S', 'SVZ': 'X', 'SYS': 'C', 'T': 'T',
'T11': 'F', 'T23': 'T', 'T2S': 'T', 'T2T': 'N', 'T31': 'U', 'T32': 'T',
'T36': 'T', 'T37': 'T', 'T38': 'T', 'T39': 'T', 'T3P': 'T', 'T41': 'T',
'T48': 'T', 'T49': 'T', 'T4S': 'T', 'T5O': 'U', 'T5S': 'T', 'T66': 'X',
'T6A': 'A', 'TA3': 'T', 'TA4': 'X', 'TAF': 'T', 'TAL': 'N', 'TAV': 'D',
'TBG': 'V', 'TBM': 'T', 'TC1': 'C', 'TCP': 'T', 'TCQ': 'Y', 'TCR': 'W',
'TCY': 'A', 'TDD': 'L', 'TDY': 'T', 'TFE': 'T', 'TFO': 'A', 'TFQ': 'F',
'TFT': 'T', 'TGP': 'G', 'TH6': 'T', 'THC': 'T', 'THO': 'X', 'THR': 'T',
'THX': 'N', 'THZ': 'R', 'TIH': 'A', 'TLB': 'N', 'TLC': 'T', 'TLN': 'U',
'TMB': 'T', 'TMD': 'T', 'TNB': 'C', 'TNR': 'S', 'TOX': 'W', 'TP1': 'T',
'TPC': 'C', 'TPG': 'G', 'TPH': 'X', 'TPL': 'W', 'TPO': 'T', 'TPQ': 'Y',
'TQI': 'W', 'TQQ': 'W', 'TRF': 'W', 'TRG': 'K', 'TRN': 'W', 'TRO': 'W',
'TRP': 'W', 'TRQ': 'W', 'TRW': 'W', 'TRX': 'W', 'TS': 'N', 'TST': 'X',
'TT': 'N', 'TTD': 'T', 'TTI': 'U', 'TTM': 'T', 'TTQ': 'W', 'TTS': 'Y',
'TY1': 'Y', 'TY2': 'Y', 'TY3': 'Y', 'TY5': 'Y', 'TYB': 'Y', 'TYI': 'Y',
'TYJ': 'Y', 'TYN': 'Y', 'TYO': 'Y', 'TYQ': 'Y', 'TYR': 'Y', 'TYS': 'Y',
'TYT': 'Y', 'TYU': 'N', 'TYW': 'Y', 'TYX': 'X', 'TYY': 'Y', 'TZB': 'X',
'TZO': 'X', 'U': 'U', 'U25': 'U', 'U2L': 'U', 'U2N': 'U', 'U2P': 'U',
'U31': 'U', 'U33': 'U', 'U34': 'U', 'U36': 'U', 'U37': 'U', 'U8U': 'U',
'UAR': 'U', 'UCL': 'U', 'UD5': 'U', 'UDP': 'N', 'UFP': 'N', 'UFR': 'U',
'UFT': 'U', 'UMA': 'A', 'UMP': 'U', 'UMS': 'U', 'UN1': 'X', 'UN2': 'X',
'UNK': 'X', 'UR3': 'U', 'URD': 'U', 'US1': 'U', 'US2': 'U', 'US3': 'T',
'US5': 'U', 'USM': 'U', 'VAD': 'V', 'VAF': 'V', 'VAL': 'V', 'VB1': 'K',
'VDL': 'X', 'VLL': 'X', 'VLM': 'X', 'VMS': 'X', 'VOL': 'X', 'X': 'G',
'X2W': 'E', 'X4A': 'N', 'XAD': 'A', 'XAE': 'N', 'XAL': 'A', 'XAR': 'N',
'XCL': 'C', 'XCN': 'C', 'XCP': 'X', 'XCR': 'C', 'XCS': 'N', 'XCT': 'C',
'XCY': 'C', 'XGA': 'N', 'XGL': 'G', 'XGR': 'G', 'XGU': 'G', 'XPR': 'P',
'XSN': 'N', 'XTH': 'T', 'XTL': 'T', 'XTR': 'T', 'XTS': 'G', 'XTY': 'N',
'XUA': 'A', 'XUG': 'G', 'XX1': 'K', 'Y': 'A', 'YCM': 'C', 'YG': 'G',
'YOF': 'Y', 'YRR': 'N', 'YYG': 'G', 'Z': 'C', 'Z01': 'A', 'ZAD': 'A',
'ZAL': 'A', 'ZBC': 'C', 'ZBU': 'U', 'ZCL': 'F', 'ZCY': 'C', 'ZDU': 'U',
'ZFB': 'X', 'ZGU': 'G', 'ZHP': 'N', 'ZTH': 'T', 'ZU0': 'T', 'ZZJ': 'A',
}
# pyformat: enable
================================================
FILE: alphafold/common/residue_constants_test.py
================================================
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Test that residue_constants generates correct values."""
from absl.testing import absltest
from absl.testing import parameterized
from alphafold.common import residue_constants
import numpy as np
class ResidueConstantsTest(parameterized.TestCase):
@parameterized.parameters(
('ALA', 0),
('CYS', 1),
('HIS', 2),
('MET', 3),
('LYS', 4),
('ARG', 4),
)
def testChiAnglesAtoms(self, residue_name, chi_num):
chi_angles_atoms = residue_constants.chi_angles_atoms[residue_name]
self.assertLen(chi_angles_atoms, chi_num)
for chi_angle_atoms in chi_angles_atoms:
self.assertLen(chi_angle_atoms, 4)
def testChiGroupsForAtom(self):
for k, chi_groups in residue_constants.chi_groups_for_atom.items():
res_name, atom_name = k
for chi_group_i, atom_i in chi_groups:
self.assertEqual(
atom_name,
residue_constants.chi_angles_atoms[res_name][chi_group_i][atom_i],
)
@parameterized.parameters(
('ALA', 5),
('ARG', 11),
('ASN', 8),
('ASP', 8),
('CYS', 6),
('GLN', 9),
('GLU', 9),
('GLY', 4),
('HIS', 10),
('ILE', 8),
('LEU', 8),
('LYS', 9),
('MET', 8),
('PHE', 11),
('PRO', 7),
('SER', 6),
('THR', 7),
('TRP', 14),
('TYR', 12),
('VAL', 7),
)
def testResidueAtoms(self, atom_name, num_residue_atoms):
residue_atoms = residue_constants.residue_atoms[atom_name]
self.assertLen(residue_atoms, num_residue_atoms)
def testStandardAtomMask(self):
with self.subTest('Check shape'):
self.assertEqual(
residue_constants.STANDARD_ATOM_MASK.shape,
(
21,
37,
),
)
with self.subTest('Check values'):
str_to_row = lambda s: [c == '1' for c in s] # More clear/concise.
np.testing.assert_array_equal(
residue_constants.STANDARD_ATOM_MASK,
np.array([
# NB This was defined by c+p but looks sane.
str_to_row('11111 '), # ALA
str_to_row('111111 1 1 11 1 '), # ARG
str_to_row('111111 11 '), # ASP
str_to_row('111111 11 '), # ASN
str_to_row('11111 1 '), # CYS
str_to_row('111111 1 11 '), # GLU
str_to_row('111111 1 11 '), # GLN
str_to_row('111 1 '), # GLY
str_to_row('111111 11 1 1 '), # HIS
str_to_row('11111 11 1 '), # ILE
str_to_row('111111 11 '), # LEU
str_to_row('111111 1 1 1 '), # LYS
str_to_row('111111 11 '), # MET
str_to_row('111111 11 11 1 '), # PHE
str_to_row('111111 1 '), # PRO
str_to_row('11111 1 '), # SER
str_to_row('11111 1 1 '), # THR
str_to_row('111111 11 11 1 1 11 '), # TRP
str_to_row('111111 11 11 11 '), # TYR
str_to_row('11111 11 '), # VAL
str_to_row(' '), # UNK
]),
)
with self.subTest('Check row totals'):
# Check each row has the right number of atoms.
for row, restype in enumerate(residue_constants.restypes): # A, R, ...
long_restype = residue_constants.restype_1to3[restype] # ALA, ARG, ...
atoms_names = residue_constants.residue_atoms[
long_restype
] # ['C', 'CA', 'CB', 'N', 'O'], ...
self.assertLen(
atoms_names,
residue_constants.STANDARD_ATOM_MASK[row, :].sum(),
long_restype,
)
def testAtomTypes(self):
self.assertEqual(residue_constants.atom_type_num, 37)
self.assertEqual(residue_constants.atom_types[0], 'N')
self.assertEqual(residue_constants.atom_types[1], 'CA')
self.assertEqual(residue_constants.atom_types[2], 'C')
self.assertEqual(residue_constants.atom_types[3], 'CB')
self.assertEqual(residue_constants.atom_types[4], 'O')
self.assertEqual(residue_constants.atom_order['N'], 0)
self.assertEqual(residue_constants.atom_order['CA'], 1)
self.assertEqual(residue_constants.atom_order['C'], 2)
self.assertEqual(residue_constants.atom_order['CB'], 3)
self.assertEqual(residue_constants.atom_order['O'], 4)
self.assertEqual(residue_constants.atom_type_num, 37)
def testRestypes(self):
three_letter_restypes = [
residue_constants.restype_1to3[r] for r in residue_constants.restypes
]
for restype, exp_restype in zip(
three_letter_restypes, sorted(residue_constants.restype_1to3.values())
):
self.assertEqual(restype, exp_restype)
self.assertEqual(residue_constants.restype_num, 20)
def testSequenceToOneHotHHBlits(self):
one_hot = residue_constants.sequence_to_onehot(
'ABCDEFGHIJKLMNOPQRSTUVWXYZ-', residue_constants.HHBLITS_AA_TO_ID
)
exp_one_hot = np.array([
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
])
np.testing.assert_array_equal(one_hot, exp_one_hot)
def testSequenceToOneHotStandard(self):
one_hot = residue_constants.sequence_to_onehot(
'ARNDCQEGHILKMFPSTWYV', residue_constants.restype_order
)
np.testing.assert_array_equal(one_hot, np.eye(20))
def testSequenceToOneHotUnknownMapping(self):
seq = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
expected_out = np.zeros([26, 21])
# pyformat: disable
for row, position in enumerate([
0, 20, 4, 3, 6, 13, 7, 8, 9, 20, 11, 10, 12, 2, 20, 14, 5, 1, 15, 16,
20, 19, 17, 20, 18, 20
]):
expected_out[row, position] = 1
# pyformat: enable
aa_types = residue_constants.sequence_to_onehot(
sequence=seq,
mapping=residue_constants.restype_order_with_x,
map_unknown_to_x=True,
)
self.assertTrue((aa_types == expected_out).all())
@parameterized.named_parameters(
('lowercase', 'aaa'), # Insertions in A3M.
('gaps', '---'), # Gaps in A3M.
('dots', '...'), # Gaps in A3M.
('metadata', '>TEST'), # FASTA metadata line.
)
def testSequenceToOneHotUnknownMappingError(self, seq):
with self.assertRaises(ValueError):
residue_constants.sequence_to_onehot(
sequence=seq,
mapping=residue_constants.restype_order_with_x,
map_unknown_to_x=True,
)
if __name__ == '__main__':
absltest.main()
================================================
FILE: alphafold/common/testdata/2rbg.pdb
================================================
HEADER STRUCTURAL GENOMICS, UNKNOWN FUNCTION 19-SEP-07 2RBG
TITLE CRYSTAL STRUCTURE OF HYPOTHETICAL PROTEIN(ST0493) FROM
TITLE 2 SULFOLOBUS TOKODAII
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: PUTATIVE UNCHARACTERIZED PROTEIN ST0493;
COMPND 3 CHAIN: A, B;
COMPND 4 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: SULFOLOBUS TOKODAII;
SOURCE 3 ORGANISM_TAXID: 111955;
SOURCE 4 STRAIN: STRAIN 7;
SOURCE 5 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
SOURCE 6 EXPRESSION_SYSTEM_TAXID: 562;
SOURCE 7 EXPRESSION_SYSTEM_STRAIN: ROSETTA834(DE3);
SOURCE 8 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID;
SOURCE 9 EXPRESSION_SYSTEM_PLASMID: PET-21A
KEYWDS HYPOTHETICAL PROTEIN, STRUCTURAL GENOMICS, UNKNOWN FUNCTION,
KEYWDS 2 NPPSFA, NATIONAL PROJECT ON PROTEIN STRUCTURAL AND
KEYWDS 3 FUNCTIONAL ANALYSES, RIKEN STRUCTURAL GENOMICS/PROTEOMICS
KEYWDS 4 INITIATIVE, RSGI
EXPDTA X-RAY DIFFRACTION
AUTHOR J.JEYAKANTHAN,S.KURAMITSU,S.YOKOYAMA,RIKEN STRUCTURAL
AUTHOR 2 GENOMICS/PROTEOMICS INITIATIVE (RSGI)
REVDAT 2 24-FEB-09 2RBG 1 VERSN
REVDAT 1 30-SEP-08 2RBG 0
JRNL AUTH J.JEYAKANTHAN,S.KURAMITSU,S.YOKOYAMA
JRNL TITL CRYSTAL STRUCTURE OF HYPOTHETICAL PROTEIN(ST0493)
JRNL TITL 2 FROM SULFOLOBUS TOKODAII
JRNL REF TO BE PUBLISHED
JRNL REFN
REMARK 1
REMARK 2
REMARK 2 RESOLUTION. 1.75 ANGSTROMS.
REMARK 3
REMARK 3 REFINEMENT.
REMARK 3 PROGRAM : CNS 1.1
REMARK 3 AUTHORS : BRUNGER,ADAMS,CLORE,DELANO,GROS,GROSSE-
REMARK 3 : KUNSTLEVE,JIANG,KUSZEWSKI,NILGES, PANNU,
REMARK 3 : READ,RICE,SIMONSON,WARREN
REMARK 3
REMARK 3 REFINEMENT TARGET : ENGH & HUBER
REMARK 3
REMARK 3 DATA USED IN REFINEMENT.
REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.75
REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 33.49
REMARK 3 DATA CUTOFF (SIGMA(F)) : 0.000
REMARK 3 DATA CUTOFF HIGH (ABS(F)) : 2067291.840
REMARK 3 DATA CUTOFF LOW (ABS(F)) : 0.0000
REMARK 3 COMPLETENESS (WORKING+TEST) (%) : 99.3
REMARK 3 NUMBER OF REFLECTIONS : 25029
REMARK 3
REMARK 3 FIT TO DATA USED IN REFINEMENT.
REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT
REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM
REMARK 3 R VALUE (WORKING SET) : 0.173
REMARK 3 FREE R VALUE : 0.196
REMARK 3 FREE R VALUE TEST SET SIZE (%) : 4.900
REMARK 3 FREE R VALUE TEST SET COUNT : 1216
REMARK 3 ESTIMATED ERROR OF FREE R VALUE : 0.006
REMARK 3
REMARK 3 FIT IN THE HIGHEST RESOLUTION BIN.
REMARK 3 TOTAL NUMBER OF BINS USED : 8
REMARK 3 BIN RESOLUTION RANGE HIGH (A) : 1.75
REMARK 3 BIN RESOLUTION RANGE LOW (A) : 1.83
REMARK 3 BIN COMPLETENESS (WORKING+TEST) (%) : 96.80
REMARK 3 REFLECTIONS IN BIN (WORKING SET) : 2906
REMARK 3 BIN R VALUE (WORKING SET) : 0.1980
REMARK 3 BIN FREE R VALUE : 0.2420
REMARK 3 BIN FREE R VALUE TEST SET SIZE (%) : 5.10
REMARK 3 BIN FREE R VALUE TEST SET COUNT : 156
REMARK 3 ESTIMATED ERROR OF BIN FREE R VALUE : 0.019
REMARK 3
REMARK 3 NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
REMARK 3 PROTEIN ATOMS : 2060
REMARK 3 NUCLEIC ACID ATOMS : 0
REMARK 3 HETEROGEN ATOMS : 5
REMARK 3 SOLVENT ATOMS : 316
REMARK 3
REMARK 3 B VALUES.
REMARK 3 FROM WILSON PLOT (A**2) : 13.30
REMARK 3 MEAN B VALUE (OVERALL, A**2) : 16.90
REMARK 3 OVERALL ANISOTROPIC B VALUE.
REMARK 3 B11 (A**2) : 2.81000
REMARK 3 B22 (A**2) : -1.00000
REMARK 3 B33 (A**2) : -1.81000
REMARK 3 B12 (A**2) : 0.00000
REMARK 3 B13 (A**2) : -1.31000
REMARK 3 B23 (A**2) : 0.00000
REMARK 3
REMARK 3 ESTIMATED COORDINATE ERROR.
REMARK 3 ESD FROM LUZZATI PLOT (A) : 0.16
REMARK 3 ESD FROM SIGMAA (A) : 0.06
REMARK 3 LOW RESOLUTION CUTOFF (A) : 5.00
REMARK 3
REMARK 3 CROSS-VALIDATED ESTIMATED COORDINATE ERROR.
REMARK 3 ESD FROM C-V LUZZATI PLOT (A) : 0.19
REMARK 3 ESD FROM C-V SIGMAA (A) : 0.14
REMARK 3
REMARK 3 RMS DEVIATIONS FROM IDEAL VALUES.
REMARK 3 BOND LENGTHS (A) : 0.005
REMARK 3 BOND ANGLES (DEGREES) : 1.10
REMARK 3 DIHEDRAL ANGLES (DEGREES) : 22.00
REMARK 3 IMPROPER ANGLES (DEGREES) : 0.70
REMARK 3
REMARK 3 ISOTROPIC THERMAL MODEL : RESTRAINED
REMARK 3
REMARK 3 ISOTROPIC THERMAL FACTOR RESTRAINTS. RMS SIGMA
REMARK 3 MAIN-CHAIN BOND (A**2) : NULL ; NULL
REMARK 3 MAIN-CHAIN ANGLE (A**2) : NULL ; NULL
REMARK 3 SIDE-CHAIN BOND (A**2) : NULL ; NULL
REMARK 3 SIDE-CHAIN ANGLE (A**2) : NULL ; NULL
REMARK 3
REMARK 3 BULK SOLVENT MODELING.
REMARK 3 METHOD USED : FLAT MODEL
REMARK 3 KSOL : 0.37
REMARK 3 BSOL : 51.20
REMARK 3
REMARK 3 NCS MODEL : NULL
REMARK 3
REMARK 3 NCS RESTRAINTS. RMS SIGMA/WEIGHT
REMARK 3 GROUP 1 POSITIONAL (A) : NULL ; NULL
REMARK 3 GROUP 1 B-FACTOR (A**2) : NULL ; NULL
REMARK 3
REMARK 3 PARAMETER FILE 1 : PROTEIN_REP.PARAM
REMARK 3 PARAMETER FILE 2 : LIGAND.PARAM
REMARK 3 PARAMETER FILE 3 : ION.PARAM
REMARK 3 PARAMETER FILE 5 : WATER_REP.PARAM
REMARK 3 PARAMETER FILE 6 : NULL
REMARK 3 TOPOLOGY FILE 1 : PROTEIN.TOP
REMARK 3 TOPOLOGY FILE 2 : LIGAND.TOP
REMARK 3 TOPOLOGY FILE 3 : ION.TOP
REMARK 3 TOPOLOGY FILE 5 : WATER_PROTIN.TOP
REMARK 3 TOPOLOGY FILE 6 : NULL
REMARK 3
REMARK 3 OTHER REFINEMENT REMARKS: NULL
REMARK 4
REMARK 4 2RBG COMPLIES WITH FORMAT V. 3.15, 01-DEC-08
REMARK 100
REMARK 100 THIS ENTRY HAS BEEN PROCESSED BY PDBJ ON 27-SEP-07.
REMARK 100 THE RCSB ID CODE IS RCSB044658.
REMARK 200
REMARK 200 EXPERIMENTAL DETAILS
REMARK 200 EXPERIMENT TYPE : X-RAY DIFFRACTION
REMARK 200 DATE OF DATA COLLECTION : 16-JUN-07
REMARK 200 TEMPERATURE (KELVIN) : 100
REMARK 200 PH : 7.5
REMARK 200 NUMBER OF CRYSTALS USED : 1
REMARK 200
REMARK 200 SYNCHROTRON (Y/N) : Y
REMARK 200 RADIATION SOURCE : SPRING-8
REMARK 200 BEAMLINE : BL26B2
REMARK 200 X-RAY GENERATOR MODEL : NULL
REMARK 200 MONOCHROMATIC OR LAUE (M/L) : M
REMARK 200 WAVELENGTH OR RANGE (A) : 0.97899, 0.9, 0.97931
REMARK 200 MONOCHROMATOR : SI-1 1 1 DOUBLE CRYSTAL
REMARK 200 MONOCHROMATOR
REMARK 200 OPTICS : RH COATED BENT-CYRINDRICAL
REMARK 200 MIRROR
REMARK 200
REMARK 200 DETECTOR TYPE : CCD
REMARK 200 DETECTOR MANUFACTURER : MARMOSAIC 225 MM CCD
REMARK 200 INTENSITY-INTEGRATION SOFTWARE : HKL-2000
REMARK 200 DATA SCALING SOFTWARE : SCALEPACK
REMARK 200
REMARK 200 NUMBER OF UNIQUE REFLECTIONS : 25105
REMARK 200 RESOLUTION RANGE HIGH (A) : 1.750
REMARK 200 RESOLUTION RANGE LOW (A) : 50.000
REMARK 200 REJECTION CRITERIA (SIGMA(I)) : NULL
REMARK 200
REMARK 200 OVERALL.
REMARK 200 COMPLETENESS FOR RANGE (%) : 99.6
REMARK 200 DATA REDUNDANCY : NULL
REMARK 200 R MERGE (I) : 0.05900
REMARK 200 R SYM (I) : 0.06300
REMARK 200 <I/SIGMA(I)> FOR THE DATA SET : NULL
REMARK 200
REMARK 200 IN THE HIGHEST RESOLUTION SHELL.
REMARK 200 HIGHEST RESOLUTION SHELL, RANGE HIGH (A) : 1.75
REMARK 200 HIGHEST RESOLUTION SHELL, RANGE LOW (A) : 1.81
REMARK 200 COMPLETENESS FOR SHELL (%) : 96.9
REMARK 200 DATA REDUNDANCY IN SHELL : NULL
REMARK 200 R MERGE FOR SHELL (I) : 0.14300
REMARK 200 R SYM FOR SHELL (I) : 0.13300
REMARK 200 <I/SIGMA(I)> FOR SHELL : NULL
REMARK 200
REMARK 200 DIFFRACTION PROTOCOL: MAD
REMARK 200 METHOD USED TO DETERMINE THE STRUCTURE: MAD
REMARK 200 SOFTWARE USED: SOLVE
REMARK 200 STARTING MODEL: NULL
REMARK 200
REMARK 200 REMARK: NULL
REMARK 280
REMARK 280 CRYSTAL
REMARK 280 SOLVENT CONTENT, VS (%): 41.69
REMARK 280 MATTHEWS COEFFICIENT, VM (ANGSTROMS**3/DA): 2.11
REMARK 280
REMARK 280 CRYSTALLIZATION CONDITIONS: 30% PEG 4K, 0.2M AMMONIUM SULFATE,
REMARK 280 PH 7.5, MICROBATCH, TEMPERATURE 293K
REMARK 290
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY
REMARK 290 SYMMETRY OPERATORS FOR SPACE GROUP: P 1 21 1
REMARK 290
REMARK 290 SYMOP SYMMETRY
REMARK 290 NNNMMM OPERATOR
REMARK 290 1555 X,Y,Z
REMARK 290 2555 -X,Y+1/2,-Z
REMARK 290
REMARK 290 WHERE NNN -> OPERATOR NUMBER
REMARK 290 MMM -> TRANSLATION VECTOR
REMARK 290
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY TRANSFORMATIONS
REMARK 290 THE FOLLOWING TRANSFORMATIONS OPERATE ON THE ATOM/HETATM
REMARK 290 RECORDS IN THIS ENTRY TO PRODUCE CRYSTALLOGRAPHICALLY
REMARK 290 RELATED MOLECULES.
REMARK 290 SMTRY1 1 1.000000 0.000000 0.000000 0.00000
REMARK 290 SMTRY2 1 0.000000 1.000000 0.000000 0.00000
REMARK 290 SMTRY3 1 0.000000 0.000000 1.000000 0.00000
REMARK 290 SMTRY1 2 -1.000000 0.000000 0.000000 0.00000
REMARK 290 SMTRY2 2 0.000000 1.000000 0.000000 32.59200
REMARK 290 SMTRY3 2 0.000000 0.000000 -1.000000 0.00000
REMARK 290
REMARK 290 REMARK: NULL
REMARK 300
REMARK 300 BIOMOLECULE: 1, 2, 3
REMARK 300 SEE REMARK 350 FOR THE AUTHOR PROVIDED AND/OR PROGRAM
REMARK 300 GENERATED ASSEMBLY INFORMATION FOR THE STRUCTURE IN
REMARK 300 THIS ENTRY. THE REMARK MAY ALSO PROVIDE INFORMATION ON
REMARK 300 BURIED SURFACE AREA.
REMARK 350
REMARK 350 COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN
REMARK 350 BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE
REMARK 350 MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS
REMARK 350 GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND
REMARK 350 CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN.
REMARK 350
REMARK 350 BIOMOLECULE: 1
REMARK 350 AUTHOR DETERMINED BIOLOGICAL UNIT: DIMERIC
REMARK 350 APPLY THE FOLLOWING TO CHAINS: A, B
REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000
REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000
REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 0.00000
REMARK 350
REMARK 350 BIOMOLECULE: 2
REMARK 350 SOFTWARE DETERMINED QUATERNARY STRUCTURE: MONOMERIC
REMARK 350 SOFTWARE USED: PISA
REMARK 350 APPLY THE FOLLOWING TO CHAINS: A
REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000
REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000
REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 0.00000
REMARK 350
REMARK 350 BIOMOLECULE: 3
REMARK 350 SOFTWARE DETERMINED QUATERNARY STRUCTURE: MONOMERIC
REMARK 350 SOFTWARE USED: PISA
REMARK 350 APPLY THE FOLLOWING TO CHAINS: B
REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000
REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000
REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 0.00000
REMARK 465
REMARK 465 MISSING RESIDUES
REMARK 465 THE FOLLOWING RESIDUES WERE NOT LOCATED IN THE
REMARK 465 EXPERIMENT. (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN
REMARK 465 IDENTIFIER; SSSEQ=SEQUENCE NUMBER; I=INSERTION CODE.)
REMARK 465
REMARK 465 M RES C SSSEQI
REMARK 465 MSE A 1
REMARK 465 PRO A 2
REMARK 465 MSE B 1
REMARK 465 PRO B 2
REMARK 500
REMARK 500 GEOMETRY AND STEREOCHEMISTRY
REMARK 500 SUBTOPIC: TORSION ANGLES
REMARK 500
REMARK 500 TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:
REMARK 500 (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;
REMARK 500 SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
REMARK 500
REMARK 500 STANDARD TABLE:
REMARK 500 FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)
REMARK 500
REMARK 500 EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSI-
REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
REMARK 500
REMARK 500 M RES CSSEQI PSI PHI
REMARK 500 PHE A 121 76.88 -102.11
REMARK 500 CYS A 122 -73.41 -165.90
REMARK 500 CYS B 122 -70.28 -161.68
REMARK 500
REMARK 500 REMARK: NULL
REMARK 800
REMARK 800 SITE
REMARK 800 SITE_IDENTIFIER: AC1
REMARK 800 EVIDENCE_CODE: SOFTWARE
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE SO4 B 127
REMARK 900
REMARK 900 RELATED ENTRIES
REMARK 900 RELATED ID: STO001000493.1 RELATED DB: TARGETDB
DBREF 2RBG A 1 126 UNP Q975B5 Q975B5_SULTO 1 126
DBREF 2RBG B 1 126 UNP Q975B5 Q975B5_SULTO 1 126
SEQRES 1 A 126 MSE PRO TYR LYS ASN ILE LEU THR LEU ILE SER VAL ASN
SEQRES 2 A 126 ASN ASP ASN PHE GLU ASN TYR PHE ARG LYS ILE PHE LEU
SEQRES 3 A 126 ASP VAL ARG SER SER GLY SER LYS LYS THR THR ILE ASN
SEQRES 4 A 126 VAL PHE THR GLU ILE GLN TYR GLN GLU LEU VAL THR LEU
SEQRES 5 A 126 ILE ARG GLU ALA LEU LEU GLU ASN ILE ASP ILE GLY TYR
gitextract_p0ek25vk/
├── .dockerignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── afdb/
│ └── README.md
├── alphafold/
│ ├── __init__.py
│ ├── common/
│ │ ├── __init__.py
│ │ ├── confidence.py
│ │ ├── confidence_test.py
│ │ ├── mmcif_metadata.py
│ │ ├── protein.py
│ │ ├── protein_test.py
│ │ ├── residue_constants.py
│ │ ├── residue_constants_test.py
│ │ └── testdata/
│ │ ├── 2rbg.pdb
│ │ ├── 5nmu.pdb
│ │ └── glucagon.pdb
│ ├── data/
│ │ ├── __init__.py
│ │ ├── feature_processing.py
│ │ ├── mmcif_parsing.py
│ │ ├── msa_identifiers.py
│ │ ├── msa_pairing.py
│ │ ├── parsers.py
│ │ ├── pipeline.py
│ │ ├── pipeline_multimer.py
│ │ ├── templates.py
│ │ └── tools/
│ │ ├── __init__.py
│ │ ├── hhblits.py
│ │ ├── hhsearch.py
│ │ ├── hmmbuild.py
│ │ ├── hmmsearch.py
│ │ ├── jackhmmer.py
│ │ ├── kalign.py
│ │ └── utils.py
│ ├── model/
│ │ ├── __init__.py
│ │ ├── all_atom.py
│ │ ├── all_atom_multimer.py
│ │ ├── all_atom_test.py
│ │ ├── base_config.py
│ │ ├── base_config_test.py
│ │ ├── common_modules.py
│ │ ├── config.py
│ │ ├── config_test.py
│ │ ├── data.py
│ │ ├── features.py
│ │ ├── folding.py
│ │ ├── folding_multimer.py
│ │ ├── geometry/
│ │ │ ├── __init__.py
│ │ │ ├── rigid_matrix_vector.py
│ │ │ ├── rotation_matrix.py
│ │ │ ├── struct_of_array.py
│ │ │ ├── test_utils.py
│ │ │ ├── utils.py
│ │ │ └── vector.py
│ │ ├── layer_stack.py
│ │ ├── layer_stack_test.py
│ │ ├── lddt.py
│ │ ├── lddt_test.py
│ │ ├── mapping.py
│ │ ├── model.py
│ │ ├── modules.py
│ │ ├── modules_multimer.py
│ │ ├── prng.py
│ │ ├── prng_test.py
│ │ ├── quat_affine.py
│ │ ├── quat_affine_test.py
│ │ ├── r3.py
│ │ ├── tf/
│ │ │ ├── __init__.py
│ │ │ ├── data_transforms.py
│ │ │ ├── input_pipeline.py
│ │ │ ├── protein_features.py
│ │ │ ├── protein_features_test.py
│ │ │ ├── proteins_dataset.py
│ │ │ ├── shape_helpers.py
│ │ │ ├── shape_helpers_test.py
│ │ │ ├── shape_placeholders.py
│ │ │ └── utils.py
│ │ └── utils.py
│ ├── notebooks/
│ │ ├── __init__.py
│ │ ├── notebook_utils.py
│ │ └── notebook_utils_test.py
│ ├── relax/
│ │ ├── __init__.py
│ │ ├── amber_minimize.py
│ │ ├── amber_minimize_test.py
│ │ ├── cleanup.py
│ │ ├── cleanup_test.py
│ │ ├── relax.py
│ │ ├── relax_test.py
│ │ ├── testdata/
│ │ │ ├── model_output.pdb
│ │ │ ├── multiple_disulfides_target.pdb
│ │ │ ├── with_violations.pdb
│ │ │ └── with_violations_casp14.pdb
│ │ ├── utils.py
│ │ └── utils_test.py
│ └── version.py
├── conftest.py
├── docker/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── run_docker.py
├── docs/
│ └── technical_note_v2.3.0.md
├── notebooks/
│ └── AlphaFold.ipynb
├── pyproject.toml
├── requirements.txt
├── run_alphafold.py
├── run_alphafold_test.py
├── scripts/
│ ├── download_all_data.sh
│ ├── download_alphafold_params.sh
│ ├── download_bfd.sh
│ ├── download_mgnify.sh
│ ├── download_pdb70.sh
│ ├── download_pdb_mmcif.sh
│ ├── download_pdb_seqres.sh
│ ├── download_small_bfd.sh
│ ├── download_uniprot.sh
│ ├── download_uniref30.sh
│ └── download_uniref90.sh
└── server/
├── README.md
└── example.json
SYMBOL INDEX (847 symbols across 76 files)
FILE: alphafold/common/confidence.py
function _softmax (line 23) | def _softmax(x: np.ndarray, axis: Optional[int] = None):
function compute_plddt (line 30) | def compute_plddt(logits: np.ndarray) -> np.ndarray:
function _confidence_category (line 47) | def _confidence_category(score: float) -> str:
function confidence_json (line 61) | def confidence_json(plddt: np.ndarray) -> str:
function _calculate_bin_centers (line 84) | def _calculate_bin_centers(breaks: np.ndarray):
function _calculate_expected_aligned_error (line 102) | def _calculate_expected_aligned_error(
function compute_predicted_aligned_error (line 127) | def compute_predicted_aligned_error(
function pae_json (line 158) | def pae_json(pae: np.ndarray, max_pae: float) -> str:
function predicted_tm_score (line 184) | def predicted_tm_score(
FILE: alphafold/common/confidence_test.py
class ConfidenceTest (line 23) | class ConfidenceTest(absltest.TestCase):
method test_pae_json (line 25) | def test_pae_json(self):
method test_confidence_json (line 34) | def test_confidence_json(self):
FILE: alphafold/common/mmcif_metadata.py
function add_metadata_to_mmcif (line 72) | def add_metadata_to_mmcif(
FILE: alphafold/common/protein.py
class Protein (line 65) | class Protein:
method __post_init__ (line 92) | def __post_init__(self):
function _from_bio_structure (line 100) | def _from_bio_structure(
function from_pdb_string (line 182) | def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Pro...
function from_mmcif_string (line 202) | def from_mmcif_string(
function _chain_end (line 224) | def _chain_end(atom_index, end_resname, chain_name, residue_index) -> str:
function to_pdb (line 232) | def to_pdb(prot: Protein) -> str:
function ideal_atom_mask (line 327) | def ideal_atom_mask(prot: Protein) -> np.ndarray:
function from_prediction (line 343) | def from_prediction(
function to_mmcif (line 384) | def to_mmcif(
function _int_id_to_str_id (line 524) | def _int_id_to_str_id(num: int) -> str:
function _get_entity_poly_seq (line 546) | def _get_entity_poly_seq(
function _create_mmcif_string (line 594) | def _create_mmcif_string(mmcif_dict: Dict[str, Any]) -> str:
FILE: alphafold/common/protein_test.py
class ProteinTest (line 28) | class ProteinTest(parameterized.TestCase):
method _check_shapes (line 30) | def _check_shapes(self, prot, num_res):
method test_from_pdb_str (line 63) | def test_from_pdb_str(self, pdb_file, chain_id, num_res, num_chains):
method test_to_pdb (line 76) | def test_to_pdb(self):
method test_to_mmcif (line 114) | def test_to_mmcif(self, pdb_file, model_type):
method test_ideal_atom_mask (line 142) | def test_ideal_atom_mask(self):
method test_too_many_chains (line 160) | def test_too_many_chains(self):
FILE: alphafold/common/residue_constants.py
function load_stereo_chemical_props (line 413) | def load_stereo_chemical_props() -> Tuple[
function sequence_to_onehot (line 587) | def sequence_to_onehot(
function atom_id_to_type (line 659) | def atom_id_to_type(atom_id: str) -> str:
function _make_standard_atom_mask (line 768) | def _make_standard_atom_mask() -> np.ndarray:
function chi_angle_atom (line 786) | def chi_angle_atom(atom_index: int) -> np.ndarray:
function _make_rigid_transformation_4x4 (line 831) | def _make_rigid_transformation_4x4(ex, ey, translation):
function _make_rigid_group_constants (line 860) | def _make_rigid_group_constants():
function make_atom14_dists_bounds (line 939) | def make_atom14_dists_bounds(
FILE: alphafold/common/residue_constants_test.py
class ResidueConstantsTest (line 23) | class ResidueConstantsTest(parameterized.TestCase):
method testChiAnglesAtoms (line 33) | def testChiAnglesAtoms(self, residue_name, chi_num):
method testChiGroupsForAtom (line 39) | def testChiGroupsForAtom(self):
method testResidueAtoms (line 70) | def testResidueAtoms(self, atom_name, num_residue_atoms):
method testStandardAtomMask (line 74) | def testStandardAtomMask(self):
method testAtomTypes (line 127) | def testAtomTypes(self):
method testRestypes (line 143) | def testRestypes(self):
method testSequenceToOneHotHHBlits (line 153) | def testSequenceToOneHotHHBlits(self):
method testSequenceToOneHotStandard (line 188) | def testSequenceToOneHotStandard(self):
method testSequenceToOneHotUnknownMapping (line 194) | def testSequenceToOneHotUnknownMapping(self):
method testSequenceToOneHotUnknownMappingError (line 217) | def testSequenceToOneHotUnknownMappingError(self, seq):
FILE: alphafold/data/feature_processing.py
function _is_homomer_or_monomer (line 60) | def _is_homomer_or_monomer(chains: Iterable[pipeline.FeatureDict]) -> bool:
function pair_and_merge (line 74) | def pair_and_merge(
function crop_chains (line 110) | def crop_chains(
function _crop_single_chain (line 142) | def _crop_single_chain(
function process_final (line 198) | def process_final(np_example: pipeline.FeatureDict) -> pipeline.FeatureD...
function _correct_msa_restypes (line 207) | def _correct_msa_restypes(np_example):
function _make_seq_mask (line 215) | def _make_seq_mask(np_example):
function _make_msa_mask (line 220) | def _make_msa_mask(np_example):
function _filter_features (line 231) | def _filter_features(np_example: pipeline.FeatureDict) -> pipeline.Featu...
function process_unmerged_features (line 236) | def process_unmerged_features(
FILE: alphafold/data/mmcif_parsing.py
class Monomer (line 35) | class Monomer:
class AtomSite (line 43) | class AtomSite:
class ResiduePosition (line 56) | class ResiduePosition:
class ResidueAtPosition (line 63) | class ResidueAtPosition:
class MmcifObject (line 71) | class MmcifObject:
class ParsingResult (line 97) | class ParsingResult:
class ParseError (line 110) | class ParseError(Exception):
function mmcif_loop_to_list (line 114) | def mmcif_loop_to_list(
function mmcif_loop_to_dict (line 146) | def mmcif_loop_to_dict(
function parse (line 170) | def parse(
function _get_first_model (line 293) | def _get_first_model(structure: PdbStructure) -> PdbStructure:
function get_release_date (line 301) | def get_release_date(parsed_info: MmCIFDict) -> str:
function _get_header (line 307) | def _get_header(parsed_info: MmCIFDict) -> PdbHeader:
function _get_atom_site_list (line 342) | def _get_atom_site_list(parsed_info: MmCIFDict) -> Sequence[AtomSite]:
function _get_protein_chains (line 359) | def _get_protein_chains(
function _is_set (line 411) | def _is_set(data: str) -> bool:
FILE: alphafold/data/msa_identifiers.py
class Identifiers (line 51) | class Identifiers:
function _parse_sequence_identifier (line 55) | def _parse_sequence_identifier(msa_sequence_identifier: str) -> Identifi...
function _extract_sequence_identifier (line 75) | def _extract_sequence_identifier(description: str) -> Optional[str]:
function get_identifiers (line 84) | def get_identifiers(description: str) -> Identifiers:
FILE: alphafold/data/msa_pairing.py
class MSAStatistics (line 71) | class MSAStatistics:
method from_chain_features (line 88) | def from_chain_features(
method __len__ (line 124) | def __len__(self) -> int:
method get_top_msa_rows (line 127) | def get_top_msa_rows(self, num_rows: int) -> np.ndarray:
method to_species_dict (line 132) | def to_species_dict(self) -> Mapping[bytes, 'MSAStatistics']:
function create_paired_features (line 153) | def create_paired_features(
function pad_features (line 188) | def pad_features(feature: np.ndarray, feature_name: str) -> np.ndarray:
function _match_rows_by_sequence_similarity (line 220) | def _match_rows_by_sequence_similarity(
function pair_sequences (line 256) | def pair_sequences(
function reorder_paired_rows (line 322) | def reorder_paired_rows(
function block_diag (line 349) | def block_diag(*arrs: np.ndarray, pad_value: float = 0.0) -> np.ndarray:
function _correct_post_merged_feats (line 358) | def _correct_post_merged_feats(
function _pad_templates (line 411) | def _pad_templates(
function _merge_features_from_multiple_chains (line 434) | def _merge_features_from_multiple_chains(
function _merge_homomers_dense_msa (line 469) | def _merge_homomers_dense_msa(
function _concatenate_paired_and_unpaired_features (line 498) | def _concatenate_paired_and_unpaired_features(
function merge_chain_features (line 513) | def merge_chain_features(
function deduplicate_unpaired_sequences (line 546) | def deduplicate_unpaired_sequences(
FILE: alphafold/data/parsers.py
class Msa (line 30) | class Msa:
method __post_init__ (line 37) | def __post_init__(self):
method __len__ (line 50) | def __len__(self):
method truncate (line 53) | def truncate(self, max_seqs: int):
class TemplateHit (line 62) | class TemplateHit:
function parse_fasta (line 75) | def parse_fasta(fasta_string: str) -> Tuple[Sequence[str], Sequence[str]]:
function parse_stockholm (line 104) | def parse_stockholm(stockholm_string: str) -> Msa:
function parse_a3m (line 166) | def parse_a3m(a3m_string: str) -> Msa:
function _convert_sto_seq_to_a3m (line 205) | def _convert_sto_seq_to_a3m(
function convert_stockholm_to_a3m (line 215) | def convert_stockholm_to_a3m(
function _keep_line (line 274) | def _keep_line(line: str, seqnames: Set[str]) -> bool:
function truncate_stockholm_msa (line 294) | def truncate_stockholm_msa(stockholm_msa_path: str, max_sequences: int) ...
function remove_empty_columns_from_stockholm_msa (line 317) | def remove_empty_columns_from_stockholm_msa(stockholm_msa: str) -> str:
function deduplicate_stockholm_msa (line 357) | def deduplicate_stockholm_msa(stockholm_msa: str) -> str:
function _get_hhr_line_regex_groups (line 392) | def _get_hhr_line_regex_groups(
function _update_hhr_residue_indices_list (line 401) | def _update_hhr_residue_indices_list(
function _parse_hhr_hit (line 414) | def _parse_hhr_hit(detailed_lines: Sequence[str]) -> TemplateHit:
function parse_hhr (line 518) | def parse_hhr(hhr_string: str) -> Sequence[TemplateHit]:
function parse_e_values_from_tblout (line 536) | def parse_e_values_from_tblout(tblout: str) -> Dict[str, float]:
function _get_indices (line 551) | def _get_indices(sequence: str, start: int) -> List[int]:
class HitMetadata (line 570) | class HitMetadata:
function _parse_hmmsearch_description (line 579) | def _parse_hmmsearch_description(description: str) -> HitMetadata:
function parse_hmmsearch_a3m (line 601) | def parse_hmmsearch_a3m(
FILE: alphafold/data/pipeline.py
function make_sequence_features (line 37) | def make_sequence_features(
function make_msa_features (line 57) | def make_msa_features(msas: Sequence[parsers.Msa]) -> FeatureDict:
function run_msa_tool (line 94) | def run_msa_tool(
class DataPipeline (line 123) | class DataPipeline:
method __init__ (line 126) | def __init__(
method process (line 174) | def process(self, input_fasta_path: str, msa_output_dir: str) -> Featu...
FILE: alphafold/data/pipeline_multimer.py
class _FastaChain (line 40) | class _FastaChain:
function _make_chain_id_map (line 45) | def _make_chain_id_map(
function temp_fasta_file (line 72) | def temp_fasta_file(fasta_str: str):
function convert_monomer_features (line 79) | def convert_monomer_features(
function int_id_to_str_id (line 108) | def int_id_to_str_id(num: int) -> str:
function add_assembly_features (line 130) | def add_assembly_features(
function pad_msa (line 170) | def pad_msa(np_example, min_num_seq):
class DataPipeline (line 184) | class DataPipeline:
method __init__ (line 187) | def __init__(
method _process_single_chain (line 218) | def _process_single_chain(
method _all_seq_msa_features (line 248) | def _all_seq_msa_features(self, input_fasta_path, msa_output_dir):
method process (line 269) | def process(
FILE: alphafold/data/templates.py
class Error (line 35) | class Error(Exception):
class NoChainsError (line 39) | class NoChainsError(Error):
class SequenceNotInTemplateError (line 43) | class SequenceNotInTemplateError(Error):
class NoAtomDataInTemplateError (line 47) | class NoAtomDataInTemplateError(Error):
class TemplateAtomMaskAllZerosError (line 51) | class TemplateAtomMaskAllZerosError(Error):
class QueryToTemplateAlignError (line 55) | class QueryToTemplateAlignError(Error):
class CaDistanceError (line 59) | class CaDistanceError(Error):
class MultipleChainsError (line 63) | class MultipleChainsError(Error):
class PrefilterError (line 68) | class PrefilterError(Exception):
class DateError (line 72) | class DateError(PrefilterError):
class AlignRatioError (line 76) | class AlignRatioError(PrefilterError):
class DuplicateError (line 80) | class DuplicateError(PrefilterError):
class LengthError (line 84) | class LengthError(PrefilterError):
function _get_pdb_id_and_chain (line 98) | def _get_pdb_id_and_chain(hit: parsers.TemplateHit) -> Tuple[str, str]:
function _is_after_cutoff (line 108) | def _is_after_cutoff(
function _parse_obsolete (line 133) | def _parse_obsolete(obsolete_file_path: str) -> Mapping[str, Optional[st...
function _parse_release_dates (line 156) | def _parse_release_dates(path: str) -> Mapping[str, datetime.datetime]:
function _assess_hhsearch_hit (line 175) | def _assess_hhsearch_hit(
function _find_template_in_pdb (line 244) | def _find_template_in_pdb(
function _realign_pdb_template_to_query (line 316) | def _realign_pdb_template_to_query(
function _check_residue_distances (line 453) | def _check_residue_distances(
function _get_atom_positions (line 477) | def _get_atom_positions(
function _extract_template_features (line 544) | def _extract_template_features(
function _build_query_to_hit_index_mapping (line 692) | def _build_query_to_hit_index_mapping(
class SingleHitResult (line 750) | class SingleHitResult:
function _read_file (line 757) | def _read_file(path):
function _process_single_hit (line 763) | def _process_single_hit(
class TemplateSearchResult (line 908) | class TemplateSearchResult:
class TemplateHitFeaturizer (line 914) | class TemplateHitFeaturizer(abc.ABC):
method __init__ (line 917) | def __init__(
method get_templates (line 979) | def get_templates(
class HhsearchHitFeaturizer (line 985) | class HhsearchHitFeaturizer(TemplateHitFeaturizer):
method get_templates (line 988) | def get_templates(
class HmmsearchHitFeaturizer (line 1053) | class HmmsearchHitFeaturizer(TemplateHitFeaturizer):
method get_templates (line 1056) | def get_templates(
FILE: alphafold/data/tools/hhblits.py
class HHBlits (line 32) | class HHBlits:
method __init__ (line 35) | def __init__(
method query (line 100) | def query(self, input_fasta_path: str) -> List[Mapping[str, Any]]:
FILE: alphafold/data/tools/hhsearch.py
class HHSearch (line 29) | class HHSearch:
method __init__ (line 32) | def __init__(
method output_format (line 65) | def output_format(self) -> str:
method input_format (line 69) | def input_format(self) -> str:
method query (line 72) | def query(self, a3m: str) -> str:
method get_template_hits (line 111) | def get_template_hits(
FILE: alphafold/data/tools/hmmbuild.py
class Hmmbuild (line 27) | class Hmmbuild(object):
method __init__ (line 30) | def __init__(self, *, binary_path: str, singlemx: bool = False):
method build_profile_from_sto (line 44) | def build_profile_from_sto(self, sto: str, model_construction='fast') ...
method build_profile_from_a3m (line 60) | def build_profile_from_a3m(self, a3m: str) -> str:
method _build_profile (line 80) | def _build_profile(self, msa: str, model_construction: str = 'fast') -...
FILE: alphafold/data/tools/hmmsearch.py
class Hmmsearch (line 29) | class Hmmsearch(object):
method __init__ (line 32) | def __init__(
method output_format (line 76) | def output_format(self) -> str:
method input_format (line 80) | def input_format(self) -> str:
method query (line 83) | def query(self, msa_sto: str) -> str:
method query_with_hmm (line 90) | def query_with_hmm(self, hmm: str) -> str:
method get_template_hits (line 135) | def get_template_hits(
FILE: alphafold/data/tools/jackhmmer.py
class Jackhmmer (line 31) | class Jackhmmer:
method __init__ (line 34) | def __init__(
method _query_chunk (line 92) | def _query_chunk(
method query (line 171) | def query(
method query_multiple (line 177) | def query_multiple(
FILE: alphafold/data/tools/kalign.py
function _to_a3m (line 26) | def _to_a3m(sequences: Sequence[str]) -> str:
class Kalign (line 36) | class Kalign:
method __init__ (line 39) | def __init__(self, *, binary_path: str):
method align (line 50) | def align(self, sequences: Sequence[str]) -> str:
FILE: alphafold/data/tools/utils.py
function tmpdir_manager (line 25) | def tmpdir_manager(base_dir: Optional[str] = None):
function timing (line 35) | def timing(msg: str):
FILE: alphafold/model/all_atom.py
function squared_difference (line 44) | def squared_difference(x, y):
function get_chi_atom_indices (line 48) | def get_chi_atom_indices():
function atom14_to_atom37 (line 75) | def atom14_to_atom37(
function atom37_to_atom14 (line 95) | def atom37_to_atom14(
function atom37_to_frames (line 115) | def atom37_to_frames(
function atom37_to_torsion_angles (line 288) | def atom37_to_torsion_angles(
function torsion_angles_to_frames (line 495) | def torsion_angles_to_frames(
function frames_and_literature_positions_to_atom14_pos (line 599) | def frames_and_literature_positions_to_atom14_pos(
function extreme_ca_ca_distance_violations (line 647) | def extreme_ca_ca_distance_violations(
function between_residue_bond_loss (line 685) | def between_residue_bond_loss(
function _loss_and_violation_mask (line 830) | def _loss_and_violation_mask(
function between_residue_clash_loss (line 847) | def between_residue_clash_loss(
function within_residue_violations (line 974) | def within_residue_violations(
function find_optimal_renaming (line 1058) | def find_optimal_renaming(
function frame_aligned_point_error (line 1159) | def frame_aligned_point_error(
function _make_renaming_matrices (line 1231) | def _make_renaming_matrices():
function get_alt_atom14 (line 1264) | def get_alt_atom14(aatype, positions, mask):
FILE: alphafold/model/all_atom_multimer.py
function squared_difference (line 26) | def squared_difference(x, y):
function _make_chi_atom_indices (line 30) | def _make_chi_atom_indices():
function _make_renaming_matrices (line 57) | def _make_renaming_matrices():
function _make_restype_atom37_mask (line 86) | def _make_restype_atom37_mask():
function _make_restype_atom14_mask (line 99) | def _make_restype_atom14_mask():
function _make_restype_atom37_to_atom14 (line 114) | def _make_restype_atom37_to_atom14():
function _make_restype_atom14_to_atom37 (line 132) | def _make_restype_atom14_to_atom37():
function _make_restype_atom14_is_ambiguous (line 149) | def _make_restype_atom14_is_ambiguous():
function _make_restype_rigidgroup_base_atom37_idx (line 170) | def _make_restype_rigidgroup_base_atom37_idx():
function get_atom37_mask (line 215) | def get_atom37_mask(aatype):
function get_atom14_mask (line 219) | def get_atom14_mask(aatype):
function get_atom14_is_ambiguous (line 223) | def get_atom14_is_ambiguous(aatype):
function get_atom14_to_atom37_map (line 227) | def get_atom14_to_atom37_map(aatype):
function get_atom37_to_atom14_map (line 231) | def get_atom37_to_atom14_map(aatype):
function atom14_to_atom37 (line 235) | def atom14_to_atom37(
function atom37_to_atom14 (line 252) | def atom37_to_atom14(aatype, all_atom_pos, all_atom_mask):
function get_alt_atom14 (line 271) | def get_alt_atom14(aatype, positions: geometry.Vec3Array, mask):
function atom37_to_frames (line 291) | def atom37_to_frames(
function torsion_angles_to_frames (line 398) | def torsion_angles_to_frames(
function frames_and_literature_positions_to_atom14_pos (line 478) | def frames_and_literature_positions_to_atom14_pos(
function extreme_ca_ca_distance_violations (line 517) | def extreme_ca_ca_distance_violations(
function between_residue_bond_loss (line 539) | def between_residue_bond_loss(
function between_residue_clash_loss (line 664) | def between_residue_clash_loss(
function within_residue_violations (line 758) | def within_residue_violations(
function find_optimal_renaming (line 814) | def find_optimal_renaming(
function frame_aligned_point_error (line 873) | def frame_aligned_point_error(
function get_chi_atom_indices (line 949) | def get_chi_atom_indices():
function compute_chi_angles (line 976) | def compute_chi_angles(
function make_transform_from_reference (line 1046) | def make_transform_from_reference(
FILE: alphafold/model/all_atom_test.py
function _relu (line 33) | def _relu(x):
function _get_positions_for_ca_c_n_violation_mask (line 38) | def _get_positions_for_ca_c_n_violation_mask():
function _get_mask_for_ca_c_n_violation_mask (line 44) | def _get_mask_for_ca_c_n_violation_mask():
function get_identity_rigid (line 50) | def get_identity_rigid(shape):
function get_global_rigid_transform (line 60) | def get_global_rigid_transform(rot_angle, translation, bcast_dims):
class AllAtomTest (line 88) | class AllAtomTest(parameterized.TestCase):
method test_frame_aligned_point_error_perfect_on_global_transform (line 96) | def test_frame_aligned_point_error_perfect_on_global_transform(
method test_frame_aligned_point_error_matches_expected (line 166) | def test_frame_aligned_point_error_matches_expected(
method test_between_residue_bond_loss (line 266) | def test_between_residue_bond_loss(
FILE: alphafold/model/base_config.py
function _strip_optional (line 30) | def _strip_optional(t: type[Any]) -> type[Any]:
class _Autocreate (line 42) | class _Autocreate:
method __init__ (line 44) | def __init__(self, **defaults: Any):
function autocreate (line 48) | def autocreate(**defaults: Any) -> Any:
function _clone_field (line 53) | def _clone_field(
class ConfigMeta (line 70) | class ConfigMeta(type):
method __new__ (line 73) | def __new__(mcs, name, bases, classdict):
class BaseConfig (line 132) | class BaseConfig(metaclass=ConfigMeta):
method _coercable_fields (line 148) | def _coercable_fields(self) -> Mapping[str, tuple[type['BaseConfig'], ...
method as_dict (line 151) | def as_dict(self, include_none: bool = True) -> Mapping[str, Any]:
method __setattr__ (line 168) | def __setattr__(self, name: str, value: Any) -> None:
method _toggle_freeze (line 178) | def _toggle_freeze(self, frozen: bool) -> None:
method freeze (line 186) | def freeze(self) -> None:
method unfreeze (line 191) | def unfreeze(self: _ConfigT) -> Iterator[_ConfigT]:
FILE: alphafold/model/base_config_test.py
class InnerConfig (line 23) | class InnerConfig(base_config.BaseConfig):
class OuterConfig (line 28) | class OuterConfig(base_config.BaseConfig):
class ModelConfigTest (line 38) | class ModelConfigTest(absltest.TestCase):
method _equal_at_path (line 40) | def _equal_at_path(self, path, a, b):
method test_post_init_is_chained (line 43) | def test_post_init_is_chained(self):
method test_config_is_dataclass (line 53) | def test_config_is_dataclass(self):
method test_nested_values_not_provided (line 56) | def test_nested_values_not_provided(self):
method test_config_dict_escape_hatch (line 62) | def test_config_dict_escape_hatch(self):
method test_config_from_dict (line 71) | def test_config_from_dict(self):
method test_create_config (line 92) | def test_create_config(self):
method test_freeze (line 113) | def test_freeze(self):
method test_unfreeze (line 134) | def test_unfreeze(self):
FILE: alphafold/model/common_modules.py
function get_initializer_scale (line 31) | def get_initializer_scale(initializer_name, input_shape):
class Linear (line 54) | class Linear(hk.Module):
method __init__ (line 62) | def __init__(
method __call__ (line 98) | def __call__(self, inputs):
class LayerNorm (line 138) | class LayerNorm(hk.LayerNorm):
method __init__ (line 146) | def __init__(
method __call__ (line 172) | def __call__(self, x: jnp.ndarray) -> jnp.ndarray:
FILE: alphafold/model/config.py
class MaskedMsa (line 689) | class MaskedMsa(base_config.BaseConfig):
class CommonData (line 696) | class CommonData(base_config.BaseConfig):
class EvalData (line 708) | class EvalData(base_config.BaseConfig):
class Data (line 719) | class Data(base_config.BaseConfig):
class MsaRowAttentionWithPairBias (line 724) | class MsaRowAttentionWithPairBias(base_config.BaseConfig):
class MsaColumnAttention (line 732) | class MsaColumnAttention(base_config.BaseConfig):
class MsaTransition (line 740) | class MsaTransition(base_config.BaseConfig):
class OuterProductMean (line 747) | class OuterProductMean(base_config.BaseConfig):
class TriangleAttention (line 756) | class TriangleAttention(base_config.BaseConfig):
class TriangleMultiplication (line 764) | class TriangleMultiplication(base_config.BaseConfig):
class PairTransition (line 773) | class PairTransition(base_config.BaseConfig):
class Evoformer (line 780) | class Evoformer(base_config.BaseConfig):
class TemplateAttention (line 792) | class TemplateAttention(base_config.BaseConfig):
class DgramFeatures (line 799) | class DgramFeatures(base_config.BaseConfig):
class TemplatePairStackAttention (line 805) | class TemplatePairStackAttention(base_config.BaseConfig):
class TemplatePairStackTriangleMultiplication (line 815) | class TemplatePairStackTriangleMultiplication(base_config.BaseConfig):
class TemplatePairStackTransition (line 824) | class TemplatePairStackTransition(base_config.BaseConfig):
class TemplatePairStack (line 831) | class TemplatePairStack(base_config.BaseConfig):
class Template (line 840) | class Template(base_config.BaseConfig):
class PrevPos (line 852) | class PrevPos(base_config.BaseConfig):
class EmbeddingsAndEvoformer (line 858) | class EmbeddingsAndEvoformer(base_config.BaseConfig):
class GlobalConfig (line 881) | class GlobalConfig(base_config.BaseConfig):
class DistogramHead (line 892) | class DistogramHead(base_config.BaseConfig):
class PredictedAlignedErrorHead (line 899) | class PredictedAlignedErrorHead(base_config.BaseConfig):
class ExperimentallyResolvedHead (line 909) | class ExperimentallyResolvedHead(base_config.BaseConfig):
class Fape (line 916) | class Fape(base_config.BaseConfig):
class Sidechain (line 922) | class Sidechain(base_config.BaseConfig):
class StructureModuleHead (line 931) | class StructureModuleHead(base_config.BaseConfig):
class PredictedLDDTHead (line 957) | class PredictedLDDTHead(base_config.BaseConfig):
class MaskedMSAHead (line 966) | class MaskedMSAHead(base_config.BaseConfig):
class Heads (line 971) | class Heads(base_config.BaseConfig):
class Model (line 980) | class Model(base_config.BaseConfig):
class AlphaFoldConfig (line 990) | class AlphaFoldConfig(base_config.BaseConfig):
function model_config (line 995) | def model_config(name: str) -> ml_collections.ConfigDict:
function get_model_config (line 1009) | def get_model_config(name: str, frozen: bool = True) -> AlphaFoldConfig:
function _apply_model_1_diff (line 1025) | def _apply_model_1_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_2_diff (line 1034) | def _apply_model_2_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_3_diff (line 1042) | def _apply_model_3_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_4_diff (line 1047) | def _apply_model_4_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_5_diff (line 1052) | def _apply_model_5_diff(cfg: AlphaFoldConfig) -> None: # pylint: disabl...
function _apply_model_1_ptm_diff (line 1056) | def _apply_model_1_ptm_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_2_ptm_diff (line 1066) | def _apply_model_2_ptm_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_3_ptm_diff (line 1075) | def _apply_model_3_ptm_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_4_ptm_diff (line 1081) | def _apply_model_4_ptm_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_5_ptm_diff (line 1087) | def _apply_model_5_ptm_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_1_multimer_v3_diff (line 1091) | def _apply_model_1_multimer_v3_diff(cfg: AlphaFoldConfig) -> None: # py...
function _apply_model_2_multimer_v3_diff (line 1095) | def _apply_model_2_multimer_v3_diff(cfg: AlphaFoldConfig) -> None: # py...
function _apply_model_3_multimer_v3_diff (line 1099) | def _apply_model_3_multimer_v3_diff(cfg: AlphaFoldConfig) -> None: # py...
function _apply_model_4_multimer_v3_diff (line 1103) | def _apply_model_4_multimer_v3_diff(cfg: AlphaFoldConfig) -> None:
function _apply_model_5_multimer_v3_diff (line 1107) | def _apply_model_5_multimer_v3_diff(cfg: AlphaFoldConfig) -> None:
function _common_updates (line 1111) | def _common_updates(cfg: AlphaFoldConfig) -> None:
FILE: alphafold/model/config_test.py
class ConfigTest (line 22) | class ConfigTest(parameterized.TestCase):
method test_config_dict_and_dataclass_agree (line 25) | def test_config_dict_and_dataclass_agree(self, model_name):
FILE: alphafold/model/data.py
function get_model_haiku_params (line 25) | def get_model_haiku_params(model_name: str, data_dir: str) -> hk.Params:
FILE: alphafold/model/features.py
function make_data_config (line 28) | def make_data_config(
function np_example_to_features (line 45) | def np_example_to_features(
FILE: alphafold/model/folding.py
function squared_difference (line 33) | def squared_difference(x, y):
class InvariantPointAttention (line 37) | class InvariantPointAttention(hk.Module):
method __init__ (line 50) | def __init__(
method __call__ (line 74) | def __call__(self, inputs_1d, inputs_2d, mask, affine):
class FoldIteration (line 290) | class FoldIteration(hk.Module):
method __init__ (line 301) | def __init__(self, config, global_config, name='fold_iteration'):
method __call__ (line 306) | def __call__(
function generate_affines (line 397) | def generate_affines(
class StructureModule (line 465) | class StructureModule(hk.Module):
method __init__ (line 471) | def __init__(
method __call__ (line 479) | def __call__(self, representations, batch, is_training, safe_key=None):
method loss (line 527) | def loss(self, value, batch):
function compute_renamed_ground_truth (line 573) | def compute_renamed_ground_truth(
function backbone_loss (line 634) | def backbone_loss(ret, batch, value, config):
function sidechain_loss (line 701) | def sidechain_loss(batch, value, config):
function structural_violation_loss (line 746) | def structural_violation_loss(ret, batch, value, config):
function find_structural_violations (line 765) | def find_structural_violations(
function compute_violation_metrics (line 868) | def compute_violation_metrics(
function supervised_chi_loss (line 907) | def supervised_chi_loss(ret, batch, value, config):
function generate_new_affine (line 973) | def generate_new_affine(sequence_mask):
function l2_normalize (line 983) | def l2_normalize(x, axis=-1, epsilon=1e-12):
class MultiRigidSidechain (line 989) | class MultiRigidSidechain(hk.Module):
method __init__ (line 992) | def __init__(self, config, global_config, name='rigid_sidechain'):
method __call__ (line 997) | def __call__(self, affine, representations_list, aatype):
FILE: alphafold/model/folding_multimer.py
function squared_difference (line 40) | def squared_difference(x: jnp.ndarray, y: jnp.ndarray) -> jnp.ndarray:
function make_backbone_affine (line 45) | def make_backbone_affine(
class QuatRigid (line 65) | class QuatRigid(hk.Module):
method __init__ (line 68) | def __init__(
method __call__ (line 102) | def __call__(self, activations: jnp.ndarray) -> geometry.Rigid3Array:
class PointProjection (line 144) | class PointProjection(hk.Module):
method __init__ (line 147) | def __init__(
method __call__ (line 174) | def __call__(
class InvariantPointAttention (line 194) | class InvariantPointAttention(hk.Module):
method __init__ (line 205) | def __init__(
method __call__ (line 229) | def __call__(
class FoldIteration (line 384) | class FoldIteration(hk.Module):
method __init__ (line 393) | def __init__(
method __call__ (line 403) | def __call__(
function generate_monomer_rigids (line 483) | def generate_monomer_rigids(
class StructureModule (line 552) | class StructureModule(hk.Module):
method __init__ (line 558) | def __init__(
method __call__ (line 568) | def __call__(
method loss (line 631) | def loss(
function compute_atom14_gt (line 767) | def compute_atom14_gt(
function backbone_loss (line 798) | def backbone_loss(
function compute_frames (line 827) | def compute_frames(
function sidechain_loss (line 860) | def sidechain_loss(
function structural_violation_loss (line 900) | def structural_violation_loss(
function find_structural_violations (line 922) | def find_structural_violations(
function compute_violation_metrics (line 1030) | def compute_violation_metrics(
function supervised_chi_loss (line 1067) | def supervised_chi_loss(
function l2_normalize (line 1119) | def l2_normalize(
function get_renamed_chi_angles (line 1127) | def get_renamed_chi_angles(
class MultiRigidSidechain (line 1143) | class MultiRigidSidechain(hk.Module):
method __init__ (line 1146) | def __init__(
method __call__ (line 1156) | def __call__(
FILE: alphafold/model/geometry/rigid_matrix_vector.py
class Rigid3Array (line 32) | class Rigid3Array:
method __matmul__ (line 38) | def __matmul__(self, other: Rigid3Array) -> Rigid3Array:
method inverse (line 43) | def inverse(self) -> Rigid3Array:
method apply_to_point (line 49) | def apply_to_point(self, point: vector.Vec3Array) -> vector.Vec3Array:
method apply_inverse_to_point (line 53) | def apply_inverse_to_point(self, point: vector.Vec3Array) -> vector.Ve...
method compose_rotation (line 58) | def compose_rotation(self, other_rotation):
method identity (line 66) | def identity(cls, shape, dtype=jnp.float32) -> Rigid3Array:
method scale_translation (line 73) | def scale_translation(self, factor: Float) -> Rigid3Array:
method to_array (line 77) | def to_array(self):
method from_array (line 83) | def from_array(cls, array):
method from_array4x4 (line 89) | def from_array4x4(cls, array: jnp.ndarray) -> Rigid3Array:
method __getstate__ (line 103) | def __getstate__(self):
method __setstate__ (line 106) | def __setstate__(self, state):
FILE: alphafold/model/geometry/rotation_matrix.py
class Rot3Array (line 33) | class Rot3Array:
method inverse (line 48) | def inverse(self) -> Rot3Array:
method apply_to_point (line 56) | def apply_to_point(self, point: vector.Vec3Array) -> vector.Vec3Array:
method apply_inverse_to_point (line 64) | def apply_inverse_to_point(self, point: vector.Vec3Array) -> vector.Ve...
method __matmul__ (line 68) | def __matmul__(self, other: Rot3Array) -> Rot3Array:
method identity (line 76) | def identity(cls, shape, dtype=jnp.float32) -> Rot3Array:
method from_two_vectors (line 83) | def from_two_vectors(
method from_array (line 108) | def from_array(cls, array: jnp.ndarray) -> Rot3Array:
method to_array (line 114) | def to_array(self) -> jnp.ndarray:
method from_quaternion (line 126) | def from_quaternion(
method random_uniform (line 154) | def random_uniform(cls, key, shape, dtype=jnp.float32) -> Rot3Array:
method __getstate__ (line 160) | def __getstate__(self):
method __setstate__ (line 163) | def __setstate__(self, state):
FILE: alphafold/model/geometry/struct_of_array.py
function get_item (line 21) | def get_item(instance, key):
function get_shape (line 33) | def get_shape(instance):
function get_len (line 44) | def get_len(instance):
function get_dtype (line 54) | def get_dtype(instance):
function replace (line 79) | def replace(instance, **kwargs):
function post_init (line 83) | def post_init(instance):
function flatten (line 135) | def flatten(instance):
function make_metadata_class (line 151) | def make_metadata_class(cls):
function get_fields (line 164) | def get_fields(cls_or_instance, filterfn, return_values=False):
function get_array_fields (line 175) | def get_array_fields(cls, return_values=False):
function get_metadata_fields (line 183) | def get_metadata_fields(cls, return_values=False):
class StructOfArray (line 191) | class StructOfArray:
method __init__ (line 194) | def __init__(self, same_dtype=True):
method __call__ (line 197) | def __call__(self, cls):
FILE: alphafold/model/geometry/test_utils.py
function assert_rotation_matrix_equal (line 25) | def assert_rotation_matrix_equal(
function assert_rotation_matrix_close (line 35) | def assert_rotation_matrix_close(
function assert_array_equal_to_rotation_matrix (line 41) | def assert_array_equal_to_rotation_matrix(
function assert_array_close_to_rotation_matrix (line 56) | def assert_array_close_to_rotation_matrix(
function assert_vectors_equal (line 62) | def assert_vectors_equal(vec1: vector.Vec3Array, vec2: vector.Vec3Array):
function assert_vectors_close (line 68) | def assert_vectors_close(vec1: vector.Vec3Array, vec2: vector.Vec3Array):
function assert_array_close_to_vector (line 74) | def assert_array_close_to_vector(array: jnp.ndarray, vec: vector.Vec3Arr...
function assert_array_equal_to_vector (line 78) | def assert_array_equal_to_vector(array: jnp.ndarray, vec: vector.Vec3Arr...
function assert_rigid_equal_to_rigid (line 82) | def assert_rigid_equal_to_rigid(
function assert_rigid_close_to_rigid (line 89) | def assert_rigid_close_to_rigid(
function assert_rot_trans_equal_to_rigid (line 96) | def assert_rot_trans_equal_to_rigid(
function assert_rot_trans_close_to_rigid (line 105) | def assert_rot_trans_close_to_rigid(
FILE: alphafold/model/geometry/utils.py
function unstack (line 21) | def unstack(value: jnp.ndarray, axis: int = -1) -> List[jnp.ndarray]:
FILE: alphafold/model/geometry/vector.py
class Vec3Array (line 33) | class Vec3Array:
method __post_init__ (line 49) | def __post_init__(self):
method __add__ (line 56) | def __add__(self, other: Vec3Array) -> Vec3Array:
method __sub__ (line 59) | def __sub__(self, other: Vec3Array) -> Vec3Array:
method __mul__ (line 62) | def __mul__(self, other: Float) -> Vec3Array:
method __rmul__ (line 65) | def __rmul__(self, other: Float) -> Vec3Array:
method __truediv__ (line 68) | def __truediv__(self, other: Float) -> Vec3Array:
method __neg__ (line 71) | def __neg__(self) -> Vec3Array:
method __pos__ (line 74) | def __pos__(self) -> Vec3Array:
method cross (line 77) | def cross(self, other: Vec3Array) -> Vec3Array:
method dot (line 84) | def dot(self, other: Vec3Array) -> Float:
method norm (line 88) | def norm(self, epsilon: float = 1e-6) -> Float:
method norm2 (line 96) | def norm2(self):
method normalized (line 99) | def normalized(self, epsilon: float = 1e-6) -> Vec3Array:
method zeros (line 104) | def zeros(cls, shape, dtype=jnp.float32):
method to_array (line 112) | def to_array(self) -> jnp.ndarray:
method from_array (line 116) | def from_array(cls, array):
method __getstate__ (line 119) | def __getstate__(self):
method __setstate__ (line 125) | def __setstate__(self, state):
function square_euclidean_distance (line 132) | def square_euclidean_distance(
function dot (line 154) | def dot(vector1: Vec3Array, vector2: Vec3Array) -> Float:
function cross (line 158) | def cross(vector1: Vec3Array, vector2: Vec3Array) -> Float:
function norm (line 162) | def norm(vector: Vec3Array, epsilon: float = 1e-6) -> Float:
function normalized (line 166) | def normalized(vector: Vec3Array, epsilon: float = 1e-6) -> Vec3Array:
function euclidean_distance (line 170) | def euclidean_distance(
function dihedral_angle (line 190) | def dihedral_angle(
function random_gaussian_vector (line 219) | def random_gaussian_vector(shape, key, dtype=jnp.float32):
FILE: alphafold/model/layer_stack.py
function _check_no_varargs (line 39) | def _check_no_varargs(f):
function nullcontext (line 52) | def nullcontext():
function maybe_with_rng (line 56) | def maybe_with_rng(key):
function maybe_fold_in (line 63) | def maybe_fold_in(key, data):
class _LayerStack (line 70) | class _LayerStack(hk.Module):
method __init__ (line 73) | def __init__(self, count: int, unroll: int, name: Optional[str] = None):
method __call__ (line 79) | def __call__(self, x, *args_ys):
method _call_wrapped (line 169) | def _call_wrapped(
class _LayerStackNoState (line 177) | class _LayerStackNoState(_LayerStack):
method __init__ (line 180) | def __init__(
method _call_wrapped (line 188) | def _call_wrapped(self, args, y):
class _LayerStackWithState (line 198) | class _LayerStackWithState(_LayerStack):
method __init__ (line 201) | def __init__(
method _call_wrapped (line 208) | def _call_wrapped(self, x, *args):
function layer_stack (line 212) | def layer_stack(
FILE: alphafold/model/layer_stack_test.py
function _slice_layers_params (line 30) | def _slice_layers_params(layers_params):
class LayerStackTest (line 42) | class LayerStackTest(parameterized.TestCase):
method test_layer_stack (line 45) | def test_layer_stack(self, unroll):
method test_layer_stack_multi_args (line 86) | def test_layer_stack_multi_args(self):
method test_layer_stack_no_varargs (line 126) | def test_layer_stack_no_varargs(self):
method test_layer_stack_grads (line 156) | def test_layer_stack_grads(self, unroll):
method test_random (line 207) | def test_random(self):
method test_threading (line 227) | def test_threading(self):
method test_nested_stacks (line 252) | def test_nested_stacks(self):
method test_with_state_multi_args (line 274) | def test_with_state_multi_args(self):
method test_with_container_state (line 303) | def test_with_container_state(self):
FILE: alphafold/model/lddt.py
function lddt (line 19) | def lddt(
FILE: alphafold/model/lddt_test.py
class LddtTest (line 21) | class LddtTest(parameterized.TestCase, absltest.TestCase):
method test_lddt (line 79) | def test_lddt(self, predicted_pos, true_pos, exp_lddt):
FILE: alphafold/model/mapping.py
function _set_docstring (line 36) | def _set_docstring(docstr: str) -> Callable[[T], T]:
function _maybe_slice (line 46) | def _maybe_slice(array, i, slice_size, axis):
function _maybe_get_size (line 55) | def _maybe_get_size(array, axis):
function _expand_axes (line 62) | def _expand_axes(axes, values, name='sharded_apply'):
function sharded_map (line 70) | def sharded_map(
function sharded_apply (line 101) | def sharded_apply(
function inference_subbatch (line 224) | def inference_subbatch(
FILE: alphafold/model/model.py
function get_confidence_metrics (line 31) | def get_confidence_metrics(
class RunModel (line 72) | class RunModel:
method __init__ (line 75) | def __init__(
method init_params (line 104) | def init_params(self, feat: features.FeatureDict, random_seed: int = 0):
method process_features (line 122) | def process_features(
method eval_shape (line 143) | def eval_shape(self, feat: features.FeatureDict) -> jax.ShapeDtypeStruct:
method predict (line 153) | def predict(
FILE: alphafold/model/modules.py
function softmax_cross_entropy (line 38) | def softmax_cross_entropy(logits, labels):
function sigmoid_cross_entropy (line 44) | def sigmoid_cross_entropy(logits, labels):
function apply_dropout (line 53) | def apply_dropout(*, tensor, safe_key, rate, is_training, broadcast_dim=...
function dropout_wrapper (line 66) | def dropout_wrapper(
function create_extra_msa_feature (line 108) | def create_extra_msa_feature(batch):
class AlphaFoldIteration (line 134) | class AlphaFoldIteration(hk.Module):
method __init__ (line 145) | def __init__(self, config, global_config, name='alphafold_iteration'):
method __call__ (line 150) | def __call__(
class AlphaFold (line 289) | class AlphaFold(hk.Module):
method __init__ (line 295) | def __init__(self, config, name='alphafold'):
method __call__ (line 300) | def __call__(
class TemplatePairStack (line 414) | class TemplatePairStack(hk.Module):
method __init__ (line 420) | def __init__(self, config, global_config, name='template_pair_stack'):
method __call__ (line 425) | def __call__(self, pair_act, pair_mask, is_training, safe_key=None):
class Transition (line 515) | class Transition(hk.Module):
method __init__ (line 522) | def __init__(self, config, global_config, name='transition_block'):
method __call__ (line 527) | def __call__(self, act, mask, is_training=True):
function glorot_uniform (line 573) | def glorot_uniform():
class Attention (line 579) | class Attention(hk.Module):
method __init__ (line 582) | def __init__(self, config, global_config, output_dim, name='attention'):
method __call__ (line 589) | def __call__(self, q_data, m_data, mask, nonbatched_bias=None):
class GlobalAttention (line 686) | class GlobalAttention(hk.Module):
method __init__ (line 692) | def __init__(self, config, global_config, output_dim, name='attention'):
method __call__ (line 699) | def __call__(self, q_data, m_data, q_mask):
class MSARowAttentionWithPairBias (line 795) | class MSARowAttentionWithPairBias(hk.Module):
method __init__ (line 801) | def __init__(
method __call__ (line 808) | def __call__(self, msa_act, msa_mask, pair_act, is_training=False):
class MSAColumnAttention (line 858) | class MSAColumnAttention(hk.Module):
method __init__ (line 864) | def __init__(self, config, global_config, name='msa_column_attention'):
method __call__ (line 869) | def __call__(self, msa_act, msa_mask, is_training=False):
class MSAColumnGlobalAttention (line 910) | class MSAColumnGlobalAttention(hk.Module):
method __init__ (line 916) | def __init__(self, config, global_config, name='msa_column_global_atte...
method __call__ (line 921) | def __call__(self, msa_act, msa_mask, is_training=False):
class TriangleAttention (line 963) | class TriangleAttention(hk.Module):
method __init__ (line 970) | def __init__(self, config, global_config, name='triangle_attention'):
method __call__ (line 975) | def __call__(self, pair_act, pair_mask, is_training=False):
class MaskedMsaHead (line 1027) | class MaskedMsaHead(hk.Module):
method __init__ (line 1036) | def __init__(self, config, global_config, name='masked_msa_head'):
method __call__ (line 1046) | def __call__(self, representations, batch, is_training):
method loss (line 1068) | def loss(self, value, batch):
class PredictedLDDTHead (line 1079) | class PredictedLDDTHead(hk.Module):
method __init__ (line 1086) | def __init__(self, config, global_config, name='predicted_lddt_head'):
method __call__ (line 1091) | def __call__(self, representations, batch, is_training):
method loss (line 1133) | def loss(self, value, batch):
class PredictedAlignedErrorHead (line 1181) | class PredictedAlignedErrorHead(hk.Module):
method __init__ (line 1188) | def __init__(
method __call__ (line 1195) | def __call__(self, representations, batch, is_training):
method loss (line 1224) | def loss(self, value, batch):
class ExperimentallyResolvedHead (line 1284) | class ExperimentallyResolvedHead(hk.Module):
method __init__ (line 1291) | def __init__(
method __call__ (line 1298) | def __call__(self, representations, batch, is_training):
method loss (line 1320) | def loss(self, value, batch):
function _layer_norm (line 1344) | def _layer_norm(axis=-1, name='layer_norm'):
class TriangleMultiplication (line 1358) | class TriangleMultiplication(hk.Module):
method __init__ (line 1365) | def __init__(self, config, global_config, name='triangle_multiplicatio...
method __call__ (line 1370) | def __call__(self, left_act, left_mask, is_training=True):
method _triangle_multiplication (line 1389) | def _triangle_multiplication(self, left_act, left_mask):
method _fused_triangle_multiplication (line 1471) | def _fused_triangle_multiplication(self, left_act, left_mask):
class DistogramHead (line 1519) | class DistogramHead(hk.Module):
method __init__ (line 1525) | def __init__(self, config, global_config, name='distogram_head'):
method __call__ (line 1530) | def __call__(self, representations, batch, is_training):
method loss (line 1559) | def loss(self, value, batch):
function _distogram_log_loss (line 1565) | def _distogram_log_loss(logits, bin_edges, batch, num_bins):
class OuterProductMean (line 1600) | class OuterProductMean(hk.Module):
method __init__ (line 1606) | def __init__(
method __call__ (line 1614) | def __call__(self, act, mask, is_training=True):
function dgram_from_positions (line 1692) | def dgram_from_positions(positions, num_bins, min_bin, max_bin):
function pseudo_beta_fn (line 1729) | def pseudo_beta_fn(aatype, all_atom_positions, all_atom_masks):
class EvoformerIteration (line 1751) | class EvoformerIteration(hk.Module):
method __init__ (line 1757) | def __init__(
method __call__ (line 1765) | def __call__(self, activations, masks, is_training=True, safe_key=None):
class EmbeddingsAndEvoformer (line 1904) | class EmbeddingsAndEvoformer(hk.Module):
method __init__ (line 1911) | def __init__(self, config, global_config, name='evoformer'):
method __call__ (line 1916) | def __call__(self, batch, is_training, safe_key=None):
class SingleTemplateEmbedding (line 2150) | class SingleTemplateEmbedding(hk.Module):
method __init__ (line 2156) | def __init__(self, config, global_config, name='single_template_embedd...
method __call__ (line 2161) | def __call__(self, query_embedding, batch, mask_2d, is_training):
class TemplateEmbedding (line 2258) | class TemplateEmbedding(hk.Module):
method __init__ (line 2265) | def __init__(self, config, global_config, name='template_embedding'):
method __call__ (line 2270) | def __call__(self, query_embedding, template_batch, mask_2d, is_traini...
FILE: alphafold/model/modules_multimer.py
function reduce_fn (line 41) | def reduce_fn(x, mode):
function gumbel_noise (line 52) | def gumbel_noise(key: jnp.ndarray, shape: Sequence[int]) -> jnp.ndarray:
function gumbel_max_sample (line 73) | def gumbel_max_sample(key: jnp.ndarray, logits: jnp.ndarray) -> jnp.ndar...
function gumbel_argsort_sample_idx (line 92) | def gumbel_argsort_sample_idx(
function make_masked_msa (line 121) | def make_masked_msa(batch, key, config, epsilon=1e-6):
function nearest_neighbor_clusters (line 164) | def nearest_neighbor_clusters(batch, gap_agreement_weight=0.0):
function create_msa_feat (line 212) | def create_msa_feat(batch):
function create_extra_msa_feature (line 236) | def create_extra_msa_feature(batch, num_extra_msa):
function sample_msa (line 266) | def sample_msa(key, batch, max_seq):
function make_msa_profile (line 301) | def make_msa_profile(batch):
class AlphaFoldIteration (line 310) | class AlphaFoldIteration(hk.Module):
method __init__ (line 318) | def __init__(self, config, global_config, name='alphafold_iteration'):
method __call__ (line 323) | def __call__(
class AlphaFold (line 427) | class AlphaFold(hk.Module):
method __init__ (line 430) | def __init__(self, config, name='alphafold'):
method __call__ (line 435) | def __call__(
class EmbeddingsAndEvoformer (line 539) | class EmbeddingsAndEvoformer(hk.Module):
method __init__ (line 545) | def __init__(self, config, global_config, name='evoformer'):
method _relative_encoding (line 550) | def _relative_encoding(self, batch):
method __call__ (line 629) | def __call__(self, batch, is_training, safe_key=None):
class TemplateEmbedding (line 846) | class TemplateEmbedding(hk.Module):
method __init__ (line 849) | def __init__(self, config, global_config, name='template_embedding'):
method __call__ (line 854) | def __call__(
class SingleTemplateEmbedding (line 941) | class SingleTemplateEmbedding(hk.Module):
method __init__ (line 944) | def __init__(self, config, global_config, name='single_template_embedd...
method __call__ (line 949) | def __call__(
class TemplateEmbeddingIteration (line 1110) | class TemplateEmbeddingIteration(hk.Module):
method __init__ (line 1113) | def __init__(
method __call__ (line 1120) | def __call__(self, act, pair_mask, is_training=True, safe_key=None):
function template_embedding_1d (line 1196) | def template_embedding_1d(batch, num_channel, global_config):
FILE: alphafold/model/prng.py
function safe_dropout (line 21) | def safe_dropout(*, tensor, safe_key, rate, is_deterministic, is_training):
class SafeKey (line 28) | class SafeKey:
method __init__ (line 31) | def __init__(self, key):
method _assert_not_used (line 35) | def _assert_not_used(self):
method get (line 39) | def get(self):
method split (line 44) | def split(self, num_keys=2):
method duplicate (line 50) | def duplicate(self, num_keys=2):
function _safe_key_flatten (line 56) | def _safe_key_flatten(safe_key):
function _safe_key_unflatten (line 61) | def _safe_key_unflatten(aux_data, children):
FILE: alphafold/model/prng_test.py
class PrngTest (line 20) | class PrngTest(absltest.TestCase):
method test_key_reuse (line 22) | def test_key_reuse(self):
FILE: alphafold/model/quat_affine.py
function rot_to_quat (line 93) | def rot_to_quat(rot, unstack_inputs=False):
function rot_list_to_tensor (line 128) | def rot_list_to_tensor(rot_list):
function vec_list_to_tensor (line 140) | def vec_list_to_tensor(vec_list):
function quat_to_rot (line 145) | def quat_to_rot(normalized_quat):
function quat_multiply_by_vec (line 161) | def quat_multiply_by_vec(quat, vec):
function quat_multiply (line 169) | def quat_multiply(quat1, quat2):
function apply_rot_to_vec (line 177) | def apply_rot_to_vec(rot, vec, unstack=False):
function apply_inverse_rot_to_vec (line 190) | def apply_inverse_rot_to_vec(rot, vec):
class QuatAffine (line 200) | class QuatAffine(object):
method __init__ (line 203) | def __init__(
method to_tensor (line 249) | def to_tensor(self):
method apply_tensor_fn (line 256) | def apply_tensor_fn(self, tensor_fn):
method apply_rotation_tensor_fn (line 265) | def apply_rotation_tensor_fn(self, tensor_fn):
method scale_translation (line 274) | def scale_translation(self, position_scale):
method from_tensor (line 285) | def from_tensor(cls, tensor, normalize=False):
method pre_compose (line 291) | def pre_compose(self, update):
method apply_to_point (line 322) | def apply_to_point(self, point, extra_dims=0):
method invert_point (line 349) | def invert_point(self, transformed_point, extra_dims=0):
method __repr__ (line 377) | def __repr__(self):
function _multiply (line 381) | def _multiply(a, b):
function make_canonical_transform (line 401) | def make_canonical_transform(
function make_transform_from_reference (line 484) | def make_transform_from_reference(
FILE: alphafold/model/quat_affine_test.py
class QuatAffineTest (line 31) | class QuatAffineTest(absltest.TestCase):
method _assert_check (line 33) | def _assert_check(self, to_check, tol=1e-5):
method test_conversion (line 41) | def test_conversion(self):
method test_double_cover (line 67) | def test_double_cover(self):
method test_homomorphism (line 81) | def test_homomorphism(self):
method test_batching (line 114) | def test_batching(self):
method assertAllClose (line 138) | def assertAllClose(self, a, b, rtol=1e-06, atol=1e-06):
method assertAllEqual (line 141) | def assertAllEqual(self, a, b):
FILE: alphafold/model/r3.py
function squared_difference (line 56) | def squared_difference(x, y):
function invert_rigids (line 60) | def invert_rigids(r: Rigids) -> Rigids:
function invert_rots (line 68) | def invert_rots(m: Rots) -> Rots:
function rigids_from_3_points (line 73) | def rigids_from_3_points(
function rigids_from_list (line 101) | def rigids_from_list(l: List[jnp.ndarray]) -> Rigids:
function rigids_from_quataffine (line 107) | def rigids_from_quataffine(a: quat_affine.QuatAffine) -> Rigids:
function rigids_from_tensor4x4 (line 112) | def rigids_from_tensor4x4(
function rigids_from_tensor_flat9 (line 143) | def rigids_from_tensor_flat9(
function rigids_from_tensor_flat12 (line 154) | def rigids_from_tensor_flat12(
function rigids_mul_rigids (line 163) | def rigids_mul_rigids(a: Rigids, b: Rigids) -> Rigids:
function rigids_mul_rots (line 171) | def rigids_mul_rots(r: Rigids, m: Rots) -> Rigids:
function rigids_mul_vecs (line 176) | def rigids_mul_vecs(r: Rigids, v: Vecs) -> Vecs:
function rigids_to_list (line 181) | def rigids_to_list(r: Rigids) -> List[jnp.ndarray]:
function rigids_to_quataffine (line 186) | def rigids_to_quataffine(r: Rigids) -> quat_affine.QuatAffine:
function rigids_to_tensor_flat9 (line 199) | def rigids_to_tensor_flat9(
function rigids_to_tensor_flat12 (line 210) | def rigids_to_tensor_flat12(
function rots_from_tensor3x3 (line 217) | def rots_from_tensor3x3(
function rots_from_two_vecs (line 236) | def rots_from_two_vecs(e0_unnormalized: Vecs, e1_unnormalized: Vecs) -> ...
function rots_mul_rots (line 267) | def rots_mul_rots(a: Rots, b: Rots) -> Rots:
function rots_mul_vecs (line 275) | def rots_mul_vecs(m: Rots, v: Vecs) -> Vecs:
function vecs_add (line 284) | def vecs_add(v1: Vecs, v2: Vecs) -> Vecs:
function vecs_dot_vecs (line 289) | def vecs_dot_vecs(v1: Vecs, v2: Vecs) -> jnp.ndarray:
function vecs_cross_vecs (line 294) | def vecs_cross_vecs(v1: Vecs, v2: Vecs) -> Vecs:
function vecs_from_tensor (line 303) | def vecs_from_tensor(x: jnp.ndarray) -> Vecs: # shape (..., 3) # shape...
function vecs_robust_normalize (line 310) | def vecs_robust_normalize(v: Vecs, epsilon: float = 1e-8) -> Vecs:
function vecs_robust_norm (line 324) | def vecs_robust_norm(v: Vecs, epsilon: float = 1e-8) -> jnp.ndarray:
function vecs_sub (line 337) | def vecs_sub(v1: Vecs, v2: Vecs) -> Vecs:
function vecs_squared_distance (line 342) | def vecs_squared_distance(v1: Vecs, v2: Vecs) -> jnp.ndarray:
function vecs_to_tensor (line 351) | def vecs_to_tensor(v: Vecs) -> jnp.ndarray: # shape (...) # shape(..., 3)
FILE: alphafold/model/tf/data_transforms.py
function cast_64bit_ints (line 35) | def cast_64bit_ints(protein):
function make_seq_mask (line 53) | def make_seq_mask(protein):
function make_template_mask (line 60) | def make_template_mask(protein):
function curry1 (line 68) | def curry1(f):
function add_distillation_flag (line 78) | def add_distillation_flag(protein, distillation):
function make_all_atom_aatype (line 85) | def make_all_atom_aatype(protein):
function fix_templates_aatype (line 90) | def fix_templates_aatype(protein):
function correct_msa_restypes (line 105) | def correct_msa_restypes(protein):
function squeeze_features (line 126) | def squeeze_features(protein):
function make_random_crop_to_size_seed (line 155) | def make_random_crop_to_size_seed(protein):
function randomly_replace_msa_with_unknown (line 162) | def randomly_replace_msa_with_unknown(protein, replace_proportion):
function sample_msa (line 186) | def sample_msa(protein, max_seq, keep_extra):
function crop_extra_msa (line 215) | def crop_extra_msa(protein, max_extra_msa):
function delete_extra_msa (line 227) | def delete_extra_msa(protein):
function block_delete_msa (line 235) | def block_delete_msa(protein, config):
function nearest_neighbor_clusters (line 278) | def nearest_neighbor_clusters(protein, gap_agreement_weight=0.0):
function summarize_clusters (line 317) | def summarize_clusters(protein):
function make_msa_mask (line 343) | def make_msa_mask(protein):
function pseudo_beta_fn (line 354) | def pseudo_beta_fn(aatype, all_atom_positions, all_atom_masks):
function make_pseudo_beta (line 376) | def make_pseudo_beta(protein, prefix=''):
function add_constant_field (line 390) | def add_constant_field(protein, key, value):
function shaped_categorical (line 395) | def shaped_categorical(probs, epsilon=1e-10):
function make_hhblits_profile (line 404) | def make_hhblits_profile(protein):
function make_masked_msa (line 417) | def make_masked_msa(protein, config, replace_fraction):
function make_fixed_size (line 452) | def make_fixed_size(
function make_msa_feat (line 491) | def make_msa_feat(protein):
function select_feat (line 538) | def select_feat(protein, feature_list):
function crop_templates (line 543) | def crop_templates(protein, max_templates):
function random_crop_to_size (line 551) | def random_crop_to_size(
function make_atom14_masks (line 628) | def make_atom14_masks(protein):
FILE: alphafold/model/tf/input_pipeline.py
function nonensembled_map_fns (line 34) | def nonensembled_map_fns(data_config):
function ensembled_map_fns (line 65) | def ensembled_map_fns(data_config):
function process_tensors_from_config (line 131) | def process_tensors_from_config(tensors, data_config):
function compose (line 165) | def compose(x, fs):
FILE: alphafold/model/tf/protein_features.py
class FeatureType (line 25) | class FeatureType(enum.Enum):
function register_feature (line 75) | def register_feature(
function shape (line 84) | def shape(
FILE: alphafold/model/tf/protein_features_test.py
class FeaturesTest (line 19) | class FeaturesTest(absltest.TestCase):
method testFeatureNames (line 21) | def testFeatureNames(self):
method testReplacement (line 30) | def testReplacement(self):
FILE: alphafold/model/tf/proteins_dataset.py
function _first (line 24) | def _first(tensor: tf.Tensor) -> tf.Tensor:
function parse_reshape_logic (line 29) | def parse_reshape_logic(
function _make_features_metadata (line 95) | def _make_features_metadata(
function np_to_tensor_dict (line 109) | def np_to_tensor_dict(
FILE: alphafold/model/tf/shape_helpers.py
function shape_list (line 19) | def shape_list(x):
FILE: alphafold/model/tf/shape_helpers_test.py
class ShapeTest (line 20) | class ShapeTest(tf.test.TestCase):
method setUp (line 22) | def setUp(self):
method test_shape_list (line 26) | def test_shape_list(self):
FILE: alphafold/model/tf/utils.py
class SeedMaker (line 19) | class SeedMaker(object):
method __init__ (line 22) | def __init__(self, initial_seed=0):
method __call__ (line 25) | def __call__(self):
function make_random_seed (line 34) | def make_random_seed():
FILE: alphafold/model/utils.py
function stable_softmax (line 30) | def stable_softmax(logits: jax.Array) -> jax.Array:
function bfloat16_creator (line 44) | def bfloat16_creator(next_creator, shape, dtype, init, context):
function bfloat16_getter (line 51) | def bfloat16_getter(next_getter, value, context):
function bfloat16_context (line 60) | def bfloat16_context():
function final_init (line 65) | def final_init(config):
function batched_gather (line 72) | def batched_gather(params, indices, axis=0, batch_dims=0):
function mask_mean (line 80) | def mask_mean(mask, value, axis=None, drop_mask_channel=False, eps=1e-10):
function flat_params_to_haiku (line 112) | def flat_params_to_haiku(params: Mapping[str, np.ndarray]) -> hk.Params:
function padding_consistent_rng (line 124) | def padding_consistent_rng(f):
function ks_normal_test (line 181) | def ks_normal_test(
FILE: alphafold/notebooks/notebook_utils.py
function clean_and_validate_single_sequence (line 24) | def clean_and_validate_single_sequence(
function clean_and_validate_input_sequences (line 54) | def clean_and_validate_input_sequences(
function merge_chunked_msa (line 80) | def merge_chunked_msa(
function show_msa_info (line 112) | def show_msa_info(
function empty_placeholder_template_features (line 144) | def empty_placeholder_template_features(
function check_cell_execution_order (line 170) | def check_cell_execution_order(
FILE: alphafold/notebooks/notebook_utils_test.py
class NotebookUtilsTest (line 104) | class NotebookUtilsTest(parameterized.TestCase):
method test_clean_and_validate_sequence_ok (line 113) | def test_clean_and_validate_sequence_ok(self, sequence, exp_clean):
method test_clean_and_validate_sequence_bad (line 129) | def test_clean_and_validate_sequence_bad(self, sequence, exp_error):
method test_validate_input_ok (line 141) | def test_validate_input_ok(self, input_sequences, exp_sequences):
method test_validate_input_bad (line 154) | def test_validate_input_bad(self, input_sequences, exp_error):
method test_merge_chunked_msa_no_hits (line 162) | def test_merge_chunked_msa_no_hits(self):
method test_merge_chunked_msa (line 171) | def test_merge_chunked_msa(self):
method test_show_msa_info (line 196) | def test_show_msa_info(self, mocked_stdout):
method test_empty_placeholder_template_features (line 218) | def test_empty_placeholder_template_features(self, num_templates):
method test_check_cell_execution_order_correct (line 236) | def test_check_cell_execution_order_correct(self):
method test_check_cell_execution_order_missing (line 243) | def test_check_cell_execution_order_missing(
FILE: alphafold/relax/amber_minimize.py
function will_restrain (line 40) | def will_restrain(atom: openmm_app.Atom, rset: str) -> bool:
function _add_restraints (line 49) | def _add_restraints(
function _openmm_minimize (line 79) | def _openmm_minimize(
function _get_pdb_string (line 116) | def _get_pdb_string(topology: openmm_app.Topology, positions: unit.Quant...
function _check_cleaned_atoms (line 123) | def _check_cleaned_atoms(pdb_cleaned_string: str, pdb_ref_string: str):
function _check_residues_are_well_defined (line 145) | def _check_residues_are_well_defined(prot: protein.Protein):
function _check_atom_mask_is_ideal (line 155) | def _check_atom_mask_is_ideal(prot):
function clean_protein (line 162) | def clean_protein(prot: protein.Protein, checks: bool = True):
function make_atom14_positions (line 194) | def make_atom14_positions(prot):
function find_violations (line 333) | def find_violations(prot_np: protein.Protein):
function get_violation_metrics (line 370) | def get_violation_metrics(prot: protein.Protein):
function _run_one_iteration (line 383) | def _run_one_iteration(
function run_pipeline (line 446) | def run_pipeline(
FILE: alphafold/relax/amber_minimize_test.py
function _load_test_protein (line 27) | def _load_test_protein(data_path):
class AmberMinimizeTest (line 33) | class AmberMinimizeTest(absltest.TestCase):
method test_multiple_disulfides_target (line 35) | def test_multiple_disulfides_target(self):
method test_raises_invalid_protein_assertion (line 49) | def test_raises_invalid_protein_assertion(self):
method test_iterative_relax (line 67) | def test_iterative_relax(self):
method test_find_violations (line 79) | def test_find_violations(self):
FILE: alphafold/relax/cleanup.py
function fix_pdb (line 27) | def fix_pdb(pdbfile, alterations_info):
function clean_structure (line 64) | def clean_structure(pdb_structure, alterations_info):
function _remove_heterogens (line 75) | def _remove_heterogens(fixer, alterations_info, keep_water):
function _replace_met_se (line 97) | def _replace_met_se(pdb_structure, alterations_info):
function _remove_chains_of_length_one (line 111) | def _remove_chains_of_length_one(pdb_structure, alterations_info):
FILE: alphafold/relax/cleanup_test.py
function _pdb_to_structure (line 22) | def _pdb_to_structure(pdb_str):
function _lines_to_structure (line 27) | def _lines_to_structure(pdb_lines):
class CleanupTest (line 31) | class CleanupTest(absltest.TestCase):
method test_missing_residues (line 33) | def test_missing_residues(self):
method test_missing_atoms (line 57) | def test_missing_atoms(self):
method test_remove_heterogens (line 86) | def test_remove_heterogens(self):
method test_fix_nonstandard_residues (line 105) | def test_fix_nonstandard_residues(self):
method test_replace_met_se (line 125) | def test_replace_met_se(self):
method test_remove_chains_of_length_one (line 142) | def test_remove_chains_of_length_one(self):
FILE: alphafold/relax/relax.py
class AmberRelaxation (line 23) | class AmberRelaxation(object):
method __init__ (line 26) | def __init__(
method process (line 60) | def process(
FILE: alphafold/relax/relax_test.py
class RunAmberRelaxTest (line 25) | class RunAmberRelaxTest(absltest.TestCase):
method setUp (line 27) | def setUp(self):
method test_process (line 42) | def test_process(self):
method test_unresolved_violations (line 78) | def test_unresolved_violations(self):
FILE: alphafold/relax/utils.py
function overwrite_b_factors (line 22) | def overwrite_b_factors(pdb_str: str, bfactors: np.ndarray) -> str:
function assert_equal_nonterminal_atom_types (line 64) | def assert_equal_nonterminal_atom_types(
FILE: alphafold/relax/utils_test.py
class UtilsTest (line 25) | class UtilsTest(absltest.TestCase):
method test_overwrite_b_factors (line 27) | def test_overwrite_b_factors(self):
FILE: conftest.py
function initialize_absl_flags (line 27) | def initialize_absl_flags(request):
FILE: docker/run_docker.py
function _create_mount (line 133) | def _create_mount(mount_name: str, path: str) -> Tuple[types.Mount, str]:
function main (line 159) | def main(argv):
FILE: run_alphafold.py
class ModelsToRelax (line 51) | class ModelsToRelax(enum.Enum):
function _check_flag (line 269) | def _check_flag(flag_name: str, other_flag_name: str, should_be_set: bool):
function _jnp_to_np (line 278) | def _jnp_to_np(output: Dict[str, Any]) -> Dict[str, Any]:
function _save_confidence_json_file (line 288) | def _save_confidence_json_file(
function _save_mmcif_file (line 301) | def _save_mmcif_file(
function _save_pae_json_file (line 326) | def _save_pae_json_file(
function predict_structure (line 345) | def predict_structure(
function main (line 558) | def main(argv):
FILE: run_alphafold_test.py
class RunAlphafoldTest (line 29) | class RunAlphafoldTest(parameterized.TestCase):
method test_end_to_end (line 35) | def test_end_to_end(self, models_to_relax):
Condensed preview — 118 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,431K chars).
[
{
"path": ".dockerignore",
"chars": 32,
"preview": ".dockerignore\ndocker/Dockerfile\n"
},
{
"path": "CONTRIBUTING.md",
"chars": 973,
"preview": "# How to Contribute\n\nWe welcome small patches related to bug fixes and documentation, but we do not\nplan to make any maj"
},
{
"path": "LICENSE",
"chars": 11358,
"preview": "\n Apache License\n Version 2.0, January 2004\n "
},
{
"path": "README.md",
"chars": 35362,
"preview": "\n\n# AlphaFold\n\nThis package provides an implementation of the inference pipeline of AlphaFold\n"
},
{
"path": "afdb/README.md",
"chars": 19197,
"preview": "# AlphaFold Protein Structure Database\n\n## Introduction\n\nThe AlphaFold UniProt release (214M predictions) is hosted on\n["
},
{
"path": "alphafold/__init__.py",
"chars": 663,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/common/__init__.py",
"chars": 655,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/common/confidence.py",
"chars": 8160,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/common/confidence_test.py",
"chars": 1476,
"preview": "# Copyright 2023 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/common/mmcif_metadata.py",
"chars": 7561,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/common/protein.py",
"chars": 20850,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/common/protein_test.py",
"chars": 5927,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/common/residue_constants.py",
"chars": 51758,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/common/residue_constants_test.py",
"chars": 9514,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/common/testdata/2rbg.pdb",
"chars": 225504,
"preview": "HEADER STRUCTURAL GENOMICS, UNKNOWN FUNCTION 19-SEP-07 2RBG \nTITLE CRYSTAL STRUCTURE OF HYPOTHET"
},
{
"path": "alphafold/common/testdata/5nmu.pdb",
"chars": 788048,
"preview": "MODEL 1 \nATOM 1 N GLY B 1 -13.429 "
},
{
"path": "alphafold/common/testdata/glucagon.pdb",
"chars": 51273,
"preview": "HEADER HORMONE 17-OCT-77 1GCN \nTITLE X-RAY ANALYSIS OF GLUCAGON AN"
},
{
"path": "alphafold/data/__init__.py",
"chars": 634,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/feature_processing.py",
"chars": 8694,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/mmcif_parsing.py",
"chars": 14006,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/msa_identifiers.py",
"chars": 2959,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/msa_pairing.py",
"chars": 18928,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/parsers.py",
"chars": 21349,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/pipeline.py",
"chars": 10532,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/pipeline_multimer.py",
"chars": 11161,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/templates.py",
"chars": 41424,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/tools/__init__.py",
"chars": 639,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/tools/hhblits.py",
"chars": 5470,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/tools/hhsearch.py",
"chars": 3667,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/tools/hmmbuild.py",
"chars": 4555,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/tools/hmmsearch.py",
"chars": 4558,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/tools/jackhmmer.py",
"chars": 8321,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/tools/kalign.py",
"chars": 3420,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/data/tools/utils.py",
"chars": 1223,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/__init__.py",
"chars": 617,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/all_atom.py",
"chars": 47121,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/all_atom_multimer.py",
"chars": 39595,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/all_atom_test.py",
"chars": 9428,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/base_config.py",
"chars": 6420,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/base_config_test.py",
"chars": 5036,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/common_modules.py",
"chars": 5814,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/config.py",
"chars": 39500,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/config_test.py",
"chars": 1263,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/data.py",
"chars": 1097,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/features.py",
"chars": 2343,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/folding.py",
"chars": 36649,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/folding_multimer.py",
"chars": 40774,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/geometry/__init__.py",
"chars": 1172,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/geometry/rigid_matrix_vector.py",
"chars": 4155,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/geometry/rotation_matrix.py",
"chars": 5682,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/geometry/struct_of_array.py",
"chars": 7685,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/geometry/test_utils.py",
"chars": 3893,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/geometry/utils.py",
"chars": 859,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/geometry/vector.py",
"chars": 6780,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/layer_stack.py",
"chars": 8997,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/layer_stack_test.py",
"chars": 10301,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/lddt.py",
"chars": 3535,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/lddt_test.py",
"chars": 2614,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/mapping.py",
"chars": 8219,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/model.py",
"chars": 6269,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/modules.py",
"chars": 74007,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/modules_multimer.py",
"chars": 41844,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/prng.py",
"chars": 1978,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/prng_test.py",
"chars": 1179,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/quat_affine.py",
"chars": 17206,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/quat_affine_test.py",
"chars": 4930,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/r3.py",
"chars": 10919,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/__init__.py",
"chars": 633,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/data_transforms.py",
"chars": 21254,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/input_pipeline.py",
"chars": 5322,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/protein_features.py",
"chars": 5011,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/protein_features_test.py",
"chars": 1459,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/proteins_dataset.py",
"chars": 4567,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/shape_helpers.py",
"chars": 1415,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/shape_helpers_test.py",
"chars": 1286,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/shape_placeholders.py",
"chars": 812,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/tf/utils.py",
"chars": 1040,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/model/utils.py",
"chars": 7294,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/notebooks/__init__.py",
"chars": 626,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/notebooks/notebook_utils.py",
"chars": 6720,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/notebooks/notebook_utils_test.py",
"chars": 9566,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/relax/__init__.py",
"chars": 618,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/relax/amber_minimize.py",
"chars": 18993,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/relax/amber_minimize_test.py",
"chars": 4437,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/relax/cleanup.py",
"chars": 4807,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/relax/cleanup_test.py",
"chars": 6496,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/relax/relax.py",
"chars": 3247,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/relax/relax_test.py",
"chars": 3706,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/relax/testdata/model_output.pdb",
"chars": 7861,
"preview": "ATOM 1 C MET A 1 1.921 -46.152 7.786 1.00 4.39 C \nATOM 2 CA MET A 1 1.631 "
},
{
"path": "alphafold/relax/testdata/multiple_disulfides_target.pdb",
"chars": 119447,
"preview": "MODEL 0\nATOM 1 N MET A 1 19.164 -28.457 26.130 1.00 0.00 N \nATOM 2 CA MET A "
},
{
"path": "alphafold/relax/testdata/with_violations.pdb",
"chars": 96362,
"preview": "MODEL 0\nATOM 1 N SER A 1 23.291 1.505 0.613 1.00 6.08 N \nATOM 2 CA SER A "
},
{
"path": "alphafold/relax/testdata/with_violations_casp14.pdb",
"chars": 96362,
"preview": "MODEL 0\nATOM 1 N SER A 1 27.311 -3.395 37.375 1.00 8.64 N \nATOM 2 CA SER A "
},
{
"path": "alphafold/relax/utils.py",
"chars": 2493,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/relax/utils_test.py",
"chars": 1981,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "alphafold/version.py",
"chars": 674,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "conftest.py",
"chars": 936,
"preview": "# Copyright 2025 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "docker/Dockerfile",
"chars": 3513,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "docker/requirements.txt",
"chars": 79,
"preview": "# Dependencies necessary to execute run_docker.py\nabsl-py==1.0.0\ndocker==5.0.0\n"
},
{
"path": "docker/run_docker.py",
"chars": 10933,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "docs/technical_note_v2.3.0.md",
"chars": 3774,
"preview": "# AlphaFold v2.3.0\n\nUpdate (2026-03-11): Added note about cluster filtering change.\n\nThis technical note describes updat"
},
{
"path": "notebooks/AlphaFold.ipynb",
"chars": 738,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {\n \"id\": \"pc5-mbsX9PZC\"\n },\n \"sou"
},
{
"path": "pyproject.toml",
"chars": 2448,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "requirements.txt",
"chars": 221,
"preview": "absl-py==1.0.0\nbiopython==1.79\ndm-haiku==0.0.12\ndocker==5.0.0\njax==0.4.26\nmatplotlib==3.8.0\nml-collections==0.1.0\nnumpy="
},
{
"path": "run_alphafold.py",
"chars": 23280,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "run_alphafold_test.py",
"chars": 4311,
"preview": "# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
},
{
"path": "scripts/download_all_data.sh",
"chars": 2450,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_alphafold_params.sh",
"chars": 1402,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_bfd.sh",
"chars": 1530,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_mgnify.sh",
"chars": 1438,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_pdb70.sh",
"chars": 1414,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_pdb_mmcif.sh",
"chars": 2355,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_pdb_seqres.sh",
"chars": 1494,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_small_bfd.sh",
"chars": 1362,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_uniprot.sh",
"chars": 2063,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_uniref30.sh",
"chars": 1485,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "scripts/download_uniref90.sh",
"chars": 1331,
"preview": "#!/usr/bin/env bash\n#\n# Copyright 2021 DeepMind Technologies Limited\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "server/README.md",
"chars": 9148,
"preview": "# JSON file format for AlphaFold Server jobs\n\nYou can\n[download an example JSON file here](https://github.com/google-dee"
},
{
"path": "server/example.json",
"chars": 2510,
"preview": "[\n {\n \"name\": \"Test Fold Job Number One\",\n \"modelSeeds\": [],\n \"sequences\": [\n {\n \"proteinChain\": {"
}
]
About this extraction
This page contains the full source code of the google-deepmind/alphafold GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 118 files (2.3 MB), approximately 598.6k tokens, and a symbol index with 847 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.