• 首页
  • 版本
  • 工具
  • 文档
  • 培训
  • 关于
  • 联系
计算驱动创新药物研发
  • Name: Patch Analysis v2.1
    Description: 分析蛋白质表面的Patch(正电、负电、疏水残基富集区域)的大小和分布,用于解决蛋白质的聚集等问题。一般建议通过WeView三维结构可视化编辑器来使用该功能,可以在三维结构中直观地查看patch的位置。v2.1更新:支持设定PH值以及CDR编号,高亮CDR残基,输出CDR patch面积。 Calculate patches (positively charged, negatively charged, or hydrophobic regions) on the protein surface to address protein aggregation issues. It is recommended to use in the WeView, as it allows for a visual inspection of the patch locations within the 3D structure. v2.1 update: Supports setting the pH value and CDR numbering, highlights CDR residues, and outputs the CDR patch area.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-04-29 15:01:18
    Reference:

    Patch Analysis v2.1

    简介

    该模块计算蛋白质表面静电和疏水作用相对富集的区域,用于显示出在蛋白质-蛋白质相互作用中其重要作用的区域,这对于预测基于非共价弱相互作用的可逆的聚集现象尤其有用。尤其是,疏水相互作用长期被认为是大分子间吸引相互作用的主要组成部分。对于抗体,静电相互作用牵涉到了自聚集,而偶极-偶极相互作用被认为是导致β-折叠的纤维化的原因。同时,也可以通过WeView界面对蛋白结构进行Patch分析。

    v2.1 更新内容

    • 支持设定PH值
    • 支持CDR编号,高亮CDR残基,输出CDR patch面积。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式

    pH

    pH值,用于蛋白质子化判断

    Antibody Numbering

    抗体编号方法,其中 no_use 不使用编号

    Hydrophobic Cutoff

    Hydrophobic cutoff是一个以疏水性氨基酸(通常包括Leu,Ile,Val,Phe,Trp和Met)为基础定义的截断值,用于将表面上疏水性氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,Patch区域中获得的化学性质信息会根据其表面密度和具有高疏水性氨基酸的数量而有所变化。

    Positive Cutoff

    Positive Cutoff是一个以阳离子氨基酸为基础定义的截断值,用于将表面上阳离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,positive cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    Negative Cutoff

    Negative Cutoff是一个以阴离子氨基酸为基础定义的截断值,用于将表面上阴离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,negative cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    SASA Cutoff

    SASA Cutoff是一个以溶剂可及表面积为基础定义的截断值,低于截断值的patch会被过滤掉。

    Distance Cutoff

    Distance Cutoff是原子距离截断值,低于截断值的才会认为属于同一聚集块。值越小,聚集块patch越小。

    Min Distance Cutoff

    Min Distance Cutoff是patch之间的距离截断值,距离小于截断值的归为同一个patch。

    Result Type

    输出文件格式,csv或者json
    通俗地讲,cutoff代表静电势能或疏水势能的强度阈值,单位是kcal/mol,超过阈值才会被计入面积。阈值越小,则patch越多。

    Keep Original

    不添加缺失原子(包括氢原子)和结构优化。

    Neutral N-terminus

    使得N-氮端的蛋白残基中性化。

    Neutral C-terminus

    使得C-氮端的蛋白残基中性化。

    结果说明

    输出结果包括:

    输出文件名称 说明
    patch_list.csv Patch结果的csv文件。主要关注Area(Å^2)数值,代表patch的大小,越大则越可疑,重点关注100 Å以上的patch。
    input_prot.pdb 质子化后的pdb结构。
    patch_list_sum.csv 统计了三种patch类型(Hyd:疏水中心,Neg:负电中心,Pos:正电中心)在蛋白表面所占面积,重点关注100 Å以上的patch。

    其中patch_list.csv,包含信息如下:

    字段名称 说明
    Type Patch的类型,Hyd:疏水中心,Neg:负电中心,Pos:正电中心
    Area(Å^2) 每个Patch的蛋白质表面区域面积
    Residues 每个Patch的对应的残基

    其中patch_list_sum.csv,包含信息如下:

    字段名称 说明
    Type Patch的类型,Hyd:疏水中心,Neg:负电中心,Pos:正电中心
    Total Areas Patch的蛋白质表面区域总面积
    Areas of The Largest Patch的蛋白质表面区域最大面积
    Number of Areas More Than 100 超过100 Å以上的patch的数目

    参考文献

    • Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348. DOI: 10.1142/p477
    • Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514. DOI: 10.1007/BF00124359
    • Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873. DOI: 10.1021/ci990307l

    Patch Analysis v2.1

    Introduction

    Protein Patches calculates both electrostatic (excess charge) and hydrophobic surface patches to show regions of significance with respect to protein-protein interactions. This can be particularly useful in the prediction of reversible aggregation, which typically arises from relatively weak non-covalent interactions. In particular, hydrophobic interactions have long been recognized as major contributors in high affinity interactions between macromolecules. In antibodies, electrostatic interactions have been implicated in forming self-associated aggregates [Karshikoff 2006], while dipole-dipole interactions are believed to be the cause of fibrillogenic association of β-sheets. At the same time, protein structures can also be analyzed for patches through the WeView interface.
    Electrostatic patches.
    The surface electrostatic field is estimated using an exponentially decaying Debye-Hückel field with a screening length of λD=3.5Å.
    The map thus obtained is one mostly of excess charge close to the molecular surface.
    Significant patches are established by cutting the surface along isocontour lines of absolute field value equal to 40 kcal/mol/C, keeping regions above. Finally, a default minimal patch area of 40Å2 filters out smaller, presumably less relevant, patches.
    Hydrophobicity map.
    The hydrophobic potential is calculated from the Wildman and Crippen octanol-water partition coefficients f=log P [Wildman 1999]:

    where fi is the coefficient of atom i and g(ri) is a Fermi-type distance-dependent weighting function proposed by Heiden et al. [Heiden 1993]:

    with rcut=5Å and α=1.5.
    Similarly to electrostatic (excess charge) patches, significant hydrophobic patches are established by cutting the surface along isocontour lines, retaining only the portion above a potential threshold value of 0.09 kcal/mol, and filtering for a minimal patch area of 50Å2.

    v2.1 updates

    • Supports setting the pH value
    • Supports CDR numbering, highlights CDR residues, and outputs the CDR patch area.

    Parameters

    Structure PDB File

    Protein structure file in PDB format.

    pH

    pH value for protein protonation

    Antibody Numbering

    Antibody Numbering type, no_use indicates no antibody numbering applied.

    Hydrophobic Cutoff

    Hydrophobic Cutoff defined based on hydrophobic amino acids (usually including Leu, Ile, Val, Phe, Trp, and Met) is used to compare the number of hydrophobic amino acids on a surface with the surface area, so as to screen out areas that may have important biological functions. Generally speaking, the chemical property information obtained in the Patch region will vary according to its surface density and the number of highly hydrophobic amino acids.

    Positive Cutoff

    Positive Cutoff is a cut-off value defined based on cationic amino acids, which is used to compare the number of cationic amino acids on the surface with the surface area to screen out areas that may have important biological functions. Generally speaking, the positive cutoff method is used to screen protein surface regions that may be involved in ion interactions.

    Negative Cutoff

    Negative Cutoff is a value defined based on anionic amino acids, which is used to compare the number of anionic amino acids on a surface with the surface area, to screen out areas that may have important biological functions. Generally speaking, the negative cutoff method is used to screen the surface regions of proteins that may participate in ion interactions.

    SASA Cutoff

    SASA Cutoff is a cutoff value defined on the basis of polar surface area, which is used to screen the surface regions of proteins that have enough polar surface areas.

    Distance Cutoff

    Distance Cutoff is a cutoff value defined on the basis of neighbor atoms, which is used to adjust the size of patches. Lower values result in smaller patches.

    Min Distance Cutoff

    Min Distance Cutoff is the cutoff value for neighbor patch point distance (Å). Patches with distances lower than the cutoff value would be merged.

    Result Type

    output file format, json or csv

    Keep Original

    Do no atom addition and optimization.

    Results

    The output includes:

    Output File Name Description
    patch_list.csv A CSV file containing patch results. The main focus is on the Area (Å^2) value, which represents the size of the patch. Larger patches are considered more suspicious, with particular attention to patches larger than 100 Å.
    input_prot.pdb The protonated PDB structure.
    patch_list_sum.csv Summarizes the surface area occupied by three types of patches (Hyd: hydrophobic center, Neg: negative charge center, Pos: positive charge center) on the protein surface. Focus is placed on patches larger than 100 Å.

    Details of patch_list.csv:
    The file contains the following information:

    Field Name Description
    Type The type of patch: Hyd (hydrophobic center), Neg (negative charge center), Pos (positive charge center).
    Area (Å^2) The surface area of each patch on the protein.
    Residues The residues corresponding to each patch.

    Details of patch_list_sum.csv:
    The file contains the following information:

    Field Name Description
    Type The type of patch: Hyd (hydrophobic center), Neg (negative charge center), Pos (positive charge center).
    Total Areas The total surface area of patches on the protein.
    Areas of The Largest The largest surface area of a patch on the protein.
    Number of Areas More Than 100 The number of patches with an area larger than 100 Å.

    References

    • Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348. DOI: 10.1142/p477
    • Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514. DOI: 10.1007/BF00124359
    • Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873. DOI: 10.1021/ci990307l
  • Name: Immunogenicity Prediction (WeADApt v4.2)
    Description: 唯信开发的基于多模融合深度学习的端到端免疫原性预测系统WeADApt(原名:AlphaMHC)的最新版本。采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条与免疫原性相关的湿实验数据(包括亲和力数据、呈递数据、NGS数据、质谱数据等)进行训练,实现了从序列到临床免疫原性风险的端到端的预测,并通过了数百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)的验证测试。 v4.2为最新主力版本,相比v4.1进一步提升了预测的特异性,且对不同风险水平的表位的区分度更高,更易于进行去免疫原性改造。 The latest version of the immunogenicity prediction system, WeADApt (formerly known as AlphaMHC). Compared to version v4.1, version v4.2 offers improved prediction specificity and better discrimination between epitopes of varying risk levels, making it more suitable for de-immunization modifications.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-10-18 10:50:56
    Reference:

    Immunogenicity Prediction (WeADApt v4.2)

    简介

    WeADApt (Wecomput ADA prediction) 是唯信开发的基于多模融合深度学习架构的免疫原性预测系统(也被熟知为AlphaMHC)。

    该方法采用全新的多模融合深度神经网络架构,整合了近10亿条与免疫原性相关的湿实验数据(包括亲和力数据、呈递数据、NGS数据、质谱数据等)进行训练,有机地将多个与免疫原性相关的模型融合,构成一个高效的免疫反应模拟系统,可准确地模拟蛋白、抗体、多肽、疫苗等生物药的免疫原性,并能鉴别潜在的免疫原性的T细胞表位(引起临床人体免疫应答的肽段),实现了从序列到临床免疫原性风险的端到端的预测,并通过了数百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)的验证测试。

    在同样的42个抗体分子的临床ADA数据集上,WeADApt(v4)预测的相关性超过了知名的商业软件EpiMatrix(R2=0.49 vs R2=0.42)。
    image.png

    v4.2版本

    该版本是截止到2025/04/30的最新主力版本。
    相比v4.1进一步提升了预测的特异性,且对不同风险水平的表位的区分度更高,结果对于去免疫原性改造更有指导性。

    V4.2版本相对于上个版本v4.1主要有以下改进:

    • 算法架构优化
    • 测试集规模扩大1倍
    • 分类能力F1提升:18%
    • 特异性提升:26%
    • 敏感性提升:4%

    性能

    测试数据:

    从FDA和EMA的临床试验中收集了200余个已知免疫原性的分子及其ADA的分布,计算模型预测值与真实ADA发生率的相关性,以测试其预测性能。
    image.png

    单抗 mAb

    使用唯信收集整理的200多个临床及上市单抗的ADA数据的测试结果如下图所示,预测分数与ADA发生率的Pearson相关性达到R=0.76。
    image.png

    0.2分适合作为单抗的高/低风险的阈值(>20% ADA定义为高风险)。

    双抗 BsAB

    WeADApt v4被设计为兼容各类的分子形式,不论是对称还是非对称、是否有重复结构域的任意蛋白分子,仅需输入不重复的链即可(重复链全部输入也会自动处理)。

    使用唯信收集整理的双抗ADA数据集的测试表现如下图所示,预测分数与ADA发生率的Pearson相关性达到R=0.60。
    image.png

    注意区别于v4.1的一点是,由于分布的变化,该版本以0.4的分数作为分界线时,可以较好的区分高、低风险的双抗分子。

    本系统仅从序列水平预测产生的影响,因此尤其适合同类靶点分子的相对比较和筛选。

  • Name: PPI Binding Energy & Contacts
    Description: 基于界面接触特性与非相互作用表面特征预测蛋白-蛋白结合亲和力 Predict protein-protein binding affinity using properties of interfacial contacts and non-interacting surfaces
    Tags: undefined
    Author: Li C Xue
    Release: 2025-04-24 09:39:09
    Reference: Xue LC, Rodrigues JP, Kastritis PL, Bonvin AM, Vangone A. PRODIGY: a web server for predicting the binding affinity of protein-protein complexes. Bioinformatics. 2016 Dec 1;32(23):3676-3678

    PPI Binding Energy & Contacts

    简介

    该模块结合界面接触特征与非相互作用表面(NIS)特征,用于预测蛋白-蛋白结合亲和力,并可输出接触界面的残基信息。模块基于PRODIGY模型,该模型通过线性回归利用界面接触点和NIS的物理化学性质来估算结合亲和力,这些性质已被验证对亲和力具有显著影响。

    以下为亲和力的计算公式:
    image.png

    公式中的 ICsxxx/yyy 表示在相互作用的两个蛋白之间检测到的界面接触点数,xxx/yyy表示接触残基的类型(带电/极性/非极性等),例如 ICscharged/apolar 表示带电残基与非极性残基之间的接触点数量。若两个残基之间任意重原子的距离小于5.5 Å,则视为发生了接触。

    该模型在81个复合物的数据集上进行了验证,预测亲和力与实验值之间的皮尔逊相关系数为0.73(p < 0.0001),均方根误差(RMSE)为1.89 kcal/mol。
    image.png

    参数说明

    Structure

    蛋白复合物的结构文件,格式支持 .pdb 或 .cif。支持多个复合物结构打包进行批量预测,压缩格式支持 .tar、.tar.gz 或 .zip。

    Group

    用于将结构中的多个链组合为组,组内链作为整体,仅计算组与组之间的结合亲和力。组合格式为:组内链名用逗号分隔,组与组之间用空格分隔。
    示例:H,L A 表示将链 H 和 L 作为一组,链 A 作为另一组,计算这两组之间的亲和力。

    注意:

    1. 若不设置该参数,则默认对结构中所有发生接触的链对进行亲和力计算。
    2. 在进行抗体-抗原亲和力计算时,应将抗体的重链与轻链合并为一个整体(即为一组),并与抗原链之间计算亲和力。

    Contacts

    输出链间接触界面的残基对信息。

    Output

    预测结果文件名,默认值为 prodigy_output.csv。

    Output_CRP

    接触界面残基对的结果文件名,默认值为 contacts.txt。

    结果说明

    预测结果文件 prodigy_output.csv 包含以下信息:

    列名 说明
    Name 结构名称
    Binding_Affinity (kcal/mol) 预测的结合亲和力,单位为 kcal/mol
    Dissociation_Constant (25.0˚C) 根据公式 ΔG = RTlnKd 计算出的25°C下的解离常数
    Intermolecular Contacts 接触残基对总数
    Charged_Charged Contacts 带电残基-带电残基的接触对数
    Charged_Polar Contacts 带电残基-极性残基的接触对数
    Charged_Apolar Contacts 带电残基-非极性残基的接触对数
    Polar_Polar Contacts 极性残基-极性残基的接触对数
    Apolar_Polar Contacts 非极性残基-极性残基的接触对数
    Apolar_Apolar Contacts 非极性残基-非极性残基的接触对数
    Percentage of Apolar NIS 非极性非相互作用表面的百分比
    Percentage of Charged NIS 带电非相互作用表面的百分比

    可选接触界面结果文件 Contacts.txt,每行记录一个接触残基对,包含残基名称、编号及所在链名。

    若启用批量模式,输出将为以下两个打包文件:

    • prodigy_output.tar.gz:亲和力预测结果
    • Contacts.tar.gz:接触残基对结果

    参考文献

    • Xue LC, Rodrigues JP, Kastritis PL, Bonvin AM, Vangone A. PRODIGY: a web server for predicting the binding affinity of protein-protein complexes. Bioinformatics. 2016 Dec 1;32(23):3676–3678. DOI: 10.1093/bioinformatics/btw514

    PPI Binding Energy & Contacts

    Introduction

    This module predicts protein-protein binding affinity by combining interfacial contact features with non-interacting surface characteristics. It also provides residue-level information for the contact interface. The module is based on the PRODIGY model, which applies linear regression using properties of interfacial contacts and non-interacting surfaces (NIS), both of which have been shown to influence binding affinity.

    The binding affinity is calculated using the following formula:
    image.png

    ICsxxx/yyy represent the number of interfacial contact points found between interacting protein 1 and interacting protein 2, categorized by the polarity/charge of the interacting residues (e.g., ICscharged/apolar indicates the number of interfacial contact points between charged and apolar residues). Two residues are considered to be in contact if any of their heavy atoms are within 5.5 Å of each other.

    The model’s prediction accuracy was evaluated using a dataset of 81 complexes. The Pearson correlation coefficient between predicted and experimental binding affinities is 0.73 (p < 0.0001), with a root-mean-square error (RMSE) of 1.89 kcal/mol⁻¹.
    image.png

    Parameters

    Structure

    The protein complex structure in PDB or CIF format. Multiple complex structures can be packaged together for batch prediction. Supported package formats: .tar, .tar.gz, or .zip.

    Group

    Allows grouping of multiple chains in the structure. Chains in the same group are treated as a single unit, and binding affinity is only calculated between groups. Use chain IDs to define groups: separate chains in the same group with commas, and separate groups with spaces.
    Example: H,L A means chains H and L are treated as one group, and chain A as another group. The binding affinity is then calculated between these two groups.

    Note:

    1. If this parameter is not specified, binding affinity will be calculated for all contacting chain pairs in the complex.
    2. For antibody-antigen binding affinity calculations, the heavy and light chains of the antibody should be grouped together using this parameter to compute affinity with the antigen chain.

    Contacts

    Outputs residue pairs at the inter-chain contact interface.

    Output

    Filename for the binding affinity prediction result. Default: prodigy_output.csv

    Output_CRP

    Filename for the contact interface residue pairs. Default: contacts.txt

    Results

    The binding affinity prediction result is saved in prodigy_output.csv, which includes the following columns:

    Column Name Description
    Name Structure name
    Binding_Affinity (kcal/mol) Predicted binding affinity in kcal/mol
    Dissociation_Constant (25.0˚C) Dissociation constant at 25°C, calculated using: ΔG = RTlnKd
    Intermolecular Contacts Total number of interfacial residue pairs
    Charged_Charged Contacts Number of contacts between charged residues
    Charged_Polar Contacts Number of contacts between charged and polar residues
    Charged_Apolar Contacts Number of contacts between charged and apolar residues
    Polar_Polar Contacts Number of contacts between polar residues
    Apolar_Polar Contacts Number of contacts between apolar and polar residues
    Apolar_Apolar Contacts Number of contacts between apolar residues
    Percentage of Apolar NIS Percentage of apolar non-interacting surface
    Percentage of Charged NIS Percentage of charged non-interacting surface

    The optional contact interface file Contacts.txt lists one contacting residue pair per line, including residue names, numbers, and chain IDs.

    In batch mode:

    • Binding affinity results are packaged in prodigy_output.tar.gz
    • Contact interface results are packaged in Contacts.tar.gz

    References

    Xue LC, Rodrigues JP, Kastritis PL, Bonvin AM, Vangone A. PRODIGY: a web server for predicting the binding affinity of protein-protein complexes. Bioinformatics. 2016 Dec 1;32(23):3676–3678. DOI: 10.1093/bioinformatics/btw514

  • Name: MolSeeker
    Description: 基于深度学习的小分子 ADMET 预测 Deep learning based small molecule ADMET property prediction.
    Tags: undefined
    Author: MolSeeker
    Release: 2024-07-30 00:00:00
    Reference:

    MolSeeker: 大模型多智能体驱动 ADMET 预测

    简介

    基于大模型多智能体的 ADMET 预测,通过构建分布式智能协作系统,精准预测药物在体内的吸收(Absorption)、分布(Distribution)、代谢(Metabolism)、排泄(Excretion)及毒性(Toxicity)性质。深度融合大语言模型(LLM)的深度语义理解能力与多智能体系统(MAS)的分工协作。将 ADMET 预测任务拆解为数据清洗、分子表征、性质建模、决策推理等多个子模块,从分子结构输入到成药性预测的全流程智能化闭环。

    参数

    1. 物理化学性质(Physicochem): LogD, LogP, pKa, pKb, Solubility_ Kinetic, Solubility_ FASSIF
    2. 吸收(Absorption): Caco2_A2B (Cls), Caco2_A2B (Reg), PAMPA (Reg)
    3. 分布(Distribution): BBB (Cls), MDCK_Efflux (Reg)
    4. 代谢(Metabolism): HLM (Cls), HLM (Reg), hHep (Cls), hHep (Reg)
    5. 毒性(Toxicity): hERG (Cls), hERG (Reg), AMES (Cls), Hepatotoxicity (Cls)

    结果说明

    结果名称 说明
    MW 分子量,即化合物分子的相对质量
    TPSA 拓扑极性表面积,反映分子极性大小,影响药物的吸收、分布等性质
    PAINS 是否存在 PAINS 结构警示,PAINS(Pan Assay Interference Compounds)是可能干扰多种生物测定的化合物
    SaScore 合成可及性得分,评估化合物合成的难易程度
    cLogP 计算的脂水分配系数,体现化合物亲脂性
    LogD_pred 预测的分配系数,用于评估化合物在不同 pH 下的分配特性
    LogP_pred 预测的脂水分配系数,辅助判断化合物的亲脂性
    pKa_pred 预测的酸解离常数,帮助了解化合物的酸性特征
    pKb_pred 预测的碱解离常数,反映化合物的碱性特点
    Solubility_Kinetic_Pred 预测的动力学溶解度,评估化合物溶解过程的动态特性,PH6.5
    Solubility_FASSIF_Pred 预测的热力学溶解度,模拟胃肠液FASSIF介质溶解度,PH6.5
    Caco2(Cls)_Pred Caco - 2 细胞渗透性分类预测结果,判断药物通过 Caco-2 细胞的渗透能力 (分类标准:1*10-6 cm/s )
    Caco2(Reg)_Pred Caco - 2 细胞渗透性回归预测结果,量化药物在 Caco-2 细胞中的渗透程度
    HLM(Cls)_Pred 人肝微粒体稳定性分类预测结果,评估药物在人肝微粒体中的稳定状态类别 (分类标准:15 uL/min/mg protein)
    HLM(Reg)_Pred 人肝微粒体稳定性回归预测结果,精确衡量药物在人肝微粒体中的稳定性数值
    hHep(Reg)_Pred 人肝细胞相关回归预测结果,对与人肝细胞相关的指标进行量化预测
    hHep(Cls)_Pred 人肝细胞相关分类预测结果,对与人肝细胞相关的性质进行类别判定 (分类标准:10 uL/min/1E6 cells)
    PAMPA(Reg)_Pred 平行人工膜渗透性测定回归预测结果,评估药物通过人工膜的渗透能力数值
    MDCK_Efflux(Reg)_Pred MDCK 细胞外排回归预测结果,量化 MDCK 细胞对药物的外排程度
    BBB(Cls)_Pred 血脑屏障穿透性分类预测结果,判断药物穿透血脑屏障的能力类别
    hERG(Cls)_Pred hERG 通道抑制分类预测结果,评估药物对 hERG 通道抑制的风险类别,(分类标准: 10 uM)
    hERG(Reg)_Pred hERG 通道抑制回归预测结果,评估药物对 hERG 通道抑制的风险类别
    AMES(Cls)_Pred Ames 试验致突变性分类预测结果,判定药物是否具有致突变性类别
    Hepatotoxicity(Cls)_Pred 肝毒性分类预测结果,评估药物对肝脏产生毒性的风险类别
  • Name: Back Mutation Grouping v2.5
    Description: 抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组 Back mutation grouping in the antibody humanization, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-04-03 10:23:26
    Reference:

    Back Mutation Grouping v2.5

    简介

    该模块是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。

    更新内容:

    • 新增参数Combination Max Cutoff,高于改截断值的突变自动进行回复突变,
    • 新增参数Combination Site Cutoff,每条链回复突变打分在Combination Min Cutoff与Combination Max Cutoff之间的选择打分前n个位置进行组合突变

    参数说明

    Grafted Chain

    抗体CDR区嫁接后序列文件,FASTA格式,由Grafting模块生成

    Raw Chain

    抗体序列文件,FASTA格式

    Mutation Score

    人源化突变评分文件,CSV格式,由Mutation Score模块生成

    Output File

    指定输出的突变序列文件名称,FASTA格式

    Cutoff

    打分分组的截断值,逗号分割,例如:2,5,10表示将氨基酸突变评分大于10的为一组,5~10的氨基酸为一组,小于2的氨基酸分为一组。

    Output Policy

    指定输出的回复突变的文件

    Type

    普通抗体Antibody或者纳米抗体Nanobody

    Combination Min Cutoff

    突变组合的截断值,Mutation Score模块中输出的氨基酸回复突变打分大于截断值的氨基酸参与生成突变组合

    Combination Max Cutoff

    高于截断值的突变自动进行回复突变

    Combination Site Cutoff

    每条链回复突变打分在Combination Min Cutoff与Combination Max Cutoff之间的选择打分前n个位置进行组合突变

    结果说明

    根据不同截断值得到突变分组结果文件mutate_policy.json。

    根据组合突变截断值得到的突变分组结果文件combination_mutate_policy.json,高通量人源化设计流程。

    Back Mutation Grouping v2.5

    Introduction

    Back mutation grouping in the antibody humanization, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module.

    Parameters

    Grafted Chain

    Sequence file of the antibody CDR region after grafting, in FASTA format, generated by the Grafting module.

    Raw Chain

    Sequence file of the antibody, in FASTA format.

    Mutation Score

    Humanization mutation score file, in CSV format, generated by the Mutation Score module.

    Output File

    Specify the name of the output mutation sequence file, in FASTA format.

    Cutoff

    Cutoff values for scoring grouping, separated by commas. For example, 2,5,10 indicates grouping amino acid mutations with scores greater than 10 in one group, amino acids with scores between 5 and 10 in another group, and amino acids with scores less than 2 in a separate group.

    Output Policy

    Specify the file for the output of back mutations.

    Type

    Antibody or Nanobody

    Combination Min Cutoff

    Cutoff value for mutation combinations. Amino acids with scores (generated from Mutation Score module) greater than the cutoff value are involved in the mutation combinations.

    Combination Max Cutoff

    Mutations above the cutoff value automatically undergo reversion mutations.

    Combination Site Cutoff

    For each chain, select the top n positions with back mutation scores between the Combination Min Cutoff and the Combination Max Cutoff for combination mutations.

    Results

    The mutation grouping results file mutate_policy.json is generated based on different cutoff values.
    The mutation grouping results file combination_mutate_policy.json is generated based on combination cutoff values.

  • Name: Protease (MMP) Cleavage Prediction
    Description: 预测肽段(长度不超过10个氨基酸)被18种基质金属蛋白酶(MMPs)切割的效率及基于指定目标切割谱生成相应的多肽底物。 Predicting the cleavage efficiency of peptides (≤10 amino acids) by 18 matrix metalloproteinases (MMPs) or generating corresponding peptide substrates based on a specified cleavage profile.
    Tags: undefined
    Author: Carmen Martin-Alonso
    Release: 2025-03-26 16:03:42
    Reference: Carmen Martin-Alonso, Sarah Alamdari, Tahoura S. Samad, Kevin K. Yang, Sangeeta N. Bhatia, Ava P. Amini. Deep learning guided design of protease substrates. bioRxiv 2025.02.27.640681

    Protease (MMP) Cleavage Prediction

    简介

    该模块具有两方面的功能:
    1,用于预测肽段(长度不超过10个氨基酸)被18种基质金属蛋白酶(MMPs)切割的效率。
    2,基于指定的目标切割谱(如:仅被MMP13切割),生成相应的多肽底物。

    模块基于CleaveNet模型实现,CleaveNet是一种基于深度学习的蛋白酶底物设计工具,通过整合预测与生成技术,实现了从“虚拟筛选”到“智能设计”的转变。
    image.png
    CleaveNet包含两个核心模块:
    预测模块

    • 基于Transformer架构,训练于大规模mRNA展示肽段库数据。
    • 针对18种基质金属蛋白酶(MMPs),能够预测肽段被特定蛋白酶切割的效率,测试集Pearson相关系数达0.80,优于传统二分类模型。
    • 模型不仅复现了已知的酶切基序,还发现了新的底物偏好,例如甲硫氨酸在P4位的作用,拓展了对蛋白酶特异性的理解。

    生成模块

    • 采用条件化生成技术,用户可通过条件标签指定目标切割谱(如“对MMP13高活性、对其他MMPs低活性”)。
    • 通过注意力机制调整生成方向,生成的6-mer肽段新颖度达89%,突破了训练数据的局限性。
    • 与传统虚拟筛选相比,生成效率提升约5.5倍,支持复杂设计需求,如“双蛋白酶逻辑门”底物。

    这一端到端的设计流程显著提高了底物设计的效率和精准性,为蛋白酶研究提供了一种全新的计算驱动方法。

    实验验证
    为评估CleaveNet的实际应用能力,研究团队以MMP13(一种与癌症转移、伤口愈合和骨关节炎相关的胶原酶)为目标,设计并合成了95条肽段底物,并通过荧光共振能量转移(FRET)技术验证其切割效率。实验结果表明:

    • 切割效率:所有CleaveNet设计的MMP13底物均能被有效切割,其中一条底物(DL73)的切割效率比训练集中最优底物高出39%(p<0.01)。
    • 特异性:3条底物(如DL41)实现了对MMP13的绝对特异性,不被其他MMPs切割;5条底物(如DL48)同时表现出高活性和高选择性,填补了传统方法的空白。
    • 机制洞察:分析生成序列后,发现了P2位亮氨酸偏好和P3’位天冬氨酸的作用,为MMP13的特异性机制提供了新的研究方向。

    这些结果验证了CleaveNet在设计高效且特异性底物方面的能力,同时也展示了其揭示未知底物偏好的潜力。

    参数说明

    Prediction

    Peptide Sequence

    必填参数,多肽序列,txt或fasta格式,支持多条(txt格式时,每行放置一条多肽)。注意:多肽长度不能超过10个残基,超过长度的多肽序列会自动被过滤掉。
    txt格式实例如下:

    LRVFL
    FMPLNFTASG
    LGPYAMTSRG
    AARFKKFATE
    

    Output

    可选参数,预测得到的MMPs酶切概率结果文件名称,默认为“pred_cleavage.csv”。

    Generation

    Number of Peptides

    可选参数,指定需要生成的多肽数量,默认为50。

    Z-score of MMPs

    可选参数,指定多肽生成的酶切条件,CSV文件格式。包含每种MMP酶的酶切概率Z-score值,值越大表示酶切的可能性越高,值可为负,一般阈值为2.5,大于该阈值时,表示极大可能被酶切。模型会根据设置的各种MMPs酶的酶切概率Z-score值进行多肽生成。注意:18种MMPs的Z-score数值都必须设定,不能缺少任意一种。
    文件内容实例如下:

    MMP1,MMP10,MMP11,MMP12,MMP13,MMP14,MMP15,MMP16,MMP17,MMP19,MMP2,MMP20,MMP24,MMP25,MMP3,MMP7,MMP8,MMP9
    2.0,1.2,2.2,2.9,3.3,4.1,2.2,3.1,2.2,1.91,3.6,2.83,2.7,0.2,1.8,2.1,0.61,4.2
    

    以上内容为一组条件,也支持多组条件同时输入,每行一组条件即可。每组条件都会生成指定数量的多肽。多组条件示例如下:

    MMP1,MMP10,MMP11,MMP12,MMP13,MMP14,MMP15,MMP16,MMP17,MMP19,MMP2,MMP20,MMP24,MMP25,MMP3,MMP7,MMP8,MMP9
    2.0,1.2,2.2,2.9,3.3,4.1,2.2,3.1,2.2,1.91,3.6,2.83,2.7,0.2,1.8,2.1,0.61,4.2
    3.33,2.5,3.6,2.7,2.9,5.2,3.4,4.2,2.7,2.0,3.5,2.6,4.0,3.1,3.5,0.61,2.1,2.9
    

    Temperature

    可选参数,指定生成的温度条件,用于控制生成多肽序列的多样性,默认为1.0,越大表示多样性越高。如果希望多样性低一些,推荐0.7,如果希望多样性再高一些,推荐1.2~1.5。

    Output

    可选参数,指定序列输出文件名称,fasta或txt格式,默认为“gen_seqs.fasta”。

    结果说明

    Prediction

    预测得到的MMPs酶切概率结果文件,默认为pred_cleavage.csv。包含如下内容:

    字段名称 说明
    SEQ 多肽序列
    MMP1,MMP2,MMP3,… 各种MMPs蛋白酶对多肽酶切能力强弱的Z-score数值,数值越大表示酶切的可能性越高,目前的阈值为2.5,大于该阈值时,表示极大可能被酶切。

    Generation

    生成的序列文件,默认为“gen_seqs.fasta”。

    参考文献

    • Deep learning guided design of protease substrates. Carmen Martin-Alonso, Sarah Alamdari, Tahoura S. Samad, Kevin K. Yang, Sangeeta N. Bhatia, Ava P. Amini bioRxiv 2025.02.27.640681; DOI: 10.1101/2025.02.27.640681

    Protease (MMP) Cleavage Prediction

    Introduction

    This module has two functions:
    Predicting the cleavage efficiency of peptides (≤10 amino acids) by 18 matrix metalloproteinases (MMPs).
    Generating corresponding peptide substrates based on a specified cleavage profile (e.g., only cleaved by MMP13).
    Built on the CleaveNet model, a deep - learning - based protease substrate design tool, it integrates prediction and generation, shifting from “virtual screening” to “intelligent design”.
    CleaveNet has two core modules:
    Prediction Module
    Trained on a large - scale mRNA - displayed peptide library using a Transformer architecture.
    Predicts peptide cleavage efficiency by 18 MMPs, with a test - set Pearson correlation of 0.80, outperforming traditional binary - classification models.
    Reproduces known cleavage motifs and reveals new substrate preferences (e.g., methionine at P4), enhancing understanding of protease specificity.
    Generation Module
    Uses conditional generation. Users can set target cleavage profiles (e.g., “high MMP13 activity, low other MMP activities”) via conditional tags.
    Adjusts generation direction with attention mechanisms. Generated 6 - mer peptides have 89% novelty, surpassing training data limits.
    Is about 5.5 times more efficient than traditional virtual screening, supporting complex designs like “dual - protease logic gate” substrates.
    This end - to - end design process improves substrate design efficiency and accuracy, offering a new computation - driven method for protease research.
    Experimental Validation
    To assess CleaveNet’s practicality, the team targeted MMP13 (a collagenase linked to cancer metastasis, wound healing, and osteoarthritis). They designed and synthesized 95 peptide substrates, validating cleavage efficiency via fluorescence resonance energy transfer (FRET). Results showed:
    All CleaveNet - designed MMP13 substrates were efficiently cleaved. One (DL73) had 39% higher efficiency than the best training - set substrate (p<0.01).
    Three substrates (e.g., DL41) were absolutely specific to MMP13, and five (e.g., DL48) had both high activity and selectivity, addressing traditional method gaps.
    Analysis of generated sequences revealed leucine preference at P2 and aspartic acid’s role at P3’, offering new insights into MMP13’s specificity mechanism.
    These results confirm CleaveNet’s ability to design efficient, specific substrates and its potential to uncover unknown substrate preferences.

    Parameters

    Prediction

    Peptide Sequence

    Required parameter, peptide sequence, in txt or fasta format, supporting multiple sequences (when in txt format, place each peptide on a separate line). Note: The length of the peptide cannot exceed 10 residues.
    An example in txt format is as follows:

    LRVFL
    FMPLNFTASG
    LGPYAMTSRG
    AARFKKFATE
    

    Output

    Optional parameter, the file name of the predicted MMPs cleavage probability results, default is “pred_cleavage.csv”。

    Generation

    Number of Peptides

    Optional parameter, specify the number of peptides to be generated, default is 50.

    Z-score of MMPs

    Optional parameter, specify the cleavage conditions for peptide generation in CSV file format. It includes the Z-score values of cleavage probabilities for each type of MMP enzyme. A higher value indicates a higher likelihood of cleavage. The value can be negative. The general threshold is 2.5. When the value is above this threshold, it indicates a very high probability of being cleaved. The model will generate peptides based on the set Z-score values of cleavage probabilities for various MMPs enzymes. Note: The Z-score values for all 18 types of MMPs must be set, and none can be missing.

    An example of the file content is as follows:

    MMP1,MMP10,MMP11,MMP12,MMP13,MMP14,MMP15,MMP16,MMP17,MMP19,MMP2,MMP20,MMP24,MMP25,MMP3,MMP7,MMP8,MMP9
    2.0,1.2,2.2,2.9,3.3,4.1,2.2,3.1,2.2,1.91,3.6,2.83,2.7,0.2,1.8,2.1,0.61,4.2
    

    The above content is a set of conditions, and multiple sets of conditions can also be input simultaneously. Just place each set of conditions on a separate line. Peptides of the specified quantity will be generated for each set of conditions. An example of multiple sets of conditions is as follows:

    MMP1,MMP10,MMP11,MMP12,MMP13,MMP14,MMP15,MMP16,MMP17,MMP19,MMP2,MMP20,MMP24,MMP25,MMP3,MMP7,MMP8,MMP9
    2.0,1.2,2.2,2.9,3.3,4.1,2.2,3.1,2.2,1.91,3.6,2.83,2.7,0.2,1.8,2.1,0.61,4.2
    3.33,2.5,3.6,2.7,2.9,5.2,3.4,4.2,2.7,2.0,3.5,2.6,4.0,3.1,3.5,0.61,2.1,2.9
    

    Temperature

    Optional parameter, specify the temperature condition for controlling the diversity of the generated peptide sequences. The default value is 1.0. A higher value indicates higher diversity. If you want lower diversity, it is recommended to use 0.7. If you want higher diversity, it is recommended to use a value between 1.2 and 1.5.

    Output

    Optional parameter, specify the output file name for the sequences in fasta or txt format. The default is “gen_seqs.fasta”.

    Results

    Prediction

    The predicted MMPs cleavage probability results file, default is pred_cleavage.csv. It contains the following content:

    Field Name Description
    SEQ Peptide sequence
    MMP1, MMP2, MMP3, … Z-score values representing the strength of cleavage by various MMPs proteases. A higher value indicates a higher likelihood of cleavage. The current threshold is 2.5. If the value is above this threshold, it indicates a very high probability of being cleaved.

    Generation

    The generated sequence file, default is “gen_seqs.fasta”.

    References

    • Deep learning guided design of protease substrates. Carmen Martin-Alonso, Sarah Alamdari, Tahoura S. Samad, Kevin K. Yang, Sangeeta N. Bhatia, Ava P. Amini bioRxiv 2025.02.27.640681; DOI: 10.1101/2025.02.27.640681
  • Name: Computing Electrostatic Surfaces
    Description: 分析蛋白质表面的静电区域(正电、负电区域)的大小和分布 Analyze the electrostatic patches of protein surfaces.
    Tags: undefined
    Author: Valentin J Hoerschinger
    Release: 2025-03-19 15:15:14
    Reference: Hoerschinger VJ, Waibl F, Pomarici ND, Loeffler JR, Deane CM, Georges G, Kettenberger H, Fernández-Quintero ML, Liedl KR. PEP-Patch: Electrostatics in Protein-Protein Recognition, Specificity, and Antibody Developability. J Chem Inf Model. 2023 Nov 27;63(22):6964-6971.

    Computing Electrostatic Surfaces

    简介

    该模块用于分析和可视化蛋白质表面的静电特性,这对分子识别、蛋白质溶解性、粘度和抗体的可开发性等过程至关重要。它主要通过定义“Patch”来识别和量化蛋白质表面的静电势,这些Patch是具有统一正或负电势值的连接区域。
    主要功能和特点:

    • 静电势计算:
      该工具使用APBS(自适应泊松-玻尔兹曼求解器)来计算静电势。此外,它还可以接受用户提供的势图或基于疏水性尺度的映射。
    • 分子表面生成:
      工具生成分子表面,并将计算的静电势映射到该表面。然后,可以通过颜色编码来可视化该表面,以指示正负区域。
    • Patch识别:
      识别和量化蛋白质表面上不同的正电和负电静电Patch,这对于理解蛋白质-蛋白质相互作用和抗体开发非常重要。

    参数说明

    Structure PDB

    蛋白结构文件,PDB格式。

    Surface Type

    分子表面的类型:sas或者ses。以下是两个选项的解释:

    • 溶剂可及表面(SAS,Solvent-Accessible Surface):SAS 是溶剂探针(通常是水分子)在分子表面滚动时,其中心轨迹形成的表面。
    • 溶剂排除表面(SES,Solvent-Excluded Surface):SES 是溶剂探针围绕分子滚动时,其最靠近分子的外部轮廓所形成的表面。

    Probe Radius

    探针半径,单位为纳米(默认:0.14)。

    Size Cutoff

    Patch面积(area )阈值,单位为Ų。如果 Size Cutoff = 0,则不过滤任何 patch,即所有 patch 都会被保留。

    pH Value

    pH 值。

    Output Patch

    输出Patch文件名称

    结果说明

    输出结果包括:

    输出文件名称 说明
    patches.csv 识别出的蛋白质表面静电Patch的信息。
    apbs.pqr APBS计算静电势的输入文件。PQR文件类似于PDB文件,但包含了每个原子的电荷和半径信息。
    apbs.pqr.dx 通过APBS计算得到的静电势分布数据。DX文件是网格格式,描述了蛋白质周围空间的静电势值。
    apbs.pdb APBS计算静电势的PDB文件

    其中patches.csv包括信息如下:

    字段名称 说明
    nr 代表Patch的编号。这是每个识别出的静电Patch的唯一标识符,用于区分不同的Patch。
    type 表示Patch的类型,通常为“positive”或“negative”,指示Patch的电荷性质是正电还是负电。
    npoints Patch中包含的表面点的数量。这些点构成了Patch在蛋白质表面上的区域。
    area Patch的面积,单位为Ų。这表示Patch在蛋白质表面上覆盖的物理面积。
    value Patch的总静电势值,通常为Patch内所有点的静电势值的总和或平均值。这反映了Patch的整体静电强度。
    residue Patch中的氨基酸残基,通常是Patch所在区域的一个代表性残基。这个残基可能是Patch中电荷最集中的位置或最显著的氨基酸。其他的氨基酸编号与apbs.pdb对应。

    参考文献

    • Hoerschinger VJ, Waibl F, Pomarici ND, Loeffler JR, Deane CM, Georges G, Kettenberger H, Fernández-Quintero ML, Liedl KR. PEP-Patch: Electrostatics in Protein-Protein Recognition, Specificity, and Antibody Developability. J Chem Inf Model. 2023 Nov 27;63(22):6964-6971. DOI: 10.1021/acs.jcim.3c01490

    Computing Electrostatic Surfaces

    Introduction

    This module is designed for analyzing and visualizing the electrostatic properties of protein surfaces, which are critical for processes such as molecular recognition, protein solubility, viscosity, and antibody developability. It primarily identifies and quantifies the electrostatic potential on protein surfaces by defining “patches,” which are connected regions with uniform positive or negative potential values.
    Key Features:

    • Electrostatic Potential Calculation:
      This tool uses APBS (Adaptive Poisson-Boltzmann Solver) to compute electrostatic potentials. Additionally, it can accept user-provided potential maps or mappings based on hydrophobicity scales.

    • Molecular Surface Generation:
      The tool generates molecular surfaces and maps the calculated electrostatic potentials onto these surfaces. The surface can then be visualized using color coding to indicate positive and negative regions.

    • Patch Identification:
      It identifies and quantifies different positive and negative electrostatic patches on the protein surface, which are crucial for understanding protein-protein interactions and antibody development.

    Parameter

    Structure PDB

    The protein structure file in PDB format.

    Surface Type

    The type of molecular surface: SAS or SES. Below are explanations for the two options:

    • Solvent-Accessible Surface (SAS): SAS represents the surface formed by the center trajectory of a solvent probe (usually a water molecule) rolling over the molecular surface.
    • Solvent-Excluded Surface (SES): SES represents the outer contour closest to the molecule formed when the solvent probe rolls around the molecule.

    Probe Radius

    The radius of the probe, measured in nanometers (default: 0.14).

    Size Cutoff

    Patch area threshold (area), measured in Ų. If Size Cutoff = 0, no patch will be filtered, meaning all patches will be retained.

    pH Value

    The pH value.

    Output Patch

    The name of the output file for identified patches.

    Result

    The output includes the following files:

    File Name Description
    patches.csv Information about the identified electrostatic patches on the protein surface.
    apbs.pqr Input file for APBS electrostatic potential calculations. PQR files are similar to PDB files but include charge and radius information for each atom.
    apbs.pqr.dx Electrostatic potential distribution data calculated by APBS. DX files are grid-format files describing the electrostatic potential values in the space surrounding the protein.
    apbs.pdb PDB file with electrostatic potential information calculated by APBS.

    The patches.csv file includes the following information:

    Field Name Description
    nr Patch number. This is a unique identifier for each identified electrostatic patch.
    type Patch type, typically “positive” or “negative,” indicating whether the patch is positively or negatively charged.
    npoints The number of surface points in the patch, which defines the region of the patch on the protein surface.
    area The area of the patch in Ų, representing the physical coverage of the patch on the protein surface.
    value The total electrostatic potential value of the patch, usually the sum or average of all potential values within the patch. This indicates the overall electrostatic intensity of the patch.
    residue Representative amino acid residue within the patch, typically the residue with the highest charge concentration or the most prominent residue in the patch. Other residue numbers correspond to the apbs.pdb file.

    References

    • Hoerschinger VJ, Waibl F, Pomarici ND, Loeffler JR, Deane CM, Georges G, Kettenberger H, Fernández-Quintero ML, Liedl KR. PEP-Patch: Electrostatics in Protein-Protein Recognition, Specificity, and Antibody Developability. J Chem Inf Model. 2023 Nov 27;63(22):6964-6971. DOI: 10.1021/acs.jcim.3c01490
  • Name: Patch Analysis v2
    Description: 分析蛋白质表面的Patch(正电、负电、疏水残基富集区域)的大小和分布,用于解决蛋白质的聚集等问题。一般建议通过WeView三维结构可视化编辑器来使用该功能,可以在三维结构中直观地查看patch的位置。 Calculate patches (positively charged, negatively charged, or hydrophobic regions) on the protein surface to address protein aggregation issues. It is recommended to use in the WeView, as it allows for a visual inspection of the patch locations within the 3D structure.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-04-14 15:01:18
    Reference:

    Patch Analysis v2

    简介

    该模块计算蛋白质表面静电和疏水作用相对富集的区域,用于显示出在蛋白质-蛋白质相互作用中其重要作用的区域,这对于预测基于非共价弱相互作用的可逆的聚集现象尤其有用。尤其是,疏水相互作用长期被认为是大分子间吸引相互作用的主要组成部分。对于抗体,静电相互作用牵涉到了自聚集,而偶极-偶极相互作用被认为是导致β-折叠的纤维化的原因。同时,也可以通过WeView界面对蛋白结构进行Patch分析。

    V2 更新内容

    • 优化原子参数,提高计算准确性。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式

    Hydrophobic Cutoff

    Hydrophobic cutoff是一个以疏水性氨基酸(通常包括Leu,Ile,Val,Phe,Trp和Met)为基础定义的截断值,用于将表面上疏水性氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,Patch区域中获得的化学性质信息会根据其表面密度和具有高疏水性氨基酸的数量而有所变化。

    Positive Cutoff

    Positive Cutoff是一个以阳离子氨基酸为基础定义的截断值,用于将表面上阳离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,positive cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    Negative Cutoff

    Negative Cutoff是一个以阴离子氨基酸为基础定义的截断值,用于将表面上阴离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,negative cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    SASA Cutoff

    SASA Cutoff是一个以溶剂可及表面积为基础定义的截断值,低于截断值的patch会被过滤掉。

    Distance Cutoff

    Distance Cutoff是原子距离截断值,低于截断值的才会认为属于同一聚集块。值越小,聚集块patch越小。

    Min Distance Cutoff

    Min Distance Cutoff是patch之间的距离截断值,距离小于截断值的归为同一个patch。

    Result Type

    输出文件格式,csv或者json
    通俗地讲,cutoff代表静电势能或疏水势能的强度阈值,单位是kcal/mol,超过阈值才会被计入面积。阈值越小,则patch越多。

    Keep Original

    不添加缺失原子(包括氢原子)和结构优化。

    Neutral N-terminus

    使得N-氮端的蛋白残基中性化。

    Neutral C-terminus

    使得C-氮端的蛋白残基中性化。

    结果说明

    输出结果包括:

    输出文件名称 说明
    patch_list.csv Patch结果的csv文件。主要关注Area(Å^2)数值,代表patch的大小,越大则越可疑,重点关注100 Å以上的patch。
    input_prot.pdb 质子化后的pdb结构。
    patch_list_sum.csv 统计了三种patch类型(Hyd:疏水中心,Neg:负电中心,Pos:正电中心)在蛋白表面所占面积,重点关注100 Å以上的patch。

    其中patch_list.csv,包含信息如下:

    字段名称 说明
    Type Patch的类型,Hyd:疏水中心,Neg:负电中心,Pos:正电中心
    Area(Å^2) 每个Patch的蛋白质表面区域面积
    Residues 每个Patch的对应的残基

    其中patch_list_sum.csv,包含信息如下:

    字段名称 说明
    Type Patch的类型,Hyd:疏水中心,Neg:负电中心,Pos:正电中心
    Total Areas Patch的蛋白质表面区域总面积
    Areas of The Largest Patch的蛋白质表面区域最大面积
    Number of Areas More Than 100 超过100 Å以上的patch的数目

    参考文献

    • Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348. DOI: 10.1142/p477
    • Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514. DOI: 10.1007/BF00124359
    • Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873. DOI: 10.1021/ci990307l

    Patch Analysis v2

    Introduction

    Protein Patches calculates both electrostatic (excess charge) and hydrophobic surface patches to show regions of significance with respect to protein-protein interactions. This can be particularly useful in the prediction of reversible aggregation, which typically arises from relatively weak non-covalent interactions. In particular, hydrophobic interactions have long been recognized as major contributors in high affinity interactions between macromolecules. In antibodies, electrostatic interactions have been implicated in forming self-associated aggregates [Karshikoff 2006], while dipole-dipole interactions are believed to be the cause of fibrillogenic association of β-sheets. At the same time, protein structures can also be analyzed for patches through the WeView interface.
    Electrostatic patches.
    The surface electrostatic field is estimated using an exponentially decaying Debye-Hückel field with a screening length of λD=3.5Å.
    The map thus obtained is one mostly of excess charge close to the molecular surface.
    Significant patches are established by cutting the surface along isocontour lines of absolute field value equal to 40 kcal/mol/C, keeping regions above. Finally, a default minimal patch area of 40Å2 filters out smaller, presumably less relevant, patches.
    Hydrophobicity map.
    The hydrophobic potential is calculated from the Wildman and Crippen octanol-water partition coefficients f=log P [Wildman 1999]:

    where fi is the coefficient of atom i and g(ri) is a Fermi-type distance-dependent weighting function proposed by Heiden et al. [Heiden 1993]:

    with rcut=5Å and α=1.5.
    Similarly to electrostatic (excess charge) patches, significant hydrophobic patches are established by cutting the surface along isocontour lines, retaining only the portion above a potential threshold value of 0.09 kcal/mol, and filtering for a minimal patch area of 50Å2.

    V2 updates

    • Optimized atoms parameters and improved the accuracy.

    Parameters

    Structure PDB File

    Protein structure file in PDB format.

    Hydrophobic Cutoff

    Hydrophobic Cutoff defined based on hydrophobic amino acids (usually including Leu, Ile, Val, Phe, Trp, and Met) is used to compare the number of hydrophobic amino acids on a surface with the surface area, so as to screen out areas that may have important biological functions. Generally speaking, the chemical property information obtained in the Patch region will vary according to its surface density and the number of highly hydrophobic amino acids.

    Positive Cutoff

    Positive Cutoff is a cut-off value defined based on cationic amino acids, which is used to compare the number of cationic amino acids on the surface with the surface area to screen out areas that may have important biological functions. Generally speaking, the positive cutoff method is used to screen protein surface regions that may be involved in ion interactions.

    Negative Cutoff

    Negative Cutoff is a value defined based on anionic amino acids, which is used to compare the number of anionic amino acids on a surface with the surface area, to screen out areas that may have important biological functions. Generally speaking, the negative cutoff method is used to screen the surface regions of proteins that may participate in ion interactions.

    SASA Cutoff

    SASA Cutoff is a cutoff value defined on the basis of polar surface area, which is used to screen the surface regions of proteins that have enough polar surface areas.

    Distance Cutoff

    Distance Cutoff is a cutoff value defined on the basis of neighbor atoms, which is used to adjust the size of patches. Lower values result in smaller patches.

    Min Distance Cutoff

    Min Distance Cutoff is the cutoff value for neighbor patch point distance (Å). Patches with distances lower than the cutoff value would be merged.

    Result Type

    output file format, json or csv

    Keep Original

    Do no atom addition and optimization.

    Results

    The output includes:

    Output File Name Description
    patch_list.csv A CSV file containing patch results. The main focus is on the Area (Å^2) value, which represents the size of the patch. Larger patches are considered more suspicious, with particular attention to patches larger than 100 Å.
    input_prot.pdb The protonated PDB structure.
    patch_list_sum.csv Summarizes the surface area occupied by three types of patches (Hyd: hydrophobic center, Neg: negative charge center, Pos: positive charge center) on the protein surface. Focus is placed on patches larger than 100 Å.

    Details of patch_list.csv:
    The file contains the following information:

    Field Name Description
    Type The type of patch: Hyd (hydrophobic center), Neg (negative charge center), Pos (positive charge center).
    Area (Å^2) The surface area of each patch on the protein.
    Residues The residues corresponding to each patch.

    Details of patch_list_sum.csv:
    The file contains the following information:

    Field Name Description
    Type The type of patch: Hyd (hydrophobic center), Neg (negative charge center), Pos (positive charge center).
    Total Areas The total surface area of patches on the protein.
    Areas of The Largest The largest surface area of a patch on the protein.
    Number of Areas More Than 100 The number of patches with an area larger than 100 Å.

    References

    • Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348. DOI: 10.1142/p477
    • Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514. DOI: 10.1007/BF00124359
    • Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873. DOI: 10.1021/ci990307l
  • Name: Molecular Docking (AutoDock-GPU v2)
    Description: 基于AutoDock的分子对接工具,采用GPU加速版本,主要用于预测分子之间的结合模式和相互作用,得到分子对接的能量和结合亲和力等信息。它还可以计算和比较多个分子之间的结合能力,用于药物分子的筛选、设计和优化。AutoDock-GPU是AutoDock4.2.6的OpenCL和Cuda加速版本,其利用可并行的LGA,从而通过在多个计算单元上并行处理配体-受体结合构象。配体文件支持的输入格式为SD(.sdf, .sd)、PDB(.pdb)和MOL(.mol)。受体结构文件支持的输入格式为PDB(.pdb)。 It is a docking simulation tool used primarily to predict binding modes and interactions between molecules and obtain information such as molecular docking energy and binding affinity. It can also calculate and compare the binding abilities of multiple molecules, making it useful for drug molecule screening, design, and optimization. AutoDock-GPU is the OpenCL and Cuda accelerated version of AutoDock4.2.6, which leverages its embarrassingly parallelizable LGA by processing ligand-receptor poses in parallel over multiple compute units. The supported input formats for ligand files are SD (.sdf, .sd), PDB (.pdb), and MOL (.mol). The supported input format for receptor files is PDB (.pdb).
    Tags: undefined
    Author: Forli lab
    Release: 2022-06-08 16:00:00
    Reference: Santos-Martins D, Solis-Vasquez L, Tillack AF, et al., Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. J Chem Theory Comput. 2021 Feb 9;17(2):1060-1073. doi: 10.1021/acs.jctc.0c01006.

    Molecular Docking (AutoDock-GPU v2)

    简介

    该模块是一种用于分子对接模拟工具,主要用于预测分子之间的结合模式和相互作用,得到分子对接的能量和结合亲和力等信息。它还可以计算和比较多个分子之间的结合能力,用于药物分子的筛选、设计和优化。AutoDock-GPU是AutoDock4.2.6的OpenCL和Cuda加速版本,其利用可并行的LGA,从而通过在多个计算单元上并行处理配体-受体结合构象。
    image.png

    参数说明

    支持自行上传小分子文件(Private Ligand Library)或者选择公共分子虚筛库(Public Ligand Library)。

    Private Ligand Library (Comp<100)

    Binding Mode

    对接模式为刚性配体对接(rigid)或者柔性配体对接(flex),
    刚性配体对接:配体自身保持刚性,经平移、旋转,在口袋内寻找合适的结合取向。
    柔性配体对接:配体在固定某些非关键部位的键长、键角的前提下允许其构象发生一定程度的变化。

    Receptor

    受体结构文件,PDB格式

    Private Ligand

    配体结构文件,支持SDF、PDB、MOL格式。只会计算前100的分子。

    Box Center

    对接口袋中心的三维坐标(XYZ),空格分割。例如:0 0 0。

    Box Size

    对接口袋长方体盒子的大小,必须是整数,空格分割,例如 24 22 32。

    Number of Poses

    每个分子保留的最大结合模式数量

    TopN

    虚拟筛选中保留打分排名前n个分子。

    Unbound Model

    未结合状态模型选择:

    • bound:适用于已知结合模式的精确优化,假设配体初始构象接近结合状态。
    • extended:适用于探索结合模式的中等灵活配体,从自由分子状态开始搜索。
    • compact:适用于高度灵活或折叠配体,提供最大范围的结合模式探索,但计算成本最高所需时间最长。

    Keep Heterogens

    保留非标准氨基酸,格式为[链名]:[残基名称]-[残基编号],如A:UNL-311。不能包含特殊离子的小分子结构。

    Private Ligand Library (Comp<10,000)

    Private Ligand

    配体结构文件,支持SDF、PDB、MOL格式。只会计算前10,000的分子。
    其余参数与**Private Ligand Library (Comp<100)**模式一致。

    Public Ligand Library模式

    Public Ligand

    提供17个公共分子虚筛库用于分子对接,包括:

    1. Alinda :~77万库存分子,源自中国香港的Alinda Chemical公司,致力于分子砌块和新颖筛选化合物的研发供应。
    2. Analyticon :~4万库存分子,源自德国的天然产物品牌,专注天然产物提取及类似物合成工作,产品质量稳定。
    3. Asinex :~57万库存分子,源自美国的品牌,多年来致力于类先导化合物及分子砌块的研发供应,价格较贵。
    4. Bionet :~30万库存分子,源自英国的品牌,拥有多年的有机合成经验。
    5. Chembridge :~137万库存分子,源自美国的化合物品牌,总部位于圣地亚哥,拥有多样性库、大环库等多种热门化合物库。
    6. Chemdiv :~156万库存分子,全球最大的化合物品牌之一,拥有5000多种化合物骨架结构和100多种化合物库,性价比高。
    7. Enamine :~407万库存分子,源自乌克兰的化合物品牌,具有较强的化合物研发能力,有高性价比化合物和高价值化合物两类产品。
    8. Eximed :~6万库存分子,源自乌克兰的化合物品牌,近20年来致力于提供高通量筛选化合物及相关服务。
    9. HTS :~6万库存分子,源自德国的HTS Biochemie Innovationen化合物品牌,致力于为制药、农业和生物技术公司开发独特的化合物。
    10. IBS :~55万库存分子,源自俄罗斯的InterBioScreen化合物品牌,拥有多种天然产物及衍生物。
    11. Life_Chemicals :~54万库存分子,源自加拿大的化合物品牌,拥有2900多种化合物骨架结构,化合物规格较齐全且有对应价格。
    12. Maybridge :~5万库存分子,源自英国的化合物品牌,Thermofisher旗下,产品数量少而专,每种产品均具有较大库存。
    13. Otava :~29万库存分子,源自加拿大的化合物品牌,专门从事特色化合物,生物化学药品和生物分析试剂的开发和生成。
    14. Princeton :~153万库存分子,源自美国的化合物品牌,20多年来设计独特的小分子化合物用于药物开发。
    15. Specs :~20万库存分子,源自荷兰的化合物品牌,价格优势明显。
    16. UORSY :~68万库存分子,源自乌克兰的化合物品牌,产品主要用于高通量筛选和药物发现,价格与Enamine接近。
    17. Vitas-m :~140万库存分子,源自美国的化合物品牌,在香港拥有发货中心,到货速度快,价格适中。

    其他参数与Private Ligand Library模式相同,公共库只允许刚性对接。

    结果说明

    输出结果包括:

    输出文件名称 说明
    TopNScores.csv 分子对接得到的打分csv文件。输出小分子最多为10,000。
    complex_001.pdb 展示配体与受体的复合物构象文件。
    output_ligand_topn.sdf 筛选后配体的SDF文件。根据指定的topN数生成,最多为10,000。
    output_complex_topn.tar.bz2 小分子与受体对接后的复合物构象PDB文件压缩包,最多生成前1000小分子的复合物结构。
    TopNScores_Molecule_Info.csv 当Private Ligand Library模式,该csv中不仅有打分信息,还有配体原有信息。

    其中TopNScores.csv包括信息如下:

    字段名称 说明
    Name 对接小分子名称
    Bingding Energy (AutoDock GPU) 对接打分结果
    Cluster RMSD 指一个配体构象相对于同一聚类(cluster)中的中心构象(通常是最低能量构象)的均方根偏差(RMSD)。RMSD 截断值为2.0 Å。
    Reference RMSD 指对接得到的配体构象与 参考构象(通常是实验解析的晶体结构或用户指定的标准结构)之间的 RMSD。

    其中TopNScores_Molecule_Info.csv包含TopNScores.csv的信息和SDF格式小分子原有信息。

    参考文献

    Santos-Martins D, Solis-Vasquez L, Tillack AF, et al., Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. J Chem Theory Comput. 2021 Feb 9;17(2):1060-1073.

    Molecular Docking (AutoDock-GPU v2)

    Introduction

    This module is a molecular docking simulation tool primarily used for predicting molecular binding modes and interactions. It provides information on docking energy and binding affinity. Additionally, it allows for the calculation and comparison of binding abilities among multiple molecules, facilitating the screening, design, and optimization of drug molecules.

    AutoDock-GPU is the OpenCL and CUDA-accelerated version of AutoDock 4.2.6, utilizing parallelizable LGA (Lamarckian Genetic Algorithm) to process ligand-receptor binding conformations in parallel across multiple computing units.

    image.png

    Parameter

    It supports private ligand file uploads (Private Ligand Library) or the selection of public virtual screening libraries (Public Ligand Library).

    Private Ligand Library (Comp <100)

    Binding Mode

    Docking mode can be either rigid docking or flexible docking:

    • Rigid docking: The ligand remains rigid, undergoing translation and rotation within the binding pocket to find an optimal binding orientation.
    • Flexible docking: The ligand is allowed to undergo conformational changes while keeping certain non-critical bond lengths and angles fixed.

    Receptor

    • Format: PDB

    Private Ligand

    • Formats Supported: SDF, PDB, MOL
    • Limitation: Only the top 100 molecules will be processed.

    Box Center

    • The XYZ coordinates of the docking pocket center, separated by spaces.
      • Example: 0 0 0

    Box Size

    • The size of the docking pocket, represented as a rectangular box with integer values separated by spaces.
      • Example: 24 22 32

    Number of Poses

    • The maximum number of binding modes retained for each molecule.

    TopN

    • The number of top-scoring molecules retained from the virtual screening.

    Unbound Model

    Defines the unbound state model:

    • bound: Assumes the initial ligand conformation is close to the bound state, suitable for precise optimization with known binding modes.
    • extended: Begins from a free molecular state, suitable for moderately flexible ligands to explore binding modes.
    • compact: Best for highly flexible or folded ligands, allowing the broadest exploration of binding modes but with higher computational costs and longer runtime.

    Keep Heterogens

    • Retains non-standard amino acids.
    • Format: [Chain Name]:[Residue Name]-[Residue Number], e.g., A:UNL-311.
    • Restriction: Cannot include small molecular structures containing special ions.

    Private Ligand Library (Comp <10,000)

    Private Ligand

    • Formats Supported: SDF, PDB, MOL
    • Limitation: Only the top 10,000 molecules will be processed.

    🔹 Other parameters are identical to those in Private Ligand Library (Comp <100) mode.


    Public Ligand Library

    Public Ligand

    Provides 17 public virtual screening libraries for molecular docking, including:

    1. Alinda (~770,000 molecules) - Hong Kong-based company specializing in molecular building blocks and novel screening compounds.
    2. Analyticon (~40,000 molecules) - German brand specializing in natural product extraction and analog synthesis.
    3. Asinex (~570,000 molecules) - US-based company focused on lead-like compounds and molecular building blocks, but relatively expensive.
    4. Bionet (~300,000 molecules) - UK-based company with extensive organic synthesis expertise.
    5. Chembridge (~1.37 million molecules) - US-based company with a diverse compound collection, including macrocycles.
    6. Chemdiv (~1.56 million molecules) - One of the largest compound brands globally, offering over 5,000 scaffolds and 100+ libraries.
    7. Enamine (~4.07 million molecules) - Ukraine-based company known for cost-effective and high-value compounds.
    8. Eximed (~60,000 molecules) - Ukraine-based company providing high-throughput screening compounds.
    9. HTS (~60,000 molecules) - German company developing unique compounds for pharmaceutical, agricultural, and biotech applications.
    10. IBS (~550,000 molecules) - Russian company specializing in natural products and derivatives.
    11. Life Chemicals (~540,000 molecules) - Canadian company with diverse scaffolds and transparent pricing.
    12. Maybridge (~50,000 molecules) - UK-based ThermoFisher subsidiary focusing on high-quality compounds.
    13. Otava (~290,000 molecules) - Canadian company specializing in biochemical drugs and reagents.
    14. Princeton (~1.53 million molecules) - US-based company with 20+ years of expertise in small molecule drug discovery.
    15. Specs (~200,000 molecules) - Dutch company known for its cost-effective compounds.
    16. UORSY (~680,000 molecules) - Ukraine-based company with a price range similar to Enamine.
    17. Vitas-m (~1.4 million molecules) - US-based company with a Hong Kong shipping center, offering fast delivery and moderate pricing.

    🔹 Other parameters are identical to Private Ligand Library, but only rigid docking is allowed.


    Result

    The docking results include:

    File Name Description
    TopNScores.csv CSV file containing docking scores for up to 10,000 molecules.
    complex_001.pdb Ligand-receptor complex conformation file.
    output_ligand_topn.sdf Top-N selected ligands in SDF format (max 10,000).
    output_complex_topn.tar.bz2 Compressed file of the top 1,000 ligand-receptor complex structures in PDB format.
    TopNScores_Molecule_Info.csv If using the Private Ligand Library mode, this CSV includes both docking scores and original ligand information.

    📌 TopNScores.csv Fields:

    Field Name Description
    Name Name of the docked molecule.
    Binding Energy (AutoDock GPU) Docking score.
    Cluster RMSD RMSD relative to the cluster center (default cutoff: 2.0 Å).
    Reference RMSD RMSD relative to the reference structure (e.g., crystal structure).

    The TopNScores_Molecule_Info.csv file contains the information from TopNScores.csv along with the original data of small molecules in SDF format.


    References

    Santos-Martins D, Solis-Vasquez L, Tillack AF, et al., Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. J Chem Theory Comput. 2021; 17(2): 1060-1073.

  • Name: Antibody Design (RFAntibody)
    Description: 基于RFAntibody(抗体微调版RFdiffusion)的抗体从头设计 RFAntibody (Antibody Fine-tuned RFdiffusion) -based de novo antibody design
    Tags: undefined
    Author: Bennett NR
    Release: 2025-03-17 09:44:07
    Reference: Bennett NR, Watson JL, Ragotte RJ, Borst AJ, See DL, Weidle C, Biswas R, Shrock EL, Leung PJY, Huang B, Goreshnik I, Ault R, Carr KD, Singer B, Criswell C, Vafeados D, Sanchez MG, Kim HM, Torres SV, Chan S, Baker D. Atomically accurate de novo design of antibodies with RFdiffusion. bioRxiv. 2024.03.14.585103.

    Antibody Design (RFAntibody)

    简介

    RFantibody 是目前最先进的抗体从头生成方法,通过人工智能(AI)技术实现抗体的从头(de novo)设计,包括单域抗体(VHH)和单链抗体片段(scFv),能够精准结合用户指定的目标表位,并已通过湿实验验证其功能。

    RFantibody基于蛋白质结构预测模型RoseTTAFold2(RF2)和蛋白质生成模型RFdiffusion,通过对原始RFdiffusion进行微调,开发出专用于抗体设计的RFdiffusion版本。其核心原理如下:

    • 抗体结构特性利用:RFdiffusion在蛋白质数据库(PDB)中的抗体结构数据(约8100个抗体结构)上进行微调,重点训练抗体特有的互补决定区(CDR)loop 区域,同时保持框架结构接近用户指定的优化框架。训练过程中,通过逐步添加噪声(Cα 坐标加入三维高斯噪声,残基方向加入 SO(3) 布朗运动),网络学习预测去噪后的结构。

    • 表位靶向设计:通过引入"热点"(Hotspot)特征,用户可指定目标蛋白上的表位,网络通过CDR loop与表位的相互作用进行设计。训练时,抗体框架以全局坐标无关的方式提供(通过二维距离和二面角矩阵表示),允许网络自由设计CDR Loop构象及抗体与目标的刚体定位。

    • 序列设计与验证:结构设计后,使用ProteinMPNN生成CDR loop区序列,优化与目标表位的相互作用。设计的抗体通过微调后的RF2进行结构预测和自一致性验证,筛选高潜力候选分子。

    • 支持 VHH 和 scFv 设计:RFdiffusion 不仅支持单域抗体(VHH)的设计,还可应用于单链抗体片段(scFv)的设计。scFv 设计涉及重链和轻链的所有六个 CDR 的设计。

    通过上述方法,RFantibody能够生成多样化的抗体结构,显著区别于训练数据集,同时实现与目标表位的高度形状互补性和功能性结合。

    image.png

    RFantibody项目针对多个疾病相关表位进行了VHH和scFv设计,并通过表面等离子共振(SPR)、冷冻电镜(cryo-EM)、中和实验等手段验证了设计的有效性。以下是具体实验结果及分析:

    1, 单域抗体(VHH)设计与实验验证

    实验选择了多个疾病相关靶点,包括流感血凝素(HA)、呼吸道合胞病毒(RSV)位点I和III、SARS-CoV-2受体结合域(RBD)、艰难梭菌毒素B(TcdB)和IL-7Rα。以下为关键结果:

    • 结合亲和力(KD):

      • 流感HA:针对HA茎部表位的VHH设计中,最高亲和力结合体(VHH_flu_01)KD值为78 nM,其他结合体KD值分别为546 nM、698 nM和790 nM。实验使用昆虫细胞表达的单体HA(模拟去糖基化状态)以匹配计算设计条件。
      • SARS-CoV-2 RBD:最佳VHH结合体KD值为5.5 μM,通过竞争实验(与已知结合体AHB2竞争)确认结合至目标表位。
      • TcdB:针对Frizzled-7表位的VHH最佳结合体KD值为260 nM,结合特异性高,未观察到与同源性70%的Clostridium sordellii毒素L(TcsL)的交叉反应。
    • 中和活性(EC50):

      • TcdB:针对TcdB的VHH在中和实验中表现出功能性,在CSPG4敲除细胞中中和TcdB毒性,EC50值为460 nM,表明其潜在的治疗应用价值。
    • 结构准确性(cryo-EM):

      • 流感HA:通过cryo-EM解析了VHH_flu_01与原生糖基化HA三聚体的复合物结构(分辨率3.0 Å)。66%的HA颗粒结合了至多两个VHH,部分未结合可能由于N296糖基的遮挡。实验结构与设计模型高度一致,整体RMSD为1.45 Å,CDR3 RMSD为0.8 Å,关键CDR3残基(V100、V101、S103、F108)与HA茎部表位的相互作用如设计预期。
      • TcdB:针对TcdB的原始设计(VHH_TcdB_H2)和亲和力成熟后版本(VHH_TcdB_H2_ortho)进行了cryo-EM分析。原始设计确认结合至Frizzled-7表位,成熟后版本(分辨率5.7 Å)显示更高的结合比例,结构符合设计预期。
      • SARS-CoV-2 RBD:亲和力成熟后的VHH(VHH_RBD_D4_ortho19)结合至RBD"上"构象表位(分辨率3.9 Å)。
    • 亲和力成熟(OrthoRep):

      • 使用OrthoRep系统对TcdB、流感HA和SARS-CoV-2 RBD的VHH进行亲和力成熟,结合亲和力提升约两个数量级,同时保留了原始表位特异性。

    2, 单链抗体片段(scFv)设计与实验验证

    进一步扩展至scFv设计,涉及重链和轻链六个CDR的设计,采用结构导向的组合库策略以提高成功率。实验靶点包括TcdB的Frizzled-7表位和Phox2b/HLA-C*07:02复合物。

    • 结合亲和力(KD):

      • TcdB:通过组合库筛选出针对Frizzled-7表位的scFv,最高亲和力结合体(scFv6)KD值为72 nM,其他结合体的KD值未详细列出。竞争实验(与Frizzled-7竞争)确认结合至目标表位,未与无关受体CSPG4竞争。
      • Phox2b/HLA-C*07:02:针对神经母细胞瘤相关表位的scFv结合体KD值为400 nM(SPR)和1 μM(ITC),特异性结合至Phox2b肽,未结合R6A突变肽。尝试将其转化为CAR-T细胞未显示细胞毒性,可能因亲和力不足或抗原密度低。
    • 结构准确性(cryo-EM):

      • TcdB:两个scFv(scFv5和scFv6)结合至Frizzled-7表位的cryo-EM结构验证了设计准确性。scFv6的分辨率为3.6 Å,整体RMSD为0.9 Å,六个CDR的骨架RMSD分别为CDRH1=0.4 Å、CDRH2=0.3 Å、CDRH3=0.7 Å、CDRL1=0.2 Å、CDRL2=1.1 Å、CDRL3=0.2 Å,侧链构象及相互作用符合设计。scFv5(分辨率6.1 Å)以不同接近角度结合,实验结构与设计模型一致。

    3, 实验结果分析

    • 结构多样性:设计的VHH和scFv的CDR区与自然抗体显著不同,且针对TcdB的Frizzled-7表位无已知抗体,表明RFdiffusion实现了真正的从头设计。
    • 功能性与应用潜力:TcdB VHH的中和活性(EC50=460 nM)和scFv的高亲和力(KD=72 nM)显示出治疗潜力,但Phox2b scFv的CAR-T应用失败表明需进一步优化亲和力或抗原表达。

    4, 总结

    RFantibody通过微调RFdiffusion网络,实现了从头设计VHH和scFv的目标,能够靶向多种疾病相关表位。实验结果显示设计的抗体具有较高的结构准确性(RMSD低至0.9 Å)和功能性(KD低至72 nM,EC50为460 nM)。cryo-EM验证了设计的原子级精度,而亲和力成熟和组合库策略进一步提升了成功率。

    参数说明

    Complex

    用于抗体设计的抗体-抗原复合物结构,PDB格式。如果指定了该参数,后续的Antigen,Antibody参数不用再指定。如果不指定该参数,则需要分别输入Antigen与Antibody的结构。
    注意:
    1,当前只支持单链抗原,如存在多链时会提示错误,可以使用蛋白编辑工具去掉抗原多余的链,保留单链抗原即可。
    2,抗体编号方式只支持Chothia,会自动转成Chothia编号。

    Antigen

    指定抗原的结构文件,PDB格式。
    说明:抗原结构通常需要截短以减少计算开销,建议保留表位周围约 10Å 的区域即可。

    Antibody

    指定抗体的结构文件,PDB格式。

    Number of designs

    指定设计的抗体数量,默认为20。

    H-CDR1, H-CDR2, H-CDR3, L-CDR1, L-CDR2, L-CDR3

    分别指定需要设计的抗体重、轻链CDR区的长度范围。格式为:起始长度-终止长度(如:5-13),或单一长度(如:7)。
    说明:这些参数定义了每个CDR区的允许长度范围,如果设置的是起始长度-终止长度(如:5-13),模型将从中均匀采样长度。如果设置的是单一长度(如:7),则该CDR将以指定长度进行设计。如果不指定某个CDR的长度范围(如:不设置H-CDR1的长度),则该CDR将保持原始结构和序列不被设计。
    对于VHH设计,仅需指定H-CDR1, H-CDR2, H-CDR3;对于scFv设计,可指定所有六个CDR。长度选择可参考自然抗体的CDR 长度分布,推荐较短的H-CDR3(如:5-13),以降低设计难度。

    Hotspot

    指定抗原上的结合位点残基,用于定义抗体结合的表位。格式为:逗号分隔的残基列表,格式为 305,456

    • 说明:结合位点残基帮助模型聚焦于特定表位。选择时建议挑选表位中3个以上疏水性残基,避免过多极性或糖基化区域。

    结果说明

    经过抗体设计后,得到的抗体-抗原复合物结构,并根据质量评估指标进行排序。包括:

    结构文件:按结构质量排序的PDB格式抗体-抗原复合物结构的打包文件 de_novo_antibody.tar及最优的设计结果rank_1.pdb
    结构评分:CSV格式的评估指标表格 cdr_sequences.csv,包含如下信息:

    字段名称 说明
    Design_ID 预测结构的文件名
    ipAE 预测对齐误差交互值(the predicted interaction alignment error),衡量抗体与抗原结合界面的结构预测置信度,该指标反映了抗体-抗原复合物界面的结构稳定性和预测准确性,数值越小表示结合界面预测越可靠,推荐选择ipAE<10的设计进行实验验证
    pLDDT 预测局部距离差异测试,衡量整体结构预测的质量和可靠性,该指标反映了抗体结构本身的稳定性和折叠质量,数值范围为 0-1.0,数值越接近1.0表示结构预测越可靠,推荐选择pLDDT > 0.8的设计进行实验验证

    输出示例

    Design_ID,CDR_H3,ipAE,pLDDT
    rank_1,IAYTPGAPLF,8.91,0.92
    rank_2,VAPSKTDALF,9.29,0.92
    

    参考文献

    Bennett NR, Watson JL, Ragotte RJ, Borst AJ, See DL, Weidle C, Biswas R, Shrock EL, Leung PJY, Huang B, Goreshnik I, Ault R, Carr KD, Singer B, Criswell C, Vafeados D, Sanchez MG, Kim HM, Torres SV, Chan S, Baker D. Atomically accurate de novo design of antibodies with RFdiffusion. bioRxiv. 2024.03.14.585103.

    Antibody Design (RFAntibody)

    Introduction

    RFantibody is the most advanced de novo antibody generation method currently available. Through artificial intelligence (AI) technology, it achieves de novo design of antibodies, including single-domain antibodies (VHH) and single-chain antibody fragments (scFv), capable of precisely binding to user-specified target epitopes, with functionality validated through wet lab experiments.

    RFantibody is based on the protein structure prediction model RoseTTAFold2 (RF2) and the protein generation model RFdiffusion. By fine-tuning the original RFdiffusion, a specialized version for antibody design has been developed. Its core principles are as follows:

    • Utilization of Antibody Structural Features: RFdiffusion is fine-tuned on antibody structural data (approximately 8,100 antibody structures) from the Protein Data Bank (PDB), focusing on training the antibody-specific complementarity-determining region (CDR) loops while maintaining framework structures close to user-specified optimized frameworks. During training, noise is gradually added (3D Gaussian noise to Cα coordinates, SO(3) Brownian motion to residue orientations), and the network learns to predict the denoised structure.

    • Epitope-Targeted Design: By introducing “Hotspot” features, users can specify epitopes on target proteins, and the network designs through interactions between CDR loops and the epitope. During training, the antibody framework is provided in a globally coordinate-independent manner (represented by 2D distance and dihedral angle matrices), allowing the network to freely design CDR loop conformations and rigid-body positioning of the antibody relative to the target.

    • Sequence Design and Validation: After structural design, ProteinMPNN is used to generate sequences for CDR loop regions, optimizing interactions with the target epitope. The designed antibodies are validated through structure prediction and self-consistency verification using the fine-tuned RF2, screening for high-potential candidates.

    • Support for VHH and scFv Design: RFdiffusion supports not only the design of single-domain antibodies (VHH) but also single-chain antibody fragments (scFv). scFv design involves designing all six CDRs of the heavy and light chains.

    Through these methods, RFantibody can generate diverse antibody structures that significantly differ from the training dataset while achieving high shape complementarity and functional binding to target epitopes.

    image.png
    Experimental Validation

    The RFantibody project has conducted VHH and scFv designs targeting multiple disease-related epitopes and validated their effectiveness through surface plasmon resonance (SPR), cryo-electron microscopy (cryo-EM), neutralization assays, and other methods. The following are specific experimental results and analyses:

    1, Single-Domain Antibody (VHH) Design and Experimental Validation

    Experiments selected multiple disease-related targets, including influenza hemagglutinin (HA), respiratory syncytial virus (RSV) sites I and III, SARS-CoV-2 receptor-binding domain (RBD), Clostridioides difficile toxin B (TcdB), and IL-7Rα. Key results include:

    • Binding Affinity (KD):

      • Influenza HA: Among VHH designs targeting the HA stem epitope, the highest affinity binder (VHH_flu_01) had a KD value of 78 nM, with other binders having KD values of 546 nM, 698 nM, and 790 nM. Experiments used insect cell-expressed monomeric HA (simulating deglycosylated state) to match computational design conditions.
      • SARS-CoV-2 RBD: The best VHH binder had a KD value of 5.5 μM, confirmed to bind to the target epitope through competition experiments (competing with known binder AHB2).
      • TcdB: The best VHH binder targeting the Frizzled-7 epitope had a KD value of 260 nM, with high binding specificity and no observed cross-reactivity with Clostridium sordellii toxin L (TcsL), which has 70% homology.
    • Neutralization Activity (EC50):

      • TcdB: VHHs targeting TcdB demonstrated functionality in neutralization assays, neutralizing TcdB toxicity in CSPG4 knockout cells with an EC50 value of 460 nM, indicating potential therapeutic applications.
    • Structural Accuracy (cryo-EM):

      • Influenza HA: Cryo-EM resolved the complex structure of VHH_flu_01 with native glycosylated HA trimer (resolution 3.0 Å). 66% of HA particles bound up to two VHHs, with partial non-binding possibly due to N296 glycan shielding. The experimental structure highly aligned with the design model, with an overall RMSD of 1.45 Å, CDR3 RMSD of 0.8 Å, and key CDR3 residues (V100, V101, S103, F108) interacting with the HA stem epitope as designed.
      • TcdB: Cryo-EM analysis was performed on the original design (VHH_TcdB_H2) and affinity-matured version (VHH_TcdB_H2_ortho) targeting TcdB. The original design confirmed binding to the Frizzled-7 epitope, while the matured version (resolution 5.7 Å) showed higher binding proportions, with structures conforming to design expectations.
      • SARS-CoV-2 RBD: The affinity-matured VHH (VHH_RBD_D4_ortho19) bound to the RBD “up” conformation epitope (resolution 3.9 Å).
    • Affinity Maturation (OrthoRep):

      • The OrthoRep system was used for affinity maturation of VHHs targeting TcdB, influenza HA, and SARS-CoV-2 RBD, improving binding affinity by approximately two orders of magnitude while maintaining original epitope specificity.

    2, Single-Chain Antibody Fragment (scFv) Design and Experimental Validation

    Further expansion to scFv design involved designing six CDRs of heavy and light chains, adopting a structure-guided combinatorial library strategy to increase success rates. Experimental targets included the Frizzled-7 epitope of TcdB and the Phox2b/HLA-C*07:02 complex.

    • Binding Affinity (KD):

      • TcdB: Through combinatorial library screening, scFvs targeting the Frizzled-7 epitope were identified, with the highest affinity binder (scFv6) having a KD value of 72 nM. KD values for other binders were not detailed. Competition experiments (competing with Frizzled-7) confirmed binding to the target epitope, with no competition with the unrelated receptor CSPG4.
      • Phox2b/HLA-C*07:02: scFvs targeting the neuroblastoma-related epitope had KD values of 400 nM (SPR) and 1 μM (ITC), specifically binding to the Phox2b peptide but not to the R6A mutant peptide. Attempts to convert it to CAR-T cells did not show cytotoxicity, possibly due to insufficient affinity or low antigen density.
    • Structural Accuracy (cryo-EM):

      • TcdB: Cryo-EM structures of two scFvs (scFv5 and scFv6) binding to the Frizzled-7 epitope validated design accuracy. scFv6 had a resolution of 3.6 Å, overall RMSD of 0.9 Å, and backbone RMSDs for the six CDRs of CDRH1=0.4 Å, CDRH2=0.3 Å, CDRH3=0.7 Å, CDRL1=0.2 Å, CDRL2=1.1 Å, CDRL3=0.2 Å, with side chain conformations and interactions conforming to design. scFv5 (resolution 6.1 Å) bound with a different approach angle, with the experimental structure consistent with the design model.

    3, Analysis of Experimental Results

    • Structural Diversity: The designed VHHs and scFvs had CDR regions significantly different from natural antibodies, and there were no known antibodies for the Frizzled-7 epitope of TcdB, indicating that RFdiffusion achieved true de novo design.
    • Functionality and Application Potential: The neutralization activity of TcdB VHH (EC50=460 nM) and high affinity of scFv (KD=72 nM) demonstrated therapeutic potential, but the failure of Phox2b scFv in CAR-T applications indicated the need for further optimization of affinity or antigen expression.

    4, Summary

    RFantibody, through fine-tuning the RFdiffusion network, has achieved the goal of de novo designing VHHs and scFvs capable of targeting various disease-related epitopes. Experimental results show that the designed antibodies have high structural accuracy (RMSD as low as 0.9 Å) and functionality (KD as low as 72 nM, EC50 of 460 nM). Cryo-EM validated the atomic-level precision of the designs, while affinity maturation and combinatorial library strategies further improved success rates.

    Parameter

    Complex

    The structure of the antibody-antigen complex used for antibody design, in PDB format. If this parameter is specified, the subsequent Antigen and Antibody parameters do not need to be specified. If this parameter is not specified, the structures of Antigen and Antibody need to be input separately.

    Antigen

    The structure file of the antigen, in PDB format.
    Note: The antigen structure usually needs to be truncated to reduce computational cost. It is recommended to retain only the region within approximately 10 Å around the epitope.

    Antibody

    The structure file of the antibody, in PDB format.

    Number of designs

    The number of antibodies to be designed, with a default value of 20.

    H-CDR1, H-CDR2, H-CDR3, L-CDR1, L-CDR2, L-CDR3

    Specify the length range of the CDR regions in the heavy and light chains to be designed. The format is: start length-end length (e.g., 5-13), or a single length (e.g., 7).
    Note: These parameters define the allowed length range for each CDR region. If a range is specified (e.g., 5-13), the model will uniformly sample lengths within this range. If a single length is specified (e.g., 7), the CDR will be designed with the given length. If the length range of a CDR is not specified (e.g., H-CDR1 is not set), that CDR will retain its original structure and sequence without being designed.
    For VHH design, only H-CDR1, H-CDR2, and H-CDR3 need to be specified; for scFv design, all six CDRs can be specified. The length selection can refer to the natural distribution of CDR lengths in antibodies. It is recommended to use a shorter H-CDR3 (e.g., 5-13) to reduce design complexity.

    Hotspot

    Specify the binding site residues on the antigen to define the epitope for antibody binding. The format is: a comma-separated list of residues, e.g., 305,456.
    Note: Binding site residues help the model focus on specific epitopes. It is recommended to select more than three hydrophobic residues within the epitope and avoid areas with excessive polarity or glycosylation.

    Result Description

    After antibody design, the antibody-antigen complex structures are obtained and sorted based on quality assessment metrics. These include:
    Structure Files: The packed file of antibody - antigen complex structures in PDB format sorted by structural quality is de_novo_antibody.tar, and the optimal design result rank_1.pdb.
    Structure Scores: A CSV file cdr_sequences.csv containing the assessment metrics, with the following information:

    Field Name Description
    Design_ID The filename of the predicted structure
    ipAE Predicted interaction alignment error, which measures the confidence of the structural prediction at the antibody-antigen binding interface. This metric reflects the stability and accuracy of the antibody-antigen complex interface. Lower values indicate more reliable predictions. Designs with ipAE < 10 are recommended for experimental validation.
    pLDDT Predicted Local Distance Difference Test, which measures the overall quality and reliability of the structural prediction. This metric reflects the stability and folding quality of the antibody structure itself. The value ranges from 0 to 1.0, with values closer to 1.0 indicating more reliable structural predictions. Designs with pLDDT > 0.8 are recommended for experimental validation.

    Example

    Design_ID,CDR_H3,ipAE,pLDDT
    rank_1,IAYTPGAPLF,8.91,0.92
    rank_2,VAPSKTDALF,9.29,0.92
    

    References

    Bennett NR, Watson JL, Ragotte RJ, Borst AJ, See DL, Weidle C, Biswas R, Shrock EL, Leung PJY, Huang B, Goreshnik I, Ault R, Carr KD, Singer B, Criswell C, Vafeados D, Sanchez MG, Kim HM, Torres SV, Chan S, Baker D. Atomically accurate de novo design of antibodies with RFdiffusion. bioRxiv. 2024.03.14.585103.

  • Name: Nanobody Thermostability Prediction
    Description: 基于机器学习方法预测纳米抗体的热稳定性(Tm值)。 Machine learning-based prediction of nanobodies thermostability (Tm value).
    Tags: undefined
    Author: Aubin Ramon
    Release: 2025-03-06 11:26:44
    Reference: Ramon A, Ni M, Predeina O, Gaffey R, Kunz P, Onuoha S, Sormanni P. Prediction of protein biophysical traits from limited data: a case study on nanobody thermostability through NanoMelt. MAbs. 2025 Dec;17(1):2442750.

    Nanobody Thermostability Prediction

    简介

    该模块用于预测纳米抗体的热稳定性(Tm值)。模型使用多种序列嵌入(如ESM-1b、one-hot、VHSE)来表示蛋白质序列,并通过不同的回归模型(如ridge、GPR、RF、SVR)进行处理,最后进行集成学习得到最终模型。通过文献整合和新测量,构建了一个包含640个独特纳米抗体序列的熔化温度数据集。具体来说,从NbThermo数据库中添加了511个独特序列点,并通过实验生成了129个新数据点。在测试集上表现出较高的预测准确性,Pearson相关系数为0.853,Spearman相关系数为0.832,MAE为4.1°C,SDR为0.86。
    模型的整体架构如下图所示:
    image.png
    模型预测效果如下图所示:
    image.png

    参数说明

    Nanobody Sequence

    纳米抗体的序列文件,FASTA格式

    Output

    输出结果文件名,默认为Tm_pred.csv。

    结果说明

    输出结果文件为Tm_pred.csv,包含信息如下:

    字段名称 说明
    ID 序列ID
    Aligned Sequence 输入序列与数据库序列进行alignment后的输出序列格式
    Sequence 输入序列
    NanoMelt Tm © 预测得到的Tm值

    备注:部分纳米抗体无法预测其热稳定性。

    参考文献

    Ramon A, Ni M, Predeina O, Gaffey R, Kunz P, Onuoha S, Sormanni P. Prediction of protein biophysical traits from limited data: a case study on nanobody thermostability through NanoMelt. MAbs. 2025 Dec;17(1):2442750.

    Nanobody Thermostability Prediction

    Introduction

    This module is designed to predict the thermostability (Tm value) of nanobodies. The model uses various sequence embeddings, such as ESM-1b, one-hot, and VHSE, to represent protein sequences. These are processed by different regression models, including ridge, GPR, RF, and SVR, and finally integrated to produce the final model. A dataset comprising the melting temperatures of 640 unique nanobody sequences was constructed through literature integration and new measurements. Specifically, 511 unique sequence points were added from the NbThermo database, and 129 new data points were generated experimentally. The model demonstrates high prediction accuracy on the test set, with a Pearson correlation coefficient of 0.853, a Spearman correlation coefficient of 0.832, an MAE of 4.1°C, and an SDR of 0.86.
    The overall architecture of the model is shown in the figure below:
    image.png
    The model’s prediction performance is illustrated in the figure below:
    image.png

    Parameter

    Nanobody Sequence

    The sequence file of the nanobody, in FASTA format.

    Output

    The name of the output result file, default is Tm_pred.csv.

    Result

    The output result file is Tm_pred.csv, containing the following information:

    Field Name Description
    ID Sequence ID
    Aligned Sequence The output sequence format after alignment with the database sequence
    Sequence Input sequence
    NanoMelt Tm © Predicted Tm value

    ** Note ** : Some nanoantibodies cannot predict their thermal stability.

    References

    Ramon A, Ni M, Predeina O, Gaffey R, Kunz P, Onuoha S, Sormanni P. Prediction of protein biophysical traits from limited data: a case study on nanobody thermostability through NanoMelt. MAbs. 2025 Dec;17(1):2442750.

  • Name: MD Solvation v2
    Description: 对MD体系加入水盒子和离子。v2新增自主添加金属离子环境功能。 Adds water box and ions for the system. Add user-specified ions in version v2.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-02-19 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD Solvation v2

    简介

    对MD体系进行溶剂化操作,添加水盒子和离子。

    参数说明

    Receptor Topology

    输入的受体拓扑文件,可由GMX Receptor Parameterization模块生成。

    Receptor GRO

    输入的受体结构文件,可由GMX Receptor Parameterization模块生成。

    Receptor ITP

    输入的受体参数(压缩)文件,可由GMX Receptor Parameterization模块生成。

    Ligand GRO

    输入的配体结构(压缩)文件,可由GMX Ligand Parameterization模块生成。

    Ligand ITP

    输入的配体参数(压缩)文件,可由GMX Ligand Parameterization模块生成。

    Ions

    需要添加的离子,支持钠离子NA,钾离子K,氯离子CL,钙离子CA,镁离子MG,锌离子ZN,同时添加多个使用英文冒号:分割,如NA:K:MG

    Number of Ions

    需要添加的离子数目,添加多种离子时,和Ions参数对应,使用英文冒号:分割,如15:20:30
    说明:Number of Ions与Concentration of Ions,选择其中一种输入,不要同时输入

    Concentration of Ions

    需要添加的离子浓度,添加多种离子时,和Ions参数对应,使用英文冒号:分割,如0.15:0.3:0.1
    说明:Number of Ions与Concentration of Ions,选择其中一种输入,不要同时输入

    Output Topology

    输出的体系总的拓扑文件

    Output GRO

    输出的体系总的结构文件

    Output ITP

    输出的体系参数的(压缩)文件

    Distance Restraints

    距离限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [Type]  [Index]  [Type]  [Low]  [Up1]  [Up2]  [Factor]
    

    其中,AtomIndex1和AtomIndex2为在system.gro的原子编号;Type为施加约束类型,通常设置为1,Type类型见表1;Index是计算顺序;Low、Up1、Up2为原子间限制距离,Low到Up1区间的原子距离是不受限制的,但是不能超过Up2,单位为nm;Factor为因子,将Factor乘以“Disre Force Constant”即为限制力的大小,单位为kJ/mol/nm2。
    例如:

    10     16      1       0       1      0.0     0.3     0.4     1.0
    10     46      1       1       1      0.0     0.3     0.4     1.0
    16     22      1       2       1      0.0     0.3     0.4     2.5
    

    表1:GROMACS中三种约束类型对原子对进行限制

    Type Code 约束类型 作用情况
    1 Complex NMR distance restraints 当Disre Type为ensemble时,即非键相互作用设置为1
    6 Simple harmonic restraints 当Disre Type为simple时,即分子内成键相互作用设定,可设为6或者10.
    10 Piecewise linear/harmonic restraints 当Disre Type为simple时,即分子内成键相互作用设定,可设为6或者10

    Angle Restraints

    角度限制是两对原子间角度的限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Theta0]  [Force Constant]  [Multiplicity]
    

    其中,AtomIndex1-AtomIndex2是第一对原子编号;AtomIndex3-AtomIndex4为第二对原子编号;Type在这里无用,定义为1即可;Theta0为约束的角度,单位为deg;Force Constant为约束力常数,单位为kJ/mol;Multiplicity为多重度。
    例如

    2642     2643     2635     2652     1     67.0     1500     1
    

    Dihedral Restraints

    二面角限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Label]  [Phi]  [dPhi]  [KFactor]  [Power]
    

    其中,AtomIndex1-AtomIndex4为组成二面角的原子编号;Type为约束类型函数,总是为1;Label无效;Phi为参考角,dPhi为超出参考角的角度值,单位为deg;KFactor为因子,将KFactor乘以“Disre Force Constant”即为限制力的大小,单位为 kJ/mol/rad2;Power无效。
    例如:

    2642      2643      2635      2652      1      67.0      1500      1
    

    约束势函数如下所示:
    image.png
    其中,Φ’为参考角Phi,ΔΦ为超出参考角的值dPhi,K_dihr为限制力的大小KFactor。

    结果说明

    输出结果包括:

    输出文件名称 说明
    system.gro 体系的分子坐标文件
    system_itp.tar.gz 体系平衡模拟时固定原子位置所施加的力
    system.top 体系的拓扑文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: 10.1016/j.softx.2015.06.001

  • Name: Human Germline BLAST (v2.1)
    Description: 通过序列比对在人类生殖系数据库中搜索与目标抗体序列接近的同源模板,输出对应的模板序列以及序列一致性信息。 Search the human germline database for homologs of the target antibody sequence, and output the template sequences and the corresponding identities.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-02-11 14:29:03
    Reference:

    Human Germline BLAST (v2.1)

    简介

    通过序列比对在人类生殖系数据库中搜索与目标抗体序列最接近的同源模板,输出对应的模板序列以及序列一致性信息。

    参数说明

    Sequence String模式

    Input Sequence

    抗体的序列(纯序列信息,非FASTA格式文件)。

    Type

    抗体编号类型:kabat、chothia、imgt。

    TopHits

    输出同源性最高的n条序列。

    Fasta File模式

    FASTA File

    抗体的序列文件,FASTA格式。

    Type

    抗体编号类型:kabat、chothia、imgt。

    TopHits

    输出同源性最高的n条序列。

    结果说明

    输出参数 输出文件名称 说明
    Hits Sequence hits.fasta 包含同源性最高的n条序列的序列文件
    Result result.json 包含找到的Germline模板以及序列的一致性信息

    相关内容

    抗体常用的germline模板:
    image.png

    临床后期及已上市抗体的germline配对分布情况(统计自Adimab数据集):
    image.png
    image.png
    Adimab_germline_usage.jpeg

    Human Germline BLAST (v2.1)

    Introduction

    This module performs sequence alignment to search for the closest homologous template in the human germline database for a given target antibody sequence. It outputs the corresponding template sequence along with sequence similarity information.

    Parameter Description

    Sequence String Mode

    Input Sequence

    The antibody sequence (pure sequence information, not in FASTA format).

    Type

    Type of antibody numbering: kabat, chothia, imgt.

    TopHits

    Number of top hits to output.

    Fasta File Mode

    FASTA File

    Antibody sequence file in FASTA format.

    Type

    Type of antibody numbering: kabat, chothia, imgt.

    TopHits

    Number of top hits to output.

    Result Description

    Output Parameter Output File Name Description
    Hits Sequence hits.fasta File containing the top n sequences with the highest homology
    Result result.json File containing the found Germline template and sequence similarity information

    Related Content

    Commonly used germline templates for antibodies:
    image.png

    Distribution of germline pairing for late-stage and marketed antibodies (statistics from the Adimab dataset):
    image.png
    image.png
    Adimab_germline_usage.jpeg

  • Name: Grafting (v2.4)
    Description: Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.4 Graft antibody CDRs to target frameworks, normally for humanization. Version: v2.4
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-02-11 14:25:31
    Reference:

    Grafting v2.4

    简介

    Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.4

    参数说明

    Antibody Sequence File

    抗体序列文件,FASTA格式

    Numbering Type

    抗体编号规则:kabat,imgt,chothia

    Output File

    指定输出抗体graft后的序列文件名称,FASTA格式

    Output Policy

    指定输出graft策略文件,JSON格式

    Germline Score

    指定输出抗体FR区序列比对同源性打分文件

    Germline

    指定轻链或重链使用特定germline模板,也可都指定,写法如下:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    其中链名来自于流程第一步输入的fasta文件。
    例1:以下语句为链"Infliximab.H"指定了模板"IGHV3-7*01":

    Infliximab.H:IGHV3-7*01
    

    例2:以下语句为两条链分别指定了模板:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    指定参考模板序列,FASTA格式

    Germline Hits

    指定输出FR区序列比对结果文件,FASTA格式

    Number of Hits

    指定输出命中序列的数目

    结果说明

    输出结果包括:

    输出文件名称 说明
    germline_hits.fasta 输出FR区序列比对结果文件
    germline_score.json 输出抗体FR区序列比对同源性打分文件
    grafted.fasta 输出抗体graft后的序列文件名称
    graft_policy.json 输出graft策略文件

    Grafting v2.4

    Introduction

    The Grafting module is used to graft the CDRs of an antibody onto a specific framework region template, typically used for humanization design. Version: v2.4

    Parameter Description

    Antibody Sequence File

    Antibody sequence file in FASTA format.

    Numbering Type

    Antibody numbering rule: kabat, imgt, chothia.

    Output File

    Specify the output file name for the grafted antibody sequence in FASTA format.

    Output Policy

    Specify the output grafting strategy file in JSON format.

    Germline Score

    Specify the output file for the homology scores of the antibody FR region sequences.

    Germline

    Specify the specific germline template to be used for the light chain or heavy chain, or both, in the format:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    Where the chain names come from the FASTA file input in the first step of the process.
    Example 1: The following statement specifies the template “IGHV3-7*01” for the chain “Infliximab.H”:

    Infliximab.H:IGHV3-7*01
    

    Example 2: The following statement specifies templates for two chains separately:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    Specify the reference template sequence in FASTA format.

    Germline Hits

    Specify the output file for the FR region sequence alignment results in FASTA format.

    Number of Hits

    Specify the number of sequences to output.

    Result Description

    The output includes:

    Output File Name Description
    germline_hits.fasta Output file for FR region sequence alignment results
    germline_score.json Output file for homology scores of the antibody FR region sequences
    grafted.fasta Output file name for the grafted antibody sequence
    graft_policy.json Output file for the grafting strategy
  • Name: Mutation Energy of Stability (Pythia)
    Description: 基于自监督图神经网络预测突变对蛋白稳定性影响。 A self-supervised graph neural network for protein stability prediction upon mutation.
    Tags: undefined
    Author: Jinyuan Sun
    Release: 2025-02-10 10:28:28
    Reference: Jinyuan Sun, Tong Zhu, Yinglu Cui, Bian Wu, Structure-based self-supervised learning enable ultrafast protein stability prediction upon mutation, The Innovation, Volume 6, Issue 1, 2025,100750, ISSN 2666-6758

    Mutation Energy of Stability (Pythia)

    简介

    该模块基于Pythia模型实现,该模型是一种针对零样本 ∆∆G 预测量身定制的自监督图神经网络。

    蛋白质突变效应预测是解码分子进化机制、优化蛋白质工程改造的关键物理量。然而,传统预测方法面临两大挑战:一是基于物理力场的计算方法(如自由能微扰)计算复杂度高,难以满足大规模筛选需求;二是依赖于实验数据的监督学习方法易受训练集偏差影响,泛化能力受限。

    为了应对这些问题,研究团队提出了Pythia框架,它结合了图神经网络与注意力机制,能够直接从蛋白质的三维结构中学习氨基酸之间的相互作用。通过这种“零监督”预训练策略,Pythia突破了传统方法对标记数据的依赖,成功捕捉了蛋白质折叠过程中隐藏的物理化学约束规律。

    Pythia的模型架构采用了将蛋白质局部结构转化为k近邻图的方式,每个氨基酸作为节点,通过欧几里得距离连接其32个最近的氨基酸。节点的特征包括氨基酸类型以及主链的二面角,边的特征则涉及主链原子之间的距离、序列位置和链信息。通过消息传递神经网络(MPNN)架构,Pythia可以高效地更新每个氨基酸节点的信息,并对突变的稳定性变化进行准确预测。

    与传统的基于物理力场的方法相比,Pythia能够在单核计算中实现每分钟预测约50,000个突变,速度提升了5个数量级。其在标准测试集S2648上的Spearman相关系数为0.616,Pearson相关系数为0.598,表现优于现有的所有对比模型。这一进展为大规模蛋白质序列空间扫描提供了强大的计算支持,能够处理多达2600万个高质量蛋白质结构数据,显著加深了我们对蛋白质序列空间的理解。

    在实验验证中,Pythia表现出了比传统能量函数方法高出一倍的成功率,充分证明了其在实际应用中的可靠性。同时,Pythia的可解释性也为蛋白质工程提供了宝贵的生物学见解,使其更易于应用于复杂的蛋白质工程任务。

    模型架构:Pythia将蛋白质局部结构转换为k近邻图,其中每个氨基酸作为一个节点,并通过欧几里得距离连接其32个最近的氨基酸。节点的特征包括氨基酸类型和主链的二面角(φ、ψ、ω),边的特征包括主链原子之间的距离、序列位置和链信息。
    image.png

    训练目标:Pythia的训练目标是预测中心节点的自然氨基酸类型,使用来自节点和边的信息。

    消息传递神经网络(MPNN):Pythia采用消息传递神经网络(MPNN)架构,具体为带有注意力机制的消息传递层(AMPL)。在每个AMPL层中,顶点表示通过注意力块更新,然后与边表示连接以派生消息表示,最终通过另一个注意力块进一步细化节点表示。

    损失函数:通过估计特定位置处每个氨基酸的概率来实现ΔΔG的预测。

    在与其他自监督预训练模型和基于力场的方法的比较基准中,Pythia以极高的相关性超越其他同类算法,同时以最少的参数运行,使得计算速度显着加快,高达105倍。Pythia的功效通过其在预测柠檬烯环氧水解酶 (LEH) 的热稳定突变中的应用得到证实,实验成功率显着提高。
    S2648数据集上的性能:Pythia在S2648数据集上的Spearman相关系数为0.616,Pearson相关系数为0.598,优于所有测试的模型。
    S669数据集上的性能:在S669数据集上,Pythia的Spearman相关系数为0.66,在所有评估的方法中表现最佳。
    image.png

    大规模数据集上的性能:在一个包含约100万个突变的百万级数据集上,Pythia的Spearman相关系数为0.602,Pearson相关系数为0.633,AUROC为0.83,AUPRC为0.88。
    计算速度:Pythia的计算速度比传统的力场方法快105倍,能够在20秒内完成S2648数据集的计算,单核速度约为50,000个突变/分钟。

    参数说明

    Structure PDB

    蛋白结构文件,PDB格式。不支持含有非标准氨基酸的蛋白。

    Chain

    指定要突变扫描的链名,可多链,用英文逗号分隔,如:A,B,默认为空,表示全部链都扫描。

    Output

    输出文件名称,默认mutation_energy.csv。

    Output_fmt

    特定格式化的输出文件名称,默认mutation_energy_fmt.csv。

    备注:当前24GB的GPU显存支持计算的残基数量在2000个左右。

    结果说明

    输出mutation_energy.csv结果文件,包含以下信息:

    字段名称 说明
    Chain 链名称,如:'A’表示A链
    Mutation 单点突变信息,如:'G1A’表示该链中,残基位置编号为1的残基甘氨酸G,突变为丙氨酸A,残基位置编号从1开始按顺序编号(非PDB文件中的残基序号)
    Energy 突变对应的能量变化,负值表示突变使得体系能量降低,体系变得更稳定。负得越多表示稳定性提升越多

    输出mutation_energy_fmt.csv结果文件,包含如下信息:

    字段名称 说明
    Chain PDB结构中的链名称
    WT PDB结构中的初始AA
    Pos AA位置编号,从1开始
    Consensus 该位置出现能量最优的AA
    L,A,G,V… 该位置每种AA对应的能量变化值,变化值为负时,表示更稳定,负得越多,越稳定

    输出结果对应的热图mutation_energy_[chain].png

    参考文献

    • Jinyuan Sun, Tong Zhu, Yinglu Cui, Bian Wu, Structure-based self-supervised learning enables ultrafast protein stability prediction upon mutation, The Innovation, Volume 6, Issue 1, 2025, 100750, ISSN 2666-6758, DOI: 10.1016/j.xinn.2024.100750

    Mutation Energy of Stability (Pythia)

    Introduction

    This module is implemented based on the Pythia model, which is a self-supervised graph neural network specifically designed for zero-shot ∆∆G prediction.

    Predicting the effects of protein mutations is a key factor in decoding molecular evolution mechanisms and optimizing protein engineering modifications. However, traditional prediction methods face two major challenges: first, computational methods based on physical force fields (such as free energy perturbation) have high computational complexity, making them unsuitable for large-scale screening; second, supervised learning methods that rely on experimental data are susceptible to training set biases, limiting their generalization ability.

    To address these issues, the research team proposed the Pythia framework, which combines graph neural networks with attention mechanisms to learn interactions between amino acids directly from the three-dimensional structure of proteins. Through this “zero-supervision” pre-training strategy, Pythia overcomes the traditional methods’ dependence on labeled data and successfully captures the hidden physicochemical constraints in the protein folding process.

    The architecture of Pythia converts the local structure of proteins into k-nearest neighbor graphs, where each amino acid acts as a node connected to its 32 nearest amino acids based on Euclidean distance. Node features include amino acid type and backbone dihedral angles, while edge features involve distances between backbone atoms, sequence positions, and chain information. Using a message-passing neural network (MPNN) architecture, Pythia efficiently updates information for each amino acid node and accurately predicts changes in mutation stability.

    Compared to traditional physical force field-based methods, Pythia can predict approximately 50,000 mutations per minute on a single-core processor, achieving a speed increase of five orders of magnitude. On the standard test set S2648, it achieves a Spearman correlation coefficient of 0.616 and a Pearson correlation coefficient of 0.598, outperforming all existing comparative models. This advancement provides powerful computational support for large-scale scanning of protein sequence space, capable of handling up to 26 million high-quality protein structure data points, significantly deepening our understanding of protein sequence space.

    In experimental validation, Pythia demonstrated a success rate twice as high as traditional energy function methods, fully proving its reliability in practical applications. Additionally, Pythia’s interpretability offers valuable biological insights for protein engineering, making it more applicable to complex protein engineering tasks.

    Model Architecture: Pythia transforms the local structure of proteins into a k-nearest neighbor graph, where each amino acid is represented as a node, connected to its 32 nearest amino acids by Euclidean distance. The features of the nodes include the amino acid type and the backbone dihedrals (φ, ψ, ω), while the features of the edges include the distances between backbone atoms, sequence positions, and chain information.
    image.png

    Training Objective: The training objective of Pythia is to predict the natural amino acid type of the central node, using information from both nodes and edges.

    Message Passing Neural Network (MPNN): Pythia employs a message passing neural network (MPNN) architecture, specifically an Attention-based Message Passing Layer (AMPL). In each AMPL layer, the vertices are updated through an attention block, and then connected to edge representations to derive message representations, which are further refined through another attention block.

    Loss Function: The prediction of ΔΔG is achieved by estimating the probability of each amino acid at specific positions.

    In benchmark comparisons with other self-supervised pre-training models and force-field-based methods, Pythia outperforms other similar algorithms with high correlation while operating with minimal parameters, significantly accelerating computational speed by up to 105 times. The effectiveness of Pythia is demonstrated through its application in predicting thermally stable mutations of limonene epoxide hydrolase (LEH), with a notable increase in experimental success rates.
    Performance on the S2648 Dataset: Pythia achieves a Spearman correlation coefficient of 0.616 and a Pearson correlation coefficient of 0.598 on the S2648 dataset, outperforming all tested models.
    Performance on the S669 Dataset: On the S669 dataset, Pythia achieves a Spearman correlation coefficient of 0.66, performing the best among all evaluated methods.
    image.png

    Performance on Large-scale Datasets: On a large dataset containing approximately 1 million mutations, Pythia achieves a Spearman correlation coefficient of 0.602, a Pearson correlation coefficient of 0.633, an AUROC of 0.83, and an AUPRC of 0.88.
    Computational Speed: Pythia is 105 times faster than traditional force-field methods, capable of completing calculations on the S2648 dataset in 20 seconds, with a single-core speed of approximately 50,000 mutations per minute.

    Parameters

    Structure PDB

    Protein structure file in PDB format. Proteins containing non-standard amino acids are not supported.

    Chain

    Specify the chain names to be scanned for mutations. Multiple chains can be listed, separated by commas, e.g., A,B. The default is empty, which means all chains will be scanned.

    Output

    Output file name, mutation_energy.csv is the default.

    Results

    Outputs a mutation_energy.csv file containing the following information:

    Field Name Description
    Chain Chain name, e.g., ‘A’ represents chain A
    Mutation Single point mutation information, e.g., ‘G1A’ indicates that in this chain, the residue at position 1, Glycine (G), is mutated to Alanine (A), with residue positions numbered sequentially starting from 1 (not the residue numbering in the PDB file)
    Energy The energy change associated with the mutation; negative values indicate that the mutation lowers the system’s energy, making it more stable. The more negative the value, the greater the increase in stability.

    References

    • Jinyuan Sun, Tong Zhu, Yinglu Cui, Bian Wu, Structure-based self-supervised learning enables ultrafast protein stability prediction upon mutation, The Innovation, Volume 6, Issue 1, 2025, 100750, ISSN 2666-6758, DOI: 10.1016/j.xinn.2024.100750
  • Name: Mutation Energy of Binding (Pythia-PPI)
    Description: 基于深度学习和多任务学习的预测突变对蛋白-蛋白亲和力影响。 Deep learning and multi-task learning based prediction of protein-protein binding affinity changes upon mutations.
    Tags: undefined
    Author: Fangting Tao
    Release: 2025-02-10 10:36:50
    Reference: Fangting Tao, Jinyuan Sun, Bian Wu, George F Gao. Reliable prediction of protein-protein binding affinity changes upon mutations with Pythia-PPI, bioRxiv 2024.10.28.620752.

    Mutation Energy of Binding (Pythia-PPI)

    简介

    Mutation Energy of Binding (Pythia-PPI)模块基于Pythia-PPI模型实现,该模型基于深度学习,结合了多任务学习和自蒸馏策略,以克服实验数据稀缺的瓶颈,并提高预测准确性。Pythia-PPI由两个模块组成:预训练的结构图编码器模块和ΔΔG预测模块。该模型使用k-最近邻(k-NN)图将蛋白质或蛋白质-蛋白质复合物的局部结构转换为图表示,每个氨基酸作为一个节点,与其32个最近的氨基酸基于C-alpha原子的欧几里得距离建立连接。输入的结构图编码器结合了氨基酸类型的一热编码,以及使用正弦和余弦函数表示的主链二面角(φ、ψ和ω)作为节点特征。边特征则考虑了五个主链原子(C-alpha、C、N、O和C-beta)之间的距离,以及序列位置和链信息。通过结构图编码器,节点和边输入特征被转换为嵌入,这些嵌入与预训练模块中的氨基酸概率相结合,形成ΔΔG预测模块的输入向量。Pythia-PPI采用迁移学习和多任务学习相结合的方法,共享结构编码器层以预测突变对PPI结合亲和力和蛋白质稳定性的影响。
    image.png
    使用了SKEMPI数据集进行基准测试,并与其他方法进行了比较。结果显示,Pythia-PPI在SKEMPI数据集上的皮尔逊相关系数从0.6447提高到0.7850,在病毒-受体数据集上的皮尔逊相关系数从0.3654提高到0.6051。这些结果表明Pythia-PPI是一个分析蛋白质-蛋白质相互作用适应性景观的有力工具。
    image.png

    参数说明

    Structure PDB

    蛋白复合物结构文件,PDB格式。不支持含有非标准氨基酸的蛋白。

    Chain

    指定要突变扫描的链名,可多链,用英文逗号分隔,如:A,B,默认为空,表示全部链都扫描。

    Output

    输出文件名称,默认mutation_ddg.csv。

    Output_fmt

    特定格式化输出的结果文件名称,默认mutation_ddg_fmt.csv。

    备注:当前24GB的GPU显存支持计算的残基数量在1500个左右。

    结果说明

    输出mutation_ddg.csv结果文件,包含以下信息:

    字段名称 说明
    Chain 链名称,如:'A’表示A链
    Mutation 单点突变信息,如:'G1A’表示该链中,残基位置编号为1的残基甘氨酸G,突变为丙氨酸A,残基位置编号从1开始按顺序编号(非PDB文件中的残基序号)
    ddG_Pred 突变对应的结合自由能ddG变化,负值表示突变使得亲和力变高,负得越多表示亲和力提升越多

    输出mutation_ddg_fmt.csv结果文件,包含如下信息:

    字段名称 说明
    Chain PDB结构中的链名称
    WT PDB结构中的初始AA
    Pos AA位置编号,从1开始
    Consensus 该位置出现能量最优的AA
    L,A,G,V… 该位置每种AA对应的能量变化值,变化值为负时,表示更稳定,负得越多,越稳定

    输出结果对应的热图mutation_ddg_[chain].png

    参考文献

    • Reliable prediction of protein-protein binding affinity changes upon mutations with Pythia-PPI, Fangting Tao, Jinyuan Sun, Bian Wu, George F Gao bioRxiv 2024.10.28.620752; DOI: 10.1101/2024.10.28.620752

    Mutation Energy of Binding (Pythia-PPI)

    Introduction

    The Mutation Energy of Binding (Pythia-PPI) module is implemented based on the Pythia-PPI model, which utilizes deep learning and combines multi-task learning with a self-distillation strategy to overcome the bottleneck of scarce experimental data and improve prediction accuracy. Pythia-PPI consists of two modules: a pre-trained structural graph encoder module and a ΔΔG prediction module. The model uses a k-nearest neighbors (k-NN) graph to convert the local structure of proteins or protein-protein complexes into a graph representation, where each amino acid is represented as a node, connected to its 32 nearest amino acids based on the Euclidean distance of C-alpha atoms. The input structural graph encoder combines one-hot encoding of amino acid types with backbone dihedrals (φ, ψ, and ω) represented using sine and cosine functions as node features. Edge features take into account the distances between five backbone atoms (C-alpha, C, N, O, and C-beta), as well as sequence positions and chain information. Through the structural graph encoder, the input features for nodes and edges are transformed into embeddings, which are combined with amino acid probabilities from the pre-trained module to form the input vector for the ΔΔG prediction module. Pythia-PPI employs a combination of transfer learning and multi-task learning, sharing structural encoder layers to predict the effects of mutations on PPI binding affinity and protein stability.
    image.png

    Benchmarking was conducted using the SKEMPI dataset and compared with other methods. The results show that Pythia-PPI improved the Pearson correlation coefficient from 0.6447 to 0.7850 on the SKEMPI dataset, and from 0.3654 to 0.6051 on the virus-receptor dataset. These results indicate that Pythia-PPI is a powerful tool for analyzing the adaptive landscape of protein-protein interactions.
    image.png

    Parameters

    Structure PDB

    Protein complex structure file in PDB format. Proteins containing non-standard amino acids are not supported.

    Chain

    Specify the chain names to be scanned for mutations. Multiple chains can be listed, separated by commas, e.g., A,B. The default is empty, which means all chains will be scanned.

    Output

    Output file name, mutation_ddg.csv is the default.

    Output_fmt

    Formatted output file name, mutation_ddg_fmt.csv is the default.

    Results

    Outputs a mutation_ddg.csv file containing the following information:

    Field Name Description
    Chain Chain name, e.g., ‘A’ represents chain A
    Mutation Single point mutation information, e.g., ‘G1A’ indicates that in this chain, the residue at position 1, Glycine (G), is mutated to Alanine (A), with residue positions numbered sequentially starting from 1 (not the residue numbering in the PDB file)
    ddG_Pred The change in binding free energy (ddG) corresponding to the mutation; negative values indicate that the mutation increases affinity, with more negative values indicating a greater increase in affinity.

    Outputs a mutation_ddg_fmt.csv file containing the following information:

    Field Name Description
    Chain Chain name in the PDB structure
    WT Initial AA in the PDB structure
    Pos Position index of the AA, start from 1
    Consensus The AA with the most affinity value at that position
    L, A, G, V… The ddg of each AA at that position. Negative values indicate that the mutation increases affinity, with more negative values indicating a greater increase in affinity.

    The heatmap output mutation_ddg_[chain].png

    References

    • Reliable prediction of protein-protein binding affinity changes upon mutations with Pythia-PPI, Fangting Tao, Jinyuan Sun, Bian Wu, George F Gao bioRxiv 2024.10.28.620752; DOI: 10.1101/2024.10.28.620752
  • Name: Antibody (Off-) Target Prediction (WeTarScan)
    Description: 基于结构相似性原理从抗原-抗体数据库中(相似抗体可能具有相似靶点)预测抗体的潜在靶点(脱靶效应)。 Structure similarity-based antibody (Off-) target prediction from antibody-antigen interaction database.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-01-06 10:17:53
    Reference:

    Antibody (Off-) Target Prediction

    简介

    Antibody (Off-) Target Prediction模块对输入的抗体进行潜在靶点预测,基于丰富的抗体-抗原相互作用数据库,寻找与输入抗体在序列及结构上高度相似的一系列抗体。基于相似性原理(相似抗体可能具有相似靶点),这些高度相似的抗体对应的抗原靶点可能是输入抗体的潜在靶点。当前抗体-抗原相互作用数据库包含16万对抗原-抗体复合物,主要来源于文献、专利等开源数据。

    参数说明

    Antibody Structure

    待预测靶点的抗体结构文件,PDB格式或CIF格式。推荐使用AF3-like相关的结构预测模块进行抗体结构预测,如:Protenix

    Mode

    搜索模式,支持4种模式(默认为模式3):

    • 模式1: 完整抗体模式,以完整的抗体重轻链结构进行数据库检索。
    • 模式2: 抗体CDR模式,仅提取抗体的CDR区域结构进行数据库检索。
    • 模式3: 抗体重链CDR模式,仅提取抗体的重链CDR区域结构进行数据库检索。
    • 模式4: 抗体重链CDR3模式,仅提取抗体的重链CDR3区域结构进行数据库检索。

    TopN

    保留打分排名最高的前N个结果,默认为50。

    Species

    物种信息过滤:

    • Human表示仅保留人源靶点。
    • Any表示不做任何限制。

    Output

    输出结果的文件名,默认为“hits.csv”

    结果说明

    检索结果文件默认为hits.csv,包含信息如下:

    字段名 说明
    Query 查询抗体结构名称
    Target 数据库的抗体结构名称
    Antigen Name 预测的靶点名称
    Description 对数据库结构的描述
    Antigen Organism Label 靶点的来源物种
    Comprehensive Score 潜在靶点的综合打分,数值在0-1.0之间,越接近1.0,表示成为抗体靶点的可能性越大,默认基于该打分对潜在靶点进行排序。该打分综合了多种结构比对与复合物评价指标。
    Alignment TMScore \ Query TMScore \ Target TMScore TM-score (Template Modeling Score) 是一种结构比对指标,用于衡量两个蛋白质三维结构的相似性,与 RMSD相比,TM-score 更加稳定,对结构长度不敏感,能更准确地反映蛋白质结构的全局相似性。其取值范围在0到1之间,TM-score > 0.5 表示显著相似。其中,Query TMScore指使用查询抗体结构进行长度归一化;Target TMScore指使用数据库抗体结构进行长度归一化;Alignment TMScore指使用查询抗体和数据库抗体的序列匹配区的结构进行长度归一化。
    RMSD 查询抗体与数据库抗体的骨架结构中alpha碳原子C𝛂位置差异的均方根偏差。
    RMSD_score 基于结构比对叠合后的主链C𝛂原子的位置差异RMSD值,进行归一化获得,计算公式为:RMSD_score = exp(-RMSD/3.8),将其归一化到0-1.0之间,其中3.8为经验参数。
    DockQ 衡量抗体与潜在靶点之间的虚拟结合参数,其值在0-1.0之间,越大表示抗体越能与潜在靶点结合。

    参考文献

    • van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
    • Schneider, C., Raybould, M.I.J., Deane, C.M. (2022) SAbDab in the Age of Biotherapeutics: Updates including SAbDab-Nano, the Nanobody Structure Tracker. Nucleic Acids Res. 50(D1):D1368-D1372
    • Brennan Abanades et al. “The Patent and Literature Antibody Database (PLAbDab): an evolving reference set of functionally diverse, literature-annotated antibody sequences and structures”. In: Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D545-D551
    • Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017)

    Antibody (Off-) Target Prediction

    Introduction

    The Antibody (Off-) Target Prediction module predicts potential targets for the input antibody. Based on a rich database of antibody-antigen interactions, it identifies a series of antibodies that are highly similar to the input antibody in both sequence and structure. Following the principle of similarity (similar antibodies may have similar targets), the antigen targets corresponding to these highly similar antibodies could be potential targets for the input antibody. The current antibody-antigen interaction database contains 160,000 antigen-antibody complexes, primarily sourced from open-source data such as literature and patents.

    Parameter

    Antibody Structure

    Antibody structure file for the target to be predicted, in PDB or CIF format.

    Mode

    Search Modes, supporting 4 modes (default is Mode 3):

    • Mode 1: Full Antibody Mode, where the complete heavy and light chain structure of the antibody is used for database search.
    • Mode 2: Antibody CDR Mode, where only the CDR regions of the antibody are extracted for database search.
    • Mode 3: Antibody Heavy Chain CDR Mode, where only the CDR regions of the heavy chain are extracted for database search.
    • Mode 4: Antibody Heavy Chain CDR3 Mode, where only the CDR3 region of the heavy chain is extracted for database search.

    TopN

    Retain the top N results with the highest scores, with the default being 50.

    Species

    Species Information Filtering:

    • Human: Retain only human-derived targets.
    • Any: No restrictions.

    Output

    The name of output file, default is “hits.csv”.

    Result

    The search result file hits.csv contains the following information:

    Field Name Description
    Query Name of the query antibody structure
    Target Name of the antibody structure in the database
    Antigen Name Name of the predicted target
    Description Description of the structure in the database
    Antigen Organism Label Source organism of the target
    Comprehensive Score The comprehensive scoring of potential targets ranges from 0 to 1.0. The closer the score is to 1.0, the higher the likelihood of it being an antibody target. By default, target hits are ranked based on this score. This score integrates various structural alignment and complex evaluation metrics
    Alignment TMScore \ Query TMScore \ Target TMScore TM-score (Template Modeling Score) is a structural alignment metric used to measure the similarity between two protein 3D structures. Compared to RMSD, TM-score is more stable and less sensitive to structural length, providing a more accurate reflection of the global similarity of protein structures. It ranges from 0 to 1, with TM-score > 0.5 indicating significant similarity. Query TMScore refers to length normalization using the query antibody structure; Target TMScore refers to length normalization using the database antibody structure; Alignment TMScore refers to length normalization using the sequence-matched regions of the query and database antibodies.
    RMSD Query the root mean square deviation (RMSD) of the alpha carbon atom C𝛂 positions between the antibody and the database antibody’s backbone structures.
    RMSD_score The RMSD value of the backbone C𝛂 atoms’ position differences after structural alignment is normalized to obtain the score. The calculation formula is: RMSD_score = exp(-RMSD/3.8), which normalizes the score to the range of 0-1.0, where 3.8 is an empirical parameter.
    DockQ A virtual binding parameter that measures the interaction between an antibody and a potential target, with values ranging from 0 to 1.0. The higher the value, the greater the likelihood of the antibody binding to the potential target.

    Reference

    • van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
    • Schneider, C., Raybould, M.I.J., Deane, C.M. (2022) SAbDab in the Age of Biotherapeutics: Updates including SAbDab-Nano, the Nanobody Structure Tracker. Nucleic Acids Res. 50(D1):D1368-D1372
    • Brennan Abanades et al. “The Patent and Literature Antibody Database (PLAbDab): an evolving reference set of functionally diverse, literature-annotated antibody sequences and structures”. In: Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D545-D551
    • Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017)
  • Name: Protein Design (LigandMPNN)
    Description: LigandMPNN是一种基于深度学习的蛋白质序列设计方法,专门用于模拟蛋白质与非蛋白质组分(如小分子、核苷酸和金属)之间的相互作用。它是 ProteinMPNN的升级版,能够在蛋白质设计中加入非蛋白质的组分,从而提升对非蛋白-蛋白相互作用的理解。 LigandMPNN is a deep learning-based protein sequence design method specifically designed to simulate interactions between proteins and non-protein components (such as small molecules, nucleotides, and metals). It is an upgraded version of ProteinMPNN and can incorporate non-protein components into protein design, thereby enhancing the understanding of non-protein-protein interactions.
    Tags: undefined
    Author: Justas Dauparas
    Release: 2025-01-07 10:31:52
    Reference: Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, D. Baker. Atomic context-conditioned protein sequence design using LigandMPNN. bioRxiv 2023.12.22.573103;

    Protein Design (LigandMPNN)

    简介

    LigandMPNN是一种基于深度学习的蛋白质序列设计方法,专门用于模拟蛋白质与非蛋白质组分(如小分子、核苷酸和金属)之间的相互作用。它是 ProteinMPNN的升级版,能够在蛋白质设计中加入非蛋白质的组分,从而提升对非蛋白-蛋白相互作用的理解。
    主要特点和优势:
    全面建模:LigandMPNN结合了蛋白质图、配体图和蛋白质-配体图三种图结构,全面建模蛋白质与非蛋白质组分的相互作用。
    高性能:在恢复与小分子、核苷酸和金属相互作用的原生背景序列方面,LigandMPNN的表现优于传统方法如Rosetta和ProteinMPNN。
    侧链预测:除了生成蛋白质序列,LigandMPNN还能生成侧链构象,允许对结合相互作用进行更详尽的评估。
    LigandMPNN 在酶设计、小分子结合剂开发以及生物传感器的设计中具有广阔的应用前景。其高效性和准确性使其成为蛋白质工程领域的重要工具。
    image.png
    image.png

    参数说明

    Structure PDB

    蛋白的结构文件,PDB格式。

    Chain

    指定需要设计的链,多条链用空格分割,例如:A,B。

    Number of Sequences

    输出设计的序列数目。

    Sampling Temp

    氨基酸采样温度,T=0.0表示取argmax,T>>1.0表示随机采样。建议的取值为0.1、0.15、0.2、0.25、0.3。较高的值会导致更多的多样性。

    Position Type

    设计残基模式:

    • 固定(Fix)指定下一步Position中的残基在设计时保持不变。
    • 设计(Design)指定下一步Position中的残基可进行设计而其他未指定残基在设计时保持不变。

    Position

    可选参数,设置氨基酸序号,对设置的氨基酸根据Position Type选项进行固定或设计。当参数Chain设置为A,C时,此参数如果设置为1 2 3 4 5 6 7 8 23 25, 10 11 12 13 14 15 16 17 18 19 20 40意味着对链A中的残基1 2 3…25和链C中的残基10 11 12…40进行固定或设计。
    注意:同一条链的氨基酸序号用空格分隔,不同链的氨基酸用逗号分隔。根据原始 PDB 编号设计的,并且支持插入代码。

    Omit_AAS

    指定在生成的结果序列中不许出现的氨基酸种类。

    Other Parameter

    可选参数,可指定设计时参考的模式。具体含义如下:
    –homomer:基于同源多聚体进行序列设计;
    –pack_side_chains:对设计的序列生成侧链结构。

    结果说明

    最终设计的序列文件result.fasta,里面包含最终设计的序列。
    其中序列名称:

    1. overall_confidence:设计序列的全序列的置信度评分,数值在0~1.0之间,数值越大表示序列置信度越高
    2. ligand_confidence:设计序列的所有已设计残基的置信度评分,数值在0~1.0之间,数值越大表示已设计部分序列的置信度越高
    3. seq_rec:序列恢复率(与原序列的相似程度),0-1之间,越高表示与原序列越相似

    指定参数--pack_side_chains时,输出设计后的结构打包文件packed_side_chains.tar.gz,包含最终设计的序列对应的复合物结构PDB文件。

    参考文献

    Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, D. Baker. Atomic context-conditioned protein sequence design using LigandMPNN. bioRxiv 2023.12.22.573103;

    Protein Design (LigandMPNN)

    Introduction

    LigandMPNN is a deep learning-based protein sequence design method specifically designed to simulate interactions between proteins and non-protein components (such as small molecules, nucleotides, and metals). It is an upgraded version of ProteinMPNN and can incorporate non-protein components into protein design, thereby enhancing the understanding of non-protein-protein interactions.

    Key Features and Advantages:

    • Comprehensive Modeling: LigandMPNN integrates three types of graph structures: protein graphs, ligand graphs, and protein-ligand graphs, to comprehensively model interactions between proteins and non-protein components.
    • High Performance: In terms of recovering native background sequences interacting with small molecules, nucleotides, and metals, LigandMPNN outperforms traditional methods such as Rosetta and ProteinMPNN.
    • Side Chain Prediction: In addition to generating protein sequences, LigandMPNN can also generate side chain conformations, allowing for a more detailed assessment of binding interactions.

    LigandMPNN has broad application prospects in enzyme design, small molecule binder development, and biosensor design. Its efficiency and accuracy make it an important tool in the field of protein engineering.
    image.png
    image.png

    Parameter

    PDB File

    Protein structure file in PDB format.

    Chain

    Specify the chain to be designed, multiple chains are separated by spaces, for example: A,B.

    Number of Sequences

    Output the number of sequences designed.

    Position Type

    Residue Design Mode:

    • Fix specifying that the residues in the next Position step remain unchanged during design.
    • Design specifying that the residues in the next Position step can be designed while other unspecified residues remain unchanged during design.

    Position

    Optional parameter to set the amino acid sequence number for fixing or designing amino acids based on the Position Type option. When the parameter Chain is set to A C, if this parameter is set to 1 2 3 4 5 6 7 8 23 25, 10 11 12 13 14 15 16 17 18 19 20 40, it means that residues 1 2 3…25 in chain A and residues 10 11 12…40 in chain C are fixed or designed.

    Note: Amino acid sequence numbers of the same chain are separated by spaces, while amino acids from different chains are separated by commas. The position is designed according to the original PDB numbering and also supports insertion codes.

    Sampling Temp

    Amino acid sampling temperature, T=0.0 means argmax, T>>1.0 means random sampling. The suggested values are 0.1, 0.15, 0.2, 0.25, 0.3. Higher values result in more diversity.

    Omit_AAS

    Optional parameter specifying the types of amino acids that are not allowed to appear in the generated sequence.

    Other Parameter

    Optional parameter specifying the reference mode for design. Specific meanings are as follows:
    –homomer: Sequence design based on homologous oligomers;
    –pack_side_chains:Generate side chain structures for the designed sequence。

    Result

    The output file is result.fasta and contains the final design sequence.
    Where the sequence name:

    1. overall_confidence: The confidence score for the entire designed sequence, with values ranging from 0 to 1.0. A higher value indicates a higher confidence in the sequence.
    2. ligand_confidence: The confidence score for all designed residues in the sequence, with values ranging from 0 to 1.0. A higher value indicates a higher confidence in the designed portion of the sequence.
    3. seq_rec: Sequence recovery rate (similarity to the original sequence), ranging from 0 to 1. A higher value indicates greater similarity to the original sequence.

    When the parameter --pack_side_chains is specified, the output is a packed structure file named packed_side_chains.tar.gz, which includes the PDB file of the final designed sequence’s corresponding complex structure.

    Reference

    Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, D. Baker. Atomic context-conditioned protein sequence design using LigandMPNN. bioRxiv 2023.12.22.573103;

  • Name: Thermostability Prediction
    Description: 基于TemBERTure开发的Thermostability Prediction是一个用于预测蛋白质热稳定性的深度学习工具,专注于氨基酸序列分析。它包括两个模型:TemBERTureCLS和TemBERTureTm。TemBERTureCLS是一个分类模型,用于预测蛋白质序列的热类别,即判断其是嗜热的还是非嗜热的。TemBERTureTm是一个回归模型,用于根据蛋白质序列预测其熔点温度(Tm)。这两个模型都基于protBERT-BFD语言模型,该模型在大量蛋白质序列数据集上进行了预训练。通过基于适配器的方法进行高效微调,使得TemBERTure能够在不需要广泛重新训练的情况下,稳健地适应特定任务。 Thermostability Prediction, developed based on TemBERTure, is a deep learning tool designed to predict protein thermostability, focusing on amino acid sequence analysis. It includes two models: TemBERTureCLS and TemBERTureTm. TemBERTureCLS is a classification model used to predict the thermal category of a protein sequence, determining whether it is thermophilic or non-thermophilic. TemBERTureTm is a regression model used to predict the melting temperature (Tm) of a protein based on its sequence. Both models are based on the protBERT-BFD language model, which has been pre-trained on a large dataset of protein sequences. By using an adapter-based fine-tuning approach, TemBERTure can efficiently and robustly adapt to specific tasks without the need for extensive retraining.
    Tags: undefined
    Author: Chiara Rodella
    Release: 2025-01-08 09:28:20
    Reference: Chiara Rodella, Symela Lazaridi, Thomas Lemmin, TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms, Bioinformatics Advances, Volume 4, Issue 1, 2024, vbae103

    Thermostability Prediction

    简介

    基于TemBERTure开发的Thermostability Prediction是一个用于预测蛋白质热稳定性的深度学习工具,专注于氨基酸序列分析。它包括两个模型:TemBERTureCLS和TemBERTureTm。TemBERTureCLS是一个分类模型,用于预测蛋白质序列的热类别,即判断其是嗜热的还是非嗜热的。TemBERTureTm是一个回归模型,用于根据蛋白质序列预测其熔点温度(Tm)。这两个模型都基于protBERT-BFD语言模型,该模型在大量蛋白质序列数据集上进行了预训练。通过基于适配器的方法进行高效微调,使得TemBERTure能够在不需要广泛重新训练的情况下,稳健地适应特定任务。
    image.png
    TemBERTureCLS与其他常用模型的预测结果比较
    image.png
    TemBERTureTm与其他常用模型的预测结果比较
    image.png

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式

    结果说明

    默认输出结果文件为predicted_Tm.csv,包含信息如下:

    字段名称 说明
    ID 序列ID
    Tm 预测得到的蛋白Melting Temperature ™ 值
    Thermostability Type 预测得到的蛋白热稳定性类别,有两种:Thermophilic与Non-thermophilic
    Thermophilicity Prediction Score 预测得到的蛋白嗜热性概率评分,数值在0-1.0之间,越大表示蛋白嗜热的概率越高

    参考文献

    • Chiara Rodella, Symela Lazaridi, Thomas Lemmin, TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms, Bioinformatics Advances, Volume 4, Issue 1, 2024, vbae103 DOI:10.1093/bioadv/vbae103

    Thermostability Prediction

    Introduction

    Thermostability Prediction, developed based on TemBERTure, is a deep learning tool designed to predict protein thermostability, focusing on amino acid sequence analysis. It includes two models: TemBERTureCLS and TemBERTureTm. TemBERTureCLS is a classification model used to predict the thermal category of a protein sequence, determining whether it is thermophilic or non-thermophilic. TemBERTureTm is a regression model used to predict the melting temperature ™ of a protein based on its sequence. Both models are based on the protBERT-BFD language model, which has been pre-trained on a large dataset of protein sequences. By using an adapter-based fine-tuning approach, TemBERTure can efficiently and robustly adapt to specific tasks without the need for extensive retraining.
    image.png
    Comparison of TemBERTureCLS with other common models’ prediction results
    image.png
    Comparison of TemBERTureTm with other common models’ prediction results
    image.png

    Parameter

    Protein Sequence

    The protein sequence file in FASTA format.

    Result

    The output result file is predicted_Tm.csv, containing the following information:

    Field Name Description
    ID Sequence ID
    Tm Predicted protein Melting Temperature ™ value
    Thermostability Type Predicted protein thermostability category: either Thermophilic or Non-thermophilic
    Thermophilicity Prediction Score Predicted probability score of protein thermophilicity, ranging from 0 to 1.0, where a higher score indicates a higher likelihood of the protein being thermophilic

    Reference

    • Chiara Rodella, Symela Lazaridi, Thomas Lemmin, TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms, Bioinformatics Advances, Volume 4, Issue 1, 2024, vbae103 DOI:10.1093/bioadv/vbae103
  • Name: GMX Metadynamics Generation
    Description: GMX Metadynamics Generation模块是生成可用于Metadynamics模拟的输入文件。 The GMX Metadynamics Generation module is used to generate input files for Metadynamics simulations.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-12-02 15:42:53
    Reference:

    GMX Metadynamics Generation

    简介

    GMX Metadynamics Generation模块是生成可用于Metadynamics模拟的输入文件。

    参数说明

    GRO File

    提交模拟体系的gro文件。该文件可以从MD Solvation模块获取。

    PBC

    Metadynamics模拟阶段是否考虑周期性边界条件,yes或者no。

    CV Group1

    组成集合变量CV的第一个组所包含的原子。

    CV Group2

    组成集合变量CV的第二个组所包含的原子。

    CV Group3

    组成集合变量CV的第三个组所包含的原子。

    CV Group4

    组成集合变量CV的第四个组所包含的原子。
    备注:

    • Group1和Group2组成DISTANCE集合变量,Group1,Group2和Group3组成ANGLE集合变量,Group1,Group2,Group3和Group4组成TORSION集合变量
    • Group的书写规则:a5表示GRO文件中第5个原子,a5-10表示GRO文件中第5-10个原子,aCA表示GRO文件中名字为CA的原子,同理r5, r5-10, rASP分别表示GRO文件中第5位残基,第5-10位残基和名字位ASP的残基,一些特殊的字符如Protein,Protein-H, MainChain等亦可使用,也可以合并使用,但需用逗号隔开,如"a5-10,r5,r8-10,UNK"表示GRO文件中第5-10位原子、第5位残基、第8-10位残基以及名字叫UNK的分子
    • 多个CV的处理方式:如果要定义多个集合变量,则在Group定义中用"//"将不同集合变量对应的原子组进行分割,如a5//r5-10表示a5是第一个集合变量对应的原子组,r5-r10是第二个集合变量对应的原子组,当集合变量在某个Group没有对应的原子组时,用none表示,比如第一个集合变量是DISTANCE,第一个集合变量是ANGLE,那么第一个DISTANCE集合变量在Group3中没有对应的原子组,此时在Group3可以写none//r5-10,表示第一个集合变量在Group3中没有对应的原子组,而第二个集合变量在Group3中对应的原子组为r5-10

    Component

    集合变量DISTANCE对应的成分,其成分有x,y,z和xyz,分别表示计算DISTANCE仅考虑x,y,z维度以及xyz三个维度都考虑,有多个集合变量时用"//"进行分割。

    Metad Height

    施加的沉积高斯函数的高度,默认1.0

    Metad Width

    施加的沉积高斯函数的宽度或者标准差,有多个集合变量时用"//"进行分割,默认0.05

    Metad Frequency

    施加的沉积高斯函数的频率,默认500,即每500个时间步长进行一次高斯函数沉积

    CV Min

    集合变量的边界最小值,有多个集合变量时用"//"进行分割。无默认值时即不考虑边界,此时计算量会增加,强烈建议设置边界。

    CV Max

    集合变量的边界最大值,有多个集合变量时用"//"进行分割,无默认值时即不考虑边界,此时计算量会增加,强烈建议设置边界。

    CV Space

    集合变量的窗口大小,有多个集合变量时用"//"进行分割,默认等于metad_width的1/5

    CV Bin

    集合变量的窗口数量,有多个集合变量时用"//"进行分割,默认等于150,CV Space和CV Bin的相乘等于CV Max和CV Min的差值,因此当CV Space和CV Bin同时设置时以对应窗口数最多的为准

    Adaptive

    是否考虑施加自适应沉积函数, geom或者diff,默认为不填,即不考虑自适应。

    Sigma Min

    施加的自适应高斯函数的宽度或者标准差的最小值,有多个集合变量时用"//"进行分割,默认等于0。

    Sigma Max

    施加的自适应高斯函数的宽度或者标准差的最大值,有多个集合变量时用"//"进行分割,默认等于0。

    Reweight

    是否考虑重加权以获得重加权因子,对获得归一化偏势,yes或者no,默认no,即不考虑重加权,一般在体系收敛后才考虑重加权。

    Reweight Ngauss

    计算重加权因子时施加的高斯函数的个数,默认等于50。

    Reweight Bin

    计算重加权因子时集合变量的窗口数量,其值不能小于CV Bin的值,有多个集合变量时用"//"进行分割,默认等于CV Bin。

    Well Tempered

    是否考虑回火metadynamics模拟,yes或者no。

    Temperature

    回火metadynamics模拟时对应的基础温度,默认等于300K

    Bias Factor

    回火Metadynamics模拟时对应的偏置因子,其值等于(T+deltaT)/T,默认等于1,此时未进行偏置模拟,若进行偏置模拟,偏置因子应大于1

    TAU

    回火Metadynamics模拟时对应的施加的沉积高斯函数的高度,Height=kbDeltaTFrequency*TimeStep/TAU,默认等于0,即直接使用设置的沉积函数的高度代替。

    Step

    Metadynamics模拟时指定的输出步长,默认100。

    Gauss File

    Metadynamics模拟时指定的沉积高斯函数的输出文件名。

    CV File

    Metadynamics模拟时指定的集合变量的输出文件名。

    PLUMED Index File

    Metadynamics模拟时指定的CV Group的输出文件名,该文件中包含所有的CV Group的原子组,用于下一步Metadynamics的输入文件。

    PLUMED Data File

    Metadynamics模拟时指定的参数的输出文件名,该文件中包含计算时所需的参数,用于下一步Metadynamics的输入文件。

    结果说明

    输出结果包括:

    输出文件名称 说明
    HILLS.dat Metadynamics模拟时指定的沉积高斯函数输出
    COLVAR.dat Metadynamics模拟时指定的集合变量的输出
    PLUMED.ndx NDX文件指定的组成集合变量的原子组
    PLUMED.dat 下一步Metadynamics计算所需的参数文件

    上述两个生成的文件将作为下一步metadynamics模拟的输入文件。

    GMX Metadynamics Generation

    Introduction

    The GMX Metadynamics Generation module is used to generate input files for Metadynamics simulations.

    Parameter

    GRO File

    Submit the gro file of the simulation system. This file can be obtained from the MD Solvation module.

    PBC

    Whether to consider periodic boundary conditions during the Metadynamics simulation phase, yes or no.

    CV Group1

    Atoms included in the first group that makes up the collective variable (CV).

    CV Group2

    Atoms included in the second group that makes up the collective variable (CV).

    CV Group3

    Atoms included in the third group that makes up the collective variable (CV).

    CV Group4

    Atoms included in the fourth group that makes up the collective variable (CV).
    Note:

    • Group1 and Group2 form the DISTANCE collective variable, Group1, Group2, and Group3 form the ANGLE collective variable, and Group1, Group2, Group3, and Group4 form the TORSION collective variable.
    • The notation for Groups: a5 represents the 5th atom in the GRO file, a5-10 represents atoms 5 to 10 in the GRO file, aCA represents the atom named CA in the GRO file. Similarly, r5, r5-10, and rASP represent the 5th residue, residues 5 to 10, and the residue named ASP in the GRO file, respectively. Some special characters like Protein, Protein-H, MainChain, etc., can also be used and can be combined, separated by commas. For example, “a5-10,r5,r8-10,UNK” represents atoms 5 to 10, the 5th residue, residues 8 to 10, and a molecule named UNK in the GRO file.
    • Handling multiple CVs: If you want to define multiple collective variables, separate the corresponding atom groups for different collective variables in the Group definition using “//”. For example, a5//r5-10 indicates that a5 corresponds to the atom group for the first collective variable, and r5-10 corresponds to the second collective variable. If there is no corresponding atom group for a collective variable in a Group, use “none” to indicate this. For instance, if the first collective variable is DISTANCE and the second is ANGLE, and the first DISTANCE collective variable has no corresponding atom group in Group3, you can write none//r5-10 in Group3 to indicate that the first collective variable has no corresponding atom group, while the second collective variable corresponds to r5-10 in Group3.

    Component

    The components corresponding to the DISTANCE collective variable, which can be x, y, z, and xyz, representing calculations of DISTANCE considering only the x, y, z dimensions or all three dimensions, respectively. Use “//” to separate multiple collective variable components.

    Metad Height

    The height of the deposited Gaussian function, default is 1.0.

    Metad Width

    The width or standard deviation of the deposited Gaussian function. Use “//” to separate multiple collective variable widths, default is 0.05.

    Metad Frequency

    The frequency of depositing the Gaussian function, default is 500, meaning a Gaussian function deposition occurs every 500 time steps.

    CV Min

    The minimum boundary value of the collective variable. Use “//” to separate multiple collective variable minimums. If there is no default value, the boundary will not be considered, which will increase the computational load. It is strongly recommended to set boundaries.

    CV Max

    The maximum boundary value of the collective variable. Use “//” to separate multiple collective variable maximums. If there is no default value, the boundary will not be considered, which will increase the computational load. It is strongly recommended to set boundaries.

    CV Space

    The window size of the collective variable. Use “//” to separate multiple collective variable window sizes, default is 1/5 of metad_width.

    CV Bin

    The number of windows for the collective variable. Use “//” to separate multiple collective variable bin counts, default is 150. The product of CV Space and CV Bin equals the difference between CV Max and CV Min. Therefore, when both CV Space and CV Bin are set, the one with the highest corresponding window count will prevail.

    Adaptive

    Whether to consider applying an adaptive deposition function, geom or diff, default is not filled, which means adaptive deposition is not considered.

    Sigma Min

    The minimum width or standard deviation of the applied adaptive Gaussian function. Use “//” to separate multiple collective variable minimums, default is 0.

    Sigma Max

    The maximum width or standard deviation of the applied adaptive Gaussian function. Use “//” to separate multiple collective variable maximums, default is 0.

    Reweight

    Whether to consider reweighting to obtain the reweighting factor for normalization of the bias potential, yes or no, default is no, which means reweighting is not considered. Reweighting is generally considered only after the system has converged.

    Reweight Ngauss

    The number of Gaussian functions applied when calculating the reweighting factor, default is 50.

    Reweight Bin

    The number of windows for the collective variable when calculating the reweighting factor, which cannot be less than the value of CV Bin. Use “//” to separate multiple collective variable bin counts, default is equal to CV Bin.

    Well Tempered

    Whether to consider simulated annealing in the Metadynamics simulation, yes or no.

    Temperature

    The base temperature corresponding to the simulated annealing Metadynamics simulation, default is 300K.

    Bias Factor

    The bias factor corresponding to the simulated annealing Metadynamics simulation, which equals (T + deltaT) / T, default is 1, meaning no bias simulation is performed. If a bias simulation is performed, the bias factor should be greater than 1.

    TAU

    The height of the deposited Gaussian function applied during the simulated annealing Metadynamics simulation, Height = kb * DeltaT * Frequency * TimeStep / TAU, default is 0, meaning the set deposition function height is used directly.

    Step

    The specified output step length during the Metadynamics simulation, default is 100.

    Gauss File

    The output file name for the deposited Gaussian function during the Metadynamics simulation.

    CV File

    The output file name for the collective variable during the Metadynamics simulation.

    PLUMED Index File

    The output file name for the CV Group during the Metadynamics simulation, which contains all the atom groups of the CV Group for the next step’s Metadynamics input file.

    PLUMED Data File

    The output file name for the parameters during the Metadynamics simulation, which contains the parameters required for calculations for the next step’s Metadynamics input file.

    Result

    The output results include:

    Output File Name Description
    HILLS.dat Output of the deposited Gaussian function specified during the Metadynamics simulation
    COLVAR.dat Output of the collective variable specified during the Metadynamics simulation
    PLUMED.ndx NDX file specifying the atom groups that make up the collective variable
    PLUMED.dat Parameter file required for the next step of Metadynamics calculation

    The two generated files above will serve as input files for the next step of the Metadynamics simulation.

  • Name: Free Energy Surface Analysis
    Description: 基于PLUMED元动力学模拟后的自由能计算。 Free energy surface analysis for PLUMED based metadynamics.
    Tags: undefined
    Author:
    Release: 2024-11-21 00:00:00
    Reference:

    Free Energy Surface Analysis

    简介

    Free Energy Surface Analysis模块是对基于PLUMED元动力学模拟后得到的模拟结果进行自由能计算。

    参数说明

    Input File

    基于PLUMED元动力学模拟后输出的沉积高斯函数文件,默认为HILLS.dat文件。

    Histogram

    对沉积高斯函数文件进行自由能计算时是否考虑直方图分布方法,yes或者no,默认no。

    Sigma

    当考虑直方图分布方法时高斯函数的宽度值,有多个集合变量(即CV)时用"//"进行分割,比如0.35//0.35。只有当Histogram值为no时Sigma参数才会生效,当有多个CV而只设置了一个宽度值时,则表示该宽度值适用于所有CV。默认0.05。

    CV Name

    CV名称,对沉积高斯函数文件进行自由能计算时只考虑该指定的CV。当不指定CV时则考虑沉积高斯函数文件中包含的所有CV,当指定CV时则不能考虑直方图分布方法。

    CV Min

    集合变量的边界最小值,有多个集合变量时用"//"进行分割,比如0.1//0.3,强烈建议设置边界。当有多个CV而只设置了一个边界最小值时,则表示该最小值适用于所有CV。

    CV Max

    集合变量的边界最大值,有多个集合变量时用"//"进行分割,比如0.1//0.3,强烈建议设置边界。当有多个CV而只设置了一个边界最大值时,则表示该最大值适用于所有CV。

    Grid Size

    集合变量的窗口大小,有多个集合变量时用"//"进行分割,比如0.1//0.3。仅当设置了CV Min和CV Max值时,Grid Size才会生效。当有多个CV而只设置了一个窗口大小值时,则表示该窗口大小值适用于所有CV。

    Bin

    集合变量的窗口数量,有多个集合变量时用"//"进行分割,比如150//300。仅当设置了CV Min和CV Max值时,Bin才会生效。当有多个CV而只设置了一个窗口数量值时,则表示该窗口数量值适用于所有CV。Grid Size和Bin相乘等于CV Max和CV Min的差值,因此当Grid Size和Bin同时设置时以对应窗口数最多的为准。

    Temperature

    温度,对沉积高斯函数文件进行自由能计算时使用的温度值,默认300K

    Min to Zero

    是否对输出的自由能数据进行归零处理,即将自由能数据进行相对移动以保证最小值移动到0的位置,yes或者no,默认no。

    Stride

    沉积高斯函数的数量,在对沉积高斯函数文件进行自由能计算时,每隔该指定的沉积高斯函数的数量进行一次自由能计算。当不设置该数量值时表示对所有的沉积高斯函数在整体上只进行一次自由能计算。

    Output File

    输出结果文件,文件中包含随CV变化的自由能数据,默认为FES.csv文件。当指定了Stride值时,默认文件为FES.dat.tar.gz。

    结果说明

    输出结果包括:

    输出文件名称 说明
    FES.csv 随CV变化的自由能数据文件
    FES.dat.tar.gz 随CV变化的自由能数据压缩文件

    Free Energy Surface Analysis

    Introduction

    The Free Energy Surface Analysis module is used to to calculate the free energy based on the simulation results outputed from the metadynamics simulations.

    Parameter

    Input File

    The deposited Gaussian function file outputed from the metadymamics simulations. Default “HILLS.dat”.

    Histogram

    Whether considers the Historgram method when calculates the free energy based on the deposited Gaussian function file. “yes” or “no”, default “no”.

    Sigma

    Width of Gaussian Function used by the Historgram method, if there are multiple CVs, you can separated them by “//”, such as 0.35//0.35. Only effective when Historgram method is used. When there are multiple CVs and only one width value is set, it means that the width value will be applied to all CVs. Default 0.05.

    CV Name

    The specified CV considered in the free energy calculation based on the deposited Gaussian function file. When CV is not specified, all CVs contained in the deposited Gaussian function file will be considered, and when CV is specified, histogram distribution methods cannot be considered.

    CV Min

    The minimum boundary value of the collective variable. Use “//” to separate multiple collective variable minimums, such as 0.1//0.3. It is strongly recommended to set boundaries. When there are multiple CVs and only one minimum value is set, it means that the minimum value will be applied to all CVs.

    CV Max

    The maximum boundary value of the collective variable. Use “//” to separate multiple collective variable maximums, such as 0.1//0.3. It is strongly recommended to set boundaries. When there are multiple CVs and only one maximum value is set, it means that the maximum value will be applied to all CVs.

    Grid Size

    The window size of the collective variable. Use “//” to separate multiple collective variable window sizes, such as 0.1//0.3. Only effective when CV Min and CV Max values are set. When there are multiple CVs and only one window size value is set, it means that the window size value will be applied to all CVs.

    Bin

    The window number of the collective variable. Use “//” to separate multiple collective variable bin counts, such as 150//300. Only effective when CV Min and CV Max values are set. When there are multiple CVs and only one window number value is set, it means that the window number value will be applied to all CVs.The product of Grid Size and Bin equals the difference between CV Max and CV Min. Therefore, when both CV Space and CV Bin are set, the one with the highest corresponding window count will prevail.

    Temperature

    The temperature value used in the free energy calculation based on the deposited Gaussian function file. Default 300K.

    Min to Zero

    Whether mintozeros the obatined free energy data calculated based on the deposited Gaussian function file. “yes” or “no”, default “no”.

    Stride

    Specified number of the deposition Gauss function. When calculates the free energy based on the deposition Gauss function file, the free energy will be calculated every specified number of the deposition Gauss function. When this stride value is not set, it means that only one free energy calculation is performed for all deposition Gaussian functions as a whole.

    Output File

    The specified output file. The output file contains free energy data that varies with CV. Default FES.csv file. When the Stride value is specified, default FES.dat.tar.gz file.

    Result

    The output results include:

    Output File Name Description
    FES.csv output file that contains free energy data that varies with CV
    FES.dat.tar.gz output tar.gz file that contains free energy data that varies with CV
  • Name: MD Clustering (v2)
    Description: MD Clustering是对动力学轨迹进行归簇分析。 MD Clustering is a clustering analysis of dynamic trajectories.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-04 11:40:38
    Reference:

    MD Clustering

    简介

    MD Clustering是对动力学轨迹进行归簇分析。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    Cutoff

    聚类时结构的RMSD截断值(nm)

    Cluster Method

    聚类算法:linkage, jarvis-patrick, monte-carlo, diagonalization, gromos, 默认使用gromos算法。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    输出结果包括:

    输出文件名称 说明
    clusters.pdb 差异较大的每个簇的代表性结构
    clust-size.xvg 各个簇的帧数
    clust-size.xvg 各个簇和轨迹帧号的对应关系

    MD Clustering

    Introduction

    MD Clustering is a clustering analysis of molecular dynamics trajectories.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    Cutoff

    RMSD cutoff value for clustering (in nm).

    Cluster Method

    Clustering algorithm: linkage, jarvis-patrick, monte-carlo, diagonalization, gromos. The default method is gromos.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Result Description

    The output results include:

    Output File Name Description
    clusters.pdb Representative structures of each cluster with significant differences
    clust-size.xvg Number of frames in each cluster
    clust-size.xvg Correspondence between clusters and trajectory frame numbers
  • Name: MD Hbond (v2)
    Description: MD Hbond对于指定组别之间的氢键分析。 MD Hbond for hydrogen bond analysis between specified groups.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-05 17:34:57
    Reference:

    MD Hbond

    简介

    MD Hbond模板对于指定组别之间的氢键分析。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group1

    选择需要计算的氢键组别1:Protein,DNA,RNA。如果两个组相同则分析的是组内氢键。
    可以根据PDB中小分子的名称填写组别名称。

    System Group2

    选择需要计算的氢键组别2:Protein,DNA,RNA。如果两个组相同则分析的是组内氢键。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid1

    自定义需要计算的组1残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom1

    自定义需要计算的组1原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Resid2

    自定义需要计算的组2残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom2

    自定义需要计算的组2原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    输出结果包括:

    输出文件名称 说明
    hbnum.csv 氢键分析CSV文件
    hbnum.xvg 氢键分析XVG文件
    hbnum.png 氢键分析PNG文件

    其中hbnum.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Hydrogen bonds 氢键数目
    Pairs within 0.35 nm 两个组相距0.35nm内的接触的原子数目

    MD Hbond

    Introduction

    MD Hbond template is used for analyzing hydrogen bonds between specified groups.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group1

    Select the hydrogen bond group 1 for calculation: Protein, DNA, RNA. If both groups are the same, it analyzes intra-group hydrogen bonds. You can specify group names based on the names of small molecules in the PDB file.

    System Group2

    Select the hydrogen bond group 2 for calculation: Protein, DNA, RNA. If both groups are the same, it analyzes intra-group hydrogen bonds. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid1

    Custom residue numbers for group 1 calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom1

    Custom atom numbers for group 1 calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Custom Resid2

    Custom residue numbers for group 2 calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom2

    Custom atom numbers for group 2 calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Result Description

    The output results include:

    Output File Name Description
    hbnum.csv Hydrogen bond analysis CSV file
    hbnum.xvg Hydrogen bond analysis XVG file
    hbnum.png Hydrogen bond analysis PNG file

    The hbnum.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Hydrogen bonds Number of hydrogen bonds
    Pairs within 0.35 nm Number of atoms in contact within 0.35 nm between the two groups
  • Name: MD Trajectory (v2)
    Description: 可根据起始帧数、结束帧数以及间隔帧数对平衡模拟进行轨迹提取,从而将其转换为GRO或者PDB轨迹文件。 MD Trajectory converts Gromacs trajectory file (xtc) into GRO or PDB file for visualization.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD Trajectory

    简介

    可根据起始帧数、结束帧数以及间隔帧数对平衡模拟进行轨迹提取,从而将其转换为GRO或者PDB轨迹文件。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run模块或者AlphaAutoMD模块中获取。

    Type

    文件输出类型:GRO或者PDB。

    Water

    输出文件是否保留水盒子。

    Start Time (ps)

    起始位置(单位ps)。

    End Time (ps)

    结束位置(单位ps)。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。对于膜体系的轨迹提取是必填项。

    Keep Heterogens

    是否保留体系中的溶剂(Water以及Ion):不保留(none),都保留(all),指定保留溶剂范围(specify)。

    Specify Heterogens

    指定需要保留的特殊组别如:水(Water),离子(Ion);或者指定保留组别的范围,规定格式为:需要保留的溶剂组别(Water或者Ion):限定距离(单位Å):目标组别,中间使用冒号(:)进行分隔,例如Water:3:ligand。
    注:组别名称可以通过MD Solvation模块的index文件查询;若目标组别是小分子,可以根据PDB中小分子的名称填写组别名称,多个小分子可填写ligand表示。

    结果说明

    输出结果包括:

    输出文件名称 说明
    md_finally.pdb 最后一帧结构文件
    md_center.pdb/.gro PDB/GRO格式轨迹文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    MD Trajectory

    Introduction

    The MD Trajectory module allows for the extraction of trajectories from equilibrium simulations based on the starting frame number, ending frame number, and frame interval, converting them into GRO or PDB trajectory files.

    Parameter

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run module or the AlphaAutoMD module.

    Type

    File output type: GRO or PDB.

    Water

    Whether to retain the water box in the output files.

    Start Time (ps)

    Starting time (in ps).

    End Time (ps)

    Ending time (in ps).

    Skip Time (ps)

    Time interval, in ps.

    Index File

    Index file in ndx format. This is a required parameter for extracting trajectories in membrane systems.

    Keep Heterogens

    Whether to retain the solvents in the system (Water and Ion) : none (none), all (all), specify the solvent range (specify).

    Specify Heterogens

    Specify special groups to be retained: Water, Ion; Or specify the range of reserved groups in the format: solvent group to be retained (Water or Ion) : limit distance (unit Å) : target group, separated by a colon (:), e.g., Water:3:ligand.
    Note: The group name can be queried through the index file of the MD Solvation module. If the target group is a small molecule, the group name can be filled in according to the name of small molecule in PDB, and the ligand representation can be filled in for multiple small molecules.

    Result

    The output results include:

    Output File Name Description
    md_finally.pdb Structure file of the final frame
    md_center.pdb PDB format trajectory file
    md_center.gro GRO format trajectory file

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MD Gyration (v2)
    Description: MD Gyration回旋半径分析,可用来衡量体系模拟时的质权平均半径。 MD Gyration cycloidal radius analysis, which can be used to measure the average radius of pledge during system simulation.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-05 16:24:54
    Reference:

    MD Gyration

    简介

    MD Gyration回旋半径分析,可用来衡量体系模拟时的质权平均半径。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    输出结果包括:

    输出文件名称 说明
    gyrate.csv 回转半径CSV文件
    gyrate.xvg 回转半径XVG文件
    gyrate.png 回转半径PNG文件

    其中gyrate.csv包括信息如下:

    字段名称 说明
    Time (ps) 时间
    Rg 回旋半径
    Rg(X) 绕着x轴的回旋半径
    Rg(Y) 绕着y轴的回旋半径
    Rg(Z) 绕着z轴的回旋半径

    MD Gyration

    Introduction

    MD Gyration is a radius of gyration analysis used to measure the mass-weighted average radius of a system during simulation.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    The output results include:

    Output File Name Description
    gyrate.csv Gyration radius CSV file
    gyrate.xvg Gyration radius XVG file
    gyrate.png Gyration radius PNG file

    The gyrate.csv file includes the following information:

    Field Name Description
    Time (ps) Time
    Rg Radius of gyration
    Rg(X) Radius of gyration around the x-axis
    Rg(Y) Radius of gyration around the y-axis
    Rg(Z) Radius of gyration around the z-axis
  • Name: MD SASA (v2)
    Description: MD SASA模块是计算指定组别的溶剂可及表面积(solvent accessible surface area,SASA)。 MD SASA module calculates the solvent accessible surface area (SASA) for a specified group.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-06 00:29:36
    Reference:

    MD SASA

    简介

    MD SASA模块是计算指定组别的溶剂可及表面积(solvent accessible surface area,SASA)。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    输出结果包括:

    输出文件名称 说明
    area.csv 溶剂可及表面积CSV文件
    area.xvg 溶剂可及表面积XVG文件
    area.png 溶剂可及表面积PNG文件

    其中area.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Total Area (nm^2) 溶剂可及表面积
    Hydrophobic (nm^2) 疏水表面积
    Hydrophilic (nm^2) 亲水表面积

    MD SASA

    Introduction

    The MD SASA module calculates the solvent accessible surface area (SASA) of specified groups.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    The output results include:

    Output File Name Description
    area.csv Solvent accessible surface area CSV file
    area.xvg Solvent accessible surface area XVG file
    area.png Solvent accessible surface area PNG file

    The area.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Total Area (nm^2) Total solvent accessible surface area
    Hydrophobic (nm^2) Hydrophobic surface area
    Hydrophilic (nm^2) Hydrophilic surface area
  • Name: MD Distance (v2)
    Description: MD Distance是分子动力学轨迹的距离分析模块,输出分子动力学过程中两个组之间距离 (质心距离或几何中心距离) 随时间的变化。 MD Distance is a distance analysis module that outputs the distance changes between two groups (center of mass distance or geometric center distance) over time.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-22 09:35:48
    Reference:

    MD Distance

    简介

    MD Distance是针对分子动力学轨迹的距离分析模块,输出两个组之间距离 (质心距离或几何中心距离) 随时间的变化情况。自定义组别时需要注意,如果只需测量两个原子之间的距离则填写Custom Atom1和Custom Atom2即可;当同时填写Custom Resid1和Custom Atom1时,组别1的原子数是Custom Atom1与Custom Resid1交集的原子。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group1

    选择需要计算的组别1:Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    System Group2

    选择需要计算的组别1:Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid1

    自定义需要计算的组1残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Atom1

    自定义需要计算的组1原子编号,连续参数可用“-”表示,不连续原子用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Resid2

    自定义需要计算的组2残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Atom2

    自定义需要计算的组2原子编号,连续参数可用“-”表示,不连续原子用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    dist.csv 距离分析CSV文件
    dist.xvg 距离分析XVG文件
    dist.png 距离分析PNG文件

    其中dist.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Distance (nm) 组别之间的距离

    MD Distance

    Introduction

    MD Distance is a distance analysis module for molecular dynamics trajectories, providing the variation of distance (center-of-mass distance or geometric center distance) between two groups over time. When defining custom groups, it is important to note that if you only need to measure the distance between two atoms, you can fill in Custom Atom1 and Custom Atom2. When both Custom Resid1 and Custom Atom1 are filled in, the number of atoms in group 1 is the intersection of Custom Atom1 and Custom Resid1.

    Parameter Description

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD (GMX2023) module.

    System Group1

    Select the group 1 for calculation: Protein, DNA, RNA.
    You can enter the group name based on the name of the small molecule in the PDB.

    System Group2

    Select the group 2 for calculation: Protein, DNA, RNA.
    You can enter the group name based on the name of the small molecule in the PDB.

    Custom Resid1

    Custom residue numbers for group 1 that need to be calculated. Use “-” for continuous parameters and separate non-continuous residues with commas, for example: 1-10,15. Custom groups are unions of specified residues.

    Custom Atom1

    Custom atom numbers for group 1 that need to be calculated. Use “-” for continuous parameters and separate non-continuous atoms with commas, for example: 1-10,15. Custom groups are unions of specified atoms.

    Custom Resid2

    Custom residue numbers for group 2 that need to be calculated. Use “-” for continuous parameters and separate non-continuous residues with commas, for example: 1-10,15. Custom groups are unions of specified residues.

    Custom Atom2

    Custom atom numbers for group 2 that need to be calculated. Use “-” for continuous parameters and separate non-continuous atoms with commas, for example: 1-10,15. Custom groups are unions of specified atoms.

    Skip Time (ns)

    Time interval for each frame (in ns).

    Result Description

    The output includes:

    Output File Name Description
    dist.csv Distance analysis CSV file
    dist.xvg Distance analysis XVG file
    dist.png Distance analysis PNG file

    The dist.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Distance (nm) Distance between the groups
  • Name: MMPBSA (v2)
    Description: MMPBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。 MMPBSA calculates the binding free energy between the receptor and ligand and provides energy decomposition data, binding constant (Ka), and inhibitor constant (Ki).
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:29
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMPBSA

    简介

    MMPBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMPBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMPBSA_result.txt MMPBSA结果汇总文件。
    MMPBSA_Residue.csv 能量分解数据CSV文件。
    MMPBSA.pdb 原子对应的MMPBSA能量放到PDB文件。可以做对应能量类别的表面图,从图的颜色深浅可以看出哪些区域对结合能贡献值较大。
    MMPBSA.tar.gz MMPBSA所有原始文件。包括_mmpbsa_residue_#.txt是每个残基对应的每个能量类别的数值,共包含7个能量类别:范德华能(VDW)、静电能(ELE)、溶剂化能极性部分(PB)、溶剂化能非极性部分(SA)、VDW+ELE=MM、PB+SA=PBSA、MM+PBSA=Binding/MMPBSA。_mmpbsa_residue.txt是对上述7个文件的总结,即为MMPBSA_Residue.csv对应的原始文件。_mmpbsa_atom#.pdb是将每个原子对应的能量类别放到pdb文件,与MMPBSA.pdb相似。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMPBSA

    Introduction

    MMPBSA calculates the binding free energy between a receptor and a ligand, providing energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMPBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameter Description

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Result Description

    The output includes:

    Output File Name Description
    MMPBSA_result.txt Summary file of MMPBSA results.
    MMPBSA_Residue.csv Energy decomposition data in CSV format.
    MMPBSA.pdb MMPBSA energy corresponding to atoms in a PDB file. This can be used to create a surface map of energy categories, where the color depth indicates the regions contributing significantly to the binding energy.
    MMPBSA.tar.gz All original MMPBSA files. Includes mmpbsa_residue#.txt which contains the values of each energy category for each residue, totaling 7 energy categories: van der Waals energy (VDW), electrostatic energy (ELE), polar part of solvation energy (PB), nonpolar part of solvation energy (SA), VDW+ELE=MM, PB+SA=PBSA, MM+PBSA=Binding/MMPBSA. _mmpbsa_residue.txt is a summary of the above 7 files, corresponding to the original file MMPBSA_Residue.csv. _mmpbsa_atom#.pdb assigns the energy categories to each atom in a pdb file, similar to MMPBSA.pdb.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: MD RMS (v2)
    Description: MD RMS模块是通过计算平衡模拟轨迹的均方根偏差(RMSD,Root Mean Square Deviation)和均方根波动(RMSF,Root Mean Square Fluctuation),从而分析结构的稳定性和结构变化情况。 The RMS module calculates the RMSD or RMSF to analyze the structural stability of the system.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    RMS

    简介

    通过计算平衡模拟轨迹的均方根偏差(RMSD,Root Mean Square Deviation)和均方根波动(RMSF,Root Mean Square Fluctuation),从而分析结构的稳定性和结构变化情况。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在**GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)**模块中获取。

    Analysis Type

    选择分析类型:RMSD或者RMSF(可多选)。

    System Group

    选择需要计算的组别。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Custom Atom

    自定义需要计算的原子编号,用逗号隔开,例如:CA,O,H。与Custom Resid是交集关系。

    Skip Time (ps)

    Index File

    索引文件,可由Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    rmsd_result.csv 所选组别的RMSD的CSV文件
    rmsd_result.png 所选组别的RMSD的PNG文件
    rmsd_result.xvg 所选组别的RMSD的XVG文件
    rmsf_*.csv 所选组别的RMSF的CSV文件
    rmsf_*.png 所选组别的RMSF的PNG文件
    rmsf_*xvg. 所选组别的RMSF的XVG文件
    bfac.pdb PDB中的B-Factor一列为原子RMSF值。RMSF值通过公式<Δr^2>=3B/(8π^2)转换为b-factor值。

    RMS

    Introduction

    By calculating the Root Mean Square Deviation (RMSD) and Root Mean Square Fluctuation (RMSF) of equilibrium simulation trajectories, the stability and structural changes of the system can be analyzed.

    Parameter Description

    Path File

    The path file obtained after MD simulation, which can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    Analysis Type

    Select the type of analysis: RMSD or RMSF (multiple selections possible).

    System Group

    Select the group to be calculated.

    Custom Resid

    Custom residue numbers to be calculated, continuous parameters can be represented by “-”, non-continuous residues are separated by commas, for example: 1-10,15.

    Custom Atom

    Custom atom numbers to be calculated, separated by commas, for example: CA, O, H. This intersects with Custom Resid.

    Skip Time (ps)

    Index File

    Index file obtained from the Membrane Solvation module.

    Result Description

    The output results include:

    Output File Name Description
    rmsd_result.csv CSV file of RMSD for the selected group
    rmsd_result.png PNG file of RMSD for the selected group
    rmsd_result.xvg XVG file of RMSD for the selected group
    rmsf_*.csv CSV file of RMSF for the selected group
    rmsf_*.png PNG file of RMSF for the selected group
    rmsf_*xvg. XVG file of RMSF for the selected group
    bfac.pdb The B-Factor column in the PDB file represents the atomic RMSF value. The RMSF values are converted to B-factor values by the formula <Δr^2>=3B/(8π^2).
  • Name: MD PCA (v2)
    Description: MD PCA(Principal component analysis,PCA)模块可以从高维数据中分析出主要的影响因素 (本征向量) 。前几个本征向量(主成分,如前两个主成份则为 PC1,PC2) 一般可以描述分子运动的大部分信息。N个原子的柔性大体系如蛋白,其运动轨迹需要3N维笛卡尔坐标来描述,这样高维的数据很难理解和直观分析。 MD PCA (Principal component analysis) module can analyze the main influencing factors (eigenvectors) from the high-dimensional data. The first few eigenvectors (principal components, such as PC1 and PC2 for the first two principal components) can generally describe most of the information about molecular motion. The motion path of a flexible large system with N atoms, such as protein, needs 3N Cartesian coordinates to describe, so it is difficult to understand and intuitively analyze the high-dimensional data.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-06 00:51:22
    Reference:

    MD PCA

    简介

    N个原子的柔性大体系如蛋白,其运动轨迹需要3N维笛卡尔坐标来描述,这样高维的数据很难理解和直观分析。MD PCA(Principal component analysis,PCA)模块可以从高维数据中分析出主要的影响因素 (本征向量) 。前几个本征向量(主成分,如前两个主成份则为 PC1,PC2) 一般可以描述分子运动的大部分信息。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    得到结果文件,每种类型的文件如果包含PNG、CSV以及XVG后缀,相同名称只是表现形式不同,数据一样

    输出文件名称 说明
    Gibbs_2d.png/Gibbs_3d.png 只计算两个主成分时的二维和三维自由能景观图
    average.pdb 计算后的平均结构文件
    eigenvalues.xvg/.png/.csv 本征值文件
    filtered.pdb 计算的降维过滤后的轨迹文件
    proj1.xvg/.png/.csv 对应的主成分PC1文件
    proj2.xvg/.png/.csv 对应的主成分PC2文件
    proj_all.xvg 计算的PC1到PC2的主成份合并文件

    MD PCA

    Introduction

    For a large flexible system with N atoms such as a protein, its motion trajectory requires 3N-dimensional Cartesian coordinates to describe, making it difficult to understand and analyze high-dimensional data. The MD PCA (Principal Component Analysis, PCA) module can analyze the main influencing factors (eigenvectors) from high-dimensional data. The first few eigenvectors (principal components, such as the first two principal components PC1, PC2) can generally describe most of the information about molecular motion.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    Obtain the following result files. If the files have PNG, CSV, and XVG suffixes, they contain the same data but in different formats.

    Output File Name Description
    Gibbs_2d.png/Gibbs_3d.png 2D and 3D free energy landscape plots when only two principal components are considered
    average.pdb Computed average structure file
    eigenvalues.xvg/.png/.csv Eigenvalues file
    filtered.pdb Filtered trajectory file after dimensionality reduction
    proj1.xvg/.png/.csv Corresponding principal component PC1 file
    proj2.xvg/.png/.csv Corresponding principal component PC2 file
    proj_all.xvg Combined file of principal components PC1 to PC2
  • Name: MD (GMX2024)
    Description: GMX MD Run (GMX2024)模块是利用已经准备好的体系拓扑文件以及参数文件进行基于GROMACS的分子动力学模拟。 GMX MD Run (GMX2024) runs a Gromacs MD task using the prepared system topology and parameter files.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 11:21:21
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    GMX MD Run (GMX2024)

    简介

    提交GROMACS对应文件,从而进行分子动力学模拟,得到平衡模拟后得到的轨迹文件。

    参数说明

    GRO File

    提交模拟体系的gro文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    Topology File

    提交模拟体系的top文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    ITP File

    提交模拟体系的itp文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    Minimize MDP File

    提交进行最小化的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的Minimization方法或者**GMX MDP Generation (Auto)**生成。

    NPT MDP File

    提交进行等压等温的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的NPT方法或者**GMX MDP Generation (Auto)**生成。

    MD MDP File

    提交进行平衡模拟的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的MD方法或者**GMX MDP Generation (Auto)**生成。

    结果说明

    输出结果包括:

    输出文件名称 说明
    md.cpt md模拟断点文件
    md.gro md的分子坐标文件
    md.log md记录文件
    md.tpr md模拟所需的所有初始化数据(分子拓扑、初始结构等)
    mini.gro mini运行的分子坐标文件
    mini.log mini运行记录文件
    mini.tpr mini模拟运行所需的所有初始化数据(分子拓扑、初始结构等)
    npt.gro npt的分子坐标文件
    npt.log npt记录文件
    npt.tpr npt模拟所需的所有初始化数据(分子拓扑、初始结构等)
    path.txt 模拟轨迹文件存储路径,可用于后续分析模块的Path File输入。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    GMX MD Run (GMX2024)

    Introduction

    Submit corresponding files to GROMACS to perform molecular dynamics simulations and obtain trajectory files after equilibrium simulations.

    Parameter Description

    GRO File

    Submit the gro file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    Topology File

    Submit the top file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    ITP File

    Submit the itp file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    Minimize MDP File

    Submit the script file for minimization, in mdp format. This file can be generated using the GMX MDP Generation module with the Minimization method or GMX MDP Generation (Auto).

    NPT MDP File

    Submit the script file for NPT (isothermal-isobaric) simulation, in mdp format. This file can be generated using the GMX MDP Generation module with the NPT method or GMX MDP Generation (Auto).

    MD MDP File

    Submit the script file for equilibrium simulation, in mdp format. This file can be generated using the GMX MDP Generation module with the MD method or GMX MDP Generation (Auto).

    Result Description

    The output results include:

    Output File Name Description
    md.cpt Checkpoint file for the MD simulation
    md.gro Molecular coordinate file for the MD simulation
    md.log Log file for the MD simulation
    md.tpr All initial data required for the MD simulation (molecular topology, initial structure, etc.)
    mini.gro Molecular coordinate file for the minimization run
    mini.log Log file for the minimization run
    mini.tpr All initial data required for the minimization run (molecular topology, initial structure, etc.)
    npt.gro Molecular coordinate file for the NPT simulation
    npt.log Log file for the NPT simulation
    npt.tpr All initial data required for the NPT simulation (molecular topology, initial structure, etc.)
    path.txt Path to store the simulation trajectory files, which can be used as input for the Path File in subsequent analysis modules.

    Reference Literature

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: Structure Prediction (Protenix)
    Description: Protenix是字节跳动公司AML AI4Science团队复现的pytorch版本的AlphaFold3模型。 - 模型性能:将Protenix与现有的模型进行了基准测试。Protenix在不同分子类型的结构预测中表现出强大的性能。作为一个完全开源的模型,它使研究人员能够生成新的预测并对模型进行微调,以满足特定的应用需求。 - 方法:在复现过程中,依据AF3的描述实现了Protenix,并优化了一些模糊步骤,纠正了排版错误,并根据模型行为进行了有针对性的调整。通过分享复现经验,希望支持社区在这些改进的基础上进一步推动该领域的发展。 Protenix is the PyTorch version of the AlphaFold3 model reproduced by the AML AI4Science team at ByteDance. - Model Performance: Protenix has been benchmarked against existing models, demonstrating strong performance in structure prediction across different types of molecules. As a fully open-source model, it enables researchers to generate new predictions and fine-tune the model to meet specific application needs. - Methodology: During the reproduction process, Protenix was implemented based on the description of AF3, optimizing some ambiguous steps, correcting typographical errors, and making targeted adjustments based on model behavior. By sharing our reproduction experience, we hope to support the community in further advancing the field based on these improvements.
    Tags: undefined
    Author: ByteDance
    Release: 2024-12-30 09:46:26
    Reference: https://github.com/bytedance/Protenix

    Structure Prediction (Protenix)

    简介

    Protenix是字节跳动公司AML AI4Science团队复现的pytorch版本的AlphaFold3模型。以下是ByteDance AML AI4Science团队的主要贡献概要:
    - 模型性能:将Protenix与现有的模型进行了基准测试。Protenix在不同分子类型的结构预测中表现出强大的性能。作为一个完全开源的模型,它使研究人员能够生成新的预测并对模型进行微调,以满足特定的应用需求。
    - 方法:在复现过程中,依据AF3的描述实现了Protenix,并优化了一些模糊步骤,纠正了排版错误,并根据模型行为进行了有针对性的调整。通过分享复现经验,希望支持社区在这些改进的基础上进一步推动该领域的发展。
    - 可访问性:已将Protenix开源,提供了模型权重、推理代码和可训练代码供研究用途。
    fc936bcc6efe6df85dc7359d52767659_protenix_predictions.gif
    image.png
    image.png

    参数说明

    Protein Sequence

    蛋白序列文件,FASTA格式,支持多条序列。
    注意:多蛋白复合物结构预测,其氨基酸序列输入格式如下:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    DNA序列文件,FASTA格式,支持多条序列。

    RNA Sequence

    RNA序列文件,FASTA格式,支持多条序列。

    备注:当前支持计算的残基/碱基数量在1400个左右。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔,并加上CCD前缀。示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    注意:不适用于配体蛋白或多肽的氨基酸序列格式输入。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每行放置一个PTM信息,每个PTM信息由三部分组成:

    • 发生PTM序列的顺序编号
    • PTM类型的CCD编号
    • 发生PTM的残基位置编号
      三部分由逗号分隔,例如:1,HY3,1 表示第一条序列的第一个残基,发生了类型为HY3(CCD编号,为3-羟基脯氨酸,为脯氨酸的羟基化)的PTM
      备注:
    • 序列的顺序编号,是依次按上述参数Protein、DNA、RNA中的序列顺序与数量,从1开始进行编号,例如:当有2条蛋白序列,1条DNA序列,1条RNA序列时,各序列对应的编号为:第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4
    • CCD的介绍参考https://www.wwpdb.org/data/ccd 编号查询网址为https://www.ebi.ac.uk/pdbe-srv/pdbechem/
      包含多个PTM信息的文件内容示例如下:
    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Covalent Bond

    共价键信息的文本文件,TXT格式。每行放置一个共价键信息,每个共价键信息包含两个原子信息,每个原子信息由三部分组成:

    • 原子所在序列或小分子的顺序编号(编号规则在Modification中定义的序列编号规则基础上,在最后加入小分子的顺序即可)
    • 原子所在残基的位置编号(如残基为小分子时,编号为1)
    • 原子的标准名称:
      • 默认是CCD中定义的原子标准名称
      • 如果配体是SMILES,则是SMILES字符串中原子对应的从0开始位置序号。

    三部分由逗号分隔,例如:3,1,CA表示第三个实体(序列或小分子)中的第一个残基(或小分子)的CA原子
    一个共价键是由两个原子信息组成,原子间用分号分隔,如:1,1,CA;2,1,CA
    表示一个共价键,该共价键由两个原子组成,第一个原子为1,1,CA,第二个原子为2,1,CA
    包含多个共价键信息的文件内容示例如下:

    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Ion

    离子名称,可以包含一个或多个离子,需写在一行文本中,不同的离子使用英文逗号分隔,支持输入离子数量,使用英文冒号分隔。示例如下:

    MG:2,ZN,CU:3
    

    表示2个MG离子,1个ZN离子,3个CU离子

    Format

    输出结构的格式,支持PDB或CIF格式,默认为PDB格式。

    Enhanced Mode

    该模式下,会默认使用1000个随机种子,每个随机种子进行5个结构采样,共进行5000个结构的大批量采样,并从中选择评分靠前的多个预测结构,最终获得更高精度的预测结构。该模式特别适用于抗原-抗体复合物结构的高精度预测,有研究表明该模式下抗体-抗原复合物结构预测准确性提升60%。该模式的输入参数与Single Mode一致,一次运行时间约10~20小时。

    备注:

    序列总长度不可超过1300。

    结果说明

    输出结果文件为排名前5的复合物结构rank_1-5.cif和pred_scores_protenix.csv,csv中包含信息如下:

    列名 说明
    Name 复合物结构名称
    Ranking_Score 对预测结构的质量排序的指标分数,值范围在-100至1.5之间,越大表示预测结构的质量越高。该分数综合考虑了四个指标:ptm, iptm, fraction_disordered,has_clash, 计算公式为: Ranking_Score = 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash 注意:结构为单体时,因为ipTM为0,整体的综合得分偏低,可参考pTM即可。
    pLDDT 局部结构的可信度指标,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测
    pTM 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定

    参考文献

    https://github.com/bytedance/Protenix

    Structure Prediction (Protenix)

    Introduction

    Protenix is the PyTorch version of the AlphaFold3 model reproduced by the AML AI4Science team at ByteDance. Here is a summary of the main contributions from the ByteDance AML AI4Science team:
    - Model Performance: Protenix has been benchmarked against existing models, demonstrating strong performance in structure prediction across different types of molecules. As a fully open-source model, it enables researchers to generate new predictions and fine-tune the model to meet specific application needs.
    - Methodology: During the reproduction process, Protenix was implemented based on the description of AF3, optimizing some ambiguous steps, correcting typographical errors, and making targeted adjustments based on model behavior. By sharing our reproduction experience, we hope to support the community in further advancing the field based on these improvements.
    - Accessibility: Protenix has been open-sourced, providing model weights, inference code, and training code for research purposes.
    fc936bcc6efe6df85dc7359d52767659_protenix_predictions.gif
    image.png
    image.png

    Parameter

    Protein Sequence

    A sequence file for proteins in FASTA format, supporting multiple sequences.
    Note: For multi-protein complex structure prediction, the amino acid sequence input format is as follows:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    A sequence file for DNA nucleic acids in FASTA format, supporting multiple sequences.

    RNA Sequence

    A sequence file for RNA nucleic acids in FASTA format, supporting multiple sequences.

    Note:The currently supported number of residues/bases for calculation is around 1,400.

    Ligand

    A text file containing information about small molecules in TXT format. It supports SMILES or CCD Code (Chemical Component Dictionary number). If using the SMILES format, each line should contain one small molecule; if using CCD Code, each line can contain one or more small molecules, separated by commas, and prefixed with CCD. Examples are as follows:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Note: Not applicable for amino acid sequence format input of ligand proteins or polypeptides.

    Modification

    An optional parameter that includes a text file with post-translational modification (PTM) information in TXT format. Each line contains one PTM entry, which consists of three parts:

    • The sequential number of the sequence where the PTM occurs
    • The CCD number for the PTM type
    • The position number of the residue where the PTM occurs
      These three parts are separated by commas. For example, 1,HY3,1 indicates that a PTM of type HY3 (CCD number, which is 3-hydroxyproline, a hydroxylation of proline) occurs at the first residue of the first sequence.
      Notes:
    • The sequential number of the sequence is assigned based on the order and quantity of sequences in the parameters Protein, DNA, and RNA, starting from 1. For example, if there are 2 protein sequences, 1 DNA sequence, and 1 RNA sequence, the corresponding numbers are: the first protein sequence is numbered 1, the second protein sequence is numbered 2, the DNA sequence is numbered 3, and the RNA sequence is numbered 4.
    • For an introduction to CCD, refer to https://www.wwpdb.org/data/ccd , and for the number query website, visit https://www.ebi.ac.uk/pdbe-srv/pdbechem/ .
      An example of a file containing multiple PTM entries is as follows:
    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Covalent Bond

    A text file containing covalent bond information in TXT format. Each line contains one covalent bond entry, which consists of two atom entries. Each atom entry consists of three parts:

    • The sequential number of the sequence or small molecule containing the atom (the numbering rule is based on the sequence numbering defined in Modification, with the order of small molecules added at the end)
    • The position number of the residue containing the atom (if the residue is a small molecule, the number is 1)
    • Standard Names of Atoms:
      • By default, the standard names of atoms are defined in CCD (Chemical Component Dictionary).
      • If the ligand is represented as a SMILES string, the standard names correspond to the position indices of atoms in the SMILES string, starting from 0.

    These three parts are separated by commas. For example, 3,1,CA indicates the CA atom of the first residue (or small molecule) in the third entity (sequence or small molecule).
    A covalent bond is represented by two atom entries separated by a semicolon, such as: 1,1,CA;2,1,CA, indicating a covalent bond composed of two atoms, with the first atom being 1,1,CA and the second atom being 2,1,CA.
    An example of a file containing multiple covalent bond entries is as follows:

    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Ion

    Ion names can include one or more ions, which should be written in a single line of text, with different ions separated by commas. It is also possible to specify the quantity of ions, using a colon to separate the ion name and its quantity. Examples are as follows:

    MG:2,ZN,CU:3
    

    Format

    The output structure format supports PDB or CIF, with PDB format as the default.

    Enhanced Mode

    In this mode, a default of 1000 random seeds will be used, with each seed conducting 5 structural samplings, totaling 5000 structures for large-scale sampling. From these, multiple predicted structures with high scores will be selected to ultimately obtain a more accurate predicted structure. This mode is particularly suitable for high-precision prediction of antigen-antibody complex structures, and studies have shown that the accuracy of antibody-antigen complex structure prediction can be increased by 60% in this mode. The input parameters for this mode are consistent with those in Single Mode, and the runtime for one session is approximately 10 to 20 hours.

    Note:

    The total length of the sequence cannot exceed 1300.

    Result

    The output result files are the structures of the top 5 complexes, rank_1-5.cif and pred_scores_protenix.csv. The CSV file contains the following information:

    Column Name Description
    Name The name of the complex structure.
    Ranking_Score A score that ranks the quality of the predicted structure, with values ranging from -100 to 1.5, where a higher value indicates a better quality of the predicted structure. This score takes into account four indicators: ptm, iptm, fraction_disordered, and has_clash. The calculation formula is: Ranking_Score = 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash. Note: When the structure is monomeric, the Ranking_Score is relatively low because ipTM is 0. In such cases, you can refer to pTM alone.
    pLDDT The confidence score for local structure, with a range of 0-100. A higher value indicates a more reliable prediction. Values below 70 are considered low reliability, below 50 are generally regarded as very low confidence, indicating disordered predictions.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.

    Reference

    https://github.com/bytedance/Protenix

  • Name: Generate Humanized Variants
    Description: 抗体人源化设计中基于Grafting以及Back Mutation Grouping的结果批量生成人源化后的序列。 Generate humanized variant sequences based on the Grafting and Back Mutation Grouping results.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-12-23 00:00:00
    Reference:

    Generate Humanized Variants

    简介

    抗体人源化设计中基于Grafting以及Back Mutation Grouping的结果批量生成人源化后的序列。

    参数说明

    Graft Policy

    Grafting模块生成的Graft Policy文件,JSON格式

    Mutate Policy

    Back Mutation Grouping模块生成的组合突变的Policy文件(combination_mutate_policy.json),JSON格式

    结果说明

    输出人源化后的序列文件humanized_variants_esmfold.fasta,将轻重链的序列通过冒号:拼接成一条链,便于直接用于ESMFold模块进行批量结构预测。示例:

    >L1H1
    EIVLTQSPATLSLSPGERATLSCRASQFVGSSIHWYQQKPGQAPRLLIYYASESMSGIPARFSGSGSGTDFTLTISSLEPEDFAVYYCQQSHSWPFTFGQGTKLEIK:EVQLVESGGGLVQPGGSLRLSCAASGFIFSNHYMSWVRQAPGKGLEWVSVIRSKSINSATYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARNYYGSTYDYWGQGTTVTVSS
    >L1H2
    EIVLTQSPATLSLSPGERATLSCRASQFVGSSIHWYQQKPGQAPRLLIYYASESMSGIPARFSGSGSGTDFTLTISSLEPEDFAVYYCQQSHSWPFTFGQGTKLEIK:EVQLVESGGGLVQPGGSLRLSCAASGFIFSNHYMSWVRQAPGKGLEWVSEIRSKSINSATYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARNYYGSTYDYWGQGTTVTVSS
    

    Generate Humanized Variants

    Introduction

    Generate humanized variant sequences based on the Grafting and Back Mutation Grouping results.

    Parameters

    Graft Policy

    Graft policy file in JSON format generated by the Grafting module.

    Mutate Policy

    Combination mutate policy file generated by Back Mutation Grouping module in JSON format.

    Results

    The output file humanized_variants_esmfold.fasta in which sequences of the light and heavy chains are concatenated into a single chain using a colon (:). This format facilitates direct use in the ESMFold module for batch structural prediction.

    >L1H1
    EIVLTQSPATLSLSPGERATLSCRASQFVGSSIHWYQQKPGQAPRLLIYYASESMSGIPARFSGSGSGTDFTLTISSLEPEDFAVYYCQQSHSWPFTFGQGTKLEIK:EVQLVESGGGLVQPGGSLRLSCAASGFIFSNHYMSWVRQAPGKGLEWVSVIRSKSINSATYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARNYYGSTYDYWGQGTTVTVSS
    >L1H2
    EIVLTQSPATLSLSPGERATLSCRASQFVGSSIHWYQQKPGQAPRLLIYYASESMSGIPARFSGSGSGTDFTLTISSLEPEDFAVYYCQQSHSWPFTFGQGTKLEIK:EVQLVESGGGLVQPGGSLRLSCAASGFIFSNHYMSWVRQAPGKGLEWVSEIRSKSINSATYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARNYYGSTYDYWGQGTTVTVSS
    
  • Name: Humanization Report (v2.4)
    Description: 抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。相比v2.3,新增RMSD和能量信息。 Humanization Report is an antibody humanization design reporting module for Generating the humanization design reports as well as patent example paragraphs. Compared with v2.3, RMSD and energy information are added.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-12-23 00:00:00
    Reference:

    Humanization Report v2.4

    简介

    Humanization Report v2.4是抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。相比v2.3,新增RMSD和能量信息。

    参数说明

    Graft Policy

    Grafting模块生成的Graft Policy文件。

    Mutate Policy

    Back Mutation Grouping模块生成的Policy文件。

    Antibody Type

    抗体类型,Antibody 标准双链抗体,Nanobody 纳米抗体。

    Germline Score File

    Grafting模块生成的score文件,JSON格式

    Mutation Score File

    Mutation模块生成的score文件,CSV格式

    Antibody RMSD File

    抗体结构RMSD文件,由Antibody RMSD模块生成,CSV格式

    Antibody RMSD Top

    从RMSD排序中取前N个RMSD值小的抗体

    Folding Stability File

    Absolute Folding Stability模块预测生成的蛋白稳定性文件,CSV格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    BM.pptx 回复突变位点汇总文件
    batch_registration_template.xlsx 批量注册模板文件
    hotspot_summary.xlsx 风险位点总结
    patent_example_template.docx 人源化设计序列在相应的专利实施例段落
    patent_example_en_template.docx 英文版人源化设计序列在相应的专利实施例段落
    back_mutation_grouping.md 回复突变分组信息
    candidate_score.xlsx 人源化抗体序列的结构和能量打分汇总
    humanized_variants.fasta 抗体人源化设计序列文件,FASTA格式
    Report.docx 抗体人源化设计报告,包括整个人源化设计过程涉及的序列、分组等信息

    其中batch_registration_template.xlsx包含如下信息:

    字段名称 说明
    Protein Sequence 蛋白序列
    Molecule Name 分子名称

    其中hotspot_summary.xlsx包含如下信息:

    字段名称 说明
    ID 抗体序列名称
    Sequence-CDR CDR序列区域
    Deamidation 脱酰胺位点
    Isomerization 异构化位点
    Cleavage 酶切位点
    Hydrolysis 水解位点
    Glycosylation 糖基化位点
    Cys 半胱氨酸数量
    Oxidation 氧化位点
    High risk 高风险率
    High risk sites 高风险位点

    Humanization Report v2.4

    Introduction

    The Humanization Report v2.4 is a module for generating reports on antibody humanization design, including the final antibody humanization design report and corresponding paragraphs for patent implementation examples. Compared with v2.3, RMSD and energy information are added.

    Parameter Description

    Graft Policy

    The Graft Policy file generated by the Grafting module.

    Mutate Policy

    The Policy file generated by the Back Mutation Grouping module.

    Antibody Type

    Antibody type, Antibody or Nanobody

    Germline Score File

    Graft germline score file in JSON format generated by the Grafting module

    Mutation Score File

    Mutation score file in csv format generated by the Mutation module

    Antibody RMSD File

    Antibody structure RMSD file generated by Antibody RMSD module

    Antibody RMSD Top

    Select the top N antibodies with the smallest RMSD values from the RMSD ranking

    Folding Stability File

    Protein folding stability file generated by Absolute Folding Stability module in CSV format

    Result Description

    The output results include:

    Output File Name Description
    BM.pptx Summary file of back mutation sites
    batch_registration_template.xlsx Batch registration template file
    hotspot_summary.xlsx Summary of hotspot sites
    patent_example_template.docx Humanization design sequences in corresponding patent implementation example paragraphs (Chinese version)
    patent_example_en_template.docx Humanization design sequences in corresponding patent implementation example paragraphs (English version)
    back_mutation_grouping.md Grouping for back mutations
    humanized_variants.fasta Antibody humanization design sequence file in FASTA format
    Report.docx Antibody humanization design report, including information on sequences, grouping, etc., involved in the entire humanization design process
    candidate_score.xlsx Candidate sequences energy and structure scores

    The batch_registration_template.xlsx file contains the following information:

    Field Name Description
    Protein Sequence Protein sequence
    Molecule Name Molecule name

    The hotspot_summary.xlsx file contains the following information:

    Field Name Description
    ID Antibody sequence name
    Sequence-CDR CDR sequence region
    Deamidation Deamidation site
    Isomerization Isomerization site
    Cleavage Cleavage site
    Hydrolysis Hydrolysis site
    Glycosylation Glycosylation site
    Cys Number of cysteines
    Oxidation Oxidation site
    High risk High-risk rate
    High risk sites High-risk sites
  • Name: Structure Prediction (HelixFold3)
    Description: 百度螺旋桨PaddleHelix团队研发的HelixFold3,在常规的小分子配体、核酸分子(包括DNA和RNA)以及蛋白质的结构预测精度上已与AlphaFold3相媲美。 HelixFold3, developed by the Baidu PaddleHelix team, is comparable to AlphaFold3 in the accuracy of structure prediction of conventional small molecule ligands, nucleic acid molecules (including DNA and RNA), and proteins.
    Tags: undefined
    Author: Baidu
    Release: 2024-12-13 15:51:05
    Reference: Liu, L., Zhang, S., Xue, Y., Ye, X., Ye, X., Zhu, K., Li, Y., Li, Y., Zhao, W., Yu, H.,Wu, Z., Zhang, X., & Fang, X. (2024). Technical Report of HelixFold3 for Biomolecular Structure Prediction. DOI: arxiv-2408.16975

    Structure Prediction (HelixFold3)

    简介

    百度螺旋桨PaddleHelix团队研发的HelixFold3,在常规的小分子配体、核酸分子(包括DNA和RNA)以及蛋白质的结构预测精度上已与AlphaFold3相媲美。为了评估其在蛋白质-配体结构预测中的效果,HelixFold3与其他主流方法在PoseBusters数据集上的表现进行了对比。HelixFold3即便在没有指定蛋白质结构的情况下,仍然展示出卓越的表现,成功率甚至超过了依赖已知蛋白质结构的方法,其预测精度与目前顶尖的AlphaFold3相当,这表明HelixFold3在蛋白质-配体相互作用预测领域的出色潜力。HelixFold3在蛋白质-蛋白质复合体结构预测方面已经略微超越了AlphaFold-Multimer的表现,展示出更强的预测能力。
    image.png
    image.png

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    注意:多蛋白复合物结构预测,其氨基酸序列输入格式如下:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。

    备注:

    序列总长度不可超过2000。

    Ligand

    文本文件包含小分子信息,TXT格式。HF3支持绝大多数重核数量不超过50的配体,支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔,并加上CCD前缀,示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    注意:不适用于配体蛋白或多肽的氨基酸序列格式输入。

    备注:

    水、助剂和少量特殊的配体目前是模型所不支持的。模型会将这些配体从CCD列表中除去,如果您通过输入SMILES的方式进行了这些输入,可能会造成结果的表现下降。具体不支持的配体的CCD列表参见HF3 FAQ https://paddlehelix.baidu.com/app/tut/guide/all/helixfold3faq 。每种配体数量不能超过50。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每行放置一个PTM信息,每个PTM信息由三部分组成:

    • 发生PTM序列的顺序编号
    • PTM类型的CCD编号
    • 发生PTM的残基位置编号
      三部分由逗号分隔,例如:1,HY3,1 表示第一条序列的第一个残基,发生了类型为HY3(CCD编号,为3-羟基脯氨酸,为脯氨酸的羟基化)的PTM。不同的氨基酸支持的CCD不同,具体参考https://paddlehelix.baidu.com/app/tut/guide/all/helixfold3json 里的说明。

    备注:

    • 序列的顺序编号,是依次按上述参数Protein、DNA、RNA中的序列顺序与数量,从1开始进行编号,例如:当有2条蛋白序列,1条DNA序列,1条RNA序列时,各序列对应的编号为:第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4
    • CCD的介绍参考https://www.wwpdb.org/data/ccd 编号查询网址为https://www.ebi.ac.uk/pdbe-srv/pdbechem/
      包含多个PTM信息的文件内容示例如下:
    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    ion

    目前支持的离子为:MG, ZN, CL, CA, NA, MN, MN3, K, FE, FE2, CU, CU1, CU3, CO
    不支持分行书写,需要写在一行,可以包含一个或多个离子,不同的离子使用逗号分隔,冒号后对应的是离子的数量,每种离子数量不能超过50。示例如下:

    MG:2,ZN,CU:3
    

    结果说明

    输出结果文件为排名前5的复合物结构rank_1-5.cif和ranking_scores.csv,csv中包含信息如下:

    字段名称 说明
    Name 复合物结构名称
    Ranking_Score 对预测结构的质量排序的指标分数,数值越大表示预测结构的质量越高。此分数综合了 ptm、iptm 和 has_clash,计算公式为: 0.8 × ipTM + 0.2 × pTM - 1 × has_clash。注意:结构为单体时,因为ipTM为0,整体的综合得分偏低,可参考pTM即可。

    参考文献

    Liu, L., Zhang, S., Xue, Y., Ye, X., Ye, X., Zhu, K., Li, Y., Li, Y., Zhao, W., Yu, H.,Wu, Z., Zhang, X., & Fang, X. (2024). Technical Report of HelixFold3 for Biomolecular Structure Prediction. DOI: arxiv-2408.16975

    Structure Prediction (HelixFold3)

    Introduction

    HelixFold3, developed by the Baidu PaddleHelix team, is comparable to AlphaFold3 in the accuracy of structure prediction of conventional small molecule ligands, nucleic acid molecules (including DNA and RNA), and proteins. To assess HelixFold3’s performance in protein-ligand structure prediction, we compared it against other leading methods using the PoseBusters dataset. HelixFold3 demonstrated exceptional performance even without specifying protein structures, surpassing methods that rely on known protein structures. Its prediction accuracy is comparable to the state-of-the-art AlphaFold3, indicating HelixFold3’s outstanding potential in protein-ligand interaction prediction. Currently, HelixFold3 has slightly surpassed AlphaFold-Multimer in protein-protein complex structure prediction, demonstrating stronger predictive power.
    image.png
    image.png

    Parameter

    Protein Sequence

    The sequence file of proteins in FASTA format, supporting multiple sequences.
    Note: For multi-protein complex structure prediction, the amino acid sequence input format is as follows:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    The sequence file of DNA nucleic acids in FASTA format, supporting multiple sequences.

    RNA Sequence

    The sequence file of RNA nucleic acid molecules in FASTA format, supporting multiple sequences.

    Note:

    The maximum length of this module is 2000.

    Ligand

    A text file containing small molecule information in TXT format. HF3 supports most ligands with fewer than 50 heavy atoms. It supports SMILES or CCD Code (Chemical Component Dictionary number). If using the SMILES format, each line should contain one small molecule; if using the CCD Code, each line can contain one or more small molecules, separated by commas and prefixed with CCD. An example is as follows:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Note: Not applicable for amino acid sequence format input of ligand proteins or polypeptides.

    Note:

    Water, adjuvants, and certain special ligands are not supported by the model. These ligands are excluded from the CCD list. If you input them using SMILES, it may result in degraded performance. For a list of unsupported ligands, refer to the HelixFold3 FAQ https://paddlehelix.baidu.com/app/tut/guide/all/helixfold3faq . The number of each ligand is not to exceed 50.

    Modification

    A text file containing post-translational modification (PTM) information in TXT format. Each line contains one PTM information entry, consisting of three parts:

    • Sequence order number where the PTM occurs
    • CCD number of the PTM type
    • Residue position number where the PTM occurs
      The three parts are separated by commas. For example, 1,HY3,1 indicates that the first residue of the first sequence undergoes a PTM of type HY3 (CCD number, which is 3-hydroxyproline, a hydroxylation of proline). Different amino acids support different CCDs, as shown in tutorial https://paddlehelix.baidu.com/app/tut/guide/all/helixfold3json .

    Note:

    • The sequence order number is numbered sequentially according to the order and number of sequences in the parameters Protein, DNA, and RNA, starting from 1. For example, if there are 2 protein sequences, 1 DNA sequence, and 1 RNA sequence, the sequence numbers are: the first protein sequence is numbered 1, the second protein sequence is numbered 2, the DNA sequence is numbered 3, and the RNA sequence is numbered 4.
    • For CCD introduction, refer to https://www.wwpdb.org/data/ccd . The CCD number lookup website is https://www.ebi.ac.uk/pdbe-srv/pdbechem/ .

    An example of a file containing multiple PTM information entries is as follows:

    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    ion

    Currently supported ions include: MG, ZN, CL, CA, NA, MN, MN3, K, FE, FE2, CU, CU1, CU3, CO

    It does not support multi-line writing; it must be written in a single line. It can contain one or multiple ions, with different ions separated by commas. The quantity of each ion is indicated after a colon, and the number of each ion cannot exceed 50. An example of a file containing different ion information is as follows:

    MG:2,ZN,CU:3
    

    Result

    The output result files are the top 5 ranked complex structures rank_1-5.cif and ranking_scores.csv, with the following information in the CSV:

    Field Name Description
    Name Name of the complex structure
    Ranking_Score A score indicating the quality ranking of the predicted structure, with higher values indicating better quality. This score considers three metrics: pTM, iptm and has_clash, calculated as: Ranking_Score = 0.8 × ipTM + 0.2 × pTM - 1 × has_clash. Note: When the structure is monomeric, the Ranking_Score is relatively low because ipTM is 0. In such cases, you can refer to pTM alone.

    Reference

    Liu, L., Zhang, S., Xue, Y., Ye, X., Ye, X., Zhu, K., Li, Y., Li, Y., Zhao, W., Yu, H.,Wu, Z., Zhang, X., & Fang, X. (2024). Technical Report of HelixFold3 for Biomolecular Structure Prediction. DOI: arxiv-2408.16975

  • Name: Patent BLAST
    Description: 针对抗体全长或者CDR区进行序列检索。从专利中检索一条抗体可变区时,现有的BLAST程序(例如NCBI BLAST)通常是以全序列进行检索,但是对于抗体而言,功能主要取决于CDR,FR相对不重要,并且由于FR的通用性,许多不同抗体的FR是相同或高度同源的,而FR占序列的比重更高,就导致以抗体的可变区BLAST会得到很多FR相似但CDR不相似的序列。并且,专利申请时,除了保护可变区完整序列,很多情况也会对抗体CDR进行单独保护,以获得更大的保护范围,因此在抗体开发过程中,以CDR为目标进行同源序列检索就很有必要了。为此,唯信团队开发了该程序,可以从现有专利库中检索到与目标CDR最接近的序列。数据更新于:Dec, 2024 A module for sequence retrieval of antibody full-length or CDR region. When retrieving an antibody variable region from a patent, existing BLAST programs (such as NCBI BLAST) usually search the whole sequence, but for antibodies, the function mainly depends on the CDR, FR is relatively not important, and due to the generality of FR, FR of many different antibodies is the same or highly homologous. However, FR accounts for a higher proportion of sequences, resulting in a lot of sequences with similar FR but different CDR by BLAST in the variable region of antibodies. Moreover, in addition to protecting the complete sequence of the variable region during patent application, in many cases, the antibody CDR will also be protected separately to obtain a wider range of protection, so it is necessary to search for homologous sequences with CDR as the target in the process of antibody development. To this end, the Vixon team developed the program, which can retrieve the closest sequence to the target CDR from the existing patent library. Data updated: Dec, 2024
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-04-06 00:00:00
    Reference:

    Patent BLAST

    简介

    针对抗体全长或者CDR区进行序列检索的模块。从专利中检索一条抗体可变区时,现有的BLAST程序(例如NCBI BLAST)通常是以全序列进行检索,但是对于抗体而言,功能主要取决于CDR,FR相对不重要,并且由于FR的通用性,许多不同抗体的FR是相同或高度同源的,而FR占序列的比重更高,就导致以抗体的可变区BLAST会得到很多FR相似但CDR不相似的序列。并且,专利申请时,除了保护可变区完整序列,很多情况也会对抗体CDR进行单独保护,以获得更大的保护范围,因此在抗体开发过程中,以CDR为目标进行同源序列检索就很有必要了。为此,唯信团队开发了该程序,可以从现有专利库中检索到与目标CDR最接近的序列。数据更新于:Dec 2024

    • 序列来源:NCBI专利序列库
    • 来自美国专利局USPTO提交的美国专利序列和通过INSDC合作包括的欧洲和日本专利序列
    • 包含已授权专利中的权利要求和实施例中的全部序列
    • 原始数据链接:https://ftp.ncbi.nlm.nih.gov/blast/db/pataa.tar.gz
    • WeMol数据更新于:Dec 2024
    • 数量:>700万个蛋白序列,其中14万条抗体CDR序列
    • 检索原理:提取专利序列数据库中的抗体序列,使用Kabat规则识别CDR区,并将CDR1/2/3拼接成新的CDR序列,与目标抗体拼接后的CDRs进行比对,输出同源性最高的数条。

    例如,输入序列L的完整序列,进行检索后,返回检索到同源性较高的序列的CDR,如下图所示。
    image.png

    如果需要查看某个检索到的序列的出处,可以根据检索的CDR的序列编号,从任务输出的log文件中找到对应的专利名,
    例如序列ATJ10081.1来自于US专利9493553(SEQ ID为39),并且US专利9670274、9890209等多个专利中也出现了该CDR片段,他们的比对情况包括同源性也展示在后面,如下图所示。

    image.png

    根据唯信团队经验,通常CDR的保护范围精确到具体序列,即差异一个以上氨基酸,即视为不在专利的保护范围之内,但不排除存在等同侵权的风险,仅供参考。

    参数说明

    Antibody Sequence File

    抗体序列文件, FASTA格式

    Type

    指定序列比对数据库类型:抗体全长(full)或者抗体CDR区域 (cdr)。
    CDR区域数据库为专利保护抗体数据库。

    结果说明

    输出结果包括:

    输出文件名称 说明
    align.fst 序列比对结果文件
    blast.log 序列比对日志文件

    Patent BLAST

    Introduction

    A module for sequence retrieval of antibody full-length or CDR region. When retrieving an antibody variable region from a patent, existing BLAST programs (such as NCBI BLAST) usually search the whole sequence, but for antibodies, the function mainly depends on the CDR, FR is relatively not important, and due to the generality of FR, FR of many different antibodies is the same or highly homologous. However, FR accounts for a higher proportion of sequences, resulting in a lot of sequences with similar FR but different CDR by BLAST in the variable region of antibodies. Moreover, in addition to protecting the complete sequence of the variable region during patent application, in many cases, the antibody CDR will also be protected separately to obtain a wider range of protection, so it is necessary to search for homologous sequences with CDR as the target in the process of antibody development. To this end, the Vixon team developed the program, which can retrieve the closest sequence to the target CDR from the existing patent library. Data updated: Dec 2024

    • Sequence Source: NCBI patent sequence database
    • Includes US patent sequences submitted to the USPTO and European and Japanese patent sequences included through collaboration with INSDC
    • Contains claims from granted patents and all sequences in the embodiments
    • Original data link: https://ftp.ncbi.nlm.nih.gov/blast/db/pataa.tar.gz
    • WeMol data updated: Dec 2024
    • Quantity: >7 million protein sequences, including 140,000 antibody CDR sequences
    • Search Principle: Extract antibody sequences from the patent sequence database, identify CDR regions using Kabat rules, concatenate CDR1/2/3 into a new CDR sequence, compare it with the concatenated CDRs of the target antibody, and output the top matching sequences based on homology.

    For example, when inputting the complete sequence of antibody L for search, the returned CDR of the highly homologous sequences is shown in the image below.
    image.png

    If there is a need to check the source of a retrieved sequence, you can find the corresponding patent name based on the sequence number of the retrieved CDR from the log file output of the task. For example, sequence ATJ10081.1 is from US Patent 9493553 (SEQ ID 39), and the CDR fragment also appears in multiple patents such as US Patents 9670274, 9890209, etc., with their alignment details and homology shown as well, as depicted in the image below.

    image.png

    Based on the experience of the WeMol team, the protection range of CDRs is usually specified down to the specific sequence, meaning that a difference of one or more amino acids is considered outside the scope of patent protection. However, there may still be risks of equivalent infringement, so this information is for reference only.

    Parameter Description

    Antibody Sequence File

    Antibody sequence file in FASTA format.

    Type

    Specifies the sequence alignment database type: antibody full-length (full) or antibody CDR region (cdr).
    The CDR regional database is a patent protected antibody database.

    Result Description

    The output includes:

    Output File Name Description
    align.fst Sequence alignment result file
    blast.log Sequence alignment log file
  • Name: CIF2PDB
    Description: 将mmCIF文件转换成PDB文件。 Convert mmCIF files into PDB files.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-12-13 15:13:35
    Reference:

    CIF2PDB

    简介

    CIF2PDB模块是基于BioPython将mmCIF文件转换成PDB文件。
    单独化合物CIF转换部分存在问题。

    参数说明

    CIF File

    mmCIF文件的结构。

    Output PDB

    输出PDB文件名称

    结果说明

    输出PDB文件,默认为output.pdb。

    CIF2PDB

    Introduction

    The CIF2PDB module is based on BioPython to convert mmCIF files into PDB files.

    Parameter

    CIF File

    mmCIF file of structure.

    Output PDB

    Output the PDB file name

    Result

    Output pdb file; default is output.pdb.

  • Name: Target-Cyclic Peptide Complex Structure Prediction
    Description: Target-Cyclic Peptide Complex Structure Prediction模块基于Alphafold2,用于预测首尾相接(Head-to-tail)环肽和靶点蛋白的复合物三维结构。 The Target-cyclic Peptide Complex Structure Prediction module is based on Alphafold2, which is used to predict the 3D structure of the complex of Head-to-tail Cyclic peptide and Target protein.
    Tags: undefined
    Author: Qiuzhen Li
    Release: 2024-11-19 15:19:22
    Reference: Li Q, Vlachos E.N., Bryant P. Design of linear and cyclic peptide binders of different lengths only from a protein target sequence. bioRxiv. 2024. p. 2024.06.20.599739. doi:10.1101/2024.06.20.599739

    Target-Cyclic Peptide Complex Structure Prediction

    简介

    Target-Cyclic Peptide Complex Structure Prediction模块基于Alphafold2,用于预测首尾相接(Head-to-tail)环肽和靶点蛋白的复合物三维结构。
    靶点-环肽复合物预测示例,展示了首尾相接酰胺键:
    image.png

    参数说明

    Target Sequence File

    靶标蛋白的序列文件,只支持输入一条链,不支持多条链,FASTA格式

    Cyclic Peptide

    环肽的序列,如:“ARDCPLVNPL”

    结果说明

    输出结果包括:

    输出文件名称 说明
    rank_1-5.pdb 设计的复合物结构文件,共5个
    score.csv 复合物结构名称及打分文件

    其中score.csv包括信息如下:

    字段名称 说明
    Name 复合物结构名称
    pLDDT 局部结构的可信度指标,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测
    pTM The predicted template modeling score预测的TM分数,衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM The interface predicted template modeling score预测的亚基接触面的TM分数,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定

    参考文献

    Li Q, Vlachos E.N., Bryant P. Design of linear and cyclic peptide binders of different lengths only from a protein target sequence. bioRxiv. 2024. p. 2024.06.20.599739.

    Target-Cyclic Peptide Complex Structure Prediction

    Introduction

    The Target-Cyclic Peptide Complex Structure Prediction module is based on AlphaFold2 and is used to predict the three-dimensional structure of head-to-tail cyclic peptides in complex with target proteins. An example of a target-cyclic peptide complex prediction is shown below, demonstrating the head-to-tail amide bond:
    image.png

    Parameter

    Target Sequence File

    The sequence file of the target protein, only supports the input of one strand, not multiple strands, FASTA format.

    Cyclic Peptide

    The sequence of the cyclic peptide, e.g., “ARDCPLVNPL”.

    Hotspot

    Residues on the target sequence that are binding sites for the cyclic peptide, numbered starting from 1. Specifying these sites can improve the accuracy and success rate of the design. Multiple sites are separated by commas, e.g., “23,45,67”.

    Result

    The output results include:

    Output File Name Description
    rank_1-5.pdb Structure files of the designed complexes, a total of 5.
    score.csv File containing the names and scores of the complex structures.

    The score.csv includes the following information:

    Field Name Description
    Name Name of the complex structure
    pLDDT Local structure reliability indicator, with values ranging from 0 to 100. A higher value indicates a more reliable predicted structure. Values below 70 are considered less reliable, and those below 50 are generally regarded as having very low reliability, indicating disordered predictions.
    pTM The predicted template modeling score, measuring the overall accuracy of the predicted structure. A higher score indicates greater accuracy, and a score above 0.5 suggests that the overall fold of the structure may resemble the true structure.
    ipTM The interface predicted template modeling score, measuring the predicted accuracy of the relative positions of the subunits in the complex. A higher score indicates greater accuracy, with scores above 0.8 indicating high-quality predictions, below 0.6 suggesting potential failure in predictions, and scores between 0.6 and 0.8 being in a gray area where the correctness of the prediction is uncertain.

    References

    Li Q, Vlachos E.N., Bryant P. Design of linear and cyclic peptide binders of different lengths only from a protein target sequence. bioRxiv. 2024. p. 2024.06.20.599739.

  • Name: Structure Prediction (Boltz-1)
    Description: 基于MIT的Boltz-1算法的AF3 like结构预测模型,融合了模型架构、速度优化和数据处理方面的创新。在预测生物分子复合物的3D结构方面,它达到了AlphaFold3级别的准确度。Boltz-1在一系列基准测试中表现出与最先进的商业模型相当的性能,为结构生物学中可商业化使用的工具树立了新的标杆。 An AF3-like structure prediction model based on the Boltz-1 algorithm from MIT. It integrates innovations in model architecture, speed optimization, and data processing. It achieves AlphaFold3-level accuracy in predicting the 3D structures of biomolecular complexes. Boltz-1 demonstrates performance comparable to state-of-the-art commercial models across a range of benchmarks, setting a new standard for commercially usable tools in structural biology.
    Tags: undefined
    Author: Jeremy Wohlwend
    Release: 2024-11-20 09:34:01
    Reference: Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, Regina Barzilay. Boltz-1 Democratizing Biomolecular Interaction Modeling. bioRxiv 2024.11.19.624167

    Structure Prediction (Boltz-1)

    简介

    基于MIT(麻省理工学院)的Boltz-1算法的AF3 like结构预测模型。Boltz-1是一种开源深度学习模型,融合了模型架构、速度优化和数据处理方面的创新,在预测生物分子复合物的 3D结构方面达到了 AlphaFold3 级的准确度。Boltz-1 在一系列不同的基准测试中表现出与最先进的商业模型相当的性能,为结构生物学中可商业化使用的工具树立了新的标杆。
    image.png

    参数说明

    Single Mode

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    注意:多蛋白复合物结构预测,其氨基酸序列输入格式如下:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。

    备注:当前24GB的GPU显存支持计算的残基/碱基数量在1000个左右。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔,并加上CCD前缀。示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    注意:不适用于配体蛋白或多肽的氨基酸序列格式输入。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每行放置一个PTM信息,每个PTM信息由三部分组成:

    • 发生PTM序列的顺序编号
    • PTM类型的CCD编号
    • 发生PTM的残基位置编号
      三部分由逗号分隔,例如:1,HY3,1 表示第一条序列的第一个残基,发生了类型为HY3(CCD编号,为3-羟基脯氨酸,为脯氨酸的羟基化)的PTM
      备注:
    • 序列的顺序编号,是依次按上述参数Protein、DNA、RNA中的序列顺序与数量,从1开始进行编号,例如:当有2条蛋白序列,1条DNA序列,1条RNA序列时,各序列对应的编号为:第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4
    • CCD的介绍参考https://www.wwpdb.org/data/ccd 编号查询网址为https://www.ebi.ac.uk/pdbe-srv/pdbechem/
      包含多个PTM信息的文件内容示例如下:
    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Covalent Bond

    共价键信息的文本文件,TXT格式。每行放置一个共价键信息,每个共价键信息包含两个原子信息,每个原子信息由三部分组成:

    • 原子所在序列或小分子的顺序编号(编号规则在Modification中定义的序列编号规则基础上,在最后加入小分子的顺序即可)
    • 原子所在残基的位置编号(如残基为小分子时,编号为1)
    • 原子的标准名称(CCD中定义)
      三部分由逗号分隔,例如:3,1,CA表示第三个实体(序列或小分子)中的第一个残基(或小分子)的CA原子
      一个共价键是由两个原子信息组成,原子间用分号分隔,如:1,1,CA;2,1,CA
      表示一个共价键,该共价键由两个原子组成,第一个原子为1,1,CA,第二个原子为2,1,CA
      包含多个共价键信息的文件内容示例如下:
    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Pocket

    结合位点类型限制信息的文本文件,TXT格式。每行放置一个结合位点信息,每个结合位点信息由两部分组成:

    • Binder的顺序编号(与共价键定义中的序列或小分子的顺序编号一致),Binder可以是小分子,蛋白/核酸序列的任意一种,目前一个结合位点只支持定义一条Binder(即一个编号)
    • 结合位点的残基信息,每个残基信息由其所在序列编号与残基位置编号组成,逗号分隔,如:1,25 表示第一条序列中的第25个残基;可以定义多个残基信息,由英文分号“;”进行分隔,如1,25;1,27;1,32;1,38表示第一条序列中的第25/27/32/38号残基形成结合位点
      上述两部信息之间也用英文分号“;”进行分隔,例如:2;1,55;1,62;1,91;1,92;1,99;1,110表示第二个实体(序列或小分子)作为Binder,与第一条序列的第55/62/91/92/99/110号残基所形成的结合位点进行结合。
      包含多个结合位点信息的文件内容示例如下:
    2;1,55;1,62;1,91;1,92;1,99;1,110
    3;1,25;1,27;1,32;1,38
    

    Domain

    定义的残基区域信息。模块将输出区域中所有残基平均的pLDDT数值。一个残基区域由序列顺序编号与残基组合编号组成:

    • 序列顺序编号(同Modification参数中的定义),值为1时,可省略(即默认为1)
    • 残基组合编号,使用残基位置编号,多个残基用逗号分隔,指定残基范围用横杠符号。如:“3,10,24-30”表示目标序列上的第3、第10与第24至30号残基。
      例如:1:24,28,32-40 表示第一条序列中的第24/28/32至40号残基所组成的区域,因为是第一条序列,数值1可以省略,等同于24,28,32-40 ,该区域的所有残基的平均pLDDT值将输出到结果文件中。

    残基区域支持定义多个,每个残基区域之间用英文“;”分隔,例如:
    1:24,28,32-40;2:15,23,50-60表示定义了两个区域,区域一为第一条序列的第24/28/32至40号残基,区域二为第二条序列的第15/23/50至60号残基。两个区域各自的残基平均pLDDT值,将输出到结果文件中。

    Format

    输出结构的格式,支持PDB或CIF格式,默认为PDB格式。

    Batch Mode

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    每一条记录代表一个待预测的结构,每条记录的名称要唯一不能重复。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >1
    EVQLVESGGGVVQPGRSLRLS:NFMLTQPHSVSVL:ELFPQWHLPIKIAAIIAS
    >2
    YYCAKDRRDMGYFQHWGQGTLVTVSSQWYQQRPGSAPTTVIYEDNQRPSGTPPTFMIAVFLPIVVLIFKSILFL
    

    表示有两个待预测的结构,第一条记录的名称为1,有三条蛋白链,用:进行分隔。第二条记录的名称为2,为单链。

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。
    每一条序列记录代表一个待预测的结构,每条记录的名称要唯一不能重复(可与Protein参数中的记录名称一致,表示该记录的DNA序列与Protein序列归属于同一结构)。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >dna
    GACCTCT:CCTAGCT
    >1
    CCTAGCT
    

    表示有两条记录,第一条的名称为dna,有两条DNA链,用:进行分隔,因为该名称不存在与Protein示例记录中,属于新结构。第二条的名称为1,有一条DNA链,因为该名称存在于Protein示例记录中,则表示同属一个结构(该结构同时包含Protein序列和该DNA序列)。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。
    每一条序列记录代表一个待预测的结构,每条记录的名称要唯一不能重复(可与DNA或Protein参数中的记录名称一致,表示该记录的RNA序列与上面的DNA序列或Protein序列归属于同一结构)。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >1
    AGCU
    >rna
    AGGCU:UGAUC
    

    表示有两条记录,第一条的名称为1,为单链,因为该名称存在于DNA及Protein示例记录中,表示同属一个结构(该结构同时包含了Protein序列、DNA序列及该RNA序列)。第二条的名称为rna,有两条RNA链,用:进行分隔,因为该名称不存在于DNA或Protein示例记录中,属于新结构。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔。
    每行代表一个待预测的结构,每行可放置多个ligand,且以唯一不重复的名称开头(该名称可与上述RNA,DNA或Protein参数中的记录名称一致,表示该行的所有ligands,与上述的RNA或DNA或Protein序列归属于同一结构),名称与所有ligands都以英文冒号(:)分隔。文件内容示例如下:

    1:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,HY3
    lig:CC1CC(CN2NC[N+](=C3CCC(OC(F)(F)F)CC3)C2=O)CC(C)C1OC(C)(C)C(=O)[O-]
    

    表示有两条记录,第一条的名称为1,有三个ligand(一个SMILES,两个CCD codes),因为该名称存在于上述的RNA或DNA或Protein示例记录中,表示同属一个结构。第二条的名称为lig,有一个ligand(为SMILES),因为该名称不存在上述的RNA或DNA或Protein示例记录中,属于新结构。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每个PTM的信息与Single模式中一致(参考Single模式中的定义)。
    每行定义一个结构的所有PTM信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,HY3,1:1,P1L,5:2,HY3,3
    2:1,HY3,1:2,HY3,3
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有三个PTM。名称为2的结构中,有两个PTM。

    Covalent Bond

    共价键信息的文本文件,TXT格式。每个共价键的信息与Single模式中一致(参考Single模式中的定义)。
    Batch模式下,每行定义一个结构的所有共价键,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,1,CA;2,1,CA:1,1,CA;3,1,CHA
    2:1,1,CA;3,1,CHA
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有两个共价键。名称为2的结构中,有一个共价键。

    Pocket

    结合位点类型限制信息的文本文件,TXT格式。每个结合位点信息的定义与Single模式中一致(参考Single模式中的定义)。
    Batch模式下,每行定义一个结构的所有结合位点限制信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:2;1,55;1,62;1,91;1,92;1,99;1,110
    2:1;2,15;2,17;2,18;2,56:1;3,76;3,78;3,96
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有一个结合位点限制。名称为2的结构中,有两个结合位点限制。

    Format

    输出结构的格式,支持PDB或CIF格式,默认为PDB格式。

    结果说明

    输出结果文件为排名前5的复合物结构rank_1-5.cif和pred_scores_boltz.csv,csv中包含信息如下:

    字段名称 说明
    Name 复合物结构名称
    Confidence_Score 对预测结构的质量排序的指标分数,数值在0~1.0之间,越大表示预测结构的质量越高。该分数综合考虑了两个指标:iptm(单体时为pTM), complex_plddt, 计算公式为: Confidence_Score = 0.8 × complex_plddt + 0.2 × ipTM
    pTM 对结构预测得到的TM score,衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 对结构中的相互作用界面预测得到的TM score,当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定
    Complex_pLDDT 对复合物预测得到的平均pLDDT score,值范围是0-1.0,该值越大说明预测的结构越可靠。低于0.7被认为可靠性较低,低于0.5基本认为是可信度非常低,为无序预测
    Complex_ipLDDT 将复合物中相互作用界面的权重提升后,预测得到的pLDDT score,值范围是0-1.0,该值越大说明预测的结构越可靠
    pLDDT_domain 当设置Domain参数时,预测得到的区域残基的平均pLDDT数值,多个区域时,数值用英文分号";"分隔

    final_results.tar.gz文件为Batch模式下额外生成一个所有预测结果的打包文件

    参考文献

    Boltz-1 Democratizing Biomolecular Interaction Modeling. Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, Regina Barzilay, bioRxiv 2024.11.19.624167;

    Structure Prediction (Boltz-1)

    Introduction

    Developed based on the Boltz-1 model, Boltz-1 is an open-source deep learning model that integrates innovations in model architecture, speed optimization, and data processing. It achieves AlphaFold3-level accuracy in predicting the 3D structures of biomolecular complexes. Boltz-1 demonstrates performance comparable to state-of-the-art commercial models across a range of benchmarks, setting a new standard for commercially usable tools in structural biology.
    image.png

    Parameter

    Single Mode

    Protein Sequence

    The sequence file of proteins in FASTA format, supporting multiple sequences.
    Note: For multi-protein complex structure prediction, the amino acid sequence input format is as follows:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    The sequence file of DNA nucleic acids in FASTA format, supporting multiple sequences.

    RNA Sequence

    The sequence file of RNA nucleic acid molecules in FASTA format, supporting multiple sequences.

    Ligand

    A text file containing small molecule information in TXT format. It supports SMILES or CCD Code (Chemical Component Dictionary number). If using the SMILES format, each line should contain one small molecule; if using the CCD Code, each line can contain one or more small molecules, separated by commas and prefixed with CCD. An example is as follows:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Note: Not applicable for amino acid sequence format input of ligand proteins or polypeptides.

    Modification

    A text file containing post-translational modification (PTM) information in TXT format. Each line contains one PTM information entry, consisting of three parts:

    • Sequence order number where the PTM occurs
    • CCD number of the PTM type
    • Residue position number where the PTM occurs
      The three parts are separated by commas. For example, 1,HY3,1 indicates that the first residue of the first sequence undergoes a PTM of type HY3 (CCD number, which is 3-hydroxyproline, a hydroxylation of proline).

    Note:

    • The sequence order number is numbered sequentially according to the order and number of sequences in the parameters Protein, DNA, and RNA, starting from 1. For example, if there are 2 protein sequences, 1 DNA sequence, and 1 RNA sequence, the sequence numbers are: the first protein sequence is numbered 1, the second protein sequence is numbered 2, the DNA sequence is numbered 3, and the RNA sequence is numbered 4.
    • For CCD introduction, refer to https://www.wwpdb.org/data/ccd . The CCD number lookup website is https://www.ebi.ac.uk/pdbe-srv/pdbechem/ .

    An example of a file containing multiple PTM information entries is as follows:

    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Covalent Bond

    A text file containing covalent bond information in TXT format. Each line contains one covalent bond information entry, and each entry includes two atom information entries, each consisting of three parts:

    • Sequence or small molecule order number (following the sequence numbering rule defined in Modification, with small molecule order added at the end)
    • Position number of the residue where the atom is located (if the residue is a small molecule, the number is 1)
    • Standard name of the atom (as defined in CCD)

    The three parts are separated by commas. For example, 3,1,CA indicates the CA atom of the first residue (or small molecule) in the third entity (sequence or small molecule).

    A covalent bond consists of two atom information entries, separated by a semicolon, such as 1,1,CA;2,1,CA, indicating a covalent bond composed of two atoms: the first atom is 1,1,CA, and the second atom is 2,1,CA.

    An example of a file containing multiple covalent bond information entries is as follows:

    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Pocket

    A text file with pocket type restriction information, in TXT format. Each line contains the information of one pocket, which is composed of two parts:
    The sequential number of the Binder (consistent with the sequential number of the sequence or small molecule in the covalent bond definition), the Binder can be any one of small molecules, protein/nucleic acid sequences, and currently, only one Binder (i.e., one number) is supported for a pocket.
    The residue information of the pocket, each residue information consists of the sequence number where it is located and the residue position number, separated by a comma, such as: 1,25 indicates the 25th residue in the first sequence; multiple residue information can be defined, separated by an English semicolon “;”, for example, 1,25;1,27;1,32;1,38 indicates that the 25th, 27th, 32nd, and 38th residues in the first sequence form the pocket.
    The above two pieces of information are also separated by an English semicolon “;”. For example: 2;1,55;1,62;1,91;1,92;1,99;1,110 indicates that the second entity (sequence or small molecule) as a Binder, binds to the pocket formed by the 55th, 62nd, 91st, 92nd, 99th, and 110th residues in the first sequence.

    An example of a file content containing multiple pockets information is as follows:

    2;1,55;1,62;1,91;1,92;1,99;1,110
    3;1,25;1,27;1,32;1,38
    

    Domain

    The defined residue region information. The module will output the average pLDDT value of all residues in the region. A residue region is composed of sequence order numbers and residue combination numbers:
    Sequence order numbers (as defined in the Modification parameter), the value 1 can be omitted (i.e., defaulting to 1).
    Residue combination numbers, using residue position numbers, with multiple residues separated by commas and specified residue ranges indicated by hyphen symbols. For example, “3,10,24-30” indicates the 3rd, 10th, and 24th to 30th residues on the target sequence.
    For example: 1:24,28,32-40 indicates the region composed of the 24th, 28th, and 32nd to 40th residues in the first sequence. Since it is the first sequence, the number 1 can be omitted, equivalent to 24,28,32-40. The average pLDDT value of all residues in this region will be output to the result file.
    Multiple residue regions are supported, with each residue region separated by an English semicolon “;”. For example: 1:24,28,32-40;2:15,23,50-60 defines two regions. Region one consists of the 24th, 28th, and 32nd to 40th residues in the first sequence, and region two consists of the 15th, 23rd, and 50th to 60th residues in the second sequence. The average pLDDT values of the residues in each of the two regions will be output to the result file.

    Format

    The output structure format supports PDB or CIF, with PDB format as the default.

    Batch Mode

    Protein Sequence

    The protein sequence file in FASTA format, supporting multiple sequences. Each record represents a structure to be predicted, and each record name must be unique. If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >1
    EVQLVESGGGVVQPGRSLRLS:NFMLTQPHSVSVL:ELFPQWHLPIKIAAIIAS
    >2
    YYCAKDRRDMGYFQHWGQGTLVTVSSQWYQQRPGSAPTTVIYEDNQRPSGTPPTFMIAVFLPIVVLIFKSILFL
    

    This indicates two structures to be predicted, with the first record named 1 containing three protein chains separated by colons. The second record is named 2 and contains a single chain.

    DNA Sequence

    The DNA sequence file in FASTA format, supporting multiple sequences. Each sequence record represents a structure to be predicted, and each record name must be unique. (It can match the record names in the Protein parameter, indicating that the DNA sequence belongs to the same structure as the Protein sequence.) If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >dna
    GACCTCT:CCTAGCT
    >1
    CCTAGCT
    

    This indicates two records, with the first named dna containing two DNA chains separated by a colon. Since this name does not appear in the Protein example records, it represents a new structure. The second record is named 1, containing one DNA chain, and since this name exists in the Protein example records, it indicates that they belong to the same structure (which contains both Protein and DNA sequences).

    RNA Sequence

    The RNA sequence file in FASTA format, supporting multiple sequences. Each sequence record represents a structure to be predicted, and each record name must be unique. (It can match the record names in the DNA or Protein parameters, indicating that the RNA sequence belongs to the same structure.) If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >1
    AGCU
    >rna
    AGGCU:UGAUC
    

    This indicates two records, with the first named 1, which is a single chain. Since this name exists in the DNA and Protein example records, it indicates that they belong to the same structure (which includes Protein, DNA, and this RNA sequence). The second record is named rna, containing two RNA chains separated by a colon. Since this name does not appear in the DNA or Protein example records, it represents a new structure.

    Ligand

    A text file containing information on small molecules in TXT format. It supports either SMILES or CCD Code. If using SMILES format, each line should contain one small molecule; if using CCD Code, each line can contain one or more small molecules, separated by commas. Each line represents a structure to be predicted, and each line must start with a unique name (this name can match those in the RNA, DNA, or Protein parameters, indicating that all ligands in that line belong to the same structure). The name and all ligands are separated by a colon (:). Example content:

    1:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,HY3
    lig:CC1CC(CN2NC[N+](=C3CCC(OC(F)(F)F)CC3)C2=O)CC(C)C1OC(C)(C)C(=O)[O-]
    

    This indicates two records, with the first named 1, containing three ligands (one SMILES and two CCD codes). Since this name exists in the RNA, DNA, or Protein example records, it indicates that they belong to the same structure. The second record is named lig, containing one ligand (in SMILES format). Since this name does not appear in the RNA, DNA, or Protein example records, it represents a new structure.

    Modification

    A text file containing post-translational modification (PTM) information in TXT format. Each PTM entry is consistent with that in Single mode (refer to the definitions in Single mode). Each line defines all PTM information for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:1,HY3,1:1,P1L,5:2,HY3,3
    2:1,HY3,1:2,HY3,3
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has three PTMs, while the structure named 2 has two PTMs.

    Covalent Bond

    A text file containing covalent bond information in TXT format. Each covalent bond entry is consistent with that in Single mode (refer to the definitions in Single mode). In Batch mode, each line defines all covalent bonds for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:1,1,CA;2,1,CA:1,1,CA;3,1,CHA
    2:1,1,CA;3,1,CHA
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has two covalent bonds, while the structure named 2 has one covalent bond.

    Pocket

    A text file containing pockets information in TXT format. Each pocket is consistent with that in Single mode (refer to the definitions in Single mode). In Batch mode, each line defines all pockets for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:2;1,55;1,62;1,91;1,92;1,99;1,110
    2:1;2,15;2,17;2,18;2,56:1;3,76;3,78;3,96
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has one pocket, while the structure named 2 has two pockets.

    Format

    The output structure format supports PDB or CIF, with PDB format as the default.

    Result

    The output result files are the top 5 ranked complex structures rank_1-5.cif and pred_scores_boltz.csv, with the following information in the CSV:

    Field Name Description
    Name Name of the complex structure
    Confidence_Score A score indicating the quality ranking of the predicted structure, ranging from 0 to 1.0, with higher values indicating better quality. This score considers two metrics: iptm (pTM for monomers) and complex_plddt, calculated as: Confidence_Score = 0.8 × complex_plddt + 0.2 × ipTM
    pTM Predicted TM score for the complex
    ipTM Predicted TM score when aggregating at the interfaces
    Complex_pLDDT Average pLDDT score for the complex
    Complex_ipLDDT Average pLDDT score when upweighting interface tokens
    pLDDT_domain When setting the Domain parameter, the average pLDDT value of the domain residues. For multiple domains, the values are separated by semicolons “;”.

    final_results.tar.gz, An additional compressed file containing all predicted results generated in Batch mode.

    Reference

    Boltz-1 Democratizing Biomolecular Interaction Modeling. Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, Regina Barzilay, bioRxiv 2024.11.19.624167;

  • Name: Structure Prediction (Chai-1)
    Description: 基于Chai Discovery, Inc.(OpenAI投资)的Chai-1算法的AF3 like结构预测模型,支持蛋白、核酸、小分子,金属离子等复合物。 Structure prediction using Chai-1, supporting protein, dna, rna, ions, ligands.
    Tags: undefined
    Author: Chai Discovery
    Release: 2024-12-02 00:00:00
    Reference: Chai-1: Decoding the molecular interactions of life. Chai Discovery, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhnikov, Kevin Wu. doi: 10.1101/2024.10.10.615955

    Structure Prediction (Chai-1)

    简介

    基于Chai Discovery, Inc.(OpenAI投资)的Chai-1算法的AF3 like结构预测模型。Chai-1是一种用于分子结构预测的多模态基础模型,在各种基准测试中均表现出色,可以预测包括蛋白质、小分子、DNA、RNA、糖基化等。
    image.png

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    注意:多蛋白复合物结构预测,其氨基酸序列输入格式如下:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。

    备注:当前24GB的GPU显存能计算的残基/碱基数量在1000个左右。

    在Protein、DNA、RNA序列中,都支持残基或碱基的修饰,用CCD进行定义,CCD的介绍参考https://www.wwpdb.org/data/ccd 编号查询网址为https://www.ebi.ac.uk/pdbe-srv/pdbechem/
    定义残基或碱基修饰时,直接在序列中用英文括号‘()’包含CCD code即可,示例如下:

    >seq
    (ACE)GQLEEIAK
    

    表示在序列的N端发生了乙酰化;

    >seq
    AGSHSMRYFSTSVSR(HY3)GRGEPRFIAVG
    

    表示序列中的残基P发生了羟基化修饰,变成HY3(CCD code)

    Ligand

    文本文件包含小分子的结构信息,用SMILES格式,支持多个小分子,每行放置一个,示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    [Mg+2]
    

    注意:不适用于配体蛋白或多肽的氨基酸序列格式输入。

    Restraints

    包含残基间距离限制信息的文本文件。距离限制的类型有两种:两个残基间的距离限制,一个残基与一条链之间的距离限制。

    两个残基间的距离限制的定义由五部分组成:

    • 残基1所在序列的顺序编号(序列的顺序编号,是依次按上述参数Protein、DNA、RNA中的序列顺序与数量,从1开始进行编号,例如:当有2条蛋白序列,1条DNA序列,1条RNA序列时,各序列对应的编号为:第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4)
    • 残基1的符号及位置编号(如:R84表示84号残基R)
    • 残基2所在序列的顺序编号
    • 残基2的符号及位置编号
    • 残基间的最大距离(单位为埃)

    五部分由逗号分隔,例如:1,R84,3,G7,10.0
    表示第1条序列中的84号残基R,与第3条序列中的7号残基G,之间的最大距离为10.0埃。

    一个残基与一条链之间的距离限制表示该残基与链中任意一个残基的距离满足限制即可。其定义方式与上述类似,差异在于,残基1与残基2的符号及位置编号,其中一个需设置为0(不可同时为0),例如:1,R84,3,0,10.0
    表示第1条序列中的84号残基R,与第3条链的任意一个残基/碱基的最大距离为10.0埃即可。

    支持放置多个距离限制,每行放置一个即可,包含多个距离限制信息的文件内容示例如下:

    1,H189,3,L4,8.0
    1,R84,3,0,10.0
    

    结果说明

    输出结果文件为排名前5的复合物结构rank_1-5.cif和pred_scores_chai1.csv,csv中包含信息如下:

    列名 说明
    Name 结构名称
    Aggregate_Score 对预测结构的质量排序的指标分数,值范围在-100至1.0之间,越大表示预测结构的质量越高。该分数综合考虑了三个指标:ptm, iptm, has_clash, 计算公式为: Aggregate_Score = 0.8 × ipTM + 0.2 × pTM − 100 × has_clash。注意:结构为单体时,因为ipTM为0,整体的综合得分偏低,可参考pTM即可。
    pTM 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定

    参考文献

    Chai-1: Decoding the molecular interactions of life. Chai Discovery team, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhonikov, Kevin Wu
    bioRxiv 2024.10.10.615955

    Structure Prediction (Chai-1)

    Introduction

    Based on Chai-1 structure prediction model implementation. Chai-1 is a multimodal basis model for molecular structure prediction that performs well on various benchmarks and can predict including proteins, small molecules, DNA, RNA, glycosylation, and more.
    image.png

    Parameter

    Protein Sequence

    The sequence file of proteins in FASTA format, supporting multiple sequences.
    Note: For multi-protein complex structure prediction, the amino acid sequence input format is as follows:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    The sequence file of DNA nucleic acids in FASTA format, supporting multiple sequences.

    RNA Sequence

    The sequence file of RNA nucleic acid molecules in FASTA format, supporting multiple sequences.
    ** Note: Current 24GB GPU memory can calculate around 1000 residues/bases. **
    In Protein, DNA, RNA sequences, all support the modification of residues or bases, which are defined by CCD, The introduction of the CCD reference https://www.wwpdb.org/data/ccd Number query url for https://www.ebi.ac.uk/pdbe-srv/pdbechem/
    To define a residue or base modification, simply include the CCD code in parentheses’ () 'in the sequence, as shown in the following example:

    >seq
    (ACE)GQLEEIAK
    

    Indicates acetylation at the N-terminus of the sequence;

    >seq
    AGSHSMRYFSTSVSR(HY3)GRGEPRFIAVG
    

    Indicates that residue P in the sequence is hydroxylated and becomes HY3 (CCD code).

    Ligand

    The text file contains structural information about small molecules, in SMILES format, supporting multiple small molecules, one per line, as shown in the following example:

    CC(=O)OC1C[NH+]2CCC1CC2
    [Mg+2]
    

    Note: Not applicable for amino acid sequence format input of ligand proteins or polypeptides.

    Restraints

    • Sequence number of the sequence in which residue 1 is located (The sequence number of the sequence is numbered from 1 according to the sequence order and quantity in the above parameters Protein, DNA and RNA in turn. For example, when there are 2 protein sequences, 1 DNA sequence and 1 RNA sequence, the corresponding number of each sequence is: The first protein sequence is numbered 1, the second protein sequence is numbered 2, the DNA sequence is numbered 3, and the RNA sequence is numbered 4)
    • Symbol and position number of residue 1 (e.g. R84 for residue 84 R)
      -The sequence number of the sequence in which residue 2 is located
      -Symbol and position number of residue 2
    • Maximum distance between residues (in angstroms)

    The five parts are separated by commas, for example: 1,R84,3,G7,10.0
    Denote residue 84 R in the first sequence, and residue 7 G in the third sequence, with a maximum distance of 10.0 angstroms.

    ** The distance limit between a residue and a chain ** means that the distance between the residue and any residue in the chain satisfies the limit. It is defined in the same way as above, except that the symbol and position number of residue 1 and residue 2 need to be set to 0 (not both), e.g. 1,R84,3,0,10.0
    Denotes residue 84 R in the first sequence, and a maximum distance of 10.0 angstroms from any residue/base of the third strand is sufficient.

    Multiple distance limits are supported, one per line, and an example file containing multiple distance limits is as follows:

    1,H189,3,L4,8.0
    1,R84,3,0,10.0
    

    Result

    The output files are the top 5 complex structures rank_1-5.cif and pred_scores_chai1.csv, which contain the following information:

    Field Name Description
    Name Name of the complex structure
    Aggregate_Score Index scores that rank the quality of the predicted structure, with values ranging from -100 to 1.0, with larger values indicating higher quality of the predicted structure. The score takes into account three metrics: ptm, iptm, has_clash, and is calculated as follows: Aggregate_Score = 0.8 × ipTM + 0.2 × pTM − 100 × has_clash. Note: When the structure is monomeric, the Aggregate_Score is relatively low because ipTM is 0. In such cases, you can refer to pTM alone.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.
  • Name: ADMET Prediction (v2)
    Description: ADMET Prediction (v2)是一个基于机器学习的小分子ADMET性质预测模块。能快速批量预测小分子的ADMET性质,支持图注意力神经网络模型、轻量梯度提升树模型、随机森林模型、梯度提升树模型4种常见高效的机器学习算法,分子特征支持分子指纹以及分子描述符两种方法,能对小分子化合物库进行快速批量预测。模块支持27种ADMET性质,其中7种回归模型,20中分类模型。用户可以根据介绍文档中的预测性能数据,选择理想的机器学习算法和分子特征化方法。 ADMET Prediction (v2) is a machine learning-based module for predicting the ADMET properties of small molecules. It enables rapid batch predictions of ADMET properties and supports four common and efficient machine learning algorithms: Graph Attention Neural Network (GAT), Light Gradient Boosting Machine (LightGBM), Random Forest, and Gradient Boosting Machine (GBM). The module supports two methods for molecular feature representation: molecular fingerprints and molecular descriptors, allowing for quick batch predictions on libraries of small molecule compounds. It supports 27 ADMET properties, including 7 regression models and 20 classification models. Module select the ideal machine learning algorithm and molecular characterization method based on the predictive performance data provided in the documentation.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-11-28 00:00:00
    Reference:

    ADMET Prediction (v2)

    简介

    ADMET Prediction (v2)是一个基于机器学习的小分子ADMET性质预测模块。能快速批量预测小分子的ADMET性质,支持图注意力神经网络模型(GNN)、轻量梯度提升树模型(LGBM)、随机森林模型(RF)、梯度提升树模型(XGBT)4种常见高效的机器学习算法,分子特征支持分子指纹(Morgan FP)以及分子描述符(Descriptors)两种方法,能对小分子化合物库进行快速批量预测。模块支持27种ADMET性质,其中7种回归模型,20种分类模型。不同机器学习方法以及分子特征化方法预测性能如下:
    image.png
    image.png
    模块自动选择最理想的机器学习算法和分子特征化方法的组合进行预测。

    参数说明

    Small Molecules

    待预测的小分子文件,SDF格式。

    Properties

    ADMET预测列表,ADMET性质见结果说明部分。

    Predicted Results

    输出的预测结果文件,默认为predicted_results.csv

    结果说明

    输出结果中,如果是分类模型,输出0或1分类。如果是回归模型,预测出实际值。
    ADMET性质信息如下:

    Dataset Dataset Abbr. ADMET Type Dataset Type Endpoints Description
    Caco-2 (Cell Effective Permeability), Wang et al. caco2 Absorption Regression logPapp
    PAMPA Permeability, NCATS pampa Absorption Binary classification high permeability (1) or low-to-moderate permeability (0) in PAMPA assay
    HIA (Human Intestinal Absorption), Hou et al. hia Absorption Binary classification good permeability (1) or poor permeability (0)
    Pgp (P-glycoprotein) Inhibition, Broccatelli et al. pgp Absorption Binary classification inhibitor (1) or non-inhibitor (0)
    Bioavailability, Ma et al. bioavailability Absorption Binary classification High (1) or low (0) bioavailability
    Lipophilicity, AstraZeneca lipophilicity Absorption Regression octanol/water distribution coefficient (logD at pH 7.4)
    Solubility, AqSolDB solubility Absorption Regression logS
    Hydration Free Energy, FreeSolv freesolv Absorption Regression Hydration Free Energy (kcal/mol)
    BBB (Blood-Brain Barrier), Martins et al. bbbp Distribution Binary classification High (1) or low (0) blood-brain barrier penetration
    PPBR (Plasma Protein Binding Rate), AstraZeneca ppbr Distribution Regression Plasma Protein Binding Rate (0-100)
    CYP P450 2C19 Inhibition, Veith et al. cyp2c19_inhibition Metabolism Binary Classification P450 2C19 inhibitor (1) or non-inhibitor (0)
    CYP P450 2D6 Inhibition, Veith et al. cyp2d6_inhibition Metabolism Binary Classification P450 2D6 inhibitor (1) or non-inhibitor (0)
    CYP P450 3A4 Inhibition, Veith et al. cyp3a4_inhibition Metabolism Binary Classification P450 3A4 inhibitor (1) or non-inhibitor (0)
    CYP P450 1A2 Inhibition, Veith et al. cyp1a2_inhibition Metabolism Binary Classification P450 1A2 inhibitor (1) or non-inhibitor (0)
    CYP P450 2C9 Inhibition, Veith et al. cyp2c9_inhibition Metabolism Binary Classification P450 2C9 inhibitor (1) or non-inhibitor (0)
    CYP2C9 Substrate, Carbon-Mangels et al. cyp2c9_substrate Metabolism Binary Classification CYP2C9 substrate (1) or non-substrate (0)
    CYP2D6 Substrate, Carbon-Mangels et al. cyp2d6_substrate Metabolism Binary Classification CYP2CD6 substrate (1) or non-substrate(0)
    CYP3A4 Substrate, Carbon-Mangels et al. cyp3a4_substrate Metabolism Binary Classification CYP3A4 substrate (1) or non-substrate(0)
    Microsome Clearance, AstraZeneca clearance_microsome Excretion Regression Microsome Clearance (CL)
    Acute Toxicity LD50 ld50 Toxicity Regression Acute Toxicity LD50
    hERG blockers herg_blockers Toxicity Binary classification hERG blockers (1) or non-blockers (0)
    hERG Karim et al. herg_karim Toxicity Binary classification hERG blockers (1) or non-blockers (0)
    Ames Mutagenicity ames Toxicity Binary classification high (1) or low (0) ames mutagenicity
    DILI (Drug Induced Liver Injury) dili Toxicity Binary classification high (1) or low (0) drug induced liver injury
    Skin Reaction skin Toxicity Binary classification high (1) or low (0) skin reaction
    ClinTox clintox Toxicity Binary classification high (1) or low (0) ClinTox
    Carcinogens carcinogens Toxicity Binary classification high (1) or low (0) Carcinogens

    ADMET Prediction (v2)

    Introduction

    ADMET Prediction (v2) is a machine learning-based module for predicting the ADMET properties of small molecules. It enables rapid batch predictions of ADMET properties and supports four common and efficient machine learning algorithms: Graph Attention Neural Network (GAT), Light Gradient Boosting Machine (LightGBM), Random Forest, and Gradient Boosting Machine (GBM). The module supports two methods for molecular feature representation: molecular fingerprints and molecular descriptors, allowing for quick batch predictions on libraries of small molecule compounds. It supports 27 ADMET properties, including 7 regression models and 20 classification models. Users can select the ideal machine learning algorithm and molecular characterization method based on the predictive performance data provided in the documentation. The predictive performance of different machine learning methods and molecular characterization methods is as follows:
    image.png
    image.png
    The module selects the ideal machine learning algorithm and molecular characterization method automaticaly based on the predictive performance data provided in the documentation.

    Parameters

    Small Molecules

    Small molecular structure file in SDF format

    Properties

    ADMET properties. Details can be seen in results.

    Predicted Results

    Output prediction results file name with default predicted_results.csv

    Results

    In the output results, if it is a classification model, the output will be a classification of 0 or 1. The predicted output will be the actual value if it is a regression model. The endpoint descriptions are as follows:

    Dataset Dataset Abbr. ADMET Type Dataset Type Endpoints Description
    Caco-2 (Cell Effective Permeability), Wang et al. caco2 Absorption Regression logPapp
    PAMPA Permeability, NCATS pampa Absorption Binary classification high permeability (1) or low-to-moderate permeability (0) in PAMPA assay
    HIA (Human Intestinal Absorption), Hou et al. hia Absorption Binary classification good permeability (1) or poor permeability (0)
    Pgp (P-glycoprotein) Inhibition, Broccatelli et al. pgp Absorption Binary classification inhibitor (1) or non-inhibitor (0)
    Bioavailability, Ma et al. bioavailability Absorption Binary classification High (1) or low (0) bioavailability
    Lipophilicity, AstraZeneca lipophilicity Absorption Regression octanol/water distribution coefficient (logD at pH 7.4)
    Solubility, AqSolDB solubility Absorption Regression logS
    Hydration Free Energy, FreeSolv freesolv Absorption Regression Hydration Free Energy (kcal/mol)
    BBB (Blood-Brain Barrier), Martins et al. bbbp Distribution Binary classification High (1) or low (0) blood-brain barrier penetration
    PPBR (Plasma Protein Binding Rate), AstraZeneca ppbr Distribution Regression Plasma Protein Binding Rate (0-100)
    CYP P450 2C19 Inhibition, Veith et al. cyp2c19_inhibition Metabolism Binary Classification P450 2C19 inhibitor (1) or non-inhibitor (0)
    CYP P450 2D6 Inhibition, Veith et al. cyp2d6_inhibition Metabolism Binary Classification P450 2D6 inhibitor (1) or non-inhibitor (0)
    CYP P450 3A4 Inhibition, Veith et al. cyp3a4_inhibition Metabolism Binary Classification P450 3A4 inhibitor (1) or non-inhibitor (0)
    CYP P450 1A2 Inhibition, Veith et al. cyp1a2_inhibition Metabolism Binary Classification P450 1A2 inhibitor (1) or non-inhibitor (0)
    CYP P450 2C9 Inhibition, Veith et al. cyp2c9_inhibition Metabolism Binary Classification P450 2C9 inhibitor (1) or non-inhibitor (0)
    CYP2C9 Substrate, Carbon-Mangels et al. cyp2c9_substrate Metabolism Binary Classification CYP2C9 substrate (1) or non-substrate (0)
    CYP2D6 Substrate, Carbon-Mangels et al. cyp2d6_substrate Metabolism Binary Classification CYP2CD6 substrate (1) or non-substrate(0)
    CYP3A4 Substrate, Carbon-Mangels et al. cyp3a4_substrate Metabolism Binary Classification CYP3A4 substrate (1) or non-substrate(0)
    Microsome Clearance, AstraZeneca clearance_microsome Excretion Regression Microsome Clearance (CL)
    Acute Toxicity LD50 ld50 Toxicity Regression Acute Toxicity LD50
    hERG blockers herg_blockers Toxicity Binary classification hERG blockers (1) or non-blockers (0)
    hERG Karim et al. herg_karim Toxicity Binary classification hERG blockers (1) or non-blockers (0)
    Ames Mutagenicity ames Toxicity Binary classification high (1) or low (0) ames mutagenicity
    DILI (Drug Induced Liver Injury) dili Toxicity Binary classification high (1) or low (0) drug induced liver injury
    Skin Reaction skin Toxicity Binary classification high (1) or low (0) skin reaction
    ClinTox clintox Toxicity Binary classification high (1) or low (0) ClinTox
    Carcinogens carcinogens Toxicity Binary classification high (1) or low (0) Carcinogens
  • Name: Structure Prediction (WFold)
    Description: 第三方复现的AlphaFold3-like的结构预测模型,与AF3预测结果基本一致,精度超越AF2。 A third-party implementation of an AlphaFold3-like structure prediction model, with prediction results generally consistent with those of AF3, and accuracy surpassing that of AF2.
    Tags: undefined
    Author:
    Release: 2024-11-20 09:34:01
    Reference:

    Structure Prediction (WFold)

    简介

    Structure Prediction模块是基于最新的生物分子结构预测模型,进行各类生物分子的结构预测。

    参数说明

    Single Mode

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    注意:多蛋白复合物结构预测,其氨基酸序列输入格式如下:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔,并加上CCD前缀。示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    注意:不适用于配体蛋白或多肽的氨基酸序列格式输入。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每行放置一个PTM信息,每个PTM信息由三部分组成:

    • 发生PTM序列的顺序编号
    • PTM类型的CCD编号
    • 发生PTM的残基位置编号
      三部分由逗号分隔,例如:1,HY3,1 表示第一条序列的第一个残基,发生了类型为HY3(CCD编号,为3-羟基脯氨酸,为脯氨酸的羟基化)的PTM
      备注:
    • 序列的顺序编号,是依次按上述参数Protein、DNA、RNA中的序列顺序与数量,从1开始进行编号,例如:当有2条蛋白序列,1条DNA序列,1条RNA序列时,各序列对应的编号为:第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4
    • CCD的介绍参考https://www.wwpdb.org/data/ccd 编号查询网址为https://www.ebi.ac.uk/pdbe-srv/pdbechem/
      包含多个PTM信息的文件内容示例如下:
    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Covalent Bond

    共价键信息的文本文件,TXT格式。每行放置一个共价键信息,每个共价键信息包含两个原子信息,每个原子信息由三部分组成:

    • 原子所在序列或小分子的顺序编号(编号规则在Modification中定义的序列编号规则基础上,在最后加入小分子的顺序即可)
    • 原子所在残基的位置编号(如残基为小分子时,编号为1)
    • 原子的标准名称(CCD中定义)
      三部分由逗号分隔,例如:3,1,CA表示第三个实体(序列或小分子)中的第一个残基(或小分子)的CA原子
      一个共价键是由两个原子信息组成,原子间用分号分隔,如:1,1,CA;2,1,CA
      表示一个共价键,该共价键由两个原子组成,第一个原子为1,1,CA,第二个原子为2,1,CA
      包含多个共价键信息的文件内容示例如下:
    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Batch Mode

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    每一条记录代表一个待预测的结构,每条记录的名称要唯一不能重复。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >1
    EVQLVESGGGVVQPGRSLRLS:NFMLTQPHSVSVL:ELFPQWHLPIKIAAIIAS
    >2
    YYCAKDRRDMGYFQHWGQGTLVTVSSQWYQQRPGSAPTTVIYEDNQRPSGTPPTFMIAVFLPIVVLIFKSILFL
    

    表示有两个待预测的结构,第一条记录的名称为1,有三条蛋白链,用:进行分隔。第二条记录的名称为2,为单链。

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。
    每一条序列记录代表一个待预测的结构,每条记录的名称要唯一不能重复(可与Protein参数中的记录名称一致,表示该记录的DNA序列与Protein序列归属于同一结构)。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >dna
    GACCTCT:CCTAGCT
    >1
    CCTAGCT
    

    表示有两条记录,第一条的名称为dna,有两条DNA链,用:进行分隔,因为该名称不存在与Protein示例记录中,属于新结构。第二条的名称为1,有一条DNA链,因为该名称存在于Protein示例记录中,则表示同属一个结构(该结构同时包含Protein序列和该DNA序列)。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。
    每一条序列记录代表一个待预测的结构,每条记录的名称要唯一不能重复(可与DNA或Protein参数中的记录名称一致,表示该记录的RNA序列与上面的DNA序列或Protein序列归属于同一结构)。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >1
    AGCU
    >rna
    AGGCU:UGAUC
    

    表示有两条记录,第一条的名称为1,为单链,因为该名称存在于DNA及Protein示例记录中,表示同属一个结构(该结构同时包含了Protein序列、DNA序列及该RNA序列)。第二条的名称为rna,有两条RNA链,用:进行分隔,因为该名称不存在于DNA或Protein示例记录中,属于新结构。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔。
    每行代表一个待预测的结构,每行可放置多个ligand,且以唯一不重复的名称开头(该名称可与上述RNA,DNA或Protein参数中的记录名称一致,表示该行的所有ligands,与上述的RNA或DNA或Protein序列归属于同一结构),名称与所有ligands都以英文冒号(:)分隔。文件内容示例如下:

    1:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,HY3
    lig:CC1CC(CN2NC[N+](=C3CCC(OC(F)(F)F)CC3)C2=O)CC(C)C1OC(C)(C)C(=O)[O-]
    

    表示有两条记录,第一条的名称为1,有三个ligand(一个SMILES,两个CCD codes),因为该名称存在于上述的RNA或DNA或Protein示例记录中,表示同属一个结构。第二条的名称为lig,有一个ligand(为SMILES),因为该名称不存在上述的RNA或DNA或Protein示例记录中,属于新结构。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每个PTM的信息与Single模式中一致(参考Single模式中的定义)。
    每行定义一个结构的所有PTM信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,HY3,1:1,P1L,5:2,HY3,3
    2:1,HY3,1:2,HY3,3
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有三个PTM。名称为2的结构中,有两个PTM。

    Covalent Bond

    共价键信息的文本文件,TXT格式。每个共价键的信息与Single模式中一致(参考Single模式中的定义)。
    Batch模式下,每行定义一个结构的所有共价键,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,1,CA;2,1,CA:1,1,CA;3,1,CHA
    2:1,1,CA;3,1,CHA
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有两个共价键。名称为2的结构中,有一个共价键。

    Enhanced Mode

    该模式下,会默认使用1000个随机种子,每个随机种子进行5个结构采样,共进行5000个结构的大批量采样,并从中选择评分靠前的多个预测结构,最终获得更高精度的预测结构。该模式特别适用于抗原-抗体复合物结构的高精度预测,有研究表明该模式下抗体-抗原复合物结构预测准确性提升60%。该模式的输入参数与Single Mode一致,一次运行时间约10~20小时。
    备注:Enhanced Mode模式下预测的氨基酸序列不能超过450AA。

    结果说明

    输出结果文件为排名前5的复合物结构rank_1-5.pdb和pred_scores.csv,csv中包含信息如下:

    字段名称 说明
    Name 复合物结构名称
    Ranking_Score 对预测结构的质量排序的指标分数,值范围在-100至1.5之间,越大表示预测结构的质量越高。该分数综合考虑了四个指标:ptm, iptm, fraction_disordered,has_clash, 计算公式为: Ranking_Score = 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash
    pTM 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定
    final_results.tar.gz Batch模式下额外生成一个所有预测结果的打包文件

    Structure Prediction (WFold)

    Introduction

    The Structure Prediction module is designed to predict the structures of various biomolecules based on the latest biomolecular structure prediction models.

    Parameter

    Single Mode

    Protein Sequence

    The protein sequence file in FASTA format, supporting multiple sequences.
    Note: For multi-protein complex structure prediction, the amino acid sequence input format is as follows:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    The DNA sequence file in FASTA format, supporting multiple sequences.

    RNA Sequence

    The RNA sequence file in FASTA format, supporting multiple sequences.

    Ligand

    A text file containing information on small molecules in TXT format. It supports either SMILES or CCD Code (Chemical Component Dictionary number). If using SMILES format, each line should contain one small molecule; if using CCD Code, each line can contain one or more small molecules, separated by commas and prefixed with CCD. Example:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Note: Not applicable for amino acid sequence format input of ligand proteins or polypeptides.

    Modification

    A text file containing post-translational modification (PTM) information in TXT format. Each line contains one PTM entry, which consists of three parts:

    • The order number of the sequence where the PTM occurs.
    • The CCD number for the type of PTM.
    • The position number of the residue where the PTM occurs.

    These three parts are separated by commas. For example, 1,HY3,1 indicates that the first residue of the first sequence has undergone a PTM of type HY3 (CCD number for 3-hydroxyproline, which is hydroxylation of proline).

    Remarks:

    • The sequence order number is assigned sequentially based on the order and number of sequences in the Protein, DNA, and RNA parameters, starting from 1. For example, if there are 2 protein sequences, 1 DNA sequence, and 1 RNA sequence, the corresponding numbers would be: the first protein sequence is 1, the second protein sequence is 2, the DNA sequence is 3, and the RNA sequence is 4.
    • For information on CCD, refer to https://www.wwpdb.org/data/ccd . The number query website is https://www.ebi.ac.uk/pdbe-srv/pdbechem/ .

    Example content of a file with multiple PTM entries:

    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Covalent Bond

    A text file containing covalent bond information in TXT format. Each line contains one covalent bond entry, which consists of two atom entries. Each atom entry consists of three parts:

    • The order number of the sequence or small molecule (the numbering rule is based on the sequence numbering defined in Modification, with the addition of the order of the small molecule).
    • The position number of the residue where the atom is located (if the residue is a small molecule, the number is 1).
    • The standard name of the atom (as defined in CCD).

    These three parts are separated by commas. For example, 3,1,CA indicates the CA atom of the first residue (or small molecule) in the third entity (sequence or small molecule).

    A covalent bond is composed of two atom entries, separated by a semicolon, such as 1,1,CA;2,1,CA, indicating that this covalent bond consists of the first atom 1,1,CA and the second atom 2,1,CA.

    Example content of a file with multiple covalent bond entries:

    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Batch Mode

    Protein Sequence

    The protein sequence file in FASTA format, supporting multiple sequences. Each record represents a structure to be predicted, and each record name must be unique. If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >1
    EVQLVESGGGVVQPGRSLRLS:NFMLTQPHSVSVL:ELFPQWHLPIKIAAIIAS
    >2
    YYCAKDRRDMGYFQHWGQGTLVTVSSQWYQQRPGSAPTTVIYEDNQRPSGTPPTFMIAVFLPIVVLIFKSILFL
    

    This indicates two structures to be predicted, with the first record named 1 containing three protein chains separated by colons. The second record is named 2 and contains a single chain.

    DNA Sequence

    The DNA sequence file in FASTA format, supporting multiple sequences. Each sequence record represents a structure to be predicted, and each record name must be unique. (It can match the record names in the Protein parameter, indicating that the DNA sequence belongs to the same structure as the Protein sequence.) If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >dna
    GACCTCT:CCTAGCT
    >1
    CCTAGCT
    

    This indicates two records, with the first named dna containing two DNA chains separated by a colon. Since this name does not appear in the Protein example records, it represents a new structure. The second record is named 1, containing one DNA chain, and since this name exists in the Protein example records, it indicates that they belong to the same structure (which contains both Protein and DNA sequences).

    RNA Sequence

    The RNA sequence file in FASTA format, supporting multiple sequences. Each sequence record represents a structure to be predicted, and each record name must be unique. (It can match the record names in the DNA or Protein parameters, indicating that the RNA sequence belongs to the same structure.) If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >1
    AGCU
    >rna
    AGGCU:UGAUC
    

    This indicates two records, with the first named 1, which is a single chain. Since this name exists in the DNA and Protein example records, it indicates that they belong to the same structure (which includes Protein, DNA, and this RNA sequence). The second record is named rna, containing two RNA chains separated by a colon. Since this name does not appear in the DNA or Protein example records, it represents a new structure.

    Ligand

    A text file containing information on small molecules in TXT format. It supports either SMILES or CCD Code. If using SMILES format, each line should contain one small molecule; if using CCD Code, each line can contain one or more small molecules, separated by commas. Each line represents a structure to be predicted, and each line must start with a unique name (this name can match those in the RNA, DNA, or Protein parameters, indicating that all ligands in that line belong to the same structure). The name and all ligands are separated by a colon (:). Example content:

    1:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,HY3
    lig:CC1CC(CN2NC[N+](=C3CCC(OC(F)(F)F)CC3)C2=O)CC(C)C1OC(C)(C)C(=O)[O-]
    

    This indicates two records, with the first named 1, containing three ligands (one SMILES and two CCD codes). Since this name exists in the RNA, DNA, or Protein example records, it indicates that they belong to the same structure. The second record is named lig, containing one ligand (in SMILES format). Since this name does not appear in the RNA, DNA, or Protein example records, it represents a new structure.

    Modification

    A text file containing post-translational modification (PTM) information in TXT format. Each PTM entry is consistent with that in Single mode (refer to the definitions in Single mode). Each line defines all PTM information for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:1,HY3,1:1,P1L,5:2,HY3,3
    2:1,HY3,1:2,HY3,3
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has three PTMs, while the structure named 2 has two PTMs.

    Covalent Bond

    A text file containing covalent bond information in TXT format. Each covalent bond entry is consistent with that in Single mode (refer to the definitions in Single mode). In Batch mode, each line defines all covalent bonds for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:1,1,CA;2,1,CA:1,1,CA;3,1,CHA
    2:1,1,CA;3,1,CHA
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has two covalent bonds, while the structure named 2 has one covalent bond.

    Enhanced Mode

    In this mode, a default of 1000 random seeds will be used, with each seed conducting 5 structural samplings, totaling 5000 structures for large-scale sampling. From these, multiple predicted structures with high scores will be selected to ultimately obtain a more accurate predicted structure. This mode is particularly suitable for high-precision prediction of antigen-antibody complex structures, and studies have shown that the accuracy of antibody-antigen complex structure prediction can be increased by 60% in this mode. The input parameters for this mode are consistent with those in Single Mode, and the runtime for one session is approximately 10 to 20 hours.
    Note: In Enhanced Mode, the predicted amino acid sequence cannot exceed 450 amino acids.

    Result

    The output result files include the top 5 ranked complex structures rank_1-5.pdb and ranking_scores.csv, which contains the following information:

    Field Name Description
    Name Name of the complex structure.
    Ranking_Score A score that ranks the quality of the predicted structure, ranging from -100 to 1.5, where a higher score indicates better quality. This score takes into account four indicators: ptm, iptm, fraction_disordered, has_clash, calculated as: Ranking_Score = 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.
    final_results.tar.gz An additional compressed file containing all predicted results generated in Batch mode.
  • Name: Evaluate Nucleic Acid (AlphaRNA)
    Description: 用于评估核酸序列的其表达量和半衰期、抗体滴度等。支持人、小鼠、大鼠、猪等种属。 Evaluate the expression and half-life of nucleic acid sequences, antibody titers, etc. Support human, mouse, rat, pig and other species.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-11-20 16:47:10
    Reference:

    Evaluate Nucleic Acid (AlphaRNA)

    简介

    Evaluate Nucleic Acid (AlphaRNA)模块用于评估核酸序列的其表达量和半衰期、抗体滴度等。支持人、小鼠、大鼠、猪等种属。

    参数说明

    Nucleic Acid Sequence

    核酸序列,必须为3的倍数,否则截断尾部序列以达到3的倍数序列,比如:GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTGGG

    Specis

    序列所属物种,Homo_Sapiens、Mamalian、Pig、Rat。

    结果

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    AUP AUP (Amino Acid Usage Pattern)指的是氨基酸使用模式的指标,通常用于评估特定氨基酸在序列中的使用频率。值越高,表示该氨基酸在序列中使用的频率越高。
    CAI CAI (Codon Adaptation Index)是一个用于评估特定基因的密码子使用偏好度的指标,值范围从 0 到 1。接近 1 表示该基因的密码子使用模式与高表达基因的模式相似,通常与基因表达效率相关。
    GCR GCR (Gene Codon Ratio)是基因密码子比率的指标,反映了基因中不同密码子的相对使用情况。值越高,表示基因中使用的密码子与参考密码子库的偏好越一致。
    MFE MFE (Minimum Free Energy)是指核酸序列的最低自由能,通常用于评估 RNA 二级结构的稳定性。值越低表示结构越稳定。负值表示该序列在折叠时释放能量,形成稳定的构象。
    Aug Positions Aug Positions表示在序列中发现的AUG(起始密码子)的位置。结果空时表示在序列中没有找到AUG密码子。
    Sequence 根据输入的核酸序列翻译得到的氨基酸序列。
    Secondary Structure RNA序列的预测二级结构。

    Evaluate Nucleic Acid (AlphaRNA)

    Introduction

    The Evaluate Nucleic Acid (AlphaRNA) module is used to assess the expression levels, half-lives, antibody titers, and other characteristics of nucleic acid sequences.

    Parameter

    Nucleic Acid Sequence

    The nucleic acid sequence must be a multiple of three; otherwise, the tail of the sequence will be truncated to achieve a length that is a multiple of three. For example: GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTGGG.

    Species

    The species to which the sequence belongs, such as Homo_Sapiens, Mammalian, Pig, or Rat.

    Results

    The output result file is result.csv, which contains the following information:

    Field Name Description
    AUP AUP (Amino Acid Usage Pattern) indicates the usage pattern of amino acids, typically used to assess the frequency of specific amino acids in the sequence. A higher value indicates a higher frequency of that amino acid in the sequence.
    CAI CAI (Codon Adaptation Index) is a metric used to evaluate the codon usage preference of a specific gene, with values ranging from 0 to 1. A value close to 1 indicates that the codon usage pattern of the gene is similar to that of highly expressed genes, which is often related to gene expression efficiency.
    GCR GCR (Gene Codon Ratio) is an indicator of the gene codon ratio, reflecting the relative usage of different codons within the gene. A higher value indicates that the codons used in the gene are more consistent with the preferences of the reference codon library.
    MFE MFE (Minimum Free Energy) refers to the minimum free energy of the nucleic acid sequence, typically used to assess the stability of RNA secondary structures. Lower values indicate more stable structures. Negative values indicate that the sequence releases energy when folded, forming a stable conformation.
    Aug Positions Aug Positions indicates the positions of AUG (start codon) found in the sequence. An empty result means that no AUG codons were found in the sequence.
    Sequence The amino acid sequence translated from the input nucleic acid sequence.
    Secondary Structure The predicted secondary structure of the RNA sequence.
  • Name: Back Mutation Grouping (v2.4)
    Description: Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。 Back Mutation Grouping is a grouping module in the antibody humanization design workflow, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module and returns the back mutated sequence.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-11-15 15:21:07
    Reference:

    Back Mutation Grouping v2.4

    简介

    Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。

    参数说明

    Grafted Chain

    抗体CDR区嫁接后序列文件,FASTA格式,由Grafting模块生成

    Raw Chain

    抗体序列文件,FASTA格式

    Mutation Score

    人源化突变评分文件,CSV格式,由Mutation Score模块生成

    Output File

    指定输出的突变序列文件名称,FASTA格式

    Cutoff

    打分分组的截断值,逗号分割,例如:2,5,10表示将氨基酸突变评分大于10的为一组,5~10的氨基酸为一组,小于2的氨基酸分为一组。

    Output Policy

    指定输出的回复突变的文件

    结果说明

    根据不同截断值得到突变分组结果文件mutate_policy.json。

    Back Mutation Grouping v2.4

    Introduction

    Back Mutation Grouping is a grouping module in the humanization design process of antibodies, which groups back mutations based on the mutation score table output by the Mutation Score module and returns the mutated sequences.

    Parameters

    Grafted Chain

    Sequence file of the antibody CDR region after grafting, in FASTA format, generated by the Grafting module.

    Raw Chain

    Sequence file of the antibody, in FASTA format.

    Mutation Score

    Humanization mutation score file, in CSV format, generated by the Mutation Score module.

    Output File

    Specify the name of the output mutation sequence file, in FASTA format.

    Cutoff

    Cutoff values for scoring grouping, separated by commas. For example, 2,5,10 indicates grouping amino acid mutations with scores greater than 10 in one group, amino acids with scores between 5 and 10 in another group, and amino acids with scores less than 2 in a separate group.

    Output Policy

    Specify the file for the output of back mutations.

    Results

    The mutation grouping results file, mutate_policy.json, is generated based on different cutoff values.

  • Name: Template-guided Structure Prediction
    Description: 基于自定义的蛋白结构模板,采用colabfold进行蛋白结构预测。 Based on a custom protein structure Template, and colabfold is used to predict protein structure.
    Tags: undefined
    Author: Mirdita M
    Release: 2024-11-04 15:24:56
    Reference: Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022 Jun;19(6):679-682.

    Template-guided structure prediction

    简介

    基于自定义的蛋白结构模板,采用colabfold进行蛋白结构预测。

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式

    Template Structure

    蛋白的模板结构,PDB格式

    结果说明

    输出文件名称 说明
    rank_001.pdb 预测得到的最佳复合物结构。
    pdbs.tar.gz 预测得到的前5个最佳复合物结构的压缩包文件。
    scores.csv 预测结构的评分文件

    其中scores.csv包含如下信息:

    字段名称 说明
    Name 预测结构的文件名
    pLDDT 局部结构的可信度指标,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测
    pTM 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定

    参考文献

    Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022 Jun;19(6):679-682.

    Template-guided Structure Prediction

    Introduction

    Protein structure prediction is performed using ColabFold based on a custom protein structure template.

    Parameter

    Protein Sequence

    The sequence file of the protein in FASTA format.

    Template Structure

    The template structure of the protein in PDB format.

    Result Description

    Output File Name Description
    rank_001.pdb The predicted best complex structure.
    pdbs.tar.gz A compressed file containing the top 5 best complex structures.
    scores.csv The scoring file for the predicted structures.

    The scores.csv file contains the following information:

    Field Name Description
    Name The file name of the predicted structure.
    pLDDT The confidence score for local structure, with a range of 0-100. A higher value indicates a more reliable prediction. Values below 70 are considered low reliability, below 50 are generally regarded as very low confidence, indicating disordered predictions.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.

    References

    Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022 Jun;19(6):679-682.

  • Name: TCR-pMHC Complex Structure Prediction
    Description: 细胞免疫系统是人体免疫的重要组成部分,它使用 T 细胞受体 (TCR) 识别由主要组织相容性复合体 (MHC) 蛋白呈递的肽形式的抗原蛋白。准确定义TCR的结构基础及其与肽-MHC的结合可以为正常和异常免疫提供重要见解,并有助于指导疫苗和免疫疗法的设计。鉴于实验确定的TCR-肽-MHC结构数量有限,而每个个体内的TCR以及抗原靶标数量巨大,因此需要准确的建模方法。该模块基于TCRmodel2实现,TCRmodel2在AlphaFold基础上对TCR-肽-MHC复合物建模做了优化,与原生AlphaFold和其他基于基准测试的TCR-肽-MHC复合物建模方法相比,其准确度相似或更高,可在30分钟内完成复合物结构预测。 The cellular immune system is a crucial component of the human immune response, utilizing T cell receptors (TCRs) to recognize peptide-form antigens presented by major histocompatibility complex (MHC) proteins. Accurately defining the structural basis of TCRs and their binding to peptide-MHC complexes can provide important insights into both normal and abnormal immune responses and assist in guiding the design of vaccines and immunotherapies. Given the limited number of experimentally determined TCR-peptide-MHC structures and the vast number of TCRs and antigen targets within each individual, accurate modeling methods are needed. This module is based on TCRmodel2, which optimizes TCR-peptide-MHC complex modeling on the foundation of AlphaFold. It achieves comparable or higher accuracy than native AlphaFold and other benchmark-based TCR-peptide-MHC modeling methods, completing complex structure predictions within 30 minutes.
    Tags: undefined
    Author: Rui Yin
    Release: 2024-11-08 10:35:19
    Reference: Yin R, Ribeiro-Filho HV, Lin V, Gowthaman R, Cheung M, Pierce BG. TCRmodel2: high-resolution modeling of T cell receptor recognition using deep learning. Nucleic Acids Res. 2023 Jul 5;51(W1):W569-W576.

    TCR-pMHC Complex Structure Prediction

    简介

    细胞免疫系统是人体免疫的重要组成部分,它使用 T 细胞受体 (TCR) 识别由主要组织相容性复合体 (MHC) 蛋白呈递的肽形式的抗原蛋白。准确定义TCR的结构基础及其与肽-MHC的结合可以为正常和异常免疫提供重要见解,并有助于指导疫苗和免疫疗法的设计。鉴于实验确定的TCR-肽-MHC结构数量有限,而每个个体内的TCR以及抗原靶标数量巨大,因此需要准确的建模方法。该模块基于TCRmodel2实现,TCRmodel2在AlphaFold基础上对TCR-肽-MHC复合物建模做了优化,与原生AlphaFold和其他基于基准测试的TCR-肽-MHC复合物建模方法相比,其准确度相似或更高,可在30分钟内完成复合物结构预测。
    image.png
    image.png

    参数说明

    TCR α

    TCR α链的序列,如:AQEVTQIPAALSVPEGENLVLNCSFTDSAIYNLQWFRQDPGKGLTSLLLIQSSQREQTSGRLNASLDKSSGRSTLYIAASQPGDSATYLCAVTNQAGTALIFGKGTTLSVSS

    TCR β

    TCR β链的序列,如:NAGVTQTPKFQVLKTGQSMTLQCSQDMNHEYMSWYRQDPGMGLRLIHYSVGAGITDQGEVPNGYNVSRSTTEDFPLRLLSAAPSQTSVYFCASSYSIRGSRGEQFFGPGTRLTVL

    Peptide Sequence

    多肽序列,如:LAWEWWRTVAL
    注:输入的多肽序列长度需要符合相应要求,如下:
    I型TCR-pMHC复合物中,多肽的序列长度在8-15之间;
    II型TCR-pMHC复合物中,多肽的长度为11。

    MHC(I or II α)

    MHC-I型序列或MHC-II α链序列。
    当预测I型TCR-pMHC复合物时,输入MHC-I型序列,如:SHSLKYFHTSVSRPGRGEPRFISVGYVDDTQFVRFDNDAASPRMVPRAPWMEQEGSEYWDRETRSARDTAQIFRVNLRTLRGYYNQSEAGSHTLQWMHGCELGPDGRFLRGYEQFAYDGKDYLTLNEDLRSWTAVDTAAQISEQKSNDASEAEHQRAYLEDTCVEWLHKYLEKGKETLLH
    当预测II型TCR-pMHC复合物时,输入MHC-II α链序列,如:IKADHVSTYAAFVQTHRPTGEFMFEFDEDEMFYVDLDKKETVWHLEEFGQAFSFEAQGGLANIAILNNNLNTLIQRSNHTQAT

    MHC II β

    MHC-II β链序列,当预测II型TCR-pMHC复合物时才需要输入,如:PENYLFQGRQECYAFNGTQRFLERYIYNREEFARFDSDVGEFRAVTELGRPAAEYWNSQKDILEEKRAVPDRMCRHNYELGGPMTLQR

    结果说明

    输出结果包括:

    输出文件名称 说明
    ranked_0.pdb 预测得到的最佳复合物结构。
    pdbs.tar.gz 预测得到的前5个最佳复合物结构的压缩包文件。
    scores.csv 结构评分文件

    其中scores.csv包含如下信息:

    字段名称 说明
    PDB 复合物PDB结构的文件名
    Model_Confidence 结构的置信度评分,是pTM与ipTM评分的加权综合值,数值在0-1之间,越接近1表示结构模型质量越好
    pLDDT 局部结构的可信度指标,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测
    pTM the predicted template modeling score预测的TM分数,衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM the interface predicted template modeling score预测的亚基接触面的TM分数,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定
    TCR-pMHC_ipTM TCR与pMHC之间的ipTM值

    参考文献

    Yin R, Ribeiro-Filho HV, Lin V, Gowthaman R, Cheung M, Pierce BG. TCRmodel2: high-resolution modeling of T cell receptor recognition using deep learning. Nucleic Acids Res. 2023 Jul 5;51(W1):W569-W576.

    TCR-pMHC Complex Structure Prediction

    Introduction

    The cellular immune system is a crucial component of the human immune response, utilizing T cell receptors (TCRs) to recognize peptide-form antigens presented by major histocompatibility complex (MHC) proteins. Accurately defining the structural basis of TCRs and their binding to peptide-MHC complexes can provide important insights into both normal and abnormal immune responses and assist in guiding the design of vaccines and immunotherapies. Given the limited number of experimentally determined TCR-peptide-MHC structures and the vast number of TCRs and antigen targets within each individual, accurate modeling methods are needed. This module is based on TCRmodel2, which optimizes TCR-peptide-MHC complex modeling on the foundation of AlphaFold. It achieves comparable or higher accuracy than native AlphaFold and other benchmark-based TCR-peptide-MHC modeling methods, completing complex structure predictions within 30 minutes.
    image.png
    image.png

    Parameter

    TCR α

    The sequence of the TCR α chain, for example: AQEVTQIPAALSVPEGENLVLNCSFTDSAIYNLQWFRQDPGKGLTSLLLIQSSQREQTSGRLNASLDKSSGRSTLYIAASQPGDSATYLCAVTNQAGTALIFGKGTTLSVSS

    TCR β

    The sequence of the TCR β chain, for example: NAGVTQTPKFQVLKTGQSMTLQCSQDMNHEYMSWYRQDPGMGLRLIHYSVGAGITDQGEVPNGYNVSRSTTEDFPLRLLSAAPSQTSVYFCASSYSIRGSRGEQFFGPGTRLTVL

    Peptide Sequence

    The peptide sequence, for example: LAWEWWRTVAL.
    Note: The length of the input peptide sequence must meet the following requirements:
    For Class I TCR-pMHC complexes, the peptide sequence length should be between 8-15;
    For Class II TCR-pMHC complexes, the peptide length is 11.

    MHC (I or II α)

    The MHC-I sequence or MHC-II α chain sequence.
    When predicting Class I TCR-pMHC complexes, input the MHC-I sequence, for example: SHSLKYFHTSVSRPGRGEPRFISVGYVDDTQFVRFDNDAASPRMVPRAPWMEQEGSEYWDRETRSARDTAQIFRVNLRTLRGYYNQSEAGSHTLQWMHGCELGPDGRFLRGYEQFAYDGKDYLTLNEDLRSWTAVDTAAQISEQKSNDASEAEHQRAYLEDTCVEWLHKYLEKGKETLLH.
    When predicting Class II TCR-pMHC complexes, input the MHC-II α chain sequence, for example: IKADHVSTYAAFVQTHRPTGEFMFEFDEDEMFYVDLDKKETVWHLEEFGQAFSFEAQGGLANIAILNNNLNTLIQRSNHTQAT.

    MHC II β

    The MHC-II β chain sequence, which is required only when predicting Class II TCR-pMHC complexes, for example: PENYLFQGRQECYAFNGTQRFLERYIYNREEFARFDSDVGEFRAVTELGRPAAEYWNSQKDILEEKRAVPDRMCRHNYELGGPMTLQR.

    Result

    The output results include:

    Output File Name Description
    ranked_0.pdb The predicted best complex structure.
    pdbs.tar.gz A compressed file containing the top 5 predicted complex structures.
    scores.csv Structure scoring file.

    The scores.csv contains the following information:

    Field Name Description
    PDB The filename of the complex PDB structure.
    Model_Confidence The confidence score of the structure, which is a weighted composite value of pTM and ipTM scores, ranging from 0 to 1, with values closer to 1 indicating better model quality.
    pLDDT A measure of the reliability of the local structure, ranging from 0 to 100; higher values indicate more reliable predictions. Values below 70 are considered low reliability, and below 50 are deemed very low reliability, indicating disordered predictions.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure; higher values indicate greater accuracy. A score greater than 0.5 suggests that the overall folding of the structure may resemble the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of subunits within the complex; higher values indicate greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure, and scores between 0.6 and 0.8 are in a gray area where correctness is uncertain.
    TCR-pMHC_ipTM The ipTM value between the TCR and pMHC.

    References

    Yin R, Ribeiro-Filho HV, Lin V, Gowthaman R, Cheung M, Pierce BG. TCRmodel2: high-resolution modeling of T cell receptor recognition using deep learning. Nucleic Acids Res. 2023 Jul 5;51(W1):W569-W576.

  • Name: Alanine Scan (MMPBSA v2)
    Description: Alanine Scan (MMPBSA)是计算丙氨酸突变后的结合自由能。 Alanine Scan (MMPBSA) calculates components of binding free energy after alanine mutation using the MM-PBSA method.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:47
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMPBSA)

    简介

    Alanine Scan (MMPBSA)是计算丙氨酸突变后的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMPBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Mutation Residue

    突变扫描为丙氨酸(ALA)的氨基酸位置。格式为res1:res2:res3:res4,其中“res1-res4”数字为残基编号。

    Force File

    丙氨酸扫描时使用的力场。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。只能是AMBER力场下构建的膜体系才能进行计算。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMPBSA_result.txt MMPBSA丙氨酸突变结果汇总文件。
    MMPBSA_Residue.csv 丙氨酸突变能量分解数据CSV文件。
    MMPBSA.pdb 丙氨酸突变后,原子对应的MMPBSA能量放到PDB文件。可以做对应能量类别的表面图,从图的颜色深浅可以看出哪些区域对结合能贡献值较大。
    MMPBSA.tar.gz MMPBSA所有原始文件。包括_mmpbsa_residue_#.txt是每个残基对应的每个能量类别的数值,共包含7个能量类别:范德华能(VDW)、静电能(ELE)、溶剂化能极性部分(PB)、溶剂化能非极性部分(SA)、VDW+ELE=MM、PB+SA=PBSA、MM+PBSA=Binding/MMPBSA。_mmpbsa_residue.txt是对上述7个文件的总结,即为MMPBSA_Residue.csv对应的原始文件。_mmpbsa_atom#.pdb是将每个原子对应的能量类别放到pdb文件,与MMPBSA.pdb相似。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMPBSA)

    Introduction

    Alanine Scan (MMPBSA) calculates the binding free energy after alanine mutations and provides energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMPBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameter Description

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Mutation Residue

    Amino acid positions where mutations to alanine (ALA) are scanned. The format is res1:res2:res3:res4, where “res1-res4” are residue numbers.

    Force File

    Force field used for alanine scanning.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’. Only membrane systems built under the AMBER force field can be calculated.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Result Description

    The output includes:

    Output File Name Description
    MMPBSA_result.txt Summary file of MMPBSA alanine mutation results.
    MMPBSA_Residue.csv Energy decomposition data for alanine mutations in CSV format.
    MMPBSA.pdb MMPBSA energy corresponding to atoms after alanine mutations in a PDB file. This can be used to create a surface map of energy categories, where the color depth indicates the regions contributing significantly to the binding energy.
    MMPBSA.tar.gz All original MMPBSA files. Includes mmpbsa_residue#.txt which contains the values of each energy category for each residue, totaling 7 energy categories: van der Waals energy (VDW), electrostatic energy (ELE), polar part of solvation energy (PB), nonpolar part of solvation energy (SA), VDW+ELE=MM, PB+SA=PBSA, MM+PBSA=Binding/MMPBSA. _mmpbsa_residue.txt is a summary of the above 7 files, corresponding to the original file MMPBSA_Residue.csv. _mmpbsa_atom#.pdb assigns the energy categories to each atom in a pdb file, similar to MMPBSA.pdb.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: Human Germline BLAST (v2.0)
    Description: 通过序列比对在人类生殖系数据库中搜索与目标抗体序列接近的同源模板,输出对应的模板序列以及序列一致性信息。 Search the human germline database for homologs of the target antibody sequence, and output the template sequences and the corresponding identities.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-12-20 16:24:53
    Reference:

    Human Germline BLAST

    简介

    通过序列比对在人类生殖系数据库中搜索与目标抗体序列最接近的同源模板,输出对应的模板序列以及序列一致性信息。

    参数说明

    Sequence String模式

    Input Sequence

    抗体的序列(纯序列信息,非FASTA格式文件)。

    Type

    抗体编号类型:kabat、chothia、imgt。

    TopHits

    输出同源性最高的n条序列。

    Fasta File模式

    FASTA File

    抗体的序列文件,FASTA格式。

    Type

    抗体编号类型:kabat、chothia、imgt。

    TopHits

    输出同源性最高的n条序列。

    结果说明

    输出参数 输出文件名称 说明
    Hits Sequence hits.fasta 包含同源性最高的n条序列的序列文件
    Result result.json 包含找到的Germline模板以及序列的一致性信息

    相关内容

    抗体常用的germline模板:
    image.png

    临床后期及已上市抗体的germline配对分布情况(统计自Adimab数据集):
    image.png
    image.png
    Adimab_germline_usage.jpeg

    Human Germline BLAST

    Introduction

    This module performs sequence alignment to search for the closest homologous template in the human germline database for a given target antibody sequence. It outputs the corresponding template sequence along with sequence similarity information.

    Parameter Description

    Sequence String Mode

    Input Sequence

    The antibody sequence (pure sequence information, not in FASTA format).

    Type

    Type of antibody numbering: kabat, chothia, imgt.

    TopHits

    Number of top hits to output.

    Fasta File Mode

    FASTA File

    Antibody sequence file in FASTA format.

    Type

    Type of antibody numbering: kabat, chothia, imgt.

    TopHits

    Number of top hits to output.

    Result Description

    Output Parameter Output File Name Description
    Hits Sequence hits.fasta File containing the top n sequences with the highest homology
    Result result.json File containing the found Germline template and sequence similarity information

    Related Content

    Commonly used germline templates for antibodies:
    image.png

    Distribution of germline pairing for late-stage and marketed antibodies (statistics from the Adimab dataset):
    image.png
    image.png
    Adimab_germline_usage.jpeg

  • Name: Humanization Report (v2.3)
    Description: 抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。 Humanization Report is an antibody humanization design reporting module for Generating the humanization design reports as well as patent example paragraphs.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-01-19 09:19:22
    Reference:

    Humanization Report

    简介

    Humanization Report是抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。

    参数说明

    Graft Policy

    Grafting模块生成的Graft Policy文件。

    Mutate Policy

    Back Mutation Grouping模块生成的Policy文件。

    结果说明

    输出结果包括:

    输出文件名称 说明
    BM.pptx 回复突变位点汇总文件
    batch_registration_template.xlsx 批量注册模板文件
    hotspot_summary.xlsx 风险位点总结
    patent_example_template.docx 人源化设计序列在相应的专利实施例段落
    humanized_variants.fasta 抗体人源化设计序列文件,FASTA格式
    Report.docx 抗体人源化设计报告,包括整个人源化设计过程涉及的序列、分组等信息

    其中batch_registration_template.xlsx包含如下信息:

    字段名称 说明
    Protein Sequence 蛋白序列
    Molecule Name 分子名称

    其中hotspot_summary.xlsx包含如下信息:

    字段名称 说明
    ID 抗体序列名称
    Sequence-CDR CDR序列区域
    Deamidation 脱酰胺位点
    Isomerization 异构化位点
    Cleavage 酶切位点
    Hydrolysis 水解位点
    Glycosylation 糖基化位点
    Cys 半胱氨酸数量
    Oxidation 氧化位点
    High risk 高风险率
    High risk sites 高风险位点

    Humanization Report

    Introduction

    The Humanization Report is a module for generating reports on antibody humanization design, including the final antibody humanization design report and corresponding paragraphs for patent implementation examples.

    Parameter Description

    Graft Policy

    The Graft Policy file generated by the Grafting module.

    Mutate Policy

    The Policy file generated by the Back Mutation Grouping module.

    Result Description

    The output results include:

    Output File Name Description
    BM.pptx Summary file of back mutation sites
    batch_registration_template.xlsx Batch registration template file
    hotspot_summary.xlsx Summary of hotspot sites
    patent_example_template.docx Humanization design sequences in corresponding patent implementation example paragraphs
    humanized_variants.fasta Antibody humanization design sequence file in FASTA format
    Report.docx Antibody humanization design report, including information on sequences, grouping, etc., involved in the entire humanization design process

    The batch_registration_template.xlsx file contains the following information:

    Field Name Description
    Protein Sequence Protein sequence
    Molecule Name Molecule name

    The hotspot_summary.xlsx file contains the following information:

    Field Name Description
    ID Antibody sequence name
    Sequence-CDR CDR sequence region
    Deamidation Deamidation site
    Isomerization Isomerization site
    Cleavage Cleavage site
    Hydrolysis Hydrolysis site
    Glycosylation Glycosylation site
    Cys Number of cysteines
    Oxidation Oxidation site
    High risk High-risk rate
    High risk sites High-risk sites
  • Name: Back Mutation Grouping (v2.3)
    Description: Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。 Back Mutation Grouping is a grouping module in the antibody humanization design workflow, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module and returns the back mutated sequence.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-01-17 15:21:07
    Reference:

    Back Mutation Grouping v2.3

    简介

    Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。

    参数说明

    Grafted Chain

    抗体CDR区嫁接后序列文件,FASTA格式,由Grafting模块生成

    Raw Chain

    抗体序列文件,FASTA格式

    Mutation Score

    人源化突变评分文件,CSV格式,由Mutation Score模块生成

    Output File

    指定输出的突变序列文件名称,FASTA格式

    Cutoff

    打分分组的截断值,逗号分割,例如:2,5,10表示将氨基酸突变评分大于10的为一组,5~10的氨基酸为一组,小于2的氨基酸分为一组。

    Output Policy

    指定输出的回复突变的文件

    结果说明

    根据不同截断值得到突变分组结果文件mutate_policy.json。

    Back Mutation Grouping v2.3

    Introduction

    Back Mutation Grouping is a grouping module in the humanization design process of antibodies, which groups back mutations based on the mutation score table output by the Mutation Score module and returns the mutated sequences.

    Parameters

    Grafted Chain

    Sequence file of the antibody CDR region after grafting, in FASTA format, generated by the Grafting module.

    Raw Chain

    Sequence file of the antibody, in FASTA format.

    Mutation Score

    Humanization mutation score file, in CSV format, generated by the Mutation Score module.

    Output File

    Specify the name of the output mutation sequence file, in FASTA format.

    Cutoff

    Cutoff values for scoring grouping, separated by commas. For example, 2,5,10 indicates grouping amino acid mutations with scores greater than 10 in one group, amino acids with scores between 5 and 10 in another group, and amino acids with scores less than 2 in a separate group.

    Output Policy

    Specify the file for the output of back mutations.

    Results

    The mutation grouping results file, mutate_policy.json, is generated based on different cutoff values.

  • Name: Mutation Score (v2.3)
    Description: Mutation Score是抗体人源化设计中核心模块,是一个基于结构的自动化评分模块。该模块基于抗体的结构信息以及CDR嫁接后的序列信息,对graft后的FR区的每个氨基酸在替换前后的变化程度进行量化评分。评分越高,说明CDR嫁接过程中氨基酸的替换对CDR区的构象可能影响较大,需要进行回复突变。模块输出每个氨基酸的打分值,用于抗体人源化设计流程中后续的分组和人源化抗体序列的生成。 Mutation Score is a core module in antibody humanization design workflow, which is a structure-based automated scoring module. Based on the structure information of the antibody and the CDR-grafted sequence information, this module quantitatively scores the degree of change before and after the replacement of each amino acid in the FR region. The higher the score, the greater the potential impact of the amino acid replacement on the conformation change of the CDR region during CDR grafting, indicating the need for auto-back mutation. The module outputs the score for each amino acid, which is used for subsequent grouping and generation of humanized antibody sequences in the antibody humanization design workflow.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 11:14:32
    Reference: To be submitted

    Mutation Score

    简介

    Mutation Score是抗体人源化设计中核心模块,是一个基于结构的自动化评分模块。该模块基于抗体的结构信息以及CDR嫁接后的序列信息,对移植抗体(graft)后的FR区的每个氨基酸在替换前后的变化程度进行量化评分。评分越高,说明CDR嫁接过程中氨基酸的替换对CDR区的构象可能影响较大,需要进行回复突变。模块输出每个氨基酸的打分值,用于抗体人源化设计流程中后续的分组和人源化抗体序列的生成。

    参数说明

    Sequence File

    抗体Fv区序列文件,FASTA格式。

    Model File

    抗体结构文件,PDB格式。

    Grafted Sequence

    抗体CDR区Graft后的序列文件,FASTA格式。

    Output Score

    指定输出打分文件的名称,CSV格式。

    Antibody Type

    抗体类型:

    • Antibody:常规抗体
    • Nanobody:纳米抗体

    结果说明

    输出结果文件为score.csv,包含信息如下:

    字段名称 说明
    Chain 轻链或重链
    UID 为残基的标准编号(默认为 Kabat)
    Position 残基在序列中的位置
    Donor Residue 原始氨基酸
    Template Residue 人源模板的目标氨基酸
    score 回复突变智能评分,Score 越高,认为其回复突变的必要性越高。通常Score>10为高优先级,5-10为中优先级,其他为低优先级

    Mutation Score

    Introduction

    Mutation Score is a core module in antibody humanization design, serving as a structure-based automated scoring module. This module quantitatively scores the degree of change for each amino acid in the FR region after grafting CDRs based on the antibody’s structure and the sequence information post-CDR grafting. A higher score indicates that the replacement of amino acids during CDR grafting may have a significant impact on the CDR region’s conformation, suggesting the need for revertant mutations. The module outputs a score for each amino acid, which is used in subsequent grouping and generation of humanized antibody sequences in the antibody humanization design process.

    Parameter Description

    Sequence File

    Sequence file of the antibody Fv region in FASTA format.

    Model File

    Antibody structure file in PDB format.

    Grafted Sequence

    Sequence file of the antibody CDR region after grafting in FASTA format.

    Output Score

    Specify the name of the output scoring file in CSV format.

    Antibody Type

    Type of antibody:

    • Antibody: Conventional antibody
    • Nanobody: Nanobody

    Result Description

    The output result file is named score.csv and includes the following information:

    Field Name Description
    Chain Light chain or heavy chain
    UID Standard numbering for residues (default is Kabat)
    Position Position of the residue in the sequence
    Donor Residue Original amino acid
    Template Residue Target amino acid from the human template
    Score Revertant mutation intelligence score, where a higher score suggests a higher necessity for a revertant mutation. Typically, a Score > 10 is high priority, 5-10 is medium priority, and others are low priority.
  • Name: Grafting (v2.3)
    Description: Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.3 Graft antibody CDRs to target frameworks, normally for humanization. Version: v2.3
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-09-12 09:40:16
    Reference:

    Grafting v2.3

    简介

    Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.3

    参数说明

    Antibody Sequence File

    抗体序列文件,FASTA格式

    Numbering Type

    抗体编号规则:kabat,imgt,chothia

    Output File

    指定输出抗体graft后的序列文件名称,FASTA格式

    Output Policy

    指定输出graft策略文件,JSON格式

    Germline Score

    指定输出抗体FR区序列比对同源性打分文件

    Germline

    指定轻链或重链使用特定germline模板,也可都指定,写法如下:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    其中链名来自于流程第一步输入的fasta文件。
    例1:以下语句为链"Infliximab.H"指定了模板"IGHV3-7*01":

    Infliximab.H:IGHV3-7*01
    

    例2:以下语句为两条链分别指定了模板:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    指定参考模板序列,FASTA格式

    Germline Hits

    指定输出FR区序列比对结果文件,FASTA格式

    Number of Hits

    指定输出命中序列的数目

    结果说明

    输出结果包括:

    输出文件名称 说明
    germline_hits.fasta 输出FR区序列比对结果文件
    germline_score.json 输出抗体FR区序列比对同源性打分文件
    grafted.fasta 输出抗体graft后的序列文件名称
    graft_policy.json 输出graft策略文件

    Grafting v2.3

    Introduction

    The Grafting module is used to graft the CDRs of an antibody onto a specific framework region template, typically used for humanization design. Version: v2.3

    Parameter Description

    Antibody Sequence File

    Antibody sequence file in FASTA format.

    Numbering Type

    Antibody numbering rule: kabat, imgt, chothia.

    Output File

    Specify the output file name for the grafted antibody sequence in FASTA format.

    Output Policy

    Specify the output grafting strategy file in JSON format.

    Germline Score

    Specify the output file for the homology scores of the antibody FR region sequences.

    Germline

    Specify the specific germline template to be used for the light chain or heavy chain, or both, in the format:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    Where the chain names come from the FASTA file input in the first step of the process.
    Example 1: The following statement specifies the template “IGHV3-7*01” for the chain “Infliximab.H”:

    Infliximab.H:IGHV3-7*01
    

    Example 2: The following statement specifies templates for two chains separately:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    Specify the reference template sequence in FASTA format.

    Germline Hits

    Specify the output file for the FR region sequence alignment results in FASTA format.

    Number of Hits

    Specify the number of sequences to output.

    Result Description

    The output includes:

    Output File Name Description
    germline_hits.fasta Output file for FR region sequence alignment results
    germline_score.json Output file for homology scores of the antibody FR region sequences
    grafted.fasta Output file name for the grafted antibody sequence
    graft_policy.json Output file for the grafting strategy
  • Name: Antibody Numbering v2
    Description: 抗体编号模块,用于注释抗体可变区(Fv)或恒定区(包括 Fc), 支持几乎所有主流的抗体编号规则,如可变区广泛使用的Kabat、Chothia 和 IMGT,以及恒定区主要使用的EU规则。 A module for antibody numbering for variable regions and constant regions. Mainstream numbering schemes are supported, e.g., Kabat, Chothia, and IMGT are widely used for Fv, and EU is the most used scheme for the constant region.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-09-23 16:45:09
    Reference:

    Antibody Numbering v2

    简介

    Antibody Numbering v2是抗体编号模块,用于注释抗体可变区(Fv)或恒定区(包括 Fc), 支持几乎所有主流的抗体编号规则,如可变区广泛使用的Kabat、Chothia 和 IMGT,以及恒定区主要使用的EU规则。

    参数说明

    Variable Region (Fv)模式

    Fasta File

    抗体序列文件,FASTA格式,支持多序列模式。

    Numbering Scheme

    编号规则,支持Kabat、Chothia、IMGT,可多选。

    Constant Region (Fc)模式

    Fasta File

    抗体序列文件,FASTA格式,支持多序列模式。

    结果说明

    • Variable Region (Fv)模式下的输出结果包括:
    输出文件名称 说明
    output_chothia(imgt\kabat).csv 抗体可变区三种编号规则的csv文件
    output_chothia(imgt\kabat).json 抗体可变区三种编号规则的json文件

    三种不同编号规则的csv文件,包含信息如下:

    字段名称 说明
    molecule 抗体序列名称
    chain_type 抗体链类型:重链(VH)或者轻链(VL)
    is_cdr 判断是否为CDR区
    loc 序列位置
    numbering 序列编号
    insertion 插入序列编号
    region 抗体可变区类型:CDR1、CDR2或者CDR3
    domain 区域
    • Constant Region (Fc)模式下的输出结果包括:
    输出文件名称 说明
    output_EU.csv 抗体恒定区EU编号规则的csv文件
    output_EU.json 抗体恒定区EU编号规则的json文件

    其中output_EU.csv文件,包含信息如下:

    字段名称 说明
    Chain 抗体序列链类型
    Position 序列位置
    Eu numbering 序列EU编号
    Residue 抗体氨基酸缩写
    IgG1 Ref IgG1氨基酸缩号
    Region 抗体恒定类型:CH1、CH2、CH3、Hinge
    Mutation(IgG1) 原序列突变成IgG1的突变信息

    Antibody Numbering v2

    Introduction

    Antibody Numbering v2 is the antibody numbering module for the annotations of antibody variable region (Fv) or constant region (including Fc). It supports almost all mainstream antibody numbering rules, such as Kabat, Chothia and IMGT, which are widely used in the variable region, and EU rules, which are mainly used in the constant region.

    Parameter

    Variable Region (Fv)模式

    Fasta File

    Antibody sequence file in FASTA format.

    Numbering Scheme

    Numbering Scheme: Kabat, Chothia, and IMGT.

    Report

    Visualize all three schemes of Fv numberings and CDR regions via a HTML page.

    Constant Region (Fc)模式

    Fasta File

    Antibody sequence file in FASTA format.

    Result

    • Variable Region (Fv) mode contains the following output results:
    Output File Name Description
    results.html Visualize all three schemes of Fv numberings and CDR regions via a HTML page.
    output_chothia(imgt\kabat).csv Visualize all three schemes of Fv numberings and CDR regions via a csv file.
    output_chothia(imgt\kabat).json Visualize all three schemes of Fv numberings and CDR regions via a json file.

    Three csv files with different numbering rules contain the following information:

    Field Name Description
    molecule Antibody sequence name
    chain_type Antibody chain type: heavy chain (VH) or light chain (VL)
    is_cdr Check whether it is a CDR region
    loc Sequence position
    numbering Sequence numbering
    insertion Insertion sequence number
    region Antibody variable region type: CDR1, CDR2, or CDR3
    domain Area
    • Constant Region (Fc) mode contains the following output results:
    Output File Name Description
    output_EU.csv EU numberings for constant region in csv file
    output_EU.json EU numberings for constant region in json file

    The output EU.csv file contains the following information:

    Field Name Description
    Chain Type of antibody sequence chain
    Position Sequence position
    Eu numbering Sequence EU numbering
    Residue Antibody amino acid abbreviation
    IgG1 Ref IgG1 amino acid abbreviation
    Region Constant Region type of antibody: CH1, CH2, CH3, Hinge
    Mutation(IgG1) Mutation information of the original sequence mutated into IgG1
  • Name: Immunogenicity Prediction (WeADApt v4.1.0)
    Description: 唯信开发的基于多模融合深度学习的端到端免疫原性预测系统WeADApt(原名:AlphaMHC)v4.1。该版本为备用版本,目前最新主力版本为v4.2。 The backup version of the deep learning immunogenicity prediction system, WeADApt (formerly known as AlphaMHC) v4.1. Latest version is v4.2.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-10-18 10:50:56
    Reference:

    Immunogenicity Prediction (WeADApt v4.1.0)

    简介

    WeADApt (Wecomput ADA prediction) 是一种基于多模融合架构的免疫原性预测系统。该方法有机地将多个与免疫原性相关的模型融合,构成一个高效的免疫反应模拟系统,可准确地模拟蛋白、抗体、多肽、疫苗等生物药的免疫原性,并能鉴别潜在的免疫原性的T细胞表位(引起临床人体免疫应答的肽段)。
    该模块为备份版本,最新版本为:v4.2。

    性能测试

    使用100多个临床及上市抗体的ADA数据的测试结果显示,预测的打分(MolScore)与ADA发生率的相关性达到R=0.68(下图)。

    image.png
    在同样的42个分子的数据集上,WeADApt预测的相关性超过了知名的商业软件EpiMatrix(R2=0.49 vs R2=0.42)。

    image.png

    打分

    0.2分适合作为单抗的高/低风险的阈值(>20% ADA定义为高风险)。

    关于双抗/多特异性分子

    这类分子仅需输入不重复的链即可
    在唯信收集的双抗ADA数据集的测试表现如下图所示。以0.6的分数作为分界线,可以较好的区分高、低风险的双抗分子。双抗
    注意,由于存在较多的B细胞清除双抗,其MOA会对ADA产生有较大的影响。

    image.png

    用法

    推荐从WeSeq中运行该功能,可以进行更多可视化交互

    image.png

    查看结果

    image.png
    Score为预测的免疫原性风险评分(范围0-1),Risk为风险评级

    image.png

    image.png

    注意对照结构,排除不可及(包埋的)表位(下图)
    image.png

    去免疫原性

    最简单的方式是进行人源片段的替换,可以直接在WeSeq中进行(下图)。
    image.png

    也可以通过频率分析功能引入人源突变。
    突变完之后再对突变体预测一下免疫原性是否降低。

    注意:从weseq中计算v4免疫原性的结果可以自动保存并且随时再打开的
    企业微信截图_17350890464449.png
    企业微信截图_1735089029621.png
    企业微信截图_17350890159377.png

    Immunogenicity Prediction (WeADApt v4.1.0)

    Introduction

    WeADApt (Wecomput ADA prediction) is an immunogenicity prediction system based on a multi-modal fusion architecture. This method organically integrates multiple models related to immunogenicity to form an efficient immune response simulation system. It can accurately simulate the immunogenicity of biologics such as proteins, antibodies, peptides, and vaccines, and identify potential immunogenic T-cell epitopes (peptide segments that elicit clinical human immune responses). The module is the latest version: v4.1.0.

    Performance Testing

    Testing results using ADA data from over 100 clinical and marketed antibodies show that the predicted scores (MolScore) correlate with ADA incidence at R=0.68 (see the figure below).

    image.png

    On the same dataset of 42 molecules, the correlation predicted by WeADApt exceeds that of the well-known commercial software EpiMatrix (R²=0.49 vs R²=0.42).

    image.png

    Scoring

    A score of 0.2 is suitable as a threshold for high/low risk in monoclonal antibodies (>20% ADA defined as high risk).

    About Bispecific/Multispecific Molecules

    For these types of molecules, only non-redundant chains need to be input. The test performance on the bispecific ADA dataset collected by Weixin is shown in the figure below. With a score of 0.6 as the dividing line, high-risk and low-risk bispecific molecules can be better distinguished. Note that due to the presence of many B-cell depleting bispecifics, their MOA can significantly affect ADA.

    图片.png

    Usage

    It is recommended to run this function from WeSeq for more visual interactions.

    图片.png

    Viewing Results

    图片.png

    Score is the predicted immunogenicity risk score (range 0-1), and Risk is the risk rating.

    图片.png

    图片.png

    Note the reference structure and exclude inaccessible (embedded) epitopes (see the figure below).

    图片.png

    De-immunization

    The simplest way is to perform human fragment replacement, which can be done directly in WeSeq (see the figure below).

    图片.png

    Human mutations can also be introduced through the frequency analysis feature. After mutation, predict the immunogenicity of the mutants to see if it has decreased.

    Note: The results of calculating v4 immunogenicity in WeSeq can be automatically saved and reopened at any time.
    企业微信截图_17350890464449.png
    企业微信截图_1735089029621.png
    企业微信截图_17350890159377.png

  • Name: Grafting (v2.2)
    Description: Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.2 Graft antibody CDRs to target frameworks, normally for humanization. Version: v2.2
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-09-12 09:40:16
    Reference:

    Grafting v2.2

    简介

    Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.2

    参数说明

    Antibody Sequence File

    抗体序列文件,FASTA格式

    Numbering Type

    抗体编号规则:kabat,imgt,chothia

    Output File

    指定输出抗体graft后的序列文件名称,FASTA格式

    Output Policy

    指定输出graft策略文件,JSON格式

    Germline Score

    指定输出抗体FR区序列比对同源性打分文件

    Germline

    指定轻链或重链使用特定germline模板,也可都指定,写法如下:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    其中链名来自于流程第一步输入的fasta文件。
    例1:以下语句为链"Infliximab.H"指定了模板"IGHV3-7*01":

    Infliximab.H:IGHV3-7*01
    

    例2:以下语句为两条链分别指定了模板:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    指定参考模板序列,FASTA格式

    Germline Hits

    指定输出FR区序列比对结果文件,FASTA格式

    Number of Hits

    指定输出命中序列的数目

    结果说明

    输出结果包括:

    输出文件名称 说明
    germline_hits.fasta 输出FR区序列比对结果文件
    germline_score.json 输出抗体FR区序列比对同源性打分文件
    grafted.fasta 输出抗体graft后的序列文件名称
    graft_policy.json 输出graft策略文件

    Grafting v2.2

    Introduction

    The Grafting module is used to graft the CDRs of an antibody onto a specific framework region template, typically used for humanization design. Version: v2.2

    Parameter Description

    Antibody Sequence File

    Antibody sequence file in FASTA format.

    Numbering Type

    Antibody numbering rule: kabat, imgt, chothia.

    Output File

    Specify the output file name for the grafted antibody sequence in FASTA format.

    Output Policy

    Specify the output grafting strategy file in JSON format.

    Germline Score

    Specify the output file for the homology scores of the antibody FR region sequences.

    Germline

    Specify the specific germline template to be used for the light chain or heavy chain, or both, in the format:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    Where the chain names come from the FASTA file input in the first step of the process.
    Example 1: The following statement specifies the template “IGHV3-7*01” for the chain “Infliximab.H”:

    Infliximab.H:IGHV3-7*01
    

    Example 2: The following statement specifies templates for two chains separately:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    Specify the reference template sequence in FASTA format.

    Germline Hits

    Specify the output file for the FR region sequence alignment results in FASTA format.

    Number of Hits

    Specify the number of sequences to output.

    Result Description

    The output includes:

    Output File Name Description
    germline_hits.fasta Output file for FR region sequence alignment results
    germline_score.json Output file for homology scores of the antibody FR region sequences
    grafted.fasta Output file name for the grafted antibody sequence
    graft_policy.json Output file for the grafting strategy
  • Name: Disulfide Bond Search
    Description: Disulfide Bond Search模块计算蛋白质中潜在的二硫键位置,这对优化蛋白质的稳定性有所作用。二硫键作为对蛋白质的稳定性有极大的作用,但是加入不合理的二硫键也会容易引起聚集,表达量降低甚至错误折叠等不利影响。 Disulfide Bond Search module calculates potential disulfide bond locations in proteins, which is useful for optimizing protein stability. Disulfide bond plays a great role in the stability of protein, but the addition of unreasonable disulfide bond will easily lead to aggregation, expression reduction and even misfolding.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-09-07 10:46:01
    Reference:

    Disulfide Bond Search

    简介

    Disulfide Bond Search模块计算蛋白质中潜在的二硫键位置,这对优化蛋白质的稳定性有所作用。二硫键作为对蛋白质的稳定性有极大的作用,但是加入不合理的二硫键也会容易引起聚集,表达量降低甚至错误折叠等不利影响。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式

    Chain

    指定需要设计的链,多条链用逗号分割,例如:A,B。

    Position

    设置氨基酸序号,当参数Chain设置为A,C时,此参数如果设置为1 2 3 4 5 6 7 8 23 25, 10 11 12 13 14 15 16 17 18 19 20 40意味着对A中的残基1 2 3…25和链C中的残基10 11 12…40进行设计。如果不填,则该链的所有残基都参与设计。
    注意:这里的氨基酸序号是从1开始,而不是PDB文件中带有的氨基酸序号。同一条链的氨基酸序号用空格分隔,不同链的氨基酸用逗号分隔。

    Interchain

    是否只选择链间的二硫键。

    Distance

    可设置Cβ之间的距离,默认5.0Å。

    结果说明

    输出结果包括:

    输出文件名称 说明
    ss_bond.csv 输出自然顺序编号、PDB文件中的残基编号以及Cβ之间的距离信息的CSV文件。
    ss_index.fasta 序列名编号为自然顺序编号并将预测位点突变为CYS的FASTA文件。
    ss_uid.fasta 序列名编号为PDB文件中的残基编号并将预测位点突变为CYS的的FASTA文件。

    Disulfide Bond Search

    Introduction

    The Disulfide Bond Search module calculates potential disulfide bond positions in proteins, which can be useful for optimizing protein stability. Disulfide bonds play a significant role in stabilizing proteins, but improper addition of disulfide bonds can lead to aggregation, reduced expression levels, or even misfolding.

    Parameter

    Structure PDB File

    The structure file of the protein in PDB format.

    Chain

    Specify the chains to be designed. Multiple chains are separated by commas, e.g. A,B.

    Position

    Set the amino acid sequence numbers. When the Chain parameter is set to A,C, setting this parameter to 1 2 3 4 5 6 7 8 23 25, 10 11 12 13 14 15 16 17 18 19 20 40 means designing residues 1 2 3...25 in chain A and residues 10 11 12...40 in chain C. If not specified, all residues in the chain will be included in the design.
    Note: The amino acid sequence numbers here start from 1, not the residue numbers in the PDB file. Amino acid sequence numbers within the same chain are separated by spaces, and different chains are separated by commas.

    Interchain

    Whether to select only interchain disulfide bonds.

    Distance

    The distance between Cβ atoms can be set, with a default of 5.0 Å.

    Result

    The output includes:

    Output File Name Description
    ss_bond.csv A CSV file containing information on the natural sequence number, residue number in the PDB file, and the distance between Cβ atoms.
    ss_index.fasta A FASTA file with sequence names numbered by natural sequence number, and predicted sites mutated to CYS.
    ss_uid.fasta A FASTA file with sequence names numbered by residue number in the PDB file, and predicted sites mutated to CYS.
  • Name: Pocket Finder
    Description: 基于几何特性和物理化学特性来识别这些口袋,其主要功能是快速、准确地识别蛋白质表面的潜在口袋。蛋白质口袋(或活性位点)是蛋白质表面的小区域,通常是药物分子或其他小分子结合的地方。识别这些口袋对于药物设计和蛋白质功能研究至关重要。 It identifies the pockets based on geometric and physicochemical properties, and its main function is to quickly and accurately identify potential pockets on the protein surface. Protein pockets (or active sites) are small areas on the surface of proteins, usually where drug molecules or other small molecules bind. Identifying these pockets is crucial for drug design and protein function research.
    Tags: undefined
    Author: Vincent Le Guilloux; Peter Schmidtke
    Release: 2024-09-06 15:58:52
    Reference: Vincent Le Guilloux, Peter Schmidtke and Pierre Tuffery, Fpocket: An open source platform for ligand pocket detection, BMC Bioinformatics 2009, 10:168 Peter Schmidtke and Xavier Barril, Understanding and predicting druggability. A high-throughput method for detection of drug binding sites, J Med Chem 2010, 53(15):5858-67

    Pocket Finder

    简介

    Pocket Finder模块基于几何特性和物理化学特性来识别这些口袋,其主要功能是快速、准确地识别蛋白质表面的潜在口袋。蛋白质口袋(或活性位点)是蛋白质表面的小区域,通常是药物分子或其他小分子结合的地方。识别这些口袋对于药物设计和蛋白质功能研究至关重要。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。

    Minimum Radius

    最小alpha球的半径。

    Maximum Radius

    最大alpha球的半径。

    Distance Threshold

    距离阈值聚类算法

    Clustering Method

    用于将Voronoi顶点分组的聚类方法:

    • s是单链接聚类(single linkage clustering)。
    • m是完全链接聚类(complete linkage clustering)。
    • a是平均链接聚类(average linkage clustering)。
    • c是质心链接聚类(centroid linkage clustering)。

    Clustering Measure

    聚类的距离度量方法:

    • e是欧几里得距离(euclidean distance)。
    • b是曼哈顿距离(Manhattan distance)。

    Minimum Number

    每个口袋的最小alpha球数量。

    结果说明

    输出结果包括:

    输出文件名称 说明
    pocket_properties.csv 口袋信息CSV文件
    pockets.tar.gz 蛋白分析后得到的PDB文件压缩包
    pocket*_atm.pdb 分别输出所有口袋的PDB(原子)文件格式

    其中pocket_properties.csv包含如下信息:

    字段名称 说明
    Pocket 口袋顺序
    Score 口袋综合得分,考虑了口袋的大小、形状和疏水性等因素。打分越高说明口袋更好,更有可能在生物学上具有相关性或适合药物结合。
    Druggability Score 评估口袋结合药物分子的潜力,打分越高说明口袋药物可及性越高。
    Total SASA 口袋可被溶剂分子接触的总表面积,单位为平方埃Ų;SASA较大,可容纳配体结构越大。
    Polar SASA 总SASA中的极性部分,表示可被水分子接触的表面积。反映了口袋的亲水性。
    Apolar SASA 总SASA中的非极性部分,表示不可被水分子接触的表面积。反映了口袋的疏水性。
    Volume 口袋的体积,单位为ų。较大的体积表示口袋较大,能够容纳更大的配体或多个结合位点。

    参考文献

    • Vincent Le Guilloux, Peter Schmidtke and Pierre Tuffery, Fpocket: An open source platform for ligand pocket detection, BMC Bioinformatics 2009, 10:168
    • Peter Schmidtke and Xavier Barril, Understanding and predicting druggability. A high-throughput method for detection of drug binding sites, J Med Chem 2010, 53(15):5858-67

    Pocket Finder

    Introduction

    The Pocket Finder module identifies pockets based on geometric and physicochemical properties. Its main function is to quickly and accurately identify potential pockets on the protein surface. Protein pockets (or active sites) are small regions on the protein surface where drug molecules or other small molecules typically bind. Identifying these pockets is crucial for drug design and protein function studies.

    Parameter

    Structure PDB File

    The structure file of the protein in PDB format.

    Minimum Radius

    The minimum radius of the alpha sphere.

    Maximum Radius

    The maximum radius of the alpha sphere.

    Distance Threshold

    The distance threshold for the clustering algorithm.

    Clustering Method

    The clustering method used to group Voronoi vertices:

    • s for single linkage clustering.
    • m for complete linkage clustering.
    • a for average linkage clustering.
    • c for centroid linkage clustering.

    Clustering Measure

    The distance metric for clustering:

    • e for Euclidean distance.
    • b for Manhattan distance.

    Minimum Number

    The minimum number of alpha spheres per pocket.

    Result

    The output results include:

    Output File Name Description
    pocket_properties.csv CSV file with pocket information
    pockets.tar.gz Compressed archive of PDB files obtained from the protein analysis
    pocket*_atm.pdb PDB (atom) file format for each pocket

    The pocket_properties.csv file contains the following information:

    Field Name Description
    Pocket Pocket order
    Score Comprehensive score of the pocket, considering factors such as size, shape, and hydrophobicity. A higher score indicates a better pocket, more likely to be biologically relevant or suitable for drug binding.
    Druggability Score Assesses the potential of the pocket to bind drug molecules. A higher score indicates higher druggability.
    Total SASA Total solvent-accessible surface area of the pocket, in square angstroms (Ų); larger SASA indicates the ability to accommodate larger ligand structures.
    Polar SASA The polar portion of the total SASA, indicating the surface area accessible to water molecules. Reflects the hydrophilicity of the pocket.
    Apolar SASA The apolar portion of the total SASA, indicating the surface area not accessible to water molecules. Reflects the hydrophobicity of the pocket.
    Volume The volume of the pocket, in cubic angstroms (ų). A larger volume indicates a larger pocket, capable of accommodating larger ligands or multiple binding sites.

    References

    • Vincent Le Guilloux, Peter Schmidtke and Pierre Tuffery, Fpocket: An open source platform for ligand pocket detection, BMC Bioinformatics 2009, 10:168
    • Peter Schmidtke and Xavier Barril, Understanding and predicting druggability. A high-throughput method for detection of drug binding sites, J Med Chem 2010, 53(15):5858-67
  • Name: Restrained Complex Structure Prediction
    Description: 基于ColabDock框架实现,ColabDock框架通过整合多种实验限制条件,显著提升了蛋白-蛋白对接预测的准确性。 Implemented based on the ColabDock framework, which significantly improves the accuracy of protein-protein docking prediction by integrating multiple experimental constraints.
    Tags: undefined
    Author: Shihao Feng
    Release: 2024-08-22 11:55:25
    Reference: Feng, S., Chen, Z., Zhang, C. et al. Integrated structure prediction of protein–protein docking with experimental restraints using ColabDock. Nat Mach Intell, 2024.

    Restrained Complex Structure Prediction

    简介

    Restrained Complex Structure Prediction模块基于ColabDock框架实现,ColabDock框架通过整合多种实验限制条件,显著提升了蛋白-蛋白对接预测的准确性。其创新点包括:

    • 无需大规模重新训练或微调:ColabDock框架通过梯度反向传播直接整合实验限制,避免了对深度学习模型进行大规模的重新训练或微调,提高了计算效率。
    • 多源实验数据的整合能力:ColabDock能够处理不同形式和来源的实验数据,包括但不限于化学交联质谱(XL-MS)、核磁共振化学位移扰动(CSP)、共价标记(CL)和模拟的深度突变扫描(DMS)等,增强了模型的适用性和灵活性。
    • 提升预测精度:通过在多个数据集上的评估,ColabDock展现出了超越现有方法的预测精度,尤其是在考虑实验限制条件时。不仅在具有模拟残基和表面约束的复杂结构预测中优于HADDOCK和ClusPro,而且在结合核磁共振化学位移扰动和共价标记辅助的情况下也表现出色。

    ColabDock框架的工作流程分为两个主要阶段:

    1. 生成阶段
    • ColabDock生成阶段的目标是生成与提供的实验限制和模板相一致的蛋白质复合物结构。
    • 该阶段使用梯度反向传播(Backprop)来优化输入序列配置文件的对数空间,从而引导结构预测模型(AF2)产生符合实验限制的复杂结构。
    • 输入包括:蛋白质序列配置文件、每条蛋白链的模板,以及实验限制条件。
    • 优化过程:模型通过调整序列配置文件来改变对接结构,同时保持预测的蛋白质序列与输入序列一致。
      image.png
    1. 预测阶段
    • 预测阶段使用生成的结构和每个链的模板进行最终的复杂结构预测。
    • 这个阶段利用AlphaFold2(AF2)或其他深度学习模型来评估和细化复合物结构,提高预测的精确度。
    • 预测阶段的输出是最终的蛋白质复合物结构预测,它考虑了实验限制并结合了深度学习模型的预测能力。
      image.png

    ColabDock主要关注两种类型的约束。第一种约束限制了残基对之间的距离低于某一阈值,属于残基-残基层面的约束(称为1v1约束)。这类约束包括源自交联质谱(XL-MS)的约束。第二种约束定义了在蛋白质表面上可能接触的两组残基之间的约束,但具体的接触信息未知。此类约束属于界面层面的约束(称为MvN约束),典型示例包括多种NMR实验和共价标记(CL)。
    ColabDock在模拟约束条件下的性能验证情况如下图所示:
    image.png
    如图a所示,在仅提供两个1v1约束的情况下,81.08%的蛋白质复合物的最大DockQ值超过了0.23,尤其考虑到从这些约束中获取的结构信息相对有限。当提供三到五个约束时,成功率接近100%。如图b所示,对于含有两、三和五对约束的蛋白质复合物,其约束满足率分别为0.55、0.77和0.80。这些结果表明,ColabDock能够高效利用提供的约束来获得高质量的复合物结构。

    评估ColabDock在MvN约束下的性能时,先基于上述1v1样本生成了MvN样本。这些样本的挑战性更大,因为MvN约束的模糊性使得多个1v1约束组合可能满足同一组MvN约束。如图c所示,111个样本中有100个预测结构的最大DockQ值超过了0.23。其中,75个样本的top1结构的DockQ值超过0.23。随着约束数量的增加,ColabDock的准确性也相应提高,top1结构的成功率从两个约束时的62.16%上升到三个和五个约束时的70.27%。在预测结构中,约束满足率与实验结构中的比例相似(图d)。这些结果表明,ColabDock同样能够高效利用模糊的约束条件来改善结构预测。

    为了评估ColabDock中预测阶段的必要性,在上述1v1和MvN约束实验中,收集了最后十个优化步骤中的结构,大多数优化过程已经收敛。在生成阶段和预测阶段的DockQ值差异较大的情况下(这里定义为大于0.1),预测阶段在69.9%的1v1约束复合物中表现更好(图e),在MvN约束复合物中这一比例为68.8%(图f)。这些结果表明,AF2的能量景观可以帮助优化生成阶段的构象并提高预测的准确性。

    ColabDock与传统限制性对接方法比较如下图所示:
    image.png
    基于37个蛋白质复合物的独立基准集。与HADDOCK和ClusPro进行了比较。对于基准集中的每个复合物,采样两、三和五个1v1约束来指导对接,最终生成了111个样本。ColabDock在大多数样本中优于HADDOCK和ClusPro(图a)。ColabDock的平均DockQ值为0.477,而HADDOCK和ClusPro的DockQ值分别为0.287和0.191。无论1v1约束的数量多少,ColabDock在三种方法中均表现最佳(图b)。这些结果表明,ColabDock在稀疏约束条件下有生成可靠结构的潜力,这与验证集的观察结果一致。

    为了进一步评估ColabDock在界面级别约束下的表现,作为验证数据集,将上述描述的1v1约束转换为MvN约束。由于ClusPro在111个样本中有7个无法给出预测,将其排除,并对剩余的104个样本进行比较。与1v1约束下的表现相比,由于MvN约束的模糊性,ColabDock、HADDOCK和ClusPro在MvN约束下的表现有所下降,但ColabDock仍然优于其他两种方法(图c)。实验再次表明,无论MvN约束的数量多少,ColabDock在DockQ上均表现最佳(图d)。

    实验衍生的约束中常常包含相距较远的残基,作者将其称为“松散约束”。为了测试模型在相关任务中的表现,有意在距离范围为8Å到20Å之间加入了松散约束。对于基准集中的每个复合物,松散约束的数量从1到5不等,而总约束数量固定为5个,共生成了185个样本。排除了9个ClusPro无法处理的样本,并对剩余的176个样本进行了三种方法的比较。结果显示,ColabDock表现最佳,平均DockQ值为0.344,平均α碳原子r.m.s.d.(Cα-r.m.s.d.)为6.55Å(图e)。这些结果表明,ColabDock对约束的质量依赖较低。当与高质量约束结合时,ColabDock能够预测出比其他两种方法更为精确的结构。

    抗原抗体复合物预测
    抗体-抗原复合物建模一直是一个长期存在的挑战,因为互补决定区(CDRs)的灵活性和缺乏共同进化信号。深度突变扫描(DMS)是一种常用技术,用于确定可能参与抗体-抗原结合的残基。基于一个包含45个复合物的抗体-抗原基准集,通过采样界面上的残基来模拟DMS衍生的约束。预测效果及与传统方法的比较情况如下图所示:
    image.png

    图a所示,ColabDock优于HADDOCK和ClusPro,其平均DockQ值为0.223,平均r.m.s.d.为9.57Å。对于DockQ值大于0.49的样本数量,ColabDock也超过了HADDOCK和ClusPro(图b)。

    以1AHW为例:1AHW是一个人类组织因子-抗体(5G9)复合物,参与了血液凝固蛋白酶级联过程。如图c所示,随机从抗体中采样了五个界面残基(轻链的His91和Gly92,重链的Asp31、Tyr32和Asn100),以及从抗原中采样了七个界面残基(Lys165、Thr167、Val192、Thr197、Val198、Asn199和Asp204)。这些在抗体中采样的残基主要分布在L1 CDR、H1 CDR和H3 CDR区域。图d展示了AF-Multimer的预测结构以及三种对接方法的结构。如图e所示,ColabDock捕捉到了大多数界面上的天然接触,其DockQ值为0.770,r.m.s.d.为1.17Å,而其他方法的预测结构与天然构象有较大差异。这一案例研究表明,ColabDock在构象探索和构象排序方面都优于其他两种方法。

    参数说明

    Complex Structure

    初始蛋白复合物结构文件,PDB格式
    注:该结构由多条链组成,链与链之间的相对位置可任意放置,无要求。由于显存大小限制,当前最大支持的最终复合物尺寸大小不超过800个残基。

    Chains

    复合物中提取多条链,用于组成最终的复合物结构,链名之间用逗号分隔,如:A,H,L

    Fix Chains

    提取的多条链中指定相对位置固定的每对链,支持定义多对,链名之间用逗号分隔,每行一对,示例如下:

    H,L
    A,H
    

    表示链H与L之间的相对位置固定,链A与H之间的相对位置固定。

    Threthold

    实验限制的距离阈值,表示设置限制的残基间的距离需小于该阈值。默认为8.0 Å,值范围为2.0 Å - 22.0 Å,建议采用默认值。

    1v1 Restrains

    单个残基之间的限制条件,限制单个残基之间的距离在上述定义的阈值参数内,残基之间用逗号(,)分隔,支持定义多个条件(每行定义一个),示例如下:

    A20,H50
    A78,L98
    

    该参数表示设置的限制条件有2个:

    • A链的第20位残基和H链的第50位残基之间的距离要小于阈值;
    • A链的第78位残基和L链的第98位残基之间的距离要小于阈值。

    注意:残基编号为位置编号,即每条链按顺序从1开始进行编号,以下编号规则一致。

    MvN Restrains

    单个残基与残基组合之间的限制条件,限制单个残基与多个残基集合中至少一个残基之间的距离在上述定义的阈值参数内,单个残基与残基组合之间用逗号(,)分隔,残基组合内部用分号(;)分隔,可支持定义多个条件(每行定义一个),示例如下:

    A10,H60-70;H78;L90
    A78,H60-70;L56;L69
    A120,L30-L36;H68;H72
    2
    

    该参数表示设置的限制条件有3个,分别是:

    • A链第10位残基与残基组合(H链第60至70位、H链第78位及L链第90位残基)中的至少一个残基之间的距离小于阈值;
    • A链第78位残基与残基组合(H链第60至70位、L链第56位及L链第69位残基)中的至少一个残基之间的距离小于阈值;
    • A链第120位残基与残基组合(L链第30至36位、H链第68位及H链第72位残基)中的至少一个残基之间的距离小于阈值;
    • 最后一行的数值2,表示上述3个条件中,满足任意2个条件即可,如限制条件只有1个时,该数值可以省略。

    Rep Threthold

    限制残基间排斥的距离阈值,表示设定的排斥残基间的距离需大于该阈值。默认为8.0 Å,值范围为2.0 Å - 22.0 Å,建议采用默认值。

    Rep 1v1 Restrains

    单个残基间的排斥限制条件,限制单个残基之间的距离需大于上述定义的排斥阈值,残基之间用逗号(,)分隔,可支持定义多个条件(每行定义一个),示例如下:

    15,98
    60,205
    

    该参数表示设置的排斥限制条件有2个:

    • 编号顺序为第20和第50的残基之间的距离要大于排斥阈值;
    • 编号第78和第198的残基之间的距离要大于排斥阈值。

    结果说明

    输出1st_best.pdb结果文件,为预测得到的最优复合物结构文件。
    输出pdbs.tar.gz文件,为预测得到的前5个最优复合物结构文件压缩包。
    输出summary.txt文件,包含以下信息:

    列名 说明
    pdb 复合物结构文件名
    iptm 复合物结构的质量好坏评价指标,0-1之间,越接近1表示预测结构的质量越好
    # of satisfied restraints 限制条件的数量,以及预测的复合物结构能满足的条件数量,如:2/2表示有2个限制条件,预测得到的复合物结构都能满足;1/2表示有2个限制条件,但复合物结构只满足了其中1个

    备注:
    可能存在以下个别情况,属正常现象

    1. 1st_best.pdb的iptm打分并不是5个结构里最优的;
    2. 结构中有个别残基间的肽键发生断裂;
      有待结构预测模型的进一步优化。

    参考文献

    • Feng, S., Chen, Z., Zhang, C. et al. Integrated structure prediction of protein–protein docking with experimental restraints using ColabDock. Nat Mach Intell, 2024.
    • Nat. Mach. Intell. | 突破对接瓶颈:ColabDock革新蛋白质-蛋白质结构预测

    Restrained Complex Structure Prediction

    Introduction

    The module is implemented based on the ColabDock framework, which significantly improves the accuracy of protein-protein docking predictions by integrating a variety of experimental constraints. Its innovations include:

    • No need for large-scale retraining or fine-tuning: The ColabDock framework directly integrates experimental constraints through gradient backpropagation, avoiding large-scale retraining or fine-tuning of deep learning models and improving computational efficiency.
    • Integration ability of multi-source experimental data: ColabDock is able to handle experimental data in different forms and sources, including but not limited to chemical cross-linking mass spectrometry (XL-MS), NMR chemical shift perturbation (CSP), covalent labeling (CL) and Simulated deep mutation scanning (DMS), etc., enhance the applicability and flexibility of the model.
    • Improved prediction accuracy: Through evaluation on multiple data sets, ColabDock has demonstrated prediction accuracy that exceeds existing methods, especially when considering experimental constraints. Not only does it outperform HADDOCK and ClusPro in complex structure predictions with simulated residues and surface constraints, but it also performs well when combined with NMR chemical shift perturbation and covalent labeling assistance.

    The workflow of the ColabDock framework is divided into two main stages:

    1. Generation stage
    • The goal of the ColabDock generation stage is to generate a protein complex structure that is consistent with the provided experimental constraints and template.
    • This stage uses gradient backpropagation (Backprop) to optimize the logarithmic space of the input sequence profile, thereby guiding the structure prediction model (AF2) to produce a complex structure that meets the experimental constraints.
    • The input includes: protein sequence profile, template for each protein chain, and experimental constraints.
    • Optimization process: The model changes the docked structure by adjusting the sequence profile while keeping the predicted protein sequence consistent with the input sequence.
      image.png
    1. Prediction stage
    • The prediction stage uses the generated structures and templates for each chain to make final complex structure predictions.
    • This stage uses AlphaFold2 (AF2) or other deep learning models to evaluate and refine the complex structure and improve the accuracy of the predictions.
    • The output of the prediction stage is the final protein complex structure prediction, which takes into account experimental constraints and combines the predictive power of deep learning models.
      image.png

    ColabDock focuses on two types of constraints. The first type of constraints restricts the distance between residue pairs to be below a certain threshold and are residue-residue level constraints (called 1v1 constraints). This type of constraints includes constraints derived from cross-linking mass spectrometry (XL-MS). The second type of constraints defines constraints between two groups of residues that may contact on the protein surface, but the specific contact information is unknown. This type of constraints belongs to the interface level constraints (called MvN constraints), and typical examples include various NMR experiments and covalent labeling (CL).

    The performance verification of ColabDock under simulation constraints is shown in the following figure:
    image.png
    As shown in Figure a, with only two 1v1 constraints provided, 81.08% of the protein complexes had a maximum DockQ value of more than 0.23, especially considering the relatively limited structural information obtained from these constraints. When three to five constraints were provided, the success rate was close to 100%. As shown in Figure b, for protein complexes containing two, three, and five pairs of constraints, the constraint satisfaction rates were 0.55, 0.77, and 0.80, respectively. These results show that ColabDock can efficiently use the provided constraints to obtain high-quality complex structures.

    When evaluating the performance of ColabDock under MvN constraints, MvN samples were first generated based on the above 1v1 samples. These samples are more challenging because the ambiguity of MvN constraints makes it possible for multiple 1v1 constraint combinations to satisfy the same set of MvN constraints. As shown in Figure c, 100 of the 111 samples have a maximum DockQ value of more than 0.23 for the predicted structures. Among them, 75 samples have a DockQ value of more than 0.23 for the top1 structure. As the number of constraints increases, the accuracy of ColabDock also increases accordingly, with the success rate of the top1 structure increasing from 62.16% with two constraints to 70.27% with three and five constraints. In the predicted structures, the constraint satisfaction rate is similar to that in the experimental structures (Figure d). These results show that ColabDock can also effectively use fuzzy constraints to improve structure prediction.

    To evaluate the necessity of the prediction stage in ColabDock, structures from the last ten optimization steps were collected in the above 1v1 and MvN constrained experiments, and most of the optimization processes have converged. In cases where the difference in DockQ values ​​between the generation stage and the prediction stage is large (here defined as greater than 0.1), the prediction stage performs better in 69.9% of the 1v1 constrained complexes (Figure e) and in 68.8% of the MvN constrained complexes (Figure f). These results suggest that the energy landscape of AF2 can help optimize conformations in the generation stage and improve the accuracy of predictions.

    The comparison between ColabDock and traditional restrictive docking methods is shown in the figure below:
    image.png
    Based on an independent benchmark set of 37 protein complexes. Comparisons were made with HADDOCK and ClusPro. For each complex in the benchmark set, two, three, and five 1v1 constraints were sampled to guide docking, and 111 samples were finally generated. ColabDock outperformed HADDOCK and ClusPro in most samples (Figure a). The average DockQ value of ColabDock was 0.477, while the DockQ values ​​of HADDOCK and ClusPro were 0.287 and 0.191, respectively. Regardless of the number of 1v1 constraints, ColabDock performed best among the three methods (Figure b). These results show that ColabDock has the potential to generate reliable structures under sparse constraints, which is consistent with the observations of the validation set.

    To further evaluate the performance of ColabDock under interface-level constraints, the 1v1 constraints described above were converted to MvN constraints as a validation dataset. Since ClusPro could not give predictions for 7 out of 111 samples, it was excluded and the remaining 104 samples were compared. Compared with the performance under 1v1 constraints, the performance of ColabDock, HADDOCK, and ClusPro under MvN constraints declined due to the ambiguity of MvN constraints, but ColabDock still outperformed the other two methods (Figure c). The experiment again shows that ColabDock performs best on DockQ regardless of the number of MvN constraints (Figure d).

    Experimentally derived constraints often contain residues that are far apart, which the authors call “loose constraints.” In order to test the performance of the model in related tasks, loose constraints were intentionally added with distances ranging from 8Å to 20Å. For each complex in the benchmark set, the number of loose constraints ranged from 1 to 5, while the total number of constraints was fixed at 5, generating a total of 185 samples. Nine samples that ClusPro could not handle were excluded, and the three methods were compared on the remaining 176 samples. The results showed that ColabDock performed best, with an average DockQ value of 0.344 and an average α-carbon atom r.m.s.d. (Cα-r.m.s.d.) of 6.55Å (Figure e). These results indicate that ColabDock has a low dependence on the quality of constraints. When combined with high-quality constraints, ColabDock is able to predict more accurate structures than the other two methods.

    Antigen-antibody complex prediction
    Modeling antibody-antigen complexes has been a long-standing challenge due to the flexibility of complementarity determining regions (CDRs) and the lack of co-evolutionary signals. Deep mutational scanning (DMS) is a commonly used technique to identify residues that may be involved in antibody-antigen binding. Based on an antibody-antigen benchmark set of 45 complexes, DMS-derived constraints were simulated by sampling residues on the interface. The prediction results and comparison with traditional methods are shown in the figure below:
    image.png

    As shown in Figure a, ColabDock outperforms HADDOCK and ClusPro, with an average DockQ value of 0.223 and an average r.m.s.d. of 9.57 Å. For the number of samples with a DockQ value greater than 0.49, ColabDock also exceeds HADDOCK and ClusPro (Figure b).

    Take 1AHW as an example: 1AHW is a human tissue factor-antibody (5G9) complex that participates in the blood coagulation protease cascade. As shown in Figure c, five interface residues were randomly sampled from the antibody (His91 and Gly92 of the light chain, Asp31, Tyr32 and Asn100 of the heavy chain), and seven interface residues were sampled from the antigen (Lys165, Thr167, Val192, Thr197, Val198, Asn199 and Asp204). These sampled residues in the antibody are mainly distributed in the L1 CDR, H1 CDR and H3 CDR regions. Figure d shows the predicted structure of AF-Multimer and the structures of the three docking methods. As shown in Figure e, ColabDock captures most of the natural contacts on the interface, with a DockQ value of 0.770 and an r.m.s.d. of 1.17Å, while the predicted structures of other methods are quite different from the natural conformation. This case study demonstrates that ColabDock outperforms the other two methods in both conformational exploration and conformational ranking.

    Parameters

    Complex Structure

    Original protein complex structure file, PDB format
    Note: This structure consists of multiple chains, and the relative positions between chains can be placed arbitrarily. Due to the limitation of GPU memory, the current maximum supported final complex size does not exceed 800 residues.

    Chains

    Multiple chains are extracted from the original complex to form the final complex structure. The chain names are separated by commas, such as: A,H,L

    Fix Chains

    Specify each pair of chains with fixed relative positions among the extracted multiple chains. Multiple pairs can be defined. Chain names are separated by comma, with one pair per line. The example is as follows:

    H,L
    A,H
    

    It means that the relative position between chains H and L is fixed, and the relative position between chains A and H is fixed.

    Threthold

    The distance threshold of the experimental restraint, which means that the distance between the residues to set the restraint must be less than this threshold. The default value is 8.0 Å, and the value range is 2.0 Å - 22.0 Å. The default value is recommended.

    1v1 Restrains

    Restrictions between single residues. Limit the distance between single residues to the threshold parameters defined above. Residues are separated by commas. Multiple conditions can be defined (one per line). The following is an example:

    A20,H50
    A78,L98
    

    This parameter indicates that there are two restrictions set:

    • The distance between the 20th residue of the A chain and the 50th residue of the H chain must be less than the threshold;
    • The distance between the 78th residue of the A chain and the 98th residue of the L chain must be less than the threshold.
      Note:The residue numbers are position numbers, i.e., each chain is numbered sequentially starting from 1, and the following numbering rules are consistent.

    MvN Restrains

    The restriction conditions between a single residue and a residue combination limit the distance between a single residue and at least one residue in a set of multiple residues to be within the threshold parameters defined above. Single residues and residue combinations are separated by commas, and residue combinations are separated by semicolons. Multiple conditions can be defined (one per line). The following is an example:

    A10,H60-70;H78;L90
    A78,H60-70;L56;L69
    A120,L30-L36;H68;H72
    2
    

    This parameter indicates that there are three restrictions set, namely:

    • The distance between the 10th residue of the A chain and at least one residue in the residue combination (residues 60 to 70 of the H chain, 78 of the H chain, and 90 of the L chain) is less than the threshold;
    • The distance between the 78th residue of the A chain and at least one residue in the residue combination (residues 60 to 70 of the H chain, 56 of the L chain, and 69 of the L chain) is less than the threshold;
    • The distance between the 120th residue of the A chain and at least one residue in the residue combination (residues 30 to 36 of the L chain, 68 of the H chain, and 72 of the H chain) is less than the threshold;
    • The value 2 in the last row indicates that any two of the above three conditions can be met. If there is only one restriction, this value can be omitted.

    Rep Threthold

    The distance threshold for limiting the repulsion between residues, indicating that the distance between the set repulsive residues must be greater than this threshold. The default value is 8.0 Å, and the value range is 2.0 Å - 22.0 Å. It is recommended to use the default value.

    Rep 1v1 Restrains

    The exclusion constraint between single residues requires the distance between single residues to be greater than the exclusion threshold defined above. Residues are separated by comma. Multiple conditions can be defined (one per line). The following is an example:

    15,98
    60,205
    

    This parameter indicates that there are two exclusion constraints set:

    • The distance between the 20th and 50th residues must be greater than the exclusion threshold;
    • The distance between the 78th and 198th residues must be greater than the exclusion threshold.

    Results

    ‘1st_best.pdb’ file, which is the predicted optimal complex structure file.
    ‘pdbs.tar.gz’ file, which is the compressed package of the top 5 predicted optimal complex structure files.
    ‘summary.txt’ file, which contains the following information:

    Fields Introduction
    pdb File name of complex structure
    iptm An evaluation index of the quality of the complex structure, between 0 and 1, the closer to 1, the better the quality of the predicted structure
    # of satisfied restraints The total number of constraints and the number of constraints that the predicted complex structure can satisfy. For example, 2/2 means that there are 2 constraints and the predicted complex structure can satisfy them all; 1/2 means that there are 2 constraints, but the complex structure only satisfies one of them.

    Note:
    The following individual cases may exist, which are normal:

    1. The iptm score of 1st_best.pdb is not the best among the 5 structures;
    2. The peptide bonds between individual residues in the structure are broken;
      The structure prediction model needs to be further optimized.

    References

    • Feng, S., Chen, Z., Zhang, C. et al. Integrated structure prediction of protein–protein docking with experimental restraints using ColabDock. Nat Mach Intell, 2024.
    • Nat. Mach. Intell. | 突破对接瓶颈:ColabDock革新蛋白质-蛋白质结构预测
  • Name: Germline Blast
    Description: Germline Blast模块基于IgBlastp实现,通过序列比对从IMGT reference sequences数据库中搜索与目标抗体序列最接近的同源模板,输出对应的模板序列以及序列一致性等信息。数据库中默认检索的序列类型为:IMGT V genes(F+ORF+in-frame P)。 Based on IgBlastp, the Germline Blast module will search for the homologous template closest to the target antibody sequence in the IMGT reference sequences database through sequence alignment and output the corresponding template sequence and sequence consistency. The default sequence types searched in the database are: IMGT V genes (F+ORF+in-frame P).
    Tags: undefined
    Author: Jian Ye; Lefranc
    Release: 2024-08-29 15:34:27
    Reference: Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W34-40. Lefranc, M.-P. and Lefranc, G. The Immunoglobulin FactsBook Academic Press, London, UK (458 pages), (2001)

    Germline Blast

    简介

    Germline Blast模块基于IgBlastp实现,通过氨基酸序列比对从IMGT reference sequences数据库中搜索与目标抗体序列最接近的同源模板,输出对应的模板序列以及序列一致性等信息。数据库中默认检索的序列类型为:IMGT V genes(F+ORF+in-frame P)。

    参数说明

    Antibody Sequence File

    抗体的序列文件,FASTA格式,如包含多条序列,仅对第一条序列进行分析。

    Numbering Scheme

    抗体编号类型:kabat和imgt

    TopHits

    输出同源性最高的N条序列,默认值为10。

    Species

    序列所属物种:Human,Mouse,Rat,Rabbit,Rhesus Monkey,默认值为Human。

    结果说明

    输出参数 输出文件名称 说明
    Hits Sequence hits.fasta 包含同源性最高的n条序列的序列文件
    Result result.csv 包含找到的Germline序列以及序列的一致性信息
    Alignment Summary align_info_top_germline.csv 包含查询序列与同源性最高的Germline V基因序列的比对信息

    参考文献

    • Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W34-40.
    • Lefranc, M.-P. and Lefranc, G. The Immunoglobulin FactsBook Academic Press, London, UK (458 pages), (2001)

    Germline Blast

    Introduction

    The Germline Blast module is based on IgBlastp and searches for the most homologous templates to the target antibody sequence from the IMGT reference sequences database through sequence alignment. It outputs the corresponding template sequences and sequence identity information. The default sequence types searched in the database are: IMGT V genes (F+ORF+in-frame P).

    Parameter

    Antibody Sequence File

    The antibody sequence file in FASTA format. If multiple sequences are included, only the first sequence will be analyzed.

    Numbering Scheme

    The antibody numbering scheme: kabat and imgt.

    TopHits

    The number of top homologous sequences to output, with a default value of 10.

    Species

    The species of the sequence: Human, Mouse, Rat, Rabbit, Rhesus Monkey, with the default value being Human.

    Results

    Output Parameter Output File Name Description
    Hits Sequence hits.fasta A sequence file containing the top N homologous sequences
    Result result.csv Contains the identified germline sequences and sequence identity information
    Alignment Summary align_info_top_germline.csv Contains alignment information between the query sequence and the top homologous germline V gene sequences

    Reference

    • Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W34-40.
    • Lefranc, M.-P. and Lefranc, G. The Immunoglobulin FactsBook Academic Press, London, UK (458 pages), (2001)
  • Name: Immunogenicity Prediction (WeADApt v4.0.2)
    Description: 唯信新一代免疫原性预测方法的最新版本:v4.0.2。 为便于区分概念和避免歧义,从该版本开始,整套预测系统启用全新名称WeADApt(缩写自WEcomput ADA PredicTion),AlphaMHC仅用于指代该系统中底层的MHC相互作用预测模型。 该版本在ADA预测任务的测试性能表现超越了包括EpiMatrix、NetMHCIIpan、DeepMHCII、IEDB Consensus、AlphaMHC v1/v2/v3在内的方法。
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-08-29 17:00:00
    Reference: Please cite www.wecomput.com if needed.

    WeADApt (Wecomput ADA prediction) 是一种基于多模融合架构的免疫原性预测系统。该方法有机地将多个与免疫原性相关的模型融合,构成一个高效的免疫反应模拟系统,可准确地模拟蛋白、抗体、多肽、疫苗等生物药的免疫原性,并能鉴别潜在的免疫原性的T细胞表位(引起临床人体免疫应答的肽段)。
    该模块为最新版本:v4.0.2。

    性能测试

    使用100多个临床及上市抗体的ADA数据的测试结果显示,预测的打分(MolScore)与ADA发生率的相关性达到R=0.68(下图)。

    image.png

    在同样的42个分子的数据集上,WeADApt预测的相关性超过了知名的商业软件EpiMatrix(R2=0.49 vs R2=0.42)。

    image.png

    打分

    0.2分适合作为单抗的高/低风险的阈值(>20% ADA定义为高风险)。

    关于双抗/多特异性分子

    这类分子仅需输入不重复的链即可
    在唯信收集的双抗ADA数据集的测试表现如下图所示。以0.6的分数作为分界线,可以较好的区分高、低风险的双抗分子。双抗
    注意,由于存在较多的B细胞清除双抗,其MOA会对ADA产生有较大的影响。

    image.png

    用法

    推荐从WeSeq中运行该功能,可以进行更多可视化交互

    image.png

    查看结果

    image.png
    Score为预测的免疫原性风险评分(范围0-1),Risk为风险评级

    image.png

    image.png

    注意对照结构,排除不可及(包埋的)表位(下图)
    image.png

    去免疫原性

    最简单的方式是进行人源片段的替换,可以直接在WeSeq中进行(下图)。
    image.png

    也可以通过频率分析功能引入人源突变。
    突变完之后再对突变体预测一下免疫原性是否降低。

    引用

    XX蛋白的免疫原性风险和潜在T细胞表位使用WeMol的WeADApt v4.0.2方法进行预测(wemol.wecomput.com,北京中大唯信科技有限公司)。

    The immunogenicity risk and potential T cell epitopes of the xxx protein are predicted using the WeADApt v4.0.2 method of WeMol (wemol.wecomput.com, Wecomput Technology Co., Ltd.)

  • Name: Target-based Cyclic Peptide Design
    Description: Target-based Cyclic Peptide Design模块基于Evobind2模型实现,Evobind2模型是基于AF2实现的,旨在基于目标蛋白质序列设计新型首尾相接(Head-to-Tail)环肽。通过预测的置信度指标直接选择高亲和力结合体,可以指定的结合位点,并通过异构体评价避免对抗性设计,大大提高了成功率。 The target-based Cyclic Peptide Design module is based on the Evobind2 model, which is based on AF2 and aims to design novel Head-to-Tail cyclic peptides based on Target protein sequences. The high affinity combination can be directly selected by the predicted confidence index, the binding site can be specified, and the adversarial design can be avoided by isomer evaluation, which greatly improves the success rate.
    Tags: undefined
    Author: Qiuzhen Li
    Release: 2024-08-22 10:15:48
    Reference: Li, Qiuzhen et al. Design of linear and cyclic peptide binders of different lengths only from a protein target sequence. bioRxiv. 2024. p. 2024.06.20.599739.

    Target-based Cyclic Peptide Design

    简介

    Target-based Cyclic Peptide Design模块基于Evobind2模型实现,Evobind2模型是基于AF2实现的,旨在基于目标蛋白质序列设计新型首尾相接(Head-to-Tail)环肽。通过预测的置信度指标直接选择高亲和力结合环肽,支持指定结合位点。
    EvoBind2设计环肽/多肽的流程如下:
    image.png
    环肽设计案例,展示了首尾相接酰胺键:
    image.png

    参数说明

    Target Sequence

    靶标蛋白的序列文件,FASTA格式。

    Length

    设计的环肽长度。长度的选择将直接影响设计的环肽大小和潜在结合能力,推荐长度范围为6-20。

    Init Sequence

    环肽的起始序列,如果提供了该参数,模块将以此序列为基础进行优化,如:ARDCPLVNPL。在已知的有效序列基础上进行优化,而不是从头开始,有助于加快设计过程和提高设计效率。

    Hotspot

    靶标序列上的环肽结合位点残基,编号从1开始。指定这些位点可以提高设计的准确性和成功率。多个位点通过逗号分隔,如:23,45,67。

    结果说明

    输出结果包括:

    输出文件名称 说明
    best.pdb 最优设计的复合物结构文件
    design_pdbs.tar.gz 评分前20的复合物结构压缩文件
    top20.csv 评分前20的复合物结构名称及打分文件

    其中top20.csv包含如下信息:

    字段名称 说明
    ID 复合物结构ID
    pLDDT 预测得到的复合物LDDT评分,数值在0-100之间,越大表示结构质量越好
    Sequence 环肽序列

    参考文献

    Li, Qiuzhen et al. Design of linear and cyclic peptide binders of different lengths only from a protein target sequence. bioRxiv. 2024. p. 2024.06.20.599739.

    Target-based Cyclic Peptide Design

    Introduction

    The target-based Cyclic Peptide Design module is based on the Evobind2 model, which is based on AF2 and aims to design novel Head-to-Tail cyclic peptides based on Target protein sequences. The high affinity combination can be directly selected by the predicted confidence index, the binding site can be specified, and the adversarial design can be avoided by isomer evaluation, which greatly improves the success rate.
    EvoBind2 design cyclic peptide/polypeptide process is as follows:
    image.png
    Cyclic peptide design case, demonstrating the end to end amide bond:
    image.png

    Parameter

    Target Sequence

    Target protein sequence file in FASTA format

    Length

    Designed cyclic peptide length. The choice of length will directly affect the size and potential binding capacity of the designed cyclic peptide, and the recommended length range is 6-20.

    Init Sequence

    The starting sequence of the cyclic peptide, on which the module will optimize if this parameter is provided, for example: ARDCPLVNPL. Optimizing on the basis of known effective sequences, rather than starting from scratch, helps speed up the design process and increase design efficiency.

    Hotspot

    Cyclic peptide binding site residues on the target sequence, numbered from 1. Specifying these sites can improve the accuracy and success of the design. Multiple sites are separated by commas, such as 23,45,67.

    Result

    The output includes:

    Output File Name Description
    best.pdb Structure file of the best designed complex
    design_pdbs.tar.gz Compressed file containing the top 20 complex structures
    top20.csv File containing the names and scores of the top 20 complex structures

    The top20.csv file contains the following information:

    Field Name Description
    ID Complex structure ID
    pLDDT Predicted LDDT score of the complex, ranging from 0 to 100, with higher values indicating better structure quality
    Sequence Cyclic peptide sequence

    Reference

    Li, Qiuzhen et al. Design of linear and cyclic peptide binders of different lengths only from a protein target sequence. bioRxiv. 2024. p. 2024.06.20.599739.

  • Name: Mutation Energy of Stability (ThermoMPNN)
    Description: Mutation Energy of Stability (ThermoMPNN)模块基于ThermoMPNN模型实现,此深度神经网络模型可根据蛋白初始结构,预测单点突变对应的稳定性变化。模型使用从ProteinMPNN(一种深度神经网络模型,可根据蛋白质的三维结构预测其氨基酸序列)中提取的结构特征,在已建立的基准数据集上实现了优秀的预测性能。 The Mutation Energy of Stability (ThermoMPNN) module is implemented based on the ThermoMPNN model, a deep neural network model that predicts the stability changes corresponding to a single point mutation based on the initial structure of the protein. The model uses structural features extracted from ProteinMPNN, a deep neural network model that predicts a protein's amino acid sequence based on its three-dimensional structure, to achieve excellent prediction performance on established baseline datasets.
    Tags: undefined
    Author: Henry Dieckhaus
    Release: 2024-08-07 15:14:52
    Reference: Dieckhaus H, et al. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proc Natl Acad Sci U S A. 2024 Feb 6;121(6):e2314853121.

    Mutation Energy of Stability (ThermoMPNN)

    简介

    Mutation Energy of Stability (ThermoMPNN)模块基于ThermoMPNN模型实现,此深度神经网络模型可根据蛋白初始结构,预测单点突变对应的稳定性变化。模型使用从ProteinMPNN(一种深度神经网络模型,可根据蛋白质的三维结构预测其氨基酸序列)中提取的结构特征,在已建立的基准数据集上实现了优秀的预测性能。通常认为,ddG < -0.5 kcal/mol 可能是一个有利于稳定性的突变。ThermoMPNN 在 Fireprot(HF)数据集上的正预测值为 56%(34/61 个预测为稳定的突变),在 Megascale 数据集上为 46%(1,312/2,852)。

    模型架构与数据集分析如下图所示:
    image.png
    模型预测效果与其他方法效果比较见下图:
    image.png

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式,支持单体或复合物结构

    Target Chain

    用于稳定性突变分析的链名称,仅支持单链,如:A

    结果说明

    输出result.csv结果文件,包含以下信息:

    列名 说明
    Chain 链名称,如:'A’表示A链
    Mutation 单点突变信息,如:'G1A’表示序列编号为1的残基甘氨酸G,突变为丙氨酸A,序列编号从1开始按顺序编号(非PDB文件中的残基序号)
    ddG_pred 突变对应的能量变化,负值表示体系能量较低,体系变得更稳定。负得越多表示稳定性提升越多。ddG < -0.5 kcal/mol 可能是一个有利于稳定性的突变

    参考文献

    Dieckhaus H, et al. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proc Natl Acad Sci U S A. 2024 Feb 6;121(6):e2314853121.

    Mutation Energy of Stability (ThermoMPNN)

    Introduction

    The Mutation Energy of Stability (ThermoMPNN) module is based on the ThermoMPNN model. This deep neural network model predicts the stability changes corresponding to single-point mutations based on the initial structure of the protein. The model uses structural features extracted from ProteinMPNN (a deep neural network model that predicts amino acid sequences based on the three-dimensional structure of proteins) and has achieved excellent predictive performance on established benchmark datasets.If we consider a ΔΔG° < -0.5 kcal/mol to indicate a stabilizing mutation, ThermoMPNN achieves a PPV of 56% (34/61 predicted stabilizing mutations) on the Fireprot (HF) dataset and 46% (1,312/2,852) on the Megascale dataset.

    The model architecture and dataset analysis are shown in the figure below:
    image.png
    The comparison of the model’s predictive performance with other methods is shown in the figure below:
    image.png

    Parameter Description

    Structure PDB File

    The structure file of the protein in PDB format, supporting monomer or complex structures.

    Target Chain

    The name of the chain for stability mutation analysis, supporting only single chains, e.g., A.

    Result Description

    The output result.csv file contains the following information:

    Column Name Description
    Chain The name of the chain, e.g., ‘A’ for chain A
    Mutation Single-point mutation information, e.g., ‘G1A’ means the residue glycine G at sequence number 1 is mutated to alanine A. The sequence number starts from 1 in order (not the residue number in the PDB file)
    ddG_pred The energy change corresponding to the mutation. A negative value indicates lower system energy and increased stability. The more negative, the greater the stability improvement. ddG < -0.5 kcal/mol may indicate a stabilizing mutation

    References

    Dieckhaus H, et al. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proc Natl Acad Sci U S A. 2024 Feb 6;121(6):e2314853121.

  • Name: Homology Tree
    Description: Homology Tree模块用于生成同源性进化树。 The Homology Tree module is used to generate homologous evolutionary trees.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-08-05 15:30:13
    Reference:

    Homology Tree

    简介

    Homology Tree模块用于生成同源性进化树。

    参数说明

    Input File

    蛋白序列文件,FASTA格式。

    结果说明

    输出结果包括:

    输出文件名称 说明
    alignment.fasta 按树结构顺序输出的叠合后的序列文件的FASTA文件
    tree.png 多重序列树结构图片

    Homology Tree

    Introduction

    The Homology Tree module is used to generate homologous evolutionary trees.

    Parameter

    Input File

    Protein sequence file in FASTA format.

    Result

    The output includes:

    Output File Name Description
    alignment.fasta FASTA file of the superimposed sequence of files output in order of tree structure.
    tree.png Tree structure picture of multiple sequence
  • Name: Structure Evolution
    Description: Structure Evolution模块基于ESMIF模型实现,ESMIF逆折叠模型旨在根据蛋白质主链原子坐标预测蛋白序列。该模型使用AlphaFold2预测的1200万个蛋白质结构进行训练,包含不变几何输入处理层,随后是一个序列到序列的Transformer,对于在结构上保持不变的主干序列实现51%的本地序列恢复率,对于埋藏残基实现72%的恢复率。该模型还经过跨度屏蔽训练,能够容忍缺失的主链坐标,因此可以预测部分被屏蔽结构的序列。该模块既可以用于亲和力成熟,也可以用于稳定性优化。 Structure Evolution module is implemented based on the ESMIF model. The ESMIF inverse folding model aims to predict protein sequences based on the atomic coordinates of the protein backbone. The model is trained on 12 million protein structures predicted by AlphaFold2, and it includes an invariant geometric input processing layer followed by a sequence-to-sequence Transformer. It achieves a 51% local sequence recovery rate for backbone sequences that remain invariant in structure and a 72% recovery rate for buried residues. The model is also trained with span masking, allowing it to tolerate missing backbone coordinates, thus enabling the prediction of sequences for partially masked structures. This module can be used for both affinity maturation and stability optimization.
    Tags: undefined
    Author: VARUN R. SHANKER
    Release: 2024-07-29 16:11:04
    Reference: Varun R. Shanker et al., Unsupervised evolution of protein and antibody complexes with a structure-informed language model. Science,385,46-53(2024).

    Structure Evolution

    简介

    Structure Evolution模块基于ESMIF模型实现,ESMIF逆折叠模型旨在根据蛋白质主链原子坐标预测蛋白序列。该模型使用AlphaFold2预测的1200万个蛋白质结构进行训练,包含不变几何输入处理层,随后是一个序列到序列的Transformer,对于在结构上保持不变的主干序列实现51%的本地序列恢复率,对于埋藏残基实现72%的恢复率。该模型还经过跨度屏蔽训练,能够容忍缺失的主链坐标,因此可以预测部分被屏蔽结构的序列。该模块既可以用于亲和力成熟,也可以用于稳定性优化。
    image.png

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式,支持单体或复合物结构

    Target Chain

    用于进化分析的链名称,默认为A链

    Positions

    指定目标链中的多个残基,进行多点突变分析。使用残基位置编号(从1开始),多个残基用逗号分隔,指定残基范围用横杠符号。如:“3,10,24-30”表示目标链上的第3、第10与第24至30号残基,参与多点突变分析。
    备注:如不设置该参数,表示采用目标链的全长序列进行突变分析。

    Min Mutations

    指定突变点最小数目,默认值为1,表示从单点突变开始进行突变分析。如设置为2,表示从两点组合突变开始进行突变分析。

    Max Mutations

    指定突变点最大数目,默认值为3,表示至多进行三点组合突变。如设置为2时,表示最多进行两个点的多点组合突变。

    Max Substitutions

    指定参与多点突变分析的每个残基,其最大的替换数目,默认为5,表示每个残基最多突变为5种不同的其他残基。
    备注:理论上,每种残基可以突变为其他19种天然残基,但因多点突变可能引起的组合爆炸,这里我们限制了最大替换数目。每个残基具体替换的其他残基类别,会根据ESMIF模型给出的该位置残基的概率分布,优先选择概率高的残基类别。

    Predicted Mutation Probability

    输出CSV文件名称,包含了突变以及对应的突变的可能性。

    结果说明

    输出结果文件,包含以下信息:

    列名 说明
    Mutation 单点突变信息,如:'WT’表示野生型原序列,'G1A’表示序列编号为1的残基甘氨酸G,突变为丙氨酸A,序列编号从1开始按顺序编号(非PDB文件中的残基序号)
    Log_likelihood 突变序列对应的模型预测概率对数值,越大表示该突变序列越好
    Log_likelihood_target_chain 在结构为复合物情况下,进行分析的目标链序列对应的模型预测概率对数值,越大表示该突变序列越好

    参考文献

    • Varun R. Shanker et al., Unsupervised evolution of protein and antibody complexes with a structure-informed language model. Science, 385, 46-53 (2024).DOI: 10.1126/science.adk8946

    Structure Evolution

    Introduction

    The Structure Evolution module is based on the ESMIF model and is used for structure-based single-point advantageous mutation analysis. The ESMIF inverse folding model aims to predict protein sequences from the coordinates of protein backbone atoms. This model is trained on 12 million protein structures predicted by AlphaFold2 and includes invariant geometric input processing layers followed by a sequence-to-sequence Transformer. It achieves a 51% local sequence recovery rate for backbone sequences that remain structurally invariant and a 72% recovery rate for buried residues. The model is also trained with span masking, allowing it to tolerate missing backbone coordinates and predict sequences for partially masked structures. This module can be used for both affinity maturation and stability optimization.
    image.png

    Parameter

    Structure PDB File

    The structural file of the protein in PDB format, supporting both monomer and complex structures.

    Target Chain

    The name of the chain used for evolutionary analysis. Only single chains are supported. After uploading the structural file, you can select a chain name from the list of chains.

    Positions

    Multiple residues in the chain were labeled for multi-point mutation analysis. Use a residue location number (starting at 1), multiple residues are separated by commas, and a delimiter is used to specify the residue range. For example, “3,10,24-30” indicates residues 3,10, and 24 to 30 on the target chain, which participate in multipoint mutation analysis.

    Min Mutations

    Specifies the minimum number of mutation points, the default is 1, indicating that mutation analysis starts with single mutation. If the value is set to 2, it indicates that the mutation analysis starts from the two-point mutation.

    Max Mutations

    Specifies the maximum number of mutation points, the default is 3, indicating that at most three points of combination mutation can be made. If the value is set to 2, it indicates that a maximum of two points of combination mutation can be performed.

    Max Substitutions

    Specifies the maximum number of substitutions for each residue participating in multipoint mutation analysis, which defaults to 5, meaning that each residue mutates up to 5 different other residues.

    Predicted Mutation Probability

    Output CSV file containing the mutations and corresponding probabilities.

    Result

    The output file contains the following information:

    Column Name Description
    Mutation Single-point mutation information, e.g., ‘WT’ represents the wild-type original sequence, ‘G1A’ indicates that the residue glycine (G) at sequence position 1 is mutated to alanine (A). Sequence numbering starts from 1 in order (not the residue number in the PDB file).
    Log_likelihood The log value of the predicted probability of the mutated sequence by the model. The higher the value, the better the mutated sequence. If this value is greater than the corresponding value of WT, it indicates that the mutation is advantageous.
    Log_likelihood_target_chain In the case of complex structures, the log value of the predicted probability of the target chain sequence analyzed by the model. The higher the value, the better the mutated sequence. If this value is greater than the corresponding value of WT, it indicates that the mutation is advantageous.

    Reference

    • Varun R. Shanker et al., Unsupervised evolution of protein and antibody complexes with a structure-informed language model. Science, 385, 46-53 (2024).DOI: 10.1126/science.adk8946
  • Name: HPLC Modeling
    Description: HPLC Modeling模块用于高效液相色谱(HPLC)领域的计算模拟,采用人工智能(AI)方法对HPLC实验的保留时间(RT)进行预测,并根据待分离化合物结构进行HPLC方法的推荐。 HPLC Modeling module is designed for computational modeling in the field of High-Performance Liquid Chromatography (HPLC). It utilizes artificial intelligence (AI) methods to predict the retention time (RT) of HPLC experiments and recommends HPLC methods based on the structure of the compounds to be separated.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-07-16 12:35:07
    Reference:

    HPLC Modeling

    简介

    该模块用于高效液相色谱(HPLC)领域的计算模拟,采用人工智能(AI)方法对HPLC实验的保留时间(RT)进行预测,也可根据主成分分子结构及期望的RT值,推荐相应的HPLC方法。

    HPLC RT Prediction 保留时间预测

    参数说明

    Chemical Structure

    必填参数,待预测RT的化学结构,支持单结构或批量结构:
    单结构可通过WeDraw进行结构绘制,会自动保存为结构文件上传
    批量结构支持上传结构文件,文件格式为SDF格式或SMI/TXT格式,后者每行放置一个结构的SMILES

    Mobile Phase A 与 Mobile Phase B

    必填参数,流动相类型,单选,一般情况Mobile Phase B为有机相

    Buffer

    选填参数,缓冲液类型,单选,如没有可不选择

    PH

    必填参数,流动相整体的PH值,默认值为7.0

    Column

    必填参数,色谱柱类型,单选,可选择系统提供的色谱柱类型,也可选择列表最后的‘Custom’表示自定义色谱柱信息

    Column Name, Size, Length, Diameter, Bonded Molecule 自定义色谱柱的相关信息

    选填参数,在色谱柱类型选择‘Custom’时,表示自定义色谱柱,需要提供色谱柱相关信息:

    参数 说明
    Column Name 色谱柱名称
    Size 填料颗粒的粒径,单位为微米(μm)
    Length 色谱柱长度,单位为毫米(mm)
    Diameter 色谱柱内径,单位为毫米(mm)
    Bonded Molecule 色谱柱基质键连的核心分子结构,SMILES格式,比如C18色谱柱,其键连核心分子为十八烷烃,其SMILES为 CCCCCCCCCCCCCCCCCC

    Elution

    必填参数,洗脱类型,0表示等度洗脱,1表示梯度洗脱

    Isocratic Elution

    选填参数,当选择等度洗脱时,需设置等度洗脱条件:
    流速(单位:毫升/分钟 ml/min)
    流动相B的比例(取值0~1之间)
    两者用逗号分隔,如:流速为0.5ml/min,流动相B比例为0.2,则该参数填写为 0.5,0.2

    Gradient Elution

    选填参数,当选择梯度洗脱时,需设置梯度洗脱条件:
    时间(单位:分钟 min)
    流速(单位:毫升/分钟 ml/min)
    流动相B比例(取值0~1之间)
    三者用逗号分隔,每行三个数值,格式如下:
    时间1, 流速1, 流动相B比例1
    时间2, 流速2, 流动相B比例2
    … …

    结果说明

    result.txt文件,包含预测的保留时间RT,以及代表性的分子特征数据:

    列名 说明
    ID 结构编号Index
    SMILES 结构SMILES
    RT_Predict 预测的RT数值,单位为分钟(min)
    SlogP 计算的分子logP值, Wildman-Crippen logP
    TASA 疏水表面积,单位为平方埃(A2), Total hydrophobic surface area
    TPSA 极性表面积,单位为平方埃(A2), Total polar surface area
    RASA 相对疏水表面积,Relative hydrophobic surface area
    RPSA 相对极性表面积,Relative polar surface area
    nHeavyAtom 重原子数量,Number of heavy atoms
    nAromAtom 芳香性原子数量,Number of aromatic atoms
    nAcid 酸性基团数量,Acidic group count
    nBase 碱性基团数量,Basic group count
    RNCG 相对负电荷,Relative negative charge
    RPCG 相对正电荷,Relative positive charge
    nHBA 氢键受体数量,Number of hydrogen bond acceptor
    nHBD 氢键供体数量,Number of hydrogen bond donor
    fragCpx 结构片段复杂度,Fragment complexity
    GeomDiameter 几何直径,Geometric diameter
    nRing 环数量,Ring count
    naRing 芳香环数量,Aromatic ring count
    nRot 可旋转键数量,Rotable bond count
    RotRatio 可旋转键比例,Rotable bond ratio

    Recommendation System 方法推荐系统

    参数说明

    Principal Compound

    必填参数,主成分分子结构

    Expected RT

    选填参数,对主成分分子期望的RT值,如:6.0

    结果说明

    recommend_methods.csv文件,包含推荐的HPLC方法及其预测的主成分分子的保留时间RT:

    列名 说明
    ID 结构编号Index
    Expected_RT 期望的RT值
    Predicted_RT 当前HPLC方法下预测的RT值
    Mobile_A 流动相A
    Mobile_B 流动相B
    Addictive 添加剂名称
    PH 整体PH值
    Column 色谱柱
    Elution 洗脱方式

    mol_property.csv文件,包含主成分分子的分子特征数据(同HPLC RT Prediction结果中的描述)。

    HPLC Modeling

    Introduction

    This module is designed for computational modeling in the field of High-Performance Liquid Chromatography (HPLC). It utilizes artificial intelligence (AI) methods to predict the retention time (RT) of HPLC experiments and recommends HPLC methods based on the structure of the compounds to be separated.

    HPLC RT Prediction (Retention Time Prediction)

    Parameter Description

    Chemical Structure

    Mandatory parameter, the chemical structure for which RT is to be predicted. Supports single or batch structures:

    • Single structure can be drawn using WeDraw and will be automatically saved as a structure file for upload.
    • Batch structures can be uploaded as a file in SDF format or SMI/TXT format, with each line representing a structure’s SMILES.

    Mobile Phase A and Mobile Phase B

    Mandatory parameter, the type of mobile phase, single selection. Generally, Mobile Phase B is the organic phase.

    Buffer

    Optional parameter, the type of buffer, single selection. If not applicable, it can be left unselected.

    pH

    Mandatory parameter, the overall pH value of the mobile phase. The default value is 7.0.

    Column

    Mandatory parameter, the type of chromatographic column, single selection. You can choose from the system-provided column types or select ‘Custom’ at the end of the list to define your own column information.

    Custom Column Information (Column Name, Size, Length, Diameter, Bonded Molecule)

    Optional parameter, required when selecting ‘Custom’ as the column type. Provides the relevant details for the custom column:

    Parameter Description
    Column Name The name of the column
    Size The particle size of the packing material, in micrometers (µm)
    Length The length of the column, in millimeters (mm)
    Diameter The internal diameter of the column, in millimeters (mm)
    Bonded Molecule The core molecular structure bonded to the column matrix, in SMILES format. For example, for an C18 column, the bonded core molecule is octadecane, represented as CCCCCCCCCCCCCCCCCC

    Elution

    Mandatory parameter, the type of elution, 0 for isocratic elution and 1 for gradient elution.

    Isocratic Elution

    Optional parameter, required when selecting isocratic elution. Specifies the conditions for isocratic elution:

    • Flow rate (unit: milliliters per minute, ml/min)
    • Proportion of Mobile Phase B (value between 0 and 1)
      Both values are separated by a comma. For example, if the flow rate is 0.5 ml/min and the proportion of Mobile Phase B is 0.2, this parameter should be entered as 0.5,0.2

    Gradient Elution

    Optional parameter, required when selecting gradient elution. Specifies the conditions for gradient elution:

    • Time (unit: minutes, min)
    • Flow rate (unit: milliliters per minute, ml/min)
    • Proportion of Mobile Phase B (value between 0 and 1)
      These three values are separated by commas, with each line containing three values in the following format:
      Time1, FlowRate1, ProportionB1
      Time2, FlowRate2, ProportionB2
      …

    Result Description

    The result.txt file includes the predicted retention time (RT) and representative molecular feature data:

    Column Name Description
    ID Structure Index
    SMILES Structure SMILES
    RT_Predict Predicted RT value, in minutes (min)
    SlogP Calculated molecular logP value, Wildman-Crippen logP
    TASA Total hydrophobic surface area, in square angstroms (Ų)
    TPSA Total polar surface area, in square angstroms (Ų)
    RASA Relative hydrophobic surface area
    RPSA Relative polar surface area
    nHeavyAtom Number of heavy atoms
    nAromAtom Number of aromatic atoms
    nAcid Number of acidic groups
    nBase Number of basic groups
    RNCG Relative negative charge
    RPCG Relative positive charge
    nHBA Number of hydrogen bond acceptors
    nHBD Number of hydrogen bond donors
    fragCpx Structural fragment complexity
    GeomDiameter Geometric diameter
    nRing Number of rings
    naRing Number of aromatic rings
    nRot Number of rotatable bonds
    RotRatio Rotatable bond ratio
  • Name: Structural Alignment (USalign)
    Description: Structural Alignment (USalign)是基于USalign的结构叠合工具。 Structural Alignment (USalign) is a structural superposition tool based on USalign.
    Tags: undefined
    Author: Yang Zhang
    Release: 2024-06-17 00:00:00
    Reference: Chengxin Zhang, Morgan Shine et al.US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes.(2022)

    Structural Alignment (USalign)

    简介

    Structural Alignment (USalign)是基于USalign的结构叠合工具。

    参数说明

    Sample Structure

    需要进行叠合的蛋白结构文件,PDB格式

    Reference Structure

    叠合操作中位置保持不变的参考结构文件,PDB格式

    Aligned Structure

    输出的叠合后的结构文件名称

    结果说明

    输出叠合后结构文件:aligned_structure.pdb,TM-Score值在stdout.txt文件中。

    参考文献

    C Zhang, M Shine, AM Pyle, Y Zhang. US-align: Universal structure alignment of proteins, nucleic acids and macromolecular complexes . Nature Methods, 19: 1109-1115 (2022).

    Structural Alignment (USalign)

    Introduction

    Structural Alignment (USalign) is a structural superposition tool based on USalign.

    Parameter Description

    Sample Structure

    The protein structure file to be aligned, in PDB format.

    Reference Structure

    The reference structure file that remains fixed during the alignment operation, in PDB format.

    Aligned Structure

    The name of the output file for the aligned structure.

    Result Description

    The output includes the aligned structure file named aligned_structure.pdb and the TM-Score value in the stdout.txt file.

    References

    C Zhang, M Shine, AM Pyle, Y Zhang. US-align: Universal structure alignment of proteins, nucleic acids, and macromolecular complexes . Nature Methods, 19: 1109-1115 (2022).

  • Name: Antibody Design (MEAN)
    Description: 基于MEAN模型实现,采用多通道等变图注意力网络,用于设计CDR的一维序列和三维结构。 Implemented based on the MEAN model, which utilizes a multi-channel equivariant graph attention network. It can be used to design the one-dimensional sequence and three-dimensional structure of CDRs.
    Tags: undefined
    Author: Xiangzhe Kong
    Release: 2024-06-26 11:34:29
    Reference: Xiangzhe Kong, Wenbing Huang, Yang Liu. Conditional Antibody Design as 3D Equivariant Graph Translation. arXiv. 2023.

    Antibody Design (MEAN)

    简介

    Antibody Design (MEAN)模块基于MEAN模型实现,该模型采用多通道等变图注意力网络,可用于设计CDR的一维序列和三维结构。具体而言,MEAN 通过导入额外的结构信息(包括目标抗原和抗体的轻链)将抗体设计公式化为条件图翻译问题。然后,MEAN重新采用 E(3)-等变消息传递以及提出的注意机制来更好地捕捉不同结构信息之间的几何相关性。最后,它通过多轮渐进式全景模式输出一维序列和三维结构,与以前的自回归方法相比,它具有更高的效率和精度。MEAN在序列和结构建模、抗原结合CDR设计和结合亲和力优化方面明显超越了届时最优模型。具体而言,抗原结合CDR设计相对于基线模型改进约为23%,亲和力优化相对于基线模型改进约为34%。
    MEAN模型架构如下图所示:
    image.png
    image.png

    参数说明

    Structure PDB File

    抗体-抗原复合物结构或抗体结构(建议采用复合物结构,设计效果更佳),PDB格式

    Heavy Chain

    指定结构中的抗体重链名称,默认值为H,注意如果上传的结构中抗体重链命名非H,请修改该参数为相应的链名

    Light Chain

    指定结构中的抗体轻链名称,默认值为L,注意如果上传的结构中抗体轻链命名非L,请修改该参数为相应的链名

    Design Type

    设计模式,有两种设计模式:CDR-H3设计与亲和力优化(Optimized)

    Number

    亲和力优化中,生成的结构数量,默认值为100

    结果说明

    CDR-H3设计

    输出结果包括:

    输出文件名称 说明
    cdrs.txt文件 包含设计的CDR-H3序列
    design.pdb文件 设计后的复合物结构文件,注意抗体结构只保留Fv区域

    亲和力优化

    输出结果包括:

    输出文件名称 说明
    ddg_scores.txt文件 优化后结构与原结构的亲和力差异评分
    opt_best.pdb文件 亲和力最优结构文件,注意抗体结构只保留Fv区域
    log.txt 亲和力优化文件日志
    opt.zip 优化后的多个结构的压缩文件

    其中,ddg_scores.txt文件,包含信息如下:

    列名 说明
    Name 结构名称
    ddG 与原结构的亲和力差异评分ddG,单位为kcal/mol,数值为负时表示亲和力有提升,负得越多表示亲和力提升越好

    参考文献

    Xiangzhe Kong, Wenbing Huang, Yang Liu. Conditional Antibody Design as 3D Equivariant Graph Translation. arXiv. 2023.

    Antibody Design (MEAN)

    Introduction

    The Antibody Design (MEAN) module is implemented based on the MEAN model, which employs a multi-channel equivariant graph attention network for designing the one-dimensional sequence and three-dimensional structure of the CDR (Complementarity-Determining Region). Specifically, MEAN formulates antibody design as a conditional graph translation problem by incorporating additional structural information, including the target antigen and the light chain of the antibody. MEAN then re-adopts E(3)-equivariant message passing and the proposed attention mechanism to better capture the geometric correlations between different structural information. Finally, it outputs the one-dimensional sequence and three-dimensional structure through multiple rounds of progressive panoramic mode. Compared to previous autoregressive methods, it has higher efficiency and accuracy. MEAN significantly outperforms the then state-of-the-art models in sequence and structure modeling, antigen-binding CDR design, and binding affinity optimization. Specifically, antigen-binding CDR design improves by approximately 23% over baseline models, and affinity optimization improves by approximately 34% over baseline models.
    The MEAN model architecture is shown in the figure below:
    image.png
    image.png

    Parameter Description

    Structure PDB File

    The structure of the antibody-antigen complex or the antibody structure (the complex structure is recommended for better design results), in PDB format.

    Heavy Chain

    Specify the name of the antibody heavy chain in the structure, the default value is H. Note that if the antibody heavy chain in the uploaded structure is not named H, please modify this parameter to the corresponding chain name.

    Light Chain

    Specify the name of the antibody light chain in the structure, the default value is L. Note that if the antibody light chain in the uploaded structure is not named L, please modify this parameter to the corresponding chain name.

    Design Type

    Design mode, there are two design modes: CDR-H3 design and affinity optimization (Optimized).

    Number

    In affinity optimization, the number of generated structures, the default value is 100.

    Result Description

    CDR-H3 Design

    The output results include:

    Output File Name Description
    cdrs.txt Contains the designed CDR-H3 sequences
    design.pdb The designed complex structure file, note that only the Fv region of the antibody structure is retained

    Affinity Optimization

    The output results include:

    Output File Name Description
    ddg_scores.txt Affinity difference scores between the optimized structure and the original structure
    opt_best.pdb The structure file with the best affinity, note that only the Fv region of the antibody structure is retained
    log.txt Affinity optimization log file
    opt.zip Compressed file of multiple optimized structures

    The ddg_scores.txt file contains the following information:

    Column Name Description
    Name Structure name
    ddG Affinity difference score ddG with the original structure, in kcal/mol. A negative value indicates an improvement in affinity, and the more negative, the better the improvement in affinity

    References

    Xiangzhe Kong, Wenbing Huang, Yang Liu. Conditional Antibody Design as 3D Equivariant Graph Translation. arXiv. 2023.

  • Name: Venn Diagram Plot
    Description: Venn Diagram Plot是一个制作韦恩图(Venn diagram)模块,常用于比较两个集合的重叠区域以及提取公共部分内容。例如,用于中药网络药理学分析中提取中药成分预测靶点与疾病相关靶点的交集。 Venn Diagram Plot is a module for creating Venn diagrams, commonly used to compare the overlapping areas of two sets and to extract the common elements. For example, it can be used in traditional Chinese medicine network pharmacology analysis to extract the intersection of predicted targets of traditional Chinese medicine components and disease-related targets.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-06-23 00:00:00
    Reference:

    Venn Diagram Plot

    简介

    Venn Diagram Plot是一个制作韦恩图(Venn diagram)模块,常用于比较两个集合的重叠区域以及提取公共部分内容。用于中药网络药理学分析中提取中药成分预测靶点与疾病相关靶点的交集。

    参数说明

    Set A File

    集合A文件,TXT格式,每行一个元素。

    Set B File

    集合B文件,TXT格式,每行一个元素。

    Labels

    作图时显示的图例,逗号分割,如:set A,set B

    Case Sensitive

    比较时是否大小写敏感:
    Yes:区分大小写比较
    No:不区分大小写比较

    Output Intersection

    输出包含交集部分内容的文件名称,默认为intersection.txt

    结果说明

    输出韦恩图文件venn_diagram.png以及交集部分内容的文本文件intersection.txt

    Venn Diagram Plot

    Introduction

    The Venn Diagram Plot module is used to create Venn diagrams, which are commonly utilized to compare the overlapping regions of two sets and extract the common elements. This is particularly useful in traditional Chinese medicine network pharmacology analysis for identifying the intersection of predicted targets of herbal components and disease-related targets.

    Parameter Description

    Set A File

    The file for set A, in TXT format, with one element per line.

    Set B File

    The file for set B, in TXT format, with one element per line.

    Labels

    The labels to be displayed in the diagram, separated by commas, e.g., set A,set B.

    Case Sensitive

    Whether the comparison is case-sensitive:

    • Yes: Case-sensitive comparison
    • No: Case-insensitive comparison

    Output Intersection

    The name of the output file containing the intersection elements, default is intersection.txt.

    Result Description

    The output includes a Venn diagram file named venn_diagram.png and a text file containing the intersection elements named intersection.txt.

  • Name: Protein-Protein Interaction (STRING)
    Description: Protein-Protein Interaction (STRING)是基于STRING的提取蛋白相互作用模块。String是一个蛋白互作网络数据库,包含蛋白直接物理作用的互作关系以及间接作用的互作关系。 Protein-Protein Interaction (STRING) is a module for extracting protein interactions based on STRING. STRING is a protein interaction network database that includes both direct physical interactions and indirect functional associations between proteins.
    Tags: undefined
    Author: STRING
    Release: 2024-06-21 00:00:00
    Reference: Szklarczyk D et.al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023 Jan 6;51(D1):D638-646.

    Protein-Protein Interaction (STRING)

    简介

    Protein-Protein Interaction (STRING)是基于STRING的提取蛋白相互作用模块。String是一个蛋白互作网络数据库,包含蛋白直接物理作用的互作关系以及间接作用的互作关系。

    参数说明

    Protein List

    蛋白名称列表文件,TXT格式,一行一个蛋白名称

    Cutoff

    蛋白-蛋白关联性打分的截断值,0~1之间,只导出combined_score为截断值以上的蛋白-蛋白相互作用数据。

    Related Protein

    是否输出相关蛋白;
    Yes:代表输出与输入蛋白相关的蛋白
    No:代表只输出输入蛋白之间存在的相互作用

    结果说明

    输出蛋白-蛋白相互作用文件string_interactions.tsv,每一列说明如下:

    列名 说明
    node1 节点1的蛋白名称
    node2 节点2的蛋白名称
    node1_string_id 节点1在STRING数据库中标准ID
    node2_string_id 节点1在STRING数据库中标准ID
    neighborhood_on_chromosome 基于基因组邻近性预测的相互作用得分。
    gene_fusion 基于基因融合事件预测的相互作用得分。
    phylogenetic_cooccurrence 基于共同出现(共现性)预测的相互作用得分。
    homology 蛋白之间的同源性。
    coexpression 基于共同表达(共表达)预测的相互作用得分。
    experimentally_determined_interaction 基于实验数据(例如,酵母双杂交实验)预测的相互作用得分。
    database_annotated 基于已知数据库信息预测的相互作用得分。
    automated_textmining 基于文本挖掘预测的相互作用得分。
    combined_score 综合所有上述信息计算得到的综合得分。

    参考文献

    • Szklarczyk D et.al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest . Nucleic Acids Res. 2023 Jan 6;51(D1):D638-646.
    • https://cn.string-db.org/

    Protein-Protein Interaction (STRING)

    Introduction

    Protein-Protein Interaction (STRING) is a module based on the STRING database for extracting protein interaction data. STRING is a protein interaction network database that includes both direct physical interactions and indirect functional associations between proteins.

    Parameter Description

    Protein List

    A file containing a list of protein names, in TXT format, with one protein name per line.

    Cutoff

    A cutoff value for the protein-protein association score, ranging from 0 to 1. Only protein-protein interactions with a combined score above this cutoff will be exported.

    Related Protein

    Whether to output related proteins:

    • Yes: Output proteins related to the input proteins.
    • No: Only output interactions among the input proteins.

    Result Description

    The output is a protein-protein interaction file named string_interactions.tsv. Each column is described as follows:

    Column Name Description
    node1 Protein name of node 1
    node2 Protein name of node 2
    node1_string_id Standard STRING ID for node 1
    node2_string_id Standard STRING ID for node 2
    neighborhood_on_chromosome Interaction score based on genomic neighborhood prediction
    gene_fusion Interaction score based on gene fusion events
    phylogenetic_cooccurrence Interaction score based on phylogenetic co-occurrence
    homology Homology between proteins
    coexpression Interaction score based on co-expression
    experimentally_determined_interaction Interaction score based on experimental data (e.g., yeast two-hybrid)
    database_annotated Interaction score based on known database information
    automated_textmining Interaction score based on text mining
    combined_score Combined score calculated from all the above information

    References

    • Szklarczyk D et.al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest . Nucleic Acids Res. 2023 Jan 6;51(D1):D638-646.
    • STRING Database
  • Name: Gene Enrichment (DAVID)
    Description: Gene Enrichment (DAVID)是基于DAVID的基因功能富集分析模块,DAVID是一个生物信息数据库,整合了生物学数据和分析工具,为大规模的基因或蛋白列表提供系统综合的生物功能注释信息。 Gene Enrichment (DAVID) is a gene function enrichment analysis module based on DAVID. DAVID is a bioinformatics database that integrates biological data and analysis tools, providing systematic and comprehensive biological functional annotation information for large-scale gene or protein lists.
    Tags: undefined
    Author: DAVID
    Release: 2024-06-21 00:00:00
    Reference: B.T. Sherman, M. Hao, J. Qiu, X. Jiao, M.W. Baseler, H.C. Lane, T. Imamichi and W. Chang. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Research. 23 March 2022. 50(W1):W216-W221. doi:10.1093/nar/gkac194.

    Gene Enrichment (DAVID)

    简介

    Gene Enrichment (DAVID)是基于DAVID的基因功能富集分析模块,DAVID是一个生物信息数据库,整合了生物学数据和分析工具,为大规模的基因或蛋白列表提供系统综合的生物功能注释信息。

    参数说明

    Gene List

    基因列表文件,TXT格式,一行一个基因/蛋白。

    Gene Identifier

    基因名称类型,支持多种数据库基因名称。

    P-value

    P-value,基因富集中统计差异检验使用的p值的截断值,只保留低于该截断值的富集条目。

    Gene Count

    基因数目截断值,只保留大于该截断值的富集条目。

    Category

    基因富集的类别,包括细胞组分(Cellular Component BP),分子功能(Molecular Function MF),生物学过程(Biological Proccess BP)。

    Report File

    输出基因富集的结果文件,TSV格式。

    结果说明

    结果输出chartReport.tsv文件,文件中每一列代表说明如下:

    列名 说明
    Category 注释类别,例如GOTERM_BP_DIRECT(生物过程)、GOTERM_MF_DIRECT(分子功能)、GOTERM_CC_DIRECT(细胞组分)、KEGG_PATHWAY(KEGG通路)等。
    Term 具体的注释术语或通路名称。
    Count 输入基因集中注释到该术语的基因数目。
    % 输入基因集中注释到该术语的基因占总输入基因的百分比。
    PValue 富集分析的p值,表示注释到该术语的基因数目与随机情况下的期望数目之间的显著性差异。
    Benjamini Benjamini-Hochberg校正后的p值,用于控制假发现率(FDR)。
    FDR 假发现率,表示在所有显著结果中,预期的错误发现比例。
    Genes 注释到该术语的输入基因的列表,通常以逗号分隔。
    List Total 输入基因集中总的基因数目。
    Pop Hits 背景基因集中注释到该术语的基因数目。
    Pop Total 背景基因集的总基因数目。
    Fold Enrichment 富集倍数,表示输入基因集中注释到该术语的基因数目相对于背景基因集中注释到该术语的基因数目的比例。

    参考文献

    • B.T. Sherman, M. Hao, J. Qiu, X. Jiao, M.W. Baseler, H.C. Lane, T. Imamichi and W. Chang. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update) . Nucleic Acids Research. 23 March 2022. 50(W1):W216-W221. doi:10.1093/nar/gkac194.
      • https://david.ncifcrf.gov

    Gene Enrichment (DAVID)

    Introduction

    Gene Enrichment (DAVID) is a gene functional enrichment analysis module based on DAVID. DAVID is a bioinformatics database that integrates biological data and analytical tools to provide systematic and comprehensive biological functional annotation information for large-scale gene or protein lists.

    Parameter Description

    Gene List

    A file containing the gene list in TXT format, with one gene/protein per line.

    Gene Identifier

    The type of gene name, supporting multiple database gene names.

    P-value

    P-value, the cutoff value of the p-value used in the statistical difference test of gene enrichment, retaining only enrichment entries below this cutoff value.

    Gene Count

    The cutoff value of the number of genes, retaining only enrichment entries with a gene count greater than this cutoff value.

    Category

    The category of gene enrichment, including Cellular Component (CC), Molecular Function (MF), and Biological Process (BP).

    Report File

    The output file of gene enrichment results, in TSV format.

    Result Description

    The results are output in the chartReport.tsv file, with each column representing the following descriptions:

    Column Name Description
    Category Annotation category, such as GOTERM_BP_DIRECT (Biological Process), GOTERM_MF_DIRECT (Molecular Function), GOTERM_CC_DIRECT (Cellular Component), KEGG_PATHWAY (KEGG Pathway), etc.
    Term Specific annotation term or pathway name.
    Count The number of genes in the input gene set annotated to this term.
    % The percentage of genes in the input gene set annotated to this term.
    PValue The p-value of the enrichment analysis, indicating the significance of the difference between the number of genes annotated to this term and the expected number under random conditions.
    Benjamini The p-value after Benjamini-Hochberg correction, used to control the false discovery rate (FDR).
    FDR False discovery rate, indicating the expected proportion of false discoveries among all significant results.
    Genes The list of input genes annotated to this term, usually separated by commas.
    List Total The total number of genes in the input gene set.
    Pop Hits The number of genes in the background gene set annotated to this term.
    Pop Total The total number of genes in the background gene set.
    Fold Enrichment The fold enrichment, indicating the ratio of the number of genes annotated to this term in the input gene set to the number of genes annotated to this term in the background gene set.

    References

    • B.T. Sherman, M. Hao, J. Qiu, X. Jiao, M.W. Baseler, H.C. Lane, T. Imamichi, and W. Chang. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Research. 23 March 2022. 50(W1):W216-W221. doi:10.1093/nar/gkac194.
    • https://david.ncifcrf.gov
  • Name: TCM Chemical Ingredients
    Description: TCM Chemical Ingredients是基于中药名称提取中药化学成分的模块。 TCM Chemical Ingredients is a module for extracting chemical structures of Chinese herbs.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-06-20 00:00:00
    Reference: Liu Z, Cai C, Du J. et al. TCMIO: a comprehensive database of traditional Chinese medicine on Immuno-oncology. Front Pharmacol 2020;11:439.

    TCM Chemical Ingredients

    简介

    TCM Chemical Ingredients用于提取中药的化学成分的结构信息。

    参数说明

    TCM Name

    中药的名称,支持中文名、英文名、拼音名,支持多个名称,英文逗号分割。比如:人参,黄芪

    Remove Duplicates

    是否对成分的结构进行去重处理

    结果说明

    输出文件 描述
    ingredients.sdf 化学成分的结构文件,SDF格式
    ingredients.csv 化学成分的结构文件,CSV格式,里面包含SMILES等结构信息

    参考文献

    Liu Z, Cai C, Du J. et al. TCMIO: a comprehensive database of traditional Chinese medicine on Immuno-oncology . Front Pharmacol 2020;11:439.

    TCM Chemical Ingredients

    Introduction

    The TCM Chemical Ingredients module is used to extract structural information of chemical ingredients from traditional Chinese medicines (TCM).

    Parameter Description

    TCM Name

    The name(s) of the traditional Chinese medicine(s), supporting Chinese, English, or Pinyin names. Multiple names can be separated by commas. For example: 人参,黄芪.

    Remove Duplicates

    Whether to remove duplicate structures of the ingredients:

    • Yes: Remove duplicates
    • No: Do not remove duplicates

    Result Description

    The output includes the following files:

    Output File Description
    ingredients.sdf Structural file of the chemical ingredients in SDF format
    ingredients.csv Structural file of the chemical ingredients in CSV format, containing SMILES and other structural information

    References

    Liu Z, Cai C, Du J. et al. TCMIO: a comprehensive database of traditional Chinese medicine on Immuno-oncology . Front Pharmacol 2020;11:439.

  • Name: Target Prioritization (OpenTargets)
    Description: 提取疾病相关靶点蛋白,基于OpenTarget数据库及其疾病-靶点相关性打分方法。 A module for extracting disease-related target proteins, based on the OpenTarget database and its disease-target association scoring method.
    Tags: undefined
    Author: Open Targets
    Release: 2024-06-20 00:00:00
    Reference: Ochoa, D et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Research, 2023, DOI: 10.1093/nar/gkac1046

    Target Prioritization (OpenTargets)

    简介

    Target Prioritization (OpenTargets) 是提取疾病相关靶点蛋白的模块,基于OpenTarget数据库及其疾病-靶点相关性打分方法。

    image.jpg

    参数说明

    Disease Name

    疾病的英文名称,如rheumatoid arthritis

    Data Type

    数据类型,包括直接关联和全部关联的数据。
    direct:直接关联数据,指有直接证据表明该疾病和靶点存在关联。
    all:全部关联数据,包括了间接关联数据,间接关联是基于本体论推断出来的疾病靶点关系。
    详细可参考:https://platform-docs.opentargets.org/associations

    Cutoff

    疾病-靶点关系打分的截断值,只输出大于截断值的靶点信息。

    Target Class

    靶点类型,默认为all 代表全部

    结果说明

    输出疾病及靶点相关的文件,包括:

    文件名称 文件说明
    disease_info.csv 疾病信息表
    target_info.csv 靶点信息表
    targets_by_data_source.csv 基于数据来源的疾病-靶点关系打分表
    targets_by_data_type.csv 基于数据类型的疾病-靶点关系打分表
    uniprot_ids.txt 靶点的蛋白UniProt ID列表
    genes.txt 靶点的基因名称列表

    参考文献

    https://platform-docs.opentargets.org/

    Target Prioritization (OpenTargets)

    Introduction

    The Target Prioritization (OpenTargets) module is used to extract disease-related target proteins based on the OpenTargets database and its disease-target association scoring method.
    image.jpg

    Parameter Description

    Disease Name

    The English name of the disease, such as rheumatoid arthritis.

    Data Type

    The type of data, including directly associated and all associated data.

    • direct: Directly associated data, indicating there is direct evidence linking the disease to the target.
    • all: All associated data, including indirect associations inferred through ontological relationships. For more details, refer to: OpenTargets Associations

    Cutoff

    The cutoff value for the disease-target association score. Only target information with a score greater than this cutoff will be output.

    Target Class

    The type of target, default is all representing all target classes.

    Result Description

    The output includes files related to the disease and its targets:

    File Name Description
    disease_info.csv Disease information table
    target_info.csv Target information table
    targets_by_data_source.csv Disease-target association scores by data source
    targets_by_data_type.csv Disease-target association scores by data type
    uniprot_ids.txt List of target protein UniProt IDs
    genes.txt List of target gene names

    References

    OpenTargets Platform Documentation

  • Name: Structure Minimization
    Description: Structure Minimization是结构优化模块,支持氢原子优化、氨基酸侧链优化、整体优化三种方式。 Structure Minimization is a structure optimization module, supporting three methods: hydrogen atom optimization, amino acid side chain optimization, and overall optimization.
    Tags: undefined
    Author:
    Release: 2024-05-29 14:41:20
    Reference:

    Structure Minimization

    简介

    Structure Minimization是结构优化模块,支持氢原子优化、氨基酸侧链优化、整体优化三种方式。

    参数说明

    PDB File

    结构文件,PDB格式。

    Relax Type

    优化类型,支持以下几种:
    hydrogen:约束限制所有非氢原子,对结构上的氢原子进行优化。
    sidechain:约束蛋白骨架,优化蛋白氨基酸侧脸,若存在小分子,整个小分子进行限制。
    all:系统整体优化,不做任何限制约束。
    可多选,进行多步优化。

    Cycle Number

    能量优化的步数。

    Force Field

    采用的分子力场,默认ff14SB。ff19SB, ff14SB适合蛋白和核酸的凝聚相模拟,也支持小分子。

    Restrain Force Constant

    约束力常数,单位为kcal/mol/Å^2,数值越大,约束能力越强。

    Output Name

    输出文件名称,默认minimized_structure.pdb。

    结果说明

    输出结果为优化后的结构文件minimized_structure.pdb,保留了输入文件中的链和氨基酸编号信息。

    Structure Minimization

    Introduction

    The Structure Minimization module is used for structural optimization, supporting three types of optimizations: hydrogen atom optimization, amino acid side chain optimization, and overall optimization.

    Parameter Description

    PDB File

    The structure file in PDB format.

    Relax Type

    The type of optimization, supporting the following options:

    • hydrogen: Constrains all non-hydrogen atoms and optimizes the hydrogen atoms in the structure.
    • sidechain: Constrains the protein backbone and optimizes the amino acid side chains. If small molecules are present, the entire small molecule is constrained.
    • all: Performs overall system optimization without any constraints.
      This option allows multiple selections for multi-step optimization.

    Cycle Number

    The number of steps for energy optimization.

    Force Field

    The molecular force field used, default is ff14SB. ff19SB and ff14SB are suitable for condensed phase simulations of proteins and nucleic acids, and also support small molecules.

    Restrain Force Constant

    The restrain force constant, in units of kcal/mol/Ų. The larger the value, the stronger the constraint.

    Output Name

    The name of the output file, default is minimized_structure.pdb.

    Result Description

    The output is the optimized structure file minimized_structure.pdb, retaining the chain and amino acid numbering information from the input file.

  • Name: Replace Chain Name
    Description: Replace Chain Name用于替换PDB文件中的链名。 Performs in-place replacement of a chain identifier by another.
    Tags: undefined
    Author:
    Release: 2024-06-07 00:00:00
    Reference:
  • Name: Structure Preparation
    Description: Structure Preparation是蛋白结构处理模块,用于补全缺失原子和残基,以及蛋白氨基酸残基的质子化判断以及加氢操作。 Structure Preparation is a protein structure preparation module used for adding missing atoms and residues, as well as for protonation determination and hydrogenation of protein amino acid residues.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-06-07 00:00:00
    Reference: J. Chem. Theory Comput. 2011, 7 (2), 525–537.

    Structure Preparation

    简介

    蛋白结构处理模块,用于补全缺失原子和残基,以及蛋白氨基酸残基的质子化判断以及加氢操作。采用pdbfixer补全缺失,采用propka3进行质子化判断。

    参数说明

    Structure File

    蛋白的结构文件,PDB格式

    Chains

    提取指定链处理,默认all,代表选择全部链,输入链名,多条链用英文逗号隔开,如A,B表示从PDB文件中提取A,B链进行结构处理。注意链名之间不要用空格。

    Delete Heterogens

    删除非标准蛋白或核酸残基,如水、离子、以及其他PDB中HETATM记录。
    all:表示删除所有HETATM记录,包括水、离子、小分子等;
    water:表示仅删除水;
    ions:表示仅删除离子,默认为NA,CL;
    custom:表示需要删除其他定制的残基名称,由Custom Heterogens参数指定。
    Heterogens详细介绍可参考:https://www.wwpdb.org/documentation/file-format-content/format23/sect4.html

    Custom Heterogens

    自定义Heterogens的残基名称,多个用英文逗号分隔,如ZN,MG

    Delete Hydrogens

    删除氢原子,Yes表示删除,No表示不删除。

    Add

    添加缺失的重原子或者残基。
    heavy:表示添加缺失重原子
    residues:表示添加缺失残基,默认也会添加缺失的原子

    Protonation

    是否进行质子化判断并添加氢原子,采用propka方法进行蛋白残基的质子化判断。
    Yes:代表根据质子化判断结果进行加氢操作,
    No:代表不加氢处理

    pH

    用于蛋白质子化状态判断的pH值。

    Naming Scheme

    输出PDB文件中残基和原子的命名方式。
    PDB:标准氨基酸格式,如组氨酸为HIS;
    AMBER:AMBER格式,如组氨酸为HID/HIE/HIP;
    CHARMM:CHARMM格式,如组氨酸为HSE/HSD/HSP。

    Prepared Structure

    输出的处理后的蛋白结构文件,PDB格式。默认文件名为:prepared_structure.pdb。

    结果说明

    输出处理好的结构文件,PDB格式。文件中的原子和残基类型按照指定Naming Scheme方法。

    参考文献

    • Olsson, M. H. M.; Søndergaard, C. R.; Rostkowski, M.; Jensen, J. H. PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical pKa Predictions. J. Chem. Theory Comput. 2011, 7 (2), 525–537. https://doi.org/10.1021/ct100578z.
    • https://github.com/jensengroup/propka
    • https://github.com/openmm/pdbfixer

    Structure Preparation

    Introduction

    The Structure Preparation module is used for completing missing atoms and residues in protein structures, as well as determining the protonation states of amino acid residues and adding hydrogen atoms. It uses pdbfixer for completing missing parts and propka3 for protonation state determination.

    Parameter Description

    Structure File

    The protein structure file in PDB format.

    Chains

    Specify the chains to be processed. The default is all, which means all chains will be processed. To specify chains, input the chain names separated by commas without spaces, e.g., A,B to process chains A and B from the PDB file.

    Delete Heterogens

    Remove non-standard protein or nucleic acid residues such as water, ions, and other HETATM records in the PDB.

    • all: Remove all HETATM records, including water, ions, small molecules, etc.
    • water: Remove only water.
    • ions: Remove only ions, default is NA,CL.
    • custom: Remove other specified residues, indicated by the Custom Heterogens parameter.

    For more details on Heterogens, refer to: Heterogen Information

    Custom Heterogens

    Specify custom heterogens to be removed by their residue names, separated by commas, e.g., ZN,MG.

    Delete Hydrogens

    Remove hydrogen atoms.

    • Yes: Delete hydrogen atoms.
    • No: Do not delete hydrogen atoms.

    Add

    Add missing heavy atoms or residues.

    • heavy: Add missing heavy atoms.
    • residues: Add missing residues, which also adds missing atoms by default.

    Protonation

    Determine protonation states and add hydrogen atoms using the propka method.

    • Yes: Add hydrogen atoms based on protonation state determination.
    • No: Do not add hydrogen atoms.

    pH

    The pH value used for determining the protonation states of the protein residues.

    Naming Scheme

    The naming convention for residues and atoms in the output PDB file.

    • PDB: Standard amino acid format, e.g., histidine as HIS.
    • AMBER: AMBER format, e.g., histidine as HID/HIE/HIP.
    • CHARMM: CHARMM format, e.g., histidine as HSE/HSD/HSP.

    Prepared Structure

    The name of the output processed protein structure file in PDB format. The default file name is prepared_structure.pdb.

    Result Description

    The output is a processed structure file in PDB format. The atoms and residue types in the file follow the specified naming scheme.

    References

    • Olsson, M. H. M.; Søndergaard, C. R.; Rostkowski, M.; Jensen, J. H. PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical pKa Predictions. J. Chem. Theory Comput. 2011, 7 (2), 525–537. https://doi.org/10.1021/ct100578z.
    • PROPKA GitHub Repository
    • PDBFixer GitHub Repository
  • Name: Antibody RMSD
    Description: Antibody RMSD模块对参考抗体结构及其他CDR相同的抗体结构,进行基于CDR区域的结构叠合,并计算CDR区域的RMSD值。 The Antibody RMSD module performs a CDR region-based structure superposition of the reference antibody and other CDR identical antibody structures, and calculates the RMSD value of the CDR region.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-05-22 14:24:47
    Reference:

    Antibody RMSD

    简介

    Antibody RMSD模块对参考抗体结构及其他CDR相同的抗体结构,进行基于CDR区域的结构叠合,并计算CDR区域的RMSD值。支持普通抗体及纳米抗体。
    应用场景:人源化后的抗体序列,预测抗体结构后,比较各结构CDR区域的RMSD差异。支持普通抗体及纳米抗体。

    参数说明

    Antibody Structures

    多个抗体结构PDB文件的压缩打包文件,TAR格式

    Reference Structure

    进行RMSD计算的参考抗体结构,PDB格式

    Aligned PDB

    抗体叠合结构输出名称,TAR.GZ格式

    结果说明

    • RMSD计算结果,CSV格式文件‘result.csv’,包含信息如下:
    列名 说明
    Reference Antibody 参考抗体结构的名称
    The other Antibody 用于计算RMSD的其他抗体结构名称
    RMSD_CDRs CDR区域整体的RMSD值
    RMSD_CDR1 CDR1的RMSD值
    RMSD_CDR2 CDR2的RMSD值
    RMSD_CDR3 CDR3的RMSD值

    注意:进行RMSD计算的两个抗体结构,其CDR区域序列应相同,如有差异会导致计算出错。

    Antibody RMSD

    Introduction

    The Antibody RMSD module aligns the reference antibody structure with other antibodies having the same CDR regions, performs a structural overlay based on the CDR regions, and calculates the RMSD values of the CDR regions.
    Application Scenario: After humanizing antibody sequences and predicting antibody structures, the module compares the RMSD differences in the CDR regions of various structures.

    Parameter Description

    Antibody Structures

    Compressed TAR file containing multiple antibody structure PDB files.

    Reference Structure

    Reference antibody structure in PDB format for RMSD calculation.

    Aligned PDB

    Antibody composite structure output name, TAR.GZ format

    Result Description

    • RMSD calculation results in a CSV format file ‘result.csv’, including the following information:
    Column Name Description
    Reference Antibody Name of the reference antibody structure
    The other Antibody Name of the other antibody structure used for RMSD calculation
    RMSD_CDRs RMSD value of the overall CDR regions
    RMSD_CDR1 RMSD value of CDR1
    RMSD_CDR2 RMSD value of CDR2
    RMSD_CDR3 RMSD value of CDR3

    Note: The CDR region sequences of the two antibody structures used for RMSD calculation should be identical; any differences may lead to calculation errors.

  • Name: Target Prediction (FastTargetPred)
    Description: 基于二维相似度的小分子靶点预测模块,活性分子及靶点数据来源于ChEMBL数据库。 A small molecule target prediction module based on 2D similarity. Active molecules and target data are derived from ChEMBL database.
    Tags: undefined
    Author: Ludovic Chaput
    Release: 2024-04-25 14:16:17
    Reference: Chaput L, Guillaume V, Singh N, Deprez B, Villoutreix BO. FastTargetPred: a program enabling the fast prediction of putative protein targets for input chemical databases. Bioinformatics. 2020 Aug 15;36(14):4225-4226.

    Target Prediction (FastTargetPred)

    简介

    Target Prediction (FastTargetPred)是基于二维相似度的小分子靶点预测模块,活性分子及靶点数据来源于ChEMBL25数据库,相似度计算采用1024位ECFP4的分子指纹,特点是速度块,几小时完成数十万化合物的靶点预测。

    参数说明

    SDF File

    小分子结构文件,SDF格式

    Tanimoto Threshold

    相似度(Tanimoto)阈值。从ChEMBL中查找大于相似度阈值的化合物。

    Output File

    输出文件名称

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 靶点预测结果的csv文件
    result.html 靶点预测结果的html文件

    其中输出结果包含信息如下:

    字段名称 说明
    Query name 查询分子名称
    Database molecule id ChEMBL中相似找出的相似分子ID
    Target id 靶标分子ID
    Score 相似度数值
    Uniprot 蛋白Uniprot ID
    Uniprot name Uniprot分子名称
    Status 数据发表情况
    Protein names 蛋白名称
    Gene names 基因名称
    Organism 物种名称
    CHEMBL 靶点CHEMBL分子ID
    Involvement in disease 参与疾病类型
    Geneontology (biological process) 谱系学(生物过程)
    Cross-reference (Reactome) 交叉引用(Reactome)

    参考文献

    Chaput L, Guillaume V, Singh N, Deprez B, Villoutreix BO. FastTargetPred: a program enabling the fast prediction of putative protein targets for input chemical databases. Bioinformatics. 2020 Aug 15;36(14):4225-4226.https://doi.org/10.1093/bioinformatics/btaa494

    Target Prediction (FastTargetPred)

    Introduction

    Target Prediction (FastTargetPred) is a module for predicting small molecule targets based on 2D similarity. The active molecules and target data are sourced from the ChEMBL25 database. Similarity calculation uses 1024-bit ECFP4 molecular fingerprints. The main feature of this module is its speed, capable of predicting targets for hundreds of thousands of compounds within a few hours.

    Parameter Description

    SDF File

    The structure file of small molecules in SDF format.

    Tanimoto Threshold

    The similarity (Tanimoto) threshold. Compounds from ChEMBL with a similarity greater than this threshold will be considered.

    Output File

    The name of the output file.

    Result Description

    The output results include:

    Output File Name Description
    result.csv CSV file containing the target prediction results
    result.html HTML file containing the target prediction results

    The output results contain the following information:

    Field Name Description
    Query name Name of the query molecule
    Database molecule id ID of the similar molecule found in ChEMBL
    Target id ID of the target molecule
    Score Similarity score
    Uniprot Uniprot ID of the protein
    Uniprot name Name of the Uniprot molecule
    Status Publication status of the data
    Protein names Names of the proteins
    Gene names Names of the genes
    Organism Name of the organism
    CHEMBL CHEMBL molecule ID of the target
    Involvement in disease Types of diseases involved
    Geneontology (biological process) Gene ontology (biological process)
    Cross-reference (Reactome) Cross-reference (Reactome)

    References

    • Chaput L, Guillaume V, Singh N, Deprez B, Villoutreix BO. FastTargetPred: a program enabling the fast prediction of putative protein targets for input chemical databases. Bioinformatics. 2020 Aug 15;36(14):4225-4226. https://doi.org/10.1093/bioinformatics/btaa494
  • Name: Electrostatic Potential Calculation (APBS)
    Description: 基于APBS方法计算生物大分子结构的静电势能,并绘制表面图。 为了可视化显示表面图,请从结构编辑器WeView中执行该功能:Weview->Analysis->Electrostatics。 Calculate the electrostatic potential energy of biomolecular structures using the APBS method and generate surface plots. To visualize the surface maps, execute this function from the structure editor WeView: WeView->Analysis->Electrostatics.
    Tags: undefined
    Author: Elizabeth Jurrus
    Release: 2024-04-19 15:42:10
    Reference: Jurrus E, et. al. Improvements to the APBS biomolecular solvation software suite. Protein Sci, 27 (1), 112-128, 2018.

    Electrostatic Potential Calculation (APBS)

    简介

    静电势(ESP,electrostatic potential)表面是指在分子周围某个曲面上静电势的分布,通过静电势对蛋白质表面着色有助于识别带电分子或极性分子的结合位点。正电位区域与负电荷互补,而负电位区域与正电荷互补。蛋白质静电势对于蛋白质的稳定性、折叠、酶催化、蛋白质间相互作用以及与其他分子的结合等方面起着关键作用。APBS(Adaptive Poisson-Boltzmann Solver )是业界著名的计算生物大分子结构静电势能的工具。
    esp.jpg

    参数说明

    PDB File

    蛋白结构文件,PDB格式

    Output Format

    输出文件格式,支持DX或者CUBE

    结果说明

    输出静电势能结果文件potential.dx或者potential.cube,用于将静电势能渲染到蛋白表面上。

    参考文献

    • Jurrus E, et. al. Improvements to the APBS biomolecular solvation software suite. Protein Sci, 27 (1), 112-128, 2018. https://doi.org/10.1002/pro.3280
    • Vascon, F, et. al. Protein Electrostatics: From Computational and Structural Analysis to Discovery of Functional Fingerprints and Biotechnological Design. Comput. Struct. Biotechnol. J. 2020, 18, 1774–1789. https://doi.org/10.1016/j.csbj.2020.06.029.

    Electrostatic Potential Calculation (APBS)

    Introduction

    Electrostatic potential (ESP) surfaces represent the distribution of electrostatic potential around a molecule on a given surface. Coloring the protein surface based on electrostatic potential helps identify binding sites for charged or polar molecules. Regions with positive potential complement negatively charged molecules, while regions with negative potential complement positively charged molecules. Protein electrostatic potential plays a crucial role in protein stability, folding, enzymatic catalysis, protein-protein interactions, and binding with other molecules. APBS (Adaptive Poisson-Boltzmann Solver) is a renowned tool for calculating the electrostatic potential of biological macromolecules.

    esp.jpg

    Parameter Description

    PDB File

    The protein structure file in PDB format.

    Output Format

    The format of the output file, supporting DX or CUBE.

    Result Description

    The output electrostatic potential result file, named potential.dx or potential.cube, can be used to render the electrostatic potential on the protein surface.

    References

    • Jurrus E, et al. Improvements to the APBS biomolecular solvation software suite. Protein Sci, 27 (1), 112-128, 2018. https://doi.org/10.1002/pro.3280
    • Vascon, F, et al. Protein Electrostatics: From Computational and Structural Analysis to Discovery of Functional Fingerprints and Biotechnological Design. Comput. Struct. Biotechnol. J. 2020, 18, 1774–1789. https://doi.org/10.1016/j.csbj.2020.06.029
  • Name: Absolute Folding Stability
    Description: Absolute Folding Stability Prediction模块通过蛋白序列生成模型ESM-IF,预测蛋白质的绝对稳定性ΔG。模型在收集的数据集上进行测试,发现预测误差RMSE ≈ 1.5 kcal/mol,PR相关系数为0.7。这一进展降低了通过实验方法(如CD、荧光定量PCR等)获取蛋白质稳定性ΔG的成本,为计算预测蛋白质的折叠稳定性ΔG带来重大突破。 The Absolute Folding Stability Prediction module predicts the absolute stability ΔG of proteins using the protein sequence generation model ESM-IF. When tested on collected datasets, the model exhibited a prediction error with an RMSE of approximately 1.5 kcal/mol and a Pearson correlation coefficient of 0.7. This advancement reduces the cost of obtaining protein stability ΔG through experimental methods such as CD and fluorescence quantitative PCR, making a significant breakthrough in computationally predicting the folding stability ΔG of proteins.
    Tags: undefined
    Author: Sergey Ovchinnikov
    Release: 2024-05-16 10:11:19
    Reference: Predicting absolute protein folding stability using generative models Matteo Cagiada, Sergey Ovchinnikov, Kresten Lindorff-Larsen bioRxiv 2024.03.14.584940

    Absolute Folding Stability

    简介

    通过蛋白序列逆折叠模型ESM-IF,预测蛋白质的绝对稳定性ΔG。
    传统的物理方法(如FoldX、Rosetta等)预测蛋白稳定性ΔG,依赖于高置信度结构pdb,如果突变太多,结构置信度降低,预测结果较差。在ProteinGym的benchmark结果表明,生成模型ESM-IF在zero-shot预测DMS数据的蛋白突变稳定性ΔΔG达到同类最佳水平。该方法是在突变预测基础上的延伸,利用ESM-IF模型直接预测完整蛋白折叠稳定性的绝对ΔG值。
    经过测试,预测误差RMSE ≈ 1.5 kcal/mol,相关系数为0.7,是预测蛋白质的折叠稳定性ΔG的重大突破。

    原理:
    f3245508-826b-45b5-9f82-d92ca9ea15f6.webp

    • xk : 蛋白某位点为氨基酸k时,使用ESM-IF计算的log-likelihood库
    • xj : 蛋白遍历20种氨基酸时,在该位点为j时,使用ESM-IF计算的log-likelihood
    • Lk:Softmax得到蛋白某位点为氨基酸k时,对稳定性的贡献大小

    然后,将蛋白质所有氨基酸位点的Lk加和,得到蛋白整体的log-likelihood。
    最后,通过线性整体log-likelihood与实验稳定性ΔG拟合得到拟合参数,根据a/b就可以将log-likelihood转换成蛋白稳定性ΔG了。

    模型预测效果如下图所示:
    在两个不同数据集的 265 种蛋白质的预测稳定性值和实验稳定性值进行了比较。Spearman相关系数 (ρs) 为0.69,误差RMSE约为1.36 kcal/mol,相关性较好。
    image.png
    与其他基线模型比较结果如下图所示:
    image.png

    参数说明

    Protein Structure (PDB)

    蛋白结构文件,PDB格式

    Protein Structure (TAR)

    多个蛋白结构PDB的压缩文件,TAR格式。
    当同时上传蛋白结构PDB和压缩包时会合并计算。

    结果说明

    • 绝对稳定性计算结果CSV格式文件‘result.csv’,包含信息如下:
    列名 说明
    Name 结构名称
    Absolute_Folding_Stability (kcal/mol) dG,越大越好,代表去折叠状态能量减去折叠状态能量,即去折叠需要的能量值,通常为正值,能量越大表示需要能量越多,折叠状态越稳定

    企业微信截图_17201609906097.png

    参考文献

    Predicting absolute protein folding stability using generative models Matteo Cagiada, Sergey Ovchinnikov, Kresten Lindorff-Larsen bioRxiv 2024.03.14.584940; https://doi.org/10.1101/2024.03.14.584940

    Absolute Folding Stability

    Introduction

    The absolute folding stability ($\Delta G$) of a protein can be predicted using the inverse folding model ESM-IF. Traditional physical methods (such as FoldX, Rosetta, etc.) for predicting protein stability $\Delta G$ rely on high-confidence structure PDB files. If mutations are numerous, the structural confidence decreases, leading to poor prediction results. Benchmark results from ProteinGym show that the generative model ESM-IF achieves state-of-the-art performance in zero-shot prediction of protein mutation stability $\Delta \Delta G$ on DMS data. This method extends mutation prediction by using the ESM-IF model to directly predict the absolute $\Delta G$ value of the complete protein folding stability.

    Testing shows a prediction error RMSE of approximately 1.5 kcal/mol and a correlation coefficient of 0.7, marking a significant breakthrough in predicting the folding stability $\Delta G$ of proteins.

    Principle
    f3245508-826b-45b5-9f82-d92ca9ea15f6.webp

    • $x_k$: Log-likelihood library calculated using ESM-IF when the protein at a certain site is amino acid $k$.
    • $x_j$: Log-likelihood calculated using ESM-IF when the protein at a certain site is amino acid $j$ while traversing 20 amino acids.
    • $L_k$: Contribution to stability when the protein at a certain site is amino acid $k$, obtained via Softmax.

    The log-likelihood of the entire protein is obtained by summing the $L_k$ values of all amino acid sites. Finally, the log-likelihood is linearly fitted to the experimental stability $\Delta G$ to obtain the fitting parameters. The log-likelihood can be converted into protein stability $\Delta G$ based on $a/b$.

    Model Prediction Performance
    The predicted stability values and experimental stability values for 265 proteins in two different datasets were compared. The Spearman correlation coefficient ($\rho_s$) is 0.69, and the error RMSE is about 1.36 kcal/mol, indicating good correlation.
    image.png

    Comparison with Other Baseline Models
    image.png

    Parameters

    Protein Structure (PDB)

    The protein structure file in PDB format.

    Protein Structure (TAR)

    A compressed file containing multiple protein structure PDBs in TAR format. When both the protein structure PDB and the compressed file are uploaded, they will be calculated together.

    Results

    • The absolute stability calculation result is provided in a CSV format file ‘result.csv’, containing the following information:
    Column Name Description
    Name Structure name
    Absolute_Folding_Stability (kcal/mol) $\Delta G$, the higher the better, representing the energy difference between the unfolded and folded states. It is usually a positive value, with higher values indicating greater stability in the folded state.

    企业微信截图_17201609906097.png

    References

    • Cagiada, M., Ovchinnikov, S., Lindorff-Larsen, K. Predicting absolute protein folding stability using generative models. bioRxiv 2024.03.14.584940; https://doi.org/10.1101/2024.03.14.584940
  • Name: De novo Generation (REINVENT4)
    Description: 基于REINVENT4的小分子生成。支持多种分子生成方式:Reinvent - 从头开始创造新分子,Libinvent - 修饰一个骨架,Linkinvent - 设计两个片段之间的linker,Mol2Mol - 在用户定义的相似度范围内优化分子。 Small molecule de novo generation based on REINVENT4. REINVENT 4 enables and facilitates de novo design, R-group replacement, library design, linker design, scaffold hopping and molecule optimization.
    Tags: undefined
    Author: Hannes H. Loeffler
    Release: 2024-05-16 14:52:00
    Reference: Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). https://doi.org/10.1186/s13321-024-00812-5

    De novo Generation (REINVENT4)

    简介

    De novo Generation (REINVENT4)是基于阿斯利康开源的REINVENT4算法用于小分子全新生成的模块。支持多种分子生成方式:Reinvent - 从头开始创造新类药分子,Libinvent - 修饰一个骨架,Linkinvent - 设计两个片段之间的linker,Mol2Mol - 在用户定义的相似度范围内优化分子。
    image.png

    参数说明

    Model Type

    模型类型:Reinvent、LibInvent、LinkInvent、Mol2Mol
    Reinvent是从头开始创造新分子,Libinvent修饰一个骨架,Linkinvent识别两个片段之间的连接器,而Mol2Mol 则在用户定义的相似度范围内优化分子。

    Small Molecule Structure File

    小分子结构文件,SDF或者SMILES格式。除了Reinvent外,其余模型为必填项。

    Unique Molecules

    输出唯一的标准化分子:true或者false

    Randomize Molecules

    对原子进行随机“洗牌操作”:true或者false。“随机洗牌”是为了避免数据投入的顺序对网络训练造成影响。

    Number Molecules

    生成的分子个数,注意:它乘以输入分子的个数为最终输出总分子数

    Sample Strategy

    仅在Mol2Mol使用:beamsearch或者multinomial

    • Beam Search(束搜索):这是一种搜索算法,通过扩展最有希望的节点来探索图形,但仅限于一组有限的节点。它常用于优化和搜索问题中,能够同时跟踪多个假设并逐步扩展。束搜索在您希望找到最可能的状态或输出序列时非常有用。
    • Multinomial Sampling(多项式采样):这种方法涉及从一个概率分布中进行采样,其中每个结果都被分配了一个概率。它通常用于需要在选择过程中引入变异性或随机性的场景。多项式采样在探索多样化的可能性集或模拟不同情境时非常有益。

    Mol2Mol Priors

    在 Mol2Mol 中,有5种不同的训练模型:

    1. Low_similarity:Tanimoto similarity > 0.5;
    2. Medium_similarity:0.5 < Tanimoto similarity < 0.7,通常表示中等程度的结构相似性;
    3. High_similarity:Tanimoto similarity > 0.7,表示高度相似的分子;
    4. Scaffold:要求分子具有相同的Murcko骨架,Murcko骨架是一种用于描述分子结构的核心骨架;
    5. Generic_scaffold:要求分子具有相同的未标记的Murcko骨架,指在Murcko骨架中未标记特定原子或功能团的结构。

    Temperature

    仅在Mol2Mol使用:多项抽样中的温度

    Use CUDA

    使用GPU进行计算:true或者false

    Output CSV File

    输出CSV文件名称

    Output SDF File

    输出SDF文件名称

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 全新生成的化合物CSV文件,包含了SMILES信息
    denovo.sdf 全新生成的化合物SDF文件

    参考文献

    • Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). https://doi.org/10.1186/s13321-024-00812-5
    • https://github.com/MolecularAI/REINVENT4

    De novo Generation (REINVENT4)

    Introduction

    De novo Generation (REINVENT4) is a module based on AstraZeneca’s open-source REINVENT4 algorithm for generating new small molecules. It supports various molecule generation methods: Reinvent - creating new drug-like molecules from scratch, Libinvent - modifying a scaffold, Linkinvent - designing a linker between two fragments, and Mol2Mol - optimizing molecules within a user-defined similarity range.

    image.png

    Parameters

    Model Type

    Type of model to use: Reinvent, LibInvent, LinkInvent, Mol2Mol.

    • Reinvent: Creating new molecules from scratch.
    • Libinvent: Modifying a scaffold.
    • Linkinvent: Identifying a linker between two fragments.
    • Mol2Mol: Optimizing molecules within a user-defined similarity range.

    Small Molecule Structure File

    File containing small molecule structures in SDF or SMILES format. This is required for all models except Reinvent.

    Unique Molecules

    Whether to output unique standardized molecules: true or false.

    Randomize Molecules

    Whether to perform random atom shuffling: true or false. “Random shuffling” helps to avoid the impact of input order on network training.

    Number Molecules

    Number of molecules to generate. Note that this number multiplied by the number of input molecules gives the total number of output molecules.

    Sample Strategy

    Used only in Mol2Mol: beamsearch or multinomial.

    Mol2Mol Priors

    In Mol2Mol, there are five different training models:

    1. Low_similarity: Tanimoto similarity > 0.5.
    2. Medium_similarity: 0.5 < Tanimoto similarity < 0.7, typically indicating a moderate degree of structural similarity.
    3. High_similarity: Tanimoto similarity > 0.7, indicating highly similar molecules.
    4. Scaffold: Requires molecules to have the same Murcko scaffold, a core structure descriptor.
    5. Generic_scaffold: Requires molecules to have the same unmarked Murcko scaffold, which is the Murcko scaffold without specific atoms or functional groups marked.

    Temperature

    Used only in Mol2Mol: Temperature for multinomial sampling.

    Use CUDA

    Whether to use GPU for computation: true or false.

    Output CSV File

    Name of the output CSV file.

    Output SDF File

    Name of the output SDF file.

    Results

    The output includes:

    Output File Name Description
    result.csv CSV file containing newly generated compounds, including SMILES information
    denovo.sdf SDF file containing newly generated compounds

    References

    • Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). https://doi.org/10.1186/s13321-024-00812-5
    • https://github.com/MolecularAI/REINVENT4
  • Name: Grafting (v2.1)
    Description: Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.1 Graft antibody CDRs to target frameworks, normally for humanization. Version: v2.1
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-05-07 16:43:58
    Reference:

    Grafting v2.1

    简介

    Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.1

    参数说明

    Antibody Sequence File

    抗体序列文件,FASTA格式

    Numbering Type

    抗体编号规则:kabat,imgt,chothia

    Output File

    指定输出抗体graft后的序列文件名称,FASTA格式

    Output Policy

    指定输出graft策略文件,JSON格式

    Germline Score

    指定输出抗体FR区序列比对同源性打分文件

    Germline

    指定轻链或重链使用特定germline模板,也可都指定,写法如下:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    其中链名来自于流程第一步输入的fasta文件。
    例1:以下语句为链"Infliximab.H"指定了模板"IGHV3-7*01":

    Infliximab.H:IGHV3-7*01
    

    例2:以下语句为两条链分别指定了模板:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    指定参考模板序列,FASTA格式

    Germline Hits

    指定输出FR区序列比对结果文件,FASTA格式

    Number of Hits

    指定输出命中序列的数目

    结果说明

    输出结果包括:

    输出文件名称 说明
    germline_hits.fasta 输出FR区序列比对结果文件
    germline_score.json 输出抗体FR区序列比对同源性打分文件
    grafted.fasta 输出抗体graft后的序列文件名称
    graft_policy.json 输出graft策略文件

    Grafting v2.1

    Introduction

    The Grafting module is used to graft the CDRs of an antibody onto a specific framework region template, typically used for humanization design. Version: v2.1

    Parameter Description

    Antibody Sequence File

    Antibody sequence file in FASTA format.

    Numbering Type

    Antibody numbering rule: kabat, imgt, chothia.

    Output File

    Specify the output file name for the grafted antibody sequence in FASTA format.

    Output Policy

    Specify the output grafting strategy file in JSON format.

    Germline Score

    Specify the output file for the homology scores of the antibody FR region sequences.

    Germline

    Specify the specific germline template to be used for the light chain or heavy chain, or both, in the format:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    Where the chain names come from the FASTA file input in the first step of the process.
    Example 1: The following statement specifies the template “IGHV3-7*01” for the chain “Infliximab.H”:

    Infliximab.H:IGHV3-7*01
    

    Example 2: The following statement specifies templates for two chains separately:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    Specify the reference template sequence in FASTA format.

    Germline Hits

    Specify the output file for the FR region sequence alignment results in FASTA format.

    Number of Hits

    Specify the number of sequences to output.

    Result Description

    The output includes:

    Output File Name Description
    germline_hits.fasta Output file for FR region sequence alignment results
    germline_score.json Output file for homology scores of the antibody FR region sequences
    grafted.fasta Output file name for the grafted antibody sequence
    graft_policy.json Output file for the grafting strategy
  • Name: Structural Energy
    Description: 基于物理模型(分子力学经验力场)计算多个蛋白结构的能量,并与参考蛋白的结构能量进行比较。 Calculate the energy of multiple protein structures based on a physical model (molecular mechanics empirical force field) and compare it with the reference protein.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-04-28 11:26:05
    Reference:

    Structural Energy

    简介

    该模块基于物理模型(分子力学经验力场)计算多个蛋白结构的能量,并与参考蛋白结构的能量进行比较。

    参数说明

    Target Structure

    多个蛋白结构PDB文件的压缩打包文件,TAR格式

    Reference Structure

    进行能量比对的参考蛋白结构,PDB格式

    结果说明

    • 能量比对的结果CSV格式文件‘energy_rank.csv’,包含信息如下:
    列名 说明
    Name 结构名称
    Score 能量打分,数值负得越多表示能量越低

    Structural Energy

    Introduction

    This module calculates the energy of multiple protein structures based on a physical model (empirical molecular force field) and compares these energies with the energy of a reference structure.

    Parameter Description

    Target Structures

    Compressed TAR file containing multiple protein structure PDB files.

    Reference Structure

    Reference structure in PDB format for energy comparisons.

    Result Description

    • The result of energy comparison is stored in a CSV file named ‘energy_rank.csv’, which includes the following information:
    Column Name Description
    Name Structure name
    Score Energy score, where a more negative value indicates lower energy
  • Name: De novo Generation (REINVENT4) ALL
    Description: De novo Generation (REINVENT4)是基于阿斯利康开源的REINVENT4算法用于小分子全新生成的模块。支持多种分子生成方式:Reinvent - 从头开始创造新分子,Libinvent - 修饰一个骨架,Linkinvent - 设计两个片段之间的linker,Mol2Mol - 在用户定义的相似度范围内优化分子。 De novo Generation (REINVENT4) is a small molecule de novo generation module based on REINVENT4 developed by AstraZeneca. REINVENT 4 enables and facilitates de novo design, R-group replacement, library design, linker design, scaffold hopping and molecule optimization.
    Tags: undefined
    Author: Hannes H. Loeffler
    Release: 2024-05-16 14:52:00
    Reference: Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). https://doi.org/10.1186/s13321-024-00812-5

    De novo Generation (REINVENT4)

    简介

    De novo Generation (REINVENT4)是基于REINVENT4算法的小分子全新生成模块。支持多种分子生成方式:Reinvent - 从头开始创造新分子,Libinvent - 修饰一个骨架,Linkinvent - 设计两个片段之间的linker,Mol2Mol - 在用户定义的相似度范围内优化分子。
    分子从头算法示意图如下:根据RNN模型生成Celecoxib相似结构的化合物。
    image.png
    image.png
    De novo Generation (REINVENT4)目前有四种运行模式:

    1. Sampling:在分子设计中,采样是指生成新的分子结构。REINVENT4使用不同的方法和策略来生成具有潜在生物活性的新分子。这些方法可能包括基于化学规则的分子生成、基于统计模型的分子生成、以及深度学习模型等。采样过程是生成新分子的起点。
    2. Scoring:生成的新分子需要进行评分以确定它们的潜在生物活性或其他性质。在REINVENT4中,评分通常涉及使用预先训练的模型或算法对分子进行预测和评估。这些评分可以基于分子的结构、性质、相互作用等方面。评分过程有助于筛选出具有最高潜在活性的分子。
    3. Transfer Learning:在分子设计中,迁移学习是指利用一个任务上学到的知识来帮助另一个相关任务。在REINVENT4中,迁移学习可以用于将已经学习到的知识或模型应用于新的分子设计任务。这有助于加速新模型的训练和提高性能。
    4. Staged Learning:分阶段学习是一种训练模型的方法,其中模型在多个阶段逐步提高性能。在REINVENT4中,分阶段学习可能涉及使用不同的数据集、模型架构或超参数来训练模型的不同阶段。这种方法有助于提高模型的鲁棒性和性能。

    参数说明

    Sampling模式

    Number Molecules

    生成的分子个数,注意:它乘以输入分子的个数为最终输出总分子数

    Sample Strategy

    仅在Mol2Mol使用:beamsearch或者multinomial

    Temperature

    仅在Mol2Mol使用:多项抽样中的温度

    Scoring模式

    Scoring Type

    'arithmetic_mean’表示加权算术平均值,'geometric_mean’表示加权几何平均值。

    Weight

    打分成分的权重,当输入多个成分时,需要输入对应的权重,用逗号分隔开。

    Transfer Learning模式

    Epochs

    迁移学习中的训练轮数

    Pairs Upper Threshold

    相似性的上限阈值 (0-1)

    Pairs Lower Threshold

    相似性的下限阈值 (0-1)

    Pairs Minimum Cardinality

    相似性的最小基数

    Pairs Maximum Cardinality

    相似性的最大基数

    Staged Learning模式

    Diversity Filter Type

    启动分集过滤器的部分:“IdenticalMurckoScaffold”, “IdenticalTopologicalScaffold”, “ScaffoldSimilarity”, “PenalizeSameSmiles”

    Sigma

    RL reward函数的sigma值

    Rate

    torch优化器的学习率

    Sample Strategy

    仅在Mol2Mol使用:beamsearch或者multinomial

    Distance Threshold

    仅Mol2Mol:距离阈值

    以下为共同输入参数

    Model Type

    模型类型:Reinvent、LibInvent、LinkInvent、Mol2Mol
    Reinvent是从头开始创造新分子,Libinvent修饰一个骨架,Linkinvent识别两个片段之间的连接器,而Mol2Mol 则在用户定义的相似度范围内优化分子。

    Small Molecule Structure File

    小分子结构文件,SDF或者SMILES格式。在Sampling以及Staged Learning中,除Reinvent外,其余模式为必填项。

    Unique Molecules

    输出唯一的标准化分子:true或者false

    Randomize Molecules

    对原子进行随机“洗牌操作”:true或者false。“随机洗牌”是为了避免数据投入的顺序对网络训练造成影响。

    Component Type

    成分名称,多个打分成分时,用逗号分隔开:

    Qed:Quantitative Estimate of Drug-likeness (QED) 是用于评估分子药物样性的指标,通常用于筛选具有潜在药物活性的化合物。
    
    SlogP:SlogP 是指分子的分配系数的对数值,用于描述分子的亲脂性。它是预测分子在脂质和水相之间分布的一种指标。
    
    MolecularWeight:分子量是指一个分子中所有原子的质量总和,通常以原子单位(Dalton)表示。
    
    TPSA:Topological Polar Surface Area (TPSA) 是描述分子极性表面积的指标,有助于预测分子的溶解度、渗透性等性质。
    
    GraphLength:图长度是指分子结构中原子之间的最短路径长度。
    
    NumAtomStereoCenters:描述分子中的手性中心数量。
    
    HBondAcceptors 和 HBondDonors:分别指代分子中可供氢键受体和给体的原子数量。
    
    NumRotBond:描述分子中旋转键的数量,用于衡量分子的自由度。
    
    Csp3、numsp、numsp2、numsp3:分别表示分子中 sp3、sp、sp2 杂化的原子数量。
    
    NumHeavyAtoms:非氢原子的数量。
    
    NumHeteroAtoms:描述分子中杂原子(非碳、非氢)的数量。
    
    NumRings、NumAromaticRings、NumAliphaticRings:分别表示分子中环的总数、芳香环的数量和脂肪环的数量。
    
    pmi:Polar Surface Area Modifier Index,用于描述分子的极性表面积。
    
    TanimotoDistance:Tanimoto 距离是一种用于比较分子结构相似性的指标。
    
    custom_alerts:自定义警告,用于描述特定结构或性质的分子可能存在的问题或风险。
    

    Use CUDA

    使用GPU进行计算:true或者false

    Output CSV File

    输出CSV文件名称

    Output SDF File

    输出SDF文件名称

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 全新生成的化合物CSV文件
    denovo.sdf 全新生成的化合物SDF文件
    model_tf.ckpt/TL.model 迁移学习后的模型文件

    参考文献

    • Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). https://doi.org/10.1186/s13321-024-00812-5
    • https://github.com/MolecularAI/REINVENT4

    De novo Generation (REINVENT4)

    Introduction

    De novo Generation (REINVENT4) is a module for the generation of new small molecules based on the REINVENT4 algorithm. It supports various molecular generation methods: Reinvent - creating new molecules from scratch, Libinvent - modifying a scaffold, Linkinvent - designing a linker between two fragments, Mol2Mol - optimizing molecules within a user-defined similarity range.
    The schematic diagram of the de novo algorithm is as follows: generating compounds similar in structure to Celecoxib using an RNN model.
    image.png
    image.png
    De novo Generation (REINVENT4) currently has four operating modes:

    1. Sampling: In molecular design, sampling refers to the generation of new molecular structures. REINVENT4 uses different methods and strategies to generate new molecules with potential biological activity. These methods may include rule-based molecular generation, statistical model-based generation, and deep learning models. The sampling process is the starting point for generating new molecules.
    2. Scoring: The newly generated molecules need to be scored to determine their potential biological activity or other properties. In REINVENT4, scoring typically involves using pre-trained models or algorithms to predict and evaluate the molecules. These scores can be based on the structure, properties, interactions, and other aspects of the molecules. The scoring process helps to identify molecules with the highest potential activity.
    3. Transfer Learning: In molecular design, transfer learning refers to using knowledge learned from one task to assist with another related task. In REINVENT4, transfer learning can be used to apply already learned knowledge or models to new molecular design tasks. This helps to accelerate the training of new models and improve performance.
    4. Staged Learning: Staged learning is a method of training models where the model’s performance is gradually improved over multiple stages. In REINVENT4, staged learning may involve using different datasets, model architectures, or hyperparameters to train different stages of the model. This approach helps to enhance the robustness and performance of the model.

    Parameter Explanation

    Sampling Mode

    Number Molecules

    The number of molecules generated. Note: The final output total number of molecules is the product of this value and the number of input molecules.

    Sample Strategy

    Used only in Mol2Mol: beamsearch or multinomial

    Temperature

    Used only in Mol2Mol: temperature in multinomial sampling

    Scoring Mode

    Scoring Type

    ‘arithmetic_mean’ for weighted arithmetic mean, ‘geometric_mean’ for weighted geometric mean.

    Weight

    The weight of the scoring components. When multiple components are input, the corresponding weights need to be input, separated by commas.

    Transfer Learning Mode

    Epochs

    The number of training epochs in transfer learning

    Pairs Upper Threshold

    Upper threshold of similarity (0-1)

    Pairs Lower Threshold

    Lower threshold of similarity (0-1)

    Pairs Minimum Cardinality

    Minimum cardinality of similarity

    Pairs Maximum Cardinality

    Maximum cardinality of similarity

    Staged Learning Mode

    Diversity Filter Type

    Parts to start the diversity filter: “IdenticalMurckoScaffold”, “IdenticalTopologicalScaffold”, “ScaffoldSimilarity”, “PenalizeSameSmiles”

    Sigma

    Sigma value of the RL reward function

    Rate

    Learning rate of the torch optimizer

    Sample Strategy

    Used only in Mol2Mol: beamsearch or multinomial

    Distance Threshold

    Only Mol2Mol: distance threshold

    Common Input Parameters

    Model Type

    Model type: Reinvent, LibInvent, LinkInvent, Mol2Mol
    Reinvent creates new molecules from scratch, Libinvent modifies a scaffold, Linkinvent identifies linkers between two fragments, and Mol2Mol optimizes molecules within a user-defined similarity range.

    Small Molecule Structure File

    Small molecule structure file, in SDF or SMILES format. In Sampling and Staged Learning, this is required for all modes except Reinvent.

    Unique Molecules

    Output unique standardized molecules: true or false

    Randomize Molecules

    Random “shuffle” of atoms: true or false. “Random shuffle” is to avoid the impact of the input order of data on network training.

    Component Type

    Component names, separated by commas for multiple scoring components:

    Qed: Quantitative Estimate of Drug-likeness (QED) is an indicator used to evaluate the drug-likeness of a molecule, typically used to screen compounds with potential drug activity.
    
    SlogP: SlogP refers to the logarithmic value of a molecule's partition coefficient, used to describe the lipophilicity of the molecule. It is an indicator for predicting the distribution of the molecule between lipid and water phases.
    
    MolecularWeight: Molecular weight is the total mass of all atoms in a molecule, usually expressed in atomic units (Dalton).
    
    TPSA: Topological Polar Surface Area (TPSA) is an indicator that describes the polar surface area of a molecule, helping to predict properties such as solubility and permeability.
    
    GraphLength: Graph length refers to the shortest path length between atoms in the molecular structure.
    
    NumAtomStereoCenters: Describes the number of chiral centers in the molecule.
    
    HBondAcceptors and HBondDonors: Indicate the number of atoms in the molecule that can act as hydrogen bond acceptors and donors, respectively.
    
    NumRotBond: Describes the number of rotatable bonds in the molecule, used to measure the flexibility of the molecule.
    
    Csp3, numsp, numsp2, numsp3: Represent the number of sp3, sp, sp2, and sp3 hybridized atoms in the molecule, respectively.
    
    NumHeavyAtoms: Number of non-hydrogen atoms.
    
    NumHeteroAtoms: Describes the number of heteroatoms (non-carbon, non-hydrogen) in the molecule.
    
    NumRings, NumAromaticRings, NumAliphaticRings: Represent the total number of rings, the number of aromatic rings, and the number of aliphatic rings in the molecule, respectively.
    
    pmi: Polar Surface Area Modifier Index, used to describe the polar surface area of the molecule.
    
    TanimotoDistance: Tanimoto distance is an indicator used to compare the structural similarity of molecules.
    
    custom_alerts: Custom alerts, used to describe potential problems or risks with specific structures or properties of the molecule.
    

    Use CUDA

    Use GPU for computation: true or false

    Output CSV File

    Name of the output CSV file

    Output SDF File

    Name of the output SDF file

    Results Explanation

    The output results include:

    Output File Name Description
    result.csv CSV file of the newly generated compounds
    denovo.sdf SDF file of the newly generated compounds
    model_tf.ckpt/TL.model Model file after transfer learning

    References

    • Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). https://doi.org/10.1186/s13321-024-00812-5
    • https://github.com/MolecularAI/REINVENT4
  • Name: Signal Peptide Prediction
    Description: Signal Peptide Prediction模块基于SignalP 6.0实现,是一种机器学习模型,可以检测所有信号肽 (SP) 类型,并且适用于宏基因组数据。信号肽 (SP) 是控制所有生物体中蛋白质分泌和易位的短氨基酸序列。 The Signal Peptide Prediction module is implemented based on SignalP 6.0, a machine learning model capable of detecting all types of signal peptides (SP) and suitable for metagenomic data. Signal peptides (SP) are short amino acid sequences that control protein secretion and translocation in all organisms.
    Tags: undefined
    Author: Teufel, F.
    Release: 2024-03-29 10:01:15
    Reference: Teufel, F., Almagro Armenteros, J.J., Johansen, A.R. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol 40, 1023–1025 (2022).

    Signal Peptide Prediction

    简介

    信号肽 (SP) 是控制所有生物体中蛋白质分泌和易位的短氨基酸序列。SP可以从序列数据中预测,但现有算法无法检测所有已知类型的SP。该模块基于SignalP 6.0实现,是一种机器学习模型,可以检测所有五种SP 类型,并且适用于宏基因组数据。
    SignalP 6.0模型架构如下图所示:
    image.png
    模型预测效果如下图所示:
    image.png

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式

    Organism

    物种信息,支持2种:eukarya, other, 默认是other

    结果说明

    • Tab分隔符的文本文件‘prediction_results.txt’,每一行为一条序列的预测结果,每列信息如下:
    字段名称 说明
    ID 序列ID
    Prediction 预测的结果类型,‘SP’表示预测含有信号肽,‘OTHER’表示预测不含信号肽
    SP(Sec/SPI) Sec/SPI类型信号肽的预测概率,SP(Sec/SPI): “standard” secretory signal peptides transported by the Sec translocon and cleaved by Signal Peptidase I (Lep)
    LIPO(Sec/SPII) Sec/SPII类型信号肽的预测概率,Sec/SPII: lipoprotein signal peptides transported by the Sec translocon and cleaved by Signal Peptidase II (Lsp)
    TAT(Tat/SPI) Tat/SPI类型信号肽的预测概率,Tat/SPI: Tat signal peptides transported by the Tat translocon and cleaved by Signal Peptidase I (Lep)
    TATLIPO(Tat/SPII) Tat/SPII类型信号肽的预测概率, Tat/SPII: Tat lipoprotein signal peptides transported by the Tat translocon and cleaved by Signal Peptidase II (Lsp)
    PILIN(Sec/SPIII) Sec/SPIII类型信号肽的预测概率, Sec/SPIII: Pilin and pilin-like signal peptides transported by the Sec translocon and cleaved by Signal Peptidase III (PilD/PibD)
    CS Position SPase酶切位点(序列位置)及预测概率
    • 去除信号肽的成熟蛋白序列文件‘processed_entries.fasta’

    参考文献

    Teufel, F., Almagro Armenteros, J.J., Johansen, A.R. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol 40, 1023–1025 (2022).

    Signal Peptide Prediction

    Introduction

    Signal peptides (SP) are short amino acid sequences that control the secretion and translocation of proteins in all organisms. While SPs can be predicted from sequence data, existing algorithms may not detect all known types of SPs. This module is based on SignalP 6.0, a machine learning model that can detect all five types of SPs and is suitable for metagenomic data.
    The architecture of the SignalP 6.0 model is shown in the following figure:
    image.png
    The predictive performance of the model is illustrated in the following figure:
    image.png

    Parameter Description

    Protein Sequence

    The sequence file of the protein in FASTA format.

    Organism

    Organism information, supporting two types: eukarya, other. The default is other.

    Result Description

    • A tab-separated text file ‘prediction_results.txt’ where each row represents the prediction results for a sequence, with the following column information:
    Field Name Description
    ID Sequence ID
    Prediction Predicted result type. ‘SP’ indicates the presence of a signal peptide, ‘OTHER’ indicates the absence of a signal peptide
    SP(Sec/SPI) Prediction probability of Sec/SPI-type signal peptides, where SP(Sec/SPI) refers to “standard” secretory signal peptides transported by the Sec translocon and cleaved by Signal Peptidase I (Lep)
    LIPO(Sec/SPII) Prediction probability of Sec/SPII-type signal peptides, where Sec/SPII refers to lipoprotein signal peptides transported by the Sec translocon and cleaved by Signal Peptidase II (Lsp)
    TAT(Tat/SPI) Prediction probability of Tat/SPI-type signal peptides, where Tat/SPI refers to Tat signal peptides transported by the Tat translocon and cleaved by Signal Peptidase I (Lep)
    TATLIPO(Tat/SPII) Prediction probability of Tat/SPII-type signal peptides, where Tat/SPII refers to Tat lipoprotein signal peptides transported by the Tat translocon and cleaved by Signal Peptidase II (Lsp)
    PILIN(Sec/SPIII) Prediction probability of Sec/SPIII-type signal peptides, where Sec/SPIII refers to Pilin and pilin-like signal peptides transported by the Sec translocon and cleaved by Signal Peptidase III (PilD/PibD)
    CS Position SPase cleavage site (sequence position) and prediction probability
    • A file ‘processed_entries.fasta’ containing mature protein sequences with signal peptides removed.

    References

    Teufel, F., Almagro Armenteros, J.J., Johansen, A.R. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol 40, 1023–1025 (2022).

  • Name: Sequence Embedding Generation
    Description: Sequence Embedding Generation模块基于ESMFold大规模预训练蛋白语言模型实现。提取序列的向量化特征信息(embeddings),可用于下游序列性质(如:突变对应的亲和力变化、稳定性变化,抗体序列可开发性等)预测任务,为判别模型的训练提供序列特征。ESM模型是通用蛋白质语言模型,采用UniRef50/90等序列数据库(数千万条序列)进行模型训练,提供了不同参数量(800万,3500万,1.5亿,6.5亿,30亿,150亿)的各类模型,可用于直接从蛋白序列预测结构、功能和其他蛋白质性质。如在结构预测中,ESM避免了对外部进化数据库、MSA和模板的需求,计算精度与AlphaFold2(存在MSA信息时)接近,无可用MSA信息时,计算精度ESMFold要显著优于AlphaFold2。计算速度比AlphaFold2快数十倍。 The Sequence Embedding Generation module is based on ESMFold's large-scale pre-trained protein language model. embeddings of sequences can be extracted to predict downstream sequence properties (such as affinity changes and stability changes corresponding to mutations, developability of antibody sequences, etc.) and provide sequence features for the training of discriminant models. ESM model is a universal protein language model. Sequence databases such as UniRef50/90 (tens of millions of sequences) are used for model training, and various models with different reference numbers (8 million, 35 million, 150 million, 650 million, 3 billion, 15 billion) are provided. It can be used to predict structure, function, and other protein properties directly from protein sequences. For example, in structure prediction, ESM avoids the need for external evolutionary databases, MSA and templates, and the calculation accuracy is close to AlphaFold2 (when MSA information exists). When MSA information is not available, ESMFold is significantly better than AlphaFold2. Computations are tens of times faster than AlphaFold2.
    Tags: undefined
    Author: Zeming Lin
    Release: 2024-03-25 17:13:30
    Reference: Zeming Lin et al. ,Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).DOI:10.1126/science.ade2574

    Sequence Embedding Generation

    简介

    该模块基于ESM大规模预训练蛋白语言模型实现。提取序列的向量化特征信息(embeddings),可用于下游序列性质(如:突变对应的亲和力变化、稳定性变化,抗体序列可开发性等)预测任务,为判别模型的训练提供序列特征。
    ESM模型是通用蛋白质语言模型,采用UniRef50/90等序列数据库(数千万条序列)进行模型训练,提供了不同参数量(800万,3500万,1.5亿,6.5亿,30亿,150亿)的各类模型,可用于直接从蛋白序列预测结构、功能和其他蛋白质性质。如在结构预测中,ESM避免了对外部进化数据库、MSA和模板的需求,计算精度与AlphaFold2(存在MSA信息时)接近,无可用MSA信息时,计算精度ESM要显著优于AlphaFold2。计算速度比AlphaFold2快数十倍。

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式
    注意:多条序列时,序列名称应避免重复,模块会对重复的序列名称进行重命名,格式为“原序列名_数字”

    Model

    选择用于提取序列特征的模型,可用模型及特征维度说明如下:

    模型名称 参数量 特征维度 模型层数
    ESM1b_650M 650M 1280 33
    ESM1v_650M 650M 1280 33
    ESM2_8M 8M 320 6
    ESM2_35M 35M 480 12
    ESM2_150M 150M 640 30
    ESM2_650M 650M 1280 33
    ESM2_3B 3B 2560 36
    ESM2_15B 15B 5120 48

    备注:“M”表示Million(百万),“B”表示Billion(十亿),ESM-2-15B模型需要的GPU卡显存大小约为32GB

    结果说明

    每条序列会输出一个特征信息文件“序列名.pt”,包含了该序列的向量化特征信息,该特征信息由模型最后一层产生。多条序列会输出多个pt文件,并压缩为feats.tar压缩文件。
    特征信息文件可通过torch加载,如下:
    embs = torch.load(“序列名.pt”)
    embs[‘mean_representations’][‘模型层数’]

    参考文献

    Zeming Lin et al. ,Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).DOI:10.1126/science.ade2574

    Sequence Embedding Generation

    Introduction

    This module is based on the ESM (Evolutionary Scale Modeling) large-scale pre-trained protein language model. It extracts vectorized feature information (embeddings) from sequences, which can be used for downstream sequence property prediction tasks such as changes in affinity and stability corresponding to mutations, developability of antibody sequences, etc., providing sequence features for discriminative model training.
    The ESM model is a universal protein language model trained on sequence databases such as UniRef50/90 (tens of millions of sequences). It offers various models with different parameter sizes (8 million, 35 million, 150 million, 650 million, 3 billion, 15 billion) that can be used to predict protein structures, functions, and other protein properties directly from protein sequences. In structural prediction, ESM eliminates the need for external evolutionary databases, multiple sequence alignments (MSA), and templates. Its calculation accuracy is comparable to AlphaFold2 (when MSA information is available) and significantly superior to AlphaFold2 in accuracy when MSA information is not available. ESM is also several times faster than AlphaFold2.

    Parameter Description

    Protein Sequence

    The sequence file of the protein in FASTA format.
    Note: When multiple sequences are provided, sequence names should be unique to avoid duplication. The module will rename duplicated sequence names in the format “original_sequence_name_number”.

    Model

    Select the model used to extract sequence features. The available models and their feature dimensions are as follows:

    Model Name Parameters Feature Dimension Number of Layers
    ESM1b_650M 650M 1280 33
    ESM1v_650M 650M 1280 33
    ESM2_8M 8M 320 6
    ESM2_35M 35M 480 12
    ESM2_150M 150M 640 30
    ESM2_650M 650M 1280 33
    ESM2_3B 3B 2560 36
    ESM2_15B 15B 5120 48

    Note: “M” stands for Million, “B” stands for Billion. The ESM-2-15B model requires approximately 32GB of GPU memory.

    Result Description

    Each sequence will output a feature information file named “sequence_name.pt,” which contains the vectorized feature information of that sequence generated by the last layer of the model. For multiple sequences, multiple pt files will be output and compressed into a feats.tar file.
    The feature information file can be loaded using torch as follows:
    embs = torch.load(“sequence_name.pt”)
    embs[‘mean_representations’][‘number_of_layers’]

    References

    Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130 (2023). DOI: 10.1126/science.ade2574

  • Name: NGS Analysis
    Description: 用于抗体NGS测序的DNA序列分析,具体分析内容包括:IGV、IGD、IGJ基因型标注;DNA序列翻译为氨基酸序列(抗体),并进行CDR识别;基于蛋白(抗体)语言模型(ESM/IgLM),分析不常见残基及优势突变;PTM(翻译后修饰)风险位点分析,标记低、高风险位点;序列特征计算(等电点pI,分子量kDa,疏水性);序列聚类分析;体系超突变率分析等。 This module is used for DNA sequence (antibody) analysis after NGS sequencing: IGV, IGD, IGJ clonotype annotation; amino acid sequence translation; antibody numbering and CDR recognition; uncommon residues and high frequency mutations idenfication using protein (antibody) language models (ESM, IgLM); PTM hot-spot liability analysis; Sequence-based physico-chemical property calculation including pI (isoelectric point), molecular weight, hydrophobicity index; sequence clustering; SHM (somatic hyper-mutation) rate calculation, etc.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-03-26 09:19:24
    Reference: Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023). DOI:10.1126/science.ade2574 Shuai RW, Ruffolo JA, Gray JJ. IgLM: Infilling language modeling for antibody sequence design. Cell Syst. 2023 Nov 15;14(11):979-989.e4. Milot Mirdita, Martin Steinegger, Johannes Söding, MMseqs2 desktop and local web server app for fast, interactive sequence searches, Bioinformatics, Volume 35, Issue 16, August 2019, Pages 2856–2858 Kyte, J. and Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol, 157(1):105-132

    NGS Analysis

    简介

    该模块用于NGS测序后的DNA序列(抗体)分析,具体分析内容包括:

    • IGV、IGD、IGJ基因标注(IgBlast)
    • DNA序列翻译为氨基酸序列(抗体),并进行CDR识别
    • 基于蛋白(抗体)语言模型,分析不常见残基及优势突变(ESM,IgLM)
    • PTM(翻译后修饰)风险位点分析,标记低、高风险位点
    • 序列特征计算(等电点pI,分子量kDa,疏水性)
    • 序列聚类分析(MMseq2)

    NGS Analysis操作指南

    参数说明

    DNA

    DNA Sequence

    NGS测序后的DNA序列,FASTA/AB1格式

    Species

    物种类型,支持2种:HUMAN, MOUSE。默认为HUMAN

    Numbering Scheme

    编号规则,支持imgt, chothia, kabat

    Cluster

    氨基酸序列聚类方案,支持2种:full, cdr。‘full’表示使用全长序列进行聚类,‘cdr’表示使用CDR序列进行聚类(具体CDR位置在参数‘CDRs’中设定),默认为‘cdr’

    CDRs

    指定用于聚类的CDR区域,在‘Cluster’参数为cdr时生效。可选区域为(支持多选):CDR1,CDR2,CDR3。默认选择CDR3。

    Identity

    聚类中采用的序列一致性数值,范围在0-1之间,默认值为0.5

    Vgene

    聚类前是否要求IGV基因名称一致的序列归为一组,默认为False

    Protein

    Protein Sequence

    NGS测序后的蛋白序列,FASTA格式

    Species

    物种类型,支持2种:HUMAN, MOUSE。默认为HUMAN

    Numbering Scheme

    编号规则,支持imgt, chothia, kabat

    Cluster

    氨基酸序列聚类方案,支持2种:full, cdr。‘full’表示使用全长序列进行聚类,‘cdr’表示使用CDR序列进行聚类(具体CDR位置在参数‘CDRs’中设定),默认为‘cdr’

    CDRs

    指定用于聚类的CDR区域,在‘Cluster’参数为cdr时生效。可选区域为(支持多选):CDR1,CDR2,CDR3。默认选择CDR3。

    Identity

    聚类中采用的序列一致性数值,范围在0-1之间,默认值为0.5

    Vgene

    聚类前是否要求IGV基因名称一致的序列归为一组,默认为False

    结果说明

    输出result.csv结果文件,包含以下信息:

    列名 说明 备注
    ID 序列名称
    DNA_Seq DNA序列
    Protein_Seq 翻译后的氨基酸序列
    Chain 链类型:IGH/IGK/IGL
    CDR1_AA CDR1的氨基酸序列
    CDR2_AA CDR2的氨基酸序列
    CDR3_AA CDR3的氨基酸序列
    CDR1_Length CDR1的氨基酸序列长度
    CDR2_Length CDR2的氨基酸序列长度
    CDR3_Length CDR3的氨基酸序列长度
    Unusual_Residue(ESM) 基于ESM模型的不常见残基及优势突变 如:'V11L’表示序列中第11位的V是模型判定的该位置不常见残基,L为模型判定的该位置优势突变残基
    Unusual_Residue(IgLM) 基于IgLM模型的不常见残基及优势突变 同上
    V_Gene_First 匹配的首个IGV基因名称。 IGV基因名称可能存在多个匹配,这里列出首个。注:输入为蛋白序列时,该字段忽略。
    V_Gene IGV基因名称 如同时匹配多个基因名,用‘;’分隔
    D_Gene IGD基因名称 同上,注:输入为蛋白序列时,该字段忽略。
    J_Gene IGJ基因名称 同上,注:输入为蛋白序列时,该字段忽略。
    CDR1_Highrisk_Hotspots CDR1中的PTM高风险位点 如:‘NG(1)’表示高风险位点‘NG’出现1次
    CDR2_Highrisk_Hotspots CDR2中的PTM高风险位点 同上
    CDR3_Highrisk_Hotspots CDR3中的PTM高风险位点 同上
    CDR1_Lowrisk_Hotspots CDR1中的PTM低风险位点 同上
    CDR2_Lowrisk_Hotspots CDR2中的PTM低风险位点 同上
    CDR3_Lowrisk_Hotspots CDR3中的PTM低风险位点 同上
    Mutations(AA) 与Germline序列比对所对应的突变,并标注了突变所在区域(FR或CDR),多个突变用分号分隔 如: 'V29I(CDR1)'表示编号29的残基存在突变,其中Germline序列中残基是V,当前抗体序列中残基为I,根据抗体编号规则所在的区域为CDR1
    SHM(AA) 基于氨基酸序列计算得到的体系超突变率 SHM: Somatic hypermutation,计算方式是将当前序列与Germline参考序列进行比对,序列突变总数量与序列长度的比值即为SHM
    SHM(NA) 基于DNA序列计算得到的体系超突变率 同上,注:输入为蛋白序列时,该字段忽略。
    pI 等电点
    kDa 分子量(千道尔顿)
    Hydrophobicity 疏水性指数 序列各氨基酸的Kyte-Doolittle疏水指数之和,主要用来快速粗略比较近似序列的相对疏水程度高低
    Pre_Cluster_Group 聚类分析中的组别名称 序列聚类前先进行序列分组,各组内序列再进行聚类分析。当选择CDR聚类时,CDR序列长度一致的序列归为一组。组别名称由各聚类参数组合而成,如:组名为‘8_8_18’,表示该组由CDR1,2,3长度分别为8,8,18的多条序列组成。如果分组参数设定要求IGV基因名称一致,则IGV基因名称也会出现在组别名称中,如:‘8_8_18_IGKV1-12*01’
    Cluster_ID 序列所属类别的名称 如:‘2_3’表示第2组第3个类别
    Cluster_Size 序列所属类别包含的序列数目 如:‘5’表示该类别含有5条序列
    Cluster_Center 序列是否为聚类中心 '1’表示是,‘0’表示不是
    Cluster_Ident 聚类后的类别中,成员序列与聚类中心序列的序列一致性 聚类时,如果选择全长序列聚类,这里即为全长序列的一致性;如选择CDR进行聚类,则为选中的CDR区域序列的整体一致性
    Cluster_CDR1_Ident 聚类后的类别中,成员序列与聚类中心序列的CDR1序列的一致性
    Cluster_CDR2_Ident 聚类后的类别中,成员序列与聚类中心序列的CDR2序列的一致性
    Cluster_CDR3_Ident 聚类后的类别中,成员序列与聚类中心序列的CDR3序列的一致性

    输出进化树信息,为打包文件tree.tar,包含多个进化树文件tree_clusterXXX.txt,每个进化树文件包含该聚类类别(cluster)中所有成员序列CDR区域的进化分析结果。

    风险位点说明:
    image.png
    其中打勾标记的位点NXS, NXT, NG, DHK, DG, DD和Cys共7个位点为默认的潜在PTM高风险位点,通常需重点关注,其余为低风险位点。

    参考文献

    1. Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).
    2. Shuai RW, Ruffolo JA, Gray JJ. IgLM: Infilling language modeling for antibody sequence design. Cell Syst. 2023 Nov 15;14(11):979-989.e4.
    3. Milot Mirdita, Martin Steinegger, Johannes Söding, MMseqs2 desktop and local web server app for fast, interactive sequence searches, Bioinformatics, Volume 35, Issue 16, August 2019, Pages 2856–2858.
    4. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982 May 5;157(1):105-32.

    Sequence Analysis

    Introduction

    The module is used for the analysis of the DNA sequence (antibody) after NGS sequencing. The analysis content includes:
    -IGV, IGD, IGJ gene annotation(IGBLAST)
    -DNA sequence is translated as amino acid sequence (antibody) and CDR recognition
    -Based on protein (antibody) language model, analyze unusual residual and advantageous mutations (ESM, IgLM)
    -PTM (post -translation modification) hotspot analysis, low and high risk hotspot
    -Sequence property calculation (PI, molecular weight, hydrophobicity)
    -Sequence clustering(MMSEQ2)

    Parameter

    DNA

    DNA Sequence

    DNA sequence after NGS sequencing,FASTA/ab1 format

    Species

    Type of Species,support two:HUMAN, MOUSE. The default is HUMAN

    Numbering Scheme

    Numbering scheme: imgt, chothia and kabat

    Cluster

    Scheme of sequence clustering,support two:full, cdr. ‘full’ means clustering by full length sequence,‘cdr’ means clustering by CDR. The default is ‘cdr’

    CDRs

    Specify the CDRs for clustering,when the ‘Cluster’ is set to ‘cdr’. Mutiple choice are supported: CDR1, CDR2, CDR3

    Identity

    The sequence identity used for clustering,value range from 0 to 1, the default is 0.5

    Vgene

    Whether the sequence of the IGV gene name is consistent for classification as a group before clustering. The default is False

    Protein

    Protein Sequence

    Protein sequence after NGS sequencing,FASTA format

    Species

    Type of Species,support two:HUMAN, MOUSE. The default is HUMAN

    Numbering Scheme

    Numbering scheme: imgt, chothia and kabat

    Cluster

    Scheme of sequence clustering,support two:full, cdr. ‘full’ means clustering by full length sequence,‘cdr’ means clustering by CDR. The default is ‘cdr’

    CDRs

    Specify the CDRs for clustering,when the ‘Cluster’ is set to ‘cdr’. Mutiple choice are supported: CDR1, CDR2, CDR3

    Identity

    The sequence identity used for clustering,value range from 0 to 1, the default is 0.5

    Vgene

    Whether the sequence of the IGV gene name is consistent for classification as a group before clustering. The default is False

    Result

    Export the result file result.csv, which includes the following information:

    Field Name Description Notes
    ID Sequence name
    DNA_Seq DNA sequence
    Protein_Seq Translated amino acid sequence
    Chain Chain type: IGH/IGK/IGL
    CDR1_AA Amino acid sequence of CDR1
    CDR2_AA Amino acid sequence of CDR2
    CDR3_AA Amino acid sequence of CDR3
    CDR1_Length Length of CDR1 amino acid sequence
    CDR2_Length Length of CDR2 amino acid sequence
    CDR3_Length Length of CDR3 amino acid sequence
    Unusual_Residue(ESM) Uncommon residues and dominant mutations based on the ESM model e.g., ‘V11L’ indicates that the V at position 11 in the sequence is determined by the model to be an uncommon residue, and L is determined by the model to be a dominant mutation residue at that position
    Unusual_Residue(IgLM) Uncommon residues and dominant mutations based on the IgLM model Same as above
    V_Gene_First The name of the first IGV gene that matches. There may be multiple matches for IGV gene names, the first of which is listed here
    V_Gene Name of the IGV gene If multiple gene names match simultaneously, separate them with ‘;’
    D_Gene Name of the IGD gene Same as above
    J_Gene Name of the IGJ gene Same as above
    CDR1_highrisk_hotspots PTM high-risk sites in CDR1 e.g., ‘NG(1)’ indicates the high-risk site ‘NG’ appears 1 time
    CDR2_Highrisk_hotspots PTM high-risk sites in CDR2 Same as above
    CDR3_Highrisk_hotspots PTM high-risk sites in CDR3 Same as above
    CDR1_Lowrisk_hotspots PTM low-risk sites in CDR1 Same as above
    CDR2_Lowrisk_hotspots PTM low-risk sites in CDR2 Same as above
    CDR3_Lowrisk_hotspots PTM low-risk sites in CDR3 Same as above
    Mutations(AA) corresponds to mutations compared to the Germline sequence and annotates the region where the mutation occurs (FR or CDR), with multiple mutations separated by semicolons. For example, ‘V29I(CDR1)’ indicates a mutation at residue 29, where the residue in the Germline sequence is V and the residue in the current antibody sequence is I, and based on the antibody numbering rules, the region is identified as CDR1.
    SHM(AA) System hypermutation rate calculated based on amino acid sequence SHM: Somatic hypermutation is calculated by aligning the current sequence with a Germline reference sequence. The ratio of the total number of sequence mutations to the sequence length is defined as SHM
    SHM(NA) System hypermutation rate calculated based on DNA sequence Same as above
    pI Isoelectric point
    kDa Molecular weight (kilodalton)
    Hydrophobicity Hydrophobicity index The sum of the Kyte-Doolittle hydrophobicity indices of each amino acid in the sequence, mainly used for a rough comparison of the relative hydrophobicity levels of approximate sequences
    Pre_Cluster_Group Group name in cluster analysis Before sequence clustering, sequences are grouped, and sequences within each group are then analyzed for clustering. For example, when selecting CDR clustering, sequences with the same CDR length are grouped together. The group name is composed of various clustering parameters, e.g., ‘8_8_18’ indicates that the group consists of multiple sequences with CDR1, 2, 3 lengths of 8, 8, 18, respectively
    Cluster_ID Name of the category to which the sequence belongs e.g., ‘2_3’ indicates the third category in the second group
    Cluster_Size Number of sequences contained in the category e.g., ‘5’ indicates that this category contains 5 sequences
    Cluster_Center Whether the sequence is a cluster center ‘1’ indicates yes, ‘0’ indicates no
    Cluster_Ident Consistency of member sequences with the cluster center sequence in the clustered category During clustering, if full-length sequence clustering is selected, this represents the consistency of the full-length sequences; if CDR clustering is chosen, it represents the overall consistency of the selected CDR region sequences
    Cluster_CDR1_Ident Consistency of member sequences with the CDR1 sequence of the cluster center sequence in the clustered category
    Cluster_CDR2_Ident Consistency of member sequences with the CDR2 sequence of the cluster center sequence in the clustered category
    Cluster_CDR3_Ident Consistency of member sequences with the CDR3 sequence of the cluster center sequence in the clustered category

    Output evolutionary tree information into a packed file named tree.tar, which includes multiple evolutionary tree files named tree_clusterXXX.txt, with each evolutionary tree file containing the evolutionary analysis results of the CDR regions of all member sequences in that clustering category (cluster).

    Risk Site Description:
    image.png
    The default potential PTM high-risk sites marked with check marks include NXS, NXT, NG, DHK, DG, DD, and Cys, totaling 7 sites. These sites typically require special attention, while the rest are considered low-risk sites.

    Reference

    1. Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).
    2. Shuai RW, Ruffolo JA, Gray JJ. IgLM: Infilling language modeling for antibody sequence design. Cell Syst. 2023 Nov 15;14(11):979-989.e4.
    3. Milot Mirdita, Martin Steinegger, Johannes Söding, MMseqs2 desktop and local web server app for fast, interactive sequence searches, Bioinformatics, Volume 35, Issue 16, August 2019, Pages 2856–2858.
    4. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982 May 5;157(1):105-32.
  • Name: Human Fragment BLAST
    Description: Human Fragment BLAST是基于输入的九肽, 在人源片段库(Germline, TCR, NextProt, OAS)中搜索最相似的9肽。 The Human Fragment BLAST is based on inputs of 9 peptides, searching the Germline, TCR, NextProt, OAS for the most similar 9 peptides.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-03-06 12:01:50
    Reference:

    Human Fragment BLAST

    简介

    Human Fragment BLAST是基于输入的九肽, 在人源片段库(Germline, TCR, NextProt, OAS)中搜索最相似的9肽。

    参数说明

    Peptide Fragment

    九肽片段,多个肽段用逗号分隔,例如:
    NFFWHLHFP,GKGITLSVR,TPEALFVMT,GGIPIINCA,CVAIAEDRK

    Minimun

    相同氨基酸的最小数量(相同位置),默认为7。

    Output File

    输出文件名称

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    Query 原始9肽
    SameCnt 相同个数
    Target 匹配到的9肽
    DiffMask 以*号标记氨基酸不同的位置
    From 生成片段的来源数据库

    Human Fragment BLAST

    Introduction

    The Human Fragment BLAST is based on inputs of 9 peptides, searching the Germline, TCR, NextProt, OAS for the most similar 9 peptides.

    Parameter

    Peptide Fragment

    Minimun

    Output File

    Result

    The output file is result.csv and contains the following information:

    Field Name Description
    Query original 9 peptide
    SameCnt Same number
    Target The resulting 9 peptides
    DiffMask The different positions of amino acids are marked with *.
    From The source database from which the fragment is generated.
  • Name: Protein Structure Prediction (RaptorX-Single)
    Description: 基于RaptorX-Single算法实现,RaptorX-Single是一种基于单一序列的蛋白质结构预测方法,无需multiple sequence alignment(MSA)信息。它集成了多个蛋白质语言模型和一个结构生成模块,研究结果表明,RaptorX-Single除了比AlphaFold2等基于MSA的方法运行得更快之外,在预测抗体结构、极少同源序列的蛋白和单突变效应方面也优于AlphaFold2和其他无MSA的方法。当预测的蛋白序列有大量同源序列时,RaptorX-Single的预测结果也优于AlphaFold2。 Implemented based on the RaptorX-Single algorithm, which is a single sequence-based protein structure prediction method that does not require multiple sequence alignment (MSA) information. It integrates multiple protein language models and a structure generation module. The results show that RaptorX-Single, in addition to running faster than MSA-based methods such as AlphaFold2, also outperforms AlphaFold2 and other MSA-free methods in predicting antibody structures, proteins with very few homologous sequences, and single mutation effects. RaptorX-Single also outperforms AlphaFold2 when predicting protein sequences with a large number of homologous sequences.
    Tags: undefined
    Author: Xiaoyang Jing
    Release: 2024-03-04 16:21:12
    Reference: RaptorX-Single: exploring the advantage of single sequence based protein structure prediction. Xiaoyang Jing, Fandi Wu, Jinbo Xu. bioRxiv 2023.04.24.538081

    Protein Structure Prediction (RaptorX-Single)

    简介

    该模块基于RaptorX-Single算法实现,RaptorX-Single是一种基于单一序列的蛋白质结构预测方法,无需multiple sequence alignment(MSA)信息。它集成了多个蛋白质语言模型和一个结构生成模块,研究结果表明,RaptorX-Single除了比AlphaFold2等基于MSA的方法运行得更快之外,在预测抗体结构、极少同源序列的蛋白和单突变效应方面也优于AlphaFold2和其他无MSA的方法。当预测的蛋白序列有大量同源序列时,RaptorX-Single的预测结果也优于AlphaFold2。
    RaptorX-Single的神经网络架构:
    image.png
    对抗体结构预测精度比较:
    image.png

    参数说明

    Sequence File

    普通蛋白或抗体序列文件(不超过1000个氨基酸),FASTA格式,如:
    >Nb21
    MAQVQLVESGGGLVQAGGSLRLSCAVSGLGAHRVGWFRRAPGKEREFVAAIGANGGNTNYLDSVKGRFTISRDNAKNTIYLQMNSLKPQDTAVYYCAARDIETAEYTYWGQGTQVTVS

    注意:

    只支持预测单链蛋白或抗体,如果FASTA文件有多条链,每条链会单独预测为一个PDB结构。

    Model for Prediction

    选择预测结构时使用的模型,有两个模型可供选择:
    protein表示蛋白模型,对应RaptorX-Single-ESM1b-ESM1v-ProtTrans.pt;
    antibody表示抗体模型,对应RaptorX-Single-ESM1b-ESM1v-ProtTrans-Ab.pt。
    如果预测蛋白,请选择前者,如果预测抗体,请选择后者

    结果说明

    输出结果包括:

    输出文件名称 说明
    first.pdb 默认输出第一条序列的预测结构。
    structs.tar 针对含有多条序列的fasta文件,压缩包中含所有的序列的预测结构。

    参考文献

    RaptorX-Single: exploring the advantage of single sequence based protein structure prediction. Xiaoyang Jing, Fandi Wu, Jinbo Xu. bioRxiv 2023.04.24.538081
    https://doi.org/10.1101/2023.04.24.538081

    Protein Structure Prediction (RaptorX-Single)

    Introduction

    The module is implemented based on the RaptorX-Single algorithm, which is a single sequence-based protein structure prediction method that does not require multiple sequence alignment (MSA) information. It integrates multiple protein language models and a structure generation module. The results show that RaptorX-Single, in addition to running faster than MSA-based methods such as AlphaFold2, also outperforms AlphaFold2 and other MSA-free methods in predicting antibody structures, proteins with very few homologous sequences, and single mutation effects. RaptorX-Single also outperforms AlphaFold2 when predicting protein sequences with a large number of homologous sequences.
    Network Architecture for RaptorX-Single:
    image.png
    Comparison of the accuracy of antibody structure prediction:
    image.png

    Parameter

    Sequence File

    Protein or antibody sequence file (not more than 1000 amino acids) in FASTA format, example:
    >Nb21
    MAQVQLVESGGGLVQAGGSLRLSCAVSGLGAHRVGWFRRAPGKEREFVAAIGANGGNTNYLDSVKGRFTISRDNAKNTIYLQMNSLKPQDTAVYYCAARDIETAEYTYWGQGTQVTVS

    Note:

    This module only supports the prediction of single chain proteins or antibodies, if the fasta file has multiple chains, each chain will be predicted separately as a PDB structure.

    Model for Prediction

    There are two models to choose from when selecting the model to use in predicting the structure.
    ‘protein’ represents the protein model, corresponding to RaptorX-Single-ESM1b-ESM1v-ProtTrans.pt;
    ‘antibody’ indicates an antibody model, corresponding to RaptorX-Single-ESM1b-ESM1v-ProtTrans-Ab.pt.
    Choose the former if predicting proteins and the latter if predicting antibodies.

    Result

    The output includes:

    Field Name Description
    first.pdb The default output is the prediction structure of the first sequence.
    structs.tar For fasta files with multiple sequences, the package contains the predictive structure for all sequences.

    Reference

    RaptorX-Single: exploring the advantage of single sequence based protein structure prediction. Xiaoyang Jing, Fandi Wu, Jinbo Xu. bioRxiv 2023.04.24.538081
    https://doi.org/10.1101/2023.04.24.538081

  • Name: Germline AA Distribution Frequency
    Description: Germline AA Distribution Frequency模块输出抗体各位置的germline的氨基酸频率分布。可以按指定的germline基因家族分别输出(通常关注与目标序列同家族germline基因的频率分布情况)。 Germline AA Distribution Frequency module outputs the amino acid frequency distribution of the germline at each position of the antibody. It can output the distribution separately according to the specified germline gene family (usually focusing on the frequency distribution of the germline genes in the same family as the target sequence).
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-01-26 00:00:00
    Reference:

    Germline AA Distribution Frequency

    简介

    该模块输出指定的germline基因家族(部分或全部)的各位置的氨基酸频率分布,以供突变设计参考。

    输入方式1

    输入一条抗体序列(多条序列时只处理第一条序列)。
    程序根据输入序列进行BLAST,判断其对应的基因家族,如IGHV1。
    再输出对应家族的germline基因的AA频率分布。

    输入方式2

    不输入序列,则直接输出勾选的链类型(Group选项)或基因家族(Single选项)对应的germline的频率分布。

    其中:
    若勾选某Group,仅统计对应类型(kappa, lambda, heavy)的所有家族germline的频率分布。
    若勾选Single中的某个family(如IGHV1),只输出指定的germline基因家族的AA频率分布(因为通常仅关注与目标序列同家族germline基因的频率分布情况,与我们序列不同家族的其他germline的频率分布的参考意义不大)。

    输出

    抗体各位置的germline的氨基酸频率分布。

    Germline AA Distribution Frequency

    Introduction

    This module outputs the amino acid frequency distribution at each position of the specified germline gene family (partially or entirely) for reference in mutation design.

    Input Method 1

    Input an antibody sequence (if multiple sequences are provided, only the first sequence is processed).
    The program uses BLAST to determine the corresponding gene family of the input sequence, such as IGHV1.
    Then it outputs the amino acid frequency distribution of the corresponding germline genes in that family.

    Input Method 2

    If no sequence is provided, the module directly outputs the frequency distribution of the selected chain type (Group option) or gene family (Single option) of germline genes.

    Specifically:

    • If a Group is selected, it will only calculate the frequency distribution of all germline genes of the corresponding type (kappa, lambda, heavy).
    • If a specific family is selected in the Single option (e.g., IGHV1), it will only output the amino acid frequency distribution of the specified germline gene family (as typically only the frequency distribution of germline genes from the same family as the target sequence is of interest, and the frequency distribution of germline genes from different families has limited relevance to our sequence design).

    Output

    The amino acid frequency distribution of germline genes at each position in the antibody.

  • Name: AA Probability Prediction
    Description: 基于预训练的大规模蛋白质语言模型(也叫做PLM或pLLM),预测序列中每个氨基酸(AA)位置处20种AA出现的概率。与进化上更保守的AA类似,语言模型预测的高概率AA,有利于提升结构的稳定性、改善蛋白的折叠、提升蛋白质的表达能力等、甚至提升亲和力,比随机盲目突变具有潜在的优势。 - ESM为基于序列的PLM,适用于蛋白包括抗体; - IgLM为基于序列的PLM,只适用于抗体,可以指定种属(比如人); - All in One同时使用ESM与IgLM进行计算; - ESMIF为结构感知的PLM,适用于蛋白包括抗体; - AntiFold为基于ESMIF使用抗体数据微调的模型,更适用于抗体或纳米抗体。 没有结构的时候,可以使用ESM、IgLM等纯序列模型;有结构或者预测了结构,可以使用结构感知的模型,在稳定性、亲和力等跟局部结构相关性更强的任务上表现更好。 Leveraging pre-trained large-scale protein language models, otherwise known as PLMs or pLLMs, there emerges the capacity to forecast the likelihood of each of the twenty amino acids appearing at any given position within a sequence. Comparable to the structurally conservative amino acids found in evolution, those with high probability predictions from the language model are beneficial in enhancing the protein's stability, fostering more efficient protein folding, augmenting its expression capacity, and potentially elevating its affinity. By comparison, this implies an intrinsic advantage over haphazardly induced mutations.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-01-23 20:07:02
    Reference: Zeming Lin et al. ,Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).DOI:10.1126/science.ade2574 Shuai et al., 2023, Cell Systems 14, 979–989. doi:10.1016/j.cels.2023.10.001

    AA Probability Prediction

    简介

    基于预训练的大规模蛋白质语言模型(也叫做PLM或pLLM),预测序列中每个氨基酸(AA)位置处20种AA出现的概率。与进化上更保守的AA类似,语言模型预测的高概率AA,有利于提升结构的稳定性、改善蛋白的折叠、提升蛋白质的表达能力等、甚至提升亲和力,比随机盲目突变具有潜在的优势。相比于基于MSA序列统计的PSSM,语言模型的预测速度更快,更多地考虑了序列内AA之间的相互作用,自身的变化也更敏感。

    该模块基于ESM、IgLM等大规模预训练蛋白(抗体)语言模型实现。

    蛋白质语言模型介绍

    目前WeMol中集成了多个PLM大模型,并基于PLM开发了多种应用,涉及的PLM模型如下:
    image.png

    ESM模型

    ESM模型是一个通用蛋白质语言模型,主要采用UniRef序列数据库进行模型训练,提供了不同参数量(800万,3500万,1.5亿,6.5亿,30亿,150亿)的各类模型,可用于直接从蛋白序列预测结构、功能和其他蛋白质性质。ESM在预测蛋白结构时避免了对外部进化数据库、MSA和模板的需求,计算精度与AlphaFold2(存在MSA信息时)接近(无可用MSA信息时,计算精度ESM要显著优于AlphaFold2),计算速度比AlphaFold2快数十倍。模块中采用150亿参数的ESM2模型。
    image.png

    IgLM模型

    IgLM是一种用于构建合成抗体库的深度生成语言模型。与利用单向上下文生成序列的方法相比,IgLM 基于自然语言中的文本输入进行抗体设计。因此它能利用双向上下文重新设计抗体序列。IgLM基于5.58亿条抗体重链和轻链可变序列进行训练,并根据每个序列的链类型和来源物种进行了调整。
    image.png

    ESMIF模型

    ESMIF逆折叠模型旨在根据蛋白质主链原子坐标预测蛋白序列。该模型使用AlphaFold2预测的1200万个蛋白质结构进行训练,包含不变几何输入处理层,随后是一个序列到序列的Transformer,对于在结构上保持不变的主干序列实现51%的本地序列恢复率,对于埋藏残基实现72%的恢复率。该模型还经过跨度屏蔽训练,能够容忍缺失的主链坐标,因此可以预测部分被屏蔽结构的序列。

    AntiFold模型

    AntiFold是使用抗体结构数据对ESMIF模型进行fine-tune微调得到,其在抗体CDR区序列恢复方面优于其他逆折叠工具,设计序列与已解析的序列具有高度结构相似性。此外,它在预测抗体-抗原结合亲和力时具有更强的相关性,同时在包括抗原信息的情况下性能会进一步增强。AntiFold为破坏与抗原结合的抗体残基突变给与低概率,并显示出在指导抗体优化的同时保留结构相关特性的前景

    Nanobody模型

    该模型用于预测纳米抗体序列中每个残基位置的20种残基出现的概率。模型采用类似AntiBerta(基于BERT的抗体语言模型)的网络架构,使用纳米抗体的序列数据集,进行模型训练得到。序列数据集包含开源序列与商业序列(未开源)两部分,其中开源序列整合了来自专利、NCBI GenBank、Protein Data Bank(PDB)以及科学出版物中的纳米抗体序列(约2.1万条),商业序列是基于新一代测序(NGS)技术,对多个商业研发项目进行测序得到的序列(约1100万条)。

    参数说明

    ESM

    Protein Sequence

    蛋白序列,如:QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    如果是抗体,请将重链、轻链序列分开预测。

    Model

    模型类型,可选esm2模型或者esm1b模型。

    IgLM

    Protein Sequence

    蛋白序列,如:QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    如果是抗体,请将重链、轻链序列分开预测。

    Chain Type

    抗体链类型,H表示重链,L表示轻链

    Species

    物种类型,支持6种:HUMAN, CAMEL, MOUSE, RABBIT, RAT, RHESUS。

    All in One

    Antibody Sequence

    抗体序列,如:QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    抗体序列,需将重链、轻链序列分开预测。

    Species

    物种类型,支持6种:HUMAN, CAMEL, MOUSE, RABBIT, RAT, RHESUS。

    ESMIF

    PDB File

    蛋白结构,pdb格式。

    Threshold

    残基概率的阈值,概率大于该阈值的突变残基会输出到突变列表文件。

    Regions

    定义的残基区域,区域内突变概率大于阈值的残基,其突变信息会输出到突变列表文件,残基区域的格式为链名:残基区域,残基区域即指定PDB文件中的残基编号(注意是PDB文件中带有的残基索引编号,起始编号可能不为1),多个残基用逗号分隔,指定残基范围用横杠符号,如A:24,28,32-40 表示残基区域为蛋白A链的24/28/32至40号残基。
    支持定义多个残基区域,每行定义一个,如:

    A:24,28,32-40
    B:12-24
    

    AntiFold

    PDB File

    抗体/纳米抗体,及与抗原的复合物结构文件,PDB格式。

    Antigen Chain

    填写输入pdb结构中的抗原链名。

    注意:如果文件中有多个抗体/纳米抗体,识别按顺序排的最后一个。

    Nanobody

    Nanobody Sequence

    纳米抗体序列(序列长度不超过198个残基,当序列长度超过198时,会自动识别抗体Fv区域并保留,序列其余部分去除),如:

    seq
    QLVSGPEVKKPGASVKVSCKASGYIFNNYGISWVRQAPGQGLEWMGWISTDNGNTNYAQKVQGRVTMTTDTSTSTAYMELRSLRYDDTAVYYCATNWGSYFEHWGQGTLVTVSS

    结果说明

    ESM、IgLM以及Nanobody

    输出result.csv结果文件,包含以下信息:

    字段名称 说明
    WT 序列中的初始AA
    POS AA的位置系引(从1开始)
    Consensus 该位置出现概率最大的AA
    L,A,G,V… 该位置每种AA出现的概率

    输出chain_score.csv结果文件,包含以下信息:

    字段名称 说明
    Name 序列名称
    Chain_Score 序列打分,是序列中每个位置残基的预测概率的算术平均值

    All in One

    All in One模式中,每一条序列都会输出5个文件,分别是:
    1,ESM目录下的 '序列名.csv’文件,与上述result.csv格式一致
    2,ESM目录下的’序列名_unusualAA.csv’文件,保存ESM模型预测得到的序列中不常见残基及其优势突变
    3,IgLM目录下的 '序列名.csv’文件,与上述result.csv格式一致
    4,IgLM目录下的’序列名_unusualAA.csv’文件,保存IgLM模型预测得到的序列中不常见残基及其优势突变
    5,all目录下的 '序列名.csv’文件,保存序列每个位置由ESM与IgLM预测得到的可能优势突变及概率

    ESMIF和AnfiFold

    输出result.csv结果文件,包含以下信息:

    字段名称 说明
    Chain PDB结构中的链名称
    WT PDB结构中的初始AA
    Pos PDB文件中的AA编号
    Consensus 该位置出现概率最大的AA
    L,A,G,V… 该位置每种AA出现的概率

    参考文献

    1, Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).DOI:10.1126/science.ade2574
    https://www.science.org/doi/abs/10.1126/science.ade2574
    2, Shuai et al., 2023, Cell Systems 14, 979–989.
    https://doi.org/10.1016/j.cels.2023.10.001

    AA Probability Prediction

    Introduction

    Based on pre-trained large-scale protein language models (also known as PLMs or pLLMs), this module predicts the probability of each of the 20 amino acids (AA) appearing at each position in the sequence. Similar to evolutionarily more conservative AAs, high-probability AAs predicted by language models are beneficial for enhancing structural stability, improving protein folding, enhancing protein expression capabilities, and even increasing affinity, potentially offering advantages over random blind mutations. Compared to PSSMs based on MSA sequence statistics, language models provide faster predictions, consider more interactions between AAs within the sequence, and are more sensitive to their own changes.

    This module is based on large-scale pre-trained protein (antibody) language models such as ESM and IgLM.

    Protein Language Model Overview

    Several PLM large models are integrated into WeMol, and various applications have been developed based on PLMs, including the following PLM models:
    image.png

    ESM Model

    The ESM model is a general protein language model that primarily uses the UniRef sequence database for model training. It offers various models with different parameter sizes (8 million, 35 million, 150 million, 650 million, 3 billion, 15 billion) that can be used to predict structure, function, and other protein properties directly from protein sequences. ESM avoids the need for external evolutionary databases, MSA, and templates when predicting protein structures. Its computational accuracy is close to AlphaFold2 (when MSA information is available) and significantly superior to AlphaFold2 in the absence of MSA information. ESM2 with 15 billion parameters is used in this module.
    image.png

    IgLM Model

    IgLM is a deep generative language model used to construct synthetic antibody libraries. Unlike methods that generate sequences based on unidirectional context, IgLM designs antibodies based on text inputs from natural language, allowing it to utilize bidirectional context for antibody sequence redesign. IgLM is trained on 558 million antibody heavy and light chain variable sequences and adjusted based on the chain type and source species of each sequence.
    image.png

    ESMIF Model

    The ESMIF inverse folding model aims to predict protein sequences from their backbone atom coordinates. Trained on 12 million protein structures predicted by AlphaFold2, the ESMIF model consists of invariant geometric input processing layers followed by a sequence-to-sequence transformer. It achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues. The model is also trained with span masking to tolerate missing backbone coordinates and can predict sequences for partially masked structures.

    AntiFold Model

    AntiFold is fine-tuned using antibody structural data on the ESMIF model, outperforming other de novo folding tools in antibody CDR sequence recovery and exhibiting high structural similarity to the designed sequences and those resolved. Additionally, it shows stronger correlation in predicting antibody-antigen binding affinity, with performance further enhanced when antigen information is included. AntiFold predicts low probability mutations in antibody residues that disrupt antigen binding and demonstrates the prospect of retaining structural-relevant features while guiding antibody optimization.

    Nanobody Model

    This model predicts the probability of each of the 20 residues at every position in a nanobody sequence. It uses an AntiBerta - like (BERT based antibody language model) architecture and is trained on nanobody sequence datasets. These datasets have two parts: open-source sequences (around 21,000 from patents, NCBI GenBank, PDB, and publications) and commercial sequences (around 11 million from NGS of multiple R&D projects).

    Parameter

    ESM

    Protein Sequence

    Protein sequence, e.g., QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    If it is an antibody, predict heavy and light chain sequences separately.

    Model

    Model type, choose between esm2 model or esm1b model.

    IgLM

    Protein Sequence

    Protein sequence, e.g., QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    If it is an antibody, predict heavy and light chain sequences separately.

    Chain Type

    Antibody chain type, H for heavy chain, L for light chain.

    Species

    Species type, supports 6 types: HUMAN, CAMEL, MOUSE, RABBIT, RAT, RHESUS.

    All in One

    Antibody Sequence

    Antibody sequence, e.g., QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    Antibody sequence, predict heavy and light chain sequences separately.

    Species

    Species type, supports 6 types: HUMAN, CAMEL, MOUSE, RABBIT, RAT, RHESUS.

    ESMIF

    PDB File

    Protein structure, in pdb format.

    Threshold

    The threshold for residue probability. Mutated residues with probabilities exceeding this threshold will be output to the mutation list file.

    Regions

    Defined residue regions. Mutation information for residues within these regions, whose mutation probability exceeds the threshold, will be output to the mutation list file. The format for residue regions is Chain:ResidueRegion, where ResidueRegion specifies the residue indices in the PDB file (note that the indices are the residue indices as they appear in the PDB file, which may not start from 1). Multiple residues can be separated by commas, and residue ranges can be specified using a hyphen, e.g., A:24,28,32-40 represents residues 24, 28, and 32 to 40 of chain A in the protein.
    Multiple residue regions can be defined, with each region on a separate line, e.g.:

    A:24,28,32-40  
    B:12-24  
    

    AntiFold

    PDB File

    Structure files of antibodies/nanobodies and their complexes with antigens, in PDB format.

    Antigen Chain

    Enter the antigen chain name in the input PDB structure.

    Note: If there are multiple antibodies/nanobodies in the file, identify the last one in sequential order.

    Nanobody

    Nanobody Sequence

    Sequence of Nanobody, such as:

    seq
    QLVSGPEVKKPGASVKVSCKASGYIFNNYGISWVRQAPGQGLEWMGWISTDNGNTNYAQKVQGRVTMTTDTSTSTAYMELRSLRYDDTAVYYCATNWGSYFEHWGQGTLVTVSS
    Only single-chain sequences can be submitted, and the sequence length must not exceed 198 residues.

    Result

    ESM, IgLM and Nanobody

    Output result.csv file containing the following information:

    Field Name Description
    WT Initial AA in the sequence
    POS Position index of the AA (starting from 1)
    Consensus Most probable AA at that position
    L, A, G, V… Probability of each AA appearing at that position

    Output chain_score.csv file containing the following information:

    Field Name Description
    Name Sequence name
    Chain_Score Sequence score, the arithmetic mean of predicted probabilities of residues at each position in the sequence

    All in One

    In the All in One mode, each sequence will output 5 files:

    1. ‘sequence_name.csv’ file in the ESM directory, following the format of result.csv
    2. ‘sequence_name_unusualAA.csv’ file in the ESM directory, saving uncommon residues and their advantageous mutations predicted by the ESM model
    3. ‘sequence_name.csv’ file in the IgLM directory, following the format of result.csv
    4. ‘sequence_name_unusualAA.csv’ file in the IgLM directory, saving uncommon residues and their advantageous mutations predicted by the IgLM model
    5. ‘sequence_name.csv’ file in the all directory, saving potential advantageous mutations and probabilities predicted by ESM and IgLM at each position in the sequence

    ESMIF and AntiFold

    Output result.csv file containing the following information:

    Field Name Description
    Chain Chain name in the PDB structure
    WT Initial AA in the PDB structure
    Pos Index of the AA in the PDB file
    Consensus Most probable AA at that position
    L, A, G, V… Probability of each AA appearing at that position

    References

    1. Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130 (2023). DOI: 10.1126/science.ade2574
      https://www.science.org/doi/abs/10.1126/science.ade2574
    2. Shuai et al., 2023, Cell Systems 14, 979–989.
      https://doi.org/10.1016/j.cels.2023.10.001
  • Name: Cyclic Peptide Structure Prediction
    Description: 利用线性肽的序列生成环肽的结构。 Use the sequence of linear peptides to generate the structure of a cyclic peptide.
    Tags: undefined
    Author:
    Release: 2024-01-21 00:00:00
    Reference:

    Cyclic Peptide Structure Prediction

    简介

    Cyclic Peptide Structure Prediction模块利用线性肽的序列生成环肽的结构。

    参数说明

    Peptide Sequence

    线性肽的氨基酸序列,示例:
    GRCTQAWPPICFPD
    只支持输入一条序列。

    Linear Peptide Structure

    线性肽的结构文件,PDB格式

    结果说明

    预测输出环肽的结构文件 pep_cyclic.pdb

    参考文献

    Cyclic peptide structure prediction and design using AlphaFold . Stephen A. Rettie, Katelyn V. Campbell, Asim K. Bera, Alex Kang, Simon Kozlov, Joshmyn De La Cruz, Victor Adebomi, Guangfeng Zhou, Frank DiMaio, Sergey Ovchinnikov, Gaurav Bhardwaj; https://doi.org/10.1101/2023.02.25.529956

    Cyclic Peptide Structure Prediction

    Introduction

    The Cyclic Peptide Structure Prediction module generates the structure of cyclic peptides based on the structures of linear peptides.

    Parameter Description

    Peptide Sequence

    Amino acid sequence of the linear peptide, for example:
    GRCTQAWPPICFPD
    Only one sequence can be entered.

    Linear Peptide Structure

    Structure file of the linear peptide in PDB format.

    Result Description

    The predicted structure of the cyclic peptide is output as the structure file pep_cyclic.pdb.

    References

    Cyclic peptide structure prediction and design using AlphaFold . Stephen A. Rettie, Katelyn V. Campbell, Asim K. Bera, Alex Kang, Simon Kozlov, Joshmyn De La Cruz, Victor Adebomi, Guangfeng Zhou, Frank DiMaio, Sergey Ovchinnikov, Gaurav Bhardwaj; https://doi.org/10.1101/2023.02.25.529956

  • Name: Immune Protein Structure Prediction
    Description: 基于ImmuneBuilder算法预测免疫蛋白的结构。ImmuneBuilder是一组深度学习模型,专门预测抗体、纳米抗体和T细胞受体的结构,精度高的同时比AlphaFold2快得多。 To predicted the structure of immunoprotein based on ImmuneBuilder. ImmuneBuilder is a set of deep learning models that accurately predict the structure of antibodies (ABodyBuilder2), NanoBodyBuilder2, and T-cell receptors (TCRBuilder2). ImmuneBuilder generates structures with state-of-the-art precision while being much faster than AlphaFold2.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-10-19 10:50:28
    Reference: Abanades, B., Wong, W.K., Boyles, F. et al. ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins. Commun Biol 6, 575 (2023).

    Immune Protein Structure Prediction (ImmuneBuilder2)

    简介

    Immune Protein Structure Prediction模块是基于ImmuneBuilder的免疫蛋白结构预测模块。ImmuneBuilder是一组深度学习模型,可以准确预测抗体(ABodyBuilder2)、纳米抗体(NanoBodyBuilder2)和T细胞受体(TCRBuilder2)的结构;ImmuneBuilder生成的结构精度高,同时比AlphaFold2快得多。

    参数说明

    Immune Protein Sequence File

    抗体、纳米抗体或者TCER的序列文件,FASTA格式。
    支持多条序列一次性计算,相应的序列顺序需满足以下要求:
    对于抗体序列,每个抗体的重、轻链为一组,相邻放置即可(先后顺序没有要求),示例如下:

    >seq1.H
    xxxxxxxxxxxx
    >seq1.L
    xxxxxxxxx
    >seq2.H
    xxxxxxxxxxxx
    >seq2.L
    xxxxxxxxx
    

    对于TCR序列,每个TCR的alpha、beta链为一组,相邻放置即可(先后顺序没有要求),示例如下

    >seq1.A
    xxxxxxx
    >seq1.B
    xxxxxxx
    >seq2.A
    xxxxxxx
    >seq2.B
    xxxxxxx
    

    对于纳米抗体没有特殊要求。

    Type

    预测蛋白结构类型:Antibody、Nanobody以及TCR。

    Numbering Scheme

    抗体编号类型,支持kabat、chothia、imgt、raw。

    Output File

    输出文件名称,默认结构名称为model.pdb。

    结果说明

    输出结果为预测的免疫蛋白pdb结构,默认名称为model.pdb。
    可以进行批量生成结构文件,所有文件在model.tar压缩文件中。

    参考文献

    Abanades, B., Wong, W.K., Boyles, F. et al. ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins. Commun Biol 6, 575 (2023).

    https://github.com/oxpig/ImmuneBuilder

    Immune Protein Structure Prediction (ImmuneBuilder2)

    Introduction

    The Immune Protein Structure Prediction module is based on ImmuneBuilder and is used for predicting the structures of immune proteins. ImmuneBuilder is a set of deep learning models that accurately predict the structures of antibodies (ABodyBuilder2), nanobodies (NanoBodyBuilder2), and T cell receptors (TCRBuilder2). The structures generated by ImmuneBuilder are highly accurate and much faster than AlphaFold2.

    Parameter Description

    Immune Protein Sequence File

    Sequence file of the antibody, nanobody, or TCR in FASTA format.
    Supports calculating multiple sequences at once, with the sequence order meeting the following requirements:
    For antibody sequences, the heavy and light chain of an antibody constitute a pair, which should be placed adjacent to each other (the order does not matter), as shown below:

    >seq1.H
    xxxxxxxxxxxx
    >seq1.L
    xxxxxxxxx
    >seq2.H
    xxxxxxxxxxxx
    >seq2.L
    xxxxxxxxx
    

    For TCR sequences, the alpha and beta chain of TCR constitute a pair, which can be placed adjacent to each other (the order does not matter), as shown below:

    >seq1.A
    xxxxxxx
    >seq1.B
    xxxxxxx
    >seq2.A
    xxxxxxx
    >seq2.B
    xxxxxxx
    

    There are no specific naming requirements for nanobody sequences.

    Type

    Type of protein structure to predict: Antibody, Nanobody, or TCR.

    Numbering Scheme

    Antibody numbering scheme, supporting Kabat, Chothia, IMGT, and raw.

    Output File

    Name of the output file, with the default structure name as model.pdb.

    Result Description

    The output result is the predicted immune protein PDB structure, with the default name as model.pdb.
    Batch generation of structure files is supported, and all files are compressed in the model.tar file.

    References

    Abanades, B., Wong, W.K., Boyles, F. et al. ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins. Commun Biol 6, 575 (2023).

    https://github.com/oxpig/ImmuneBuilder

  • Name: Nanobody Humanization (Llamanade)
    Description: Llamanade基于NGS数据库和高分辨率结构,系统分析了Nbs的序列和结构特性,进而确定了可能有助于提高溶解度、结构稳定性和抗原结合的保守残基,以促进Nbs的人源化的理性设计,已成功应用于一批结构多样、强效的SARS-CoV-2中和Nbs人源化工作。对给定的Nbs进行全面人源化分析只需不到一分钟时间。 Llamanade based on NGS databases and high-resolution structures, which systematically analyzes the sequence and structural properties of Nbs. A large amount of framework diversity was revealed and key differences between Nbs and human immunoglobulin G (IgG) antibodies were highlighted. Conserved residues that may contribute to improved solubility, structural stability, and antigen-binding were identified to facilitate the rational humanization of Nbs. It has been successfully applied to humanize a group of structurally diverse and potent SARS-CoV-2 neutralized Nbs. It takes less than a minute to perform a comprehensive humanization analysis of a given Nbs.
    Tags: undefined
    Author: Zhe Sang
    Release: 2024-01-11 00:00:00
    Reference: Sang Z, Xiang Y, Bahar I, Shi Y. Llamanade: An open-source computational pipeline for robust nanobody humanization. Structure. 2022, doi: 10.1016/j.str.2021.11.006

    Nanobody Humanization

    简介

    纳米抗体(Nanobody, Nbs)是最近出现的一类很有前景的生物医学和治疗应用抗体片段。尽管Nbs具有显著的理化特性,但它来自于驼科动物,可能需要 "人源化"才能提高临床试验的转化潜力。该模块基于Llamanade实现。Llamanade基于NGS(下一代测序)数据库和高分辨率结构,系统分析了Nbs的序列和结构特性。揭示了大量的框架多样性,并强调了Nbs与人类免疫球蛋白G(IgG)抗体之间的关键差异。确定了可能有助于提高溶解度、结构稳定性和抗原结合的保守残基,以促进Nbs的合理人源化。模块以Nbs序列为输入,提供序列特征、模型结构等信息,并优化Nbs人源化的解决方案。对给定的Nbs进行全面人源化分析只需不到一分钟时间。已成功应用于一批结构多样、强效的SARS-CoV-2中和Nbs人源化工作。
    image.png

    参数说明

    Nanobody Sequence

    纳米抗体的序列,fasta格式,如:

    Nb21
    MAQVQLVESGGGLVQAGGSLRLSCAVSGLGAHRVGWFRRAPGKEREFVAAIGANGGNTNYLDSVKGRFTISRDNAKNTIYLQMNSLKPQDTAVYYCAARDIETAEYTYWGQGTQVTVS

    结果说明

    输出humanized_data.csv结果文件,包含以下信息:
    Position:残基编号
    Original AA:原来残基
    Humanized?: 是否需要人源化,True表示需要,False表示不需要
    Humanized AA: 人源化后的残基
    备注:抗体编号方式采用Martin模式。

    参考文献

    Llamanade: An open-source computational pipeline for robust nanobody humanization
    Sang, Zhe et al. Structure, Volume 30, Issue 3, 418 - 429.e3
    https://doi.org/10.1016/j.str.2021.11.006

    Nanobody Humanization

    Introduction

    Nanobodies (Nanobody, Nbs) are a recently emerging class of promising antibody fragments for biomedical and therapeutic applications. Despite its remarkable physicochemical properties, Nbs are derived from camelids and may need to be “humanized” in order to improve translational potential in clinical trials. This module is implemented based on Llamanade, which systematically analyzes the sequence and structural properties of Nbs based on NGS (Next Generation Sequencing) databases and high-resolution structures. A large amount of framework diversity was revealed and key differences between Nbs and human immunoglobulin G (IgG) antibodies were highlighted. Conserved residues that may contribute to improved solubility, structural stability, and antigen binding were identified to facilitate the rational humanization of Nbs. This Module uses Nbs sequence as input to provide information on sequence characterization, model structure, and optimize solutions for Nbs humanization. It takes less than a minute to perform a comprehensive humanization analysis of a given Nbs. It has been successfully applied to humanize a group of structurally diverse and potent SARS-CoV-2 neutralized Nbs.
    image.png

    Parameter

    Nanobody Sequence

    Nanobody sequence in FASTA format, such as:

    Nb21
    MAQVQLVESGGGLVQAGGSLRLSCAVSGLGAHRVGWFRRAPGKEREFVAAIGANGGNTNYLDSVKGRFTISRDNAKNTIYLQMNSLKPQDTAVYYCAARDIETAEYTYWGQGTQVTVS

    Result

    The output csv file (humanized_data.csv) of humanization results includes:
    Position: index of residue
    Original AA: original residue
    Humanized?: need to humanize,0 means no,1 means yes
    Humanized AA: residue after humanization
    Note: Antibodies are numbered in Martin mode.

    Reference

    Llamanade: An open-source computational pipeline for robust nanobody humanization
    Sang, Zhe et al. Structure, Volume 30, Issue 3, 418 - 429.e3
    https://doi.org/10.1016/j.str.2021.11.006

  • Name: mRNA 5'UTRs optimization
    Description: 是一种新颖的深度生成模型,设计用于在 mRNA 序列中创建 N1-甲基假尿苷 (m1Ψ) 5'UTR。Smart5UTR 利用多任务自动编码器框架,利用从大型数据集中学习到的潜在特征,有效地生成 5'UTR 序列。Smart5UTR设计的mRNA的性能已通过体外和体内实验得到验证。这个强大的工具简化了m1Ψ-5'UTRs的设计,有助于开发更有效的mRNA疗法。 A novel deep generative model designed to create N1-methyl-pseudouridine (m1Ψ) 5' UTRs in mRNA sequences. Smart5UTR utilizes a multi-task autoencoder framework to effectively generate 5' UTR sequences by leveraging latent features learned from large datasets. The performance of mRNAs designed by Smart5UTR has been validated through both in vitro and in vivo experiments. This powerful tool simplifies the design of m1Ψ-5' UTRs and contributes to the development of more effective mRNA therapies.
    Tags: undefined
    Author: Xiaoshan Tang
    Release: 2024-01-09 00:00:00
    Reference: Xiaoshan Tang, et al., A novel deep generative model for mRNA vaccine development: Designing 5ʹ UTRs with N1-methyl-pseudouridine modification, Acta Pharmaceutica Sinica B, 2023

    mRNA 5’UTRs optimization

    简介

    该模块基于Smart5UTR模型实现,Smart5UTR 是一种新颖的深度生成模型,设计用于在 mRNA 序列中创建 N1-甲基假尿苷 (m1Ψ) 5’ UTR。Smart5UTR 利用多任务自动编码器框架,利用从大型数据集中学习到的潜在特征,有效地生成 5’ UTR 序列。Smart5UTR设计的mRNA的性能已通过体外和体内实验得到验证。这个强大的工具简化了m1Ψ-5’UTRs的设计,有助于开发更有效的mRNA疗法。
    image.png

    参数说明

    Sequence of 5’UTR

    mRNA 5’UTR的序列,如:GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTGGGTT
    备注:输入序列长度不超过50碱基。

    结果说明

    输出result.csv结果文件,包含以下信息:
    Original Sequence: 初始序列
    Optimized Sequence: 优化后的序列
    Optimized MRL: 优化序列预测的MRL值

    MRL解释:
    mean ribosome load (MRL) 平均核糖体加载值,是反映mRNA序列翻译效率的指标,值越大表示翻译效率越高,一般大于5.0

    参考文献

    Xiaoshan Tang, et al., A novel deep generative model for mRNA vaccine development: Designing 5ʹ UTRs with N1-methyl-pseudouridine modification, Acta Pharmaceutica Sinica B, 2023
    https://doi.org/10.1016/j.apsb.2023.11.003

    mRNA 5’UTRs optimization

    Introduction

    Smart5UTR is a novel deep generative model designed for creating N1-methyl-pseudouridine (m1Ψ) 5’ UTRs in mRNA sequences. Utilizing a multi-task autoencoder framework, Smart5UTR efficiently generates 5’ UTR sequences by leveraging the latent features learned from a large dataset. The performance of Smart5UTR-designed mRNA has been validated through in vitro and in vivo experiments. This powerful tool streamlines the design of m1Ψ-5’ UTRs, contributing to the development of more effective mRNA therapeutics.
    image.png

    Parameter

    Sequence of 5’UTR

    Sequence of mRNA 5’UTR, such as: GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTGGGTT
    Note: The input sequence length should not exceed 50bp.

    Result

    The output csv file of optimized sequence includes Original Sequence, Optimized Sequence and Optimized MRL.

    MRL is a metric of the average number of ribosomes associated to a given RNA and a proxy for translation efficiency. Higher values indicate higher translation efficiency, generally greater than 5.0

    Reference

    Xiaoshan Tang, et al., A novel deep generative model for mRNA vaccine development: Designing 5ʹ UTRs with N1-methyl-pseudouridine modification, Acta Pharmaceutica Sinica B, 2023
    https://doi.org/10.1016/j.apsb.2023.11.003

  • Name: Immunogenicity Prediction (AlphaMHC v3.0 beta)
    Description: AlphaMHC算法采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条与免疫原性相关的湿实验数据(包括亲和力数据、NGS数据、质谱数据等)进行训练,实现了从序列到临床免疫原性风险的端到端的预测,并通过上百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)进行验证。 该版本使用抗体为主的临床ADA数据进行测试精度达到90%,AUC达0.91,性能优于v2.0版本。 The AlphaMHC algorithm employs popular NLP (Natural Language Processing) techniques and a novel multi-modal fusion deep neural network architecture. It integrates nearly one billion wet lab data points related to immunogenicity (including affinity data, NGS data, mass spectrometry data, etc.) for training, achieving end-to-end prediction from sequence to clinical immunogenicity risk. This has been validated with hundreds of real clinical immunogenicity data points from the FDA and EMA (including mono- and multi-specific antibodies, recombinant proteins, etc.). This version is the latest and has been tested primarily with clinical ADA data from antibodies, achieving an accuracy of 90% and an AUC of 0.91. Its performance surpasses that of version 2.0, and it is recommended for trial.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-11-30 00:00:00
    Reference:

    Immunogenicity Prediction (AlphaMHC v3.0 beta)

    介绍

    AlphaMHC v3.0在多个方面相比v2.0进行了大幅优化,
    主要包括:
    1、风险评分优化,能更好的反映多重HLA激活的风险贡献;
    2、引入新的EL和TCR等更多来源的数据,提升了对可递呈表位的预测能力,对TCR分子的支持更好;
    3、全新的结果可视化面板(通过WeSeq运行);

    为了更好的交互体验和对结果进行可视化,推荐从WeSeq中使用本功能。

    测试数据:
    从FDA和EMA的临床试验中收集了已知免疫原性的分子及其ADA的分布,使用模型对ADA明显较高(ADA>20%)及较低(ADA<5%)的分子进行分类以测试其预测性能。
    image.png

    测试结果:
    AlphaMHC v3.0全面超越常见算法及v2.0,性能同类最佳(SOTA)
    image.png
    右图中:

    • ACC是准确度,代表所有分子中预测正确的比例;
    • PRECISION代表特异性,指预测为高风险的分子中,实际为高ADA分子的比例;
    • RECALL代表敏感性,指预测的高风险分子占全部高ADA分子的比例;
    • F1是综合了特异性和敏感性的指标;
      以上指标都是越高越好。

    参数

    Fasta File

    蛋白序列文件,FASTA格式。支持多条链以及多分子模式。
    对于多分子模式,序列名称规则为:分子名.链名,例如:

    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    >mol2.A
    XXXXXXX
    >mol2.B
    XXXXXXX
    

    结果说明

    • Molecule Score:
      预测的每个分子的免疫原性风险评分以及风险
      (同个分子的多条链的预测结果汇总后综合评估所得)。
    • TCE Score:
      预测出的T细胞表位(TCE)以及多个评分指标。

    Molecule Score 包含以下信息:

    指标 说明
    Protein ID 输入蛋白的名称,如果是多条序列组成的蛋白,会自动合并
    Score 预测的免疫原性风险评分,值越大,风险越高。为所预测短肽的TCE score的求和
    Risk 对应的免疫原性风险等级

    TCE Score 包含以下信息:

    指标 说明
    Protein ID 所在分子的名称,同个分子的多条序列组成的蛋白会自动合并
    Sequence ID 所在序列的名称
    Core_Pos 表位序列的起始位置
    Core 表位序列(TCE)
    Score 表位序列的风险评分,分数越高越可能引起免疫原性。其范围是0-不限
    MHC_Count 可激活的MHC亚型数,考虑了MHC-II的递呈
    Tolerance 免疫耐受的可能性
    Germline 是否存在于人胚系基因中
    NextProt 是否存在于人蛋白组中
    OAS 在NGS人源抗体中出现的频率
    TCR 是否存在于人TCR基因中
    LAC 是否存在于低ADA临床药物(Low ADA CST)中

    Immunogenicity Prediction (AlphaMHC v3.0 beta)

    Introduction

    AlphaMHC v3.0 has undergone significant optimizations compared to v2.0 in several aspects, including:

    1. Improved risk scoring to better reflect the risk contributions of multiple HLA activations.
    2. Introduction of new data sources such as EL and TCR, enhancing the predictive ability for antigen presentation sites and better support for TCR molecules.
    3. Brand new visualization panel for results (run through WeSeq).

    For a better interactive experience and visualization of results, it is recommended to use this feature through WeSeq.

    Test Data:
    Molecules with known immunogenicity and their ADA distributions collected from clinical trials by the FDA and EMA were used to test the predictive performance of the model on molecules with significantly high ADA (>20%) and low ADA (<5%).
    image.png

    Test Results:
    AlphaMHC v3.0 surpasses common algorithms and v2.0 comprehensively, achieving state-of-the-art performance (SOTA).
    image.png
    In the image on the right:

    • ACC represents accuracy, indicating the proportion of correctly predicted molecules among all molecules.
    • PRECISION represents specificity, indicating the proportion of molecules predicted as high risk that are actually high ADA molecules.
    • RECALL represents sensitivity, indicating the proportion of predicted high-risk molecules among all high ADA molecules.
    • F1 is a metric that combines specificity and sensitivity. Higher values are better for all these metrics.

    Parameters

    Fasta File

    Protein sequence file in FASTA format. Supports multiple chains and multiple molecule modes.
    For multiple molecule mode, the sequence naming convention is: molecule name.chain name, for example:

    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    >mol2.A
    XXXXXXX
    >mol2.B
    XXXXXXX
    

    Result Description

    • Molecule Score:
      The predicted immunogenicity risk score for each molecule and its risk.
      (Comprehensive evaluation obtained by summarizing the predictions of multiple chains of the same molecule).
    • TCE Score:
      Predicted T cell epitopes (TCE) and multiple scoring metrics.

    Translation into English:

    Molecule Score contains the following information:

    Indicator Description
    Protein ID Name of the input protein; if the protein is composed of multiple sequences, they will be automatically merged
    Score Predicted immunogenicity risk score; higher values indicate higher risk. It is the sum of the TCE scores predicted for the peptide
    Risk Corresponding immunogenicity risk level

    TCE Score contains the following information:

    Indicator Description
    Protein ID Name of the molecule it belongs to; proteins composed of multiple sequences within the same molecule will be automatically merged
    Sequence ID Name of the sequence it belongs to
    Core_Pos Starting position of the epitope sequence
    Core Epitope sequence (TCE)
    Score Risk score of the epitope sequence; higher scores are more likely to cause immunogenicity. The range is from 0 to unlimited
    MHC_Count Number of activatable MHC subtypes, considering MHC-II presentation
    Tolerance Possibility of immunological tolerance
    Germline Whether it exists in human germline genes
    NextProt Whether it exists in the human proteome
    OAS Frequency of occurrence in NGS-derived human antibodies
    TCR Whether it exists in human TCR genes
    LAC Whether it exists in Low ADA CST (Low ADA Clinical Study Treatment) medications
  • Name: Mutation Score (v2.1)
    Description: Mutation Score是抗体人源化设计中核心模块,是一个基于结构的自动化评分模块。该模块基于抗体的结构信息以及CDR嫁接后的序列信息,对graft后的FR区的每个氨基酸在替换前后的变化程度进行量化评分。评分越高,说明CDR嫁接过程中氨基酸的替换对CDR区的构象可能影响较大,需要进行回复突变。模块输出每个氨基酸的打分值,用于抗体人源化设计流程中后续的分组和人源化抗体序列的生成。 Mutation Score is a core module in antibody humanization design workflow, which is a structure-based automated scoring module. Based on the structure information of the antibody and the CDR-grafted sequence information, this module quantitatively scores the degree of change before and after the replacement of each amino acid in the FR region. The higher the score, the greater the potential impact of the amino acid replacement on the conformation change of the CDR region during CDR grafting, indicating the need for auto-back mutation. The module outputs the score for each amino acid, which is used for subsequent grouping and generation of humanized antibody sequences in the antibody humanization design workflow.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 11:14:32
    Reference: To be submitted

    Mutation Score

    简介

    Mutation Score是抗体人源化设计中核心模块,是一个基于结构的自动化评分模块。该模块基于抗体的结构信息以及CDR嫁接后的序列信息,对移植抗体(graft)后的FR区的每个氨基酸在替换前后的变化程度进行量化评分。评分越高,说明CDR嫁接过程中氨基酸的替换对CDR区的构象可能影响较大,需要进行回复突变。模块输出每个氨基酸的打分值,用于抗体人源化设计流程中后续的分组和人源化抗体序列的生成。

    参数说明

    Sequence File

    抗体Fv区序列文件,FASTA格式。

    Model File

    抗体结构文件,PDB格式。

    Grafted Sequence

    抗体CDR区Graft后的序列文件,FASTA格式。

    Output Score

    指定输出打分文件的名称,CSV格式。

    Antibody Type

    抗体类型:

    • Antibody:常规抗体
    • Nanobody:纳米抗体

    结果说明

    输出结果文件为score.csv,包含信息如下:

    字段名称 说明
    Chain 轻链或重链
    UID 为残基的标准编号(默认为 Kabat)
    Position 残基在序列中的位置
    Donor Residue 原始氨基酸
    Template Residue 人源模板的目标氨基酸
    score 回复突变智能评分,Score 越高,认为其回复突变的必要性越高。通常Score>10为高优先级,5-10为中优先级,其他为低优先级

    Mutation Score

    Introduction

    Mutation Score is a core module in antibody humanization design, serving as a structure-based automated scoring module. This module quantitatively scores the degree of change for each amino acid in the FR region after grafting CDRs based on the antibody’s structure and the sequence information post-CDR grafting. A higher score indicates that the replacement of amino acids during CDR grafting may have a significant impact on the CDR region’s conformation, suggesting the need for revertant mutations. The module outputs a score for each amino acid, which is used in subsequent grouping and generation of humanized antibody sequences in the antibody humanization design process.

    Parameter Description

    Sequence File

    Sequence file of the antibody Fv region in FASTA format.

    Model File

    Antibody structure file in PDB format.

    Grafted Sequence

    Sequence file of the antibody CDR region after grafting in FASTA format.

    Output Score

    Specify the name of the output scoring file in CSV format.

    Antibody Type

    Type of antibody:

    • Antibody: Conventional antibody
    • Nanobody: Nanobody

    Result Description

    The output result file is named score.csv and includes the following information:

    Field Name Description
    Chain Light chain or heavy chain
    UID Standard numbering for residues (default is Kabat)
    Position Position of the residue in the sequence
    Donor Residue Original amino acid
    Template Residue Target amino acid from the human template
    Score Revertant mutation intelligence score, where a higher score suggests a higher necessity for a revertant mutation. Typically, a Score > 10 is high priority, 5-10 is medium priority, and others are low priority.
  • Name: Ramachandran Plots
    Description: Ramachandran Plots模块是对同源建模后模型质量的评估,仅仅考虑蛋白的构象是否合理,并不涉及能量问题。 Evaluate the quality of models after homology modeling, focusing on the reasonableness of the protein's conformation without considering energy issues.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-11-20 10:25:37
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. Lovell SC, Davis IW, Arendall WB 3rd, de Bakker PI, Word JM, Prisant MG, Richardson JS, Richardson DC. Structure validation by Calpha geometry: phi,psi and Cbeta deviation. Proteins. 2003 Feb 15;50(3):437-50.

    Ramachandran Plots

    简介

    Ramachandran Plots模块是对同源建模后模型质量的评估,仅仅考虑蛋白的构象是否合理,并不涉及能量问题。Ramachandran Plot中φ(phi)表示一个肽单位中α碳左边C-N键的旋转角度, ψ(psi)表示α碳右边C-C键的旋转角度。一般来说落在允许区和最大允许区的氨基酸残基占整个蛋白质的比例高于90%的,可以认为该模型的构象符合立体化学的规则。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。

    Chain ID

    选择作图链名称,不填默认为all。

    Figure Resolution

    图片分辨率(以每英寸点为单位)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    result_General.png 通常情况下的拉氏图
    result_Glycine.png 甘氨酸的拉氏图
    result_PreProline.png 脯氨酸前一个残基的拉氏图
    result_Proline.png 脯氨酸的拉氏图

    图中绿色为最大允许区,浅绿色为允许区,白色为不允许区,青色圆点代表在允许区域的氨基酸,红色圆点代表在不允许区域的氨基酸。在白色区域的氨基酸小于5%时,蛋白结构较为合理。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    Lovell SC, Davis IW, Arendall WB 3rd, de Bakker PI, Word JM, Prisant MG, Richardson JS, Richardson DC. Structure validation by Calpha geometry: phi,psi and Cbeta deviation. Proteins. 2003 Feb 15;50(3):437-50.

    Ramachandran Plots

    Introduction

    The Ramachandran Plots module is used to evaluate the quality of models after homology modeling, focusing on the reasonableness of the protein’s conformation without considering energy issues. In a Ramachandran Plot, φ (phi) represents the rotation angle of the C-N bond to the left of the alpha carbon in a peptide unit, and ψ (psi) represents the rotation angle of the C-C bond to the right of the alpha carbon. Generally, if the proportion of amino acid residues falling within the allowed regions and the most favored regions in the Ramachandran Plot is over 90%, the conformation of the model is considered to comply with the rules of stereochemistry.

    Parameter Description

    • Structure PDB File: The structure file of the protein in PDB format.
    • Chain ID: Select the chain name for plotting. If left blank, it defaults to all.
    • Figure Resolution: Resolution of the image (in dots per inch).

    Result Description

    The output includes:

    Output File Name Description
    result_General.png Ramachandran plot for general residues
    result_Glycine.png Ramachandran plot for glycine residues
    result_PreProline.png Ramachandran plot for residues before proline
    result_Proline.png Ramachandran plot for proline residues

    In the plots, green represents the most favored regions, light green represents allowed regions, white represents disallowed regions, cyan dots represent amino acids in allowed regions, and red dots represent amino acids in disallowed regions. When the percentage of amino acids in the white region is less than 5%, the protein structure is considered reasonable.

    References

    • Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    • Lovell SC, Davis IW, Arendall WB 3rd, de Bakker PI, Word JM, Prisant MG, Richardson JS, Richardson DC. Structure validation by Calpha geometry: phi, psi and Cbeta deviation. Proteins. 2003 Feb 15;50(3):437-50.
  • Name: Therapeutic Antibody Profiler
    Description: Therapeutic Antibody Profiler用于快速对抗体进行打分,评估抗体的成药性。基于抗体可变区的结构,计算CDR区域及其周围的表面疏水性程度、正电荷分布程度、负电荷分布程度、Fv区的重、轻链之间的净电荷失衡程度。 The Therapeutic Antibody Profiler (TAP) compares your antibody variable domain sequence against multiple developability guidelines derived from clinical-stage therapeutic values. TAP calculates the following properties to see if your antibody design is commenserate with those of clinical-stage therapeutics: Patches of Surface Hydrophobicity (PSH) metric across the CDR Vicinity Patches of Positive Charge (PPC) metric across the CDR Vicinity Patches of Negative Charge (PNC) metric across the CDR Vicinity Structural Fv Charge Symmetry Parameter (SFvCSP)
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-11-13 00:00:00
    Reference:

    Therapeutic Antibody Profiler

    简介

    Therapeutic Antibody Profiler (TAP) 基于抗体可变区的结构计算抗体的可开发性性质。TAP 计算以下5个性质,以确定输入抗体的可开发性指标是否与临床阶段抗体的属性相匹配:

    • CDR区总长度:Total CDR Length
    • CDR区域及其周围的表面疏水性程度:Patches of Surface Hydrophobicity (PSH) metric across the CDR Vicinity
    • CDR区域及其周围的表面正电荷程度:Patches of Positive Charge (PPC) metric across the CDR Vicinity
    • CDR区域及其周围的表面负电荷程度:Patches of Negative Charge (PNC) metric across the CDR Vicinity
    • Fv区的重、轻链之间的净电荷失衡程度:Structural Fv Charge Symmetry Parameter (SFvCSP)

    针对851的治疗性抗体(临床I期及之后)的Fv区计算的可开发性指标范围如下(最新更新日期为2025年2月24日):

    Property Amber Region Red Region
    Total CDR Length (L) 37 ≤ L ≤ 42 L < 37
    55 ≤ L ≤ 65 L > 65
    Patches of Surface Hydrophobicity (PSH) 95.77 ≤ PSH ≤ 111.40 PSH < 95.77
    167.64 ≤ PSH ≤ 211.65 PSH > 211.65
    Patches of Positive Charge (PPC) 1.34 ≤ PPC ≤ 4.20 PPC > 4.24
    Patches of Negative Charge (PNC) 1.99 ≤ PNC ≤ 4.43 PNC > 5.67
    Structural Fv Charge Symmetry Parameter (SFvCSP) -30.60 ≤ SFvCSP ≤ -6.00 SFvCSP < -30.60

    Amber Region: 指标在851个治疗性抗体(临床I期及之后)的Fv区计算的指标范围内,属于合理区域
    Red Region:指标不合理区域,需要调整
    Amber Region和Red Region的区域范围定义如下表所示。
    image.png

    参数说明

    Antibody Fv Structure (PDB)

    抗体结构文件,PDB格式

    Antibody Fv Structure (TAR)

    多个抗体Fv结构的压缩文件,TAR格式
    当同时上传抗体结构PDB和压缩包时会合并计算。

    Score

    输出打分文件,CSV格式

    结果说明

    输出TAP打分文件,CSV格式,输出以下信息:
    Total CDR Length:CDR区域氨基酸长度
    CDR Vicinity PSH Score (Kyte & Doolittle):CDR区域及其周围的表面疏水性程度
    CDR Vicinity PPC Score:CDR区域及其周围的表面正电荷程度
    CDR Vicinity PNC Score:CDR区域及其周围的表面负电荷程度
    SFvCSP Score:Fv区的重、轻链之间的净电荷失衡程度

    参考文献

    Five computational developability guidelines for therapeutic antibody profiling . Matthew I. J. Raybould, Claire Marks, Konrad Krawczyk, Bruck Taddese, Jaroslaw Nowak, Alan P. Lewis, Alexander Bujotzek, Jiye Shi, and Charlotte M. Deane, Proceedings of the National Academy of Sciences, 2019, 116 (10) 4025-4030
    https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabpred/tap

    Therapeutic Antibody Profiler

    Introduction

    The Therapeutic Antibody Profiler (TAP) compares your antibody variable domain sequence against multiple developability guidelines derived from clinical-stage therapeutic values.

    TAP calculates the following properties to see if your antibody design is commenserate with those of clinical-stage therapeutics:

    • Patches of Surface Hydrophobicity (PSH) metric across the CDR Vicinity
    • Patches of Positive Charge (PPC) metric across the CDR Vicinity
    • Patches of Negative Charge (PNC) metric across the CDR Vicinity
    • Structural Fv Charge Symmetry Parameter (SFvCSP)

    The TAP Guidelines were last updated on 24th February 2025:

    Property Amber Region Red Region
    Total CDR Length (L) 37 ≤ L ≤ 42 L < 37
    55 ≤ L ≤ 65 L > 65
    Patches of Surface Hydrophobicity (PSH) 95.77 ≤ PSH ≤ 111.40 PSH < 95.77
    167.64 ≤ PSH ≤ 211.65 PSH > 211.65
    Patches of Positive Charge (PPC) 1.34 ≤ PPC ≤ 4.20 PPC > 4.24
    Patches of Negative Charge (PNC) 1.99 ≤ PNC ≤ 4.43 PNC > 5.67
    Structural Fv Charge Symmetry Parameter (SFvCSP) -30.60 ≤ SFvCSP ≤ -6.00 SFvCSP < -30.60

    Amber Region: Within the reasonable region of 851 post Phase-I therapeutic Fvs
    Red Region: Unreasonable region, the developability needs to be optimized
    The following table defines the scope of Amber Region and Red Region.
    WXWorkCapture_17008110438269.png

    Parameter

    Antibody Fv Structure (PDB)

    Antibody Structure file in PDB format

    Antibody Fv Structure (TAR)

    Multiple antibody Fv structure compressed file, TAR format

    Score

    Output score file in CSV format

    Result

    The output csv file of TAP properties includes Total CDR Length, CDR Vicinity PSH Score (Kyte & Doolittle), CDR Vicinity PPC Score, CDR Vicinity PNC Score, SFvCSP Score.

    Reference

    Five computational developability guidelines for therapeutic antibody profiling . Matthew I. J. Raybould, Claire Marks, Konrad Krawczyk, Bruck Taddese, Jaroslaw Nowak, Alan P. Lewis, Alexander Bujotzek, Jiye Shi, and Charlotte M. Deane, Proceedings of the National Academy of Sciences, 2019, 116 (10) 4025-4030
    https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabpred/tap

  • Name: Residue Mutation
    Description: Residue Mutation模块基于PyMOL对于蛋白PDB结构中指定氨基酸进行突变,得到突变后的PDB结构,以便突变前后结构对比和后续分析。 The Residue Mutation module mutates the specified amino acids in the PDB structure of proteins based on PyMOL to obtain the PDB structure after mutation, so as to facilitate structural comparison and subsequent analysis before and after mutation.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-11-07 10:24:00
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Residue Mutation

    简介

    Residue Mutation模块基于PyMOL对于蛋白PDB结构中指定氨基酸进行突变,得到突变后的PDB结构,以便突变前后结构对比和后续分析。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。

    Mutation Residue Names

    突变氨基酸名称,其格式为:<ChainID>:<ResName><ResNum><ResName>,…。
    例如:
    E:LEU49CYS,SER53TYR,I:ILE11VAL
    其中,“E”和“I”为链名称,紧接着链名的第一个氨基酸为原始氨基酸,第二个氨基酸为突变氨基酸名称;链与氨基酸之间用冒号(:)隔开;多个突变点之间用逗号(,)隔开。

    Output PDB File

    输出PDB文件名称。

    结果说明

    输出文件为突变后的PDB结构result.pdb。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    Schrödinger, LLC. The PyMOL Molecular Graphics System, Version 2.0. 2015 Nov 22.
    Schrödinger, LLC. The AxPyMOL Molecular Graphics Plugin for PowerPoint, Version 2.0. 2015 Nov 22.

    Residue Mutation

    Introduction

    The Residue Mutation module uses PyMOL to mutate specified amino acids in a protein PDB structure, resulting in a mutated PDB structure for comparison and further analysis.

    Parameter Description

    • Structure PDB File: The structure file of the protein in PDB format.
    • Mutation Residue Names: The format for specifying mutation amino acids is: <ChainID>:<ResName1><ResNum1><ResName2>,…
      For example: E:LEU49CYS,SER53TYR,I:ILE11VAL
      Here, “E” and “I” are chain names, followed by the original amino acid and the mutated amino acid names; chains and amino acids are separated by a colon (:), and multiple mutation points are separated by commas (,).
    • Output PDB File: Name of the output PDB file.

    Result Description

    The output file is the mutated PDB structure named result.pdb.

    References

    • Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    • Schrödinger, LLC. The PyMOL Molecular Graphics System, Version 2.0. 2015 Nov 22.
    • Schrödinger, LLC. The AxPyMOL Molecular Graphics Plugin for PowerPoint, Version 2.0. 2015 Nov 22.
  • Name: Nanobody Structure Prediction
    Description: Nanobody Structure Prediction模块是基于ImmuneBuilder/NanoBodyBuilder2的纳米抗体可变区结构预测模块。 ImmuneBuilder是一组深度学习模型,可以准确预测抗体、纳米抗体和T细胞受体的结构;ImmuneBuilder生成的结构精度高,同时比AlphaFold2快得多。 Nanobody Structure Prediction module is based on ImmuneBuilder. ImmuneBuilder is a set of deep learning models that accurately predict the structure of antibodies (ABodyBuilder2), NanoBodyBuilder2, and T-cell receptors (TCRBuilder2). ImmuneBuilder generates structures with state-of-the-art precision while being much faster than AlphaFold2.
    Tags: undefined
    Author: ImmuneBuilder
    Release: 2023-10-19 10:50:28
    Reference: Abanades, B., Wong, W.K., Boyles, F. et al. ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins. Commun Biol 6, 575 (2023).

    Nanobody Structure Prediction

    简介

    Nanobody Structure Prediction模块是基于ImmuneBuilder的纳米抗体结构预测模块。
    ImmuneBuilder是一组深度学习模型,可以准确预测抗体(ABodyBuilder2)、纳米抗体(NanoBodyBuilder2)和T细胞受体(TCRBuilder2)的结构;
    ImmuneBuilder生成的结构精度高,同时比AlphaFold2快得多。

    据唯信团队采用近期的纳米抗体晶体结构进行对比测试,NanoBodyBuilder与ESMFold的表现优于其他的知名算法(下图)。

    image.png
    image.png

    注:该模块只适合预测可变区构象,如果是全长抗体或者包含多个可变区的抗体等情况,需要使用Protein Structure Prediction (AlphaFold2.3.2)或者Protein Structure Prediction (ESMFold)进行结构预测。

    参数说明

    Nanobody Sequence File

    纳米抗体的序列文件,FASTA格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    firstnano.pdb 纳米抗体结构(第1个)
    model.tar 所有预测结构压缩包

    参考文献

    Abanades, B., Wong, W.K., Boyles, F. et al. ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins. Commun Biol 6, 575 (2023).

    Nanobody Structure Prediction

    Introduction

    The Nanobody Structure Prediction module is based on ImmuneBuilder’s nanobody structure prediction module.
    ImmuneBuilder is a set of deep learning models that can accurately predict the structures of antibodies (ABodyBuilder2), nanobodies (NanoBodyBuilder2), and T cell receptors (TCRBuilder2).
    Structures generated by ImmuneBuilder have high accuracy and are much faster than AlphaFold2.

    According to the Weixin team’s recent comparative tests with nanobody crystal structures, NanoBodyBuilder and ESMFold outperform other well-known algorithms (see images below).

    image.png
    image.png

    Note: This module is suitable for predicting variable region conformations. For full-length antibodies or antibodies with multiple variable regions, Protein Structure Prediction (AlphaFold2.3.2) or Protein Structure Prediction (ESMFold) should be used for structure prediction.

    Parameter Description

    • Nanobody Sequence File: Sequence file of the nanobody in FASTA format.

    Result Description

    The output includes:

    Output File Name Description
    firstnano.pdb Nanobody structure (1st)
    model.tar Compressed file containing all predicted structures

    References

    Abanades, B., Wong, W.K., Boyles, F. et al. ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins. Commun Biol 6, 575 (2023).

  • Name: IgG Modeling
    Description: 对抗体全长序列进行建模,用于构建抗体IgG完整的三维结构,支持单特异性和双特异性抗体。 自动识别全长序列中的可变区(Fv)序列并通过SOTA的方法(目前为ESMFold)进行建模,IgG的其余部分包括Fc和linker以已知全长抗体的晶体结构为模板通过空间约束条件进行同源模建,效果比直接用AF2等方法预测完整IgG结构更优。 Perform modeling on the full-length sequence of antibodies to construct the complete three-dimensional structure of IgG, supporting both monospecific and bispecific antibodies. It automatically identifies the variable region (Fv) sequences within the full-length sequence and models them using state-of-the-art methods (currently ESMFold). The remaining parts of the IgG, including the Fc and linker, are modeled using homology modeling based on the crystal structures of known full-length antibodies as templates, with spatial constraints. This approach yields better results than directly predicting the complete IgG structure using methods like AF2.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-09-23 00:00:00
    Reference:

    IgG Modeling

    简介

    IgG Modeling对抗体全长序列进行建模,用于构建抗体IgG完整的三维结构,支持单特异性和双特异性抗体。
    自动识别全长序列中的可变区(Fv)序列并通过SOTA的方法(目前为ESMFold)进行建模,IgG的其余部分包括Fc和linker以已知全长抗体的晶体结构为模板通过空间约束条件进行同源模建,效果比直接用AF2等方法预测完整IgG结构更优。

    参数说明

    Heavy Chain 1 Sequence

    抗体的第一条重链的序列。

    Light Chain 1 Sequence

    抗体的第一条轻链的序列。

    Heavy Chain 2 Sequence

    抗体的第二条重链的序列,非必填,仅在双抗建模时输入。

    Light Chain 2 Sequence

    抗体的第二条轻链的序列,非必填,仅在双抗建模时输入。

    Isotype

    IgG亚型,目前支持IgG1和IgG4两种类型。
    注意:
    1)当待建模序列为单抗时,只需要写入H1与L1即可,H1与H2相同,L1与L2相同,最终模型包含2条相同的重链和2条相同的轻链。
    2)当待建模序列为双抗时,需要输入四条链的序列,最终模型包含2条不同重链和2条不同轻链。

    结果说明

    输出结果包括:

    输出文件名称 说明
    antibody_001.pdb-antibody_003.pdb 输出三个抗体全长的结构
    scores.csv 抗体全长结构打分,其中MolPDF是Modeller的评估打分,此分数越小越推荐使用。

    image.png

    参考文献

    • Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37.
    • Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291-325.
    • Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993 Dec 5;234(3):779-815.
    • Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000 Sep;9(9):1753-73.
    • Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.

    IgG Modeling

    Introduction

    IgG Modeling is used to model the full-length sequence of antibodies to construct the complete three-dimensional structure of antibody IgG, supporting both monospecific and bispecific antibodies. It automatically identifies the variable region (Fv) sequence in the full-length sequence and models it using state-of-the-art methods (currently ESMFold). The remaining parts of IgG, including Fc and linker, are modeled homologously based on the crystal structure of known full-length antibodies as templates, using spatial constraints, which yields better results compared to directly predicting the complete IgG structure using methods like AF2.

    Parameter Description

    • Heavy Chain 1 Sequence: Sequence of the first heavy chain of the antibody.
    • Light Chain 1 Sequence: Sequence of the first light chain of the antibody.
    • Heavy Chain 2 Sequence: Sequence of the second heavy chain of the antibody, optional, only required for bispecific antibody modeling.
    • Light Chain 2 Sequence: Sequence of the second light chain of the antibody, optional, only required for bispecific antibody modeling.
    • Isotype: IgG subtype, currently supporting IgG1 and IgG4.
      Note:
    1. When modeling a monospecific antibody, only the sequences for H1 and L1 need to be provided. H1 is the same as H2, and L1 is the same as L2, resulting in a model containing two identical heavy chains and two identical light chains.
    2. When modeling a bispecific antibody, sequences for all four chains need to be provided, resulting in a model containing two different heavy chains and two different light chains.

    Result

    The output includes:

    Output File Name Description
    antibody_001.pdb-antibody_003.pdb Structures of three full-length antibodies
    scores.csv Scoring of the full-length antibody structures, where MolPDF is the evaluation score from Modeller; a smaller score is more favorable.

    image.png

    References

    • Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37.
    • Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291-325.
    • Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993 Dec 5;234(3):779-815.
    • Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000 Sep;9(9):1753-73.
    • Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.
  • Name: Substructure Search
    Description: Substructure Search模块是小分子子结构搜索模块,实现在化合物库中查询出含有特定子结构的分子并输出到SDF文件中。子结构搜索是化学信息学研究中的常用操作,也可以用于虚拟筛选,从小分子商业库中搜索出含有特定功能片段的分子用于后续实验验证。 Substructure Search is a tool for structure searching against a small molecule library file using a specified substructure and writing out the matched molecules to an output file. Substructure search is a common operation in cheminformatics research and can also be used for virtual screening to search for molecules containing specific functional fragments from small molecule commercial libraries for subsequent experimental validation.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-09-21 10:07:46
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Substructure Search

    简介

    Substructure Search模块是小分子子结构搜索模块,实现在化合物库中查询出含有特定子结构的分子并输出到SDF文件中。子结构搜索是化学信息学研究中的常用操作,也可以用于虚拟筛选,从小分子商业库中搜索出含有特定功能片段的分子用于后续实验验证。

    参数说明

    上传文件搜索子结构:File Search

    Substructure File

    搜索子结构文件,SDF或者SMI格式

    WeDraw画出搜索子结构:Draw

    Substructure File

    通过WeDraw界面画模板小分子,只允许单个小分子。

    通过SMILES字符搜索子结构:Smiles Search

    Substructure Smiles

    搜索子结构SMILES字符,例如
    c1ccccc1
    CC(N)=O

    Public Library

    选择用于相似性搜索的分子库,该模块提供17个公共分子数据库用于进行相似性搜索:

    1. Analyticon:~4万库存分子,源自德国的天然产物品牌,专注天然产物提取及类似物合成工作,产品质量稳定。
    2. Asinex:~52万库存分子,源自美国的品牌,20多年来致力于类先导化合物及分子砌块的研发供应,价格较贵。
    3. Bionet:~23万库存分子,源自英国的品牌,拥有20多年的有机合成经验。
    4. Chembridge:~156万库存分子,源自美国的化合物品牌,总部位于圣地亚哥,拥有多样性库、大环库等多种热门化合物库。
    5. Chemdiv:~160万库存分子,全球最大的化合物品牌之一,拥有5000多种化合物骨架结构和100多种化合物库,性价比高。
    6. Enamine:~273万库存分子,源自乌克兰的化合物品牌,具有较强的化合物研发能力,有高性价比化合物和高价值化合物两类产品。
    7. Eximed:~6万库存分子,源自乌克兰的化合物品牌,近20年来致力于提供高通量筛选化合物及相关服务。
    8. HTS_Biochemie_Innovationen:~6万库存分子,源自德国的化合物品牌,致力于为制药、农业和生物技术公司开发独特的化合物。
    9. IBScreen:~48万库存分子,源自俄罗斯的化合物品牌,拥有多种天然产物及衍生物。
    10. Life_Chemicals:~50万库存分子,源自加拿大的化合物品牌,拥有2900多种化合物骨架结构,化合物规格较齐全且有对应价格。
    11. Maybridge:~5万库存分子,源自英国的化合物品牌,Thermofisher旗下,产品数量少而专,每种产品均具有较大库存。
    12. Otava:~27万库存分子,源自加拿大的化合物品牌,专门从事特色化合物,生物化学药品和生物分析试剂的开发和生成。
    13. Princeton:~153万库存分子,源自美国的化合物品牌,20多年来设计独特的小分子化合物用于药物开发。
    14. Specs:~21万库存分子,源自荷兰的化合物品牌,价格优势明显。
    15. UORSY:~68万库存分子,源自乌克兰的化合物品牌,产品主要用于高通量筛选和药物发现,价格与Enamine接近。
    16. Vitas-m:~140万库存分子,源自美国的化合物品牌,在香港拥有发货中心,到货速度快,价格适中。

    提示说明:Public Library与Private Library选填其中一个。

    Private Library

    用于搜索的个人分子库,仅支持SDF格式。
    提示说明:Public Library与Private Library选填其中一个。

    Output File

    输出文件名称,默认matched_molecules.sdf。

    结果说明

    结果文件为分子库中含有子结构的化合物matched_molecules.sdf。

    Public Library与Private Library选填其中一个。

    Private Library

    用于搜索的个人分子库,仅支持SDF格式。

    Public Library与Private Library选填其中一个。

    Output File

    输出文件名称,默认matched_molecules.sdf。

    结果说明

    结果文件为分子库中含有子结构的化合物matched_molecules.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Substructure Search

    Introduction

    The Substructure Search module is a tool for searching for specific substructures within a compound library and outputting them to an SDF file. Substructure searching is a common operation in cheminformatics research and can be used for virtual screening to identify molecules in commercial small molecule libraries containing specific functional fragments for subsequent experimental validation.

    Parameter Description

    File Search for Substructure Search

    Substructure File

    File containing the substructure to search for, in SDF or SMI format.

    Draw for Substructure Search

    Substructure File

    Draw a template small molecule using the WeDraw interface, allowing only a single small molecule.

    Smiles Search for Substructure Search

    Substructure Smiles

    SMILES string of the substructure to search for, for example:
    c1ccccc1
    CC(N)=O

    Public Library

    Select the public molecular library for the substructure search module, which provides 16 public molecular databases for substructure searching.

    1. Analyticon: ~40,000 inventory molecules, originating from Germany, a natural product brand focusing on natural product extraction and analog synthesis work, with stable product quality.
    2. Asinex: ~520,000 inventory molecules, originating from the United States, dedicated to the development and supply of lead-like compounds and molecular building blocks for over 20 years, with a higher price range.
    3. Bionet: ~230,000 inventory molecules, originating from the United Kingdom, with over 20 years of organic synthesis experience.
    4. Chembridge: ~1.56 million inventory molecules, originating from a US compound brand headquartered in San Diego, offering diverse libraries including macrocyclic libraries and other popular compound libraries.
    5. Chemdiv: ~1.6 million inventory molecules, one of the world’s largest compound brands, with over 5,000 compound skeleton structures and over 100 compound libraries, offering high cost-performance ratio.
    6. Enamine: ~2.73 million inventory molecules, originating from a Ukrainian compound brand, with strong compound development capabilities, offering both high cost-performance ratio compounds and high-value compounds.
    7. Eximed: ~60,000 inventory molecules, originating from a Ukrainian compound brand, dedicated to providing high-throughput screening compounds and related services for nearly 20 years.
    8. HTS_Biochemie_Innovationen: ~60,000 inventory molecules, originating from a German compound brand, focusing on the development of unique compounds for pharmaceutical, agricultural, and biotechnology companies.
    9. IBScreen: ~480,000 inventory molecules, originating from a Russian compound brand, offering a variety of natural products and derivatives.
    10. Life_Chemicals: ~500,000 inventory molecules, originating from a Canadian compound brand, with over 2,900 compound skeleton structures, comprehensive compound specifications, and corresponding prices.
    11. Maybridge: ~50,000 inventory molecules, originating from a British compound brand under Thermofisher, specializing in a smaller yet specialized product range, each with substantial inventory.
    12. Otava: ~270,000 inventory molecules, originating from a Canadian compound brand, specializing in unique compounds, biochemical drugs, and biological analysis reagents development and production.
    13. Princeton: ~1.53 million inventory molecules, originating from a US compound brand, designing unique small molecule compounds for drug development for over 20 years.
    14. Specs: ~210,000 inventory molecules, originating from a Dutch compound brand, with significant price advantages.
    15. UORSY: ~680,000 inventory molecules, originating from a Ukrainian compound brand, primarily used for high-throughput screening and drug discovery, with prices similar to Enamine.
    16. Vitas-m: ~1.4 million inventory molecules, originating from a US compound brand, with a shipping center in Hong Kong for fast delivery and moderate prices.
      Note: Choose either Public Library or Private Library.

    Private Library

    Personal molecular library for searching, supporting SDF format.
    Note: Choose either Public Library or Private Library.

    Output File

    Name of the output file, default is matched_molecules.sdf.

    Result Description

    The result file contains compounds from the compound library that contain the specified substructure, saved as matched_molecules.sdf.

    References

    • Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
  • Name: Small Molecule Minimization
    Description: Small Molecule Minimization是针对小分子结构进行能量最小化优化并得到优化后的3D结构。支持UFF和MMFF两种分子力场,支持SDG, ETDG, KDG, ETKDG四种构象采样方法,用于生成初始3D构象。 Small Molecule Minimization is a small molecule energy minimization optimization tool that generates optimized 3D structure. UFF or MMFF molecular forcefields could be used for energy minimization. Conformation sampling methods, SDG, ETDG, KDG, and ETKDG could be used for generating initial 3D coordinates.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-09-15 14:38:46
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. Riniker, S.; Landrum, G. A. Better informed distance geometry: Using what we know to improve conformation generation. JCIM. 2015, 55, 2562-2574. Rappe, A.K.; Casewit, C.J.; Colwell, K.S.; Goddard III, W.A.; Skiff, W.M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J. Am. Chem. Soc. 1992, 114, 10024-10035. Halgren, T.A.; Merck Molecular Force Field. I. Basis, Form, Scope, Parameterization, and Performance of MMFF94. 1996, J. Comput. Chem., 17, 490-519. Halgren, T.A.; Merck molecular force field. V. Extension of MMFF94 using experimental data, additional computational data, and empirical rules. J. Compt. Chem. 1996, 17, 616-641.

    Small Molecules Minimization

    简介

    Small Molecule Minimization是针对小分子结构进行能量最小化优化并得到优化后的3D结构。支持UFF和MMFF两种分子力场,支持SDG, ETDG, KDG, ETKDG四种构象采样方法,用于生成初始3D构象。注意,每个分子只输出一个能量最低构象,构象搜索推荐使用 3D Conf (AlphaConf)模块。

    参数说明

    Small Molecule File

    小分子文件,支持Mol (.mol), SD (.sdf, .sd), SMILES (.smi .csv, .tsv, .txt)。

    Output File

    输出文件名称,仅支持SDF格式,默认为minimized_struture.sdf。

    Conformer Generator

    3D构象方法:SDG, ETDG, KDG, ETKDG, None.

    1. SDG:Standard Distance Geometry (SDG)
    2. ETDG:Experimental Torsion-angle preference with Distance Geometry
    3. KDG:basic Knowledge-terms with Distance Geometry
    4. ETKDG:Experimental Torsion-angle preference along with basic Knowledge-terms with Distance Geometry
    5. None:代表不使用构象生成算法生成初始构象,直接基于输入文件中的3D构象进行力场优化。因此当输入文件为2D结构或者smiles格式不采用该参数。

    Forcefield Method

    用于能量最小化的力场方法,包括UFF(Universal Force Field)和MMFF(Merck Molecular Mechanics Force Field)。

    Multiprocessing

    使用并行计算。

    Maximum Number of Iterations

    在基于力场优化期间针对每个分子执行的最大迭代次数,默认500。

    Random Seed

    随机数,用于重现优化后的结构。

    结果说明

    得到能量最小化后的小分子3D结构文件minimized_struture.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    Riniker, S.; Landrum, G. A. Better informed distance geometry: Using what we know to improve conformation generation. JCIM. 2015, 55, 2562-2574.
    Rappe, A.K.; Casewit, C.J.; Colwell, K.S.; Goddard III, W.A.; Skiff, W.M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J. Am. Chem. Soc. 1992, 114, 10024-10035.
    Halgren, T.A.; Merck Molecular Force Field. I. Basis, Form, Scope, Parameterization, and Performance of MMFF94. 1996, J. Comput. Chem., 17, 490-519.
    Halgren, T.A.; Merck molecular force field. V. Extension of MMFF94 using experimental data, additional computational data, and empirical rules. J. Compt. Chem. 1996, 17, 616-641.

    Small Molecules Minimization

    Introduction

    Small Molecule Minimization is a tool module designed to perform energy minimization optimization on small molecule structures and obtain the optimized 3D structure. It supports two molecular force fields, UFF and MMFF, as well as four conformation sampling methods: SDG, ETDG, KDG, and ETKDG, used to generate initial 3D conformations. Note that only one energy-minimized conformation is output for each molecule, and for conformational search, it is recommended to use the 3D Conf (AlphaConf) module.

    Parameter Description

    Small Molecule File

    Input file for the small molecule, supporting Mol (.mol), SD (.sdf, .sd), SMILES (.smi .csv, .tsv, .txt) formats.

    Output File

    Name of the output file, only supports SDF format, default is minimized_structure.sdf.

    Conformer Generator

    3D conformation method: SDG, ETDG, KDG, ETKDG, None.

    1. SDG: Standard Distance Geometry (SDG)
    2. ETDG: Experimental Torsion-angle preference with Distance Geometry
    3. KDG: Basic Knowledge-terms with Distance Geometry
    4. ETKDG: Experimental Torsion-angle preference along with basic Knowledge-terms with Distance Geometry
    5. None: Indicates not using a conformation generation algorithm to generate initial conformations, directly optimizing the force field based on the 3D conformation in the input file. Therefore, this parameter is not used when the input file is a 2D structure or in SMILES format.

    Forcefield Method

    Force field method for energy minimization, including UFF (Universal Force Field) and MMFF (Merck Molecular Mechanics Force Field).

    Multiprocessing

    Utilize parallel computing.

    Maximum Number of Iterations

    Maximum number of iterations performed for each molecule during force field optimization, default is 500.

    Random Seed

    Random number used to reproduce the optimized structure.

    Result Description

    Obtain the energy-minimized 3D structure file for the small molecule as minimized_structure.sdf.

    References

    • Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    • Riniker, S.; Landrum, G. A. Better informed distance geometry: Using what we know to improve conformation generation. JCIM. 2015, 55, 2562-2574.
    • Rappe, A.K.; Casewit, C.J.; Colwell, K.S.; Goddard III, W.A.; Skiff, W.M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J. Am. Chem. Soc. 1992, 114, 10024-10035.
    • Halgren, T.A.; Merck Molecular Force Field. I. Basis, Form, Scope, Parameterization, and Performance of MMFF94. 1996, J. Comput. Chem., 17, 490-519.
    • Halgren, T.A.; Merck molecular force field. V. Extension of MMFF94 using experimental data, additional computational data, and empirical rules. J. Compt. Chem. 1996, 17, 616-641.
  • Name: Mutation Score (v2.0)
    Description: Mutation Score是抗体人源化设计中核心模块,是一个基于结构的自动化评分模块。该模块基于抗体的结构信息以及CDR嫁接后的序列信息,对graft后的FR区的每个氨基酸在替换前后的变化程度进行量化评分。评分越高,说明CDR嫁接过程中氨基酸的替换对CDR区的构象可能影响较大,需要进行回复突变。模块输出每个氨基酸的打分值,用于抗体人源化设计流程中后续的分组和人源化抗体序列的生成。 Mutation Score is a core module in antibody humanization design workflow, which is a structure-based automated scoring module. Based on the structure information of the antibody and the CDR-grafted sequence information, this module quantitatively scores the degree of change before and after the replacement of each amino acid in the FR region. The higher the score, the greater the potential impact of the amino acid replacement on the conformation change of the CDR region during CDR grafting, indicating the need for auto-back mutation. The module outputs the score for each amino acid, which is used for subsequent grouping and generation of humanized antibody sequences in the antibody humanization design workflow.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 11:14:32
    Reference:

    Mutation Score

    简介

    Mutation Score是抗体人源化设计中核心模块,是一个基于结构的自动化评分模块。该模块基于抗体的结构信息以及CDR嫁接后的序列信息,对移植抗体(graft)后的FR区的每个氨基酸在替换前后的变化程度进行量化评分。评分越高,说明CDR嫁接过程中氨基酸的替换对CDR区的构象可能影响较大,需要进行回复突变。模块输出每个氨基酸的打分值,用于抗体人源化设计流程中后续的分组和人源化抗体序列的生成。

    参数说明

    Sequence File

    抗体Fv区序列文件,FASTA格式。

    Model File

    抗体结构文件,PDB格式。

    Grafted Sequence

    抗体CDR区Graft后的序列文件,FASTA格式。

    Output Score

    指定输出打分文件的名称,CSV格式。

    Antibody Type

    抗体类型:

    • Antibody:常规抗体
    • Nanobody:纳米抗体

    结果说明

    输出结果文件为score.csv,包含信息如下:

    字段名称 说明
    Chain 轻链或重链
    UID 为残基的标准编号(默认为 Kabat)
    Position 残基在序列中的位置
    Donor Residue 原始氨基酸
    Template Residue 人源模板的目标氨基酸
    score 回复突变智能评分,Score 越高,认为其回复突变的必要性越高。通常Score>10为高优先级,5-10为中优先级,其他为低优先级

    Mutation Score

    Introduction

    Mutation Score is a core module in antibody humanization design, serving as a structure-based automated scoring module. This module quantitatively scores the degree of change for each amino acid in the FR region after grafting CDRs based on the antibody’s structure and the sequence information post-CDR grafting. A higher score indicates that the replacement of amino acids during CDR grafting may have a significant impact on the CDR region’s conformation, suggesting the need for revertant mutations. The module outputs a score for each amino acid, which is used in subsequent grouping and generation of humanized antibody sequences in the antibody humanization design process.

    Parameter Description

    Sequence File

    Sequence file of the antibody Fv region in FASTA format.

    Model File

    Antibody structure file in PDB format.

    Grafted Sequence

    Sequence file of the antibody CDR region after grafting in FASTA format.

    Output Score

    Specify the name of the output scoring file in CSV format.

    Antibody Type

    Type of antibody:

    • Antibody: Conventional antibody
    • Nanobody: Nanobody

    Result Description

    The output result file is named score.csv and includes the following information:

    Field Name Description
    Chain Light chain or heavy chain
    UID Standard numbering for residues (default is Kabat)
    Position Position of the residue in the sequence
    Donor Residue Original amino acid
    Template Residue Target amino acid from the human template
    Score Revertant mutation intelligence score, where a higher score suggests a higher necessity for a revertant mutation. Typically, a Score > 10 is high priority, 5-10 is medium priority, and others are low priority.
  • Name: PDB ReNumbering
    Description: 针对蛋白残基重新编号,同时支持抗体kabat,imgt以及chothia的重编号。输入蛋白结构PDB文件,输出重新编号后的PDB文件。 It is a tool module that renumbers protein residues and supports renumbering antibody structure with kabat, imgt, and chothia schemes. It takes a protein structure PDB file as input and outputs a renumbered PDB file.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-09-19 00:00:00
    Reference:

    PDB ReNumbering

    简介

    PDB ReNumbering是针对蛋白残基重新编号的工具模块,同时支持抗体kabat,imgt以及chothia的重编号。输入蛋白结构PDB文件,输出重新编号后的PDB文件。

    参数说明

    Protein Structure File

    输入蛋白结构文件,PDB格式。

    Renumbering Type

    重编号类型,支持指定链从指定数字开始编号,同时支持抗体结构重新编号。
    numeric:氨基酸序号重编号
    kabat:抗体kabat编号规则重编号
    imgt:抗体imgt编号规则重编号
    chothia:抗体chothia编号规则重编号

    Chain Name

    链名,指定具体的链名进行重编号操作。

    Start

    针对氨基酸序号重编号,指定起始编号数字。

    Output File

    重编号后的文件名称。

    结果说明

    重编号后的结构文件名称,默认输出renumbering.pdb。
    注意:如果输入是抗体结构,输出结构中重链的链名会自动改为H,轻链链名会改为L。

    PDB ReNumbering

    Introduction

    PDB ReNumbering is a tool module for renumbering protein residues, supporting renumbering according to the kabat, imgt, and chothia numbering schemes for antibodies. Input a protein structure PDB file and get the renumbered PDB file as output.

    Parameter Description

    Protein Structure File

    Input protein structure file in PDB format.

    Renumbering Type

    Renumbering type, supports starting numbering from a specified number for a specific chain, and also supports renumbering for antibody structures.

    • numeric: Renumber amino acid residues numerically.
    • kabat: Renumber according to the kabat antibody numbering scheme.
    • imgt: Renumber according to the imgt antibody numbering scheme.
    • chothia: Renumber according to the chothia antibody numbering scheme.

    Chain Name

    Chain name, specifies the chain to perform renumbering.

    Start

    For renumbering amino acid residues numerically, specifies the starting number.

    Output File

    Name of the renumbered file.

    Result Description

    The renumbered structure file is named by default as renumbering.pdb.
    Note: If the input is an antibody structure, the chain names in the output structure will be automatically changed to H for the heavy chain and L for the light chain.

  • Name: AC2SDF
    Description: 用于将AlphaConf模块生成的构象压缩二进制文件AC.GZ转为便于查看的SDF文件。 It is used to convert the compressed binary conformation file AC.GZ generated by the AlphaConf module into an SDF file for easier viewing.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-23 00:00:00
    Reference:

    AC2SDF

    简介

    AC2SDF模块是一个格式转换工具,用于将AlphaConf模块生成的构象压缩二进制文件AC.GZ转为便于查看结构的SDF文件。

    参数说明

    Conformation Library (AC)

    输入构象文件,AC.GZ格式,由AlphaConf模块生成

    Fragment Library

    片段库文件,AUX.GZ格式,由AlphaConf模块生成

    SDF File

    转换生成的SDF文件名称

    结果说明

    输出文件名称 说明
    ligands_confs.sd 转换生成的SDF文件,可通过WeView直接查看构象

    AC2SDF

    Introduction

    The AC2SDF module is a format conversion tool used to convert the compressed binary conformation file AC.GZ generated by the AlphaConf module into an SDF file for easier visualization of the structure.

    Parameter Description

    Conformation Library (AC)

    Input conformation file in AC.GZ format generated by the AlphaConf module.

    Fragment Library

    Fragment library file in AUX.GZ format generated by the AlphaConf module.

    SDF File

    Name of the converted SDF file.

    Result Description

    Output File Name Description
    ligands_confs.sd Converted SDF file that can be viewed directly using WeView for conformation visualization.
  • Name: Sequence Mutation
    Description: Sequence Mutation是蛋白序列突变模块,用于针对特定位点批量生成突变序列。突变策略包括基于位置的突变,基于同源序列的突变,基于抗体CDR区的突变,以及基于抗体CDR区和同源性的突变。突变类型支持丙氨酸突变,组氨酸突变,以及饱和突变。 Sequence Mutation is a protein sequence mutation module used to generate mutated sequences in bulk for specific sites. Mutation strategies include position-based mutations, homology-based mutations, mutations based on antibody CDR regions, and mutations based on both antibody CDR regions and homology. The types of mutations supported include alanine scanning, histidine mutation, and saturation mutagenesis.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-22 00:00:00
    Reference:

    Sequence Mutation

    简介

    Sequence Mutation是蛋白序列突变模块,用于针对特定位点批量生成突变序列,支持多样的突变策略,包括设定不同的突变位置及突变类型。

    突变策略包括:

    • 基于指定位置的突变
    • 基于同源序列的突变
    • 基于抗体CDR区的突变
    • 基于抗体CDR区和同源性的突变

    突变类型支持:

    • 丙氨酸突变
    • 组氨酸突变
    • 饱和突变
    • 同源突变(同源序列中的进化突变)

    参数说明:基于位置的突变

    Protein Sequence

    蛋白原始序列或者fasta格式的序列

    Mutation Location

    突变位点,支持多个位点,英文逗号分割,例如:2,3

    Mutation Type

    突变类型,支持三种类型:Ala 丙氨酸突变,His 组氨酸突变,Sat 饱和突变

    Chain Name

    链名,输出突变信息时加上指定链名

    Mutants Sequences

    生成突变序列的文件名称,FASTA格式

    Mutation Policy

    蛋白突变信息文件,TXT格式

    参数说明:基于同源序列的突变

    Protein Sequence

    蛋白原始序列或者fasta格式的序列

    Homologous Sequences

    同源序列,一般由序列比对产生的结果文件,FASTA 格式

    Alignment Methods

    序列比对的方法,mafft或者muscle

    Frequency Cutoff

    频数截断值,大于截断值的氨基酸才会选择作为突变目标

    Chain Name

    链名,输出突变信息时加上指定链名

    Mutants Sequences

    生成突变序列的文件名称,FASTA格式

    Mutation Policy

    蛋白突变信息文件,TXT格式

    参数说明:基于抗体CDR区的突变

    Antibody Sequence

    蛋白原始序列或者fasta格式的序列

    Antibody Numbering

    抗体CDR编号规则:kabat, imgt, chothia

    Mutation Type

    突变类型,支持三种类型:Ala 丙氨酸突变,His 组氨酸突变,Sat 饱和突变

    Chain Name

    链名,输出突变信息时加上指定链名

    Mutants Sequences

    生成的包含蛋白突变序列的文件名称,FASTA格式

    Mutation Policy

    生成的包含蛋白突变信息的文件名称,TXT格式

    参数说明:基于抗体CDR区及同源性的突变

    Antibody Sequence

    蛋白原始序列或者fasta格式的序列

    Antibody Numbering

    抗体CDR编号规则:kabat, imgt, chothia

    Homologous Sequences

    同源序列,一般由序列比对产生的结果文件,FASTA 格式

    Alignment Methods

    序列比对的方法,mafft或者muscle

    Frequency Cutoff

    频数截断值,大于截断值的氨基酸才会选择作为突变目标

    Chain Name

    链名,输出突变信息时加上指定链名

    Mutants Sequences

    生成的包含蛋白突变序列的文件名称,FASTA格式

    Mutation Policy

    生成的包含蛋白突变信息的文件名称,TXT格式

    结果说明

    输出文件名称 说明
    mutants.fasta 生成突变序列的文件名称,FASTA格式
    mutations.txt 蛋白突变信息文件,TXT格式,每行一个突变记录,例如:Q2A 代表第2位氨基酸Q突变为氨基酸A

    Sequence Mutation

    Introduction

    Sequence Mutation is a protein sequence mutation module that allows for batch generation of mutated sequences at specific positions, supporting various mutation strategies including setting different mutation positions and types.

    Mutation strategies include:

    • Position-based mutations
    • Homologous sequence-based mutations
    • Antibody CDR region mutations
    • Antibody CDR region and homology-based mutations

    Supported mutation types include:

    • Alanine mutations
    • Histidine mutations
    • Saturation mutations
    • Homologous mutations (evolutionary mutations from homologous sequences)

    Parameter Description: Position-based Mutations

    Protein Sequence

    Original protein sequence or sequence in FASTA format.

    Mutation Location

    Mutation positions, support for multiple positions separated by commas, e.g., 2,3.

    Mutation Type

    Mutation types, supporting three types: Ala (Alanine mutation), His (Histidine mutation), Sat (Saturation mutation).

    Chain Name

    Chain name to be included in the mutation information output.

    Mutants Sequences

    File name for generated mutated sequences in FASTA format.

    Mutation Policy

    Protein mutation information file in TXT format.

    Parameter Description: Homologous Sequence-based Mutations

    Protein Sequence

    Original protein sequence or sequence in FASTA format.

    Homologous Sequences

    Homologous sequences, typically generated from sequence alignment results in FASTA format.

    Alignment Methods

    Alignment methods for sequence alignment: mafft or muscle.

    Frequency Cutoff

    Frequency cutoff value, only amino acids with frequencies greater than the cutoff value will be selected as mutation targets.

    Chain Name

    Chain name to be included in the mutation information output.

    Mutants Sequences

    File name for generated mutated sequences in FASTA format.

    Mutation Policy

    Protein mutation information file in TXT format.

    Parameter Description: Antibody CDR region Mutations

    Antibody Sequence

    Original protein sequence or sequence in FASTA format.

    Antibody Numbering

    Antibody CDR numbering rule: kabat, imgt, chothia.

    Mutation Type

    Mutation types, supporting three types: Ala (Alanine mutation), His (Histidine mutation), Sat (Saturation mutation).

    Chain Name

    Chain name to be included in the mutation information output.

    Mutants Sequences

    File name for generated mutated protein sequences in FASTA format.

    Mutation Policy

    File name for generated protein mutation information in TXT format.

    Parameter Description: Antibody CDR region and Homology-based Mutations

    Antibody Sequence

    Original protein sequence or sequence in FASTA format.

    Antibody Numbering

    Antibody CDR numbering rule: kabat, imgt, chothia.

    Homologous Sequences

    Homologous sequences, typically generated from sequence alignment results in FASTA format.

    Alignment Methods

    Alignment methods for sequence alignment: mafft or muscle.

    Frequency Cutoff

    Frequency cutoff value, only amino acids with frequencies greater than the cutoff value will be selected as mutation targets.

    Chain Name

    Chain name to be included in the mutation information output.

    Mutants Sequences

    File name for generated mutated protein sequences in FASTA format.

    Mutation Policy

    File name for generated protein mutation information in TXT format.

    Result Description

    Output File Name Description
    mutants.fasta File name for generated mutated sequences in FASTA format.
    mutations.txt Protein mutation information file in TXT format, with each line representing a mutation record, e.g., Q2A represents the mutation of amino acid Q at position 2 to amino acid A.
  • Name: Interaction Auto Plot
    Description: Interaction Auto Plot是基于Pymol绘制蛋白-蛋白、蛋白-小分子相互作用图。 Interaction Auto Plot is based on Pymol to map protein-protein and protein-small molecule interactions.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-22 15:22:57
    Reference: https://pymol.org/

    Interaction Auto Plot

    简介

    Interaction Auto Plot是基于Pymol绘制蛋白-蛋白、蛋白-小分子相互作用图。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    Structure PDB File

    复合物结构文件,PDB格式。

    Frame for Plot

    选择输入路径文件(Path File)类型:

    1. last_frame表示Path File为GMX MD Run (GMX2023)模块模拟后的路径。
    2. cluster_frame表示Path File为MD Cluster模块运行后的路径。

    Task

    相互作用分析类型:
    1.protein_protein是分析蛋白-蛋白相互作用。
    2.protein_ligand是分析蛋白-小分子相互作用。

    Chain

    分析相互作用的两条链,例如A,B,两条链之间用逗号隔开。仅当Task为protein_protein时,该值生效。

    Contact List

    自定义相互作用的氢键和盐桥则上传Excel文件,csv或者xlsx格式。仅当Task为protein_protein时,该值生效。

    结果说明

    输出结果包括:

    输出文件名称 说明
    file.png 生成的相互作用图
    file.pdb 生成的用于作图的pdb文件
    file.pse Pymol的pse文件,可导入Pymol软件自行根据喜好调整颜色、字体、视角等。

    Interaction Auto Plot

    Introduction

    Interaction Auto Plot is a tool for generating protein-protein or protein-small molecule interaction plots using PyMOL.

    Parameter Description

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD (GMX2023) module.

    Structure PDB File

    Complex structure file in PDB format.

    Frame for Plot

    Select the type of input path file (Path File):

    1. last_frame indicates the Path File is from the output of GMX MD Run (GMX2023) module simulation.
    2. cluster_frame indicates the Path File is from the output of MD Cluster module.

    Task

    Type of interaction analysis:

    1. protein_protein for analyzing protein-protein interactions.
    2. protein_ligand for analyzing protein-small molecule interactions.

    Chain

    Specify the two chains for interaction analysis, e.g., A,B. Separate the two chains by a comma. This parameter is only effective when Task is set to protein_protein.

    Contact List

    Upload an Excel file in csv or xlsx format containing custom hydrogen bonds and salt bridges for interaction analysis. This parameter is only effective when Task is set to protein_protein.

    Result Description

    The output includes:

    Output File Name Description
    file.png Generated interaction plot.
    file.pdb PDB file generated for plotting.
    file.pse PyMOL’s pse file, which can be imported into PyMOL software for further customization of colors, fonts, viewpoints, etc. according to preference.
  • Name: MD Distance
    Description: MD Distance是分子动力学轨迹的距离分析模块,输出分子动力学过程中两个组之间距离 (质心距离或几何中心距离) 随时间的变化。 MD Distance is a distance analysis module that outputs the distance changes between two groups (center of mass distance or geometric center distance) over time.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-22 09:35:48
    Reference:

    MD Distance

    简介

    MD Distance是针对分子动力学轨迹的距离分析模块,输出两个组之间距离 (质心距离或几何中心距离) 随时间的变化情况。自定义组别时需要注意,如果只需测量两个原子之间的距离则填写Custom Atom1和Custom Atom2即可;当同时填写Custom Resid1和Custom Atom1时,组别1的原子数是Custom Atom1与Custom Resid1交集的原子。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group1

    选择需要计算的组别1:Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    System Group2

    选择需要计算的组别1:Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid1

    自定义需要计算的组1残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Atom1

    自定义需要计算的组1原子编号,连续参数可用“-”表示,不连续原子用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Resid2

    自定义需要计算的组2残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Atom2

    自定义需要计算的组2原子编号,连续参数可用“-”表示,不连续原子用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    dist.csv 距离分析CSV文件
    dist.xvg 距离分析XVG文件
    dist.png 距离分析PNG文件

    其中dist.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Distance (nm) 组别之间的距离

    MD Distance

    Introduction

    MD Distance is a distance analysis module for molecular dynamics trajectories, providing the variation of distance (center-of-mass distance or geometric center distance) between two groups over time. When defining custom groups, it is important to note that if you only need to measure the distance between two atoms, you can fill in Custom Atom1 and Custom Atom2. When both Custom Resid1 and Custom Atom1 are filled in, the number of atoms in group 1 is the intersection of Custom Atom1 and Custom Resid1.

    Parameter Description

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD (GMX2023) module.

    System Group1

    Select the group 1 for calculation: Protein, DNA, RNA.
    You can enter the group name based on the name of the small molecule in the PDB.

    System Group2

    Select the group 2 for calculation: Protein, DNA, RNA.
    You can enter the group name based on the name of the small molecule in the PDB.

    Custom Resid1

    Custom residue numbers for group 1 that need to be calculated. Use “-” for continuous parameters and separate non-continuous residues with commas, for example: 1-10,15. Custom groups are unions of specified residues.

    Custom Atom1

    Custom atom numbers for group 1 that need to be calculated. Use “-” for continuous parameters and separate non-continuous atoms with commas, for example: 1-10,15. Custom groups are unions of specified atoms.

    Custom Resid2

    Custom residue numbers for group 2 that need to be calculated. Use “-” for continuous parameters and separate non-continuous residues with commas, for example: 1-10,15. Custom groups are unions of specified residues.

    Custom Atom2

    Custom atom numbers for group 2 that need to be calculated. Use “-” for continuous parameters and separate non-continuous atoms with commas, for example: 1-10,15. Custom groups are unions of specified atoms.

    Skip Time (ns)

    Time interval for each frame (in ns).

    Result Description

    The output includes:

    Output File Name Description
    dist.csv Distance analysis CSV file
    dist.xvg Distance analysis XVG file
    dist.png Distance analysis PNG file

    The dist.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Distance (nm) Distance between the groups
  • Name: Peptide VS
    Description: 集成了AutoDock Vina与AutoDock CrankPep进行蛋白-多肽的分子对接,从而预测蛋白-多肽的构象、得到分子对接的能量以及结合亲和力。 This module integrates AutoDock Vina and AutoDock CrankPep for protein-polypeptide docking, thereby predicting the conformation of protein-polypeptide, obtaining the energy of molecular docking and binding affinity.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-24 14:37:51
    Reference: J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling. O. Trott, A. J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading, Journal of Computational Chemistry 31 (2010) 455-461 Podtelezhnikov, A.A., Wild, D.L. CRANKITE: A fast polypeptide backbone conformation sampler. Source Code Biol Med 3, 12 (2008).

    Peptide VS

    简介

    Peptide VS模块集成了AutoDock Vina与AutoDock CrankPep进行蛋白-多肽的分子对接,从而预测蛋白-多肽的构象、得到分子对接的能量以及结合亲和力。AutoDock CrankPep则是一个专门用于多肽对接工具,其基于蛋白折叠和刚性受体网格能量背景下,采用蒙特卡罗方法对多肽的折叠进行计算,产生多肽的对接构象。

    参数说明

    Receptor File

    受体结构文件,PDB格式。

    Peptide Sequence String

    多肽的氨基酸序列,多肽的氨基酸序列,可以成功对接长度达20个氨基酸的肽。一行一条序列,例如:
    AINMDSFHTWKVLECGRPQY
    HRIAQCSDKW
    IYSADCLPKG
    AAAAIS

    Box Center

    对接口袋中心的三维坐标(XYZ),空格分割。例如:10 2 -11。

    Box Size

    对接口袋长方体盒子的大小,必须是整数,空格分割,例如 30 30 30。

    Out Pose

    每个多肽与蛋白对接后输出的构象数目,默认为10。

    结果说明

    输出结果包括:

    输出文件名称 说明
    Scores.csv 提交多肽与受体的打分文件。
    output_complex_top1.pdb 展示打分第一的多肽与受体的复合物构象。
    output_complex_topn.tar.gz TopN多肽“Out Pose”构象数与受体形成的复合物结构PDB文件压缩包。

    其中Scores.csv包括信息如下:

    字段名称 说明
    Name 对接多肽名称
    Score(kcal/mol) 对接打分,该值越低说明结合亲和力越高。
    Cluster RMSD 聚类后,构象之间的RMSD
    Average RMSD 平均RMSD
    Complex File Name 复合物文件名称

    参考文献

    J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling.
    O. Trott, A. J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading, Journal of Computational Chemistry 31 (2010) 455-461.
    Podtelezhnikov, A.A., Wild, D.L. CRANKITE: A fast polypeptide backbone conformation sampler. Source Code Biol Med 3, 12 (2008).

    Peptide VS

    Introduction

    The Peptide VS module integrates AutoDock Vina and AutoDock CrankPep for protein-polypeptide molecular docking, predicting the conformation of protein-polypeptide complexes, docking energy, and binding affinity. AutoDock Vina is a molecular docking tool that compares the binding affinities between multiple molecules, used for screening, designing, and optimizing drug molecules. AutoDock CrankPep is a specialized tool for peptide docking that uses a Monte Carlo method to calculate peptide folding based on protein folding and rigid receptor grid energy background, generating docking conformations for peptides. This module has been successfully demonstrated to redock peptides of up to 20 amino acids in length.

    Parameter Description

    Receptor File

    Structure file of the receptor in PDB format.

    Peptide Ligand

    Structure file of the peptide ligand in SDF format. Obtained from the Peptide Structure Generation module.

    Box Center

    Three-dimensional coordinates (XYZ) of the docking pocket center, separated by spaces. For example: -44.497 -22 -5.

    Box Size

    Size of the docking pocket rectangular box, must be integers, separated by spaces, for example 30 30 30.

    TopN

    Specify the top N small molecules for scoring as output files, default is 100.

    Out Pose

    Number of conformations output for each peptide-protein docking, default is 10.

    Result Description

    The output includes:

    Output File Name Description
    Scores.csv Scoring file for the docking of peptides with the receptor.
    output_complex_top1.pdb Conformation of the top scoring peptide-receptor complex.
    output_complex_topn.tar.gz Compressed PDB files of the top N peptide “Out Pose” conformations forming complexes with the receptor.

    The Scores.csv file includes the following information:

    Field Name Description
    Name Name of the docked peptide
    Score(kcal/mol) Docking score, lower values indicate higher binding affinity.
    Complex File Name Name of the complex file

    References

    J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling.
    O. Trott, A. J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading, Journal of Computational Chemistry 31 (2010) 455-461.
    Podtelezhnikov, A.A., Wild, D.L. CRANKITE: A fast polypeptide backbone conformation sampler. Source Code Biol Med 3, 12 (2008).

  • Name: Alanine Scan (MMPBSA)
    Description: Alanine Scan (MMPBSA)是计算丙氨酸突变后的结合自由能。 Alanine Scan (MMPBSA) calculates components of binding free energy after alanine mutation using the MM-PBSA method.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:47
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMPBSA)

    简介

    Alanine Scan (MMPBSA)是计算丙氨酸突变后的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMPBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Mutation Residue

    突变扫描为丙氨酸(ALA)的氨基酸位置。格式为res1:res2:res3:res4,其中“res1-res4”数字为残基编号。

    Force File

    丙氨酸扫描时使用的力场。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。只能是AMBER力场下构建的膜体系才能进行计算。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMPBSA_result.txt MMPBSA丙氨酸突变结果汇总文件。
    MMPBSA_Residue.csv 丙氨酸突变能量分解数据CSV文件。
    MMPBSA.pdb 丙氨酸突变后,原子对应的MMPBSA能量放到PDB文件。可以做对应能量类别的表面图,从图的颜色深浅可以看出哪些区域对结合能贡献值较大。
    MMPBSA.tar.gz MMPBSA所有原始文件。包括_mmpbsa_residue_#.txt是每个残基对应的每个能量类别的数值,共包含7个能量类别:范德华能(VDW)、静电能(ELE)、溶剂化能极性部分(PB)、溶剂化能非极性部分(SA)、VDW+ELE=MM、PB+SA=PBSA、MM+PBSA=Binding/MMPBSA。_mmpbsa_residue.txt是对上述7个文件的总结,即为MMPBSA_Residue.csv对应的原始文件。_mmpbsa_atom#.pdb是将每个原子对应的能量类别放到pdb文件,与MMPBSA.pdb相似。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMPBSA)

    Introduction

    Alanine Scan (MMPBSA) calculates the binding free energy after alanine mutations and provides energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMPBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameter Description

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Mutation Residue

    Amino acid positions where mutations to alanine (ALA) are scanned. The format is res1:res2:res3:res4, where “res1-res4” are residue numbers.

    Force File

    Force field used for alanine scanning.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’. Only membrane systems built under the AMBER force field can be calculated.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Result Description

    The output includes:

    Output File Name Description
    MMPBSA_result.txt Summary file of MMPBSA alanine mutation results.
    MMPBSA_Residue.csv Energy decomposition data for alanine mutations in CSV format.
    MMPBSA.pdb MMPBSA energy corresponding to atoms after alanine mutations in a PDB file. This can be used to create a surface map of energy categories, where the color depth indicates the regions contributing significantly to the binding energy.
    MMPBSA.tar.gz All original MMPBSA files. Includes mmpbsa_residue#.txt which contains the values of each energy category for each residue, totaling 7 energy categories: van der Waals energy (VDW), electrostatic energy (ELE), polar part of solvation energy (PB), nonpolar part of solvation energy (SA), VDW+ELE=MM, PB+SA=PBSA, MM+PBSA=Binding/MMPBSA. _mmpbsa_residue.txt is a summary of the above 7 files, corresponding to the original file MMPBSA_Residue.csv. _mmpbsa_atom#.pdb assigns the energy categories to each atom in a pdb file, similar to MMPBSA.pdb.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: MMPBSA
    Description: MMPBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。 MMPBSA calculates the binding free energy between the receptor and ligand and provides energy decomposition data, binding constant (Ka), and inhibitor constant (Ki).
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:29
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMPBSA

    简介

    MMPBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMPBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMPBSA_result.txt MMPBSA结果汇总文件。
    MMPBSA_Residue.csv 能量分解数据CSV文件。
    MMPBSA.pdb 原子对应的MMPBSA能量放到PDB文件。可以做对应能量类别的表面图,从图的颜色深浅可以看出哪些区域对结合能贡献值较大。
    MMPBSA.tar.gz MMPBSA所有原始文件。包括_mmpbsa_residue_#.txt是每个残基对应的每个能量类别的数值,共包含7个能量类别:范德华能(VDW)、静电能(ELE)、溶剂化能极性部分(PB)、溶剂化能非极性部分(SA)、VDW+ELE=MM、PB+SA=PBSA、MM+PBSA=Binding/MMPBSA。_mmpbsa_residue.txt是对上述7个文件的总结,即为MMPBSA_Residue.csv对应的原始文件。_mmpbsa_atom#.pdb是将每个原子对应的能量类别放到pdb文件,与MMPBSA.pdb相似。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMPBSA

    Introduction

    MMPBSA calculates the binding free energy between a receptor and a ligand, providing energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMPBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameter Description

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Result Description

    The output includes:

    Output File Name Description
    MMPBSA_result.txt Summary file of MMPBSA results.
    MMPBSA_Residue.csv Energy decomposition data in CSV format.
    MMPBSA.pdb MMPBSA energy corresponding to atoms in a PDB file. This can be used to create a surface map of energy categories, where the color depth indicates the regions contributing significantly to the binding energy.
    MMPBSA.tar.gz All original MMPBSA files. Includes mmpbsa_residue#.txt which contains the values of each energy category for each residue, totaling 7 energy categories: van der Waals energy (VDW), electrostatic energy (ELE), polar part of solvation energy (PB), nonpolar part of solvation energy (SA), VDW+ELE=MM, PB+SA=PBSA, MM+PBSA=Binding/MMPBSA. _mmpbsa_residue.txt is a summary of the above 7 files, corresponding to the original file MMPBSA_Residue.csv. _mmpbsa_atom#.pdb assigns the energy categories to each atom in a pdb file, similar to MMPBSA.pdb.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: Grafting (v2)
    Description: Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。 Graft antibody CDRs to target frameworks, normally for humanization.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 11:07:18
    Reference:

    Grafting v2

    简介

    Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2

    参数说明

    Antibody Sequence File

    抗体序列文件,FASTA格式

    Numbering Type

    抗体编号规则:kabat,imgt,chothia

    Output File

    指定输出抗体graft后的序列文件名称,FASTA格式

    Output Policy

    指定输出graft策略文件,JSON格式

    Germline Score

    指定输出抗体FR区序列比对同源性打分文件

    Germline

    指定轻链或重链使用特定germline模板,也可都指定,写法如下:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    其中链名来自于流程第一步输入的fasta文件。
    例1:以下语句为链"Infliximab.H"指定了模板"IGHV3-7*01":

    Infliximab.H:IGHV3-7*01
    

    例2:以下语句为两条链分别指定了模板:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    指定参考模板序列,FASTA格式

    Germline Hits

    指定输出FR区序列比对结果文件,FASTA格式

    Number of Hits

    指定输出命中序列的数目

    结果说明

    输出结果包括:

    输出文件名称 说明
    germline_hits.fasta 输出FR区序列比对结果文件
    germline_score.json 输出抗体FR区序列比对同源性打分文件
    grafted.fasta 输出抗体graft后的序列文件名称
    graft_policy.json 输出graft策略文件

    Grafting v2

    Introduction

    The Grafting module is used to graft antibody CDRs onto specific framework region templates, typically used in humanization design. Version: v2

    Parameter Description

    Antibody Sequence File

    Antibody sequence file in FASTA format.

    Numbering Type

    Antibody numbering rule: kabat, imgt, chothia.

    Output File

    Specify the output file name for the grafted antibody sequence in FASTA format.

    Output Policy

    Specify the output grafting strategy file in JSON format.

    Germline Score

    Specify the output file for the alignment scores of the antibody FR region sequences.

    Germline

    Specify the specific germline template to be used for the light chain or heavy chain, or both. The format is as follows:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    Where the chain name comes from the input FASTA file in the first step of the process.
    Example 1: The following statement specifies the template “IGHV3-7*01” for the chain “Infliximab.H”:

    Infliximab.H:IGHV3-7*01
    

    Example 2: The following statement specifies templates for two chains:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template Sequence

    Specify the reference template sequence in FASTA format.

    Germline Hits

    Specify the output file for the FR region sequence alignment results in FASTA format.

    Number of Hits

    Specify the number of hit sequences to output.

    Result Description

    The output includes:

    Output File Name Description
    germline_hits.fasta Output file for FR region sequence alignment results
    germline_score.json Output file for alignment scores of the antibody FR region sequences
    grafted.fasta Output file name for the grafted antibody sequence
    graft_policy.json Output file for the grafting strategy
  • Name: De-immunization Design
    Description: 在具有免疫原性的序列上引入逐渐增加的突变(通常建议人源的突变),以降低或者去除免疫原性。 本模块通常不单独使用,一般结合AlphaMHC进行使用,自动化去除免疫原性请使用【De-immunization Design】流程。 Introduce progressively increasing mutations on sequences with immunogenicity to reduce or eliminate immunogenicity. This module is usually not used alone and is generally used in combination with AlphaMHC. For automated de-immunization, please use the De-immunization Design process.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-11 09:59:10
    Reference:

    De-immunization Design

    简介

    对具有免疫原性的序列通过突变的方式降低或者去除免疫原性。需要结合AlphaMHC2进行使用。

    参数说明

    Sequence File

    序列文件,FASTA格式。

    Detail File

    AlphaMHC2输出的detail文件。

    Mutation File

    突变文件,文本格式包含突变信息,格式如下:
    L21H
    G26K
    其中L,G代表序列残基名称,21,26代表21或26位氨基酸残基,H/K代表突变后的残基名称。

    Mutation Maximums

    最多进行多少次突变,谨慎设置,建议不超过3,不然会因为产生过量组合导致内存溢出。

    Output Sequences Number

    最终输出序列数目,序列优先按照TCE_LEN从低到高排序,同样情况下越少突变点排名越高。

    Output File

    输出文件名称,FASTA格式,默认result.fst。

    Random Mutation

    是否对突变文件之外的位点进行随机突变。

    Seed

    随机种子,因为量过大,所以程序中部分地方会使用随机采样,可以通过控制seed来获取重复结果。

    结果说明

    输出降低(去除)免疫原性的序列文件为result.fst。

    De-immunization Design

    Introduction

    De-immunization design involves reducing or eliminating immunogenicity in sequences through mutations. It needs to be used in conjunction with AlphaMHC2.

    Parameter Description

    Sequence File

    Sequence file in FASTA format.

    Detail File

    Detail file output from AlphaMHC2.

    Mutation File

    Mutation file in text format containing mutation information. The format is as follows:
    L21H
    G26K
    Where ‘L’ and ‘G’ represent the residue names in the sequence, ‘21’ and ‘26’ represent the positions of the amino acid residues, and ‘H/K’ represent the mutated residue names.

    Mutation Maximums

    Maximum number of mutations to perform. Exercise caution in setting this parameter. It is recommended not to exceed 3 to avoid memory overflow due to excessive combinations.

    Output Sequences Number

    Number of final output sequences. Sequences are prioritized based on TCE_LEN in ascending order, with fewer mutation points ranking higher in the same situation.

    Output File

    Output file name in FASTA format. Default is result.fst.

    Random Mutation

    Whether to perform random mutations at sites outside the mutation file.

    Seed

    Random seed for reproducibility. Random sampling is used in some parts of the program, and controlling the seed allows for obtaining reproducible results.

    Result Description

    The output file containing sequences with reduced (or eliminated) immunogenicity is named result.fst.

  • Name: MD PCA
    Description: MD PCA(Principal component analysis,PCA)模块可以从高维数据中分析出主要的影响因素 (本征向量) 。前几个本征向量(主成分,如前两个主成份则为 PC1,PC2) 一般可以描述分子运动的大部分信息。N个原子的柔性大体系如蛋白,其运动轨迹需要3N维笛卡尔坐标来描述,这样高维的数据很难理解和直观分析。 MD PCA (Principal component analysis) module can analyze the main influencing factors (eigenvectors) from the high-dimensional data. The first few eigenvectors (principal components, such as PC1 and PC2 for the first two principal components) can generally describe most of the information about molecular motion. The motion path of a flexible large system with N atoms, such as protein, needs 3N Cartesian coordinates to describe, so it is difficult to understand and intuitively analyze the high-dimensional data.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-06 00:51:22
    Reference:

    MD PCA

    简介

    N个原子的柔性大体系如蛋白,其运动轨迹需要3N维笛卡尔坐标来描述,这样高维的数据很难理解和直观分析。MD PCA(Principal component analysis,PCA)模块可以从高维数据中分析出主要的影响因素 (本征向量) 。前几个本征向量(主成分,如前两个主成份则为 PC1,PC2) 一般可以描述分子运动的大部分信息。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    得到结果文件,每种类型的文件如果包含PNG、CSV以及XVG后缀,相同名称只是表现形式不同,数据一样

    输出文件名称 说明
    average.pdb 计算后的平均结构文件
    filtered.xtc 计算的降维过滤后的轨迹文件
    eigenvalues.xvg 本征值文件
    proj1.xvg 对应的主成分PC1文件
    proj2.xvg 对应的主成分PC2文件
    proj_all.xvg 计算的PC1到PC2的主成份合并文件
    Gibbs_2d.png/Gibbs_3d.png 只计算两个主成分时的二维和三维自由能景观图

    MD PCA

    Introduction

    For a large flexible system with N atoms such as a protein, its motion trajectory requires 3N-dimensional Cartesian coordinates to describe, making it difficult to understand and analyze high-dimensional data. The MD PCA (Principal Component Analysis, PCA) module can analyze the main influencing factors (eigenvectors) from high-dimensional data. The first few eigenvectors (principal components, such as the first two principal components PC1, PC2) can generally describe most of the information about molecular motion.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    Obtain the following result files. If the files have PNG, CSV, and XVG suffixes, they contain the same data but in different formats.

    Output File Name Description
    average.pdb Computed average structure file
    filtered.xtc Filtered trajectory file after dimensionality reduction
    eigenvalues.xvg Eigenvalues file
    proj1.xvg Corresponding principal component PC1 file
    proj2.xvg Corresponding principal component PC2 file
    proj_all.xvg Combined file of principal components PC1 to PC2
    Gibbs_2d.png/Gibbs_3d.png 2D and 3D free energy landscape plots when only two principal components are considered
  • Name: MD SASA
    Description: MD SASA模块是计算指定组别的溶剂可及表面积(solvent accessible surface area,SASA)。 MD SASA module calculates the solvent accessible surface area (SASA) for a specified group.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-06 00:29:36
    Reference:

    MD SASA

    简介

    MD SASA模块是计算指定组别的溶剂可及表面积(solvent accessible surface area,SASA)。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    输出结果包括:

    输出文件名称 说明
    area.csv 溶剂可及表面积CSV文件
    area.xvg 溶剂可及表面积XVG文件
    area.png 溶剂可及表面积PNG文件

    其中area.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Total Area (nm^2) 溶剂可及表面积
    Hydrophobic (nm^2) 疏水表面积
    Hydrophilic (nm^2) 亲水表面积

    MD SASA

    Introduction

    The MD SASA module calculates the solvent accessible surface area (SASA) of specified groups.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    The output results include:

    Output File Name Description
    area.csv Solvent accessible surface area CSV file
    area.xvg Solvent accessible surface area XVG file
    area.png Solvent accessible surface area PNG file

    The area.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Total Area (nm^2) Total solvent accessible surface area
    Hydrophobic (nm^2) Hydrophobic surface area
    Hydrophilic (nm^2) Hydrophilic surface area
  • Name: MD Hbond
    Description: MD Hbond对于指定组别之间的氢键分析。 MD Hbond for hydrogen bond analysis between specified groups.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-05 17:34:57
    Reference:

    MD Hbond

    简介

    MD Hbond模板对于指定组别之间的氢键分析。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group1

    选择需要计算的氢键组别1:Protein,DNA,RNA。如果两个组相同则分析的是组内氢键。
    可以根据PDB中小分子的名称填写组别名称。

    System Group2

    选择需要计算的氢键组别2:Protein,DNA,RNA。如果两个组相同则分析的是组内氢键。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid1

    自定义需要计算的组1残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom1

    自定义需要计算的组1原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Resid2

    自定义需要计算的组2残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom2

    自定义需要计算的组2原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    输出结果包括:

    输出文件名称 说明
    hbnum.csv 氢键分析CSV文件
    hbnum.xvg 氢键分析XVG文件
    hbnum.png 氢键分析PNG文件

    其中hbnum.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Hydrogen bonds 氢键数目
    Pairs within 0.35 nm 两个组相距0.35nm内的接触的原子数目

    MD Hbond

    Introduction

    MD Hbond template is used for analyzing hydrogen bonds between specified groups.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group1

    Select the hydrogen bond group 1 for calculation: Protein, DNA, RNA. If both groups are the same, it analyzes intra-group hydrogen bonds. You can specify group names based on the names of small molecules in the PDB file.

    System Group2

    Select the hydrogen bond group 2 for calculation: Protein, DNA, RNA. If both groups are the same, it analyzes intra-group hydrogen bonds. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid1

    Custom residue numbers for group 1 calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom1

    Custom atom numbers for group 1 calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Custom Resid2

    Custom residue numbers for group 2 calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom2

    Custom atom numbers for group 2 calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Result Description

    The output results include:

    Output File Name Description
    hbnum.csv Hydrogen bond analysis CSV file
    hbnum.xvg Hydrogen bond analysis XVG file
    hbnum.png Hydrogen bond analysis PNG file

    The hbnum.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Hydrogen bonds Number of hydrogen bonds
    Pairs within 0.35 nm Number of atoms in contact within 0.35 nm between the two groups
  • Name: MD Gyration
    Description: MD Gyration回旋半径分析,可用来衡量体系模拟时的质权平均半径。 MD Gyration cycloidal radius analysis, which can be used to measure the average radius of pledge during system simulation.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-05 16:24:54
    Reference:

    MD Gyration

    简介

    MD Gyration回旋半径分析,可用来衡量体系模拟时的质权平均半径。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    输出结果包括:

    输出文件名称 说明
    gyrate.csv 回转半径CSV文件
    gyrate.xvg 回转半径XVG文件
    gyrate.png 回转半径PNG文件

    其中gyrate.csv包括信息如下:

    字段名称 说明
    Time (ps) 时间
    Rg 回旋半径
    Rg(X) 绕着x轴的回旋半径
    Rg(Y) 绕着y轴的回旋半径
    Rg(Z) 绕着z轴的回旋半径

    MD Gyration

    Introduction

    MD Gyration is a radius of gyration analysis used to measure the mass-weighted average radius of a system during simulation.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    The output results include:

    Output File Name Description
    gyrate.csv Gyration radius CSV file
    gyrate.xvg Gyration radius XVG file
    gyrate.png Gyration radius PNG file

    The gyrate.csv file includes the following information:

    Field Name Description
    Time (ps) Time
    Rg Radius of gyration
    Rg(X) Radius of gyration around the x-axis
    Rg(Y) Radius of gyration around the y-axis
    Rg(Z) Radius of gyration around the z-axis
  • Name: MD Clustering
    Description: MD Clustering是对动力学轨迹进行归簇分析。 MD Clustering is a clustering analysis of dynamic trajectories.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-04 11:40:38
    Reference:

    MD Clustering

    简介

    MD Clustering是对动力学轨迹进行归簇分析。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    Cutoff

    聚类时结构的RMSD截断值(nm)

    Cluster Method

    聚类算法:linkage, jarvis-patrick, monte-carlo, diagonalization, gromos, 默认使用gromos算法。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    输出结果包括:

    输出文件名称 说明
    clusters.pdb 差异较大的每个簇的代表性结构
    clust-size.xvg 各个簇的帧数
    cluster.xvg 各个簇和轨迹帧号的对应关系

    MD Clustering

    Introduction

    MD Clustering is a clustering analysis of molecular dynamics trajectories.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    Cutoff

    RMSD cutoff value for clustering (in nm).

    Cluster Method

    Clustering algorithm: linkage, jarvis-patrick, monte-carlo, diagonalization, gromos. The default method is gromos.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10,15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10,15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Result Description

    The output results include:

    Output File Name Description
    clusters.pdb Representative structures of each cluster with significant differences
    clust-size.xvg Number of frames in each cluster
    cluster.xvg Correspondence between clusters and trajectory frame numbers
  • Name: GMX MDP Generation (Auto)
    Description: GMX MDP Generation (Auto)模块主要是根据所选体系(膜,受体,配体)自动生成分子动力学模拟过程中所需的MDP文件,此文件是Gromacs分子动力学模拟需要用到输入文件,里面包含各种参数。若需要设置更细节的参数,请前往Minimize MDP Generation,NPT MDP Generation,MD MDP Generation模块。 The GMX MDP Generation (Auto) module is mainly based on the selected system (membrane, receptor, ligand) to automatically generate the MDP file required for the molecular dynamics simulation process. This file is the input file required for the Gromacs molecular dynamics simulation, which contains various parameters. To set more detailed parameters, go to the Minimize MDP Generation, NPT MDP Generation, MD MDP Generation module.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-26 10:33:46
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    GMX MDP Generation (Auto)

    简介

    GMX MDP Generation (Auto)模块主要是根据所选体系(膜,受体,配体)自动生成分子动力学模拟过程中所需的MDP文件,此文件是Gromacs分子动力学模拟需要用到输入文件,里面包含各种参数。

    参数说明

    Group Name

    选择体系中存在的结构类型:membrane代表膜结构,receptor代表大分子结构(蛋白或者核酸),ligand代表小分子结构。

    Simulation Time (ns)

    模拟时长,单位为ns

    Time Step

    时间步长,单位ps

    Coupling Reference Temperature

    参考温度,单位为K

    结果说明

    输出结果包括:

    输出文件名称 说明
    mini.mdp 最小化MDP文件
    npt.mdp/npt.tar.gz NPT MDP文件
    md.mdp/md.tar.gz MD MDP文件

    参考文献

    Abraham, Mark James et al. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1 (2015): 19-25.

    GMX MDP Generation (Auto)

    Introduction

    The GMX MDP Generation (Auto) module is designed to automatically generate the MDP files required for molecular dynamics simulations based on the selected system (membrane, receptor, ligand). The MDP file is an input file required for Gromacs molecular dynamics simulations, containing various parameters.

    Parameter Description

    Group Name

    Select the type of structure present in the system: membrane for membrane structure, receptor for macromolecular structure (protein or nucleic acid), ligand for small molecule structure.

    Simulation Time (ns)

    Duration of the simulation, in units of ns.

    Time Step

    Time step for the simulation, in units of ps.

    Coupling Reference Temperature

    Reference temperature for the temperature coupling, in units of K.

    Result Description

    The output results include:

    Output File Name Description
    mini.mdp MDP file for minimization
    npt.mdp/npt.tar.gz MDP file for NPT ensemble simulation
    md.mdp/md.tar.gz MDP file for MD simulation

    Reference

    Abraham, Mark James et al. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1 (2015): 19-25.

  • Name: siRNA Designer
    Description: siRNA Designer基于靶点基因序列,设计siRNA分子序列。 siRNA Designer designs siRNA molecular sequences based on target gene sequences.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-25 23:18:05
    Reference:

    siRNA Designer

    简介

    siRNA Designer基于靶点基因序列,设计siRNA分子序列。该方法考虑了多条siRNA设计规则,如下:

    • 36% < GCpercent < 52%
    • no internal short repeats
    • no GC stretches (more than 10 GC contigous repeats)
    • 5’ end of the guide RNA is A/U
    • 5’ end of the passenger RNA is G/C
    • at least 4 A/U residues in the last 7bp of the 5’ end of the guide
    • No G at position 13 of the passenger
    • A/U at position 19 of the passenger
    • G/C at position 19 in guide

    参数说明

    RNA FASTA File

    靶点基因序列,支持多条,FASTA格式。

    结果说明

    输出结果文件为siRNAcandidates_序列名称.csv,包含信息如下:

    字段名称 说明
    Target starting position 靶点基因序列的起始位置
    Target ending position 靶点基因序列的终止位置
    Target sequence(21nt target + 2nt overhang) 靶点序列
    Target score 靶点打分,越高越好
    Guide sequence(5’->3’) 结合靶点基因的序列,也称为antisense sequence
    Passenger sequence(5’->3’) 与Guide sequence配对的序列
    Guide Tm Guide sequence计算的Melting Temperature值,一般情况下Tm值越低,发生副作用的可能性越小
    Passenger Tm Passenger sequence计算的Melting Temperature值

    siRNA Designer

    Introduction

    siRNA Designer designs siRNA molecule sequences based on target gene sequences. This method considers multiple siRNA design rules as follows:

    • 36% < GCpercent < 52%
    • no internal short repeats
    • no GC stretches (more than 10 GC contiguous repeats)
    • 5’ end of the guide RNA is A/U
    • 5’ end of the passenger RNA is G/C
    • at least 4 A/U residues in the last 7bp of the 5’ end of the guide
    • No G at position 13 of the passenger
    • A/U at position 19 of the passenger
    • G/C at position 19 in guide

    Parameter Description

    RNA FASTA File

    Target gene sequences, supports multiple sequences in FASTA format.

    Result Description

    The output result file is named siRNAcandidates_sequence_name.csv, and it includes the following information:

    Field Name Description
    Target starting position Starting position of the target gene sequence
    Target ending position Ending position of the target gene sequence
    Target sequence (21nt target + 2nt overhang) Target sequence
    Target score Score assigned to the target, higher scores are better
    Guide sequence (5’->3’) Sequence that binds to the target gene, also known as the antisense sequence
    Passenger sequence (5’->3’) Sequence that pairs with the Guide sequence
    Guide Tm Melting Temperature value calculated for the Guide sequence. In general, lower Tm values indicate a lower likelihood of side effects
    Passenger Tm Melting Temperature value calculated for the Passenger sequence
  • Name: Membrane Solvation
    Description: Membrane Solvation对输入的膜,受体,配体文件加入水盒子和离子。 Membrane Solvation module adds water box and ions for the membrane, receptor, ligand.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-21 16:33:21
    Reference:

    Membrane Solvation

    简介

    Membrane Solvation对输入的膜,受体,配体文件加入水盒子和离子。

    参数说明

    Membrane Topology

    膜拓扑文件,top格式,可由GMX Membrane Parameterization模块生成。

    Membrane GRO

    膜结构文件,gro格式,可由GMX Membrane Parameterization模块生成。

    Membrane ITP

    膜参数压缩文件,tar.gz格式,可由GMX Membrane Parameterization模块生成。

    Receptor Topology

    受体拓扑文件,top格式,可由GMX Receptor Parameterization模块生成。

    Receptor GRO

    受体结构文件,gro格式,可由GMX Receptor Parameterization模块生成。

    Receptor ITP

    受体参数压缩文件,tar.gz格式,可由GMX Receptor Parameterization模块生成。

    Ligand GRO

    配体结构文件,多配体输入压缩文件,gro格式,可由GMX Ligand Parameterization模块生成。

    Ligand ITP

    配体参数压缩文件,tar.gz格式,可由GMX Ligand Parameterization模块生成。

    Output Topology

    体系拓扑文件的输出名称

    Output GRO

    体系结构文件的输出名称

    Output ITP

    体系参数压缩文件的输出名称

    Output Index

    体系索引文件的输出名称

    结果说明

    输出结果包括:

    输出文件名称 说明
    system.gro 体系的分子坐标文件
    system_itp.tar.gz 体系平衡模拟时固定原子位置所施加的力
    system.top 体系的拓扑文件
    index.ndx 体系的索引文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    Membrane Solvation

    Introduction

    Membrane Solvation adds water boxes and ions to the input membrane, receptor, and ligand files.

    Parameter Description

    Membrane Topology

    Topology file of the membrane in .top format, can be generated by the GMX Membrane Parameterization module.

    Membrane GRO

    Structure file of the membrane in .gro format, can be generated by the GMX Membrane Parameterization module.

    Membrane ITP

    Compressed parameter file of the membrane in .tar.gz format, can be generated by the GMX Membrane Parameterization module.

    Receptor Topology

    Topology file of the receptor in .top format, can be generated by the GMX Receptor Parameterization module.

    Receptor GRO

    Structure file of the receptor in .gro format, can be generated by the GMX Receptor Parameterization module.

    Receptor ITP

    Compressed parameter file of the receptor in .tar.gz format, can be generated by the GMX Receptor Parameterization module.

    Ligand GRO

    Structure file of the ligand, multiple ligands input as a compressed file in .gro format, can be generated by the GMX Ligand Parameterization module.

    Ligand ITP

    Compressed parameter file of the ligand in .tar.gz format, can be generated by the GMX Ligand Parameterization module.

    Output Topology

    Output name of the system topology file.

    Output GRO

    Output name of the system structure file.

    Output ITP

    Output name of the compressed system parameter file.

    Output Index

    Output name of the system index file.

    Result Description

    The output results include:

    Output File Name Description
    system.gro Molecular coordinate file of the system
    system_itp.tar.gz Force applied to fix atom positions during equilibrium simulations of the system
    system.top Topology file of the system
    index.ndx Index file of the system

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: GMX Membrane Parameterization
    Description: GMX Membrane Parameterization模块是根据Amber或者Charmm生成膜结构的GRO,ITP以及TOP文件。 The GMX Membrane Parameterization module is the GRO, ITP and TOP file that generates the membrane structure according to Amber or Charmm.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-21 16:31:29
    Reference:

    GMX Membrane Parameterization

    简介

    GMX Membrane Parameterization模块是根据Amber或者Charmm生成膜结构的GRO,ITP以及TOP文件。

    参数说明

    Membrane Structure File

    膜结构文件,PDB格式,必须是纯膜结构,并允许水和离子存在

    Force Field

    只支持“amber”力场和“charmm”力场。默认的“amber”力场。
    需要特别注意的是:

    1. 当选择“charmm”力场时,“GMX Receptor Parameterization”模块力场必须选择“charmm36-jul2020”版本。
    2. 当存在小分子时,有且只能选择“amber”力场进行计算。

    结果说明

    输出结果包括:

    输出文件名称 说明
    membrane.top 膜的拓扑文件
    membrane.gro 膜的结构文件
    membrane_itp.tar.gz 膜的参数压缩文件

    GMX Membrane Parameterization

    Introduction

    The GMX Membrane Parameterization module is used to generate GRO, ITP, and TOP files for membrane structures based on Amber or Charmm force fields.

    Parameter Description

    Membrane Structure File

    The membrane structure file in PDB format. It must be a pure membrane structure and can contain water and ions.

    Force Field

    Supports only the “amber” force field and the “charmm” force field. The default is the “amber” force field. It is important to note:

    1. When selecting the “charmm” force field, the “GMX Receptor Parameterization” module must select the “charmm36-jul2020” version.
    2. When small molecules are present, only the “amber” force field can be selected for calculations.

    Result Description

    The output results include:

    Output File Name Description
    membrane.top Topology file for the membrane
    membrane.gro Structure file for the membrane
    membrane_itp.tar.gz Compressed parameter file for the membrane
  • Name: Membrane System Construction
    Description: Membrane System Construction构建膜结构的PDB文件。 Membrane System Construction module builds the PDB file of the membrane structure.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-21 16:29:32
    Reference:

    Membrane System Construction

    简介

    Membrane System Construction构建膜结构的PDB文件。
    需要注意的是:Amber参数涉及有大分子的AMBER力场、小分子的GAFF力场、糖的GLYCAM以及磷脂的LIPID力场,这四个力场是可以兼容的。Charmm也有自己一套力场,涉及有CHARMM力场(适用于大分子、糖、磷脂)和CGenFF力场(适用于小分子),这两个力场是相互兼容的。
    目前WEMOL上只支持GAFF力场的小分子计算,所以当存在小分子时,膜的成分必须为AMBER力场下的。

    参数说明

    Lipid Component

    必须遵循格式:lipid1:lipid2//lipid3,“//”用于区分上膜和下膜,没有“//”表示上膜和下膜中相同的脂质成分!
    注:在charmm力场作用下,支持以下38种脂质构建膜:

    CHL1 SITO ERG1 DLPA DLPC DLPE DLPG DLPS DMPA DMPC DMPE DMPG DMPS DOPA DOPC DOPE DOPG DOPS DPPA DPPC DPPE DPPG DPPS DSPA DSPC DSPE DSPG DSPS POPA POPC POPE POPG POPS SOPA SOPC SOPE SOPG SOPS
    

    注:在charmm力场作用下,还支持以下26种心磷脂膜:

    LACH LACL LBCH LBCL LCCH LCCL LDCH LDCL OACH OACL OCCH OCCL PACL PMCH PMCL POCH POCL PVCL TLCH TLCL TMCH TMCL TOCH TOCL TYCH TYCL
    

    注:在amber力场作用下,支持以下253种脂质构建膜:

    CHL1 AHPA AHPC AHPE AHPG AHPS ALPA ALPC ALPE ALPG ALPS AMPA AMPC AMPE AMPG AMPS AOPA AOPC AOPE AOPG AOPS APPA APPC APPE APPG APPS ASPA ASPC ASPE ASPG ASPS DAPA DAPC DAPE DAPG DAPS DHPA DHPC DHPE DHPG DHPS DLPA DLPC DLPE DLPG DLPS DMPA DMPC DMPE DMPG DMPS DOPA DOPC DOPE DOPG DOPS DPPA DPPC DPPE DPPG DPPS DSPA DSPC DSPE DSPG DSPS HAPA HAPC HAPE HAPG HAPS HLPA HLPC HLPE HLPG HLPS HMPA HMPC HMPE HMPG HMPS HOPA HOPC HOPE HOPG HOPS HPPA HPPC HPPE HPPG HPPS HSPA HSPC HSPE HSPG HSPS LAPA LAPC LAPE LAPG LAPS LHPA LHPC LHPE LHPG LHPS LMPA LMPC LMPE LMPG LMPS LOPA LOPC LOPE LOPG LOPS LPPA LPPC LPPE LPPG LPPS LSPA LSPC LSPE LSPG LSPS MAPA MAPC MAPE MAPG MAPS MHPA MHPC MHPE MHPG MHPS MLPA MLPC MLPE MLPG MLPS MOPA MOPC MOPE MOPG MOPS MPPA MPPC MPPE MPPG MPPS MSPA MSPC MSPE MSPG MSPS OAPA OAPC OAPE OAPG OAPS OHPA OHPC OHPE OHPG OHPS OLPA OLPC OLPE OLPG OLPS OMPA OMPC OMPE OMPG OMPS OPPA OPPC OPPE OPPG OPPS OSPA OSPC OSPE OSPG OSPS PAPA PAPC PAPE PAPG PAPS PHPA PHPC PHPE PHPG PHPS PLPA PLPC PLPE PLPG PLPS PMPA PMPC PMPE PMPG PMPS POPA POPC POPE POPG POPS PSPA PSPC PSPE PSPG PSPS SAPA SAPC SAPE SAPG SAPS SHPA SDPA SHPC SDPC SHPE SDPE SHPG SDPG SHPS SDPS SLPA SLPC SLPE SLPG SLPS SMPA SMPC SMPE SMPG SMPS SOPA SOPC SOPE SOPG SOPS SPPA SPPC SPPE SPPG SPPS PSM SSM
    

    Lipid Ratio

    膜成分比例,格式为ratio1:ratio2//ratio3

    Lipid Number

    膜成分数量比例,格式为number1:number2//number3

    Orientated Structrue File

    定向结构文件,pdb格式

    Ions

    添加离子类型,格式为ion1:ion2//ion3,“//”用于区分上下膜,没有“//”表示上下膜中离子成分相同!支持以下5种离子:NA、K、CL、CA、MG。

    Ions Concentration

    离子成分比例,格式为conc1:conc2//conc3,与Ion参数顺序相同

    Ions Number

    离子成分数量比例,格式为number1:number2//number3,与Ion参数顺序相同

    Force Field

    只支持“amber”力场和“charmm”力场。默认的“amber”力场

    Length of XY

    膜的X轴和Y轴长度,默认为50 Å

    Length of Z

    膜的Z轴长度,默认为100 Å

    结果说明

    输出结果包括:

    输出文件名称 说明
    membrane_lipid.pdb 纯膜体系下生成的结构文件,当存在配体或者受体时不会生成该文件。
    membrane_orientation.pdb 膜与受体/配体/复合物的结构文件,纯膜时不生成该文件。
    orientation.pdb 受体/配体/复合物的取向结构,纯膜时不生成该文件。

    Membrane System Construction

    Introduction

    Membrane System Construction is used to build PDB files for membrane structures. It is important to note that the Amber parameters involve the AMBER force field for macromolecules, the GAFF force field for small molecules, the GLYCAM force field for sugars, and the LIPID force field for phospholipids. These four force fields are compatible. Charmm also has its own set of force fields, including the CHARMM force field (for macromolecules, sugars, and phospholipids) and the CGenFF force field (for small molecules), which are mutually compatible. Currently, WEMOL only supports calculations for small molecules using the GAFF force field, so when small molecules are present, the membrane components must be under the AMBER force field.

    Parameter Description

    Lipid Component

    Must follow the format: lipid1:lipid2//lipid3. “//” is used to differentiate between the upper and lower membrane components. If there is no “//”, it indicates the same lipid component in the upper and lower membranes.
    Note: Under the Charmm force field, the membrane construction supports the following 38 lipid types:

    CHL1 SITO ERG1 DLPA DLPC DLPE DLPG DLPS DMPA DMPC DMPE DMPG DMPS DOPA DOPC DOPE DOPG DOPS DPPA DPPC DPPE DPPG DPPS DSPA DSPC DSPE DSPG DSPS POPA POPC POPE POPG POPS SOPA SOPC SOPE SOPG SOPS
    

    Under the Charmm force field, it also supports the following 26 sphingomyelin membranes:

    LACH LACL LBCH LBCL LCCH LCCL LDCH LDCL OACH OACL OCCH OCCL PACL PMCH PMCL POCH POCL PVCL TLCH TLCL TMCH TMCL TOCH TOCL TYCH TYCL
    

    Under the Amber force field, the membrane construction supports 253 lipid types:

    CHL1 AHPA AHPC AHPE AHPG AHPS ALPA ALPC ALPE ALPG ALPS AMPA AMPC AMPE AMPG AMPS AOPA AOPC AOPE AOPG AOPS APPA APPC APPE APPG APPS ASPA ASPC ASPE ASPG ASPS DAPA DAPC DAPE DAPG DAPS DHPA DHPC DHPE DHPG DHPS DLPA DLPC DLPE DLPG DLPS DMPA DMPC DMPE DMPG DMPS DOPA DOPC DOPE DOPG DOPS DPPA DPPC DPPE DPPG DPPS DSPA DSPC DSPE DSPG DSPS HAPA HAPC HAPE HAPG HAPS HLPA HLPC HLPE HLPG HLPS HMPA HMPC HMPE HMPG HMPS HOPA HOPC HOPE HOPG HOPS HPPA HPPC HPPE HPPG HPPS HSPA HSPC HSPE HSPG HSPS LAPA LAPC LAPE LAPG LAPS LHPA LHPC LHPE LHPG LHPS LMPA LMPC LMPE LMPG LMPS LOPA LOPC LOPE LOPG LOPS LPPA LPPC LPPE LPPG LPPS LSPA LSPC LSPE LSPG LSPS MAPA MAPC MAPE MAPG MAPS MHPA MHPC MHPE MHPG MHPS MLPA MLPC MLPE MLPG MLPS MOPA MOPC MOPE MOPG MOPS MPPA MPPC MPPE MPPG MPPS MSPA MSPC MSPE MSPG MSPS OAPA OAPC OAPE OAPG OAPS OHPA OHPC OHPE OHPG OHPS OLPA OLPC OLPE OLPG OLPS OMPA OMPC OMPE OMPG OMPS OPPA OPPC OPPE OPPG OPPS OSPA OSPC OSPE OSPG OSPS PAPA PAPC PAPE PAPG PAPS PHPA PHPC PHPE PHPG PHPS PLPA PLPC PLPE PLPG PLPS PMPA PMPC PMPE PMPG PMPS POPA POPC POPE POPG POPS PSPA PSPC PSPE PSPG PSPS SAPA SAPC SAPE SAPG SAPS SHPA SDPA SHPC SDPC SHPE SDPE SHPG SDPG SHPS SDPS SLPA SLPC SLPE SLPG SLPS SMPA SMPC SMPE SMPG SMPS SOPA SOPC SOPE SOPG SOPS SPPA SPPC SPPE SPPG SPPS PSM SSM
    

    Lipid Ratio

    The ratio of membrane components, format is ratio1:ratio2//ratio3.

    Lipid Number

    The number ratio of membrane components, format is number1:number2//number3.

    Orientated Structure File

    The oriented structure file in PDB format.

    Ions

    Types of ions to add, format is ion1:ion2//ion3. “//” is used to differentiate between the upper and lower membranes. If there is no “//”, it indicates the same ion component in the upper and lower membranes. It supports the following 5 types of ions: NA, K, CL, CA, MG.

    Ions Concentration

    The concentration ratio of ions, format is conc1:conc2//conc3, in the same order as the Ion parameter.

    Ions Number

    The number ratio of ion components, format is number1:number2//number3, in the same order as the Ion parameter.

    Force Field

    Supports only the “amber” force field and the “charmm” force field. Default is the “amber” force field.

    Length of XY

    The length of the membrane along the X and Y axes, default is 50 Å.

    Length of Z

    The length of the membrane along the Z axis, default is 100 Å.

    Result Description

    The output results include:

    Output File Name Description
    membrane_lipid.pdb Generated structure file for the pure membrane system. This file is not generated when ligands or receptors are present.
    membrane_orientation.pdb Structure file of the membrane with the receptor/ligand/complex. This file is not generated for a pure membrane system.
    orientation.pdb Orientation structure of the receptor/ligand/complex. This file is not generated for a pure membrane system.
  • Name: Molecule In Membrane
    Description: Molecule In Membrane模块是生成受体/配体/复合物取向位置的结构文件。 The Molecule In Membrane module is a structural file that generates receptor/ligand/complex orientation.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-21 15:13:41
    Reference:

    Molecule In Membrane

    简介

    Molecule In Membrane模块是生成受体/配体/复合物取向位置与膜的结构文件。

    参数说明

    Receptor File

    受体结构,PDB格式。如果一个受体含有配体,可以把它们组合成一个受体结构。

    Receptor Position

    “center”,“upper”或“upper”,默认“upper”,即受体相对于膜的位置

    Receptor Orientation

    “inside”或“outside”,默认为“outside”,即n端相对于膜的取向,只有受体在“center”时才有效。

    Receptor Heteroatom

    “yes”或“no”,默认“no”,即当受体定向时是否考虑受体结构中的非受体分子,仅当受体位于“center”时有效。

    Receptor Z Shift

    受体结构的向Z轴位移距离,仅当受体处于“center”时有效。

    Ligand File

    配体结构,PDB格式。通常是指相对于受体的独立配体分子

    Ligand Position

    “center”、“upper”或“lower”,当受体不在“center”时默认为“center”,当受体在“center”时默认为“upper”,即配体相对于膜的位置

    Ligand Orientation

    “inside”或“outside”,默认为“outside”,即n端相对于膜的取向,只有配体在“center”时才有效。

    Ligand Z Shift

    配体结构的向Z轴位移距离,仅当配体处于“center”时有效。

    Ligand Number

    配体分子数,默认为1。只有配体在“upper”或“lower”时才有效

    Length of XY

    膜的X轴和Y轴长度,默认为50 Å

    Length of Z

    膜的Z轴长度,默认为100 Å

    结果说明

    输出结果包括:

    输出文件名称 说明
    orientation.pdb 受体/配体/复合物的结构文件
    orientation_dum.pdb 显示受体/配体/复合物与膜的相对位置的结构文件

    Molecule In Membrane

    Introduction

    The Molecule In Membrane module is used to generate structural files of the orientation of receptors/ligands/complexes relative to a membrane.

    Parameter Description

    Receptor File

    The structure of the receptor in PDB format. If a receptor contains a ligand, they can be combined into a single receptor structure.

    Receptor Position

    “center”, “upper”, or “lower”, default is “upper”, indicating the position of the receptor relative to the membrane.

    Receptor Orientation

    “inside” or “outside”, default is “outside”, indicating the orientation of the N-terminus of the receptor relative to the membrane. This parameter is only effective when the receptor is in the “center” position.

    Receptor Heteroatom

    “yes” or “no”, default is “no”, indicating whether non-receptor molecules in the receptor structure should be considered when orienting the receptor. This parameter is only effective when the receptor is in the “center” position.

    Receptor Z Shift

    The distance the receptor structure is shifted along the Z-axis. This parameter is only effective when the receptor is in the “center” position.

    Ligand File

    The structure of the ligand in PDB format. Typically, this refers to an independent ligand molecule relative to the receptor.

    Ligand Position

    “center”, “upper”, or “lower”, default is “center” when the receptor is not in the “center” position, and default is “upper” when the receptor is in the “center” position, indicating the position of the ligand relative to the membrane.

    Ligand Orientation

    “inside” or “outside”, default is “outside”, indicating the orientation of the N-terminus of the ligand relative to the membrane. This parameter is only effective when the ligand is in the “center” position.

    Ligand Z Shift

    The distance the ligand structure is shifted along the Z-axis. This parameter is only effective when the ligand is in the “center” position.

    Ligand Number

    The number of ligand molecules, default is 1. This parameter is only effective when the ligand is in the “upper” or “lower” position.

    Length of XY

    The length of the membrane along the X and Y axes, default is 50 Å.

    Length of Z

    The length of the membrane along the Z axis, default is 100 Å.

    Result Description

    The output results include:

    Output File Name Description
    orientation.pdb Structural file of the receptor/ligand/complex
    orientation_dum.pdb Structural file showing the relative position of the receptor/ligand/complex with respect to the membrane
  • Name: Mutation Energy with Small Molecule
    Description: Mutation Energy with Small Molecule用于预测突变对小分子与蛋白界面能的影响。将一个或多个残基突变为新残基,从而对蛋白质与小分子的结合能进行分析。 Mutation Energy with Small Molecule is used to predict the effect of mutation on the interface energy between small molecule and protein. Binding energies of proteins to small molecules were analyzed by mutating one or more residues to new residues and calculating energy changes.
    Tags: undefined
    Author: Aldeghi M
    Release: 2023-06-14 11:26:30
    Reference: Aldeghi M, Gapsys V, de Groot BL. Accurate Estimation of Ligand Binding Affinity Changes upon Protein Mutation. ACS Cent Sci. 2018 Dec 26;4(12):1708-1718.

    Mutation Energy with Small Molecule

    简介

    Mutation Energy with Small Molecule用于预测突变对小分子与蛋白界面能的影响。将一个或多个残基突变为新残基,从而对蛋白质与小分子的结合能进行分析。

    参数说明

    Complex Structure File

    复合物结构文件包含小分子,PDB格式。复合物结构文件不能存在非标准氨基酸残基和质子化后的氨基酸残基,如HIE、HID、HIP、ASH以及CYX。

    Mutation File

    突变文件,文本文件包含突变信息,格式如下:
    GB26R;
    GB26H;
    其中G代表原始残基,
    B代表PDB文件中待突变残基所在的链名,
    26代表残基位置编号,
    R, H代表要突变成的突变残基。

    Ligand Name

    如果存在多个小分子时,使用“A:LIG”选择其中一个小分子进行计算。

    结果说明

    输出结果文件为scores.csv,包含信息如下:

    字段名称 说明
    Chain 链名称
    Residue ID 氨基酸编号
    Wild Resname 突变前氨基酸名称缩写
    Mutate Resname 突变后氨基酸名称缩写
    Score (kcal/mol) 突变结合自由能变化∆∆G

    其中结合能变化∆∆G标准如下:
    结合力提升:∆∆G <= -1.0 kcal/mol
    无显著变化:-1.0 kcal/mol < ∆∆G < 1.0 kcal/mol
    结合力下降:∆∆G >= 1.0 kcal/mol

    参考文献

    Aldeghi M, Gapsys V, de Groot BL. Accurate Estimation of Ligand Binding Affinity Changes upon Protein Mutation. ACS Cent Sci. 2018 Dec 26;4(12):1708-1718.

    Mutation Energy with Small Molecule

    Introduction

    Mutation Energy with Small Molecule is used to predict the effect of mutation on the interface energy between small molecule and protein. Binding energies of proteins to small molecules were analyzed by mutating one or more residues to new residues and calculating energy changes.

    Parameter Description

    Complex Structure File

    The complex structure file contains the small molecule in PDB format. The complex structure file should not contain non-standard amino acid residues or protonated amino acid residues, such as HIE, HID, HIP, ASH, and CYX.

    Mutation File

    The mutation file is a text file containing mutation information in the following format:
    GB26R;
    GB26H;
    Where:
    G represents the original residue,
    B represents the chain name where the residue to be mutated is located in the PDB file,
    26 represents the residue position number,
    R, H represent the mutated residue to be changed into.

    Ligand Name

    If there are multiple small molecules, use “A:LIG” to select one of the small molecules for calculation.

    Result Description

    The output file is scores.csv, containing the following information:

    Field Name Description
    Chain Chain name
    Residue ID Amino acid number
    Wild Resname Abbreviated name of the amino acid before mutation
    Mutate Resname Abbreviated name of the amino acid after mutation
    Score (kcal/mol) Change in binding free energy upon mutation ∆∆G

    The standards for ∆∆G in binding energy change are as follows:
    Increase in binding affinity: ∆∆G <= -1.0 kcal/mol
    No significant change: -1.0 kcal/mol < ∆∆G < 1.0 kcal/mol
    Decrease in binding affinity: ∆∆G >= 1.0 kcal/mol

    Reference

    Aldeghi M, Gapsys V, de Groot BL. Accurate Estimation of Ligand Binding Affinity Changes upon Protein Mutation. ACS Cent Sci. 2018 Dec 26;4(12):1708-1718.

  • Name: Protein Mutation Predictor
    Description: Protein Mutation Predictor是预测蛋白结构中的各位点的潜在氨基酸突变。其主要是通过提取氨基酸的微环境,用深度学习(CNN卷积神经网络)进行训练建模,用于预测蛋白位点(已知微环境)的可能氨基酸突变。
    Tags: undefined
    Author: Raghav Shroff
    Release: 2023-06-14 10:14:51
    Reference: Shroff R, Cole AW, Diaz DJ, Morrow BR, Donnell I, Annapareddy A, Gollihar J, Ellington AD, Thyer R. Discovery of Novel Gain-of-Function Mutations Guided by Structure-Based Deep Learning. ACS Synth Biol. 2020 Nov 20;9(11):2927-2935.

    1687313497304.jpg Protein Mutation Predictor

    简介

    Protein Mutation Predictor是预测蛋白结构中的各位点的潜在氨基酸突变。其主要是通过提取氨基酸的微环境,用深度学习(CNN卷积神经网络)进行训练建模,用于预测蛋白位点(已知微环境)的可能氨基酸突变。
    氨基酸环境提取过程:

    • 用一个3D立方体网格(格点间距10A),起大小确保覆盖整个蛋白
    • 遍历每个格点,离该格点最近的原子,其所属的氨基酸设为该格点的中心氨基酸
    • 以该中心氨基酸的Cβ原子作为中心,用20A的box为范围提取周围信息,做法是:
      • CA-N键和CA-C键构成box的x-y平面,z轴方向是与CA-CB键的方向同一侧
      • 甘氨酸GLY的Cβ原子坐标为其他19个氨基酸Cβ原子坐标的平均值
      • 盒子分隔为1A的立方小格(体素),分别提取小格内的信息,分为不同类别,每一种类别作为后续的一个数据channel:
        • 是否存在碳原子(有,该体素就标记1,反之为0)
        • 是否存在氧原子(同上)
        • 是否存在氮原子(同上)
        • 是否存在硫原子(同上)
        • 是否存在氢原子(同上)
        • 该原子的部分电荷(partial charges)
        • 该原子的溶剂可接触面积
          image.png

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式

    结果说明

    输出结果文件为summary.csv,包含信息如下:

    字段名称 说明
    Actual_AA 蛋白氨基酸残基标识
    RES_SEQ 氨基酸索引(PDB文件中)
    Max_prob_AA 该位点概率最高的氨基酸残基突变类型
    ALA_prob,CYS_prob… 该位点突变为各类其他残基的概率值,在0-1之间,值越大概率越高

    参考文献

    Shroff R, Cole AW, Diaz DJ, Morrow BR, Donnell I, Annapareddy A, Gollihar J, Ellington AD, Thyer R. Discovery of Novel Gain-of-Function Mutations Guided by Structure-Based Deep Learning. ACS Synth Biol. 2020 Nov 20;9(11):2927-2935.

    1687313497304.jpg Protein Mutation Predictor

    Introduction

    The Protein Mutation Predictor predicts potential amino acid mutations at various positions in protein structures. It primarily involves extracting the amino acid microenvironment, training a model using deep learning (CNN convolutional neural network), and predicting possible amino acid mutations at protein sites with known microenvironments.
    Amino acid environment extraction process:

    • Utilizes a 3D cubic grid (grid spacing of 10A) covering the entire protein
    • Iterates through each grid point; the amino acid belonging to the atom closest to that grid point is designated as the central amino acid of the grid point
    • Uses the Cβ atom of this central amino acid as the center and extracts surrounding information within a 20A box, where:
      • The CA-N bond and CA-C bond form the x-y plane of the box, with the z-axis in the same direction as the CA-CB bond
      • The Cβ atom coordinates of glycine GLY are the average of the Cβ atom coordinates of the other 19 amino acids
      • Divides the box into 1A cubic voxels and extracts information within each voxel, categorizing them into different types, each representing a data channel for subsequent processing:
        • Presence of carbon atom (marked as 1 if present, 0 otherwise)
        • Presence of oxygen atom (same as above)
        • Presence of nitrogen atom (same as above)
        • Presence of sulfur atom (same as above)
        • Presence of hydrogen atom (same as above)
        • Partial charges of the atom
        • Solvent accessible surface area of the atom
          image.png

    Parameter Description

    Structure PDB File

    Protein structure file in PDB format

    Result Description

    The output file is summary.csv, containing the following information:

    Field Name Description
    Actual_AA Amino acid residue identifier
    RES_SEQ Amino acid index (in the PDB file)
    Max_prob_AA Amino acid residue mutation type with the highest probability at that position
    ALA_prob,CYS_prob… Probability values of mutating to various other residue types at that position, ranging from 0 to 1, where higher values indicate higher probabilities

    Reference

    Shroff R, Cole AW, Diaz DJ, Morrow BR, Donnell I, Annapareddy A, Gollihar J, Ellington AD, Thyer R. Discovery of Novel Gain-of-Function Mutations Guided by Structure-Based Deep Learning. ACS Synth Biol. 2020 Nov 20;9(11):2927-2935.

  • Name: Immunogenicity Prediction (AlphaMHC v2.1)
    Description: AlphaMHC算法采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条公开及私有的与免疫原性相关的湿实验数据(包括亲和力数据、NGS数据、质谱数据等)进行训练,成功实现了从序列到临床免疫原性风险的端到端的预测,并通过上百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)进行验证,AlphaMHC能够准确区分免疫原性的高低,ROC-AUC达0.87,准确性超过80%(部分测试集高达91%),表现出比现有方法显著更优的预测性能,是已知唯一一个可以得到临床数据验证的算法。 The AlphaMHC algorithm utilizes popular NLP natural language processing technology and a novel multimodal fusion deep neural network architecture. It integrates nearly one billion publicly and privately available wet lab experimental data related to immunogenicity (including affinity data, NGS data, mass spectrometry data, etc.) for training. It successfully achieves end-to-end prediction of immunogenicity risk from sequence to clinical application and has been validated using over a hundred clinical real-world immunogenicity data from FDA and EMA (including mono-/multi-specific antibodies and recombinant proteins). AlphaMHC can accurately distinguish between high and low immunogenicity, with an ROC-AUC of 0.87 and an accuracy of over 80% (up to 91% for some test sets). It exhibits significantly superior predictive performance compared to existing methods and is the only algorithm known to have been validated with clinical data.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-05-03 13:53:09
    Reference:

    Immunogenicity Prediction (AlphaMHC v2.0)

    简介

    AlphaMHC是唯信计算为解决现有预测方法的已知问题而开发的下一代免疫原性预测算法,采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条公开及私有的与免疫原性相关的湿实验数据(包括亲和力数据、NGS数据、质谱数据等)进行训练,成功实现了从序列到临床免疫原性风险的端到端的预测,并通过上百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)进行验证,AlphaMHC能够准确区分免疫原性的高低,ROC-AUC达0.87,准确性超过80%(部分测试集高达91%),表现出比现有方法显著更优的预测性能,是已知唯一一个可以得到临床数据验证的算法。
    F13.png

    算法特点:

    • 显着扩展训练集空间。 除了公开可用的数据集外,我们还从文献、专利和湿实验室合作者那里收集了更多数据。 除了最常用的亲和力数据外,还考虑了更多的数据类型,例如 T 细胞激活数据、蛋白质组学数据、抗体测序数据等,它们贡献了超过 10 亿个数据条目/点。
    • 与仅预测 MHC 肽结合亲和力的大多数其他算法不同,AlphaMHC 预测临床水平的最终免疫原性,同时考虑除肽结合之外的其他重要影响因素,例如免疫呈递/耐受性、等位基因频率等。
    • 针对多达 5000 多个 MHC-II 等位基因训练深度神经网络模型。 在并行计算的支持下,所有支持的 MHC 等位基因都可以以高通量的方式同时计算,而之前的方法通常只能一次性指定少数HLA等位基因。

    参数说明

    Fasta File

    蛋白序列文件,FASTA格式。支持多条链以及多分子模式。对于多分子模式,序列名称规则为:分子名.链名,例如:

    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    >mol2.A
    XXXXXXX
    >mol2.B
    XXXXXXX
    

    HLA Allotypes

    预测HLA等位基因型。推荐使用"rep",速度更快。
    rep:32个代表性等位基因型,适用于一般人群。
    all:用于训练的所有非冗余人类等位基因型(1166个)

    Binding Affinity Profile

    导出每个 HLA 等位基因的结合亲和力曲线图,展示了与每条蛋白质链的 N 端到 C 端的所有15肽的结合亲和力。注意:即使“HLA Allotypes”选项设置为全部,也只会绘制代表性 HLA的曲线。

    结果说明

    输出结果包括:

    输出文件名称 说明
    score_immunogenicity_risk.csv 该结果展示了预测的每个分子的免疫原性风险(自动将同个分子的多条链的预测的潜在T细胞表位的结果进行汇总后综合评估所得)。
    detail_tce_of_chains.csv 该结果评估可以进行定向改造的HLA呈递表位,以降低免疫原性。
    BAProfile_of_mol.chain.png 不同HLA亚型与每条链的不同位置的亲和力的分布情况,更精细的展示了不同HLA的亲和力的差异。 从左到右的分布图表示从其中一条蛋白质链的N末端移动到C末端的15聚肽窗口的结合亲和力。 即使“HLA同种异型”选项设置为“全部”,也只会包括代表性的HLA等位基因。
    Heatmap_of_mol.chain.png 每个肽与代表性HLA之间结合亲和力的热图。Z-score是pAffinity,值越大(浅色)意味着预测结合越强。

    其中score_immunogenicity_risk.csv包括信息如下:

    字段名称 说明
    Protein_Id 蛋白序列名称
    Risk 预测的分子整体风险评估,高风险的分子为high,否则为low。
    Score 表位总长度,是整体风险评估的重要依据。
    TCE_Sequences 表位序列

    其中detail_tce_of_chains.csv包括信息如下:

    字段名称 说明
    Sequences 蛋白序列名称
    TCE 每条链的相对的高风险的T细胞表位
    Alleles_Number 递呈的HLA亚型数
    Alleles 递呈的HLA亚型
    Min_Affinity 亲和力最小值
    Median_Affinity 亲和力中位数
    Max_Affinity 亲和力最大值

    Immunogenicity Prediction (AlphaMHC v2.0)

    Introduction

    AlphaMHC is the next-generation immunogenicity prediction algorithm developed by Wecomput using popular NLP natural language processing technology to address known issues with existing prediction methods. It employs a new multi-modal fusion deep neural network architecture and is trained on nearly one billion publicly available and private wet-lab experimental data related to immunogenicity, including affinity data, NGS data, mass spectrometry data, etc. It successfully achieves end-to-end prediction of immunogenicity risk from sequence to clinical application and is validated using hundreds of clinical real-world immunogenicity data from FDA and EMA, including mono/multi-specific antibodies and recombinant proteins. AlphaMHC accurately distinguishes high and low immunogenicity with an ROC-AUC of 0.87 and an accuracy of over 80% (up to 91% on some test sets), demonstrating significantly better predictive performance than existing methods. It is the only known algorithm that has been validated with clinical data.
    F13.png

    Feature highlights

    • Significantly expanded training set space. Besides the publicly available data sets, we have collected more data from literature, patents, and wet lab collaborators. Besides the most used affinity data, more data types are considered, e.g., T cell activation data, proteomics data, antibody sequencing data, etc., which contributes over 1 billion more data entries/points.
    • Unlike most other algorithms which predict only the MHC-peptide binding affinity, AlphaMHC predicts the eventual immunogenicity at the clinical level, taking into consideration other important influencing factors besides peptide binding, such as immune presentation/tolerance, allele frequency, etc.
    • A deep neural network model is trained for up to 5000+ alleles of MHC-II. With the support of parallel computing, all supported MHC alleles can be simultaneously calculated in a high-throughput manner, while similar methods can usually only afford a few representative alleles within reasonable time cost.

    Parameter

    Fasta File

    Protein sequence file in FASTA format.Multiple chains and multi-molecule modes are supported. For multi-molecule mode, the sequence name rule is: molecule name. chain name, for example:

    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    >mol2.A
    XXXXXXX
    >mol2.B
    XXXXXXX
    

    HLA Allotypes

    Prediction of HLA allelic types. “rep” is recommended, which is faster.
    rep: 32 representative allelic types, applicable to the general population.
    all: all non-redundant human allele types used for training (1166)

    Binding Affinity Profile

    Export binding affinity curve graphs for each HLA allele, showing the binding affinity of all 15 peptides from the N- to C-terminus for each protein chain. Note: Even if the “HLA Allotypes” option is set to all, curves will only be plotted for representative HLAs.

    Result

    The output includes:

    Output File Name Description
    score_immunogenicity_risk.csv The result displays the immunogenicity risk for each predicted molecule (which is obtained by aggregating the predicted potential T cell epitopes from multiple chains of the same molecule and evaluating the overall risk).
    detail_tce_of_chains.csv The results evaluated HLA presentation epitopes that could be targeted for engineering to reduce immunogenicity.
    BAProfile_of_mol.chain.png The distribution profile of the binding affinity between each chain and the 32 representative HLAs. The profile from left to right represents the binding affinity of a 15-mer pepetide window moving from the N terminus to C terminus of one of the protein chain. PS. only representative HLA alleles will be included even if the “HLA allotypes” option is set to “all”.
    Heatmap_of_mol.chain.png The heat map of the binding affinity between each peptide and the representative HLAs. The Z-score is pAffinity, greater value (light color) means stronger binding by prediction.

    score_immunogenicity_risk.csv contains the following information:

    Field Name Description
    Protein_Id Protein sequence name
    Risk The overall risk assessment for the predicted molecule, with “high” indicating high-risk molecules and “low” indicating low-risk molecules.
    Score The total length of the epitopes, which is an important basis for overall risk assessment.
    TCE_Sequences The epitope sequences

    detail_tce_of_chains.csv contains the following information:

    Field Name Description
    Sequences Protein sequence name
    TCE The relative high risk T cell epitope of each strand.
    Alleles_Number Number of HLA subtypes presented
    Alleles The HLA subtypes presented
    Min_Affinity Affinity minimum
    Median_Affinity Median affinity
    Max_Affinity Affinity maximum
  • Name: Solvent Exposure (SASA)
    Description: 基于蛋白质结构(PDB),计算各个残基的溶剂暴露程度(溶液可及化表面积,solvent accessible surface area, SASA)。 The Residue SASA Calculation module calculates the solvent accessible surface area of residue based on structure PDB file.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-08 12:56:06
    Reference: NA

    Solvent Exposure (SASA)

    简介

    基于蛋白质结构(PDB文件),计算各个残基的溶剂暴露程度(溶液可及化表面积,solvent accessible surface area, SASA)。
    蛋白氨基酸残基的相对溶剂可及表面积(Relative SASA,RSASA)可以衡量残基在溶剂中的暴露程度,其计算公式如下:
    image.png
    其中,SASA是溶剂可及表面积,MaxSASA是氨基酸最大溶剂可及表面积,单位均为Å。
    为了测量氨基酸侧链的相对溶剂可及表面积,通常采用从Gly-X-Gly三肽中获得的MaxSASA值,其中X为需要计算的氨基酸残基。几种MaxSASA量表如下所示。

    Residue Tien et al. 2013 (theor.)[1] Tien et al. 2013 (emp.)[1] Miller et al. 1987[2] Rose et al. 1985[3]
    Alanine 129.0 121.0 113.0 118.1
    Arginine 274.0 265.0 241.0 256.0
    Asparagine 195.0 187.0 158.0 165.5
    Aspartate 193.0 187.0 151.0 158.7
    Cysteine 167.0 148.0 140.0 146.1
    Glutamate 223.0 214.0 183.0 186.2
    Glutamine 225.0 214.0 189.0 193.2
    Glycine 104.0 97.0 85.0 88.1
    Histidine 224.0 216.0 194.0 202.5
    Isoleucine 197.0 195.0 182.0 181.0
    Leucine 201.0 191.0 180.0 193.1
    Lysine 236.0 230.0 211.0 225.8
    Methionine 224.0 203.0 204.0 203.4
    Phenylalanine 240.0 228.0 218.0 222.8
    Proline 159.0 154.0 143.0 146.8
    Serine 155.0 143.0 122.0 129.8
    Threonine 172.0 163.0 146.0 152.5
    Tryptophan 285.0 264.0 259.0 266.3
    Tyrosine 263.0 255.0 229.0 236.8
    Valine 174.0 165.0 160.0 164.5

    判断溶液可及性的 rASA 阈值

    通常有以下标准:

    rASA >0.5(50%):残基被认为是暴露于溶液的(solvent-exposed)
    rASA < 0.2(20%):残基被认为是埋藏在蛋白质内部的(buried)
    0.2 ≤ rASA ≤ 0.5:残基处于部分暴露状态。
    

    具体阈值的选择可能取决于研究的目的。例如,某些分析可能使用更严格或宽松的标准来划分。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。

    结果说明

    计算出来的各种溶剂可及表面积值,可根据需求选择需要的类型:

    字段名称 说明
    ResidueType 残基类型
    Chain ID 链名称
    Residue Number 残基编号
    total Total SASA of residue
    polar Polar SASA(极性)
    apolar Apolar SASA(非极性)
    mainChain Main chain SASA
    sideChain Side chain SASA
    relativeTotal* Relative total SASA
    relativePolar Relative polar SASA
    relativeApolar Relative Apolar SASA
    relativeMainChain Relative main chain SASA
    relativeSideChain* Relative side chain SASA

    *常用的比如:

    • relativeSideChain,残基侧链的暴露程度(很多时候主链不需要考虑)
    • relativeTotal,残基的暴露程度(考虑了侧链+主链)

    判断溶液可及性的 rASA 阈值

    通常有以下标准:

    rASA >0.5(50%):残基被认为是暴露于溶液的(solvent-exposed)
    rASA < 0.2(20%):残基被认为是埋藏在蛋白质内部的(buried)
    0.2 ≤ rASA ≤ 0.5:残基处于部分暴露状态。
    

    具体阈值的选择可能取决于研究的目的。例如,某些分析可能使用更严格或宽松的标准来划分。

    参考文献

    https://en.wikipedia.org/wiki/Relative_accessible_surface_area
    Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilites of residues in proteins. PLoS One. 2013 Nov 21;8(11):e80635.
    Miller S, Lesk AM, Janin J, Chothia C. The accessible surface area and stability of oligomeric proteins. Nature. 1987 Aug 27-Sep 2;328(6133):834-6.
    Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH. Hydrophobicity of amino acid residues in globular proteins. Science. 1985 Aug 30;229(4716):834-8.
    https://freesasa.github.io/doxygen/Geometry.html

    Solvent Exposure (SASA)

    Introduction

    Based on protein structure (PDB file), calculates the solvent exposure of each residue (solvent accessible surface area, SASA). The relative solvent accessible surface area (RSASA) of protein amino acid residues measures the exposure of residues in the solvent. The calculation formula is as follows:
    image.png
    Here, SASA is the solvent accessible surface area, and MaxSASA is the maximum solvent accessible surface area of the amino acid, both in Å units. To measure the relative solvent accessible surface area of amino acid side chains, the MaxSASA value obtained from the Gly-X-Gly tripeptide is typically used, where X represents the amino acid residue being calculated. Several MaxSASA scales are shown below.

    Residue Tien et al. 2013 (theor.)[1] Tien et al. 2013 (emp.)[1] Miller et al. 1987[2] Rose et al. 1985[3]
    Alanine 129.0 121.0 113.0 118.1
    Arginine 274.0 265.0 241.0 256.0
    Asparagine 195.0 187.0 158.0 165.5
    Aspartate 193.0 187.0 151.0 158.7
    Cysteine 167.0 148.0 140.0 146.1
    Glutamate 223.0 214.0 183.0 186.2
    Glutamine 225.0 214.0 189.0 193.2
    Glycine 104.0 97.0 85.0 88.1
    Histidine 224.0 216.0 194.0 202.5
    Isoleucine 197.0 195.0 182.0 181.0
    Leucine 201.0 191.0 180.0 193.1
    Lysine 236.0 230.0 211.0 225.8
    Methionine 224.0 203.0 204.0 203.4
    Phenylalanine 240.0 228.0 218.0 222.8
    Proline 159.0 154.0 143.0 146.8
    Serine 155.0 143.0 122.0 129.8
    Threonine 172.0 163.0 146.0 152.5
    Tryptophan 285.0 264.0 259.0 266.3
    Tyrosine 263.0 255.0 229.0 236.8
    Valine 174.0 165.0 160.0 164.5

    Parameter Description

    Structure PDB File

    Protein structure file in PDB format.

    Result Description

    Calculated solvent accessible surface area values for various residue types can be selected as needed:

    Field Name Description
    ResidueType Residue type
    Chain ID Chain name
    Residue Number Residue number
    total Total SASA of residue
    polar Polar SASA
    apolar Apolar SASA
    mainChain Main chain SASA
    sideChain Side chain SASA
    relativeTotal* Relative total SASA
    relativePolar Relative polar SASA
    relativeApolar Relative Apolar SASA
    relativeMainChain Relative main chain SASA
    relativeSideChain* Relative side chain SASA

    *Commonly used include:

    • relativeSideChain, exposure level of the residue side chain (often main chain is not considered)
    • relativeTotal, exposure level of the residue (considering both side chain and main chain)

    Determining Solvent Accessibility with rASA Thresholds

    Typically, the following criteria are used:

    rASA > 0.5 (50%): Residues are considered solvent-exposed.
    rASA < 0.2 (20%): Residues are considered buried within the protein.
    0.2 ≤ rASA ≤ 0.5: Residues are in a partially exposed state.
    

    The choice of specific thresholds may depend on the purpose of the study. For example, some analyses may use stricter or more lenient criteria for classification.

    Reference

    Relative accessible surface area - Wikipedia
    Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilities of residues in proteins. PLoS One. 2013 Nov 21;8(11):e80635.
    Miller S, Lesk AM, Janin J, Chothia C. The accessible surface area and stability of oligomeric proteins. Nature. 1987 Aug 27-Sep 2;328(6133):834-6.
    Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH. Hydrophobicity of amino acid residues in globular proteins. Science. 1985 Aug 30;229(4716):834-8.
    Geometry - FreeSASA Documentation

  • Name: Multiple Sequence Alignment (MAFFT)
    Description: 基于MAFFT的多序列比对程序,支持蛋白和核酸序列的比对。 mafft - Multiple alignment program for amino acid or nucleotide sequences
    Tags: undefined
    Author: Kazutaka Katoh
    Release: 2023-06-06 00:00:00
    Reference: Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005 Jan 20;33(2):511-8.

    Multiple Sequence Alignment (MAFFT)

    简介

    基于MAFFT的多序列比对工具,支持蛋白和核酸序列的比对。

    参数说明

    Sequence File

    蛋白或者核酸的序列文件,FASTA格式

    结果说明

    输出结果为多序列比对后的结果文件:alignment.fasta

    参考文献

    Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005 Jan 20;33(2):511-8.
    https://mafft.cbrc.jp/alignment/software/manual/manual.html

    Multiple Sequence Alignment (MAFFT)

    Introduction

    MAFFT-based tool for multiple sequence alignment, supports alignment of both protein and nucleic acid sequences.

    Parameter Description

    Sequence File

    Sequence file containing protein or nucleic acid sequences in FASTA format.

    Result Description

    The output result is the aligned sequences saved in the file: alignment.fasta.

    Reference

    Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005 Jan 20;33(2):511-8.
    MAFFT Manual

  • Name: mRNA Optimization (LinearDesign)
    Description: 将基于密码子适应指数(codon adaptation index,CAI)的序列优化和基于折叠最小自由能(Minimum Free Energy,MFE)序列优化结合起来,能够在数分钟之内通过算法得到即稳定又能够高效翻译的mRNA序列。 Combines sequence optimization based on codon adaptation index (CAI) and based on folded minimum free energy (MFE) to obtain stable and efficiently translated mRNA sequences within minutes.
    Tags: undefined
    Author: Zhang H
    Release: 2023-06-01 14:23:54
    Reference: Zhang, H., Zhang, L., Lin, A. et al. Algorithm for Optimized mRNA Design Improves Stability and Immunogenicity. Nature. 2023 May 2.

    mRNA Optimization (LinearDesign)

    简介

    mRNA Optimization (LinearDesign)模块是将基于密码子适应指数(codon adaptation index,CAI)的序列优化和基于折叠最小自由能(Minimum Free Energy,MFE)序列优化结合起来,能够在数分钟之内通过算法得到即稳定又能够高效翻译的mRNA序列。
    image.png

    参数说明

    Protein Sequence File

    待优化的mRNA序列对应的蛋白序列,支持多条,Fasta格式。

    LAMBDA

    控制优化指标MFE与CAI之间的平衡,默认是0.0。取值需大于0,越大表示越偏向优化CAI。

    结果说明

    输出结果为result.txt,包含信息如下:

    字段名称 说明
    mRNA sequence 优化后蛋白对应的mRNA序列
    mRNA structure mRNA的二级结构
    mRNA folding free energy mRNA折叠自由能
    mRNA CAI mRNA密码子适应指数值

    参考文献

    Zhang, H., Zhang, L., Lin, A. et al. Algorithm for Optimized mRNA Design Improves Stability and Immunogenicity. Nature. 2023 May 2.

    mRNA Optimization (LinearDesign)

    Introduction

    The mRNA Optimization (LinearDesign) module combines sequence optimization based on Codon Adaptation Index (CAI) and Minimum Free Energy (MFE) folding optimization. It can generate mRNA sequences that are both stable and efficiently translatable within minutes through the algorithm.
    image.png

    Parameter Description

    Protein Sequence File

    Protein sequence corresponding to the mRNA sequence to be optimized, supports multiple sequences in Fasta format.

    LAMBDA

    Controls the balance between the optimization metrics MFE and CAI, default is 0.0. The value should be greater than 0, with larger values indicating a stronger bias towards optimizing CAI.

    Result Description

    The output result is stored in result.txt, containing the following information:

    Field Name Description
    mRNA sequence Optimized mRNA sequence corresponding to the protein
    mRNA structure Secondary structure of the mRNA
    mRNA folding free energy Free energy of mRNA folding
    mRNA CAI Codon Adaptation Index value of the mRNA

    Reference

    Zhang, H., Zhang, L., Lin, A. et al. Algorithm for Optimized mRNA Design Improves Stability and Immunogenicity. Nature. 2023 May 2.

  • Name: PSI4 EDA
    Description: PSI4 EDA模块是基于对称匹配微扰理论(Symmetry-Adapted Perturbation Theory, SAPT)将片段相互作用能进行能量分解。
    Tags: undefined
    Author: Smith DGA
    Release: 2023-05-31 15:34:44
    Reference: Smith DGA, Burns LA, Simmonett AC, et al. Psi4 1.4: Open-source software for high-throughput quantum chemistry. J Chem Phys. 2020 May 14;152(18):184108.

    PSI4 EDA

    简介

    PSI4 EDA模块是基于对称匹配微扰理论(Symmetry-Adapted Perturbation Theory, SAPT)将片段相互作用能进行能量分解。

    参数说明

    Molecular Structure File

    分子结构文件,xyz、pdb、mol、mol2、gjf、com或者fchk格式。

    Task Type

    计算任务的类型:当前仅为energy decomposition。

    Fragment1

    片段1的原子编号,例如:1-3,5,7。

    Fragment2

    片段2的原子编号,例如:1-3,5,7。

    Method Basis

    选择SAPT方法和基组。sSAPT0/jun-cc-pVDZ、SAPT2+/aug-cc-pVDZ、SAPT2+(3)dMP2/aug-cc-pVTZ,精度依次提升,默认 sSAPT0/jun-cc-pVDZ。

    Charge of Fragment1

    片段1分子总电荷,默认0。

    Multiplicity of Fragment1

    片段1自选多重度(一般为单电子数目+1),当前仅支持闭壳层,默认1。

    Charge of Fragment2

    片段2分子总电荷,默认0。

    Multiplicity of Fragment2

    片段2自选多重度(一般为单电子数目+1),当前仅支持闭壳层,默认1。

    结果说明

    sapt.out文件为计算结果输出信息。

    参考文献

    Smith DGA, Burns LA, Simmonett AC, et al. Psi4 1.4: Open-source software for high-throughput quantum chemistry. J Chem Phys. 2020 May 14;152(18):184108.

    PSI4 EDA

    Introduction

    The PSI4 EDA module is based on Symmetry-Adapted Perturbation Theory (SAPT) to decompose the interaction energy into energy components for fragments.

    Parameter Description

    Molecular Structure File

    Molecular structure file in formats such as xyz, pdb, mol, mol2, gjf, com, or fchk.

    Task Type

    Type of computation task, currently only supports energy decomposition.

    Fragment1

    Atom indices for fragment 1, e.g., 1-3,5,7.

    Fragment2

    Atom indices for fragment 2, e.g., 1-3,5,7.

    Method Basis

    Selection of SAPT method and basis set. Options include sSAPT0/jun-cc-pVDZ, SAPT2+/aug-cc-pVDZ, SAPT2+(3)dMP2/aug-cc-pVTZ, with increasing precision, default is sSAPT0/jun-cc-pVDZ.

    Charge of Fragment1

    Total charge of fragment 1, default is 0.

    Multiplicity of Fragment1

    Multiplicity of fragment 1 (usually the number of unpaired electrons + 1), currently supports only closed-shell, default is 1.

    Charge of Fragment2

    Total charge of fragment 2, default is 0.

    Multiplicity of Fragment2

    Multiplicity of fragment 2 (usually the number of unpaired electrons + 1), currently supports only closed-shell, default is 1.

    Result Description

    The sapt.out file contains the output information of the computation results.

    Reference

    Smith DGA, Burns LA, Simmonett AC, et al. Psi4 1.4: Open-source software for high-throughput quantum chemistry. J Chem Phys. 2020 May 14;152(18):184108.

  • Name: Extended Tight Binding Molecular Dynamics (XTB MD)
    Description: Extended Tight Binding Molecular Dynamics (XTB MD)是基于紧束缚量子化学方法 (类似于半经验DFT) 的动力学模拟,可计算上千个原子的大体系的动力学过程。 Extended Tight Binding Molecular Dynamics (XTB MD) is a dynamic simulation based on tightly bound quantum chemistry (similar to semi-empirical DFT) to calculate the dynamical processes of large systems with thousands of atoms.
    Tags: undefined
    Author: Bannwarth C
    Release: 2023-05-31 15:06:57
    Reference: Bannwarth C, Caldeweyher E, Ehlert S, et al. Extended tight-binding quantum chemistry methods. WIREs Comput Mol Sci. 2021; 11:e1493.

    Extended Tight Binding Molecular Dynamics (XTB MD)

    简介

    Extended Tight Binding Molecular Dynamics (XTB MD)是基于紧束缚量子化学方法 (类似于半经验DFT) 的动力学模拟,可计算上千个原子的大体系的动力学过程。

    参数说明

    Molecular Structure File

    分子结构文件,xyz、pdb、mol、mol2、gjf、com或者fchk格式。

    Simulation Time (ps)

    动力学模拟总时间,单位ps,默认10。

    Time Step (fs)

    动力学模拟步长,单位fs,默认1。

    Trajectory Output Steps (fs)

    轨迹文件输出的时间间隔,单位fs,默认100。

    Temperaure

    动力学模拟温度,单位K,默认298.15。

    Theory Version

    GFNn-xTB理论的版本。GFN0-xTB、GFN1-xTB、GFN2-xTB,默认GFN2-xTB。

    Solvent

    选择隐式溶剂模型。gas、toluene、thf、methanol、h2o、ether、chcl3、acetonitrile、acetone、cs2,默认气相条件(gas)。

    Charge

    分子总电荷,默认0。

    Spin Multiplicity

    分子自选多重度(一般为单电子数目+1)默认1。

    结果说明

    输出结果包括:

    输出文件名称 说明
    xtb.trj 动力学过程坐标轨迹文件,为xyz格式。后缀改为.xyz可通过支持.xyz格式的可视化软件查看模拟动画。文件里面每一帧第二行记录了能量信息。
    result.out 计算结果输出信息

    参考文献

    Bannwarth C, Caldeweyher E, Ehlert S, et al. Extended tight-binding quantum chemistry methods. WIREs Comput Mol Sci. 2021; 11:e1493.

    Extended Tight Binding Molecular Dynamics (XTB MD)

    Introduction

    Extended Tight Binding Molecular Dynamics (XTB MD) is a dynamical simulation based on tight-binding quantum chemical methods (similar to semi-empirical DFT), capable of simulating the dynamics of large systems with thousands of atoms.

    Parameter Description

    Molecular Structure File

    Molecular structure file in formats such as xyz, pdb, mol, mol2, gjf, com, or fchk.

    Simulation Time (ps)

    Total simulation time in picoseconds, default is 10.

    Time Step (fs)

    Time step for the dynamics simulation in femtoseconds, default is 1.

    Trajectory Output Steps (fs)

    Time interval for outputting trajectory files in femtoseconds, default is 100.

    Temperature

    Temperature for the dynamics simulation in Kelvin, default is 298.15.

    Theory Version

    Version of the GFNn-xTB theory. Options include GFN0-xTB, GFN1-xTB, GFN2-xTB, default is GFN2-xTB.

    Solvent

    Selection of implicit solvent model. Options include gas, toluene, thf, methanol, h2o, ether, chcl3, acetonitrile, acetone, cs2, default is gas phase (gas).

    Charge

    Total charge of the molecule, default is 0.

    Spin Multiplicity

    Spin multiplicity of the molecule (usually the number of unpaired electrons + 1), default is 1.

    Result Description

    The output includes:

    Output File Name Description
    xtb.trj Coordinate trajectory file of the dynamics process in xyz format. Change the suffix to .xyz to view the simulation animation using software that supports .xyz format. Each frame in the file records energy information in the second line.
    result.out Output information of the calculation results.

    Reference

    Bannwarth C, Caldeweyher E, Ehlert S, et al. Extended tight-binding quantum chemistry methods. WIREs Comput Mol Sci. 2021; 11:e1493.

  • Name: Antibody Sequence Prediction (IgLM)
    Description: Antibody Sequence Prediction (IgLM)模块是抗体序列生成与优化,该方法从Observed Antibody Space (OAS) 收集抗体序列。OAS数据库包含六个物种的天然抗体序列:人类、小鼠、大鼠、兔子、恒河猴和骆驼。为了研究模型能力的影响,训练了两个版本的模型: IgLM和IgLM-S,分别有13M和1.4M的训练参数。两个IgLM模型都在558万非冗余序列上训练,这些序列基于95%相似性聚类。在训练过程中,随机屏蔽了抗体序列中10到20个残基,以便在推理过程中实现任意跨度的多样化。此外,还对序列中的链型(重链或轻链)和原产物种进行了限定,提供这样的背景能够控制物种特异性抗体序列的产生。该方法被证明可以从各种物种中产生全长的重链和轻链序列,以及具有改进可开发性的填充CDR环库。该方法是一个强大的抗体设计工具,可应用于各种抗体序列设计场景。 The Antibody Sequence Prediction (IgLM) module is designed for the generation and optimization of antibody sequences, utilizing the Observed Antibody Space (OAS) to collect antibody sequences. The OAS database contains natural antibody sequences from six species: humans, mice, rats, rabbits, rhesus monkeys, and camels. To investigate the impact of model capacity, two versions of the model were trained: IgLM and IgLM-S, with 13 million and 1.4 million training parameters, respectively. Both IgLM models were trained on 5.58 million non-redundant sequences, clustered based on 95% similarity. During training, 10 to 20 residues in the antibody sequences were randomly masked to enable diversification of arbitrary spans during inference. Additionally, constraints were applied to the chain type (heavy chain or light chain) and the originating species of the sequences, providing a framework to control the generation of species-specific antibody sequences. This method has been shown to produce full-length heavy and light chain sequences from various species, as well as improved developability for filling CDR loop libraries. The method serves as a powerful tool for antibody design and can be applied to various antibody sequence design scenarios.
    Tags: undefined
    Author: Richard W. Shuai
    Release: 2023-05-29 09:07:25
    Reference: Richard W. Shuai, Jeffrey A. Ruffolo, Jeffrey J. Gray. Generative language modeling for antibody design. bioRxiv 2021.12.13.472419.

    Antibody Sequence Prediction (IgLM)

    简介

    Antibody Sequence Prediction(IgLM)模块是抗体序列生成与优化,该方法从Observed Antibody Space (OAS) 收集抗体序列。OAS数据库包含六个物种的天然抗体序列:人类、小鼠、大鼠、兔子、恒河猴和骆驼。为了研究模型能力的影响,训练了两个版本的模型: IgLM和IgLM-S,分别有13M和1.4M的训练参数。两个IgLM模型都在558万非冗余序列上训练,这些序列基于95%相似性聚类。在训练过程中,随机屏蔽了抗体序列中10到20个残基,以便在推理过程中实现任意跨度的多样化。此外,还对序列中的链型(重链或轻链)和原产物种进行了限定,提供这样的背景能够控制物种特异性抗体序列的产生。该方法被证明可以从各种物种中产生全长的重链和轻链序列,以及具有改进可开发性的填充CDR环库。该方法是一个强大的抗体设计工具,可应用于各种抗体序列设计场景。

    参数说明

    Antibody Sequence File

    抗体序列,仅支持1条序列,FASTA格式。

    Chain Type

    设定为抗体重链或轻链,值为"H" 或 “L”。

    Start Index of AA

    指定序列中进行改造优化的氨基酸起始值,整数值,从1开始。需要说明的是,并不是说从优化起始值-终止值的氨基酸就会完全一对一的进行修改,模型里是指定开始到结束的残基作为1个MASK TOKEN提供给模型进行生成,具体生成多少个残基,是看模型学习的情况。

    End Index of AA

    指定序列中进行改造优化的氨基酸终止值,整数值。需要说明的是,并不是说从优化起始值-终止值的氨基酸就会完全一对一的进行修改,模型里是指定开始到结束的残基作为1个MASK TOKEN提供给模型进行生成,具体生成多少个残基,是看模型学习的情况。

    Species Type

    设定物种信息,默认是人源。

    Nunber of Designed Sequences

    设定设计的序列数量,默认100。

    结果说明

    输出结果文件为generated_seqs.fasta,包含生产的序列信息,fasta格式。

    参考文献

    Richard W. Shuai, Jeffrey A. Ruffolo, Jeffrey J. Gray. Generative language modeling for antibody design. bioRxiv 2021.12.13.472419.

    Antibody Sequence Prediction (IgLM)

    Introduction

    The Antibody Sequence Prediction (IgLM) module is designed for antibody sequence generation and optimization. This method collects antibody sequences from the Observed Antibody Space (OAS) database, which includes natural antibody sequences from six species: human, mouse, rat, rabbit, cynomolgus monkey, and camel. To study the impact of model capacity, two versions of the model were trained: IgLM and IgLM-S, with 13M and 1.4M training parameters, respectively. Both IgLM models were trained on 5.58 million non-redundant sequences clustered at 95% similarity. During training, 10 to 20 residues in the antibody sequences were randomly masked to achieve diversity across arbitrary spans during inference. Additionally, constraints were placed on the chain type (heavy or light chain) and original species in the sequences to control the generation of species-specific antibody sequences. This method has been shown to generate full-length heavy and light chain sequences from various species, along with a diversified CDR loop library for improved developability. It serves as a powerful antibody design tool applicable to various antibody design scenarios.

    Parameter Description

    Antibody Sequence File

    Antibody sequence in FASTA format, supporting only one sequence.

    Chain Type

    Specify the antibody chain type as heavy (“H”) or light (“L”).

    Start Index of AA

    Specify the starting amino acid index for optimization in the sequence, an integer value starting from 1. Note that the optimization does not necessarily modify each amino acid from the start to end index one-to-one. The model treats the specified residues from the start to end as one MASK TOKEN for generating sequences, and the actual number of residues generated depends on the model’s learning.

    End Index of AA

    Specify the ending amino acid index for optimization in the sequence, an integer value. Similarly, the optimization does not necessarily modify each amino acid from the start to end index one-to-one.

    Species Type

    Set the species information, default is human.

    Number of Designed Sequences

    Set the number of sequences to be designed, default is 100.

    Result Description

    The output result file is named generated_seqs.fasta, containing the information of the generated sequences in FASTA format.

    Reference

    Richard W. Shuai, Jeffrey A. Ruffolo, Jeffrey J. Gray. Generative language modeling for antibody design. bioRxiv 2021.12.13.472419.

  • Name: PTM Hotspot by Structure
    Description: 基于结构预测蛋白中高风险的PTM位点,比基于序列的方式更精准。当前版本支持天冬氨酸(ASP)位点发生异构化的概率。 Prediction of isomerization probability of aspartic acid (ASP) site in protein Structure by PTM Hotspot by Structure.
    Tags: undefined
    Author: Sharma VK, Patapoff TW, Kabakoff B, Pai S, Hilario E, Zhang B, Li C, Borisov O, Kelley RF, Chorny I, Zhou JZ, Dill KA, Swartz TE
    Release: 2023-05-19 12:40:06
    Reference: In silico selection of therapeutic antibodies for development: viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 30;111(52):18601-6.

    PTM Hotspot by Structure

    简介

    PTM Hotspot by Structure模块通过快速的蒙特卡罗模拟采样,获得蛋白的多样性构象,通过分析多构象的溶剂暴露情况和结构波动情况来预测天冬氨酸(ASP)的异构化的概率。

    参数说明

    Protein Structure File

    蛋白的结构文件,PDB格式。

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    Chain 蛋白链名称
    Residue Index 氨基酸索引(PDB文件中)
    Pred_Score 预测得到的ASP残基异构化评分,分数值在0-1之间,越大表示异构化的可能性越高
    Labile 最终判别异构化的值,1表示预测发生异构化,0表示预测无异构化

    参考文献

    Sharma VK, Patapoff TW, Kabakoff B, Pai S, Hilario E, Zhang B, Li C, Borisov O, Kelley RF, Chorny I, Zhou JZ, Dill KA, Swartz TE. In silico selection of therapeutic antibodies for development: viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 30;111(52):18601-6.

    PTM Hotspot by Structure

    Introduction

    The PTM Hotspot by Structure module uses rapid Monte Carlo simulation sampling to obtain diverse protein conformations. By analyzing the solvent exposure and structural fluctuations of multiple conformations, it predicts the probability of aspartic acid (ASP) isomerization.

    Parameter Description

    Protein Structure File

    Protein structure file in PDB format.

    Result Description

    The output result file is named result.csv, containing the following information:

    Field Name Description
    Chain Name of the protein chain
    Residue Index Amino acid index (in the PDB file)
    Pred_Score Predicted score for ASP residue isomerization, with values ranging from 0 to 1; higher values indicate a higher likelihood of isomerization
    Labile Final determination of isomerization; 1 indicates predicted isomerization, 0 indicates predicted non-isomerization

    References

    Sharma VK, Patapoff TW, Kabakoff B, Pai S, Hilario E, Zhang B, Li C, Borisov O, Kelley RF, Chorny I, Zhou JZ, Dill KA, Swartz TE. In silico selection of therapeutic antibodies for development: viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 30;111(52):18601-6.

  • Name: Protein Isoelectric Point (pI)
    Description: Protein Isoelectric Point(pI),即分子不带净电荷的pH值,是影响分子理化性质甚至功能的关键参数。该模块使用多种不同的算法,基于序列计算分子的pI数值,并可以对多条链的结果进行合并计算。 基于唯信团队使用部分内部抗体实测pI数据的对比,Sillero算法的精度相对更高,推荐采用。 Protein Isoelectric Point module is used to calculate the isoelectric point of protein, that is, the pH at which a particular molecule carries no net electrical charge, is an critical parameter for many analytical biochemistry and proteomics techniques, especially for 2D gel electrophoresis (2D-PAGE), capillary isoelectric focusing (cIEF), X-ray crystallography and liquid chromatography–mass spectrometry (LC-MS)
    Tags: undefined
    Author:
    Release: 2023-05-15 18:01:25
    Reference:

    Protein Isoelectric Point

    简介

    Protein Isoelectric Point(pI),即分子不带净电荷的pH值,是影响分子理化性质甚至功能的关键参数。该模块使用多种不同的算法,基于序列计算分子的pI数值,并可以对多条链的结果进行合并计算。

    基于唯信团队使用部分内部抗体实测pI数据的对比,Sillero算法的精度相对更高,推荐采用。

    唯信测试用的抗体分子和对应的实测pI数值区间和均值如下图所示。

    image.png

    用不同算法计算的pI数值与实测均值的差值及相关性如下图所示。

    image.png

    基于R和RMSE等指标,Sillero的相关性略优于其他算法。

    464e925a1c78788da290f4691171545.png

    参数说明

    Protein Sequence File

    蛋白的序列文件,FASTA格式。

    pI Result File

    使用所选模型预测pI的输出文件,默认名称result.csv。

    Plot

    绘制二维散点图,默认False。

    Plot File

    二维散点图(分子量与等电点)表示为热图,默认名称result.png。

    Merge Chain

    根据链名,将来自同一序列的多条链的pI值进行合并计算。
    例如:mol1.chain1与mol1.chain2将被合并为mol1分子的结果。同名的链也会被视为同一个分子。

    Merge Output File

    仅当merge_chain=True时可用。默认值:merged.csv。

    Job Number

    并行任务数,默认为1。

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.png 当Plot=True时输出二维散点图(分子量与等电点),热图形式
    result.csv 使用所选模型预测pI的输出文件
    merged.csv 多条链的pI合并输出文件

    其中result.csv包括信息如下:

    字段名称 说明
    Protein ID 蛋白序列名称
    Molecular weight (Da) 蛋白分子量
    pI 蛋白等电点

    参考文献

    Kozlowski LP. IPC - Isoelectric Point Calculator. Biol Direct. 2016 Oct 21;11(1):55.

    Protein Isoelectric Point

    Introduction

    Protein Isoelectric Point (pI), the pH at which a molecule carries no net charge, is a key parameter that influences the physical and functional properties of a molecule. This module uses various algorithms to calculate the pI value of a molecule based on its sequence and can merge results for multiple chains.

    Based on a comparison of experimentally measured pI data from a subset of internal antibodies by the WeiXin team, the Sillero algorithm demonstrates relatively higher accuracy and is recommended for use.

    The figure below shows the antibody molecules used in the WeiXin tests along with the corresponding ranges and averages of experimentally measured pI values.

    image.png

    The figure below illustrates the differences and correlations between the pI values calculated using different algorithms and the experimentally measured averages.

    image.png

    Based on metrics such as R and RMSE, the Sillero algorithm shows slightly better correlation compared to other algorithms.

    464e925a1c78788da290f4691171545.png

    Parameter Description

    Protein Sequence File

    File containing the protein sequence in FASTA format.

    pI Result File

    Output file for predicted pI values using the selected model, default name is result.csv.

    Plot

    Whether to plot a two-dimensional scatter plot, default is False.

    Plot File

    Graphical representation of the two-dimensional scatter plot (molecular weight vs. isoelectric point), default name is result.png.

    Merge Chain

    Merge pI values of multiple chains from the same sequence based on chain names.
    For example: mol1.chain1 and mol1.chain2 will be merged into the result for the molecule mol1. Chains with the same name are considered as part of the same molecule.

    Merge Output File

    Available only when merge_chain=True, default value is merged.csv.

    Job Number

    Number of parallel tasks, default is 1.

    Result Description

    The output includes:

    Output File Name Description
    result.png Output of the two-dimensional scatter plot (molecular weight vs. isoelectric point) if Plot=True, in heatmap format
    result.csv Output file for predicted pI values using the selected model
    merged.csv Merged output file for pI values of multiple chains

    The result.csv file includes the following information:

    Field Name Description
    Protein ID Protein sequence name
    Molecular weight (Da) Protein molecular weight
    pI Protein isoelectric point

    References

    Kozlowski LP. IPC - Isoelectric Point Calculator. Biol Direct. 2016 Oct 21;11(1):55.

  • Name: DP4 AI
    Description: DP4-AI模块是自动处理和归属原始13C和1H-NMR数据,通过对理论计算NMR数据和实验NMR数据进行有机分子结构确证。DP4-AI利用客观模型选择的核磁共振选峰方法,以及将计算的13C和1H核磁共振化学位移与实验核磁共振数据中的峰值匹配的算法。 The DP4-AI module automatically processes and ascribe the original 13C and 1H-NMR data, and confirms the organic molecular structure through theoretical calculation NMR data and experimental NMR data. DP4-AI utilizes the NMR peak selection method selected by the objective model and the algorithm matching the calculated 13C and 1H NMR chemical shifts to the peaks in the experimental NMR data.
    Tags: undefined
    Author: Howarth, A
    Release: 2023-05-15 10:47:32
    Reference: Howarth, A.; Ermanis, K.; Goodman, J. M. DP4-AI Automated NMR Data Analysis: Straight from Spectrometer to Structure. Chem. Sci. 2020, 11 (17), 4351–4359.

    DP4-AI

    简介

    DP4-AI模块是自动处理和归属原始13C和1H-NMR数据,通过对理论计算NMR数据和实验NMR数据进行有机分子结构确证。DP4-AI利用客观模型选择的核磁共振选峰方法,以及将计算的13C和1H核磁共振化学位移与实验核磁共振数据中的峰值匹配的算法。

    参数说明

    Workflow

    计算流程:

    • gmns : 生成构象-力场优化-量化NMR计算-DP4概率计算
    • gnomes : 生成构象-量化优化-量化高精度能量计算-量化NMR计算-DP4概率计算

    Molecule Structure File

    分子结构文件,SDF格式(仅允许输入一个分子)

    NMR Experimental File

    可以提供两种形式的文件:一种是实验的原始核磁共振数据(Bruker文件格式),ZIP或者RAR格式。另外一种是已经解析的数据(NMR.txt),TXT格式。
    原始核磁数据打包成ZIP或者RAR格式压缩包上传,文件夹包含Carbon或Proton,文件夹的内容如下:
    image.png
    已解析的数据为txt文件并且必须以NMR.txt命名。NMR.txt示例如下:

    1. 第一部分是碳的化学位移。也可以表示为如59.58(any),127.88(any)…即括号内的原子名都写成any
    2. 第二部分是氢的化学位移。也可以表示为如4.81(any),7.18(any)…即括号内的原子名都写成any
    3. 第三部分是等效原子。每一新行的原子视为等效原子
    4. 第四部分是定义要忽略的原子
      image.png

    Stereo Center

    确定立体异构中心原子来生成异构体,多个原子用英文逗号隔开,如:

    • 3号碳原子则写:3
    • 3号和4号碳原子则写:3,4
      默认auto为自动生成异构

    Maximum Number Conformations

    每个结构最多计算NMR的构象数,默认5

    DFT Software

    DFT或NMR计算的软件。gaussian 和 nwchem,默认 nwchem

    Solvent

    选择溶剂模型。可选择的模型:none, water, benzene, chloroform, methanol, dimethylsulfoxide, pyridine, acetone, 默认不使用溶剂模型(none)

    Charge

    分子总电荷,默认0

    结果说明

    得到计算过程输出文件(result.out)和结果总结文件(conf_1NMR.dp4),内容如下所示。

    参考文献

    Howarth, A.; Ermanis, K.; Goodman, J. M. DP4-AI Automated NMR Data Analysis: Straight from Spectrometer to Structure. Chem. Sci. 2020, 11 (17), 4351–4359.

    DP4-AI

    Introduction

    The DP4-AI module automatically processes and assigns original 13C and 1H-NMR data to confirm organic molecular structures. DP4-AI utilizes an objective model selection of NMR peaks method and an algorithm that matches calculated 13C and 1H NMR chemical shifts with peaks in experimental NMR data.

    Parameter Description

    Workflow

    Calculation workflow:

    • gmns: Conformer generation - Force field optimization - Quantum NMR calculation - DP4 probability calculation
    • gnomes: Conformer generation - Quantum optimization - Quantum high-precision energy calculation - Quantum NMR calculation - DP4 probability calculation

    Molecule Structure File

    Molecular structure file in SDF format (only one molecule allowed).

    NMR Experimental File

    Two types of files can be provided: one is the raw experimental NMR data (Bruker file format), in ZIP or RAR format. The other type is pre-analyzed data (NMR.txt) in TXT format.
    The raw NMR data should be packed in a ZIP or RAR format archive, containing folders named Carbon or Proton, with contents as shown below:
    image.png
    The pre-analyzed data should be a TXT file named NMR.txt. An example of NMR.txt is as follows:

    1. The first part is carbon chemical shifts, which can be represented as 59.58(any),127.88(any)… where atom names in parentheses are written as ‘any’.
    2. The second part is hydrogen chemical shifts, which can be represented as 4.81(any),7.18(any)… where atom names in parentheses are written as ‘any’.
    3. The third part is equivalent atoms, with each new line representing an equivalent atom.
    4. The fourth part defines atoms to be ignored.
      image.png

    Stereo Center

    Specify atoms for determining stereo centers to generate stereoisomers. Multiple atoms should be separated by commas, for example:

    • For carbon atom 3: 3
    • For carbon atoms 3 and 4: 3,4
      Default ‘auto’ generates stereoisomers automatically.

    Maximum Number Conformations

    Maximum number of conformations to calculate NMR per structure, default is 5.

    DFT Software

    Software for DFT or NMR calculations, choose between gaussian and nwchem, default is nwchem.

    Solvent

    Select the solvent model. Available models: none, water, benzene, chloroform, methanol, dimethylsulfoxide, pyridine, acetone. Default is no solvent model (none).

    Charge

    Total charge of the molecule, default is 0.

    Result Description

    The output includes the calculation process output file (result.out) and the result summary file (conf_1NMR.dp4), with contents as shown below.

    References

    Howarth, A.; Ermanis, K.; Goodman, J. M. DP4-AI Automated NMR Data Analysis: Straight from Spectrometer to Structure. Chem. Sci. 2020, 11 (17), 4351–4359.

  • Name: Protein Structure Prediction (AlphaFold2.3.2)
    Description: AlphaFold2 是一个高度准确的蛋白质结构预测算法,在CASP14部分测试中的表现接近实验水平,主要适用于有一定同源序列的蛋白及复合物。 v2.3.2是截止于2023年10月的最新版本。推荐使用AF3 like模块(比如Boltz-1、Chai-1、HelixFold3和Protenix等)。 AlphaFold2 is a highly accurate protein structure prediction package. This is a completely new model that was entered in CASP14 and published in Nature. Version: v2.3.2. It is recommended to use AF3-like modules (such as Boltz-1, Chai-1, HelixFold3, and Protenix).
    Tags: undefined
    Author: DeepMind, Jumper, J., Evans, R., Pritzel, A. et al.
    Release: 2021-11-09 08:00:00
    Reference: Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589.

    AlphaFold2(v2.3.2)

    简介

    AlphaFold2是目前业界优秀的蛋白质结构预测方法。由Deepmind 团队开发,在2020年的蛋白质结构预测大赛CASP14中, AlphaFold 2 得到了接近90分的成绩,排名第一,大幅度领先第二名,对大部分蛋白质结构的预测与真实结构只差一个原子的宽度,达到了人类利用冷冻电镜等复杂仪器观察预测的水平,这是蛋白质结构预测史无前例的巨大进步。后续更新版本支持复合物结构预测,包括蛋白-多肽复合物的预测。

    当前版本:v2.3.2, 是截止于2023年10月的最新版本。
    image.png
    image.png
    image.png
    上图:蛋白单体预测精度
    image.png
    上图:蛋白复合物预测精度

    参数说明

    Input File

    输入序列文件,fasta格式

    Type

    预测任务类型,monomer 或者 multimer
    monomer:单体蛋白,单条链
    multimer:复合物,多条链,最大可以6条链,超过6条系统不处理

    Relax

    优化结构模式
    all:优化所有的结构
    best:只优化打分最高的结构,这个模式只输出一个结构
    none:不做优化

    MSA Database

    多序列比对使用的数据库
    full_dbs:全库,更耗时,但相比reduced_db更精确
    reduced_dbs:精简库,速度更快,但是牺牲准确性

    结果说明

    输出结果包括:

    输出文件名称 说明
    ranking_debug.csv 预测模型可信度评估文件,其中包含用于执行模型排名的pLDDT, ipTM, pTM值,以及到原始模型名称的映射。
    ranked_*.pdb 预测最终蛋白结构文件。默认提供1个打分最高的优化后的结构
    PAE_0.csv 当预测为复合物结构时,生成最优模型的Predicted aligned error(PAE)热图CSV数据。
    PAE_Heatmap_0.png 当预测为复合物结构时,生成最优模型的Predicted aligned error(PAE)热图。
    PAE.tar.gz 当预测为复合物结构时,生成所有模型的Predicted aligned error(PAE)热图。

    其中评估结构预测可信度指标分为pLDDT和ipTM:

    • pLDDT是针对单体结构预测可信度指标,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测。
    pLDDT > 90:Very high
    90 > pLDDT > 70:Confident
    70 > pLDDT > 50:Low
    pLDDT < 50:Very low
    
    • pTM和ipTM用于评估复合物预测的准确性。pTM和ipTM的加权组合是针对复合物预测可信度指标:model confidence = 0.8 · ipTM + 0.2 · pTM,值范围是0-1,该值越大说明预测的复合物结构越可靠。
      • pTM(the predicted template modelling)是AlphaFold-Multimer预测复合物整体结构的综合测量,该值高于0.5表示复合物的整体预测折叠可能类似于真实结构,其低于 0.5表示预测结构可能是错误的。
      • ipTM(the interface predicted template modelling)是不同链残基之间相互作用的评分,该值高于0.8表明高质量的预测结果,低于0.6表明预测结果可能失败,介于0.6-0.8之间是一个灰色地带,预测可能正确或者错误。
    ipTM >= 0.80:High quality 
    0.6 <=  ipTM <  0.80:Acceptable quality
    0.00 <=  ipTM <  0.6:Incorrect
    

    对结构准确性分析应该综合考虑所有指标,包括pTM、ipTM、pLDDT 和 PAE。

    参考文献

    • Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589.
    • Richard Evans, Michael O’Neill, Alexander Pritzel, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2021 Oct;463034.
    • https://github.com/deepmind/alphafold
    • https://github.com/deepmind/alphafold/releases/tag/v2.3.0

    AlphaFold2 (v2.3.2)

    Introduction

    AlphaFold2 is currently the best protein structure prediction method in the industry. Developed by the DeepMind team, in the 2020 CASP14 protein structure prediction competition, AlphaFold 2 achieved a score close to 90, ranking first and significantly outperforming the second-place competitor. It predicted the structures of most proteins within the width of a single atom from the ground truth, reaching a level comparable to human observation using complex instruments like cryo-electron microscopy. This represents an unprecedented advancement in protein structure prediction. Subsequent updates support the prediction of complex structures, including protein-peptide complexes.

    Current Version: v2.3.2, the latest version as of October 2023.
    image.png
    image.png
    image.png
    Above: Protein monomer prediction accuracy
    image.png
    Above: Protein complex prediction accuracy

    Parameter Description

    Input File

    Input sequence file in FASTA format.

    Type

    Prediction task type, either monomer or multimer.
    monomer: Single protein, single chain.
    multimer: Complex, multiple chains, with a maximum of 6 chains. Systems with more than 6 chains are not processed.

    Relax

    Structure optimization mode.
    all: Optimize all structures.
    best: Optimize only the highest-scoring structure; this mode outputs only one structure.
    none: No optimization.

    MSA Database

    Database used for multiple sequence alignment.
    full_dbs: Full database, more time-consuming but more accurate compared to reduced_db.
    reduced_dbs: Reduced database, faster but sacrifices accuracy.

    Result Description

    The output includes:

    Output File Name Description
    ranking_debug.csv Confidence evaluation file of the prediction model, containing pLDDT, ipTM, pTM values used for model ranking and mapping to the original model names.
    ranked_*.pdb Final predicted protein structure files. By default, the optimized highest-scoring structure is provided.
    PAE_0.csv For complex structure predictions, generates a Predicted Aligned Error (PAE) heatmap CSV data for the best model.
    PAE_Heatmap_0.png For complex structure predictions, generates a Predicted Aligned Error (PAE) heatmap for the best model.
    PAE.tar.gz For complex structure predictions, generates PAE heatmaps for all models.

    The confidence metrics for structure prediction include pLDDT and DockQ:

    • pLDDT is a confidence metric for monomer structure prediction, ranging from 0 to 100. A higher value indicates a more reliable structure prediction.
    pLDDT > 90: Very high
    90 > pLDDT > 70: Confident
    70 > pLDDT > 50: Low
    pLDDT < 50: Very low
    
    • pTM and ipTM are used to evaluate the accuracy of complex predictions. The weighted combination of pTM and ipTM serves as a confidence metric for complex predictions: model confidence = 0.8 · ipTM + 0.2 · pTM. The value ranges from 0 to 1, with higher values indicating a more reliable predicted complex structure.
      • pTM (the predicted template modelling) is a comprehensive measure of the overall structure prediction by AlphaFold-Multimer. A value above 0.5 suggests that the overall predicted folding of the complex may be similar to the real structure, whereas a value below 0.5 suggests that the predicted structure may be incorrect.
      • ipTM (the interface predicted template modelling) scores the interactions between residues of different chains. A value above 0.8 indicates a high-quality prediction, a value below 0.6 indicates a likely failure of the prediction, and values between 0.6 and 0.8 represent a gray area where the prediction may be correct or incorrect.
    ipTM >= 0.80: High quality
    0.6 <= ipTM < 0.80: Acceptable quality
    0.00 <= ipTM < 0.6: Incorrect
    

    References

    • Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589.
    • Richard Evans, Michael O’Neill, Alexander Pritzel, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2021 Oct;463034.
    • https://github.com/deepmind/alphafold
    • https://github.com/deepmind/alphafold/releases/tag/v2.3.0
  • Name: Solubility Score (CamSol)
    Description: Solubility Score (CamSol)模块使用CamSol算法预测蛋白的溶解度评分。该方法考虑了最直接影响蛋白质溶解度的氨基酸的物理化学特性,包括疏水性、静电荷以及它们在空间的相互作用。通过对这些特性的组合来定义溶解度分数。该方法在预测突变对蛋白质溶解度的影响方面具有很高的准确性。与其他现有方法相比,如SOLpro和 PROSO II,在测试的56个变体中,该方法正确预测了54个突变体在突变后溶解度的变化,而SOLpro和PROSO II分别为40和32个。 Solubility Score (CamSol) module predicts a protein's solubility score. The method considers the physicochemical properties of amino acids that most directly affect protein solubility, including hydrophobicity, electrostatic charge, and their interaction in space. The solubility fraction is defined by a combination of these properties. This method has high accuracy in predicting the effect of mutation on protein solubility. Compared to other existing methods, such as SOLpro and PROSO II, the method correctly predicted changes in solubility of 54 mutants after mutation out of 56 variants tested, compared to 40 for SOLpro and 32 for PROSO II.
    Tags: undefined
    Author: Sormanni P
    Release: 2023-05-09 10:09:23
    Reference: The CamSol method of rational design of protein mutants with enhanced solubility. J Mol Biol. 2015 Jan 30;427(2):478-90.

    Solubility Score (CamSol)

    简介

    Solubility Score (CamSol)模块的功能是预测蛋白的溶解度评分。该方法考虑了最直接影响蛋白质溶解度的氨基酸的物理化学特性,包括疏水性、静电荷以及它们在空间的相互作用。通过对这些特性的组合来定义溶解度分数。该方法在预测突变对蛋白质溶解度的影响方面具有很高的准确性。与其他现有方法相比,如SOLpro和 PROSO II,在测试的56个变体中,该方法正确预测了54个突变体在突变后溶解度的变化,而SOLpro和PROSO II分别为40和32个。

    参数说明

    Protein Sequence File

    蛋白的序列文件,FASTA格式

    结果说明

    输出结果文件为result.csv,包含信息如下:
    image.png

    字段名称 说明
    ID 蛋白名称
    Score 预测得到的溶解度评分,该分数越大表示溶解性越好,特别的,当分数小于-1时,溶解性很差,当分数大于1时,表示溶解性很好

    参考文献

    Sormanni P, Aprile FA, Vendruscolo M. The CamSol method of rational design of protein mutants with enhanced solubility. J Mol Biol. 2015 Jan 30;427(2):478-90.

    Solubility Score (CamSol)

    Introduction

    The Solubility Score (CamSol) module predicts a protein’s Solubility score. The method considers the physicochemical properties of amino acids that most directly affect protein solubility, including hydrophobicity, electrostatic charge, and their interaction in space. The solubility fraction is defined by a combination of these properties. This method has high accuracy in predicting the effect of mutation on protein solubility. Compared to other existing methods, such as SOLpro and PROSO II, the method correctly predicted changes in solubility of 54 mutants after mutation out of 56 variants tested, compared to 40 for SOLpro and 32 for PROSO II.

    Parameter

    Protein Sequence File

    Protein sequence file in FASTA format

    Result

    The output file is result.csv and contains the following information:
    image.png

    Field Name Description
    ID Protein name
    Score The predicted solubility score, the higher the score, the better the solubility, in particular, when the score is less than -1, the solubility is poor, and when the score is greater than 1, the solubility is very good.

    Reference

    Sormanni P, Aprile FA, Vendruscolo M. The CamSol method of rational design of protein mutants with enhanced solubility. J Mol Biol. 2015 Jan 30;427(2):478-90.

  • Name: Antibody Viscosity Prediction
    Description: 粘度是影响抗体药物开发的重要因素,临床上抗体往往需要静脉内或皮下给药,需要高浓度的抗体溶液(>100mg/mL)才能以小剂量注射获得与治疗相关的剂量,但是高浓度的抗体往往表现出高粘度,这对抗体药物的开发,制造和给药提出了挑战。研究发现,抗体序列是决定抗体粘度的关键因素,文献报道抗体粘度与Fv区域的电荷、VH和VL区域电荷的不对称性FvCSP和Fv区域的疏水指数HI存在相关性,基于抗体序列预测抗体粘度是一个有效方法。 Viscosity is an important factor affecting the development of antibody drugs. Clinically, antibodies often need to be administered intravenously or subcutaneously, requiring a high concentration of antibody solution (>100mg/mL) to obtain a therapeutic dose at a small dose. However, high concentrations of antibodies often exhibit high viscosity, which poses a challenge to the development, manufacture and administration of antibody drugs. It has been found that antibody sequence is the key factor to determine antibody viscosity. It has been reported that antibody viscosity is correlated with charge in Fv region, charge asymmetry in VH and VL region, FvCSP, and hydrophobic index HI in Fv region. It is an effective method to predict antibody viscosity based on antibody sequence.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-05-05 00:00:00
    Reference: In silico selection of therapeutic antibodies for development: Viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 15;111(52):E18601-18606. doi: 10.1073/pnas.1421779112.

    Antibody Viscosity Prediction

    简介

    粘度是影响抗体药物开发的重要因素,临床上抗体往往需要静脉内或皮下给药,需要高浓度的抗体溶液(>100mg/mL)才能以小剂量注射获得与治疗相关的剂量,但是高浓度的抗体往往表现出高粘度,这对抗体药物的开发,制造和给药提出了挑战。研究发现,抗体序列是决定抗体粘度的关键因素,文献报道抗体粘度与Fv区域的电荷、VH和VL区域电荷的不对称性FvCSP和Fv区域的疏水指数HI存在相关性,基于抗体序列预测抗体粘度是一个有效方法。
    粘度计算方法如下所示:
    η,cP(180 mg/mL,25°C)=10^[0.15+1.26(0.60)∗ϕ−0.043(0.047)∗q−0.020(0.015)∗qsym
    其中,ϕ代表Fv区域的疏水指数HI,q代表Fv电荷,qsym代表VH和VL区域电荷的不对称性FvCSP。

    参数说明

    Sequence模式

    Heavy Chain Sequence

    抗体重链的序列(纯序列信息,非FASTA格式文件)。

    Light Chain Sequence

    抗体轻链的序列(纯序列信息,非FASTA格式文件)。

    FASTA File模式

    Antibody Fasta File

    抗体的序列文件,FASTA格式,支持多抗体模式。不支持纳米抗体序列。重链必须包括标识符".H"或者"_H",轻链必须包含标识符".L"或者"_L":

    > name.H
    XXXXXX
    > name.L
    XXXXXX
    

    结果说明

    得到result.csv文件,包含信息如下:

    字段名称 说明
    Sequence ID 抗体序列名称
    Fv Heavy Chain Charge 重链电荷
    Fv Light Chain Charge 轻链电荷
    Fv Charge Symmetry Parameter 电荷对称性指标
    Fv Hydrophobicity Index 疏水性指数
    Viscosity 抗体粘度

    参考文献

    In silico selection of therapeutic antibodies for development: Viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 15;111(52):E18601-18606.

    Antibody Viscosity Prediction

    Introduction

    Viscosity is an important factor affecting the development of antibody drugs. Clinically, antibodies often need to be administered intravenously or subcutaneously, requiring a high concentration of antibody solution (>100mg/mL) to obtain a therapeutic dose at a small dose. However, high concentrations of antibodies often exhibit high viscosity, which poses a challenge to the development, manufacture and administration of antibody drugs. It has been found that antibody sequence is the key factor to determine antibody viscosity. It has been reported that antibody viscosity is correlated with charge in Fv region, charge asymmetry in VH and VL region, FvCSP, and hydrophobic index HI in Fv region. It is an effective method to predict antibody viscosity based on antibody sequence.

    Patameter

    Sequence Method

    Heavy Chain Sequence

    Antibody heavy chain sequence (raw, not fasta).

    Light Chain Sequence

    Antibody light chain sequence (raw, not fasta).

    FASTA File Method

    Antibody Fasta File

    Antibody sequence file in FASTA format. Nanoantibody sequences are not supported. Heavy chains must include identifiers “.H” or “_H”, and light chains must contain identifiers “.L” or “_L” :

    > name.H
    XXXXXX
    > name.L
    XXXXXX
    

    Result

    A result.csv file contains the following information:

    Field Name Description
    Sequence ID Antibody sequence name
    Fv Heavy Chain Charge Fv heavy chain charge
    Fv Light Chain Charge Fv light chain charge
    Fv Charge Symmetry Parameter Fv charge symmetry index
    Fv Hydrophobicity Index Fv hydrophobicity index
    Viscosity Antibody viscosity

    Reference

    In silico selection of therapeutic antibodies for development: Viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 15;111(52):E18601-18606.

  • Name: Alanine Mutation - MMPBSA (Deprecated)
    Description: 基于g_mmpbsa计算丙氨酸突变后的结合自由能。 MMPBSA calculates components of binding free energy after alanine mutation using the MM-PBSA method.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 16:47:54
    Reference: Kumari et al (2014) g_mmpbsa - A GROMACS tool for high-throughput MM-PBSA calculations. J. Chem. Inf. Model. 54:1951-1962. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    Alanine Scanning (MMPBSA)

    简介

    计算受体与配体之间丙氨酸扫描突变后的结合自由能。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run模块或者AlphaAutoMD模块中获取。

    Energy Option

    选择计算能量类型:pb或者gb。
    pb:用PB法计算脱溶自由能,并根据pbsa中的INP选项计算非极性溶剂化自由能。
    gb:用GB模型计算sander脱溶自由能。

    Ligand Mol2

    上传配体的mol2文件,可由GMX Ligand Parameterization模块获取。Ligand Mol2和Custom Group必须选填其中一个参数。

    Custom Group

    定义两个组别之间进行结合能计算,组别之间用"/"分隔开。组别中填写的为蛋白氨基酸的序号。例如1-213/214-426或者1-211,212-213/214-426。蛋白氨基酸序号从1开始从新编号,与初始pdb氨基酸编号无关。Ligand Mol2和Custom Group必须选填其中一个参数。

    Decomp

    是否进行能量分解计算。(默认:no)

    Mutation Scanning

    选择扫描突变类型,ALA或者GLY。勾选该选项后,必须在Mutation Residue填写突变位点

    Mutation Residue

    突变扫描氨基酸位置,格式为链名称加上氨基酸位置,中间用冒号隔开,例如A:23。注意:每次仅能突变一个氨基酸,并且需要确认计算过后的氨基酸序列号是否发生变化,防止预期突变位点和实际突变位点不一致。

    Startframe

    起始帧位置。

    Endframe

    结束帧位置。

    Skipframe

    间隔帧数。

    结果说明

    输出结果包括:

    输出文件名称 说明
    mmpbsa_energy_gb(pb)_*.csv gb(pb)方法下得到的结合自由能随时间变化的CSV文件
    mmpbsa_energy_mutation_gb(pb)_*.csv 丙氨酸突变后,gb(pb)方法下得到的结合自由能随时间变化的CSV文件
    mmpbsa_energy_total_*.dat gb(pb)方法下得到的结合自由能随时间变化的dat文件
    mmpbsa_result_*.dat 总结合自由能dat文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: Antibody Viscosity Predictor
    Description: 通过预测抗体的spatial charge map (SCM) score来反映抗体分子的粘度。 The viscosity of the antibody molecules was reflected by predicting the spatial charge map (SCM) score of the antibody.
    Tags: undefined
    Author: Pin-Kuang Lai
    Release: 2023-04-27 14:59:34
    Reference: Lai PK. DeepSCM: An efficient convolutional neural network surrogate model for the screening of therapeutic antibody viscosity. Comput Struct Biotechnol J. 2022 Apr 29;20:2143-2152.

    Antibody Viscosity Predictor(DeepSCM)

    简介

    通过预测抗体的spatial charge map (SCM) score来反映抗体分子的粘度。先进行了高通量动力学模拟MD,计算了6596个非冗余抗体可变区域的spatial charge map(SCM)分数(mAbs. 2015, 8(1):43–48)。然后根据这个数据集开发了一个卷积神经网络模型,只需要序列信息。在测试集(N = 1320)上,模型预测的SCM分数与MD模拟后计算的SCM分数的线性相关系数达到0.9。该模型被应用于筛选38种治疗性抗体的粘度,并正确地进行了分类,只有一个错误的分类。该模型将促进高浓度抗体粘度的筛选。
    image.png

    参数说明

    Heavy Chain

    抗体重链序列,FASTA格式,支持多条重链,例如:

    >name_1	
    [heavy chain sequence]
    >name_2
    [heavy chain sequence]
    >name_3
    [heavy chain sequence]
    

    同一个抗体的重、轻链序列的名称要一致!

    Light Chain

    抗体轻链序列,FASTA格式,支持多条轻链,例如:

    >name_1	
    [light chain sequence]
    >name_2
    [light chain sequence]
    >name_3
    [light chain sequence]
    

    同一个抗体的重、轻链序列的名称要一致!

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    Name 抗体名称
    SCM_score 预测得到的SCM分值,该分值大于1000时,表示粘度高

    参考文献

    Lai PK. DeepSCM: An efficient convolutional neural network surrogate model for the screening of therapeutic antibody viscosity. Comput Struct Biotechnol J. 2022 Apr 29;20:2143-2152.

    Antibody Viscosity Predictor (DeepSCM)

    Introduction

    The Antibody Viscosity Predictor (DeepSCM) predicts the viscosity of antibodies by estimating the spatial charge map (SCM) score of the antibody molecule. High-throughput molecular dynamics simulations were conducted to calculate the spatial charge map (SCM) scores of 6596 non-redundant antibody variable regions (mAbs. 2015, 8(1):43–48). Subsequently, a convolutional neural network model was developed based on this dataset, requiring only sequence information. On a test set (N = 1320), the model’s predicted SCM scores showed a linear correlation coefficient of 0.9 with SCM scores calculated after MD simulations. The model was applied to screen the viscosity of 38 therapeutic antibodies, correctly classifying them with only one misclassification. This model will facilitate the screening of high-concentration antibody viscosities.
    image.png

    Parameter Description

    Heavy Chain

    Antibody heavy chain sequences in FASTA format, supporting multiple heavy chains, for example:

    >name_1	
    [heavy chain sequence]
    >name_2
    [heavy chain sequence]
    >name_3
    [heavy chain sequence]
    

    The names of the heavy and light chain sequences of the same antibody must match!

    Light Chain

    Antibody light chain sequences in FASTA format, supporting multiple light chains, for example:

    >name_1	
    [light chain sequence]
    >name_2
    [light chain sequence]
    >name_3
    [light chain sequence]
    

    The names of the heavy and light chain sequences of the same antibody must match!

    Result Description

    The output file is result.csv, containing the following information:

    Field Name Description
    Name Antibody name
    SCM_score Predicted SCM score, where a score greater than 1000 indicates high viscosity

    References

    Lai PK. DeepSCM: An efficient convolutional neural network surrogate model for the screening of therapeutic antibody viscosity. Comput Struct Biotechnol J. 2022 Apr 29;20:2143-2152.

  • Name: Extended Tight Binding (XTB)
    Description: 基于紧束缚量子化学方法 (类似于半经验DFT) 的快速能量计算和结构优化,可计算上千个原子的大体系。
    Tags: undefined
    Author: Christoph Bannwarth
    Release: 2023-04-27 11:23:42
    Reference: Bannwarth C, Caldeweyher E, Ehlert S, et al. Extended tight-binding quantum chemistry methods. WIREs Comput Mol Sci. 2021; 11:e1493.

    XTB (Extended Tight Binding)

    简介

    基于紧束缚量子化学方法 (类似于半经验DFT) 的快速能量计算和结构优化,可计算上千个原子的大体系。

    参数说明

    Structure

    分子结构文件,xyz, pdb, mol, mol2, gjf, com, fchk格式。

    Task Type

    计算任务的类型:能量计算(single point)和结构优化(optimization),默认optimization。

    Theory Version

    GFNn-xTB理论的版本。GFN0-xTB, GFN1-xTB, GFN2-xTB,默认GFN2-xTB

    Solvent

    选择隐式溶剂模型:gas, toluene, thf, methanol, h2o, ether, chcl3, acetonitrile, acetone, cs2。默认气相条件(gas)

    Charge

    分子总电荷,默认为0.

    Spin Multiplicity

    分子自旋多重度(一般为单电子数目+1),默认1。

    结果说明

    xtbopt.xyz是最后结构的xyz坐标文件,文件里面第二行记录了能量信息
    xtbopt.log是优化过程每一帧的坐标,为xyz格式。后缀改为.xyz可通过支持.xyz结构的可视化软件查看
    .out文件为计算结果输出信息

    参考文献

    Bannwarth C, Caldeweyher E, Ehlert S, et al. Extended tight-binding quantum chemistry methods. WIREs Comput Mol Sci. 2021; 11:e1493.

    XTB (Extended Tight Binding)

    Introduction

    Fast energy calculation and structural optimization based on extended tight binding methods (similar to semi-empirical DFT) can be used to calculate large systems with thousands of atoms.

    Parameter

    Structure

    Molecular structure file in xyz, pdb, mol, mol2, gjf, com, fchk format.

    Task Type

    Types of computing tasks: energy calculation (single point) and structure optimization (optimization), default in optimization.

    Theory Version

    The choice of theory version: GFN0-xTB, GFN1-xTB, GFN2-xTB, default in GFN2-xTB

    Solvent

    Choose the implicit solvent model: gas, toluene, thf, methanol, h2o, ether, chcl3, acetonitrile, acetone, cs2, default in gas.

    Charge

    The total charge of the molecule, default in 0.

    Spin Multiplicity

    Molecular spin multiplicity (usually the number of single electrons +1), default 1.

    Result

    Reference

    Bannwarth C, Caldeweyher E, Ehlert S, et al. Extended tight-binding quantum chemistry methods. WIREs Comput Mol Sci. 2021; 11:e1493.

  • Name: Protein Structure Prediction (AlphaFold2.3.2)_部分可见_内部使用_序列个数和长度_放宽限制(官方原版)
    Description: AlphaFold2 is a highly accurate protein structure prediction package. This is a completely new model that was entered in CASP14 and published in Nature. AlphaFold2 是一个高度准确的蛋白质结构预测包,是目前最高精度的方法之一,甚至接近实验水平。 这是其最新的模型,已进入 CASP14 并发表在 Nature 上。
    Tags: undefined
    Author: DeepMind, Jumper, J., Evans, R., Pritzel, A. et al.
    Release: 2021-11-09 08:00:00
    Reference: Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2

    AlphaFold2

    简介

    AlphaFold2是目前业界最优的蛋白质结构预测方法。由Deepmind 团队开发,在2020年的蛋白质结构预测大赛CASP14中, AlphaFold 2 得到了接近90分的成绩,排名第一,大幅度领先第二名,对大部分蛋白质结构的预测与真实结构只差一个原子的宽度,达到了人类利用冷冻电镜等复杂仪器观察预测的水平,这是蛋白质结构预测史无前例的巨大进步。后续更新版本支持复合物结构预测,包括蛋白-多肽复合物的预测。

    image.png

    image.png

    上图:蛋白单体预测精度
    image.png
    上图:蛋白复合物预测精度

    输入参数

    Input File

    输入序列文件,fasta格式

    Type

    预测任务类型,monomer 或者 multimer
    monomer:单体蛋白,单条链
    multimer:复合物,多条链,最大可以6条链,超过6条系统不处理

    计算示例

    预测模型可信度评估文件

    ranking_debug.json,一个JSON格式的文本文件,其中包含用于执行模型排名的pLDDT值,以及到原始模型名称的映射。
    image.png

    AlphaFold2提供一个评价单体结构预测可信度的指标,叫pLDDT,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测。
    Very high (pLDDT > 90)
    Confident (90 > pLDDT > 70)
    Low (70 > pLDDT > 50)
    Very low (pLDDT < 50)

    针对复合物预测,可信度指标是DockQ,值范围是0-1,该值越大说明预测的复合物结构越可靠。
    0.00 <= DockQ < 0.23 - Incorrect
    0.23 <= DockQ < 0.49 - Acceptable quality
    0.49 <= DockQ < 0.80 - Medium quality
    DockQ >= 0.80 - High quality

    预测最终蛋白结构文件

    单体默认提供5个预测结构,复合物默认提供25个预测结构。
    image.png

  • Name: Protein Structure Prediction (AlphaFold2.3.2)_部分可见_内部使用_序列个数和长度_放宽限制(跳过缓存)
    Description: AlphaFold2 is a highly accurate protein structure prediction package. This is a completely new model that was entered in CASP14 and published in Nature. AlphaFold2 是一个高度准确的蛋白质结构预测包,是目前最高精度的方法之一,甚至接近实验水平。 这是其最新的模型,已进入 CASP14 并发表在 Nature 上。
    Tags: undefined
    Author: DeepMind, Jumper, J., Evans, R., Pritzel, A. et al.
    Release: 2021-11-09 08:00:00
    Reference: Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2

    AlphaFold2

    简介

    AlphaFold2是目前业界最优的蛋白质结构预测方法。由Deepmind 团队开发,在2020年的蛋白质结构预测大赛CASP14中, AlphaFold 2 得到了接近90分的成绩,排名第一,大幅度领先第二名,对大部分蛋白质结构的预测与真实结构只差一个原子的宽度,达到了人类利用冷冻电镜等复杂仪器观察预测的水平,这是蛋白质结构预测史无前例的巨大进步。后续更新版本支持复合物结构预测,包括蛋白-多肽复合物的预测。

    image.png

    image.png

    上图:蛋白单体预测精度
    image.png
    上图:蛋白复合物预测精度

    输入参数

    Input File

    输入序列文件,fasta格式

    Type

    预测任务类型,monomer 或者 multimer
    monomer:单体蛋白,单条链
    multimer:复合物,多条链,最大可以6条链,超过6条系统不处理

    计算示例

    预测模型可信度评估文件

    ranking_debug.json,一个JSON格式的文本文件,其中包含用于执行模型排名的pLDDT值,以及到原始模型名称的映射。
    image.png

    AlphaFold2提供一个评价单体结构预测可信度的指标,叫pLDDT,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测。
    Very high (pLDDT > 90)
    Confident (90 > pLDDT > 70)
    Low (70 > pLDDT > 50)
    Very low (pLDDT < 50)

    针对复合物预测,可信度指标是DockQ,值范围是0-1,该值越大说明预测的复合物结构越可靠。
    0.00 <= DockQ < 0.23 - Incorrect
    0.23 <= DockQ < 0.49 - Acceptable quality
    0.49 <= DockQ < 0.80 - Medium quality
    DockQ >= 0.80 - High quality

    预测最终蛋白结构文件

    单体默认提供5个预测结构,复合物默认提供25个预测结构。
    image.png

  • Name: Molecular Docking (DiffDock)
    Description: 一种扩散生成模型,主要用于小分子和蛋白对接。DiffDock在PDBBind上获得了38%的top-1成功率(RMSD<2A),大大超过了以前传统对接(23%)和深度学习(20%)方法的最先进水平。此外,以前的方法无法对接计算上的折叠结构(最大精度为10.4%),而DiffDock保持了明显更高的精度(21.7%)。最后,DiffDock具有快速的推理时间,并提供具有高选择性精度的置信度估计值。 It is a diffusion-forming model that has been used for docking between small molecules and proteins. DiffDock achieved a top-1 success rate of 38% (RMSD<2A) on PDBBind, significantly exceeding the previous state-of-the-art of traditional docking (23%) and deep learning (20%) approaches. In addition, while previous methods were unable to butt computationally folded structures (with a maximum accuracy of 10.4%), DiffDock maintained a significantly higher accuracy (21.7%). Finally, DiffDock has fast reasoning times and provides confidence estimates with high selective precision.
    Tags: undefined
    Author: Gabriele Corso
    Release: 2023-04-21 17:05:53
    Reference: Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. arXivLabs. 2022 Oct (v1).

    Molecular Docking (DiffDock)

    简介

    Molecular Docking (DiffDock)是一种扩散生成模型,主要用于小分子和蛋白对接。DiffDock在PDBBind上获得了38%的top-1成功率(RMSD<2A),大大超过了以前传统对接(23%)和深度学习(20%)方法的最先进水平。此外,以前的方法无法对接计算上的折叠结构(最大精度为10.4%),而DiffDock保持了明显更高的精度(21.7%)。最后,DiffDock具有快速的推理时间,并提供具有高选择性精度的置信度估计值。

    image.png

    参数说明

    Receptor File

    蛋白的结构文件,PDB格式。最多支持1022个氨基酸。

    Ligand File

    小分子结构文件,SDF格式

    Number of Poses

    每个配体与受体对接时得到的构象数,默认为10。

    结果说明

    输出结果包括:

    输出文件名称 说明
    Scores.csv 所有配体(≤2000)与受体的打分文件。
    output_ligand.sdf 对接后所有配体SDF文件。
    output_complex_topn.tar.gz TopN小分子中每个配体与受体打分最高的复合物构象PDB文件压缩包。
    display_complex.pdb 展示配体与受体的复合物构象文件。

    其中Scores.csv包含信息如下:

    字段名称 说明
    Ligand ID 配体编号ID
    Confidence 对接置信度打分,虽然解读和比较不同复合物或不同蛋白质构象的置信度分数可能会很困难,可以通过以下标准粗略比较(c是最佳构象的置信度分数):c > 0高置信度;-1.5 < c < 0中等置信度;c < -1.5低置信度
    Complex File Name 复合物名称

    参考文献

    Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. arXivLabs. 2022 Oct (v1).

    Molecular Docking (DiffDock)

    Introduction

    Molecular Docking (DiffDock) is a diffusion-based model primarily used for the docking of small molecules with proteins. DiffDock has achieved a top-1 success rate of 38% (RMSD < 2A) on PDBBind, significantly surpassing the state-of-the-art levels of previous traditional docking methods (23%) and deep learning methods (20%). Furthermore, previous methods were unable to dock computationally folded structures (maximum accuracy of 10.4%), while DiffDock maintains significantly higher accuracy (21.7%). Finally, DiffDock features fast inference times and provides confidence estimates with high selectivity accuracy.
    image.png

    Parameter Description

    Receptor File

    Structure file of the protein in PDB format. Supports up to 1022 amino acids.

    Ligand File

    Structure file of the small molecule in SDF format.

    Number of Poses

    The number of conformations obtained for each ligand docked with the receptor, default is 10.

    Result Description

    The output includes:

    Output File Name Description
    Scores.csv Scoring file for all ligands (≤2000) with the receptor.
    output_ligand.sdf SDF file containing all ligands after docking.
    output_complex_topn.tar.gz Compressed file containing the PDB files of the top scoring complex conformations for each ligand among the TopN small molecules.
    display_complex.pdb File displaying the complex conformation of the ligand and receptor.

    The Scores.csv contains the following information:

    Field Name Description
    Ligand ID Ligand identification ID.
    Confidence Docking confidence score. Although interpreting and comparing confidence scores of different complexes or different protein conformations can be challenging, a rough comparison can be made using the following criteria (c is the confidence score of the top pose): c > 0 indicates high confidence; -1.5 < c < 0 indicates moderate confidence; c < -1.5 indicates low confidence.
    Complex File Name Name of the complex.

    References

    Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. arXivLabs. 2022 Oct (v1).

  • Name: Synthetic Accessibility Score
    Description: Synthetic Accessibility Score是一个化合物合成可行性评估指标,反映了化合物是否容易合成。其将小分子合成难易程度用1到10区间数值进行评价,越靠近1表明越容易合成,越靠近10表明合成越困难。SA Score已成为一种普遍使用的指标,可用于预测新化合物的合成可行性,加速化合物筛选和药物发现过程。 The SA Score (synthetic accessibility score) is an index for evaluating the feasibility of compound synthesis, which indicates whether a compound is easy to synthesize. The synthesis difficulty of small molecules was evaluated with values ranging from 1 to 10. The closer to 1, the easier to synthesize, and the closer to 10, the more difficult to synthesize. SA Score has become a commonly used indicator to predict the synthetic feasibility of new compounds and accelerate compound screening and drug discovery processes.
    Tags: undefined
    Author: Peter Ertl
    Release: 2023-04-21 16:46:22
    Reference: Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform. 2009 Jun 10;1(1):8.

    Synthetic Accessibility Score

    简介

    Synthetic Accessibility Score是一个化合物合成可行性评估指标,反映了化合物是否容易合成。其将小分子合成难易程度用1到10区间数值进行评价,越靠近1表明越容易合成,越靠近10表明合成越困难。SA Score基于片段贡献和复杂度惩罚从而评估化合物合成的难易程度,其中片段贡献值根据PubChem数据库中上百万分子计算共性进行计算,复杂度则考虑分子中非标准结构特征的占比,例如大环、非标准环的合并、立体异构和分子量大小等方面。SA Score方法已被验证,通过将40个化合物分别采用SA Score和经验丰富的药物化学家评估其合成难易程度,并且比较得到二者评分的相关性R2高达0.89,表明其在识别可合成难易程度上的可靠性较高。SA Score已成为一种普遍使用的指标,可用于预测新化合物的合成可行性,加速化合物筛选和药物发现过程。
    image.png

    参数说明

    File模式

    Input File

    小分子结构文件,支持SDF和SMILES格式。

    Smiles模式

    Smiles String

    小分子结构,SMILES格式,支持多个小分子,一行一个SMILES,例如:
    CSC1=C(c2ccc©s2)/C(=N/C©©C)C1
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5

    结果说明

    输出结果文件为sa_score.csv,包含信息如下:

    字段名称 说明
    smiles 小分子smiles结构
    Name 小分子名称
    sa_score 化合物合成可行性评估指标数值

    参考文献

    Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform. 2009 Jun 10;1(1):8.

    Synthetic Accessibility Score

    Introduction

    The Synthetic Accessibility Score is an indicator of the feasibility of synthesizing a compound, reflecting how easily a compound can be synthesized. It evaluates the difficulty of synthesizing small molecules on a scale of 1 to 10, with values closer to 1 indicating easier synthesis and values closer to 10 indicating more challenging synthesis. The SA Score assesses the ease of compound synthesis based on fragment contributions and complexity penalties. The fragment contribution values are calculated based on the commonality of millions of molecules in the PubChem database, while complexity considers the proportion of non-standard structural features in the molecule, such as macrocycles, fused non-standard rings, stereoisomers, molecular weight, and other aspects. The SA Score method has been validated by comparing the SA Scores with evaluations of synthesis difficulty by experienced medicinal chemists for 40 compounds. The high correlation coefficient (R2 = 0.89) between the two sets of scores demonstrates the reliability of the SA Score in identifying the feasibility of synthesis. The SA Score has become a widely used metric for predicting the synthetic feasibility of new compounds, accelerating compound screening and drug discovery processes.
    image.png

    Parameter Description

    File Mode

    Input File

    Small molecule structure file in SDF or SMILES format.

    Smiles Mode

    Smiles String

    SMILES format of small molecule structures, supports multiple small molecules with one SMILES string per line, for example:
    CSC1=C(c2ccc©s2)/C(=N/C©©C)C1
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5

    Result Description

    The output file is sa_score.csv, containing the following information:

    Field Name Description
    smiles SMILES structure of the small molecule
    Name Name of the small molecule
    sa_score Synthetic Accessibility Score value for the compound

    References

    Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform. 2009 Jun 10;1(1):8.

  • Name: Protein Structure Prediction (AlphaFold2.3.2)_部分可见_测试
    Description: AlphaFold2 is a highly accurate protein structure prediction package. This is a completely new model that was entered in CASP14 and published in Nature. AlphaFold2 是一个高度准确的蛋白质结构预测包,是目前最高精度的方法之一,甚至接近实验水平。 这是其最新的模型,已进入 CASP14 并发表在 Nature 上。
    Tags: undefined
    Author: DeepMind, Jumper, J., Evans, R., Pritzel, A. et al.
    Release: 2021-11-09 08:00:00
    Reference: Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2

    AlphaFold2

    简介

    AlphaFold2是目前业界最优的蛋白质结构预测方法。由Deepmind 团队开发,在2020年的蛋白质结构预测大赛CASP14中, AlphaFold 2 得到了接近90分的成绩,排名第一,大幅度领先第二名,对大部分蛋白质结构的预测与真实结构只差一个原子的宽度,达到了人类利用冷冻电镜等复杂仪器观察预测的水平,这是蛋白质结构预测史无前例的巨大进步。后续更新版本支持复合物结构预测,包括蛋白-多肽复合物的预测。

    image.png

    image.png

    上图:蛋白单体预测精度
    image.png
    上图:蛋白复合物预测精度

    输入参数

    Input File

    输入序列文件,fasta格式

    Type

    预测任务类型,monomer 或者 multimer
    monomer:单体蛋白,单条链
    multimer:复合物,多条链,最大可以6条链,超过6条系统不处理

    计算示例

    预测模型可信度评估文件

    ranking_debug.json,一个JSON格式的文本文件,其中包含用于执行模型排名的pLDDT值,以及到原始模型名称的映射。
    image.png

    AlphaFold2提供一个评价单体结构预测可信度的指标,叫pLDDT,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测。
    Very high (pLDDT > 90)
    Confident (90 > pLDDT > 70)
    Low (70 > pLDDT > 50)
    Very low (pLDDT < 50)

    针对复合物预测,可信度指标是DockQ,值范围是0-1,该值越大说明预测的复合物结构越可靠。
    0.00 <= DockQ < 0.23 - Incorrect
    0.23 <= DockQ < 0.49 - Acceptable quality
    0.49 <= DockQ < 0.80 - Medium quality
    DockQ >= 0.80 - High quality

    预测最终蛋白结构文件

    单体默认提供5个预测结构,复合物默认提供25个预测结构。
    image.png

  • Name: Cleavage Site Prediction
    Description: 预测八种常用蛋白酶的蛋白型裂解位点,包括胰蛋白酶(trypsin),精氨酸C端肽段(ArgC),粒胰蛋白酶(chymotrypsin),谷氨酸C端蛋白酶(GluC),赖氨酸C端肽段(LysC),天冬氨酸N端肽段(AspN),赖氨酸N端肽段(LysN),L-精氨酸胺基肽酶(LysargiNase)。 Predict protein cleavage sites for eight commonly used proteases (trypsin, ArgC, chymotrypsin, GluC, LysC, AspN, LysN, and LysargiNase).
    Tags: undefined
    Author: Yang, J
    Release: 2023-04-13 17:51:49
    Reference: Yang, J.; Gao, Z.; Ren, X.; Sheng, J.; Xu, P.; Chang, C.; Fu, Y. DeepDigest: Prediction of Protein Proteolytic Digestion with Deep Learning. Anal. Chem. 2021, 93 (15), 6094–6103.

    Proteotypic Cleavage Site Predictor

    简介

    Proteotypic Cleavage Site Predictor模块基于深度学习,用于预测8种常用蛋白酶(trypsin, ArgC, chymotrypsin, GluC, LysC, AspN, LysN, and LysargiNase)的蛋白型裂解位点。它整合了卷积神经网络和长短时记忆网络,以实现高准确性和稳健性。与传统的机器学习算法(逻辑回归、随机森林和支持向量机)相比,对所有8种蛋白酶都有更准确的预测精度。
    以下是八种常用蛋白酶的蛋白型裂解位点预测:

    1. 胰蛋白酶(trypsin)是由胰腺分泌的一种组成蛋白质消化酶,可水解多肽和蛋白质的肽键。胰蛋白酶对于含有精氨酸、天冬酰胺等氨基酸残基的多肽和蛋白具有高度的特异性。
    2. 精氨酸C端肽段(ArgC)是由ArgC这种无钠胰蛋白酶切割产生的一种特异性肽段,它的切割位点是精氨酸残基(Arg)。
    3. 粒胰蛋白酶(chymotrypsin)是一种由胰腺分泌的消化酶,可水解含有芳香族氨基酸残基的多肽和蛋白质,具有高度的特异性。
    4. 谷氨酸C端蛋白酶(GluC)可以识别和水解蛋白质中的谷氨酸残基,通过水解蛋白质分子的内部肽键来催化蛋白质的降解。
    5. 赖氨酸C端肽段(LysC)是一种特定的氨基酸序列,通常由LysC这种胰蛋白酶采用的切割位点确定。LysC肽段包含了一个含有两个赖氨酸残基的肽段,这些赖氨酸残基是可以被氨基酸测序等分析技术识别的标志性序列。
    6. 天冬氨酸N端肽段(AspN)是由AspN这种蛋白酶切割蛋白质而产生的一种肽段,它的切割位点是氨基酸序列中的天冬氨酸残基(Asp)。
    7. 赖氨酸N端肽段(LysN)是溶葡萄球菌素的一个片段,它具有高度的特异性和活性,可针对金黄色葡萄球菌等细菌的细胞壁进行水解裂解。这一裂解是通过LysN肽段序列中的特定赖氨酸-甘氨酸(Lys-Gly)肽键实现的。
    8. L-精氨酸胺基肽酶(LysargiNase)是一种从放线菌属真菌(链霉菌属)分离出来的碱性蛋白酶,它主要作用是水解L-精氨酸的肽键,从而移除蛋白质序列中的精氨酸。
      image.png

    参数说明

    Protein Sequence File

    蛋白的序列文件,FASTA格式

    结果说明

    输出对应8个蛋白酶的csv文件,每个csv文件包括信息如下:

    字段名称 说明
    Protein id 蛋白名称
    Peptide sequence 蛋白的理论酶切肽段
    Digestibility of the N-terminal site N端肽键的裂解概率预测值
    Digestibility of the C-terminal site C端肽键的裂解概率预测值
    Digestibility of the missed site(s) 理论酶切肽段所有漏切(非N/C端)位点的酶切概率预测值

    *注:概率值区间为0-1,越接近1表示发生概率越大。

    参考文献

    Yang, J.; Gao, Z.; Ren, X.; Sheng, J.; Xu, P.; Chang, C.; Fu, Y. DeepDigest: Prediction of Protein Proteolytic Digestion with Deep Learning. Anal. Chem. 2021, 93 (15), 6094–6103.

    Proteotypic Cleavage Site Predictor

    Introduction

    Proteotypic Cleavage Site Predictor module is based on deep learning. Used to predict the protein-type cleavage sites of eight common proteases (trypsin, ArgC, chymotrypsin, GluC, LysC, AspN, LysN, and LysargiNase). It integrates convolutional neural network and short - and long-term memory network to achieve high accuracy and robustness. Compared with traditional machine learning algorithms (logistic regression, random forest and support vector machine), the prediction accuracy of all eight proteases was more accurate.
    The following are protein-type cleavage site predictions for eight common proteases:

    1. Trypsin is a constituent protein-digesting enzyme secreted by the pancreas, which can hydrolyze the peptide bonds of peptides and proteins. Trypsin is highly specific to peptides and proteins containing amino acid residues such as arginine and asparagine.
    2. Arginine C-terminal peptide (ArgC) is a specific peptide produced by the cleavage of ArgC, a non-sodium trypsin, and its cleavage site is arginine residue (Arg).
    3. Chymotrypsin is a kind of digestive enzyme secreted by pancreas, which can hydrolyze polypeptides and proteins containing aromatic amino acid residues with high specificity.
    4. Glutamic acid C-terminal protease (GluC) recognizes and hydrolyzes glutamic acid residues in proteins and catalyzes protein degradation by hydrolyzing the internal peptide bonds of protein molecules.
    5. Lysine C-terminal peptide (LysC) is a specific amino acid sequence, usually defined by the cleavage site used by the trypsin LysC. The LysC peptide contains a peptide containing two lysine residues, which are signature sequences that can be identified by analytical techniques such as amino acid sequencing.
    6. Aspartic N-terminal peptide (AspN) is a peptide produced by AspN protease cleavage of protein. Its cleavage site is aspartic acid residue (Asp) in amino acid sequence.
    7. Lysine N-terminal peptide (LysN) is a fragment of staphylococcus lysin, which is highly specific and active and can be hydrolyzed against the cell wall of bacteria such as Staphylococcus aureus. This cleavage is achieved by the specific lysine-gly peptide bond in the LysN sequence.
    8. LysargiNase is an alkaline protease isolated from streptomyces arginaseus. Its main function is to hydrolyze the peptide bonds of L-arginine, thereby removing arginine from the protein sequence.
      image.png

    Parameter

    Protein Sequence File

    Protein sequence file in FASTA format

    Result

    The output csv file is corresponding to the 8 proteases. Each csv file contains the following information:

    Field Name Description
    Protein id The identity of the protein from which the peptide is digested.
    Peptide sequence The sequence of the theoretical digested peptide.
    Digestibility of the N-terminal site The predicted cleavage probability of the cleavage site on the N-terminal of the peptide.
    Digestibility of the C-terminal site The predicted cleavage probability of the cleavage site on the C-terminal of the peptide.
    Digestibility of the missed site(s) The predicted cleavage probabilities of the missed cleavage sites in the peptide.

    Reference

    Yang, J.; Gao, Z.; Ren, X.; Sheng, J.; Xu, P.; Chang, C.; Fu, Y. DeepDigest: Prediction of Protein Proteolytic Digestion with Deep Learning. Anal. Chem. 2021, 93 (15), 6094–6103.

  • Name: Protein Design (RFDiffusion)
    Description: 扩散模型在图像和语言生成建模方面取得了相当大的成功,但在蛋白质建模方面的应用成功有限,这可能是由于蛋白质主链几何形状和序列-结构关系的复杂性所致。David Baker课题组基于扩散概率模型开发了RFdiffusion方法,通过在蛋白质结构去噪任务上微调RoseTTAFold结构预测网络,获得了一种蛋白质主链生成模型,它在无条件和拓扑约束蛋白质单体设计、蛋白质结合物设计、对称寡聚体设计、酶活性位点支架和对称基序支架等方面取得了出色的性能,用于治疗和金属结合蛋白质设计。RFdiffusion能够从简单的分子规格中设计出多样的、复合的、功能性的蛋白质。 Diffusion models have had considerable success in image and language generative modeling but limited success when applied to protein modeling, likely due to the complexity of protein backbone geometry and sequence-structure relationships. David Baker's group developed RoseTTAFold Diffusion (RFdiffusion), and demonstrate the power and generality of the method, by experimentally characterizing the structures and functions of hundreds of new designs. By fine-tuning the RoseTTAFold structure prediction network on protein structure denoising tasks, they obtain a generative model of protein backbones that achieves outstanding performance on unconditional and topology-constrained protein monomer design, protein binder design, symmetric oligomer design, enzyme active site scaffolding, and symmetric motif scaffolding for therapeutic and metal-binding protein design. In a manner analogous to networks which produce images from user-specified inputs, RFdiffusion enables the design of diverse, complex, functional proteins from simple molecular specifications.
    Tags: undefined
    Author: Joseph L. Watson, David Baker
    Release: 2023-04-06 15:43:44
    Reference: Joseph L. Watson, David Juergens, Nathaniel R. Bennett, et. al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv 2022.12.09.519842.

    Protein Design (RFDiffusion)

    简介

    通过基于扩散概率模型,在蛋白质结构去噪任务上对RoseTTAFold结构预测网络进行微调,得到该蛋白质骨架生成模型,在无条件和拓扑约束的蛋白质单体设计、蛋白质结合物设计、对称低聚物设计、酶活性位点支架以及治疗性和金属结合蛋白设计的对称主题支架上取得了出色的性能。RFdiffusion能够从简单的分子规格中设计出多样的、复合的、功能性的蛋白质。
    模块功能为多场景蛋白设计,如:Motif Scaffolding,Unconditional protein generation,Symmetric unconditional generation (cyclic, dihedral and tetrahedral symmetries),Symmetric motif scaffolding,Binder design,Design diversification (“partial diffusion”)

    参数说明

    Custom模式

    Reference Protein Structure

    设计时的参考蛋白。

    Design Type

    设计类型,支持2种类型:‘Motif_Scaffold’与’Binder’,分别说明如下:
    ‘Motif_Scaffold’ 表示基于参考蛋白的骨架结构(由后续参数定义),进行设计。
    ‘Binder’ 表示基于受体结构进行其Binder蛋白设计。

    Contigs

    定义蛋白的设计策略,指定蛋白中的哪部分被随机设计、保留等。
    如:该参数设置为 ‘5-15/A10-25/30-40/0 B1-100’ 时,
    ●’5-15’表示先设计长度为5到15之间(具体多长是随机的,如果要固定长度为10,可以设置为10-10)的motif
    ●‘/A10-25’表示紧接着从参考蛋白中取A链中编号为10至25的氨基酸,其N端连接到上一段’5-15’设计的motif的C端
    ●’/30-40’表示紧接着设计长度为30到40之间(具体多长是随机的)的motif,其N端连接到前面已经设计的motif的C端
    ●‘/0 ’表示链断开,前一条链结束,后续设计会是新的链,注意0后有一个空格!
    ●‘B1-100’表示从参考蛋白中取B链中编号为1至100的氨基酸,作为新的一条链

    注意:

    1. 输入的PDB文件中如果存在残基缺失,缺失残基的编号避免出现在Contigs参数中,如:A链缺失编号为45的残基,则A45或A10-50等涵盖45号残基的表示需要避免,A10-50可以修改为A10-44/A46-50;
    2. Binder设计时,需要把受体包含在Contigs中,通过’/0’链断开标识来分开受体和Binder,如:需要对含有150个氨基酸的单链受体设计相结合的Binder蛋白,受体链名为A,需要设计70-100个氨基酸长度的Binder蛋白,这里对应的Contigs的内容应填入’A1-150/0 70-100’,其中’A1-150’表示受体蛋白,'/0 '表示隔断受体与设计的Binder蛋白的直接肽键相连接,'70-100’表示设计的Binder蛋白长度为70-100个氨基酸。
    3. Contigs和Hotspot Residues中参数设定的残基序号需填写原始PDB文件中的序列编号。进行抗体计算时如果存在插入编号的情况,可以先用PDB ReNumbering进行PDB重编号。

    Hotspot Residues

    在binder模式下可以指定受体中的热点残基,格式为"链名称",“氨基酸残基”,如:‘A59,A83,A91’。

    Symmetry

    设计对称蛋白,参数值为C_N或D_N,其中C表示循环对称(Cyclic symmetry),D表示二面体对称(Dihedral symmetry),N表示单体的数量。如:C2表示设计包含2个单体的循环对称蛋白。
    注意:在进行对称蛋白设计时,Contigs参数的设置要与之匹配,如:Symmetry为C2时,Contigs参数的设置应该符合两条链。


    Binder模式

    Reference Protein Structure

    设计时的参考蛋白。

    Index Type

    为后续参数(Receptor, Initial Binder, Hotspot)中定义的氨基酸残基的索引设置类别。
    有两种选择:UID或者POS,UID表示PDB文件中自带的残基编号,该编号可能存在间断不连续,不从1开始等情况;POS表示位置编号或自然顺序编号,从1开始按顺序进行编号。
    该参数的默认值为UID。

    Receptor Range

    定义受体蛋白,从参考蛋白中选定哪部分作为受体蛋白,格式为“链名称+残基编号或范围”,多段残基用逗号分隔。例如:参数设置为A1-50,A70-100,A105,A108,B1-108时,表示:
    选取参考蛋白的A链中残基编号为1至50、70至100、105与108的残基,以及B链位置编号1至108的残基作为受体。
    注意:这里输入的残基编号应与参数Index Type中的编号类别一致。

    Length of Binder

    定义Binder蛋白的长度,可以是确定的长度,或长度范围,例如:设置为20或20-50时,
    20表示Binder蛋白的长度为20个残基;
    20-50表示Binder蛋白的长度范围为20至50个残基,具体长度视最终设计结果为准。

    Initial Binder

    指定结构中初始的Binder,从参考蛋白中选定哪部分是初始的Binder蛋白,模型会在不改变初始Binder的前提下,进一步延长Binder。例如:参数设置为B1-10时,表示:
    指定参考蛋白中的B链残基编号为1至10的残基为初始Binder蛋白,模型会以此为基础进行延长设计。

    Hotspot Residues

    指定受体中的热点残基作为binder蛋白的结合位置,格式为“链名称+残基编号或范围”,多段残基用逗号分隔,例如:A59-61,A83,A91,表示:
    指定A链编号为59至61、83及91的残基为binder蛋白的结合位置。


    Scaffolding&Infilling模式

    Reference Protein Structure

    设计时的参考蛋白。

    Index Type

    为后续参数(Design Range)中定义的氨基酸残基的索引设置类别。
    有两种选择:UID或者POS,UID表示PDB文件中自带的残基编号,该编号可能存在间断不连续,不从1开始等情况;POS表示位置编号或自然顺序编号,从1开始按顺序进行编号。
    该参数的默认值为UID。

    Design Range

    定义需要设计的蛋白骨架范围,从参考蛋白中选定哪部分进行设计,格式为“链名称+残基编号或范围”,多段残基用逗号分隔。例如:参数设置为A1-50,A70-100,A105,A108,B1-108时,表示:
    选取参考蛋白的A链中残基编号为1至50、70至100、105与108的残基,以及B链编号1至108的残基进行骨架优化设计。
    注意:这里输入的残基编号应与参数Index Type中的编号类别一致。

    Length

    为参数Design Range中的每段残基,定义其设计的长度,多个长度用逗号分隔。如不设置该参数,表示按Design Range中的原始长度进行设计。
    注意:长度的数量要与上述Range参数中残基段的数量一致,且顺序对应。长度可以有多种不同的取值:

    • 非负整数,其中0表示该段残基会被忽略掉,不进行设计;其他正整数表示该段残基区域设计的长度。
    • 字母N,表示该段残基区域设计时,长度不变。
    • 长度范围,如5-10,表示该段残基设计时,长度在5-10个残基的范围内变化,具体长度看最终设计结果。
      长度定义的示例如下:
      N,5-10,15表示定义了3个长度(对应的Design Range参数中的残基段应该也是3个),第1段残基设计时保持长度不变,第2段残基设计时的长度范围为5-10,第3段残基设计时的长度为15。

    Other Design Mode

    其他设计模式,可选为Fix,表示固定上述定义的Design Range不变,对结构中的所有其他区域进行设计。

    Fluctuation Length

    当其他设计模式设置为Fix时,会对其他区域进行设计,设计时会在其他区域的原长度基础上做长度变动,该参数即为长度变动的大小,默认为5,即在原长度的基础上减少或增加5个残基。


    结果说明

    不同设计模式的输出pdb文件。

    注意:

    • Binder设计得到的为聚甘氨酸(poly-G)序列,这并不是错误。因为RFdiffusion是一种骨架生成模型,不会为设计的区域生成序列,因此必须使用另一种方法为Binder生成合适的序列。这里推荐采用ProteinMPNN进行序列设计(WeMol中已部署该模块,使用这里生成的整体复合物PDB进行序列设计即可)。
    • 输出的PDB文件从1开始重新编号。

    参考文献

    Joseph L. Watson, David Juergens, Nathaniel R. Bennett, et. al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv 2022.12.09.519842.

    Protein Design (RFDiffusion)

    Introduction

    This module fine-tunes the RoseTTAFold structure prediction network using a diffusion probability model for protein structure denoising tasks. It generates protein backbone generation models that excel in various protein design scenarios, such as unconditional and topological constraint protein monomer design, protein complex design, symmetric oligomer design, enzyme active site scaffolding, therapeutic, and metal-binding protein design. RFdiffusion can design diverse, complex, and functional proteins from simple molecular specifications.

    The module functions for multi-scenario protein design include Motif Scaffolding, Unconditional protein generation, Symmetric unconditional generation (cyclic, dihedral, and tetrahedral symmetries), Symmetric motif scaffolding, Binder design, and Design diversification (“partial diffusion”).

    Parameter Description

    Reference Protein Structure

    The reference protein used for design.

    Design Type

    Two supported design types: ‘Motif_Scaffold’ and ‘Binder’, explained as follows:

    • ‘Motif_Scaffold’: Design based on the scaffold structure of the reference protein defined by subsequent parameters.
    • ‘Binder’: Design of Binder proteins based on the receptor structure.

    Contigs

    Defines the protein design strategy, specifying which part of the protein is randomly designed or retained.
    For example, setting this parameter to ‘5-15/A10-25/30-40/0 B1-100’:

    • ‘5-15’: Design a motif of random length between 5 and 15 (specific length is random; to fix it at 10, set it to 10-10).
    • ‘/A10-25’: Connect the N-terminus of amino acids numbered 10 to 25 in chain A of the reference protein to the C-terminus of the motif designed in the previous step ‘5-15’. Note: When in the form of chain name + residue number, even if there is only one residue, such as A10, a range symbol should also be added, and it should be written as A10-10.
    • ‘/30-40’: Design a motif of random length between 30 and 40, with its N-terminus connected to the C-terminus of the previously designed motif.
    • ‘/0’: Chain break, indicating the end of the previous chain and the start of a new chain. Note the space after 0.
    • ‘B1-100’: Select amino acids numbered 1 to 100 in chain B of the reference protein as a new chain.

    Note:

    1. If there are missing residues in the input PDB file, avoid including the numbers of the missing residues in the Contigs parameter. For example, if residue 45 is missing in chain A, representations such as A45 or A10-50 that include residue 45 should be avoided. A10-50 can be modified to A10-44/A46-50.
    2. When designing the Binder, it is necessary to include the receptor in the Contigs and separate the receptor from the Binder using the ‘/0’ chain break identifier. For example, if you need to design a combined Binder protein with a single-chain receptor containing 150 amino acids, where the receptor chain is named A, and you need to design a Binder protein of 70-100 amino acids in length, the content of the Contigs should be filled in as ‘A1-150/0 70-100’. Here, ‘A1-150’ represents the receptor protein, ‘/0’ indicates the separation between the receptor and the designed Binder protein with a direct peptide bond, and ‘70-100’ indicates that the designed Binder protein is of length 70-100 amino acids.
    3. When setting parameters in Contigs and Hotspot Residues, the residue sequence numbers should be filled in according to the sequence numbers in the original PDB file. If there are insertion numbers when performing antibody calculations, you can first use PDB ReNumbering to renumber the PDB file.

    Hotspot Residues

    In Binder mode, specify hotspot residues in the receptor as “Chain Name”, “Amino Acid Residue”, such as ‘A59,A83,A91’.

    Symmetry

    Design symmetrical proteins with parameter values of C_N or D_N, where C denotes cyclic symmetry, D denotes dihedral symmetry, and N represents the number of monomers. For example, C2 indicates the design of a cyclic symmetrical protein containing 2 monomers.
    Note: When designing symmetrical proteins, the setting of the Contigs parameter should match the symmetry type. For example, when Symmetry is set to C2, the setting of the Contigs parameter should be consistent with two chains.

    Result Description

    Output PDB files for different design modes.

    Note:

    • The Binder design output will be a poly-glycine (poly-G) sequence, which is not an error. RFdiffusion is a backbone generation model and does not generate sequences for the designed regions. Therefore, another method must be used to generate appropriate sequences for the Binder. It is recommended to use ProteinMPNN for sequence design (this module is deployed in WeMol, and the overall complex PDB generated here can be used for sequence design).
    • The output PDB file will be renumbered starting from 1.

    Reference

    Joseph L. Watson, David Juergens, Nathaniel R. Bennett, et al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv 2022.12.09.519842.

  • Name: Neighbor Search
    Description: Provide a compressed package containing PDB files, save all PDB structures with the closest distance between atoms of amino acid residues selected less than the cutoff value to a new compressed file named "result.tar.gz". 提供包含pdb文件的压缩包,将所有满足所选氨基酸残基之间原子间最近距离小于截断值的PDB结构保存到新的压缩文件result.tar.gz中。
    Tags: undefined
    Author: Wecomput
    Release: 2023-04-04 13:06:13
    Reference: NA

    Neighbor Search

    简介

    该模块的功能是通过对上传的pdb文件进行分析,得到所选氨基酸残基之间最近距离小于自定义的截断值的PDB结构。

    参数说明

    Input Compressed File

    上传pdb文件压缩包或者pdb文件

    Chain1: Residue ID

    指定测量距离第一组氨基酸残基位置,格式为链名称:氨基酸编号,中间用冒号分隔开。例如:A:401

    Chain2: Residue ID

    指定测量距离第一组氨基酸残基位置,格式为链名称:氨基酸编号,中间用冒号分隔开。例如:A:900

    Distance Cutoff

    距离截断值

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.tar.gz 所有满足条件的压缩包文件
    output_1.pdb 第一个满足要求的pdb文件

    Neighbor Search

    Introduction

    The function of this module is to analyze uploaded PDB files to obtain the PDB structures where the distance between selected amino acid residues is less than a user-defined cutoff value.

    Parameter Description

    Input Compressed File

    Upload a compressed file containing PDB files or a single PDB file.

    Chain1: Residue ID

    Specify the position of the first group of amino acid residues for distance measurement in the format of Chain Name:Residue Number, separated by a colon. For example: A:401.

    Chain2: Residue ID

    Specify the position of the second group of amino acid residues for distance measurement in the format of Chain Name:Residue Number, separated by a colon. For example: A:900.

    Distance Cutoff

    Distance cutoff value.

    Result Description

    The output includes:

    Output File Name Description
    result.tar.gz Compressed file containing all PDB files that meet the criteria
    output_1.pdb The first PDB file that meets the criteria
  • Name: Protein Physico-chemical Properties
    Description: Protein Physico-chemical Properties是计算蛋白序列理化性质模块,计算的蛋白理化性质包括:分子质量、等电点、消光系数、不稳定系数、蛋白质的芳香值、总平均亲水性以及二级结构占比。 Protein Physico-chemical Properties is a module for calculating the physicochemical properties of protein sequences. The computed properties include molecular weight, isoelectric point, extinction coefficient, instability index, aromaticity, grand average of hydropathicity (GRAVY), and secondary structure composition.
    Tags: undefined
    Author: Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, Hochstrasser DF.
    Release: 2023-03-27 17:15:36
    Reference: Methods Mol Biol. 1999;112:531-52. doi: 10.1385/1-59259-584-7:531. PMID: 10027275.

    Protein Physico-chemical Properties

    简介

    对上传的蛋白Fasta序列分析其蛋白的理化性质,包括分子质量、等电点、消光系数、不稳定系数、蛋白质的芳香值、总平均亲水性以及二级结构占比。

    参数说明

    Protein Sequence File

    输入的蛋白FASTA文件,格式:FASTA。

    Output File

    输出文件名称,必须为CSV后缀。

    Merge Chain

    是否合并来自同一蛋白质链的信息。

    Merge Output File

    仅当merge_chain=True时可用。默认值:merged.csv。

    Job Number

    并行任务数,默认为1。

    pH Value

    指定计算净电荷(net charge)的pH值

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 序列名称和蛋白质的信息一一对应的CSV文件
    merged.csv 合并来自同一蛋白质链的信息的CSV文件

    其中result.csv和merged.csv,包含信息如下:

    字段名称 说明
    Sequence ID 蛋白序列名称
    Molecular Weight 蛋白序列分子量
    Isoelectric Point 蛋白序列等电点
    Molar Extinction Coefficient (without disulfide bond) 假设半胱氨酸被还原时的摩尔消光系数,单位为M-1·cm-1。
    Extinction Coefficient (without disulfide bond) 假设半胱氨酸被还原时的消光系数,单位为g·L-1。
    Molar Extinction Coefficient (with disulfide bond) 假设成对半胱氨酸形成的二硫键的摩尔消光系数,单位为M-1·cm-1。
    Extinction Coefficient (with disulfide bond) 假设成对半胱氨酸形成的二硫键的消光系数,单位为g·L-1。
    Instability Index 蛋白的不稳定指数,当该数值高于40时都表示蛋白质不稳定(半衰期很短)。
    Aromaticity 蛋白质的芳香值,即为Phe+Trp+Tyr的相对频率。
    Grand average of hydropathicity (GRAVY) 总平均亲水性,若此数值为负值则说明该蛋白为亲水性蛋白,反之为疏水性蛋白。
    Helix Fraction 计算Helix结构在蛋白上所占比例。Helix中的氨基酸:V,I,Y,F,W,L。
    Turn Fraction 计算Trun结构在蛋白上所占比例。Trun中氨基酸顺序为:N,P,G,S。
    Sheet Fraction 计算Sheet结构在蛋白上所占比例。Sheet中氨基酸:E,M,A,L。
    Net Charge 蛋白序列在特定pH值下的净电荷,采用Biopython中的电荷计算功能函数进行计算

    参考文献

    Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, Hochstrasser DF. Protein identification and analysis tools in the ExPASy server. Methods Mol Biol. 1999;112:531-52.

    Protein Physico-chemical Properties

    Introduction

    This module analyzes the physicochemical properties of a protein based on the uploaded protein FASTA sequence. The properties include molecular weight, isoelectric point, molar extinction coefficient, instability index, aromaticity, total average hydrophobicity, and secondary structure composition.

    Parameter Description

    Protein Sequence File

    Input protein FASTA file in FASTA format.

    Output File

    Name of the output file, must have a CSV extension.

    Merge Chain

    Whether to merge information from the same protein chain.

    Merge Output File

    Only available when merge_chain=True. Default value: merged.csv.

    Job Number

    Number of parallel tasks, default is 1.

    pH Value

    Specifies the pH value for calculating the net charge

    Result Description

    The output includes:

    Output File Name Description
    result.csv CSV file mapping sequence names to protein information
    merged.csv CSV file containing merged information from the same protein chain

    Both result.csv and merged.csv contain the following information:

    Field Name Description
    Sequence ID Protein sequence name
    Molecular Weight Molecular weight of the protein sequence
    Isoelectric Point Isoelectric point of the protein sequence
    Molar Extinction Coefficient (without disulfide bond) Molar extinction coefficient assuming cysteine is reduced, in M-1·cm-1
    Extinction Coefficient (without disulfide bond) Extinction coefficient assuming cysteine is reduced, in g·L-1
    Molar Extinction Coefficient (with disulfide bond) Molar extinction coefficient assuming disulfide bonds of paired cysteines, in M-1·cm-1
    Extinction Coefficient (with disulfide bond) Extinction coefficient assuming disulfide bonds of paired cysteines, in g·L-1
    Instability Index Instability index of the protein, values above 40 indicate protein instability (short half-life)
    Aromaticity Aromaticity of the protein, relative frequency of Phe+Trp+Tyr
    Grand average of hydropathicity (GRAVY) GRAVY value indicating the overall hydrophobicity of the protein, negative values indicate hydrophilic proteins
    Helix Fraction Fraction of helical structure in the protein, amino acids considered: V, I, Y, F, W, L
    Turn Fraction Fraction of turn structure in the protein, amino acids considered: N, P, G, S
    Sheet Fraction Fraction of sheet structure in the protein, amino acids considered: E, M, A, L
    Net Charge The net charge of a protein sequence at a specific pH, calculated by functions in Biopython

    Reference

    Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, Hochstrasser DF. Protein identification and analysis tools in the ExPASy server. Methods Mol Biol. 1999;112:531-52.

  • Name: Receptor-Based Peptide Design
    Description: 基于受体结构(目前支持单链)的结合多肽设计。多链受体可先拼接成单链。 Binding peptide design based on receptor structure (currently supports single-chain). Multi-chain receptors can first be spliced into a single chain.
    Tags: undefined
    Author: Wecomput
    Release: 2023-03-27 09:43:52
    Reference: NA

    Receptor-Based Peptide Design

    简介

    Receptor-based Peptide Design模块是进行基于受体结构(目前支持单链)的结合多肽设计。该模块算法是基于AlphaFold2与Colabdesign实现。

    参数说明

    Receptor Structure

    PDB格式的受体结构。

    Binder Length

    设定肽binder的长度,如:10。

    Chain

    指定PDB文件中作为受体的链,如:“B”,如果结构中只有一条链,可以不用指定。
    注意:目前仅支持单链模式,且链的长度不超过500个氨基酸。

    Hotspot Residues

    指定受体中的热点残基,如:‘1-10,12,15’

    Binder Sequence

    指定多肽binder的起始序列,如设定,则会在此序列的基础上继续设计。

    Binder Chain

    如果已有多肽binder在参数1的PDB文件中,指定该多肽为哪条链,可以此为基础进行多肽binder的优化设计。

    Use Multimer

    默认False,是否使用Alphafold-Multimer进行设计

    Flexible

    是否设定受体的骨架为柔性。

    Output

    指定输出的结构评分文件名称,默认为“design_scores.csv”

    结果说明

    输出5个肽binder设计的PDB文件:result_0~4.pdb,为受体中选择的链结构与设计肽的复合物。5个设计结果为5次平行设计的不同结果。
    输出结构的评分指标:design_scores.csv,包含如下信息:

    字段名称 说明
    Name 预测结构的文件名
    pLDDT 局部结构的可信度指标,值范围是0-1.0,该值越大说明预测的结构越可靠。低于0.7被认为可靠性较低,低于0.5基本认为是可信度非常低,为无序预测
    pTM 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定

    Receptor-Based Peptide Design

    Introduction

    The Receptor-Based Peptide Design module is used for designing binding peptides based on receptor structures (currently supporting single-chain structures). The algorithm of this module is implemented based on AlphaFold2 and Colabdesign.

    Parameter

    Receptor Structure

    The receptor structure in PDB format.

    Binder Length

    Specifies the length of the peptide binder, e.g., 10.

    Chain

    Specifies the chain in the PDB file to be used as the receptor, e.g., “B”. If the structure contains only one chain, this parameter may not need to be specified. Note: Currently, only single-chain mode is supported, and the chain length should not exceed 500 amino acids.

    Hotspot Residues

    Specifies the hotspot residues in the receptor, e.g., ‘1-10,12,15’.

    Binder Sequence

    Specifies the starting sequence of the peptide binder. If provided, the design will be based on this sequence.

    Binder Chain

    If a peptide binder already exists in the PDB file specified in parameter 1, this parameter specifies which chain the peptide belongs to, allowing optimization and design based on this peptide.

    Use Multimer

    Default is False. Specifies whether to use AlphaFold-Multimer for design.

    Flexible

    Specifies whether to set the receptor backbone as flexible.

    Output

    the output scoring file, default is “design_scores.csv”

    Result

    The output file is result.pdb, which contains the structure of the designed peptide binder. The resultpdb is a complex of the selected chain structure from the receptor and the designed peptide.
    The design_scores.csv file contains the following information:

    Field Name Description
    Name The file name of the predicted structure.
    pLDDT The confidence score for local structure, with a range of 0-100. A higher value indicates a more reliable prediction. Values below 70 are considered low reliability, below 50 are generally regarded as very low confidence, indicating disordered predictions.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.
  • Name: Antibody Paratope Prediction
    Description: 预测抗体上与抗原结合的氨基酸位点(称为Paratope),基于等变图神经网络的深度学习模型,使用抗体结构进行训练和预测,预测精度在现有方法中最佳。 Predict the amino acid sites on an antibody that bind to an antigen, known as the Paratope. The algorithm is based on a deep learning model using an isomorphic graph neural network, trained and predicted on antibody structures, and has the highest prediction accuracy among existing methods.
    Tags: undefined
    Author: Lewis Chinery
    Release: 2023-03-23 10:19:16
    Reference: Lewis Chinery, Newton Wahome, Iain Moal, Charlotte M. Deane. Paragraph - Antibody Paratope prediction using Graph Neural Networks with minimal feature vectors. bioRxiv 2022.06.10.495640; doi: https://doi.org/10.1101/2022.06.10.495640

    Antibody Paratope Prediction

    简介

    Antibody Paratope Predictor模块的功能是预测抗体上与抗原结合的氨基酸位点,称为Paratope。其算法是基于等变图神经网络的深度学习模型,使用抗体结构进行训练和预测,预测精度在现有方法中最佳。
    image.png

    参数说明

    Antibody PDB File

    需要预测的抗体结构,链名称必须为H, L, H/L才能判断为抗体结构。

    结果说明

    输出文件为result.csv,包含信息如下:

    字段名称 说明
    pdb 文件名
    chain_type 抗体链类型
    chain_id 抗体链标识
    IMGT 抗体氨基酸对应的IMGT编号
    AA 抗体氨基酸名称
    atom_num 抗体氨基酸的Alpha碳原子的原子编号(PDB文件中)。
    x,y,z 抗体氨基酸的Alpha碳原子的坐标。
    pred 该氨基酸为Paratope的预测概率(取值范围0-1),参考值为0.734,大于参考值时,为Paratope的可能性高,值越大可能性越高。

    参考文献

    Lewis Chinery, Newton Wahome, Iain Moal, Charlotte M. Deane. Paragraph - Antibody Paratope prediction using Graph Neural Networks with minimal feature vectors. bioRxiv 2022.06.10.495640.

    Antibody Paratope Prediction

    Introduction

    The Antibody Paratope Predictor module aims to predict the amino acid residues on an antibody that bind to antigens, known as the Paratope. The algorithm is based on a deep learning model using a variant of graph neural networks, trained and tested on antibody structures. It achieves the highest prediction accuracy among existing methods.
    image.png

    Parameter Description

    Antibody PDB File

    The antibody structure for which the paratope needs to be predicted. The chain names must be H, L, or H/L to be recognized as an antibody structure.

    Result Description

    The output file is result.csv, containing the following information:

    Field Name Description
    pdb File name
    chain_type Antibody chain type
    chain_id Antibody chain identifier
    IMGT IMGT number corresponding to the antibody amino acid
    AA Antibody amino acid name
    atom_num Atom number of the alpha carbon of the antibody amino acid in the PDB file
    x, y, z Coordinates of the alpha carbon of the antibody amino acid
    pred Predicted probability that the amino acid is part of the Paratope (range 0-1). A reference value of 0.734 is provided; a value greater than this indicates a high likelihood of being part of the Paratope, with higher values indicating higher likelihood.

    Reference

    Lewis Chinery, Newton Wahome, Iain Moal, Charlotte M. Deane. Paragraph - Antibody Paratope prediction using Graph Neural Networks with minimal feature vectors. bioRxiv 2022.06.10.495640. Link

  • Name: FakeCmd_RCall_Client_部分可见_测试
    Description: 开发测试
    Tags: undefined
    Author:
    Release: null
    Reference:
  • Name: FakeCmd_部分可见_测试
    Description: 开发测试
    Tags: undefined
    Author:
    Release: null
    Reference:
  • Name: Small Molecule Aptamer Screening
    Description: Small Molecule Aptamer Screening是基于机器学习模型进行大规模适配体序列筛选模块,利用Small Molecule Aptamer Training模块得到的模型文件对适配体序列进行小分子-适配体亲和力预测。
    Tags: undefined
    Author: 华南环境科学研究所
    Release: 2023-03-21 14:39:52
    Reference:

    Small Molecule Aptamer Screening

    简介

    Small Molecule Aptamer Screening是基于机器学习模型进行大规模适配体序列筛选模块,利用Small Molecule Aptamer Training模块得到的模型文件对适配体序列进行小分子-适配体亲和力预测。

    参数说明

    Small Molecule Structure(Smiles)

    输入小分子的smiles信息

    Apatmer Sequence File

    输入适配体序列信息,txt文件格式,一行一条序列

    GCGGATGAAGACTGGTGTGAGGGGATGGGTTAGGTGGAGGTGGTTATTCCGGGAATTCGCCCTAAATACGAGCAAC
    GCGGATGAAGACTGGTGTCCCTTATGGTGGGTGCGCTGGGGCTGCAATCTTTTGGCTGGCCCTAAATACGAGCAAC
    TGTGTGTGAGACTTCGTTCCGGCGATGGGGTAGGGGGTGTGGAGGGGCCGGACGGAGGGGCAGCAAGGCATCAGAGGTAT
    AGCAGCACAGAGGTCAGTTCGTCCATTATTCTGGTAGCGTTGAACAACATTCAACACGCCCCTATGCGTGCTACCGTGAA
    AGCAGCACAGAGGTCAGTTCGTCGAATCAGCACCTCTGCATAGGTTACGTTTATACTGCGCCTATGCGTGCTACCGTGAA
    

    Model File

    机器学习模型文件,由Small Molecule Aptamer Training模块训练输出得到。

    Screening Results File

    筛选预测结果输出文件名称,默认result.csv

    Write Smiles

    结果文件是否包含小分子smiles信息

    Sort

    是否对预测结果进行排序,默认根据预测亲和力值从小到到排序。

    TopN

    只输出亲和力排名前N条序列的结果

    Cutoff

    只输出亲和力在指定截断值以下的结果,单位是nM, 比如,500表示只保留亲和力Kd值小于500的序列信息。

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    pred_kd_nM 代表预测的亲和力Kd值,单位是nM
    sequence 输入序列信息
    smiles 小分子的smiles信息

    Small Molecule Aptamer Screening

    Introduction

    Small Molecule Aptamer Screening is a module for large-scale screening of aptamer sequences based on machine learning models. It utilizes the model files obtained from the Small Molecule Aptamer Training module to predict the affinity between small molecules and aptamer sequences.

    Parameter Description

    Small Molecule Structure (Smiles)

    Input the SMILES information of the small molecule.

    Aptamer Sequence File

    Input the aptamer sequence information in a TXT file format, with one sequence per line.

    GCGGATGAAGACTGGTGTGAGGGGATGGGTTAGGTGGAGGTGGTTATTCCGGGAATTCGCCCTAAATACGAGCAAC
    GCGGATGAAGACTGGTGTCCCTTATGGTGGGTGCGCTGGGGCTGCAATCTTTTGGCTGGCCCTAAATACGAGCAAC
    TGTGTGTGAGACTTCGTTCCGGCGATGGGGTAGGGGGTGTGGAGGGGCCGGACGGAGGGGCAGCAAGGCATCAGAGGTAT
    AGCAGCACAGAGGTCAGTTCGTCCATTATTCTGGTAGCGTTGAACAACATTCAACACGCCCCTATGCGTGCTACCGTGAA
    AGCAGCACAGAGGTCAGTTCGTCGAATCAGCACCTCTGCATAGGTTACGTTTATACTGCGCCTATGCGTGCTACCGTGAA
    

    Model File

    Machine learning model file obtained from the Small Molecule Aptamer Training module.

    Screening Results File

    Output file name for the screening prediction results, default is result.csv.

    Write Smiles

    Indicates whether the result file should include the small molecule SMILES information.

    Sort

    Specifies whether to sort the prediction results. By default, the results are sorted in ascending order based on the predicted affinity values.

    TopN

    Outputs only the top N sequences ranked by affinity.

    Cutoff

    Outputs only results with affinity values below a specified cutoff value, in units of nM. For example, a value of 500 means that only sequences with an affinity Kd value less than 500 nM will be retained.

    Result Description

    The output file is result.csv, containing the following information:

    Field Name Description
    pred_kd_nM Predicted affinity Kd value in nM
    sequence Input sequence information
    smiles SMILES information of the small molecule
  • Name: Small Molecule Aptamer Training
    Description: Small Molecule Aptamer Training是针对小分子-适配体数据集进行多种机器学习回归模型的训练,得到模型预测性能信息,并输出准确性最高的模型,模型可用于后续适配体的筛选。
    Tags: undefined
    Author: 华南环境科学研究所
    Release: 2023-03-21 14:39:06
    Reference:

    Small Molecule Aptamer Training

    简介

    Small Molecule Aptamer Training模块是训练小分子-核酸适配体亲和力的数据的回归模型,训练模型支持14种常用回归模型:LinearRegression,KNN,SVR,Ridge,Lasso,DecisionTree,ExtraTree,RandomForest,MLP,AdaBoost,GradientBoost,Bagging,XGBoost,LightGBM,NeuralNetwork。通过采用交叉验证比较不同模型的预测效果,然后保留交叉验证效果排名前三的回归算法,对全部数据集进行训练,得到最终的预测模型,模型可用于小分子-适配体亲和力的预测。

    参数说明

    Training Data File

    输入训练数据集csv文件,包括小分子smiles以及适配体序列文件,注意:只支持DNA适配体序列

    K-mers

    适配体特征提取k-mers取值,默认值为2。

    K-Fold CV

    模型训练过程采用的k倍交叉验证,目前有5倍交叉验证和10倍交叉验证。

    Seed

    随机数,用于重复训练结果或者比较不同随机数结果。

    结果说明

    输出结果包括:

    输出文件名称 说明
    correlation.png 交叉验证中实验值与预测值相关性图。
    score_detail.csv 交叉验证打分详细信息。
    score_summary.csv 模型预测性能指标。
    best1.pt 预测性能排名第一的模型文件
    best2.pt 预测性能排名第二的模型文件
    best3.pt 预测性能排名第三的模型文件

    Small Molecule Aptamer Training

    Introduction

    The Small Molecule Aptamer Training module trains a regression model on the affinity data of small molecule-nucleic acid aptamers. The training model supports 14 common regression models: LinearRegression, KNN, SVR, Ridge, Lasso, DecisionTree, ExtraTree, RandomForest, MLP, AdaBoost, GradientBoost, Bagging, XGBoost, LightGBM, NeuralNetwork. By using cross-validation to compare the predictive performance of different models, the top three regression algorithms in terms of cross-validation performance are retained. These top models are then trained on the entire dataset to obtain the final prediction model, which can be used for predicting the affinity between small molecules and aptamers.

    Parameter Description

    Training Data File

    Input training dataset in a CSV file, including small molecule SMILES and aptamer sequence files. Note: Only DNA aptamer sequences are supported.

    K-mers

    Value for extracting aptamer features using k-mers, with a default value of 2.

    K-Fold CV

    Number of folds for k-fold cross-validation during model training. Currently supports 5-fold and 10-fold cross-validation.

    Seed

    Random number used for replicating training results or comparing results with different random numbers.

    Result Description

    The output results include:

    Output File Name Description
    correlation.png Graph showing the correlation between experimental and predicted values in cross-validation.
    score_detail.csv Detailed scoring information from cross-validation.
    score_summary.csv Performance metrics of the model predictions.
    best1.pt Model file for the top-performing model.
    best2.pt Model file for the second-best performing model.
    best3.pt Model file for the third-best performing model.
  • Name: Antibody Design (DiffAb)
    Description: 基于扩散概率模型和等价神经网络的抗体设计,可针对特定抗原结构生成抗体,也可基于抗体-抗原复合物结构进行抗体结构和序列的优化。 Luo et al. developed a deep generative model that jointly models sequences and structures of CDRs based on diffusion probabilistic models and equivariant neural networks. The model is capable of sequence-structure co-design, sequence design for given backbone structures, and antibody optimization.
    Tags: undefined
    Author: Shitong Luo
    Release: 2023-03-20 09:25:36
    Reference: Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, Jianzhu Ma. Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures. bioRxiv 2022.07.10.499510; doi: https://doi.org/10.1101/2022.07.10.499510

    Antibody Design

    简介

    基于扩散概率模型和等价神经网络,进行抗体设计,可针对特定抗原结构生成抗体,也可基于抗体-抗原复合物结构进行抗体结构和序列的优化。
    抗体是免疫系统的蛋白质,通过与特定的抗原(如病毒和细菌)结合来保护宿主。抗体和抗原之间的结合主要是由抗体的互补性决定区域(CDR)决定的。该模块是基于扩散概率模型和等价神经网络的深度生成模型,对CDR的序列和结构共同建模。该方法可明确针对特定抗原结构生成抗体,是最早的蛋白质结构扩散概率模型之一。能进行序列-结构协同设计、给定骨架结构的序列设计和抗体优化。
    image.png

    参数说明

    本模块存在两种模式:Antibody Optimization或Design Without Bound Antibody Frameworks,前者是上传抗体-抗原复合物结构,后者是上传单独的抗原结构。

    Antibody Optimization

    Antibody-Antigen Complex Structure

    抗体-抗原复合物结构文件,PDB格式

    Mode

    设计模式选择,对于抗原-抗体复合物有4种设计模式可选:

    1. Optimize:优化单个CDR的序列和结构。此模式需要抗体-抗原复合物结构和CDR标签。
    2. Fixbb:固定抗体的主干结构,仅逐个采样CDR的序列。此模式需要抗体-抗原复合物结构。
    3. Sample_one_CDR:逐个采样CDR的序列和结构。
    4. Sample_multi_CDRs:同时采样所有CDR的序列和结构。

    CDR Label

    只有在指定Optimize设计模式后,才需要选择改参数,默认值为H_CDR3,一共有6个选项:H_CDR1、H_CDR2、H_CDR3、L_CDR1、L_CDR2、L_CDR3。

    Design Without Bound Antibody Frameworks

    Antigen Structure

    单独的抗原结构文件,PDB格式

    Mode

    设计模式选择,对于抗原结构有2种设计模式可选:

    1. Sample_one_CDR:逐个采样CDR的序列和结构。
    2. Sample_multi_CDRs:同时采样所有CDR的序列和结构。

    结果说明

    1.输出一个结构优化后或构建后的压缩包result.tar.gz。
    2.展示不同设计模式的第一个结构优化结果,输出结果分别如下:
    (1) 'Optimize’模式,输出输出结果包括:

    输出文件名称 说明
    H_CDR1-O1_0000.pdb O1表示优化次数为1,对应的优化程度很低,序列变化很小
    H_CDR1-O2_0000.pdb O2表示优化次数为2,优化程度低,序列变化小
    H_CDR1-O4_0000.pdb 优化次数为4,优化程度较低,序列变化较小
    H_CDR1-O8_0000.pdb 优化次数为8,优化程度一般,序列变化一般
    H_CDR1-O16_0000.pdb 优化次数为16,优化程度较高,序列变化较大
    H_CDR1-O32_0000.pdb 优化次数为32,优化程度高,序列变化大
    H_CDR1-O64_0000.pdb 优化次数为64,优化程度很高,序列变化很大

    (2) ‘Fixbb’ 模式,输出输出结果包括:

    输出文件名称 说明
    H_CDR1_0000.pdb 重链CDR1区优化的结构文件
    H_CDR2_0000.pdb 重链CDR2区优化的结构文件
    H_CDR3_0000.pdb 重链CDR3区优化的结构文件
    L_CDR1_0000.pdb 轻链CDR1区优化的结构文件
    L_CDR2_0000.pdb 轻链CDR2区优化的结构文件
    L_CDR3_0000.pdb 轻链CDR3区优化的结构文件

    (3) ‘Sample_one_CDR’模式,输出文件名称与’Fixbb’ 模式相同。
    (4) 'Sample_multi_CDRs’模式,输出CDR区进行优化后的结构文件"MultipleCDRs_0000.pdb"。

    参考文献

    Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, Jianzhu Ma. Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures. bioRxiv. 2022.07.10.499510

    Antibody Design

    Introduction

    Antibody design is conducted based on diffusion probability models and equivalent neural networks, allowing for the generation of antibodies targeting specific antigen structures and optimization of antibody structures and sequences based on antibody-antigen complex structures.
    Antibodies are proteins of the immune system that protect the host by binding to specific antigens such as viruses and bacteria. The binding between antibodies and antigens is primarily determined by the complementarity-determining regions (CDRs) of the antibodies. This module is a deep generative model based on diffusion probability models and equivalent neural networks, jointly modeling the sequences and structures of CDRs. This method can explicitly generate antibodies targeting specific antigen structures and is one of the earliest protein structure diffusion probability models. It enables sequence-structure co-design, sequence design with given scaffold structures, and antibody optimization.
    image.png

    Parameter Description

    This module has two modes: Antibody Optimization or Design Without Bound Antibody Frameworks, where the former involves uploading antibody-antigen complex structures and the latter involves uploading standalone antigen structures.

    Antibody Optimization

    Antibody-Antigen Complex Structure

    Structure file of the antibody-antigen complex in PDB format.

    Mode

    Design mode selection for the antigen-antibody complex with four available options:

    1. Optimize: Optimizes the sequence and structure of a single CDR. This mode requires the antibody-antigen complex structure and CDR labels.
    2. Fixbb: Fixes the backbone structure of the antibody and samples the sequence of each CDR individually. This mode requires the antibody-antigen complex structure.
    3. Sample_one_CDR: Samples the sequence and structure of each CDR individually.
    4. Sample_multi_CDRs: Simultaneously samples the sequences and structures of all CDRs.

    CDR Label

    This parameter is only required when selecting the Optimize design mode, with a default value of H_CDR3. There are a total of six options: H_CDR1, H_CDR2, H_CDR3, L_CDR1, L_CDR2, L_CDR3.

    Design Without Bound Antibody Frameworks

    Antigen Structure

    Structure file of the standalone antigen in PDB format.

    Mode

    Design mode selection for antigen structures with two available options:

    1. Sample_one_CDR: Samples the sequence and structure of each CDR individually.
    2. Sample_multi_CDRs: Simultaneously samples the sequences and structures of all CDRs.

    Result Description

    1. Outputs a compressed file, result.tar.gz, containing the optimized or constructed structure.

    2. Displays the first structure optimization results for different design modes as follows:
      (1)For the Optimize mode, the output includes:

      Output File Name Description
      H_CDR1-O1_0000.pdb O1 indicates optimization at 1, with low optimization level and minimal sequence changes
      H_CDR1-O2_0000.pdb O2 indicates optimization at 2, with low optimization level and small sequence changes
      H_CDR1-O4_0000.pdb Optimization at 4, with relatively low optimization level and moderate sequence changes
      H_CDR1-O8_0000.pdb Optimization at 8, with moderate optimization level and average sequence changes
      H_CDR1-O16_0000.pdb Optimization at 16, with relatively high optimization level and significant sequence changes
      H_CDR1-O32_0000.pdb Optimization at 32, with high optimization level and substantial sequence changes
      H_CDR1-O64_0000.pdb Optimization at 64, with very high optimization level and extensive sequence changes

      (2)For the Fixbb mode, the output includes:

      Output File Name Description
      H_CDR1_0000.pdb Structure file optimized for the heavy chain CDR1 region
      H_CDR2_0000.pdb Structure file optimized for the heavy chain CDR2 region
      H_CDR3_0000.pdb Structure file optimized for the heavy chain CDR3 region
      L_CDR1_0000.pdb Structure file optimized for the light chain CDR1 region
      L_CDR2_0000.pdb Structure file optimized for the light chain CDR2 region
      L_CDR3_0000.pdb Structure file optimized for the light chain CDR3 region

    (3)For the Sample_one_CDR mode, the output file names are the same as the Fixbb mode.
    (4)For the Sample_multi_CDRs mode, the output is the structure file “MultipleCDRs_0000.pdb” after optimizing the CDR regions.

    Reference Literature

    Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, Jianzhu Ma. Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures. bioRxiv. 2022.07.10.499510

  • Name: ORCA RUN
    Description: ORCA是一种灵活、高效且易于使用的量子化学通用工具,特别强调开壳层分子的光谱性质。它具有各种标准的量子化学方法,从半经验方法到DFT到单参考和多参考相关的从头算方法。它还可以处理环境和相对论效应。 ORCA is a flexible, efficient and easy-to-use general purpose tool for quantum chemistry with specific emphasis on spectroscopic properties of open-shell molecules. It features a wide variety of standard quantum chemical methods ranging from semiempirical methods to DFT to single- and multireference correlated ab initio methods. It can also treat environmental and relativistic effects.
    Tags: undefined
    Author: F. Neese
    Release: 2023-03-19 19:56:09
    Reference: Neese, F.; Wennmohs, F.; Becker, U.; Riplinger, C. The ORCA Quantum Chemistry Program Package. The Journal of Chemical Physics 2020, 152, 224108.
  • Name: ORCA INP Generation
    Description: ORCA INP Generation是生成ORCA输入文件的工具,输入结构文件,指定计算的目的以及计算采用的方法和基组,自动生成ORCA计算所需要的输入文件,可直接用于ORCA的计算。 A script to generate ORCA input file. ORCA is a flexible, efficient and easy-to-use general purpose tool for quantum chemistry with specific emphasis on spectroscopic properties of open-shell molecules. It features a wide variety of standard quantum chemical methods ranging from semiempirical methods to DFT to single- and multireference correlated ab initio methods. It can also treat environmental and relativistic effects.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-03-19 19:53:13
    Reference: Neese, F.; Wennmohs, F.; Becker, U.; Riplinger, C. The ORCA Quantum Chemistry Program Package. The Journal of Chemical Physics 2020, 152, 224108.
  • Name: GMX MD Run (GMX2023)
    Description: GMX MD Run (GMX2023)模块是利用已经准备好的体系拓扑文件以及参数文件进行基于GROMACS的分子动力学模拟。 GMX MD Run (GMX2023) runs a Gromacs MD task using the prepared system topology and parameter files.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 11:21:21
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    GMX MD Run (GMX2023)

    简介

    提交GROMACS对应文件,从而进行分子动力学模拟,得到平衡模拟后得到的轨迹文件。

    参数说明

    GRO File

    提交模拟体系的gro文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    Topology File

    提交模拟体系的top文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    ITP File

    提交模拟体系的itp文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    Minimize MDP File

    提交进行最小化的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的Minimization方法或者**GMX MDP Generation (Auto)**生成。

    NPT MDP File

    提交进行等压等温的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的NPT方法或者**GMX MDP Generation (Auto)**生成。

    MD MDP File

    提交进行平衡模拟的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的MD方法或者**GMX MDP Generation (Auto)**生成。

    结果说明

    输出结果包括:

    输出文件名称 说明
    md.cpt md模拟断点文件
    md.gro md的分子坐标文件
    md.log md记录文件
    md.tpr md模拟所需的所有初始化数据(分子拓扑、初始结构等)
    mini.gro mini运行的分子坐标文件
    mini.log mini运行记录文件
    mini.tpr mini模拟运行所需的所有初始化数据(分子拓扑、初始结构等)
    npt.gro npt的分子坐标文件
    npt.log npt记录文件
    npt.tpr npt模拟所需的所有初始化数据(分子拓扑、初始结构等)
    path.txt 模拟轨迹文件存储路径,可用于后续分析模块的Path File输入。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    GMX MD Run (GMX2023)

    Introduction

    Submit corresponding files to GROMACS to perform molecular dynamics simulations and obtain trajectory files after equilibrium simulations.

    Parameter Description

    GRO File

    Submit the gro file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    Topology File

    Submit the top file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    ITP File

    Submit the itp file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    Minimize MDP File

    Submit the script file for minimization, in mdp format. This file can be generated using the GMX MDP Generation module with the Minimization method or GMX MDP Generation (Auto).

    NPT MDP File

    Submit the script file for NPT (isothermal-isobaric) simulation, in mdp format. This file can be generated using the GMX MDP Generation module with the NPT method or GMX MDP Generation (Auto).

    MD MDP File

    Submit the script file for equilibrium simulation, in mdp format. This file can be generated using the GMX MDP Generation module with the MD method or GMX MDP Generation (Auto).

    Result Description

    The output results include:

    Output File Name Description
    md.cpt Checkpoint file for the MD simulation
    md.gro Molecular coordinate file for the MD simulation
    md.log Log file for the MD simulation
    md.tpr All initial data required for the MD simulation (molecular topology, initial structure, etc.)
    mini.gro Molecular coordinate file for the minimization run
    mini.log Log file for the minimization run
    mini.tpr All initial data required for the minimization run (molecular topology, initial structure, etc.)
    npt.gro Molecular coordinate file for the NPT simulation
    npt.log Log file for the NPT simulation
    npt.tpr All initial data required for the NPT simulation (molecular topology, initial structure, etc.)
    path.txt Path to store the simulation trajectory files, which can be used as input for the Path File in subsequent analysis modules.

    Reference Literature

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: AlphaAutoMD (GMX2023)
    Description: AlphaAutoMD是一个完全自动化的基于gromacs的分子动力学模块,用户提交PDB文件,指定力场、模拟时长等关键参数即可快速提交MD作业,该模块内部整合了多个计算模块,大部分参数采用默认参数。 The AlphaAutoMD module is a fully automated molecular dynamics module, which enables fast submit an MD job using a PDB file. This module integrates MD PDB Prepare, Protein Protonation, GMX Receptor Parameterization, GMX Ligand Parameterization, MD Solvation, GMX MD Run, and RMS.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    AlphaAutoMD

    简介

    提交一个pdb文件自动进行分子动力学模拟,为初步接触分子动力学模拟提供便捷操作界面。

    参数说明

    PDB File

    结构文件,PDB格式。
    需要注意体系中若存在配体,其名称不能为*号且必须以HETATM开头。同一小分子中的原子名(如下图所示位置)不能相同。不需要模拟的结构最好是删除。如下所示为正确的小分子结构文件:

    HETATM 3767  C1  GOL A 302      -4.671 -11.067  -0.429  1.00 43.56           C  
    HETATM 3768  O1  GOL A 302      -5.324  -9.793  -0.300  1.00 41.43           O  
    

    image.png

    若体系中有特殊金属原子,只能选AMBER力场。离子需要按照特定书写格式,以下为一些常见原子书写格式:

      # Mg2+离子
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+离子
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+离子
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+离子
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+离子
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+离子
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
    

    96dcbfca9ffb96541221e86f6db9c5a.jpg

    其中atom type,residue需要是大写,atom name只需是标准金属离子(可以通过文本编辑器查看书写格式是否相同)。

    Force Field

    力场文件类型:
    amber03,amber14sb_parmbsc1适合蛋白和核酸的凝聚相模拟,也支持小分子。
    gromos系列适合烷烃、蛋白、核酸凝聚相的模拟。
    注意:根据提交的pdb结构选取力场。

    Water Type

    水的类型:
    spc:最好用于GROMOS力场。
    spce:对纯水体系比SPC、TIP3P都好。
    tip3p:最好用于amber。
    tip4p:最好用于opls。
    tip5p:不适用于混合模拟。

    Simulation Time (ns)

    模拟时长,单位ns

    结果说明

    输出结果包括:

    输出文件名称 说明
    md.cpt md模拟断点文件
    md.gro md的分子坐标文件
    md.log md记录文件
    md.mdp md参数文件
    md.tpr md模拟所需的所有初始化数据(分子拓扑、初始结构等)
    mini.gro mini运行的分子坐标文件
    mini.log mini运行记录文件
    mini.mdp mini运行参数文件
    mini.tpr mini模拟运行所需的所有初始化数据(分子拓扑、初始结构等)
    npt.cpt npt模拟断点文件
    npt.gro npt的分子坐标文件
    npt.log npt记录文件
    npt.mdp npt参数文件
    npt.tpr npt模拟所需的所有初始化数据(分子拓扑、初始结构等)
    protein.pdb 体系中的蛋白PDB文件
    predict_pKa.txt 蛋白质子化记录文件
    protein_protonation.pdb 蛋白质子化PDB文件
    receptor.gro 受体的分子坐标文件
    receptor_itp.tar.gz 受体平衡模拟时固定原子位置所施加的力
    receptor.top 受体的拓扑文件
    system.gro 体系的分子坐标文件
    system_itp.tar.gz 体系平衡模拟时固定原子位置所施加的力
    system.top 体系的拓扑文件
    interaction_energy.csv 体系能量随时间变化的csv文件
    interaction_energy.png 体系能量随时间变化的png文件
    interaction_pressure.csv 体系压力随时间变化的csv文件
    interaction_pressure.png 体系压力随时间变化的png文件
    rmsd_result.csv RMSD的CSV文件
    rmsd_result.png RMSD的PNG文件
    rmsd_result.xvg RMSD的XVG文件
    rmsf_Protein.csv 蛋白RMSF的CSV文件
    rmsf_Protein.png 蛋白RMSF的PNG文件
    rmsf_Protein.xvg 蛋白RMSF的XVG文件
    path.txt 模拟轨迹文件存储路径,可用于后续分析模块的Path File输入。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    AlphaAutoMD

    Introduction

    Automatically perform molecular dynamics simulations on a pdb file to provide a convenient interface for those who are new to molecular dynamics simulations.

    Parameter Description

    PDB File

    Structure file in PDB format.
    It is important to note that if there are ligands in the system, their names cannot contain “*” and must start with HETATM. The atomic names within the same small molecule (as shown in the figure below) should not be the same. It is advisable to delete structures that do not need to be simulated. The following is an example of a correct small molecule structure file:

    HETATM 3767  C1  GOL A 302      -4.671 -11.067  -0.429  1.00 43.56           C  
    HETATM 3768  O1  GOL A 302      -5.324  -9.793  -0.300  1.00 41.43           O  
    

    image.png

    If there are special metal atoms in the system, only the AMBER force field can be selected. Ions need to be written in a specific format, here are some common atomic writing formats:

      # Mg2+ ion
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+ ion
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+ ion
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+ ion
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+ ion
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+ ion
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
    

    96dcbfca9ffb96541221e86f6db9c5a.jpg

    The atom type and residue must be in uppercase, and the atom name needs to be a standard metal ion (you can check if the writing format is the same using a text editor).

    Force Field

    Types of force field files:

    • amber03, amber14sb_parmbsc1 are suitable for simulating protein and nucleic acid condensed phases, and also support small molecules.
    • The gromos series is suitable for simulating alkanes, proteins, and nucleic acid condensed phases.
      Note: Select the force field based on the submitted pdb structure.

    Water Type

    Types of water:

    • spc: best used for the GROMOS force field.
    • spce: better than SPC and TIP3P for pure water systems.
    • tip3p: best used for amber.
    • tip4p: best used for opls.
    • tip5p: not suitable for mixed simulations.

    Simulation Time (ns)

    Duration of the simulation, in ns.

    Result Description

    The output results include:

    Output File Name Description
    md.cpt Checkpoint file for the md simulation
    md.gro Molecular coordinate file for md
    md.log Log file for md
    md.mdp Parameter file for md
    md.tpr All initial data required for the md simulation (molecular topology, initial structure, etc.)
    mini.gro Molecular coordinate file for mini run
    mini.log Log file for mini run
    mini.mdp Parameter file for mini run
    mini.tpr All initial data required for the mini run (molecular topology, initial structure, etc.)
    npt.cpt Checkpoint file for the npt simulation
    npt.gro Molecular coordinate file for npt
    npt.log Log file for npt
    npt.mdp Parameter file for npt
    npt.tpr All initial data required for the npt simulation (molecular topology, initial structure, etc.)
    protein.pdb PDB file of the protein in the system
    predict_pKa.txt Record file for protein protonation
    protein_protonation.pdb PDB file for protein protonation
    receptor.gro Molecular coordinate file for the receptor
    receptor_itp.tar.gz Force applied to fix atomic positions during receptor equilibrium simulation
    receptor.top Topology file for the receptor
    system.gro Molecular coordinate file for the system
    system_itp.tar.gz Force applied to fix atomic positions during system equilibrium simulation
    system.top Topology file for the system
    interaction_energy.csv CSV file of system energy over time
    interaction_energy.png PNG file of system energy over time
    interaction_pressure.csv CSV file of system pressure over time
    interaction_pressure.png PNG file of system pressure over time
    rmsd_result.csv CSV file for RMSD
    rmsd_result.png PNG file for RMSD
    rmsd_result.xvg XVG file for RMSD
    rmsf_Protein.csv CSV file for protein RMSF
    rmsf_Protein.png PNG file for protein RMSF
    rmsf_Protein.xvg XVG file for protein RMSF
    path.txt Storage path for the simulation trajectory file, can be used as input for the subsequent analysis module’s Path File input.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: SDF File Split
    Description: SDF File Split是个化合物库文件分割模块,可以将一个大的SDF文件分割为多个SDF文件,支持按文件个数或者分子数目分割,使得分割后的每个SD文件分子数目接近。 SDF File Split is a tool for splitting an SD File into multiple SD files. Each new SD File contains a compound subset of similar size from the initial file.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-03-12 22:33:44
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    SDF File Split

    简介

    SDF File Split是个化合物库文件分割模块,可以将一个大的SDF文件分割为多个SDF文件,支持按文件个数或者分子数目分割,使得分割后的每个SD文件分子数目接近。

    参数说明

    Split by Files Number模式

    SDF File

    小分子库结构文件,SDF格式

    Files Number

    生成文件的数目

    Prefix

    新生成SDF文件的前缀,默认subset,生成的文件名为:subset1.sdf,subset2.sdf,以此类推。

    Split by Compounds Number模式

    SDF File

    小分子库结构文件,SDF格式

    Compounds Number

    每个新生成的SD文件包含的分子数目

    Prefix

    新生成SDF文件的前缀,默认subset,生成的文件名为:subset1.sdf,subset2.sdf,以此类推。

    结果说明

    拆分后的SDF文件列表文件。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    SDF File Split

    Introduction

    SDF File Split is a compound library file splitting module that can divide a large SDF file into multiple SDF files. It supports splitting based on the number of files or the number of compounds, ensuring that the number of molecules in each split SDF file is similar.

    Parameter Description

    Split by Files Number Mode

    SDF File

    Structure file of the small molecule library, in SDF format.

    Files Number

    Number of files to generate.

    Prefix

    Prefix for the newly generated SDF files, default is “subset”. The generated files will be named as: subset1.sdf, subset2.sdf, and so on.

    Split by Compounds Number Mode

    SDF File

    Structure file of the small molecule library, in SDF format.

    Compounds Number

    Number of compounds to include in each newly generated SDF file.

    Prefix

    Prefix for the newly generated SDF files, default is “subset”. The generated files will be named as: subset1.sdf, subset2.sdf, and so on.

    Result Description

    List of split SDF files.

    Reference Literature

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

  • Name: Molecular Docking (DOCK)
    Description: Molecular Docking (DOCK) 是基于Dock6的分子对接模块,用于确定分子与靶点之间的潜在结合模式和相互作用。对接是在已知大分子(或受体)的活性位点内,确定小分子(或配体)的最佳结合模式。 Molecular Docking (DOCK) is a Dock6-based molecular dock module for identifying potential binding geometries and interactions of a molecule to a target. Specifically, docking is the identification of the low-energy binding modes of a small molecule, or ligand, within the active site of a macromolecule, or receptor, whose structure is known. A compound that interacts strongly with, or binds, a receptor associated with a disease may inhibit its function and thus act as a drug. Solving the docking problem computationally requires an accurate representation of the molecular energetics as well as an efficient algorithm to search the potential binding modes.
    Tags: undefined
    Author: Allen, W.J.
    Release: 2023-03-12 21:53:35
    Reference: Allen, W.J.; Balius, T.E.; Mukherjee, S.; Brozell, S. R.; Moustakas, D. T.; Lang, P. T.; Case, D. A.; Kuntz, I. D.; Rizzo, R. C. DOCK 6: Impact of New Features and Current Docking Performance. J. Comput. Chem. 36: 1132-1156, 2015.

    Molecular Docking (DOCK)

    简介

    DOCK用于识别分子与受体蛋白的潜在结合位点和相互作用。具体来说,对接是在已知结构的大分子或受体的活性位点内,识别小分子或配体的低能量结合模式。一种化合物与与疾病相关的受体强烈相互作用或结合,可能会抑制其功能,从而起到药物的作用。计算解决对接问题需要分子能量学的准确表示以及搜索潜在结合模式的有效算法。
    历史上,DOCK算法使用几何匹配算法来解决刚体对接问题,将配体结合到结合口袋中。近年来,该算法增加了一些重要功能,提高了算法找到最低能量结合模式的能力,包括基于力场的评分、动态优化、改进的刚体对接匹配算法和柔性配体对接算法。近年来通过添加新的功能,如力场评分、增强的溶剂化模型、基于参考的评分选项和从头设计,从而继续提高算法预测配体结合位点的准确性。

    参数说明

    支持自行上传小分子文件(Private Ligand Library)或者选择公共分子虚筛库(Public Ligand Library)。

    Private Ligand Library模式

    Binding Mode

    对接模式为刚性配体对接(rigid)或者柔性配体对接(flex),
    刚性配体对接:配体自身保持刚性,经平移、旋转,在口袋内寻找合适的结合取向。
    柔性配体对接:配体在固定某些非关键部位的键长、键角的前提下允许其构象发生一定程度的变化。

    Receptor File

    用于对接的受体分子,只支持pdb格式。

    Private Ligand

    对接的配体分子,支持sdf和mol2格式。

    Box Center

    配体结合口袋中心xyz坐标,用空格分开,例如 “10.734 2.033 -11.537”。

    Box Size

    配体结合口袋大小,用空格分开,例如 “24 22 32”。

    TopN

    指定打分前TopN作为输出文件。

    Rank Result

    结果文件是否按照亲和力由高到低的排序,越高打分越小。

    Public Ligand Library模式

    Public Library

    提供17个公共分子虚筛库用于分子对接,包括:

    1. Alinda :~77万库存分子,源自中国香港的Alinda Chemical公司,致力于分子砌块和新颖筛选化合物的研发供应。
    2. Analyticon :~4万库存分子,源自德国的天然产物品牌,专注天然产物提取及类似物合成工作,产品质量稳定。
    3. Asinex :~57万库存分子,源自美国的品牌,多年来致力于类先导化合物及分子砌块的研发供应,价格较贵。
    4. Bionet :~30万库存分子,源自英国的品牌,拥有多年的有机合成经验。
    5. Chembridge :~137万库存分子,源自美国的化合物品牌,总部位于圣地亚哥,拥有多样性库、大环库等多种热门化合物库。
    6. Chemdiv :~156万库存分子,全球最大的化合物品牌之一,拥有5000多种化合物骨架结构和100多种化合物库,性价比高。
    7. Enamine :~407万库存分子,源自乌克兰的化合物品牌,具有较强的化合物研发能力,有高性价比化合物和高价值化合物两类产品。
    8. Eximed :~6万库存分子,源自乌克兰的化合物品牌,近20年来致力于提供高通量筛选化合物及相关服务。
    9. HTS :~6万库存分子,源自德国的HTS Biochemie Innovationen化合物品牌,致力于为制药、农业和生物技术公司开发独特的化合物。
    10. IBS :~55万库存分子,源自俄罗斯的InterBioScreen化合物品牌,拥有多种天然产物及衍生物。
    11. Life_Chemicals :~54万库存分子,源自加拿大的化合物品牌,拥有2900多种化合物骨架结构,化合物规格较齐全且有对应价格。
    12. Maybridge :~5万库存分子,源自英国的化合物品牌,Thermofisher旗下,产品数量少而专,每种产品均具有较大库存。
    13. Otava :~29万库存分子,源自加拿大的化合物品牌,专门从事特色化合物,生物化学药品和生物分析试剂的开发和生成。
    14. Princeton :~153万库存分子,源自美国的化合物品牌,20多年来设计独特的小分子化合物用于药物开发。
    15. Specs :~20万库存分子,源自荷兰的化合物品牌,价格优势明显。
    16. UORSY :~68万库存分子,源自乌克兰的化合物品牌,产品主要用于高通量筛选和药物发现,价格与Enamine接近。
    17. Vitas-m :~140万库存分子,源自美国的化合物品牌,在香港拥有发货中心,到货速度快,价格适中。

    其他参数与Private Ligand Library模式相同。

    结果说明

    输出结果包括:

    输出文件名称 说明
    TopNScores.csv 分子对接得到的打分csv文件。输出小分子最多为100,000。
    complex_001.pdb 展示配体与受体的复合物构象文件。当Rank Result=yes时,得到亲和力最高的复合物,Rank Result=no则输出第一个小分子对接后的复合物结构。
    output_ligand_topn.sdf 筛选后配体的SDF文件。根据指定的topN数生成,最多为100,000。
    output_complex_topn.tar.bz2 小分子与受体对接后的复合物构象PDB文件压缩包,最多生成前1000小分子的复合物结构。
    TopNScores_Molecule_Info.csv 当Private Ligand Library模式,该csv中不仅有打分信息,还有配体原有信息。

    参考文献

    Allen, W.J.; Balius, T.E.; Mukherjee, S.; Brozell, S. R.; Moustakas, D. T.; Lang, P. T.; Case, D. A.; Kuntz, I. D.; Rizzo, R. C. DOCK 6: Impact of New Features and Current Docking Performance. J. Comput. Chem. 36: 1132-1156, 2015.

    Molecular Docking (DOCK)

    Introduction

    DOCK is used to identify potential binding sites and interactions between molecules and receptor proteins. Specifically, molecular docking involves identifying low-energy binding modes of small molecules or ligands within the active site of a known structure of a large molecule or receptor. Strong interactions or binding of a compound with a disease-related receptor can inhibit its function, thereby acting as a drug. Solving docking problems computationally requires an accurate representation of molecular energetics and effective algorithms to search for potential binding modes.

    In the past, the DOCK algorithm used geometric matching algorithms to solve rigid docking problems by placing ligands into binding pockets. In recent years, the algorithm has added some important features to enhance its ability to find the lowest energy binding modes, including force-field-based scoring, dynamic optimization, improved rigid docking matching algorithms, and flexible ligand docking algorithms. By incorporating new features such as force-field scoring, enhanced solvation models, reference-based scoring options, and de novo design, the algorithm continues to improve the accuracy of predicting ligand binding sites.

    Parameter

    Supports uploading a Private Ligand Library or selecting a Public Ligand Library for docking.

    Private Ligand Library

    Receptor File

    The receptor molecule used for docking, only supports the pdb format.

    Private Ligand

    The ligand molecule for docking, supports sdf and mol2 formats.

    Binding Mode

    Choose between rigid docking (rigid) or flexible docking (flex):

    • Rigid docking: The ligand remains rigid and is translated and rotated to find an appropriate binding orientation within the pocket.
    • Flexible docking: The ligand is allowed to undergo conformational changes within certain non-critical bond lengths and bond angles while keeping some parts fixed.

    Box Center

    The xyz coordinates of the center of the ligand binding pocket, separated by spaces, e.g., “10.734 2.033 -11.537”.

    Box Size

    The size of the ligand binding pocket, separated by spaces, e.g., “24 22 32”.

    TopN

    Specify the top N scoring results for output.

    Public Ligand Library

    Public Library

    Provides 17 public molecular virtual screening libraries for molecular docking, including various brands such as Analyticon, Asinex, Bionet, Chembridge, Chemdiv, Enamine, Eximed, HTS_Biochemie_Innovationen, IBScreen, Life_Chemicals, Maybridge, Otava, Princeton, Specs, UORSY, and Vitas-m.

    1. Alinda : ~770,000 stock molecules, sourced from Alinda Chemical in Hong Kong, dedicated to the development and supply of molecular building blocks and novel screening compounds.
    2. Analyticon : ~40,000 stock molecules, a German brand specializing in natural product extraction and analogue synthesis, known for stable product quality.
    3. Asinex : ~570,000 stock molecules, an American brand focused on the development and supply of lead-like compounds and molecular building blocks for many years, relatively expensive.
    4. Bionet : ~300,000 stock molecules, a UK brand with many years of experience in organic synthesis.
    5. Chembridge : ~1,370,000 stock molecules, an American compound brand headquartered in San Diego, offering diverse libraries, macrocyclic libraries, and other popular compound libraries.
    6. Chemdiv : ~1,560,000 stock molecules, one of the world’s largest compound brands, with over 5,000 compound scaffolds and more than 100 compound libraries, offering high cost-effectiveness.
    7. Enamine : ~4,070,000 stock molecules, a Ukrainian compound brand with strong compound development capabilities, offering both high cost-effectiveness compounds and high-value compounds.
    8. Eximed : ~60,000 stock molecules, a Ukrainian compound brand dedicated to providing high-throughput screening compounds and related services for nearly 20 years.
    9. HTS : ~60,000 stock molecules, a German compound brand HTS Biochemie Innovationen, dedicated to developing unique compounds for pharmaceutical, agricultural, and biotechnology companies.
    10. IBS : ~550,000 stock molecules, a Russian compound brand InterBioScreen, offering a variety of natural products and derivatives.
    11. Life Chemicals : ~540,000 stock molecules, a Canadian compound brand with over 2,900 compound scaffolds, offering a wide range of compound specifications at corresponding prices.
    12. Maybridge : ~50,000 stock molecules, a UK compound brand under Thermo Fisher, known for a small but specialized product range with large inventories for each product.
    13. Otava : ~290,000 stock molecules, a Canadian compound brand specializing in the development and production of specialty compounds, biochemical drugs, and bioanalytical reagents.
    14. Princeton : ~1,530,000 stock molecules, an American compound brand that has been designing unique small molecules for drug development for over 20 years.
    15. Specs : ~200,000 stock molecules, a Dutch compound brand with a clear price advantage.
    16. UORSY : ~680,000 stock molecules, a Ukrainian compound brand, mainly used for high-throughput screening and drug discovery, with prices similar to Enamine.
    17. Vitas-M : ~1,400,000 stock molecules, an American compound brand with a shipping center in Hong Kong, offering fast delivery and moderate prices.

    Other parameters are the same as in the Private Ligand Library mode.

    Result

    The output includes:

    Output File Name Description
    Max_poses_ligand.sdf Generated SDF file of the top 3000 ligands based on scoring, all docking results are output if the number of ligands in the library is less than 100,000.
    Max_poses_scores.csv Scoring file for all ligands (≤ 100,000) docked with the receptor.
    output_complex_topn.tar.gz Compressed file containing PDB files of the top complex conformations of the top N ligands with the receptor, generating complex structures for up to the first 1000 small molecules.
    complex_001.pdb File showing the top complex conformation of the ligand with the receptor based on scoring.
    topN_ligand.sdf SDF file of the top N ligands based on docking scores.
    topN_scores.csv Scoring file sorted by the highest docking score for each ligand with the receptor.

    Reference Literature

    Allen, W.J.; Balius, T.E.; Mukherjee, S.; Brozell, S. R.; Moustakas, D. T.; Lang, P. T.; Case, D. A.; Kuntz, I. D.; Rizzo, R. C. DOCK 6: Impact of New Features and Current Docking Performance. J. Comput. Chem. 36: 1132-1156, 2015.

  • Name: Enumerate Stereoisomers
    Description: Enumerate Stereoisomers是枚举小分子立体异构体的工具,支持顺反异构体和对映异构体两种形式的枚举。 Enumerate Stereoisomers is a tool for performing a combinatorial enumeration of stereoisomers for molecules around all or unassigned chiral atoms and bonds. cis-trans isomer and optical isomer are supported.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-03-12 20:08:04
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Enumerate Stereoisomers

    简介

    Enumerate Stereoisomers是枚举小分子立体异构体的工具,支持顺反异构体和对映异构体两种形式的枚举。立体异构(stereoisomerism)是在有相同分子式的化合物分子中,原子或原子团互相连接的次序相同,但在空间的排列方式不同,与构造异构同属有机化学范畴中的同分异构现象。对所有或未分配的手性原子和键周围的分子进行立体异构体的组合枚举。

    参数说明

    Enumerate Stereoisomers (File)模式

    Input File

    小分子结构文件,支持SMILES、MOL、SDF格式。

    Output File

    指定输出文件的名称,支持SDF(.sd)和SMILES格式(.smi)。

    Mode

    枚举模式,包括如下:
    UnassignedOnly:只枚举未分配手性原子和键的分子的构型异构体。所有原子和键都分配手性时,选择该选项得到该分子本身。
    All:枚举所有立体异构体,包括构型异构和构象异构。

    Number

    每个分子产生异构体的最大数目。

    Enumerate Stereoisomers (String)模式

    Smiles String

    小分子的smiles字符串,一行一个分子

    结果说明

    得到小分子构型异构体的组合SDF文件generated_isomers.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. doi: 10.1021/acs.jcim.6b00505.

    Enumerate Stereoisomers

    Introduction

    Enumerate Stereoisomers is a tool for enumerating stereoisomers of small molecules, supporting both cis-trans isomers and enantiomers. Stereoisomerism refers to the phenomenon in organic chemistry where compounds with the same molecular formula have atoms or groups connected in the same order but arranged differently in space, belonging to the category of structural isomerism. It enumerates stereoisomeric combinations for all or unassigned chiral atoms and bonds in a molecule.

    Parameter Description

    Enumerate Stereoisomers (File) Mode

    Input File

    The small molecule structure file, supporting SMILES, MOL, and SDF formats.

    Output File

    Specify the name of the output file, supporting SDF (.sd) and SMILES (.smi) formats.

    Mode

    Enumeration modes include:

    • UnassignedOnly: Enumerate conformational isomers of molecules with unassigned chiral atoms and bonds only. When all atoms and bonds are assigned chirality, selecting this option will yield the molecule itself.
    • All: Enumerate all stereoisomers, including conformational and configurational isomers.

    Number

    Maximum number of isomers to generate for each molecule.

    Enumerate Stereoisomers (String) Mode

    Smiles String

    SMILES string of the small molecule, one molecule per line.

    Result Description

    Obtain a combined SDF file (generated_isomers.sdf) of conformational isomers of small molecules.

    Reference Literature

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. doi: 10.1021/acs.jcim.6b00505.

  • Name: SDF Viewer
    Description: SDF Viewer是小分子化合物库的可视化模块,可以针对一个SDF文件生成一个结构和属性可视、可交互、可检索的HTML页面,方便浏览化合物的结构和属性信息。 SDF Viewer is a visualization tool for the small molecular library. Generate an interactive HTML table with columns corresponding to molecules and available alphanumerical data in an input file.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-03-10 00:00:00
    Reference: Manish Sud*,MayaChemTools: An Open Source Package for Computational Drug Discovery. J. Chem. Inf. Model. 2016, 56, 12, 2292–2298

    SDF Viewer

    简介

    SDF Viewer是小分子化合物库的可视化模块,可以针对一个SDF文件生成一个结构和属性可视、可交互、可检索的HTML页面,方便浏览化合物的结构和属性信息。

    参数说明

    SDF File

    小分子结构文件,SDF格式

    HTML File

    输出HTML文件名,默认为library.html

    结果说明

    针对SDF文件生成一个结构和属性可视、可交互、可检索的HTML页面library.html。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. doi: 10.1021/acs.jcim.6b00505.

    SDF Viewer

    Introduction

    The SDF Viewer is a visualization module for small molecule compound libraries. It generates an HTML page that visualizes and makes the structures and properties of compounds in an SDF file interactive and searchable, facilitating the browsing of compound structure and property information.

    Parameter Description

    SDF File

    The small molecule structure file in SDF format.

    HTML File

    The output HTML file name, defaulting to library.html.

    Result Description

    Generates an interactive and searchable HTML page (library.html) that visualizes the structures and properties of compounds in the SDF file.

    Reference Literature

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. doi: 10.1021/acs.jcim.6b00505.

  • Name: Antibody-Antigen Docking (HADDOCK)
    Description: HADDOCK v3.0是对长期存在且经过时间验证的用于生物分子复合物集成建模的HADDOCK的自下而上的重新想象。这个年轻且仍处于实验阶段(使用它的风险由您自己承担!)的目标是模块化和扩展HADDOCK的核心功能。在目前的实现中,HADDOCK v3.0仍然缺乏生产web服务器版本HADDOCK v2.4所具有的全部功能。然而,它能够充分利用模糊交互约束(AIRs)来驱动对接过程。 HADDOCK v3.0 is a bottom-up reimagination of the long standing time-proven HADDOCK used for integrative modeling of biomolecular complexes. This young and still very experimental (use it at your own risk!) aims to modularize and extend HADDOCK’s core functions. In its current implementation, HADDOCK v3.0 still lacks the full repertoire of features present at the production web server version, HADDOCK v2.4. However it is able to take full advantage of the ambiguous interaction restraints (AIRs) to drive the docking process.
    Tags: undefined
    Author: Cyril Dominguez
    Release: 2023-03-06 14:09:05
    Reference: Dominguez, C., Boelens, R. & Bonvin, A. M. J. J. HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J Am Chem Soc 125, 1731–1737 (2003).

    HADDOCK

    简介

    HADDOCK v3.0 是一个自下而上的对长期以来被证实的HADDOCK的重新构想,用于生物分子复合物的综合建模。旨在对HADDOCK的核心功能进行模块化和扩展。它能够充分利用模糊的相互作用约束(AIRs)来驱动对接过程。使用蛋白质-蛋白质对接基准5对它进行了评估,并与实时版本(v2.4)进行了比较。该评估是使用每个复合物的真实界面(3.9 Å)进行的,并以成功率表示;在按HADDOCK-score排名的特定解决方案子集中,至少有一个对接解决方案低于指定阈值的BM5目标数量。

    参数说明

    Antibody File

    用于进行对接的抗体PDB文件

    Antigen File

    用于进行对接的抗原PDB文件

    结果说明

    输出结果包括:

    输出文件名称 说明
    score.csv 复合物构象的对接能量打分文件
    result.tar.gz 所有复合物构象PDB文件压缩包
    cluster_01_model.pdb-cluster_10_model.pdb 打分前十的复合物构象

    其中score.csv,包含信息如下:

    字段名称 说明
    RANK 打分排序
    Score 对接能量打分,其中打分值越低,结合能力越强。

    参考文献

    Dominguez C, Boelens R, Bonvin AM. HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J Am Chem Soc. 2003 Feb 19;125(7):1731-7.

    HADDOCK

    Introduction

    HADDOCK v3.0 is a bottom-up reimagining of the well-established HADDOCK for comprehensive modeling of biomolecular complexes. It aims to modularize and extend the core functionalities of HADDOCK, leveraging ambiguous interaction restraints (AIRs) to drive the docking process. It has been evaluated against five protein-protein docking benchmarks and compared to the real-time version (v2.4). The evaluation was conducted using the true interfaces (3.9 Å) of each complex and represented in terms of success rates; in a specific subset of solutions ranked by HADDOCK-score, a minimum number of BM5 targets have at least one docking solution below a specified threshold.

    Parameter Description

    Antibody File

    PDB file of the antibody used for docking.

    Antigen File

    PDB file of the antigen used for docking.

    Result Description

    The output results include:

    Output File Name Description
    score.csv Docking energy scoring file for complex conformations.
    result.tar.gz Compressed archive of all complex conformation PDB files.
    cluster_01_model.pdb-cluster_10_model.pdb Top ten complex conformation models before scoring.

    In score.csv, the information is as follows:

    Field Name Description
    RANK Ranking based on scoring.
    Score Docking energy score, where lower scores indicate stronger binding capability.

    Reference Literature

    Dominguez C, Boelens R, Bonvin AM. HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J Am Chem Soc. 2003 Feb 19;125(7):1731-7.

  • Name: Cyclic Peptide Design
    Description: 基于环肽设计算法AfCycDesign。AfCycDesign通过修改AlphaFold网络来准确预测和设计环肽。该模块可基于环肽模板分子结构的骨架进行环肽设计,也可以全新环肽设计。在包含49个环肽的数据集中,算法预测的36个结构具有高置信度。使用 AlphaFold 重新设计大环骨架序列,结晶测试显示算法设计的具有不同大小和结构的七个序列和 X 射线晶体结构非常匹配。 Base on cyclic peptide design algorithm AfCycDesign. This module uses AlphaFold modified network for accurate structure prediction and design of cyclic peptides. Results show this approach can accurately predict the structures of native cyclic peptides from a single sequence, with 36 out of 49 cases predicted with high confidence (pLDDT > 0.85) matching the native structure with root mean squared deviation (RMSD) less than 1.5 Å. Redesign of macrocyclic backbone sequences using AlphaFold resulted in seven sequences with different sizes and structures that matched the X-ray crystal structures very closely.
    Tags: undefined
    Author: Stephen A.
    Release: 2023-03-03 16:09:18
    Reference: Stephen A. Rettie, Katelyn V. Campbell, Asim K. Bera, Alex Kang, Simon Kozlov, Joshmyn De La Cruz, Victor Adebomi, Guangfeng Zhou, Frank DiMaio, Sergey Ovchinnikov, Gaurav Bhardwaj. Cyclic peptide structure prediction and design using AlphaFold. bioRxiv 2023.02.25.529956.

    Cyclic Peptide Design

    简介

    基于AfCycDesign算法,利用ColabDesign与AlphaFold2等技术,基于模板分子结构骨架的环肽设计,或进行全新环肽设计。测试表明,这种方法能够准确地预测来自单一序列的原生环状肽的结构,在49个案例中,有36个被预测为高置信度的环状肽,pLDDT>0.85,与原生结构相匹配,均方根偏差(RMSD)小于1.5 Å。
    image.png

    参数说明

    本模块存在两种模式FixBB与Hallucination,其中前者表示进行基于模板蛋白(环肽)结构骨架的环肽设计;后者表示进行全新的环肽设计,不参考模板骨架,可设置环肽长度。
    。

    FixBB模式参数

    Structural Template

    上传模板蛋白(环肽)结构。注意,环肽长度不能超过100个氨基酸。

    Chain

    指定模板蛋白中用于参考设计的蛋白链标识,如:“B”,如果结构中只有一条链,可以不用指定。

    Fix Position

    指定设计时固定模板蛋白中的某些位置的氨基酸不变化,如:‘1,5-10’ 将固定模板蛋白中的第1和5至10的氨基酸不变。

    Hallucination模式参数

    Peptide Length

    指定全新设计的环肽长度,如:20.

    Remove Residue

    指定设计时需要去除的氨基酸类型,如:“C,W”表示设计的环肽不会出现cysteine和Tryptophan。

    结果说明

    设计的环肽的三维结构文件result.pdb。

    参考文献

    Stephen A. Rettie, Katelyn V. Campbell, Asim K. Bera, Alex Kang, Simon Kozlov, Joshmyn De La Cruz, Victor Adebomi, Guangfeng Zhou, Frank DiMaio, Sergey Ovchinnikov, Gaurav Bhardwaj. Cyclic peptide structure prediction and design using AlphaFold. bioRxiv 2023.02.25.529956.

    Cyclic Peptide Design

    Introduction

    The Cyclic Peptide Design module utilizes the AfCycDesign algorithm in conjunction with technologies such as ColabDesign and AlphaFold2 to design cyclic peptides based on the structural backbone of template molecules or to create entirely new cyclic peptide designs. Tests have shown that this method can accurately predict the structures of native cyclic peptides from a single sequence. Out of 49 cases, 36 were predicted as high-confidence cyclic peptides with pLDDT > 0.85, matching the native structures with a root mean square deviation (RMSD) of less than 1.5 Å.

    image.png

    Parameter Description

    This module has two modes: FixBB and Hallucination. The former involves designing cyclic peptides based on the template protein (cyclic peptide) structure, while the latter involves designing entirely new cyclic peptides without reference to a template backbone and allows for setting the length of the cyclic peptide.

    FixBB Mode Parameters

    Structural Template

    Upload the template protein (cyclic peptide) structure. Note that the length of the cyclic peptide cannot exceed 100 amino acids.

    Chain

    Specify the protein chain identifier used for reference design in the template protein, e.g., “B”. If there is only one chain in the structure, this can be left unspecified.

    Fix Position

    Specify the amino acids in the template protein that should remain fixed during design, e.g., ‘1,5-10’ will fix amino acids at positions 1 and 5 to 10 in the template protein.

    Hallucination Mode Parameters

    Peptide Length

    Specify the length of the newly designed cyclic peptide, e.g., 20.

    Remove Residue

    Specify the types of amino acids to be removed during design, e.g., “C,W” indicates that the designed cyclic peptide will not contain cysteine and tryptophan.

    Result Description

    The three-dimensional structure file of the designed cyclic peptide is stored in result.pdb.

    Reference Literature

    Stephen A. Rettie, Katelyn V. Campbell, Asim K. Bera, Alex Kang, Simon Kozlov, Joshmyn De La Cruz, Victor Adebomi, Guangfeng Zhou, Frank DiMaio, Sergey Ovchinnikov, Gaurav Bhardwaj. Cyclic peptide structure prediction and design using AlphaFold. bioRxiv 2023.02.25.529956.

  • Name: Mutation Energy of Binding (GeoPPI)
    Description: Mutation Energy of Binding (GeoPPI)模块是基于深度学习的框架,使用蛋白质复合物的深度几何表征来模拟突变对结合亲和力的影响,从而预测氨基酸突变对蛋白质-蛋白质亲和力的影响。 Deep geometric representations for modeling effects of mutations on protein-protein binding affinity.
    Tags: undefined
    Author: GeoPPI
    Release: 2023-02-28 15:46:02
    Reference: Liu X, Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput Biol. 2021 Aug 4;17(8):e1009284. doi: 10.1371/journal.pcbi.1009284.

    Mutation Energy of Binding (GeoPPI)

    简介

    基于深度学习技术预测氨基酸突变对蛋白质-蛋白质相互作用的影响。该模块是基于开源的GeoPPI方法开发的,使用蛋白质复合物的深度几何表征来模拟突变对结合亲和力的影响。为了实现几何结构的强大表达能力和预测的稳健性,模块依次采用了两个组件,即一个几何编码器(擅长提取图形特征)和一个梯度增强树(GBT,擅长避免过度拟合)。几何编码器是一个图形神经网络,在相邻的原子上执行神经信息传递,以更新中心原子的表征。它通过一个新的自我监督学习方案进行训练,以产生蛋白质结构的深度几何表示。基于这些对复合物及其突变体的学习表征,GBT从突变数据中学习,以预测相应的结合亲和力变化。
    image.png
    image.png

    参数说明

    PDB File

    野生型的复合物结构,PDB格式。

    Mutation File

    突变列表文件,TXT格式,每行包含突变信息,格式如下:

    TI17R,EI19R;E_I
    AI15R;E_I
    

    每行突变信息及一个相互作用链信息,用分号“;”分隔,其中:
    TI17R中的T表示野生型的氨基酸,I表示该氨基酸所在的链,17表示结构文件中该氨基酸的序号,R表示突变后的氨基酸。当存在多点突变时,突变信息用逗号(“,”)隔开,如TI17R,EI19R。E_I表示复合物中产生相互作用的蛋白链是E链与I链;相应的,如果是多条链与多条链产生相互作用,如:HL_WV,表示H、L链与W、V链产生相互作用。
    需要注意的时突变信息可以时多点或者单点,但是每一行的相互作用链信息只能是一个。

    结果说明

    输出结果文件为score.csv,包含信息如下:

    字段名称 说明
    Mutation 突变位点
    Chain 突变点所在的链
    Interaction_Chains 相互作用之间的链名称
    deltaEnergy 该突变引起的结合能量的变化(wildtype-mutant),值越小说明突变后结合越弱,该突变位点对受配体之间结合越重要,单位为kcal/mol。

    参考文献

    Liu X, Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput Biol. 2021 Aug 4;17(8):e1009284.

    MIT License

    Copyright © 2021 LiuXianggen
    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
    THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

    Mutation Energy of Binding (GeoPPI)

    Introduction

    The Mutation Energy of Binding (GeoPPI) module predicts the effect of amino acid mutations on protein-protein interactions using deep learning techniques. Developed based on the open-source GeoPPI method, this module utilizes deep geometric representations of protein complexes to simulate the impact of mutations on binding affinity. To achieve robust prediction capabilities and powerful geometric structure representations, the module sequentially employs two components: a geometric encoder (proficient at extracting graphical features) and a Gradient Boosting Tree (GBT, adept at preventing overfitting). The geometric encoder is a graph neural network that performs neural message passing on neighboring atoms to update the representation of central atoms. It is trained using a novel self-supervised learning scheme to generate deep geometric representations of protein structures. Based on these learned representations of complexes and their mutants, the GBT learns from mutation data to predict corresponding changes in binding affinity.

    image.png
    image.png

    Parameter Description

    PDB File

    The structure of the wild-type complex in PDB format.

    Mutation File

    A file listing mutations in TXT format, with each line containing mutation information in the following format:

    TI17R,EI19R;E_I
    AI15R;E_I
    

    Each line contains mutation information and interaction chain information separated by a semicolon “;”. In the mutation information:

    • In TI17R, T represents the wild-type amino acid, I represents the chain where the amino acid is located, 17 represents the sequence number of the amino acid in the structure file, and R represents the mutated amino acid. When there are multiple mutations, they are separated by a comma (“,”) as in TI17R,EI19R.
    • E_I indicates the interacting protein chains in the complex are chains E and I. Similarly, for interactions between multiple chains, such as HL_WV, it denotes interactions between chains H, L, W, and V.

    It is important to note that mutation information can be single-point or multi-point mutations, but the interaction chain information per line should be only one.

    Result Description

    The output result file is score.csv, which includes the following information:

    Field Name Description
    Mutation The mutation site
    Chain The chain where the mutation occurs
    Interaction_Chains Names of the interacting chains
    deltaEnergy The change in binding energy caused by the mutation (wildtype-mutant). A smaller value indicates weaker binding after the mutation, highlighting the importance of the mutation site for the binding between the ligand and receptor, in kcal/mol.

    Reference Literature

    Liu X, Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput Biol. 2021 Aug 4;17(8):e1009284.

    MIT License

    Copyright © 2021 LiuXianggen
    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
    THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  • Name: Mutation Energy of Binding (ddG Predictor)
    Description: Mutation Energy of Binding (ddG Predictor)是基于几何神经网络模型预测氨基酸突变对蛋白-蛋白亲和力的影响。 An attention-based geometric neural network architecture to learn the mutational effect on protein–protein interactions from three-dimensional protein complex structures.
    Tags: undefined
    Author: Sisi Shan.
    Release: 2023-02-26 12:44:43
    Reference: Shan S, Luo S, Yang Z, Hong J, Su Y, Ding F, Fu L, Li C, Chen P, Ma J, Shi X, Zhang Q, Berger B, Zhang L, Peng J. Deep learning guided optimization of human antibody against SARS-CoV-2 variants with broad neutralization. Proc Natl Acad Sci U S A. 2022 Mar 15;119(11):e2122954119.

    Mutation Energy of Binding (ddG Predictor)

    简介

    Mutation Energy of Binding (ddG Predictor)模块功能是预测氨基酸突变对蛋白质-蛋白质亲和力的影响。采用基于注意力的几何神经网络架构,从三维蛋白质复合体结构中学习突变对蛋白质-蛋白质相互作用的影响。该模型的几何部分通过考虑其周围原子的接近程度为每个残基学习一个矢量嵌入。基于这些学习到的几何嵌入,注意力网络学习识别蛋白质界面附近有助于结合亲和力的关键残基对。具体来说,对于蛋白质复合物中的每个残基,网络首先通过注意机制识别其他残基的重要性,并从这些残基中学习包括空间接近性和物理化学特性在内的信息。因此,聚合的信息可以编码环境以及每个残基的相互作用特征。使用模型对野生型(WT)和突变复合物进行编码,以获得WT和突变的embeding信息。然后,额外的神经网络层比较这两个embeding来预测能量的变化ΔΔG。该模型通过对SKEMPI(V2.0)数据集进行逐个复合体的五倍交叉验证来评估。由1,131个单点突变(S1131)组成的子集被用来作为模型和其他基线的基准。另外一个由多点突变组成的子集(M1707)也被用来作为基准。该模型能够做出与实验结合数据具有中度至高度相关性的预测,并且也优于目前最先进的方法GeoPPI,以及其他一些最近提出的预测单一突变效应的方法。
    image.png

    基准测试

    image.png

    参数说明

    PDB File

    野生型的复合物结构,PDB格式

    Mutation File

    单点突变列表文件,TXT格式,每行一个单点突变信息,格式如下:

    QA1D;
    QA1S;
    

    QA1D中的Q表示野生型的氨基酸,A表示该氨基酸所在的链,1表示结构文件中该氨基酸的序号,D表示突变后的氨基酸。

    结果说明

    输出结果文件为score.csv,包含信息如下:

    字段名称 说明
    Mutation 突变位点
    Chain 突变点所在的链
    deltaEnergy 该突变引起的结合能量的变化,单位为kcal/mol,(Energy[mutant]-Energy[wild])

    参考文献

    Shan S, Luo S, Yang Z, Hong J, Su Y, Ding F, Fu L, Li C, Chen P, Ma J, Shi X, Zhang Q, Berger B, Zhang L, Peng J. Deep learning guided optimization of human antibody against SARS-CoV-2 variants with broad neutralization. Proc Natl Acad Sci U S A. 2022 Mar 15;119(11):e2122954119.

    Mutation Energy of Binding (ddG Predictor)

    Introduction

    The Mutation Energy of Binding (ddG Predictor) module is designed to predict the impact of amino acid mutations on the protein-protein binding affinity. It employs an attention-based geometric neural network architecture to learn the effects of mutations on protein-protein interactions from the three-dimensional structures of protein complexes. The geometric part of the model learns a vector embedding for each residue by considering the proximity of surrounding atoms. Using these learned geometric embeddings, the attention network identifies key residue pairs near the protein interface that contribute to binding affinity. Specifically, for each residue in the protein complex, the network first identifies the importance of other residues through attention mechanisms and learns information including spatial proximity and physicochemical properties from these residues. Thus, the aggregated information can encode the environment and interaction features of each residue. The model encodes the wild-type (WT) and mutant complexes to obtain embedding information for the WT and mutant. Then, additional neural network layers compare these two embeddings to predict the change in energy, ΔΔG. The model is evaluated through five-fold cross-validation on individual complexes from the SKEMPI (V2.0) dataset. A subset consisting of 1,131 single-point mutations (S1131) is used as a benchmark for the model and other baselines. Another subset consisting of multi-point mutations (M1707) is also used as a benchmark. The model is capable of making predictions with moderate to high correlation to experimental data and outperforms the state-of-the-art method GeoPPI and other recently proposed methods for predicting single-point mutation effects.

    image.png

    Benchmark Testing

    image.png

    Parameter Description

    PDB File

    The structure of the wild-type complex in PDB format.

    Mutation File

    A file listing single-point mutations in TXT format, with one mutation information per line in the following format:

    QA1D;
    QA1S;
    

    In QA1D, Q represents the wild-type amino acid, A represents the chain where the amino acid is located, 1 represents the sequence number of the amino acid in the structure file, and D represents the mutated amino acid.

    Result Description

    The output result file is score.csv, which includes the following information:

    Field Name Description
    Mutation The mutation site
    Chain The chain where the mutation occurs
    deltaEnergy The change in binding energy caused by the mutation, in kcal/mol.(Energy[mutant]-Energy[wild])

    Reference Literature

    Shan S, Luo S, Yang Z, Hong J, Su Y, Ding F, Fu L, Li C, Chen P, Ma J, Shi X, Zhang Q, Berger B, Zhang L, Peng J. Deep learning guided optimization of human antibody against SARS-CoV-2 variants with broad neutralization. Proc Natl Acad Sci U S A. 2022 Mar 15;119(11):e2122954119.

  • Name: Protein Sequence Generation (ProGen)
    Description: ProGen是一种语言模型,可以在大型蛋白质家族中生成具有功能的蛋白质序列,类似于在各种话题上生成语法和语义正确的自然语言句子。该模型使用来自>19,000个家族的2.8亿个蛋白质序列进行训练,并附加了控制标签以指定蛋白质属性。可以进一步对ProGen进行微调,以改善来自具有足够同源样本家族的蛋白质生成性能。 ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples.
    Tags: undefined
    Author: Ali Madani
    Release: 2023-02-11 00:00:00
    Reference: Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL Jr, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 Jan 26.

    Protein Sequence Generation (ProGen)

    简介

    ProGen是一种语言模型,可以在大型蛋白质家族中生成具有可预测功能的蛋白质序列,类似于在不同主题上生成语法和语义正确的自然语言句子。该模型基于来自> 19,000个家族的2.8亿个蛋白质序列进行训练,并增加了指定蛋白质属性的控制标签。基于Progen2模型实现,ProGen2模型可扩展到64亿个参数,并在不同的序列数据集上进行训练,这些数据集来自基因组、元基因组和免疫剧目数据库的10亿多个蛋白质。ProGen2模型在捕捉观察到的进化序列的分布、产生新的可行的序列,并预测蛋白质的适应性等方面显示出最先进的性能。
    Protein Sequence Generation (ProGen)目前主要功能是基于Reference序列,进行序列的增长(从Reference序列末端开始增长),后续开放其他场景的序列生成功能。

    参数说明

    Model

    模型类型有2种可选(progen2-large,progen2-xlarge)。
    模型信息:
    progen2-large,参数数量2.7 Billion,神经网络层数32。
    progen2-xlarge,模型参数数量6.4 Billion,神经网络层数32。

    Reference Sequence

    作为参考的序列(填序列信息)
    注意:不支持多条序列,多条序列会被合并为一条序列。

    Number of Samples

    生成序列的数目。
    注意:序列长度不超过1024个氨基酸。

    结果说明

    生成的蛋白序列文件result.fasta。

    参考文献

    Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL Jr, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 Jan 26.

    Protein Sequence Generation (ProGen)

    Introduction

    ProGen is a language model designed to generate protein sequences with predictable functions within large protein families, similar to generating syntactically and semantically correct natural language sentences on different topics. The model is trained on 280 million protein sequences from over 19,000 families and incorporates control labels specifying protein attributes. Built upon the Progen2 model, ProGen2 can scale up to 6.4 billion parameters and is trained on over a billion proteins from various sequence datasets sourced from genomes, metagenomes, and immune repertoire databases. ProGen2 demonstrates state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel feasible sequences, and predicting protein adaptability.

    Currently, the main function of Protein Sequence Generation (ProGen) is to extend sequences based on a reference sequence (growing from the end of the reference sequence). Additional sequence generation functionalities for other scenarios will be made available in the future.

    Parameter Description

    Model

    There are two model options available: progen2-large and progen2-xlarge.
    Model details:

    • progen2-large: 2.7 Billion parameters, 32 neural network layers.
    • progen2-xlarge: 6.4 Billion parameters, 32 neural network layers.

    Reference Sequence

    The reference sequence for sequence extension (provide sequence information).
    Note: Multiple sequences are not supported; multiple sequences will be merged into one sequence.

    Number of Samples

    The number of sequences to generate.
    Note: The sequence length should not exceed 1024 amino acids.

    Result Description

    The generated protein sequence file is named result.fasta.

    Reference Literature

    Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL Jr, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 Jan 26.

  • Name: Peptide Structure Generation
    Description: 基于多肽序列生成多肽结构:输入多肽的氨基酸序列,生成线性多肽的二维或者三维结构文件,一般用于小肽结构的创建。 A tool for generating peptide structures based on peptide sequences. Input the amino acid sequence of the peptide, and generate a two-dimensional or three-dimensional structure file of the linear peptide. This tool is generally used for creating small peptide structures.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-02-07 14:55:10
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. O'Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

    Peptide Structure Generation

    简介

    Peptide Structure Generation模块只需要输入多肽序列字符或者文件,就能生成多肽的三维或者二维结构的SDF文件。

    参数说明

    Peptide Sequence模式

    Peptide Sequence String

    输入氨基酸序列,每行表示一条多肽,支持同时生成多条多肽。

    Generated Structure (.sdf)

    输出文件名称。

    Structure Type

    输出多肽结构类型:3d或者2d。

    Peptide File模式

    Peptide Sequence File

    输入氨基酸序列txt文件,与“Peptide Sequence”相同。
    其他参数与Peptide Sequence模式相同。

    结果说明

    得到多肽三维结构的SDF文件output.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

    Peptide Structure Generation

    Introduction

    The Peptide Structure Generation module can generate three-dimensional or two-dimensional structures of peptides in SDF format based on input peptide sequences.

    Parameter Description

    Peptide Sequence Mode

    Peptide Sequence String

    Input amino acid sequences, with each line representing a peptide. Multiple peptides can be generated simultaneously.

    Generated Structure (.sdf)

    Output file name.

    Structure Type

    Specify the type of peptide structure to generate: 3D or 2D.

    Peptide File Mode

    Peptide Sequence File

    Input a text file containing amino acid sequences, similar to the “Peptide Sequence” mode.
    Other parameters are the same as in the Peptide Sequence mode.

    Result Description

    The output is an SDF file named output.sdf containing the three-dimensional structure of the peptide.

    Reference Literature

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

  • Name: Protein FEP
    Description: 基于唯信计算自主研发的Protein FEP算法,实现了蛋白稳定性与蛋白复合物亲和力的相对结合自由能计算,能用于判断单点突变对蛋白稳定性、蛋白复合物结合亲和力的影响。 Based on the Protein FEP algorithm developed by WECOMPUT, the module is capable of computing the relative binding free energy of protein stability and protein-protein binding affinity, which can be used to determine the effect of single-point mutations on protein stability and protein complex binding affinity.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-01-23 00:00:00
    Reference:

    Protein FEP

    简介

    Protein FEP是基于唯信计算自主研发的基于蛋白的自由能微扰算法AlphaFEP,实现了更高效、更精确的蛋白稳定性与蛋白复合物亲和力的相对结合自由能计算,能用于判断单点突变对蛋白稳定性、蛋白复合物结合亲和力的影响。

    基准测试

    众多文献报道,FEP方法相比于半经验方法、机器学习方法及GB/PBSA等自由能计算方法,精度更高(例如 http://dx.doi.org/10.1016/j.jmb.2023.168187,见下图,其中PCC代表预测值与SPR实验值的相关性,越高越好)。

    image.png

    唯信开发的AlphaFEP算法媲美已知的FEP方法,例如Schrodinger的FEP+,并大幅超越其他经典的非FEP方法。下图:结合自由能的预测值与实测值的相关性。

    image.png

    AlphaFEP技术特点

    1. 独特的自适应混合采样方法,允许分子构象在不同计算窗口之间跳跃,且通过随机逼近法实现自由能调整,进而保证每个窗口采样数分布最优,可在有限模拟时间内实现更多构象采样,采样效率相较同类方法提升一个数量级,提高了计算的精度和重现性。
    2. 改进的自由能计算MBAR方法:DC-MBAR,以基于多态模拟采样的数据来预测自由能。首先计算任意两个炼金态之间的重叠,并将那些具有足够重叠的状态定义为相邻状态。与传统的MBAR方法(一次使用所有数据计算每个状态的自由能)不同,DC-MBAR专注于预测相邻状态之间的自由能变化。为了准确地估计自由能变化,MBAR方程中包括与两个相邻状态重叠且大于定义阈值的其他状态。在特定阈值下,DC-MBAR预测的自由能非常接近传统MABR方法计算的自由能。此外,DC-MBAR方案可以减少计算和存储成本。DC-MBAR方法的一个重要特征是线性缩放,这意味着随着状态数的变化,CPU时间是一条直线关系。由于基于对的计算是相互独立且可并行的,因此可以利用HPC群集上所有可访问的CPU内核,这使DC-MBAR策略更加有效。

    参数说明

    Single-point Mutation模式

    PDB File

    蛋白的结构文件,PDB格式

    Mutation

    指定单点突变的位置(如:S52K,S代表野生型氨基酸,52表示该氨基酸在蛋白PDB文件中的索引值,K代表突变后的氨基酸)

    Type

    指定单点突变类型:稳定性(S)或者结合亲和力(B)

    Chain

    指定单点突变所在的链名称

    Multipoint Mutation模式

    PDB File

    蛋白的结构文件,PDB格式

    Mutation List

    多点突变列表文件(.txt),例如:

    L28E,H
    K30T,H
    

    其中,“L”和“K”是WT;“28”和“30”是PDB文件中的残基ID;“E”和“T”是突变;“H”代表残基的链名。
    多点突变只支持结合亲和力(B)类型的计算。

    结果说明

    输出结果文件为result.txt,包含信息如下:

    字段名称 说明
    ligand dG 配体自由能
    complex dG 复合物自由能
    final ddG 最终突变引起的自由能(结合自由能或折叠自由能)变化,单位为kcal/mol,负值表示蛋白更稳定或结合更强,反之亦然。

    参考文献

    Jia X, Ge H, Mei Y. Free energy change estimation: The Divide and Conquer MBAR method. J Comput Chem. 2021; 42: 1204–1211. https://doi.org/10.1002/jcc.26533

    Protein FEP

    Introduction

    Protein FEP is a protein-based free energy perturbation algorithm developed by Weixing Computing, which implements the AlphaFEP algorithm for more efficient and accurate calculation of relative binding free energies for protein stability and protein complex affinity. It can be used to assess the impact of single-point mutations on protein stability and protein complex binding affinity.

    Benchmark Testing

    Numerous studies have shown that FEP methods offer higher accuracy compared to semi-empirical methods, machine learning methods, and GB/PBSA among other free energy calculation methods (e.g., link , as shown in the figure below, where PCC represents the correlation between predicted and experimental values, with higher values indicating better performance).

    image.png

    The AlphaFEP algorithm developed by Weixing Computing rivals established FEP methods like Schrodinger’s FEP+ and significantly surpasses other classical non-FEP methods. The figure below illustrates the correlation between predicted and measured binding free energies.

    image.png

    AlphaFEP Technical Features

    1. Unique adaptive hybrid sampling method allows molecular conformations to jump between different calculation windows. Free energy adjustments are made using a stochastic approximation method to ensure optimal conformation sampling distribution in each window. This leads to significantly increased conformation sampling within a limited simulation time, improving sampling efficiency by an order of magnitude compared to similar methods, enhancing computational precision and reproducibility.
    2. Improved free energy calculation using the MBAR method: DC-MBAR, which predicts free energies based on data from multi-state simulations. It calculates overlaps between any two alchemical states and defines states with sufficient overlap as neighboring states. Unlike traditional MBAR methods that compute free energies for all states simultaneously, DC-MBAR focuses on predicting free energy changes between neighboring states. To accurately estimate free energy changes, the MBAR equation includes additional states that overlap sufficiently with two neighboring states. Under specific thresholds, the free energies predicted by DC-MBAR are very close to those calculated by traditional MBAR methods. Furthermore, the DC-MBAR approach can reduce computational and storage costs. A key feature of the DC-MBAR method is linear scaling, meaning that CPU time scales linearly with the number of states. Since the calculations are independent and parallelizable, utilizing all available CPU cores on an HPC cluster makes the DC-MBAR strategy more efficient.

    Parameter Description

    Single-point Mutation Mode

    PDB File

    Structure file of the protein in PDB format.

    Mutation

    Specify the position of the single-point mutation (e.g., S52K, where S represents the wild-type amino acid, 52 is the index of the amino acid in the protein PDB file, and K represents the mutated amino acid).

    Type

    Specify the type of single-point mutation: stability (S) or binding affinity (B).

    Chain

    Specify the chain where the single-point mutation occurs.

    Multipoint Mutation Mode

    PDB File

    Structure file of the protein in PDB format.

    Mutation List

    File containing a list of multipoint mutations (.txt), for example:

    L28E,H
    K30T,H
    

    Here, “L” and “K” represent wild-type residues, “28” and “30” are residue IDs in the PDB file, “E” and “T” represent mutations, and “H” denotes the chain name of the residue.
    Multipoint mutations are only supported for binding affinity (B) type calculations.

    Result Description

    The output result file is named result.txt and includes the following information:

    Field Name Description
    ligand dG Ligand free energy
    complex dG Complex free energy
    final ddG Final change in free energy (binding or folding) caused by the mutation, in kcal/mol. A negative value indicates that the protein is more stable or has stronger binding affinity, and vice versa.

    Reference Literature

    Jia X, Ge H, Mei Y. Free energy change estimation: The Divide and Conquer MBAR method. J Comput Chem. 2021; 42: 1204–1211. https://doi.org/10.1002/jcc.26533

  • Name: Antibody Sequence Prediction (AbLang)
    Description: 根据OAS数据库中的抗体序列训练的语言模型,可预测抗体序列中指定位点可能的氨基酸,或者修复抗体序列数据中缺失残基,在抗体序列预测中优于通用蛋白质语言模型(如Meta开发的ESM-1b模型)。 Language models trained on the antibody sequences in the OAS database have the power of predicting or restoring missing residues in antibody sequences.
    Tags: undefined
    Author: AbLang
    Release: 2023-01-16 00:00:00
    Reference: Bioinform Adv. 2022 Jun 17;2(1):vbac046

    Antibody Sequence Prediction

    简介

    根据OAS数据库中的抗体序列训练的语言模型,可预测抗体序列中指定位点可能的氨基酸,或者修复抗体序列数据中缺失残基,在抗体序列预测中优于通用蛋白质语言模型如Meta开发的ESM-1b模型。
    image.png

    输入参数

    Fasta File

    抗体序列文件,FASTA格式。使用*表示需要修复区域,支持多条序列。抗体序列文件如下所示:

    >H
    EV*LVESG*GLVQPGKSLRLSCVASGFTFSGYGMH
    

    Chain Type

    指定抗体序列是重链还是轻链,值为"H" 或 “L”。

    结果说明

    预测概率最高的一条抗体序列,其文件为result.fasta。

    参考文献

    Olsen TH, Moal IH, Deane CM. AbLang: an antibody language model for completing antibody sequences. Bioinform Adv. 2022 Jun 17;2(1):vbac046.

    Antibody Sequence Prediction

    Introduction

    The Antibody Sequence Prediction module utilizes a language model trained on antibody sequences from the OAS database to predict the likely amino acids at specified positions in antibody sequences or to fill in missing residues in antibody sequence data. This model outperforms general protein language models like the ESM-1b model developed by Meta in antibody sequence prediction.
    image.png

    Input Parameters

    Fasta File

    Antibody sequence file in FASTA format. Use “*” to indicate regions that need to be filled in, and multiple sequences are supported. An example of an antibody sequence file is shown below:

    >H
    EV*LVESG*GLVQPGKSLRLSCVASGFTFSGYGMH
    

    Chain Type

    Specify whether the antibody sequence is heavy chain (“H”) or light chain (“L”).

    Result Description

    The predicted antibody sequence with the highest probability is saved in the file result.fasta.

    Reference Literature

    Olsen TH, Moal IH, Deane CM. AbLang: an antibody language model for completing antibody sequences. Bioinform Adv. 2022 Jun 17;2(1):vbac046.

  • Name: Structure Clustering
    Description: Structure Clustering是基于分子指纹的小分子结构聚类模块,其采用的聚类方法有Butina或任何其他可用的分层聚类方法。 Structure Clustering is a small molecule clustering molecules based on a variety of 2D fingerprints using hierarchical clustering methodology.
    Tags: undefined
    Author: Butina, D
    Release: 2021-10-28 10:15:43
    Reference: Butina D. Unsupervised database clustering based on Daylight's fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. J. Chem. Inf. Model. 1999, 39, 747-750.

    Structure Clustering

    简介

    Structure Clustering是基于分子指纹的小分子结构聚类模块,其采用的聚类方法有Butina或任何其他可用的分层聚类方法。

    参数说明

    Input File

    小分子的结构文件,支持SDF、SMILES格式。

    Ouput File

    输出文件名称。

    Clustering Numbers

    在分层聚类过程中生成的聚类的数目。

    Similarity Cutoff

    Butina聚类算法中使用的相似度截断值。

    Clustering Method

    聚类算法,包括如下:

    • Butina
    • Centroid
    • CLink
    • Gower
    • McQuitty
    • SLink
    • UPGMA
    • Ward

    Fingerprints

    用于计算相似度或者距离的分子指纹类型,包括如下:

    • AtomPairs
    • MACCS166Keys
    • Morgan
    • MorganFeatures
    • PathLength
    • TopologicalTorsions

    Fingerprints Type

    分子指纹方式,包括如下:

    • IntVect
    • BitVect
    • auto

    Similarity Metric

    相似度计算指标,包括如下:

    • Tanimoto
    • Cosine
    • Dice

    结果说明

    在原有SDF文件中加入聚类编号,得到新的SDF文件output.sdf。

    参考文献

    Butina D. Unsupervised database clustering based on Daylight’s fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. J. Chem. Inf. Model. 1999, 39, 747-750.

    Structure Clustering

    Introduction

    Structure Clustering is a module for clustering small molecule structures based on molecular fingerprints. It employs clustering methods such as Butina or any other available hierarchical clustering method.

    Parameter Description

    Input File

    The structure file of the small molecule, supported formats include SDF and SMILES.

    Output File

    Name of the output file.

    Clustering Numbers

    Number of clusters generated during the hierarchical clustering process.

    Similarity Cutoff

    Similarity cutoff value used in the Butina clustering algorithm.

    Clustering Method

    Clustering algorithms available include:

    • Butina
    • Centroid
    • CLink
    • Gower
    • McQuitty
    • SLink
    • UPGMA
    • Ward

    Fingerprints

    Types of molecular fingerprints used for similarity or distance calculation include:

    • AtomPairs
    • MACCS166Keys
    • Morgan
    • MorganFeatures
    • PathLength
    • TopologicalTorsions

    Fingerprints Type

    Types of molecular fingerprint representations include:

    • IntVect
    • BitVect
    • auto

    Similarity Metric

    Similarity metrics for calculation include:

    • Tanimoto
    • Cosine
    • Dice

    Result Description

    The original SDF file will be updated with cluster numbers, resulting in a new SDF file named output.sdf.

    Reference Literature

    Butina D. Unsupervised database clustering based on Daylight’s fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. J. Chem. Inf. Model. 1999, 39, 747-750.

  • Name: Sequence Clustering
    Description: Sequence Clustering使用DBSCAN算法对多序列比对(MSA)后的结果进行聚类分析,将多序列分为多个cluster类别,并通过可视化模块UMAP进行序列的embedding,并获取二维可视化信息。 Sequence clustering uses the DBSCAN algorithm to perform cluster analysis on the results of multiple sequence alignment (MSA), dividing multiple sequences into multiple cluster categories, and using the visualization module UMAP to embed sequences and obtain two-dimensional visualization information.
    Tags: undefined
    Author: Hannah K. Wayment-Steele
    Release: 2023-01-10 00:00:00
    Reference: Hannah K. Wayment-Steele, Sergey Ovchinnikov, Lucy Colwell, Dorothee Kern. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. bioRxiv 2022.10.17.512570.

    Sequence Clustering

    简介

    Sequence Clustering使用DBSCAN算法对多序列比对(MSA)后的结果进行聚类分析,将多序列分为多个cluster类别,并通过可视化模块UMAP进行序列的embedding,并获取二维可视化信息。
    image.png

    参数说明

    Input File

    需要聚类序列的多序列比对结果文件(fasta格式),可以由Multiple Sequence Alignmnet模块产生的alignmnet.fasta。

    结果说明

    输出结果文件为res_clustering_assignments.tsv,包含信息如下:

    字段名称 说明
    SequenceName 序列名称
    sequence 序列
    frac_gaps 后续序列与参考序列(第一条序列)氨基酸差异(填充‘-’)的比例
    dbscan_label 聚类后的类别标签(如果值为-1表示未分配类别)
    UMAP 1,UMAP 2 二维可视化坐标信息(UMAP 1,UMAP 2对应X,Y坐标)

    参考文献

    Hannah K. Wayment-Steele, Sergey Ovchinnikov, Lucy Colwell, Dorothee Kern. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. bioRxiv 2022.10.17.512570.

    Sequence Clustering

    Introduction

    Sequence Clustering uses the DBSCAN algorithm to perform cluster analysis on the results of multiple sequence alignment (MSA), dividing multiple sequences into different cluster categories. It utilizes the UMAP visualization module to embed sequences and obtain two-dimensional visualization information.
    image.png

    Parameter Description

    Input File

    The file containing the results of multiple sequence alignment (in FASTA format) that need to be clustered. This file can be generated by the Multiple Sequence Alignment module as alignmnet.fasta.

    Result Description

    The output result file is res_clustering_assignments.tsv, which includes the following information:

    Field Name Description
    SequenceName Name of the sequence
    sequence The sequence itself
    frac_gaps Proportion of gaps (‘-’) in the sequence compared to the reference sequence (the first sequence)
    dbscan_label Cluster label after clustering (if the value is -1, it means the sequence is unassigned to any cluster)
    UMAP 1, UMAP 2 Two-dimensional visualization coordinate information (UMAP 1 corresponds to the X-coordinate and UMAP 2 corresponds to the Y-coordinate)

    Reference Literature

    Hannah K. Wayment-Steele, Sergey Ovchinnikov, Lucy Colwell, Dorothee Kern. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. bioRxiv 2022.10.17.512570.

  • Name: Extract Sequence from Structure (PDB2FASTA)
    Description: Extract Sequence from Structure (PDB2FASTA)模块是从蛋白的PDB文件中将序列提取出来保存为FASTA文件。常规氨基酸序列用单字母表示,其他类型都标注为X。 Extracts the protein sequences in a PDB file to FASTA. Amino acids are represented by their one-letter code while all others are represented by 'X'.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-12-09 00:00:00
    Reference:

    Extract Sequence from Structure (PDB2FASTA)

    简介

    Extract Sequence from Structure (PDB2FASTA)模块是从蛋白的PDB文件中将序列提取出来保存为FASTA文件。常规氨基酸序列用单字母表示,其他类型都标注为X。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。

    Chain Name

    将指定链的序列转存为fasta格式,默认all代表将所有链的序列输出。

    Output Sequence

    输出序列文件名称,FASTA格式。

    结果说明

    得到蛋白的序列文件,默认为seq.fasta。

    Extract Sequence from Structure (PDB2FASTA)

    Introduction

    The Extract Sequence from Structure (PDB2FASTA) module extracts sequences from a protein’s PDB file and saves them as a FASTA file. Conventional amino acid sequences are represented by single letters, while other types are labeled as X.

    Parameter Description

    Structure PDB File

    The protein’s structure file in PDB format.

    Chain Name

    Specify the chain whose sequence will be saved in FASTA format. Use “all” to output sequences from all chains by default.

    Output Sequence

    Name of the output sequence file in FASTA format.

    Result Description

    Obtain the protein sequence file, default name is seq.fasta.

  • Name: 3-letter AA Conversion
    Description: 把三字母表示的氨基酸转换为单字母表示。"ASP ILE VAL ASN"转换为 "DIVQ". Convert 3-letter amino acids to 1-letter amino acid. E.g., "ASP ILE VAL ASN" will be converted to -> "DIVQ".
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-11-18 00:00:00
    Reference:

    3-letter AA Convertion

    简介

    把三字母表示的氨基酸转换为单字母表示。"ASP ILE VAL ASN"转换为 “DIVQ”.

    参数说明

    File模式

    Input File

    包含三字符氨基酸序列的文本文件

    Output File

    指定输出序列文件的名称,FASTA格式

    Text模式

    Input String

    三字符代表的氨基酸序列,例如:
    ASP ILE VAL ASN

    Output File

    指定输出序列文件的名称,FASTA格式

    结果说明

    三字母表示的氨基酸转换为单字母,并以序列FASTA格式输出sequence.fasta。

    3-letter AA Conversion

    Introduction

    Converts three-letter amino acid representations to single-letter representations. For example, “ASP ILE VAL ASN” is converted to “DIVQ”.

    Parameter Description

    File Mode

    Input File

    Text file containing sequences of three-character amino acids.

    Output File

    Specify the name of the output sequence file in FASTA format.

    Text Mode

    Input String

    Sequence of three-character amino acids, for example:
    ASP ILE VAL ASN

    Output File

    Specify the name of the output sequence file in FASTA format.

    Result Description

    Converts three-letter amino acid representations to single-letter representations and outputs the sequence in FASTA format as sequence.fasta.

  • Name: Sequence Translation
    Description: Sequence Translation是DNA序列转换成RNA序列和蛋白序列的工具。 Sequence Translation is a tool for Translating DNA sequences into RNA and protein sequences.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-11-18 17:19:28
    Reference:

    Sequence Translation

    简介

    Sequence Translation是DNA序列转换成RNA序列和蛋白序列的工具。

    参数说明

    DNA Sequence File

    DNA序列文件,FASTA格式

    DNA Sequence String

    DNA序列,例如:

    TTATGACGTTATTCTACTTTGATTGTGCGAGACAATGCTACCTTACCGGTCGGAACTCGATCGGTTGAACTCTATCACGCCTGGTCTTCGAAGTTAGCAC
    

    结果说明

    输出结果包括:

    输出文件名称 说明
    prepared_dna.fasta 转换成DNA的FASTA文件
    protein.fasta 转换成蛋白的FASTA文件
    mrna.fasta 转换成mRNA的FASTA文件

    Sequence Translation

    Introduction

    Sequence Translation is a tool for converting DNA sequences into RNA sequences and protein sequences.

    Parameter Description

    DNA Sequence File

    DNA sequence file in FASTA format.

    DNA Sequence String

    DNA sequence, for example:

    TTATGACGTTATTCTACTTTGATTGTGCGAGACAATGCTACCTTACCGGTCGGAACTCGATCGGTTGAACTCTATCACGCCTGGTCTTCGAAGTTAGCAC
    

    Result Description

    The output includes:

    Output File Name Description
    prepared_dna.fasta FASTA file converted to DNA
    protein.fasta FASTA file converted to protein
    mrna.fasta FASTA file converted to mRNA
  • Name: Protein Structure Prediction (ESMFold)
    Description: ESMFold是Meta公司开发的蛋白结构预测模型,使用大型语言模型从主序列直接推断结构,预测的速度比AlphaFold方法快60倍,同时能够保持分辨率和准确性。 ESMFold is a protein structure prediction model developed by Meta company, which uses a large language model to directly infer structure from the primary sequence. It predicts structures 60 times faster than AlphaFold while maintaining resolution and accuracy.
    Tags: undefined
    Author: Meta
    Release: 2022-11-11 00:00:00
    Reference: Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.

    Protein Structure Prediction (ESMFold)

    简介

    ESMFold使用大型语言模型从主序列直接推断结构,预测的速度比最先进的方法快60倍,同时能够保持分辨率和准确性。AlphaFold2和其他替代方法使用多序列比对(MSA)和类似蛋白质的模板来实现原子分辨率结构预测的最佳性能获突破性成功;而ESMFold通过利用语言模型的内部表征,只用一个序列作为输入就能生成结构预测。ESMFold与AlphaFold2和RoseTTAFold具有相似的准确性,但ESMFold在探索宏基因组蛋白质的结构空间方面速度更快。
    image.png

    参数说明

    ESMFold Batch Mode模式

    Fasta File

    蛋白序列文件,FASTA格式,支持多条序列。
    预测复合物,多条链通过英文冒号(:)相连,举例:

    >complex
    MGITQTPYKVSISGLYLRARV:QVQLQQSGAELARPGASVKMSCKASGYTFTRYTMHWVKQR
    

    Max tokens per batch

    每个GPU前向传递中的最大令牌数。这将使较短的序列分组进行批量预测。如果在短序列上发生内存不足问题,降低此值可以有所帮助。

    Chunk Size

    较低的值将导致更低的内存使用,但会降低速度。推荐值:128、64、32。

    ESMFold Single Mode模式

    Fasta File

    蛋白序列文件,FASTA格式,多条序列时默认为复合物预测。

    结果说明

    输出结果包括:

    输出文件名称 说明
    seq1.pdb 默认输出第一条序列的预测结构。
    result.tar.gz 针对含有多条序列的fasta文件,压缩包中含所有的序列的预测结构。
    score.csv 预测结构的打分,包含结构可靠性指标pLDDT与pTM,pLDDT数值范围在0-100,数值越大表示结构可靠性越高,pTM数值范围在0-1,数值越大表示结构可靠性越高
    stdout.txt 模块的标准输出信息。

    参考文献

    Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.

    Protein Structure Prediction (ESMFold)

    Introduction

    ESMFold uses a large language model to directly infer structure from primary sequences, with prediction speeds 60 times faster than state-of-the-art methods, while maintaining resolution and accuracy. While AlphaFold2 and other alternative methods achieve breakthrough success in atomic-resolution structure prediction using multiple sequence alignments (MSA) and protein-like templates, ESMFold leverages the internal representation of a language model to generate structure predictions using just one sequence as input. ESMFold exhibits similar accuracy to AlphaFold2 and RoseTTAFold, but is faster in exploring the structural space of macrogenomic proteins.
    image.png

    Parameter Description

    ESMFold Batch Mode

    Fasta File

    Protein sequence file in FASTA format, supporting multiple sequences.
    For predicting complexes, multiple chains are connected by a colon (:) as shown below:

    >complex
    MGITQTPYKVSISGLYLRARV:QVQLQQSGAELARPGASVKMSCKASGYTFTRYTMHWVKQR
    

    Max tokens per batch

    Maximum number of tokens in each GPU forward pass. This allows grouping of shorter sequences for batch prediction. Lowering this value can help if memory issues occur with short sequences.

    Chunk Size

    A lower value leads to lower memory usage but decreases speed. Recommended values: 128, 64, 32.

    ESMFold Single Mode

    Fasta File

    Protein sequence file in FASTA format, defaulting to complex prediction for multiple sequences.

    Result Description

    The output includes:

    Output File Name Description
    seq1.pdb Default output of the predicted structure for the first sequence.
    result.tar.gz For fasta files containing multiple sequences, the compressed file includes predicted structures for all sequences.
    score.csv The score of the predicted structure includes the structural reliability indicators pLDDT and pTM. The pLDDT value range is 0-100, and the larger the value, the higher the structural quality. The pTM value range is 0-1, and the larger the value, the higher the structural quality.
    stdout.txt Standard output.

    Reference

    Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.

  • Name: Retrosynthetic Prediction (AiZynthFinder)
    Description: Retrosynthetic Prediction (AiZynthFinder)是阿斯利康开发的针对小分子的逆反应合成路线预测算法。AiZynthFinder算法基于蒙特卡罗树搜索最终得到可被购买的小分子,用于合成输出分子。树搜索策略采用神经网络方法对已知的反应库进行训练得到。 Retrosynthetic Prediction (AiZynthFinder) is a tool for retrosynthetic planning. The algorithm is based on a Monte Carlo tree search that recursively breaks down a molecule to purchasable precursors. The tree search is guided by a policy that suggests possible precursors by utilizing a neural network trained on a library of known reaction templates.
    Tags: undefined
    Author: Samuel Genheden
    Release: 2022-10-27 00:00:00
    Reference: Genheden S, Thakkar A, Chadimová V, et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J Cheminform. 2020 Nov 17;12(1):70.

    Retrosynthetic Prediction (AiZynthFinder)

    简介

    Retrosynthetic Prediction (AiZynthFinder)是阿斯利康开发的针对小分子的逆反应合成路线预测算法。AiZynthFinder算法基于蒙特卡罗树搜索最终得到可被购买的小分子,用于合成输出分子。树搜索策略采用神经网络方法对已知的反应库进行训练得到。

    参数说明

    Smiles String

    目标小分子的结构文件,SMILES格式,如:
    Cc1cccc(c1N(CC(=O)Nc2ccc(cc2)c3ncon3)C(=O)C4CCS(=O)(=O)CC4)C

    结果说明

    得到逆合成分析的路线图route000.png-route010.png。

    参考文献

    Genheden S, Thakkar A, Chadimová V, et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J Cheminform. 2020 Nov 17;12(1):70.
    http://www.github.com/MolecularAI/aizynthfinder

    Retrosynthetic Prediction (AiZynthFinder)

    Introduction

    AiZynthFinder is a tool for retrosynthetic planning. The algorithm is based on a Monte Carlo tree search that recursively breaks down a molecule to purchasable precursors. The tree search is guided by a policy that suggests possible precursors by utilizing a neural network trained on a library of known reaction templates.

    Parameter

    Smiles String

    Product molecule structure file in SMILES format. Example:
    Cc1cccc(c1N(CC(=O)Nc2ccc(cc2)c3ncon3)C(=O)C4CCS(=O)(=O)CC4)C

    Result

    The road map of inverse synthesis analysis is obtained. route000.png-route010.png

    Reference

    Genheden S, Thakkar A, Chadimová V, et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J Cheminform. 2020 Nov 17;12(1):70.
    http://www.github.com/MolecularAI/aizynthfinder

  • Name: Protein Design (ABACUS-R)
    Description: ABACUS-R用于设计能自主地折叠到给定目标骨架的氨基酸序列。该方法通过使用多任务学习训练的编码器-解码器网络,从中心残基的三维局部环境中预测其侧链类型。 若使用该模块发表论文请引用:Liu, Y., Zhang, L., Wang, W. et al. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat Comput Sci. 2022 June 21; 2:451–462. ABACUS-R is used to design amino acid sequences that can autonomously fold into a given target backbone. This method utilizes an encoder-decoder network trained with multi-task learning to predict the side-chain type of a central residue based on its three-dimensional local environment. Please cite if you find this module helpful: Liu, Y., Zhang, L., Wang, W. et al. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat Comput Sci. 2022 June 21; 2:451–462.
    Tags: undefined
    Author: 中科大刘海燕课题组
    Release: 2022-10-14 00:00:00
    Reference: Liu, Y., Zhang, L., Wang, W. et al. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat Comput Sci. 2022 June 21; 2:451–462.

    Protein Design (ABACUS-R)

    简介

    ABACUS-R是一种基于深度学习的方法,用于设计能自主地折叠到给定目标骨架的氨基酸序列。该方法通过使用多任务学习策略训练的编码器-解码器网络,从其三维局部环境预测中心残基的侧链类型。该网络编码的环境特征包括周围残基的侧链类型,但不包括构象周围残基的侧链构象。这消除了重建和优化侧链结构的需要,并大大简化了序列设计过程。广泛的湿实验结果,包括通过X射线晶体学解决的五个结构,表明ABACUS-R比最先进的基于能量函数的序列设计方法,在成功率和精度上有很大的优势。
    image.png

    参数说明

    PDB File

    蛋白结构文件,PDB格式。蛋白结构不能超过300个氨基酸。

    Chain

    指定需要设计的链,只支持单链。

    Number of Designs

    输出设计的序列数量,最大值100。

    SPFile

    限制文件,文本格式,包含指定位点的氨基酸信息,例如:

    1,A
    2,A
    

    表示A链的第1和2位的氨基酸在设计中不变。

    结果说明

    输出结果文件为seqs_design.fasta,里面包含最终设计的序列。

    参考文献

    Liu, Y., Zhang, L., Wang, W. et al. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat Comput Sci. 2022 June 21; 2:451–462.

    Protein Design (ABACUS-R)

    Introduction

    ABACUS-R is a deep-learning-based method for designing amino acid sequences that can autonomously fold into a given target scaffold. This method employs an encoder-decoder network trained via multitask learning to predict the side-chain type of a central residue from its three-dimensional local environment. The environmental features encoded by the network include the side-chain types of surrounding residues but exclude the side-chain conformations of those residues, eliminating the need for reconstructing and optimizing side-chain structures and greatly simplifying the sequence design process. Extensive wet-lab results, including five structures solved by X-ray crystallography, demonstrate that ABACUS-R has significant advantages in success rate and accuracy over state-of-the-art energy-based sequence design methods.
    image.png

    Parameter

    PDB File

    Protein structure file in PDB format. The protein structure must not exceed 300 amino acids.

    Chain

    Specify the chain to be designed, only single chain is supported.

    Number of Designs

    Output the number of sequences designed, max 100.

    SP File

    Constraints file, in text format, containing amino acid information at specified sites, for example:

    1,A
    2,A
    

    Indicates that the amino acids at positions 1 and 2 of the A chain are not changed in the design.

    Result

    seqs_design.fasta

    The output file is seqs_design.fasta and contains the sequence of the final design.

    Reference

    Liu, Y., Zhang, L., Wang, W. et al. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat Comput Sci. 2022 June 21; 2:451–462.

  • Name: Antibody Fv Structure Prediction (IgFold)
    Description: IgFold是一种基于深度学习的快速预测抗体Fv结构的方法。IgFold由一个预先训练的语言模型和直接预测骨架原子坐标的图网络组成,该语言模型训练了558M个天然抗体序列。 IgFold在显著更短的时间内(不到一分钟)预测出与其他方法(包括AlphaFold)相似或更好质量的结构。 注意:输入的抗体Fv区抗体序列名称中必须包含重链标识符:H,Heavy,.H;轻链标识符:L,Light,.L。 已知问题:部分预测结构会比输入序列缺失个别氨基酸,请留意! IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558M natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under one minute).
    Tags: undefined
    Author: Ruffolo JA
    Release: 2022-10-14 00:00:00
    Reference: Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun. 2023 Apr 25;14(1):2389.

    Antibody Structure Prediction (IgFold)

    简介

    IgFold是一种基于深度学习的快速预测抗体Fv结构的方法。IgFold由一个预先训练的语言模型和直接预测骨架原子坐标的图网络组成,该语言模型训练了558M个天然抗体序列。IgFold在显著更短的时间内(不到一分钟)预测出与其他方法(包括AlphaFold)相似或更好质量的结构。注:该模块只适合预测可变区构象,如果是全长抗体或者包含多个可变区的抗体等情况,需要使用Protein Structure Prediction (AlphaFold2.3.2)或者Protein Structure Prediction (ESMFold)进行结构预测。

    参数说明

    Fv Sequence (fasta)

    输入抗体Fv区重链和或轻链序列,其中抗体序列名称中必须包含重链标识符:H,Heavy,.H;轻链标识符:L,Light,.L。例如:

    >antibody.H
    XXXXXX
    >antibody.L
    XXXXXX
    

    结果说明

    输出文件为预测抗体的结构文件antibody_pred.pdb。
    【已知问题】部分预测结构会比输入序列缺失个别氨基酸,请留意!

    参考文献

    Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun. 2023 Apr 25;14(1):2389.

    Antibody Structure Prediction (IgFold)

    Introduction

    IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558M natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under one minute).

    Parameter

    Fv Sequence (fasta)

    Antibody Fv sequence file in FASTA format. The heavy chain sequence name should contain :H, Heavy, or .H. The light chain sequence name should contain :L, Light, or .L. Demo:

    >antibody.H
    XXXXXX
    >antibody.L
    XXXXXX
    

    Result

    The output file is antibody_pred.pdb, which is a structure file for predicting antibodies.
    Part of the predicted structure will be missing individual amino acids compared to the input sequence, please note!

    Reference

    Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun. 2023 Apr 25;14(1):2389.

  • Name: MHC-I Binding Prediction
    Description: MHC-I型亲和力预测模型。模型训练是利用亲和力(BA)和质谱洗脱配体(MS eluted ligand)的数据,基于NNAlign框架增加了预测特定MHC分子结合肽段的亲和力值和肽段的长度。NetMHCpan-4.0的方法提高了在肿瘤新抗原,验证的洗脱配体(ELs),T细胞免疫表位的预测准确性。 A model for predicting MHC-I binding affinity. The model is trained using affinity (BA) and mass spectrometry eluted ligand (MS eluted ligand) data, and it incorporates the prediction of the affinity values and peptide lengths of specific MHC molecules using the NNAlign framework. The NetMHCpan-4.0 method improves the accuracy of predicting tumor neoantigens, validated eluted ligands (ELs), and T-cell epitopes.
    Tags: undefined
    Author: Morten Nielsen
    Release: 2022-10-14 00:00:00
    Reference: Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020 Jul 2;48(W1):W449-W454.

    MHC-I Binding Prediction

    简介

    基于神经网络的MHC-I型相互作用预测模型。模型训练是利用亲和力和质谱洗脱配体的数据,预测特定MHC分子结合肽段的亲和力值和肽段的长度,可用于肿瘤新抗原的预测。

    参数说明

    Protein Sequence File

    蛋白的序列文件,FASTA格式。

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    Seq_ID 蛋白序列名称
    Pos 肽段在蛋白质序列中的残基编号(从0开始)
    MHC MHC分子/等位基因名称
    Peptide 潜在配体的氨基酸序列
    Core 直接与MHC接触的最小的9个氨基酸结合核心
    Of 核心在肽段中的起始位置(如果>0,则该方法预测N-末端突出)
    Gp 如有删除,删除的位置
    Gl 如有删除,删除的长度
    Ip 如有插入,插入的位置
    Il 如有插入,插入的长度
    Icore 相互作用核心。这是包括插入和删除的结合核心序列
    Identity 蛋白质标识符,即FASTA条目的名称
    Score 原始预测得分。(EL:质谱洗脱配体,BA:亲和力)
    %Rank 预测结合得分与一组随机天然肽相比的排名。此测量不受某些分子固有偏向于更高或更低的预测亲和力的影响。强结合物被定义为具有%rank<0.5的物质,而弱结合物则具有%rank<2。我们建议基于%Rank而不是得分选择候选配体。(EL:质谱洗脱配体,BA:亲和力)
    Aff(nM) 亲和力大小
    BindLevel 如果%Rank低于强结合物的指定阈值(默认为0.5%),则将识别肽段为强结合物。如果%Rank高于强结合物的阈值但低于弱结合物的指定阈值(默认为2%),则将识别肽段为弱结合物。(SB:强结合物,WB:弱结合物)

    参考文献

    Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020 Jul 2;48(W1):W449-W454.

    MHC-I Binding Prediction

    Introduction

    A neural network-based model for predicting MHC-I interactions. The model is trained using affinity and mass spectrometry-eluted ligand data to forecast the affinity values and lengths of peptides binding to specific MHC molecules. This can be employed for predicting tumor neoantigens.

    Parameter

    Protein Sequence File

    Protein sequence file in FASTA format.

    Result

    The output file is result.csv and contains the following information:

    Seq_ID Protein sequence name
    Pos Residue number (starting from 0) of the peptide in the protein sequence.
    MHC Specified MHC molecule / Allele name.
    Peptide Amino acid sequence of the potential ligand.
    Core The minimal 9 amino acid binding core directly in contact with the MHC.
    Of The starting position of the Core within the Peptide (if > 0, the method predicts a N-terminal protrusion).
    Gp Position of the deletion, if any.
    Gl Length of the deletion, if any.
    Ip Position of the insertion, if any
    Il Length of the insertion, if any
    Icore Interaction core. This is the sequence of the binding core including eventual insertions of deletions.
    Identity Protein identifier, i.e. the name of the FASTA entry.
    Score The raw prediction score. (EL: MS eluted ligand, BA: Binding Affinity)
    %Rank Rank of the predicted binding score compared to a set of random natural peptides. This measure is not affected by inherent bias of certain molecules towards higher or lower mean predicted affinities. Strong binders are defined as having %rank<0.5, and weak binders with %rank<2. We advise to select candidate binders based on %Rank rather than Score. (EL: MS eluted ligand, BA: Binding Affinity)
    Aff(nM) Affinity value
    BindLevel The peptide will be identified as a strong binder if the %Rank is below the specified threshold for the strong binders (by default, 0.5%). The peptide will be identified as a weak binder if the %Rank is above the threshold of the strong binders but below the specified threshold for the weak binders (by default, 2%). (SB: Strong Binder, WB: Weak Binder)

    Reference

    Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020 Jul 2;48(W1):W449-W454.

  • Name: NPT MDP Generation
    Description: 该模块主要是生成等温等压(NPT)的MDP文件,此文件是Gromacs分子动力学模拟需要用到输入文件,里面包含各种参数。 Generate Gromacs MD input file at constant temperature and pressure (NPT).
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 17:14:19
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    NPT MDP Generation

    简介

    NPT MDP Generation是生成等温等压(NPT)MDP文件的模块。

    参数说明

    Define

    Define用于传递预处理器的定义,可以使用任何定义来控制自定义拓扑文件(.top)中的选项。可选择的定义包括以下选项:

    1. DPOSRES用于实现位置约束。选择该项时必须填写Force Constant of POSRE,否则无效。
    2. none为无定义。

    Integrator

    模拟中积分方式的选择:md算法。
    md是蛙跳法,对符合牛顿公式的运动进行积分。

    Time Step

    时间步长,单位为ps。(默认为0.001)

    Simulation Time (ns)

    模拟时长,单位为ns。

    Group(s) for Center of Mass

    质心进行操作的组,可以是索引文件中的一个,或者多个组。默认为整个系统。

    Motion Mode

    系统或者系统中各个组质心的操作。(默认为None)

    • Linear:移去质心平移速度
    • Angular:去掉质心的平移和质心周围的旋转速度
    • Linear-acceleration-correction:去除质心平移速度。修正质心位置,假设在nstcomm步骤上有线性加速度。这对于期望质心上的加速度在mdp:nstcomm步长上几乎是恒定的情况是有用的。例如,当使用绝对引用拉入组时,就会发生这种情况。
    • None:对质心运动没有限制

    Coordinates Output Steps

    在轨迹文件中写入坐标的频率。(默认为0)

    Velocities Output Steps

    在轨迹文件中写入速度(v)的频率。(默认为0)

    Forces Output Steps

    在轨迹文件中写入力的频率。(默认为0)

    Log Output Steps

    在log文件中写入能量的频率。(默认为50)

    Energies Output Steps

    在记录能量的文件中写入能量的频率。(默认为100)

    Compressed Coordinates Steps

    输入压缩的轨迹文件的频率。(默认为50)

    Compressed Groups

    输入轨迹包含的结构。默认为整个系统。

    PBC

    周期化边界条件设置(默认为xyz)。

    • xyz:在所有方向上使用周期性边界条件。
    • no:不使用周期边界条件,忽略方框。要模拟没有截止,设置所有截止和nstlist为0。为了在没有截断的情况下获得最佳性能,请将nstlist设置为零并将ns-type =simple设置为简单。
    • xy:只在x和y方向上使用周期边界条件。这只适用于ns-type =grid,并且可以与墙壁结合使用。没有墙或只有一个墙,系统尺寸在z方向上是无限的。因此不能采用压力耦合法或埃瓦尔德求和法。当使用两面墙时,这些缺点就不适用了。

    Coulomb Type

    原子静电相互作用的计算方法,默认为PME。

    • Cut-off:具有对列表半径rlist 和库仑截止 rcoulomb 的平面截止,其中 rlist>=rcoulumb。
    • Ewald:经典的Ewald sum静电学。实空间截止Coulomb Cutoff应等于rlist,使用例如rlist=0.9,rcoulomb=0.9。在reciprocal space中使用的波矢量的最高幅度由傅里叶间距控制。direct/reciprocal space 的相对精度由 ewald rtol 控制。
    • PME: 用于具体指静电相互作用或库仑力的Fast smooth Particle-Mesh Ewald(SPME)。Direct space类似于Ewald sum,而reciprocal space使用FFT执行。网格尺寸由傅里叶间距控制,插值顺序由pme-order控制。

    Coulomb Cutoff

    库仑力截止距离,单位nm(默认为1.2)

    VdW Type

    范德华相互作用的计算方法,默认为Cut-off。

    • Cut-off:用对列表半径rlist和VdW截断rvdw的普通截断,其中rlist >= rvdw。
    • PME:用于VdW相互作用的快速平滑粒子网格Ewald (SPME)。网格尺寸采用傅里叶间距控制,插补顺序采用pme-order控制。正/倒易空间的相对精度由ewald-rtoll-lj控制,倒易例程使用的具体组合规则由lj-pme-comb-rule设置。

    VdW Cutoff

    LJ或Buckingham截止距离,单位nm(默认为1.2)

    Dispersion Correction

    能量和压力的长程色散校正方法(默认为EnerPres)。

    • no:不做任何修正
    • EnerPres:适用于能量和压力的长程分散校正
    • Ener:仅对能量应用长程色散修正

    Temperature Coupling

    温度耦合的方法(默认为V-rescale)。

    • V-rescale:使用随机项的速度重标度的温度耦合(JCP 126, 014101)。这个恒温器类似于Berendsen耦合,使用tau-t进行相同的缩放,但随机项确保生成适当的规范集合。随机种子用ld-seed设置。即使tau-t =0,这个恒温器也能正常工作。对于NVT模拟,保存的能量被写入能量和日志文件。
    • Berendsen:与Berendsen恒温器的温度耦合到温度为ref-t的浴槽,时间常数为tau-t。几个组可以单独耦合,它们在tc-grps字段中指定,并用空格分隔。
    • no:无温度耦合。

    Coupling Groups

    耦合到单独的温度浴的组别,多个组别用空格间隔。

    Time for Temperature Coupling

    温度耦合时间常数,单位为ps。(默认为0.2)

    Coupling Reference Temperature

    耦合的参考温度,即动力学模拟的温度,单位为K。(默认为300)

    Pressure Coupling

    压力耦合的方法(默认为Berendsen)。

    • Parrinello-Rahman:扩展系综压力耦合,其中盒向量服从运动方程。原子的运动方程和这个是耦合的。不会发生瞬时缩放。对于Nose-Hoover温度耦合,时间常数tau-p是压力在平衡状态下波动的周期。当您希望在数据收集期间应用压力缩放时,这可能是一种更好的方法,但要注意,如果您从不同的压力开始,您可能会得到非常大的振荡。对于NPT系综的精确波动很重要的模拟,或者如果压力耦合时间很短,则可能不合适,因为在GROMACS实现的某些步骤中使用了之前的时间步长压力来代替当前的时间步长压力。
    • Berendsen:指数弛豫压力与时间常数tau-p的耦合。这个盒子每隔几步就缩放一次。有人认为,这并不能产生正确的热力学集合,但这是在运行开始时缩放盒子的最有效方法。
    • no:无压力耦合。这意味着一个固定的盒子大小。

    Pressure Coupling Type

    压力耦合的各向同性类型。每种类型取一个或多个可压缩性(compressibility)和Coupling Reference Pressure。Time for Pressure Coupling仅允许一个值。(默认为isotropic)

    • isotropic:时间常数为Time for Pressure Coupling的各向同性压力耦合。可压缩性(compressibility)和Coupling Reference Pressure各需要一个值.
    • semisotropic:在x和y方向上各向同性但在方向上不同的压力耦合。这对于膜模拟是有用的。对于x/y和z方向,分别需要可压缩性(compressibility)和Coupling Reference Pressure的两个值。
    • anisotropic:与之前相同,但xx、yy、zz、xy/yx、xz/zx和yz/zy组件分别需要6个值。当非对角压缩性设置为零时,矩形盒子将保持矩形。请注意,各向异性缩放可能会导致模拟盒子发生极端变形。
    • surface-tension:平行于xy平面的表面的表面张力耦合。对Z方向使用法向压力耦合,而表面张力耦合到盒子的x/y尺度。第一个Coupling Reference Pressure是参考表面张力乘以表面数(单位bar*nm),第二个值是参考z-pressure(单位bar)。这两个可压缩性(compressibility)分别是xy和方向上的压缩率。z-compressibility的值应该相当精确,因为它会影响表面张力的收敛,也可以将其设置为零,使盒子具有恒定的高度。

    Time for Pressure Coupling

    压力耦合的时间常数(所有方向一个值),单位为ps。(默认为2)

    Coupling Reference Pressure

    耦合的参考压力,单位为bar。(默认为1)

    Compressibility

    可压缩性(注:这实际上是在bar^-1)对于水在1atm和300k的可压缩性是4.5e-5 bar^-1。所需值的数量由pcoupltype [bar^-1]暗示。

    Constraints

    限制类型。(默认为none)

    • none:除了拓扑文件中明确定义的外,没有限制。
    • hbonds:给含有氢原子的键添加限制。
    • all-bonds:给所有的键添加限制。
    • h-angles:给所有的键添加限制,同时给含有氢原子的角度添加限制。
    • all-angles:给所有的键和角度添加限制。

    Output File

    输出文件名称

    结果说明

    得到一个计算NPT的MDP文件npt.mdp。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    NPT MDP Generation

    Introduction

    The NPT MDP Generation module is used to generate the MDP file for an isothermal-isobaric (NPT) simulation.

    Parameter Description

    Define

    The Define section is used to pass preprocessor definitions that can control options in custom topology files (.top). Available options include:

    1. DPOSRES: Used to implement position restraints. Requires filling in the Force Constant of POSRE, otherwise, it is invalid.
    2. none: No definitions.

    Integrator

    Choice of integration method in the simulation: md algorithm.
    md is the leap-frog algorithm used to integrate motions conforming to Newton’s equations.

    Time Step

    Time step size in ps. (Default is 0.001)

    Simulation Time (ns)

    Duration of the simulation in ns.

    Group(s) for Center of Mass

    Group(s) for center of mass operations, can be one or multiple groups from the index file. Default is the entire system.

    Motion Mode

    Operations for the system or center of mass of individual groups in the system. (Default is None)

    • Linear: Removes center of mass translational velocities.
    • Angular: Removes both the center of mass translational and rotational velocities.
    • Linear-acceleration-correction: Removes center of mass translational velocities. Corrects the center of mass position assuming linear acceleration over nstcomm steps. Useful when expecting nearly constant accelerations on the center of mass over mdp:nstcomm steps. For example, this occurs when using absolute reference pulling groups.
    • None: No restrictions on center of mass motion.

    Coordinates Output Steps

    Frequency of writing coordinates to the trajectory file. (Default is 0)

    Velocities Output Steps

    Frequency of writing velocities to the trajectory file. (Default is 0)

    Forces Output Steps

    Frequency of writing forces to the trajectory file. (Default is 0)

    Log Output Steps

    Frequency of writing energy to the log file. (Default is 50)

    Energies Output Steps

    Frequency of writing energy to the energy file. (Default is 100)

    Compressed Coordinates Steps

    Frequency of inputting compressed trajectory files. (Default is 50)

    Compressed Groups

    Structures included in the input trajectory. Default is the entire system.

    PBC

    Setting for periodic boundary conditions (Default is xyz).

    • xyz: Periodic boundary conditions in all directions.
    • no: No periodic boundary conditions, ignore the box. For simulating without cutoffs, set all cutoffs and nstlist to 0. For optimal performance without cutoffs on a single MPI rank, set nstlist to 0 and ns-type=simple.
    • xy: Periodic boundary conditions only in the x and y directions. This is only valid for ns-type=grid and can be used with walls. Without walls or with only one wall, the system size is infinite in the z direction, so pressure coupling or Ewald sum methods cannot be used. When using two walls, these limitations do not apply.

    Coulomb Type

    Method for calculating atomic electrostatic interactions, default is PME.

    • Cut-off: Plain cut-off with a plain cut-off for the pair-list radius rlist and Coulomb cut-off rcoulomb, where rlist >= rcoulumb.
    • Ewald: Classical Ewald sum electrostatics. The real space cut-off Coulomb Cutoff should be equal to rlist, using values like rlist=0.9, rcoulomb=0.9. The highest amplitude of wave vectors used in reciprocal space is controlled by the Fourier spacing. The relative accuracy of direct/reciprocal space is controlled by ewald rtol.
    • PME: Fast smooth Particle-Mesh Ewald (SPME) for specific electrostatic interactions or Coulomb forces. Direct space is similar to the Ewald sum, while reciprocal space is executed using FFT. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order.

    Coulomb Cutoff

    Coulomb force cut-off distance in nm. (Default is 1.2)

    VdW Type

    Method for calculating van der Waals interactions, default is Cut-off.

    • Cut-off: Normal cut-off with a plain cut-off for the pair-list radius rlist and VdW cut-off rvdw, where rlist >= rvdw.
    • PME: Fast smooth Particle-Mesh Ewald (SPME) for VdW interactions. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order. The relative accuracy of direct/reciprocal space is controlled by ewald-rtol-lj, and the specific combination rule used in reciprocal space is set by lj-pme-comb-rule.

    VdW Cutoff

    LJ or Buckingham cut-off distance in nm. (Default is 1.2)

    Dispersion Correction

    Method for long-range dispersion correction for energy and pressure (Default is EnerPres).

    • no: No corrections are applied.
    • EnerPres: Long-range dispersion correction applied for both energy and pressure.
    • Ener: Long-range dispersion correction applied only for energy.

    Temperature Coupling

    Method for temperature coupling (Default is V-rescale).

    • V-rescale: Temperature coupling using velocity rescaling with a stochastic term (JCP 126, 014101). This thermostat is similar to the Berendsen coupling using tau-t for the same scaling, but the stochastic term ensures the correct canonical ensemble is generated. The random seed is set with ld-seed. This thermostat works even if tau-t = 0. For NVT simulations, the saved energies are written to the energy and log files.
    • Berendsen: Temperature coupling to a bath at temperature ref-t with an exponential relaxation time tau-t. Several groups can be coupled separately, specified in the tc-grps field and separated by spaces.
    • no: No temperature coupling.

    Coupling Groups

    Groups to which temperature baths are coupled, multiple groups separated by spaces.

    Time for Temperature Coupling

    Time constant for temperature coupling in ps. (Default is 0.2)

    Coupling Reference Temperature

    Reference temperature for coupling in K. (Default is 300)

    Pressure Coupling

    Method for pressure coupling (Default is Berendsen).

    • Parrinello-Rahman: Extended ensemble pressure coupling where box vectors follow the motion equations. The motion equations of atoms are coupled to this. No instantaneous scaling occurs. For Nose-Hoover temperature coupling, the time constant tau-p is the period over which the pressure fluctuates in equilibrium. This may be a better method when you want to apply pressure scaling during data collection, but be aware that you may get very large oscillations if you start from a different pressure. It may not be suitable for precise fluctuations of the NPT ensemble or if the pressure coupling time is short, as some steps in GROMACS implementation use the previous time step pressure instead of the current time step pressure.
    • Berendsen: Exponential relaxation pressure coupling with time constant tau-p. The box is scaled every few steps. Some believe this does not generate correct thermodynamic ensembles, but it is the most efficient method to scale the box at the beginning of a run.
    • no: No pressure coupling. This means a fixed box size.

    Pressure Coupling Type

    Isotropic type of pressure coupling. Each type takes one or more compressibility and Coupling Reference Pressure values. Time for Pressure Coupling allows only one value. (Default is isotropic)

    • isotropic: Isotropic pressure coupling with a time constant of Time for Pressure Coupling. Requires one value each for compressibility and Coupling Reference Pressure.
    • semisotropic: Isotropic pressure coupling in x and y directions but different pressures in the z direction. Useful for membrane simulations. Requires two values each for compressibility and Coupling Reference Pressure for x/y and z directions.
    • anisotropic: Same as above, but requires six values each for xx, yy, zz, xy/yx, xz/zx, and yz/zy components. When non-diagonal compressibilities are set to zero, the rectangular box will remain rectangular. Note that anisotropic scaling may lead to extreme deformations of the simulation box.
    • surface-tension: Surface tension coupling for surfaces parallel to the xy plane. Uses normal pressure coupling in the z direction and surface tension coupling to the x/y scales of the box. The first Coupling Reference Pressure is the reference surface tension multiplied by the surface area (units bar*nm), the second value is the reference z-pressure (units bar). Both compressibilities are for xy and z directions. The z-compressibility value should be quite accurate as it affects the convergence of the surface tension and can also be set to zero to have a constant box height.

    Time for Pressure Coupling

    Time constant for pressure coupling (one value for all directions) in ps. (Default is 2)

    Coupling Reference Pressure

    Reference pressure for coupling in bar. (Default is 1)

    Compressibility

    Compressibility (actually in bar^-1). For water at 1 atm and 300K, the compressibility is 4.5e-5 bar^-1. The number of values required is implied by pcoupltype [bar^-1].

    Constraints

    Type of constraints. (Default is none)

    • none: No constraints other than those explicitly defined in the topology file.
    • hbonds: Adds constraints to bonds involving hydrogen atoms.
    • all-bonds: Adds constraints to all bonds.
    • h-angles: Adds constraints to all bonds and angles involving hydrogen atoms.
    • all-angles: Adds constraints to all bonds and angles.

    Output File

    Output file name.

    Result Description

    Generates an MDP file named npt.mdp for the NPT calculation.

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: Minimize MDP Generation
    Description: Minimize MDP Generation模块主要是生成Gromacs分子动力学模拟需要用到体系能量优化(Minimization)的输入MDP文件。 The Minimize MDP Generation module is primarily used to generate input MDP files that are required for Minimization of Gromacs molecular dynamics simulations.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 16:35:14
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    Minimize MDP Generation

    简介

    Minimize MDP Generation是生成能量优化(Minimization)MDP文件的模块。

    参数说明

    Integrator

    模拟中积分方式的选择:cg和steep算法。
    cg用于能量最小化的共轭梯度算法,在能量下降最陡峭时,比steep更加高效。
    steep用于能量最小化的最陡下降算法。一般在setup的能量最小化中使用。

    Simulation Time (ns)

    最小化的最大时间,-1没有最大值。

    Convergency Value of Minimization

    最大容许力,单位为kJ/(mol·nm)。当最大作用力小于此值,认为最小化过程收敛。(默认为100)

    Initial Step

    起始步长,单位为nm。(默认为0.01)

    Coordinates Output Steps

    在轨迹文件中写入坐标的频率。(默认为50)

    Log Output Steps

    在log文件中写入能量的频率。(默认为50)

    Energies Output Steps

    在记录能量的文件中写入能量的频率。(默认为50)

    PBC

    周期化边界条件设置:
    xyz: 在所有方向上使用周期性边界条件
    no: 不使用周期性边界条件,忽略box。若要模拟无截止,请将所有Cutoff相关选项和nstlist设置为0。若要在单个MPlrank上实现无截止的最佳性能,请将nstlist设置为0,ns-type=simple.
    xy: 仅在x和y方向使用周期性边界条件。这仅适用于 ns-type=grid,并可与墙(walls)结合使用。如果没有墙或只有一面墙,系统在z方向上的大小是无限的,因此不能使用压力糟合或 Ewald求和方法。当使用两个墙时,这些缺点不存在。

    Coulomb Type

    原子静电相互作用的计算方法,默认为PME。

    1. Cut-off:具有对列表半径rlist 和库仑截止 rcoulomb 的平面截止,其中 rlist>=rcoulumb。
    2. Ewald:经典的Ewald sum静电学。实空间截止Coulomb Cutoff应等于rlist,使用例如rlist=0.9,rcoulomb=0.9。在reciprocal space中使用的波矢量的最高幅度由傅里叶间距控制。direct/reciprocal space 的相对精度由 ewald rtol 控制。
    3. PME: 用于具体指静电相互作用或库仑力的Fast smooth Particle-Mesh Ewald(SPME)。Direct space类似于Ewald sum,而reciprocal space使用FFT执行。网格尺寸由傅里叶间距控制,插值顺序由pme-order控制。

    Coulomb Cutoff

    指定库仑力阈值,单位为nm。(默认为1.2)

    VdW Type

    范德华相互作用的计算方法,默认为Cut-off。

    1. Cut-off:用对列表半径rlist和VdW截断rvdw的普通截断,其中rlist >= rvdw。
    2. PME:用于VdW相互作用的快速平滑粒子网格Ewald (SPME)。网格尺寸采用傅里叶间距控制,插补顺序采用pme-order控制。正/倒易空间的相对精度由ewald-rtoll-lj控制,倒易例程使用的具体组合规则由lj-pme-comb-rule设置。

    VdW Cutoff

    LJ或Buckingham截止距离,单位nm。(默认为1.2)

    Constraints

    控制拓扑中被转换为刚性完整约束的键类型。典型的刚性水模型没有键,因此不受此关键字的影响。
    none:不将键转化为约束.
    h-bonds:将与氢原子的键合转换为约束
    all-bonds:将所有键转换为约束
    h-angles:将所有键转换为约束,并将涉及氢原子的角度转换为键约束
    al-angles:将所有结合转换为约束,将所有角度转换为结合约束

    Output File

    输出文件名称

    结果说明

    得到一个计算最小化的MDP文件mini.mdp。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    Minimize MDP Generation

    Introduction

    The Minimize MDP Generation module is used to generate the MDP file for energy minimization.

    Parameter Description

    Integrator

    Choice of integration method in the simulation: cg and steep algorithms.
    cg is the conjugate gradient algorithm used for energy minimization, more efficient than steep when the energy decreases steeply.
    steep is the steepest descent algorithm used for energy minimization. Generally used in setting up energy minimization.

    Simulation Time (ns)

    Maximum time for minimization, -1 means no maximum.

    Convergency Value of Minimization

    Maximum allowable force in kJ/(mol·nm). Minimization is considered converged when the maximum force is below this value. (Default is 100)

    Initial Step

    Initial step size in nm. (Default is 0.01)

    Coordinates Output Steps

    Frequency of writing coordinates in the trajectory file. (Default is 50)

    Log Output Steps

    Frequency of writing energy to the log file. (Default is 50)

    Energies Output Steps

    Frequency of writing energy to the energy file. (Default is 50)

    PBC

    Setting for periodic boundary conditions:

    • xyz: Periodic boundary conditions in all directions.
    • no: No periodic boundary conditions, ignore the box. To simulate without cutoffs, set all Cutoff-related options and nstlist to 0. For best performance of cutoff-free on a single MPI rank, set nstlist to 0 and ns-type=simple.
    • xy: Periodic boundary conditions only in the x and y directions. This is only valid for ns-type=grid and can be used with walls. If there are no walls or only one wall, the system is infinite in the z direction, so pressure coupling or Ewald sum methods cannot be used. When using two walls, these limitations do not exist.

    Coulomb Type

    Method for calculating atomic electrostatic interactions, default is PME.

    1. Cut-off: Plain cut-off with a plain cut-off for the pair-list radius rlist and Coulomb cut-off rcoulomb, where rlist >= rcoulumb.
    2. Ewald: Classical Ewald sum electrostatics. The real space cut-off Coulomb Cutoff should be equal to rlist, using values like rlist=0.9, rcoulomb=0.9. The highest amplitude of wave vectors used in reciprocal space is controlled by the Fourier spacing. The relative accuracy of direct/reciprocal space is controlled by ewald rtol.
    3. PME: Fast smooth Particle-Mesh Ewald (SPME) for specific electrostatic interactions or Coulomb forces. Direct space is similar to the Ewald sum, while reciprocal space is executed using FFT. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order.

    Coulomb Cutoff

    Specifies the Coulomb force threshold in nm. (Default is 1.2)

    VdW Type

    Method for calculating van der Waals interactions, default is Cut-off.

    1. Cut-off: Normal cut-off with a plain cut-off for the pair-list radius rlist and VdW cut-off rvdw, where rlist >= rvdw.
    2. PME: Fast smooth Particle-Mesh Ewald (SPME) for VdW interactions. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order. The relative accuracy of direct/reciprocal space is controlled by ewald-rtol-lj, and the specific combination rule used in reciprocal space is set by lj-pme-comb-rule.

    VdW Cutoff

    LJ or Buckingham cut-off distance in nm. (Default is 1.2)

    Constraints

    Controls which types of bonds in the topology are converted to rigid constraints. Typical rigid water models have no bonds, so they are not affected by this keyword.

    • none: No bonds are converted to constraints.
    • h-bonds: Bonds involving hydrogen atoms are converted to constraints.
    • all-bonds: All bonds are converted to constraints.
    • h-angles: All bonds are converted to constraints, and angles involving hydrogen atoms are converted to bond constraints.
    • all-angles: All bonds are converted to constraints, and all angles are converted to bond constraints.

    Output File

    Output file name.

    Result Description

    Generates an MDP file named mini.mdp for the energy minimization calculation.

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MD PDB Prepare
    Description: 在分子动力学模拟前处理PDB结构,结合PDBFixer工具对输入PDB文件中的蛋白结果进行修复,再分离出PDB文件中的蛋白结构、小分子结构以及核酸结构。 It is a structure preparation module before running molecular dynamics. The missing residues in PDB were added using PDBFixer. The protein, nucleic acid, and ligands were extracted and output individually.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: P. Eastman, M. S. Friedrichs, J. D. Chodera, R. J. Radmer, C. M. Bruns, J. P. Ku, K. A. Beauchamp, T. J. Lane, L.-P. Wang, D. Shukla, T. Tye, M. Houston, T. Stich, C. Klein, M. R. Shirts, and V. S. Pande. 2013. OpenMM 4: A Reusable, Extensible, Hardware Independent Library for High Performance Molecular Simulation. Journal of Chemical Theory and Computation. ACS Publications. 9(1): 461-469.

    MD PDB Prepare

    简介

    MD PDB Prepare是一个在分子动力学模拟前PDB结构处理模块,结合PDBFixer工具对输入PDB文件中的蛋白结果进行修复,再分离出PDB文件中的蛋白结构、小分子结构以及核酸结构。

    参数说明

    PDB File

    结构文件,PDB格式。
    需要注意体系中若存在配体,其名称不能为*号且必须以HETATM开头。如下所示为正确的小分子结构文件:

    HETATM 3767  C1  GOL A 302      -4.671 -11.067  -0.429  1.00 43.56           C  
    HETATM 3768  O1  GOL A 302      -5.324  -9.793  -0.300  1.00 41.43           O  
    

    若体系中有特殊金属原子,只能选AMBER力场。离子需要按照特定书写格式,以下为一些常见原子书写格式:

      # Mg2+离子
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+离子
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+离子
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+离子
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+离子
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+离子
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
      # Cu2+离子
      HETATM 1431  CU  CU  A 301     -23.030  15.955  -4.315  1.00 47.40          CU
    

    其中atom type,residue需要是大写,atom name只需是标准金属离子(可以通过文本编辑器查看书写格式是否相同)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    protein.pdb 分离得到体系中蛋白文件
    ligand.pdb/ligand_pdb.tar.gz 分离得到体系中小分子文件或者压缩文件
    nucleic_acid.pdb 分离得到体系中核酸文件
    membrane.pdb/lipid_membrane.pdb 分离得到体系中膜结构

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    MD PDB Prepare

    Introduction

    MD PDB Prepare is a module for pre-processing PDB structures before molecular dynamics simulations. It uses the PDBFixer tool to repair protein structures in the input PDB file and separates the protein structure, small molecule structure, and nucleic acid structure from the PDB file.

    Parameter Description

    PDB File

    Structure file in PDB format.
    It is important to note that if there is a ligand in the system, its name cannot be an asterisk (*) and must start with HETATM. Below is an example of a correct small molecule structure in a file:

    HETATM 3767  C1  GOL A 302      -4.671 -11.067  -0.429  1.00 43.56           C  
    HETATM 3768  O1  GOL A 302      -5.324  -9.793  -0.300  1.00 41.43           O  
    

    If the system contains special metal atoms, only the AMBER force field can be selected. Ions need to be written in a specific format. Here are some common atomic writing formats:

      # Mg2+ ion
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+ ion
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+ ion
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+ ion
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+ ion
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+ ion
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
      # Cu2+ ion
      HETATM 1431  CU  CU  A 301     -23.030  15.955  -4.315  1.00 47.40          CU
    

    Where atom type and residue should be in uppercase, and atom name should be the standard metal ion name (you can check the writing format using a text editor).

    Result Description

    The output results include:

    Output File Name Description
    protein.pdb Separated protein file from the system
    ligand.pdb/ligand_pdb.tar.gz Separated small molecule file or compressed file from the system
    nucleic_acid.pdb Separated nucleic acid file from the system
    membrane.pdb/lipid_membrane.pdb Separated membrane structure from the system

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MD Trajectory
    Description: 可根据起始帧数、结束帧数以及间隔帧数对平衡模拟进行轨迹提取,从而将其转换为GRO或者PDB轨迹文件。 MD Trajectory converts Gromacs trajectory file (xtc) into GRO or PDB file for visualization.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD Trajectory

    简介

    可根据起始帧数、结束帧数以及间隔帧数对平衡模拟进行轨迹提取,从而将其转换为GRO或者PDB轨迹文件。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run模块或者AlphaAutoMD模块中获取。

    Type

    文件输出类型:GRO或者PDB。

    Water

    输出文件是否保留水盒子。

    Start Time (ps)

    起始位置(单位ps)。

    End Time (ps)

    结束位置(单位ps)。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。对于膜体系的轨迹提取是必填项。

    结果说明

    输出结果包括:

    输出文件名称 说明
    md_finally.pdb 最后一帧结构文件
    md_center.pdb PDB格式轨迹文件
    md_center.gro GRO格式轨迹文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    MD Trajectory

    Introduction

    The MD Trajectory module allows for the extraction of trajectories from equilibrium simulations based on the starting frame number, ending frame number, and frame interval, converting them into GRO or PDB trajectory files.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run module or the AlphaAutoMD module.

    Type

    File output type: GRO or PDB.

    Water

    Whether to retain the water box in the output files.

    Start Time (ps)

    Starting time (in ps).

    End Time (ps)

    Ending time (in ps).

    Skip Time (ps)

    Time interval, in ps.

    Index File

    Index file in ndx format. This is a required parameter for extracting trajectories in membrane systems.

    Result Description

    The output results include:

    Output File Name Description
    md_finally.pdb Structure file of the final frame
    md_center.pdb PDB format trajectory file
    md_center.gro GRO format trajectory file

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MMPBSA (Deprecated)
    Description: 基于g_mmpbsa计算受体与配体之间的结合自由能。 MMPBSA calculates components of binding free energy using the MM-PBSA method.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 16:47:54
    Reference: Kumari et al (2014) g_mmpbsa - A GROMACS tool for high-throughput MM-PBSA calculations. J. Chem. Inf. Model. 54:1951-1962. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MMPBSA

    简介

    计算受体与配体之间的结合自由能,支持pb和gb,同时支持能量分解。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD模块中获取。

    Energy Option

    选择计算能量类型:pb或者gb。
    pb:用PB法计算脱溶自由能,并根据pbsa中的INP选项计算非极性溶剂化自由能。
    gb:用GB模型计算sander脱溶自由能。

    Ligand Mol2

    上传配体的mol2文件,可由GMX Ligand Parameterization模块获取。Ligand Mol2和Custom Group必须选填其中一个参数。

    Custom Group

    定义两个组别之间进行结合能计算,组别之间用"/"分隔开。组别中填写的为蛋白氨基酸的序号。例如1-213/214-426或者1-211,212-213/214-426。蛋白氨基酸序号从1开始从新编号,与初始pdb氨基酸编号无关。Ligand Mol2和Custom Group必须选填其中一个参数。

    Decomp

    能量分解计算:yes或者no。(默认:no)

    Startframe

    起始帧位置。

    Endframe

    结束帧位置。

    Skipframe

    每一帧的间隔时间(单位ps)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    mmpbsa_decomp_*.csv gb(pb)方法下能量分解CSV文件
    mmpbsa_decomp_gb(pb)_*.dat gb(pb)方法下能量分解dat文件
    mmpbsa_energy_gb(pb)_*.csv gb(pb)方法下得到的结合自由能随时间变化的CSV文件
    mmpbsa_energy_total_*.dat gb(pb)方法下得到的结合自由能随时间变化的dat文件
    mmpbsa_result_*.dat 总结合自由能dat文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    MMPBSA

    Introduction

    MMPBSA calculates the binding free energy between a receptor and a ligand, supporting both pb and gb methods, as well as energy decomposition.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD module.

    Energy Option

    Select the type of energy calculation: pb or gb.
    pb: Calculate the desolvation free energy using the PB method and calculate the nonpolar solvation free energy based on the INP option in PBSA.
    gb: Calculate the desolvation free energy using the GB model in sander.

    Ligand Mol2

    Upload the ligand’s mol2 file, which can be obtained from the GMX Ligand Parameterization module. Either Ligand Mol2 or Custom Group must be selected.

    Custom Group

    Define the groups for which to calculate the binding energy, with groups separated by “/”. The amino acid numbers in the groups should be entered. For example, 1-213/214-426 or 1-211,212-213/214-426. The amino acid numbering starts from 1 and is independent of the initial amino acid numbering in the pdb file. Either Ligand Mol2 or Custom Group must be selected.

    Decomp

    Energy decomposition calculation: yes or no. (Default: no)

    Startframe

    Starting frame position.

    Endframe

    Ending frame position.

    Skipframe

    Time interval for each frame (in ps).

    Result Description

    The output results include:

    Output File Name Description
    mmpbsa_decomp_*.csv Energy decomposition CSV file for gb (pb) method
    mmpbsa_decomp_gb(pb)_*.dat Energy decomposition dat file for gb (pb) method
    mmpbsa_energy_gb(pb)_*.csv CSV file showing the binding free energy variation over time for gb (pb) method
    mmpbsa_energy_total_*.dat Dat file showing the binding free energy variation over time for gb (pb) method
    mmpbsa_result_*.dat Summary binding free energy dat file

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: Protein Protonation
    Description: 预测每个蛋白残基的pKa并根据指定的pH值判断每个残基的质子化状态。 Predict the pKa value for each protein residue using PROPKA3 and determines the protonation state based on the pH values.
    Tags: undefined
    Author: Jan H. Jensen
    Release: 2022-09-29 00:00:00
    Reference: Olsson, Mats HM, Chresten R. Sondergaard, Michal Rostkowski, and Jan H. Jensen. "PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions." Journal of Chemical Theory and Computation 7, no. 2 (2011): 525-537. doi:10.1021/ct100578z

    Protein Protonation

    简介

    Protein Protonation是蛋白质子化模块主要是预测每个蛋白残基的pKa并根据指定的pH值判断每个残基的质子化状态。

    参数说明

    PDB File

    蛋白的结构文件,PDB格式,该文件可以MD PDB Prepare模块提取得到。

    pH

    pH值,默认为7。

    N terminal

    N端残基质子化状态,只有charge和neutral两个选项,默认charge。

    C Terminal

    C端残基质子化状态,只有charge和neutral两个选项,默认charge。

    Custom Residues

    自定义残基质子化状态。

    Output PDB File

    预测的含质子化状态的结构文件。

    结果说明

    输出结果包括:

    输出文件名称 说明
    protein_protonation.pdb 质子化状态的结构文件
    predict_pKa.txt 含pKa值输出文件

    参考文献

    Olsson, Mats HM, Chresten R. Sondergaard, Michal Rostkowski, and Jan H. Jensen. “PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions.” Journal of Chemical Theory and Computation 7, no. 2 (2011): 525-537.

    Protein Protonation

    Introduction

    The Protein Protonation module is primarily used to predict the pKa of each protein residue and determine the protonation state of each residue based on the specified pH value.

    Parameter Description

    PDB File

    The structure file of the protein in PDB format, which can be obtained from the MD PDB Prepare module.

    pH

    pH value, default is 7.

    N terminal

    Protonation state of the N-terminal residue, with options of charge and neutral, default is charge.

    C Terminal

    Protonation state of the C-terminal residue, with options of charge and neutral, default is charge.

    Custom Residues

    Customize the protonation state of residues.

    Output PDB File

    Structure file with predicted protonation states.

    Result Description

    The output results include:

    Output File Name Description
    protein_protonation.pdb Structure file with protonation states
    predict_pKa.txt Output file containing pKa values

    Reference

    Olsson, Mats HM, Chresten R. Sondergaard, Michal Rostkowski, and Jan H. Jensen. “PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions.” Journal of Chemical Theory and Computation 7, no. 2 (2011): 525-537.

  • Name: GMX Receptor Parameterization
    Description: 根据Gromacs生成受体(包括蛋白或者核酸)的GRO,ITP以及TOP文件。 Generate gro, itp, and top files for receptor (protein or nucleic acid) for molecular dynamics using Gromacs.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 12:49:42
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    GMX Receptor Parameterization

    简介

    GMX Receptor Parameterization模块根据Gromacs生成受体(包括蛋白或者核酸)的GRO,ITP以及TOP文件。

    参数说明

    Protein PDB

    蛋白结构文件。提交的蛋白质文件最好经过Protein Protonation模块的处理。
    若体系中有特殊金属原子,只能选AMBER力场。离子需要按照特定书写格式,以下为一些常见原子书写格式:

      # Mg2+离子
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+离子
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+离子
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+离子
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+离子
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+离子
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
      # Cu2+离子
      HETATM 1431  CU  CU  A 301     -23.030  15.955  -4.315  1.00 47.40          CU
    

    其中atom type,residue需要是大写,atom name只需是标准金属离子(可以通过文本编辑器查看书写格式是否相同)。

    Nucleic Acid PDB

    核酸结构文件。

    Force Field

    力场,默认amber03。以下是各个力场适用于那些情况:
    amber03,amber14sb_parmbsc1适合蛋白和核酸的凝聚相模拟,也支持小分子。
    charmm27,charmm36-jul2020适用于核酸和脂(膜)。
    gromos54a7适合烷烃、蛋白、核酸凝聚相的模拟。
    oplsaa适合高分子模拟。
    注意:根据提交的pdb结构选取力场。

    Water Model

    水模型,默认spc。
    spc:最好用于GROMOS力场。
    spce:对纯水体系比SPC、TIP3P都好。
    tip3p:最好用于amber。
    tip4p:最好用于opls。

    结果说明

    输出结果包括:

    输出文件名称 说明
    receptor.gro 受体的分子坐标文件
    receptor_itp.tar.gz 受体平衡模拟时固定原子位置所施加的力
    receptor.top 受体的拓扑文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    GMX Receptor Parameterization

    Introduction

    The GMX Receptor Parameterization module generates GRO, ITP, and TOP files for receptors (including proteins or nucleic acids) based on Gromacs.

    Parameter Description

    Protein PDB

    Protein structure file. The submitted protein file is preferably processed through the Protein Protonation module.
    If the system contains special metal atoms, only the AMBER force field can be selected. Ions need to be written in specific formats. Below are some common atomic writing formats:

      # Mg2+ ion
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+ ion
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+ ion
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+ ion
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+ ion
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+ ion
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
      # Cu2+ ion
      HETATM 1431  CU  CU  A 301     -23.030  15.955  -4.315  1.00 47.40          CU
    

    Where atom type and residue should be in uppercase, and atom name should match the standard metal ion format (check in a text editor if the writing format is the same).

    Nucleic Acid PDB

    Nucleic acid structure file.

    Force Field

    Force field, default is amber03. The following are the scenarios for each force field:
    amber03, amber14sb_parmbsc1 are suitable for protein and nucleic acid condensed phase simulations, and also support small molecules.
    charmm27, charmm36-jul2020 are suitable for nucleic acids and lipids (membranes).
    gromos54a7 is suitable for simulations of alkanes, proteins, and nucleic acids in the condensed phase.
    oplsaa is suitable for polymer simulations.
    Note: Select the force field based on the submitted pdb structure.

    Water Model

    Water model, default is spc.
    spc: Best used for the GROMOS force field.
    spce: Better for pure water systems compared to SPC and TIP3P.
    tip3p: Best used for amber.
    tip4p: Best used for opls.

    Result Description

    The output results include:

    Output File Name Description
    receptor.gro Molecular coordinate file of the receptor
    receptor_itp.tar.gz Force applied to fix atomic positions during receptor equilibrium simulations
    receptor.top Topology file of the receptor

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: GMX Ligand Parameterization
    Description: GMX Ligand Parameterization模块根据小分子pdb文件生成分子动力学模拟(GROMACS)所需的MOL2,GRO以及ITP文件。 Generate mol2, gro, and itp files for ligand in molecular dynamics using Gromacs.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 10:40:45
    Reference: 1.O'Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33. doi: 10.1186/1758-2946-3-33. PMID: 21982300; PMCID: PMC3198950. 2.Abraham, M. J.; Murtola, T.; Schulz, R.; Páll, S.; Smith, J. C.; Hess, B.; Lindahl, E., GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 2015, 1-2, 19-25. 3.Case, D. A.; Darden, T. A.; Cheatham, I., T.E.; et al., AMBER 16, University of California, San Francisco, 2016. 4.Sousa da Silva, A.W., Vranken, W.F. ACPYPE - AnteChamber PYthon Parser interfacE. BMC Res Notes 5, 367 (2012). https://doi.org/10.1186/1756-0500-5-367. 5.Wang J, Wang W, Kollman PA, Case DA. Automatic atom type and bond type perception in molecular mechanical calculations. J Mol Graph Model. 2006 Oct;25(2):247-60. doi: 10.1016/j.jmgm.2005.12.005. Epub 2006 Feb 3. PMID: 16458552. 6.Wang J, Wolf RM, Caldwell JW, Kollman PA, Case DA. Development and testing of a general amber force field. J Comput Chem. 2004 Jul 15;25(9):1157-74. doi: 10.1002/jcc.20035. Erratum in: J Comput Chem. 2005 Jan 15;26(1):114. PMID: 15116359. 7.Lu T, Chen F. Multiwfn: a multifunctional wavefunction analyzer. J Comput Chem. 2012 Feb 15;33(5):580-92. doi: 10.1002/jcc.22885. Epub 2011 Dec 8. PMID: 22162017. 8.Neese F, Wennmohs F, Becker U, Riplinger C. The ORCA quantum chemistry program package. J Chem Phys. 2020 Jun 14;152(22):224108. doi: 10.1063/5.0004608. PMID: 32534543.

    GMX Ligand Parameterization

    简介

    基于obabel,Antechamber(Ambertool),ACPYPE以及ORCA对小分子进行处理。将小分子的PDB文件根据所需电荷,电荷类型和自旋多重度进行处理,从而生成Gromacs分子动力学模拟所需的GRO和ITP文件。

    参数说明

    Small Molecule PDB File

    支持pdb和tar.gz的文件格式。当单个配体时提交pdb文件,多个配体时提交含有pdb的tar.gz文件。该文件最好经过MD PDB Prepare模块处理。
    配体分子不能用*号,最好是重新命名成英文名称。

    Charge Type

    选取计算的电荷类型,默认为bcc电荷。

    pH

    如设置则配体在该pH环境下加氢;如不设置,按全氢加氢。注意:设置pH后,如果配体电荷不为0,自旋多重度不为1,则需要在Charge Multiplicity设置。

    Charge Multiplicity

    指明要计算的配体文件的电荷和自旋多重度,默认为电荷为0,自旋多重度为1。格式要求:配体文件名称(不包含后缀) 电荷值 自旋多重度,例如提交文件为ligand.pdb、电荷为0、自旋多重度为1,则该栏输入为“ligand 0 1”。

    结果说明

    输出结果包括:

    输出文件名称 说明
    ligand.gro 受体的分子坐标文件
    ligand_itp.tar.gz 受体平衡模拟时固定原子位置所施加的力
    ligand.mol2/ligand_mol2.tar.gz 分子结构的mol2文件,多个配体时为tar.gz文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    GMX Ligand Parameterization

    Introduction

    Processing of small molecules is performed based on obabel, Antechamber (Ambertool), ACPYPE, and ORCA. The PDB file of the small molecule is processed according to the desired charge, charge type, and spin multiplicity to generate the GRO and ITP files required for Gromacs molecular dynamics simulations.

    Parameter Description

    Small Molecule PDB File

    Supports file formats of pdb and tar.gz. Submit a pdb file when a single ligand is present, and submit a tar.gz file containing pdb when multiple ligands are present. It is recommended that the file has been processed through the MD PDB Prepare module.
    Ligand molecules should not contain asterisks (*), and it is preferable to rename them with English names.

    Charge Type

    Select the type of charge calculation, with the default being the bcc charge.

    pH

    If set, hydrogenation of the ligand will occur at the specified pH environment; if not set, full hydrogenation will be applied. Note: when pH is set, if the ligand charge is not 0 and the spin multiplicity is not 1, it needs to be specified in Charge Multiplicity.

    Charge Multiplicity

    Specifies the charge and spin multiplicity of the ligand file to be calculated, with the default charge being 0 and spin multiplicity being 1. Format requirement: ligand file name (excluding the extension) charge value spin multiplicity. For example, if the submitted file is ligand.pdb with a charge of 0 and a spin multiplicity of 1, the input in this field should be “ligand 0 1”.

    Result Description

    The output results include:

    Output File Name Description
    ligand.gro Molecular coordinate file of the ligand
    ligand_itp.tar.gz Force applied to fix atomic positions during ligand equilibrium simulations
    ligand.mol2/ligand_mol2.tar.gz Mol2 file of the molecular structure, a tar.gz file for multiple ligands

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MD MDP Generation
    Description: MD MDP Generation模块主要是生成平衡模拟(MD)的MDP文件,此文件是Gromacs分子动力学模拟需要用到输入文件,里面包含各种参数。 Generate final Gromacs MD production MDP file.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 14:08:30
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD MDP Generation

    简介

    MD MDP Generation是生成平衡模拟(MD)MDP文件的模块。

    参数说明

    Define

    Define用于传递预处理器的定义,可以使用任何定义来控制自定义拓扑文件(.top)中的选项。可选择的定义包括以下选项:

    • DPOSRES用于实现位置约束。选择该项时必须填写Force Constant of POSRE,否则无效。
    • DFLEXIBLE将使用柔性水而不是刚性水进入拓扑结构,这对正常模式分析很有用。

    Integrator

    模拟中积分方式的选择:md算法。
    md是蛙跳法,对符合牛顿公式的运动进行积分。

    Time Step

    时间步长,单位为ps。(默认为0.001)

    Simulation Time (ns)

    模拟时长,单位为ns。

    Group(s) for Center of Mass

    质心进行操作的组,可以是索引文件中的一个,或者多个组。默认为整个系统。

    Motion Mode

    系统或者系统中各个组质心的操作。(默认为None)

    • Linear:移去质心平移速度
    • Angular:去掉质心的平移和质心周围的旋转速度
    • Linear-acceleration-correction:去除质心平移速度。修正质心位置,假设在nstcomm步骤上有线性加速度。这对于期望质心上的加速度在mdp:nstcomm步长上几乎是恒定的情况是有用的。例如,当使用绝对引用拉入组时,就会发生这种情况。
    • None:对质心运动没有限制

    Coordinates Output Steps

    在轨迹文件中写入坐标的频率。(默认为0)

    Velocities Output Steps

    在轨迹文件中写入速度(v)的频率。(默认为0)

    Forces Output Steps

    在轨迹文件中写入力的频率。(默认为0)

    Log Output Steps

    在log文件中写入能量的频率。(默认为5000)

    Energies Output Steps

    在记录能量的文件中写入能量的频率。(默认为1000)

    Compressed Coordinates Steps

    输入压缩的轨迹文件的频率。(默认为1000)

    Compressed Groups

    输入轨迹包含的结构。默认为整个系统。

    PBC

    周期化边界条件设置(默认为xyz)。

    • xyz:在所有方向上使用周期性边界条件。
    • no:不使用周期边界条件,忽略方框。要模拟没有截止,设置所有截止和nstlist为0。为了在没有截断的情况下获得最佳性能,请将nstlist设置为零并将ns-type =simple设置为简单。
    • xy:只在x和y方向上使用周期边界条件。这只适用于ns-type =grid,并且可以与墙壁结合使用。没有墙或只有一个墙,系统尺寸在z方向上是无限的。因此不能采用压力耦合法或埃瓦尔德求和法。当使用两面墙时,这些缺点就不适用了。

    Coulomb Type

    原子静电相互作用的计算方法,默认为PME。

    • Cut-off:具有对列表半径rlist 和库仑截止 rcoulomb 的平面截止,其中 rlist>=rcoulumb。
    • Ewald:经典的Ewald sum静电学。实空间截止Coulomb Cutoff应等于rlist,使用例如rlist=0.9,rcoulomb=0.9。在reciprocal space中使用的波矢量的最高幅度由傅里叶间距控制。direct/reciprocal space 的相对精度由 ewald rtol 控制。
    • PME: 用于具体指静电相互作用或库仑力的Fast smooth Particle-Mesh Ewald(SPME)。Direct space类似于Ewald sum,而reciprocal space使用FFT执行。网格尺寸由傅里叶间距控制,插值顺序由pme-order控制。

    Coulomb Cutoff

    库仑力截止距离,单位nm。(默认为1.2)

    VdW Type

    范德华相互作用的计算方法,默认为Cut-off。

    • Cut-off:用对列表半径rlist和VdW截断rvdw的普通截断,其中rlist >= rvdw。
    • PME:用于VdW相互作用的快速平滑粒子网格Ewald (SPME)。网格尺寸采用傅里叶间距控制,插补顺序采用pme-order控制。正/倒易空间的相对精度由ewald-rtoll-lj控制,倒易例程使用的具体组合规则由lj-pme-comb-rule设置。

    VdW Cutoff

    LJ势或Buckingham的阈值,单位为nm。(默认为1.2)

    Dispersion Correction

    能量和压力的长程色散校正方法(默认为EnerPres)。

    • no:不做任何修正
    • EnerPres:适用于能量和压力的长程分散校正
    • Ener:仅对能量应用长程色散修正

    Temperature Coupling

    温度耦合的方法(默认为V-rescale)。

    • V-rescale:使用随机项的速度重标度的温度耦合(JCP 126, 014101)。这个恒温器类似于Berendsen耦合,使用tau-t进行相同的缩放,但随机项确保生成适当的规范集合。随机种子用ld-seed设置。即使tau-t =0,这个恒温器也能正常工作。对于NVT模拟,保存的能量被写入能量和日志文件。
    • Berendsen:与Berendsen恒温器的温度耦合到温度为ref-t的浴槽,时间常数为tau-t。几个组可以单独耦合,它们在tc-grps字段中指定,并用空格分隔。
    • no:无温度耦合。

    Coupling Groups

    耦合到单独的温度浴的组别,多个组别用空格间隔。

    Time for Temperature Coupling

    耦合时间常数,每个组别都需要定义温度,-1表示无温度耦合,单位为ps。(默认为0.2)

    Coupling Reference Temperature

    耦合的参考温度,即动力学模拟的温度,单位为K。(默认为300)

    Pressure Coupling

    压力耦合的方法(默认为Berendsen)。

    • Parrinello-Rahman:扩展系综压力耦合,其中盒向量服从运动方程。原子的运动方程和这个是耦合的。不会发生瞬时缩放。对于Nose-Hoover温度耦合,时间常数tau-p是压力在平衡状态下波动的周期。当您希望在数据收集期间应用压力缩放时,这可能是一种更好的方法,但要注意,如果您从不同的压力开始,您可能会得到非常大的振荡。对于NPT系综的精确波动很重要的模拟,或者如果压力耦合时间很短,则可能不合适,因为在GROMACS实现的某些步骤中使用了之前的时间步长压力来代替当前的时间步长压力。
    • Berendsen:指数弛豫压力与时间常数tau-p的耦合。这个盒子每隔几步就缩放一次。有人认为,这并不能产生正确的热力学集合,但这是在运行开始时缩放盒子的最有效方法。
    • no:无压力耦合。这意味着一个固定的盒子大小。

    Pressure Coupling Type

    压力耦合的各向同性类型。每种类型取一个或多个可压缩性(compressibility)和Coupling Reference Pressure。Time for Pressure Coupling仅允许一个值。(默认为isotropic)

    • isotropic:时间常数为Time for Pressure Coupling的各向同性压力耦合。可压缩性(compressibility)和Coupling Reference Pressure各需要一个值.
    • semisotropic:在x和y方向上各向同性但在方向上不同的压力耦合。这对于膜模拟是有用的。对于x/y和z方向,分别需要可压缩性(compressibility)和Coupling Reference Pressure的两个值。
    • anisotropic:与之前相同,但xx、yy、zz、xy/yx、xz/zx和yz/zy组件分别需要6个值。当非对角压缩性设置为零时,矩形盒子将保持矩形。请注意,各向异性缩放可能会导致模拟盒子发生极端变形。
    • surface-tension:平行于xy平面的表面的表面张力耦合。对Z方向使用法向压力耦合,而表面张力耦合到盒子的x/y尺度。第一个Coupling Reference Pressure是参考表面张力乘以表面数(单位bar*nm),第二个值是参考z-pressure(单位bar)。这两个可压缩性(compressibility)分别是xy和方向上的压缩率。z-compressibility的值应该相当精确,因为它会影响表面张力的收敛,也可以将其设置为零,使盒子具有恒定的高度。

    Time for Pressure Coupling

    压力耦合的时间常数(所有方向一个值),单位为ps。(默认为2)

    Coupling Reference Pressure

    耦合的参考压力,单位为bar。(默认为1)

    Compressibility

    可压缩性(注:这实际上是在bar^-1)对于水在1atm和300k的可压缩性是4.5e-5 bar^-1。所需值的数量由pcoupltype [bar^-1]暗示。

    Constraints

    限制类型。(默认为none)

    • none:除了拓扑文件中明确定义的外,没有限制。
    • hbonds:给含有氢原子的键添加限制。
    • all-bonds:给所有的键添加限制。
    • h-angles:给所有的键添加限制,同时给含有氢原子的角度添加限制。
    • all-angles:给所有的键和角度添加限制。

    Force Constant of POSRE

    xyz方向的位置限制的力常数,三个数值之间用逗号分隔开,单位为kJ/(mol·nm^2)。例如:500,500,500。

    Disre Type

    MD运行中距离、角度、二面角限制是否生效:
    no表示忽略拓扑文件中的约束信息;
    simple表示简单的(每分子)的距离约束;
    ensemble表示一个模拟盒中分子系综的距离约束。

    Disre Weighting

    约束力权重类型:
    equal表示将约束力平分到约束中的所有原子对上;
    conservative表示约束力为约束势的导数, 将导致原子对的权重为r^-7.,当Time Constant for Restraints=0时,约束力为保守力。

    Disre Mixed

    Dirse mixed采用的方法:
    no表示计算约束力时使用时间平均的违反;
    yes表示计算约束力时使用时间平均违反与瞬时违反乘积的平方根。

    Force Constant

    约束的力常数,乘以拓扑文件中相互作用约束给出的Factor即为最终的约束力大小。

    Time Constant for Restraints

    限制约束的时间,设置为0时表示MD过程中一直进行约束,单位为ps。

    Dirse Output Steps

    将约束中所有原子对的运行距离和瞬时距离写入能量文件的间隔步数。间隔越小该文件越大。

    Output File

    输出文件名称

    结果说明

    生成跑MD的MDP文件md.mdp。

    MD MDP Generation

    Introduction

    MD MDP Generation is a module for generating the MDP file for equilibrium simulations (MD).

    Parameter Description

    Define

    Used to pass definitions to the preprocessor, which can be used to control options in custom topology files (.top). Available options include:

    • DPOSRES for implementing position restraints. You must fill in the Force Constant of POSRE when selecting this option, otherwise it is invalid.
    • DFLEXIBLE will use flexible water instead of rigid water in the topology structure, which is useful for normal mode analysis.

    Integrator

    Choice of integration method in the simulation: md algorithm.
    md is the leap-frog algorithm for integrating motion conforming to Newton’s equations.

    Time Step

    Time step, in ps. (Default is 0.001)

    Simulation Time (ns)

    Simulation duration, in ns.

    Group(s) for Center of Mass

    Groups for which center of mass operations will be performed, can be one or multiple groups from an index file. Default is the entire system.

    Motion Mode

    Operations for the system or center of mass of groups in the system. (Default is None)

    • Linear: Remove center of mass translation velocities
    • Angular: Remove center of mass translation and rotation velocities around the center of mass
    • Linear-acceleration-correction: Remove center of mass translation velocities. Correct center of mass positions assuming a linear acceleration over nstcomm steps. This is useful when you expect the acceleration on the center of mass to be nearly constant over nstcomm steps, for example when using absolute reference pulling groups.
    • None: No restrictions on center of mass motion

    Coordinates Output Steps

    Frequency of writing coordinates to the trajectory file. (Default is 0)

    Velocities Output Steps

    Frequency of writing velocities to the trajectory file. (Default is 0)

    Forces Output Steps

    Frequency of writing forces to the trajectory file. (Default is 0)

    Log Output Steps

    Frequency of writing energies to the log file. (Default is 5000)

    Energies Output Steps

    Frequency of writing energies to the energy file. (Default is 1000)

    Compressed Coordinates Steps

    Frequency of inputting compressed trajectory files. (Default is 1000)

    Compressed Groups

    Structures included in the input trajectory. Default is the entire system.

    PBC

    Periodic boundary conditions setting. (Default is xyz)

    • xyz: Use periodic boundary conditions in all directions.
    • no: Do not use periodic boundary conditions, ignore the box. To simulate without truncation, set all cutoffs and nstlist to 0. For optimal performance without truncation, set nstlist to zero and ns-type=simple.
    • xy: Use periodic boundary conditions only in the x and y directions. This is only applicable with ns-type=grid and can be combined with walls. Without walls or with only one wall, the system size is infinite in the z direction. Therefore, pressure coupling or Ewald summation cannot be used. When using two walls, these drawbacks do not apply.

    Coulomb Type

    Method for calculating atomic electrostatic interactions, default is PME.

    • Cut-off: Plain cut-off with a plain cut-off for the Coulomb potential with a plane cut-off rlist and Coulomb cut-off rcoulomb, where rlist >= rcoulumb.
    • Ewald: Classic Ewald sum electrostatics. The real space cut-off Coulomb Cutoff should be equal to rlist, using values such as rlist=0.9, rcoulomb=0.9. The highest magnitude of wave vectors used in reciprocal space is controlled by the Fourier spacing. The relative accuracy of direct/reciprocal space is controlled by ewald rtol.
    • PME: Fast smooth Particle-Mesh Ewald (SPME) for specific electrostatic interactions or Coulomb forces. Direct space is similar to an Ewald sum, while reciprocal space is executed using FFT. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order.

    Coulomb Cutoff

    Coulomb force cut-off distance, in nm. (Default is 1.2)

    VdW Type

    Method for calculating van der Waals interactions, default is Cut-off.

    • Cut-off: Ordinary cut-off with a plain cut-off for the van der Waals potential with a plain cut-off rlist and VdW cut-off rvdw, where rlist >= rvdw.
    • PME: Fast smooth Particle-Mesh Ewald (SPME) for van der Waals interactions. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order. The relative accuracy of direct/reciprocal space is controlled by ewald-rtol-lj, and the specific combination rule for the LJ-PME is set by lj-pme-comb-rule.

    VdW Cutoff

    Threshold for LJ potential or Buckingham, in nm. (Default is 1.2)

    Dispersion Correction

    Method for long-range dispersion correction for energy and pressure. (Default is EnerPres)

    • no: No correction is applied.
    • EnerPres: Long-range dispersion correction is applied for both energy and pressure.
    • Ener: Only the energy is corrected for long-range dispersion.

    Temperature Coupling

    Method for temperature coupling. (Default is V-rescale)

    • V-rescale: Temperature coupling using velocity rescaling with random noise (JCP 126, 014101). This thermostat is similar to Berendsen coupling but uses a stochastic term to ensure the correct canonical ensemble is generated. The random seed is set with ld-seed. This thermostat works even when tau-t = 0. For NVT simulations, saved energies are written to the energy and log files.
    • Berendsen: Coupling the temperature to a bath at temperature ref-t with a time constant tau-t. Several groups can be coupled separately, specified in the tc-grps field and separated by spaces.
    • no: No temperature coupling.

    Coupling Groups

    Groups to which temperature baths are coupled, multiple groups separated by spaces.

    Time for Temperature Coupling

    Time constant for temperature coupling, each group defining a temperature needs to be defined, -1 indicates no temperature coupling, in ps. (Default is 0.2)

    Coupling Reference Temperature

    Reference temperature for coupling, the temperature of the dynamic simulation, in K. (Default is 300)

    Pressure Coupling

    Method for pressure coupling. (Default is Berendsen)

    • Parrinello-Rahman: Extended system pressure coupling where box vectors follow the motion equations. The motion equations of atoms are coupled to this. No instantaneous scaling occurs. This may be a better method when you wish to apply pressure scaling during data collection, but be aware that you may get very large oscillations if you start from different pressures. It may not be appropriate for precise fluctuations of an NPT ensemble simulation or if the pressure coupling time is short, as some steps in the GROMACS implementation use the previous time step pressure instead of the current time step pressure.
    • Berendsen: Exponential relaxation pressure coupling with a time constant tau-p. The box is rescaled every few steps. It is argued that this does not produce the correct thermodynamic ensemble, but it is the most effective method to scale the box at the beginning of a run.
    • no: No pressure coupling. This means a fixed box size.

    Pressure Coupling Type

    Isotropic type for pressure coupling. Each type takes one or more compressibility values and a Coupling Reference Pressure. Time for Pressure Coupling allows only one value. (Default is isotropic)

    • isotropic: Isotropic pressure coupling with a time constant Time for Pressure Coupling. Requires a compressibility and Coupling Reference Pressure value each.
    • semisotropic: Pressure coupling isotropic in x and y directions but different in the z direction. Useful for membrane simulations. Requires two compressibility and Coupling Reference Pressure values for x/y and z directions, respectively.
    • anisotropic: Same as before but with six values for xx, yy, zz, xy/yx, xz/zx, and yz/zy components. When non-diagonal compressibilities are set to zero, the rectangular box will remain rectangular. Note that anisotropic scaling may cause extreme deformation of the simulation box.
    • surface-tension: Surface tension coupling for a surface parallel to the xy plane. Uses normal pressure coupling in the z direction, while surface tension couples to the x/y scale of the box. The first Coupling Reference Pressure is the reference surface tension multiplied by the surface area (units of bar*nm), and the second value is the reference z-pressure (units of bar). Both compressibilities are for xy and z directions. The z-compressibility value should be quite accurate as it affects the convergence of the surface tension and can be set to zero to keep the box at a constant height.

    Time for Pressure Coupling

    Time constant for pressure coupling (one value for all directions), in ps. (Default is 2)

    Coupling Reference Pressure

    Reference pressure for coupling, in bar. (Default is 1)

    Compressibility

    Compressibility (note: this is actually in bar^-1). For water at 1 atm and 300 K, the compressibility is 4.5e-5 bar^-1. The number of values required is indicated by pcoupltype [bar^-1].

    Constraints

    Type of constraints. (Default is none)

    • none: No constraints other than those explicitly defined in the topology file.
    • hbonds: Constraints added to bonds involving hydrogen atoms.
    • all-bonds: Constraints added to all bonds.
    • h-angles: Constraints added to all bonds and angles involving hydrogen atoms.
    • all-angles: Constraints added to all bonds and angles.

    Force Constant of POSRE

    Force constant for position restraints in the xyz directions, separated by commas, in units of kJ/(mol·nm^2). For example: 500,500,500.

    Disre Type

    Whether distance, angle, and dihedral restraints are active during MD runs:
    no means ignore constraint information in the topology file;
    simple means simple (per-molecule) distance constraints;
    ensemble means distance constraints for a molecule ensemble in a simulation box.

    Disre Weighting

    Type of constraint force weighting:
    equal distributes the constraint force equally among all atom pairs in the constraint;
    conservative gives the derivative of the constraint potential, leading to a weight of r^-7 for atom pairs, and if Time Constant for Restraints=0, the constraint force is conservative.

    Disre Mixed

    Method used by Dirse mixed:
    no uses time-averaged violations in computing the constraint force;
    yes uses the square root of the time-averaged violation times the instantaneous violation in computing the constraint force.

    Force Constant

    Force constant for constraints, multiplied by the Factor given by the interaction constraints in the topology file to determine the final constraint force magnitude.

    Time Constant for Restraints

    Time for constraints, set to 0 to maintain constraints throughout the MD process, in ps.

    Dirse Output Steps

    Interval steps for writing the running and instantaneous distances of all atom pairs in the constraint to the energy file. Smaller intervals lead to larger files.

    Output File

    Output file name.

    Result Description

    Generates the MDP file md.mdp for running MD.

  • Name: MD Solvation
    Description: MD Solvation对输入的受体配体文件加入水盒子和离子。 MD Solvation module adds water box and ions for the system.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 15:49:33
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD Solvation

    简介

    MD Solvation将原有的受配体结构中加入水分子和离子。

    参数说明

    Receptor Topology

    输入的受体拓扑文件,可由GMX Receptor Parameterization模块生成。

    Receptor GRO

    输入的受体结构文件,可由GMX Receptor Parameterization模块生成。

    Receptor ITP

    输入的受体参数(压缩)文件,可由GMX Receptor Parameterization模块生成。

    Ligand GRO

    输入的配体结构(压缩)文件,可由GMX Ligand Parameterization模块生成。

    Ligand ITP

    输入的配体参数(压缩)文件,可由GMX Ligand Parameterization模块生成。

    Output Topology

    输出的体系总的拓扑文件

    Output GRO

    输出的体系总的结构文件

    Output ITP

    输出的体系参数的(压缩)文件

    Distance Restraints

    距离限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [Type]  [Index]  [Type]  [Low]  [Up1]  [Up2]  [Factor]
    

    其中,AtomIndex1和AtomIndex2为在system.gro的原子编号;Type为施加约束类型,通常设置为1,Type类型见表1;Index是计算顺序;Low、Up1、Up2为原子间限制距离,Low到Up1区间的原子距离是不受限制的,但是不能超过Up2,单位为nm;Factor为因子,将Factor乘以“Disre Force Constant”即为限制力的大小,单位为kJ/mol/nm2。
    例如:

    10     16      1       0       1      0.0     0.3     0.4     1.0
    10     46      1       1       1      0.0     0.3     0.4     1.0
    16     22      1       2       1      0.0     0.3     0.4     2.5
    

    表1:GROMACS中三种约束类型对原子对进行限制

    Type Code 约束类型 作用情况
    1 Complex NMR distance restraints 当Disre Type为ensemble时,即非键相互作用设置为1
    6 Simple harmonic restraints 当Disre Type为simple时,即分子内成键相互作用设定,可设为6或者10.
    10 Piecewise linear/harmonic restraints 当Disre Type为simple时,即分子内成键相互作用设定,可设为6或者10

    Angle Restraints

    角度限制是两对原子间角度的限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Theta0]  [Force Constant]  [Multiplicity]
    

    其中,AtomIndex1-AtomIndex2是第一对原子编号;AtomIndex3-AtomIndex4为第二对原子编号;Type在这里无用,定义为1即可;Theta0为约束的角度,单位为deg;Force Constant为约束力常数,单位为kJ/mol;Multiplicity为多重度。
    例如

    2642     2643     2635     2652     1     67.0     1500     1
    

    Dihedral Restraints

    二面角限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Label]  [Phi]  [dPhi]  [KFactor]  [Power]
    

    其中,AtomIndex1-AtomIndex4为组成二面角的原子编号;Type为约束类型函数,总是为1;Label无效;Phi为参考角,dPhi为超出参考角的角度值,单位为deg;KFactor为因子,将KFactor乘以“Disre Force Constant”即为限制力的大小,单位为 kJ/mol/rad2;Power无效。
    例如:

    2642      2643      2635      2652      1      67.0      1500      1
    

    约束势函数如下所示:
    image.png
    其中,Φ’为参考角Phi,ΔΦ为超出参考角的值dPhi,K_dihr为限制力的大小KFactor。

    结果说明

    输出结果包括:

    输出文件名称 说明
    system.gro 体系的分子坐标文件
    system_itp.tar.gz 体系平衡模拟时固定原子位置所施加的力
    system.top 体系的拓扑文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    MD Solvation

    Introduction

    MD Solvation adds water molecules and ions to the original ligand-bound structure.

    Parameter Description

    Receptor Topology

    Input receptor topology file, can be generated by the GMX Receptor Parameterization module.

    Receptor GRO

    Input receptor structure file, can be generated by the GMX Receptor Parameterization module.

    Receptor ITP

    Input receptor parameter (compressed) file, can be generated by the GMX Receptor Parameterization module.

    Ligand GRO

    Input ligand structure (compressed) file, can be generated by the GMX Ligand Parameterization module.

    Ligand ITP

    Input ligand parameter (compressed) file, can be generated by the GMX Ligand Parameterization module.

    Output Topology

    Output total system topology file.

    Output GRO

    Output total system structure file.

    Output ITP

    Output system parameter (compressed) file.

    Distance Restraints

    Distance restraints, effective only when Disre is not “no”, formatted as follows:

    [AtomIndex1]  [AtomIndex2]  [Type]  [Index]  [Type]  [Low]  [Up1]  [Up2]  [Factor]
    

    Where AtomIndex1 and AtomIndex2 are atomic indices in system.gro; Type is the type of constraint applied, typically set to 1, see Table 1 for Type codes; Index is the calculation order; Low, Up1, Up2 are the distance limits between atoms, the distance between atoms in the Low to Up1 range is unrestricted but cannot exceed Up2, in nm; Factor is a multiplier, multiplying Factor by the “Disre Force Constant” gives the size of the restraint force, in kJ/mol/nm2.
    For example:

    10     16      1       0       1      0.0     0.3     0.4     1.0
    10     46      1       1       1      0.0     0.3     0.4     1.0
    16     22      1       2       1      0.0     0.3     0.4     2.5
    

    Table 1: Three constraint types in GROMACS for atom pairs

    Type Code Constraint Type Application
    1 Complex NMR distance restraints Set to 1 for non-bonded interactions when Disre Type is ensemble
    6 Simple harmonic restraints Set to 6 or 10 for intramolecular bonded interactions when Disre Type is simple
    10 Piecewise linear/harmonic restraints Set to 6 or 10 for intramolecular bonded interactions when Disre Type is simple

    Angle Restraints

    Angle restraints limit the angle between two pairs of atoms, effective only when Disre is not “no”, formatted as follows:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Theta0]  [Force Constant]  [Multiplicity]
    

    Where AtomIndex1-AtomIndex2 is the first pair of atom indices; AtomIndex3-AtomIndex4 is the second pair of atom indices; Type is not used here, defined as 1; Theta0 is the constrained angle in degrees; Force Constant is the constraint force constant in kJ/mol; Multiplicity is the multiplicity.
    For example:

    2642     2643     2635     2652     1     67.0     1500     1
    

    Dihedral Restraints

    Dihedral restraints, effective only when Disre is not “no”, formatted as follows:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Label]  [Phi]  [dPhi]  [KFactor]  [Power]
    

    Where AtomIndex1-AtomIndex4 are the atomic indices composing the dihedral; Type is always 1; Label is not used; Phi is the reference angle, dPhi is the angle value beyond the reference angle in degrees; KFactor is a factor, multiplying KFactor by the “Disre Force Constant” gives the size of the restraint force in kJ/mol/rad2; Power is not used.
    For example:

    2642      2643      2635      2652      1      67.0      1500      1
    

    The constraint potential functions are as follows:
    image.png
    Where Φ’ is the reference angle Phi, ΔΦ is the value beyond the reference angle dPhi, and K_dihr is the size of the restraint force KFactor.

    Result Description

    The output results include:

    Output File Name Description
    system.gro Molecular coordinate file of the system
    system_itp.tar.gz Force applied to fix atomic positions during system equilibrium simulation
    system.top Topology file of the system

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MD RMS
    Description: MD RMS模块是通过计算平衡模拟轨迹的均方根偏差(RMSD,Root Mean Square Deviation)和均方根波动(RMSF,Root Mean Square Fluctuation),从而分析结构的稳定性和结构变化情况。 The RMS module calculates the RMSD or RMSF to analyze the structural stability of the system.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    RMS

    简介

    通过计算平衡模拟轨迹的均方根偏差(RMSD,Root Mean Square Deviation)和均方根波动(RMSF,Root Mean Square Fluctuation),从而分析结构的稳定性和结构变化情况。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在**GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)**模块中获取。

    Analysis Type

    选择分析类型:RMSD或者RMSF(可多选)。

    System Group

    选择需要计算的组别。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Custom Atom

    自定义需要计算的原子编号,用逗号隔开,例如:CA,O,H。与Custom Resid是交集关系。

    Skip Time (ps)

    Index File

    索引文件,可由Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    rmsd_result.csv 所选组别的RMSD的CSV文件
    rmsd_result.png 所选组别的RMSD的PNG文件
    rmsd_result.xvg 所选组别的RMSD的XVG文件
    rmsf_*.csv 所选组别的RMSF的CSV文件
    rmsf_*.png 所选组别的RMSF的PNG文件
    rmsf_*xvg. 所选组别的RMSF的XVG文件
    bfac.pdb PDB中的B-Factor一列为原子RMSF值通过公式<Δr²> = 3B/(8π²)转换得到。

    RMS

    Introduction

    By calculating the Root Mean Square Deviation (RMSD) and Root Mean Square Fluctuation (RMSF) of equilibrium simulation trajectories, the stability and structural changes of the system can be analyzed.

    Parameter Description

    Path File

    The path file obtained after MD simulation, which can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    Analysis Type

    Select the type of analysis: RMSD or RMSF (multiple selections possible).

    System Group

    Select the group to be calculated.

    Custom Resid

    Custom residue numbers to be calculated, continuous parameters can be represented by “-”, non-continuous residues are separated by commas, for example: 1-10,15.

    Custom Atom

    Custom atom numbers to be calculated, separated by commas, for example: CA, O, H. This intersects with Custom Resid.

    Skip Time (ps)

    Index File

    Index file obtained from the Membrane Solvation module.

    Result Description

    The output results include:

    Output File Name Description
    rmsd_result.csv CSV file of RMSD for the selected group
    rmsd_result.png PNG file of RMSD for the selected group
    rmsd_result.xvg XVG file of RMSD for the selected group
    rmsf_*.csv CSV file of RMSF for the selected group
    rmsf_*.png PNG file of RMSF for the selected group
    rmsf_*xvg. XVG file of RMSF for the selected group
    bfac.pdb The RMSF values are converted to B-factor values by the formula<Δr^2>=3B/(8π^2).
  • Name: AlphaAutoMD
    Description: AlphaAutoMD是一个完全自动化的gromacs MD流,允许用户提交PDB文件快速提交MD作业。该模块整合了多个计算模块,大部分参数采用默认参数。 The AlphaAutoMD module is a fully automated molecular dynamics module, which enables fast submit an MD job using a PDB file. This module integrates MD PDB Prepare, Protein Protonation, GMX Receptor Parameterization, GMX Ligand Parameterization, MD Solvation, GMX MD Run, and RMS.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    AlphaAutoMD

    简介

    提交一个pdb文件自动进行分子动力学模拟,为初步接触分子动力学模拟提供便捷操作界面。

    参数说明

    PDB File

    结构文件,PDB格式。
    需要注意体系中若存在配体,其名称不能为*号且必须以HETATM开头。同一小分子中的原子名(如下图所示位置)不能相同。不需要模拟的结构最好是删除。如下所示为正确的小分子结构文件:

    HETATM 3767  C1  GOL A 302      -4.671 -11.067  -0.429  1.00 43.56           C  
    HETATM 3768  O1  GOL A 302      -5.324  -9.793  -0.300  1.00 41.43           O  
    

    image.png

    若体系中有特殊金属原子,只能选AMBER力场。离子需要按照特定书写格式,以下为一些常见原子书写格式:

      # Mg2+离子
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+离子
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+离子
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+离子
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+离子
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+离子
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
    

    96dcbfca9ffb96541221e86f6db9c5a.jpg

    其中atom type,residue需要是大写,atom name只需是标准金属离子(可以通过文本编辑器查看书写格式是否相同)。

    Force Field

    力场文件类型:
    amber03,amber14sb_parmbsc1适合蛋白和核酸的凝聚相模拟,也支持小分子。
    gromos系列适合烷烃、蛋白、核酸凝聚相的模拟。
    注意:根据提交的pdb结构选取力场。

    Water Type

    水的类型:
    spc:最好用于GROMOS力场。
    spce:对纯水体系比SPC、TIP3P都好。
    tip3p:最好用于amber。
    tip4p:最好用于opls。
    tip5p:不适用于混合模拟。

    Simulation Time (ns)

    模拟时长,单位ns

    结果说明

    输出结果包括:

    输出文件名称 说明
    md.cpt md模拟断点文件
    md.gro md的分子坐标文件
    md.log md记录文件
    md.mdp md参数文件
    md.tpr md模拟所需的所有初始化数据(分子拓扑、初始结构等)
    mini.gro mini运行的分子坐标文件
    mini.log mini运行记录文件
    mini.mdp mini运行参数文件
    mini.tpr mini模拟运行所需的所有初始化数据(分子拓扑、初始结构等)
    npt.cpt npt模拟断点文件
    npt.gro npt的分子坐标文件
    npt.log npt记录文件
    npt.mdp npt参数文件
    npt.tpr npt模拟所需的所有初始化数据(分子拓扑、初始结构等)
    protein.pdb 体系中的蛋白PDB文件
    predict_pKa.txt 蛋白质子化记录文件
    protein_protonation.pdb 蛋白质子化PDB文件
    receptor.gro 受体的分子坐标文件
    receptor_itp.tar.gz 受体平衡模拟时固定原子位置所施加的力
    receptor.top 受体的拓扑文件
    system.gro 体系的分子坐标文件
    system_itp.tar.gz 体系平衡模拟时固定原子位置所施加的力
    system.top 体系的拓扑文件
    interaction_energy.csv 体系能量随时间变化的csv文件
    interaction_energy.png 体系能量随时间变化的png文件
    interaction_pressure.csv 体系压力随时间变化的csv文件
    interaction_pressure.png 体系压力随时间变化的png文件
    rmsd_result.csv RMSD的CSV文件
    rmsd_result.png RMSD的PNG文件
    rmsd_result.xvg RMSD的XVG文件
    rmsf_Protein.csv 蛋白RMSF的CSV文件
    rmsf_Protein.png 蛋白RMSF的PNG文件
    rmsf_Protein.xvg 蛋白RMSF的XVG文件
    path.txt 模拟轨迹文件存储路径,可用于后续分析模块的Path File输入。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    AlphaAutoMD

    Introduction

    Automatically perform molecular dynamics simulations on a pdb file to provide a convenient interface for those who are new to molecular dynamics simulations.

    Parameter Description

    PDB File

    Structure file in PDB format.
    It is important to note that if there are ligands in the system, their names cannot contain “*” and must start with HETATM. The atomic names within the same small molecule (as shown in the figure below) should not be the same. It is advisable to delete structures that do not need to be simulated. The following is an example of a correct small molecule structure file:

    HETATM 3767  C1  GOL A 302      -4.671 -11.067  -0.429  1.00 43.56           C  
    HETATM 3768  O1  GOL A 302      -5.324  -9.793  -0.300  1.00 41.43           O  
    

    image.png

    If there are special metal atoms in the system, only the AMBER force field can be selected. Ions need to be written in a specific format, here are some common atomic writing formats:

      # Mg2+ ion
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+ ion
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+ ion
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+ ion
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+ ion
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+ ion
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
    

    96dcbfca9ffb96541221e86f6db9c5a.jpg

    The atom type and residue must be in uppercase, and the atom name needs to be a standard metal ion (you can check if the writing format is the same using a text editor).

    Force Field

    Types of force field files:

    • amber03, amber14sb_parmbsc1 are suitable for simulating protein and nucleic acid condensed phases, and also support small molecules.
    • The gromos series is suitable for simulating alkanes, proteins, and nucleic acid condensed phases.
      Note: Select the force field based on the submitted pdb structure.

    Water Type

    Types of water:

    • spc: best used for the GROMOS force field.
    • spce: better than SPC and TIP3P for pure water systems.
    • tip3p: best used for amber.
    • tip4p: best used for opls.
    • tip5p: not suitable for mixed simulations.

    Simulation Time (ns)

    Duration of the simulation, in ns.

    Result Description

    The output results include:

    Output File Name Description
    md.cpt Checkpoint file for the md simulation
    md.gro Molecular coordinate file for md
    md.log Log file for md
    md.mdp Parameter file for md
    md.tpr All initial data required for the md simulation (molecular topology, initial structure, etc.)
    mini.gro Molecular coordinate file for mini run
    mini.log Log file for mini run
    mini.mdp Parameter file for mini run
    mini.tpr All initial data required for the mini run (molecular topology, initial structure, etc.)
    npt.cpt Checkpoint file for the npt simulation
    npt.gro Molecular coordinate file for npt
    npt.log Log file for npt
    npt.mdp Parameter file for npt
    npt.tpr All initial data required for the npt simulation (molecular topology, initial structure, etc.)
    protein.pdb PDB file of the protein in the system
    predict_pKa.txt Record file for protein protonation
    protein_protonation.pdb PDB file for protein protonation
    receptor.gro Molecular coordinate file for the receptor
    receptor_itp.tar.gz Force applied to fix atomic positions during receptor equilibrium simulation
    receptor.top Topology file for the receptor
    system.gro Molecular coordinate file for the system
    system_itp.tar.gz Force applied to fix atomic positions during system equilibrium simulation
    system.top Topology file for the system
    interaction_energy.csv CSV file of system energy over time
    interaction_energy.png PNG file of system energy over time
    interaction_pressure.csv CSV file of system pressure over time
    interaction_pressure.png PNG file of system pressure over time
    rmsd_result.csv CSV file for RMSD
    rmsd_result.png PNG file for RMSD
    rmsd_result.xvg XVG file for RMSD
    rmsf_Protein.csv CSV file for protein RMSF
    rmsf_Protein.png PNG file for protein RMSF
    rmsf_Protein.xvg XVG file for protein RMSF
    path.txt Storage path for the simulation trajectory file, can be used as input for the subsequent analysis module’s Path File input.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: Target Prediction (3DSTarPred)
    Description: Target Prediction (3DSTarPred)是基于三维形状相似度的小分子靶点预测模块,活性分子及靶点数据来源于ChEMBL数据库及PDBbind数据库。 The target prediction module is based on the similar property principle. 3D similarities were calculated against the ChEMBL database, which contains over 2 million actives compounds and related target information.
    Tags: undefined
    Author: Ailin Liu
    Release: 2022-09-26 00:00:00
    Reference: NA

    Target Prediction (3DSTarPred)

    简介

    Target Prediction是一个基于三维形状相似度的靶点预测模块。采用业界权威的小分子活性数据库ChEMBL 29以及PDBbind数据库,提取配体结构信息得到配体数据库,分别得到1221364和12745个小分子。利用唯信自主开发的构象生成算法AlphaConf生成活性分子构象库,每个分子最多生成32个构象,得到配体构象库。同时从ChEMBL 29和 PDBbind中提取小分子化合物靶点信息构建靶点数据库,分别得到5298个和3121个靶点,包含所有小分子与靶点的关系。利用分子三维形状药效团比对算法,开发靶点预测模块,实现针对查询分子对活性分子构象库的三维相似度搜索,从中筛选出与查询分子三维相似的活性分子,然后利用靶点数据库中的分子-靶点关系数据,提取出靶点信息。
    image.png

    参数说明

    SDF File

    小分子结构文件,SDF格式。

    Reference Database

    靶点预测的参考数据库,
    pdb:使用PDB数据库中的配体进行3D相似性计算。
    chembl29:使用ChEMBL29中的配体进行3D相似性计算。

    Query Conformations

    每个分子的构象数。

    Similarity Threshold

    相似度阈值,取值范围在0-1之间。

    Activity Threshold

    活性阈值,取值在0~100000nM (100uM)之间。

    Ranking

    靶点列表排序方法:similarity是按照相似度值排序。overall是按照相似度值×活性值大小排序。

    结果说明

    输出结果包括:

    输出文件名称 说明
    predicted_target.csv 预测的靶点列表信息文件
    detail.csv 小分子和配体分子的相似度及活性信息文件
    overlay_1_1.sdf 小分子和配体的叠合文件

    其中predicted_target.csv包括信息如下:

    字段名称 说明
    mol_id 小分子名称
    rank 靶点排序
    pref_name 靶点名称
    accession 靶点Uniprot编号
    organism 靶点种属
    target_type 靶点类别
    similarity 相似度最大值
    standard_value 活性最大值
    overall 相似度最大值*活性最大值
    sim_ligands_cnt 相似度配体数目
    chembl_id 相似度配体编号
    overlay_sdf 叠合文件

    其中detail.csv包括信息如下:

    字段名称 说明
    chembl_id 相似度配体编号
    mol_id 小分子名称
    similarity 相似度值
    activity_value 活性值
    activity_unites 活性单位(nM)
    activity_type 活性类型(IC50/Ki/Kd/EC50/Potency)
    target_type 靶点类别
    pref_name 靶点名称
    accession 靶点Uniprot编号
    organism 靶点种属
    reference 参考文献
    standard_value pIC50值

    Target Prediction (3DSTarPred)

    Introduction

    The target prediction module is based on the similar property principle. 3D similarities were calculated against the ChEMBL database, which contains over 2 million actives compounds and related target information.
    image.png

    Parameter

    SDF File

    Small molecule structure file in SDF format.

    Reference Database

    Reference database for target prediction,
    pdb: use ligands in PDB database for 3D similarity calculation.
    chembl29: use ligands in ChEMBL 29 for 3D similarity calculation.

    Query Conformations

    number of conformations per mol.

    Similarity Threshold

    smilary threshold, 0~1.

    Activity Threshold

    activity threshold, 0~100000nM (100uM).

    Ranking

    Target list sorting method: similarity is sorted according to similarity value. The overall is sorted by similarity value * activity value.

    Result

    The output includes:

    Output File Name Description
    predicted_target.csv The predicted target list information file.
    detail.csv The similarity and activity information of small molecule and ligand molecule.
    overlay_1_1.sdf The superimposed file of small molecule and ligand was obtained

    predicted_target.csv contains the following information:

    Field Name Description
    mol_id Small molecule name
    rank Target sequencing
    pref_name Target name
    accession Target Uniprot number
    organism Target species
    target_type Target class
    similarity Maximum similarity
    standard_value Maximum activity
    overall Maximum similarity*Maximum activity
    sim_ligands_cnt Number of similarity ligands
    chembl_id Similarity ligand number
    overlay_sdf Overlay SDF file

    detail.csv contains the following information:

    Field Name Description
    chembl_id Similarity ligand number
    mol_id Small molecule name
    similarity Similarity value
    activity_value Activity value
    activity_unites Activity unites(nM)
    activity_type Activity type(IC50/Ki/Kd/EC50/Potency)
    target_type Target type
    pref_name Target name
    accession Target Uniprot number
    organism Target species
    reference Reference
    standard_value pIC50 value
  • Name: Scaffold Constrained Generation
    Description: 传统分子生成模型无法限制特定骨架,限制了分子生成在结构优化中的应用,该模块可以限制骨架,指定优化部位,特异性的生成全新分子库。 During the optimization of a lead series, it is common to have scaffold constraints imposed on the structure of the molecules designed. Without enforcing such constraints, the probability of generating molecules with the required scaffold is extremely low and hinders the practicality of generative models for de novo drug design.
    Tags: undefined
    Author: Maxime Langevin
    Release: 2022-08-20 00:00:00
    Reference: Langevin M, Minoux H, Levesque M, Bianciotto M. Scaffold-Constrained Molecular Generation. J Chem Inf Model. 2020 Dec 28;60(12):5637-5646.

    Scaffold Constrained Generation

    简介

    传统分子生成模型无法限制特定骨架,限制了分子生成在结构优化中的应用,Scaffold Constrained Generation是一种骨架限制的生成模型,可以限制骨架,指定优化部位,特异性的生成全新分子库。

    参数说明

    SDF File模式

    SDF File

    小分子结构文件,SDF格式。

    Draw模式

    SDF File

    使用WeDraw生成小分子结构文件,SDF格式。

    Smiles模式

    Smiles String

    输入小分子SMILES格式字段:
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5

    Number of Molecules

    期望生成的分子数目。

    Output File

    最终输出文件的文件名称,默认为scg_results.sdf。

    结果说明

    生成优化后的分子库的sdf文件scg_results.sdf。

    参考文献

    Langevin M, Minoux H, Levesque M, Bianciotto M. Scaffold-Constrained Molecular Generation. J Chem Inf Model. 2020 Dec 28;60(12):5637-5646.

    Scaffold Constrained Generation

    Introduction

    Traditional molecular generation models cannot restrict specific scaffolds, limiting the application of molecular generation in structure optimization. Scaffold Constrained Generation is a scaffold-constrained generation model that can restrict scaffolds, specify optimization sites, and generate a new molecular library with specificity.

    Parameter Description

    SDF File Mode

    SDF File

    Small molecule structure file in SDF format.

    Draw Mode

    SDF File

    Generate small molecule structure file using WeDraw, in SDF format.

    SMILES Mode

    SMILES String

    Input small molecule in SMILES format:
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5

    Number of Molecules

    The desired number of molecules to generate.

    Output File

    The file name for the final output file, default is scg_results.sdf.

    Result Description

    The optimized molecular library is saved in an SDF file named scg_results.sdf.

    Reference

    Langevin M, Minoux H, Levesque M, Bianciotto M. Scaffold-Constrained Molecular Generation. J Chem Inf Model. 2020 Dec 28;60(12):5637-5646.

  • Name: De novo Generation (Moses)
    Description: 基于深度学习的分子生成模块,实现了多种主流的分子生成模型,包括字符级循环神经网络,变分自编码器,以及对抗自编码器。 A deep learning-based molecular generation module, which implements various mainstream molecular generation models, including character-level recurrent neural networks, variational autoencoders, and adversarial autoencoders.
    Tags: undefined
    Author: Daniil Polykovskiy
    Release: 2022-08-19 00:00:00
    Reference: Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front Pharmacol. 2020 Dec 18;11:565644.

    De novo Generation (Moses)

    简介

    De novo Generation (Moses)是基于深度学习的分子生成模块,实现了多种主流的分子生成模型,包括字符级循环神经网络,变分自编码器,以及对抗自编码器。

    参数说明

    Model

    分子生成模型,目前包含以下几种:
    char_rnn:Character-level Recurrent Neural Network(CharRNN)字符级循环神经网络。
    vae:Variational Autoencoder(VAE)变分自编码器。
    aae:Adversarial Autoencoder(AAE)对抗自编码器。

    Number of Molecules

    期望生成的分子数目。

    Seed

    采样随机数。

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.sdf 生成sdf格式分子库。
    result.csv 生成smiles格式分子库,写入csv文件中,首行列名smiles。

    参考文献

    Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front Pharmacol. 2020 Dec 18;11:565644.

    De novo Generation (Moses)

    Introduction

    De novo Generation (Moses) is a deep learning-based molecular generation module that implements various mainstream molecular generation models, including character-level recurrent neural networks, variational autoencoders, and adversarial autoencoders.

    Parameter Description

    Model

    Molecular generation model, currently includes the following:

    • char_rnn: Character-level Recurrent Neural Network (CharRNN).
    • vae: Variational Autoencoder (VAE).
    • aae: Adversarial Autoencoder (AAE).

    Number of Molecules

    The desired number of molecules to generate.

    Seed

    The sampling random number.

    Result Description

    The output includes:

    Output File Name Description
    result.sdf Generated molecular library in SDF format.
    result.csv Generated molecular library in SMILES format, written to a CSV file with the column name “smiles”.

    Reference

    Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front Pharmacol. 2020 Dec 18;11:565644.

  • Name: Protein Design (ProteinMPNN)
    Description: ProteinMPNN基于给定的骨架结构生成合理的序列。在训练过程中加入噪声提高了蛋白质结构模型的序列恢复率,并且产生的序列可以更稳健地编码它们的结构。在天然蛋白质骨架上,ProteinMPNN的序列恢复率为52.4%,高于Rosetta的32.9%。X射线晶体学、低温电镜和功能研究证明了ProteinMPNN的广泛实用性和高准确性,它成功挽救了以前用Rosetta或AlphaFold设计失败的蛋白质单体、环状同源多聚体、四面体纳米颗粒和目标结合蛋白等。本模块也集成了基于ProteinMPNN使用抗体数据微调得到的AbMPNN模型,可更好地进行抗体设计。 ProteinMPNN generates plausible sequences based on a given backbone structure. ProteinMPNN achieves a sequence recovery rate of 52.4% on natural protein scaffolds, compared to 32.9% for Rosetta. Adding noise during the training process can improve the sequence recovery rate of the protein structural model, and the resulting sequences can more robustly encode their structures. X-ray crystallography, cryo-electron microscopy, and functional studies have also demonstrated the wide applicability and high accuracy of ProteinMPNN, which has successfully rescued previously failed protein monomers, cyclic homooligomers, tetrahedral nanoparticles, and target-binding proteins designed using Rosetta or AlphaFold.
    Tags: undefined
    Author: Dauparas J, Anishchenko I, Bennett N, et al.
    Release: 2022-08-17 23:23:03
    Reference: Dauparas J, Anishchenko I, Bennett N, et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science. 2022 Oct 7;378(6615):49-56.

    Protein Design (ProteinMPNN)

    简介

    ProteinMPNN是一种基于深度学习的蛋白质序列设计方法,在天然蛋白质骨架上,ProteinMPNN的序列恢复率为52.4%,而Rosetta为32.9%。在训练过程中加入噪声可以提高蛋白质结构模型的序列恢复率,并且产生的序列可以更稳健地编码它们的结构。X射线晶体学、低温电镜和功能研究也证明了ProteinMPNN的广泛实用性和高准确性,它成功挽救了以前用Rosetta或AlphaFold设计失败的蛋白质单体、环状同源多聚体、四面体纳米颗粒和目标结合蛋白等。
    image.png

    在ProteinMPNN的基础上,Exscientia提出了一种针对抗体结构进行优化的微调逆折叠模型AbMPNN,该模型在抗体序列恢复和结构稳健性方面优于通用蛋白质模型,尤其在超可变区CDR-H3环上有显著改进。
    image.png
    image.png
    image.png

    参数说明

    PDB File

    蛋白的结构文件,PDB格式。

    Chain

    指定需要设计的链,多条链用空格分割,例如:‘A,B’。

    Number of Sequences

    输出设计的序列数目。

    Sampling Temp

    氨基酸采样温度,T=0.0表示取argmax,T>>1.0表示随机采样。建议的取值为0.1、0.15、0.2、0.25、0.3。较高的值会导致更多的多样性。

    Position Type

    设计残基模式:固定(Fix,指定下一步Position中的残基在设计时保持不变)或者设计(Design,指定下一步Position中的残基可进行设计而其他未指定残基在设计时保持不变)。默认:Fix。

    Position

    可选参数,设置氨基酸序号,对设置的氨基酸根据’Position Type’选项进行固定或设计。当参数Chain设置为’A,C’ 时,此参数如果设置为 ‘1 2 3 4 5 6 7 8 23 25, 10 11 12 13 14 15 16 17 18 19 20 40’ 意味着对A中的残基1 2 3…25和链C中的残基10 11 12…40进行固定或设计。
    注意:这里的氨基酸序号是从1开始,而不是PDB文件中带有的氨基酸序号。同一条链的氨基酸序号用空格分隔,不同链的氨基酸用逗号分隔。

    Omit_AAS

    可选参数,指定在生成的结果序列中不许出现的氨基酸种类。

    Other Parameter

    可选参数,可指定设计时参考的模式。具体含义如下:
    Homomer:基于同源多聚体进行序列设计;
    use_soluble_mode:基于可溶蛋白模型进行序列设计,即SolMPNN,仅使用可溶蛋白数据进行训练的MPNN模型。
    antibody_design:基于抗体优化模型AbMPNN进行序列设计,AbMPNN是使用抗体结构数据对ProteinMPNN模型进行微调得到的模型。
    以上模式都不选择时,会使用默认的ProteinMPNN模型,即使用PDB数据库的全部蛋白结构训练的模型。

    Save Probablility

    MPNN预测的每个位置的概率:0为不进行预测,1为是进行预测。

    结果说明

    输出结果文件为seqs/result.fasta,里面包含最终设计的序列。
    其中序列名称:

    1. score:设计残基的概率打分(设计残基平均概率的负对数,越小表示概率值越大(越接近1),一般是越小越好)
    2. global score:序列中所有残基的概率打分(所有残基平均概率的负对数,越小表示概率值越大(越接近1),一般是越小越好)
    3. seq_recovery:序列恢复率(与原序列的相似程度),0-1之间,越高表示与原序列越相似

    参考文献

    https://github.com/dauparas/ProteinMPNN
    Robust deep learning based protein sequence design using ProteinMPNN,bioRxiv 2022.06.03.494563
    AbMPNN: https://arxiv.org/abs/2310.19513

    Protein Design (ProteinMPNN)

    Introduction

    ProteinMPNN is a deep learning-based protein sequence design method that achieves a sequence recovery rate of 52.4% on natural protein scaffolds, compared to 32.9% for Rosetta. Adding noise during the training process can improve the sequence recovery rate of the protein structural model, and the resulting sequences can more robustly encode their structures. X-ray crystallography, cryo-electron microscopy, and functional studies have also demonstrated the wide applicability and high accuracy of ProteinMPNN, which has successfully rescued previously failed protein monomers, cyclic homooligomers, tetrahedral nanoparticles, and target-binding proteins designed using Rosetta or AlphaFold.
    image.png
    On top of ProteinMPNN, Exscientia has introduced a fine-tuning inverse folding model called AbMPNN specifically tailored for optimizing antibody structures. This model outperforms general protein models in antibody sequence recovery and structural robustness, particularly showing significant improvements in the highly variable CDR-H3 loop region.
    image.png
    image.png
    image.png

    Parameter

    PDB File

    Protein structure file in PDB format.

    Chain

    Specify the chain to be designed, multiple chains are separated by spaces, for example: ‘A,B’.

    Number of Sequences

    Output the number of sequences designed.

    Sampling Temp

    Amino acid sampling temperature, T=0.0 means argmax, T>>1.0 means random sampling. The suggested values are 0.1, 0.15, 0.2, 0.25, 0.3. Higher values result in more diversity.

    Position Type

    Residue Design Mode: Fixed (Fix, specifying that the residues in the next Position step remain unchanged during design) or Design (Design, specifying that the residues in the next Position step can be designed while other unspecified residues remain unchanged during design). Default: Fix.

    Position

    Optional parameter to set the amino acid sequence number for fixing or designing amino acids based on the ‘Position Type’ option. When the parameter Chain is set to ‘A C’, if this parameter is set to ‘1 2 3 4 5 6 7 8 23 25, 10 11 12 13 14 15 16 17 18 19 20 40’, it means that residues 1 2 3…25 in chain A and residues 10 11 12…40 in chain C are fixed or designed.

    Note: The amino acid sequence numbers here start from 1, not the amino acid numbers in the PDB file. Amino acid sequence numbers of the same chain are separated by spaces, while amino acids from different chains are separated by commas.

    Omit_AAS

    Optional parameter specifying the types of amino acids that are not allowed to appear in the generated sequence.

    Other Parameter

    Optional parameter specifying the reference mode for design. Specific meanings are as follows:
    Homomer: Sequence design based on homologous oligomers;
    use_soluble_mode: Sequence design based on soluble protein models, namely SolMPNN, the MPNN model trained exclusively on soluble protein data.
    antibody_design: Sequence design based on the antibody optimization model AbMPNN, the model obtained by fine-tuning the ProteinMPNN model using antibody structure data.

    When none of the above options are selected, the default ProteinMPNN model will be used, which is trained on all protein structures from the PDB database.

    Save Probability

    Probability of each position predicted by MPNN: 0 for no prediction, 1 for prediction.

    Result

    The output file is seqs/result.fasta and contains the final design sequence.
    Where the sequence name:

    1. score: average over residues that were designed negative log probability of sampled amino acids.The smaller the value, the higher the probability (closer to 1). Generally, smaller values are better.
    2. global score: average over all residues in all chains negative log probability of sampled/fixed amino acids. The smaller the value, the higher the probability (closer to 1). Generally, smaller values are better.
    3. seq_recovery: the sequence recovery rate (the degree of similarity to the original sequence) is between 0 and 1, the higher the higher the similarity to the original sequence

    Refrence

    • ProteinMPNN:
      Robust deep learning-based protein sequence design using ProteinMPNN. Science. 2022 Oct 7;378(6615):49-56.
    • AbMPNN:
      https://arxiv.org/abs/2310.19513
  • Name: Protein Design (RFDesign)
    Description: Protein Design (RFDesign)是基于RoseTTAFold进行蛋白设计(幻想和修改)的一种相当通用的方法。除了提供需进行设计的功能位点的结构和序列外,无需其他输入,而且与目前的非深度学习方法不同,不需要指定二级结构或骨架的拓扑结构,并能同时产生序列和结构。 Protein Design (RFDesign) is a fairly general method for protein design (fantasy and modification) based on RoseTTAFold. In addition to providing the structure and sequence of functional sites that need to be designed, no other input is required. Unlike current non-deep learning methods, it does not require specification of the topology of secondary structure or backbone, and can generate both sequence and structure simultaneously.
    Tags: undefined
    Author: RosettaCommons
    Release: 2022-08-17 23:07:30
    Reference: Wang J, Lisanza S, Juergens D, Tischer D, Watson JL, Castro KM, Ragotte R, Saragovi A, Milles LF, Baek M, Anishchenko I, Yang W, Hicks DR, Expòsit M, Schlichthaerle T, Chun JH, Dauparas J, Bennett N, Wicky BIM, Muenks A, DiMaio F, Correia B, Ovchinnikov S, Baker D. Scaffolding protein functional sites using deep learning. Science. 2022 Jul 22;377(6604):387-394.

    Protein Design (RFDesign)

    简介

    RFDesign是基于RoseTTAFold进行蛋白设计(幻想和修改)的一种相当通用的方法。除了提供需进行设计的功能位点的结构和序列外,无需其他输入,而且与目前的非深度学习方法不同,不需要指定二级结构或骨架的拓扑结构,并能同时产生序列和结构。

    参数说明

    PDB File

    蛋白的结构文件,PDB格式

    Contigs

    定义蛋白的设计策略,指定蛋白中的哪部分被保留、移除和修改。
    如:该参数设置为 ‘A25-50,10,A61-79’ 时,
    ● ‘A25-50’ 表示从上传的PDB结构中 A25-A50的氨基酸序列和结构会保留,并复制到新产生的蛋白序列/结构中,因为A25是第一个指定的氨基酸,所以在新产生的蛋白中,将变为第一个氨基酸。
    ● ‘,10’ 表示连接到A1-25(新蛋白中) 的氨基酸中,有10个进行修改的氨基酸,这10个氨基酸的序列和结构都将通过RFDesign的算法生成。
    ● ‘,A61-79’ 表示连接上述10个修改氨基酸的后续残基是从上传的PDB文件中复制过来的A61-A79的残基。

    Number of Designs

    设计产生的序列/结构数量

    结果说明

    输出结果包括:

    输出文件名称 说明
    result/res_0.pdb-result/res_4.pdb 设计得到的蛋白结构文件,默认生成5个结构。
    result.fasta 所有设计结构的FASTA文件。

    参考文献

    Wang J, Lisanza S, Juergens D, et al. Scaffolding protein functional sites using deep learning. Science. 2022 Jul 22;377(6604):387-394.
    https://github.com/RosettaCommons/RFDesign

    Protein Design (RFDesign)

    Introduction

    RFDesign is a fairly general method for protein design (fantasy and modification) based on RoseTTAFold. In addition to providing the structure and sequence of functional sites that need to be designed, no other input is required. Unlike current non-deep learning methods, it does not require specification of secondary structure or topology of the skeleton and can generate both sequences and structures simultaneously.

    Parameter

    PDB File

    Protein structure file in PDB format

    Contigs

    Define protein design strategies, specifying which parts of the protein are kept, removed, and modified.
    For example: when this parameter is set to ‘A25-50,10,A61-79’,
    ● ‘A25-50’ indicates that the amino acid sequence and structure of A25-A50 in the uploaded PDB structure will be retained and copied to the newly generated protein sequence/structure, because A25 is the first specified amino acid, so it will be used in the newly generated protein sequence/structure. In the protein, will become the first amino acid.
    ● ‘,10’ means that there are 10 modified amino acids among the amino acids connected to A1-25 (in the new protein), and the sequence and structure of these 10 amino acids will be generated by the algorithm of RFDesign.
    ● ‘,A61-79’ indicates that the subsequent residues connecting the above 10 modified amino acids are the residues A61-A79 copied from the uploaded PDB file.

    Number of Designs

    Number of sequences/structures generated by design

    Result

    The output includes:

    Output File Name Description
    result/res_0.pdb-result/res_4.pdb The designed protein structure file generates 5 structures by default.
    result.fasta FASTA file for all designed protein structures.

    Refrence

    Wang J, Lisanza S, Juergens D, et al. Scaffolding protein functional sites using deep learning. Science. 2022 Jul 22;377(6604):387-394.
    https://github.com/RosettaCommons/RFDesign

  • Name: FASTA File
    Description: FASTA文件是一个用于指定fasta文件的模块,可用于其他模块的输入。会对FASTA文件的有效性进行判断。 FASTA File is a module for specifying fasta file which could used for other modules input.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-08-15 14:11:28
    Reference: NA

    FASTA File

    简介

    FASTA File是一个指定FASTA文件的模块,可以用于其他模块的输入。会对FASTA文件的有效性进行判断。

    参数说明

    FASTA File

    上传FASTA文件

    结果说明

    输出一个对应的FASTA文件,会对文件的有效性进行判断。

    FASTA File

    Introduction

    FASTA File is a module for specifying fasta file which could used for other modules input.

    Parameter

    FASTA File

    input FASTA file

    Result

    Generate a corresponding FASTA file and validate its effectiveness.

  • Name: AlphaShape
    Description: 基于分子三维形状和药效团的虚拟筛选,算法在三维构象的基础上进行基于分子三维相似性的虚拟筛选。通过结合高斯函数与深度神经网络模型,计算精度领先同类型商业算法。 A molecular shape and pharmacophore-based virtual screening module. The AlphaShape algorithm performs virtual screening or protein structure search based on the three-dimensional similarity of molecules on the basis of three-dimensional conformation. By combining the Gaussian function and the deep neural network model, the calculation accuracy achieves SOTA.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-11-11 03:23:06
    Reference: X. Yan, J. Li, et al.. J. Chem. Inf. Model., 2013, 53(8), 1967–1978. X. Yan, J. Li, et al., J. Comput. Chem., 2014, 35(15), 1122-1130.

    AlphaShape

    简介

    AlphaShape(简称AlphaS)是一种构象表征与识别算法,可以基于分子的三维空间形状和药效团等药学特征比较进行高通量的虚拟筛选,可以最大化区分海量化合物中与已知活性分子相似的活性化合物(筛选的化合物库分子可使用AlphaConf进行构象生成)。也可用于蛋白质结构域匹配以指导蛋白质设计。

    通过创造性地在高斯函数表征方式之上融合深度学习技术,AlphaShape虚拟筛选的计算精度已经领先同超越主流商业算法(例如Schrodinger的Phase,OpenEye的ROCS),在DUD-E标准数据集的测试中,虚拟筛选的AUC值达到了0.837(对比Phase与ROCS的0.663及0.696)。
    image.png
    通过采用高性能计算(HPC)技术,特别是NVIDIA的GPU加速技术,目前在搜索或筛选速度上都领先同领域商业软件。以小分子化合物筛选为例,使用一块GPU卡,数小时即可筛完全世界所有的现货商业化合物库的数千万分子,一天可高通量虚拟筛选上亿个化合物分子。

    目前已被多家合作药企用于虚拟筛选并成功发现生物活性分子。目前已被合作药企用于虚拟筛选并成功发现生物活性分子。
    除了高精度之外,AlphaShape 还充分利用了GPU的能力。 一张GPU卡每天可以筛选大约 5000万种化合物。

    参数说明

    Private Library私有库筛选模式

    Query File

    输入查询分子文件,SDF格式

    Conformation Library

    小分子的构象库文件,由AlphaConf模块产生,AC.GZ格式

    Fragment Library

    小分子的片段库文件,由AlphaConf模块产生,AUX.GZ格式

    Top N

    输出和每个查询分子相似度排名前n个分子,默认100。

    Generate Query Conformation

    是否对输入的查询分子产生3D构象,True 表示生成,当输入分子是2D结构时可用,False表示不生成,直接使用输入分子的3D结构。

    Similarity Hits File

    输出最终相似度命中化合物的文件名称,SDF格式,默认文件名为hits.sdf

    Public Library系统公共库筛选模式

    Query File

    输入查询分子文件,SDF格式

    Public Library

    系统内置的小分子化合物数据库,可多选。

    Top N

    输出和每个查询分子相似度排名前n个分子,默认100。

    Generate Query Conformation

    是否对输入的查询分子产生3D构象,True 表示生成,当输入分子是2D结构时可用,False表示不生成,直接使用输入分子的3D结构。

    Similarity Hits File

    输出最终相似度命中化合物的文件名称,SDF格式,默认文件名为hits.sdf

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 相似度值信息,包含查询分子名称与库中分子名称。
    hits.sdf 筛选相似度最高的n个化合物。多个查询分子时,这个文件是多个查询分子命中化合物合并去重后的结果。
    result/AA-173-40757587.sdf 查询分子对应的命中化合物。每个查询分子都会生成一个对应的包含top n个命中化合物的文件

    其中result.csv,包含信息如下:

    字段名称 说明
    querymol 查询分子化合物名称
    confdb 化合物库名称
    molname 命中化合物名称
    Total Similarity 3D相似度值

    AlphaShape

    Introduction

    AlphaShape (AlphaS for short) is a conformation representation and recognition algorithm that enables high-throughput virtual screening based on the three-dimensional spatial shape and pharmacophoric features of molecules. It maximizes the differentiation of active compounds similar to known active molecules from a large number of compounds (the molecules in the compound library for screening can be generated using AlphaConf). It can also be used for protein domain matching to guide protein design.

    By creatively integrating deep learning technology on top of Gaussian function representation, AlphaShape’s virtual screening computational accuracy has surpassed and outperformed mainstream commercial algorithms (such as Schrodinger’s Phase, OpenEye’s ROCS). In testing on the DUD-E standard dataset, the AUC value of virtual screening reached 0.837 (compared to Phase and ROCS at 0.663 and 0.696).
    image.png

    By employing high-performance computing (HPC) technology, especially NVIDIA’s GPU acceleration technology, AlphaShape currently leads in search or screening speed compared to commercial software in the field. For example, in small molecule compound screening, using a single GPU card, it is possible to screen tens of millions of molecules in commercial compound libraries worldwide in a few hours, and conduct high-throughput virtual screening of billions of compound molecules in a day.

    It has been used by several collaborative pharmaceutical companies for virtual screening and successful discovery of bioactive molecules. In addition to high accuracy, AlphaShape fully leverages the capabilities of GPUs. A single GPU card can screen approximately 50 million compounds per day.

    Parameter Description

    Private Library Screening Mode

    Query File

    Input file of query molecules in SDF format.

    Conformation Library

    File of conformation libraries for small molecules, generated by the AlphaConf module, in AC.GZ format.

    Fragment Library

    File of fragment libraries for small molecules, generated by the AlphaConf module, in AUX.GZ format.

    Top N

    Output the top N molecules ranked by similarity to each query molecule, default is 100.

    Generate Query Conformation

    Whether to generate 3D conformations for the input query molecules. True for generation, useful when the input molecules are in 2D structure; False for direct use of the input molecules’ 3D structures.

    Similarity Hits File

    File name for the final hit compounds based on similarity, in SDF format, default file name is hits.sdf.

    Public Library Screening Mode

    Query File

    Input file of query molecules in SDF format.

    Public Library

    System’s built-in small molecule compound database, multiple selections allowed.

    Top N

    Output the top N molecules ranked by similarity to each query molecule, default is 100.

    Generate Query Conformation

    Whether to generate 3D conformations for the input query molecules. True for generation, useful when the input molecules are in 2D structure; False for direct use of the input molecules’ 3D structures.

    Similarity Hits File

    File name for the final hit compounds based on similarity, in SDF format, default file name is hits.sdf.

    Result Description

    The output includes:

    Output File Name Description
    result.csv Information on similarity values, including query molecule names and library molecule names.
    hits.sdf Top N screened compounds based on similarity. For multiple query molecules, this file is the merged and deduplicated result of top N hit compounds for each query molecule.
    result/AA-173-40757587.sdf Hit compounds corresponding to the query molecule. A file containing the top N hit compounds is generated for each query molecule.

    In result.csv, the information includes:

    Field Name Description
    querymol Query molecule name
    confdb Compound library name
    molname Hit compound name
    Total Similarity 3D similarity value
  • Name: Format Conversion (RDKit)
    Description: 分子文件格式转换工具。 A molecular file format conversion tool.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-28 02:46:13
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    File Convert

    简介

    File Convert是基于RDKit对分子文件格式之间进行转换的模块。支持的输入文件格式为:SD(.sdf、.sd)、SMILES(.smi、.csv、.tsv、.txt)。支持的输出文件格式为:SD(.sdf、.sd)、SMILES(.smi)、PDB(.pdb)。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件名。更改文件扩展名。

    结果说明

    输入SDF文件转换成SMILES格式output.smi文件。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    File Convert

    Introduction

    The File Convert module is designed to convert molecular file formats using RDKit. Supported input file formats include: SD (.sdf, .sd), SMILES (.smi, .csv, .tsv, .txt). Supported output file formats include: SD (.sdf, .sd), SMILES (.smi), PDB (.pdb).

    Parameter Description

    Input File

    Input file containing the molecular structure in SDF or SMILES format.

    Output File

    Name of the output file. Change the file extension as needed.

    Result Description

    Convert the input SDF file to SMILES format and save it as output.smi.

    References

    • Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
  • Name: Antibody Numbering
    Description: Antibody Numbering是抗体编号模块,用于注释抗体可变区(Fv)或恒定区(包括 Fc), 支持几乎所有主流的抗体编号规则,如可变区广泛使用的Kabat、Chothia 和 IMGT,以及恒定区主要使用的EU规则。 Antibody Numbering is a module for antibody numbering for variable regions and constant regions. Mainstream numbering schemes are supported, e.g., Kabat, Chothia, and IMGT are widely used for Fv, and EU is the most used scheme for the constant region.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-12-20 07:21:54
    Reference:

    Antibody Numbering

    简介

    编号及注释抗体可变区(Fv)或恒定区(包括 Fc)。 支持几乎所有主流的编号规则, 例如:可变区广泛使用的Kabat、Chothia 和 IMGT,以及恒定区主要使用的EU规则。

    参数说明

    Variable Region (Fv)模式

    Fasta File

    抗体序列文件,FASTA格式,支持多序列模式。

    Numbering Scheme

    编号规则,支持Kabat、Chothia、IMGT,可多选。

    Report

    是否生成包含三种编号规则的HTML文件。

    Constant Region (Fc)模式

    Fasta File

    抗体序列文件,FASTA格式,支持多序列模式。

    结果说明

    • Variable Region (Fv)模式下的输出结果包括:
    输出文件名称 说明
    results.html 抗体可变区三种编号规则的HTML文件
    output_chothia(imgt\kabat).csv 抗体可变区三种编号规则的csv文件
    output_chothia(imgt\kabat).json 抗体可变区三种编号规则的json文件

    三种不同编号规则的csv文件,包含信息如下:

    字段名称 说明
    Molecule 抗体序列名称
    chain_type 抗体链类型:重链(VH)或者轻链(VL)
    is_cdr 判断是否为CDR区
    loc 序列位置
    numbering 序列编号
    region 抗体可变区类型:CDR1、CDR2或者CDR3
    insertion 插入序列编号
    • Constant Region (Fc)模式下的输出结果包括:
    输出文件名称 说明
    output_EU.csv 抗体恒定区EU编号规则的csv文件
    output_EU.json 抗体恒定区EU编号规则的json文件

    其中output_EU.csv文件,包含信息如下:

    字段名称 说明
    Chain 抗体序列链类型
    Position 序列位置
    Eu numbering 序列EU编号
    Residue 抗体氨基酸缩写
    IgG1 Ref IgG1氨基酸缩号
    Region 抗体恒定类型:CH1、CH2、CH3、Hinge
    Mutation(IgG1) 原序列突变成IgG1的突变信息

    Antibody Numbering

    Introduction

    Number antibody Fv (variable region) or constant region (including Fc). Mainstream numbering schemes are supported, e.g., Kabat, Chothia, and IMGT are widely used for Fv, and EU is the most used scheme for constant region.

    Parameter

    Variable Region (Fv)模式

    Fasta File

    Antibody sequence file in FASTA format.

    Numbering Scheme

    Numbering Scheme: Kabat, Chothia, and IMGT.

    Report

    Visualize all three schemes of Fv numberings and CDR regions via a HTML page.

    Constant Region (Fc)模式

    Fasta File

    Antibody sequence file in FASTA format.

    Result

    • Variable Region (Fv) mode contains the following output results:
    Output File Name Description
    results.html Visualize all three schemes of Fv numberings and CDR regions via a HTML page.
    output_chothia(imgt\kabat).csv Visualize all three schemes of Fv numberings and CDR regions via a csv file.
    output_chothia(imgt\kabat).json Visualize all three schemes of Fv numberings and CDR regions via a json file.

    Three csv files with different numbering rules contain the following information:

    Field Name Description
    Molecule Antibody sequence name
    chain_type Antibody chain type: heavy chain (VH) or light chain (VL)
    is_cdr Check whether it is a CDR region
    loc Sequence position
    numbering Sequence numbering
    region Antibody variable region type: CDR1, CDR2, or CDR3
    insertion Insertion sequence number
    • Constant Region (Fc) mode contains the following output results:
    Output File Name Description
    output_EU.csv EU numberings for constant region in csv file
    output_EU.json EU numberings for constant region in json file

    The output EU.csv file contains the following information:

    Field Name Description
    Chain Type of antibody sequence chain
    Position Sequence position
    Eu numbering Sequence EU numbering
    Residue Antibody amino acid abbreviation
    IgG1 Ref IgG1 amino acid abbreviation
    Region Constant Region type of antibody: CH1, CH2, CH3, Hinge
    Mutation(IgG1) Mutation information of the original sequence mutated into IgG1
  • Name: Molecular Docking (AutoDock-GPU)
    Description: Molecular Docking (AutoDock-GPU)是基于AutoDock的分子对接工具,采用GPU加速版本,主要用于预测分子之间的结合模式和相互作用,得到分子对接的能量和结合亲和力等信息。它还可以计算和比较多个分子之间的结合能力,用于药物分子的筛选、设计和优化。AutoDock-GPU是AutoDock4.2.6的OpenCL和Cuda加速版本,其利用可并行的LGA,从而通过在多个计算单元上并行处理配体-受体结合构象。配体文件支持的输入格式为SD(.sdf, .sd)、PDB(.pdb)和MOL(.mol)。受体结构文件支持的输入格式为PDB(.pdb)。 The Molecular Docking (AutoDock-GPU) module is a docking simulation tool used primarily to predict binding modes and interactions between molecules and obtain information such as molecular docking energy and binding affinity. It can also calculate and compare the binding abilities of multiple molecules, making it useful for drug molecule screening, design, and optimization. AutoDock-GPU is the OpenCL and Cuda accelerated version of AutoDock4.2.6, which leverages its embarrassingly parallelizable LGA by processing ligand-receptor poses in parallel over multiple compute units. The supported input formats for ligand files are SD (.sdf, .sd), PDB (.pdb), and MOL (.mol). The supported input format for receptor files is PDB (.pdb).
    Tags: undefined
    Author: Forli lab
    Release: 2022-06-08 16:00:00
    Reference: Santos-Martins D, Solis-Vasquez L, Tillack AF, et al., Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. J Chem Theory Comput. 2021 Feb 9;17(2):1060-1073. doi: 10.1021/acs.jctc.0c01006.

    Molecular Docking (AutoDock-GPU)

    简介

    Molecular Docking (AutoDock-GPU) 是一种用于分子对接模拟的工具,主要用于预测分子之间的结合模式和相互作用,评估对接的能量和结合亲和力等信息。同时,它还可用于计算和比较多个分子之间的结合能力,广泛应用于药物分子的筛选、设计和优化。AutoDock-GPU 是 AutoDock 4.2.6 的 OpenCL 和 CUDA 加速版本。它利用可并行的遗传算法(LGA),通过在多个计算单元上并行处理配体-受体的结合构象,大幅提升计算效率。
    image.png
    AutoDock 使用一种半经验的自由能力场来评估对接模拟中的构象。该力场基于大量具有已知结构和抑制常数(Ki)的蛋白质-抑制剂复合物进行参数化,评分过程分为两步:

    • 评估分子内部能量:估算配体和蛋白质从未结合状态到结合构象的分子内能量变化。
    • 评估分子间能量:计算配体和蛋白质结合后的相互作用能量。

    AutoDock的自由能评分函数(ΔG)包含六种成对能量项(V),以及结合时构象熵的损失(ΔSconf):
    image.png
    其中,L 表示“配体”,P 表示“蛋白质”。

    AutoDock力场中的成对能量项V包括以下四种主要相互作用,每种相互作用的组成和贡献如下:

    • 色散/排斥势(范德华相互作用):描述原子间的吸引与排斥平衡,例如两个碳原子之间的相互作用。
    • 氢键势:描述方向性氢键的作用,例如氧原子和氢原子之间的结合能,其最低值约为 –2 kcal/mol,表明较强的结合。
    • 静电势:描述带电原子之间的静电吸引力,例如两个带相反全原子电荷的原子之间的相互作用。
    • 脱溶剂化势:模拟溶剂效应对结合自由能的影响,例如一个碳原子在不同距离上排开约 10 个水分子所产生的贡献。

    成对能量的计算公式如下:
    image.png
    其中,第一项为范德华相互作用,第二项为方向性氢键作用,第三项为静电作用,第四项为溶剂化效应。

    参数说明

    该模块存在两种对接方法Rigid Docking和Flexible Docking。

    Rigid Docking方法

    Receptor File

    上传受体蛋白文件,格式为PDB。受体蛋白被定义为刚性。

    Ligand File

    上传配体文件,当配体为一个时允许上传SDF,PDB和MOL格式,当配体为多个时(≤2000)只允许上传SDF格式。注:配体需要是三维结构,可用Small Molecule Minimization模块转换。

    Box Center

    配体结合口袋中心xyz坐标,用空格分开,例如 10.734 2.033 -11.537。

    Box Size

    配体结合口袋大小,用空格分开,例如 24 22 32。

    TopN

    指定打分前TopN小分子作为输出文件,默认为100。

    Run Pose

    每个配体与受体对接时得到的总构象数,默认为50。

    Out Pose

    每个配体与蛋白对接后输出的构象数目,默认为10。该数值应当≤“Run Pose”。

    Rotatable Bonds

    配体可旋转键大于该值时被剔除,默认为50。

    MW Threshold

    配体分子量大于该值时被剔除,默认为1000。

    Flexible Docking方法

    Receptor File

    上传受体蛋白文件,格式为PDB。受体蛋白被设置为局部柔性,柔性残基由Flexible Residues定义。

    Flexible Residues

    定义柔性残基其格式为"链名称":“氨基酸名称”“氨基酸编号”,每个氨基酸用逗号隔开,例如:“A:ALA1221,A:MET1211,A:LEU1140”。柔性氨基酸必须在口袋附近。
    其他参数相同

    结果说明

    输出结果包括:

    输出文件名称 说明
    Scores.csv 提交所有配体(≤2000)与受体的打分文件。
    output_complex_topn.tar.gz TopN小分子中每个配体与受体打分最高的复合物构象PDB文件压缩包。
    output_complex_top10.pdb 展示每个配体与受体打分最高的前十复合物构象文件(仅展示作用,不适用于后续计算)。
    output_ligand_topn.sdf 对接打分topN的配体SDF文件。
    TopNScores.csv 按照每个配体与受体对接打分最高的排序得到打分文件。
    output_complex_topn_pdbqt.tar.gz TopN小分子中每个配体与受体打分最高的复合物构象PDBQT文件压缩包。

    其中TopNScores.csv包括信息如下:

    字段名称 说明
    Ligand 对接配体名称。
    Mol Index 对接配体在原始SDF文件的编号。
    Score(kcal/mol) 对接打分,该值越低说明结合亲和力越高。
    Complex File Name 复合物文件名称。

    参考文献

    Santos-Martins D, Solis-Vasquez L, Tillack AF, et al., Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. J Chem Theory Comput. 2021 Feb 9;17(2):1060-1073.

    Molecular Docking (AutoDock-GPU)

    Introduction

    Molecular Docking (AutoDock-GPU) is a tool used for molecular docking simulations, primarily aimed at predicting the binding modes and interactions between molecules, as well as evaluating docking energies and binding affinities. It can also be used to calculate and compare the binding capabilities of multiple molecules, making it widely applicable in the screening, design, and optimization of drug molecules. AutoDock-GPU is an OpenCL and CUDA accelerated version of AutoDock 4.2.6. It significantly enhances computational efficiency by utilizing a parallelized genetic algorithm (LGA) to process ligand-receptor binding conformations across multiple computing units.

    image.png

    AutoDock employs a semi-empirical free energy force field to evaluate conformations in docking simulations. This force field is parameterized based on a large number of protein-inhibitor complexes with known structures and inhibition constants (Ki). The scoring process is divided into two steps:

    • Evaluating the internal energy of the molecules: Estimating the change in internal energy of the ligand and protein from the unbound state to the bound conformation.
    • Evaluating the intermolecular energy: Calculating the interaction energy after the ligand and protein bind.

    The free energy scoring function (ΔG) in AutoDock includes six pairwise energy terms (V), as well as the loss of conformational entropy upon binding (ΔSconf):

    image.png

    Here, L represents “ligand” and P represents “protein.”

    The pairwise energy terms V in the AutoDock force field include four main types of interactions, each with its composition and contribution as follows:

    • Dispersion/Repulsion Potential (Van der Waals interactions): Describes the balance of attraction and repulsion between atoms, such as the interaction between two carbon atoms.
    • Hydrogen Bond Potential: Describes the directional hydrogen bonding interactions, such as the binding energy between an oxygen atom and a hydrogen atom, with a minimum value of approximately –2 kcal/mol, indicating a strong interaction.
    • Electrostatic Potential: Describes the electrostatic attraction between charged atoms, such as the interaction between two atoms with opposite partial charges.
    • Desolvation Potential: Simulates the solvent effect on the binding free energy, for example, the contribution of a carbon atom displacing about 10 water molecules at different distances.

    The formula for calculating pairwise energy is as follows:

    image.png

    Where the first term represents the Van der Waals interactions, the second term represents the directional hydrogen bonding, the third term represents electrostatic interactions, and the fourth term represents solvation effects.

    Parameter

    Rigid Docking Method

    Receptor File

    Upload the receptor protein file in PDB format. The receptor is set to be rigid.

    Ligand File

    Upload the ligand file. When there is only one ligand, SDF, PDB, and MOL formats are allowed. When there are multiple ligands (≤2000), only SDF format is allowed. Note: The ligand needs to be in a three-dimensional structure, which can be converted using the Small Molecule Minimization module.

    Box Center

    The xyz coordinates of the center of the binding pocket, separated by spaces. For example, 10.734 2.033 -11.537.

    Box Size

    The size of the binding pocket, separated by spaces. For example, 24 22 32.

    TopN

    Specify the topN ligands to be output as the scoring file, with the default being 100.

    Run Pose

    The total number of poses obtained for each ligand-receptor docking, with the default being 50.

    Out Pose

    The number of output poses obtained for each ligand-receptor docking, with the default being 50.(The number of output poses is less than the number of running poses.)

    Rotatable Bonds

    Molecules with rotatable bonds greater than this value will be ignored,with the default being 50.

    MW Threshold

    Molecules with molecular weight greater than this value will be ignored,with the default being 1000.

    Flexible Docking Method

    Receptor File

    Upload the receptor protein file in PDB format. The receptor is set to be locally flexible and the flexible residues are defined by Flexible Residues.

    Flexible Residues

    The flexible residues are defined in the format of "Chain:“Residue name”“Residue ID”, with each amino acid separated by a comma. For example, “A:ALA1221,A:MET1211,A:LEU1140”. Flexible amino acids must be located near the docking pocket.

    Result

    The output includes:

    Output File Name Description
    Scores.csv Scores.csv with all ligand (≤2000) docking with receptor.
    output_complex_topn.tar.gz output_complex_topn.tar.gz containing the top-scoring complex conformation (PDB format) for each ligand and receptor in TopN small molecules.
    output_complex_top10.pdb Show the top 10 complex conformation files output_complex_top10.pdb for each ligand-receptor with the highest score (The function is shown only and is not applicable to subsequent calculations).
    output_ligand_topn.sdf Each ligand docked with the top 100 scoring complexes.
    TopNScores.csv Sort the scoring file according to the highest score for each ligand-receptor docking.
    output_complex_topn_pdbqt.tar.gz output_complex_topn_pdbqt.tar.gz containing the top-scoring complex conformation (PDBQT format) for each ligand and receptor in TopN small molecules.

    其中TopNScores.csv包括信息如下:

    Field Name Description
    Ligand Ligand Name
    Mol Index Number of the ligand in the original SDF file.
    Score(kcal/mol) In the docking score, the lower the value, the higher the binding affinity.
    Complex File Name Complex file name

    Reference

    Santos-Martins D, Solis-Vasquez L, Tillack AF, et al., Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. J Chem Theory Comput. 2021 Feb 9;17(2):1060-1073.

  • Name: Metabolism Site Prediction
    Description: Metabolism Site Prediction模块为预测小分子被P450代谢的代谢位点。模型对小分子的每个原子进行评估被代谢的可能性,并通过打分排序。支持的小分子输入文件格式为:SD(.sdf、.sd)、SMILES(.smi)。 Metabolism Site Prediction module predict which sites in a molecule are most liable to metabolism by Cytochrome P450. The supported small-molecule input file formats are SD (.sdf,.sd) and SMILES (.smi).
    Tags: undefined
    Author: Rydberg P
    Release: 2022-05-27 08:27:00
    Reference: Rydberg P, Gloriam DE, Olsen L. The SMARTCyp cytochrome P450 metabolism prediction server. Bioinformatics. 2010 Dec 1;26(23):2988-9.

    Metabolism Site Prediction

    简介

    Metabolism Site Prediction模块为预测小分子被P450代谢的代谢位点。模型对小分子的每个原子进行评估被代谢的可能性,并通过打分排序。支持的小分子输入文件格式为:SD(.sdf、.sd)、SMILES(.smi)。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    结果说明

    输出结果包括:

    输出文件名称 说明
    molecule_1_atomNumbers.png 原子编号图片
    molecule_1_heteroAtoms.png P450代谢酶(CYP3A4)预测结果图
    molecule_1_heteroAtoms1A2.png P450代谢酶(CYP1A2)预测结果图
    molecule_1_heteroAtoms2C19.png P450代谢酶(CYP2C19)预测结果图
    molecule_1_heteroAtoms2C9.png P450代谢酶(CYP2C9)预测结果图
    molecule_1_heteroAtoms2D6.png P450代谢酶(CYP2D6)预测结果图
    results.csv 评估被代谢可能性的csv文件
    results.html 评估被代谢可能性的html文件

    其中results.html,包含如下信息:

    Field Name Description
    Rank 排序
    Atom 原子类型和序号
    Score 最终的打分,也是排序的标准,打分越低,排名越前,被代谢的可能性越高。
    Energy 能量值,基于DFT计算以及原子匹配得到的原子激活的能量值。是打分Score的重要参考项。
    Accessibility 原子到分子中心的相对拓扑距离。

    参考文献

    Rydberg P, Gloriam DE, Olsen L. The SMARTCyp cytochrome P450 metabolism prediction server. Bioinformatics. 2010 Dec 1;26(23):2988-9.

    Metabolism Site Prediction

    Introduction

    The Metabolism Site Prediction module is used to predict the metabolism sites of small molecules by P450 enzymes. The model evaluates the likelihood of each atom in the small molecule being metabolized and ranks them based on scores. Supported input file formats for small molecules include: SD (.sdf, .sd) and SMILES (.smi).

    Parameter Description

    Input File

    Input file containing the small molecule structure in SDF or SMILES format.

    Result Description

    The output includes:

    Output File Name Description
    molecule_1_atomNumbers.png Image showing atom numbering
    molecule_1_heteroAtoms.png Prediction results for P450 enzyme (CYP3A4)
    molecule_1_heteroAtoms1A2.png Prediction results for P450 enzyme (CYP1A2)
    molecule_1_heteroAtoms2C19.png Prediction results for P450 enzyme (CYP2C19)
    molecule_1_heteroAtoms2C9.png Prediction results for P450 enzyme (CYP2C9)
    molecule_1_heteroAtoms2D6.png Prediction results for P450 enzyme (CYP2D6)
    results.csv CSV file evaluating the likelihood of metabolism
    results.html HTML file evaluating the likelihood of metabolism

    The results in results.html include the following information:

    Field Name Description
    Rank Ranking
    Atom Atom type and number
    Score Final score, also the sorting criterion. The lower the score, the higher the ranking, indicating a higher likelihood of metabolism.
    Energy Energy value based on DFT calculations and atomic activation energy obtained from atomic matching. An important reference for the score.
    Accessibility Relative topological distance of the atom to the molecular center.

    References

    • Rydberg P, Gloriam DE, Olsen L. The SMARTCyp cytochrome P450 metabolism prediction server. Bioinformatics. 2010 Dec 1;26(23):2988-9.
  • Name: Structure Prepare
    Description: Structure Prepare是蛋白结构准备模块,包括添加缺失原子、添加氢原子、预测残基质子化状态、根据力场添加原子电荷和半径、生成新的PDB文件。氨基酸残基的pKa值采用PROPKA预测。 Structure Prepare module prepares structures for further calculations by reconstructing missing atoms, adding hydrogens, predicting protonation states, assigning atomic charges and radii from specified force fields, and generating PDB and PQR files. PROPKA is used to predict the pKa values of ionizable groups in protein. PDB2PQR tool is used for the structure preparation.
    Tags: undefined
    Author:
    Release: 2022-05-27 14:23:48
    Reference: Dolinsky TJ, Czodrowski P, Li H, Nielsen JE, Jensen JH, Klebe G, Baker NA. PDB2PQR: expanding and upgrading automated preparation of biomolecular structures for molecular simulations. Nucleic Acids Res. 2007 Jul;35(Web Server issue):W522-5. Dolinsky TJ, Nielsen JE, McCammon JA, Baker NA. PDB2PQR: an automated pipeline for the setup of Poisson-Boltzmann electrostatics calculations. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W665-7.

    Structure Prepare

    简介

    PDB2PQR是Nathan Baker Group开发维护的蛋白电荷处理系统,能够将pdb输入的蛋白结构输出为pqr格式带原子电荷和原子半径的文件。PDB2PQR可以使用多种力场来参数化蛋白,并且可以添加氢原子并优化氢键网络,修复残基缺失侧链,判断二硫键,计算指定pH下残基pKa来计算质子化状态。输出pqr文件还可以根据力场规范格式化残基和原子类型,用于动力学输入。

    参数说明

    Input File

    输入蛋白结构文件,PDB格式。

    Output PDB

    输出PDB文件的名称。

    Output PQR

    指定输出PQR文件的名称,PQR是一个修改后的PDB的格式文件,原子坐标后面包含了原子的电荷信息,在HETATM中包含了原子半径信息。

    Forcefield

    力场类型,支持AMBER力场和CHARMM力场。

    Output Forcefield

    使用来自给定力场的名称,支持AMBER力场和CHARMM力场。

    Titration Method

    用于计算滴定状态的方法。若pH值不为中性时,需要勾选该选项才能生效。

    pH Values

    指定的pH值环境,用于计算质子化状态使用。

    Other Parameter

    其他参数:
    –drop-water:先去掉水再处理。
    –keep-chain: 在PQR文件中保留链名。

    结果说明

    输出结果包括:

    输出文件名称 说明
    output.pdb 修复后的结构文件
    output.pqr 带原子电荷和原子半径的结构文件

    参考文献

    Dolinsky TJ, Czodrowski P, Li H, Nielsen JE, Jensen JH, Klebe G, Baker NA. PDB2PQR: expanding and upgrading automated preparation of biomolecular structures for molecular simulations. Nucleic Acids Res. 2007 Jul;35(Web Server issue):W522-5.
    Dolinsky TJ, Nielsen JE, McCammon JA, Baker NA. PDB2PQR: an automated pipeline for the setup of Poisson-Boltzmann electrostatics calculations. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W665-7.

    Structure Prepare

    Introduction

    PDB2PQR is a protein charge handling system developed and maintained by the Nathan Baker Group. It can convert protein structures input in PDB format to PQR format files with atomic charges and atomic radii. PDB2PQR can parameterize proteins using various force fields, add hydrogen atoms, optimize hydrogen bond networks, repair missing side chains, identify disulfide bonds, and calculate residue pKa at a specified pH to determine protonation states. The output PQR file can also format residues and atom types according to force field standards for use in dynamics simulations.

    Parameter Description

    Input File

    Input protein structure file in PDB format.

    Output PDB

    Name of the output PDB file.

    Output PQR

    Specify the name of the output PQR file. PQR is a modified PDB format file that includes atomic charge information following atomic coordinates and atomic radii information in HETATM.

    Forcefield

    Type of force field, supporting AMBER and CHARMM force fields.

    Output Forcefield

    Name of the force field to be used, supporting AMBER and CHARMM force fields.

    Titration Method

    Method used to calculate titration states. This option needs to be selected if the pH value is not neutral.

    pH Values

    Specified pH value environment used for calculating protonation states.

    Other Parameters

    Other parameters:
    –drop-water: Remove water molecules before processing.
    –keep-chain: Retain chain names in the PQR file.

    Result Description

    The output includes:

    Output File Name Description
    output.pdb Repaired structure file
    output.pqr Structure file with atomic charges and atomic radii

    References

    • Dolinsky TJ, Czodrowski P, Li H, Nielsen JE, Jensen JH, Klebe G, Baker NA. PDB2PQR: expanding and upgrading automated preparation of biomolecular structures for molecular simulations. Nucleic Acids Res. 2007 Jul;35(Web Server issue):W522-5.
    • Dolinsky TJ, Nielsen JE, McCammon JA, Baker NA. PDB2PQR: an automated pipeline for the setup of Poisson-Boltzmann electrostatics calculations. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W665-7.
  • Name: Toxic Fragment Identification
    Description: AlphaTox模块用于识别小分子的毒效片段,从文献中收集了大量的毒效片段构成毒效片段库,利用子结构匹配方法,实现对化合物库中每个分子进行毒效片段匹配,并通过不同颜色区分。 AlphaTox is a toxicity prediction and toxicity fragment detection module for small molecules. Toxicity fragments were collected from the reported literatures.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-05-26 16:00:00
    Reference:

    Toxic Fragment Identification

    简介

    Toxic Fragment Identification模块用于识别小分子的毒效片段,从文献中收集了大量的毒效片段构成毒效片段库,利用子结构匹配方法,实现对化合物库中每个分子进行毒效片段匹配,并通过不同颜色区分。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    结果说明

    得到化合物库中与小分子毒效片段匹配的output.xlsx文件,并通过不同颜色区分毒性片段。
    output.xlsx包括如下信息:

    字段名称 说明
    Smiles 分子的smiles
    Image 分子的化学结构图片,包括毒效片段的匹配。
    MolName 分子名称
    Smarts 毒效片段的Smarts
    Bad_type 毒性类型
    BadNum 毒性数量
    Literature 参考文献
    Colors 毒效片段匹配颜色

    Bad_type毒性类型,包括如下:

    Potential_electrophilic_agents,Inpharmatica,Idiosyncratic_toxicity_(RM_formation),Non-genotoxic_carcinogenicity,Endocrine_disruption,MLSMR,AlphaScreen-HIS-FHs,AlphaScreen-FHs,Nonbiodegradable_compounds,Acute_Aquatic_Toxicity,AlphaScreen-GST-FHs,LINT,Promiscuity,LD50_mo_oral,Reactive,_unstable,_toxic,Skin_sensitization,Chelating_agents,Genotoxic_carcinogenicity,_mutagenicity,Developmental_and_mitochondrial_toxicity,PAINS,Hepatotoxicity_Nephrotoxicity,SMARTSfilter,Hepatotoxicity,Toxtree,Myelotoxicity
    

    Toxic Fragment Identification

    Introduction

    The Toxic Fragment Identification module is used to identify toxic fragments of small molecules. A large library of toxic fragments has been collected from the literature. Using a substructure matching method, this module matches toxic fragments in each molecule of the compound library and distinguishes them with different colors.

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Result Description

    Obtain the output.xlsx file that matches toxic fragments in the compound library with the small molecule, color-coding the toxic fragments.

    The output.xlsx includes the following information:

    Field Name Description
    Smiles Molecular SMILES
    Image Chemical structure image of the molecule, including the matched toxic fragments.
    MolName Molecule name
    Smarts Toxic fragment SMARTS
    Bad_type Type of toxicity
    BadNum Number of toxicities
    Literature Literature reference
    Colors Colors for toxic fragment matches

    The Bad_type toxicity types include:

    Potential_electrophilic_agents, Inpharmatica, Idiosyncratic_toxicity_(RM_formation), Non-genotoxic_carcinogenicity, Endocrine_disruption, MLSMR, AlphaScreen-HIS-FHs, AlphaScreen-FHs, Nonbiodegradable_compounds, Acute_Aquatic_Toxicity, AlphaScreen-GST-FHs, LINT, Promiscuity, LD50_mo_oral, Reactive,_unstable,_toxic, Skin_sensitization, Chelating_agents, Genotoxic_carcinogenicity,_mutagenicity, Developmental_and_mitochondrial_toxicity, PAINS, Hepatotoxicity_Nephrotoxicity, SMARTSfilter, Hepatotoxicity, Toxtree, Myelotoxicity
    
  • Name: mRNA Optimization (AlphaRNA)
    Description: 优化mRNA序列以获得更好的密码子偏好性和更稳定的二级结构,以优化其表达量和半衰期、抗体滴度等。支持人、小鼠、大鼠、猪等种属。 在输入UTR的情况下,若不选择固定UTR,则会同时优化UTR与CDS;若选择固定UTR,则只优化CDS区,但CDS区域仍会尽量与UTR配对。 若不希望CDS与UTR配对,只优化CDS自身,则不需要输入UTR,只输入CDS即可。 Optimize mRNA sequences for better codon usage bias and more stable secondary structures, to enhance its expression level, half-life, antibody titer, etc. When inputting UTR, if fixed UTR is selected, only the CDS region will be optimized. If fixed UTR is not selected, both UTR and CDS will still be optimized simultaneously. If you do not want CDS to be paired with UTR and only optimize CDS itself, you do not need to input UTR, just input CDS.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-05-17 07:01:58
    Reference:

    mRNA Optimization (AlphaRNA)

    简介

    AlphaRNA是Wecomput开发的程序,可以有效地共同优化CAI(Codon Adaption Index)和MFE(Minimum free energy)/AUP(Average unpaired probability)。AlphaRNA提供了一种基于DFA图进行Motif约束的方法,该方法在不明显增加计算量的同时,隐式地将约束加入到密码子优化地过程中以获得更好的密码子偏好性和更稳定的二级结构,以优化其表达量和半衰期、抗体滴度等。可以支持任意数量和长度的序列。
    image.png

    参数说明

    Amino acid sequence of CDS/ORF

    所需要优化的编码区氨基酸序列,例如:

    MDIDPYKEFGATVELLSFLPSDFFPSVRDLLDTASALYREALESPEHCSPHHTALRQAIL
    

    Enzyme restrictions

    要限制(避免出现在优化后序列中)的酶切位点,可多选。

    Motif restrictions

    需要限制的Motif序列,可指定多个,可手动输入不在列表中的新序列,使用空白符分隔。

    Weights of CAI

    CAI的lambda系数,正值越大能够调大结果中的CAI, 可选择多个。

    Weights of GCR

    GCR的lambda系数,正值越大能够调大结果中的GCR, 可选择多个。

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    lambda_cai CAI的lambda系数
    lambda_gcr GCR的lambda系数
    full_sequence 优化后的序列
    CAI 密码子适应指数
    AUP 平均未配对率
    MFE Structure 最小自由能二级结构
    dG(MFE)[kcal/mol] 最小自由能

    mRNA Optimization (AlphaRNA)

    Introduction

    AlphaRNA is a Wecomput-developed program that efficiently co-optimize both Codon Adaption Index (CAI) and Minimum free energy (MFE)/Average unpaired probability (AUP).It provides a method for motif-constrained codon optimization based on DFA graphs, which implicitly incorporates constraints into the codon optimization process to achieve better codon preferences and more stable secondary structures, optimizing expression levels, half-life, antibody titers, etc., without significantly increasing computational complexity. This method supports sequences of arbitrary numbers and lengths.
    image.png

    Parameter

    Amino acid sequence of CDS/ORF

    The amino acid sequence of the coding region that needs to be optimized, for example:

    MDIDPYKEFGATVELLSFLPSDFFPSVRDLLDTASALYREALESPEHCSPHHTALRQAIL
    

    Enzyme restrictions

    The restriction enzyme cleavage sites to be limited (avoided in the optimized sequence) can be selected multiple times.

    Motif restrictions

    Motif sequences that need to be restricted, multiple can be specified, and new sequences that are not in the list can be manually entered, separated by blanks.

    Weights of CAI

    The lambda coefficient of CAI, the larger the positive value, the larger the CAI in the result, you can choose multiple.

    Weights of GCR

    The lambda coefficient of GCR, the larger the positive value, the larger the GCR in the result, you can choose multiple.

    Result

    The output file is result.csv and contains the following information:

    Field Name Description
    lambda_cai Lambda coefficients of CAI
    lambda_gcr Lambda coefficients of GCR
    full_sequence The optimized sequence
    CAI Codon adaption index
    AUP Average unpaired probability
    MFE Structure The minimum free energy structure
    dG(MFE)[kcal/mol] The value of the minimum free energy
  • Name: Extract Fv Sequence
    Description: Extract Fv Sequence是从抗体全长序列中提取Fv区序列的工具。 Extract Fv Sequence is a tool for Extracting the Fv region sequence from antibody full-length sequence.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-05-16 11:18:14
    Reference:

    Extract Fv Sequence

    简介

    Extract Fv Sequence是从抗体全长序列中提取Fv区序列的工具。

    参数说明

    Antibody Sequence File

    抗体全长序列文件,FASTA格式

    Output File

    指定输出抗体Fv序列文件的名称,FASTA格式

    结果说明

    得到仅含Fv区域的序列FASTA文件Fv.fasta。

    Extract Fv Sequence

    Introduction

    Extract Fv Sequence is a tool used to extract the Fv region sequence from the full-length antibody sequence.

    Parameter Description

    Antibody Sequence File

    The full-length antibody sequence file in FASTA format.

    Output File

    Specify the name of the output file for the antibody Fv sequence in FASTA format.

    Result Description

    Obtain a FASTA file, Fv.fasta, containing only the Fv region sequence.

  • Name: RNA Secondary Structure Prediction
    Description: 使用动态编程算法预测单链RNA或DNA序列的二级结构,返回单一的最佳结构和最低自由能。 Predict secondary structures of single-stranded RNA or DNA sequences using dynamic programming algorithms which yield a single optimal structure and the minimum free energy.
    Tags: undefined
    Author: Zuker & Stiegler
    Release: 2022-04-29 08:00:00
    Reference: Lorenz R, Bernhart SH, Höner Zu Siederdissen C, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011 Nov 24;6:26.

    RNA Secondary Structure Prediction

    简介

    使用动态编程算法预测单链RNA或DNA序列的二级结构,返回单一的RNA最佳结构和最低自由能。

    RNA二级结构符号说明

    长度为n的序列上的结构由相等长度的括号和点组成的字符串表示。i和j之间的碱基对用“(”在i和“)”在在j位置表示,未配对的碱基用“.”表示。如下为RNA二级结构表示方式。

      (((..((((...)))).))) 
    

    与之对应的RNA二级结构图为:
    image.png

    参数说明

    RNA Sequence File

    RNA序列文件,FASTA格式。

    Output File

    输出文件名称。

    结果说明

    输出结果包括:

    输出文件名称 说明
    output.txt RNA序列二级结构的文本文件,其中包括序列、最佳二级结构以及与其对应的最小自由能(kcal/mol)。
    SeqN_2D.png 第N条RNA序列对应的二级结构图

    参考文献

    Lorenz R, Bernhart SH, Höner Zu Siederdissen C, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011 Nov 24;6:26.

    RNA Secondary Structure Prediction

    Introduction

    The dynamic programming algorithm is used to predict the secondary structure of a single-stranded RNA or DNA sequence, returning the best RNA structure and its minimum free energy.

    RNA Secondary Structure Symbols

    The structure on a sequence of length n is represented by a string consisting of equal-length parentheses and dots. Base pairs between i and j are represented by “(” at position i and “)” at position j, while unpaired bases are represented by “.”. Below is an example of an RNA secondary structure representation.

    (((..((((...)))).))) 
    

    The corresponding RNA secondary structure diagram is shown in the image above.
    image.png

    Parameter Description

    RNA Sequence File

    RNA sequence file in FASTA format.

    Output File

    Name of the output file.

    Result Description

    The output results include:

    Output File Name Description
    output.txt Text file of the RNA sequence’s secondary structure, including the sequence, best secondary structure, and the corresponding minimum free energy (kcal/mol).
    SeqN_2D.png Secondary structure diagram for the Nth RNA sequence

    Reference

    Lorenz R, Bernhart SH, Höner Zu Siederdissen C, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011 Nov 24;6:26.

  • Name: RNA 3D Structure Prediction
    Description: 在给定二级结构和实验限制的情况下,从头预测RNA的三维结构模型(可长达约 300 nts )。除了要预测的 RNA 序列外,您还需要提供一个描述二级结构的文件:具有以圆点符号表示的二级结构的文本文件。 Build three-dimensional de novo models of RNAs of sizes up to ~300 nts, given secondary structure and experimental constraints. Besides the RNA sequence to predict, you also need to provide a secondary structure file: a text file with secondary structure described in the dot-parentheses notation.
    Tags: undefined
    Author: Cheng, C.Y., Chou, F.-C., and Das, R.
    Release: 2022-04-30 00:00:00
    Reference: Cheng CY, Chou FC, Das R. Modeling complex RNA tertiary folds with Rosetta. Methods Enzymol. 2015;553:35-64.

    RNA 3D Structure Prediction

    简介

    RNA 3D Structure Prediction是基于Rosetta中的RNA结构建模算法是基于现有RNA晶体结构的短片段(1到3个核苷酸)的组装,其序列与目标RNA的子序列相匹配。RNA片段组装(Fragment Assembly of RNA, FARNA)算法是一个蒙特卡洛过程,由一个低分辨率的基于知识的能量函数指导。然后,这些模型可以在全原子力场下进一步完善,以产生更真实的结构。由此产生的能量也能更好地区分原生构象和非原生构象。该计算方法被称为FARFAR(RNA片段组装与全原子细化)。

    参数说明

    Input File

    从5’到3’的序列。通常用小写字母,但大写字母是可以接受的,并且会被转换。支持多条序列同时生成3D结构。

    Secstru File

    点括号表示RNA二级结构文件。可以通过模块“RNA Secondary Structure Prediction”获取。
    RNA二级结构文件,文本格式,例如:

    >a
    auauccccauauaucccauauauccccgcgcgucccgcgc
    ........((((((...))))))....(((((...))))) ( -6.60)
    >b
    aaauccccauauaucccauauauccccgcgcgucccgcgc
    ........((((((...))))))....(((((...))))) ( -6.60)
    

    结果说明

    得到RNA结构的PDB文件S_000001.pdb。

    参考文献

    Cheng CY, Chou FC, Das R. Modeling complex RNA tertiary folds with Rosetta. Methods Enzymol. 2015;553:35-64.

    RNA 3D Structure Prediction

    Introduction

    RNA 3D Structure Prediction utilizes the RNA structure modeling algorithm in Rosetta, which assembles short fragments (1 to 3 nucleotides) based on existing RNA crystal structures, matching the sequence to a subsequence of the target RNA. The Fragment Assembly of RNA (FARNA) algorithm is a Monte Carlo process guided by a low-resolution, knowledge-based energy function. These models can then be further refined under a full-atom force field to produce more realistic structures. The resulting energy can better distinguish native conformations from non-native conformations. This computational method is known as FARFAR (Fragment Assembly of RNA with Full Atom Refinement).

    Parameter Description

    Input File

    Sequence(s) from 5’ to 3’. Typically in lowercase letters, but uppercase letters are acceptable and will be converted. Supports generating 3D structures for multiple sequences simultaneously.

    Secstru File

    RNA secondary structure file in dot-bracket notation. This can be obtained using the “RNA Secondary Structure Prediction” module.
    Example RNA secondary structure file in text format:

    >a
    auauccccauauaucccauauauccccgcgcgucccgcgc
    ........((((((...))))))....(((((...))))) ( -6.60)
    >b
    aaauccccauauaucccauauauccccgcgcgucccgcgc
    ........((((((...))))))....(((((...))))) ( -6.60)
    

    Result Description

    Obtain the PDB file for the RNA structure as S_000001.pdb.

    Reference

    Cheng CY, Chou FC, Das R. Modeling complex RNA tertiary folds with Rosetta. Methods Enzymol. 2015;553:35-64.

  • Name: Immunogenicity Prediction (AlphaMHC v2.0)
    Description: AlphaMHC算法采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条公开及私有的与免疫原性相关的湿实验数据(包括亲和力数据、NGS数据、质谱数据等)进行训练,成功实现了从序列到临床免疫原性风险的端到端的预测,并通过上百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)进行验证,AlphaMHC能够准确区分免疫原性的高低,ROC-AUC达0.87,准确性超过80%(部分测试集高达91%),表现出比现有方法显著更优的预测性能,是已知唯一一个可以得到高质量临床数据验证的算法。 注:推荐在WeSeq序列编辑器中调用此功能(Immunogenicity按钮),可以在序列中直观看到T细胞表位的位置。 The AlphaMHC algorithm utilizes popular NLP natural language processing technology and a novel multimodal fusion deep neural network architecture. It integrates nearly one billion publicly and privately available wet lab experimental data related to immunogenicity (including affinity data, NGS data, mass spectrometry data, etc.) for training. It successfully achieves end-to-end prediction of immunogenicity risk from sequence to clinical application and has been validated using over a hundred clinical real-world immunogenicity data from FDA and EMA (including mono-/multi-specific antibodies and recombinant proteins). AlphaMHC can accurately distinguish between high and low immunogenicity, with an ROC-AUC of 0.87 and an accuracy of over 80% (up to 91% for some test sets). It exhibits significantly superior predictive performance compared to existing methods and is the only algorithm known to have been validated with clinical data.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-05-03 13:53:09
    Reference:

    Immunogenicity Prediction (AlphaMHC v2.0)

    简介

    AlphaMHC是唯信计算为解决现有预测方法的已知问题而开发的下一代免疫原性预测算法,采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条公开及私有的与免疫原性相关的湿实验数据(包括亲和力数据、NGS数据、质谱数据等)进行训练,成功实现了从序列到临床免疫原性风险的端到端的预测,并通过上百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)进行验证,AlphaMHC能够准确区分免疫原性的高低,ROC-AUC达0.87,准确性超过80%(部分测试集高达91%),表现出比现有方法显著更优的预测性能,是已知唯一一个可以得到临床数据验证的算法。
    F13.png

    算法特点:

    • 显着扩展的训练集空间。除了公开可用的数据集外,我们还从文献、专利和湿实验室合作者那里收集了更多数据。除了最常用的亲和力数据外,还考虑了更多的数据类型,例如T细胞激活数据、蛋白质组学数据、抗体测序数据等,它们贡献了超过10亿个数据条目/点。
    • 与仅预测MHC肽结合亲和力的大多数其他算法不同,AlphaMHC 预测临床水平的最终免疫原性,同时考虑除肽结合之外的其他重要影响因素,例如免疫呈递/耐受性、HLA等位基因频率等。
    • 针对上千个MHC-II型等位基因训练深度神经网络模型。在并行计算的支持下,所有支持的 MHC 等位基因都可以以高通量的方式同时计算。
    • 基于独家收集的高质量临床ADA数据集进行验证和优化

    参数说明

    Fasta File

    蛋白序列文件,FASTA格式。支持多条链以及多分子模式。

    请注意按下面的规则来书写序列名,因为目前免疫原性风险的评分是以整个分子为单位的,链名会影响到程序区分同个分子的多条链,并影响对于分子总的风险评级(risk per molecule),但不影响对链的TCE的识别。

    对于多条链的分子,序列名称应写为:分子名.链名,".“之前是分子名,”.“之后是链名,同个分子的不同链,只要”."之前的分子名保持一致就可以了,链名随意,顺序不限。

    例如,下面mol1是常见的单抗,mol2是多抗:

    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    
    >mol2.L1
    XXXXXXX
    >mol2.H1
    XXXXXXX
    >mol2.L2
    XXXXXXX
    >mol2.H2
    XXXXXXX
    

    HLA Allotypes

    预测HLA等位基因型。
    rep:32个代表性等位基因型,适用于一般人群。
    all:用于训练的所有非冗余人类等位基因型(1166个)。

    一般推荐使用默认的"rep",因为免疫原性的风险评分(risk)是基于rep的代表性HLA来确定的。

    Binding Affinity Profile

    导出每个 HLA 等位基因的结合亲和力曲线图,展示了与每条蛋白质链的 N 端到 C 端的所有15肽的结合亲和力。注意:即使“HLA Allotypes”选项设置为全部,也只会绘制代表性 HLA的曲线。

    结果说明

    输出结果包括:

    输出文件名称 说明
    score_immunogenicity_risk.csv 该结果展示了预测的每个分子的免疫原性风险(自动将同个分子的多条链的预测的潜在T细胞表位的结果进行汇总后综合评估所得)。
    detail_tce_of_chains.csv 该结果评估可以进行定向改造的HLA呈递表位,以降低免疫原性。
    BAProfile_of_mol.chain.png 不同HLA亚型与每条链的不同位置的亲和力的分布情况,更精细的展示了不同HLA的亲和力的差异。 从左到右的分布图表示从其中一条蛋白质链的N末端移动到C末端的15聚肽窗口的结合亲和力。 即使“HLA同种异型”选项设置为“全部”,也只会包括代表性的HLA等位基因。
    Heatmap_of_mol.chain.png 每个肽与代表性HLA之间结合亲和力的热图。Z-score是pAffinity,值越大(浅色)意味着预测结合越强。

    其中score_immunogenicity_risk.csv包括信息如下:

    字段名称 说明
    Protein_Id 蛋白序列名称
    Risk 预测的分子整体风险评估,高风险的分子为high,否则为low。
    Score 表位总长度,是整体风险评估的重要依据。
    TCE_Sequences 表位序列

    其中detail_tce_of_chains.csv包括信息如下:

    字段名称 说明
    Sequences 蛋白序列名称
    TCE 每条链的相对的高风险的T细胞表位
    Alleles_Number 递呈的HLA亚型数
    Alleles 递呈的HLA亚型
    Min_Affinity 亲和力最小值
    Median_Affinity 亲和力中位数
    Max_Affinity 亲和力最大值

    Immunogenicity Prediction (AlphaMHC v2.0)

    Introduction

    AlphaMHC is the next-generation immunogenicity prediction algorithm developed by Wecomput using popular NLP natural language processing technology to address known issues with existing prediction methods. It employs a new multi-modal fusion deep neural network architecture and is trained on nearly one billion publicly available and private wet-lab experimental data related to immunogenicity, including affinity data, NGS data, mass spectrometry data, etc. It successfully achieves end-to-end prediction of immunogenicity risk from sequence to clinical application and is validated using hundreds of clinical real-world immunogenicity data from FDA and EMA, including mono/multi-specific antibodies and recombinant proteins. AlphaMHC accurately distinguishes high and low immunogenicity with an ROC-AUC of 0.87 and an accuracy of over 80% (up to 91% on some test sets), demonstrating significantly better predictive performance than existing methods. It is the only known algorithm that has been validated with clinical data.
    F13.png

    Feature highlights

    • Significantly expanded training set space. Besides the publicly available data sets, we have collected more data from literature, patents, and wet lab collaborators. Besides the most used affinity data, more data types are considered, e.g., T cell activation data, proteomics data, antibody sequencing data, etc., which contributes over 1 billion more data entries/points.
    • Unlike most other algorithms which predict only the MHC-peptide binding affinity, AlphaMHC predicts the eventual immunogenicity at the clinical level, taking into consideration other important influencing factors besides peptide binding, such as immune presentation/tolerance, allele frequency, etc.
    • A deep neural network model is trained for up to 5000+ alleles of MHC-II. With the support of parallel computing, all supported MHC alleles can be simultaneously calculated in a high-throughput manner, while similar methods can usually only afford a few representative alleles within reasonable time cost.

    Parameter

    Fasta File

    Protein sequence file in FASTA format.Multiple chains and multi-molecule modes are supported. For multi-molecule mode, the sequence name rule is: molecule name. chain name, for example:

    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    >mol2.A
    XXXXXXX
    >mol2.B
    XXXXXXX
    

    HLA Allotypes

    Prediction of HLA allelic types. “rep” is recommended, which is faster.
    rep: 32 representative allelic types, applicable to the general population.
    all: all non-redundant human allele types used for training (1166).

    Binding Affinity Profile

    Export binding affinity curve graphs for each HLA allele, showing the binding affinity of all 15 peptides from the N- to C-terminus for each protein chain. Note: Even if the “HLA Allotypes” option is set to all, curves will only be plotted for representative HLAs.

    Result

    The output includes:

    Output File Name Description
    score_immunogenicity_risk.csv The result displays the immunogenicity risk for each predicted molecule (which is obtained by aggregating the predicted potential T cell epitopes from multiple chains of the same molecule and evaluating the overall risk).
    detail_tce_of_chains.csv The results evaluated HLA presentation epitopes that could be targeted for engineering to reduce immunogenicity.
    BAProfile_of_mol.chain.png The distribution profile of the binding affinity between each chain and the 32 representative HLAs. The profile from left to right represents the binding affinity of a 15-mer pepetide window moving from the N terminus to C terminus of one of the protein chain. PS. only representative HLA alleles will be included even if the “HLA allotypes” option is set to “all”.
    Heatmap_of_mol.chain.png The heat map of the binding affinity between each peptide and the representative HLAs. The Z-score is pAffinity, greater value (light color) means stronger binding by prediction.

    score_immunogenicity_risk.csv contains the following information:

    Field Name Description
    Protein_Id Protein sequence name
    Risk The overall risk assessment for the predicted molecule, with “high” indicating high-risk molecules and “low” indicating low-risk molecules.
    Score The total length of the epitopes, which is an important basis for overall risk assessment.
    TCE_Sequences The epitope sequences

    detail_tce_of_chains.csv contains the following information:

    Field Name Description
    Sequences Protein sequence name
    TCE The relative high risk T cell epitope of each strand.
    Alleles_Number Number of HLA subtypes presented
    Alleles The HLA subtypes presented
    Min_Affinity Affinity minimum
    Median_Affinity Median affinity
    Max_Affinity Affinity maximum
  • Name: Codon Optimization
    Description: Codon Optimization可用于密码子优化(基于PCR的基因合成的自动寡核苷酸设计)。整个基因组序列的可用性极大地增加了蛋白质靶标的数量,其中许多需要在原始DNA来源以外的细胞中过度表达。合成基因可以针对表达进行优化,并构建为易于突变操作而无需考虑亲本基因组。然而,合成基因的设计和构建,尤其是那些编码大蛋白质的基因,可能是一个缓慢、困难和令人困惑的过程。该模块通过基于PCR的方法自动设计用于基因合成的寡核苷酸。 Codon optimization can be used for optimizing codons (i.e., the genetic code) for the automated oligonucleotide design of gene synthesis based on PCR. The availability of whole genome sequences has greatly increased the number of protein targets, many of which need to be overexpressed in cells outside of their native DNA source. Synthetic genes can be optimized for expression and constructed to be easily mutagenized without consideration for the parental genome. However, designing and constructing synthetic genes, particularly those encoding large proteins, can be a slow, difficult, and confusing process. This module automatically designs oligonucleotides for gene synthesis using a PCR-based approach.
    Tags: undefined
    Author: DNAWorks
    Release: 2022-04-15 11:52:22
    Reference: Nucleic Acids Res. 2002 May 15;30(10):e43.

    Codon Optimization

    简介

    基于知名的DNAWorks算法对氨基酸或DNA序列进行密码子优化(基于PCR的基因合成的自动寡核苷酸设计)。

    整个基因组序列的可用性极大地增加了蛋白质靶标的数量,其中许多需要在原始DNA来源以外的细胞中过度表达。合成基因可以针对表达进行优化,并构建为易于突变操作而无需考虑亲本基因组。然而,合成基因的设计和构建,尤其是那些编码大蛋白质的基因,可能是一个缓慢、困难和令人困惑的过程。该模块通过基于PCR的方法自动设计用于基因合成的寡核苷酸。
    image.png

    参数说明

    Sequence File

    蛋白或者核酸的序列文件,FASTA格式。

    Sequence Type

    序列类型,蛋白或者核酸。

    Organism

    几种常用生物的密码子频率基于每个密码子在相应生物基因组的蛋白质编码区中出现的次数。大肠杆菌有两种选项:基于所有基因的标准频率(E. coli),或在指数增长期间以高水平表达的 II 类基因频率(ecoli2),通常建议用后者。

    Annealing Temperature

    退火温度参数为一组合成寡核苷酸设定了理想的退火温度。 可接受的退火温度范围在 58 至 70°C 之间。

    Oligo Length

    寡核苷酸长度参数限制了一组合成寡核苷酸中的任何一个可以达到的核苷酸长度。可接受的寡核苷酸长度范围在 30 到 999 nt 之间。

    Codon Frequency Threshold

    密码子频率阈值参数设置:密码子用于反向翻译蛋白质序列到DNA的截断值。

    Oligonucleotides Concentration

    寡核苷酸的浓度。寡核苷酸必须在100 uM (1E-4 M)和1 nM (1E-9 M)之间。

    Cations Concentration

    一价阳离子(Na+,K+)的浓度。单价阳离子必须在10到1000mM之间。

    Magnesium Concentration

    镁离子的浓度。镁离子浓度必须在0到200mM之间。

    Solution Number

    执行中生成的寡核苷酸的数量,每个作业的最大运行次数为999次。

    Thermodynamically Balanced Mode

    检查是否为热力学平衡由内而外合成法 (thermodynamically balanced inside-out, TBIO)输出模式。

    Restriction Site Screen

    要求被排除在合成基因的蛋白质编码区之外的位点,每个位点之间用逗号隔开,例如Aatll,Acc65I。
    支持非简并位点共117种:

    AatII,Acc65I,AclI,AcuI,AfeI,AflII,AgeI,AlwI,ApaI,ApaLI,AscI,AseI,AsiSI,AvrII,BamHI,BbsI,BbvCI,BbvI,BccI,BceAI,BciVI,BclI,BfrBI,BfuAI,BglII,BmgBI,BmrI,BmtI,BpmI,BpuEI,BsaI,BseRI,BseYI,BsgI,BsiWI,BsmAI,BsmBI,BsmFI,BsmI,BspCNI,BspDI,BspEI,BspHI,BspMI,BsrBI,BsrDI,BsrGI,BsrI,BssHII,BssSI,BstBI,BstZ17I,BtgZI,BtsI,ClaI,DraI,EagI,EarI,EciI,EcoRI,EcoRV,FauI,FokI,FseI,FspI,HgaI,HindIII,HpaI,HphI,KasI,KpnI,MboII,MfeI,MluI,MlyI,MscI,NaeI,NarI,NcoI,NdeI,NgoMIV,NheI,NotI,NruI,NsiI,PacI,PaeR7I,PciI,PleI,PmeI,PmlI,PsiI,PspOMI,PstI,PvuI,PvuII,SacI,SacII,SalI,SapI,SbfI,ScaI,SfaNI,SfoI,SmaI,SnaBI,SpeI,SphI,SspI,StuI,SwaI,TliI,TspRI,XbaI,XhoI,XmaI,ZraI
    

    支持简并位点共62种:

    AccI,AflIII,AhdI,AleI,AlwNI,ApoI,AvaI,BanI,BanII,BcgI,BglI,BlpI,Bme1580I,Bpu10I,BsaAI,BsaBI,BsaHI,BsaJI,BsaWI,BsaXI,BsiEI,BsiHKAI,BslI,BsoBI,Bsp1286I,BsrFI,BstAPI,BstEII,BstF5I,BstXI,BstYI,Bsu36I,BtgI,Cac8I,DraIII,DrdI,EaeI,EcoNI,EcoO109I,HaeII,HincII,Hpy188III,MmeI,MslI,MspA1I,MwoI,NlaIV,NspI,PflFI,PflMI,PpuMI,PshAI,RsrII,SexAI,SfcI,SfiI,SgrAI,SmlI,StyI,Tth111I,XcmI,XmnI
    

    Custom Site Screen

    自定义被排除在合成基因的蛋白质编码区之外的位点,自定义位点格式必须包含名称和序列,名称和序列之间用空格隔开,多个位点时用逗号隔开,例如:Aatll GACGTC,Acc65I GGTACC。

    Output File

    输出结果文件的名称。

    结果说明

    输出结果文件为result.txt,包含优化后的密码子序列以及序列相关信息。

    参考文献

    Hoover DM, Lubkowski J. DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 2002 May 15;30(10):e43.

    Codon Optimization

    Introduction

    Codon optimization can be used for optimizing codons (i.e., the genetic code) for the automated oligonucleotide design of gene synthesis based on PCR. The availability of whole genome sequences has greatly increased the number of protein targets, many of which need to be overexpressed in cells outside of their native DNA source. Synthetic genes can be optimized for expression and constructed to be easily mutagenized without consideration for the parental genome. However, designing and constructing synthetic genes, particularly those encoding large proteins, can be a slow, difficult, and confusing process. This module automatically designs oligonucleotides for gene synthesis using a PCR-based approach.
    image.png

    Parameter

    Sequence File

    Protein or nucleotide sequences in FASTA format

    Sequence Type

    Sequence files of proteins or nucleic acids

    Organism

    The codon frequencies of several commonly used organisms are based on the number of times each codon appears in the protein-coding regions of the respective organism’s genome. For Escherichia coli, there are two options: the standard frequency based on all genes (E. coli), or the frequency of Class II genes expressed at high levels during exponential growth (ecoli2), which is usually recommended to be used.

    Annealing Temperature

    The annealing temperature parameter sets the ideal annealing temperature for a set of synthetic oligonucleotides. Acceptable annealing temperatures range from 58 to 70°C.

    Oligo Length

    The oligonucleotide length parameter limits the achievable nucleotide length of any one of a set of synthetic oligonucleotides. Acceptable oligonucleotide lengths range from 30 to 999 nt.

    Codon Frequency Threshold

    Codon Frequency Threshold Parameter Settings: Codon cutoff value for backtranslation of protein sequences to DNA.

    Oligonucleotides Concentration

    Concentration of oligonucleotides. Oligonucleotides must be between 100 uM (1E-4 M) and 1 nM (1E-9 M).

    Cations Concentration

    Concentration of monovalent cations (Na+, K+). Monovalent cations must be between 10 and 1000 mM.

    Magnesium Concentration

    concentration of magnesium ions. Magnesium ion concentration must be between 0 and 200mM.

    Solution Number

    The number of oligos generated in an execution, with a maximum of 999 runs per job.

    Thermodynamically Balanced Mode

    Check if it is thermodynamically balanced inside-out (TBIO) output mode.

    Restriction Site Screen

    Sites required to be excluded from the protein coding region of the synthetic gene, separated by commas between each site, example: Aatll,Acc65I.
    Support a total of 117 non-degenerate sites:

    AatII,Acc65I,AclI,AcuI,AfeI,AflII,AgeI,AlwI,ApaI,ApaLI,AscI,AseI,AsiSI,AvrII,BamHI,BbsI,BbvCI,BbvI,BccI,BceAI,BciVI,BclI,BfrBI,BfuAI,BglII,BmgBI,BmrI,BmtI,BpmI,BpuEI,BsaI,BseRI,BseYI,BsgI,BsiWI,BsmAI,BsmBI,BsmFI,BsmI,BspCNI,BspDI,BspEI,BspHI,BspMI,BsrBI,BsrDI,BsrGI,BsrI,BssHII,BssSI,BstBI,BstZ17I,BtgZI,BtsI,ClaI,DraI,EagI,EarI,EciI,EcoRI,EcoRV,FauI,FokI,FseI,FspI,HgaI,HindIII,HpaI,HphI,KasI,KpnI,MboII,MfeI,MluI,MlyI,MscI,NaeI,NarI,NcoI,NdeI,NgoMIV,NheI,NotI,NruI,NsiI,PacI,PaeR7I,PciI,PleI,PmeI,PmlI,PsiI,PspOMI,PstI,PvuI,PvuII,SacI,SacII,SalI,SapI,SbfI,ScaI,SfaNI,SfoI,SmaI,SnaBI,SpeI,SphI,SspI,StuI,SwaI,TliI,TspRI,XbaI,XhoI,XmaI,ZraI
    

    Support a total of 62 degenerate sites:

    AccI,AflIII,AhdI,AleI,AlwNI,ApoI,AvaI,BanI,BanII,BcgI,BglI,BlpI,Bme1580I,Bpu10I,BsaAI,BsaBI,BsaHI,BsaJI,BsaWI,BsaXI,BsiEI,BsiHKAI,BslI,BsoBI,Bsp1286I,BsrFI,BstAPI,BstEII,BstF5I,BstXI,BstYI,Bsu36I,BtgI,Cac8I,DraIII,DrdI,EaeI,EcoNI,EcoO109I,HaeII,HincII,Hpy188III,MmeI,MslI,MspA1I,MwoI,NlaIV,NspI,PflFI,PflMI,PpuMI,PshAI,RsrII,SexAI,SfcI,SfiI,SgrAI,SmlI,StyI,Tth111I,XcmI,XmnI
    

    Custom Site Screen

    Custom sites that to be excluded from the protein coding region(s) of the synthetic gene. The custom site format must contain the name and sequence, separated by a space between the name and sequence, and separated by a comma when there are multiple sites. Example: Aatll GACGTC,Acc65I GGTACC.

    Output File

    Specify output file name

    Result

    The output file is result.txt, which contains the optimized codon sequence and sequence-related information.

    Reference

    Hoover DM, Lubkowski J. DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 2002 May 15;30(10):e43.

  • Name: Patch Analysis
    Description: 分析蛋白质表面的Patch(正电、负电、疏水残基富集区域)的大小和分布,用于解决蛋白质的聚集等问题。一般建议通过WeView三维结构可视化编辑器来使用该功能,可以在三维结构中直观地查看patch的位置。 Calculate patches (positively charged, negatively charged, or hydrophobic regions) on the protein surface to address protein aggregation issues. It is recommended to use in the WeView, as it allows for a visual inspection of the patch locations within the 3D structure.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-04-14 15:01:18
    Reference:

    Patch Analysis

    简介

    Patch Analysis模块计算蛋白质表面静电和疏水作用相对富集的区域,用于显示出在蛋白质-蛋白质相互作用中其重要作用的区域,这对于预测基于非共价弱相互作用的可逆的聚集现象尤其有用。尤其是,疏水相互作用长期被认为是大分子间吸引相互作用的主要组成部分。对于抗体,静电相互作用牵涉到了自聚集,而偶极-偶极相互作用被认为是导致β-折叠的纤维化的原因。同时,也可以通过WeView界面对蛋白结构进行Patch分析。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式

    Hydrophobic Cutoff

    Hydrophobic cutoff是一个以疏水性氨基酸(通常包括Leu,Ile,Val,Phe,Trp和Met)为基础定义的截断值,用于将表面上疏水性氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,Patch区域中获得的化学性质信息会根据其表面密度和具有高疏水性氨基酸的数量而有所变化。

    Positive Cutoff

    Positive Cutoff是一个以阳离子氨基酸为基础定义的截断值,用于将表面上阳离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,positive cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    Negative Cutoff

    Negative Cutoff是一个以阴离子氨基酸为基础定义的截断值,用于将表面上阴离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,negative cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    SASA Cutoff

    SASA Cutoff是一个以溶剂可及表面积为基础定义的截断值,低于截断值的patch会被过滤掉。

    Distance Cutoff

    Distance Cutoff是原子距离截断值,低于截断值的才会认为属于同一聚集块。值越小,聚集块patch越小。

    Min Distance Cutoff

    Min Distance Cutoff是patch之间的距离截断值,距离小于截断值的归为同一个patch。

    Result Type

    输出文件格式,csv或者json
    通俗地讲,cutoff代表静电势能或疏水势能的强度阈值,单位是kcal/mol,超过阈值才会被计入面积。阈值越小,则patch越多。

    Keep Original

    不添加缺失原子(包括氢原子)和结构优化。

    Neutral N-terminus

    使得N-氮端的蛋白残基中性化。

    Neutral C-terminus

    使得C-氮端的蛋白残基中性化。

    结果说明

    输出结果文件为result.csv和input_prot.pdb,包含信息如下:

    字段名称 说明
    Type Patch的类型,Hyd:疏水中心,Neg:负电中心,Pos:正电中心
    Area(Å^2) 每个Patch的蛋白质表面区域面积
    Residues 每个Patch的对应的残基

    参考文献

    Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348.
    Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514.
    Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873.

    Patch Analysis

    Introduction

    Protein Patches calculates both electrostatic (excess charge) and hydrophobic surface patches to show regions of significance with respect to protein-protein interactions. This can be particularly useful in the prediction of reversible aggregation which typically arises from relatively weak non-covalent interactions. In particular, hydrophobic interactions have long been recognized as major contributors in high affinity interactions between macromolecules. In antibodies, electrostatic interactions have been implicated in forming self-associated aggregates [Karshikoff 2006], while dipole-dipole interactions are believed to be the cause of fibrillogenic association of β-sheets.At the same time, protein structures can also be analyzed for patches through the WeView interface.
    Electrostatic patches.
    The surface electrostatic field is estimated using an exponentially decaying Debye-Hückel field with a screening length of λD=3.5Å.
    image.png
    The map thus obtained is one mostly of excess charge close to the molecular surface.
    Significant patches are established by cutting the surface along iso-contour lines of absolute field value equal to 40 kcal/mol/C, keeping regions above. Finally, a default minimal patch area of 40Å2 filters out smaller, presumably less relevant, patches.
    Hydrophobicity map.
    The hydrophobic potential is calculated from the Wildman and Crippen octanol-water partition coefficients f=log P [Wildman 1999]:
    image.png
    where fi is the coefficient of atom i and g(ri) is a Fermi-type distance-dependent weighting function proposed by Heiden et al. [Heiden 1993]:
    image.png
    with rcut=5Å and α=1.5.
    Similarly to electrostatic (excess charge) patches, significant hydrophobic patches are established by cutting the surface along iso-contour lines, retaining only the portion above a potential threshold value of 0.09 kcal/mol, and filtering for a minimal patch area of 50Å2.

    Parameter

    Structure PDB File

    Protein structure file in PDB format.

    Hydrophobic Cutoff

    Hydrophobic Cutoff defined based on hydrophobic amino acids (usually including Leu, Ile, Val, Phe, Trp, and Met) is used to compare the number of hydrophobic amino acids on a surface with the surface area, so as to screen out areas that may have important biological functions. Generally speaking, the chemical property information obtained in the Patch region will vary according to its surface density and the number of highly hydrophobic amino acids.

    Positive Cutoff

    Positive Cutoff is a cut-off value defined on the basis of cationic amino acids, which is used to compare the number of cationic amino acids on the surface with the surface area, so as to screen out areas that may have important biological functions. Generally speaking, the positive cutoff method is used to screen protein surface regions that may be involved in ion interactions.

    Negative Cutoff

    Negative Cutoff is a cutoff value defined on the basis of anionic amino acids, which is used to compare the number of anionic amino acids on a surface with the surface area, so as to screen out areas that may have important biological functions. Generally speaking, the negative cutoff method is used to screen the surface regions of proteins that may participate in ion interactions.

    SASA Cutoff

    SASA Cutoff is a cutoff value defined on the basis of polar surface area, which is used to screen the surface regions of proteins that have enough polar surface areas.

    Distance Cutoff

    Distance Cutoff is a cutoff value defined on the basis of neighbor atoms, which is used to adjust the size of patches. Lower values result in smaller patches.

    Min Distance Cutoff

    Min Distance Cutoff is cutoff value for neighbor patch point distance (Å). Patches with distance lower than the cutoff value would be merged.

    Result Type

    output file format, json or csv

    Keep Original

    Do no atom addition and optimization.

    Result

    The output file is result.csv and input_prot.pdb, and contains the following information:

    Field Name Description
    Type Patch Type,Hyd: Hydrophobic patch, Neg: Negative patch, Pos: positive patch
    Area(Å^2) Protein surface area of the Patch
    Residues Corresponding residue of the Patch

    References

    Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348.
    Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514.
    Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873.

  • Name: PDB Mutation
    Description: 突变PDB格式的蛋白质结构并返回突变后的结构。 Mutate a protein structure in PDB format and return mutated structure.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-04-12 00:00:00
    Reference:

    PDB Mutation

    简介

    PDB Mutation是用于突变PDB格式的蛋白质结构并返回突变后的结构。

    参数说明

    PDB File

    蛋白的结构文件,PDB格式

    Mutation File

    突变文本文件,包含突变信息,格式如下:

    KA100N;KA101T;
    KA100T;
    

    第一字母代表的是原始残基,第二个字母代表PDB文件中待突变残基所在的链名,后面的数字代表残基位置编号,最后一个字母代表突变后的残基。

    结果说明

    输出结果包括:

    输出文件名称 说明
    mutation_result.tar.gz 所有突变体PDB结构的压缩包文件
    mutation_001.pdb 每个突变体的结构PDB文件

    PDB Mutation

    Introduction

    PDB Mutation is a tool used to mutate protein structures in PDB format and return the mutated structures.

    Parameter Description

    PDB File

    Structure file of the protein in PDB format.

    Mutation File

    Mutation text file containing mutation information in the following format:

    KA100N;KA101T;
    KA100T;
    

    The first letter represents the original residue, the second letter represents the chain name of the residue to be mutated in the PDB file, the following number represents the residue position number, and the last letter represents the mutated residue.

    Result Description

    The output results include:

    Output File Name Description
    mutation_result.tar.gz Compressed file containing all mutated PDB structures
    mutation_001.pdb PDB file for each mutated structure
  • Name: Patent Sequence Listing
    Description: 批量从专利文本文件中提取序列的工具。很多大分子专利会附带一个序列清单文件,里面存储了专利要求中的全部序列,但是人工很难高效读取,利用此模块可以一次性批量提取。其中Image(OCR)是基于图像的蛋白质序列转换为3个字母编码或1个字母编码的序列。 A tool for extracting sequences in bulk from patent text files. Many macromolecule patents come with a sequence listing file that contains all the sequences in the patent claims. However, it is difficult for humans to efficiently read and extract these sequences. With this module, all sequences can be extracted in bulk at once. The Image(OCR) is the conversion of image-based protein sequences into 3-letter coded or 1-letter coded sequences.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-03-22 14:36:49
    Reference: https://github.com/xinyu-dev/PatentSeq

    Patent Sequence Listing

    简介

    通过解析美国(https://patentcenter.uspto.gov/ )和国际(https://patentscope2.wipo.int/search/en/search.jsf )专利附带的序列清单(Sequence Listing)文件,里面存储了专利权利要求的序列,但是人工很难读取,该模块可以从中一次性批量提取专利中所有具有正式编号(SEQ ID NO.)的序列。

    1. Sequence Listing文件下载

    序列清单(Sequence Listing)文件内容示例:
    image.png

    用法:
    (1)从专利网站搜索专利:

    • WO专利从WIPO的网站PatentScope搜索:
      https://patentscope2.wipo.int/search/en/search.jsf
    • US专利从USPTO的网站搜索:
      https://patentcenter.uspto.gov/
      (2)在专利的页面中找到Sequence Listing文件并下载。
      image.png
      从WIPO网站下载
      image.png
      从USPTO网站下载
      (3)使用该模块,提交下载到的文件即可。

    2. Image(OCR)
    Image(OCR)是基于图像的蛋白质序列转换为3个字母编码或1个字母编码的序列。
    注意:截图时请务必省略标题,类似下图。
    Example_Seq1.png

    TXT(XML)方法

    参数说明

    Sequence Listing File

    专利文件,TXT或者XML格式。

    结果说明

    输出结果包括:

    输出文件名称 说明
    seq_list.csv 记录所有序列信息的csv文件
    seq_list.fasta 记录所有序列信息的fasta文件

    其中seq_list.csv包括信息如下:

    字段名称 说明
    idx 序列编号
    type 序列类型,DNA/蛋白
    sequence 序列信息

    Image(OCR)方法

    参数说明

    Image File

    专利图片文件,PNG或者JPG格式

    Output File

    输出文件名称,默认为result.fasta

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.fasta 专利图片转换成一个字母序列的FASTA文件
    result.txt 包含图片文件的字符,转换成一个字母和三个字母的序列

    Patent Sequence Listing

    Introduction

    By parsing the sequence listing files attached to U.S. (https://patentcenter.uspto.gov/ ) and international (https://patentscope2.wipo.int/search/en/search.jsf ) patents, which store the sequences claimed in patents, it is difficult for humans to read them. This module can extract all sequences with official numbers (SEQ ID NO.) from the patents in bulk.

    1. Sequence Listing File Download

    Example content of a Sequence Listing file:
    image.png

    Usage:
    (1) Search for patents on patent websites:

    • For WO patents, search on WIPO’s PatentScope:
      https://patentscope2.wipo.int/search/en/search.jsf
    • For US patents, search on USPTO’s website:
      https://patentcenter.uspto.gov/
      (2) Find and download the Sequence Listing file on the patent page.
      image.png
      Download from the WIPO website
      image.png
      Download from the USPTO website
      (3) Use this module to submit the downloaded file.

    2. Image(OCR)

    Image(OCR) is for converting protein sequences from images into three-letter or one-letter coded sequences.
    Note: When taking screenshots, please be sure to omit the headers, similar to the image below.
    Example_Seq1.png

    TXT(XML) Method

    Parameter Description

    Sequence Listing File

    Patent file in TXT or XML format.

    Result Description

    The output includes:

    Output File Name Description
    seq_list.csv CSV file recording all sequence information
    seq_list.fasta FASTA file recording all sequence information

    The seq_list.csv includes the following information:

    Field Name Description
    idx Sequence number
    type Sequence type, DNA/protein
    sequence Sequence information

    Image(OCR) Method

    Parameter Description

    Image File

    Patent image file in PNG or JPG format

    Output File

    Output file name, default is result.fasta

    Result Description

    The output includes:

    Output File Name Description
    result.fasta FASTA file of one-letter sequences converted from patent images
    result.txt Characters from image files converted into one-letter and three-letter sequences
  • Name: Tumor Gene Expression (TCGA)
    Description: 基于TCGA和GTEx等数据,检索指定基因在肿瘤和正常组织的表达情况,统计并绘制肿瘤细胞、肿瘤组织、正常组织等的基因表达差异,帮助药物靶点选择、研发立项和决策。 Plot gene expression in normal and tumor tissues, based on databases TCGA and GTEx. It retrieves the expression of specified genes in tumors and normal tissues, counts and maps the gene expression differences of tumor cells, tumor tissues, and normal tissues, etc., to help drug target selection and decision-making.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-03-22 14:35:06
    Reference:

    Tumor Gene Expression (TCGA)

    简介

    基于TCGA和GTEx等数据,检索指定基因在肿瘤和正常组织的表达情况,统计并绘制肿瘤细胞、肿瘤组织、正常组织等的基因表达差异,帮助药物靶点选择、研发立项和决策。

    参数说明

    Gene Name

    基因名称,输入的基因名须对应HGNC(https://www.genenames.org/)的"Approved Symbol"。例如:在HGNC搜索“PD-1”,得知“approved symbol”为“PDCD1”,后者“PDCD1”是该程序需要的输入。

    结果说明

    输出结果包括:

    输出文件名称 说明
    tcga_expression.jpeg 不同疾病中该基因分别在肿瘤、正常、癌旁组织的表达量分布。
    tcga_tissue_expression.jpeg 不同组织中该基因分别在肿瘤、正常、癌旁组织的表达量分布。

    Tumor Gene Expression (TCGA)

    Introduction

    Plot gene expression in normal and tumor tissues, based on databases TCGA and GTEx. It retrieves the expression of specified genes in tumors and normal tissues, counts and maps the gene expression differences of tumor cells, tumor tissues, and normal tissues, etc., to help drug target selection and decision-making.

    Parameter

    Gene Name

    The entered gene name must correspond to the “Approved Symbol” of HGNC (https://www.genenames.org/). For example: search for “PD-1” in HGNC, and know that “approved symbol” is “PDCD1”, and the latter “PDCD1” is the input required by the program.

    Result

    The output includes:

    Output File Name Description
    tcga_expression.jpeg The program will return the expression distribution of the gene in tumor, normal, and adjacent tissues in different disease.
    tcga_tissue_expression.jpeg The program will return the expression distribution of the gene in tumor, normal, and adjacent tissues in different tissues.
  • Name: Multiple Sequence Alignment
    Description: 基于渐进(progressive)比对算法进行多重序列比对,绘制进化树与序列对比图。 Align multiple sequences using progressive alignment algorithm for evolutionary analysis, generating phylogenetic trees.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-03-21 11:41:36
    Reference:

    Multiple Sequence Alignment

    简介

    Multiple Sequence Alignment 是多重序列比对模块,用于进化分析,绘制进化树,帮助对候选序列进行聚类、分析多样性等。

    参数说明

    Input File

    蛋白序列文件,FASTA格式。

    结果说明

    输出结果包括:

    输出文件名称 说明
    alignment.fasta 多重序列进行比对后的FASTA文件
    alignment.png 多重序列进行比对后的PNG文件
    newick.txt 多重序列进行多样性分析的结果文件
    tree.png 多重序列进化树图片

    Multiple Sequence Alignment

    Introduction

    Multiple Sequence Alignment is a module for aligning multiple sequences, used for evolutionary analysis, drawing evolutionary trees, and aiding in clustering and analyzing diversity of candidate sequences.

    Parameter

    Input File

    Protein sequence file in FASTA format.

    Result

    The output includes:

    Output File Name Description
    alignment.fasta FASTA file after aligning multiple sequences
    alignment.png PNG file after aligning multiple sequences
    newick.txt Evolutionary analysis result of multiple sequence
    tree.png Evolutionary trees picture of multiple sequence
  • Name: Structural Alignment
    Description: Structural Alignment是对两个蛋白质的三维结构进行叠合的工具。使用BLOSUM62矩阵和Needleman-Wunsch算法在两个序列之间执行全局配对比对,返回叠合后的蛋白结构,同时输出RMSD值。 Structural Alignment is a tool for the sequence-based structural alignment of two proteins. Performs a global pairwise alignment between two sequences using the BLOSUM62 matrix and the Needleman-Wunsch algorithm. Returns the alignment, the sequence identity, and the residue mapping between both original sequences.
    Tags: undefined
    Author: Biopython
    Release: 2022-03-17 14:43:33
    Reference:

    Structural Alignment

    简介

    Structural Alignment是对两个蛋白质的三维结构进行叠合的工具。使用BLOSUM62矩阵和Needleman-Wunsch算法在两个序列之间执行全局配对比对,返回叠合后的蛋白结构,同时输出RMSD值。

    参数说明

    Reference Structure

    参考蛋白的结构文件,PDB格式

    Sample Structure

    需要叠合蛋白的结构文件,PDB格式

    Reference Chain

    指定参考蛋白的链名,默认是A链

    Sample Chain

    指定需要叠合蛋白的链名,默认是A链

    Output File

    指定输出叠合后的结构文件,PDB格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 参考蛋白与样本蛋白之间的RMSD值记录文件
    alignment_renumbering_pred.pdb 叠合后的结构文件

    其中result.csv包含如下信息:

    字段名称 说明
    Reference 参考蛋白构象
    Sample 需要叠合的蛋白构象
    RMSD 叠合后的RMSD值

    Structural Alignment

    Introduction

    Structural Alignment is a tool for overlaying the 3D structures of two proteins. It performs a global pairwise alignment between two sequences using the BLOSUM62 matrix and the Needleman-Wunsch algorithm, returning the aligned protein structures and outputting the RMSD value.

    Parameter Description

    Reference Structure

    Structure file of the reference protein in PDB format.

    Sample Structure

    Structure file of the protein to be aligned in PDB format.

    Reference Chain

    Specify the chain name of the reference protein, default is chain A.

    Sample Chain

    Specify the chain name of the protein to be aligned, default is chain A.

    Output File

    Specify the output structure file after alignment in PDB format.

    Result Description

    The output results include:

    Output File Name Description
    result.csv RMSD value record file between the reference protein and the sample protein
    alignment_renumbering_pred.pdb Aligned structure file

    The result.csv file contains the following information:

    Field Name Description
    Reference Conformation of the reference protein
    Sample Conformation of the protein to be aligned
    RMSD RMSD value after alignment
  • Name: AIM-Ig Builder
    Description: 基于唯信开发的AIM-Ig平台,将指定的两个可变区组装为不对称类IgG双抗。 Assemble two specified variable regions into asymmetric IgG-like bispecific antibodies using the AIM-Ig platform developed by WECOMPUT.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-03-06 09:33:28
    Reference:

    AIM-Ig Builder

    简介

    使用唯信开发的AIM-Ig平台,将指定的两个可变区组装为不对称类IgG双抗。其中包含了Fv区,CH1-CL与CH3的突变,如果客户有纯化的需求,可以自行于Hole侧加入H435R突变。

    参数说明

    H1 Sequence

    第一个抗体的重链序列。

    L1 Sequence

    第一个抗体的轻链序列。

    H2 Sequence

    第二个抗体的重链序列。

    L2 Sequence

    第二个抗体的轻链序列。

    BsAb Sequence File

    适用性最好的两组双抗的序列文件名称,默认名:BsAb.fasta。

    BsAb Additional

    适用性次一级,部分序列上有优异效果的两组双抗序列文件名称,默认名:BsAb_additional.fasta。

    结果说明

    输出参数 输出文件名称 说明
    BsAb Sequence BsAb.fasta 适用性最好的两组双抗的序列
    BsAb Additional BsAb_additional.fasta 适用性次好,部分序列上有优异效果两组双抗的序列

    AIM-Ig Builder

    Introduction

    Assemble two specified variable regions into asymmetric IgG-like bispecific antibodies using the AIM-Ig platform developed by WECOMPUT.

    Parameter

    H1 Sequence

    The heavy chain sequence of the first antibody.

    L1 Sequence

    The light chain sequence of the first antibody.

    H2 Sequence

    The heavy chain sequence of the second antibody.

    L2 Sequence

    The light chain sequence of the second antibody.

    BsAb Sequence File

    The sequences of the two bispecific antibodies with the best applicability. Default filename: BsAb.fasta.

    BsAb Additional

    The sequences of the two bispecific antibodies with secondary applicability, which exhibit exceptional performance in certain sequences. Default filename: BsAb_additional.fasta.

    Result

    Output Parameter Output File Name Description
    BsAb Sequence BsAb.fasta The sequences of the two bispecific antibodies with the best applicability
    BsAb Additional BsAb_additional.fasta The sequences of the two bispecific antibodies with secondary applicability, exhibiting exceptional performance in certain sequences
  • Name: Alanine Scan
    Description: 丙氨酸扫描可以将蛋白质的每一个残基分别突变为Ala,并计算丙氨酸突变导致的自由能变化。 它对于快速扫描很有用,因为残基中的极性相互作用和位阻在突变为丙氨酸时都会被破坏。 Alanine scanning involves mutating each residue of a protein to Ala and calculating the resulting change in free energy. It is useful for rapid scanning because polar interactions and steric hindrance within the residue will be disrupted upon mutation to Ala.
    Tags: undefined
    Author: Schymkowitz J
    Release: 2022-03-10 17:57:36
    Reference:

    Alanine Scan

    简介

    丙氨酸扫描可以将蛋白质的每一个残基分别突变为Ala,并计算丙氨酸突变导致的自由能变化。 它对于快速扫描很有用,因为残基中的极性相互作用和位阻在突变为丙氨酸时都会被破坏。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式

    Output File

    指定输出文件名称,CSV格式

    结果说明

    输出结果文件为output.csv,包含信息如下:

    字段名称 说明
    Index 氨基酸索引(PDB文件中)
    Residue 氨基酸名称(PDB文件中)
    Mutation Residue 突变氨基酸名称
    detalEnergy 氨基酸突变成丙氨酸的能量变化,以Kcal/mol为单位。

    参考文献

    Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W382-8.

    Alanine Scan

    Introduction

    Alanine scanning involves mutating each residue of a protein to Ala and calculating the resulting change in free energy. It is useful for rapid scanning because polar interactions and steric hindrance within the residue will be disrupted upon mutation to Ala.

    Parameter

    Structure PDB File

    Protein structure file in PDB format

    Output File

    Specify the output file in CSV format

    Result

    The output file is output.csv and contains the following information:

    Field Name Description
    Index Amino acid index (in PDB file)
    Residue Amino acid name (in PDB file)
    Mutation Residue Mutant amino acid name
    detalEnergy the energy change in Kcal/mol for the amino acid in the protein upon mutation to Ala.

    Reference

    Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W382-8.

  • Name: PDB Insertion Removal
    Description: 用于去掉抗体PDB文件中的插入序列,因为某些计算工具不支持PDB中的插入序列。比如,20A改成20。 Renumber the antibody PDB file to remove any insertion codes in UID, to make such PDB compatible with other tools.
    Tags: undefined
    Author: Rodrigues JPGLM
    Release: 2022-03-10 16:10:28
    Reference: Rodrigues JPGLM, Teixeira JMC, Trellet M, Bonvin AMJJ. pdb-tools: a swiss army knife for molecular structures. F1000Res. 2018 Dec 20;7:1961.

    PDB Insertion Removal

    简介

    PDB Insertion Removal模块用于去掉抗体PDB文件中的插入序列,因为某些计算工具不支持PDB中的插入序列。比如,20A改成20。

    参数说明

    Structure PDB File

    抗体结构文件,PDB格式。

    结果说明

    得到去掉抗体中的插入序列的PDB文件prepared_insert.pdb。

    参考文献

    Rodrigues JPGLM, Teixeira JMC, Trellet M, Bonvin AMJJ. pdb-tools: a swiss army knife for molecular structures. F1000Res. 2018 Dec 20;7:1961.

    PDB Insertion Removal

    Introduction

    The PDB Insertion Removal module is used to remove insertion sequences from antibody PDB files because some computational tools do not support insertion sequences in PDB files. For example, changing 20A to 20.

    Parameter Description

    Structure PDB File

    Antibody structure file in PDB format.

    Result Description

    Obtain the PDB file prepared_insert.pdb with the insertion sequences removed from the antibody.

    References

    Rodrigues JPGLM, Teixeira JMC, Trellet M, Bonvin AMJJ. pdb-tools: a swiss army knife for molecular structures. F1000Res. 2018 Dec 20;7:1961.

  • Name: Antibody Fv Structure Prediction
    Description: 一种从序列中预测准确抗体 FV 结构的深度学习方法。通过引入直接可解释的注意力机制,深度神经网络关注物理上重要的残基对(例如,近端芳烃和关键氢键相互作用)。 A deep learning method for predicting accurate antibody FV structures from sequence. By introducing a directly interpretable attention mechanism, the deep neural network attends to physically important residue pairs (e.g., proximal aromatics and key hydrogen bonding interactions).
    Tags: undefined
    Author: Ruffolo JA
    Release: 2022-03-04 10:34:26
    Reference: Ruffolo JA, Sulam J, Gray JJ. Antibody structure prediction using interpretable deep learning. Patterns (N Y). 2021 Dec 9;3(2):100406.

    Antibody Fv Structure Prediction

    简介

    随着市场对于治疗性抗体需求的快速增长,依赖实验方法确定抗体结构的方法已经无法满足需求。在这里,我们提出了一种深度学习方法DeepAb,用于从序列中准确预测抗体FV结构。我们通过一组结构多样、治疗相关的抗体评估DeepAb,发现我们的方法始终优于领先的替代方法。以前的深度学习方法就像“黑匣子”一样运作,对它们的预测几乎没有提供什么解释说明。通过引入一种可直接解释的注意机制,我们表明我们的网络关注物理上重要的残基对(例如,近芳烃和关键的氢键相互作用)。最后,我们提出了一种新的基于网络置信度的突变评分指标,并表明对于某一特定抗体,所有8个排名靠前的突变都提高了结合亲和力。该模型将有助于广泛的抗体预测和设计任务。基本流程如下图所示。
    image.png
    我们的抗体结构预测方法DeepAb由两个主要阶段组成(如下图所示)。第一个阶段是一个深度残差卷积网络,用于预测Fv结构,用残差对之间的相对距离和方向表示。该网络只需要轻重链序列作为输入,并设计了可解释组件,以提供对模型预测的洞察。第二阶段是一个基于fast Rosetta,利用网络的预测来实现结构设计。
    image.png
    抗体结构预测的DeepAb方法示意图

    参数说明

    Fasta File

    输出轻链和重链的抗体序列文件,重链必须包括标识符":H"或者"Heavy",轻链必须包含标识符":L"或者"Light":

    >:H
    XXXXXX
    >:L
    XXXXXX
    

    Number of Decoys

    要创建的构象数量。选择能量最低的构象作为最终预测结构。

    Processes Number

    并行计算数量。

    Native PDB

    测量与Chothia格式的原始PDB之间的RMSD值。

    Use GPU

    使用GPU进行加速。

    Convert Format

    使用AbNum将最终预测结构转换为Chothia格式。

    Single Chain

    当预测FASTA只有一条链。注意:FASTA文件应该包含一个标记为“H”的条目(即使序列是一个轻链)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    pred_result/pred.deepab.pdb 能量最低的预测结构
    result.tar.gz 所有预测结构压缩包文件

    参考文献

    Ruffolo JA, Sulam J, Gray JJ. Antibody structure prediction using interpretable deep learning. Patterns (N Y). 2021 Dec 9;3(2):100406.

    Antibody Fv Structure Prediction

    Introduction

    Therapeutic antibodies make up a rapidly growing segment of the biologics market. However, rational design of antibodies is hindered by reliance on experimental methods for determining antibody structures. Here, we present DeepAb, a deep learning method for predicting accurate antibody FV structures from sequence. We evaluate DeepAb on a set of structurally diverse, therapeutically relevant antibodies and find that our method consistently outperforms the leading alternatives. Previous deep learning methods have operated as “black boxes” and offered few insights into their predictions. By introducing a directly interpretable attention mechanism, we show our network attends to physically important residue pairs (e.g., proximal aromatics and key hydrogen bonding interactions). Finally, we present a novel mutant scoring metric derived from network confidence and show that for a particular antibody, all eight of the top-ranked mutations improve binding affinity. This model will be useful for a broad range of antibody prediction and design tasks.
    image.png
    Our method for antibody structure prediction, DeepAb, consists of two main stages (as follow figure). The first stage is a deep residual convolutional network that predicts Fv structure, represented as relative distances and orientations between pairs of residues. The network requires only heavy and light chain sequences as input and is designed with interpretable components to provide insight into model predictions. The second stage is a fast Rosetta-based protocol for structure realization using the predictions from the network.
    image.png

    Parameter

    Fasta File

    The heavy chain and light chain name should be :H and :L, respectively.
    E.g.:

    >:H
    XXXXXX
    >:L
    XXXXXX
    

    Number of Decoys

    Number of decoys to create. The lowest energy decoy will be selected as final predicted structure.

    Processes Number

    Maximum number of parallel processes that should be used for creating decoys.

    Native PDB

    Native PDB in Chothia format for measuring RMSDs.

    Use GPU

    Use GPU for acceleration.

    Convert Format

    Convert final predicted structure to Chothia format using AbNum.

    Single Chain

    Predict for fasta with only one chain. Note: The fasta file should contain a single entry labeled ‘H’ (even if the sequence is a light chain).

    Result

    The output includes:

    Output File Name Description
    pred_result/pred.deepab.pdb The lowest energy predictive structure
    result.tar.gz All predictive structure compressed package files

    Reference

    Ruffolo JA, Sulam J, Gray JJ. Antibody structure prediction using interpretable deep learning. Patterns (N Y). 2021 Dec 9;3(2):100406.

  • Name: Aggregation Score
    Description: 预测蛋白质结构中的聚集倾向和蛋白质溶解度,通过考虑序列和结构来预测蛋白质中易聚集的位点,这对于球状蛋白质特别有用,其中容易聚集的位点可能埋藏在天然结构内并且序列不连续。通过考虑天然氨基酸的实验聚集倾向尺度,该方法可以准确预测蛋白质聚集倾向。 Design for the rational design of protein solubility and aggregation tendency in protein structures. It allows researchers to predict aggregation-prone sites in proteins by considering both sequence and structure. This is particularly useful for globular proteins, where aggregation-prone sites may be buried within the native structure and the sequence may be discontinuous. By considering experimental aggregation propensity scales of natural amino acids, this method can accurately predict protein aggregation tendency.
    Tags: undefined
    Author: Zambrano R
    Release: 2022-03-01 14:05:39
    Reference: Zambrano R, Jamroz M, Szczasiuk A, Pujols J, Kmiecik S, Ventura S. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015 Jul 1;43(W1):W306-13.

    Aggregation Score

    简介

    该模块用于预测蛋白质结构中的聚集倾向和蛋白质溶解度,通过考虑序列和结构来预测蛋白质中易聚集的位点,这对于球状蛋白质特别有用,其中容易聚集的位点可能埋藏在天然结构内并且序列不连续。通过考虑天然氨基酸的实验聚集倾向尺度,该方法可以准确预测蛋白质聚集倾向,也可用于预测构象紊乱中家族性突变的致病作用。任何已知或预测的蛋白质结构都是适用的,它具备其他基于序列的算法未考虑的特性,例如蛋白质动态波动和蛋白质序列中距离较远的残基的空间聚类,这对于从初始折叠状态准确预测蛋白质聚集非常重要。
    底层算法Aggrescan3D(A3D)旨在预测蛋白质在其折叠状态下的聚集倾向。为了实现这个目标,A3D使用蛋白质的三维结构作为输入,这些结构可以通过X射线衍射、溶液NMR或建模方法得到,并以pdb格式表示。在分析之前,这些结构会经过能量最小化处理。该方法利用了实验得出的天然氨基酸内在聚集倾向尺度,并将这个尺度应用于蛋白质的三维结构中。在A3D方法中,结构中每个特定氨基酸的内在聚集倾向会受到其特定的结构环境的调节。聚集倾向是通过以每个残基Cα碳为中心的球形区域计算得出的。这为结构中每个氨基酸提供了一个独特的经过结构修正的聚集值(A3D分数),其公式如下:

    image.png

    其中:Aggi是球心处残基的内在聚集倾向;RSAi是其相对于溶剂暴露的表面积;Agge是包括在球体中的每个额外残基的内在聚集倾向,RSAe是其相对于溶剂暴露的表面积,dist是到中心残基i的距离。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。如果没有已知结构,可以用结构预测模块预测。

    结果说明

    输出结果包括:

    名称 说明
    Aggregation Score (result_A3D.csv) 蛋白结构中每个氨基酸聚集倾向和蛋白质溶解度的打分文件
    Structure (output.pdb) 根据聚集倾向和蛋白质溶解度得到的结构文件,在PDB文件温度因子一栏填入计算得到的聚集度和溶解度数值
    result_A.png A链中每个氨基酸对应的聚集度和溶解度打分值的png格式图片
    result_A.svg A链中每个氨基酸对应的聚集度和溶解度打分值的svg格式图片

    其中result_A3D.csv包括信息如下:

    字段名称 说明
    protein 氨基酸残基折叠
    chain 蛋白链名称
    residue 氨基酸索引(PDB文件中)
    residue_name 氨基酸名称缩写(PDB文件中)
    score 聚集度和溶解度打分值,该数值为正代表氨基酸促进聚集,为负代表氨基酸促进溶解。

    参考文献

    • Zambrano R, Jamroz M, Szczasiuk A, Pujols J, Kmiecik S, Ventura S. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015 Jul 1;43(W1):W306-13.
    • Aleksander Kuriata, Valentin Iglesias, Jordi Pujols, Mateusz Kurcinski, Sebastian Kmiecik, Salvador Ventura, Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W300–W307

    Aggregation Score

    Introduction

    This module is used to predict the aggregation propensity and protein solubility in protein structures. By considering both sequence and structure, it predicts sites in proteins that are prone to aggregation, which is particularly useful for globular proteins where aggregation-prone sites may be buried within the native structure and not contiguous in sequence. By considering experimentally derived aggregation propensity scales of natural amino acids, this method accurately predicts protein aggregation propensity and can be used to predict the pathogenic effects of familial mutations in conformational disorders. Any known or predicted protein structure is applicable. It incorporates features not considered by other sequence-based algorithms, such as protein dynamic fluctuations and spatial clustering of residues that are distant in the protein sequence, which is crucial for accurately predicting protein aggregation from the initial folding state.

    The underlying algorithm, Aggrescan3D (A3D), aims to predict the aggregation propensity of proteins in their folded states. To achieve this, A3D uses the protein’s 3D structure as input, which can be obtained through X-ray crystallography, solution NMR, or modeling methods, and is represented in PDB format. These structures undergo energy minimization before analysis. The method utilizes experimentally determined intrinsic aggregation propensity scales of natural amino acids and applies this scale to the protein’s 3D structure. In the A3D method, the intrinsic aggregation propensity of each specific amino acid in the structure is modulated by its specific structural environment. The aggregation propensity is calculated within a spherical region centered on the Cα carbon of each residue. This provides a unique, structurally corrected aggregation value (A3D score) for each amino acid in the structure.The calculation formula is as follows:

    image.png

    Where:

    • Aggi represents the intrinsic aggregation propensity of the residue at the center of the sphere.
    • RSAi is the relative solvent accessibility of the residue.
    • Agge is the intrinsic aggregation propensity of each additional residue included in the sphere.
    • RSAe is the relative solvent accessibility of each additional residue included in the sphere.
    • dist is the distance to the central residue i.

    Parameter Description

    Structure PDB File

    The structure file of the protein in PDB format. If the structure is not known, it can be predicted using the structure prediction module.

    Result Description

    The output results include:

    Name Description
    Aggregation Score (result_A3D.csv) A scoring file for the aggregation propensity and protein solubility of each amino acid in the protein structure.
    Structure (output.pdb) Structure file obtained based on the aggregation propensity and protein solubility, with the calculated aggregation and solubility values filled in the temperature factor column of the PDB file.
    result_A.png A PNG format image showing the aggregation and solubility scores for each amino acid in chain A.
    result_A.svg An SVG format image showing the aggregation and solubility scores for each amino acid in chain A.

    The result_A3D.csv file includes the following information:

    Field Name Description
    protein Fold of the amino acid residue.
    chain Protein chain name.
    residue Amino acid index in the PDB file.
    residue_name Amino acid name abbreviation in the PDB file.
    score Aggregation and solubility score, where a positive value indicates promotion of aggregation and a negative value indicates promotion of solubility.

    References

    • Zambrano R, Jamroz M, Szczasiuk A, Pujols J, Kmiecik S, Ventura S. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015 Jul 1;43(W1):W306-13.
    • Aleksander Kuriata, Valentin Iglesias, Jordi Pujols, Mateusz Kurcinski, Sebastian Kmiecik, Salvador Ventura, Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W300–W307
  • Name: Sequence Mutagenesis (Saturated)
    Description: Sequence Mutagenesis (Saturated)是用于枚举蛋白质序列指定位置饱和突变的所有可能性,生成所有对应突变的文本文件和突变体序列文件。 Enumerate all possible point mutations at specified positions in a protein sequence, and generate text files for all corresponding mutations and mutant sequence files.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-23 14:29:03
    Reference:

    Sequence Mutagenesis (Saturated)

    简介

    Sequence Mutagenesis (Saturated)是用于枚举蛋白质序列指定位置饱和突变的所有可能性,生成所有对应突变的文本文件和突变体序列文件。

    参数说明

    Input File

    蛋白序列文件,FASTA格式。

    Mutation Location

    突变位置,多个位置可以用逗号(,)隔开。

    Output File

    指定输出突变后的序列文件的名称,FASTA格式。

    Output Mutation Residue

    包含突变信息的文本文件的名称。

    Chain Name

    指定链名,生成带有链名的突变信息。

    结果说明

    输出结果包括:

    输出文件名称 说明
    mutated_seqs.fasta 突变后的序列文件
    individual.txt 突变文件信息,包含链信息
    mutated_polict.txt 突变文件信息,不包含链信息

    Sequence Mutagenesis (Saturated)

    Introduction

    Sequence Mutagenesis (Saturated) is used to enumerate all possibilities of saturated mutations at specified positions in a protein sequence, generating text files with all corresponding mutations and mutated sequence files.

    Parameter Description

    Input File

    Protein sequence file in FASTA format.

    Mutation Location

    Mutation locations, multiple positions can be separated by commas (,).

    Output File

    Specify the name of the output file containing the mutated sequence in FASTA format.

    Output Mutation Residue

    Name of the text file containing mutation information.

    Chain Name

    Specify the chain name to generate mutation information with chain names.

    Result Description

    The output results include:

    Output File Name Description
    mutated_seqs.fasta Mutated sequence file after mutation.
    individual.txt Mutation file information with chain information.
    mutated_polict.txt Mutation file information without chain information.
  • Name: Mutation Format Conversion
    Description: Mutation Format Conversion将突变文件中的突变信息加上链名,转换为适用于结构的格式。如将C20S改为CA20S。 Mutation Format Conversion is a tool to convert the list of mutations applicable to the sequence to a format suitable for the structure by adding the chain name.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-23 11:37:06
    Reference:

    Mutation Format Conversion

    简介

    Mutation Format Conversion将突变文件中的突变信息加上链名,转换为适用于结构的格式。如将C20S改为CA20S。

    参数说明

    Mutation File

    突变文件,TXT格式

    Chain Name

    指定链名

    结果说明

    将原本不带链名的转换为带有链名的突变文件individual.txt。

    Mutation Format Conversion

    Introduction

    Mutation Format Conversion adds chain names to the mutation information in the mutation file, converting it into a format suitable for structures. For example, converting C20S to CA20S.

    Parameter Description

    Mutation File

    Mutation file in TXT format.

    Chain Name

    Specify the chain name.

    Result Description

    Converts the original mutation file without chain names to a mutation file with chain names, named individual.txt.

  • Name: Mutation Energy of Stability
    Description: Mutation Energy of Stability用于预测突变对蛋白质稳定性的影响。首先能量优化蛋白结构,将一个或多个残基突变为新残基,通过计算折叠自由能变化进行蛋白质稳定性分析。 Mutation Energy of Stability is used to predict the effect of mutations on protein stability. First, the protein structure is energy optimized, and one or more residues are mutated to new ones. Protein stability analysis is then performed by calculating the change in folding free energy.
    Tags: undefined
    Author: Schymkowitz J
    Release: 2022-02-18 22:13:26
    Reference: Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W382-8.

    Mutation Energy of Stability

    简介

    蛋白质稳定度的增强可以提高蛋白表达效率,甚至可以耐酸耐碱和高温,提高制剂的稳定性。使用序列信息来计算预测蛋白的折叠的稳定程度,可以大大降低实验研究的成本。本功能使用FoldX作为底层引擎来计算蛋白的稳定性。

    参数说明

    Input File

    输入蛋白PDB文件
    注意:输入PDB中的UID不能有Insertion Code,使用PDB Insertion Removal模块处理PDB文件可以去除Insertion Code。

    Mutant File

    突变文件,文本文件包含突变信息,格式如下:

    GB26R;
    GB26H,SB32K;
    

    其中G、S代表原始残基,
    B代表PDB文件中待突变残基所在的链名,
    26代表残基位置编号,
    R, H, K代表要突变成的突变残基

    Folding Energy

    指定包含折叠自由能影响的输出文件的名称,CSV格式

    结果说明

    输出结果文件为score.csv,下列表格中总能量(蛋白质折叠的吉布斯能量)和各能量分解项单位均为Kcal/mol,包含信息如下:

    字段名称 说明
    Mutation 突变氨基酸位点
    FileName PDB文件名
    Total Energy 预测的蛋白质整体稳定性
    Backbone HBond 骨架氢键的贡献
    SideChain HBond 侧链-侧链和侧链-骨架贡献氢键的贡献
    Van der Waals 范德华力的贡献
    Eletrostatics 静电相互作用
    Solvation Polar 极性基团的惩罚
    Solvation Hydrophobic 疏水基团的贡献
    Van der Waals clashes 由于范德华冲突(残留物)导致的能量惩罚
    Entropy Side Chain 固定侧链的熵成本
    Entropy Main Chain 固定主链的熵成本
    Cis Bond 顺式肽键的成本
    Torsional Clash 范德华的扭转冲突(内部残差)
    Backbone Clash 骨架-骨架范德华力,不在综合考虑范围内
    Helix Dipole 螺旋偶极子的静电贡献
    Water Bridge 水桥的贡献
    Disulfide 二硫键的贡献
    Electrostatic Kon 预配合物中分子间的静电相互作用
    Partial Covalent Bonds 络合金属的相互作用
    Energy Ionisation 电离能的贡献
    Entropy Complex 形成复合物的熵成本
    Residue Number 残基数

    突变稳定性的判断标准为:

    1. −5 kcal\mol < ΔΔG < −0.75 kcal\mol:蛋白突变后为稳定状态。
    2. −0.75 kcal\mol < ΔΔG < 1 kcal\mol:突变前后对蛋白稳定性的影响不大。
    3. ΔΔG ≥ 1 kcal\mol:蛋白突变后为不稳定状态。
      其中,ΔΔG为Total Energy[mutant]-Total Energy[wild]。

    参考文献

    Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W382-8.

    Mutation Energy of Stability

    Introduction

    Enhancing protein stability can increase protein expression efficiency, even enabling acid and alkali resistance as well as high-temperature stability, thus improving the stability of the formulation. Using sequence information to calculate the predicted stability of protein folding can significantly reduce the cost of experimental research. This feature uses FoldX as the underlying engine to calculate the protein stabilization energy.

    Parameter

    PDB File

    Protein structure file to be mutated, PDB format
    Note: The UID in the input PDB cannot have an insertion code. Using the PDB Insertion Removal module to process the PDB file can remove the insertion code.

    Mutation List

    Mutation file, the text file contains mutation information, the format is as follows:

    GB26R;
    GB26H, SB32K;
    

    where G and S represent the original residues,
    B represents the chain name of the residue to be mutated in the PDB file,
    26 represents the residue position number,
    R, H, K represent the mutated residues to be mutated into
    Note: Please upload text file, TXT format

    Folding Energy

    Specifies the name of the output file containing the effects of folding free energy, in CSV format

    Result

    The output file is score.csv and then rows with the energy decomposition in Kcal/mol, the different columns are described below in the following information:

    Field Name Description
    Mutation Mutant amino acid site
    FileName PDB file
    Total Energy This is the predicted overall stability of your protein
    Backbone HBond This the contribution of backbone Hbonds
    SideChain HBond This the contribution of sidechain-sidechain and sidechain-backbone Hbonds
    Van der Waals Contribution of the VanderWaals
    Eletrostatics Electrostatic interactions
    Solvation Polar Penalization for burying polar groups
    Solvation Hydrophobic Contribution of hydrophobic groups
    Van der Waals clashes Energy penalization due to VanderWaals’ clashes (interresidue)
    Entropy Side Chain Entropy cost of fixing the side chain
    Entropy Main Chain Entropy cost of fixing the main chain
    Cis Bond Cost of having a cis peptide bond
    Torsional Clash VanderWaals’ torsional clashes (intraresidue)
    Backbone Clash Backbone-backbone VanderWaals. These are not considered in the total
    Helix Dipole Electrostatic contribution of the helix dipole
    Water Bridge Contribution of water bridges
    Disulfide Contribution of disulfide bonds
    Electrostatic Kon Electrostatic interaction between molecules in the precomplex
    Partial Covalent Bonds Interactions with bound metals
    Energy Ionisation Contribution of ionisation energy
    Entropy Complex Entropy cost of forming a complex
    Residue Number Number of residues

    The criteria for judging mutation stability are:

    1. −5 kcal\mol < ΔΔG < −0.75 kcal\mol: The protein was stable after mutation.
    2. −0.75 kcal\mol < ΔΔG < 1 kcal\mol: There was little effect on protein stability before and after mutation.
    3. ΔΔG ≥ 1 kcal\mol: The protein is unstable after mutation.
      Where ΔΔG is Total Energy[mutant]-Total Energy[wild].

    Reference

    Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W382-8.

  • Name: Structure Mutagenesis
    Description: Structure Mutagenesis模块是从蛋白结构文件得到蛋白的序列信息,然后对指定位点进行饱和突变或者丙氨酸突变,得到包含突变信息的突变文件和突变序列。用于后续其他模块进行结构突变。 The Structure Mutagenesis module obtains protein sequence information from protein structure files and performs saturation mutagenesis or alanine mutagenesis at specified sites to generate a mutation file and a mutated sequence file containing mutation information. This module is used for subsequent structural mutation analysis in other modules.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-17 22:36:02
    Reference:

    Structure Mutagenesis

    简介

    对复合物界面区域进行单点或者多点的虚拟饱和突变,从而获得不同格式的突变文件以及突变后的Fasta文件。这为后续复合物之间的亲和力以及对突变体之间的结合自由能计算提供基础。

    参数说明

    Input File

    蛋白结构文件,PDB格式。

    Mutation Site

    突变位点文件,JSON格式,一般由Complex Interface Analysis模块生成的json文件。

    Chain Name

    指定链名。

    Output Sequence

    指定输出突变后的序列文件的名称。

    Mutated Policy

    指定输出突变文件的名称,不包含链信息。

    Chain Mutated Policy

    指定输出突变文件的名称,包含指定链信息。

    Mode

    突变模式:

    • Saturation:饱和突变,突变为其他19种氨基酸。
    • AlaScan:丙氨酸突变,仅突变为丙氨酸。

    结果说明

    输出结果包括:

    输出文件名称 说明
    mutated_policy.txt 突变文件信息,不包含链信息
    mutated_policy_with_chain.txt 突变文件信息,包含链信息
    output_mutated_seqs.fasta 突变后的序列文件

    Structure Mutagenesis

    Introduction

    Virtual saturation mutagenesis is performed on single or multiple points in the interface region of a complex to generate mutation files in different formats and mutated Fasta files. This provides a basis for calculating the affinity between complexes and the binding free energy between mutants.

    Parameter Description

    Input File

    Protein structure file in PDB format.

    Mutation Site

    Mutation site file in JSON format, typically generated by the Complex Interface Analysis module.

    Chain Name

    Specify the chain name.

    Output Sequence

    Specify the name of the output file containing the mutated sequence.

    Mutated Policy

    Specify the name of the output mutation file without chain information.

    Chain Mutated Policy

    Specify the name of the output mutation file with specified chain information.

    Mode

    Mutation mode:

    • Saturation: Saturation mutagenesis, mutating to the other 19 amino acids.
    • AlaScan: Alanine scanning mutagenesis, mutating only to alanine.

    Result Description

    The output results include:

    Output File Name Description
    mutated_policy.txt Mutation file information without chain information.
    mutated_policy_with_chain.txt Mutation file information with chain information.
    output_mutated_seqs.fasta Mutated sequence file after mutation.
  • Name: Complex Interface Analysis
    Description: Complex Interface Analysis模块是基于结构的分析蛋白质复合物相互作用界面的关键残基。 Complex Interface Analysis module is designed for identifying the residues on the interface of a structure complex.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-17 14:41:05
    Reference:

    Complex Interface Analysis

    简介

    Complex Interface Analysis模块是基于结构的分析蛋白质复合物相互作用界面的关键残基。

    参数说明

    Structure PDB File

    蛋白复合物结构文件,PDB格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    interaction_score.csv 记录复合物不同链之间相互作用的能量的文件
    interface_residues.csv 记录相互作用界面关键氨基酸的csv文件
    interface_residues.json 记录相互作用界面关键氨基酸的json文件

    其中interaction_score.csv包括信息如下:

    字段名称 说明
    PDB 蛋白质复合物结构名称
    Group1 链名称
    Group2 链名称
    Interaction Energy 相互作用能(kcal/mol)

    其中interface_residues.csv包括信息如下:

    字段名称 说明
    Chain1_and_Chain2 Chain1链和Chain2链之间相互作用的关键氨基酸,此处Chain1和Chain2为蛋白结构文件中的链名称。

    Complex Interface Analysis

    Introduction

    The Complex Interface Analysis module is a structure-based analysis of key residues involved in the protein complex interaction interface.

    Parameter Description

    Structure PDB File

    Protein complex structure file in PDB format.

    Result Description

    The output results include:

    Output File Name Description
    interaction_score.csv File recording the energy of interactions between different chains of the complex.
    interface_residues.csv CSV file recording key amino acids at the interaction interface.
    interface_residues.json JSON file recording key amino acids at the interaction interface.

    The interaction_score.csv file includes the following information:

    Field Name Description
    PDB Name of the protein complex structure.
    Group1 Chain name.
    Group2 Chain name.
    Interaction Energy Interaction energy (kcal/mol).

    The interface_residues.csv file includes the following information:

    Field Name Description
    Chain1_and_Chain2 Key amino acids involved in the interaction between Chain1 and Chain2, where Chain1 and Chain2 are chain names in the protein structure file.
  • Name: Protein BLAST
    Description: Protein BLAST是蛋白Blast数据库,该数据库序列整合了GenPept、Swissprot、PIR、PDF、PDB、RefSeq等序列数据库。 Protein BLAST is a protein Blast database that integrates sequences from various databases including GenPept, Swissprot, PIR, PDF, PDB, and RefSeq.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-15 11:00:04
    Reference:

    Protein BLAST

    简介

    Protein BLAST是蛋白Blast数据库,该数据库序列整合了GenPept、Swissprot、PIR、PDF、PDB、RefSeq等序列数据库。

    参数说明

    Input File

    蛋白序列文件,FASTA格式。

    Type

    指定序列比对数据库类型:蛋白,抗体,或者CDR区域。
    nr:蛋白Blast数据库。
    oas:Observed Antibody Space,抗体Blast数据库。
    cdr:CDR区域数据库,专利保护抗体数据库 。

    结果说明

    输出结果文件为alignment.fasta,是系列对齐后的FASTA文件,可在WeSeq中查看。

    Protein BLAST

    Introduction

    Protein BLAST is a protein Blast database that integrates sequences from databases such as GenPept, Swissprot, PIR, PDF, PDB, RefSeq, and others.

    Parameter Description

    Input File

    Protein sequence file in FASTA format.

    Type

    Specifies the sequence alignment database type: protein, antibody, or CDR region.
    nr: Protein BLAST database.
    oas: Observed Antibody Space, an antibody BLAST database.
    cdr: CDR region database, a patent-protected antibody database.

    Result Description

    The output result file is alignment.fasta, which is a FASTA file of the aligned sequences that can be viewed in WeSeq.

  • Name: Sequence Mutagenesis (Directed) for Ab
    Description: Sequence Mutagenesis (Directed) for Ab是根据模板抗体序列和描述突变的突变文件 (json) 批量生成突变抗体序列,通常突变文件由 BLAST 和 MSA 自动生成。这对于高通量抗体工程设计很有用。 Generate sequences of mutated antibody sequences based on a template antibody sequence and a mutation file (json) listing all mutations (normally the mutation file is automatically generated by BLAST and MSA). This is useful for high-throughput antibody engineering design.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-10 10:22:35
    Reference:

    Sequence Mutagenesis (Directed) for Ab

    简介

    Sequence Mutagenesis (Directed) for Ab是根据模板抗体序列和描述突变的突变文件(json)批量生成突变抗体序列,通常突变文件由BLAST和MSA自动生成。这对于高通量抗体工程设计很有用。

    参数说明

    Input File

    抗体的序列文件,FASTA格式

    Mutation File

    突变文件,JSON格式

    Cutoff

    突变频率截断值,默认10,只针对突变频率超过截断值的氨基酸生成对应的突变信息。用于过滤掉低频率的突变氨基酸。

    Numbering Type

    抗体编号类型:kabat,chothia,imgt以及none

    结果说明

    输出结果包括:

    输出文件名称 说明
    gen.fr.fasta 骨架区(frameworkregion,FR)FASTA文件
    gen.fr.mutations.txt 骨架区(frameworkregion,FR)突变文件信息
    gen.cdr.fasta 互补决定区(complementarity-determining region, CDR)FASTA文件
    gen.cdr.mutations.txt 互补决定区(complementarity-determining region, CDR)突变文件信息

    Sequence Mutagenesis (Directed) for Ab

    Introduction

    Sequence Mutagenesis (Directed) for Ab is a process that batch generates mutated antibody sequences based on a template antibody sequence and a mutation file (in JSON format) describing the mutations. The mutation file is typically generated automatically by BLAST and MSA. This is particularly useful for high-throughput antibody engineering design.

    Parameter Description

    Input File

    Antibody sequence file in FASTA format.

    Mutation File

    Mutation file in JSON format.

    Cutoff

    Mutation frequency cutoff value, default is 10. Only mutations with frequencies exceeding the cutoff value will generate corresponding mutation information. This is used to filter out low-frequency mutated amino acids.

    Numbering Type

    Antibody numbering type: kabat, chothia, imgt, or none.

    Result Description

    The output results include:

    Output File Name Description
    gen.fr.fasta FASTA file for the Framework Region (FR)
    gen.fr.mutations.txt Mutation file information for the Framework Region (FR)
    gen.cdr.fasta FASTA file for the Complementarity-Determining Region (CDR)
    gen.cdr.mutations.txt Mutation file information for the Complementarity-Determining Region (CDR)
  • Name: Mutation List Generation
    Description: Mutation List Generation是基于一个原始序列,从经过序列比对后得到的序列(例如BLAST得到的同源序列)中提取每个位点出现过的所有突变(同源突变/共识突变),生成一个突变列表,并按位点统计突变的频率。 Generate a list of mutations (aka. consensus mutations) from a set of aligned sequences (normally generated by the blast).
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-10 10:22:00
    Reference:

    Mutation List Generation

    简介

    Mutation List Generation是基于一个原始序列,从经过序列比对后得到的序列(例如BLAST得到的同源序列)中提取每个位点出现过的所有突变(同源突变/共识突变),生成一个突变列表,并按位点统计突变的频率。

    参数说明

    Reference Seq

    参考蛋白序列,FASTA格式

    Homologs

    同源序列文件,一般由参考序列BLAST数据库后得到,FASTA格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    mutations.csv 突变统计文件,包含每个位点的突变的类型及其百分比,CSV格式
    output.json 突变统计文件,包含每个位点的突变类型及其频率,JSON格式
    mutations.txt 突变文件,根据前面的突变统计信息生成,包含了野生型氨基酸、位置以及突变后氨基酸

    其中mutations.csv包括信息如下:

    字段名称 说明
    WT 野生型氨基酸
    Position 突变位置
    Mutations and frequency 突变氨基酸及其频率

    Mutation List Generation

    Introduction

    Mutation List Generation is a process that extracts all mutations (homologous mutations/consensus mutations) occurring at each position from a sequence obtained through sequence alignment (e.g., homologous sequences obtained from BLAST), based on an original sequence. It generates a mutation list and calculates the frequency of mutations at each position.

    Parameter Description

    Reference Seq

    Reference protein sequence in FASTA format.

    Homologs

    Homologous sequence file typically obtained by BLASTing the reference sequence against a database, in FASTA format.

    Result Description

    The output results include:

    Output File Name Description
    mutations.csv Mutation statistics file containing the type and percentage of mutations at each position, in CSV format
    output.json Mutation statistics file containing the type and frequency of mutations at each position, in JSON format
    mutations.txt Mutation file generated based on the mutation statistics information, containing the wild-type amino acid, position, and mutated amino acid

    The mutations.csv file includes the following information:

    Field Name Description
    WT Wild-type amino acid
    Position Mutation position
    Mutations and frequency Mutated amino acid and its frequency
  • Name: Solubility Score
    Description: 基于序列的蛋白溶解度预测。 Sequence-based protein solubility prediction.
    Tags: undefined
    Author: Hon J
    Release: 2022-01-24 11:53:25
    Reference: Bioinformatics. 2021 Apr 9;37(1):23-28. Bioinformatics. 2017 Oct 1;33(19):3098-3100. J Mol Biol. 2015 Jan 30;427(2):478-90.

    Solubility Score

    简介

    蛋白质溶解度不良阻碍了许多治疗和工业上有用的蛋白质的生产。通过实验手段增加溶解度的努力往往成功率低,并且通常会降低生物活性。使用序列信息来计算预测蛋白的溶解度,可以大大降低实验研究的成本。
    本模块使用CamSol、SoluProt和Protein-Sol算法进行溶解度预测。其中:

    • CamSol是利用最直接影响蛋白质溶解度的氨基酸的物理化学特性,包括疏水性、静电荷以及它们在空间的相互作用,通过对这些特性的组合来定义溶解度分数。该方法在预测突变对蛋白质溶解度的影响方面具有很高的准确性。与其他现有方法相比,如SOLpro和 PROSO II,在测试的56个变体中,该方法正确预测了54个突变体在突变后溶解度的变化,而SOLpro和PROSO II分别为40和32个。
    • SoluProt是一个基于序列信息预测溶解度的机器学习模型,使用了高质量的TargetTrack数据集进行训练,并使用NESG数据库的3100条序列进行了验证,准确度优于其他现有预测算法(评测结果见下表)。基于梯度增强机器模型并采用 96 个基于序列的特征,例如氨基酸含量、与 PDB 序列的序列同一性以及几种聚合的物理化学特性。 对溶解度的预测准确度为 58.5%,AUC 为 0.62,高于其他同类工具。
    • Protein-Sol提供了一种快速的基于序列的方法来预测蛋白质的溶解度,共采用了35个基于序列的特征进行模型构建。使用来自于大肠杆菌,酵母和人源的上万个蛋白数据进行了模型训练和验证测试。注意:要求输入序列长度大于20个氨基酸残基。

    结果说明

    输出结果包括:

    输出文件名称 说明
    protein-sol_score_show.png Protein–Sol方法下,针对Folding Propensity和Charge两个指标的分布图。横坐标Windows为每21个氨基酸为一个片段组别。
    result_per_chain.csv 三种方法下,每条链的预测溶解度结果。
    result_per_residue.csv Protein–Sol方法下,不同蛋白区域对应的溶解度情况(该结果仅针对第一条链)。

    其中result_per_chain.csv包括信息如下:

    字段名称 说明
    Protein ID 蛋白序列名称
    Solubility (CamSol) CamSol方法预测的溶解度。越大表示溶解性越好,大于1时,表示溶解性很好;当分数小于-1时,溶解性很差。
    Solubility (Soluprot) Soluprot方法预测的溶解度
    Solubility (Protein-Sol) Protein-Sol方法预测的溶解度
    pI 蛋白等电点

    其中result_per_residue.csv包括信息如下:

    字段名称 说明
    ID 蛋白序列名称
    Kyte-Doolittle Hydropathy 氨基酸亲水指数是一个描述其支链的亲水性或疏水性程度大小的值。亲水指数越小代表该氨基酸段的亲水性越强。
    Folding Propensity 该数值描述蛋白折叠程度,该数值越大,越不利于蛋白溶解。
    Entropy 熵是在某种分子折叠构象下能保证该分子最稳定(熵最大)。熵越大越不利于蛋白溶解。
    Charge 蛋白质表面带有的电荷值,带电蛋白均有利于溶解度,无论正负。
    Sequence 所分析的序列段。

    参考文献

    • Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D, Damborsky J. SoluProt: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021 Apr 9;37(1):23-28.DOI: 10.1093/bioinformatics/btaa1102
    • Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J. Protein-Sol: a web tool for predicting protein solubility from sequence. Bioinformatics. 2017 Oct 1;33(19):3098-3100.DOI: 10.1093/bioinformatics/btx345
    • Sormanni P, Aprile FA, Vendruscolo M. The CamSol method of rational design of protein mutants with enhanced solubility. J Mol Biol. 2015 Jan 30;427(2):478-90.DOI: 10.1016/j.jmb.2014.09.026

    Solubility Score

    Introduction

    Poor protein solubility hinders the production of many therapeutically and industrially useful proteins. Efforts to increase solubility through experimental means often have low success rates and can compromise biological activity. Calculating protein solubility based on sequence information can significantly reduce the cost of experimental research.
    This module uses the CamSol, SoluProt, and Protein-Sol algorithms for solubility prediction. Specifically:

    • CamSol utilizes the physical and chemical properties of amino acids that most directly affect protein solubility, including hydrophobicity, electrostatic charges, and their spatial interactions, to define a solubility score based on a combination of these properties. This method demonstrates high accuracy in predicting the impact of mutations on protein solubility. In a test of 56 variants, it correctly predicted the solubility changes after mutation for 54 variants, compared to 40 and 32 for SOLpro and PROSO II, respectively.
    • SoluProt is a machine learning model that predicts solubility based on sequence information. It is trained on a high-quality TargetTrack dataset and validated using 3100 sequences from the NESG database, showing superior accuracy compared to other existing prediction algorithms (see evaluation results in the table below). It employs a gradient boosting machine model and utilizes 96 sequence-based features, such as amino acid composition, sequence identity to PDB sequences, and several physicochemical properties of aggregates. The accuracy of solubility prediction is 58.5%, with an AUC of 0.62, higher than other similar tools.
    • Protein-Sol provides a rapid sequence-based method to predict protein solubility, using 35 sequence-based features for model construction. The model is trained and validated using tens of thousands of protein data from Escherichia coli, yeast, and human sources. Note: Input sequences must be longer than 20 amino acid residues.

    Results

    The output results include:

    Output File Name Description
    protein-sol_score_show.png Distribution of Folding Propensity and Charge under the Protein-Sol method. The horizontal coordinate Windows for each 21 amino acids is a fragment group.
    result_per_chain.csv Predicted solubility results for each chain under the three methods.
    result_per_residue.csv Solubility status corresponding to different protein regions under the Protein-Sol method (this result is only for the first chain).

    The result_per_chain.csv includes the following information:

    Field Name Description
    Protein ID Protein sequence name
    Solubility (CamSol) Predicted solubility by CamSol. A higher score indicates better solubility, with scores greater than 1 indicating good solubility and scores less than -1 indicating poor solubility.
    Solubility (SoluProt) Predicted solubility by SoluProt
    Solubility (Protein-Sol) Predicted solubility by Protein-Sol
    pI Isoelectric point of the protein

    The result_per_residue.csv includes the following information:

    Field Name Description
    ID Protein sequence name
    Kyte-Doolittle Hydropathy Hydropathy index of amino acids, describing the hydrophilicity or hydrophobicity of their side chains. A smaller hydropathy index indicates higher hydrophilicity of the amino acid segment.
    Folding Propensity This value describes the folding degree of the protein, with higher values being less favorable for protein solubility.
    Entropy Entropy ensures the most stable molecular conformation under certain folding configurations. Higher entropy is less favorable for protein solubility.
    Charge The charge value on the protein surface, with charged proteins being favorable for solubility regardless of positive or negative charge.
    Sequence The analyzed sequence segment.

    References

    • Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D, Damborsky J. SoluProt: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021 Apr 9;37(1):23-28.DOI: 10.1093/bioinformatics/btaa1102
    • Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J. Protein-Sol: a web tool for predicting protein solubility from sequence. Bioinformatics. 2017 Oct 1;33(19):3098-3100.DOI: 10.1093/bioinformatics/btx345
    • Sormanni P, Aprile FA, Vendruscolo M. The CamSol method of rational design of protein mutants with enhanced solubility. J Mol Biol. 2015 Jan 30;427(2):478-90.DOI: 10.1016/j.jmb.2014.09.026
  • Name: Humanization Report
    Description: Humanization Report是抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。 Humanization Report is an antibody humanization design reporting module for Generating the humanization design reports as well as patent example paragraphs.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-01-19 09:19:22
    Reference:

    Humanization Report

    简介

    Humanization Report是抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。

    参数说明

    Graft Policy

    Grafting模块生成的Graft Policy文件。

    Mutate Policy

    Back Mutation Grouping模块生成的Policy文件。

    结果说明

    输出结果包括:

    输出文件名称 说明
    BM.pptx 回复突变位点汇总文件
    batch_registration_template.xlsx 批量注册模板文件
    hotspot_summary.xlsx 风险位点总结
    patent_example_template.docx 人源化设计序列在相应的专利实施例段落
    humanized_variants.fasta 抗体人源化设计序列文件,FASTA格式
    Report.docx 抗体人源化设计报告,包括整个人源化设计过程涉及的序列、分组等信息

    其中batch_registration_template.xlsx包含如下信息:

    字段名称 说明
    Protein Sequence 蛋白序列
    Molecule Name 分子名称

    其中hotspot_summary.xlsx包含如下信息:

    字段名称 说明
    ID 抗体序列名称
    Sequence-CDR CDR序列区域
    Deamidation 脱酰胺位点
    Isomerization 异构化位点
    Cleavage 酶切位点
    Hydrolysis 水解位点
    Glycosylation 糖基化位点
    Cys 半胱氨酸数量
    Oxidation 氧化位点
    High risk 高风险率
    High risk sites 高风险位点

    Humanization Report

    Introduction

    The Humanization Report is a module for generating reports on antibody humanization design, including the final antibody humanization design report and corresponding paragraphs for patent implementation examples.

    Parameter Description

    Graft Policy

    The Graft Policy file generated by the Grafting module.

    Mutate Policy

    The Policy file generated by the Back Mutation Grouping module.

    Result Description

    The output results include:

    Output File Name Description
    BM.pptx Summary file of back mutation sites
    batch_registration_template.xlsx Batch registration template file
    hotspot_summary.xlsx Summary of hotspot sites
    patent_example_template.docx Humanization design sequences in corresponding patent implementation example paragraphs
    humanized_variants.fasta Antibody humanization design sequence file in FASTA format
    Report.docx Antibody humanization design report, including information on sequences, grouping, etc., involved in the entire humanization design process

    The batch_registration_template.xlsx file contains the following information:

    Field Name Description
    Protein Sequence Protein sequence
    Molecule Name Molecule name

    The hotspot_summary.xlsx file contains the following information:

    Field Name Description
    ID Antibody sequence name
    Sequence-CDR CDR sequence region
    Deamidation Deamidation site
    Isomerization Isomerization site
    Cleavage Cleavage site
    Hydrolysis Hydrolysis site
    Glycosylation Glycosylation site
    Cys Number of cysteines
    Oxidation Oxidation site
    High risk High-risk rate
    High risk sites High-risk sites
  • Name: Back Mutation Grouping
    Description: Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。 Back Mutation Grouping is a grouping module in the antibody humanization design workflow, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module and returns the back mutated sequence.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-01-17 15:21:07
    Reference:

    Back Mutation Grouping

    简介

    Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。

    参数说明

    Grafted Chain

    抗体CDR区嫁接后序列文件,FASTA格式,由Grafting模块生成

    Raw Chain

    抗体序列文件,FASTA格式

    Mutation Score

    人源化突变评分文件,CSV格式,由Mutation Score模块生成

    Output File

    指定输出的突变序列文件名称,FASTA格式

    Cutoff

    打分分组的截断值,逗号分割,例如:2,5,10表示将氨基酸突变评分大于10的为一组,5~10的氨基酸为一组,小于2的氨基酸分为一组。

    Output Policy

    指定输出的回复突变的文件

    结果说明

    根据不同截断值得到突变分组结果文件mutate_policy.json。

    Back Mutation Grouping

    Introduction

    Back Mutation Grouping is a grouping module in the humanization design process of antibodies, which groups back mutations based on the mutation score table output by the Mutation Score module and returns the mutated sequences.

    Parameter Description

    Grafted Chain

    Sequence file of the antibody CDR region after grafting, in FASTA format, generated by the Grafting module.

    Raw Chain

    Sequence file of the antibody, in FASTA format.

    Mutation Score

    Humanization mutation score file, in CSV format, generated by the Mutation Score module.

    Output File

    Specify the name of the output mutation sequence file, in FASTA format.

    Cutoff

    Cutoff values for scoring grouping, separated by commas. For example, 2,5,10 indicates grouping amino acid mutations with scores greater than 10 in one group, amino acids with scores between 5 and 10 in another group, and amino acids with scores less than 2 in a separate group.

    Output Policy

    Specify the file for the output of back mutations.

    Result Description

    The mutation grouping results file, mutate_policy.json, is generated based on different cutoff values.

  • Name: Human Germline BLAST
    Description: 通过序列比对在人类生殖系数据库中搜索与目标抗体序列接近的同源模板,输出对应的模板序列以及序列一致性信息。 Search the human germline database for homologs of the target antibody sequence, and output the template sequences and the corresponding identities.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-12-20 16:24:53
    Reference:

    Human Germline BLAST

    简介

    通过序列比对在人类生殖系数据库中搜索与目标抗体序列最接近的同源模板,输出对应的模板序列以及序列一致性信息。

    参数说明

    Sequence String模式

    Input Sequence

    抗体的序列(纯序列信息,非FASTA格式文件)。

    Type

    抗体编号类型:kabat、chothia、imgt。

    TopHits

    输出同源性最高的n条序列。

    Fasta File模式

    FASTA File

    抗体的序列文件,FASTA格式。

    Type

    抗体编号类型:kabat、chothia、imgt。

    TopHits

    输出同源性最高的n条序列。

    结果说明

    输出参数 输出文件名称 说明
    Hits Sequence hits.fasta 包含同源性最高的n条序列的序列文件
    Result result.json 包含找到的Germline模板以及序列的一致性信息

    相关内容

    抗体常用的germline模板:
    image.png

    临床后期及已上市抗体的germline配对分布情况(统计自Adimab数据集):
    image.png
    image.png
    Adimab_germline_usage.jpeg

    Human Germline BLAST

    Introduction

    This module performs sequence alignment to search for the closest homologous template in the human germline database for a given target antibody sequence. It outputs the corresponding template sequence along with sequence similarity information.

    Parameter Description

    Sequence String Mode

    Input Sequence

    The antibody sequence (pure sequence information, not in FASTA format).

    Type

    Type of antibody numbering: kabat, chothia, imgt.

    TopHits

    Number of top hits to output.

    Fasta File Mode

    FASTA File

    Antibody sequence file in FASTA format.

    Type

    Type of antibody numbering: kabat, chothia, imgt.

    TopHits

    Number of top hits to output.

    Result Description

    Output Parameter Output File Name Description
    Hits Sequence hits.fasta File containing the top n sequences with the highest homology
    Result result.json File containing the found Germline template and sequence similarity information

    Related Content

    Commonly used germline templates for antibodies:
    image.png

    Distribution of germline pairing for late-stage and marketed antibodies (statistics from the Adimab dataset):
    image.png
    image.png
    Adimab_germline_usage.jpeg

  • Name: Homology Modeling (Antibody)
    Description: 使用经典同源建模方法从序列构建抗体结构模型。支持抗体(由重链和轻链可变区组成)。 Build antibody structure model from sequences using classical homology modeling. Both normal antibodies (consisting of the heavy and light chain variable region) are supported.
    Tags: undefined
    Author: RosettaCommons
    Release: 2021-10-22 10:51:24
    Reference: Weitzner BD, Jeliazkov JR, Lyskov S, et al. Modeling and docking of antibody structures with Rosetta. Nat Protoc. 2017 Feb;12(2):401-416.

    Homology Modeling (Antibody)

    简介

    抗体模型一直是模型预测中的重点,并对抗体-抗原对接的准确性有较大影响,但由于CDR区的序列较为多变且随机性较大,而难以使用MSA进行结构预测。采用Rosetta中antibody模块,可以通过能量的优化来更进一步的优化CDR区结构,以确保结构的准确性。
    Homology Modeling (Antibody)模块的技术特点包含如下:
    1.自动识别CDR区和FR区,根据每个FR区和CDR区的序列分别搜索数据库,以寻找最佳模板。
    2.针对CDR区结构,根据能量稳定性进行结构优化,以得到更高精度的结构。
    3.全自动完成。输入抗体序列后就可以直接生成模型。
    Rosetta是业界抗体设计广泛使用的工具,在众多案例中得到应用。
    image.png
    上图:在运用KIC+kink的计算方法后,H3与软件打分的正相关性较高

    参数说明

    Input File

    抗体的序列文件,FASTA格式。

    结果说明

    得到预测抗体结构文件grafting/model.0.pdb。

    参考文献

    Weitzner BD, Jeliazkov JR, Lyskov S, et al. Modeling and docking of antibody structures with Rosetta. Nat Protoc. 2017 Feb;12(2):401-416.

    Homology Modeling (Antibody)

    Introduction

    Antibody modeling has always been a focus in structure prediction, and it has a significant impact on the accuracy of antibody-antigen docking. However, due to the high variability and randomness of the CDR region sequences, it is challenging to use Multiple Sequence Alignment (MSA) for structure prediction. By utilizing the antibody module in Rosetta, the structure of the CDR regions can be further optimized through energy minimization to ensure structural accuracy.

    The technical features of the Homology Modeling (Antibody) module include:

    1. Automatic identification of CDR regions and FR regions, searching databases for the best templates based on the sequences of each FR and CDR region.
    2. Structural optimization of CDR regions based on energy stability to obtain higher-precision structures.
    3. Fully automated process. Models can be generated directly after inputting the antibody sequence.

    Rosetta is a widely used tool in the industry for antibody design and has been applied in numerous cases.
    image.png
    Above Image: After applying the KIC+kink calculation method, there is a high positive correlation between H3 and the software score.

    Parameter Description

    Input File

    Sequence file of the antibody in FASTA format.

    Result Description

    Obtain the predicted antibody structure file grafting/model.0.pdb.

    Reference

    Weitzner BD, Jeliazkov JR, Lyskov S, et al. Modeling and docking of antibody structures with Rosetta. Nat Protoc. 2017 Feb;12(2):401-416.

  • Name: ADMET Prediction
    Description: 基于通信消息传递神经网络的ADMET预测。CMPNN通过通信内核加强节点和边之间的消息交互来改进分子嵌入。此外,通过引入新的消息增强模块,丰富了消息生成过程。 ADMET Prediction based on Communicative Message Passing Neural Network. CMPNN improves molecular embedding by strengthening the message interactions between nodes and edges through a communicative kernel. In addition, the message generation process is enriched by introducing a new message booster module.
    Tags: undefined
    Author: Ying Song
    Release: 2022-01-16 15:00:22
    Reference: Ying Song, Shuangjia Zheng, Zhangming Niu, et al., Communicative Representation Learning on Attributed Molecular Graphs. International Joint Conference on Artificial Intelligence. 2020. 29:2831-2838.

    ADMET Prediction

    简介

    ADMET Prediction模块基于通信消息传递神经网络(Compositional Message Passing Neural Network,CMPNN)对化合物的在体内的吸收、分布、代谢、排泄和毒性特性进行预测并且评估其潜在药效,从而筛选出更有前途的化合物,缩短新药研发周期。

    早期的图神经网络(GNN),尤其是消息传递神经网络(MPNN)及其变体,在分子图建模方面取得了显著成效。然而,这些模型主要关注节点(原子)或边(键)的信息,可能导致对分子图的表示不够充分。CMPNN模型通过增强节点和边之间的消息交互,改进了分子图的嵌入。该模型引入了消息增强器(Message Booster)模块,丰富了消息生成过程。同时,设计了节点-边消息通信函数,以更好地利用节点和边的信息。

    CMPNN的核心原理在于将分子表示为图结构,通过消息传递机制在不同节点之间传递信息。通过迭代优化多个局部属性和全局特征的计算,CMPNN最终生成整个分子的特征表示。该模型能够在不同级别的分子特征之间进行有效的信息传递和整合,使其在分子预测、反应预测和药物发现等领域表现优异。

    图片.png
    图1. CMPNN 嵌入生成过程。

    图片.png
    图2. MPNN、DMPNN和CMPNN三种模型在区分毒性和非毒性原子方面的能力。CMPNN能够更精细地区分毒性原子和非毒性原子,红色点(有毒原子)和蓝色点(非毒性原子)之间的分离更为明显。

    图片.png
    图3. CMPNN模型在BBBP和ESOL数据集上的消融研究结果。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。对SMILES文件上传格式要求如下所示,第一行必须为smiles字段:

    smiles
    CCCC(C)C
    CCOC(=O)c1cncn1C(C)c2ccccc2
    

    Property

    选择需要预测的ADMET性质,包括如下:

    1. bbbp:血脑屏障穿透(BBBP)数据集包括超过2040个化合物的渗透特性的二进制标签,用于预测化合物是否能够穿过血脑屏障。
    2. clintox:ClinTox临床毒性预测数据集是一个用于预测药物分子毒性的数据集,其中包括通过美国FDA(美国食品药品监督管理局)批准的1491种药物分子和由于毒性在临床试验中失败的化合物列表。
    3. sider:Sider数据集提供了已上市药物及其相应的不良药物反应信息,可以用于预测某个药物的副作用。Sider数据集包括1427种已获得批准的药物,共涵盖了超过14000种不良反应。这些反应被分成了27个系统器官类别,如心血管系统、神经系统、免疫系统等,描述了这些药物在各个方面可能导致的副作用以及与身体系统的关联。
    4. tox21:Tox21数据集针对与药物毒性相关的12个不同靶点预测化合物毒性,包括核受体和应激反应途径相关靶点。
    5. freesolv:FreeSolv数据集包含一些化合物溶解度相关的数据,用于预测化合物的溶解度。在数据集中,每个化合物都有其化学结构、摩尔质量、实验量、溶解度等信息;此外,对于一些小分子还提供了水合自由能数据。
    6. esol:ESOL数据集包括1128个小分子化合物及其相应的水溶性,用于预测化合物的水溶性。
      当以上预测数值为1时说明存在相关靶点,为0则不存在。

    结果说明

    选择不同ADMET性质,输出不同结果的result.csv文件,包含信息如下:

    1. 选择bbbp预测化合物是否能够穿过血脑屏障,n_np为1时说明化合物存在穿过血脑屏障的靶点,为0则代表不能穿过血脑屏障。
    2. 选择clintox预测化合物是否包含该ClinTox数据集所列药物毒性,CT_TOX为1时说明化合物存在毒性,为0则不存在。
    3. 选择sider预测化合物是否存在各自不良反应,列表第一行为各自不良反应的名称,如Hepatobiliary_disorders为有关肝脏方面的不良反应,Hepatobiliary_disorders为1是代表存在肝脏不良反应,为0则不存在。
    4. 选择tox21预测化合物与核受体和应激反应相关的毒性,如果存在相关靶点则为1,不存在则为0。当为1时表示包含这个靶点所含的毒性,使其失去原有功能。
      其中NR表示为与核受体相关的靶点(Nuclear receptor),包括以下几种:
      1)NR-AR:AR指“雄激素受体”(Androgen Receptor),是一种核受体家族的成员。AR通常通过结合雄激素(如睾酮等)而被激活,并在细胞核中调节基因的转录活性及其生成的蛋白质,参与了许多重要的生理功能,如生殖、肌肉发育、骨密度、皮肤基质合成等生长、发育与维护等过程。
      2)NR-AR-LBD:AR-LBD指的是雄激素受体(Androgen Receptor)的配体结合域(Ligand-Binding Domain)部分,它是与外源性配体相互作用并进行信号转导的重要区域。AR-LBD区域丰富的氨基酸残基使其能够与多种内源性激素和外源性激素具有高度的亲和力,并在细胞核内激活基因转录从而调控基因表达。
      3)NR-AhR:AhR是指芳香族环烃受体(Aryl Hydrocarbon Receptor),是一种核受体,被广泛表达在细胞核中。它被认为是调节生物体对多种环境污染物的敏感性和毒性的关键因素。AhR能够与多种环境污染物(如二恶英、苯并芘等)结合并调节下游基因的表达,参与了许多生理过程,如免疫应答、细胞周期、细胞分化、代谢和生殖等过程。
      4)NR-Aromatase:Aromatase是指芳香化酶,是一种通过催化转化雄激素为雌激素的酶类蛋白质。它主要参与产生雌激素的生理调节,在许多组织和器官中都广泛表达,如卵巢、骨骼、脑组织、肌肉和脂肪组织等。
      5)NR-ER:ER是指雌激素受体(Estrogen Receptor),是一个存在于细胞内的核受体蛋白。ER通常分为两个亚型:ERα和ERβ。ERα主要分布在与性激素相关的女性器官(如乳腺、子宫),而ERβ则分布在人体的许多器官和组织中。ER的主要功能是与雌激素结合并调控基因的表达。
      6)NR-ER-LBD:ER-LBD指的是雌激素受体的配体结合域(Ligand-binding domain of Estrogen Receptor),是雌激素受体的一部分。ER-LBD是一种结构域,存在于雌激素受体分子的C端,其主要功能是结合雌激素类分子。通过改变ER-LBD的构象,可以调节雌激素受体的激活状态,从而在治疗某些疾病,如乳腺癌和骨质疏松症等方面具有重要的应用前景。
      7)NR-PPAR-gamma:PPAR-gamma是指过氧化物酶体增殖因子活化受体gamma(Peroxisome proliferator-activated receptor gamma),是一种转录因子,属于PPARs家族的一员。PPAR-gamma主要表达于脂肪组织和免疫细胞中,它参与了多种生理过程,如代谢、免疫调节、细胞分化等。
      其中,SR表示为与应激相关的靶点(Stress Response)。每列含义如下:
      1)SR-ARE:ARE是指抗氧化反应元件(Antioxidant Response Element),是一种基础结构的DNA序列,其中包含与氧化应激信号反应体系相关的一些基因,如抗氧化酶、代谢调节因子等。
      2)SR-ATAD5:ATAD5(ATPase family AAA domain-containing protein 5)是一种ATPase家族的蛋白。ATAD5存在于人类细胞中,定位于细胞核,并在DNA损伤修复过程中发挥重要作用。
      3)SR-HSE:HSE是指热应激反应元件(Heat shock response element),是一段基因组DNA序列,其主要功能是响应细胞内和细胞外的热应激,通过激活热休克蛋白(HSP)基因的表达,保护细胞免受热应激的伤害。
      4)SR-MMP:MMP为基质金属蛋白酶(Matrix Metalloproteinase),是一类重要的降解酶。MMP能够通过水解体外基质或细胞表面的许多不同成分,来改变基质结构并影响细胞周围的环境。
      5)SR-p53:p53是肿瘤抑制蛋白,防止癌症发生和发展。
    5. 选择freesolv实验和计算的水化自由能的数据库,带有输入文件
    6. 选择esol为ESOL数据集,其包含1128个化合物及相应的水溶性(log10(mol/L)),计算得到logSolubility值。

    参考文献

    Ying Song, Shuangjia Zheng, Zhangming Niu, et al., Communicative Representation Learning on Attributed Molecular Graphs . International Joint Conference on Artificial Intelligence. 2020. 29:2831-2838.

    ADMET Prediction

    Introduction

    The ADMET Prediction module uses the Compositional Message Passing Neural Network (CMPNN) to predict the absorption, distribution, metabolism, excretion, and toxicity properties of compounds in vivo. It evaluates their potential efficacy, thereby identifying more promising compounds and shortening the drug development cycle.

    Early Graph Neural Networks (GNNs), especially Message Passing Neural Networks (MPNNs) and their variants, achieved significant success in molecular graph modeling. However, these models mainly focus on the information of nodes (atoms) or edges (bonds), which may lead to insufficient representation of the molecular graph. The CMPNN model improves the embedding of molecular graphs by enhancing message interactions between nodes and edges. This model introduces a Message Booster module to enrich the message generation process. Additionally, a node-edge message communication function is designed to better utilize the information from nodes and edges.

    The core principle of CMPNN is to represent molecules as graph structures and transmit information between different nodes through a message passing mechanism. By iteratively optimizing the computation of multiple local properties and global features, CMPNN ultimately generates a feature representation of the entire molecule. This model can effectively transmit and integrate information between different levels of molecular features, making it excel in areas such as molecular prediction, reaction prediction, and drug discovery.

    图片.png
    Figure 1.CMPNN embedding generation algorithm.

    图片.png
    Figure 2. The ability of the MPNN, DMPNN, and CMPNN models to distinguish between toxic and non-toxic atoms. The CMPNN is able to differentiate toxic atoms from non-toxic atoms more precisely, with a more pronounced separation between red dots (toxic atoms) and blue dots (non-toxic atoms).

    图片.png
    Figure 3. Ablation results on BBBP and ESOL datasets.

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format. For SMILES file upload, the format should follow as shown below, where the first line must be the smiles field:

    smiles
    CCCC(C)C
    CCOC(=O)c1cncn1C(C)c2ccccc2
    

    Property

    Select the ADMET properties to predict, including:

    1. bbbp: Blood-Brain Barrier Penetration (BBBP) dataset includes binary labels of permeation properties for over 2040 compounds, used to predict whether a compound can penetrate the blood-brain barrier.
    2. clintox: The ClinTox clinical toxicity prediction dataset is used to predict the toxicity of drug molecules, including a list of 1491 drug molecules approved by the U.S. FDA and compounds that failed in clinical trials due to toxicity.
    3. sider: The Sider dataset provides information on adverse drug reactions associated with marketed drugs, used to predict the side effects of a drug. The Sider dataset includes 1427 approved drugs, covering over 14,000 adverse reactions categorized into 27 organ system classes, such as cardiovascular system, nervous system, immune system, etc., describing the possible side effects of these drugs and their associations with body systems.
    4. tox21: The Tox21 dataset predicts compound toxicity related to 12 different targets associated with nuclear receptors and stress response pathways.
    5. freesolv: The FreeSolv dataset contains data related to compound solubility, used to predict the solubility of compounds. The dataset includes chemical structure, molar mass, experimental quantity, solubility information for each compound, and hydration free energy data for some small molecules.
    6. esol: The ESOL dataset includes 1128 small molecule compounds and their corresponding water solubility, used to predict the water solubility of compounds.
      A prediction value of 1 indicates the presence of a related target, while 0 indicates absence.

    Result Description

    Selecting different ADMET properties will output different result.csv files, containing the following information:

    1. Selecting bbbp predicts whether a compound can penetrate the blood-brain barrier. When n_np is 1, it indicates the compound has targets for crossing the blood-brain barrier, and 0 indicates it cannot penetrate the barrier.
    2. Selecting clintox predicts whether a compound contains toxicity listed in the ClinTox dataset. CT_TOX equal to 1 indicates the compound is toxic, while 0 indicates no toxicity.
    3. Selecting sider predicts whether a compound has specific adverse reactions. The first row lists the names of the adverse reactions, such as Hepatobiliary_disorders for liver-related adverse reactions. A value of 1 indicates the presence of the adverse reaction, while 0 indicates its absence.
    4. Selecting tox21 predicts compound toxicity related to nuclear receptors and stress response. A value of 1 indicates the presence of the related target, leading to loss of function.
      The NR column represents nuclear receptor-related targets, including:
      a) NR-AR: Androgen Receptor (AR) is a member of the nuclear receptor family involved in various physiological functions.
      b) NR-AR-LBD: Ligand-binding domain of Androgen Receptor (AR-LBD) interacts with ligands and mediates signaling.
      c) NR-AhR: Aryl Hydrocarbon Receptor (AhR) regulates sensitivity to environmental pollutants.
      d) NR-Aromatase: Aromatase converts androgens to estrogens.
      e) NR-ER: Estrogen Receptor (ER) regulates gene expression.
      f) NR-ER-LBD: Ligand-binding domain of Estrogen Receptor (ER-LBD) modulates receptor activation.
      g) NR-PPAR-gamma: Peroxisome proliferator-activated receptor gamma (PPAR-gamma) regulates metabolism and immune response.
      The SR column represents stress response-related targets, including:
      a) SR-ARE: Antioxidant Response Element (ARE) responds to oxidative stress.
      b) SR-ATAD5: ATPase family AAA domain-containing protein 5 (ATAD5) plays a role in DNA damage repair.
      c) SR-HSE: Heat Shock Response Element (HSE) responds to heat stress.
      d) SR-MMP: Matrix Metalloproteinase (MMP) degrades extracellular matrix components.
      e) SR-p53: Tumor suppressor protein p53 prevents cancer development.
    5. Selecting freesolv provides experimental and calculated hydration free energy data for compounds.
    6. Selecting esol uses the ESOL dataset for water solubility prediction.

    References

    Ying Song, Shuangjia Zheng, Zhangming Niu, et al., Communicative Representation Learning on Attributed Molecular Graphs . International Joint Conference on Artificial Intelligence. 2020. 29:2831-2838.

  • Name: Protein Docking (FRODOCK)
    Description: FRODOCK可以在两个蛋白质结构之间执行详尽的6D对接。这种近似能够非常有效地产生许多关于这两种蛋白质如何相互作用的潜在预测。由于它是刚体近似,因此仅当预期结合后构象变化减少时才有效,通常用作第一步初始对接,后面可以接进一步柔性对接。 FRODOCK is capable of performing detailed 6D docking between two protein structures. This approximation allows for the efficient generation of potential predictions regarding how these two proteins may interact with each other. Since it is a rigid-body approximation, it is only effective when the expected conformational changes upon binding are minimal and is typically used as the initial step for docking, followed by further flexible docking.
    Tags: undefined
    Author: Ramírez-Aportela E
    Release: 2022-01-12 19:45:05
    Reference: Ramírez-Aportela E, López-Blanco JR, Chacón P. FRODOCK 2.0: fast protein-protein docking server. Bioinformatics. 2016 Aug 1;32(15):2386-8.

    Protein Docking (FRODOCK)

    简介

    FRODOCK是由西班牙Pablo Chacón教授开发的蛋白-蛋白对接软件。FRODOCK使用球谐函数(spherical harmonics)的旋转搜索提高对接效率。全局能量优化采用 6D(3D 旋转 + 3D平移)刚体详尽搜索(rigid-body exhaustive search)固定配体的构象。复合物的结合能考虑范德华力、静电和去溶剂化三个能量项。在抗原-抗体复合物、酶-底物、其他蛋白复合物的基准测试集中效果表现很好。具有以下技术特点:

    1. 采用球谐函数旋转搜索提高对接效率。
    2. 采用6D(3D 旋转 + 3D平移)进行详尽搜索采样。

    参数说明

    Receptor File

    受体结构文件,PDB格式。

    Ligand File

    配体结构文件,PDB格式。

    Interaction Type

    相互作用类型。

    Constraints File

    限制文件,文本格式如下:

    # RECEPT_____ LIGAND_____ D__
    # -------------------------------
    GLY A 269 SER A 81 5
    GLY A 269 LEU A 84 10
    

    其中"GLY A 269"代表受体部分的残基名称"GLY"、链名称"A"、残基编号"269";“SER A 81"代表配体部分的残基"SER”,链名称"A",残基编号"81";"5"代表受配体残基之间的距离在5Å。

    Clusters Number

    生成构象聚类最大数目。

    Output TopN

    保存的得分最高分子的PDB文件。

    Reference File

    参考结合配体分子(用于比较),格式:PDB。

    结果说明

    输出结果包括:

    输出文件名称 说明
    complex_01.pdb-complex_10.pdb 输出打分前十的复合物构象
    output_complex_TopN.tar.gz 输出所有复合物结构的压缩包文件
    TopN_score.csv 提供复合物构象的对接打分,其中打分值越大,结合能力越强。
    output_ligand_TopN.tar.gz 输出所有配体结构的压缩包文件

    其中TopN_score.csv包括信息如下:

    字段名称 说明
    NO 打分排序
    Euler1 配体旋转α角度(ZYZ顺序旋转的欧拉角)
    Euler2 配体旋转β角度(ZYZ顺序旋转的欧拉角)
    Euler3 配体旋转γ角度(ZYZ顺序旋转的欧拉角)
    posX 配体质心所在位置的X坐标
    posY 配体质心所在位置的Y坐标
    posZ 配体质心所在位置的Z坐标
    Absolute_Energy_Score 绝对能量分数用来评估复合物结合能力强弱。
    Ligand_File 配体文件名称
    complex_pdb 复合物文件名称

    参考文献

    Ramírez-Aportela E, López-Blanco JR, Chacón P. FRODOCK 2.0: fast protein-protein docking server. Bioinformatics. 2016 Aug 1;32(15):2386-8.

    Protein Docking (FRODOCK)

    Introduction

    FRODOCK is a protein-protein docking software developed by Professor Pablo Chacón from Spain. FRODOCK utilizes spherical harmonics for rotation search to enhance docking efficiency. Global energy optimization is achieved through a 6D (3D rotation + 3D translation) rigid-body exhaustive search with fixed ligand conformation. The binding energy of the complex considers van der Waals forces, electrostatic interactions, and desolvation energy. It has shown good performance in benchmark tests with antigen-antibody complexes, enzyme-substrate interactions, and other protein complexes. It features the following technical aspects:

    1. Utilizes spherical harmonics for rotation search to enhance docking efficiency.
    2. Utilizes 6D (3D rotation + 3D translation) for exhaustive search sampling.

    Parameter Description

    Receptor File

    Structure file of the receptor in PDB format.

    Ligand File

    Structure file of the ligand in PDB format.

    Interaction Type

    Type of interaction.

    Constraints File

    Text file specifying constraints, with the format:

    # RECEPT_____ LIGAND_____ D__
    # -------------------------------
    GLY A 269 SER A 81 5
    GLY A 269 LEU A 84 10
    

    Where “GLY A 269” represents the residue name “GLY”, chain “A”, residue number “269” in the receptor part; “SER A 81” represents the residue “SER”, chain “A”, residue number “81” in the ligand part; and “5” represents a distance of 5Å between the receptor and ligand residues.

    Clusters Number

    Maximum number of conformation clusters to generate.

    Output TopN

    Number of top-scoring molecules to save as PDB files.

    Reference File

    Reference ligand molecule for comparison, in PDB format.

    Result Description

    The output includes:

    Output File Name Description
    complex_01.pdb-complex_10.pdb Output of the top ten scored complex conformations
    output_complex_TopN.tar.gz Compressed file containing all complex structures
    TopN_score.csv Provides docking scores for complex conformations, where higher scores indicate stronger binding affinity
    output_ligand_TopN.tar.gz Compressed file containing all ligand structures

    The TopN_score.csv file includes the following information:

    Field Name Description
    NO Ranking based on scores
    Euler1 Euler angles for ligand rotation (in ZYZ order)
    Euler2 Euler angles for ligand rotation (in ZYZ order)
    Euler3 Euler angles for ligand rotation (in ZYZ order)
    posX X-coordinate of the ligand center of mass
    posY Y-coordinate of the ligand center of mass
    posZ Z-coordinate of the ligand center of mass
    Absolute_Energy_Score Absolute energy score for evaluating binding strength
    Ligand_File Ligand file name
    complex_pdb Complex file name

    Reference

    Ramírez-Aportela E, López-Blanco JR, Chacón P. FRODOCK 2.0: fast protein-protein docking server. Bioinformatics. 2016 Aug 1;32(15):2386-8.

  • Name: Human Antibody BLAST
    Description: Human Antibody BLAST是人类抗体数据库Blast模块,该数据库目前包含来自超过75项不同研究的超过10亿个序列,涵盖了来自人类的多种免疫状态和个体。提交抗体序列,将返回同源性最高的人源同源抗体序列,可用于高级抗体人源化设计、亲和力成熟、去免疫原性、抗体工程等。 BLAST human antibody database for homologs, which currently contains over one billion sequences, from over 75 different studies. These repertoires cover diverse immune states and individuals from humans. Submit an antibody sequence, and homologous human antibody sequences will be returned and could be used for advanced antibody humanization, affinity maturation, de-immunization, etc.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-01-13 18:17:41
    Reference: Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol. 2018 Oct 15;201(8):2502-2509.

    Human Antibody Blast

    简介

    Observed Antibody Space 数据库 (OAS) 是一个收集和注释免疫组库以用于大规模分析的项目。它目前包含来自超过75项不同研究的超过10亿个真实抗体序列。这些库涵盖了不同的免疫状态、生物体(主要是人类和小鼠)和个体。本功能从OAS库中搜索同源的人源抗体序列,通过序列比对,可以得到不同位点的进化信息,常用于对亲和力成熟或是对人源化过程中突变位点的选择提供参考依据,指导抗体设计。

    参数说明

    Input File

    抗体序列文件,FASTA格式。

    结果说明

    通过序列比对,可以得到不同位点的进化信息文件alignment.fasta。

    参考文献

    Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol. 2018 Oct 15;201(8):2502-2509.

    Human Antibody Blast

    Introduction

    The Observed Antibody Space (OAS) database is a project that collects and annotates immune repertoires for large-scale analysis. It currently contains over 1 billion real antibody sequences from more than 75 different studies. These libraries cover different immune states, organisms (primarily humans and mice), and individuals. This feature searches for homologous human antibody sequences from the OAS database. By aligning sequences, evolutionary information at different sites can be obtained. This is commonly used to provide reference for the selection of mutation sites during affinity maturation or humanization processes, guiding antibody design.

    Parameter Description

    Input File

    Antibody sequence file in FASTA format.

    Result Description

    The evolutionary information file for different sites can be obtained through sequence alignment, saved as alignment.fasta.

    Reference

    Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol. 2018 Oct 15;201(8):2502-2509.

  • Name: Structure Relax
    Description: Structure Relax模块是用于消除晶体结构中的不合理构象,优化点突变设计的结构,以及比较多个不同结构的能量等。通过多次迭代进行氨基酸侧链重排以及能量最小化的计算来搜索给定三维结构的在局部能垒的最优构象。 The Structure Relax module is used to eliminate unreasonable conformations in crystal structures, optimize structures designed with point mutations, and compare energy levels of multiple structures. The module searches for the optimal conformation within the local energy barriers of a given 3D structure through multiple iterations of amino acid side chain rearrangement and energy minimization calculations.
    Tags: undefined
    Author: RosettaCommons
    Release: 2022-01-13 10:19:41
    Reference: Weitzner BD, Jeliazkov JR, Lyskov S, et al. Modeling and docking of antibody structures with Rosetta. Nat Protoc. 2017 Feb;12(2):401-416.

    Structure Relax

    简介

    用于消除晶体结构中的不合理构象,优化点突变设计的结构,以及比较多个不同结构的能量等。通过多次迭代进行氨基酸侧链重排以及能量最小化的计算来搜索给定三维结构的在局部能垒的最优构象。

    参数说明

    Input File

    蛋白结构文件,PDB格式

    结果说明

    输出优化后的结构文件relax_model_0001.pdb。

    参考文献

    Weitzner BD, Jeliazkov JR, Lyskov S, et al. Modeling and docking of antibody structures with Rosetta. Nat Protoc. 2017 Feb;12(2):401-416.

    Structure Relax

    Introduction

    This tool is used to eliminate unreasonable conformations in crystal structures, optimize structures for point mutation designs, and compare the energies of multiple different structures. It conducts amino acid side-chain rearrangements and energy minimization calculations through multiple iterations to search for the optimal conformation within the local energy barrier of a given three-dimensional structure.

    Parameter Description

    Input File

    Protein structure file in PDB format.

    Result Description

    The optimized structure file is output as relax_model_0001.pdb.

    Reference

    Weitzner BD, Jeliazkov JR, Lyskov S, et al. Modeling and docking of antibody structures with Rosetta. Nat Protoc. 2017 Feb;12(2):401-416.

  • Name: Mutation Energy of Binding
    Description: Mutation Energy of Binding模块旨在计算突变对复合物结合能的影响。根据输入的复合物结构及突变文件构建突变结构,并计算链之间的结合能,与野生型对比,计算突变前后链之间的结合能变化。能量越负,说明突变越有利于指定链之间的结合。 Mutation Energy of Binding module aims to calculate the effect of mutations on the binding energy of a complex. Based on the input complex structure and mutation files, it builds mutation structures and calculates the binding energy between chains. By comparing with the wild type, it calculates the change in binding energy between chains before and after the mutation. The more negative the energy, the more favorable the mutation is for binding between specified chains.
    Tags: undefined
    Author: Schymkowitz J
    Release: 2022-01-12 22:45:53
    Reference: Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W382-8.

    Mutation Energy of Binding

    简介

    本模块旨在计算突变对复合物结合能的影响。
    根据输入的复合物结构及突变文件构建突变结构,并基于物理模型(分子力学经验力场)计算链之间的结合能,与野生型对比,计算突变前后链之间的结合能变化。能量越负,说明突变越有利于指定链之间的结合。

    参数说明

    PDB File

    蛋白复合物的结构文件,PDB格式。
    注意:输入的PDB中的UID不能有Insetion Code,使用PDB Insertion Removal模块处理PDB文件可以去除Insertion Code。

    Mutant File

    突变文件,文本格式包含突变信息,格式如下:

    GB26R;
    GB26H;
    GB26K,YB27H;
    

    其中G代表序列残基名称,B代表PDB文件中蛋白链名称,26代表26位氨基酸残基,R/H/K 代表突变后的残基名称。

    结果说明

    输出结果包括:

    输出文件名称 说明
    Mutation_pdb_file.tar.gz 突变结构文件压缩包
    Interface_A_B.csv 突变前后,链A和链B之间相互作用能量变化

    其中Interface_A_B.csv包括信息如下:

    字段名称 说明
    Mutation 突变氨基酸位点
    File Name 蛋白结构文件名称
    Chain1 Name 链名称
    Chain2 Name 链名称
    Interaction Energy 链Chain1和链Chain2之间相互作用能,单位kcal/mol。
    deltaEnergy 突变后与野生型两条链之间相互作用能的差值,单位kcal/mol。(Energy[mutant]-Energy[wild])

    参考文献

    Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W382-8.

    Mutation Energy of Binding

    Introduction

    Mutation Energy of Binding module aims to calculate the effect of mutations on the binding energy of a complex. Based on the input complex structure and mutation files, it builds mutation structures and calculates the binding energy between chains. By comparing with the wild type, it calculates the change in binding energy between chains before and after the mutation. The more negative the energy, the more favorable the mutation is for binding between specified chains.

    Parameter

    PDB File

    Protein complex structure file in PDB format.
    Note: The UID in the input PDB cannot have an insertion code. Using the PDB Insertion Removal module to process the PDB file can remove the insertion code.

    Mutant File

    Mutation file, containing mutation information in text format, the format is as follows:

    GB26R;
    GB26H;
    GB26K,YB27H;
    

    Among them, G represents the name of the sequence residue, B represents the name of the protein chain in the PDB file, 26 represents the 26th amino acid residue, and R/H/K represents the name of the residue after mutation.

    Result

    The output includes:

    Output File Name Description
    Mutation_pdb_file.tar.gz Mutant structure file compression package
    Interface_A_B.csv Before and after the mutation, the changes of interaction energy between chain1 and chain2.

    Interface_A_B.csv contains the following information:

    Field Name Description
    Mutation Mutant amino acid site
    File Name Protein structure file name
    Chain1 Name Chain name
    Chain2 Name Chain name
    Interaction Energy The interaction energy between chain1 and chain2. (Unit: kcal/mol)
    deltaEnergy The difference of the interaction energy between the mutant and the wild type, unit in kcal/mol. (Energy[mutant]-Energy[wild])

    Reference

    Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W382-8.

  • Name: Protein Docking (HDOCK)
    Description: HDOCK是一个集成了同源搜索、基于模板建模、结构预测、大分子对接、生物信息整合的快速蛋白质-蛋白质对接程序。HDOCK使用基于快速傅里叶变换 (FFT) 的对接算法对所有结合模式进行全局采样,然后通过迭代导出的基于知识的评分函数对结合模式进行打分。在多个基准测试中显示很好的预测效果。服务器使用混合对接策略来预测两种分子(如蛋白质和核酸)之间的结合复合物。基于模板的建模和从头开始的自由对接的混合算法进行蛋白质-蛋白质和蛋白质- DNA/RNA 对接。 HDOCK is a fast protein-protein docking program that integrates homology search, template-based modeling, structure prediction, large molecule docking, and bioinformatics integration. The server employs a hybrid docking strategy to predict binding complexes between two types of molecules, such as proteins and nucleic acids. It utilizes a combination of template-based modeling and de novo docking algorithms for protein-protein and protein-DNA/RNA docking.
    Tags: undefined
    Author: Yan Y; Huang S-Y
    Release: 2022-01-12 15:21:06
    Reference: Yan Y, Tao H, He J, Huang S-Y.* The HDOCK server for integrated protein-protein docking. Nature Protocols, 2020; doi: https://doi.org/10.1038/s41596-020-0312-x. Yan Y, Zhang D, Zhou P, Li B, Huang S-Y. HDOCK: a web server for protein-protein and protein-DNA/RNA docking based on a hybrid strategy. Nucleic Acids Res. 2017;45(W1):W365-W373. Yan Y, Wen Z, Wang X, Huang S-Y. Addressing recent docking challenges: A hybrid strategy to integrate template-based and free protein-protein docking. Proteins 2017;85:497-512. Huang S-Y, Zou X. A knowledge-based scoring function for protein-RNA interactions derived from a statistical mechanics-based iterative method. Nucleic Acids Res. 2014;42:e55. Huang S-Y, Zou X. An iterative knowledge-based scoring function for protein-protein recognition. Proteins 2008;72:557-579.

    Protein Docking (HDOCK)

    简介

    HDOCK是由华中科技大学物理学院黄胜友教授团队开发的一个集成了同源搜索、基于模板建模、结构预测、大分子对接、生物信息整合的快速蛋白质-蛋白质对接程序。HDOCK使用基于快速傅里叶变换 (FFT) 的对接算法对所有结合模式进行全局采样,然后通过迭代导出的基于知识的评分函数对结合模式进行打分。在多个基准测试中显示很好的预测效果。具有以下技术特点:

    1. 支持氨基酸序列作为输入和混合对接策略
    2. 支持蛋白-DNA/RNA对接
    3. 计算速度快,几分钟内完成对接

    参数说明

    Receptor File

    受体的结构文件,PDB格式

    Ligand File

    配体的结构文件,PDB格式

    Output TopN

    输出打分最高的复合物PDB文件个数

    Grid Space

    平动网格间距

    Angle Interval

    转动角间距

    Receptor Binding Site

    受体的结合位点残基。
    结合位点残基可以作为一个文件(.txt)提交,格式如下:

    195:A
    203-206:A
    108:B
    

    表示A链的195号、203-206号残基以及B链的108号残基。请注意,文件中的残基应该放在不同的行上。

    Ligand Binding Site

    配体的结合位点残基。
    结合位点残基可以作为一个文件(.txt)提交,格式如下:

    195:A
    203-206:A
    108:B
    

    表示A链的195号、203-206号残基以及B链的108号残基。请注意,文件中的残基应该放在不同的行上。

    Restraints

    相互作用氨基酸之间的距离约束。
    距离约束可以作为一个文件(.txt)提供,格式如下:

    195:A 236:B 8
    215-218:A 306:B 6
    

    其中,受体上的A链195号残基和配体上的B链236号残基的距离将在8埃之内。受体上的A链215-218号残基和配体上的B链306号残基的距离将在6埃之内。
    注意:对于每个约束,第一个字段是受体,第二个字段是配体,第三个字段是约束距离。残基表示必须采用num:chainID或num1-num2:chainID格式,其中残基编号和链ID指的是输入结构(如果输入是结构)或模型结构(如果输入是序列)。

    Cluster Cutoff

    聚类RMSD截断值

    Keep Receptor Heterogens

    是否保留受体中非标准氨基酸:都保留(all),只保留水(water),指定保留非标准氨基酸(specify),去除所有非标准氨基酸(none)。

    Receptor Specify Heterogens

    多个残基用逗号(,)分隔开。例如:“X:UNL-1”,其中X为链名,UNL为非标准氨基酸残基名称,1为残基编号。

    Keep Ligand Heterogens

    是否保留配体中非标准氨基酸:都保留(all),只保留水(water),指定保留非标准氨基酸(specify),去除所有非标准氨基酸(none)。

    Ligand Specify Heterogens

    指定配体中需要保留非标准氨基酸,多个残基用逗号(,)分隔开。例如:“X:UNL-1”,其中X为链名,UNL为非标准氨基酸残基名称,1为残基编号。

    结果说明

    输出结果包括:

    输出文件名称 说明
    complex_01.pdb-complex_10.pdb 打分前十的复合物构象
    score.csv 提供复合物构象的对接打分,其中打分值越低,结合能力越强。
    TopNComplex.tar.gz 输出所有复合物结构的压缩包文件

    其中score.csv包括如下信息:

    字段名称 说明
    Number 打分排序
    RMSD 复合物构象的RMSD
    Score 对接能量打分,其中打分值越低,结合能力越强。

    参考文献

    Yan Y, Tao H, He J, Huang S-Y.* The HDOCK server for integrated protein-protein docking. Nature Protocols, 2020 .
    Yan Y, Zhang D, Zhou P, Li B, Huang S-Y. HDOCK: a web server for protein-protein and protein-DNA/RNA docking based on a hybrid strategy. Nucleic Acids Res. 2017;45(W1):W365-W373.

    Protein Docking (HDOCK)

    Introduction

    HDOCK is a fast protein-protein docking program developed by the team of Professor Shengyou Huang at the School of Physics, Huazhong University of Science and Technology. It integrates homology search, template-based modeling, structure prediction, macromolecular docking, and bioinformatics integration. HDOCK uses a docking algorithm based on Fast Fourier Transform (FFT) to globally sample all binding modes and then scores the binding modes using an iteratively derived knowledge-based scoring function. It has shown good predictive performance in multiple benchmark tests. Its technical features include:

    1. Support for amino acid sequences as input and hybrid docking strategies.
    2. Support for protein-DNA/RNA docking.
    3. Fast computation speed, completing docking in minutes.

    Parameter Description

    Receptor File

    Structure file of the receptor in PDB format.

    Ligand File

    Structure file of the ligand in PDB format.

    Output TopN

    Number of top-scoring complex PDB files to output.

    Grid Space

    Translation grid spacing.

    Angle Interval

    Rotation angle interval.

    Receptor Binding Site

    Residues of the receptor’s binding site.
    Binding site residues can be submitted as a file (.txt) with the following format:

    195:A
    203-206:A
    108:B
    

    This indicates residue 195 of chain A, residues 203-206 of chain A, and residue 108 of chain B. Note that residues in the file should be on separate lines.

    Ligand Binding Site

    Residues of the ligand’s binding site.
    Binding site residues can be submitted as a file (.txt) with the same format as above.

    195:A
    203-206:A
    108:B
    

    Restraints

    Distance constraints between interacting amino acids.
    Distance constraints can be provided as a file (.txt) with the following format:

    195:A 236:B 8
    215-218:A 306:B 6
    

    Here, the distance between residue 195 of chain A in the receptor and residue 236 of chain B in the ligand is within 8 angstroms. The distance between residues 215-218 of chain A in the receptor and residue 306 of chain B in the ligand is within 6 angstroms.
    Note: For each constraint, the first field is the receptor, the second field is the ligand, and the third field is the constraint distance. Residues should be in the format num:chainID or num1-num2:chainID, where residue number and chain ID refer to the input structure (if the input is a structure) or model structure (if the input is a sequence).

    Cluster Cutoff

    RMSD cutoff value for clustering.

    Keep Receptor Heterogens

    Whether to retain non-standard amino acids in the receptor: retain all (all), retain only water (water), specify which non-standard amino acids to retain (specify), remove all non-standard amino acids (none).

    Receptor Specify Heterogens

    Multiple residues should be separated by commas (,). For example: “X:UNL-1”, where X is the chain name, UNL is the name of the non-standard amino acid residue, and 1 is the residue number.

    Keep Ligand Heterogens

    Whether to retain non-standard amino acids in the ligand: retain all (all), retain only water (water), specify which non-standard amino acids to retain (specify), remove all non-standard amino acids (none).

    Ligand Specify Heterogens

    Specify which non-standard amino acids in the ligand need to be retained, with multiple residues separated by commas (,). For example: “X:UNL-1”, where X is the chain name, UNL is the name of the non-standard amino acid residue, and 1 is the residue number.

    Result

    The output includes:

    Output File Name Description
    complex_01.pdb-complex_10.pdb Top ten scoring complex conformations
    score.csv Provides docking scores for complex conformations, where lower scores indicate stronger binding
    TopNComplex.tar.gz Compressed file containing all complex structures

    The score.csv file includes the following information:

    Field Name Description
    Number Score ranking
    RMSD RMSD of complex conformations
    Score Docking energy score, where lower scores indicate stronger binding

    References

    • Yan Y, Tao H, He J, Huang S-Y. The HDOCK server for integrated protein-protein docking. Nature Protocols, 2020 .
    • Yan Y, Zhang D, Zhou P, Li B, Huang S-Y. HDOCK: a web server for protein-protein and protein-DNA/RNA docking based on a hybrid strategy. Nucleic Acids Res. 2017;45(W1):W365-W373.
  • Name: PDB Fixer
    Description: 修复PDB文件中的问题,包括添加丢失的重原子、添加缺失的氢原子、建立缺失的loop、将非标准残基转化为标准残基、为列出多个替代位置的原子选择一个位置、从模型中删除不需要的链、删除不需要的小分子、为显式溶剂模拟构建一个水盒子。 It is a module that fixes problems in the PDB file. It can automatically fix the following problems: add missing heavy atoms; add missing hydrogen atoms; build missing loops; convert non-standard residues to their standard equivalents; select a single position for atoms with multiple alternate positions listed; delete unwanted chains from the model; delete unwanted heterogens; build a water box for explicit solvent simulations.
    Tags: undefined
    Author: P. Eastman, M. S. Friedrichs, J. D. Chodera, R. J. Radmer, C. M. Bruns, J. P. Ku, K. A. Beauchamp, T. J. Lane, L.-P. Wang, D. Shukla, T. Tye, M. Houston, T. Stich, C. Klein, M. R. Shirts, and V. S. Pande.
    Release: 2022-01-05 11:55:40
    Reference: Eastman P, Friedrichs MS, Chodera JD, et al. OpenMM 4: A Reusable, Extensible, Hardware Independent Library for High Performance Molecular Simulation. J Chem Theory Comput. 2013 Jan 8;9(1):461-469.

    PDB Fixer

    简介

    PDB Fixer是修复 PDB 文件中的问题的模块,包括添加丢失的重原子、添加缺失的氢原子、建立缺失的loop、将非标准残基转化为标准残基、为列出多个替代位置的原子选择一个位置、从模型中删除不需要的链、删除不需要的小分子、为显式溶剂模拟构建一个水盒子。一般应用在分子动力学模拟之前,通常需要对蛋白的结构进行预处理,如补全残基等。PDBFixer能够解决的问题包括如下:

    1. 添加缺失的重原子
    2. 添加缺失的氢原子
    3. 添加缺失的loop
    4. 替换不标准的残基和增加缺失残基
    5. 删除不需要的异源物(如水、离子、非蛋白结构)

    参数说明

    Input File

    PDB结构文件

    Output File

    输出PDB文件名称

    Add Atoms

    补充结构中缺失原子:所有缺失原子(all),缺失的重原子(heavy),缺失的氢原子(hydrogen),不补充(none)

    Keep Heterogens

    是否保留非标准氨基酸:都保留(all),只保留水(water),去除所有非标准氨基酸(none)

    PH Value

    添加缺失氢原子时使用的pH值

    Add Residue

    添加缺失的氨基酸

    Replace Nonstandard

    将非标准氨基酸转换成标准氨基酸

    结果说明

    得到的转换结果文件output.pdb。

    参考文献

    Eastman P, Friedrichs MS, Chodera JD, et al. OpenMM 4: A Reusable, Extensible, Hardware Independent Library for High Performance Molecular Simulation. J Chem Theory Comput. 2013 Jan 8;9(1):461-469.

    PDB Fixer

    Introduction

    PDB Fixer is a module designed to address issues in PDB files, including adding missing heavy atoms, adding missing hydrogen atoms, building missing loops, converting non-standard residues to standard residues, selecting a position for atoms with multiple alternative locations, removing unnecessary chains from models, removing unwanted small molecules, and constructing a water box for explicit solvent simulation. It is commonly used in pre-processing protein structures before molecular dynamics simulations, such as completing missing residues. The problems that PDB Fixer can address include:

    1. Adding missing heavy atoms
    2. Adding missing hydrogen atoms
    3. Building missing loops
    4. Replacing non-standard residues and adding missing residues
    5. Removing unwanted heterogens (such as water, ions, non-protein structures)

    Parameter Description

    Input File

    PDB structure file.

    Output File

    Name of the output PDB file.

    Add Atoms

    Add missing atoms in the structure: all missing atoms (all), missing heavy atoms (heavy), missing hydrogen atoms (hydrogen), do not add (none).

    Keep Heterogens

    Whether to keep non-standard amino acids: keep all (all), keep only water (water), remove all non-standard amino acids (none).

    pH Value

    pH value used when adding missing hydrogen atoms.

    Add Residue

    Add missing amino acids.

    Replace Nonstandard

    Convert non-standard amino acids to standard amino acids.

    Result Description

    Obtain the transformed result file output.pdb.

    References

    Eastman P, Friedrichs MS, Chodera JD, et al. OpenMM 4: A Reusable, Extensible, Hardware Independent Library for High Performance Molecular Simulation. J Chem Theory Comput. 2013 Jan 8;9(1):461-469.

  • Name: SeqKit
    Description: SeqKit模块是一款超快速、全面的FASTA/Q处理工具包,能够快速完成常见的FASTA/Q文件操作。 Ultrafast comprehensive toolkit for FASTA/Q processing, rapidly accomplishing common FASTA/Q file manipulations.
    Tags: undefined
    Author: Shen W
    Release: 2022-01-04 17:15:54
    Reference: Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962.

    SeqKit

    简介

    Seqkit是一款专门处理fsata/q序列文件的软件,由go语言编写,功能比较完善,软件使用也很稳定。
    该模块主要提供的功能有:

    1. 编辑序列(点突,插入,删除)
    2. 通过名称/序列来去除重复的序列、保存数量的文件并列出重复的seqs、保存重复seqs的文件
    3. 对序列进行转换(颠倒,互补,提取ID等)

    参数说明

    Clean模式

    FASTA File

    序列文件,FASTA格式。

    GAP

    指定序列中需要清理掉的间隔字符。

    Output File

    指定输出序列文件名称,FASTA格式。

    Edit模式

    FASTA File

    序列文件,FASTA格式。

    Output File

    指定输出序列文件名称,FASTA格式。

    Point Mutation

    对FASTA文件进行单独突变:在给定位置改变碱基。例如:“2:C”为将第二位碱基变为胞嘧啶(C);“-1:A”为将最后一位碱基变为腺嘌呤(A)。

    Deletion Mutation

    删除突变:删除指定范围内的子序列,例如,“1:2”表示删除前两个碱基,“-3:-1”表示删除最后三个碱基。

    Insertion Mutation

    插入突变:在给定位置后插入碱基,例如,“0:ACGT”表示在开头插入ACGT,“-1:”表示在末尾添加。

    Threads

    CPUs数目。

    Remove Duplicates模式

    FASTA File

    序列文件,FASTA格式。

    Output File

    指定输出序列文件名称,FASTA格式。

    Duplicated Type

    按name (-n)或按seq (-s)删除重复序列。

    Save Data

    保存重复序列数和列表的文件(-D)或保存重复序列的文件(-d)。

    Threads

    CPUs数目。

    Transform模式

    FASTA File

    序列文件,FASTA格式。

    Output File

    指定输出序列文件名称,FASTA格式。

    Transform Sequences

    转换类型,包括如下几种:
    –complement:互补序列
    –dna2rna:DNA转RNA
    –rna2dna:RNA转DNA
    –lower-case:以小写形式打印序列
    –upper-case:以大写形式打印序列

    Threads

    CPUs数目。

    FASTA2Seq模式

    FASTA File

    序列文件,FASTA格式。

    Output File

    指定输出序列文件名称,FASTA格式。

    结果说明

    按照指定要求得到FASTA文件。

    参考文献

    Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962.

    SeqKit

    Introduction

    SeqKit is a software specifically designed for processing fasta/q sequence files. It is written in Go language, offering comprehensive functionality and stable performance. The module provides the following main features:

    1. Edit sequences (point mutations, insertions, deletions).
    2. Remove duplicate sequences by name/sequence, save the count of files, list duplicate seqs, and save files with duplicate seqs.
    3. Transform sequences (reverse, complement, extract IDs, etc.).

    Parameter Description

    Clean Mode

    FASTA File

    Sequence file in FASTA format.

    GAP

    Specify the gap characters to be cleaned from the sequence.

    Output File

    Specify the output sequence file name in FASTA format.

    Edit Mode

    FASTA File

    Sequence file in FASTA format.

    Output File

    Specify the output sequence file name in FASTA format.

    Point Mutation

    Perform individual mutations on the FASTA file: change bases at specified positions. For example, “2:C” changes the base at the second position to cytosine ©; “-1:A” changes the last base to adenine (A).

    Deletion Mutation

    Deletion mutation: delete a subsequence within a specified range. For example, “1:2” deletes the first two bases, “-3:-1” deletes the last three bases.

    Insertion Mutation

    Insertion mutation: insert bases after the specified position. For example, “0:ACGT” inserts ACGT at the beginning, “-1:*” appends * at the end.

    Threads

    Number of CPUs.

    Remove Duplicates Mode

    FASTA File

    Sequence file in FASTA format.

    Output File

    Specify the output sequence file name in FASTA format.

    Duplicated Type

    Delete duplicate sequences by name (-n) or by sequence (-s).

    Save Data

    Save a file with the count and list of duplicate sequences (-D) or save a file with duplicate sequences (-d).

    Threads

    Number of CPUs.

    Transform Mode

    FASTA File

    Sequence file in FASTA format.

    Output File

    Specify the output sequence file name in FASTA format.

    Transform Sequences

    Transformation types include:
    –complement: Complementary sequences
    –dna2rna: DNA to RNA conversion
    –rna2dna: RNA to DNA conversion
    –lower-case: Print sequences in lowercase
    –upper-case: Print sequences in uppercase

    Threads

    Number of CPUs.

    FASTA2Seq Mode

    FASTA File

    Sequence file in FASTA format.

    Output File

    Specify the output sequence file name in FASTA format.

    Result Description

    Obtain a FASTA file according to the specified requirements.

    References

    Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962.

  • Name: Property Filter
    Description: 基于导入的分子属性(例如从SDF文件导入)或在运行时对分子进行计算来选择分子的子集。支持的输入文件格式为:SD(.sdf,.sd)。支持的输出文件格式为:SD(.sdf,.sd)。 It is very versatile and can select a subset of molecules based either on properties imported with the molecule (as from a SDF file) or from calculations on the molecule on the fly. The supported input file formats are: SD (.sdf, .sd). The supported output file formats are: SD (.sdf, .sd).
    Tags: undefined
    Author: Open Babel
    Release: 2021-12-28 06:06:09
    Reference: O'Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

    Property Filter

    简介

    Property Filter模块可以基于导入的分子属性(例如从SDF文件导入)或在运行时对分子进行计算来选择分子的子集。支持的输入文件格式为:SD(.sdf,.sd)。支持的输出文件格式为:SD(.sdf,.sd)。

    参数说明

    Input File

    小分子结构文件,SDF格式。

    Property

    过滤属性,相关的描述符含义分别如下:

    L5 (Lipinski rule of five):类药物五原则,指的是一组用于评估化合物作为口服药物潜力的规则,包括的规则为HBD<5、HBA1<10、MW<500以及logP<5。
    HBA1 (Number of hydrogen bond acceptors 1 [JoelLib]):用于识别化合物中符合此模式的氢键受体,其匹配的SMARTS格式为[$([!#6;+0]);!$([F,Cl,Br,I]);!$([o,s,nX3]);!$([Nv5,Pv5,Sv4,Sv6])]
    HBA2 (Number of hydrogen bond acceptors 2 [JoelLib]):用于识别另一种模式的氢键受体,其匹配的SMARTS格式为[$([$([#8,#16]);!$(*=N~O);!$(*~N=O);X1,X2]),$([#7;v3;!$([nH]);!$(*(-a)-a)])]
    HBD (Number of hydrogen bond donors [JoelLib]):其匹配的SMARTS格式为[!#6;!H0],用于识别化合物中符合此模式的氢键供体。
    logP (Octanol/water partition coefficient):辛醇/水分配系数,是衡量化合物在辛醇与水之间分配的比例,通常用于预测化合物的疏水性。
    MW (Molecular weight):分子量。
    abonds (Number of aromatic bonds):芳香键的数量,SMARTS格式为*:*。
    atoms (Number of atoms):原子数量,通过添加或去除氢原子来计算总原子或重原子数量,SMARTS格式为*。
    bonds (Number of bonds):键的数量,通过添加或去除氢原子来计算总键或重原子之间的键,SMARTS格式为*~*。
    cansmi (Canonical SMILES):规范化的SMILES(简化分子线性输入规范),用于唯一表示化合物的线性结构。
    cansmiNS (Canonical SMILES without isotopes or stereo):不含同位素或立体化学信息的规范化SMILES。
    dbonds (Number of double bonds):双键的数量,SMARTS格式为*=*。
    formula (Chemical formula):化学式。
    InChI (IUPAC InChI identifier):国际化学标识符。
    InChIKey (InChIKey):InChI的简化版,固定长度的字符串,用于快速查找和识别化合物。
    MP (Melting point):熔点,是由Andy Lang开发的熔点描述符,用于预测化合物的熔点。
    MR (Molar refractivity):摩尔折射率,是化合物体积和极化率的量度,通常用于评估分子间相互作用。
    nF (Number of fluorine atoms):氟原子的数量,SMARTS格式为F,用于识别化合物中的氟原子数量。
    s/smarts  (SMARTS filter):SMARTS过滤器,用于根据特定模式筛选化合物。
    sbonds (Number of single bonds):单键的数量,SMARTS格式为*-*。
    tbonds (Number of triple bonds):三键的数量,SMARTS格式为*#*。
    title (For comparing a molecule's title):用于比较分子标题的信息。
    TPSA (Topological polar surface area):拓扑极性表面积,是分子中极性区域的表面积总和,通常用于预测药物的吸收性和透过性。
    

    Relation

    选择属性的名称和所需的关系(如>、<、=、>=、<=、!=),多个符号用逗号(,)分隔。当筛选性质为L5时,该栏填None。

    Value

    属性过滤器的截止值。当筛选性质为L5时,该栏填None。

    Logic Operator

    前后条件的逻辑关系连接符(&&或者||),多个用逗号分隔

    Output File

    输出文件名称。

    结果说明

    得到筛选后的SDF结构文件output.sdf。

    参考文献

    O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

    Property Filter

    Introduction

    The Property Filter module allows for the selection of a subset of molecules based on imported molecular properties (e.g., imported from an SDF file) or calculated at runtime. Supported input file formats include: SD (.sdf, .sd). Supported output file formats include: SD (.sdf, .sd).

    Parameter Description

    Input File

    Small molecule structure file in SDF format.

    Property

    Filter properties, with the meanings of related descriptors as follows:

    L5 (Lipinski rule of five): A set of rules used to evaluate the potential of compounds as oral drugs, including the following criteria: HBD<5, HBA1<10, MW<500, and logP<5.
    HBA1 (Number of hydrogen bond acceptors 1 [JoelLib]): Used to identify hydrogen bond acceptors in compounds that match this pattern, with the SMARTS format: [$([!#6;+0]);!$([F,Cl,Br,I]);!$([o,s,nX3]);!$([Nv5,Pv5,Sv4,Sv6])]
    HBA2 (Number of hydrogen bond acceptors 2 [JoelLib]): Used to identify another pattern of hydrogen bond acceptors, with the SMARTS format: [$([$([#8,#16]);!$(*=N~O);!$(*~N=O);X1,X2]),$([#7;v3;!$([nH]);!$(*(-a)-a)])]
    HBD (Number of hydrogen bond donors [JoelLib]): Matches the SMARTS format [!#6;!H0], used to identify hydrogen bond donors in compounds that match this pattern.
    logP (Octanol/water partition coefficient): The octanol/water partition coefficient, which measures the ratio of a compound's distribution between octanol and water, typically used to predict compound hydrophobicity.
    MW (Molecular weight): The molecular weight.
    abonds (Number of aromatic bonds): The number of aromatic bonds, SMARTS format: *:*.
    atoms (Number of atoms): The number of atoms, calculated by adding or removing hydrogen atoms to count total or heavy atoms, SMARTS format: *.
    bonds (Number of bonds): The number of bonds, calculated by adding or removing hydrogen atoms to count total bonds or bonds between heavy atoms, SMARTS format: *~*.
    cansmi (Canonical SMILES): Canonical SMILES (Simplified Molecular Input Line Entry System), used to uniquely represent the linear structure of a compound.
    cansmiNS (Canonical SMILES without isotopes or stereo): Canonical SMILES without isotope or stereochemistry information.
    dbonds (Number of double bonds): The number of double bonds, SMARTS format: *=*.
    formula (Chemical formula): The chemical formula.
    InChI (IUPAC InChI identifier): The International Chemical Identifier, a standardized text string to represent the structure of a compound.
    InChIKey (InChIKey): A simplified version of InChI, a fixed-length string used for quick lookup and identification of compounds.
    MP (Melting point): The melting point, a descriptor developed by Andy Lang, used to predict the melting point of compounds.
    MR (Molar refractivity): Molar refractivity, a measure of the compound's volume and polarizability, typically used to assess intermolecular interactions.
    nF (Number of fluorine atoms): The number of fluorine atoms, SMARTS format: F, used to identify the number of fluorine atoms in a compound.
    s/smarts (SMARTS filter): A SMARTS filter used to filter compounds based on specific patterns.
    sbonds (Number of single bonds): The number of single bonds, SMARTS format: *-*.
    tbonds (Number of triple bonds): The number of triple bonds, SMARTS format: *#*.
    title (For comparing a molecule's title): Used for comparing the titles of molecules.
    TPSA (Topological polar surface area): The topological polar surface area, the total surface area of polar regions in a molecule, typically used to predict drug absorption and permeability.
    

    Relation

    Select the name of the property and the desired relation (such as >, <, =, >=, <=, !=), separated by commas. When filtering by L5, fill in None for this field.

    Value

    The cutoff value for the property filter. When filtering by L5, fill in None for this field.

    Logic Operator

    Logical operators (&& or ||) connecting the conditions, separated by commas.

    Result Description

    Obtain the filtered SDF structure file, output.sdf.

    Output File

    The name of the output file.

    Result Description

    The filtered SDF structure file output.sdf is obtained.

    References

    O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

  • Name: Homology Modeling (Protein)
    Description: 蛋白质三维结构的同源性或比较建模。用户提供一个待建模的序列与已知相关结构的比对。通过满足空间约束条件进行比较蛋白质结构建模,以及许多其他任务,包括蛋白质结构中循环的全新建模、针对灵活定义的目标函数优化各种蛋白质结构模型、蛋白质序列和/或结构的多重比对、聚类、搜索序列数据库、比较蛋白质结构等。 Homology or comparative modeling of protein three-dimensional structures. Users provide a sequence to be modeled and compare it with known related structures. Protein structure modeling is performed by satisfying spatial constraint conditions, as well as many other tasks, including novel modeling of loops in protein structures, optimization of various protein structure models for flexibly defined objective functions, multiple alignments of protein sequences and/or structures, clustering, searching sequence databases, and comparing protein structures.
    Tags: undefined
    Author: B. Webb*; M.A. Marti-Renom*; A. Sali*; A. Fiser, R.K. Do*.
    Release: 2021-12-21 17:39:18
    Reference: (1) B. Webb, A. Sali. Comparative Protein Structure Modeling Using Modeller. Current Protocols in Bioinformatics 54, John Wiley & Sons, Inc., 5.6.1-5.6.37, 2016. M.A. Marti-Renom, A. Stuart, A. Fiser, R. Sánchez, F. Melo, A. Sali. Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291-325, 2000. (2) A. Sali & T.L. Blundell. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815, 1993. (3) A. Fiser, R.K. Do, & A. Sali. Modeling of loops in protein structures, Protein Science 9. 1753-1773, 2000.

    Homology Modeling (Protein)

    简介

    Homology Modeling (Protein)采用老牌蛋白质同源模建算法Modeller,可以对蛋白质三维结构的同源性或比较建模。用户提供一个待建模的序列与已知相关结构的比对。通过满足空间约束条件进行比较蛋白质结构建模,以及许多其他任务,包括蛋白质结构中循环的全新建模、针对灵活定义的目标函数优化各种蛋白质结构模型、蛋白质序列和/或结构的多重比对、聚类、搜索序列数据库、比较蛋白质结构等。

    参数说明

    Protein Sequence File

    蛋白的序列文件,FASTA格式。

    Models

    输出预测结构数目。

    Template PDB File

    构建PDB结构的模板文件。

    结果说明

    输出结果包括:

    输出文件名称 说明
    output.log 输出记录文件
    score.csv 预测结构对应的打分文件
    Top0001.pdb-Top0005.pdb 打分前五的结构文件

    其中score.csv包括信息如下:

    字段名称 说明
    name 预测结构名称
    molpdf 评估预测结构与模板结构的一致性,其值越大越好。
    DOPE score 评估预测结构与真实结构相似的可能性,其值越低越好。
    Template 构建结构所使用的模板PDB ID和链名称。

    参考文献

    Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37.
    Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291-325.
    Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993 Dec 5;234(3):779-815.
    Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000 Sep;9(9):1753-73.

    Homology Modeling (Protein)

    Introduction

    Homology Modeling (Protein) uses the established protein homology modeling algorithm Modeller to model protein three-dimensional structures based on homology or comparative modeling. Users provide a sequence to be modeled and perform a comparison with known related structures. The modeling of protein structures is achieved by satisfying spatial constraints, as well as many other tasks, including novel modeling of loops in protein structures, optimizing various protein structure models for flexible-defined target functions, multiple sequence and/or structure alignments, clustering, searching sequence databases, and comparing protein structures.

    Parameter

    Protein Sequence File

    Protein sequence file in FASTA format.

    Models

    Number of predicted structures.

    Template PDB File

    Build a template file for the PDB structure.

    Log File

    Name of log file

    Result

    The output includes:

    Output File Name Description
    output.log Output record file
    score.csv Predict the structure of the corresponding scoring file
    Top0001.pdb-Top0005.pdb Score the top five structure files

    score.csv contains the following information:

    Field Name Description
    name Prediction structures name
    molpdf The molpdf score informs about the agreement of the model with the restraints derived from the alignment, the larger the value, the better.
    DOPE score The DOPE score tries to inform on the likelihood of the model resembling a real structure, the lower the value, the better.
    Template The template PDB ID and chain name used to build the structure.

    Reference

    Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37.
    Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291-325.
    Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993 Dec 5;234(3):779-815.
    Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000 Sep;9(9):1753-73.

  • Name: PTM Hotspot by Sequence
    Description: 扫描抗体序列发现潜在的翻译后修饰(PTM)风险位点, PTM 位点是生物制剂开发的常见风险。 通常建议使用WeSeq中的PTM功能进行可视化的分析,本模块更常用于组装自动化流程。 Scan antibody sequences for potential PTM (post-translational modification) hotspots (liabilities). PTM hotspot is a common risk for biologics development. It is generally recommended to use the PTM function in WeSeq for visual analysis. This module is more commonly used for assembling automated workflows.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-12-20 16:13:18
    Reference: NA

    PTM Hotspot by Sequence

    简介

    扫描抗体序列发现潜在的翻译后修饰(PTM)风险位点,PTM位点是生物制剂开发的常见风险。主要包括:氧化位点Oxidation、糖基化位点Glycosylation、水解位点Hydrolysis、脱酰胺基位点Deamidation、裂解位点Cleavage、天冬氨酸异构化位点Isomerization、半胱氨酸位点Cysteine。

    参数说明

    FASTA File

    抗体的序列文件,FASTA格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    hotspots.md 风险位点信息,Mardown格式
    Hotspots.json 风险位点信息,JSON格式

    针对抗体序列,会自动识别CDR区域,并输出CDR区和全部序列区域的风险位点。

    风险位点说明:
    图片.png
    其中打勾的位点NXS, NXT, NG, DHK, DG, DD和Cys六个位点可能为高风险PTM hotspot,是需要重点关注的。

    PTM Hotspot by Sequence

    Introduction

    This module scans antibody sequences to identify potential post-translational modification (PTM) hotspot sites. PTM sites are common risks in biologics development and include Oxidation, Glycosylation, Hydrolysis, Deamidation, Cleavage, Isomerization, and Cysteine sites.

    Parameter Description

    FASTA File

    Antibody sequence file in FASTA format.

    Result Description

    The output includes:

    Output File Name Description
    hotspots.md Information on hotspot sites in Markdown format
    Hotspots.json Information on hotspot sites in JSON format

    For antibody sequences, the module automatically identifies the CDR regions and outputs hotspot sites for both the CDR and the entire sequence regions.

    Explanation of Hotspot Sites:
    Image.png

    Among the marked sites, the six sites NXS, NXT, NG, DHK, DG, DD, and Cys are potential high-risk PTM hotspots that require special attention.

  • Name: Sequence Mutagenesis (Directed)
    Description: Sequence Mutagenesis (Directed)是根据模板序列批量生成突变体的模块。 用户可以在文本文件中定义所有突变位置和突变氨基酸。 Generate mutants based on a template sequence in batch. User could define all mutation locations and the mutated amino acids in a text file.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-12-15 00:00:00
    Reference: NA

    Sequence Mutagenesis (Directed)

    简介

    Sequence Mutagenesis (Directed)是根据模板序列批量生成突变体的模块。 用户可以在文本文件中定义所有突变位置和突变氨基酸。

    参数说明

    Sequence String

    输入序列字符,如:

    QAVVTQESALTTSPGETVTL
    

    结果说明

    生成满足突变要求的FASTA文件mutations.fasta。

    Sequence Mutagenesis (Directed)

    Introduction

    Sequence Mutagenesis (Directed) is a module for generating mutant variants in bulk based on a template sequence. Users can define all mutation positions and mutant amino acids in a text file.

    Parameter Description

    Sequence String

    Input sequence characters, for example:

    QAVVTQESALTTSPGETVTL
    

    Result Description

    Generate a FASTA file mutations.fasta that meets the mutation requirements.

  • Name: 2D Similarity Search
    Description: 基于分子指纹进行二维相似度搜索。根据不同指纹类型(Maccs Key、pharmacophore fingerprints、extended connectivity fingerprints)计算得到的指纹向量或者向量字符串进行相似性搜索,从分子数据库中筛选出与模板分子相似(不相似)的化合物。 It is a tool based on molecular fingerprints for 2D similarity search. Firstly, the fingerprint bit-vector or vector string of the template small molecule is calculated based on the fingerprint types (Maccs Key, pharmacophore fingerprints, extended connectivity fingerprints). Then, the fingerprint bit-vector or vector string is used for molecular similarity search in the selected public library or private library, and the small molecules that are similar (or dissimilar) to the template molecule are obtained.
    Tags: undefined
    Author: Kier LB; Filimonov D; Venkatraman V
    Release: 2021-12-15 07:40:57
    Reference: Kier LB, Hall LH, The E-State as the basis for molecular structure space definition and structure similarity. J. Chem. Inf. Comput. Sci. 2000, 40, 784-791. Filimonov D, Poroikov V, Borodina Y, Gloriozova T, Chemical similarity assessment through multilevel neighborhoods of atoms: Definition and comparison with the other Descriptors. J. Chem. Inf. Comput. Sci., 1999, 39, 666-670 Venkatraman V, Prez-Nueno VI, Mavridis L, Ritchie DW, Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J. Chem. Inf. Model., 2010, 50, 2079-2093. Durant, J.L.; Leland, B.A.; Henry, D.H.; Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273-1280. Willett, P.; Similarity searching using 2D structural fingerprints. Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology. 2011, 672, 133-58.

    2D Similarity Search

    简介

    2D Similarity Search模块是基于分子指纹进行二维相似度搜索的工具。根据不同指纹类型(Maccs Key、pharmacophore fingerprints、extended connectivity fingerprints)计算得到的指纹向量或者向量字符串进行相似性搜索,从分子数据库中筛选出与模板分子相似(不相似)的化合物。相似性评估方法采用的是常用的Tanimoto系数,用于比较两个化合物之间的相似性。它是基于化合物指纹或描述符的重叠程度计算得出的,数值范围从0到1,值越大表示两个化合物越相似。其主要功能如下所示:

    1. 从提供的化合物数据库中,筛选出与查询分子二维相似、符合特定相似度阈值的的化合物结构。
    2. 从提供的化合物数据库中,筛选出与查询分子二维不相似、符合特定距离阈值的的化合物结构。
    3. 支持多个查询分子模式。
      支持的输入文件格式为:SD(.sdf, .sd)。支持的输出文件格式为:SD(.sdf,.sd)、CSV(.csv)。

    参数说明

    Template SDF File

    小分子结构文件,SDF格式。

    Template Smiles

    小分子结构,SMILES格式,支持多个小分子,一行一个SMILES,例如:

    CSC1=C(c2ccc(C)s2)/C(=N/C(C)(C)C)C1
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5
    

    Public Library

    选择用于相似性搜索的分子库,该模块提供17个公共分子数据库用于进行相似性搜索:

    1. Alinda :~77万库存分子,源自中国香港的Alinda Chemical公司,致力于分子砌块和新颖筛选化合物的研发供应。
    2. Analyticon :~4万库存分子,源自德国的天然产物品牌,专注天然产物提取及类似物合成工作,产品质量稳定。
    3. Asinex :~57万库存分子,源自美国的品牌,多年来致力于类先导化合物及分子砌块的研发供应,价格较贵。
    4. Bionet :~30万库存分子,源自英国的品牌,拥有多年的有机合成经验。
    5. Chembridge :~137万库存分子,源自美国的化合物品牌,总部位于圣地亚哥,拥有多样性库、大环库等多种热门化合物库。
    6. Chemdiv :~156万库存分子,全球最大的化合物品牌之一,拥有5000多种化合物骨架结构和100多种化合物库,性价比高。
    7. Enamine :~407万库存分子,源自乌克兰的化合物品牌,具有较强的化合物研发能力,有高性价比化合物和高价值化合物两类产品。
    8. Eximed :~6万库存分子,源自乌克兰的化合物品牌,近20年来致力于提供高通量筛选化合物及相关服务。
    9. HTS :~6万库存分子,源自德国的HTS Biochemie Innovationen化合物品牌,致力于为制药、农业和生物技术公司开发独特的化合物。
    10. IBS :~55万库存分子,源自俄罗斯的InterBioScreen化合物品牌,拥有多种天然产物及衍生物。
    11. Life_Chemicals :~54万库存分子,源自加拿大的化合物品牌,拥有2900多种化合物骨架结构,化合物规格较齐全且有对应价格。
    12. Maybridge :~5万库存分子,源自英国的化合物品牌,Thermofisher旗下,产品数量少而专,每种产品均具有较大库存。
    13. Otava :~29万库存分子,源自加拿大的化合物品牌,专门从事特色化合物,生物化学药品和生物分析试剂的开发和生成。
    14. Princeton :~153万库存分子,源自美国的化合物品牌,20多年来设计独特的小分子化合物用于药物开发。
    15. Specs :~20万库存分子,源自荷兰的化合物品牌,价格优势明显。
    16. UORSY :~68万库存分子,源自乌克兰的化合物品牌,产品主要用于高通量筛选和药物发现,价格与Enamine接近。
    17. Vitas-m :~140万库存分子,源自美国的化合物品牌,在香港拥有发货中心,到货速度快,价格适中。

    Public Library与Private Library选填其中一个。

    Private Library

    上传用于进行相似度搜索的个人分子数据库,格式为SDF。
    Public Library与Private Library选填其中一个。

    Fingerprint

    分子指纹类型:maccskey、phar、ecfp

    1. maccskey指纹是基于分子的结构和功能团片段生成的二进制指纹,可以用于进行药物相似性和虚拟筛选。
    2. phar(Pharmacophore fingerprints)识别分子中的药效团特征指纹,如氢键供体、氢键受体、疏水中心等,适合药物设计。
    3. ecfp(Extended Connectivity Fingerprints)是基于圆形子结构的分子指纹,适合相似性搜索和定量结构-活性关系(QSAR)建模。

    Cutoff

    当搜索模式为SimilaritySearch时,表示搜索相似度≥截断值的分子;当搜索模式为DissimilaritySearch时,表示搜索相似度≤截断值的分子。计算值取值范围是0~1。Cutoff默认为0.75。

    Search Mode

    指定搜索模式:SimilaritySearch是查找相似分子,DissimilaritySearch是查找不相似分子。

    结果说明

    输出结果包括:

    输出文件名称 说明
    hits_values.csv 添加数据库与模板分子相似度值。
    hits.sdf 数据库中筛选出与模板分子相似在截断值以内的化合物。

    其中hits_values.csv包括信息如下:

    字段名称 说明
    ReferenceCompoundID 模板分子库中分子的名称,无名称则别表示为“Cmpd”前缀+“分子编号”。
    DatabaseCompoundID 搜索库中符合条件的分子的名称,无名称同上。
    ComparisonValue 模板分子与分子库的相似度值。

    其余参数为所提供的分子数据库包含的描述。

    参考文献

    Kier LB, Hall LH, The E-State as the basis for molecular structure space definition and structure similarity. J. Chem. Inf. Comput. Sci. 2000, 40, 784-791.
    Filimonov D, Poroikov V, Borodina Y, Gloriozova T, Chemical similarity assessment through multilevel neighborhoods of atoms: Definition and comparison with the other Descriptors. J. Chem. Inf. Comput. Sci., 1999, 39, 666-670
    Venkatraman V, Prez-Nueno VI, Mavridis L, Ritchie DW, Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J. Chem. Inf. Model., 2010, 50, 2079-2093.
    Durant, J.L.; Leland, B.A.; Henry, D.H.; Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273-1280.
    Willett, P.; Similarity searching using 2D structural fingerprints. Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology. 2011, 672, 133-58.

    2D Similarity Search

    Introduction

    The 2D Similarity Search module is a tool based on molecular fingerprint for 2D similarity search. The fingerprint bit-vector or vector string obtained by calculating the fingerprint types (Maccs Key, pharmacophore fingerprints, extended connectivity fingerprints) are used for similarity search, and compounds similar (or dissimilar) to the template molecule are selected from the small molecular database. The similarity assessment method used is the commonly used Tanimoto coefficient, which is used to compare the similarity between two compounds. It is based on the overlap of molecular fingerprints or descriptors, and the numerical range is from 0 to 1. The larger the value, the more similar the two compounds are considered to be. Its main functions are as follows:

    1. Select compounds from the provided compound database that are two-dimensionally similar to the query molecule and meet a specific similarity threshold.
    2. Select compounds from the provided compound database that are two-dimensionally dissimilar to the query molecule and meet a specific distance threshold.
    3. Support multiple query molecule patterns.
      The supported input file formats are: SD (.sdf, .sd). The supported output file formats are: SD (.sdf, .sd), CSV (.csv).

    Parameter

    Template SDF File

    Small molecule structure file in format.

    Template Smiles

    Small molecule SMILES string. Example:

    CSC1=C(c2ccc(C)s2)/C(=N/C(C)(C)C)C1
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5
    

    Public Library

    Select the molecular database for similarity search. This module provides 17 public molecular databases for conducting similarity search:

    Translation:

    1. Alinda : ~770,000 stock molecules, sourced from Alinda Chemical in Hong Kong, dedicated to the development and supply of molecular building blocks and novel screening compounds.
    2. Analyticon : ~40,000 stock molecules, a German brand specializing in natural product extraction and analogue synthesis, known for stable product quality.
    3. Asinex : ~570,000 stock molecules, an American brand focused on the development and supply of lead-like compounds and molecular building blocks for many years, relatively expensive.
    4. Bionet : ~300,000 stock molecules, a UK brand with many years of experience in organic synthesis.
    5. Chembridge : ~1,370,000 stock molecules, an American compound brand headquartered in San Diego, offering diverse libraries, macrocyclic libraries, and other popular compound libraries.
    6. Chemdiv : ~1,560,000 stock molecules, one of the world’s largest compound brands, with over 5,000 compound scaffolds and more than 100 compound libraries, offering high cost-effectiveness.
    7. Enamine : ~4,070,000 stock molecules, a Ukrainian compound brand with strong compound development capabilities, offering both high cost-effectiveness compounds and high-value compounds.
    8. Eximed : ~60,000 stock molecules, a Ukrainian compound brand dedicated to providing high-throughput screening compounds and related services for nearly 20 years.
    9. HTS : ~60,000 stock molecules, a German compound brand HTS Biochemie Innovationen, dedicated to developing unique compounds for pharmaceutical, agricultural, and biotechnology companies.
    10. IBS : ~550,000 stock molecules, a Russian compound brand InterBioScreen, offering a variety of natural products and derivatives.
    11. Life Chemicals : ~540,000 stock molecules, a Canadian compound brand with over 2,900 compound scaffolds, offering a wide range of compound specifications at corresponding prices.
    12. Maybridge : ~50,000 stock molecules, a UK compound brand under Thermo Fisher, known for a small but specialized product range with large inventories for each product.
    13. Otava : ~290,000 stock molecules, a Canadian compound brand specializing in the development and production of specialty compounds, biochemical drugs, and bioanalytical reagents.
    14. Princeton : ~1,530,000 stock molecules, an American compound brand that has been designing unique small molecules for drug development for over 20 years.
    15. Specs : ~200,000 stock molecules, a Dutch compound brand with a clear price advantage.
    16. UORSY : ~680,000 stock molecules, a Ukrainian compound brand, mainly used for high-throughput screening and drug discovery, with prices similar to Enamine.
    17. Vitas-M : ~1,400,000 stock molecules, an American compound brand with a shipping center in Hong Kong, offering fast delivery and moderate prices.

    Public Library and Private Library are optional, choose one of them.

    Private Library

    Upload a personal molecular database in SDF format for similarity search.

    Public Library and Private Library are optional, choose one of them.

    Fingerprint

    Types of Molecular Fingerprints: maccskey, phar, ecfp.

    1. maccskey fingerprints are binary fingerprints generated based on the structure and functional group fragments of a molecule, and can be used for drug similarity and virtual screening.
    2. phar (Pharmacophore fingerprints) recognize pharmacophore features in molecules, such as hydrogen bond donors, hydrogen bond acceptors, hydrophobic centers, etc., and are suitable for drug design.
    3. ecfp (Extended Connectivity Fingerprints) are circular substructure-based molecular fingerprints, suitable for similarity searching and quantitative structure-activity relationship (QSAR) modeling.

    Cutoff

    When the search mode is set to SimilaritySearch, it means that molecules with a similarity ≥ the cutoff value will be searched. When the search mode is set to DissimilaritySearch, it means that molecules with a similarity ≤ the cutoff value will be searched. The calculated values range from 0 to 1, with a default cutoff value of 0.75.

    Search Mode

    Specify the search mode: SimilaritySearch or DissimilaritySearch.

    Result

    The output includes:

    Output File Name Description
    hits_values.csv Add database and template molecular similarity values.
    hits.sdf Compounds similar to template molecules within the truncation value were screened from the database.

    The hits_values.csv contains the following information:

    Field Name Description
    ReferenceCompoundID The name of the molecule in the template library, or denoted as “Cmpd” prefix + “molecule number” if it has no name.
    DatabaseCompoundID The name of the compound in the search library that meets the conditions, or denoted as above if it has no name.
    ComparisonValue The similarity value between the template molecule and the compound in the database.

    The remaining parameters are the descriptors contained in the provided molecular database.

    Reference

    Kier LB, Hall LH, The E-State as the basis for molecular structure space definition and structure similarity. J. Chem. Inf. Comput. Sci. 2000, 40, 784-791.
    Filimonov D, Poroikov V, Borodina Y, Gloriozova T, Chemical similarity assessment through multilevel neighborhoods of atoms: Definition and comparison with the other Descriptors. J. Chem. Inf. Comput. Sci., 1999, 39, 666-670
    Venkatraman V, Prez-Nueno VI, Mavridis L, Ritchie DW, Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J. Chem. Inf. Model., 2010, 50, 2079-2093.
    Durant, J.L.; Leland, B.A.; Henry, D.H.; Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273-1280.
    Willett, P.; Similarity searching using 2D structural fingerprints. Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology. 2011, 672, 133-58.

  • Name: Molecular Docking (SMINA)
    Description: 基于SMINA的分子对接工具,主要用于预测分子之间的结合模式和相互作用,得到分子对接的能量和结合亲和力等信息。它还可以计算和比较多个分子之间的结合能力,用于药物分子的筛选、设计和优化。 It is a docking simulation tool used primarily to predict binding modes and interactions between molecules and obtain information such as molecular docking energy and binding affinity. It can also calculate and compare the binding abilities of multiple molecules, making it useful for drug molecule screening, design, and optimization.
    Tags: undefined
    Author: David Ryan Koes
    Release: 2022-03-17 09:56:09
    Reference: Koes DR, Baumgartner MP, Camacho CJ. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J Chem Inf Model. 2013 Aug 26;53(8):1893-904.

    Molecular Docking (SMINA)

    简介

    Molecular Docking (SMINA)是基于SMINA的分子对接工具(背景介绍链接 )。SMINA作为Autodock Vina(http://vina.scripps.edu/ )的分支,其主要功能是预测分子之间的结合模式和相互作用,得到分子对接的能量和结合亲和力等信息。它还可以计算和比较多个分子之间的结合能力,用于药物分子的筛选、设计和优化。与Autodock Vina(version 1.1.2)相比,SMINA支持:
    1.配体SDF分子格式进行计算;
    2.多配体文件(SDF)进行对接;
    3.超过20个对接POSE输出;
    4.更易于定义受体柔性残基;
    5.极大地改进了最小化算法(最小化趋于收敛)。

    图片.png

    SMINA 默认使用 AutoDock Vina 的经验性打分函数,但可以根据具体任务进行扩展和调整。Vina 打分函数是一种线性加权和的经验性模型,公式如下:
    image.png
    评分函数主要组成项目为:

    • 高斯分布项为Vgauss,1和Vgauss,2。Vgauss,1描述原子间接近的有利情况,通常适用于特定距离范围的配体-受体原子对。Vgauss,2描述较大距离原子间的贡献,通常用于模拟中远程的吸引作用。
    • 排斥项Vhydrophobic描述疏水原子间的有利相互作用,模拟疏水效应对配体结合的影响。
    • 氢键项VHBond描述方向性氢键的贡献,特别是在较短距离范围内有较大影响。
    • 构象熵惩罚项Vtorsional通过统计学方法计算,惩罚旋转自由度的增加以体现熵的损失。
    • ω1, ω2, ω3, ω4, ω5, ω6是每一项的权重,代表不同相互作用在总评分中的贡献。这些权重基于训练数据集优化得到,用户也可以根据具体需求调整这些权重。

    参数说明

    Rigid Docking模式

    Receptor File

    受体蛋白结构文件,PDB或PDBQT格式。受体蛋白被设置为刚性。

    Ligand File

    小分子结构文件,SDF格式

    Configure File

    结合口袋信息文件,TXT格式,可由Weview获取。文件内容如下所示:

    center_x = -44.497
    center_y = -22.273
    center_z = -4.922
    size_x = 40
    size_y = 40
    size_z = 40
    

    TopN

    指定打分前TopN小分子作为输出文件,默认为100。

    Out Pose

    每个配体与蛋白对接后输出的构象数目,默认为10。该数值应当≤“Run Pose”。

    Flexible Docking模式

    Flexible Residue

    定义柔性残基其格式为"链名称":“氨基酸编号”,每个氨基酸用逗号隔开,例如:“A:48,A:90,A:110”。柔性氨基酸必须在口袋附近。

    Flexible Distance (Å)

    将配体指定距离内的所有侧链设置为柔性,单位为Å
    其他参数与Rigid Docking模式一致

    结果说明

    输出结果包括:

    输出文件名称 说明
    Complex_Top1-10.pdb 展示每个配体与受体打分最高的前十复合物构象文件
    score.csv 提交所有配体与受体的打分文件
    TopNscore.csv 按照每个配体与受体对接打分最高的排序得到打分文件
    output.TopNComplex.tar.gz TopN小分子中每个配体与受体打分最高的复合物构象PDBQT文件压缩包
    output.TopNLigand.sdf 对接打分topN的配体SDF文件

    参考文献

    Koes DR, Baumgartner MP, Camacho CJ. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J Chem Inf Model. 2013 Aug 26;53(8):1893-904.

    Molecular Docking (SMINA)

    Introduction

    Molecular Docking (SMINA) is a molecular docking tool based on SMINA. As a branch of Autodock Vina (http://vina.scripps.edu/ ), SMINA’s main function is to predict the binding modes and interactions between molecules, providing information on the energy and binding affinity of molecular docking. It can also calculate and compare the binding abilities of multiple molecules, useful for screening, designing, and optimizing drug molecules. Compared to Autodock Vina (version 1.1.2), SMINA supports:

    1. Calculation with ligand SDF molecule format.
    2. Docking with multiple ligand files (SDF).
    3. Output of over 20 docking poses.
    4. Easier definition of flexible receptor residues.
    5. Greatly improved minimization algorithm (minimization tends to converge).
      image.png
      SMINA defaults to using the empirical scoring function of AutoDock Vina, but it can be extended and adjusted according to specific tasks. The Vina scoring function is an empirical model based on a linear weighted sum, represented by the following formula:
      image.png
      The main components of the scoring function are as follows:
    • The Gaussian distribution terms, Vgauss,1 and Vgauss,2. Vgauss,1 describes favorable interactions between atoms that are close together, typically applicable to ligand-receptor atom pairs within a specific distance range. Vgauss,2 accounts for contributions from atoms at larger distances, usually modeling medium-range attractive effects.
    • The repulsive term Vhydrophobic describes favorable interactions between hydrophobic atoms, simulating the impact of hydrophobic effects on ligand binding.
    • The hydrogen bond term VHBond captures the contribution of directional hydrogen bonds, particularly having a significant effect at shorter distances.
    • The conformational entropy penalty term Vtorsional is calculated using statistical methods, penalizing the increase in rotational degrees of freedom to reflect the loss of entropy.
    • ω1, ω2, ω3, ω4, ω5, and ω6 are the weights for each term, representing the contribution of different interactions to the overall score. These weights are optimized based on the training dataset, and users can also adjust them according to specific needs.

    Parameter Description

    Rigid Docking Mode

    Receptor File

    Protein receptor structure file in PDB or PDBQT format. The receptor protein is set as rigid.

    Ligand File

    Small molecule structure file in SDF format.

    Configure File

    Binding pocket information file in TXT format, obtainable from Weview. The file content is as follows:

    center_x = -44.497
    center_y = -22.273
    center_z = -4.922
    size_x = 40
    size_y = 40
    size_z = 40
    

    TopN

    Specify the top N small molecules for output, default is 100.

    Out Pose

    Number of conformations output for each ligand-protein docking, default is 10. This value should be ≤ “Run Pose”.

    Flexible Docking Mode

    Flexible Residue

    Define flexible residues in the format “chain name”:“amino acid number”, with each amino acid separated by a comma, e.g., “A:48,A:90,A:110”. Flexible amino acids must be near the pocket.

    Flexible Distance (Å)

    Set all side chains within a specified distance from the ligand as flexible, unit is Å.
    Other parameters are the same as in Rigid Docking Mode.

    Result Description

    The output includes:

    Output File Name Description
    Complex_Top1-10.pdb Files showing the top ten complex conformations with the highest scores for each ligand-protein docking
    score.csv File containing scores for all ligand-protein dockings
    TopNscore.csv Scores file sorted by the highest docking scores for each ligand-protein docking
    output.TopNComplex.tar.gz Compressed file containing PDBQT files of the top complex conformations for each ligand-protein docking in the top N small molecules
    output.TopNLigand.sdf SDF file of the top N ligands based on docking scores

    Reference

    Koes DR, Baumgartner MP, Camacho CJ. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J Chem Inf Model. 2013 Aug 26;53(8):1893-904.

  • Name: Batch Renaming
    Description: Batch Renaming模块设计用于化学库的分子重命名。用户可以使用前缀和定义的长度来规范分子名称。例如,将一个从WCP0001开始的库重命名为WCP9999,用户可以输入WCP前缀,长度为4。用户还可以使用——keeptitle参数保存以前的名称,以保存名称之间的关系。该模块可用于大型从头库或用户私有化学库中的自定义分子命名。 Batch Renaming module was designed to molecule rename for chemical library. User could standardize the molecule name using prefix and defined length. For example, rename a library start from WCP0001 to WCP9999, user could input prefix with WCP and length of 4. User also could preserve the previous name using --keeptitle parameter to preserve the relations between names. This module could use for customized molecules naming in a large de novo library or a user private chemical library.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-11-18 09:38:17
    Reference: NA

    Batch Renaming

    Batch Renaming模块设计用于化学库的分子重命名。用户可以使用前缀和定义的长度来规范分子名称。例如,将一个从WCP0001开始的库重命名为WCP9999,用户可以输入WCP前缀,长度为4。用户还可以使用——keeptitle参数保存以前的名称,以保存名称之间的关系。该模块可用于大型从头库或用户私有化学库中的自定义分子命名。支持的输入文件格式为:SD(.sdf,.sd)。支持的输出文件格式为:SD(.sdf,.sd)。

    参数说明

    Input File

    小分子结构文件,SDF格式。

    Output File

    输出SDF文件名称。

    Prefix

    自定义前缀,如C表示从C001生成名称,并结合长度为3。

    Length

    固定名称长度,如4表示生成名C0001, 1表示生成C1, C2……。

    Location

    新生成名称的位置:

    1. field表示添加新字段以保存新名称。
    2. title表示替换之前的分子标题。
    3. all表示以上两种操作。

    Field Name

    字段名作为新生成的名称,仅当Location为filed或all时有效。

    Keep Name

    保留以前的分子标题名称。

    结果说明

    得到重命名后的sdf文件output.sdf。

    Batch Renaming

    The Batch Renaming module is designed for renaming molecules in chemical libraries. Users can standardize molecule names using a prefix and a defined length. For example, to rename a library starting from WCP0001 to WCP9999, users can input the prefix WCP and a length of 4. Users can also use the --keeptitle parameter to preserve previous names, maintaining relationships between names. This module can be used for custom molecule naming in large de novo libraries or user-private chemical libraries. Supported input file formats: SD (.sdf, .sd). Supported output file formats: SD (.sdf, .sd).

    Parameter Description

    Input File

    Small molecule structure file in SDF format.

    Output File

    Name of the output SDF file.

    Prefix

    Custom prefix, e.g., C indicating names generated from C001, combined with a length of 3.

    Length

    Fixed name length, e.g., 4 generates names like C0001, 1 generates C1, C2, and so on.

    Location

    Position for the newly generated names:

    1. field: Add a new field to save the new name.
    2. title: Replace the previous molecule title.
    3. all: Perform both of the above operations.

    Field Name

    Field name to be used as the newly generated name, only valid when Location is field or all.

    Keep Name

    Keep the previous molecule title name.

    Result Description

    Obtain the renamed SDF file named output.sdf.

  • Name: 3D Conf Generation (AlphaConf)
    Description: 小分子三维构象搜索模块。三维构象搜索与生成技术主要用于对蛋白质结构域或者化合物结构进行高效的搜索,以用于结构设计或筛选。唯信通过采用一种全新的限制性结构片段定义方式进行分子三维构象的生成,精度优于同类算法。通过采用非重复构象生成方法,节省大量计算时间,计算速度远超同类算法。独特高效的构象压缩技术,较同类算法的存储空间降低400~800倍,适用于超大规模三维构象库的构建和超高通量虚拟筛选。 It is a super fast 3D conformation search and generation engine. Machine learning models for bond lengths/angles based on millions of high-quality data in PubChemQC. A new way of defining restriction structure fragments is developed to generate the three-dimensional conformation of molecules, and the accuracy is better than similar algorithms. By adopting the non-repetitive conformation generation method, a lot of computation time is saved, and the computation speed is much faster than similar algorithms. The unique and efficient conformation compression technology reduces the storage space by 400-800 times compared with similar algorithms and is suitable for the construction of ultra-large-scale 3D conformation libraries and ultra-high-throughput virtual screening.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-11-11 03:20:54
    Reference:

    3D Conf Generation (AlphaConf)

    简介

    3D Conf Generation (AlphaConf)采用唯信计算自研的分子三维构象生成算法,超快速生成分子三维构象库,比Open Eyes的Omega至少快一个数量级,后者被认为是目前最高效的商业产品。它也比薛定谔的ConfGenX快一个数量级以上。其优异的构象多样性和质量已被下游应用证明。AlphaConf非常适合用于药物分子发现的超高通量虚拟筛选。其技术特点如下:

    1. 通过采用限制性结构片段定义,构象生成精度已媲美Schrodinger的ConfGenX算法,明显优于同类开源算法,如:RDKit。
    2. 通过采用非重复构象生成方法,节省大量计算时间,计算速度远超同类算法。
    3. 专利数据格式(AC 格式),用于高效的数据存储和检索。例如,与主流的SD格式相比,数据压缩率约为400-800倍。这也意味着我们可以通过多核并行化在大约一周内为数十亿个药物分子生成构象异构体,并将它们存储在具有几TB存储容量的磁盘上。构象检索也非常令人印象深刻:每秒从磁盘获取1-2百万个3D构象(使用中等的8核机器)。
      AlphaConf与其他构象生成工具的对比情况。
      image.png

    参数说明

    Input File

    小分子结构文件,SDF格式或者压缩的SDF格式(.gz文件)。

    Max Confs

    每个分子的最大构象数,默认100。

    Energy Window

    构象能量截断值(单位:kcal/mol),默认20kcal/mol。

    Output File

    指定输出文件名称,后缀是.sd,.ac,.ac.gz或者.aux.gz。除了构象文件外,当输出文件后缀为.ac.gz或者.aux.gz还会输出片段库文件(文件后缀为.aux,其文件名根据构象文件名称自动命名,如:构象文件名设置为conf.ac.gz,片段文件名自动命名为conf.aux.gz)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    SelfConf.ac.gz 构象压缩文件,AC格式,用于AlphaShape模块的构象库输入
    SelfConf.aux.gz 片段库文件(其文件名根据构象文件名称自动命名,如:构象文件名设置为conf.ac.gz或者conf.aux.gz,片段文件名自动命名为conf.aux),AUX格式,用于AlphaShape模块的片段库输入

    3D Conf Generation (AlphaConf)

    Introduction

    3D Conf Generation (AlphaConf) uses a proprietary molecular conformation generation algorithm developed by Wecompute to rapidly generate a library of molecular conformations. It is at least an order of magnitude faster than Open Eye’s Omega, which is considered the most efficient commercial product, and more than an order of magnitude faster than Schrodinger’s ConfGenX. Its excellent conformational diversity and quality have been proven in downstream applications, making AlphaConf particularly suitable for high-throughput virtual screening in drug discovery. Its technical features are as follows:

    1. The precision of the conformation generation, achieved through the use of restrictive structural fragments, is comparable to Schrodinger’s ConfGenX algorithm, and significantly better than similar open-source algorithms such as RDKit.
    2. The use of a non-redundant conformation generation method saves a significant amount of computation time, making it much faster than similar algorithms.
    3. The proprietary AC format is used for efficient data storage and retrieval. Compared to the mainstream SD format, the data compression ratio is about 400-800 times higher. This means that we can generate conformational isomers for billions of drug molecules in about a week using multi-core parallelization and store them on a disk with several terabytes of storage capacity. Conformational retrieval is also impressive: 1-2 million 3D conformations can be retrieved from disk per second using a medium-sized 8-core machine.
      The comparison of AlphaConf with other conformation generation tools.
      image.png

    Parameter

    Input File

    Small molecule structure file in SDF format or gzip format with .gz file extension for SDF file.

    Max Confs

    The maximum number of conformations per molecule, the default value is 100.

    Energy Window

    Specify energy cutoff for confs.(kcal/mol), the default value is 20 kcal/mol.

    Output File

    Specify output conformation file in SD format(.sd) or AC format(.ac)

    Result

    The output includes:

    Output File Name Description
    SelfConf.ac.gz Conformation compressed file in AC format, used as input for the conformation library in the AlphaShape module.
    SelfConf.aux.gz Fragment library file in AUX format, used as input for the fragment library in the AlphaShape module.
  • Name: Format Conversion (Open Babel)
    Description: 格式转换模块,主要用于处理各种化学数据。允许任何人从分子建模、化学、固态材料、生物化学或相关领域搜索、转换、分析或存储数据。 Open Babel is a chemical toolbox designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.
    Tags: undefined
    Author: Open Babel
    Release: 2021-11-05 09:20:49
    Reference: O'Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

    Format Conversion

    简介

    Format Conversion (Open Babel)是基于Open Babel的模块,主要用于处理各种化学数据。允许任何人从分子建模、化学、固态材料、生物化学或相关领域搜索、转换、分析或存储数据。支持的格式:mol、smiles、sdf、xyz、mol2、pdbqt、cdx、cdxml、com、cube、ent、pdb、fchk、g16、gamin、gamout、gjf、gro、inchi、inchikey、png、svg。

    参数说明

    Input File

    格式文件结构格式,必须带有拓展名称。

    Output File

    输出文件名称,并且必须更改文件扩展名。

    Coordinates

    生成二维结构(–gen2D)或者三维结构(–gen3D)文件。

    RemoveSalt

    去除分子中的盐离子并且保留分子的最大片段。

    DeleteHydrogens

    删除原有结构中的氢原子。

    AddHydrogens

    添加氢原子到结构中。

    First Mol

    对于多个分子输入文件来说,从指定数值分子开始导入。

    Last Mol

    对于多个分子输入文件来说,从指定数值分子结束导入。

    Skip Error

    如果可能,在出现错误后继续下一个对象。

    结果说明

    得到处理后与Output File对应后缀格式的小分子文件。

    参考文献

    O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

    Format Conversion

    Introduction

    The Format Conversion module is based on Open Babel and is primarily used for handling various chemical data. It allows individuals from molecular modeling, chemistry, solid-state materials, biochemistry, or related fields to search, convert, analyze, or store data. Supported formats include: mol, smiles, sdf, xyz, mol2, pdbqt, cdx, cdxml, com, cube, ent, pdb, fchk, g16, gamin, gamout, gjf, gro, inchi, inchikey, png, svg.

    Parameter Description

    Input File

    Input file with the structure format, must include the file extension.

    Output File

    Name of the output file, and the file extension must be changed.

    Coordinates

    Generate a 2D structure (–gen2D) or a 3D structure (–gen3D) file.

    RemoveSalt

    Remove salt ions from the molecule and retain the largest fragment of the molecule.

    DeleteHydrogens

    Remove hydrogen atoms from the original structure.

    AddHydrogens

    Add hydrogen atoms to the structure.

    First Mol

    For multiple input molecule files, import starting from the specified number of molecules.

    Last Mol

    For multiple input molecule files, import up to the specified number of molecules.

    Skip Error

    Continue to the next object if an error occurs, if possible.

    Result Description

    Obtain a small molecule file in the format corresponding to the Output File suffix after processing.

    Reference

    O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

  • Name: Smina
    Description: Smina is a fork of Autodock Vina (http://vina.scripps.edu/) that is customized to better support scoring function development and high-performance energy minimization. Smina makes the following changes based on VINA, including easily define flexible residues of receptor and vastly improved minimization algorithms. These changes make Sminamuch easer to use and 10-20x faster than AutoDock Vina.
    Tags: undefined
    Author: David Ryan
    Release: 2021-10-29 16:06:00
    Reference: David Ryan*, Lessons Learned in Empirical Scoring with smina from the CSAR 2011 Benchmarking Exercise. J. Chem. Inf. Model. 2013, 53, 8, 1893–1904. O. Trott, A. J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading, Journal of Computational Chemistry 31 (2010) 455-461
  • Name: Salts Removal
    Description: Salts Removal模块是从分子中去除盐或者简单地计算含盐分子的数量。 Remove salts from molecules or simply count the number of molecules containing salts.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-28 06:37:44
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Salts Removal

    简介

    该模块可以去除或者统计分子含有的盐,从而获得去盐后分子结构或者分子结构含有的盐数量。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件名称。

    Mode

    选择去除(remove)或者统计(count)盐离子。

    结果说明

    得到无盐离子的分子结构文件oufile.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Salts Removal

    Introduction

    The Salts Removal module can remove or count the salts present in molecules, providing the option to obtain the molecular structures without salts or the count of salts in the molecular structures.

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    Name of the output file.

    Mode

    Select whether to remove (remove) or count (count) salt ions.

    Result Description

    Obtain a molecular structure file without salt ions named outfile.sdf.

    Reference

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

  • Name: Duplicates Removal
    Description: Duplicates Removal模块是基于规范SMILES字符串识别和删除重复分子,或者仅统计重复分子数量。 Duplicates Removal modules identify and remove duplicate molecules based on canonical SMILES strings or simply count the number of duplicate molecules.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-28 06:27:43
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Duplicates Removal

    简介

    基于规范SMILES字符串识别和删除重复分子,或者仅统计重复分子数量。支持的输入文件格式为:MOL(.mol)、SD(.sdf、.sd)、SMILES(.smi、.csv、.tsv、.txt)。支持的输出文件格式为:SD(.sdf、.sd)、SMILES(.smi、.csv、.tsv、.txt)。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件名称。

    Mode

    选择是去除重复分子(remove)还是对重复分子进行计数(count),默认为remove。

    结果说明

    得到删除重复分子的sdf文件outfile.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Duplicates Removal

    Introduction

    The Duplicates Removal module identifies and removes duplicate molecules based on canonical SMILES strings, or it can simply count the number of duplicate molecules. Supported input file formats are: MOL (.mol), SD (.sdf, .sd), SMILES (.smi, .csv, .tsv, .txt). Supported output file formats are: SD (.sdf, .sd), SMILES (.smi, .csv, .tsv, .txt).

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    Name of the output file.

    Mode

    Select whether to remove duplicate molecules (remove) or count duplicate molecules (count), default is remove.

    Result Description

    Obtain an SDF file named outfile.sdf after removing duplicate molecules.

    Reference

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

  • Name: Diverse Subset
    Description: 基于多种2D指纹以及使用最大最小距离(MaxMin)或分层聚类方法(Hierarchical Clustering)选择分子子集,并将它们写入文件。RDKit中可用的Dice和Tanimoto相似性函数能够处理对应于IntVect和BitVect的指纹。然而,所有其他相似性函数都期望使用BitVect指纹来计算成对相似性。因此,对于AtomPairs、Morgan、MorganFeatures和TopologicalTorsions的相似性计算,使用ExplicitBitVect指纹代替默认的IntVect指纹。 Pick a subset of diverse molecules based on a variety of 2D fingerprints using MaxMin or an available hierarchical clustering methodology and write them to a file. The Dice and Tanimoto similarity functions available in RDKit are able to handle fingerprints corresponding to both IntVect and BitVect. All other similarity functions, however, expect BitVect fingerprints to calculate pairwise similarity. Consequently, ExplicitBitVect fingerprints are generated for AtomPairs, Morgan, MorganFeatures, and TopologicalTorsions for similarity calculations instead of default IntVect fingerprints.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-22 08:41:36
    Reference: Ashton, M., Barnard, J., Casset, F. et al. Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions. Quantitative Structure-Activity Relationships. 2002 Dec 27; 21:598-604.

    Diverse Subset

    简介

    基于多种2D指纹选择分子子集,使用MaxMin或可用的分层聚类方法,并将它们写入文件。RDKit中可用的Dice和Tanimoto相似性函数能够处理对应于IntVect和BitVect的指纹。然而,所有其他相似性函数都期望使用BitVect指纹来计算成对相似性。因此,对于AtomPairs、Morgan、MorganFeatures和TopologicalTorsions的相似性计算,使用ExplicitBitVect指纹代替默认的IntVect指纹。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件名称。

    Diverse Numbers

    指定划分数量。

    Mode

    利用最大最小距离(MaxMin)或分层聚类方法(Hierarchical Clustering)进行聚类,从而选择不同的分子子集类型。

    Similarity Metric

    用于计算分子间相似性的方法,有Tanimoto、Cosine以及Dice。

    • 谷本系数——Tanimoto:只关心个体间共同具有的特征是否一致这个问题,用于比较有限样本集之间的相似性与差异性。计算公式如下:
      image.png
    • 余弦相似度——Cosine:通过n维空间中两个n维向量之间角度的余弦来判断相似程度。计算公式如下:
      image.png
    • Dice相似度:是一种集合相似度度量指标。计算公式如下所示:
      image.png

    Fingerprints

    用于计算分子间相似性/距离的指纹。

    • Morgan通过设定一个从特定原子出发的半径,来统计这个半径以内的部分分子结构的数量来组成一个分子指纹。
    • AtomPairs是分子中每个原子对基于原子环境和最短路径分离。
    • MACCS166Keys是一种基于SMARTS的,长度为167的分子指纹,每一位所代表的含义可见https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt 。
    • PathLength搜索分子中特定长度的所有路径。
    • TopologicalTorsions是基于拓扑两面角描述符。

    结果说明

    按划分数量得到聚类结果,输出每个聚类中的第一个分子文件diverse_set.sdf。

    参考文献

    Ashton, M., Barnard, J., Casset, F. et al. Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions. Quantitative Structure-Activity Relationships. 2002 Dec 27; 21:598-604.

    Diverse Subset

    Introduction

    The Diverse Subset module selects a subset of molecules based on multiple 2D fingerprints, using MaxMin or available hierarchical clustering methods, and writes them to a file. The Dice and Tanimoto similarity functions available in RDKit can handle fingerprints corresponding to IntVect and BitVect. However, all other similarity functions expect to use BitVect fingerprints to compute pairwise similarities. Therefore, for similarity calculations of AtomPairs, Morgan, MorganFeatures, and TopologicalTorsions, ExplicitBitVect fingerprints are used instead of the default IntVect fingerprints.

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    Name of the output file.

    Diverse Numbers

    Specify the number of partitions.

    Mode

    Use MaxMin distance or hierarchical clustering to select different types of molecular subsets.

    Similarity Metric

    Methods used to calculate molecular similarity, including Tanimoto, Cosine, and Dice.

    • Tanimoto Coefficient: Focuses on whether individuals share common features and is used to compare the similarity and dissimilarity between limited sample sets. The calculation formula is as follows:
      image.png
    • Cosine Similarity: Determines the similarity degree by the cosine of the angle between two n-dimensional vectors in an n-dimensional space. The calculation formula is as follows:
      image.png
    • Dice Similarity: A measure of set similarity. The calculation formula is as follows:
      image.png

    Fingerprints

    Fingerprints used to calculate molecular similarity/distance.

    • Morgan counts the number of substructures within a certain radius from a specific atom to form a molecular fingerprint.
    • AtomPairs represent pairs of atoms in a molecule based on atomic environments and shortest path separation.
    • MACCS166Keys is a 167-bit molecular fingerprint based on SMARTS, where each bit’s meaning can be seen at https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt .
    • PathLength searches for all paths of a specific length in a molecule.
    • TopologicalTorsions are based on topological torsion descriptors.

    Result Description

    Cluster results are obtained based on the specified number of partitions, and the first molecule in each cluster is written to the file diverse_set.sdf.

    References

    Ashton, M., Barnard, J., Casset, F. et al. Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions. Quantitative Structure-Activity Relationships. 2002 Dec 27; 21:598-604.

  • Name: Descriptors (RDKit)
    Description: Descriptors (RDKit)模块是计算分子的2D/3D描述符并将其写入SD或CSV/TSV文本文件中。2D描述符:Autocorr2D、MolWt、Ipc、NumRotatableBonds、qed等;3D描述符:Autocorr3D、RadiusOfGyration、Eccentricity等;以及FragmentCountOnly描述符:fr_Al_COO、fr_Al_OH、fr_Al_OH_noTert等。 Calculate 2D/3D molecular descriptors for molecules and write them out to a SD or CSV/TSV text file. 2D descriptors: Autocorr2D, MolWt, Ipc, NumRotatableBonds, qed, etc. 3D descriptors: Autocorr3D, RadiusOfGyration, Eccentricity, etc. (See the documentation for more descriptors). And FragmentCountOnly descriptors: fr_Al_COO, fr_Al_OH, fr_Al_OH_noTert, etc.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-22 09:00:29
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Descriptors (RDKit)

    简介

    Descriptors (RDKit)模块是计算分子的2D/3D描述符并将其写入SD或CSV/TSV文本文件中。2D描述符:Autocorr2D、MolWt、Ipc、NumRotatableBonds、qed等;3D描述符:Autocorr3D、RadiusOfGyration、Eccentricity等;以及FragmentCountOnly描述符:fr_Al_COO、fr_Al_OH、fr_Al_OH_noTert等。支持的输入文件格式为:Mol(.mol)、SD(.sdf、.sd)、SMILES(.smi、.txt、.csv、.tsv)。支持的输出文件格式为:SD文件(.sdf、.sd)、CSV/TSV(.csv、.tsv、.txt)。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件以保存计算的描述符。

    Multiprocessing

    使用多进程处理(默认:yes)。

    Type

    计算分子描述符的类型,可选值有2D、3D、FragmentCountOnly和Specify。
    2D描述符包括以下:

    Autocorr2D, BalabanJ, BertzCT, Chi0, Chi1, Chi0n - Chi4n, Chi0v - Chi4v, EState_VSA1 - EState_VSA11, ExactMolWt, FpDensityMorgan1, FpDensityMorgan2, FpDensityMorgan3, FractionCSP3, HallKierAlpha, HeavyAtomCount, HeavyAtomMolWt, Ipc, Kappa1 - Kappa3, LabuteASA, MaxAbsEStateIndex, MaxAbsPartialCharge, MaxEStateIndex, MaxPartialCharge, MinAbsEStateIndex, MinAbsPartialCharge, MinEStateIndex, MinPartialCharge, MolLogP, MolMR, MolWt, NHOHCount, NOCount, NumAliphaticCarbocycles, NumAliphaticHeterocycles, NumAliphaticRings, NumAromaticCarbocycles, NumAromaticHeterocycles, NumAromaticRings, NumHAcceptors, NumHDonors, NumHeteroatoms, NumRadicalElectrons, NumRotatableBonds, NumSaturatedCarbocycles, NumSaturatedHeterocycles, NumSaturatedRings, NumValenceElectrons, PEOE_VSA1 - PEOE_VSA14, RingCount, SMR_VSA1 - SMR_VSA10, SlogP_VSA1 - SlogP_VSA12, TPSA, VSA_EState1 - VSA_EState10, qed
    

    FragmentCountOnly描述符包括以下:

    fr_Al_COO, fr_Al_OH, fr_Al_OH_noTert, fr_ArN, fr_Ar_COO, fr_Ar_N, fr_Ar_NH, fr_Ar_OH, fr_COO, fr_COO2, fr_C_O, fr_C_O_noCOO, fr_C_S, fr_HOCCN, fr_Imine, fr_NH0, fr_NH1, fr_NH2, fr_N_O, fr_Ndealkylation1, fr_Ndealkylation2, fr_Nhpyrrole, fr_SH, fr_aldehyde, fr_alkyl_carbamate, fr_alkyl_halide, fr_allylic_oxid, fr_amide, fr_amidine, fr_aniline, fr_aryl_methyl, fr_azide, fr_azo, fr_barbitur, fr_benzene, fr_benzodiazepine, fr_bicyclic, fr_diazo, fr_dihydropyridine, fr_epoxide, fr_ester, fr_ether, fr_furan, fr_guanido, fr_halogen, fr_hdrzine, fr_hdrzone, fr_imidazole, fr_imide, fr_isocyan, fr_isothiocyan, fr_ketone, fr_ketone_Topliss, fr_lactam, fr_lactone, fr_methoxy, fr_morpholine, fr_nitrile, fr_nitro, fr_nitro_arom, fr_nitro_arom_nonortho, fr_nitroso, fr_oxazole, fr_oxime, fr_para_hydroxylation, fr_phenol, fr_phenol_noOrthoHbond, fr_phos_acid, fr_phos_ester, fr_piperdine, fr_piperzine, fr_priamide, fr_prisulfonamd, fr_pyridine, fr_quatN, fr_sulfide, fr_sulfonamd, fr_sulfone, fr_term_acetylene, fr_tetrazole, fr_thiazole, fr_thiocyan, fr_thiophene, fr_unbrch_alkane, fr_urea
    

    3D描述符包括以下:

    Asphericity, Autocorr3D, Eccentricity, GETAWAY, InertialShapeFactor, MORSE, NPR1, NPR2, PMI1, PMI2, PMI3, RDF, RadiusOfGyration, SpherocityIndex, WHIM
    

    Descriptor Names

    此选项仅在Type为“Specify”时使用。当应用多个描述符时,由逗号分隔描述符,如MolWt, qed。

    结果说明

    得到各个分子指定描述符的数值在descriptors.csv文件中。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Descriptors (RDKit)

    Introduction

    The Descriptors (RDKit) module calculates 2D/3D descriptors of molecules and writes them to an SD or CSV/TSV text file. 2D descriptors include Autocorr2D, MolWt, Ipc, NumRotatableBonds, qed, etc.; 3D descriptors include Autocorr3D, RadiusOfGyration, Eccentricity, etc.; and FragmentCountOnly descriptors include fr_Al_COO, fr_Al_OH, fr_Al_OH_noTert, etc. Supported input file formats are: Mol (.mol), SD (.sdf, .sd), SMILES (.smi, .txt, .csv, .tsv). Supported output file formats are: SD files (.sdf, .sd), CSV/TSV (.csv, .tsv, .txt).

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    File to save the calculated descriptors.

    Multiprocessing

    Use multiprocessing for computation (default: yes).

    Type

    Type of molecular descriptors to compute, options are 2D, 3D, FragmentCountOnly, and Specify.
    2D descriptors include the following:

    Autocorr2D, BalabanJ, BertzCT, Chi0, Chi1, Chi0n - Chi4n, Chi0v - Chi4v, EState_VSA1 - EState_VSA11, ExactMolWt, FpDensityMorgan1, FpDensityMorgan2, FpDensityMorgan3, FractionCSP3, HallKierAlpha, HeavyAtomCount, HeavyAtomMolWt, Ipc, Kappa1 - Kappa3, LabuteASA, MaxAbsEStateIndex, MaxAbsPartialCharge, MaxEStateIndex, MaxPartialCharge, MinAbsEStateIndex, MinAbsPartialCharge, MinEStateIndex, MinPartialCharge, MolLogP, MolMR, MolWt, NHOHCount, NOCount, NumAliphaticCarbocycles, NumAliphaticHeterocycles, NumAliphaticRings, NumAromaticCarbocycles, NumAromaticHeterocycles, NumAromaticRings, NumHAcceptors, NumHDonors, NumHeteroatoms, NumRadicalElectrons, NumRotatableBonds, NumSaturatedCarbocycles, NumSaturatedHeterocycles, NumSaturatedRings, NumValenceElectrons, PEOE_VSA1 - PEOE_VSA14, RingCount, SMR_VSA1 - SMR_VSA10, SlogP_VSA1 - SlogP_VSA12, TPSA, VSA_EState1 - VSA_EState10, qed
    

    FragmentCountOnly descriptors include the following:

    fr_Al_COO, fr_Al_OH, fr_Al_OH_noTert, fr_ArN, fr_Ar_COO, fr_Ar_N, fr_Ar_NH, fr_Ar_OH, fr_COO, fr_COO2, fr_C_O, fr_C_O_noCOO, fr_C_S, fr_HOCCN, fr_Imine, fr_NH0, fr_NH1, fr_NH2, fr_N_O, fr_Ndealkylation1, fr_Ndealkylation2, fr_Nhpyrrole, fr_SH, fr_aldehyde, fr_alkyl_carbamate, fr_alkyl_halide, fr_allylic_oxid, fr_amide, fr_amidine, fr_aniline, fr_aryl_methyl, fr_azide, fr_azo, fr_barbitur, fr_benzene, fr_benzodiazepine, fr_bicyclic, fr_diazo, fr_dihydropyridine, fr_epoxide, fr_ester, fr_ether, fr_furan, fr_guanido, fr_halogen, fr_hdrzine, fr_hdrzone, fr_imidazole, fr_imide, fr_isocyan, fr_isothiocyan, fr_ketone, fr_ketone_Topliss, fr_lactam, fr_lactone, fr_methoxy, fr_morpholine, fr_nitrile, fr_nitro, fr_nitro_arom, fr_nitro_arom_nonortho, fr_nitroso, fr_oxazole, fr_oxime, fr_para_hydroxylation, fr_phenol, fr_phenol_noOrthoHbond, fr_phos_acid, fr_phos_ester, fr_piperdine, fr_piperzine, fr_priamide, fr_prisulfonamd, fr_pyridine, fr_quatN, fr_sulfide, fr_sulfonamd, fr_sulfone, fr_term_acetylene, fr_tetrazole, fr_thiazole, fr_thiocyan, fr_thiophene, fr_unbrch_alkane, fr_urea
    

    3D descriptors include the following:

    Asphericity, Autocorr3D, Eccentricity, GETAWAY, InertialShapeFactor, MORSE, NPR1, NPR2, PMI1, PMI2, PMI3, RDF, RadiusOfGyration, SpherocityIndex, WHIM
    

    Descriptor Names

    This option is only used when Type is “Specify.” When applying multiple descriptors, separate them by commas, e.g., MolWt, qed.

    Result Description

    The numerical values of the specified descriptors for each molecule are stored in the descriptors.csv file.

    References

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

  • Name: PAINS Filter
    Description: 通过使用SMARTS模式进行子结构搜索,从输入文件中过滤Filter Pan-assay Interference molecules (PAINS) ,并将适当的分子写入输出文件或仅计算过滤分子的数量。 Filter Pan-assay Interference molecules (PAINS) from an input file by performing a substructure search using SMARTS pattern and write out appropriate molecules to an output file or simply count the number of filtered molecules.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-22 03:29:53
    Reference: Baell JB, Holloway GA. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem. 2010 Apr 8;53(7):2719-40.

    PAINS Filter

    简介

    PAINS Filter模块通过SMARTS子结构规则来搜索输入文件中假阳性化合物(Pan-assay Interference molecules,PAINS),并将符合条件的分子输出或者统计过滤分子的数量。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件名称。

    Multiprocessing

    是否使用多进程进行计算,可选:yes或者no,默认为yes。

    Output PAINS

    输出文件包含与PAINS匹配的分子,可选:yes或者no,默认为no。

    结果说明

    输出结果包括:

    输出文件名称 说明
    output.sdf 筛选出不匹配PAINS规则的化合物
    output_Filtered.sdf 筛选出匹配PAINS规则的化合物

    参考文献

    Baell JB, Holloway GA. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem. 2010 Apr 8;53(7):2719-40.

    PAINS Filter

    Introduction

    The PAINS Filter module searches for false positive compounds (Pan-assay Interference molecules, PAINS) in the input file using SMARTS substructure rules and either outputs or counts the molecules that meet the criteria.

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    Name of the output file.

    Multiprocessing

    Whether to use multiprocessing for computation, options: yes or no, default is yes.

    Output PAINS

    Whether the output file includes molecules that match PAINS, options: yes or no, default is no.

    Result Description

    The output includes:

    Output File Name Description
    output.sdf Compounds that do not match the PAINS rules
    output_Filtered.sdf Compounds that match the PAINS rules

    References

    Baell JB, Holloway GA. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem. 2010 Apr 8;53(7):2719-40.

  • Name: File
    Description: File是用于指定输入文件的模块,可用于多个模块的统一输入。 File is a module for specifying file path which could be used for multiple modules.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 10:35:43
    Reference: NA

    File

    简介

    File是用于指定输入文件的模块,可用于多个模块的统一输入。

    参数说明

    Input File

    上传小分子结构文件(SDF格式)或者蛋白的结构文件(PDB格式)

    结果说明

    输出重命名后的文件。

    File

    Introduction

    The File module is used to specify input files and can be used for unified input across multiple modules.

    Parameter Description

    Input File

    Upload a small molecule structure file (SDF format) or a protein structure file (PDB format).

    Result Description

    Output the file after renaming.

  • Name: SD File
    Description: SD File用于指定SDF格式的小分子结构文件的模块,用于一个文件在多个模块的输入。 SD File is a module for specifying small molecule structure in SDF format which could be used for multiple modules.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 17:14:38
    Reference: NA

    SDF File

    简介

    SDF File是一个用于指定SDF文件的模块,可用于其他模块的输入。

    参数说明

    SDF File

    小分子结构文件,SDF

    结果说明

    得到一个与原文件相同的SDF文件

    SDF File

    Introduction

    The SDF File module is used to specify an SDF file that can be used as input for other modules.

    Parameter Description

    SDF File

    Small molecule structure file in SDF format.

    Result Description

    Obtain an SDF file identical to the original file.

  • Name: PDB File
    Description: PDB文件是一个用于指定PDB文件的模块,可用于其他模块的输入。 PDB File is a module for specifying pdb file which could used for other modules input.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 17:16:59
    Reference: NA

    PDB File

    简介

    PDB文件是一个用于指定PDB文件的模块,可用于其他模块的输入。

    参数说明

    PDB File

    Protein structure file in PDB format

    结果说明

    得到PDB文件

    PDB File

    Introduction

    The PDB File module is used to specify a PDB file that can be used as input for other modules.

    Parameter Description

    PDB File

    Protein structure file in PDB format.

    Result Description

    Obtain a PDB file.