• 首页
  • 版本
  • 工具
  • 文档
  • 培训
  • 关于
  • 联系
计算驱动创新药物研发
  • Name: Chemical Structure Search
    Description: 对小分子库进行相似度及子结构搜索 Similarity and substructure search against a chemical library
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-06-02 00:00:00
    Reference:

    Chemical Structure Search

    简介

    ChemicaLite 是一个基于 RDKit 的 SQLite 数据库扩展,专为化学信息学应用设计。可在化合物库中搜索目标分子,支持子结构搜索和相似性搜索两种模式。

    核心特性:

    • 基于 SQLite 扩展架构,支持标准 SQL 查询
    • 集成 RDKit 化学信息学工具包
    • 支持子结构搜索和相似性搜索
    • 使用 rdtree 索引实现高性能查询

    适用场景:

    • 化合物数据库管理:构建和管理大规模化合物库
    • 虚拟筛选:基于子结构或相似性搜索候选化合物
    • 化学空间分析:计算分子描述符和指纹

    参数说明

    Query

    查询分子文件,支持多个文件,格式为 .sdf、.smi、.smiles

    Private Library

    私有化合物库文件路径,与 Public Library 二选一

    Search Method

    搜索方法,可选值:

    • sim:相似性搜索,基于 Tanimoto 系数
    • sub:子结构搜索,基于 SMARTS 匹配
    • 默认为sim

    Threshold

    相似性阈值,范围 0.0-1.0,默认为 0.7,仅在相似性搜索时有效

    Hits SDF

    输出 SDF 文件路径,默认为 hits.sdf

    Hits Info

    命中信息 CSV 文件路径,默认为 hits.csv

    结果说明

    输出结果包括:

    文件名 说明
    hits.sdf 命中分子的 SDF 文件
    hits.csv 命中信息 CSV 文件(可选)

    其中 SDF 文件包含以下分子属性:

    属性名 说明
    QUERY_NAME 查询分子名称
    QUERY_FILE 查询文件路径
    QUERY_INDEX 查询分子序号
    SEARCH_METHOD 搜索方法
    HIT_INDEX 命中序号
    HIT_ID 命中分子 ID
    SIMILARITY 相似性分数(仅相似性搜索)

    其中 hits.csv 包含信息如下:

    列名 说明
    query_name 查询分子名称
    query_file 查询文件路径
    query_index 查询分子序号
    hit_id 命中分子 ID
    similarity 相似性分数

    Chemical Structure Search

    Introduction

    ChemicaLite is a SQLite database extension built on RDKit, designed for cheminformatics applications. It enables searching for target molecules within compound libraries, supporting two modes: substructure search and similarity search.

    Key features:

    • SQLite extension architecture with standard SQL query support
    • Integrated RDKit cheminformatics toolkit
    • Substructure and similarity search support
    • High-performance queries via rdtree indexing

    Use cases:

    • Compound database management: build and manage large-scale compound libraries
    • Virtual screening: search candidate compounds by substructure or similarity
    • Chemical space analysis: compute molecular descriptors and fingerprints

    Parameters

    Query

    Query molecule file(s); multiple files supported. Accepted formats: .sdf, .smi, .smiles.

    Private Library

    Path to a private compound library file. Mutually exclusive with Public Library.

    Search Method

    Search algorithm. Options:

    • sim — Similarity search based on Tanimoto coefficient
    • sub — Substructure search based on SMARTS matching
    • Default: sim

    Threshold

    Similarity threshold in the range 0.0–1.0. Default: 0.7. Applies to similarity search only.

    Hits SDF

    Output SDF file path. Default: hits.sdf.

    Hits Info

    Output CSV file path for hit information. Default: hits.csv.


    Results

    Results consist of two files:

    File Description
    hits.sdf SDF file containing hit molecules
    hits.csv CSV file with hit metadata (optional)

    SDF molecule properties:

    Property Description
    QUERY_NAME Query molecule name
    QUERY_FILE Query file path
    QUERY_INDEX Query molecule index
    SEARCH_METHOD Search method used
    HIT_INDEX Hit index
    HIT_ID Hit molecule ID
    SIMILARITY Similarity score (similarity search only)

    hits.csv columns:

    Column Description
    query_name Query molecule name
    query_file Query file path
    query_index Query molecule index
    hit_id Hit molecule ID
    similarity Similarity score
  • Name: Peptide Design (PepCraft)
    Description: PepCraft 用于面向蛋白受体热点区域设计候选结合多肽。用户提供受体序列、目标 hotspot、多肽长度和多肽类型后,流程会生成多肽候选,并使用 Boltz-2 对受体-多肽复合物进行结构预测与打分,最终输出按综合评分排序的设计结果。 PepCraft is designed to generate candidate binding peptides targeting hotspot regions on protein receptors. Users provide the receptor sequence, target hotspot, peptide length, and peptide type; the pipeline then generates peptide candidates, uses Boltz-2 to perform structure prediction and scoring on the receptor–peptide complex, and outputs the design results ranked by overall score.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-06-04 00:00:00
    Reference:

    Peptide Design (PepCraft)

    简介

    PepCraft 是唯信开发的从头多肽生成模型,用于面向蛋白受体热点区域设计候选结合多肽。

    用户提供受体序列、目标 hotspot、多肽长度和多肽类型后,PepCraft会生成多肽候选,并使用 Boltz-2 对受体-多肽复合物进行结构预测与打分,最终输出按综合评分排序的设计结果。

    当前支持三种多肽类型:

    • Linear:线性多肽。
    • Disulfide:首尾半胱氨酸形成二硫键约束的多肽。
    • Cyclic:head-to-tail 环肽。

    image.png

    相比于EvoBind等多肽设计方法,PepCraft在生成的质量和多样性方面具有显著优势,同时支持线性、环肽等各种多肽类型。

    image.png

    注:上图中PepSeek即为PepCraft

    PepCraft 的核心流程为“候选生成 - 结构验证 - 指标评分 - 迭代优化”。候选多肽可由 PepMLM、随机生成、突变和交叉等方式产生;结构验证阶段使用 Boltz-2 预测复合物,并结合整体置信度、界面质量和 hotspot 接触情况进行综合排序。

    为提升运行效率,流程会在每次任务开始时仅对受体序列搜索一次 MSA,后续所有候选多肽验证时复用该受体 MSA;多肽链始终使用 single-sequence mode,不单独搜索 MSA。

    参数说明

    Receptor Sequence

    受体蛋白序列文件。支持标准 FASTA 单行序列、标准 FASTA 多行序列,以及无 header 的纯序列输入。

    流程会自动进行格式检查与标准化,包括:

    • 合并多行序列;
    • 将序列统一为大写;
    • 检查非法氨基酸字符;
    • 确认输入仅包含一条受体序列。

    示例:

    >1SSC_1|Chain A|RIBONUCLEASE A
    KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEG
    

    Hotspot

    受体上的目标结合热点残基,使用 1-indexed 编号。支持单个残基、多个残基和连续区间。

    示例:

    15,16
    20-24,31
    

    Peptide Length

    目标多肽长度。

    示例:

    15
    

    Peptide Type

    多肽的化学类型或结构约束,用于限定生成多肽的理化性质,可选参数。

    • linear:线性多肽,无环化约束
    • disulfide
    • cyclic:环化多肽,首尾或侧链形成环化结构

    其中 disulfide 会约束多肽首尾为半胱氨酸,并在结构预测输入中加入首尾二硫键约束;cyclic 会在结构预测输入中设置环肽约束。

    结果说明

    PepCraft 输出打包结果 results.zip,其中包含按综合评分排序的候选多肽信息和对应结构文件。

    主要输出文件包括:

    文件 含义
    top_designs.csv Top 设计结果汇总表,默认输出前 20 个候选。
    rank_1.cif, rank_2.cif, … 按评分排序后的受体-多肽复合物结构文件。
    results.zip 最终交付压缩包,包含 top_designs.csv 和 ranked CIF 文件。

    top_designs.csv 输出以下信息:

    列名 含义
    rank 设计结果排名,按综合评分排序。
    design_id 设计编号,按排名使用 rank_N 表示。
    sequence 候选多肽序列。
    score 综合评分,默认按该列降序排序,越大越好。
    iptm 受体-多肽界面置信指标,越大越好。
    ptm_binder 多肽结构相关的 predicted TM-score。
    peptide_mean_min_distance_to_epitope 多肽到 hotspot 的平均最小距离,通常越小越好。

    结构文件仍按排名输出为 rank_1.cif、rank_2.cif 等;rank_1.cif 对应 top_designs.csv 第一行,rank_2.cif 对应第二行,以此类推。CSV 中不再包含结构文件路径或内部来源字段。

    Peptide Design (PepCraft)

    Introduction

    PepCraft is a peptide design framework for generating candidate binding peptides targeting hotspot regions on protein receptors. Given a receptor sequence, target hotspot residues, peptide length, and peptide type, the workflow generates peptide candidates and evaluates them using Boltz-2 structure prediction and scoring. Final peptide designs are ranked according to a composite score.

    Compared with peptide design methods such as EvoBind, PepCraft boasts prominent advantages in the quality and diversity of generated peptides and supports various peptide types including linear peptides and cyclic peptides.

    image.png

    Note: PepSeek in the figure above refers to PepCraft

    Currently, three peptide types are supported:

    • Linear: Linear peptides.
    • Disulfide: Peptides constrained by a disulfide bond formed between N-terminal and C-terminal cysteines.
    • Cyclic: Head-to-tail cyclic peptides.

    The core PepCraft workflow consists of candidate generation → structure validation → metric scoring → iterative optimization. Candidate peptides can be generated using PepMLM, random generation, mutation, and crossover operations. During structure validation, Boltz-2 is used to predict receptor–peptide complex structures, which are subsequently ranked according to overall confidence, interface quality, and hotspot-contact metrics.

    To improve computational efficiency, receptor MSA is searched only once at the beginning of each task and reused throughout all subsequent peptide evaluations. Peptide chains are always modeled in single-sequence mode without independent MSA searches.


    Parameters

    Receptor Sequence

    Input receptor protein sequence file.

    The following formats are supported:

    • Standard FASTA with a single-line sequence
    • Standard FASTA with a multi-line sequence
    • Plain sequence without a FASTA header

    The workflow automatically performs format validation and normalization, including:

    • Merging multi-line sequences
    • Converting sequences to uppercase
    • Checking for invalid amino acid characters
    • Ensuring that only one receptor sequence is provided

    Example:

    >1SSC_1|Chain A|RIBONUCLEASE A
    KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEG
    

    Hotspot

    Target binding hotspot residues on the receptor, using 1-indexed residue numbering.

    Supports individual residues, multiple residues, and residue ranges.

    Examples:

    15,16
    20-24,31
    

    Peptide Length

    Target peptide length.

    Example:

    15
    

    Peptide Type

    Chemical type or structural constraint applied to generated peptides. Optional.

    Available options:

    • linear: Linear peptide without cyclization constraints
    • disulfide: Disulfide-constrained peptide
    • cyclic: Cyclic peptide

    For disulfide, PepCraft enforces cysteine residues at both peptide termini and introduces a terminal disulfide bond constraint during structure prediction.

    For cyclic, cyclic peptide constraints are applied during structure prediction.


    Results

    PepCraft produces a compressed result package named results.zip, containing ranked peptide candidates and their corresponding structure files.

    Main output files include:

    File Description
    top_designs.csv Summary table of top-ranked peptide designs. By default, the top 20 candidates are reported.
    rank_1.cif, rank_2.cif, … Receptor–peptide complex structure files ranked by overall score.
    results.zip Final delivery package containing top_designs.csv and all ranked CIF files.

    top_designs.csv contains the following information:

    Column Description
    rank Rank of the peptide design based on the composite score.
    design_id Design identifier, represented as rank_N.
    sequence Candidate peptide sequence.
    score Composite score used for ranking. Higher values indicate better designs.
    iptm Receptor–peptide interface confidence score. Higher values indicate higher confidence.
    ptm_binder Predicted TM-score associated with the peptide structure.
    peptide_mean_min_distance_to_epitope Mean minimum distance between the peptide and hotspot residues. Smaller values generally indicate better hotspot engagement.

    Structure files are output as:

    • rank_1.cif
    • rank_2.cif
    • …

    where rank_1.cif corresponds to the first row of top_designs.csv, rank_2.cif corresponds to the second row, and so on.

    The CSV file does not contain structure file paths or internal provenance fields.

  • Name: Molecular Docking (Gnina)
    Description: 基于深度学习的分子对接工具,采用卷积神经网络(CNN)评分函数对配体-受体结合构象进行打分和排序。 A deep learning-based molecular docking tool that employs convolutional neural network (CNN) scoring functions to score and rank ligand-receptor binding poses.
    Tags: undefined
    Author: Andrew T McNutt
    Release: 2025-03-02 00:00:00
    Reference: McNutt A, Li Y, Meli R, Aggarwal R, Koes D R. GNINA 1.3: the next increment in molecular docking with deep learning[J]. J. Cheminformatics, 2025, 17:43.

    Molecular Docking (Gnina)

    简介

    基于深度学习的分子对接工具,采用卷积神经网络(CNN)评分函数对配体-受体结合构象进行打分和排序。Gnina在传统对接算法基础上引入深度学习评分,显著提升了对接精度和虚拟筛选效率,支持刚性对接、柔性残基对接和共价对接等多种模式。

    核心技术

    • CNN 评分函数:基于 PyTorch 的深度学习评分模型,训练于大规模蛋白质-配体复合物结构数据
    • 多模式对接:支持标准刚性对接、柔性残基对接和共价对接
    • 自动盒子构建:可根据参考配体自动计算搜索空间,简化参数设置
    • 知识蒸馏优化:GNINA 1.3 引入蒸馏模型,在保持精度的同时大幅提升筛选速度

    适用场景

    • 虚拟筛选:从大型化合物库中快速发现潜在活性分子
    • 先导化合物优化:分析配体与靶点的结合模式,指导结构改造
    • 共价药物设计:支持共价键合分子的对接计算
    • 分子动力学预处理:生成合理的初始结合构象用于后续模拟

    参数说明

    Receptor

    受体结构文件,包含对接计算中保持刚性的受体部分。

    Flex

    柔性受体侧链文件,指定对接过程中允许柔性的受体侧链。

    Ligand

    配体结构文件,支持多种分子格式。

    Flexres

    柔性残基列表,以逗号分隔的 chain:resid 格式指定需要柔性的残基。

    Flexdist Ligand

    Flexdist 模式的参考配体,用于自动识别该配体附近的柔性残基。

    Flexdist

    柔性化距离阈值,自动将距离 flexdist_ligand 该范围内的残基设为柔性。

    Flex Limit

    柔性残基数量的硬上限,限制最多允许多少个残基柔性化。

    Flex Max

    最多保留的最近柔性残基数量,当柔性残基超过限制时只保留距离最近的。

    Center X

    搜索盒子中心的 X 坐标,用于定义对接搜索空间的位置。

    Center Y

    搜索盒子中心的 Y 坐标。

    Center Z

    搜索盒子中心的 Z 坐标。

    Size X

    搜索盒子在 X 方向的尺寸,设置时必须为正值。

    Size Y

    搜索盒子在 Y 方向的尺寸,设置时必须为正值。

    Size Z

    搜索盒子在 Z 方向的尺寸,设置时必须为正值。

    Autobox Ligand

    参考配体文件,用于自动计算搜索盒子的中心和尺寸,无需手动指定 center 和 size 参数。

    Autobox Add

    在自动计算的搜索盒子周围添加的额外填充距离,用于扩展搜索空间。

    Scoring

    用于选择打分函数(scoring function),即评估配体与受体结合好坏的数学模型。

    • default(CNN 深度学习)
      gnina 默认使用的打分函数,基于卷积神经网络,在训练数据覆盖的体系上精度最高,适合对结果质量要求较高的场景。

    • vina(经验式)
      AutoDock Vina 原版打分函数,最经典且广泛使用,速度快、兼容性好,是虚拟筛选中的常用基准。

    • vinardo(经验式)
      Vina 的改进版本,在部分体系上精度优于原版 Vina,可作为 Vina 的替代选择。

    • ad4_scoring(经验式)
      AutoDock 4 的打分函数,需配合 AD4 力场参数文件使用,适合已有 AD4 工作流的场景。

    • dkoes_fast(知识式)
      dkoes 系列中速度最快的版本,精度相对较低,适合需要极高吞吐量的大规模粗筛。

    • dkoes_scoring(知识式)
      dkoes 系列的标准版本,在速度与精度之间取得平衡,是该系列的推荐选择。

    • dkoes_scoring_old(知识式)
      dkoes_scoring 的旧版实现,一般仅用于复现早期文献或历史计算结果。

    CNN Scoring

    CNN 评分模式,用于选择不同的深度学习评分策略。

    • none
      CNN 完全不介入,由传统打分函数独立完成全部计算,精度较低,适合超大规模粗筛场景。

    • rescore(默认)
      在传统方法完成构象搜索后,由 CNN 对所有姿势进行最终重打分和重排序,精度中高,是日常虚拟筛选的推荐模式。

    • refinement
      在初始姿势生成后,用 CNN 分数引导进一步局部优化,精度较高,适合中等规模的精细筛选。

    • metrorescore
      引入 Metropolis 采样以 CNN 分数驱动构象搜索,最终再执行 CNN 重打分,精度较高,适合构象空间复杂或结合口袋灵活的体系。

    • metrorefine
      结合 Metropolis 采样与 CNN 引导的局部优化,精度很高,适合对少量重要化合物进行精细对接评估。

    • all
      CNN 参与对接的全部阶段(搜索、优化、重打分),精度最高,计算代价也最大,适合对少量化合物进行最严格的精确评估。

    Num Modes

    输出的最大结合模式数量,即最终保留的候选构象数,默认为10

    Covalent Receptor Atom

    指定蛋白质中哪个原子与配体形成共价键

    A:145:SG      # A链第145位半胱氨酸的硫原子
    A:200:OG      # A链第200位丝氨酸的氧原子
    B:63:NZ       # B链第63位赖氨酸的氨基氮原子
    

    Covalent Lig Atom Pattern

    SMARTS 模式,用于识别配体中参与共价键的原子。

    C(=O)Cl           # 酰氯,与Cys/Ser/Lys反应
    C=C               # 迈克尔受体(丙烯酰胺类),与Cys反应
    [CH2]Br           # 卤代烷,烷基化反应
    C(=O)[F,Cl,Br]    # 通用酰卤模式
    [cH]1[cH][nH]c1   # 用于特定杂环弹头
    

    Covalent Lig Atom Position

    共价配体原子的初始放置坐标。

    12.345,7.890,-3.210     # 从晶体结构中读取的弹头原子坐标
    -5.100,22.300,8.750     # 从同源建模结构推测的坐标
    

    Covalent Bond Order

    共价键的键级,用于共价对接计算。

    1       # 单键(最常见,如 Cys-S–C 烷基化产物)
    2       # 双键(如与 Lys 形成的亚胺/席夫碱)
    1.5     # 芳香键(较少用)
    

    结果说明

    输出结果包括:对接的压缩文件docked.sdf.gz、解压后的小分子文件docked.sdf和打分文件docked.csv。
    打分文件docked.csv各指标说明:

    列名 说明
    name 小分子名
    mode 小分子构象
    minimizedAffinity 传统/经验 docking 亲和力,越负越好,单位为kcal/mol
    CNNscore 构象(pose)合理性评分,越接近 1 越好
    CNNaffinity CNN 预测结合强度,越大越好,单位为kcal/mol
    CNN_VS 虚拟筛选综合排序分,越大越好

    参考文献

    • McNutt A, Li Y, Meli R, Aggarwal R, Koes D R. GNINA 1.3: the next increment in molecular docking with deep learning[J]. J. Cheminformatics, 2025, 17:43.DOI: 10.1186/s13321-025-00973-x

    Molecular Docking (Gnina)

    Introduction

    A deep learning-based molecular docking tool that employs convolutional neural network (CNN) scoring functions to score and rank ligand–receptor binding poses. Building upon traditional docking algorithms, Gnina introduces deep learning scoring, significantly improving docking accuracy and virtual screening efficiency. It supports rigid docking, flexible residue docking, and covalent docking, among other modes.

    Core Technology

    • CNN Scoring Function: A PyTorch-based deep learning scoring model trained on large-scale protein–ligand complex structural data.
    • Multi-mode Docking: Supports standard rigid docking, flexible residue docking, and covalent docking.
    • Automatic Box Construction: Automatically calculates the search space based on a reference ligand, simplifying parameter setup.
    • Knowledge Distillation Optimization: GNINA 1.3 introduces distillation models that substantially boost screening speed while maintaining accuracy.

    Use Cases

    • Virtual Screening: Rapidly discover potentially active molecules from large compound libraries.
    • Lead Compound Optimization: Analyze ligand–target binding modes to guide structural modifications.
    • Covalent Drug Design: Support docking calculations for covalently binding molecules.
    • Molecular Dynamics Preprocessing: Generate reasonable initial binding poses for subsequent simulations.

    Parameters

    Receptor

    Receptor structure file containing the rigid portion of the receptor used in the docking calculation.

    Flex

    Flexible receptor sidechain file specifying receptor sidechains allowed to be flexible during docking.

    Ligand

    Ligand structure file supporting multiple molecular formats.

    Flexres

    Flexible residue list specifying residues to be made flexible in chain:resid format, comma-separated.

    Flexdist Ligand

    Reference ligand for flexdist mode, used to automatically identify flexible residues near this ligand.

    Flexdist

    Flexibilization distance threshold; residues within this distance from flexdist_ligand are automatically set as flexible.

    Flex Limit

    Hard limit on the number of flexible residues, restricting the maximum number of residues that can be made flexible.

    Flex Max

    Maximum number of nearest flexible residues to retain; when the number of flexible residues exceeds the limit, only the closest ones are kept.

    Center X

    X coordinate of the search box center, defining the position of the docking search space.

    Center Y

    Y coordinate of the search box center.

    Center Z

    Z coordinate of the search box center.

    Size X

    Search box dimension in the X direction; must be set to a positive value.

    Size Y

    Search box dimension in the Y direction; must be set to a positive value.

    Size Z

    Search box dimension in the Z direction; must be set to a positive value.

    Autobox Ligand

    Reference ligand file used to automatically calculate the search box center and size, eliminating the need to manually specify center and size parameters.

    Autobox Add

    Additional padding distance added around the automatically calculated search box to expand the search space.

    Scoring

    Scoring function selection, i.e., the mathematical model used to evaluate ligand–receptor binding quality.

    • none
      The CNN is completely uninvolved; all calculations are handled independently by the traditional scoring function. Accuracy is lower, making it suitable for ultra-large-scale coarse screening scenarios.
    • rescore (default)
      After the traditional method completes conformational search, the CNN performs final rescoring and reranking of all poses. Accuracy is medium-to-high, making it the recommended mode for routine virtual screening.
    • refinement
      After initial pose generation, CNN scores guide further local optimization. Accuracy is relatively high, suitable for medium-scale fine-grained screening.
    • metrorescore
      Incorporates Metropolis sampling driven by CNN scores for conformational search, followed by a final CNN rescoring step. Accuracy is relatively high, suitable for systems with complex conformational spaces or flexible binding pockets.
    • metrorefine
      Combines Metropolis sampling with CNN-guided local optimization. Accuracy is very high, suitable for detailed docking evaluation of a small number of important compounds.
    • all
      The CNN participates in every stage of docking (search, optimization, and rescoring). Accuracy is the highest, but so is the computational cost, making it appropriate for the most rigorous and precise evaluation of a small set of compounds.

    CNN Scoring

    CNN scoring mode, used to select different deep learning scoring strategies.

    • none
      The CNN is completely uninvolved; all calculations are handled independently by the traditional scoring function. Accuracy is lower, making it suitable for ultra-large-scale coarse screening scenarios.
    • rescore (default)
      After the traditional method completes conformational search, the CNN performs final rescoring and reranking of all poses. Accuracy is medium-to-high, making it the recommended mode for routine virtual screening.
    • refinement
      After initial pose generation, CNN scores guide further local optimization. Accuracy is relatively high, suitable for medium-scale fine-grained screening.
    • metrorescore
      Incorporates Metropolis sampling driven by CNN scores for conformational search, followed by a final CNN rescoring step. Accuracy is relatively high, suitable for systems with complex conformational spaces or flexible binding pockets.
    • metrorefine
      Combines Metropolis sampling with CNN-guided local optimization. Accuracy is very high, suitable for detailed docking evaluation of a small number of important compounds.
    • all
      The CNN participates in every stage of docking (search, optimization, and rescoring). Accuracy is the highest, but so is the computational cost, making it appropriate for the most rigorous and precise evaluation of a small set of compounds.

    Num Modes

    Maximum number of binding modes to output, i.e., the final number of candidate poses retained. Default: 10.

    Covalent Receptor Atom

    Specifies which atom in the protein forms a covalent bond with the ligand.

    A:145:SG      # Sulfur atom of Cysteine 145 on chain A
    A:200:OG      # Oxygen atom of Serine 200 on chain A
    B:63:NZ       # Amino nitrogen atom of Lysine 63 on chain B
    

    Covalent Lig Atom Pattern

    SMARTS pattern used to identify the atom in the ligand that participates in the covalent bond.

    C(=O)Cl           # Acyl chloride; reacts with Cys/Ser/Lys
    C=C               # Michael acceptor (acrylamide-like); reacts with Cys
    [CH2]Br           # Haloalkane; alkylation reaction
    C(=O)[F,Cl,Br]    # General acyl halide pattern
    [cH]1[cH][nH]c1   # For specific heterocyclic warheads
    

    Covalent Lig Atom Position

    Initial placement coordinates of the covalent ligand atom.

    12.345,7.890,-3.210     # Warhead atom coordinates read from a crystal structure
    -5.100,22.300,8.750     # Coordinates inferred from a homology model
    

    Covalent Bond Order

    Bond order of the covalent bond, used in covalent docking calculations.

    1       # Single bond (most common, e.g., Cys-S–C alkylation product)
    2       # Double bond (e.g., imine/Schiff base formed with Lys)
    1.5     # Aromatic bond (rarely used)
    

    Results

    The output includes a compressed docking file docked.sdf.gz, the extracted small molecule file docked.sdf, and a scoring file docked.csv.

    Column descriptions for the scoring file docked.csv:

    Column Description
    name Small molecule name
    mode Small molecule conformation
    minimizedAffinity Traditional/empirical docking affinity; more negative is better. Unit: kcal/mol
    CNNscore Pose rationality score; closer to 1 is better
    CNNaffinity CNN-predicted binding strength; higher is better. Unit: kcal/mol
    CNN_VS Virtual screening comprehensive ranking score; higher is better

    Reference

    • McNutt A, Li Y, Meli R, Aggarwal R, Koes D R. GNINA 1.3: the next increment in molecular docking with deep learning[J]. J. Cheminformatics, 2025, 17:43.DOI: 10.1186/s13321-025-00973-x
  • Name: Structure Minimization
    Description: 用于在 GB 隐式溶剂下对蛋白质/核酸/小分子/复合物结构进行能量最小化,在指定突变的情况下也支持对突变体进行能量最小化(蛋白突变和核酸突变都支持)。 Structure Minimization performs energy minimization of protein/nucleic acid/small molecule/complex structures in GB implicit solvent. When mutations are specified, it also supports energy minimization of mutant structures (both protein and nucleic acid mutations are supported).
    Tags: undefined
    Author: Peter Eastman
    Release: 2026-05-22 00:00:00
    Reference: Friedrichs M. S., et al. Accelerating molecular dynamic simulations on graphics processing units[J]. J. Comput. Chem., 2009, 30: 864-872. Eastman P., Pande V. S. Efficient nonbonded interactions for molecular dynamics on a graphics processing unit[J]. J. Comput. Chem., 2010, 31: 1268-1272. Eastman P., Pande V. S. Constant constraint matrix approximation: A robust, parallelizable constraint method for molecular simulations[J]. J. Chem. Theor. Comput., 2010, 6: 434-437. Eastman P., et al. OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials[J]. J. Phys. Chem. B, 2023, 128: 109-116.

    Structure Minimization

    简介

    Structure Minimization 用于在 GB 隐式溶剂下对蛋白质/核酸/小分子/复合物结构进行能量最小化,在指定突变的情况下也支持对突变体进行能量最小化(蛋白突变和核酸突变都支持)。优化过程中可自动检测小分子配体并使用 GAFF 力场进行参数化。
    Structure Minimization 提供两种最小化方法:

    1. openmm(默认):OpenMM 内置 LocalEnergyMinimizer(L-BFGS),在CPU和GPU计算平台上结果具有非确定性(结果不可重现)
    2. capped-sd:自定义的确定性能量最速下降法(GPU 力求值 + NumPy 坐标更新),在CPU和GPU计算平台上结果均可重现

    参数说明

    Input File

    输入的蛋白质/核酸/小分子/复合物 PDB 文件,必选项。如果存在残基编号间隙,可在 PDB 中提供 SEQRES 记录以便自动补全(晶体结构中一般都有SEQRES记录因此会自动补全)。

    Mutations

    突变指定,可选项。省略时进入 WT-only 模式(仅计算 WT 的结合自由能)。

    mutations.txt文件内容示例:

    #A100V (注释行,可省略)
    A:100:VAL
    A:100:VAL,A:105:LEU 
    

    备注:如果Input File中没有链名,可以不指定链名,如100:VAL(表示第100个残基突变为VAL),但当有多条链都包含有指定的突变残基时会报错

    Method

    最小化方法,必选项,默认 openmm。

    • openmm:OpenMM 内置 L-BFGS,速度快但 GPU 上非确定性
    • capped-sd:自定义的确定性最速下降方法,结果可重现

    Add Hydrogens

    控制在结构准备过程中如何处理氢原子。

    • --add-hydrogens:默认删除所有H,然后根据pH重建H原子
    • --no-add-hydrogens:跳过 H 处理,使用原始输入结构中的H原子,适用于原始输入结构已经进行过H处理的PDB文件

    Keep Hydrogens

    控制是否保留输入结构中的原始氢原子,可选项。默认删除所有原始氢原子,随后根据设定的 pH 条件重新构建全部氢原子。

    • --keep-hydrogens:保留输入结构中原始H原子,仅补缺失的H原子,适用于原始结构中已经包含了部分H原子,但仍然缺失H原子的PDB文件

    pH

    对Input File文件进行加氢时参考的pH状态,会根据pH值进行残基的质子化状态判定,默认 7.0

    Ligand SMILES

    小分子配体的 SMILES,可选项。用于确保小分子配体正确的键序和连接性,提供时会先去除配体 H 再进行键序匹配,完成后自动重新添加。(当输入结构没有提供键连关系和键序信息时对小分子配体很难做到准确加H,提供小分子配体的smiles可做到对小分子配体的准确加H)。

    Ligand SMILES书写格式:
    "OC[C@H]1O[C@@H](n2cnc3cc(Cl)c(Cl)cc32)[C@H](O)[C@@H]1O" (适用于Input File中只含有一种配体的情况)
    "RFZ:OC[C@H]1O[C@@H](n2cnc3cc(Cl)c(Cl)cc32)[C@H](O)[C@@H]1O,W2R:O=C(Nc1cccc(Oc2ccc(Nc3ncnc4ccn(CCOCCO)c34)cc2Cl)c1)NC1CCCCC1" (适用于Input File中含有多种配体的情况,以逗号分隔)
    

    Min Tolerance

    能量最小化收敛精度 (kJ/mol/nm),默认 1.0,值越小越精确。

    Max Steps

    能量最小化最大迭代步数,默认 5000。

    Restraint Force

    骨架位置限制力常数 (kJ/mol/nm^2),默认 100.0,设为 0 表示不对骨架位置进行限制。

    结果说明

    WT-only 模式输出

    文件 说明
    <prefix>_minimized.pdb WT 重优化后的结构

    突变模式输出

    文件 说明
    <prefix>_WT_minimized.pdb WT 重优化后的结构
    <prefix>_MUT_<链>_<残基号>_<目标残基>_minimized.pdb 各突变体最小化后的结构

    如何理解结果

    1. openmm方法在CPU和GPU计算平台上结果均不可重现
    2. capped-sd方法在CPU和GPU计算平台上结果均可重现

    注意事项

    1. 修饰残基:修饰残基嵌入蛋白链或者核酸链时会报错
    2. 缺失残基:PDB 中存在残基编号间隙但缺少 SEQRES 记录时会报错
    3. 突变支持:支持蛋白质残基突变(标准氨基酸)和核酸残基突变(DNA/RNA)
    4. 小分子 SMILES:强烈建议提供 --ligand-smiles 以确保正确的键序和连接性

    参考文献

    • Friedrichs M. S., et al. Accelerating molecular dynamic simulations on graphics processing units[J]. J. Comput. Chem., 2009, 30: 864-872.DOI: 10.1002/jcc.21209
    • Eastman P., Pande V. S. Efficient nonbonded interactions for molecular dynamics on a graphics processing unit[J]. J. Comput. Chem., 2010, 31: 1268-1272.DOI: 10.1002/jcc.21413
    • Eastman P., Pande V. S. Constant constraint matrix approximation: A robust, parallelizable constraint method for molecular simulations[J]. J. Chem. Theor. Comput., 2010, 6: 434-437.DOI: 10.1021/ct900463w
    • Eastman P., et al. OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials[J]. J. Phys. Chem. B, 2023, 128: 109-116.DOI: 10.1021/acs.jpcb.3c06662

    Structure Minimization

    Introduction

    Structure Minimization performs energy minimization on protein/nucleic acid/small-molecule/complex structures in GB implicit solvent . When mutations are specified, it also supports energy minimization of mutant structures (both protein and nucleic acid mutations are supported). During optimization, small-molecule ligands are automatically detected and parameterized using the GAFF force field.

    Structure Minimization provides two minimization methods:

    1. openmm (default): OpenMM’s built-in LocalEnergyMinimizer (L-BFGS). Results are non-deterministic on both CPU and GPU platforms (not reproducible).
    2. capped-sd: A custom deterministic energy steepest descent method (GPU force evaluation + NumPy coordinate updates). Results are reproducible on both CPU and GPU platforms.

    Parameters

    Input File

    Input protein/nucleic acid/small-molecule/complex PDB file. Required. If residue numbering gaps exist, a SEQRES record can be provided in the PDB for automatic completion (crystal structures typically contain SEQRES records, so completion is automatic).

    Mutations

    Mutation specification. Optional. When omitted, the tool enters WT-only mode.

    Example mutations.txt file content:

    #A100V (comment line, can be omitted)
    A:100:VAL
    A:100:VAL,A:105:LEU
    

    Note: If the Input File does not contain chain names, the chain name can be omitted, e.g. 100:VAL (indicating residue 100 is mutated to VAL). However, an error will be raised when multiple chains contain the specified mutation residue.

    Method

    Minimization method. Required. Default: openmm.

    • openmm: OpenMM’s built-in L-BFGS. Fast but non-deterministic on GPU.
    • capped-sd: Custom deterministic steepest descent method. Results are reproducible.

    Add Hydrogens

    Controls how hydrogen atoms are handled during structure preparation.

    • Add Hydrogens (default): Deletes all H atoms, then rebuilds them according to pH.
    • No Add Hydrogens: Skips H processing and uses H atoms from the original input structure. Suitable for PDB files that have already been H-treated.

    Keep Hydrogens

    Controls whether original hydrogen atoms from the input structure are preserved. Optional. By default, all original H atoms are deleted and subsequently rebuilt according to the set pH condition.

    • --keep-hydrogens: Preserves original H atoms from the input structure and only adds missing H atoms. Suitable for PDB files where the original structure already contains partial H atoms but still has missing H atoms.

    pH

    pH state referenced during hydrogen addition to the Input File. Residue protonation states are determined based on the pH value. Default: 7.0.

    Ligand SMILES

    SMILES string of the small-molecule ligand. Optional. Used to ensure correct bond order and connectivity of the small-molecule ligand. When provided, ligand H atoms are first removed for bond-order matching, then automatically re-added. (When the input structure does not provide bond connectivity and bond order information, accurate H addition for small-molecule ligands is difficult; providing the SMILES enables accurate H addition for the ligand.)

    Ligand SMILES format:
    "OC[C@H]1O[C@@H](n2cnc3cc(Cl)c(Cl)cc32)[C@H](O)[C@@H]1O" (for cases where the Input File contains only one ligand)
    "RFZ:OC[C@H]1O[C@@H](n2cnc3cc(Cl)c(Cl)cc32)[C@H](O)[C@@H]1O,W2R:O=C(Nc1cccc(Oc2ccc(Nc3ncnc4ccn(CCOCCO)c34)cc2Cl)c1)NC1CCCCC1" (for cases where the Input File contains multiple ligands, comma-separated)
    

    Min Tolerance

    Energy minimization convergence tolerance (kJ/mol/nm). Default: 1.0. Smaller values are more precise.

    Max Steps

    Maximum number of energy minimization iterations. Default: 5000.

    Restraint Force

    Backbone position restraint force constant (kJ/mol/nm²). Default: 100.0. Set to 0 to disable backbone position restraints.

    Results

    WT-only Mode Output

    File Description
    <prefix>_minimized.pdb Re-optimized WT structure.

    Mutation Mode Output

    File Description
    <prefix>_WT_minimized.pdb Re-optimized WT structure.
    <prefix>_MUT_<chain>_<residue_number>_<target_residue>_minimized.pdb Minimized structure for each mutant.

    Interpreting Results

    1. The openmm method produces non-reproducible results on both CPU and GPU platforms.
    2. The capped-sd method produces reproducible results on both CPU and GPU platforms.

    Notes

    1. Modified residues: An error will be raised if modified residues are embedded in the protein or nucleic acid chain.
    2. Missing residues: An error will be raised if residue numbering gaps exist in the PDB but no SEQRES record is provided.
    3. Mutation support: Supports protein residue mutations (standard amino acids) and nucleic acid residue mutations (DNA/RNA).
    4. Small-molecule SMILES: It is strongly recommended to provide --ligand-smiles to ensure correct bond order and connectivity.

    References

    • Friedrichs M. S., et al. Accelerating molecular dynamic simulations on graphics processing units[J]. J. Comput. Chem., 2009, 30: 864-872. DOI: 10.1002/jcc.21209
    • Eastman P., Pande V. S. Efficient nonbonded interactions for molecular dynamics on a graphics processing unit[J]. J. Comput. Chem., 2010, 31: 1268-1272. DOI: 10.1002/jcc.21413
    • Eastman P., Pande V. S. Constant constraint matrix approximation: A robust, parallelizable constraint method for molecular simulations[J]. J. Chem. Theor. Comput., 2010, 6: 434-437. DOI: 10.1021/ct900463w
    • Eastman P., et al. OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials[J]. J. Phys. Chem. B, 2023, 128: 109-116. DOI: 10.1021/acs.jpcb.3c06662
  • Name: Mutation Energy Calculation (ddG)
    Description: 用于计算在蛋白质/核酸/小分子复合物结构中,由于突变而引起的结合自由能差(即突变能,ddG)。当不指定突变时可用于计算蛋白质/核酸/小分子复合物结构的结合自由能。 Mutation DDG calculates the binding free energy difference (DDG) caused by mutations in protein/nucleic acid/small molecule complex structures. When no mutation is specified, it can be used to calculate the binding free energy of protein/nucleic acid/small molecule complex structures.
    Tags: undefined
    Author: Peter Eastman
    Release: 2026-05-22 00:00:00
    Reference: Friedrichs M. S., et al. Accelerating molecular dynamic simulations on graphics processing units[J]. J. Comput. Chem., 2009, 30: 864-872. Eastman P., Pande V. S. Efficient nonbonded interactions for molecular dynamics on a graphics processing unit[J]. J. Comput. Chem., 2010, 31: 1268-1272. Eastman P., Pande V. S. Constant constraint matrix approximation: A robust, parallelizable constraint method for molecular simulations[J]. J. Chem. Theor. Comput., 2010, 6: 434-437. Eastman P., et al. OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials[J]. J. Phys. Chem. B, 2023, 128: 109-116.

    Mutation Energy Calculation (ddG)

    简介

    Mutation Energy Calculation (ddG) 用于计算在蛋白质/核酸/小分子复合物结构中,由于突变而引起的结合自由能差(即突变能,ddG)。当不指定突变时可用于计算蛋白质/核酸/小分子复合物结构的结合自由能。支持蛋白突变和核酸(DNA/RNA)突变。

    参数说明

    Input File

    输入的蛋白质/核酸/小分子/复合物 PDB 文件,必选项。如果存在残基编号间隙,可在 PDB 中提供 SEQRES 记录以便自动补全(晶体结构中一般都有SEQRES记录因此会自动补全)。

    Receptor Chains

    输入结构中受体链 ID(逗号分隔),默认为全部非配体链。

    D
    B,C
    

    Receptor Residues

    输入结构中受体残基号范围,默认为全部非配体链。

    1-100,120 (如输入结构中没有包含链名,可不指定链名,但当有多条链都包含有指定的残基时会报错)
     A:1-100,B:200
    

    Mutations

    突变指定,可选项。省略时进入 WT-only 模式(仅计算 WT 的结合自由能)。

    mutations.txt文件内容示例:

    #A100V (注释行,可省略)
    A:100:VAL
    A:100:VAL,A:105:LEU 
    

    备注:如果Input File中没有链名,可以不指定链名,如100:VAL(表示第100个残基突变为VAL),但当有多条链都包含有指定的突变残基时会报错

    Ligand Chains

    输入结构的受体链 ID(逗号分隔)。

    注意:Ligand Chains、Ligand Residues和Ligand Name参数三选一

    Ligand Residues

    从输入结构中指定的小分子的名称。

    501-520,530 (如Input File中没有包含链名,可不指定链名,但当有多条链都包含有指定的残基时会报错)
    B:501-520
    

    Ligand Name

    从输入结构中指定的小分子的名称。

    RFZ
    LIG
    

    Ligand SMILES

    小分子配体的 SMILES,可选项。用于确保小分子配体正确的键序和连接性,提供时会先去除配体 H 再进行键序匹配,完成后自动重新添加。(当输入结构没有提供键连关系和键序信息时对小分子配体很难做到准确加H,提供小分子配体的smiles可做到对小分子配体的准确加H)。

    Ligand SMILES书写格式:

    "OC[C@H]1O[C@@H](n2cnc3cc(Cl)c(Cl)cc32)[C@H](O)[C@@H]1O" (适用于Input File中只含有一种配体的情况)
    "RFZ:OC[C@H]1O[C@@H](n2cnc3cc(Cl)c(Cl)cc32)[C@H](O)[C@@H]1O,W2R:O=C(Nc1cccc(Oc2ccc(Nc3ncnc4ccn(CCOCCO)c34)cc2Cl)c1)NC1CCCCC1" (适用于Input File中含有多种配体的情况,以逗号分隔)
    

    Add Hydrogens

    控制在结构准备过程中如何处理氢原子。

    • --add-hydrogens:默认删除所有H,然后根据pH重建H原子
    • --no-add-hydrogens:跳过 H 处理,使用原始输入结构中的H原子,适用于原始输入结构已经进行过H处理的PDB文件

    Keep Hydrogens

    控制是否保留输入结构中的原始氢原子,可选项。默认删除所有原始氢原子,随后根据设定的 pH 条件重新构建全部氢原子。

    • --keep-hydrogens:保留输入结构中原始H原子,仅补缺失的H原子,适用于原始结构中已经包含了部分H原子,但仍然缺失H原子的PDB文件

    pH

    对输入结构文件进行加氢时参考的pH状态,会根据pH值进行残基的质子化状态判定,默认 7.0

    Energy Model

    溶剂化模型,控制 PB/GB 静电相互作用的计算方法,必选项,默认 ALPB。

    • GB:Generalized Born,广义 Born 模型,适用于一般的溶剂化能计算。
    • ALPB:Analytical Linearized Poisson-Boltzmann,解析线性化泊松-玻尔兹曼模型,精度较高,适合需要更准确静电相互作用的场景。
    • CHAGB:Charge-Dependent GB,电荷依赖型 GB 模型,考虑原子电荷变化对溶剂化的影响。
    • CHAGBCAN:Charge-Dependent GB with canonical radii,使用标准原子半径的电荷依赖型 GB 模型。

    Inradii

    溶剂化半径,可选项

    • inpqr:使用 PQR 文件中的 BONDI 半径
    • bestgb:使用 GB 优化半径
    • chagb:使用 CHAGB 专用半径(仅限 CHAGB/CHAGBCAN 模型)

    Ele Corr

    启用 Debye-Huckel 静电屏蔽校正,默认关闭(不进行静电能校正)

    Temp

    温度 (K),默认 298.15

    Min Tolerance

    能量最小化收敛精度 (kJ/mol/nm),默认 1.0,值越小越精确。

    Max Steps

    能量最小化最大迭代步数,默认 5000。

    Restraint Force

    骨架位置限制力常数 (kJ/mol/nm^2),默认 100.0,设为 0 表示不对骨架位置进行限制。

    Output

    结果输出文件,可选项,默认mutations.csv

    结果说明

    1. 突变能输出到mutations.csv文件中
      包含信息如下:
    列名 说明
    mutation 突变标识,格式为 链:残基编号:突变后氨基酸,WT-only 表示野生型
    WT_G_bind 野生型结合自由能(kcal/mol)
    MUT_G_bind 突变型结合自由能(kcal/mol),WT-only 模式下为 N/A
    DDG 突变结合自由能变化(MUT_G_bind - WT_G_bind),WT-only 模式下为 N/A

    如果未指定突变,则进入WT-Only模式,csv文件中只有输入结构的结合自由能

    mutations.csv:
    mutation,WT_G_bind,MUT_G_bind,DDG
    WT-only,-15.2300,N/A,N/A
    
    1. 突变前后对应的PDB结构文件
      | 文件 | 说明 |
      |------|------|
      | WT_minimized.pdb | WT 能量最小化后的结构 |
      | MUT_<链名>_<残基号>_<突变残基名称>_minimized.pdb | MUT 能量最小化后的结构 |
      如果未指定突变,则进入WT-Only模式,则只输出WT_minimized.pdb结构

    如何理解结果

    1. DDG > 0:突变削弱结合(不利突变)
    2. DDG < 0:突变增强结合(有利突变)

    注意事项

    1. 修饰残基:修饰残基嵌入蛋白链或者核酸链时会报错
    2. 缺失残基:PDB 中存在残基编号间隙但缺少 SEQRES 记录时会报错
    3. 突变支持:支持蛋白质残基突变(标准氨基酸)和核酸残基突变(DNA/RNA)
    4. 配体指定:--ligand-chains、--ligand-residues、--ligand-name 三选一,至少提供一个
    5. 小分子 SMILES:强烈建议提供 --ligand-smiles 以确保正确的键序和连接性

    参考文献

    • Friedrichs M. S., et al. Accelerating molecular dynamic simulations on graphics processing units[J]. J. Comput. Chem., 2009, 30: 864-872.DOI: 10.1002/jcc.21209
    • Eastman P., Pande V. S. Efficient nonbonded interactions for molecular dynamics on a graphics processing unit[J]. J. Comput. Chem., 2010, 31: 1268-1272.DOI: 10.1002/jcc.21413
    • Eastman P., Pande V. S. Constant constraint matrix approximation: A robust, parallelizable constraint method for molecular simulations[J]. J. Chem. Theor. Comput., 2010, 6: 434-437.DOI: 10.1021/ct900463w
    • Eastman P., et al. OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials[J]. J. Phys. Chem. B, 2023, 128: 109-116.DOI: 10.1021/acs.jpcb.3c06662

    Mutation Energy Calculation (ddG)

    Introduction

    Mutation Energy Calculation (ddG) computes the change in binding free energy (i.e., mutation energy, ddG) caused by mutations in protein/nucleic acid/small-molecule complex structures. When no mutation is specified, it can be used to calculate the binding free energy of the protein/nucleic acid/small-molecule complex structure. Supports both protein mutations and nucleic acid (DNA/RNA) mutations.

    Parameters

    Input File

    Input protein/nucleic acid/small-molecule/complex PDB file. Required. If residue numbering gaps exist, a SEQRES record can be provided in the PDB for automatic completion (crystal structures typically contain SEQRES records, so completion is automatic).

    Mutations

    Mutation specification. Optional. When omitted, the tool enters WT-only mode (only the WT binding free energy is calculated).

    Example mutations.txt file content:

    #A100V (comment line, can be omitted)
    A:100:VAL
    A:100:VAL,A:105:LEU
    

    Note: If the Input File does not contain chain names, the chain name can be omitted, e.g. 100:VAL (indicating residue 100 is mutated to VAL). However, an error will be raised when multiple chains contain the specified mutation residue.

    Receptor Residues

    Receptor residue number range(s) from the input structure. Defaults to all non-ligand chains.

    1-100,120 (if the Input File does not contain chain names, the chain name can be omitted; however, an error will be raised when multiple chains contain the specified residues)
    A:1-100,B:200
    

    Receptor Chains

    Receptor chain ID(s) from the input structure (comma-separated). Defaults to all non-ligand chains.

    D
    B,C
    

    Ligand Chains

    Ligand chain ID(s) from the input structure (comma-separated). Defaults to all non-ligand chains.

    D
    B,C
    

    Note: Exactly one of Ligand Chains, Ligand Residues, and Ligand Name must be provided.

    Ligand Residues

    Specify small-molecule residue name(s) from the input structure.

    501-520,530 (if the Input File does not contain chain names, the chain name can be omitted; however, an error will be raised when multiple chains contain the specified residues)
    B:501-520
    

    Ligand Name

    Specify small-molecule name(s) from the input structure.

    RFZ
    LIG
    

    Ligand SMILES

    SMILES string of the small-molecule ligand. Optional. Used to ensure correct bond order and connectivity of the small-molecule ligand. When provided, ligand H atoms are first removed for bond-order matching, then automatically re-added. (When the input structure does not provide bond connectivity and bond order information, accurate H addition for small-molecule ligands is difficult; providing the SMILES enables accurate H addition for the ligand.)
    Ligand SMILES format:

    "OC[C@H]1O[C@@H](n2cnc3cc(Cl)c(Cl)cc32)[C@H](O)[C@@H]1O" (for cases where the Input File contains only one ligand)
    "RFZ:OC[C@H]1O[C@@H](n2cnc3cc(Cl)c(Cl)cc32)[C@H](O)[C@@H]1O,W2R:O=C(Nc1cccc(Oc2ccc(Nc3ncnc4ccn(CCOCCO)c34)cc2Cl)c1)NC1CCCCC1" (for cases where the Input File contains multiple ligands, comma-separated)
    

    Add Hydrogens

    Controls how hydrogen atoms are handled during structure preparation.

    • --add-hydrogens (default): Deletes all H atoms, then rebuilds them according to pH.
    • --no-add-hydrogens: Skips H processing and uses H atoms from the original input structure. Suitable for PDB files that have already been H-treated.

    Keep Hydrogens

    Controls whether original hydrogen atoms from the input structure are preserved. Optional. By default, all original H atoms are deleted and subsequently rebuilt according to the set pH condition.

    • --keep-hydrogens: Preserves original H atoms from the input structure and only adds missing H atoms. Suitable for PDB files where the original structure already contains partial H atoms but still has missing H atoms.

    pH

    pH state referenced during hydrogen addition to the input structure file. Residue protonation states are determined based on the pH value. Default: 7.0.

    Energy Model

    Solvation model. Controls the calculation method for PB/GB electrostatic interactions. Required. Default: ALPB.

    • GB: Generalized Born model. Suitable for general solvation energy calculations.
    • ALPB: Analytical Linearized Poisson-Boltzmann model. Higher accuracy, suitable for scenarios requiring more accurate electrostatic interactions.
    • CHAGB: Charge-Dependent GB model. Considers the effect of atomic charge changes on solvation.
    • CHAGBCAN: Charge-Dependent GB with canonical radii. Uses standard atomic radii.

    Inradii

    Solvation radii. Optional.

    • inpqr: Uses BONDI radii from the PQR file.
    • bestgb: Uses GB-optimized radii.
    • chagb: Uses CHAGB-specific radii (for CHAGB/CHAGBCAN models only).

    Ele Corr

    Enable Debye-Huckel electrostatic shielding correction. Disabled by default (no electrostatic energy correction).

    Temp

    Temperature (K). Default: 298.15.

    Min Tolerance

    Energy minimization convergence tolerance (kJ/mol/nm). Default: 1.0. Smaller values are more precise.

    Max Steps

    Maximum number of energy minimization iterations. Default: 5000.

    Restraint Force

    Backbone position restraint force constant (kJ/mol/nm²). Default: 100.0. Set to 0 to disable backbone position restraints.

    Results

    Result output file. Optional. Default: mutations.csv.

    1. Mutation energies are output to the mutations.csv file, containing the following columns:
    Column Description
    mutation Mutation identifier, format: chain:residue_number:mutated_amino_acid; WT-only indicates wild type.
    WT_G_bind Wild-type binding free energy (kcal/mol).
    MUT_G_bind Mutant binding free energy (kcal/mol); N/A in WT-only mode.
    DDG Change in binding free energy upon mutation (MUT_G_bind - WT_G_bind); N/A in WT-only mode.

    If no mutation is specified, the tool enters WT-only mode, and the CSV file contains only the binding free energy of the input structure:

    mutations.csv:
    mutation,WT_G_bind,MUT_G_bind,DDG
    WT-only,-15.2300,N/A,N/A
    
    1. PDB structure files corresponding to pre- and post-mutation states:
    File Description
    WT_minimized.pdb WT structure after energy minimization.
    MUT_<chain>_<residue_number>_<mutated_residue_name>_minimized.pdb Mutant structure after energy minimization.

    If no mutation is specified, the tool enters WT-only mode and only outputs WT_minimized.pdb.

    Interpreting Results

    1. DDG > 0: The mutation weakens binding (unfavorable mutation).
    2. DDG < 0: The mutation strengthens binding (favorable mutation).

    Notes

    1. Modified residues: An error will be raised if modified residues are embedded in the protein or nucleic acid chain.
    2. Missing residues: An error will be raised if residue numbering gaps exist in the PDB but no SEQRES record is provided.
    3. Mutation support: Supports protein residue mutations (standard amino acids) and nucleic acid residue mutations (DNA/RNA).
    4. Ligand specification: Exactly one of --ligand-chains, --ligand-residues, and --ligand-name must be provided.
    5. Small-molecule SMILES: It is strongly recommended to provide --ligand-smiles to ensure correct bond order and connectivity.

    References

    • Friedrichs M. S., et al. Accelerating molecular dynamic simulations on graphics processing units[J]. J. Comput. Chem., 2009, 30: 864-872. DOI: 10.1002/jcc.21209
    • Eastman P., Pande V. S. Efficient nonbonded interactions for molecular dynamics on a graphics processing unit[J]. J. Comput. Chem., 2010, 31: 1268-1272. DOI: 10.1002/jcc.21413
    • Eastman P., Pande V. S. Constant constraint matrix approximation: A robust, parallelizable constraint method for molecular simulations[J]. J. Chem. Theor. Comput., 2010, 6: 434-437. DOI: 10.1021/ct900463w
    • Eastman P., et al. OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials[J]. J. Phys. Chem. B, 2023, 128: 109-116. DOI: 10.1021/acs.jpcb.3c06662
  • Name: Peptide Property Prediction (PeptiVerse)
    Description: PeptiVerse 是一个面向治疗性多肽研发的通用属性预测方法,主要用于线性肽、环肽及化学修饰肽的关键成药性属性评估。 PeptiVerse is a universal property prediction platform for therapeutic peptide development, designed to evaluate key developability properties of linear peptides, cyclic peptides, and chemically modified peptides.
    Tags: undefined
    Author: PeptiVerse
    Release: 2026-05-15 00:00:00
    Reference: Zhang Y, Tang S, Chen T, Mahood E, Vincoff S, Chatterjee P. PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction

    Peptide Property Prediction (PeptiVerse)

    简介

    基于PeptiVerse深度学习模型的多肽ADMET性质预测工具,支持溶血性、溶解性、细胞穿透性、毒性、膜通透性、半衰期等多种性质的批量预测。输入支持标准氨基酸序列和 SMILES 化学结构两种格式,适用于线性肽、环肽及修饰肽的虚拟筛选与性质评估。

    image.png

    适用场景

    • 多肽药物早期筛选:快速评估候选多肽的成药性关键性质
    • 安全性评价:预测溶血性、毒性等安全性指标
    • 递送潜力评估:评估细胞穿透性和膜通透能力
    • 靶点结合分析:预测多肽与目标蛋白的结合亲和力

    参数说明

    Peptide Sequence

    Input Peptides

    输入的FASTA 格式多肽序列文件:

    >id1
    ZCVBDSWERTA
    >id2
    WERTAZCV
    

    Property

    预测属性名称,必填,支持多选。

    • permeability_penetrance:细胞穿透性,预测多肽进入或穿透细胞膜的能力
    • hemolysis:溶血性,预测多肽破坏红细胞膜并引发溶血的风险
    • nf:抗污性 / 非特异性吸附,预测多肽发生非特异性蛋白吸附的倾向
    • solubility:溶解性,预测多肽在水相环境中的溶解能力
    • halflife:半衰期,预测多肽在体内的稳定性和半衰期表现

    Uncertainty

    是否计算预测不确定性。启用后输出结果包含不确定性估计值,有助于评估预测可靠性。

    • true:计算不确定性
    • false:不计算不确定性

    Output

    预测结果的输出文件路径,默认输出为 results.csv。

    Peptide Smiles

    Input Peptides

    输入的多肽文件,支持 SMILES 格式:

    N[C@@H](CC(C)C)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC(=CN2)C1=C2C=CC=C1)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](CO)C(=O)N[C@@H](CC(=CN2)C1=C2C=CC=C1)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CCSC)C(=O)N[C@@H]([C@H](CC)C)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H]([C@H](O)C)C(=O)NCC(=O)N[C@@H](CO)C(=O)O 
    

    Property

    预测属性名称,必填,支持多选。

    • hemolysis:溶血性,预测多肽破坏红细胞膜并引发溶血的风险
    • nf:抗污性 / 非特异性吸附,预测多肽发生非特异性蛋白吸附的倾向
    • solubility:溶解性,预测多肽在水相环境中的溶解能力
    • permeability_penetrance:细胞穿透性,预测多肽进入或穿透细胞膜的能力
    • toxicity:毒性,预测多肽潜在毒性风险
    • permeability_pampa:PAMPA 通透性,预测多肽在人工膜通透性实验中的表现
    • permeability_caco2:Caco-2 通透性,预测多肽在 Caco-2 细胞模型中的通透能力
    • halflife:半衰期,预测多肽在体内的稳定性和半衰期表现

    Uncertainty

    是否计算预测不确定性。启用后输出结果包含不确定性估计值,有助于评估预测可靠性。

    • true:计算不确定性
    • false:不计算不确定性

    Output

    预测结果的输出文件路径,默认输出为 results.csv。

    结果说明

    输出结果包括 results.csv 预测结果表格,包含每条多肽的各项预测性质及对应的不确定性。

    results.csv 包含信息如下:

    列名 说明
    id 多肽标识符,与输入文件中的 id 对应
    halflife 回归任务,血清半衰期预测值,反映多肽在体内的稳定性,越大越稳定。单位:小时 (h)
    halflife_uncertainty_type 半衰期不确定性的计算类型标识
    toxicity 分类任务(概率值),毒性预测值,评估多肽的潜在毒性风险,越小越安全。范围 [0, 1],无量纲
    toxicity_uncertainty_type 毒性不确定性的计算类型标识
    hemolysis 分类任务(概率值),溶血性预测值,评估破坏红细胞风险(HC50 < 100 μM 为溶血),越小越安全。范围 [0, 1],无量纲
    hemolysis_uncertainty_type 溶血性不确定性的计算类型标识
    permeability_pampa 回归任务,PAMPA 平行人工膜通透性预测值,反映被动跨膜扩散能力,越大通透性越好。好:> -6.0单位:log Pe (log₁₀ cm/s),范围约 -9 ~ -5
    permeability_pampa_uncertainty PAMPA 通透性预测的共形预测区间。格式 (lo, hi) 元组
    permeability_pampa_uncertainty_type PAMPA 通透性不确定性的计算类型标识
    nf 分类任务(概率值),非特异性吸附(抗污性)预测值,评估非特异性相互作用倾向,越小抗污性越好。范围 [0, 1],无量纲
    nf_uncertainty 非特异性吸附预测的二元预测熵。范围 [0, ln2 ≈ 0.693]
    nf_uncertainty_type 非特异性吸附不确定性的计算类型标识
    solubility 分类任务(概率值),溶解性预测值,反映多肽在水相环境中的溶解能力,越大水溶性越好。范围 [0, 1],无量纲
    solubility_uncertainty_type 溶解性不确定性的计算类型标识
    permeability_penetrance 分类任务(概率值),细胞穿透性预测值,评估多肽进入细胞膜的能力,越大穿透能力越强。范围 [0, 1],无量纲
    permeability_penetrance_uncertainty 细胞穿透性预测的二元预测熵。范围 [0, ln2 ≈ 0.693]
    permeability_penetrance_uncertainty_type 细胞穿透性不确定性的计算类型标识
    permeability_caco2 回归任务,Caco-2 细胞通透性预测值,反映肠道吸收潜力,越大吸收越好。单位:log Pe (log₁₀ cm/s),范围约 -9 ~ -5
    permeability_caco2_uncertainty Caco-2 通透性预测的共形预测区间。格式 (lo, hi) 元组
    permeability_caco2_uncertainty_type Caco-2 通透性不确定性的计算类型标识
    • 不确定性类型说明
    类型标识 含义 取值范围 解读
    binary_predictive_entropy 二元预测熵(基于集成模型预测分布) [0, ln2 ≈ 0.693] 越接近 0 越确定,越接近 0.693 越接近不确定
    ensemble_predictive_entropy 集成预测熵(多分类) [0, ln(n)] 同上,n 为类别数
    binary_predictive_entropy_single_model 单模型二元预测熵 [0, ln2 ≈ 0.693] 仅基于单一模型,可信度低于集成版本
    conformal_prediction_interval 共形预测区间 (lo, hi) 无界 真实值有较高概率(如 90%)落在区间内,区间越窄越可信
    unavailable (no seed ensemble found) 无集成模型可用 — 无法量化不确定性,对该字段需谨慎
    unavailable (no MAPIE bundle for XGBoost regression) XGBoost 回归无 MAPIE 配套 — 无共形区间可用,对该字段需谨慎

    注意:不确定性指标仅在 Uncertainty 选择 true 时输出。

    参考文献

    • Zhang Y, Tang S, Chen T, Mahood E, Vincoff S, Chatterjee P. PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction[bioRxiv]. 2025.DOI: 10.64898/2025.12.31.697180

    Peptide Property Prediction (PeptiVerse)

    Introduction

    A deep learning-based multi-property prediction tool for peptides, supporting batch prediction of properties including hemolysis, solubility, cell penetration, toxicity, membrane permeability, half-life, and binding affinity. Input supports both standard amino acid sequences and SMILES chemical structure formats, making it suitable for virtual screening and property evaluation of linear peptides, cyclic peptides, and modified peptides.

    Use Cases

    • Early-stage peptide drug screening: Rapid assessment of key druggability properties for candidate peptides
    • Safety evaluation: Prediction of safety-related indicators such as hemolysis and toxicity
    • Delivery potential assessment: Evaluation of cell penetration and membrane permeability
    • Target binding analysis: Prediction of binding affinity between peptides and target proteins

    Parameters

    Peptide Sequence

    Input Peptides

    Input peptide sequence file in FASTA format:

    >id1
    ZCVBDSWERTA
    >id2
    WERTAZCV
    

    Property

    The properties to predict. Required; multiple selections supported.

    • permeability_penetrance: Cell penetration — predicts the ability of a peptide to enter or traverse the cell membrane
    • hemolysis: Hemolysis — predicts the risk of a peptide disrupting red blood cell membranes and causing hemolysis
    • nf: Non-fouling / non-specific adsorption — predicts the tendency of a peptide to undergo non-specific protein adsorption
    • solubility: Solubility — predicts the ability of a peptide to dissolve in an aqueous environment
    • halflife: Half-life — predicts the in vivo stability and half-life performance of a peptide

    Peptide SMILES

    Input Peptides

    Input peptide file in SMILES format:

    N[C@@H](CC(C)C)C(=O)N[C@@H](CC(=O)N)C(=O)...
    

    Property

    The properties to predict. Required; multiple selections supported.

    • hemolysis: Hemolysis — predicts the risk of a peptide disrupting red blood cell membranes and causing hemolysis
    • nf: Non-fouling / non-specific adsorption — predicts the tendency of a peptide to undergo non-specific protein adsorption
    • solubility: Solubility — predicts the ability of a peptide to dissolve in an aqueous environment
    • permeability_penetrance: Cell penetration — predicts the ability of a peptide to enter or traverse the cell membrane
    • toxicity: Toxicity — predicts the potential toxicity risk of a peptide
    • permeability_pampa: PAMPA permeability — predicts peptide performance in artificial membrane permeability assays
    • permeability_caco2: Caco-2 permeability — predicts peptide permeability in the Caco-2 cell model
    • halflife: Half-life — predicts the in vivo stability and half-life performance of a peptide
    • binding_affinity: Binding affinity — predicts the binding strength between a peptide and a target protein. If no Target Sequence is provided, this property will be automatically skipped.

    Uncertainty

    Whether to calculate prediction uncertainty. When enabled, the output includes uncertainty estimates to help assess prediction reliability.

    • true: Calculate uncertainty
    • false: Do not calculate uncertainty

    Output

    Output file path for prediction results. Defaults to results.csv.


    Results

    The output is a results.csv prediction table containing the predicted properties and corresponding uncertainty estimates for each peptide.

    Column Name Description
    id Peptide identifier, corresponding to the id in the input file
    halflife Regression task: predicted serum half-life value, reflecting peptide stability in vivo; higher values indicate greater stability. Unit: hours (h)
    halflife_uncertainty_type Uncertainty type identifier for half-life prediction
    toxicity Classification task (probability value): predicted toxicity score, assessing potential toxic risk of the peptide; lower values are safer. Range: [0, 1], dimensionless
    toxicity_uncertainty_type Uncertainty type identifier for toxicity prediction
    hemolysis Classification task (probability value): predicted hemolytic activity, assessing risk of red blood cell destruction (HC50 < 100 μM indicates hemolysis); lower values are safer. Range: [0, 1], dimensionless
    hemolysis_uncertainty_type Uncertainty type identifier for hemolysis prediction
    permeability_pampa Regression task: predicted PAMPA (Parallel Artificial Membrane Permeability Assay) value, reflecting passive trans-membrane diffusion ability; higher values indicate better permeability. Good: > -6.0. Unit: log Pe (log₁₀ cm/s), range approximately -9 ~ -5
    permeability_pampa_uncertainty Conformal prediction interval for PAMPA permeability. Format: (lo, hi) tuple
    permeability_pampa_uncertainty_type Uncertainty type identifier for PAMPA permeability prediction
    nf Classification task (probability value): predicted non-specific adsorption (antifouling property) score, assessing tendency for non-specific interactions; lower values indicate better antifouling. Range: [0, 1], dimensionless
    nf_uncertainty Binary predictive entropy for non-specific adsorption prediction. Range: [0, ln2 ≈ 0.693]
    nf_uncertainty_type Uncertainty type identifier for non-specific adsorption prediction
    solubility Classification task (probability value): predicted solubility score, reflecting peptide dissolution ability in aqueous environment; higher values indicate better water solubility. Range: [0, 1], dimensionless
    solubility_uncertainty_type Uncertainty type identifier for solubility prediction
    permeability_penetrance Classification task (probability value): predicted cell penetration ability, assessing peptide capacity to enter cell membrane; higher values indicate stronger penetration. Range: [0, 1], dimensionless
    permeability_penetrance_uncertainty Binary predictive entropy for cell penetration prediction. Range: [0, ln2 ≈ 0.693]
    permeability_penetrance_uncertainty_type Uncertainty type identifier for cell penetration prediction
    permeability_caco2 Regression task: predicted Caco-2 cell permeability value, reflecting intestinal absorption potential; higher values indicate better absorption. Unit: log Pe (log₁₀ cm/s), range approximately -9 ~ -5
    permeability_caco2_uncertainty Conformal prediction interval for Caco-2 permeability. Format: (lo, hi) tuple
    permeability_caco2_uncertainty_type Uncertainty type identifier for Caco-2 permeability prediction

    Uncertainty Type Explanation

    Type Identifier Meaning Value Range Interpretation
    binary_predictive_entropy Binary predictive entropy (based on ensemble model prediction distribution) [0, ln2 ≈ 0.693] Closer to 0 indicates higher certainty; closer to 0.693 indicates greater uncertainty
    ensemble_predictive_entropy Ensemble predictive entropy (multiclass) [0, ln(n)] Same as above; n is the number of classes
    binary_predictive_entropy_single_model Single-model binary predictive entropy [0, ln2 ≈ 0.693] Based on a single model only; lower credibility than ensemble version
    conformal_prediction_interval Conformal prediction interval (lo, hi) Unbounded True value has high probability (e.g., 90%) of falling within the interval; narrower intervals are more credible
    unavailable (no seed ensemble found) No ensemble model available — Unable to quantify uncertainty; use caution when interpreting this field
    unavailable (no MAPIE bundle for XGBoost regression) XGBoost regression has no MAPIE support — No conformal interval available; use caution when interpreting this field

    Note: Uncertainty columns are only included in the output when Uncertainty is set to true.


    Reference

    • Zhang Y, Tang S, Chen T, Mahood E, Vincoff S, Chatterjee P. PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction [bioRxiv]. 2025. DOI: 10.64898/2025.12.31.697180
  • Name: Extract Fv and Analyze Contacts
    Description: 从输入 PDB 文件中自动提取抗体 Fv 区域及邻近分子片段,生成包含 Fv 与伙伴链的截断 PDB 和 Fv 序列文件,并进行界面(interface)和氢键(hydrogen bond)相互作用计算。 Automatically extracts the antibody Fv region and neighboring molecular fragments from an input PDB file, generates a truncated PDB containing Fv with partner chains and an Fv sequence file, and calculates interface and hydrogen bond interactions.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-05-09 00:00:00
    Reference:

    Extract Fv and Analyze Contacts

    简介

    从输入 PDB 文件中自动提取抗体 Fv 区域及邻近分子片段,生成包含 Fv 与伙伴链的截断 PDB 和 Fv 序列文件,并进行界面(interface)和氢键(hydrogen bond)相互作用计算。

    核心技术

    • 自识别 VH/VL:基于保守的 Cys-Trp motif 和 FR1 起始序列特征自动识别可变区,无需外部数据库
    • 编号方案:内置 IMGT、Kabat、Chothia、Martin、CCG 五种 CDR 位置定义
    • Fv 截断:按各方案对应的 Fv 末端位置截断,去除 CH1/CL 恒定区(如 IMGT≈128,Kabat≈113)
    • 伙伴链保留:通过 NeighborSearch 识别与 Fv 原子距离在截止范围内的所有链(抗原、小分子、离子等),一并写入输出 PDB

    适用场景

    • 抗体结构预处理:自动提取 Fv 区域用于后续人源化、亲和力成熟等流程
    • 分子相互作用分析:计算抗体与抗原、配体之间的界面接触和氢键网络
    • 结构数据准备:生成标准化的 Fv 结构文件和序列文件用于下游分析

    参数说明

    Input Structure

    输入的抗体 PDB 结构文件,需包含完整的抗体结构及可能结合的抗原、配体或其他分子。输入时请限制抗体及其相互作用的对象是一对一的,例如一个轻重连构成的抗体对应抗原,而非多个抗体对应一个抗原

    Numbering Scheme

    Fv 编号方案,用于确定 CDR 位置和 Fv 截断点。

    • IMGT:国际免疫基因组学标准,Fv 末端约 128 位
    • Kabat:基于序列变异性定义,Fv 末端约 113 位
    • Chothia:基于结构环区定义
    • Martin:基于 Kabat 的修订版本
    • CCG:癌症基因组学联盟方案

    Contact Cutoff Distance

    Fv 与邻近分子的接触截止距离,用于识别需要保留的伙伴链。单位 Å,默认 10.0 Å。

    结果说明

    输出结果包括:

    文件名 说明
    extracted_fv.pdb 截断后的 Fv 及邻近伙伴链的 PDB 结构文件
    extracted_fv.fasta 提取的 Fv 氨基酸序列,可用于后续人源化流程
    interface_cb.json 界面相互作用计算结果,包含原子/残基级别的接触信息
    hydrogen_bond.json 氢键计算结果,包含供体-受体对、距离和角度信息
    extracted_HL.pdb 截断后Fv的PDB 结构文件

    Extract Fv and Analyze Contacts

    Introduction

    Automatically extracts the antibody Fv region and neighboring molecular fragments from an input PDB file, generates a truncated PDB containing Fv with partner chains and an Fv sequence file, and calculates interface and hydrogen bond interactions.

    Core Technologies

    • VH/VL Auto-identification: Automatically identifies variable regions based on conserved Cys-Trp motifs and FR1 starting sequence features without external databases
    • Numbering Schemes: Built-in definitions for five CDR positioning schemes: IMGT, Kabat, Chothia, Martin, and CCG
    • Fv Truncation: Truncates at Fv terminus positions according to each scheme to remove CH1/CL constant regions (e.g., IMGT≈128, Kabat≈113)
    • Partner Chain Retention: Uses NeighborSearch to identify all chains within the cutoff distance of Fv atoms (antigens, small molecules, ions, etc.) and writes them into the output PDB

    Use Cases

    • Antibody structure preprocessing: Extract Fv regions for downstream humanization, affinity maturation, and other workflows
    • Molecular interaction analysis: Calculate interface contacts and hydrogen bond networks between antibodies and antigens/ligands
    • Structural data preparation: Generate standardized Fv structure and sequence files for downstream analysis

    Parameters

    Input Structure

    Input antibody PDB structure file, which should contain the complete antibody structure and any bound antigens, ligands, or other molecules.

    Numbering Scheme

    Fv numbering scheme used to determine CDR positions and Fv truncation points.

    • IMGT: International ImMunoGeneTics standard, Fv terminus around position 128
    • Kabat: Defines CDRs based on sequence variability, Fv terminus around position 113
    • Chothia: Defines CDRs based on structural loop regions
    • Martin: Revised version based on Kabat
    • CCG: Cancer Genome Consortium scheme

    Contact Cutoff Distance

    Contact cutoff distance between Fv and neighboring molecules for identifying partner chains to retain. Unit: Å, default 10.0 Å.

    Results

    The output includes the following files:

    File Name Description
    extracted_fv.pdb Truncated PDB structure file containing Fv and neighboring partner chains
    extracted_fv.fasta Extracted Fv amino acid sequence, available for downstream humanization workflows
    interface_cb.json Interface interaction calculation results, including atom/residue-level contact information
    hydrogen_bond.json Hydrogen bond calculation results, including donor-acceptor pairs, distances, and angles
    extracted_HL.pdb PDB structure file of the truncated Fv
  • Name: Immunogenicity Prediction Report
    Description: 对 Immunogenicity Prediction (AlphaMHC v3.0 beta)和 Immunogenicity Prediction (WeADApt v4.1.0)、Immunogenicity Prediction (WeADApt v4.2)、Immunogenicity Prediction (WeADApt v4.3) 四个免疫原性评估模块的结果进行汇总,生成分子和表位级别的整合报告。该模块为流程编排组件,需配合上游免疫原性预测模块使用。 Aggregates results from four immunogenicity assessment modules ( Immunogenicity Prediction (AlphaMHC v3.0 beta) and Immunogenicity Prediction (WeADApt v4.1.0)、Immunogenicity Prediction (WeADApt v4.2)、Immunogenicity Prediction (WeADApt v4.3)) to generate integrated molecule-level and epitope-level reports. This module is a workflow orchestration component and must be used in conjunction with upstream immunogenicity prediction modules.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-05-08 00:00:00
    Reference:

    ImmuneReport

    简介

    对 Immunogenicity Prediction (AlphaMHC v3.0 beta)和 Immunogenicity Prediction (WeADApt v4.1.0)、Immunogenicity Prediction (WeADApt v4.2)、Immunogenicity Prediction (WeADApt v4.3) 四个免疫原性评估模块的结果进行汇总,生成分子和表位级别的整合报告。该模块为流程编排组件,需配合上游免疫原性预测模块使用。

    参数说明

    Input Directory

    基础输入目录,仅作为省略的输入文件参数的默认路径前缀。

    FASTA

    FASTA 格式的氨基酸序列文件。

    AlphaMHC v3 Molecule Score

    AlphaMHC v3.0 分子评分 CSV 文件。

    AlphaMHC v3 Epitope Score

    AlphaMHC v3.0 表位评分 CSV 文件。

    WeAdapt v4.1 Molecule Score

    WeAdapt v4.1 分子评分 CSV 文件。

    WeAdapt v4.1 Epitope Score

    WeAdapt v4.1 表位评分 CSV 文件。

    WeAdapt v4.2 Molecule Score

    WeAdapt v4.2 分子评分 CSV 文件。

    WeAdapt v4.2 Epitope Score

    WeAdapt v4.2 表位评分 CSV 文件。

    WeAdapt v4.3 Molecule Score

    WeAdapt v4.3 分子评分 CSV 文件。

    WeAdapt v4.3 Epitope Score

    WeAdapt v4.3 表位评分 CSV 文件。

    Molecule Summary

    分子汇总 CSV 输出路径。

    Epitope Summary

    表位汇总 CSV 输出路径。

    Errors

    记录级错误 CSV 输出路径。

    结果说明

    输出结果包括:

    文件名 说明
    molecule_summary.csv 分子级别汇总结果,整合各模块的分子评分
    epitope_summary.csv 表位级别汇总结果,整合各模块的表位评分
    errors.csv 记录级错误日志,汇总处理过程中的异常信息

    molecule_summary.csv文件包含信息如下:

    列名 说明
    molecule 蛋白质分子名称(取自 FASTA 和 CSV 中的 Protein ID)
    AlphaMHC_v3.0_score AlphaMHC v3.0 模块给出的分子级别评分
    WeAdapt_v4.1_score WeAdapt v4.1 模块给出的分子级别评分
    WeAdapt_v4.2_score WeAdapt v4.2 模块给出的分子级别评分
    WeAdapt_v4.3_score WeAdapt v4.3 模块给出的分子级别评分
    mean_score(v4) WeAdapt 三个版本(v4.1 / v4.2 / v4.3)评分的均值,AlphaMHC 不参与统计
    max_score(v4) WeAdapt 三个版本评分的最大值
    min_score(v4) WeAdapt 三个版本评分的最小值

    epitope_summary.csv文件包含信息如下:

    列名 说明
    molecule 蛋白质分子名称
    chain 序列 ID(chain 名称)
    epitope_id 表位编号,格式 Epitope_001,按分子内出现顺序递增
    epitope_position 表位在序列上的区间,格式 begin-end(1-based)
    epitope 代表性表位肽段序列(优先取 FASTA 对应区间子串,否则取聚类中最长肽段)
    mean_score(v4) 聚类中 WeAdapt 三版评分的均值(AlphaMHC 不参与统计)
    max_score(v4) 聚类中 WeAdapt 三版评分的最大值
    min_score(v4) 聚类中 WeAdapt 三版评分的最小值
    AlphaMHC_v3.0_score 聚类中 AlphaMHC v3.0 表位的最高评分
    WeAdapt_v4.1_score 聚类中 WeAdapt v4.1 表位的最高评分
    WeAdapt_v4.2_score 聚类中 WeAdapt v4.2 表位的最高评分
    WeAdapt_v4.3_score 聚类中 WeAdapt v4.3 表位的最高评分
    AlphaMHC_v3.0_HLA AlphaMHC v3.0 模块关联的 HLA 等位基因(该模块无 HLA 数据,始终为 /)
    WeAdapt_v4.1_HLA WeAdapt v4.1 模块关联的 HLA 等位基因,分号分隔
    WeAdapt_v4.2_HLA WeAdapt v4.2 模块关联的 HLA 等位基因,分号分隔
    WeAdapt_v4.3_HLA WeAdapt v4.3 模块关联的 HLA 等位基因,分号分隔
    overlapping_HLA 各模块 HLA 集合的交集(至少 2 个模块有 HLA 数据时才计算),无交集或数据不足时为 /

    ImmuneReport

    Introduction

    Aggregates results from four immunogenicity assessment modules ( Immunogenicity Prediction (AlphaMHC v3.0 beta) and Immunogenicity Prediction (WeADApt v4.1.0)、Immunogenicity Prediction (WeADApt v4.2)、Immunogenicity Prediction (WeADApt v4.3)) to generate integrated molecule-level and epitope-level reports. This module is a workflow orchestration component and must be used in conjunction with upstream immunogenicity prediction modules.

    Parameters

    Input Directory

    Base input directory used only as the default path prefix for omitted input file arguments.

    FASTA

    Amino acid sequence file in FASTA format.

    AlphaMHC v3 Molecule Score

    AlphaMHC v3.0 molecule score CSV file.

    AlphaMHC v3 Epitope Score

    AlphaMHC v3.0 epitope score CSV file.

    WeAdapt v4.1 Molecule Score

    WeAdapt v4.1 molecule score CSV file.

    WeAdapt v4.1 Epitope Score

    WeAdapt v4.1 epitope score CSV file.

    WeAdapt v4.2 Molecule Score

    WeAdapt v4.2 molecule score CSV file.

    WeAdapt v4.2 Epitope Score

    WeAdapt v4.2 epitope score CSV file.

    WeAdapt v4.3 Molecule Score

    WeAdapt v4.3 molecule score CSV file.

    WeAdapt v4.3 Epitope Score

    WeAdapt v4.3 epitope score CSV file.

    Molecule Summary

    Molecule summary CSV output path.

    Epitope Summary

    Epitope summary CSV output path.

    Errors

    Record-level error CSV output path.

    Results

    The output includes the following files:

    File Name Description
    molecule_summary.csv Molecule-level summary integrating scores from all modules
    epitope_summary.csv Epitope-level summary integrating scores from all modules
    errors.csv Record-level error log summarizing exceptions during processing

    The molecule_summary.csv file contains the following columns:

    Column Description
    molecule Protein molecule name (taken from the Protein ID in FASTA and CSV)
    AlphaMHC_v3.0_score Molecule-level score from the AlphaMHC v3.0 module
    WeAdapt_v4.1_score Molecule-level score from the WeAdapt v4.1 module
    WeAdapt_v4.2_score Molecule-level score from the WeAdapt v4.2 module
    WeAdapt_v4.3_score Molecule-level score from the WeAdapt v4.3 module
    mean_score(v4) Mean of the three WeAdapt version scores (v4.1 / v4.2 / v4.3); AlphaMHC is excluded
    max_score(v4) Maximum of the three WeAdapt version scores
    min_score(v4) Minimum of the three WeAdapt version scores

    The epitope_summary.csv file contains the following columns:

    Column Description
    molecule Protein molecule name
    chain Sequence ID (chain name)
    epitope_id Epitope identifier, formatted as Epitope_001, incrementing in order of appearance within the molecule
    epitope_position Epitope interval on the sequence, formatted as begin-end (1-based)
    epitope Representative epitope peptide sequence (preferentially taken from the corresponding FASTA subsequence; otherwise the longest peptide in the cluster)
    mean_score(v4) Mean of the three WeAdapt version scores within the cluster (AlphaMHC is excluded)
    max_score(v4) Maximum of the three WeAdapt version scores within the cluster
    min_score(v4) Minimum of the three WeAdapt version scores within the cluster
    AlphaMHC_v3.0_score Highest AlphaMHC v3.0 epitope score within the cluster
    WeAdapt_v4.1_score Highest WeAdapt v4.1 epitope score within the cluster
    WeAdapt_v4.2_score Highest WeAdapt v4.2 epitope score within the cluster
    WeAdapt_v4.3_score Highest WeAdapt v4.3 epitope score within the cluster
    AlphaMHC_v3.0_HLA HLA allele associated with the AlphaMHC v3.0 module (this module has no HLA data, always /)
    WeAdapt_v4.1_HLA HLA allele(s) associated with the WeAdapt v4.1 module, semicolon-separated
    WeAdapt_v4.2_HLA HLA allele(s) associated with the WeAdapt v4.2 module, semicolon-separated
    WeAdapt_v4.3_HLA HLA allele(s) associated with the WeAdapt v4.3 module, semicolon-separated
    overlapping_HLA Intersection of HLA sets across modules (computed only when at least 2 modules have HLA data); / when there is no overlap or insufficient data
  • Name: Antibody Numbering v3
    Description: 抗体编号模块,用于注释抗体可变区(Fv)或恒定区(包括 Fc), 支持几乎所有主流的抗体编号规则,如可变区广泛使用的Kabat、Chothia 和 IMGT,以及恒定区主要使用的EU规则。 A module for antibody numbering for variable regions and constant regions. Mainstream numbering schemes are supported, e.g., Kabat, Chothia, and IMGT are widely used for Fv, and EU is the most used scheme for the constant region.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-04-23 00:00:00
    Reference:

    Antibody Numbering v3

    简介

    基于 ANARCI 和 mafft 的抗体序列编号工具,支持 FV 和 FC 批量编号。

    Antibody Numbering 是一个抗体序列编号工具,用于将抗体氨基酸序列映射到标准化编号体系。编号后的序列具有统一的位置参照,使得不同抗体之间的同源比对、CDR 精确定位、突变分析等工作成为可能。

    抗体序列的氨基酸残基数因克隆不同而差异较大,直接比较两条原始序列很难确定哪些位置是同源的。编号方案通过为每个残基赋予标准化编号来解决这个问题,使研究人员能够准确识别 CDR 和 FR 的边界。

    适用场景:

    • 抗体工程:精确定位 CDR 区域,指导人源化、亲和力成熟等改造工作。
    • 序列分析:对不同来源的抗体进行同源比较,识别保守位点和突变热点。
    • 数据库标准化:将序列统一到 IMGT 等标准编号体系,便于入库和检索。
    • 质量评估:通过编号结果识别缺失、插入、非典型残基等异常。

    FV 编号使用 ANARCI 引擎,自动识别输入序列中的可变区结构域,支持单条序列中包含多个结构域的情况。编号结果包含每个残基的标准化编号、CDR/FR 区域标注及链类型判定。支持 IMGT、Kabat、Chothia、Martin、AHo、CCG 等方案。

    FC 编号使用 mafft 多序列比对引擎,将输入序列与已知恒定区模板进行比对,通过匹配率判定同种型和亚型。适用于同型鉴定、Fc 工程改造等下游分析。支持 EU 和 Kabat 方案。

    每个编号方案生成独立的 JSON 和 CSV 结果文件。FV 还会生成未覆盖片段 FASTA,FC 还会生成模板匹配率 CSV。summary.jsonl 包含各方案的处理统计,failed.fasta 收集编号失败的序列。

    参数说明

    FV Numbering

    该模式针对抗体的Fv区序列(包括重链 VH 和轻链 VL),通过指定编号规则(如 Kabat、Chothia、或 IMGT等)对氨基酸残基进行标准化编号。

    FASTA File

    上传需要进行抗体编号的氨基酸序列文件。支持批量提交多条序列,文件内容应使用 FASTA 格式。

    Numbering Scheme

    可变区编号规则,支持IMGT、Kabat、Chothia、Martin、AHo、CCG可多选。

    FC Numbering

    通常用于抗体的EU、Kabat标准化编号。

    FASTA File

    上传需要进行抗体恒定区编号的氨基酸序列文件。支持批量提交多条序列,文件内容应使用 FASTA 格式。

    Numbering Scheme

    恒定区编号规则:eu,kabat。默认为eu。

    结果说明

    输出结果包含以下文件:

    文件名 说明
    summary.jsonl 汇总每个编号方案的处理统计,包括成功、未匹配、失败的序列数量
    failed.fasta 保存编号失败的原始序列
    output_{scheme}.json 抗体编号结果文件,json格式,按不同编号方案分别生成(如 Chothia、IMGT、Kabat、Martin),包含 residue 编号、区域标注和链类型等信息
    output_{scheme}.csv 抗体编号结果文件,csv格式,按不同编号方案分别生成(如 Chothia、IMGT、Kabat、Martin),包含 residue 编号、区域标注和链类型等信息
    non_fv_{scheme}.fasta 未被识别为 FV 可变区的剩余片段(仅 FV 编号)
    output_{scheme}_match_rate.csv 输入序列与各 FC 模板的匹配率(仅 FC 编号)

    FV Numbering模式输出的output_{scheme}.csv文件包含信息如下:

    列名 说明
    molecule 抗体链类型(VH = 重链可变区,VL = 轻链可变区)
    residue 氨基酸残基(单字母表示,如 E = 谷氨酸)
    chain_type 链的具体类型(如 VK = κ轻链,VL = λ轻链,VH = 重链)
    species 抗体来源物种(如 human、mouse)
    is_cdr 是否属于 CDR 区(True = CDR,False = 框架区 FR)
    loc 在原始序列中的位置(从1开始计数)
    numbering 抗体编号体系中的位置(如 IMGT/Kabat 编号)
    insertion 插入位点标记(如 A、B;无则为空)
    region 所属区域(FR1、CDR1、FR2、CDR2、FR3、CDR3、FR4)
    domain 所属结构域编号(用于区分多结构域抗体)

    FC Numbering模式输出的output_{scheme}.csv文件包含信息如下:

    列名 含义
    molecule 抗体分子ID
    chain_type 抗体链类型或来源注释,例如 Mouse IgG2a(小鼠IgG2a亚型)
    position EU编号体系中的残基编号(EU index位置)
    region 抗体结构区域标注(如 FR、CDR、hinge 等;“-”表示未归类或非关键区)
    ref_residue 参考序列(template / germline / wild-type)上的氨基酸
    residue 实际观测或目标结构中的氨基酸
    mutation 突变信息(ref → observed)。“-”表示无突变(完全一致)

    FC Numbering模式输出的output_{scheme}_match_rate.csv文件包含信息如下:

    列名 含义
    Chain 抗体链标识
    Template 用于比对的模板类型(如 IgG1_H 表示 IgG1 重链模板)
    MatchRate_CH1 CH1结构域的匹配率(序列或结构相似度)
    MatchRate_Hinge Hinge(铰链区)的匹配率
    MatchRate_CH2 CH2结构域的匹配率
    MatchRate_CH3 CH3结构域的匹配率
    MatchRate_Global 全局匹配率(整体结构/序列相似度)

    Antibody Numbering v3

    Introduction

    An antibody sequence numbering tool based on ANARCI and mafft, supporting batch numbering for FV and FC regions.

    Antibody Numbering is a tool that maps antibody amino acid sequences to standardized numbering schemes. Numbered sequences share a unified positional reference, enabling homologous alignment across different antibodies, precise CDR localization, and mutation analysis.

    The number of amino acid residues in antibody sequences varies widely across clones, making it difficult to identify homologous positions by comparing raw sequences directly. Numbering schemes resolve this by assigning each residue a standardized identifier, allowing researchers to accurately delineate CDR and FR boundaries.

    Use cases:

    • Antibody engineering: Precisely locate CDR regions to guide humanization, affinity maturation, and other modifications.
    • Sequence analysis: Perform homologous comparisons across antibodies from different sources, identifying conserved sites and mutation hotspots.
    • Database standardization: Unify sequences under standard numbering schemes such as IMGT for streamlined archiving and retrieval.
    • Quality assessment: Detect anomalies such as deletions, insertions, and atypical residues from numbering results.

    The FV numbering module uses the ANARCI engine to automatically identify variable-region domains in input sequences, supporting cases where a single sequence contains multiple domains. Results include standardized residue numbering, CDR/FR region annotations, and chain-type classification. Supported schemes include IMGT, Kabat, Chothia, North, Martin, AHo, and CCG.

    The FC numbering module uses the mafft multiple-sequence-alignment engine to align input sequences against known constant-region templates, determining isotype and subtype by match rate. Applicable for isotype identification and Fc engineering downstream analyses. Supported schemes include EU and Kabat.

    Each numbering scheme generates independent JSON and CSV result files. FV numbering also produces an unassigned-segment FASTA file, and FC numbering produces a template match-rate CSV. summary.jsonl contains per-scheme processing statistics, and failed.fasta collects sequences that failed numbering.

    Parameters

    FV Numbering

    This mode targets Fv-region sequences of antibodies (including heavy chain VH and light chain VL), applying a standardized numbering scheme (e.g., Kabat, Chothia, IMGT) to amino acid residues.

    FASTA File

    Upload the amino acid sequence file for antibody numbering. Batch submission of multiple sequences is supported; file content must be in FASTA format.

    Numbering Scheme

    Variable-region numbering rules. Supports IMGT, Kabat, Chothia, Martin, AHo, and CCG. Multiple selection is allowed.

    FC Numbering

    Commonly used for standardized EU and Kabat numbering of antibody constant regions.

    FASTA File

    Upload the amino acid sequence file for antibody constant-region numbering. Batch submission of multiple sequences is supported; file content must be in FASTA format.

    Numbering Scheme

    Constant-region numbering rules: eu, kabat. Default is eu.

    Results

    Output results include the following files:

    Filename Description
    summary.jsonl Aggregated processing statistics for each numbering scheme, including counts of successful, unmatched, and failed sequences
    failed.fasta Raw sequences that failed numbering
    output_{scheme}.json Antibody numbering results in json format, generated per scheme (e.g., Chothia, IMGT, Kabat, Martin), containing residue numbering, region annotations, and chain-type information
    output_{scheme}.csv Antibody numbering results in csv format, generated per scheme (e.g., Chothia, IMGT, Kabat, Martin), containing residue numbering, region annotations, and chain-type information
    non_fv_{scheme}.fasta Remaining segments not identified as FV variable regions (FV numbering only)
    output_{scheme}_match_rate.csv Match rates between input sequences and each FC template (FC numbering only)

    The output_{scheme}.csv files produced by both FV Numbering modes contain the following columns:

    Column Description
    molecule Antibody chain type (VH = heavy chain variable region, VL = light chain variable region)
    residue Amino acid residue (single-letter code, e.g., E = Glutamic acid)
    chain_type Specific chain type (e.g., VK = κ light chain, VL = λ light chain, VH = heavy chain)
    species Source species of the antibody (e.g., human, mouse)
    is_cdr Whether the residue belongs to a CDR region (True = CDR, False = framework region FR)
    loc Position in the original sequence (1-based index)
    numbering Position in the numbering scheme (e.g., IMGT/Kabat numbering)
    insertion Insertion marker (e.g., A, B; empty if none)
    region Belonging region (FR1, CDR1, FR2, CDR2, FR3, CDR3, FR4)
    domain Domain index (used to distinguish multi-domain antibodies)

    The output_{scheme}_match_rate.csv files produced by the FC Numbering mode contain the following columns:

    Column Description
    Chain Antibody chain identifier
    Template Template type used for alignment (e.g., IgG1_H indicates an IgG1 heavy chain template)
    MatchRate_CH1 Match rate for the CH1 domain (sequence or structural similarity)
    MatchRate_Hinge Match rate for the Hinge region
    MatchRate_CH2 Match rate for the CH2 domain
    MatchRate_CH3 Match rate for the CH3 domain
    MatchRate_Global Global match rate (overall sequence/structural similarity)

    FC Numbering mode output output_{scheme}.csv contains the following fields:

    Column Description
    molecule Antibody molecule ID
    chain_type Antibody chain type or source annotation, e.g., Mouse IgG2a subtype
    position Residue position in the EU numbering system (EU index position)
    region Structural region annotation (e.g., FR, CDR, hinge; “-” indicates unassigned or non-critical region)
    ref_residue Amino acid in the reference sequence (template / germline / wild-type)
    residue Amino acid observed in the target or input structure
    mutation Mutation annotation (ref → observed). “-” indicates no mutation (identical residue)
  • Name: Filter Antibody Sequences
    Description: 基于 ANARCI 的抗体序列快速分类工具,将输入 FASTA 文件中的序列自动划分为可编号、不可编号和异常序列三类,并分别输出到独立的 FASTA 文件中。 A rapid antibody sequence classification tool based on ANARCI that automatically partitions sequences from an input FASTA file into three categories: numberable, unnumberable, and invalid, exporting each to separate FASTA files.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-04-30 00:00:00
    Reference:

    Filter Antibody Sequences

    简介

    基于 ANARCI 的抗体序列快速分类工具,将输入 FASTA 文件中的序列自动划分为可编号、不可编号和异常序列三类,并分别输出到独立的 FASTA 文件中。

    核心技术

    • ANARCI 编号判定:调用 ANARCI 引擎尝试对每个序列进行编号,根据编号成功与否判定序列类别
    • 三分类输出:自动将序列归类为可编号(numberable)、不可编号(unnumberable)和异常(invalid)
    • 并行处理:支持多核并行加速,提升大批量序列处理效率
    • 多编号方案:内置 IMGT、Kabat、Chothia、Martin 等主流编号方案

    适用场景

    • 批量序列质控:快速筛选出可被标准编号体系识别的抗体序列
    • 数据清洗:从混合序列集中分离异常或低质量序列
    • 下游分析预处理:为后续的抗体编号、人源化、CDR 分析等流程准备合格的输入数据

    参数说明

    Input FASTA

    输入的抗体氨基酸序列文件,需为标准 FASTA 格式,支持单条或多条序列。
    注意:仅包含完整或可识别 Fv(可变区)结构域的序列才能被 ANARCI 正确编号。

    Numbering Scheme

    ANARCI 编号方案,用于判定序列是否可被识别为抗体可变区并进行编号。

    • IMGT:国际免疫基因组学标准,最常用
    • Kabat:基于序列变异性定义
    • Chothia:基于结构环区定义
    • Martin:基于 Kabat 的修订版本

    默认使用 IMGT。

    Numberable Output

    可编号序列(即成功识别为 Fv 区域的序列)的输出文件路径。
    这些序列包含可解析的抗体可变区结构域,可被 ANARCI 成功编号,并适用于下游分析(如 CDR 定位、人源化等)。
    默认输出文件为 numberable.fasta。

    Unnumberable Output

    不可编号序列的输出文件路径。
    这些序列不包含可识别的 Fv 区域,或与标准抗体可变区差异过大,因此无法被 ANARCI 识别和编号。
    默认输出文件为 unnumberable.fasta。

    Invalid Output

    异常序列的输出文件路径。
    这些序列存在格式错误(如 FASTA 不规范)、包含非标准氨基酸,或其他导致无法解析的问题。
    默认输出文件为 invalid.fasta。

    输出结果包括以下文件:

    文件名 说明
    numberable.fasta 包含可被 ANARCI 识别为 Fv 区域并成功编号的序列,可直接用于下游编号与抗体工程分析
    unnumberable.fasta 不包含可识别 Fv 区域或偏离标准抗体结构的序列,无法进行编号
    invalid.fasta 输入异常序列,包括格式错误或非法字符等,未参与编号流程

    Filter Antibody Sequences

    Introduction

    A rapid antibody sequence classification tool based on ANARCI that automatically partitions sequences from an input FASTA file into three categories: numberable, unnumberable, and invalid, exporting each to separate FASTA files.

    Core Technologies

    • ANARCI Numbering Assessment: Calls the ANARCI engine to attempt numbering on each sequence, classifying based on success or failure
    • Three-Class Output: Automatically categorizes sequences as numberable, unnumberable, or invalid
    • Parallel Processing: Supports multi-core parallel acceleration for improved throughput on large batch jobs
    • Multiple Numbering Schemes: Built-in support for mainstream schemes including IMGT, Kabat, Chothia, and Martin

    Use Cases

    • Batch sequence quality control: Quickly filter antibody sequences recognizable by standard numbering systems
    • Data cleaning: Separate abnormal or low-quality sequences from mixed sequence sets
    • Downstream analysis preprocessing: Prepare qualified input data for subsequent antibody numbering, humanization, and CDR analysis workflows

    Parameters

    Input FASTA

    Input antibody amino acid sequence file in standard FASTA format, supporting single or multiple sequences.

    Note: Only sequences containing complete or recognizable Fv (variable region) domains can be correctly numbered by ANARCI.

    Numbering Scheme

    ANARCI numbering scheme used to determine whether a sequence can be recognized as an antibody variable region and subsequently numbered.

    • IMGT: International ImMunoGeneTics information system standard, most commonly used
    • Kabat: Based on sequence variability definitions
    • Chothia: Based on structural loop definitions
    • Martin: Revised version based on Kabat

    Default: IMGT.

    Numberable Output

    Output file path for numberable sequences (i.e., sequences successfully identified as Fv regions).

    These sequences contain parseable antibody variable region domains that can be successfully numbered by ANARCI, and are suitable for downstream analyses (e.g., CDR localization, humanization, etc.).

    Default output file: numberable.fasta.

    Unnumberable Output

    Output file path for unnumberable sequences.

    These sequences do not contain recognizable Fv regions or deviate too far from standard antibody variable regions, and therefore cannot be recognized or numbered by ANARCI.

    Default output file: unnumberable.fasta.

    Invalid Output

    Output file path for invalid sequences.

    These sequences have formatting errors (e.g., non-standard FASTA), contain non-standard amino acids, or have other issues that prevent parsing.

    Default output file: invalid.fasta.

    Results

    The output includes the following files:

    Filename Description
    numberable.fasta Sequences recognized by ANARCI as Fv regions and successfully numbered; ready for downstream numbering and antibody engineering analysis
    unnumberable.fasta Sequences without recognizable Fv regions or that deviate from standard antibody structures; cannot be numbered
    invalid.fasta Abnormal input sequences, including format errors or illegal characters; excluded from the numbering workflow
  • Name: Split Antibody Chain
    Description: 将输入的抗体序列分割成轻、重链文件 Split the input antibody sequence into light chain and heavy chain files.
    Tags: undefined
    Author: WQECOMPUT
    Release: 2026-04-24 00:00:00
    Reference:

    Split Antibody Chain

    简介

    Split Antibody Chain 是一个用于拆分抗体链的工具,能够将混合的抗体序列分离为重链、轻链和非抗体序列。

    核心思想
    本项目采用 基于抗体编号方案的链分类 策略:

    • 重链识别 根据指定的编号方案识别并输出抗体重链序列
    • 轻链识别 根据指定的编号方案识别并输出抗体轻链序列
    • 非抗体过滤 识别并分离不符合抗体特征的序列

    该流程以"基于 IMGT/Kabat/Chothia 编号方案的抗体链分类"为核心,实现抗体序列的自动化拆分和分类功能。

    参数说明

    Input File

    输入文件路径,FASTA 格式,为必选参数。
    注意:仅包含完整或可识别 Fv(可变区)结构域的序列才能被 ANARCI 识别为抗体重链和轻链。

    Numbering Scheme

    抗体编号方案,可选值包括 imgt、kabat 或 chothia。该方案用于链分类的标准依据。

    Heavy Chain

    输出包含抗体重链序列的文件名称

    Light Chain

    输出包含抗体轻链序列的文件名称

    Non-Antibody

    输出包含非抗体序列的文件名称

    结果说明

    输出结果包括以下 FASTA 格式文件:

    输出文件名称 说明
    heavy_chain.fasta 按照指定编号方案识别的重链序列
    light_chain.fasta 按照指定编号方案识别的轻链序列
    non_antibody.fasta 未识别为抗体的序列

    所有输出文件均为 FASTA 格式,每条记录包含序列标识符和氨基酸序列。

    Split Antibody Chain

    Introduction

    Split Antibody Chain is a tool for splitting mixed antibody sequences into heavy chains, light chains, and non-antibody sequences.

    Core concept

    This tool adopts a numbering-scheme-based chain classification strategy:

    • Heavy chain recognition: Identifies and outputs antibody heavy chain sequences according to the specified numbering scheme.
    • Light chain recognition: Identifies and outputs antibody light chain sequences according to the specified numbering scheme.
    • Non-antibody filtering: Identifies and separates sequences that do not meet antibody characteristics.

    The workflow centers on “antibody chain classification based on IMGT/Kabat/Chothia numbering schemes,” achieving automated splitting and classification of antibody sequences.

    Parameters

    Input File

    Input file path in FASTA format. Required.

    Note: Only sequences containing complete or recognizable Fv (variable region) domains can be recognized by ANARCI as antibody heavy or light chains.

    Numbering Scheme

    Antibody numbering scheme. Supported values: imgt, kabat, or chothia. This scheme serves as the standard basis for chain classification.

    Heavy Chain

    Output filename for sequences identified as antibody heavy chains.

    Light Chain

    Output filename for sequences identified as antibody light chains.

    Non-Antibody

    Output filename for sequences identified as non-antibody sequences.

    Results

    Output files include the following FASTA-format files:

    Output Filename Description
    heavy_chain.fasta Heavy chain sequences identified according to the specified numbering scheme.
    light_chain.fasta Light chain sequences identified according to the specified numbering scheme.
    non_antibody.fasta Sequences not recognized as antibodies.

    All output files are in FASTA format; each record contains a sequence identifier and the amino acid sequence.

  • Name: Small Molecule PK Predictor
    Description: 基于 MolPK 模型的药代动力学(PK)参数批量预测工具。利用预训练深度学习模型,从小分子结构(SMILES)及实验条件(物种、给药途径、剂量)预测 PK 参数 A batch pharmacokinetic (PK) parameter prediction tool . Utilizes a pretrained deep learning model to predict PK parameters from molecular structures (SMILES) and experimental conditions (species, administration route, dose)
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-04-30 00:00:00
    Reference:

    Small Molecule PK Predictor

    简介

    基于 MolPK 模型的药代动力学(PK)参数批量预测工具。利用预训练深度学习模型,从小分子结构(SMILES)及实验条件(物种、给药途径、剂量)预测 PK 参数,支持多种输入格式和灵活的批处理场景。

    核心技术

    • MolPK 深度学习模型:基于分子表征学习预测药代动力学参数
    • 多格式输入:支持 SMILES 文本、CSV 表格、SDF 结构文件
    • 灵活批处理:支持单条预测和批量预测,自动识别 CSV 列名
    • 双格式输出:同时输出 CSV 结果表和带 PK 属性的 SDF 结构文件

    适用场景

    • 药物筛选早期:快速评估候选分子的药代动力学性质
    • 批量预测:对大型化合物库进行高通量 PK 参数预测
    • 实验设计辅助:为体内实验提供剂量和物种选择的参考依据

    参数说明

    Input

    输入的待预测文件,支持 .smi(SMILES 文本)、.csv(表格)或 .sdf(结构文件)格式。为必填参数。

    Species

    实验物种,可选值为 rat(大鼠)、mou(小鼠)、dog(犬)、hum(人)。用于指定 PK 预测对应的物种背景。

    Route

    给药途径,可选值为 iv(静脉注射)、po(口服)。不同给药途径对 PK 曲线有显著影响。

    Dose

    给药剂量,单位为 mg/kg。用于指定预测时对应的剂量条件。

    Output

    预测结果的 CSV 输出路径。默认输出为 pred_pk_value.csv。

    Output SDF

    带 PK 预测属性的 SDF 结构文件输出路径。默认输出为 pred_with_pk_value.sdf。

    SMILES Column

    当输入为 CSV 文件时,指定包含 SMILES 字符串的列名。

    Species Column

    当输入为 CSV 文件时,指定包含物种信息的列名。

    Route Column

    当输入为 CSV 文件时,指定包含给药途径信息的列名。

    Dose Column

    当输入为 CSV 文件时,指定包含剂量信息的列名。

    结果说明

    输出结果包括:

    文件名 说明
    pred_pk_value.csv 预测的 PK 参数表格,包含每个分子的预测值及输入条件
    pred_with_pk_value.sdf 带 PK 预测属性的分子结构文件,可直接用于结构查看和进一步分析

    输出的预测结果文件文件pred_pk_value.csv:

    列名 说明
    _smi_line 原始输入的 SMILES 行字符串(通常包含分子结构及附加标识信息)
    SMILES 分子的标准 SMILES 表示,用于描述化学结构
    Species 实验物种(如 human、mouse、rat 等)
    Route 给药途径(如 IV、PO 等)
    Dose (mg/kg) 给药剂量,单位为 mg/kg
    CL (mL/min/kg) 清除率(Clearance),单位为 mL/min/kg,表示单位时间内药物从体内被清除的能力
    Vd (L/kg) 表观分布容积(Volume of distribution),单位为 L/kg,反映药物在体内的分布范围
    AUC (ng·h/mL) 曲线下面积(Area Under the Curve),单位为 ng·h/mL,表示药物暴露量
    T1/2 (h) 半衰期(Half-life),单位为小时,表示药物浓度降低一半所需时间

    Small Molecule PK Predictor

    Introduction

    A batch pharmacokinetic (PK) parameter prediction tool . Utilizes a pretrained deep learning model to predict PK parameters from molecular structures (SMILES) and experimental conditions (species, administration route, dose), supporting multiple input formats and flexible batch processing scenarios.

    Core Technologies

    • MolPK Deep Learning Model: Predicts pharmacokinetic parameters based on molecular representation learning
    • Multi-format Input: Supports SMILES text, CSV tables, and SDF structure files
    • Flexible Batch Processing: Supports both single and batch prediction with automatic CSV column detection
    • Dual-format Output: Simultaneously outputs CSV result tables and SDF structure files with PK attributes

    Use Cases

    • Early drug screening: Rapidly evaluate pharmacokinetic properties of candidate molecules
    • Batch prediction: High-throughput PK parameter prediction for large compound libraries
    • Experimental design support: Provides reference for dose and species selection in in vivo studies

    Parameters

    Input

    Input file for prediction, supporting .smi (SMILES text), .csv (table), or .sdf (structure file) formats. This is a required parameter.

    Species

    Experimental species, with options rat, mou (mouse), dog, or hum (human). Specifies the species background for PK prediction.

    Route

    Administration route, with options iv (intravenous) or po (oral). Different routes significantly affect PK profiles.

    Dose

    Administration dose in mg/kg. Specifies the dose condition for prediction.

    Output

    Output CSV path for prediction results. Default: pred_pk_value.csv.

    Output SDF

    Output SDF structure file path with PK prediction attributes. Default: pred_with_pk_value.sdf.

    SMILES Column

    When input is a CSV file, specifies the column name containing SMILES strings.

    Species Column

    When input is a CSV file, specifies the column name containing species information.

    Route Column

    When input is a CSV file, specifies the column name containing administration route information.

    Dose Column

    When input is a CSV file, specifies the column name containing dose information.

    Results

    The output includes the following files:

    File Name Description
    pred_pk_value.csv Predicted PK parameter table containing predicted values and input conditions for each molecule
    pred_with_pk_value.sdf Molecular structure file with PK prediction attributes, suitable for structure viewing and further analysis

    The predicted results are output to pred_pk_value.csv:

    Column Name Description
    _smi_line Original input SMILES line (may include structure and additional identifiers)
    SMILES Standard SMILES representation of the molecule
    Species Experimental species (e.g., human, mouse, rat)
    Route Administration route (e.g., IV, PO)
    Dose (mg/kg) Administered dose in mg/kg
    CL (mL/min/kg) Clearance, expressed in mL/min/kg, indicating the rate of drug elimination
    Vd (L/kg) Volume of distribution, in L/kg, reflecting the extent of drug distribution in the body
    AUC (ng·h/mL) Area Under the Curve, representing overall drug exposure
    T1/2 (h) Half-life, in hours, indicating the time required for the drug concentration to decrease by half
  • Name: Protein Evolution
    Description: 蛋白进化分析,快速找到能够协同作用的多重突变组合,基于MULTI-evolve实现。 Protein evolution analysis for rapidly identifying synergistic multi-site mutation combinations, based on the MULTI-evolve framework.
    Tags: undefined
    Author: Vincent Q. Tran
    Release: 2026-01-19 00:00:00
    Reference: Vincent Q. Tran et al. ,Rapid directed evolution guided by protein language models and epistatic interactions.[DOI:10.1126/science.aea1820](https://doi.org/10.1126/science.aea1820)

    Protein Evolution

    简介

    蛋白进化分析,快速找到能够协同作用的多重突变组合。基于MULTI-evolve框架实现,面向蛋白工程中的候选突变发现与组合优化,提供单点突变与多点突变两种工作模式:前者利用蛋白语言模型进行零样本评估,快速发现潜在有利的单点突变;后者基于实验测得的突变数据训练监督模型,并在候选突变池上进一步搜索高阶组合突变。该流程将蛋白语言模型、表观互作(epistasis)建模和后续实验构建衔接为一套端到端方案;其中单点突变部分实际整合了 5 个 ESM-1v 模型(esm1v_t33_650M_UR90S_1-5)、1 个 ESM-2 3B 模型(esm2_t36_3B_UR50D),以及结构感知的 ESM-IF1,多点突变部分则以全连接神经网络为核心预测器来学习序列与性质之间的映射。
    image.png

    使用流程:
    1,计算步骤,先使用单点突变模式,获取优势单点突变(一般选择排名靠前的15-20个)
    2,湿实验步骤,对第一步选择的单点突变,及其所有两点突变的组合(100~200个组合),进行湿实验验证,获取突变对应的湿实验数据,请使用性质数据的比值(Fold-Change,FC值),即: 突变后的性质/野生型的性质。
    3,计算步骤,使用多点突变模式,输入第二步的湿实验结果,进行模型训练,并预测多点突变组合对应的FC值,给出推荐的优势多点突变组合。

    参数说明

    Single Point Mutation

    利用多个蛋白语言模型对蛋白单点突变的潜在效应进行突变概率预测,帮助研究者高效筛选更有希望进入后续实验验证的候选单点突变。模块并行提供 4 种筛选策略:ESM、ESM-IF、ESM-z 和 ESM-IF-z。突变位置从1开始按残基顺序编号。

    • 在 ESM 筛选中,每个 ESM 序列子模型都会在野生型序列背景下,分别计算目标位点上突变氨基酸与野生型氨基酸的条件概率,取对数后作差,得到该子模型对该单点突变的原始分数;随后再对所有序列子模型的分数取平均,作为最终的 ESM 综合得分。
    • 在 ESM-IF 筛选中,模型会结合输入的蛋白结构信息,对每个结构分别计算目标位点上突变残基与野生型残基的结构条件打分,并以两者差值作为该结构下的原始分数;当输入多个结构文件或多个构象时,再对各结构得到的分数取平均,作为最终的 ESM-IF 综合得分。
    • ESM-z 和 ESM-IF-z 则是在对应原始得分的基础上,进一步进行 z-score 标准化处理,使不同突变位置之间的分数更便于横向比较与排序。
      注意: z-score指的是一种标准化方法,模块提供了两种标准化方法,由Normalization控制。

    Structure

    输入蛋白结构文件,支持 PDB 或 CIF 格式,用于结构模型评分。支持输入同一结构的批量构象(需压缩文件格式,支持:.zip,.tar, .tar.gz, .tgz,.tar.bz2, .tbz2,.tar.xz, .txz),模块会分别计算每个构象中的突变评分,再取不同构象的平均值,以降低单一构象带来的偏差。

    Chain

    指定链名,进行单点突变推荐,多链时用逗号分隔,如A,B。如果不指定该参数,则对结构中的每条链都会进行单点突变推荐。

    TopN

    设置每种集成方法对每条链推荐的候选单点突变数量,默认20。

    Positions Excluded

    需排除的突变位点的位置,使用链名+残基位置编号(从1开始按顺序),如:A100表示A链中位置顺序编号100的残基进行排除。多位置时使用逗号分隔,支持范围符号,例如:A10-20,A25,B30-36,B40表示:排除A链编号10至20、25的残基,B链编号30至36、40的残基`。

    Normalization

    z-score 标准化的分组方式,可选 aa_substitution_type 和 aa_mutation,默认为:aa_substitution_type。
    两种方法说明如下:

    aa_substitution_type :按具体替换类型分组标准化。例如所有突变位置中, A→L的突变单独作为一组(如:A10L,A35L,A128L),所有G→V的突变为另一组;该方式更关注“从哪种氨基酸变成哪种氨基酸”。
    aa_mutation : 按突变后的目标氨基酸分组标准化。例如 A10P、G25P、L80P 都会归到 P 这一组;该方式更关注“最终变成了什么氨基酸”。

    Single Point Mutation File

    指定输出结果csv文件的名称。默认:SP_Mutation.csv

    Multiple Point Mutations

    基于实验数据训练预测模型,对候选突变进行自动筛选与组合,生成可用于实验验证的优势多点突变方案。该模式的典型使用场景是:先通过单点或双点突变实验获得一定规模的功能数据,再训练模型预测更高阶组合突变(通常为 >=3 位点)的潜在表现。

    Structure

    输入蛋白结构文件,支持 PDB 或 CIF 格式。

    Training Data

    输入.csv格式文件,CSV必须包含以下列:
    mutation :指定结构中的突变信息,使用原始残基+链名+残基位置编号(从1开始按顺序)+突变后的残基,如:KA100N表示A链中位置顺序编号100的残基K,突变为N。多点突变时用分号分隔,如:GA48R;DB106A
    property : 突变对应的性质变化倍数,即性质数据的比值(Fold-Change,FC值),即: 突变后的性质/野生型的性质。
    注意:
    1.突变样本数量需要大于20条
    2.模块会对输入内容进行检查;若存在数据错误,请查看 stderr.txt。

    Single Mutations

    用于进行多点组合突变的单点突变文件,同样使用原始残基+链名+残基位置编号(从1开始按顺序)+突变后的残基,输入格式如下:

    TA192V
    TB192K
    AC167R
    NA72A
    

    注意:
    1.如果不指定该参数,默认会将训练数据中的所有单点突变,进行组合,然后预测推荐。
    2.模块会对输入内容进行检查;若存在数据错误,请查看 stderr.txt。

    Top variants

    指定为每类组合突变推荐的TopN数量,默认为:3,即:三点组合突变推荐3个,四点组合突变推荐3个,五点组合突变推荐3个,…,最多推荐十点组合突变。

    Mutiple Point Mutation File

    指定输出结果csv文件的名称。默认:MP_Mutation.csv

    结果说明

    单点突变模式下,结果输出SP_Mutation.csv,内容如下:

    Chain ID Mutations ESM ESM-IF ESM-z ESM-IF-z Count
    A F26L 1 0 1 1 3
    A A167R 1 0 1 0 2
    A A250D 0 1 0 1 2
    …

    说明:

    字段 说明
    Chain ID 当前推荐突变所属链 ID
    Mutations 单点突变名称,格式通常为“野生型氨基酸 + 位点 + 突变后氨基酸”,如 F26L 表示第 26 位(从1开始的位置顺序编号)由 F 突变为 L
    ESM 是否被 ESM 方法推荐,1 表示是,0 表示否
    ESM-IF 是否被 ESM-IF 方法推荐,1 表示是,0 表示否
    ESM-z 是否被 ESM-z 方法推荐,1 表示是,0 表示否
    ESM-IF-z 是否被 ESM-IF-z 方法推荐,1 表示是,0 表示否
    Count 该突变被多少种方法共同推荐,为各方法标记值之和
    • 模块基于ESM、ESM-IF、ESM-z 和 ESM-IF-z 4 种推荐方法对饱和单点突变进行筛选,每种推荐方法均按照对应的打分规则对候选突变进行排序,并依次选取前TopN个且位点不重复的突变作为推荐结果;被推荐的突变在对应列中记为1,未被推荐则记0

    在多点突变模式下,结果输出MP_Mutation.csv,结果内容如下:

    Variant ID Chain ID Mutations Number Mutations Sequence Average
    399 A 3 N72A/A167R/T192K MGKSYPTVSADYQDAVEKAKKKLRGFIAEKRCAPLMLRLAFHSAGTFDKGTKTGGPFGTIKHPAELAHSANAGLDIAVRLLEPLKAEFPILSYADFYQLAGVVAVEVTGGPKVPFHPGREDKPEPPPEGRLPDATKGSDHLRDVFGKAMGLTDQDIVALSGGHTIGRAHKERSGFEGPWTSNPLIFDNSYFKELLSGEKEGLLQLPSDKALLSDPVFRPLVDKYAADEDAFFADYAEAHQKLSELGFADA 0.7711919
    405 A 3 A167R/T192K/D222E MGKSYPTVSADYQDAVEKAKKKLRGFIAEKRCAPLMLRLAFHSAGTFDKGTKTGGPFGTIKHPAELAHSANNGLDIAVRLLEPLKAEFPILSYADFYQLAGVVAVEVTGGPKVPFHPGREDKPEPPPEGRLPDATKGSDHLRDVFGKAMGLTDQDIVALSGGHTIGRAHKERSGFEGPWTSNPLIFDNSYFKELLSGEKEGLLQLPSDKALLSDPVFRPLVEKYAADEDAFFADYAEAHQKLSELGFADA 0.754778
    201 A 4 L11Q/A40P/S63A/T116L EVQLVESGGGQVQPGGSLRLSCAASGFTFSDFYMEWVRQPPGKGLEWIAASRNKANDYTTEYAASVKGRFIVSRDDSKNSLYLQMNSLKTEDTAVYYCARSYYRYDGMDYWGQGTLVTVSS:EIVLTQSPATLSLSPGERATLSCSAISSVSYMYWYQQKPGQAPRLLIYDTSNLVSGVPARFSGSGSGTDYTLTISSLEPEDFAVYYCQQWNTYPYTFGGGTKVEIK 0.63438326
    460 A;B 4;1 Q13P/A40P/S63A/T116L;I105L EVQLVESGGGLVPPGGSLRLSCAASGFTFSDFYMEWVRQPPGKGLEWIAASRNKANDYTTEYAASVKGRFIVSRDDSKNSLYLQMNSLKTEDTAVYYCARSYYRYDGMDYWGQGTLVTVSS:EIVLTQSPATLSLSPGERATLSCSAISSVSYMYWYQQKPGQAPRLLIYDTSNLVSGVPARFSGSGSGTDYTLTISSLEPEDFAVYYCQQWNTYPYTFGGGTKVELK 0.67288095
    …

    说明:

    字段 说明
    Variant ID 候选变体编号,与 all 结果文件中的编号一致。all 结果将包含在结果打包文件中
    Chain ID 当前结果中实际发生突变的链 ID;单链或仅单条链发生突变时为单个链名,如 A;多条链同时突变时按字母顺序使用分号 ; 分隔
    Mutations Number 突变数量;仅单条链发生突变时为单个数字;多条链同时突变时按链顺序使用分号 ; 分隔
    Mutations 突变信息;链内多个突变使用 / 分隔;多条链同时突变时使用分号 ; 连接各链突变信息
    Sequence 被筛选变体对应的氨基酸序列;多链情况下按链顺序使用冒号 : 分隔
    Average 被筛选变体的综合平均预测得分,数值越高表示该变体预测表现越优

    同时,输出 MP_Mutation.tar.gz,其中包含最终合并结果 CSV。压缩包内包含以下文件:

    • MP_Mutation.csv
    • MP_Mutation_all.csv

    其中,MP_Mutation_all.csv 为全部筛选变体的完整结果文件。

    参考文献

    • Vincent Q. Tran et al. ,Rapid directed evolution guided by protein language models and epistatic interactions.DOI:10.1126/science.aea1820

    Protein Evolution

    Introduction

    Protein evolution analysis for rapidly identifying synergistic multi-site mutation combinations, based on the MULTI-evolve framework. This module is designed for candidate mutation discovery and combinatorial optimization in protein engineering. It provides two working modes: single-point mutation and multi-point mutation.

    The single-point mutation mode uses protein language models for zero-shot evaluation to rapidly identify potentially beneficial single mutations. The multi-point mutation mode trains a supervised model using experimentally measured mutation data and further searches for higher-order combinatorial mutations within the candidate mutation pool.

    This workflow integrates protein language models, epistasis modeling, and experimental validation into an end-to-end pipeline. The single-point mutation module integrates five ESM-1v models (esm1v_t33_650M_UR90S_1-5), one ESM-2 3B model (esm2_t36_3B_UR50D), and structure-aware ESM-IF1. The multi-point mutation module uses a fully connected neural network as the core predictor to learn the mapping between sequence and functional properties.

    image.png

    Workflow:

    1. Computational step: Use single-point mutation mode to obtain advantageous single mutations (typically selecting the top 15–20 candidates).
    2. Experimental step: Perform wet-lab validation on selected single mutations and their pairwise combinations (100–200 variants). Measure experimental properties and compute Fold-Change (FC), defined as: mutant property / wild-type property.
    3. Computational step: Use multi-point mutation mode with experimental data from step 2 to train the model and predict FC values for higher-order mutation combinations, then identify optimal multi-point variants.

    Parameters

    Single Point Mutation

    This module uses multiple protein language models to predict the potential effects of single-point mutations, enabling efficient screening of promising candidates for experimental validation. Four screening strategies are provided: ESM, ESM-IF, ESM-z, and ESM-IF-z. Residue indexing starts from 1.

    • In the ESM strategy, each ESM sub-model computes the conditional probability difference between the mutant amino acid and the wild-type amino acid at the target position under the wild-type sequence background. The log-probability difference is used as the raw score for each sub-model, and the final ESM score is obtained by averaging across all sub-models.
    • In the ESM-IF strategy, structure information is incorporated. For each structure, a structural conditional score is computed for the mutation and wild type at the target position. The difference is used as the raw score. If multiple structures or conformations are provided, the final ESM-IF score is the average across all structures.
    • ESM-z and ESM-IF-z apply z-score normalization to the corresponding raw scores, enabling better comparison and ranking across mutation sites.

    Note: Z-score refers to a standardization method. Two normalization strategies are supported and controlled by the Normalization parameter.


    Structure

    Input protein structure file in PDB or CIF format for structure-based scoring. Multiple conformations of the same structure are supported (compressed formats: .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz). Scores are averaged across conformations to reduce bias from a single structure.


    Chain

    Specify chain IDs for single-point mutation recommendation. Multiple chains are separated by commas (e.g., A,B). If not specified, all chains in the structure will be analyzed.


    TopN

    Number of candidate single-point mutations recommended per chain for each integrated method. Default: 20.


    Positions Excluded

    Residues to exclude from mutation analysis. Format: Chain + residue index (starting from 1), e.g., A100. Multiple positions can be separated by commas and ranges are supported, e.g., A10-20,A25,B30-36,B40.


    Normalization

    Defines the grouping strategy for z-score normalization. Options: aa_substitution_type (default) and aa_mutation.

    • aa_substitution_type: Groups mutations by substitution type (e.g., A→L, G→V). Focuses on “which amino acid is replaced by which”.
    • aa_mutation: Groups mutations by the resulting amino acid. Focuses on “what amino acid it becomes”.

    Single Point Mutation File

    Output CSV file name for single-point mutation results. Default: SP_Mutation.csv.


    Multi-point Mutation

    This module trains predictive models using experimental data to automatically screen and combine mutations, generating high-order mutation designs for experimental validation. A typical workflow involves generating experimental data from single- or double-point mutations, then training a model to predict higher-order combinations (≥3 sites).


    Structure

    Input protein structure file in PDB or CIF format.


    Training Data

    Input .csv file containing the following required columns:

    • mutation: Mutation information in the format WildTypeResidue + Chain + Position + MutantResidue, e.g., KA100N. For multi-point mutations, use semicolons, e.g., GA48R;DB106A.
    • property: Experimental fold-change (FC), defined as: mutant property / wild-type property.

    Note:

    1. The number of mutation samples must be greater than 20.
    2. The module will check the input content; if there are data errors, please refer to stderr.txt.

    Single Mutations

    Single-point mutation file used for combinatorial generation. Format:

    TA192V
    TB192K
    AC167R
    NA72A
    

    Notes:

    1. If not specified, all single mutations in the training set will be used for combination.
    2. Input is validated; check stderr.txt for errors.

    Top variants

    Number of top-ranked variants returned per mutation order. Default: 3 (e.g., top 3 for triple, quadruple, etc., up to decuple mutations).


    Multiple Point Mutation File

    Output CSV file name. Default: MP_Mutationcsv.


    Results

    Single-point Mutation Output (SP_Mutation.csv)

    Chain ID Mutations ESM ESM-IF ESM-z ESM-IF-z Count
    A F26L 1 0 1 1 3
    A A167R 1 0 1 0 2
    …

    Field Description:

    • Chain ID: Chain where mutation is recommended
    • Mutations: Mutation in format WildTypeResidue + Position + MutantResidue (index starts from 1), e.g., F26L
    • ESM / ESM-IF / ESM-z / ESM-IF-z: Whether selected by each method (1 = yes, 0 = no)
    • Count: Number of methods that selected the mutation

    Each method ranks candidates independently and selects top-N non-redundant mutations.


    Multi-point Mutation Output (MP_Mutation.csv)

    Variant ID Chain ID Mutations Number Mutations Sequence Average
    399 A 3 N72A/A167R/T192K … 0.7711919
    …

    Field Description:

    • Variant ID: Variant index (consistent with all file)
    • Chain ID: Chains with mutations; multiple chains separated by semicolon
    • Mutations Number: Number of mutations per chain
    • Mutations: Mutation list; within-chain separated by /, between chains by ;
    • Sequence: Full amino acid sequence; multiple chains separated by :
    • Average: Mean predicted score; higher indicates better predicted performance

    The output package MP_Mutation.tar.gz contains:

    • MP_Mutation.csv
    • MP_Mutation_all.csv (complete results)

    References

    • Vincent Q. Tran et al. ,Rapid directed evolution guided by protein language models and epistatic interactions.DOI:10.1126/science.aea1820
  • Name: Enzyme pH Optimum Prediction (EpHod)
    Description: EpHod 是一个基于机器学习的酶最适 pH(pHopt)预测工具,旨在从氨基酸序列直接预测酶的最适工作 pH 值。 EpHod is a machine learning tool for predicting enzyme optimum pH (pHopt) directly from amino acid sequences.
    Tags: undefined
    Author: Japheth E. Gado
    Release: 2025-04-29 00:00:00
    Reference: Gado J E, Knotts M, Shaw A Y, et al. Machine learning prediction of enzyme optimum pH[J]. Nature Machine Intelligence, 2025, 7(5): 716-729.

    Enzyme pH Optimum Prediction (EpHod)

    简介

    EpHod 是一个基于机器学习的酶最适 pH(pHopt)预测工具,旨在从氨基酸序列直接预测酶的最适工作 pH 值。

    核心思想是通过蛋白质语言模型 ESM1v 提取酶序列特征,结合残差注意力机制(RLAT)和支持向量回归(SVR)进行集成预测。模型直接从序列数据中学习与 pHopt 相关的结构和生物物理特征,包括残基与催化中心的距离、溶剂分子可及性等。

    参数说明

    Enzyme Fasta File

    输入的酶序列 FASTA 文件路径,必选项

    FASTA 文件每条序列以 > 开头,格式示例:

    >Q2YPV0 | Brucella abortus | 4.2.1.11 | 8.5 | 0.366
    MTAIIDIVGREILDSRGNPTVEVDVVLEDGSFGRAAVPSGASTGAHEAVELRDGGSRYLGKGVEKAVEVVNGKIFDAIAGMDAESQLLIDQTLIDLDGSANKGNLGANAILGVSLAVAKAAAQASGLPLYRYVGGTNAHVLPVPMMNIINGGAHADNPIDFQEFMILPVGATSIREAVRYGSEVFHTLKKRLKDAGHNTNVGDEGGFAPNLKNAQAALDFIMESIEKAGFKPGEDIALGLDCAATEFFKDGNYVYEGERKTRDPKAQAKYLAKLASDYPIVTIEDGMAEDDWEGWKYLTDLIGNKCQLVGDDLFVTNSARLRDGIRLGVANSILVKVNQIGSLSETLDAVETAHKAGYTAVMSHRSGETEDSTIADLAVATNCGQIKTGSLARSDRTAYNQLIRIEEELGKQARYAGRSALKLL
    

    Predicted Results

    输出预测结果文件名,默认为 prediction.csv

    结果说明

    预测结果为 CSV 文件,包含以下列:

    列名 说明
    index 序列 ID
    RLATtr 基于注意力机制的预测酶最适 pH
    SVR 基于支持向量回归的预测酶最适 pH
    Ensemble 集成预测值(上述两者平均)

    如何理解结果

    1. Ensemble 为推荐使用的预测值,综合了注意力机制和回归模型的优势
    2. 预测值为数值型 pH 值,代表酶的最适工作 pH
    3. 建议结合预测置信度(序列头部的 pLDDT 值)综合判断结果可靠性
    4. 序列长度超过 1022 残基时会被截断处理

    参考文献

    • Gado J E, Knotts M, Shaw A Y, et al. Machine learning prediction of enzyme optimum pH[J]. Nature Machine Intelligence, 2025, 7(5): 716-729. DOI: 10.1038/s42256-025-01026-6

    Enzyme pH Optimum Prediction (EpHod)

    Introduction

    EpHod is a machine learning tool for predicting enzyme optimum pH (pHopt) directly from amino acid sequences.

    The core approach uses the protein language model ESM1v to extract enzyme sequence features, combined with Residual Light Attention (RLAT) and Support Vector Regression (SVR) for ensemble prediction. The model learns structural and biophysical features directly from sequence data that relate to pHopt, including residue proximity to catalytic centers and solvent accessibility.

    Parameters

    Enzyme Fasta File

    Path to input enzyme sequence FASTA file, required
    Each sequence in the FASTA file starts with >, example format:

    >Q2YPV0 | Brucella abortus | 4.2.1.11 | 8.5 | 0.366
    MTAIIDIVGREILDSRGNPTVEVDVVLEDGSFGRAAVPSGASTGAHEAVELRDGGSRYLGKGVEKAVEVVNGKIFDAIAGMDAESQLLIDQTLIDLDGSANKGNLGANAILGVSLAVAKAAAQASGLPLYRYVGGTNAHVLPVPMMNIINGGAHADNPIDFQEFMILPVGATSIREAVRYGSEVFHTLKKRLKDAGHNTNVGDEGGFAPNLKNAQAALDFIMESIEKAGFKPGEDIALGLDCAATEFFKDGNYVYEGERKTRDPKAQAKYLAKLASDYPIVTIEDGMAEDDWEGWKYLTDLIGNKCQLVGDDLFVTNSARLRDGIRLGVANSILVKVNQIGSLSETLDAVETAHKAGYTAVMSHRSGETEDSTIADLAVATNCGQIKTGSLARSDRTAYNQLIRIEEELGKQARYAGRSALKLL
    

    Predicted Results

    Output prediction result filename, default prediction.csv

    Results

    Prediction result is a CSV file with the following columns:

    Column Description
    index Sequence ID
    RLATtr Attention-based pHopt prediction
    SVR Support vector regression prediction
    Ensemble Ensemble prediction (average of above)

    How to Interpret Results

    1. Ensemble is the recommended prediction value, combining attention mechanism and regression model advantages
    2. The predicted value is a numeric pH value representing the enzyme’s optimum working pH
    3. It is recommended to consider prediction confidence (pLDDT value in sequence header) when evaluating results
    4. Sequences longer than 1022 residues will be truncated

    Reference

    • Gado J E, Knotts M, Shaw A Y, et al. Machine learning prediction of enzyme optimum pH[J]. Nature Machine Intelligence, 2025, 7(5): 716-729. DOI: 10.1038/s42256-025-01026-6
  • Name: Enzyme-Substrate Prediction (ESP)
    Description: 用于预测酶-底物反应活性的机器学习工具,旨在为实验筛选提供优先级排序。 ESP (Enzyme-Substrate Prediction) is a machine learning tool for predicting enzyme-substrate reaction activity, designed to provide priority ranking for experimental screening.
    Tags: undefined
    Author: Alexander Kroll
    Release: 2023-04-21 00:00:00
    Reference: Alexander Kroll, Sahasra Ranjan, Martin K. M. Engqvist & Martin J. Lercher. A general model to predict small molecule substrates of enzymes based on machine and deep learning.

    Enzyme-Substrate Prediction (ESP)

    简介

    ESP (Enzyme-Substrate Prediction) 是一个用于预测酶-底物反应活性的机器学习工具,旨在为实验筛选提供优先级排序。
    它要解决的问题是:在候选组合数量较大时,如何优先挑出更可能发生反应的酶-底物对,从而降低实验试错成本。
    ESP 的核心思想是联合利用两类信息:

    1. 代谢物表征:通过 GNN(图神经网络)提取底物的分子特征。
    2. 酶表征:通过 ESM1b(蛋白语言模型)提取酶的序列特征。
      最终将两类表征拼接后,使用 XGBoost 模型预测反应活性分数。

    参数说明

    Enzyme–Substrate Pair

    输入的底物-酶对列表文件,支持 .csv、.xlsx、.xls 格式,必选项
    文件应包含两列:substrate 和 enzyme

    substrate,enzyme
    C00069,MARLPFYLLVISTLLLVVTADSFLARPPSSSFLHALSNKRASTPASLPSCSLDFLLQTRGGTAANAATTALPTSALVERKGGAAVALEGGKTLWEKSKVWVFIGLWYFFNVAFNIYNKKVLNALPLPWTVSIAQLGLGALYTMFLWLVRARKMPTIAAPEMKTLSILGVLHAVSHITAITSLGAGAVSFTHIVKSAEPFFSAVFAGLFFGQFFSLPVYAALIPVVSGVAYASLKELTFTWLSFWCAMASNVVCAARGVVVKGMMGGKPTQSKDLTSSNMYSVLTILAALVLLPFGALVEGPGLHAAWKAAAAHPSLTNGGTELAKYLVYSGLTFFLYNEVAFAALESLHPISHAVANTIKRVVIIVVSVLVFRNPMSTQSIIGSSTAVIGVLLYSLAKHYCK
    

    Substrate Column

    底物列的列名,默认为 substrate

    Enzyme Column

    酶列的列名,默认为 enzyme

    Predicted Results

    输出结果文件名,默认为 predictions.csv

    结果说明

    输出文件

    输出的结果文件,CSV 格式,包含以下列:

    列名 说明
    substrate 底物 ID
    enzyme 酶的氨基酸序列
    complete 数据是否完整(True/False)
    metabolite_similarity_score 代谢物与训练集的相似度分数
    metabolite in training set 底物是否在训练集中
    #metabolite in training set 训练集中相似代谢物数量
    Prediction 预测值(0-1),值越高表示反应越可能发生

    如何理解结果

    1. Prediction:预测分数,值越高表示该酶-底物对越可能发生反应
    2. 该分数本质是概率型分数,不应直接等同于"反应一定发生"
    3. 建议将输出用于候选排序,按分数从高到低组织实验顺序(Top-K 优先)
    4. metabolite_similarity_score 反映底物与训练集的相似程度,可作为预测可信度的参考

    注意事项

    1. 底物输入格式:支持 KEGG Compound ID(如 C00069)、SMILES 或 InChI 格式
    2. 输入质量影响显著:底物 ID 合法性、SMILES 有效性会直接影响预测稳定性
    3. 结果使用原则:建议在同一任务上下文内做相对比较,不建议跨任务直接比较绝对分数阈值

    参考文献

    • Kroll A, Ranjan S, Engqvist M K M, Lercher M J. A general model to predict small molecule substrates of enzymes based on machine and deep learning[J]. Nature Communications, 2023, 14(1): 2787. DOI:10.1038/s41467-023-38347-2

    Enzyme-Substrate Prediction (ESP)

    Introduction

    ESP (Enzyme-Substrate Prediction) is a machine learning tool for predicting enzyme-substrate reaction activity, designed to provide priority ranking for experimental screening.
    It addresses the problem: when the number of candidate combinations is large, how to prioritize enzyme-substrate pairs that are more likely to react, thereby reducing experimental trial-and-error costs.
    The core idea of ESP is to jointly utilize two types of information:

    1. Metabolite representation: Extract substrate molecular features through GNN (Graph Neural Network)
    2. Enzyme representation: Extract enzyme sequence features through ESM1b (protein language model)
      Finally, the two types of representations are concatenated and used with XGBoost model to predict reaction activity scores.

    Parameters

    Input File (CSV/Excel)

    Input file containing substrate-enzyme pairs, supports .csv, .xlsx, .xls format, required

    The file should contain two columns: substrate and enzyme

    substrate,enzyme
    C00069,MARLPFYLLVISTLLLVVTADSFLARPPSSSFLHALSNKRASTPASLPSCSLDFLLQTRGGTAANAATTALPTSALVERKGGAAVALEGGKTLWEKSKVWVFIGLWYFFNVAFNIYNKKVLNALPLPWTVSIAQLGLGALYTMFLWLVRARKMPTIAAPEMKTLSILGVLHAVSHITAITSLGAGAVSFTHIVKSAEPFFSAVFAGLFFGQFFSLPVYAALIPVVSGVAYASLKELTFTWLSFWCAMASNVVCAARGVVVKGMMGGKPTQSKDLTSSNMYSVLTILAALVLLPFGALVEGPGLHAAWKAAAAHPSLTNGGTELAKYLVYSGLTFFLYNEVAFAALESLHPISHAVANTIKRVVIIVVSVLVFRNPMSTQSIIGSSTAVIGVLLYSLAKHYCK
    C00002,MKGRRRRRREYCKFALLLVLYTLVLLLVPSVLDGGRDGDKGAEHCPGLQRSLGVWSLEAAAAGEREQGAEARAAEEGGANQSPRFPSNLSGAVGEAVSREKQHIYVHATWRTGSSFLGELFNQHPDVFYLYEPMWHLWQALYPGDAESLQGALRDMLRSLFRCDFSVLRLYAPPGDPAARAPDTANLTTAALFRWRTNKVICSPPLCPGAPRARAEVGLVEDTACERSCPPVAIRALEAECRKYPVVVIKDVRLLDLGVLVPLLRDPGLNLKVVQLFRDPRAVHNSRLKSRQGLLRESIQVLRTRQRGDRFHRVLLAHGVGARPGGQSRALPAAPRADFFLTGALEVICEAWLRDLLFARGAPAWLRRRYLRLRYEDLVRQPRAQLRRLLRFSGLRALAALDAFALNMTRGAAYGADRPFHLSARDAREAVHAWRERLSREQVRQVEAACAPAMRLLAYPRSGEEGDAEQPREGETPLEMDADGAT
    

    Substrate Column

    Column name for substrate, default substrate

    Enzyme Column

    Column name for enzyme, default enzyme

    Output File

    Output result filename, default predictions.csv

    Results

    Output File

    Output result file in CSV format, containing the following columns:

    Column Description
    substrate Substrate ID
    enzyme Enzyme amino acid sequence
    complete Whether data is complete (True/False)
    metabolite_similarity_score Similarity score between metabolite and training set
    metabolite in training set Whether substrate is in training set
    #metabolite in training set Number of similar metabolites in training set
    Prediction Prediction score (0-1), higher values indicate higher reaction likelihood

    How to Interpret Results

    1. Prediction: Prediction score, higher values indicate the enzyme-substrate pair is more likely to react
    2. This score is essentially a probability-like score and should not be directly equated with “reaction will definitely occur”
    3. It is recommended to use the output for candidate ranking, organizing experimental order from high to low scores (Top-K priority)
    4. metabolite_similarity_score reflects the similarity between substrate and training set, which can be used as a reference for prediction reliability

    Notes

    1. Substrate input format: Supports KEGG Compound ID (e.g., C00069), SMILES, or InChI format
    2. Input quality significantly affects results: Substrate ID validity and SMILES validity directly affect prediction stability
    3. Result usage principle: It is recommended to make relative comparisons within the same task context, and avoid directly comparing absolute score thresholds across tasks

    Reference

    • Kroll A, Ranjan S, Engqvist M K M, Lercher M J. A general model to predict small molecule substrates of enzymes based on machine and deep learning[J]. Nature Communications, 2023, 14(1): 2787. DOI: 10.1038/s41467-023-38347-2
  • Name: Catalytic Optimum Predictor (CatOpt)
    Description: Catalytic Optimum Predictor (CatOpt) 是一个基于深度学习的酶催化剂特性预测工具,用于从蛋白质序列预测酶的最适 pH 、最适温度和热变性温度。 CatOpt is a deep learning-based tool for predicting enzyme catalytic properties, including optimal pH, optimal temperature, and melting temperature from protein sequences.
    Tags: undefined
    Author: Sizhe Qiu
    Release: 2025-11-07 00:00:00
    Reference: Qiu S, Wang N K, Lu Y, et al. Deep Learning-Based Prediction of Enzyme Optimal pH and Design of Point Mutations to Improve Acid Resistance[J]. ACS Synthetic Biology, 2025, 14(12): 4897-4906

    Catalytic Optimum Predictor (CatOpt)

    简介

    Catalytic Optimum Predictor (CatOpt) 是一个基于深度学习的酶催化剂特性预测工具,用于从蛋白质序列预测酶的最适 pH 、最适温度和热变性温度。

    CatOpt 的核心思想是利用蛋白质语言模型 ESM2 提取酶序列的高维特征表征,结合多头自注意力机制的多尺度卷积神经网络,实现高精度的酶催化特性预测。

    参数说明

    参数说明

    Input Dataset

    输入数据集路径,CSV格式
    输入文件应包含 sequence 列,每行为蛋白质的氨基酸序列。

    sequence
    MARLPFYLLVISTLLLVVTADSFLARPPSSSFLHALSNKRASTPASLPSCSLDFLLQTRGGTAANAATTALPTSALVERKGGAAVALEGGKTLWEKSKVWVFIGLWYFFNVAFNIYNKKVLNALPLPWTVSIAQLGLGALYTMFLWLVRARKMPTIAAPEMKTLSILGVLHAVSHITAITSLGAGAVSFTHIVKSAEPFFSAVFAGLFFGQFFSLPVYAALIPVVSGVAYASLKELTFTWLSFWCAMASNVVCAARGVVVKGMMGGKPTQSKDLTSSNMYSVLTILAALVLLPFGALVEGPGLHAAWKAAAAHPSLTNGGTELAKYLVYSGLTFFLYNEVAFAALESLHPISHAVANTIKRVVIIVVSVLVFRNPMSTQSIIGSSTAVIGVLLYSLAKHYCK
    MKGRRRRRREYCKFALLLVLYTLVLLLVPSVLDGGRDGDKGAEHCPGLQRSLGVWSLEAAAAGEREQGAEARAAEEGGANQSPRFPSNLSGAVGEAVSREKQHIYVHATWRTGSSFLGELFNQHPDVFYLYEPMWHLWQALYPGDAESLQGALRDMLRSLFRCDFSVLRLYAPPGDPAARAPDTANLTTAALFRWRTNKVICSPPLCPGAPRARAEVGLVEDTACERSCPPVAIRALEAECRKYPVVVIKDVRLLDLGVLVPLLRDPGLNLKVVQLFRDPRAVHNSRLKSRQGLLRESIQVLRTRQRGDRFHRVLLAHGVGARPGGQSRALPAAPRADFFLTGALEVICEAWLRDLLFARGAPAWLRRRYLRLRYEDLVRQPRAQLRRLLRFSGLRALAALDAFALNMTRGAAYGADRPFHLSARDAREAVHAWRERLSREQVRQVEAACAPAMRLLAYPRSGEEGDAEQPREGETPLEMDADGAT
    

    Task

    预测任务类型:pHopt(最适 pH)、topt(最适温度)、tm(热变性温度)

    Output Path

    输出结果文件路径,默认为prediction_results.csv

    结果说明

    输出文件

    CSV 格式,包含以下列:

    列名 说明
    id 样本索引
    sequence 蛋白质氨基酸序列
    pred_{task} 预测值(pHopt/topt/tm)

    预测值范围

    任务 预测值范围
    pHopt 0 - 14
    topt 0 - 120 °C
    tm 0 - 100 °C

    注意事项

    1. 输入格式:仅支持标准氨基酸字母序列,不含特殊字符或未知氨基酸(X 除外)

    参考文献

    • Qiu S, Wang N K, Lu Y, et al. Deep Learning-Based Prediction of Enzyme Optimal pH and Design of Point Mutations to Improve Acid Resistance[J]. ACS Synthetic Biology, 2025, 14(12): 4897-4906.DOI: 10.1021/acssynbio.5c00679

    Catalytic Optimum Predictor (CatOpt)

    Introduction

    CatOpt is a deep learning-based tool for predicting enzyme catalytic properties, including optimal pH, optimal temperature, and melting temperature from protein sequences.

    The core idea of CatOpt is to leverage the ESM2 protein language model to extract high-dimensional sequence features, combined with a multi-scale convolutional neural network with multi-head self-attention mechanism, achieving high-precision enzyme catalytic property prediction.

    Parameters

    Input Dataset

    Path to the input dataset in CSV format.
    The input file must contain a sequence column, with each row representing a protein amino acid sequence.

    sequence
    MARLPFYLLVISTLLLVVTADSFLARPPSSSFLHALSNKRASTPASLPSCSLDFLLQTRGGTAANAATTALPTSALVERKGGAAVALEGGKTLWEKSKVWVFIGLWYFFNVAFNIYNKKVLNALPLPWTVSIAQLGLGALYTMFLWLVRARKMPTIAAPEMKTLSILGVLHAVSHITAITSLGAGAVSFTHIVKSAEPFFSAVFAGLFFGQFFSLPVYAALIPVVSGVAYASLKELTFTWLSFWCAMASNVVCAARGVVVKGMMGGKPTQSKDLTSSNMYSVLTILAALVLLPFGALVEGPGLHAAWKAAAAHPSLTNGGTELAKYLVYSGLTFFLYNEVAFAALESLHPISHAVANTIKRVVIIVVSVLVFRNPMSTQSIIGSSTAVIGVLLYSLAKHYCK
    MKGRRRRRREYCKFALLLVLYTLVLLLVPSVLDGGRDGDKGAEHCPGLQRSLGVWSLEAAAAGEREQGAEARAAEEGGANQSPRFPSNLSGAVGEAVSREKQHIYVHATWRTGSSFLGELFNQHPDVFYLYEPMWHLWQALYPGDAESLQGALRDMLRSLFRCDFSVLRLYAPPGDPAARAPDTANLTTAALFRWRTNKVICSPPLCPGAPRARAEVGLVEDTACERSCPPVAIRALEAECRKYPVVVIKDVRLLDLGVLVPLLRDPGLNLKVVQLFRDPRAVHNSRLKSRQGLLRESIQVLRTRQRGDRFHRVLLAHGVGARPGGQSRALPAAPRADFFLTGALEVICEAWLRDLLFARGAPAWLRRRYLRLRYEDLVRQPRAQLRRLLRFSGLRALAALDAFALNMTRGAAYGADRPFHLSARDAREAVHAWRERLSREQVRQVEAACAPAMRLLAYPRSGEEGDAEQPREGETPLEMDADGAT
    
    

    Task

    Prediction task type: pHopt (optimal pH), topt (optimal temperature), tm (melting temperature)

    Output Path

    Path to the output results file. Default: prediction_results.csv

    Results

    Output File

    CSV format with the following columns:

    Column Description
    id Sample index
    sequence Protein amino acid sequence
    pred_{task} Prediction value (pHopt/topt/tm)

    Prediction Range

    Task Prediction Range
    pHopt 0 - 14
    topt 0 - 120 °C
    tm 0 - 100 °C

    Notes

    1. Input Format: Only standard amino acid letter sequences are supported; no special characters or unknown amino acids (except X)

    Reference

    • Qiu S, Wang N K, Lu Y, et al. Deep Learning-Based Prediction of Enzyme Optimal pH and Design of Point Mutations to Improve Acid Resistance[J]. ACS Synthetic Biology, 2025, 14(12): 4897-4906.DOI: 10.1021/acssynbio.5c00679
  • Name: Protein Contacts Profile
    Description: 对结构预测模型(如:Boltz2/Protenix/AF3等)预测的一组蛋白单体或复合物结构进行全面分析,包括:二级结构、溶剂可及性、疏水性、残基接触、结构置信度等等,对分析结果进行统一整理和对比展示。 Performs comprehensive analysis on a set of protein monomer or complex structures predicted by structure prediction models (e.g., Boltz2, Protenix, AF3). The analysis covers secondary structure, solvent accessibility, hydrophobicity, residue contacts, structural confidence, and more, with all results organized and presented in a unified comparative view.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-04-24 00:00:00
    Reference: Robert, X., Guillon, C. and Gouet, P. (2025) FoldScript: a web server for the efficient analysis of AI-generated 3D protein models, Nucleic Acids Res., 53(W1):W277-W282, DOI: https://doi.org/10.1093/nar/gkaf326 https://foldscript.ibcp.fr

    Protein Contacts Profile

    简介

    对结构预测模型(如:Boltz2/Protenix/AF3等)预测的一组蛋白单体或复合物结构进行全面分析,包括:二级结构、溶剂可及性、疏水性、残基接触、结构置信度等等,对分析结果进行统一整理和对比展示。

    核心思路:以参考链为分析目标,将二级结构、溶剂可及性、疏水性、残基接触、模型置信度,以及可选的同源序列和保守性信息等等汇总到同一份报告中。

    模块工作流 :

    • Gemmi:用于读取 PDB/mmCIF 结构,并从查询模型中规范化链、残基、原子和实体类型。
    • DSSP:通过提取每个残基的二级结构和溶剂可及性。
    • 二硫键计算模块:根据模型坐标计算以残基为中心的接触注释,并推断二硫键。
    • Kyte-Doolittle 疏水性评分:逐一计算残基疏水性。
    • 类 pLDDT 的置信度提取:当输入模型将置信度编码在原子的 B-factor字段中时,提取残基置信度数值。
    • BLAST+:在配置好的本地序列数据库中搜索同源蛋白序列。
    • Clustal Omega:对查询序列和命中序列进行比对,进行保守性分析。
    • 相互作用类型:对残基对相互作用进行跨结构统计,生成CSV文件,比较不同结构之间相互作用出现的频率与类型。
    • 统计汇总:基于上述分析结果,汇总输出多个CSV文件。
    • HTML 渲染:将上述分析结果渲染为交互式 HTML 报告和按链导出的 PDF。

    相互作用类型判断的阈值:

    相互作用类型 相互作用表示的编号 对应阈值
    疏水接触 hp 原子间距 < RvdW(A)+RvdW(B)+0.5 Å。
    盐桥 sb 距离 < 4.0 Å。
    π-阳离子 pc 距离 < 6.0 Å,角度 < 60°。
    π-π 堆积 ps 中心距 < 7.0 Å,法向角 < 30°;ψ角 < 45°。
    T-stacking ts 中心距 < 5.0 Å,相对 90° 的法向偏差 < 30°;ψ角 < 45°。
    范德华接触 vdw 原子间距 < RvdW(A)+RvdW(B)+0.5 Å。
    直接氢键 hbbb / hbsb / hbss / hblb / hbls / hbll D–A 距离 < 3.5 Å,静态结构默认角度阈值 180°。
    水桥氢键 wb / lwb 每一段氢键距离< 3.5 Å,角度 > 110°。
    扩展水桥氢键 wb2 / lwb2 每一段氢键距离< 3.5 Å,角度> 110°。

    参数说明

    Structures

    输入蛋白结构,允许单个以及批量输出。单结构输入,支持 .pdb、.cif、.mmcif 结构格式批量输入需要以压缩包的形式,支持:.zip、.tar、.tar.gz、.tgz、.tar.bz2、.tbz2、.tar.xz、.txz,批量输入最大支持100个结构。

    Sequence database

    指定通过BLAST+搜索筛选的序列数据库。可选:SwissProt(UniProt知识库中的Swiss-Prot数据库)和 PDBAA(从实验确定的三维结构数据库PDB衍生的序列)。

    E-value

    指定BLAST+搜索识别序列匹配的保留阈值。E值表示给定序列比对的统计显著性。E值越低(越接近零),匹配的显著性越高,可选阈值:
    1e-4、1e-5、1e-6、1e-7、1e-8、1e-9、1e-10、1e-11、1e-12。默认:1e-6

    Homolog Display Limit

    设置在序列搜索时,保留的最大序列数,可选数量:1~25。默认:5

    Columns

    指定一行序列展示的残基数量。默认:120

    VDW

    分析相互作用时,是否输出范德华相互作用,默认不输出。

    结果说明

    输出protein_contacts_profile.html,内容展示如下:
    注意: 若输入超过80个结构,HTML文档有可能过大而导致浏览器无法正常显示。
    image.png

    说明:
    A. 摘要
    在图的最上方,是当前活动面板的摘要区。

    • Reference Chain 表示当前正在查看的参考链;
    • Residue Span 表示这条链在图中覆盖的残基范围;
    • Model Count 表示本次一起参与比较的模型数量;
    • Output Status 表示当前链面板的结果状态;
    • Weak Contact Cutoff和Strong Contact Cutoff
    • Hydropathy Window、E-value Threshold、Weak Contact Cutoff、Strong Contact Cutoff、Homolog Display Limit 和 Database 则对应这张图生成时使用的关键参数,说明如下:
    参数 默认值 说明
    Hydropathy Window 3 疏水性平滑窗口大小,用于计算 Kyte-Doolittle hydropathy。值越大,曲线越平滑;值越小,越能反映局部波动。应为正整数。
    E-value Threshold 1e-6 表示指定BLAST+搜索识别序列匹配的保留阈值E-value,E值越低(越接近零),匹配的显著性越高。
    Weak Contact Cutoff 3.7 Å 弱接触的距离上限,按非氢原子之间的最短距离判断。只有当最短距离 <= 3.7 Å 时,残基间才会被视为存在接触。通常表示“接触判定的外层阈值”。
    Strong Contact Cutoff 3.2 Å 强接触的距离阈值,按非氢原子之间的最短距离判断。当最短距离 < 3.2 Å 时记为强接触;否则若仍在 weak cutoff 内,则记为弱接触。
    Homolog Display Limit 5 最多检索并展示多少条同源序列。取值范围为 0-25;设为 0 时会跳过 homolog search,只保留 query 序列本身。
    Database SWISSPROT 用于同源搜索的数据库,PDBAA和SWISSPROT。

    B. 二级结构与置信度
    在结构部分,每个模型都会用图形标出参考链上的二级结构单元。

    • Alpha Helix、3₁₀ Helix 和 Pi Helix 都用卷曲波浪线表示,颜色不同,分别对应 α 螺旋、3₁₀ 螺旋和 π 螺旋。
    • Strand 用箭头表示,箭头方向反映链段方向。
    • Alpha Turn 和 Beta Turn 则表示两类更短的紧转角区域,通常出现在连接不同二级结构的局部片段中。

    如果输入结构中带有逐残基置信度信息,结构轨道还会按 pLDDT 着色。

    • 深蓝色表示 pLDDT >= 90,通常可以把这部分看作局部较可信的区域;
    • 青蓝色表示 70 <= pLDDT < 90,整体骨架往往已经比较可靠;
    • 黄色表示 50 <= pLDDT < 70,说明这一段需要更谨慎地解释;
    • 橙红色表示 pLDDT < 50,这类区域往往更灵活,也更容易出现不稳定或低置信度的构象。

    把“结构形态”和“置信度颜色”结合起来看:如果某个螺旋或链段在不同模型里都出现,但颜色差异明显,就说明该局部形态可能存在不确定性。

    C.如果当前参考链被识别为抗体样链,还会额外显示一条 Antibody Numbering 轨道,并在界面中提供 Kabat、IMGT、Chothia 三种编号方案切换。

    • 这条轨道主要用来标出抗体可变区中的 CDR1、CDR2、CDR3 位置,帮助快速判断互补决定区落在序列的哪一段;
    • 切换编号方案时,变化的是 CDR 的边界定义和标签归属,而不是原始氨基酸序列本身;
    • 若参考链被识别为重链,则这些 CDR 标记对应重链可变区;若被识别为轻链,则对应轻链可变区;

    如果序列并不像抗体可变区,或者当前环境无法完成抗体编号,这条轨道可能不会显示。

    D. 序列与同源信息

    在结构轨道下方,报告会显示参考链的序列轴。

    • Query Sequence 表示目标序列本身;
    • Exact Match 表示该位置与查询序列完全一致;`
    • Similar Substitution 表示氨基酸虽然不同,但仍属于较保守的替换。

    阅读时,它能帮助你区分“模型之间结构有差异,但序列背景本身很保守”和“这一段本来就在同源序列里变化较大”这两种情况。

    E. Accessibility与Hydropathy

    在序列信息下面,报告会显示两条理化性质轨道。
    Accessibility 用来表示残基在结构表面的暴露程度。

    • 深蓝色对应埋藏程度较高的区域;
    • 灰蓝色表示中间状态;
    • 浅蓝色表示更容易暴露于溶剂环境;
    • 金色则提示该位置表面暴露很明显;

    对于结构分析,这条轨道适合用来判断一个位点更像是核心残基、表面残基,还是潜在的界面区域。

    Hydropathy 则描述序列在局部窗口中的疏水或亲水倾向。

    • 橙色表示偏疏水,常见于埋藏区、界面内部或与脂质环境相容的区域;
    • 灰色表示性质居中;蓝色表示偏亲水,更常出现在暴露于水相环境的位置;

    这条轨道结合Accessibility轨道一起观察,能更好的判断某一段结构是否符合直觉:例如,一个明显暴露的区域如果同时又很疏水,就值得进一步留意它是否参与界面作用或是否处在特殊构象环境中。

    F. 接触与符号

    接触轨道用来描述参考链残基与其他链、配体或小分子的相互作用。
    蛋白-蛋白接触使用字母显示,字母表示接触对方所在的链。颜色用于区分接触强度:

    • 较深的接触标记对应强接触,当前阈值为最短非氢原子距离小于 3.2 A;
    • 较浅的标记对应弱接触,距离位于 3.2 A 到 3.7 A 之间。
    • 若同一个位点同时存在多种接触,图中还会用额外外框加以标识,提醒这一位置可能处在更复杂的界面环境里;

    除了链字母,轨道里还会出现一些专门的符号。

    • S 用来标记二硫键;绿色通常表示链内二硫键,青色表示链间二硫键;
    • # 表示接触对方与当前位置具有相同的残基编号和残基类型,常见于对称相关的接触关系;
    • 对于非蛋白组分,模块也会用符号做简写显示,例如核酸、离子、糖类、卟啉样配体或其他小分子;参考下方的符号对照表

    建议先看“有没有接触”,再看“接触对象是谁”,最后再结合结构轨道判断这些接触是否稳定、是否集中出现在同一个局部区域。

    符号对照表:

    符号 说明
    A-Z / a-z 与对应链发生蛋白-蛋白接触
    S 二硫键位置
    # 与同编号、同类型残基发生对应接触
    * 与核酸接触
    + 与离子接触
    : 与卟啉样或相关大环配体接触
    " 与糖类配体接触
    ^ 与其他小分子或杂项配体接触

    G. 底部图例
    分成三列:

    • 左侧 Structure 主要解释螺旋、折叠链和转角这些结构符号以及抗体编号;
    • 中间 Tracks 主要解释序列比对、可及性、疏水性和置信度颜色;
    • 右侧 Contacts 主要解释强弱接触、二硫键和各类配体符号。

    接触残基对的详细信息文件contact_details.csv,示例如下:

    Chain Residue Pos Other_Chain Other_Residue Other_Pos Structures Distances Distance_Avg Interaction Types
    A Y 34 C S 32 1;2;3;4;5 3.34;3.32;3.00;3.31;3.39 3.27
    A H 38 B D 104 1;2;3;4;5 3.69;3.95;2.95;2.98;2.95 3.30 sb
    A Q 46 B Y 95 1;2;3;4;5 3.57;3.71;3.41;3.46;3.48 3.53
    A R 96 C E 54 1;4;5 3.74;3.60;3.99 3.77 sb
    B R 99 C D 55 2;3;4;5 3.74;2.77;2.82;2.84 3.04 sb

    说明:

    字段 说明
    Chain 第一个残基所在链。
    Residue 第一个残基类型。
    Pos 第一个残基从1开始的顺序编号。
    Other_Chain 形成接触的另一残基所在链。
    Other_Residue 另一残基类型。
    Other_Pos 另一残基从1开始的顺序编号。
    Structures 存在该接触残基对的结构编号列表,结构编号从1开始按出现顺序(见HTML文档中结构名称的展示顺序,从上向下)编号,使用分号 ; 分隔。
    Distances 各结构中该残基对的最小接触距离,顺序与 Structures 对应。
    Distance_Avg 所有接触距离(Distances)的平均值。
    Interaction Types 相互作用类型,使用分号 ; 分隔;未匹配时为空。

    接触残基的详细信息文件contact_residue_details.csv,示例如下:

    Chain Residue Pos SASA_Rel_Avg All …
    A E 1 0.68 0.40 …
    A S 32 0.34 1.00 …
    A Y 34 0.04 1.00 …
    A Q 46 0.51 1.00 …
    A Q 93 0.09 0.60 …
    B S 7 0.21 0.40 …

    说明:

    字段 说明
    Chain 残基所在链。
    Residue 残基类型。
    Pos 残基从1开始的顺序编号。
    SASA_Rel_Avg 该残基在全部结构中的平均相对溶剂可及性,取值范围为 0.00 到 1.00。
    Domain(Kabat/IMGT/Chothia) 如果是抗体链,会显示残基对应的CDR区域
    All / Cluster_n 该残基在全部结构/聚类结构簇中的作为接触残基出现的频率,取值范围为 0.00 到 1.00。当前示例包含 All 列,数值为0.40,表示该残基在全部结构的40%中作为接触残基出现。若存在多个聚类结构簇,还会增加 Cluster_1、Cluster_2 等列。

    接触残基汇总文件contact_consensus.csv,示例如下:

    Cluster Id Structure Count Cluster Center Combine Count Consensus Count Combine Residue Consensus Residue Consensus Residue (Threshold)
    All 5 proteinx_lig_rank_1 87 52 A1;A31-34;A36;A38;A40;A42;A46-48;A50;A53-54;A57;A59;A91;A95-100;A102;B7-10;B1… A32;A34;A36;A38;A40;A42;A47-48;A50;A53-54;A57;A91;A95-96;A98-100;B33;B35;B39;… A32;A34;A36;A38;A40;A42;A46-48;A50;A53-54;A57;A59;A91;A95-100;A102;B30;B33;B3…

    说明:

    字段 说明
    Cluster Id 统计范围标识。All 表示基于全部结构的统计;若存在多个聚类结构簇,还会出现 Cluster_1、Cluster_2 等结构簇范围统计。
    Structure Count 当前统计范围内参与汇总的结构数量。
    Cluster Center 当前簇的中心结构
    Combine Count 当前统计范围内,接触残基并集的数量。
    Consensus Count 当前统计范围内,接触残基交集的数量。
    Combine Residue 接触残基并集列表。残基编号使用从1开始的顺序编号,并保留链前缀;连续区间会压缩为 A31-34 这种格式。
    Consensus Residue 接触残基交集列表,格式与 Combine Residue 相同。
    Consensus Residue (Threshold) 达到统计范围内,结构数量百分比阈值的接触残基列表,格式与 Combine Residue 相同。默认阈值为 0.5,表示统计范围内50%的结构中出现的接触残基列表。

    结构聚类信息tm_clusters.csv,示例如下:

    Structure Cluster Id Cluster Size Cluster Center Is Representative
    chai-1_rank_1 1 5 proteinx_lig_rank_1 0
    chai-1_rank_2 1 5 proteinx_lig_rank_1 0
    proteinx_lig_rank_1 1 5 proteinx_lig_rank_1 1
    proteinx_lig_rank_2 1 5 proteinx_lig_rank_1 0
    proteinx_lig_rank_3 1 5 proteinx_lig_rank_1 0

    说明:

    字段 说明
    Structure 结构名称,不带后缀。
    Cluster Id 聚类后,该结构所属的结构簇编号。
    Cluster Size 该结构所在簇的成员数量。
    Cluster Center 该结构所在簇的中心结构名。
    Is Representative 是否为该簇的代表结构;1 表示是,0 表示否。

    用于聚类的相似性分数(TM_score)矩阵tm_score_matrix.csv,示例如下:

    Structure chai-1_rank_1 chai-1_rank_2 proteinx_lig_rank_1 proteinx_lig_rank_2 proteinx_lig_rank_3
    chai-1_rank_1 1.00 1.00 0.97 0.97 0.97
    chai-1_rank_2 1.00 1.00 0.98 0.98 0.98
    proteinx_lig_rank_1 0.97 0.98 1.00 1.00 1.00
    proteinx_lig_rank_2 0.97 0.98 1.00 1.00 1.00
    proteinx_lig_rank_3 0.97 0.98 1.00 1.00 1.00

    复合物中所有可能相互作用的列表cross_structure_interaction.csv,示例如下:

    Chain Residue Pos Other_Chain Other_Residue Other_Pos Structures Count Interaction Types
    B R 38 B E 46 1;2;3;4;5 5 sb
    C K 105 C W 2 1;2;3;4;5 5 pc
    C F 22 C F 7 1;2;3;4;5 5 ps
    B Y 27 B Y 32 1;2;3;4;5 5 ts
    B R 38 D ATP 1 3;4;5 3 pc

    说明:

    字段 说明
    Chain 第一个残基所在链。
    Residue 第一个残基类型;或配体名称,如 LIG。
    Pos 第一个残基从1开始的顺序编号。
    Other_Chain 另一残基所在链。
    Other_Residue 另一残基类型/配体名称
    Other_Pos 另一残基从1开始的顺序编号。
    Structures 存在该相互作用的结构编号列表,结构编号从1开始按出现顺序(见HTML文档中结构名称的展示顺序,从上向下)编号,使用分号 ; 分隔。
    Count Structures 中结构编号的数量。
    Interaction Types 该相互作用对,相互作用类型汇总,使用分号 ; 分隔。

    输出protein_contacts_profile_results.tar.gz,会包含HTML、PDF、CSV文档。

    参考文献

    • Robert, X., Guillon, C. and Gouet, P. (2025) FoldScript: a web server for the efficient analysis of AI-generated 3D protein models, Nucleic Acids Res., 53(W1):W277-W282, DOI: https://doi.org/10.1093/nar/gkaf326
    • https://foldscript.ibcp.fr

    Protein Contacts Profile

    Introduction

    Performs comprehensive analysis on a set of protein monomer or complex structures predicted by structure prediction models (e.g., Boltz2 / Protenix / AF3), including secondary structure, solvent accessibility, hydrophobicity, residue contacts, structural confidence, and more. Analysis results are organized and presented in a unified comparative view.

    Core concept: Using the reference chain as the analysis target, the report aggregates secondary structure, solvent accessibility, hydrophobicity, residue contacts, model confidence, and optionally homologous sequences and conservation information into a single document.

    Module workflow:

    • Gemmi: Reads PDB/mmCIF structures and normalizes chains, residues, atoms, and entity types from the query models.
    • DSSP: Extracts per-residue secondary structure and solvent accessibility.
    • Disulfide bond calculator: Computes residue-centric contact annotations from model coordinates and infers disulfide bonds.
    • Kyte-Doolittle hydrophobicity scoring: Computes per-residue hydrophobicity.
    • pLDDT-like confidence extraction: When the input model encodes confidence in atomic B-factor fields, per-residue confidence values are extracted.
    • BLAST+: Searches homologous protein sequences against the configured local sequence database.
    • Clustal Omega: Aligns query and hit sequences for conservation analysis.
    • Interaction types: Performs cross-structure statistics on residue-pair interactions, generating a CSV file comparing the frequency and types of interactions across different structures.
    • Statistical summary: Aggregates the above analysis results into multiple CSV files.
    • HTML rendering: Renders the analysis results into an interactive HTML report and per-chain PDF exports.

    Interaction type thresholds:

    Interaction Type Code Threshold
    Hydrophobic contact hp Inter-atomic distance < RvdW(A)+RvdW(B)+0.5 Å
    Salt bridge sb Distance < 4.0 Å
    π-Cation pc Distance < 6.0 Å, angle < 60°
    π-π Stacking ps Centroid distance < 7.0 Å, normal angle < 30°; ψ angle < 45°
    T-stacking ts Centroid distance < 5.0 Å, normal deviation from 90° < 30°; ψ angle < 45°
    van der Waals contact vdw Inter-atomic distance < RvdW(A)+RvdW(B)+0.5 Å
    Direct H-bond hbbb / hbsb / hbss / hblb / hbls / hbll D–A distance < 3.5 Å; default angle threshold 180° for static structures
    Water-bridged H-bond wb / lwb Each H-bond segment distance < 3.5 Å, angle > 110°
    Extended water-bridged H-bond wb2 / lwb2 Each H-bond segment distance < 3.5 Å, angle > 110°

    Parameters

    Structures

    Input protein structures, supporting both single and batch submission. Single-structure input supports .pdb, .cif, and .mmcif formats. Batch input must be provided as a compressed archive, supporting .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz. Maximum 100 structures per batch.

    Sequence database

    Specifies the sequence database for BLAST+ search filtering. Options: SwissProt (from the UniProt Knowledgebase) and PDBAA (derived from experimentally determined 3D structures in PDB).

    E-value

    Specifies the retention threshold for BLAST+ sequence match significance. The E-value indicates the statistical significance of a given sequence alignment. A lower E-value (closer to zero) indicates higher significance. Available thresholds: 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9, 1e-10, 1e-11, 1e-12. Default: 1e-6

    Homolog Display Limit

    Sets the maximum number of sequences retained during sequence search. Range: 1–25. Default: 5

    Columns

    Specifies the number of residues displayed per line. Default: 120

    VDW

    Whether to output van der Waals interactions during interaction analysis. Default: false

    Results

    Outputs protein_contacts_profile.html, displayed as follows:

    Note: If more than 80 structures are input, the HTML document may become too large for browsers to display properly.

    image.png

    A. Summary

    At the top of the figure is the summary area for the current active panel.

    • Reference Chain: the reference chain currently being viewed;
    • Residue Span: the residue range covered by this chain in the figure;
    • Model Count: the number of models participating in this comparison;
    • Output Status: the result status of the current chain panel;
    • Weak Contact Cutoff and Strong Contact Cutoff
    • Hydropathy Window, E-value Threshold, Weak Contact Cutoff, Strong Contact Cutoff, Homolog Display Limit, and Database correspond to key parameters used during figure generation, described below:
    Parameter Default Description
    Hydropathy Window 3 Smoothing window size for Kyte-Doolittle hydropathy calculation. Larger values produce smoother curves; smaller values reflect local fluctuations more closely. Must be a positive integer.
    E-value Threshold 1e-6 Retention threshold for BLAST+ sequence match significance. A lower E-value (closer to zero) indicates higher significance.
    Weak Contact Cutoff 3.7 Å Upper distance limit for weak contacts, judged by the shortest distance between non-hydrogen atoms. A contact is only considered present when the shortest distance is <= 3.7 Å. This is typically the outer threshold for contact判定.
    Strong Contact Cutoff 3.2 Å Distance threshold for strong contacts, judged by the shortest distance between non-hydrogen atoms. Distances < 3.2 Å are recorded as strong contacts; otherwise, if within the weak cutoff, they are recorded as weak contacts.
    Homolog Display Limit 5 Maximum number of homologous sequences retrieved and displayed. Range: 0–25; setting to 0 skips homolog search and retains only the query sequence itself.
    Database SWISSPROT Database used for homology search: PDBAA or SWISSPROT.

    B. Secondary Structure and Confidence

    In the structure section, each model marks the secondary structure elements on the reference chain.

    • Alpha Helix, 3₁₀ Helix, and Pi Helix are all represented by coiled waves in different colors, corresponding to α-helix, 3₁₀-helix, and π-helix, respectively.
    • Strand is represented by arrows, with arrow direction reflecting strand orientation.
    • Alpha Turn and Beta Turn denote two classes of shorter tight-turn regions, typically appearing in local segments connecting different secondary structures.

    If the input structures contain per-residue confidence information, the structure tracks are colored by pLDDT.

    • Dark blue: pLDDT >= 90, typically regarded as locally highly reliable regions;
    • Cyan: 70 <= pLDDT < 90, generally indicating a fairly reliable backbone;
    • Yellow: 50 <= pLDDT < 70, suggesting this segment should be interpreted with caution;
    • Orange-red: pLDDT < 50, regions that are often more flexible and prone to unstable or low-confidence conformations.

    Combining “structural morphology” with “confidence color”: if a helix or strand appears across multiple models but with noticeably different colors, it indicates potential uncertainty in the local conformation.

    C. Antibody Numbering

    If the current reference chain is recognized as an antibody-like chain, an additional Antibody Numbering track is displayed, with toggle options for Kabat, IMGT, and Chothia numbering schemes in the interface.

    • This track primarily marks the positions of CDR1, CDR2, and CDR3 in the antibody variable region, helping to quickly determine which sequence segment the complementarity-determining regions fall into;
    • When switching numbering schemes, what changes are the CDR boundary definitions and label assignments, not the underlying amino acid sequence itself;
    • If the reference chain is recognized as a heavy chain, the CDR labels correspond to the heavy chain variable region; if recognized as a light chain, they correspond to the light chain variable region;

    If the sequence does not resemble an antibody variable region, or if antibody numbering cannot be completed in the current environment, this track may not be displayed.

    D. Sequence and Homology Information

    Below the structure tracks, the report displays the reference chain sequence axis.

    • Query Sequence: the target sequence itself;
    • Exact Match: indicates complete identity with the query sequence at this position;
    • Similar Substitution: indicates that the amino acid differs but belongs to a relatively conserved substitution.

    When reading, this helps distinguish between “structural differences across models but a highly conserved sequence background” and “this segment is inherently variable among homologous sequences.”

    E. Accessibility and Hydropathy

    Below the sequence information, the report displays two physicochemical property tracks.

    Accessibility indicates the degree of surface exposure of a residue in the structure.

    • Dark blue corresponds to more buried regions;
    • Gray-blue indicates an intermediate state;
    • Light blue indicates greater exposure to solvent;
    • Gold indicates clearly high surface exposure;

    For structural analysis, this track is useful for judging whether a site is more likely a core residue, a surface residue, or a potential interface region.

    Hydropathy describes the hydrophobic or hydrophilic tendency of the sequence in a local window.

    • Orange indicates hydrophobic bias, commonly found in buried regions, internal interfaces, or lipid-compatible environments;
    • Gray indicates intermediate properties; blue indicates hydrophilic bias, more frequently appearing in positions exposed to aqueous environments;

    Observing this track together with the Accessibility track can better help determine whether a structural segment matches intuition: for example, a clearly exposed region that is also highly hydrophobic warrants further attention to whether it participates in interface interactions or resides in a special conformational environment.

    F. Contacts and Symbols

    Contact tracks describe interactions between reference chain residues and other chains, ligands, or small molecules.

    Protein-protein contacts are displayed as letters indicating the chain of the contact partner. Colors distinguish contact strength:

    • Darker contact markers correspond to strong contacts, currently defined as shortest non-hydrogen atom distance < 3.2 Å;
    • Lighter markers correspond to weak contacts, with distances between 3.2 Å and 3.7 Å.
    • If multiple contact types exist at the same site, additional borders are used to highlight that this position may be in a more complex interface environment.

    In addition to chain letters, the track contains specialized symbols.

    • S marks disulfide bonds; green typically indicates intra-chain disulfide bonds, cyan indicates inter-chain disulfide bonds;
    • # indicates that the contact partner shares the same residue number and residue type as the current position, commonly seen in symmetry-related contacts;
    • For non-protein components, the module uses shorthand symbols, e.g., nucleic acids, ions, sugars, porphyrin-like ligands, or other small molecules; refer to the Symbol Legend below.

    Recommended reading order: first check “whether there is a contact”, then “who the contact partner is”, and finally combine with structural tracks to judge whether these contacts are stable and whether they are concentrated in the same local region.

    Symbol Legend:

    Symbol Description
    A–Z / a–z Protein-protein contact with the corresponding chain
    S Disulfide bond position
    # Contact with a residue of the same number and same type
    * Contact with nucleic acid
    + Contact with ion
    : Contact with porphyrin-like or related macrocyclic ligand
    " Contact with carbohydrate ligand
    ^ Contact with other small molecule or miscellaneous ligand

    G. Bottom Legend

    Divided into three columns:

    • Left Structure: mainly explains structural symbols for helices, strands, and turns, as well as antibody numbering;
    • Middle Tracks: mainly explains sequence alignment, accessibility, hydrophobicity, and confidence colors;
    • Right Contacts: mainly explains strong/weak contacts, disulfide bonds, and various ligand symbols.

    Contact Details

    File: contact_details.csv

    Example:

    Chain Residue Pos Other_Chain Other_Residue Other_Pos Structures Distances Distance_Avg Interaction Types
    A Y 34 C S 32 1;2;3;4;5 3.34;3.32;3.00;3.31;3.39 3.27
    A H 38 B D 104 1;2;3;4;5 3.69;3.95;2.95;2.98;2.95 3.30 sb
    A Q 46 B Y 95 1;2;3;4;5 3.57;3.71;3.41;3.46;3.48 3.53
    A R 96 C E 54 1;4;5 3.74;3.60;3.99 3.77 sb
    B R 99 C D 55 2;3;4;5 3.74;2.77;2.82;2.84 3.04 sb

    Field descriptions:

    Field Description
    Chain Chain of the first residue.
    Residue Type of the first residue.
    Pos Sequential 1-based index of the first residue.
    Other_Chain Chain of the contacting partner residue.
    Other_Residue Type of the partner residue.
    Other_Pos Sequential 1-based index of the partner residue.
    Structures List of structure indices where this contact pair exists. Structure indices start from 1 in order of appearance (see structure name display order in the HTML document, top to bottom), separated by semicolons ;.
    Distances Minimum contact distances for this residue pair in each structure, in the same order as Structures.
    Distance_Avg Average of all contact distances (Distances).
    Interaction Types Interaction type(s), separated by semicolons ;; empty if not matched.

    Contact Residue Details

    File: contact_residue_details.csv

    Example:

    Chain Residue Pos SASA_Rel_Avg All …
    A E 1 0.68 0.40 …
    A S 32 0.34 1.00 …
    A Y 34 0.04 1.00 …
    A Q 46 0.51 1.00 …
    A Q 93 0.09 0.60 …
    B S 7 0.21 0.40 …

    Field descriptions:

    Field Description
    Chain Chain containing the residue.
    Residue Residue type.
    Pos Sequential 1-based index of the residue.
    SASA_Rel_Avg Average relative solvent-accessible surface area of this residue across all structures, ranging from 0.00 to 1.00.
    Domain (Kabat/IMGT/Chothia) If the chain is an antibody chain, displays the corresponding CDR region for the residue.
    All / Cluster_n Frequency with which this residue appears as a contact residue across all structures / within a structural cluster, ranging from 0.00 to 1.00. The example shows the All column with value 0.40, indicating this residue appeared as a contact residue in 40% of all structures. If multiple structural clusters exist, additional columns Cluster_1, Cluster_2, etc. will be added.

    Contact Consensus

    File: contact_consensus.csv

    Example:

    Cluster Id Structure Count Cluster Center Combine Count Consensus Count Combine Residue Consensus Residue Consensus Residue (Threshold)
    All 5 proteinx_lig_rank_1 87 52 A1;A31-34;A36;A38;A40;A42;A46-48;A50;A53-54;A57;A59;A91;A95-100;A102;B7-10;B1… A32;A34;A36;A38;A40;A42;A47-48;A50;A53-54;A57;A91;A95-96;A98-100;B33;B35;B39;… A32;A34;A36;A38;A40;A42;A46-48;A50;A53-54;A57;A59;A91;A95-100;A102;B30;B33;B3…

    Field descriptions:

    Field Description
    Cluster Id Statistical scope identifier. All indicates statistics based on all structures; if multiple structural clusters exist, additional scope statistics such as Cluster_1, Cluster_2, etc. will appear.
    Structure Count Number of structures included in the current statistical scope.
    Cluster Center Center structure of the current cluster.
    Combine Count Number of residues in the union of contact residues within the current statistical scope.
    Consensus Count Number of residues in the intersection of contact residues within the current statistical scope.
    Combine Residue List of contact residues in the union. Residue numbering uses 1-based sequential indexing with chain prefixes; contiguous ranges are compressed into formats such as A31-34.
    Consensus Residue List of contact residues in the intersection, formatted the same as Combine Residue.
    Consensus Residue (Threshold) List of contact residues reaching the percentage threshold of structures within the statistical scope, formatted the same as Combine Residue. Default threshold is 0.5, indicating contact residues that appeared in 50% of structures within the scope.

    Structural Clustering Information

    File: tm_clusters.csv

    Example:

    Structure Cluster Id Cluster Size Cluster Center Is Representative
    chai-1_rank_1 1 5 proteinx_lig_rank_1 0
    chai-1_rank_2 1 5 proteinx_lig_rank_1 0
    proteinx_lig_rank_1 1 5 proteinx_lig_rank_1 1
    proteinx_lig_rank_2 1 5 proteinx_lig_rank_1 0
    proteinx_lig_rank_3 1 5 proteinx_lig_rank_1 0

    Field descriptions:

    Field Description
    Structure Structure name, without file extension.
    Cluster Id Cluster index to which the structure belongs after clustering.
    Cluster Size Number of members in the cluster containing this structure.
    Cluster Center Center structure name of the cluster containing this structure.
    Is Representative Whether this structure is the representative of its cluster; 1 = yes, 0 = no.

    Similarity score (TM-score) matrix used for clustering: tm_score_matrix.csv

    Example:

    Structure chai-1_rank_1 chai-1_rank_2 proteinx_lig_rank_1 proteinx_lig_rank_2 proteinx_lig_rank_3
    chai-1_rank_1 1.00 1.00 0.97 0.97 0.97
    chai-1_rank_2 1.00 1.00 0.98 0.98 0.98
    proteinx_lig_rank_1 0.97 0.98 1.00 1.00 1.00
    proteinx_lig_rank_2 0.97 0.98 1.00 1.00 1.00
    proteinx_lig_rank_3 0.97 0.98 1.00 1.00 1.00

    Cross-Structure Interaction List

    File: cross_structure_interaction.csv

    Example:

    Chain Residue Pos Other_Chain Other_Residue Other_Pos Structures Count Interaction Types
    B R 38 B E 46 1;2;3;4;5 5 sb
    C K 105 C W 2 1;2;3;4;5 5 pc
    C F 22 C F 7 1;2;3;4;5 5 ps
    B Y 27 B Y 32 1;2;3;4;5 5 ts
    B R 38 D ATP 1 3;4;5 3 pc

    Field descriptions:

    Field Description
    Chain Chain of the first residue.
    Residue Type of the first residue; or ligand name, e.g. LIG.
    Pos Sequential 1-based index of the first residue.
    Other_Chain Chain of the partner residue.
    Other_Residue Type / ligand name of the partner residue.
    Other_Pos Sequential 1-based index of the partner residue.
    Structures List of structure indices where this interaction exists. Structure indices start from 1 in order of appearance (see structure name display order in the HTML document, top to bottom), separated by semicolons ;.
    Count Number of structure indices in Structures.
    Interaction Types Summary of interaction type(s) for this interaction pair, separated by semicolons ;.

    The output protein_contacts_profile_results.tar.gz contains HTML, PDF, and CSV documents.

    Reference

    • Robert, X., Guillon, C. and Gouet, P. (2025) FoldScript: a web server for the efficient analysis of AI-generated 3D protein models, Nucleic Acids Res., 53(W1):W277-W282, DOI: https://doi.org/10.1093/nar/gkaf326
    • https://foldscript.ibcp.fr
  • Name: MD Dipole
    Description: 能够计算出动力学轨迹体系的总偶极矩以及其波动情况。通过这些数据,可以计算出例如低介电常数介质的介电常数。对于具有净电荷的分子,其净电荷会在分子质心处进行扣除。 The module can calculate the total dipole moment of a molecular dynamics trajectory system and its fluctuations. Based on these data, properties such as the dielectric constant of low-dielectric media can be derived. For systems containing molecules with a net charge, the net charge is removed at the molecular center of mass before the dipole moment calculation.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-01-30 10:36:21
    Reference:

    MD Dipole

    简介

    模块能够计算出动力学轨迹体系的总偶极矩以及其波动情况。通过这些数据,可以计算出例如低介电常数介质的介电常数。对于具有净电荷的分子,其净电荷会在分子质心处进行扣除。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2024)模块或者AlphaAutoMD (GMX2024)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA,Complex。
    可以根据PDB中小分子的名称填写组别名称。
    注意:其中Complex指的是蛋白-小分子复合物体系。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。
    注意:
    1.使用该参数时必须指定完整分子的残基范围,不允许截断结构或遗漏残基。
    2.残基编号参考system.gro文件

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。
    注意:
    1.使用该参数时必须指定完整分子的残基范围,不允许截断结构或遗漏残基。
    2.原子编号参考system.gro文件

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    输出结果包括:

    输出文件名称 说明
    aver.csv 偶极矩统计量的时间平均结果(CSV 格式)
    aver.png 偶极矩统计量随时间变化的可视化图像(PNG 格式)
    Mtot.csv 体系总偶极矩及其分量的时间序列数据(CSV 格式)
    Mtot.png 体系总偶极矩模长随时间变化的可视化图像(PNG 格式)

    其中aver.csv包括信息如下:

    字段名称 说明
    Time (ns) 模拟时间(单位:纳秒)
    <|M|^2> 体系总偶极矩模长平方的系综平均值
    <|M|>^2 体系总偶极矩模长系综平均值的平方
    <|M|^2> - <|M|>^2 总偶极矩模长的涨落项,表示偶极矩的方差
    <|M|>^2 / <|M|^2> 归一化的偶极矩相关比值,可用于介电常数计算

    其中Mtot.csv包括信息如下:

    字段名称 说明
    Time (ns) 模拟时间(单位:纳秒)
    M_x 体系总偶极矩在 x 方向的分量
    M_y 体系总偶极矩在 y 方向的分量
    M_z 体系总偶极矩在 z 方向的分量
    |M_tot| 体系总偶极矩向量的模长

    MD Dipole

    Introduction

    This module calculates the total dipole moment of a molecular dynamics trajectory system and its fluctuations. Based on these data, properties such as the dielectric constant of low-dielectric media can be derived. For molecules with a net charge, the net charge is subtracted at the molecular center of mass before the dipole moment calculation.


    Parameters

    Path File

    The trajectory file obtained after MD simulations. It can be generated by the GMX MD Run (GMX2024) module or the AlphaAutoMD (GMX2024) module.

    System Group

    Select the structural group to be included in the calculation: Backbone, Protein, DNA, RNA, or Complex.
    Custom group names can also be specified based on the names of small molecules defined in the PDB file.

    Note: Complex refers to a protein–small-molecule complex system.

    Custom Resid

    Specify custom residue indices for calculation. Continuous ranges can be denoted using “-”, and non-contiguous residues should be separated by commas, e.g., 1-10,15.

    Note:

    1. When using this parameter, the complete residue range of the molecule must be specified. Truncated structures or missing residues are not allowed.
    2. Residue numbering should follow the system.gro file.

    Custom Atom

    Specify custom atom indices for calculation. Continuous ranges can be denoted using “-”, and non-contiguous atoms should be separated by commas, e.g., 1-10,15.

    Note:

    1. When using this parameter, the complete atom range of the molecule must be specified. Truncated structures or missing atoms are not allowed.
    2. Atom numbering should follow the system.gro file.

    Skip Time (ns)

    Time interval between successive frames used in the calculation (unit: nanoseconds).


    Results

    The output results include the following files:

    Output file name Description
    aver.csv Time-averaged dipole moment statistics (CSV format)
    aver.png Visualization of dipole moment statistics as a function of time (PNG format)
    Mtot.csv Time series data of the total dipole moment and its vector components (CSV format)
    Mtot.png Visualization of the magnitude of the total dipole moment over time (PNG format)

    aver.csv File Contents

    Field name Description
    Time (ns) Simulation time (nanoseconds)
    <|M|^2> Ensemble average of the squared magnitude of the total dipole moment
    <|M|>^2 Square of the ensemble-averaged magnitude of the total dipole moment
    <|M|^2> - <|M|>^2 Fluctuation term of the dipole moment magnitude, representing its variance
    <|M|>^2 / <|M|^2> Normalized dipole correlation ratio, used for dielectric constant calculations

    Mtot.csv File Contents

    Field name Description
    Time (ns) Simulation time (nanoseconds)
    M_x x-component of the total dipole moment
    M_y y-component of the total dipole moment
    M_z z-component of the total dipole moment
    |M_tot| Magnitude of the total dipole moment vector
  • Name: MMGBSA
    Description: MMGBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。 MMGBSA calculates the binding free energy between the receptor and ligand and provides energy decomposition data, binding constant (Ka), and inhibitor constant (Ki).
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:29
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMGBSA

    简介

    MMGBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMGBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在MD (GMX2024)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Reference Structure (GRO)

    参考结构。默认system.gro。可以在GMX MD Run (GMX2024)模块的输出结果文件中找到。当周期性边界条件处理不当时可以使用该参数。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMGBSA_result.csv MMGBSA结果汇总文件。
    MMGBSA_Residue.csv 能量分解数据CSV文件。
    MMGBSA.pdb 原子对应的MMGBSA能量放到PDB文件。可以做对应能量类别的表面图,从图的颜色深浅可以看出哪些区域对结合能贡献值较大。
    MMGBSA.tar.gz MMGBSA所有原始文件。包括_mmgbsa_residue_#.txt是每个残基对应的每个能量类别的数值,共包含7个能量类别:范德华能(VDW)、静电能(ELE)、溶剂化能极性部分(PB)、溶剂化能非极性部分(SA)、VDW+ELE=MM、PB+SA=GBSA、MM+GBSA=Binding/MGBSA。_mmgbsa_residue.txt是对上述7个文件的总结,即为MMGBSA_Residue.csv对应的原始文件。_mmgbsa_atom#.pdb是将每个原子对应的能量类别放到pdb文件,与MMGBSA.pdb相似。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMGBSA

    Introduction

    MMGBSA calculates the binding free energy between a receptor and a ligand, providing energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMGBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameter Description

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the MD (GMX2024) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Reference Structure (GRO)

    Reference structure. Default: system.gro.
    This file can be found in the output results of the GMX MD Run (GMX2024) module.
    Use this parameter when periodic boundary conditions are not handled properly.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Result Description

    The output includes:

    Output File Name Description
    MMGBSA_result.csv Summary file of MMGBSA results.
    MMGBSA_Residue.csv Energy decomposition data in CSV format.
    MMGBSA.pdb MMGBSA energy corresponding to atoms in a PDB file. This can be used to create a surface map of energy categories, where the color depth indicates the regions contributing significantly to the binding energy.
    MMGBSA.tar.gz All original MMGBSA files. Includes mmgbsa_residue#.txt which contains the values of each energy category for each residue, totaling 7 energy categories: van der Waals energy (VDW), electrostatic energy (ELE), polar part of solvation energy (PB), nonpolar part of solvation energy (SA), VDW+ELE=MM, PB+SA=GBSA, MM+GBSA=Binding/MMGBSA. _mmgbsa_residue.txt is a summary of the above 7 files, corresponding to the original file MMGBSA_Residue.csv. _mmgbsa_atom#.pdb assigns the energy categories to each atom in a pdb file, similar to MMGBSA.pdb.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: Alanine Scan (MMGBSA)
    Description: Alanine Scan (MMGBSA)是计算丙氨酸突变后的结合自由能。 Alanine Scan (MMGBSA) calculates components of binding free energy after alanine mutation using the MM-PBSA method.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:47
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMGBSA)

    简介

    Alanine Scan (MMGBSA)是计算丙氨酸突变后的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMGBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在MD (GMX2024)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Reference Structure (GRO)

    参考结构。默认system.gro。可以在GMX MD Run (GMX2024)模块的输出结果文件中找到。当周期性边界条件处理不当时可以使用该参数。

    Mutation Residue

    突变扫描为丙氨酸(ALA)的氨基酸位置。格式为‘32-34,36’。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Force File

    丙氨酸扫描时使用的力场。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。只能是AMBER力场下构建的膜体系才能进行计算。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMGBSA_result.csv/MMGBSA_Result_txt.tar.gz 丙氨酸突变结果csv文件。
    MMGBSA_Residue.csv/MMGBSA_Residue_csv.tar.gz 残基能量分解数据(CSV)。
    MMGBSA.pdb/MMGBSA_pdb.tar.gz 突变后能量映射到 PDB 文件,可用于可视化结合能贡献区域。
    MMGBSA.tar.gz 全部原始数据,包括:
    • _mmgbsa_residue_#.txt(7 类能量:VDW、ELE、PB、SA、MM、GBSA、Binding)
    • _mmgbsa_residue.txt(残基能量汇总,对应 MMGBSA_Residue.csv)
    • _mmgbsa_atom#.pdb(原子能量映射 PDB,类似 MMGBSA.pdb)。
    ALA_Scan_Results.csv 丙氨酸扫描所有残基突变结果。

    ALA_Scan_Results.csv,包含信息如下:

    字段名称 说明
    index 残基编号。
    Residue 原始残基名称。
    Mutation Residue 突变后的残基(通常为丙氨酸 ALA)。
    dH (kJ/mol) 焓贡献。
    Tds (kJ/mol) 熵贡献(TΔS)。
    dG (kJ/mol) 结合自由能变化。决定结合强弱的关键指标。越负说明亲和力越强。
    Ki (µM/L) 解离常数,结合亲和力的倒数。
    Ka (L/µM) 结合常数,亲和力大小。

    Ka 越大表示结合力强,Ki 越小表示抑制效果强。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMGBSA)

    Introduction

    Alanine Scan (MMGBSA) calculates the binding free energy after alanine mutations and provides energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMGBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameters

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the MD (GMX2024) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Reference Structure (GRO)

    Reference structure. Default: system.gro.
    This file can be found in the output results of the GMX MD Run (GMX2024) module.
    Use this parameter when periodic boundary conditions are not handled properly.

    Mutation Residue

    The mutation scans for the amino acid location of alanine (ALA). Must followed the format is ‘32-34,36’. The protein amino acid or nucleic acid number is re-numbered from 1, independent of the initial pdb amino acid number.

    Force File

    Force field used for alanine scanning.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’. Only membrane systems built under the AMBER force field can be calculated.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Results

    The output includes:

    File Name Description
    MMGBSA_result.csv / MMGBSA_Result_txt.tar.gz Alanine mutation result (csv file).
    MMGBSA_Residue.csv / MMGBSA_Residue_csv.tar.gz Residue energy decomposition data (CSV).
    MMGBSA.pdb / MMGBSA_pdb.tar.gz Energy mapped onto the PDB file after mutation, useful for visualizing binding energy contribution regions.
    MMGBSA.tar.gz Complete raw data, including:
    • _mmgbsa_residue_#.txt (7 energy terms: VDW, ELE, PB, SA, MM, GBSA, Binding)
    • _mmgbsa_residue.txt (residue energy summary, corresponding to MMGBSA_Residue.csv)
    • _mmgbsa_atom#.pdb (atomic energy mapped PDB files, similar to MMGBSA.pdb).
    ALA_Scan_Results.csv Results of alanine scanning mutations for all residues.

    ALA_Scan_Results.csv Contents

    Field Name Description
    index Residue index number.
    Residue Original residue name.
    Mutation Residue Mutated residue (typically alanine, ALA).
    dH (kJ/mol) Enthalpy change.
    Tds (kJ/mol) Entropy term (TΔS).
    dG (kJ/mol) Binding free energy change, the key indicator of binding strength. The more negative the value, the stronger the affinity.
    Ki (µM/L) Dissociation constant, reciprocal of binding affinity.
    Ka (L/µM) Association constant, magnitude of binding affinity.

    Larger Ka indicates stronger binding affinity, while smaller Ki indicates stronger inhibitory effect.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: Cleavage Site Prediction (PeptideCutter)
    Description: 预测蛋白质序列中潜在的蛋白酶或化学试剂切割位点。模块基于PeptideCutter工具对应的文献资料复现。 Predict potential protease or chemical reagent cleavage sites in protein sequences. This module is reproduced based on the literature corresponding to the PeptideCutter tool.
    Tags: undefined
    Author: Elisabeth Gasteiger
    Release: 2026-03-10 15:06:37
    Reference: Gasteiger, E. et al. (2005). Protein Identification and Analysis Tools on the ExPASy Server. In: Walker, J.M. (eds) The Proteomics Protocols Handbook. Springer Protocols Handbooks. Humana Press.

    Cleavage Site Prediction (PeptideCutter)

    预测蛋白质序列中潜在的蛋白酶或化学试剂切割位点。模块基于PeptideCutter工具对应的文献资料复现。PeptideCutter 是瑞士生物信息学研究所(SIB)Expasy 平台提供的专业生物信息学工具。
    支持的蛋白酶或化学试剂的切割规则如下:

    Enzyme / Chemical Name Abbrev P4 P3 P2 P1 P1’ P2’
    Arg-C proteinase ArgC - - - R - -
    Asp-N endopeptidase AspN - - - - D -
    Asp-N endopeptidase + N-terminal Glu AspN+AspGluN - - - - D or E -
    BNPS-Skatole BNPS - - - W - -
    Caspase 1 Casp1 F,W,Y or L - H,A or T D not P,E,D,Q,K or R -
    Caspase 2 Casp2 D V A D not P,E,D,Q,K or R -
    Caspase 3 Casp3 D M Q D not P,E,D,Q,K or R -
    Caspase 4 Casp4 L E V D not P,E,D,Q,K or R -
    Caspase 5 Casp5 L or W E H D - -
    Caspase 6 Casp6 V E H or I D not P,E,D,Q,K or R -
    Caspase 7 Casp7 D E V D not P,E,D,Q,K or R -
    Caspase 8 Casp8 I or L E T D not P,E,D,Q,K or R -
    Caspase 9 Casp9 L E H D - -
    Caspase 10 Casp10 I E A D - -
    Chymotrypsin-high specificity (C-term to [FYW], not before P) Ch_hi - - - F or Y not P -
    - - - W not P -
    Chymotrypsin-low specificity (C-term to [FYWML], not before P) Ch_lo - - - F,L or Y not P -
    - - - W not M or P -
    - - - M not P or Y -
    - - - H not D,M,P or W -
    Clostripain (Clostridiopeptidase B) Clost - - - R - -
    CNBr CNBr - - - M - -
    Enterokinase EK D or E D or E D or E K - -
    Factor Xa FXa A,F,G,I,L,T,V or M D or E G R - -
    Formic acid HCOOH - - - D - -
    Glutamyl endopeptidase GluC - - - E - -
    GranzymeB GzmB I E P D - -
    Hydroxylamine (NH2OH) Hydro - - - N G -
    Iodosobenzoic acid Iodo - - - W - -
    LysC LysC - - - K - -
    LysN LysN - - - - K -
    Neutrophil elastase Elast - - - A or V - -
    NTCB (2-nitro-5-thiocyanobenzoic acid) NTCB - - - - C -
    Pepsin (pH1.3) Pn1.3 - not H,K or R not P not R F or L not P
    - not H,K or R not P F or L - not P
    Pepsin (pH>2) Pn2p - not H,K or R not P not R F,L,W or Y not P
    - not H,K or R not P F,L,W or Y - not P
    Proline-endopeptidase[*] Prol - - H,K or R P not P -
    Proteinase K ProtK - - - A,E,F,I,L,T,V,W or Y - -
    Staphylococcal peptidase I Staph - - not E E - -
    Tobacco etch virus protease TEV - Y - Q G or S -
    Thermolysin Therm - - - not D or E A,F,I,L,M or V not P
    Thrombin Throm - - G R G -
    A,F,G,I,L,T,V or M A,F,G,I,L,T,V,W or R P R not D or E not D or E
    Trypsin Tryps - - - K or R not P -
    - - W K not P -
    - - M R not P -

    *注:脯氨酸内肽酶仅能切割序列不超过30个氨基酸的底物。一种特殊的β螺旋结构域调控蛋白质水解:参见 Fulop 等,1998 年。

    Trypsin Exceptions (Blocking Rules)

    Enzyme Name P4 P3 P2 P1 P1’ P2’
    Trypsin - - C or D K D -
    Trypsin - - C K H or Y -
    Trypsin - - C R K -
    Trypsin - - R R H or R -

    参数说明

    Input File

    上传蛋白的序列文件,只能提交单链序列,FASTA格式

    Enzymes

    选择切割的切割酶或化学物质,输入all表示选择全部,同时支持多个输入,输入方式如:Tryps;Ch_hi(输出对应的缩写,使用;分隔)。仅限上方切割规则表中酶和化学物质。

    结果说明

    输出All_in_One.csv,内容为输入序列中的切割点表,内容如下:

    Chain ID Name of enzyme No. of cleavages Positions of cleavage sites
    seq_1 Arg-C proteinase 1 14
    seq_1 Asp-N endopeptidase 1 2
    seq_1 Asp-N endopeptidase + N-terminal Glu 2 2, 6
    seq_1 BNPS-Skatole 1 4
    seq_1 Chymotrypsin-high specificity (C-term to [FYW], not before P) 1 4
    seq_1 Chymotrypsin-low specificity (C-term to [FYWML], not before P) 2 4, 6
    seq_1 Clostripain (Clostridiopeptidase B) 1 14
    seq_1 Formic acid 1 3
    seq_1 Glutamyl endopeptidase 1 7
    seq_1 Iodosobenzoic acid 1 4
    seq_1 LysC 1 1
    seq_1 Neutrophil elastase 1 13
    seq_1 NTCB (2-nitro-5-thiocyanobenzoic acid) 1 4
    seq_1 Pepsin (pH>2) 1 4
    seq_1 Proteinase K 5 2, 4, 7, 8, 13
    seq_1 Staphylococcal peptidase I 1 7
    seq_1 Thermolysin 2 1, 12
    seq_1 Trypsin 1 1
    seq_2 Asp-N endopeptidase + N-terminal Glu 3 2, 11, 17
    seq_2 BNPS-Skatole 2 2, 16
    seq_2 Chymotrypsin-high specificity (C-term to [FYW], not before P) 5 2, 5, 11, 16, 17
    seq_2 Chymotrypsin-low specificity (C-term to [FYWML], not before P) 9 1, 2, 5, 6, 9, 10, 11, 16, 17
    seq_2 CNBr 1 6
    seq_2 Glutamyl endopeptidase 3 3, 12, 18
    seq_2 Iodosobenzoic acid 2 2, 16
    seq_2 LysC 2 13, 15
    seq_2 LysN 2 12, 14
    seq_2 Neutrophil elastase 1 4
    seq_2 NTCB (2-nitro-5-thiocyanobenzoic acid) 1 13
    seq_2 Pepsin (pH1.3) 3 4, 5, 10
    seq_2 Pepsin (pH>2) 4 4, 5, 10, 16
    seq_2 Proteinase K 10 1, 2, 3, 4, 5, 11, 12, 16, 17, 18
    seq_2 Staphylococcal peptidase I 3 3, 12, 18
    seq_2 Thermolysin 3 4, 5, 10
    seq_2 Trypsin 2 13, 15

    说明:

    字段 说明
    Chain ID 序列名称。 如果名称有重复时,会在原名称上添加上_dup1、如:A_dup1,1对应就是重复的次数。
    Name of enzyme 蛋白酶/化学试剂名称,用于标识采用哪一种切割规则(例如 Arg-C proteinase、Asp-N endopeptidase、BNPS-Skatole、CNBr 等)。
    No. of cleavages 该酶/试剂在对应序列上预测到的切割次数(切割位点数量)。应与 Positions of cleavage sites 中列出的位点个数一致。
    Positions of cleavage sites 切割位点在序列中的位置编号列表。用逗号 + 空格分隔(例如 2, 4, 7)。酶或化学试剂的切割发生在序列对应位置之后。

    输出All_in_One.html,内容将包含所有链的切割信息,展出如下:
    All_in_One.html

    输出clvg_site_pred_results.tar.gz,包含所有序列各自的csv以及HTML报告结果。

    参考文献

    • Gasteiger, E. et al. (2005). Protein Identification and Analysis Tools on the ExPASy Server. In: Walker, J.M. (eds) The Proteomics Protocols Handbook. Springer Protocols Handbooks. Humana Press. https://doi.org/10.1385/1-59259-890-0:571DOI:10.1385/1-59259-890-0:571

    Cleavage Site Prediction (PeptideCutter)

    Predict potential protease or chemical reagent cleavage sites in protein sequences.
    This module is reproduced based on the literature associated with the PeptideCutter tool.
    PeptideCutter is a professional bioinformatics tool provided by the ExPASy platform of the Swiss Institute of Bioinformatics (SIB).

    The supported protease and chemical reagent cleavage rules are listed below:

    Enzyme / Chemical Name Abbrev P4 P3 P2 P1 P1’ P2’
    Arg-C proteinase ArgC - - - R - -
    Asp-N endopeptidase AspN - - - - D -
    Asp-N endopeptidase + N-terminal Glu AspN+AspGluN - - - - D or E -
    BNPS-Skatole BNPS - - - W - -
    Caspase 1 Casp1 F,W,Y or L - H,A or T D not P,E,D,Q,K or R -
    Caspase 2 Casp2 D V A D not P,E,D,Q,K or R -
    Caspase 3 Casp3 D M Q D not P,E,D,Q,K or R -
    Caspase 4 Casp4 L E V D not P,E,D,Q,K or R -
    Caspase 5 Casp5 L or W E H D - -
    Caspase 6 Casp6 V E H or I D not P,E,D,Q,K or R -
    Caspase 7 Casp7 D E V D not P,E,D,Q,K or R -
    Caspase 8 Casp8 I or L E T D not P,E,D,Q,K or R -
    Caspase 9 Casp9 L E H D - -
    Caspase 10 Casp10 I E A D - -
    Chymotrypsin-high specificity (C-term to [FYW], not before P) Ch_hi - - - F or Y not P -
    - - - W not P -
    Chymotrypsin-low specificity (C-term to [FYWML], not before P) Ch_lo - - - F,L or Y not P -
    - - - W not M or P -
    - - - M not P or Y -
    - - - H not D,M,P or W -
    Clostripain (Clostridiopeptidase B) Clost - - - R - -
    CNBr CNBr - - - M - -
    Enterokinase EK D or E D or E D or E K - -
    Factor Xa FXa A,F,G,I,L,T,V or M D or E G R - -
    Formic acid HCOOH - - - D - -
    Glutamyl endopeptidase GluC - - - E - -
    GranzymeB GzmB I E P D - -
    Hydroxylamine (NH2OH) Hydro - - - N G -
    Iodosobenzoic acid Iodo - - - W - -
    LysC LysC - - - K - -
    LysN LysN - - - - K -
    Neutrophil elastase Elast - - - A or V - -
    NTCB (2-nitro-5-thiocyanobenzoic acid) NTCB - - - - C -
    Pepsin (pH1.3) Pn1.3 - not H,K or R not P not R F or L not P
    - not H,K or R not P F or L - not P
    Pepsin (pH>2) Pn2p - not H,K or R not P not R F,L,W or Y not P
    - not H,K or R not P F,L,W or Y - not P
    Proline-endopeptidase[*] Prol - - H,K or R P not P -
    Proteinase K ProtK - - - A,E,F,I,L,T,V,W or Y - -
    Staphylococcal peptidase I Staph - - not E E - -
    Tobacco etch virus protease TEV - Y - Q G or S -
    Thermolysin Therm - - - not D or E A,F,I,L,M or V not P
    Thrombin Throm - - G R G -
    A,F,G,I,L,T,V or M A,F,G,I,L,T,V,W or R P R not D or E not D or E
    Trypsin Tryps - - - K or R not P -
    - - W K not P -
    - - M R not P -

    *Note: Proline endopeptidase can only cleave substrates with sequences shorter than 30 amino acids.
    A special β-propeller domain regulates protein hydrolysis. See Fulop et al., 1998.


    Trypsin Exceptions (Blocking Rules)

    Enzyme Name P4 P3 P2 P1 P1’ P2’
    Trypsin - - C or D K D -
    Trypsin - - C K H or Y -
    Trypsin - - C R K -
    Trypsin - - R R H or R -

    Parameter

    Input File

    Upload the protein sequence file.
    Only single-chain sequences are supported, and the file must be in FASTA format.

    Enzymes

    Select the protease or chemical reagent for cleavage.

    Enter all to select all enzymes. Multiple inputs are supported.
    Example:

    Tryps;Ch_hi
    

    The corresponding abbreviations will be used in the output and should be separated by ;.

    Only enzymes and chemicals listed in the cleavage rule table above are allowed.


    Results

    The output All_in_One.csv contains the predicted cleavage sites for the input sequences.

    Chain ID Name of enzyme No. of cleavages Positions of cleavage sites
    seq_1 Arg-C proteinase 1 14
    seq_1 Asp-N endopeptidase 1 2
    seq_1 Asp-N endopeptidase + N-terminal Glu 2 2, 6
    seq_1 BNPS-Skatole 1 4
    seq_1 Chymotrypsin-high specificity (C-term to [FYW], not before P) 1 4
    … … … …

    (remaining rows unchanged)

    Field Description
    Chain ID The sequence name. If duplicate names appear, a suffix such as _dup1 will be added (e.g., A_dup1), where 1 represents the duplication count.
    Name of enzyme The protease or chemical reagent used to identify the applied cleavage rule (e.g., Arg-C proteinase, Asp-N endopeptidase, BNPS-Skatole, CNBr).
    No. of cleavages The number of predicted cleavage events in the sequence for the given enzyme or reagent. This number should match the count of sites listed in Positions of cleavage sites.
    Positions of cleavage sites A list of cleavage site positions within the sequence. Values are separated by comma + space (e.g., 2, 4, 7). Cleavage occurs after the corresponding residue position.

    The output All_in_One.html contains the cleavage information for all chains.

    The output clvg_site_pred_results.tar.gz contains individual CSV files and HTML reports for each sequence.


    Reference

    • Gasteiger, E. et al. (2005). Protein Identification and Analysis Tools on the ExPASy Server. In: Walker, J.M. (eds) The Proteomics Protocols Handbook. Springer Protocols Handbooks. Humana Press. https://doi.org/10.1385/1-59259-890-0:571DOI:10.1385/1-59259-890-0:571
  • Name: Humaness Score (BioPhi)
    Description: 基于BioPhi的抗体序列人源化评分 Antibody sequence humanness evaluation using BioPhi
    Tags: undefined
    Author: David Prihoda
    Release: 2026-03-26 00:00:00
    Reference: David Prihoda, Jad Maamary, Andrew Waight, Veronica Juan, Laurence Fayadat-Dilman, Daniel Svozil & Danny A. Bitton (2022) BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning, mAbs, 14:1, DOI: https://doi.org/10.1080/19420862.2021.2020203

    Humaness Score (BioPhi)

    简介

    Humaness Score (BioPhi)是抗体序列人源化评分工具,不依赖于有限的人源种系(germline)序列,而是基于海量的天然人类抗体库(Observed Antibody Space, OAS)。该数据库包含来自数百个受试者的数亿条序列,这使得它能捕捉到更丰富、更多样化的抗体序列空间。将待评估的抗体序列切割成所有可能的、长度为9个氨基酸的短肽(9-mer),将这些短肽放到庞大的OAS数据库中进行搜索,找出每个短肽在真实人类抗体库中出现的频率,以及携带该短肽的个体数量。如果一个序列中的大多数短肽在人类抗体库中都很常见,那么它的OASis评分就高,意味着“看起来很人类”,免疫原性风险较低;反之,如果含有大量在人类中罕见的短肽,则评分低,提示可能需要进一步人源化改造。

    参数说明

    Antibody Sequence

    抗体序列文件,FASTA格式,同一抗体轻重链序列名可以通过后缀.H/.L、_VH/_VL、_HC/_LC识别,如:

    Antibody1.H
    XXXX
    Antibody1.L
    XXXX
    Antibody2.H
    XXXX
    Antibody2.L
    XXXX

    支持批量,最大支持1000条序列计算,超过1000的序列会忽略。

    Numbering Scheme

    编号方案,可选值包括 kabat、chothia、imgt、aho,默认值为 kabat

    CDR Definition

    CDR 定义方法,可选值包括 kabat、chothia、imgt、north,默认值为 kabat

    Min Percent Subjects

    考虑肽段为人类的最小 OAS 主体百分比,取值范围为 1-90,默认值为 10.0

    Score File

    输出序列人源化打分文件名称,XLSX格式

    结果说明

    输出文件默认为humaness_score.xlsx,文件中包含多个SHEET,第一个Overview内容包括:

    列名 说明
    Antibody 抗体名称
    Threshold 使用的阈值,loose:宽松 (≥1% subjects),relaxed 较宽松(≥10% subjects),medium 中等(≥50% subjects),strict 严格(≥90% subjects)
    OASis Percentile 抗体整体(重链+轻链)的 OASis 百分位数,得分越高,代表该序列在人类天然抗体库中出现的频率越高
    OASis Identity 抗体序列与人类天然抗体库中最接近序列的同一性(相似度)
    Germline Content 重链+轻链的胚系含量(与人类最接近的 V/J 基因的整体相似度)
    Heavy V Germline 重链 V 基因来源
    Heavy J Germline 重链 J 基因来源
    Heavy OASis Percentile 重链的 OASis 百分位数
    Heavy OASis Identity 重链与最接近人类胚系基因的相似度
    Heavy Non-human peptides 重链检测到的非人源肽段的数量
    Heavy Germline Content 重链的胚系含量(与人类最接近的 V/J 基因的整体相似度)
    Light V Germline 轻链 V 基因来源
    Light J Germline 轻链 J 基因来源
    Light OASis Percentile 轻链的 OASis 百分位数
    Light OASis Identity 轻链与最接近人类胚系基因的相似度
    Light Non-human peptides 轻链检测到的非人源肽段的数量
    Light Germline Content 轻链的胚系含量(与人类最接近的 V/J 基因的整体相似度)

    参考文献

    • David Prihoda, Jad Maamary, Andrew Waight, Veronica Juan, Laurence Fayadat-Dilman, Daniel Svozil & Danny A. Bitton (2022) BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning, mAbs, 14:1, DOI: https://doi.org/10.1080/19420862.2021.2020203

    Humaness Score (BioPhi)

    Introduction

    Humaness Score (BioPhi) is a tool for evaluating the humanization score of antibody sequences. It does not rely on a limited set of human germline sequences but is instead based on the vast natural human antibody repertoire, the Observed Antibody Space (OAS). This database contains hundreds of millions of sequences from hundreds of subjects, allowing it to capture a richer, more diverse landscape of antibody sequences. The tool evaluates an antibody sequence by slicing it into all possible 9-amino-acid peptides (9-mers) and searching for these peptides within the extensive OAS database. It determines the frequency of each peptide in the authentic human antibody repertoire and the number of individuals carrying that peptide. If most peptides in a sequence are common in the human antibody repertoire, the sequence receives a high OASis score, indicating it “looks human” and has a lower risk of immunogenicity. Conversely, if the sequence contains many peptides that are rare in humans, the score is low, suggesting that further humanization may be needed.

    Parameters

    Antibody Sequence

    Antibody sequence file in FASTA format. For the same antibody, heavy and light chain sequences can be identified using suffixes such as .H/.L, _VH/_VL, or _HC/_LC. Example:

    Antibody1.H
    XXXX
    Antibody1.L
    XXXX
    Antibody2.H
    XXXX
    Antibody2.L
    XXXX

    Batch processing is supported, with a maximum of 1,000 sequences for calculation. Sequences exceeding 1,000 will be ignored.

    Numbering Scheme

    Numbering scheme, options include kabat, chothia, imgt, aho, default value is kabat

    CDR Definition

    CDR definition method, options include kabat, chothia, imgt, north, default value is kabat

    Min Percent Subjects

    Minimum percent of OAS subjects to consider peptide human, range 1-90, default value is 10.0

    Score File

    The name of the output file containing the humanization scores for the sequences, in XLSX format.

    Results

    The default output file is humaness_score.xlsx, which contains multiple sheets. The first sheet, “Overview,” includes the following columns:

    Column Description
    Antibody Antibody name
    Threshold Input threshold used, loose (≥1% subjects),relaxed (≥10% subjects),medium (≥50% subjects),strict (≥90% subjects)
    OASis Percentile Overall (heavy + light chain) OASis percentile of the antibody. A higher score indicates a higher frequency of the sequence in the natural human antibody repertoire.
    OASis Identity Overall identity (similarity) of the antibody sequence to the closest sequence in the natural human antibody repertoire.
    Germline Content Overall germline content (heavy + light chain) – the overall similarity to the closest human V/J genes.
    Heavy V Germline V gene origin for the heavy chain.
    Heavy J Germline J gene origin for the heavy chain.
    Heavy OASis Percentile OASis percentile for the heavy chain.
    Heavy OASis Identity Identity of the heavy chain to the closest human germline gene.
    Heavy Non-human peptides Number of non-human peptides detected in the heavy chain.
    Heavy Germline Content Germline content for the heavy chain – similarity to the closest human V/J genes.
    Light V Germline V gene origin for the light chain.
    Light J Germline J gene origin for the light chain.
    Light OASis Percentile OASis percentile for the light chain.
    Light OASis Identity Identity of the light chain to the closest human germline gene.
    Light Non-human peptides Number of non-human peptides detected in the light chain.
    Light Germline Content Germline content for the light chain – similarity to the closest human V/J genes.

    References

    • David Prihoda, Jad Maamary, Andrew Waight, Veronica Juan, Laurence Fayadat-Dilman, Daniel Svozil & Danny A. Bitton (2022) BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning, mAbs, 14:1, DOI: https://doi.org/10.1080/19420862.2021.2020203
  • Name: Directed Evolution Library Analyzer
    Description: 通过DNA序列比对分析序列中对应的突变信息 Analyze the corresponding mutation information in the sequence through DNA sequence alignment.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-04-01 00:00:00
    Reference:

    Directed Evolution Library Analyzer

    进行DNA序列比对(查询序列VS模板序列),分析序列突变(相对于模板序列),并给出翻译的氨基酸序列,及其对应的残基突变(相对于模板序列)。

    参数说明

    Template

    DNA模板序列,fasta格式,支持多条。

    Sequence

    DNA查询序列,fasta格式文件或.seq文本文件(每个文件单一序列),支持批量,以打包压缩文件上传即可。

    Output

    比对结果文件名。默认:dna_analysis_res.csv

    结果说明

    输出dna_analysis_res.csv,字段说明:

    字段名 说明
    query 查询序列名称,取FASTA文件中的序列ID或.seq 文件名
    target 命中的模板序列名称,对应模板 FASTA 中的序列 ID
    template_nt_seq 模板的对齐序列(命中区域),可能包含 gap(-)
    matched_nt_seq 查询序列与模板对齐的部分(命中区域),可能包含 gap(-)
    identity 序列一致性百分比,例如 99.300,单位 %
    template_aa_aln 基于模板的对齐序列,翻译的氨基酸序列,可能包含 gap(-)
    matched_aa_aln 基于查询序列的对齐序列,翻译的氨基酸序列,可能包含 gap(-)
    nt_mutations 核酸突变列表(相对模板)
    aa_mutations 氨基酸突变列表(相对模板)

    突变表示规则

    表示格式 含义
    A123G 第 123 位由 A 替换为 G
    del123A 删除模板第 123 位的 A
    ins123_T 在模板第 123 位之后插入 T

    Directed Evolution Library Analyzer

    Perform DNA sequence comparison (Query Sequence VS Template Sequence), analyze sequence mutations (relative to the template sequence), and provide the translated amino acid sequences along with their corresponding residue mutations (relative to the template sequence).

    Parameters

    Template

    DNA template sequences, FASTA format, supports multiple sequences.

    Sequence

    DNA query sequences, FASTA format file or .seq text file (single sequence per file), supports batch processing, please upload as a compressed archive.

    Output

    Alignment results filename. Default: dna_analysis_res.csv

    Results

    Output dna_analysis_res.csv, field descriptions:

    Field Name Description
    query Query sequence name, taken from the sequence ID in the FASTA file or the .seq filename
    target Matched template sequence name, corresponding to the sequence ID in the template FASTA
    template_nt_seq Aligned template sequence (hit region), may contain gaps (-)
    matched_nt_seq Part of the query sequence aligned with the template (hit region), may contain gaps (-)
    identity Sequence identity percentage, e.g., 99.300, unit %
    template_aa_aln Translated amino acid sequence based on the aligned template sequence, may contain gaps (-)
    matched_aa_aln Translated amino acid sequence based on the aligned query sequence, may contain gaps (-)
    nt_mutations List of nucleotide mutations (relative to the template)
    aa_mutations List of amino acid mutations (relative to the template)

    Mutation Representation Rules

    Representation Format Meaning
    A123G Substitution of A with G at position 123
    del123A Deletion of A at position 123 of the template
    ins123_T Insertion of T after position 123 of the template
  • Name: Substrate Specificity Prediction (EZSpecificity)
    Description: EZSpecificity 是用于酶-底物特异性预测的模型化工具,目标是为实验筛选提供优先级排序。 EZSpecificity is a modeling tool for enzyme-substrate specificity prediction, designed to provide priority ranking for experimental screening.
    Tags: undefined
    Author: Haiyang Cui
    Release: 2025-10-08 00:00:00
    Reference: Enzyme specificity prediction using cross-attention graph neural networks. Nature, 2025.

    Substrate Specificity Prediction (EZSpecificity)

    简介

    EZSpecificity 是用于酶-底物特异性预测的模型化工具,目标是为实验筛选提供优先级排序。
    它要解决的问题是:在候选组合数量较大时,如何优先挑出更可能发生反应的酶-底物对,从而降低实验试错成本。
    根据论文报告,EZSpecificity 在未知酶/未知底物等外推场景下,相比对照方法(如 ESP)表现更稳定,并在卤化酶案例中给出更高的 Top-1 命中率。
    因此,它的定位是"实验前的筛选与排序工具",而不是"替代实验的最终判定工具"。
    EZSpecificity 的核心思想是联合利用三类信息:

    1. 序列信息:通过蛋白语言模型(ESM)提取酶序列表示,提供全局蛋白特征。
    2. 结构信息:将活性位点附近的酶-底物原子关系建模为 3D 图,捕获局部几何和化学环境。
    3. 交互信息:通过 cross-attention 强化"关键残基-关键原子"的对应关系,避免仅做简单特征拼接。
      最终模型输出一个分数用于排名。

    参数说明

    Batch Sequence Prediction

    适用于大规模筛选场景,输入酶序列文件和底物列表,自动生成 N×N 组合进行预测。

    Enzyme Sequence

    酶的序列文件,FASTA格式,支持多条序列,必选项

    >enzyme_405
    MLPLQDFPKFTAAAVQASPVFLDAHKTAQKAVDLIAEAAGNGAELVVFPEVF...
    >enzyme_483
    MQTRKIVRAAAVQAASPNYDLATGVDKTIELARQARDEGCDLIVFGETWL...
    

    Substrate Smile

    底物分子的结构信息,支持.smi格式,必选项

    substrate_smiles
    N#CC1=NC=CC=C1 sample_1
    N#CCC1=CC=CC=C1 sample_2
    

    Score File

    输出的结果文件, CSV 格式(默认文件名 predicted_scores.csv)

    Batch Structure Prediction

    适用于使用复合物结构进行预测的场景,通过 CSV 文件指定底物与结构的对应关系。

    Input CSV

    输入的CSV文件,文件中需包含底物结构substrate_smiles 列和复合物结构名称complex_name 列,必选项

    substrate_smiles,complex_name
    N#CC1=NC=CC=C1,complex_405
    N#CCC1=CC=CC=C1,complex_483
    

    Complex Structure

    复合物结构压缩包,支持 .zip/.tar/.tar.gz/.tgz/.tar.bz2/.tar.xz 格式,必选项
    压缩包内应包含与 complex_name 对应的 PDB 文件(如 complex_405.pdb)

    Smiles Column

    输入的CSV文件中SMILES列的列名,必选项

    Complex Column

    输入的CSV文件中复合物名称列的列名,必选项

    Score File

    输出的结果文件, CSV 格式(默认文件名 predicted_scores.csv)

    Error CSV

    解析后记录错误的列表文件(如有),CSV格式(默认文件名match_errors.csv)

    结果说明

    Batch Sequence Prediction

    输出的结果文件, CSV 格式(默认文件名 predicted_scores.csv)

    列名 说明
    substrate_smiles 底物的 SMILES 字符串
    enzyme_sequence 酶的氨基酸序列
    score 预测打分值,数值越高表示该酶-底物对越可能发生反应

    Batch Structure Prediction

    输出的结果文件, CSV 格式(默认文件名 predicted_scores.csv)

    列名 说明
    substrate_smiles 底物的 SMILES 字符串
    enzyme_sequence 酶的氨基酸序列
    score 预测打分值,数值越高表示该酶-底物对越可能发生反应
    complex_name 复合物结构名称
    matched_complex_file 根据复合物名称匹配到的结构文件

    解析后记录错误的列表文件(如有),CSV格式(默认文件名match_errors.csv)

    列名 说明
    complex_name 匹配失败的复合物结构名称
    error 匹配失败的原因或错误信息

    如何理解结果

    1. 分数越高,表示该酶-底物对在当前模型下更值得优先验证
    2. 该分数本质是排序分数,不应直接等同于"反应一定发生"
    3. 建议将输出用于候选排序,按分数从高到低组织实验顺序(Top-K 优先)

    注意事项

    1. 适用边界:本工具用于特异性筛选与排序,不直接提供反应机理、位点选择性或立体选择性的确定结论。
    2. 输入质量影响显著:SMILES 合法性、结构文件质量与匹配程度会直接影响预测稳定性。
    3. 结果使用原则:建议在同一任务上下文内做相对比较,不建议跨任务直接比较绝对分数阈值。

    参考文献

    • Enzyme specificity prediction using cross-attention graph neural networks. Nature, 2025. DOI: 10.1038/s41586-025-09697-2

    Substrate Specificity Prediction (EZSpecificity)

    Introduction

    EZSpecificity is a modeling tool for enzyme-substrate specificity prediction, designed to provide priority ranking for experimental screening.

    It addresses the problem: when the number of candidate combinations is large, how to prioritize enzyme-substrate pairs that are more likely to react, thereby reducing experimental trial-and-error costs.

    According to the paper, EZSpecificity demonstrates more stable performance compared to control methods (such as ESP) in extrapolation scenarios involving unknown enzymes/unknown substrates, and achieves higher Top-1 hit rates in halogenase case studies.

    Therefore, its positioning is a “pre-experimental screening and ranking tool” rather than a “final decision tool to replace experiments”.

    The core idea of EZSpecificity is to jointly utilize three types of information:

    1. Sequence Information: Extract enzyme sequence representations through protein language models (ESM) to provide global protein features.
    2. Structure Information: Model enzyme-substrate atomic relationships near the active site as 3D graphs to capture local geometric and chemical environments.
    3. Interaction Information: Strengthen the correspondence between “key residues - key atoms” through cross-attention, avoiding simple feature concatenation.

    The final model outputs a score for ranking.

    Parameters

    Batch Sequence Prediction

    Suitable for large-scale screening scenarios. Input enzyme sequence files and substrate lists to automatically generate N×N combinations for prediction.

    Enzyme Sequence

    Enzyme sequence file in FASTA format, supporting multiple sequences. Required.

    >enzyme_405
    MLPLQDFPKFTAAAVQASPVFLDAHKTAQKAVDLIAEAAGNGAELVVFPEVF...
    >enzyme_483
    MQTRKIVRAAAVQAASPNYDLATGVDKTIELARQARDEGCDLIVFGETWL...
    

    Substrate Smile

    Structural information of substrate molecules, supporting .smi format. Required.

    substrate_smiles
    N#CC1=NC=CC=C1 sample_1
    N#CCC1=CC=CC=C1 sample_2
    

    Score File

    Output result file in CSV format (default filename: predicted_scores.csv).

    Batch Structure Prediction

    Suitable for prediction scenarios using complex structures. Specifies the correspondence between substrates and structures via a CSV file.

    Input CSV

    Input CSV file containing substrate_smiles column and complex_name column. Required.

    substrate_smiles,complex_name
    N#CC1=NC=CC=C1,complex_405
    N#CCC1=CC=CC=C1,complex_483
    

    Complex Structure

    Complex structure archive, supporting .zip/.tar/.tar.gz/.tgz/.tar.bz2/.tar.xz formats. Required.

    The archive should contain PDB files corresponding to complex_name (e.g., complex_405.pdb).

    Smiles Column

    Column name for SMILES in the input CSV file. Required.

    Complex Column

    Column name for complex names in the input CSV file. Required.

    Score File

    Output result file in CSV format (default filename: predicted_scores.csv).

    Error CSV

    Parsed error records file (if any) in CSV format (default filename: match_errors.csv).

    Results

    Batch Sequence Prediction

    Output result file in CSV format (default filename: predicted_scores.csv):

    Column Name Description
    substrate_smiles SMILES string of the substrate
    enzyme_sequence Amino acid sequence of the enzyme
    score Predicted score; higher values indicate higher likelihood of reaction

    Batch Structure Prediction

    Output result file in CSV format (default filename: predicted_scores.csv):

    Column Name Description
    substrate_smiles SMILES string of the substrate
    enzyme_sequence Amino acid sequence of the enzyme
    score Predicted score; higher values indicate higher likelihood of reaction
    complex_name Complex structure name
    matched_complex_file Matched structure file based on complex name

    Parsed error records file in CSV format (default filename: match_errors.csv):

    Column Name Description
    complex_name Complex structure name that failed to match
    error Reason or error information for the matching failure
    • Enzyme specificity prediction using cross-attention graph neural networks. Nature, 2025. DOI: 10.1038/s41586-025-09697-2
  • Name: Structure Prediction (FKSFold-Chai)
    Description: 基于Chai-1开发的针对分子胶复合物体系的结构预测模型。 Multi-body structure prediction and molecular dynamics simulation tool built upon the Chai-1 algorithm, specifically designed for molecular glue complex systems.
    Tags: undefined
    Author: YDS Pharmatech, Inc.
    Release: 2025-10-15 15:11:22
    Reference: Chai-1: Decoding the molecular interactions of life. Chai Discovery, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhnikov, Kevin Wu. doi: 10.1101/2024.10.10.615955 FKSFold: Improving AlphaFold3-Type Predictions of Molecular Glue-Induced Ternary Complexes with Feynman-Kac-Steered Diffusion. Jian Shen, Shengmin Zhou, Xing Che. bioRxiv, doi: 10.1101/2025.05.03.651455

    Structure Prediction (FKSFold-Chai)

    简介

    基于Chai-1开发的针对分子胶复合物体系的结构预测模型。
    通过融合Feynman-Kac(FK)随机控制理论与AlphaFold3的扩散模型,引入界面预测TM-score(ipTM),在生成过程中实时评估蛋白质-蛋白界面质量,并通过FK公式派生的指导术语来修改反向扩散过程,优先保留高分结构,同时,使用FK框架能够将采样偏向于生物物理学上合理的构象,而无需对底层模型进行广泛的重新训练或损害生成结构的多样性。该方法成功预测了八个分子胶案例中的三种,其RMSD均小于3Å。
    企业微信截图_9ccab96f-1a3b-49a7-acb1-3e6eb2202655.png

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    注意:多蛋白复合物结构预测,其氨基酸序列输入格式如下:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。

    备注:当前24GB的GPU显存能计算的残基/碱基数量在2048个左右。

    在Protein、DNA、RNA序列中,都支持残基或碱基的修饰,用CCD进行定义,CCD的介绍参考https://www.wwpdb.org/data/ccd 编号查询网址为https://www.ebi.ac.uk/pdbe-srv/pdbechem/
    定义残基或碱基修饰时,直接在序列中用英文括号‘()’包含CCD code即可,示例如下:

    >seq
    (ACE)GQLEEIAK
    

    表示在序列的N端发生了乙酰化;

    >seq
    AGSHSMRYFSTSVSR(HY3)GRGEPRFIAVG
    

    表示序列中的残基P发生了羟基化修饰,变成HY3(CCD code)

    Ligand

    文本文件包含小分子的结构信息,用SMILES格式,支持多个小分子,每行放置一个,示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    [Mg+2]
    

    注意:不适用于配体蛋白或多肽的氨基酸序列格式输入。

    Restraints

    包含残基间距离限制信息的文本文件。距离限制的类型有两种:两个残基间的距离限制,一个残基与一条链之间的距离限制。

    两个残基间的距离限制的定义由五部分组成:

    • 残基1所在序列的顺序编号(序列的顺序编号,是依次按上述参数Protein、DNA、RNA中的序列顺序与数量,从1开始进行编号,例如:当有2条蛋白序列,1条DNA序列,1条RNA序列时,各序列对应的编号为:第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4)
    • 残基1的符号及位置编号(如:R84表示84号残基R)
    • 残基2所在序列的顺序编号
    • 残基2的符号及位置编号
    • 残基间的最大距离(单位为埃)

    五部分由逗号分隔,例如:1,R84,3,G7,10.0
    表示第1条序列中的84号残基R,与第3条序列中的7号残基G,之间的最大距离为10.0埃。

    一个残基与一条链之间的距离限制表示该残基与链中任意一个残基的距离满足限制即可。其定义方式与上述类似,差异在于,残基1与残基2的符号及位置编号,其中一个需设置为0(不可同时为0),例如:1,R84,3,0,10.0
    表示第1条序列中的84号残基R,与第3条链的任意一个残基/碱基的最大距离为10.0埃即可。

    支持放置多个距离限制,每行放置一个即可,包含多个距离限制信息的文件内容示例如下:

    1,H189,3,L4,8.0
    1,R84,3,0,10.0
    

    结果说明

    输出结果文件为排名前5的复合物结构rank_1-5.cif和pred_scores_chai1.csv,csv中包含信息如下:

    列名 说明
    Name 结构名称
    aggregate_score 对预测结构的质量排序的指标分数,值范围在-100至1.0之间,越大表示预测结构的质量越高。该分数综合考虑了三个指标:ptm, iptm, has_clash, 计算公式为: Aggregate_Score = 0.8 × ipTM + 0.2 × pTM − 100 × has_clash
    ptm 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    iptm 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定
    per_chain_ptm 每条链单独计算的 pTM 分数,用于评估复合物中各个单链结构折叠预测的可靠性。该值可用于判断某一条链是否预测质量较低。
    per_chain_pair_iptm 每一对链之间的 界面 ipTM 分数矩阵,用于评估复合物中不同链对之间相互作用界面的预测可靠性。
    has_inter_chain_clashes 是否存在 跨链原子碰撞(inter-chain clashes) 的标志。通常为布尔值或 0/1。若存在碰撞(1),说明不同链之间存在严重空间重叠,结构可能不合理。
    chain_chain_clashes 各链之间发生的 原子碰撞数量或碰撞统计信息,用于进一步评估复合物界面是否存在结构冲突。
    actif_ptm 用于衡量复合物预测中 参与相互作用界面的区域结构质量。相比整体 pTM,更关注界面区域结构的可靠性。
    mean_interface_ptm 所有预测界面区域的 平均 pTM 分数,用于整体评估复合物界面结构的可靠性。
    protein_mean_interface_ptm 仅针对 蛋白质链之间界面 计算的平均 interface pTM 分数,用于评估蛋白–蛋白相互作用界面预测质量。
    pae_scores 用于表示模型预测中不同残基之间的相对位置误差。数值越低表示预测越可靠,常用于分析结构域之间或链之间的相对定位可信度。

    参考文献

    • FKSFold: Improving AlphaFold3-Type Predictions of Molecular Glue-Induced Ternary Complexes with Feynman-Kac-Steered Diffusion. Jian Shen, Shengmin Zhou, Xing Che. bioRxiv,2025.05.03.651455.
      DOI:10.1101/2025.05.03.651455

    Structure Prediction (FKSFold-Chai)

    Introduction

    A structure prediction model developed based on Chai-1, specifically designed for molecular glue complex systems. By integrating Feynman-Kac (FK) stochastic control theory with AlphaFold3’s diffusion model, it introduces the interface prediction TM-score (ipTM) to evaluate the quality of protein-protein interfaces in real-time during the generation process. It modifies the reverse diffusion process using guidance terms derived from the FK formula to prioritize the preservation of high-scoring structures. Meanwhile, the FK framework enables the sampling to be biased toward biophysically plausible conformations without requiring extensive retraining of the underlying model or compromising the diversity of generated structures. This method successfully predicted three out of eight molecular glue cases with an RMSD of less than 3Å.
    image.png

    Parameters

    Protein Sequence

    The sequence file of proteins in FASTA format, supporting multiple sequences.
    Note: For multi-protein complex structure prediction, the amino acid sequence input format is as follows:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    The sequence file of DNA nucleic acids in FASTA format, supporting multiple sequences.

    RNA Sequence

    The sequence file of RNA nucleic acid molecules in FASTA format, supporting multiple sequences.
    ** Note: Current 24GB GPU memory can calculate around 2048 residues/bases. **
    In Protein, DNA, RNA sequences, all support the modification of residues or bases, which are defined by CCD, The introduction of the CCD reference https://www.wwpdb.org/data/ccd Number query url for https://www.ebi.ac.uk/pdbe-srv/pdbechem/
    To define a residue or base modification, simply include the CCD code in parentheses’ () 'in the sequence, as shown in the following example:

    >seq
    (ACE)GQLEEIAK
    

    Indicates acetylation at the N-terminus of the sequence;

    >seq
    AGSHSMRYFSTSVSR(HY3)GRGEPRFIAVG
    

    Indicates that residue P in the sequence is hydroxylated and becomes HY3 (CCD code).

    Ligand

    The text file contains structural information about small molecules, in SMILES format, supporting multiple small molecules, one per line, as shown in the following example:

    CC(=O)OC1C[NH+]2CCC1CC2
    [Mg+2]
    

    Note: Not applicable for amino acid sequence format input of ligand proteins or polypeptides.

    Restraints

    • Sequence number of the sequence in which residue 1 is located (The sequence number of the sequence is numbered from 1 according to the sequence order and quantity in the above parameters Protein, DNA and RNA in turn. For example, when there are 2 protein sequences, 1 DNA sequence and 1 RNA sequence, the corresponding number of each sequence is: The first protein sequence is numbered 1, the second protein sequence is numbered 2, the DNA sequence is numbered 3, and the RNA sequence is numbered 4)
    • Symbol and position number of residue 1 (e.g. R84 for residue 84 R)
      -The sequence number of the sequence in which residue 2 is located
      -Symbol and position number of residue 2
    • Maximum distance between residues (in angstroms)

    The five parts are separated by commas, for example: 1,R84,3,G7,10.0
    Denote residue 84 R in the first sequence, and residue 7 G in the third sequence, with a maximum distance of 10.0 angstroms.

    ** The distance limit between a residue and a chain ** means that the distance between the residue and any residue in the chain satisfies the limit. It is defined in the same way as above, except that the symbol and position number of residue 1 and residue 2 need to be set to 0 (not both), e.g. 1,R84,3,0,10.0
    Denotes residue 84 R in the first sequence, and a maximum distance of 10.0 angstroms from any residue/base of the third strand is sufficient.

    Multiple distance limits are supported, one per line, and an example file containing multiple distance limits is as follows:

    1,H189,3,L4,8.0
    1,R84,3,0,10.0
    

    Result

    Field Name Description
    Name Name of the complex structure
    Aggregate_Score Index scores that rank the quality of the predicted structure, with values ranging from -100 to 1.0, with larger values indicating higher quality of the predicted structure. The score takes into account three metrics: ptm, iptm, has_clash, and is calculated as follows: Aggregate_Score = 0.8 × ipTM + 0.2 × pTM − 100 × has_clash
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.
    per_chain_ptm The pTM score calculated for each individual chain, used to assess the structural reliability of each monomer within the complex.
    per_chain_pair_iptm A matrix containing ipTM scores for each pair of chains, used to evaluate the reliability of predicted interfaces between specific chain pairs.
    has_inter_chain_clashes A boolean or binary indicator (0/1) showing whether steric clashes occur between atoms of different chains. If clashes are present, the predicted complex structure may be physically unrealistic.
    chain_chain_clashes The number or statistics of atomic clashes between chains, providing more detailed information about structural conflicts at the interfaces.
    actif_ptm Active interface pTM, representing the predicted structural confidence specifically for residues involved in interaction interfaces.
    mean_interface_ptm The average pTM score across all predicted interaction interfaces, providing an overall estimate of interface structural reliability.
    protein_mean_interface_ptm The average interface pTM specifically for protein–protein interfaces, used to assess the quality of predicted protein interaction regions.
    pae_scores Predicted Aligned Error (PAE) matrix, representing the expected positional error between residue pairs. Lower values indicate higher confidence in the relative positioning of residues or domains.

    Reference

    • Chai-1: Decoding the molecular interactions of life. Chai Discovery team, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhonikov, Kevin Wu.bioRxiv,2024.10.10.615955.DOI:10.1101/2024.10.10.615955
    • FKSFold: Improving AlphaFold3-Type Predictions of Molecular Glue-Induced Ternary Complexes with Feynman-Kac-Steered Diffusion. Jian Shen, Shengmin Zhou, Xing Che. bioRxiv,2025.05.03.651455.DOI:10.1101/2025.05.03.651455
  • Name: BsAb Builder
    Description: 基于BsAb双抗序列编辑器输出的序列格式,进行双抗结构预测,当前支持含IgG的对称性双抗结构预测。 Based on the sequence format output by the BsAb bispecific antibody sequence editor, this module performs bispecific antibody structure prediction. Currently, it supports structure prediction for symmetrical bispecific antibodies containing IgG.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-03-06 00:00:00
    Reference:

    BsAb Builder

    基于BsAb双抗序列编辑器输出的序列格式,进行双抗结构预测,当前支持含IgG的对称性双抗结构预测。

    BsAb Builder

    Based on the sequence format output by the BsAb bispecific antibody sequence editor, this module performs bispecific antibody structure prediction. Currently, it supports structure prediction for symmetrical bispecific antibodies containing IgG.

  • Name: Antibody-Antigen Complex Structure Score
    Description: 基于 DeepRank-Ab 的抗体-抗原界面专用几何深度学习评分模型。 Based on DeepRank-Ab, a geometric deep learning scoring model specifically designed for antibody–antigen interfaces.
    Tags: undefined
    Author: Xiaotong Xu
    Release: 2026-02-06 00:00:00
    Reference: DeepRank-Ab: a dedicated scoring function for antibody-antigen complexes based on geometric deep learning.

    Antibody-Antigen Complex Structure Score

    简介

    预测抗体-抗原复合物结构的DockQ值,进行结构质量评价。模块基于DeepRank-Ab模型实现,DeepRank-Ab是一种专为抗体 - 抗原界面独特特性量身定制的几何深度学习评分函数。该函数的开发得益于一个精心构建的基准数据集,该数据集包含来自 1442 个复合物的 230 多万个诱饵构象,为稳健训练和无偏评估提供了所需的多样性。在多个独立测试集(包括非结合态 - 非结合态对接模型和 AlphaFold 生成的结构)上,DeepRank-Ab 持续优于所有评估方法,包括 AF3、HADDOCK 以及 FTDMP 等最先进的评分函数。它将 AF3 的 Top 1 成功率提升了 35.5%,并将平均 Top 1 DockQ 值提高了一倍以上。DeepRank-Ab 还能稳健泛化到训练分布之外,在外部抗体 - 抗原 CAPRI 靶点上实现 100% 的 Top 5 成功率,超越了所有测试方法。这些结果共同表明,DeepRank-Ab 是一种高效的评分方法,显著提升了近天然抗体 - 抗原构象的识别能力。
    image.png

    参数说明

    Structures

    抗体/纳米抗体-抗原复合物结构文件,支持格式:.pdb、.cif、.pdb.gz、.cif.gz,支持批量结构,要求以压缩包形式输入,支持格式:.zip、.tar、.tar.gz、.tgz、.tar.bz2、.tbz2、.tar.xz、.txz

    结果说明

    输出deeprank_ab_result.csv,内容如下:

    Name Predicted DockQ
    test1 0.20
    test2 0.17
    字段 说明
    Name 结构/样本标识符,PDB 名称。
    Predicted DockQ DeepRank-Ab 预测的 DockQ 分数(数值越高通常表示复合物对接质量越好;范围常见在 0–1)。

    参考文献

    • DeepRank-Ab: a dedicated scoring function for antibody-antigen complexes based on geometric deep learning.DOI:10.64898/2025.12.03.691974

    Antibody-Antigen Complex Structure Score

    Introduction

    Predicts the DockQ score of antibody–antigen complex structures for structural quality assessment. This module is based on the DeepRank-Ab model, a geometric deep learning scoring function specifically designed for the unique characteristics of antibody–antigen interfaces. The development of this function is supported by a carefully curated benchmark dataset containing over 2.3 million decoy conformations from 1,442 complexes, providing the necessary diversity for robust training and unbiased evaluation.

    On multiple independent test sets, including unbound–unbound docking models and AlphaFold-generated structures, DeepRank-Ab consistently outperforms all evaluation methods, including state-of-the-art scoring functions such as AF3, HADDOCK, and FTDMP. It improves AF3’s Top 1 success rate by 35.5% and more than doubles the average Top 1 DockQ score. DeepRank-Ab also generalizes robustly beyond the training distribution, achieving a 100% Top 5 success rate on external antibody–antigen CAPRI targets, outperforming all tested methods. These results collectively demonstrate that DeepRank-Ab is an efficient scoring approach that significantly enhances the recognition of near-native antibody–antigen conformations.

    image.png

    Parameters

    Structures

    Antibody–antigen and nanobody–antigen complex structure files are supported in the following formats: .pdb, .cif, .pdb.gz, .cif.gz.

    Batch submission of multiple structures is supported and must be provided as a compressed archive. Supported archive formats include: .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz.

    Results

    Generates deeprank_ab_result.csv with the following content:

    Name Predicted DockQ
    test1 0.20
    test2 0.17
    Field Description
    Name Structure/sample identifier, typically the PDB name.
    Predicted DockQ DockQ score predicted by DeepRank-Ab (higher values generally indicate better docking quality; typical range 0–1).

    References

    • DeepRank-Ab: a dedicated scoring function for antibody-antigen complexes based on geometric deep learning. DOI:10.64898/2025.12.03.691974
  • Name: Grafting v2.5
    Description: Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.5 Graft antibody CDRs to target frameworks, normally for humanization. Version: v2.5
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-02-11 14:25:31
    Reference:

    Grafting v2.5

    简介

    Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.5

    参数说明

    Antibody Sequence File

    抗体序列文件,FASTA格式

    Numbering Type

    抗体编号规则:kabat,imgt,chothia

    Output File

    指定输出抗体graft后的序列文件名称,FASTA格式

    Output Policy

    指定输出graft策略文件,JSON格式

    Germline Score

    指定输出抗体FR区序列比对同源性打分文件

    Germline

    指定轻链或重链使用特定germline模板,也可都指定,写法如下:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    其中链名来自于流程第一步输入的fasta文件。
    例1:以下语句为链"Infliximab.H"指定了模板"IGHV3-7*01":

    Infliximab.H:IGHV3-7*01
    

    例2:以下语句为两条链分别指定了模板:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template V Sequence

    指定抗体可变区 V 基因 的参考模板序列,FASTA格式。

    Template J Sequence

    指定抗体可变区 J 基因 的参考模板序列,FASTA格式。

    Germline Hits

    指定输出FR区序列比对结果文件,FASTA格式

    Number of Hits

    指定输出命中序列的数目

    结果说明

    输出结果包括:

    输出文件名称 说明
    germline_hits.fasta 输出FR区序列比对结果文件
    germline_score.json 输出抗体FR区序列比对同源性打分文件
    grafted.fasta 输出抗体graft后的序列文件名称
    graft_policy.json 输出graft策略文件
    Germline Frequency germline 模板打分未知残基频率

    Grafting v2.5

    Introduction

    The Grafting module is used to graft the CDRs of an antibody onto a specific framework region template, typically used for humanization design. Version: v2.5

    Parameters

    Antibody Sequence File

    Antibody sequence file in FASTA format.

    Numbering Type

    Antibody numbering rule: kabat, imgt, chothia.

    Output File

    Specify the output file name for the grafted antibody sequence in FASTA format.

    Output Policy

    Specify the output grafting strategy file in JSON format.

    Germline Score

    Specify the output file for the homology scores of the antibody FR region sequences.

    Germline

    Specify the specific germline template to be used for the light chain or heavy chain, or both, in the format:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    Where the chain names come from the FASTA file input in the first step of the process.
    Example 1: The following statement specifies the template “IGHV3-7*01” for the chain “Infliximab.H”:

    Infliximab.H:IGHV3-7*01
    

    Example 2: The following statement specifies templates for two chains separately:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template V Sequence

    Specify the reference template sequence of the antibody V gene in FASTA format.

    Template J Sequence

    Specify the reference template sequence of the antibody J gene in FASTA format.

    Germline Hits

    Specify the output file for the FR region sequence alignment results in FASTA format.

    Number of Hits

    Specify the number of sequences to output.

    Results

    The output includes:

    Output File Name Description
    germline_hits.fasta Output file for FR region sequence alignment results
    germline_score.json Output file for homology scores of the antibody FR region sequences
    grafted.fasta Output file name for the grafted antibody sequence
    graft_policy.json Output file for the grafting strategy
    Germline Frequency Frequency of unknown residues in germline template scoring
  • Name: Binder Design (BoltzGen) HTS
    Description: Binder Design (BoltzGen) HTS 是基于BoltzGen的高通量筛选模式,默认采样10000条,从中筛选出靠前的Hits Binder Design (BoltzGen) HTS is a high-throughput screening mode based on BoltzGen. It samples 10,000 sequences by default to screen for top hits.
    Tags: undefined
    Author: Hannes Stark
    Release: 2025-10-30 00:00:00
    Reference: https://hannes-stark.com/assets/boltzgen.pdf

    Binder Design (BoltzGen) HTS

    简介

    设计能够结合各种生物分子靶标的蛋白、肽类等生物分子。模块基于BoltzGen模型实现,BoltzGen是一个通用的全原子生成模型(all-atom generative model),能够在同一框架下完成多模态的binder设计任务。不同于前代模型只做“推断”,BoltzGen直接在扩散采样过程中生成目标分子与其结合体的全原子结构,并通过可控约束机制确保生成结果物理一致、功能可实现。同时具备良好的泛化性能,能够针对训练过程中未出现过的全新靶标进行有效设计。

    BoltzGen的方法设计承接了Boltz系列一贯的目标——在统一的几何表示与能量空间中,学习多模态分子的物理规律。与以往的预测模型不同,BoltzGen 并不输出单一结构预测,而是通过扩散建模(diffusion modeling)直接生成分子的全原子坐标。

    image.png

    其采用扩散模型(diffusion model)框架,在全原子坐标空间中学习分子结构的分布。模型通过在每个采样步骤中向原子坐标加入高斯噪声,再逐步去噪恢复,从而近似真实的势能面分布。与传统的分子生成模型(如仅在残基层面建模)不同,BoltzGen的每个采样变量都是具体原子的位置向量。模型通过能量一致性约束(energy consistency)学习键长、键角、二面角等局部几何关系,从而在生成时自动保持化学合理性。这里对于全原子建模采用了Atom14的方法。

    image.png
    这一设计使生成结构不仅在形状上接近真实蛋白,在能量上也符合分子物理规律。

    BoltzGen的架构如下图所示,由三大模块组成:输入层(Input Representation)、条件编码器(Condition Encoder)、扩散模型(Diffusion Model),输出为三维坐标的全原子结构。

    image.png

    BoltzGen的一个突出特点,是并非停留在计算层面的生成验证,而是进行了系统的湿实验评估。进行了十类实验任务(如下图所示),涵盖 蛋白质、肽类、环肽、小分子结合体及抗菌肽设计等多种体系,几乎囊括了现有结构设计模型所能覆盖的全部生物模态。这些实验的共同目标,是检验模型能否在“无同源模板、真实实验条件”下生成可表达、可结合的结构。不同于以往只验证折叠精度的预测模型,BoltzGen的验证标准是功能实现——即所生成分子是否能在实验中稳定结合目标。

    image.png
    BoltzGen 的实验结果显示出较高的一致性与通用性:

    • 在 26 个实验靶标中,有超过 60% 的生成候选在实验中表现出结合活性;
    • 模型生成的肽类与蛋白 binder 均表现出良好的可表达性(多数 >80% 可溶性);
    • 环肽和抗菌肽任务中,多个样本在无模板条件下仍能正确形成环化结构;
    • 小分子结合蛋白任务中,生成结果的结合构象与已知复合物 RMSD < 2.5 Å。
      image.png

    在 BoltzGen 论文中,进行抗体和结合蛋白生成的湿实验验证时,抗原(目标蛋白)的主要输入方式是结构,但在特定情况下也可以通过序列输入。

    具体说明如下:

    默认输入方式:结构
    论文中明确提到,除非另有说明,实验中均是将目标的结构(structure)作为输入提供给 BoltzGen 。例如,在针对 9 个新型目标(Novel Targets)设计纳米抗体和蛋白质结合剂时,研究人员利用了目标的结构信息。

    灵活性:序列输入与协同折叠(Cofolding) BoltzGen 是一个全原子生成模型,能够同时进行结构预测和蛋白质设计 。当仅提供目标的序列(sequence)作为输入时,模型可以在设计结合剂的同时对目标进行折叠,最终生成结合复合物的原子结构 。

    特殊案例:

    • 无结构输入
      在针对 NPM1 蛋白的无序区(disordered region)设计多肽时,研究人员采用了“无结构输入”的策略 。他们提供了 NPM1 有序区域的结构,但让无序区域保持柔性,从而测试模型在处理缺乏固定结构的目标时的表现。
    • 小分子目标
      对于小分子目标,BoltzGen仅需要输入SMILES字符串(一种描述分子结构的序列表示法),并在设计过程中执行协同折叠。

    总结来说,虽然BoltzGen具备直接从序列出发进行设计的能力,但在该论文的大多数湿实验验证(特别是针对新型蛋白目标)中,结构是主要的输入方式。

    计算耗时

    抗原大小 生成模式 生成序列数量 计算耗时(小时)
    120 Nanobody 10000 4.5
    140 Nanobody 10000 5.0
    180 Nanobody 10000 6.0
    200 Nanobody 10000 6.5
    400 Nanobody 10000 11.5
    460 Nanobody 10000 15.0
    240 Antibody 10000 10.8
    290 Antibody 10000 15.0
    400 Antibody 10000 17.5

    参数说明

    De Novo Antibody

    Type

    指定抗体类型,目前支持Antibody(普通抗体)和Nanobody(纳米抗体)。

    Antigen Structure

    上传已有的抗原结构,PDB或CIF格式。

    Antigen Chains

    指定从结构中提取一些链作为抗原,可多选,如:A,B。如不设置该参数,表示提取结构中的所有链。

    Antigen Sequence

    如果没有已知的抗原结构,可上传抗原序列,fasta格式,支持多链。

    Binding Hotspot

    指定抗原中的哪些残基参与结合,使用链名+残基位置(从1开始的顺序编号)进行指定,如A10-20,A25,B30-36,B40。
    表示:抗原结合位点为A链编号10至20、25的残基,B链提编号30至36、40的残基。
    注意:
    1,在使用抗原序列文件时,链名是按字母顺序命名(与链的位置顺序对应),第一条链的链名为A,第二条链的链名为B,依次命名。
    2,如不设置该参数,模型会自主寻找潜在的结合位点。

    Custom Templates

    支持上传自定义的抗体或纳米抗体模板结构,会采用模板结构的FR区,对CDR区域(Chothia编号)进行重设计,可选择:

    • 单个结构文件(.pdb 或 .cif)
    • 批量结构文件(压缩包格式)

    多个模板结构时,每个模板结构都会用于设计。Number of Samples参数若设为10000,在默认抗体模板的情况下,每个模板结构的次数都约为3333。
    如未提供自定义模板,系统将使用内置的默认抗体模板和纳米抗体模板,具体如下:
    抗体模板:

    • 6CR1 — Adalimumab(阿达木单抗,Humira)
      靶点:TNF-α
      作用:阻断 TNF-α 与受体结合,抑制炎症反应

    • 6WGB — Dupilumab(度普利尤单抗,Dupixent)
      靶点:IL-4Rα
      作用:阻断 IL-4 / IL-13 信号通路,抑制 2 型炎症

    • 3HMW — Ustekinumab(乌司奴单抗,Stelara)
      靶点:IL-12 / IL-23 p40
      作用:同时抑制 Th1 和 Th17 炎症通路

    纳米抗体模板:

    • 7EOW — Caplacizumab(卡普赛珠单抗)
      靶点:vWF A1 域
      作用:阻断 vWF 与血小板结合,抑制血栓形成

    • 7XL0 — Vobarilizumab(ALX-0061,沃巴利珠单抗)
      靶点:IL-6R(+ 白蛋白结合)
      作用:抑制 IL-6 信号并延长半衰期

    • 8COH — TPP-3444(Gefurulimab / ALXN1720 组成部分)
      靶点:补体 C5
      作用:抑制补体激活

    • 8Z8V — ALB8(Ozoralizumab / ATN-103 组件)
      靶点:人血清白蛋白(HSA)
      作用:延长药物半衰期

    • Gontivimab(ALX-0171,格替韦单抗)
      靶点:RSV F 蛋白
      作用:阻断病毒融合,抑制感染

    • Isecarosmab(M-6495 / ALX-1141,艾司卡索单抗)
      靶点:ADAMTS-5
      作用:抑制软骨降解,具有抗炎作用

    • Sonelokimab
      靶点:IL-17A / IL-17F
      作用:双重抑制炎症因子,增强抗炎效果

    Number of Samples

    采样的序列数量,值越大,采样空间越大,筛选序列质量越高,对应计算时间也更长,最大支持20000。

    Number of Designs

    完成设计后,最终给出的结构数量,默认为30,最大支持100。

    Custom

    Protocol

    设计模式共有6种:

    • Protein:设计与靶点(蛋白或多肽)结合的蛋白,也可脱离靶点仅设计蛋白单体。
    • Peptide:设计与靶点蛋白结合的多肽(线性肽或环肽)。
    • Small_Molecule:设计与小分子结合的蛋白,不改变小分子本身。
    • Antibody: 设计与靶点结合的普通抗体,也可脱离靶点仅设计普通抗体自身
    • Nanobody:设计与靶点结合的纳米抗体,也可脱离靶点仅设计纳米抗体自身。
    • Redesign: 对已存在的蛋白/复合物结构,进行指定残基的重设计优化。

    设计规则的定义有三种方式:

    • 基于已有结构进行定义,可以是提取部分结构,也可以对部分结构进行设计。
    • 基于序列进行定义,指定序列中哪部分需要设计,哪部分残基不变。
    • 基于小分子文件进行定义,指定参与结合的小分子。

    三种方式可以自由组合。

    Structure

    上传已有蛋白结构,从中提取已有结构,或重新设计部分结构。例如:从上传的结构中提取靶点链、抗原链、纳米抗体链等。

    Chains

    指定从Structure中提取的链名,可多选,如:A,B。如不设置该参数,表示提取结构中的所有链。

    Include

    从Chains参数指定的链中,进一步确认需要提取的残基范围,使用链名+残基位置(从1开始的顺序编号,非PDB的UID编号)进行指定,如A10-20,A25,B1-36,B40。
    表示:从A链提取编号10至20、25的残基,从B链提取编号1至36、40的残基。
    如不设置该参数,表示提取Chains参数中指定的完整链。

    Exclude

    从Chains参数指定的链中,确认哪些残基不提取),与Include参数作用相反,指定方式相同,如A15,B36-42(从1开始的顺序编号,非PDB的UID编号表示A链编号15、B链编号36至42的残基不提取。

    Design Positions

    已提取的结构中,指定需要重新设计的残基,指定方式同Include参数,如A10-12,B15,B40(从1开始的顺序编号,非PDB的UID编号)。
    注意:需要重新设计的残基编号应在已提取的结构中存在。

    Design SS

    对要设计的残基,指定二级结构类型。使用链名,SS类型:残基范围(从1开始的顺序编号,非PDB的UID编号)进行指定,每行放置一个,如:

    A,HELIX:10-12
    B,SHEET:15,LOOP:40
    

    二级结构类型可选:LOOP, HELIX, SHEET(大小写均可)。
    不指定该参数表示不强制二级结构类型。

    Binding Hotspot

    指定哪些残基参与结合(如链间或与小分子结合),指定方式同Include,如A12,B15-18(从1开始的顺序编号,非PDB的UID编号)。

    Non Binding

    指定哪些残基不参与结合(从1开始的顺序编号,非PDB的UID编号),与Binding参数作用相反。

    Design Insertions

    指定插入突变设计,使用链名,插入位置,插入残基长度,二级结构(从1开始的顺序编号,非PDB的UID编号方式定义,每行一个,如:

    A,10,5
    B,15,5-10,HELIX
    

    表示在A链的10号残基位置后,插入5个新残基,二级结构不确定(不强制)。在B链的15号残基位置后,插入5至10个残基(具体残基数量随机确定),二级结构为HELIX。

    二级结构类型的选择有3种(大小写皆可): LOOP, HELIX, or SHEET


    Structure Repetition

    同Structure定义。例如:指定已有的Binder结构。

    Repetition Chains

    同Chains定义

    Repetition Include

    同Include定义

    Repetition Exclude

    同Exclude定义

    Repetition Design Positions

    同Design Positions定义

    Repetition Design SS

    同Design SS定义

    Repetition Binding Hotspot

    同Binding Hotspot定义

    Repetition Non Binding

    同Non Binding定义

    Repetition Design Insertions

    同Design Insertions定义


    Sequence

    指定要设计的蛋白序列,每行一条,如:

    AAVTTTTPPP
    15-20AAAAAAVTTTT18PPP
    

    其中:

    • 字母表示序列中明确的残基(设计中不变)
    • 单个数值表示该位置要设计的长度,如18表示序列的该位置将设计18个残基。
    • 数值范围表示长度范围(具体设计长度在范围内随机指定),如15-20表示该位置将设计15至20个残基,具体长度在15至20之间随机指定。

    序列的ID默认从1开始按顺序编号。

    Sequence Binding

    指定序列中参与结合的残基,使用序列编号:残基范围格式,如:

    1:5,8-10
    2:30-35
    

    表示第一条序列中编号5、8至10的残基参与结合;第二条序列中编号30至35的残基参与结合。
    第二条序列中含有设计长度范围时,按最小长度计算残基位置。

    Sequence Non Binding

    指定序列中不参与结合的残基,与Sequence_Binding作用相反。

    Sequence SS

    指定序列中残基的二级结构类型,使用序列编号,SS类型:残基范围定义,每行一条,如:

    1,HELIX:5-8
    2,SHEET:15,LOOP:40
    

    表示第一条序列编号5至8的残基,二级结构为HELIX;第二条序列编号15的残基,二级结构为SHEET,编号40的残基,二级结构为LOOP。

    注意: 有指定设计长度范围的序列,按长度最小值来确认剩余残基的位置。

    Sequence Cycle

    指定需要环化的序列编号,如1,2表示第1和第2条序列首尾相连。

    Ligand

    指定参与结合的小分子信息,文本文件,支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔,并加上CCD前缀。示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Covalent Bond

    共价键信息的文本文件,TXT格式。每行放置一个共价键信息,每个共价键信息包含两个原子信息,每个原子信息由三部分组成:

    • 原子所在序列或小分子的顺序编号(按上述参数设置的顺序,确定相应序列或小分子的顺序,从1开始编号。)
    • 原子所在残基的位置编号(如残基为小分子时,编号为1)
    • 原子的标准名称(CCD中定义)
      三部分由逗号分隔,例如:3,1,CA表示第三个实体(序列或小分子)中的第一个残基(或小分子)的CA原子
      一个共价键是由两个原子信息组成,原子间用分号分隔,如:1,1,CA;2,1,CA
      表示一个共价键,该共价键由两个原子组成,第一个原子为1,1,CA,第二个原子为2,1,CA
      包含多个共价键信息的文件内容示例如下:
    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    当小分子为SMILES时,如CC(=O)NCCNC(C)=O,如果该小分子的顺序编号(按上述方式确认)为3,其第一个C原子参与形成共价键,与编号为1的链/序列中第一个残基的CA原子,则共价键的定义为1,1,CA;3,1,C1其中C1表示小分子的第一个C原子,如果是第二C原子,用C2表示。

    注意:

    • 当前Covalent Bond的定义中,出现的序列不能是结构文件(Structure)中,只能是序列文件(Sequence和Ligand)中
    • 序列中有指定设计长度范围的情况时,按长度最小值来确认后续残基的位置。如:15-20ACS,长度范围的序列长度按最小长度计算,即15,所以残基A的位置编号是16,C是17,S是18。
      共价键信息的文本文件,TXT格式。每行放置一个共价键信息,每个共价键信息包含两个原子信息,每个原子信息由三部分组成:
    • 原子所在序列或小分子的顺序编号(按上述参数设置的顺序,确定相应序列或小分子的顺序,从1开始编号。)

    Number of Samples

    采样的序列数量,值越大,采样空间越大,筛选序列质量越高,对应计算时间也更长,最大支持20000。

    Number of Designs

    完成设计后,最终给出的结构数量,默认为30,最大支持100。

    结果说明

    输出参数文件design_spec.yaml
    输出设计的复合物的序列文件,final_complex.fasta
    输出设计的复合物的序列文件(Batch模式),final_complex_batch.fasta,适合一些模块的Batch模式的输入,如Structure Prediction (Boltz-2)
    输出设计的链的序列文件,final_designed_chains.fasta
    输出设计打分文件final_designs_metrics.csv,csv文件每个指标含义如下:

    列名 说明
    id 设计分子的名称
    final_rank 通过各指标综合排序后的最终排名
    absolute_score 基于多种指标(结构指标,物理能量指标)计算的综合打分,但与final rank排序并不完全一致,供参考。
    structure_confidence 基于结构指标(ptm,iptm,pae)计算的结构置信度评分,供参考。
    design_ptm 设计结构的预测TM分数(0–1),反映模型对设计蛋白整体折叠结构的置信度。数值越高表示设计结构越合理,通常 >0.7 视为高置信度。
    design_iptm 设计结构与靶点结构相互作用界面的预测TM分数(0-1),反应相互作用界面质量的置性度。数值越高表示界面结构越合理,通常 >0.7 视为高置信度。
    design_to_target_iptm 仅设计的残基与靶点结构相互作用界面的预测TM分数(0–1),反应相互作用界面质量的置性度。数值越高表示界面结构越合理,通常 >0.7 视为高置信度。
    min_design_to_target_pae 最小预测对齐误差(Å),是残基对水平的置信度指标,用来衡量任意两个残基之间相对空间位置的预测可信度。这里表示设计的结构与靶点结构的残基之间相对位置的准确度。数值越小(例如 <10 Å)准确度越高。
    plip_saltbridge_refolded 重折叠后结构中的盐桥数量。盐桥(带相反电荷残基间的电性作用力)是维持蛋白稳定性的重要因素,数量越多通常结合越稳固。
    plip_hbonds_refolded 重折叠后结构中的氢键数量。氢键是二级结构和界面互补性的关键作用力,数量越多整体稳定性越好。
    delta_sasa_refolded 重折叠前后溶剂可及表面积变化(ΔSASA, Ų)。数值越大(例如 >2000 Ų)表示疏水核心包埋程度越高,通常代表更强的热稳定性。
    filter_rmsd 整个复合物(设计+靶点)结构重折叠后与原设计结构的RMSD值,用于验证生成结构与预测结构的一致性,数值越小越好。
    design_ipsae_min 设计结构与靶点结构之间的最小ipSAE数值(从设计结构出发,计算与靶点结构之间的ipSAE,反之从靶点结构出发,计算与目标结构之间的ipSAE,两者中取最小值)。ipSAE是基于pAE(predicted Aligned Errors)矩阵计算得到的相互作用界面评价分数,取值范围是0到1,值越大,表示预测的蛋白-蛋白相互作用界面越可靠。ipSAE > 0.7 表明相互作用界面预测质量高,结构可信。ipSAE < 0.1: 表明预测中几乎不存在可信互作界面,可排除假阳性相互作用。
    design_to_target_ipsae 从设计结构出发,计算与靶点结构之间的ipSAE。
    ALA/GLY/GLU/LEU/VAL/CYS_fraction 设计的残基中,各类型氨基酸的比例
    contacts 预测结构中的接触界面残基
    contacts_overlap 与输入 hotspot 重叠的预测接触残基
    overlap_ratio 输入 hotspot 被预测接触残基覆盖的比例

    注意:只有设置Binding Hotspot参数,才会输出
    输出设计的前5个结构rank1-5*.cif
    输出最后设计的结构打包文件final_designs.tar.gz
    输出设计的概述文件results_overview.pdf,包含结构的过滤 (Filtering Criteria)和排序标准(Sorting Criteria)。

    • 过滤标准 (Filtering Criteria)
    列名 说明
    has_x 阈值:0.0
    序列有效性检查。确保序列中不包含未知氨基酸(“X”),必须完全由标准的 20 种天然氨基酸组成,保证序列在物理上可被合成和表达。
    filter_rmsd 阈值:< 2.5 Å
    整体骨架的 RMSD。检查整个复合物(设计+靶点)在重折叠后是否保持原样,用于验证生成结构与预测结构的一致性。
    filter_rmsd_design 阈值:< 2.5 Å
    仅针对设计部分(Binder)的骨架 RMSD。确保即使靶点有微小移动,结合剂本身的结构依然是稳定的。
    designfolding-filter_rmsd 阈值:< 2.5 Å
    独立折叠稳定性检查。在没有靶点的情况下单独折叠结合剂并计算 RMSD。用于确保结合剂能独立折叠,从而大大提高湿实验中的表达成功率。
    ALA_fraction
    GLY_fraction
    GLU_fraction
    LEU_fraction
    VAL_fraction
    阈值:< 0.3 (30%)
    序列复杂度/多样性检查。限制丙氨酸、甘氨酸、谷氨酸、亮氨酸、缬氨酸的单项占比。防止模型为了刷高结构稳定性分数而生成单一重复序列,强制要求序列具备化学多样性,以保证特异性的相互作用能力。
    • 排序标准(Sorting Criteria)
    列名 说明
    design_to_target_iptm 权重为1
    界面预测 TM 得分(0–1),用于评估蛋白–蛋白相互作用界面的结构合理性。数值越大表明界面(如结合位点)越可能形成稳定相互作用。
    design_ptm 权重为1
    预测模板建模得分(0–1),反映模型对设计蛋白整体折叠结构的置信度。数值越高表示全局结构越合理,通常 >0.7 视为高置信度。
    neg_min_design_to_target_pae 权重为1
    负的最小界面预测对齐误差 (PAE)。PAE 越低越好(误差越小),取负值是为了方便排序(数值越大越好)。它代表模型对结合界面上“最确定的那个接触点”有多大把握。
    affinity_probability_binary1 权重为1
    亲和力预测概率。主要用于小分子结合剂场景。这是模型直接预测出的“该分子能结合”的概率值。
    plip_hbonds_refolded 权重为0.5
    重折叠后结构中的氢键数量。氢键是二级结构和界面互补性的关键作用力,数量越多整体稳定性越好。
    plip_saltbridge_refolded 权重为0.5
    重折叠后结构中的盐桥数量。盐桥(带相反电荷残基间的电性作用力)是维持蛋白稳定性的重要因素,数量越多通常结合越稳固。
    delta_sasa_refolded 权重为0.5
    重折叠前后溶剂可及表面积变化(ΔSASA, Ų)。数值越大(例如 >2000 Ų)表示疏水核心包埋程度越高,通常代表更强的热稳定性。

    设计教程

    遮蔽肽设计教程

    已知抗体结构

    1. 抗体编号
    应用WeView打开mH35抗体结构,进行抗体编号,确定重链CDR3的位置在H99-102,为遮蔽肽的结合位置
    image.png
    2. BoltzGen中输入参数设置

    • 选择Custom模式
    • Protocol中选择Peptide
    • Structure中上传mH35抗体结构
    • Chains中选择H和L链,作为受体链
    • Binding Hotspot中输入受体的结合位点,为重链的CDR3区域:H99-102
      image.png
    • Sequence中输入需要设计的多肽长度,遮蔽肽建议设计长度是:5-30
      image.png
    • 提交运行

    已知抗体序列

    1. 抗体编号
    应用WeSeq打开mH35抗体序列,进行抗体编号,确定重链CDR3的位置在99-102,为遮蔽肽的结合位置
    image.png
    image.png
    2. BoltzGen中输入参数设置

    • 选择Custom模式
    • Sequence中输入mH35抗体重轻链的序列以及遮蔽肽的长度,一条链一行,遮蔽肽建议设计长度是:5-30
    • Sequence Binding中设置受体的结合位点,为重链的CDR3区域:1:99-102
      image.png
    • 提交运行

    环肽设计教程

    已知受体结构

    • Protocol中选择Peptide。
    • Structure中上传受体结构。
    • Binding Hotspot中定义受体中结合位点(如有)。
      image.png
    • Sequence的输入分以下两种情况:
      • 如果有模板结构,则输入模板环肽序列和拆入序列的长度,比如C8-9AC,在第1位残基C后面插入8-9个残基,首位C和末尾C构建环肽,如下:
        image.png
      • 如果无模板结构,可直接输入序列长度,如8-10,预测与受体结合的8-10AA长度的环肽,如下:
        image.png
    • 成环情况分为以下两种:
      • 如果环肽是头尾肽键成环,可以在Sequence Cycle中填1。
        image.png
      • 如果环肽是二硫键成环,则Sequence Cycle不填写,在Covalent Bond中填入首尾两个Cys生成二硫键的信息:1,1,SG;1,11,SG。
        image.png
    • 提交运行

    已知受体序列

    • Protocol中选择Peptide。
      image.png
    • 根据环肽情况,Sequence的输入分以下两种情况:
      • 如果环肽有模板结构,则输入受体序列、模板环肽序列及拆入序列的长度,如下图,每一行是一条序列,受体有2条序列,受体序列的ID分别为1、2。环肽序列位C8-9AC(在第1位残基C后面插入8-9个残基,首位C和末尾C构建环肽),环肽位于第三行序列ID为3。
        image.png
      • 如果无模板结构,可直接输入受体序列和环肽的序列长度,如下图,预测与受体结合的8-10AA长度的环肽。
        image.png
    • Sequence Binding中定义受体中结合位点/非结合位点(如有)。
      image.png
    • 成环情况分为以下两种:
      • 如果环肽是头尾肽键成环,可以在Sequence Cycle中填3。
        image.png
      • 如果环肽是二硫键成环,则Sequence Cycle不填写,在Covalent Bond中填入首尾两个Cys生成二硫键的信息:1,1,SG;1,11,SG。
        image.png
    • 提交运行

    参考文献

    • https://hannes-stark.com/assets/boltzgen.pdf

    Binder Design (BoltzGen)

    Introduction

    The Binder Design module is designed to generate proteins, peptides, and other biomolecules capable of binding to various biological targets. It is implemented based on the BoltzGen model — a universal all-atom generative model capable of performing multimodal binder design tasks within a unified framework. Unlike earlier models that focused solely on “inference,” BoltzGen directly generates the full-atom structures of target molecules and their complexes during diffusion sampling, ensuring physically consistent and functionally feasible results through controllable constraints. It also demonstrates strong generalization, enabling effective design for novel, unseen targets.

    The BoltzGen framework inherits the Boltz family’s core objective — to learn the physical laws of multimodal molecules within a unified geometric and energetic representation. Unlike traditional prediction models that output a single structure, BoltzGen uses diffusion modeling to directly generate full atomic coordinates of molecules.

    image.png

    BoltzGen adopts a diffusion model framework to learn molecular structure distributions in full-atom coordinate space. The model adds Gaussian noise to atomic coordinates at each sampling step and progressively denoises them to approximate the real potential energy surface. Unlike traditional residue-level molecular generators, BoltzGen models each atom’s position explicitly. Using energy consistency constraints, the model learns local geometric relationships — such as bond lengths, angles, and torsions — to ensure chemical plausibility during generation. The Atom14 method is used for full-atom representation.

    image.png

    This design ensures that generated structures are not only geometrically realistic but also physically valid in terms of molecular energetics.

    The BoltzGen architecture consists of three main modules: Input Representation, Condition Encoder, and Diffusion Model, outputting full-atom 3D coordinates.

    image.png

    A distinguishing feature of BoltzGen is that it goes beyond computational validation and includes extensive wet-lab experimental evaluation. Ten categories of experiments were performed (as shown below), covering proteins, peptides, cyclic peptides, protein–small molecule complexes, and antimicrobial peptides — encompassing nearly all biological modalities addressed by structural design models. The goal was to test whether BoltzGen can generate expressible, functional binders under real experimental conditions without any homologous templates. Unlike models that only validate structural accuracy, BoltzGen emphasizes functional success — i.e., whether the generated molecules can stably bind their targets experimentally.

    image.png

    Experimental results demonstrate high consistency and generality:

    • Among 26 experimental targets, over 60% of generated candidates exhibited measurable binding activity.
    • Generated peptide and protein binders showed excellent expression performance (most with >80% solubility).
    • In cyclic peptide and antimicrobial peptide tasks, multiple samples correctly formed cyclic structures without templates.
    • In protein–small molecule binding tasks, generated complexes achieved binding poses with RMSD < 2.5 Å compared to known complexes.

    image.png

    In the BoltzGen paper, during wet-lab validation of antibody and binder generation, the primary form of input for the antigen (target protein) is structural information, although sequence-only input is also supported in specific scenarios.

    Default Input Mode: Structure
    The paper explicitly states that, unless otherwise specified, the target structure is provided as input to BoltzGen in the experiments. For example, when designing nanobodies and protein binders against nine novel targets, the researchers relied on the structural information of the targets.

    Flexibility: Sequence Input and Cofolding
    BoltzGen is an all-atom generative model capable of performing structure prediction and protein design simultaneously. When only the target sequence is provided, the model can cofold the target and the binder, folding the target while designing the binder and ultimately generating the atomic structure of the bound complex.

    Special Cases

    • No Fixed Structure Input
      When designing peptides targeting the disordered region of the NPM1 protein, the researchers adopted a “no fixed structure input” strategy. They provided the structure of the ordered regions of NPM1 while leaving the disordered region flexible, allowing the model to evaluate performance on targets lacking a well-defined structure.

    • Small-Molecule Targets
      For small-molecule targets, BoltzGen requires only a SMILES string (a sequence-based representation of molecular structure) as input and performs cofolding during the design process.

    Summary

    In summary, although BoltzGen is capable of performing design directly from sequence-only inputs, in the majority of the wet-lab validation experiments reported in the paper—especially those involving novel protein targets—structural information was used as the primary form of input.

    Computation Time

    Antigen Size Generation Mode Number of Sequences Computation Time (hours)
    120 Nanobody 10000 4.5
    140 Nanobody 10000 5.0
    180 Nanobody 10000 6.0
    200 Nanobody 10000 6.5
    400 Nanobody 10000 11.5
    460 Nanobody 10000 15.0
    240 Antibody 10000 10.8
    290 Antibody 10000 15.0
    400 Antibody 10000 17.5

    Parameters

    De Novo Antibody

    Type

    Specifies the antibody type. Currently supports Antibody (conventional antibodies) and Nanobody.

    Antigen Structure

    Upload an existing antigen structure in PDB or CIF format.

    Antigen Chains

    Specify which chains in the structure should be extracted as the antigen.
    Multiple chains are allowed, e.g., A,B.
    If not set, all chains in the structure are used by default.

    Antigen Sequence

    If no antigen structure is available, you may upload an antigen sequence in FASTA format.
    Multi-chain sequences are supported.

    Binding Hotspot

    Specify which residues on the antigen participate in binding, using the format
    ChainName + ResidueIndex (indexing starts from 1), such as:
    A10-20,A25,B30-36,B40.

    This represents:

    • Chain A: residues 10–20 and 25
    • Chain B: residues 30–36 and 40

    Notes:

    1. When using an antigen sequence file, chain names are assigned alphabetically based on sequence order: the first chain is A, the second is B, and so on.
    2. If this parameter is not set, the model will automatically search for potential binding sites.

    Custom Templates

    Supports uploading custom antibody or nanobody template structures. The FR regions from the template structures will be adopted, while the CDR regions (Chothia numbering) will be redesigned. Options:

    • Single structure file (.pdb or .cif)
    • Batch structure files (compressed archive format)

    When multiple template structures are provided, each template structure will be used for design.

    If no custom template is provided, the system will use built-in default antibody and nanobody templates, listed below:

    Antibody Templates

    • 6CR1 — Adalimumab (Humira)

      • Target: TNF-α
      • Mechanism: Blocks TNF-α binding to its receptor, inhibiting inflammatory response
    • 6WGB — Dupilumab (Dupixent)

      • Target: IL-4Rα
      • Mechanism: Blocks IL-4 / IL-13 signaling pathway, suppressing type 2 inflammation
    • 3HMW — Ustekinumab (Stelara)

      • Target: IL-12 / IL-23 p40
      • Mechanism: Simultaneously inhibits Th1 and Th17 inflammatory pathways

    Nanobody Templates:

    • 7EOW — Caplacizumab

      • Target: vWF A1 domain
      • Mechanism: Blocks vWF-platelet binding, inhibiting thrombosis
    • 7XL0 — Vobarilizumab (ALX-0061)

      • Target: IL-6R (plus albumin binding)
      • Mechanism: Inhibits IL-6 signaling and extends half-life
    • 8COH — TPP-3444 (Gefurulimab / ALXN1720 component)

      • Target: Complement C5
      • Mechanism: Inhibits complement activation
    • 8Z8V — ALB8 (Ozoralizumab / ATN-103 component)

      • Target: Human serum albumin (HSA)
      • Mechanism: Extends drug half-life
    • Gontivimab (ALX-0171)

      • Target: RSV F protein
      • Mechanism: Blocks viral fusion, preventing infection
    • Isecarosmab (M-6495 / ALX-1141)

      • Target: ADAMTS-5
      • Mechanism: Inhibits cartilage degradation, with anti-inflammatory effects
    • Sonelokimab

      • Target: IL-17A / IL-17F
      • Mechanism: Dual inhibition of inflammatory cytokines, enhancing anti-inflammatory efficacy

    Number of Samples

    The number of sampled sequences: the larger the value, the larger the sampling space, the higher the quality of the selected sequences, and the longer the corresponding computation time. Maximal value: 20000.

    Number of Designs

    Number of final generated structures. Default: 30, Max: 100.

    Custom

    Protocol

    There are six design modes:

    • Protein – Design proteins that bind to a target (protein or peptide), or design standalone protein monomers.
    • Peptide – Design peptides (linear or cyclic) that bind to a target protein.
    • Small_Molecule – Design proteins that bind to small molecules.
    • Nanobody – Design nanobodies that bind to a target, or standalone nanobodies.
    • Antibody: Design of conventional antibodies that bind to targets, or design of conventional antibodies alone without targets.
    • Redesign: Redesign and optimization of specified residues for existing protein/complex structures.

    Three approaches to define the design rule:

    • Based on existing structures, by extracting or redesigning specific regions.
    • Based on sequences, specifying which residues to design or keep fixed.
    • Based on small molecules, defining the binding partner using a molecular file.

    These approaches can be combined freely.


    Structure

    Upload an existing protein structure to extract or redesign certain regions, e.g., selecting specific chains such as antigen, nanobody, or receptor chains.

    Chains

    Specify chain IDs extracted from Structure, e.g., A,B.
    If not set, all chains will be extracted.

    Include

    From the selected chains (Chains), specify which residues to extract using chainID + residue range, e.g.:
    A10-20,A25,B1-36,B40
    This extracts residues 10–20 and 25 from chain A, and residues 1–36 and 40 from chain B.
    If not set, all residues in Chains are extracted.

    Exclude

    Specify residues not to extract from selected chains. Same format as Include, e.g. A15,B36-42.

    Design Positions

    Specify residues to redesign within the extracted structure, same format as Include, e.g. A10-12,B15,B40.
    Note:Must correspond to residues existing in the extracted structure.

    Design SS

    Specify secondary structure types for designed residues using the format:

    A,HELIX:10-12
    B,SHEET:15,LOOP:40
    

    Accepted types: LOOP, HELIX, SHEET (case-insensitive).
    If not specified, secondary structures are not constrained.

    Design Insertions

    Define insertion mutations using the format:

    A,10,5
    B,15,5-10,HELIX
    

    Meaning: insert 5 residues after residue 10 of chain A; insert 5–10 residues after residue 15 of chain B with HELIX conformation.
    Accepted secondary structure types: LOOP, HELIX, SHEET.

    Binding Hostpost

    Specify which residues participate in binding (e.g., between chains or with small molecules), same as Include, e.g. A12,B15-18.

    Non Binding

    Specify residues not involved in binding.


    Structure Repetition

    Same definition as Structure. For example, specify an existing binder structure.

    Repetition Chains

    Follow the same rules as the corresponding parameters above.

    Repetition Include

    Follow the same rules as the corresponding parameters above.

    Repetition Exclude

    Follow the same rules as the corresponding parameters above.

    Repetition Design Positions

    Follow the same rules as the corresponding parameters above.

    Repetition Design SS

    Follow the same rules as the corresponding parameters above.

    Repetition Design Insertions

    Follow the same rules as the corresponding parameters above.

    Repetition Binding Hotspost

    Follow the same rules as the corresponding parameters above.

    Repetition Non Binding

    Follow the same rules as the corresponding parameters above.


    Sequence

    Specify the designed protein sequences, one per line, e.g.:

    AAVTTTTPPP
    15-20AAAAAAVTTTT18PPP
    

    Letters represent fixed residues; numeric values indicate positions to be designed.
    Ranges indicate variable lengths (chosen randomly within the range).
    Sequence IDs start from 1 by default.

    Sequence Binding

    Specify which residues in the sequence are involved in binding:

    1:5,8-10
    2:30-35
    

    Binding residues are indexed based on the minimum sequence length when ranges are used.

    Sequence Non Binding

    Opposite of Sequence Binding, defines residues not involved in binding.

    Sequence SS

    Define secondary structure for sequence residues:

    1,HELIX:5-8
    2,SHEET:15,LOOP:40
    

    Positions are determined based on the minimum sequence length when variable ranges exist.

    Sequence Cycle

    Specify cyclic sequences, e.g. 1,2 means the first and second sequences are cyclized (head-to-tail connected).

    Ligand

    Specify small molecules involved in binding.
    Supports SMILES or CCD Code formats.

    Examples:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Covalent Bond

    TXT file defining covalent bonds.
    Each line specifies a bond between two atoms using the format:

    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Each atom entry = EntityID,ResidueIndex,AtomName.
    Entity IDs are assigned based on the input order of sequences or ligands (starting from 1).
    When the small molecule is provided as a SMILES string, e.g. CC(=O)NCCNC(C)=O, if the sequential index of this small molecule (determined as described above) is 3, and its first carbon atom participates in forming a covalent bond with the CA atom of the first residue in chain/sequence 1, then the covalent bond should be defined as:

    1,1,CA;3,1,C1

    Here, C1 denotes the first carbon atom of the small molecule. If it is the second carbon atom, it should be specified as C2.

    Notes:

    • In the current definition of Covalent Bond, the sequences involved must not come from structure files (Structure); they can only come from sequence files (Sequence and Ligand).
    • When a sequence specifies a design length range, the minimum length is used to determine subsequent residue positions.
      For example, for 15-20ACS, the sequence length is taken as 15. Therefore, the position indices are: A = 16, C = 17, S = 18.

    The covalent bond information is provided as a text file (TXT format).
    Each line defines one covalent bond, and each covalent bond consists of two atom definitions.
    Each atom definition contains three parts:

    • The sequential index of the sequence or small molecule to which the atom belongs (determined by the parameter order described above, starting from 1).

    Number of Samples

    The number of sampled sequences: the larger the value, the larger the sampling space, the higher the quality of the selected sequences, and the longer the corresponding computation time. Maximal value: 20000.

    Number of Designs

    Number of final generated structures. Default: 30, Max: 100.


    Results

    • Output parameter file: design_spec.yaml

    • Output the sequence file of the designed complex: final_complex.fasta

    • Output the sequence file of the designed complex (Batch mode): final_complex_batch.fasta, suitable for Batch-mode inputs of some modules, such as Structure Prediction (Boltz-2)

    • Output the sequence file of the designed chains: final_designed_chains.fasta

    • Output the design scoring file: final_designs_metrics.csv. The meaning of each metric in the CSV file is as follows:

    Column Name Description
    id Name of the designed molecule
    final_rank Final ranking after comprehensive sorting based on all metrics
    absolute_score A composite score calculated from multiple metrics (structural metrics and physical energy metrics). It does not fully correspond to the final_rank ordering and is provided for reference.
    structure_confidence Structural confidence score calculated from structural metrics (pTM, ipTM, PAE), for reference.
    design_ptm Predicted Template Modeling score (0–1), reflecting confidence in the overall fold of the designed protein. Higher values indicate a more reasonable global structure; typically, values >0.7 are considered high confidence.
    design_to_target_iptm Interface predicted TM score (0–1), used to evaluate the structural plausibility of the protein–protein interaction interface. Higher values indicate a greater likelihood of forming a stable interface (e.g., binding site).
    min_design_to_target_pae Minimum Predicted Alignment Error (Å), a residue-pair–level confidence metric that measures the predicted reliability of relative spatial positions between residues. Here it represents the accuracy of relative positioning between residues of the designed structure and the target structure. Smaller values (e.g., <10 Å) indicate higher accuracy.
    plip_saltbridge_refolded Number of salt bridges in the refolded structure. Salt bridges (electrostatic interactions between oppositely charged residues) are important for protein stability; higher numbers generally indicate more stable binding.
    plip_hbonds_refolded Number of hydrogen bonds in the refolded structure. Hydrogen bonds are key forces for secondary structure formation and interface complementarity; higher numbers usually imply better overall stability.
    delta_sasa_refolded Change in solvent-accessible surface area before and after refolding (ΔSASA, Ų). Larger values (e.g., >2000 Ų) indicate greater burial of the hydrophobic core and usually represent stronger thermal stability.

    |contacts|Contact interface residues in the predicted structure|
    |contacts_overlap|Predicted contact residues that overlap with the input hotspots|
    |overlap_ratio|Proportion of input hotspots covered by predicted contact residues|

    Note: The contacts, contacts_overlap, and overlap_ratio metrics are output only when the Binding Hotspot parameter is set.

    • Output the top 5 designed structures: rank1-5*.cif

    • Output the packaged file of the final designed structures: final_designs.tar.gz

    • The design overview file results_overview.pdf summarizes the Filtering Criteria and Sorting Criteria used for structural evaluation and ranking.

    • Filtering Criteria

    Column Description
    has_x Threshold: 0.0
    Sequence validity check. Ensures that the sequence contains no unknown amino acids (“X”) and is composed exclusively of the 20 standard natural amino acids, guaranteeing physical synthesizability and expressibility.
    filter_rmsd Threshold: < 2.5 Å
    Overall backbone RMSD. Evaluates whether the entire complex (design + target) maintains its structure after refolding, verifying consistency between the generated and predicted structures.
    filter_rmsd_design Threshold: < 2.5 Å
    Backbone RMSD of the designed component (Binder) only. Ensures that the binder itself remains structurally stable even if the target undergoes minor movements.
    designfolding-filter_rmsd Threshold: < 2.5 Å
    Independent folding stability check. The binder is folded without the target, and RMSD is computed to ensure it can fold autonomously, substantially improving the likelihood of successful experimental expression.
    ALA_fraction
    GLY_fraction
    GLU_fraction
    LEU_fraction
    VAL_fraction
    Threshold: < 0.3 (30%)
    Sequence complexity/diversity control. Limits the individual fractions of alanine, glycine, glutamate, leucine, and valine to prevent the model from generating overly repetitive sequences to artificially boost stability scores. This enforces chemical diversity and promotes specific interactions.
    • Sorting Criteria
    Column Description
    design_to_target_iptm Weight = 1
    Interface Predicted TM score (0–1), used to assess the structural plausibility of the protein–protein interaction interface. Higher values indicate a greater likelihood of forming stable interactions at the interface (e.g., binding sites).
    design_ptm Weight = 1
    Predicted Template Modeling score (0–1), reflecting confidence in the global fold of the designed protein. Higher values indicate a more plausible overall structure; values >0.7 are typically considered high confidence.
    neg_min_design_to_target_pae Weight = 1
    Negative minimum Predicted Aligned Error (PAE) at the interface. Lower PAE indicates better accuracy (smaller error); the negative sign is used to facilitate ranking (higher is better). This metric reflects the model’s confidence in the most certain contact point at the binding interface.
    affinity_probability_binary1 Weight = 1
    Predicted binding affinity probability, primarily used in small-molecule binder scenarios. This is the model’s direct estimate of the probability that the molecule binds.
    plip_hbonds_refolded Weight = 0.5
    Number of hydrogen bonds in the refolded structure. Hydrogen bonds are critical for secondary structure formation and interface complementarity; higher counts generally indicate better overall stability.
    plip_saltbridge_refolded Weight = 0.5
    Number of salt bridges in the refolded structure. Salt bridges (electrostatic interactions between oppositely charged residues) are key contributors to protein stability; higher counts typically correspond to stronger binding.
    delta_sasa_refolded Weight = 0.5
    Change in solvent-accessible surface area upon refolding (ΔSASA, Ų). Larger values (e.g., >2000 Ų) indicate greater burial of hydrophobic cores, generally associated with higher thermal stability.

    Design Tutorial

    Masking Peptide Design Tutorial

    Known Antibody Structure

    1. Antibody Numbering
    Open the mH35 antibody structure using WeView, perform antibody numbering, and determine that the heavy chain CDR3 is located at H99-102, which serves as the binding site for the masking peptide
    image.png
    2. Parameter Settings in BoltzGen

    • Select Custom mode
    • Select Peptide in Protocol
    • Upload mH35 antibody structure in Structure
    • Select H and L chains in Chains as receptor chains
    • Input the receptor binding site in Binding Hotspot, which is the CDR3 region of the heavy chain: H99-102
      image.png
    • Input the peptide length to be designed in Sequence. The recommended design length for masking peptides is: 5-30
      image.png
    • Submit and run

    Known Antibody Sequence

    1. Antibody Numbering
    Open the mH35 antibody sequence using WeSeq, perform antibody numbering, and determine that the heavy chain CDR3 is located at 99-102, which serves as the binding site for the masking peptide
    image.png
    image.png
    2. Parameter Settings in BoltzGen

    • Select Custom mode
    • Input the heavy and light chain sequences of the mH35 antibody and the length of the masking peptide in Sequence, one chain per line. The recommended design length for masking peptides is: 5-30
    • Set the receptor binding site in Sequence Binding, which is the CDR3 region of the heavy chain: 1:99-102
      image.png
    • Submit and run

    Cyclic Peptide Design Tutorial

    Known Receptor Structure

    • Select Peptide in Protocol.
    • Upload receptor structure in Structure.
    • Define binding hotspots/non-binding sites (if any) in the receptor in Binding Hotspot.
      image.png
    • Sequence input is divided into the following two cases:
      • If there is a template structure, input the template cyclic peptide sequence and the length of the insertion sequence, such as C8-9AC, insert 8-9 residues after the 1st residue C, with the first C and last C forming the cyclic peptide, as follows:
        image.png
      • If there is no template structure, you can directly input the sequence length, such as 8-10, to predict cyclic peptides of 8-10AA length that bind to the receptor, as follows:
        image.png
    • Cyclization is divided into the following two types:
      • If the cyclic peptide is cyclized by head-to-tail peptide bond, you can fill in 1 in Sequence Cycle.
        image.png
      • If the cyclic peptide is cyclized by disulfide bond, do not fill in Sequence Cycle, and fill in the disulfide bond information 1,1,SG;1,11,SG in Covalent Bond.
        image.png
    • Submit and run

    Known Receptor Sequence

    • Select Peptide in Protocol.
      image.png
    • According to the cyclic peptide situation, Sequence input is divided into the following two cases:
      • If the cyclic peptide has a template structure, input the receptor sequence, template cyclic peptide sequence and the length of the insertion sequence. As shown in the figure below, each line is a sequence, the receptor has 2 sequences, and the receptor sequence IDs are 1 and 2 respectively. The cyclic peptide sequence is C8-9AC (insert 8-9 residues after the 1st residue C, with the first C and last C forming the cyclic peptide), and the cyclic peptide is located in the third row with sequence ID 3.
        image.png
      • If there is no template structure, you can directly input the receptor sequence and the sequence length of the cyclic peptide. As shown in the figure below, predict cyclic peptides of 8-10AA length that bind to the receptor.
        image.png
    • Define binding hotspots/non-binding sites (if any) in the receptor in Sequence Binding.
      image.png
    • Cyclization is divided into the following two types:
      • If the cyclic peptide is cyclized by head-to-tail peptide bond, you can fill in 3 in Sequence Cycle.
        image.png
      • If the cyclic peptide is cyclized by disulfide bond, do not fill in Sequence Cycle, and fill in the disulfide bond information 1,1,SG;1,11,SG in Covalent Bond.
        image.png
    • Submit and run

    Reference

    • https://hannes-stark.com/assets/boltzgen.pdf
  • Name: Antibody Polyreactivity Prediction
    Description: 基于PolyXpert模型预测治疗性普通抗体的多反应性,该模型对六种蛋白语言模型——antiBERTy、AntiBERTa2、IgBert、ESM-2、ProtBert和 ProtT5进行了微调,并使用其中效果最优的ESM-2微调模型,作为最终模型,用于临床前治疗性单克隆抗体的多反应性评估。 Polyreactivity of therapeutic conventional antibodies is predicted using the PolyXpert model. This model fine-tunes six protein language models—antiBERTy, AntiBERTa2, IgBert, ESM-2, ProtBert, and ProtT5—and selects the best-performing fine-tuned ESM-2 model as the final predictor for preclinical polyreactivity assessment of therapeutic monoclonal antibodies.
    Tags: undefined
    Author: Yuwei Zhou
    Release: 2026-01-17 00:00:00
    Reference: Yuwei Zhou, Haoxiang Tang, Changchun Wu, Zixuan Zhang, Jinyi Wei, Rong Gong, Samarappuli Mudiyanselage Savini Gunarathne, Changcheng Xiang, Jian Huang,Enhancing polyreactivity prediction of preclinical antibodies through fine-tuned protein language models,Journal of Pharmaceutical Analysis

    Antibody Polyreactivity Prediction

    简介

    预测治疗性普通抗体或纳米抗体的多反应性。模块基于PolyXpert模型及纳米抗体多反应性模型实现,PolyXpert模型对六种蛋白语言模型——antiBERTy、AntiBERTa2、IgBert、ESM-2、ProtBert和 ProtT5进行了微调,使其作为端到端的多反应性预测器,并使用其中效果最优的ESM-2微调模型,作为最终模型,用于临床前治疗性单克隆抗体的多反应性评估。纳米抗体多反应性预测,是通过机器学习模型,根据序列预测其与多种非靶标蛋白的非特异性结合倾向。

    治疗性普通抗体

    多反应性数据集的构建
    PolyXpert采用一个基于酵母展示体系构建的单链可变片段(scFv)多反应性数据集。该数据集包含两个独立的人源 scFv 文库(library #1 和 library #2),所有序列均经高通量测序获得。通过流式细胞分选(FACS),依据卵清蛋白、CHO 细胞来源的可溶性胞质蛋白(SCPs)、可溶性膜蛋白(SMPs)以及胰岛素四种多特异性试剂对 scFv 的多反应性水平进行表型划分。library #1 共包含 246,293 条唯一序列,其中高多反应性与低多反应性 scFv 分别为 115,038 条和 131,255 条;library #2 共包含 127,217 条序列,其中高多反应性 scFv 为 93,080 条,低多反应性 scFv 为 34,137 条。library #1 的序列多样性更高,因此划分为训练集(60%)、验证集(20%)和测试集(20%),而 library #2 被用作独立的外部测试集。

    治疗性抗体数据集
    治疗性抗体数据包括 48 条已获批准抗体序列和 89 条处于临床 II/III 期的抗体序列。对于每个抗体,同时提取了来自 12 项生物物理和生化实验的对应数据。随后,剔除了 6 条存在序列记录冲突的抗体,最终得到包含 131 条抗体序列的数据集。抗体多反应性通过该数据集中基于多特异性试剂的可溶性膜蛋白(PSR SMP)评分进行判定,以 0.27 作为分类阈值将抗体划分为高多反应性与低多反应性两类。

    微调蛋白语言模型的预测效果
    相比之下,微调后的 ESM-2 模型在训练数据集及两个测试数据集上均表现出最优且稳定的预测性能。在 library #2 外部独立测试集上,该模型取得了显著更高的整体判别能力和泛化性能。
    image.png

    在不同开发阶段治疗性抗体中的预测能力
    基于已有研究数据,共分析了 131 条单克隆抗体的 PSR SMP 评分。模型预测的高、低多反应性抗体分组在 PSR SMP 评分上呈现出显著差异,表明 PolyXpert 具备良好的判别能力。在临床阶段抗体和已获批准抗体两个子集中,同样观察到预测分组之间一致的差异趋势。
    image.png

    纳米抗体

    基于debbiemarkslab开源的模型实现,该模型的构建是从一个大型初始合成纳米抗体文库出发,分离获得了低多反应性和高多反应性的纳米抗体初始数据集。然后使用机器学习模型在初始数据集的深度测序数据上进行训练,以学习低多反应性和高多反应性纳米抗体的序列特征。
    image.png

    数据集构建

    • 实验流程
    1. 文库来源:使用一个大型、合成的naïve骆驼源纳米抗体酵母展示文库(模拟天然免疫库)。
    2. 筛选策略:
      MACS(磁珠分选):预富集高表达纳米抗体的酵母细胞。
      FACS(流式分选):使用 PSR(Polyspecificity Reagent) 对酵母细胞进行染色。PSR是由昆虫细胞(Sf9)膜蛋白提取的混合蛋白试剂,用于模拟体内非特异性结合环境。分选出 PSR-negative(低多反应性) 和 PSR-positive(高多反应性) 两个群体。
      深度测序:对分选后的群体进行Illumina MiSeq测序。
    • 数据规模与处理
      初始数据集:65,147条unique低多反应性序列 + 69,155条unique高多反应性序列 。
      扩展数据集:通过更深度的测序,扩展至 1,221,800条低多反应性 和 1,058,842条高多反应性序列 。
      序列预处理:
      使用 ANARCI 工具按 IMGT编号方案 对齐序列,识别CDR区域。
    • 训练/测试集划分
      为避免序列相似性导致的高估性能,采用基于聚类的严格划分:使用 k-means聚类 将序列分为5个簇。构建训练/测试分割时,确保测试集序列与训练集序列的距离(Levenshtein distance)> 10,且CDR区域序列相似度仅约 75% ,这种划分方式模拟了真实场景中模型面对全新序列的泛化能力。

    核心模型

    • 输入表示:仅提取纳米抗体的 CDR1、CDR2、CDR3 序列。使用ANARCI按IMGT方案对齐,将变长CDR序列映射到固定长度的编号位置。每个位置用20维one-hot向量表示氨基酸类型(共20种标准氨基酸)。拼接所有CDR位置,形成一个高维稀疏特征向量。
    • 模型结构:标准的L2正则化逻辑回归(L2-regularized Logistic Regression)。输出模型打分,经sigmoid转换为多反应性类别。
    • 性能:AUC = 0.85

    关键发现

    • 增加多反应性:精氨酸(Arg, R)在所有CDR区域;赖氨酸(Lys, K)、色氨酸(Trp, W)、酪氨酸(Tyr, Y)在CDR3。
    • 降低多反应性:酸性残基(Asp, Glu)在CDR2和CDR3。
    • 位置依赖性:尽管精氨酸总体上增加多反应性,但在CDR1的30号和38号位置,以及色氨酸在CDR3的105号位置,低多反应性克隆可以容忍这些残基 。

    参数说明

    Sequence

    待预测普通抗体的 Fv 区序列,或者纳米抗体序列,FASTA 格式,支持批量预测,最多可同时提交 500 对普通抗体(共 1000 条重、轻链序列),按顺序放置即可(每条抗体的轻、重链Fv序列不分先后),或1000条纳米抗体。
    示例如下:
    普通抗体

    >avelumab.H
    EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYIMMWVRQAPGKGLEWVSSIYPSGGITFYADTVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARIKLGTVTTVDYWGQGTLVTVSS
    >avelumab.L
    QSALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSNRPSGVSNRFSGSKSGNTASLTISGLQAEDEADYYCSSYTSSSTRVFGTGTKVTVLG
    >durvalumab.H
    EVQLVESGGGLVQPGGSLRLSCAASGFTFSRYWMSWVRQAPGKGLEWVANIKQDGSEKYYVDSVKGRFTISRDNAKNSLYLQMNSLRAEDTAVYYCAREGGWFGELAFDYWGQGTLVTVSS
    >durvalumab.L
    EIVLTQSPGTLSLSPGERATLSCRASQRVSSSYLAWYQQKPGQAPRLLIYDASSRATGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYGSLPWTFGQGTKVEIK
    
    

    纳米抗体

    >nanobody
    QVQLVESGGTLVQPGGSLRLSCAASRNISQIAILGWYRQAPGKQREAVAIITGGGPAHYADSVKGRFTISRDTAKNTTYLQMDSLKPEDTAVYFCNVRIRWGAADSWGQGIQVTVSS
    

    注意:
    1.当重链(Heavy chain)与轻链(Light chain)的序列名称不完全一致时,系统将按照以下规则确定抗体名称:

    • 若重链与轻链名称存在相同前缀,则该前缀将被识别为抗体名称。
    • 若不存在可识别的共同前缀,则默认使用重链序列名称作为抗体名称。

    建议采用统一且简洁的命名格式,例如:

    >V1.H
    >V1.L
    

    其中:

    • V1 表示同一抗体的唯一编号;
    • .H 和 .L 分别表示重链与轻链。

    Mode

    抗体类别选择,普通抗体或纳米抗体。

    结果说明

    输出result.csv文件
    普通抗体的内容格式如下:

    Name Possibility low-polyreactivity Possibility high-polyreactivity Polyreactivity
    Seq1 0.0003 0.9997 High
    Seq2 0.9993 0.0007 Low

    说明:

    列名 说明
    Name 抗体名称
    Possibility low-polyreactivity 预测为低多反应性(Low polyreactivity)的概率
    Possibility high-polyreactivity 预测为高多反应性(High polyreactivity)的概率
    Polyreactivity 最终分类标签,High属于高多反应性,Low属于低多反应性

    纳米抗体的内容格式如下:

    Name Polyreactivity Score
    sample_seq1 Low 1.1481
    sample_seq2 High -2.5228

    说明:

    列名 说明
    Name 纳米抗体序列名称
    Polyreactivity 多反应性(polyreactivity)分类标签,High = 预测为高多反应性,Low = 预测为低多反应性。
    Score 模型打分,分数越高,预测的多反应性越低;分数越低,预测的多反应性越高。

    输出纳米抗体打分的分布状态图 dist_pr_scores.png,示例如下:
    image.png
    图中给出了数据集(65,147条低多反应性序列 + 69,155条高多反应性序列)的模型打分分布情况,以及输入的纳米抗体序列(最多输出前10条)的预测模型打分在整个数据集中的所处位置。

    参考文献

    • Yuwei Zhou,Haoxiang Tang, Changchun Wu, Zixuan Zhang, Jinyi Wei, Rong Gong, Samarappuli Mudiyanselage Savini Gunarathne, Changcheng Xiang, Jian Huang.Enhancing polyreactivity prediction of preclinical antibodies through fine-tuned protein language models.Journal of Pharmaceutical Analysis, 2025,101448. DOI: 10.1016/j.jpha.2025.101448DOI:10.1016/j.jpha.2025.101448
    • Harvey, E.P., Shin, JE., Skiba, M.A. et al. An in silico method to assess antibody fragment polyreactivity. Nat Commun 13, 7554 (2022). https://doi.org/10.1038/s41467-022-35276-4

    Antibody Polyreactivity Prediction

    Introduction

    Predicts the polyreactivity of therapeutic conventional antibodies or nanobodies. The module is implemented based on the PolyXpert model and a nanobody polyreactivity model.

    Conventional Antibody

    PolyXpert fine-tunes six protein language models (antiBERTy, AntiBERTa2, IgBert, ESM-2, ProtBert, and ProtT5) as end-to-end polyreactivity predictors. The fine-tuned ESM-2 model demonstrated the best and most consistent predictive performance across the training set and two test sets, achieving significantly higher overall discriminative ability and generalization on the external independent test set. It was selected as the final model for preclinical therapeutic monoclonal antibody polyreactivity evaluation.

    image.png

    The model-predicted high- and low-polyreactivity groups showed significant differences in PSR SMP scores, with consistent trends observed in both the clinical-stage antibody and approved antibody subsets.

    image.png

    Nanobody

    Based on the debbiemarkslab open-source model. Starting from a large synthetic nanobody library, low- and high-polyreactivity datasets were obtained via FACS sorting. A machine learning model was trained on deep sequencing data to learn CDR sequence features associated with polyreactivity (AUC = 0.85). The model takes CDR1, CDR2, and CDR3 sequences aligned by ANARCI under the IMGT scheme as one-hot encoded input, and outputs a polyreactivity score via L2-regularized logistic regression.

    image.png

    Parameters

    Sequence

    Fv region sequences of conventional antibodies or nanobody sequences in FASTA format. Supports batch prediction: up to 500 antibody pairs (1,000 heavy and light chain sequences total) or 1,000 nanobody sequences.

    Example (conventional antibody):

    >avelumab.H
    EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYIMMWVRQAPGKGLEWVSSIYPSGGITFYADTVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARIKLGTVTTVDYWGQGTLVTVSS
    >avelumab.L
    QSALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSNRPSGVSNRFSGSKSGNTASLTISGLQAEDEADYYCSSYTSSSTRVFGTGTKVTVLG
    

    Example (nanobody):

    >nanobody
    QVQLVESGGTLVQPGGSLRLSCAASRNISQIAILGWYRQAPGKQREAVAIITGGGPAHYADSVKGRFTISRDTAKNTTYLQMDSLKPEDTAVYFCNVRIRWGAADSWGQGIQVTVSS
    

    Note: It is recommended to use a unified naming format such as V1.H (heavy chain) and V1.L (light chain). If the heavy and light chain names share a common prefix, that prefix is used as the antibody name; otherwise the heavy chain sequence name is used by default.

    Mode

    Select antibody type: conventional antibody or nanobody.

    Results

    The output includes the following files:

    Output File Description
    result.csv Polyreactivity prediction results
    dist_pr_scores.png Nanobody score distribution plot (nanobody mode only)

    Columns in result.csv (conventional antibody):

    Column Description
    Name Antibody name
    Possibility low-polyreactivity Predicted probability of low polyreactivity
    Possibility high-polyreactivity Predicted probability of high polyreactivity
    Polyreactivity Final classification label: High = high polyreactivity, Low = low polyreactivity

    Columns in result.csv (nanobody):

    Column Description
    Name Nanobody sequence name
    Polyreactivity Classification label: High = high polyreactivity, Low = low polyreactivity
    Score Model score; higher score indicates lower predicted polyreactivity

    Example of dist_pr_scores.png:

    image.png

    References

    • Yuwei Zhou, Haoxiang Tang, Changchun Wu, Zixuan Zhang, Jinyi Wei, Rong Gong, et al. Enhancing polyreactivity prediction of preclinical antibodies through fine-tuned protein language models. Journal of Pharmaceutical Analysis, 2025, 101448. DOI: 10.1016/j.jpha.2025.101448
    • Harvey, E.P., Shin, J.E., Skiba, M.A. et al. An in silico method to assess antibody fragment polyreactivity. Nat Commun 13, 7554 (2022). DOI: 10.1038/s41467-022-35276-4
  • Name: Protein Protonation v2
    Description: 预测每个蛋白残基的pKa并根据指定的pH值判断每个残基的质子化状态。 Predict the pKa value for each protein residue using PROPKA3 and determines the protonation state based on the pH values.
    Tags: undefined
    Author: Jan H. Jensen
    Release: 2022-09-29 00:00:00
    Reference: Olsson, Mats HM, Chresten R. Sondergaard, Michal Rostkowski, and Jan H. Jensen. "PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions." Journal of Chemical Theory and Computation 7, no. 2 (2011): 525-537. doi:10.1021/ct100578z

    Protein Protonation

    简介

    Protein Protonation是蛋白质子化模块主要是预测每个蛋白残基的pKa并根据指定的pH值判断每个残基的质子化状态。

    参数说明

    PDB File

    蛋白的结构文件,PDB格式,该文件可以MD PDB Prepare模块提取得到。

    pH

    pH值,默认为7。

    N terminal

    N端残基质子化状态,只有charge和neutral两个选项,默认charge。

    C Terminal

    C端残基质子化状态,只有charge和neutral两个选项,默认charge。

    Custom Residues

    自定义残基质子化状态。例如:HIS90HIE HIS91HIP。
    注:

    • 这里的氨基酸序号为预处理后结构中的顺序编号,从 1 开始计数,并非原始 PDB 文件中给出的残基编号;
    • 该功能仅支持输入单个 PDB 文件,不支持压缩包格式。

    结果说明

    输出结果包括:

    输出文件名称 说明
    protein_protonation.pdb 已完成质子化处理的蛋白质结构文件(PDB 格式)
    pka_summary_{pdb_name}.csv 各可电离残基的 pKa 计算结果及最终质子化状态
    pi_summary.csv 蛋白质等电点(pI)计算结果汇总
    result_protonation.zip 所有输出结果文件的压缩打包

    pka_summary_{pdb_name}.csv 文件内容如下:

    字段名 说明
    group 残基类型(如 ASP、GLU、HIS、LYS 等)
    resseq 残基在预处理后结构中的顺序编号(从 1 开始计数)
    chain 链 ID(若预处理过程中链 ID 被移除,则可能为空)
    pka 计算得到的残基 pKa 值
    model_pka 该残基在模型体系中的参考 pKa 值
    final_state 在目标条件下最终采用的质子化状态

    pi_summary.csv文件内容如下:

    字段名 说明
    pdb 输入的 PDB 文件名称
    folded_pi 蛋白质在折叠状态下的等电点(pI)
    unfolded_pi 蛋白质在非折叠(完全展开)状态下的等电点(pI)

    参考文献

    • Olsson, Mats HM, Chresten R. Sondergaard, Michal Rostkowski, and Jan H. Jensen. “PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions.” Journal of Chemical Theory and Computation 7, no. 2 (2011): 525-537. DOI:10.1021/ct100578z

    Protein Protonation

    Introduction

    The Protein Protonation module is primarily used to predict the pKa of each protein residue and determine the protonation state of each residue based on the specified pH value.

    Parameters

    PDB File

    The structure file of the protein in PDB format, which can be obtained from the MD PDB Prepare module.

    pH

    pH value, default is 7.

    N terminal

    Protonation state of the N-terminal residue, with options of charge and neutral, default is charge.

    C Terminal

    Protonation state of the C-terminal residue, with options of charge and neutral, default is charge.

    Custom Residues

    Customize residue protonation states. For example: HIS90HIE HIS91HIP.
    Note:

    • The residue indices refer to the sequential numbering in the preprocessed structure, starting from 1, and do not correspond to the residue numbers in the original PDB file;
    • This feature is only supported when a PDB file is provided as input; compressed archives are not supported.

    Results

    The results include the following files:

    Output file name Description
    protein_protonation.pdb Protein structure file after protonation (PDB format)
    pka_summary_{pdb_name}.csv pKa calculation results and final protonation states of ionizable residues
    pi_summary.csv Summary of protein isoelectric point (pI) calculations
    result_protonation.zip Compressed archive containing all output result files

    The contents of pka_summary_{pdb_name}.csv are described below:

    Field name Description
    group Residue type (e.g., ASP, GLU, HIS, LYS, etc.)
    resseq Sequential residue index in the preprocessed structure (starting from 1)
    chain Chain ID (may be empty if the chain ID was removed during preprocessing)
    pka Calculated pKa value of the residue
    model_pka Reference pKa value of the residue in the model system
    final_state Final protonation state adopted under the target conditions

    The contents of pi_summary.csv are described below:

    Field name Description
    pdb Name of the input PDB file
    folded_pi Isoelectric point (pI) of the protein in the folded state
    unfolded_pi Isoelectric point (pI) of the protein in the unfolded (fully extended) state

    Reference

    • Olsson, Mats HM, Chresten R. Sondergaard, Michal Rostkowski, and Jan H. Jensen. “PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions.” Journal of Chemical Theory and Computation 7, no. 2 (2011): 525-537. DOI:10.1021/ct100578z
  • Name: Antibody Thermostability Prediction (AbMelt)
    Description: 预测抗体的熔体开始温度(Tm,on)、熔体温度(Tm)及抗体的聚集温度(Tagg)。基于AbMelt模型,通过在不同温度(300K、350K 和 400K)下进行分子动力学模拟,生成代表实验热稳定性测量过程中不同温度阶段的结构集合,模拟同源抗体结构的内在灵活性,并学习相关描述符预测相应熔体温度。 This module predicts the antibody melting onset temperature (Tm,on), melting temperature (Tm), and aggregation temperature (Tagg).
    Tags: undefined
    Author: Zachary A Rollins
    Release: 2025-11-06 00:00:00
    Reference: Rollins, Z. A.; Widatalla, T.; Cheng, A. C.; Metwally, E. AbMelt: Learning Antibody Thermostability from Molecular Dynamics. Biophys. J.2024, 123, 2921–2933.

    Antibody Thermostability Prediction (AbMelt)

    简介

    预测抗体的熔体开始温度(Tm,on)、熔体温度™及抗体的聚集温度(Tagg)。模块基于AbMelt模型实现,AbMelt通过在不同温度(300K、350K 和 400K)下进行分子动力学模拟,生成代表实验热稳定性测量过程中不同温度阶段的结构集合,模拟同源抗体结构的内在灵活性,并学习相关描述符预测相应熔体温度。

    AbMelt的流程图如下:
    image.png

    用到的描述符信息如下图所示:
    image.png
    所有描述符的均值和标准差在20ns的平衡后以10ps的间隔计算。

    对描述符进行筛选:

    • 计算描述符与Tm,on ,Tm,Tagg之间的皮尔逊相关系数r,r>0.45且<0.95描述符保留
    • 使用网格搜索和随机森林递归选择剩余特征,并进行交叉验证和随机特征重要性排序
    • 最后,穷尽搜索排名前10位的特征以获得最佳特征组合(1-5个特征)

    机器学习使用了8种常用方法:线性回归、弹性网络、支持向量机、k近邻、决策树、随机森林、adaboost和xgboost。最佳模型为:
    image.png

    参数说明

    Structure

    用于熔体温度预测的抗体Fv区结构,PDB格式。

    Output

    输出文件名,包含Tm,on、Tm及Tagg预测值,CSV格式,默认为results.csv。

    结果说明

    results.csv文件,包含如下信息:

    列名 说明
    Name 结构名称
    gyr_cdrs_Rg_std_350 温度350K时,CDR区域回转半径的标准差
    bonds_contacts_std_350 温度350K时,内部接触的标准差
    rmsf_cdrl1_std_350 温度350K时,CDRL1区域的RMSF标准差
    rmsf_cdrs_mu_400 温度400K时,CDR区域的RMSF平均值
    gyr_cdrs_Rg_std_400 温度400K时,CDR区域回转半径的标准差
    all-temp_lamda_b=25_eq=20 lamda参数,与热容(heat capacity)相关,用于量化骨架结构N-H键矢量序参数(S²) 的温度依赖性
    all-temp-sasa_core_mean_k=20_eq=20 所有温度下,核心SASA的平均值
    all-temp-sasa_core_std_k=20_eq=20 所有温度下,核心SASA的标准差
    r-lamda_b=2.5_eq=20 lamda参数的线性拟合决定系数
    Tm 预测得到的Tm值
    Tagg 预测得到的Tagg值
    Tmonset 预测得到的Tm,on值

    参考文献

    • Rollins, Z. A.; Widatalla, T.; Cheng, A. C.; Metwally, E. AbMelt: Learning Antibody Thermostability from Molecular Dynamics. Biophys. J. 2024, 123, 2921–2933.DOI:10.1016/j.bpj.2024.06.003

    Antibody Thermostability Prediction (AbMelt)

    Introduction

    This module predicts the antibody melting onset temperature (Tm,on), melting temperature (Tm), and aggregation temperature (Tagg).
    It is implemented based on the AbMelt model, which performs molecular dynamics (MD) simulations at three temperatures (300 K, 350 K, and 400 K) to generate structural ensembles corresponding to different stages of experimental thermal stability measurements.
    AbMelt simulates the intrinsic flexibility of homologous antibody structures and learns relevant descriptors to predict corresponding melting temperatures.

    The workflow of AbMelt is illustrated below:
    image.png

    Descriptor information used in the model is shown below:
    image.png
    All descriptor means and standard deviations are calculated over 20 ns of equilibrated trajectories, sampled at 10 ps intervals.

    Descriptor selection was performed as follows:

    • Compute Pearson correlation coefficients ® between each descriptor and Tm,on / Tm / Tagg; retain descriptors with 0.45 < r < 0.95.
    • Apply grid search and recursive feature elimination using random forests, followed by cross-validation and random feature importance ranking.
    • Finally, perform exhaustive search on the top 10 ranked features to obtain the best feature combinations (1–5 features).

    Eight common machine-learning methods were evaluated:
    Linear Regression, Elastic Net, Support Vector Machine, k-Nearest Neighbors, Decision Tree, Random Forest, AdaBoost, and XGBoost.
    The best-performing models are shown below:
    image.png


    Parameters

    Structure

    The antibody Fv-region structure used for melting temperature prediction, in PDB format.

    Output

    Name of the output CSV file containing predicted Tm,on, Tm, and Tagg values.
    Default: results.csv.


    Results

    The file results.csv is generated, containing:

    Column Name Description
    Name Structure name
    gyr_cdrs_Rg_std_350 Standard deviation of the radius of gyration (Rg) of the CDR regions at 350 K
    bonds_contacts_std_350 Standard deviation of internal contacts at 350 K
    rmsf_cdrl1_std_350 Standard deviation of RMSF for the CDRL1 region at 350 K
    rmsf_cdrs_mu_400 Mean RMSF of the CDR regions at 400 K
    gyr_cdrs_Rg_std_400 Standard deviation of the radius of gyration (Rg) of the CDR regions at 400 K
    all-temp_lamda_b=25_eq=20 Lambda parameter related to heat capacity, used to quantify the temperature dependence of the backbone N–H bond vector order parameter (S²)
    all-temp-sasa_core_mean_k=20_eq=20 Mean core SASA across all temperatures
    all-temp-sasa_core_std_k=20_eq=20 Standard deviation of core SASA across all temperatures
    r-lamda_b=2.5_eq=20 Coefficient of determination (R²) from the linear fit of the lambda parameter
    Tm Predicted melting temperature ™
    Tagg Predicted aggregation temperature (Tagg)
    Tmonset Predicted onset melting temperature (Tm,onset)

    Reference

    • Rollins, Z. A.; Widatalla, T.; Cheng, A. C.; Metwally, E. AbMelt: Learning Antibody Thermostability from Molecular Dynamics. Biophys. J.2024, 123, 2921–2933. DOI:10.1016/j.bpj.2024.06.003
  • Name: Peptide SMILES Generation
    Description: 进行多肽(含环肽)的从头生成、性质计算与分析、格式转换等,支持非天然氨基酸。模块基于p2smi工具包实现。 Conducts de novo peptide generation (including cyclic peptides), property computation and analysis, and format conversion, supporting non-natural amino acids. This module is built upon the p2smi toolkit.
    Tags: undefined
    Author: Aaron Feller
    Release: 2025-12-05 00:00:00
    Reference: p2smi: A Python Toolkit for Peptide FASTA-to-SMILES Conversion and Molecular Property Analysis.Feller, A. L. and Wilke, C. O. (2025).

    Peptide SMILES Generation

    简介

    进行多肽(含环肽)的从头生成、性质计算与分析、格式转换等,支持非天然氨基酸。模块基于p2smi工具包实现。
    image.png

    能够自动生成肽序列、将肽序列转换为 SMILES 字符串(支持环化结构和非天然氨基酸),并计算多种分子性质。此外,还提供修饰功能(如N-甲基化、PEG化)、合成可行性评估。

    主要功能:

    • 生成随机肽序列(支持非天然氨基酸、D 构型与多种环化方式)。
    • 将肽类 FASTA 文件转换为有效的 SMILES 字符串。
    • 支持五种环化类型:二硫键、头尾环化、侧链-侧链、侧链-N 端、侧链- C 端。
    • 计算多种分子性质(如分子量、logP、TPSA、Lipinski 指标等)。
    • 评估肽序列的合成可行性。

    非标准氨基酸信息表(共411个):

    Name Code Formula MolWeight SMILES
    Phenylglycine PG C8H9NO2 151.063328528 N[C@@H](c1ccccc1)C(=O)O
    4-methoxy-Phenylalanine 0A1 C10H13NO3 195.089543276 N[C@@H](Cc1ccc(OC)cc1)C(=O)O
    …

    详细列表见附录。

    参数说明

    Generation

    根据自定义约束条件生成随机多肽序列。

    Number

    生成的多肽序列数量,默认为10,最大为10000。

    Minimum Length

    多肽序列最小长度,默认为10。

    Maximum Length

    多肽序列最大长度,默认为20,最大长度为150。

    NCAA Percentage

    每条多肽序列中的非天然氨基酸(NCAA, Non-Canonical Amino Acids)占比,默认为0.1(10%),数值范围为0.0 ~ 1.0(0%~100%)。

    D-AA Percentage

    每条多肽序列中的D型氨基酸占比,默认为0.1(10%),数值范围为0.0 ~ 1.0(0%~100%)。

    Cyclization Types

    设置环肽的环化类型,可多选。all表示选择所有环肽类型,都不选时,则生成线性肽(默认为都不选)。
    支持的环化类型如下:

    标签(Tag) 类型(Type) 描述(Description)
    SS 二硫键(Disulfide) 半胱氨酸之间形成二硫键
    HT 头尾环化(Head-to-tail) 在 N 端与 C 端之间(形成酰胺键)
    SCSC 侧链–侧链(Sidechain–Sidechain) 侧链之间成环(形成缩肽-酯键)
    SCNT 侧链–N 端(Sidechain–N-Terminus) 侧链与 N 末端成环(形成酰胺键)
    SCCT 侧链–C 端(Sidechain–C-Terminus) 侧链与 C 末端成环

    Output

    生成的多肽序列文件,FASTA格式,默认为peptides.fasta。

    Format Conversion

    将FASTA格式的肽序列转换为SMILES格式,环肽需指定环化类型。

    Peptides

    多肽序列文件,FASTA格式。
    注意:

    • 大写单字母(ACDEFGHIKLMNPQRSTVWY)表示标准氨基酸
    • 小写单字母(acdefghiklmnpqrstvwy)表示D型氨基酸
    • 非标准氨基酸用大括号+非标准氨基酸Code表示 (例如:羟脯氨酸表示为{Hyp},常用非标准氨基酸Code请见附录列表。
    • 环肽需将具体的环化类型(定义见上述参数Cyclization Types)标注在序列名称中(用|与名称分隔)

    示例如下:

    >seq_1
    AVRENmV
    >seq_2|SCCT
    PyWT{DMEG}{0BN}IKaYI{TFG3}RWtNQ{I2M}
    >seq_3|SCNT
    KI{D6MW}E{AHP}iiARCKE{MEN}
    

    序列seq_1是线性肽,由标准氨基酸和D型氨基酸m组成;seq_2是环肽,环化类型是SCCT,由标准氨基酸、D型氨基酸、非标准氨基酸组成;seq_3是环肽,环化类型是SCNT,由标准氨基酸、D型氨基酸、非标准氨基酸组成。

    Output SMILES

    转换后的SMILES字符串,文本格式,每行一个。默认为peptides.smi。

    Output CSV

    转换前后对应的信息文件,CSV格式,默认为peptides.csv。

    Property

    计算多肽的分子性质,包括:分子量(MW)、拓扑极性表面积(TPSA)、logP、氢键供体/受体、可旋转键数、环数量、Csp3 比例、重原子数、形式电荷、分子式,以及 Lipinski 规则评估等。

    Peptides

    进行格式转换的多肽,支持两种格式:多肽序列文件(FASTA格式),或者多肽SMILES(文本格式,每行放置一个SMILES字符串)。

    Output

    多肽的分子性质计算结果,CSV格式,默认为peptide_props.csv。

    Feasibility of Synthesis

    评估肽序列的可合成性,例如:N/Q 是否位于 N 端、Gly/Pro 模体、半胱氨酸数量、疏水性、总体电荷等。注意:目前仅支持天然氨基酸

    Fasta

    多肽序列文件,FASTA格式。仅支持天然氨基酸的多肽。

    Output

    可合成性评估结果,CSV格式,默认为synthesis_report.csv。

    结果说明

    Generation模式,输出多肽序列FASTA文件,示例如下:

    >seq_1|HT
    {FLA}dAVREN{6CL}mV
    >seq_2|SCCT
    PyWT{DMEG}{0BN}IKaYI{TFG3}RWtNQ{I2M}
    >seq_3|SCNT
    KI{D6MW}E{AHP}iiARCKE{MEN}
    >seq_4|HT
    YlCP{YCM}yR{ESC}EiD{DDAB}HYSY{LMQ}GT
    >seq_5|HT
    {ORN}{AA4}TQAqP{CSA}YKI{DTTQ}aVvH
    
    • 大写单字母(ACDEFGHIKLMNPQRSTVWY)表示标准氨基酸
    • 小写单字母(acdefghiklmnpqrstvwy)表示D型氨基酸
    • 非标准氨基酸用大括号+非标准氨基酸Code表示 (例如:羟脯氨酸表示为{Hyp}),常用非标准氨基酸Code请见附录列表。
    • 环肽需将具体的环化类型(定义见上述参数Cyclization Types)标注在序列名称中(用|与名称分隔)

    Format模式下,输出CSV文件和SMILES文件,CSV文件包含信息如下:

    字段名称 示例 说明
    Name seq_1 多肽序列名称
    Type HT 环肽的环化类型,线性肽为空值
    Sequence FALPciA{DQ36}S{ONL}MV{TTQ}RS 多肽序列
    SMILES N3{C@@H}(Cc1ccccc1)C(=O) 转换后的SMILES字符串

    Property模式,输出CSV文件,包含信息如下:

    字段名称 示例 说明
    Formula(分子式) C49H80F3N15O17S 分子的元素组成
    Molecular weight(分子量) 1240.33 分子整体质量,单位道尔顿
    logP(脂溶性) -4.76 越低越亲水,该分子极度亲水
    TPSA(拓扑极性表面积) 516.33 反映极性强弱,越高越不易透膜
    H-bond donors(氢键供体) 16 可提供氢键的基团数量
    H-bond acceptors(氢键受体) 17 可接受氢键的基团数量
    Rotatable bonds(可旋转键) 21 分子柔性的衡量指标
    Rings(环数量) 1 分子内部的环结构数
    Fraction Csp3(Csp³ 碳比例) 0.694 反映三维度的比例(越高越立体)
    Heavy atoms(重原子数) 85 除氢以外的原子数量
    Formal charge(形式电荷) 0 分子整体电中性
    Lipinski pass(Lipinski 规则) false 不符合口服小分子规则(很正常,因其为大分子肽)

    Feasibility of Synthesis模式,输出CSV文件,包含信息如下:

    字段名称 示例 说明
    Name seq_1 多肽序列名称
    Result FAIL 合成可行性评价,PASS表示好,FAIL表示差
    Description Failed charge: need 1 charged residue every 5 residues 合成可行性差的原因说明
    Sequence FALPciA{DQ36}S{ONL}MV{TTQ}RS 多肽序列

    参考文献

    • p2smi: A Python Toolkit for Peptide FASTA-to-SMILES Conversion and Molecular Property Analysis.Feller, A. L. and Wilke, C. O. (2025).DOI:10.48550/arXiv.2505.00719

    附录

    非标准氨基酸信息表

    Name Code Formula MolWeight SMILES
    Phenylglycine PG C8H9NO2 151.063328528 N[C@@H](c1ccccc1)C(=O)O
    4-methoxy-Phenylalanine 0A1 C10H13NO3 195.089543276 N[C@@H](Cc1ccc(OC)cc1)C(=O)O
    7-hydroxy-l-tryptophan 0AF C11H12N2O3 220.084792244 N[C@@H](Cc1c[nH]c2c1cccc2O)C(=O)O
    4-carbamimidoyl-l-phenylalanine 0BN C10H13N3O2 207.100776656 N[C@@H](Cc1ccc(cc1)C(=N)N)C(=O)O
    4-chloro-Phenylalanine 4CP C9H10ClNO2 199.04000624 N[C@@H](Cc1ccc(cc1)Cl)C(=O)O
    2-Allyl-glycine 2AG C7H11NO5 189.063722452 N[C@@H](CCCC(C(=O)O)=O)C(=O)O
    3-methyl-aspartic-acid 2AS C5H9NO4 147.053157768 N[C@H]([C@H](C)(C(=O)O))C(=O)O
    s-(difluoromethyl)-homocysteine 2FM C5H9F2NO2S 185.032205968 N[C@@H](CCSC(F)F)C(=O)O
    2-fluoro-l-histidine 2HF C6H12FN3O2 177.091354844 N[C@@H](C[C@@H]1CN[C@@H](N1)F)C(=O)O
    2-fluoro-l-histidine(1) 2HF1 C6H8FN3O2 173.060054716 N[C@@H](Cc1cnc(F)N1)C(=O)O
    2-fluoro-l-histidine(2) 2HF2 C6H8FN3O2 173.060054716 N[C@@H](Cc1c[nH]c(n1)F)C(=O)O
    l-2-amino-6-methylene-pimelic-acid 2NP C8H13NO4 187.084457896 N[C@@H](CCCC(=C)C(=O)O)C(=O)O
    3-(4H-thieno[3,2-b]pyrrol-6-yl)-L-alanine 32T C9H10N2O2S 210.04629856 N[C@H](Cc1c[nH]c2c1scc2)C(=O)O
    3-cyano-phenylalanine 3CF C10H10N2O2 190.07422756 N[C@@H](Cc1cccc(C#N)c1)C(=O)O
    (2s)-amino(3,5-dihydroxyphenyl)-ethanoic-acid 3FG C8H9NO4 183.053157768 N[C@@H](c1cc(O)cc(c1)O)C(=O)O
    4-hydroxy-glutamic-acid 3GL C5H9NO5 163.048072388 N[C@@H](C[C@@H](C(=O)O)O)C(=O)O
    3-Chloro-tyrosine 3MY C9H10ClNO3 215.03492086 N[C@H](Cc1ccc(c(c1)Cl)O)C(=O)O
    4-Bromo-phenylalanine 4BF C9H10BrNO2 242.98949066 N[C@@H](Cc1ccc(cc1)Br)C(=O)O
    4-cyano-phenylalanine 4CF C10H10N2O2 190.07422756 N[C@@H](Cc1ccc(cc1)C#N)C(=O)O
    nitrilo-l-methionine 4CY C5H8N2O2S 160.030648496 N[C@@H](CCSC#N)C(=O)O
    4-fluoro-tryptophan 4FW C11H11FN2O2 222.080455812 N[C@@H](Cc1c[nH]c2c1c(F)ccc2)C(=O)O
    4-hydroxymethyl-phenylalanine 4HMP C10H13NO3 195.089543276 N[C@@H](Cc1ccc(CO)cc1)C(=O)O
    4-hydroxy-tryptophan 4HT C11H12N2O3 220.084792244 N[C@@H](Cc1c[nH]c2c1c(O)ccc2)C(=O)O
    4-amino-l-tryptophan 4IN C11H13N3O2 219.100776656 N[C@@H](Cc1c[nH]c2c1c(N)ccc2)C(=O)O
    4-methyl-phenylalanine 4PH C10H13NO2 179.094628656 N[C@@H](Cc1ccc(cc1)C)C(=O)O
    6-carboxylysine 6CL C7H14N2O4 190.095356928 N[C@@H](CCC[C@H](C(=O)O)N)C(=O)O
    6-chloro-l-tryptophan 6CW C11H11ClN2O2 238.050905272 N[C@@H](Cc1c[nH]c2c1ccc(c2)Cl)C(=O)O
    2-amino-5-hydroxypentanoic-acid AA4 C5H11NO3 133.073893212 N[C@@H](CCCO)C(=O)O
    2-Aminobutyric-acid ABA C4H9NO2 103.063328528 N[C@@H](CC)C(=O)O
    cis-amiclenomycin ACZ C10H16N2O2 196.121177752 N[C@@H](CC[C@@H]1C=C[C@@H](C=C1)N)C(=O)O
    Adamanthane ADAM C13H21NO2 223.157228912 N[C@@H](C[C@]12C[C@H]3C[C@@H](C2)C[C@@H](C1)C3)C(=O)O
    5-methyl-arginine AGM C7H16N4O2 188.127325752 N[C@@H](CC[C@H](C)NC(=N)N)C(=O)O
    beta-hydroxyasparagine AHB C4H8N2O4 148.048406736 N[C@@H]([C@@H](C(=O)N)O)C(=O)O
    2-Aminoheptanoic-acid AHP C7H15NO2 145.11027872 N[C@@H](CCCCC)C(=O)O
    3-cyclohexyl-alanine ALC C9H17NO2 171.125928784 N[C@@H](CC1CCCCC1)C(=O)O
    1-Naphthyl-alanine ALN C13H13NO2 215.094628656 N[C@@H](Cc1cccc2c1cccc2)C(=O)O
    Allo-threonine ALO C4H9NO3 119.058243148 N[C@@H]([C@H](C)O)C(=O)O
    3-(9-anthryl)-alanine ANTH C17H15NO2 265.11027872 N[C@@H](Cc1c2ccccc2cc2c1cccc2)C(=O)O
    3-Methyl-phenylalanine APD C10H13NO2 179.094628656 N[C@@H](Cc1cccc(c1)C)C(=O)O
    m-amidinophenyl-3-alanine APM C10H13N3O2 207.100776656 N[C@@H](Cc1cccc(c1)C(=N)N)C(=O)O
    c-gamma-hydroxy-arginine ARO C6H14N4O3 190.106590308 N[C@@H](C[C@@H](O)CN=C(N)N)C(=O)O
    (2r)-2-amino-4-oxobutanoic-acid AS2 C4H7NO3 117.042593084 N[C@@H](CC=O)C(=O)O
    azido-alanine AZDA C3H7N4O2+ 131.05635188409 N[C@@H](CN=[N+]=N)C(=O)O
    Phenylserine BB8 C9H11NO3 181.073893212 N[C@@H]([C@@H](O)c1ccccc1)C(=O)O
    benzylcysteine BCS C10H13NO2S 211.066699656 N[C@@H](CSCc1ccccc1)C(=O)O
    beta-hydroxyaspartic-acid BHD C4H7NO5 149.032422324 N[C@@H]([C@H](O)C(=O)O)C(=O)O
    4,4-biphenylalanine BIF C15H15NO2 241.11027872 N[C@@H](Cc1ccc(cc1)c1ccccc1)C(=O)O
    5-bromo-l-isoleucine BIU C6H12BrNO2 209.005140724 N[C@@H]([C@@H](C)CCBr)C(=O)O
    3-(3-benzothienyl)-alanine BTH3 C11H11NO2S 221.051049592 N[C@@H](Cc1csc2c1cccc2)C(=O)O
    6-bromo-tryptophan BTR C11H11BrN2O2 282.000389692 N[C@@H](Cc1c[nH]c2c1ccc(c2)Br)C(=O)O
    Tertleucine BUG C6H13NO2 131.094628656 N[C@@H](C(C)(C)C)C(=O)O
    3-chloro-l-alanine C2N C3H6ClNO2 123.008706112 N[C@@H](CCl)C(=O)O
    canaline CAN C4H10N2O3 134.06914218 N[C@@H](CCON)C(=O)O
    carboxymethylated-cysteine CCS C5H9NO4S 179.025228768 N[C@@H](CSCC(=O)O)C(=O)O
    Cyclohexylglycine CHG C8H15NO2 157.11027872 N[C@@H](C1CCCCC1)C(=O)O
    3-chloro-4-hydroxy-phenylglycine CHP C8H8ClNO3 201.019270796 N[C@@H](c1ccc(c(c1)Cl)O)C(=O)O
    Citrulline CIR C6H13N3O3 175.095691276 N[C@@H](CCC[NH]C(=O)N)C(=O)O
    2-cyano-phenylalanine CNP2 C10H10N2O2 190.07422756 N[C@@H](Cc1ccccc1C#N)C(=O)O
    2,4-dichloro-phenylalanine CP24 C9H9Cl2NO2 233.001033888 N[C@@H](Cc1ccc(cc1Cl)Cl)C(=O)O
    3,4-dichloro-phenylalanine CP34 C9H9Cl2NO2 233.001033888 N[C@@H](Cc1ccc(c(c1)Cl)Cl)C(=O)O
    3-Cyclopentyl-alanine CPA3 C8H15NO2 157.11027872 N[C@@H](CC1CCCC1)C(=O)O
    2-Chloro-phenylglycine CPG2 C8H8ClNO2 185.024356176 N[C@@H](c1ccccc1Cl)C(=O)O
    3-Chloro-phenylglycine CPG3 C8H8ClNO2 185.024356176 N[C@@H](c1cccc(c1)Cl)C(=O)O
    4-Chloro-phenylglycine CPG4 C8H8ClNO2 185.024356176 N[C@@H](c1ccc(cc1)Cl)C(=O)O
    2-chloro-Phenylalanine CPH2 C9H10ClNO2 199.04000624 N[C@@H](Cc1ccccc1Cl)C(=O)O
    s-acetonylcysteine CSA C6H11NO3S 177.045964212 N[C@@H](CSCC(=O)C)C(=O)O
    Selenocysteine CSE C3H7NO2Se 168.964199764 N[C@@H](C[SeH])C(=O)O
    7-chloro-tryptophan CTE C11H11ClN2O2 238.050905272 N[C@@H](Cc1cNc2c1cccc2Cl)C(=O)O
    4-chloro-threonine CTH C4H8ClNO3 153.019270796 N[C@@H]([C@H](O)CCl)C(=O)O
    4-Hydroxy-phenylglycine D4P C8H9NO3 167.058243148 N[C@@H](c1ccc(cc1)O)C(=O)O
    Diaminobutyric-acid DAB C4H10N2O2 118.07422756 N[C@@H](CCN)C(=O)O
    3,4-Dihydroxy-phenylalanine DAH C9H11NO4 197.068807832 N[C@@H](Cc1ccc(c(c1)O)O)C(=O)O
    3,5-dibromotyrosine DBY C9H9Br2NO3 336.894917348 N[C@@H](Cc1cc(Br)c(c(c1)Br)O)C(=O)O
    3,3-dihydroxy-alanine DDZ C3H7NO4 121.037507704 N[C@@H](C(O)O)C(=O)O
    Diethylalanine DILE C7H15NO2 145.11027872 N[C@@H](C(CC)CC)C(=O)O
    3,3-diphenylalanine DIPH C15H15NO2 241.11027872 N[C@@H]([C@H](c1ccccc1)c1ccccc1)C(=O)O
    3,3-dimethyl-aspartic-acid DMK C6H11NO4 161.068807832 N[C@@H](C(C(=O)O)(C)C)C(=O)O
    3-ethyl-phenylalanine DMP3 C11H15NO2 193.11027872 N[C@@H](Cc1cc(CC)ccc1)C(=O)O
    2,3-Diaminopropanoic-acid DPP C3H8N2O2 104.058577496 N[C@@H](CN)C(=O)O
    Ethionine ESC C6H13NO2S 163.066699656 N[C@@H](CCSCC)C(=O)O
    3,4-Difluoro-phenylalanine F2F C9H9F2NO2 201.060134968 N[C@@H](Cc1ccc(c(c1)F)F)C(=O)O
    3-chloro-Phenylalanine FCL C9H10ClNO2 199.04000624 N[C@@H](Cc1cccc(c1)Cl)C(=O)O
    4-Fluoro-glutamic-acid FGA4 C5H8FNO4 165.043735956 N[C@@H](C[C@H](F)C(=O)O)C(=O)O
    2-amino-propanedioic-acid FGL C3H5NO4 119.02185764 NC(C(=O)O)C(=O)O
    Trifluoro-alanine FLA C3H4F3NO2 143.019413028 N[C@@H](C(F)(F)F)C(=O)O
    2-Fluoro-phenylglycine FPG2 C8H8FNO2 169.053906716 N[C@@H](c1ccccc1F)C(=O)O
    3-Fluoro-phenylglycine FPG3 C8H8FNO2 169.053906716 N[C@@H](c1cccc(c1)F)C(=O)O
    4-Fluoro-phenylglycine FPG4 C8H8FNO2 169.053906716 N[C@@H](c1ccc(cc1)F)C(=O)O
    2-Fluoro-Phenylalanine FPH2 C9H10FNO2 183.06955678 N[C@@H](Cc1ccccc1F)C(=O)O
    3-Fluoro-Phenylalanine FPH3 C9H10FNO2 183.06955678 N[C@@H](Cc1cccc(c1)F)C(=O)O
    6-fluoro-l-tryptophan FT6 C11H11FN2O2 222.080455812 N[C@@H](Cc1cNc2c1ccc(c2)F)C(=O)O
    5-Fluoro-tryptophan FTR C11H11FN2O2 222.080455812 N[C@@H](Cc1c[nH]c2c1cc(F)cc2)C(=O)O
    (2-furyl)-alanine FUA2 C7H9NO3 155.058243148 N[C@@H](Cc1ccco1)C(=O)O
    3-Fluoro-valine FVAL C5H10FNO2 135.06955678 N[C@@H](C(F)(C)C)C(=O)O
    2-Amino-4-guanidinobutryric-acid GBUT C5H14N4O2 162.111675688 N[C@@H](CCNC(N)N)C(=O)O
    2-Amino-3-guanidinopropionic-acid GDPR C4H12N4O2 148.096025624 N[C@@H](CNC(N)N)C(=O)O
    Canavanine GGB C5H12N4O3 176.090940244 N[C@@H](CCON=C(N)N)C(=O)O
    (2s,4s)-2,5-diamino-4-hydroxy-5-oxopentanoic-acid GHG C5H10N2O4 162.0640568 N[C@@H](C[C@H](O)C(=O)N)C(=O)O
    5-o-methyl-glutamic-acid GME C6H11NO4 161.068807832 N[C@@H](CCC(=O)OC)C(=O)O
    homocysteine HCS C4H9NO2S 135.035399528 N[C@@H](CCS)C(=O)O
    glutamine-hydroxamate HGA C5H10N2O4 162.0640568 N[C@@H](CCC(=O)NO)C(=O)O
    (2s)-2,8-diaminooctanoic-acid HHK C8H18N2O2 174.136827816 N[C@@H](CCCCCCN)C(=O)O
    4-Hydroxy-L-isoleucine HIL4 C6H13NO3 147.089543276 N[C@@H]([C@H]([C@@H](C)O)C)C(=O)O
    (2s,3r)-2-amino-3-hydroxy-4-methylpentanoic-acid HL2 C6H13NO3 147.089543276 N[C@@H]([C@H](O)C(C)C)C(=O)O
    Homoleucine HLEU C7H15NO2 145.11027872 N[C@@H](CCC(C)C)C(=O)O
    beta-hydroxyleucine HLU C6H13NO3 147.089543276 N[C@@H]([C@@H](O)C(C)C)C(=O)O
    4-amino-L-phenylalanine HOX C9H12N2O2 180.089877624 N[C@@H](Cc1ccc(cc1)N)C(=O)O
    Homophenylalanine HPE C10H13NO2 179.094628656 N[C@@H](CCc1ccccc1)C(=O)O
    3-(8-hydroxyquinolin-3-yl)-l-alanine HQA C12H12N2O3 232.084792244 N[C@@H](Cc1cnc2c(c1)cccc2O)C(=O)O
    homoarginine HRG C7H18N4O2 190.142975816 N[C@@H](CCCCNC(N)N)C(=O)O
    5-Hydroxy-tryptophan HRP C11H12N2O3 220.084792244 N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O
    homoserine HSER C4H9NO3 119.058243148 N[C@@H](CCO)C(=O)O
    beta-hydroxy-tryptophane HTR C11H12N2O3 220.084792244 N[C@@H]([C@H](c1c[nH]c2c1cccc2)O)C(=O)O
    3-hydroxy-l-valine HVA C5H11NO3 133.073893212 N[C@@H](C(O)(C)C)C(=O)O
    3-methyl-l-alloisoleucine I2M C7H15NO2 145.11027872 N[C@@H](C(CC)(C)C)C(=O)O
    alpha-amino-2-indanacetic-acid IGL C11H13NO2 191.094628656 N[C@@H](C1Cc2c(C1)cccc2)C(=O)O
    Allo-Isoleucine IIL C6H13NO2 131.094628656 N[C@@H]([C@@H](CC)C)C(=O)O
    4,5-dihydroxy-isoleucine ILX C6H13NO4 163.084457896 N[C@@H]([C@H]([C@H](CO)O)C)C(=O)O
    3-iodo-tyrosine IYR C9H10INO3 306.97054117999994 N[C@@H](Cc1ccc(c(c1)I)O)C(=O)O
    kynurenine KYN C10H12N2O3 208.084792244 N[C@@H](CC(=O)c1ccccc1N)C(=O)O
    6-hydroxy-l-norleucine LDO C6H13NO3 147.089543276 N[C@@H](CCCCO)C(=O)O
    Penicillamine LE1 C5H11NO2S 149.051049592 N[C@@H](C(S)(C)C)C(=O)O
    (4r)-5-oxo-l-leucine LED C6H11NO3 145.073893212 N[C@@H](C[C@@H](C)C=O)C(=O)O
    (4s)-5-fluoro-l-leucine LEF C6H12FNO2 149.085206844 N[C@@H](C[C@H](C)CF)C(=O)O
    (3r)-3-methyl-l-glutamic-acid LME C6H11NO4 161.068807832 N[C@@H]([C@H](C)CC(=O)O)C(=O)O
    3-methyl-l-glutamine LMQ C6H12N2O3 160.084792244 N[C@@H]([C@@H](C)CC(N)=O)C(=O)O
    vinylglycine LVG C4H7NO2 101.047678464 N[C@@H](C=C)C(=O)O
    4-oxo-l-valine LVN C5H9NO3 131.058243148 N[C@@H]([C@H](C)C=O)C(=O)O
    3,3-dimethyl-methionine-sulfoxide M2S C7H15NO3S 193.07726434 N[C@@H](C(C)(C)C[S@@](C)=O)C(=O)O
    hydroxy-l-methionine ME0 C5H11NO3S 165.045964212 N[C@@H](CCSCO)C(=O)O
    (3s)-3-methyl-l-glutamic-acid MEG C6H11NO4 161.068807832 N[C@@H]([C@@H](C)CC(=O)O)C(=O)O
    n-methyl-asparagine MEN C5H10N2O3 146.06914218 N[C@@H](CC(=O)NC)C(=O)O
    n5-methyl-glutamine MEQ C6H12N2O3 160.084792244 N[C@@H](CCC(=O)NC)C(=O)O
    s-oxymethionine MHO C5H11NO3S 165.045964212 N[C@@H](CC[S@](=O)C)C(=O)O
    5-Methoxy-tryptophan MOT5 C12H14N2O3 234.100442308 N[C@@H](Cc1cNc2ccc(OC)cc12)C(=O)O
    3,4-Dimethyl-phenylalanine MP34 C11H15NO2 193.11027872 N[C@@H](Cc1ccc(c(c1)C)C)C(=O)O
    2-Methyl-phenylalanine MPH2 C10H13NO2 179.094628656 N[C@@H](Cc1ccccc1C)C(=O)O
    5-Methyl-tryptophan MTR5 C12H14N2O2 218.105527688 N[C@@H](Cc1cNc2ccc(C)cc12)C(=O)O
    6-Methyl-tryptophan MTR6 C12H14N2O2 218.105527688 N[C@@H](Cc1cNc2c1ccc(c2)C)C(=O)O
    m-Tyrosine MTY C9H11NO3 181.073893212 N[C@@H](Cc1cccc(c1)O)C(=O)O
    2-Naphthyl-alanine NAL C13H13NO2 215.094628656 N[C@@H](Cc1ccc2c(c1)cccc2)C(=O)O
    5-hydroxy-1-naphthalene NAO1 C13H13NO3 231.089543276 N[C@@H](Cc1cccc2c1cc(O)cc2)C(=O)O
    6-hydroxy-2-naphthalene NAO2 C13H13NO3 231.089543276 N[C@@H](Cc1ccc2c(c1)cc(cc2)O)C(=O)O
    meta-nitro-tyrosine NIY C9H10N2O5 226.05897142 N[C@@H](Cc1ccc(c(c1)N(=O)=O)O)C(=O)O
    Norleucine NLE C6H13NO2 131.094628656 N[C@@H](CCCC)C(=O)O
    Norvaline NVA C5H11NO2 117.078978592 N[C@@H](CCC)C(=O)O
    o-acetylserine OAS C5H9NO4 147.053157768 N[C@@H](COC(=O)C)C(=O)O
    (2s)-2-amino-4,4-difluorobutanoic-acid OBF C4H7F2NO2 139.044484904 N[C@@H](CC(F)F)C(=O)O
    s-(2-hydroxyethyl)-l-cysteine OCY C5H11NO3S 165.045964212 N[C@@H](CSCCO)C(=O)O
    o-methyl-l-threonine OLT C5H11NO3 133.073893212 N[C@@H]([C@H](OC)C)C(=O)O
    Methionine-sulfone OMT C5H11NO4S 181.040878832 N[C@@H](CCS(=O)(=O)C)C(=O)O
    (betar)-beta-hydroxy-l-tyrosine OMX C9H11NO4 197.068807832 N[C@@H]([C@@H](c1ccc(cc1)O)O)C(=O)O
    (betar)-3-chloro-beta-hydroxy-l-tyrosine OMY C9H10ClNO4 231.02983548 N[C@@H]([C@@H](c1ccc(c(c1)Cl)O)O)C(=O)O
    5-oxo-l-norleucine ONL C6H11NO3 145.073893212 N[C@@H](CCC(=O)C)C(=O)O
    Ornithine ORN C5H12N2O2 132.089877624 N[C@@H](CCCN)C(=O)O
    o-Tyrosine OTYR C9H11NO3 181.073893212 N[C@@H](Cc1ccccc1O)C(=O)O
    4-benzoyl-phenylalanine PBF C16H15NO3 269.10519334 N[C@@H](Cc1ccc(cc1)C(=O)c1ccccc1)C(=O)O
    pentafluoro-phenylalanine PF5 C9H6F5NO2 255.031869532 N[C@@H](Cc1c(F)c(F)c(c(c1F)F)F)C(=O)O
    4-Fluoro-Phenylalanine PFF C9H10FNO2 183.06955678 N[C@@H](Cc1ccc(cc1)F)C(=O)O
    4-Iodo-Phenylalanine PHI C9H10INO2 290.97562656 N[C@@H](Cc1ccc(cc1)I)C(=O)O
    4-Nitro-phenylalanine PPN C9H10N2O4 210.0640568 N[C@@H](Cc1ccc(cc1)N(=O)=O)C(=O)O
    phosphotyrosine PTR C9H12NO6P 261.04022373400005 N[C@@H](Cc1ccc(cc1)OP(=O)(O)O)C(=O)O
    3-(2-Pyridyl)-alanine PYR2 C8H10N2O2 166.07422756 N[C@@H](Cc1ccccn1)C(=O)O
    3-(3-Pyridyl)-alanine PYR3 C8H10N2O2 166.07422756 N[C@@H](Cc1cccnc1)C(=O)O
    3-(4-Pyridyl)-alanine PYR4 C8H10N2O2 166.07422756 N[C@@H](Cc1ccncc1)C(=O)O
    3-(1-Pyrazolyl)-alanine PYZ1 C6H9N3O2 155.069476528 N[C@@H](Cn1cccn1)C(=O)O
    3-(2-Quinolyl)-alanine QU32 C12H12N2O2 216.089877624 N[C@@H](Cc1ccc2c(n1)cccc2)C(=O)O
    3-(3-quinolyl)-alanine QU33 C12H12N2O2 216.089877624 N[C@@H](Cc1cnc2c(c1)cccc2)C(=O)O
    3-(4-quinolyl)-alanine QU34 C12H12N2O2 216.089877624 N[C@@H](Cc1ccnc2c1cccc2)C(=O)O
    3-(5-Quinolyl)-alanine QU35 C12H12N2O2 216.089877624 N[C@@H](Cc1ccc2c(c1)nccc2)C(=O)O
    3-(6-Quinolyl)-alanine QU36 C12H12N2O2 216.089877624 N[C@@H](Cc1ccc2c(c1)cncc2)C(=O)O
    3-(2-quinoxalyl)-alanine QX32 C11H11N3O2 217.085126592 N[C@@H](Cc1cnc2c(n1)cccc2)C(=O)O
    phosphoserine SEP C3H8NO6P 185.008923606 N[C@@H](COP(=O)(O)O)C(=O)O
    thialysine SLZ C5H12N2O2S 164.061948624 N[C@@H](CSCCN)C(=O)O
    Methionine-sulfoxide SME C5H11NO3S 165.045964212 N[C@@H](CC[S@](=O)C)C(=O)O
    Styrylalanine STYA C11H13NO2 191.094628656 N[C@@H](CC=Cc1ccccc1)C(=O)O
    2s,4r-4-methylglutamate SYM C6H11NO4 161.068807832 N[C@@H](C[C@H](C(=O)O)C)C(=O)O
    4-tert-butyl-phenylalanine TBP4 C13H19NO2 221.141578848 N[C@@H](Cc1ccc(cc1)C(C)(C)C)C(=O)O
    3-(2-Tetrazolyl)-alanine TEZA C4H7N5O2 157.059974464 N[C@@H](Cn1nncn1)C(=O)O
    2-(Trifluoromethyl)-phenylglycine TFG2 C9H8F3NO2 219.050713156 N[C@@H](c1ccccc1C(F)(F)F)C(=O)O
    3-(Trifluoromethyl)-phenylglycine TFG3 C9H8F3NO2 219.050713156 N[C@@H](c1cccc(c1)C(F)(F)F)C(=O)O
    4-(Trifluoromethyl)-phenylglycine TFG4 C9H8F3NO2 219.050713156 N[C@@H](c1ccc(cc1)C(F)(F)F)C(=O)O
    5,5,5-Trifluoro-leucine TFLE C6H10F3NO2 185.06636322 N[C@@H](C[C@@H](C(F)(F)F)C)C(=O)O
    2-(Trifluoromethyl)-phenylalanine TFP2 C10H10F3NO2 233.06636322 N[C@@H](Cc1ccccc1C(F)(F)F)C(=O)O
    3-(Trifluoromethyl)-phenylalanine TFP3 C10H10F3NO2 233.06636322 N[C@@H](Cc1cccc(c1)C(F)(F)F)C(=O)O
    4-(Trifluoromethyl)-phenylalanine TFP4 C10H10F3NO2 233.06636322 N[C@@H](Cc1ccc(cc1)C(F)(F)F)C(=O)O
    4-hydroxy-l-threonine TH6 C4H9NO4 135.053157768 N[C@@H]([C@H](O)CO)C(=O)O
    3-(3-thienyl)-alanine THA3 C7H9NO2S 171.035399528 N[C@@H](Cc1cscc1)C(=O)O
    2-thienylglycine THG2 C6H7NO2S 157.019749464 N[C@@H](c1cccs1)C(=O)O
    3-thienylglycine THG3 C6H7NO2S 157.019749464 N[C@@H](c1cscc1)C(=O)O
    Thio-citrulline THIC C6H13N3O2S 191.072847656 N[C@@H](CCCNC(=S)N)C(=O)O
    3-(2-thienyl)-alanine TIH C7H9NO2S 171.035399528 N[C@@H](Cc1cccs1)C(=O)O
    phosphothreonine TPO C4H10NO6P 199.02457367 N[C@@H]([C@H](OP(=O)(O)O)C)C(=O)O
    2-hydroxy-tryptophan TRO C11H12N2O3 220.084792244 N[C@@H](Cc1c(O)[nH]c2c1cccc2)C(=O)O
    6-hydroxy-tryptophan TRX C11H12N2O3 220.084792244 N[C@@H](Cc1c[nH]c2c1ccc(c2)O)C(=O)O
    3-(1,2,4-Triazol-1-yl)-alanine TRZ4 C5H8N4O2 156.064725496 N[C@@H](Cn1cncn1)C(=O)O
    6-amino-7-hydroxy-l-tryptophan TTQ C11H13N3O3 235.095691276 N[C@@H](Cc1c[nH]c2c1ccc(c2O)N)C(=O)O
    3-Amino-L-tyrosine TY2 C9H12N2O3 196.084792244 N[C@@H](Cc1ccc(c(c1)N)O)C(=O)O
    3,5-diiodotyrosine TYI C9H9I2NO3 432.8671891479999 N[C@@H](Cc1cc(I)c(c(c1)I)O)C(=O)O
    3-amino-6-hydroxy-tyrosine TYQ C9H12N2O4 212.079706864 N[C@@H](Cc1cc(N)c(cc1O)O)C(=O)O
    (4-thiazolyl)-alanine TZA4 C6H8N2O2S 172.030648496 N[C@@H](Cc1cscn1)C(=O)O
    2-Aminoadipic-acid UN1 C6H11NO4 161.068807832 N[C@@H](CCCC(=O)O)C(=O)O
    Hydroxynorvaline VAH C5H11NO3 133.073893212 N[C@@H]([C@H](O)CC)C(=O)O
    3,5-Difluoro-phenylalanine WFP C9H9F2NO2 201.060134968 N[C@@H](Cc1cc(F)cc(c1)F)C(=O)O
    cysteine-s-acetamide YCM C5H10N2O3S 178.04121318 N[C@@H](CSCC(=O)N)C(=O)O
    3-fluorotyrosine YOF C9H10FNO3 199.0644714 N[C@@H](Cc1ccc(c(c1)F)O)C(=O)O
    d-Phenylglycine DPG C8H9NO2 151.063328528 N[C@H](c1ccccc1)C(=O)O
    d-4-methoxy-Phenylalanine D0A1 C10H13NO3 195.089543276 N[C@H](Cc1ccc(OC)cc1)C(=O)O
    d-7-hydroxy-l-tryptophan D0AF C11H12N2O3 220.084792244 N[C@H](Cc1c[nH]c2c1cccc2O)C(=O)O
    d-4-carbamimidoyl-l-phenylalanine D0BN C10H13N3O2 207.100776656 N[C@H](Cc1ccc(cc1)C(=N)N)C(=O)O
    d-4-chloro-Phenylalanine D200 C9H10ClNO2 199.04000624 N[C@H](Cc1ccc(cc1)Cl)C(=O)O
    d-2-Allyl-glycine D2AG C7H11NO5 189.063722452 N[C@H](CCCC(C(=O)O)=O)C(=O)O
    d-3-methyl-aspartic-acid D2AS C5H9NO4 147.053157768 N[C@@H]([C@H](C)(C(=O)O))C(=O)O
    d-s-(difluoromethyl)-homocysteine D2FM C5H9F2NO2S 185.032205968 N[C@H](CCSC(F)F)C(=O)O
    d-2-fluoro-l-histidine D2HF C6H12FN3O2 177.091354844 N[C@H](C[C@@H]1CN[C@@H](N1)F)C(=O)O
    d-2-fluoro-l-histidine(1) D2H1 C6H8FN3O2 173.060054716 N[C@H](Cc1cnc(F)N1)C(=O)O
    d-2-fluoro-l-histidine(2) D2H2 C6H8FN3O2 173.060054716 N[C@H](Cc1c[nH]c(n1)F)C(=O)O
    d-l-2-amino-6-methylene-pimelic-acid D2NP C8H13NO4 187.084457896 N[C@H](CCCC(=C)C(=O)O)C(=O)O
    d-3-(4H-thieno[3,2-b]pyrrol-6-yl)-L-alanine D32T C9H10N2O2S 210.04629856 N[C@@H](Cc1c[nH]c2c1scc2)C(=O)O
    d-3-cyano-phenylalanine D3CF C10H10N2O2 190.07422756 N[C@H](Cc1cccc(C#N)c1)C(=O)O
    d-(2s)-amino(3,5-dihydroxyphenyl)-ethanoic-acid D3FG C8H9NO4 183.053157768 N[C@H](c1cc(O)cc(c1)O)C(=O)O
    d-4-hydroxy-glutamic-acid D3GL C5H9NO5 163.048072388 N[C@H](C[C@@H](C(=O)O)O)C(=O)O
    d-3-Chloro-tyrosine D3MY C9H10ClNO3 215.03492086 N[C@@H](Cc1ccc(c(c1)Cl)O)C(=O)O
    d-4-Bromo-phenylalanine D4BF C9H10BrNO2 242.98949066 N[C@H](Cc1ccc(cc1)Br)C(=O)O
    d-4-cyano-phenylalanine D4CF C10H10N2O2 190.07422756 N[C@H](Cc1ccc(cc1)C#N)C(=O)O
    d-nitrilo-l-methionine D4CY C5H8N2O2S 160.030648496 N[C@H](CCSC#N)C(=O)O
    d-4-fluoro-tryptophan D4FW C11H11FN2O2 222.080455812 N[C@H](Cc1c[nH]c2c1c(F)ccc2)C(=O)O
    d-4-hydroxymethyl-phenylalanine D4HZ C10H13NO3 195.089543276 N[C@H](Cc1ccc(CO)cc1)C(=O)O
    d-4-hydroxy-tryptophan D4HT C11H12N2O3 220.084792244 N[C@H](Cc1c[nH]c2c1c(O)ccc2)C(=O)O
    d-4-amino-l-tryptophan D4IN C11H13N3O2 219.100776656 N[C@H](Cc1c[nH]c2c1c(N)ccc2)C(=O)O
    d-4-methyl-phenylalanine D4PH C10H13NO2 179.094628656 N[C@H](Cc1ccc(cc1)C)C(=O)O
    d-6-carboxylysine D6CL C7H14N2O4 190.095356928 N[C@H](CCC[C@H](C(=O)O)N)C(=O)O
    d-6-chloro-l-tryptophan D6CW C11H11ClN2O2 238.050905272 N[C@H](Cc1c[nH]c2c1ccc(c2)Cl)C(=O)O
    d-2-amino-5-hydroxypentanoic-acid DAA4 C5H11NO3 133.073893212 N[C@H](CCCO)C(=O)O
    d-2-Aminobutyric-acid DABA C4H9NO2 103.063328528 N[C@H](CC)C(=O)O
    d-cis-amiclenomycin DACZ C10H16N2O2 196.121177752 N[C@H](CC[C@@H]1C=C[C@@H](C=C1)N)C(=O)O
    d-Adamanthane DADM C13H21NO2 223.157228912 N[C@H](C[C@]12C[C@H]3C[C@@H](C2)C[C@@H](C1)C3)C(=O)O
    d-5-methyl-arginine DAGM C7H16N4O2 188.127325752 N[C@H](CC[C@H](C)NC(=N)N)C(=O)O
    d-beta-hydroxyasparagine DAHB C4H8N2O4 148.048406736 N[C@H]([C@@H](C(=O)N)O)C(=O)O
    d-2-Aminoheptanoic-acid DAHP C7H15NO2 145.11027872 N[C@H](CCCCC)C(=O)O
    d-3-cyclohexyl-alanine DALC C9H17NO2 171.125928784 N[C@H](CC1CCCCC1)C(=O)O
    d-1-Naphthyl-alanine DALN C13H13NO2 215.094628656 N[C@H](Cc1cccc2c1cccc2)C(=O)O
    d-Allo-threonine DALO C4H9NO3 119.058243148 N[C@H]([C@H](C)O)C(=O)O
    d-3-(9-anthryl)-alanine DNTL C17H15NO2 265.11027872 N[C@H](Cc1c2ccccc2cc2c1cccc2)C(=O)O
    d-3-Methyl-phenylalanine DAPD C10H13NO2 179.094628656 N[C@H](Cc1cccc(c1)C)C(=O)O
    d-m-amidinophenyl-3-alanine DAPM C10H13N3O2 207.100776656 N[C@H](Cc1cccc(c1)C(=N)N)C(=O)O
    d-c-gamma-hydroxy-arginine DARO C6H14N4O3 190.106590308 N[C@H](C[C@@H](O)CN=C(N)N)C(=O)O
    d-(2r)-2-amino-4-oxobutanoic-acid DAS2 C4H7NO3 117.042593084 N[C@H](CC=O)C(=O)O
    d-azido-alanine DZDA C3H7N4O2+ 131.05635188409 N[C@H](CN=[N+]=N)C(=O)O
    d-Phenylserine DBB8 C9H11NO3 181.073893212 N[C@H]([C@@H](O)c1ccccc1)C(=O)O
    d-benzylcysteine DBCS C10H13NO2S 211.066699656 N[C@H](CSCc1ccccc1)C(=O)O
    d-beta-hydroxyaspartic-acid DBHD C4H7NO5 149.032422324 N[C@H]([C@H](O)C(=O)O)C(=O)O
    d-4,4-biphenylalanine DBIF C15H15NO2 241.11027872 N[C@H](Cc1ccc(cc1)c1ccccc1)C(=O)O
    d-5-bromo-l-isoleucine DBIU C6H12BrNO2 209.005140724 N[C@H]([C@@H](C)CCBr)C(=O)O
    d-3-(3-benzothienyl)-alanine DTH9 C11H11NO2S 221.051049592 N[C@H](Cc1csc2c1cccc2)C(=O)O
    d-6-bromo-tryptophan DBTR C11H11BrN2O2 282.000389692 N[C@H](Cc1c[nH]c2c1ccc(c2)Br)C(=O)O
    d-Tertleucine DBUG C6H13NO2 131.094628656 N[C@H](C(C)(C)C)C(=O)O
    d-3-chloro-l-alanine DC2N C3H6ClNO2 123.008706112 N[C@H](CCl)C(=O)O
    d-canaline DCAN C4H10N2O3 134.06914218 N[C@H](CCON)C(=O)O
    d-carboxymethylated-cysteine DCCS C5H9NO4S 179.025228768 N[C@H](CSCC(=O)O)C(=O)O
    d-Cyclohexylglycine DCHG C8H15NO2 157.11027872 N[C@H](C1CCCCC1)C(=O)O
    d-3-chloro-4-hydroxy-phenylglycine DCHP C8H8ClNO3 201.019270796 N[C@H](c1ccc(c(c1)Cl)O)C(=O)O
    d-Citrulline DCIR C6H13N3O3 175.095691276 N[C@H](CCC[NH]C(=O)N)C(=O)O
    d-2-cyano-phenylalanine D2CF C10H10N2O2 190.07422756 N[C@H](Cc1ccccc1C#N)C(=O)O
    d-2,4-dichloro-phenylalanine D24E C9H9Cl2NO2 233.001033888 N[C@H](Cc1ccc(cc1Cl)Cl)C(=O)O
    d-3,4-dichloro-phenylalanine D34E C9H9Cl2NO2 233.001033888 N[C@H](Cc1ccc(c(c1)Cl)Cl)C(=O)O
    d-3-Cyclopentyl-alanine DCPE C8H15NO2 157.11027872 N[C@H](CC1CCCC1)C(=O)O
    d-2-Chloro-phenylglycine DCG6 C8H8ClNO2 185.024356176 N[C@H](c1ccccc1Cl)C(=O)O
    d-3-Chloro-phenylglycine DCG5 C8H8ClNO2 185.024356176 N[C@H](c1cccc(c1)Cl)C(=O)O
    d-4-Chloro-phenylglycine DCGD C8H8ClNO2 185.024356176 N[C@H](c1ccc(cc1)Cl)C(=O)O
    d-2-chloro-Phenylalanine DCF6 C9H10ClNO2 199.04000624 N[C@H](Cc1ccccc1Cl)C(=O)O
    d-s-acetonylcysteine DCSA C6H11NO3S 177.045964212 N[C@H](CSCC(=O)C)C(=O)O
    d-Selenocysteine DCSE C3H7NO2Se 168.964199764 N[C@H](C[SeH])C(=O)O
    d-7-chloro-tryptophan DCTE C11H11ClN2O2 238.050905272 N[C@H](Cc1cNc2c1cccc2Cl)C(=O)O
    d-4-chloro-threonine DCTH C4H8ClNO3 153.019270796 N[C@H]([C@H](O)CCl)C(=O)O
    d-4-Hydroxy-phenylglycine DD4P C8H9NO3 167.058243148 N[C@H](c1ccc(cc1)O)C(=O)O
    d-Diaminobutyric-acid DDAB C4H10N2O2 118.07422756 N[C@H](CCN)C(=O)O
    d-3,4-Dihydroxy-phenylalanine DDAH C9H11NO4 197.068807832 N[C@H](Cc1ccc(c(c1)O)O)C(=O)O
    d-3,5-dibromotyrosine DDBY C9H9Br2NO3 336.894917348 N[C@H](Cc1cc(Br)c(c(c1)Br)O)C(=O)O
    d-3,3-dihydroxy-alanine DDDZ C3H7NO4 121.037507704 N[C@H](C(=O)O)C(=O)O
    d-Diethylalanine D2EL C7H15NO2 145.11027872 N[C@H](C(CC)CC)C(=O)O
    d-3,3-diphenylalanine D2F1 C15H15NO2 241.11027872 N[C@H]([C@H](c1ccccc1)c1ccccc1)C(=O)O
    d-3,3-dimethyl-aspartic-acid DDMK C6H11NO4 161.068807832 N[C@H](C(C(=O)O)(C)C)C(=O)O
    d-3-ethyl-phenylalanine DDF4 C11H15NO2 193.11027872 N[C@H](Cc1cc(CC)ccc1)C(=O)O
    d-2,3-Diaminopropanoic-acid DDPP C3H8N2O2 104.058577496 N[C@H](CN)C(=O)O
    d-Ethionine DESC C6H13NO2S 163.066699656 N[C@H](CCSCC)C(=O)O
    d-3,4-Difluoro-phenylalanine DF2F C9H9F2NO2 201.060134968 N[C@H](Cc1ccc(c(c1)F)F)C(=O)O
    d-3-chloro-Phenylalanine DFCL C9H10ClNO2 199.04000624 N[C@H](Cc1cccc(c1)Cl)C(=O)O
    d-4-Fluoro-glutamic-acid D4FG C5H8FNO4 165.043735956 N[C@H](C[C@H](F)C(=O)O)C(=O)O
    d-Trifluoro-alanine DFLA C3H4F3NO2 143.019413028 N[C@H](C(F)(F)F)C(=O)O
    d-2-Fluoro-phenylglycine DFP6 C8H8FNO2 169.053906716 N[C@H](c1ccccc1F)C(=O)O
    d-3-Fluoro-phenylglycine DFP7 C8H8FNO2 169.053906716 N[C@H](c1cccc(c1)F)C(=O)O
    d-4-Fluoro-phenylglycine DFP8 C8H8FNO2 169.053906716 N[C@H](c1ccc(cc1)F)C(=O)O
    d-2-Fluoro-Phenylalanine DFF2 C9H10FNO2 183.06955678 N[C@H](Cc1ccccc1F)C(=O)O
    d-3-Fluoro-Phenylalanine DFF3 C9H10FNO2 183.06955678 N[C@H](Cc1cccc(c1)F)C(=O)O
    d-6-fluoro-l-tryptophan DFT6 C11H11FN2O2 222.080455812 N[C@H](Cc1cNc2c1ccc(c2)F)C(=O)O
    d-5-Fluoro-tryptophan DFTR C11H11FN2O2 222.080455812 N[C@H](Cc1c[nH]c2c1cc(F)cc2)C(=O)O
    d-(2-furyl)-alanine DFUO C7H9NO3 155.058243148 N[C@H](Cc1ccco1)C(=O)O
    d-3-Fluoro-valine DFVL C5H10FNO2 135.06955678 N[C@H](C(F)(C)C)C(=O)O
    d-2-Amino-4-guanidinobutryric-acid DGBT C5H14N4O2 162.111675688 N[C@H](CCNC(N)N)C(=O)O
    d-2-Amino-3-guanidinopropionic-acid DGPA C4H12N4O2 148.096025624 N[C@H](CNC(N)N)C(=O)O
    d-Canavanine DGGB C5H12N4O3 176.090940244 N[C@H](CCON=C(N)N)C(=O)O
    d-(2s,4s)-2,5-diamino-4-hydroxy-5-oxopentanoic-acid DGHG C5H10N2O4 162.0640568 N[C@H](C[C@H](O)C(=O)N)C(=O)O
    d-5-o-methyl-glutamic-acid DGME C6H11NO4 161.068807832 N[C@H](CCC(=O)OC)C(=O)O
    d-homocysteine DHCS C4H9NO2S 135.035399528 N[C@H](CCS)C(=O)O
    d-glutamine-hydroxamate DHGA C5H10N2O4 162.0640568 N[C@H](CCC(=O)NO)C(=O)O
    d-(2s)-2,8-diaminooctanoic-acid DHHK C8H18N2O2 174.136827816 N[C@H](CCCCCCN)C(=O)O
    d-4-Hydroxy-L-isoleucine DHIL C6H13NO3 147.089543276 N[C@H]([C@H]([C@@H](C)O)C)C(=O)O
    d-(2s,3r)-2-amino-3-hydroxy-4-methylpentanoic-acid DHL2 C6H13NO3 147.089543276 N[C@H]([C@H](O)C(C)C)C(=O)O
    d-Homoleucine DHL1 C7H15NO2 145.11027872 N[C@H](CCC(C)C)C(=O)O
    d-beta-hydroxyleucine DHLU C6H13NO3 147.089543276 N[C@H]([C@@H](O)C(C)C)C(=O)O
    d-4-amino-L-phenylalanine DHOX C9H12N2O2 180.089877624 N[C@H](Cc1ccc(cc1)N)C(=O)O
    d-Homophenylalanine DHPE C10H13NO2 179.094628656 N[C@H](CCc1ccccc1)C(=O)O
    d-3-(8-hydroxyquinolin-3-yl)-l-alanine DHQA C12H12N2O3 232.084792244 N[C@H](Cc1cnc2c(c1)cccc2O)C(=O)O
    d-homoarginine DHRG C7H18N4O2 190.142975816 N[C@H](CCCCNC(N)N)C(=O)O
    d-5-Hydroxy-tryptophan DHRP C11H12N2O3 220.084792244 N[C@H](Cc1cNc2c1cc(O)cc2)C(=O)O
    d-homoserine DHSE C4H9NO3 119.058243148 N[C@H](CCO)C(=O)O
    d-beta-hydroxy-tryptophane DHTR C11H12N2O3 220.084792244 N[C@H]([C@H](c1c[nH]c2c1cccc2)O)C(=O)O
    d-3-hydroxy-l-valine DHVA C5H11NO3 133.073893212 N[C@H](C(O)(C)C)C(=O)O
    d-3-methyl-l-alloisoleucine DI2M C7H15NO2 145.11027872 N[C@H](C(CC)(C)C)C(=O)O
    d-alpha-amino-2-indanacetic-acid DIGL C11H13NO2 191.094628656 N[C@H](C1Cc2c(C1)cccc2)C(=O)O
    d-Allo-Isoleucine DIIL C6H13NO2 131.094628656 N[C@H]([C@@H](CC)C)C(=O)O
    d-4,5-dihydroxy-isoleucine DILX C6H13NO4 163.084457896 N[C@H]([C@H]([C@H](CO)O)C)C(=O)O
    d-3-iodo-tyrosine DIYR C9H10INO3 306.97054117999994 N[C@H](Cc1ccc(c(c1)I)O)C(=O)O
    d-kynurenine DKYN C10H12N2O3 208.084792244 N[C@H](CC(=O)c1ccccc1N)C(=O)O
    d-6-hydroxy-l-norleucine DLDO C6H13NO3 147.089543276 N[C@H](CCCCO)C(=O)O
    d-Penicillamine DLE1 C5H11NO2S 149.051049592 N[C@H](C(S)(C)C)C(=O)O
    d-(4r)-5-oxo-l-leucine DLED C6H11NO3 145.073893212 N[C@H](C[C@@H](C)C=O)C(=O)O
    d-(4s)-5-fluoro-l-leucine DLEF C6H12FNO2 149.085206844 N[C@H](C[C@H](C)CF)C(=O)O
    d-(3r)-3-methyl-l-glutamic-acid DLME C6H11NO4 161.068807832 N[C@H]([C@H](C)CC(O)=O)C(=O)O
    d-3-methyl-l-glutamine DLMQ C6H12N2O3 160.084792244 N[C@H]([C@@H](C)CC(N)=O)C(=O)O
    d-vinylglycine DLVG C4H7NO2 101.047678464 N[C@H](C=C)C(=O)O
    d-4-oxo-l-valine DLVN C5H9NO3 131.058243148 N[C@H]([C@H](C)C=O)C(=O)O
    d-3,3-dimethyl-methionine-sulfoxide DM2S C7H15NO3S 193.07726434 N[C@H](C(C)(C)C[S@@](C)=O)C(=O)O
    d-hydroxy-l-methionine DME0 C5H11NO3S 165.045964212 N[C@H](CCSCO)C(=O)O
    d-(3s)-3-methyl-l-glutamic-acid DMEG C6H11NO4 161.068807832 N[C@H]([C@@H](C)CC(=O)O)C(=O)O
    d-n-methyl-asparagine DMEN C5H10N2O3 146.06914218 N[C@H](CC(=O)NC)C(=O)O
    d-n5-methyl-glutamine DMEQ C6H12N2O3 160.084792244 N[C@H](CCC(=O)NC)C(=O)O
    d-s-oxymethionine DMHO C5H11NO3S 165.045964212 N[C@H](CC[S@](=O)C)C(=O)O
    d-5-Methoxy-tryptophan D5XW C12H14N2O3 234.100442308 N[C@H](Cc1cNc2ccc(OC)cc12)C(=O)O
    d-3,4-Dimethyl-phenylalanine DM34 C11H15NO2 193.11027872 N[C@H](Cc1ccc(c(c1)C)C)C(=O)O
    d-2-Methyl-phenylalanine D2MF C10H13NO2 179.094628656 N[C@H](Cc1ccccc1C)C(=O)O
    d-5-Methyl-tryptophan D5MW C12H14N2O2 218.105527688 N[C@H](Cc1cNc2ccc(C)cc12)C(=O)O
    d-6-Methyl-tryptophan D6MW C12H14N2O2 218.105527688 N[C@H](Cc1cNc2c1ccc(c2)C)C(=O)O
    d-m-Tyrosine DMTY C9H11NO3 181.073893212 N[C@H](Cc1cccc(c1)O)C(=O)O
    d-2-Naphthyl-alanine DNAL C13H13NO2 215.094628656 N[C@H](Cc1ccc2c(c1)cccc2)C(=O)O
    d-5-hydroxy-1-naphthalene D51N C13H13NO3 231.089543276 N[C@H](Cc1cccc2c1cc(O)cc2)C(=O)O
    d-6-hydroxy-2-naphthalene D62N C13H13NO3 231.089543276 N[C@H](Cc1ccc2c(c1)cc(cc2)O)C(=O)O
    d-meta-nitro-tyrosine DNIY C9H10N2O5 226.05897142 N[C@H](Cc1ccc(c(c1)N(=O)=O)O)C(=O)O
    d-Norleucine DNLE C6H13NO2 131.094628656 N[C@H](CCCC)C(=O)O
    d-Norvaline DNVA C5H11NO2 117.078978592 N[C@H](CCC)C(=O)O
    d-o-acetylserine DOAS C5H9NO4 147.053157768 N[C@H](COC(=O)C)C(=O)O
    d-(2s)-2-amino-4,4-difluorobutanoic-acid DOBF C4H7F2NO2 139.044484904 N[C@H](CC(F)F)C(=O)O
    d-s-(2-hydroxyethyl)-l-cysteine DOCY C5H11NO3S 165.045964212 N[C@H](CSCCO)C(=O)O
    d-o-methyl-l-threonine DOLT C5H11NO3 133.073893212 N[C@H]([C@H](OC)C)C(=O)O
    d-Methionine-sulfone DOMT C5H11NO4S 181.040878832 N[C@H](CCS(=O)(=O)C)C(=O)O
    d-(betar)-beta-hydroxy-l-tyrosine DOMX C9H11NO4 197.068807832 N[C@H]([C@@H](c1ccc(cc1)O)O)C(=O)O
    d-(betar)-3-chloro-beta-hydroxy-l-tyrosine DOMY C9H10ClNO4 231.02983548 N[C@H]([C@@H](c1ccc(c(c1)Cl)O)O)C(=O)O
    d-5-oxo-l-norleucine DONL C6H11NO3 145.073893212 N[C@H](CCC(=O)C)C(=O)O
    d-Ornithine DORN C5H12N2O2 132.089877624 N[C@H](CCCN)C(=O)O
    d-o-Tyrosine D2TR C9H11NO3 181.073893212 N[C@H](Cc1ccccc1O)C(=O)O
    d-4-benzoyl-phenylalanine DPBF C16H15NO3 269.10519334 N[C@H](Cc1ccc(cc1)C(=O)c1ccccc1)C(=O)O
    d-pentafluoro-phenylalanine DPF5 C9H6F5NO2 255.031869532 N[C@H](Cc1c(F)c(F)c(c(c1F)F)F)C(=O)O
    d-4-Fluoro-Phenylalanine DPFF C9H10FNO2 183.06955678 N[C@H](Cc1ccc(cc1)F)C(=O)O
    d-4-Iodo-Phenylalanine DPHI C9H10INO2 290.97562656 N[C@H](Cc1ccc(cc1)I)C(=O)O
    d-4-Nitro-phenylalanine DPPN C9H10N2O4 210.0640568 N[C@H](Cc1ccc(cc1)N(=O)=O)C(=O)O
    d-phosphotyrosine DPTR C9H12NO6P 261.04022373400005 N[C@H](Cc1ccc(cc1)OP(=O)(O)O)C(=O)O
    d-3-(2-Pyridyl)-alanine DY23 C8H10N2O2 166.07422756 N[C@H](Cc1ccccn1)C(=O)O
    d-3-(3-Pyridyl)-alanine DY33 C8H10N2O2 166.07422756 N[C@H](Cc1cccnc1)C(=O)O
    d-3-(4-Pyridyl)-alanine DY34 C8H10N2O2 166.07422756 N[C@H](Cc1ccncc1)C(=O)O
    d-3-(1-Pyrazolyl)-alanine DPZ4 C6H9N3O2 155.069476528 N[C@H](Cn1cccn1)C(=O)O
    d-3-(2-Quinolyl)-alanine DQ32 C12H12N2O2 216.089877624 N[C@H](Cc1ccc2c(n1)cccc2)C(=O)O
    d-3-(3-quinolyl)-alanine DQ33 C12H12N2O2 216.089877624 N[C@H](Cc1cnc2c(c1)cccc2)C(=O)O
    d-3-(4-quinolyl)-alanine DQ34 C12H12N2O2 216.089877624 N[C@H](Cc1ccnc2c1cccc2)C(=O)O
    d-3-(5-Quinolyl)-alanine DQ35 C12H12N2O2 216.089877624 N[C@H](Cc1ccc2c(c1)nccc2)C(=O)O
    d-3-(6-Quinolyl)-alanine DQ36 C12H12N2O2 216.089877624 N[C@H](Cc1ccc2c(c1)cncc2)C(=O)O
    d-3-(2-quinoxalyl)-alanine DQX3 C11H11N3O2 217.085126592 N[C@H](Cc1cnc2c(n1)cccc2)C(=O)O
    d-phosphoserine DSEP C3H8NO6P 185.008923606 N[C@H](COP(=O)(O)O)C(=O)O
    d-thialysine DSLZ C5H12N2O2S 164.061948624 N[C@H](CSCCN)C(=O)O
    d-Methionine-sulfoxide DSME C5H11NO3S 165.045964212 N[C@H](CC[S@](=O)C)C(=O)O
    d-Styrylalanine DSYA C11H13NO2 191.094628656 N[C@H](CC=Cc1ccccc1)C(=O)O
    d-2s,4r-4-methylglutamate DSYM C6H11NO4 161.068807832 N[C@H](C[C@H](C(=O)O)C)C(=O)O
    d-4-tert-butyl-phenylalanine D4TF C13H19NO2 221.141578848 N[C@H](Cc1ccc(cc1)C(C)(C)C)C(=O)O
    d-3-(2-Tetrazolyl)-alanine DTEZ C4H7N5O2 157.059974464 N[C@H](Cn1nncn1)C(=O)O
    d-2-(Trifluoromethyl)-phenylglycine D2TG C9H8F3NO2 219.050713156 N[C@H](c1ccccc1C(F)(F)F)C(=O)O
    d-3-(Trifluoromethyl)-phenylglycine D3TG C9H8F3NO2 219.050713156 N[C@H](c1cccc(c1)C(F)(F)F)C(=O)O
    d-4-(Trifluoromethyl)-phenylglycine D4TG C9H8F3NO2 219.050713156 N[C@H](c1ccc(cc1)C(F)(F)F)C(=O)O
    d-5,5,5-Trifluoro-leucine DTFL C6H10F3NO2 185.06636322 N[C@H](C[C@@H](C(F)(F)F)C)C(=O)O
    d-2-(Trifluoromethyl)-phenylalanine D2TF C10H10F3NO2 233.06636322 N[C@H](Cc1ccccc1C(F)(F)F)C(=O)O
    d-3-(Trifluoromethyl)-phenylalanine D3TF C10H10F3NO2 233.06636322 N[C@H](Cc1cccc(c1)C(F)(F)F)C(=O)O
    d-4-(Trifluoromethyl)-phenylalanine D4TM C10H10F3NO2 233.06636322 N[C@H](Cc1ccc(cc1)C(F)(F)F)C(=O)O
    d-4-hydroxy-l-threonine DTH6 C4H9NO4 135.053157768 N[C@H]([C@H](O)CO)C(=O)O
    d-3-(3-thienyl)-alanine D3TA C7H9NO2S 171.035399528 N[C@H](Cc1cscc1)C(=O)O
    d-2-thienylglycine D2TH C6H7NO2S 157.019749464 N[C@H](c1cccs1)C(=O)O
    d-3-thienylglycine D3TH C6H7NO2S 157.019749464 N[C@H](c1cscc1)C(=O)O
    d-Thio-citrulline DTVI C6H13N3O2S 191.072847656 N[C@H](CCCNC(=S)N)C(=O)O
    d-3-(2-thienyl)-alanine DTIH C7H9NO2S 171.035399528 N[C@H](Cc1cccs1)C(=O)O
    d-phosphothreonine DTPO C4H10NO6P 199.02457367 N[C@H]([C@H](OP(=O)(O)O)C)C(=O)O
    d-2-hydroxy-tryptophan DTRO C11H12N2O3 220.084792244 N[C@H](Cc1c(O)[nH]c2c1cccc2)C(=O)O
    d-6-hydroxy-tryptophan DTRX C11H12N2O3 220.084792244 N[C@H](Cc1c[nH]c2c1ccc(c2)O)C(=O)O
    d-3-(1,2,4-Triazol-1-yl)-alanine DTZR C5H8N4O2 156.064725496 N[C@H](Cn1cncn1)C(=O)O
    d-6-amino-7-hydroxy-l-tryptophan DTTQ C11H13N3O3 235.095691276 N[C@H](Cc1c[nH]c2c1ccc(c2O)N)C(=O)O
    d-3-Amino-L-tyrosine DTY2 C9H12N2O3 196.084792244 N[C@H](Cc1ccc(c(c1)N)O)C(=O)O
    d-3,5-diiodotyrosine DTYI C9H9I2NO3 432.8671891479999 N[C@H](Cc1cc(I)c(c(c1)I)O)C(=O)O
    d-3-amino-6-hydroxy-tyrosine DTYQ C9H12N2O4 212.079706864 N[C@H](Cc1cc(N)c(cc1O)O)C(=O)O
    d-(4-thiazolyl)-alanine D4TH C6H8N2O2S 172.030648496 N[C@H](Cc1cscn1)C(=O)O
    d-2-Aminoadipic-acid DUN1 C6H11NO4 161.068807832 N[C@H](CCCC(=O)O)C(=O)O
    d-Hydroxynorvaline DVAH C5H11NO3 133.073893212 N[C@H]([C@H](O)CC)C(=O)O
    d-3,5-Difluoro-phenylalanine DWFP C9H9F2NO2 201.060134968 N[C@H](Cc1cc(F)cc(c1)F)C(=O)O
    d-cysteine-s-acetamide DYCM C5H10N2O3S 178.04121318 N[C@H](CSCC(=O)N)C(=O)O
    d-3-fluorotyrosine DYOF C9H10FNO3 199.0644714 N[C@H](Cc1ccc(c(c1)F)O)C(=O)O

    Peptide SMILES Generation

    Introduction

    This module performs de novo peptide (including cyclic peptides) generation, property calculation, analysis, and format conversion, with full support for non-canonical amino acids (NCAAs).
    The module is implemented based on the p2smi toolkit.

    image.png

    It can automatically generate peptide sequences, convert peptide sequences to SMILES strings (including cyclized structures and non-natural amino acids), and compute various molecular properties.
    In addition, it provides modification utilities (e.g., N-methylation, PEGylation) and synthetic feasibility assessment.

    Main Features:

    • Random peptide generation (supports non-natural amino acids, D-amino acids, and multiple cyclization modes)
    • Convert peptide FASTA files to valid SMILES strings
    • Support five types of cyclization: disulfide, head-to-tail, sidechain–sidechain, sidechain–N-terminus, sidechain–C-terminus
    • Compute diverse molecular properties (e.g., MW, logP, TPSA, Lipinski rules)
    • Evaluate peptide synthetic feasibility

    Non-Canonical Amino Acids (411 total)

    Name Code Formula MolWeight SMILES
    Phenylglycine PG C8H9NO2 151.063328528 N[C@@H](c1ccccc1)C(=O)O
    4-methoxy-Phenylalanine 0A1 C10H13NO3 195.089543276 N[C@@H](Cc1ccc(OC)cc1)C(=O)O
    … … … … …

    Full list provided in Appendix.


    Parameter Description

    Generation

    Generate random peptide sequences according to user-defined constraints.

    Number

    Number of generated peptide sequences.
    Default: 10, maximum: 10000.

    Minimum Length

    Minimum peptide length.
    Default: 10.

    Maximum Length

    Maximum peptide length.
    Default: 20, upper limit: 150.

    NCAA Percentage

    Fraction of non-canonical amino acids per peptide.
    Default: 0.1 (10%), range: 0.0 ~ 1.0.

    D-AA Percentage

    Fraction of D-type amino acids per peptide.
    Default: 0.1 (10%), range: 0.0 ~ 1.0.

    Cyclization Types

    Cyclization strategy for cyclic peptides; multiple choices allowed.
    all selects all cyclization modes; selecting none produces linear peptides (default).

    Tag Type Description
    SS Disulfide Disulfide bond between cysteines
    HT Head-to-tail N-terminus to C-terminus (amide bond)
    SCSC Sidechain–Sidechain Sidechain linkage (depsipeptide/ester bond)
    SCNT Sidechain–N-Terminus Sidechain to N-terminus cyclization
    SCCT Sidechain–C-Terminus Sidechain to C-terminus cyclization

    Output

    Generated peptide sequences in FASTA format.
    Default: peptides.fasta.


    Format Conversion

    Convert FASTA-format peptide sequences to SMILES strings.
    Cyclic peptides require cyclization type specification.

    Peptides

    Input peptide sequence file in FASTA format.

    Notes:

    • Uppercase letters (ACDEFGHIKLMNPQRSTVWY): standard amino acids
    • Lowercase letters: D-amino acids
    • NCAAs: represented as {Code} (e.g., hydroxyproline = {Hyp})
    • Cyclic peptides: cyclization type appended to sequence header using |

    Examples

    seq_1
    AVRENmV
    seq_2|SCCT
    PyWT{DMEG}{0BN}IKaYI{TFG3}RWtNQ{I2M}
    seq_3|SCNT
    KI{D6MW}E{AHP}iiARCKE{MEN}
    
    
    • seq_1: linear peptide with standard and D-amino acids
    • seq_2: cyclic peptide (SCCT) with standard, D-, and non-canonical amino acids
    • seq_3: cyclic peptide (SCNT) with mixed amino acid types

    Output SMILES

    Converted SMILES strings, one per line.
    Default: peptides.smi.

    Output CSV

    Mapping file with sequence and SMILES information.
    Default: peptides.csv.


    Property

    Compute peptide molecular properties, including MW, TPSA, logP, H-bond donors/acceptors, rotatable bonds, ring count, fraction Csp³, heavy atom count, formal charge, formula, and Lipinski evaluation.

    Peptides

    Input peptides in FASTA format or SMILES text format (one SMILES per line).

    Output

    Results in CSV format.
    Default: peptide_props.csv.


    Feasibility of Synthesis

    Evaluate peptide synthetic feasibility based on:
    N/Q at N-terminus, Gly/Pro motifs, cysteine count, hydrophobicity, net charge, etc.
    (Currently supports standard amino acids only.)

    Fasta

    Peptide sequence file in FASTA format.

    Output

    Synthetic feasibility report in CSV format.
    Default: synthesis_report.csv.


    Example Output

    Generation Mode

    Example FASTA output:

    seq_1|HT
    {FLA}dAVREN{6CL}mV
    seq_2|SCCT
    PyWT{DMEG}{0BN}IKaYI{TFG3}RWtNQ{I2M}
    seq_3|SCNT
    KI{D6MW}E{AHP}iiARCKE{MEN}
    seq_4|HT
    YlCP{YCM}yR{ESC}EiD{DDAB}HYSY{LMQ}GT
    seq_5|HT
    {ORN}{AA4}TQAqP{CSA}YKI{DTTQ}aVvH
    
    

    Legend:

    • Uppercase letters: standard amino acids
    • Lowercase letters: D-amino acids
    • NCAAs: {Code}
    • Cyclization types: annotated using |

    Format Mode

    CSV contains:

    Field Example Description
    Name seq_1 Peptide name
    Type HT Cyclization type; empty for linear peptides
    Sequence FALPciA{DQ36}S{ONL}MV{TTQ}RS Peptide sequence
    SMILES N3{C@@H}(Cc1ccccc1)C(=O) Converted SMILES

    Property Mode

    The output CSV includes:

    Field Description
    Name Peptide name
    Sequence / SMILES Input representation
    Molecular Weight (MW) Peptide molecular weight
    logP Partition coefficient
    TPSA Topological polar surface area
    HBD / HBA Hydrogen bond donors / acceptors
    Rotatable Bonds Number of rotatable bonds
    Rings Number of rings
    Fraction Csp³ Percentage of sp³ carbon atoms
    Heavy Atom Count Number of heavy atoms
    Formal Charge Net formal charge
    Formula Molecular formula
    Lipinski Lipinski rule-of-five evaluation

    Feasibility of Synthesis Mode

    The output CSV file containing the following information:

    Field Name Example Description
    Name seq_1 Peptide sequence name
    Result FAIL Feasibility assessment of synthesis: PASS indicates good feasibility; FAIL indicates poor feasibility
    Description Failed charge: need 1 charged residue every 5 residues Explanation of the reason for poor synthesis feasibility
    Sequence FALPciA{DQ36}S{ONL}MV{TTQ}RS Peptide sequence

    References

    • p2smi: A Python Toolkit for Peptide FASTA-to-SMILES Conversion and Molecular Property Analysis.
      Feller, A. L. and Wilke, C. O. (2025).
      DOI: 10.48550/arXiv.2505.00719

    Appendix

    Table of Non-Standard Amino Acids

    Name Code Formula MolWeight SMILES
    Phenylglycine PG C8H9NO2 151.063328528 N[C@@H](c1ccccc1)C(=O)O
    4-methoxy-Phenylalanine 0A1 C10H13NO3 195.089543276 N[C@@H](Cc1ccc(OC)cc1)C(=O)O
    7-hydroxy-l-tryptophan 0AF C11H12N2O3 220.084792244 N[C@@H](Cc1c[nH]c2c1cccc2O)C(=O)O
    4-carbamimidoyl-l-phenylalanine 0BN C10H13N3O2 207.100776656 N[C@@H](Cc1ccc(cc1)C(=N)N)C(=O)O
    4-chloro-Phenylalanine 4CP C9H10ClNO2 199.04000624 N[C@@H](Cc1ccc(cc1)Cl)C(=O)O
    2-Allyl-glycine 2AG C7H11NO5 189.063722452 N[C@@H](CCCC(C(=O)O)=O)C(=O)O
    3-methyl-aspartic-acid 2AS C5H9NO4 147.053157768 N[C@H]([C@H](C)(C(=O)O))C(=O)O
    s-(difluoromethyl)-homocysteine 2FM C5H9F2NO2S 185.032205968 N[C@@H](CCSC(F)F)C(=O)O
    2-fluoro-l-histidine 2HF C6H12FN3O2 177.091354844 N[C@@H](C[C@@H]1CN[C@@H](N1)F)C(=O)O
    2-fluoro-l-histidine(1) 2HF1 C6H8FN3O2 173.060054716 N[C@@H](Cc1cnc(F)N1)C(=O)O
    2-fluoro-l-histidine(2) 2HF2 C6H8FN3O2 173.060054716 N[C@@H](Cc1c[nH]c(n1)F)C(=O)O
    l-2-amino-6-methylene-pimelic-acid 2NP C8H13NO4 187.084457896 N[C@@H](CCCC(=C)C(=O)O)C(=O)O
    3-(4H-thieno[3,2-b]pyrrol-6-yl)-L-alanine 32T C9H10N2O2S 210.04629856 N[C@H](Cc1c[nH]c2c1scc2)C(=O)O
    3-cyano-phenylalanine 3CF C10H10N2O2 190.07422756 N[C@@H](Cc1cccc(C#N)c1)C(=O)O
    (2s)-amino(3,5-dihydroxyphenyl)-ethanoic-acid 3FG C8H9NO4 183.053157768 N[C@@H](c1cc(O)cc(c1)O)C(=O)O
    4-hydroxy-glutamic-acid 3GL C5H9NO5 163.048072388 N[C@@H](C[C@@H](C(=O)O)O)C(=O)O
    3-Chloro-tyrosine 3MY C9H10ClNO3 215.03492086 N[C@H](Cc1ccc(c(c1)Cl)O)C(=O)O
    4-Bromo-phenylalanine 4BF C9H10BrNO2 242.98949066 N[C@@H](Cc1ccc(cc1)Br)C(=O)O
    4-cyano-phenylalanine 4CF C10H10N2O2 190.07422756 N[C@@H](Cc1ccc(cc1)C#N)C(=O)O
    nitrilo-l-methionine 4CY C5H8N2O2S 160.030648496 N[C@@H](CCSC#N)C(=O)O
    4-fluoro-tryptophan 4FW C11H11FN2O2 222.080455812 N[C@@H](Cc1c[nH]c2c1c(F)ccc2)C(=O)O
    4-hydroxymethyl-phenylalanine 4HMP C10H13NO3 195.089543276 N[C@@H](Cc1ccc(CO)cc1)C(=O)O
    4-hydroxy-tryptophan 4HT C11H12N2O3 220.084792244 N[C@@H](Cc1c[nH]c2c1c(O)ccc2)C(=O)O
    4-amino-l-tryptophan 4IN C11H13N3O2 219.100776656 N[C@@H](Cc1c[nH]c2c1c(N)ccc2)C(=O)O
    4-methyl-phenylalanine 4PH C10H13NO2 179.094628656 N[C@@H](Cc1ccc(cc1)C)C(=O)O
    6-carboxylysine 6CL C7H14N2O4 190.095356928 N[C@@H](CCC[C@H](C(=O)O)N)C(=O)O
    6-chloro-l-tryptophan 6CW C11H11ClN2O2 238.050905272 N[C@@H](Cc1c[nH]c2c1ccc(c2)Cl)C(=O)O
    2-amino-5-hydroxypentanoic-acid AA4 C5H11NO3 133.073893212 N[C@@H](CCCO)C(=O)O
    2-Aminobutyric-acid ABA C4H9NO2 103.063328528 N[C@@H](CC)C(=O)O
    cis-amiclenomycin ACZ C10H16N2O2 196.121177752 N[C@@H](CC[C@@H]1C=C[C@@H](C=C1)N)C(=O)O
    Adamanthane ADAM C13H21NO2 223.157228912 N[C@@H](C[C@]12C[C@H]3C[C@@H](C2)C[C@@H](C1)C3)C(=O)O
    5-methyl-arginine AGM C7H16N4O2 188.127325752 N[C@@H](CC[C@H](C)NC(=N)N)C(=O)O
    beta-hydroxyasparagine AHB C4H8N2O4 148.048406736 N[C@@H]([C@@H](C(=O)N)O)C(=O)O
    2-Aminoheptanoic-acid AHP C7H15NO2 145.11027872 N[C@@H](CCCCC)C(=O)O
    3-cyclohexyl-alanine ALC C9H17NO2 171.125928784 N[C@@H](CC1CCCCC1)C(=O)O
    1-Naphthyl-alanine ALN C13H13NO2 215.094628656 N[C@@H](Cc1cccc2c1cccc2)C(=O)O
    Allo-threonine ALO C4H9NO3 119.058243148 N[C@@H]([C@H](C)O)C(=O)O
    3-(9-anthryl)-alanine ANTH C17H15NO2 265.11027872 N[C@@H](Cc1c2ccccc2cc2c1cccc2)C(=O)O
    3-Methyl-phenylalanine APD C10H13NO2 179.094628656 N[C@@H](Cc1cccc(c1)C)C(=O)O
    m-amidinophenyl-3-alanine APM C10H13N3O2 207.100776656 N[C@@H](Cc1cccc(c1)C(=N)N)C(=O)O
    c-gamma-hydroxy-arginine ARO C6H14N4O3 190.106590308 N[C@@H](C[C@@H](O)CN=C(N)N)C(=O)O
    (2r)-2-amino-4-oxobutanoic-acid AS2 C4H7NO3 117.042593084 N[C@@H](CC=O)C(=O)O
    azido-alanine AZDA C3H7N4O2+ 131.05635188409 N[C@@H](CN=[N+]=N)C(=O)O
    Phenylserine BB8 C9H11NO3 181.073893212 N[C@@H]([C@@H](O)c1ccccc1)C(=O)O
    benzylcysteine BCS C10H13NO2S 211.066699656 N[C@@H](CSCc1ccccc1)C(=O)O
    beta-hydroxyaspartic-acid BHD C4H7NO5 149.032422324 N[C@@H]([C@H](O)C(=O)O)C(=O)O
    4,4-biphenylalanine BIF C15H15NO2 241.11027872 N[C@@H](Cc1ccc(cc1)c1ccccc1)C(=O)O
    5-bromo-l-isoleucine BIU C6H12BrNO2 209.005140724 N[C@@H]([C@@H](C)CCBr)C(=O)O
    3-(3-benzothienyl)-alanine BTH3 C11H11NO2S 221.051049592 N[C@@H](Cc1csc2c1cccc2)C(=O)O
    6-bromo-tryptophan BTR C11H11BrN2O2 282.000389692 N[C@@H](Cc1c[nH]c2c1ccc(c2)Br)C(=O)O
    Tertleucine BUG C6H13NO2 131.094628656 N[C@@H](C(C)(C)C)C(=O)O
    3-chloro-l-alanine C2N C3H6ClNO2 123.008706112 N[C@@H](CCl)C(=O)O
    canaline CAN C4H10N2O3 134.06914218 N[C@@H](CCON)C(=O)O
    carboxymethylated-cysteine CCS C5H9NO4S 179.025228768 N[C@@H](CSCC(=O)O)C(=O)O
    Cyclohexylglycine CHG C8H15NO2 157.11027872 N[C@@H](C1CCCCC1)C(=O)O
    3-chloro-4-hydroxy-phenylglycine CHP C8H8ClNO3 201.019270796 N[C@@H](c1ccc(c(c1)Cl)O)C(=O)O
    Citrulline CIR C6H13N3O3 175.095691276 N[C@@H](CCC[NH]C(=O)N)C(=O)O
    2-cyano-phenylalanine CNP2 C10H10N2O2 190.07422756 N[C@@H](Cc1ccccc1C#N)C(=O)O
    2,4-dichloro-phenylalanine CP24 C9H9Cl2NO2 233.001033888 N[C@@H](Cc1ccc(cc1Cl)Cl)C(=O)O
    3,4-dichloro-phenylalanine CP34 C9H9Cl2NO2 233.001033888 N[C@@H](Cc1ccc(c(c1)Cl)Cl)C(=O)O
    3-Cyclopentyl-alanine CPA3 C8H15NO2 157.11027872 N[C@@H](CC1CCCC1)C(=O)O
    2-Chloro-phenylglycine CPG2 C8H8ClNO2 185.024356176 N[C@@H](c1ccccc1Cl)C(=O)O
    3-Chloro-phenylglycine CPG3 C8H8ClNO2 185.024356176 N[C@@H](c1cccc(c1)Cl)C(=O)O
    4-Chloro-phenylglycine CPG4 C8H8ClNO2 185.024356176 N[C@@H](c1ccc(cc1)Cl)C(=O)O
    2-chloro-Phenylalanine CPH2 C9H10ClNO2 199.04000624 N[C@@H](Cc1ccccc1Cl)C(=O)O
    s-acetonylcysteine CSA C6H11NO3S 177.045964212 N[C@@H](CSCC(=O)C)C(=O)O
    Selenocysteine CSE C3H7NO2Se 168.964199764 N[C@@H](C[SeH])C(=O)O
    7-chloro-tryptophan CTE C11H11ClN2O2 238.050905272 N[C@@H](Cc1cNc2c1cccc2Cl)C(=O)O
    4-chloro-threonine CTH C4H8ClNO3 153.019270796 N[C@@H]([C@H](O)CCl)C(=O)O
    4-Hydroxy-phenylglycine D4P C8H9NO3 167.058243148 N[C@@H](c1ccc(cc1)O)C(=O)O
    Diaminobutyric-acid DAB C4H10N2O2 118.07422756 N[C@@H](CCN)C(=O)O
    3,4-Dihydroxy-phenylalanine DAH C9H11NO4 197.068807832 N[C@@H](Cc1ccc(c(c1)O)O)C(=O)O
    3,5-dibromotyrosine DBY C9H9Br2NO3 336.894917348 N[C@@H](Cc1cc(Br)c(c(c1)Br)O)C(=O)O
    3,3-dihydroxy-alanine DDZ C3H7NO4 121.037507704 N[C@@H](C(O)O)C(=O)O
    Diethylalanine DILE C7H15NO2 145.11027872 N[C@@H](C(CC)CC)C(=O)O
    3,3-diphenylalanine DIPH C15H15NO2 241.11027872 N[C@@H]([C@H](c1ccccc1)c1ccccc1)C(=O)O
    3,3-dimethyl-aspartic-acid DMK C6H11NO4 161.068807832 N[C@@H](C(C(=O)O)(C)C)C(=O)O
    3-ethyl-phenylalanine DMP3 C11H15NO2 193.11027872 N[C@@H](Cc1cc(CC)ccc1)C(=O)O
    2,3-Diaminopropanoic-acid DPP C3H8N2O2 104.058577496 N[C@@H](CN)C(=O)O
    Ethionine ESC C6H13NO2S 163.066699656 N[C@@H](CCSCC)C(=O)O
    3,4-Difluoro-phenylalanine F2F C9H9F2NO2 201.060134968 N[C@@H](Cc1ccc(c(c1)F)F)C(=O)O
    3-chloro-Phenylalanine FCL C9H10ClNO2 199.04000624 N[C@@H](Cc1cccc(c1)Cl)C(=O)O
    4-Fluoro-glutamic-acid FGA4 C5H8FNO4 165.043735956 N[C@@H](C[C@H](F)C(=O)O)C(=O)O
    2-amino-propanedioic-acid FGL C3H5NO4 119.02185764 NC(C(=O)O)C(=O)O
    Trifluoro-alanine FLA C3H4F3NO2 143.019413028 N[C@@H](C(F)(F)F)C(=O)O
    2-Fluoro-phenylglycine FPG2 C8H8FNO2 169.053906716 N[C@@H](c1ccccc1F)C(=O)O
    3-Fluoro-phenylglycine FPG3 C8H8FNO2 169.053906716 N[C@@H](c1cccc(c1)F)C(=O)O
    4-Fluoro-phenylglycine FPG4 C8H8FNO2 169.053906716 N[C@@H](c1ccc(cc1)F)C(=O)O
    2-Fluoro-Phenylalanine FPH2 C9H10FNO2 183.06955678 N[C@@H](Cc1ccccc1F)C(=O)O
    3-Fluoro-Phenylalanine FPH3 C9H10FNO2 183.06955678 N[C@@H](Cc1cccc(c1)F)C(=O)O
    6-fluoro-l-tryptophan FT6 C11H11FN2O2 222.080455812 N[C@@H](Cc1cNc2c1ccc(c2)F)C(=O)O
    5-Fluoro-tryptophan FTR C11H11FN2O2 222.080455812 N[C@@H](Cc1c[nH]c2c1cc(F)cc2)C(=O)O
    (2-furyl)-alanine FUA2 C7H9NO3 155.058243148 N[C@@H](Cc1ccco1)C(=O)O
    3-Fluoro-valine FVAL C5H10FNO2 135.06955678 N[C@@H](C(F)(C)C)C(=O)O
    2-Amino-4-guanidinobutryric-acid GBUT C5H14N4O2 162.111675688 N[C@@H](CCNC(N)N)C(=O)O
    2-Amino-3-guanidinopropionic-acid GDPR C4H12N4O2 148.096025624 N[C@@H](CNC(N)N)C(=O)O
    Canavanine GGB C5H12N4O3 176.090940244 N[C@@H](CCON=C(N)N)C(=O)O
    (2s,4s)-2,5-diamino-4-hydroxy-5-oxopentanoic-acid GHG C5H10N2O4 162.0640568 N[C@@H](C[C@H](O)C(=O)N)C(=O)O
    5-o-methyl-glutamic-acid GME C6H11NO4 161.068807832 N[C@@H](CCC(=O)OC)C(=O)O
    homocysteine HCS C4H9NO2S 135.035399528 N[C@@H](CCS)C(=O)O
    glutamine-hydroxamate HGA C5H10N2O4 162.0640568 N[C@@H](CCC(=O)NO)C(=O)O
    (2s)-2,8-diaminooctanoic-acid HHK C8H18N2O2 174.136827816 N[C@@H](CCCCCCN)C(=O)O
    4-Hydroxy-L-isoleucine HIL4 C6H13NO3 147.089543276 N[C@@H]([C@H]([C@@H](C)O)C)C(=O)O
    (2s,3r)-2-amino-3-hydroxy-4-methylpentanoic-acid HL2 C6H13NO3 147.089543276 N[C@@H]([C@H](O)C(C)C)C(=O)O
    Homoleucine HLEU C7H15NO2 145.11027872 N[C@@H](CCC(C)C)C(=O)O
    beta-hydroxyleucine HLU C6H13NO3 147.089543276 N[C@@H]([C@@H](O)C(C)C)C(=O)O
    4-amino-L-phenylalanine HOX C9H12N2O2 180.089877624 N[C@@H](Cc1ccc(cc1)N)C(=O)O
    Homophenylalanine HPE C10H13NO2 179.094628656 N[C@@H](CCc1ccccc1)C(=O)O
    3-(8-hydroxyquinolin-3-yl)-l-alanine HQA C12H12N2O3 232.084792244 N[C@@H](Cc1cnc2c(c1)cccc2O)C(=O)O
    homoarginine HRG C7H18N4O2 190.142975816 N[C@@H](CCCCNC(N)N)C(=O)O
    5-Hydroxy-tryptophan HRP C11H12N2O3 220.084792244 N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O
    homoserine HSER C4H9NO3 119.058243148 N[C@@H](CCO)C(=O)O
    beta-hydroxy-tryptophane HTR C11H12N2O3 220.084792244 N[C@@H]([C@H](c1c[nH]c2c1cccc2)O)C(=O)O
    3-hydroxy-l-valine HVA C5H11NO3 133.073893212 N[C@@H](C(O)(C)C)C(=O)O
    3-methyl-l-alloisoleucine I2M C7H15NO2 145.11027872 N[C@@H](C(CC)(C)C)C(=O)O
    alpha-amino-2-indanacetic-acid IGL C11H13NO2 191.094628656 N[C@@H](C1Cc2c(C1)cccc2)C(=O)O
    Allo-Isoleucine IIL C6H13NO2 131.094628656 N[C@@H]([C@@H](CC)C)C(=O)O
    4,5-dihydroxy-isoleucine ILX C6H13NO4 163.084457896 N[C@@H]([C@H]([C@H](CO)O)C)C(=O)O
    3-iodo-tyrosine IYR C9H10INO3 306.97054117999994 N[C@@H](Cc1ccc(c(c1)I)O)C(=O)O
    kynurenine KYN C10H12N2O3 208.084792244 N[C@@H](CC(=O)c1ccccc1N)C(=O)O
    6-hydroxy-l-norleucine LDO C6H13NO3 147.089543276 N[C@@H](CCCCO)C(=O)O
    Penicillamine LE1 C5H11NO2S 149.051049592 N[C@@H](C(S)(C)C)C(=O)O
    (4r)-5-oxo-l-leucine LED C6H11NO3 145.073893212 N[C@@H](C[C@@H](C)C=O)C(=O)O
    (4s)-5-fluoro-l-leucine LEF C6H12FNO2 149.085206844 N[C@@H](C[C@H](C)CF)C(=O)O
    (3r)-3-methyl-l-glutamic-acid LME C6H11NO4 161.068807832 N[C@@H]([C@H](C)CC(=O)O)C(=O)O
    3-methyl-l-glutamine LMQ C6H12N2O3 160.084792244 N[C@@H]([C@@H](C)CC(N)=O)C(=O)O
    vinylglycine LVG C4H7NO2 101.047678464 N[C@@H](C=C)C(=O)O
    4-oxo-l-valine LVN C5H9NO3 131.058243148 N[C@@H]([C@H](C)C=O)C(=O)O
    3,3-dimethyl-methionine-sulfoxide M2S C7H15NO3S 193.07726434 N[C@@H](C(C)(C)C[S@@](C)=O)C(=O)O
    hydroxy-l-methionine ME0 C5H11NO3S 165.045964212 N[C@@H](CCSCO)C(=O)O
    (3s)-3-methyl-l-glutamic-acid MEG C6H11NO4 161.068807832 N[C@@H]([C@@H](C)CC(=O)O)C(=O)O
    n-methyl-asparagine MEN C5H10N2O3 146.06914218 N[C@@H](CC(=O)NC)C(=O)O
    n5-methyl-glutamine MEQ C6H12N2O3 160.084792244 N[C@@H](CCC(=O)NC)C(=O)O
    s-oxymethionine MHO C5H11NO3S 165.045964212 N[C@@H](CC[S@](=O)C)C(=O)O
    5-Methoxy-tryptophan MOT5 C12H14N2O3 234.100442308 N[C@@H](Cc1cNc2ccc(OC)cc12)C(=O)O
    3,4-Dimethyl-phenylalanine MP34 C11H15NO2 193.11027872 N[C@@H](Cc1ccc(c(c1)C)C)C(=O)O
    2-Methyl-phenylalanine MPH2 C10H13NO2 179.094628656 N[C@@H](Cc1ccccc1C)C(=O)O
    5-Methyl-tryptophan MTR5 C12H14N2O2 218.105527688 N[C@@H](Cc1cNc2ccc(C)cc12)C(=O)O
    6-Methyl-tryptophan MTR6 C12H14N2O2 218.105527688 N[C@@H](Cc1cNc2c1ccc(c2)C)C(=O)O
    m-Tyrosine MTY C9H11NO3 181.073893212 N[C@@H](Cc1cccc(c1)O)C(=O)O
    2-Naphthyl-alanine NAL C13H13NO2 215.094628656 N[C@@H](Cc1ccc2c(c1)cccc2)C(=O)O
    5-hydroxy-1-naphthalene NAO1 C13H13NO3 231.089543276 N[C@@H](Cc1cccc2c1cc(O)cc2)C(=O)O
    6-hydroxy-2-naphthalene NAO2 C13H13NO3 231.089543276 N[C@@H](Cc1ccc2c(c1)cc(cc2)O)C(=O)O
    meta-nitro-tyrosine NIY C9H10N2O5 226.05897142 N[C@@H](Cc1ccc(c(c1)N(=O)=O)O)C(=O)O
    Norleucine NLE C6H13NO2 131.094628656 N[C@@H](CCCC)C(=O)O
    Norvaline NVA C5H11NO2 117.078978592 N[C@@H](CCC)C(=O)O
    o-acetylserine OAS C5H9NO4 147.053157768 N[C@@H](COC(=O)C)C(=O)O
    (2s)-2-amino-4,4-difluorobutanoic-acid OBF C4H7F2NO2 139.044484904 N[C@@H](CC(F)F)C(=O)O
    s-(2-hydroxyethyl)-l-cysteine OCY C5H11NO3S 165.045964212 N[C@@H](CSCCO)C(=O)O
    o-methyl-l-threonine OLT C5H11NO3 133.073893212 N[C@@H]([C@H](OC)C)C(=O)O
    Methionine-sulfone OMT C5H11NO4S 181.040878832 N[C@@H](CCS(=O)(=O)C)C(=O)O
    (betar)-beta-hydroxy-l-tyrosine OMX C9H11NO4 197.068807832 N[C@@H]([C@@H](c1ccc(cc1)O)O)C(=O)O
    (betar)-3-chloro-beta-hydroxy-l-tyrosine OMY C9H10ClNO4 231.02983548 N[C@@H]([C@@H](c1ccc(c(c1)Cl)O)O)C(=O)O
    5-oxo-l-norleucine ONL C6H11NO3 145.073893212 N[C@@H](CCC(=O)C)C(=O)O
    Ornithine ORN C5H12N2O2 132.089877624 N[C@@H](CCCN)C(=O)O
    o-Tyrosine OTYR C9H11NO3 181.073893212 N[C@@H](Cc1ccccc1O)C(=O)O
    4-benzoyl-phenylalanine PBF C16H15NO3 269.10519334 N[C@@H](Cc1ccc(cc1)C(=O)c1ccccc1)C(=O)O
    pentafluoro-phenylalanine PF5 C9H6F5NO2 255.031869532 N[C@@H](Cc1c(F)c(F)c(c(c1F)F)F)C(=O)O
    4-Fluoro-Phenylalanine PFF C9H10FNO2 183.06955678 N[C@@H](Cc1ccc(cc1)F)C(=O)O
    4-Iodo-Phenylalanine PHI C9H10INO2 290.97562656 N[C@@H](Cc1ccc(cc1)I)C(=O)O
    4-Nitro-phenylalanine PPN C9H10N2O4 210.0640568 N[C@@H](Cc1ccc(cc1)N(=O)=O)C(=O)O
    phosphotyrosine PTR C9H12NO6P 261.04022373400005 N[C@@H](Cc1ccc(cc1)OP(=O)(O)O)C(=O)O
    3-(2-Pyridyl)-alanine PYR2 C8H10N2O2 166.07422756 N[C@@H](Cc1ccccn1)C(=O)O
    3-(3-Pyridyl)-alanine PYR3 C8H10N2O2 166.07422756 N[C@@H](Cc1cccnc1)C(=O)O
    3-(4-Pyridyl)-alanine PYR4 C8H10N2O2 166.07422756 N[C@@H](Cc1ccncc1)C(=O)O
    3-(1-Pyrazolyl)-alanine PYZ1 C6H9N3O2 155.069476528 N[C@@H](Cn1cccn1)C(=O)O
    3-(2-Quinolyl)-alanine QU32 C12H12N2O2 216.089877624 N[C@@H](Cc1ccc2c(n1)cccc2)C(=O)O
    3-(3-quinolyl)-alanine QU33 C12H12N2O2 216.089877624 N[C@@H](Cc1cnc2c(c1)cccc2)C(=O)O
    3-(4-quinolyl)-alanine QU34 C12H12N2O2 216.089877624 N[C@@H](Cc1ccnc2c1cccc2)C(=O)O
    3-(5-Quinolyl)-alanine QU35 C12H12N2O2 216.089877624 N[C@@H](Cc1ccc2c(c1)nccc2)C(=O)O
    3-(6-Quinolyl)-alanine QU36 C12H12N2O2 216.089877624 N[C@@H](Cc1ccc2c(c1)cncc2)C(=O)O
    3-(2-quinoxalyl)-alanine QX32 C11H11N3O2 217.085126592 N[C@@H](Cc1cnc2c(n1)cccc2)C(=O)O
    phosphoserine SEP C3H8NO6P 185.008923606 N[C@@H](COP(=O)(O)O)C(=O)O
    thialysine SLZ C5H12N2O2S 164.061948624 N[C@@H](CSCCN)C(=O)O
    Methionine-sulfoxide SME C5H11NO3S 165.045964212 N[C@@H](CC[S@](=O)C)C(=O)O
    Styrylalanine STYA C11H13NO2 191.094628656 N[C@@H](CC=Cc1ccccc1)C(=O)O
    2s,4r-4-methylglutamate SYM C6H11NO4 161.068807832 N[C@@H](C[C@H](C(=O)O)C)C(=O)O
    4-tert-butyl-phenylalanine TBP4 C13H19NO2 221.141578848 N[C@@H](Cc1ccc(cc1)C(C)(C)C)C(=O)O
    3-(2-Tetrazolyl)-alanine TEZA C4H7N5O2 157.059974464 N[C@@H](Cn1nncn1)C(=O)O
    2-(Trifluoromethyl)-phenylglycine TFG2 C9H8F3NO2 219.050713156 N[C@@H](c1ccccc1C(F)(F)F)C(=O)O
    3-(Trifluoromethyl)-phenylglycine TFG3 C9H8F3NO2 219.050713156 N[C@@H](c1cccc(c1)C(F)(F)F)C(=O)O
    4-(Trifluoromethyl)-phenylglycine TFG4 C9H8F3NO2 219.050713156 N[C@@H](c1ccc(cc1)C(F)(F)F)C(=O)O
    5,5,5-Trifluoro-leucine TFLE C6H10F3NO2 185.06636322 N[C@@H](C[C@@H](C(F)(F)F)C)C(=O)O
    2-(Trifluoromethyl)-phenylalanine TFP2 C10H10F3NO2 233.06636322 N[C@@H](Cc1ccccc1C(F)(F)F)C(=O)O
    3-(Trifluoromethyl)-phenylalanine TFP3 C10H10F3NO2 233.06636322 N[C@@H](Cc1cccc(c1)C(F)(F)F)C(=O)O
    4-(Trifluoromethyl)-phenylalanine TFP4 C10H10F3NO2 233.06636322 N[C@@H](Cc1ccc(cc1)C(F)(F)F)C(=O)O
    4-hydroxy-l-threonine TH6 C4H9NO4 135.053157768 N[C@@H]([C@H](O)CO)C(=O)O
    3-(3-thienyl)-alanine THA3 C7H9NO2S 171.035399528 N[C@@H](Cc1cscc1)C(=O)O
    2-thienylglycine THG2 C6H7NO2S 157.019749464 N[C@@H](c1cccs1)C(=O)O
    3-thienylglycine THG3 C6H7NO2S 157.019749464 N[C@@H](c1cscc1)C(=O)O
    Thio-citrulline THIC C6H13N3O2S 191.072847656 N[C@@H](CCCNC(=S)N)C(=O)O
    3-(2-thienyl)-alanine TIH C7H9NO2S 171.035399528 N[C@@H](Cc1cccs1)C(=O)O
    phosphothreonine TPO C4H10NO6P 199.02457367 N[C@@H]([C@H](OP(=O)(O)O)C)C(=O)O
    2-hydroxy-tryptophan TRO C11H12N2O3 220.084792244 N[C@@H](Cc1c(O)[nH]c2c1cccc2)C(=O)O
    6-hydroxy-tryptophan TRX C11H12N2O3 220.084792244 N[C@@H](Cc1c[nH]c2c1ccc(c2)O)C(=O)O
    3-(1,2,4-Triazol-1-yl)-alanine TRZ4 C5H8N4O2 156.064725496 N[C@@H](Cn1cncn1)C(=O)O
    6-amino-7-hydroxy-l-tryptophan TTQ C11H13N3O3 235.095691276 N[C@@H](Cc1c[nH]c2c1ccc(c2O)N)C(=O)O
    3-Amino-L-tyrosine TY2 C9H12N2O3 196.084792244 N[C@@H](Cc1ccc(c(c1)N)O)C(=O)O
    3,5-diiodotyrosine TYI C9H9I2NO3 432.8671891479999 N[C@@H](Cc1cc(I)c(c(c1)I)O)C(=O)O
    3-amino-6-hydroxy-tyrosine TYQ C9H12N2O4 212.079706864 N[C@@H](Cc1cc(N)c(cc1O)O)C(=O)O
    (4-thiazolyl)-alanine TZA4 C6H8N2O2S 172.030648496 N[C@@H](Cc1cscn1)C(=O)O
    2-Aminoadipic-acid UN1 C6H11NO4 161.068807832 N[C@@H](CCCC(=O)O)C(=O)O
    Hydroxynorvaline VAH C5H11NO3 133.073893212 N[C@@H]([C@H](O)CC)C(=O)O
    3,5-Difluoro-phenylalanine WFP C9H9F2NO2 201.060134968 N[C@@H](Cc1cc(F)cc(c1)F)C(=O)O
    cysteine-s-acetamide YCM C5H10N2O3S 178.04121318 N[C@@H](CSCC(=O)N)C(=O)O
    3-fluorotyrosine YOF C9H10FNO3 199.0644714 N[C@@H](Cc1ccc(c(c1)F)O)C(=O)O
    d-Phenylglycine DPG C8H9NO2 151.063328528 N[C@H](c1ccccc1)C(=O)O
    d-4-methoxy-Phenylalanine D0A1 C10H13NO3 195.089543276 N[C@H](Cc1ccc(OC)cc1)C(=O)O
    d-7-hydroxy-l-tryptophan D0AF C11H12N2O3 220.084792244 N[C@H](Cc1c[nH]c2c1cccc2O)C(=O)O
    d-4-carbamimidoyl-l-phenylalanine D0BN C10H13N3O2 207.100776656 N[C@H](Cc1ccc(cc1)C(=N)N)C(=O)O
    d-4-chloro-Phenylalanine D200 C9H10ClNO2 199.04000624 N[C@H](Cc1ccc(cc1)Cl)C(=O)O
    d-2-Allyl-glycine D2AG C7H11NO5 189.063722452 N[C@H](CCCC(C(=O)O)=O)C(=O)O
    d-3-methyl-aspartic-acid D2AS C5H9NO4 147.053157768 N[C@@H]([C@H](C)(C(=O)O))C(=O)O
    d-s-(difluoromethyl)-homocysteine D2FM C5H9F2NO2S 185.032205968 N[C@H](CCSC(F)F)C(=O)O
    d-2-fluoro-l-histidine D2HF C6H12FN3O2 177.091354844 N[C@H](C[C@@H]1CN[C@@H](N1)F)C(=O)O
    d-2-fluoro-l-histidine(1) D2H1 C6H8FN3O2 173.060054716 N[C@H](Cc1cnc(F)N1)C(=O)O
    d-2-fluoro-l-histidine(2) D2H2 C6H8FN3O2 173.060054716 N[C@H](Cc1c[nH]c(n1)F)C(=O)O
    d-l-2-amino-6-methylene-pimelic-acid D2NP C8H13NO4 187.084457896 N[C@H](CCCC(=C)C(=O)O)C(=O)O
    d-3-(4H-thieno[3,2-b]pyrrol-6-yl)-L-alanine D32T C9H10N2O2S 210.04629856 N[C@@H](Cc1c[nH]c2c1scc2)C(=O)O
    d-3-cyano-phenylalanine D3CF C10H10N2O2 190.07422756 N[C@H](Cc1cccc(C#N)c1)C(=O)O
    d-(2s)-amino(3,5-dihydroxyphenyl)-ethanoic-acid D3FG C8H9NO4 183.053157768 N[C@H](c1cc(O)cc(c1)O)C(=O)O
    d-4-hydroxy-glutamic-acid D3GL C5H9NO5 163.048072388 N[C@H](C[C@@H](C(=O)O)O)C(=O)O
    d-3-Chloro-tyrosine D3MY C9H10ClNO3 215.03492086 N[C@@H](Cc1ccc(c(c1)Cl)O)C(=O)O
    d-4-Bromo-phenylalanine D4BF C9H10BrNO2 242.98949066 N[C@H](Cc1ccc(cc1)Br)C(=O)O
    d-4-cyano-phenylalanine D4CF C10H10N2O2 190.07422756 N[C@H](Cc1ccc(cc1)C#N)C(=O)O
    d-nitrilo-l-methionine D4CY C5H8N2O2S 160.030648496 N[C@H](CCSC#N)C(=O)O
    d-4-fluoro-tryptophan D4FW C11H11FN2O2 222.080455812 N[C@H](Cc1c[nH]c2c1c(F)ccc2)C(=O)O
    d-4-hydroxymethyl-phenylalanine D4HZ C10H13NO3 195.089543276 N[C@H](Cc1ccc(CO)cc1)C(=O)O
    d-4-hydroxy-tryptophan D4HT C11H12N2O3 220.084792244 N[C@H](Cc1c[nH]c2c1c(O)ccc2)C(=O)O
    d-4-amino-l-tryptophan D4IN C11H13N3O2 219.100776656 N[C@H](Cc1c[nH]c2c1c(N)ccc2)C(=O)O
    d-4-methyl-phenylalanine D4PH C10H13NO2 179.094628656 N[C@H](Cc1ccc(cc1)C)C(=O)O
    d-6-carboxylysine D6CL C7H14N2O4 190.095356928 N[C@H](CCC[C@H](C(=O)O)N)C(=O)O
    d-6-chloro-l-tryptophan D6CW C11H11ClN2O2 238.050905272 N[C@H](Cc1c[nH]c2c1ccc(c2)Cl)C(=O)O
    d-2-amino-5-hydroxypentanoic-acid DAA4 C5H11NO3 133.073893212 N[C@H](CCCO)C(=O)O
    d-2-Aminobutyric-acid DABA C4H9NO2 103.063328528 N[C@H](CC)C(=O)O
    d-cis-amiclenomycin DACZ C10H16N2O2 196.121177752 N[C@H](CC[C@@H]1C=C[C@@H](C=C1)N)C(=O)O
    d-Adamanthane DADM C13H21NO2 223.157228912 N[C@H](C[C@]12C[C@H]3C[C@@H](C2)C[C@@H](C1)C3)C(=O)O
    d-5-methyl-arginine DAGM C7H16N4O2 188.127325752 N[C@H](CC[C@H](C)NC(=N)N)C(=O)O
    d-beta-hydroxyasparagine DAHB C4H8N2O4 148.048406736 N[C@H]([C@@H](C(=O)N)O)C(=O)O
    d-2-Aminoheptanoic-acid DAHP C7H15NO2 145.11027872 N[C@H](CCCCC)C(=O)O
    d-3-cyclohexyl-alanine DALC C9H17NO2 171.125928784 N[C@H](CC1CCCCC1)C(=O)O
    d-1-Naphthyl-alanine DALN C13H13NO2 215.094628656 N[C@H](Cc1cccc2c1cccc2)C(=O)O
    d-Allo-threonine DALO C4H9NO3 119.058243148 N[C@H]([C@H](C)O)C(=O)O
    d-3-(9-anthryl)-alanine DNTL C17H15NO2 265.11027872 N[C@H](Cc1c2ccccc2cc2c1cccc2)C(=O)O
    d-3-Methyl-phenylalanine DAPD C10H13NO2 179.094628656 N[C@H](Cc1cccc(c1)C)C(=O)O
    d-m-amidinophenyl-3-alanine DAPM C10H13N3O2 207.100776656 N[C@H](Cc1cccc(c1)C(=N)N)C(=O)O
    d-c-gamma-hydroxy-arginine DARO C6H14N4O3 190.106590308 N[C@H](C[C@@H](O)CN=C(N)N)C(=O)O
    d-(2r)-2-amino-4-oxobutanoic-acid DAS2 C4H7NO3 117.042593084 N[C@H](CC=O)C(=O)O
    d-azido-alanine DZDA C3H7N4O2+ 131.05635188409 N[C@H](CN=[N+]=N)C(=O)O
    d-Phenylserine DBB8 C9H11NO3 181.073893212 N[C@H]([C@@H](O)c1ccccc1)C(=O)O
    d-benzylcysteine DBCS C10H13NO2S 211.066699656 N[C@H](CSCc1ccccc1)C(=O)O
    d-beta-hydroxyaspartic-acid DBHD C4H7NO5 149.032422324 N[C@H]([C@H](O)C(=O)O)C(=O)O
    d-4,4-biphenylalanine DBIF C15H15NO2 241.11027872 N[C@H](Cc1ccc(cc1)c1ccccc1)C(=O)O
    d-5-bromo-l-isoleucine DBIU C6H12BrNO2 209.005140724 N[C@H]([C@@H](C)CCBr)C(=O)O
    d-3-(3-benzothienyl)-alanine DTH9 C11H11NO2S 221.051049592 N[C@H](Cc1csc2c1cccc2)C(=O)O
    d-6-bromo-tryptophan DBTR C11H11BrN2O2 282.000389692 N[C@H](Cc1c[nH]c2c1ccc(c2)Br)C(=O)O
    d-Tertleucine DBUG C6H13NO2 131.094628656 N[C@H](C(C)(C)C)C(=O)O
    d-3-chloro-l-alanine DC2N C3H6ClNO2 123.008706112 N[C@H](CCl)C(=O)O
    d-canaline DCAN C4H10N2O3 134.06914218 N[C@H](CCON)C(=O)O
    d-carboxymethylated-cysteine DCCS C5H9NO4S 179.025228768 N[C@H](CSCC(=O)O)C(=O)O
    d-Cyclohexylglycine DCHG C8H15NO2 157.11027872 N[C@H](C1CCCCC1)C(=O)O
    d-3-chloro-4-hydroxy-phenylglycine DCHP C8H8ClNO3 201.019270796 N[C@H](c1ccc(c(c1)Cl)O)C(=O)O
    d-Citrulline DCIR C6H13N3O3 175.095691276 N[C@H](CCC[NH]C(=O)N)C(=O)O
    d-2-cyano-phenylalanine D2CF C10H10N2O2 190.07422756 N[C@H](Cc1ccccc1C#N)C(=O)O
    d-2,4-dichloro-phenylalanine D24E C9H9Cl2NO2 233.001033888 N[C@H](Cc1ccc(cc1Cl)Cl)C(=O)O
    d-3,4-dichloro-phenylalanine D34E C9H9Cl2NO2 233.001033888 N[C@H](Cc1ccc(c(c1)Cl)Cl)C(=O)O
    d-3-Cyclopentyl-alanine DCPE C8H15NO2 157.11027872 N[C@H](CC1CCCC1)C(=O)O
    d-2-Chloro-phenylglycine DCG6 C8H8ClNO2 185.024356176 N[C@H](c1ccccc1Cl)C(=O)O
    d-3-Chloro-phenylglycine DCG5 C8H8ClNO2 185.024356176 N[C@H](c1cccc(c1)Cl)C(=O)O
    d-4-Chloro-phenylglycine DCGD C8H8ClNO2 185.024356176 N[C@H](c1ccc(cc1)Cl)C(=O)O
    d-2-chloro-Phenylalanine DCF6 C9H10ClNO2 199.04000624 N[C@H](Cc1ccccc1Cl)C(=O)O
    d-s-acetonylcysteine DCSA C6H11NO3S 177.045964212 N[C@H](CSCC(=O)C)C(=O)O
    d-Selenocysteine DCSE C3H7NO2Se 168.964199764 N[C@H](C[SeH])C(=O)O
    d-7-chloro-tryptophan DCTE C11H11ClN2O2 238.050905272 N[C@H](Cc1cNc2c1cccc2Cl)C(=O)O
    d-4-chloro-threonine DCTH C4H8ClNO3 153.019270796 N[C@H]([C@H](O)CCl)C(=O)O
    d-4-Hydroxy-phenylglycine DD4P C8H9NO3 167.058243148 N[C@H](c1ccc(cc1)O)C(=O)O
    d-Diaminobutyric-acid DDAB C4H10N2O2 118.07422756 N[C@H](CCN)C(=O)O
    d-3,4-Dihydroxy-phenylalanine DDAH C9H11NO4 197.068807832 N[C@H](Cc1ccc(c(c1)O)O)C(=O)O
    d-3,5-dibromotyrosine DDBY C9H9Br2NO3 336.894917348 N[C@H](Cc1cc(Br)c(c(c1)Br)O)C(=O)O
    d-3,3-dihydroxy-alanine DDDZ C3H7NO4 121.037507704 N[C@H](C(=O)O)C(=O)O
    d-Diethylalanine D2EL C7H15NO2 145.11027872 N[C@H](C(CC)CC)C(=O)O
    d-3,3-diphenylalanine D2F1 C15H15NO2 241.11027872 N[C@H]([C@H](c1ccccc1)c1ccccc1)C(=O)O
    d-3,3-dimethyl-aspartic-acid DDMK C6H11NO4 161.068807832 N[C@H](C(C(=O)O)(C)C)C(=O)O
    d-3-ethyl-phenylalanine DDF4 C11H15NO2 193.11027872 N[C@H](Cc1cc(CC)ccc1)C(=O)O
    d-2,3-Diaminopropanoic-acid DDPP C3H8N2O2 104.058577496 N[C@H](CN)C(=O)O
    d-Ethionine DESC C6H13NO2S 163.066699656 N[C@H](CCSCC)C(=O)O
    d-3,4-Difluoro-phenylalanine DF2F C9H9F2NO2 201.060134968 N[C@H](Cc1ccc(c(c1)F)F)C(=O)O
    d-3-chloro-Phenylalanine DFCL C9H10ClNO2 199.04000624 N[C@H](Cc1cccc(c1)Cl)C(=O)O
    d-4-Fluoro-glutamic-acid D4FG C5H8FNO4 165.043735956 N[C@H](C[C@H](F)C(=O)O)C(=O)O
    d-Trifluoro-alanine DFLA C3H4F3NO2 143.019413028 N[C@H](C(F)(F)F)C(=O)O
    d-2-Fluoro-phenylglycine DFP6 C8H8FNO2 169.053906716 N[C@H](c1ccccc1F)C(=O)O
    d-3-Fluoro-phenylglycine DFP7 C8H8FNO2 169.053906716 N[C@H](c1cccc(c1)F)C(=O)O
    d-4-Fluoro-phenylglycine DFP8 C8H8FNO2 169.053906716 N[C@H](c1ccc(cc1)F)C(=O)O
    d-2-Fluoro-Phenylalanine DFF2 C9H10FNO2 183.06955678 N[C@H](Cc1ccccc1F)C(=O)O
    d-3-Fluoro-Phenylalanine DFF3 C9H10FNO2 183.06955678 N[C@H](Cc1cccc(c1)F)C(=O)O
    d-6-fluoro-l-tryptophan DFT6 C11H11FN2O2 222.080455812 N[C@H](Cc1cNc2c1ccc(c2)F)C(=O)O
    d-5-Fluoro-tryptophan DFTR C11H11FN2O2 222.080455812 N[C@H](Cc1c[nH]c2c1cc(F)cc2)C(=O)O
    d-(2-furyl)-alanine DFUO C7H9NO3 155.058243148 N[C@H](Cc1ccco1)C(=O)O
    d-3-Fluoro-valine DFVL C5H10FNO2 135.06955678 N[C@H](C(F)(C)C)C(=O)O
    d-2-Amino-4-guanidinobutryric-acid DGBT C5H14N4O2 162.111675688 N[C@H](CCNC(N)N)C(=O)O
    d-2-Amino-3-guanidinopropionic-acid DGPA C4H12N4O2 148.096025624 N[C@H](CNC(N)N)C(=O)O
    d-Canavanine DGGB C5H12N4O3 176.090940244 N[C@H](CCON=C(N)N)C(=O)O
    d-(2s,4s)-2,5-diamino-4-hydroxy-5-oxopentanoic-acid DGHG C5H10N2O4 162.0640568 N[C@H](C[C@H](O)C(=O)N)C(=O)O
    d-5-o-methyl-glutamic-acid DGME C6H11NO4 161.068807832 N[C@H](CCC(=O)OC)C(=O)O
    d-homocysteine DHCS C4H9NO2S 135.035399528 N[C@H](CCS)C(=O)O
    d-glutamine-hydroxamate DHGA C5H10N2O4 162.0640568 N[C@H](CCC(=O)NO)C(=O)O
    d-(2s)-2,8-diaminooctanoic-acid DHHK C8H18N2O2 174.136827816 N[C@H](CCCCCCN)C(=O)O
    d-4-Hydroxy-L-isoleucine DHIL C6H13NO3 147.089543276 N[C@H]([C@H]([C@@H](C)O)C)C(=O)O
    d-(2s,3r)-2-amino-3-hydroxy-4-methylpentanoic-acid DHL2 C6H13NO3 147.089543276 N[C@H]([C@H](O)C(C)C)C(=O)O
    d-Homoleucine DHL1 C7H15NO2 145.11027872 N[C@H](CCC(C)C)C(=O)O
    d-beta-hydroxyleucine DHLU C6H13NO3 147.089543276 N[C@H]([C@@H](O)C(C)C)C(=O)O
    d-4-amino-L-phenylalanine DHOX C9H12N2O2 180.089877624 N[C@H](Cc1ccc(cc1)N)C(=O)O
    d-Homophenylalanine DHPE C10H13NO2 179.094628656 N[C@H](CCc1ccccc1)C(=O)O
    d-3-(8-hydroxyquinolin-3-yl)-l-alanine DHQA C12H12N2O3 232.084792244 N[C@H](Cc1cnc2c(c1)cccc2O)C(=O)O
    d-homoarginine DHRG C7H18N4O2 190.142975816 N[C@H](CCCCNC(N)N)C(=O)O
    d-5-Hydroxy-tryptophan DHRP C11H12N2O3 220.084792244 N[C@H](Cc1cNc2c1cc(O)cc2)C(=O)O
    d-homoserine DHSE C4H9NO3 119.058243148 N[C@H](CCO)C(=O)O
    d-beta-hydroxy-tryptophane DHTR C11H12N2O3 220.084792244 N[C@H]([C@H](c1c[nH]c2c1cccc2)O)C(=O)O
    d-3-hydroxy-l-valine DHVA C5H11NO3 133.073893212 N[C@H](C(O)(C)C)C(=O)O
    d-3-methyl-l-alloisoleucine DI2M C7H15NO2 145.11027872 N[C@H](C(CC)(C)C)C(=O)O
    d-alpha-amino-2-indanacetic-acid DIGL C11H13NO2 191.094628656 N[C@H](C1Cc2c(C1)cccc2)C(=O)O
    d-Allo-Isoleucine DIIL C6H13NO2 131.094628656 N[C@H]([C@@H](CC)C)C(=O)O
    d-4,5-dihydroxy-isoleucine DILX C6H13NO4 163.084457896 N[C@H]([C@H]([C@H](CO)O)C)C(=O)O
    d-3-iodo-tyrosine DIYR C9H10INO3 306.97054117999994 N[C@H](Cc1ccc(c(c1)I)O)C(=O)O
    d-kynurenine DKYN C10H12N2O3 208.084792244 N[C@H](CC(=O)c1ccccc1N)C(=O)O
    d-6-hydroxy-l-norleucine DLDO C6H13NO3 147.089543276 N[C@H](CCCCO)C(=O)O
    d-Penicillamine DLE1 C5H11NO2S 149.051049592 N[C@H](C(S)(C)C)C(=O)O
    d-(4r)-5-oxo-l-leucine DLED C6H11NO3 145.073893212 N[C@H](C[C@@H](C)C=O)C(=O)O
    d-(4s)-5-fluoro-l-leucine DLEF C6H12FNO2 149.085206844 N[C@H](C[C@H](C)CF)C(=O)O
    d-(3r)-3-methyl-l-glutamic-acid DLME C6H11NO4 161.068807832 N[C@H]([C@H](C)CC(O)=O)C(=O)O
    d-3-methyl-l-glutamine DLMQ C6H12N2O3 160.084792244 N[C@H]([C@@H](C)CC(N)=O)C(=O)O
    d-vinylglycine DLVG C4H7NO2 101.047678464 N[C@H](C=C)C(=O)O
    d-4-oxo-l-valine DLVN C5H9NO3 131.058243148 N[C@H]([C@H](C)C=O)C(=O)O
    d-3,3-dimethyl-methionine-sulfoxide DM2S C7H15NO3S 193.07726434 N[C@H](C(C)(C)C[S@@](C)=O)C(=O)O
    d-hydroxy-l-methionine DME0 C5H11NO3S 165.045964212 N[C@H](CCSCO)C(=O)O
    d-(3s)-3-methyl-l-glutamic-acid DMEG C6H11NO4 161.068807832 N[C@H]([C@@H](C)CC(=O)O)C(=O)O
    d-n-methyl-asparagine DMEN C5H10N2O3 146.06914218 N[C@H](CC(=O)NC)C(=O)O
    d-n5-methyl-glutamine DMEQ C6H12N2O3 160.084792244 N[C@H](CCC(=O)NC)C(=O)O
    d-s-oxymethionine DMHO C5H11NO3S 165.045964212 N[C@H](CC[S@](=O)C)C(=O)O
    d-5-Methoxy-tryptophan D5XW C12H14N2O3 234.100442308 N[C@H](Cc1cNc2ccc(OC)cc12)C(=O)O
    d-3,4-Dimethyl-phenylalanine DM34 C11H15NO2 193.11027872 N[C@H](Cc1ccc(c(c1)C)C)C(=O)O
    d-2-Methyl-phenylalanine D2MF C10H13NO2 179.094628656 N[C@H](Cc1ccccc1C)C(=O)O
    d-5-Methyl-tryptophan D5MW C12H14N2O2 218.105527688 N[C@H](Cc1cNc2ccc(C)cc12)C(=O)O
    d-6-Methyl-tryptophan D6MW C12H14N2O2 218.105527688 N[C@H](Cc1cNc2c1ccc(c2)C)C(=O)O
    d-m-Tyrosine DMTY C9H11NO3 181.073893212 N[C@H](Cc1cccc(c1)O)C(=O)O
    d-2-Naphthyl-alanine DNAL C13H13NO2 215.094628656 N[C@H](Cc1ccc2c(c1)cccc2)C(=O)O
    d-5-hydroxy-1-naphthalene D51N C13H13NO3 231.089543276 N[C@H](Cc1cccc2c1cc(O)cc2)C(=O)O
    d-6-hydroxy-2-naphthalene D62N C13H13NO3 231.089543276 N[C@H](Cc1ccc2c(c1)cc(cc2)O)C(=O)O
    d-meta-nitro-tyrosine DNIY C9H10N2O5 226.05897142 N[C@H](Cc1ccc(c(c1)N(=O)=O)O)C(=O)O
    d-Norleucine DNLE C6H13NO2 131.094628656 N[C@H](CCCC)C(=O)O
    d-Norvaline DNVA C5H11NO2 117.078978592 N[C@H](CCC)C(=O)O
    d-o-acetylserine DOAS C5H9NO4 147.053157768 N[C@H](COC(=O)C)C(=O)O
    d-(2s)-2-amino-4,4-difluorobutanoic-acid DOBF C4H7F2NO2 139.044484904 N[C@H](CC(F)F)C(=O)O
    d-s-(2-hydroxyethyl)-l-cysteine DOCY C5H11NO3S 165.045964212 N[C@H](CSCCO)C(=O)O
    d-o-methyl-l-threonine DOLT C5H11NO3 133.073893212 N[C@H]([C@H](OC)C)C(=O)O
    d-Methionine-sulfone DOMT C5H11NO4S 181.040878832 N[C@H](CCS(=O)(=O)C)C(=O)O
    d-(betar)-beta-hydroxy-l-tyrosine DOMX C9H11NO4 197.068807832 N[C@H]([C@@H](c1ccc(cc1)O)O)C(=O)O
    d-(betar)-3-chloro-beta-hydroxy-l-tyrosine DOMY C9H10ClNO4 231.02983548 N[C@H]([C@@H](c1ccc(c(c1)Cl)O)O)C(=O)O
    d-5-oxo-l-norleucine DONL C6H11NO3 145.073893212 N[C@H](CCC(=O)C)C(=O)O
    d-Ornithine DORN C5H12N2O2 132.089877624 N[C@H](CCCN)C(=O)O
    d-o-Tyrosine D2TR C9H11NO3 181.073893212 N[C@H](Cc1ccccc1O)C(=O)O
    d-4-benzoyl-phenylalanine DPBF C16H15NO3 269.10519334 N[C@H](Cc1ccc(cc1)C(=O)c1ccccc1)C(=O)O
    d-pentafluoro-phenylalanine DPF5 C9H6F5NO2 255.031869532 N[C@H](Cc1c(F)c(F)c(c(c1F)F)F)C(=O)O
    d-4-Fluoro-Phenylalanine DPFF C9H10FNO2 183.06955678 N[C@H](Cc1ccc(cc1)F)C(=O)O
    d-4-Iodo-Phenylalanine DPHI C9H10INO2 290.97562656 N[C@H](Cc1ccc(cc1)I)C(=O)O
    d-4-Nitro-phenylalanine DPPN C9H10N2O4 210.0640568 N[C@H](Cc1ccc(cc1)N(=O)=O)C(=O)O
    d-phosphotyrosine DPTR C9H12NO6P 261.04022373400005 N[C@H](Cc1ccc(cc1)OP(=O)(O)O)C(=O)O
    d-3-(2-Pyridyl)-alanine DY23 C8H10N2O2 166.07422756 N[C@H](Cc1ccccn1)C(=O)O
    d-3-(3-Pyridyl)-alanine DY33 C8H10N2O2 166.07422756 N[C@H](Cc1cccnc1)C(=O)O
    d-3-(4-Pyridyl)-alanine DY34 C8H10N2O2 166.07422756 N[C@H](Cc1ccncc1)C(=O)O
    d-3-(1-Pyrazolyl)-alanine DPZ4 C6H9N3O2 155.069476528 N[C@H](Cn1cccn1)C(=O)O
    d-3-(2-Quinolyl)-alanine DQ32 C12H12N2O2 216.089877624 N[C@H](Cc1ccc2c(n1)cccc2)C(=O)O
    d-3-(3-quinolyl)-alanine DQ33 C12H12N2O2 216.089877624 N[C@H](Cc1cnc2c(c1)cccc2)C(=O)O
    d-3-(4-quinolyl)-alanine DQ34 C12H12N2O2 216.089877624 N[C@H](Cc1ccnc2c1cccc2)C(=O)O
    d-3-(5-Quinolyl)-alanine DQ35 C12H12N2O2 216.089877624 N[C@H](Cc1ccc2c(c1)nccc2)C(=O)O
    d-3-(6-Quinolyl)-alanine DQ36 C12H12N2O2 216.089877624 N[C@H](Cc1ccc2c(c1)cncc2)C(=O)O
    d-3-(2-quinoxalyl)-alanine DQX3 C11H11N3O2 217.085126592 N[C@H](Cc1cnc2c(n1)cccc2)C(=O)O
    d-phosphoserine DSEP C3H8NO6P 185.008923606 N[C@H](COP(=O)(O)O)C(=O)O
    d-thialysine DSLZ C5H12N2O2S 164.061948624 N[C@H](CSCCN)C(=O)O
    d-Methionine-sulfoxide DSME C5H11NO3S 165.045964212 N[C@H](CC[S@](=O)C)C(=O)O
    d-Styrylalanine DSYA C11H13NO2 191.094628656 N[C@H](CC=Cc1ccccc1)C(=O)O
    d-2s,4r-4-methylglutamate DSYM C6H11NO4 161.068807832 N[C@H](C[C@H](C(=O)O)C)C(=O)O
    d-4-tert-butyl-phenylalanine D4TF C13H19NO2 221.141578848 N[C@H](Cc1ccc(cc1)C(C)(C)C)C(=O)O
    d-3-(2-Tetrazolyl)-alanine DTEZ C4H7N5O2 157.059974464 N[C@H](Cn1nncn1)C(=O)O
    d-2-(Trifluoromethyl)-phenylglycine D2TG C9H8F3NO2 219.050713156 N[C@H](c1ccccc1C(F)(F)F)C(=O)O
    d-3-(Trifluoromethyl)-phenylglycine D3TG C9H8F3NO2 219.050713156 N[C@H](c1cccc(c1)C(F)(F)F)C(=O)O
    d-4-(Trifluoromethyl)-phenylglycine D4TG C9H8F3NO2 219.050713156 N[C@H](c1ccc(cc1)C(F)(F)F)C(=O)O
    d-5,5,5-Trifluoro-leucine DTFL C6H10F3NO2 185.06636322 N[C@H](C[C@@H](C(F)(F)F)C)C(=O)O
    d-2-(Trifluoromethyl)-phenylalanine D2TF C10H10F3NO2 233.06636322 N[C@H](Cc1ccccc1C(F)(F)F)C(=O)O
    d-3-(Trifluoromethyl)-phenylalanine D3TF C10H10F3NO2 233.06636322 N[C@H](Cc1cccc(c1)C(F)(F)F)C(=O)O
    d-4-(Trifluoromethyl)-phenylalanine D4TM C10H10F3NO2 233.06636322 N[C@H](Cc1ccc(cc1)C(F)(F)F)C(=O)O
    d-4-hydroxy-l-threonine DTH6 C4H9NO4 135.053157768 N[C@H]([C@H](O)CO)C(=O)O
    d-3-(3-thienyl)-alanine D3TA C7H9NO2S 171.035399528 N[C@H](Cc1cscc1)C(=O)O
    d-2-thienylglycine D2TH C6H7NO2S 157.019749464 N[C@H](c1cccs1)C(=O)O
    d-3-thienylglycine D3TH C6H7NO2S 157.019749464 N[C@H](c1cscc1)C(=O)O
    d-Thio-citrulline DTVI C6H13N3O2S 191.072847656 N[C@H](CCCNC(=S)N)C(=O)O
    d-3-(2-thienyl)-alanine DTIH C7H9NO2S 171.035399528 N[C@H](Cc1cccs1)C(=O)O
    d-phosphothreonine DTPO C4H10NO6P 199.02457367 N[C@H]([C@H](OP(=O)(O)O)C)C(=O)O
    d-2-hydroxy-tryptophan DTRO C11H12N2O3 220.084792244 N[C@H](Cc1c(O)[nH]c2c1cccc2)C(=O)O
    d-6-hydroxy-tryptophan DTRX C11H12N2O3 220.084792244 N[C@H](Cc1c[nH]c2c1ccc(c2)O)C(=O)O
    d-3-(1,2,4-Triazol-1-yl)-alanine DTZR C5H8N4O2 156.064725496 N[C@H](Cn1cncn1)C(=O)O
    d-6-amino-7-hydroxy-l-tryptophan DTTQ C11H13N3O3 235.095691276 N[C@H](Cc1c[nH]c2c1ccc(c2O)N)C(=O)O
    d-3-Amino-L-tyrosine DTY2 C9H12N2O3 196.084792244 N[C@H](Cc1ccc(c(c1)N)O)C(=O)O
    d-3,5-diiodotyrosine DTYI C9H9I2NO3 432.8671891479999 N[C@H](Cc1cc(I)c(c(c1)I)O)C(=O)O
    d-3-amino-6-hydroxy-tyrosine DTYQ C9H12N2O4 212.079706864 N[C@H](Cc1cc(N)c(cc1O)O)C(=O)O
    d-(4-thiazolyl)-alanine D4TH C6H8N2O2S 172.030648496 N[C@H](Cc1cscn1)C(=O)O
    d-2-Aminoadipic-acid DUN1 C6H11NO4 161.068807832 N[C@H](CCCC(=O)O)C(=O)O
    d-Hydroxynorvaline DVAH C5H11NO3 133.073893212 N[C@H]([C@H](O)CC)C(=O)O
    d-3,5-Difluoro-phenylalanine DWFP C9H9F2NO2 201.060134968 N[C@H](Cc1cc(F)cc(c1)F)C(=O)O
    d-cysteine-s-acetamide DYCM C5H10N2O3S 178.04121318 N[C@H](CSCC(=O)N)C(=O)O
    d-3-fluorotyrosine DYOF C9H10FNO3 199.0644714 N[C@H](Cc1ccc(c(c1)F)O)C(=O)O
  • Name: Protein Design (RFDiffusion3)
    Description: 基于扩散的蛋白质结构生成模型,设计与蛋白、核酸或小分子结合的Binder蛋白。 A diffusion-based protein structure generation model for the design of binder proteins that bind to proteins, nucleic acids, or small molecules.
    Tags: undefined
    Author: David Baker
    Release: 2025-12-16 00:00:00
    Reference: De novo Design of All-atom Biomolecular Interactions with RFdiffusion3. Jasper Butcher, Rohith Krishna, Raktim Mitra, Rafael I. Brent, Yanjing Li, Nathaniel Corley, Paul Kim, Jonathan Funk, Simon Mathis, Saman Salike, Aiko Muraishi, Helen Eisenach, Tuscan Rock Thompson, Jie Chen, Yuliya Politanska, Enisha Sehgal, Brian Coventry, Odin Zhang, Bo Qiang, Kieran Didi, Max Kazman, Frank DiMaio, David Baker.

    Protein Design (RFDiffusion3)

    简介

    设计与蛋白、核酸或小分子结合的Binder蛋白。模块基于RFDiffusion3(RFD3)模型,RFD3是一款基于扩散的蛋白质结构生成模型。其真正实现了多分子共扩散(Co-diffusion)。它并不从头创造新的小分子或核酸序列,而是接受输入的化学实体信息(如药物分子或核酸),在每个原子坐标上进行扩散和去噪,生成蛋白主链和侧链原子的同时,同步采样并优化这些非蛋白分子的空间结构,捕捉它们在结合过程中的诱导契合效应(Induced-fit)。

    相对于前期版本(RFD1/2),有多处核心提升:

    1. 原子级扩散建模:从残基到每一个原子(Atom-level Diffusion)
      在 RFD3 中,模型直接在每个原子坐标上进行扩散和去噪,主链和侧链原子一体建模。让一整套几何与物理约束可以自然表达为“条件”:
    • 氢键网络与供体/受体分布
    • 溶剂可及性(buried / exposed)
    • 酶活性位点的精确几何
    • 质心与相对排布约束
    • 对称性(D2、C3、C5等)
      image.png
    1. 通用任务范围:一个模型覆盖主流“蛋白 + 伙伴”场景
      第二个关键点,是它的统一性。RFD3 的设计理念是:用同一组参数,覆盖“几乎所有”常见的“蛋白 + 伙伴分子”相互作用设计场景。不管是对称多聚体、酶催化中心、小分子配体,还是 DNA / RNA 结合,全部都在同一个 all-atom diffusion 框架里处理。
      image.png

    2. 更快也更强:推理效率提升约一个数量级
      在 all-atom 分辨率下,反而比前代更快。

    • RFD3 采用全新的 Transformer–U-Net 混合架构,训练与推理代码重写。
    • 在基准测试中,RFD3 在同等硬件上的计算成本约为 RFdiffusion2 的 1/10,大致一个数量级的加速。它在四类核心任务上全面超越前代专用模型:蛋白–蛋白结合(protein–protein binders);蛋白–DNA 结合(DNA binders);蛋白–小分子结合(small-molecule binders);酶活性位点设计(enzyme active-site scaffolding)。在同样的 GPU 时间里,既能跑更细的 all-atom 模型,又能做更多采样,极大缩短迭代周期。
      image.png

    参数说明

    Protein or NA Binder

    设计Binder蛋白与蛋白或核酸结合。

    Reference Structure

    在Binder设计时的参考结构,PDB或CIF格式,可包含蛋白,核酸。

    Receptor Range

    在设计蛋白或核酸的Binder时,从参考结构中选定哪部分作为受体蛋白或核酸。
    格式为链名称+残基/碱基编号(UID),多段残基用逗号分隔。例如:参数设置为A25-50,A70-100,A105,A108,/0,B75-108时,表示:
    选取参考结构的A链中残基UID为25至50、70至100、105与108的残基/碱基,以及B链UID为75至108的残基/碱基作为受体,同时使用分链符号/0对A与B链之间进行分链,如果不设置分链符,B75的N端会连接在A108的C端。

    注意:残基/碱基编号(UID)表示结构文件中带有的编号,该编号可能存在起始编号部位1、间断不连续、或插入编号等情况。当前模型支持插入编号形式,如:A105A表示A链中编号为105A(插入标识为A)的残基。后续所有的残基/碱基编号都是该形式。

    Length of Binder

    定义Binder蛋白的长度,可以是确定的长度,或长度范围,例如:设置为20或20-50时,
    20表示Binder蛋白的长度为20个残基;
    20-50表示Binder蛋白的长度范围为20至50个残基,具体长度视最终设计结果为准。

    Initial Binder

    指定结构中初始的Binder,从参考蛋白中选定哪部分是初始的Binder蛋白,模型会在不改变初始Binder的前提下,进一步延长Binder,延长的方向通过X指定,例如:参数设置为X,B1-10时,表示:

    • 指定参考蛋白中的B链残基编号为1至10的残基为初始Binder蛋白,模型会以此为基础进行延长设计,延长的方向是接在残基B1的N端,直到满足参数Length of Binder指定的长度或范围。
    • B1-10,X则表示延长的方向是接在残基B10的C端。

    Hotspot

    选择Receptor Range参数中指定的残基作/碱基为结合位点,格式支持两种形式:

    • 定位到残基/碱基,使用链名称+残基/碱基编号(UID),多段范围用逗号分隔,例如:A59-61,A83,A91,表示:指定A链编号为59至61、83及91的残基/碱基为结合位置。
    • 定位到残基/碱基的原子,使用链名称+残基/碱基编号(UID)+原子标准名称,多个原子之间用分号分隔,多段范围用逗号分隔,例如:A83:O;NZ,A91:OG,表示指定A链编号83残基中的O与NZ原子,编号91残基中的OG原子为结合位置。为了方便指定原子,已预定义一批原子组合名称,如下表:
    原子组合名称 说明 示例
    ALL 该残基的所有原子 A83:ALL表示:指定A链编号83残基中的所有原子
    BKBN 该残基的骨架原子,具体为:N;CA;C;O A83:BKBN表示:指定A链编号83残基中的骨架原子
    TIP 残基的主要侧链原子,不同类型残基的TIP原子定义见下方 A83:TIP表示:指定A链编号83残基中的TIP原子

    不同类型残基预定义的TIP原子:

        "TRP": ["CG","CD1","CD2","NE1","CE2","CE3","CZ2","CZ3","CH2"],  # both rings
        "HIS": ["CG","ND1","CD2","CE1","NE2"],  # ring
        "TYR": ["CZ","OH"],  # ring dihedral 
        "PHE": ["CG","CD1","CD2","CE1","CE2","CZ"],
        "ASN": ["CB", "CG","OD1","ND2"],
        "ASP": ["CB", "CG","OD1","OD2"],
        "GLN": ["CG", "CD","OE1","NE2"],
        "GLU": ["CG", "CD","OE1","OE2"],
        "CYS": ["CB", "SG"],
        "SER": ["CB", "OG"],
        "THR": ["CB", "OG1"],
        "LEU": ["CB", "CG", "CD1", "CD2"],
        "VAL": ["CG1", "CG2"],
        "ILE": ["CB", "CG2"],
        "MET": ["SD", "CE"],
        "LYS": ["CE","NZ"],
        "ARG": ["CD","NE","CZ","NH1","NH2"],
        "PRO": None,
        "ALA": None,
        "GLY": None,
    

    Number of Designs

    指定要设计的Binder数量(目前最多支持 100 个)。

    Small Molecule Binder

    Reference Structure

    包含小分子结构的参考结构,PDB或CIF格式。

    Ligand

    参考结构中的小分子名称,如:IAI
    注意: 如果小分子名称存在于CCD数据库(https://www.ebi.ac.uk/pdbe-srv/pdbechem/)中时,对应的结构需要一致,否则会报错。如果结构不一致,建议修改小分子名称为L:G或者不在CCD库中的名称,确保名称不重复。

    Fixed Ligand Atoms

    在设计时,从参考结构中提取的小分子中的原子坐标会发生变化,可通过该参数限制某些原子的坐标固定不变。通过结构中的标准原子名称指定,多个原子用逗号分隔,如:N9,O8;C4;C1;N3;C10

    Buried Ligand Atoms

    指定小分子中的原子,哪些是要掩埋在Binder蛋白的内部(一般是参与相互作用的),不暴露在溶剂中。指定方式同Fixed Ligand Atoms。

    Exposed Ligand Atoms

    指定小分子中的原子,哪些是暴露在溶剂中。指定方式同Fixed Ligand Atoms。

    Length of Binder

    定义Binder蛋白的长度,可以是确定的长度,或长度范围,例如:设置为20或20-50时,
    20表示Binder蛋白的长度为20个残基;
    20-50表示Binder蛋白的长度范围为20至50个残基,具体长度视最终设计结果为准。

    Number of Designs

    指定要设计的Binder数量(目前最多支持 100 个)。

    Enzyme

    Reference Structure

    设计时的酶的参考结构,PDB或CIF格式,可包含酶蛋白与底物分子的全部或部分结构(原子)。

    Length

    定义酶蛋白的长度,可以是确定的长度,或长度范围,例如:设置为100或100-120

    Fixed Atoms

    在设计时,从参考结构中提取的结构,固定其中某些原子的坐标不变。原子的指定方式与Binder模式中的Hotspot参数的定位到残基/碱基的原子的方式一致,指定小分子的原子时,使用小分子的名称+原子名称即可,如:IAI:N9;O8

    Unindex

    指定从参考结构中提取的结构中,哪些残基的索引由模型推断而非预先指定,残基的选择方式同Binder模式中Receptor Range参数。

    Ligand

    指定参考结构中,小分子的名称,提取到设计的复合物结构中,可设置多个,用逗号分隔,如:NAD,IAI
    注意: 如果小分子名称存在于CCD数据库(https://www.ebi.ac.uk/pdbe-srv/pdbechem/)中时,对应的结构需要一致,否则会报错。如果结构不一致,建议修改小分子名称为L:G或者不在CCD库中的名称,确保名称不重复。

    Number of Designs

    指定设计的数量,默认为10,最大不超过100

    Custom

    Reference Structure

    设计时的参考结构,PDB或CIF格式,可包含蛋白,核酸,小分子等。

    Contigs

    定义主要的设计策略,指定从参考结构中提取哪部分结构,从头设计哪部分结构等,多段设计策略用逗号分隔。例如:A1-80,10,/0,B5-12,表示:

    • 'A1-80’表示先从参考结构中提取A链中编号(UID,支持插入符号)1至80的残基。
    • '10’表示从头设计长度为10的motif连接到上一段motifA1-80的C端,motif的长度也可以指定范围,如24-50,表示设计长度在24至50之间,具体多长看最终的设计结果。
    • ‘/0’是分链符号,表示设计的蛋白在此分链,后续的motif是另起一条链。
    • 'B5-12’表示从参考结构中提取B链中编号(UID)5至12的残基。

    Unfixed Sequence

    指定从参考结构中提取的已知结构,哪部分需要改变序列,多段区域用逗号分隔。例如:‘A20-30,A54-60’,表示Contigs参数中已指定的结构A链残基编号(UID)20至30,54至60的结构部分,需要优化序列。
    注意:改变序列的区域一定是在Contigs参数中已指定的,否则会提示错误。

    Length

    指定整个设计的蛋白的总长度,可以是确定的长度,或长度范围,例如:设置为100或100-200
    注意:此处的总长度需大于等于Contigs参数中定义的motif总长度。

    Ligand

    指定参考结构中,小分子的名称,提取到设计的复合物结构中,可设置多个,用逗号分隔,如:NAD,IAI
    注意: 如果小分子名称存在于CCD数据库(https://www.ebi.ac.uk/pdbe-srv/pdbechem/)中时,对应的结构需要一致,否则会报错。如果结构不一致,建议修改小分子名称为L:G或者不在CCD库中的名称,确保名称不重复。

    Hotspot

    指定已提取的参考结构中,哪部分是结合位置,格式同Binder模式中的Hotspot参数。

    Fixed Atoms

    在设计时,从参考结构中提取的残基/碱基或小分子中的原子坐标会发生变化,可通过该参数限制某些原子的坐标固定不变。原子的指定方式与Binder模式中的Hotspot参数的定位到残基/碱基的原子的方式一致,指定小分子的原子时,使用小分子的名称+原子名称即可,如:IAI:N9;O8

    Buried

    指定已提取的参考结构中,哪部分是要掩埋在内部,不暴露在溶剂中。指定方式同Binder模式中的Hotspot参数。可以定位到具体残基/碱基,小分子,也可以精确到具体原子。如指定小分子的某些原子是掩埋的,IAI:N9;O8;C4;C1;N3;C10

    Exposed

    指定已提取的参考结构中,哪部分是暴露在溶剂中。指定方式同Buried参数。

    Hbond Donor Atoms

    指定已提取的参考结构中,哪些原子是作为氢键供体,指定方式同Fixed Atoms参数。

    Hbond Acceptor Atoms

    指定已提取的参考结构中,哪些原子是作为氢键受体,指定方式同Fixed Atoms参数。

    Redesign Sidechains

    固定已提取的参考结构的骨架结构不变,只进行侧链的重新设计。

    Center of Mass

    指定生成蛋白的质心(Center of Mass, COM)位置坐标,X,Y,Z坐标通过逗号分隔,如15,2,-4

    Number of Designs

    指定设计的数量,默认为10,最大不超过100

    结果说明

    设计得到的结构文件res_design_0_model_0-5.cif
    对应的序列文件res_seqs_rfd3.fasta与res_seqs_rfd3_batch.fasta
    设计结构的评价Metrics文件metrics_rfd3_summary.csv,包含信息如下:

    列名 说明
    Name 结构名称
    max_ca_deviation 最大CA原子偏差(单位:Å),衡量预测结构与理想结构之间的差异,值越小表示结构越合理,通常应<0.5Å
    n_chainbreaks 链断裂数量,表示蛋白主链的连续性,0表示主链完全连续,无断裂
    n_clashing.interresidue_clashes_w_sidechain 残基间侧链冲突数,不同残基侧链间的空间冲突,0表示无侧链冲突
    n_clashing.interresidue_clashes_w_backbone 残基间主链冲突数,不同残基主链间的空间冲突,0表示无主链冲突
    non_loop_fraction 非环区域(螺旋+折叠)占整体结构的比例
    loop_fraction loop区域占整体结构的比例
    helix_fraction alpha螺旋区域占整体结构的比例
    sheet_fraction beta折叠区域占整体结构的比例
    num_ss_elements 二级结构单元数量
    radius_of_gyration 回转半径Rg(单位:Å),衡量蛋白结构的紧密程度,Rg < 15Å:极度紧密的球状结构,通常对应高度稳定的折叠;15-20Å:典型的紧密球蛋白,结构稳定;20-25Å:中等紧密度,可能存在柔性区域;Rg > 25Å:结构较为松散或呈延展构象
    alanine_content 丙氨酸含量,较高的丙氨酸含量有助于螺旋形成
    glycine_content 甘氨酸含量,适中的甘氨酸含量提供结构柔性
    num_residues 总残基数量

    注意:当前输出结构未进行结构质量的排序,是模型默认的输出顺序。

    所有结果的打包文件all_results_rfd3.tar.gz

    参考文献

    • Butcher, J.; Krishna, R.; Mitra, R.; Brent, R. I.; Li, Y.; et al. De novo Design of All-atom Biomolecular Interactions with RFdiffusion3. bioRxiv (2025). DOI:10.1101/2025.09.18.676967

    Protein Design (RFDiffusion3)

    Introduction

    This module is designed for the de novo design of binder proteins that interact with proteins, nucleic acids, or small molecules. It is based on the RFDiffusion3 (RFD3) model, a diffusion-based protein structure generation framework. RFD3 introduces true multi-molecular co-diffusion, enabling simultaneous modeling of proteins together with their binding partners.

    Rather than generating new small-molecule or nucleic-acid sequences from scratch, RFD3 takes chemical entities (e.g., drug-like molecules or nucleic acids) as input and performs diffusion and denoising directly on all atomic coordinates. While generating protein backbone and side-chain atoms, the model simultaneously samples and optimizes the spatial configurations of non-protein molecules, thereby capturing induced-fit effects during binding.

    Compared with earlier versions (RFD1/2), RFD3 introduces several major advances:

    1. Atom-level diffusion modeling

    RFD3 performs diffusion and denoising at the individual atom level, rather than at the residue level. Backbone and side-chain atoms are modeled jointly, allowing geometric and physical constraints to be naturally expressed as conditioning signals, including:

    • Hydrogen-bond networks and donor/acceptor distributions
    • Solvent accessibility (buried vs. exposed regions)
    • Precise geometries of enzyme active sites
    • Center-of-mass and relative spatial constraints
    • Symmetry constraints (e.g., D2, C3, C5)

    image.png

    2. Unified task coverage

    A key strength of RFD3 is its generality. With a single set of model parameters, it supports nearly all common protein + partner design scenarios, including:

    • Symmetric oligomers
    • Enzyme active-site scaffolding
    • Protein–small-molecule binding
    • Protein–DNA/RNA binding

    All tasks are handled within a unified all-atom diffusion framework.

    image.png

    3. Faster and more powerful inference

    Despite operating at all-atom resolution, RFD3 is significantly faster than previous versions:

    • It adopts a new Transformer–U-Net hybrid architecture, with rewritten training and inference pipelines.
    • Benchmark results show that RFD3 requires roughly 1/10 the computational cost of RFdiffusion2 on the same hardware.
    • RFD3 outperforms earlier specialized models across four core tasks: protein–protein binders, protein–DNA binders, protein–small-molecule binders, and enzyme active-site design.

    This efficiency allows both finer-resolution modeling and increased sampling within the same GPU time, substantially shortening design iteration cycles.

    image.png


    Parameters

    Binder

    Design binder proteins that interact with proteins, nucleic acids, or small molecules.

    Reference Structure

    The reference structure used for binder design. PDB or CIF format. May contain proteins, nucleic acids, and/or small molecules.

    Receptor Range

    Specifies which parts of the reference structure are treated as the receptor (protein or nucleic acid).
    Format: ChainID + Residue/Base UID, with multiple segments separated by commas.

    Example:

    A25-50,A70-100,A105,A108,/0,B75-108
    

    This selects residues/bases with UID 25–50, 70–100, 105, and 108 from chain A, and UID 75–108 from chain B.
    The /0 symbol indicates a chain break between chains A and B. Without it, residue B75 would be connected to the C-terminus of A108.

    Note: Residue/base numbering uses the UID as defined in the structure file. This may include non-1 starting indices, gaps, or insertion codes (e.g., A105A). Insertion codes are fully supported.

    Ligand

    When designing a small-molecule binder, specify the ligand name from the reference structure.

    Length of Binder

    Defines the length of the binder protein. Can be a fixed length or a range:

    • 20: binder length is exactly 20 residues
    • 20-50: binder length ranges from 20 to 50 residues

    Initial Binder

    Specifies an initial binder fragment extracted from the reference structure. The model extends this fragment without modifying it.

    Examples:

    • X,B1-10: extend from the N-terminus of residue B1
    • B1-10,X: extend from the C-terminus of residue B10

    Hotspot

    Select the residues/nucleotides specified in the Receptor Range parameter as binding sites. Two input formats are supported:

    1. Residue/Nucleotide-level specification
      Use chain ID + residue/nucleotide index (UID). Multiple ranges can be separated by commas.
      Example: A59-61,A83,A91
      This specifies residues/nucleotides with indices 59–61, 83, and 91 on chain A as binding sites.

    2. Atom-level specification within residues/nucleotides
      Use chain ID + residue/nucleotide index (UID) + standard atom name. Multiple atoms are separated by semicolons, and multiple ranges are separated by commas.
      Example: A83:O;NZ,A91:OG
      This specifies atoms O and NZ in residue 83, and atom OG in residue 91 on chain A as binding sites.

    Predefined atom groups:

    Atom Group Description Example
    ALL All atoms of the residue A83:ALL
    BKBN Backbone atoms (N;CA;C;O) A83:BKBN
    TIP Key side-chain atoms (defined per residue type) A83:TIP

    Predefined TIP atoms by residue type:

    "TRP": ["CG","CD1","CD2","NE1","CE2","CE3","CZ2","CZ3","CH2"],
    "HIS": ["CG","ND1","CD2","CE1","NE2"],
    "TYR": ["CZ","OH"],
    "PHE": ["CG","CD1","CD2","CE1","CE2","CZ"],
    "ASN": ["CB","CG","OD1","ND2"],
    "ASP": ["CB","CG","OD1","OD2"],
    "GLN": ["CG","CD","OE1","NE2"],
    "GLU": ["CG","CD","OE1","OE2"],
    "CYS": ["CB","SG"],
    "SER": ["CB","OG"],
    "THR": ["CB","OG1"],
    "LEU": ["CB","CG","CD1","CD2"],
    "VAL": ["CG1","CG2"],
    "ILE": ["CB","CG2"],
    "MET": ["SD","CE"],
    "LYS": ["CE","NZ"],
    "ARG": ["CD","NE","CZ","NH1","NH2"],
    "PRO": None,
    "ALA": None,
    "GLY": None
    

    Number of Designs

    Number of binder designs to generate (maximum: 100).


    Small Molecule Binder

    Reference Structure

    A reference structure containing the small molecule, in PDB or CIF format.

    Ligand

    The name of the small molecule in the reference structure, e.g., IAI.
    Note: If the small molecule name exists in the CCD database (https://www.ebi.ac.uk/pdbe-srv/pdbechem/), the corresponding structure must be consistent; otherwise, an error will occur. If the structures are inconsistent, it is recommended to change the small molecule name to L:G or a name not present in the CCD library, ensuring the name is unique.

    Fixed Ligand Atoms

    During design, the coordinates of atoms extracted from the reference structure may change. This parameter allows specific ligand atoms to be fixed so that their coordinates remain unchanged.
    Atoms are specified using standard atom names from the structure. Multiple atoms can be separated by commas or semicolons, for example:
    N9,O8;C4;C1;N3;C10.

    Buried Ligand Atoms

    Specifies which ligand atoms should be buried inside the binder protein (typically atoms involved in interactions) and not exposed to the solvent.
    The specification format is the same as for Fixed Ligand Atoms.

    Exposed Ligand Atoms

    Specifies which ligand atoms should be exposed to the solvent.
    The specification format is the same as for Fixed Ligand Atoms.

    Length of Binder

    Defines the length of the binder protein. This can be a fixed length or a length range, for example 20 or 20-50.

    • 20 means the binder protein has a length of 20 residues.
    • 20-50 means the binder protein length ranges from 20 to 50 residues, with the exact length determined by the final design.

    Number of Designs

    Specifies the number of binder designs to generate (currently up to a maximum of 100).


    Enzyme

    Reference Structure

    The reference structure of the enzyme used during design, in PDB or CIF format. It may include all or part of the enzyme protein and substrate molecules (atoms).

    Length

    Defines the length of the enzyme. This can be a fixed length or a length range, for example 100 or 100-120.

    Fixed Atoms

    During design, for structures extracted from the reference structure, the coordinates of specified atoms can be fixed and kept unchanged.
    The atom specification format is the same as locating atoms of residues/nucleotides in the Hotspot parameter of the Binder mode.
    When specifying atoms of small molecules, use ligand_name + atom_name, for example: IAI:N9;O8.

    Unindex

    Specify which residues, among the structures extracted from the reference structure, have their indices inferred by the model rather than being predefined.
    The residue selection format is the same as the Receptor Range parameter in the Binder mode.

    Ligand

    Specify the names of small molecules in the reference structure to be extracted into the designed complex structure. Multiple ligands can be specified, separated by commas, for example: NAD,IAI.
    Note: If the small molecule name exists in the CCD database (https://www.ebi.ac.uk/pdbe-srv/pdbechem/), the corresponding structure must be consistent; otherwise, an error will occur. If the structures are inconsistent, it is recommended to change the small molecule name to L:G or a name not present in the CCD library, ensuring the name is unique.

    Number of Designs

    Specify the number of designs to generate. The default is 10, and the maximum is 100.


    Custom

    Reference Structure

    The reference structure used during design, in PDB or CIF format. It may include proteins, nucleic acids, small molecules, etc.

    Contigs

    Define the main design strategy by specifying which parts are extracted from the reference structure and which parts are designed de novo. Multiple design segments are separated by commas.
    For example: A1-80,10,/0,B5-12, which means:

    • A1-80: First, extract residues 1 to 80 (UID, insertion codes supported) from chain A of the reference structure.
    • 10: Design a de novo motif with a length of 10 residues and connect it to the C-terminus of the previous motif A1-80. The motif length can also be specified as a range, such as 24-50, meaning the final length will be determined by the design result.
    • /0: A chain break symbol, indicating that the designed protein is split into a new chain at this point, and subsequent motifs belong to a new chain.
    • B5-12: Extract residues 5 to 12 (UID) from chain B of the reference structure.

    Unfixed Sequence

    Specify which parts of the extracted known structure need to have their sequences changed. Multiple regions are separated by commas.
    For example: A20-30,A54-60 indicates that residues 20–30 and 54–60 (UID) of chain A, which are already specified in the Contigs parameter, need sequence optimization.
    Note: The regions to be redesigned must be included in the Contigs parameter; otherwise, an error will be raised.

    Length

    Specify the total length of the designed protein. This can be a fixed length or a range, for example: 100 or 100-200.
    Note: The total length must be greater than or equal to the total motif length defined in the Contigs parameter.

    Ligand

    Specify the names of small molecules in the reference structure to be extracted into the designed complex structure. Multiple ligands can be specified, separated by commas, for example: NAD,IAI.
    Note: If the small molecule name exists in the CCD database (https://www.ebi.ac.uk/pdbe-srv/pdbechem/), the corresponding structure must be consistent; otherwise, an error will occur. If the structures are inconsistent, it is recommended to change the small molecule name to L:G or a name not present in the CCD library, ensuring the name is unique.

    Hotspot

    Specify which parts of the extracted reference structure are binding sites. The format is the same as the Hotspot parameter in the Binder mode.

    Fixed Atoms

    During design, the coordinates of atoms in residues/nucleotides or small molecules extracted from the reference structure may change. This parameter can be used to fix the coordinates of selected atoms so they remain unchanged.
    The atom specification format is the same as locating atoms of residues/nucleotides in the Hotspot parameter of the Binder mode.
    When specifying atoms of small molecules, use ligand_name + atom_name, for example: IAI:N9;O8.

    Buried

    Specify which parts of the extracted reference structure should be buried inside the protein and not exposed to the solvent.
    The specification format is the same as the Hotspot parameter in the Binder mode. It can target specific residues/nucleotides, small molecules, or even specific atoms.
    For example, to specify buried atoms of a small molecule: IAI:N9;O8;C4;C1;N3;C10.

    Exposed

    Specify which parts of the extracted reference structure should be exposed to the solvent.
    The specification format is the same as the Buried parameter.

    Hbond Donor Atoms

    Specify which atoms in the extracted reference structure act as hydrogen bond donors.
    The specification format is the same as the Fixed Atoms parameter.

    Hbond Acceptor Atoms

    Specify which atoms in the extracted reference structure act as hydrogen bond acceptors.
    The specification format is the same as the Fixed Atoms parameter.

    Redesign Sidechains

    Keep the backbone of the extracted reference structure fixed and redesign only the side chains.

    Center of Mass

    Specify the coordinates of the center of mass (COM) of the generated protein.
    The X, Y, and Z coordinates are separated by commas, for example: 15,2,-4.

    Number of Designs

    Specify the number of designs to generate. The default is 10, and the maximum is 100.


    Results

    • Designed structure file: res_design_0_model_0-5.cif
    • Corresponding sequence files: res_seqs_rfd3.fasta and res_seqs_rfd3_batch.fasta
    • Design evaluation metrics file: metrics_rfd3_summary.csv, which contains the following information:
    Column Name Description
    Name Structure name
    max_ca_deviation Maximum Cα atom deviation (Å), measuring the difference between the predicted structure and the ideal structure. Smaller values indicate more reasonable structures; typically < 0.5 Å
    n_chainbreaks Number of chain breaks, indicating backbone continuity. 0 means the backbone is fully continuous
    n_clashing.interresidue_clashes_w_sidechain Number of inter-residue side-chain clashes. 0 indicates no side-chain clashes
    n_clashing.interresidue_clashes_w_backbone Number of inter-residue backbone clashes. 0 indicates no backbone clashes
    non_loop_fraction Fraction of non-loop regions (helices + sheets) in the overall structure
    loop_fraction Fraction of loop regions in the overall structure
    helix_fraction Fraction of alpha-helix regions in the overall structure
    sheet_fraction Fraction of beta-sheet regions in the overall structure
    num_ss_elements Number of secondary structure elements
    radius_of_gyration Radius of gyration (Rg, Å), measuring structural compactness. Rg < 15 Å: extremely compact globular structure, typically highly stable; 15–20 Å: typical compact globular protein, stable; 20–25 Å: moderately compact, may contain flexible regions; Rg > 25 Å: relatively loose or extended conformation
    alanine_content Alanine content; higher alanine content favors helix formation
    glycine_content Glycine content; moderate glycine content provides structural flexibility
    num_residues Total number of residues

    Note: The current output structures are not ranked by structural quality; they are presented in the model’s default output order.

    • Packaged archive of all results: all_results_rfd3.tar.gz

    References

    • Butcher, J.; Krishna, R.; Mitra, R.; Brent, R. I.; Li, Y.; et al. De novo Design of All-atom Biomolecular Interactions with RFdiffusion3. bioRxiv (2025). DOI:10.1101/2025.09.18.676967
  • Name: Immunogenicity Prediction (WeADApt v4.3)
    Description: 唯信开发的基于多模融合深度学习的端到端免疫原性预测系统WeADApt(原名:AlphaMHC)的最新版本。采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条与免疫原性相关的湿实验数据(包括亲和力数据、呈递数据、NGS数据、质谱数据等)进行训练,实现了从序列到临床免疫原性风险的端到端的预测,并通过了数百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)的验证测试。 v4.3为最新主力版本,相比v4.2进一步提升了预测的特异性,且对不同风险水平的表位的区分度更高,更易于进行去免疫原性改造。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Immunogenicity -> WeADApt v4。 The latest version of the immunogenicity prediction system, WeADApt (formerly known as AlphaMHC). Compared to version v4.2, version v4.3 offers improved prediction specificity and better discrimination between epitopes of varying risk levels, making it more suitable for de-immunization modifications. It is recommended to use in the WeSeq: WeSeq -> Immunogenicity -> WeADApt v4.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-12-04 00:00:00
    Reference:

    Immunogenicity Prediction (WeADApt v4.3)

    简介

    WeADApt (Wecomput ADA prediction) 是唯信开发的基于多模融合深度学习架构的免疫原性预测系统(也被熟知为AlphaMHC)。

    该方法采用全新的多模融合深度神经网络架构,整合了近10亿条与免疫原性相关的湿实验数据(包括亲和力数据、呈递数据、NGS数据、质谱数据等)进行训练,有机地将多个与免疫原性相关的模型融合,构成一个高效的免疫反应模拟系统,可准确地模拟蛋白、抗体、多肽、疫苗等生物药的免疫原性,并能鉴别潜在的免疫原性的T细胞表位(引起临床人体免疫应答的肽段),实现了从序列到临床免疫原性风险的端到端的预测,并通过了数百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)的验证测试。

    image.png

    在同样的43个抗体分子的临床ADA数据集上,WeADApt v4.3预测的相关性超过了知名的商业软件EpiMatrix(R2=0.45 vs R2=0.42)。

    v4.3版本

    V4.3版本相对于上个版本v4.2主要有以下改进:

    • 优化了架构和超参数,提升模型对非单抗类和未知分子的泛化能力;
    • 优化了HLA亚型,覆盖更广泛的世界人群;
    • 优化了表位计算的逻辑;
    • 优化了相关组件,提升计算效率;
    • 重新设计了报告的样式,更加直观的展示表位位置;

    性能

    测试数据:

    从FDA和EMA的临床试验中收集了200余个已知免疫原性的分子及其ADA的分布,计算模型预测值与真实ADA发生率的相关性,以测试其预测性能。
    image.png

    单抗 mAb

    使用唯信收集整理的200多个临床及上市单抗的ADA数据的测试结果如下图所示,预测分数与ADA发生率的Spearman相关性提升到R=0.74。

    image.png

    0.2分适合作为单抗的高/低风险的阈值(>20% ADA定义为高风险)。

    双抗 BsAB

    WeADApt v4被设计为兼容各类的分子形式,不论是对称还是非对称、是否有重复结构域的任意蛋白分子,仅需输入不重复的链即可(重复链全部输入也会自动处理)。

    使用唯信收集整理的双抗ADA数据集的测试表现如下图所示,预测分数与ADA发生率的Pearson相关性达到R=0.62。
    image.png

    延续 v4.2 版本的设计,该版本以0.4的分数作为分界线时,可以较好的区分高、低风险的双抗分子。

    本系统仅从序列水平预测产生的影响,因此尤其适合同类靶点分子的相对比较和筛选。

    参数说明

    Fasta File

    待预测的 Fasta 文件。
    对序列名有要求,程序内部使用 “蛋白.链名” 的形式区分不同蛋白。

    计算量消耗
    采用阶梯式动态机制,根据提交的序列数量,对应消耗如下:

    • ≤ 5 条序列:10,000 计算量 / 条
    • 第 6–100 条序列:1,000 计算量 / 条
    • 超过 100 条的部分:100 计算量 / 条

    Molecule Score

    蛋白级别的打分和风险评估结果文件。
    默认值: MolScore.csv

    TCE Score

    表位(TCE)明细数据输出文件。
    默认值: TceScore.csv

    Export Details

    是否导出明细数据。
    默认值: no

    • 开启会影响运行效率
    • 当序列数 超过 20 时,即使设置为 yes 也不会输出明细

    Export HTML Reports

    是否导出可视化 HTML 报告。
    默认值: no

    • 开启会影响运行效率
    • 当序列数 超过 20 时,即使设置为 yes 也不会输出报告

    Risk Threshold

    风险评估阈值。
    默认值: 0.2

    • 双抗分子建议使用 0.4

    Hide Low TCE

    在表位明细输出中,屏蔽分数 小于该值 的表位。
    默认值: 0

    结果说明

    蛋白级别的打分和风险评估结果文件MolScore.csv, 表位(TCE)明细数据输出文件TceScore.csv, Details.xlsx文件更详细的数据,用于确认哪些 9 肽对结果影响更大,Plots.tar文件,压缩包中包含可视化报告,每个分子一个独立的 HTML 页面报告。

    Immunogenicity Prediction (WeADApt v4.3)

    Introduction

    WeADApt (Wecomput ADA Prediction) is an immunogenicity prediction system developed by Wecomput, based on a multi-modal fusion deep learning architecture. The system is also widely known as AlphaMHC.

    This method adopts a novel multi-modal deep neural network framework and is trained on nearly 1 billion experimentally derived immunogenicity-related data points, including binding affinity data, antigen presentation data, NGS data, and mass spectrometry data. By organically integrating multiple immunogenicity-related models, WeADApt constructs an efficient immune response simulation system capable of accurately modeling the immunogenicity of biologics such as proteins, antibodies, peptides, and vaccines.

    WeADApt enables end-to-end prediction from sequence to clinical immunogenicity risk, and can identify potential T-cell epitopes that may trigger clinical immune responses. The system has been validated against hundreds of real-world clinical immunogenicity datasets from the FDA and EMA, covering mono-/multi-specific antibodies and recombinant proteins.

    On the same clinical ADA dataset comprising 43 antibody molecules, WeADApt v4.3 achieved higher correlation than the well-known commercial software EpiMatrix
    (R² = 0.45 vs. R² = 0.42).

    image.png


    Version v4.3

    Compared to the previous version (v4.2), V4.3 introduces the following key improvements:

    • Model Generalization: Optimized architecture and hyperparameters to enhance the model’s generalization capabilities for non-mAb (monoclonal antibody) entities and unseen molecules.
    • HLA Coverage: Expanded and optimized HLA subtypes to provide broader coverage across diverse global populations.
    • Epitope Logic: Refined the underlying logic for epitope calculation for higher precision.
    • Performance: Optimized internal components to significantly improve computational efficiency and processing speed.
    • Reporting: Redesigned the report layout to provide a more intuitive visualization of epitope positions.

    Performance

    Test Dataset

    More than 200 molecules with known immunogenicity and corresponding ADA incidence rates were collected from FDA and EMA clinical trials. Model performance was evaluated by measuring the correlation between predicted scores and real ADA incidence rates.

    image.png


    Monoclonal Antibodies (mAb)

    Using a curated dataset of over 200 clinical and marketed monoclonal antibodies, the prediction scores achieved a Spearman correlation of R = 0.74 with observed ADA incidence rates.

    image.png

    A score of 0.2 is recommended as the threshold for distinguishing high- vs. low-risk monoclonal antibodies
    (>20% ADA incidence defined as high risk).


    Bispecific Antibodies (BsAb)

    WeADApt v4 is designed to be compatible with various molecular formats, including symmetric or asymmetric architectures and proteins with repeated domains. Only non-redundant chains need to be provided as input (duplicate chains are automatically handled by the system).

    On a curated bispecific antibody ADA dataset, WeADApt v4 achieved a Pearson correlation of R = 0.62 between predicted scores and observed ADA incidence rates.

    image.png

    Consistent with v4.2, a score threshold of 0.4 effectively separates high- and low-risk bispecific antibodies in v4.3.

    This system predicts immunogenicity solely at the sequence level, making it particularly suitable for relative comparison and screening of molecules targeting the same antigen.


    Parameters

    Fasta File

    FASTA file containing the sequences to be evaluated.
    Sequence identifiers must follow the format “Protein.ChainID”, which is used internally to distinguish different proteins.

    WeAdapt 4.3 Pricing Policy
    WeAdapt 4.3 uses a tiered, dynamic pricing model, where charges are calculated based on the number of submitted sequences:

    • ≤ 5 sequences: 10,000 compute units per sequence
    • Sequences 6–100: 1,000 compute units per sequence
    • Sequences beyond 100: 100 compute units per sequence

    Molecule Score

    Protein-level scoring and immunogenicity risk assessment output file.
    Default: MolScore.csv


    TCE Score

    Output file containing detailed T-cell epitope (TCE) information.
    Default: TceScore.csv


    Export Details

    Whether to export detailed data.
    Default: no

    • Enabling this option may reduce computational efficiency
    • When the number of sequences exceeds 20, detailed outputs will not be generated even if set to yes

    Export HTML Reports

    Whether to export interactive HTML visualization reports.
    Default: no

    • Enabling this option may reduce computational efficiency
    • When the number of sequences exceeds 20, reports will not be generated even if set to yes

    Risk Threshold

    Threshold used for immunogenicity risk assessment.
    Default: 0.2

    • For bispecific antibodies, a threshold of 0.4 is recommended

    Hide Low TCE

    Hide epitopes with scores below this value in the TCE output.
    Default: 0


    Results

    The system generates the following output files:

    • MolScore.csv: Protein-level scores and immunogenicity risk assessment
    • TceScore.csv: Detailed T-cell epitope information
    • Details.xlsx: Extended data for identifying which 9-mer peptides contribute most significantly to the final score
    • Plots.tar: Compressed archive containing visualization reports, with one standalone HTML report per molecule
  • Name: Join Structure
    Description: 将结构A中指定链的C端与结构B中指定链的N端进行拼接,形成新的结构。 This module joins two structures by connecting the **C-terminus** of a specified chain in Structure A with the **N-terminus** of a specified chain in Structure B, generating a new combined structure.
    Tags: undefined
    Author:
    Release: 2025-12-02 00:00:00
    Reference:

    Structure Join

    简介

    将结构A中指定链的C端与结构B中指定链的N端进行拼接,形成新的结构。

    参数说明

    Structure A

    用于拼接的结构之一,PDB格式,该结构中指定链的C端参与拼接。

    N-terminal Chain

    指定结构A中参与拼接的链名,仅单链,如H,如不指定,则默认使用第一条链。

    Structure B

    用于拼接的结构之一,PDB格式,该结构中指定链的N端参与拼接。

    C-terminal Chain

    指定结构B中参与拼接的链名,仅单链,如H,如不指定,则默认使用第一条链。

    Output

    拼接后的结构名称,默认为join_result.pdb

    结果说明

    输出拼接后的结构,默认为join_result.pdb

    Structure Join

    Introduction

    This module joins two structures by connecting the C-terminus of a specified chain in Structure A with the N-terminus of a specified chain in Structure B, generating a new combined structure.

    Parameters

    Structure A

    One of the input structures used for joining, in PDB format.
    The C-terminal end of the specified chain in this structure will be used for the join.

    N-terminal Chain

    The chain in Structure A to be used for joining.
    Must be a single chain, e.g., H.
    If not specified, the first chain in the structure is used by default.

    Structure B

    The second structure used for joining, in PDB format.
    The N-terminal end of the specified chain in this structure will be used for the join.

    C-terminal Chain

    The chain in Structure B to be used for joining.
    Must be a single chain, e.g., H.
    If not specified, the first chain in the structure is used by default.

    Output

    Name of the output file containing the joined structure.
    Default: join_result.pdb.


    Results

    The resulting joined structure is written to the output file, with the default name join_result.pdb.

  • Name: Binder Design (BoltzGen)
    Description: BoltzGen是一个全原子生成模型,能够从头生成结合各种生物分子靶标的抗体、蛋白、肽类等生物分子,也可以基于已有分子进行局部生成式优化。是经过较多湿实验验证的新一代从头生成模型,是目前WeMol平台上用于抗体(包括VHH)从头生成的首选。注意:本模块生成采样数固定为1000,如果准备进行湿实验验证,推荐使用“Binder Design (BoltzGen) HTS”模块进行更大量的生成,可以提高成功率。 BoltzGen is an all-atom generative model capable of de novo design of biomolecules such as antibodies, proteins, and peptides that bind to various biomolecular targets. It also supports local generative optimization based on existing molecules. As a next-generation de novo generative model validated by extensive wet-lab experiments, it is currently the preferred tool for de novo generation of antibodies (including VHHs) on the WeMol platform. Note: The number of generated samples for this module is fixed at 1000. For wet-lab validation studies, we recommend using the “Binder Design (BoltzGen) HTS” module to generate a larger number of candidates, which can improve success rates.
    Tags: undefined
    Author: Hannes Stark
    Release: 2025-10-30 00:00:00
    Reference: BoltzGen: Toward Universal Binder Design. bioRxiv 2025.11.20.689494; doi: https://doi.org/10.1101/2025.11.20.689494

    Binder Design (BoltzGen)

    简介

    设计能够结合各种生物分子靶标的蛋白、肽类等生物分子。模块基于BoltzGen模型实现,BoltzGen是一个通用的全原子生成模型(all-atom generative model),能够在同一框架下完成多模态的binder设计任务。不同于前代模型只做“推断”,BoltzGen直接在扩散采样过程中生成目标分子与其结合体的全原子结构,并通过可控约束机制确保生成结果物理一致、功能可实现。同时具备良好的泛化性能,能够针对训练过程中未出现过的全新靶标进行有效设计。

    BoltzGen的方法设计承接了Boltz系列一贯的目标——在统一的几何表示与能量空间中,学习多模态分子的物理规律。与以往的预测模型不同,BoltzGen 并不输出单一结构预测,而是通过扩散建模(diffusion modeling)直接生成分子的全原子坐标。

    image.png

    其采用扩散模型(diffusion model)框架,在全原子坐标空间中学习分子结构的分布。模型通过在每个采样步骤中向原子坐标加入高斯噪声,再逐步去噪恢复,从而近似真实的势能面分布。与传统的分子生成模型(如仅在残基层面建模)不同,BoltzGen的每个采样变量都是具体原子的位置向量。模型通过能量一致性约束(energy consistency)学习键长、键角、二面角等局部几何关系,从而在生成时自动保持化学合理性。这里对于全原子建模采用了Atom14的方法。

    image.png
    这一设计使生成结构不仅在形状上接近真实蛋白,在能量上也符合分子物理规律。

    BoltzGen的架构如下图所示,由三大模块组成:输入层(Input Representation)、条件编码器(Condition Encoder)、扩散模型(Diffusion Model),输出为三维坐标的全原子结构。

    image.png

    BoltzGen的一个突出特点,是并非停留在计算层面的生成验证,而是进行了系统的湿实验评估。进行了十类实验任务(如下图所示),涵盖 蛋白质、肽类、环肽、小分子结合体及抗菌肽设计等多种体系,几乎囊括了现有结构设计模型所能覆盖的全部生物模态。这些实验的共同目标,是检验模型能否在“无同源模板、真实实验条件”下生成可表达、可结合的结构。不同于以往只验证折叠精度的预测模型,BoltzGen的验证标准是功能实现——即所生成分子是否能在实验中稳定结合目标。

    image.png
    BoltzGen 的实验结果显示出较高的一致性与通用性:

    • 在 26 个实验靶标中,有超过 60% 的生成候选在实验中表现出结合活性;
    • 模型生成的肽类与蛋白 binder 均表现出良好的可表达性(多数 >80% 可溶性);
    • 环肽和抗菌肽任务中,多个样本在无模板条件下仍能正确形成环化结构;
    • 小分子结合蛋白任务中,生成结果的结合构象与已知复合物 RMSD < 2.5 Å。
      image.png

    在 BoltzGen 论文中,进行抗体和结合蛋白生成的湿实验验证时,抗原(目标蛋白)的主要输入方式是结构,但在特定情况下也可以通过序列输入。

    具体说明如下:

    默认输入方式:结构
    论文中明确提到,除非另有说明,实验中均是将目标的结构(structure)作为输入提供给 BoltzGen 。例如,在针对 9 个新型目标(Novel Targets)设计纳米抗体和蛋白质结合剂时,研究人员利用了目标的结构信息。

    灵活性:序列输入与协同折叠(Cofolding) BoltzGen 是一个全原子生成模型,能够同时进行结构预测和蛋白质设计 。当仅提供目标的序列(sequence)作为输入时,模型可以在设计结合剂的同时对目标进行折叠,最终生成结合复合物的原子结构 。

    特殊案例:

    • 无结构输入
      在针对 NPM1 蛋白的无序区(disordered region)设计多肽时,研究人员采用了“无结构输入”的策略 。他们提供了 NPM1 有序区域的结构,但让无序区域保持柔性,从而测试模型在处理缺乏固定结构的目标时的表现。
    • 小分子目标
      对于小分子目标,BoltzGen仅需要输入SMILES字符串(一种描述分子结构的序列表示法),并在设计过程中执行协同折叠。

    总结来说,虽然BoltzGen具备直接从序列出发进行设计的能力,但在该论文的大多数湿实验验证(特别是针对新型蛋白目标)中,结构是主要的输入方式。

    参数说明

    De Novo Antibody

    Type

    指定抗体类型,目前支持Antibody(普通抗体)和Nanobody(纳米抗体)。

    Antigen Structure

    上传已有的抗原结构,PDB或CIF格式。

    Antigen Chains

    指定从结构中提取一些链作为抗原,可多选,如:A,B。如不设置该参数,表示提取结构中的所有链。

    Binding Hotspot

    指定抗原中的哪些残基参与结合,使用链名+残基位置(从1开始的顺序编号)进行指定,如A10-20,A25,B30-36,B40。
    表示:抗原结合位点为A链编号10至20、25的残基,B链提编号30至36、40的残基。
    注意:
    1.在使用抗原序列文件时,链名是按字母顺序命名(与链的位置顺序对应),第一条链的链名为A,第二条链的链名为B,依次命名。
    2.如不设置该参数,模型会自主寻找潜在的结合位点。

    Custom Templates

    支持上传自定义的抗体或纳米抗体模板结构,会采用模板结构的FR区,对CDR区域(Chothia编号)进行重设计,可选择:

    • 单个结构文件(.pdb 或 .cif)
    • 批量结构文件(压缩包格式)

    多个模板结构时,每个模板结构都会用于设计。
    如未提供自定义模板,系统将使用内置的默认抗体模板和纳米抗体模板,具体如下:
    抗体模板:

    • 6CR1 — Adalimumab(阿达木单抗,Humira)
      靶点:TNF-α
      作用:阻断 TNF-α 与受体结合,抑制炎症反应

    • 6WGB — Dupilumab(度普利尤单抗,Dupixent)
      靶点:IL-4Rα
      作用:阻断 IL-4 / IL-13 信号通路,抑制 2 型炎症

    • 3HMW — Ustekinumab(乌司奴单抗,Stelara)
      靶点:IL-12 / IL-23 p40
      作用:同时抑制 Th1 和 Th17 炎症通路

    纳米抗体模板:

    • 7EOW — Caplacizumab(卡普赛珠单抗)
      靶点:vWF A1 域
      作用:阻断 vWF 与血小板结合,抑制血栓形成

    • 7XL0 — Vobarilizumab(ALX-0061,沃巴利珠单抗)
      靶点:IL-6R(+ 白蛋白结合)
      作用:抑制 IL-6 信号并延长半衰期

    • 8COH — TPP-3444(Gefurulimab / ALXN1720 组成部分)
      靶点:补体 C5
      作用:抑制补体激活

    • 8Z8V — ALB8(Ozoralizumab / ATN-103 组件)
      靶点:人血清白蛋白(HSA)
      作用:延长药物半衰期

    • Gontivimab(ALX-0171,格替韦单抗)
      靶点:RSV F 蛋白
      作用:阻断病毒融合,抑制感染

    • Isecarosmab(M-6495 / ALX-1141,艾司卡索单抗)
      靶点:ADAMTS-5
      作用:抑制软骨降解,具有抗炎作用

    • Sonelokimab
      靶点:IL-17A / IL-17F
      作用:双重抑制炎症因子,增强抗炎效果

    Number of Designs

    完成设计后,最终给出的结构数量,默认为20,最大支持100,设计过程中产生的结构数量在1000左右。

    Custom

    Protocol

    设计模式共有6种:

    • Protein:设计与靶点(蛋白或多肽)结合的蛋白,也可脱离靶点仅设计蛋白单体。
    • Peptide:设计与靶点蛋白结合的多肽(线性肽或环肽)。
    • Small_Molecule:设计与小分子结合的蛋白,不改变小分子本身。
    • Antibody: 设计与靶点结合的普通抗体,也可脱离靶点仅设计普通抗体自身
    • Nanobody:设计与靶点结合的纳米抗体,也可脱离靶点仅设计纳米抗体自身。
    • Redesign: 对已存在的蛋白/复合物结构,进行指定残基的重设计优化。

    设计规则的定义有三种方式:

    • 基于已有结构进行定义,可以是提取部分结构,也可以对部分结构进行设计。
    • 基于序列进行定义,指定序列中哪部分需要设计,哪部分残基不变。
    • 基于小分子文件进行定义,指定参与结合的小分子。

    三种方式可以自由组合。

    Structure

    上传已有蛋白结构,从中提取已有结构,或重新设计部分结构。例如:从上传的结构中提取靶点链、抗原链、纳米抗体链等。

    Chains

    指定从Structure中提取的链名,可多选,如:A,B。如不设置该参数,表示提取结构中的所有链。

    Include

    从Chains参数指定的链中,进一步确认需要提取的残基范围,使用链名+残基位置(从1开始的顺序编号,非PDB的UID编号)进行指定,如A10-20,A25,B1-36,B40。
    表示:从A链提取编号10至20、25的残基,从B链提取编号1至36、40的残基。
    如不设置该参数,表示提取Chains参数中指定的完整链。

    Exclude

    从Chains参数指定的链中,确认哪些残基不提取),与Include参数作用相反,指定方式相同,如A15,B36-42(从1开始的顺序编号,非PDB的UID编号表示A链编号15、B链编号36至42的残基不提取。

    Design Positions

    已提取的结构中,指定需要重新设计的残基,指定方式同Include参数,如A10-12,B15,B40(从1开始的顺序编号,非PDB的UID编号)。
    注意:需要重新设计的残基编号应在已提取的结构中存在。

    Design SS

    对要设计的残基,指定二级结构类型。使用链名,SS类型:残基范围(从1开始的顺序编号,非PDB的UID编号)进行指定,每行放置一个,如:

    A,HELIX:10-12
    B,SHEET:15,LOOP:40
    

    二级结构类型可选:LOOP, HELIX, SHEET(大小写均可)。
    不指定该参数表示不强制二级结构类型。

    Binding Hotspot

    指定哪些残基参与结合(如链间或与小分子结合),指定方式同Include,如A12,B15-18(从1开始的顺序编号,非PDB的UID编号)。

    Non Binding

    指定哪些残基不参与结合(从1开始的顺序编号,非PDB的UID编号),与Binding参数作用相反。

    Design Insertions

    指定插入突变设计,使用链名,插入位置,插入残基长度,二级结构(从1开始的顺序编号,非PDB的UID编号方式定义,每行一个,如:

    A,10,5
    B,15,5-10,HELIX
    

    表示在A链的10号残基位置后,插入5个新残基,二级结构不确定(不强制)。在B链的15号残基位置后,插入5至10个残基(具体残基数量随机确定),二级结构为HELIX。

    二级结构类型的选择有3种(大小写皆可): LOOP, HELIX, or SHEET


    Structure Repetition

    同Structure定义。例如:指定已有的Binder结构。

    Repetition Chains

    同Chains定义

    Repetition Include

    同Include定义

    Repetition Exclude

    同Exclude定义

    Repetition Design Positions

    同Design Positions定义

    Repetition Design SS

    同Design SS定义

    Repetition Binding Hotspot

    同Binding Hotspot定义

    Repetition Non Binding

    同Non Binding定义

    Repetition Design Insertions

    同Design Insertions定义


    Sequence

    指定要设计的蛋白序列,每行一条,如:

    AAVTTTTPPP
    15-20AAAAAAVTTTT18PPP
    

    其中:

    • 字母表示序列中明确的残基(设计中不变)
    • 单个数值表示该位置要设计的长度,如18表示序列的该位置将设计18个残基。
    • 数值范围表示长度范围(具体设计长度在范围内随机指定),如15-20表示该位置将设计15至20个残基,具体长度在15至20之间随机指定。

    序列的ID默认从1开始按顺序编号。

    Sequence Binding

    指定序列中参与结合的残基,使用序列编号:残基范围格式,如:

    1:5,8-10
    2:30-35
    

    表示第一条序列中编号5、8至10的残基参与结合;第二条序列中编号30至35的残基参与结合。
    第二条序列中含有设计长度范围时,按最小长度计算残基位置。

    Sequence Non Binding

    指定序列中不参与结合的残基,与Sequence_Binding作用相反。

    Sequence SS

    指定序列中残基的二级结构类型,使用序列编号,SS类型:残基范围定义,每行一条,如:

    1,HELIX:5-8
    2,SHEET:15,LOOP:40
    

    表示第一条序列编号5至8的残基,二级结构为HELIX;第二条序列编号15的残基,二级结构为SHEET,编号40的残基,二级结构为LOOP。

    注意: 有指定设计长度范围的序列,按长度最小值来确认剩余残基的位置。

    Sequence Cycle

    指定需要环化的序列编号,如1,2表示第1和第2条序列首尾相连。

    Ligand

    指定参与结合的小分子信息,文本文件,支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔,并加上CCD前缀。示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Covalent Bond

    共价键信息的文本文件,TXT格式。每行放置一个共价键信息,每个共价键信息包含两个原子信息,每个原子信息由三部分组成:

    • 原子所在序列或小分子的顺序编号(按上述参数设置的顺序,确定相应序列或小分子的顺序,从1开始编号。)
    • 原子所在残基的位置编号(如残基为小分子时,编号为1)
    • 原子的标准名称(CCD中定义)
      三部分由逗号分隔,例如:3,1,CA表示第三个实体(序列或小分子)中的第一个残基(或小分子)的CA原子
      一个共价键是由两个原子信息组成,原子间用分号分隔,如:1,1,CA;2,1,CA
      表示一个共价键,该共价键由两个原子组成,第一个原子为1,1,CA,第二个原子为2,1,CA
      包含多个共价键信息的文件内容示例如下:
    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    当小分子为SMILES时,如CC(=O)NCCNC(C)=O,如果该小分子的顺序编号(按上述方式确认)为3,其第一个C原子参与形成共价键,与编号为1的链/序列中第一个残基的CA原子,则共价键的定义为1,1,CA;3,1,C1其中C1表示小分子的第一个C原子,如果是第二C原子,用C2表示。

    注意:

    • 当前Covalent Bond的定义中,出现的序列不能是结构文件(Structure)中,只能是序列文件(Sequence和Ligand)中
    • 序列中有指定设计长度范围的情况时,按长度最小值来确认后续残基的位置。如:15-20ACS,长度范围的序列长度按最小长度计算,即15,所以残基A的位置编号是16,C是17,S是18。
      共价键信息的文本文件,TXT格式。每行放置一个共价键信息,每个共价键信息包含两个原子信息,每个原子信息由三部分组成:
    • 原子所在序列或小分子的顺序编号(按上述参数设置的顺序,确定相应序列或小分子的顺序,从1开始编号。)

    Number of Designs

    完成设计后,最终给出的结构数量,默认为30,最大支持100,设计过程中产生的结构数量在1000左右。

    结果说明

    输出参数文件design_spec.yaml
    输出设计的复合物的序列文件,final_complex.fasta
    输出设计的复合物的序列文件(Batch模式),final_complex_batch.fasta,适合一些模块的Batch模式的输入,如Structure Prediction (Boltz-2)
    输出设计的链的序列文件,final_designed_chains.fasta
    输出设计打分文件final_designs_metrics.csv,csv文件每个指标含义如下:

    列名 说明
    id 设计分子的名称
    final_rank 通过各指标综合排序后的最终排名
    absolute_score 基于多种指标(结构指标,物理能量指标)计算的综合打分,但与final rank排序并不完全一致,供参考。
    structure_confidence 基于结构指标(ptm,iptm,pae)计算的结构置信度评分,供参考。
    design_ptm 设计结构的预测TM分数(0–1),反映模型对设计蛋白整体折叠结构的置信度。数值越高表示设计结构越合理,通常 >0.7 视为高置信度。
    design_iptm 设计结构与靶点结构相互作用界面的预测TM分数(0-1),反应相互作用界面质量的置性度。数值越高表示界面结构越合理,通常 >0.7 视为高置信度。
    design_to_target_iptm 仅设计的残基与靶点结构相互作用界面的预测TM分数(0–1),反应相互作用界面质量的置性度。数值越高表示界面结构越合理,通常 >0.7 视为高置信度。
    min_design_to_target_pae 最小预测对齐误差(Å),是残基对水平的置信度指标,用来衡量任意两个残基之间相对空间位置的预测可信度。这里表示设计的结构与靶点结构的残基之间相对位置的准确度。数值越小(例如 <10 Å)准确度越高。
    plip_saltbridge_refolded 重折叠后结构中的盐桥数量。盐桥(带相反电荷残基间的电性作用力)是维持蛋白稳定性的重要因素,数量越多通常结合越稳固。
    plip_hbonds_refolded 重折叠后结构中的氢键数量。氢键是二级结构和界面互补性的关键作用力,数量越多整体稳定性越好。
    delta_sasa_refolded 重折叠前后溶剂可及表面积变化(ΔSASA, Ų)。数值越大(例如 >2000 Ų)表示疏水核心包埋程度越高,通常代表更强的热稳定性。
    filter_rmsd 整个复合物(设计+靶点)结构重折叠后与原设计结构的RMSD值,用于验证生成结构与预测结构的一致性,数值越小越好。
    design_ipsae_min 设计结构与靶点结构之间的最小ipSAE数值(从设计结构出发,计算与靶点结构之间的ipSAE,反之从靶点结构出发,计算与目标结构之间的ipSAE,两者中取最小值)。ipSAE是基于pAE(predicted Aligned Errors)矩阵计算得到的相互作用界面评价分数,取值范围是0到1,值越大,表示预测的蛋白-蛋白相互作用界面越可靠。ipSAE > 0.7 表明相互作用界面预测质量高,结构可信。ipSAE < 0.1: 表明预测中几乎不存在可信互作界面,可排除假阳性相互作用。
    design_to_target_ipsae 从设计结构出发,计算与靶点结构之间的ipSAE。
    ALA/GLY/GLU/LEU/VAL/CYS_fraction 设计的残基中,各类型氨基酸的比例
    contacts 预测结构中的接触界面残基
    contacts_overlap 与输入 hotspot 重叠的预测接触残基
    overlap_ratio 输入 hotspot 被预测接触残基覆盖的比例

    注意:只有设置Binding Hotspot参数,才会输出contacts、contacts_overlap、overlap_ratio指标

    输出设计的前5个结构rank1-5*.cif
    输出最后设计的结构打包文件final_designs.tar.gz
    输出设计的概述文件results_overview.pdf,包含结构的过滤 (Filtering Criteria)和排序标准(Sorting Criteria)。

    • 过滤标准 (Filtering Criteria)
    列名 说明
    has_x 阈值:0.0
    序列有效性检查。确保序列中不包含未知氨基酸(“X”),必须完全由标准的 20 种天然氨基酸组成,保证序列在物理上可被合成和表达。
    filter_rmsd 阈值:< 2.5 Å
    整体骨架的 RMSD。检查整个复合物(设计+靶点)在重折叠后是否保持原样,用于验证生成结构与预测结构的一致性。
    filter_rmsd_design 阈值:< 2.5 Å
    仅针对设计部分(Binder)的骨架 RMSD。确保即使靶点有微小移动,结合剂本身的结构依然是稳定的。
    designfolding-filter_rmsd 阈值:< 2.5 Å
    独立折叠稳定性检查。在没有靶点的情况下单独折叠结合剂并计算 RMSD。用于确保结合剂能独立折叠,从而大大提高湿实验中的表达成功率。
    ALA_fraction
    GLY_fraction
    GLU_fraction
    LEU_fraction
    VAL_fraction
    阈值:< 0.3 (30%)
    序列复杂度/多样性检查。限制丙氨酸、甘氨酸、谷氨酸、亮氨酸、缬氨酸的单项占比。防止模型为了刷高结构稳定性分数而生成单一重复序列,强制要求序列具备化学多样性,以保证特异性的相互作用能力。
    • 排序标准(Sorting Criteria)
    列名 说明
    design_to_target_iptm 权重为1
    界面预测 TM 得分(0–1),用于评估蛋白–蛋白相互作用界面的结构合理性。数值越大表明界面(如结合位点)越可能形成稳定相互作用。
    design_ptm 权重为1
    预测模板建模得分(0–1),反映模型对设计蛋白整体折叠结构的置信度。数值越高表示全局结构越合理,通常 >0.7 视为高置信度。
    neg_min_design_to_target_pae 权重为1
    负的最小界面预测对齐误差 (PAE)。PAE 越低越好(误差越小),取负值是为了方便排序(数值越大越好)。它代表模型对结合界面上“最确定的那个接触点”有多大把握。
    affinity_probability_binary1 权重为1
    亲和力预测概率。主要用于小分子结合剂场景。这是模型直接预测出的“该分子能结合”的概率值。
    plip_hbonds_refolded 权重为0.5
    重折叠后结构中的氢键数量。氢键是二级结构和界面互补性的关键作用力,数量越多整体稳定性越好。
    plip_saltbridge_refolded 权重为0.5
    重折叠后结构中的盐桥数量。盐桥(带相反电荷残基间的电性作用力)是维持蛋白稳定性的重要因素,数量越多通常结合越稳固。
    delta_sasa_refolded 权重为0.5
    重折叠前后溶剂可及表面积变化(ΔSASA, Ų)。数值越大(例如 >2000 Ų)表示疏水核心包埋程度越高,通常代表更强的热稳定性。

    设计教程

    遮蔽肽设计教程

    已知抗体结构

    1. 抗体编号
    应用WeView打开mH35抗体结构,进行抗体编号,确定重链CDR3的位置在H99-102,为遮蔽肽的结合位置
    image.png
    2. BoltzGen中输入参数设置

    • 选择Custom模式
    • Protocol中选择Peptide
    • Structure中上传mH35抗体结构
    • Chains中选择H和L链,作为受体链
    • Binding Hotspot中输入受体的结合位点,为重链的CDR3区域:H99-102
      image.png
    • Sequence中输入需要设计的多肽长度,遮蔽肽建议设计长度是:5-30
      image.png
    • 提交运行

    已知抗体序列

    1. 抗体编号
    应用WeSeq打开mH35抗体序列,进行抗体编号,确定重链CDR3的位置在99-102,为遮蔽肽的结合位置
    image.png
    image.png
    2. BoltzGen中输入参数设置

    • 选择Custom模式
    • Sequence中输入mH35抗体重轻链的序列以及遮蔽肽的长度,一条链一行,遮蔽肽建议设计长度是:5-30
    • Sequence Binding中设置受体的结合位点,为重链的CDR3区域:1:99-102
      image.png
    • 提交运行

    环肽设计教程

    已知受体结构

    • Protocol中选择Peptide。
    • Structure中上传受体结构。
    • Binding Hotspot中定义受体中结合位点(如有)。
      image.png
    • Sequence的输入分以下两种情况:
      • 如果有模板结构,则输入模板环肽序列和拆入序列的长度,比如C8-9AC,在第1位残基C后面插入8-9个残基,首位C和末尾C构建环肽,如下:
        image.png
      • 如果无模板结构,可直接输入序列长度,如8-10,预测与受体结合的8-10AA长度的环肽,如下:
        image.png
    • 成环情况分为以下两种:
      • 如果环肽是头尾肽键成环,可以在Sequence Cycle中填1。
        image.png
      • 如果环肽是二硫键成环,则Sequence Cycle不填写,在Covalent Bond中填入首尾两个Cys生成二硫键的信息:1,1,SG;1,11,SG。
        image.png
    • 提交运行

    已知受体序列

    • Protocol中选择Peptide。
      image.png
    • 根据环肽情况,Sequence的输入分以下两种情况:
      • 如果环肽有模板结构,则输入受体序列、模板环肽序列及拆入序列的长度,如下图,每一行是一条序列,受体有2条序列,受体序列的ID分别为1、2。环肽序列位C8-9AC(在第1位残基C后面插入8-9个残基,首位C和末尾C构建环肽),环肽位于第三行序列ID为3。
        image.png
      • 如果无模板结构,可直接输入受体序列和环肽的序列长度,如下图,预测与受体结合的8-10AA长度的环肽。
        image.png
    • Sequence Binding中定义受体中结合位点/非结合位点(如有)。
      image.png
    • 成环情况分为以下两种:
      • 如果环肽是头尾肽键成环,可以在Sequence Cycle中填3。
        image.png
      • 如果环肽是二硫键成环,则Sequence Cycle不填写,在Covalent Bond中填入首尾两个Cys生成二硫键的信息:1,1,SG;1,11,SG。
        image.png
    • 提交运行

    参考文献

    • https://hannes-stark.com/assets/boltzgen.pdf

    Binder Design (BoltzGen)

    Introduction

    The Binder Design module is designed to generate proteins, peptides, and other biomolecules capable of binding to various biological targets. It is implemented based on the BoltzGen model — a universal all-atom generative model capable of performing multimodal binder design tasks within a unified framework. Unlike earlier models that focused solely on “inference,” BoltzGen directly generates the full-atom structures of target molecules and their complexes during diffusion sampling, ensuring physically consistent and functionally feasible results through controllable constraints. It also demonstrates strong generalization, enabling effective design for novel, unseen targets.

    The BoltzGen framework inherits the Boltz family’s core objective — to learn the physical laws of multimodal molecules within a unified geometric and energetic representation. Unlike traditional prediction models that output a single structure, BoltzGen uses diffusion modeling to directly generate full atomic coordinates of molecules.

    image.png

    BoltzGen adopts a diffusion model framework to learn molecular structure distributions in full-atom coordinate space. The model adds Gaussian noise to atomic coordinates at each sampling step and progressively denoises them to approximate the real potential energy surface. Unlike traditional residue-level molecular generators, BoltzGen models each atom’s position explicitly. Using energy consistency constraints, the model learns local geometric relationships — such as bond lengths, angles, and torsions — to ensure chemical plausibility during generation. The Atom14 method is used for full-atom representation.

    image.png

    This design ensures that generated structures are not only geometrically realistic but also physically valid in terms of molecular energetics.

    The BoltzGen architecture consists of three main modules: Input Representation, Condition Encoder, and Diffusion Model, outputting full-atom 3D coordinates.

    image.png

    A distinguishing feature of BoltzGen is that it goes beyond computational validation and includes extensive wet-lab experimental evaluation. Ten categories of experiments were performed (as shown below), covering proteins, peptides, cyclic peptides, protein–small molecule complexes, and antimicrobial peptides — encompassing nearly all biological modalities addressed by structural design models. The goal was to test whether BoltzGen can generate expressible, functional binders under real experimental conditions without any homologous templates. Unlike models that only validate structural accuracy, BoltzGen emphasizes functional success — i.e., whether the generated molecules can stably bind their targets experimentally.

    image.png

    Experimental results demonstrate high consistency and generality:

    • Among 26 experimental targets, over 60% of generated candidates exhibited measurable binding activity.
    • Generated peptide and protein binders showed excellent expression performance (most with >80% solubility).
    • In cyclic peptide and antimicrobial peptide tasks, multiple samples correctly formed cyclic structures without templates.
    • In protein–small molecule binding tasks, generated complexes achieved binding poses with RMSD < 2.5 Å compared to known complexes.

    image.png

    In the BoltzGen paper, during wet-lab validation of antibody and binder generation, the primary form of input for the antigen (target protein) is structural information, although sequence-only input is also supported in specific scenarios.

    Default Input Mode: Structure
    The paper explicitly states that, unless otherwise specified, the target structure is provided as input to BoltzGen in the experiments. For example, when designing nanobodies and protein binders against nine novel targets, the researchers relied on the structural information of the targets.

    Flexibility: Sequence Input and Cofolding
    BoltzGen is an all-atom generative model capable of performing structure prediction and protein design simultaneously. When only the target sequence is provided, the model can cofold the target and the binder, folding the target while designing the binder and ultimately generating the atomic structure of the bound complex.

    Special Cases

    • No Fixed Structure Input
      When designing peptides targeting the disordered region of the NPM1 protein, the researchers adopted a “no fixed structure input” strategy. They provided the structure of the ordered regions of NPM1 while leaving the disordered region flexible, allowing the model to evaluate performance on targets lacking a well-defined structure.

    • Small-Molecule Targets
      For small-molecule targets, BoltzGen requires only a SMILES string (a sequence-based representation of molecular structure) as input and performs cofolding during the design process.

    Summary

    In summary, although BoltzGen is capable of performing design directly from sequence-only inputs, in the majority of the wet-lab validation experiments reported in the paper—especially those involving novel protein targets—structural information was used as the primary form of input.


    Parameters

    De Novo Antibody

    Type

    Specifies the antibody type. Currently supports Antibody (conventional antibodies) and Nanobody.

    Antigen Structure

    Upload an existing antigen structure in PDB or CIF format.

    Antigen Chains

    Specify which chains in the structure should be extracted as the antigen.
    Multiple chains are allowed, e.g., A,B.
    If not set, all chains in the structure are used by default.

    Antigen Sequence

    If no antigen structure is available, you may upload an antigen sequence in FASTA format.
    Multi-chain sequences are supported.

    Binding Hotspot

    Specify which residues on the antigen participate in binding, using the format
    ChainName + ResidueIndex (indexing starts from 1), such as:
    A10-20,A25,B30-36,B40.

    This represents:

    • Chain A: residues 10–20 and 25
    • Chain B: residues 30–36 and 40

    Notes:

    1. When using an antigen sequence file, chain names are assigned alphabetically based on sequence order: the first chain is A, the second is B, and so on.
    2. If this parameter is not set, the model will automatically search for potential binding sites.

    Custom Templates

    Supports uploading custom antibody or nanobody template structures. The FR regions from the template structures will be adopted, while the CDR regions (Chothia numbering) will be redesigned. Options:

    • Single structure file (.pdb or .cif)
    • Batch structure files (compressed archive format)

    When multiple template structures are provided, each template structure will be used for design.

    If no custom template is provided, the system will use built-in default antibody and nanobody templates, listed below:

    Antibody Templates

    • 6CR1 — Adalimumab (Humira)

      • Target: TNF-α
      • Mechanism: Blocks TNF-α binding to its receptor, inhibiting inflammatory response
    • 6WGB — Dupilumab (Dupixent)

      • Target: IL-4Rα
      • Mechanism: Blocks IL-4 / IL-13 signaling pathway, suppressing type 2 inflammation
    • 3HMW — Ustekinumab (Stelara)

      • Target: IL-12 / IL-23 p40
      • Mechanism: Simultaneously inhibits Th1 and Th17 inflammatory pathways

    Nanobody Templates:

    • 7EOW — Caplacizumab

      • Target: vWF A1 domain
      • Mechanism: Blocks vWF-platelet binding, inhibiting thrombosis
    • 7XL0 — Vobarilizumab (ALX-0061)

      • Target: IL-6R (plus albumin binding)
      • Mechanism: Inhibits IL-6 signaling and extends half-life
    • 8COH — TPP-3444 (Gefurulimab / ALXN1720 component)

      • Target: Complement C5
      • Mechanism: Inhibits complement activation
    • 8Z8V — ALB8 (Ozoralizumab / ATN-103 component)

      • Target: Human serum albumin (HSA)
      • Mechanism: Extends drug half-life
    • Gontivimab (ALX-0171)

      • Target: RSV F protein
      • Mechanism: Blocks viral fusion, preventing infection
    • Isecarosmab (M-6495 / ALX-1141)

      • Target: ADAMTS-5
      • Mechanism: Inhibits cartilage degradation, with anti-inflammatory effects
    • Sonelokimab

      • Target: IL-17A / IL-17F
      • Mechanism: Dual inhibition of inflammatory cytokines, enhancing anti-inflammatory efficacy

    Number of Designs

    Number of final generated structures. Default: 20, Max: 100. Roughly 1000 candidate structures are sampled during the process.

    Custom

    Protocol

    There are six design modes:

    • Protein – Design proteins that bind to a target (protein or peptide), or design standalone protein monomers.
    • Peptide – Design peptides (linear or cyclic) that bind to a target protein.
    • Small_Molecule – Design proteins that bind to small molecules.
    • Nanobody – Design nanobodies that bind to a target, or standalone nanobodies.
    • Antibody: Design of conventional antibodies that bind to targets, or design of conventional antibodies alone without targets.
    • Redesign: Redesign and optimization of specified residues for existing protein/complex structures.

    Three approaches to define the design rule:

    • Based on existing structures, by extracting or redesigning specific regions.
    • Based on sequences, specifying which residues to design or keep fixed.
    • Based on small molecules, defining the binding partner using a molecular file.

    These approaches can be combined freely.


    Structure

    Upload an existing protein structure to extract or redesign certain regions, e.g., selecting specific chains such as antigen, nanobody, or receptor chains.

    Chains

    Specify chain IDs extracted from Structure, e.g., A,B.
    If not set, all chains will be extracted.

    Include

    From the selected chains (Chains), specify which residues to extract using chainID + residue range, e.g.:
    A10-20,A25,B1-36,B40
    This extracts residues 10–20 and 25 from chain A, and residues 1–36 and 40 from chain B.
    If not set, all residues in Chains are extracted.

    Exclude

    Specify residues not to extract from selected chains. Same format as Include, e.g. A15,B36-42.

    Design Positions

    Specify residues to redesign within the extracted structure, same format as Include, e.g. A10-12,B15,B40.
    Note:Must correspond to residues existing in the extracted structure.

    Design SS

    Specify secondary structure types for designed residues using the format:

    A,HELIX:10-12
    B,SHEET:15,LOOP:40
    

    Accepted types: LOOP, HELIX, SHEET (case-insensitive).
    If not specified, secondary structures are not constrained.

    Design Insertions

    Define insertion mutations using the format:

    A,10,5
    B,15,5-10,HELIX
    

    Meaning: insert 5 residues after residue 10 of chain A; insert 5–10 residues after residue 15 of chain B with HELIX conformation.
    Accepted secondary structure types: LOOP, HELIX, SHEET.

    Binding Hostpost

    Specify which residues participate in binding (e.g., between chains or with small molecules), same as Include, e.g. A12,B15-18.

    Non Binding

    Specify residues not involved in binding.


    Structure Repetition

    Same definition as Structure. For example, specify an existing binder structure.

    Repetition Chains

    Follow the same rules as the corresponding parameters above.

    Repetition Include

    Follow the same rules as the corresponding parameters above.

    Repetition Exclude

    Follow the same rules as the corresponding parameters above.

    Repetition Design Positions

    Follow the same rules as the corresponding parameters above.

    Repetition Design SS

    Follow the same rules as the corresponding parameters above.

    Repetition Design Insertions

    Follow the same rules as the corresponding parameters above.

    Repetition Binding Hotspost

    Follow the same rules as the corresponding parameters above.

    Repetition Non Binding

    Follow the same rules as the corresponding parameters above.


    Sequence

    Specify the designed protein sequences, one per line, e.g.:

    AAVTTTTPPP
    15-20AAAAAAVTTTT18PPP
    

    Letters represent fixed residues; numeric values indicate positions to be designed.
    Ranges indicate variable lengths (chosen randomly within the range).
    Sequence IDs start from 1 by default.

    Sequence Binding

    Specify which residues in the sequence are involved in binding:

    1:5,8-10
    2:30-35
    

    Binding residues are indexed based on the minimum sequence length when ranges are used.

    Sequence Non Binding

    Opposite of Sequence Binding, defines residues not involved in binding.

    Sequence SS

    Define secondary structure for sequence residues:

    1,HELIX:5-8
    2,SHEET:15,LOOP:40
    

    Positions are determined based on the minimum sequence length when variable ranges exist.

    Sequence Cycle

    Specify cyclic sequences, e.g. 1,2 means the first and second sequences are cyclized (head-to-tail connected).

    Ligand

    Specify small molecules involved in binding.
    Supports SMILES or CCD Code formats.

    Examples:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Covalent Bond

    TXT file defining covalent bonds.
    Each line specifies a bond between two atoms using the format:

    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Each atom entry = EntityID,ResidueIndex,AtomName.
    Entity IDs are assigned based on the input order of sequences or ligands (starting from 1).
    When the small molecule is provided as a SMILES string, e.g. CC(=O)NCCNC(C)=O, if the sequential index of this small molecule (determined as described above) is 3, and its first carbon atom participates in forming a covalent bond with the CA atom of the first residue in chain/sequence 1, then the covalent bond should be defined as:

    1,1,CA;3,1,C1

    Here, C1 denotes the first carbon atom of the small molecule. If it is the second carbon atom, it should be specified as C2.

    Notes:

    • In the current definition of Covalent Bond, the sequences involved must not come from structure files (Structure); they can only come from sequence files (Sequence and Ligand).
    • When a sequence specifies a design length range, the minimum length is used to determine subsequent residue positions.
      For example, for 15-20ACS, the sequence length is taken as 15. Therefore, the position indices are: A = 16, C = 17, S = 18.

    The covalent bond information is provided as a text file (TXT format).
    Each line defines one covalent bond, and each covalent bond consists of two atom definitions.
    Each atom definition contains three parts:

    • The sequential index of the sequence or small molecule to which the atom belongs (determined by the parameter order described above, starting from 1).

    Number of Designs

    Number of final generated structures. Default: 30, Max: 100.
    Roughly 1000 candidate structures are sampled during the process.


    Results

    • Output parameter file: design_spec.yaml

    • Output the sequence file of the designed complex: final_complex.fasta

    • Output the sequence file of the designed complex (Batch mode): final_complex_batch.fasta, suitable for Batch-mode inputs of some modules, such as Structure Prediction (Boltz-2)

    • Output the sequence file of the designed chains: final_designed_chains.fasta

    • Output the design scoring file: final_designs_metrics.csv. The meaning of each metric in the CSV file is as follows:

    Column Name Description
    id Name of the designed molecule
    final_rank Final ranking after comprehensive sorting based on all metrics
    absolute_score A composite score calculated from multiple metrics (structural metrics and physical energy metrics). It does not fully correspond to the final_rank ordering and is provided for reference.
    structure_confidence Structural confidence score calculated from structural metrics (pTM, ipTM, PAE), for reference.
    design_ptm Predicted Template Modeling score (0–1), reflecting confidence in the overall fold of the designed protein. Higher values indicate a more reasonable global structure; typically, values >0.7 are considered high confidence.
    design_to_target_iptm Interface predicted TM score (0–1), used to evaluate the structural plausibility of the protein–protein interaction interface. Higher values indicate a greater likelihood of forming a stable interface (e.g., binding site).
    min_design_to_target_pae Minimum Predicted Alignment Error (Å), a residue-pair–level confidence metric that measures the predicted reliability of relative spatial positions between residues. Here it represents the accuracy of relative positioning between residues of the designed structure and the target structure. Smaller values (e.g., <10 Å) indicate higher accuracy.
    plip_saltbridge_refolded Number of salt bridges in the refolded structure. Salt bridges (electrostatic interactions between oppositely charged residues) are important for protein stability; higher numbers generally indicate more stable binding.
    plip_hbonds_refolded Number of hydrogen bonds in the refolded structure. Hydrogen bonds are key forces for secondary structure formation and interface complementarity; higher numbers usually imply better overall stability.
    delta_sasa_refolded Change in solvent-accessible surface area before and after refolding (ΔSASA, Ų). Larger values (e.g., >2000 Ų) indicate greater burial of the hydrophobic core and usually represent stronger thermal stability.
    contacts Contact interface residues in the predicted structure
    contacts_overlap Predicted contact residues that overlap with the input hotspots
    overlap_ratio Proportion of input hotspots covered by predicted contact residues

    Note: The contacts, contacts_overlap, and overlap_ratio metrics are output only when the Binding Hotspot parameter is set.

    • Output the top 5 designed structures: rank1-5*.cif

    • Output the packaged file of the final designed structures: final_designs.tar.gz

    • The design overview file results_overview.pdf summarizes the Filtering Criteria and Sorting Criteria used for structural evaluation and ranking.

    • Filtering Criteria

    Column Description
    has_x Threshold: 0.0
    Sequence validity check. Ensures that the sequence contains no unknown amino acids (“X”) and is composed exclusively of the 20 standard natural amino acids, guaranteeing physical synthesizability and expressibility.
    filter_rmsd Threshold: < 2.5 Å
    Overall backbone RMSD. Evaluates whether the entire complex (design + target) maintains its structure after refolding, verifying consistency between the generated and predicted structures.
    filter_rmsd_design Threshold: < 2.5 Å
    Backbone RMSD of the designed component (Binder) only. Ensures that the binder itself remains structurally stable even if the target undergoes minor movements.
    designfolding-filter_rmsd Threshold: < 2.5 Å
    Independent folding stability check. The binder is folded without the target, and RMSD is computed to ensure it can fold autonomously, substantially improving the likelihood of successful experimental expression.
    ALA_fraction
    GLY_fraction
    GLU_fraction
    LEU_fraction
    VAL_fraction
    Threshold: < 0.3 (30%)
    Sequence complexity/diversity control. Limits the individual fractions of alanine, glycine, glutamate, leucine, and valine to prevent the model from generating overly repetitive sequences to artificially boost stability scores. This enforces chemical diversity and promotes specific interactions.
    • Sorting Criteria
    Column Description
    design_to_target_iptm Weight = 1
    Interface Predicted TM score (0–1), used to assess the structural plausibility of the protein–protein interaction interface. Higher values indicate a greater likelihood of forming stable interactions at the interface (e.g., binding sites).
    design_ptm Weight = 1
    Predicted Template Modeling score (0–1), reflecting confidence in the global fold of the designed protein. Higher values indicate a more plausible overall structure; values >0.7 are typically considered high confidence.
    neg_min_design_to_target_pae Weight = 1
    Negative minimum Predicted Aligned Error (PAE) at the interface. Lower PAE indicates better accuracy (smaller error); the negative sign is used to facilitate ranking (higher is better). This metric reflects the model’s confidence in the most certain contact point at the binding interface.
    affinity_probability_binary1 Weight = 1
    Predicted binding affinity probability, primarily used in small-molecule binder scenarios. This is the model’s direct estimate of the probability that the molecule binds.
    plip_hbonds_refolded Weight = 0.5
    Number of hydrogen bonds in the refolded structure. Hydrogen bonds are critical for secondary structure formation and interface complementarity; higher counts generally indicate better overall stability.
    plip_saltbridge_refolded Weight = 0.5
    Number of salt bridges in the refolded structure. Salt bridges (electrostatic interactions between oppositely charged residues) are key contributors to protein stability; higher counts typically correspond to stronger binding.
    delta_sasa_refolded Weight = 0.5
    Change in solvent-accessible surface area upon refolding (ΔSASA, Ų). Larger values (e.g., >2000 Ų) indicate greater burial of hydrophobic cores, generally associated with higher thermal stability.

    Design Tutorial

    Masking Peptide Design Tutorial

    Known Antibody Structure

    1. Antibody Numbering
    Open the mH35 antibody structure using WeView, perform antibody numbering, and determine that the heavy chain CDR3 is located at H99-102, which serves as the binding site for the masking peptide
    image.png
    2. Parameter Settings in BoltzGen

    • Select Custom mode
    • Select Peptide in Protocol
    • Upload mH35 antibody structure in Structure
    • Select H and L chains in Chains as receptor chains
    • Input the receptor binding site in Binding Hotspot, which is the CDR3 region of the heavy chain: H99-102
      image.png
    • Input the peptide length to be designed in Sequence. The recommended design length for masking peptides is: 5-30
      image.png
    • Submit and run

    Known Antibody Sequence

    1. Antibody Numbering
    Open the mH35 antibody sequence using WeSeq, perform antibody numbering, and determine that the heavy chain CDR3 is located at 99-102, which serves as the binding site for the masking peptide
    image.png
    image.png
    2. Parameter Settings in BoltzGen

    • Select Custom mode
    • Input the heavy and light chain sequences of the mH35 antibody and the length of the masking peptide in Sequence, one chain per line. The recommended design length for masking peptides is: 5-30
    • Set the receptor binding site in Sequence Binding, which is the CDR3 region of the heavy chain: 1:99-102
      image.png
    • Submit and run

    Cyclic Peptide Design Tutorial

    Known Receptor Structure

    • Select Peptide in Protocol.
    • Upload receptor structure in Structure.
    • Define binding hotspots/non-binding sites (if any) in the receptor in Binding Hotspot.
      image.png
    • Sequence input is divided into the following two cases:
      • If there is a template structure, input the template cyclic peptide sequence and the length of the insertion sequence, such as C8-9AC, insert 8-9 residues after the 1st residue C, with the first C and last C forming the cyclic peptide, as follows:
        image.png
      • If there is no template structure, you can directly input the sequence length, such as 8-10, to predict cyclic peptides of 8-10AA length that bind to the receptor, as follows:
        image.png
    • Cyclization is divided into the following two types:
      • If the cyclic peptide is cyclized by head-to-tail peptide bond, you can fill in 1 in Sequence Cycle.
        image.png
      • If the cyclic peptide is cyclized by disulfide bond, do not fill in Sequence Cycle, and fill in the disulfide bond information 1,1,SG;1,11,SG in Covalent Bond.
        image.png
    • Submit and run

    Known Receptor Sequence

    • Select Peptide in Protocol.
      image.png
    • According to the cyclic peptide situation, Sequence input is divided into the following two cases:
      • If the cyclic peptide has a template structure, input the receptor sequence, template cyclic peptide sequence and the length of the insertion sequence. As shown in the figure below, each line is a sequence, the receptor has 2 sequences, and the receptor sequence IDs are 1 and 2 respectively. The cyclic peptide sequence is C8-9AC (insert 8-9 residues after the 1st residue C, with the first C and last C forming the cyclic peptide), and the cyclic peptide is located in the third row with sequence ID 3.
        image.png
      • If there is no template structure, you can directly input the receptor sequence and the sequence length of the cyclic peptide. As shown in the figure below, predict cyclic peptides of 8-10AA length that bind to the receptor.
        image.png
    • Define binding hotspots/non-binding sites (if any) in the receptor in Sequence Binding.
      image.png
    • Cyclization is divided into the following two types:
      • If the cyclic peptide is cyclized by head-to-tail peptide bond, you can fill in 3 in Sequence Cycle.
        image.png
      • If the cyclic peptide is cyclized by disulfide bond, do not fill in Sequence Cycle, and fill in the disulfide bond information 1,1,SG;1,11,SG in Covalent Bond.
        image.png
    • Submit and run

    Reference

    • https://hannes-stark.com/assets/boltzgen.pdf
  • Name: Antibody Design (IgGM)
    Description: IgGM是一种新型生成式基础模型,旨在加速高亲和力抗体的工程化设计。 A new generative foundation model developed to accelerate the engineering of high-affinity antibodies.
    Tags: undefined
    Author: Rubo Wang
    Release: 2025-10-21 16:33:47
    Reference: Wang, R., Wu, F., Shi, J., Song, Y., Kong, Y., Ma, J., He, B., Yan, Q., Ying, T., Zhao, P., Gao, X., & Yao, J. (2025). A Generative Foundation Model for Antibody Design. bioRxiv.

    Antibody Design (IgGM)

    简介

    基于抗原结构或抗原-抗体复合物结构进行抗体设计,需要有初始抗体序列。模块基于IgGM模型实现。IgGM是一种新型生成式基础模型,旨在加速高亲和力抗体的工程化设计。其学习抗原与抗体之间复杂的结合规律,以及抗体序列与结构之间的映射关系,从而支持多种抗体设计任务。在针对多种抗原的体外实验和计算机模拟基准评估中,其能稳定地产生具有高实测亲和力的抗体或纳米抗体。充分展示了其多样性与高效性,凸显其作为下一代抗体发现与优化强大工具的潜力。

    IgGM主要由三个核心组件组成:

    • 序列特征提取:利用预训练的蛋白语言模型(PPSM)来提取抗体序列的进化特征,就像在自然语言中理解语法和语义一样。
    • 抗原-抗体交互建模(Sgformer):这是关键的一步,它能够学习抗体和抗原之间的结合规律,而不仅仅是单独的抗体结构。
    • 生成预测模块:在前两步的基础上,直接输出抗体的序列和结构。

    IgGM的模型框架如下图所示:
    image.png

    对比结果显示,IgGM在多个CDR区域的预测准确性均高于ProteinMPNN、ProteinMPNN(Filtered)、IgMPNN与IgDesign(如下图所示):
    image.png
    这些结果表明,IgGM的设计与优化策略特别适合捕捉这些关键CDR区域的复杂结构与功能特征,从而提升整体的抗体设计效率。

    参数说明

    Complex

    Structure

    用于抗体设计的抗体-抗原复合物结构(支持普通抗体或纳米抗体),PDB格式。
    注意:当前只支持单链抗原,如存在多链时会默认提取第一条抗原链(或通过后续Chain参数指定抗原链)。

    Chain

    指定抗原链,仅单链。

    Positions

    定义抗体中需要进行设计的残基。
    指定格式为:链类型 + 残基编号或编号范围,其中链类型仅支持 H(重链) 和 L(轻链)。
    多个残基或编号范围之间使用逗号分隔。

    例如,参数设置为:

    H27,H28,H99,H100-103,L24-32
    

    表示:

    • 对 H 链 中编号为 27、28、99、100 至 103 的残基进行设计;
    • 对 L 链 中编号为 24 至 32 的残基进行设计。

    注意:

    1. 这里的残基编号是指从1开始的残基位置顺序编号,不是原PDB文件中的残基编号。
    2. 如果不指定链类型,则同时应用于所有抗体链。如24-32表示设计所有抗体链中的编号为24-32的残基。

    Number of Designs

    指定设计的抗体数量,默认为20,最大支持1000。

    Design Type

    指定需要使用的设计模型类型,有三种选择:

    • Design:通用设计模型,默认选择。
    • FR Design:专为抗体FR区域设计提供的模型,在进行FR区域设计时可选择。
    • Inverse Design:逆折叠模型,固定抗体结构骨架不变,进行序列设计,在使用抗体-抗原复合物时,可以选择。

    Relax

    指定是否进行结构Relax(使用OpenMM完成),默认不进行。在设计数量较大时,计算时间会显著增加。

    Output Prefix

    指定输出文件的前缀,默认为Result,则输出的文件名称为 Result_编号.fasta 与 Result_编号.pdb

    Antigen

    Structure

    指定抗原的结构文件,PDB格式。当前只支持单链抗原,如存在多链时会默认提取第一条抗原链(或通过后续Chain参数指定抗原链)。

    Chain

    指定抗原链,仅单链。

    Sequence

    指定普通抗体Fv区 或者 纳米抗体 的初始序列,fasta格式。如:

    >H
    QIQLVQSGPELKKPGETVKISCKASGYTFTDYGLNWVKQAPGKGLKWMGWINTYSGEPTYNDEFRGRFAFSLETSTITAYLKINNLKNEDTATYFCARGGNWDWYFDVWGAGTTVTVSS
    >L
    DIVLTQSPATLSVTPGDNVSLSCRASQIISNNLHWYQQKSHESPRLLIKYASQSISGIPSRFSGSGSGTDFTLSINSVETEDFGMYFCQQSNTWPLTCGSGTKLELN
    
    >nanobody
    QVQLVESGGTLVQPGGSLRLSCAASRNISQIAILGWYRQAPGKQREAVAIITGGGPAHYADSVKGRFTISRDTAKNTTYLQMDSLKPEDTAVYFCNVRIRWGAADSWGQGIQVTVSS
    

    Positions

    定义抗体中需要进行设计的残基。
    指定格式为:链类型 + 残基编号或编号范围,其中链类型仅支持 H(重链) 和 L(轻链)。
    多个残基或编号范围之间使用逗号分隔。

    例如,参数设置为:

    H27,H28,H99,H100-103,L24-32
    

    表示:

    • 对 H 链 中编号为 27、28、99、100 至 103 的残基进行设计;
    • 对 L 链 中编号为 24 至 32 的残基进行设计。

    注意:

    1. 这里的残基编号是指从1开始的残基位置顺序编号。
    2. 如果不指定链类型,则同时应用于所有抗体链。如24-32表示设计所有抗体链中的编号为24-32的残基。

    Epitope

    指定抗原链上的结合位点信息,格式:1-5,10,20

    Number of Designs

    同complex模式中的定义。

    Design Type

    指定需要使用的设计模型类型,有两种选择:

    • Design:通用设计模型,可指定抗体的任意区域进行设计,默认选择。
    • FR Design:专为抗体FR区域设计提供的模型,在进行FR区域设计时可选择。

    Relax

    同complex模式中的定义。

    Output Prefix

    同complex模式中的定义。

    结果说明

    • 设计结果对应的序列文件,fasta格式。经过去重处理,序列重复出现的频率也会保留到序列名中。单独输出5个序列文件直接查阅,所有序列文件会打包为seqs.tar.gz。注意:序列排名不分先后。
    • 相应的结构PDB文件,使用openMM模块进行了结构relax,并补全侧链结构。所有PDB文件的打包文件pdbs.tar.gz。
    • 复合物序列文件,fasta格式,包含设计的抗体序列与对应的抗原序列,用英文冒号:进行分隔。

    参考文献

    • Wang, R., Wu, F., Shi, J., Song, Y., Kong, Y., Ma, J., He, B., Yan, Q., Ying, T., Zhao, P., Gao, X., & Yao, J. (2025). *A Generative Foundation Model for Antibody Design. bioRxiv.DOI:10.1101/2025.09.12.675771

    Antibody Design (IgGM)

    Introduction

    This module performs antibody design based on either antigen structures or antigen–antibody complex structures, requiring an initial antibody sequence as input. The design is powered by the IgGM model, a new generative foundation model developed to accelerate the engineering of high-affinity antibodies. IgGM learns the complex binding relationships between antigens and antibodies, as well as the mapping between antibody sequences and structures, thus enabling various antibody design tasks.

    In both in vitro experiments and computational benchmarks across diverse antigens, IgGM consistently generates antibodies and nanobodies with high measured affinity, demonstrating its versatility and efficiency as a next-generation tool for antibody discovery and optimization.

    IgGM consists of three core components:

    • Sequence feature extraction: Uses a pretrained protein language model (PPSM) to extract evolutionary features from antibody sequences, similar to how grammar and semantics are captured in natural language.
    • Antigen–antibody interaction modeling (Sgformer): The key component that learns the binding rules between antigens and antibodies, rather than modeling antibodies in isolation.
    • Generative prediction module: Based on the above components, directly outputs the antibody sequence and structure.

    The IgGM model framework is illustrated below:
    image.png

    Comparative results show that IgGM achieves higher prediction accuracy across multiple CDR regions than ProteinMPNN, ProteinMPNN (Filtered), IgMPNN, and IgDesign (see figure below):
    image.png
    These results indicate that IgGM’s design and optimization strategies are particularly well-suited for capturing the complex structural and functional characteristics of critical CDR regions, thereby enhancing the overall efficiency of antibody design.


    Parameters

    Complex

    Structure

    The antigen–antibody or antigen-nanobody complex structure used for antibody/nanobody design, in PDB format.
    Note: Currently, only single-chain antigens are supported. If multiple chains exist, the first chain will be used by default (or the antigen chain can be specified with the Chain parameter).

    Chain

    Specifies the antigen chain (single chain only).

    Positions

    Define the residues in the antibody that need to be redesigned.
    The format is Chain Type + Residue Number or Range, where the chain type supports only H (heavy chain) and L (light chain).
    Multiple residues or ranges are separated by commas.
    For example, if the parameter is set as:

    H27,H28,H99,H100-103,L24-32
    

    This means:

    • Residues 27, 28, 99, and 100–103 in the H chain will be redesigned;
    • Residues 24–32 in the L chain will be redesigned.

    Notes:

    1. The residue numbering refers to sequential indices starting from 1, not the original PDB residue numbers.
    2. If no chain type is specified, the range applies to all antibody chains (e.g., 24-32 designs residues 24–32 in all antibody chains).

    Number of Designs

    Specifies the number of antibody designs to generate. Default is 20, maximum is 1000.

    Design Type

    Specifies the design model type to use. Three options are available:

    • Design: General-purpose design model (default).
    • FR Design: Model specialized for framework region (FR) design.
    • Inverse Design: Inverse folding model that fixes the backbone structure and performs sequence design. This mode is applicable when using antigen–antibody complex structures.

    Relax

    Specifies whether to perform structure relaxation using OpenMM. Default is no relaxation.
    Note: Relaxation can significantly increase computation time for large design batches.

    Output Prefix

    Specifies the prefix for output files. Default is Result, producing files such as Result_<index>.fasta and Result_<index>.pdb.


    Antigen

    Structure

    Specifies the antigen structure file in PDB format. Only single-chain antigens are supported; for multi-chain structures, the first chain is used by default (or can be specified using Chain).

    Chain

    Specifies the antigen chain (single chain only).

    Sequence

    Specifies the initial antibody Fv sequence or nanobody sequence in FASTA format, for example:

    >H
    QIQLVQSGPELKKPGETVKISCKASGYTFTDYGLNWVKQAPGKGLKWMGWINTYSGEPTYNDEFRGRFAFSLETSTITAYLKINNLKNEDTATYFCARGGNWDWYFDVWGAGTTVTVSS
    >L
    DIVLTQSPATLSVTPGDNVSLSCRASQIISNNLHWYQQKSHESPRLLIKYASQSISGIPSRFSGSGSGTDFTLSINSVETEDFGMYFCQQSNTWPLTCGSGTKLELN
    
    >nanobody
    QVQLVESGGTLVQPGGSLRLSCAASRNISQIAILGWYRQAPGKQREAVAIITGGGPAHYADSVKGRFTISRDTAKNTTYLQMDSLKPEDTAVYFCNVRIRWGAADSWGQGIQVTVSS
    

    Positions

    Define the residues in the antibody that need to be redesigned.
    The format is Chain Type + Residue Number or Range, where the chain type supports only H (heavy chain) and L (light chain).
    Multiple residues or ranges are separated by commas.
    For example, if the parameter is set as:

    H27,H28,H99,H100-103,L24-32
    

    This means:

    • Residues 27, 28, 99, and 100–103 in the H chain will be redesigned;
    • Residues 24–32 in the L chain will be redesigned.

    Notes:
    If no chain type is specified, the range applies to all antibody chains (e.g., 24-32 designs residues 24–32 in all antibody chains).

    Epitope

    Specifies the binding site information on the antigen chain, in the format: 1-5,10,20.

    Number of Designs

    Same as in the Complex mode.

    Design Type

    Specifies the model type, with two options:

    • Design: General-purpose design model (default). You can specify any region of the antibody for design.
    • FR Design: Model specialized for framework region design.

    Relax

    Same as in the Complex mode.

    Output Prefix

    Same as in the Complex mode.


    Results

    • Designed sequences: Output in FASTA format. Duplicate sequences are removed, and the occurrence frequency is recorded in sequence headers.
      Five sequence files are provided for direct viewing, and all sequences are packaged as seqs.tar.gz.
      Note: Sequence ranking does not indicate affinity ranking.
    • Structure models: Output in PDB format. Structures are relaxed and side chains completed using OpenMM.
      All PDB files are packaged as pdbs.tar.gz.
    • Complex sequences: Output in FASTA format, containing both antibody and corresponding antigen sequences separated by a colon (:).

    Reference

    • Wang, R., Wu, F., Shi, J., Song, Y., Kong, Y., Ma, J., He, B., Yan, Q., Ying, T., Zhao, P., Gao, X., & Yao, J. (2025). *A Generative Foundation Model for Antibody Design. bioRxiv.DOI:10.1101/2025.09.12.675771
  • Name: ADMET-AI
    Description: 基于AI 快速、准确地预测药物分子的吸收、分布、代谢、排泄和毒性(ADMET)性质,适合大规模化合物筛选。 AI-based fast and accurate prediction of the Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties of drug molecules, suitable for large-scale compound screening.
    Tags: undefined
    Author: Kyle Swanson
    Release: 2025-10-16 14:18:31
    Reference: Swanson K, Walther P, Leitz J, Mukherjee S, Wu JC, Shivnaraine RV, Zou J. ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries. bioRxiv [Preprint]. 2023 Dec 28:2023.12.28.573531. doi: 10.1101/2023.12.28.573531.

    ADMET-AI

    简介

    用于药物发现的高性能ADMET预测平台,帮助研究人员从庞大的化学库和组合化学空间中筛选符合药物性质的小分子。随着高通量分子对接和生成式AI技术的发展,药物化学空间迅速扩大,实验验证的分子选择变得更加重要。ADMET-AI提供快速且准确的吸收、分布、代谢、排泄和毒性预测,实现批量预测。
    在性能方面,ADMET-AI在TDC ADMET排行榜上获得了最高的平均排名,同时是目前最快的网页端ADMET预测工具,相较于第二快的公共网页预测服务器,速度提升了45%。在本地运行模式下,对一百万个分子进行预测仅需约3.1小时,极大提高了大规模分子筛选的效率。
    image.png

    参数说明

    Small Molecule File

    小分子SMILES文件,CSV格式。文件内容如下:

    smiles,name
    O(c1ccc(cc1)CCOC)CC(O)CNC(C)C,lig1
    

    注意
    1.小分子SMILES列必须包含列名,示例文件中为smiles。
    2.name列用于表示分子名称或标识,可选填写。

    Smiles Column Name

    CSV文件中小分子SMILES的列名称,例如示例文件中是smiles

    Predicted Results

    预测结果文件,CSV格式。默认为predicted_results.csv。

    结果说明

    输出predicted_results.csv文件,包含信息如下:

    列名 含义
    smiles 分子的 SMILES 表示法
    name 分子名称或标识
    molecular_weight 分子量(Da)
    logP 分子的辛醇/水分配系数,反映疏水性
    hydrogen_bond_acceptors 氢键受体数量
    hydrogen_bond_donors 氢键供体数量
    Lipinski 是否符合 Lipinski 规则(药物可口服性评估)
    QED 药物化学综合评分(Quantitative Estimate of Drug-likeness)
    stereo_centers 分子的手性中心数量
    tpsa 极性表面积(Topological Polar Surface Area)
    AMES AMES 试验预测,评估致突变性
    BBB_Martins 跨血脑屏障能力预测(Martins 方法)
    Bioavailability_Ma 口服生物利用度预测(Ma 方法)
    CYP1A2_Veith CYP1A2 酶底物或抑制剂预测(Veith 方法)
    CYP2C19_Veith CYP2C19 酶底物或抑制剂预测
    CYP2C9_Substrate_CarbonMangels CYP2C9 底物预测(CarbonMangels 方法)
    CYP2C9_Veith CYP2C9 底物/抑制剂预测(Veith 方法)
    CYP2D6_Substrate_CarbonMangels CYP2D6 底物预测
    CYP2D6_Veith CYP2D6 底物/抑制剂预测
    CYP3A4_Substrate_CarbonMangels CYP3A4 底物预测
    CYP3A4_Veith CYP3A4 底物/抑制剂预测
    Carcinogens_Lagunin 致癌性预测(Lagunin 方法)
    ClinTox 临床毒性预测
    DILI 药物诱导肝损伤(Drug-Induced Liver Injury)预测
    HIA_Hou 人体吸收率预测(Hou 方法)
    NR-AR-LBD 核受体雄激素受体结合域预测
    NR-AR 核受体雄激素受体活性预测
    NR-AhR 核受体芳烃受体活性预测
    NR-Aromatase 芳香酶抑制活性预测
    NR-ER-LBD 核受体雌激素受体结合域预测
    NR-ER 核受体雌激素受体活性预测
    NR-PPAR-gamma 核受体 PPAR-γ 活性预测
    PAMPA_NCATS PAMPA 渗透性预测(NCATS 方法)
    Pgp_Broccatelli P-糖蛋白底物预测
    SR-ARE 抗氧化反应元件诱导预测
    SR-ATAD5 DNA 损伤修复元件诱导预测
    SR-HSE 热休克元件诱导预测
    SR-MMP 金属基质蛋白酶诱导预测
    SR-p53 p53 信号通路影响预测
    Skin_Reaction 皮肤反应/刺激性预测
    hERG hERG 通道抑制预测(心脏毒性)
    Caco2_Wang Caco-2 细胞透过性预测
    Clearance_Hepatocyte_AZ 肝细胞清除率预测(AstraZeneca 方法)
    Clearance_Microsome_AZ 微粒体清除率预测
    Half_Life_Obach 半衰期预测(Obach 方法)
    HydrationFreeEnergy_FreeSolv 水化自由能(FreeSolv 数据库)
    LD50_Zhu 半数致死量预测(Zhu 方法)
    Lipophilicity_AstraZeneca 脂溶性预测(AstraZeneca 方法)
    PPBR_AZ 血浆蛋白结合率(AstraZeneca 方法)
    Solubility_AqSolDB 水溶性预测(AqSolDB 数据库)
    VDss_Lombardo 分布容积预测(Lombardo 方法)

    后缀 _drugbank_approved_percentile 的列表示对应属性在 DrugBank 批准药物集中的百分位数。例如:

    • molecular_weight_drugbank_approved_percentile 表示该分子分子量在 DrugBank 批准药物中的相对位置(0~100%)。

    参考文献

    • Swanson K, Walther P, Leitz J, Mukherjee S, Wu JC, Shivnaraine RV, Zou J. ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries. bioRxiv [Preprint]. 2023 Dec 28:2023.12.28.573531. DOI: 10.1101/2023.12.28.573531.

    ADMET-AI

    Introduction

    ADMET-AI is a high-performance ADMET prediction platform for drug discovery, helping researchers screen small molecules with favorable drug-like properties from large chemical libraries and combinatorial chemical spaces. With the development of high-throughput molecular docking and generative AI, the chemical space of potential drugs has rapidly expanded, making the selection of compounds for experimental validation increasingly important. ADMET-AI provides fast and accurate predictions of Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET), supporting batch predictions.

    In terms of performance, ADMET-AI achieves the highest average rank on the TDC ADMET leaderboard and is currently the fastest web-based ADMET predictor, offering a 45% speed increase compared to the second fastest public web server. In local mode, predictions for one million molecules take only about 3.1 hours, greatly improving the efficiency of large-scale molecular screening.

    image.png

    Parameters

    Small Molecule File

    A CSV file containing small molecule SMILES. Example content:

    smiles,name
    O(c1ccc(cc1)CCOC)CC(O)CNC(C)C,lig1
    

    NOTE:
    1.The SMILES column for small molecules must have a header, as shown in the example (smiles).
    2.The name column represents the molecule name or identifier and is optional.

    Predicted Results

    The predicted results file in CSV format. Defaults to predicted_results.csv.

    Results

    The output predicted_results.csv contains the following information:

    Column Name Meaning
    smiles SMILES representation of the molecule
    name Molecule name or identifier
    molecular_weight Molecular weight (Da)
    logP Octanol-water partition coefficient, indicating hydrophobicity
    hydrogen_bond_acceptors Number of hydrogen bond acceptors
    hydrogen_bond_donors Number of hydrogen bond donors
    Lipinski Whether the molecule satisfies Lipinski’s rules (oral drug-likeness)
    QED Quantitative Estimate of Drug-likeness (QED)
    stereo_centers Number of stereocenters
    tpsa Topological Polar Surface Area (TPSA)
    AMES AMES mutagenicity prediction
    BBB_Martins Blood-brain barrier permeability prediction (Martins method)
    Bioavailability_Ma Oral bioavailability prediction (Ma method)
    CYP1A2_Veith CYP1A2 substrate/inhibitor prediction (Veith method)
    CYP2C19_Veith CYP2C19 substrate/inhibitor prediction
    CYP2C9_Substrate_CarbonMangels CYP2C9 substrate prediction (CarbonMangels method)
    CYP2C9_Veith CYP2C9 substrate/inhibitor prediction (Veith method)
    CYP2D6_Substrate_CarbonMangels CYP2D6 substrate prediction
    CYP2D6_Veith CYP2D6 substrate/inhibitor prediction
    CYP3A4_Substrate_CarbonMangels CYP3A4 substrate prediction
    CYP3A4_Veith CYP3A4 substrate/inhibitor prediction
    Carcinogens_Lagunin Carcinogenicity prediction (Lagunin method)
    ClinTox Clinical toxicity prediction
    DILI Drug-Induced Liver Injury prediction
    HIA_Hou Human intestinal absorption prediction (Hou method)
    NR-AR-LBD Nuclear receptor androgen receptor ligand binding domain prediction
    NR-AR Nuclear receptor androgen receptor activity prediction
    NR-AhR Nuclear receptor aryl hydrocarbon receptor activity prediction
    NR-Aromatase Aromatase inhibition prediction
    NR-ER-LBD Nuclear receptor estrogen receptor ligand binding domain prediction
    NR-ER Nuclear receptor estrogen receptor activity prediction
    NR-PPAR-gamma Nuclear receptor PPAR-γ activity prediction
    PAMPA_NCATS PAMPA permeability prediction (NCATS method)
    Pgp_Broccatelli P-glycoprotein substrate prediction
    SR-ARE Antioxidant response element induction prediction
    SR-ATAD5 DNA damage repair element induction prediction
    SR-HSE Heat shock element induction prediction
    SR-MMP Matrix metalloproteinase induction prediction
    SR-p53 p53 pathway impact prediction
    Skin_Reaction Skin reaction / irritation prediction
    hERG hERG channel inhibition prediction (cardiotoxicity)
    Caco2_Wang Caco-2 cell permeability prediction
    Clearance_Hepatocyte_AZ Hepatocyte clearance prediction (AstraZeneca method)
    Clearance_Microsome_AZ Microsomal clearance prediction
    Half_Life_Obach Half-life prediction (Obach method)
    HydrationFreeEnergy_FreeSolv Hydration free energy (FreeSolv database)
    LD50_Zhu Lethal dose 50% prediction (Zhu method)
    Lipophilicity_AstraZeneca Lipophilicity prediction (AstraZeneca method)
    PPBR_AZ Plasma protein binding ratio (AstraZeneca method)
    Solubility_AqSolDB Aqueous solubility prediction (AqSolDB database)
    VDss_Lombardo Volume of distribution prediction (Lombardo method)

    Columns with the suffix _drugbank_approved_percentile indicate the percentile of the property relative to approved drugs in DrugBank.
    Example: molecular_weight_drugbank_approved_percentile shows the relative position (0–100%) of the molecular weight among approved DrugBank compounds.

    References

    • Swanson K, Walther P, Leitz J, Mukherjee S, Wu JC, Shivnaraine RV, Zou J. ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries. bioRxiv [Preprint]. 2023 Dec 28:2023.12.28.573531. DOI: 10.1101/2023.12.28.573531.
  • Name: CSV Merge
    Description: 批量合并多个CSV文件,并输出合并后的CSV文件。 Batch merge multiple CSV files and output a single merged CSV file.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-10-20 15:11:35
    Reference:

    CSV Merge

    简介

    批量合并多个CSV文件,并输出合并后的CSV文件。

    参数说明

    Archive File

    用于合并的多个CSV文件的打包文件,支持格式:.zip,.tar,.tar.gz,.tgz,.tar.bz2,.tbz2,.tar.xz,.txz

    CSV1

    参与合并的单个CSV文件。

    CSV2

    参与合并的单个CSV文件。

    CSV3

    参与合并的单个CSV文件。

    CSV4

    参与合并的单个CSV文件。

    CSV5

    参与合并的单个CSV文件。

    打包文件或单个CSV文件,可以自由设置,至少设置一个。

    Columns

    指定每个CSV文件需要提取并输出的列,使用文本文件,每行定义一个文件名对应的列名,用英文逗号分隔。未定义的文件,将提取并输出所有列。输出的列名默认是原文件中的列名,如需修改输出的列名称,在对应列名后加上:修改后的列名
    示例如下:

    ESM_output.csv,Mutation,Log_likelihood,Log_likelihood_target_chain
    pythia_output.csv,Mutation,Energy:ddG(pythia)
    pythia_ppi_output.csv,Mutation,ddG_Pred:ddG(pythia_ppi)
    

    表示:

    • 从文件ESM_output.csv中提取列Mutation,Log_likelihood,及Log_likelihood_target_chain;
    • 从文件pythia_output.csv中提取列Mutation,Energy,同时Energy重命名为pythia(ddG);
    • 从文件pythia_ppi_output.csv中提取列Mutation,ddG_Pred,同时ddG_Pred重命名为ddG(pythia_ppi)

    Join Columns

    指定上述提取的列中,用于合并的列名,多列时用逗号分隔,如Mutation表示使用Mutation列进行合并,或者Mutation,Chain表示同时用Mutation,Chain两列进行合并。
    注意:如不指定该参数,默认会从各文件的提取列中,选择名称相同的公共列,如没有公共列则无法合并。

    Filter Type

    过滤方式,目前支持三种方式:TopN,WT,Both:
    TopN:对指定的列进行排序,选取排序靠前的N条记录。
    WT:对指定的列进行排序,选取数值优于野生型的记录。
    Both:同时采用前述两种过滤方式。

    Filter Columns

    指定用于过滤的列名,多列时使用逗号分隔,如:Energy(Pythia),ddG_pred(ThermoMPNN)表示使用列名为Energy(Pythia)及ddG_pred(ThermoMPNN)的列进行过滤。

    • 指定该参数后,输出的 merged.csv 文件中将新增 Count_Selected 列,用于统计满足筛选条件的列数量。例如,当值为 2 时,表示有两列符合过滤条件。

    Sort Direction

    指定Filter Columns参数中,每列的排序方式,1表示升序,0表示降序,与列名顺序对应,通过逗号分隔,如:1,0表示第一个列名用升序,第二个列名用降序。如不设置该参数,则默认都采用升序。

    TopN

    设置TopN过滤方式中的具体N值,正整数。

    Exclude Sites

    输出的突变信息和序列中,不包含指定的位点。
    格式为:残基位置或范围,如:‘1-10,36’,可加链名,如:‘A1-10,A36’,不加链名时,表示应用到所有可能链的相应位置

    Diverse AA

    进行二次过滤时,对同一位点的所有突变中,仅保留同类型/性质突变残基中的排名最优者,默认为True。

    Max AA per Site

    进行二次过滤时,允许同一位点中突变数量的最大值,默认为2,仅保留排名靠前的最大数量突变残基。

    Interface Chain

    用于指定目标链,多条链时使用逗号分隔,如 A,B。在二次筛选阶段,设置后只保留与目标链存在相互作用的突变,不设置则全部保留。如抗原-抗体复合物中,只需保留与抗原链相互作用界面上的突变时,设置该参数为抗原链名,可过滤掉重轻链相互作用界面上的突变。

    SASA

    可加入SASA(relativeSideChain)与Bfactor信息,模块Solvent Exposure (SASA)的输出文件。

    Output Sequences

    是否输出过滤后,相应突变对应的突变序列,单选,Yes或No,默认为Yes。注意:合并后的CSV文件中必须有包含突变信息的列,且突变信息的格式为原残基+突变位置+突变残基(如:G1A),才能进行正常的序列输出。

    Mutation Column

    定义包含突变信息的列名,默认为Mutation。

    Output

    输出合并文件,默认为merged.csv

    Output Fasta

    输出序列文件的名称,fasta格式,默认为mutated_seqs.fasta

    结果说明

    合并输出文件merged.csv。当指定Filter Columns参数时,输出的 merged.csv 文件中将新增:

    • Hits_Count,用于统计满足筛选条件的列数量。例如,当值为 2 时,表示有两列符合过滤条件。
    • Rank_列名,为该条记录在每个过滤列的排序Rank值。
    • Rank_Avg,满足过滤条件的过滤列的平均Rank值。

    结果优先按Hits_Count 列降序排序,然后按Rank_Avg列升序排列。

    突变序列对应的fasta文件mutated_seqs.fasta,Batch格式的复合物序列文件hits_complex_batch.fasta。

    二次过滤后的结果文件,相互作用界面上计算结果

    • 基于合并的计算结果,挑选的相互作用界面上的多样性子集interface_diverse_subset.csv

    • 对相互作用界面上的突变子集,经二次过滤后生成的Batch格式的复合物序列interface_diverse_complex_batch.fasta

    • 对相互作用界面上,经二次过滤后得到的多样性子集,各突变对应的突变序列interface_diverse_mutated_seqs.fasta

    • 对相互作用界面上,经二次过滤后得到的多样性子集,生成双点与三点突变组合,对应的复合物序列。interface_diverse_multi_mutants_complex_batch.fasta

    • 对相互作用界面上,经二次过滤后得到的多样性子集,生成双点与三点突变组合的序列interface_diverse_multi_mutants_seqs.fasta

    二次过滤后的结果文件,非相互作用界面上计算结果

    • 对非相互作用界面上,经二次过滤后得到的多样性子集non_interface_diverse_subset.csv
    • 对非相互作用界面上,经二次过滤后生成的Batch格式的复合物序列non_interface_diverse_complex_batch.fasta
    • 对非相互作用界面上,经二次过滤后得到的多样性子集,各突变对应的突变序列non_interface_diverse_mutated_seqs.fasta
    • 对非相互作用界面上,经二次过滤后得到的多样性子集,生成双点与三点突变组合,对应的复合物序列non_interface_diverse_multi_mutants_complex_batch.fasta
    • 对非相互作用界面上,经二次过滤后得到的多样性子集,生成双点与三点突变组合的序列non_interface_diverse_multi_mutants_seqs.fasta

    多链计算结果文件

    • cross_chain_merged.csv合并后的多链计算结果,包含所有链的综合评分与排序信息。对于多链体系,Cross_Chain_Rank 表示整体综合排名。
    • cross_chain_interface_diverse_subset.csv相互作用界面区域的多样性子集结果。该文件保留界面相关残基/构象中具有代表性的多样化候选,用于分析链间相互作用。
    • cross_chain_non_interface_diverse_subset.csv非相互作用界面区域的多样性子集结果。该文件主要反映非界面区域中的多样化候选分布,用于评估整体结构或序列多样性。

    CSV Merge

    Introduction

    Batch merge multiple CSV files and output a single merged CSV file.

    Parameters

    Archive File

    A compressed archive containing multiple CSV files to be merged. Supported formats: .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz.

    CSV1

    A single CSV file to be included in the merge.

    CSV2

    A single CSV file to be included in the merge.

    CSV3

    A single CSV file to be included in the merge.

    CSV4

    A single CSV file to be included in the merge.

    CSV5

    A single CSV file to be included in the merge.

    Either an archive file or individual CSV files can be provided. At least one must be specified.

    Columns

    Specifies the columns to extract and output from each CSV file. This parameter is provided as a text file, where each line defines the file name and its corresponding column names, separated by commas.
    Files not listed will have all columns extracted and output.
    By default, output column names are identical to the original names. To rename a column, append :new_name to the original column name.

    Example:

    ESM_output.csv,Mutation,Log_likelihood,Log_likelihood_target_chain
    pythia_output.csv,Mutation,Energy:ddG(pythia)
    pythia_ppi_output.csv,Mutation,ddG_Pred:ddG(pythia_ppi)
    

    Meaning:

    • Extract Mutation, Log_likelihood, and Log_likelihood_target_chain from ESM_output.csv;
    • Extract Mutation and Energy from pythia_output.csv, renaming Energy to ddG(pythia);
    • Extract Mutation and ddG_Pred from pythia_ppi_output.csv, renaming ddG_Pred to ddG(pythia_ppi).

    Join Columns

    Specifies the column names used for merging. Multiple columns should be separated by commas, e.g.,
    Mutation (merge by the Mutation column), or
    Mutation,Chain (merge using both Mutation and Chain columns).

    Note: If this parameter is not specified, the tool will automatically use common columns with identical names among the extracted columns. If no common columns exist, merging cannot be performed.

    Filter Type

    The filtering method. Three types are supported: TopN, WT, and Both:

    • TopN: Sort by selected columns and keep the top N records.
    • WT: Sort by selected columns and keep records that perform better than the wild-type.
    • Both: Apply both of the above filtering strategies.Default:Both

    Filter Columns

    Specifies the column names used for filtering. Multiple columns should be separated by commas.
    Example:
    Energy(Pythia),ddG_pred(ThermoMPNN)

    Sort Direction

    Specifies the sorting order for each column in Filter Columns. 1 indicates ascending order, 0 indicates descending order, correspond to the order of column names, comma-separated; e.g., 1,0 indicates ascending for the first column and descending for the second.
    If not specified, all columns default to ascending order.

    TopN

    Defines the N value for the TopN filtering strategy. Must be a positive integer. Default is 20.

    Exclude Sites

    The output mutation information and sequences will exclude the specified positions.
    The format should be residue indices or ranges, e.g., 1-10,36. Chain identifiers can be included, e.g., A1-10,A36.
    If no chain identifier is provided, the positions will be applied to all corresponding residues across all possible chains.

    Mutation Column

    Specifies the column containing mutation information. Default: Mutation.

    Interface Only

    When performing secondary filtering, whether to retain only interface residues. Default: True.

    Diverse AA

    When performing secondary filtering, for all mutations at the same site, only the top-ranked mutation within each amino acid type/property group is retained. Default: True.

    Max AA per Site

    When performing secondary filtering, the maximum number of allowed mutations at the same site. Default: 2. Only the top-ranked mutations up to this maximum are retained.

    Interface Chain

    Specifies the target chain(s). Separate multiple chains with commas, e.g. A,B. During the secondary screening stage, when this parameter is set, only mutations that interact with the target chain(s) are retained; if left unset, all mutations are retained. For example, in an antigen–antibody complex, to retain only mutations on the interaction interface with the antigen chain, set this parameter to the antigen chain name to filter out mutations on the heavy–light chain interaction interface.

    SASA

    Optional inclusion of SASA (relativeSideChain) and B-factor information, using the output file from the Solvent Exposure (SASA) module.

    Output Sequences

    Determines whether to output the mutated sequences corresponding to the filtered variants. Options: Yes or No. Default is Yes.
    Note: The merged CSV must contain a column with mutation information in the format
    OriginalResidue + Position + MutatedResidue (e.g., G1A) to correctly generate sequences.

    Output

    Merge the output file into merged.csv. When the Filter Columns parameter is specified, the following columns will be added to the resulting merged.csv:

    • Hits_Count: counts how many columns meet the filtering criteria. For example, a value of 2 means two columns satisfy the condition.
    • Rank_<ColumnName>: the rank of the record within each filtered column.
    • Rank_Avg: the average rank across all columns that meet the filtering criteria.

    The results are sorted first by Hits_Count in descending order, then by Rank_Avg in ascending order.

    Output Fasta

    Name of the output FASTA file containing mutated sequences. Default: mutated_seqs.fasta.

    Results

    The merged output file is merged.csv. When the Filter Columns parameter is specified, the following additional columns will be included in the merged.csv file:

    • Hits_Count: Counts the number of columns that satisfy the filtering criteria. For example, a value of 2 indicates that two columns meet the filter conditions.
    • Rank_<column_name>: The ranking value of the current record within each filtered column.
    • Rank_Avg: The average rank across all filtered columns that meet the filter conditions.

    The results are first sorted in descending order by the Hits_Count column, and then in ascending order by the Rank_Avg column.

    The FASTA file corresponding to the mutated sequences is mutated_seqs.fasta.

    Post-secondary-filtering result files — interaction interface calculation results

    • A diverse subset on the interaction interface selected from the merged calculation results: interface_diverse_subset.csv
    • Batch-format complex sequences generated from the mutation subset on the interaction interface after secondary filtering: interface_diverse_complex_batch.fasta
    • Mutated sequences corresponding to each mutation in the diverse subset obtained after secondary filtering on the interaction interface: interface_diverse_mutated_seqs.fasta
    • Complex sequences corresponding to double- and triple-mutation combinations generated from the diverse subset obtained after secondary filtering on the interaction interface: interface_diverse_multi_mutants_complex_batch.fasta
    • Sequences corresponding to double- and triple-mutation combinations generated from the diverse subset obtained after secondary filtering on the interaction interface: interface_diverse_multi_mutants_seqs.fasta

    Post-secondary-filtering result files — non-interaction interface calculation results

    • A diverse subset on the non-interaction interface obtained after secondary filtering: non_interface_diverse_subset.csv
    • Batch-format complex sequences generated after secondary filtering on the non-interaction interface: non_interface_diverse_complex_batch.fasta
    • Mutated sequences corresponding to each mutation in the diverse subset obtained after secondary filtering on the non-interaction interface: non_interface_diverse_mutated_seqs.fasta
    • Complex sequences corresponding to double- and triple-mutation combinations generated from the diverse subset obtained after secondary filtering on the non-interaction interface: non_interface_diverse_multi_mutants_complex_batch.fasta
    • Sequences corresponding to double- and triple-mutation combinations generated from the diverse subset obtained after secondary filtering on the non-interaction interface: non_interface_diverse_multi_mutants_seqs.fasta

    Multi-Chain Result Files

    • cross_chain_merged.csv: The merged multi-chain calculation results, including the overall scores and ranking information across all chains. In the multi-chain system, Cross_Chain_Rank represents the overall integrated ranking.

    • cross_chain_interface_diverse_subset.csv: The diversity subset results for interaction interface regions. This file retains representative and diverse candidates among interface-related residues/conformations, and is used to analyze inter-chain interactions.

    • cross_chain_non_interface_diverse_subset.csv: The diversity subset results for non-interaction interface regions. This file mainly reflects the distribution of diverse candidates in non-interface regions and is used to evaluate overall structural or sequence diversity.

  • Name: Protein Acid Stability
    Description: 计算蛋白的耐酸性指数,并统计蛋白整体及表面暴露的酸碱性残基及其比例,给出酸性残基集中的区域(Patchs)。 Calculates the acid stability index (ASI) of proteins and provides statistics of acidic, basic, and hydrophobic residues in the whole protein and on the surface, along with acidic residue clusters (Patches).
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-10-12 00:00:00
    Reference:

    Protein Acid Stability

    简介

    计算蛋白的耐酸性指数,并统计蛋白整体及表面暴露的酸碱性残基及其比例,给出酸性残基集中的区域(Patchs)。
    耐酸性指数(ASI)的计算公式为:
    ASI = 0.6*碱性残基比例 + 0.3*疏水性残基比例 - 0.5*酸性残基比例

    ASI取值范围在-0.5 ~ 0.6之间,越大表示耐酸性能力越强。
    表面暴露残基定义为相对溶剂可及表面积(RSA)大于25%的残基。

    参数说明

    Structure

    蛋白结构文件,PDB格式,支持批量,批量格式支持:.zip,.tar,.tar.gz,.tgz,.tar.bz2,.tbz2,.tar.xz,.txz. 目前最大支持1000个结构。

    Output Summary

    输出蛋白耐酸性指数及各类残基比例等,CSV格式,默认为acid_stability_summary.csv

    Output Patch

    输出酸性区域残基信息,CSV格式,默认为acid_sensitive_regions.csv

    结果说明

    蛋白耐酸性指数及各类残基比例结果文件acid_stability_summary.csv,包含内容如下:

    列名 说明
    PDB 结构文件名称
    TotalResidues 结构中的总残基数量
    SurfaceResidues 表面暴露残基的数量
    AcidicRatio 酸性残基的比例
    BasicRatio 碱性残基的比例
    HydrophobicRatio 疏水残基的比例
    SurfaceAcidicRatio 表面暴露残基中酸性残基的比例
    SurfaceBasicRatio 表面暴露残基中碱性残基的比例
    SurfaceHydrophobicRatio 表面暴露残基中疏水残基的比例
    NetCharge@pH2 在pH值=2时计算的Net Charge
    ASI_Global 基于所有残基计算的耐酸性指数ASI值
    ASI_Surface 仅基于表面暴露残基计算的耐酸性指数ASI值
    AcidicPatches 酸性残基区域的数量

    酸性区域残基信息文件acid_sensitive_regions.csv

    列名 说明
    PDB 结构文件名称
    ClusterID 酸性残基区域的ID
    Chain 所在链名
    ResSeq 组成残基的UID
    Residue 残基名

    Protein Acid Stability

    Introduction

    Calculates the acid stability index (ASI) of proteins and provides statistics of acidic, basic, and hydrophobic residues in the whole protein and on the surface, along with acidic residue clusters (Patches).
    The Acid Stability Index (ASI) is calculated as:
    ASI = 0.6*BasicResidueRatio + 0.3*HydrophobicResidueRatio - 0.5*AcidicResidueRatio

    ASI ranges from -0.5 ~ 0.6, with higher values indicating stronger acid stability.
    Surface-exposed residues are defined as residues with relative solvent accessible surface area (RSA) greater than 25%.

    Parameters

    Structure

    Protein structure files in PDB format. Supports batch processing with formats: .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz. The supported maximum number of structures is 1000.

    Output Summary

    Outputs protein acid stability index and residue ratios in CSV format. Default: acid_stability_summary.csv

    Output Patch

    Outputs acidic residue cluster information in CSV format. Default: acid_sensitive_regions.csv

    Result Description

    The acid stability summary file acid_stability_summary.csv contains:

    Column Description
    PDB Structure file name
    TotalResidues Total number of residues in the structure
    SurfaceResidues Number of surface-exposed residues
    AcidicRatio Ratio of acidic residues
    BasicRatio Ratio of basic residues
    HydrophobicRatio Ratio of hydrophobic residues
    SurfaceAcidicRatio Ratio of acidic residues among surface-exposed residues
    SurfaceBasicRatio Ratio of basic residues among surface-exposed residues
    SurfaceHydrophobicRatio Ratio of hydrophobic residues among surface-exposed residues
    NetCharge@pH2 Net charge calculated at the pH=2
    ASI_Global ASI calculated using all residues
    ASI_Surface ASI calculated using only surface-exposed residues
    AcidicPatches Number of acidic residue clusters

    The acidic residue cluster file acid_sensitive_regions.csv contains:

    Column Description
    PDB Structure file name
    ClusterID ID of the acidic residue cluster
    Chain Chain name
    ResSeq Residue UID in the cluster
    Residue Residue name
  • Name: B-cell Epitope Prediction
    Description: 预测抗原中潜在的B细胞表位,及寻找两个抗原之间潜在相似残基。 Predict potential B-cell epitopes in antigens and identify potentially similar residues between two antigens.
    Tags: undefined
    Author: Tatiana I Shashkova
    Release: 2025-09-29 00:00:00
    Reference: Shashkova TI, Umerenkov D, Salnikov M, Strashnov PV, Konstantinova AV, Lebed I, Shcherbinin DN, Asatryan MN, Kardymon OL, Ivanisenko NV. SEMA: Antigen B-cell conformational epitope prediction using deep transfer learning. Front Immunol. 2022 Sep 15;13:960985.

    Antigen B-cell Epitope Prediction

    简介

    预测抗原中潜在的B细胞表位,及寻找两个抗原之间潜在相似残基。模块基于SEMA模型实现,其中表位预测工具融合了基于序列(SEMA-1D)和基于结构(SEMA-3D)的两种方法:

    • SEMA-1D 模型集成了一组ESM2蛋白质语言模型;
    • SEMA-3D 模型集成了一组预训练的蛋白质双模态SaProt模型。

    两个模型均经过微调,用于预测氨基酸残基与免疫球蛋白Fab区的抗原相互作用倾向。此外,表位预测工具还包含一个基于一级序列预测N-糖基化位点的模型,该模型同样基于ESM2。
    结合表位预测与相似性比对能够在两个抗原之间识别结构相似的表位,即便抗原整体相似度极低。该功能适用于比较不同病毒或细菌株的蛋白质,其底层神经网络在SaProt模型生成的嵌入向量上训练而成。

    参数说明

    Epitope-1D模式

    Sequence

    用于表位预测的蛋白序列,FASTA格式。最大支持100条序列。

    Output

    输出评分文件名,CSV格式,默认为result.csv

    Epitope-3D模式

    Structure

    用于表位预测的蛋白结构,PDB格式。

    Chain

    指定进行表位预测的蛋白链名称,多链用英文逗号分隔,如:A,B。如不指定,表示全部链都进行预测。

    Output

    输出评分文件名,CSV格式,默认为result.csv

    N-glycosylation模式

    Sequence

    用于N糖基化预测的蛋白序列,FASTA格式。最大支持100条序列。

    Structure

    用于N糖基化预测的蛋白结构,PDB格式。
    注意:上述序列和结构,只能选择其一,否则会提示错误

    Chain

    上传结构时,指定进行预测的蛋白链名称,多链用英文逗号分隔,如:A,B。如不指定,表示全部链都进行预测。

    Output

    输出评分文件名,CSV格式,默认为result.csv

    Comparison模式

    Structure_1

    用于比较局部结构相似性的第一个蛋白结构,PDB格式。

    Chain_1

    指定第一个蛋白用于比较的链名,多链用英文逗号分隔,如:A,B。

    Structure_2

    指定用于比较局部结构相似性的第二个蛋白结构,PDB格式。

    Chain_2

    指定第二个蛋白用于比较的链名,多链用英文逗号分隔,如:A,B。

    Output

    相似残基对的输出文件名,CSV格式,默认为result.csv。相似度值大于2.0时,表示两残基相似,相似度值越大表示残基对越相似。

    结果说明

    表位预测的打分文件result.csv,包含如下信息:

    列名 说明
    PDB_ID 结构名称
    Chain 链名称
    Residue position 残基UID编号
    AA 残基单字母名
    Epitope_score 表位预测概率值,表示该残基成为B细胞受体表位的可能性,数值在0-1之间,越大表示成为表位的可能性越高。

    N糖基化预测的打分文件result.csv,包含如下信息

    列名 说明
    PDB_ID 结构名称
    Chain 链名称
    Residue position 残基UID编号
    AA 残基单字母名
    PTM_score 该残基N是否发生糖基化的概率值,0-1之间,小于0.5表示不会,大于0.5表示会。
    PTM_label 根据PTM_score判断是否会发生N糖基化,0表示不会,1表示会。

    局部相似性比对结果文件result.csv,包含如下信息:

    列名 说明
    PDB_ID_1 第一个结构名称
    aa_1 残基名称
    Chain_1 链名称
    pos_1 残基UID编号
    PDB_ID_2 第二个结构名称
    aa_2 残基名称
    Chain_2 链名称
    pos_2 残基UID编号
    score 相似性打分,数值大于2.0时,表示相似,数值越大相似性越高。

    参考文献

    • Shashkova, T.I., Umerenkov, D., Salnikov, M., Strashnov, P.V., Konstantinova A.V., Lebed, I., Shcherbinin, D.N., Asatryan, M.N., Kardymon, O.L., Ivanisenko, N.V. (2022). SEMA: Antigen B-cell conformational epitope prediction using deep transfer learning. Front. Immunol.DOI:10.3389/fimmu.2022.960985

    Antigen B-cell Epitope Prediction

    Introduction

    Predict potential B-cell epitopes in antigens and identify potentially similar residues between two antigens. This module is implemented based on the SEMA model, which integrates sequence-based (SEMA-1D) and structure-based (SEMA-3D) epitope prediction methods:

    • The SEMA-1D model integrates a set of ESM2 protein language models.
    • The SEMA-3D model integrates a set of pre-trained protein multimodal SaProt models.

    Both models are fine-tuned to predict the propensity of amino acid residues to interact with the Fab region of immunoglobulins. Additionally, the epitope prediction tool includes a model for predicting N-glycosylation sites from primary sequences, also based on ESM2. Combining epitope prediction and similarity alignment allows the identification of structurally similar epitopes between two antigens, even when overall antigen similarity is low. This function is suitable for comparing proteins from different viral or bacterial strains, and the underlying neural network is trained on embeddings generated by the SaProt model.

    Parameters

    Epitope-1D Mode

    Sequence

    Protein sequence for epitope prediction, in FASTA format. Supports up to 100 sequences.

    Output

    Output score file name, CSV format, default is result.csv.

    Epitope-3D Mode

    Structure

    Protein structure for epitope prediction, in PDB format.

    Chain

    Specify the protein chains for epitope prediction. Multiple chains are separated by commas, e.g., A,B. If not specified, all chains are predicted.

    Output

    Output score file name, CSV format, default is result.csv.

    N-glycosylation Mode

    Sequence

    Protein sequence for N-glycosylation prediction, FASTA format. Supports up to 100 sequences.

    Structure

    Protein structure for N-glycosylation prediction, PDB format.
    Note: Only one of Sequence or Structure can be selected, otherwise an error will occur.

    Chain

    When uploading a structure, specify the chains for prediction. Multiple chains separated by commas, e.g., A,B. If not specified, all chains are predicted.

    Output

    Output score file name, CSV format, default is result.csv.

    Comparison Mode

    Structure_1

    The first protein structure for local similarity comparison, PDB format.

    Chain_1

    Specify chains in the first protein for comparison, multiple chains separated by commas, e.g., A,B.

    Structure_2

    The second protein structure for local similarity comparison, PDB format.

    Chain_2

    Specify chains in the second protein for comparison, multiple chains separated by commas, e.g., A,B.

    Output

    Output file for similar residue pairs, CSV format, default is result.csv. Residue pairs with similarity score greater than 2.0 are considered similar; the higher the score, the more similar the residues.

    Results

    Epitope prediction score file result.csv contains:

    Column Description
    PDB_ID Structure name
    Chain Chain name
    Residue position Residue UID
    AA Residue single-letter code
    Epitope_score Probability of being a B-cell epitope, ranging from 0 to 1; higher values indicate higher likelihood of being an epitope.

    N-glycosylation prediction score file result.csv contains:

    Column Description
    PDB_ID Structure name
    Chain Chain name
    Residue position Residue UID
    AA Residue single-letter code
    PTM_score Probability of N-glycosylation at this residue, 0-1; <0.5 indicates unlikely, >0.5 indicates likely.
    PTM_label Determined from PTM_score: 0 = not glycosylated, 1 = glycosylated.

    Local similarity comparison result file result.csv contains:

    Column Description
    PDB_ID_1 First structure name
    aa_1 Residue name
    Chain_1 Chain name
    pos_1 Residue UID
    PDB_ID_2 Second structure name
    aa_2 Residue name
    Chain_2 Chain name
    pos_2 Residue UID
    score Similarity score, >2.0 indicates similar residues; higher values indicate higher similarity.

    References

    • Shashkova, T.I., Umerenkov, D., Salnikov, M., Strashnov, P.V., Konstantinova A.V., Lebed, I., Shcherbinin, D.N., Asatryan, M.N., Kardymon, O.L., Ivanisenko, N.V. (2022). SEMA: Antigen B-cell conformational epitope prediction using deep transfer learning. Front. Immunol.DOI:10.3389/fimmu.2022.960985
  • Name: Antibody Sequence Generation & Pairing (p-IgGen)
    Description: 生成抗体Fv区序列,或对已有Fv区序列进行序列自然性评分,可用于抗体重轻链配对分析。 Generate antibody Fv region sequences or perform naturalness scoring (Log Likelihood) on existing Fv sequences and can be used for antibody heavy and light chain pairing analysis.
    Tags: undefined
    Author: Oliver M Turnbull
    Release: 2025-10-09 00:00:00
    Reference: Turnbull OM, Oglic D, Croasdale-Wood R, Deane CM. p-IgGen: a paired antibody generative language model. Bioinformatics. 2024 Nov 1;40(11):btae659.

    Antibody Sequence Generation & Pairing (p-IgGen)

    简介

    生成抗体Fv区序列,或对已有Fv区序列进行序列自然性评分(Log Likelihood)。Fv序列生成支持多种场景:

    • 基于已有VH(重链Fv序列),生成VL(轻链Fv序列)
    • 基于已有VL,生成VH
    • 基于部分Fv序列(可以是部分重链、部分轻链、或部分重轻链),生成VH、VL或完整Fv序列。

    该功能基于p-IgGen模型实现,p-IgGen是一个专门用于生成抗体重链-轻链配对序列的生成式蛋白质语言模型。由牛津大学与阿斯利康合作开发,其核心目标是生成具有天然抗体特征、且可开发性(developability)良好的抗体序列,用于抗体药物发现。

    p-IgGen模型特性如下:

    特性 描述
    训练数据 基于 Observed Antibody Space(OAS)数据库,包含约 2.5 亿条非配对序列和 180 万条配对序列
    模型结构 自回归解码器(decoder-only),使用旋转位置编码(RoPE),共 1730 万参数
    训练策略 阶段训练:非配对预训练(2.5亿条非配对序列,学习抗体序列的语言模式) → 配对微调(180万条配对序列,学习重链与轻链之间的配对关系)

    p-IgGen 的优势:

    1. 生成序列质量高
      与天然抗体在序列相似性、多样性、CDR 长度分布等方面高度一致,可成功通过结构建模工具(如 ABodyBuilder2)建模,置信度高。
    2. 保留 VH/VL 配对信息
      模型能识别天然配对关系,生成的序列在突变率、配对偏好上与天然抗体一致,在94%的测试中,真实配对的得分高于随机配对。

    参数说明

    Generate

    Generate模式基于已有序列(部分)生成序列

    Initial Sequence

    抗体序列,FASTA格式。最大支持500条序列。
    generate模式中,表示用于序列生成的部分Fv区序列(模型会在当前序列基础上延展生成新的序列)。

    Number of Sequences

    generate模式参数,指定生成的序列数量,默认为10,最大为1000。

    Output Sequence

    generate模式参数,输出生成的序列,FASTA格式。默认为generate.fasta

    Pairing Likelihood

    Pairing Likelihood模式用于抗体重、轻链配对评分

    VH Sequence

    抗体VH(重链Fv区)序列,FASTA格式。最大支持500条序列。
    Pairing Likelihood模式中,表示用于序列配对并进行序列评分的VH序列。

    VL Sequence

    抗体VL(轻链Fv区)序列,FASTA格式。最大支持500条序列。
    Pairing Likelihood模式中,表示用于序列配对并进行序列评分的VL序列。

    Output Pairs

    输出VH、VL配对后的序列文件,FASTA格式,VH与VL序列通过英文冒号分隔。默认为pairs.fasta

    Output Score

    Pairing Likelihood模式参数,输出序列评分文件名,CSV格式,默认为pred_scores.csv

    结果说明

    • generate模式参数,输入Initial Sequence,输出generate.fasta
    • Pairing Likelihood模式参数,输入VH Sequence和VL Sequence,输出pairs.fasta和pred_scores.csv,pred_scores.csv包含以下信息:
    列名 说明
    Name 序列名称
    Heavy VH序列名称,进行VH与VL配对评分时输出
    Light VL序列名称,进行VH与VL配对评分时输出
    Log Likelihood 序列自然性评分,数值在 -∞ ~ 0之间,数值越大表示序列越接近天然抗体序列。

    参考文献

    • Oliver M Turnbull, Dino Oglic, Rebecca Croasdale-Wood, Charlotte M Deane, p-IgGen: a paired antibody generative language model, Bioinformatics, Volume 40, Issue 11, November 2024, btae659.DOI:10.1093/bioinformatics/btae659

    Antibody Sequence Generation & Pairing (p-IgGen)

    Introduction

    Generate antibody Fv region sequences or perform naturalness scoring (Log Likelihood) on existing Fv sequences.
    Fv sequence generation supports multiple scenarios:

    • Generate VL (light-chain Fv sequence) based on a given VH (heavy-chain Fv sequence)
    • Generate VH based on a given VL
    • Generate VH, VL, or a complete Fv sequence based on a partial Fv sequence (which may include partial heavy chain, light chain, or both)

    This functionality is powered by the p-IgGen model — a generative protein language model specifically designed for paired antibody heavy-light chain sequence generation.
    Developed through collaboration between the University of Oxford and AstraZeneca, p-IgGen aims to generate antibody sequences that exhibit natural antibody-like features and good developability for antibody drug discovery.

    p-IgGen Model Characteristics

    Feature Description
    Training Data Based on the Observed Antibody Space (OAS) database, containing ~250 million unpaired sequences and 1.8 million paired sequences
    Model Architecture Decoder-only autoregressive model using Rotary Position Embeddings (RoPE) with 17.3 million parameters
    Training Strategy Two-stage training: unpaired pretraining (250M unpaired sequences to learn antibody sequence patterns) → paired fine-tuning (1.8M paired sequences to learn VH–VL pairing relationships)

    Advantages of p-IgGen

    1. High-quality generated sequences
      Generated sequences are highly consistent with natural antibodies in terms of sequence similarity, diversity, and CDR length distribution.
      They can be successfully modeled by structural modeling tools (e.g., ABodyBuilder2) with high confidence.

    2. Retention of VH/VL pairing information
      The model captures natural pairing relationships — generated sequences maintain realistic mutation rates and pairing preferences.
      In 94% of tests, real VH/VL pairs scored higher than random pairs.


    Parameters

    Generate

    The Generate mode produces new antibody sequences based on existing (partial) sequences.

    Initial Sequence

    Antibody sequences in FASTA format, supporting up to 500 sequences.
    In generate mode, this parameter specifies the partial Fv-region sequences on which the model will extend and generate new sequences.

    Number of Sequences

    A parameter for generate mode that specifies the number of sequences to generate.
    Default: 10; Maximum: 1000.

    Output Sequence

    A parameter for generate mode that specifies the output FASTA file for generated sequences.
    Default: generate.fasta.


    Pairing Likelihood

    The Pairing Likelihood mode evaluates the compatibility (pairing likelihood) between antibody heavy- and light-chain sequences.

    VH Sequence

    Antibody VH (heavy-chain Fv region) sequences in FASTA format.
    Supports up to 500 sequences.
    In Pairing Likelihood mode, this parameter supplies the VH sequences used for chain pairing and likelihood scoring.

    VL Sequence

    Antibody VL (light-chain Fv region) sequences in FASTA format.
    Supports up to 500 sequences.
    In Pairing Likelihood mode, this parameter supplies the VL sequences used for chain pairing and likelihood scoring.

    Output Pairs

    The FASTA file containing paired VH and VL sequences.
    VH and VL sequences are joined using a colon (:).
    Default: pairs.fasta.

    Output Score

    A Pairing Likelihood mode parameter specifying the CSV file name for likelihood scoring results.
    Default: pred_scores.csv.


    Results

    • In Generate mode:
      Input: Initial Sequence
      Output: generate.fasta

    • In Pairing Likelihood mode:
      Input: VH Sequence and VL Sequence
      Output: pairs.fasta and pred_scores.csv

    The file pred_scores.csv contains the following fields:

    Column Name Description
    Name Sequence name
    Heavy VH sequence name (output when VH–VL pairing is evaluated)
    Light VL sequence name (output when VH–VL pairing is evaluated)
    Log Likelihood Naturalness score of the sequence. The value ranges from −∞ to 0, and higher values indicate greater similarity to natural antibody sequences.

    Reference

    • Oliver M Turnbull, Dino Oglic, Rebecca Croasdale-Wood, Charlotte M Deane, p-IgGen: a paired antibody generative language model, Bioinformatics, Volume 40, Issue 11, November 2024, btae659. DOI:10.1093/bioinformatics/btae659
  • Name: Protein Binder Design (BindCraft)
    Description: 设计蛋白高亲和力Binder,可以是线性多肽或小蛋白。 Design high-affinity protein binders, which can be either linear peptides or small proteins.
    Tags: undefined
    Author: Martin Pacesa
    Release: 2025-09-12 12:25:50
    Reference: Pacesa, M., Nickel, L., Schellhaas, C. et al. One-shot design of functional protein binders with BindCraft. Nature (2025).

    Protein Binder Design (BindCraft)

    简介

    设计蛋白高亲和力Binder,可以是线性多肽或小蛋白。模块基于FreeBindCraft实现(FreeBindCraft不同于BindCraft之处在于去掉了商业收费的PyRosetta,使用开源替代),其算法特色在于巧妙利用了AlphaFold2(AF2)的预训练权重,通过反向传播(Backpropagation) 进行“序列幻觉”设计(Hallucination),从头生成能与目标蛋白精准结合的多肽/小蛋白。其自动化流程简洁高效:

    • 输入目标蛋白结构:AF2-multimer模型生成初始Binder骨架与序列,并同步优化结合界面。
    • 序列优化(ProteinMPNN):在固定结合界面的前提下,优化Binder核心与表面序列,提升其表达性和稳定性。
    • 质量过滤(AF2单体模型):最终通过AF2单体模型进行严格过滤,确保设计出的Binder质量可靠。
      11df0e4cf852cc43864c705a5ac8c911_e23a200422fb4ec982ea11036ed9a025.png

    与传统方法固定目标蛋白结构不同,FreeBindCraft允许目标和结合剂的骨架均保持一定灵活性,从而能动态“塑形”出完美匹配的界面,更真实地模拟自然界的诱导拟合(Induced Fit)过程。

    研究人员在12个极具挑战性的目标上测试了FreeBindCraft,涵盖了细胞受体、过敏原、基因编辑酶等,仅测试少量设计(6-53个)便取得了惊人成果:

    • 免疫检查点(PD-1/PD-L1):
      PD-1:53个设计中有13个成功,最强结合剂亲和力(Kd)<1 nM,并可有效阻断天然配体结合。
      PD-L1:9个设计中有7个成功,展现出高特异性。
    • 过敏原(Bet v1):成功设计出能中和IgE抗体反应的结合剂,单分子即可阻断50%的IgE结合,有望用于过敏治疗。
    • 基因编辑酶(SpCas9):6个设计全部成功,能有效抑制Cas9的编辑活性,为精准调控基因编辑提供了新工具。
    • 病毒重定向(AAV):成功设计出微型结合剂,使腺相关病毒(AAV)能特异性靶向HER2/PD-L1等靶点,将基因递送效率提升高达100倍,为基因疗法开辟了新途径。
      8509a0998e6153476c6b6d51a9978f9b_5495bb6dc77c4675930d1e8baead7a06.png

    实验验证有结合的binder数目及其与Binder长度的分布如下图所示:
    image.png

    有测定亲和力数值的Binder信息如下:
    企业微信截图_17591584823338.png

    关于FreeBindCraft功能总结

    全面开源

    FreeBindCraft不同于BindCraft之处在于去掉了商业收费的PyRosetta,用一套“全开源”组合策略来填补Rosetta 在原流程中的功能空缺,核心思路是:
    1.用 GPU 加速的开源物理引擎 OpenMM 替代 Rosetta 的 FastRelax,对复合体做结构松弛,速度提高 2–4 倍;
    2.用 MIT 授权的 sc-rs 库计算形状互补(Shape Complementarity),取代 Rosetta InterfaceAnalyzer 的 SC 打分;
    3.用 FreeSASA + Biopython 完成表面积/疏水性分析,替换 Rosetta 的界面能量项;
    4.结构比对、RMSD 计算等几何操作全部改用 Biopython,彻底去掉 Rosetta 的结构工具依赖;
    5.氢键网络评估因预测价值有限,直接舍弃,不再作为强制过滤条件。
    实测显示,90%的失败设计在前期已被AlphaFold2筛除,Rosetta能量阈值仅贡献约9%的额外拒绝,因此上述开源替换几乎无损性能。

    性能对比

    FreeBindCraft 在速度和效率上显著优于传统 BindCraft,运行快近 3 倍,所需轨迹减少 37%,同时保持设计质量和置信度不变。

    指标 传统 BindCraft (PyRosetta) FreeBindCraft (开源) 优势
    接受设计数 101 101 持平
    所需轨迹数 144 91 减少 37%,更高效
    运行时间 33.19 小时 12.25 小时 快 63%,近 3 倍加速
    平均 ipTM 0.785 0.792 持平,略优

    参数说明

    Target

    靶点蛋白的结构,PDB格式。靶点结构中尽量只保留与Binder结合的链,其他链去除,能缩短设计时间。

    Chain

    指定靶点结构中的哪些链作为受体与设计的Binder进行结合,多条链用英文逗号分隔,如:A或者A,B

    Hotspot

    指定结合位点的残基,支持范围符号,多个区域用英文逗号分隔,如1-10,12,15,如果有多条链时,可以在残基前加上链名来指定,如A1-10,A15,B1-20,B26。
    注意:
    1.当不指定该参数时,默认使用AF2-multimer预测的结合位点。
    2.残基编号为pdb文件的uid

    Length

    指定需设计的Binder长度,可以是固定长度,或长度范围,如10或者10-30。
    注意:

    • 长度<=30时,会认为是多肽,采用多肽的设计策略;长度>=31时,会认为是小蛋白,采用蛋白设计策略。所以指定长度范围时,不要跨越30,如设置为29-40时系统会提示错误。
    • 长度范围不要设置过大,跨度10个AA比较合适,范围过大时设计耗时很长,一般的长度范围与耗时如下:

    Number of Designs

    最终设计的Binder数量,默认为10,目前最大支持100,数量越多所需计算时间越长。

    不同的Binder长度,设计数量与所需计算时间大致如下:

    Length Number of Designs Time(h)
    65-150 100 ~48
    10 10 ~12
    50 10 ~4
    50 100 ~41
    100 10 ~2
    90-120 10 ~5

    Flexible

    指定靶点结构是否支持柔性,选中表示靶点链在设计中,其骨架坐标允许1–2Å的RMSD变化,以满足与Binder结合时的诱导契合。

    结果说明

    设计的靶点-Binder复合物结构,最多展示前5个。
    所有设计结果的打包文件designs.tar.gz
    设计结果的详细打分文件final_design_stats.csv

    打分指标及其解释见下表:

    特征 描述
    MPNN_score MPNN序列评分,一般不推荐使用,因为依赖于蛋白质本身
    MPNN_seq_recovery MPNN对原始轨迹的序列恢复率
    pLDDT AF2复合物预测的pLDDT置信度评分,归一化到0-1
    pTM AF2复合物预测的pTM置信度评分,归一化到0-1
    i_pTM AF2复合物预测的接口pTM置信度评分,归一化到0-1
    pAE AF2复合物预测的预测对齐误差,归一化(AF2对比n/31)到0-1
    i_pAE AF2复合物预测的接口预测对齐误差,归一化(AF2对比n/31)到0-1
    i_pLDDT AF2复合物预测的接口pLDDT置信度评分,归一化到0-1
    ss_pLDDT AF2复合物预测的二级结构pLDDT置信度评分,归一化到0-1
    Unrelaxed_Clashes 放松前的接口碰撞数量
    Relaxed_Clashes 放松后的接口碰撞数量
    Binder_Energy_Score 单独binder的Rosetta能量评分
    Surface_Hydrophobicity binder表面疏水性分数
    ShapeComplementarity 接口形状互补性
    PackStat 接口PackStat Rosetta得分
    dG 接口Rosetta dG能量
    dSASA 接口delta SASA(面积大小)
    dG/dSASA 接口能量除以接口面积
    Interface_SASA_% 接口覆盖binder表面的比例
    Interface_Hydrophobicity binder接口的疏水性比例
    n_InterfaceResidues 接口残基数量
    n_InterfaceHbonds 接口处的氢键数量
    InterfaceHbondsPercentage 氢键数量占接口面积比例
    n_InterfaceUnsatHbonds 接口处未满足的埋藏氢键数量
    InterfaceUnsatHbondsPercentage 未满足埋藏氢键占接口面积比例
    Interface_Helix% 接口处α螺旋比例
    Interface_BetaSheet% 接口处β折叠比例
    Interface_Loop% 接口处环结构比例
    Binder_Helix% binder结构中α螺旋比例
    Binder_BetaSheet% binder结构中β折叠比例
    Binder_Loop% binder结构中环结构比例
    InterfaceAAs 接口处每种氨基酸的数量
    HotspotRMSD binder相对于原始轨迹的未对齐RMSD,即重新预测的复合物中binder与原始结合位点的偏差
    Target_RMSD 在设计的binder背景下预测的目标RMSD,与输入PDB对比
    Binder_pLDDT 单独预测的binder pLDDT置信度评分
    Binder_pTM 单独预测的binder pTM置信度评分
    Binder_pAE 单独预测的binder预测对齐误差
    Binder_RMSD 单独预测的binder RMSD,与原始轨迹对比

    以N_开头的特征对应每个AlphaFold模型的统计信息,平均值为所有预测模型的平均。

    参考文献

    • Pacesa, M. et al. One-shot design of functional protein binders with BindCraft. Nature (2025). DOI:10.1038/s41586-025-09429-6.

    Protein Binder Design (BindCraft)

    Introduction

    Design high-affinity protein binders, which can be either linear peptides or small proteins.
    This module is based on FreeBindCraft (FreeBindCraft differs from BindCraft by removing the commercial PyRosetta dependency, using open-source alternatives instead).

    The algorithm leverages AlphaFold2 (AF2) pre-trained weights and performs sequence hallucination via backpropagation, generating de novo peptides/small proteins that bind precisely to target proteins.

    The automated workflow is streamlined and efficient:

    • Input target protein structure: The AF2-multimer model generates the initial binder backbone and sequence while optimizing the binding interface.
    • Sequence optimization (ProteinMPNN): With the interface fixed, the binder’s core and surface residues are optimized to improve expression and stability.
    • Quality filtering (AF2 monomer model): Final filtering is done with AF2 single-chain models to ensure reliable binder quality.
      11df0e4cf852cc43864c705a5ac8c911_e23a200422fb4ec982ea11036ed9a025.png
      Unlike traditional approaches that rigidly fix the target protein structure, FreeBindCraft allows both the target and binder backbones to maintain some flexibility, dynamically “sculpting” a perfectly matched interface and better simulating the natural induced-fit process.

    Benchmark Results

    FreeBindCraft was tested on 12 challenging targets, including cell receptors, allergens, and genome-editing enzymes. Even with only a small number of designs (6–53), it achieved striking results:

    • Immune checkpoints (PD-1/PD-L1):

      • PD-1: 13 out of 53 designs were successful, with the strongest binder showing Kd < 1 nM, effectively blocking the natural ligand.
      • PD-L1: 7 out of 9 designs were successful, showing high specificity.
    • Allergen (Bet v1): Designed binders neutralized IgE binding, with a single molecule blocking 50% of IgE–antigen interactions, offering potential for allergy therapy.

    • Genome editing enzyme (SpCas9): All 6 designs successfully inhibited Cas9 editing activity, providing a precise tool for controlling gene editing.

    • Viral retargeting (AAV): Mini-binders were designed to redirect adeno-associated virus (AAV) to targets like HER2/PD-L1, boosting gene delivery efficiency up to 100-fold, opening new avenues for gene therapy.
      8509a0998e6153476c6b6d51a9978f9b_5495bb6dc77c4675930d1e8baead7a06.png

    The experimental validation of the number of binders with binding affinity and their distribution relative to Binder length is shown in the figure below:
    image.png

    Information on binders with measured affinity values is as follows:
    企业微信截图_17591584823338.png

    FreeBindCraft at a Glance

    Open-source Replacements in FreeBindCraft

    FreeBindCraft replaces Rosetta components with a fully open-source strategy, filling in functional gaps while maintaining performance:

    1. OpenMM (GPU-accelerated) replaces Rosetta’s FastRelax for complex relaxation, achieving 2–4× faster speed.
    2. sc-rs library (MIT-licensed) calculates shape complementarity, replacing Rosetta’s InterfaceAnalyzer SC score.
    3. FreeSASA + Biopython perform surface area/hydrophobicity analysis, replacing Rosetta’s interface energy term.
    4. Biopython handles structural alignment, RMSD, and geometry operations, eliminating Rosetta structural utilities.
    5. Hydrogen bond network evaluation was discarded due to limited predictive value, no longer used as a mandatory filter.

    Benchmarks show that ~90% of failed designs were already filtered out by AF2, with Rosetta thresholds only contributing ~9% additional rejection. Thus, these open-source replacements cause negligible performance loss.

    Head-to-head performance

    FreeBindCraft is almost 3× faster and needs 37 % fewer trajectories while preserving design quality and AlphaFold confidence.

    Metric BindCraft (PyRosetta) FreeBindCraft (open source) Advantage
    Accepted designs 101 101 Equal
    Trajectories needed 144 91 –37 %, more efficient
    Runtime (B200 GPU) 33.19 h 12.25 h –63 %, ≈3× faster
    Mean ipTM 0.785 0.792 Equal, slightly better

    Parameters

    Target

    The target protein structure, in PDB format. In the target structure, retain only the chain(s) that interact with the Binder and remove all others; this can significantly shorten the design time.

    Chain

    Specify which chains in the target structure are used as receptors for binder design. Multiple chains are separated by commas, e.g. A or A,B.

    Hotspot

    Specify binding site residues. Range syntax is supported, and multiple ranges are separated by commas, e.g. 1-10,12,15.
    For multi-chain targets, prefix residue numbers with chain IDs, e.g. A1-10,A15,B1-20,B26.
    Note:

    1. When this parameter is not specified, AF2-multimer predicted binding sites are used by default.
    2. Residue numbering corresponds to the PDB file’s unique identifier (uid).

    Length

    Specify binder length, either as a fixed length or a range, e.g. 10 or 10-30.
    Note:

    • Length ≤30 → treated as a peptide, peptide design strategy is applied.
    • Length ≥31 → treated as a small protein, protein design strategy is applied.
    • Length ranges must not cross 30. For example, 29-40 is invalid and will raise an error.

    Number of Designs

    The number of binders to design. Default: 10. Maximum supported: 100.

    For different binder lengths, the approximate number of designs and computation time are as follows:

    Length Number of Designs Time (h)
    65–150 100 ~48
    10 10 ~12
    50 10 ~4
    50 100 ~41
    100 10 ~2
    90-120 10 ~5

    Flexible

    Specify whether the target structure supports flexibility. If selected, target backbones are allowed RMSD changes of 1–2 Å during design to accommodate induced fit.


    Results

    The designed target–binder complex structures, with the top 5 displayed at most.
    All design results are packaged in the file designs.tar.gz.
    The detailed scoring file for the design results is final_design_stats.csv.

    The features and their explanations are provided in the table below:

    Features Description
    MPNN_score MPNN sequence score, generally not recommended as it depends on protein
    MPNN_seq_recovery MPNN sequence recovery of original trajectory
    pLDDT pLDDT confidence score of AF2 complex prediction, normalised to 0-1
    pTM pTM confidence score of AF2 complex prediction, normalised to 0-1
    i_pTM interface pTM confidence score of AF2 complex prediction, normalised to 0-1
    pAE predicted alignment error of AF2 complex prediction, normalised compared AF2 by n/31 to 0-1
    i_pAE predicted interface alignment error of AF2 complex prediction, normalised compared AF2 by n/31 to 0-1
    i_pLDDT interface pLDDT confidence score of AF2 complex prediction, normalised to 0-1
    ss_pLDDT secondary structure pLDDT confidence score of AF2 complex prediction, normalised to 0-1
    Unrelaxed_Clashes number of interface clashes before relaxation
    Relaxed_Clashes number of interface clashes after relaxation
    Binder_Energy_Score Rosetta energy score for binder alone
    Surface_Hydrophobicity surface hydrophobicity fraction for binder
    ShapeComplementarity interface shape complementarity
    PackStat interface packstat rosetta score
    dG interface rosetta dG energy
    dSASA interface delta SASA (size)
    dG/dSASA interface energy divided by interface size
    Interface_SASA_% Fraction of binder surface covered by the interface
    Interface_Hydrophobicity Interface hydrophobicity fraction of binder interface
    n_InterfaceResidues number of interface residues
    n_InterfaceHbonds number of hydrogen bonds at the interface
    InterfaceHbondsPercentage number of hydrogen bonds compared to interface size
    n_InterfaceUnsatHbonds number of unsatisfied buried hydrogen bonds at the interface
    InterfaceUnsatHbondsPercentage number of unsatisfied buried hydrogen bonds compared to interface size
    Interface_Helix% proportion of alfa helices at the interface
    Interface_BetaSheet% proportion of beta sheets at the interface
    Interface_Loop% proportion of loops at the interface
    Binder_Helix% proportion of alfa helices in the binder structure
    Binder_BetaSheet% proportion of beta sheets in the binder structure
    Binder_Loop% proportion of loops in the binder structure
    InterfaceAAs number of amino acids of each type at the interface
    HotspotRMSD unaligned RMSD of binder compared to original trajectory, in other words how far is binder in the repredicted complex from the original binding site
    Target_RMSD RMSD of target predicted in context of the designed binder compared to input PDB
    Binder_pLDDT pLDDT confidence score of binder predicted alone
    Binder_pTM pTM confidence score of binder predicted alone
    Binder_pAE predicted alignment error of binder predicted alone
    Binder_RMSD RMSD of binder predicted alone compared to original trajectory

    Features starting with N_ correspond to statistics per each AlphaFold model, Averages are accross all models predicted.


    Reference

    • Pacesa, M. et al. One-shot design of functional protein binders with BindCraft. Nature (2025). DOI:10.1038/s41586-025-09429-6.
  • Name: SDF File
    Description: SDF File用于指定SDF格式的小分子结构文件的模块,用于一个文件在多个模块的输入。 SDF File is a module for specifying small molecule structure in SDF format which could be used for multiple modules.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 17:14:38
    Reference: NA

    SDF File

    简介

    SDF File是一个用于指定SDF文件的模块,可用于其他模块的输入。

    参数说明

    Input File

    小分子结构文件,SDF

    结果说明

    得到一个与原文件相同的SDF文件

    SDF File

    Introduction

    The SDF File module is used to specify an SDF file that can be used as input for other modules.

    Parameters

    Input File

    Small molecule structure file in SDF format.

    Results

    Obtain an SDF file identical to the original file.

  • Name: Excel2Fasta
    Description: 转换包含序列信息的EXCEL或CSV格式文件为序列Fasta格式文件。 Convert sequence information stored in **Excel** or **CSV** format files into **FASTA** format.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-09-12 10:34:18
    Reference:

    Excel2Fasta

    简介

    转换包含序列信息的EXCEL或CSV格式文件为序列Fasta格式文件。

    参数说明

    Input

    Excel或csv格式的文件,必需包含表头信息。

    ID

    Excel或csv格式文件中,序列ID所在的列名,如:Seq_ID,当该参数未设置时,序列名称默认从1开始进行顺序设置。

    Sequence

    Excel或csv格式文件中,序列所在的列名,如:Sequence

    Rest

    设置是否将文件中除去ID与Sequence外的其他列数据,若选择该选项,则将其他列数据以field=value的形式放置在Fasta文件的序列名中。

    Output

    输出Fasta文件名称,默认为convert.fasta

    结果说明

    输出Fasta文件,默认为convert.fasta

    Excel2Fasta

    Introduction

    Convert sequence information stored in Excel or CSV format files into FASTA format.

    Parameters

    Input

    Excel or CSV files must include header information.

    ID

    The column name in the Excel or CSV file that contains the sequence IDs (e.g., Seq_ID).
    If this parameter is not specified, sequence IDs will be assigned sequentially starting from 1.

    Sequence

    The column name in the Excel or CSV file that contains the sequences (e.g., Sequence).

    Rest

    Set whether to include columns other than ID and Sequence from the file. If this option is selected, the additional columns will be appended to the FASTA sequence name in the format field=value.

    Output

    The name of the output FASTA file. Default: convert.fasta.

    Result

    Outputs a FASTA file, with the default name convert.fasta.

  • Name: Ligand Preparation (Meeko)
    Description: 分子预处理工具,主要作用是对输入分子进行标准化和扩展,生成适合后续对接、虚拟筛选或机器学习的分子结构。 Molecular preprocessing tool, mainly used to standardize and expand input molecules, generating molecular structures suitable for subsequent docking, virtual screening, or machine learning.
    Tags: undefined
    Author:
    Release: 2025-09-12 00:00:00
    Reference:

    Ligand Preparation (Meeko)

    简介

    Meeko是一个分子预处理工具,主要作用是对输入分子进行标准化和扩展,生成适合后续对接、虚拟筛选或机器学习的分子结构。支持uff, mmff94, mmff94s, espaloma力场

    参数说明

    Small Molecule File

    小分子文件,支持Mol (.mol), SD (.sdf), SMILES (.smi )格式,支持单个或批量的小分子输入。

    PH

    根据指定的pH值(如 --ph 7.4),预测分子的质子化/去质子化状态,并在pH5–9范围内考虑其质子化异构体(protomer)和互变异构体(tautomer)。

    Acidbase

    默认枚举酸碱异构体,若选择该选项,则跳过酸碱异构体的生成。

    Tautomers

    默认枚举可能的互变异构体,若选择该选项,则跳过互变异构体的生成。

    Ringfix

    默认修复六元环的芳香化、张力结构等问题,若选择该选项,则跳过六元环的修复。

    Gen3d

    默认生成3D构象坐标,若选择该选项,则跳过3D坐标的生成,只保留2D。

    Force Field

    3D构象生成相关的参数,用于构象优化的力场,默认为MMFF94。

    • UFF:Universal Force Field,通用但精度一般,速度快。
    • MMFF94:Merck Molecular Force Field 94,适合小分子,精度较高。
    • MMFF94s:mmff94 的简化版,稍快。

    Name from Prop

    将分子名称设置为来自SDF文件中小分子属性,如:SDF文件中<IDNUMBER>,可以输入IDNUMBER,作为小分子的名称,适合大批量的小分子输入。

    Output File

    输出文件名称,支持SDF和HDF5格式。

    结果说明

    输出结果为优化后的结构文件preprocessed.sdf,每个小分子末尾都会包含ScrubInfo,ScrubInfo包含如下信息:

    列名 说明
    isomerGroup 输入小分子顺序编号(每个分子一个组号)
    isomerId 异构体编号信息(同一分子下的不同异构体)
    confId 构象编号信息(同一异构体下的不同3D构象)
    nr_isomers 该输入分子的异构体总数
    nr_conformers 该输入分子或异构体对应的3D构象总数

    Ligand Preparation (Meeko)

    Overview

    Meeko is a molecular preprocessing tool that standardizes and expands input molecules, generating structures suitable for subsequent docking, virtual screening, or machine learning. It supports force fields including UFF, MMFF94, MMFF94s, and ESPALOMA.

    Parameters

    Small Molecule File

    Input small molecule file, supporting Mol (.mol), SD (.sdf), and SMILES (.smi) formats single or batch small molecule input.

    PH

    Predicts the protonation/deprotonation states of molecules at a specified pH (e.g., --ph 7.4), considering protomers and tautomers within the pH range of 5–9.

    Acidbase

    By default, acid-base isomers are enumerated. If this option is selected, acid-base enumeration will be skipped.

    Tautomers

    By default, possible tautomers are enumerated. If this option is selected, tautomer enumeration will be skipped.

    Ringfix

    By default, issues in six-membered rings such as aromaticity or ring strain are fixed. If this option is selected, six-membered ring correction will be skipped.

    Gen3d

    By default, 3D coordinates are generated. If this option is selected, 3D generation is skipped, and only 2D coordinates are retained.

    Force Field

    Force field used for conformer optimization. Default is MMFF94.

    • UFF: Universal Force Field, general-purpose, fast but moderate accuracy.
    • MMFF94: Merck Molecular Force Field 94, suitable for small molecules with higher accuracy.
    • MMFF94s: Simplified version of MMFF94, slightly faster.

    Name from Prop

    Set the molecule name from a property in the SDF file, e.g., <IDNUMBER>. You can specify IDNUMBER as the molecule name, which is suitable for batch input of small molecules.

    Output File

    Specifies the output file name. Supports SDF and HDF5 formats.

    Output Description

    The output is an optimized structure file preprocessed.sdf. Each molecule includes a ScrubInfo section containing the following information:

    Column Name Description
    isomerGroup Sequential group number of input molecules (one group per molecule)
    isomerId Isomer ID (different isomers of the same molecule)
    confId Conformer ID (different 3D conformers of the same isomer)
    nr_isomers Total number of isomers for the input molecule
    nr_conformers Total number of 3D conformers for the input molecule or isomer
  • Name: Protein Design (RFDiffusion2)
    Description: 用于从头设计具有理想催化活性的酶。模块基于RFdiffusion2模型,引入流匹配(flow matching)技术替代传统的扩散方法,能够在原子分辨率下直接对酶的活性位点进行骨架化设计,而无需预先指定序列位置或侧链构象,从而显著提高了设计的灵活性与成功率。 A tool for de novo design of enzymes with desired catalytic activity. The module is based on the RFdiffusion2 model, which introduces **flow matching** to replace traditional diffusion methods. It enables atom-level scaffold design of enzyme active sites directly, without predefining sequence positions or side-chain conformations, thus significantly improving design flexibility and success rates.
    Tags: undefined
    Author: Woody Ahern
    Release: 2025-09-02 15:24:57
    Reference: Atom level enzyme active site scaffolding using RFdiffusion2. Woody Ahern, Jason Yim, Doug Tischer, Saman Salike, Seth M. Woodbury, Donghyo Kim, Indrek Kalvet, Yakov Kipnis, Brian Coventry, Han Raut Altae-Tran, Magnus Bauer, Regina Barzilay, Tommi S. Jaakkola, Rohith Krishna, David Baker.

    Protein Design (RFDiffusion2)

    简介

    用于从头设计具有理想催化活性的酶。模块基于RFdiffusion2模型,引入流匹配(flow matching)技术替代传统的扩散方法,能够在原子分辨率下直接对酶的活性位点进行骨架化设计,而无需预先指定序列位置或侧链构象,从而显著提高了设计的灵活性与成功率。实验结果表明,RFdiffusion2 不仅在计算基准测试中超越了现有方法,还能生成具备体外催化活性的功能性酶,为从头酶设计开辟了新的可能。
    image.png

    RFdiffusion2在 <100 个酶设计的测试中,就拿到了5种具备实际催化活性的酶;其中一个锌水解酶的活性远超以往工程酶。在Benchmark验证中,从M-CSA数据库中筛选41个真实酶活性位点,每个活性位点提取原子级motif(随机选择催化残基的部分原子)作为输入,使用传统RFdiffusion与RFdiffusion2进行设计,结果显示RFdiffusion2成功解决了41个挑战任务,相比之下,传统的RFdiffusion只能解决其中16个。

    参数说明

    Reference Protein Structure

    在酶设计中,可通过参考结构(如酶活性位点的 Theozyme)作为PDB格式输入。在该结构中引入一个特殊的ORI伪原子(pseudo-atom),用于提供基序放置位置的先验信息。这个伪原子通常放置在酶活性口袋的几何中心,作为目标结构的参考点,引导模型在生成过程中合理定位活性位点及其周围支架的空间关系。
    可以使用PyMOL创建该伪原子,方法如下:

    # 1️⃣ 选择口袋残基,构建口袋的原子 selection
    # 这里假设口袋由 A 链的 11、72、92、94、117、177 号残基组成
    select pocket, (resi 11+72+92+94+117+177 and chain A)
    
    # 2️⃣ 在口袋几何中心创建伪原子 ORI
    # 参数说明:
    # - ORI: 创建的对象名
    # - pocket: 使用 selection 的几何中心作为位置
    # - name=ORI: 原子名为 ORI
    # - resn=ORI: 残基名为 ORI
    # - chain=P: 指定链名为 P
    # - resi=1: 残基编号为 1
    pseudoatom ORI, pocket, name=ORI, resn=ORI, chain=P, resi=1
    
    

    伪原子格式可以参考:

    HETATM   91  ORI ORI B 332       0.000   0.000   0.000  1.00  0.00           X 
    

    Contigs

    定义设计策略,可指定多段区域,用英文逗号分隔。例如:该参数设置为 46,A106-106,59,A166-166,2,A169-169,23,A193-193,46,表示:

    • '46’表示先设计长度为46的motif(也可以指定长度范围,如24-50,表示长度在24至50之间,具体多长是随机的)
    • ‘A106-106’表示紧接着从参考蛋白中取A链中编号为106的残基,其N端连接到上一段’46’设计的motif的C端(也同样可以指定范围,如:A100-118,表示从参考蛋白中取A链100-118的残基)。
    • '59’表示设计长度为59的motif,其N端连接到上一段motif的C端。
    • ‘A166-166’表示紧接着从参考蛋白中取A链中编号为166的残基,其N端连接到上一段motif的C端。
    • '2’表示设计长度为2的motif,其N端连接到上一段motif的C端。
    • ‘A169-169’表示紧接着从参考蛋白中取A链中编号为169的残基,其N端连接到上一段motif的C端。
    • '23’表示设计长度为23的motif,其N端连接到上一段motif的C端。
    • ‘A193-193’表示紧接着从参考蛋白中取A链中编号为193的残基,其N端连接到上一段motif的C端。
    • '46’表示设计长度为46的motif,其N端连接到上一段motif的C端。

    Ligand

    指定参考结构中,小分子或虚拟原子的名称,可设置多个,用英文逗号分隔,如:NAD,OXM

    Active Site Atoms

    指定构成活性口袋的原子,通过链名,残基名和原子名称来指定,格式为:链名残基名:原子1名称,原子2名称...,多个残基之间用英文分号分隔。例如:A106:NE,CD,CZ;A166:OD1,CG;A169:NH2,CZ;A193:NE2,CD2,CE1表示:
    活性口袋中的原子为:A链残基106中的NE,CD,CZ原子;A链残基166的OD1与CG原子;A链残基169的NH2与CZ原子;A链残基193的NE2,CD2,CE1原子。

    Number of Designs

    指定设计的数量,默认为10,最大不超过100

    Output Prefix

    输出文件的前缀,默认为result,对应的输出文件为result_0.pdb,result_1.pdb…

    结果说明

    设计得到的结构文件result_0.pdb,result_1.pdb…
    所有结果的打包文件result.tar.gz

    注意:

    • 设计得到的为聚丙氨酸(poly-A)序列,这并不是错误。因为RFdiffusion2是一种骨架生成模型,不会为设计的区域生成序列。这里推荐采用ProteinMPNN(中的ligandMPNN模式)进行序列设计(WeMol中已部署该模块,使用这里生成的PDB结构进行序列设计即可)。
    • 输出的PDB文件从1开始重新编号。

    参考文献

    • Atom level enzyme active site scaffolding using RFdiffusion2. Woody Ahern, Jason Yim, Doug Tischer, Saman Salike, Seth M. Woodbury, Donghyo Kim, Indrek Kalvet, Yakov Kipnis, Brian Coventry, Han Raut Altae-Tran, Magnus Bauer, Regina Barzilay, Tommi S. Jaakkola, Rohith Krishna, David Baker. DOI:10.1101/2025.04.09.648075

    Protein Design (RFDiffusion2)

    Introduction

    A tool for de novo design of enzymes with desired catalytic activity. The module is based on the RFdiffusion2 model, which introduces flow matching to replace traditional diffusion methods. It enables atom-level scaffold design of enzyme active sites directly, without predefining sequence positions or side-chain conformations, thus significantly improving design flexibility and success rates. Experimental results show that RFdiffusion2 not only outperforms existing methods in computational benchmarks but also generates functional enzymes with in vitro catalytic activity, opening new possibilities for de novo enzyme design.
    image.png
    In a test of fewer than 100 designed enzymes, RFdiffusion2 successfully produced 5 enzymes with actual catalytic activity; among them, one zinc hydrolase exhibited activity far surpassing previous engineered enzymes. In benchmark validation, 41 real enzyme active sites were extracted from the M-CSA database. For each site, atomic-level motifs (randomly selecting atoms from catalytic residues) were used as inputs for design by both traditional RFdiffusion and RFdiffusion2. The results showed that RFdiffusion2 successfully solved all 41 challenge cases, whereas traditional RFdiffusion solved only 16.

    Parameters

    Reference Protein Structure

    In enzyme design, a reference structure (such as the Theozyme of the enzyme active site) can be provided in PDB format.
    Within this structure, a special ORI pseudo-atom is introduced to provide prior spatial information for motif placement.

    This pseudo-atom is typically positioned at the geometric center of the enzyme active pocket, serving as a reference point to guide the model in properly aligning the active site with the surrounding scaffold during the design process.
    The pseudo-atom can be created in PyMOL as follows:

    # 1️⃣ Select the residues that form the binding pocket
    # Example: pocket consists of residues 11, 72, 92, 94, 117, 177 in chain A
    select pocket, (resi 11+72+92+94+117+177 and chain A)
    
    # 2️⃣ Create a pseudo-atom (ORI) at the geometric center of the pocket
    # Parameter explanation:
    # - ORI: name of the created object
    # - pocket: use the geometric center of this selection as position
    # - name=ORI: atom name set to ORI
    # - resn=ORI: residue name set to ORI
    # - chain=P: assign chain identifier as P
    # - resi=1: assign residue number as 1
    pseudoatom ORI, pocket, name=ORI, resn=ORI, chain=P, resi=1
    

    The pseudo-atom in the exported PDB file will follow a format similar to:

    HETATM   91  ORI ORI B 332       0.000   0.000   0.000  1.00  0.00           X
    

    Contigs

    Defines the design strategy. Multiple segments can be specified, separated by commas.
    Example:
    46,A106-106,59,A166-166,2,A169-169,23,A193-193,46

    This means:

    • 46: first design a motif of length 46 (a range can also be specified, e.g., 24-50, meaning a random length between 24 and 50).
    • A106-106: then take residue 106 from chain A of the reference protein, attaching its N-terminus to the C-terminus of the previously designed 46-length motif (a range such as A100-118 can also be given to take residues 100–118 from chain A).
    • 59: design a motif of length 59, attached to the previous motif’s C-terminus.
    • A166-166: take residue 166 from chain A, attach its N-terminus to the previous motif’s C-terminus.
    • 2: design a motif of length 2, attached to the previous motif’s C-terminus.
    • A169-169: take residue 169 from chain A, attach to the previous motif’s C-terminus.
    • 23: design a motif of length 23, attached to the previous motif’s C-terminus.
    • A193-193: take residue 193 from chain A, attach to the previous motif’s C-terminus.
    • 46: design another motif of length 46, attached to the previous motif’s C-terminus.

    Ligand

    Specifies small molecules or dummy atoms in the reference structure. Multiple ligands can be listed, separated by commas, e.g., NAD,OXM.

    Active Site Atoms

    Defines the atoms that make up the active pocket. Specified by chain ID, residue number, and atom names.
    Format:
    ChainResidue:Atom1,Atom2...
    Multiple residues are separated by semicolons.

    Example:
    A106:NE,CD,CZ;A166:OD1,CG;A169:NH2,CZ;A193:NE2,CD2,CE1

    This means:

    • Chain A, residue 106: atoms NE, CD, CZ
    • Chain A, residue 166: atoms OD1, CG
    • Chain A, residue 169: atoms NH2, CZ
    • Chain A, residue 193: atoms NE2, CD2, CE1

    Number of Designs

    Specify the number of designs; the default is 10, and the maximum allowed is 100.

    Output Prefix

    Prefix for the output files; the default is result, yielding files named result_0.pdb, result_1.pdb, …

    Result

    Structure files generated by the design: result_0.pdb, result_1.pdb, …
    An archive containing all results: result.tar.gz

    Notes:

    • The designed sequence is poly-alanine (poly-A). This is not an error. RFdiffusion2 is a scaffold generation model, and does not generate sequences for the designed regions. We recommend using ProteinMPNN (ligandMPNN mode) for sequence design. (This module is already deployed in WeMol; simply input the PDB generated here for sequence design).
    • The output PDB file is renumbered starting from 1.

    Reference

    • Atom level enzyme active site scaffolding using RFdiffusion2. Woody Ahern, Jason Yim, Doug Tischer, Saman Salike, Seth M. Woodbury, Donghyo Kim, Indrek Kalvet, Yakov Kipnis, Brian Coventry, Han Raut Altae-Tran, Magnus Bauer, Regina Barzilay, Tommi S. Jaakkola, Rohith Krishna, David Baker. DOI:10.1101/2025.04.09.648075
  • Name: Small Molecule Generation (GenMol)
    Description: GenMol是基于diffusion model的开源AI框架,用于分子生成。它从大型化学数据库中学习,生成类药物分子。GenMol能够同时优化多种属性(类药物特性、合成可得性),并提供合成规划,大致确保分子可在实验室中合成。 GenMol is an open-source AI framework for molecular generation based on a diffusion model. It learns from large chemical databases to generate drug-like molecules. GenMol can simultaneously optimize multiple properties (such as drug-likeness and synthetic accessibility) and provide synthesis planning, roughly ensuring that the generated molecules can be synthesized in a laboratory.
    Tags: undefined
    Author: Seul Lee
    Release: 2025-08-13 11:04:57
    Reference: Lee, S., Kreis, K., Veccham, S. P., Liu, M., Reidenbach, D., Peng, Y., Paliwal, S., Nie, W., & Vahdat, A. (2025). GenMol: A Drug Discovery Generalist with Discrete Diffusion. arXiv preprint arXiv:2501.06158.

    Small Molecule Generation (GenMol)

    简介

    GenMol是基于diffusion model的开源AI框架,用于分子生成。它从大型化学数据库中学习,生成类药物分子。GenMol能够同时优化多种属性(类药物特性、合成可得性),并提供合成规划,大致确保分子可在实验室中合成。
    GenMol支持多种生成模式,满足不同的药物发现需求:

    • 连接子设计/骨架变换:生成连接两个侧链的连接子
    • 基团扩展:从给定基团片段扩展分子
    • 骨架修饰:为大型骨架添加修饰
    • 超结构生成:基于部分结构生成完整分子
    • 单步连接子设计:直接连接两个片段,无需中间混合

    企业微信截图_17580767039340.png

    参数说明

    Mode

    选择生成模式:Superstructure-Generation、Scaffold-Decoration、Motif-Extension、Linker-Design

    • Superstructure-Generation:超结构生成,基于部分结构生成完整分子
    • Scaffold-Decoration:骨架修饰,为大型骨架添加修饰
    • Motif-Extension:基团扩展,从给定基团片段扩展分子
    • Linker-Design:连接子设计,生成连接两个侧链的连接子

    Molecule

    分子结构文件,格式为SDF或SMILES,文件后缀为.sdf/.sd或.smi

    • Linker-Design模式需要输入带*的两个小分子,可以通过wedraw工具生成。
      image.png
      image.png
      image.png

    Number of samples

    该参数用于指定生成候选样本的数量。程序会按照该大小进行采样,随后自动过滤掉不符合设定片段拼接规则或子结构匹配要求的结果。因此,最终输出的有效样本数可能少于设定值。

    Randomness

    采样随机性因子,推荐范围 0–10;数值越低结果越稳定,数值越高结果越多样。

    Output

    输出文件名称

    结果说明

    生成符合要求的结果文件,result.sdf。

    参考文献

    • Lee, S., Kreis, K., Veccham, S. P., Liu, M., Reidenbach, D., Peng, Y., Paliwal, S., Nie, W., & Vahdat, A. (2025). GenMol: A Drug Discovery Generalist with Discrete Diffusion. arXiv preprint arXiv:2501.06158.DOI:https://arxiv.org/abs/2501.06158

    Small Molecule Generation (GenMol)

    Introduction

    GenMol is an open-source AI framework for molecular generation based on a diffusion model. It learns from large chemical databases to generate drug-like molecules. GenMol can simultaneously optimize multiple properties (such as drug-likeness and synthetic accessibility) and provide synthesis planning, roughly ensuring that the generated molecules can be synthesized in a laboratory.

    GenMol supports multiple generation modes to meet different drug discovery needs:

    • Linker Design / Scaffold Transformation: Generate a linker that connects two side chains.
    • Motif Extension: Extend a molecule from a given motif fragment.
    • Scaffold Decoration: Add modifications to a large scaffold.
    • Superstructure Generation: Generate a complete molecule based on a partial structure.
    • Single-Step Linker Design: Directly connect two fragments without an intermediate mix.

    企业微信截图_17580767039340.png

    Parameters

    Mode

    Select the generation mode: Superstructure-Generation, Scaffold-Decoration, Motif-Extension, Linker-Design

    • Superstructure-Generation: Generate a complete molecule from a partial structure.
    • Scaffold-Decoration: Add modifications to a large scaffold.
    • Motif-Extension: Extend a molecule from a given motif fragment.
    • Linker-Design: Generate a linker to connect two side chains.

    Molecule

    Molecular structure file in SDF or SMILES format, with file extensions .sdf, .sd, or .smi.

    • In Linker-Design mode, two small molecules with * attachment points are required, which can be generated using the wedraw tool.
      image.png
      image.png
      image.png

    Number of samples

    This parameter specifies the number of candidate molecules to generate. The program will sample according to this value, and then automatically filter out those that do not meet the defined fragment linking rules or substructure matching requirements. Therefore, the final number of valid outputs may be smaller than the specified value.

    Randomness

    Sampling randomness factor, recommended range 0–10.
    Lower values lead to more stable results, while higher values produce more diverse outputs.

    Output

    Name of the output file.

    Results

    Generates the result file result.sdf containing the molecules that meet the specified requirements.

    Reference

    • Lee, S., Kreis, K., Veccham, S. P., Liu, M., Reidenbach, D., Peng, Y., Paliwal, S., Nie, W., & Vahdat, A. (2025). GenMol: A Drug Discovery Generalist with Discrete Diffusion. arXiv preprint arXiv:2501.06158.DOI:https://arxiv.org/abs/2501.06158
  • Name: Small Molecule Generation from Pocket
    Description: 基于受体的结合口袋生成小分子配体。 This module generates small-molecule ligands based on the binding pocket of a receptor.
    Tags: undefined
    Author: Schneuing, A.
    Release: 2025-07-08 09:36:59
    Reference: Schneuing, A., Harris, C., Du, Y. et al. Structure-based drug design with equivariant diffusion models. Nat Comput Sci 4, 899–909 (2024)

    Small Molecule Generation from Pocket

    简介

    基于受体的结合口袋生成小分子配体。模块基于DiffSBDD模型实现,DiffSBDD于2024年发布,是近年来结构基础药物设计(SBDD)与生成式分子建模领域的代表性进展之一。模型充分利用了SE(3)-等变三维条件扩散模型的最新思想,将蛋白质结合口袋的几何结构直接作为条件输入,结合去噪扩散概率模型(DDPM),能够高效、灵活地生成与目标口袋空间匹配、具有潜在高亲和力的小分子候选物。相较于传统的对接筛选和先导优化方法,该模块可一次性输出多个具备合理构象、较高类药性(QED)和良好合成可行性的分子,显著降低了候选物设计的时间与人工偏差。
    DiffSBDD在多个基准数据集上的效果超过以往方法,如下图所示
    image.png

    该模块支持多种分子生成场景,助力用户在从头设计(de novo design)、子结构修复(fragment growing & linking)、骨架跃迁(scaffold hopping)等典型药物设计任务中快速获得高价值候选分子。

    参数说明

    Mode

    设计模式,共有四种不同模式可选:
    Denovo:从头生成,以复合物结构中的初始配体所在位置作为结合位点,从头生成一批新的配体分子。
    Inpaint:配体补全,以复合物结构中的初始配体作为结构增长起点,继续增长结构进一步占据结合位点。
    LinkerGen:链接片段生成,以复合物结构中两个配体片段为基础,自动进行链接片段的生成,将两个配体片段进行连接。注意:该模式下,复合物结构中必须存在且仅有两个结构片段位于结合位点。
    Optimize:配体性质优化,对复合物结构中的初始配体进行性质优化,两类可选性质。

    Structure

    蛋白与配体小分子的复合物结构文件,PDB格式。小分子所在的结合位置即新分子生成的位置。建议先使用 protein preparation 功能对非标准残基等进行优化

    Samples

    要生成的分子数量,默认为20,最大为1000。

    Output

    输出文件名称,默认为mols_gen.sdf。

    Atoms

    Inpaint模式参数,指定补全过程中需要添加的新原子数量,默认为10。

    Property

    Optimize模式参数,指定优化的分子性质,可选 sa(合成可行性)或 qed(类药性),默认值为sa。

    结果说明

    生成配体分子的结构文件mols_gen.sdf,包含多个分子,分子坐标是复合物中的配体坐标。
    Optimize模式下,SDF文件中包含打分信息:

    列名 说明
    Score 合成可行性或类药性的打分,数值在0-1之间,越大表示相应的性质越优

    参考文献

    • Schneuing, A., Harris, C., Du, Y. et al. Structure-based drug design with equivariant diffusion models. Nat Comput Sci 4, 899–909 (2024)DOI:10.1038/s43588-024-00737-x

    Small Molecule Generation from Pocket

    Introduction

    This module generates small-molecule ligands based on the binding pocket of a receptor. It is implemented using the DiffSBDD model, which was released in 2024 and represents a major advancement in the field of structure-based drug design (SBDD) and generative molecular modeling. The model leverages the latest developments in SE(3)-equivariant 3D conditional diffusion models by taking the geometric structure of the protein binding pocket as direct input conditions. Combined with a denoising diffusion probabilistic model (DDPM), DiffSBDD can efficiently and flexibly generate small molecules that spatially match the target pocket and have potentially high binding affinity.

    Compared to traditional docking-based screening and lead optimization methods, this module can generate multiple candidate molecules in one go—each with reasonable conformations, high drug-likeness (QED), and good synthetic accessibility—greatly reducing design time and human bias.

    DiffSBDD outperforms previous methods across multiple benchmark datasets, as shown in the figure below:
    image.png

    This module supports various molecular generation scenarios, helping users quickly obtain high-value candidate compounds for tasks such as de novo design, fragment growing & linking, and scaffold hopping.

    Parameters

    Mode

    Design mode—four different modes are available:

    • Denovo: De novo generation. Generates a new batch of ligands from scratch based on the binding site occupied by the initial ligand in the complex structure.
    • Inpaint: Ligand completion. Extends the existing ligand structure in the complex to further occupy the binding site.
    • LinkerGen: Fragment linker generation. Automatically generates linkers between two ligand fragments located in the binding site.
      Note: In this mode, the complex must contain exactly two fragments positioned within the binding site.
    • Optimize: Ligand property optimization. Optimizes specific properties of the initial ligand in the complex; two types of properties are supported.

    Structure

    The complex structure file of the protein and ligand in PDB format. The position of the small molecule defines where new molecules will be generated.
    It is recommended to use the protein preparation function to clean non-standard residues beforehand.

    Samples

    Number of molecules to generate. Default is 20; maximum is 1000.

    Output

    Name of the output file. Default is mols_gen.sdf.

    Atoms

    Parameter for Inpaint mode. Specifies how many new atoms to add during the completion process. Default is 10.

    Property

    Parameter for Optimize mode. Specifies the molecular property to optimize:
    Options are sa (synthetic accessibility) or qed (drug-likeness).
    Default is sa.

    Output Description

    The generated ligand structures are saved in an .sdf file named mols_gen.sdf, containing multiple molecules whose coordinates align with the ligand in the complex.

    In Optimize mode, an additional score information is included in SDF file, containing:

    Column Name Description
    Score Score for synthetic accessibility or drug-likeness (ranging from 0 to 1; higher is better)

    Reference

    • Schneuing, A., Harris, C., Du, Y. et al. Structure-based drug design with equivariant diffusion models. Nat Comput Sci 4, 899–909 (2024)DOI:10.1038/s43588-024-00737-x
  • Name: Genome Visualization
    Description: 将DNA序列转换为可视化图像,通过将DNA碱基序列映射到数值并按照核小体(nucleosome)排列模式组织成图像,最终生成彩色图像以直观展示DNA序列的结构特征。 Convert the DNA sequence into a visual image by mapping each DNA base to a numerical value and arranging the resulting values according to the nucleosome positioning pattern, ultimately producing a color image that intuitively displays the structural features of the DNA sequence.
    Tags: undefined
    Author: Song Qing
    Release: 2025-08-21 00:00:00
    Reference:

    Genome Visualization

    简介

    将DNA序列转换为可视化图像,通过将DNA碱基序列映射到数值并按照核小体(nucleosome)排列模式组织成图像,最终生成彩色图像以直观展示DNA序列的结构特征。功能特点:

    • 将DNA序列文件转换为可视化图像
    • 使用核小体排列模式进行数据组织
    • 生成RGB彩色图像,便于观察序列模式

    参数介绍

    Genome Sequence

    物种的基因组序列,FASTA格式

    Output

    输出图片文件的名称

    结果说明

    生成彩色图片,默认名称:genome_visualization.png

    Genome Visualization

    Introduction

    This tool converts DNA sequences into visual images by mapping DNA bases to numerical values and organizing them according to the nucleosome arrangement pattern. The result is a colorful image that intuitively displays the structural features of the DNA sequence.

    Features

    • Convert DNA sequence files into visual images
    • Organize data using nucleosome arrangement patterns
    • Generate RGB color images to facilitate observation of sequence patterns

    Parameters

    Genome Sequence

    The genome sequence of the species in FASTA format

    Output

    The name of the output image file

    Results

    A colorful image is generated with the default file name: genome_visualization.png

  • Name: Molecular Atom Index
    Description: 将分子结构转换为图片,并显示原子编号。 Convert the molecular structure into an image and display the atomic numbers.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-07-08 10:18:57
    Reference:

    Molecular Atom Index

    简介

    将分子结构转换为图片,并显示原子编号。

    参数说明

    Molecule

    分子结构文件,格式为SDF或SMILES,文件后缀为.sdf/.sd或.smi

    Output

    输出图片名称,默认为mol.png

    结果说明

    标注了原子编号的分子结构图片。
    image.png

    Molecular Atom Index

    Introduction

    Converts a molecular structure into an image with atom indices labeled.

    Parameter

    Molecule

    Molecular structure file in either SDF or SMILES format.

    Output

    Name of the output image file. Default is mol.png.

    Result

    An image of the molecular structure with atom indices labeled.
    image.png

  • Name: MD DSSP
    Description: 蛋白质二级结构残基数目计算。 Residue count in protein secondary structures.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-07-22 15:35:35
    Reference: Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577-2637.

    MD DSSP

    简介

    蛋白二级结构残基数目计算。使用 DSSP 算法(即通过检测氨基酸残基之间特定的氢键模式)来确定蛋白质的二级结构。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run模块或者AlphaAutoMD模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA,Complex。
    可以根据PDB中小分子的名称填写组别名称。
    注:其中Complex指的是蛋白-小分子复合物体系。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。
    参考md.gro的残基编号。

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    得到结果文件,每种类型的文件如果包含PNG、CSV以及XVG后缀,相同名称只是表现形式不同,数据一样

    输出文件名称 说明
    num.xvg/.png/.csv 不同形式的二级结构的残基数目
    ss.png 每一帧每个残基的二级结构显示文件

    MD DSSP

    Introduction

    Calculation of the number of residues in protein secondary structures. The DSSP algorithm determines the secondary structure of proteins by identifying specific hydrogen bonding patterns between amino acid residues.

    Parameters

    Path File

    The trajectory file obtained after MD simulation. This can be retrieved from the GMX MD Run module or the AlphaAutoMD module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA, or Complex.
    You can also specify the group name based on the small molecule names in the PDB file.
    Note: “Complex” refers to protein-small molecule complex systems.

    Custom Resid

    Specify the residue numbers to analyze. Use a hyphen (-) for continuous ranges and commas (,) for discontinuous residues.
    Example: 1-10,15
    Refer to the residue numbering in md.gro

    Custom Atom

    Specify the atom numbers to analyze. Use a hyphen (-) for continuous ranges and commas (,) for discontinuous atoms.
    Example: 1-10,15

    Skip Time (ns)

    Time interval (in nanoseconds) between each frame.

    Result

    The result files include the number of residues in different types of secondary structures.
    Each type of result may be available in PNG, CSV, and XVG formats. These files have the same content, just different representations.

    Output File Name Description
    num.xvg/.png/.csv Number of residues for each secondary structure type
    ss.png Secondary structure visualization for each residue in each frame
  • Name: Stability Result Merge
    Description: 合并稳定性流程(Pythia,ThermoMPNN,ESMIF)输出的结果。 Merge the results output by the stability process (Pythia, ThermoMPNN, ESMIF).
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-07-15 14:23:45
    Reference:

    Stability Result Merge

    简介

    合并稳定性流程(Pythia,ThermoMPNN,ESMIF)输出的结果。

    参数说明

    ESM

    指定ESMIF的结果文件,csv格式,如:ESMIF_results.csv。

    Pythia

    指定Pythia的结果文件,csv格式,如:Pythia_results.csv。

    ThermoMPNN

    指定ThermoMPNN的结果文件,csv格式,如:ThermoMPNN_results.csv。

    Output

    结果合并输出的文件名称,默认为merged_results.csv

    结果说明

    结果合并输出文件merged_results.csv。

    Stability Result Merge

    Introduction

    Merge the output results from the stability evaluation pipelines: Pythia, ThermoMPNN, and ESMIF.

    Parameter

    ESM

    Specify the result file from ESMIF in CSV format, e.g., ESMIF_results.csv.

    Pythia

    Specify the result file from Pythia in CSV format, e.g., Pythia_results.csv.

    ThermoMPNN

    Specify the result file from ThermoMPNN in CSV format, e.g., ThermoMPNN_results.csv.

    Output

    Name of the merged output file. Default is merged_results.csv.

    Result

    The merged result will be output to the file merged_results.csv.

  • Name: Batch Fasta Generator
    Description: 对不同文件中的序列进行组装,输出满足Boltz2批量预测模式需要的序列格式。 Assembles sequences from different files and outputs them in the sequence format required for Boltz2 batch prediction mode.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-07-02 09:46:02
    Reference:

    Batch Fasta Generator

    简介

    对不同文件中的序列进行组装,输出满足Boltz2批量预测模式需要的序列格式。

    参数说明

    Sequences_A

    进行序列组装的A文件,组装时的固定序列,FASTA格式

    Sequences_B

    进行序列组装的B文件,组装时的遍历序列,FASTA格式

    Sequences_C

    进行序列组装的C文件,组装时的遍历序列,FASTA格式

    Mode

    组装模式,选中表示对B,C文件中的序列进行交叉组装。具体组装逻辑见下述。

    组装逻辑:
    读取A文件中的所有序列,依次读取B文件及C文件中的相同顺序的一条序列进行组装。如果B文件与C文件中的序列数量不一致,或者其中一个文件为空时,则超出部分的序列单独与A文件序列进行组装。示例如下:
    A文件中有两条序列A1/2,B文件中有三条序列B1/2/3,C文件中有5条序列C1/2/3/4/5,输出组合后的序列为:

    >A1_A2_B1_C1
    A1:A2:B1:C1
    >A1_A2_B2_C2
    A1:A2:B2:C2
    >A1_A2_B3_C3
    A1:A2:B3:C3
    >A1_A2_C4
    A1:A2:C4
    >A1_A2_C5
    A1:A2:C5
    

    如果选择交叉组装模式,则对B,C文件中的序列进行交叉组装,输出组合后的序列为:

    >A1_A2_B1_C1
    A1:A2:B1:C1
    >A1_A2_B1_C2
    A1:A2:B1:C2
    >A1_A2_B1_C3
    A1:A2:B1:C3
    >A1_A2_B1_C4
    A1:A2:B1:C4
    >A1_A2_B1_C5
    A1:A2:B1:C5
    >A1_A2_B2_C1
    A1:A2:B2:C1
    >A1_A2_B2_C2
    A1:A2:B2:C2
    ......
    
    

    结果说明

    输出组装后的序列文件combined_seqs.fasta。

    Batch Fasta Generator

    Introduction

    Assembles sequences from different files and outputs them in the sequence format required for Boltz2 batch prediction mode.

    Parameter

    Sequences_A

    File A used for sequence assembly, fixed sequence during assembly, in FASTA format.

    Sequences_B

    File B used for sequence assembly, traversal sequence during assembly, in FASTA format.

    Sequences_C

    File C used for sequence assembly, traversal sequence during assembly, in FASTA format.

    Mode

    Assembly mode. If selected, sequences from files B and C will be cross-assembled. The specific assembly logic is described below.

    Assembly Logic:
    Read all sequences from file A. Then, for each sequence in A, read sequences from files B and C in the same order and assemble them together.
    If the number of sequences in files B and C are inconsistent, or if one file is empty, the extra sequences will be assembled individually with the sequences from file A.
    For example, if file A contains two sequences A1 and A2, file B contains three sequences B1, B2, and B3, and file C contains five sequences C1, C2, C3, C4, and C5, the output assembled sequences will be:

    >A1_A2_B1_C1  
    A1:A2:B1:C1  
    >A1_A2_B2_C2  
    A1:A2:B2:C2  
    >A1_A2_B3_C3  
    A1:A2:B3:C3  
    >A1_A2_C4  
    A1:A2:C4  
    >A1_A2_C5  
    A1:A2:C5  
    

    If cross-assembly mode is selected, sequences from files B and C will be cross-assembled. The output sequences will be:

    >A1_A2_B1_C1  
    A1:A2:B1:C1  
    >A1_A2_B1_C2  
    A1:A2:B1:C2  
    >A1_A2_B1_C3  
    A1:A2:B1:C3  
    >A1_A2_B1_C4  
    A1:A2:B1:C4  
    >A1_A2_B1_C5  
    A1:A2:B1:C5  
    >A1_A2_B2_C1  
    A1:A2:B2:C1  
    >A1_A2_B2_C2  
    A1:A2:B2:C2  
    ......
    

    Result

    The assembled sequence file will be output as combined_seqs.fasta.

  • Name: AutoModel Protein v1.8
    Description: 利用小样本数据对ESM2蛋白质语言模型进行微调。支持分类和回归任务,三种微调方法: 1,基于BioNeMo框架的全参微调 2,基于BioNeMo框架的LoRA(Low-Rank Adaptation)参数高效微调。 3,序列特征迁移+传统机器学习(ML)预测头 4,序列特征迁移+多层感知机(MLP)预测头 A module for fine-tuning the ESM2 protein language model, supporting classification (binary) and regression tasks. The module offers three training methods: 1. Full-parameter fine-tuning based on the BioNeMo framework. 2. Parameter-Efficient Fine-Tuning with LoRA (Low-Rank Adaptation) Based on the BioNeMo Framework. 3. Sequence feature transfer with a traditional machine learning (ML) prediction head. 4. Sequence feature transfer with a multi-layer perceptron (MLP) prediction head.
    Tags: undefined
    Author: WECOMPUT
    Release: 2026-03-11 00:00:00
    Reference:

    AutoModel Protein

    简介

    对ESM2蛋白质语言模型进行微调,支持分类(二分类)和回归任务。
    该模块提供了四种训练方法:
    1,基于BioNeMo框架的全参微调
    2,基于BioNeMo框架的LoRA(Low-Rank Adaptation)参数高效微调。
    3,序列特征迁移+传统机器学习(ML)预测头
    4,序列特征迁移+多层感知机(MLP)预测头

    默认会尝试所有训练方法,自动比较训练结果并选择最佳模型。训练完成后可基于训练后的最佳模型进行推理。

    参数说明

    Train 模式

    训练

    Train Method

    训练方法:All (所有方法)、Finetune (基于 BioNeMo 框架的全参微调)、Lora(基于 BioNeMo 框架的 LoRA 参数高效微调)、Ml (序列特征迁移 + 传统机器学习预测头)、MLP (序列特征迁移 + 多层感知机预测头)。

    Input File

    用于训练的数据文件路径,CSV 格式(逗号分隔的文本文件格式)。

    Sequence Column

    数据文件中蛋白序列所在列的列名称,如 “sequence”。

    Label Column

    数据文件中标签所在列的列名称,如 “label”,标签可以是序列的性质(如:亲和力、稳定性等),也可以是类别(0 或 1 等)。

    Task Type

    任务类型:classification 或 regression。

    Test Size

    训练数据中用于作为测试集的比例,默认值 0.2。

    Epochs

    训练轮次,默认 10。

    Batch Size

    训练时的批次大小,默认 16。

    Inference模式

    推理

    Input File

    用于推理的数据文件路径。支持以下格式:

    • CSV(逗号分隔的文本文件)
    • FASTA

    Sequence Column

    当输入为 CSV 格式时,指定序列所在列的列名称,如“sequence”。
    如未指定,将自动从 model_info_file 中读取训练时使用的列名称,此时需确保推理数据文件中的列名称与训练数据一致。

    当输入为 FASTA 格式时,无需填写该参数。

    Model Status File

    模型信息 JSON 文件路径(训练任务最终输出的 result.json 文件)。

    Inference Mode

    推理结果筛选方式:largest(由大到小排序)、smallest(由小到大排序)、closest(按最接近某个数值排序,仅适用于回归任务)。

    Top N

    筛选保留的样本数量,默认值 10000。

    Target Value

    如果选择 closest 模式,需要指定的目标值。

    Target Class

    对于分类任务,只保留特定类别的样本。

    结果说明

    训练结果

    • result.json:模型信息文件,包含任务ID、方法、模型路径等信息
    • methods_comparison.csv:不同方法的性能比较结果
      回归任务的模型评价指标:
    指标 说明
    Spearman Spearman相关性指标,-1至1之间,绝对值越大表示相关性越高,模型效果越好。不同训练方法得到回归模型通过该参数进行排序,选取最优模型。
    MAE 平均绝对误差,数值越小越好

    分类任务的模型评价指标:

    指标 说明
    Accuracy 准确率,整体预测正确的比例,0-1之间,越大表示模型效果越好
    Precision 精确率,预测为正例的样本中,实际为正的比例,0-1之间,越大表示模型效果越好
    Recall 召回率,实际为正例的样本中,被正确预测的比例,0-1之间,越大表示模型效果越好
    F1_score 精确率与召回率的调和平均值。不同训练方法得到分类模型通过该参数进行排序,选取最优模型。
    • train_report.pdf:各方法的性能结果报告(PDF格式)

    注意:当训练模型失败或指标不符合要求时(如:Spearman为0),不输出该模型及其指标。

    推理结果

    • predictions.csv:预测结果文件,输出序列及预测打分(与训练数据中label列的性质一致)。

    AutoModel Protein

    Introduction

    This module is designed for fine-tuning the ESM2 protein language model, supporting classification (binary) and regression tasks. It offers three training methods:

    1. Full-parameter fine-tuning based on the BioNeMo framework.
    2. Parameter-Efficient Fine-Tuning with LoRA (Low-Rank Adaptation) Based on the BioNeMo Framework.
    3. Sequence feature transfer with a traditional machine learning (ML) prediction head.
    4. Sequence feature transfer with a multi-layer perceptron (MLP) prediction head.

    By default, all training methods are attempted, and the results are automatically compared to select the best model. After training, inference can be performed using the best-trained model.

    Parameters

    Training Parameters

    Train Method

    Training strategy. Supported options:

    • all: Use all available methods
    • finetune: Full-parameter fine-tuning based on the BioNeMo framework
    • lora: Parameter-efficient fine-tuning using LoRA
    • ml: Sequence feature transfer with a traditional machine learning prediction head
    • mlp: Sequence feature transfer with an MLP prediction head

    Input File

    Path to the training dataset file.
    Only CSV format (comma-separated values) is supported.

    Sequence Column

    Name of the column containing protein sequences in the dataset (e.g., sequence).

    Label Column

    Name of the column containing labels in the dataset (e.g., label).
    Labels can represent:

    • Continuous values (e.g., affinity, stability)
    • Categorical values (e.g., 0 or 1)

    Task Type

    Type of task:

    • classification
    • regression

    Test Size

    Proportion of the dataset used as the test set.
    Default: 0.2

    Epochs

    Number of training epochs.
    Default: 10

    Batch Size

    Batch size used during training.
    Default: 16


    Inference Parameters

    Input File

    Path to the input file for inference. Supported formats:

    • CSV (comma-separated values)
    • FASTA

    Sequence Column

    Required when the input file is in CSV format, specifying the column name that contains the sequences,such as “suquence”.
    If not provided, the column name will be automatically loaded from model_info_file, and must match the column used during training.

    This parameter is not required when using FASTA format.

    Model Status File

    Path to the JSON file containing model metadata (i.e., result.json generated during training).

    Inference Mode

    Method used to filter and rank inference results:

    • largest: Sort results from largest to smallest
    • smallest: Sort results from smallest to largest
    • closest: Sort by proximity to a target value (only applicable to regression tasks)

    Top N

    Number of samples to retain after filtering.
    Default: 10,000

    Target Value

    Required when using closest mode.
    Specifies the target value for ranking.

    Target Class

    Used in classification tasks to retain samples belonging to a specific class.

    Results

    Training Results

    • result.json: Model information file, including task ID, method, model path, etc.
    • methods_comparison.csv: Performance comparison results of different methods.

    Model Evaluation Metrics for Regression Tasks:

    Metric Description
    Spearman Spearman correlation coefficient, ranging from -1 to 1. A higher absolute value indicates stronger correlation and better model performance. Regression models from different training methods are ranked based on this metric to select the optimal model.
    MAE Mean Absolute Error. Smaller values indicate better performance.

    Model Evaluation Metrics for Classification Tasks:

    Metric Description
    Accuracy Proportion of correct predictions overall, ranging from 0 to 1. Higher values indicate better model performance.
    Precision Proportion of true positives among predicted positives, ranging from 0 to 1. Higher values indicate better performance.
    Recall Proportion of true positives correctly identified, ranging from 0 to 1. Higher values indicate better performance.
    F1_score Harmonic mean of precision and recall. Classification models from different training methods are ranked based on this metric to select the optimal model.
    • train_report.pdf: Performance reports for each method (in PDF format).

    Note: If model training fails or evaluation metrics do not meet requirements (e.g., Spearman = 0), the model and its metrics will not be included in the output.

    Inference Results

    • predictions.csv: File containing predicted sequences and their corresponding prediction scores (same to the label column properties in the training data)
  • Name: Mutation Score v2.3
    Description: Mutation Score是抗体人源化设计中核心模块,是一个基于结构的自动化评分模块。该模块基于抗体的结构信息以及CDR嫁接后的序列信息,对graft后的FR区的每个氨基酸在替换前后的变化程度进行量化评分。评分越高,说明CDR嫁接过程中氨基酸的替换对CDR区的构象可能影响较大,需要进行回复突变。模块输出每个氨基酸的打分值,用于抗体人源化设计流程中后续的分组和人源化抗体序列的生成。 Mutation Score is a core module in antibody humanization design workflow, which is a structure-based automated scoring module. Based on the structure information of the antibody and the CDR-grafted sequence information, this module quantitatively scores the degree of change before and after the replacement of each amino acid in the FR region. The higher the score, the greater the potential impact of the amino acid replacement on the conformation change of the CDR region during CDR grafting, indicating the need for auto-back mutation. The module outputs the score for each amino acid, which is used for subsequent grouping and generation of humanized antibody sequences in the antibody humanization design workflow.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 11:14:32
    Reference: To be submitted

    Mutation Score

    简介

    Mutation Score是抗体人源化设计中核心模块,是一个基于结构的自动化评分模块。该模块基于抗体的结构信息以及CDR嫁接后的序列信息,对移植抗体(graft)后的FR区的每个氨基酸在替换前后的变化程度进行量化评分。评分越高,说明CDR嫁接过程中氨基酸的替换对CDR区的构象可能影响较大,需要进行回复突变。模块输出每个氨基酸的打分值,用于抗体人源化设计流程中后续的分组和人源化抗体序列的生成。

    参数说明

    Sequence File

    抗体Fv区序列文件,FASTA格式。

    Model File

    抗体结构文件,PDB格式。

    Grafted Sequence

    抗体CDR区Graft后的序列文件,FASTA格式。

    Germline Hits

    抗体FR区序列比对结果文件,FASTA格式

    Interface Score

    界面相互作用计算结果,包含原子/残基级别的接触信息

    Hbond Score

    氢键计算结果,包含供体-受体对、距离和角度信息

    Output Score

    指定输出打分文件的名称,CSV格式。

    Antibody Type

    抗体类型:

    • Antibody:常规抗体
    • Nanobody:纳米抗体

    Numbering Type

    抗体编号规则:kabat,imgt,chothia

    结果说明

    输出结果文件为score.csv,包含信息如下:

    字段名称 说明
    Chain 轻链或重链
    UID 为残基的标准编号(默认为 Kabat)
    Position 残基在序列中的位置
    Donor Residue 原始氨基酸
    Template Residue 人源模板的目标氨基酸
    score 回复突变智能评分,Score 越高,认为其回复突变的必要性越高。通常Score>10为高优先级,5-10为中优先级,其他为低优先级

    Mutation Score

    Introduction

    Mutation Score is a core module in antibody humanization design, serving as a structure-based automated scoring module. This module quantitatively scores the degree of change for each amino acid in the FR region after grafting CDRs based on the antibody’s structure and the sequence information post-CDR grafting. A higher score indicates that the replacement of amino acids during CDR grafting may have a significant impact on the CDR region’s conformation, suggesting the need for revertant mutations. The module outputs a score for each amino acid, which is used in subsequent grouping and generation of humanized antibody sequences in the antibody humanization design process.

    Parameters

    Sequence File

    Sequence file of the antibody Fv region in FASTA format.

    Model File

    Antibody structure file in PDB format.

    Grafted Sequence

    Sequence file of the antibody CDR region after grafting in FASTA format.

    Output Score

    Specify the name of the output scoring file in CSV format.

    Interface Score

    Interface interaction calculation results, including atom/residue-level contact information

    Hbond Score

    Interface interaction calculation results, including atom/residue-level contact information

    Antibody Type

    Type of antibody:

    • Antibody: Conventional antibody
    • Nanobody: Nanobody

    Numbering Type

    Antibody numbering type: kabat,imgt,chothia

    Results

    The output result file is named score.csv and includes the following information:

    Field Name Description
    Chain Light chain or heavy chain
    UID Standard numbering for residues (default is Kabat)
    Position Position of the residue in the sequence
    Donor Residue Original amino acid
    Template Residue Target amino acid from the human template
    Score Revertant mutation intelligence score, where a higher score suggests a higher necessity for a revertant mutation. Typically, a Score > 10 is high priority, 5-10 is medium priority, and others are low priority.
  • Name: Retrosynthetic Prediction (LocalRetro)
    Description: 基于LocalRetro的小分子逆合成预测 Small molecule retrosynthetic prediction using LocalRetro
    Tags: undefined
    Author: Chen, S.
    Release: 2025-05-19 00:00:00
    Reference: Chen, S.; Jung, Y. Deep Retrosynthetic Reaction Prediction Using Local Reactivity and Global Attention. JACS Au 2021, 1 (10), 1612–1620. https://doi.org/10.1021/jacsau.1c00246.

    Retrosynthetic Prediction (LocalRetro)

    简介

    LocalRetro 是局部逆合成预测框架,其动机是化学直觉认为分子变化主要发生在化学反应过程中的局部。这与几乎所有现有的逆合成方法不同,这些方法根据分子的全局结构建议反应物,通常包含与反应没有直接关系的精细细节。这个局部概念产生了涉及原子和键编辑的局部反应模板。由于远程官能团也可以作为次要方面影响整个反应路径,因此进一步细化了所提出的局部编码逆合成模型,以通过全局注意力机制来解释化学反应的非局部效应。模型显示,对于包含 50016 个反应的 USPTO-50K 数据集,top-1 名和 top-5 预测的准确率分别为 89.5% 和 99.2%。在包含 479035 个反应(UTPTO-MIT) 的大型数据集上 top-1 和 top-5 准确率分别为 87.0% 和 97.4%。通过从各种文献中正确预测五种候选药物分子的合成途径,还证明了该模型的实际应用。
    image.png

    参数说明

    SMILES

    输入小分子的SMILES,支持多个批量预测,一行一个,示例:
    O=C(Nc4cccc(C(=O)N3CCN(c1ccnc2[nH]ccc12)C3)c4)c5cccc(C(F)(F)F)c5

    结果说明

    输出的CSV文件包含以下列:

    列名 说明
    Input SMILES 输入的原始分子SMILES
    Predicted Reactants 预测反应物的SMILES
    Predicted Site 预测的反应位点
    Local Reaction Template 局部反应模板
    Score 预测得分,范围0-1,分数越高,表明该反应发生概率越高

    注意: 每个输入分子可能产生多个预测反应,因此一个分子会对应多行数据。

    参考文献

    • Chen, S.; Jung, Y. Deep Retrosynthetic Reaction Prediction Using Local Reactivity and Global Attention. JACS Au 2021, 1 (10), 1612–1620. DOI: 10.1021/jacsau.1c00246

    Retrosynthetic Prediction (LocalRetro)

    Introduction

    LocalRetro, a local retrosynthesis framework, motivated by the chemical intuition that the molecular changes occur mostly locally during the chemical reactions. This differs from nearly all existing retrosynthesis methods that suggest reactants based on the global structures of the molecules, often containing fine details not directly relevant to the reactions. This local concept yields local reaction templates involving the atom and bond edits. Because the remote functional groups can also affect the overall reaction path as a secondary aspect, the proposed locally encoded retrosynthesis model is then further refined to account for the nonlocal effects of chemical reaction through a global attention mechanism. Model shows a promising 89.5 and 99.2% round-trip accuracy at top-1 and top-5 predictions for the USPTO-50K dataset containing 50 016 reactions. LocalRetro was further validated on a large dataset containing 479 035 reactions (UTPTO-MIT) with comparable round-trip top-1 and top-5 accuracy of 87.0 and 97.4%, respectively. The practical application of the model is also demonstrated by correctly predicting the synthesis pathways of five drug candidate molecules from various literature.

    image.png

    Parameters

    SMILES

    SMILES of small molecules, supporting batch prediction of multiple entries, one per line. Demo:
    O=C(Nc4cccc(C(=O)N3CCN(c1ccnc2[nH]ccc12)C3)c4)c5cccc(C(F)(F)F)c5

    Results

    Output CSV file includes:

    Column Name Description
    Input SMILES input SMILES for prediction
    Predicted Reactants Predicted reactant in SMILES
    Predicted Site Predicted reaction site
    Local Reaction Template Template used
    Score Predicted score(0~1),and a high score indicating higher the likelihood of the reaction.

    Note: Each input molecule may generate multiple predicted reactions, so one molecule may correspond to multiple lines of data

    References

    • Chen, S.; Jung, Y. Deep Retrosynthetic Reaction Prediction Using Local Reactivity and Global Attention. JACS Au 2021, 1 (10), 1612–1620. DOI: 10.1021/jacsau.1c00246
  • Name: Ligand Protein Binding Prediction
    Description: 预测小分子与蛋白的亲和力(用pIC50表示)。 Predict the affinity of small molecules to proteins (represented by pIC50).
    Tags: undefined
    Author: Kexin Huang
    Release: 2025-06-17 10:17:30
    Reference: Kexin Huang, Tianfan Fu, Lucas M Glass, Marinka Zitnik, Cao Xiao, Jimeng Sun, DeepPurpose: a deep learning library for drug–target interaction prediction, Bioinformatics, Volume 36, Issue 22-23, December 2020, Pages 5545–5547

    Ligand Protein Binding Prediction

    简介

    预测小分子与蛋白的亲和力(用pIC50表示)。模块基于DeepPurpose框架实现,采用的预训练模型为MPNN_CNN_BindingDB,是基于BindingDB数据库训练的小分子-蛋白亲和力预测模型。

    模型架构如图所示:
    image.png
    image-1.png

    模型预测效果在当时是最佳的:
    image.png

    参数说明

    Sequences

    单个或多个蛋白序列,FASTA格式或TXT格式,每个蛋白使用一条序列表示(有多条链时,将单链序列收尾连接放在同一条序列中),txt格式时,每行一个蛋白。

    Ligands

    小分子结构文件,TXT格式,支持多个底物分子,使用SMILES表示,每行一个分子,文件内容示例:

    OC1=CC=C(C[C@@H](C(O)=O)N)C=C1
    CC(O)O
    

    注意:
    输入每个小分子都会与每个蛋白计算亲和力,并输出结果。

    Output

    亲和力预测结果文件名,默认为pred_res.csv

    结果说明

    结果文件pred_res.csv,包含以下信息:

    列名 说明
    SMILES 小分子结构
    Target_ID 蛋白名称
    Target_Sequence 蛋白序列
    Score(pIC50) 预测的亲和力pIC50数值,越大表示亲和力越高,可与阳性对照分子的预测数值比较。

    参考文献

    Kexin Huang, Tianfan Fu, Lucas M Glass, Marinka Zitnik, Cao Xiao, Jimeng Sun, DeepPurpose: a deep learning library for drug–target interaction prediction, Bioinformatics, Volume 36, Issue 22-23, December 2020, Pages 5545–5547

    Ligand-Protein Binding Prediction

    Introduction

    This module predicts the binding affinity between small molecules and proteins, expressed as pIC50. It is implemented based on the DeepPurpose framework, using the pre-trained model MPNN_CNN_BindingDB, which was trained on the BindingDB dataset for small molecule–protein affinity prediction.

    The model architecture is shown below:
    image.png
    image-1.png

    At the time of its release, the model achieved state-of-the-art performance:
    image.png

    Parameter

    Sequences

    One or more protein sequences in FASTA or TXT format. Each protein should be represented by a single sequence. For multi-chain proteins, concatenate the chain sequences end-to-end into one line. In TXT format, each line represents one protein.

    Ligands

    Small molecule structure file in TXT format, supporting multiple substrate molecules. Molecules are represented using SMILES, with one molecule per line. Example content:

    OC1=CC=C(C[C@@H](C(O)=O)N)C=C1  
    CC(O)O  
    

    Note:
    Each small molecule will be paired with each protein to compute the binding affinity, and the results will be output accordingly.

    Output

    The output filename for affinity prediction results. Default is pred_res.csv.

    Result

    The result file pred_res.csv contains the following fields:

    Column Name Description
    SMILES Small molecule structure (SMILES format)
    Target_ID Protein name
    Target_Sequence Protein sequence
    Score (pIC50) Predicted binding affinity score (pIC50). A higher value indicates stronger binding, and can be compared with positive control molecules.

    Reference

    Kexin Huang, Tianfan Fu, Lucas M Glass, Marinka Zitnik, Cao Xiao, Jimeng Sun, DeepPurpose: a deep learning library for drug–target interaction prediction, Bioinformatics, Volume 36, Issue 22–23, December 2020, Pages 5545–5547

  • Name: Sequence Clustering (MMseqs2)
    Description: 对蛋白、抗体序列进行聚类、可视化 Clustering and visualization for protein and antibody sequences
    Tags: undefined
    Author: Kallenborn F
    Release: 2025-06-30 10:44:57
    Reference: Kallenborn F, Chacon A, Hundt C, Sirelkhatim H, Didi K, Cha S, Dallago C, Mirdita M, Schmidt B, Steinegger M: GPU-accelerated homology search with MMseqs2. bioRxiv, doi: 10.1101/2024.11.13.623350 (2024) Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.

    Sequence Clustering (MMseqs2)

    简介

    对蛋白、抗体序列进行聚类、可视化。模块使用MMseq2算法对序列进行聚类分析,将多序列分为多个cluster类别,并通过ESM2模型对序列进行embedding,通过可视化模块UMAP对序列embedding进行降维,获取二维可视化信息。

    参数说明

    Sequence

    蛋白或抗体序列,FASTA格式

    Identity

    聚类中采用的最小序列一致性数值,范围在0-1之间,默认值为0.5,表示至少具有50% identity的序列才会被聚为一类。

    Type

    序列类型,选中表示抗体序列,否则为蛋白序列。

    Numbering Scheme

    序列类型为抗体时的编号规则,支持imgt, chothia, kabat

    Cluster

    序列聚类方案,支持2种:full, cdr(仅序列类型为抗体时可用)。‘full’表示使用全长序列进行聚类,‘cdr’表示使用CDR序列进行聚类(具体CDR位置在参数‘CDRs’中设定),默认为‘full’

    CDRs

    指定用于聚类的CDR区域,在‘Cluster’参数为cdr时生效。可选区域为(支持多选):CDR1,CDR2,CDR3。默认选择CDR3。

    结果说明

    输出cluster_res.csv结果文件,包含以下信息:

    列名 说明
    ID 序列名称
    Sequence 序列
    CDR1_AA CDR1的氨基酸序列,序列为抗体时输出
    CDR2_AA CDR2的氨基酸序列,序列为抗体时输出
    CDR3_AA CDR3的氨基酸序列,序列为抗体时输出
    Cluster_ID 序列所属类别编号,从1开始按顺序编号
    Cluster_Size 序列所属类别包含的序列数目,如:‘5’表示该类别含有5条序列
    Cluster_Center 序列是否为聚类中心,'1’表示是,‘0’表示不是

    参考文献

    • [Kallenborn F, Chacon A, Hundt C, Sirelkhatim H, Didi K, Cha S, Dallago C, Mirdita M, Schmidt B, Steinegger M: GPU-accelerated homology search with MMseqs2. bioRxiv, DOI: 10.1101/2024.11.13.623350 (2024)
    • [Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130. DOI: 10.1126/science.ade2574

    Sequence Clustering (MMseqs2)

    Introduction

    Cluster and visualize protein and antibody sequences. This module uses the MMseqs2 algorithm to perform clustering analysis on sequences, dividing multiple sequences into several cluster categories. It uses the ESM2 model to embed the sequences, and the visualization module UMAP to reduce the dimensionality of the sequence embeddings, obtaining two-dimensional visualization information.

    Parameters

    Sequence

    Protein or antibody sequences in FASTA format.

    Identity

    The minimum sequence identity value used in clustering, ranging from 0 to 1. The default value is 0.5, which means sequences must have at least 50% identity to be clustered together.

    Type

    The type of sequence. Selecting indicates antibody sequences; otherwise, it is protein sequences.

    Numbering Scheme

    The numbering scheme for antibody sequences, supporting imgt, chothia, kabat.

    Cluster

    Sequence clustering scheme, supporting two types: full and cdr (only available for antibody sequences). ‘full’ means using the full-length sequence for clustering, while ‘cdr’ means using CDR sequences for clustering (specific CDR positions are set in the ‘CDRs’ parameter). The default is ‘full’.

    CDRs

    Specifies the CDR regions used for clustering, effective when the ‘Cluster’ parameter is set to cdr. Optional regions (supporting multiple selections) are: CDR1, CDR2, CDR3. The default selection is CDR3.

    Results

    Outputs a result file named cluster_res.csv containing the following information:

    Column Name Description
    ID Sequence name
    Sequence Sequence
    CDR1_AA Amino acid sequence of CDR1, output when the sequence is an antibody
    CDR2_AA Amino acid sequence of CDR2, output when the sequence is an antibody
    CDR3_AA Amino acid sequence of CDR3, output when the sequence is an antibody
    Cluster_ID Cluster category number of the sequence, numbered sequentially starting from 1
    Cluster_Size Number of sequences in the cluster category, e.g., ‘5’ means the category contains 5 sequences
    Cluster_Center Whether the sequence is a cluster center, ‘1’ indicates yes, ‘0’ indicates no

    References

    • [Kallenborn F, Chacon A, Hundt C, Sirelkhatim H, Didi K, Cha S, Dallago C, Mirdita M, Schmidt B, Steinegger M: GPU-accelerated homology search with MMseqs2. bioRxiv, DOI: 10.1101/2024.11.13.623350 (2024)
    • [Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.DOI: 10.1126/science.ade2574
  • Name: Back Mutation Grouping v2.6
    Description: 抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组 Back mutation grouping in the antibody humanization, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-04-03 10:23:26
    Reference:

    Back Mutation Grouping v2.6

    简介

    该模块是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组。

    更新内容:

    • 模块同时支持人源化和高通量人源化流程。

    参数说明

    方法1:Mutate

    Grafted Chain

    抗体CDR区嫁接后序列文件,FASTA格式,由Grafting模块生成

    Raw Chain

    抗体序列文件,FASTA格式

    Mutation Score

    人源化突变评分文件,CSV格式,由Mutation Score模块生成

    Output File

    指定输出的突变序列文件名称,FASTA格式

    Cutoff

    打分分组的截断值,逗号分割,例如:2,5,10表示将氨基酸突变评分大于10的为一组,5~10的氨基酸为一组,小于2的氨基酸分为一组。如果是纳米抗体,控制数量为 3 个,三个cutoff划分成4组:第一组仅T1,第二组开始T1全部+T2中一个轮换,第三组加入全部T2,第四组加入全部T3

    Output Policy

    指定输出的回复突变的文件

    Type

    普通抗体Antibody或者纳米抗体Nanobody

    方法2:HTS Mutate

    Grafted Chain

    抗体CDR区嫁接后序列文件,FASTA格式,由Grafting模块生成

    Raw Chain

    抗体序列文件,FASTA格式

    Mutation Score

    人源化突变评分文件,CSV格式,由Mutation Score模块生成

    Output File

    指定输出的突变序列文件名称,FASTA格式

    Cutoff

    打分分组的截断值,逗号分割,例如:2,5,10表示将氨基酸突变评分大于10的为一组,5~10的氨基酸为一组,小于2的氨基酸分为一组。如果是纳米抗体,控制数量为 3 个,三个cutoff划分成4组:第一组仅T1,第二组开始T1全部+T2中一个轮换,第三组加入全部T2,第四组加入全部T3

    Output Policy

    指定输出的回复突变的文件

    Type

    普通抗体Antibody或者纳米抗体Nanobody

    Combination Min Cutoff

    突变组合的截断值,Mutation Score模块中输出的氨基酸回复突变打分大于截断值的氨基酸参与生成突变组合

    Combination Max Cutoff

    高于截断值的突变自动进行回复突变

    Combination Site Cutoff

    每条链回复突变打分在Combination Min Cutoff与Combination Max Cutoff之间的选择打分前n个位置进行组合突变

    结果说明

    根据不同截断值得到突变分组结果文件mutate_policy.json。

    高通量方法HTS Mutate中根据组合突变截断值得到的突变分组结果文件combination_mutate_policy.json,高通量人源化设计流程。

    Back Mutation Grouping v2.6

    Introduction

    Back mutation grouping in the antibody humanization, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module.

    Update Log:

    • support both humanization and high-throughput humanization.

    Parameters

    Method 1: Mutate

    Grafted Chain

    Sequence file of the antibody CDR region after grafting, in FASTA format, generated by the Grafting module.

    Raw Chain

    Sequence file of the antibody, in FASTA format.

    Mutation Score

    Humanization mutation score file, in CSV format, generated by the Mutation Score module.

    Output File

    Specify the name of the output mutation sequence file, in FASTA format.

    Cutoff

    The cutoff values for score-based grouping, separated by commas. For example, “2,5,10” means: amino acid mutations with scores >10 are grouped together, those with scores between 5 and 10 form one group, and those with scores <2 form another group. For nanobodies, control the number to 3. Three cutoffs divide into 4 groups: the first group contains only T1; the second group includes all of T1 plus one rotation from T2; the third group adds all of T2; the fourth group adds all of T3.

    Output Policy

    Specify the file for the output of back mutations.

    Type

    Antibody or Nanobody

    Method 2: HTS Mutate

    Grafted Chain

    Sequence file of the antibody CDR region after grafting, in FASTA format, generated by the Grafting module.

    Raw Chain

    Sequence file of the antibody, in FASTA format.

    Mutation Score

    Humanization mutation score file, in CSV format, generated by the Mutation Score module.

    Output File

    Specify the name of the output mutation sequence file, in FASTA format.

    Cutoff

    The cutoff values for score-based grouping, separated by commas. For example, “2,5,10” means: amino acid mutations with scores >10 are grouped together, those with scores between 5 and 10 form one group, and those with scores <2 form another group. For nanobodies, control the number to 3. Three cutoffs divide into 4 groups: the first group contains only T1; the second group includes all of T1 plus one rotation from T2; the third group adds all of T2; the fourth group adds all of T3.

    Output Policy

    Specify the file for the output of back mutations.

    Type

    Antibody or Nanobody

    Combination Min Cutoff

    Cutoff value for mutation combinations. Amino acids with scores (generated from Mutation Score module) greater than the cutoff value are involved in the mutation combinations.

    Combination Max Cutoff

    Mutations above the cutoff value automatically undergo reversion mutations.

    Combination Site Cutoff

    For each chain, select the top n positions with back mutation scores between the Combination Min Cutoff and the Combination Max Cutoff for combination mutations.

    Results

    The mutation grouping results file mutate_policy.json is generated based on different cutoff values.
    In HTS Mutate, the mutation grouping results file combination_mutate_policy.json is generated based on combination cutoff values.

  • Name: Humanization Report v2.5
    Description: 抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。 Humanization Report is an antibody humanization design reporting module for Generating the humanization design reports as well as patent example paragraphs.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-12-23 00:00:00
    Reference:

    Humanization Report v2.5

    简介

    Humanization Report v2.5是抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。

    更新日志:

    • 同时支持人源化和高通量人源化流程

    参数说明

    方法1:Humanization Report

    Graft Policy

    Grafting模块生成的Graft Policy文件。

    Mutate Policy

    Back Mutation Grouping模块生成的Policy文件。

    Antibody Type

    抗体类型,Antibody 标准双链抗体,Nanobody 纳米抗体。

    Germline Score File

    Grafting模块生成的score文件,JSON格式

    Mutation Score File

    Mutation模块生成的score文件,CSV格式

    方法2:Humanization HTS Report

    Graft Policy

    Grafting模块生成的Graft Policy文件。

    Mutate Policy

    Back Mutation Grouping模块生成的Policy文件。

    Antibody Type

    抗体类型,Antibody 标准双链抗体,Nanobody 纳米抗体。

    Germline Score File

    Grafting模块生成的score文件,JSON格式

    Mutation Score File

    Mutation模块生成的score文件,CSV格式

    Antibody RMSD File

    抗体结构RMSD文件,由Antibody RMSD模块生成,CSV格式

    Antibody RMSD Top

    从RMSD排序中取前N个RMSD值小的抗体

    Folding Stability File

    Absolute Folding Stability模块预测生成的蛋白稳定性文件,CSV格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    BM.pptx 回复突变位点汇总文件
    batch_registration_template.xlsx 批量注册模板文件
    hotspot_summary.xlsx 风险位点总结
    patent_example_template.docx 人源化设计序列在相应的专利实施例段落
    patent_example_en_template.docx 英文版人源化设计序列在相应的专利实施例段落
    back_mutation_grouping.md 回复突变分组信息
    candidate_score.xlsx 人源化抗体序列的结构和能量打分汇总
    humanized_variants.fasta 抗体人源化设计序列文件,FASTA格式
    Report.docx 抗体人源化设计报告,包括整个人源化设计过程涉及的序列、分组等信息

    其中batch_registration_template.xlsx包含如下信息:

    字段名称 说明
    Protein Sequence 蛋白序列
    Molecule Name 分子名称

    其中hotspot_summary.xlsx包含如下信息:

    字段名称 说明
    ID 抗体序列名称
    Sequence-CDR CDR序列区域
    Deamidation 脱酰胺位点
    Isomerization 异构化位点
    Cleavage 酶切位点
    Hydrolysis 水解位点
    Glycosylation 糖基化位点
    Cys 半胱氨酸数量
    Oxidation 氧化位点
    High risk 高风险率
    High risk sites 高风险位点

    Humanization Report v2.5

    Introduction

    The Humanization Report v2.5 is a module for generating reports on antibody humanization design, including the final antibody humanization design report and corresponding paragraphs for patent implementation examples.

    Update Log:

    • support both humanization and high-throughput humanization.

    Parameters

    Method 1: Humanization Report

    Graft Policy

    The Graft Policy file generated by the Grafting module.

    Mutate Policy

    The Policy file generated by the Back Mutation Grouping module.

    Antibody Type

    Antibody type, Antibody or Nanobody

    Germline Score File

    Graft germline score file in JSON format generated by the Grafting module

    Mutation Score File

    Mutation score file in csv format generated by the Mutation module

    Method 2: Humanization HTS Report

    Graft Policy

    The Graft Policy file generated by the Grafting module.

    Mutate Policy

    The Policy file generated by the Back Mutation Grouping module.

    Antibody Type

    Antibody type, Antibody or Nanobody

    Germline Score File

    Graft germline score file in JSON format generated by the Grafting module

    Mutation Score File

    Mutation score file in csv format generated by the Mutation module

    Antibody RMSD File

    Antibody structure RMSD file generated by Antibody RMSD module

    Antibody RMSD Top

    Select the top N antibodies with the smallest RMSD values from the RMSD ranking

    Folding Stability File

    Protein folding stability file generated by Absolute Folding Stability module in CSV format

    Results

    The output results include:

    Output File Name Description
    BM.pptx Summary file of back mutation sites
    batch_registration_template.xlsx Batch registration template file
    hotspot_summary.xlsx Summary of hotspot sites
    patent_example_template.docx Humanization design sequences in corresponding patent implementation example paragraphs (Chinese version)
    patent_example_en_template.docx Humanization design sequences in corresponding patent implementation example paragraphs (English version)
    back_mutation_grouping.md Grouping for back mutations
    humanized_variants.fasta Antibody humanization design sequence file in FASTA format
    Report.docx Antibody humanization design report, including information on sequences, grouping, etc., involved in the entire humanization design process
    candidate_score.xlsx Candidate sequences energy and structure scores

    The batch_registration_template.xlsx file contains the following information:

    Field Name Description
    Protein Sequence Protein sequence
    Molecule Name Molecule name

    The hotspot_summary.xlsx file contains the following information:

    Field Name Description
    ID Antibody sequence name
    Sequence-CDR CDR sequence region
    Deamidation Deamidation site
    Isomerization Isomerization site
    Cleavage Cleavage site
    Hydrolysis Hydrolysis site
    Glycosylation Glycosylation site
    Cys Number of cysteines
    Oxidation Oxidation site
    High risk High-risk rate
    High risk sites High-risk sites
  • Name: Model Result Merge
    Description: 合并AF3-like模型(Boltz-2,Protenix,Chai-1)输出的结果。 Merge the results of AF3-like models (Boltz-2, Protenix, Chai-1).
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-06-11 15:54:07
    Reference:

    Model Result Merge

    简介

    合并AF3-like模型(Boltz-2,Protenix,Chai-1)输出的结果。

    参数说明

    Boltz

    指定Boltz2结果的打包文件,tar格式,如:Boltz_results.tar。

    Protenix

    指定Protenix结果的打包文件,tar格式,如:Protenix_results.tar。

    Chai-1

    指定Chai-1结果的打包文件,tar格式,如:Chai-1_results.tar。

    Output

    结构文件合并输出的打包文件名称,默认为merged_results.tar。

    Output Score

    打分文件合并输出的打包文件名称,默认为merged_results.csv。

    结果说明

    结构文件的合并输出打包文件merged_results.tar,包含输入的所有AF3-like模型预测结果。
    打分文件的合并输出打包文件merged_results.csv,包含所有AF3-like模型的打分。

    Model Result Merge

    Introduction

    Merge the output results of AF3-like models (Boltz-2, Protenix, Chai-1).

    Parameter

    Boltz

    Specify the packaged result file from Boltz-2 in tar format, e.g., Boltz_results.tar.

    Protenix

    Specify the packaged result file from Protenix in tar format, e.g., Protenix_results.tar.

    Chai-1

    Specify the packaged result file from Chai-1 in tar format, e.g., Chai-1_results.tar.

    Output

    Name of the merged output tar file containing structure files. Defaults to merged_results.tar.

    Output Score

    Name of the merged output file containing scores. Defaults to merged_results.csv.

    Result

    The merged output tar file merged_results.tar contains the structural prediction results from all the input AF3-like models.
    The merged score file merged_results.csv includes the scores from all AF3-like models.

  • Name: PPI Score Merge
    Description: 合并AF3-like模型打分结果与PPI模块打分结果,并汇总输出。 The AF3-like model scoring results and PPI module scoring results are merged, and the output is summarized.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-06-11 15:54:07
    Reference:

    PPI Score Merge

    简介

    合并AF3-like模型打分结果与PPI模块打分结果,并汇总输出。

    参数说明

    Model Score

    指定AF3-like多个模型打分的汇总文件,csv格式,如:merged_results.csv。

    Prodigy Score

    指定PPI模型Prodigy的结果打分文件,csv格式,如:prodigy_output.csv。

    Graphomer Score

    指定PPI模型Graphomer的结果打分文件,csv格式,如:PPI_pred.csv。

    Output

    打分合并输出的文件名称,默认为score_merge.csv。

    结果说明

    打分的合并输出打包文件score_merge.csv,包含所有AF3-like模型的打分及PPI模型打分。

    PPI Score Merge

    Introduction

    Merge the scoring results from AF3-like models with the PPI module scoring results and generate a consolidated output.

    Parameter

    Model Score

    Specify the consolidated score file from multiple AF3-like models in CSV format, e.g., merged_results.csv.

    Prodigy Score

    Specify the scoring result file from the PPI model Prodigy in CSV format, e.g., prodigy_output.csv.

    Graphomer Score

    Specify the scoring result file from the PPI model Graphomer in CSV format, e.g., PPI_pred.csv.

    Output

    Name of the merged output score file. Defaults to score_merge.csv.

    Result

    The merged score output file score_merge.csv includes scoring results from all AF3-like models and PPI models.

  • Name: Enzyme Kinetic Prediction
    Description: 基于UniKP框架预测酶的动力学参数Kcat和Km Predict the enzyme kinetic parameters Kcat and Km using UniKP
    Tags: undefined
    Author: Han Yu
    Release: 2025-06-03 11:15:35
    Reference: Yu H, Deng H, He J, Keasling JD, Luo X. UniKP: a unified framework for the prediction of enzyme kinetic parameters. Nat Commun. 2023 Dec 11;14(1):8211.

    Enzyme Kinetic Prediction

    简介

    该模块预测酶的动力学参数Kcat与Km。模块基于UniKP框架实现,UniKP是一个用于预测酶动力学参数的计算工具。它结合了蛋白质序列和底物结构信息,利用预训练的语言模型(如 ProtT5-XL-UniRef50)来生成酶的表示,并通过深度学习模型预测酶的动力学参数。

    UniKP框架由两个关键组件组成:表示模块和机器学习模块。表示模块使用预训练的语言模型对酶和底物的信息进行编码。具体而言,酶序列中的氨基酸使用ProtT5-XL-UniRef50模型转换为1024维的向量。对于每个蛋白质,应用平均池化方法得到其表示,这被发现是对于蛋白质任务最有效的方法。另一方面,底物结构被转换为简化的分子输入线条记录系统(SMILES)格式,并通过预训练的SMILES转换器进行处理,每个符号生成一个256维的向量。然后,对最后一层和倒数第二层的第一个输出进行平均池化和最大池化,将它们连接起来生成一个1024维的分子表示向量。蛋白质和底物的连接表示向量随后被输入到机器学习模块中(整体架构图如下)。
    image.png

    在kcat预测任务中使用DLKcat数据集进行验证。在没有任何额外参数优化的情况下,通过五轮随机分割的测试集上的平均确定系数(R2)值为0.68,比DLKcat提高了20%。此外,这五轮中DLKcat的最高值比UniKP的最低值低16%,进一步证明了UniKP的稳健性。预测值和实验测量值之间的均方根误差(RMSE)在UniKP中也比DLKcat低,无论是在训练集还是测试集中。在测试集中,预测值和实验测量值之间存在着强烈的相关性,相关系数(PCC)为0.85,整个数据集的相关系数为0.99,比DLKcat分别高出14%和11%。
    image.png

    参数说明

    Sequences

    单个或多个酶的序列,fasta格式,每个酶使用一条序列表示(当某个酶有多条链时,将多条单链序列首尾连接作为一条序列)。

    Ligands

    底物分子的文件,txt格式,支持多个底物分子,使用SMILES表示,每行一个分子,文件内容示例:

    OC1=CC=C(C[C@@H](C(O)=O)N)C=C1
    CC(O)O
    

    注意:
    1,输入的底物分子数量与酶数量应相同,模块会按文件中的顺序进行酶与底物分子配对。
    2,当有多个酶分子时,可只设置一个底物分子,表示每个酶都使用相同的底物分子。

    Output

    动力学参数预测结果文件名,默认为pred_res.csv

    结果说明

    动力学参数结果文件pred_res.csv,包含以下信息:

    列名 说明
    SeqID 序列名称
    Sequence 酶序列
    SMILES 底物分子
    Kcat(n/s) 酶的周转数,是酶的动力学参数之一。表示每个酶分子单位时间内能转化底物的最大分子数,单位为个/秒
    Km(mM) 米氏常数,是另一个酶的动力学参数。代表反应速率为最大反应速率一半时的底物浓度,单位为mM

    参考文献

    • Yu H, Deng H, He J, Keasling JD, Luo X. UniKP: a unified framework for the prediction of enzyme kinetic parameters. Nat Commun. 2023 Dec 11;14(1):8211. DOI: 10.1038/s41467-023-44113-1

    Enzyme Kinetic Prediction

    Introduction

    This module predicts the kinetic parameters Kcat and Km of enzymes. It is implemented based on the UniKP framework, a computational tool designed for enzyme kinetic parameter prediction. UniKP integrates protein sequence and substrate structure information, utilizing pre-trained language models (such as ProtT5-XL-UniRef50) to generate enzyme representations and employs deep learning models to predict enzyme kinetic parameters.

    The UniKP framework consists of two key components: the representation module and the machine learning module. The representation module encodes information of enzymes and substrates using pre-trained language models. Specifically, amino acids in enzyme sequences are transformed into 1024-dimensional vectors using the ProtT5-XL-UniRef50 model. For each protein, average pooling is applied to obtain its representation, which has been found to be the most effective method for protein tasks. On the other hand, substrate structures are converted into Simplified Molecular Input Line Entry System (SMILES) format and processed by a pre-trained SMILES encoder, generating a 256-dimensional vector for each token. Then, average pooling and max pooling are applied to the first outputs of the last and penultimate layers, concatenated to form a 1024-dimensional molecular representation vector. The concatenated representation vectors of proteins and substrates are then fed into the machine learning module (overall architecture diagram shown below).
    image.png

    The Kcat prediction task was validated using the DLKcat dataset. Without any additional parameter tuning, the average coefficient of determination (R²) on five rounds of random splits of the test set was 0.68, which is a 20% improvement over DLKcat. Furthermore, the highest R² value of DLKcat in these five rounds was 16% lower than the lowest R² value of UniKP, further demonstrating UniKP’s robustness. The root mean square error (RMSE) between predicted and experimental values was also lower in UniKP than in DLKcat for both training and test sets. In the test set, there was a strong correlation between predicted and experimental values, with a Pearson correlation coefficient (PCC) of 0.85, and 0.99 for the entire dataset, which are 14% and 11% higher than DLKcat, respectively.
    image.png

    Parameters

    Sequences

    Sequences of one or more enzymes in FASTA format, with each enzyme represented by a single sequence (for multi-chain enzymes, concatenate the individual chain sequences end-to-end into one sequence).

    Ligands

    Substrate molecule file in TXT format. Multiple substrate molecules are supported, represented using SMILES notation, with one molecule per line. Example file content:

    OC1=CC=C(C[C@@H](C(O)=O)N)C=C1
    CC(O)O
    

    Note:

    1. The number of input substrate molecules should match the number of enzymes. The module pairs enzymes and substrates in the order they appear in the file.
    2. When multiple enzymes are provided, a single substrate molecule can be specified, indicating that the same substrate is used for all enzymes.

    Output

    Filename of the kinetic parameter prediction result file, default is pred_res.csv.

    Results

    The kinetic parameter result file pred_res.csv contains the following information:

    Column Name Description
    SeqID Sequence identifier
    Sequence Enzyme sequence
    SMILES Substrate molecule
    Kcat (n/s) Turnover number of the enzyme, one of the kinetic parameters. It represents the maximum number of substrate molecules converted by one enzyme molecule per unit time, in units of per second
    Km (mM) Michaelis constant, another kinetic parameter. It represents the substrate concentration at which the reaction rate is half of the maximum, in millimolar (mM)

    References

    • Yu H, Deng H, He J, Keasling JD, Luo X. UniKP: a unified framework for the prediction of enzyme kinetic parameters. Nat Commun. 2023 Dec 11;14(1):8211.DOI: 10.1038/s41467-023-44113-1
  • Name: DockQ
    Description: 评估预测的蛋白-蛋白复合物结构质量的工具和指标 A tool and metric for evaluating the quality of predicted protein-protein complex structures
    Tags: undefined
    Author: Claudio Mirabello
    Release: 2025-05-09 09:44:25
    Reference: Mirabello C, Wallner B. DockQ v2: improved automatic quality measure for protein multimers, nucleic acids, and small molecules. Bioinformatics. 2024 Oct 1;40(10):btae586.

    DockQ

    简介

    DockQ是一种用于评估预测的蛋白-蛋白复合物结构质量的工具和指标,它通过将三个相关但独立的质量测量指标(Fnat、LRMS和iRMS)组合成一个范围在0,1内的单个分数,来评估蛋白质对接模型的质量。DockQ的分数范围为0到1,分数越高表示模型质量越好。根据DockQ的分数,可以将对接模型的质量分为以下几类:

    分数范围 质量分类
    0.00 ≤ DockQ < 0.23 错误(Incorrect)
    0.23 ≤ DockQ < 0.49 可接受质量(Acceptable quality)
    0.49 ≤ DockQ < 0.80 中等质量(Medium quality)
    DockQ ≥ 0.80 高质量(High quality)

    DockQ的计算公式如下:
    image.png

    其中:
    Fnat:预测复合体在交界面上的作用残基在真实复合体中的比例。
    LRMSD:将预测的复合体和真实复合体的两条链中较长的链比对后,较短链的均方根偏差(RMSD)。
    iRMSD:度量界面上两个原子相距10 Å内的原子集合的RMSD。
    LRMSD与iRMSD是经过缩放后的数值,缩放公式如下:
    image.png

    参数说明

    Native

    必填参数,用于DockQ计算的Native复合物结构,PDB格式,一般为实验解析的结构。

    Model

    必填参数,用于DockQ计算的Model复合物结构,PDB格式,一般为AI模型预测或者分子对接等得到的模拟结构。

    Mapping

    可选参数,指定Native结构与Model结构中的链对应关系。相对应的链名之间用逗号分隔,多组链对应时,组间用分号分隔,如:A,E;B,D;C,F表示:

    • Native结构中的A链与Model结构中的E链对应。
    • Native结构中的B链与Model结构中的D链对应。
    • Native结构中的C链与Model结构中的F链对应。

    注意:
    1,设置该参数时,模块将根据设置的链对应关系来计算DockQ,如不设置该参数,模块会自动匹配所有有界面接触的两条链之间的对应关系,并计算匹配到的所有两条链的DockQ。
    2,在特定场景中,计算DockQ时,可能希望合并某些链作为整体来考虑。比如抗原-抗体复合物中,希望将抗体的重、轻链作为一个整体,计算与抗原之间的DockQ值。这种情况,可以在指定mapping参数时,将需要合并的链名写在一起即可,比如C,F;AB,ED 表示:

    • Native结构中的C链与Model结构中的F链对应。
    • Native结构中的A链与B链作为一个整体,与Model结构中的E链D链作为整体,进行链对应(AB链之间的界面,ED链之间的界面不再单独考虑)。

    Output

    输出结果文件名称,默认为dockq_res.csv

    结果说明

    预测结果文件dockq_res.csv,包含以下信息:

    列名 说明
    Native_chains Native结构中用于计算DockQ的链名,多个链名用分号分隔
    Model_chains Model结构中用于计算DockQ的链名,多个链名用分号分隔
    DockQ 计算得到的DockQ数值。DockQ的分数范围为0到1,分数越高表示模型质量越好。
    iRMSD 界面上两个原子相距10 Å内的原子集合的RMSD
    LRMSD 将预测的复合体和真实复合体的两条链中较长的链叠合后,较短链的RMSD
    fnat 预测复合体在交界面上的作用残基在真实复合体中的比例
    fnonnat 预测复合体在交界面上的作用残基不在真实复合体中的比例
    F1 预测复合体在交界面上的作用残基是否在真实复合体中,对应的精确率和召回率的调和平均值
    clashes 预测复合体中界面残基存在clash的数量,当两个残基的距离小于2Å时视为clash

    参考文献

    • Mirabello C, Wallner B. DockQ v2: improved automatic quality measure for protein multimers, nucleic acids, and small molecules. Bioinformatics. 2024 Oct 1;40(10):btae586; DOI: 10.1101/2024.05.28.596225

    DockQ

    Introduction

    DockQ is a tool and metric used to evaluate the quality of predicted protein-protein complex structures. It combines three related but independent quality assessment metrics—Fnat, LRMS, and iRMS—into a single score ranging from 0 to 1 to assess the accuracy of docking models. A higher DockQ score indicates better model quality. Based on the DockQ score, docking models can be classified as follows:

    Score Range Quality Category
    0.00 ≤ DockQ < 0.23 Incorrect
    0.23 ≤ DockQ < 0.49 Acceptable quality
    0.49 ≤ DockQ < 0.80 Medium quality
    DockQ ≥ 0.80 High quality

    DockQ is computed using the following formula:
    image.png

    Where:

    • Fnat: The fraction of native contacts (interface residues in the predicted complex that are also present in the native complex).
    • LRMSD: RMSD between the shorter chain in the complex after aligning the longer chains of the predicted and native structures.
    • iRMSD: RMSD of interface atoms within 10 Å across chains.

    LRMSD and iRMSD are scaled using the following equations:
    image.png

    Parameters

    Native

    Required. The native (reference) structure in PDB format used for DockQ calculation, typically derived from experimental data.

    Model

    Required. The model structure in PDB format to be evaluated by DockQ, typically generated by AI models or docking simulations.

    Mapping

    Optional. Specifies the chain correspondence between the native and model structures. Chain names are separated by commas for each pair, and semicolons are used to separate multiple pairs.
    For example: A,E;B,D;C,F means:

    • Chain A in the native structure corresponds to chain E in the model.
    • Chain B in the native structure corresponds to chain D in the model.
    • Chain C in the native structure corresponds to chain F in the model.

    Note:

    1. When this parameter is provided, the module uses the specified mapping for DockQ calculation.
      If not set, the module will automatically match all chain pairs with interface contacts and calculate DockQ for each matched pair.

    2. In specific scenarios, it may be necessary to consider merged chains as a single unit (e.g., heavy and light chains of an antibody). For such cases, multiple chains can be combined in the mapping, e.g., C,F;AB,ED means:

      • Chain C in the native structure corresponds to chain F in the model.
      • Chains A and B in the native structure are treated as one unit and correspond to chains E and D in the model, also treated as one unit (interfaces within AB or ED are not considered separately).

    Output

    Output file name for DockQ results. The default is dockq_res.csv.

    Results

    The result file dockq_res.csv contains the following information:

    Column Name Description
    Native_chains Chains in the native structure used for DockQ calculation (separated by semicolons)
    Model_chains Chains in the model structure used for DockQ calculation (separated by semicolons)
    DockQ Computed DockQ score. The DockQ score ranges from 0 to 1, with higher scores indicating better model quality.
    iRMSD Interface RMSD of atoms within 10 Å
    LRMSD RMSD of the shorter chain after aligning the longer chains
    fnat Fraction of native interface contacts
    fnonnat Fraction of non-native interface contacts
    F1 F1-score combining precision and recall for predicted interface residues
    clashes Number of clashes (residue pairs < 2 Å apart) in the predicted complex

    References

    Mirabello C, Wallner B. DockQ v2: improved automatic quality measure for protein multimers, nucleic acids, and small molecules. Bioinformatics. 2024 Oct 1;40(10):btae586; DOI: 10.1101/2024.05.28.596225

  • Name: PPI Binding Energy (Graphomer)
    Description: 基于PPI-Graphomer模型预测蛋白-蛋白结合亲和力 PPI-Graphomer model based predicting protein-protein binding affinity
    Tags: undefined
    Author: Jun Xie
    Release: 2025-05-08 17:24:24
    Reference: Xie, J., Zhang, Y., Wang, Z. et al. PPI-Graphomer: enhanced protein-protein affinity prediction using pretrained and graph transformer models. BMC Bioinformatics 26, 116 (2025).

    PPI Binding Energy (Graphomer)

    简介

    基于PPI-Graphomer模型预测蛋白-蛋白结合亲和力,该模型是一种专门感知界面残基作用的Graph Transformer模型,同时结合了多模态预训练模型,效果显著优于已有主流方法。

    模型设计采用:序列 + 结构 + 图神经网络三合一
    步骤一:特征提取,蛋白语言模型 + 结构模型协同

    • 使用ESM2(蛋白语言模型)用于提取蛋白序列中的进化和语义特征(1280 维 → 64 维);
    • 使用ESM-IF1(结构到序列逆折叠模型)用于提取 AlphaFold2 结构中的空间特征(512 维 → 32 维);
    • 使用多链结构用25 个 Gly 连接为一条伪序列,确保预训练模型能正常运行;
    • 特征拼接后送入下一阶段建模。

    步骤二:核心模块,PPI-Graphomer(界面建模利器)
    借鉴微软提出的 Graphormer 思想,引入结构感知的图 Transformer 模块,具体包括:

    编码方式 描述
    氨基酸对类型编码 AAType(vᵢ,vⱼ) 区分不同氨基酸组合,推测物理作用趋势
    相互作用力编码 Interact(vᵢ,vⱼ) 捕捉氢键、盐桥、π堆叠等相互作用数量
    距离权重 Dij + 接口遮罩 仅关注跨链、7Å内的残基对,提高关注焦点准确性

    这些信息被作为注意力偏置项加入到 Transformer 的 Attention 计算中,强化模型对关键界面信息的关注,最终获得接口表征。

    步骤三:特征拼接 + 回归预测
    使用“跳跃连接式”结构(skip-connection),将界面信息与全局序列结构信息拼接后输入 MLP 预测亲和力(ΔG),输出结果用于与真实值比较回归损失。

    模型整体架构示意图如下:
    image.png

    数据集与训练配置如下:
    主训练集:PDBbind(共 2376 条蛋白复合物,均转化为ΔG);
    测试集:
    Affinity Benchmark v1(Test set 1,75 个样本)
    PDBbind 精炼子集(Test set 2,87 个样本)

    预处理:
    移除序列过长(>2000 残基)样本;
    使用 BLAST 排除训练集中与测试集相似度>65%的样本,防止数据泄露;

    模型参数:
    Graphomer 层数:2 层;
    Attention 头数:8;
    训练轮次:20 epoch;
    使用 A40 GPU,推理内存仅需 4GB。

    模型预测效果如下:
    image.png

    与其他方法的结果比较如下:
    image.png

    参数说明

    Structure

    蛋白复合物结构,格式支持 .pdb 或 .cif。蛋白长度需小于2000AA(超过时会略过)。

    Structure TAR

    蛋白复合物结构,支持多个复合物结构打包进行批量预测,格式支持 .tar、.tar.*z 或 .zip,最大支持1000个结构。

    Output

    亲和力预测的结果文件名,默认为PPI_pred.csv

    结果说明

    亲和力预测结果文件PPI_pred.csv,包含以下信息:

    列名 说明
    Name 结构名称
    Binding_Affinity (kcal/mol) 预测的亲和力,为Gibbs自由能,单位为kcal/mol。负得越多,亲和力越强。注意:所提供的能量是复合物中所有链之间的亲和力总和。

    参考文献

    • Xie, J., Zhang, Y., Wang, Z. et al. PPI-Graphomer: enhanced protein-protein affinity prediction using pretrained and graph transformer models. BMC Bioinformatics 26, 116 (2025). DOI:10.1186/s12859-025-06123-2

    PPI Binding Energy (Graphomer)

    Introduction

    This module predicts protein–protein binding affinity. It is powered by the PPI-Graphomer model, a graph transformer architecture specifically designed to capture interface residue interactions. The model integrates multimodal pretrained features and significantly outperforms existing mainstream approaches.

    The model design integrates sequence + structure + graph neural network in a unified framework.

    Step 1: Feature Extraction – Coordinated Protein Language and Structure Modeling

    • ESM2 (a protein language model) is used to extract evolutionary and semantic features from protein sequences (1280-dim → 64-dim);
    • ESM-IF1 (inverse folding model from structure to sequence) is used to extract spatial features from AlphaFold2 structures (512-dim → 32-dim);
    • For multichain complexes, a pseudo-sequence is created by connecting chains with 25 Gly residues to ensure compatibility with pretrained models;
    • The features are then concatenated and passed to the next modeling stage.

    Step 2: Core Module – PPI-Graphomer (Interface Modeling Engine)
    Inspired by Microsoft’s Graphormer, a structure-aware graph transformer module is introduced. It includes:

    Encoding Type Description
    Amino Acid Pair Encoding AAType(vᵢ,vⱼ) Differentiates amino acid combinations to infer physical interaction trends
    Interaction Force Encoding Interact(vᵢ,vⱼ) Captures number of interactions such as hydrogen bonds, salt bridges, and π-stacking
    Distance Weight Dij + Interface Mask Focuses only on inter-chain residue pairs within 7Å to enhance attention accuracy

    These encodings are used as attention biases in the transformer’s attention mechanism, reinforcing the model’s focus on key interfacial residues to derive meaningful interface representations.

    Step 3: Feature Fusion + Affinity Regression
    Using a skip-connection design, the interface features are concatenated with global sequence and structure features and input into an MLP to predict binding affinity (ΔG). The predicted values are compared with ground truth to compute regression loss.

    The overall model architecture is illustrated below:
    image.png

    Dataset and Training Configuration

    • Primary training dataset: PDBbind (2,376 protein complexes, all converted to ΔG);

    • Test datasets:

      • Affinity Benchmark v1 (Test set 1, 75 samples)
      • PDBbind Refined Subset (Test set 2, 87 samples)

    Preprocessing:

    • Sequences longer than 2000 residues are removed;
    • Samples in the training set with >65% sequence similarity to test set (as determined by BLAST) are excluded to prevent data leakage.

    Model Hyperparameters:

    • Graphomer Layers: 2
    • Attention Heads: 8
    • Training Epochs: 20
    • GPU: NVIDIA A40 (inference memory requirement: only 4GB)

    Prediction Performance:
    image.png

    Comparison with Other Methods:
    image.png


    Parameters

    Structure

    Protein-complex structure; accepted formats: .pdb or .cif.
    The protein must be shorter than 2,000 amino acids (structures exceeding this limit will be skipped).

    Structure TAR

    Protein-complex structures for batch prediction; submit multiple complexes packed into a single archive.
    Accepted archive formats: .tar, .tar.*z, or .zip, containing up to 1,000 structures.

    Output

    Filename for the prediction results. Default is PPI_pred.csv.

    Results

    The output file PPI_pred.csv contains:

    Column Description
    Name Name of the structure
    Binding_Affinity (kcal/mol) Predicted binding affinity (Gibbs free energy) in kcal/mol.The more negative the value, the stronger the affinity. Note: The provided energy represents the total affinity among all chains within the complex.

    References

    • Xie, J., Zhang, Y., Wang, Z. et al. PPI-Graphomer: enhanced protein-protein affinity prediction using pretrained and graph transformer models. BMC Bioinformatics 26, 116 (2025). DOI:10.1186/s12859-025-06123-2
  • Name: Patch Analysis v2.1
    Description: 分析蛋白质表面的Patch(正电、负电、疏水残基富集区域)的大小和分布,用于解决蛋白质的聚集等问题。一般建议通过WeView三维结构可视化编辑器来使用该功能,可以在三维结构中直观地查看patch的位置。v2.1更新:支持设定PH值以及CDR编号,高亮CDR残基,输出CDR patch面积。 Calculate patches (positively charged, negatively charged, or hydrophobic regions) on the protein surface to address protein aggregation issues. It is recommended to use in the WeView, as it allows for a visual inspection of the patch locations within the 3D structure. v2.1 update: Supports setting the pH value and CDR numbering, highlights CDR residues, and outputs the CDR patch area.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-04-29 15:01:18
    Reference:

    Patch Analysis v2.1

    简介

    该模块计算蛋白质表面静电和疏水作用相对富集的区域,用于显示出在蛋白质-蛋白质相互作用中其重要作用的区域,这对于预测基于非共价弱相互作用的可逆的聚集现象尤其有用。尤其是,疏水相互作用长期被认为是大分子间吸引相互作用的主要组成部分。对于抗体,静电相互作用牵涉到了自聚集,而偶极-偶极相互作用被认为是导致β-折叠的纤维化的原因。同时,也可以通过WeView界面对蛋白结构进行Patch分析。

    v2.1 更新内容

    • 支持设定PH值
    • 支持CDR编号,高亮CDR残基,输出CDR patch面积。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式

    pH

    pH值,用于蛋白质子化判断

    Antibody Numbering

    抗体编号方法,其中 no_use 不使用编号

    Hydrophobic Cutoff

    Hydrophobic cutoff是一个以疏水性氨基酸(通常包括Leu,Ile,Val,Phe,Trp和Met)为基础定义的截断值,用于将表面上疏水性氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,Patch区域中获得的化学性质信息会根据其表面密度和具有高疏水性氨基酸的数量而有所变化。

    Positive Cutoff

    Positive Cutoff是一个以阳离子氨基酸为基础定义的截断值,用于将表面上阳离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,positive cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    Negative Cutoff

    Negative Cutoff是一个以阴离子氨基酸为基础定义的截断值,用于将表面上阴离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,negative cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    SASA Cutoff

    SASA Cutoff是一个以溶剂可及表面积为基础定义的截断值,低于截断值的patch残基会被过滤掉。是残基侧链暴露程度的百分比,相对值,范围在0-100之间。

    Distance Cutoff

    Distance Cutoff是原子距离截断值,低于截断值的才会认为属于同一聚集块。值越小,聚集块patch越小。

    Min Distance Cutoff

    Min Distance Cutoff是patch之间的距离截断值,距离小于截断值的归为同一个patch。

    Result Type

    输出文件格式,csv或者json
    通俗地讲,cutoff代表静电势能或疏水势能的强度阈值,单位是kcal/mol,超过阈值才会被计入面积。阈值越小,则patch越多。

    Keep Original

    不添加缺失原子(包括氢原子)和结构优化。

    Neutral N-terminus

    使得N-氮端的蛋白残基中性化。

    Neutral C-terminus

    使得C-氮端的蛋白残基中性化。

    结果说明

    输出结果包括:

    输出文件名称 说明
    patch_list.csv Patch结果的csv文件。主要关注Area(Å^2)数值,代表patch的大小,越大则越可疑,重点关注100 Å以上的patch。
    input_prot.pdb 质子化后的pdb结构。
    patch_list_sum.csv 统计了三种patch类型(Hyd:疏水中心,Neg:负电中心,Pos:正电中心)在蛋白表面所占面积,重点关注100 Å以上的patch。

    其中patch_list.csv,包含信息如下:

    字段名称 说明
    Type Patch的类型,Hyd:疏水中心,Neg:负电中心,Pos:正电中心
    Area(Å^2) 每个Patch的蛋白质表面区域面积
    Residues 每个Patch的对应的残基

    其中patch_list_sum.csv,包含信息如下:

    字段名称 说明
    Type Patch的类型,Hyd:疏水中心,Neg:负电中心,Pos:正电中心
    Total Areas Patch的蛋白质表面区域总面积
    Areas of The Largest Patch的蛋白质表面区域最大面积
    Number of Areas More Than 100 超过100 Å以上的patch的数目

    参考文献

    • Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348. DOI: 10.1142/p477
    • Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514. DOI: 10.1007/BF00124359
    • Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873. DOI: 10.1021/ci990307l

    Patch Analysis v2.1

    Introduction

    Protein Patches calculates both electrostatic (excess charge) and hydrophobic surface patches to show regions of significance with respect to protein-protein interactions. This can be particularly useful in the prediction of reversible aggregation, which typically arises from relatively weak non-covalent interactions. In particular, hydrophobic interactions have long been recognized as major contributors in high affinity interactions between macromolecules. In antibodies, electrostatic interactions have been implicated in forming self-associated aggregates [Karshikoff 2006], while dipole-dipole interactions are believed to be the cause of fibrillogenic association of β-sheets. At the same time, protein structures can also be analyzed for patches through the WeView interface.
    Electrostatic patches.
    The surface electrostatic field is estimated using an exponentially decaying Debye-Hückel field with a screening length of λD=3.5Å.
    The map thus obtained is one mostly of excess charge close to the molecular surface.
    Significant patches are established by cutting the surface along isocontour lines of absolute field value equal to 40 kcal/mol/C, keeping regions above. Finally, a default minimal patch area of 40Å2 filters out smaller, presumably less relevant, patches.
    Hydrophobicity map.
    The hydrophobic potential is calculated from the Wildman and Crippen octanol-water partition coefficients f=log P [Wildman 1999]:

    where fi is the coefficient of atom i and g(ri) is a Fermi-type distance-dependent weighting function proposed by Heiden et al. [Heiden 1993]:

    with rcut=5Å and α=1.5.
    Similarly to electrostatic (excess charge) patches, significant hydrophobic patches are established by cutting the surface along isocontour lines, retaining only the portion above a potential threshold value of 0.09 kcal/mol, and filtering for a minimal patch area of 50Å2.

    v2.1 updates

    • Supports setting the pH value
    • Supports CDR numbering, highlights CDR residues, and outputs the CDR patch area.

    Parameters

    Structure PDB File

    Protein structure file in PDB format.

    pH

    pH value for protein protonation

    Antibody Numbering

    Antibody Numbering type, no_use indicates no antibody numbering applied.

    Hydrophobic Cutoff

    Hydrophobic Cutoff defined based on hydrophobic amino acids (usually including Leu, Ile, Val, Phe, Trp, and Met) is used to compare the number of hydrophobic amino acids on a surface with the surface area, so as to screen out areas that may have important biological functions. Generally speaking, the chemical property information obtained in the Patch region will vary according to its surface density and the number of highly hydrophobic amino acids.

    Positive Cutoff

    Positive Cutoff is a cut-off value defined based on cationic amino acids, which is used to compare the number of cationic amino acids on the surface with the surface area to screen out areas that may have important biological functions. Generally speaking, the positive cutoff method is used to screen protein surface regions that may be involved in ion interactions.

    Negative Cutoff

    Negative Cutoff is a value defined based on anionic amino acids, which is used to compare the number of anionic amino acids on a surface with the surface area, to screen out areas that may have important biological functions. Generally speaking, the negative cutoff method is used to screen the surface regions of proteins that may participate in ion interactions.

    SASA Cutoff

    SASA Cutoff is a cutoff value defined on the basis of polar surface area, which is used to screen the surface regions of proteins that have enough polar surface areas.

    Distance Cutoff

    Distance Cutoff is a cutoff value defined on the basis of neighbor atoms, which is used to adjust the size of patches. Lower values result in smaller patches.

    Min Distance Cutoff

    Min Distance Cutoff is the cutoff value for neighbor patch point distance (Å). Patches with distances lower than the cutoff value would be merged.

    Result Type

    output file format, json or csv

    Keep Original

    Do no atom addition and optimization.

    Results

    The output includes:

    Output File Name Description
    patch_list.csv A CSV file containing patch results. The main focus is on the Area (Å^2) value, which represents the size of the patch. Larger patches are considered more suspicious, with particular attention to patches larger than 100 Å.
    input_prot.pdb The protonated PDB structure.
    patch_list_sum.csv Summarizes the surface area occupied by three types of patches (Hyd: hydrophobic center, Neg: negative charge center, Pos: positive charge center) on the protein surface. Focus is placed on patches larger than 100 Å.

    Details of patch_list.csv:
    The file contains the following information:

    Field Name Description
    Type The type of patch: Hyd (hydrophobic center), Neg (negative charge center), Pos (positive charge center).
    Area (Å^2) The surface area of each patch on the protein.
    Residues The residues corresponding to each patch.

    Details of patch_list_sum.csv:
    The file contains the following information:

    Field Name Description
    Type The type of patch: Hyd (hydrophobic center), Neg (negative charge center), Pos (positive charge center).
    Total Areas The total surface area of patches on the protein.
    Areas of The Largest The largest surface area of a patch on the protein.
    Number of Areas More Than 100 The number of patches with an area larger than 100 Å.

    References

    • Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348. DOI: 10.1142/p477
    • Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514. DOI: 10.1007/BF00124359
    • Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873. DOI: 10.1021/ci990307l
  • Name: Immunogenicity Prediction (WeADApt v4.2)
    Description: 唯信开发的基于多模融合深度学习的端到端免疫原性预测系统WeADApt(原名:AlphaMHC)。采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条与免疫原性相关的湿实验数据(包括亲和力数据、呈递数据、NGS数据、质谱数据等)进行训练,实现了从序列到临床免疫原性风险的端到端的预测,并通过了数百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)的验证测试。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Immunogenicity -> WeADApt。注:该版本非最新版本,推荐使用更新版本。 The new generation of immunogenicity prediction system, WeADApt (formerly known as AlphaMHC). Compared to version v4.1, version v4.2 offers improved prediction specificity and better discrimination between epitopes of varying risk levels, making it more suitable for de-immunization modifications. It is recommended to be run from WeSeq -> Immunogenicity -> WeADApt v4.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-10-18 10:50:56
    Reference:

    Immunogenicity Prediction (WeADApt v4.2)

    简介

    WeADApt (Wecomput ADA prediction) 是唯信开发的基于多模融合深度学习架构的免疫原性预测系统(也被熟知为AlphaMHC)。

    image.png

    该方法采用全新的多模融合深度神经网络架构,整合了近10亿条与免疫原性相关的湿实验数据(包括亲和力数据、呈递数据、NGS数据、质谱数据等)进行训练,有机地将多个与免疫原性相关的模型融合,构成一个高效的免疫反应模拟系统,可准确地模拟蛋白、抗体、多肽、疫苗等生物药的免疫原性,并能鉴别潜在的免疫原性的T细胞表位(引起临床人体免疫应答的肽段),实现了从序列到临床免疫原性风险的端到端的预测,并通过了数百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)的验证测试。

    v4.2版本

    该版本相比v4.1进一步提升了预测的特异性,且对不同风险水平的表位的区分度更高,结果对于去免疫原性改造更有指导性。

    V4.2版本相对于上个版本v4.1主要有以下改进:

    • 算法架构优化
    • 测试集规模扩大1倍
    • 分类能力F1提升:18%
    • 特异性提升:26%
    • 敏感性提升:4%

    性能测试

    我们从FDA和EMA的临床试验中收集了200余个已知免疫原性的分子及其ADA的分布,计算模型预测值与真实ADA发生率的相关性,以测试预测性能。
    在二分类测试中,将>20% ADA定义为高风险,20%以下定义为低风险。
    image.png

    单抗 mAb

    使用唯信收集整理的166个临床及上市单抗的ADA数据的测试结果如下图所示,0.2分作为单抗的高/低风险的阈值,WeADApt表现出了最好的分类能力,准确率为86%,召回率为88%,富集率(AUC)为0.87,超过了行业知名学术软件IEDB、NetMHCllpan等。

    image.png

    在EpiVax论文中公开的42个临床抗体分子的数据集上,WeADApt的预测结果与ADA的相关性超过了知名的商业软件EpiMatrix(R^2=0.52 vs R^2=0.42)。

    image.png

    双抗 BsAB

    WeADApt被设计为兼容各类的分子形式,不论是对称还是非对称、是否有重复结构域的任意蛋白分子,仅需输入不重复的链即可(重复链全部输入也会自动处理)。
    对比下图,WeADApt对于双抗的预测分数会相比实际ADA较单抗偏高,因此高风险的阈值建议比单抗相应提高至0.4附近。
    image.png
    本系统仅从序列水平预测产生的影响,因此尤其适合同类靶点和相同MOA分子的相对比较和筛选。

    实用建议

    关于版本选择

    新项目可以优先使用v4.2。对于已经使用过v4.1的项目,如果发现结果差异较大,可以参照已知临床分子的结果(比如阳性对照等),以一致性更高的版本为准。在可接受的情况下,尽量切换到v4.2。

    关于风险阈值

    实际项目中对于高风险阈值的定义,除了按照程序默认的单双抗0.2/0.4的标准之外,也可以以项目的阳性分子作为基准,因为不同靶点或MOA对于绝对值的影响还是蛮大的。

    WeAdapt 4.2 计费规则
    WeAdapt 4.2 采用阶梯式动态计费机制,根据提交的序列数量分段计费,具体规则如下:

    • ≤ 5 条序列:5000 计算量 / 条
    • 第 6–100 条序列:500 计算量 / 条
    • 超过 100 条的部分:50 计算量 / 条

    Immunogenicity Prediction (WeADApt v4.2)

    Introduction

    WeADApt (Wecomput Anti-Drug Antibody prediction), internally codenamed AlphaMHC, is Wecomput’s next-generation immunogenicity predictor built on a multimodal deep-learning framework.

    image.png

    The platform employs a novel multimodal deep neural network trained on nearly one billion wet-lab records spanning affinity assays, antigen-presentation data, NGS profiles and mass-spectrometry spectra. By fusing orthogonal immunogenic signals, the model functions as a high-throughput in-silico immune-response simulator that accurately forecasts the immunogenic potential of biologics—including proteins, antibodies, peptides and vaccines—and pinpoints clinically relevant T-cell epitopes. The pipeline delivers end-to-end risk prediction directly from sequence and has been validated against hundreds of human immunogenicity datapoints curated by the FDA and EMA, covering both mono- and multi-specific antibodies as well as recombinant proteins.

    v4.2 (Latest release as of 30 July 2025)

    Relative to v4.1, v4.2 delivers markedly higher specificity and sharper resolution between epitopes of differing risk levels, providing clearer guidance for de-immunization campaigns.

    Key improvements over v4.1

    • Algorithm architecture optimization
    • Test-set size doubled
    • F1 score ↑ 18 %
    • Specificity ↑ 26 %
    • Sensitivity ↑ 4 %

    Performance Testing

    We compiled >200 molecules with known clinical immunogenicity profiles and their observed ADA incidence from FDA- and EMA-led trials, then quantified the correlation between predicted and actual ADA rates. In binary classification, an ADA incidence >20 % was defined as high-risk and ≤20 % as low-risk.

    image.png

    Monoclonal Antibodies (mAb)

    Using a Wecomput-curated dataset of 166 clinically tested or marketed mAbs, we set a high-/low-risk threshold of 0.20. WeADApt achieved 86 % accuracy, 88 % recall and an AUC of 0.87—outperforming widely used academic tools such as IEDB and NetMHCIIpan.

    image.png

    On the dataset of 42 clinical antibody molecules published by EpiVax, the ADA prediction results of WeADApt showed a stronger correlation with observed ADA outcomes than the well-known commercial software EpiMatrix.

    image.png

    Bispecific Antibodies (BsAb)

    WeADApt is designed to be compatible with a wide range of molecular formats, regardless of whether the protein is symmetric or asymmetric, or contains repeated domains. Users only need to input the non-redundant chains (repeated chains will be automatically processed if included).

    As shown in the figure below, WeADApt tends to yield slightly higher prediction scores for bispecific antibodies compared to monoclonal antibodies with similar observed ADA outcomes. Therefore, it is recommended to adjust the high-risk threshold upward to around 0.4 for bispecific molecules.

    image.png

    Practical Recommendations

    Version selection

    New projects are recommended to use version 4.2 by default.
    For ongoing projects that have already used version 4.1, if significant differences in results are observed, users may refer to known clinical molecules (e.g., positive controls) and adopt the version that shows higher consistency.
    Where feasible, switching to version 4.2 is encouraged.

    Risk thresholds

    In certain projects, the definition of high -risk thresholds can go beyond the default cutoffs (0.20 for mAbs, ~0.40 for BsAbs), project-specific positive controls can be used to calibrate thresholds, as target biology and MOA heavily influence absolute risk scores.

  • Name: PPI Binding Energy & Contacts
    Description: 基于界面接触特性与非相互作用表面特征预测蛋白-蛋白结合亲和力 Predict protein-protein binding affinity using properties of interfacial contacts and non-interacting surfaces
    Tags: undefined
    Author: Li C Xue
    Release: 2025-04-24 09:39:09
    Reference: Xue LC, Rodrigues JP, Kastritis PL, Bonvin AM, Vangone A. PRODIGY: a web server for predicting the binding affinity of protein-protein complexes. Bioinformatics. 2016 Dec 1;32(23):3676-3678

    PPI Binding Energy & Contacts

    简介

    该模块结合界面接触特征与非相互作用表面(NIS)特征,用于预测蛋白-蛋白结合亲和力,并可输出接触界面的残基信息。模块基于PRODIGY模型,该模型通过线性回归利用界面接触点和NIS的物理化学性质来估算结合亲和力,这些性质已被验证对亲和力具有显著影响。

    以下为亲和力的计算公式:
    image.png

    公式中的 ICsxxx/yyy 表示在相互作用的两个蛋白之间检测到的界面接触点数,xxx/yyy表示接触残基的类型(带电/极性/非极性等),例如 ICscharged/apolar 表示带电残基与非极性残基之间的接触点数量。若两个残基之间任意重原子的距离小于5.5 Å,则视为发生了接触。

    该模型在81个复合物的数据集上进行了验证,预测亲和力与实验值之间的皮尔逊相关系数为0.73(p < 0.0001),均方根误差(RMSE)为1.89 kcal/mol。
    image.png

    参数说明

    Structure

    蛋白复合物的结构文件,格式支持 .pdb 或 .cif。支持多个复合物结构打包进行批量预测,压缩格式支持 .tar、.tar.gz 或 .zip。注意:支持最大结构文件数量为1000

    Group

    用于将结构中的多个链组合为组,组内链作为整体,仅计算组与组之间的结合亲和力。组合格式为:组内链名用逗号分隔,组与组之间用分号分隔。
    示例:H,L;A 表示将链 H 和 L 作为一组,链 A 作为另一组,计算这两组之间的亲和力。

    注意:

    1. 若不设置该参数,则默认对结构中所有发生接触的链对进行亲和力计算。
    2. 在进行抗体-抗原亲和力计算时,应将抗体的重链与轻链合并为一个整体(即为一组),并与抗原链之间计算亲和力。

    Contacts

    输出链间接触界面的残基对信息。

    Output

    预测结果文件名,默认值为 prodigy_output.csv。

    Output_CRP

    接触界面残基对的结果文件名,默认值为 contacts.txt。

    结果说明

    预测结果文件 prodigy_output.csv 包含以下信息:

    列名 说明
    Name 结构名称
    Binding_Affinity (kcal/mol) 预测的结合亲和力,单位为 kcal/mol,值越小越好,负得越多表示结合越强
    Dissociation_Constant (25.0˚C) 根据公式 ΔG = RTlnKd 计算出的25°C下的解离常数
    Intermolecular Contacts 接触残基对总数
    Charged_Charged Contacts 带电残基-带电残基的接触对数
    Charged_Polar Contacts 带电残基-极性残基的接触对数
    Charged_Apolar Contacts 带电残基-非极性残基的接触对数
    Polar_Polar Contacts 极性残基-极性残基的接触对数
    Apolar_Polar Contacts 非极性残基-极性残基的接触对数
    Apolar_Apolar Contacts 非极性残基-非极性残基的接触对数
    Percentage of Apolar NIS 非极性非相互作用表面的百分比
    Percentage of Charged NIS 带电非相互作用表面的百分比

    可选接触界面结果文件 Contacts.txt,每行记录一个接触残基对,包含残基名称、编号及所在链名。

    若启用批量模式,在设置contacts参数后,将给出打包文件:

    • contacts.tar.gz:接触残基对结果

    参考文献

    Xue LC, Rodrigues JP, Kastritis PL, Bonvin AM, Vangone A. PRODIGY: a web server for predicting the binding affinity of protein-protein complexes. Bioinformatics. 2016 Dec 1;32(23):3676–3678.

    PPI Binding Energy & Contacts

    Introduction

    This module predicts protein-protein binding affinity by combining interfacial contact features with non-interacting surface characteristics. It also provides residue-level information for the contact interface. The module is based on the PRODIGY model, which applies linear regression using properties of interfacial contacts and non-interacting surfaces (NIS), both of which have been shown to influence binding affinity.

    The binding affinity is calculated using the following formula:
    image.png

    ICsxxx/yyy represent the number of interfacial contact points found between interacting protein 1 and interacting protein 2, categorized by the polarity/charge of the interacting residues (e.g., ICscharged/apolar indicates the number of interfacial contact points between charged and apolar residues). Two residues are considered to be in contact if any of their heavy atoms are within 5.5 Å of each other.

    The model’s prediction accuracy was evaluated using a dataset of 81 complexes. The Pearson correlation coefficient between predicted and experimental binding affinities is 0.73 (p < 0.0001), with a root-mean-square error (RMSE) of 1.89 kcal/mol⁻¹.
    image.png

    Parameter

    Structure

    The protein complex structure in PDB or CIF format. Multiple complex structures can be packaged together for batch prediction. Supported package formats: .tar, .tar.gz, or .zip. The supported maximum number of structures is 1000.

    Group

    Allows grouping of multiple chains in the structure. Chains in the same group are treated as a single unit, and binding affinity is only calculated between groups. Use chain IDs to define groups: separate chains in the same group with commas, and separate groups with semicolon.
    Example: H,L;A means chains H and L are treated as one group, and chain A as another group. The binding affinity is then calculated between these two groups.

    Note:

    1. If this parameter is not specified, binding affinity will be calculated for all contacting chain pairs in the complex.
    2. For antibody-antigen binding affinity calculations, the heavy and light chains of the antibody should be grouped together using this parameter to compute affinity with the antigen chain.

    Contacts

    Outputs residue pairs at the inter-chain contact interface.

    Output

    Filename for the binding affinity prediction result. Default: prodigy_output.csv

    Output_CRP

    Filename for the contact interface residue pairs. Default: contacts.txt

    Result

    The binding affinity prediction result is saved in prodigy_output.csv, which includes the following columns:

    Column Name Description
    Name Structure name
    Binding_Affinity (kcal/mol) Predicted binding affinity in kcal/mol. The smaller the value, the better. The more negative it is, the stronger the binding.
    Dissociation_Constant (25.0˚C) Dissociation constant at 25°C, calculated using: ΔG = RTlnKd
    Intermolecular Contacts Total number of interfacial residue pairs
    Charged_Charged Contacts Number of contacts between charged residues
    Charged_Polar Contacts Number of contacts between charged and polar residues
    Charged_Apolar Contacts Number of contacts between charged and apolar residues
    Polar_Polar Contacts Number of contacts between polar residues
    Apolar_Polar Contacts Number of contacts between apolar and polar residues
    Apolar_Apolar Contacts Number of contacts between apolar residues
    Percentage of Apolar NIS Percentage of apolar non-interacting surface
    Percentage of Charged NIS Percentage of charged non-interacting surface

    The optional contact interface file Contacts.txt lists one contacting residue pair per line, including residue names, numbers, and chain IDs.

    In batch mode:

    • Contact interface results are packaged in contacts.tar.gz

    References

    Xue LC, Rodrigues JP, Kastritis PL, Bonvin AM, Vangone A. PRODIGY: a web server for predicting the binding affinity of protein-protein complexes. Bioinformatics. 2016 Dec 1;32(23):3676-3678.

  • Name: Back Mutation Grouping v2.5
    Description: 抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组 Back mutation grouping in the antibody humanization, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-04-03 10:23:26
    Reference:

    Back Mutation Grouping v2.5

    简介

    该模块是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。

    更新内容:

    • 新增参数Combination Max Cutoff,高于改截断值的突变自动进行回复突变,
    • 新增参数Combination Site Cutoff,每条链回复突变打分在Combination Min Cutoff与Combination Max Cutoff之间的选择打分前n个位置进行组合突变

    参数说明

    Grafted Chain

    抗体CDR区嫁接后序列文件,FASTA格式,由Grafting模块生成

    Raw Chain

    抗体序列文件,FASTA格式

    Mutation Score

    人源化突变评分文件,CSV格式,由Mutation Score模块生成

    Output File

    指定输出的突变序列文件名称,FASTA格式

    Cutoff

    打分分组的截断值,逗号分割,例如:2,5,10表示将氨基酸突变评分大于10的为一组,5~10的氨基酸为一组,小于2的氨基酸分为一组。

    Output Policy

    指定输出的回复突变的文件

    Type

    普通抗体Antibody或者纳米抗体Nanobody

    Combination Min Cutoff

    突变组合的截断值,Mutation Score模块中输出的氨基酸回复突变打分大于截断值的氨基酸参与生成突变组合

    Combination Max Cutoff

    高于截断值的突变自动进行回复突变

    Combination Site Cutoff

    每条链回复突变打分在Combination Min Cutoff与Combination Max Cutoff之间的选择打分前n个位置进行组合突变

    结果说明

    根据不同截断值得到突变分组结果文件mutate_policy.json。

    根据组合突变截断值得到的突变分组结果文件combination_mutate_policy.json,高通量人源化设计流程。

    Back Mutation Grouping v2.5

    Introduction

    Back mutation grouping in the antibody humanization, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module.

    Parameters

    Grafted Chain

    Sequence file of the antibody CDR region after grafting, in FASTA format, generated by the Grafting module.

    Raw Chain

    Sequence file of the antibody, in FASTA format.

    Mutation Score

    Humanization mutation score file, in CSV format, generated by the Mutation Score module.

    Output File

    Specify the name of the output mutation sequence file, in FASTA format.

    Cutoff

    Cutoff values for scoring grouping, separated by commas. For example, 2,5,10 indicates grouping amino acid mutations with scores greater than 10 in one group, amino acids with scores between 5 and 10 in another group, and amino acids with scores less than 2 in a separate group.

    Output Policy

    Specify the file for the output of back mutations.

    Type

    Antibody or Nanobody

    Combination Min Cutoff

    Cutoff value for mutation combinations. Amino acids with scores (generated from Mutation Score module) greater than the cutoff value are involved in the mutation combinations.

    Combination Max Cutoff

    Mutations above the cutoff value automatically undergo reversion mutations.

    Combination Site Cutoff

    For each chain, select the top n positions with back mutation scores between the Combination Min Cutoff and the Combination Max Cutoff for combination mutations.

    Results

    The mutation grouping results file mutate_policy.json is generated based on different cutoff values.
    The mutation grouping results file combination_mutate_policy.json is generated based on combination cutoff values.

  • Name: Protease (MMP) Cleavage Prediction
    Description: 预测肽段(长度不超过10个氨基酸)被18种基质金属蛋白酶(MMPs)切割的效率及基于指定目标切割谱生成相应的多肽底物。 Predicting the cleavage efficiency of peptides (≤10 amino acids) by 18 matrix metalloproteinases (MMPs) or generating corresponding peptide substrates based on a specified cleavage profile.
    Tags: undefined
    Author: Carmen Martin-Alonso
    Release: 2025-03-26 16:03:42
    Reference: Carmen Martin-Alonso, Sarah Alamdari, Tahoura S. Samad, Kevin K. Yang, Sangeeta N. Bhatia, Ava P. Amini. Deep learning guided design of protease substrates. bioRxiv 2025.02.27.640681

    Protease (MMP) Cleavage Prediction

    简介

    该模块具有两方面的功能:
    1,用于预测肽段(长度不超过10个氨基酸)被18种基质金属蛋白酶(MMPs)切割的效率。
    2,基于指定的目标切割谱(如:仅被MMP13切割),生成相应的多肽底物。

    模块基于CleaveNet模型实现,CleaveNet是一种基于深度学习的蛋白酶底物设计工具,通过整合预测与生成技术,实现了从“虚拟筛选”到“智能设计”的转变。
    image.png
    CleaveNet包含两个核心模块:
    预测模块

    • 基于Transformer架构,训练于大规模mRNA展示肽段库数据。
    • 针对18种基质金属蛋白酶(MMPs),能够预测肽段被特定蛋白酶切割的效率,测试集Pearson相关系数达0.80,优于传统二分类模型。
    • 模型不仅复现了已知的酶切基序,还发现了新的底物偏好,例如甲硫氨酸在P4位的作用,拓展了对蛋白酶特异性的理解。

    生成模块

    • 采用条件化生成技术,用户可通过条件标签指定目标切割谱(如“对MMP13高活性、对其他MMPs低活性”)。
    • 通过注意力机制调整生成方向,生成的6-mer肽段新颖度达89%,突破了训练数据的局限性。
    • 与传统虚拟筛选相比,生成效率提升约5.5倍,支持复杂设计需求,如“双蛋白酶逻辑门”底物。

    这一端到端的设计流程显著提高了底物设计的效率和精准性,为蛋白酶研究提供了一种全新的计算驱动方法。

    实验验证
    为评估CleaveNet的实际应用能力,研究团队以MMP13(一种与癌症转移、伤口愈合和骨关节炎相关的胶原酶)为目标,设计并合成了95条肽段底物,并通过荧光共振能量转移(FRET)技术验证其切割效率。实验结果表明:

    • 切割效率:所有CleaveNet设计的MMP13底物均能被有效切割,其中一条底物(DL73)的切割效率比训练集中最优底物高出39%(p<0.01)。
    • 特异性:3条底物(如DL41)实现了对MMP13的绝对特异性,不被其他MMPs切割;5条底物(如DL48)同时表现出高活性和高选择性,填补了传统方法的空白。
    • 机制洞察:分析生成序列后,发现了P2位亮氨酸偏好和P3’位天冬氨酸的作用,为MMP13的特异性机制提供了新的研究方向。

    这些结果验证了CleaveNet在设计高效且特异性底物方面的能力,同时也展示了其揭示未知底物偏好的潜力。

    参数说明

    Prediction

    Peptide Sequence

    必填参数,多肽序列,txt或fasta格式,支持多条(txt格式时,每行放置一条多肽,最多支持1000条多肽)。注意:多肽长度不能超过10个残基,超过长度的多肽序列会自动被过滤掉。
    txt格式实例如下:

    LRVFL
    FMPLNFTASG
    LGPYAMTSRG
    AARFKKFATE
    

    Output

    可选参数,预测得到的MMPs酶切概率结果文件名称,默认为“pred_cleavage.csv”。

    Generation

    Number of Peptides

    可选参数,指定需要生成的多肽数量,默认为50。

    Z-score of MMPs

    可选参数,指定多肽生成的酶切条件,CSV文件格式。包含每种MMP酶的酶切概率Z-score值,值越大表示酶切的可能性越高,值可为负,一般阈值为2.5,大于该阈值时,表示极大可能被酶切。模型会根据设置的各种MMPs酶的酶切概率Z-score值进行多肽生成。注意:18种MMPs的Z-score数值都必须设定,不能缺少任意一种。
    文件内容实例如下:

    MMP1,MMP10,MMP11,MMP12,MMP13,MMP14,MMP15,MMP16,MMP17,MMP19,MMP2,MMP20,MMP24,MMP25,MMP3,MMP7,MMP8,MMP9
    2.0,1.2,2.2,2.9,3.3,4.1,2.2,3.1,2.2,1.91,3.6,2.83,2.7,0.2,1.8,2.1,0.61,4.2
    

    以上内容为一组条件,也支持多组条件同时输入,每行一组条件即可。每组条件都会生成指定数量的多肽。多组条件示例如下:

    MMP1,MMP10,MMP11,MMP12,MMP13,MMP14,MMP15,MMP16,MMP17,MMP19,MMP2,MMP20,MMP24,MMP25,MMP3,MMP7,MMP8,MMP9
    2.0,1.2,2.2,2.9,3.3,4.1,2.2,3.1,2.2,1.91,3.6,2.83,2.7,0.2,1.8,2.1,0.61,4.2
    3.33,2.5,3.6,2.7,2.9,5.2,3.4,4.2,2.7,2.0,3.5,2.6,4.0,3.1,3.5,0.61,2.1,2.9
    

    Temperature

    可选参数,指定生成的温度条件,用于控制生成多肽序列的多样性,默认为1.0,越大表示多样性越高。如果希望多样性低一些,推荐0.7,如果希望多样性再高一些,推荐1.2~1.5。

    Output

    可选参数,指定序列输出文件名称,fasta或txt格式,默认为“gen_seqs.fasta”。

    结果说明

    Prediction

    预测得到的MMPs酶切概率结果文件,默认为pred_cleavage.csv。包含如下内容:

    字段名称 说明
    SEQ 多肽序列
    MMP1,MMP2,MMP3,… 各种MMPs蛋白酶对多肽酶切能力强弱的Z-score数值,数值越大表示酶切的可能性越高,目前的阈值为2.5,大于该阈值时,表示极大可能被酶切。

    Generation

    生成的序列文件,默认为“gen_seqs.fasta”。

    参考文献

    • Deep learning guided design of protease substrates. Carmen Martin-Alonso, Sarah Alamdari, Tahoura S. Samad, Kevin K. Yang, Sangeeta N. Bhatia, Ava P. Amini bioRxiv 2025.02.27.640681; DOI: 10.1101/2025.02.27.640681

    Protease (MMP) Cleavage Prediction

    Introduction

    This module has two functions:
    Predicting the cleavage efficiency of peptides (≤10 amino acids) by 18 matrix metalloproteinases (MMPs).
    Generating corresponding peptide substrates based on a specified cleavage profile (e.g., only cleaved by MMP13).
    Built on the CleaveNet model, a deep - learning - based protease substrate design tool, it integrates prediction and generation, shifting from “virtual screening” to “intelligent design”.
    CleaveNet has two core modules:
    Prediction Module
    Trained on a large - scale mRNA - displayed peptide library using a Transformer architecture.
    Predicts peptide cleavage efficiency by 18 MMPs, with a test - set Pearson correlation of 0.80, outperforming traditional binary - classification models.
    Reproduces known cleavage motifs and reveals new substrate preferences (e.g., methionine at P4), enhancing understanding of protease specificity.
    Generation Module
    Uses conditional generation. Users can set target cleavage profiles (e.g., “high MMP13 activity, low other MMP activities”) via conditional tags.
    Adjusts generation direction with attention mechanisms. Generated 6 - mer peptides have 89% novelty, surpassing training data limits.
    Is about 5.5 times more efficient than traditional virtual screening, supporting complex designs like “dual - protease logic gate” substrates.
    This end - to - end design process improves substrate design efficiency and accuracy, offering a new computation - driven method for protease research.
    Experimental Validation
    To assess CleaveNet’s practicality, the team targeted MMP13 (a collagenase linked to cancer metastasis, wound healing, and osteoarthritis). They designed and synthesized 95 peptide substrates, validating cleavage efficiency via fluorescence resonance energy transfer (FRET). Results showed:
    All CleaveNet - designed MMP13 substrates were efficiently cleaved. One (DL73) had 39% higher efficiency than the best training - set substrate (p<0.01).
    Three substrates (e.g., DL41) were absolutely specific to MMP13, and five (e.g., DL48) had both high activity and selectivity, addressing traditional method gaps.
    Analysis of generated sequences revealed leucine preference at P2 and aspartic acid’s role at P3’, offering new insights into MMP13’s specificity mechanism.
    These results confirm CleaveNet’s ability to design efficient, specific substrates and its potential to uncover unknown substrate preferences.

    Parameters

    Prediction

    Peptide Sequence

    Required parameter, peptide sequence, in txt or fasta format, supporting multiple sequences (when in txt format, place each peptide on a separate line. Supports up to 1,000 peptides.). Note: The length of the peptide cannot exceed 10 residues.
    An example in txt format is as follows:

    LRVFL
    FMPLNFTASG
    LGPYAMTSRG
    AARFKKFATE
    

    Output

    Optional parameter, the file name of the predicted MMPs cleavage probability results, default is “pred_cleavage.csv”。

    Generation

    Number of Peptides

    Optional parameter, specify the number of peptides to be generated, default is 50.

    Z-score of MMPs

    Optional parameter, specify the cleavage conditions for peptide generation in CSV file format. It includes the Z-score values of cleavage probabilities for each type of MMP enzyme. A higher value indicates a higher likelihood of cleavage. The value can be negative. The general threshold is 2.5. When the value is above this threshold, it indicates a very high probability of being cleaved. The model will generate peptides based on the set Z-score values of cleavage probabilities for various MMPs enzymes. Note: The Z-score values for all 18 types of MMPs must be set, and none can be missing.

    An example of the file content is as follows:

    MMP1,MMP10,MMP11,MMP12,MMP13,MMP14,MMP15,MMP16,MMP17,MMP19,MMP2,MMP20,MMP24,MMP25,MMP3,MMP7,MMP8,MMP9
    2.0,1.2,2.2,2.9,3.3,4.1,2.2,3.1,2.2,1.91,3.6,2.83,2.7,0.2,1.8,2.1,0.61,4.2
    

    The above content is a set of conditions, and multiple sets of conditions can also be input simultaneously. Just place each set of conditions on a separate line. Peptides of the specified quantity will be generated for each set of conditions. An example of multiple sets of conditions is as follows:

    MMP1,MMP10,MMP11,MMP12,MMP13,MMP14,MMP15,MMP16,MMP17,MMP19,MMP2,MMP20,MMP24,MMP25,MMP3,MMP7,MMP8,MMP9
    2.0,1.2,2.2,2.9,3.3,4.1,2.2,3.1,2.2,1.91,3.6,2.83,2.7,0.2,1.8,2.1,0.61,4.2
    3.33,2.5,3.6,2.7,2.9,5.2,3.4,4.2,2.7,2.0,3.5,2.6,4.0,3.1,3.5,0.61,2.1,2.9
    

    Temperature

    Optional parameter, specify the temperature condition for controlling the diversity of the generated peptide sequences. The default value is 1.0. A higher value indicates higher diversity. If you want lower diversity, it is recommended to use 0.7. If you want higher diversity, it is recommended to use a value between 1.2 and 1.5.

    Output

    Optional parameter, specify the output file name for the sequences in fasta or txt format. The default is “gen_seqs.fasta”.

    Results

    Prediction

    The predicted MMPs cleavage probability results file, default is pred_cleavage.csv. It contains the following content:

    Field Name Description
    SEQ Peptide sequence
    MMP1, MMP2, MMP3, … Z-score values representing the strength of cleavage by various MMPs proteases. A higher value indicates a higher likelihood of cleavage. The current threshold is 2.5. If the value is above this threshold, it indicates a very high probability of being cleaved.

    Generation

    The generated sequence file, default is “gen_seqs.fasta”.

    References

    • Deep learning guided design of protease substrates. Carmen Martin-Alonso, Sarah Alamdari, Tahoura S. Samad, Kevin K. Yang, Sangeeta N. Bhatia, Ava P. Amini bioRxiv 2025.02.27.640681; DOI: 10.1101/2025.02.27.640681
  • Name: Computing Electrostatic Surfaces
    Description: 分析蛋白质表面的静电区域(正电、负电区域)的大小和分布 Analyze the electrostatic patches of protein surfaces.
    Tags: undefined
    Author: Valentin J Hoerschinger
    Release: 2025-03-19 15:15:14
    Reference: Hoerschinger VJ, Waibl F, Pomarici ND, Loeffler JR, Deane CM, Georges G, Kettenberger H, Fernández-Quintero ML, Liedl KR. PEP-Patch: Electrostatics in Protein-Protein Recognition, Specificity, and Antibody Developability. J Chem Inf Model. 2023 Nov 27;63(22):6964-6971.

    Computing Electrostatic Surfaces

    简介

    该模块用于分析和可视化蛋白质表面的静电特性,这对分子识别、蛋白质溶解性、粘度和抗体的可开发性等过程至关重要。它主要通过定义“Patch”来识别和量化蛋白质表面的静电势,这些Patch是具有统一正或负电势值的连接区域。
    主要功能和特点:

    • 静电势计算:
      该工具使用APBS(自适应泊松-玻尔兹曼求解器)来计算静电势。此外,它还可以接受用户提供的势图或基于疏水性尺度的映射。
    • 分子表面生成:
      工具生成分子表面,并将计算的静电势映射到该表面。然后,可以通过颜色编码来可视化该表面,以指示正负区域。
    • Patch识别:
      识别和量化蛋白质表面上不同的正电和负电静电Patch,这对于理解蛋白质-蛋白质相互作用和抗体开发非常重要。

    参数说明

    Structure PDB

    蛋白结构文件,PDB格式。

    Surface Type

    分子表面的类型:sas或者ses。以下是两个选项的解释:

    • 溶剂可及表面(SAS,Solvent-Accessible Surface):SAS 是溶剂探针(通常是水分子)在分子表面滚动时,其中心轨迹形成的表面。
    • 溶剂排除表面(SES,Solvent-Excluded Surface):SES 是溶剂探针围绕分子滚动时,其最靠近分子的外部轮廓所形成的表面。

    Probe Radius

    探针半径,单位为纳米(默认:0.14)。

    Size Cutoff

    Patch面积(area )阈值,单位为Ų。如果 Size Cutoff = 0,则不过滤任何 patch,即所有 patch 都会被保留。

    pH Value

    pH 值。

    Output Patch

    输出Patch文件名称

    结果说明

    输出结果包括:

    输出文件名称 说明
    patches.csv 识别出的蛋白质表面静电Patch的信息。
    apbs.pqr APBS计算静电势的输入文件。PQR文件类似于PDB文件,但包含了每个原子的电荷和半径信息。
    apbs.pqr.dx 通过APBS计算得到的静电势分布数据。DX文件是网格格式,描述了蛋白质周围空间的静电势值。
    apbs.pdb APBS计算静电势的PDB文件

    其中patches.csv包括信息如下:

    字段名称 说明
    nr 代表Patch的编号。这是每个识别出的静电Patch的唯一标识符,用于区分不同的Patch。
    type 表示Patch的类型,通常为“positive”或“negative”,指示Patch的电荷性质是正电还是负电。
    npoints Patch中包含的表面点的数量。这些点构成了Patch在蛋白质表面上的区域。
    area Patch的面积,单位为Ų。这表示Patch在蛋白质表面上覆盖的物理面积。
    value Patch的总静电势值,通常为Patch内所有点的静电势值的总和或平均值。这反映了Patch的整体静电强度。
    residue Patch中的氨基酸残基,通常是Patch所在区域的一个代表性残基。这个残基可能是Patch中电荷最集中的位置或最显著的氨基酸。其他的氨基酸编号与apbs.pdb对应。

    参考文献

    • Hoerschinger VJ, Waibl F, Pomarici ND, Loeffler JR, Deane CM, Georges G, Kettenberger H, Fernández-Quintero ML, Liedl KR. PEP-Patch: Electrostatics in Protein-Protein Recognition, Specificity, and Antibody Developability. J Chem Inf Model. 2023 Nov 27;63(22):6964-6971. DOI: 10.1021/acs.jcim.3c01490

    Computing Electrostatic Surfaces

    Introduction

    This module is designed for analyzing and visualizing the electrostatic properties of protein surfaces, which are critical for processes such as molecular recognition, protein solubility, viscosity, and antibody developability. It primarily identifies and quantifies the electrostatic potential on protein surfaces by defining “patches,” which are connected regions with uniform positive or negative potential values.
    Key Features:

    • Electrostatic Potential Calculation:
      This tool uses APBS (Adaptive Poisson-Boltzmann Solver) to compute electrostatic potentials. Additionally, it can accept user-provided potential maps or mappings based on hydrophobicity scales.

    • Molecular Surface Generation:
      The tool generates molecular surfaces and maps the calculated electrostatic potentials onto these surfaces. The surface can then be visualized using color coding to indicate positive and negative regions.

    • Patch Identification:
      It identifies and quantifies different positive and negative electrostatic patches on the protein surface, which are crucial for understanding protein-protein interactions and antibody development.

    Parameter

    Structure PDB

    The protein structure file in PDB format.

    Surface Type

    The type of molecular surface: SAS or SES. Below are explanations for the two options:

    • Solvent-Accessible Surface (SAS): SAS represents the surface formed by the center trajectory of a solvent probe (usually a water molecule) rolling over the molecular surface.
    • Solvent-Excluded Surface (SES): SES represents the outer contour closest to the molecule formed when the solvent probe rolls around the molecule.

    Probe Radius

    The radius of the probe, measured in nanometers (default: 0.14).

    Size Cutoff

    Patch area threshold (area), measured in Ų. If Size Cutoff = 0, no patch will be filtered, meaning all patches will be retained.

    pH Value

    The pH value.

    Output Patch

    The name of the output file for identified patches.

    Result

    The output includes the following files:

    File Name Description
    patches.csv Information about the identified electrostatic patches on the protein surface.
    apbs.pqr Input file for APBS electrostatic potential calculations. PQR files are similar to PDB files but include charge and radius information for each atom.
    apbs.pqr.dx Electrostatic potential distribution data calculated by APBS. DX files are grid-format files describing the electrostatic potential values in the space surrounding the protein.
    apbs.pdb PDB file with electrostatic potential information calculated by APBS.

    The patches.csv file includes the following information:

    Field Name Description
    nr Patch number. This is a unique identifier for each identified electrostatic patch.
    type Patch type, typically “positive” or “negative,” indicating whether the patch is positively or negatively charged.
    npoints The number of surface points in the patch, which defines the region of the patch on the protein surface.
    area The area of the patch in Ų, representing the physical coverage of the patch on the protein surface.
    value The total electrostatic potential value of the patch, usually the sum or average of all potential values within the patch. This indicates the overall electrostatic intensity of the patch.
    residue Representative amino acid residue within the patch, typically the residue with the highest charge concentration or the most prominent residue in the patch. Other residue numbers correspond to the apbs.pdb file.

    References

    • Hoerschinger VJ, Waibl F, Pomarici ND, Loeffler JR, Deane CM, Georges G, Kettenberger H, Fernández-Quintero ML, Liedl KR. PEP-Patch: Electrostatics in Protein-Protein Recognition, Specificity, and Antibody Developability. J Chem Inf Model. 2023 Nov 27;63(22):6964-6971. DOI: 10.1021/acs.jcim.3c01490
  • Name: Patch Analysis v2
    Description: 分析蛋白质表面的Patch(正电、负电、疏水残基富集区域)的大小和分布,用于解决蛋白质的聚集等问题。一般建议通过WeView三维结构可视化编辑器来使用该功能,可以在三维结构中直观地查看patch的位置。 Calculate patches (positively charged, negatively charged, or hydrophobic regions) on the protein surface to address protein aggregation issues. It is recommended to use in the WeView, as it allows for a visual inspection of the patch locations within the 3D structure.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-04-14 15:01:18
    Reference:

    Patch Analysis v2

    简介

    该模块计算蛋白质表面静电和疏水作用相对富集的区域,用于显示出在蛋白质-蛋白质相互作用中其重要作用的区域,这对于预测基于非共价弱相互作用的可逆的聚集现象尤其有用。尤其是,疏水相互作用长期被认为是大分子间吸引相互作用的主要组成部分。对于抗体,静电相互作用牵涉到了自聚集,而偶极-偶极相互作用被认为是导致β-折叠的纤维化的原因。同时,也可以通过WeView界面对蛋白结构进行Patch分析。

    V2 更新内容

    • 优化原子参数,提高计算准确性。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式

    Hydrophobic Cutoff

    Hydrophobic cutoff是一个以疏水性氨基酸(通常包括Leu,Ile,Val,Phe,Trp和Met)为基础定义的截断值,用于将表面上疏水性氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,Patch区域中获得的化学性质信息会根据其表面密度和具有高疏水性氨基酸的数量而有所变化。

    Positive Cutoff

    Positive Cutoff是一个以阳离子氨基酸为基础定义的截断值,用于将表面上阳离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,positive cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    Negative Cutoff

    Negative Cutoff是一个以阴离子氨基酸为基础定义的截断值,用于将表面上阴离子氨基酸的数量与表面面积相比较,从而筛选出可能具有重要生物学功能的区域。一般来说,negative cutoff方式用于筛选出可能参与离子相互作用的蛋白质表面区域。

    SASA Cutoff

    SASA Cutoff是一个以溶剂可及表面积为基础定义的截断值,低于截断值的patch会被过滤掉。

    Distance Cutoff

    Distance Cutoff是原子距离截断值,低于截断值的才会认为属于同一聚集块。值越小,聚集块patch越小。

    Min Distance Cutoff

    Min Distance Cutoff是patch之间的距离截断值,距离小于截断值的归为同一个patch。

    Result Type

    输出文件格式,csv或者json
    通俗地讲,cutoff代表静电势能或疏水势能的强度阈值,单位是kcal/mol,超过阈值才会被计入面积。阈值越小,则patch越多。

    Keep Original

    不添加缺失原子(包括氢原子)和结构优化。

    Neutral N-terminus

    使得N-氮端的蛋白残基中性化。

    Neutral C-terminus

    使得C-氮端的蛋白残基中性化。

    结果说明

    输出结果包括:

    输出文件名称 说明
    patch_list.csv Patch结果的csv文件。主要关注Area(Å^2)数值,代表patch的大小,越大则越可疑,重点关注100 Å以上的patch。
    input_prot.pdb 质子化后的pdb结构。
    patch_list_sum.csv 统计了三种patch类型(Hyd:疏水中心,Neg:负电中心,Pos:正电中心)在蛋白表面所占面积,重点关注100 Å以上的patch。

    其中patch_list.csv,包含信息如下:

    字段名称 说明
    Type Patch的类型,Hyd:疏水中心,Neg:负电中心,Pos:正电中心
    Area(Å^2) 每个Patch的蛋白质表面区域面积
    Residues 每个Patch的对应的残基

    其中patch_list_sum.csv,包含信息如下:

    字段名称 说明
    Type Patch的类型,Hyd:疏水中心,Neg:负电中心,Pos:正电中心
    Total Areas Patch的蛋白质表面区域总面积
    Areas of The Largest Patch的蛋白质表面区域最大面积
    Number of Areas More Than 100 超过100 Å以上的patch的数目

    参考文献

    • Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348. DOI: 10.1142/p477
    • Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514. DOI: 10.1007/BF00124359
    • Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873. DOI: 10.1021/ci990307l

    Patch Analysis v2

    Introduction

    Protein Patches calculates both electrostatic (excess charge) and hydrophobic surface patches to show regions of significance with respect to protein-protein interactions. This can be particularly useful in the prediction of reversible aggregation, which typically arises from relatively weak non-covalent interactions. In particular, hydrophobic interactions have long been recognized as major contributors in high affinity interactions between macromolecules. In antibodies, electrostatic interactions have been implicated in forming self-associated aggregates [Karshikoff 2006], while dipole-dipole interactions are believed to be the cause of fibrillogenic association of β-sheets. At the same time, protein structures can also be analyzed for patches through the WeView interface.
    Electrostatic patches.
    The surface electrostatic field is estimated using an exponentially decaying Debye-Hückel field with a screening length of λD=3.5Å.
    The map thus obtained is one mostly of excess charge close to the molecular surface.
    Significant patches are established by cutting the surface along isocontour lines of absolute field value equal to 40 kcal/mol/C, keeping regions above. Finally, a default minimal patch area of 40Å2 filters out smaller, presumably less relevant, patches.
    Hydrophobicity map.
    The hydrophobic potential is calculated from the Wildman and Crippen octanol-water partition coefficients f=log P [Wildman 1999]:

    where fi is the coefficient of atom i and g(ri) is a Fermi-type distance-dependent weighting function proposed by Heiden et al. [Heiden 1993]:

    with rcut=5Å and α=1.5.
    Similarly to electrostatic (excess charge) patches, significant hydrophobic patches are established by cutting the surface along isocontour lines, retaining only the portion above a potential threshold value of 0.09 kcal/mol, and filtering for a minimal patch area of 50Å2.

    V2 updates

    • Optimized atoms parameters and improved the accuracy.

    Parameters

    Structure PDB File

    Protein structure file in PDB format.

    Hydrophobic Cutoff

    Hydrophobic Cutoff defined based on hydrophobic amino acids (usually including Leu, Ile, Val, Phe, Trp, and Met) is used to compare the number of hydrophobic amino acids on a surface with the surface area, so as to screen out areas that may have important biological functions. Generally speaking, the chemical property information obtained in the Patch region will vary according to its surface density and the number of highly hydrophobic amino acids.

    Positive Cutoff

    Positive Cutoff is a cut-off value defined based on cationic amino acids, which is used to compare the number of cationic amino acids on the surface with the surface area to screen out areas that may have important biological functions. Generally speaking, the positive cutoff method is used to screen protein surface regions that may be involved in ion interactions.

    Negative Cutoff

    Negative Cutoff is a value defined based on anionic amino acids, which is used to compare the number of anionic amino acids on a surface with the surface area, to screen out areas that may have important biological functions. Generally speaking, the negative cutoff method is used to screen the surface regions of proteins that may participate in ion interactions.

    SASA Cutoff

    SASA Cutoff is a cutoff value defined on the basis of polar surface area, which is used to screen the surface regions of proteins that have enough polar surface areas.

    Distance Cutoff

    Distance Cutoff is a cutoff value defined on the basis of neighbor atoms, which is used to adjust the size of patches. Lower values result in smaller patches.

    Min Distance Cutoff

    Min Distance Cutoff is the cutoff value for neighbor patch point distance (Å). Patches with distances lower than the cutoff value would be merged.

    Result Type

    output file format, json or csv

    Keep Original

    Do no atom addition and optimization.

    Results

    The output includes:

    Output File Name Description
    patch_list.csv A CSV file containing patch results. The main focus is on the Area (Å^2) value, which represents the size of the patch. Larger patches are considered more suspicious, with particular attention to patches larger than 100 Å.
    input_prot.pdb The protonated PDB structure.
    patch_list_sum.csv Summarizes the surface area occupied by three types of patches (Hyd: hydrophobic center, Neg: negative charge center, Pos: positive charge center) on the protein surface. Focus is placed on patches larger than 100 Å.

    Details of patch_list.csv:
    The file contains the following information:

    Field Name Description
    Type The type of patch: Hyd (hydrophobic center), Neg (negative charge center), Pos (positive charge center).
    Area (Å^2) The surface area of each patch on the protein.
    Residues The residues corresponding to each patch.

    Details of patch_list_sum.csv:
    The file contains the following information:

    Field Name Description
    Type The type of patch: Hyd (hydrophobic center), Neg (negative charge center), Pos (positive charge center).
    Total Areas The total surface area of patches on the protein.
    Areas of The Largest The largest surface area of a patch on the protein.
    Number of Areas More Than 100 The number of patches with an area larger than 100 Å.

    References

    • Karshikoff, A.; Non-Covalent Interactions In Proteins; Imperial College Press (2006) pp. 348. DOI: 10.1142/p477
    • Heiden, W., Moeckel, G., Brickmann, J.; A New Approach to Analysis and Display of Local Lipophilicity/Hydrophilicity Mapped on Molecular Surfaces; J. Comput. Aided Mol. Des. 7 (1993) 503–514. DOI: 10.1007/BF00124359
    • Wildman, S.A., Crippen, G.M.; Prediction of Physiochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39 (1999) 868–873. DOI: 10.1021/ci990307l
  • Name: Molecular Docking (AutoDock-GPU v2)
    Description: 基于AutoDock的GPU加速的小分子对接工具。建议通过WeView三维结构可视化编辑器来使用该功能,具体为WeView-> Dock。 AutoDock-GPU-based small molecule docking tool. It is recommended to use in the WeView: WeView-> Dock.
    Tags: undefined
    Author: Forli lab
    Release: 2022-06-08 16:00:00
    Reference: Santos-Martins D, Solis-Vasquez L, Tillack AF, et al., Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. J Chem Theory Comput. 2021 Feb 9;17(2):1060-1073. doi: 10.1021/acs.jctc.0c01006.

    Molecular Docking (AutoDock-GPU v2)

    简介

    该模块是一种用于分子对接模拟工具,主要用于预测分子之间的结合模式和相互作用,得到分子对接的能量和结合亲和力等信息。它还可以计算和比较多个分子之间的结合能力,用于药物分子的筛选、设计和优化。AutoDock-GPU是AutoDock4.2.6的OpenCL和Cuda加速版本,其利用可并行的LGA,从而通过在多个计算单元上并行处理配体-受体结合构象。
    image.png

    参数说明

    支持自行上传小分子文件(Private Ligand Library)或者选择公共分子虚筛库(Public Ligand Library)。

    Private Ligand Library (Comp<100)

    Binding Mode

    对接模式为刚性配体对接(rigid)或者柔性配体对接(flex),
    刚性配体对接:配体自身保持刚性,经平移、旋转,在口袋内寻找合适的结合取向。
    柔性配体对接:配体在固定某些非关键部位的键长、键角的前提下允许其构象发生一定程度的变化。

    Receptor

    受体结构文件,PDB格式。要求受体原子数目不超过32768个。

    Private Ligand

    配体结构文件,支持SDF、PDB、MOL格式。只会计算前100的分子。

    Box Center

    对接口袋中心的三维坐标(XYZ),空格分割。例如:0 0 0。

    Box Size

    对接口袋长方体盒子的大小,必须是整数,空格分割,例如 24 22 32。

    Number of Poses

    每个分子保留的最大结合模式数量

    TopN

    虚拟筛选中保留打分排名前n个分子。

    Unbound Model

    未结合状态模型选择:

    • bound:适用于已知结合模式的精确优化,假设配体初始构象接近结合状态。
    • extended:适用于探索结合模式的中等灵活配体,从自由分子状态开始搜索。
    • compact:适用于高度灵活或折叠配体,提供最大范围的结合模式探索,但计算成本最高所需时间最长。

    Keep Heterogens

    保留非标准氨基酸,格式为[链名]:[残基名称]-[残基编号],如A:UNL-311。不能包含特殊离子的小分子结构。

    Private Ligand Library (Comp<10,000)

    Private Ligand

    配体结构文件,支持SDF、PDB、MOL格式。只会计算前10,000的分子。
    其余参数与**Private Ligand Library (Comp<100)**模式一致。

    Public Ligand Library模式

    Public Ligand

    提供17个公共分子虚筛库用于分子对接,包括:

    1. Alinda :~77万库存分子,源自中国香港的Alinda Chemical公司,致力于分子砌块和新颖筛选化合物的研发供应。
    2. Analyticon :~4万库存分子,源自德国的天然产物品牌,专注天然产物提取及类似物合成工作,产品质量稳定。
    3. Asinex :~57万库存分子,源自美国的品牌,多年来致力于类先导化合物及分子砌块的研发供应,价格较贵。
    4. Bionet :~30万库存分子,源自英国的品牌,拥有多年的有机合成经验。
    5. Chembridge :~137万库存分子,源自美国的化合物品牌,总部位于圣地亚哥,拥有多样性库、大环库等多种热门化合物库。
    6. Chemdiv :~156万库存分子,全球最大的化合物品牌之一,拥有5000多种化合物骨架结构和100多种化合物库,性价比高。
    7. Enamine :~407万库存分子,源自乌克兰的化合物品牌,具有较强的化合物研发能力,有高性价比化合物和高价值化合物两类产品。
    8. Eximed :~6万库存分子,源自乌克兰的化合物品牌,近20年来致力于提供高通量筛选化合物及相关服务。
    9. HTS :~6万库存分子,源自德国的HTS Biochemie Innovationen化合物品牌,致力于为制药、农业和生物技术公司开发独特的化合物。
    10. IBS :~55万库存分子,源自俄罗斯的InterBioScreen化合物品牌,拥有多种天然产物及衍生物。
    11. Life_Chemicals :~54万库存分子,源自加拿大的化合物品牌,拥有2900多种化合物骨架结构,化合物规格较齐全且有对应价格。
    12. Maybridge :~5万库存分子,源自英国的化合物品牌,Thermofisher旗下,产品数量少而专,每种产品均具有较大库存。
    13. Otava :~29万库存分子,源自加拿大的化合物品牌,专门从事特色化合物,生物化学药品和生物分析试剂的开发和生成。
    14. Princeton :~153万库存分子,源自美国的化合物品牌,20多年来设计独特的小分子化合物用于药物开发。
    15. Specs :~20万库存分子,源自荷兰的化合物品牌,价格优势明显。
    16. UORSY :~68万库存分子,源自乌克兰的化合物品牌,产品主要用于高通量筛选和药物发现,价格与Enamine接近。
    17. Vitas-m :~140万库存分子,源自美国的化合物品牌,在香港拥有发货中心,到货速度快,价格适中。

    其他参数与Private Ligand Library模式相同,公共库只允许刚性对接。

    结果说明

    输出结果包括:

    输出文件名称 说明
    TopNScores.csv 分子对接得到的打分csv文件。输出小分子最多为10,000。
    complex_001.pdb 展示配体与受体的复合物构象文件。
    output_ligand_topn.sdf 筛选后配体的SDF文件。根据指定的topN数生成,最多为10,000。
    output_complex_topn.tar.bz2 小分子与受体对接后的复合物构象PDB文件压缩包,最多生成前1000小分子的复合物结构。
    TopNScores_Molecule_Info.csv 当Private Ligand Library模式,该csv中不仅有打分信息,还有配体原有信息。

    其中TopNScores.csv包括信息如下:

    字段名称 说明
    Name 对接小分子名称
    Bingding Energy (AutoDock GPU) 对接打分结果,单位为kcal/mol
    Cluster RMSD 指一个配体构象相对于同一聚类(cluster)中的中心构象(通常是最低能量构象)的均方根偏差(RMSD)。RMSD 截断值为2.0 Å。
    Reference RMSD 指对接得到的配体构象与 参考构象(通常是实验解析的晶体结构或用户指定的标准结构)之间的 RMSD。

    其中TopNScores_Molecule_Info.csv包含TopNScores.csv的信息和SDF格式小分子原有信息。

    参考文献

    Santos-Martins D, Solis-Vasquez L, Tillack AF, et al., Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. J Chem Theory Comput. 2021 Feb 9;17(2):1060-1073.

    Molecular Docking (AutoDock-GPU v2)

    Introduction

    This module is a molecular docking simulation tool primarily used for predicting molecular binding modes and interactions. It provides information on docking energy and binding affinity. Additionally, it allows for the calculation and comparison of binding abilities among multiple molecules, facilitating the screening, design, and optimization of drug molecules.

    AutoDock-GPU is the OpenCL and CUDA-accelerated version of AutoDock 4.2.6, utilizing parallelizable LGA (Lamarckian Genetic Algorithm) to process ligand-receptor binding conformations in parallel across multiple computing units.

    image.png

    Parameter

    It supports private ligand file uploads (Private Ligand Library) or the selection of public virtual screening libraries (Public Ligand Library).

    Private Ligand Library (Comp <100)

    Binding Mode

    Docking mode can be either rigid docking or flexible docking:

    • Rigid docking: The ligand remains rigid, undergoing translation and rotation within the binding pocket to find an optimal binding orientation.
    • Flexible docking: The ligand is allowed to undergo conformational changes while keeping certain non-critical bond lengths and angles fixed.

    Receptor

    • Format: PDB

    Private Ligand

    • Formats Supported: SDF, PDB, MOL
    • Limitation: Only the top 100 molecules will be processed.

    Box Center

    • The XYZ coordinates of the docking pocket center, separated by spaces.
      • Example: 0 0 0

    Box Size

    • The size of the docking pocket, represented as a rectangular box with integer values separated by spaces.
      • Example: 24 22 32

    Number of Poses

    • The maximum number of binding modes retained for each molecule.

    TopN

    • The number of top-scoring molecules retained from the virtual screening.

    Unbound Model

    Defines the unbound state model:

    • bound: Assumes the initial ligand conformation is close to the bound state, suitable for precise optimization with known binding modes.
    • extended: Begins from a free molecular state, suitable for moderately flexible ligands to explore binding modes.
    • compact: Best for highly flexible or folded ligands, allowing the broadest exploration of binding modes but with higher computational costs and longer runtime.

    Keep Heterogens

    • Retains non-standard amino acids.
    • Format: [Chain Name]:[Residue Name]-[Residue Number], e.g., A:UNL-311.
    • Restriction: Cannot include small molecular structures containing special ions.

    Private Ligand Library (Comp <10,000)

    Private Ligand

    • Formats Supported: SDF, PDB, MOL
    • Limitation: Only the top 10,000 molecules will be processed.

    🔹 Other parameters are identical to those in Private Ligand Library (Comp <100) mode.


    Public Ligand Library

    Public Ligand

    Provides 17 public virtual screening libraries for molecular docking, including:

    1. Alinda (~770,000 molecules) - Hong Kong-based company specializing in molecular building blocks and novel screening compounds.
    2. Analyticon (~40,000 molecules) - German brand specializing in natural product extraction and analog synthesis.
    3. Asinex (~570,000 molecules) - US-based company focused on lead-like compounds and molecular building blocks, but relatively expensive.
    4. Bionet (~300,000 molecules) - UK-based company with extensive organic synthesis expertise.
    5. Chembridge (~1.37 million molecules) - US-based company with a diverse compound collection, including macrocycles.
    6. Chemdiv (~1.56 million molecules) - One of the largest compound brands globally, offering over 5,000 scaffolds and 100+ libraries.
    7. Enamine (~4.07 million molecules) - Ukraine-based company known for cost-effective and high-value compounds.
    8. Eximed (~60,000 molecules) - Ukraine-based company providing high-throughput screening compounds.
    9. HTS (~60,000 molecules) - German company developing unique compounds for pharmaceutical, agricultural, and biotech applications.
    10. IBS (~550,000 molecules) - Russian company specializing in natural products and derivatives.
    11. Life Chemicals (~540,000 molecules) - Canadian company with diverse scaffolds and transparent pricing.
    12. Maybridge (~50,000 molecules) - UK-based ThermoFisher subsidiary focusing on high-quality compounds.
    13. Otava (~290,000 molecules) - Canadian company specializing in biochemical drugs and reagents.
    14. Princeton (~1.53 million molecules) - US-based company with 20+ years of expertise in small molecule drug discovery.
    15. Specs (~200,000 molecules) - Dutch company known for its cost-effective compounds.
    16. UORSY (~680,000 molecules) - Ukraine-based company with a price range similar to Enamine.
    17. Vitas-m (~1.4 million molecules) - US-based company with a Hong Kong shipping center, offering fast delivery and moderate pricing.

    🔹 Other parameters are identical to Private Ligand Library, but only rigid docking is allowed.


    Result

    The docking results include:

    File Name Description
    TopNScores.csv CSV file containing docking scores for up to 10,000 molecules.
    complex_001.pdb Ligand-receptor complex conformation file.
    output_ligand_topn.sdf Top-N selected ligands in SDF format (max 10,000).
    output_complex_topn.tar.bz2 Compressed file of the top 1,000 ligand-receptor complex structures in PDB format.
    TopNScores_Molecule_Info.csv If using the Private Ligand Library mode, this CSV includes both docking scores and original ligand information.

    📌 TopNScores.csv Fields:

    Field Name Description
    Name Name of the docked molecule.
    Binding Energy (AutoDock GPU) Docking score.
    Cluster RMSD RMSD relative to the cluster center (default cutoff: 2.0 Å).
    Reference RMSD RMSD relative to the reference structure (e.g., crystal structure).

    The TopNScores_Molecule_Info.csv file contains the information from TopNScores.csv along with the original data of small molecules in SDF format.


    References

    Santos-Martins D, Solis-Vasquez L, Tillack AF, et al., Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. J Chem Theory Comput. 2021; 17(2): 1060-1073.

  • Name: Antibody Design (RFantibody)
    Description: 基于RFantibody(抗体微调版RFdiffusion)的抗体从头设计,默认用Chothia编号。 RFantibody (Antibody Fine-tuned RFdiffusion) -based de novo antibody design. It uses Chothia numbering by default.
    Tags: undefined
    Author: Bennett NR
    Release: 2025-03-17 09:44:07
    Reference: Bennett, N.R., Watson, J.L., Ragotte, R.J. et al. Atomically accurate de novo design of antibodies with RFdiffusion. Nature 649, 183–193 (2026). https://doi.org/10.1038/s41586-025-09721-5

    Antibody Design (RFantibody)

    简介

    RFantibody 是目前最先进的抗体从头生成方法,通过人工智能(AI)技术实现抗体的从头(de novo)设计,包括单域抗体(VHH)和单链抗体片段(scFv),能够精准结合用户指定的目标表位,并已通过湿实验验证其功能。

    RFantibody基于蛋白质结构预测模型RoseTTAFold2(RF2)和蛋白质生成模型RFdiffusion,通过对原始RFdiffusion进行微调,开发出专用于抗体设计的RFdiffusion版本。其核心原理如下:

    • 抗体结构特性利用:RFdiffusion在蛋白质数据库(PDB)中的抗体结构数据(约8100个抗体结构)上进行微调,重点训练抗体特有的互补决定区(CDR)loop 区域,同时保持框架结构接近用户指定的优化框架。训练过程中,通过逐步添加噪声(Cα 坐标加入三维高斯噪声,残基方向加入 SO(3) 布朗运动),网络学习预测去噪后的结构。

    • 表位靶向设计:通过引入"热点"(Hotspot)特征,用户可指定目标蛋白上的表位,网络通过CDR loop与表位的相互作用进行设计。训练时,抗体框架以全局坐标无关的方式提供(通过二维距离和二面角矩阵表示),允许网络自由设计CDR Loop构象及抗体与目标的刚体定位。

    • 序列设计与验证:结构设计后,使用ProteinMPNN生成CDR loop区序列,优化与目标表位的相互作用。设计的抗体通过微调后的RF2进行结构预测和自一致性验证,筛选高潜力候选分子。

    • 支持 VHH 和 scFv 设计:RFdiffusion 不仅支持单域抗体(VHH)的设计,还可应用于单链抗体片段(scFv)的设计。scFv 设计涉及重链和轻链的所有六个 CDR 的设计。

    通过上述方法,RFantibody能够生成多样化的抗体结构,显著区别于训练数据集,同时实现与目标表位的高度形状互补性和功能性结合。

    image.png

    RFantibody项目针对多个疾病相关表位进行了VHH和scFv设计,并通过表面等离子共振(SPR)、冷冻电镜(cryo-EM)、中和实验等手段验证了设计的有效性。以下是具体实验结果及分析:

    1, 单域抗体(VHH)设计与实验验证

    实验选择了多个疾病相关靶点,包括流感血凝素(HA)、呼吸道合胞病毒(RSV)位点I和III、SARS-CoV-2受体结合域(RBD)、艰难梭菌毒素B(TcdB)和IL-7Rα。以下为关键结果:

    • 结合亲和力(KD):

      • 流感HA:针对HA茎部表位的VHH设计中,最高亲和力结合体(VHH_flu_01)KD值为78 nM,其他结合体KD值分别为546 nM、698 nM和790 nM。实验使用昆虫细胞表达的单体HA(模拟去糖基化状态)以匹配计算设计条件。
      • SARS-CoV-2 RBD:最佳VHH结合体KD值为5.5 μM,通过竞争实验(与已知结合体AHB2竞争)确认结合至目标表位。
      • TcdB:针对Frizzled-7表位的VHH最佳结合体KD值为260 nM,结合特异性高,未观察到与同源性70%的Clostridium sordellii毒素L(TcsL)的交叉反应。
    • 中和活性(EC50):

      • TcdB:针对TcdB的VHH在中和实验中表现出功能性,在CSPG4敲除细胞中中和TcdB毒性,EC50值为460 nM,表明其潜在的治疗应用价值。
    • 结构准确性(cryo-EM):

      • 流感HA:通过cryo-EM解析了VHH_flu_01与原生糖基化HA三聚体的复合物结构(分辨率3.0 Å)。66%的HA颗粒结合了至多两个VHH,部分未结合可能由于N296糖基的遮挡。实验结构与设计模型高度一致,整体RMSD为1.45 Å,CDR3 RMSD为0.8 Å,关键CDR3残基(V100、V101、S103、F108)与HA茎部表位的相互作用如设计预期。
      • TcdB:针对TcdB的原始设计(VHH_TcdB_H2)和亲和力成熟后版本(VHH_TcdB_H2_ortho)进行了cryo-EM分析。原始设计确认结合至Frizzled-7表位,成熟后版本(分辨率5.7 Å)显示更高的结合比例,结构符合设计预期。
      • SARS-CoV-2 RBD:亲和力成熟后的VHH(VHH_RBD_D4_ortho19)结合至RBD"上"构象表位(分辨率3.9 Å)。
    • 亲和力成熟(OrthoRep):

      • 使用OrthoRep系统对TcdB、流感HA和SARS-CoV-2 RBD的VHH进行亲和力成熟,结合亲和力提升约两个数量级,同时保留了原始表位特异性。

    2, 单链抗体片段(scFv)设计与实验验证

    进一步扩展至scFv设计,涉及重链和轻链六个CDR的设计,采用结构导向的组合库策略以提高成功率。实验靶点包括TcdB的Frizzled-7表位和Phox2b/HLA-C*07:02复合物。

    • 结合亲和力(KD):

      • TcdB:通过组合库筛选出针对Frizzled-7表位的scFv,最高亲和力结合体(scFv6)KD值为72 nM,其他结合体的KD值未详细列出。竞争实验(与Frizzled-7竞争)确认结合至目标表位,未与无关受体CSPG4竞争。
      • Phox2b/HLA-C*07:02:针对神经母细胞瘤相关表位的scFv结合体KD值为400 nM(SPR)和1 μM(ITC),特异性结合至Phox2b肽,未结合R6A突变肽。尝试将其转化为CAR-T细胞未显示细胞毒性,可能因亲和力不足或抗原密度低。
    • 结构准确性(cryo-EM):

      • TcdB:两个scFv(scFv5和scFv6)结合至Frizzled-7表位的cryo-EM结构验证了设计准确性。scFv6的分辨率为3.6 Å,整体RMSD为0.9 Å,六个CDR的骨架RMSD分别为CDRH1=0.4 Å、CDRH2=0.3 Å、CDRH3=0.7 Å、CDRL1=0.2 Å、CDRL2=1.1 Å、CDRL3=0.2 Å,侧链构象及相互作用符合设计。scFv5(分辨率6.1 Å)以不同接近角度结合,实验结构与设计模型一致。

    3, 实验结果分析

    • 结构多样性:设计的VHH和scFv的CDR区与自然抗体显著不同,且针对TcdB的Frizzled-7表位无已知抗体,表明RFdiffusion实现了真正的从头设计。
    • 功能性与应用潜力:TcdB VHH的中和活性(EC50=460 nM)和scFv的高亲和力(KD=72 nM)显示出治疗潜力,但Phox2b scFv的CAR-T应用失败表明需进一步优化亲和力或抗原表达。

    4, 总结

    RFantibody通过微调RFdiffusion网络,实现了从头设计VHH和scFv的目标,能够靶向多种疾病相关表位。实验结果显示设计的抗体具有较高的结构准确性(RMSD低至0.9 Å)和功能性(KD低至72 nM,EC50为460 nM)。cryo-EM验证了设计的原子级精度,而亲和力成熟和组合库策略进一步提升了成功率。

    参数说明

    Complex

    用于抗体设计的抗体-抗原复合物结构,PDB格式。如果指定了该参数,后续的Antigen,Antibody参数不用再指定。如果不指定该参数,则需要分别输入Antigen与Antibody的结构。
    注意:当前只支持单链抗原,如存在多链时会提示错误,可以使用蛋白编辑工具去掉抗原多余的链,保留单链抗原即可。

    Antigen

    指定抗原的结构文件,PDB格式。
    说明:抗原结构通常需要截短以减少计算开销,建议保留表位周围约 10Å 的区域即可。

    Antibody

    指定抗体的结构文件,PDB格式。

    Number of designs

    指定设计的抗体数量,默认为20。

    Residues

    定义需要突变设计的残基,格式为“链名称+残基编号或范围”,多段残基用逗号分隔。例如:参数设置为H27,H28,H99,H100-103,L24-32时,表示:对H链中编号为27、28、99,、100至103的残基,L链中编号为24-32的残基,进行突变设计。
    注意:
    1,这里的残基编号是指从1开始的残基位置顺序编号,不是原PDB文件中的残基编号。
    2,如指定了该参数,则不能再指定后续的CDR参数(HCDR1-3或LCDR1-3),否则会提示参数错误。

    H-CDR1, H-CDR2, H-CDR3, L-CDR1, L-CDR2, L-CDR3

    分别指定需要设计的抗体重、轻链CDR区的长度范围。格式为:起始长度-终止长度(如:5-13),或单一长度(如:7)。
    说明:这些参数定义了每个CDR区的允许长度范围,如果设置的是起始长度-终止长度(如:5-13),模型将从中均匀采样长度。如果设置的是单一长度(如:7),则该CDR将以指定长度进行设计。如果不指定某个CDR的长度范围(如:不设置H-CDR1的长度),则该CDR将保持原始结构和序列不被设计。需要指定至少一个CDR区域的长度进行设计,否则会提示错误。
    对于VHH设计,仅需指定H-CDR1, H-CDR2, H-CDR3;对于scFv设计,可指定所有六个CDR。长度选择可参考自然抗体的CDR 长度分布,推荐较短的H-CDR3(如:5-13),以降低设计难度。

    Hotspot

    指定抗原上的结合位点残基,用于定义抗体结合的表位。格式为:逗号分隔的残基列表,格式为 305,456

    • 说明:结合位点残基帮助模型聚焦于特定表位。选择时建议挑选表位中3个以上疏水性残基,避免过多极性或糖基化区域。

    结果说明

    经过抗体设计后,得到的抗体-抗原复合物结构,并根据质量评估指标进行排序。包括:

    结构文件:按结构质量排序的PDB格式抗体-抗原复合物结构的打包文件 de_novo_antibody.tar及最优的设计结果rank_1.pdb
    结构评分:CSV格式的评估指标表格 cdr_sequences.csv,包含如下信息:

    字段名称 说明
    Design_ID 预测结构的文件名
    CDR_H1/H2/H3/L1/L2/L3 设计后得到的CDR序列
    ipAE 预测对齐误差交互值(the predicted interaction alignment error),衡量抗体与抗原结合界面的结构预测置信度,该指标反映了抗体-抗原复合物界面的结构稳定性和预测准确性,数值越小表示结合界面预测越可靠,推荐选择ipAE<10的设计进行实验验证
    pLDDT 预测局部距离差异测试,衡量整体结构预测的质量和可靠性,该指标反映了抗体结构本身的稳定性和折叠质量,数值范围为 0-1.0,数值越接近1.0表示结构预测越可靠,推荐选择pLDDT > 0.8的设计进行实验验证

    输出示例

    Design_ID,CDR_H3,ipAE,pLDDT
    rank_1,IAYTPGAPLF,8.91,0.92
    rank_2,VAPSKTDALF,9.29,0.92
    

    序列文件:所有设计抗体的序列汇总文件antibody_sequences.fasta

    参考文献

    • Bennett NR, Watson JL, Ragotte RJ, Borst AJ, See DL, Weidle C, Biswas R, Shrock EL, Leung PJY, Huang B, Goreshnik I, Ault R, Carr KD, Singer B, Criswell C, Vafeados D, Sanchez MG, Kim HM, Torres SV, Chan S, Baker D. Atomically accurate de novo design of antibodies with RFdiffusion. DOI:10.1101/2024.03.14.585103

    Antibody Design (RFantibody)

    Introduction

    RFantibody is the most advanced de novo antibody generation method currently available. Through artificial intelligence (AI) technology, it achieves de novo design of antibodies, including single-domain antibodies (VHH) and single-chain antibody fragments (scFv), capable of precisely binding to user-specified target epitopes, with functionality validated through wet lab experiments.

    RFantibody is based on the protein structure prediction model RoseTTAFold2 (RF2) and the protein generation model RFdiffusion. By fine-tuning the original RFdiffusion, a specialized version for antibody design has been developed. Its core principles are as follows:

    • Utilization of Antibody Structural Features: RFdiffusion is fine-tuned on antibody structural data (approximately 8,100 antibody structures) from the Protein Data Bank (PDB), focusing on training the antibody-specific complementarity-determining region (CDR) loops while maintaining framework structures close to user-specified optimized frameworks. During training, noise is gradually added (3D Gaussian noise to Cα coordinates, SO(3) Brownian motion to residue orientations), and the network learns to predict the denoised structure.

    • Epitope-Targeted Design: By introducing “Hotspot” features, users can specify epitopes on target proteins, and the network designs through interactions between CDR loops and the epitope. During training, the antibody framework is provided in a globally coordinate-independent manner (represented by 2D distance and dihedral angle matrices), allowing the network to freely design CDR loop conformations and rigid-body positioning of the antibody relative to the target.

    • Sequence Design and Validation: After structural design, ProteinMPNN is used to generate sequences for CDR loop regions, optimizing interactions with the target epitope. The designed antibodies are validated through structure prediction and self-consistency verification using the fine-tuned RF2, screening for high-potential candidates.

    • Support for VHH and scFv Design: RFdiffusion supports not only the design of single-domain antibodies (VHH) but also single-chain antibody fragments (scFv). scFv design involves designing all six CDRs of the heavy and light chains.

    Through these methods, RFantibody can generate diverse antibody structures that significantly differ from the training dataset while achieving high shape complementarity and functional binding to target epitopes.

    image.png

    Experimental Validation

    The RFantibody project has conducted VHH and scFv designs targeting multiple disease-related epitopes and validated their effectiveness through surface plasmon resonance (SPR), cryo-electron microscopy (cryo-EM), neutralization assays, and other methods. The following are specific experimental results and analyses:

    1, Single-Domain Antibody (VHH) Design and Experimental Validation

    Experiments selected multiple disease-related targets, including influenza hemagglutinin (HA), respiratory syncytial virus (RSV) sites I and III, SARS-CoV-2 receptor-binding domain (RBD), Clostridioides difficile toxin B (TcdB), and IL-7Rα. Key results include:

    • Binding Affinity (KD):

      • Influenza HA: Among VHH designs targeting the HA stem epitope, the highest affinity binder (VHH_flu_01) had a KD value of 78 nM, with other binders having KD values of 546 nM, 698 nM, and 790 nM. Experiments used insect cell-expressed monomeric HA (simulating deglycosylated state) to match computational design conditions.
      • SARS-CoV-2 RBD: The best VHH binder had a KD value of 5.5 μM, confirmed to bind to the target epitope through competition experiments (competing with known binder AHB2).
      • TcdB: The best VHH binder targeting the Frizzled-7 epitope had a KD value of 260 nM, with high binding specificity and no observed cross-reactivity with Clostridium sordellii toxin L (TcsL), which has 70% homology.
    • Neutralization Activity (EC50):

      • TcdB: VHHs targeting TcdB demonstrated functionality in neutralization assays, neutralizing TcdB toxicity in CSPG4 knockout cells with an EC50 value of 460 nM, indicating potential therapeutic applications.
    • Structural Accuracy (cryo-EM):

      • Influenza HA: Cryo-EM resolved the complex structure of VHH_flu_01 with native glycosylated HA trimer (resolution 3.0 Å). 66% of HA particles bound up to two VHHs, with partial non-binding possibly due to N296 glycan shielding. The experimental structure highly aligned with the design model, with an overall RMSD of 1.45 Å, CDR3 RMSD of 0.8 Å, and key CDR3 residues (V100, V101, S103, F108) interacting with the HA stem epitope as designed.
      • TcdB: Cryo-EM analysis was performed on the original design (VHH_TcdB_H2) and affinity-matured version (VHH_TcdB_H2_ortho) targeting TcdB. The original design confirmed binding to the Frizzled-7 epitope, while the matured version (resolution 5.7 Å) showed higher binding proportions, with structures conforming to design expectations.
      • SARS-CoV-2 RBD: The affinity-matured VHH (VHH_RBD_D4_ortho19) bound to the RBD “up” conformation epitope (resolution 3.9 Å).
    • Affinity Maturation (OrthoRep):

      • The OrthoRep system was used for affinity maturation of VHHs targeting TcdB, influenza HA, and SARS-CoV-2 RBD, improving binding affinity by approximately two orders of magnitude while maintaining original epitope specificity.

    2, Single-Chain Antibody Fragment (scFv) Design and Experimental Validation

    Further expansion to scFv design involved designing six CDRs of heavy and light chains, adopting a structure-guided combinatorial library strategy to increase success rates. Experimental targets included the Frizzled-7 epitope of TcdB and the Phox2b/HLA-C*07:02 complex.

    • Binding Affinity (KD):

      • TcdB: Through combinatorial library screening, scFvs targeting the Frizzled-7 epitope were identified, with the highest affinity binder (scFv6) having a KD value of 72 nM. KD values for other binders were not detailed. Competition experiments (competing with Frizzled-7) confirmed binding to the target epitope, with no competition with the unrelated receptor CSPG4.
      • Phox2b/HLA-C*07:02: scFvs targeting the neuroblastoma-related epitope had KD values of 400 nM (SPR) and 1 μM (ITC), specifically binding to the Phox2b peptide but not to the R6A mutant peptide. Attempts to convert it to CAR-T cells did not show cytotoxicity, possibly due to insufficient affinity or low antigen density.
    • Structural Accuracy (cryo-EM):

      • TcdB: Cryo-EM structures of two scFvs (scFv5 and scFv6) binding to the Frizzled-7 epitope validated design accuracy. scFv6 had a resolution of 3.6 Å, overall RMSD of 0.9 Å, and backbone RMSDs for the six CDRs of CDRH1=0.4 Å, CDRH2=0.3 Å, CDRH3=0.7 Å, CDRL1=0.2 Å, CDRL2=1.1 Å, CDRL3=0.2 Å, with side chain conformations and interactions conforming to design. scFv5 (resolution 6.1 Å) bound with a different approach angle, with the experimental structure consistent with the design model.

    3, Analysis of Experimental Results

    • Structural Diversity: The designed VHHs and scFvs had CDR regions significantly different from natural antibodies, and there were no known antibodies for the Frizzled-7 epitope of TcdB, indicating that RFdiffusion achieved true de novo design.
    • Functionality and Application Potential: The neutralization activity of TcdB VHH (EC50=460 nM) and high affinity of scFv (KD=72 nM) demonstrated therapeutic potential, but the failure of Phox2b scFv in CAR-T applications indicated the need for further optimization of affinity or antigen expression.

    4, Summary

    RFantibody, through fine-tuning the RFdiffusion network, has achieved the goal of de novo designing VHHs and scFvs capable of targeting various disease-related epitopes. Experimental results show that the designed antibodies have high structural accuracy (RMSD as low as 0.9 Å) and functionality (KD as low as 72 nM, EC50 of 460 nM). Cryo-EM validated the atomic-level precision of the designs, while affinity maturation and combinatorial library strategies further improved success rates.

    Parameter

    Complex

    The structure of the antibody-antigen complex used for antibody design, in PDB format. If this parameter is specified, the subsequent Antigen and Antibody parameters do not need to be specified. If this parameter is not specified, the structures of Antigen and Antibody need to be input separately.

    Antigen

    The structure file of the antigen, in PDB format.
    Note: The antigen structure usually needs to be truncated to reduce computational cost. It is recommended to retain only the region within approximately 10 Å around the epitope.

    Antibody

    The structure file of the antibody, in PDB format.

    Number of designs

    The number of antibodies to be designed, with a default value of 20.

    H-CDR1, H-CDR2, H-CDR3, L-CDR1, L-CDR2, L-CDR3

    Specify the length range of the CDR regions in the heavy and light chains to be designed. The format is: start length-end length (e.g., 5-13), or a single length (e.g., 7).
    Note: These parameters define the allowed length range for each CDR region. If a range is specified (e.g., 5-13), the model will uniformly sample lengths within this range. If a single length is specified (e.g., 7), the CDR will be designed with the given length. If the length range of a CDR is not specified (e.g., H-CDR1 is not set), that CDR will retain its original structure and sequence without being designed. The length of at least one CDR region needs to be specified for the design; otherwise, an error will be prompted.
    For VHH design, only H-CDR1, H-CDR2, and H-CDR3 need to be specified; for scFv design, all six CDRs can be specified. The length selection can refer to the natural distribution of CDR lengths in antibodies. It is recommended to use a shorter H-CDR3 (e.g., 5-13) to reduce design complexity.

    Hotspot

    Specify the binding site residues on the antigen to define the epitope for antibody binding. The format is: a comma-separated list of residues, e.g., 305,456.
    Note: Binding site residues help the model focus on specific epitopes. It is recommended to select more than three hydrophobic residues within the epitope and avoid areas with excessive polarity or glycosylation.

    Result Description

    After antibody design, the antibody-antigen complex structures are obtained and sorted based on quality assessment metrics. These include:
    Structure Files: The packed file of antibody - antigen complex structures in PDB format sorted by structural quality is de_novo_antibody.tar, and the optimal design result rank_1.pdb.
    Structure Scores: A CSV file cdr_sequences.csv containing the assessment metrics, with the following information:

    Field Name Description
    Design_ID The filename of the predicted structure
    CDR_H1/H2/H3/L1/L2/L3 Designed sequence of CDRs
    ipAE Predicted interaction alignment error, which measures the confidence of the structural prediction at the antibody-antigen binding interface. This metric reflects the stability and accuracy of the antibody-antigen complex interface. Lower values indicate more reliable predictions. Designs with ipAE < 10 are recommended for experimental validation.
    pLDDT Predicted Local Distance Difference Test, which measures the overall quality and reliability of the structural prediction. This metric reflects the stability and folding quality of the antibody structure itself. The value ranges from 0 to 1.0, with values closer to 1.0 indicating more reliable structural predictions. Designs with pLDDT > 0.8 are recommended for experimental validation.

    Example

    Design_ID,CDR_H3,ipAE,pLDDT
    rank_1,IAYTPGAPLF,8.91,0.92
    rank_2,VAPSKTDALF,9.29,0.92
    

    Sequence File:Summary fasta file of all designed antibody sequencesantibody_sequences.fasta

    References

    • Bennett NR, Watson JL, Ragotte RJ, Borst AJ, See DL, Weidle C, Biswas R, Shrock EL, Leung PJY, Huang B, Goreshnik I, Ault R, Carr KD, Singer B, Criswell C, Vafeados D, Sanchez MG, Kim HM, Torres SV, Chan S, Baker D. Atomically accurate de novo design of antibodies with RFdiffusion. DOI:10.1101/2024.03.14.585103
  • Name: MD Solvation v2
    Description: 对MD体系加入水盒子和离子。v2新增自主添加金属离子环境功能。 Adds water box and ions for the system. Add user-specified ions in version v2.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-02-19 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD Solvation v2

    简介

    对MD体系进行溶剂化操作,添加水盒子和离子。

    参数说明

    Receptor Topology

    输入的受体拓扑文件,可由GMX Receptor Parameterization模块生成。

    Receptor GRO

    输入的受体结构文件,可由GMX Receptor Parameterization模块生成。

    Receptor ITP

    输入的受体参数(压缩)文件,可由GMX Receptor Parameterization模块生成。

    Ligand GRO

    输入的配体结构(压缩)文件,可由GMX Ligand Parameterization模块生成。

    Ligand ITP

    输入的配体参数(压缩)文件,可由GMX Ligand Parameterization模块生成。

    Ions

    需要添加的离子,支持钠离子NA,钾离子K,氯离子CL,钙离子CA,镁离子MG,锌离子ZN,同时添加多个使用英文冒号:分割,如NA:K:MG

    Number of Ions

    需要添加的离子数目,添加多种离子时,和Ions参数对应,使用英文冒号:分割,如15:20:30
    说明:Number of Ions与Concentration of Ions,选择其中一种输入,不要同时输入

    Concentration of Ions

    需要添加的离子浓度,单位为mol/L,添加多种离子时,和Ions参数对应,使用英文冒号:分割,如0.15:0.3:0.1
    说明:Number of Ions与Concentration of Ions,选择其中一种输入,不要同时输入

    Output Topology

    输出的体系总的拓扑文件

    Output GRO

    输出的体系总的结构文件

    Output ITP

    输出的体系参数的(压缩)文件

    Distance Restraints

    距离限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [Type]  [Index]  [Type]  [Low]  [Up1]  [Up2]  [Factor]
    

    其中,AtomIndex1和AtomIndex2为在system.gro的原子编号;Type为施加约束类型,通常设置为1,Type类型见表1;Index是计算顺序;Low、Up1、Up2为原子间限制距离,Low到Up1区间的原子距离是不受限制的,但是不能超过Up2,单位为nm;Factor为因子,将Factor乘以“Disre Force Constant”即为限制力的大小,单位为kJ/mol/nm2。
    例如:

    10     16      1       0       1      0.0     0.3     0.4     1.0
    10     46      1       1       1      0.0     0.3     0.4     1.0
    16     22      1       2       1      0.0     0.3     0.4     2.5
    

    表1:GROMACS中三种约束类型对原子对进行限制

    Type Code 约束类型 作用情况
    1 Complex NMR distance restraints 当Disre Type为ensemble时,即非键相互作用设置为1
    6 Simple harmonic restraints 当Disre Type为simple时,即分子内成键相互作用设定,可设为6或者10.
    10 Piecewise linear/harmonic restraints 当Disre Type为simple时,即分子内成键相互作用设定,可设为6或者10

    Angle Restraints

    角度限制是两对原子间角度的限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Theta0]  [Force Constant]  [Multiplicity]
    

    其中,AtomIndex1-AtomIndex2是第一对原子编号;AtomIndex3-AtomIndex4为第二对原子编号;Type在这里无用,定义为1即可;Theta0为约束的角度,单位为deg;Force Constant为约束力常数,单位为kJ/mol;Multiplicity为多重度。
    例如

    2642     2643     2635     2652     1     67.0     1500     1
    

    Dihedral Restraints

    二面角限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Label]  [Phi]  [dPhi]  [KFactor]  [Power]
    

    其中,AtomIndex1-AtomIndex4为组成二面角的原子编号;Type为约束类型函数,总是为1;Label无效;Phi为参考角,dPhi为超出参考角的角度值,单位为deg;KFactor为因子,将KFactor乘以“Disre Force Constant”即为限制力的大小,单位为 kJ/mol/rad2;Power无效。
    例如:

    2642      2643      2635      2652      1      67.0      1500      1
    

    约束势函数如下所示:
    image.png
    其中,Φ’为参考角Phi,ΔΦ为超出参考角的值dPhi,K_dihr为限制力的大小KFactor。

    Solute Box Type

    控制溶剂盒子的几何形状。

    • cubic:立方体盒
    • triclinic:一般三斜盒
    • dodecahedron:近似球形、体积更小。通常用于蛋白或小分子体系,因为它能在保证同样最小距离的前提下,减少约 30% 的水分子数,节约计算量。
    • octahedron:八面体盒

    Solute Box Distance

    体系中分子表面到盒子边界的最小距离(单位 nm)

    结果说明

    输出结果包括:

    输出文件名称 说明
    system.gro 体系的分子坐标文件
    system_itp.tar.gz 体系平衡模拟时固定原子位置所施加的力
    system.top 体系的拓扑文件
    index.ndx GROMACS 生成的索引文件,定义体系中原子或残基的分组信息(index groups),用于后续分析或计算时选择特定原子集合

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: 10.1016/j.softx.2015.06.001

    MD Solvation v2

    Introduction

    Solvates an MD system by adding a water box and ions.

    Parameters

    Receptor Topology

    Input receptor topology file, which can be generated by the GMX Receptor Parameterization module.

    Receptor GRO

    Input receptor structure file, which can be generated by the GMX Receptor Parameterization module.

    Receptor ITP

    Input receptor parameter (compressed) file, which can be generated by the GMX Receptor Parameterization module.

    Ligand GRO

    Input ligand structure (compressed) file, which can be generated by the GMX Ligand Parameterization module.

    Ligand ITP

    Input ligand parameter (compressed) file, which can be generated by the GMX Ligand Parameterization module.

    Ions

    Ions to be added. Supports sodium (NA), potassium (K), chloride (CL), calcium (CA), magnesium (MG), and zinc (ZN). To add multiple ion types simultaneously, separate them with a colon :, e.g. NA:K:MG.

    Number of Ions

    Number of ions to be added. When adding multiple ion types, this corresponds to the Ions parameter and should also be colon-separated, e.g. 15:20:30.

    Note: Choose either Number of Ions or Concentration of Ions; do not provide both.

    Concentration of Ions

    Concentration of ions to be added, in mol/L. When adding multiple ion types, this corresponds to the Ions parameter and should also be colon-separated, e.g. 0.15:0.3:0.1.

    Note: Choose either Number of Ions or Concentration of Ions; do not provide both.

    Output Topology

    Output topology file for the entire system.

    Output GRO

    Output structure file for the entire system.

    Output ITP

    Output parameter (compressed) file for the entire system.

    Distance Restraints

    Distance restraints, effective only when Disre is not set to no. Format:

    [AtomIndex1]  [AtomIndex2]  [Type]  [Index]  [Type]  [Low]  [Up1]  [Up2]  [Factor]
    
    • AtomIndex1 and AtomIndex2: Atom indices in system.gro.
    • Type: Restraint type, typically set to 1. See Table 1 for restraint types.
    • Index: Calculation order.
    • Low, Up1, Up2: Distance limits between atoms. Distances between Low and Up1 are unrestricted, but must not exceed Up2. Unit: nm.
    • Factor: Multiplier. The restraint force is calculated as Factor × “Disre Force Constant”. Unit: kJ/mol/nm².

    Example:

    10     16      1       0       1      0.0     0.3     0.4     1.0
    10     46      1       1       1      0.0     0.3     0.4     1.0
    16     22      1       2       1      0.0     0.3     0.4     2.5
    

    Table 1: Three GROMACS restraint types for atom pairs

    Type Code Restraint Type Usage
    1 Complex NMR distance restraints Use when Disre Type is ensemble, i.e., non-bonded interactions set to 1.
    6 Simple harmonic restraints Use when Disre Type is simple, i.e., intramolecular bonded interactions; can be set to 6 or 10.
    10 Piecewise linear/harmonic restraints Use when Disre Type is simple, i.e., intramolecular bonded interactions; can be set to 6 or 10.

    Angle Restraints

    Angle restraints define the angle between two atom pairs, effective only when Disre is not set to no. Format:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Theta0]  [Force Constant]  [Multiplicity]
    
    • AtomIndex1–AtomIndex2: First atom pair.
    • AtomIndex3–AtomIndex4: Second atom pair.
    • Type: Unused; set to 1.
    • Theta0: Restrained angle. Unit: deg.
    • Force Constant: Restraint force constant. Unit: kJ/mol.
    • Multiplicity: Multiplicity.

    Example:

    2642     2643     2635     2652     1     67.0     1500     1
    

    Dihedral Restraints

    Dihedral angle restraints, effective only when Disre is not set to no. Format:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Label]  [Phi]  [dPhi]  [KFactor]  [Power]
    
    • AtomIndex1–AtomIndex4: Atom indices forming the dihedral angle.
    • Type: Restraint function type, always 1.
    • Label: Unused.
    • Phi: Reference angle.
    • dPhi: Tolerance beyond the reference angle. Unit: deg.
    • KFactor: Multiplier. The restraint force is calculated as KFactor × “Disre Force Constant”. Unit: kJ/mol/rad².
    • Power: Unused.

    Example:

    2642      2643      2635      2652      1      67.0      1500      1
    

    The restraint potential is shown below:

    image.png

    Where Φ′ is the reference angle Phi, ΔΦ is the deviation dPhi, and K_dihr is the restraint force magnitude KFactor.

    Solute Box Type

    Controls the geometry of the solvent box.

    • cubic: Cubic box.
    • triclinic: General triclinic box.
    • dodecahedron: Approximately spherical, smaller volume. Typically used for protein or small-molecule systems because it can reduce the number of water molecules by approximately 30% while maintaining the same minimum distance, saving computational cost.
    • octahedron: Octahedral box.

    Solute Box Distance

    Minimum distance from the molecular surface to the box boundary. Unit: nm.

    Output Description

    Output files include:

    Output Filename Description
    system.gro Molecular coordinates file of the system.
    system_itp.tar.gz Forces applied to fix atom positions during system equilibration.
    system.top Topology file of the system.
    index.ndx GROMACS-generated index file defining atom or residue groups (index groups) in the system, used for selecting specific atom sets in subsequent analyses or calculations.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: 10.1016/j.softx.2015.06.001

  • Name: Human Germline BLAST v2.1
    Description: 通过序列比对在人类生殖系数据库中搜索与目标抗体序列接近的同源模板,输出对应的模板序列以及序列一致性信息。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Blast -> Human Germline BLAST。 Search the human germline database for homologs of the target antibody sequence, and output the template sequences and the corresponding identities. It is recommended to use in the WeSeq: WeSeq -> Blast -> Human Germline BLAST.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-02-11 14:29:03
    Reference:

    Human Germline BLAST (v2.1)

    简介

    通过序列比对在人类生殖系数据库中搜索与目标抗体序列最接近的同源模板,输出对应的模板序列以及序列一致性信息。

    参数说明

    Sequence String模式

    Input Sequence

    抗体的序列(纯序列信息,非FASTA格式文件)。

    Type

    抗体编号类型:kabat、chothia、imgt。

    TopHits

    输出同源性最高的n条序列。

    Fasta File模式

    FASTA File

    抗体的序列文件,FASTA格式。

    Type

    抗体编号类型:kabat、chothia、imgt。

    TopHits

    输出同源性最高的n条序列。

    结果说明

    输出参数 输出文件名称 说明
    Hits Sequence hits.fasta 包含同源性最高的n条序列的序列文件
    Result result.json 包含找到的Germline模板以及序列的一致性信息

    相关内容

    抗体常用的germline模板:
    image.png

    临床后期及已上市抗体的germline配对分布情况(统计自Adimab数据集):
    image.png
    image.png
    Adimab_germline_usage.jpeg

    Human Germline BLAST (v2.1)

    Introduction

    This module performs sequence alignment to search for the closest homologous template in the human germline database for a given target antibody sequence. It outputs the corresponding template sequence along with sequence similarity information.

    Parameter Description

    Sequence String Mode

    Input Sequence

    The antibody sequence (pure sequence information, not in FASTA format).

    Type

    Type of antibody numbering: kabat, chothia, imgt.

    TopHits

    Number of top hits to output.

    Fasta File Mode

    FASTA File

    Antibody sequence file in FASTA format.

    Type

    Type of antibody numbering: kabat, chothia, imgt.

    TopHits

    Number of top hits to output.

    Result Description

    Output Parameter Output File Name Description
    Hits Sequence hits.fasta File containing the top n sequences with the highest homology
    Result result.json File containing the found Germline template and sequence similarity information

    Related Content

    Commonly used germline templates for antibodies:
    image.png

    Distribution of germline pairing for late-stage and marketed antibodies (statistics from the Adimab dataset):
    image.png
    image.png
    Adimab_germline_usage.jpeg

  • Name: Grafting v2.4
    Description: Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.4 Graft antibody CDRs to target frameworks, normally for humanization. Version: v2.4
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-02-11 14:25:31
    Reference:

    Grafting v2.4

    简介

    Grafting模块是移植抗体的CDR到特定的框架区模板上,通常用于人源化设计。版本:v2.4

    参数说明

    Antibody Sequence File

    抗体序列文件,FASTA格式

    Numbering Type

    抗体编号规则:kabat,imgt,chothia

    Output File

    指定输出抗体graft后的序列文件名称,FASTA格式

    Output Policy

    指定输出graft策略文件,JSON格式

    Germline Score

    指定输出抗体FR区序列比对同源性打分文件

    Germline

    指定轻链或重链使用特定germline模板,也可都指定,写法如下:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    其中链名来自于流程第一步输入的fasta文件。
    例1:以下语句为链"Infliximab.H"指定了模板"IGHV3-7*01":

    Infliximab.H:IGHV3-7*01
    

    例2:以下语句为两条链分别指定了模板:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template V Sequence

    指定抗体可变区 V 基因 的参考模板序列,FASTA格式。

    Template J Sequence

    指定抗体可变区 J 基因 的参考模板序列,FASTA格式。

    Germline Hits

    指定输出FR区序列比对结果文件,FASTA格式

    Number of Hits

    指定输出命中序列的数目

    结果说明

    输出结果包括:

    输出文件名称 说明
    germline_hits.fasta 输出FR区序列比对结果文件
    germline_score.json 输出抗体FR区序列比对同源性打分文件
    grafted.fasta 输出抗体graft后的序列文件名称
    graft_policy.json 输出graft策略文件

    Grafting v2.4

    Introduction

    The Grafting module is used to graft the CDRs of an antibody onto a specific framework region template, typically used for humanization design. Version: v2.4

    Parameter Description

    Antibody Sequence File

    Antibody sequence file in FASTA format.

    Numbering Type

    Antibody numbering rule: kabat, imgt, chothia.

    Output File

    Specify the output file name for the grafted antibody sequence in FASTA format.

    Output Policy

    Specify the output grafting strategy file in JSON format.

    Germline Score

    Specify the output file for the homology scores of the antibody FR region sequences.

    Germline

    Specify the specific germline template to be used for the light chain or heavy chain, or both, in the format:

    seq_name1:germline_name1,seq_name2:germline_name2
    

    Where the chain names come from the FASTA file input in the first step of the process.
    Example 1: The following statement specifies the template “IGHV3-7*01” for the chain “Infliximab.H”:

    Infliximab.H:IGHV3-7*01
    

    Example 2: The following statement specifies templates for two chains separately:

    Infliximab.H:IGHV3-7*01,Infliximab.L:IGKV1-39*01
    

    Template V Sequence

    Specify the reference template sequence of the antibody V gene in FASTA format.

    Template J Sequence

    Specify the reference template sequence of the antibody J gene in FASTA format.

    Germline Hits

    Specify the output file for the FR region sequence alignment results in FASTA format.

    Number of Hits

    Specify the number of sequences to output.

    Result Description

    The output includes:

    Output File Name Description
    germline_hits.fasta Output file for FR region sequence alignment results
    germline_score.json Output file for homology scores of the antibody FR region sequences
    grafted.fasta Output file name for the grafted antibody sequence
    graft_policy.json Output file for the grafting strategy
  • Name: Mutation Energy of Stability (Pythia)
    Description: 基于自监督图神经网络预测突变对蛋白稳定性影响。 A self-supervised graph neural network for protein stability prediction upon mutation.
    Tags: undefined
    Author: Jinyuan Sun
    Release: 2025-02-10 10:28:28
    Reference: Jinyuan Sun, Tong Zhu, Yinglu Cui, Bian Wu, Structure-based self-supervised learning enable ultrafast protein stability prediction upon mutation, The Innovation, Volume 6, Issue 1, 2025,100750, ISSN 2666-6758

    Mutation Energy of Stability (Pythia)

    简介

    该模块基于Pythia模型实现,该模型是一种针对零样本 ∆∆G 预测量身定制的自监督图神经网络。

    蛋白质突变效应预测是解码分子进化机制、优化蛋白质工程改造的关键物理量。然而,传统预测方法面临两大挑战:一是基于物理力场的计算方法(如自由能微扰)计算复杂度高,难以满足大规模筛选需求;二是依赖于实验数据的监督学习方法易受训练集偏差影响,泛化能力受限。

    为了应对这些问题,研究团队提出了Pythia框架,它结合了图神经网络与注意力机制,能够直接从蛋白质的三维结构中学习氨基酸之间的相互作用。通过这种“零监督”预训练策略,Pythia突破了传统方法对标记数据的依赖,成功捕捉了蛋白质折叠过程中隐藏的物理化学约束规律。

    Pythia的模型架构采用了将蛋白质局部结构转化为k近邻图的方式,每个氨基酸作为节点,通过欧几里得距离连接其32个最近的氨基酸。节点的特征包括氨基酸类型以及主链的二面角,边的特征则涉及主链原子之间的距离、序列位置和链信息。通过消息传递神经网络(MPNN)架构,Pythia可以高效地更新每个氨基酸节点的信息,并对突变的稳定性变化进行准确预测。

    与传统的基于物理力场的方法相比,Pythia能够在单核计算中实现每分钟预测约50,000个突变,速度提升了5个数量级。其在标准测试集S2648上的Spearman相关系数为0.616,Pearson相关系数为0.598,表现优于现有的所有对比模型。这一进展为大规模蛋白质序列空间扫描提供了强大的计算支持,能够处理多达2600万个高质量蛋白质结构数据,显著加深了我们对蛋白质序列空间的理解。

    在实验验证中,Pythia表现出了比传统能量函数方法高出一倍的成功率,充分证明了其在实际应用中的可靠性。同时,Pythia的可解释性也为蛋白质工程提供了宝贵的生物学见解,使其更易于应用于复杂的蛋白质工程任务。

    模型架构:Pythia将蛋白质局部结构转换为k近邻图,其中每个氨基酸作为一个节点,并通过欧几里得距离连接其32个最近的氨基酸。节点的特征包括氨基酸类型和主链的二面角(φ、ψ、ω),边的特征包括主链原子之间的距离、序列位置和链信息。
    image.png

    训练目标:Pythia的训练目标是预测中心节点的自然氨基酸类型,使用来自节点和边的信息。

    消息传递神经网络(MPNN):Pythia采用消息传递神经网络(MPNN)架构,具体为带有注意力机制的消息传递层(AMPL)。在每个AMPL层中,顶点表示通过注意力块更新,然后与边表示连接以派生消息表示,最终通过另一个注意力块进一步细化节点表示。

    损失函数:通过估计特定位置处每个氨基酸的概率来实现ΔΔG的预测。

    在与其他自监督预训练模型和基于力场的方法的比较基准中,Pythia以极高的相关性超越其他同类算法,同时以最少的参数运行,使得计算速度显着加快,高达105倍。Pythia的功效通过其在预测柠檬烯环氧水解酶 (LEH) 的热稳定突变中的应用得到证实,实验成功率显着提高。
    S2648数据集上的性能:Pythia在S2648数据集上的Spearman相关系数为0.616,Pearson相关系数为0.598,优于所有测试的模型。
    S669数据集上的性能:在S669数据集上,Pythia的Spearman相关系数为0.66,在所有评估的方法中表现最佳。
    image.png

    大规模数据集上的性能:在一个包含约100万个突变的百万级数据集上,Pythia的Spearman相关系数为0.602,Pearson相关系数为0.633,AUROC为0.83,AUPRC为0.88。
    计算速度:Pythia的计算速度比传统的力场方法快105倍,能够在20秒内完成S2648数据集的计算,单核速度约为50,000个突变/分钟。

    参数说明

    Structure PDB

    蛋白结构文件,PDB格式。不支持含有非标准氨基酸的蛋白。

    Chain

    指定要突变扫描的链名,可多链,用英文逗号分隔,如:A,B,默认为空,表示全部链都扫描。

    Numbering Type

    抗体编号规则,支持Kabat, Chothia和IMGT,默认为Kabat。

    TopN

    指定输出能量最优的前N个突变对应的序列,默认为100。

    Output

    输出文件名称,默认mutation_energy.csv。

    Output_fmt

    特定格式化的输出文件名称,默认mutation_energy_fmt.csv。

    Output_Chain_Seq

    输出TopN对应的突变链的序列,默认为mutant_seqs.fasta。

    Output_Cpx_Seq

    输出TopN对应的复合物序列,复合物中各链之间用分号:分隔(Boltz2结构预测的批量模式),默认为mutant_seqs_complex.fasta。

    备注:当前24GB的GPU显存支持计算的残基数量在2000个左右。

    结果说明

    输出mutation_energy.csv结果文件,包含以下信息:

    字段名称 说明
    Chain 链名称,如:'A’表示A链
    Mutation 单点突变信息,如:'G1A’表示该链中,残基位置编号为1的残基甘氨酸G,突变为丙氨酸A,残基位置编号从1开始按顺序编号(非PDB文件中的残基序号)
    Energy 突变对应的能量变化,负值表示突变使得体系能量降低,体系变得更稳定。负得越多表示稳定性提升越多

    输出mutation_energy_fmt.csv结果文件,包含如下信息:

    字段名称 说明
    Chain PDB结构中的链名称
    WT PDB结构中的初始AA
    Pos AA位置编号,从1开始
    Consensus 该位置出现能量最优的AA
    L,A,G,V… 该位置每种AA对应的能量变化值,变化值为负时,表示更稳定,负得越多,越稳定

    输出结果对应的热图mutation_energy_[chain].png
    输出TopN对应的突变链的序列mutant_seqs.fasta。
    输出TopN对应的复合物序列,复合物中各链之间用冒号:分隔(Boltz2结构预测的批量模式)mutant_seqs_complex.fasta。

    参考文献

    • Jinyuan Sun, Tong Zhu, Yinglu Cui, Bian Wu, Structure-based self-supervised learning enables ultrafast protein stability prediction upon mutation, The Innovation, Volume 6, Issue 1, 2025, 100750, ISSN 2666-6758, DOI: 10.1016/j.xinn.2024.100750

    Mutation Energy of Stability (Pythia)

    Introduction

    This module is implemented based on the Pythia model, which is a self-supervised graph neural network specifically designed for zero-shot ∆∆G prediction.

    Predicting the effects of protein mutations is a key factor in decoding molecular evolution mechanisms and optimizing protein engineering modifications. However, traditional prediction methods face two major challenges: first, computational methods based on physical force fields (such as free energy perturbation) have high computational complexity, making them unsuitable for large-scale screening; second, supervised learning methods that rely on experimental data are susceptible to training set biases, limiting their generalization ability.

    To address these issues, the research team proposed the Pythia framework, which combines graph neural networks with attention mechanisms to learn interactions between amino acids directly from the three-dimensional structure of proteins. Through this “zero-supervision” pre-training strategy, Pythia overcomes the traditional methods’ dependence on labeled data and successfully captures the hidden physicochemical constraints in the protein folding process.

    The architecture of Pythia converts the local structure of proteins into k-nearest neighbor graphs, where each amino acid acts as a node connected to its 32 nearest amino acids based on Euclidean distance. Node features include amino acid type and backbone dihedral angles, while edge features involve distances between backbone atoms, sequence positions, and chain information. Using a message-passing neural network (MPNN) architecture, Pythia efficiently updates information for each amino acid node and accurately predicts changes in mutation stability.

    Compared to traditional physical force field-based methods, Pythia can predict approximately 50,000 mutations per minute on a single-core processor, achieving a speed increase of five orders of magnitude. On the standard test set S2648, it achieves a Spearman correlation coefficient of 0.616 and a Pearson correlation coefficient of 0.598, outperforming all existing comparative models. This advancement provides powerful computational support for large-scale scanning of protein sequence space, capable of handling up to 26 million high-quality protein structure data points, significantly deepening our understanding of protein sequence space.

    In experimental validation, Pythia demonstrated a success rate twice as high as traditional energy function methods, fully proving its reliability in practical applications. Additionally, Pythia’s interpretability offers valuable biological insights for protein engineering, making it more applicable to complex protein engineering tasks.

    Model Architecture: Pythia transforms the local structure of proteins into a k-nearest neighbor graph, where each amino acid is represented as a node, connected to its 32 nearest amino acids by Euclidean distance. The features of the nodes include the amino acid type and the backbone dihedrals (φ, ψ, ω), while the features of the edges include the distances between backbone atoms, sequence positions, and chain information.
    image.png

    Training Objective: The training objective of Pythia is to predict the natural amino acid type of the central node, using information from both nodes and edges.

    Message Passing Neural Network (MPNN): Pythia employs a message passing neural network (MPNN) architecture, specifically an Attention-based Message Passing Layer (AMPL). In each AMPL layer, the vertices are updated through an attention block, and then connected to edge representations to derive message representations, which are further refined through another attention block.

    Loss Function: The prediction of ΔΔG is achieved by estimating the probability of each amino acid at specific positions.

    In benchmark comparisons with other self-supervised pre-training models and force-field-based methods, Pythia outperforms other similar algorithms with high correlation while operating with minimal parameters, significantly accelerating computational speed by up to 105 times. The effectiveness of Pythia is demonstrated through its application in predicting thermally stable mutations of limonene epoxide hydrolase (LEH), with a notable increase in experimental success rates.
    Performance on the S2648 Dataset: Pythia achieves a Spearman correlation coefficient of 0.616 and a Pearson correlation coefficient of 0.598 on the S2648 dataset, outperforming all tested models.
    Performance on the S669 Dataset: On the S669 dataset, Pythia achieves a Spearman correlation coefficient of 0.66, performing the best among all evaluated methods.
    image.png

    Performance on Large-scale Datasets: On a large dataset containing approximately 1 million mutations, Pythia achieves a Spearman correlation coefficient of 0.602, a Pearson correlation coefficient of 0.633, an AUROC of 0.83, and an AUPRC of 0.88.
    Computational Speed: Pythia is 105 times faster than traditional force-field methods, capable of completing calculations on the S2648 dataset in 20 seconds, with a single-core speed of approximately 50,000 mutations per minute.

    Parameters

    Structure PDB

    Protein structure file in PDB format. Proteins containing non-standard amino acids are not supported.

    Chain

    Specify the chain names to be scanned for mutations. Multiple chains can be listed, separated by commas, e.g., A,B. The default is empty, which means all chains will be scanned.

    Numbering Type

    Antibody numbering schemes, supporting Kabat, Chothia, and IMGT. The default scheme is Kabat.

    TopN

    Specify the sequences corresponding to the top N mutations with the optimal energy. The default value is 100.

    Output

    Output file name, mutation_energy.csv is the default.

    Output_fmt

    Formatted output file name, mutation_energy_fmt.csv is the default.

    Output_Chain_Seq

    Output the sequences of the mutation chains corresponding to TopN.

    Output_Cpx_Seq

    Output the sequences of the complexes corresponding to TopN. The chains within the complexes are separated by semicolons (;) (for batch mode structure prediction by Boltz2).

    Results

    Outputs a mutation_energy.csv file containing the following information:

    Field Name Description
    Chain Chain name, e.g., ‘A’ represents chain A
    Mutation Single point mutation information, e.g., ‘G1A’ indicates that in this chain, the residue at position 1, Glycine (G), is mutated to Alanine (A), with residue positions numbered sequentially starting from 1 (not the residue numbering in the PDB file)
    Energy The energy change associated with the mutation; negative values indicate that the mutation lowers the system’s energy, making it more stable. The more negative the value, the greater the increase in stability.

    The heatmap output mutation_energy_[chain].png
    Output the sequences of the mutation chains corresponding to TopN. mutant_seqs.fasta
    Output the sequences of the complexes corresponding to TopN. The chains within the complexes are separated by : (for batch mode structure prediction by Boltz2). mutant_seqs_complex.fasta

    References

    • Jinyuan Sun, Tong Zhu, Yinglu Cui, Bian Wu, Structure-based self-supervised learning enables ultrafast protein stability prediction upon mutation, The Innovation, Volume 6, Issue 1, 2025, 100750, ISSN 2666-6758, DOI: 10.1016/j.xinn.2024.100750
  • Name: Mutation Energy of Binding (Pythia-PPI)
    Description: 基于深度学习和多任务学习的预测突变对蛋白-蛋白亲和力影响。 Deep learning and multi-task learning based prediction of protein-protein binding affinity changes upon mutations.
    Tags: undefined
    Author: Fangting Tao
    Release: 2025-02-10 10:36:50
    Reference: Fangting Tao, Jinyuan Sun, Bian Wu, George F Gao. Reliable prediction of protein-protein binding affinity changes upon mutations with Pythia-PPI, bioRxiv 2024.10.28.620752.

    Mutation Energy of Binding (Pythia-PPI)

    简介

    Mutation Energy of Binding (Pythia-PPI)模块基于Pythia-PPI模型实现,该模型基于深度学习,结合了多任务学习和自蒸馏策略,以克服实验数据稀缺的瓶颈,并提高预测准确性。Pythia-PPI由两个模块组成:预训练的结构图编码器模块和ΔΔG预测模块。该模型使用k-最近邻(k-NN)图将蛋白质或蛋白质-蛋白质复合物的局部结构转换为图表示,每个氨基酸作为一个节点,与其32个最近的氨基酸基于C-alpha原子的欧几里得距离建立连接。输入的结构图编码器结合了氨基酸类型的一热编码,以及使用正弦和余弦函数表示的主链二面角(φ、ψ和ω)作为节点特征。边特征则考虑了五个主链原子(C-alpha、C、N、O和C-beta)之间的距离,以及序列位置和链信息。通过结构图编码器,节点和边输入特征被转换为嵌入,这些嵌入与预训练模块中的氨基酸概率相结合,形成ΔΔG预测模块的输入向量。Pythia-PPI采用迁移学习和多任务学习相结合的方法,共享结构编码器层以预测突变对PPI结合亲和力和蛋白质稳定性的影响。
    image.png
    使用了SKEMPI数据集进行基准测试,并与其他方法进行了比较。结果显示,Pythia-PPI在SKEMPI数据集上的皮尔逊相关系数从0.6447提高到0.7850,在病毒-受体数据集上的皮尔逊相关系数从0.3654提高到0.6051。这些结果表明Pythia-PPI是一个分析蛋白质-蛋白质相互作用适应性景观的有力工具。
    image.png

    参数说明

    Structure PDB

    蛋白复合物结构文件,PDB格式。不支持含有非标准氨基酸的蛋白。

    Chain

    指定要突变扫描的链名,可多链,用英文逗号分隔,如:A,B,默认为空,表示全部链都扫描。

    Numbering Type

    抗体编号规则,支持Kabat, Chothia和IMGT,默认为Kabat。

    TopN

    指定输出能量最优的前N个突变对应的序列,默认为100。

    Output

    输出文件名称,默认mutation_ddg.csv。

    Output_fmt

    特定格式化输出的结果文件名称,默认mutation_ddg_fmt.csv。

    Output_Chain_Seq

    输出TopN对应的突变链的序列,默认为mutant_seqs.fasta。

    Output_Cpx_Seq

    输出TopN对应的复合物序列,复合物中各链之间用分号:分隔(Boltz2结构预测的批量模式),默认为mutant_seqs_complex.fasta。

    备注:当前24GB的GPU显存支持计算的残基数量在1500个左右。

    结果说明

    输出mutation_ddg.csv结果文件,包含以下信息:

    字段名称 说明
    Chain 链名称,如:'A’表示A链
    Mutation 单点突变信息,如:'G1A’表示该链中,残基位置编号为1的残基甘氨酸G,突变为丙氨酸A,残基位置编号从1开始按顺序编号(非PDB文件中的残基序号)
    Energe(Pythia-PPI) 突变对应的结合自由能ddG变化,负值表示突变使得亲和力变高,负得越多表示亲和力提升越多

    输出mutation_ddg_fmt.csv结果文件,包含如下信息:

    字段名称 说明
    Chain PDB结构中的链名称
    WT PDB结构中的初始AA
    Pos AA位置编号,从1开始
    Consensus 该位置出现能量最优的AA
    L,A,G,V… 该位置每种AA对应的能量变化值,变化值为负时,表示更稳定,负得越多,越稳定

    输出结果对应的热图mutation_ddg_[chain].png

    输出TopN对应的突变链的序列mutant_seqs.fasta。
    输出TopN对应的复合物序列mutant_seqs_complex.fasta,复合物中各链之间用冒号:分隔(Boltz2结构预测的批量模式)。

    参考文献

    • Reliable prediction of protein-protein binding affinity changes upon mutations with Pythia-PPI, Fangting Tao, Jinyuan Sun, Bian Wu, George F Gao bioRxiv 2024.10.28.620752; DOI: 10.1101/2024.10.28.620752

    Mutation Energy of Binding (Pythia-PPI)

    Introduction

    The Mutation Energy of Binding (Pythia-PPI) module is implemented based on the Pythia-PPI model, which utilizes deep learning and combines multi-task learning with a self-distillation strategy to overcome the bottleneck of scarce experimental data and improve prediction accuracy. Pythia-PPI consists of two modules: a pre-trained structural graph encoder module and a ΔΔG prediction module. The model uses a k-nearest neighbors (k-NN) graph to convert the local structure of proteins or protein-protein complexes into a graph representation, where each amino acid is represented as a node, connected to its 32 nearest amino acids based on the Euclidean distance of C-alpha atoms. The input structural graph encoder combines one-hot encoding of amino acid types with backbone dihedrals (φ, ψ, and ω) represented using sine and cosine functions as node features. Edge features take into account the distances between five backbone atoms (C-alpha, C, N, O, and C-beta), as well as sequence positions and chain information. Through the structural graph encoder, the input features for nodes and edges are transformed into embeddings, which are combined with amino acid probabilities from the pre-trained module to form the input vector for the ΔΔG prediction module. Pythia-PPI employs a combination of transfer learning and multi-task learning, sharing structural encoder layers to predict the effects of mutations on PPI binding affinity and protein stability.
    image.png

    Benchmarking was conducted using the SKEMPI dataset and compared with other methods. The results show that Pythia-PPI improved the Pearson correlation coefficient from 0.6447 to 0.7850 on the SKEMPI dataset, and from 0.3654 to 0.6051 on the virus-receptor dataset. These results indicate that Pythia-PPI is a powerful tool for analyzing the adaptive landscape of protein-protein interactions.
    image.png

    Parameters

    Structure PDB

    Protein complex structure file in PDB format. Proteins containing non-standard amino acids are not supported.

    Chain

    Specify the chain names to be scanned for mutations. Multiple chains can be listed, separated by commas, e.g., A,B. The default is empty, which means all chains will be scanned.

    Numbering Type

    Antibody numbering schemes, supporting Kabat, Chothia, and IMGT. The default scheme is Kabat.

    TopN

    Specify the sequences corresponding to the top N mutations with the optimal energy. The default value is 100.

    Output

    Output file name, mutation_ddg.csv is the default.

    Output_fmt

    Formatted output file name, mutation_ddg_fmt.csv is the default.

    Output_Chain_Seq

    Output the sequences of the mutation chains corresponding to TopN.

    Output_Cpx_Seq

    Output the sequences of the complexes corresponding to TopN. The chains within the complexes are separated by semicolons (;) (for batch mode structure prediction by Boltz2).

    Results

    Outputs a mutation_ddg.csv file containing the following information:

    Field Name Description
    Chain Chain name, e.g., ‘A’ represents chain A
    Mutation Single point mutation information, e.g., ‘G1A’ indicates that in this chain, the residue at position 1, Glycine (G), is mutated to Alanine (A), with residue positions numbered sequentially starting from 1 (not the residue numbering in the PDB file)
    Energe(Pythia-PPI) The change in binding free energy (ddG) corresponding to the mutation; negative values indicate that the mutation increases affinity, with more negative values indicating a greater increase in affinity.

    Outputs a mutation_ddg_fmt.csv file containing the following information:

    Field Name Description
    Chain Chain name in the PDB structure
    WT Initial AA in the PDB structure
    Pos Position index of the AA, start from 1
    Consensus The AA with the most affinity value at that position
    L, A, G, V… The ddg of each AA at that position. Negative values indicate that the mutation increases affinity, with more negative values indicating a greater increase in affinity.

    The heatmap output mutation_ddg_[chain].png

    Output the sequences of the mutation chains corresponding to TopN mutant_seqs.fasta.

    Output the sequences of the complexes corresponding to TopN mutant_seqs_complex.fasta. The chains within the complex are separated by colons(:) (for batch mode of Boltz2 structure prediction) .

    References

    • Reliable prediction of protein-protein binding affinity changes upon mutations with Pythia-PPI, Fangting Tao, Jinyuan Sun, Bian Wu, George F Gao bioRxiv 2024.10.28.620752; DOI: 10.1101/2024.10.28.620752
  • Name: Antibody (Off-) Target Prediction (WeTarScan)
    Description: 基于结构相似性原理从抗原-抗体数据库中(相似抗体可能具有相似靶点)预测抗体的潜在靶点(脱靶效应)。 Structure similarity-based antibody (Off-) target prediction from antibody-antigen interaction database.
    Tags: undefined
    Author: WECOMPUT
    Release: 2025-01-06 10:17:53
    Reference:

    Antibody (Off-) Target Prediction (WeTarScan)

    简介

    Antibody (Off-) Target Prediction模块对输入的抗体进行潜在靶点预测,基于丰富的抗体-抗原相互作用数据库,寻找与输入抗体在序列及结构上高度相似的一系列抗体。基于相似性原理(相似抗体可能具有相似靶点),这些高度相似的抗体对应的抗原靶点可能是输入抗体的潜在靶点。当前抗体-抗原相互作用数据库包含16万对抗原-抗体复合物,主要来源于文献、专利等开源数据。

    参数说明

    Antibody Structure

    待预测靶点的抗体结构文件,PDB格式或CIF格式。

    Mode

    搜索模式,支持4种模式(默认为模式2):

    • 模式1: 完整抗体模式,以完整的抗体重轻链结构进行数据库检索。
    • 模式2: 抗体CDR模式,仅提取抗体的CDR区域结构进行数据库检索。
    • 模式3: 抗体重链CDR模式,仅提取抗体的重链CDR区域结构进行数据库检索。
    • 模式4: 抗体重链CDR3模式,仅提取抗体的重链CDR3区域结构进行数据库检索。

    注意: 纳米抗体VHH只能使用模式3或模式4,使用其他模式会提示错误。

    TopN

    保留打分排名最高的前N个结果,默认为50。

    Species

    物种信息过滤:

    • Human表示仅保留人源靶点。
    • Any表示不做任何限制。

    Output

    输出结果的文件名,默认为pred_hits.csv

    结果说明

    结果文件有多个,根据抗体结构来源不同会有不同的预测结果,以及合并后的最终结果。
    当前抗体结构来源有2种:实验结构(来自PDB数据库)、Boltz模型预测结构。
    结果文件有:

    • 基于实验结构来源pred_hits_Experimental.csv
    • 基于Boltz预测结构pred_hits_Boltz.csv
    • 合并上述三者的最终结果pred_hits.csv

    pred_hits_Experimental.csv,pred_hits_Boltz.csv包含如下信息:

    字段名 说明
    Query 查询抗体结构名称
    Database 抗体结构来源
    Antigen Name 预测的靶点名称
    Description 对数据库结构的描述
    Antigen Organism 靶点的来源物种
    Comprehensive Score 潜在靶点的综合打分,数值在0-1.0之间,越接近1.0,表示成为抗体靶点的可能性越大,默认基于该打分对潜在靶点进行排序。该打分综合了多种结构比对与复合物评价指标。
    Alignment TMScore \ Query TMScore \ Target TMScore TM-score (Template Modeling Score) 是一种结构比对指标,用于衡量两个蛋白质三维结构的相似性,与 RMSD相比,TM-score 更加稳定,对结构长度不敏感,能更准确地反映蛋白质结构的全局相似性。其取值范围在0到1之间,TM-score > 0.5 表示显著相似。其中,Query TMScore指使用查询抗体结构进行长度归一化;Target TMScore指使用数据库抗体结构进行长度归一化;Alignment TMScore指使用查询抗体和数据库抗体的序列匹配区的结构进行长度归一化。
    DockQ 衡量抗体与潜在靶点之间的虚拟结合参数,其值在0-1.0之间,越大表示抗体越能与潜在靶点结合。

    pred_hits.csv包含信息如下:

    字段名 说明
    Query 查询抗体结构名称
    Antigen Name 预测的靶点名称
    Description 对数据库结构的描述
    Antigen Organism 靶点的来源物种
    Comprehensive Score (Boltz) 基于Boltz预测结构的抗体结构数据库对应的综合打分。
    Comprehensive Score (Experimental) 基于实验结构的抗体结构数据库对应的综合打分。
    Comprehensive Score 不同数据库来源的综合打分平均值,默认基于该打分对潜在靶点进行排序。

    参考文献

    • van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
    • Schneider, C., Raybould, M.I.J., Deane, C.M. (2022) SAbDab in the Age of Biotherapeutics: Updates including SAbDab-Nano, the Nanobody Structure Tracker. Nucleic Acids Res. 50(D1):D1368-D1372
    • Brennan Abanades et al. “The Patent and Literature Antibody Database (PLAbDab): an evolving reference set of functionally diverse, literature-annotated antibody sequences and structures”. In: Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D545-D551
    • Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017)

    Antibody (Off-) Target Prediction (WeTarScan)

    Introduction

    The Antibody (Off-) Target Prediction module predicts potential targets for the input antibody. Based on a rich database of antibody-antigen interactions, it identifies a series of antibodies that are highly similar to the input antibody in both sequence and structure. Following the principle of similarity (similar antibodies may have similar targets), the antigen targets corresponding to these highly similar antibodies could be potential targets for the input antibody. The current antibody-antigen interaction database contains 160,000 antigen-antibody complexes, primarily sourced from open-source data such as literature and patents.

    Parameter

    Antibody Structure

    Antibody structure file for the target to be predicted, in PDB or CIF format.

    Mode

    Search Modes, supporting 4 modes (default is Mode 2):

    • Mode 1: Full Antibody Mode, where the complete heavy and light chain structure of the antibody is used for database search.
    • Mode 2: Antibody CDR Mode, where only the CDR regions of the antibody are extracted for database search.
    • Mode 3: Antibody Heavy Chain CDR Mode, where only the CDR regions of the heavy chain are extracted for database search.
    • Mode 4: Antibody Heavy Chain CDR3 Mode, where only the CDR3 region of the heavy chain is extracted for database search.

    TopN

    Retain the top N results with the highest scores, with the default being 50.

    Species

    Species Information Filtering:

    • Human: Retain only human-derived targets.
    • Any: No restrictions.

    Output

    The name of output file, default is “pred_hits.csv”.

    Result

    There are multiple output files, each corresponding to a different antibody-structure source, plus a final merged result.
    Current antibody-structure sources are:

    • Experimental structures (from the PDB)
    • Structures predicted by the Boltz model

    Output files:

    • pred_hits_Experimental.csv – predictions based on experimental structures
    • pred_hits_Boltz.csv – predictions based on Boltz-predicted structures
    • pred_hits.csv – merged final results

    Contents of pred_hits_Experimental.csv and pred_hits_Boltz.csv:

    Field Description
    Query Name of the query antibody structure
    Database Source of the antibody structure
    Antigen Name Predicted target name
    Description Description of the database entry
    Antigen Organism Species of origin for the predicted target
    Comprehensive Score Overall score (0–1.0) for the potential target; closer to 1.0 indicates a higher likelihood of being the antibody’s true target. Targets are ranked by this score by default. The score integrates multiple structural-alignment and complex-quality metrics.
    Alignment TMScore / Query TMScore / Target TMScore TM-score (Template Modeling Score) measures global structural similarity between two protein 3-D structures. It is more robust and length-insensitive than RMSD, with values from 0 to 1. TM-score > 0.5 indicates significant similarity. Query TMScore normalizes by query antibody length; Target TMScore normalizes by database antibody length; Alignment TMScore normalizes by the structurally aligned region shared by both antibodies.
    DockQ Virtual binding quality score between the antibody and the potential target (0–1.0); higher values suggest stronger predicted binding.

    Contents of pred_hits.csv:

    Field Description
    Query Name of the query antibody structure
    Antigen Name Predicted target name
    Description Description of the database entry
    Antigen Organism Species of origin for the predicted target
    Comprehensive Score (Boltz) Comprehensive score derived from the Boltz-predicted structure database
    Comprehensive Score (Experimental) Comprehensive score derived from the experimental-structure database
    Comprehensive Score Mean of the scores from all sources; targets are ranked by this value by default

    Reference

    • van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
    • Schneider, C., Raybould, M.I.J., Deane, C.M. (2022) SAbDab in the Age of Biotherapeutics: Updates including SAbDab-Nano, the Nanobody Structure Tracker. Nucleic Acids Res. 50(D1):D1368-D1372
    • Brennan Abanades et al. “The Patent and Literature Antibody Database (PLAbDab): an evolving reference set of functionally diverse, literature-annotated antibody sequences and structures”. In: Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D545-D551
    • Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017)
  • Name: Protein Thermostability Prediction
    Description: 预测蛋白质热稳定性的深度学习工具,包括分类模型TemBERTureCLS和回归模型TemBERTureTm。 Deep learning tool designed to predict protein thermostability, including classfication model TemBERTureCLS and regression TemBERTureTm.
    Tags: undefined
    Author: Chiara Rodella
    Release: 2025-01-08 09:28:20
    Reference: Chiara Rodella, Symela Lazaridi, Thomas Lemmin, TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms, Bioinformatics Advances, Volume 4, Issue 1, 2024, vbae103

    Protein Thermostability Prediction

    简介

    基于TemBERTure开发的Thermostability Prediction是一个用于预测蛋白质热稳定性的深度学习工具,专注于氨基酸序列分析。它包括两个模型:TemBERTureCLS和TemBERTureTm。TemBERTureCLS是一个分类模型,用于预测蛋白质序列的热类别,即判断其是嗜热的还是非嗜热的。TemBERTureTm是一个回归模型,用于根据蛋白质序列预测其熔点温度(Tm)。这两个模型都基于protBERT-BFD语言模型,该模型在大量蛋白质序列数据集上进行了预训练。通过基于适配器的方法进行高效微调,使得TemBERTure能够在不需要广泛重新训练的情况下,稳健地适应特定任务。
    image.png
    TemBERTureCLS与其他常用模型的预测结果比较
    image.png
    TemBERTureTm与其他常用模型的预测结果比较
    image.png

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式

    结果说明

    默认输出结果文件为predicted_Tm.csv,包含信息如下:

    字段名称 说明
    ID 序列ID
    Tm 预测得到的蛋白Melting Temperature ™ 值
    Thermostability Type 预测得到的蛋白热稳定性类别,有两种:Thermophilic与Non-thermophilic
    Thermophilicity Prediction Score 预测得到的蛋白嗜热性概率评分,数值在0-1.0之间,越大表示蛋白嗜热的概率越高

    参考文献

    • Chiara Rodella, Symela Lazaridi, Thomas Lemmin, TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms, Bioinformatics Advances, Volume 4, Issue 1, 2024, vbae103 DOI:10.1093/bioadv/vbae103

    Thermostability Prediction

    Introduction

    Thermostability Prediction, developed based on TemBERTure, is a deep learning tool designed to predict protein thermostability, focusing on amino acid sequence analysis. It includes two models: TemBERTureCLS and TemBERTureTm. TemBERTureCLS is a classification model used to predict the thermal category of a protein sequence, determining whether it is thermophilic or non-thermophilic. TemBERTureTm is a regression model used to predict the melting temperature ™ of a protein based on its sequence. Both models are based on the protBERT-BFD language model, which has been pre-trained on a large dataset of protein sequences. By using an adapter-based fine-tuning approach, TemBERTure can efficiently and robustly adapt to specific tasks without the need for extensive retraining.
    image.png
    Comparison of TemBERTureCLS with other common models’ prediction results
    image.png
    Comparison of TemBERTureTm with other common models’ prediction results
    image.png

    Parameter

    Protein Sequence

    The protein sequence file in FASTA format.

    Result

    The output result file is predicted_Tm.csv, containing the following information:

    Field Name Description
    ID Sequence ID
    Tm Predicted protein Melting Temperature ™ value
    Thermostability Type Predicted protein thermostability category: either Thermophilic or Non-thermophilic
    Thermophilicity Prediction Score Predicted probability score of protein thermophilicity, ranging from 0 to 1.0, where a higher score indicates a higher likelihood of the protein being thermophilic

    Reference

    • Chiara Rodella, Symela Lazaridi, Thomas Lemmin, TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms, Bioinformatics Advances, Volume 4, Issue 1, 2024, vbae103 DOI:10.1093/bioadv/vbae103
  • Name: GMX Metadynamics Generation
    Description: 生成Metadynamics模拟的输入文件 Generate input files for Metadynamics simulations
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-12-02 15:42:53
    Reference:

    GMX Metadynamics Generation

    简介

    GMX Metadynamics Generation模块是生成可用于Metadynamics模拟的输入文件。

    参数说明

    GRO File

    提交模拟体系的gro文件。该文件可以从MD Solvation模块获取。

    PBC

    Metadynamics模拟阶段是否考虑周期性边界条件,yes或者no。

    CV Group1

    组成集合变量CV的第一个组所包含的原子。

    CV Group2

    组成集合变量CV的第二个组所包含的原子。

    CV Group3

    组成集合变量CV的第三个组所包含的原子。

    CV Group4

    组成集合变量CV的第四个组所包含的原子。
    备注:

    • Group1和Group2组成DISTANCE集合变量,Group1,Group2和Group3组成ANGLE集合变量,Group1,Group2,Group3和Group4组成TORSION集合变量
    • Group的书写规则:a5表示GRO文件中第5个原子,a5-10表示GRO文件中第5-10个原子,aCA表示GRO文件中名字为CA的原子,同理r5, r5-10, rASP分别表示GRO文件中第5位残基,第5-10位残基和名字位ASP的残基,一些特殊的字符如Protein,Protein-H, MainChain等亦可使用,也可以合并使用,但需用逗号隔开,如"a5-10,r5,r8-10,UNK"表示GRO文件中第5-10位原子、第5位残基、第8-10位残基以及名字叫UNK的分子
    • 多个CV的处理方式:如果要定义多个集合变量,则在Group定义中用"//"将不同集合变量对应的原子组进行分割,如a5//r5-10表示a5是第一个集合变量对应的原子组,r5-r10是第二个集合变量对应的原子组,当集合变量在某个Group没有对应的原子组时,用none表示,比如第一个集合变量是DISTANCE,第一个集合变量是ANGLE,那么第一个DISTANCE集合变量在Group3中没有对应的原子组,此时在Group3可以写none//r5-10,表示第一个集合变量在Group3中没有对应的原子组,而第二个集合变量在Group3中对应的原子组为r5-10

    Component

    集合变量DISTANCE对应的成分,其成分有x,y,z和xyz,分别表示计算DISTANCE仅考虑x,y,z维度以及xyz三个维度都考虑,有多个集合变量时用"//"进行分割。

    Metad Height

    施加的沉积高斯函数的高度,默认1.0

    Metad Width

    施加的沉积高斯函数的宽度或者标准差,有多个集合变量时用"//"进行分割,默认0.05

    Metad Frequency

    施加的沉积高斯函数的频率,默认500,即每500个时间步长进行一次高斯函数沉积

    CV Min

    集合变量的边界最小值,有多个集合变量时用"//"进行分割。无默认值时即不考虑边界,此时计算量会增加,强烈建议设置边界。

    CV Max

    集合变量的边界最大值,有多个集合变量时用"//"进行分割,无默认值时即不考虑边界,此时计算量会增加,强烈建议设置边界。

    CV Space

    集合变量的窗口大小,有多个集合变量时用"//"进行分割,默认等于metad_width的1/5

    CV Bin

    集合变量的窗口数量,有多个集合变量时用"//"进行分割,默认等于150,CV Space和CV Bin的相乘等于CV Max和CV Min的差值,因此当CV Space和CV Bin同时设置时以对应窗口数最多的为准

    Adaptive

    是否考虑施加自适应沉积函数, geom或者diff,默认为不填,即不考虑自适应。

    Sigma Min

    施加的自适应高斯函数的宽度或者标准差的最小值,有多个集合变量时用"//"进行分割,默认等于0。

    Sigma Max

    施加的自适应高斯函数的宽度或者标准差的最大值,有多个集合变量时用"//"进行分割,默认等于0。

    Reweight

    是否考虑重加权以获得重加权因子,对获得归一化偏势,yes或者no,默认no,即不考虑重加权,一般在体系收敛后才考虑重加权。

    Reweight Ngauss

    计算重加权因子时施加的高斯函数的个数,默认等于50。

    Reweight Bin

    计算重加权因子时集合变量的窗口数量,其值不能小于CV Bin的值,有多个集合变量时用"//"进行分割,默认等于CV Bin。

    Well Tempered

    是否考虑回火metadynamics模拟,yes或者no。

    Temperature

    回火metadynamics模拟时对应的基础温度,默认等于300K

    Bias Factor

    回火Metadynamics模拟时对应的偏置因子,其值等于(T+deltaT)/T,默认等于1,此时未进行偏置模拟,若进行偏置模拟,偏置因子应大于1

    TAU

    回火Metadynamics模拟时对应的施加的沉积高斯函数的高度,Height=kbDeltaTFrequency*TimeStep/TAU,默认等于0,即直接使用设置的沉积函数的高度代替。

    Step

    Metadynamics模拟时指定的输出步长,默认100。

    Gauss File

    Metadynamics模拟时指定的沉积高斯函数的输出文件名。

    CV File

    Metadynamics模拟时指定的集合变量的输出文件名。

    PLUMED Index File

    Metadynamics模拟时指定的CV Group的输出文件名,该文件中包含所有的CV Group的原子组,用于下一步Metadynamics的输入文件。

    PLUMED Data File

    Metadynamics模拟时指定的参数的输出文件名,该文件中包含计算时所需的参数,用于下一步Metadynamics的输入文件。

    结果说明

    输出结果包括:

    输出文件名称 说明
    HILLS.dat Metadynamics模拟时指定的沉积高斯函数输出
    COLVAR.dat Metadynamics模拟时指定的集合变量的输出
    PLUMED.ndx NDX文件指定的组成集合变量的原子组
    PLUMED.dat 下一步Metadynamics计算所需的参数文件

    上述两个生成的文件将作为下一步metadynamics模拟的输入文件。

    GMX Metadynamics Generation

    Introduction

    The GMX Metadynamics Generation module is used to generate input files for Metadynamics simulations.

    Parameter

    GRO File

    Submit the gro file of the simulation system. This file can be obtained from the MD Solvation module.

    PBC

    Whether to consider periodic boundary conditions during the Metadynamics simulation phase, yes or no.

    CV Group1

    Atoms included in the first group that makes up the collective variable (CV).

    CV Group2

    Atoms included in the second group that makes up the collective variable (CV).

    CV Group3

    Atoms included in the third group that makes up the collective variable (CV).

    CV Group4

    Atoms included in the fourth group that makes up the collective variable (CV).
    Note:

    • Group1 and Group2 form the DISTANCE collective variable, Group1, Group2, and Group3 form the ANGLE collective variable, and Group1, Group2, Group3, and Group4 form the TORSION collective variable.
    • The notation for Groups: a5 represents the 5th atom in the GRO file, a5-10 represents atoms 5 to 10 in the GRO file, aCA represents the atom named CA in the GRO file. Similarly, r5, r5-10, and rASP represent the 5th residue, residues 5 to 10, and the residue named ASP in the GRO file, respectively. Some special characters like Protein, Protein-H, MainChain, etc., can also be used and can be combined, separated by commas. For example, “a5-10,r5,r8-10,UNK” represents atoms 5 to 10, the 5th residue, residues 8 to 10, and a molecule named UNK in the GRO file.
    • Handling multiple CVs: If you want to define multiple collective variables, separate the corresponding atom groups for different collective variables in the Group definition using “//”. For example, a5//r5-10 indicates that a5 corresponds to the atom group for the first collective variable, and r5-10 corresponds to the second collective variable. If there is no corresponding atom group for a collective variable in a Group, use “none” to indicate this. For instance, if the first collective variable is DISTANCE and the second is ANGLE, and the first DISTANCE collective variable has no corresponding atom group in Group3, you can write none//r5-10 in Group3 to indicate that the first collective variable has no corresponding atom group, while the second collective variable corresponds to r5-10 in Group3.

    Component

    The components corresponding to the DISTANCE collective variable, which can be x, y, z, and xyz, representing calculations of DISTANCE considering only the x, y, z dimensions or all three dimensions, respectively. Use “//” to separate multiple collective variable components.

    Metad Height

    The height of the deposited Gaussian function, default is 1.0.

    Metad Width

    The width or standard deviation of the deposited Gaussian function. Use “//” to separate multiple collective variable widths, default is 0.05.

    Metad Frequency

    The frequency of depositing the Gaussian function, default is 500, meaning a Gaussian function deposition occurs every 500 time steps.

    CV Min

    The minimum boundary value of the collective variable. Use “//” to separate multiple collective variable minimums. If there is no default value, the boundary will not be considered, which will increase the computational load. It is strongly recommended to set boundaries.

    CV Max

    The maximum boundary value of the collective variable. Use “//” to separate multiple collective variable maximums. If there is no default value, the boundary will not be considered, which will increase the computational load. It is strongly recommended to set boundaries.

    CV Space

    The window size of the collective variable. Use “//” to separate multiple collective variable window sizes, default is 1/5 of metad_width.

    CV Bin

    The number of windows for the collective variable. Use “//” to separate multiple collective variable bin counts, default is 150. The product of CV Space and CV Bin equals the difference between CV Max and CV Min. Therefore, when both CV Space and CV Bin are set, the one with the highest corresponding window count will prevail.

    Adaptive

    Whether to consider applying an adaptive deposition function, geom or diff, default is not filled, which means adaptive deposition is not considered.

    Sigma Min

    The minimum width or standard deviation of the applied adaptive Gaussian function. Use “//” to separate multiple collective variable minimums, default is 0.

    Sigma Max

    The maximum width or standard deviation of the applied adaptive Gaussian function. Use “//” to separate multiple collective variable maximums, default is 0.

    Reweight

    Whether to consider reweighting to obtain the reweighting factor for normalization of the bias potential, yes or no, default is no, which means reweighting is not considered. Reweighting is generally considered only after the system has converged.

    Reweight Ngauss

    The number of Gaussian functions applied when calculating the reweighting factor, default is 50.

    Reweight Bin

    The number of windows for the collective variable when calculating the reweighting factor, which cannot be less than the value of CV Bin. Use “//” to separate multiple collective variable bin counts, default is equal to CV Bin.

    Well Tempered

    Whether to consider simulated annealing in the Metadynamics simulation, yes or no.

    Temperature

    The base temperature corresponding to the simulated annealing Metadynamics simulation, default is 300K.

    Bias Factor

    The bias factor corresponding to the simulated annealing Metadynamics simulation, which equals (T + deltaT) / T, default is 1, meaning no bias simulation is performed. If a bias simulation is performed, the bias factor should be greater than 1.

    TAU

    The height of the deposited Gaussian function applied during the simulated annealing Metadynamics simulation, Height = kb * DeltaT * Frequency * TimeStep / TAU, default is 0, meaning the set deposition function height is used directly.

    Step

    The specified output step length during the Metadynamics simulation, default is 100.

    Gauss File

    The output file name for the deposited Gaussian function during the Metadynamics simulation.

    CV File

    The output file name for the collective variable during the Metadynamics simulation.

    PLUMED Index File

    The output file name for the CV Group during the Metadynamics simulation, which contains all the atom groups of the CV Group for the next step’s Metadynamics input file.

    PLUMED Data File

    The output file name for the parameters during the Metadynamics simulation, which contains the parameters required for calculations for the next step’s Metadynamics input file.

    Result

    The output results include:

    Output File Name Description
    HILLS.dat Output of the deposited Gaussian function specified during the Metadynamics simulation
    COLVAR.dat Output of the collective variable specified during the Metadynamics simulation
    PLUMED.ndx NDX file specifying the atom groups that make up the collective variable
    PLUMED.dat Parameter file required for the next step of Metadynamics calculation

    The two generated files above will serve as input files for the next step of the Metadynamics simulation.

  • Name: Free Energy Surface Analysis
    Description: 基于PLUMED元动力学模拟后的自由能计算。 Free energy surface analysis for PLUMED based metadynamics.
    Tags: undefined
    Author:
    Release: 2024-11-21 00:00:00
    Reference:

    Free Energy Surface Analysis

    简介

    Free Energy Surface Analysis模块是对基于PLUMED元动力学模拟后得到的模拟结果进行自由能计算。

    参数说明

    Input File

    基于PLUMED元动力学模拟后输出的沉积高斯函数文件,默认为HILLS.dat文件。

    Histogram

    对沉积高斯函数文件进行自由能计算时是否考虑直方图分布方法,yes或者no,默认no。

    Sigma

    当考虑直方图分布方法时高斯函数的宽度值,有多个集合变量(即CV)时用"//"进行分割,比如0.35//0.35。只有当Histogram值为no时Sigma参数才会生效,当有多个CV而只设置了一个宽度值时,则表示该宽度值适用于所有CV。默认0.05。

    CV Name

    CV名称,对沉积高斯函数文件进行自由能计算时只考虑该指定的CV。当不指定CV时则考虑沉积高斯函数文件中包含的所有CV,当指定CV时则不能考虑直方图分布方法。

    CV Min

    集合变量的边界最小值,有多个集合变量时用"//"进行分割,比如0.1//0.3,强烈建议设置边界。当有多个CV而只设置了一个边界最小值时,则表示该最小值适用于所有CV。

    CV Max

    集合变量的边界最大值,有多个集合变量时用"//"进行分割,比如0.1//0.3,强烈建议设置边界。当有多个CV而只设置了一个边界最大值时,则表示该最大值适用于所有CV。

    Grid Size

    集合变量的窗口大小,有多个集合变量时用"//"进行分割,比如0.1//0.3。仅当设置了CV Min和CV Max值时,Grid Size才会生效。当有多个CV而只设置了一个窗口大小值时,则表示该窗口大小值适用于所有CV。

    Bin

    集合变量的窗口数量,有多个集合变量时用"//"进行分割,比如150//300。仅当设置了CV Min和CV Max值时,Bin才会生效。当有多个CV而只设置了一个窗口数量值时,则表示该窗口数量值适用于所有CV。Grid Size和Bin相乘等于CV Max和CV Min的差值,因此当Grid Size和Bin同时设置时以对应窗口数最多的为准。

    Temperature

    温度,对沉积高斯函数文件进行自由能计算时使用的温度值,默认300K

    Min to Zero

    是否对输出的自由能数据进行归零处理,即将自由能数据进行相对移动以保证最小值移动到0的位置,yes或者no,默认no。

    Stride

    沉积高斯函数的数量,在对沉积高斯函数文件进行自由能计算时,每隔该指定的沉积高斯函数的数量进行一次自由能计算。当不设置该数量值时表示对所有的沉积高斯函数在整体上只进行一次自由能计算。

    Output File

    输出结果文件,文件中包含随CV变化的自由能数据,默认为FES.csv文件。当指定了Stride值时,默认文件为FES.dat.tar.gz。

    结果说明

    输出结果包括:

    输出文件名称 说明
    FES.csv 随CV变化的自由能数据文件
    FES.dat.tar.gz 随CV变化的自由能数据压缩文件

    Free Energy Surface Analysis

    Introduction

    The Free Energy Surface Analysis module is used to to calculate the free energy based on the simulation results outputed from the metadynamics simulations.

    Parameter

    Input File

    The deposited Gaussian function file outputed from the metadymamics simulations. Default “HILLS.dat”.

    Histogram

    Whether considers the Historgram method when calculates the free energy based on the deposited Gaussian function file. “yes” or “no”, default “no”.

    Sigma

    Width of Gaussian Function used by the Historgram method, if there are multiple CVs, you can separated them by “//”, such as 0.35//0.35. Only effective when Historgram method is used. When there are multiple CVs and only one width value is set, it means that the width value will be applied to all CVs. Default 0.05.

    CV Name

    The specified CV considered in the free energy calculation based on the deposited Gaussian function file. When CV is not specified, all CVs contained in the deposited Gaussian function file will be considered, and when CV is specified, histogram distribution methods cannot be considered.

    CV Min

    The minimum boundary value of the collective variable. Use “//” to separate multiple collective variable minimums, such as 0.1//0.3. It is strongly recommended to set boundaries. When there are multiple CVs and only one minimum value is set, it means that the minimum value will be applied to all CVs.

    CV Max

    The maximum boundary value of the collective variable. Use “//” to separate multiple collective variable maximums, such as 0.1//0.3. It is strongly recommended to set boundaries. When there are multiple CVs and only one maximum value is set, it means that the maximum value will be applied to all CVs.

    Grid Size

    The window size of the collective variable. Use “//” to separate multiple collective variable window sizes, such as 0.1//0.3. Only effective when CV Min and CV Max values are set. When there are multiple CVs and only one window size value is set, it means that the window size value will be applied to all CVs.

    Bin

    The window number of the collective variable. Use “//” to separate multiple collective variable bin counts, such as 150//300. Only effective when CV Min and CV Max values are set. When there are multiple CVs and only one window number value is set, it means that the window number value will be applied to all CVs.The product of Grid Size and Bin equals the difference between CV Max and CV Min. Therefore, when both CV Space and CV Bin are set, the one with the highest corresponding window count will prevail.

    Temperature

    The temperature value used in the free energy calculation based on the deposited Gaussian function file. Default 300K.

    Min to Zero

    Whether mintozeros the obatined free energy data calculated based on the deposited Gaussian function file. “yes” or “no”, default “no”.

    Stride

    Specified number of the deposition Gauss function. When calculates the free energy based on the deposition Gauss function file, the free energy will be calculated every specified number of the deposition Gauss function. When this stride value is not set, it means that only one free energy calculation is performed for all deposition Gaussian functions as a whole.

    Output File

    The specified output file. The output file contains free energy data that varies with CV. Default FES.csv file. When the Stride value is specified, default FES.dat.tar.gz file.

    Result

    The output results include:

    Output File Name Description
    FES.csv output file that contains free energy data that varies with CV
    FES.dat.tar.gz output tar.gz file that contains free energy data that varies with CV
  • Name: MD Clustering v2
    Description: 对动力学轨迹进行归簇分析 Clustering analysis for dynamic trajectories.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-04 11:40:38
    Reference:

    MD Clustering (v2)

    简介

    MD Clustering是对动力学轨迹进行归簇分析。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    Cutoff

    聚类时结构的RMSD截断值(nm)

    Cluster Method

    聚类算法:linkage, jarvis-patrick, monte-carlo, diagonalization, gromos, 默认使用gromos算法。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA,Complex。
    可以根据PDB中小分子的名称填写组别名称。
    注:其中Complex指的是蛋白-小分子复合物体系。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    输出结果包括:

    输出文件名称 说明
    clusters.pdb 差异较大的每个簇的代表性结构
    clust-size.xvg 各个簇的帧数
    clust-size.xvg 各个簇和轨迹帧号的对应关系

    MD Clustering (v2)

    Introduction

    MD Clustering is a clustering analysis of molecular dynamics trajectories.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    Cutoff

    RMSD cutoff value for clustering (in nm).

    Cluster Method

    Clustering algorithm: linkage, jarvis-patrick, monte-carlo, diagonalization, gromos. The default method is gromos.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA, or Complex.
    You can also specify the group name based on the small molecule names in the PDB file.
    Note: “Complex” refers to protein-small molecule complex systems.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Result Description

    The output results include:

    Output File Name Description
    clusters.pdb Representative structures of each cluster with significant differences
    clust-size.xvg Number of frames in each cluster
    clust-size.xvg Correspondence between clusters and trajectory frame numbers
  • Name: MD Hbond v2
    Description: 分子动力学氢键分析 Hydrogen bond analysis between specified groups
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-05 17:34:57
    Reference:

    MD Hbond (v2)

    简介

    MD Hbond模板对于指定组别之间的氢键分析。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group1

    选择需要计算的氢键组别1:Protein,DNA,RNA。如果两个组相同则分析的是组内氢键。
    可以根据PDB中小分子的名称填写组别名称。

    System Group2

    选择需要计算的氢键组别2:Protein,DNA,RNA。如果两个组相同则分析的是组内氢键。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid1

    自定义需要计算的组1残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom1

    自定义需要计算的组1原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Resid2

    自定义需要计算的组2残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom2

    自定义需要计算的组2原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    输出结果包括:

    输出文件名称 说明
    hbnum.csv 氢键分析CSV文件
    hbnum.xvg 氢键分析XVG文件
    hbnum.png 氢键分析PNG文件

    其中hbnum.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Hydrogen bonds 氢键数目
    Pairs within 0.35 nm 两个组相距0.35nm内的接触的原子数目

    MD Hbond

    Introduction

    MD Hbond template is used for analyzing hydrogen bonds between specified groups.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group1

    Select the hydrogen bond group 1 for calculation: Protein, DNA, RNA. If both groups are the same, it analyzes intra-group hydrogen bonds. You can specify group names based on the names of small molecules in the PDB file.

    System Group2

    Select the hydrogen bond group 2 for calculation: Protein, DNA, RNA. If both groups are the same, it analyzes intra-group hydrogen bonds. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid1

    Custom residue numbers for group 1 calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom1

    Custom atom numbers for group 1 calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Custom Resid2

    Custom residue numbers for group 2 calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom2

    Custom atom numbers for group 2 calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Result Description

    The output results include:

    Output File Name Description
    hbnum.csv Hydrogen bond analysis CSV file
    hbnum.xvg Hydrogen bond analysis XVG file
    hbnum.png Hydrogen bond analysis PNG file

    The hbnum.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Hydrogen bonds Number of hydrogen bonds
    Pairs within 0.35 nm Number of atoms in contact within 0.35 nm between the two groups
  • Name: MD Trajectory v2
    Description: 可根据起始帧数、结束帧数以及间隔帧数对平衡模拟进行轨迹提取,从而将其转换为GRO或者PDB轨迹文件。 MD Trajectory converts Gromacs trajectory file (xtc) into GRO or PDB file for visualization.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD Trajectory (v2)

    简介

    可根据起始帧数、结束帧数以及间隔帧数对平衡模拟进行轨迹提取,从而将其转换为GRO或者PDB轨迹文件。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run模块或者AlphaAutoMD模块中获取。

    Type

    文件输出类型:GRO或者PDB。

    Water

    输出文件是否保留水盒子。

    Start Time (ps)

    起始位置(单位ps)。

    End Time (ps)

    结束位置(单位ps)。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。对于膜体系的轨迹提取是必填项。

    Keep Heterogens

    是否保留体系中的溶剂(Water以及Ion):不保留(none),都保留(all),指定保留溶剂范围(specify)。

    Specify Heterogens

    指定需要保留的特殊组别如:水(Water),离子(Ion);或者指定保留组别的范围,规定格式为:需要保留的溶剂组别(Water或者Ion):限定距离(单位Å):目标组别,中间使用冒号(:)进行分隔,例如Water:3:ligand。
    注:组别名称可以通过MD Solvation模块的index文件查询;若目标组别是小分子,可以根据PDB中小分子的名称填写组别名称,多个小分子可填写ligand表示。

    结果说明

    输出结果包括:

    输出文件名称 说明
    md_finally.pdb 最后一帧结构文件
    md_center.pdb/.gro PDB/GRO格式轨迹文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    MD Trajectory

    Introduction

    The MD Trajectory module allows for the extraction of trajectories from equilibrium simulations based on the starting frame number, ending frame number, and frame interval, converting them into GRO or PDB trajectory files.

    Parameter

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run module or the AlphaAutoMD module.

    Type

    File output type: GRO or PDB.

    Water

    Whether to retain the water box in the output files.

    Start Time (ps)

    Starting time (in ps).

    End Time (ps)

    Ending time (in ps).

    Skip Time (ps)

    Time interval, in ps.

    Index File

    Index file in ndx format. This is a required parameter for extracting trajectories in membrane systems.

    Keep Heterogens

    Whether to retain the solvents in the system (Water and Ion) : none (none), all (all), specify the solvent range (specify).

    Specify Heterogens

    Specify special groups to be retained: Water, Ion; Or specify the range of reserved groups in the format: solvent group to be retained (Water or Ion) : limit distance (unit Å) : target group, separated by a colon (:), e.g., Water:3:ligand.
    Note: The group name can be queried through the index file of the MD Solvation module. If the target group is a small molecule, the group name can be filled in according to the name of small molecule in PDB, and the ligand representation can be filled in for multiple small molecules.

    Result

    The output results include:

    Output File Name Description
    md_finally.pdb Structure file of the final frame
    md_center.pdb PDB format trajectory file
    md_center.gro GRO format trajectory file

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MD Gyration v2
    Description: 回旋半径分析,可用来衡量体系模拟时的质权平均半径 Gyration analysis, which can be used to measure the average radius of pledge during system simulation
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-05 16:24:54
    Reference:

    MD Gyration (v2)

    简介

    MD Gyration回旋半径分析,可用来衡量体系模拟时的质权平均半径。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA,Complex。
    可以根据PDB中小分子的名称填写组别名称。
    注:其中Complex指的是蛋白-小分子复合物体系。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    输出结果包括:

    输出文件名称 说明
    gyrate.csv 回转半径CSV文件
    gyrate.xvg 回转半径XVG文件
    gyrate.png 回转半径PNG文件

    其中gyrate.csv包括信息如下:

    字段名称 说明
    Time (ps) 时间
    Rg 回旋半径
    Rg(X) 绕着x轴的回旋半径
    Rg(Y) 绕着y轴的回旋半径
    Rg(Z) 绕着z轴的回旋半径

    MD Gyration (v2)

    Introduction

    MD Gyration is a radius of gyration analysis used to measure the mass-weighted average radius of a system during simulation.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA, or Complex.
    You can also specify the group name based on the small molecule names in the PDB file.
    Note: “Complex” refers to protein-small molecule complex systems.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    The output results include:

    Output File Name Description
    gyrate.csv Gyration radius CSV file
    gyrate.xvg Gyration radius XVG file
    gyrate.png Gyration radius PNG file

    The gyrate.csv file includes the following information:

    Field Name Description
    Time (ps) Time
    Rg Radius of gyration
    Rg(X) Radius of gyration around the x-axis
    Rg(Y) Radius of gyration around the y-axis
    Rg(Z) Radius of gyration around the z-axis
  • Name: MD SASA v2
    Description: 计算指定组别的溶剂可及表面积 Calculates the solvent accessible surface area (SASA) for a specified group
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-06 00:29:36
    Reference:

    MD SASA (v2)

    简介

    MD SASA模块是计算指定组别的溶剂可及表面积(solvent accessible surface area,SASA)。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA,Complex。
    可以根据PDB中小分子的名称填写组别名称。
    注:其中Complex指的是蛋白-小分子复合物体系。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    输出结果包括:

    输出文件名称 说明
    area.csv 溶剂可及表面积CSV文件
    area.xvg 溶剂可及表面积XVG文件
    area.png 溶剂可及表面积PNG文件

    其中area.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Total Area (nm^2) 溶剂可及表面积
    Hydrophobic (nm^2) 疏水表面积
    Hydrophilic (nm^2) 亲水表面积

    MD SASA (v2)

    Introduction

    The MD SASA module calculates the solvent accessible surface area (SASA) of specified groups.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA, or Complex.
    You can also specify the group name based on the small molecule names in the PDB file.
    Note: “Complex” refers to protein-small molecule complex systems.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    The output results include:

    Output File Name Description
    area.csv Solvent accessible surface area CSV file
    area.xvg Solvent accessible surface area XVG file
    area.png Solvent accessible surface area PNG file

    The area.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Total Area (nm^2) Total solvent accessible surface area
    Hydrophobic (nm^2) Hydrophobic surface area
    Hydrophilic (nm^2) Hydrophilic surface area
  • Name: MD Distance v2
    Description: 分子动力学轨迹的距离分析模块,输出分子动力学过程中两个组之间距离 (质心距离或几何中心距离) 随时间的变化。 MD distance analysis that outputs the distance changes between two groups (center of mass distance or geometric center distance) over time.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-22 09:35:48
    Reference:

    MD Distance (v2)

    简介

    MD Distance是针对分子动力学轨迹的距离分析模块,输出两个组之间距离 (质心距离或几何中心距离) 随时间的变化情况。自定义组别时需要注意,如果只需测量两个原子之间的距离则填写Custom Atom1和Custom Atom2即可;当同时填写Custom Resid1和Custom Atom1时,组别1的原子数是Custom Atom1与Custom Resid1交集的原子。

    参数说明

    System Group

    计算两个组之间距离 (质心距离或几何中心距离) 随时间的变化情况。

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2024)模块或者AlphaAutoMD (GMX2024)模块中获取。

    System Group1

    选择需要计算的组别1:Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    System Group2

    选择需要计算的组别1:Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Distance Type

    距离计算方式分为两种:质心距离(mass)和几何中心距离(geometry)。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)。

    Custom Group

    自定义组别,如果只需测量两个原子之间的距离则填写Custom Atom1和Custom Atom2即可;当同时填写Custom Resid1和Custom Atom1时,组别1的原子数是Custom Atom1与Custom Resid1交集的原子。

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2024)模块或者AlphaAutoMD (GMX2024)模块中获取。

    Custom Resid1

    自定义需要计算的组1残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Atom1

    自定义需要计算的组1原子编号,连续参数可用“-”表示,不连续原子用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Chain1

    自定义需要计算的组1的链名称,例如A。

    Custom Resid2

    自定义需要计算的组2残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Atom2

    自定义需要计算的组2原子编号,连续参数可用“-”表示,不连续原子用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Chain2

    自定义需要计算的组2的链名称,例如B。

    Distance Type

    距离计算方式分为两种:质心距离(mass)和几何中心距离(geometry)。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    dist.csv 距离分析CSV文件
    dist.xvg 距离分析XVG文件
    dist.png 距离分析PNG文件

    其中dist.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Distance (nm) 组别之间的距离

    MD Distance (v2)

    Introduction

    MD Distance is a distance analysis module for molecular dynamics trajectories, providing the variation of distance (center-of-mass distance or geometric center distance) between two groups over time. When defining custom groups, it is important to note that if you only need to measure the distance between two atoms, you can fill in Custom Atom1 and Custom Atom2. When both Custom Resid1 and Custom Atom1 are filled in, the number of atoms in group 1 is the intersection of Custom Atom1 and Custom Resid1.

    Parameters

    System Group

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2024) module or AlphaAutoMD (GMX2024) module.

    System Group1

    Select the group 1 for calculation: Protein, DNA, RNA.
    You can enter the group name based on the name of the small molecule in the PDB.

    System Group2

    Select the group 2 for calculation: Protein, DNA, RNA.
    You can enter the group name based on the name of the small molecule in the PDB.

    Distance Type

    There are two types of distance calculations: center of mass distance (mass) and center of geometry distance (geometry).

    Skip Time (ns)

    Time interval for each frame (in ns).

    Custom Group

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2024) module or AlphaAutoMD (GMX2024) module.

    Custom Resid1

    Custom residue numbers for group 1 that need to be calculated. Use “-” for continuous parameters and separate non-continuous residues with commas, for example: 1-10,15. Custom groups are unions of specified residues.

    Custom Atom1

    Custom atom numbers for group 1 that need to be calculated. Use “-” for continuous parameters and separate non-continuous atoms with commas, for example: 1-10,15. Custom groups are unions of specified atoms.

    Custom Chain1

    Custom chain name for group 1 to be included in the calculation, e.g., A.

    Custom Resid2

    Custom residue numbers for group 2 that need to be calculated. Use “-” for continuous parameters and separate non-continuous residues with commas, for example: 1-10,15. Custom groups are unions of specified residues.

    Custom Atom2

    Custom atom numbers for group 2 that need to be calculated. Use “-” for continuous parameters and separate non-continuous atoms with commas, for example: 1-10,15. Custom groups are unions of specified atoms.

    Custom Chain2

    Custom chain name for group 2 to be included in the calculation, e.g., B.

    Distance Type

    There are two types of distance calculations: center of mass distance (mass) and center of geometry distance (geometry).

    Skip Time (ns)

    Time interval for each frame (in ns).

    Results

    The output includes:

    Output File Name Description
    dist.csv Distance analysis CSV file
    dist.xvg Distance analysis XVG file
    dist.png Distance analysis PNG file

    The dist.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Distance (nm) Distance between the groups
  • Name: MMPBSA v2
    Description: MMPBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。 MMPBSA calculates the binding free energy between the receptor and ligand and provides energy decomposition data, binding constant (Ka), and inhibitor constant (Ki).
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:29
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMPBSA

    简介

    MMPBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMPBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Reference Structure (GRO)

    参考结构。默认system.gro。可以在GMX MD Run (GMX2024)模块的输出结果文件中找到。当周期性边界条件处理不当时可以使用该参数。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMPBSA_result.csv MMPBSA结果汇总文件。
    MMPBSA_Residue.csv 能量分解数据CSV文件。
    MMPBSA.pdb 原子对应的MMPBSA能量放到PDB文件。可以做对应能量类别的表面图,从图的颜色深浅可以看出哪些区域对结合能贡献值较大。
    MMPBSA.tar.gz MMPBSA所有原始文件。包括_mmpbsa_residue_#.txt是每个残基对应的每个能量类别的数值,共包含7个能量类别:范德华能(VDW)、静电能(ELE)、溶剂化能极性部分(PB)、溶剂化能非极性部分(SA)、VDW+ELE=MM、PB+SA=PBSA、MM+PBSA=Binding/MMPBSA。_mmpbsa_residue.txt是对上述7个文件的总结,即为MMPBSA_Residue.csv对应的原始文件。_mmpbsa_atom#.pdb是将每个原子对应的能量类别放到pdb文件,与MMPBSA.pdb相似。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMPBSA

    Introduction

    MMPBSA calculates the binding free energy between a receptor and a ligand, providing energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMPBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameters

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Reference Structure (GRO)

    Reference structure. Default: system.gro.
    This file can be found in the output results of the GMX MD Run (GMX2024) module.
    Use this parameter when periodic boundary conditions are not handled properly.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Results

    The output includes:

    Output File Name Description
    MMPBSA_result.csv Summary file of MMPBSA results.
    MMPBSA_Residue.csv Energy decomposition data in CSV format.
    MMPBSA.pdb MMPBSA energy corresponding to atoms in a PDB file. This can be used to create a surface map of energy categories, where the color depth indicates the regions contributing significantly to the binding energy.
    MMPBSA.tar.gz All original MMPBSA files. Includes mmpbsa_residue#.txt which contains the values of each energy category for each residue, totaling 7 energy categories: van der Waals energy (VDW), electrostatic energy (ELE), polar part of solvation energy (PB), nonpolar part of solvation energy (SA), VDW+ELE=MM, PB+SA=PBSA, MM+PBSA=Binding/MMPBSA. _mmpbsa_residue.txt is a summary of the above 7 files, corresponding to the original file MMPBSA_Residue.csv. _mmpbsa_atom#.pdb assigns the energy categories to each atom in a pdb file, similar to MMPBSA.pdb.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: MD RMS v2
    Description: 计算平衡模拟轨迹的均方根偏差(RMSD)和均方根波动(RMSF),从而分析结构的稳定性和结构变化情况。 Calculates the RMSD or RMSF to analyze the structural stability of the system.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    RMS

    简介

    通过计算平衡模拟轨迹的均方根偏差(RMSD,Root Mean Square Deviation)和均方根波动(RMSF,Root Mean Square Fluctuation),从而分析结构的稳定性和结构变化情况。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2024)模块或者AlphaAutoMD模块中获取。

    Analysis Type

    选择分析类型:RMSD或者RMSF(可多选)。

    Reference Structure (GRO)

    参考结构。默认system.gro。可以在GMX MD Run (GMX2024)模块的输出结果文件中找到。当周期性边界条件处理不当时可以使用该参数。

    System Group

    选择需要计算的组别。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Custom Atom

    自定义需要计算的原子编号,用逗号隔开,例如:CA,O,H。与Custom Resid是交集关系。

    Skip Time (ps)

    Index File

    索引文件,可由Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    rmsd_result.csv 所选组别的RMSD的CSV文件
    rmsd_result.png 所选组别的RMSD的PNG文件
    rmsd_result.xvg 所选组别的RMSD的XVG文件
    rmsf_*.csv 所选组别的RMSF的CSV文件
    rmsf_*.png 所选组别的RMSF的PNG文件
    rmsf_*xvg. 所选组别的RMSF的XVG文件
    bfac_*.pdb PDB中的B-Factor一列为原子RMSF值。RMSF值通过公式<Δr^2>=3B/(8π^2)转换为b-factor值。

    RMS

    Introduction

    By calculating the Root Mean Square Deviation (RMSD) and Root Mean Square Fluctuation (RMSF) of equilibrium simulation trajectories, the stability and structural changes of the system can be analyzed.

    Parameter Description

    Path File

    The path file obtained after MD simulation, which can be obtained from the GMX MD Run (GMX2024) module or the AlphaAutoMD module.

    Reference Structure (GRO)

    Reference structure. Default: system.gro.
    This file can be found in the output results of the GMX MD Run (GMX2024) module.
    Use this parameter when periodic boundary conditions are not handled properly.

    Analysis Type

    Select the type of analysis: RMSD or RMSF (multiple selections possible).

    System Group

    Select the group to be calculated.

    Custom Resid

    Custom residue numbers to be calculated, continuous parameters can be represented by “-”, non-continuous residues are separated by commas, for example: 1-10,15.

    Custom Atom

    Custom atom numbers to be calculated, separated by commas, for example: CA, O, H. This intersects with Custom Resid.

    Skip Time (ps)

    Index File

    Index file obtained from the Membrane Solvation module.

    Result Description

    The output results include:

    Output File Name Description
    rmsd_result.csv CSV file of RMSD for the selected group
    rmsd_result.png PNG file of RMSD for the selected group
    rmsd_result.xvg XVG file of RMSD for the selected group
    rmsf_*.csv CSV file of RMSF for the selected group
    rmsf_*.png PNG file of RMSF for the selected group
    rmsf_*xvg. XVG file of RMSF for the selected group
    bfac_*.pdb The B-Factor column in the PDB file represents the atomic RMSF value. The RMSF values are converted to B-factor values by the formula <Δr^2>=3B/(8π^2).
  • Name: MD PCA v2
    Description: 从高维数据中分析出主要的影响因素 (本征向量) ,前几个本征向量(主成分,如前两个主成份则为 PC1,PC2) 一般可以描述分子运动的大部分信息。 Analyze the main influencing factors (eigenvectors) from the high-dimensional data. The first few eigenvectors (principal components, such as PC1 and PC2 for the first two principal components) can generally describe most of the information about molecular motion.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-06 00:51:22
    Reference:

    MD PCA (v2)

    简介

    N个原子的柔性大体系如蛋白,其运动轨迹需要3N维笛卡尔坐标来描述,这样高维的数据很难理解和直观分析。MD PCA(Principal component analysis,PCA)模块可以从高维数据中分析出主要的影响因素 (本征向量) 。前几个本征向量(主成分,如前两个主成份则为 PC1,PC2) 一般可以描述分子运动的大部分信息。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA,Complex。
    可以根据PDB中小分子的名称填写组别名称。
    注:其中Complex指的是蛋白-小分子复合物体系。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    得到结果文件,每种类型的文件如果包含PNG、CSV以及XVG后缀,相同名称只是表现形式不同,数据一样

    输出文件名称 说明
    Gibbs_2d.png/Gibbs_3d.png 只计算两个主成分时的二维和三维自由能景观图
    average.pdb 计算后的平均结构文件
    eigenvalues.xvg/.png/.csv 本征值文件
    filtered.pdb 计算的降维过滤后的轨迹文件
    proj1.xvg/.png/.csv 对应的主成分PC1文件
    proj2.xvg/.png/.csv 对应的主成分PC2文件
    proj_all.xvg 计算的PC1到PC2的主成份合并文件

    MD PCA (v2)

    Introduction

    For a large flexible system with N atoms such as a protein, its motion trajectory requires 3N-dimensional Cartesian coordinates to describe, making it difficult to understand and analyze high-dimensional data. The MD PCA (Principal Component Analysis, PCA) module can analyze the main influencing factors (eigenvectors) from high-dimensional data. The first few eigenvectors (principal components, such as the first two principal components PC1, PC2) can generally describe most of the information about molecular motion.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA, or Complex.
    You can also specify the group name based on the small molecule names in the PDB file.
    Note: “Complex” refers to protein-small molecule complex systems.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    Obtain the following result files. If the files have PNG, CSV, and XVG suffixes, they contain the same data but in different formats.

    Output File Name Description
    Gibbs_2d.png/Gibbs_3d.png 2D and 3D free energy landscape plots when only two principal components are considered
    average.pdb Computed average structure file
    eigenvalues.xvg/.png/.csv Eigenvalues file
    filtered.pdb Filtered trajectory file after dimensionality reduction
    proj1.xvg/.png/.csv Corresponding principal component PC1 file
    proj2.xvg/.png/.csv Corresponding principal component PC2 file
    proj_all.xvg Combined file of principal components PC1 to PC2
  • Name: MD (GMX2024)
    Description: 利用准备好的体系拓扑文件以及参数文件进行基于GROMACS 2024 的分子动力学模拟。 Runs MD using the prepared system topology and parameter files based on GROMACS 2024.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 11:21:21
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    GMX MD Run (GMX2024)

    简介

    提交GROMACS对应文件,从而进行分子动力学模拟,得到平衡模拟后得到的轨迹文件。

    参数说明

    GRO File

    提交模拟体系的gro文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    Topology File

    提交模拟体系的top文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    ITP File

    提交模拟体系的itp文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    Minimize MDP File

    提交进行最小化的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的Minimization方法或者**GMX MDP Generation (Auto)**生成。

    NPT MDP File

    提交进行等压等温的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的NPT方法或者**GMX MDP Generation (Auto)**生成。

    MD MDP File

    提交进行平衡模拟的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的MD方法或者**GMX MDP Generation (Auto)**生成。

    结果说明

    输出结果包括:

    输出文件名称 说明
    md.cpt md模拟断点文件
    md.gro md的分子坐标文件
    md.log md记录文件
    md.tpr md模拟所需的所有初始化数据(分子拓扑、初始结构等)
    mini.gro mini运行的分子坐标文件
    mini.log mini运行记录文件
    mini.tpr mini模拟运行所需的所有初始化数据(分子拓扑、初始结构等)
    npt.gro npt的分子坐标文件
    npt.log npt记录文件
    npt.tpr npt模拟所需的所有初始化数据(分子拓扑、初始结构等)
    path.txt 模拟轨迹文件存储路径,可用于后续分析模块的Path File输入。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    GMX MD Run (GMX2024)

    Introduction

    Submit corresponding files to GROMACS to perform molecular dynamics simulations and obtain trajectory files after equilibrium simulations.

    Parameter Description

    GRO File

    Submit the gro file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    Topology File

    Submit the top file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    ITP File

    Submit the itp file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    Minimize MDP File

    Submit the script file for minimization, in mdp format. This file can be generated using the GMX MDP Generation module with the Minimization method or GMX MDP Generation (Auto).

    NPT MDP File

    Submit the script file for NPT (isothermal-isobaric) simulation, in mdp format. This file can be generated using the GMX MDP Generation module with the NPT method or GMX MDP Generation (Auto).

    MD MDP File

    Submit the script file for equilibrium simulation, in mdp format. This file can be generated using the GMX MDP Generation module with the MD method or GMX MDP Generation (Auto).

    Result Description

    The output results include:

    Output File Name Description
    md.cpt Checkpoint file for the MD simulation
    md.gro Molecular coordinate file for the MD simulation
    md.log Log file for the MD simulation
    md.tpr All initial data required for the MD simulation (molecular topology, initial structure, etc.)
    mini.gro Molecular coordinate file for the minimization run
    mini.log Log file for the minimization run
    mini.tpr All initial data required for the minimization run (molecular topology, initial structure, etc.)
    npt.gro Molecular coordinate file for the NPT simulation
    npt.log Log file for the NPT simulation
    npt.tpr All initial data required for the NPT simulation (molecular topology, initial structure, etc.)
    path.txt Path to store the simulation trajectory files, which can be used as input for the Path File in subsequent analysis modules.

    Reference Literature

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: Structure Prediction (Protenix v2.0)
    Description: Protenix是字节跳动开发的类AlphaFold3(AF3-like)的结构预测模型,支持蛋白、核酸、小分子,金属离子等分子形式。 Protenix is a AlphaFold3-like structure prediction model developed by Bytedance, supporting various modalities like protein, dna, rna, ions, and chemicals.
    Tags: undefined
    Author: ByteDance
    Release: 2024-12-30 09:46:26
    Reference: Protenix - Advancing Structure Prediction Through a Comprehensive AlphaFold3 Reproduction.ByteDance AML AI4Science Team, Xinshi Chen, Yuxuan Zhang, Chan Lu, Wenzhi Ma, Jiaqi Guan, Chengyue Gong, Jincai Yang, Hanyu Zhang, Ke Zhang, Shenghao Wu, Kuangqi Zhou, Yanping Yang, Zhenyu Liu, Lan Wang, Bo Shi, Shaochen Shi, Wenzhi Xiao.

    Structure Prediction (Protenix v2.0)

    简介

    Protenix是字节跳动公司AML AI4Science团队复现的pytorch版本的AlphaFold3模型。以下是ByteDance AML AI4Science团队的主要贡献概要:
    - 模型性能:将Protenix与现有的模型进行了基准测试。Protenix在不同分子类型的结构预测中表现出强大的性能。作为一个完全开源的模型,它使研究人员能够生成新的预测并对模型进行微调,以满足特定的应用需求。
    - 方法:在复现过程中,依据AF3的描述实现了Protenix,并优化了一些模糊步骤,纠正了排版错误,并根据模型行为进行了有针对性的调整。通过分享复现经验,希望支持社区在这些改进的基础上进一步推动该领域的发展。
    - 可访问性:已将Protenix开源,提供了模型权重、推理代码和可训练代码供研究用途。
    fc936bcc6efe6df85dc7359d52767659_protenix_predictions.gif

    2b882fa7f5feedcc5fd4ede902d5277e_640_wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1.webp

    54f223ff196c25030c88a9dc82cda43f_640_wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1.webp

    Protenix v2.0
    是字节跳动AI for Science团队于2026年4月发布的开源结构基础模型重大升级版本,在蛋白质结构预测和生物分子设计领域实现了显著突破,重点解决抗体-抗原复合物预测难题,同时增强小分子化学合理性。与基线模型及早期Protenix-v1相比,Protenix-v2呈现出大幅改进的趋势。在DockQ > 0.23的阈值下,Protenix-v2在三个测试集上相比Protenix-v1实现了9至13个百分点的绝对成功率提升。值得注意的是,Protenix-v2仅使用5个种子(seeds)即可超越Protenix-v1使用1000个种子的性能表现,显示出明显的效率增益。
    39ec735a280555d735d32714b462fbea_protenix-v2.png

    参数说明

    Protein Sequence

    蛋白序列文件,FASTA格式,支持多条序列。
    注意:多蛋白复合物结构预测,其氨基酸序列输入格式如下:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    DNA序列文件,FASTA格式,支持多条序列。

    RNA Sequence

    RNA序列文件,FASTA格式,支持多条序列。

    备注:当前支持计算的残基/碱基数量在1400个左右。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔,并加上CCD前缀。示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    注意:不适用于配体蛋白或多肽的氨基酸序列格式输入。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每行放置一个PTM信息,每个PTM信息由三部分组成:

    • 发生PTM序列的顺序编号
    • PTM类型的CCD编号
    • 发生PTM的残基位置编号
      三部分由逗号分隔,例如:1,HY3,1 表示第一条序列的第一个残基,发生了类型为HY3(CCD编号,为3-羟基脯氨酸,为脯氨酸的羟基化)的PTM
      备注:
    • 序列的顺序编号,是依次按上述参数Protein、DNA、RNA中的序列顺序与数量,从1开始进行编号,例如:当有2条蛋白序列,1条DNA序列,1条RNA序列时,各序列对应的编号为:第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4
    • CCD的介绍参考https://www.wwpdb.org/data/ccd 编号查询网址为https://www.ebi.ac.uk/pdbe-srv/pdbechem/
      包含多个PTM信息的文件内容示例如下:
    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Covalent Bond

    共价键信息的文本文件,TXT格式。每行放置一个共价键信息,每个共价键信息包含两个原子信息,每个原子信息由三部分组成:

    • 原子所在序列或小分子的顺序编号(编号规则在Modification中定义的序列编号规则基础上,在最后加入小分子的顺序即可)

    示例一:
    当有2条蛋白序列,1条DNA序列,1条RNA序列,2个小分子时。对应的编号为:
    第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4,第一个小分子对应的编号为5,第二个小分子对应的编号为6

    示例二:
    当有3条蛋白序列,2个小分子时。对应的编号为:
    第一条蛋白序列编号为1,第二条蛋白序列编号为2,第二条蛋白序列编号为3,第一个小分子对应的编号为4,第二个小分子对应的编号为5

    • 原子所在残基的位置编号(如残基为小分子时,编号为1)
    • 原子的标准名称:
      • 默认是CCD中定义的原子标准名称
      • 如果配体是SMILES,则是SMILES字符串中原子对应的从1开始位置序号。

    三部分由逗号分隔,

    • 当小分子为CCD时,如3,1,CA表示第三个实体(序列或小分子)中的第一个残基(或小分子)的CA原子
      一个共价键是由两个原子信息组成,原子间用分号分隔,如:1,1,CA;2,1,CA
      表示一个共价键,该共价键由两个原子组成,第一个原子为1,1,CA,第二个原子为2,1,CA
      包含多个共价键信息的文件内容示例如下:
    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    
    • 当小分子为SMILES时,如CC(=O)NCCNC(C)=O,如果该小分子的顺序编号(按上述方式确认)为3,其第一个C原子参与形成共价键,与编号为1的链/序列中第一个残基的CA原子,则共价键的定义为1,1,CA;3,1,C1其中C1表示小分子的第一个C原子,如果是第二个C原子,用C2表示。
      文件内容示例如下:
    1,1,CA;3,1,C1
    

    Ion

    离子名称,可以包含一个或多个离子,需写在一行文本中,不同的离子使用英文逗号分隔,支持输入离子数量,使用英文冒号分隔。示例如下:

    MG:2,ZN,CU:3
    

    表示2个MG离子,1个ZN离子,3个CU离子

    Contacts

    包含残基间、或原子间、或残基与原子间的距离限制信息的文本文件,每行定义一个距离限制信息。

    每个距离限制的定义由四部分组成,每部分之间通过英文分号分隔:

    • 残基1或原子1的信息
      残基信息由两部分组成:残基所在序列的顺序编号(见Covalent Bond参数中定义),残基的位置编号(从1开始顺序编号),使用英文逗号分隔。如:1,24表示第一条序列的第24个残基。
      原子信息由三部分组成:原子所在序列或小分子的顺序编号,原子所在残基的位置编号(如残基为小分子时,编号为1),原子的标准名称(见Covalent Bond参数中定义)

    • 残基2或原子2的信息(同上)

    • 最大距离(单位为埃)

    • 最小距离(单位为埃)

    包含多个距离限制信息的文件内容示例如下:

    1,169;2,1,C5;6;0
    1,24,CA;2,1;6;0
    1,169;2,1;6;3
    1,169,CA;2,1,C5;6;3
    

    表示:

    • 第一条序列的位置编号169的残基,与第二条序列1号残基(也可以是小分子)的C5原子,距离限制在0-6埃之间
    • 第一条序列的位置编号24的残基的CA原子,与第二条序列1号残基,距离限制在0-6埃之间
    • 第一条序列的位置编号169的残基,与第二条序列1号残基,距离限制在3-6埃之间
    • 第一条序列的位置编号24的残基的CA原子,与第二条序列1号残基(也可以是小分子)的C5原子,距离限制在3-6埃之间

    Pocket

    结合位点类型限制信息的文本文件,TXT格式,当前只支持单个pocket信息。pocket信息由三部分组成:

    • 序列或小分子Binder的顺序编号(见Covalent Bond参数中定义)。
    • 结合位点的残基信息,每个残基信息由其所在序列编号与残基位置编号组成,逗号分隔,如:1,25 表示第一条序列中的第25个残基;可以定义多个残基信息,由英文分号“;”进行分隔,如1,25;1,27;1,32;1,38表示第一条序列中的第25/27/32/38号残基形成结合位点
    • Binder与结合位点之间的最大距离(单位为埃),如6
      上述三部分信息之间也用英文分号“;”进行分隔,例如:2;1,55;1,62;1,91;1,92;1,99;1,110;6表示第二个实体(序列或小分子)作为Binder,与第一条序列的第55/62/91/92/99/110号残基所形成的结合位点进行结合,且两者之间的最大距离为6埃。
      文件内容示例如下:
    2;1,55;1,62;1,91;1,92;1,99;1,110;6
    

    Use Protenix_Mini

    是否使用Protenix_Mini模型,该模型仅使用ESM2-3B特征,不依赖MSA信息,推理速度最快,适合高通量场景。

    Seed

    随机数种子,用于控制预测过程中的随机性。 输入格式:逗号分隔的整数,例如:1,39,248,1970,20967
    取值规则:至多取前 5 个整数作为随机种子
    默认值:1,39,248,1970,20967

    Format

    输出结构的格式,支持PDB或CIF格式,默认为PDB格式。

    Batch Mode

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    每一条记录代表一个待预测的结构,每条记录的名称要唯一不能重复。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >1
    EVQLVESGGGVVQPGRSLRLS:NFMLTQPHSVSVL:ELFPQWHLPIKIAAIIAS
    >2
    YYCAKDRRDMGYFQHWGQGTLVTVSSQWYQQRPGSAPTTVIYEDNQRPSGTPPTFMIAVFLPIVVLIFKSILFL
    

    表示有两个待预测的结构,第一条记录的名称为1,有三条蛋白链,用:进行分隔。第二条记录的名称为2,为单链。

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。
    每一条序列记录代表一个待预测的结构,每条记录的名称要唯一不能重复(可与Protein参数中的记录名称一致,表示该记录的DNA序列与Protein序列归属于同一结构)。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >dna
    GACCTCT:CCTAGCT
    >1
    CCTAGCT
    

    表示有两条记录,第一条的名称为dna,有两条DNA链,用:进行分隔,因为该名称不存在与Protein示例记录中,属于新结构。第二条的名称为1,有一条DNA链,因为该名称存在于Protein示例记录中,则表示同属一个结构(该结构同时包含Protein序列和该DNA序列)。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。
    每一条序列记录代表一个待预测的结构,每条记录的名称要唯一不能重复(可与DNA或Protein参数中的记录名称一致,表示该记录的RNA序列与上面的DNA序列或Protein序列归属于同一结构)。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >1
    AGCU
    >rna
    AGGCU:UGAUC
    

    表示有两条记录,第一条的名称为1,为单链,因为该名称存在于DNA及Protein示例记录中,表示同属一个结构(该结构同时包含了Protein序列、DNA序列及该RNA序列)。第二条的名称为rna,有两条RNA链,用:进行分隔,因为该名称不存在于DNA或Protein示例记录中,属于新结构。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔。
    每行代表一个待预测的结构,每行可放置多个ligand,且以唯一不重复的名称开头(该名称可与上述RNA,DNA或Protein参数中的记录名称一致,表示该行的所有ligands,与上述的RNA或DNA或Protein序列归属于同一结构),名称与所有ligands都以英文冒号(:)分隔。文件内容示例如下:

    1:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,HY3
    lig:CC1CC(CN2NC[N+](=C3CCC(OC(F)(F)F)CC3)C2=O)CC(C)C1OC(C)(C)C(=O)[O-]
    

    表示有两条记录,第一条的名称为1,有三个ligand(一个SMILES,两个CCD codes),因为该名称存在于上述的RNA或DNA或Protein示例记录中,表示同属一个结构。第二条的名称为lig,有一个ligand(为SMILES),因为该名称不存在上述的RNA或DNA或Protein示例记录中,属于新结构。
    注意:
    1.不适用于配体蛋白或多肽的氨基酸序列格式输入。
    2.在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的小分子信息,设置方式为输入一行小分子信息(可多个),且不设置结构名称,如CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,表示为所有结构都加入小分子CC(=O)OC1C[NH+]2CCC1CC2与ATP。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每个PTM的信息与Single模式中一致(参考Single模式中的定义)。
    每行定义一个结构的所有PTM信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,HY3,1:1,P1L,5:2,HY3,3
    2:1,HY3,1:2,HY3,3
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有三个PTM。名称为2的结构中,有两个PTM。
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的PTM信息,设置方式为输入一行PTM信息,且不设置结构名称,如:1,HY3,1:1,P1L,5:2,HY3,3表示这些PTM信息将应用到所有结构。

    Ion

    离子名称,可以包含一个或多个离子,需写在一行文本中,不同的离子使用英文逗号分隔,支持输入离子数量,使用英文冒号分隔。每行定义一个结构的所有离子信息,且以唯一名称开头,都以英文冒号(:)分隔。文件内容示例如下:

    1:MG:2,ZN,CU:3
    

    表示前述名称为1的结构中,有2个MG离子,1个ZN离子,3个CU离子
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的离子信息,设置方式为输入一行离子信息,且不设置结构名称,如:MG:2,ZN,CU:3,表示这些离子信息将应用到所有结构。

    Covalent Bond

    共价键信息的文本文件,TXT格式。每个共价键的信息与Single模式中一致(参考Single模式中的定义)。
    Batch模式下,每行定义一个结构的所有共价键,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,1,CA;2,1,CA:1,1,CA;3,1,CHA
    2:1,1,CA;3,1,CHA
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有两个共价键。名称为2的结构中,有一个共价键。
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的共价键信息,设置方式为输入一行共价键信息,且不设置结构名称,如:1,1,CA;3,1,CHA表示该共价键信息将应用到所有结构。

    Contact

    接触类型限制信息的文本文件,TXT格式。每个接触信息的定义与Single模式中一致(参考Single模式中的定义)。
    Batch模式下,每行定义一个结构的所有接触信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,35;2,62;6.0
    2:1,48;2,CA;6.0:1,35;2,62;6.0
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有一个接触限制。名称为2的结构中,有两个接触限制。
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的接触信息,设置方式为输入一行接触信息,且不设置结构名称,如:1,35;2,62;6.0表示该Contact信息将被应用到所有结构。

    Pocket

    结合位点类型限制信息的文本文件,TXT格式。每个结合位点信息的定义与Single模式中一致(参考Single模式中的定义)。
    Batch模式下,每行定义一个结构的所有结合位点限制信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:2;1,55;1,62;1,91;1,92;1,99;1,110
    2:1;2,15;2,17;2,18;2,56:1;3,76;3,78;3,96
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有一个结合位点限制。名称为2的结构中,有两个结合位点限制。
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的限制信息,设置方式为输入一行限制信息,且不设置结构名称。如:2;1,55;1,62;1,91;1,92;1,99;1,110表示该pocket信息将被应用到所有结构。

    Use Protenix_Mini

    是否使用Protenix_Mini模型,该模型仅使用ESM2-3B特征,不依赖MSA信息,推理速度最快,适合高通量场景。

    Seed

    随机数种子,用于控制预测过程中的随机性。 输入格式:逗号分隔的整数,例如:1,39,248,1970,20967
    取值规则:至多取前 5 个整数作为随机种子
    默认值:1,39,248,1970,20967

    Enhanced Mode

    该模式下,会默认使用1000个随机种子,每个随机种子进行5个结构采样,共进行5000个结构的大批量采样,并从中选择评分靠前的多个预测结构,最终获得更高精度的预测结构。该模式特别适用于抗原-抗体复合物结构的高精度预测,有研究表明该模式下抗体-抗原复合物结构预测准确性提升60%。该模式的输入参数与Single Mode一致,一次运行时间约10~20小时。
    备注:
    序列总长度不可超过1300。

    结果说明

    输出结果文件为排名前5的复合物结构rank_1-5.pdb、pred_scores_protenix.csv和protenix_results.tar文件,csv中包含信息如下:

    列名 说明
    Name 复合物结构名称
    Ranking_Score 对预测结构的质量排序的指标分数,值范围在-100至1.5之间,越大表示预测结构的质量越高。该分数综合考虑了四个指标:ptm, iptm, fraction_disordered,has_clash, 计算公式为: Ranking_Score = 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash
    pLDDT 局部结构的可信度指标,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测
    pTM 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定
    Avg_pAE 平均pae分数,pae是预测对齐误差,是残基对水平的置信度指标,用来衡量任意两个残基之间相对空间位置的预测可信度。数值<5,表示残基对之间相对位置预测非常可靠,通常位于同一结构域内;数值在5–10,表示预测较为准确,可能为柔性环区或轻微构象差异区域;数值在10–20,表示相对位置不确定性较高,常见于结构域间连接区或柔性区域;数值> 20,表示预测不可靠,可能为无序区域、错误折叠,或复合物界面不稳定。
    Min_pAE 所有pae分数中的最小值
    Avg_iPAE 结构中相互作用界面的平均pae分数
    Min_iPAE 结构中相互作用界面pae分数中的最小值
    pDockQ2_链名 该链的预测对接评分(pDock2),用于评估该链在复合物界面中的结合可靠性
    pDock2_Avg 链之间的平均预测对接评分,用于整体评估复合物界面质量
    • pDockQ2阈值(继承自 DockQ):
    pDockQ2 范围 结构质量评估
    < 0.23 不正确(Incorrect)
    0.23 – 0.49 可接受(Acceptable)
    0.49 – 0.80 中等质量(Medium)
    > 0.80 高质量(High quality)

    tar文件包含排名前5的复合物结构和pred_scores_protenix.csv打包文件。

    参考文献

    • Protenix - Advancing Structure Prediction Through a Comprehensive AlphaFold3 Reproduction.ByteDance AML AI4Science Team, Xinshi Chen, Yuxuan Zhang, Chan Lu, Wenzhi Ma, Jiaqi Guan, Chengyue Gong, Jincai Yang, Hanyu Zhang, Ke Zhang, Shenghao Wu, Kuangqi Zhou, Yanping Yang, Zhenyu Liu, Lan Wang, Bo Shi, Shaochen Shi, Wenzhi Xiao.bioRxiv 2025.01.08.631967; DOI:10.1101/2025.01.08.631967

    Structure Prediction (Protenix v2.0)

    Introduction

    Protenix is the PyTorch version of the AlphaFold3 model reproduced by the AML AI4Science team at ByteDance. Here is a summary of the main contributions from the ByteDance AML AI4Science team:
    - Model Performance: Protenix has been benchmarked against existing models, demonstrating strong performance in structure prediction across different types of molecules. As a fully open-source model, it enables researchers to generate new predictions and fine-tune the model to meet specific application needs.
    - Methodology: During the reproduction process, Protenix was implemented based on the description of AF3, optimizing some ambiguous steps, correcting typographical errors, and making targeted adjustments based on model behavior. By sharing our reproduction experience, we hope to support the community in further advancing the field based on these improvements.
    - Accessibility: Protenix has been open-sourced, providing model weights, inference code, and training code for research purposes.
    fc936bcc6efe6df85dc7359d52767659_protenix_predictions.gif
    2b882fa7f5feedcc5fd4ede902d5277e_640_wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1.webp

    54f223ff196c25030c88a9dc82cda43f_640_wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1.webp

    Parameter

    Single Mode

    Protein Sequence

    A sequence file for proteins in FASTA format, supporting multiple sequences.
    Note: For multi-protein complex structure prediction, the amino acid sequence input format is as follows:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    A sequence file for DNA nucleic acids in FASTA format, supporting multiple sequences.

    RNA Sequence

    A sequence file for RNA nucleic acids in FASTA format, supporting multiple sequences.

    Note:The currently supported number of residues/bases for calculation is around 1,400.

    Ligand

    A text file containing information about small molecules in TXT format. It supports SMILES or CCD Code (Chemical Component Dictionary number). If using the SMILES format, each line should contain one small molecule; if using CCD Code, each line can contain one or more small molecules, separated by commas, and prefixed with CCD. Examples are as follows:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Note: Not applicable for amino acid sequence format input of ligand proteins or polypeptides.

    Modification

    An optional parameter that includes a text file with post-translational modification (PTM) information in TXT format. Each line contains one PTM entry, which consists of three parts:

    • The sequential number of the sequence where the PTM occurs
    • The CCD number for the PTM type
    • The position number of the residue where the PTM occurs
      These three parts are separated by commas. For example, 1,HY3,1 indicates that a PTM of type HY3 (CCD number, which is 3-hydroxyproline, a hydroxylation of proline) occurs at the first residue of the first sequence.
      Notes:
    • The sequential number of the sequence is assigned based on the order and quantity of sequences in the parameters Protein, DNA, and RNA, starting from 1. For example, if there are 2 protein sequences, 1 DNA sequence, and 1 RNA sequence, the corresponding numbers are: the first protein sequence is numbered 1, the second protein sequence is numbered 2, the DNA sequence is numbered 3, and the RNA sequence is numbered 4.
    • For an introduction to CCD, refer to https://www.wwpdb.org/data/ccd , and for the number query website, visit https://www.ebi.ac.uk/pdbe-srv/pdbechem/ .
      An example of a file containing multiple PTM entries is as follows:
    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Covalent Bond

    A text file (TXT format) containing covalent bond information. Each line represents one covalent bond, and each bond contains two atom entries. Each atom entry consists of three parts:

    • The sequence or small molecule order index of the atom (the numbering rule is based on the sequence numbering defined in Modification, with small molecules appended at the end).

    Example 1:
    If there are 2 protein sequences, 1 DNA sequence, 1 RNA sequence, and 2 small molecules, the numbering is as follows:
    The first protein sequence is 1, the second protein sequence is 2, the DNA sequence is 3, the RNA sequence is 4, the first small molecule is 5, and the second small molecule is 6.

    Example 2:
    If there are 3 protein sequences and 2 small molecules, the numbering is as follows:
    The first protein sequence is 1, the second protein sequence is 2, the third protein sequence is 3, the first small molecule is 4, and the second small molecule is 5.

    • The residue index of the atom (for small molecules, the residue index is 1)
    • The standard atom name:
      • By default, the standard atom name defined in CCD
      • If the ligand is represented by SMILES, the atom corresponds to the 0-based position index in the SMILES string.

    The three parts are separated by commas.

    • When the small molecule is in CCD format, for example, 3,1,CA represents the CA atom of the first residue (or small molecule) in the third entity (sequence or small molecule).
      A covalent bond consists of two atom entries separated by a semicolon, such as: 1,1,CA;2,1,CA
      This represents a covalent bond composed of two atoms, where the first atom is 1,1,CA and the second atom is 2,1,CA.
      An example of a file containing multiple covalent bond entries is as follows:
    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    
    • When the small molecule is in SMILES format, for example, CC(=O)NCCNC(C)=O. If the sequential number of this small molecule (determined as described above) is 3, and its first C atom participates in forming a covalent bond with the CA atom of the first residue in chain/sequence number 1, then the covalent bond is defined as 1,1,CA;3,1,C1, where C1 represents the first C atom of the small molecule. If it were the second C atom, it would be denoted as C2.
      An example of the file content is as follows:
    1,1,CA;3,1,C1
    

    Ion

    Ion names can include one or more ions, which should be written in a single line of text, with different ions separated by commas. It is also possible to specify the quantity of ions, using a colon to separate the ion name and its quantity. Examples are as follows:

    MG:2,ZN,CU:3
    

    Contacts

    A text file containing distance constraints between residues, atoms, or residues and atoms. Each line defines a distance constraint.

    Each distance constraint consists of four parts, separated by semicolons:

    • Information of residue 1 or atom 1
      Residue information includes two parts: the sequence number of the residue (as defined in the Covalent Bond parameters) and the position number of the residue (sequential numbering starting from 1), separated by a comma. For example, 1,24 indicates the 24th residue in the first sequence.
      Atom information includes three parts: the sequence number of the atom (or small molecule), the position number of the residue (if the residue is a small molecule, the number is 1), and the standard name of the atom (as defined in the Covalent Bond parameters).

    • Information of residue 2 or atom 2 (same as above)

    • Maximum distance (in Ångströms)

    • Minimum distance (in Ångströms)

    Example of a file containing multiple distance constraints:

    1,169;2,1,C5;6;0  
    1,24,CA;2,1;6;0  
    1,169;2,1;6;3  
    1,169,CA;2,1,C5;6;3  
    

    This means:

    • The residue at position 169 in the first sequence and the C5 atom of residue 1 (or small molecule) in the second sequence have a distance constraint between 0–6 Å.
    • The CA atom of residue 24 in the first sequence and residue 1 in the second sequence have a distance constraint between 0–6 Å.
    • The residue at position 169 in the first sequence and residue 1 in the second sequence have a distance constraint between 3–6 Å.
    • The CA atom of residue 169 in the first sequence and the C5 atom of residue 1 (or small molecule) in the second sequence have a distance constraint between 3–6 Å.

    Pocket

    A text file (TXT format) containing binding site type constraints. Currently, only single-pocket information is supported. Pocket information consists of three parts:

    • The sequence number of the binder (sequence or small molecule, as defined in the Covalent Bond parameters).
    • The residue information of the binding site. Each residue is defined by its sequence number and residue position number, separated by a comma. For example, 1,25 indicates the 25th residue in the first sequence. Multiple residues can be defined, separated by semicolons. For example, 1,25;1,27;1,32;1,38 indicates that residues 25, 27, 32, and 38 in the first sequence form the binding site.
    • The maximum distance (in angstroms) between the Binder and the binding site, e.g., 6.

    The three parts above are also separated by a semicolon. For example:
    2;1,55;1,62;1,91;1,92;1,99;1,110;6
    indicates that the second entity (sequence or small molecule) acts as the binder, binding to the pocket formed by residues 55, 62, 91, 92, 99, and 110 in the first sequence. The maximum distance between the Binder and Pocket residues is 6 angstroms.

    Example file content:

    2;1,55;1,62;1,91;1,92;1,99;1,110;6
    

    Use Protenix_Mini

    Whether to use the Protenix_Mini model. This model relies solely on ESM2-3B features and does not require MSA information. It offers the fastest inference speed and is suitable for high-throughput scenarios.

    Seed

    Random seed used to control the randomness in the prediction process.Input format: Comma-separated integers, e.g. 1,39,248,1970,20967
    Parsing rule: Up to the first 5 integers are used as random seeds
    Default value: 1,39,248,1970,20967

    Format

    The output structure format supports PDB or CIF, with PDB format as the default.

    Batch Mode

    Protein Sequence

    The protein sequence file in FASTA format, supporting multiple sequences. Each record represents a structure to be predicted, and each record name must be unique. If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >1
    EVQLVESGGGVVQPGRSLRLS:NFMLTQPHSVSVL:ELFPQWHLPIKIAAIIAS
    >2
    YYCAKDRRDMGYFQHWGQGTLVTVSSQWYQQRPGSAPTTVIYEDNQRPSGTPPTFMIAVFLPIVVLIFKSILFL
    

    This indicates two structures to be predicted, with the first record named 1 containing three protein chains separated by colons. The second record is named 2 and contains a single chain.

    DNA Sequence

    The DNA sequence file in FASTA format, supporting multiple sequences. Each sequence record represents a structure to be predicted, and each record name must be unique. (It can match the record names in the Protein parameter, indicating that the DNA sequence belongs to the same structure as the Protein sequence.) If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >dna
    GACCTCT:CCTAGCT
    >1
    CCTAGCT
    

    This indicates two records, with the first named dna containing two DNA chains separated by a colon. Since this name does not appear in the Protein example records, it represents a new structure. The second record is named 1, containing one DNA chain, and since this name exists in the Protein example records, it indicates that they belong to the same structure (which contains both Protein and DNA sequences).

    RNA Sequence

    The RNA sequence file in FASTA format, supporting multiple sequences. Each sequence record represents a structure to be predicted, and each record name must be unique. (It can match the record names in the DNA or Protein parameters, indicating that the RNA sequence belongs to the same structure.) If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >1
    AGCU
    >rna
    AGGCU:UGAUC
    

    This indicates two records, with the first named 1, which is a single chain. Since this name exists in the DNA and Protein example records, it indicates that they belong to the same structure (which includes Protein, DNA, and this RNA sequence). The second record is named rna, containing two RNA chains separated by a colon. Since this name does not appear in the DNA or Protein example records, it represents a new structure.

    Ligand

    A text file containing information on small molecules in TXT format. It supports either SMILES or CCD Code. If using SMILES format, each line should contain one small molecule; if using CCD Code, each line can contain one or more small molecules, separated by commas. Each line represents a structure to be predicted, and each line must start with a unique name (this name can match those in the RNA, DNA, or Protein parameters, indicating that all ligands in that line belong to the same structure). The name and all ligands are separated by a colon (:). Example content:

    1:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,HY3
    lig:CC1CC(CN2NC[N+](=C3CCC(OC(F)(F)F)CC3)C2=O)CC(C)C1OC(C)(C)C(=O)[O-]
    

    This indicates two records, with the first named 1, containing three ligands (one SMILES and two CCD codes). Since this name exists in the RNA, DNA, or Protein example records, it indicates that they belong to the same structure. The second record is named lig, containing one ligand (in SMILES format). Since this name does not appear in the RNA, DNA, or Protein example records, it represents a new structure.

    Note:
    1.In Batch mode, if the Affinity parameter is set, each structure in the batch must have Affinity information; otherwise, an error will be reported.
    2.The sorting of small-molecule binders depends solely on the sequence order and quantity of the Protein, DNA, and RNA parameters; the ligand itself does not participate in the sorting.

    3.You can assign the same ligand information to all target structures by providing a single line of ligand data (multiple ligands are allowed) without specifying structure names. For example:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP
    This indicates that all structures will include the ligands CC(=O)OC1C[NH+]2CCC1CC2 and ATP.

    Modification

    A text file containing post-translational modification (PTM) information in TXT format. Each PTM entry is consistent with that in Single mode (refer to the definitions in Single mode). Each line defines all PTM information for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:1,HY3,1:1,P1L,5:2,HY3,3
    2:1,HY3,1:2,HY3,3
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has three PTMs, while the structure named 2 has two PTMs.
    Note: When the sequence count for each structure to be predicted is the same, you can set identical PTM information for all structures by entering a single line of PTM information without specifying a structure name. For example: 1,HY3,1:1,P1L,5:2,HY3,3 indicates that these PTM information will be applied to all structures.

    Ion

    Ion names. One or more ions can be specified in a single line. Different ions are separated by commas, and the number of each ion can be specified using a colon (:).

    In Batch mode, each line defines all ion information for one structure. Each line must start with a unique name (structure identifier), and fields are separated by colons (:). An example is shown below:

    1:MG:2,ZN,CU:3
    

    This indicates that for the structure named 1, there are 2 MG ions, 1 ZN ion, and 3 CU ions.

    Note: When the number of sequences predicted for each structure is the same, you can assign the same ion information to all target structures by providing a single line of ion information without specifying structure names. For example:MG:2,ZN,CU:3indicates that these Ion information will be applied to all structures.

    Covalent Bond

    A text file containing covalent bond information in TXT format. Each covalent bond entry is consistent with that in Single mode (refer to the definitions in Single mode). In Batch mode, each line defines all covalent bonds for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:1,1,CA;2,1,CA:1,1,CA;3,1,CHA
    2:1,1,CA;3,1,CHA
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has two covalent bonds, while the structure named 2 has one covalent bond.
    Note: When the sequence count for each structure to be predicted is the same, you can set identical covalent bond information for all structures by entering a single line of covalent bond information without specifying a structure name. For example: 1,1,CA;3,1,CHA indicates that this covalent bond information will be applied to all structures.

    Contact

    A text file in TXT format containing contact type restraint information. The definition of each contact restraint is consistent with that in Single mode (refer to the definition in Single mode).
    In Batch mode, each line defines all contact restraints for one structure, starting with a unique name (which must exist in the aforementioned Protein, DNA, or RNA records), with fields separated by English colons (:). An example of the file content is as follows:

    1:1,35;2,62;6.0
    2:1,48;2,CA;6.0:1,35;2,62;6.0
    

    This indicates that in the structure named 1 (Protein, DNA, or RNA mentioned above), there is one contact restraint. In the structure named 2, there are two contact restraints.

    Note: When the sequence count for each structure to be predicted is the same, you can set identical contact information for all structures by entering a single line of contact information without specifying a structure name. For example: 1,35;2,62;6.0 indicates that this Contact information will be applied to all structures.

    Pocket

    A text file containing pockets information in TXT format. Each pocket is consistent with that in Single mode (refer to the definitions in Single mode). In Batch mode, each line defines all pockets for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:2;1,55;1,62;1,91;1,92;1,99;1,110
    2:1;2,15;2,17;2,18;2,56:1;3,76;3,78;3,96
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has one pocket, while the structure named 2 has two pockets.
    Note: When the sequence count for each structure to be predicted is the same, you can set identical constraint information for all structures by entering a single line of constraint information without specifying a structure name. For example: 2;1,55;1,62;1,91;1,92;1,99;1,110 indicates that this pocket information will be applied to all structures.

    Use Protenix_Mini

    Whether to use the Protenix_Mini model. This model relies solely on ESM2-3B features and does not require MSA information. It offers the fastest inference speed and is suitable for high-throughput scenarios.

    Seed

    Random seed used to control the randomness in the prediction process.Input format: Comma-separated integers, e.g. 1,39,248,1970,20967
    Parsing rule: Up to the first 5 integers are used as random seeds
    Default value: 1,39,248,1970,20967

    Enhanced Mode

    In this mode, a default of 1000 random seeds will be used, with each seed conducting 5 structural samplings, totaling 5000 structures for large-scale sampling. From these, multiple predicted structures with high scores will be selected to ultimately obtain a more accurate predicted structure. This mode is particularly suitable for high-precision prediction of antigen-antibody complex structures, and studies have shown that the accuracy of antibody-antigen complex structure prediction can be increased by 60% in this mode. The input parameters for this mode are consistent with those in Single Mode, and the runtime for one session is approximately 10 to 20 hours.

    Note:

    The total length of the sequence cannot exceed 1300.

    Results

    The output result files are the structures of the top 5 complexes, rank_1-5.cif and pred_scores_protenix.csv. The CSV file contains the following information:

    Column Name Description
    Name The name of the complex structure.
    Ranking_Score A score that ranks the quality of the predicted structure, with values ranging from -100 to 1.5, where a higher value indicates a better quality of the predicted structure. This score takes into account four indicators: ptm, iptm, fraction_disordered, and has_clash. The calculation formula is: Ranking_Score = 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash.
    pLDDT The confidence score for local structure, with a range of 0-100. A higher value indicates a more reliable prediction. Values below 70 are considered low reliability, below 50 are generally regarded as very low confidence, indicating disordered predictions.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.
    Avg_pAE Average pAE score; pAE stands for Predicted Aligned Error, a residue-pair-level confidence metric measuring the prediction reliability of relative spatial positions between any two residues. Values <5 indicate highly reliable predictions of relative positions between residue pairs, typically within the same domain; values of 5–10 suggest relatively accurate predictions, possibly in flexible loop regions or areas with minor conformational differences; values of 10–20 indicate high uncertainty in relative positions, commonly found in inter-domain linkers or flexible regions; values >20 indicate unreliable predictions, possibly representing disordered regions, misfolding, or unstable complex interfaces.
    Min_pAE The minimum value among all pAE scores.
    Avg_iPAE The average value of interface pAE scores.
    Min_iPAE The minimum value among all ipAE scores.
    pDockQ2_chain Predicted docking score (pDockQ2) for a specific chain, used to evaluate the reliability of that chain’s interaction at the complex interface
    pDock2_Avg Average predicted docking score between chains, used to assess the overall interface quality of the complex
    • pDockQ2 thresholds (derived from DockQ):
    pDockQ2 Range Structure Quality Assessment
    < 0.23 Incorrect
    0.23 – 0.49 Acceptable
    0.49 – 0.80 Medium quality
    > 0.80 High quality
    The tar file contains the top 5 ranked complex structures and the pred_scores_protenix.csv archive.

    Reference

    • Protenix - Advancing Structure Prediction Through a Comprehensive AlphaFold3 Reproduction.ByteDance AML AI4Science Team, Xinshi Chen, Yuxuan Zhang, Chan Lu, Wenzhi Ma, Jiaqi Guan, Chengyue Gong, Jincai Yang, Hanyu Zhang, Ke Zhang, Shenghao Wu, Kuangqi Zhou, Yanping Yang, Zhenyu Liu, Lan Wang, Bo Shi, Shaochen Shi, Wenzhi Xiao.bioRxiv 2025.01.08.631967; DOI:10.1101/2025.01.08.631967
  • Name: Generate Humanized Variants
    Description: 抗体人源化设计中基于Grafting以及Back Mutation Grouping的结果批量生成人源化后的序列。 Generate humanized variant sequences based on the Grafting and Back Mutation Grouping results.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-12-23 00:00:00
    Reference:

    Generate Humanized Variants

    简介

    抗体人源化设计中基于Grafting以及Back Mutation Grouping的结果批量生成人源化后的序列。

    参数说明

    Graft Policy

    Grafting模块生成的Graft Policy文件,JSON格式

    Mutate Policy

    Back Mutation Grouping模块生成的组合突变的Policy文件(combination_mutate_policy.json),JSON格式

    结果说明

    输出人源化后的序列文件humanized_variants_esmfold.fasta,将轻重链的序列通过冒号:拼接成一条链,便于直接用于ESMFold模块进行批量结构预测。示例:

    >L1H1
    EIVLTQSPATLSLSPGERATLSCRASQFVGSSIHWYQQKPGQAPRLLIYYASESMSGIPARFSGSGSGTDFTLTISSLEPEDFAVYYCQQSHSWPFTFGQGTKLEIK:EVQLVESGGGLVQPGGSLRLSCAASGFIFSNHYMSWVRQAPGKGLEWVSVIRSKSINSATYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARNYYGSTYDYWGQGTTVTVSS
    >L1H2
    EIVLTQSPATLSLSPGERATLSCRASQFVGSSIHWYQQKPGQAPRLLIYYASESMSGIPARFSGSGSGTDFTLTISSLEPEDFAVYYCQQSHSWPFTFGQGTKLEIK:EVQLVESGGGLVQPGGSLRLSCAASGFIFSNHYMSWVRQAPGKGLEWVSEIRSKSINSATYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARNYYGSTYDYWGQGTTVTVSS
    

    Generate Humanized Variants

    Introduction

    Generate humanized variant sequences based on the Grafting and Back Mutation Grouping results.

    Parameters

    Graft Policy

    Graft policy file in JSON format generated by the Grafting module.

    Mutate Policy

    Combination mutate policy file generated by Back Mutation Grouping module in JSON format.

    Results

    The output file humanized_variants_esmfold.fasta in which sequences of the light and heavy chains are concatenated into a single chain using a colon (:). This format facilitates direct use in the ESMFold module for batch structural prediction.

    >L1H1
    EIVLTQSPATLSLSPGERATLSCRASQFVGSSIHWYQQKPGQAPRLLIYYASESMSGIPARFSGSGSGTDFTLTISSLEPEDFAVYYCQQSHSWPFTFGQGTKLEIK:EVQLVESGGGLVQPGGSLRLSCAASGFIFSNHYMSWVRQAPGKGLEWVSVIRSKSINSATYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARNYYGSTYDYWGQGTTVTVSS
    >L1H2
    EIVLTQSPATLSLSPGERATLSCRASQFVGSSIHWYQQKPGQAPRLLIYYASESMSGIPARFSGSGSGTDFTLTISSLEPEDFAVYYCQQSHSWPFTFGQGTKLEIK:EVQLVESGGGLVQPGGSLRLSCAASGFIFSNHYMSWVRQAPGKGLEWVSEIRSKSINSATYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARNYYGSTYDYWGQGTTVTVSS
    
  • Name: Humanization Report (v2.4)
    Description: 抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。相比v2.3,新增RMSD和能量信息。 Humanization Report is an antibody humanization design reporting module for Generating the humanization design reports as well as patent example paragraphs. Compared with v2.3, RMSD and energy information are added.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-12-23 00:00:00
    Reference:

    Humanization Report v2.4

    简介

    Humanization Report v2.4是抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。相比v2.3,新增RMSD和能量信息。

    参数说明

    Graft Policy

    Grafting模块生成的Graft Policy文件。

    Mutate Policy

    Back Mutation Grouping模块生成的Policy文件。

    Antibody Type

    抗体类型,Antibody 标准双链抗体,Nanobody 纳米抗体。

    Germline Score File

    Grafting模块生成的score文件,JSON格式

    Mutation Score File

    Mutation模块生成的score文件,CSV格式

    Antibody RMSD File

    抗体结构RMSD文件,由Antibody RMSD模块生成,CSV格式

    Antibody RMSD Top

    从RMSD排序中取前N个RMSD值小的抗体

    Folding Stability File

    Absolute Folding Stability模块预测生成的蛋白稳定性文件,CSV格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    BM.pptx 回复突变位点汇总文件
    batch_registration_template.xlsx 批量注册模板文件
    hotspot_summary.xlsx 风险位点总结
    patent_example_template.docx 人源化设计序列在相应的专利实施例段落
    patent_example_en_template.docx 英文版人源化设计序列在相应的专利实施例段落
    back_mutation_grouping.md 回复突变分组信息
    candidate_score.xlsx 人源化抗体序列的结构和能量打分汇总
    humanized_variants.fasta 抗体人源化设计序列文件,FASTA格式
    Report.docx 抗体人源化设计报告,包括整个人源化设计过程涉及的序列、分组等信息

    其中batch_registration_template.xlsx包含如下信息:

    字段名称 说明
    Protein Sequence 蛋白序列
    Molecule Name 分子名称

    其中hotspot_summary.xlsx包含如下信息:

    字段名称 说明
    ID 抗体序列名称
    Sequence-CDR CDR序列区域
    Deamidation 脱酰胺位点
    Isomerization 异构化位点
    Cleavage 酶切位点
    Hydrolysis 水解位点
    Glycosylation 糖基化位点
    Cys 半胱氨酸数量
    Oxidation 氧化位点
    High risk 高风险率
    High risk sites 高风险位点

    Humanization Report v2.4

    Introduction

    The Humanization Report v2.4 is a module for generating reports on antibody humanization design, including the final antibody humanization design report and corresponding paragraphs for patent implementation examples. Compared with v2.3, RMSD and energy information are added.

    Parameter Description

    Graft Policy

    The Graft Policy file generated by the Grafting module.

    Mutate Policy

    The Policy file generated by the Back Mutation Grouping module.

    Antibody Type

    Antibody type, Antibody or Nanobody

    Germline Score File

    Graft germline score file in JSON format generated by the Grafting module

    Mutation Score File

    Mutation score file in csv format generated by the Mutation module

    Antibody RMSD File

    Antibody structure RMSD file generated by Antibody RMSD module

    Antibody RMSD Top

    Select the top N antibodies with the smallest RMSD values from the RMSD ranking

    Folding Stability File

    Protein folding stability file generated by Absolute Folding Stability module in CSV format

    Result Description

    The output results include:

    Output File Name Description
    BM.pptx Summary file of back mutation sites
    batch_registration_template.xlsx Batch registration template file
    hotspot_summary.xlsx Summary of hotspot sites
    patent_example_template.docx Humanization design sequences in corresponding patent implementation example paragraphs (Chinese version)
    patent_example_en_template.docx Humanization design sequences in corresponding patent implementation example paragraphs (English version)
    back_mutation_grouping.md Grouping for back mutations
    humanized_variants.fasta Antibody humanization design sequence file in FASTA format
    Report.docx Antibody humanization design report, including information on sequences, grouping, etc., involved in the entire humanization design process
    candidate_score.xlsx Candidate sequences energy and structure scores

    The batch_registration_template.xlsx file contains the following information:

    Field Name Description
    Protein Sequence Protein sequence
    Molecule Name Molecule name

    The hotspot_summary.xlsx file contains the following information:

    Field Name Description
    ID Antibody sequence name
    Sequence-CDR CDR sequence region
    Deamidation Deamidation site
    Isomerization Isomerization site
    Cleavage Cleavage site
    Hydrolysis Hydrolysis site
    Glycosylation Glycosylation site
    Cys Number of cysteines
    Oxidation Oxidation site
    High risk High-risk rate
    High risk sites High-risk sites
  • Name: Patent BLAST
    Description: 针对抗体全长或者CDR区进行序列检索。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Blast -> Patent BLAST。数据更新于:Dec, 2024。 A module for sequence retrieval of antibody full-length or CDR region. It is recommended to use in the WeSeq: WeSeq -> Blast -> Patent BLAST. Data updated: Dec, 2024
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-04-06 00:00:00
    Reference:

    Patent BLAST

    简介

    针对抗体全长或者CDR区进行序列检索的模块。从专利中检索一条抗体可变区时,现有的BLAST程序(例如NCBI BLAST)通常是以全序列进行检索,但是对于抗体而言,功能主要取决于CDR,FR相对不重要,并且由于FR的通用性,许多不同抗体的FR是相同或高度同源的,而FR占序列的比重更高,就导致以抗体的可变区BLAST会得到很多FR相似但CDR不相似的序列。并且,专利申请时,除了保护可变区完整序列,很多情况也会对抗体CDR进行单独保护,以获得更大的保护范围,因此在抗体开发过程中,以CDR为目标进行同源序列检索就很有必要了。为此,唯信团队开发了该程序,可以从现有专利库中检索到与目标CDR最接近的序列。数据更新于:Dec 2024

    • 序列来源:NCBI专利序列库
    • 来自美国专利局USPTO提交的美国专利序列和通过INSDC合作包括的欧洲和日本专利序列
    • 包含已授权专利中的权利要求和实施例中的全部序列
    • 原始数据链接:https://ftp.ncbi.nlm.nih.gov/blast/db/pataa.tar.gz
    • WeMol数据更新于:Dec 2024
    • 数量:>700万个蛋白序列,其中14万条抗体CDR序列
    • 检索原理:提取专利序列数据库中的抗体序列,使用Kabat规则识别CDR区,并将CDR1/2/3拼接成新的CDR序列,与目标抗体拼接后的CDRs进行比对,输出同源性最高的数条。

    例如,输入序列L的完整序列,进行检索后,返回检索到同源性较高的序列的CDR,如下图所示。
    image.png

    如果需要查看某个检索到的序列的出处,可以根据检索的CDR的序列编号,从任务输出的log文件中找到对应的专利名,
    例如序列ATJ10081.1来自于US专利9493553(SEQ ID为39),并且US专利9670274、9890209等多个专利中也出现了该CDR片段,他们的比对情况包括同源性也展示在后面,如下图所示。

    image.png

    根据唯信团队经验,通常CDR的保护范围精确到具体序列,即差异一个以上氨基酸,即视为不在专利的保护范围之内,但不排除存在等同侵权的风险,仅供参考。

    参数说明

    Antibody Sequence File

    抗体序列文件, FASTA格式

    Type

    指定序列比对数据库类型:抗体全长(full)或者抗体CDR区域 (cdr)。
    CDR区域数据库为专利保护抗体数据库。

    结果说明

    输出结果包括:

    输出文件名称 说明
    align.fst 序列比对结果文件
    blast.log 序列比对日志文件

    Patent BLAST

    Introduction

    A module for sequence retrieval of antibody full-length or CDR region. When retrieving an antibody variable region from a patent, existing BLAST programs (such as NCBI BLAST) usually search the whole sequence, but for antibodies, the function mainly depends on the CDR, FR is relatively not important, and due to the generality of FR, FR of many different antibodies is the same or highly homologous. However, FR accounts for a higher proportion of sequences, resulting in a lot of sequences with similar FR but different CDR by BLAST in the variable region of antibodies. Moreover, in addition to protecting the complete sequence of the variable region during patent application, in many cases, the antibody CDR will also be protected separately to obtain a wider range of protection, so it is necessary to search for homologous sequences with CDR as the target in the process of antibody development. To this end, the Vixon team developed the program, which can retrieve the closest sequence to the target CDR from the existing patent library. Data updated: Dec 2024

    • Sequence Source: NCBI patent sequence database
    • Includes US patent sequences submitted to the USPTO and European and Japanese patent sequences included through collaboration with INSDC
    • Contains claims from granted patents and all sequences in the embodiments
    • Original data link: https://ftp.ncbi.nlm.nih.gov/blast/db/pataa.tar.gz
    • WeMol data updated: Dec 2024
    • Quantity: >7 million protein sequences, including 140,000 antibody CDR sequences
    • Search Principle: Extract antibody sequences from the patent sequence database, identify CDR regions using Kabat rules, concatenate CDR1/2/3 into a new CDR sequence, compare it with the concatenated CDRs of the target antibody, and output the top matching sequences based on homology.

    For example, when inputting the complete sequence of antibody L for search, the returned CDR of the highly homologous sequences is shown in the image below.
    image.png

    If there is a need to check the source of a retrieved sequence, you can find the corresponding patent name based on the sequence number of the retrieved CDR from the log file output of the task. For example, sequence ATJ10081.1 is from US Patent 9493553 (SEQ ID 39), and the CDR fragment also appears in multiple patents such as US Patents 9670274, 9890209, etc., with their alignment details and homology shown as well, as depicted in the image below.

    image.png

    Based on the experience of the WeMol team, the protection range of CDRs is usually specified down to the specific sequence, meaning that a difference of one or more amino acids is considered outside the scope of patent protection. However, there may still be risks of equivalent infringement, so this information is for reference only.

    Parameter Description

    Antibody Sequence File

    Antibody sequence file in FASTA format.

    Type

    Specifies the sequence alignment database type: antibody full-length (full) or antibody CDR region (cdr).
    The CDR regional database is a patent protected antibody database.

    Result Description

    The output includes:

    Output File Name Description
    align.fst Sequence alignment result file
    blast.log Sequence alignment log file
  • Name: CIF2PDB
    Description: 将mmCIF文件转换成PDB文件。 Convert mmCIF files into PDB files.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-12-13 15:13:35
    Reference:

    CIF2PDB

    简介

    CIF2PDB模块是基于BioPython将mmCIF文件转换成PDB文件。
    单独化合物CIF转换部分存在问题。

    参数说明

    CIF File

    输入所需的 mmCIF 格式结构文件。

    • 支持格式:单个 .cif 文件或其压缩包。
    • 压缩包支持:.zip, .tar.gz, .tar.bz2, .tar.xz, .tar。

    结果说明

    • 单个文件转换:若输入为单个 CIF 文件,系统将输出名为 convert_output.pdb 的 PDB 文件。
    • 批量/压缩包转换:若输入为压缩包,系统将输出名为 convert_output.tar.gz 的压缩包,其中包含转换后的所有 PDB 文件。

    CIF2PDB

    Introduction

    The CIF2PDB module is based on BioPython to convert mmCIF files into PDB files.

    Parameters

    CIF File

    The structural file(s) in mmCIF format.

    • Supported Formats: Individual .cif files or compressed archives.
    • Archive Support: .zip, .tar.gz, .tar.bz2, and .tar.xz.

    Results

    • Single File Conversion: If a single CIF file is provided, the output will be a PDB file named convert_output.pdb.
    • Archive Conversion: If a compressed archive is provided, the output will be a compressed package named convert_output.tar.gz containing the converted PDB files.
  • Name: Structure Prediction (Boltz-2)
    Description: 基于MIT的Boltz-2算法的AF3 like结构预测模型,支持蛋白、核酸、小分子,金属离子等复合物。相比于Boltz-1x,Boltz-2新增亲和力预测。 An AF3-like structure prediction model based on the Boltz-1x algorithm from MIT, supporting protein, dna, rna, ions, ligands. Compared to Boltz-1x, Boltz-2 has added the capability of affinity prediction.
    Tags: undefined
    Author: MIT
    Release: 2024-11-20 09:34:01
    Reference: Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, Regina Barzilay. Boltz-1 Democratizing Biomolecular Interaction Modeling. bioRxiv 2024.11.19.624167

    Structure Prediction (Boltz-2)

    简介

    基于MIT(麻省理工学院)的Boltz-2算法的AF3 like结构预测模型。Boltz-2是一种开源深度学习模型,融合了模型架构、速度优化和数据处理方面的创新,在预测生物分子复合物的 3D结构方面达到了 AlphaFold3 级的准确度。Boltz-2 在一系列不同的基准测试中表现出与最先进的商业模型相当的性能,为结构生物学中可商业化使用的工具树立了新的标杆。
    image.png
    更新
    相比于Boltz-1x,Boltz-2新增亲和力预测。

    Boltz-2介绍

    Boltz-2介绍

    什么是 Boltz-2?

    Boltz-2 是一个专为“生物分子交互”设计的 AI 大模型,它可以:

    • 预测蛋白质与小分子之间的结合位置
    • 判断结合是否牢固(亲和力)
    • 模拟结构在不同实验条件下的变化
      一句话:从结构到功能,Boltz-2 一网打尽。

    它解决了哪些痛点?

    目前,最准确的亲和力预测方法是“自由能微扰(FEP)”,但它计算成本高,跑一次可能要几天。
    相比之下:

    • Boltz-2 预测速度比 FEP 快 1000 倍
    • 预测准确度接近 FEP
    • 还能支持海量筛选和分子设计
      更重要的是,Boltz-2 是开源的,科研/药企都可以免费使用!

    实际表现如何?

    Boltz-2 在多个实际药物研发场景中展现了优异性能:

    • 药物优化在测试集中,Boltz-2 能准确判断出哪个小分子“粘得更牢”,效果逼近 FEP,却快了 1000 倍。
    • 虚拟筛选面对几十万小分子,Boltz-2 迅速筛出潜在活性物。比如在 TYK2 靶点测试中,Boltz-2 筛出的 top10 中有 8 个被后续模拟证实有效。
    • 结构预测升级比起上代模型,Boltz-2 在 RNA、抗体等复杂结构中表现更好,还能根据“实验方式”个性化调整预测结果。

    它背后的“秘密武器”有哪些?

    虽然我们不展开技术细节,但 Boltz-2 之所以强大,主要靠以下三点:

    • 更聪明的数据整理方式 团队从海量的公开数据库中精挑细选了高质量数据,并去除噪音,训练出更可靠的模型。
    • 结合生成模型,一边筛一边设计 Boltz-2 不仅能“判断好坏”,还能与分子生成模型搭配,设计全新的小分子,大大拓展化合物空间。
    • 可控性更强 研究者可以指定结构预测使用的条件,比如使用 NMR 实验数据,或加入自己感兴趣的结合位点,模型都能灵活应对。

    它能做什么?你能用它做什么?

    Boltz-2 为药物研发、蛋白结构预测、AI 驱动分子设计提供了一个强大的通用平台:

    • 制药企业可以大规模筛选候选药物
    • 生物研究者可以探索蛋白-小分子交互机制
    • AI 从业者可以基于它开发更多垂直应用

    Boltz-2 让 AI 第一次真正具备了“预测小分子是否好用”的能力,速度快、准确率高,开启新一代智能药物发现时代。

    模型机制类 Q&A

    • Q1:Boltz‑2 为什么不默认开启 Steering(结构引导)?
      A:Steering 会让推理变慢约 2 倍,而且当前参数是在不使用 Steering 的情况下优化的。未来可能默认开启,但需重新调参。

    • Q2:Steering Potential 会不会让结构偏离真实构象?
      A:Steering 的目的是将采样引导回“真实分布流形”,不会盲目收缩采样空间,但需要在“有效性”与“物理合理性”之间找到平衡。

    • Q3:结构相似性是按口袋还是全结构算的?会不会数据泄漏?
      A:使用的是全结构相似性,这确实存在争议,但现实中药物研发常常面对有序列信息的靶点。我们已尽力控制信息泄漏风险。

    结构与亲和力预测相关 Q&A

    • Q4:Boltz‑2 的亲和力预测是回归还是分类?
      A:两者都有,输出包括:

      • 连续亲和力值(如 ∆Ki)
      • 二分类概率(binder vs decoy)
    • Q5:亲和力数据怎么处理?不准确怎么办?
      A:主要训练 ∆Ki(同一实验内的相对值),因为原始 Ki/IC50 数据误差大。用 Cheng–Prusoff 公式统一 Ki 与 IC50。训练集只保留剂量-反应测量,删除噪声高/不可重复实验。

    • Q6:Boltz‑2 对结构准确性要求高吗?
      A:是的,只训练了 ipTM ≥ 0.75 的结构。结构质量是亲和力预测成功的前提。

    • Q7:Boltz‑2 是否支持金属离子相关配体?
      A:不支持。带金属离子的复合物在数据准备阶段已被过滤掉。

    适用范围与局限 Q&A

    • Q8:适用于哪些分子体系?
      A:蛋白、小分子、RNA、DNA 等多模态复合物。对于大构象变化或柔性蛋白,性能会下降。

    • Q9:Boltz‑2 和 OpenFE、FEP+ 比如何?
      A:在公开 benchmark 上性能优于 OpenFE,略低于商业级 FEP+,但速度优势巨大(~1000× 快)。

    • Q10:在 Recursion 内部数据集上效果好吗?
      A:效果一般,说明模型仍对真实分布存在泛化问题。

    拓展与未来方向 Q&A

    • Q11:能用于蛋白–蛋白亲和力预测吗?
      A:还不支持,但开发中,预计未来几个月会发布 PPI affinity 模块。

    • Q12:能预测 ADME 或毒性吗?
      A:某些毒性通路是结合驱动的,可以利用结构模型辅助预测。参考 BioEmu(Frank Noé)相关研究。

    • Q13:能预测药物耐药性吗?
      A:我们也想知道,希望后续能验证。

    • Q14:Boltz‑2 可以与 MD 数据结合使用吗?
      A:有讨论过,但还没有标准策略,未来可能探索“Boltz + MD”混合建模框架。

    参数说明

    Single Mode

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    注意:多蛋白复合物结构预测,其氨基酸序列输入格式如下:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。

    备注:当前24GB的GPU显存支持计算的残基/碱基数量在1000个左右。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔,并加上CCD前缀。示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    注意:不适用于配体蛋白或多肽的氨基酸序列格式输入。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每行放置一个PTM信息,每个PTM信息由三部分组成:

    • 发生PTM序列的顺序编号
    • PTM类型的CCD编号
    • 发生PTM的残基位置编号
      三部分由逗号分隔,例如:1,HY3,1 表示第一条序列的第一个残基,发生了类型为HY3(CCD编号,为3-羟基脯氨酸,为脯氨酸的羟基化)的PTM
      备注:
    • 序列的顺序编号,是依次按上述参数Protein、DNA、RNA中的序列顺序与数量,从1开始进行编号,例如:当有2条蛋白序列,1条DNA序列,1条RNA序列时,各序列对应的编号为:第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4
    • CCD的介绍参考https://www.wwpdb.org/data/ccd 编号查询网址为https://www.ebi.ac.uk/pdbe-srv/pdbechem/
      包含多个PTM信息的文件内容示例如下:
    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Cycle

    指定需要环化的序列的顺序编号,如1,2表示第一和第二条序列都进行首尾相连的环化。

    Covalent Bond

    共价键信息的文本文件,TXT格式。每行放置一个共价键信息,每个共价键信息包含两个原子信息,每个原子信息由三部分组成:

    • 原子所在序列或小分子的顺序编号(编号规则在Modification中定义的序列编号规则基础上,在最后加入小分子的顺序即可)
    • 原子所在残基的位置编号(如残基为小分子时,编号为1)
    • 原子的标准名称(CCD中定义)
      三部分由逗号分隔,例如:3,1,CA表示第三个实体(序列或小分子)中的第一个残基(或小分子)的CA原子
      一个共价键是由两个原子信息组成,原子间用分号分隔,如:1,1,CA;2,1,CA
      表示一个共价键,该共价键由两个原子组成,第一个原子为1,1,CA,第二个原子为2,1,CA
      包含多个共价键信息的文件内容示例如下:
    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Contact

    接触类型限制信息的文本文件,TXT格式。每行放置一个接触对(残基或小分子CCD中的标准原子名称)的信息,每个接触对信息由三部分组成:

    • 接触对中的第一个残基或原子信息,由其所在序列/小分子顺序编号与残基位置编号/原子名称组成,逗号分隔,如:1,25 表示第一个实体(序列)的第25个残基,2,CA表示第二个实体(小分子)中的CA原子。
    • 接触对中的第二个残基或原子信息,格式如上述。
    • 接触对残基或原子之间的最大距离(单位为埃),如6.0,支持范围为4.0-20.0之间

    上述三部信息之间也用英文分号“;”进行分隔,例如:1,35;2,62;6.0表示第一条序列中的第35号残基,与第二条序列的第62号残基,靠近接触,且两者之间的最大距离为6埃。1,35;2,CA;6.0表示第一条序列中的第35号残基,与第二个实体(小分子)的CA原子,靠近接触,且两者之间的最大距离为6埃。
    包含多个结合位点信息的文件内容示例如下:

    1,35;2,62;6.0
    1,48;2,CA;6.0
    

    Pocket

    结合位点类型限制信息的文本文件,TXT格式。每行放置一个结合位点信息,每个结合位点信息由三部分组成:

    • Binder的顺序编号(与共价键定义中的序列或小分子的顺序编号一致),Binder可以是小分子,蛋白/核酸序列的任意一种,目前一个结合位点只支持定义一条Binder(即一个编号)
    • 结合位点的残基信息,每个残基信息由其所在序列编号与残基位置编号组成,逗号分隔,如:1,25 表示第一条序列中的第25个残基;可以定义多个残基信息,由英文分号“;”进行分隔,如1,25;1,27;1,32;1,38表示第一条序列中的第25/27/32/38号残基形成结合位点
    • Binder与结合位点残基之间的最大距离(单位为埃),如6.0,支持范围为4.0-20.0之间

    上述三部信息之间也用英文分号“;”进行分隔,例如:2;1,55;1,62;1,91;1,92;1,99;1,110;6.0表示第二个实体(序列或小分子)作为Binder,与第一条序列的第55/62/91/92/99/110号残基所形成的结合位点进行结合。且两者之间的最大距离为6埃。
    包含多个结合位点信息的文件内容示例如下:

    2;1,55;1,62;1,91;1,92;1,99;1,110;6.0
    3;1,25;1,27;1,32;1,38;8.0
    

    Template

    指定结构建模时,使用的模板结构文件,PDB或CIF格式(推荐CIF格式,PDB格式缺失头信息时Boltz处理会报错),当前仅适用于蛋白序列。

    Force

    在使用模板进行结构建模时,是否增加强制约束:
    True:模板作为硬约束,预测的结构会被"强制"向模板结构靠拢,而非仅作为参考信息。通过在能量函数中引入约束势能(restraint potential)来实现。注意:此模式可能会引起部分结构的断裂。
    False:模板仅作为参考信息,允许预测结构与模板结构之间存在较大偏离。默认为False。

    Chain

    在设置了Template参数时,如果只希望部分蛋白序列基于模板进行建模,可指定该参数,设置需要进行模板建模的蛋白序列顺序编号(同Modification参数中定义),支持多条蛋白序列,用英文逗号分隔。
    例如:只希望第一条蛋白序列使用模版建模,该参数设置为1即可。如果希望第一条与第二条蛋白序列使用模版建模,该参数设置为1,2即可。

    Affinity

    指定小分子顺序编号(定义见Bond参数),进行亲和力评估,格式为正整数,且只能指定1个小分子,如:3表示要进行亲和力评估的是顺序编号为3的小分子。模型会评估复合物体系中该小分子与其他部分的结合亲和力。

    Domain

    定义的残基区域信息。模块将输出区域中所有残基平均的pLDDT数值。一个残基区域由序列顺序编号与残基组合编号组成:

    • 序列顺序编号(同Modification参数中的定义),值为1时,可省略(即默认为1)
    • 残基组合编号,使用残基位置编号,多个残基用逗号分隔,指定残基范围用横杠符号。如:“3,10,24-30”表示目标序列上的第3、第10与第24至30号残基。
      例如:1:24,28,32-40 表示第一条序列中的第24/28/32至40号残基所组成的区域,因为是第一条序列,数值1可以省略,等同于24,28,32-40 ,该区域的所有残基的平均pLDDT值将输出到结果文件中。

    残基区域支持定义多个,每个残基区域之间用英文“;”分隔,例如:
    1:24,28,32-40;2:15,23,50-60表示定义了两个区域,区域一为第一条序列的第24/28/32至40号残基,区域二为第二条序列的第15/23/50至60号残基。两个区域各自的残基平均pLDDT值,将输出到结果文件中。

    Seed

    随机数种子,用于控制预测过程中的随机性。

    Format

    输出结构的格式,支持PDB或CIF格式,默认为PDB格式。

    Output_Score

    结构打分的结果文件名,默认为pred_scores_boltz.csv

    Output_Affinity

    亲和力打分的结果文件名,默认为pred_affinity_boltz.csv

    Batch Mode

    批量预测模式采用阶梯式动态计费,根据预测结构数量分段计费,规则如下:

    • ≤ 5 个结构:500计算量 / 个
    • 第 6–100 个结构:300计算量 / 个
    • 超过 100 个的部分:100计算量 / 个

    注意:
    1.当前系统最多支持 1000 个结构的批量预测
    2.一条fasta序列为一个结构

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    每一条记录代表一个待预测的结构,每条记录的名称要唯一不能重复。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >1
    EVQLVESGGGVVQPGRSLRLS:NFMLTQPHSVSVL:ELFPQWHLPIKIAAIIAS
    >2
    YYCAKDRRDMGYFQHWGQGTLVTVSSQWYQQRPGSAPTTVIYEDNQRPSGTPPTFMIAVFLPIVVLIFKSILFL
    

    表示有两个待预测的结构,第一条记录的名称为1,有三条蛋白链,用:进行分隔。第二条记录的名称为2,为单链。

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。
    每一条序列记录代表一个待预测的结构,每条记录的名称要唯一不能重复(可与Protein参数中的记录名称一致,表示该记录的DNA序列与Protein序列归属于同一结构)。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >dna
    GACCTCT:CCTAGCT
    >1
    CCTAGCT
    

    表示有两条记录,第一条的名称为dna,有两条DNA链,用:进行分隔,因为该名称不存在与Protein示例记录中,属于新结构。第二条的名称为1,有一条DNA链,因为该名称存在于Protein示例记录中,则表示同属一个结构(该结构同时包含Protein序列和该DNA序列)。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。
    每一条序列记录代表一个待预测的结构,每条记录的名称要唯一不能重复(可与DNA或Protein参数中的记录名称一致,表示该记录的RNA序列与上面的DNA序列或Protein序列归属于同一结构)。一条记录中有多条链时,通过英文冒号(:)相连,文件内容示例如下:

    >1
    AGCU
    >rna
    AGGCU:UGAUC
    

    表示有两条记录,第一条的名称为1,为单链,因为该名称存在于DNA及Protein示例记录中,表示同属一个结构(该结构同时包含了Protein序列、DNA序列及该RNA序列)。第二条的名称为rna,有两条RNA链,用:进行分隔,因为该名称不存在于DNA或Protein示例记录中,属于新结构。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔。
    每行代表一个待预测的结构,每行可放置多个ligand,且以唯一不重复的名称开头(该名称可与上述RNA,DNA或Protein参数中的记录名称一致,表示该行的所有ligands,与上述的RNA或DNA或Protein序列归属于同一结构),名称与所有ligands都以英文冒号(:)分隔。文件内容示例如下:

    1:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,HY3
    lig:CC1CC(CN2NC[N+](=C3CCC(OC(F)(F)F)CC3)C2=O)CC(C)C1OC(C)(C)C(=O)[O-]
    

    表示有两条记录,第一条的名称为1,有三个ligand(一个SMILES,两个CCD codes),因为该名称存在于上述的RNA或DNA或Protein示例记录中,表示同属一个结构。第二条的名称为lig,有一个ligand(为SMILES),因为该名称不存在上述的RNA或DNA或Protein示例记录中,属于新结构。
    注意:
    1.不适用于配体蛋白或多肽的氨基酸序列格式输入。
    2.在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的小分子信息,设置方式为输入一行小分子信息(可多个),且不设置结构名称,如CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,表示为所有结构都加入小分子CC(=O)OC1C[NH+]2CCC1CC2与ATP。

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。每个PTM的信息与Single模式中一致(参考Single模式中的定义)。
    每行定义一个结构的所有PTM信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,HY3,1:1,P1L,5:2,HY3,3
    2:1,HY3,1:2,HY3,3
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有三个PTM。名称为2的结构中,有两个PTM。
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的PTM信息,设置方式为输入一行PTM信息,且不设置结构名称,如:1,HY3,1:1,P1L,5:2,HY3,3表示这些PTM信息将应用到所有结构。

    Cycle

    包含需要环化的序列顺序编号的文本文件,TXT格式。每行定义一个结构的所有环化序列信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    complexA:2
    complexB:2,3
    

    表示前述名称为complexA的结构中(Protein或DNA或RNA),顺序编号为2的序列进行首尾相连的环化。名称为complexB的结构中,顺序编号为2和3的序列都进行首尾相连的环化。
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的环化信息,设置方式为输入一行环化信息,且不设置结构名称,如:2,表示为所有结构设置环化序列编号为2。

    Covalent Bond

    共价键信息的文本文件,TXT格式。每个共价键的信息与Single模式中一致(参考Single模式中的定义)。
    Batch模式下,每行定义一个结构的所有共价键,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,1,CA;2,1,CA:1,1,CA;3,1,CHA
    2:1,1,CA;3,1,CHA
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有两个共价键。名称为2的结构中,有一个共价键。
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的共价键信息,设置方式为输入一行共价键信息,且不设置结构名称,如:1,1,CA;3,1,CHA表示该共价键信息将应用到所有结构。

    Contact

    接触类型限制信息的文本文件,TXT格式。每个接触信息的定义与Single模式中一致(参考Single模式中的定义)。
    Batch模式下,每行定义一个结构的所有接触信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:1,35;2,62;6.0
    2:1,48;2,CA;6.0:1,35;2,62;6.0
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有一个接触限制。名称为2的结构中,有两个接触限制。
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的接触信息,设置方式为输入一行接触信息,且不设置结构名称,如:1,35;2,62;6.0表示该Contact信息将被应用到所有结构。

    Pocket

    结合位点类型限制信息的文本文件,TXT格式。每个结合位点信息的定义与Single模式中一致(参考Single模式中的定义)。
    Batch模式下,每行定义一个结构的所有结合位点限制信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:2;1,55;1,62;1,91;1,92;1,99;1,110;6.0
    2:1;2,15;2,17;2,18;2,56;6.0:1;3,76;3,78;3,96;8.0
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有一个结合位点限制。名称为2的结构中,有两个结合位点限制。
    注意:在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的限制信息,设置方式为输入一行限制信息,且不设置结构名称。如:2;1,55;1,62;1,91;1,92;1,99;1,110;6.0表示该pocket信息将被应用到所有结构。

    Affinity

    指定小分子顺序编号(定义见Bond参数),进行亲和力评估,每个亲和力信息的定义与Single模式一致。
    Batch模式下,每行定义一个亲和力信息,且以唯一名称开头(该名称必须存在于前述的Protein或DNA或RNA记录中),都以英文冒号(:)分隔。文件内容示例如下:

    1:4
    2:5
    

    表示前述名称为1的结构中(Protein或DNA或RNA),有亲和力计算,其小分子Binder的顺序编号为4。名称为2的结构中,有亲和力计算,其小分子Binder的顺序编号为5。
    注意:
    1.Batch模式中如果设置该参数Affinity,则需要批量预测的每个结构中都有设置Affinity信息,否则会提示错误。
    2.小分子 Binder 排序仅依赖 Protein、DNA、RNA 参数的序列顺序与数量,Ligand 不参与排序。**
    3.在预测每个结构的序列数量相同的情况下,可为所有待预测结构设置相同的亲和力信息,设置方式为输入小分子的顺序编号4,且不设置结构名称

    Template

    指定结构建模时,使用的模板结构文件(同Single模式),当前仅适用于蛋白序列。

    Format

    输出结构的格式,支持PDB或CIF格式,默认为PDB格式。

    Seed

    随机数种子,用于控制预测过程中的随机性。

    Virtual Screening Mode

    虚拟筛选模式中,可一次性提交多个小分子,每个小分子会单独与蛋白/核酸体系计算亲和力。当前一次运行支持最大100个小分子。

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。(同Single模式)

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。(同Single模式)

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。(同Single模式)

    备注:当前24GB的GPU显存支持计算的残基/碱基数量在1000个左右。

    Ligand

    文本文件包含小分子信息,TXT格式。支持SMILES或 CCD Code(化学组分词典编号)。如果使用SMILES格式,每行应包含一个小分子;如果使用CCD Code,每行可以包含一个或多个小分子,使用逗号分隔,并加上CCD前缀。示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Modification

    包含翻译后修饰(PTM)信息的文本文件,TXT格式。(同Single模式)

    Cycle

    指定需要环化的序列的顺序编号,如1,2表示第一和第二条序列都进行首尾相连的环化。(同Single模式)

    Covalent Bond

    共价键信息的文本文件,TXT格式。(同Single模式,但共价键中小分子不能参与)

    Pocket

    结合位点类型限制信息的文本文件,TXT格式。(同Single模式)

    Output_Affinity

    亲和力打分的结果文件名,默认为pred_affinity_boltz.csv

    结果说明

    • Single模式

    输出结果文件为排名前5的复合物结构rank_1-5.cif,pred_scores_boltz.csv,pred_affinity_boltz.csv(如果指定了Affinity参数)和可视化交互式工具PAE Viewer生成的boltz_report.html和pae_report_Model_1-5.html。
    pred_scores_boltz.csv中包含信息如下:

    字段名称 说明
    Name 复合物结构名称
    Confidence_Score 对预测结构的质量排序的指标分数,数值在0~1.0之间,越大表示预测结构的质量越高。
    该分数综合考虑了两个指标:iptm(单体时为pTM), complex_plddt, 计算公式为: Confidence_Score = 0.8 × complex_plddt + 0.2 × ipTM
    pTM 对结构预测得到的TM score,衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 对结构中的相互作用界面预测得到的TM score,当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确。
    大于0.8表示高质量预测,
    小于0.6表示预测可能失败,
    0.6-0.8为灰色地带,预测正确与否不确定
    ipSAE 基于pAE(predicted Aligned Errors)矩阵计算得到的相互作用界面评价分数,取值范围是0到1,值越大,表示预测的蛋白-蛋白相互作用界面越可靠。
    ipSAE > 0.7 表明相互作用界面预测质量高,结构可信。
    ipSAE < 0.1: 表明预测中几乎不存在可信互作界面,可排除假阳性相互作用。
    Complex_pLDDT 对复合物预测得到的平均pLDDT score,值范围是0-1.0,该值越大说明预测的结构越可靠。低于0.7被认为可靠性较低,低于0.5基本认为是可信度非常低,为无序预测
    Complex_ipLDDT 将复合物中相互作用界面的权重提升后,预测得到的pLDDT score,值范围是0-1.0,该值越大说明预测的结构越可靠
    complex_pDE 复合物中所有残基对之间的平均预测距离误差,是评估复合物结构预测质量的指标,越低越好。典型数值范围:高质量区域:< 2 Å,中等质量区域:2-5 Å,低质量/柔性区域:> 5 Å
    complex_ipDE 复合物界面区域残基对的平均预测距离误差,越低越好,专门反映界面相互作用的预测可靠性,阈值范围同上。
    pLDDT_domain 当设置Domain参数时,预测得到的区域残基的平均pLDDT数值,多个区域时,数值用英文分号";"分隔
    Avg_pAE 平均pae分数,pae是预测对齐误差,是残基对水平的置信度指标,用来衡量任意两个残基之间相对空间位置的预测可信度。
    数值<5,表示残基对之间相对位置预测非常可靠,通常位于同一结构域内;
    数值在5–10,表示预测较为准确,可能为柔性环区或轻微构象差异区域;
    数值在10–20,表示相对位置不确定性较高,常见于结构域间连接区或柔性区域;
    数值> 20,表示预测不可靠,可能为无序区域、错误折叠,或复合物界面不稳定。
    Min_pAE 所有pae分数中的最小值
    Avg_iPAE 结构中相互作用界面的平均pae分数
    Min_iPAE 结构中相互作用界面pae分数中的最小值
    Avg_Ligand_pAE ligand存在时,与ligand相关的pAE分数的平均值。
    Min_Ligand_pAE ligand存在时,与ligand相关的pAE分数的最小值。
    pDockQ2_链名 该链的预测对接评分(pDock2),用于评估该链在复合物界面中的结合可靠性
    pDockQ2_Avg 链之间的平均预测对接评分,用于整体评估复合物界面质量
    • pDockQ2阈值(继承自 DockQ):
    pDockQ2 范围 结构质量评估
    < 0.23 不正确(Incorrect)
    0.23 – 0.49 可接受(Acceptable)
    0.49 – 0.80 中等质量(Medium)
    > 0.80 高质量(High quality)

    pred_affinity_boltz.csv中包含信息如下:

    字段名称 说明
    Pred_Affinity(log(IC50)) 预测的复合物中小分子与其他部分结合的亲和力数值,为IC50的对数值,即log(IC50),其中IC50的单位为μM,数值越低表示亲和力越强。
    Pred_Prob 概率值,判断小分子是真正Binder的可能性,数值在0-1之间,越大表示小分子是Binder的可能性越大
    • Batch模式

    输出final_results.tar.gz、pred_scores_boltz.csv以及pred_affinity_boltz.csv(如果指定了Affinity参数)
    final_results.tar.gz文件为Batch模式下生成一个所有预测结果的打包文件,包含预测结构PDB文件、打分CSV文件。
    pred_scores_boltz.csv以及pred_affinity_boltz.csv。(同Single模式)

    • Virtual Screening模式:

    输出pred_affinity_boltz.csv文件为亲和力预测结果,包含如下信息:

    字段名称 说明
    ID 小分子顺序,从1开始
    Ligand 小分子的SMILES或CCD代码
    Pred_Affinity(log(IC50)) 预测的复合物中小分子与其他部分结合的亲和力数值,为IC50的对数值,即log(IC50),其中IC50的单位为μM,数值越低表示亲和力越强。
    Pred_Prob 概率值,判断小分子是真正Binder的可能性,数值在0-1之间,越大表示小分子是Binder的可能性越大

    final_results.tar.gz文件为所有预测结果的打包文件,包含预测结构PDB文件、打分CSV文件。

    参考文献

    • Boltz-1 Democratizing Biomolecular Interaction Modeling. Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, Regina Barzilay. DOI:10.1101/2024.11.19.624167
    • Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction, Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Dominique Beaini, Tommi Jaakkola, Regina Barzilay.DOI:10.1101/2025.06.14.659707

    Structure Prediction (Boltz-2)

    Introduction

    Developed based on the Boltz-2 model, Boltz-2 is an open-source deep learning model that integrates innovations in model architecture, speed optimization, and data processing. It achieves AlphaFold3-level accuracy in predicting the 3D structures of biomolecular complexes. Boltz-2 demonstrates performance comparable to state-of-the-art commercial models across a range of benchmarks, setting a new standard for commercially usable tools in structural biology.
    image.png

    Updates
    Compared to Boltz-1x, Boltz-2 has added the capability of affinity prediction.

    Introduction to Boltz-2

    Boltz-2 Introduction

    What is Boltz-2?

    Boltz-2 is an AI model specifically designed for “biomolecular interactions”. It can:

    • Predict binding sites between proteins and small molecules
    • Determine the binding strength (affinity)
    • Simulate structural changes under different experimental conditions

    In short: From structure to function, Boltz-2 covers it all.
    image.png

    What pain points does it address?

    Currently, the most accurate method for affinity prediction is Free Energy Perturbation (FEP), but it is computationally expensive and can take days to complete a single calculation.

    In comparison:

    • Boltz-2 is 1000 times faster than FEP
    • Its prediction accuracy is comparable to FEP
    • It supports large-scale virtual screening and molecular design

    Most importantly, Boltz-2 is open-source, meaning both researchers and pharmaceutical companies can use it for free!
    image.png
    Boltz-2 presents a strong accuracy/speed trade-off for affinity prediction.


    How does it perform in practice?

    Boltz-2 has demonstrated outstanding performance in various real-world drug discovery scenarios:

    • Drug Optimization
      In test datasets, Boltz-2 can accurately determine which small molecule binds more strongly, achieving results close to FEP but 1000 times faster.

    • Virtual Screening
      When faced with hundreds of thousands of small molecules, Boltz-2 quickly identifies potential active compounds. For example, in the TYK2 target test, 8 out of the top 10 molecules selected by Boltz-2 were later validated as effective in simulations.

    • Enhanced Structure Prediction
      Compared to its predecessor, Boltz-2 performs better on complex structures such as RNA and antibodies. It can also customize predictions based on experimental conditions.
      image.png
      Boltz-2 model architecture diagram


    What are its “secret weapons”?

    While we won’t dive into the technical details, Boltz-2’s strength lies in three key aspects:

    1. Smarter Data Curation
      The team carefully selected high-quality data from massive public databases and removed noise, resulting in a more reliable model.

    2. Integration with Generative Models
      Boltz-2 not only evaluates interactions but also works with molecular generative models to design new small molecules, significantly expanding the chemical space.

    3. Greater Customizability
      Researchers can specify conditions for structure predictions, such as incorporating NMR experimental data or focusing on specific binding sites of interest. The model adapts flexibly.
      image.png
      Evaluation of the performance of Boltz-2 against existing co-folding models on a diverse set of unseen complexes


    What can it do? What can you do with it?

    Boltz-2 provides a powerful, general-purpose platform for drug discovery, protein structure prediction, and AI-driven molecular design:

    • Pharmaceutical companies can screen drug candidates at scale
    • Biological researchers can explore protein-small molecule interaction mechanisms
    • AI practitioners can develop more specialized applications based on Boltz-2
      image.png

    Boltz-2 empowers AI to truly predict the effectiveness of small molecules for the first time, combining speed and accuracy to usher in a new era of intelligent drug discovery.

    Boltz-2 Q&A Collection

    Boltz-2 Q&A Collection Link


    Model Mechanism Q&A

    • Q1: Why doesn’t Boltz-2 enable Steering (structural guidance) by default?
      A: Steering slows inference by about 2x, and the current parameters are optimized without Steering. It may be enabled by default in the future, but parameter tuning will be required.

    • Q2: Does Steering Potential cause structures to deviate from their true conformations?
      A: Steering aims to guide sampling back to the “manifold of true distributions” without blindly shrinking the sampling space. However, it requires a balance between “effectiveness” and “physical plausibility.”

    • Q3: Is structural similarity calculated based on the pocket or the entire structure? Could there be data leakage?
      A: Structural similarity is calculated using the entire structure, which is indeed a controversial approach. However, in real-world drug discovery, target sequence information is often available. Efforts have been made to minimize the risk of information leakage.


    Structure and Affinity Prediction Q&A

    • Q4: Is Boltz-2’s affinity prediction regression-based or classification-based?
      A: Both. The output includes:

      • Continuous affinity values (e.g., ∆Ki)
      • Binary classification probabilities (binder vs. decoy)
    • Q5: How is affinity data processed? What happens if it’s inaccurate?
      A: The model primarily trains on ∆Ki (relative values within the same experiment) due to large errors in raw Ki/IC50 data. Ki and IC50 values are unified using the Cheng–Prusoff equation. The training set excludes high-noise or non-reproducible experiments, focusing on dose-response measurements.

    • Q6: Does Boltz-2 require high structural accuracy?
      A: Yes, it only trains on structures with ipTM ≥ 0.75. Structural quality is essential for successful affinity prediction.

    • Q7: Does Boltz-2 support ligands with metal ions?
      A: No, complexes containing metal ions are filtered out during data preparation.


    Applicability and Limitations Q&A

    • Q8: What molecular systems is Boltz-2 suitable for?
      A: Protein, small molecules, RNA, DNA, and other multi-modal complexes. Performance decreases for systems with large conformational changes or flexible proteins.

    • Q9: How does Boltz-2 compare to OpenFE and FEP+?
      A: It outperforms OpenFE on public benchmarks but slightly underperforms compared to the commercial-grade FEP+. However, Boltz-2 has a significant speed advantage (~1000× faster).

    • Q10: Does Boltz-2 perform well on Recursion’s internal datasets?
      A: Performance is moderate, indicating the model still struggles with generalization to real-world distributions.


    Expansion and Future Directions Q&A

    • Q11: Can Boltz-2 predict protein–protein affinity?
      A: Not yet, but development is underway. A PPI affinity module is expected in the coming months.

    • Q12: Can Boltz-2 predict ADME or toxicity?
      A: Certain toxicity pathways are binding-driven, and structural models can assist in prediction. Related studies include BioEmu by Frank Noé.

    • Q13: Can Boltz-2 predict drug resistance?
      A: We hope to explore this in future validations.

    • Q14: Can Boltz-2 be used with MD data?
      A: There have been discussions, but no standard strategy exists yet. A future direction may involve exploring a “Boltz + MD” hybrid modeling framework.

    Parameters

    Single Mode

    Protein Sequence

    The sequence file of proteins in FASTA format, supporting multiple sequences.
    Note: For multi-protein complex structure prediction, the amino acid sequence input format is as follows:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    The sequence file of DNA nucleic acids in FASTA format, supporting multiple sequences.

    RNA Sequence

    The sequence file of RNA nucleic acid molecules in FASTA format, supporting multiple sequences.

    Ligand

    A text file containing small molecule information in TXT format. It supports SMILES or CCD Code (Chemical Component Dictionary number). If using the SMILES format, each line should contain one small molecule; if using the CCD Code, each line can contain one or more small molecules, separated by commas and prefixed with CCD. An example is as follows:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Note: Not applicable for amino acid sequence format input of ligand proteins or polypeptides.

    Modification

    A text file containing post-translational modification (PTM) information in TXT format. Each line contains one PTM information entry, consisting of three parts:

    • Sequence order number where the PTM occurs
    • CCD number of the PTM type
    • Residue position number where the PTM occurs
      The three parts are separated by commas. For example, 1,HY3,1 indicates that the first residue of the first sequence undergoes a PTM of type HY3 (CCD number, which is 3-hydroxyproline, a hydroxylation of proline).

    Note:

    • The sequence order number is numbered sequentially according to the order and number of sequences in the parameters Protein, DNA, and RNA, starting from 1. For example, if there are 2 protein sequences, 1 DNA sequence, and 1 RNA sequence, the sequence numbers are: the first protein sequence is numbered 1, the second protein sequence is numbered 2, the DNA sequence is numbered 3, and the RNA sequence is numbered 4.
    • For CCD introduction, refer to https://www.wwpdb.org/data/ccd . The CCD number lookup website is https://www.ebi.ac.uk/pdbe-srv/pdbechem/ .

    An example of a file containing multiple PTM information entries is as follows:

    1,HY3,1
    1,P1L,5
    2,HY3,3
    

    Cycle

    Specify the serial numbers of the sequences to be cyclized; for example, 1,2 indicates that both the first and the second sequences undergo head-to-tail cyclization.

    Covalent Bond

    A text file containing covalent bond information in TXT format. Each line contains one covalent bond information entry, and each entry includes two atom information entries, each consisting of three parts:

    • Sequence or small molecule order number (following the sequence numbering rule defined in Modification, with small molecule order added at the end)
    • Position number of the residue where the atom is located (if the residue is a small molecule, the number is 1)
    • Standard name of the atom (as defined in CCD)

    The three parts are separated by commas. For example, 3,1,CA indicates the CA atom of the first residue (or small molecule) in the third entity (sequence or small molecule).

    A covalent bond consists of two atom information entries, separated by a semicolon, such as 1,1,CA;2,1,CA, indicating a covalent bond composed of two atoms: the first atom is 1,1,CA, and the second atom is 2,1,CA.

    An example of a file containing multiple covalent bond information entries is as follows:

    1,1,CA;2,1,CA
    1,1,CA;3,1,CHA
    

    Contact

    A text file in TXT format that contains restricted information about contact types. Each line holds the information for one contact pair (residue or standard atom name in a small-molecule CCD), consisting of three parts:

    • The first residue or atom in the contact pair, specified by its sequence/small-molecule order number and residue position number/atom name, separated by a comma. For example: 1,25 denotes the 25th residue in the first entity (sequence), and 2,CA denotes the CA atom in the second entity (small molecule).
    • The second residue or atom in the contact pair, in the same format as above.
    • The maximum distance (in Ångströms) between the residues or atoms in the pair, e.g. 6.0. Supported range is 4.0–20.0.

    These three pieces of information are separated by a semicolon “;”.
    Example: 1,35;2,62;6.0 means that residue 35 of the first sequence and residue 62 of the second sequence are in close contact, with a maximum distance of 6 Å.
    1,35;2,CA;6.0 means that residue 35 of the first sequence and the CA atom of the second entity (small molecule) are in close contact, with a maximum distance of 6 Å.

    A file containing multiple binding-site entries would look like:

    1,35;2,62;6.0
    1,48;2,CA;6.0
    

    Pocket

    A text file with pocket type restriction information, in TXT format. Each line contains the information of one pocket, which is composed of three parts:

    • The sequential number of the Binder (consistent with the sequential number of the sequence or small molecule in the covalent bond definition), the Binder can be any one of small molecules, protein/nucleic acid sequences, and currently, only one Binder (i.e., one number) is supported for a pocket.
    • The residue information of the pocket, each residue information consists of the sequence number where it is located and the residue position number, separated by a comma, such as: 1,25 indicates the 25th residue in the first sequence; multiple residue information can be defined, separated by an English semicolon “;”, for example, 1,25;1,27;1,32;1,38 indicates that the 25th, 27th, 32nd, and 38th residues in the first sequence form the pocket.
    • The maximum distance (in angstroms) between the Binder and the binding site, e.g., 6.

    The above three pieces of information are also separated by an English semicolon “;”. For example: 2;1,55;1,62;1,91;1,92;1,99;1,110 indicates that the second entity (sequence or small molecule) as a Binder, binds to the pocket formed by the 55th, 62nd, 91st, 92nd, 99th, and 110th residues in the first sequence. The maximum distance between the Binder and Pocket residues is 6 angstroms.

    An example of a file content containing multiple pockets information is as follows:

    2;1,55;1,62;1,91;1,92;1,99;1,110;6.0
    3;1,25;1,27;1,32;1,38;8.0
    

    Template

    The template structure file used in designated - structure modeling, in PDB or CIF format. Currently, it’s only applicable to protein sequences.

    Force

    When performing structure modeling with a template, you can choose whether to apply forced constraints:
    True:The template is treated as a hard constraint, meaning the predicted structure will be forced to align closely with the template rather than using it only as a reference.This is achieved by introducing restraint potentials into the energy function.
    Note: This mode may cause structural breaks in some regions.
    False:The template is used only as reference information, allowing the predicted structure to deviate significantly from the template if necessary.Default: False

    Chain

    When the Template parameter is set, to perform template - based modeling for certain part of the protein sequence, specify the order number(s) (as defined in the Modification parameter) of the target protein sequence(s). Multiple sequences are separated by commas.
    Examples: Set to 1 to model the first protein sequence; set to 1,2 to model the first and second sequences.

    Affinity

    Specify the serial number of the small molecule (defined in the Bond parameter) for affinity evaluation. The format must be a positive integer, and only one small molecule can be specified. For example, 3 indicates that the small molecule with the serial number 3 is to be evaluated for affinity. The model will assess the binding affinity of this small molecule with other components in the complex system.

    Domain

    The defined residue region information. The module will output the average pLDDT value of all residues in the region. A residue region is composed of sequence order numbers and residue combination numbers:
    Sequence order numbers (as defined in the Modification parameter), the value 1 can be omitted (i.e., defaulting to 1).
    Residue combination numbers, using residue position numbers, with multiple residues separated by commas and specified residue ranges indicated by hyphen symbols. For example, “3,10,24-30” indicates the 3rd, 10th, and 24th to 30th residues on the target sequence.
    For example: 1:24,28,32-40 indicates the region composed of the 24th, 28th, and 32nd to 40th residues in the first sequence. Since it is the first sequence, the number 1 can be omitted, equivalent to 24,28,32-40. The average pLDDT value of all residues in this region will be output to the result file.
    Multiple residue regions are supported, with each residue region separated by an English semicolon “;”. For example: 1:24,28,32-40;2:15,23,50-60 defines two regions. Region one consists of the 24th, 28th, and 32nd to 40th residues in the first sequence, and region two consists of the 15th, 23rd, and 50th to 60th residues in the second sequence. The average pLDDT values of the residues in each of the two regions will be output to the result file.

    Seed

    Random seed used to control the randomness in the prediction process.

    Format

    The output structure format supports PDB or CIF, with PDB format as the default.

    Output_Score

    The filename for the structure scoring results, defaulting to “pred_scores_boltz.csv”.

    Output_Affinity

    The filename for the affinity scoring results, defaulting to “pred_affinity_boltz.csv”.

    Batch Mode

    The batch prediction mode adopts a tiered, dynamic pricing model, where computational cost is charged based on the number of predicted structures:

    • ≤ 5 structures: 500 compute units per structure
    • Structures 6–100: 300 compute units per structure
    • Structures beyond 100: 100 compute units per structure

    Notes:
    1. The system currently supports up to 1000 structures in a single batch prediction.
    2. One FASTA sequence is counted as one structure.

    Protein Sequence

    The protein sequence file in FASTA format, supporting multiple sequences. Each record represents a structure to be predicted, and each record name must be unique. If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >1
    EVQLVESGGGVVQPGRSLRLS:NFMLTQPHSVSVL:ELFPQWHLPIKIAAIIAS
    >2
    YYCAKDRRDMGYFQHWGQGTLVTVSSQWYQQRPGSAPTTVIYEDNQRPSGTPPTFMIAVFLPIVVLIFKSILFL
    

    This indicates two structures to be predicted, with the first record named 1 containing three protein chains separated by colons. The second record is named 2 and contains a single chain.

    DNA Sequence

    The DNA sequence file in FASTA format, supporting multiple sequences. Each sequence record represents a structure to be predicted, and each record name must be unique. (It can match the record names in the Protein parameter, indicating that the DNA sequence belongs to the same structure as the Protein sequence.) If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >dna
    GACCTCT:CCTAGCT
    >1
    CCTAGCT
    

    This indicates two records, with the first named dna containing two DNA chains separated by a colon. Since this name does not appear in the Protein example records, it represents a new structure. The second record is named 1, containing one DNA chain, and since this name exists in the Protein example records, it indicates that they belong to the same structure (which contains both Protein and DNA sequences).

    RNA Sequence

    The RNA sequence file in FASTA format, supporting multiple sequences. Each sequence record represents a structure to be predicted, and each record name must be unique. (It can match the record names in the DNA or Protein parameters, indicating that the RNA sequence belongs to the same structure.) If a record contains multiple chains, they should be connected by a colon (:). Example content:

    >1
    AGCU
    >rna
    AGGCU:UGAUC
    

    This indicates two records, with the first named 1, which is a single chain. Since this name exists in the DNA and Protein example records, it indicates that they belong to the same structure (which includes Protein, DNA, and this RNA sequence). The second record is named rna, containing two RNA chains separated by a colon. Since this name does not appear in the DNA or Protein example records, it represents a new structure.

    Ligand

    A text file containing information on small molecules in TXT format. It supports either SMILES or CCD Code. If using SMILES format, each line should contain one small molecule; if using CCD Code, each line can contain one or more small molecules, separated by commas. Each line represents a structure to be predicted, and each line must start with a unique name (this name can match those in the RNA, DNA, or Protein parameters, indicating that all ligands in that line belong to the same structure). The name and all ligands are separated by a colon (:). Example content:

    1:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP,HY3
    lig:CC1CC(CN2NC[N+](=C3CCC(OC(F)(F)F)CC3)C2=O)CC(C)C1OC(C)(C)C(=O)[O-]
    

    This indicates two records, with the first named 1, containing three ligands (one SMILES and two CCD codes). Since this name exists in the RNA, DNA, or Protein example records, it indicates that they belong to the same structure. The second record is named lig, containing one ligand (in SMILES format). Since this name does not appear in the RNA, DNA, or Protein example records, it represents a new structure.

    Note:
    1.In Batch mode, if the Affinity parameter is set, each structure in the batch must have Affinity information; otherwise, an error will be reported.
    2.The sorting of small-molecule binders depends solely on the sequence order and quantity of the Protein, DNA, and RNA parameters; the ligand itself does not participate in the sorting.

    3.You can assign the same ligand information to all target structures by providing a single line of ligand data (multiple ligands are allowed) without specifying structure names. For example:CC(=O)OC1C[NH+]2CCC1CC2:CCD,ATP
    This indicates that all structures will include the ligands CC(=O)OC1C[NH+]2CCC1CC2 and ATP.

    Modification

    A text file containing post-translational modification (PTM) information in TXT format. Each PTM entry is consistent with that in Single mode (refer to the definitions in Single mode). Each line defines all PTM information for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:1,HY3,1:1,P1L,5:2,HY3,3
    2:1,HY3,1:2,HY3,3
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has three PTMs, while the structure named 2 has two PTMs.
    Note: When the sequence count for each structure to be predicted is the same, you can set identical PTM information for all structures by entering a single line of PTM information without specifying a structure name. For example: 1,HY3,1:1,P1L,5:2,HY3,3 indicates that these PTM information will be applied to all structures.

    Cycle

    A plain-text file (TXT) that lists the serial numbers of the sequences to be cyclized.
    Each line defines the cyclization information for one structure and must start with the unique name of that structure (exactly as given in the preceding Protein / DNA / RNA records).
    The name and the sequence numbers are separated by a colon (:).

    Example file content:

    complexA:2
    complexB:2,3
    

    In the structure named complexA, the 2nd sequence will be cyclized head-to-tail.
    In the structure named complexB, both the 2nd and 3rd sequences will be cyclized head-to-tail.
    Note: When the sequence count for each structure to be predicted is the same, you can set identical cyclization information for all structures by entering a single line of cyclization information without specifying a structure name. For example: 2 indicates that cyclization sequence index 2 will be set for all structures.

    Covalent Bond

    A text file containing covalent bond information in TXT format. Each covalent bond entry is consistent with that in Single mode (refer to the definitions in Single mode). In Batch mode, each line defines all covalent bonds for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:1,1,CA;2,1,CA:1,1,CA;3,1,CHA
    2:1,1,CA;3,1,CHA
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has two covalent bonds, while the structure named 2 has one covalent bond.
    Note: When the sequence count for each structure to be predicted is the same, you can set identical covalent bond information for all structures by entering a single line of covalent bond information without specifying a structure name. For example: 1,1,CA;3,1,CHA indicates that this covalent bond information will be applied to all structures.

    Contact

    A text file in TXT format containing contact type restraint information. The definition of each contact restraint is consistent with that in Single mode (refer to the definition in Single mode).
    In Batch mode, each line defines all contact restraints for one structure, starting with a unique name (which must exist in the aforementioned Protein, DNA, or RNA records), with fields separated by English colons (:). An example of the file content is as follows:

    1:1,35;2,62;6.0
    2:1,48;2,CA;6.0:1,35;2,62;6.0
    

    This indicates that in the structure named 1 (Protein, DNA, or RNA mentioned above), there is one contact restraint. In the structure named 2, there are two contact restraints.

    Note: When the sequence count for each structure to be predicted is the same, you can set identical contact information for all structures by entering a single line of contact information without specifying a structure name. For example: 1,35;2,62;6.0 indicates that this Contact information will be applied to all structures.

    Pocket

    A text file containing pockets information in TXT format. Each pocket is consistent with that in Single mode (refer to the definitions in Single mode). In Batch mode, each line defines all pockets for a structure, starting with a unique name (this name must exist in the previously defined Protein, DNA, or RNA records), separated by colons (:). Example content:

    1:2;1,55;1,62;1,91;1,92;1,99;1,110
    2:1;2,15;2,17;2,18;2,56:1;3,76;3,78;3,96
    

    This indicates that the structure named 1 (Protein, DNA, or RNA) has one pocket, while the structure named 2 has two pockets.
    Note: When the number of sequences predicted for each structure is the same, you can assign the same constraint information to all target structures. To do this, provide a single line of constraint information without specifying structure names.

    Affinity

    Specify the ligand index (as defined in the Bond parameter) to perform affinity evaluation. The definition of each affinity entry is consistent with the Single mode.

    In Batch mode, each line defines one affinity entry and must start with a unique name (which must exist in the previously defined Protein, DNA, or RNA records), separated by a colon (:). An example is shown below:

    1:4
    2:5
    

    This indicates that:

    • For the structure named 1 (Protein/DNA/RNA), affinity calculation is performed with ligand index 4.
    • For the structure named 2, affinity calculation is performed with ligand index 5.

    Note:

    1. In Batch mode, if the Affinity parameter is set, each structure must include corresponding affinity information; otherwise, an error will be raised.
    2. The ordering of ligand binders depends only on the sequence order and count defined in the Protein, DNA, and RNA parameters. Ligands are not included in the ordering.
    3. When the sequence count for each structure to be predicted is the same, you can set identical affinity information for all structures by entering the sequential index of the small molecule, e.g., 4, without specifying a structure name. This indicates that affinity calculation will be performed for all structures, with the small molecule Binder’s sequential index being 4.

    Template

    The template structure file used in designated - structure modeling (Same as Single mode.), it’s only applicable to protein sequences.

    Force

    When performing structure modeling with a template, you can choose whether to apply forced constraints (Same as Single mode.)

    Format

    The output structure format supports PDB or CIF, with PDB format as the default.

    Seed

    Random seed used to control the randomness in the prediction process.

    Virtual Screening Mode

    In virtual screening mode, you may submit multiple small molecules in one job. Each molecule will be docked independently against the protein/nucleic-acid system to compute its binding affinity. A single run supports up to 100 small molecules.

    Protein Sequence

    Protein sequence file in FASTA format; multiple sequences are allowed. (Same as Single mode.)

    DNA Sequence

    DNA sequence file in FASTA format; multiple sequences are allowed. (Same as Single mode.)

    RNA Sequence

    RNA sequence file in FASTA format; multiple sequences are allowed. (Same as Single mode.)

    Note: With a 24 GB GPU, the current implementation accommodates ≈1,000 residues / bases.

    Ligand

    Plain-text file containing small-molecule information (TXT format).
    Supported formats:

    • SMILES: one molecule per line.
    • CCD Code (Chemical Component Dictionary identifier): one or more codes per line, comma-separated and prefixed with CCD.

    Example:

    CC(=O)OC1C[NH+]2CCC1CC2
    CCD,ATP,HY3,P1L
    CCD,MG
    

    Modification

    Plain-text file with post-translational modification (PTM) information (TXT format). (Same as Single mode.)

    Cycle

    Specify the serial numbers of the sequences to be cyclized; for example, 1,2 indicates that both the first and the second sequences undergo head-to-tail cyclization.(Same as Single mode.)

    Covalent Bond

    Plain-text file describing covalent-bond information (TXT format). (Same as Single mode; the small molecule in a covalent bond cannot participate in virtual screening.)

    Pocket

    Plain-text file specifying binding-site type constraints (TXT format). (Same as Single mode.)

    Template

    The template structure file used in designated - structure modeling (Same as Single mode.), it’s only applicable to protein sequences.

    Force

    When performing structure modeling with a template, you can choose whether to apply forced constraints (Same as Single mode.)

    Output_Affinity

    Name of the output file containing affinity scores.
    Default: pred_affinity_boltz.csv

    Results

    • Single Mode

    The output files include the top 5 ranked complex structures (rank_1-5.cif), pred_scores_boltz.csv, pred_affinity_boltz.csv (if the Affinity parameter is specified), and the interactive visualization tools generated by PAE Viewer: boltz_report.html and pae_report_Model_1-5.html.
    The file pred_scores_boltz.csv contains the following information:

    Field Name Description
    Name Name of the complex structure
    Confidence_Score A score indicating the quality ranking of the predicted structure, ranging from 0 to 1.0, with higher values indicating better quality.
    This score considers two metrics: iptm (pTM for monomers) and complex_plddt, calculated as:
    Confidence_Score = 0.8 × complex_plddt + 0.2 × ipTM
    pTM Predicted TM score for the complex
    ipTM Predicted TM score when aggregating at the interfaces
    ipSAE An interface evaluation score derived from the pAE (predicted Aligned Errors) matrix, ranging from 0 to 1.
    A higher value indicates a more reliable predicted protein–protein interaction interface.
    ipSAE > 0.7: high-quality interface prediction; the structure is trustworthy.
    ipSAE < 0.1: almost no credible interface is predicted; the interaction can be dismissed as a false positive.
    Complex_pLDDT Average pLDDT score for the complex
    Complex_ipLDDT Average pLDDT score when upweighting interface tokens
    pLDDT_domain When setting the Domain parameter, the average pLDDT value of the domain residues. For multiple domains, the values are separated by semicolons “;”.
    complex_pDE The average predicted distance error between all residue pairs in the complex. It is a metric for evaluating the quality of complex structure prediction, where lower values are better. Typical value ranges: High-quality regions: < 2 Å, Medium-quality regions: 2-5 Å, Low-quality/flexible regions: > 5 Å
    complex_ipDE The average predicted distance error for residue pairs in the complex interface region. Lower values are better, specifically reflecting the prediction reliability of interface interactions. Threshold ranges are the same as above.
    Avg_pAE Average pAE score; pAE stands for Predicted Aligned Error, a residue-pair-level confidence metric measuring the prediction reliability of relative spatial positions between any two residues.
    Values <5 indicate highly reliable predictions of relative positions between residue pairs, typically within the same domain;
    values of 5–10 suggest relatively accurate predictions, possibly in flexible loop regions or areas with minor conformational differences;
    values of 10–20 indicate high uncertainty in relative positions, commonly found in inter-domain linkers or flexible regions;
    values >20 indicate unreliable predictions, possibly representing disordered regions, misfolding, or unstable complex interfaces.
    Min_pAE The minimum value among all pAE scores.
    Avg_iPAE The average value of interface pAE scores.
    Min_iPAE The minimum value among all ipAE scores.
    Avg_Ligand_pAE When ligand is present, the average value of pAE scores related to the ligand.
    Min_Ligand_pAE When ligand is present, the minimum value of pAE scores related to the ligand.

    pred_affinity_boltz.csv contains the following information:

    Field Name Description
    Pred_Affinity(log(IC50)) The predicted binding affinity between the small molecule and other components in the complex, expressed as the logarithm of IC50, i.e., log(IC50). The unit of IC50 is μM; a lower value indicates stronger affinity.
    Pred_Prob Probability value indicating the likelihood that the small molecule is a true binder. The value ranges from 0 to 1, with a higher value indicating a greater probability of being a binder.
    • Batch Mode

    Outputs final_results.tar.gz, pred_scores_boltz.csv, and pred_affinity_boltz.csv (if the Affinity parameter is specified).
    The final_results.tar.gz file is a packaged archive of all prediction results generated in Batch mode, including predicted structure PDB files and scoring CSV files.
    pred_scores_boltz.csv and pred_affinity_boltz.csv are the same as in Single mode.

    • Virtual Screening Mode

    Outputs pred_affinity_boltz.csv as the affinity prediction result, containing the following information:

    Field Name Description
    ID Small molecule sequence number, starting from 1
    Ligand SMILES or CCD code of the small molecule
    Pred_Affinity(log(IC50)) Predicted binding affinity between the small molecule and other components in the complex, expressed as the logarithm of IC50, i.e., log(IC50). The unit of IC50 is μM; a lower value indicates stronger affinity.
    Pred_Prob Probability value indicating the likelihood that the small molecule is a true binder. The value ranges from 0 to 1, with a higher value indicating a greater probability of being a binder.
    pDockQ2_chain Predicted docking score (pDockQ2) for a specific chain, used to evaluate the reliability of that chain’s interaction at the complex interface
    pDockQ2_Avg Average predicted docking score between chains, used to assess the overall interface quality of the complex
    • pDockQ2 thresholds (derived from DockQ):
    pDockQ2 Range Structure Quality Assessment
    < 0.23 Incorrect
    0.23 – 0.49 Acceptable
    0.49 – 0.80 Medium quality
    > 0.80 High quality

    The final_results.tar.gz file is a packaged archive of all prediction results, including predicted structure PDB files and scoring CSV files.

    References

    • Boltz-1 Democratizing Biomolecular Interaction Modeling. Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, Regina Barzilay. DOI:10.1101/2024.11.19.624167
    • Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction, Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Dominique Beaini, Tommi Jaakkola, Regina Barzilay.DOI:10.1101/2025.06.14.659707
  • Name: Structure Prediction (Chai-1)
    Description: 基于Chai-1算法的AF3 like结构预测模型,支持蛋白、核酸、小分子,金属离子等复合物。 Structure prediction using Chai-1, supporting protein, dna, rna, ions, ligands.
    Tags: undefined
    Author: Chai Discovery
    Release: 2024-12-02 00:00:00
    Reference: Chai-1: Decoding the molecular interactions of life. Chai Discovery, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhnikov, Kevin Wu. doi: 10.1101/2024.10.10.615955

    Structure Prediction (Chai-1)

    简介

    基于Chai Discovery, Inc.(OpenAI投资)的Chai-1算法的AF3 like结构预测模型。Chai-1是一种用于分子结构预测的多模态基础模型,在各种基准测试中均表现出色,可以预测包括蛋白质、小分子、DNA、RNA、糖基化等。
    image.png

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式,支持多条序列。
    注意:多蛋白复合物结构预测,其氨基酸序列输入格式如下:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    DNA核酸的序列文件,FASTA格式,支持多条序列。

    RNA Sequence

    RNA核酸分子的序列文件,FASTA格式,支持多条序列。

    备注:当前24GB的GPU显存能计算的残基/碱基数量在1000个左右。

    在Protein、DNA、RNA序列中,都支持残基或碱基的修饰,用CCD进行定义,CCD的介绍参考https://www.wwpdb.org/data/ccd 编号查询网址为https://www.ebi.ac.uk/pdbe-srv/pdbechem/
    定义残基或碱基修饰时,直接在序列中用英文括号‘()’包含CCD code即可,示例如下:

    >seq
    (ACE)GQLEEIAK
    

    表示在序列的N端发生了乙酰化;

    >seq
    AGSHSMRYFSTSVSR(HY3)GRGEPRFIAVG
    

    表示序列中的残基P发生了羟基化修饰,变成HY3(CCD code)

    Ligand

    文本文件包含小分子的结构信息,用SMILES格式,支持多个小分子,每行放置一个,示例如下:

    CC(=O)OC1C[NH+]2CCC1CC2
    [Mg+2]
    

    注意:不适用于配体蛋白或多肽的氨基酸序列格式输入。

    Restraints

    包含残基间距离限制信息的文本文件。距离限制的类型有两种:两个残基间的距离限制,一个残基与一条链之间的距离限制。

    两个残基间的距离限制的定义由五部分组成:

    • 残基1所在序列的顺序编号(序列的顺序编号,是依次按上述参数Protein、DNA、RNA中的序列顺序与数量,从1开始进行编号,例如:当有2条蛋白序列,1条DNA序列,1条RNA序列时,各序列对应的编号为:第一条蛋白序列编号为1,第二条蛋白序列编号为2,DNA序列编号为3,RNA序列编号为4)
    • 残基1的符号及位置编号(如:R84表示84号残基R)
    • 残基2所在序列的顺序编号
    • 残基2的符号及位置编号
    • 残基间的最大距离(单位为埃)

    五部分由逗号分隔,例如:1,R84,3,G7,10.0
    表示第1条序列中的84号残基R,与第3条序列中的7号残基G,之间的最大距离为10.0埃。

    一个残基与一条链之间的距离限制表示该残基与链中任意一个残基的距离满足限制即可。其定义方式与上述类似,差异在于,残基1与残基2的符号及位置编号,其中一个需设置为0(不可同时为0),例如:1,R84,3,0,10.0
    表示第1条序列中的84号残基R,与第3条链的任意一个残基/碱基的最大距离为10.0埃即可。

    支持放置多个距离限制,每行放置一个即可,包含多个距离限制信息的文件内容示例如下:

    1,H189,3,L4,8.0
    1,R84,3,0,10.0
    

    Use MSA

    是否使用MSA信息,默认使用。选择不使用时,则不会进行MSA查询,会使用ESM2特征代替MSA信息。

    Seed

    随机数种子,用于控制预测过程中的随机性。

    结果说明

    输出结果文件为排名前5的复合物结构rank_1-5.cif和pred_scores_chai1.csv,csv中包含信息如下:

    列名 说明
    Name 结构名称
    Aggregate_Score 对预测结构的质量排序的指标分数,值范围在-100至1.0之间,越大表示预测结构的质量越高。该分数综合考虑了三个指标:ptm, iptm, has_clash, 计算公式为: Aggregate_Score = 0.8 × ipTM + 0.2 × pTM − 100 × has_clash。注意:结构为单体时,因为ipTM为0,整体的综合得分偏低,可参考pTM即可。
    pTM 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定
    Avg_pAE 平均pae分数,pae是预测对齐误差,是残基对水平的置信度指标,用来衡量任意两个残基之间相对空间位置的预测可信度。数值<5,表示残基对之间相对位置预测非常可靠,通常位于同一结构域内;数值在5–10,表示预测较为准确,可能为柔性环区或轻微构象差异区域;数值在10–20,表示相对位置不确定性较高,常见于结构域间连接区或柔性区域;数值> 20,表示预测不可靠,可能为无序区域、错误折叠,或复合物界面不稳定。
    Min_pAE 所有pae分数中的最小值
    Avg_iPAE 结构中相互作用界面的平均pae分数
    Min_iPAE 结构中相互作用界面pae分数中的最小值
    pDockQ2_链名 该链的预测对接评分(pDock2),用于评估该链在复合物界面中的结合可靠性
    pDock2_Avg 链之间的平均预测对接评分,用于整体评估复合物界面质量
    • pDockQ2阈值(继承自 DockQ):
    pDockQ2 范围 结构质量评估
    < 0.23 不正确(Incorrect)
    0.23 – 0.49 可接受(Acceptable)
    0.49 – 0.80 中等质量(Medium)
    > 0.80 高质量(High quality)

    参考文献

    • Chai-1: Decoding the molecular interactions of life. Chai Discovery team, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhonikov, Kevin Wu.bioRxiv 2024.10.10.615955DOI:10.1101/2024.10.10.615955

    Structure Prediction (Chai-1)

    Introduction

    Based on Chai-1 structure prediction model implementation. Chai-1 is a multimodal basis model for molecular structure prediction that performs well on various benchmarks and can predict including proteins, small molecules, DNA, RNA, glycosylation, and more.
    image.png

    Parameter

    Protein Sequence

    The sequence file of proteins in FASTA format, supporting multiple sequences.
    Note: For multi-protein complex structure prediction, the amino acid sequence input format is as follows:

    >protein1
    AASJ...
    >protein2
    AASJ...
    >peptide
    ASDF...
    

    DNA Sequence

    The sequence file of DNA nucleic acids in FASTA format, supporting multiple sequences.

    RNA Sequence

    The sequence file of RNA nucleic acid molecules in FASTA format, supporting multiple sequences.
    ** Note: Current 24GB GPU memory can calculate around 1000 residues/bases. **
    In Protein, DNA, RNA sequences, all support the modification of residues or bases, which are defined by CCD, The introduction of the CCD reference https://www.wwpdb.org/data/ccd Number query url for https://www.ebi.ac.uk/pdbe-srv/pdbechem/
    To define a residue or base modification, simply include the CCD code in parentheses’ () 'in the sequence, as shown in the following example:

    >seq
    (ACE)GQLEEIAK
    

    Indicates acetylation at the N-terminus of the sequence;

    >seq
    AGSHSMRYFSTSVSR(HY3)GRGEPRFIAVG
    

    Indicates that residue P in the sequence is hydroxylated and becomes HY3 (CCD code).

    Ligand

    The text file contains structural information about small molecules, in SMILES format, supporting multiple small molecules, one per line, as shown in the following example:

    CC(=O)OC1C[NH+]2CCC1CC2
    [Mg+2]
    

    Note: Not applicable for amino acid sequence format input of ligand proteins or polypeptides.

    Restraints

    • Sequence number of the sequence in which residue 1 is located (The sequence number of the sequence is numbered from 1 according to the sequence order and quantity in the above parameters Protein, DNA and RNA in turn. For example, when there are 2 protein sequences, 1 DNA sequence and 1 RNA sequence, the corresponding number of each sequence is: The first protein sequence is numbered 1, the second protein sequence is numbered 2, the DNA sequence is numbered 3, and the RNA sequence is numbered 4)
    • Symbol and position number of residue 1 (e.g. R84 for residue 84 R)
      -The sequence number of the sequence in which residue 2 is located
      -Symbol and position number of residue 2
    • Maximum distance between residues (in angstroms)

    The five parts are separated by commas, for example: 1,R84,3,G7,10.0
    Denote residue 84 R in the first sequence, and residue 7 G in the third sequence, with a maximum distance of 10.0 angstroms.

    ** The distance limit between a residue and a chain ** means that the distance between the residue and any residue in the chain satisfies the limit. It is defined in the same way as above, except that the symbol and position number of residue 1 and residue 2 need to be set to 0 (not both), e.g. 1,R84,3,0,10.0
    Denotes residue 84 R in the first sequence, and a maximum distance of 10.0 angstroms from any residue/base of the third strand is sufficient.

    Multiple distance limits are supported, one per line, and an example file containing multiple distance limits is as follows:

    1,H189,3,L4,8.0
    1,R84,3,0,10.0
    

    Use MSA

    Whether to use MSA information; enabled by default.
    If you choose not to use it, no MSA search will be performed and ESM2 features will be used instead of MSA information.

    Seed

    Random seed used to control the randomness in the prediction process.

    Results

    The output files are the top 5 complex structures rank_1-5.cif and pred_scores_chai1.csv, which contain the following information:

    Field Name Description
    Name Name of the complex structure
    Aggregate_Score Index scores that rank the quality of the predicted structure, with values ranging from -100 to 1.0, with larger values indicating higher quality of the predicted structure. The score takes into account three metrics: ptm, iptm, has_clash, and is calculated as follows: Aggregate_Score = 0.8 × ipTM + 0.2 × pTM − 100 × has_clash. Note: When the structure is monomeric, the Aggregate_Score is relatively low because ipTM is 0. In such cases, you can refer to pTM alone.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.
    Avg_pAE Average pAE score; pAE stands for Predicted Aligned Error, a residue-pair-level confidence metric measuring the prediction reliability of relative spatial positions between any two residues. Values <5 indicate highly reliable predictions of relative positions between residue pairs, typically within the same domain; values of 5–10 suggest relatively accurate predictions, possibly in flexible loop regions or areas with minor conformational differences; values of 10–20 indicate high uncertainty in relative positions, commonly found in inter-domain linkers or flexible regions; values >20 indicate unreliable predictions, possibly representing disordered regions, misfolding, or unstable complex interfaces.
    Min_pAE The minimum value among all pAE scores.
    Avg_iPAE The average value of interface pAE scores.
    Min_iPAE The minimum value among all ipAE scores.
    pDockQ2_chain Predicted docking score (pDockQ2) for a specific chain, used to evaluate the reliability of that chain’s interaction at the complex interface
    pDock2_Avg Average predicted docking score between chains, used to assess the overall interface quality of the complex
    • pDockQ2 thresholds (derived from DockQ):
    pDockQ2 Range Structure Quality Assessment
    < 0.23 Incorrect
    0.23 – 0.49 Acceptable
    0.49 – 0.80 Medium quality
    > 0.80 High quality

    Reference

    • Chai-1: Decoding the molecular interactions of life. Chai Discovery team, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhonikov, Kevin Wu.bioRxiv 2024.10.10.615955DOI:10.1101/2024.10.10.615955
  • Name: ADMET Prediction (v2)
    Description: 基于机器学习的小分子ADMET性质预测模块,支持27种ADMET性质。 Machine learning-based module for predicting the ADMET properties of small molecules, supporting 27 ADMET properties.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-11-28 00:00:00
    Reference:

    ADMET Prediction (v2)

    简介

    ADMET Prediction (v2)是一个基于机器学习的小分子ADMET性质预测模块。能快速批量预测小分子的ADMET性质,支持图注意力神经网络模型(GNN)、轻量梯度提升树模型(LGBM)、随机森林模型(RF)、梯度提升树模型(XGBT)4种常见高效的机器学习算法,分子特征支持分子指纹(Morgan FP)以及分子描述符(Descriptors)两种方法,能对小分子化合物库进行快速批量预测。模块支持27种ADMET性质,其中7种回归模型,20种分类模型。不同机器学习方法以及分子特征化方法预测性能如下:
    image.png
    image.png
    模块自动选择最理想的机器学习算法和分子特征化方法的组合进行预测。

    参数说明

    Small Molecules

    待预测的小分子文件,SDF格式。

    Properties

    ADMET预测列表,ADMET性质见结果说明部分。

    Predicted Results

    输出的预测结果文件,默认为predicted_results.csv

    结果说明

    输出结果中,如果是分类模型,输出0或1分类。如果是回归模型,预测出实际值。
    ADMET性质信息如下:

    Dataset Dataset Abbr. ADMET Type Dataset Type Endpoints Description
    Caco-2 (Cell Effective Permeability), Wang et al. caco2 Absorption Regression logPapp
    PAMPA Permeability, NCATS pampa Absorption Binary classification high permeability (1) or low-to-moderate permeability (0) in PAMPA assay
    HIA (Human Intestinal Absorption), Hou et al. hia Absorption Binary classification good permeability (1) or poor permeability (0)
    Pgp (P-glycoprotein) Inhibition, Broccatelli et al. pgp Absorption Binary classification inhibitor (1) or non-inhibitor (0)
    Bioavailability, Ma et al. bioavailability Absorption Binary classification High (1) or low (0) bioavailability
    Lipophilicity, AstraZeneca lipophilicity Absorption Regression octanol/water distribution coefficient (logD at pH 7.4)
    Solubility, AqSolDB solubility Absorption Regression logS
    Hydration Free Energy, FreeSolv freesolv Absorption Regression Hydration Free Energy (kcal/mol)
    BBB (Blood-Brain Barrier), Martins et al. bbbp Distribution Binary classification High (1) or low (0) blood-brain barrier penetration
    PPBR (Plasma Protein Binding Rate), AstraZeneca ppbr Distribution Regression Plasma Protein Binding Rate (0-100)
    CYP P450 2C19 Inhibition, Veith et al. cyp2c19_inhibition Metabolism Binary Classification P450 2C19 inhibitor (1) or non-inhibitor (0)
    CYP P450 2D6 Inhibition, Veith et al. cyp2d6_inhibition Metabolism Binary Classification P450 2D6 inhibitor (1) or non-inhibitor (0)
    CYP P450 3A4 Inhibition, Veith et al. cyp3a4_inhibition Metabolism Binary Classification P450 3A4 inhibitor (1) or non-inhibitor (0)
    CYP P450 1A2 Inhibition, Veith et al. cyp1a2_inhibition Metabolism Binary Classification P450 1A2 inhibitor (1) or non-inhibitor (0)
    CYP P450 2C9 Inhibition, Veith et al. cyp2c9_inhibition Metabolism Binary Classification P450 2C9 inhibitor (1) or non-inhibitor (0)
    CYP2C9 Substrate, Carbon-Mangels et al. cyp2c9_substrate Metabolism Binary Classification CYP2C9 substrate (1) or non-substrate (0)
    CYP2D6 Substrate, Carbon-Mangels et al. cyp2d6_substrate Metabolism Binary Classification CYP2CD6 substrate (1) or non-substrate(0)
    CYP3A4 Substrate, Carbon-Mangels et al. cyp3a4_substrate Metabolism Binary Classification CYP3A4 substrate (1) or non-substrate(0)
    Microsome Clearance, AstraZeneca clearance_microsome Excretion Regression Microsome Clearance (CL)
    Acute Toxicity LD50 ld50 Toxicity Regression Acute Toxicity LD50
    hERG blockers herg_blockers Toxicity Binary classification hERG blockers (1) or non-blockers (0)
    hERG Karim et al. herg_karim Toxicity Binary classification hERG blockers (1) or non-blockers (0)
    Ames Mutagenicity ames Toxicity Binary classification high (1) or low (0) ames mutagenicity
    DILI (Drug Induced Liver Injury) dili Toxicity Binary classification high (1) or low (0) drug induced liver injury
    Skin Reaction skin Toxicity Binary classification high (1) or low (0) skin reaction
    ClinTox clintox Toxicity Binary classification high (1) or low (0) ClinTox
    Carcinogens carcinogens Toxicity Binary classification high (1) or low (0) Carcinogens

    ADMET Prediction (v2)

    Introduction

    ADMET Prediction (v2) is a machine learning-based module for predicting the ADMET properties of small molecules. It enables rapid batch predictions of ADMET properties and supports four common and efficient machine learning algorithms: Graph Attention Neural Network (GAT), Light Gradient Boosting Machine (LightGBM), Random Forest, and Gradient Boosting Machine (GBM). The module supports two methods for molecular feature representation: molecular fingerprints and molecular descriptors, allowing for quick batch predictions on libraries of small molecule compounds. It supports 27 ADMET properties, including 7 regression models and 20 classification models. Users can select the ideal machine learning algorithm and molecular characterization method based on the predictive performance data provided in the documentation. The predictive performance of different machine learning methods and molecular characterization methods is as follows:
    image.png
    image.png
    The module selects the ideal machine learning algorithm and molecular characterization method automaticaly based on the predictive performance data provided in the documentation.

    Parameters

    Small Molecules

    Small molecular structure file in SDF format

    Properties

    ADMET properties. Details can be seen in results.

    Predicted Results

    Output prediction results file name with default predicted_results.csv

    Results

    In the output results, if it is a classification model, the output will be a classification of 0 or 1. The predicted output will be the actual value if it is a regression model. The endpoint descriptions are as follows:

    Dataset Dataset Abbr. ADMET Type Dataset Type Endpoints Description
    Caco-2 (Cell Effective Permeability), Wang et al. caco2 Absorption Regression logPapp
    PAMPA Permeability, NCATS pampa Absorption Binary classification high permeability (1) or low-to-moderate permeability (0) in PAMPA assay
    HIA (Human Intestinal Absorption), Hou et al. hia Absorption Binary classification good permeability (1) or poor permeability (0)
    Pgp (P-glycoprotein) Inhibition, Broccatelli et al. pgp Absorption Binary classification inhibitor (1) or non-inhibitor (0)
    Bioavailability, Ma et al. bioavailability Absorption Binary classification High (1) or low (0) bioavailability
    Lipophilicity, AstraZeneca lipophilicity Absorption Regression octanol/water distribution coefficient (logD at pH 7.4)
    Solubility, AqSolDB solubility Absorption Regression logS
    Hydration Free Energy, FreeSolv freesolv Absorption Regression Hydration Free Energy (kcal/mol)
    BBB (Blood-Brain Barrier), Martins et al. bbbp Distribution Binary classification High (1) or low (0) blood-brain barrier penetration
    PPBR (Plasma Protein Binding Rate), AstraZeneca ppbr Distribution Regression Plasma Protein Binding Rate (0-100)
    CYP P450 2C19 Inhibition, Veith et al. cyp2c19_inhibition Metabolism Binary Classification P450 2C19 inhibitor (1) or non-inhibitor (0)
    CYP P450 2D6 Inhibition, Veith et al. cyp2d6_inhibition Metabolism Binary Classification P450 2D6 inhibitor (1) or non-inhibitor (0)
    CYP P450 3A4 Inhibition, Veith et al. cyp3a4_inhibition Metabolism Binary Classification P450 3A4 inhibitor (1) or non-inhibitor (0)
    CYP P450 1A2 Inhibition, Veith et al. cyp1a2_inhibition Metabolism Binary Classification P450 1A2 inhibitor (1) or non-inhibitor (0)
    CYP P450 2C9 Inhibition, Veith et al. cyp2c9_inhibition Metabolism Binary Classification P450 2C9 inhibitor (1) or non-inhibitor (0)
    CYP2C9 Substrate, Carbon-Mangels et al. cyp2c9_substrate Metabolism Binary Classification CYP2C9 substrate (1) or non-substrate (0)
    CYP2D6 Substrate, Carbon-Mangels et al. cyp2d6_substrate Metabolism Binary Classification CYP2CD6 substrate (1) or non-substrate(0)
    CYP3A4 Substrate, Carbon-Mangels et al. cyp3a4_substrate Metabolism Binary Classification CYP3A4 substrate (1) or non-substrate(0)
    Microsome Clearance, AstraZeneca clearance_microsome Excretion Regression Microsome Clearance (CL)
    Acute Toxicity LD50 ld50 Toxicity Regression Acute Toxicity LD50
    hERG blockers herg_blockers Toxicity Binary classification hERG blockers (1) or non-blockers (0)
    hERG Karim et al. herg_karim Toxicity Binary classification hERG blockers (1) or non-blockers (0)
    Ames Mutagenicity ames Toxicity Binary classification high (1) or low (0) ames mutagenicity
    DILI (Drug Induced Liver Injury) dili Toxicity Binary classification high (1) or low (0) drug induced liver injury
    Skin Reaction skin Toxicity Binary classification high (1) or low (0) skin reaction
    ClinTox clintox Toxicity Binary classification high (1) or low (0) ClinTox
    Carcinogens carcinogens Toxicity Binary classification high (1) or low (0) Carcinogens
  • Name: Evaluate Nucleic Acid (AlphaRNA)
    Description: 用于评估核酸序列的其表达量和半衰期、抗体滴度等。支持人、小鼠、大鼠、猪等种属。 Evaluate the expression and half-life of nucleic acid sequences, antibody titers, etc. Support human, mouse, rat, pig and other species.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-11-20 16:47:10
    Reference:

    Evaluate Nucleic Acid (AlphaRNA)

    简介

    Evaluate Nucleic Acid (AlphaRNA)模块用于评估核酸序列的其表达量和半衰期、抗体滴度等。支持人、小鼠、大鼠、猪等种属。

    参数说明

    Nucleic Acid Sequence

    核酸序列,必须为3的倍数,否则截断尾部序列以达到3的倍数序列,比如:GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTGGG

    Specis

    序列所属物种,Homo_Sapiens、Mamalian、Pig、Rat。

    结果

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    AUP AUP (Amino Acid Usage Pattern)指的是氨基酸使用模式的指标,通常用于评估特定氨基酸在序列中的使用频率。值越高,表示该氨基酸在序列中使用的频率越高。
    CAI CAI (Codon Adaptation Index)是一个用于评估特定基因的密码子使用偏好度的指标,值范围从 0 到 1。接近 1 表示该基因的密码子使用模式与高表达基因的模式相似,通常与基因表达效率相关。
    GCR GCR (Gene Codon Ratio)是基因密码子比率的指标,反映了基因中不同密码子的相对使用情况。值越高,表示基因中使用的密码子与参考密码子库的偏好越一致。
    MFE MFE (Minimum Free Energy)是指核酸序列的最低自由能,通常用于评估 RNA 二级结构的稳定性。值越低表示结构越稳定。负值表示该序列在折叠时释放能量,形成稳定的构象。
    Aug Positions Aug Positions表示在序列中发现的AUG(起始密码子)的位置。结果空时表示在序列中没有找到AUG密码子。
    Sequence 根据输入的核酸序列翻译得到的氨基酸序列。
    Secondary Structure RNA序列的预测二级结构。

    Evaluate Nucleic Acid (AlphaRNA)

    Introduction

    The Evaluate Nucleic Acid (AlphaRNA) module is used to assess the expression levels, half-lives, antibody titers, and other characteristics of nucleic acid sequences.

    Parameter

    Nucleic Acid Sequence

    The nucleic acid sequence must be a multiple of three; otherwise, the tail of the sequence will be truncated to achieve a length that is a multiple of three. For example: GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTGGG.

    Species

    The species to which the sequence belongs, such as Homo_Sapiens, Mammalian, Pig, or Rat.

    Results

    The output result file is result.csv, which contains the following information:

    Field Name Description
    AUP AUP (Amino Acid Usage Pattern) indicates the usage pattern of amino acids, typically used to assess the frequency of specific amino acids in the sequence. A higher value indicates a higher frequency of that amino acid in the sequence.
    CAI CAI (Codon Adaptation Index) is a metric used to evaluate the codon usage preference of a specific gene, with values ranging from 0 to 1. A value close to 1 indicates that the codon usage pattern of the gene is similar to that of highly expressed genes, which is often related to gene expression efficiency.
    GCR GCR (Gene Codon Ratio) is an indicator of the gene codon ratio, reflecting the relative usage of different codons within the gene. A higher value indicates that the codons used in the gene are more consistent with the preferences of the reference codon library.
    MFE MFE (Minimum Free Energy) refers to the minimum free energy of the nucleic acid sequence, typically used to assess the stability of RNA secondary structures. Lower values indicate more stable structures. Negative values indicate that the sequence releases energy when folded, forming a stable conformation.
    Aug Positions Aug Positions indicates the positions of AUG (start codon) found in the sequence. An empty result means that no AUG codons were found in the sequence.
    Sequence The amino acid sequence translated from the input nucleic acid sequence.
    Secondary Structure The predicted secondary structure of the RNA sequence.
  • Name: Back Mutation Grouping (v2.4)
    Description: Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。 Back Mutation Grouping is a grouping module in the antibody humanization design workflow, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module and returns the back mutated sequence.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-11-15 15:21:07
    Reference:

    Back Mutation Grouping v2.4

    简介

    Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。

    参数说明

    Grafted Chain

    抗体CDR区嫁接后序列文件,FASTA格式,由Grafting模块生成

    Raw Chain

    抗体序列文件,FASTA格式

    Mutation Score

    人源化突变评分文件,CSV格式,由Mutation Score模块生成

    Output File

    指定输出的突变序列文件名称,FASTA格式

    Cutoff

    打分分组的截断值,逗号分割,例如:2,5,10表示将氨基酸突变评分大于10的为一组,5~10的氨基酸为一组,小于2的氨基酸分为一组。

    Output Policy

    指定输出的回复突变的文件

    结果说明

    根据不同截断值得到突变分组结果文件mutate_policy.json。

    Back Mutation Grouping v2.4

    Introduction

    Back Mutation Grouping is a grouping module in the humanization design process of antibodies, which groups back mutations based on the mutation score table output by the Mutation Score module and returns the mutated sequences.

    Parameters

    Grafted Chain

    Sequence file of the antibody CDR region after grafting, in FASTA format, generated by the Grafting module.

    Raw Chain

    Sequence file of the antibody, in FASTA format.

    Mutation Score

    Humanization mutation score file, in CSV format, generated by the Mutation Score module.

    Output File

    Specify the name of the output mutation sequence file, in FASTA format.

    Cutoff

    Cutoff values for scoring grouping, separated by commas. For example, 2,5,10 indicates grouping amino acid mutations with scores greater than 10 in one group, amino acids with scores between 5 and 10 in another group, and amino acids with scores less than 2 in a separate group.

    Output Policy

    Specify the file for the output of back mutations.

    Results

    The mutation grouping results file, mutate_policy.json, is generated based on different cutoff values.

  • Name: Template-guided Structure Prediction
    Description: 基于自定义的蛋白结构模板,采用colabfold进行蛋白结构预测。 Based on a custom protein structure Template, and colabfold is used to predict protein structure.
    Tags: undefined
    Author: Mirdita M
    Release: 2024-11-04 15:24:56
    Reference: Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022 Jun;19(6):679-682.

    Template-guided Structure Prediction

    简介

    基于自定义的蛋白结构模板,采用colabfold进行蛋白结构预测。

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式

    Template Structure

    蛋白的模板结构,PDB格式

    结果说明

    输出文件名称 说明
    rank_001.pdb 预测得到的最佳复合物结构。
    pdbs.tar.gz 预测得到的前5个最佳复合物结构的压缩包文件。
    scores.csv 预测结构的评分文件

    其中scores.csv包含如下信息:

    字段名称 说明
    Name 预测结构的文件名
    pLDDT 局部结构的可信度指标,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测
    pTM 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定

    参考文献

    • Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022 Jun;19(6):679-682. DOI: 10.1038/s41592-022-01488-1

    Template-guided Structure Prediction

    Introduction

    Protein structure prediction is performed using ColabFold based on a custom protein structure template.

    Parameter

    Protein Sequence

    The sequence file of the protein in FASTA format.

    Template Structure

    The template structure of the protein in PDB format.

    Result Description

    Output File Name Description
    rank_001.pdb The predicted best complex structure.
    pdbs.tar.gz A compressed file containing the top 5 best complex structures.
    scores.csv The scoring file for the predicted structures.

    The scores.csv file contains the following information:

    Field Name Description
    Name The file name of the predicted structure.
    pLDDT The confidence score for local structure, with a range of 0-100. A higher value indicates a more reliable prediction. Values below 70 are considered low reliability, below 50 are generally regarded as very low confidence, indicating disordered predictions.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.

    References

    • Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022 Jun;19(6):679-682. DOI: 10.1038/s41592-022-01488-1
  • Name: TCR-pMHC Complex Structure Prediction
    Description: 基于TCRmodel2实现,TCRmodel2在AlphaFold基础上对TCR-肽-MHC复合物建模做了优化,与原生AlphaFold和其他基于基准测试的TCR-肽-MHC复合物建模方法相比,其准确度相似或更高,可在30分钟内完成复合物结构预测。 TCR-peptide-MHC complex structure prediction based on TCRmodel2, which optimizes TCR-peptide-MHC complex modeling on the foundation of AlphaFold. It achieves comparable or higher accuracy than native AlphaFold and other benchmark-based TCR-peptide-MHC modeling methods, completing complex structure predictions within 30 minutes.
    Tags: undefined
    Author: Rui Yin
    Release: 2024-11-08 10:35:19
    Reference: Yin R, Ribeiro-Filho HV, Lin V, Gowthaman R, Cheung M, Pierce BG. TCRmodel2: high-resolution modeling of T cell receptor recognition using deep learning. Nucleic Acids Res. 2023 Jul 5;51(W1):W569-W576.

    TCR-pMHC Complex Structure Prediction

    简介

    细胞免疫系统是人体免疫的重要组成部分,它使用 T 细胞受体 (TCR) 识别由主要组织相容性复合体 (MHC) 蛋白呈递的肽形式的抗原蛋白。准确定义TCR的结构基础及其与肽-MHC的结合可以为正常和异常免疫提供重要见解,并有助于指导疫苗和免疫疗法的设计。鉴于实验确定的TCR-肽-MHC结构数量有限,而每个个体内的TCR以及抗原靶标数量巨大,因此需要准确的建模方法。该模块基于TCRmodel2实现,TCRmodel2在AlphaFold基础上对TCR-肽-MHC复合物建模做了优化,与原生AlphaFold和其他基于基准测试的TCR-肽-MHC复合物建模方法相比,其准确度相似或更高,可在30分钟内完成复合物结构预测。
    image.png
    image.png

    参数说明

    TCR α

    TCR α链的序列,如:AQEVTQIPAALSVPEGENLVLNCSFTDSAIYNLQWFRQDPGKGLTSLLLIQSSQREQTSGRLNASLDKSSGRSTLYIAASQPGDSATYLCAVTNQAGTALIFGKGTTLSVSS

    TCR β

    TCR β链的序列,如:NAGVTQTPKFQVLKTGQSMTLQCSQDMNHEYMSWYRQDPGMGLRLIHYSVGAGITDQGEVPNGYNVSRSTTEDFPLRLLSAAPSQTSVYFCASSYSIRGSRGEQFFGPGTRLTVL

    Peptide Sequence

    多肽序列,如:LAWEWWRTVAL
    注:输入的多肽序列长度需要符合相应要求,如下:
    I型TCR-pMHC复合物中,多肽的序列长度在8-15之间;
    II型TCR-pMHC复合物中,多肽的长度为11。

    MHC(I or II α)

    MHC-I型序列或MHC-II α链序列。
    当预测I型TCR-pMHC复合物时,输入MHC-I型序列,如:SHSLKYFHTSVSRPGRGEPRFISVGYVDDTQFVRFDNDAASPRMVPRAPWMEQEGSEYWDRETRSARDTAQIFRVNLRTLRGYYNQSEAGSHTLQWMHGCELGPDGRFLRGYEQFAYDGKDYLTLNEDLRSWTAVDTAAQISEQKSNDASEAEHQRAYLEDTCVEWLHKYLEKGKETLLH
    当预测II型TCR-pMHC复合物时,输入MHC-II α链序列,如:IKADHVSTYAAFVQTHRPTGEFMFEFDEDEMFYVDLDKKETVWHLEEFGQAFSFEAQGGLANIAILNNNLNTLIQRSNHTQAT

    MHC II β

    MHC-II β链序列,当预测II型TCR-pMHC复合物时才需要输入,如:PENYLFQGRQECYAFNGTQRFLERYIYNREEFARFDSDVGEFRAVTELGRPAAEYWNSQKDILEEKRAVPDRMCRHNYELGGPMTLQR

    结果说明

    输出结果包括:

    输出文件名称 说明
    ranked_0.pdb 预测得到的最佳复合物结构。
    pdbs.tar.gz 预测得到的前5个最佳复合物结构的压缩包文件。
    scores.csv 结构评分文件

    其中scores.csv包含如下信息:

    字段名称 说明
    PDB 复合物PDB结构的文件名
    Model_Confidence 结构的置信度评分,是pTM与ipTM评分的加权综合值,数值在0-1之间,越接近1表示结构模型质量越好
    pLDDT 局部结构的可信度指标,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测
    pTM the predicted template modeling score预测的TM分数,衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM the interface predicted template modeling score预测的亚基接触面的TM分数,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定
    TCR-pMHC_ipTM TCR与pMHC之间的ipTM值

    参考文献

    Yin R, Ribeiro-Filho HV, Lin V, Gowthaman R, Cheung M, Pierce BG. TCRmodel2: high-resolution modeling of T cell receptor recognition using deep learning. Nucleic Acids Res. 2023 Jul 5;51(W1):W569-W576.

    TCR-pMHC Complex Structure Prediction

    Introduction

    The cellular immune system is a crucial component of the human immune response, utilizing T cell receptors (TCRs) to recognize peptide-form antigens presented by major histocompatibility complex (MHC) proteins. Accurately defining the structural basis of TCRs and their binding to peptide-MHC complexes can provide important insights into both normal and abnormal immune responses and assist in guiding the design of vaccines and immunotherapies. Given the limited number of experimentally determined TCR-peptide-MHC structures and the vast number of TCRs and antigen targets within each individual, accurate modeling methods are needed. This module is based on TCRmodel2, which optimizes TCR-peptide-MHC complex modeling on the foundation of AlphaFold. It achieves comparable or higher accuracy than native AlphaFold and other benchmark-based TCR-peptide-MHC modeling methods, completing complex structure predictions within 30 minutes.
    image.png
    image.png

    Parameter

    TCR α

    The sequence of the TCR α chain, for example: AQEVTQIPAALSVPEGENLVLNCSFTDSAIYNLQWFRQDPGKGLTSLLLIQSSQREQTSGRLNASLDKSSGRSTLYIAASQPGDSATYLCAVTNQAGTALIFGKGTTLSVSS

    TCR β

    The sequence of the TCR β chain, for example: NAGVTQTPKFQVLKTGQSMTLQCSQDMNHEYMSWYRQDPGMGLRLIHYSVGAGITDQGEVPNGYNVSRSTTEDFPLRLLSAAPSQTSVYFCASSYSIRGSRGEQFFGPGTRLTVL

    Peptide Sequence

    The peptide sequence, for example: LAWEWWRTVAL.
    Note: The length of the input peptide sequence must meet the following requirements:
    For Class I TCR-pMHC complexes, the peptide sequence length should be between 8-15;
    For Class II TCR-pMHC complexes, the peptide length is 11.

    MHC (I or II α)

    The MHC-I sequence or MHC-II α chain sequence.
    When predicting Class I TCR-pMHC complexes, input the MHC-I sequence, for example: SHSLKYFHTSVSRPGRGEPRFISVGYVDDTQFVRFDNDAASPRMVPRAPWMEQEGSEYWDRETRSARDTAQIFRVNLRTLRGYYNQSEAGSHTLQWMHGCELGPDGRFLRGYEQFAYDGKDYLTLNEDLRSWTAVDTAAQISEQKSNDASEAEHQRAYLEDTCVEWLHKYLEKGKETLLH.
    When predicting Class II TCR-pMHC complexes, input the MHC-II α chain sequence, for example: IKADHVSTYAAFVQTHRPTGEFMFEFDEDEMFYVDLDKKETVWHLEEFGQAFSFEAQGGLANIAILNNNLNTLIQRSNHTQAT.

    MHC II β

    The MHC-II β chain sequence, which is required only when predicting Class II TCR-pMHC complexes, for example: PENYLFQGRQECYAFNGTQRFLERYIYNREEFARFDSDVGEFRAVTELGRPAAEYWNSQKDILEEKRAVPDRMCRHNYELGGPMTLQR.

    Result

    The output results include:

    Output File Name Description
    ranked_0.pdb The predicted best complex structure.
    pdbs.tar.gz A compressed file containing the top 5 predicted complex structures.
    scores.csv Structure scoring file.

    The scores.csv contains the following information:

    Field Name Description
    PDB The filename of the complex PDB structure.
    Model_Confidence The confidence score of the structure, which is a weighted composite value of pTM and ipTM scores, ranging from 0 to 1, with values closer to 1 indicating better model quality.
    pLDDT A measure of the reliability of the local structure, ranging from 0 to 100; higher values indicate more reliable predictions. Values below 70 are considered low reliability, and below 50 are deemed very low reliability, indicating disordered predictions.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure; higher values indicate greater accuracy. A score greater than 0.5 suggests that the overall folding of the structure may resemble the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of subunits within the complex; higher values indicate greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure, and scores between 0.6 and 0.8 are in a gray area where correctness is uncertain.
    TCR-pMHC_ipTM The ipTM value between the TCR and pMHC.

    References

    Yin R, Ribeiro-Filho HV, Lin V, Gowthaman R, Cheung M, Pierce BG. TCRmodel2: high-resolution modeling of T cell receptor recognition using deep learning. Nucleic Acids Res. 2023 Jul 5;51(W1):W569-W576.

  • Name: Alanine Scan (MMPBSA v2)
    Description: 计算丙氨酸突变后的结合自由能 Calculates components of binding free energy after alanine mutation using the MM-PBSA method.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:47
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMPBSA)

    简介

    Alanine Scan (MMPBSA)是计算丙氨酸突变后的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMPBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Reference Structure (GRO)

    参考结构。默认system.gro。可以在GMX MD Run (GMX2024)模块的输出结果文件中找到。当周期性边界条件处理不当时可以使用该参数。

    Mutation Residue

    突变扫描为丙氨酸(ALA)的氨基酸位置。格式为‘32-34,36’。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Force File

    丙氨酸扫描时使用的力场。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。只能是AMBER力场下构建的膜体系才能进行计算。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMPBSA_result.csv/MMPBSA_Result_txt.tar.gz 丙氨酸突变结果csv文件。
    MMPBSA_Residue.csv/MMPBSA_Residue_csv.tar.gz 残基能量分解数据(CSV)。
    MMPBSA.pdb/MMPBSA_pdb.tar.gz 突变后能量映射到 PDB 文件,可用于可视化结合能贡献区域。
    MMPBSA.tar.gz 全部原始数据,包括:
    • _mmpbsa_residue_#.txt(7 类能量:VDW、ELE、PB、SA、MM、PBSA、Binding)
    • _mmpbsa_residue.txt(残基能量汇总,对应 MMPBSA_Residue.csv)
    • _mmpbsa_atom#.pdb(原子能量映射 PDB,类似 MMPBSA.pdb)。
    ALA_Scan_Results.csv 丙氨酸扫描所有残基突变结果。

    ALA_Scan_Results.csv,包含信息如下:

    字段名称 说明
    index 残基编号。
    Residue 原始残基名称。
    Mutation Residue 突变后的残基(通常为丙氨酸 ALA)。
    dH (kJ/mol) 焓贡献。
    Tds (kJ/mol) 熵贡献(TΔS)。
    dG (kJ/mol) 结合自由能变化。决定结合强弱的关键指标。越负说明亲和力越强。
    Ki (µM/L) 解离常数,结合亲和力的倒数。
    Ka (L/µM) 结合常数,亲和力大小。

    Ka 越大表示结合力强,Ki 越小表示抑制效果强。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMPBSA)

    Introduction

    Alanine Scan (MMPBSA) calculates the binding free energy after alanine mutations and provides energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMPBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameters

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Reference Structure (GRO)

    Reference structure. Default: system.gro.
    This file can be found in the output results of the GMX MD Run (GMX2024) module.
    Use this parameter when periodic boundary conditions are not handled properly.

    Mutation Residue

    The mutation scans for the amino acid location of alanine (ALA). Must followed the format is ‘32-34,36’. The protein amino acid or nucleic acid number is re-numbered from 1, independent of the initial pdb amino acid number.

    Force File

    Force field used for alanine scanning.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’. Only membrane systems built under the AMBER force field can be calculated.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Results

    The output includes:

    File Name Description
    MMPBSA_result.csv / MMPBSA_Result_txt.tar.gz Alanine mutation result (csv file).
    MMPBSA_Residue.csv / MMPBSA_Residue_csv.tar.gz Residue energy decomposition data (CSV).
    MMPBSA.pdb / MMPBSA_pdb.tar.gz Energy mapped onto the PDB file after mutation, useful for visualizing binding energy contribution regions.
    MMPBSA.tar.gz Complete raw data, including:
    • _mmpbsa_residue_#.txt (7 energy terms: VDW, ELE, PB, SA, MM, PBSA, Binding)
    • _mmpbsa_residue.txt (residue energy summary, corresponding to MMPBSA_Residue.csv)
    • _mmpbsa_atom#.pdb (atomic energy mapped PDB files, similar to MMPBSA.pdb).
    ALA_Scan_Results.csv Results of alanine scanning mutations for all residues.

    ALA_Scan_Results.csv Contents

    Field Name Description
    index Residue index number.
    Residue Original residue name.
    Mutation Residue Mutated residue (typically alanine, ALA).
    dH (kJ/mol) Enthalpy change.
    Tds (kJ/mol) Entropy term (TΔS).
    dG (kJ/mol) Binding free energy change, the key indicator of binding strength. The more negative the value, the stronger the affinity.
    Ki (µM/L) Dissociation constant, reciprocal of binding affinity.
    Ka (L/µM) Association constant, magnitude of binding affinity.

    Larger Ka indicates stronger binding affinity, while smaller Ki indicates stronger inhibitory effect.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: Back Mutation Grouping (v2.3)
    Description: Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。 Back Mutation Grouping is a grouping module in the antibody humanization design workflow, which groups the back mutations based on the mutation scoring table generated by the Mutation Score module and returns the back mutated sequence.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-01-17 15:21:07
    Reference:

    Back Mutation Grouping v2.3

    简介

    Back Mutation Grouping是抗体人源化设计流程中分组模块,根据Mutation Score模块输出的回复突变评分表对回复突变进行分组,并返回突变后的序列。

    参数说明

    Grafted Chain

    抗体CDR区嫁接后序列文件,FASTA格式,由Grafting模块生成

    Raw Chain

    抗体序列文件,FASTA格式

    Mutation Score

    人源化突变评分文件,CSV格式,由Mutation Score模块生成

    Output File

    指定输出的突变序列文件名称,FASTA格式

    Cutoff

    打分分组的截断值,逗号分割,例如:2,5,10表示将氨基酸突变评分大于10的为一组,5~10的氨基酸为一组,小于2的氨基酸分为一组。

    Output Policy

    指定输出的回复突变的文件

    结果说明

    根据不同截断值得到突变分组结果文件mutate_policy.json。

    Back Mutation Grouping v2.3

    Introduction

    Back Mutation Grouping is a grouping module in the humanization design process of antibodies, which groups back mutations based on the mutation score table output by the Mutation Score module and returns the mutated sequences.

    Parameters

    Grafted Chain

    Sequence file of the antibody CDR region after grafting, in FASTA format, generated by the Grafting module.

    Raw Chain

    Sequence file of the antibody, in FASTA format.

    Mutation Score

    Humanization mutation score file, in CSV format, generated by the Mutation Score module.

    Output File

    Specify the name of the output mutation sequence file, in FASTA format.

    Cutoff

    Cutoff values for scoring grouping, separated by commas. For example, 2,5,10 indicates grouping amino acid mutations with scores greater than 10 in one group, amino acids with scores between 5 and 10 in another group, and amino acids with scores less than 2 in a separate group.

    Output Policy

    Specify the file for the output of back mutations.

    Results

    The mutation grouping results file, mutate_policy.json, is generated based on different cutoff values.

  • Name: Antibody Numbering v2
    Description: 抗体编号模块,用于注释抗体可变区(Fv)或恒定区(包括 Fc), 支持几乎所有主流的抗体编号规则,如可变区广泛使用的Kabat、Chothia 和 IMGT,以及恒定区主要使用的EU规则。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq-> Number。 A module for antibody numbering for variable regions and constant regions. Mainstream numbering schemes are supported, e.g., Kabat, Chothia, and IMGT are widely used for Fv, and EU is the most used scheme for the constant region. It is recommended to use in the WeSeq: WeSeq-> Number.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-09-23 16:45:09
    Reference:

    Antibody Numbering v2

    简介

    Antibody Numbering v2是抗体编号模块,用于注释抗体可变区(Fv)或恒定区(包括 Fc), 支持几乎所有主流的抗体编号规则,如可变区广泛使用的Kabat、Chothia 和 IMGT,以及恒定区主要使用的EU规则。

    参数说明

    Variable Region (Fv)模式

    该模式针对抗体的Fv区序列(包括重链 VH 和轻链 VL),通过指定编号规则(如 Kabat、Chothia、或 IMGT)对氨基酸残基进行标准化编号。

    Fasta File

    抗体序列文件,FASTA格式,支持多序列模式。

    Numbering Scheme

    可变区编号规则,支持Kabat、Chothia、IMGT,可多选。

    Constant Region (Fc)模式

    通常用于抗体恒定区的EU、Kabat标准化编号。

    Fasta File

    抗体序列文件,FASTA格式,支持多序列模式。

    Numbering Scheme

    恒定区编号规则:eu,kabat。默认为eu。

    结果说明

    • Variable Region (Fv)模式下的输出结果包括:
    输出文件名称 说明
    output_chothia(imgt\kabat\martin).csv 抗体可变区四种编号规则的csv文件
    output_chothia(imgt\kabat\martin).json 抗体可变区四种编号规则的json文件
    output_nonfv.fasta 当输入文件是完整抗体序列(包含Fv和Fc)时,自动识别出Fv区;并非Fv部分提取出来单独保存为output_nonfv.fasta。如果输入只包含Fv区,则不输出。

    三种不同编号规则的csv文件,包含信息如下:

    字段名称 说明
    molecule 抗体序列名称
    chain_type 抗体链类型:重链(VH)或者轻链(VL)
    is_cdr 判断是否为CDR区
    loc 序列位置
    numbering 序列编号
    insertion 插入序列编号
    region 抗体可变区类型:CDR1、CDR2或者CDR3
    domain 区域
    • Constant Region (Fc)模式下EU编号的输出结果包括:
    输出文件名称 说明
    output_EU.csv 抗体恒定区EU编号规则的csv文件
    output_EU.json 抗体恒定区EU编号规则的json文件
    output_MatchRate.csv 跟不同IgG亚型相似度

    其中output_EU.csv文件,包含信息如下:

    字段名称 说明
    Chain 抗体序列链类型
    Position 序列位置
    Eu numbering 序列EU编号
    Residue 抗体氨基酸缩写
    IgG1 Ref IgG1氨基酸缩号
    Region 抗体恒定类型:CH1、CH2、CH3、Hinge
    Mutation(IgG1) 原序列突变成IgG1的突变信息

    注意:在 output_MatchRate.csv 文件中,如果 MatchRate_Global 数值偏低,说明该序列与标准 Fc 区域的相似性较差,可能并不是典型的 Fc 结构,而是linker 或随机插入的非 Fc 序列。

    • Constant Region (Fc)模式下Kabat编号的输出结果包括:
    输出文件名称 说明
    failed_to_number.fasta 不能进行恒定区编号的fasta文件
    output_fc_kabat.csv 抗体恒定区Kabat编号规则的csv文件
    output_fc_kabat.json 抗体恒定区Kabat编号规则的json文件

    其中output_fc_kabat.csv文件,包含信息如下:

    字段名称 说明
    molecule 抗体序列名称
    Residue 抗体氨基酸缩写
    chain_type 抗体链类型:重链(VH)或者轻链(VL)
    is_cdr 判断是否为CDR区
    loc 序列位置
    numbering 序列编号
    insertion 插入序列编号
    region 抗体可变区类型:CDR1、CDR2或者CDR3
    domain 区域

    Antibody Numbering v2

    Introduction

    Antibody Numbering v2 is the antibody numbering module for the annotations of antibody variable region (Fv) or constant region (including Fc). It supports almost all mainstream antibody numbering rules, such as Kabat, Chothia and IMGT, which are widely used in the variable region, and EU rules, which are mainly used in the constant region.

    Parameters

    Variable Region (Fv) Mode

    This mode is for the Fv region of antibodies (including heavy chain VH and light chain VL). Amino acid residues are standardized according to the specified numbering scheme (e.g., Kabat, Chothia, or IMGT).

    Fasta File

    Antibody sequence file in FASTA format. Multiple sequences are supported.

    Numbering Scheme

    Variable region numbering schemes. Supports Kabat, Chothia, and IMGT. Multiple selections are allowed.

    Constant Region (Fc) Mode

    Typically used for EU or Kabat standardized numbering of antibody constant regions.

    Fasta File

    Antibody sequence file in FASTA format. Multiple sequences are supported.

    Numbering Scheme

    Numbering scheme for constant regions: EU or Kabat. The default is EU.

    Results

    • Under Variable Region (Fv) Mode, the output includes:
    Output File Name Description
    output_chothia(imgt\kabat\martin).csv CSV files for the four numbering schemes of antibody variable regions
    output_chothia(imgt\kabat\martin).json JSON files for the four numbering schemes of antibody variable regions
    output_nonfv.fasta When the input sequence contains a full antibody (Fv + Fc), the Fv region is automatically identified and the non-Fv region is saved to output_nonfv.fasta. If the input contains only the Fv region, this file is not generated.

    The CSV files for the three numbering schemes contain the following fields:

    Field Name Description
    molecule Antibody sequence name
    chain_type Antibody chain type: heavy chain (VH) or light chain (VL)
    is_cdr Indicates whether the position belongs to a CDR
    loc Sequence position
    numbering Numbering index
    insertion Insertion code
    region Antibody variable region type: CDR1, CDR2, or CDR3
    domain Region/domain
    • Under Constant Region (Fc) Mode with EU numbering, the output includes:
    Output File Name Description
    output_EU.csv CSV file following EU numbering rules for antibody constant regions
    output_EU.json JSON file following EU numbering rules for antibody constant regions
    Match Rate Similarity to different IgG subtypes

    The output_EU.csv file contains the following fields:

    Field Name Description
    Chain Antibody chain type
    Position Sequence position
    Eu numbering EU numbering index
    Residue Amino acid residue
    IgG1 Ref IgG1 reference residue
    Region Antibody constant region type: CH1, CH2, CH3, or Hinge
    Mutation(IgG1) Mutation information compared to IgG1

    Note:In the output_MatchRate.csv file, a low MatchRate_Global indicates that the sequence has poor similarity to canonical Fc regions. Such residues are likely not true Fc sequences, but instead may be linkers or randomly inserted non-Fc segments.

    • Under Constant Region (Fc) Mode with Kabat numbering, the output includes:
    Output File Name Description
    failed_to_number.fasta FASTA sequences that could not be numbered
    output_fc_kabat.csv CSV file following Kabat numbering rules for antibody constant regions
    output_fc_kabat.json JSON file following Kabat numbering rules for antibody constant regions

    The output_fc_kabat.csv file contains the following fields:

    Field Name Description
    molecule Antibody sequence name
    Residue Amino acid residue
    chain_type Antibody chain type: heavy chain (VH) or light chain (VL)
    is_cdr Indicates whether the position belongs to a CDR
    loc Sequence position
    numbering Numbering index
    insertion Insertion code
    region Antibody variable region type: CDR1, CDR2, or CDR3
    domain Region/domain
  • Name: Immunogenicity Prediction (WeADApt v4.1)
    Description: 唯信开发的基于多模融合深度学习的端到端免疫原性预测系统WeADApt(原名:AlphaMHC)v4.1。注:该版本不是最新版本,不是默认推荐的。 The new generation of the deep learning immunogenicity prediction system, WeADApt (formerly known as AlphaMHC) v4.1. This is not the latest version, and is generally not recommended by default.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-10-18 10:50:56
    Reference:

    Immunogenicity Prediction (WeADApt v4.1)

    简介

    WeADApt (Wecomput ADA prediction) 是一种基于多模融合架构的免疫原性预测系统。该方法有机地将多个与免疫原性相关的模型融合,构成一个高效的免疫反应模拟系统,可准确地模拟蛋白、抗体、多肽、疫苗等生物药的免疫原性,并能鉴别潜在的免疫原性的T细胞表位(引起临床人体免疫应答的肽段)。
    注:该模块非最新版本,通常推荐使用更新版本。

    性能测试

    使用100多个临床及上市抗体的ADA数据的测试结果显示,预测的打分(MolScore)与ADA发生率的相关性达到R=0.68(下图)。

    image.png
    在同样的42个分子的数据集上,WeADApt预测的相关性超过了知名的商业软件EpiMatrix(R2=0.49 vs R2=0.42)。

    image.png

    打分

    0.2分适合作为单抗的高/低风险的阈值(>20% ADA定义为高风险)。

    关于双抗/多特异性分子

    这类分子仅需输入不重复的链即可
    在唯信收集的双抗ADA数据集的测试表现如下图所示。以0.6的分数作为分界线,可以较好的区分高、低风险的双抗分子。双抗
    注意,由于存在较多的B细胞清除双抗,其MOA会对ADA产生有较大的影响。

    image.png

    用法

    推荐从WeSeq中运行该功能,可以进行更多可视化交互

    image.png

    查看结果

    image.png
    Score为预测的免疫原性风险评分(范围0-1),Risk为风险评级

    image.png

    image.png

    注意对照结构,排除不可及(包埋的)表位(下图)
    image.png

    去免疫原性

    最简单的方式是进行人源片段的替换,可以直接在WeSeq中进行(下图)。
    image.png

    也可以通过频率分析功能引入人源突变。
    突变完之后再对突变体预测一下免疫原性是否降低。

    注意:从weseq中计算v4免疫原性的结果可以自动保存并且随时再打开的
    企业微信截图_17350890464449.png
    企业微信截图_1735089029621.png
    企业微信截图_17350890159377.png

    Immunogenicity Prediction (WeADApt v4.1)

    Introduction

    WeADApt (Wecomput ADA prediction) is an immunogenicity prediction system based on a multi-modal fusion architecture. This method organically integrates multiple models related to immunogenicity to form an efficient immune response simulation system. It can accurately simulate the immunogenicity of biologics such as proteins, antibodies, peptides, and vaccines, and identify potential immunogenic T-cell epitopes (peptide segments that elicit clinical human immune responses). PS: This module is not the latest version.

    Performance Testing

    Testing results using ADA data from over 100 clinical and marketed antibodies show that the predicted scores (MolScore) correlate with ADA incidence at R=0.68 (see the figure below).

    image.png

    On the same dataset of 42 molecules, the correlation predicted by WeADApt exceeds that of the well-known commercial software EpiMatrix (R²=0.49 vs R²=0.42).

    image.png

    Scoring

    A score of 0.2 is suitable as a threshold for high/low risk in monoclonal antibodies (>20% ADA defined as high risk).

    About Bispecific/Multispecific Molecules

    For these types of molecules, only non-redundant chains need to be input. The test performance on the bispecific ADA dataset collected by Weixin is shown in the figure below. With a score of 0.6 as the dividing line, high-risk and low-risk bispecific molecules can be better distinguished. Note that due to the presence of many B-cell depleting bispecifics, their MOA can significantly affect ADA.

    图片.png

    Usage

    It is recommended to run this function from WeSeq for more visual interactions.

    图片.png

    Viewing Results

    图片.png

    Score is the predicted immunogenicity risk score (range 0-1), and Risk is the risk rating.

    图片.png

    图片.png

    Note the reference structure and exclude inaccessible (embedded) epitopes (see the figure below).

    图片.png

    De-immunization

    The simplest way is to perform human fragment replacement, which can be done directly in WeSeq (see the figure below).

    图片.png

    Human mutations can also be introduced through the frequency analysis feature. After mutation, predict the immunogenicity of the mutants to see if it has decreased.

    Note: The results of calculating v4 immunogenicity in WeSeq can be automatically saved and reopened at any time.
    企业微信截图_17350890464449.png
    企业微信截图_1735089029621.png
    企业微信截图_17350890159377.png

  • Name: Disulfide Bond Search
    Description: 计算蛋白质中潜在的二硫键位置 Calculates potential disulfide bond locations in proteins
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-09-07 10:46:01
    Reference:

    Disulfide Bond Search

    简介

    Disulfide Bond Search模块计算蛋白质中潜在的二硫键位置,这对优化蛋白质的稳定性有所作用。二硫键作为对蛋白质的稳定性有极大的作用,但是加入不合理的二硫键也会容易引起聚集,表达量降低甚至错误折叠等不利影响。

    参数说明

    Structure PDB File

    在使用 PDB 格式的蛋白质结构文件时,如果其中存在缺失残基,请务必先通过Structure Preparation模块进行补全。若缺失未补全,直接输入可能导致报错。

    Chain

    指定需要设计的链,多条链用逗号分割,例如:A,B。

    Position

    设置氨基酸序号,当参数Chain设置为A,C时,此参数如果设置为1 2 3 4 5 6 7 8 23 25, 10 11 12 13 14 15 16 17 18 19 20 40意味着对A中的残基1 2 3…25和链C中的残基10 11 12…40进行设计。如果不填,则该链的所有残基都参与设计。
    注意:这里的氨基酸序号是从1开始,而不是PDB文件中带有的氨基酸序号。同一条链的氨基酸序号用空格分隔,不同链的氨基酸用逗号分隔。

    Interchain

    是否只选择链间的二硫键。

    Distance

    可设置Cβ之间的距离,默认5.0Å。

    结果说明

    输出结果包括:

    输出文件名称 说明
    ss_bond.csv 输出自然顺序编号、PDB文件中的残基编号以及Cβ之间的距离信息的CSV文件。
    ss_index.fasta 序列名编号为自然顺序编号并将预测位点突变为CYS的FASTA文件。
    ss_uid.fasta 序列名编号为PDB文件中的残基编号并将预测位点突变为CYS的的FASTA文件。

    Disulfide Bond Search

    Introduction

    The Disulfide Bond Search module calculates potential disulfide bond positions in proteins, which can be useful for optimizing protein stability. Disulfide bonds play a significant role in stabilizing proteins, but improper addition of disulfide bonds can lead to aggregation, reduced expression levels, or even misfolding.

    Parameter

    Structure PDB File

    When using a protein structure file in PDB format, any missing residues must be completed in the Structure Preparation module before input. Failure to do so may result in errors.

    Chain

    Specify the chains to be designed. Multiple chains are separated by commas, e.g. A,B.

    Position

    Set the amino acid sequence numbers. When the Chain parameter is set to A,C, setting this parameter to 1 2 3 4 5 6 7 8 23 25, 10 11 12 13 14 15 16 17 18 19 20 40 means designing residues 1 2 3...25 in chain A and residues 10 11 12...40 in chain C. If not specified, all residues in the chain will be included in the design.
    Note: The amino acid sequence numbers here start from 1, not the residue numbers in the PDB file. Amino acid sequence numbers within the same chain are separated by spaces, and different chains are separated by commas.

    Interchain

    Whether to select only interchain disulfide bonds.

    Distance

    The distance between Cβ atoms can be set, with a default of 5.0 Å.

    Result

    The output includes:

    Output File Name Description
    ss_bond.csv A CSV file containing information on the natural sequence number, residue number in the PDB file, and the distance between Cβ atoms.
    ss_index.fasta A FASTA file with sequence names numbered by natural sequence number, and predicted sites mutated to CYS.
    ss_uid.fasta A FASTA file with sequence names numbered by residue number in the PDB file, and predicted sites mutated to CYS.
  • Name: Pocket Finder
    Description: 基于几何特性和物理化学特性识别蛋白口袋。 Identify protein pockets based on geometric and physicochemical properties.
    Tags: undefined
    Author: Vincent Le Guilloux; Peter Schmidtke
    Release: 2024-09-06 15:58:52
    Reference: Vincent Le Guilloux, Peter Schmidtke and Pierre Tuffery, Fpocket: An open source platform for ligand pocket detection, BMC Bioinformatics 2009, 10:168 Peter Schmidtke and Xavier Barril, Understanding and predicting druggability. A high-throughput method for detection of drug binding sites, J Med Chem 2010, 53(15):5858-67

    Pocket Finder

    简介

    Pocket Finder模块基于几何特性和物理化学特性来识别这些口袋,其主要功能是快速、准确地识别蛋白质表面的潜在口袋。蛋白质口袋(或活性位点)是蛋白质表面的小区域,通常是药物分子或其他小分子结合的地方。识别这些口袋对于药物设计和蛋白质功能研究至关重要。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。

    Minimum Radius

    最小alpha球的半径。

    Maximum Radius

    最大alpha球的半径。

    Distance Threshold

    距离阈值聚类算法

    Clustering Method

    用于将Voronoi顶点分组的聚类方法:

    • s是单链接聚类(single linkage clustering)。
    • m是完全链接聚类(complete linkage clustering)。
    • a是平均链接聚类(average linkage clustering)。
    • c是质心链接聚类(centroid linkage clustering)。

    Clustering Measure

    聚类的距离度量方法:

    • e是欧几里得距离(euclidean distance)。
    • b是曼哈顿距离(Manhattan distance)。

    Minimum Number

    每个口袋的最小alpha球数量。

    结果说明

    输出结果包括:

    输出文件名称 说明
    pocket_properties.csv 口袋信息CSV文件
    pockets.tar.gz 蛋白分析后得到的PDB文件压缩包
    pocket*_atm.pdb 分别输出所有口袋的PDB(原子)文件格式

    其中pocket_properties.csv包含如下信息:

    字段名称 说明
    Pocket 口袋顺序
    Score 口袋综合得分,考虑了口袋的大小、形状和疏水性等因素。打分越高说明口袋更好,更有可能在生物学上具有相关性或适合药物结合。
    Druggability Score 评估口袋结合药物分子的潜力,打分越高说明口袋药物可及性越高。
    Total SASA 口袋可被溶剂分子接触的总表面积,单位为平方埃Ų;SASA较大,可容纳配体结构越大。
    Polar SASA 总SASA中的极性部分,表示可被水分子接触的表面积。反映了口袋的亲水性。
    Apolar SASA 总SASA中的非极性部分,表示不可被水分子接触的表面积。反映了口袋的疏水性。
    Volume 口袋的体积,单位为ų。较大的体积表示口袋较大,能够容纳更大的配体或多个结合位点。

    Pocket Finder与分子对接(WeView-Dock)联用教程

    1. 在Pocket Finder中上传蛋白结构,预测对接口袋。
      图片.png
    2. 任务完成,打开pocket01_atm.pdb文件,跳转至WeView。
      图片.png
    3. 同时在WeView中上传蛋白的pdb文件。
      图片.png
    4. 进入对接程序,设置配受体文件,在Define Site时点击Selected。
      图片.png
    5. 选择整个Pocket作为对接口袋,获得对接中心坐标,口袋大小可按需调整
      图片.png
      图片.png
    6. 提交对接任务。

    参考文献

    • Vincent Le Guilloux, Peter Schmidtke and Pierre Tuffery, Fpocket: An open source platform for ligand pocket detection, BMC Bioinformatics 2009, 10:168
    • Peter Schmidtke and Xavier Barril, Understanding and predicting druggability. A high-throughput method for detection of drug binding sites, J Med Chem 2010, 53(15):5858-67

    Pocket Finder

    Introduction

    The Pocket Finder module identifies pockets based on geometric and physicochemical properties. Its main function is to quickly and accurately identify potential pockets on the protein surface. Protein pockets (or active sites) are small regions on the protein surface where drug molecules or other small molecules typically bind. Identifying these pockets is crucial for drug design and protein function studies.

    Parameter

    Structure PDB File

    The structure file of the protein in PDB format.

    Minimum Radius

    The minimum radius of the alpha sphere.

    Maximum Radius

    The maximum radius of the alpha sphere.

    Distance Threshold

    The distance threshold for the clustering algorithm.

    Clustering Method

    The clustering method used to group Voronoi vertices:

    • s for single linkage clustering.
    • m for complete linkage clustering.
    • a for average linkage clustering.
    • c for centroid linkage clustering.

    Clustering Measure

    The distance metric for clustering:

    • e for Euclidean distance.
    • b for Manhattan distance.

    Minimum Number

    The minimum number of alpha spheres per pocket.

    Result

    The output results include:

    Output File Name Description
    pocket_properties.csv CSV file with pocket information
    pockets.tar.gz Compressed archive of PDB files obtained from the protein analysis
    pocket*_atm.pdb PDB (atom) file format for each pocket

    The pocket_properties.csv file contains the following information:

    Field Name Description
    Pocket Pocket order
    Score Comprehensive score of the pocket, considering factors such as size, shape, and hydrophobicity. A higher score indicates a better pocket, more likely to be biologically relevant or suitable for drug binding.
    Druggability Score Assesses the potential of the pocket to bind drug molecules. A higher score indicates higher druggability.
    Total SASA Total solvent-accessible surface area of the pocket, in square angstroms (Ų); larger SASA indicates the ability to accommodate larger ligand structures.
    Polar SASA The polar portion of the total SASA, indicating the surface area accessible to water molecules. Reflects the hydrophilicity of the pocket.
    Apolar SASA The apolar portion of the total SASA, indicating the surface area not accessible to water molecules. Reflects the hydrophobicity of the pocket.
    Volume The volume of the pocket, in cubic angstroms (ų). A larger volume indicates a larger pocket, capable of accommodating larger ligands or multiple binding sites.

    Pocket Finder and Molecular Docking (WeView-Dock) Combined Tutorial

    1. Upload protein structure in Pocket Finder to predict docking pockets.
      图片.png
    2. After task completion, open the pocket01_atm.pdb file and jump to WeView.
      图片.png
    3. Simultaneously upload the protein pdb file in WeView.
      图片.png
    4. Enter the docking program, set the receptor file, and click Selected when defining the site.
      图片.png
    5. Select the entire Pocket as the docking pocket, obtain the docking center coordinates, and adjust the pocket size as needed.
      图片.png
      图片.png
    6. Submit the docking task.

    References

    • Vincent Le Guilloux, Peter Schmidtke and Pierre Tuffery, Fpocket: An open source platform for ligand pocket detection, BMC Bioinformatics 2009, 10:168
    • Peter Schmidtke and Xavier Barril, Understanding and predicting druggability. A high-throughput method for detection of drug binding sites, J Med Chem 2010, 53(15):5858-67
  • Name: Restrained Complex Structure Prediction
    Description: 基于ColabDock框架实现,ColabDock框架通过整合多种实验限制条件,显著提升了蛋白-蛋白对接预测的准确性。 Implemented based on the ColabDock framework, which significantly improves the accuracy of protein-protein docking prediction by integrating multiple experimental constraints.
    Tags: undefined
    Author: Shihao Feng
    Release: 2024-08-22 11:55:25
    Reference: Feng, S., Chen, Z., Zhang, C. et al. Integrated structure prediction of protein–protein docking with experimental restraints using ColabDock. Nat Mach Intell, 2024.

    Restrained Complex Structure Prediction

    简介

    Restrained Complex Structure Prediction模块基于ColabDock框架实现,ColabDock框架通过整合多种实验限制条件,显著提升了蛋白-蛋白对接预测的准确性。其创新点包括:

    • 无需大规模重新训练或微调:ColabDock框架通过梯度反向传播直接整合实验限制,避免了对深度学习模型进行大规模的重新训练或微调,提高了计算效率。
    • 多源实验数据的整合能力:ColabDock能够处理不同形式和来源的实验数据,包括但不限于化学交联质谱(XL-MS)、核磁共振化学位移扰动(CSP)、共价标记(CL)和模拟的深度突变扫描(DMS)等,增强了模型的适用性和灵活性。
    • 提升预测精度:通过在多个数据集上的评估,ColabDock展现出了超越现有方法的预测精度,尤其是在考虑实验限制条件时。不仅在具有模拟残基和表面约束的复杂结构预测中优于HADDOCK和ClusPro,而且在结合核磁共振化学位移扰动和共价标记辅助的情况下也表现出色。

    ColabDock框架的工作流程分为两个主要阶段:

    1. 生成阶段
    • ColabDock生成阶段的目标是生成与提供的实验限制和模板相一致的蛋白质复合物结构。
    • 该阶段使用梯度反向传播(Backprop)来优化输入序列配置文件的对数空间,从而引导结构预测模型(AF2)产生符合实验限制的复杂结构。
    • 输入包括:蛋白质序列配置文件、每条蛋白链的模板,以及实验限制条件。
    • 优化过程:模型通过调整序列配置文件来改变对接结构,同时保持预测的蛋白质序列与输入序列一致。
      image.png
    1. 预测阶段
    • 预测阶段使用生成的结构和每个链的模板进行最终的复杂结构预测。
    • 这个阶段利用AlphaFold2(AF2)或其他深度学习模型来评估和细化复合物结构,提高预测的精确度。
    • 预测阶段的输出是最终的蛋白质复合物结构预测,它考虑了实验限制并结合了深度学习模型的预测能力。
      image.png

    ColabDock主要关注两种类型的约束。第一种约束限制了残基对之间的距离低于某一阈值,属于残基-残基层面的约束(称为1v1约束)。这类约束包括源自交联质谱(XL-MS)的约束。第二种约束定义了在蛋白质表面上可能接触的两组残基之间的约束,但具体的接触信息未知。此类约束属于界面层面的约束(称为MvN约束),典型示例包括多种NMR实验和共价标记(CL)。
    ColabDock在模拟约束条件下的性能验证情况如下图所示:
    image.png
    如图a所示,在仅提供两个1v1约束的情况下,81.08%的蛋白质复合物的最大DockQ值超过了0.23,尤其考虑到从这些约束中获取的结构信息相对有限。当提供三到五个约束时,成功率接近100%。如图b所示,对于含有两、三和五对约束的蛋白质复合物,其约束满足率分别为0.55、0.77和0.80。这些结果表明,ColabDock能够高效利用提供的约束来获得高质量的复合物结构。

    评估ColabDock在MvN约束下的性能时,先基于上述1v1样本生成了MvN样本。这些样本的挑战性更大,因为MvN约束的模糊性使得多个1v1约束组合可能满足同一组MvN约束。如图c所示,111个样本中有100个预测结构的最大DockQ值超过了0.23。其中,75个样本的top1结构的DockQ值超过0.23。随着约束数量的增加,ColabDock的准确性也相应提高,top1结构的成功率从两个约束时的62.16%上升到三个和五个约束时的70.27%。在预测结构中,约束满足率与实验结构中的比例相似(图d)。这些结果表明,ColabDock同样能够高效利用模糊的约束条件来改善结构预测。

    为了评估ColabDock中预测阶段的必要性,在上述1v1和MvN约束实验中,收集了最后十个优化步骤中的结构,大多数优化过程已经收敛。在生成阶段和预测阶段的DockQ值差异较大的情况下(这里定义为大于0.1),预测阶段在69.9%的1v1约束复合物中表现更好(图e),在MvN约束复合物中这一比例为68.8%(图f)。这些结果表明,AF2的能量景观可以帮助优化生成阶段的构象并提高预测的准确性。

    ColabDock与传统限制性对接方法比较如下图所示:
    image.png
    基于37个蛋白质复合物的独立基准集。与HADDOCK和ClusPro进行了比较。对于基准集中的每个复合物,采样两、三和五个1v1约束来指导对接,最终生成了111个样本。ColabDock在大多数样本中优于HADDOCK和ClusPro(图a)。ColabDock的平均DockQ值为0.477,而HADDOCK和ClusPro的DockQ值分别为0.287和0.191。无论1v1约束的数量多少,ColabDock在三种方法中均表现最佳(图b)。这些结果表明,ColabDock在稀疏约束条件下有生成可靠结构的潜力,这与验证集的观察结果一致。

    为了进一步评估ColabDock在界面级别约束下的表现,作为验证数据集,将上述描述的1v1约束转换为MvN约束。由于ClusPro在111个样本中有7个无法给出预测,将其排除,并对剩余的104个样本进行比较。与1v1约束下的表现相比,由于MvN约束的模糊性,ColabDock、HADDOCK和ClusPro在MvN约束下的表现有所下降,但ColabDock仍然优于其他两种方法(图c)。实验再次表明,无论MvN约束的数量多少,ColabDock在DockQ上均表现最佳(图d)。

    实验衍生的约束中常常包含相距较远的残基,作者将其称为“松散约束”。为了测试模型在相关任务中的表现,有意在距离范围为8Å到20Å之间加入了松散约束。对于基准集中的每个复合物,松散约束的数量从1到5不等,而总约束数量固定为5个,共生成了185个样本。排除了9个ClusPro无法处理的样本,并对剩余的176个样本进行了三种方法的比较。结果显示,ColabDock表现最佳,平均DockQ值为0.344,平均α碳原子r.m.s.d.(Cα-r.m.s.d.)为6.55Å(图e)。这些结果表明,ColabDock对约束的质量依赖较低。当与高质量约束结合时,ColabDock能够预测出比其他两种方法更为精确的结构。

    抗原抗体复合物预测
    抗体-抗原复合物建模一直是一个长期存在的挑战,因为互补决定区(CDRs)的灵活性和缺乏共同进化信号。深度突变扫描(DMS)是一种常用技术,用于确定可能参与抗体-抗原结合的残基。基于一个包含45个复合物的抗体-抗原基准集,通过采样界面上的残基来模拟DMS衍生的约束。预测效果及与传统方法的比较情况如下图所示:
    image.png

    图a所示,ColabDock优于HADDOCK和ClusPro,其平均DockQ值为0.223,平均r.m.s.d.为9.57Å。对于DockQ值大于0.49的样本数量,ColabDock也超过了HADDOCK和ClusPro(图b)。

    以1AHW为例:1AHW是一个人类组织因子-抗体(5G9)复合物,参与了血液凝固蛋白酶级联过程。如图c所示,随机从抗体中采样了五个界面残基(轻链的His91和Gly92,重链的Asp31、Tyr32和Asn100),以及从抗原中采样了七个界面残基(Lys165、Thr167、Val192、Thr197、Val198、Asn199和Asp204)。这些在抗体中采样的残基主要分布在L1 CDR、H1 CDR和H3 CDR区域。图d展示了AF-Multimer的预测结构以及三种对接方法的结构。如图e所示,ColabDock捕捉到了大多数界面上的天然接触,其DockQ值为0.770,r.m.s.d.为1.17Å,而其他方法的预测结构与天然构象有较大差异。这一案例研究表明,ColabDock在构象探索和构象排序方面都优于其他两种方法。

    参数说明

    Complex Structure

    初始蛋白复合物结构文件,PDB格式
    注:该结构由多条链组成,链与链之间的相对位置可任意放置,无要求。由于显存大小限制,当前最大支持的最终复合物尺寸大小不超过800个残基。

    Chains

    复合物中提取多条链,用于组成最终的复合物结构,链名之间用逗号分隔,如:A,H,L

    Fix Chains

    提取的多条链中指定相对位置固定的每对链,支持定义多对,链名之间用逗号分隔,每行一对,示例如下:

    H,L
    A,H
    

    表示链H与L之间的相对位置固定,链A与H之间的相对位置固定。

    Threthold

    实验限制的距离阈值,表示设置限制的残基间的距离需小于该阈值。默认为8.0 Å,值范围为2.0 Å - 22.0 Å,建议采用默认值。

    1v1 Restrains

    单个残基之间的限制条件,限制单个残基之间的距离在上述定义的阈值参数内,残基之间用逗号(,)分隔,支持定义多个条件(每行定义一个),示例如下:

    A20,H50
    A78,L98
    

    该参数表示设置的限制条件有2个:

    • A链的第20位残基和H链的第50位残基之间的距离要小于阈值;
    • A链的第78位残基和L链的第98位残基之间的距离要小于阈值。

    注意:残基编号为位置编号,即每条链按顺序从1开始进行编号,以下编号规则一致。

    MvN Restrains

    单个残基与残基组合之间的限制条件,限制单个残基与多个残基集合中至少一个残基之间的距离在上述定义的阈值参数内,单个残基与残基组合之间用逗号(,)分隔,残基组合内部用分号(;)分隔,可支持定义多个条件(每行定义一个),示例如下:

    A10,H60-70;H78;L90
    A78,H60-70;L56;L69
    A120,L30-L36;H68;H72
    2
    

    该参数表示设置的限制条件有3个,分别是:

    • A链第10位残基与残基组合(H链第60至70位、H链第78位及L链第90位残基)中的至少一个残基之间的距离小于阈值;
    • A链第78位残基与残基组合(H链第60至70位、L链第56位及L链第69位残基)中的至少一个残基之间的距离小于阈值;
    • A链第120位残基与残基组合(L链第30至36位、H链第68位及H链第72位残基)中的至少一个残基之间的距离小于阈值;
    • 最后一行的数值2,表示上述3个条件中,满足任意2个条件即可,如限制条件只有1个时,该数值可以省略。

    Rep Threthold

    限制残基间排斥的距离阈值,表示设定的排斥残基间的距离需大于该阈值。默认为8.0 Å,值范围为2.0 Å - 22.0 Å,建议采用默认值。

    Rep 1v1 Restrains

    单个残基间的排斥限制条件,限制单个残基之间的距离需大于上述定义的排斥阈值,残基之间用逗号(,)分隔,可支持定义多个条件(每行定义一个),示例如下:

    15,98
    60,205
    

    该参数表示设置的排斥限制条件有2个:

    • 编号顺序为第20和第50的残基之间的距离要大于排斥阈值;
    • 编号第78和第198的残基之间的距离要大于排斥阈值。

    结果说明

    输出1st_best.pdb结果文件,为预测得到的最优复合物结构文件。
    输出pdbs.tar.gz文件,为预测得到的前5个最优复合物结构文件压缩包。
    输出summary.txt文件,包含以下信息:

    列名 说明
    pdb 复合物结构文件名
    iptm 复合物结构的质量好坏评价指标,0-1之间,越接近1表示预测结构的质量越好
    # of satisfied restraints 限制条件的数量,以及预测的复合物结构能满足的条件数量,如:2/2表示有2个限制条件,预测得到的复合物结构都能满足;1/2表示有2个限制条件,但复合物结构只满足了其中1个

    备注:
    可能存在以下个别情况,属正常现象

    1. 1st_best.pdb的iptm打分并不是5个结构里最优的;
    2. 结构中有个别残基间的肽键发生断裂;
      有待结构预测模型的进一步优化。

    参考文献

    • Feng, S., Chen, Z., Zhang, C. et al. Integrated structure prediction of protein–protein docking with experimental restraints using ColabDock. Nat Mach Intell, 2024.
    • Nat. Mach. Intell. | 突破对接瓶颈:ColabDock革新蛋白质-蛋白质结构预测

    Restrained Complex Structure Prediction

    Introduction

    The module is implemented based on the ColabDock framework, which significantly improves the accuracy of protein-protein docking predictions by integrating a variety of experimental constraints. Its innovations include:

    • No need for large-scale retraining or fine-tuning: The ColabDock framework directly integrates experimental constraints through gradient backpropagation, avoiding large-scale retraining or fine-tuning of deep learning models and improving computational efficiency.
    • Integration ability of multi-source experimental data: ColabDock is able to handle experimental data in different forms and sources, including but not limited to chemical cross-linking mass spectrometry (XL-MS), NMR chemical shift perturbation (CSP), covalent labeling (CL) and Simulated deep mutation scanning (DMS), etc., enhance the applicability and flexibility of the model.
    • Improved prediction accuracy: Through evaluation on multiple data sets, ColabDock has demonstrated prediction accuracy that exceeds existing methods, especially when considering experimental constraints. Not only does it outperform HADDOCK and ClusPro in complex structure predictions with simulated residues and surface constraints, but it also performs well when combined with NMR chemical shift perturbation and covalent labeling assistance.

    The workflow of the ColabDock framework is divided into two main stages:

    1. Generation stage
    • The goal of the ColabDock generation stage is to generate a protein complex structure that is consistent with the provided experimental constraints and template.
    • This stage uses gradient backpropagation (Backprop) to optimize the logarithmic space of the input sequence profile, thereby guiding the structure prediction model (AF2) to produce a complex structure that meets the experimental constraints.
    • The input includes: protein sequence profile, template for each protein chain, and experimental constraints.
    • Optimization process: The model changes the docked structure by adjusting the sequence profile while keeping the predicted protein sequence consistent with the input sequence.
      image.png
    1. Prediction stage
    • The prediction stage uses the generated structures and templates for each chain to make final complex structure predictions.
    • This stage uses AlphaFold2 (AF2) or other deep learning models to evaluate and refine the complex structure and improve the accuracy of the predictions.
    • The output of the prediction stage is the final protein complex structure prediction, which takes into account experimental constraints and combines the predictive power of deep learning models.
      image.png

    ColabDock focuses on two types of constraints. The first type of constraints restricts the distance between residue pairs to be below a certain threshold and are residue-residue level constraints (called 1v1 constraints). This type of constraints includes constraints derived from cross-linking mass spectrometry (XL-MS). The second type of constraints defines constraints between two groups of residues that may contact on the protein surface, but the specific contact information is unknown. This type of constraints belongs to the interface level constraints (called MvN constraints), and typical examples include various NMR experiments and covalent labeling (CL).

    The performance verification of ColabDock under simulation constraints is shown in the following figure:
    image.png
    As shown in Figure a, with only two 1v1 constraints provided, 81.08% of the protein complexes had a maximum DockQ value of more than 0.23, especially considering the relatively limited structural information obtained from these constraints. When three to five constraints were provided, the success rate was close to 100%. As shown in Figure b, for protein complexes containing two, three, and five pairs of constraints, the constraint satisfaction rates were 0.55, 0.77, and 0.80, respectively. These results show that ColabDock can efficiently use the provided constraints to obtain high-quality complex structures.

    When evaluating the performance of ColabDock under MvN constraints, MvN samples were first generated based on the above 1v1 samples. These samples are more challenging because the ambiguity of MvN constraints makes it possible for multiple 1v1 constraint combinations to satisfy the same set of MvN constraints. As shown in Figure c, 100 of the 111 samples have a maximum DockQ value of more than 0.23 for the predicted structures. Among them, 75 samples have a DockQ value of more than 0.23 for the top1 structure. As the number of constraints increases, the accuracy of ColabDock also increases accordingly, with the success rate of the top1 structure increasing from 62.16% with two constraints to 70.27% with three and five constraints. In the predicted structures, the constraint satisfaction rate is similar to that in the experimental structures (Figure d). These results show that ColabDock can also effectively use fuzzy constraints to improve structure prediction.

    To evaluate the necessity of the prediction stage in ColabDock, structures from the last ten optimization steps were collected in the above 1v1 and MvN constrained experiments, and most of the optimization processes have converged. In cases where the difference in DockQ values ​​between the generation stage and the prediction stage is large (here defined as greater than 0.1), the prediction stage performs better in 69.9% of the 1v1 constrained complexes (Figure e) and in 68.8% of the MvN constrained complexes (Figure f). These results suggest that the energy landscape of AF2 can help optimize conformations in the generation stage and improve the accuracy of predictions.

    The comparison between ColabDock and traditional restrictive docking methods is shown in the figure below:
    image.png
    Based on an independent benchmark set of 37 protein complexes. Comparisons were made with HADDOCK and ClusPro. For each complex in the benchmark set, two, three, and five 1v1 constraints were sampled to guide docking, and 111 samples were finally generated. ColabDock outperformed HADDOCK and ClusPro in most samples (Figure a). The average DockQ value of ColabDock was 0.477, while the DockQ values ​​of HADDOCK and ClusPro were 0.287 and 0.191, respectively. Regardless of the number of 1v1 constraints, ColabDock performed best among the three methods (Figure b). These results show that ColabDock has the potential to generate reliable structures under sparse constraints, which is consistent with the observations of the validation set.

    To further evaluate the performance of ColabDock under interface-level constraints, the 1v1 constraints described above were converted to MvN constraints as a validation dataset. Since ClusPro could not give predictions for 7 out of 111 samples, it was excluded and the remaining 104 samples were compared. Compared with the performance under 1v1 constraints, the performance of ColabDock, HADDOCK, and ClusPro under MvN constraints declined due to the ambiguity of MvN constraints, but ColabDock still outperformed the other two methods (Figure c). The experiment again shows that ColabDock performs best on DockQ regardless of the number of MvN constraints (Figure d).

    Experimentally derived constraints often contain residues that are far apart, which the authors call “loose constraints.” In order to test the performance of the model in related tasks, loose constraints were intentionally added with distances ranging from 8Å to 20Å. For each complex in the benchmark set, the number of loose constraints ranged from 1 to 5, while the total number of constraints was fixed at 5, generating a total of 185 samples. Nine samples that ClusPro could not handle were excluded, and the three methods were compared on the remaining 176 samples. The results showed that ColabDock performed best, with an average DockQ value of 0.344 and an average α-carbon atom r.m.s.d. (Cα-r.m.s.d.) of 6.55Å (Figure e). These results indicate that ColabDock has a low dependence on the quality of constraints. When combined with high-quality constraints, ColabDock is able to predict more accurate structures than the other two methods.

    Antigen-antibody complex prediction
    Modeling antibody-antigen complexes has been a long-standing challenge due to the flexibility of complementarity determining regions (CDRs) and the lack of co-evolutionary signals. Deep mutational scanning (DMS) is a commonly used technique to identify residues that may be involved in antibody-antigen binding. Based on an antibody-antigen benchmark set of 45 complexes, DMS-derived constraints were simulated by sampling residues on the interface. The prediction results and comparison with traditional methods are shown in the figure below:
    image.png

    As shown in Figure a, ColabDock outperforms HADDOCK and ClusPro, with an average DockQ value of 0.223 and an average r.m.s.d. of 9.57 Å. For the number of samples with a DockQ value greater than 0.49, ColabDock also exceeds HADDOCK and ClusPro (Figure b).

    Take 1AHW as an example: 1AHW is a human tissue factor-antibody (5G9) complex that participates in the blood coagulation protease cascade. As shown in Figure c, five interface residues were randomly sampled from the antibody (His91 and Gly92 of the light chain, Asp31, Tyr32 and Asn100 of the heavy chain), and seven interface residues were sampled from the antigen (Lys165, Thr167, Val192, Thr197, Val198, Asn199 and Asp204). These sampled residues in the antibody are mainly distributed in the L1 CDR, H1 CDR and H3 CDR regions. Figure d shows the predicted structure of AF-Multimer and the structures of the three docking methods. As shown in Figure e, ColabDock captures most of the natural contacts on the interface, with a DockQ value of 0.770 and an r.m.s.d. of 1.17Å, while the predicted structures of other methods are quite different from the natural conformation. This case study demonstrates that ColabDock outperforms the other two methods in both conformational exploration and conformational ranking.

    Parameters

    Complex Structure

    Original protein complex structure file, PDB format
    Note: This structure consists of multiple chains, and the relative positions between chains can be placed arbitrarily. Due to the limitation of GPU memory, the current maximum supported final complex size does not exceed 800 residues.

    Chains

    Multiple chains are extracted from the original complex to form the final complex structure. The chain names are separated by commas, such as: A,H,L

    Fix Chains

    Specify each pair of chains with fixed relative positions among the extracted multiple chains. Multiple pairs can be defined. Chain names are separated by comma, with one pair per line. The example is as follows:

    H,L
    A,H
    

    It means that the relative position between chains H and L is fixed, and the relative position between chains A and H is fixed.

    Threthold

    The distance threshold of the experimental restraint, which means that the distance between the residues to set the restraint must be less than this threshold. The default value is 8.0 Å, and the value range is 2.0 Å - 22.0 Å. The default value is recommended.

    1v1 Restrains

    Restrictions between single residues. Limit the distance between single residues to the threshold parameters defined above. Residues are separated by commas. Multiple conditions can be defined (one per line). The following is an example:

    A20,H50
    A78,L98
    

    This parameter indicates that there are two restrictions set:

    • The distance between the 20th residue of the A chain and the 50th residue of the H chain must be less than the threshold;
    • The distance between the 78th residue of the A chain and the 98th residue of the L chain must be less than the threshold.
      Note:The residue numbers are position numbers, i.e., each chain is numbered sequentially starting from 1, and the following numbering rules are consistent.

    MvN Restrains

    The restriction conditions between a single residue and a residue combination limit the distance between a single residue and at least one residue in a set of multiple residues to be within the threshold parameters defined above. Single residues and residue combinations are separated by commas, and residue combinations are separated by semicolons. Multiple conditions can be defined (one per line). The following is an example:

    A10,H60-70;H78;L90
    A78,H60-70;L56;L69
    A120,L30-L36;H68;H72
    2
    

    This parameter indicates that there are three restrictions set, namely:

    • The distance between the 10th residue of the A chain and at least one residue in the residue combination (residues 60 to 70 of the H chain, 78 of the H chain, and 90 of the L chain) is less than the threshold;
    • The distance between the 78th residue of the A chain and at least one residue in the residue combination (residues 60 to 70 of the H chain, 56 of the L chain, and 69 of the L chain) is less than the threshold;
    • The distance between the 120th residue of the A chain and at least one residue in the residue combination (residues 30 to 36 of the L chain, 68 of the H chain, and 72 of the H chain) is less than the threshold;
    • The value 2 in the last row indicates that any two of the above three conditions can be met. If there is only one restriction, this value can be omitted.

    Rep Threthold

    The distance threshold for limiting the repulsion between residues, indicating that the distance between the set repulsive residues must be greater than this threshold. The default value is 8.0 Å, and the value range is 2.0 Å - 22.0 Å. It is recommended to use the default value.

    Rep 1v1 Restrains

    The exclusion constraint between single residues requires the distance between single residues to be greater than the exclusion threshold defined above. Residues are separated by comma. Multiple conditions can be defined (one per line). The following is an example:

    15,98
    60,205
    

    This parameter indicates that there are two exclusion constraints set:

    • The distance between the 20th and 50th residues must be greater than the exclusion threshold;
    • The distance between the 78th and 198th residues must be greater than the exclusion threshold.

    Results

    ‘1st_best.pdb’ file, which is the predicted optimal complex structure file.
    ‘pdbs.tar.gz’ file, which is the compressed package of the top 5 predicted optimal complex structure files.
    ‘summary.txt’ file, which contains the following information:

    Fields Introduction
    pdb File name of complex structure
    iptm An evaluation index of the quality of the complex structure, between 0 and 1, the closer to 1, the better the quality of the predicted structure
    # of satisfied restraints The total number of constraints and the number of constraints that the predicted complex structure can satisfy. For example, 2/2 means that there are 2 constraints and the predicted complex structure can satisfy them all; 1/2 means that there are 2 constraints, but the complex structure only satisfies one of them.

    Note:
    The following individual cases may exist, which are normal:

    1. The iptm score of 1st_best.pdb is not the best among the 5 structures;
    2. The peptide bonds between individual residues in the structure are broken;
      The structure prediction model needs to be further optimized.

    References

    • Feng, S., Chen, Z., Zhang, C. et al. Integrated structure prediction of protein–protein docking with experimental restraints using ColabDock. Nat Mach Intell, 2024.
    • Nat. Mach. Intell. | 突破对接瓶颈:ColabDock革新蛋白质-蛋白质结构预测
  • Name: Germline Blast
    Description: 基于IgBlastp通过序列比对从IMGT reference sequences数据库中搜索与目标抗体序列最接近的同源模板,输出对应的模板序列以及序列一致性等信息。数据库中默认检索的序列类型为:IMGT V genes(F+ORF+in-frame P)。 IgBlastp based searching for the homologous template closest to the target antibody sequence in the IMGT reference sequences database through sequence alignment and output the corresponding template sequence and sequence consistency. The default sequence types searched in the database are: IMGT V genes (F+ORF+in-frame P).
    Tags: undefined
    Author: Jian Ye; Lefranc
    Release: 2024-08-29 15:34:27
    Reference: Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W34-40. Lefranc, M.-P. and Lefranc, G. The Immunoglobulin FactsBook Academic Press, London, UK (458 pages), (2001)

    Germline Blast

    简介

    Germline Blast模块基于IgBlastp实现,通过氨基酸序列比对从IMGT reference sequences数据库中搜索与目标抗体序列最接近的同源模板,输出对应的模板序列以及序列一致性等信息。数据库中默认检索的序列类型为:IMGT V genes(F+ORF+in-frame P)。

    参数说明

    Antibody Sequence File

    抗体的序列文件,FASTA格式。

    Numbering Scheme

    抗体编号类型:kabat和imgt

    TopHits

    输出同源性最高的N条序列,默认值为10。

    Species

    序列所属物种:Human,Mouse,Rat,Rabbit,Rhesus_Monkey,Alpaca,默认值为Human。

    结果说明

    输出参数 输出文件名称 说明
    Hits Sequence hits.fasta 包含同源性最高的n条序列的序列文件
    Result result.csv 包含找到的Germline序列以及序列的一致性信息
    Alignment Summary align_info_top_germline.csv 包含查询序列与同源性最高的Germline V基因序列的比对信息

    参考文献

    • Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W34-40. DOI:10.1093/nar/gkt382
    • Lefranc, M.-P. and Lefranc, G. The Immunoglobulin FactsBook Academic Press, London, UK (458 pages), (2001)

    Germline Blast

    Introduction

    The Germline Blast module is based on IgBlastp and searches for the most homologous templates to the target antibody sequence from the IMGT reference sequences database through sequence alignment. It outputs the corresponding template sequences and sequence identity information. The default sequence types searched in the database are: IMGT V genes (F+ORF+in-frame P).

    Parameters

    Antibody Sequence File

    The antibody sequence file in FASTA format.

    Numbering Scheme

    The antibody numbering scheme: kabat and imgt.

    TopHits

    The number of top homologous sequences to output, with a default value of 10.

    Species

    The species of the sequence: Human, Mouse, Rat, Rabbit, Rhesus_Monkey, Alpaca,with the default value being Human.

    Results

    Output Parameter Output File Name Description
    Hits Sequence hits.fasta A sequence file containing the top N homologous sequences
    Result result.csv Contains the identified germline sequences and sequence identity information
    Alignment Summary align_info_top_germline.csv Contains alignment information between the query sequence and the top homologous germline V gene sequences

    References

    • Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W34-40. DOI:10.1093/nar/gkt382
    • Lefranc, M.-P. and Lefranc, G. The Immunoglobulin FactsBook Academic Press, London, UK (458 pages), (2001)
  • Name: Mutation Energy of Stability (ThermoMPNN)
    Description: 基于ThermoMPNN模型预测单点突变对稳定性变化 ThermoMPNN based model predicts the stability changes corresponding to a single point mutation
    Tags: undefined
    Author: Henry Dieckhaus
    Release: 2024-08-07 15:14:52
    Reference: Dieckhaus H, et al. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proc Natl Acad Sci U S A. 2024 Feb 6;121(6):e2314853121.

    Mutation Energy of Stability (ThermoMPNN)

    简介

    Mutation Energy of Stability (ThermoMPNN)模块基于ThermoMPNN模型实现,此深度神经网络模型可根据蛋白初始结构,预测单点突变对应的稳定性变化。模型使用从ProteinMPNN(一种深度神经网络模型,可根据蛋白质的三维结构预测其氨基酸序列)中提取的结构特征,在已建立的基准数据集上实现了优秀的预测性能。通常认为,ddG < -0.5 kcal/mol 可能是一个有利于稳定性的突变。ThermoMPNN 在 Fireprot(HF)数据集上的正预测值为 56%(34/61 个预测为稳定的突变),在 Megascale 数据集上为 46%(1,312/2,852)。

    模型架构与数据集分析如下图所示:
    image.png
    模型预测效果与其他方法效果比较见下图:
    image.png

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式,支持单体或复合物结构

    Target Chain

    用于稳定性突变分析的链名称,仅支持单链,如:A

    Numbering Type

    抗体编号规则,支持Kabat, Chothia和IMGT,默认为Kabat。

    TopN

    指定输出能量最优的前N个突变对应的序列,默认为100。

    Output

    输出文件名称,默认pred_res.csv。

    Output_Chain_Seq

    输出TopN对应的突变链的序列,默认为mutant_seqs.fasta。

    Output_Cpx_Seq

    输出TopN对应的复合物序列,复合物中各链之间用分号:分隔(Boltz2结构预测的批量模式),默认为mutant_seqs_complex.fasta。

    结果说明

    输出result.csv结果文件,包含以下信息:

    列名 说明
    Chain 链名称,如:'A’表示A链
    Mutation 单点突变信息,如:'G1A’表示序列编号为1的残基甘氨酸G,突变为丙氨酸A,序列编号从1开始按顺序编号(非PDB文件中的残基序号)
    ddG_pred 突变对应的能量变化,负值表示体系能量较低,体系变得更稳定。负得越多表示稳定性提升越多。ddG < -0.5 kcal/mol 可能是一个有利于稳定性的突变

    输出TopN对应的突变链的序列mutant_seqs.fasta。
    输出TopN对应的复合物序列,复合物中各链之间用冒号:分隔(Boltz2结构预测的批量模式)mutant_seqs_complex.fasta。

    参考文献

    • Dieckhaus H, et al. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proc Natl Acad Sci U S A. 2024 Feb 6;121(6):e2314853121.DOI:10.1073/pnas.2314853121

    Mutation Energy of Stability (ThermoMPNN)

    Introduction

    The Mutation Energy of Stability (ThermoMPNN) module is based on the ThermoMPNN model. This deep neural network model predicts the stability changes corresponding to single-point mutations based on the initial structure of the protein. The model uses structural features extracted from ProteinMPNN (a deep neural network model that predicts amino acid sequences based on the three-dimensional structure of proteins) and has achieved excellent predictive performance on established benchmark datasets.If we consider a ΔΔG° < -0.5 kcal/mol to indicate a stabilizing mutation, ThermoMPNN achieves a PPV of 56% (34/61 predicted stabilizing mutations) on the Fireprot (HF) dataset and 46% (1,312/2,852) on the Megascale dataset.

    The model architecture and dataset analysis are shown in the figure below:
    image.png
    The comparison of the model’s predictive performance with other methods is shown in the figure below:
    image.png

    Parameters

    Structure PDB File

    The structure file of the protein in PDB format, supporting monomer or complex structures.

    Target Chain

    The name of the chain for stability mutation analysis, supporting only single chains, e.g., A.

    Numbering Type

    Antibody numbering schemes, supporting Kabat, Chothia, and IMGT. The default scheme is Kabat.

    TopN

    Specify the sequences corresponding to the top N mutations with the optimal energy. The default value is 100.

    Output

    Output file name, pred_res.csv is the default.

    Output_Chain_Seq

    Output the sequences of the mutation chains corresponding to TopN.

    Output_Cpx_Seq

    Output the sequences of the complexes corresponding to TopN. The chains within the complexes are separated by semicolons (;) (for batch mode structure prediction by Boltz2).

    Results

    The output result.csv file contains the following information:

    Column Name Description
    Chain The name of the chain, e.g., ‘A’ for chain A
    Mutation Single-point mutation information, e.g., ‘G1A’ means the residue glycine G at sequence number 1 is mutated to alanine A. The sequence number starts from 1 in order (not the residue number in the PDB file)
    ddG_pred The energy change corresponding to the mutation. A negative value indicates lower system energy and increased stability. The more negative, the greater the stability improvement. ddG < -0.5 kcal/mol may indicate a stabilizing mutation

    Output the sequences of the mutation chains corresponding to TopN. mutant_seqs.fasta
    Output the sequences of the complexes corresponding to TopN. The chains within the complexes are separated by : (for batch mode structure prediction by Boltz2). mutant_seqs_complex.fasta

    References

    • Dieckhaus H, et al. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proc Natl Acad Sci U S A. 2024 Feb 6;121(6):e2314853121.DOI:10.1073/pnas.2314853121
  • Name: Homology Tree
    Description: 生成同源性进化树 Generate homologous evolutionary trees
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-08-05 15:30:13
    Reference:

    Homology Tree

    简介

    Homology Tree模块用于生成同源性进化树。

    参数说明

    Input File

    蛋白序列文件,FASTA格式。

    结果说明

    输出结果包括:

    输出文件名称 说明
    alignment.fasta 按树结构顺序输出的叠合后的序列文件的FASTA文件
    tree.png 多重序列树结构图片

    Homology Tree

    Introduction

    The Homology Tree module is used to generate homologous evolutionary trees.

    Parameter

    Input File

    Protein sequence file in FASTA format.

    Result

    The output includes:

    Output File Name Description
    alignment.fasta FASTA file of the superimposed sequence of files output in order of tree structure.
    tree.png Tree structure picture of multiple sequence
  • Name: Structure Evolution
    Description: 基于ESMIF模型实现,ESMIF逆折叠模型旨在根据蛋白质主链原子坐标预测蛋白序列,可用于亲和力成熟和稳定性优化。 The ESMIF inverse folding model aims to predict protein sequences based on the atomic coordinates of the protein backbone and can be used for affinity maturation and stability optimization.
    Tags: undefined
    Author: VARUN R. SHANKER
    Release: 2024-07-29 16:11:04
    Reference: Varun R. Shanker et al., Unsupervised evolution of protein and antibody complexes with a structure-informed language model. Science,385,46-53(2024).

    Structure Evolution

    简介

    Structure Evolution模块基于ESMIF模型实现,ESMIF逆折叠模型旨在根据蛋白质主链原子坐标预测蛋白序列。该模型使用AlphaFold2预测的1200万个蛋白质结构进行训练,包含不变几何输入处理层,随后是一个序列到序列的Transformer,对于在结构上保持不变的主干序列实现51%的本地序列恢复率,对于埋藏残基实现72%的恢复率。该模型还经过跨度屏蔽训练,能够容忍缺失的主链坐标,因此可以预测部分被屏蔽结构的序列。该模块既可以用于亲和力成熟,也可以用于稳定性优化。
    image.png

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式,支持单体或复合物结构

    Target Chain

    用于进化分析的链名称,仅支持单链,默认为A链

    Positions

    指定目标链中的多个残基,进行多点突变分析。使用残基位置编号(从1开始),多个残基用逗号分隔,指定残基范围用横杠符号。如:“3,10,24-30”表示目标链上的第3、第10与第24至30号残基,参与多点突变分析。
    备注:如不设置该参数,表示采用目标链的全长序列进行突变分析。

    Min Mutations

    指定突变点最小数目,默认值为1,表示从单点突变开始进行突变分析。如设置为2,表示从两点组合突变开始进行突变分析。

    Max Mutations

    指定突变点最大数目,默认值为3,表示至多进行三点组合突变。如设置为2时,表示最多进行两个点的多点组合突变。

    Max Substitutions

    指定参与多点突变分析的每个残基,其最大的替换数目,默认为5,表示每个残基最多突变为5种不同的其他残基。
    备注:理论上,每种残基可以突变为其他19种天然残基,但因多点突变可能引起的组合爆炸,这里我们限制了最大替换数目。每个残基具体替换的其他残基类别,会根据ESMIF模型给出的该位置残基的概率分布,优先选择概率高的残基类别。

    Predicted Mutation Probability

    输出CSV文件名称,包含了突变以及对应的突变的可能性。

    Numbering Type

    抗体编号规则,支持Kabat, Chothia和IMGT,默认为Kabat。

    TopN

    指定输出评分最优的前N个突变对应的序列,默认为100。

    Output_Chain_Seq

    输出TopN对应的突变链的序列,默认为mutant_seqs.fasta。

    Output_Cpx_Seq

    输出TopN对应的复合物序列,复合物中各链之间用冒号:分隔(Boltz2结构预测的批量模式),默认为mutant_seqs_complex.fasta。

    结果说明

    输出结果文件,包含以下信息:

    列名 说明
    Mutation 单点突变信息,如:'WT’表示野生型原序列,'G1A’表示序列编号为1的残基甘氨酸G,突变为丙氨酸A,序列编号从1开始按顺序编号(非PDB文件中的残基序号)
    Log_likelihood 输入结构的全部序列对应的模型预测概率对数值,越大表示该突变序列越好
    Log_likelihood_target_chain 输入结构的目标链序列(对应参数Target Chain)对应的模型预测概率对数值,越大表示该突变序列越好
    Interface 用于标识残基是否位于分子接触界面。留空表示不进行界面计算;取值为 0 表示该残基不属于接触界面;取值为 1 表示该残基属于接触界面
    Domain(Chothia) 当输入为抗体序列或结构时,根据 Chothia 定义输出对应的FR(Framework Region)和CDR(Complementarity-Determining Region)区域注释
    Likelihood(ESMIF) Log_likelihood列进行去log,同时减去WT数值后的值,其数值大于0表示该突变优于WT,越大越好。
    Likelihood_target_chain(ESMIF) Log_likelihood_target_chain列进行去log,同时减去WT数值后的值,其数值大于0表示该突变优于WT,越大越好。

    注释:当输入结构为单链时,Log_likelihood与Log_likelihood_target_chain数值一致。当输入结构为复合物时,Log_likelihood对应的是复合物的全部序列的概率值,Log_likelihood_target_chain对应的是复合物中目标链序列(参数Target Chain)对应的概率值。

    输出TopN对应的突变链的序列mutant_seqs.fasta。
    输出TopN对应的复合物序列mutant_seqs_complex.fasta,复合物中各链之间用冒号:分隔(Boltz2结构预测的批量模式)。

    参考文献

    • Varun R. Shanker et al., Unsupervised evolution of protein and antibody complexes with a structure-informed language model. Science, 385, 46-53 (2024).DOI: 10.1126/science.adk8946

    Structure Evolution

    Introduction

    The Structure Evolution module is based on the ESMIF model and is used for structure-based single-point advantageous mutation analysis. The ESMIF inverse folding model aims to predict protein sequences from the coordinates of protein backbone atoms. This model is trained on 12 million protein structures predicted by AlphaFold2 and includes invariant geometric input processing layers followed by a sequence-to-sequence Transformer. It achieves a 51% local sequence recovery rate for backbone sequences that remain structurally invariant and a 72% recovery rate for buried residues. The model is also trained with span masking, allowing it to tolerate missing backbone coordinates and predict sequences for partially masked structures. This module can be used for both affinity maturation and stability optimization.
    image.png

    Parameters

    Structure PDB File

    The structural file of the protein in PDB format, supporting both monomer and complex structures.

    Target Chain

    The name of the chain used for evolutionary analysis. Only single chains are supported. After uploading the structural file, you can select a chain name from the list of chains.

    Positions

    Multiple residues in the chain were labeled for multi-point mutation analysis. Use a residue location number (starting at 1), multiple residues are separated by commas, and a delimiter is used to specify the residue range. For example, “3,10,24-30” indicates residues 3,10, and 24 to 30 on the target chain, which participate in multipoint mutation analysis.

    Min Mutations

    Specifies the minimum number of mutation points, the default is 1, indicating that mutation analysis starts with single mutation. If the value is set to 2, it indicates that the mutation analysis starts from the two-point mutation.

    Max Mutations

    Specifies the maximum number of mutation points, the default is 3, indicating that at most three points of combination mutation can be made. If the value is set to 2, it indicates that a maximum of two points of combination mutation can be performed.

    Max Substitutions

    Specifies the maximum number of substitutions for each residue participating in multipoint mutation analysis, which defaults to 5, meaning that each residue mutates up to 5 different other residues.

    Predicted Mutation Probability

    Output CSV file containing the mutations and corresponding probabilities.

    Numbering Type

    Antibody numbering schemes, supporting Kabat, Chothia, and IMGT.
    The default scheme is Kabat.

    TopN

    Designate the sequences corresponding to the top N mutations with the best scores, with a default value of 100.

    Output_Chain_Seq

    Output the sequences of the mutation chains corresponding to TopN, with a default file name of mutant_seqs.fasta.

    Output_Cpx_Seq

    Output the sequences of the complexes corresponding to TopN. The chains within the complex are separated by colons(:) (for batch mode of Boltz2 structure prediction), with a default file name of mutant_seqs_complex.fasta.

    Results

    The output file contains the following information:

    Column Name Description
    Mutation Single-point mutation information, e.g., ‘WT’ represents the wild-type original sequence, ‘G1A’ indicates that the residue glycine (G) at sequence position 1 is mutated to alanine (A). Sequence numbering starts from 1 in order (not the residue number in the PDB file).
    Log_likelihood The log value of the predicted probability of the sequences of input structure by the model. The higher the value, the better the mutated sequence. If this value is greater than the corresponding value of WT, it indicates that the mutation is advantageous.
    Log_likelihood_target_chain The log-likelihood value of the model’s predicted probability corresponding to the target chain sequence of the input structure (parameter Target Chain). The higher the value, the better the mutated sequence. If this value is greater than the corresponding value of WT, it indicates that the mutation is advantageous.
    Interface Indicates whether a residue is part of a molecular interaction interface. Leaving the field empty disables interface calculation; a value of 0 denotes a non-interface residue, whereas 1 denotes an interface residue.
    Domain(Chothia) When the input is an antibody sequence or structure, this field outputs annotations of FR (Framework Regions) and CDR (Complementarity-Determining Regions) according to the Chothia numbering scheme
    Likelihood(ESMIF) Exponentiated log-likelihood value minus the WT value. Values greater than 0 indicate the mutation is superior to WT; larger values are better.
    Likelihood_target_chain(ESMIF) Exponentiated log_likelihood_target_chain value minus the WT value. Values greater than 0 indicate the mutation is superior to WT; larger values are better.

    Output the sequences of the mutation chains corresponding to TopN mutant_seqs.fasta.

    Output the sequences of the complexes corresponding to TopN mutant_seqs_complex.fasta. The chains within the complex are separated by colons(:) (for batch mode of Boltz2 structure prediction) .

    Note: When the input structure is a single chain, the value of Log_likelihood is consistent with that of Log_likelihood_target_chain. When the input structure is a complex, Log_likelihood corresponds to the probability value of the entire sequence of the complex, and Log_likelihood_target_chain corresponds to the probability value of the target chain sequence (parameter Target Chain) in the complex.

    Reference

    • Varun R. Shanker et al., Unsupervised evolution of protein and antibody complexes with a structure-informed language model. Science, 385, 46-53 (2024).DOI: 10.1126/science.adk8946
  • Name: Structure Comparison (US-align)
    Description: 基于USalign的结构叠合工具 Structural alignment tool based on USalign
    Tags: undefined
    Author: Yang Zhang
    Release: 2024-06-17 00:00:00
    Reference: Chengxin Zhang, Morgan Shine et al.US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes.(2022)

    Structure Comparison (US-align)

    简介

    进行蛋白或核酸的结构比对,支持单体或异源寡聚体。使用US-align工具实现。输出TM-score,RMSD等衡量结构相似性的指标。可比对序列不一致的蛋白或核酸结构。

    参数说明

    PDB1

    用于结构比对的第一个结构,支持批量结构,批量格式支持:.zip,.tar,.tar.gz,.tgz,.tar.bz2,.tbz2,.tar.xz,.txz ,当前最大支持1000个结构。

    PDB2

    用于结构比对的第二个结构,定义同上。

    注意:结构比对会将PDB1中的所有结构与PDB2中的所有结构进行两两比对。

    Chain Mapping

    指定结构中进行叠合的链,格式为:文件名:链名1,链名2,每行定义一个结构的链信息。示例如下:

    结构名称1:A,B
    结构名称2:C,D
    

    表示结构1中的A链与结构2中的C链进行叠合比对,B链与D链进行叠合比对。
    为了方便统一定义所有结构的叠合链,支持只输入逗号分隔的链名列表或链顺序列表,如:A,B或者1,2,前者表示所有结构中都用A,B链进行叠合,后者表示所有结构中都使用第一和第二条链进行叠合。
    若结构1与结构2共有链C,输入共有链名(如:C)或其位置索引(如:3)。若抗原为第三条链,填写C或3均可将其作为基准进行叠合。
    注意:结构比对会将PDB1中的所有结构与PDB2中的所有结构进行两两比对。

    Output

    比对结果文件,CSV格式,默认为align_results.csv。
    叠合的结构文件,默认为aligned_pdbs.tar.gz

    结果说明

    输出结构比对结果文件align_results.csv,包含信息如下:

    列名 Description
    PDB1 第一个结构的名称
    PDB2 第二个结构的名称
    TM-score (Norm by Length of PDB1) TM-score是用于评估蛋白质结构相似性的指标。范围在0到1之间:>0.5:通常认为两个蛋白质具有相同的折叠(同一家族);<0.3:表示结构随机无关(即使长度相同)。这里Norm by Length of PDB1表示将PDB1结构作为参考结构进行归一化的打分。
    TM-score (Norm by Length of PDB2) 表示将PDB2结构作为参考结构进行归一化的TM-score
    TM-score (Average) 以上两种归一化 TM-score 的平均值,用于给出两种结构整体相似性的综合评估。
    RMSD 两个结构的骨架结构RMSD值
    Aligned_length 两个结构比对过程中会进行叠合,叠合后的重叠长度(残基数量)。
    Sequence_identity 叠合部分的序列一致性
    Aligned_structure 叠合后的结构名称

    参考文献

    • Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang. US-align: Universal Structure Alignment of Proteins, Nucleic Acids and Macromolecular Complexes. Nature Methods, 19: 195-204 (2022).DOI:10.1038/s41592-022-01585-1

    Structure Comparison (US-align)

    Introduction

    Performs structural alignment of proteins or nucleic acids, supporting both monomers and hetero-oligomers. The alignment is implemented using the US-align tool and outputs metrics such as TM-score and RMSD to quantify structural similarity.
    It supports alignment between protein or nucleic acid structures with non-identical sequences.

    Parameters

    PDB1

    The first structure used for alignment. Batch processing is supported with the following archive formats: .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz.
    Up to 1000 structures are supported at a time.

    PDB2

    The second structure used for alignment, defined in the same way as PDB1.

    Note: Structural comparison will perform pairwise alignments between all structures in PDB1 and all structures in PDB2.

    Chain Mapping

    Specify the chains used for structural superposition. The format is:
    structure_name:chain1,chain2, where each line defines the chain information for one structure. Examples:

    structure1:A,B
    structure2:C,D
    

    This means that chain A in structure 1 is aligned with chain C in structure 2, and chain B is aligned with chain D.

    For convenience, to apply a unified chain mapping to all structures, you may also provide only a comma-separated list of chain names or chain indices, such as A,B or 1,2.

    • A,B indicates that chains A and B are used for alignment in all structures.
    • 1,2 indicates that the first and second chains are used for alignment in all structures.

    Specify the shared chain for alignment by entering its Chain ID (e.g., C) or Positional Index (e.g., 3). For example, if the antigen is the third chain, entering C or 3 will set it as the reference for superposition.

    Note: Structural alignment performs pairwise comparisons between all structures in PDB1 and all structures in PDB2.

    Output

    The alignment results are written to a CSV file, named align_results.csv by default.
    The aligned structure files. The default output is aligned_pdbs.tar.gz.

    Results

    The output file align_results.csv contains the following information:

    Field Description
    PDB1 Name of the first structure
    PDB2 Name of the second structure
    TM-score (Norm by Length of PDB1) TM-score is a measure of structural similarity between proteins. It ranges from 0 to 1: values >0.5 usually indicate the same fold (same family); values <0.3 indicate random or unrelated structures (even with similar lengths). “Norm by Length of PDB1” means normalization is based on the length of PDB1.
    TM-score (Norm by Length of PDB2) TM-score normalized by the length of PDB2
    TM-score (Average) The average of the two normalized TM-scores, providing an overall and balanced assessment of the structural similarity between the two proteins.
    RMSD RMSD value between the backbones of the two structures
    Aligned_length The number of residues that overlap after structural superposition
    Sequence_identity Sequence identity of the aligned region
    Aligned_structure Name of the superimposed structure

    Reference

    • Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang. US-align: Universal Structure Alignment of Proteins, Nucleic Acids and Macromolecular Complexes. Nature Methods, 19: 195-204 (2022).DOI:10.1038/s41592-022-01585-1
  • Name: Antibody Design (MEAN)
    Description: 基于MEAN模型实现,采用多通道等变图注意力网络,用于设计CDR的一维序列和三维结构。 Implemented based on the MEAN model, which utilizes a multi-channel equivariant graph attention network. It can be used to design the one-dimensional sequence and three-dimensional structure of CDRs.
    Tags: undefined
    Author: Xiangzhe Kong
    Release: 2024-06-26 11:34:29
    Reference: Xiangzhe Kong, Wenbing Huang, Yang Liu. Conditional Antibody Design as 3D Equivariant Graph Translation. arXiv. 2023.

    Antibody Design (MEAN)

    简介

    Antibody Design (MEAN)模块基于MEAN模型实现,该模型采用多通道等变图注意力网络,可用于设计CDR的一维序列和三维结构。具体而言,MEAN 通过导入额外的结构信息(包括目标抗原和抗体的轻链)将抗体设计公式化为条件图翻译问题。然后,MEAN重新采用 E(3)-等变消息传递以及提出的注意机制来更好地捕捉不同结构信息之间的几何相关性。最后,它通过多轮渐进式全景模式输出一维序列和三维结构,与以前的自回归方法相比,它具有更高的效率和精度。MEAN在序列和结构建模、抗原结合CDR设计和结合亲和力优化方面明显超越了届时最优模型。具体而言,抗原结合CDR设计相对于基线模型改进约为23%,亲和力优化相对于基线模型改进约为34%。
    MEAN模型架构如下图所示:
    image.png
    image.png

    参数说明

    Structure PDB File

    抗体-抗原复合物结构或抗体结构(建议采用复合物结构,设计效果更佳),PDB格式

    Heavy Chain

    指定结构中的抗体重链名称,默认值为H,注意如果上传的结构中抗体重链命名非H,请修改该参数为相应的链名

    Light Chain

    指定结构中的抗体轻链名称,默认值为L,注意如果上传的结构中抗体轻链命名非L,请修改该参数为相应的链名

    Design Type

    设计模式,有两种设计模式:CDR-H3设计与亲和力优化(Optimized)

    Number

    亲和力优化中,生成的结构数量,默认值为100

    TopN

    指定输出亲和力最优的前N个突变对应的序列,默认为100。

    Output_Chain_Seq

    输出TopN对应的突变链的序列,默认为mutant_seqs.fasta。

    Output_Cpx_Seq

    输出TopN对应的复合物序列,复合物中各链之间用冒号:分隔(Boltz2结构预测的批量模式),默认为mutant_seqs_complex.fasta。

    结果说明

    CDR-H3设计

    输出结果包括:

    输出文件名称 说明
    cdrs.txt文件 包含设计的CDR-H3序列
    design.pdb文件 设计后的复合物结构文件,注意抗体结构只保留Fv区域

    亲和力优化

    输出结果包括:

    输出文件名称 说明
    ddg_scores.txt文件 优化后结构与原结构的亲和力差异评分
    opt_best.pdb文件 亲和力最优结构文件,注意抗体结构只保留Fv区域
    log.txt 亲和力优化文件日志
    opt.zip 优化后的多个结构的压缩文件

    其中,ddg_scores.txt文件,包含信息如下:

    列名 说明
    Name 结构名称
    ddG 与原结构的亲和力差异评分ddG,单位为kcal/mol,数值为负时表示亲和力有提升,负得越多表示亲和力提升越好

    输出TopN对应的突变链的序列mutant_seqs.fasta。
    输出TopN对应的复合物序列mutant_seqs_complex.fasta,复合物中各链之间用冒号:分隔(Boltz2结构预测的批量模式)。

    参考文献

    Xiangzhe Kong, Wenbing Huang, Yang Liu. Conditional Antibody Design as 3D Equivariant Graph Translation. arXiv. 2023.

    Antibody Design (MEAN)

    Introduction

    The Antibody Design (MEAN) module is implemented based on the MEAN model, which employs a multi-channel equivariant graph attention network for designing the one-dimensional sequence and three-dimensional structure of the CDR (Complementarity-Determining Region). Specifically, MEAN formulates antibody design as a conditional graph translation problem by incorporating additional structural information, including the target antigen and the light chain of the antibody. MEAN then re-adopts E(3)-equivariant message passing and the proposed attention mechanism to better capture the geometric correlations between different structural information. Finally, it outputs the one-dimensional sequence and three-dimensional structure through multiple rounds of progressive panoramic mode. Compared to previous autoregressive methods, it has higher efficiency and accuracy. MEAN significantly outperforms the then state-of-the-art models in sequence and structure modeling, antigen-binding CDR design, and binding affinity optimization. Specifically, antigen-binding CDR design improves by approximately 23% over baseline models, and affinity optimization improves by approximately 34% over baseline models.
    The MEAN model architecture is shown in the figure below:
    image.png
    image.png

    Parameter Description

    Structure PDB File

    The structure of the antibody-antigen complex or the antibody structure (the complex structure is recommended for better design results), in PDB format.

    Heavy Chain

    Specify the name of the antibody heavy chain in the structure, the default value is H. Note that if the antibody heavy chain in the uploaded structure is not named H, please modify this parameter to the corresponding chain name.

    Light Chain

    Specify the name of the antibody light chain in the structure, the default value is L. Note that if the antibody light chain in the uploaded structure is not named L, please modify this parameter to the corresponding chain name.

    Design Type

    Design mode, there are two design modes: CDR-H3 design and affinity optimization (Optimized).

    Number

    In affinity optimization, the number of generated structures, the default value is 100.

    TopN

    Specify the sequences corresponding to the top N mutations with the optimal energy. The default value is 100.

    Output_Chain_Seq

    Output the sequences of the mutation chains corresponding to TopN. Default is mutant_seqs.fasta.

    Output_Cpx_Seq

    Output the sequences of the complexes corresponding to TopN. The chains within the complexes are separated by colon(:) (for batch mode structure prediction by Boltz2). Default is mutant_seqs_complex.fasta.

    Result

    CDR-H3 Design

    The output results include:

    Output File Name Description
    cdrs.txt Contains the designed CDR-H3 sequences
    design.pdb The designed complex structure file, note that only the Fv region of the antibody structure is retained

    Affinity Optimization

    The output results include:

    Output File Name Description
    ddg_scores.txt Affinity difference scores between the optimized structure and the original structure
    opt_best.pdb The structure file with the best affinity, note that only the Fv region of the antibody structure is retained
    log.txt Affinity optimization log file
    opt.zip Compressed file of multiple optimized structures

    The ddg_scores.txt file contains the following information:

    Column Name Description
    Name Structure name
    ddG Affinity difference score ddG with the original structure, in kcal/mol. A negative value indicates an improvement in affinity, and the more negative, the better the improvement in affinity

    Output the sequences of the mutation chains corresponding to TopN mutant_seqs.fasta.

    Output the sequences of the complexes corresponding to TopN mutant_seqs_complex.fasta. The chains within the complex are separated by colons(:) (for batch mode of Boltz2 structure prediction) .

    References

    Xiangzhe Kong, Wenbing Huang, Yang Liu. Conditional Antibody Design as 3D Equivariant Graph Translation. arXiv. 2023.

  • Name: Venn Diagram Plot
    Description: 绘制韦恩图(Venn diagram)工具 Venn diagrams drawing tool
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-06-23 00:00:00
    Reference:

    Venn Diagram Plot

    简介

    Venn Diagram Plot是一个制作韦恩图(Venn diagram)模块,常用于比较两个集合的重叠区域以及提取公共部分内容。用于中药网络药理学分析中提取中药成分预测靶点与疾病相关靶点的交集。

    参数说明

    Set A File

    集合A文件,TXT格式,每行一个元素。

    Set B File

    集合B文件,TXT格式,每行一个元素。

    Labels

    作图时显示的图例,逗号分割,如:set A,set B

    Case Sensitive

    比较时是否大小写敏感:
    Yes:区分大小写比较
    No:不区分大小写比较

    Output Intersection

    输出包含交集部分内容的文件名称,默认为intersection.txt

    结果说明

    输出韦恩图文件venn_diagram.png以及交集部分内容的文本文件intersection.txt

    Venn Diagram Plot

    Introduction

    The Venn Diagram Plot module is used to create Venn diagrams, which are commonly utilized to compare the overlapping regions of two sets and extract the common elements. This is particularly useful in traditional Chinese medicine network pharmacology analysis for identifying the intersection of predicted targets of herbal components and disease-related targets.

    Parameter Description

    Set A File

    The file for set A, in TXT format, with one element per line.

    Set B File

    The file for set B, in TXT format, with one element per line.

    Labels

    The labels to be displayed in the diagram, separated by commas, e.g., set A,set B.

    Case Sensitive

    Whether the comparison is case-sensitive:

    • Yes: Case-sensitive comparison
    • No: Case-insensitive comparison

    Output Intersection

    The name of the output file containing the intersection elements, default is intersection.txt.

    Result Description

    The output includes a Venn diagram file named venn_diagram.png and a text file containing the intersection elements named intersection.txt.

  • Name: Protein-Protein Interaction (STRING)
    Description: 检索成对的蛋白-蛋白相互作用(PPI),基于STRING蛋白互作网络数据库,包含蛋白直接物理作用的互作关系以及间接作用的互作关系。 Extracting protein interactions based on STRING. STRING is a protein interaction network database that includes both direct physical interactions and indirect functional associations between proteins.
    Tags: undefined
    Author: STRING
    Release: 2024-06-21 00:00:00
    Reference: Szklarczyk D et.al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023 Jan 6;51(D1):D638-646.

    Protein-Protein Interaction Network (STRING)

    简介

    检索成对的蛋白-蛋白相互作用(PPI),基于STRING蛋白互作网络数据库,包含蛋白直接物理作用的互作关系以及间接作用的互作关系。

    参数说明

    Protein List

    蛋白名称列表文件,TXT格式,一行一个蛋白名称

    Cutoff

    蛋白-蛋白关联性打分的截断值,0~1之间,只导出combined_score为截断值以上的蛋白-蛋白相互作用数据。

    Related Protein

    是否输出相关蛋白;
    Yes:代表输出与输入蛋白相关的蛋白
    No:代表只输出输入蛋白之间存在的相互作用

    结果说明

    输出蛋白-蛋白相互作用文件string_interactions.tsv,每一列说明如下:

    列名 说明
    node1 节点1的蛋白名称
    node2 节点2的蛋白名称
    node1_string_id 节点1在STRING数据库中标准ID
    node2_string_id 节点1在STRING数据库中标准ID
    neighborhood_on_chromosome 基于基因组邻近性预测的相互作用得分。
    gene_fusion 基于基因融合事件预测的相互作用得分。
    phylogenetic_cooccurrence 基于共同出现(共现性)预测的相互作用得分。
    homology 蛋白之间的同源性。
    coexpression 基于共同表达(共表达)预测的相互作用得分。
    experimentally_determined_interaction 基于实验数据(例如,酵母双杂交实验)预测的相互作用得分。
    database_annotated 基于已知数据库信息预测的相互作用得分。
    automated_textmining 基于文本挖掘预测的相互作用得分。
    combined_score 综合所有上述信息计算得到的综合得分。

    参考文献

    • Szklarczyk D et.al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest . Nucleic Acids Res. 2023 Jan 6;51(D1):D638-646.
    • https://cn.string-db.org/

    Protein-Protein Interaction (STRING)

    Introduction

    Protein-Protein Interaction (STRING) is a module based on the STRING database for extracting protein interaction data. STRING is a protein interaction network database that includes both direct physical interactions and indirect functional associations between proteins.

    Parameter Description

    Protein List

    A file containing a list of protein names, in TXT format, with one protein name per line.

    Cutoff

    A cutoff value for the protein-protein association score, ranging from 0 to 1. Only protein-protein interactions with a combined score above this cutoff will be exported.

    Related Protein

    Whether to output related proteins:

    • Yes: Output proteins related to the input proteins.
    • No: Only output interactions among the input proteins.

    Result Description

    The output is a protein-protein interaction file named string_interactions.tsv. Each column is described as follows:

    Column Name Description
    node1 Protein name of node 1
    node2 Protein name of node 2
    node1_string_id Standard STRING ID for node 1
    node2_string_id Standard STRING ID for node 2
    neighborhood_on_chromosome Interaction score based on genomic neighborhood prediction
    gene_fusion Interaction score based on gene fusion events
    phylogenetic_cooccurrence Interaction score based on phylogenetic co-occurrence
    homology Homology between proteins
    coexpression Interaction score based on co-expression
    experimentally_determined_interaction Interaction score based on experimental data (e.g., yeast two-hybrid)
    database_annotated Interaction score based on known database information
    automated_textmining Interaction score based on text mining
    combined_score Combined score calculated from all the above information

    References

    • Szklarczyk D et.al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest . Nucleic Acids Res. 2023 Jan 6;51(D1):D638-646.
    • STRING Database
  • Name: Gene Enrichment (DAVID)
    Description: 基于DAVID的基因功能富集分析 Gene function enrichment analysis based on DAVID
    Tags: undefined
    Author: DAVID
    Release: 2024-06-21 00:00:00
    Reference: B.T. Sherman, M. Hao, J. Qiu, X. Jiao, M.W. Baseler, H.C. Lane, T. Imamichi and W. Chang. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Research. 23 March 2022. 50(W1):W216-W221. doi:10.1093/nar/gkac194.

    Gene Enrichment (DAVID)

    简介

    Gene Enrichment (DAVID)是基于DAVID的基因功能富集分析模块,DAVID是一个生物信息数据库,整合了生物学数据和分析工具,为大规模的基因或蛋白列表提供系统综合的生物功能注释信息。

    参数说明

    Gene List

    基因列表文件,TXT格式,一行一个基因/蛋白。

    Gene Identifier

    基因名称类型,支持多种数据库基因名称。

    P-value

    P-value,基因富集中统计差异检验使用的p值的截断值,只保留低于该截断值的富集条目。

    Gene Count

    基因数目截断值,只保留大于该截断值的富集条目。

    Category

    基因富集的类别,包括细胞组分(Cellular Component BP),分子功能(Molecular Function MF),生物学过程(Biological Proccess BP)。

    Report File

    输出基因富集的结果文件,TSV格式。

    结果说明

    结果输出chartReport.tsv文件,文件中每一列代表说明如下:

    列名 说明
    Category 注释类别,例如GOTERM_BP_DIRECT(生物过程)、GOTERM_MF_DIRECT(分子功能)、GOTERM_CC_DIRECT(细胞组分)、KEGG_PATHWAY(KEGG通路)等。
    Term 具体的注释术语或通路名称。
    Count 输入基因集中注释到该术语的基因数目。
    % 输入基因集中注释到该术语的基因占总输入基因的百分比。
    PValue 富集分析的p值,表示注释到该术语的基因数目与随机情况下的期望数目之间的显著性差异。
    Benjamini Benjamini-Hochberg校正后的p值,用于控制假发现率(FDR)。
    FDR 假发现率,表示在所有显著结果中,预期的错误发现比例。
    Genes 注释到该术语的输入基因的列表,通常以逗号分隔。
    List Total 输入基因集中总的基因数目。
    Pop Hits 背景基因集中注释到该术语的基因数目。
    Pop Total 背景基因集的总基因数目。
    Fold Enrichment 富集倍数,表示输入基因集中注释到该术语的基因数目相对于背景基因集中注释到该术语的基因数目的比例。

    参考文献

    • B.T. Sherman, M. Hao, J. Qiu, X. Jiao, M.W. Baseler, H.C. Lane, T. Imamichi and W. Chang. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update) . Nucleic Acids Research. 23 March 2022. 50(W1):W216-W221. doi:10.1093/nar/gkac194.
      • https://david.ncifcrf.gov

    Gene Enrichment (DAVID)

    Introduction

    Gene Enrichment (DAVID) is a gene functional enrichment analysis module based on DAVID. DAVID is a bioinformatics database that integrates biological data and analytical tools to provide systematic and comprehensive biological functional annotation information for large-scale gene or protein lists.

    Parameter Description

    Gene List

    A file containing the gene list in TXT format, with one gene/protein per line.

    Gene Identifier

    The type of gene name, supporting multiple database gene names.

    P-value

    P-value, the cutoff value of the p-value used in the statistical difference test of gene enrichment, retaining only enrichment entries below this cutoff value.

    Gene Count

    The cutoff value of the number of genes, retaining only enrichment entries with a gene count greater than this cutoff value.

    Category

    The category of gene enrichment, including Cellular Component (CC), Molecular Function (MF), and Biological Process (BP).

    Report File

    The output file of gene enrichment results, in TSV format.

    Result Description

    The results are output in the chartReport.tsv file, with each column representing the following descriptions:

    Column Name Description
    Category Annotation category, such as GOTERM_BP_DIRECT (Biological Process), GOTERM_MF_DIRECT (Molecular Function), GOTERM_CC_DIRECT (Cellular Component), KEGG_PATHWAY (KEGG Pathway), etc.
    Term Specific annotation term or pathway name.
    Count The number of genes in the input gene set annotated to this term.
    % The percentage of genes in the input gene set annotated to this term.
    PValue The p-value of the enrichment analysis, indicating the significance of the difference between the number of genes annotated to this term and the expected number under random conditions.
    Benjamini The p-value after Benjamini-Hochberg correction, used to control the false discovery rate (FDR).
    FDR False discovery rate, indicating the expected proportion of false discoveries among all significant results.
    Genes The list of input genes annotated to this term, usually separated by commas.
    List Total The total number of genes in the input gene set.
    Pop Hits The number of genes in the background gene set annotated to this term.
    Pop Total The total number of genes in the background gene set.
    Fold Enrichment The fold enrichment, indicating the ratio of the number of genes annotated to this term in the input gene set to the number of genes annotated to this term in the background gene set.

    References

    • B.T. Sherman, M. Hao, J. Qiu, X. Jiao, M.W. Baseler, H.C. Lane, T. Imamichi, and W. Chang. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Research. 23 March 2022. 50(W1):W216-W221. doi:10.1093/nar/gkac194.
    • https://david.ncifcrf.gov
  • Name: TCM Chemical Ingredients
    Description: 基于中药名称提取中药化学成分 Extracting chemical structures of Chinese herbs
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-06-20 00:00:00
    Reference: Liu Z, Cai C, Du J. et al. TCMIO: a comprehensive database of traditional Chinese medicine on Immuno-oncology. Front Pharmacol 2020;11:439.

    TCM Chemical Ingredients

    简介

    TCM Chemical Ingredients用于提取中药的化学成分的结构信息。

    参数说明

    TCM Name

    中药的名称,支持中文名、英文名、拼音名,支持多个名称,英文逗号分割。比如:人参,黄芪

    Remove Duplicates

    是否对成分的结构进行去重处理

    结果说明

    输出文件 描述
    ingredients.sdf 化学成分的结构文件,SDF格式
    ingredients.csv 化学成分的结构文件,CSV格式,里面包含SMILES等结构信息

    参考文献

    Liu Z, Cai C, Du J. et al. TCMIO: a comprehensive database of traditional Chinese medicine on Immuno-oncology . Front Pharmacol 2020;11:439.

    TCM Chemical Ingredients

    Introduction

    The TCM Chemical Ingredients module is used to extract structural information of chemical ingredients from traditional Chinese medicines (TCM).

    Parameter Description

    TCM Name

    The name(s) of the traditional Chinese medicine(s), supporting Chinese, English, or Pinyin names. Multiple names can be separated by commas. For example: 人参,黄芪.

    Remove Duplicates

    Whether to remove duplicate structures of the ingredients:

    • Yes: Remove duplicates
    • No: Do not remove duplicates

    Result Description

    The output includes the following files:

    Output File Description
    ingredients.sdf Structural file of the chemical ingredients in SDF format
    ingredients.csv Structural file of the chemical ingredients in CSV format, containing SMILES and other structural information

    References

    Liu Z, Cai C, Du J. et al. TCMIO: a comprehensive database of traditional Chinese medicine on Immuno-oncology . Front Pharmacol 2020;11:439.

  • Name: Target Prioritization (OpenTargets)
    Description: 提取疾病相关靶点蛋白,基于OpenTarget数据库及其疾病-靶点相关性打分方法。 A module for extracting disease-related target proteins, based on the OpenTarget database and its disease-target association scoring method.
    Tags: undefined
    Author: Open Targets
    Release: 2024-06-20 00:00:00
    Reference: Ochoa, D et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Research, 2023, DOI: 10.1093/nar/gkac1046

    Target Prioritization (OpenTargets)

    简介

    Target Prioritization (OpenTargets) 是提取疾病相关靶点蛋白的模块,基于OpenTarget数据库及其疾病-靶点相关性打分方法。

    image.jpg

    参数说明

    Disease Name

    疾病的英文名称,如rheumatoid arthritis

    Data Type

    数据类型,包括直接关联和全部关联的数据。
    direct:直接关联数据,指有直接证据表明该疾病和靶点存在关联。
    all:全部关联数据,包括了间接关联数据,间接关联是基于本体论推断出来的疾病靶点关系。
    详细可参考:https://platform-docs.opentargets.org/associations

    Cutoff

    疾病-靶点关系打分的截断值,只输出大于截断值的靶点信息。

    Target Class

    靶点类型,默认为all 代表全部

    结果说明

    输出疾病及靶点相关的文件,包括:

    文件名称 文件说明
    disease_info.csv 疾病信息表
    target_info.csv 靶点信息表
    targets_by_data_source.csv 基于数据来源的疾病-靶点关系打分表
    targets_by_data_type.csv 基于数据类型的疾病-靶点关系打分表
    uniprot_ids.txt 靶点的蛋白UniProt ID列表
    genes.txt 靶点的基因名称列表

    参考文献

    https://platform-docs.opentargets.org/

    Target Prioritization (OpenTargets)

    Introduction

    The Target Prioritization (OpenTargets) module is used to extract disease-related target proteins based on the OpenTargets database and its disease-target association scoring method.
    image.jpg

    Parameter Description

    Disease Name

    The English name of the disease, such as rheumatoid arthritis.

    Data Type

    The type of data, including directly associated and all associated data.

    • direct: Directly associated data, indicating there is direct evidence linking the disease to the target.
    • all: All associated data, including indirect associations inferred through ontological relationships. For more details, refer to: OpenTargets Associations

    Cutoff

    The cutoff value for the disease-target association score. Only target information with a score greater than this cutoff will be output.

    Target Class

    The type of target, default is all representing all target classes.

    Result Description

    The output includes files related to the disease and its targets:

    File Name Description
    disease_info.csv Disease information table
    target_info.csv Target information table
    targets_by_data_source.csv Disease-target association scores by data source
    targets_by_data_type.csv Disease-target association scores by data type
    uniprot_ids.txt List of target protein UniProt IDs
    genes.txt List of target gene names

    References

    OpenTargets Platform Documentation

  • Name: Structure Minimization (Protein)
    Description: 蛋白结构优化模块,支持氢原子优化、氨基酸侧链优化、整体优化三种方式。一般建议通过WeView三维结构可视化编辑器来使用该功能。 Structure optimization supporting three methods: hydrogen optimization, side chain optimization, and overall optimization. It is recommended to use in the WeView.
    Tags: undefined
    Author:
    Release: 2024-05-29 14:41:20
    Reference:

    Structure Minimization (Protein)

    简介

    Structure Minimization是结构优化模块,支持氢原子优化、氨基酸侧链优化、整体优化三种方式。

    参数说明

    PDB File

    结构文件,PDB格式。

    Relax Type

    优化类型,支持以下几种:
    hydrogen:约束限制所有非氢原子,对结构上的氢原子进行优化。
    sidechain:约束蛋白骨架,优化蛋白氨基酸侧脸,若存在小分子,整个小分子进行限制。
    all:系统整体优化,不做任何限制约束。
    可多选,进行多步优化。

    Cycle Number

    能量优化的步数。

    Force Field

    采用的分子力场,默认ff14SB。ff19SB, ff14SB适合蛋白和核酸的凝聚相模拟,也支持小分子。

    Restrain Force Constant

    约束力常数,单位为kcal/mol/Å^2,数值越大,约束能力越强。

    Output Name

    输出文件名称,默认minimized_structure.pdb。

    结果说明

    输出结果为优化后的结构文件minimized_structure.pdb,保留了输入文件中的链和氨基酸编号信息。

    Structure Minimization (Protein)

    Introduction

    The Structure Minimization module is used for structural optimization, supporting three types of optimizations: hydrogen atom optimization, amino acid side chain optimization, and overall optimization.

    Parameter Description

    PDB File

    The structure file in PDB format.

    Relax Type

    The type of optimization, supporting the following options:

    • hydrogen: Constrains all non-hydrogen atoms and optimizes the hydrogen atoms in the structure.
    • sidechain: Constrains the protein backbone and optimizes the amino acid side chains. If small molecules are present, the entire small molecule is constrained.
    • all: Performs overall system optimization without any constraints.
      This option allows multiple selections for multi-step optimization.

    Cycle Number

    The number of steps for energy optimization.

    Force Field

    The molecular force field used, default is ff14SB. ff19SB and ff14SB are suitable for condensed phase simulations of proteins and nucleic acids, and also support small molecules.

    Restrain Force Constant

    The restrain force constant, in units of kcal/mol/Ų. The larger the value, the stronger the constraint.

    Output Name

    The name of the output file, default is minimized_structure.pdb.

    Result Description

    The output is the optimized structure file minimized_structure.pdb, retaining the chain and amino acid numbering information from the input file.

  • Name: Replace Chain Name
    Description: Replace Chain Name用于替换PDB文件中的链名。 Performs in-place replacement of a chain identifier by another.
    Tags: undefined
    Author:
    Release: 2024-06-07 00:00:00
    Reference:
  • Name: Structure Preparation
    Description: 蛋白结构处理模块,用于补全缺失原子和残基,以及蛋白氨基酸残基的质子化判断以及加氢操作。一般建议通过WeView三维结构可视化编辑器来使用该功能。 Protein structure preparation module used for adding missing atoms and residues, as well as for protonation determination and hydrogenation of protein amino acid residues. It is recommended to use in the WeView.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-06-07 00:00:00
    Reference: J. Chem. Theory Comput. 2011, 7 (2), 525–537.

    Structure Preparation

    简介

    蛋白结构处理模块,用于补全缺失原子和残基,以及蛋白氨基酸残基的质子化判断以及加氢操作。采用pdbfixer补全缺失,采用propka3进行质子化判断。

    参数说明

    Structure File

    蛋白的结构文件,PDB格式

    Chains

    提取指定链处理,默认all,代表选择全部链,输入链名,多条链用英文逗号隔开,如A,B表示从PDB文件中提取A,B链进行结构处理。注意链名之间不要用空格。

    Delete Heterogens

    删除非标准蛋白或核酸残基,如水、离子、以及其他PDB中HETATM记录。
    all:表示删除所有HETATM记录,包括水、离子、小分子等;
    water:表示仅删除水;
    ions:表示仅删除离子,默认为NA,CL;
    custom:表示需要删除其他定制的残基名称,由Custom Heterogens参数指定。
    Heterogens详细介绍可参考:https://www.wwpdb.org/documentation/file-format-content/format23/sect4.html

    Custom Heterogens

    自定义Heterogens的残基名称,多个用英文逗号分隔,如ZN,MG

    Delete Hydrogens

    删除氢原子,Yes表示删除,No表示不删除。

    Add

    添加缺失的重原子或者残基。
    heavy:表示添加缺失重原子
    residues:表示添加缺失残基,默认也会添加缺失的原子

    Protonation

    是否进行质子化判断并添加氢原子,采用propka方法进行蛋白残基的质子化判断。
    Yes:代表根据质子化判断结果进行加氢操作,
    No:代表不加氢处理

    pH

    用于蛋白质子化状态判断的pH值。

    Naming Scheme

    输出PDB文件中残基和原子的命名方式。
    PDB:标准氨基酸格式,如组氨酸为HIS;
    AMBER:AMBER格式,如组氨酸为HID/HIE/HIP;
    CHARMM:CHARMM格式,如组氨酸为HSE/HSD/HSP。

    Prepared Structure

    输出的处理后的蛋白结构文件,PDB格式。默认文件名为:prepared_structure.pdb。

    结果说明

    输出处理好的结构文件,PDB格式。文件中的原子和残基类型按照指定Naming Scheme方法。

    参考文献

    • Olsson, M. H. M.; Søndergaard, C. R.; Rostkowski, M.; Jensen, J. H. PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical pKa Predictions. J. Chem. Theory Comput. 2011, 7 (2), 525–537. https://doi.org/10.1021/ct100578z.
    • https://github.com/jensengroup/propka
    • https://github.com/openmm/pdbfixer

    Structure Preparation

    Introduction

    The Structure Preparation module is used for completing missing atoms and residues in protein structures, as well as determining the protonation states of amino acid residues and adding hydrogen atoms. It uses pdbfixer for completing missing parts and propka3 for protonation state determination.

    Parameter Description

    Structure File

    The protein structure file in PDB format.

    Chains

    Specify the chains to be processed. The default is all, which means all chains will be processed. To specify chains, input the chain names separated by commas without spaces, e.g., A,B to process chains A and B from the PDB file.

    Delete Heterogens

    Remove non-standard protein or nucleic acid residues such as water, ions, and other HETATM records in the PDB.

    • all: Remove all HETATM records, including water, ions, small molecules, etc.
    • water: Remove only water.
    • ions: Remove only ions, default is NA,CL.
    • custom: Remove other specified residues, indicated by the Custom Heterogens parameter.

    For more details on Heterogens, refer to: Heterogen Information

    Custom Heterogens

    Specify custom heterogens to be removed by their residue names, separated by commas, e.g., ZN,MG.

    Delete Hydrogens

    Remove hydrogen atoms.

    • Yes: Delete hydrogen atoms.
    • No: Do not delete hydrogen atoms.

    Add

    Add missing heavy atoms or residues.

    • heavy: Add missing heavy atoms.
    • residues: Add missing residues, which also adds missing atoms by default.

    Protonation

    Determine protonation states and add hydrogen atoms using the propka method.

    • Yes: Add hydrogen atoms based on protonation state determination.
    • No: Do not add hydrogen atoms.

    pH

    The pH value used for determining the protonation states of the protein residues.

    Naming Scheme

    The naming convention for residues and atoms in the output PDB file.

    • PDB: Standard amino acid format, e.g., histidine as HIS.
    • AMBER: AMBER format, e.g., histidine as HID/HIE/HIP.
    • CHARMM: CHARMM format, e.g., histidine as HSE/HSD/HSP.

    Prepared Structure

    The name of the output processed protein structure file in PDB format. The default file name is prepared_structure.pdb.

    Result Description

    The output is a processed structure file in PDB format. The atoms and residue types in the file follow the specified naming scheme.

    References

    • Olsson, M. H. M.; Søndergaard, C. R.; Rostkowski, M.; Jensen, J. H. PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical pKa Predictions. J. Chem. Theory Comput. 2011, 7 (2), 525–537. https://doi.org/10.1021/ct100578z.
    • PROPKA GitHub Repository
    • PDBFixer GitHub Repository
  • Name: Antibody RMSD
    Description: 对参考抗体结构及其他CDR相同的抗体结构,进行基于Fv区域的结构叠合,并计算CDR区域的RMSD值。 Calculate the RMSD values of the CDR region through a Fv region-based structure superposition of the reference antibody and other CDR identical antibody structures
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-05-22 14:24:47
    Reference:

    Antibody RMSD

    简介

    Antibody RMSD模块对参考抗体结构及其他CDR相同的抗体结构,进行基于Fv区域的结构叠合,并计算CDR区域的RMSD值。支持普通抗体及纳米抗体。
    应用场景:人源化后的抗体序列,预测抗体结构后,比较各结构CDR区域的RMSD差异。支持普通抗体及纳米抗体。

    参数说明

    Antibody Structures

    多个抗体结构PDB文件的压缩打包文件,TAR格式

    Reference Structure

    进行RMSD计算的参考抗体结构,PDB格式

    Aligned PDB

    抗体叠合结构输出名称,TAR.GZ格式

    结果说明

    • RMSD计算结果,CSV格式文件result.csv ,包含信息如下:
    列名 说明
    Reference Antibody 参考抗体结构的名称
    Target 用于计算RMSD的其他抗体结构名称
    H.CDR1 H链CD1R区域的RMSD值
    H.CDR2 H链CDR2区域的RMSD值
    H.CDR3 H链CDR3区域的RMSD值
    H.CDR H链CDR区域整体的RMSD值
    L.CDR1 L链CDR1区域的RMSD值
    L.CDR2 L链CDR2区域的RMSD值
    L.CDR3 L链CDR3区域的RMSD值
    L.CDR L链CDR区域整体的RMSD值
    CDR_ALL CDR区域整体的RMSD值

    注意:进行RMSD计算的两个抗体结构,其CDR区域序列应相同,如有差异会导致计算出错。

    Antibody RMSD

    Introduction

    The Antibody RMSD module aligns the reference antibody structure with other antibodies having the same CDR regions, performs a structural overlay based on the Fv regions, and calculates the RMSD values of the CDR regions.
    Application Scenario: After humanizing antibody sequences and predicting antibody structures, the module compares the RMSD differences in the CDR regions of various structures.

    Parameters

    Antibody Structures

    Compressed TAR file containing multiple antibody structure PDB files.

    Reference Structure

    Reference antibody structure in PDB format for RMSD calculation.

    Aligned PDB

    Antibody composite structure output name, TAR.GZ format

    Result Description

    • RMSD calculation results in a CSV format file result.csv, including the following information:
    Column Name Description
    Reference Antibody Name of the reference antibody structure
    Target Name of the other antibody structure used for RMSD calculation
    H.CDR1 RMSD value of the H-chain CDR1 region
    H.CDR2 RMSD value of the H-chain CDR2 region
    H.CDR3 RMSD value of the H-chain CDR3 region
    H.CDR Overall RMSD value of the H-chain CDR regions
    L.CDR1 RMSD value of the L-chain CDR1 region
    L.CDR2 RMSD value of the L-chain CDR2 region
    L.CDR3 RMSD value of the L-chain CDR3 region
    L.CDR Overall RMSD value of the L-chain CDR regions
    CDR_ALL Overall RMSD value of all CDR regions

    Note: The CDR region sequences of the two antibody structures used for RMSD calculation should be identical; any differences may lead to calculation errors.

  • Name: Target Prediction (FastTargetPred)
    Description: 基于二维相似度的小分子靶点预测模块,活性分子及靶点数据来源于ChEMBL数据库。 A small molecule target prediction module based on 2D similarity. Active molecules and target data are derived from ChEMBL database.
    Tags: undefined
    Author: Ludovic Chaput
    Release: 2024-04-25 14:16:17
    Reference: Chaput L, Guillaume V, Singh N, Deprez B, Villoutreix BO. FastTargetPred: a program enabling the fast prediction of putative protein targets for input chemical databases. Bioinformatics. 2020 Aug 15;36(14):4225-4226.

    Target Prediction (FastTargetPred)

    简介

    Target Prediction (FastTargetPred)是基于二维相似度的小分子靶点预测模块,活性分子及靶点数据来源于ChEMBL25数据库,相似度计算采用1024位ECFP4的分子指纹,特点是速度块,几小时完成数十万化合物的靶点预测。

    参数说明

    SDF File

    小分子结构文件,SDF格式

    Tanimoto Threshold

    相似度(Tanimoto)阈值。从ChEMBL中查找大于相似度阈值的化合物。

    Output File

    输出文件名称

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 靶点预测结果的csv文件
    result.html 靶点预测结果的html文件

    其中输出结果包含信息如下:

    字段名称 说明
    Query name 查询分子名称
    Database molecule id ChEMBL中相似找出的相似分子ID
    Target id 靶标分子ID
    Score 相似度数值
    Uniprot 蛋白Uniprot ID
    Uniprot name Uniprot分子名称
    Status 数据发表情况
    Protein names 蛋白名称
    Gene names 基因名称
    Organism 物种名称
    CHEMBL 靶点CHEMBL分子ID
    Involvement in disease 参与疾病类型
    Geneontology (biological process) 谱系学(生物过程)
    Cross-reference (Reactome) 交叉引用(Reactome)

    参考文献

    Chaput L, Guillaume V, Singh N, Deprez B, Villoutreix BO. FastTargetPred: a program enabling the fast prediction of putative protein targets for input chemical databases. Bioinformatics. 2020 Aug 15;36(14):4225-4226.https://doi.org/10.1093/bioinformatics/btaa494

    Target Prediction (FastTargetPred)

    Introduction

    Target Prediction (FastTargetPred) is a module for predicting small molecule targets based on 2D similarity. The active molecules and target data are sourced from the ChEMBL25 database. Similarity calculation uses 1024-bit ECFP4 molecular fingerprints. The main feature of this module is its speed, capable of predicting targets for hundreds of thousands of compounds within a few hours.

    Parameter Description

    SDF File

    The structure file of small molecules in SDF format.

    Tanimoto Threshold

    The similarity (Tanimoto) threshold. Compounds from ChEMBL with a similarity greater than this threshold will be considered.

    Output File

    The name of the output file.

    Result Description

    The output results include:

    Output File Name Description
    result.csv CSV file containing the target prediction results
    result.html HTML file containing the target prediction results

    The output results contain the following information:

    Field Name Description
    Query name Name of the query molecule
    Database molecule id ID of the similar molecule found in ChEMBL
    Target id ID of the target molecule
    Score Similarity score
    Uniprot Uniprot ID of the protein
    Uniprot name Name of the Uniprot molecule
    Status Publication status of the data
    Protein names Names of the proteins
    Gene names Names of the genes
    Organism Name of the organism
    CHEMBL CHEMBL molecule ID of the target
    Involvement in disease Types of diseases involved
    Geneontology (biological process) Gene ontology (biological process)
    Cross-reference (Reactome) Cross-reference (Reactome)

    References

    • Chaput L, Guillaume V, Singh N, Deprez B, Villoutreix BO. FastTargetPred: a program enabling the fast prediction of putative protein targets for input chemical databases. Bioinformatics. 2020 Aug 15;36(14):4225-4226. https://doi.org/10.1093/bioinformatics/btaa494
  • Name: Electrostatic Potential Calculation (APBS)
    Description: 基于APBS方法计算生物大分子结构的静电势能,并绘制表面图。 为了可视化显示表面图,请从结构编辑器WeView中执行该功能:Weview->Analysis->Electrostatics。 Calculate the electrostatic potential energy of biomolecular structures using the APBS method and generate surface plots. To visualize the surface maps, execute this function from the structure editor WeView: WeView->Analysis->Electrostatics.
    Tags: undefined
    Author: Elizabeth Jurrus
    Release: 2024-04-19 15:42:10
    Reference: Jurrus E, et. al. Improvements to the APBS biomolecular solvation software suite. Protein Sci, 27 (1), 112-128, 2018.

    Electrostatic Potential Calculation (APBS)

    简介

    静电势(ESP,electrostatic potential)表面是指在分子周围某个曲面上静电势的分布,通过静电势对蛋白质表面着色有助于识别带电分子或极性分子的结合位点。正电位区域与负电荷互补,而负电位区域与正电荷互补。蛋白质静电势对于蛋白质的稳定性、折叠、酶催化、蛋白质间相互作用以及与其他分子的结合等方面起着关键作用。APBS(Adaptive Poisson-Boltzmann Solver )是业界著名的计算生物大分子结构静电势能的工具。
    esp.jpg

    参数说明

    PDB File

    蛋白结构文件,PDB格式

    Output Format

    输出文件格式,支持DX或者CUBE

    结果说明

    输出静电势能结果文件potential.dx或者potential.cube,用于将静电势能渲染到蛋白表面上。

    参考文献

    • Jurrus E, et. al. Improvements to the APBS biomolecular solvation software suite. Protein Sci, 27 (1), 112-128, 2018. https://doi.org/10.1002/pro.3280
    • Vascon, F, et. al. Protein Electrostatics: From Computational and Structural Analysis to Discovery of Functional Fingerprints and Biotechnological Design. Comput. Struct. Biotechnol. J. 2020, 18, 1774–1789. https://doi.org/10.1016/j.csbj.2020.06.029.

    Electrostatic Potential Calculation (APBS)

    Introduction

    Electrostatic potential (ESP) surfaces represent the distribution of electrostatic potential around a molecule on a given surface. Coloring the protein surface based on electrostatic potential helps identify binding sites for charged or polar molecules. Regions with positive potential complement negatively charged molecules, while regions with negative potential complement positively charged molecules. Protein electrostatic potential plays a crucial role in protein stability, folding, enzymatic catalysis, protein-protein interactions, and binding with other molecules. APBS (Adaptive Poisson-Boltzmann Solver) is a renowned tool for calculating the electrostatic potential of biological macromolecules.

    esp.jpg

    Parameter Description

    PDB File

    The protein structure file in PDB format.

    Output Format

    The format of the output file, supporting DX or CUBE.

    Result Description

    The output electrostatic potential result file, named potential.dx or potential.cube, can be used to render the electrostatic potential on the protein surface.

    References

    • Jurrus E, et al. Improvements to the APBS biomolecular solvation software suite. Protein Sci, 27 (1), 112-128, 2018. https://doi.org/10.1002/pro.3280
    • Vascon, F, et al. Protein Electrostatics: From Computational and Structural Analysis to Discovery of Functional Fingerprints and Biotechnological Design. Comput. Struct. Biotechnol. J. 2020, 18, 1774–1789. https://doi.org/10.1016/j.csbj.2020.06.029
  • Name: Absolute Folding Stability
    Description: 通过蛋白序列生成模型ESM-IF,预测蛋白质的绝对稳定性ΔG Predicts the absolute stability ΔG of proteins using the protein sequence generation model ESM-IF
    Tags: undefined
    Author: Sergey Ovchinnikov
    Release: 2024-05-16 10:11:19
    Reference: Predicting absolute protein folding stability using generative models Matteo Cagiada, Sergey Ovchinnikov, Kresten Lindorff-Larsen bioRxiv 2024.03.14.584940

    Absolute Folding Stability

    简介

    通过蛋白序列逆折叠模型ESM-IF,预测蛋白质的绝对稳定性ΔG。
    传统的物理方法(如FoldX、Rosetta等)预测蛋白稳定性ΔG,依赖于高置信度结构pdb,如果突变太多,结构置信度降低,预测结果较差。在ProteinGym的benchmark结果表明,生成模型ESM-IF在zero-shot预测DMS数据的蛋白突变稳定性ΔΔG达到同类最佳水平。该方法是在突变预测基础上的延伸,利用ESM-IF模型直接预测完整蛋白折叠稳定性的绝对ΔG值。
    经过测试,预测误差RMSE ≈ 1.5 kcal/mol,相关系数为0.7,是预测蛋白质的折叠稳定性ΔG的重大突破。

    原理:
    f3245508-826b-45b5-9f82-d92ca9ea15f6.webp

    • xk : 蛋白某位点为氨基酸k时,使用ESM-IF计算的log-likelihood库
    • xj : 蛋白遍历20种氨基酸时,在该位点为j时,使用ESM-IF计算的log-likelihood
    • Lk:Softmax得到蛋白某位点为氨基酸k时,对稳定性的贡献大小

    然后,将蛋白质所有氨基酸位点的Lk加和,得到蛋白整体的log-likelihood。
    最后,通过线性整体log-likelihood与实验稳定性ΔG拟合得到拟合参数,根据a/b就可以将log-likelihood转换成蛋白稳定性ΔG了。

    模型预测效果如下图所示:
    在两个不同数据集的 265 种蛋白质的预测稳定性值和实验稳定性值进行了比较。Spearman相关系数 (ρs) 为0.69,误差RMSE约为1.36 kcal/mol,相关性较好。
    image.png
    与其他基线模型比较结果如下图所示:
    image.png

    参数说明

    Protein Structure (PDB)

    蛋白结构文件,PDB格式

    Protein Structure (TAR)

    多个蛋白结构PDB的压缩文件,支持格式:.zip、.tar、.tar.gz、.tgz、.tar.bz2、.tbz2、.tar.xz、.txz
    当同时上传蛋白结构PDB和压缩包时会合并计算。

    结果说明

    • 绝对稳定性计算结果CSV格式文件默认为predicted_folding_energy.csv,包含信息如下:
    列名 说明
    Name 结构名称
    Absolute_Folding_Stability (kcal/mol) dG,越大越好,代表去折叠状态能量减去折叠状态能量,即去折叠需要的能量值,通常为正值,能量越大表示需要能量越多,折叠状态越稳定

    企业微信截图_17201609906097.png

    参考文献

    Predicting absolute protein folding stability using generative models Matteo Cagiada, Sergey Ovchinnikov, Kresten Lindorff-Larsen bioRxiv 2024.03.14.584940; https://doi.org/10.1101/2024.03.14.584940

    Absolute Folding Stability

    Introduction

    The absolute folding stability ($\Delta G$) of a protein can be predicted using the inverse folding model ESM-IF. Traditional physical methods (such as FoldX, Rosetta, etc.) for predicting protein stability $\Delta G$ rely on high-confidence structure PDB files. If mutations are numerous, the structural confidence decreases, leading to poor prediction results. Benchmark results from ProteinGym show that the generative model ESM-IF achieves state-of-the-art performance in zero-shot prediction of protein mutation stability $\Delta \Delta G$ on DMS data. This method extends mutation prediction by using the ESM-IF model to directly predict the absolute $\Delta G$ value of the complete protein folding stability.

    Testing shows a prediction error RMSE of approximately 1.5 kcal/mol and a correlation coefficient of 0.7, marking a significant breakthrough in predicting the folding stability $\Delta G$ of proteins.

    Principle
    f3245508-826b-45b5-9f82-d92ca9ea15f6.webp

    • $x_k$: Log-likelihood library calculated using ESM-IF when the protein at a certain site is amino acid $k$.
    • $x_j$: Log-likelihood calculated using ESM-IF when the protein at a certain site is amino acid $j$ while traversing 20 amino acids.
    • $L_k$: Contribution to stability when the protein at a certain site is amino acid $k$, obtained via Softmax.

    The log-likelihood of the entire protein is obtained by summing the $L_k$ values of all amino acid sites. Finally, the log-likelihood is linearly fitted to the experimental stability $\Delta G$ to obtain the fitting parameters. The log-likelihood can be converted into protein stability $\Delta G$ based on $a/b$.

    Model Prediction Performance
    The predicted stability values and experimental stability values for 265 proteins in two different datasets were compared. The Spearman correlation coefficient ($\rho_s$) is 0.69, and the error RMSE is about 1.36 kcal/mol, indicating good correlation.
    image.png

    Comparison with Other Baseline Models
    image.png

    Parameters

    Protein Structure (PDB)

    The protein structure file in PDB format.

    Protein Structure (TAR)

    Compressed archive file containing multiple protein structure PDBs. Supported formats: .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz.and the compressed file are uploaded, they will be calculated together.

    Results

    • The absolute stability calculation result is provided in a CSV format file “predicted_folding_energy.csv”, containing the following information:
    Column Name Description
    Name Structure name
    Absolute_Folding_Stability (kcal/mol) Delta G, the higher the better, representing the energy difference between the unfolded and folded states. It is usually a positive value, with higher values indicating greater stability in the folded state.

    企业微信截图_17201609906097.png

    References

    • Cagiada, M., Ovchinnikov, S., Lindorff-Larsen, K. Predicting absolute protein folding stability using generative models. bioRxiv 2024.03.14.584940; https://doi.org/10.1101/2024.03.14.584940
  • Name: Small Molecule Generation (REINVENT4)
    Description: 基于REINVENT4的小分子生成。支持多种分子生成方式:Reinvent - 从头开始创造新分子,Libinvent - 修饰一个骨架,Linkinvent - 设计两个片段之间的linker,Mol2Mol - 在用户定义的相似度范围内优化分子。 Small molecule de novo generation based on REINVENT4. REINVENT 4 enables and facilitates de novo design, R-group replacement, library design, linker design, scaffold hopping and molecule optimization.
    Tags: undefined
    Author: Hannes H. Loeffler
    Release: 2024-05-16 14:52:00
    Reference: Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). https://doi.org/10.1186/s13321-024-00812-5

    Small Molecule Generation (REINVENT4)

    简介

    De novo Generation (REINVENT4)是基于阿斯利康开源的REINVENT4算法用于小分子全新生成的模块。支持多种分子生成方式:Reinvent - 从头开始创造新类药分子,Libinvent - 修饰一个骨架,Linkinvent - 设计两个片段之间的linker,Mol2Mol - 在用户定义的相似度范围内优化分子。
    image.png

    参数说明

    Reinvent模式

    从头生成新分子

    Number Molecules

    生成的分子个数

    Output CSV File

    输出CSV文件名称

    Output SDF File

    输出SDF文件名称

    LibInvent模式

    对已有骨架结构进行修饰,生成含有该骨架结构的新分子。

    Small Molecule Structure

    小分子的骨架结构文件,该模式需要输入带 * 的小分子,SMILES或SDF格式,可以通过内嵌的wedraw工具来获得。
    image.png
    image.png
    image.png

    Number Molecules

    生成的分子个数。程序会按照该大小进行采样,随后自动过滤掉不符合设定片段拼接规则或子结构匹配要求的结果。因此,最终输出的有效样本数可能少于设定值。

    Output CSV File

    输出CSV文件名称

    Output SDF File

    输出SDF文件名称

    LinkInvent模式

    对两个结构片段进行连接,生成linker结构,获得新分子。

    Small Molecule Structure

    小分子的骨架结构文件,该模式需要输入带 * 的两个小分子,SMILES或SDF格式,可以通过内嵌的wedraw工具来获得(同LibInvent模式)。

    Number Molecules

    生成的分子个数。程序会按照该大小进行采样,随后自动过滤掉不符合设定片段拼接规则或子结构匹配要求的结果。因此,最终输出的有效样本数可能少于设定值。

    Output CSV File

    输出CSV文件名称

    Output SDF File

    输出SDF文件名称

    Mol2Mol模式

    优化分子结构,在用户定义的相似度范围内优化分子。

    Structure

    小分子的骨架结构文件,SMILES或SDF格式,可以通过内嵌的wedraw工具来获得。

    Number Molecules

    生成的分子个数,程序会按照该大小进行采样,随后自动过滤掉不符合设定片段拼接规则或子结构匹配要求的结果。因此,最终输出的有效样本数可能少于设定值。注意:它乘以输入分子的个数为最终输出总分子数。

    Mol2Mol Priors

    有5种不同的优化策略:

    1. Low_similarity:Tanimoto similarity > 0.5;
    2. Medium_similarity:0.5 < Tanimoto similarity < 0.7,通常表示中等程度的结构相似性;
    3. High_similarity:Tanimoto similarity > 0.7,表示高度相似的分子;
    4. Scaffold:要求分子具有相同的Murcko骨架,Murcko骨架是一种用于描述分子结构的核心骨架;
    5. Generic_scaffold:要求分子具有相同的未标记的Murcko骨架,指在Murcko骨架中未标记特定原子或功能团的结构。

    Sample Strategy

    beamsearch或者multinomial

    Temperature

    多项抽样中的温度

    Output CSV File

    输出CSV文件名称

    Output SDF File

    输出SDF文件名称

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 全新生成的化合物CSV文件,包含了SMILES信息
    denovo.sdf 全新生成的化合物SDF文件

    参考文献

    • Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). DOI:10.1186/s13321-024-00812-5

    Small Molecule Generation (REINVENT4)

    Introduction

    De novo Generation (REINVENT4) is a module based on AstraZeneca’s open-source REINVENT4 algorithm for generating new small molecules. It supports various molecule generation methods: Reinvent - creating new drug-like molecules from scratch, Libinvent - modifying a scaffold, Linkinvent - designing a linker between two fragments, and Mol2Mol - optimizing molecules within a user-defined similarity range.

    image.png

    Parameters

    Reinvent Mode

    De novo generation of new molecules.

    Number Molecules

    Number of molecules to generate.

    Output CSV File

    Name of the output CSV file.

    Output SDF File

    Name of the output SDF file.

    LibInvent Mode

    Modify an existing scaffold to generate new molecules containing that scaffold.

    Small Molecule Structure

    The scaffold structure file of the small molecule. This mode requires a small molecule with * placeholders. Supported formats: SMILES or SDF. The embedded wedraw tool can be used to create such molecules.
    image.png
    image.png
    image.png

    Number Molecules

    Number of molecules to generate. The program will sample according to this value and then automatically filter out results that do not satisfy the defined fragment assembly rules or substructure matching requirements. Therefore, the final number of valid output samples may be smaller than the set value.

    Output CSV File

    Name of the output CSV file.

    Output SDF File

    Name of the output SDF file.

    LinkInvent Mode

    Connect two structural fragments to form a linker structure and generate new molecules.

    Small Molecule Structure

    The scaffold structure files of the two small molecules. This mode requires two small molecules with * placeholders. Supported formats: SMILES or SDF. The embedded wedraw tool can be used to create such molecules (same as in LibInvent mode).

    Number Molecules

    Number of molecules to generate. The program will sample according to this value and then automatically filter out results that do not satisfy the defined fragment assembly rules or substructure matching requirements. Therefore, the final number of valid output samples may be smaller than the set value.

    Output CSV File

    Name of the output CSV file.

    Output SDF File

    Name of the output SDF file.

    Mol2Mol Mode

    Optimize molecular structures within a user-defined similarity range.

    Structure

    The scaffold structure file of the small molecule. Supported formats: SMILES or SDF. The embedded wedraw tool can be used to create such molecules.

    Number Molecules

    Number of molecules to generate. The program will sample according to this value and then automatically filter out results that do not satisfy the defined fragment assembly rules or substructure matching requirements. Therefore, the final number of valid output samples may be smaller than the set value. Note: the final total number of output molecules is equal to this value multiplied by the number of input molecules.

    Mol2Mol Priors

    There are five different optimization strategies:

    1. Low_similarity: Tanimoto similarity > 0.5
    2. Medium_similarity: 0.5 < Tanimoto similarity < 0.7, usually indicates a moderate level of structural similarity
    3. High_similarity: Tanimoto similarity > 0.7, indicates highly similar molecules
    4. Scaffold: requires molecules to share the same Murcko scaffold, a commonly used representation of the molecular core structure
    5. Generic_scaffold: requires molecules to share the same unmarked Murcko scaffold, where specific atoms or functional groups are not labeled within the scaffold

    Sample Strategy

    beamsearch or multinomial

    Temperature

    Temperature for multinomial sampling.

    Output CSV File

    Name of the output CSV file.

    Output SDF File

    Name of the output SDF file.

    Results

    The output includes:

    Output File Name Description
    result.csv CSV file containing newly generated compounds, including SMILES information
    denovo.sdf SDF file containing newly generated compounds

    References

    • Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). DOI:10.1186/s13321-024-00812-5
  • Name: Structural Energy
    Description: 基于物理模型(分子力学经验力场)计算多个蛋白结构的能量,并与参考蛋白的结构能量进行比较。 Calculate the energy of multiple protein structures based on a physical model (molecular mechanics empirical force field) and compare it with the reference protein.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-04-28 11:26:05
    Reference:

    Structural Energy

    简介

    该模块基于物理模型(分子力学经验力场)计算多个蛋白结构的能量,并与参考蛋白结构的能量进行比较。

    参数说明

    Target Structure

    多个蛋白结构PDB文件的压缩打包文件,TAR格式

    Reference Structure

    进行能量比对的参考蛋白结构,PDB格式

    结果说明

    • 能量比对的结果CSV格式文件‘energy_rank.csv’,包含信息如下:
    列名 说明
    Name 结构名称
    Score 能量打分,数值负得越多表示能量越低

    Structural Energy

    Introduction

    This module calculates the energy of multiple protein structures based on a physical model (empirical molecular force field) and compares these energies with the energy of a reference structure.

    Parameter Description

    Target Structures

    Compressed TAR file containing multiple protein structure PDB files.

    Reference Structure

    Reference structure in PDB format for energy comparisons.

    Result Description

    • The result of energy comparison is stored in a CSV file named ‘energy_rank.csv’, which includes the following information:
    Column Name Description
    Name Structure name
    Score Energy score, where a more negative value indicates lower energy
  • Name: Sequence Embedding Generation
    Description: 基于ESMFold预训练蛋白语言模型的序列向量化特征信息(embeddings)的提取,可用于下游序列性质(如突变对应的亲和力变化、稳定性变化,抗体序列可开发性等)预测任务。 Extract pre-trained protein language model ESMFold based sequences embeddings to predict downstream sequence properties (such as affinity changes and stability changes corresponding to mutations, developability of antibody sequences, etc.)
    Tags: undefined
    Author: Zeming Lin
    Release: 2024-03-25 17:13:30
    Reference: Zeming Lin et al. ,Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).DOI:10.1126/science.ade2574

    Sequence Embedding Generation

    简介

    该模块基于ESM大规模预训练蛋白语言模型实现。提取序列的向量化特征信息(embeddings),可用于下游序列性质(如:突变对应的亲和力变化、稳定性变化,抗体序列可开发性等)预测任务,为判别模型的训练提供序列特征。
    ESM模型是通用蛋白质语言模型,采用UniRef50/90等序列数据库(数千万条序列)进行模型训练,提供了不同参数量(800万,3500万,1.5亿,6.5亿,30亿,150亿)的各类模型,可用于直接从蛋白序列预测结构、功能和其他蛋白质性质。如在结构预测中,ESM避免了对外部进化数据库、MSA和模板的需求,计算精度与AlphaFold2(存在MSA信息时)接近,无可用MSA信息时,计算精度ESM要显著优于AlphaFold2。计算速度比AlphaFold2快数十倍。

    参数说明

    Protein Sequence

    蛋白的序列文件,FASTA格式
    注意:多条序列时,序列名称应避免重复,模块会对重复的序列名称进行重命名,格式为“原序列名_数字”

    Model

    选择用于提取序列特征的模型,可用模型及特征维度说明如下:

    模型名称 参数量 特征维度 模型层数
    ESM1b_650M 650M 1280 33
    ESM1v_650M 650M 1280 33
    ESM2_8M 8M 320 6
    ESM2_35M 35M 480 12
    ESM2_150M 150M 640 30
    ESM2_650M 650M 1280 33
    ESM2_3B 3B 2560 36
    ESM2_15B 15B 5120 48

    备注:“M”表示Million(百万),“B”表示Billion(十亿),ESM-2-15B模型需要的GPU卡显存大小约为32GB

    结果说明

    每条序列会输出一个特征信息文件“序列名.pt”,包含了该序列的向量化特征信息,该特征信息由模型最后一层产生。多条序列会输出多个pt文件,并压缩为feats.tar压缩文件。
    特征信息文件可通过torch加载,如下:
    embs = torch.load(“序列名.pt”)
    embs[‘mean_representations’][‘模型层数’]

    参考文献

    Zeming Lin et al. ,Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).DOI:10.1126/science.ade2574

    Sequence Embedding Generation

    Introduction

    This module is based on the ESM (Evolutionary Scale Modeling) large-scale pre-trained protein language model. It extracts vectorized feature information (embeddings) from sequences, which can be used for downstream sequence property prediction tasks such as changes in affinity and stability corresponding to mutations, developability of antibody sequences, etc., providing sequence features for discriminative model training.
    The ESM model is a universal protein language model trained on sequence databases such as UniRef50/90 (tens of millions of sequences). It offers various models with different parameter sizes (8 million, 35 million, 150 million, 650 million, 3 billion, 15 billion) that can be used to predict protein structures, functions, and other protein properties directly from protein sequences. In structural prediction, ESM eliminates the need for external evolutionary databases, multiple sequence alignments (MSA), and templates. Its calculation accuracy is comparable to AlphaFold2 (when MSA information is available) and significantly superior to AlphaFold2 in accuracy when MSA information is not available. ESM is also several times faster than AlphaFold2.

    Parameter Description

    Protein Sequence

    The sequence file of the protein in FASTA format.
    Note: When multiple sequences are provided, sequence names should be unique to avoid duplication. The module will rename duplicated sequence names in the format “original_sequence_name_number”.

    Model

    Select the model used to extract sequence features. The available models and their feature dimensions are as follows:

    Model Name Parameters Feature Dimension Number of Layers
    ESM1b_650M 650M 1280 33
    ESM1v_650M 650M 1280 33
    ESM2_8M 8M 320 6
    ESM2_35M 35M 480 12
    ESM2_150M 150M 640 30
    ESM2_650M 650M 1280 33
    ESM2_3B 3B 2560 36
    ESM2_15B 15B 5120 48

    Note: “M” stands for Million, “B” stands for Billion. The ESM-2-15B model requires approximately 32GB of GPU memory.

    Result Description

    Each sequence will output a feature information file named “sequence_name.pt,” which contains the vectorized feature information of that sequence generated by the last layer of the model. For multiple sequences, multiple pt files will be output and compressed into a feats.tar file.
    The feature information file can be loaded using torch as follows:
    embs = torch.load(“sequence_name.pt”)
    embs[‘mean_representations’][‘number_of_layers’]

    References

    Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130 (2023). DOI: 10.1126/science.ade2574

  • Name: Antibody NGS Analysis
    Description: 用于抗体NGS测序的DNA序列分析,具体分析内容包括:IGV、IGD、IGJ基因型标注;DNA序列翻译为氨基酸序列(抗体),并进行CDR识别;基于蛋白(抗体)语言模型(ESM/IgLM),分析不常见残基及优势突变;PTM(翻译后修饰)风险位点分析,标记低、高风险位点;序列特征计算(等电点pI,分子量kDa,疏水性);序列聚类分析;体系超突变率分析等。 This module is used for DNA sequence (antibody) analysis after NGS sequencing: IGV, IGD, IGJ clonotype annotation; amino acid sequence translation; antibody numbering and CDR recognition; uncommon residues and high frequency mutations idenfication using protein (antibody) language models (ESM, IgLM); PTM hot-spot liability analysis; Sequence-based physico-chemical property calculation including pI (isoelectric point), molecular weight, hydrophobicity index; sequence clustering; SHM (somatic hyper-mutation) rate calculation, etc.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-03-26 09:19:24
    Reference: Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023). DOI:10.1126/science.ade2574 Shuai RW, Ruffolo JA, Gray JJ. IgLM: Infilling language modeling for antibody sequence design. Cell Syst. 2023 Nov 15;14(11):979-989.e4. Milot Mirdita, Martin Steinegger, Johannes Söding, MMseqs2 desktop and local web server app for fast, interactive sequence searches, Bioinformatics, Volume 35, Issue 16, August 2019, Pages 2856–2858 Kyte, J. and Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol, 157(1):105-132

    Antibody NGS Analysis

    简介

    该模块用于NGS测序后的DNA序列(抗体)分析,具体分析内容包括:

    • IGV、IGD、IGJ基因标注(IgBlast)
    • DNA序列翻译为氨基酸序列(抗体),并进行CDR识别
    • 基于蛋白(抗体)语言模型,分析不常见残基及优势突变(ESM,IgLM)
    • PTM(翻译后修饰)风险位点分析,标记低、高风险位点
    • 序列特征计算(等电点pI,分子量kDa,疏水性)
    • 序列聚类分析(MMseq2)

    Antibody NGS Analysis操作指南

    参数说明

    DNA

    DNA Sequence

    NGS测序后的DNA序列,FASTA/AB1格式

    注意:当前限制输入序列为1000条。

    Species

    物种类型,支持2种:HUMAN, MOUSE。默认为HUMAN

    Numbering Scheme

    编号规则,支持imgt, chothia, kabat

    Cluster

    氨基酸序列聚类方案,支持2种:full, cdr。‘full’表示使用全长序列进行聚类,‘cdr’表示使用CDR序列进行聚类(具体CDR位置在参数‘CDRs’中设定),默认为‘cdr’

    CDRs

    指定用于聚类的CDR区域,在‘Cluster’参数为cdr时生效。可选区域为(支持多选):CDR1,CDR2,CDR3。默认选择CDR3。

    Identity

    聚类中采用的序列一致性数值,范围在0-1之间,默认值为0.5

    Vgene

    聚类前是否要求IGV基因名称一致的序列归为一组,默认为False

    Output

    输出结果文件名,默认为NGS_res.csv

    Protein

    Protein Sequence

    NGS测序后的蛋白序列,FASTA格式
    注意:当前限制输入序列为1000条。

    Species

    物种类型,支持2种:HUMAN, MOUSE。默认为HUMAN

    Numbering Scheme

    编号规则,支持imgt, chothia, kabat

    Cluster

    氨基酸序列聚类方案,支持2种:full, cdr。‘full’表示使用全长序列进行聚类,‘cdr’表示使用CDR序列进行聚类(具体CDR位置在参数‘CDRs’中设定),默认为‘cdr’

    CDRs

    指定用于聚类的CDR区域,在‘Cluster’参数为cdr时生效。可选区域为(支持多选):CDR1,CDR2,CDR3。默认选择CDR3。

    Identity

    聚类中采用的序列一致性数值,范围在0-1之间,默认值为0.5

    Vgene

    聚类前是否要求IGV基因名称一致的序列归为一组,默认为False

    Output

    输出结果文件名,默认为NGS_res.csv

    结果说明

    输出result.csv结果文件,包含以下信息:

    列名 说明 备注
    ID 序列名称
    DNA_Seq DNA序列
    Protein_Seq 翻译后的氨基酸序列
    Chain 链类型:IGH/IGK/IGL
    CDR1_AA CDR1的氨基酸序列
    CDR2_AA CDR2的氨基酸序列
    CDR3_AA CDR3的氨基酸序列
    CDR1_Length CDR1的氨基酸序列长度
    CDR2_Length CDR2的氨基酸序列长度
    CDR3_Length CDR3的氨基酸序列长度
    Unusual_Residue(ESM) 基于ESM模型的不常见残基及优势突变 如:'V11L’表示序列中第11位的V是模型判定的该位置不常见残基,L为模型判定的该位置优势突变残基
    Unusual_Residue(IgLM) 基于IgLM模型的不常见残基及优势突变 同上
    V_Gene_First 匹配的首个IGV基因名称。 IGV基因名称可能存在多个匹配,这里列出首个。注:输入为蛋白序列时,该字段忽略。
    V_Gene IGV基因名称 如同时匹配多个基因名,用‘;’分隔
    D_Gene IGD基因名称 同上,注:输入为蛋白序列时,该字段忽略。
    J_Gene IGJ基因名称 同上,注:输入为蛋白序列时,该字段忽略。
    CDR1_Highrisk_Hotspots CDR1中的PTM高风险位点 如:‘NG(1)’表示高风险位点‘NG’出现1次
    CDR2_Highrisk_Hotspots CDR2中的PTM高风险位点 同上
    CDR3_Highrisk_Hotspots CDR3中的PTM高风险位点 同上
    CDR1_Lowrisk_Hotspots CDR1中的PTM低风险位点 同上
    CDR2_Lowrisk_Hotspots CDR2中的PTM低风险位点 同上
    CDR3_Lowrisk_Hotspots CDR3中的PTM低风险位点 同上
    Mutations(AA) 与Germline序列比对所对应的突变,并标注了突变所在区域(FR或CDR),多个突变用分号分隔 如: 'V29I(CDR1)'表示编号29的残基存在突变,其中Germline序列中残基是V,当前抗体序列中残基为I,根据抗体编号规则所在的区域为CDR1
    SHM(AA) 基于氨基酸序列计算得到的体系超突变率 SHM: Somatic hypermutation,计算方式是将当前序列与Germline参考序列进行比对,序列突变总数量与序列长度的比值即为SHM
    SHM(NA) 基于DNA序列计算得到的体系超突变率 同上,注:输入为蛋白序列时,该字段忽略。
    pI 等电点
    kDa 分子量(千道尔顿)
    Hydrophobicity 疏水性指数 序列各氨基酸的Kyte-Doolittle疏水指数之和,主要用来快速粗略比较近似序列的相对疏水程度高低
    Pre_Cluster_Group 聚类分析中的组别名称 序列聚类前先进行序列分组,各组内序列再进行聚类分析。当选择CDR聚类时,CDR序列长度一致的序列归为一组。组别名称由各聚类参数组合而成,如:组名为‘8_8_18’,表示该组由CDR1,2,3长度分别为8,8,18的多条序列组成。如果分组参数设定要求IGV基因名称一致,则IGV基因名称也会出现在组别名称中,如:‘8_8_18_IGKV1-12*01’
    Cluster_ID 序列所属类别的名称 如:‘2_3’表示第2组第3个类别
    Cluster_Size 序列所属类别包含的序列数目 如:‘5’表示该类别含有5条序列
    Cluster_Center 序列是否为聚类中心 '1’表示是,‘0’表示不是
    Cluster_Ident 聚类后的类别中,成员序列与聚类中心序列的序列一致性 聚类时,如果选择全长序列聚类,这里即为全长序列的一致性;如选择CDR进行聚类,则为选中的CDR区域序列的整体一致性
    Cluster_CDR1_Ident 聚类后的类别中,成员序列与聚类中心序列的CDR1序列的一致性
    Cluster_CDR2_Ident 聚类后的类别中,成员序列与聚类中心序列的CDR2序列的一致性
    Cluster_CDR3_Ident 聚类后的类别中,成员序列与聚类中心序列的CDR3序列的一致性
    Unique_ID 唯一序列编号 从 1 开始按出现顺序递增,表示该序列所属的唯一序列簇。若 CDR3 区域差异 ≥ 1 个残基,则判定为不同序列;或 CDR1 + CDR2 + CDR3 区域的总差异 ≥ 3 个残基,也判定为不同序列;若上述条件均不满足,则判定为相同序列。
    Dup_Count Unique_ID 对应的序列在原始数据中出现的重复次数

    输出进化树信息,为打包文件tree.tar,包含多个进化树文件tree_clusterXXX.txt,每个进化树文件包含该聚类类别(cluster)中所有成员序列CDR区域的进化分析结果。

    风险位点说明:
    image.png
    其中打勾标记的位点NXS, NXT, NG, DHK, DG, DD和Cys共7个位点为默认的潜在PTM高风险位点,通常需重点关注,其余为低风险位点。

    参考文献

    1. Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).
    2. Shuai RW, Ruffolo JA, Gray JJ. IgLM: Infilling language modeling for antibody sequence design. Cell Syst. 2023 Nov 15;14(11):979-989.e4.
    3. Milot Mirdita, Martin Steinegger, Johannes Söding, MMseqs2 desktop and local web server app for fast, interactive sequence searches, Bioinformatics, Volume 35, Issue 16, August 2019, Pages 2856–2858.
    4. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982 May 5;157(1):105-32.

    Sequence Analysis

    Introduction

    The module is used for the analysis of the DNA sequence (antibody) after NGS sequencing. The analysis content includes:
    -IGV, IGD, IGJ gene annotation(IGBLAST)
    -DNA sequence is translated as amino acid sequence (antibody) and CDR recognition
    -Based on protein (antibody) language model, analyze unusual residual and advantageous mutations (ESM, IgLM)
    -PTM (post -translation modification) hotspot analysis, low and high risk hotspot
    -Sequence property calculation (PI, molecular weight, hydrophobicity)
    -Sequence clustering(MMSEQ2)

    Parameter

    DNA

    DNA Sequence

    DNA sequence after NGS sequencing,FASTA/ab1 format
    Note : The current entry limit is 1000 entries.

    Species

    Type of Species,support two:HUMAN, MOUSE. The default is HUMAN

    Numbering Scheme

    Numbering scheme: imgt, chothia and kabat

    Cluster

    Scheme of sequence clustering,support two:full, cdr. ‘full’ means clustering by full length sequence,‘cdr’ means clustering by CDR. The default is ‘cdr’

    CDRs

    Specify the CDRs for clustering,when the ‘Cluster’ is set to ‘cdr’. Mutiple choice are supported: CDR1, CDR2, CDR3

    Identity

    The sequence identity used for clustering,value range from 0 to 1, the default is 0.5

    Vgene

    Whether the sequence of the IGV gene name is consistent for classification as a group before clustering. The default is False

    Output

    Result file, default is NGS_res.csv

    Protein

    Protein Sequence

    Protein sequence after NGS sequencing,FASTA format
    Note : The current entry limit is 1000 entries.

    Species

    Type of Species,support two:HUMAN, MOUSE. The default is HUMAN

    Numbering Scheme

    Numbering scheme: imgt, chothia and kabat

    Cluster

    Scheme of sequence clustering,support two:full, cdr. ‘full’ means clustering by full length sequence,‘cdr’ means clustering by CDR. The default is ‘cdr’

    CDRs

    Specify the CDRs for clustering,when the ‘Cluster’ is set to ‘cdr’. Mutiple choice are supported: CDR1, CDR2, CDR3

    Identity

    The sequence identity used for clustering,value range from 0 to 1, the default is 0.5

    Vgene

    Whether the sequence of the IGV gene name is consistent for classification as a group before clustering. The default is False

    Output

    Result file, default is NGS_res.csv

    Result

    Export the result file result.csv, which includes the following information:

    Field Name Description Notes
    ID Sequence name
    DNA_Seq DNA sequence
    Protein_Seq Translated amino acid sequence
    Chain Chain type: IGH/IGK/IGL
    CDR1_AA Amino acid sequence of CDR1
    CDR2_AA Amino acid sequence of CDR2
    CDR3_AA Amino acid sequence of CDR3
    CDR1_Length Length of CDR1 amino acid sequence
    CDR2_Length Length of CDR2 amino acid sequence
    CDR3_Length Length of CDR3 amino acid sequence
    Unusual_Residue(ESM) Uncommon residues and dominant mutations based on the ESM model e.g., ‘V11L’ indicates that the V at position 11 in the sequence is determined by the model to be an uncommon residue, and L is determined by the model to be a dominant mutation residue at that position
    Unusual_Residue(IgLM) Uncommon residues and dominant mutations based on the IgLM model Same as above
    V_Gene_First The name of the first IGV gene that matches. There may be multiple matches for IGV gene names, the first of which is listed here
    V_Gene Name of the IGV gene If multiple gene names match simultaneously, separate them with ‘;’
    D_Gene Name of the IGD gene Same as above
    J_Gene Name of the IGJ gene Same as above
    CDR1_highrisk_hotspots PTM high-risk sites in CDR1 e.g., ‘NG(1)’ indicates the high-risk site ‘NG’ appears 1 time
    CDR2_Highrisk_hotspots PTM high-risk sites in CDR2 Same as above
    CDR3_Highrisk_hotspots PTM high-risk sites in CDR3 Same as above
    CDR1_Lowrisk_hotspots PTM low-risk sites in CDR1 Same as above
    CDR2_Lowrisk_hotspots PTM low-risk sites in CDR2 Same as above
    CDR3_Lowrisk_hotspots PTM low-risk sites in CDR3 Same as above
    Mutations(AA) corresponds to mutations compared to the Germline sequence and annotates the region where the mutation occurs (FR or CDR), with multiple mutations separated by semicolons. For example, ‘V29I(CDR1)’ indicates a mutation at residue 29, where the residue in the Germline sequence is V and the residue in the current antibody sequence is I, and based on the antibody numbering rules, the region is identified as CDR1.
    SHM(AA) System hypermutation rate calculated based on amino acid sequence SHM: Somatic hypermutation is calculated by aligning the current sequence with a Germline reference sequence. The ratio of the total number of sequence mutations to the sequence length is defined as SHM
    SHM(NA) System hypermutation rate calculated based on DNA sequence Same as above
    pI Isoelectric point
    kDa Molecular weight (kilodalton)
    Hydrophobicity Hydrophobicity index The sum of the Kyte-Doolittle hydrophobicity indices of each amino acid in the sequence, mainly used for a rough comparison of the relative hydrophobicity levels of approximate sequences
    Pre_Cluster_Group Group name in cluster analysis Before sequence clustering, sequences are grouped, and sequences within each group are then analyzed for clustering. For example, when selecting CDR clustering, sequences with the same CDR length are grouped together. The group name is composed of various clustering parameters, e.g., ‘8_8_18’ indicates that the group consists of multiple sequences with CDR1, 2, 3 lengths of 8, 8, 18, respectively
    Cluster_ID Name of the category to which the sequence belongs e.g., ‘2_3’ indicates the third category in the second group
    Cluster_Size Number of sequences contained in the category e.g., ‘5’ indicates that this category contains 5 sequences
    Cluster_Center Whether the sequence is a cluster center ‘1’ indicates yes, ‘0’ indicates no
    Cluster_Ident Consistency of member sequences with the cluster center sequence in the clustered category During clustering, if full-length sequence clustering is selected, this represents the consistency of the full-length sequences; if CDR clustering is chosen, it represents the overall consistency of the selected CDR region sequences
    Cluster_CDR1_Ident Consistency of member sequences with the CDR1 sequence of the cluster center sequence in the clustered category
    Cluster_CDR2_Ident Consistency of member sequences with the CDR2 sequence of the cluster center sequence in the clustered category
    Cluster_CDR3_Ident Consistency of member sequences with the CDR3 sequence of the cluster center sequence in the clustered category
    Unique_ID Unique sequence ID A unique sequence identifier, starting from 1 and incremented in order of appearance, representing the cluster to which the sequence belongs. Sequences are considered different if the CDR3 region differs by ≥ 1 residue, or if the total number of differences across CDR1 + CDR2 + CDR3 is ≥ 3 residues. If neither condition is met, sequences are considered identical.
    Dup_Count The number of times the sequence associated with the same Unique_ID appears in the original dataset.

    Output evolutionary tree information into a packed file named tree.tar, which includes multiple evolutionary tree files named tree_clusterXXX.txt, with each evolutionary tree file containing the evolutionary analysis results of the CDR regions of all member sequences in that clustering category (cluster).

    Risk Site Description:
    image.png
    The default potential PTM high-risk sites marked with check marks include NXS, NXT, NG, DHK, DG, DD, and Cys, totaling 7 sites. These sites typically require special attention, while the rest are considered low-risk sites.

    Reference

    1. Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).
    2. Shuai RW, Ruffolo JA, Gray JJ. IgLM: Infilling language modeling for antibody sequence design. Cell Syst. 2023 Nov 15;14(11):979-989.e4.
    3. Milot Mirdita, Martin Steinegger, Johannes Söding, MMseqs2 desktop and local web server app for fast, interactive sequence searches, Bioinformatics, Volume 35, Issue 16, August 2019, Pages 2856–2858.
    4. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982 May 5;157(1):105-32.
  • Name: Human Fragment BLAST
    Description: 基于输入的九肽, 在人源片段库(Germline, TCR, NextProt, OAS)中搜索最相似的9肽。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Blast -> Human Fragment BLAST。 Searching the Germline, TCR, NextProt, OAS for the most similar 9 peptides based on inputs of 9 peptides. It is recommended to use in the WeSeq: WeSeq -> Blast -> Human Fragment BLAST.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-03-06 12:01:50
    Reference:

    Human Fragment BLAST

    简介

    基于输入的9肽片段, 在人源片段库中搜索最相似的9肽片段。

    人源片段库来源:

    • Germline
    • OAS (≥25% subjects)
    • TCR
    • NextProt

    参数说明

    Peptide Fragment

    九肽片段,多个肽段用逗号分隔,例如:
    NFFWHLHFP,GKGITLSVR,TPEALFVMT,GGIPIINCA,CVAIAEDRK

    Minimun

    相同氨基酸的最小数量(相同位置),默认为7。

    Output File

    输出文件名称

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    Query 原始9肽
    Identity 9肽中相同(保守)氨基酸的数目,越大越好,例如8代表有1个突变
    Target 匹配到的9肽
    DiffMask 以*号标记氨基酸差异的位置
    From 生成片段的来源数据库

    Human Fragment BLAST

    Introduction

    The Human Fragment BLAST is based on inputs of 9 peptides, searching the Germline, TCR, NextProt, OAS for the most similar 9 peptides.

    Parameter

    Peptide Fragment

    Minimun

    Output File

    Result

    The output file is result.csv and contains the following information:

    Field Name Description
    Query original 9-mer peptide
    Identity The number of identical amino acids in the retrieved 9-mer peptide fragment. Greater value stands for less mutations.
    Target The resulting 9-mer peptides
    DiffMask The different positions of amino acids are marked with *.
    From The source database from which the fragment is generated.
  • Name: Protein Structure Prediction (RaptorX-Single)
    Description: 基于RaptorX-Single的单链蛋白结构预测,当预测的蛋白序列有大量同源序列时,RaptorX-Single的预测结果也优于AlphaFold2。 RaptorX-Single based single sequence protein structure prediction. RaptorX-Single also outperforms AlphaFold2 when predicting protein sequences with a large number of homologous sequences.
    Tags: undefined
    Author: Xiaoyang Jing
    Release: 2024-03-04 16:21:12
    Reference: RaptorX-Single: exploring the advantage of single sequence based protein structure prediction. Xiaoyang Jing, Fandi Wu, Jinbo Xu. bioRxiv 2023.04.24.538081

    Protein Structure Prediction (RaptorX-Single)

    简介

    该模块基于RaptorX-Single算法实现,RaptorX-Single是一种基于单一序列的蛋白质结构预测方法,无需multiple sequence alignment(MSA)信息。它集成了多个蛋白质语言模型和一个结构生成模块,研究结果表明,RaptorX-Single除了比AlphaFold2等基于MSA的方法运行得更快之外,在预测抗体结构、极少同源序列的蛋白和单突变效应方面也优于AlphaFold2和其他无MSA的方法。当预测的蛋白序列有大量同源序列时,RaptorX-Single的预测结果也优于AlphaFold2。
    RaptorX-Single的神经网络架构:
    image.png
    对抗体结构预测精度比较:
    image.png

    参数说明

    Sequence File

    普通蛋白或抗体序列文件(不超过1000个氨基酸),FASTA格式,如:
    >Nb21
    MAQVQLVESGGGLVQAGGSLRLSCAVSGLGAHRVGWFRRAPGKEREFVAAIGANGGNTNYLDSVKGRFTISRDNAKNTIYLQMNSLKPQDTAVYYCAARDIETAEYTYWGQGTQVTVS

    注意:

    只支持预测单链蛋白或抗体,如果FASTA文件有多条链,每条链会单独预测为一个PDB结构。

    Model for Prediction

    选择预测结构时使用的模型,有两个模型可供选择:
    protein表示蛋白模型,对应RaptorX-Single-ESM1b-ESM1v-ProtTrans.pt;
    antibody表示抗体模型,对应RaptorX-Single-ESM1b-ESM1v-ProtTrans-Ab.pt。
    如果预测蛋白,请选择前者,如果预测抗体,请选择后者

    结果说明

    输出结果包括:

    输出文件名称 说明
    first.pdb 默认输出第一条序列的预测结构。
    structs.tar 针对含有多条序列的fasta文件,压缩包中含所有的序列的预测结构。

    参考文献

    RaptorX-Single: exploring the advantage of single sequence based protein structure prediction. Xiaoyang Jing, Fandi Wu, Jinbo Xu. bioRxiv 2023.04.24.538081
    https://doi.org/10.1101/2023.04.24.538081

    Protein Structure Prediction (RaptorX-Single)

    Introduction

    The module is implemented based on the RaptorX-Single algorithm, which is a single sequence-based protein structure prediction method that does not require multiple sequence alignment (MSA) information. It integrates multiple protein language models and a structure generation module. The results show that RaptorX-Single, in addition to running faster than MSA-based methods such as AlphaFold2, also outperforms AlphaFold2 and other MSA-free methods in predicting antibody structures, proteins with very few homologous sequences, and single mutation effects. RaptorX-Single also outperforms AlphaFold2 when predicting protein sequences with a large number of homologous sequences.
    Network Architecture for RaptorX-Single:
    image.png
    Comparison of the accuracy of antibody structure prediction:
    image.png

    Parameter

    Sequence File

    Protein or antibody sequence file (not more than 1000 amino acids) in FASTA format, example:
    >Nb21
    MAQVQLVESGGGLVQAGGSLRLSCAVSGLGAHRVGWFRRAPGKEREFVAAIGANGGNTNYLDSVKGRFTISRDNAKNTIYLQMNSLKPQDTAVYYCAARDIETAEYTYWGQGTQVTVS

    Note:

    This module only supports the prediction of single chain proteins or antibodies, if the fasta file has multiple chains, each chain will be predicted separately as a PDB structure.

    Model for Prediction

    There are two models to choose from when selecting the model to use in predicting the structure.
    ‘protein’ represents the protein model, corresponding to RaptorX-Single-ESM1b-ESM1v-ProtTrans.pt;
    ‘antibody’ indicates an antibody model, corresponding to RaptorX-Single-ESM1b-ESM1v-ProtTrans-Ab.pt.
    Choose the former if predicting proteins and the latter if predicting antibodies.

    Result

    The output includes:

    Field Name Description
    first.pdb The default output is the prediction structure of the first sequence.
    structs.tar For fasta files with multiple sequences, the package contains the predictive structure for all sequences.

    Reference

    RaptorX-Single: exploring the advantage of single sequence based protein structure prediction. Xiaoyang Jing, Fandi Wu, Jinbo Xu. bioRxiv 2023.04.24.538081
    https://doi.org/10.1101/2023.04.24.538081

  • Name: Germline AA Distribution Frequency
    Description: 输出抗体各位置的germline的氨基酸频率分布,可按指定的germline基因家族分别输出(通常关注与目标序列同家族germline基因的频率分布情况)。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Frequency -> Frequency/Likelihood Analysis -> Caculate -> Germline。 Outputs the amino acid frequency distribution of the germline at each position of the antibody. It can output the distribution separately according to the specified germline gene family (usually focusing on the frequency distribution of the germline genes in the same family as the target sequence). It is recommended to use in the WeSeq: WeSeq -> Frequency -> Frequency/Likelihood Analysis -> Caculate -> Germline.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-01-26 00:00:00
    Reference:

    Germline AA Distribution Frequency

    简介

    该模块输出指定的germline基因家族(部分或全部)的各位置的氨基酸频率分布,以供突变设计参考。

    输入方式1

    输入一条抗体序列(多条序列时只处理第一条序列)。
    程序根据输入序列进行BLAST,判断其对应的基因家族,如IGHV1。
    再输出对应家族的germline基因的AA频率分布。

    输入方式2

    不输入序列,则直接输出勾选的链类型(Group选项)或基因家族(Single选项)对应的germline的频率分布。

    其中:
    若勾选某Group,仅统计对应类型(kappa, lambda, heavy)的所有家族germline的频率分布。
    若勾选Single中的某个family(如IGHV1),只输出指定的germline基因家族的AA频率分布(因为通常仅关注与目标序列同家族germline基因的频率分布情况,与我们序列不同家族的其他germline的频率分布的参考意义不大)。

    输出

    抗体各位置的germline的氨基酸频率分布。

    Germline AA Distribution Frequency

    Introduction

    This module outputs the amino acid frequency distribution at each position of the specified germline gene family (partially or entirely) for reference in mutation design.

    Input Method 1

    Input an antibody sequence (if multiple sequences are provided, only the first sequence is processed).
    The program uses BLAST to determine the corresponding gene family of the input sequence, such as IGHV1.
    Then it outputs the amino acid frequency distribution of the corresponding germline genes in that family.

    Input Method 2

    If no sequence is provided, the module directly outputs the frequency distribution of the selected chain type (Group option) or gene family (Single option) of germline genes.

    Specifically:

    • If a Group is selected, it will only calculate the frequency distribution of all germline genes of the corresponding type (kappa, lambda, heavy).
    • If a specific family is selected in the Single option (e.g., IGHV1), it will only output the amino acid frequency distribution of the specified germline gene family (as typically only the frequency distribution of germline genes from the same family as the target sequence is of interest, and the frequency distribution of germline genes from different families has limited relevance to our sequence design).

    Output

    The amino acid frequency distribution of germline genes at each position in the antibody.

  • Name: AA Probability Prediction
    Description: 基于预训练的大规模蛋白质语言模型,预测序列中每个氨基酸(AA)位置处20种AA出现的概率。与进化上更保守的AA类似,语言模型预测的高概率AA,有利于提升结构的稳定性、改善蛋白的折叠、提升蛋白质的表达能力等、甚至提升亲和力,比随机盲目突变具有潜在的优势。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Frequency -> Frequency/Likelihood Analysis -> Caculate -> ESM2/ESM1B/IgLM/ESMIF/AntiFold/Nanobidy。 Leveraging pre-trained large-scale protein language models to predict the likelihood of each of the twenty amino acids appearing at any given position within a sequence. Comparable to the structurally conservative amino acids found in evolution, those with high probability predictions from the language model are beneficial in enhancing the protein's stability, fostering more efficient protein folding, augmenting its expression capacity, and potentially elevating its affinity. It is recommended to use in the WeSeq: WeSeq -> Frequency -> Frequency/Likelihood Analysis -> Caculate -> ESM2/ESM1B/IgLM/ESMIF/AntiFold/Nanobidy.
    Tags: undefined
    Author: WECOMPUT
    Release: 2024-01-23 20:07:02
    Reference: Zeming Lin et al. ,Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).DOI:10.1126/science.ade2574 Shuai et al., 2023, Cell Systems 14, 979–989. doi:10.1016/j.cels.2023.10.001

    AA Probability Prediction

    简介

    基于预训练的大规模蛋白质语言模型(也叫做PLM或pLLM),预测序列中每个氨基酸(AA)位置处20种AA出现的概率。与进化上更保守的AA类似,语言模型预测的高概率AA,有利于提升结构的稳定性、改善蛋白的折叠、提升蛋白质的表达能力等、甚至提升亲和力,比随机盲目突变具有潜在的优势。相比于基于MSA序列统计的PSSM,语言模型的预测速度更快,更多地考虑了序列内AA之间的相互作用,自身的变化也更敏感。

    该模块基于ESM、IgLM等大规模预训练蛋白(抗体)语言模型实现。

    • ESM为基于序列的PLM,适用于蛋白包括抗体;
    • IgLM为基于序列的PLM,只适用于抗体,可以指定种属(比如人);
    • All in One同时使用ESM与IgLM进行计算;
    • ESMIF为结构感知的PLM,适用于蛋白包括抗体;
    • AntiFold为基于ESMIF使用抗体数据微调的模型,更适用于抗体或纳米抗体。
      没有结构的时候,可以使用ESM、IgLM等纯序列模型;有结构或者预测了结构,可以使用结构感知的模型,在稳定性、亲和力等跟局部结构相关性更强的任务上表现更好。

    蛋白质语言模型介绍

    目前WeMol中集成了多个PLM大模型,并基于PLM开发了多种应用,涉及的PLM模型如下:
    image.png

    ESM模型

    ESM模型是一个通用蛋白质语言模型,主要采用UniRef序列数据库进行模型训练,提供了不同参数量(800万,3500万,1.5亿,6.5亿,30亿,150亿)的各类模型,可用于直接从蛋白序列预测结构、功能和其他蛋白质性质。ESM在预测蛋白结构时避免了对外部进化数据库、MSA和模板的需求,计算精度与AlphaFold2(存在MSA信息时)接近(无可用MSA信息时,计算精度ESM要显著优于AlphaFold2),计算速度比AlphaFold2快数十倍。模块中采用150亿参数的ESM2模型。
    image.png

    IgLM模型

    IgLM是一种用于构建合成抗体库的深度生成语言模型。与利用单向上下文生成序列的方法相比,IgLM 基于自然语言中的文本输入进行抗体设计。因此它能利用双向上下文重新设计抗体序列。IgLM基于5.58亿条抗体重链和轻链可变序列进行训练,并根据每个序列的链类型和来源物种进行了调整。
    image.png

    ESMIF模型

    ESMIF逆折叠模型旨在根据蛋白质主链原子坐标预测蛋白序列。该模型使用AlphaFold2预测的1200万个蛋白质结构进行训练,包含不变几何输入处理层,随后是一个序列到序列的Transformer,对于在结构上保持不变的主干序列实现51%的本地序列恢复率,对于埋藏残基实现72%的恢复率。该模型还经过跨度屏蔽训练,能够容忍缺失的主链坐标,因此可以预测部分被屏蔽结构的序列。

    AntiFold模型

    AntiFold是使用抗体结构数据对ESMIF模型进行fine-tune微调得到,其在抗体CDR区序列恢复方面优于其他逆折叠工具,设计序列与已解析的序列具有高度结构相似性。此外,它在预测抗体-抗原结合亲和力时具有更强的相关性,同时在包括抗原信息的情况下性能会进一步增强。AntiFold为破坏与抗原结合的抗体残基突变给与低概率,并显示出在指导抗体优化的同时保留结构相关特性的前景。

    Nanobody模型

    该模型用于预测纳米抗体序列中每个残基位置的20种残基出现的概率。模型采用类似AntiBerta(基于BERT的抗体语言模型)的网络架构,使用纳米抗体的序列数据集,进行模型训练得到。序列数据集包含开源序列与商业序列(未开源)两部分,其中开源序列整合了来自专利、NCBI GenBank、Protein Data Bank(PDB)以及科学出版物中的纳米抗体序列(约2.1万条),商业序列是基于新一代测序(NGS)技术,对多个商业研发项目进行测序得到的序列(约1100万条)。

    参数说明

    ESM

    Protein Sequence

    蛋白序列,如:QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    如果是抗体,请将重链、轻链序列分开预测。

    Model

    模型类型,可选esm2模型或者esm1b模型。

    IgLM

    Protein Sequence

    蛋白序列,如:QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    如果是抗体,请将重链、轻链序列分开预测。

    Chain Type

    抗体链类型,H表示重链,L表示轻链

    Species

    物种类型,支持6种:HUMAN, CAMEL, MOUSE, RABBIT, RAT, RHESUS。

    ESMIF

    PDB File

    蛋白结构,pdb格式。

    Threshold

    残基概率的阈值,概率大于该阈值的突变残基会输出到突变列表文件。

    Regions

    定义的残基区域,区域内突变概率大于阈值的残基,其突变信息会输出到突变列表文件,残基区域的格式为链名:残基区域,残基区域即指定PDB文件中的残基编号(注意是PDB文件中带有的残基索引编号,起始编号可能不为1),多个残基用逗号分隔,指定残基范围用横杠符号,如A:24,28,32-40 表示残基区域为蛋白A链的24/28/32至40号残基。
    支持定义多个残基区域,每行定义一个,如:

    A:24,28,32-40
    B:12-24
    

    AntiFold

    PDB File

    抗体/纳米抗体,及与抗原的复合物结构文件,PDB格式。

    Antigen Chain

    填写输入pdb结构中的抗原链名。

    注意:如果文件中有多个抗体/纳米抗体,识别按顺序排的最后一个。

    Nanobody

    Nanobody Sequence

    纳米抗体序列(序列长度不超过198个残基),如:

    seq
    QLVSGPEVKKPGASVKVSCKASGYIFNNYGISWVRQAPGQGLEWMGWISTDNGNTNYAQKVQGRVTMTTDTSTSTAYMELRSLRYDDTAVYYCATNWGSYFEHWGQGTLVTVSS
    只能提交单链序列,且序列长度不得超过198个残基。

    All in One

    一次性调用所有可用模型。可接受结构或者序列作为输入,任选其一即可,有结构时,优先采用结构输入。

    Sequence

    蛋白/抗体序列,FASTA格式。

    Structure

    蛋白/抗体结构,pdb或cif格式。

    Numbering Type

    抗体编号规则,支持Kabat, Chothia和IMGT,默认为Kabat。

    Chain

    当输入结构时,指定输出特定链的预测结果,使用链名,如:A,支持多链,使用逗号分隔,如:A,B。

    Species

    物种类型,支持6种:HUMAN,CAMEL,MOUSE,RABBIT,RAT,RHESUS
    注意:该参数仅对 IgLM 模型生效

    结果说明

    ESM、IgLM以及Nanobody

    输出result.csv结果文件,包含以下信息:

    字段名称 说明
    WT 序列中的初始AA
    POS AA的位置系引(从1开始)
    Consensus 该位置出现概率最大的AA
    L,A,G,V… 该位置每种AA出现的概率

    输出chain_score.csv结果文件,包含以下信息:

    字段名称 说明
    Name 序列名称
    Chain_Score 序列打分,是序列中每个位置残基的预测概率的算术平均值

    ESMIF和AnfiFold

    输出result.csv结果文件,包含以下信息:

    字段名称 说明
    Chain PDB结构中的链名称
    WT PDB结构中的初始AA
    Pos PDB文件中的AA位置系引
    Consensus 该位置出现概率最大的AA
    L,A,G,V… 该位置每种AA出现的概率

    输出突变列表文件mutation_list.txt,包含突变信息:
    每行一个突变信息,格式为GA1S,G表示野生型残基,A表示链名A,1表示PDB结构中的残基编号,S表示突变后的残基。

    All in One

    All in One模式中,输出所有可用模型的预测结果(每种模型的预测结果见上述描述)。
    输出所有结果的打包文件 all.tar.gz
    输出两个合并的CSV文件:
    AA_allinone_mutation.csv,包含信息如下:

    字段名称 说明
    Chain 链名称,输入为fasta时,按顺序对应A,B,C…,输入为结构时,对应链名
    Mutation 突变信息,格式为WT残基+顺序位置+突变残基
    dP(ESM)/dP(IgLM)/dP(AntiFold)/dP(ESMIF)/dP(Nanobody) 模型预测的该位置突变残基出现概率与WT残基出现概率的差值,即 P(突变残基)-P(WT残基),数值为正时,表示该位置,突变残基的出现概率大于WT残基的出现概率,为优势突变,数值越大优势越大。

    AA_allinone_pos.csv,包含信息如下:

    字段名称 说明
    Chain 链名称,输入为fasta时,按顺序对应A,B,C…,输入为结构时,对应链名
    Pos 残基的位置系引
    WT 该位置的初始AA
    dP(ESM)/dP(IgLM)/dP(AntiFold)/dP(ESMIF)/dP(Nanobody) 模型预测的该位置突变残基概率优于WT残基概率的所有残基类型和对应的概率值。

    参考文献

    1, Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379,1123-1130(2023).DOI:10.1126/science.ade2574
    https://www.science.org/doi/abs/10.1126/science.ade2574
    2, Shuai et al., 2023, Cell Systems 14, 979–989.
    https://doi.org/10.1016/j.cels.2023.10.001

    AA Probability Prediction

    Introduction

    Based on pre-trained large-scale protein language models (also known as PLMs or pLLMs), this module predicts the probability of each of the 20 amino acids (AA) appearing at each position in the sequence. Similar to evolutionarily more conservative AAs, high-probability AAs predicted by language models are beneficial for enhancing structural stability, improving protein folding, enhancing protein expression capabilities, and even increasing affinity, potentially offering advantages over random blind mutations. Compared to PSSMs based on MSA sequence statistics, language models provide faster predictions, consider more interactions between AAs within the sequence, and are more sensitive to their own changes.

    This module is based on large-scale pre-trained protein (antibody) language models such as ESM and IgLM.

    Protein Language Model Overview

    Several PLM large models are integrated into WeMol, and various applications have been developed based on PLMs, including the following PLM models:
    image.png

    ESM Model

    The ESM model is a general protein language model that primarily uses the UniRef sequence database for model training. It offers various models with different parameter sizes (8 million, 35 million, 150 million, 650 million, 3 billion, 15 billion) that can be used to predict structure, function, and other protein properties directly from protein sequences. ESM avoids the need for external evolutionary databases, MSA, and templates when predicting protein structures. Its computational accuracy is close to AlphaFold2 (when MSA information is available) and significantly superior to AlphaFold2 in the absence of MSA information. ESM2 with 15 billion parameters is used in this module.
    image.png

    IgLM Model

    IgLM is a deep generative language model used to construct synthetic antibody libraries. Unlike methods that generate sequences based on unidirectional context, IgLM designs antibodies based on text inputs from natural language, allowing it to utilize bidirectional context for antibody sequence redesign. IgLM is trained on 558 million antibody heavy and light chain variable sequences and adjusted based on the chain type and source species of each sequence.
    image.png

    ESMIF Model

    The ESMIF inverse folding model aims to predict protein sequences from their backbone atom coordinates. Trained on 12 million protein structures predicted by AlphaFold2, the ESMIF model consists of invariant geometric input processing layers followed by a sequence-to-sequence transformer. It achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues. The model is also trained with span masking to tolerate missing backbone coordinates and can predict sequences for partially masked structures.

    AntiFold Model

    AntiFold is fine-tuned using antibody structural data on the ESMIF model, outperforming other de novo folding tools in antibody CDR sequence recovery and exhibiting high structural similarity to the designed sequences and those resolved. Additionally, it shows stronger correlation in predicting antibody-antigen binding affinity, with performance further enhanced when antigen information is included. AntiFold predicts low probability mutations in antibody residues that disrupt antigen binding and demonstrates the prospect of retaining structural-relevant features while guiding antibody optimization.

    Nanobody Model

    This model predicts the probability of each of the 20 residues at every position in a nanobody sequence. It uses an AntiBerta - like (BERT based antibody language model) architecture and is trained on nanobody sequence datasets. These datasets have two parts: open-source sequences (around 21,000 from patents, NCBI GenBank, PDB, and publications) and commercial sequences (around 11 million from NGS of multiple R&D projects).

    Parameters

    ESM

    Protein Sequence

    Protein sequence, e.g., QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    If it is an antibody, predict heavy and light chain sequences separately.

    Model

    Model type, choose between esm2 model or esm1b model.

    IgLM

    Protein Sequence

    Protein sequence, e.g., QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS
    If it is an antibody, predict heavy and light chain sequences separately.

    Chain Type

    Antibody chain type, H for heavy chain, L for light chain.

    Species

    Species type, supports 6 types: HUMAN, CAMEL, MOUSE, RABBIT, RAT, RHESUS.

    ESMIF

    PDB File

    Protein structure, in pdb format.

    Threshold

    The threshold for residue probability. Mutated residues with probabilities exceeding this threshold will be output to the mutation list file.

    Regions

    Defined residue regions. Mutation information for residues within these regions, whose mutation probability exceeds the threshold, will be output to the mutation list file. The format for residue regions is Chain:ResidueRegion, where ResidueRegion specifies the residue indices in the PDB file (note that the indices are the residue indices as they appear in the PDB file, which may not start from 1). Multiple residues can be separated by commas, and residue ranges can be specified using a hyphen, e.g., A:24,28,32-40 represents residues 24, 28, and 32 to 40 of chain A in the protein.
    Multiple residue regions can be defined, with each region on a separate line, e.g.:

    A:24,28,32-40  
    B:12-24  
    

    AntiFold

    PDB File

    Structure files of antibodies/nanobodies and their complexes with antigens, in PDB format.

    Antigen Chain

    Enter the antigen chain name in the input PDB structure.

    Note: If there are multiple antibodies/nanobodies in the file, identify the last one in sequential order.

    Nanobody

    Nanobody Sequence

    Sequence of Nanobody, such as:

    seq
    QLVSGPEVKKPGASVKVSCKASGYIFNNYGISWVRQAPGQGLEWMGWISTDNGNTNYAQKVQGRVTMTTDTSTSTAYMELRSLRYDDTAVYYCATNWGSYFEHWGQGTLVTVSS
    Only single-chain sequences can be submitted, and the sequence length must not exceed 198 residues.

    All in One

    Calls all available models in a single run. Either a structure or a sequence can be provided as input. If both are available, the structure input will be used with priority.

    Sequence

    Protein/antibody sequence in FASTA format.

    Structure

    Protein/antibody structure in PDB or CIF format.

    Numbering Type

    Antibody numbering schemes, supporting Kabat, Chothia, and IMGT. The default scheme is Kabat.

    Chain

    When a structure is provided, specify the chain(s) for which prediction results should be generated.
    Use chain IDs such as A. Multiple chains are supported and should be separated by commas, e.g., A,B.

    Species

    Species type. Six options are supported: HUMAN, CAMEL, MOUSE, RABBIT, RAT, RHESUS.
    Note: This parameter is effective only for the IgLM model.

    Results

    ESM, IgLM and Nanobody

    Output result.csv file containing the following information:

    Field Name Description
    WT Initial AA in the sequence
    POS Position index of the AA (starting from 1)
    Consensus Most probable AA at that position
    L, A, G, V… Probability of each AA appearing at that position

    Output chain_score.csv file containing the following information:

    Field Name Description
    Name Sequence name
    Chain_Score Sequence score, the arithmetic mean of predicted probabilities of residues at each position in the sequence

    ESMIF and AntiFold

    Output result.csv file containing the following information:

    Field Name Description
    Chain Chain name in the PDB structure
    WT Initial AA in the PDB structure
    Pos Position index of the AA in the PDB file
    Consensus Most probable AA at that position
    L, A, G, V… Probability of each AA appearing at that position

    All in One

    In All in One mode, prediction results from all available models are output
    (see descriptions above for each model’s output).

    Output files include a packaged archive all.tar.gz containing all results,
    and two merged CSV files:

    AA_allinone_mutation.csv contains the following fields:

    Field Description
    Chain Chain identifier; for FASTA input, chains are labeled A, B, C… in order; for structure input, corresponds to chain names in the PDB file
    Mutation Mutation information in format WT_residue+position+mutant_residue
    dP(ESM)/dP(IgLM)/dP(AntiFold)/dP(ESMIF)/dP(Nanobody) Difference between predicted probability of mutant residue and WT residue at this position, calculated as P(mutant) - P(WT). Positive values indicate the mutant residue has higher predicted probability than WT (advantageous mutation); larger values indicate greater advantage.

    AA_allinone_pos.csv contains the following fields:

    Field Description
    Chain Chain identifier; for FASTA input, chains are labeled A, B, C… in order; for structure input, corresponds to chain names in the PDB file
    Pos Residue position index
    WT Wild-type amino acid at this position
    dP(ESM)/dP(IgLM)/dP(AntiFold)/dP(ESMIF)/dP(Nanobody) For each model, lists all residue types with predicted probabilities superior to WT at this position, along with their corresponding probability values.

    References

    1. Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130 (2023). DOI: 10.1126/science.ade2574
      https://www.science.org/doi/abs/10.1126/science.ade2574
    2. Shuai et al., 2023, Cell Systems 14, 979–989.
      https://doi.org/10.1016/j.cels.2023.10.001
  • Name: Immune Protein Structure Prediction
    Description: 基于ImmuneBuilder深度学习模型,预测抗体(ABodyBuilder2)、纳米抗体(NanoBodyBuilder2)和T细胞受体(TCRBuilder2)的结构。精度高且比AF2快得多。 ImmuneBuilder is a set of deep learning models that accurately predict the structure of antibodies (ABodyBuilder2), NanoBodyBuilder2, and T-cell receptors (TCRBuilder2). ImmuneBuilder generates structures with state-of-the-art precision while being much faster than AlphaFold2.
    Tags: undefined
    Author: ImmuneBuilder
    Release: 2023-10-19 10:50:28
    Reference: Abanades, B., Wong, W.K., Boyles, F. et al. ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins. Commun Biol 6, 575 (2023).

    Immune Protein Structure Prediction (ImmuneBuilder2)

    简介

    Immune Protein Structure Prediction模块是基于ImmuneBuilder的免疫蛋白结构预测模块。ImmuneBuilder是一组深度学习模型,可以准确预测抗体(ABodyBuilder2)、纳米抗体(NanoBodyBuilder2)和T细胞受体(TCRBuilder2)的结构;ImmuneBuilder生成的结构精度高,同时比AlphaFold2快得多。

    参数说明

    Immune Protein Sequence File

    抗体、纳米抗体或者TCER的序列文件,FASTA格式。
    支持多条序列一次性计算,相应的序列顺序需满足以下要求:
    对于抗体序列,每个抗体的重、轻链为一组,相邻放置即可(先后顺序没有要求),示例如下:

    >seq1.H
    xxxxxxxxxxxx
    >seq1.L
    xxxxxxxxx
    >seq2.H
    xxxxxxxxxxxx
    >seq2.L
    xxxxxxxxx
    

    对于TCR序列,每个TCR的alpha、beta链为一组,相邻放置即可(先后顺序没有要求),示例如下

    >seq1.A
    xxxxxxx
    >seq1.B
    xxxxxxx
    >seq2.A
    xxxxxxx
    >seq2.B
    xxxxxxx
    

    对于纳米抗体没有特殊要求。

    Type

    预测蛋白结构类型:Antibody、Nanobody以及TCR。

    Numbering Scheme

    抗体编号类型,支持kabat、chothia、imgt、raw。
    注意:raw 并不是一种特定的抗体编号规则。选择 raw 时,输出的 PDB 文件将按照结构中残基在原始文件中的位置顺序进行编号,而不会应用任何其他抗体编号体系或重编号规则。

    Output File

    输出文件名称,默认结构名称为model.pdb。

    结果说明

    输出结果为预测的免疫蛋白pdb结构,默认名称为model.pdb。
    可以进行批量生成结构文件,所有文件在model.tar压缩文件中。

    参考文献

    • Abanades, B., Wong, W.K., Boyles, F. et al. ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins. Commun Biol 6, 575 (2023).DOI:10.1038/s42003-023-04927-7

    Immune Protein Structure Prediction (ImmuneBuilder2)

    Introduction

    The Immune Protein Structure Prediction module is based on ImmuneBuilder and is used for predicting the structures of immune proteins. ImmuneBuilder is a set of deep learning models that accurately predict the structures of antibodies (ABodyBuilder2), nanobodies (NanoBodyBuilder2), and T cell receptors (TCRBuilder2). The structures generated by ImmuneBuilder are highly accurate and much faster than AlphaFold2.

    Parameter Description

    Immune Protein Sequence File

    Sequence file of the antibody, nanobody, or TCR in FASTA format.
    Supports calculating multiple sequences at once, with the sequence order meeting the following requirements:
    For antibody sequences, the heavy and light chain of an antibody constitute a pair, which should be placed adjacent to each other (the order does not matter), as shown below:

    >seq1.H
    xxxxxxxxxxxx
    >seq1.L
    xxxxxxxxx
    >seq2.H
    xxxxxxxxxxxx
    >seq2.L
    xxxxxxxxx
    

    For TCR sequences, the alpha and beta chain of TCR constitute a pair, which can be placed adjacent to each other (the order does not matter), as shown below:

    >seq1.A
    xxxxxxx
    >seq1.B
    xxxxxxx
    >seq2.A
    xxxxxxx
    >seq2.B
    xxxxxxx
    

    There are no specific naming requirements for nanobody sequences.

    Type

    Type of protein structure to predict: Antibody, Nanobody, or TCR.

    Numbering Scheme

    Antibody numbering scheme, supporting Kabat, Chothia, IMGT, and raw.
    Note:Raw does not represent a specific antibody numbering scheme. When Raw is selected, residues in the output PDB file are numbered according to their original positional order in the input structure, without applying any alternative antibody numbering or renumbering rules.

    Output File

    Name of the output file, with the default structure name as model.pdb.

    Results

    The output result is the predicted immune protein PDB structure, with the default name as model.pdb.
    Batch generation of structure files is supported, and all files are compressed in the model.tar file.

    References

    • Abanades, B., Wong, W.K., Boyles, F. et al. ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteins. Commun Biol 6, 575 (2023).DOI:10.1038/s42003-023-04927-7
  • Name: Nanobody Humanization (Llamanade)
    Description: Llamanade基于NGS数据库和高分辨率结构,系统分析了Nbs的序列和结构特性,进而确定了可能有助于提高溶解度、结构稳定性和抗原结合的保守残基,以促进Nbs的人源化的理性设计,已成功应用于一批结构多样、强效的SARS-CoV-2中和Nbs人源化工作。对给定的Nbs进行全面人源化分析只需不到一分钟时间。 Llamanade based on NGS databases and high-resolution structures, which systematically analyzes the sequence and structural properties of Nbs. A large amount of framework diversity was revealed and key differences between Nbs and human immunoglobulin G (IgG) antibodies were highlighted. Conserved residues that may contribute to improved solubility, structural stability, and antigen-binding were identified to facilitate the rational humanization of Nbs. It has been successfully applied to humanize a group of structurally diverse and potent SARS-CoV-2 neutralized Nbs. It takes less than a minute to perform a comprehensive humanization analysis of a given Nbs.
    Tags: undefined
    Author: Zhe Sang
    Release: 2024-01-11 00:00:00
    Reference: Sang Z, Xiang Y, Bahar I, Shi Y. Llamanade: An open-source computational pipeline for robust nanobody humanization. Structure. 2022, doi: 10.1016/j.str.2021.11.006

    Nanobody Humanization

    简介

    纳米抗体(Nanobody, Nbs)是最近出现的一类很有前景的生物医学和治疗应用抗体片段。尽管Nbs具有显著的理化特性,但它来自于驼科动物,可能需要 "人源化"才能提高临床试验的转化潜力。该模块基于Llamanade实现。Llamanade基于NGS(下一代测序)数据库和高分辨率结构,系统分析了Nbs的序列和结构特性。揭示了大量的框架多样性,并强调了Nbs与人类免疫球蛋白G(IgG)抗体之间的关键差异。确定了可能有助于提高溶解度、结构稳定性和抗原结合的保守残基,以促进Nbs的合理人源化。模块以Nbs序列为输入,提供序列特征、模型结构等信息,并优化Nbs人源化的解决方案。对给定的Nbs进行全面人源化分析只需不到一分钟时间。已成功应用于一批结构多样、强效的SARS-CoV-2中和Nbs人源化工作。
    image.png

    参数说明

    Nanobody Sequence

    纳米抗体的序列,fasta格式,如:

    Nb21
    MAQVQLVESGGGLVQAGGSLRLSCAVSGLGAHRVGWFRRAPGKEREFVAAIGANGGNTNYLDSVKGRFTISRDNAKNTIYLQMNSLKPQDTAVYYCAARDIETAEYTYWGQGTQVTVS

    结果说明

    输出humanized_data.csv结果文件,包含以下信息:
    Position:残基编号
    Original AA:原来残基
    Humanized?: 是否需要人源化,True表示需要,False表示不需要
    Humanized AA: 人源化后的残基
    备注:抗体编号方式采用Martin模式。

    参考文献

    Llamanade: An open-source computational pipeline for robust nanobody humanization
    Sang, Zhe et al. Structure, Volume 30, Issue 3, 418 - 429.e3
    https://doi.org/10.1016/j.str.2021.11.006

    Nanobody Humanization

    Introduction

    Nanobodies (Nanobody, Nbs) are a recently emerging class of promising antibody fragments for biomedical and therapeutic applications. Despite its remarkable physicochemical properties, Nbs are derived from camelids and may need to be “humanized” in order to improve translational potential in clinical trials. This module is implemented based on Llamanade, which systematically analyzes the sequence and structural properties of Nbs based on NGS (Next Generation Sequencing) databases and high-resolution structures. A large amount of framework diversity was revealed and key differences between Nbs and human immunoglobulin G (IgG) antibodies were highlighted. Conserved residues that may contribute to improved solubility, structural stability, and antigen binding were identified to facilitate the rational humanization of Nbs. This Module uses Nbs sequence as input to provide information on sequence characterization, model structure, and optimize solutions for Nbs humanization. It takes less than a minute to perform a comprehensive humanization analysis of a given Nbs. It has been successfully applied to humanize a group of structurally diverse and potent SARS-CoV-2 neutralized Nbs.
    image.png

    Parameter

    Nanobody Sequence

    Nanobody sequence in FASTA format, such as:

    Nb21
    MAQVQLVESGGGLVQAGGSLRLSCAVSGLGAHRVGWFRRAPGKEREFVAAIGANGGNTNYLDSVKGRFTISRDNAKNTIYLQMNSLKPQDTAVYYCAARDIETAEYTYWGQGTQVTVS

    Result

    The output csv file (humanized_data.csv) of humanization results includes:
    Position: index of residue
    Original AA: original residue
    Humanized?: need to humanize,0 means no,1 means yes
    Humanized AA: residue after humanization
    Note: Antibodies are numbered in Martin mode.

    Reference

    Llamanade: An open-source computational pipeline for robust nanobody humanization
    Sang, Zhe et al. Structure, Volume 30, Issue 3, 418 - 429.e3
    https://doi.org/10.1016/j.str.2021.11.006

  • Name: mRNA 5'UTRs optimization
    Description: 是一种新颖的深度生成模型,设计用于在 mRNA 序列中创建 N1-甲基假尿苷 (m1Ψ) 5'UTR。Smart5UTR 利用多任务自动编码器框架,利用从大型数据集中学习到的潜在特征,有效地生成 5'UTR 序列。Smart5UTR设计的mRNA的性能已通过体外和体内实验得到验证。这个强大的工具简化了m1Ψ-5'UTRs的设计,有助于开发更有效的mRNA疗法。 A novel deep generative model designed to create N1-methyl-pseudouridine (m1Ψ) 5' UTRs in mRNA sequences. Smart5UTR utilizes a multi-task autoencoder framework to effectively generate 5' UTR sequences by leveraging latent features learned from large datasets. The performance of mRNAs designed by Smart5UTR has been validated through both in vitro and in vivo experiments. This powerful tool simplifies the design of m1Ψ-5' UTRs and contributes to the development of more effective mRNA therapies.
    Tags: undefined
    Author: Xiaoshan Tang
    Release: 2024-01-09 00:00:00
    Reference: Xiaoshan Tang, et al., A novel deep generative model for mRNA vaccine development: Designing 5ʹ UTRs with N1-methyl-pseudouridine modification, Acta Pharmaceutica Sinica B, 2023

    mRNA 5’UTRs optimization

    简介

    该模块基于Smart5UTR模型实现,Smart5UTR 是一种新颖的深度生成模型,设计用于在 mRNA 序列中创建 N1-甲基假尿苷 (m1Ψ) 5’ UTR。Smart5UTR 利用多任务自动编码器框架,利用从大型数据集中学习到的潜在特征,有效地生成 5’ UTR 序列。Smart5UTR设计的mRNA的性能已通过体外和体内实验得到验证。这个强大的工具简化了m1Ψ-5’UTRs的设计,有助于开发更有效的mRNA疗法。
    image.png

    参数说明

    Sequence of 5’UTR

    mRNA 5’UTR的序列,如:GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTGGGTT
    备注:输入序列长度不超过50碱基。

    结果说明

    输出result.csv结果文件,包含以下信息:
    Original Sequence: 初始序列
    Optimized Sequence: 优化后的序列
    Optimized MRL: 优化序列预测的MRL值

    MRL解释:
    mean ribosome load (MRL) 平均核糖体加载值,是反映mRNA序列翻译效率的指标,值越大表示翻译效率越高,一般大于5.0

    参考文献

    Xiaoshan Tang, et al., A novel deep generative model for mRNA vaccine development: Designing 5ʹ UTRs with N1-methyl-pseudouridine modification, Acta Pharmaceutica Sinica B, 2023
    https://doi.org/10.1016/j.apsb.2023.11.003

    mRNA 5’UTRs optimization

    Introduction

    Smart5UTR is a novel deep generative model designed for creating N1-methyl-pseudouridine (m1Ψ) 5’ UTRs in mRNA sequences. Utilizing a multi-task autoencoder framework, Smart5UTR efficiently generates 5’ UTR sequences by leveraging the latent features learned from a large dataset. The performance of Smart5UTR-designed mRNA has been validated through in vitro and in vivo experiments. This powerful tool streamlines the design of m1Ψ-5’ UTRs, contributing to the development of more effective mRNA therapeutics.
    image.png

    Parameter

    Sequence of 5’UTR

    Sequence of mRNA 5’UTR, such as: GGAGCCGAGGCGGGCCGATTCACGATCGGTTCGCAAAAACTGTTTGGGTT
    Note: The input sequence length should not exceed 50bp.

    Result

    The output csv file of optimized sequence includes Original Sequence, Optimized Sequence and Optimized MRL.

    MRL is a metric of the average number of ribosomes associated to a given RNA and a proxy for translation efficiency. Higher values indicate higher translation efficiency, generally greater than 5.0

    Reference

    Xiaoshan Tang, et al., A novel deep generative model for mRNA vaccine development: Designing 5ʹ UTRs with N1-methyl-pseudouridine modification, Acta Pharmaceutica Sinica B, 2023
    https://doi.org/10.1016/j.apsb.2023.11.003

  • Name: Immunogenicity Prediction (AlphaMHC v3.0 beta)
    Description: AlphaMHC算法采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条与免疫原性相关的湿实验数据(包括亲和力数据、NGS数据、质谱数据等)进行训练,实现了从序列到临床免疫原性风险的端到端的预测,并通过上百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)进行验证。 该版本使用抗体为主的临床ADA数据进行测试精度达到90%,AUROC达0.91,性能优于v2.0版本。注:该版本非最新版本,推荐使用更新版本。 The AlphaMHC algorithm employs popular NLP (Natural Language Processing) techniques and a novel multi-modal fusion deep neural network architecture. It integrates nearly one billion wet lab data points related to immunogenicity (including affinity data, NGS data, mass spectrometry data, etc.) for training, achieving end-to-end prediction from sequence to clinical immunogenicity risk. This has been validated with hundreds of real clinical immunogenicity data points from the FDA and EMA (including mono- and multi-specific antibodies, recombinant proteins, etc.). This version is the latest and has been tested primarily with clinical ADA data from antibodies, achieving an accuracy of 90% and an AUC of 0.91. Its performance surpasses that of version 2.0, and it is recommended for trial.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-11-30 00:00:00
    Reference:

    Immunogenicity Prediction (AlphaMHC v3.0 beta)

    介绍

    AlphaMHC v3.0在多个方面相比v2.0进行了大幅优化,
    主要包括:
    1、风险评分优化,能更好的反映多重HLA激活的风险贡献;
    2、引入新的EL和TCR等更多来源的数据,提升了对可递呈表位的预测能力,对TCR分子的支持更好;
    3、全新的结果可视化面板(通过WeSeq运行);

    为了更好的交互体验和对结果进行可视化,推荐从WeSeq中使用本功能。

    测试数据:
    从FDA和EMA的临床试验中收集了已知免疫原性的分子及其ADA的分布,使用模型对ADA明显较高(ADA>20%)及较低(ADA<5%)的分子进行分类以测试其预测性能。
    image.png

    测试结果:
    AlphaMHC v3.0全面超越常见算法及v2.0,性能同类最佳(SOTA)
    image.png
    右图中:

    • ACC是准确度,代表所有分子中预测正确的比例;
    • PRECISION代表特异性,指预测为高风险的分子中,实际为高ADA分子的比例;
    • RECALL代表敏感性,指预测的高风险分子占全部高ADA分子的比例;
    • F1是综合了特异性和敏感性的指标;
      以上指标都是越高越好。

    参数

    Fasta File

    计算量消耗
    采用阶梯式动态机制,根据提交的序列数量,对应消耗如下:

    • ≤ 5 条序列:2000 计算量 / 条
    • 第 6–100 条序列:200 计算量 / 条
    • 超过 100 条的部分:20 计算量 / 条

    蛋白序列文件,FASTA格式。支持多条链以及多分子模式。
    对于多分子模式,序列名称规则为:分子名.链名,例如:

    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    >mol2.A
    XXXXXXX
    >mol2.B
    XXXXXXX
    

    结果说明

    • Molecule Score:
      预测的每个分子的免疫原性风险评分以及风险(同个分子的多条链的预测结果汇总后综合评估所得)。

    阈值说明:
    当目标蛋白与该分子的评分 ≥ 1 时,将被视为高风险;当评分 < 1 时,将被视为低风险。

    • TCE Score:
      预测出的T细胞表位(TCE)以及多个评分指标。

    Molecule Score 包含以下信息:

    指标 说明
    Protein ID 输入蛋白的名称,如果是多条序列组成的蛋白,会自动合并
    Score 预测的免疫原性风险评分,值越大,风险越高。为所预测短肽的TCE score的求和
    Risk 对应的免疫原性风险等级

    TCE Score 包含以下信息:

    指标 说明
    Protein ID 所在分子的名称,同个分子的多条序列组成的蛋白会自动合并
    Sequence ID 所在序列的名称
    Core_Pos 表位序列的起始位置
    Core 表位序列(TCE)
    Score 表位序列的风险评分,分数越高越可能引起免疫原性。其范围是0-不限
    MHC_Count 可激活的MHC亚型数,考虑了MHC-II的递呈
    Tolerance 免疫耐受的可能性
    Germline 是否存在于人胚系基因中
    NextProt 是否存在于人蛋白组中
    OAS 在NGS人源抗体中出现的频率
    TCR 是否存在于人TCR基因中
    LAC 是否存在于低ADA临床药物(Low ADA CST)中

    Immunogenicity Prediction (AlphaMHC v3.0 beta)

    Introduction

    AlphaMHC v3.0 has undergone significant optimizations compared to v2.0 in several aspects, including:

    1. Improved risk scoring to better reflect the risk contributions of multiple HLA activations.
    2. Introduction of new data sources such as EL and TCR, enhancing the predictive ability for antigen presentation sites and better support for TCR molecules.
    3. Brand new visualization panel for results (run through WeSeq).

    For a better interactive experience and visualization of results, it is recommended to use this feature through WeSeq.

    Test Data:
    Molecules with known immunogenicity and their ADA distributions collected from clinical trials by the FDA and EMA were used to test the predictive performance of the model on molecules with significantly high ADA (>20%) and low ADA (<5%).
    image.png

    Test Results:
    AlphaMHC v3.0 surpasses common algorithms and v2.0 comprehensively, achieving state-of-the-art performance (SOTA).
    image.png
    In the image on the right:

    • ACC represents accuracy, indicating the proportion of correctly predicted molecules among all molecules.
    • PRECISION represents specificity, indicating the proportion of molecules predicted as high risk that are actually high ADA molecules.
    • RECALL represents sensitivity, indicating the proportion of predicted high-risk molecules among all high ADA molecules.
    • F1 is a metric that combines specificity and sensitivity. Higher values are better for all these metrics.

    Parameters

    Fasta File

    AlphaMHC v3.0 beta Pricing Policy
    AlphaMHC v3.0 beta uses a tiered, dynamic pricing model, where charges are calculated based on the number of submitted sequences:

    • ≤ 5 sequences: 2000 compute units per sequence
    • Sequences 6–100: 200 compute units per sequence
    • Sequences beyond 100: 20 compute units per sequence
      Protein sequence file in FASTA format. Supports multiple chains and multiple molecule modes.
      For multiple molecule mode, the sequence naming convention is: molecule name.chain name, for example:
    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    >mol2.A
    XXXXXXX
    >mol2.B
    XXXXXXX
    

    Results

    • Molecule Score:
      The predicted immunogenicity risk score for each molecule and its risk.
      (Comprehensive evaluation obtained by summarizing the predictions of multiple chains of the same molecule).
      Cut off:
      Target protein with the molecule score >=1 will be considered as high risk, and protein with the molecule<1 will be considered as low risk.
    • TCE Score:
      Predicted T cell epitopes (TCE) and multiple scoring metrics.

    Translation into English:

    Molecule Score contains the following information:

    Indicator Description
    Protein ID Name of the input protein; if the protein is composed of multiple sequences, they will be automatically merged
    Score Predicted immunogenicity risk score; higher values indicate higher risk. It is the sum of the TCE scores predicted for the peptide
    Risk Corresponding immunogenicity risk level

    TCE Score contains the following information:

    Indicator Description
    Protein ID Name of the molecule it belongs to; proteins composed of multiple sequences within the same molecule will be automatically merged
    Sequence ID Name of the sequence it belongs to
    Core_Pos Starting position of the epitope sequence
    Core Epitope sequence (TCE)
    Score Risk score of the epitope sequence; higher scores are more likely to cause immunogenicity. The range is from 0 to unlimited
    MHC_Count Number of activatable MHC subtypes, considering MHC-II presentation
    Tolerance Possibility of immunological tolerance
    Germline Whether it exists in human germline genes
    NextProt Whether it exists in the human proteome
    OAS Frequency of occurrence in NGS-derived human antibodies
    TCR Whether it exists in human TCR genes
    LAC Whether it exists in Low ADA CST (Low ADA Clinical Study Treatment) medications
  • Name: Ramachandran Plots
    Description: 对同源建模后模型质量的评估,仅仅考虑蛋白的构象是否合理,并不涉及能量问题。 Evaluate the quality of models after homology modeling, focusing on the reasonableness of the protein's conformation without considering energy issues.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-11-20 10:25:37
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. Lovell SC, Davis IW, Arendall WB 3rd, de Bakker PI, Word JM, Prisant MG, Richardson JS, Richardson DC. Structure validation by Calpha geometry: phi,psi and Cbeta deviation. Proteins. 2003 Feb 15;50(3):437-50.

    Ramachandran Plots

    简介

    Ramachandran Plots模块是对同源建模后模型质量的评估,仅仅考虑蛋白的构象是否合理,并不涉及能量问题。Ramachandran Plot中φ(phi)表示一个肽单位中α碳左边C-N键的旋转角度, ψ(psi)表示α碳右边C-C键的旋转角度。一般来说落在允许区和最大允许区的氨基酸残基占整个蛋白质的比例高于90%的,可以认为该模型的构象符合立体化学的规则。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。

    Chain ID

    选择作图链名称,不填默认为all。

    Figure Resolution

    图片分辨率(以每英寸点为单位)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    result_General.png 通常情况下的拉氏图
    result_Glycine.png 甘氨酸的拉氏图
    result_PreProline.png 脯氨酸前一个残基的拉氏图
    result_Proline.png 脯氨酸的拉氏图

    图中绿色为最大允许区,浅绿色为允许区,白色为不允许区,青色圆点代表在允许区域的氨基酸,红色圆点代表在不允许区域的氨基酸。在白色区域的氨基酸小于5%时,蛋白结构较为合理。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    Lovell SC, Davis IW, Arendall WB 3rd, de Bakker PI, Word JM, Prisant MG, Richardson JS, Richardson DC. Structure validation by Calpha geometry: phi,psi and Cbeta deviation. Proteins. 2003 Feb 15;50(3):437-50.

    Ramachandran Plots

    Introduction

    The Ramachandran Plots module is used to evaluate the quality of models after homology modeling, focusing on the reasonableness of the protein’s conformation without considering energy issues. In a Ramachandran Plot, φ (phi) represents the rotation angle of the C-N bond to the left of the alpha carbon in a peptide unit, and ψ (psi) represents the rotation angle of the C-C bond to the right of the alpha carbon. Generally, if the proportion of amino acid residues falling within the allowed regions and the most favored regions in the Ramachandran Plot is over 90%, the conformation of the model is considered to comply with the rules of stereochemistry.

    Parameter Description

    • Structure PDB File: The structure file of the protein in PDB format.
    • Chain ID: Select the chain name for plotting. If left blank, it defaults to all.
    • Figure Resolution: Resolution of the image (in dots per inch).

    Result Description

    The output includes:

    Output File Name Description
    result_General.png Ramachandran plot for general residues
    result_Glycine.png Ramachandran plot for glycine residues
    result_PreProline.png Ramachandran plot for residues before proline
    result_Proline.png Ramachandran plot for proline residues

    In the plots, green represents the most favored regions, light green represents allowed regions, white represents disallowed regions, cyan dots represent amino acids in allowed regions, and red dots represent amino acids in disallowed regions. When the percentage of amino acids in the white region is less than 5%, the protein structure is considered reasonable.

    References

    • Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    • Lovell SC, Davis IW, Arendall WB 3rd, de Bakker PI, Word JM, Prisant MG, Richardson JS, Richardson DC. Structure validation by Calpha geometry: phi, psi and Cbeta deviation. Proteins. 2003 Feb 15;50(3):437-50.
  • Name: Therapeutic Antibody Profiler
    Description: 基于TAP方法,快速对抗体进行打分,评估抗体的成药性。基于抗体可变区的结构,计算CDR区域及其周围的表面疏水性程度、正电荷分布程度、负电荷分布程度、Fv区的重、轻链之间的净电荷失衡程度,也支持纳米抗体(即TNP)。 Based on the TAP method, rapidly score antibodies to evaluate their druggability. Based on the structure of antibody variable regions, calculate the surface hydrophobicity, positive charge distribution, negative charge distribution in CDR regions and their surroundings, as well as the net charge imbalance between heavy and light chains in the Fv region. Also supports nanobodies (i.e., TNP).
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-11-13 00:00:00
    Reference:

    Therapeutic Antibody Profiler

    简介

    Therapeutic Antibody Profiler (TAP) 基于抗体可变区的结构计算抗体的可开发性性质。TAP目前支持单抗与纳米抗体的性质计算。
    对于单抗计算以下5个性质,以确定输入单抗的可开发性指标是否与临床阶段的单抗的属性相匹配:

    • CDR区总长度:Total CDR Length
    • CDR区域及其周围的表面疏水性程度:Patches of Surface Hydrophobicity (PSH) metric across the CDR Vicinity
    • CDR区域及其周围的表面正电荷程度:Patches of Positive Charge (PPC) metric across the CDR Vicinity
    • CDR区域及其周围的表面负电荷程度:Patches of Negative Charge (PNC) metric across the CDR Vicinity
    • Fv区的重、轻链之间的净电荷失衡程度:Structural Fv Charge Symmetry Parameter (SFvCSP)

    针对851的治疗性单体(临床I期及之后)的Fv区计算的可开发性指标范围如下(最新更新日期为2025年2月24日):

    Property Amber Region Red Region
    Total CDR Length (L) 37 ≤ L ≤ 42 L < 37
    55 ≤ L ≤ 65 L > 65
    Patches of Surface Hydrophobicity (PSH) 95.77 ≤ PSH ≤ 111.40 PSH < 95.77
    167.64 ≤ PSH ≤ 211.65 PSH > 211.65
    Patches of Positive Charge (PPC) 1.34 ≤ PPC ≤ 4.20 PPC > 4.24
    Patches of Negative Charge (PNC) 1.99 ≤ PNC ≤ 4.43 PNC > 5.67
    Structural Fv Charge Symmetry Parameter (SFvCSP) -30.60 ≤ SFvCSP ≤ -6.00 SFvCSP < -30.60

    Amber Region: 指标在851个治疗性抗体(临床I期及之后)的Fv区计算的指标范围内,属于合理区域
    Red Region:指标不合理区域,需要调整
    Amber Region和Red Region的区域范围定义如下表所示。
    image.png

    对于纳米抗体,计算6个性质,以确定输入纳米抗体的可开发性指标是否与临床阶段的纳米抗体的属性相匹配:

    • CDR区总长度:Total CDR Length
    • CDR3区总长度:CDR3 Length
    • CDR3紧凑度:CDR3 Compactness
    • CDR区域及其周围的表面疏水性程度:Patches of Surface Hydrophobicity (PSH) metric across the CDR Vicinity
    • CDR区域及其周围的表面正电荷程度:Patches of Positive Charge (PPC) metric across the CDR Vicinity
    • CDR区域及其周围的表面负电荷程度:Patches of Negative Charge (PNC) metric across the CDR Vicinity

    针对36的治疗性纳米抗体(临床I期及之后)计算的可开发性指标范围如下(最新更新日期为2025年8月14日):

    Property Amber Region Red Region
    Total CDR Length (L) 20 ≤ L ≤ 24 L < 20
    38 ≤ L ≤ 39 L > 39
    CDR3 Length (L) 5 ≤ L ≤ 8 L < 5
    22 ≤ L ≤ 23 L > 23
    CDR3 Compactness (CC) 0.56 ≤ CC ≤ 0.81 PSH < 0.56
    1.57 ≤ CC ≤ 1.61 CC > 1.61
    Patches of Surface Hydrophobicity (PSH) 73.40 ≤ PSH ≤ 79.59 PSH < 73.40
    126.83 ≤ PSH ≤ 155.47 PSH > 155.47
    Patches of Positive Charge (PPC) 0.39 ≤ PPC ≤ 1.18 PPC > 1.18
    Patches of Negative Charge (PNC) 1.47 ≤ PNC ≤ 1.88 PNC > 1.88

    Amber Region 与 Red Region的定义同上。

    参数说明

    Antibody Fv Structure (PDB)

    抗体结构文件,支持单抗或纳米抗体,PDB格式

    Antibody Fv Structure (TAR)

    多个单抗Fv结构或者多个纳米抗体结构(PDB格式)的压缩文件,压缩文件格式支持zip,tar或tar相关的压缩格式(.tar.gz, .bz2, .xz)

    当同时上传单一结构和压缩包时会合并计算。

    Nanobody

    当选择该选项时,进行纳米抗体的类TAP计算。默认情况下计算抗体的TAP。

    Score

    输出打分文件,CSV格式,默认为score.csv。

    Details

    输出每个残基的打分,CSV格式,默认为detail.csv。

    结果说明

    输出打分文件score.csv,输出以下信息:
    Total CDR Length:CDR区域氨基酸长度
    CDR3 Length:CDR3长度(纳米抗体时输出)
    CDR3 Compactness:CDR3紧凑度(纳米抗体时输出)
    CDR Vicinity PSH Score (Kyte & Doolittle):CDR区域及其周围的表面疏水性程度
    CDR Vicinity PPC Score:CDR区域及其周围的表面正电荷程度
    CDR Vicinity PNC Score:CDR区域及其周围的表面负电荷程度
    SFvCSP Score:Fv区的重、轻链之间的净电荷失衡程度(单抗时输出)

    输出每个残基的打分文件detail.csv,输出以下信息:
    PDBFile:结构文件名称
    ChainType:链名(单抗时输出)
    ResIndex:残基编号
    ResLabel:残基名称
    CDR Vicinity PSH Score (Kyte & Doolittle):残基的PSH分数
    CDR Vicinity PPC Score:残基的PPC分数
    CDR Vicinity PNC Score:残基的PNC分数

    参考文献

    • Matthew I. J. Raybould, Claire Marks, Konrad Krawczyk, Bruck Taddese, Jaroslaw Nowak, Alan P. Lewis, Alexander Bujotzek, Jiye Shi, and Charlotte M. Deane.Five computational developability guidelines for therapeutic antibody profilingProceedings of the National Academy of Sciences, 2019, 116 (10) 4025-4030.DOI:10.1073/pnas.1810576116 .
    • The Therapeutic Nanobody Profiler: characterising and predicting nanobody developability to improve therapeutic design. Gemma L Gordon, Joao Gervasio, Colby Souders, Charlotte M Deane. DOI:0.1101/2025.08.11.669635

    Therapeutic Antibody Profiler

    Introduction

    The Therapeutic Antibody Profiler (TAP) compares your antibody variable domain sequence against multiple developability guidelines derived from clinical-stage therapeutic values. TAP currently supports property calculations for both monoclonal antibodies and nanobodies.
    For monoclonal antibodies, the following five properties are calculated to see if your antibody design is commenserate with those of clinical-stage therapeutics:

    • Total CDR Length
    • Patches of Surface Hydrophobicity (PSH) metric across the CDR Vicinity
    • Patches of Positive Charge (PPC) metric across the CDR Vicinity
    • Patches of Negative Charge (PNC) metric across the CDR Vicinity
    • Structural Fv Charge Symmetry Parameter (SFvCSP)

    The TAP Guidelines were last updated on 24th February 2025:

    Property Amber Region Red Region
    Total CDR Length (L) 37 ≤ L ≤ 42 L < 37
    55 ≤ L ≤ 65 L > 65
    Patches of Surface Hydrophobicity (PSH) 95.77 ≤ PSH ≤ 111.40 PSH < 95.77
    167.64 ≤ PSH ≤ 211.65 PSH > 211.65
    Patches of Positive Charge (PPC) 1.34 ≤ PPC ≤ 4.20 PPC > 4.24
    Patches of Negative Charge (PNC) 1.99 ≤ PNC ≤ 4.43 PNC > 5.67
    Structural Fv Charge Symmetry Parameter (SFvCSP) -30.60 ≤ SFvCSP ≤ -6.00 SFvCSP < -30.60

    Amber Region: Within the reasonable region of 851 post Phase-I therapeutic Fvs
    Red Region: Unreasonable region, the developability needs to be optimized
    The following table defines the scope of Amber Region and Red Region.
    WXWorkCapture_17008110438269.png

    For nanobodies, six properties are calculated to determine whether the developability profile of the input nanobody matches the attributes of clinical-stage nanobodies:

    • Total CDR Length
    • CDR3 Length
    • CDR3 Compactness
    • Patches of Surface Hydrophobicity (PSH) metric across the CDR vicinity
    • Patches of Positive Charge (PPC) metric across the CDR vicinity
    • Patches of Negative Charge (PNC) metric across the CDR vicinity

    The developability ranges derived from 36 therapeutic nanobodies (Phase I and beyond) are as follows (last updated: 14 August 2025):

    Property Amber Region Red Region
    Total CDR Length (L) 20 ≤ L ≤ 24 L < 20
    38 ≤ L ≤ 39 L > 39
    CDR3 Length (L) 5 ≤ L ≤ 8 L < 5
    22 ≤ L ≤ 23 L > 23
    CDR3 Compactness (CC) 0.56 ≤ CC ≤ 0.81 PSH < 0.56
    1.57 ≤ CC ≤ 1.61 CC > 1.61
    Patches of Surface Hydrophobicity (PSH) 73.40 ≤ PSH ≤ 79.59 PSH < 73.40
    126.83 ≤ PSH ≤ 155.47 PSH > 155.47
    Patches of Positive Charge (PPC) 0.39 ≤ PPC ≤ 1.18 PPC > 1.18
    Patches of Negative Charge (PNC) 1.47 ≤ PNC ≤ 1.88 PNC > 1.88

    The definition of Amber Region and Red Region are same as above.

    Parameters

    Antibody Fv Structure (PDB)

    Antibody Structure file in PDB format, both monoclonal antibodies and nanobodies are supported.

    Antibody Fv Structure (TAR)

    A single compressed archive (zip, tar, or any tar-based format such as .tar.gz, .bz2, .xz) that contains multiple monoclonal-antibody Fv structures or multiple nanobody structures in PDB format.

    When a single structure file and an archive are uploaded simultaneously, the calculations will be merged.

    Nanobody

    When this option is selected, a TAP-like calculation is performed for nanobodies. By default, TAP is calculated for antibodies.

    Score

    Output score file in CSV format, default is score.csv.

    Details

    Output score file of each residue in CSV format, default is detail.csv.

    Result

    Outputs a summary file named score.csv containing:

    • Total CDR Length: Number of amino acids in the CDR regions
    • CDR3 Length: Length of the CDR3 loop (reported for nanobodies only)
    • CDR3 Compactness: Compactness score of the CDR3 loop (reported for nanobodies only)
    • CDR Vicinity PSH Score (Kyte & Doolittle): Surface hydrophobicity in and around the CDR regions
    • CDR Vicinity PPC Score: Surface positive-charge patches in and around the CDR regions
    • CDR Vicinity PNC Score: Surface negative-charge patches in and around the CDR regions
    • SFvCSP Score: Net charge imbalance between the heavy and light chains of the Fv region (reported for monoclonal antibodies only)

    Also outputs a per-residue file named detail.csv containing:

    • PDBFile: Name of the structure file
    • ChainType: Chain identifier (reported for monoclonal antibodies only)
    • ResIndex: Residue number
    • ResLabel: Residue name
    • CDR Vicinity PSH Score (Kyte & Doolittle): PSH score of the residue
    • CDR Vicinity PPC Score: PPC score of the residue
    • CDR Vicinity PNC Score: PNC score of the residue

    Reference

    • Matthew I. J. Raybould, Claire Marks, Konrad Krawczyk, Bruck Taddese, Jaroslaw Nowak, Alan P. Lewis, Alexander Bujotzek, Jiye Shi, and Charlotte M. Deane.Five computational developability guidelines for therapeutic antibody profilingProceedings of the National Academy of Sciences, 2019, 116 (10) 4025-4030.DOI:10.1073/pnas.1810576116 .
    • The Therapeutic Nanobody Profiler: characterising and predicting nanobody developability to improve therapeutic design. Gemma L Gordon, Joao Gervasio, Colby Souders, Charlotte M Deane. DOI:0.1101/2025.08.11.669635
  • Name: IgG Modeling
    Description: 对抗体全长序列进行建模,用于构建抗体IgG完整的三维结构,支持单特异性和双特异性抗体。 自动识别全长序列中的可变区(Fv)序列并通过SOTA的方法(目前为ESMFold)进行建模,IgG的其余部分包括Fc和linker以已知全长抗体的晶体结构为模板通过空间约束条件进行同源模建,效果比直接用AF2等方法预测完整IgG结构更优。 Perform modeling on the full-length sequence of antibodies to construct the complete three-dimensional structure of IgG, supporting both monospecific and bispecific antibodies. It automatically identifies the variable region (Fv) sequences within the full-length sequence and models them using state-of-the-art methods (currently ESMFold). The remaining parts of the IgG, including the Fc and linker, are modeled using homology modeling based on the crystal structures of known full-length antibodies as templates, with spatial constraints. This approach yields better results than directly predicting the complete IgG structure using methods like AF2.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-09-23 00:00:00
    Reference:

    IgG Modeling

    简介

    IgG Modeling对抗体全长序列进行建模,用于构建抗体IgG完整的三维结构,支持单特异性和双特异性抗体。
    自动识别全长序列中的可变区(Fv)序列并通过SOTA的方法(目前为ESMFold)进行建模,IgG的其余部分包括Fc和linker以已知全长抗体的晶体结构为模板通过空间约束条件进行同源模建,效果比直接用AF2等方法预测完整IgG结构更优。

    参数说明

    Heavy Chain 1 Sequence

    抗体的第一条重链的序列。

    Light Chain 1 Sequence

    抗体的第一条轻链的序列。

    Heavy Chain 2 Sequence

    抗体的第二条重链的序列,非必填,仅在双抗建模时输入。

    Light Chain 2 Sequence

    抗体的第二条轻链的序列,非必填,仅在双抗建模时输入。

    Isotype

    IgG亚型,目前支持IgG1和IgG4两种类型。
    注意:
    1)当待建模序列为单抗时,只需要写入H1与L1即可,H1与H2相同,L1与L2相同,最终模型包含2条相同的重链和2条相同的轻链。
    2)当待建模序列为双抗时,需要输入四条链的序列,最终模型包含2条不同重链和2条不同轻链。

    结果说明

    输出结果包括:

    输出文件名称 说明
    antibody_001.pdb-antibody_003.pdb 输出三个抗体全长的结构
    scores.csv 抗体全长结构打分,其中Spatial Restraint Penalty (SRP)是对结构构象约束的惩罚评分,数值越低代表违反的空间约束越少,越推荐使用。

    image.png

    参考文献

    • Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37.
    • Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291-325.
    • Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993 Dec 5;234(3):779-815.
    • Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000 Sep;9(9):1753-73.
    • Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.

    IgG Modeling

    Introduction

    IgG Modeling is used to model the full-length sequence of antibodies to construct the complete three-dimensional structure of antibody IgG, supporting both monospecific and bispecific antibodies. It automatically identifies the variable region (Fv) sequence in the full-length sequence and models it using state-of-the-art methods (currently ESMFold). The remaining parts of IgG, including Fc and linker, are modeled homologously based on the crystal structure of known full-length antibodies as templates, using spatial constraints, which yields better results compared to directly predicting the complete IgG structure using methods like AF2.

    Parameter Description

    • Heavy Chain 1 Sequence: Sequence of the first heavy chain of the antibody.
    • Light Chain 1 Sequence: Sequence of the first light chain of the antibody.
    • Heavy Chain 2 Sequence: Sequence of the second heavy chain of the antibody, optional, only required for bispecific antibody modeling.
    • Light Chain 2 Sequence: Sequence of the second light chain of the antibody, optional, only required for bispecific antibody modeling.
    • Isotype: IgG subtype, currently supporting IgG1 and IgG4.
      Note:
    1. When modeling a monospecific antibody, only the sequences for H1 and L1 need to be provided. H1 is the same as H2, and L1 is the same as L2, resulting in a model containing two identical heavy chains and two identical light chains.
    2. When modeling a bispecific antibody, sequences for all four chains need to be provided, resulting in a model containing two different heavy chains and two different light chains.

    Result

    The output includes:

    Output File Name Description
    antibody_001.pdb-antibody_003.pdb Structures of three full-length antibodies
    scores.csv Scoring of the full-length antibody structures, Among them, Spatial Restraint Penalty (SRP) is a penalty score for conformational restraints on the structure. Lower values indicate fewer violated spatial restraints and are more recommended for use.

    image.png

    References

    • Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291-325.
    • Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993 Dec 5;234(3):779-815.
    • Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000 Sep;9(9):1753-73.
    • Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.
  • Name: Substructure Search
    Description: 小分子子结构搜索 Substructure search against a small molecule library
    Tags: undefined
    Author: Manish Sud
    Release: 2023-09-21 10:07:46
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Substructure Search

    简介

    Substructure Search模块是小分子子结构搜索模块,实现在化合物库中查询出含有特定子结构的分子并输出到SDF文件中。子结构搜索是化学信息学研究中的常用操作,也可以用于虚拟筛选,从小分子商业库中搜索出含有特定功能片段的分子用于后续实验验证。

    参数说明

    上传文件搜索子结构:File Search

    Substructure File

    搜索子结构文件,SDF或者SMI格式

    WeDraw画出搜索子结构:Draw

    Substructure File

    通过WeDraw界面画模板小分子,只允许单个小分子。

    通过SMILES字符搜索子结构:Smiles Search

    Substructure Smiles

    搜索子结构SMILES字符,例如
    c1ccccc1
    CC(N)=O

    Public Library

    选择用于相似性搜索的分子库,该模块提供17个公共分子数据库用于进行相似性搜索:

    1. Analyticon:~4万库存分子,源自德国的天然产物品牌,专注天然产物提取及类似物合成工作,产品质量稳定。
    2. Asinex:~52万库存分子,源自美国的品牌,20多年来致力于类先导化合物及分子砌块的研发供应,价格较贵。
    3. Bionet:~23万库存分子,源自英国的品牌,拥有20多年的有机合成经验。
    4. Chembridge:~156万库存分子,源自美国的化合物品牌,总部位于圣地亚哥,拥有多样性库、大环库等多种热门化合物库。
    5. Chemdiv:~160万库存分子,全球最大的化合物品牌之一,拥有5000多种化合物骨架结构和100多种化合物库,性价比高。
    6. Enamine:~273万库存分子,源自乌克兰的化合物品牌,具有较强的化合物研发能力,有高性价比化合物和高价值化合物两类产品。
    7. Eximed:~6万库存分子,源自乌克兰的化合物品牌,近20年来致力于提供高通量筛选化合物及相关服务。
    8. HTS_Biochemie_Innovationen:~6万库存分子,源自德国的化合物品牌,致力于为制药、农业和生物技术公司开发独特的化合物。
    9. IBScreen:~48万库存分子,源自俄罗斯的化合物品牌,拥有多种天然产物及衍生物。
    10. Life_Chemicals:~50万库存分子,源自加拿大的化合物品牌,拥有2900多种化合物骨架结构,化合物规格较齐全且有对应价格。
    11. Maybridge:~5万库存分子,源自英国的化合物品牌,Thermofisher旗下,产品数量少而专,每种产品均具有较大库存。
    12. Otava:~27万库存分子,源自加拿大的化合物品牌,专门从事特色化合物,生物化学药品和生物分析试剂的开发和生成。
    13. Princeton:~153万库存分子,源自美国的化合物品牌,20多年来设计独特的小分子化合物用于药物开发。
    14. Specs:~21万库存分子,源自荷兰的化合物品牌,价格优势明显。
    15. UORSY:~68万库存分子,源自乌克兰的化合物品牌,产品主要用于高通量筛选和药物发现,价格与Enamine接近。
    16. Vitas-m:~140万库存分子,源自美国的化合物品牌,在香港拥有发货中心,到货速度快,价格适中。

    提示说明:Public Library与Private Library选填其中一个。

    Private Library

    用于搜索的个人分子库,仅支持SDF格式。
    提示说明:Public Library与Private Library选填其中一个。

    Output File

    输出文件名称,默认matched_molecules.sdf。

    结果说明

    结果文件为分子库中含有子结构的化合物matched_molecules.sdf。

    Public Library与Private Library选填其中一个。

    Private Library

    用于搜索的个人分子库,仅支持SDF格式。

    Public Library与Private Library选填其中一个。

    Output File

    输出文件名称,默认matched_molecules.sdf。

    结果说明

    结果文件为分子库中含有子结构的化合物matched_molecules.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Substructure Search

    Introduction

    The Substructure Search module is a tool for searching for specific substructures within a compound library and outputting them to an SDF file. Substructure searching is a common operation in cheminformatics research and can be used for virtual screening to identify molecules in commercial small molecule libraries containing specific functional fragments for subsequent experimental validation.

    Parameter Description

    File Search for Substructure Search

    Substructure File

    File containing the substructure to search for, in SDF or SMI format.

    Draw for Substructure Search

    Substructure File

    Draw a template small molecule using the WeDraw interface, allowing only a single small molecule.

    Smiles Search for Substructure Search

    Substructure Smiles

    SMILES string of the substructure to search for, for example:
    c1ccccc1
    CC(N)=O

    Public Library

    Select the public molecular library for the substructure search module, which provides 16 public molecular databases for substructure searching.

    1. Analyticon: ~40,000 inventory molecules, originating from Germany, a natural product brand focusing on natural product extraction and analog synthesis work, with stable product quality.
    2. Asinex: ~520,000 inventory molecules, originating from the United States, dedicated to the development and supply of lead-like compounds and molecular building blocks for over 20 years, with a higher price range.
    3. Bionet: ~230,000 inventory molecules, originating from the United Kingdom, with over 20 years of organic synthesis experience.
    4. Chembridge: ~1.56 million inventory molecules, originating from a US compound brand headquartered in San Diego, offering diverse libraries including macrocyclic libraries and other popular compound libraries.
    5. Chemdiv: ~1.6 million inventory molecules, one of the world’s largest compound brands, with over 5,000 compound skeleton structures and over 100 compound libraries, offering high cost-performance ratio.
    6. Enamine: ~2.73 million inventory molecules, originating from a Ukrainian compound brand, with strong compound development capabilities, offering both high cost-performance ratio compounds and high-value compounds.
    7. Eximed: ~60,000 inventory molecules, originating from a Ukrainian compound brand, dedicated to providing high-throughput screening compounds and related services for nearly 20 years.
    8. HTS_Biochemie_Innovationen: ~60,000 inventory molecules, originating from a German compound brand, focusing on the development of unique compounds for pharmaceutical, agricultural, and biotechnology companies.
    9. IBScreen: ~480,000 inventory molecules, originating from a Russian compound brand, offering a variety of natural products and derivatives.
    10. Life_Chemicals: ~500,000 inventory molecules, originating from a Canadian compound brand, with over 2,900 compound skeleton structures, comprehensive compound specifications, and corresponding prices.
    11. Maybridge: ~50,000 inventory molecules, originating from a British compound brand under Thermofisher, specializing in a smaller yet specialized product range, each with substantial inventory.
    12. Otava: ~270,000 inventory molecules, originating from a Canadian compound brand, specializing in unique compounds, biochemical drugs, and biological analysis reagents development and production.
    13. Princeton: ~1.53 million inventory molecules, originating from a US compound brand, designing unique small molecule compounds for drug development for over 20 years.
    14. Specs: ~210,000 inventory molecules, originating from a Dutch compound brand, with significant price advantages.
    15. UORSY: ~680,000 inventory molecules, originating from a Ukrainian compound brand, primarily used for high-throughput screening and drug discovery, with prices similar to Enamine.
    16. Vitas-m: ~1.4 million inventory molecules, originating from a US compound brand, with a shipping center in Hong Kong for fast delivery and moderate prices.
      Note: Choose either Public Library or Private Library.

    Private Library

    Personal molecular library for searching, supporting SDF format.
    Note: Choose either Public Library or Private Library.

    Output File

    Name of the output file, default is matched_molecules.sdf.

    Result Description

    The result file contains compounds from the compound library that contain the specified substructure, saved as matched_molecules.sdf.

    References

    • Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
  • Name: Structure Minimization (Small)
    Description: 小分子结构能量最小化优化并得到优化后的3D结构。支持UFF和MMFF两种分子力场,支持SDG, ETDG, KDG, ETKDG四种构象采样方法,用于生成初始3D构象。 Small molecule energy minimization optimization tool that generates optimized 3D structure. UFF or MMFF molecular forcefields could be used for energy minimization. Conformation sampling methods, SDG, ETDG, KDG, and ETKDG could be used for generating initial 3D coordinates.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-09-15 14:38:46
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. Riniker, S.; Landrum, G. A. Better informed distance geometry: Using what we know to improve conformation generation. JCIM. 2015, 55, 2562-2574. Rappe, A.K.; Casewit, C.J.; Colwell, K.S.; Goddard III, W.A.; Skiff, W.M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J. Am. Chem. Soc. 1992, 114, 10024-10035. Halgren, T.A.; Merck Molecular Force Field. I. Basis, Form, Scope, Parameterization, and Performance of MMFF94. 1996, J. Comput. Chem., 17, 490-519. Halgren, T.A.; Merck molecular force field. V. Extension of MMFF94 using experimental data, additional computational data, and empirical rules. J. Compt. Chem. 1996, 17, 616-641.

    Structure Minimization (Small)

    简介

    Small Molecule Minimization是针对小分子结构进行能量最小化优化并得到优化后的3D结构。支持UFF和MMFF两种分子力场,支持SDG, ETDG, KDG, ETKDG四种构象采样方法,用于生成初始3D构象。注意,每个分子只输出一个能量最低构象,构象搜索推荐使用 3D Conf (AlphaConf)模块。

    参数说明

    Small Molecule File

    小分子文件,支持Mol (.mol), SD (.sdf, .sd), SMILES (.smi .csv, .tsv, .txt)。

    Output File

    输出文件名称,仅支持SDF格式,默认为minimized_struture.sdf。

    Conformer Generator

    3D构象方法:SDG, ETDG, KDG, ETKDG, None.

    1. SDG:Standard Distance Geometry (SDG)
    2. ETDG:Experimental Torsion-angle preference with Distance Geometry
    3. KDG:basic Knowledge-terms with Distance Geometry
    4. ETKDG:Experimental Torsion-angle preference along with basic Knowledge-terms with Distance Geometry
    5. None:代表不使用构象生成算法生成初始构象,直接基于输入文件中的3D构象进行力场优化。因此当输入文件为2D结构或者smiles格式不采用该参数。

    Forcefield Method

    用于能量最小化的力场方法,包括UFF(Universal Force Field)和MMFF(Merck Molecular Mechanics Force Field)。

    Multiprocessing

    使用并行计算。

    Maximum Number of Iterations

    在基于力场优化期间针对每个分子执行的最大迭代次数,默认500。

    Random Seed

    随机数,用于重现优化后的结构。

    结果说明

    得到能量最小化后的小分子3D结构文件minimized_struture.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    Riniker, S.; Landrum, G. A. Better informed distance geometry: Using what we know to improve conformation generation. JCIM. 2015, 55, 2562-2574.
    Rappe, A.K.; Casewit, C.J.; Colwell, K.S.; Goddard III, W.A.; Skiff, W.M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J. Am. Chem. Soc. 1992, 114, 10024-10035.
    Halgren, T.A.; Merck Molecular Force Field. I. Basis, Form, Scope, Parameterization, and Performance of MMFF94. 1996, J. Comput. Chem., 17, 490-519.
    Halgren, T.A.; Merck molecular force field. V. Extension of MMFF94 using experimental data, additional computational data, and empirical rules. J. Compt. Chem. 1996, 17, 616-641.

    Structure Minimization (Small)

    Introduction

    Small Molecule Minimization is a tool module designed to perform energy minimization optimization on small molecule structures and obtain the optimized 3D structure. It supports two molecular force fields, UFF and MMFF, as well as four conformation sampling methods: SDG, ETDG, KDG, and ETKDG, used to generate initial 3D conformations. Note that only one energy-minimized conformation is output for each molecule, and for conformational search, it is recommended to use the 3D Conf (AlphaConf) module.

    Parameters

    Small Molecule File

    Input file for the small molecule, supporting Mol (.mol), SD (.sdf, .sd), SMILES (.smi .csv, .tsv, .txt) formats.

    Output File

    Name of the output file, only supports SDF format, default is minimized_structure.sdf.

    Conformer Generator

    3D conformation method: SDG, ETDG, KDG, ETKDG, None.

    1. SDG: Standard Distance Geometry (SDG)
    2. ETDG: Experimental Torsion-angle preference with Distance Geometry
    3. KDG: Basic Knowledge-terms with Distance Geometry
    4. ETKDG: Experimental Torsion-angle preference along with basic Knowledge-terms with Distance Geometry
    5. None: Indicates not using a conformation generation algorithm to generate initial conformations, directly optimizing the force field based on the 3D conformation in the input file. Therefore, this parameter is not used when the input file is a 2D structure or in SMILES format.

    Forcefield Method

    Force field method for energy minimization, including UFF (Universal Force Field) and MMFF (Merck Molecular Mechanics Force Field).

    Multiprocessing

    Utilize parallel computing.

    Maximum Number of Iterations

    Maximum number of iterations performed for each molecule during force field optimization, default is 500.

    Random Seed

    Random number used to reproduce the optimized structure.

    Results

    Obtain the energy-minimized 3D structure file for the small molecule as minimized_structure.sdf.

    References

    • Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
    • Riniker, S.; Landrum, G. A. Better informed distance geometry: Using what we know to improve conformation generation. JCIM. 2015, 55, 2562-2574.
    • Rappe, A.K.; Casewit, C.J.; Colwell, K.S.; Goddard III, W.A.; Skiff, W.M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J. Am. Chem. Soc. 1992, 114, 10024-10035.
    • Halgren, T.A.; Merck Molecular Force Field. I. Basis, Form, Scope, Parameterization, and Performance of MMFF94. 1996, J. Comput. Chem., 17, 490-519.
    • Halgren, T.A.; Merck molecular force field. V. Extension of MMFF94 using experimental data, additional computational data, and empirical rules. J. Compt. Chem. 1996, 17, 616-641.
  • Name: PDB ReNumbering
    Description: 针对蛋白残基重新编号,同时支持抗体kabat,imgt以及chothia的重编号。输入蛋白结构PDB文件,输出重新编号后的PDB文件。建议通过WeView三维结构可视化编辑器来使用该功能,具体为WeView-> Number -> Renumber UID。 It is a tool module that renumbers protein residues and supports renumbering antibody structure with kabat, imgt, and chothia schemes. It takes a protein structure PDB file as input and outputs a renumbered PDB file. It is recommended to use in the WeView: WeView-> Number -> Renumber UID.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-09-19 00:00:00
    Reference:

    PDB ReNumbering

    简介

    PDB ReNumbering是针对蛋白残基重新编号的工具模块,同时支持抗体kabat,imgt以及chothia的重编号。输入蛋白结构PDB文件,输出重新编号后的PDB文件。

    参数说明

    Protein Structure File

    输入蛋白结构文件,PDB格式。

    Renumbering Type

    重编号类型,支持指定链从指定数字开始编号,同时支持抗体结构重新编号。
    numeric:氨基酸序号重编号
    kabat:抗体kabat编号规则重编号
    imgt:抗体imgt编号规则重编号
    chothia:抗体chothia编号规则重编号

    Chain Name

    链名,指定具体的链名进行重编号操作。支持输入多条链名,链名之间用英文逗号“,”隔开,如“H,L”。

    Start

    针对氨基酸序号重编号,指定起始编号数字。

    Output File

    重编号后的文件名称。

    结果说明

    重编号后的结构文件名称,默认输出renumbering.pdb。
    注意:如果输入是抗体结构,输出结构中重链的链名会自动改为H,轻链链名会改为L。

    PDB ReNumbering

    Introduction

    PDB ReNumbering is a tool module for renumbering protein residues, supporting renumbering according to the kabat, imgt, and chothia numbering schemes for antibodies. Input a protein structure PDB file and get the renumbered PDB file as output.

    Parameter Description

    Protein Structure File

    Input protein structure file in PDB format.

    Renumbering Type

    Renumbering type, supports starting numbering from a specified number for a specific chain, and also supports renumbering for antibody structures.

    • numeric: Renumber amino acid residues numerically.
    • kabat: Renumber according to the kabat antibody numbering scheme.
    • imgt: Renumber according to the imgt antibody numbering scheme.
    • chothia: Renumber according to the chothia antibody numbering scheme.

    Chain Name

    Chain name, specifies the chain to perform renumbering.Support multiple chain names as input, separated by commas, e.g., “H,L”.

    Start

    For renumbering amino acid residues numerically, specifies the starting number.

    Output File

    Name of the renumbered file.

    Result Description

    The renumbered structure file is named by default as renumbering.pdb.
    Note: If the input is an antibody structure, the chain names in the output structure will be automatically changed to H for the heavy chain and L for the light chain.

  • Name: AC2SDF
    Description: 用于将AlphaConf模块生成的构象压缩二进制文件AC.GZ转为便于查看的SDF文件。 It is used to convert the compressed binary conformation file AC.GZ generated by the AlphaConf module into an SDF file for easier viewing.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-23 00:00:00
    Reference:

    AC2SDF

    简介

    AC2SDF模块是一个格式转换工具,用于将AlphaConf模块生成的构象压缩二进制文件AC.GZ转为便于查看结构的SDF文件。

    参数说明

    Conformation Library (AC)

    输入构象文件,AC.GZ格式,由AlphaConf模块生成

    Fragment Library

    片段库文件,AUX.GZ格式,由AlphaConf模块生成

    SDF File

    转换生成的SDF文件名称

    结果说明

    输出文件名称 说明
    ligands_confs.sd 转换生成的SDF文件,可通过WeView直接查看构象

    AC2SDF

    Introduction

    The AC2SDF module is a format conversion tool used to convert the compressed binary conformation file AC.GZ generated by the AlphaConf module into an SDF file for easier visualization of the structure.

    Parameter Description

    Conformation Library (AC)

    Input conformation file in AC.GZ format generated by the AlphaConf module.

    Fragment Library

    Fragment library file in AUX.GZ format generated by the AlphaConf module.

    SDF File

    Name of the converted SDF file.

    Result Description

    Output File Name Description
    ligands_confs.sd Converted SDF file that can be viewed directly using WeView for conformation visualization.
  • Name: Sequence Mutation
    Description: Sequence Mutation是蛋白序列突变模块,用于针对特定位点批量生成突变序列。突变策略包括基于位置的突变,基于同源序列的突变,基于抗体CDR区的突变,以及基于抗体CDR区和同源性的突变。突变类型支持丙氨酸突变,组氨酸突变,以及饱和突变。 Sequence Mutation is a protein sequence mutation module used to generate mutated sequences in bulk for specific sites. Mutation strategies include position-based mutations, homology-based mutations, mutations based on antibody CDR regions, and mutations based on both antibody CDR regions and homology. The types of mutations supported include alanine scanning, histidine mutation, and saturation mutagenesis.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-22 00:00:00
    Reference:

    Sequence Mutation

    简介

    Sequence Mutation是蛋白序列突变模块,用于针对特定位点批量生成突变序列,支持多样的突变策略,包括设定不同的突变位置及突变类型。

    突变策略包括:

    • 基于指定位置的突变
    • 基于同源序列的突变
    • 基于抗体CDR区的突变
    • 基于抗体CDR区和同源性的突变

    突变类型支持:

    • 丙氨酸突变
    • 组氨酸突变
    • 饱和突变
    • 同源突变(同源序列中的进化突变)

    参数说明:基于位置的突变

    Protein Sequence

    蛋白原始序列或者fasta格式的序列

    Mutation Location

    突变位点,支持多个位点,英文逗号分割,例如:2,3

    Mutation Type

    突变类型,支持三种类型:Ala 丙氨酸突变,His 组氨酸突变,Sat 饱和突变

    Chain Name

    链名,输出突变信息时加上指定链名

    Mutants Sequences

    生成突变序列的文件名称,FASTA格式

    Mutation Policy

    蛋白突变信息文件,TXT格式

    参数说明:基于同源序列的突变

    Protein Sequence

    蛋白原始序列或者fasta格式的序列

    Homologous Sequences

    同源序列,一般由序列比对产生的结果文件,FASTA 格式

    Alignment Methods

    序列比对的方法,mafft或者muscle

    Frequency Cutoff

    频数截断值,大于截断值的氨基酸才会选择作为突变目标

    Chain Name

    链名,输出突变信息时加上指定链名

    Mutants Sequences

    生成突变序列的文件名称,FASTA格式

    Mutation Policy

    蛋白突变信息文件,TXT格式

    参数说明:基于抗体CDR区的突变

    Antibody Sequence

    蛋白原始序列或者fasta格式的序列

    Antibody Numbering

    抗体CDR编号规则:kabat, imgt, chothia

    Mutation Type

    突变类型,支持三种类型:Ala 丙氨酸突变,His 组氨酸突变,Sat 饱和突变

    Chain Name

    链名,输出突变信息时加上指定链名

    Mutants Sequences

    生成的包含蛋白突变序列的文件名称,FASTA格式

    Mutation Policy

    生成的包含蛋白突变信息的文件名称,TXT格式

    参数说明:基于抗体CDR区及同源性的突变

    Antibody Sequence

    蛋白原始序列或者fasta格式的序列

    Antibody Numbering

    抗体CDR编号规则:kabat, imgt, chothia

    Homologous Sequences

    同源序列,一般由序列比对产生的结果文件,FASTA 格式

    Alignment Methods

    序列比对的方法,mafft或者muscle

    Frequency Cutoff

    频数截断值,大于截断值的氨基酸才会选择作为突变目标

    Chain Name

    链名,输出突变信息时加上指定链名

    Mutants Sequences

    生成的包含蛋白突变序列的文件名称,FASTA格式

    Mutation Policy

    生成的包含蛋白突变信息的文件名称,TXT格式

    结果说明

    输出文件名称 说明
    mutants.fasta 生成突变序列的文件名称,FASTA格式
    mutations.txt 蛋白突变信息文件,TXT格式,每行一个突变记录,例如:Q2A 代表第2位氨基酸Q突变为氨基酸A

    Sequence Mutation

    Introduction

    Sequence Mutation is a protein sequence mutation module that allows for batch generation of mutated sequences at specific positions, supporting various mutation strategies including setting different mutation positions and types.

    Mutation strategies include:

    • Position-based mutations
    • Homologous sequence-based mutations
    • Antibody CDR region mutations
    • Antibody CDR region and homology-based mutations

    Supported mutation types include:

    • Alanine mutations
    • Histidine mutations
    • Saturation mutations
    • Homologous mutations (evolutionary mutations from homologous sequences)

    Parameter Description: Position-based Mutations

    Protein Sequence

    Original protein sequence or sequence in FASTA format.

    Mutation Location

    Mutation positions, support for multiple positions separated by commas, e.g., 2,3.

    Mutation Type

    Mutation types, supporting three types: Ala (Alanine mutation), His (Histidine mutation), Sat (Saturation mutation).

    Chain Name

    Chain name to be included in the mutation information output.

    Mutants Sequences

    File name for generated mutated sequences in FASTA format.

    Mutation Policy

    Protein mutation information file in TXT format.

    Parameter Description: Homologous Sequence-based Mutations

    Protein Sequence

    Original protein sequence or sequence in FASTA format.

    Homologous Sequences

    Homologous sequences, typically generated from sequence alignment results in FASTA format.

    Alignment Methods

    Alignment methods for sequence alignment: mafft or muscle.

    Frequency Cutoff

    Frequency cutoff value, only amino acids with frequencies greater than the cutoff value will be selected as mutation targets.

    Chain Name

    Chain name to be included in the mutation information output.

    Mutants Sequences

    File name for generated mutated sequences in FASTA format.

    Mutation Policy

    Protein mutation information file in TXT format.

    Parameter Description: Antibody CDR region Mutations

    Antibody Sequence

    Original protein sequence or sequence in FASTA format.

    Antibody Numbering

    Antibody CDR numbering rule: kabat, imgt, chothia.

    Mutation Type

    Mutation types, supporting three types: Ala (Alanine mutation), His (Histidine mutation), Sat (Saturation mutation).

    Chain Name

    Chain name to be included in the mutation information output.

    Mutants Sequences

    File name for generated mutated protein sequences in FASTA format.

    Mutation Policy

    File name for generated protein mutation information in TXT format.

    Parameter Description: Antibody CDR region and Homology-based Mutations

    Antibody Sequence

    Original protein sequence or sequence in FASTA format.

    Antibody Numbering

    Antibody CDR numbering rule: kabat, imgt, chothia.

    Homologous Sequences

    Homologous sequences, typically generated from sequence alignment results in FASTA format.

    Alignment Methods

    Alignment methods for sequence alignment: mafft or muscle.

    Frequency Cutoff

    Frequency cutoff value, only amino acids with frequencies greater than the cutoff value will be selected as mutation targets.

    Chain Name

    Chain name to be included in the mutation information output.

    Mutants Sequences

    File name for generated mutated protein sequences in FASTA format.

    Mutation Policy

    File name for generated protein mutation information in TXT format.

    Result Description

    Output File Name Description
    mutants.fasta File name for generated mutated sequences in FASTA format.
    mutations.txt Protein mutation information file in TXT format, with each line representing a mutation record, e.g., Q2A represents the mutation of amino acid Q at position 2 to amino acid A.
  • Name: MD Distance
    Description: 分子动力学轨迹的距离分析模块,输出分子动力学过程中两个组之间距离 (质心距离或几何中心距离) 随时间的变化。 MD distance analysis that outputs the distance changes between two groups (center of mass distance or geometric center distance) over time.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-22 09:35:48
    Reference:

    MD Distance

    简介

    MD Distance是针对分子动力学轨迹的距离分析模块,输出两个组之间距离 (质心距离或几何中心距离) 随时间的变化情况。自定义组别时需要注意,如果只需测量两个原子之间的距离则填写Custom Atom1和Custom Atom2即可;当同时填写Custom Resid1和Custom Atom1时,组别1的原子数是Custom Atom1与Custom Resid1交集的原子。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group1

    选择需要计算的组别1:Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    System Group2

    选择需要计算的组别1:Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid1

    自定义需要计算的组1残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Atom1

    自定义需要计算的组1原子编号,连续参数可用“-”表示,不连续原子用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Resid2

    自定义需要计算的组2残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Custom Atom2

    自定义需要计算的组2原子编号,连续参数可用“-”表示,不连续原子用逗号隔开,例如:1-10,15。自定义组别之间是并集。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    dist.csv 距离分析CSV文件
    dist.xvg 距离分析XVG文件
    dist.png 距离分析PNG文件

    其中dist.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Distance (nm) 组别之间的距离

    MD Distance

    Introduction

    MD Distance is a distance analysis module for molecular dynamics trajectories, providing the variation of distance (center-of-mass distance or geometric center distance) between two groups over time. When defining custom groups, it is important to note that if you only need to measure the distance between two atoms, you can fill in Custom Atom1 and Custom Atom2. When both Custom Resid1 and Custom Atom1 are filled in, the number of atoms in group 1 is the intersection of Custom Atom1 and Custom Resid1.

    Parameter Description

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD (GMX2023) module.

    System Group1

    Select the group 1 for calculation: Protein, DNA, RNA.
    You can enter the group name based on the name of the small molecule in the PDB.

    System Group2

    Select the group 2 for calculation: Protein, DNA, RNA.
    You can enter the group name based on the name of the small molecule in the PDB.

    Custom Resid1

    Custom residue numbers for group 1 that need to be calculated. Use “-” for continuous parameters and separate non-continuous residues with commas, for example: 1-10,15. Custom groups are unions of specified residues.

    Custom Atom1

    Custom atom numbers for group 1 that need to be calculated. Use “-” for continuous parameters and separate non-continuous atoms with commas, for example: 1-10,15. Custom groups are unions of specified atoms.

    Custom Resid2

    Custom residue numbers for group 2 that need to be calculated. Use “-” for continuous parameters and separate non-continuous residues with commas, for example: 1-10,15. Custom groups are unions of specified residues.

    Custom Atom2

    Custom atom numbers for group 2 that need to be calculated. Use “-” for continuous parameters and separate non-continuous atoms with commas, for example: 1-10,15. Custom groups are unions of specified atoms.

    Skip Time (ns)

    Time interval for each frame (in ns).

    Result Description

    The output includes:

    Output File Name Description
    dist.csv Distance analysis CSV file
    dist.xvg Distance analysis XVG file
    dist.png Distance analysis PNG file

    The dist.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Distance (nm) Distance between the groups
  • Name: Peptide VS
    Description: 集成了AutoDock Vina与AutoDock CrankPep进行蛋白-多肽的分子对接,从而预测蛋白-多肽的构象、得到分子对接的能量以及结合亲和力。 This module integrates AutoDock Vina and AutoDock CrankPep for protein-polypeptide docking, thereby predicting the conformation of protein-polypeptide, obtaining the energy of molecular docking and binding affinity.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-24 14:37:51
    Reference: J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling. O. Trott, A. J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading, Journal of Computational Chemistry 31 (2010) 455-461 Podtelezhnikov, A.A., Wild, D.L. CRANKITE: A fast polypeptide backbone conformation sampler. Source Code Biol Med 3, 12 (2008).

    Peptide VS

    简介

    Peptide VS模块集成了AutoDock Vina与AutoDock CrankPep进行蛋白-多肽的分子对接,从而预测蛋白-多肽的构象、得到分子对接的能量以及结合亲和力。AutoDock CrankPep则是一个专门用于多肽对接工具,其基于蛋白折叠和刚性受体网格能量背景下,采用蒙特卡罗方法对多肽的折叠进行计算,产生多肽的对接构象。

    参数说明

    Receptor File

    受体结构文件,PDB格式。

    Peptide Sequence String

    多肽的氨基酸序列,可以成功对接长度达20个氨基酸的肽。一行一条序列,例如:

    AINMDSFHTWKVLECGRPQY
    HRIAQCSDKW
    IYSADCLPKG
    AAAAIS
    

    注意:最多支持多肽的氨基酸序列长度为35左右。

    Box Center

    对接口袋中心的三维坐标(XYZ),空格分割。例如:10 2 -11。

    Box Size

    对接口袋长方体盒子的大小,必须是整数,空格分割,例如 30 30 30。

    Out Pose

    每个多肽与蛋白对接后输出的构象数目,默认为10。

    结果说明

    输出结果包括:

    输出文件名称 说明
    Scores.csv 提交多肽与受体的打分文件。
    output_complex_top1.pdb 展示打分第一的多肽与受体的复合物构象。
    output_complex_topn.tar.gz TopN多肽“Out Pose”构象数与受体形成的复合物结构PDB文件压缩包。

    其中Scores.csv包括信息如下:

    字段名称 说明
    Name 对接多肽名称
    Score(kcal/mol) 对接打分,该值越低说明结合亲和力越高。
    Cluster RMSD 聚类后,构象之间的RMSD
    Average RMSD 平均RMSD
    Complex File Name 复合物文件名称

    参考文献

    • J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling.DOI:10.1021/acs.jcim.1c00203
    • O. Trott, A. J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading, Journal of Computational Chemistry 31 (2010) 455-461.DOI:10.1002/jcc.21334
    • Podtelezhnikov, A.A., Wild, D.L. CRANKITE: A fast polypeptide backbone conformation sampler. Source Code Biol Med 3, 12 (2008).DOI:10.1186/1751-0473-3-12

    Peptide VS

    Introduction

    The Peptide VS module integrates AutoDock Vina and AutoDock CrankPep for protein-polypeptide molecular docking, predicting the conformation of protein-polypeptide complexes, docking energy, and binding affinity. AutoDock Vina is a molecular docking tool that compares the binding affinities between multiple molecules, used for screening, designing, and optimizing drug molecules. AutoDock CrankPep is a specialized tool for peptide docking that uses a Monte Carlo method to calculate peptide folding based on protein folding and rigid receptor grid energy background, generating docking conformations for peptides. This module has been successfully demonstrated to redock peptides of up to 20 amino acids in length.

    Parameters

    Receptor File

    Structure file of the receptor in PDB format.

    Peptide Sequence String

    The peptide amino acid sequences can be successfully docked for peptides up to 20 amino acids in length.
    Each line should contain one sequence, for example:

    AINMDSFHTWKVLECGRPQY  
    HRIAQCSDKW  
    IYSADCLPKG  
    AAAAIS  
    

    Note: The maximum supported peptide amino acid sequence length is approximately 35.

    Box Center

    Three-dimensional coordinates (XYZ) of the docking pocket center, separated by spaces. For example: -44.497 -22 -5.

    Box Size

    Size of the docking pocket rectangular box, must be integers, separated by spaces, for example 30 30 30.

    TopN

    Specify the top N small molecules for scoring as output files, default is 100.

    Out Pose

    Number of conformations output for each peptide-protein docking, default is 10.

    Results

    The output includes:

    Output File Name Description
    Scores.csv Scoring file for the docking of peptides with the receptor.
    output_complex_top1.pdb Conformation of the top scoring peptide-receptor complex.
    output_complex_topn.tar.gz Compressed PDB files of the top N peptide “Out Pose” conformations forming complexes with the receptor.

    The Scores.csv file includes the following information:

    Field Name Description
    Name Name of the docked peptide
    Score(kcal/mol) Docking score, lower values indicate higher binding affinity.
    Complex File Name Name of the complex file

    References

    • J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling.DOI:10.1021/acs.jcim.1c00203
    • O. Trott, A. J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading, Journal of Computational Chemistry 31 (2010) 455-461.DOI:10.1002/jcc.21334
    • Podtelezhnikov, A.A., Wild, D.L. CRANKITE: A fast polypeptide backbone conformation sampler. Source Code Biol Med 3, 12 (2008).DOI:10.1186/1751-0473-3-12
  • Name: Alanine Scan (MMPBSA)
    Description: 计算丙氨酸突变后的结合自由能 Alanine Scan (MMPBSA) calculates components of binding free energy after alanine mutation using the MM-PBSA method.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:47
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMPBSA)

    简介

    Alanine Scan (MMPBSA)是计算丙氨酸突变后的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMPBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Mutation Residue

    突变扫描为丙氨酸(ALA)的氨基酸位置。格式为res1:res2:res3:res4,其中“res1-res4”数字为残基编号。

    Force File

    丙氨酸扫描时使用的力场。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。只能是AMBER力场下构建的膜体系才能进行计算。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMPBSA_result.txt MMPBSA丙氨酸突变结果汇总文件。
    MMPBSA_Residue.csv 丙氨酸突变能量分解数据CSV文件。
    MMPBSA.pdb 丙氨酸突变后,原子对应的MMPBSA能量放到PDB文件。可以做对应能量类别的表面图,从图的颜色深浅可以看出哪些区域对结合能贡献值较大。
    MMPBSA.tar.gz MMPBSA所有原始文件。包括_mmpbsa_residue_#.txt是每个残基对应的每个能量类别的数值,共包含7个能量类别:范德华能(VDW)、静电能(ELE)、溶剂化能极性部分(PB)、溶剂化能非极性部分(SA)、VDW+ELE=MM、PB+SA=PBSA、MM+PBSA=Binding/MMPBSA。_mmpbsa_residue.txt是对上述7个文件的总结,即为MMPBSA_Residue.csv对应的原始文件。_mmpbsa_atom#.pdb是将每个原子对应的能量类别放到pdb文件,与MMPBSA.pdb相似。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    Alanine Scan (MMPBSA)

    Introduction

    Alanine Scan (MMPBSA) calculates the binding free energy after alanine mutations and provides energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMPBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameter Description

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Mutation Residue

    Amino acid positions where mutations to alanine (ALA) are scanned. The format is res1:res2:res3:res4, where “res1-res4” are residue numbers.

    Force File

    Force field used for alanine scanning.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’. Only membrane systems built under the AMBER force field can be calculated.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Result Description

    The output includes:

    Output File Name Description
    MMPBSA_result.txt Summary file of MMPBSA alanine mutation results.
    MMPBSA_Residue.csv Energy decomposition data for alanine mutations in CSV format.
    MMPBSA.pdb MMPBSA energy corresponding to atoms after alanine mutations in a PDB file. This can be used to create a surface map of energy categories, where the color depth indicates the regions contributing significantly to the binding energy.
    MMPBSA.tar.gz All original MMPBSA files. Includes mmpbsa_residue#.txt which contains the values of each energy category for each residue, totaling 7 energy categories: van der Waals energy (VDW), electrostatic energy (ELE), polar part of solvation energy (PB), nonpolar part of solvation energy (SA), VDW+ELE=MM, PB+SA=PBSA, MM+PBSA=Binding/MMPBSA. _mmpbsa_residue.txt is a summary of the above 7 files, corresponding to the original file MMPBSA_Residue.csv. _mmpbsa_atom#.pdb assigns the energy categories to each atom in a pdb file, similar to MMPBSA.pdb.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: MMPBSA
    Description: MMPBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。 MMPBSA calculates the binding free energy between the receptor and ligand and provides energy decomposition data, binding constant (Ka), and inhibitor constant (Ki).
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-08-03 09:10:29
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001 Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8. https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMPBSA

    简介

    MMPBSA计算受体与配体之间的结合自由能,并且提供能量分解数据、结合常数(Ka)、抑制剂常数(Ki)。熵的计算采用的是张增辉教授的相互作用熵的方法,该方法直接从分子动力学模拟计算结合自由能的熵组分(相互作用熵或-TΔS),但是需要选取结构稳定部分进行计算。
    本模块提供四种计算方法,其中 Trajectory 是针对进行分子动力学模拟轨迹后计算其受配体之间的结合自由能;One Structure 是针对一帧pdb结构或者对接得到的PDB结构计算受配体之间的结合自由能,MMPBSA of One Structure 计算流程中可以直接输入PDB进行计算。
    Index Name 是输入受配体组别名称;Custom Name 则是输入受配体的在PDB中的残基编号。

    参数说明

    Trajectory方法

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD模块中获取。

    Receptor Name

    受体名称,可以为Protein、DNA、RNA。

    Ligand Name

    配体名称,可以为Protein、DNA、RNA。如果为小分子,填写其在PDB中的名称。如果体系中除了蛋白以外为配体(包括小分子)可用Other表示。

    Start Time (ps)

    起始帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    End Time (ps)

    结束帧时间,单位ps。最好选取RMSD稳定时区进行计算,以消除结构不稳定时导致的整体熵偏大。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。当体系中存在膜结构的时候可以通过Membrane Solvation模块获取index.ndx文件。膜模拟中的受体为receptor,配体为ligand,膜为membrane。

    Custom Receptor

    定义受体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    Custom Ligand

    定义配体组别进行结合能计算。组别中填写的为蛋白氨基酸或者核酸碱基的序号。例如1-213或者1-211,212-213。蛋白氨基酸或者核酸碱基序号从1开始重新编号,与初始pdb氨基酸编号无关。

    One Structure方法

    System Topology

    拓扑文件,由MD Solvation模块或者Membrane Solvation模块得到。

    System GRO

    结构文件,.gro格式,由MD Solvation模块或者Membrane Solvation模块得到。

    System ITP

    体系参数压缩文件,tar.gz格式。由MD Solvation模块或者Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    MMPBSA_result.txt MMPBSA结果汇总文件。
    MMPBSA_Residue.csv 能量分解数据CSV文件。
    MMPBSA.pdb 原子对应的MMPBSA能量放到PDB文件。可以做对应能量类别的表面图,从图的颜色深浅可以看出哪些区域对结合能贡献值较大。
    MMPBSA.tar.gz MMPBSA所有原始文件。包括_mmpbsa_residue_#.txt是每个残基对应的每个能量类别的数值,共包含7个能量类别:范德华能(VDW)、静电能(ELE)、溶剂化能极性部分(PB)、溶剂化能非极性部分(SA)、VDW+ELE=MM、PB+SA=PBSA、MM+PBSA=Binding/MMPBSA。_mmpbsa_residue.txt是对上述7个文件的总结,即为MMPBSA_Residue.csv对应的原始文件。_mmpbsa_atom#.pdb是将每个原子对应的能量类别放到pdb文件,与MMPBSA.pdb相似。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

    MMPBSA

    Introduction

    MMPBSA calculates the binding free energy between a receptor and a ligand, providing energy decomposition data, binding constants (Ka), and inhibitor constants (Ki). The entropy calculation uses the method of interaction entropy proposed by Professor Zhang Zenghui, which directly calculates the entropy component of the binding free energy (interaction entropy or -TΔS) from molecular dynamics simulations, but requires selecting stable structural regions for calculation. This module provides four calculation methods. Among them, Trajectory calculates the binding free energy between the receptor and ligand after molecular dynamics simulation trajectories; One Structure calculates the binding free energy between the receptor and ligand for a single frame PDB structure or a docked PDB structure; MMPBSA of One Structure allows direct input of a PDB for calculation. Index Name is the input receptor-ligand group name, while Custom Name is the input residue numbers of the receptor-ligand in the PDB.

    Parameter Description

    Trajectory Method

    Path File

    Path file obtained after MD simulation, available in the GMX MD Run (GMX2023) module or AlphaAutoMD module.

    Receptor Name

    Name of the receptor, can be Protein, DNA, or RNA.

    Ligand Name

    Name of the ligand, can be Protein, DNA, or RNA. If it is a small molecule, enter its name in the PDB. Use ‘Other’ if the system contains a ligand (including small molecules) other than proteins.

    Start Time (ps)

    Start frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    End Time (ps)

    End frame time in ps. It is best to select a stable RMSD region for calculation to eliminate large overall entropy caused by structural instability.

    Skip Time (ps)

    Time interval in ps.

    Index File

    Index file in ndx format. When there is a membrane structure in the system, you can obtain the index.ndx file through the Membrane Solvation module. In membrane simulations, the receptor is ‘receptor’, the ligand is ‘ligand’, and the membrane is ‘membrane’.

    Custom Receptor

    Define receptor groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    Custom Ligand

    Define ligand groups for binding energy calculation. Enter the sequence numbers of protein amino acids or nucleic acid bases in the group. For example, 1-213 or 1-211,212-213. Protein amino acid or nucleic acid base numbers start from 1 and are independent of the initial pdb amino acid numbering.

    One Structure Method

    System Topology

    Topology file obtained from the MD Solvation module or Membrane Solvation module.

    System GRO

    Structure file in .gro format obtained from the MD Solvation module or Membrane Solvation module.

    System ITP

    System parameter compression file in tar.gz format. Obtained from the MD Solvation module or Membrane Solvation module.

    Result Description

    The output includes:

    Output File Name Description
    MMPBSA_result.txt Summary file of MMPBSA results.
    MMPBSA_Residue.csv Energy decomposition data in CSV format.
    MMPBSA.pdb MMPBSA energy corresponding to atoms in a PDB file. This can be used to create a surface map of energy categories, where the color depth indicates the regions contributing significantly to the binding energy.
    MMPBSA.tar.gz All original MMPBSA files. Includes mmpbsa_residue#.txt which contains the values of each energy category for each residue, totaling 7 energy categories: van der Waals energy (VDW), electrostatic energy (ELE), polar part of solvation energy (PB), nonpolar part of solvation energy (SA), VDW+ELE=MM, PB+SA=PBSA, MM+PBSA=Binding/MMPBSA. _mmpbsa_residue.txt is a summary of the above 7 files, corresponding to the original file MMPBSA_Residue.csv. _mmpbsa_atom#.pdb assigns the energy categories to each atom in a pdb file, similar to MMPBSA.pdb.

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.
    Duan L, Liu X, Zhang JZ. Interaction Entropy: A New Paradigm for Highly Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J Am Chem Soc. 2016 May 4;138(17):5722-8.
    https://github.com/Electrostatics/apbs/blob/main/docs/index.rst

  • Name: MD PCA
    Description: 从高维数据中分析出主要的影响因素 (本征向量) ,前几个本征向量(主成分,如前两个主成份则为 PC1,PC2) 一般可以描述分子运动的大部分信息。 Analyze the main influencing factors (eigenvectors) from the high-dimensional data. The first few eigenvectors (principal components, such as PC1 and PC2 for the first two principal components) can generally describe most of the information about molecular motion.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-06 00:51:22
    Reference:

    MD PCA

    简介

    N个原子的柔性大体系如蛋白,其运动轨迹需要3N维笛卡尔坐标来描述,这样高维的数据很难理解和直观分析。MD PCA(Principal component analysis,PCA)模块可以从高维数据中分析出主要的影响因素 (本征向量) 。前几个本征向量(主成分,如前两个主成份则为 PC1,PC2) 一般可以描述分子运动的大部分信息。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    得到结果文件,每种类型的文件如果包含PNG、CSV以及XVG后缀,相同名称只是表现形式不同,数据一样

    输出文件名称 说明
    average.pdb 计算后的平均结构文件
    filtered.xtc 计算的降维过滤后的轨迹文件
    eigenvalues.xvg 本征值文件
    proj1.xvg 对应的主成分PC1文件
    proj2.xvg 对应的主成分PC2文件
    proj_all.xvg 计算的PC1到PC2的主成份合并文件
    Gibbs_2d.png/Gibbs_3d.png 只计算两个主成分时的二维和三维自由能景观图

    MD PCA

    Introduction

    For a large flexible system with N atoms such as a protein, its motion trajectory requires 3N-dimensional Cartesian coordinates to describe, making it difficult to understand and analyze high-dimensional data. The MD PCA (Principal Component Analysis, PCA) module can analyze the main influencing factors (eigenvectors) from high-dimensional data. The first few eigenvectors (principal components, such as the first two principal components PC1, PC2) can generally describe most of the information about molecular motion.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    Obtain the following result files. If the files have PNG, CSV, and XVG suffixes, they contain the same data but in different formats.

    Output File Name Description
    average.pdb Computed average structure file
    filtered.xtc Filtered trajectory file after dimensionality reduction
    eigenvalues.xvg Eigenvalues file
    proj1.xvg Corresponding principal component PC1 file
    proj2.xvg Corresponding principal component PC2 file
    proj_all.xvg Combined file of principal components PC1 to PC2
    Gibbs_2d.png/Gibbs_3d.png 2D and 3D free energy landscape plots when only two principal components are considered
  • Name: MD SASA
    Description: 计算指定组别的溶剂可及表面积 Calculates the solvent accessible surface area (SASA) for a specified group
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-06 00:29:36
    Reference:

    MD SASA

    简介

    MD SASA模块是计算指定组别的溶剂可及表面积(solvent accessible surface area,SASA)。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    输出结果包括:

    输出文件名称 说明
    area.csv 溶剂可及表面积CSV文件
    area.xvg 溶剂可及表面积XVG文件
    area.png 溶剂可及表面积PNG文件

    其中area.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Total Area (nm^2) 溶剂可及表面积
    Hydrophobic (nm^2) 疏水表面积
    Hydrophilic (nm^2) 亲水表面积

    MD SASA

    Introduction

    The MD SASA module calculates the solvent accessible surface area (SASA) of specified groups.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    The output results include:

    Output File Name Description
    area.csv Solvent accessible surface area CSV file
    area.xvg Solvent accessible surface area XVG file
    area.png Solvent accessible surface area PNG file

    The area.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Total Area (nm^2) Total solvent accessible surface area
    Hydrophobic (nm^2) Hydrophobic surface area
    Hydrophilic (nm^2) Hydrophilic surface area
  • Name: MD Hbond
    Description: 分子动力学氢键分析 Hydrogen bond analysis between specified groups
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-05 17:34:57
    Reference:

    MD Hbond

    简介

    MD Hbond模板对于指定组别之间的氢键分析。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group1

    选择需要计算的氢键组别1:Protein,DNA,RNA。如果两个组相同则分析的是组内氢键。
    可以根据PDB中小分子的名称填写组别名称。

    System Group2

    选择需要计算的氢键组别2:Protein,DNA,RNA。如果两个组相同则分析的是组内氢键。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid1

    自定义需要计算的组1残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom1

    自定义需要计算的组1原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Resid2

    自定义需要计算的组2残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom2

    自定义需要计算的组2原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    输出结果包括:

    输出文件名称 说明
    hbnum.csv 氢键分析CSV文件
    hbnum.xvg 氢键分析XVG文件
    hbnum.png 氢键分析PNG文件

    其中hbnum.csv包括信息如下:

    字段名称 说明
    Time (ns) 时间
    Hydrogen bonds 氢键数目
    Pairs within 0.35 nm 两个组相距0.35nm内的接触的原子数目

    MD Hbond

    Introduction

    MD Hbond template is used for analyzing hydrogen bonds between specified groups.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group1

    Select the hydrogen bond group 1 for calculation: Protein, DNA, RNA. If both groups are the same, it analyzes intra-group hydrogen bonds. You can specify group names based on the names of small molecules in the PDB file.

    System Group2

    Select the hydrogen bond group 2 for calculation: Protein, DNA, RNA. If both groups are the same, it analyzes intra-group hydrogen bonds. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid1

    Custom residue numbers for group 1 calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom1

    Custom atom numbers for group 1 calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Custom Resid2

    Custom residue numbers for group 2 calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom2

    Custom atom numbers for group 2 calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Result Description

    The output results include:

    Output File Name Description
    hbnum.csv Hydrogen bond analysis CSV file
    hbnum.xvg Hydrogen bond analysis XVG file
    hbnum.png Hydrogen bond analysis PNG file

    The hbnum.csv file includes the following information:

    Field Name Description
    Time (ns) Time
    Hydrogen bonds Number of hydrogen bonds
    Pairs within 0.35 nm Number of atoms in contact within 0.35 nm between the two groups
  • Name: MD Gyration
    Description: 回旋半径分析,可用来衡量体系模拟时的质权平均半径 Gyration analysis, which can be used to measure the average radius of pledge during system simulation
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-05 16:24:54
    Reference:

    MD Gyration

    简介

    MD Gyration回旋半径分析,可用来衡量体系模拟时的质权平均半径。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    Index File

    索引文件,格式为ndx

    结果说明

    输出结果包括:

    输出文件名称 说明
    gyrate.csv 回转半径CSV文件
    gyrate.xvg 回转半径XVG文件
    gyrate.png 回转半径PNG文件

    其中gyrate.csv包括信息如下:

    字段名称 说明
    Time (ps) 时间
    Rg 回旋半径
    Rg(X) 绕着x轴的回旋半径
    Rg(Y) 绕着y轴的回旋半径
    Rg(Z) 绕着z轴的回旋半径

    MD Gyration

    Introduction

    MD Gyration is a radius of gyration analysis used to measure the mass-weighted average radius of a system during simulation.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10, 15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10, 15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Index File

    Index file in ndx format.

    Result Description

    The output results include:

    Output File Name Description
    gyrate.csv Gyration radius CSV file
    gyrate.xvg Gyration radius XVG file
    gyrate.png Gyration radius PNG file

    The gyrate.csv file includes the following information:

    Field Name Description
    Time (ps) Time
    Rg Radius of gyration
    Rg(X) Radius of gyration around the x-axis
    Rg(Y) Radius of gyration around the y-axis
    Rg(Z) Radius of gyration around the z-axis
  • Name: MD Clustering
    Description: 分子动力学轨迹进行归簇分析 Clustering analysis for dynamic trajectories.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-07-04 11:40:38
    Reference:

    MD Clustering

    简介

    MD Clustering是对动力学轨迹进行归簇分析。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)模块中获取。

    Cutoff

    聚类时结构的RMSD截断值(nm)

    Cluster Method

    聚类算法:linkage, jarvis-patrick, monte-carlo, diagonalization, gromos, 默认使用gromos算法。

    System Group

    选择需要计算的结构组别:Backbone,Protein,DNA,RNA。
    可以根据PDB中小分子的名称填写组别名称。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Custom Atom

    自定义需要计算的原子编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15

    Skip Time (ns)

    每一帧的间隔时间(单位ns)

    结果说明

    输出结果包括:

    输出文件名称 说明
    clusters.pdb 差异较大的每个簇的代表性结构
    clust-size.xvg 各个簇的帧数
    cluster.xvg 各个簇和轨迹帧号的对应关系

    MD Clustering

    Introduction

    MD Clustering is a clustering analysis of molecular dynamics trajectories.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    Cutoff

    RMSD cutoff value for clustering (in nm).

    Cluster Method

    Clustering algorithm: linkage, jarvis-patrick, monte-carlo, diagonalization, gromos. The default method is gromos.

    System Group

    Select the structural group for calculation: Backbone, Protein, DNA, RNA. You can specify group names based on the names of small molecules in the PDB file.

    Custom Resid

    Custom residue numbers for calculation. Use “-” for continuous residues and commas to separate non-continuous residues. For example: 1-10,15.

    Custom Atom

    Custom atom numbers for calculation. Use “-” for continuous atoms and commas to separate non-continuous atoms. For example: 1-10,15.

    Skip Time (ns)

    Time interval between frames (in ns).

    Result Description

    The output results include:

    Output File Name Description
    clusters.pdb Representative structures of each cluster with significant differences
    clust-size.xvg Number of frames in each cluster
    cluster.xvg Correspondence between clusters and trajectory frame numbers
  • Name: GMX MDP Generation (Auto)
    Description: 根据所选体系(膜,受体,配体)自动生成分子动力学模拟过程中所需的MDP文件,此文件是Gromacs分子动力学模拟需要用到输入文件,里面包含各种参数。若需要设置更细节的参数,请前往Minimize MDP Generation,NPT MDP Generation,MD MDP Generation模块。 Based on the selected system (membrane, receptor, ligand) to automatically generate the MDP file required for the molecular dynamics simulation process. This file is the input file required for the Gromacs molecular dynamics simulation, which contains various parameters. To set more detailed parameters, go to the Minimize MDP Generation, NPT MDP Generation, MD MDP Generation module.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-26 10:33:46
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    GMX MDP Generation (Auto)

    简介

    GMX MDP Generation (Auto)模块主要是根据所选体系(膜,受体,配体)自动生成分子动力学模拟过程中所需的MDP文件,此文件是Gromacs分子动力学模拟需要用到输入文件,里面包含各种参数。

    参数说明

    Group Name

    选择体系中存在的结构类型:membrane代表膜结构,receptor代表大分子结构(蛋白或者核酸),ligand代表小分子结构。

    Simulation Time (ns)

    模拟时长,单位为ns

    Time Step

    时间步长,单位ps

    Coupling Reference Temperature

    参考温度,单位为K

    结果说明

    输出结果包括:

    输出文件名称 说明
    mini.mdp 最小化MDP文件
    npt.mdp/npt.tar.gz NPT MDP文件
    md.mdp/md.tar.gz MD MDP文件

    参考文献

    Abraham, Mark James et al. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1 (2015): 19-25.

    GMX MDP Generation (Auto)

    Introduction

    The GMX MDP Generation (Auto) module is designed to automatically generate the MDP files required for molecular dynamics simulations based on the selected system (membrane, receptor, ligand). The MDP file is an input file required for Gromacs molecular dynamics simulations, containing various parameters.

    Parameter Description

    Group Name

    Select the type of structure present in the system: membrane for membrane structure, receptor for macromolecular structure (protein or nucleic acid), ligand for small molecule structure.

    Simulation Time (ns)

    Duration of the simulation, in units of ns.

    Time Step

    Time step for the simulation, in units of ps.

    Coupling Reference Temperature

    Reference temperature for the temperature coupling, in units of K.

    Result Description

    The output results include:

    Output File Name Description
    mini.mdp MDP file for minimization
    npt.mdp/npt.tar.gz MDP file for NPT ensemble simulation
    md.mdp/md.tar.gz MDP file for MD simulation

    Reference

    Abraham, Mark James et al. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1 (2015): 19-25.

  • Name: siRNA Designer
    Description: 基于靶点基因序列,设计siRNA分子序列。 Designs siRNA molecular sequences based on target gene sequences.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-25 23:18:05
    Reference:

    siRNA Designer

    简介

    siRNA Designer基于靶点基因序列,设计siRNA分子序列。该方法考虑了多条siRNA设计规则,如下:

    • 36% < GCpercent < 52%
    • no internal short repeats
    • no GC stretches (more than 10 GC contigous repeats)
    • 5’ end of the guide RNA is A/U
    • 5’ end of the passenger RNA is G/C
    • at least 4 A/U residues in the last 7bp of the 5’ end of the guide
    • No G at position 13 of the passenger
    • A/U at position 19 of the passenger
    • G/C at position 19 in guide

    参数说明

    RNA FASTA File

    靶点基因序列,支持多条,FASTA格式。

    结果说明

    输出结果文件为siRNAcandidates_序列名称.csv,包含信息如下:

    字段名称 说明
    Target starting position 靶点基因序列的起始位置
    Target ending position 靶点基因序列的终止位置
    Target sequence(21nt target + 2nt overhang) 靶点序列
    Target score 靶点打分,越高越好
    Guide sequence(5’->3’) 结合靶点基因的序列,也称为antisense sequence
    Passenger sequence(5’->3’) 与Guide sequence配对的序列
    Guide Tm Guide sequence计算的Melting Temperature值,一般情况下Tm值越低,发生副作用的可能性越小
    Passenger Tm Passenger sequence计算的Melting Temperature值

    siRNA Designer

    Introduction

    siRNA Designer designs siRNA molecule sequences based on target gene sequences. This method considers multiple siRNA design rules as follows:

    • 36% < GCpercent < 52%
    • no internal short repeats
    • no GC stretches (more than 10 GC contiguous repeats)
    • 5’ end of the guide RNA is A/U
    • 5’ end of the passenger RNA is G/C
    • at least 4 A/U residues in the last 7bp of the 5’ end of the guide
    • No G at position 13 of the passenger
    • A/U at position 19 of the passenger
    • G/C at position 19 in guide

    Parameter Description

    RNA FASTA File

    Target gene sequences, supports multiple sequences in FASTA format.

    Result Description

    The output result file is named siRNAcandidates_sequence_name.csv, and it includes the following information:

    Field Name Description
    Target starting position Starting position of the target gene sequence
    Target ending position Ending position of the target gene sequence
    Target sequence (21nt target + 2nt overhang) Target sequence
    Target score Score assigned to the target, higher scores are better
    Guide sequence (5’->3’) Sequence that binds to the target gene, also known as the antisense sequence
    Passenger sequence (5’->3’) Sequence that pairs with the Guide sequence
    Guide Tm Melting Temperature value calculated for the Guide sequence. In general, lower Tm values indicate a lower likelihood of side effects
    Passenger Tm Melting Temperature value calculated for the Passenger sequence
  • Name: Membrane Solvation
    Description: 对输入的膜,受体,配体文件加入水盒子和离子。 Adds water box and ions for the membrane, receptor, ligand.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-21 16:33:21
    Reference:

    Membrane Solvation

    简介

    Membrane Solvation对输入的膜,受体,配体文件加入水盒子和离子。

    参数说明

    Membrane Topology

    膜拓扑文件,top格式,可由GMX Membrane Parameterization模块生成。

    Membrane GRO

    膜结构文件,gro格式,可由GMX Membrane Parameterization模块生成。

    Membrane ITP

    膜参数压缩文件,tar.gz格式,可由GMX Membrane Parameterization模块生成。

    Receptor Topology

    受体拓扑文件,top格式,可由GMX Receptor Parameterization模块生成。

    Receptor GRO

    受体结构文件,gro格式,可由GMX Receptor Parameterization模块生成。

    Receptor ITP

    受体参数压缩文件,tar.gz格式,可由GMX Receptor Parameterization模块生成。

    Ligand GRO

    配体结构文件,多配体输入压缩文件,gro格式,可由GMX Ligand Parameterization模块生成。

    Ligand ITP

    配体参数压缩文件,tar.gz格式,可由GMX Ligand Parameterization模块生成。

    Output Topology

    体系拓扑文件的输出名称

    Output GRO

    体系结构文件的输出名称

    Output ITP

    体系参数压缩文件的输出名称

    Output Index

    体系索引文件的输出名称

    结果说明

    输出结果包括:

    输出文件名称 说明
    system.gro 体系的分子坐标文件
    system_itp.tar.gz 体系平衡模拟时固定原子位置所施加的力
    system.top 体系的拓扑文件
    index.ndx GROMACS 生成的索引文件,定义体系中原子或残基的分组信息(index groups),用于后续分析或计算时选择特定原子集合

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    Membrane Solvation

    Introduction

    Membrane Solvation adds water boxes and ions to the input membrane, receptor, and ligand files.

    Parameters

    Membrane Topology

    Topology file of the membrane in .top format, can be generated by the GMX Membrane Parameterization module.

    Membrane GRO

    Structure file of the membrane in .gro format, can be generated by the GMX Membrane Parameterization module.

    Membrane ITP

    Compressed parameter file of the membrane in .tar.gz format, can be generated by the GMX Membrane Parameterization module.

    Receptor Topology

    Topology file of the receptor in .top format, can be generated by the GMX Receptor Parameterization module.

    Receptor GRO

    Structure file of the receptor in .gro format, can be generated by the GMX Receptor Parameterization module.

    Receptor ITP

    Compressed parameter file of the receptor in .tar.gz format, can be generated by the GMX Receptor Parameterization module.

    Ligand GRO

    Structure file of the ligand, multiple ligands input as a compressed file in .gro format, can be generated by the GMX Ligand Parameterization module.

    Ligand ITP

    Compressed parameter file of the ligand in .tar.gz format, can be generated by the GMX Ligand Parameterization module.

    Output Topology

    Output name of the system topology file.

    Output GRO

    Output name of the system structure file.

    Output ITP

    Output name of the compressed system parameter file.

    Output Index

    Output name of the system index file.

    Result Description

    The output results include:

    Output File Name Description
    system.gro Molecular coordinate file of the system
    system_itp.tar.gz Force applied to fix atom positions during equilibrium simulations of the system
    system.top Topology file of the system
    index.ndx Index file of the system

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: GMX Membrane Parameterization
    Description: 根据Amber或者Charmm生成膜结构的GRO,ITP以及TOP文件。 Generates the membrane structure GRO, ITP and TOP file according to Amber or Charmm.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-21 16:31:29
    Reference:

    GMX Membrane Parameterization

    简介

    GMX Membrane Parameterization模块是根据Amber或者Charmm生成膜结构的GRO,ITP以及TOP文件。

    参数说明

    Membrane Structure File

    膜结构文件,PDB格式,必须是纯膜结构,并允许水和离子存在

    Force Field

    只支持“amber”力场和“charmm”力场。默认的“amber”力场。
    需要特别注意的是:

    1. 当选择“charmm”力场时,“GMX Receptor Parameterization”模块力场必须选择“charmm36-jul2020”版本。
    2. 当存在小分子时,有且只能选择“amber”力场进行计算。

    结果说明

    输出结果包括:

    输出文件名称 说明
    membrane.top 膜的拓扑文件
    membrane.gro 膜的结构文件
    membrane_itp.tar.gz 膜的参数压缩文件

    GMX Membrane Parameterization

    Introduction

    The GMX Membrane Parameterization module is used to generate GRO, ITP, and TOP files for membrane structures based on Amber or Charmm force fields.

    Parameter Description

    Membrane Structure File

    The membrane structure file in PDB format. It must be a pure membrane structure and can contain water and ions.

    Force Field

    Supports only the “amber” force field and the “charmm” force field. The default is the “amber” force field. It is important to note:

    1. When selecting the “charmm” force field, the “GMX Receptor Parameterization” module must select the “charmm36-jul2020” version.
    2. When small molecules are present, only the “amber” force field can be selected for calculations.

    Result Description

    The output results include:

    Output File Name Description
    membrane.top Topology file for the membrane
    membrane.gro Structure file for the membrane
    membrane_itp.tar.gz Compressed parameter file for the membrane
  • Name: Membrane System Construction
    Description: 构建膜结构的PDB文件 Builds the PDB file of the membrane structure
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-21 16:29:32
    Reference:

    Membrane System Construction

    简介

    Membrane System Construction构建膜结构的PDB文件。
    需要注意的是:Amber参数涉及有大分子的AMBER力场、小分子的GAFF力场、糖的GLYCAM以及磷脂的LIPID力场,这四个力场是可以兼容的。Charmm也有自己一套力场,涉及有CHARMM力场(适用于大分子、糖、磷脂)和CGenFF力场(适用于小分子),这两个力场是相互兼容的。
    目前WEMOL上只支持GAFF力场的小分子计算,所以当存在小分子时,膜的成分必须为AMBER力场下的。

    参数说明

    Lipid Component

    必须遵循格式:lipid1:lipid2//lipid3,“//”用于区分上膜和下膜,没有“//”表示上膜和下膜中相同的脂质成分!
    注:在charmm力场作用下,支持以下38种脂质构建膜:

    CHL1 SITO ERG1 DLPA DLPC DLPE DLPG DLPS DMPA DMPC DMPE DMPG DMPS DOPA DOPC DOPE DOPG DOPS DPPA DPPC DPPE DPPG DPPS DSPA DSPC DSPE DSPG DSPS POPA POPC POPE POPG POPS SOPA SOPC SOPE SOPG SOPS
    

    注:在charmm力场作用下,还支持以下26种心磷脂膜:

    LACH LACL LBCH LBCL LCCH LCCL LDCH LDCL OACH OACL OCCH OCCL PACL PMCH PMCL POCH POCL PVCL TLCH TLCL TMCH TMCL TOCH TOCL TYCH TYCL
    

    注:在amber力场作用下,支持以下253种脂质构建膜:

    CHL1 AHPA AHPC AHPE AHPG AHPS ALPA ALPC ALPE ALPG ALPS AMPA AMPC AMPE AMPG AMPS AOPA AOPC AOPE AOPG AOPS APPA APPC APPE APPG APPS ASPA ASPC ASPE ASPG ASPS DAPA DAPC DAPE DAPG DAPS DHPA DHPC DHPE DHPG DHPS DLPA DLPC DLPE DLPG DLPS DMPA DMPC DMPE DMPG DMPS DOPA DOPC DOPE DOPG DOPS DPPA DPPC DPPE DPPG DPPS DSPA DSPC DSPE DSPG DSPS HAPA HAPC HAPE HAPG HAPS HLPA HLPC HLPE HLPG HLPS HMPA HMPC HMPE HMPG HMPS HOPA HOPC HOPE HOPG HOPS HPPA HPPC HPPE HPPG HPPS HSPA HSPC HSPE HSPG HSPS LAPA LAPC LAPE LAPG LAPS LHPA LHPC LHPE LHPG LHPS LMPA LMPC LMPE LMPG LMPS LOPA LOPC LOPE LOPG LOPS LPPA LPPC LPPE LPPG LPPS LSPA LSPC LSPE LSPG LSPS MAPA MAPC MAPE MAPG MAPS MHPA MHPC MHPE MHPG MHPS MLPA MLPC MLPE MLPG MLPS MOPA MOPC MOPE MOPG MOPS MPPA MPPC MPPE MPPG MPPS MSPA MSPC MSPE MSPG MSPS OAPA OAPC OAPE OAPG OAPS OHPA OHPC OHPE OHPG OHPS OLPA OLPC OLPE OLPG OLPS OMPA OMPC OMPE OMPG OMPS OPPA OPPC OPPE OPPG OPPS OSPA OSPC OSPE OSPG OSPS PAPA PAPC PAPE PAPG PAPS PHPA PHPC PHPE PHPG PHPS PLPA PLPC PLPE PLPG PLPS PMPA PMPC PMPE PMPG PMPS POPA POPC POPE POPG POPS PSPA PSPC PSPE PSPG PSPS SAPA SAPC SAPE SAPG SAPS SHPA SDPA SHPC SDPC SHPE SDPE SHPG SDPG SHPS SDPS SLPA SLPC SLPE SLPG SLPS SMPA SMPC SMPE SMPG SMPS SOPA SOPC SOPE SOPG SOPS SPPA SPPC SPPE SPPG SPPS PSM SSM
    

    Lipid Ratio

    膜成分比例,格式为ratio1:ratio2//ratio3

    Lipid Number

    膜成分数量比例,格式为number1:number2//number3

    Orientated Structrue File

    定向结构文件,pdb格式

    Ions

    添加离子类型,格式为ion1:ion2//ion3,“//”用于区分上下膜,没有“//”表示上下膜中离子成分相同!支持以下5种离子:NA、K、CL、CA、MG。

    Ions Concentration

    离子成分比例,格式为conc1:conc2//conc3,与Ion参数顺序相同

    Ions Number

    离子成分数量比例,格式为number1:number2//number3,与Ion参数顺序相同

    Force Field

    只支持“amber”力场和“charmm”力场。默认的“amber”力场

    Length of XY

    膜的X轴和Y轴长度,默认为50 Å

    Length of Z

    膜的Z轴长度,默认为100 Å

    结果说明

    输出结果包括:

    输出文件名称 说明
    membrane_lipid.pdb 纯膜体系下生成的结构文件,当存在配体或者受体时不会生成该文件。
    membrane_orientation.pdb 膜与受体/配体/复合物的结构文件,纯膜时不生成该文件。
    orientation.pdb 受体/配体/复合物的取向结构,纯膜时不生成该文件。

    Membrane System Construction

    Introduction

    Membrane System Construction is used to build PDB files for membrane structures. It is important to note that the Amber parameters involve the AMBER force field for macromolecules, the GAFF force field for small molecules, the GLYCAM force field for sugars, and the LIPID force field for phospholipids. These four force fields are compatible. Charmm also has its own set of force fields, including the CHARMM force field (for macromolecules, sugars, and phospholipids) and the CGenFF force field (for small molecules), which are mutually compatible. Currently, WEMOL only supports calculations for small molecules using the GAFF force field, so when small molecules are present, the membrane components must be under the AMBER force field.

    Parameter Description

    Lipid Component

    Must follow the format: lipid1:lipid2//lipid3. “//” is used to differentiate between the upper and lower membrane components. If there is no “//”, it indicates the same lipid component in the upper and lower membranes.
    Note: Under the Charmm force field, the membrane construction supports the following 38 lipid types:

    CHL1 SITO ERG1 DLPA DLPC DLPE DLPG DLPS DMPA DMPC DMPE DMPG DMPS DOPA DOPC DOPE DOPG DOPS DPPA DPPC DPPE DPPG DPPS DSPA DSPC DSPE DSPG DSPS POPA POPC POPE POPG POPS SOPA SOPC SOPE SOPG SOPS
    

    Under the Charmm force field, it also supports the following 26 sphingomyelin membranes:

    LACH LACL LBCH LBCL LCCH LCCL LDCH LDCL OACH OACL OCCH OCCL PACL PMCH PMCL POCH POCL PVCL TLCH TLCL TMCH TMCL TOCH TOCL TYCH TYCL
    

    Under the Amber force field, the membrane construction supports 253 lipid types:

    CHL1 AHPA AHPC AHPE AHPG AHPS ALPA ALPC ALPE ALPG ALPS AMPA AMPC AMPE AMPG AMPS AOPA AOPC AOPE AOPG AOPS APPA APPC APPE APPG APPS ASPA ASPC ASPE ASPG ASPS DAPA DAPC DAPE DAPG DAPS DHPA DHPC DHPE DHPG DHPS DLPA DLPC DLPE DLPG DLPS DMPA DMPC DMPE DMPG DMPS DOPA DOPC DOPE DOPG DOPS DPPA DPPC DPPE DPPG DPPS DSPA DSPC DSPE DSPG DSPS HAPA HAPC HAPE HAPG HAPS HLPA HLPC HLPE HLPG HLPS HMPA HMPC HMPE HMPG HMPS HOPA HOPC HOPE HOPG HOPS HPPA HPPC HPPE HPPG HPPS HSPA HSPC HSPE HSPG HSPS LAPA LAPC LAPE LAPG LAPS LHPA LHPC LHPE LHPG LHPS LMPA LMPC LMPE LMPG LMPS LOPA LOPC LOPE LOPG LOPS LPPA LPPC LPPE LPPG LPPS LSPA LSPC LSPE LSPG LSPS MAPA MAPC MAPE MAPG MAPS MHPA MHPC MHPE MHPG MHPS MLPA MLPC MLPE MLPG MLPS MOPA MOPC MOPE MOPG MOPS MPPA MPPC MPPE MPPG MPPS MSPA MSPC MSPE MSPG MSPS OAPA OAPC OAPE OAPG OAPS OHPA OHPC OHPE OHPG OHPS OLPA OLPC OLPE OLPG OLPS OMPA OMPC OMPE OMPG OMPS OPPA OPPC OPPE OPPG OPPS OSPA OSPC OSPE OSPG OSPS PAPA PAPC PAPE PAPG PAPS PHPA PHPC PHPE PHPG PHPS PLPA PLPC PLPE PLPG PLPS PMPA PMPC PMPE PMPG PMPS POPA POPC POPE POPG POPS PSPA PSPC PSPE PSPG PSPS SAPA SAPC SAPE SAPG SAPS SHPA SDPA SHPC SDPC SHPE SDPE SHPG SDPG SHPS SDPS SLPA SLPC SLPE SLPG SLPS SMPA SMPC SMPE SMPG SMPS SOPA SOPC SOPE SOPG SOPS SPPA SPPC SPPE SPPG SPPS PSM SSM
    

    Lipid Ratio

    The ratio of membrane components, format is ratio1:ratio2//ratio3.

    Lipid Number

    The number ratio of membrane components, format is number1:number2//number3.

    Orientated Structure File

    The oriented structure file in PDB format.

    Ions

    Types of ions to add, format is ion1:ion2//ion3. “//” is used to differentiate between the upper and lower membranes. If there is no “//”, it indicates the same ion component in the upper and lower membranes. It supports the following 5 types of ions: NA, K, CL, CA, MG.

    Ions Concentration

    The concentration ratio of ions, format is conc1:conc2//conc3, in the same order as the Ion parameter.

    Ions Number

    The number ratio of ion components, format is number1:number2//number3, in the same order as the Ion parameter.

    Force Field

    Supports only the “amber” force field and the “charmm” force field. Default is the “amber” force field.

    Length of XY

    The length of the membrane along the X and Y axes, default is 50 Å.

    Length of Z

    The length of the membrane along the Z axis, default is 100 Å.

    Result Description

    The output results include:

    Output File Name Description
    membrane_lipid.pdb Generated structure file for the pure membrane system. This file is not generated when ligands or receptors are present.
    membrane_orientation.pdb Structure file of the membrane with the receptor/ligand/complex. This file is not generated for a pure membrane system.
    orientation.pdb Orientation structure of the receptor/ligand/complex. This file is not generated for a pure membrane system.
  • Name: Molecule In Membrane
    Description: 生成受体/配体/复合物取向位置的结构文件。 Generates receptor/ligand/complex orientation file.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-21 15:13:41
    Reference:

    Molecule In Membrane

    简介

    Molecule In Membrane模块是生成受体/配体/复合物取向位置与膜的结构文件。

    参数说明

    Receptor File

    受体结构,PDB格式。如果一个受体含有配体,可以把它们组合成一个受体结构。

    Receptor Position

    “center”,“upper”或“upper”,默认“upper”,即受体相对于膜的位置

    Receptor Orientation

    “inside”或“outside”,默认为“outside”,即n端相对于膜的取向,只有受体在“center”时才有效。

    Receptor Heteroatom

    “yes”或“no”,默认“no”,即当受体定向时是否考虑受体结构中的非受体分子,仅当受体位于“center”时有效。

    Receptor Z Shift

    受体结构的向Z轴位移距离,仅当受体处于“center”时有效。

    Ligand File

    配体结构,PDB格式。通常是指相对于受体的独立配体分子

    Ligand Position

    “center”、“upper”或“lower”,当受体不在“center”时默认为“center”,当受体在“center”时默认为“upper”,即配体相对于膜的位置

    Ligand Orientation

    “inside”或“outside”,默认为“outside”,即n端相对于膜的取向,只有配体在“center”时才有效。

    Ligand Z Shift

    配体结构的向Z轴位移距离,仅当配体处于“center”时有效。

    Ligand Number

    配体分子数,默认为1。只有配体在“upper”或“lower”时才有效

    Length of XY

    膜的X轴和Y轴长度,默认为50 Å

    Length of Z

    膜的Z轴长度,默认为100 Å

    结果说明

    输出结果包括:

    输出文件名称 说明
    orientation.pdb 受体/配体/复合物的结构文件
    orientation_dum.pdb 显示受体/配体/复合物与膜的相对位置的结构文件

    Molecule In Membrane

    Introduction

    The Molecule In Membrane module is used to generate structural files of the orientation of receptors/ligands/complexes relative to a membrane.

    Parameter Description

    Receptor File

    The structure of the receptor in PDB format. If a receptor contains a ligand, they can be combined into a single receptor structure.

    Receptor Position

    “center”, “upper”, or “lower”, default is “upper”, indicating the position of the receptor relative to the membrane.

    Receptor Orientation

    “inside” or “outside”, default is “outside”, indicating the orientation of the N-terminus of the receptor relative to the membrane. This parameter is only effective when the receptor is in the “center” position.

    Receptor Heteroatom

    “yes” or “no”, default is “no”, indicating whether non-receptor molecules in the receptor structure should be considered when orienting the receptor. This parameter is only effective when the receptor is in the “center” position.

    Receptor Z Shift

    The distance the receptor structure is shifted along the Z-axis. This parameter is only effective when the receptor is in the “center” position.

    Ligand File

    The structure of the ligand in PDB format. Typically, this refers to an independent ligand molecule relative to the receptor.

    Ligand Position

    “center”, “upper”, or “lower”, default is “center” when the receptor is not in the “center” position, and default is “upper” when the receptor is in the “center” position, indicating the position of the ligand relative to the membrane.

    Ligand Orientation

    “inside” or “outside”, default is “outside”, indicating the orientation of the N-terminus of the ligand relative to the membrane. This parameter is only effective when the ligand is in the “center” position.

    Ligand Z Shift

    The distance the ligand structure is shifted along the Z-axis. This parameter is only effective when the ligand is in the “center” position.

    Ligand Number

    The number of ligand molecules, default is 1. This parameter is only effective when the ligand is in the “upper” or “lower” position.

    Length of XY

    The length of the membrane along the X and Y axes, default is 50 Å.

    Length of Z

    The length of the membrane along the Z axis, default is 100 Å.

    Result Description

    The output results include:

    Output File Name Description
    orientation.pdb Structural file of the receptor/ligand/complex
    orientation_dum.pdb Structural file showing the relative position of the receptor/ligand/complex with respect to the membrane
  • Name: Solvent Exposure (SASA)
    Description: 基于蛋白质结构,计算各个残基的溶剂暴露程度(溶液可及化表面积,solvent accessible surface area, SASA)。 Calculates the solvent accessible surface area of residue based on structure PDB file.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-06-08 12:56:06
    Reference: NA

    Solvent Exposure (SASA)

    简介

    基于蛋白质结构(PDB文件),计算各个残基的溶剂暴露程度(溶液可及化表面积,solvent accessible surface area, SASA)。
    蛋白氨基酸残基的相对溶剂可及表面积(Relative SASA,RSASA)可以衡量残基在溶剂中的暴露程度,其计算公式如下:
    image.png
    其中,SASA是溶剂可及表面积,MaxSASA是氨基酸最大溶剂可及表面积,单位均为Å。
    为了测量氨基酸侧链的相对溶剂可及表面积,通常采用从Gly-X-Gly三肽中获得的MaxSASA值,其中X为需要计算的氨基酸残基。几种MaxSASA量表如下所示。

    Residue Tien et al. 2013 (theor.)[1] Tien et al. 2013 (emp.)[1] Miller et al. 1987[2] Rose et al. 1985[3]
    Alanine 129.0 121.0 113.0 118.1
    Arginine 274.0 265.0 241.0 256.0
    Asparagine 195.0 187.0 158.0 165.5
    Aspartate 193.0 187.0 151.0 158.7
    Cysteine 167.0 148.0 140.0 146.1
    Glutamate 223.0 214.0 183.0 186.2
    Glutamine 225.0 214.0 189.0 193.2
    Glycine 104.0 97.0 85.0 88.1
    Histidine 224.0 216.0 194.0 202.5
    Isoleucine 197.0 195.0 182.0 181.0
    Leucine 201.0 191.0 180.0 193.1
    Lysine 236.0 230.0 211.0 225.8
    Methionine 224.0 203.0 204.0 203.4
    Phenylalanine 240.0 228.0 218.0 222.8
    Proline 159.0 154.0 143.0 146.8
    Serine 155.0 143.0 122.0 129.8
    Threonine 172.0 163.0 146.0 152.5
    Tryptophan 285.0 264.0 259.0 266.3
    Tyrosine 263.0 255.0 229.0 236.8
    Valine 174.0 165.0 160.0 164.5

    判断溶液可及性的 rASA 阈值

    通常有以下标准:

    rASA >0.5(50%):残基被认为是暴露于溶液的(solvent-exposed)
    rASA < 0.2(20%):残基被认为是埋藏在蛋白质内部的(buried)
    0.2 ≤ rASA ≤ 0.5:残基处于部分暴露状态。
    

    具体阈值的选择可能取决于研究的目的。例如,某些分析可能使用更严格或宽松的标准来划分。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。

    结果说明

    计算出来的各种溶剂可及表面积值,可根据需求选择需要的类型:

    字段名称 说明
    ResidueType 残基类型
    Chain ID 链名称
    Residue Number 残基编号
    total Total SASA of residue
    polar Polar SASA(极性)
    apolar Apolar SASA(非极性)
    mainChain Main chain SASA
    sideChain Side chain SASA
    relativeTotal* Relative total SASA
    relativePolar Relative polar SASA
    relativeApolar Relative Apolar SASA
    relativeMainChain Relative main chain SASA
    relativeSideChain* Relative side chain SASA
    bfactor 温度因子

    *常用的比如:

    • relativeSideChain,残基侧链的暴露程度(很多时候主链不需要考虑)
    • relativeTotal,残基的暴露程度(考虑了侧链+主链)

    判断溶液可及性的 rASA 阈值

    通常有以下标准:

    rASA >0.5(50%):残基被认为是暴露于溶液的(solvent-exposed)
    rASA < 0.2(20%):残基被认为是埋藏在蛋白质内部的(buried)
    0.2 ≤ rASA ≤ 0.5:残基处于部分暴露状态。
    

    具体阈值的选择可能取决于研究的目的。例如,某些分析可能使用更严格或宽松的标准来划分。

    参考文献

    https://en.wikipedia.org/wiki/Relative_accessible_surface_area
    Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilites of residues in proteins. PLoS One. 2013 Nov 21;8(11):e80635.
    Miller S, Lesk AM, Janin J, Chothia C. The accessible surface area and stability of oligomeric proteins. Nature. 1987 Aug 27-Sep 2;328(6133):834-6.
    Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH. Hydrophobicity of amino acid residues in globular proteins. Science. 1985 Aug 30;229(4716):834-8.
    https://freesasa.github.io/doxygen/Geometry.html

    Solvent Exposure (SASA)

    Introduction

    Based on protein structure (PDB file), calculates the solvent exposure of each residue (solvent accessible surface area, SASA). The relative solvent accessible surface area (RSASA) of protein amino acid residues measures the exposure of residues in the solvent. The calculation formula is as follows:
    image.png
    Here, SASA is the solvent accessible surface area, and MaxSASA is the maximum solvent accessible surface area of the amino acid, both in Å units. To measure the relative solvent accessible surface area of amino acid side chains, the MaxSASA value obtained from the Gly-X-Gly tripeptide is typically used, where X represents the amino acid residue being calculated. Several MaxSASA scales are shown below.

    Residue Tien et al. 2013 (theor.)[1] Tien et al. 2013 (emp.)[1] Miller et al. 1987[2] Rose et al. 1985[3]
    Alanine 129.0 121.0 113.0 118.1
    Arginine 274.0 265.0 241.0 256.0
    Asparagine 195.0 187.0 158.0 165.5
    Aspartate 193.0 187.0 151.0 158.7
    Cysteine 167.0 148.0 140.0 146.1
    Glutamate 223.0 214.0 183.0 186.2
    Glutamine 225.0 214.0 189.0 193.2
    Glycine 104.0 97.0 85.0 88.1
    Histidine 224.0 216.0 194.0 202.5
    Isoleucine 197.0 195.0 182.0 181.0
    Leucine 201.0 191.0 180.0 193.1
    Lysine 236.0 230.0 211.0 225.8
    Methionine 224.0 203.0 204.0 203.4
    Phenylalanine 240.0 228.0 218.0 222.8
    Proline 159.0 154.0 143.0 146.8
    Serine 155.0 143.0 122.0 129.8
    Threonine 172.0 163.0 146.0 152.5
    Tryptophan 285.0 264.0 259.0 266.3
    Tyrosine 263.0 255.0 229.0 236.8
    Valine 174.0 165.0 160.0 164.5

    Parameters

    Structure PDB File

    Protein structure file in PDB format.

    Results

    Calculated solvent accessible surface area values for various residue types can be selected as needed:

    Field Name Description
    ResidueType Residue type
    Chain ID Chain name
    Residue Number Residue number
    total Total SASA of residue
    polar Polar SASA
    apolar Apolar SASA
    mainChain Main chain SASA
    sideChain Side chain SASA
    relativeTotal* Relative total SASA
    relativePolar Relative polar SASA
    relativeApolar Relative Apolar SASA
    relativeMainChain Relative main chain SASA
    relativeSideChain* Relative side chain SASA
    bfactor Temperature factor

    *Commonly used include:

    • relativeSideChain, exposure level of the residue side chain (often main chain is not considered)
    • relativeTotal, exposure level of the residue (considering both side chain and main chain)

    Determining Solvent Accessibility with rASA Thresholds

    Typically, the following criteria are used:

    rASA > 0.5 (50%): Residues are considered solvent-exposed.
    rASA < 0.2 (20%): Residues are considered buried within the protein.
    0.2 ≤ rASA ≤ 0.5: Residues are in a partially exposed state.
    

    The choice of specific thresholds may depend on the purpose of the study. For example, some analyses may use stricter or more lenient criteria for classification.

    Reference

    Relative accessible surface area - Wikipedia
    Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilities of residues in proteins. PLoS One. 2013 Nov 21;8(11):e80635.
    Miller S, Lesk AM, Janin J, Chothia C. The accessible surface area and stability of oligomeric proteins. Nature. 1987 Aug 27-Sep 2;328(6133):834-6.
    Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH. Hydrophobicity of amino acid residues in globular proteins. Science. 1985 Aug 30;229(4716):834-8.
    Geometry - FreeSASA Documentation

  • Name: Multiple Sequence Alignment (MAFFT)
    Description: 基于MAFFT的多序列比对程序,支持蛋白和核酸序列的比对。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Align -> MAFFT。 mafft - Multiple alignment program for amino acid or nucleotide sequences. It is recommended to use in the WeSeq: WeSeq -> Align -> MAFFT.
    Tags: undefined
    Author: Kazutaka Katoh
    Release: 2023-06-06 00:00:00
    Reference: Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005 Jan 20;33(2):511-8.

    Multiple Sequence Alignment (MAFFT)

    简介

    基于MAFFT的多序列比对工具,支持蛋白和核酸序列的比对。

    参数说明

    Sequence File

    蛋白或者核酸的序列文件,FASTA格式

    结果说明

    输出结果为多序列比对后的结果文件:alignment.fasta

    参考文献

    Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005 Jan 20;33(2):511-8.
    https://mafft.cbrc.jp/alignment/software/manual/manual.html

    Multiple Sequence Alignment (MAFFT)

    Introduction

    MAFFT-based tool for multiple sequence alignment, supports alignment of both protein and nucleic acid sequences.

    Parameter Description

    Sequence File

    Sequence file containing protein or nucleic acid sequences in FASTA format.

    Result Description

    The output result is the aligned sequences saved in the file: alignment.fasta.

    Reference

    Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005 Jan 20;33(2):511-8.
    MAFFT Manual

  • Name: Antibody Sequence Prediction (IgLM)
    Description: Antibody Sequence Prediction (IgLM)模块是抗体序列生成与优化,该方法从Observed Antibody Space (OAS) 收集抗体序列。OAS数据库包含六个物种的天然抗体序列:人类、小鼠、大鼠、兔子、恒河猴和骆驼。为了研究模型能力的影响,训练了两个版本的模型: IgLM和IgLM-S,分别有13M和1.4M的训练参数。两个IgLM模型都在558万非冗余序列上训练,这些序列基于95%相似性聚类。在训练过程中,随机屏蔽了抗体序列中10到20个残基,以便在推理过程中实现任意跨度的多样化。此外,还对序列中的链型(重链或轻链)和原产物种进行了限定,提供这样的背景能够控制物种特异性抗体序列的产生。该方法被证明可以从各种物种中产生全长的重链和轻链序列,以及具有改进可开发性的填充CDR环库。该方法是一个强大的抗体设计工具,可应用于各种抗体序列设计场景。 The Antibody Sequence Prediction (IgLM) module is designed for the generation and optimization of antibody sequences, utilizing the Observed Antibody Space (OAS) to collect antibody sequences. The OAS database contains natural antibody sequences from six species: humans, mice, rats, rabbits, rhesus monkeys, and camels. To investigate the impact of model capacity, two versions of the model were trained: IgLM and IgLM-S, with 13 million and 1.4 million training parameters, respectively. Both IgLM models were trained on 5.58 million non-redundant sequences, clustered based on 95% similarity. During training, 10 to 20 residues in the antibody sequences were randomly masked to enable diversification of arbitrary spans during inference. Additionally, constraints were applied to the chain type (heavy chain or light chain) and the originating species of the sequences, providing a framework to control the generation of species-specific antibody sequences. This method has been shown to produce full-length heavy and light chain sequences from various species, as well as improved developability for filling CDR loop libraries. The method serves as a powerful tool for antibody design and can be applied to various antibody sequence design scenarios.
    Tags: undefined
    Author: Richard W. Shuai
    Release: 2023-05-29 09:07:25
    Reference: Richard W. Shuai, Jeffrey A. Ruffolo, Jeffrey J. Gray. Generative language modeling for antibody design. bioRxiv 2021.12.13.472419.

    Antibody Sequence Prediction (IgLM)

    简介

    Antibody Sequence Prediction(IgLM)模块是抗体序列生成与优化,该方法从Observed Antibody Space (OAS) 收集抗体序列。OAS数据库包含六个物种的天然抗体序列:人类、小鼠、大鼠、兔子、恒河猴和骆驼。为了研究模型能力的影响,训练了两个版本的模型: IgLM和IgLM-S,分别有13M和1.4M的训练参数。两个IgLM模型都在558万非冗余序列上训练,这些序列基于95%相似性聚类。在训练过程中,随机屏蔽了抗体序列中10到20个残基,以便在推理过程中实现任意跨度的多样化。此外,还对序列中的链型(重链或轻链)和原产物种进行了限定,提供这样的背景能够控制物种特异性抗体序列的产生。该方法被证明可以从各种物种中产生全长的重链和轻链序列,以及具有改进可开发性的填充CDR环库。该方法是一个强大的抗体设计工具,可应用于各种抗体序列设计场景。

    参数说明

    Antibody Sequence File

    抗体序列,仅支持1条序列,FASTA格式。

    Chain Type

    设定为抗体重链或轻链,值为"H" 或 “L”。

    Start Index of AA

    指定序列中进行改造优化的氨基酸起始值,整数值,从1开始。需要说明的是,并不是说从优化起始值-终止值的氨基酸就会完全一对一的进行修改,模型里是指定开始到结束的残基作为1个MASK TOKEN提供给模型进行生成,具体生成多少个残基,是看模型学习的情况。

    End Index of AA

    指定序列中进行改造优化的氨基酸终止值,整数值。需要说明的是,并不是说从优化起始值-终止值的氨基酸就会完全一对一的进行修改,模型里是指定开始到结束的残基作为1个MASK TOKEN提供给模型进行生成,具体生成多少个残基,是看模型学习的情况。

    Species Type

    设定物种信息,默认是人源。

    Nunber of Designed Sequences

    设定设计的序列数量,默认100。

    结果说明

    输出结果文件为generated_seqs.fasta,包含生产的序列信息,fasta格式。

    参考文献

    Richard W. Shuai, Jeffrey A. Ruffolo, Jeffrey J. Gray. Generative language modeling for antibody design. bioRxiv 2021.12.13.472419.

    Antibody Sequence Prediction (IgLM)

    Introduction

    The Antibody Sequence Prediction (IgLM) module is designed for antibody sequence generation and optimization. This method collects antibody sequences from the Observed Antibody Space (OAS) database, which includes natural antibody sequences from six species: human, mouse, rat, rabbit, cynomolgus monkey, and camel. To study the impact of model capacity, two versions of the model were trained: IgLM and IgLM-S, with 13M and 1.4M training parameters, respectively. Both IgLM models were trained on 5.58 million non-redundant sequences clustered at 95% similarity. During training, 10 to 20 residues in the antibody sequences were randomly masked to achieve diversity across arbitrary spans during inference. Additionally, constraints were placed on the chain type (heavy or light chain) and original species in the sequences to control the generation of species-specific antibody sequences. This method has been shown to generate full-length heavy and light chain sequences from various species, along with a diversified CDR loop library for improved developability. It serves as a powerful antibody design tool applicable to various antibody design scenarios.

    Parameter Description

    Antibody Sequence File

    Antibody sequence in FASTA format, supporting only one sequence.

    Chain Type

    Specify the antibody chain type as heavy (“H”) or light (“L”).

    Start Index of AA

    Specify the starting amino acid index for optimization in the sequence, an integer value starting from 1. Note that the optimization does not necessarily modify each amino acid from the start to end index one-to-one. The model treats the specified residues from the start to end as one MASK TOKEN for generating sequences, and the actual number of residues generated depends on the model’s learning.

    End Index of AA

    Specify the ending amino acid index for optimization in the sequence, an integer value. Similarly, the optimization does not necessarily modify each amino acid from the start to end index one-to-one.

    Species Type

    Set the species information, default is human.

    Number of Designed Sequences

    Set the number of sequences to be designed, default is 100.

    Result Description

    The output result file is named generated_seqs.fasta, containing the information of the generated sequences in FASTA format.

    Reference

    Richard W. Shuai, Jeffrey A. Ruffolo, Jeffrey J. Gray. Generative language modeling for antibody design. bioRxiv 2021.12.13.472419.

  • Name: PTM Hotspot by Structure
    Description: 基于结构预测蛋白中高风险的PTM位点,比基于序列的方式更精准。当前版本支持天冬氨酸(ASP)位点发生异构化的概率。 Prediction of isomerization probability of aspartic acid (ASP) site in protein Structure by PTM Hotspot by Structure.
    Tags: undefined
    Author: Sharma VK
    Release: 2023-05-19 12:40:06
    Reference: Sharma VK, Patapoff TW, Kabakoff B, Pai S, Hilario E, Zhang B, Li C, Borisov O, Kelley RF, Chorny I, Zhou JZ, Dill KA, Swartz TE.In silico selection of therapeutic antibodies for development: viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 30;111(52):18601-6.

    PTM Hotspot by Structure

    简介

    PTM Hotspot by Structure模块通过快速的蒙特卡罗模拟采样,获得蛋白的多样性构象,通过分析多构象的溶剂暴露情况和结构波动情况来预测天冬氨酸(ASP)的异构化的概率。

    参数说明

    Protein Structure File

    蛋白的结构文件,格式支持 .pdb 或 .cif。支持多个复合物结构打包进行批量预测,格式支持 .tar、.tar.gz 、 .zip等,最大支持10个结构。

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    Chain 蛋白链名称
    Residue Index 氨基酸索引(PDB文件中)
    Pred_Score 预测得到的ASP残基异构化评分,分数值在0-1之间,越大表示异构化的可能性越高
    Labile 最终判别异构化的值,1表示预测发生异构化,0表示预测无异构化
    sasa_asp ASP 残基侧链的 SASA(Solvent Accessible Surface Area,溶剂可及表面积)。数值越大表示该位点越暴露于溶剂,更容易发生化学修饰。单位通常为 Ų。
    rmsf 残基结构波动反映蒙特卡罗采样过程中该残基的构象柔性。数值越大表示局部结构越灵活。单位通常为 Å。
    sasa_n_1 前一个残基主链氮原子的溶剂暴露

    参考文献

    • Sharma VK, Patapoff TW, Kabakoff B, Pai S, Hilario E, Zhang B, Li C, Borisov O, Kelley RF, Chorny I, Zhou JZ, Dill KA, Swartz TE. In silico selection of therapeutic antibodies for development: viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 30;111(52):18601-6.DOI:10.1073/pnas.1421779112

    PTM Hotspot by Structure

    Introduction

    The PTM Hotspot by Structure module uses rapid Monte Carlo simulation sampling to obtain diverse protein conformations. By analyzing the solvent exposure and structural fluctuations of multiple conformations, it predicts the probability of aspartic acid (ASP) isomerization.

    Parameters

    Protein Structure File

    Protein structure file. Supported formats: .pdb or .cif. Batch prediction is supported by packaging multiple complex structures into archives. Supported archive formats: .tar, .tar.gz, .zip, etc. Maximum 10 structures.

    Results

    The output result file is named result.csv, containing the following information:

    Field Name Description
    Chain Name of the protein chain
    Residue Index Amino acid index (in the PDB file)
    Pred_Score Predicted score for ASP residue isomerization, with values ranging from 0 to 1; higher values indicate a higher likelihood of isomerization
    Labile Final determination of isomerization; 1 indicates predicted isomerization, 0 indicates predicted non-isomerization
    sasa_asp SASA (Solvent Accessible Surface Area) of the ASP residue side chain. Higher values indicate greater solvent exposure, making the site more susceptible to chemical modification. Unit: typically Ų.
    rmsf Residue structural fluctuation reflects the conformational flexibility of the residue during Monte Carlo sampling. Higher values indicate greater local structural flexibility. Unit: typically Å.
    sasa_n_1 Solvent exposure of the backbone nitrogen atom of the preceding residue

    References

    • Sharma VK, Patapoff TW, Kabakoff B, Pai S, Hilario E, Zhang B, Li C, Borisov O, Kelley RF, Chorny I, Zhou JZ, Dill KA, Swartz TE. In silico selection of therapeutic antibodies for development: viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 30;111(52):18601-6.DOI:10.1073/pnas.1421779112
  • Name: Protein Isoelectric Point (pI)
    Description: Protein Isoelectric Point(pI),即分子不带净电荷的pH值,是影响分子理化性质甚至功能的关键参数。该模块使用多种不同的算法,基于序列计算分子的pI数值,并可以对多条链的结果进行合并计算。 基于唯信团队使用部分内部抗体实测pI数据的对比,Sillero算法的精度相对更高,推荐采用。 Protein Isoelectric Point module is used to calculate the isoelectric point of protein, that is, the pH at which a particular molecule carries no net electrical charge, is an critical parameter for many analytical biochemistry and proteomics techniques, especially for 2D gel electrophoresis (2D-PAGE), capillary isoelectric focusing (cIEF), X-ray crystallography and liquid chromatography–mass spectrometry (LC-MS)
    Tags: undefined
    Author:
    Release: 2023-05-15 18:01:25
    Reference:

    Protein Isoelectric Point

    简介

    Protein Isoelectric Point(pI),即分子不带净电荷的pH值,是影响分子理化性质甚至功能的关键参数。该模块使用多种不同的算法,基于序列计算分子的pI数值,并可以对多条链的结果进行合并计算。

    基于唯信团队使用部分内部抗体实测pI数据的对比,Sillero算法的精度相对更高,推荐采用。

    唯信测试用的抗体分子和对应的实测pI数值区间和均值如下图所示。

    image.png

    用不同算法计算的pI数值与实测均值的差值及相关性如下图所示。

    image.png

    基于R和RMSE等指标,Sillero的相关性略优于其他算法。

    464e925a1c78788da290f4691171545.png

    参数说明

    Protein Sequence File

    蛋白的序列文件,FASTA格式。

    pI Result File

    使用所选模型预测pI的输出文件,默认名称result.csv。

    Plot

    绘制二维散点图,默认False。

    Plot File

    二维散点图(分子量与等电点)表示为热图,默认名称result.png。

    Merge Chain

    根据链名,将来自同一序列的多条链的pI值进行合并计算。
    例如:mol1.chain1与mol1.chain2将被合并为mol1分子的结果。同名的链也会被视为同一个分子。

    Merge Output File

    仅当merge_chain=True时可用。默认值:merged.csv。

    Job Number

    并行任务数,默认为1。

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.png 当Plot=True时输出二维散点图(分子量与等电点),热图形式
    result.csv 使用所选模型预测pI的输出文件
    merged.csv 多条链的pI合并输出文件

    其中result.csv包括信息如下:

    字段名称 说明
    Protein ID 蛋白序列名称
    Molecular weight (Da) 蛋白分子量
    pI 蛋白等电点

    参考文献

    Kozlowski LP. IPC - Isoelectric Point Calculator. Biol Direct. 2016 Oct 21;11(1):55.

    Protein Isoelectric Point

    Introduction

    Protein Isoelectric Point (pI), the pH at which a molecule carries no net charge, is a key parameter that influences the physical and functional properties of a molecule. This module uses various algorithms to calculate the pI value of a molecule based on its sequence and can merge results for multiple chains.

    Based on a comparison of experimentally measured pI data from a subset of internal antibodies by the WeiXin team, the Sillero algorithm demonstrates relatively higher accuracy and is recommended for use.

    The figure below shows the antibody molecules used in the WeiXin tests along with the corresponding ranges and averages of experimentally measured pI values.

    image.png

    The figure below illustrates the differences and correlations between the pI values calculated using different algorithms and the experimentally measured averages.

    image.png

    Based on metrics such as R and RMSE, the Sillero algorithm shows slightly better correlation compared to other algorithms.

    464e925a1c78788da290f4691171545.png

    Parameter Description

    Protein Sequence File

    File containing the protein sequence in FASTA format.

    pI Result File

    Output file for predicted pI values using the selected model, default name is result.csv.

    Plot

    Whether to plot a two-dimensional scatter plot, default is False.

    Plot File

    Graphical representation of the two-dimensional scatter plot (molecular weight vs. isoelectric point), default name is result.png.

    Merge Chain

    Merge pI values of multiple chains from the same sequence based on chain names.
    For example: mol1.chain1 and mol1.chain2 will be merged into the result for the molecule mol1. Chains with the same name are considered as part of the same molecule.

    Merge Output File

    Available only when merge_chain=True, default value is merged.csv.

    Job Number

    Number of parallel tasks, default is 1.

    Result Description

    The output includes:

    Output File Name Description
    result.png Output of the two-dimensional scatter plot (molecular weight vs. isoelectric point) if Plot=True, in heatmap format
    result.csv Output file for predicted pI values using the selected model
    merged.csv Merged output file for pI values of multiple chains

    The result.csv file includes the following information:

    Field Name Description
    Protein ID Protein sequence name
    Molecular weight (Da) Protein molecular weight
    pI Protein isoelectric point

    References

    Kozlowski LP. IPC - Isoelectric Point Calculator. Biol Direct. 2016 Oct 21;11(1):55.

  • Name: Protein Structure Prediction (AlphaFold2.3.2)
    Description: AlphaFold2 是一个高度准确的蛋白质结构预测算法,在CASP14部分测试中的表现接近实验水平,主要适用于有一定同源序列的蛋白及复合物。 v2.3.2是截止于2023年10月的最新版本。推荐使用AF3 like模块(比如Boltz-1、Chai-1、HelixFold3和Protenix等)。 AlphaFold2 is a highly accurate protein structure prediction package. This is a completely new model that was entered in CASP14 and published in Nature. Version: v2.3.2. It is recommended to use AF3-like modules (such as Boltz-1, Chai-1, HelixFold3, and Protenix).
    Tags: undefined
    Author: DeepMind, Jumper, J., Evans, R., Pritzel, A. et al.
    Release: 2021-11-09 08:00:00
    Reference: Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589.

    AlphaFold2(v2.3.2)

    简介

    AlphaFold2是目前业界优秀的蛋白质结构预测方法。由Deepmind 团队开发,在2020年的蛋白质结构预测大赛CASP14中, AlphaFold 2 得到了接近90分的成绩,排名第一,大幅度领先第二名,对大部分蛋白质结构的预测与真实结构只差一个原子的宽度,达到了人类利用冷冻电镜等复杂仪器观察预测的水平,这是蛋白质结构预测史无前例的巨大进步。后续更新版本支持复合物结构预测,包括蛋白-多肽复合物的预测。

    当前版本:v2.3.2, 是截止于2023年10月的最新版本。
    image.png
    image.png
    image.png
    上图:蛋白单体预测精度
    image.png
    上图:蛋白复合物预测精度

    参数说明

    Input File

    输入序列文件,fasta格式

    Type

    预测任务类型,monomer 或者 multimer
    monomer:单体蛋白,单条链
    multimer:复合物,多条链,最大可以6条链,超过6条系统不处理

    Relax

    优化结构模式
    all:优化所有的结构
    best:只优化打分最高的结构,这个模式只输出一个结构
    none:不做优化

    MSA Database

    多序列比对使用的数据库
    full_dbs:全库,更耗时,但相比reduced_db更精确
    reduced_dbs:精简库,速度更快,但是牺牲准确性

    结果说明

    输出结果包括:

    输出文件名称 说明
    ranking_debug.csv 预测模型可信度评估文件,其中包含用于执行模型排名的pLDDT, ipTM, pTM值,以及到原始模型名称的映射。
    ranked_*.pdb 预测最终蛋白结构文件。默认提供1个打分最高的优化后的结构
    PAE_0.csv 当预测为复合物结构时,生成最优模型的Predicted aligned error(PAE)热图CSV数据。
    PAE_Heatmap_0.png 当预测为复合物结构时,生成最优模型的Predicted aligned error(PAE)热图。
    PAE.tar.gz 当预测为复合物结构时,生成所有模型的Predicted aligned error(PAE)热图。

    其中评估结构预测可信度指标分为pLDDT和ipTM:

    • pLDDT是针对单体结构预测可信度指标,值范围是0-100,该值越大说明预测的结构越可靠。低于70被认为可靠性较低,低于50基本认为是可信度非常低,为无序预测。
    pLDDT > 90:Very high
    90 > pLDDT > 70:Confident
    70 > pLDDT > 50:Low
    pLDDT < 50:Very low
    
    • pTM和ipTM用于评估复合物预测的准确性。pTM和ipTM的加权组合是针对复合物预测可信度指标:model confidence = 0.8 · ipTM + 0.2 · pTM,值范围是0-1,该值越大说明预测的复合物结构越可靠。
      • pTM(the predicted template modelling)是AlphaFold-Multimer预测复合物整体结构的综合测量,该值高于0.5表示复合物的整体预测折叠可能类似于真实结构,其低于 0.5表示预测结构可能是错误的。
      • ipTM(the interface predicted template modelling)是不同链残基之间相互作用的评分,该值高于0.8表明高质量的预测结果,低于0.6表明预测结果可能失败,介于0.6-0.8之间是一个灰色地带,预测可能正确或者错误。
    ipTM >= 0.80:High quality 
    0.6 <=  ipTM <  0.80:Acceptable quality
    0.00 <=  ipTM <  0.6:Incorrect
    

    对结构准确性分析应该综合考虑所有指标,包括pTM、ipTM、pLDDT 和 PAE。

    参考文献

    • Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589.
    • Richard Evans, Michael O’Neill, Alexander Pritzel, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2021 Oct;463034.
    • https://github.com/deepmind/alphafold
    • https://github.com/deepmind/alphafold/releases/tag/v2.3.0

    AlphaFold2 (v2.3.2)

    Introduction

    AlphaFold2 is currently the best protein structure prediction method in the industry. Developed by the DeepMind team, in the 2020 CASP14 protein structure prediction competition, AlphaFold 2 achieved a score close to 90, ranking first and significantly outperforming the second-place competitor. It predicted the structures of most proteins within the width of a single atom from the ground truth, reaching a level comparable to human observation using complex instruments like cryo-electron microscopy. This represents an unprecedented advancement in protein structure prediction. Subsequent updates support the prediction of complex structures, including protein-peptide complexes.

    Current Version: v2.3.2, the latest version as of October 2023.
    image.png
    image.png
    image.png
    Above: Protein monomer prediction accuracy
    image.png
    Above: Protein complex prediction accuracy

    Parameter Description

    Input File

    Input sequence file in FASTA format.

    Type

    Prediction task type, either monomer or multimer.
    monomer: Single protein, single chain.
    multimer: Complex, multiple chains, with a maximum of 6 chains. Systems with more than 6 chains are not processed.

    Relax

    Structure optimization mode.
    all: Optimize all structures.
    best: Optimize only the highest-scoring structure; this mode outputs only one structure.
    none: No optimization.

    MSA Database

    Database used for multiple sequence alignment.
    full_dbs: Full database, more time-consuming but more accurate compared to reduced_db.
    reduced_dbs: Reduced database, faster but sacrifices accuracy.

    Result Description

    The output includes:

    Output File Name Description
    ranking_debug.csv Confidence evaluation file of the prediction model, containing pLDDT, ipTM, pTM values used for model ranking and mapping to the original model names.
    ranked_*.pdb Final predicted protein structure files. By default, the optimized highest-scoring structure is provided.
    PAE_0.csv For complex structure predictions, generates a Predicted Aligned Error (PAE) heatmap CSV data for the best model.
    PAE_Heatmap_0.png For complex structure predictions, generates a Predicted Aligned Error (PAE) heatmap for the best model.
    PAE.tar.gz For complex structure predictions, generates PAE heatmaps for all models.

    The confidence metrics for structure prediction include pLDDT and DockQ:

    • pLDDT is a confidence metric for monomer structure prediction, ranging from 0 to 100. A higher value indicates a more reliable structure prediction.
    pLDDT > 90: Very high
    90 > pLDDT > 70: Confident
    70 > pLDDT > 50: Low
    pLDDT < 50: Very low
    
    • pTM and ipTM are used to evaluate the accuracy of complex predictions. The weighted combination of pTM and ipTM serves as a confidence metric for complex predictions: model confidence = 0.8 · ipTM + 0.2 · pTM. The value ranges from 0 to 1, with higher values indicating a more reliable predicted complex structure.
      • pTM (the predicted template modelling) is a comprehensive measure of the overall structure prediction by AlphaFold-Multimer. A value above 0.5 suggests that the overall predicted folding of the complex may be similar to the real structure, whereas a value below 0.5 suggests that the predicted structure may be incorrect.
      • ipTM (the interface predicted template modelling) scores the interactions between residues of different chains. A value above 0.8 indicates a high-quality prediction, a value below 0.6 indicates a likely failure of the prediction, and values between 0.6 and 0.8 represent a gray area where the prediction may be correct or incorrect.
    ipTM >= 0.80: High quality
    0.6 <= ipTM < 0.80: Acceptable quality
    0.00 <= ipTM < 0.6: Incorrect
    

    References

    • Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589.
    • Richard Evans, Michael O’Neill, Alexander Pritzel, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2021 Oct;463034.
    • https://github.com/deepmind/alphafold
    • https://github.com/deepmind/alphafold/releases/tag/v2.3.0
  • Name: Antibody Viscosity Prediction
    Description: 基于序列预测抗体粘度 Sequence-based antibody viscosity prediction
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-05-05 00:00:00
    Reference: In silico selection of therapeutic antibodies for development: Viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 15;111(52):E18601-18606. doi: 10.1073/pnas.1421779112.

    Antibody Viscosity Prediction

    简介

    粘度是影响抗体药物开发的重要因素,临床上抗体往往需要静脉内或皮下给药,需要高浓度的抗体溶液(>100mg/mL)才能以小剂量注射获得与治疗相关的剂量,但是高浓度的抗体往往表现出高粘度,这对抗体药物的开发,制造和给药提出了挑战。研究发现,抗体序列是决定抗体粘度的关键因素,文献报道抗体粘度与Fv区域的电荷、VH和VL区域电荷的不对称性FvCSP和Fv区域的疏水指数HI存在相关性,基于抗体序列预测抗体粘度是一个有效方法。
    本模块集成了两种粘度预测方法:Sharma 与 DeepViscosity

    Sharma 粘度计算方法如下所示:
    η,cP(180 mg/mL,25°C)=10^[0.15+1.26(0.60)∗ϕ−0.043(0.047)∗q−0.020(0.015)∗qsym
    其中,ϕ代表Fv区域的疏水指数HI,q代表Fv电荷,qsym代表VH和VL区域电荷的不对称性FvCSP。

    DeepViscosity模型是一个集成了102个人工神经网络模型的集成学习系统。该模型利用从抗体序列(特别是Fv区)提取的30种特征,对单抗进行粘度分类。分类标准基于150mg/mL浓度下的粘度值,区分低粘度(≤20 cP)和高粘度(>20 cP)的抗体。使用了包含 229 种不同单抗及其在150mg/mL浓度下实验测定粘度值的大型数据集来训练 DeepViscosity。该数据集是目前该领域公开报道的最大的同类数据集,为模型的稳健性提供了坚实基础。 在两个独立的测试集上进行的评估结果显示,DeepViscosity 表现出色。该模型在这两个测试集上分别达到了 87.5% 和 89.5% 的粘度分类准确率,其性能显著超越了以往依赖实验数据或复杂计算模拟的预测模型。

    参数说明

    Sharma

    Antibody Fasta File

    抗体的序列文件,FASTA格式,支持批量抗体,不支持纳米抗体序列。序列按要求使用分子名.链名的形式进行命名,示例如下:

    > 抗体A.H
    重链序列XXXXXX
    > 抗体A.L
    轻链序列XXXXXX
    > 抗体B.L
    轻链序列XXXXXX
    > 抗体B.H
    重链序列XXXXXX
    

    Output

    输出结果文件,默认为vis_pred_res_SM.csv

    DeepViscosity

    Antibody Fasta File

    抗体的序列文件,FASTA格式(格式要求同Sharma模式)

    Output

    输出结果文件,默认为vis_pred_res_DV.csv

    结果说明

    Sharma算法输出vis_pred_res_SM.csv文件,包含信息如下:

    字段名称 说明
    Sequence ID 抗体序列名称
    Fv Heavy Chain Charge 重链电荷
    Fv Light Chain Charge 轻链电荷
    Fv Charge Symmetry Parameter 电荷对称性指标
    Fv Hydrophobicity Index 疏水性指数
    Viscosity 抗体粘度

    DeepViscosity算法输出vis_pred_res_DV.csv文件,包含信息如下:

    字段名称 说明
    Sequence ID 抗体序列名称
    Viscosity Type 预测的抗体粘度类别,0表示低粘度(≤20 cP),1表示高粘度(>20 cP)
    Probability 预测的概率值,数值在0-1之间,大于0.5时Viscosity Type为1,反之为0

    参考文献

    • In silico selection of therapeutic antibodies for development: Viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 15;111(52):E18601-18606.DOI:10.1073/pnas.1421779112
    • Kalejaye, L. A., Chu, J. M., Wu, I. E., Amofah, B., Lee, A., Hutchinson, M., … Lai, P. K. (2025). Accelerating high-concentration monoclonal antibody development with large-scale viscosity data and ensemble deep learning. mAbs, 17(1). DOI:10.1080/19420862.2025.2483944

    Antibody Viscosity Prediction

    Introduction

    Viscosity is an important factor affecting the development of antibody drugs. Clinically, antibodies often need to be administered intravenously or subcutaneously, requiring a high concentration of antibody solution (>100mg/mL) to obtain a therapeutic dose at a small dose. However, high concentrations of antibodies often exhibit high viscosity, which poses a challenge to the development, manufacture and administration of antibody drugs. It has been found that antibody sequence is the key factor to determine antibody viscosity. It has been reported that antibody viscosity is correlated with charge in Fv region, charge asymmetry in VH and VL region, FvCSP, and hydrophobic index HI in Fv region. It is an effective method to predict antibody viscosity based on antibody sequence.

    This module integrates two viscosity prediction methods: Sharma and DeepViscosity.

    Sharma method:
    η,cP(180 mg/mL,25°C)=10^[0.15+1.26(0.60)∗ϕ−0.043(0.047)∗q−0.020(0.015)∗qsym
    Among them, ϕ represents the hydrophobic index (HI) of the Fv region, q represents the charge of the Fv region, and qsym represents the asymmetry of the charge in the VH and VL regions (FvCSP).

    The DeepViscosity model is an ensemble learning system that incorporates 102 artificial neural network models. It uses 30 features extracted from antibody sequences (especially the Fv region) to classify monoclonal antibodies based on their viscosity. The classification criterion is based on the viscosity value at a concentration of 150 mg/mL, distinguishing between low viscosity (≤20 cP) and high viscosity (>20 cP) antibodies. The model was trained using a large dataset containing 229 different monoclonal antibodies and their experimentally measured viscosity values at a concentration of 150 mg/mL. This dataset is the largest of its kind reported in the field to date, providing a solid foundation for the robustness of the model. Evaluation results on two independent test sets show that DeepViscosity performs remarkably well. The model achieved viscosity classification accuracies of 87.5% and 89.5% on the two test sets, respectively, significantly outperforming previous prediction models that relied on experimental data or complex computational simulations.

    Patameter

    Sharma

    Antibody Fasta File

    Antibody sequence file in FASTA format. Supports multiple antibodies, but does not support nanobody sequences. The sequence is named in the form of molecule name.chain name as required. as shown in the example below.:

    > antibodyA.H
    XXXXXX(Heavy chain)
    > antibodyA.L
    XXXXXX(Light chain)
    > antibodyB.L
    XXXXXX(Light chain)
    > antibodyB.H
    XXXXXX(Heavy chain)
    

    Output

    The output result file, default name is vis_pred_res_SM.csv

    DeepViscosity

    Antibody Fasta File

    The sequence file of the antibody, in FASTA format (the format requirements are the same as those in Sharma mode)

    Output

    The output result file, default name is vis_pred_res_DV.csv

    Result

    A result.csv file contains the following information:

    Field Name Description
    Sequence ID Antibody sequence name
    Fv Heavy Chain Charge Fv heavy chain charge
    Fv Light Chain Charge Fv light chain charge
    Fv Charge Symmetry Parameter Fv charge symmetry index
    Fv Hydrophobicity Index Fv hydrophobicity index
    Viscosity Antibody viscosity

    The output file of the DeepViscosity algorithm is named vis_pred_res_DV.csv, which contains the following information:

    Field Name Description
    Sequence ID Name of the antibody sequence
    Viscosity Type Predicted viscosity category of the antibody. 0 indicates low viscosity (≤20 cP), and 1 indicates high viscosity (>20 cP)
    Probability The predicted probability value ranges between 0 and 1. When it is greater than 0.5, the Viscosity Type is 1; otherwise, it is 0.

    Reference

    • In silico selection of therapeutic antibodies for development: Viscosity, clearance, and chemical stability. Proc Natl Acad Sci U S A. 2014 Dec 15;111(52):E18601-18606.DOI:10.1073/pnas.1421779112
    • Kalejaye, L. A., Chu, J. M., Wu, I. E., Amofah, B., Lee, A., Hutchinson, M., … Lai, P. K. (2025). Accelerating high-concentration monoclonal antibody development with large-scale viscosity data and ensemble deep learning. mAbs, 17(1). DOI:10.1080/19420862.2025.2483944
  • Name: Molecular Docking (DiffDock)
    Description: 基于DiffDock的小分子对接工具 DiffDock-based small molecule docking tool
    Tags: undefined
    Author: Gabriele Corso
    Release: 2023-04-21 17:05:53
    Reference: Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. arXivLabs. 2022 Oct (v1).

    Molecular Docking (DiffDock)

    简介

    Molecular Docking (DiffDock)是一种扩散生成模型,主要用于小分子和蛋白对接。DiffDock在PDBBind上获得了38%的top-1成功率(RMSD<2A),大大超过了以前传统对接(23%)和深度学习(20%)方法的最先进水平。此外,以前的方法无法对接计算上的折叠结构(最大精度为10.4%),而DiffDock保持了明显更高的精度(21.7%)。最后,DiffDock具有快速的推理时间,并提供具有高选择性精度的置信度估计值。

    image.png

    参数说明

    Receptor File

    蛋白的结构文件,PDB格式。最多支持1022个氨基酸。

    Ligand File

    小分子结构文件,SDF格式

    Number of Poses

    每个配体与受体对接时得到的构象数,默认为10。

    结果说明

    输出结果包括:

    输出文件名称 说明
    Scores.csv 所有配体(≤2000)与受体的打分文件。
    output_ligand.sdf 对接后所有配体SDF文件。
    output_complex_topn.tar.gz TopN小分子中每个配体与受体打分最高的复合物构象PDB文件压缩包。
    display_complex.pdb 展示配体与受体的复合物构象文件。

    其中Scores.csv包含信息如下:

    字段名称 说明
    Ligand ID 配体编号ID
    Confidence 对接置信度打分,虽然解读和比较不同复合物或不同蛋白质构象的置信度分数可能会很困难,可以通过以下标准粗略比较(c是最佳构象的置信度分数):c > 0高置信度;-1.5 < c < 0中等置信度;c < -1.5低置信度
    Complex File Name 复合物名称

    参考文献

    • Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. arXivLabs. 2022 Oct (v1).DOI:10.48550/arXiv.2210.01776

    Molecular Docking (DiffDock)

    Introduction

    Molecular Docking (DiffDock) is a diffusion-based model primarily used for the docking of small molecules with proteins. DiffDock has achieved a top-1 success rate of 38% (RMSD < 2A) on PDBBind, significantly surpassing the state-of-the-art levels of previous traditional docking methods (23%) and deep learning methods (20%). Furthermore, previous methods were unable to dock computationally folded structures (maximum accuracy of 10.4%), while DiffDock maintains significantly higher accuracy (21.7%). Finally, DiffDock features fast inference times and provides confidence estimates with high selectivity accuracy.
    image.png

    Parameter Description

    Receptor File

    Structure file of the protein in PDB format. Supports up to 1022 amino acids.

    Ligand File

    Structure file of the small molecule in SDF format.

    Number of Poses

    The number of conformations obtained for each ligand docked with the receptor, default is 10.

    Result Description

    The output includes:

    Output File Name Description
    Scores.csv Scoring file for all ligands (≤2000) with the receptor.
    output_ligand.sdf SDF file containing all ligands after docking.
    output_complex_topn.tar.gz Compressed file containing the PDB files of the top scoring complex conformations for each ligand among the TopN small molecules.
    display_complex.pdb File displaying the complex conformation of the ligand and receptor.

    The Scores.csv contains the following information:

    Field Name Description
    Ligand ID Ligand identification ID.
    Confidence Docking confidence score. Although interpreting and comparing confidence scores of different complexes or different protein conformations can be challenging, a rough comparison can be made using the following criteria (c is the confidence score of the top pose): c > 0 indicates high confidence; -1.5 < c < 0 indicates moderate confidence; c < -1.5 indicates low confidence.
    Complex File Name Name of the complex.

    References

    • Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. arXivLabs. 2022 Oct (v1).DOI:10.48550/arXiv.2210.01776
  • Name: Synthetic Accessibility Score
    Description: 计算小分子化合物的合成可行性打分,反映了化合物是否容易合成。小分子合成难易程度用1到10区间数值进行评价,越靠近1表明越容易合成,越靠近10表明合成越困难。 Calculate SA score for evaluating the feasibility of compound synthesis, which indicates whether a compound is easy to synthesize. The synthesis difficulty of small molecules was evaluated with values ranging from 1 to 10. The closer to 1, the easier to synthesize, and the closer to 10, the more difficult to synthesize.
    Tags: undefined
    Author: Peter Ertl
    Release: 2023-04-21 16:46:22
    Reference: Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform. 2009 Jun 10;1(1):8.

    Synthetic Accessibility Score

    简介

    Synthetic Accessibility Score是一个化合物合成可行性评估指标,反映了化合物是否容易合成。其将小分子合成难易程度用1到10区间数值进行评价,越靠近1表明越容易合成,越靠近10表明合成越困难。SA Score基于片段贡献和复杂度惩罚从而评估化合物合成的难易程度,其中片段贡献值根据PubChem数据库中上百万分子计算共性进行计算,复杂度则考虑分子中非标准结构特征的占比,例如大环、非标准环的合并、立体异构和分子量大小等方面。SA Score方法已被验证,通过将40个化合物分别采用SA Score和经验丰富的药物化学家评估其合成难易程度,并且比较得到二者评分的相关性R2高达0.89,表明其在识别可合成难易程度上的可靠性较高。SA Score已成为一种普遍使用的指标,可用于预测新化合物的合成可行性,加速化合物筛选和药物发现过程。
    image.png

    参数说明

    File模式

    Input File

    小分子结构文件,支持SDF和SMILES格式。

    Smiles模式

    Smiles String

    小分子结构,SMILES格式,支持多个小分子,一行一个SMILES,例如:
    CSC1=C(c2ccc©s2)/C(=N/C©©C)C1
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5

    结果说明

    输出结果文件为sa_score.csv,包含信息如下:

    字段名称 说明
    smiles 小分子smiles结构
    Name 小分子名称
    sa_score 化合物合成可行性评估指标数值

    参考文献

    Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform. 2009 Jun 10;1(1):8.

    Synthetic Accessibility Score

    Introduction

    The Synthetic Accessibility Score is an indicator of the feasibility of synthesizing a compound, reflecting how easily a compound can be synthesized. It evaluates the difficulty of synthesizing small molecules on a scale of 1 to 10, with values closer to 1 indicating easier synthesis and values closer to 10 indicating more challenging synthesis. The SA Score assesses the ease of compound synthesis based on fragment contributions and complexity penalties. The fragment contribution values are calculated based on the commonality of millions of molecules in the PubChem database, while complexity considers the proportion of non-standard structural features in the molecule, such as macrocycles, fused non-standard rings, stereoisomers, molecular weight, and other aspects. The SA Score method has been validated by comparing the SA Scores with evaluations of synthesis difficulty by experienced medicinal chemists for 40 compounds. The high correlation coefficient (R2 = 0.89) between the two sets of scores demonstrates the reliability of the SA Score in identifying the feasibility of synthesis. The SA Score has become a widely used metric for predicting the synthetic feasibility of new compounds, accelerating compound screening and drug discovery processes.
    image.png

    Parameter Description

    File Mode

    Input File

    Small molecule structure file in SDF or SMILES format.

    Smiles Mode

    Smiles String

    SMILES format of small molecule structures, supports multiple small molecules with one SMILES string per line, for example:
    CSC1=C(c2ccc©s2)/C(=N/C©©C)C1
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5

    Result Description

    The output file is sa_score.csv, containing the following information:

    Field Name Description
    smiles SMILES structure of the small molecule
    Name Name of the small molecule
    sa_score Synthetic Accessibility Score value for the compound

    References

    Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform. 2009 Jun 10;1(1):8.

  • Name: Cleavage Site Prediction (DeepDigest)
    Description: 预测八种常用蛋白酶的蛋白型裂解位点,包括胰蛋白酶(trypsin),精氨酸C端肽段(ArgC),粒胰蛋白酶(chymotrypsin),谷氨酸C端蛋白酶(GluC),赖氨酸C端肽段(LysC),天冬氨酸N端肽段(AspN),赖氨酸N端肽段(LysN),L-精氨酸胺基肽酶(LysargiNase)。 Predict protein cleavage sites for eight commonly used proteases (trypsin, ArgC, chymotrypsin, GluC, LysC, AspN, LysN, and LysargiNase).
    Tags: undefined
    Author: Yang, J
    Release: 2023-04-13 17:51:49
    Reference: Yang, J.; Gao, Z.; Ren, X.; Sheng, J.; Xu, P.; Chang, C.; Fu, Y. DeepDigest: Prediction of Protein Proteolytic Digestion with Deep Learning. Anal. Chem. 2021, 93 (15), 6094–6103.

    Cleavage Site Prediction (DeepDigest)

    简介

    Cleavage Site Prediction (DeepDigest) 模块基于深度学习,用于预测8种常用蛋白酶(trypsin, ArgC, chymotrypsin, GluC, LysC, AspN, LysN, and LysargiNase)的蛋白型裂解位点。它整合了卷积神经网络和长短时记忆网络,以实现高准确性和稳健性。与传统的机器学习算法(逻辑回归、随机森林和支持向量机)相比,对所有8种蛋白酶都有更准确的预测精度。
    以下是八种常用蛋白酶的蛋白型裂解位点预测:

    1. 胰蛋白酶(trypsin)是由胰腺分泌的一种组成蛋白质消化酶,可水解多肽和蛋白质的肽键。胰蛋白酶对于含有精氨酸、天冬酰胺等氨基酸残基的多肽和蛋白具有高度的特异性。
    2. 精氨酸C端肽段(ArgC)是由ArgC这种无钠胰蛋白酶切割产生的一种特异性肽段,它的切割位点是精氨酸残基(Arg)。
    3. 粒胰蛋白酶(chymotrypsin)是一种由胰腺分泌的消化酶,可水解含有芳香族氨基酸残基的多肽和蛋白质,具有高度的特异性。
    4. 谷氨酸C端蛋白酶(GluC)可以识别和水解蛋白质中的谷氨酸残基,通过水解蛋白质分子的内部肽键来催化蛋白质的降解。
    5. 赖氨酸C端肽段(LysC)是一种特定的氨基酸序列,通常由LysC这种胰蛋白酶采用的切割位点确定。LysC肽段包含了一个含有两个赖氨酸残基的肽段,这些赖氨酸残基是可以被氨基酸测序等分析技术识别的标志性序列。
    6. 天冬氨酸N端肽段(AspN)是由AspN这种蛋白酶切割蛋白质而产生的一种肽段,它的切割位点是氨基酸序列中的天冬氨酸残基(Asp)。
    7. 赖氨酸N端肽段(LysN)是溶葡萄球菌素的一个片段,它具有高度的特异性和活性,可针对金黄色葡萄球菌等细菌的细胞壁进行水解裂解。这一裂解是通过LysN肽段序列中的特定赖氨酸-甘氨酸(Lys-Gly)肽键实现的。
    8. L-精氨酸胺基肽酶(LysargiNase)是一种从放线菌属真菌(链霉菌属)分离出来的碱性蛋白酶,它主要作用是水解L-精氨酸的肽键,从而移除蛋白质序列中的精氨酸。
      image.png

    参数说明

    Protein Sequence File

    蛋白的序列文件,FASTA格式

    结果说明

    输出对应8个蛋白酶的csv文件,每个csv文件包括信息如下:

    字段名称 说明
    Protein id 蛋白名称
    Peptide sequence 蛋白的理论酶切肽段
    Digestibility of the N-terminal site N端肽键的裂解概率预测值
    Digestibility of the C-terminal site C端肽键的裂解概率预测值
    Digestibility of the missed site(s) 理论酶切肽段所有漏切(非N/C端)位点的酶切概率预测值,这里漏切位点指的是:符合蛋白酶特异性、理论上应被切割,但实际实验中未被切割的肽键位点。以最常见的胰蛋白酶(trypsin)为例:酶切规则是K/R 后(非 P)切开,一条蛋白质序列为… A K G R T … 理论完全酶切是在K 后切、R 后切,若实际得到肽段 AKGRT,说明 K 后没切、R 后没切, 这两个位点就是 missed sites

    *注:概率值区间为0-1,越接近1表示发生概率越大。

    参考文献

    Yang, J.; Gao, Z.; Ren, X.; Sheng, J.; Xu, P.; Chang, C.; Fu, Y. DeepDigest: Prediction of Protein Proteolytic Digestion with Deep Learning. Anal. Chem. 2021, 93 (15), 6094–6103.

    Cleavage Site Prediction (DeepDigest)

    Introduction

    Cleavage Site Prediction (DeepDigest) module is based on deep learning. Used to predict the protein-type cleavage sites of eight common proteases (trypsin, ArgC, chymotrypsin, GluC, LysC, AspN, LysN, and LysargiNase). It integrates convolutional neural network and short - and long-term memory network to achieve high accuracy and robustness. Compared with traditional machine learning algorithms (logistic regression, random forest and support vector machine), the prediction accuracy of all eight proteases was more accurate.
    The following are protein-type cleavage site predictions for eight common proteases:

    1. Trypsin is a constituent protein-digesting enzyme secreted by the pancreas, which can hydrolyze the peptide bonds of peptides and proteins. Trypsin is highly specific to peptides and proteins containing amino acid residues such as arginine and asparagine.
    2. Arginine C-terminal peptide (ArgC) is a specific peptide produced by the cleavage of ArgC, a non-sodium trypsin, and its cleavage site is arginine residue (Arg).
    3. Chymotrypsin is a kind of digestive enzyme secreted by pancreas, which can hydrolyze polypeptides and proteins containing aromatic amino acid residues with high specificity.
    4. Glutamic acid C-terminal protease (GluC) recognizes and hydrolyzes glutamic acid residues in proteins and catalyzes protein degradation by hydrolyzing the internal peptide bonds of protein molecules.
    5. Lysine C-terminal peptide (LysC) is a specific amino acid sequence, usually defined by the cleavage site used by the trypsin LysC. The LysC peptide contains a peptide containing two lysine residues, which are signature sequences that can be identified by analytical techniques such as amino acid sequencing.
    6. Aspartic N-terminal peptide (AspN) is a peptide produced by AspN protease cleavage of protein. Its cleavage site is aspartic acid residue (Asp) in amino acid sequence.
    7. Lysine N-terminal peptide (LysN) is a fragment of staphylococcus lysin, which is highly specific and active and can be hydrolyzed against the cell wall of bacteria such as Staphylococcus aureus. This cleavage is achieved by the specific lysine-gly peptide bond in the LysN sequence.
    8. LysargiNase is an alkaline protease isolated from streptomyces arginaseus. Its main function is to hydrolyze the peptide bonds of L-arginine, thereby removing arginine from the protein sequence.
      image.png

    Parameter

    Protein Sequence File

    Protein sequence file in FASTA format

    Result

    The output csv file is corresponding to the 8 proteases. Each csv file contains the following information:

    Field Name Description
    Protein id The identity of the protein from which the peptide is digested.
    Peptide sequence The sequence of the theoretical digested peptide.
    Digestibility of the N-terminal site The predicted cleavage probability of the cleavage site on the N-terminal of the peptide.
    Digestibility of the C-terminal site The predicted cleavage probability of the cleavage site on the C-terminal of the peptide.
    Digestibility of the missed site(s) The predicted cleavage probabilities of the missed cleavage sites in the peptide. Here, “missed sites” (or “missed cleavage sites”) refer to peptide bond positions that meet the protease specificity and theoretically should be cleaved, but were not cleaved in actual experiments. Taking the most common trypsin as an example: the cleavage rule is to cut after K/R (not followed by P). For a protein sequence … A K G R T …, theoretical complete digestion would result in cleavage after K and after R. If the actual peptide obtained is AKGRT, this indicates no cleavage occurred after K or after R—these two positions are the missed sites.

    Reference

    Yang, J.; Gao, Z.; Ren, X.; Sheng, J.; Xu, P.; Chang, C.; Fu, Y. DeepDigest: Prediction of Protein Proteolytic Digestion with Deep Learning. Anal. Chem. 2021, 93 (15), 6094–6103.

  • Name: Protein Design (RFDiffusion)
    Description: RFdiffusion可从头设计或填充蛋白质/多肽骨架,其RFpeptide模式可设计环肽。推荐通过WeView三维结构可视化编辑器来使用:WeView-> Design -> Protein Design (RFDiffusion)。 RFdiffusion enables the de novo design of proteins and cyclic peptide (RFpeptide mode). It is recommended to use in WeView-> Design -> Protein Design (RFDiffusion).
    Tags: undefined
    Author: Joseph L. Watson, David Baker
    Release: 2023-04-06 15:43:44
    Reference: Joseph L. Watson, David Juergens, Nathaniel R. Bennett, et. al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv 2022.12.09.519842.

    Protein Design (RFDiffusion)

    简介

    通过基于扩散概率模型,在蛋白质结构去噪任务上对RoseTTAFold结构预测网络进行微调,得到该蛋白质骨架生成模型,在无条件和拓扑约束的蛋白质单体设计、蛋白质结合物设计、对称低聚物设计、酶活性位点支架以及治疗性和金属结合蛋白设计的对称主题支架上取得了出色的性能。RFdiffusion能够从简单的分子规格中设计出多样的、复合的、功能性的蛋白质,也适用于环肽设计。
    模块功能为多场景蛋白设计,如:Motif Scaffolding,Unconditional protein generation,Symmetric unconditional generation (cyclic, dihedral and tetrahedral symmetries),Symmetric motif scaffolding,Binder design,Design diversification (“partial diffusion”)

    参数说明

    Custom模式

    Reference Protein Structure

    设计时的参考蛋白。

    Design Type

    设计类型,支持2种类型:‘Motif_Scaffold’与’Binder’,分别说明如下:
    ‘Motif_Scaffold’ 表示基于参考蛋白的骨架结构(由后续参数定义),进行设计。
    ‘Binder’ 表示基于受体结构进行其Binder蛋白设计。

    Number of Designs

    指定要设计的结构数量(目前最多支持 100 个)。

    Contigs

    定义蛋白的设计策略,指定蛋白中的哪部分被随机设计、保留等。
    如:该参数设置为 ‘5-15/A10-25/30-40/0 B1-100’ 时,
    ●’5-15’表示先设计长度为5到15之间(具体多长是随机的,如果要固定长度为10,可以设置为10-10)的motif
    ●‘/A10-25’表示紧接着从参考蛋白中取A链中编号为10至25的氨基酸,其N端连接到上一段’5-15’设计的motif的C端
    ●’/30-40’表示紧接着设计长度为30到40之间(具体多长是随机的)的motif,其N端连接到前面已经设计的motif的C端
    ●‘/0 ’表示链断开,前一条链结束,后续设计会是新的链,注意0后有一个空格!
    ●‘B1-100’表示从参考蛋白中取B链中编号为1至100的氨基酸,作为新的一条链

    注意:

    1. 输入的PDB文件中如果存在残基缺失,缺失残基的编号避免出现在Contigs参数中,如:A链缺失编号为45的残基,则A45或A10-50等涵盖45号残基的表示需要避免,A10-50可以修改为A10-44/A46-50;
    2. Binder设计时,需要把受体包含在Contigs中,通过’/0’链断开标识来分开受体和Binder,如:需要对含有150个氨基酸的单链受体设计相结合的Binder蛋白,受体链名为A,需要设计70-100个氨基酸长度的Binder蛋白,这里对应的Contigs的内容应填入’A1-150/0 70-100’,其中’A1-150’表示受体蛋白,'/0 '表示隔断受体与设计的Binder蛋白的直接肽键相连接,'70-100’表示设计的Binder蛋白长度为70-100个氨基酸。
    3. Contigs和Hotspot Residues中参数设定的残基序号需填写原始PDB文件中的序列编号。进行抗体计算时如果存在插入编号的情况,可以先用PDB ReNumbering进行PDB重编号。

    Hotspot Residues

    在binder模式下可以指定受体中的热点残基,格式为"链名称",“氨基酸残基”,如:‘A59,A83,A91’。

    Symmetry

    设计对称蛋白,参数值为C_N或D_N,其中C表示循环对称(Cyclic symmetry),D表示二面体对称(Dihedral symmetry),N表示单体的数量。如:C2表示设计包含2个单体的循环对称蛋白。
    注意:在进行对称蛋白设计时,Contigs参数的设置要与之匹配,如:Symmetry为C2时,Contigs参数的设置应该符合两条链。


    Binder模式

    Reference Protein Structure

    设计时的参考蛋白。

    Index Type

    为后续参数(Receptor, Initial Binder, Hotspot)中定义的氨基酸残基的索引设置类别。
    有两种选择:UID或者POS,UID表示PDB文件中自带的残基编号,该编号可能存在间断不连续,不从1开始等情况;POS表示位置编号或自然顺序编号,从1开始按顺序进行编号。
    该参数的默认值为UID。

    Receptor Range

    定义受体蛋白,从参考蛋白中选定哪部分作为受体蛋白,格式为“链名称+残基编号或范围”,多段残基用逗号分隔。例如:参数设置为A1-50,A70-100,A105,A108,B1-108时,表示:
    选取参考蛋白的A链中残基编号为1至50、70至100、105与108的残基,以及B链位置编号1至108的残基作为受体。
    注意:这里输入的残基编号应与参数Index Type中的编号类别一致。

    Length of Binder

    定义Binder蛋白的长度,可以是确定的长度,或长度范围,例如:设置为20或20-50时,
    20表示Binder蛋白的长度为20个残基;
    20-50表示Binder蛋白的长度范围为20至50个残基,具体长度视最终设计结果为准。

    Number of Designs

    指定要设计的结构数量(目前最多支持 100 个)。

    Initial Binder

    指定结构中初始的Binder,从参考蛋白中选定哪部分是初始的Binder蛋白,模型会在不改变初始Binder的前提下,进一步延长Binder。例如:参数设置为B1-10时,表示:
    指定参考蛋白中的B链残基编号为1至10的残基为初始Binder蛋白,模型会以此为基础进行延长设计。

    Hotspot Residues

    指定受体中的热点残基作为binder蛋白的结合位置,格式为“链名称+残基编号或范围”,多段残基用逗号分隔,例如:A59-61,A83,A91,表示:
    指定A链编号为59至61、83及91的残基为binder蛋白的结合位置。


    Scaffolding&Infilling模式

    Reference Protein Structure

    设计时的参考蛋白。

    Index Type

    为后续参数(Design Range)中定义的氨基酸残基的索引设置类别。
    有两种选择:UID或者POS,UID表示PDB文件中自带的残基编号,该编号可能存在间断不连续,不从1开始等情况;POS表示位置编号或自然顺序编号,从1开始按顺序进行编号。
    该参数的默认值为UID。

    Design Range

    定义需要设计的蛋白骨架范围,从参考蛋白中选定哪部分进行设计,格式为“链名称+残基编号或范围”,多段残基用逗号分隔。例如:参数设置为A1-50,A70-100,A105,A108,B1-108时,表示:
    选取参考蛋白的A链中残基编号为1至50、70至100、105与108的残基,以及B链编号1至108的残基进行骨架优化设计。
    注意:这里输入的残基编号应与参数Index Type中的编号类别一致。

    Number of Designs

    指定要设计的结构数量(目前最多支持 100 个)。

    Length

    为参数Design Range中的每段残基,定义其设计的长度,多个长度用逗号分隔。如不设置该参数,表示按Design Range中的原始长度进行设计。
    注意:长度的数量要与上述Range参数中残基段的数量一致,且顺序对应。长度可以有多种不同的取值:

    • 非负整数,其中0表示该段残基会被忽略掉,不进行设计;其他正整数表示该段残基区域设计的长度。
    • 字母N,表示该段残基区域设计时,长度不变。
    • 长度范围,如5-10,表示该段残基设计时,长度在5-10个残基的范围内变化,具体长度看最终设计结果。
      长度定义的示例如下:
      N,5-10,15表示定义了3个长度(对应的Design Range参数中的残基段应该也是3个),第1段残基设计时保持长度不变,第2段残基设计时的长度范围为5-10,第3段残基设计时的长度为15。

    Other Design Mode

    其他设计模式,可选为Fix,表示固定上述定义的Design Range不变,对结构中的所有其他区域进行设计。

    Fluctuation Length

    当其他设计模式设置为Fix时,会对其他区域进行设计,设计时会在其他区域的原长度基础上做长度变动,该参数即为长度变动的大小,默认为5,即在原长度的基础上减少或增加5个残基。


    RFPeptide模式

    Reference Protein Structure

    设计时的参考蛋白。

    Index Type

    为后续参数(Receptor, Hotspot)中定义的氨基酸残基的索引设置类别。
    有两种选择:UID或者POS,UID表示PDB文件中自带的残基编号,该编号可能存在间断不连续,不从1开始等情况;POS表示位置编号或自然顺序编号,从1开始按顺序进行编号。
    该参数的默认值为UID。

    Receptor Range

    定义受体蛋白,从参考蛋白中选定哪部分作为受体蛋白,格式为“链名称+残基编号或范围”,多段残基用逗号分隔。例如:参数设置为A1-50,A70-100,A105,A108,B1-108时,表示:
    选取参考蛋白的A链中残基编号为1至50、70至100、105与108的残基,以及B链位置编号1至108的残基作为受体。
    注意:
    1.这里输入的残基编号应与参数Index Type中的编号类别一致。
    2.同一链内的所有残基或范围必须按残基编号升序排列

    Length of Cyclic Peptide

    定义环肽的长度,可以是确定的长度,或长度范围,例如:设置为10或12-18时,
    10表示环肽蛋白的长度为10个残基;
    12-16表示环肽的长度范围为12至16个残基,具体长度视最终设计结果为准。

    Hotspot Residues

    指定受体中的热点残基作为binder蛋白的结合位置,格式为“链名称+残基编号或范围”,多段残基用逗号分隔,例如:A59-61,A83,A91,表示:
    指定A链编号为59至61、83及91的残基为binder蛋白的结合位置。


    结果说明

    设计得到的复合物结构pdb文件。

    注意:

    • 设计得到的为聚甘氨酸(poly-G)序列,这并不是错误。因为RFdiffusion是一种骨架生成模型,不会为设计的区域生成序列,因此必须使用另一种方法为Binder生成合适的序列。这里推荐采用ProteinMPNN进行序列设计(WeMol中已部署该模块,使用这里生成的整体复合物PDB进行序列设计即可)。
    • 输出的PDB文件从1开始重新编号。

    参考文献

    • Joseph L. Watson, David Juergens, Nathaniel R. Bennett, et. al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv 2022.12.09.519842.DOI:10.1101/2022.12.09.519842

    Protein Design (RFDiffusion)

    Introduction

    By fine-tuning the RoseTTAFold structure prediction network on protein structure denoising tasks through a diffusion probabilistic model, this protein backbone generation model was obtained, achieving excellent performance in unconditional and topology-constrained protein monomer design, protein complex design, symmetric oligomer design, enzyme active site scaffolding, and symmetric motif scaffolding for therapeutic and metal-binding protein design. RFdiffusion can design diverse, complex, and functional proteins from simple molecular specifications, and is also suitable for cyclic peptide design.

    The module functions include multi-scenario protein design, such as: Motif Scaffolding, Unconditional protein generation, Symmetric unconditional generation (cyclic, dihedral and tetrahedral symmetries), Symmetric motif scaffolding, Binder design, and Design diversification (“partial diffusion”).

    Parameters

    Custom Mode

    Reference Protein Structure

    The reference protein for design.

    Design Type

    Design type, supporting two types: ‘Motif_Scaffold’ and ‘Binder’, explained as follows:

    • ‘Motif_Scaffold’: Design based on the backbone structure of the reference protein (defined by subsequent parameters).
    • ‘Binder’: Design binder proteins based on receptor structures.

    Number of Designs

    Specifies the number of structures to design (currently supports up to 100).

    Contigs

    Defines the protein design strategy, specifying which parts of the protein are randomly designed, retained, etc.
    For example, if this parameter is set to ‘5-15/A10-25/30-40/0 B1-100’:

    • ‘5-15’ indicates designing a motif with a length between 5 and 15 (the exact length is random; for a fixed length of 10, set it to 10-10).
    • ‘/A10-25’ means taking amino acids numbered 10 to 25 from chain A of the reference protein, connecting its N-terminus to the C-terminus of the previously designed ‘5-15’ motif.
    • ‘/30-40’ indicates designing a motif with a length between 30 and 40, connecting its N-terminus to the C-terminus of the already designed motif.
    • '/0 ’ signifies a chain break, ending the previous chain, and subsequent designs will be a new chain (note the space after 0).
    • ‘B1-100’ means taking amino acids numbered 1 to 100 from chain B of the reference protein as a new chain.

    Note:

    1. If there are missing residues in the input PDB file, avoid including missing residue numbers in the Contigs parameter, e.g., if chain A is missing residue number 45, avoid using A45 or A10-50 that covers residue 45. A10-50 can be modified to A10-44/A46-50.
    2. In Binder design, the receptor needs to be included in Contigs, separated from the Binder by a ‘/0’ chain break indicator. For example, to design a Binder protein to combine with a single-chain receptor containing 150 amino acids, with the receptor chain named A and the Binder protein designed to be 70-100 amino acids long, the corresponding Contigs content should be ‘A1-150/0 70-100’, where ‘A1-150’ represents the receptor protein, '/0 ’ separates the receptor from the designed Binder protein, and ‘70-100’ indicates the Binder protein length.
    3. The residue numbers set in Contigs and Hotspot Residues parameters should match the sequence numbers in the original PDB file. If there are insertion numbers during antibody calculations, use PDB ReNumbering to renumber the PDB first.

    Hotspot Residues

    In binder mode, hotspot residues in the receptor can be specified, formatted as “chain name,” “amino acid residue,” e.g., ‘A59,A83,A91’.

    Symmetry

    Design symmetric proteins with parameter values C_N or D_N, where C indicates cyclic symmetry, D indicates dihedral symmetry, and N indicates the number of monomers. For example, C2 designs a cyclic symmetric protein with 2 monomers.
    Note: When designing symmetric proteins, the Contigs parameter settings should match, e.g., if Symmetry is C2, the Contigs parameter should correspond to two chains.


    Binder Mode

    Reference Protein Structure

    The reference protein for design.

    Index Type

    Sets the index type for amino acid residues defined in subsequent parameters (Receptor, Initial Binder, Hotspot). Two options are available: UID or POS. UID refers to the residue numbers provided in the PDB file, which may be discontinuous or not start from 1. POS refers to position numbering or natural sequential numbering starting from 1. The default value is UID.

    Receptor Range

    Defines the receptor protein, selecting which parts from the reference protein serve as the receptor, formatted as “chain name + residue number or range,” separated by commas for multiple segments. For example, if the parameter is set to A1-50,A70-100,A105,A108,B1-108, it means:
    Select residues numbered 1 to 50, 70 to 100, 105, and 108 from chain A, and residues numbered 1 to 108 from chain B of the reference protein as the receptor.
    Note: The residue numbers entered here should match the index type specified in Index Type.

    Length of Binder

    Defines the length of the Binder protein, which can be a specific length or a range. For example, setting it to 20 or 20-50 means:
    20 specifies the Binder protein length as 20 residues;
    20-50 specifies the Binder protein length range as 20 to 50 residues, with the exact length determined by the final design.

    Number of Designs

    Specifies the number of structures to design (currently supports up to 100).

    Initial Binder

    Specifies the initial Binder structure, selecting which parts from the reference protein are the initial Binder protein, with the model extending the Binder without changing the initial Binder. For example, if the parameter is set to B1-10, it means:
    Specify residues numbered 1 to 10 from chain B of the reference protein as the initial Binder protein, and the model will extend the design based on this.

    Hotspot Residues

    Specify hotspot residues in the receptor as binder protein binding sites, formatted as “chain name + residue number or range,” separated by commas for multiple segments. For example, A59-61,A83,A91 means:
    Specify residues numbered 59 to 61, 83, and 91 in chain A as binder protein binding sites.


    Scaffolding & Infilling Mode

    Reference Protein Structure

    The reference protein for design.

    Index Type

    Sets the index type for amino acid residues defined in subsequent parameters (Design Range). Two options are available: UID or POS. UID refers to the residue numbers provided in the PDB file, which may be discontinuous or not start from 1. POS refers to position numbering or natural sequential numbering starting from 1. The default value is UID.

    Design Range

    Defines the protein backbone range to design, selecting which parts from the reference protein to optimize, formatted as “chain name + residue number or range,” separated by commas for multiple segments. For example, if the parameter is set to A1-50,A70-100,A105,A108,B1-108, it means:
    Select residues numbered 1 to 50, 70 to 100, 105, and 108 from chain A, and residues numbered 1 to 108 from chain B of the reference protein for backbone optimization design.
    Note: The residue numbers entered here should match the index type specified in Index Type.

    Number of Designs

    Specifies the number of structures to design (currently supports up to 100).

    Length

    Defines the design length for each segment in the Design Range parameter, with multiple lengths separated by commas. If this parameter is not set, the design length will follow the original length in the Design Range. Note: The number of lengths must match the number of residue segments in the Range parameter, and the order must correspond. Length can have various values:

    • Non-negative integers, where 0 indicates the segment will be ignored and not designed; other positive integers specify the design length for the segment area.
    • The letter N, indicating the segment length remains unchanged during design.
    • Length ranges, such as 5-10, indicating the segment design length varies between 5 and 10 residues, with the exact length determined by the final design.
      An example of length definition:
      N,5-10,15 defines three lengths (the corresponding Design Range parameter should also have three segments), with the first segment design length unchanged, the second segment design length ranging from 5 to 10, and the third segment design length as 15.

    Other Design Mode

    Other design modes, with an option for Fix, indicating the defined Design Range remains unchanged, while all other areas are designed.

    Fluctuation Length

    When the other design mode is set to Fix, other areas will be designed, with length changes based on the original length. This parameter specifies the magnitude of length change, defaulting to 5, meaning the length is increased or decreased by 5 residues based on the original length.

    RFPeptide Mode

    Reference Protein Structure

    The reference protein used during design.

    Index Type

    Sets the type of residue indexing for subsequent parameters (Receptor, Hotspot).
    Two options are available: UID or POS.

    • UID refers to the residue numbers from the PDB file, which may be discontinuous or not start from 1.
    • POS refers to the positional or natural sequential numbering, starting from 1 in order.
      The default value is UID.

    Receptor Range

    Defines the receptor protein, specifying which part of the reference protein is selected as the receptor.
    Format: chain name + residue number or range, multiple segments separated by commas.
    Example: A1-50,A70-100,A105,A108,B1-108 means:

    • Residues 1–50, 70–100, 105, and 108 on chain A of the reference protein
    • Residues 1–108 on chain B (using positional numbering)
      are selected as the receptor.
      Note:
      1.The residue numbers here should match the indexing type defined in Index Type.
      2.All residues or ranges within the same chain must be listed in ascending order of residue numbers.

    Length of Cyclic Peptide

    Specifies the length of the cyclic peptide, either a fixed length or a range.
    Example:

    • 10 → cyclic peptide length is 10 residues
    • 12-16 → cyclic peptide length ranges from 12 to 16 residues; the exact length depends on the final design.

    Hotspot Residues

    Specifies the hotspot residues in the receptor where the binder protein will bind.
    Format: chain name + residue number or range, multiple segments separated by commas.
    Example: A59-61,A83,A91 means:

    • Residues 59–61, 83, and 91 on chain A are designated as binding sites for the binder protein.

    Results

    Output PDB files for different design modes.

    Note:

    • Binder designs result in poly-glycine (poly-G) sequences, which is not an error. RFdiffusion is a backbone generation model and does not generate sequences for designed regions. Another method must be used to generate suitable sequences for the Binder. It is recommended to use ProteinMPNN for sequence design (the module is deployed in WeMol, using the overall complex PDB generated here for sequence design).
    • The output PDB files are renumbered starting from 1.

    References

    • Joseph L. Watson, David Juergens, Nathaniel R. Bennett, et. al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv 2022.12.09.519842.DOI:10.1101/2022.12.09.519842
  • Name: Protein Physico-chemical Properties
    Description: 计算蛋白序列的理化性质,基本性质包括:分子质量、等电点、消光系数、不稳定系数、蛋白质的芳香值、总平均亲水性、二级结构占比,以及DeepSP计算的SCM(Spatial Charge Map,空间电荷图)和 SAP(Spatial Aggregation Propensity,空间聚集趋势)等。 Calculate the physicochemical properties of protein sequences. The computed basic properties include molecular weight, isoelectric point, extinction coefficient, instability index, aromaticity, grand average of hydropathicity (GRAVY), and secondary structure composition. The computed DeepSP properties include SCM (Spatial Charge Map) and SAP (Spatial Aggregation Propensity), etc.
    Tags: undefined
    Author: Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, Hochstrasser DF.
    Release: 2023-03-27 17:15:36
    Reference: Methods Mol Biol. 1999;112:531-52. doi: 10.1385/1-59259-584-7:531. PMID: 10027275. Comput. Struct. Biotechnol. J. 2024, 23, 2220–2229

    Protein Physico-chemical Properties

    简介

    对上传的蛋白Fasta序列分析其蛋白的理化性质,包括分子质量、等电点、消光系数、不稳定系数、蛋白质的芳香值、总平均亲水性以及二级结构占比。该功能应用的是Bjellgvist算法。

    参数说明

    Protein Sequence File

    输入的蛋白FASTA文件,格式:FASTA。

    Output File

    输出文件名称,必须为CSV后缀。

    Merge Chain

    是否合并来自同一蛋白质链的信息。

    Merge Output File

    仅当merge_chain=True时可用。默认值:merged.csv。

    Job Number

    并行任务数,默认为1。

    DeepSP Output

    DeepSP数据输出文件

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 序列名称和蛋白质的信息一一对应的CSV文件
    merged.csv 合并来自同一蛋白质链的信息的CSV文件
    deepsp_descriptors.csv 当输入序列是抗体时输出对应的CSV文件

    其中result.csv和merged.csv,包含信息如下:

    字段名称 说明
    Sequence ID 蛋白序列名称
    Molecular Weight 蛋白序列分子量
    Isoelectric Point 蛋白序列等电点
    Molar Extinction Coefficient (without disulfide bond) 假设半胱氨酸被还原时的摩尔消光系数,单位为M-1·cm-1。
    Extinction Coefficient (without disulfide bond) 假设半胱氨酸被还原时的消光系数,单位为g·L-1。
    Molar Extinction Coefficient (with disulfide bond) 假设成对半胱氨酸形成的二硫键的摩尔消光系数,单位为M-1·cm-1。
    Extinction Coefficient (with disulfide bond) 假设成对半胱氨酸形成的二硫键的消光系数,单位为g·L-1。
    Instability Index 蛋白的不稳定指数,当该数值高于40时都表示蛋白质不稳定(半衰期很短)。
    Aromaticity 蛋白质的芳香值,即为Phe+Trp+Tyr的相对频率。
    Grand average of hydropathicity (GRAVY) 总平均亲水性,若此数值为负值则说明该蛋白为亲水性蛋白,反之为疏水性蛋白。
    Helix Fraction 计算Helix结构在蛋白上所占比例。Helix中的氨基酸:V,I,Y,F,W,L。
    Turn Fraction 计算Trun结构在蛋白上所占比例。Trun中氨基酸顺序为:N,P,G,S。
    Sheet Fraction 计算Sheet结构在蛋白上所占比例。Sheet中氨基酸:E,M,A,L。

    其中deepsp_descriptors.csv包含信息如下:

    字段名称 说明
    SCM_neg_* SCM(Spatial Charge Map,空间电荷图),是一种用于量化抗体表面电荷分布的指标,一般来说,SCM 值越高,抗体溶液的黏度可能越大
    SAP_pos_* SAP(Spatial Aggregation Propensity,空间聚集趋势),一种评估抗体空间聚集趋势的指标,SAP数值越高,空间聚集趋势越大

    参考文献

    Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, Hochstrasser DF. Protein identification and analysis tools in the ExPASy server. Methods Mol Biol. 1999;112:531-52.
    Kalejaye, L.; Wu, I.-E.; Terry, T.; Lai, P.-K. DeepSP: Deep Learning-Based Spatial Properties to Predict Monoclonal Antibody Stability. Comput. Struct. Biotechnol. J. 2024, 23, 2220–2229

    Protein Physico-chemical Properties

    Introduction

    This module analyzes the physicochemical properties of a protein based on the uploaded protein FASTA sequence. The properties include molecular weight, isoelectric point, molar extinction coefficient, instability index, aromaticity, total average hydrophobicity, and secondary structure composition. This function calculates isoelectric point (pI) using the Bjellqvist algorithm.

    Parameter Description

    Protein Sequence File

    Input protein FASTA file in FASTA format.

    Output File

    Name of the output file, must have a CSV extension.

    Merge Chain

    Whether to merge information from the same protein chain.

    Merge Output File

    Only available when merge_chain=True. Default value: merged.csv.

    Job Number

    Number of parallel tasks, default is 1.

    DeepSP Output

    DeepSP data output file

    Result Description

    The output includes:

    Output File Name Description
    result.csv CSV file mapping sequence names to protein information
    merged.csv CSV file containing merged information from the same protein chain
    deepsp_descriptors.csv The corresponding CSV file output when the input sequence is an antibody

    Both result.csv and merged.csv contain the following information:

    Field Name Description
    Sequence ID Protein sequence name
    Molecular Weight Molecular weight of the protein sequence
    Isoelectric Point Isoelectric point of the protein sequence
    Molar Extinction Coefficient (without disulfide bond) Molar extinction coefficient assuming cysteine is reduced, in M-1·cm-1
    Extinction Coefficient (without disulfide bond) Extinction coefficient assuming cysteine is reduced, in g·L-1
    Molar Extinction Coefficient (with disulfide bond) Molar extinction coefficient assuming disulfide bonds of paired cysteines, in M-1·cm-1
    Extinction Coefficient (with disulfide bond) Extinction coefficient assuming disulfide bonds of paired cysteines, in g·L-1
    Instability Index Instability index of the protein, values above 40 indicate protein instability (short half-life)
    Aromaticity Aromaticity of the protein, relative frequency of Phe+Trp+Tyr
    Grand average of hydropathicity (GRAVY) GRAVY value indicating the overall hydrophobicity of the protein, negative values indicate hydrophilic proteins
    Helix Fraction Fraction of helical structure in the protein, amino acids considered: V, I, Y, F, W, L
    Turn Fraction Fraction of turn structure in the protein, amino acids considered: N, P, G, S
    Sheet Fraction Fraction of sheet structure in the protein, amino acids considered: E, M, A, L

    The file deepsp_descriptors.csv contains the following information:

    Field Name Description
    SCM_neg_* SCM (Spatial Charge Map) is an index used to quantify the charge distribution on the antibody surface. Generally, a higher SCM value may indicate higher viscosity in the antibody solution.
    SAP_pos_* SAP (Spatial Aggregation Propensity) is an index used to evaluate the spatial aggregation tendency of an antibody. A higher SAP value indicates a greater tendency for spatial aggregation.

    Reference

    Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, Hochstrasser DF. Protein identification and analysis tools in the ExPASy server. Methods Mol Biol. 1999;112:531-52.
    Kalejaye, L.; Wu, I.-E.; Terry, T.; Lai, P.-K. DeepSP: Deep Learning-Based Spatial Properties to Predict Monoclonal Antibody Stability. Comput. Struct. Biotechnol. J. 2024, 23, 2220–2229

  • Name: Target-based Linear Peptide Design
    Description: 基于受体结构(目前支持单链)进行结合线性多肽设计。该模块算法基于AlphaFold2与Colabdesign实现。 Design linear peptides that bind to a receptor structure (currently single-chain only). The module is built upon AlphaFold2 and ColabDesign
    Tags: undefined
    Author: Wecomput
    Release: 2023-03-27 09:43:52
    Reference: NA

    Target-based Linear Peptide Design

    简介

    基于受体结构(目前支持单链)进行结合线性多肽设计。该模块算法基于AlphaFold2与Colabdesign实现。通过新型竞争结合策略进行线性肽设计。在同时存在两条肽段的情况下预测受体结构,对于单条肽段结构本身就能被准确预测的体系,该方法能以统计学显著性将亲和力更高的肽段捕获在结合态,而把另一条肽段留在游离态。在六种蛋白受体上进行了验证,这些受体已有与多条肽段的实验亲和力数据。结果表明,该方法最适用于识别中等至强亲和力、且在结合后能形成稳定二级结构的肽段。
    b6fb5246a7071181c60fb0c88e33325b_anie202213362-toc-0001-m.png

    参数说明

    Receptor Structure

    PDB格式的受体结构。

    Binder Length

    设定肽binder的长度,如:10。

    Chain

    指定PDB文件中作为受体的链,如:“B”,如果结构中只有一条链,可以不用指定。
    注意:目前仅支持单链模式,且链的长度不超过500个氨基酸。

    Hotspot Residues

    指定受体中的热点残基,如:‘1-10,12,15’

    Binder Sequence

    指定多肽binder的起始序列,如设定,则会在此序列的基础上继续设计。

    Binder Chain

    如果已有多肽binder在参数1的PDB文件中,指定该多肽为哪条链,可以此为基础进行多肽binder的优化设计。

    Use Multimer

    默认False,是否使用Alphafold-Multimer进行设计

    Flexible

    是否设定受体的骨架为柔性。

    Output

    指定输出的结构评分文件名称,默认为“design_scores.csv”

    结果说明

    输出5个肽binder设计的PDB文件:result_0~4.pdb,为受体中选择的链结构与设计肽的复合物。5个设计结果为5次平行设计的不同结果。
    输出结构的评分指标:design_scores.csv,包含如下信息:

    字段名称 说明
    Name 预测结构的文件名
    pLDDT 局部结构的可信度指标,值范围是0-1.0,该值越大说明预测的结构越可靠。低于0.7被认为可靠性较低,低于0.5基本认为是可信度非常低,为无序预测
    pTM 预测的TM分数(the predicted template modeling score),衡量预测结构整体准确性,越大表示越准确,该分数大于0.5时,表示结构整体折叠可能与真实结构相似
    ipTM 预测的亚基接触面的TM分数(the interface predicted template modeling score),当预测结构为复合物时才有该评价指标,衡量复合物中各个亚基之间相对位置的预测准确性,越大表示越准确,大于0.8表示高质量预测,小于0.6表示预测可能失败,0.6-0.8为灰色地带,预测正确与否不确定

    参考文献

    L. Chang, A. Perez, Angew. Chem. Int. Ed. 2023, 62, e202213362; Angew. Chem. 2023, 135, e202213362.

    Target-based Linear Peptide Design

    Introduction

    Design linear peptides that bind to a receptor structure (currently single-chain only). The module is built upon AlphaFold2 and ColabDesign, employing a novel competitive-binding strategy for peptide design. It predicts the receptor structure in the presence of two peptides simultaneously; for systems in which each peptide is individually well modeled, the method captures the higher-affinity peptide in the bound state while leaving the other unbound, with statistical significance. Validation on six protein receptors with experimental affinities for multiple peptides shows that the approach is best suited for identifying medium- to high-affinity peptides that adopt stable secondary structures upon binding.
    b6fb5246a7071181c60fb0c88e33325b_anie202213362-toc-0001-m.png

    Parameter

    Receptor Structure

    The receptor structure in PDB format.

    Binder Length

    Specifies the length of the peptide binder, e.g., 10.

    Chain

    Specifies the chain in the PDB file to be used as the receptor, e.g., “B”. If the structure contains only one chain, this parameter may not need to be specified. Note: Currently, only single-chain mode is supported, and the chain length should not exceed 500 amino acids.

    Hotspot Residues

    Specifies the hotspot residues in the receptor, e.g., ‘1-10,12,15’.

    Binder Sequence

    Specifies the starting sequence of the peptide binder. If provided, the design will be based on this sequence.

    Binder Chain

    If a peptide binder already exists in the PDB file specified in parameter 1, this parameter specifies which chain the peptide belongs to, allowing optimization and design based on this peptide.

    Use Multimer

    Default is False. Specifies whether to use AlphaFold-Multimer for design.

    Flexible

    Specifies whether to set the receptor backbone as flexible.

    Output

    the output scoring file, default is “design_scores.csv”

    Result

    The output file is result.pdb, which contains the structure of the designed peptide binder. The resultpdb is a complex of the selected chain structure from the receptor and the designed peptide.
    The design_scores.csv file contains the following information:

    Field Name Description
    Name The file name of the predicted structure.
    pLDDT The confidence score for local structure, with a range of 0-100. A higher value indicates a more reliable prediction. Values below 70 are considered low reliability, below 50 are generally regarded as very low confidence, indicating disordered predictions.
    pTM The predicted template modeling score, which measures the overall accuracy of the predicted structure. A higher score indicates greater accuracy. If the score is greater than 0.5, it suggests that the overall folding of the structure may be similar to the true structure.
    ipTM The interface predicted template modeling score, which measures the accuracy of the predicted relative positions of the subunits in the complex. A higher score indicates greater accuracy. Scores above 0.8 indicate high-quality predictions, while scores below 0.6 suggest potential failure of the prediction. Scores between 0.6 and 0.8 are in a gray area, where the accuracy of the prediction is uncertain.

    Reference

    L. Chang, A. Perez, Angew. Chem. Int. Ed. 2023, 62, e202213362; Angew. Chem. 2023, 135, e202213362.

  • Name: Antibody Paratope Prediction
    Description: 预测抗体上与抗原结合的氨基酸位点(称为Paratope),基于等变图神经网络的深度学习模型,使用抗体结构进行训练和预测,预测精度在现有方法中最佳。 Predict the amino acid sites on an antibody that bind to an antigen, known as the Paratope. The algorithm is based on a deep learning model using an isomorphic graph neural network, trained and predicted on antibody structures, and has the highest prediction accuracy among existing methods.
    Tags: undefined
    Author: Lewis Chinery
    Release: 2023-03-23 10:19:16
    Reference: Lewis Chinery, Newton Wahome, Iain Moal, Charlotte M. Deane. Paragraph - Antibody Paratope prediction using Graph Neural Networks with minimal feature vectors. bioRxiv 2022.06.10.495640; doi: https://doi.org/10.1101/2022.06.10.495640

    Antibody Paratope Prediction

    简介

    Antibody Paratope Predictor模块的功能是预测抗体上与抗原结合的氨基酸位点,称为Paratope。其算法是基于等变图神经网络的深度学习模型,使用抗体结构进行训练和预测,预测精度在现有方法中最佳。
    image.png

    参数说明

    Antibody PDB File

    需要预测的抗体结构,支持多个结构打包进行批量预测,格式支持 .tar、.tar.gz 或 .zip,链名称必须为H, L, H/L才能判断为抗体结构。
    阶梯计费方式:

    • 1–5 个 PDB 文件: 1000 计算量/每个
    • 6–100 个 PDB 文件:500 计算量/每个
    • 超过 100 个 PDB 文件:100 计算量/每个

    结果说明

    输出文件为result.csv,包含信息如下:

    字段名称 说明
    pdb 文件名
    chain_type 抗体链类型
    chain_id 抗体链标识
    IMGT 抗体氨基酸对应的IMGT编号
    AA 抗体氨基酸名称
    atom_num 抗体氨基酸的Alpha碳原子的原子编号(PDB文件中)。
    x,y,z 抗体氨基酸的Alpha碳原子的坐标。
    pred 该氨基酸为Paratope的预测概率(取值范围0-1),参考值为0.734,大于参考值时,为Paratope的可能性高,值越大可能性越高。

    参考文献

    Lewis Chinery, Newton Wahome, Iain Moal, Charlotte M. Deane. Paragraph - Antibody Paratope prediction using Graph Neural Networks with minimal feature vectors. bioRxiv 2022.06.10.495640.

    Antibody Paratope Prediction

    Introduction

    The Antibody Paratope Predictor module aims to predict the amino acid residues on an antibody that bind to antigens, known as the Paratope. The algorithm is based on a deep learning model using a variant of graph neural networks, trained and tested on antibody structures. It achieves the highest prediction accuracy among existing methods.
    image.png

    Parameters

    Antibody PDB File

    The antibody structures to be predicted can be provided in batches. Supported archive formats are .tar, .tar.gz, or .zip.
    Chain names must be H, L, or H/L for the structure to be recognized as an antibody.

    Tiered Pricing (Compute Cost)

    • 1–5 PDB files: 1000 compute units per file
    • 6–100 PDB files: 500 compute units per file
    • More than 100 PDB files: 100 compute units per file

    Results

    The output file is result.csv, containing the following information:

    Field Name Description
    pdb File name
    chain_type Antibody chain type
    chain_id Antibody chain identifier
    IMGT IMGT number corresponding to the antibody amino acid
    AA Antibody amino acid name
    atom_num Atom number of the alpha carbon of the antibody amino acid in the PDB file
    x, y, z Coordinates of the alpha carbon of the antibody amino acid
    pred Predicted probability that the amino acid is part of the Paratope (range 0-1). A reference value of 0.734 is provided; a value greater than this indicates a high likelihood of being part of the Paratope, with higher values indicating higher likelihood.

    Reference

    Lewis Chinery, Newton Wahome, Iain Moal, Charlotte M. Deane. Paragraph - Antibody Paratope prediction using Graph Neural Networks with minimal feature vectors. bioRxiv 2022.06.10.495640. Link

  • Name: Antibody Design (DiffAb)
    Description: 基于扩散概率模型和等价神经网络的抗体设计,可针对特定抗原结构生成抗体,也可基于抗体-抗原复合物结构进行抗体结构和序列的优化。 Luo et al. developed a deep generative model that jointly models sequences and structures of CDRs based on diffusion probabilistic models and equivariant neural networks. The model is capable of sequence-structure co-design, sequence design for given backbone structures, and antibody optimization.
    Tags: undefined
    Author: Shitong Luo
    Release: 2023-03-20 09:25:36
    Reference: Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, Jianzhu Ma. Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures. bioRxiv 2022.07.10.499510; doi: https://doi.org/10.1101/2022.07.10.499510

    Antibody Design (DiffAb)

    简介

    基于扩散概率模型和等价神经网络,进行抗体设计,可针对特定抗原结构生成抗体,也可基于抗体-抗原复合物结构进行抗体结构和序列的优化。
    抗体是免疫系统的蛋白质,通过与特定的抗原(如病毒和细菌)结合来保护宿主。抗体和抗原之间的结合主要是由抗体的互补性决定区域(CDR)决定的。该模块是基于扩散概率模型和等价神经网络的深度生成模型,对CDR的序列和结构共同建模。该方法可明确针对特定抗原结构生成抗体,是最早的蛋白质结构扩散概率模型之一。能进行序列-结构协同设计、给定骨架结构的序列设计和抗体优化。
    image.png

    参数说明

    Antibody-Antigen Complex Structure

    抗体-抗原复合物结构文件,PDB格式

    Mode

    设计模式选择,对于抗原-抗体复合物有4种设计模式可选:

    1. Optimize:优化单个CDR的序列和结构。此模式需要抗体-抗原复合物结构和CDR标签。
    2. Fixbb:固定抗体的主干结构,仅逐个采样CDR的序列。此模式需要抗体-抗原复合物结构。
    3. Sample_one_CDR:逐个采样CDR的序列和结构。
    4. Sample_multi_CDRs:同时采样所有CDR的序列和结构。

    CDR Label

    只有在指定Optimize设计模式后,才需要选择改参数,默认值为H_CDR3,一共有6个选项:H_CDR1、H_CDR2、H_CDR3、L_CDR1、L_CDR2、L_CDR3。

    结果说明

    1.输出一个结构优化后或构建后的压缩包result.tar.gz。
    2.展示不同设计模式的第一个结构优化结果,输出结果分别如下:
    (1) Optimize模式,输出输出结果包括:

    输出文件名称 说明
    H_CDR1-O1_0000.pdb O1表示优化次数为1,对应的优化程度很低,序列变化很小
    H_CDR1-O2_0000.pdb O2表示优化次数为2,优化程度低,序列变化小
    H_CDR1-O4_0000.pdb 优化次数为4,优化程度较低,序列变化较小
    H_CDR1-O8_0000.pdb 优化次数为8,优化程度一般,序列变化一般
    H_CDR1-O16_0000.pdb 优化次数为16,优化程度较高,序列变化较大
    H_CDR1-O32_0000.pdb 优化次数为32,优化程度高,序列变化大
    H_CDR1-O64_0000.pdb 优化次数为64,优化程度很高,序列变化很大

    (2) Fixbb模式,输出输出结果包括:

    输出文件名称 说明
    H_CDR1_0000.pdb 重链CDR1区优化的结构文件
    H_CDR2_0000.pdb 重链CDR2区优化的结构文件
    H_CDR3_0000.pdb 重链CDR3区优化的结构文件
    L_CDR1_0000.pdb 轻链CDR1区优化的结构文件
    L_CDR2_0000.pdb 轻链CDR2区优化的结构文件
    L_CDR3_0000.pdb 轻链CDR3区优化的结构文件

    (3) Sample_one_CDR模式,输出文件名称与Fixbb 模式相同。
    (4) Sample_multi_CDRs模式,输出CDR区进行优化后的结构文件MultipleCDRs_0000.pdb。

    参考文献

    • Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, Jianzhu Ma. Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures. bioRxiv. 2022.07.10.499510

    Antibody Design (DiffAb)

    Introduction

    Antibody design is conducted based on diffusion probability models and equivalent neural networks, allowing for the generation of antibodies targeting specific antigen structures and optimization of antibody structures and sequences based on antibody-antigen complex structures.
    Antibodies are proteins of the immune system that protect the host by binding to specific antigens such as viruses and bacteria. The binding between antibodies and antigens is primarily determined by the complementarity-determining regions (CDRs) of the antibodies. This module is a deep generative model based on diffusion probability models and equivalent neural networks, jointly modeling the sequences and structures of CDRs. This method can explicitly generate antibodies targeting specific antigen structures and is one of the earliest protein structure diffusion probability models. It enables sequence-structure co-design, sequence design with given scaffold structures, and antibody optimization.
    image.png

    Parameters

    Antibody-Antigen Complex Structure

    Structure file of the antibody-antigen complex in PDB format.

    Mode

    Design mode selection for the antigen-antibody complex with four available options:

    1. Optimize: Optimizes the sequence and structure of a single CDR. This mode requires the antibody-antigen complex structure and CDR labels.
    2. Fixbb: Fixes the backbone structure of the antibody and samples the sequence of each CDR individually. This mode requires the antibody-antigen complex structure.
    3. Sample_one_CDR: Samples the sequence and structure of each CDR individually.
    4. Sample_multi_CDRs: Simultaneously samples the sequences and structures of all CDRs.

    CDR Label

    This parameter is only required when selecting the Optimize design mode, with a default value of H_CDR3. There are a total of six options: H_CDR1, H_CDR2, H_CDR3, L_CDR1, L_CDR2, L_CDR3.

    Result Description

    1. Outputs a compressed file, result.tar.gz, containing the optimized or constructed structure.

    2. Displays the first structure optimization results for different design modes as follows:
      (1)For the Optimize mode, the output includes:

      Output File Name Description
      H_CDR1-O1_0000.pdb O1 indicates optimization at 1, with low optimization level and minimal sequence changes
      H_CDR1-O2_0000.pdb O2 indicates optimization at 2, with low optimization level and small sequence changes
      H_CDR1-O4_0000.pdb Optimization at 4, with relatively low optimization level and moderate sequence changes
      H_CDR1-O8_0000.pdb Optimization at 8, with moderate optimization level and average sequence changes
      H_CDR1-O16_0000.pdb Optimization at 16, with relatively high optimization level and significant sequence changes
      H_CDR1-O32_0000.pdb Optimization at 32, with high optimization level and substantial sequence changes
      H_CDR1-O64_0000.pdb Optimization at 64, with very high optimization level and extensive sequence changes

      (2)For the Fixbb mode, the output includes:

      Output File Name Description
      H_CDR1_0000.pdb Structure file optimized for the heavy chain CDR1 region
      H_CDR2_0000.pdb Structure file optimized for the heavy chain CDR2 region
      H_CDR3_0000.pdb Structure file optimized for the heavy chain CDR3 region
      L_CDR1_0000.pdb Structure file optimized for the light chain CDR1 region
      L_CDR2_0000.pdb Structure file optimized for the light chain CDR2 region
      L_CDR3_0000.pdb Structure file optimized for the light chain CDR3 region

    (3)For the Sample_one_CDR mode, the output file names are the same as the Fixbb mode.
    (4)For the Sample_multi_CDRs mode, the output is the structure file “MultipleCDRs_0000.pdb” after optimizing the CDR regions.

    Reference Literature

    • Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, Jianzhu Ma. Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures. bioRxiv. 2022.07.10.499510
  • Name: GMX MD Run (GMX2023)
    Description: 是利用已经准备好的体系拓扑文件以及参数文件进行基于GROMACS的分子动力学模拟。 Runs a Gromacs MD task using the prepared system topology and parameter files.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 11:21:21
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    GMX MD Run (GMX2023)

    简介

    提交GROMACS对应文件,从而进行分子动力学模拟,得到平衡模拟后得到的轨迹文件。

    参数说明

    GRO File

    提交模拟体系的gro文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    Topology File

    提交模拟体系的top文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    ITP File

    提交模拟体系的itp文件。该文件可以从MD Solvation或者Membrane Solvation模块获取。

    Minimize MDP File

    提交进行最小化的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的Minimization方法或者**GMX MDP Generation (Auto)**生成。

    NPT MDP File

    提交进行等压等温的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的NPT方法或者**GMX MDP Generation (Auto)**生成。

    MD MDP File

    提交进行平衡模拟的参数化文件,文件格式为mdp。可以根据GMX MDP Generation模块的MD方法或者**GMX MDP Generation (Auto)**生成。

    结果说明

    输出结果包括:

    输出文件名称 说明
    md.cpt md模拟断点文件
    md.gro md的分子坐标文件
    md.log md记录文件
    md.tpr md模拟所需的所有初始化数据(分子拓扑、初始结构等)
    mini.gro mini运行的分子坐标文件
    mini.log mini运行记录文件
    mini.tpr mini模拟运行所需的所有初始化数据(分子拓扑、初始结构等)
    npt.gro npt的分子坐标文件
    npt.log npt记录文件
    npt.tpr npt模拟所需的所有初始化数据(分子拓扑、初始结构等)
    path.txt 模拟轨迹文件存储路径,可用于后续分析模块的Path File输入。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    GMX MD Run (GMX2023)

    Introduction

    Submit corresponding files to GROMACS to perform molecular dynamics simulations and obtain trajectory files after equilibrium simulations.

    Parameter Description

    GRO File

    Submit the gro file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    Topology File

    Submit the top file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    ITP File

    Submit the itp file of the simulated system. This file can be obtained from the MD Solvation or Membrane Solvation modules.

    Minimize MDP File

    Submit the script file for minimization, in mdp format. This file can be generated using the GMX MDP Generation module with the Minimization method or GMX MDP Generation (Auto).

    NPT MDP File

    Submit the script file for NPT (isothermal-isobaric) simulation, in mdp format. This file can be generated using the GMX MDP Generation module with the NPT method or GMX MDP Generation (Auto).

    MD MDP File

    Submit the script file for equilibrium simulation, in mdp format. This file can be generated using the GMX MDP Generation module with the MD method or GMX MDP Generation (Auto).

    Result Description

    The output results include:

    Output File Name Description
    md.cpt Checkpoint file for the MD simulation
    md.gro Molecular coordinate file for the MD simulation
    md.log Log file for the MD simulation
    md.tpr All initial data required for the MD simulation (molecular topology, initial structure, etc.)
    mini.gro Molecular coordinate file for the minimization run
    mini.log Log file for the minimization run
    mini.tpr All initial data required for the minimization run (molecular topology, initial structure, etc.)
    npt.gro Molecular coordinate file for the NPT simulation
    npt.log Log file for the NPT simulation
    npt.tpr All initial data required for the NPT simulation (molecular topology, initial structure, etc.)
    path.txt Path to store the simulation trajectory files, which can be used as input for the Path File in subsequent analysis modules.

    Reference Literature

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: SDF File Split
    Description: 化合物库文件分割模块,可以将一个大的SDF文件分割为多个SDF文件,支持按文件个数或者分子数目分割,使得分割后的每个SD文件分子数目接近。 Splitting an SD File into multiple SD files. Each new SD File contains a compound subset of similar size from the initial file.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-03-12 22:33:44
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    SDF File Split

    简介

    SDF File Split是个化合物库文件分割模块,可以将一个大的SDF文件分割为多个SDF文件,支持按文件个数或者分子数目分割,使得分割后的每个SD文件分子数目接近。

    参数说明

    Split by Files Number模式

    SDF File

    小分子库结构文件,SDF格式

    Files Number

    生成文件的数目

    Prefix

    新生成SDF文件的前缀,默认subset,生成的文件名为:subset1.sdf,subset2.sdf,以此类推。

    Split by Compounds Number模式

    SDF File

    小分子库结构文件,SDF格式

    Compounds Number

    每个新生成的SD文件包含的分子数目

    Prefix

    新生成SDF文件的前缀,默认subset,生成的文件名为:subset1.sdf,subset2.sdf,以此类推。

    结果说明

    拆分后的SDF文件列表文件。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    SDF File Split

    Introduction

    SDF File Split is a compound library file splitting module that can divide a large SDF file into multiple SDF files. It supports splitting based on the number of files or the number of compounds, ensuring that the number of molecules in each split SDF file is similar.

    Parameter Description

    Split by Files Number Mode

    SDF File

    Structure file of the small molecule library, in SDF format.

    Files Number

    Number of files to generate.

    Prefix

    Prefix for the newly generated SDF files, default is “subset”. The generated files will be named as: subset1.sdf, subset2.sdf, and so on.

    Split by Compounds Number Mode

    SDF File

    Structure file of the small molecule library, in SDF format.

    Compounds Number

    Number of compounds to include in each newly generated SDF file.

    Prefix

    Prefix for the newly generated SDF files, default is “subset”. The generated files will be named as: subset1.sdf, subset2.sdf, and so on.

    Result Description

    List of split SDF files.

    Reference Literature

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

  • Name: Enumerate Stereoisomers
    Description: 枚举小分子立体异构体的工具,支持顺反异构体和对映异构体两种形式的枚举。 Combinatorial enumeration of stereoisomers for molecules around all or unassigned chiral atoms and bonds. cis-trans isomer and optical isomer are supported.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-03-12 20:08:04
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Enumerate Stereoisomers

    简介

    Enumerate Stereoisomers是枚举小分子立体异构体的工具,支持顺反异构体和对映异构体两种形式的枚举。立体异构(stereoisomerism)是在有相同分子式的化合物分子中,原子或原子团互相连接的次序相同,但在空间的排列方式不同,与构造异构同属有机化学范畴中的同分异构现象。对所有或未分配的手性原子和键周围的分子进行立体异构体的组合枚举。

    参数说明

    Enumerate Stereoisomers (File)模式

    Input File

    小分子结构文件,支持SMILES、MOL、SDF格式。

    Output File

    指定输出文件的名称,支持SDF(.sd)和SMILES格式(.smi)。

    Mode

    枚举模式,包括如下:
    UnassignedOnly:只枚举未分配手性原子和键的分子的构型异构体。所有原子和键都分配手性时,选择该选项得到该分子本身。
    All:枚举所有立体异构体,包括构型异构和构象异构。

    Number

    每个分子产生异构体的最大数目。

    Enumerate Stereoisomers (String)模式

    Smiles String

    小分子的smiles字符串,一行一个分子

    结果说明

    得到小分子构型异构体的组合SDF文件generated_isomers.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. doi: 10.1021/acs.jcim.6b00505.

    Enumerate Stereoisomers

    Introduction

    Enumerate Stereoisomers is a tool for enumerating stereoisomers of small molecules, supporting both cis-trans isomers and enantiomers. Stereoisomerism refers to the phenomenon in organic chemistry where compounds with the same molecular formula have atoms or groups connected in the same order but arranged differently in space, belonging to the category of structural isomerism. It enumerates stereoisomeric combinations for all or unassigned chiral atoms and bonds in a molecule.

    Parameter Description

    Enumerate Stereoisomers (File) Mode

    Input File

    The small molecule structure file, supporting SMILES, MOL, and SDF formats.

    Output File

    Specify the name of the output file, supporting SDF (.sd) and SMILES (.smi) formats.

    Mode

    Enumeration modes include:

    • UnassignedOnly: Enumerate conformational isomers of molecules with unassigned chiral atoms and bonds only. When all atoms and bonds are assigned chirality, selecting this option will yield the molecule itself.
    • All: Enumerate all stereoisomers, including conformational and configurational isomers.

    Number

    Maximum number of isomers to generate for each molecule.

    Enumerate Stereoisomers (String) Mode

    Smiles String

    SMILES string of the small molecule, one molecule per line.

    Result Description

    Obtain a combined SDF file (generated_isomers.sdf) of conformational isomers of small molecules.

    Reference Literature

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. doi: 10.1021/acs.jcim.6b00505.

  • Name: SDF Viewer
    Description: 小分子化合物库的可视化模块,可以针对一个SDF文件生成一个结构和属性可视、可交互、可检索的HTML页面,方便浏览化合物的结构和属性信息。 Visualization tool for the small molecular library. Generate an interactive HTML table with columns corresponding to molecules and available alphanumerical data in an input file.
    Tags: undefined
    Author: Manish Sud
    Release: 2023-03-10 00:00:00
    Reference: Manish Sud*,MayaChemTools: An Open Source Package for Computational Drug Discovery. J. Chem. Inf. Model. 2016, 56, 12, 2292–2298

    SDF Viewer

    简介

    SDF Viewer是小分子化合物库的可视化模块,可以针对一个SDF文件生成一个结构和属性可视、可交互、可检索的HTML页面,方便浏览化合物的结构和属性信息。

    参数说明

    SDF File

    小分子结构文件,SDF格式

    HTML File

    输出HTML文件名,默认为library.html

    结果说明

    针对SDF文件生成一个结构和属性可视、可交互、可检索的HTML页面library.html。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. doi: 10.1021/acs.jcim.6b00505.

    SDF Viewer

    Introduction

    The SDF Viewer is a visualization module for small molecule compound libraries. It generates an HTML page that visualizes and makes the structures and properties of compounds in an SDF file interactive and searchable, facilitating the browsing of compound structure and property information.

    Parameter Description

    SDF File

    The small molecule structure file in SDF format.

    HTML File

    The output HTML file name, defaulting to library.html.

    Result Description

    Generates an interactive and searchable HTML page (library.html) that visualizes the structures and properties of compounds in the SDF file.

    Reference Literature

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297. doi: 10.1021/acs.jcim.6b00505.

  • Name: Antibody-Antigen Docking (HADDOCK)
    Description: 抗原抗体对接程序 Antibody-Antigen docking tool
    Tags: undefined
    Author: Cyril Dominguez
    Release: 2023-03-06 14:09:05
    Reference: Dominguez, C., Boelens, R. & Bonvin, A. M. J. J. HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J Am Chem Soc 125, 1731–1737 (2003).

    HADDOCK

    简介

    HADDOCK v3.0 是一个自下而上的对长期以来被证实的HADDOCK的重新构想,用于生物分子复合物的综合建模。旨在对HADDOCK的核心功能进行模块化和扩展。它能够充分利用模糊的相互作用约束(AIRs)来驱动对接过程。使用蛋白质-蛋白质对接基准5对它进行了评估,并与实时版本(v2.4)进行了比较。该评估是使用每个复合物的真实界面(3.9 Å)进行的,并以成功率表示;在按HADDOCK-score排名的特定解决方案子集中,至少有一个对接解决方案低于指定阈值的BM5目标数量。

    参数说明

    Antibody File

    用于进行对接的抗体PDB文件,当前仅支持普通双链抗体(需要含有重、轻链)

    Antigen File

    用于进行对接的抗原PDB文件
    注意:
    1.每次对接任务仅支持输入一个抗原结构。
    2.HADDOCK运行时长约为2-10小时,取决于抗原抗体的体系大小。

    结果说明

    输出结果包括:

    输出文件名称 说明
    score.csv 复合物构象的对接能量打分文件
    result.tar.gz 所有复合物构象PDB文件压缩包
    cluster_01_model.pdb-cluster_10_model.pdb 打分前十的复合物构象

    其中score.csv,包含信息如下:

    字段名称 说明
    RANK 打分排序
    Score 对接能量打分,其中打分值越低,结合能力越强。

    参考文献

    • Dominguez C, Boelens R, Bonvin AM. HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J Am Chem Soc. 2003 Feb 19;125(7):1731-7.DOI:10.1021/ja026939x

    HADDOCK

    Introduction

    HADDOCK v3.0 is a bottom-up reimagining of the well-established HADDOCK for comprehensive modeling of biomolecular complexes. It aims to modularize and extend the core functionalities of HADDOCK, leveraging ambiguous interaction restraints (AIRs) to drive the docking process. It has been evaluated against five protein-protein docking benchmarks and compared to the real-time version (v2.4). The evaluation was conducted using the true interfaces (3.9 Å) of each complex and represented in terms of success rates; in a specific subset of solutions ranked by HADDOCK-score, a minimum number of BM5 targets have at least one docking solution below a specified threshold.

    Parameters

    Antibody File

    PDB file of the antibody used for docking. Currently, only normal antibodies (which must contain both heavy and light chains) are supported.

    Antigen File

    PDB file of the antigen used for docking.
    Note:

    1. Each docking job supports only one antigen structure as input.
    2. The HADDOCK runtime is approximately 2–10 hours, depending on the size of the antigen–antibody system.

    Result Description

    The output results include:

    Output File Name Description
    score.csv Docking energy scoring file for complex conformations.
    result.tar.gz Compressed archive of all complex conformation PDB files.
    cluster_01_model.pdb-cluster_10_model.pdb Top ten complex conformation models before scoring.

    In score.csv, the information is as follows:

    Field Name Description
    RANK Ranking based on scoring.
    Score Docking energy score, where lower scores indicate stronger binding capability.

    Reference Literature

    • Dominguez C, Boelens R, Bonvin AM. HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J Am Chem Soc. 2003 Feb 19;125(7):1731-7.DOI:10.1021/ja026939x
  • Name: Cyclic Peptide Design
    Description: 基于环肽设计算法AfCycDesign实现基于环肽模板分子结构的骨架进行环肽设计,也可以全新环肽设计。 AfCycDesign based cyclic peptide design enables the design of cyclic peptides based on the scaffold of cyclic peptide template molecules, and it can also be used for de novo cyclic peptide design.
    Tags: undefined
    Author: Stephen A.
    Release: 2023-03-03 16:09:18
    Reference: Stephen A. Rettie, Katelyn V. Campbell, Asim K. Bera, Alex Kang, Simon Kozlov, Joshmyn De La Cruz, Victor Adebomi, Guangfeng Zhou, Frank DiMaio, Sergey Ovchinnikov, Gaurav Bhardwaj. Cyclic peptide structure prediction and design using AlphaFold. bioRxiv 2023.02.25.529956.

    Cyclic Peptide Design

    简介

    基于AfCycDesign算法,利用ColabDesign与AlphaFold2等技术,基于模板分子结构骨架的环肽设计,或进行全新环肽设计。测试表明,这种方法能够准确地预测来自单一序列的原生环状肽的结构,在49个案例中,有36个被预测为高置信度的环状肽,pLDDT>0.85,与原生结构相匹配,均方根偏差(RMSD)小于1.5 Å。
    image.png

    参数说明

    本模块存在两种模式FixBB与Hallucination,其中前者表示进行基于模板蛋白(环肽)结构骨架的环肽设计;后者表示进行全新的环肽设计,不参考模板骨架,可设置环肽长度。
    。

    FixBB模式参数

    Structural Template

    上传模板蛋白(环肽)结构。注意,环肽长度不能超过100个氨基酸。

    Chain

    指定模板蛋白中用于参考设计的蛋白链标识,如:“B”,如果结构中只有一条链,可以不用指定。

    Fix Position

    指定设计时固定模板蛋白中的某些位置的氨基酸不变化,如:‘1,5-10’ 将固定模板蛋白中的第1和5至10的氨基酸不变。

    Hallucination模式参数

    Peptide Length

    指定全新设计的环肽长度,如:20.

    Remove Residue

    指定设计时需要去除的氨基酸类型,如:“C,W”表示设计的环肽不会出现cysteine和Tryptophan。

    结果说明

    设计的环肽的三维结构文件result.pdb。

    参考文献

    • Stephen A. Rettie, Katelyn V. Campbell, Asim K. Bera, Alex Kang, Simon Kozlov, Joshmyn De La Cruz, Victor Adebomi, Guangfeng Zhou, Frank DiMaio, Sergey Ovchinnikov, Gaurav Bhardwaj. Cyclic peptide structure prediction and design using AlphaFold. bioRxiv 2023.02.25.529956.

    Cyclic Peptide Design

    Introduction

    The Cyclic Peptide Design module utilizes the AfCycDesign algorithm in conjunction with technologies such as ColabDesign and AlphaFold2 to design cyclic peptides based on the structural backbone of template molecules or to create entirely new cyclic peptide designs. Tests have shown that this method can accurately predict the structures of native cyclic peptides from a single sequence. Out of 49 cases, 36 were predicted as high-confidence cyclic peptides with pLDDT > 0.85, matching the native structures with a root mean square deviation (RMSD) of less than 1.5 Å.

    image.png

    Parameters

    This module has two modes: FixBB and Hallucination. The former involves designing cyclic peptides based on the template protein (cyclic peptide) structure, while the latter involves designing entirely new cyclic peptides without reference to a template backbone and allows for setting the length of the cyclic peptide.

    FixBB Mode Parameters

    Structural Template

    Upload the template protein (cyclic peptide) structure. Note that the length of the cyclic peptide cannot exceed 100 amino acids.

    Chain

    Specify the protein chain identifier used for reference design in the template protein, e.g., “B”. If there is only one chain in the structure, this can be left unspecified.

    Fix Position

    Specify the amino acids in the template protein that should remain fixed during design, e.g., ‘1,5-10’ will fix amino acids at positions 1 and 5 to 10 in the template protein.

    Hallucination Mode Parameters

    Peptide Length

    Specify the length of the newly designed cyclic peptide, e.g., 20.

    Remove Residue

    Specify the types of amino acids to be removed during design, e.g., “C,W” indicates that the designed cyclic peptide will not contain cysteine and tryptophan.

    Results

    The three-dimensional structure file of the designed cyclic peptide is stored in result.pdb.

    References

    • Stephen A. Rettie, Katelyn V. Campbell, Asim K. Bera, Alex Kang, Simon Kozlov, Joshmyn De La Cruz, Victor Adebomi, Guangfeng Zhou, Frank DiMaio, Sergey Ovchinnikov, Gaurav Bhardwaj. Cyclic peptide structure prediction and design using AlphaFold. bioRxiv 2023.02.25.529956.
  • Name: Mutation Energy of Binding (GeoPPI)
    Description: 基于深度学习的框架,使用蛋白质复合物的深度几何表征来模拟突变对结合亲和力的影响,从而预测氨基酸突变对蛋白质-蛋白质亲和力的影响。 Deep geometric representations for modeling effects of mutations on protein-protein binding affinity.
    Tags: undefined
    Author: GeoPPI
    Release: 2023-02-28 15:46:02
    Reference: Liu X, Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput Biol. 2021 Aug 4;17(8):e1009284. doi: 10.1371/journal.pcbi.1009284.

    Mutation Energy of Binding (GeoPPI)

    简介

    基于深度学习技术预测氨基酸突变对蛋白质-蛋白质相互作用的影响。该模块是基于开源的GeoPPI方法开发的,使用蛋白质复合物的深度几何表征来模拟突变对结合亲和力的影响。为了实现几何结构的强大表达能力和预测的稳健性,模块依次采用了两个组件,即一个几何编码器(擅长提取图形特征)和一个梯度增强树(GBT,擅长避免过度拟合)。几何编码器是一个图形神经网络,在相邻的原子上执行神经信息传递,以更新中心原子的表征。它通过一个新的自我监督学习方案进行训练,以产生蛋白质结构的深度几何表示。基于这些对复合物及其突变体的学习表征,GBT从突变数据中学习,以预测相应的结合亲和力变化。
    image.png
    image.png

    参数说明

    PDB File

    野生型的复合物结构,PDB格式。

    Mutation File

    突变列表文件,TXT格式,每行包含突变信息,格式如下:

    TI17R,EI19R;E_I
    AI15R;E_I
    

    每行突变信息及一个相互作用链信息,用分号“;”分隔,其中:
    TI17R中的T表示野生型的氨基酸,I表示该氨基酸所在的链,17表示结构文件中该氨基酸的UID编号,R表示突变后的氨基酸。当存在多点突变时,突变信息用逗号(“,”)隔开,如TI17R,EI19R。E_I表示复合物中产生相互作用的蛋白链是E链与I链;相应的,如果是多条链与多条链产生相互作用,如:HL_WV,表示H、L链与W、V链产生相互作用。
    需要注意的时突变信息可以时多点或者单点,但是每一行的相互作用链信息只能是一个。

    结果说明

    输出结果文件为score.csv,包含信息如下:

    字段名称 说明
    Mutation 突变位点
    Chain 突变点所在的链
    Interaction_Chains 相互作用之间的链名称
    deltaEnergy 该突变引起的结合能量的变化(wildtype-mutant),值越小说明突变后结合越弱,该突变位点对受配体之间结合越重要,单位为kcal/mol。

    参考文献

    Liu X, Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput Biol. 2021 Aug 4;17(8):e1009284.

    MIT License

    Copyright © 2021 LiuXianggen
    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
    THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

    Mutation Energy of Binding (GeoPPI)

    Introduction

    The Mutation Energy of Binding (GeoPPI) module predicts the effect of amino acid mutations on protein-protein interactions using deep learning techniques. Developed based on the open-source GeoPPI method, this module utilizes deep geometric representations of protein complexes to simulate the impact of mutations on binding affinity. To achieve robust prediction capabilities and powerful geometric structure representations, the module sequentially employs two components: a geometric encoder (proficient at extracting graphical features) and a Gradient Boosting Tree (GBT, adept at preventing overfitting). The geometric encoder is a graph neural network that performs neural message passing on neighboring atoms to update the representation of central atoms. It is trained using a novel self-supervised learning scheme to generate deep geometric representations of protein structures. Based on these learned representations of complexes and their mutants, the GBT learns from mutation data to predict corresponding changes in binding affinity.

    image.png
    image.png

    Parameter Description

    PDB File

    The structure of the wild-type complex in PDB format.

    Mutation File

    A file listing mutations in TXT format, with each line containing mutation information in the following format:

    TI17R,EI19R;E_I
    AI15R;E_I
    

    Each line contains mutation information and interaction chain information separated by a semicolon “;”. In the mutation information:

    • In TI17R, T represents the wild-type amino acid, I represents the chain where the amino acid is located, 17 represents the UID of the amino acid in the structure file, and R represents the mutated amino acid. When there are multiple mutations, they are separated by a comma (“,”) as in TI17R,EI19R.
    • E_I indicates the interacting protein chains in the complex are chains E and I. Similarly, for interactions between multiple chains, such as HL_WV, it denotes interactions between chains H, L, W, and V.

    It is important to note that mutation information can be single-point or multi-point mutations, but the interaction chain information per line should be only one.

    Result Description

    The output result file is score.csv, which includes the following information:

    Field Name Description
    Mutation The mutation site
    Chain The chain where the mutation occurs
    Interaction_Chains Names of the interacting chains
    deltaEnergy The change in binding energy caused by the mutation (wildtype-mutant). A smaller value indicates weaker binding after the mutation, highlighting the importance of the mutation site for the binding between the ligand and receptor, in kcal/mol.

    Reference Literature

    Liu X, Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput Biol. 2021 Aug 4;17(8):e1009284.

    MIT License

    Copyright © 2021 LiuXianggen
    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
    THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  • Name: Protein Sequence Generation (ProGen)
    Description: ProGen是一种语言模型,可以在大型蛋白质家族中生成具有功能的蛋白质序列,类似于在各种话题上生成语法和语义正确的自然语言句子。该模型使用来自>19,000个家族的2.8亿个蛋白质序列进行训练,并附加了控制标签以指定蛋白质属性。可以进一步对ProGen进行微调,以改善来自具有足够同源样本家族的蛋白质生成性能。 ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples.
    Tags: undefined
    Author: Ali Madani
    Release: 2023-02-11 00:00:00
    Reference: Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL Jr, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 Jan 26.

    Protein Sequence Generation (ProGen)

    简介

    ProGen是一种语言模型,可以在大型蛋白质家族中生成具有可预测功能的蛋白质序列,类似于在不同主题上生成语法和语义正确的自然语言句子。该模型基于来自> 19,000个家族的2.8亿个蛋白质序列进行训练,并增加了指定蛋白质属性的控制标签。基于Progen2模型实现,ProGen2模型可扩展到64亿个参数,并在不同的序列数据集上进行训练,这些数据集来自基因组、元基因组和免疫剧目数据库的10亿多个蛋白质。ProGen2模型在捕捉观察到的进化序列的分布、产生新的可行的序列,并预测蛋白质的适应性等方面显示出最先进的性能。
    Protein Sequence Generation (ProGen)目前主要功能是基于Reference序列,进行序列的增长(从Reference序列末端开始增长),后续开放其他场景的序列生成功能。

    参数说明

    Model

    模型类型有2种可选(progen2-large,progen2-xlarge)。
    模型信息:
    progen2-large,参数数量2.7 Billion,神经网络层数32。
    progen2-xlarge,模型参数数量6.4 Billion,神经网络层数32。

    Reference Sequence

    作为参考的序列(填序列信息)
    注意:不支持多条序列,多条序列会被合并为一条序列。

    Number of Samples

    生成序列的数目。
    注意:序列长度不超过1024个氨基酸。

    结果说明

    生成的蛋白序列文件result.fasta。

    参考文献

    Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL Jr, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 Jan 26.

    Protein Sequence Generation (ProGen)

    Introduction

    ProGen is a language model designed to generate protein sequences with predictable functions within large protein families, similar to generating syntactically and semantically correct natural language sentences on different topics. The model is trained on 280 million protein sequences from over 19,000 families and incorporates control labels specifying protein attributes. Built upon the Progen2 model, ProGen2 can scale up to 6.4 billion parameters and is trained on over a billion proteins from various sequence datasets sourced from genomes, metagenomes, and immune repertoire databases. ProGen2 demonstrates state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel feasible sequences, and predicting protein adaptability.

    Currently, the main function of Protein Sequence Generation (ProGen) is to extend sequences based on a reference sequence (growing from the end of the reference sequence). Additional sequence generation functionalities for other scenarios will be made available in the future.

    Parameter Description

    Model

    There are two model options available: progen2-large and progen2-xlarge.
    Model details:

    • progen2-large: 2.7 Billion parameters, 32 neural network layers.
    • progen2-xlarge: 6.4 Billion parameters, 32 neural network layers.

    Reference Sequence

    The reference sequence for sequence extension (provide sequence information).
    Note: Multiple sequences are not supported; multiple sequences will be merged into one sequence.

    Number of Samples

    The number of sequences to generate.
    Note: The sequence length should not exceed 1024 amino acids.

    Result Description

    The generated protein sequence file is named result.fasta.

    Reference Literature

    Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL Jr, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 Jan 26.

  • Name: Peptide Structure Generation
    Description: 基于多肽序列生成多肽结构:输入多肽的氨基酸序列,生成线性多肽的二维或者三维结构文件,一般用于小肽结构的创建。 A tool for generating peptide structures based on peptide sequences. Input the amino acid sequence of the peptide, and generate a two-dimensional or three-dimensional structure file of the linear peptide. This tool is generally used for creating small peptide structures.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-02-07 14:55:10
    Reference: Landrum, G. (2006). RDKit: Open-source cheminformatics.

    Peptide Structure Generation

    简介

    Peptide Structure Generation模块只需要输入多肽序列字符或者文件,就能生成多肽的三维或者二维结构的SDF文件。

    参数说明

    Peptide Sequence模式

    Peptide Sequence String

    输入氨基酸序列,每行表示一条多肽,支持同时生成多条多肽。

    Generated Structure (.sdf)

    输出文件名称。

    Structure Type

    输出多肽结构类型:3d或者2d。

    Peptide File模式

    Peptide Sequence File

    输入氨基酸序列txt文件,与“Peptide Sequence”相同。
    其他参数与Peptide Sequence模式相同。

    结果说明

    得到多肽三维结构的SDF文件output.sdf。

    参考文献

    • Landrum, G. (2006). RDKit: Open-source cheminformatics. DOI:10.5281/zenodo.591637

    Peptide Structure Generation

    Introduction

    The Peptide Structure Generation module can generate three-dimensional or two-dimensional structures of peptides in SDF format based on input peptide sequences.

    Parameters

    Peptide Sequence Mode

    Peptide Sequence String

    Input amino acid sequences, with each line representing a peptide. Multiple peptides can be generated simultaneously.

    Generated Structure (.sdf)

    Output file name.

    Structure Type

    Specify the type of peptide structure to generate: 3D or 2D.

    Peptide File Mode

    Peptide Sequence File

    Input a text file containing amino acid sequences, similar to the “Peptide Sequence” mode.
    Other parameters are the same as in the Peptide Sequence mode.

    Result Description

    The output is an SDF file named output.sdf containing the three-dimensional structure of the peptide.

    Reference

    • Landrum, G. (2006). RDKit: Open-source cheminformatics. DOI:10.5281/zenodo.591637
  • Name: Protein FEP
    Description: 基于唯信计算自主研发的Protein FEP算法,实现了蛋白稳定性与蛋白复合物亲和力的相对结合自由能计算,能用于判断单点突变对蛋白稳定性、蛋白复合物结合亲和力的影响。 Based on the Protein FEP algorithm developed by WECOMPUT, the module is capable of computing the relative binding free energy of protein stability and protein-protein binding affinity, which can be used to determine the effect of single-point mutations on protein stability and protein complex binding affinity.
    Tags: undefined
    Author: WECOMPUT
    Release: 2023-01-23 00:00:00
    Reference:

    Protein FEP

    简介

    Protein FEP是基于唯信计算自主研发的基于蛋白的自由能微扰算法AlphaFEP,实现了更高效、更精确的蛋白稳定性与蛋白复合物亲和力的相对结合自由能计算,能用于判断单点突变对蛋白稳定性、蛋白复合物结合亲和力的影响。

    基准测试

    众多文献报道,FEP方法相比于半经验方法、机器学习方法及GB/PBSA等自由能计算方法,精度更高(例如 http://dx.doi.org/10.1016/j.jmb.2023.168187,见下图,其中PCC代表预测值与SPR实验值的相关性,越高越好)。

    image.png

    唯信开发的AlphaFEP算法媲美已知的FEP方法,例如Schrodinger的FEP+,并大幅超越其他经典的非FEP方法。下图:结合自由能的预测值与实测值的相关性。

    image.png

    AlphaFEP技术特点

    1. 独特的自适应混合采样方法,允许分子构象在不同计算窗口之间跳跃,且通过随机逼近法实现自由能调整,进而保证每个窗口采样数分布最优,可在有限模拟时间内实现更多构象采样,采样效率相较同类方法提升一个数量级,提高了计算的精度和重现性。
    2. 改进的自由能计算MBAR方法:DC-MBAR,以基于多态模拟采样的数据来预测自由能。首先计算任意两个炼金态之间的重叠,并将那些具有足够重叠的状态定义为相邻状态。与传统的MBAR方法(一次使用所有数据计算每个状态的自由能)不同,DC-MBAR专注于预测相邻状态之间的自由能变化。为了准确地估计自由能变化,MBAR方程中包括与两个相邻状态重叠且大于定义阈值的其他状态。在特定阈值下,DC-MBAR预测的自由能非常接近传统MABR方法计算的自由能。此外,DC-MBAR方案可以减少计算和存储成本。DC-MBAR方法的一个重要特征是线性缩放,这意味着随着状态数的变化,CPU时间是一条直线关系。由于基于对的计算是相互独立且可并行的,因此可以利用HPC群集上所有可访问的CPU内核,这使DC-MBAR策略更加有效。

    参数说明

    Single-point Mutation模式

    PDB File

    蛋白的结构文件,PDB格式

    Mutation

    指定单点突变的位置(如:S52K,S代表野生型氨基酸,52表示该氨基酸在蛋白PDB文件中的索引值,K代表突变后的氨基酸)

    Type

    指定单点突变类型:稳定性(S)或者结合亲和力(B)

    Chain

    指定单点突变所在的链名称

    Multipoint Mutation模式

    PDB File

    蛋白的结构文件,PDB格式

    Mutation List

    多点突变列表文件(.txt),例如:

    L28E,H
    K30T,H
    

    其中,“L”和“K”是WT;“28”和“30”是PDB文件中的残基ID;“E”和“T”是突变;“H”代表残基的链名。
    注意:

    • 建议同链多点突变,异链时采样过程不稳定,易出错。
    • 多点突变只支持结合亲和力(B)类型的计算。
    • 当前突变残基数量不要超过3个,否则计算精度大幅降低

    结果说明

    输出结果文件为result.txt,包含信息如下:

    字段名称 说明
    ligand dG 配体自由能
    complex dG 复合物自由能
    final ddG 最终突变引起的自由能(结合自由能或折叠自由能)变化,单位为kcal/mol,负值表示蛋白更稳定或结合更强,反之亦然。

    参考文献

    Jia X, Ge H, Mei Y. Free energy change estimation: The Divide and Conquer MBAR method. J Comput Chem. 2021; 42: 1204–1211. https://doi.org/10.1002/jcc.26533

    Protein FEP

    Introduction

    Protein FEP is a protein-based free energy perturbation algorithm developed by Weixing Computing, which implements the AlphaFEP algorithm for more efficient and accurate calculation of relative binding free energies for protein stability and protein complex affinity. It can be used to assess the impact of single-point mutations on protein stability and protein complex binding affinity.

    Benchmark Testing

    Numerous studies have shown that FEP methods offer higher accuracy compared to semi-empirical methods, machine learning methods, and GB/PBSA among other free energy calculation methods (e.g., link , as shown in the figure below, where PCC represents the correlation between predicted and experimental values, with higher values indicating better performance).

    image.png

    The AlphaFEP algorithm developed by Weixing Computing rivals established FEP methods like Schrodinger’s FEP+ and significantly surpasses other classical non-FEP methods. The figure below illustrates the correlation between predicted and measured binding free energies.

    image.png

    AlphaFEP Technical Features

    1. Unique adaptive hybrid sampling method allows molecular conformations to jump between different calculation windows. Free energy adjustments are made using a stochastic approximation method to ensure optimal conformation sampling distribution in each window. This leads to significantly increased conformation sampling within a limited simulation time, improving sampling efficiency by an order of magnitude compared to similar methods, enhancing computational precision and reproducibility.
    2. Improved free energy calculation using the MBAR method: DC-MBAR, which predicts free energies based on data from multi-state simulations. It calculates overlaps between any two alchemical states and defines states with sufficient overlap as neighboring states. Unlike traditional MBAR methods that compute free energies for all states simultaneously, DC-MBAR focuses on predicting free energy changes between neighboring states. To accurately estimate free energy changes, the MBAR equation includes additional states that overlap sufficiently with two neighboring states. Under specific thresholds, the free energies predicted by DC-MBAR are very close to those calculated by traditional MBAR methods. Furthermore, the DC-MBAR approach can reduce computational and storage costs. A key feature of the DC-MBAR method is linear scaling, meaning that CPU time scales linearly with the number of states. Since the calculations are independent and parallelizable, utilizing all available CPU cores on an HPC cluster makes the DC-MBAR strategy more efficient.

    Parameter Description

    Single-point Mutation Mode

    PDB File

    Structure file of the protein in PDB format.

    Mutation

    Specify the position of the single-point mutation (e.g., S52K, where S represents the wild-type amino acid, 52 is the index of the amino acid in the protein PDB file, and K represents the mutated amino acid).

    Type

    Specify the type of single-point mutation: stability (S) or binding affinity (B).

    Chain

    Specify the chain where the single-point mutation occurs.

    Multipoint Mutation Mode

    PDB File

    Structure file of the protein in PDB format.

    Mutation List

    File containing a list of multipoint mutations (.txt), for example:

    L28E,H
    K30T,H
    

    Here, “L” and “K” represent wild-type residues, “28” and “30” are residue IDs in the PDB file, “E” and “T” represent mutations, and “H” denotes the chain name of the residue.
    Notes:

    • Multi-point mutations on the same chain are recommended. Sampling is unstable and prone to errors when applied across different chains.
    • Multi-point mutations only support calculations for Binding affinity (B) type.
    • The number of currently mutated residues should not exceed 3; otherwise, calculation accuracy will be significantly reduced.

    Result Description

    The output result file is named result.txt and includes the following information:

    Field Name Description
    ligand dG Ligand free energy
    complex dG Complex free energy
    final ddG Final change in free energy (binding or folding) caused by the mutation, in kcal/mol. A negative value indicates that the protein is more stable or has stronger binding affinity, and vice versa.

    Reference Literature

    Jia X, Ge H, Mei Y. Free energy change estimation: The Divide and Conquer MBAR method. J Comput Chem. 2021; 42: 1204–1211. https://doi.org/10.1002/jcc.26533

  • Name: Antibody Sequence Prediction (AbLang2)
    Description: 模块基于Ablang2模型实现,该模型是抗体专用语言模型,为AbLang的升级版,旨在解决抗体序列中的种系偏差(germline bias)问题,从而更有效地支持抗体设计与优化。 The module is built on Ablang2, an antibody-specific language model that upgrades AbLang and is expressly designed to counteract germline bias in antibody sequences, thereby furnishing stronger support for antibody design and optimization.
    Tags: undefined
    Author: AbLang
    Release: 2023-01-16 00:00:00
    Reference: Olsen TH, Moal IH, Deane CM. AbLang: an antibody language model for completing antibody sequences. Bioinform Adv. 2022 Jun 17;2(1):vbac046

    Antibody Sequence Prediction (AbLang2)

    简介

    进行抗体序列突变优化,同时给出序列每个位置20种残基的出现概率值(基于原序列预测)。模块基于Ablang2模型实现,该模型是抗体专用语言模型,为AbLang的升级版,旨在解决抗体序列中的种系偏差(germline bias)问题,从而更有效地支持抗体设计与优化。

    抗体多样性主要来源于V(D)J重组、CDR区域的突变以及少量非CDR区域的突变。然而,天然抗体序列中仍有很大部分与种系基因(germline)保持一致,这导致传统语言模型在预训练过程中倾向于“记住”种系序列,而忽视了那些远离种系、但对结合能力至关重要的突变。AbLang2模型的核心目标就是缓解这种种系偏差,提升模型对非种系残基的预测能力,从而更有效地指导抗体工程中的关键突变设计。

    AbLang2基于Transformer架构,延续了前代模型AbLang的双组件设计。使用OAS数据库中的非配对(仅重链或轻链)和配对(重链+轻链)抗体序列数据进行训练和微调,提升模型对完整抗体结构的建模能力。

    模型预测抗体序列困惑度(perplexity,数值越小表示序列质量越高)的对比,Ablang2效果最佳:
    image.png

    输入参数

    Fasta File

    指定需要优化残基的抗体Fv区序列文件,FASTA格式。如果同时有重链Fv(VH)、轻链Fv(VL)序列,通过英文冒号:将序列进行分隔即可,不分先后。如下所示:

    >seq1
    EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS
    >seq2
    QVQLQQSGGELAKPGASVKVSCKASGYTFSSFWMHWVRQAPGQGLEWIGYINPRSGYTEYNEIFRDKATMTTDTSTSTAYMELSSLRSEDTAVYYCASFLGRGAMDYWGQGTTVTVSS:EIVLTQSPGTLSLSPGERATLSCRASQSVSSSFFAWYQQKPGQAPRLLIYGASSRATGIPDRLSGSGSGTDFTLTITRLEPEDFAVYYCQQYDSSAITFGQGTRLEIK
    
    

    重、轻链同时存在时,后续突变优化过程中,模型会同时考虑重、轻链,符合实际情况。

    Position

    指定需要突变优化的残基。使用残基位置编号(从1开始),多个残基用逗号分隔,指定残基范围用横杠符号。如:3,10,24-30表示序列中的第3、第10与第24至30号残基,进行突变优化。
    在序列中同时存在重、轻链时,需要在残基序号前加上重(H)、轻链(L)标签,如:H5,H8-10,L3表示序列中,重链的第5、第8-10,轻链的第3号残基进行突变优化。
    注意:这里定义的待优化残基,会同时应用到Fasta文件中的每条序列(如有不匹配的残基位置,会被自动过滤掉)。

    Output Fasta

    输出优化序列的文件名,Fasta格式,默认为restored.fasta,每条序列仅会产生一条优化的序列。

    Output Prob

    输出残基概率文件名,CSV格式,默认为restore_probs.csv,输出原序列对应的Positions位置20种残基出现的概率值,以及对应位置优化后的残基。

    结果说明

    优化后的序列文件restored.fasta
    残基概率文件restore_probs.csv,包含信息如下:

    字段名称 说明
    Name 原序列名称
    Chain 链类型,H或L
    WT 序列中的初始残基
    POS AA的位置系引(从1开始)
    Restored 序列优化后,该位置的残基
    Consensus 该位置出现概率最大的残基
    L,A,G,V… 该位置每种残基出现的概率

    注意:Restored的残基并不一定都是Consensus残基,因为概率计算是基于原序列整体计算的,而序列优化是对所有待优化残基进行掩码后(使用*代替原残基),计算可能的最优残基,出现概率会有差异。

    参考文献

    • Olsen TH, Moal IH, Deane CM. AbLang: an antibody language model for completing antibody sequences. Bioinform Adv. 2022 Jun 17;2(1):vbac046.DOI:10.1101/2022.01.20.477061
    • Addressing the antibody germline bias and its effect on language models for improved antibody design. Tobias H. Olsen, Iain H. Moal, Charlotte M. Deane. bioRxiv 2024.02.02.578678.DOI:10.1101/2024.02.02.578678

    Antibody Sequence Prediction

    Introduction

    Antibody-sequence mutation and optimization are performed while providing, for every position in the sequence, the predicted probability of each of the 20 amino-acid residues (prediction is conditioned on the original sequence). The module is built on Ablang2, an antibody-specific language model that upgrades AbLang and is expressly designed to counteract germline bias in antibody sequences, thereby furnishing stronger support for antibody design and optimization.

    Antibody diversity arises chiefly from V(D)J recombination, hypermutation in the CDRs, and a limited number of mutations outside the CDRs. Nevertheless, large tracts of natural antibody sequences remain identical to the germline genes. This causes conventional language models to “memorize” the germline during pre-training and to overlook mutations that deviate from it yet are critical for binding. The central goal of AbLang2 is to mitigate this germline bias and to enhance prediction accuracy for non-germline residues, thus guiding the design of pivotal mutations in antibody engineering.

    AbLang2 retains the dual-component architecture of its predecessor and is built on the Transformer framework. It is trained and fine-tuned on both unpaired (heavy- or light-chain-only) and paired (heavy + light) antibody sequences from the Observed Antibody Space (OAS) database, improving its capacity to model intact antibody structures.

    Comparison of predicted sequence perplexity (lower values indicate higher sequence quality) confirms that AbLang2 delivers the best performance.
    image.png

    Parameters

    Fasta File

    Antibody Fv region sequence file specifying residues to be optimized, in FASTA format.
    If both heavy-chain Fv (VH) and light-chain Fv (VL) sequences are provided, simply separate them with an colon :, order does not matter. Example:

    >seq1
    EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS
    >seq2
    QVQLQQSGGELAKPGASVKVSCKASGYTFSSFWMHWVRQAPGQGLEWIGYINPRSGYTEYNEIFRDKATMTTDTSTSTAYMELSSLRSEDTAVYYCASFLGRGAMDYWGQGTTVTVSS:EIVLTQSPGTLSLSPGERATLSCRASQSVSSSFFAWYQQKPGQAPRLLIYGASSRATGIPDRLSGSGSGTDFTLTITRLEPEDFAVYYCQQYDSSAITFGQGTRLEIK
    
    

    When both heavy and light chains are present, the model will consider them jointly during subsequent mutation optimization, mirroring real-world antibody behavior.

    Position

    Specify the residues to be optimized. Use residue indices starting at 1; separate individual positions with commas and ranges with a hyphen.
    Example: 3,10,24-30 optimizes positions 3, 10 and 24–30.
    If the FASTA contains both chains, prefix each index with H (heavy) or L (light).
    Example: H5,H8-10,L3 optimizes heavy-chain residues 5 and 8–10, plus light-chain residue 3.
    Note: The same Positions list is applied to every sequence in the FASTA; any non-existent positions are silently ignored.

    Output Fasta

    Name of the optimized-sequence file (FASTA format).
    Default: restored.fasta.
    Each input sequence produces exactly one optimized sequence.

    Output Prob

    Name of the residue-probability file (CSV format).
    Default: restore_probs.csv.
    For every position listed in Positions, the file contains the 20-amino-acid probabilities predicted from the original sequence and the residue finally chosen after optimization.

    Results

    Optimized sequence file: restored.fasta
    Residue-probability file: restore_probs.csv

    Column Description
    Name Original sequence identifier
    Chain Chain type, H or L
    WT Wild-type residue in the original sequence
    POS Amino-acid position index (1-based)
    Restored Residue after optimization
    Consensus Residue with the highest predicted probability
    A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y Probability of each amino acid at this position

    Note: The Restored residue is not necessarily the Consensus residue.
    Probabilities are computed from the original intact sequence, whereas optimization masks all requested positions simultaneously (replacing them with ‘*’) and then infers the globally optimal combination; hence the posterior probabilities can differ.

    Reference

    • Olsen TH, Moal IH, Deane CM. AbLang: an antibody language model for completing antibody sequences. Bioinform Adv. 2022 Jun 17;2(1):vbac046.DOI:10.1101/2022.01.20.477061
    • Addressing the antibody germline bias and its effect on language models for improved antibody design. Tobias H. Olsen, Iain H. Moal, Charlotte M. Deane. bioRxiv 2024.02.02.578678.DOI:10.1101/2024.02.02.578678
  • Name: Structure Clustering
    Description: 基于分子指纹的小分子结构聚类模块,其采用的聚类方法有Butina或任何其他可用的分层聚类方法。 Small molecule clustering based on a variety of 2D fingerprints using hierarchical clustering methodology.
    Tags: undefined
    Author: Butina, D
    Release: 2021-10-28 10:15:43
    Reference: Butina D. Unsupervised database clustering based on Daylight's fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. J. Chem. Inf. Model. 1999, 39, 747-750.

    Structure Clustering

    简介

    Structure Clustering是基于分子指纹的小分子结构聚类模块,其采用的聚类方法有Butina或任何其他可用的分层聚类方法。

    参数说明

    Input File

    小分子的结构文件,支持SDF、SMILES格式。

    Ouput File

    输出文件名称。

    Clustering Numbers

    在分层聚类过程中生成的聚类的数目。

    Similarity Cutoff

    Butina聚类算法中使用的相似度截断值。

    Clustering Method

    聚类算法,包括如下:

    • Butina
    • Centroid
    • CLink
    • Gower
    • McQuitty
    • SLink
    • UPGMA
    • Ward

    Fingerprints

    用于计算相似度或者距离的分子指纹类型,包括如下:

    • AtomPairs
    • MACCS166Keys
    • Morgan
    • MorganFeatures
    • PathLength
    • TopologicalTorsions

    Fingerprints Type

    分子指纹方式,包括如下:

    • IntVect
    • BitVect
    • auto

    Similarity Metric

    相似度计算指标,包括如下:

    • Tanimoto
    • Cosine
    • Dice

    结果说明

    在原有SDF文件中加入聚类编号,得到新的SDF文件output.sdf。

    参考文献

    Butina D. Unsupervised database clustering based on Daylight’s fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. J. Chem. Inf. Model. 1999, 39, 747-750.

    Structure Clustering

    Introduction

    Structure Clustering is a module for clustering small molecule structures based on molecular fingerprints. It employs clustering methods such as Butina or any other available hierarchical clustering method.

    Parameter Description

    Input File

    The structure file of the small molecule, supported formats include SDF and SMILES.

    Output File

    Name of the output file.

    Clustering Numbers

    Number of clusters generated during the hierarchical clustering process.

    Similarity Cutoff

    Similarity cutoff value used in the Butina clustering algorithm.

    Clustering Method

    Clustering algorithms available include:

    • Butina
    • Centroid
    • CLink
    • Gower
    • McQuitty
    • SLink
    • UPGMA
    • Ward

    Fingerprints

    Types of molecular fingerprints used for similarity or distance calculation include:

    • AtomPairs
    • MACCS166Keys
    • Morgan
    • MorganFeatures
    • PathLength
    • TopologicalTorsions

    Fingerprints Type

    Types of molecular fingerprint representations include:

    • IntVect
    • BitVect
    • auto

    Similarity Metric

    Similarity metrics for calculation include:

    • Tanimoto
    • Cosine
    • Dice

    Result Description

    The original SDF file will be updated with cluster numbers, resulting in a new SDF file named output.sdf.

    Reference Literature

    Butina D. Unsupervised database clustering based on Daylight’s fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. J. Chem. Inf. Model. 1999, 39, 747-750.

  • Name: Sequence Clustering
    Description: Sequence Clustering使用DBSCAN算法对多序列比对(MSA)后的结果进行聚类分析,将多序列分为多个cluster类别,并通过可视化模块UMAP进行序列的embedding,并获取二维可视化信息。 Sequence clustering uses the DBSCAN algorithm to perform cluster analysis on the results of multiple sequence alignment (MSA), dividing multiple sequences into multiple cluster categories, and using the visualization module UMAP to embed sequences and obtain two-dimensional visualization information.
    Tags: undefined
    Author: Hannah K. Wayment-Steele
    Release: 2023-01-10 00:00:00
    Reference: Hannah K. Wayment-Steele, Sergey Ovchinnikov, Lucy Colwell, Dorothee Kern. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. bioRxiv 2022.10.17.512570.

    Sequence Clustering

    简介

    Sequence Clustering使用DBSCAN算法对多序列比对(MSA)后的结果进行聚类分析,将多序列分为多个cluster类别,并通过可视化模块UMAP进行序列的embedding,并获取二维可视化信息。
    image.png

    参数说明

    Input File

    需要聚类序列的多序列比对结果文件(fasta格式),可以由Multiple Sequence Alignmnet模块产生的alignmnet.fasta。

    结果说明

    输出结果文件为res_clustering_assignments.tsv,包含信息如下:

    字段名称 说明
    SequenceName 序列名称
    sequence 序列
    frac_gaps 后续序列与参考序列(第一条序列)氨基酸差异(填充‘-’)的比例
    dbscan_label 聚类后的类别标签(如果值为-1表示未分配类别)
    UMAP 1,UMAP 2 二维可视化坐标信息(UMAP 1,UMAP 2对应X,Y坐标)

    参考文献

    Hannah K. Wayment-Steele, Sergey Ovchinnikov, Lucy Colwell, Dorothee Kern. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. bioRxiv 2022.10.17.512570.

    Sequence Clustering

    Introduction

    Sequence Clustering uses the DBSCAN algorithm to perform cluster analysis on the results of multiple sequence alignment (MSA), dividing multiple sequences into different cluster categories. It utilizes the UMAP visualization module to embed sequences and obtain two-dimensional visualization information.
    image.png

    Parameter Description

    Input File

    The file containing the results of multiple sequence alignment (in FASTA format) that need to be clustered. This file can be generated by the Multiple Sequence Alignment module as alignmnet.fasta.

    Result Description

    The output result file is res_clustering_assignments.tsv, which includes the following information:

    Field Name Description
    SequenceName Name of the sequence
    sequence The sequence itself
    frac_gaps Proportion of gaps (‘-’) in the sequence compared to the reference sequence (the first sequence)
    dbscan_label Cluster label after clustering (if the value is -1, it means the sequence is unassigned to any cluster)
    UMAP 1, UMAP 2 Two-dimensional visualization coordinate information (UMAP 1 corresponds to the X-coordinate and UMAP 2 corresponds to the Y-coordinate)

    Reference Literature

    Hannah K. Wayment-Steele, Sergey Ovchinnikov, Lucy Colwell, Dorothee Kern. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. bioRxiv 2022.10.17.512570.

  • Name: Extract Sequence from Structure (PDB2FASTA)
    Description: 从蛋白的PDB文件中将序列提取出来保存为FASTA文件。常规氨基酸序列用单字母表示,其他类型都标注为X。 Extracts the protein sequences in a PDB file to FASTA. Amino acids are represented by their one-letter code while all others are represented by 'X'.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-12-09 00:00:00
    Reference:

    Extract Sequence from Structure (PDB2FASTA)

    简介

    Extract Sequence from Structure (PDB2FASTA)模块是从蛋白的PDB文件中将序列提取出来保存为FASTA文件。常规氨基酸序列用单字母表示,其他类型都标注为X。

    参数说明

    Structure PDB File

    蛋白的结构文件,PDB格式。

    Chain Name

    将指定链的序列转存为fasta格式,默认all代表将所有链的序列输出。

    Missing Residue

    控制是否在输出中包含缺失残基。默认为 true 时跳过 SEQRES 记录中存在但结构文件(ATOM/HETATM)中缺失的残基;设置为 false 时将这些 SEQRES 缺失残基包含在输出结果中。

    Output Sequence

    输出序列文件名称,FASTA格式。

    结果说明

    得到蛋白的序列文件,默认为seq.fasta。

    Extract Sequence from Structure (PDB2FASTA)

    Introduction

    The Extract Sequence from Structure (PDB2FASTA) module extracts sequences from a protein’s PDB file and saves them as a FASTA file. Conventional amino acid sequences are represented by single letters, while other types are labeled as X.

    Parameter Description

    Structure PDB File

    The protein’s structure file in PDB format.

    Chain Name

    Specify the chain whose sequence will be saved in FASTA format. Use “all” to output sequences from all chains by default.

    Missing Residue

    Controls whether missing residues are included in the output. Default true skips residues that are recorded in SEQRES but missing from the structure file (ATOM/HETATM records); set false to include these SEQRES missing residues in the output.

    Output Sequence

    Name of the output sequence file in FASTA format.

    Result Description

    Obtain the protein sequence file, default name is seq.fasta.

  • Name: 3-letter AA Conversion
    Description: 把三字母表示的氨基酸转换为单字母表示。"ASP ILE VAL ASN"转换为 "DIVQ". Convert 3-letter amino acids to 1-letter amino acid. E.g., "ASP ILE VAL ASN" will be converted to -> "DIVQ".
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-11-18 00:00:00
    Reference:

    3-letter AA Convertion

    简介

    把三字母表示的氨基酸转换为单字母表示。"ASP ILE VAL ASN"转换为 “DIVQ”.

    参数说明

    File模式

    Input File

    包含三字符氨基酸序列的文本文件

    Output File

    指定输出序列文件的名称,FASTA格式

    Text模式

    Input String

    三字符代表的氨基酸序列,例如:
    ASP ILE VAL ASN

    Output File

    指定输出序列文件的名称,FASTA格式

    结果说明

    三字母表示的氨基酸转换为单字母,并以序列FASTA格式输出sequence.fasta。

    3-letter AA Conversion

    Introduction

    Converts three-letter amino acid representations to single-letter representations. For example, “ASP ILE VAL ASN” is converted to “DIVQ”.

    Parameter Description

    File Mode

    Input File

    Text file containing sequences of three-character amino acids.

    Output File

    Specify the name of the output sequence file in FASTA format.

    Text Mode

    Input String

    Sequence of three-character amino acids, for example:
    ASP ILE VAL ASN

    Output File

    Specify the name of the output sequence file in FASTA format.

    Result Description

    Converts three-letter amino acid representations to single-letter representations and outputs the sequence in FASTA format as sequence.fasta.

  • Name: Sequence Translation
    Description: DNA序列转换成RNA序列和蛋白序列的工具。 Translating DNA sequences into RNA and protein sequences.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-11-18 17:19:28
    Reference:

    Sequence Translation

    简介

    Sequence Translation是DNA序列转换成RNA序列和蛋白序列的工具。

    参数说明

    DNA Sequence File

    DNA序列文件,FASTA格式

    DNA Sequence String

    DNA序列,例如:

    TTATGACGTTATTCTACTTTGATTGTGCGAGACAATGCTACCTTACCGGTCGGAACTCGATCGGTTGAACTCTATCACGCCTGGTCTTCGAAGTTAGCAC
    

    结果说明

    输出结果包括:

    输出文件名称 说明
    prepared_dna.fasta 转换成DNA的FASTA文件
    protein.fasta 转换成蛋白的FASTA文件
    mrna.fasta 转换成mRNA的FASTA文件

    Sequence Translation

    Introduction

    Sequence Translation is a tool for converting DNA sequences into RNA sequences and protein sequences.

    Parameters

    DNA Sequence File

    DNA sequence file in FASTA format.

    DNA Sequence String

    DNA sequence, for example:

    TTATGACGTTATTCTACTTTGATTGTGCGAGACAATGCTACCTTACCGGTCGGAACTCGATCGGTTGAACTCTATCACGCCTGGTCTTCGAAGTTAGCAC
    

    Result Description

    The output includes:

    Output File Name Description
    prepared_dna.fasta FASTA file converted to DNA
    protein.fasta FASTA file converted to protein
    mrna.fasta FASTA file converted to mRNA
  • Name: Protein Structure Prediction (ESMFold)
    Description: ESMFold是Meta公司开发的蛋白结构预测模型,使用大型语言模型从主序列直接推断结构,预测的速度比AlphaFold方法快60倍,同时能够保持分辨率和准确性。 ESMFold is a protein structure prediction model developed by Meta company, which uses a large language model to directly infer structure from the primary sequence. It predicts structures 60 times faster than AlphaFold while maintaining resolution and accuracy.
    Tags: undefined
    Author: Meta
    Release: 2022-11-11 00:00:00
    Reference: Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130.

    Protein Structure Prediction (ESMFold)

    简介

    ESMFold使用大型语言模型从主序列直接推断结构,预测的速度比最先进的方法快60倍,同时能够保持分辨率和准确性。AlphaFold2和其他替代方法使用多序列比对(MSA)和类似蛋白质的模板来实现原子分辨率结构预测的最佳性能获突破性成功;而ESMFold通过利用语言模型的内部表征,只用一个序列作为输入就能生成结构预测。ESMFold与AlphaFold2和RoseTTAFold具有相似的准确性,但ESMFold在探索宏基因组蛋白质的结构空间方面速度更快。
    image.png

    参数说明

    ESMFold Batch Mode模式

    Fasta File

    蛋白序列文件,FASTA格式,支持多条序列。
    预测复合物,多条链通过英文冒号(:)相连,举例:

    >complex
    MGITQTPYKVSISGLYLRARV:QVQLQQSGAELARPGASVKMSCKASGYTFTRYTMHWVKQR
    

    Max tokens per batch

    每个GPU前向传递中的最大令牌数。这将使较短的序列分组进行批量预测。如果在短序列上发生内存不足问题,降低此值可以有所帮助。

    Chunk Size

    较低的值将导致更低的内存使用,但会降低速度。推荐值:128、64、32。

    ESMFold Single Mode模式

    Fasta File

    蛋白序列文件,FASTA格式,多条序列时默认为复合物预测。

    结果说明

    输出结果包括:

    输出文件名称 说明
    seq1.pdb 默认输出第一条序列的预测结构。
    result.tar.gz 针对含有多条序列的fasta文件,压缩包中含所有的序列的预测结构。
    score.csv 预测结构的打分,包含结构可靠性指标pLDDT与pTM,pLDDT数值范围在0-100,数值越大表示结构可靠性越高,pTM数值范围在0-1,数值越大表示结构可靠性越高
    stdout.txt 模块的标准输出信息。

    参考文献

    • Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130. DOI: 10.1126/science.ade2574

    Protein Structure Prediction (ESMFold)

    Introduction

    ESMFold uses a large language model to directly infer structure from primary sequences, with prediction speeds 60 times faster than state-of-the-art methods, while maintaining resolution and accuracy. While AlphaFold2 and other alternative methods achieve breakthrough success in atomic-resolution structure prediction using multiple sequence alignments (MSA) and protein-like templates, ESMFold leverages the internal representation of a language model to generate structure predictions using just one sequence as input. ESMFold exhibits similar accuracy to AlphaFold2 and RoseTTAFold, but is faster in exploring the structural space of macrogenomic proteins.
    image.png

    Parameters

    ESMFold Batch Mode

    Fasta File

    Protein sequence file in FASTA format, supporting multiple sequences.
    For predicting complexes, multiple chains are connected by a colon (:) as shown below:

    >complex
    MGITQTPYKVSISGLYLRARV:QVQLQQSGAELARPGASVKMSCKASGYTFTRYTMHWVKQR
    

    Max tokens per batch

    Maximum number of tokens in each GPU forward pass. This allows grouping of shorter sequences for batch prediction. Lowering this value can help if memory issues occur with short sequences.

    Chunk Size

    A lower value leads to lower memory usage but decreases speed. Recommended values: 128, 64, 32.

    ESMFold Single Mode

    Fasta File

    Protein sequence file in FASTA format, defaulting to complex prediction for multiple sequences.

    Results

    The output includes:

    Output File Name Description
    seq1.pdb Default output of the predicted structure for the first sequence.
    result.tar.gz For fasta files containing multiple sequences, the compressed file includes predicted structures for all sequences.
    score.csv The score of the predicted structure includes the structural reliability indicators pLDDT and pTM. The pLDDT value range is 0-100, and the larger the value, the higher the structural quality. The pTM value range is 0-1, and the larger the value, the higher the structural quality.
    stdout.txt Standard output.

    References

    • Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130. DOI: 10.1126/science.ade2574
  • Name: Retrosynthetic Prediction (AiZynthFinder)
    Description: 小分子的逆反应合成路线预测算法,基于蒙特卡罗树搜索最终得到可被购买的小分子,树搜索策略采用神经网络方法对已知的反应库进行训练得到。 Monte Carlo tree search based retrosynthetic planning that recursively breaks down a molecule to purchasable precursors. The tree search is guided by a policy that suggests possible precursors by utilizing a neural network trained on a library of known reaction templates.
    Tags: undefined
    Author: Samuel Genheden
    Release: 2022-10-27 00:00:00
    Reference: Genheden S, Thakkar A, Chadimová V, et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J Cheminform. 2020 Nov 17;12(1):70.

    Retrosynthetic Prediction (AiZynthFinder)

    简介

    Retrosynthetic Prediction (AiZynthFinder)是阿斯利康开发的针对小分子的逆反应合成路线预测算法。AiZynthFinder算法基于蒙特卡罗树搜索最终得到可被购买的小分子,用于合成输出分子。树搜索策略采用神经网络方法对已知的反应库进行训练得到。

    参数说明

    Smiles String

    目标小分子的结构文件,SMILES格式,如:
    Cc1cccc(c1N(CC(=O)Nc2ccc(cc2)c3ncon3)C(=O)C4CCS(=O)(=O)CC4)C

    结果说明

    输出结果包含逆合成分析结果的层级表示tree.json和逆合成分析的路线图route000.png-route010.png
    trees.json把目标分子、反应拆分路径、前体化合物、反应模板等信息都组织在了一棵“树”里面。以下是对trees.json的说明:

    1. 根节点(目标分子)
      "type": "mol",
      "smiles": "Cc1cccc(C)c1N(CC(=O)Nc1ccc(-c2ncon2)cc1)C(=O)C1CCS(=O)(=O)CC1",
      "is_chemical": true,
      "in_stock": false,
      "children": [...]
    
    • type: "mol" → 说明这是一个分子节点。
    • smiles → 目标化合物的 SMILES 表示。
    • in_stock: false → 表明目标分子不是库存可直接购买的,需要合成。
    • children → 存放对应的反应步骤(reaction)。
    1. 反应节点(reaction)
    {
      "type": "reaction",
      "smiles": "[C:1]...>>...",
      "is_reaction": true,
      "metadata": {...},
      "children": [...]
    }
    
    
    • type: "reaction" → 表明这是一个反应。
    • smiles → 带有反应中心标记的反应 SMILES,>> 左边是反应物,右边是产物。
    • metadata → 包含反应模板、来源库(uspto)、匹配次数、概率、反应类别等信息。
    • children → 反应的前体分子(pre-cursors)。
    1. 前体分子(pre-cursors)
    {
      "type": "mol",
      "smiles": "Nc1ccc(-c2ncon2)cc1",
      "is_chemical": true,
      "in_stock": true
    }
    
    
    • 每个子节点是一个反应前体分子。
    • in_stock:true→说明这个分子在库存中可以买到,不需要进一步分解。
    • 如果in_stock: false,则它继续有children,表示还能再分解为更基础的前体。
    1. 递归嵌套(多步反应)
    • 从目标分子开始,每个reaction→ 拆成前体分子。
    • 对于不在库中的前体分子,还会继续给出下一步反应(嵌套children)。
    • 最终直到所有前体都 in_stock: true为止,这条合成路线就闭合了。
    1. 总结信息(scores / metadata)
    "scores": {
      "state score": 0.994039853898894,
      "number of reactions": 2,
      "number of pre-cursors": 3,
      "number of pre-cursors in stock": 3
    },
    "metadata": {
      "created_at_iteration": 36,
      "is_solved": true
    }
    
    
    • state score → 预测模型对该路线的置信度。
    • number of reactions → 总共涉及几步反应。
    • number of pre-cursors → 需要多少前体分子。
    • number of pre-cursors in stock → 有多少前体能直接购买。
    • is_solved: true → 说明这条路线是完整可行的合成路径。

    参考文献

    • Genheden S, Thakkar A, Chadimová V, et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J Cheminform. 2020 Nov 17;12(1):70.DOI:10.1186/s13321-020-00472-1

    Retrosynthetic Prediction (AiZynthFinder)

    Introduction

    AiZynthFinder is a tool for retrosynthetic planning. The algorithm is based on a Monte Carlo tree search that recursively breaks down a molecule to purchasable precursors. The tree search is guided by a policy that suggests possible precursors by utilizing a neural network trained on a library of known reaction templates.

    Parameters

    Smiles String

    Product molecule structure file in SMILES format. Example:
    Cc1cccc(c1N(CC(=O)Nc2ccc(cc2)c3ncon3)C(=O)C4CCS(=O)(=O)CC4)C

    Result

    The output of the retrosynthesis analysis includes a hierarchical representation trees.json and retrosynthesis route diagrams route000.png–route010.png.

    The trees.json file organizes the target molecule, reaction decomposition paths, precursor compounds, reaction templates, and related information into a “tree” structure. The explanation is as follows:

    1. Root Node (Target Molecule)

    {
      "type": "mol",
      "smiles": "Cc1cccc(C)c1N(CC(=O)Nc1ccc(-c2ncon2)cc1)C(=O)C1CCS(=O)(=O)CC1",
      "is_chemical": true,
      "in_stock": false,
      "children": [...]
    }
    
    • type: "mol" → Indicates this is a molecule node.
    • smiles → The SMILES representation of the target compound.
    • in_stock: false → The target molecule is not available in stock and must be synthesized.
    • children → Stores the corresponding reaction steps (reaction).
    1. Reaction Node
    {
      "type": "reaction",
      "smiles": "[C:1]...>>...",
      "is_reaction": true,
      "metadata": {...},
      "children": [...]
    }
    
    • type: "reaction" → Indicates this is a reaction node.
    • smiles → Reaction SMILES with mapped reaction centers; the left side of >> is reactants, the right side is products.
    • metadata → Contains reaction template, source database (e.g., uspto), occurrence count, probability, classification, and other information.
    • children → The precursor molecules for this reaction.

    3.Precursor Molecule

    {
      "type": "mol",
      "smiles": "Nc1ccc(-c2ncon2)cc1",
      "is_chemical": true,
      "in_stock": true
    }
    
    • Each child node is a precursor molecule.
    • in_stock: true → Indicates that this molecule is available in stock and does not need further decomposition.
    • If in_stock: false, it will continue to have children, representing further decomposition into more basic precursors.

    4.Recursive Nesting (Multi-step Reactions)

    • Starting from the target molecule, each reaction is decomposed into precursor molecules.
    • For precursors not in stock, the next reaction steps are provided recursively (children).
    • The tree continues until all precursors have in_stock: true, completing a feasible synthesis route.
      5.Summary Information (scores / metadata)
    "scores": {
      "state score": 0.994039853898894,
      "number of reactions": 2,
      "number of pre-cursors": 3,
      "number of pre-cursors in stock": 3
    },
    "metadata": {
      "created_at_iteration": 36,
      "is_solved": true
    }
    
    
    • state score → The confidence score of the predicted route by the model.
    • number of reactions → Total number of reaction steps.
    • number of pre-cursors → Total number of precursor molecules needed.
    • number of pre-cursors in stock → Number of precursors that can be directly purchased.
    • is_solved: true → Indicates that this route is a complete and feasible synthesis path.

    Reference

    • Genheden S, Thakkar A, Chadimová V, et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J Cheminform. 2020 Nov 17;12(1):70.DOI:10.1186/s13321-020-00472-1
  • Name: Antibody Fv Structure Prediction (IgFold)
    Description: 基于深度学习的快速预测抗体Fv结构的方法。 注意:输入的抗体Fv区抗体序列名称中必须包含重链标识符:H,Heavy,.H;轻链标识符:L,Light,.L。 已知问题:部分预测结构会比输入序列缺失个别氨基酸,请留意! Deep learning method for antibody structure prediction.
    Tags: undefined
    Author: Ruffolo JA
    Release: 2022-10-14 00:00:00
    Reference: Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun. 2023 Apr 25;14(1):2389.

    Antibody Structure Prediction (IgFold)

    简介

    IgFold是一种基于深度学习的快速预测抗体Fv结构的方法。IgFold由一个预先训练的语言模型和直接预测骨架原子坐标的图网络组成,该语言模型训练了558M个天然抗体序列。IgFold在显著更短的时间内(不到一分钟)预测出与其他方法(包括AlphaFold)相似或更好质量的结构。注:该模块只适合预测可变区构象,如果是全长抗体或者包含多个可变区的抗体等情况,需要使用Protein Structure Prediction (AlphaFold2.3.2)或者Protein Structure Prediction (ESMFold)进行结构预测。

    参数说明

    Fv Sequence (fasta)

    输入抗体Fv区重链和或轻链序列,其中抗体序列名称中必须包含重链标识符:H,Heavy,.H;轻链标识符:L,Light,.L。例如:

    >antibody.H
    XXXXXX
    >antibody.L
    XXXXXX
    

    结果说明

    输出文件为预测抗体的结构文件antibody_pred.pdb。
    【已知问题】部分预测结构会比输入序列缺失个别氨基酸,请留意!

    参考文献

    Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun. 2023 Apr 25;14(1):2389.

    Antibody Structure Prediction (IgFold)

    Introduction

    IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558M natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under one minute).

    Parameter

    Fv Sequence (fasta)

    Antibody Fv sequence file in FASTA format. The heavy chain sequence name should contain :H, Heavy, or .H. The light chain sequence name should contain :L, Light, or .L. Demo:

    >antibody.H
    XXXXXX
    >antibody.L
    XXXXXX
    

    Result

    The output file is antibody_pred.pdb, which is a structure file for predicting antibodies.
    Part of the predicted structure will be missing individual amino acids compared to the input sequence, please note!

    Reference

    Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun. 2023 Apr 25;14(1):2389.

  • Name: MHC-I Binding Prediction
    Description: MHC-I型亲和力预测模型。模型训练是利用亲和力(BA)和质谱洗脱配体(MS eluted ligand)的数据,基于NNAlign框架增加了预测特定MHC分子结合肽段的亲和力值和肽段的长度。该方法提高了在肿瘤新抗原,验证的洗脱配体(ELs),T细胞免疫表位的预测准确性。 A model for predicting MHC-I binding affinity. The model is trained using affinity (BA) and mass spectrometry eluted ligand (MS eluted ligand) data, and it incorporates the prediction of the affinity values and peptide lengths of specific MHC molecules using the NNAlign framework. This method improves the accuracy of predicting tumor neoantigens, validated eluted ligands (ELs), and T-cell epitopes.
    Tags: undefined
    Author: Morten Nielsen
    Release: 2022-10-14 00:00:00
    Reference: Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020 Jul 2;48(W1):W449-W454.

    MHC-I Binding Prediction

    简介

    基于神经网络的MHC-I型相互作用预测模型。模型训练是利用亲和力和质谱洗脱配体的数据,预测特定MHC分子结合肽段的亲和力值和肽段的长度,可用于肿瘤新抗原的预测。

    参数说明

    Protein Sequence File

    蛋白的序列文件,FASTA格式。

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    Seq_ID 蛋白序列名称
    Pos 肽段在蛋白质序列中的残基编号(从0开始)
    MHC MHC分子/等位基因名称
    Peptide 潜在配体的氨基酸序列
    Core 直接与MHC接触的最小的9个氨基酸结合核心
    Of 核心在肽段中的起始位置(如果>0,则该方法预测N-末端突出)
    Gp 如有删除,删除的位置
    Gl 如有删除,删除的长度
    Ip 如有插入,插入的位置
    Il 如有插入,插入的长度
    Icore 相互作用核心。这是包括插入和删除的结合核心序列
    Identity 蛋白质标识符,即FASTA条目的名称
    Score 原始预测得分。(EL:质谱洗脱配体,BA:亲和力)
    %Rank 预测结合得分与一组随机天然肽相比的排名。此测量不受某些分子固有偏向于更高或更低的预测亲和力的影响。强结合物被定义为具有%rank<0.5的物质,而弱结合物则具有%rank<2。我们建议基于%Rank而不是得分选择候选配体。(EL:质谱洗脱配体,BA:亲和力)
    Aff(nM) 亲和力大小
    BindLevel 如果%Rank低于强结合物的指定阈值(默认为0.5%),则将识别肽段为强结合物。如果%Rank高于强结合物的阈值但低于弱结合物的指定阈值(默认为2%),则将识别肽段为弱结合物。(SB:强结合物,WB:弱结合物)

    参考文献

    Nucleic Acids Res. 2020 Jul 2;48(W1):W449-W454.

    MHC-I Binding Prediction

    Introduction

    A neural network-based model for predicting MHC-I interactions. The model is trained using affinity and mass spectrometry-eluted ligand data to forecast the affinity values and lengths of peptides binding to specific MHC molecules. This can be employed for predicting tumor neoantigens.

    Parameter

    Protein Sequence File

    Protein sequence file in FASTA format.

    Result

    The output file is result.csv and contains the following information:

    Seq_ID Protein sequence name
    Pos Residue number (starting from 0) of the peptide in the protein sequence.
    MHC Specified MHC molecule / Allele name.
    Peptide Amino acid sequence of the potential ligand.
    Core The minimal 9 amino acid binding core directly in contact with the MHC.
    Of The starting position of the Core within the Peptide (if > 0, the method predicts a N-terminal protrusion).
    Gp Position of the deletion, if any.
    Gl Length of the deletion, if any.
    Ip Position of the insertion, if any
    Il Length of the insertion, if any
    Icore Interaction core. This is the sequence of the binding core including eventual insertions of deletions.
    Identity Protein identifier, i.e. the name of the FASTA entry.
    Score The raw prediction score. (EL: MS eluted ligand, BA: Binding Affinity)
    %Rank Rank of the predicted binding score compared to a set of random natural peptides. This measure is not affected by inherent bias of certain molecules towards higher or lower mean predicted affinities. Strong binders are defined as having %rank<0.5, and weak binders with %rank<2. We advise to select candidate binders based on %Rank rather than Score. (EL: MS eluted ligand, BA: Binding Affinity)
    Aff(nM) Affinity value
    BindLevel The peptide will be identified as a strong binder if the %Rank is below the specified threshold for the strong binders (by default, 0.5%). The peptide will be identified as a weak binder if the %Rank is above the threshold of the strong binders but below the specified threshold for the weak binders (by default, 2%). (SB: Strong Binder, WB: Weak Binder)

    Reference

    Nucleic Acids Res. 2020 Jul 2;48(W1):W449-W454.

  • Name: NPT MDP Generation
    Description: 生成等温等压(NPT)的MDP文件,此文件是Gromacs分子动力学模拟需要用到输入文件,里面包含各种参数。 Generate Gromacs MD input file at constant temperature and pressure (NPT).
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 17:14:19
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    NPT MDP Generation

    简介

    NPT MDP Generation是生成等温等压(NPT)MDP文件的模块。

    参数说明

    Define

    Define用于传递预处理器的定义,可以使用任何定义来控制自定义拓扑文件(.top)中的选项。可选择的定义包括以下选项:

    1. DPOSRES用于实现位置约束。选择该项时必须填写Force Constant of POSRE,否则无效。
    2. none为无定义。

    Integrator

    模拟中积分方式的选择:md算法。
    md是蛙跳法,对符合牛顿公式的运动进行积分。

    Time Step

    时间步长,单位为ps。(默认为0.001)

    Simulation Time (ns)

    模拟时长,单位为ns。

    Group(s) for Center of Mass

    质心进行操作的组,可以是索引文件中的一个,或者多个组。默认为整个系统。

    Motion Mode

    系统或者系统中各个组质心的操作。(默认为None)

    • Linear:移去质心平移速度
    • Angular:去掉质心的平移和质心周围的旋转速度
    • Linear-acceleration-correction:去除质心平移速度。修正质心位置,假设在nstcomm步骤上有线性加速度。这对于期望质心上的加速度在mdp:nstcomm步长上几乎是恒定的情况是有用的。例如,当使用绝对引用拉入组时,就会发生这种情况。
    • None:对质心运动没有限制

    Coordinates Output Steps

    在轨迹文件中写入坐标的频率。(默认为0)

    Velocities Output Steps

    在轨迹文件中写入速度(v)的频率。(默认为0)

    Forces Output Steps

    在轨迹文件中写入力的频率。(默认为0)

    Log Output Steps

    在log文件中写入能量的频率。(默认为50)

    Energies Output Steps

    在记录能量的文件中写入能量的频率。(默认为100)

    Compressed Coordinates Steps

    输入压缩的轨迹文件的频率。(默认为50)

    Compressed Groups

    输入轨迹包含的结构。默认为整个系统。

    PBC

    周期化边界条件设置(默认为xyz)。

    • xyz:在所有方向上使用周期性边界条件。
    • no:不使用周期边界条件,忽略方框。要模拟没有截止,设置所有截止和nstlist为0。为了在没有截断的情况下获得最佳性能,请将nstlist设置为零并将ns-type =simple设置为简单。
    • xy:只在x和y方向上使用周期边界条件。这只适用于ns-type =grid,并且可以与墙壁结合使用。没有墙或只有一个墙,系统尺寸在z方向上是无限的。因此不能采用压力耦合法或埃瓦尔德求和法。当使用两面墙时,这些缺点就不适用了。

    Coulomb Type

    原子静电相互作用的计算方法,默认为PME。

    • Cut-off:具有对列表半径rlist 和库仑截止 rcoulomb 的平面截止,其中 rlist>=rcoulumb。
    • Ewald:经典的Ewald sum静电学。实空间截止Coulomb Cutoff应等于rlist,使用例如rlist=0.9,rcoulomb=0.9。在reciprocal space中使用的波矢量的最高幅度由傅里叶间距控制。direct/reciprocal space 的相对精度由 ewald rtol 控制。
    • PME: 用于具体指静电相互作用或库仑力的Fast smooth Particle-Mesh Ewald(SPME)。Direct space类似于Ewald sum,而reciprocal space使用FFT执行。网格尺寸由傅里叶间距控制,插值顺序由pme-order控制。

    Coulomb Cutoff

    库仑力截止距离,单位nm(默认为1.2)

    VdW Type

    范德华相互作用的计算方法,默认为Cut-off。

    • Cut-off:用对列表半径rlist和VdW截断rvdw的普通截断,其中rlist >= rvdw。
    • PME:用于VdW相互作用的快速平滑粒子网格Ewald (SPME)。网格尺寸采用傅里叶间距控制,插补顺序采用pme-order控制。正/倒易空间的相对精度由ewald-rtoll-lj控制,倒易例程使用的具体组合规则由lj-pme-comb-rule设置。

    VdW Cutoff

    LJ或Buckingham截止距离,单位nm(默认为1.2)

    Dispersion Correction

    能量和压力的长程色散校正方法(默认为EnerPres)。

    • no:不做任何修正
    • EnerPres:适用于能量和压力的长程分散校正
    • Ener:仅对能量应用长程色散修正

    Temperature Coupling

    温度耦合的方法(默认为V-rescale)。

    • V-rescale:使用随机项的速度重标度的温度耦合(JCP 126, 014101)。这个恒温器类似于Berendsen耦合,使用tau-t进行相同的缩放,但随机项确保生成适当的规范集合。随机种子用ld-seed设置。即使tau-t =0,这个恒温器也能正常工作。对于NVT模拟,保存的能量被写入能量和日志文件。
    • Berendsen:与Berendsen恒温器的温度耦合到温度为ref-t的浴槽,时间常数为tau-t。几个组可以单独耦合,它们在tc-grps字段中指定,并用空格分隔。
    • no:无温度耦合。

    Coupling Groups

    耦合到单独的温度浴的组别,多个组别用空格间隔。

    Time for Temperature Coupling

    温度耦合时间常数,单位为ps。(默认为0.2)

    Coupling Reference Temperature

    耦合的参考温度,即动力学模拟的温度,单位为K。(默认为300)

    Pressure Coupling

    压力耦合的方法(默认为Berendsen)。

    • Parrinello-Rahman:扩展系综压力耦合,其中盒向量服从运动方程。原子的运动方程和这个是耦合的。不会发生瞬时缩放。对于Nose-Hoover温度耦合,时间常数tau-p是压力在平衡状态下波动的周期。当您希望在数据收集期间应用压力缩放时,这可能是一种更好的方法,但要注意,如果您从不同的压力开始,您可能会得到非常大的振荡。对于NPT系综的精确波动很重要的模拟,或者如果压力耦合时间很短,则可能不合适,因为在GROMACS实现的某些步骤中使用了之前的时间步长压力来代替当前的时间步长压力。
    • Berendsen:指数弛豫压力与时间常数tau-p的耦合。这个盒子每隔几步就缩放一次。有人认为,这并不能产生正确的热力学集合,但这是在运行开始时缩放盒子的最有效方法。
    • no:无压力耦合。这意味着一个固定的盒子大小。

    Pressure Coupling Type

    压力耦合的各向同性类型。每种类型取一个或多个可压缩性(compressibility)和Coupling Reference Pressure。Time for Pressure Coupling仅允许一个值。(默认为isotropic)

    • isotropic:时间常数为Time for Pressure Coupling的各向同性压力耦合。可压缩性(compressibility)和Coupling Reference Pressure各需要一个值.
    • semisotropic:在x和y方向上各向同性但在方向上不同的压力耦合。这对于膜模拟是有用的。对于x/y和z方向,分别需要可压缩性(compressibility)和Coupling Reference Pressure的两个值。
    • anisotropic:与之前相同,但xx、yy、zz、xy/yx、xz/zx和yz/zy组件分别需要6个值。当非对角压缩性设置为零时,矩形盒子将保持矩形。请注意,各向异性缩放可能会导致模拟盒子发生极端变形。
    • surface-tension:平行于xy平面的表面的表面张力耦合。对Z方向使用法向压力耦合,而表面张力耦合到盒子的x/y尺度。第一个Coupling Reference Pressure是参考表面张力乘以表面数(单位bar*nm),第二个值是参考z-pressure(单位bar)。这两个可压缩性(compressibility)分别是xy和方向上的压缩率。z-compressibility的值应该相当精确,因为它会影响表面张力的收敛,也可以将其设置为零,使盒子具有恒定的高度。

    Time for Pressure Coupling

    压力耦合的时间常数(所有方向一个值),单位为ps。(默认为2)

    Coupling Reference Pressure

    耦合的参考压力,单位为bar。(默认为1)

    Compressibility

    可压缩性(注:这实际上是在bar^-1)对于水在1atm和300k的可压缩性是4.5e-5 bar^-1。所需值的数量由pcoupltype [bar^-1]暗示。

    Constraints

    限制类型。(默认为none)

    • none:除了拓扑文件中明确定义的外,没有限制。
    • hbonds:给含有氢原子的键添加限制。
    • all-bonds:给所有的键添加限制。
    • h-angles:给所有的键添加限制,同时给含有氢原子的角度添加限制。
    • all-angles:给所有的键和角度添加限制。

    Output File

    输出文件名称

    结果说明

    得到一个计算NPT的MDP文件npt.mdp。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    NPT MDP Generation

    Introduction

    The NPT MDP Generation module is used to generate the MDP file for an isothermal-isobaric (NPT) simulation.

    Parameter Description

    Define

    The Define section is used to pass preprocessor definitions that can control options in custom topology files (.top). Available options include:

    1. DPOSRES: Used to implement position restraints. Requires filling in the Force Constant of POSRE, otherwise, it is invalid.
    2. none: No definitions.

    Integrator

    Choice of integration method in the simulation: md algorithm.
    md is the leap-frog algorithm used to integrate motions conforming to Newton’s equations.

    Time Step

    Time step size in ps. (Default is 0.001)

    Simulation Time (ns)

    Duration of the simulation in ns.

    Group(s) for Center of Mass

    Group(s) for center of mass operations, can be one or multiple groups from the index file. Default is the entire system.

    Motion Mode

    Operations for the system or center of mass of individual groups in the system. (Default is None)

    • Linear: Removes center of mass translational velocities.
    • Angular: Removes both the center of mass translational and rotational velocities.
    • Linear-acceleration-correction: Removes center of mass translational velocities. Corrects the center of mass position assuming linear acceleration over nstcomm steps. Useful when expecting nearly constant accelerations on the center of mass over mdp:nstcomm steps. For example, this occurs when using absolute reference pulling groups.
    • None: No restrictions on center of mass motion.

    Coordinates Output Steps

    Frequency of writing coordinates to the trajectory file. (Default is 0)

    Velocities Output Steps

    Frequency of writing velocities to the trajectory file. (Default is 0)

    Forces Output Steps

    Frequency of writing forces to the trajectory file. (Default is 0)

    Log Output Steps

    Frequency of writing energy to the log file. (Default is 50)

    Energies Output Steps

    Frequency of writing energy to the energy file. (Default is 100)

    Compressed Coordinates Steps

    Frequency of inputting compressed trajectory files. (Default is 50)

    Compressed Groups

    Structures included in the input trajectory. Default is the entire system.

    PBC

    Setting for periodic boundary conditions (Default is xyz).

    • xyz: Periodic boundary conditions in all directions.
    • no: No periodic boundary conditions, ignore the box. For simulating without cutoffs, set all cutoffs and nstlist to 0. For optimal performance without cutoffs on a single MPI rank, set nstlist to 0 and ns-type=simple.
    • xy: Periodic boundary conditions only in the x and y directions. This is only valid for ns-type=grid and can be used with walls. Without walls or with only one wall, the system size is infinite in the z direction, so pressure coupling or Ewald sum methods cannot be used. When using two walls, these limitations do not apply.

    Coulomb Type

    Method for calculating atomic electrostatic interactions, default is PME.

    • Cut-off: Plain cut-off with a plain cut-off for the pair-list radius rlist and Coulomb cut-off rcoulomb, where rlist >= rcoulumb.
    • Ewald: Classical Ewald sum electrostatics. The real space cut-off Coulomb Cutoff should be equal to rlist, using values like rlist=0.9, rcoulomb=0.9. The highest amplitude of wave vectors used in reciprocal space is controlled by the Fourier spacing. The relative accuracy of direct/reciprocal space is controlled by ewald rtol.
    • PME: Fast smooth Particle-Mesh Ewald (SPME) for specific electrostatic interactions or Coulomb forces. Direct space is similar to the Ewald sum, while reciprocal space is executed using FFT. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order.

    Coulomb Cutoff

    Coulomb force cut-off distance in nm. (Default is 1.2)

    VdW Type

    Method for calculating van der Waals interactions, default is Cut-off.

    • Cut-off: Normal cut-off with a plain cut-off for the pair-list radius rlist and VdW cut-off rvdw, where rlist >= rvdw.
    • PME: Fast smooth Particle-Mesh Ewald (SPME) for VdW interactions. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order. The relative accuracy of direct/reciprocal space is controlled by ewald-rtol-lj, and the specific combination rule used in reciprocal space is set by lj-pme-comb-rule.

    VdW Cutoff

    LJ or Buckingham cut-off distance in nm. (Default is 1.2)

    Dispersion Correction

    Method for long-range dispersion correction for energy and pressure (Default is EnerPres).

    • no: No corrections are applied.
    • EnerPres: Long-range dispersion correction applied for both energy and pressure.
    • Ener: Long-range dispersion correction applied only for energy.

    Temperature Coupling

    Method for temperature coupling (Default is V-rescale).

    • V-rescale: Temperature coupling using velocity rescaling with a stochastic term (JCP 126, 014101). This thermostat is similar to the Berendsen coupling using tau-t for the same scaling, but the stochastic term ensures the correct canonical ensemble is generated. The random seed is set with ld-seed. This thermostat works even if tau-t = 0. For NVT simulations, the saved energies are written to the energy and log files.
    • Berendsen: Temperature coupling to a bath at temperature ref-t with an exponential relaxation time tau-t. Several groups can be coupled separately, specified in the tc-grps field and separated by spaces.
    • no: No temperature coupling.

    Coupling Groups

    Groups to which temperature baths are coupled, multiple groups separated by spaces.

    Time for Temperature Coupling

    Time constant for temperature coupling in ps. (Default is 0.2)

    Coupling Reference Temperature

    Reference temperature for coupling in K. (Default is 300)

    Pressure Coupling

    Method for pressure coupling (Default is Berendsen).

    • Parrinello-Rahman: Extended ensemble pressure coupling where box vectors follow the motion equations. The motion equations of atoms are coupled to this. No instantaneous scaling occurs. For Nose-Hoover temperature coupling, the time constant tau-p is the period over which the pressure fluctuates in equilibrium. This may be a better method when you want to apply pressure scaling during data collection, but be aware that you may get very large oscillations if you start from a different pressure. It may not be suitable for precise fluctuations of the NPT ensemble or if the pressure coupling time is short, as some steps in GROMACS implementation use the previous time step pressure instead of the current time step pressure.
    • Berendsen: Exponential relaxation pressure coupling with time constant tau-p. The box is scaled every few steps. Some believe this does not generate correct thermodynamic ensembles, but it is the most efficient method to scale the box at the beginning of a run.
    • no: No pressure coupling. This means a fixed box size.

    Pressure Coupling Type

    Isotropic type of pressure coupling. Each type takes one or more compressibility and Coupling Reference Pressure values. Time for Pressure Coupling allows only one value. (Default is isotropic)

    • isotropic: Isotropic pressure coupling with a time constant of Time for Pressure Coupling. Requires one value each for compressibility and Coupling Reference Pressure.
    • semisotropic: Isotropic pressure coupling in x and y directions but different pressures in the z direction. Useful for membrane simulations. Requires two values each for compressibility and Coupling Reference Pressure for x/y and z directions.
    • anisotropic: Same as above, but requires six values each for xx, yy, zz, xy/yx, xz/zx, and yz/zy components. When non-diagonal compressibilities are set to zero, the rectangular box will remain rectangular. Note that anisotropic scaling may lead to extreme deformations of the simulation box.
    • surface-tension: Surface tension coupling for surfaces parallel to the xy plane. Uses normal pressure coupling in the z direction and surface tension coupling to the x/y scales of the box. The first Coupling Reference Pressure is the reference surface tension multiplied by the surface area (units bar*nm), the second value is the reference z-pressure (units bar). Both compressibilities are for xy and z directions. The z-compressibility value should be quite accurate as it affects the convergence of the surface tension and can also be set to zero to have a constant box height.

    Time for Pressure Coupling

    Time constant for pressure coupling (one value for all directions) in ps. (Default is 2)

    Coupling Reference Pressure

    Reference pressure for coupling in bar. (Default is 1)

    Compressibility

    Compressibility (actually in bar^-1). For water at 1 atm and 300K, the compressibility is 4.5e-5 bar^-1. The number of values required is implied by pcoupltype [bar^-1].

    Constraints

    Type of constraints. (Default is none)

    • none: No constraints other than those explicitly defined in the topology file.
    • hbonds: Adds constraints to bonds involving hydrogen atoms.
    • all-bonds: Adds constraints to all bonds.
    • h-angles: Adds constraints to all bonds and angles involving hydrogen atoms.
    • all-angles: Adds constraints to all bonds and angles.

    Output File

    Output file name.

    Result Description

    Generates an MDP file named npt.mdp for the NPT calculation.

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: Minimize MDP Generation
    Description: 生成Gromacs分子动力学模拟需要用到体系能量优化(Minimization)的输入MDP文件。 Generate input MDP files that are required for Minimization of Gromacs molecular dynamics simulations.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 16:35:14
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    Minimize MDP Generation

    简介

    Minimize MDP Generation是生成能量优化(Minimization)MDP文件的模块。

    参数说明

    Integrator

    模拟中积分方式的选择:cg和steep算法。
    cg用于能量最小化的共轭梯度算法,在能量下降最陡峭时,比steep更加高效。
    steep用于能量最小化的最陡下降算法。一般在setup的能量最小化中使用。

    Simulation Time (ns)

    最小化的最大时间,-1没有最大值。

    Convergency Value of Minimization

    最大容许力,单位为kJ/(mol·nm)。当最大作用力小于此值,认为最小化过程收敛。(默认为100)

    Initial Step

    起始步长,单位为nm。(默认为0.01)

    Coordinates Output Steps

    在轨迹文件中写入坐标的频率。(默认为50)

    Log Output Steps

    在log文件中写入能量的频率。(默认为50)

    Energies Output Steps

    在记录能量的文件中写入能量的频率。(默认为50)

    PBC

    周期化边界条件设置:
    xyz: 在所有方向上使用周期性边界条件
    no: 不使用周期性边界条件,忽略box。若要模拟无截止,请将所有Cutoff相关选项和nstlist设置为0。若要在单个MPlrank上实现无截止的最佳性能,请将nstlist设置为0,ns-type=simple.
    xy: 仅在x和y方向使用周期性边界条件。这仅适用于 ns-type=grid,并可与墙(walls)结合使用。如果没有墙或只有一面墙,系统在z方向上的大小是无限的,因此不能使用压力糟合或 Ewald求和方法。当使用两个墙时,这些缺点不存在。

    Coulomb Type

    原子静电相互作用的计算方法,默认为PME。

    1. Cut-off:具有对列表半径rlist 和库仑截止 rcoulomb 的平面截止,其中 rlist>=rcoulumb。
    2. Ewald:经典的Ewald sum静电学。实空间截止Coulomb Cutoff应等于rlist,使用例如rlist=0.9,rcoulomb=0.9。在reciprocal space中使用的波矢量的最高幅度由傅里叶间距控制。direct/reciprocal space 的相对精度由 ewald rtol 控制。
    3. PME: 用于具体指静电相互作用或库仑力的Fast smooth Particle-Mesh Ewald(SPME)。Direct space类似于Ewald sum,而reciprocal space使用FFT执行。网格尺寸由傅里叶间距控制,插值顺序由pme-order控制。

    Coulomb Cutoff

    指定库仑力阈值,单位为nm。(默认为1.2)

    VdW Type

    范德华相互作用的计算方法,默认为Cut-off。

    1. Cut-off:用对列表半径rlist和VdW截断rvdw的普通截断,其中rlist >= rvdw。
    2. PME:用于VdW相互作用的快速平滑粒子网格Ewald (SPME)。网格尺寸采用傅里叶间距控制,插补顺序采用pme-order控制。正/倒易空间的相对精度由ewald-rtoll-lj控制,倒易例程使用的具体组合规则由lj-pme-comb-rule设置。

    VdW Cutoff

    LJ或Buckingham截止距离,单位nm。(默认为1.2)

    Constraints

    控制拓扑中被转换为刚性完整约束的键类型。典型的刚性水模型没有键,因此不受此关键字的影响。
    none:不将键转化为约束.
    h-bonds:将与氢原子的键合转换为约束
    all-bonds:将所有键转换为约束
    h-angles:将所有键转换为约束,并将涉及氢原子的角度转换为键约束
    al-angles:将所有结合转换为约束,将所有角度转换为结合约束

    Output File

    输出文件名称

    结果说明

    得到一个计算最小化的MDP文件mini.mdp。

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    Minimize MDP Generation

    Introduction

    The Minimize MDP Generation module is used to generate the MDP file for energy minimization.

    Parameter Description

    Integrator

    Choice of integration method in the simulation: cg and steep algorithms.
    cg is the conjugate gradient algorithm used for energy minimization, more efficient than steep when the energy decreases steeply.
    steep is the steepest descent algorithm used for energy minimization. Generally used in setting up energy minimization.

    Simulation Time (ns)

    Maximum time for minimization, -1 means no maximum.

    Convergency Value of Minimization

    Maximum allowable force in kJ/(mol·nm). Minimization is considered converged when the maximum force is below this value. (Default is 100)

    Initial Step

    Initial step size in nm. (Default is 0.01)

    Coordinates Output Steps

    Frequency of writing coordinates in the trajectory file. (Default is 50)

    Log Output Steps

    Frequency of writing energy to the log file. (Default is 50)

    Energies Output Steps

    Frequency of writing energy to the energy file. (Default is 50)

    PBC

    Setting for periodic boundary conditions:

    • xyz: Periodic boundary conditions in all directions.
    • no: No periodic boundary conditions, ignore the box. To simulate without cutoffs, set all Cutoff-related options and nstlist to 0. For best performance of cutoff-free on a single MPI rank, set nstlist to 0 and ns-type=simple.
    • xy: Periodic boundary conditions only in the x and y directions. This is only valid for ns-type=grid and can be used with walls. If there are no walls or only one wall, the system is infinite in the z direction, so pressure coupling or Ewald sum methods cannot be used. When using two walls, these limitations do not exist.

    Coulomb Type

    Method for calculating atomic electrostatic interactions, default is PME.

    1. Cut-off: Plain cut-off with a plain cut-off for the pair-list radius rlist and Coulomb cut-off rcoulomb, where rlist >= rcoulumb.
    2. Ewald: Classical Ewald sum electrostatics. The real space cut-off Coulomb Cutoff should be equal to rlist, using values like rlist=0.9, rcoulomb=0.9. The highest amplitude of wave vectors used in reciprocal space is controlled by the Fourier spacing. The relative accuracy of direct/reciprocal space is controlled by ewald rtol.
    3. PME: Fast smooth Particle-Mesh Ewald (SPME) for specific electrostatic interactions or Coulomb forces. Direct space is similar to the Ewald sum, while reciprocal space is executed using FFT. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order.

    Coulomb Cutoff

    Specifies the Coulomb force threshold in nm. (Default is 1.2)

    VdW Type

    Method for calculating van der Waals interactions, default is Cut-off.

    1. Cut-off: Normal cut-off with a plain cut-off for the pair-list radius rlist and VdW cut-off rvdw, where rlist >= rvdw.
    2. PME: Fast smooth Particle-Mesh Ewald (SPME) for VdW interactions. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order. The relative accuracy of direct/reciprocal space is controlled by ewald-rtol-lj, and the specific combination rule used in reciprocal space is set by lj-pme-comb-rule.

    VdW Cutoff

    LJ or Buckingham cut-off distance in nm. (Default is 1.2)

    Constraints

    Controls which types of bonds in the topology are converted to rigid constraints. Typical rigid water models have no bonds, so they are not affected by this keyword.

    • none: No bonds are converted to constraints.
    • h-bonds: Bonds involving hydrogen atoms are converted to constraints.
    • all-bonds: All bonds are converted to constraints.
    • h-angles: All bonds are converted to constraints, and angles involving hydrogen atoms are converted to bond constraints.
    • all-angles: All bonds are converted to constraints, and all angles are converted to bond constraints.

    Output File

    Output file name.

    Result Description

    Generates an MDP file named mini.mdp for the energy minimization calculation.

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MD PDB Prepare
    Description: 在分子动力学模拟前处理PDB结构,结合PDBFixer工具对输入PDB文件中的蛋白结果进行修复,再分离出PDB文件中的蛋白结构、小分子结构以及核酸结构。 It is a structure preparation module before running molecular dynamics. The missing residues in PDB were added using PDBFixer. The protein, nucleic acid, and ligands were extracted and output individually.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: P. Eastman, M. S. Friedrichs, J. D. Chodera, R. J. Radmer, C. M. Bruns, J. P. Ku, K. A. Beauchamp, T. J. Lane, L.-P. Wang, D. Shukla, T. Tye, M. Houston, T. Stich, C. Klein, M. R. Shirts, and V. S. Pande. 2013. OpenMM 4: A Reusable, Extensible, Hardware Independent Library for High Performance Molecular Simulation. Journal of Chemical Theory and Computation. ACS Publications. 9(1): 461-469.

    MD PDB Prepare

    简介

    MD PDB Prepare是一个在分子动力学模拟前PDB结构处理模块,结合PDBFixer工具对输入PDB文件中的蛋白结果进行修复,再分离出PDB文件中的蛋白结构、小分子结构以及核酸结构。

    参数说明

    PDB File

    结构文件,PDB格式。
    需要注意体系中若存在配体,其名称不能为*号且必须以HETATM开头。如下所示为正确的小分子结构文件:

    HETATM 3767  C1  GOL A 302      -4.671 -11.067  -0.429  1.00 43.56           C  
    HETATM 3768  O1  GOL A 302      -5.324  -9.793  -0.300  1.00 41.43           O  
    

    若体系中有特殊金属原子,只能选AMBER力场。离子需要按照特定书写格式,以下为一些常见原子书写格式:

      # Mg2+离子
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+离子
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+离子
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+离子
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+离子
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+离子
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
      # Cu2+离子
      HETATM 1431  CU  CU  A 301     -23.030  15.955  -4.315  1.00 47.40          CU
    

    其中atom type,residue需要是大写,atom name只需是标准金属离子(可以通过文本编辑器查看书写格式是否相同)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    protein.pdb 分离得到体系中蛋白文件
    ligand.pdb/ligand_pdb.tar.gz 分离得到体系中小分子文件或者压缩文件
    nucleic_acid.pdb 分离得到体系中核酸文件
    membrane.pdb/lipid_membrane.pdb 分离得到体系中膜结构

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    MD PDB Prepare

    Introduction

    MD PDB Prepare is a module for pre-processing PDB structures before molecular dynamics simulations. It uses the PDBFixer tool to repair protein structures in the input PDB file and separates the protein structure, small molecule structure, and nucleic acid structure from the PDB file.

    Parameter Description

    PDB File

    Structure file in PDB format.
    It is important to note that if there is a ligand in the system, its name cannot be an asterisk (*) and must start with HETATM. Below is an example of a correct small molecule structure in a file:

    HETATM 3767  C1  GOL A 302      -4.671 -11.067  -0.429  1.00 43.56           C  
    HETATM 3768  O1  GOL A 302      -5.324  -9.793  -0.300  1.00 41.43           O  
    

    If the system contains special metal atoms, only the AMBER force field can be selected. Ions need to be written in a specific format. Here are some common atomic writing formats:

      # Mg2+ ion
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+ ion
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+ ion
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+ ion
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+ ion
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+ ion
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
      # Cu2+ ion
      HETATM 1431  CU  CU  A 301     -23.030  15.955  -4.315  1.00 47.40          CU
    

    Where atom type and residue should be in uppercase, and atom name should be the standard metal ion name (you can check the writing format using a text editor).

    Result Description

    The output results include:

    Output File Name Description
    protein.pdb Separated protein file from the system
    ligand.pdb/ligand_pdb.tar.gz Separated small molecule file or compressed file from the system
    nucleic_acid.pdb Separated nucleic acid file from the system
    membrane.pdb/lipid_membrane.pdb Separated membrane structure from the system

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MD Trajectory
    Description: 可根据起始帧数、结束帧数以及间隔帧数对平衡模拟进行轨迹提取,从而将其转换为GRO或者PDB轨迹文件。 MD Trajectory converts Gromacs trajectory file (xtc) into GRO or PDB file for visualization.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD Trajectory

    简介

    可根据起始帧数、结束帧数以及间隔帧数对平衡模拟进行轨迹提取,从而将其转换为GRO或者PDB轨迹文件。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在GMX MD Run模块或者AlphaAutoMD模块中获取。

    Type

    文件输出类型:GRO或者PDB。

    Water

    输出文件是否保留水盒子。

    Start Time (ps)

    起始位置(单位ps)。

    End Time (ps)

    结束位置(单位ps)。

    Skip Time (ps)

    间隔时间,单位ps。

    Index File

    索引文件,ndx格式。对于膜体系的轨迹提取是必填项。

    结果说明

    输出结果包括:

    输出文件名称 说明
    md_finally.pdb 最后一帧结构文件
    md_center.pdb PDB格式轨迹文件
    md_center.gro GRO格式轨迹文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    MD Trajectory

    Introduction

    The MD Trajectory module allows for the extraction of trajectories from equilibrium simulations based on the starting frame number, ending frame number, and frame interval, converting them into GRO or PDB trajectory files.

    Parameter Description

    Path File

    Path file obtained after MD simulation, can be obtained from the GMX MD Run module or the AlphaAutoMD module.

    Type

    File output type: GRO or PDB.

    Water

    Whether to retain the water box in the output files.

    Start Time (ps)

    Starting time (in ps).

    End Time (ps)

    Ending time (in ps).

    Skip Time (ps)

    Time interval, in ps.

    Index File

    Index file in ndx format. This is a required parameter for extracting trajectories in membrane systems.

    Result Description

    The output results include:

    Output File Name Description
    md_finally.pdb Structure file of the final frame
    md_center.pdb PDB format trajectory file
    md_center.gro GRO format trajectory file

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: Protein Protonation
    Description: 预测每个蛋白残基的pKa并根据指定的pH值判断每个残基的质子化状态。 Predict the pKa value for each protein residue using PROPKA3 and determines the protonation state based on the pH values.
    Tags: undefined
    Author: Jan H. Jensen
    Release: 2022-09-29 00:00:00
    Reference: Olsson, Mats HM, Chresten R. Sondergaard, Michal Rostkowski, and Jan H. Jensen. "PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions." Journal of Chemical Theory and Computation 7, no. 2 (2011): 525-537. doi:10.1021/ct100578z

    Protein Protonation

    简介

    Protein Protonation是蛋白质子化模块主要是预测每个蛋白残基的pKa并根据指定的pH值判断每个残基的质子化状态。

    参数说明

    PDB File

    蛋白的结构文件,PDB格式,该文件可以MD PDB Prepare模块提取得到。

    pH

    pH值,默认为7。

    N terminal

    N端残基质子化状态,只有charge和neutral两个选项,默认charge。

    C Terminal

    C端残基质子化状态,只有charge和neutral两个选项,默认charge。

    Custom Residues

    自定义残基质子化状态。

    Output PDB File

    预测的含质子化状态的结构文件。

    结果说明

    输出结果包括:

    输出文件名称 说明
    protein_protonation.pdb 质子化状态的结构文件
    predict_pKa.txt 含pKa值输出文件

    参考文献

    Olsson, Mats HM, Chresten R. Sondergaard, Michal Rostkowski, and Jan H. Jensen. “PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions.” Journal of Chemical Theory and Computation 7, no. 2 (2011): 525-537.

    Protein Protonation

    Introduction

    The Protein Protonation module is primarily used to predict the pKa of each protein residue and determine the protonation state of each residue based on the specified pH value.

    Parameter Description

    PDB File

    The structure file of the protein in PDB format, which can be obtained from the MD PDB Prepare module.

    pH

    pH value, default is 7.

    N terminal

    Protonation state of the N-terminal residue, with options of charge and neutral, default is charge.

    C Terminal

    Protonation state of the C-terminal residue, with options of charge and neutral, default is charge.

    Custom Residues

    Customize the protonation state of residues.

    Output PDB File

    Structure file with predicted protonation states.

    Result Description

    The output results include:

    Output File Name Description
    protein_protonation.pdb Structure file with protonation states
    predict_pKa.txt Output file containing pKa values

    Reference

    Olsson, Mats HM, Chresten R. Sondergaard, Michal Rostkowski, and Jan H. Jensen. “PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions.” Journal of Chemical Theory and Computation 7, no. 2 (2011): 525-537.

  • Name: GMX Receptor Parameterization
    Description: 根据Gromacs生成受体(包括蛋白或者核酸)的GRO,ITP以及TOP文件。 Generate gro, itp, and top files for receptor (protein or nucleic acid) for molecular dynamics using Gromacs.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 12:49:42
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    GMX Receptor Parameterization

    简介

    GMX Receptor Parameterization模块根据Gromacs生成受体(包括蛋白或者核酸)的GRO,ITP以及TOP文件。

    参数说明

    Protein PDB

    蛋白结构文件。提交的蛋白质文件最好经过Protein Protonation模块的处理。
    若体系中有特殊金属原子,只能选AMBER力场。离子需要按照特定书写格式,以下为一些常见原子书写格式:

      # Mg2+离子
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+离子
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+离子
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+离子
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+离子
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+离子
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
      # Cu2+离子
      HETATM 1431  CU  CU  A 301     -23.030  15.955  -4.315  1.00 47.40          CU
    

    其中atom type,residue需要是大写,atom name只需是标准金属离子(可以通过文本编辑器查看书写格式是否相同)。

    Nucleic Acid PDB

    核酸结构文件。

    Force Field

    力场,默认amber14sb_parmbsc1。以下是各个力场适用于那些情况:
    amber03,amber99sb,amber14sb_parmbsc1适合蛋白和核酸的凝聚相模拟。
    amber14sb_parmbsc1,charmm36-jul2020适用于脂(膜)。
    注意:根据提交的pdb结构选取力场。

    Water Model

    水模型,默认spc。
    spc:最好用于GROMOS力场。
    spce:对纯水体系比SPC、TIP3P都好。
    tip3p:最好用于amber。
    tip4p:最好用于opls。

    结果说明

    输出结果包括:

    输出文件名称 说明
    receptor.gro 受体的分子坐标文件
    receptor_itp.tar.gz 受体平衡模拟时固定原子位置所施加的力
    receptor.top 受体的拓扑文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    GMX Receptor Parameterization

    Introduction

    The GMX Receptor Parameterization module generates GRO, ITP, and TOP files for receptors (including proteins or nucleic acids) based on Gromacs.

    Parameter Description

    Protein PDB

    Protein structure file. The submitted protein file is preferably processed through the Protein Protonation module.
    If the system contains special metal atoms, only the AMBER force field can be selected. Ions need to be written in specific formats. Below are some common atomic writing formats:

      # Mg2+ ion
      HETATM 1431 MG    MG A 301     -23.030  15.955  -4.315  1.00 47.40          MG
      # Mn2+ ion
      HETATM 1431 MN    MN A 301     -23.030  15.955  -4.315  1.00 47.40          MN
      # Zn2+ ion
      HETATM 1431 ZN    ZN A 301     -23.030  15.955  -4.315  1.00 47.40          ZN
      # Fe2+ ion
      HETATM 1431  FE2 FE2 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Fe3+ ion
      HETATM 1431  FE3 FE3 A 301     -23.030  15.955  -4.315  1.00 47.40          FE
      # Ca2+ ion
      HETATM 1431  CA  CA  A 301     -23.030  15.955  -4.315  1.00 47.40          C0
      # Cu2+ ion
      HETATM 1431  CU  CU  A 301     -23.030  15.955  -4.315  1.00 47.40          CU
    

    Where atom type and residue should be in uppercase, and atom name should match the standard metal ion format (check in a text editor if the writing format is the same).

    Nucleic Acid PDB

    Nucleic acid structure file.

    Force Field

    The default force field is amber14sb_parmbsc1. The applicability of each force field is listed below:
    amber03, amber99sb, amber14sb_parmbsc1: suitable for condensed-phase simulations of proteins and nucleic acids.
    amber14sb_parmbsc1, charmm36-jul2020: suitable for lipid (membrane) systems.
    Note:The force field should be selected according to the submitted PDB structure.

    Water Model

    Water model, default is spc.
    spc: Best used for the GROMOS force field.
    spce: Better for pure water systems compared to SPC and TIP3P.
    tip3p: Best used for amber.
    tip4p: Best used for opls.

    Result Description

    The output results include:

    Output File Name Description
    receptor.gro Molecular coordinate file of the receptor
    receptor_itp.tar.gz Force applied to fix atomic positions during receptor equilibrium simulations
    receptor.top Topology file of the receptor

    Reference

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: GMX Ligand Parameterization
    Description: 根据小分子pdb文件生成分子动力学模拟(GROMACS)所需的MOL2,GRO以及ITP文件。 Generate mol2, gro, and itp files for ligand in molecular dynamics using Gromacs.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 10:40:45
    Reference: 1.O'Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33. doi: 10.1186/1758-2946-3-33. PMID: 21982300; PMCID: PMC3198950. 2.Abraham, M. J.; Murtola, T.; Schulz, R.; Páll, S.; Smith, J. C.; Hess, B.; Lindahl, E., GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 2015, 1-2, 19-25. 3.Case, D. A.; Darden, T. A.; Cheatham, I., T.E.; et al., AMBER 16, University of California, San Francisco, 2016. 4.Sousa da Silva, A.W., Vranken, W.F. ACPYPE - AnteChamber PYthon Parser interfacE. BMC Res Notes 5, 367 (2012). https://doi.org/10.1186/1756-0500-5-367. 5.Wang J, Wang W, Kollman PA, Case DA. Automatic atom type and bond type perception in molecular mechanical calculations. J Mol Graph Model. 2006 Oct;25(2):247-60. doi: 10.1016/j.jmgm.2005.12.005. Epub 2006 Feb 3. PMID: 16458552. 6.Wang J, Wolf RM, Caldwell JW, Kollman PA, Case DA. Development and testing of a general amber force field. J Comput Chem. 2004 Jul 15;25(9):1157-74. doi: 10.1002/jcc.20035. Erratum in: J Comput Chem. 2005 Jan 15;26(1):114. PMID: 15116359. 7.Lu T, Chen F. Multiwfn: a multifunctional wavefunction analyzer. J Comput Chem. 2012 Feb 15;33(5):580-92. doi: 10.1002/jcc.22885. Epub 2011 Dec 8. PMID: 22162017. 8.Neese F, Wennmohs F, Becker U, Riplinger C. The ORCA quantum chemistry program package. J Chem Phys. 2020 Jun 14;152(22):224108. doi: 10.1063/5.0004608. PMID: 32534543.

    GMX Ligand Parameterization

    简介

    基于obabel,Antechamber(Ambertool),ACPYPE以及ORCA对小分子进行处理。将小分子的PDB文件根据所需电荷,电荷类型和自旋多重度进行处理,从而生成Gromacs分子动力学模拟所需的GRO和ITP文件。

    参数说明

    Small Molecule PDB File

    支持pdb和tar.gz的文件格式。当单个配体时提交pdb文件,多个配体时提交含有pdb的tar.gz文件。该文件最好经过MD PDB Prepare模块处理。
    配体分子不能用*号,最好是重新命名成英文名称。

    Charge Type

    选取计算的电荷类型,默认为bcc电荷。

    pH

    如设置则配体在该pH环境下加氢;如不设置,按全氢加氢。注意:设置pH后,如果配体电荷不为0,自旋多重度不为1,则需要在Charge Multiplicity设置。

    Charge Multiplicity

    指明要计算的配体文件的电荷和自旋多重度,默认为电荷为0,自旋多重度为1。格式要求:配体文件名称(不包含后缀) 电荷值 自旋多重度,例如提交文件为ligand.pdb、电荷为0、自旋多重度为1,则该栏输入为“ligand 0 1”。

    结果说明

    输出结果包括:

    输出文件名称 说明
    ligand.gro 受体的分子坐标文件
    ligand_itp.tar.gz 受体平衡模拟时固定原子位置所施加的力
    ligand.mol2/ligand_mol2.tar.gz 分子结构的mol2文件,多个配体时为tar.gz文件

    参考文献

    • GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI:10.1016/j.softx.2015.06.001

    GMX Ligand Parameterization

    Introduction

    Processing of small molecules is performed based on obabel, Antechamber (Ambertool), ACPYPE, and ORCA. The PDB file of the small molecule is processed according to the desired charge, charge type, and spin multiplicity to generate the GRO and ITP files required for Gromacs molecular dynamics simulations.

    Parameter Description

    Small Molecule PDB File

    Supports file formats of pdb and tar.gz. Submit a pdb file when a single ligand is present, and submit a tar.gz file containing pdb when multiple ligands are present. It is recommended that the file has been processed through the MD PDB Prepare module.
    Ligand molecules should not contain asterisks (*), and it is preferable to rename them with English names.

    Charge Type

    Select the type of charge calculation, with the default being the bcc charge.

    pH

    If set, hydrogenation of the ligand will occur at the specified pH environment; if not set, full hydrogenation will be applied. Note: when pH is set, if the ligand charge is not 0 and the spin multiplicity is not 1, it needs to be specified in Charge Multiplicity.

    Charge Multiplicity

    Specifies the charge and spin multiplicity of the ligand file to be calculated, with the default charge being 0 and spin multiplicity being 1. Format requirement: ligand file name (excluding the extension) charge value spin multiplicity. For example, if the submitted file is ligand.pdb with a charge of 0 and a spin multiplicity of 1, the input in this field should be “ligand 0 1”.

    Result Description

    The output results include:

    Output File Name Description
    ligand.gro Molecular coordinate file of the ligand
    ligand_itp.tar.gz Force applied to fix atomic positions during ligand equilibrium simulations
    ligand.mol2/ligand_mol2.tar.gz Mol2 file of the molecular structure, a tar.gz file for multiple ligands

    Reference

    • GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI:10.1016/j.softx.2015.06.001
  • Name: MD MDP Generation
    Description: 生成平衡模拟(MD)的MDP文件,此文件是Gromacs分子动力学模拟需要用到输入文件,里面包含各种参数。 Generate final Gromacs MD production MDP file.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 14:08:30
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD MDP Generation

    简介

    MD MDP Generation是生成平衡模拟(MD)MDP文件的模块。

    参数说明

    Define

    Define用于传递预处理器的定义,可以使用任何定义来控制自定义拓扑文件(.top)中的选项。可选择的定义包括以下选项:

    • DPOSRES用于实现位置约束。选择该项时必须填写Force Constant of POSRE,否则无效。
    • DFLEXIBLE将使用柔性水而不是刚性水进入拓扑结构,这对正常模式分析很有用。

    Integrator

    模拟中积分方式的选择:md算法。
    md是蛙跳法,对符合牛顿公式的运动进行积分。

    Time Step

    时间步长,单位为ps。(默认为0.001)

    Simulation Time (ns)

    模拟时长,单位为ns。

    Group(s) for Center of Mass

    质心进行操作的组,可以是索引文件中的一个,或者多个组。默认为整个系统。

    Motion Mode

    系统或者系统中各个组质心的操作。(默认为None)

    • Linear:移去质心平移速度
    • Angular:去掉质心的平移和质心周围的旋转速度
    • Linear-acceleration-correction:去除质心平移速度。修正质心位置,假设在nstcomm步骤上有线性加速度。这对于期望质心上的加速度在mdp:nstcomm步长上几乎是恒定的情况是有用的。例如,当使用绝对引用拉入组时,就会发生这种情况。
    • None:对质心运动没有限制

    Coordinates Output Steps

    在轨迹文件中写入坐标的频率。(默认为0)

    Velocities Output Steps

    在轨迹文件中写入速度(v)的频率。(默认为0)

    Forces Output Steps

    在轨迹文件中写入力的频率。(默认为0)

    Log Output Steps

    在log文件中写入能量的频率。(默认为5000)

    Energies Output Steps

    在记录能量的文件中写入能量的频率。(默认为1000)

    Compressed Coordinates Steps

    输入压缩的轨迹文件的频率。(默认为1000)

    Compressed Groups

    输入轨迹包含的结构。默认为整个系统。

    PBC

    周期化边界条件设置(默认为xyz)。

    • xyz:在所有方向上使用周期性边界条件。
    • no:不使用周期边界条件,忽略方框。要模拟没有截止,设置所有截止和nstlist为0。为了在没有截断的情况下获得最佳性能,请将nstlist设置为零并将ns-type =simple设置为简单。
    • xy:只在x和y方向上使用周期边界条件。这只适用于ns-type =grid,并且可以与墙壁结合使用。没有墙或只有一个墙,系统尺寸在z方向上是无限的。因此不能采用压力耦合法或埃瓦尔德求和法。当使用两面墙时,这些缺点就不适用了。

    Coulomb Type

    原子静电相互作用的计算方法,默认为PME。

    • Cut-off:具有对列表半径rlist 和库仑截止 rcoulomb 的平面截止,其中 rlist>=rcoulumb。
    • Ewald:经典的Ewald sum静电学。实空间截止Coulomb Cutoff应等于rlist,使用例如rlist=0.9,rcoulomb=0.9。在reciprocal space中使用的波矢量的最高幅度由傅里叶间距控制。direct/reciprocal space 的相对精度由 ewald rtol 控制。
    • PME: 用于具体指静电相互作用或库仑力的Fast smooth Particle-Mesh Ewald(SPME)。Direct space类似于Ewald sum,而reciprocal space使用FFT执行。网格尺寸由傅里叶间距控制,插值顺序由pme-order控制。

    Coulomb Cutoff

    库仑力截止距离,单位nm。(默认为1.2)

    VdW Type

    范德华相互作用的计算方法,默认为Cut-off。

    • Cut-off:用对列表半径rlist和VdW截断rvdw的普通截断,其中rlist >= rvdw。
    • PME:用于VdW相互作用的快速平滑粒子网格Ewald (SPME)。网格尺寸采用傅里叶间距控制,插补顺序采用pme-order控制。正/倒易空间的相对精度由ewald-rtoll-lj控制,倒易例程使用的具体组合规则由lj-pme-comb-rule设置。

    VdW Cutoff

    LJ势或Buckingham的阈值,单位为nm。(默认为1.2)

    Dispersion Correction

    能量和压力的长程色散校正方法(默认为EnerPres)。

    • no:不做任何修正
    • EnerPres:适用于能量和压力的长程分散校正
    • Ener:仅对能量应用长程色散修正

    Temperature Coupling

    温度耦合的方法(默认为V-rescale)。

    • V-rescale:使用随机项的速度重标度的温度耦合(JCP 126, 014101)。这个恒温器类似于Berendsen耦合,使用tau-t进行相同的缩放,但随机项确保生成适当的规范集合。随机种子用ld-seed设置。即使tau-t =0,这个恒温器也能正常工作。对于NVT模拟,保存的能量被写入能量和日志文件。
    • Berendsen:与Berendsen恒温器的温度耦合到温度为ref-t的浴槽,时间常数为tau-t。几个组可以单独耦合,它们在tc-grps字段中指定,并用空格分隔。
    • no:无温度耦合。

    Coupling Groups

    耦合到单独的温度浴的组别,多个组别用空格间隔。

    Time for Temperature Coupling

    耦合时间常数,每个组别都需要定义温度,-1表示无温度耦合,单位为ps。(默认为0.2)

    Coupling Reference Temperature

    耦合的参考温度,即动力学模拟的温度,单位为K。(默认为300)

    Pressure Coupling

    压力耦合的方法(默认为Berendsen)。

    • Parrinello-Rahman:扩展系综压力耦合,其中盒向量服从运动方程。原子的运动方程和这个是耦合的。不会发生瞬时缩放。对于Nose-Hoover温度耦合,时间常数tau-p是压力在平衡状态下波动的周期。当您希望在数据收集期间应用压力缩放时,这可能是一种更好的方法,但要注意,如果您从不同的压力开始,您可能会得到非常大的振荡。对于NPT系综的精确波动很重要的模拟,或者如果压力耦合时间很短,则可能不合适,因为在GROMACS实现的某些步骤中使用了之前的时间步长压力来代替当前的时间步长压力。
    • Berendsen:指数弛豫压力与时间常数tau-p的耦合。这个盒子每隔几步就缩放一次。有人认为,这并不能产生正确的热力学集合,但这是在运行开始时缩放盒子的最有效方法。
    • no:无压力耦合。这意味着一个固定的盒子大小。

    Pressure Coupling Type

    压力耦合的各向同性类型。每种类型取一个或多个可压缩性(compressibility)和Coupling Reference Pressure。Time for Pressure Coupling仅允许一个值。(默认为isotropic)

    • isotropic:时间常数为Time for Pressure Coupling的各向同性压力耦合。可压缩性(compressibility)和Coupling Reference Pressure各需要一个值.
    • semisotropic:在x和y方向上各向同性但在方向上不同的压力耦合。这对于膜模拟是有用的。对于x/y和z方向,分别需要可压缩性(compressibility)和Coupling Reference Pressure的两个值。
    • anisotropic:与之前相同,但xx、yy、zz、xy/yx、xz/zx和yz/zy组件分别需要6个值。当非对角压缩性设置为零时,矩形盒子将保持矩形。请注意,各向异性缩放可能会导致模拟盒子发生极端变形。
    • surface-tension:平行于xy平面的表面的表面张力耦合。对Z方向使用法向压力耦合,而表面张力耦合到盒子的x/y尺度。第一个Coupling Reference Pressure是参考表面张力乘以表面数(单位bar*nm),第二个值是参考z-pressure(单位bar)。这两个可压缩性(compressibility)分别是xy和方向上的压缩率。z-compressibility的值应该相当精确,因为它会影响表面张力的收敛,也可以将其设置为零,使盒子具有恒定的高度。

    Time for Pressure Coupling

    压力耦合的时间常数(所有方向一个值),单位为ps。(默认为2)

    Coupling Reference Pressure

    耦合的参考压力,单位为bar。(默认为1)

    Compressibility

    可压缩性(注:这实际上是在bar^-1)对于水在1atm和300k的可压缩性是4.5e-5 bar^-1。所需值的数量由pcoupltype [bar^-1]暗示。

    Constraints

    限制类型。(默认为none)

    • none:除了拓扑文件中明确定义的外,没有限制。
    • hbonds:给含有氢原子的键添加限制。
    • all-bonds:给所有的键添加限制。
    • h-angles:给所有的键添加限制,同时给含有氢原子的角度添加限制。
    • all-angles:给所有的键和角度添加限制。

    Force Constant of POSRE

    xyz方向的位置限制的力常数,三个数值之间用逗号分隔开,单位为kJ/(mol·nm^2)。例如:500,500,500。

    Disre Type

    MD运行中距离、角度、二面角限制是否生效:
    no表示忽略拓扑文件中的约束信息;
    simple表示简单的(每分子)的距离约束;
    ensemble表示一个模拟盒中分子系综的距离约束。

    Disre Weighting

    约束力权重类型:
    equal表示将约束力平分到约束中的所有原子对上;
    conservative表示约束力为约束势的导数, 将导致原子对的权重为r^-7.,当Time Constant for Restraints=0时,约束力为保守力。

    Disre Mixed

    Dirse mixed采用的方法:
    no表示计算约束力时使用时间平均的违反;
    yes表示计算约束力时使用时间平均违反与瞬时违反乘积的平方根。

    Force Constant

    约束的力常数,乘以拓扑文件中相互作用约束给出的Factor即为最终的约束力大小。

    Time Constant for Restraints

    限制约束的时间,设置为0时表示MD过程中一直进行约束,单位为ps。

    Dirse Output Steps

    将约束中所有原子对的运行距离和瞬时距离写入能量文件的间隔步数。间隔越小该文件越大。

    Output File

    输出文件名称

    结果说明

    生成跑MD的MDP文件md.mdp。

    MD MDP Generation

    Introduction

    MD MDP Generation is a module for generating the MDP file for equilibrium simulations (MD).

    Parameter Description

    Define

    Used to pass definitions to the preprocessor, which can be used to control options in custom topology files (.top). Available options include:

    • DPOSRES for implementing position restraints. You must fill in the Force Constant of POSRE when selecting this option, otherwise it is invalid.
    • DFLEXIBLE will use flexible water instead of rigid water in the topology structure, which is useful for normal mode analysis.

    Integrator

    Choice of integration method in the simulation: md algorithm.
    md is the leap-frog algorithm for integrating motion conforming to Newton’s equations.

    Time Step

    Time step, in ps. (Default is 0.001)

    Simulation Time (ns)

    Simulation duration, in ns.

    Group(s) for Center of Mass

    Groups for which center of mass operations will be performed, can be one or multiple groups from an index file. Default is the entire system.

    Motion Mode

    Operations for the system or center of mass of groups in the system. (Default is None)

    • Linear: Remove center of mass translation velocities
    • Angular: Remove center of mass translation and rotation velocities around the center of mass
    • Linear-acceleration-correction: Remove center of mass translation velocities. Correct center of mass positions assuming a linear acceleration over nstcomm steps. This is useful when you expect the acceleration on the center of mass to be nearly constant over nstcomm steps, for example when using absolute reference pulling groups.
    • None: No restrictions on center of mass motion

    Coordinates Output Steps

    Frequency of writing coordinates to the trajectory file. (Default is 0)

    Velocities Output Steps

    Frequency of writing velocities to the trajectory file. (Default is 0)

    Forces Output Steps

    Frequency of writing forces to the trajectory file. (Default is 0)

    Log Output Steps

    Frequency of writing energies to the log file. (Default is 5000)

    Energies Output Steps

    Frequency of writing energies to the energy file. (Default is 1000)

    Compressed Coordinates Steps

    Frequency of inputting compressed trajectory files. (Default is 1000)

    Compressed Groups

    Structures included in the input trajectory. Default is the entire system.

    PBC

    Periodic boundary conditions setting. (Default is xyz)

    • xyz: Use periodic boundary conditions in all directions.
    • no: Do not use periodic boundary conditions, ignore the box. To simulate without truncation, set all cutoffs and nstlist to 0. For optimal performance without truncation, set nstlist to zero and ns-type=simple.
    • xy: Use periodic boundary conditions only in the x and y directions. This is only applicable with ns-type=grid and can be combined with walls. Without walls or with only one wall, the system size is infinite in the z direction. Therefore, pressure coupling or Ewald summation cannot be used. When using two walls, these drawbacks do not apply.

    Coulomb Type

    Method for calculating atomic electrostatic interactions, default is PME.

    • Cut-off: Plain cut-off with a plain cut-off for the Coulomb potential with a plane cut-off rlist and Coulomb cut-off rcoulomb, where rlist >= rcoulumb.
    • Ewald: Classic Ewald sum electrostatics. The real space cut-off Coulomb Cutoff should be equal to rlist, using values such as rlist=0.9, rcoulomb=0.9. The highest magnitude of wave vectors used in reciprocal space is controlled by the Fourier spacing. The relative accuracy of direct/reciprocal space is controlled by ewald rtol.
    • PME: Fast smooth Particle-Mesh Ewald (SPME) for specific electrostatic interactions or Coulomb forces. Direct space is similar to an Ewald sum, while reciprocal space is executed using FFT. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order.

    Coulomb Cutoff

    Coulomb force cut-off distance, in nm. (Default is 1.2)

    VdW Type

    Method for calculating van der Waals interactions, default is Cut-off.

    • Cut-off: Ordinary cut-off with a plain cut-off for the van der Waals potential with a plain cut-off rlist and VdW cut-off rvdw, where rlist >= rvdw.
    • PME: Fast smooth Particle-Mesh Ewald (SPME) for van der Waals interactions. The grid size is controlled by the Fourier spacing, and the interpolation order is controlled by pme-order. The relative accuracy of direct/reciprocal space is controlled by ewald-rtol-lj, and the specific combination rule for the LJ-PME is set by lj-pme-comb-rule.

    VdW Cutoff

    Threshold for LJ potential or Buckingham, in nm. (Default is 1.2)

    Dispersion Correction

    Method for long-range dispersion correction for energy and pressure. (Default is EnerPres)

    • no: No correction is applied.
    • EnerPres: Long-range dispersion correction is applied for both energy and pressure.
    • Ener: Only the energy is corrected for long-range dispersion.

    Temperature Coupling

    Method for temperature coupling. (Default is V-rescale)

    • V-rescale: Temperature coupling using velocity rescaling with random noise (JCP 126, 014101). This thermostat is similar to Berendsen coupling but uses a stochastic term to ensure the correct canonical ensemble is generated. The random seed is set with ld-seed. This thermostat works even when tau-t = 0. For NVT simulations, saved energies are written to the energy and log files.
    • Berendsen: Coupling the temperature to a bath at temperature ref-t with a time constant tau-t. Several groups can be coupled separately, specified in the tc-grps field and separated by spaces.
    • no: No temperature coupling.

    Coupling Groups

    Groups to which temperature baths are coupled, multiple groups separated by spaces.

    Time for Temperature Coupling

    Time constant for temperature coupling, each group defining a temperature needs to be defined, -1 indicates no temperature coupling, in ps. (Default is 0.2)

    Coupling Reference Temperature

    Reference temperature for coupling, the temperature of the dynamic simulation, in K. (Default is 300)

    Pressure Coupling

    Method for pressure coupling. (Default is Berendsen)

    • Parrinello-Rahman: Extended system pressure coupling where box vectors follow the motion equations. The motion equations of atoms are coupled to this. No instantaneous scaling occurs. This may be a better method when you wish to apply pressure scaling during data collection, but be aware that you may get very large oscillations if you start from different pressures. It may not be appropriate for precise fluctuations of an NPT ensemble simulation or if the pressure coupling time is short, as some steps in the GROMACS implementation use the previous time step pressure instead of the current time step pressure.
    • Berendsen: Exponential relaxation pressure coupling with a time constant tau-p. The box is rescaled every few steps. It is argued that this does not produce the correct thermodynamic ensemble, but it is the most effective method to scale the box at the beginning of a run.
    • no: No pressure coupling. This means a fixed box size.

    Pressure Coupling Type

    Isotropic type for pressure coupling. Each type takes one or more compressibility values and a Coupling Reference Pressure. Time for Pressure Coupling allows only one value. (Default is isotropic)

    • isotropic: Isotropic pressure coupling with a time constant Time for Pressure Coupling. Requires a compressibility and Coupling Reference Pressure value each.
    • semisotropic: Pressure coupling isotropic in x and y directions but different in the z direction. Useful for membrane simulations. Requires two compressibility and Coupling Reference Pressure values for x/y and z directions, respectively.
    • anisotropic: Same as before but with six values for xx, yy, zz, xy/yx, xz/zx, and yz/zy components. When non-diagonal compressibilities are set to zero, the rectangular box will remain rectangular. Note that anisotropic scaling may cause extreme deformation of the simulation box.
    • surface-tension: Surface tension coupling for a surface parallel to the xy plane. Uses normal pressure coupling in the z direction, while surface tension couples to the x/y scale of the box. The first Coupling Reference Pressure is the reference surface tension multiplied by the surface area (units of bar*nm), and the second value is the reference z-pressure (units of bar). Both compressibilities are for xy and z directions. The z-compressibility value should be quite accurate as it affects the convergence of the surface tension and can be set to zero to keep the box at a constant height.

    Time for Pressure Coupling

    Time constant for pressure coupling (one value for all directions), in ps. (Default is 2)

    Coupling Reference Pressure

    Reference pressure for coupling, in bar. (Default is 1)

    Compressibility

    Compressibility (note: this is actually in bar^-1). For water at 1 atm and 300 K, the compressibility is 4.5e-5 bar^-1. The number of values required is indicated by pcoupltype [bar^-1].

    Constraints

    Type of constraints. (Default is none)

    • none: No constraints other than those explicitly defined in the topology file.
    • hbonds: Constraints added to bonds involving hydrogen atoms.
    • all-bonds: Constraints added to all bonds.
    • h-angles: Constraints added to all bonds and angles involving hydrogen atoms.
    • all-angles: Constraints added to all bonds and angles.

    Force Constant of POSRE

    Force constant for position restraints in the xyz directions, separated by commas, in units of kJ/(mol·nm^2). For example: 500,500,500.

    Disre Type

    Whether distance, angle, and dihedral restraints are active during MD runs:
    no means ignore constraint information in the topology file;
    simple means simple (per-molecule) distance constraints;
    ensemble means distance constraints for a molecule ensemble in a simulation box.

    Disre Weighting

    Type of constraint force weighting:
    equal distributes the constraint force equally among all atom pairs in the constraint;
    conservative gives the derivative of the constraint potential, leading to a weight of r^-7 for atom pairs, and if Time Constant for Restraints=0, the constraint force is conservative.

    Disre Mixed

    Method used by Dirse mixed:
    no uses time-averaged violations in computing the constraint force;
    yes uses the square root of the time-averaged violation times the instantaneous violation in computing the constraint force.

    Force Constant

    Force constant for constraints, multiplied by the Factor given by the interaction constraints in the topology file to determine the final constraint force magnitude.

    Time Constant for Restraints

    Time for constraints, set to 0 to maintain constraints throughout the MD process, in ps.

    Dirse Output Steps

    Interval steps for writing the running and instantaneous distances of all atom pairs in the constraint to the energy file. Smaller intervals lead to larger files.

    Output File

    Output file name.

    Result Description

    Generates the MDP file md.mdp for running MD.

  • Name: MD Solvation
    Description: 对输入的受体配体文件加入水盒子和离子。 Adds water box and ions for the system.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-10-09 15:49:33
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    MD Solvation

    简介

    MD Solvation将原有的受配体结构中加入水分子和离子。

    参数说明

    Receptor Topology

    输入的受体拓扑文件,可由GMX Receptor Parameterization模块生成。

    Receptor GRO

    输入的受体结构文件,可由GMX Receptor Parameterization模块生成。

    Receptor ITP

    输入的受体参数(压缩)文件,可由GMX Receptor Parameterization模块生成。

    Ligand GRO

    输入的配体结构(压缩)文件,可由GMX Ligand Parameterization模块生成。

    Ligand ITP

    输入的配体参数(压缩)文件,可由GMX Ligand Parameterization模块生成。

    Output Topology

    输出的体系总的拓扑文件

    Output GRO

    输出的体系总的结构文件

    Output ITP

    输出的体系参数的(压缩)文件

    Distance Restraints

    距离限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [Type]  [Index]  [Type]  [Low]  [Up1]  [Up2]  [Factor]
    

    其中,AtomIndex1和AtomIndex2为在system.gro的原子编号;Type为施加约束类型,通常设置为1,Type类型见表1;Index是计算顺序;Low、Up1、Up2为原子间限制距离,Low到Up1区间的原子距离是不受限制的,但是不能超过Up2,单位为nm;Factor为因子,将Factor乘以“Disre Force Constant”即为限制力的大小,单位为kJ/mol/nm2。
    例如:

    10     16      1       0       1      0.0     0.3     0.4     1.0
    10     46      1       1       1      0.0     0.3     0.4     1.0
    16     22      1       2       1      0.0     0.3     0.4     2.5
    

    表1:GROMACS中三种约束类型对原子对进行限制

    Type Code 约束类型 作用情况
    1 Complex NMR distance restraints 当Disre Type为ensemble时,即非键相互作用设置为1
    6 Simple harmonic restraints 当Disre Type为simple时,即分子内成键相互作用设定,可设为6或者10.
    10 Piecewise linear/harmonic restraints 当Disre Type为simple时,即分子内成键相互作用设定,可设为6或者10

    Angle Restraints

    角度限制是两对原子间角度的限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Theta0]  [Force Constant]  [Multiplicity]
    

    其中,AtomIndex1-AtomIndex2是第一对原子编号;AtomIndex3-AtomIndex4为第二对原子编号;Type在这里无用,定义为1即可;Theta0为约束的角度,单位为deg;Force Constant为约束力常数,单位为kJ/mol;Multiplicity为多重度。
    例如

    2642     2643     2635     2652     1     67.0     1500     1
    

    Dihedral Restraints

    二面角限制,仅当Disre不为no时生效,格式如下所示:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Label]  [Phi]  [dPhi]  [KFactor]  [Power]
    

    其中,AtomIndex1-AtomIndex4为组成二面角的原子编号;Type为约束类型函数,总是为1;Label无效;Phi为参考角,dPhi为超出参考角的角度值,单位为deg;KFactor为因子,将KFactor乘以“Disre Force Constant”即为限制力的大小,单位为 kJ/mol/rad2;Power无效。
    例如:

    2642      2643      2635      2652      1      67.0      1500      1
    

    约束势函数如下所示:
    image.png
    其中,Φ’为参考角Phi,ΔΦ为超出参考角的值dPhi,K_dihr为限制力的大小KFactor。

    结果说明

    输出结果包括:

    输出文件名称 说明
    system.gro 体系的分子坐标文件
    system_itp.tar.gz 体系平衡模拟时固定原子位置所施加的力
    system.top 体系的拓扑文件

    参考文献

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

    MD Solvation

    Introduction

    MD Solvation adds water molecules and ions to the original ligand-bound structure.

    Parameter Description

    Receptor Topology

    Input receptor topology file, can be generated by the GMX Receptor Parameterization module.

    Receptor GRO

    Input receptor structure file, can be generated by the GMX Receptor Parameterization module.

    Receptor ITP

    Input receptor parameter (compressed) file, can be generated by the GMX Receptor Parameterization module.

    Ligand GRO

    Input ligand structure (compressed) file, can be generated by the GMX Ligand Parameterization module.

    Ligand ITP

    Input ligand parameter (compressed) file, can be generated by the GMX Ligand Parameterization module.

    Output Topology

    Output total system topology file.

    Output GRO

    Output total system structure file.

    Output ITP

    Output system parameter (compressed) file.

    Distance Restraints

    Distance restraints, effective only when Disre is not “no”, formatted as follows:

    [AtomIndex1]  [AtomIndex2]  [Type]  [Index]  [Type]  [Low]  [Up1]  [Up2]  [Factor]
    

    Where AtomIndex1 and AtomIndex2 are atomic indices in system.gro; Type is the type of constraint applied, typically set to 1, see Table 1 for Type codes; Index is the calculation order; Low, Up1, Up2 are the distance limits between atoms, the distance between atoms in the Low to Up1 range is unrestricted but cannot exceed Up2, in nm; Factor is a multiplier, multiplying Factor by the “Disre Force Constant” gives the size of the restraint force, in kJ/mol/nm2.
    For example:

    10     16      1       0       1      0.0     0.3     0.4     1.0
    10     46      1       1       1      0.0     0.3     0.4     1.0
    16     22      1       2       1      0.0     0.3     0.4     2.5
    

    Table 1: Three constraint types in GROMACS for atom pairs

    Type Code Constraint Type Application
    1 Complex NMR distance restraints Set to 1 for non-bonded interactions when Disre Type is ensemble
    6 Simple harmonic restraints Set to 6 or 10 for intramolecular bonded interactions when Disre Type is simple
    10 Piecewise linear/harmonic restraints Set to 6 or 10 for intramolecular bonded interactions when Disre Type is simple

    Angle Restraints

    Angle restraints limit the angle between two pairs of atoms, effective only when Disre is not “no”, formatted as follows:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Theta0]  [Force Constant]  [Multiplicity]
    

    Where AtomIndex1-AtomIndex2 is the first pair of atom indices; AtomIndex3-AtomIndex4 is the second pair of atom indices; Type is not used here, defined as 1; Theta0 is the constrained angle in degrees; Force Constant is the constraint force constant in kJ/mol; Multiplicity is the multiplicity.
    For example:

    2642     2643     2635     2652     1     67.0     1500     1
    

    Dihedral Restraints

    Dihedral restraints, effective only when Disre is not “no”, formatted as follows:

    [AtomIndex1]  [AtomIndex2]  [AtomIndex3]  [AtomIndex4]  [Type]  [Label]  [Phi]  [dPhi]  [KFactor]  [Power]
    

    Where AtomIndex1-AtomIndex4 are the atomic indices composing the dihedral; Type is always 1; Label is not used; Phi is the reference angle, dPhi is the angle value beyond the reference angle in degrees; KFactor is a factor, multiplying KFactor by the “Disre Force Constant” gives the size of the restraint force in kJ/mol/rad2; Power is not used.
    For example:

    2642      2643      2635      2652      1      67.0      1500      1
    

    The constraint potential functions are as follows:
    image.png
    Where Φ’ is the reference angle Phi, ΔΦ is the value beyond the reference angle dPhi, and K_dihr is the size of the restraint force KFactor.

    Result Description

    The output results include:

    Output File Name Description
    system.gro Molecular coordinate file of the system
    system_itp.tar.gz Force applied to fix atomic positions during system equilibrium simulation
    system.top Topology file of the system

    References

    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25.

  • Name: MD RMS
    Description: 计算平衡模拟轨迹的均方根偏差(RMSD)和均方根波动(RMSF),从而分析结构的稳定性和结构变化情况。 Calculates the RMSD or RMSF to analyze the structural stability of the system.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-09-29 00:00:00
    Reference: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lindahl, SoftwareX, 1, (2015), 19-25. DOI: https://doi.org/10.1016/j.softx.2015.06.001

    RMS

    简介

    通过计算平衡模拟轨迹的均方根偏差(RMSD,Root Mean Square Deviation)和均方根波动(RMSF,Root Mean Square Fluctuation),从而分析结构的稳定性和结构变化情况。

    参数说明

    Path File

    MD模拟后得到的路径文件,可以在**GMX MD Run (GMX2023)模块或者AlphaAutoMD (GMX2023)**模块中获取。

    Analysis Type

    选择分析类型:RMSD或者RMSF(可多选)。

    System Group

    选择需要计算的组别。

    Custom Resid

    自定义需要计算的残基编号,连续参数可用“-”表示,不连续残基用逗号隔开,例如:1-10,15。

    Custom Atom

    自定义需要计算的原子编号,用逗号隔开,例如:CA,O,H。与Custom Resid是交集关系。

    Skip Time (ps)

    Index File

    索引文件,可由Membrane Solvation模块得到。

    结果说明

    输出结果包括:

    输出文件名称 说明
    rmsd_result.csv 所选组别的RMSD的CSV文件
    rmsd_result.png 所选组别的RMSD的PNG文件
    rmsd_result.xvg 所选组别的RMSD的XVG文件
    rmsf_*.csv 所选组别的RMSF的CSV文件
    rmsf_*.png 所选组别的RMSF的PNG文件
    rmsf_*xvg. 所选组别的RMSF的XVG文件
    bfac.pdb PDB中的B-Factor一列为原子RMSF值通过公式<Δr²> = 3B/(8π²)转换得到。

    RMS

    Introduction

    By calculating the Root Mean Square Deviation (RMSD) and Root Mean Square Fluctuation (RMSF) of equilibrium simulation trajectories, the stability and structural changes of the system can be analyzed.

    Parameter Description

    Path File

    The path file obtained after MD simulation, which can be obtained from the GMX MD Run (GMX2023) module or the AlphaAutoMD (GMX2023) module.

    Analysis Type

    Select the type of analysis: RMSD or RMSF (multiple selections possible).

    System Group

    Select the group to be calculated.

    Custom Resid

    Custom residue numbers to be calculated, continuous parameters can be represented by “-”, non-continuous residues are separated by commas, for example: 1-10,15.

    Custom Atom

    Custom atom numbers to be calculated, separated by commas, for example: CA, O, H. This intersects with Custom Resid.

    Skip Time (ps)

    Index File

    Index file obtained from the Membrane Solvation module.

    Result Description

    The output results include:

    Output File Name Description
    rmsd_result.csv CSV file of RMSD for the selected group
    rmsd_result.png PNG file of RMSD for the selected group
    rmsd_result.xvg XVG file of RMSD for the selected group
    rmsf_*.csv CSV file of RMSF for the selected group
    rmsf_*.png PNG file of RMSF for the selected group
    rmsf_*xvg. XVG file of RMSF for the selected group
    bfac.pdb The RMSF values are converted to B-factor values by the formula<Δr^2>=3B/(8π^2).
  • Name: Scaffold Constrained Small Molecule Generation
    Description: 传统分子生成模型无法限制特定骨架,限制了分子生成在结构优化中的应用,该模块可以限制骨架,指定优化部位,特异性的生成全新分子库。 During the optimization of a lead series, it is common to have scaffold constraints imposed on the structure of the molecules designed. Without enforcing such constraints, the probability of generating molecules with the required scaffold is extremely low and hinders the practicality of generative models for de novo drug design.
    Tags: undefined
    Author: Maxime Langevin
    Release: 2022-08-20 00:00:00
    Reference: Langevin M, Minoux H, Levesque M, Bianciotto M. Scaffold-Constrained Molecular Generation. J Chem Inf Model. 2020 Dec 28;60(12):5637-5646.

    Scaffold Constrained Small Molecule Generation

    简介

    传统分子生成模型无法限制特定骨架,限制了分子生成在结构优化中的应用,Scaffold Constrained Generation是一种骨架限制的生成模型,可以限制骨架,指定优化部位,特异性的生成全新分子库。

    参数说明

    SDF File模式

    SDF File

    小分子骨架结构文件,SDF格式。结构中用星号*表示骨架结构上需要连接新结构片段的位置,如下图所示(可使用WeDraw进行结构编辑):
    image.png

    Draw模式

    SDF File

    使用WeDraw生成小分子结构文件,SDF格式。

    Smiles模式

    Smiles String

    输入带*的小分子SMILES,代表生成部分,其他部分固定不变,支持输入多个。例如:*c1cnc2ccccc2c1

    Number of Molecules

    期望生成的分子数目。

    Output File

    最终输出文件的文件名称,默认为scg_results.sdf。

    结果说明

    生成优化后的分子库的sdf文件scg_results.sdf。

    参考文献

    • Langevin M, Minoux H, Levesque M, Bianciotto M. Scaffold-Constrained Molecular Generation. J Chem Inf Model. 2020 Dec 28;60(12):5637-5646. DOI:10.1021/acs.jcim.0c01015

    Scaffold Constrained Small Molecule Generation

    Introduction

    Traditional molecular generation models cannot restrict specific scaffolds, limiting the application of molecular generation in structure optimization. Scaffold Constrained Generation is a scaffold-constrained generation model that can restrict scaffolds, specify optimization sites, and generate a new molecular library with specificity.

    Parameters

    SDF File Mode

    SDF File

    Small molecule scaffold structure file in SDF format. The structure uses an asterisk ‘*’ to indicate the positions on the scaffold structure where new structure fragments need to be connected, as shown in the following figure (WeDraw can be used for structure editing).

    Draw Mode

    SDF File

    Generate small molecule structure file using WeDraw, in SDF format.

    SMILES Mode

    SMILES String

    Input a small-molecule SMILES string that contains one or more asterisks (*). Each * indicates a position to be generated, while the rest of the structure remains fixed. Multiple asterisks are supported. Demo: *c1cnc2ccccc2c1

    Number of Molecules

    The desired number of molecules to generate.

    Output File

    The file name for the final output file, default is scg_results.sdf.

    Results

    The optimized molecular library is saved in an SDF file named scg_results.sdf.

    References

    • Langevin M, Minoux H, Levesque M, Bianciotto M. Scaffold-Constrained Molecular Generation. J Chem Inf Model. 2020 Dec 28;60(12):5637-5646. DOI:10.1021/acs.jcim.0c01015
  • Name: Small Molecule Random Generation
    Description: 基于深度学习的分子生成模块,实现了多种主流的分子生成模型,包括字符级循环神经网络,变分自编码器,以及对抗自编码器。 A deep learning-based molecular generation module, which implements various mainstream molecular generation models, including character-level recurrent neural networks, variational autoencoders, and adversarial autoencoders.
    Tags: undefined
    Author: Daniil Polykovskiy
    Release: 2022-08-19 00:00:00
    Reference: Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front Pharmacol. 2020 Dec 18;11:565644.

    Small Molecule Random Generation

    简介

    De novo Generation (Moses)是基于深度学习的分子生成模块,实现了多种主流的分子生成模型,包括字符级循环神经网络,变分自编码器,以及对抗自编码器。

    参数说明

    Model

    分子生成模型,目前包含以下几种:
    char_rnn:Character-level Recurrent Neural Network(CharRNN)字符级循环神经网络。
    vae:Variational Autoencoder(VAE)变分自编码器。
    aae:Adversarial Autoencoder(AAE)对抗自编码器。

    Number of Molecules

    期望生成的分子数目。

    Seed

    采样随机数。

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.sdf 生成sdf格式分子库。
    result.csv 生成smiles格式分子库,写入csv文件中,首行列名smiles。

    参考文献

    Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front Pharmacol. 2020 Dec 18;11:565644.

    Small Molecule Random Generation

    Introduction

    De novo Generation (Moses) is a deep learning-based molecular generation module that implements various mainstream molecular generation models, including character-level recurrent neural networks, variational autoencoders, and adversarial autoencoders.

    Parameter Description

    Model

    Molecular generation model, currently includes the following:

    • char_rnn: Character-level Recurrent Neural Network (CharRNN).
    • vae: Variational Autoencoder (VAE).
    • aae: Adversarial Autoencoder (AAE).

    Number of Molecules

    The desired number of molecules to generate.

    Seed

    The sampling random number.

    Result Description

    The output includes:

    Output File Name Description
    result.sdf Generated molecular library in SDF format.
    result.csv Generated molecular library in SMILES format, written to a CSV file with the column name “smiles”.

    Reference

    Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front Pharmacol. 2020 Dec 18;11:565644.

  • Name: Protein Design (ProteinMPNN)
    Description: 基于ProteinMPNN模型实现基于给定的蛋白骨架结构生成合理的序列。本模块也集成了基于ProteinMPNN使用抗体数据微调得到的AbMPNN模型,可更好地进行抗体设计。建议通过WeView三维结构可视化编辑器来使用该功能,具体为WeView-> Design -> Sequence Design (ProteinMPNN)。 ProteinMPNN model-based generating sequences based on a given backbone structure. This module also integrates the AbMPNN model, which is fine-tuned on antibody data based on ProteinMPNN, and can better facilitate antibody design. It is recommended to use in the WeView: WeView-> Design -> Sequence Design (ProteinMPNN).
    Tags: undefined
    Author: Dauparas J, Anishchenko I, Bennett N, et al.
    Release: 2022-08-17 23:23:03
    Reference: Dauparas J, Anishchenko I, Bennett N, et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science. 2022 Oct 7;378(6615):49-56.

    Protein Design (ProteinMPNN)

    简介

    ProteinMPNN是一种基于深度学习的蛋白质序列设计方法,在天然蛋白质骨架上,ProteinMPNN的序列恢复率为52.4%,而Rosetta为32.9%。在训练过程中加入噪声可以提高蛋白质结构模型的序列恢复率,并且产生的序列可以更稳健地编码它们的结构。X射线晶体学、低温电镜和功能研究也证明了ProteinMPNN的广泛实用性和高准确性,它成功挽救了以前用Rosetta或AlphaFold设计失败的蛋白质单体、环状同源多聚体、四面体纳米颗粒和目标结合蛋白等。
    image.png

    在ProteinMPNN的基础上,Exscientia提出了一种针对抗体结构进行优化的微调逆折叠模型AbMPNN,该模型在抗体序列恢复和结构稳健性方面优于通用蛋白质模型,尤其在超可变区CDR-H3环上有显著改进。
    image.png
    image.png
    image.png

    参数说明

    PDB File

    蛋白的结构文件,PDB格式。

    Chain

    指定需要设计的链,多条链用逗号分割,例如:‘A,B’。

    Number of Sequences

    输出设计的序列数目。

    Sampling Temp

    氨基酸采样温度,T=0.0表示取argmax,T>>1.0表示随机采样。建议的取值为0.1、0.15、0.2、0.25、0.3。较高的值会导致更多的多样性。当需要设计的序列数目较大时,为了获取较多多样性(不重复)序列,建议增大该参数,如设置为0.25

    Position Type

    设计残基模式:固定(Fix,指定下一步Position中的残基在设计时保持不变)或者设计(Design,指定下一步Position中的残基可进行设计而其他未指定残基在设计时保持不变)。默认:Fix。

    Position

    可选参数,用于指定需要操作的氨基酸位置。根据 Position Type 的设置,对选定的氨基酸进行固定或设计。

    输入格式为:链名 + 残基编号范围,例如:

    A1-10,A30,B12-25
    

    注意:

    • 氨基酸编号从 1 开始计数,而非 PDB 文件中的原始编号;
    • 同一条链内的多个位置使用空格分隔;
    • 不同链之间使用逗号分隔;
    • Chain 与 Position 两个参数必须至少填写一个。

    Omit_AAS

    可选参数,指定在生成的结果序列中不许出现的氨基酸种类。

    Bias_AAS

    可选参数,通过数值控制生成结果中各类型残基的偏向性,文本文件格式,通过残基类型,数值来指定,支持多种残基,每行放置一类残基,如:

    H,1.5
    D,1.0
    C,-1.0
    

    残基偏向性数值意义:

    • 0,表示没有偏向性(默认)
    • 小于0,表示少出现
    • 大于0,表示多出现
    • 数值的绝对值越大,对应的偏向程度越高。推荐的绝对值,如:0.5,1.0,1.5

    Design Mode

    可选参数,可指定设计时参考的模式。具体含义如下:
    Homomer:基于同源多聚体进行序列设计;
    use_soluble_mode:基于可溶蛋白模型进行序列设计;
    antibody_design:基于抗体优化模型AbMPNN进行序列设计;
    ligandMPNN:升级版ProteinMPNN,专门用于模拟蛋白质与非蛋白质组分(如小分子、核苷酸和金属)之间的相互作用。
    cyclic:环肽的逆折叠序列设计

    Save Probablility

    MPNN预测的每个位置的概率:0为不进行预测,1为进行预测。

    结果说明

    输出结果文件result.fasta,包含最终设计的序列。
    序列名称中包含多个评价指标:

    1. Score:设计残基的概率评分,通常分值越小越好。概率评分是设计残基平均概率的负对数(-logP),因此评分越小意味着平均概率值越大。
    2. Global Score:序列中所有残基的整体概率评分,通常分值越小越好。概率评分是设计残基平均概率的负对数(-logP),因此评分越小意味着平均概率值越大。
    3. seq_recovery:序列恢复率(与原序列的相似程度),0-1之间,越高表示与原序列越相似
    • 输出最优(打分最佳)的复合物序列complex.fasta
    • 指定参数Save Probablility时,输出probs.tar.gz,包含预测的每个位置的概率。

    指定参数--ligandMPNN时,result.fasta序列名称包含指标:

    1. overall_confidence:设计序列的全序列的置信度评分,数值在0~1.0之间,数值越大表示序列置信度越高
    2. ligand_confidence:设计序列的所有已设计残基的置信度评分,数值在0~1.0之间,数值越大表示已设计部分序列的置信度越高
    3. seq_rec:序列恢复率(与原序列的相似程度),0-1之间,越高表示与原序列越相似
    • 指定参数--pack_side_chains时,输出设计后的结构打包文件packed_side_chains.tar.gz,包含最终设计的序列对应的复合物结构PDB文件。

    参考文献

    • Robust deep learning based protein sequence design using ProteinMPNN,bioRxiv 2022.06.03.494563DOI:10.1101/2022.06.03.494563v1
    • AbMPNN: https://arxiv.org/abs/2310.19513
    • Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, D. Baker. Atomic context-conditioned protein sequence design using LigandMPNN. bioRxiv 2023.12.22.573103;DOI:10.1101/2023.12.22.573103

    Protein Design (ProteinMPNN)

    Introduction

    ProteinMPNN is a deep learning-based protein sequence design method that achieves a sequence recovery rate of 52.4% on natural protein scaffolds, compared to 32.9% for Rosetta. Adding noise during the training process can improve the sequence recovery rate of the protein structural model, and the resulting sequences can more robustly encode their structures. X-ray crystallography, cryo-electron microscopy, and functional studies have also demonstrated the wide applicability and high accuracy of ProteinMPNN, which has successfully rescued previously failed protein monomers, cyclic homooligomers, tetrahedral nanoparticles, and target-binding proteins designed using Rosetta or AlphaFold.
    image.png
    On top of ProteinMPNN, Exscientia has introduced a fine-tuning inverse folding model called AbMPNN specifically tailored for optimizing antibody structures. This model outperforms general protein models in antibody sequence recovery and structural robustness, particularly showing significant improvements in the highly variable CDR-H3 loop region.
    image.png
    image.png
    image.png

    Parameters

    PDB File

    Protein structure file in PDB format.

    Chain

    Specify the chain to be designed, multiple chains are separated by spaces, for example: ‘A,B’.

    Number of Sequences

    Output the number of sequences designed.

    Sampling Temp

    Amino acid sampling temperature, T=0.0 means argmax, T>>1.0 means random sampling. The suggested values are 0.1, 0.15, 0.2, 0.25, 0.3. Higher values result in more diversity. When the required number of designed sequences is large, increase this parameter—e.g., set it to 0.25—to obtain greater sequence diversity and reduce duplicates.

    Position Type

    Residue Design Mode: Fixed (Fix, specifying that the residues in the next Position step remain unchanged during design) or Design (Design, specifying that the residues in the next Position step can be designed while other unspecified residues remain unchanged during design). Default: Fix.

    Position

    An optional parameter specifying the amino acid positions to operate on. Depending on the Position Type setting, the selected residues will be either fixed or designed.

    Input format: chain name + residue number range, for example:

    A1-10,A30,B12-25
    

    Notes:

    • Residue numbering starts from 1, not the original index in the PDB file.
    • Multiple positions within the same chain are separated by spaces.
    • Positions across different chains are separated by commas.
    • At least one of Chain and Position must be provided.

    Omit_AAS

    Optional parameter specifying the types of amino acids that are not allowed to appear in the generated sequence.

    Bias_AAS

    Optional parameter to control the bias of different residue types in the generated results. The text file format specifies residue_type,value, supporting multiple residues, with one residue per line, for example:

    H,1.5
    D,1.0
    C,-1.0
    

    Meaning of residue bias values:

    • 0 indicates no bias (default)
    • <0 indicates less frequent appearance
    • >0 indicates more frequent appearance
    • The larger the absolute value, the stronger the bias. Recommended absolute values: 0.5, 1.0, 1.5

    Design Mode

    Optional parameter specifying the reference mode for design. Specific meanings are as follows:
    Homomer: Sequence design based on homologous oligomers;
    use_soluble_mode: Sequence design based on soluble protein models, namely SolMPNN, the MPNN model trained exclusively on soluble protein data.
    antibody_design: Sequence design based on the antibody optimization model AbMPNN, the model obtained by fine-tuning the ProteinMPNN model using antibody structure data.
    ligandMPNN: Enable small-molecule (ligand) interaction modeling.
    cyclic: Inverse folding sequence design for cyclic peptides.

    When none of the above options are selected, the default ProteinMPNN model will be used, which is trained on all protein structures from the PDB database.

    Save Probability

    Probability of each position predicted by MPNN: 0 for no prediction, 1 for prediction.

    Results

    The output file is result.fasta and contains the final design sequence.
    The sequence names contain multiple evaluation metrics:

    1. Score: This is the probability score for designed residues, where a lower score is generally better. The probability score is the negative logarithm (-logP) of the average probability of the designed residues, so a lower score indicates a higher average probability value.

    2. Global Score: This is the overall probability score for all residues in the sequence, where a lower score is generally better. The probability score is the negative logarithm (-logP) of the average probability of the designed residues, so a lower score indicates a higher average probability value.

    3. seq_recovery: the sequence recovery rate (the degree of similarity to the original sequence) is between 0 and 1, the higher the higher the similarity to the original sequence.

    • When specifying the parameter Save Probability , the output probs.tar.gz contains the predicted probability for each position.
      When specifying the parameter --ligandMPNN, the sequence names in result.fasta contain the following metrics:

    1.overall_confidence: Confidence score for the full designed sequence, ranging from 0 to 1.0. A higher value indicates higher sequence confidence.

    2.ligand_confidence: Confidence score for all designed residues of the sequence, ranging from 0 to 1.0. A higher value indicates higher confidence in the designed part of the sequence.

    3.seq_rec: Sequence recovery rate (similarity to the original sequence), ranging from 0 to 1. A higher value indicates greater similarity to the original sequence.

    • Outputs the best (highest-scoring) complex sequence complex.fasta.

    • When specifying the parameter --pack_side_chains, outputs the side-chain-packed structure file packed_side_chains.tar.gz, which contains the PDB structure files of the final designed complex corresponding to the designed sequences.

    References

    • Robust deep learning based protein sequence design using ProteinMPNN,bioRxiv 2022.06.03.494563DOI:10.1101/2022.06.03.494563v1
    • AbMPNN: https://arxiv.org/abs/2310.19513
    • Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, D. Baker. Atomic context-conditioned protein sequence design using LigandMPNN. bioRxiv 2023.12.22.573103;DOI:10.1101/2023.12.22.573103
  • Name: FASTA File
    Description: FASTA文件是一个用于指定fasta文件的模块,可用于其他模块的输入。会对FASTA文件的有效性进行判断。 FASTA File is a module for specifying fasta file which could used for other modules input.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-08-15 14:11:28
    Reference: NA

    FASTA File

    简介

    FASTA File是一个指定FASTA文件的模块,可以用于其他模块的输入。会对FASTA文件的有效性进行判断。

    参数说明

    FASTA File

    上传FASTA文件

    结果说明

    输出一个对应的FASTA文件,会对文件的有效性进行判断。

    FASTA File

    Introduction

    FASTA File is a module for specifying fasta file which could used for other modules input.

    Parameter

    FASTA File

    input FASTA file

    Result

    Generate a corresponding FASTA file and validate its effectiveness.

  • Name: AlphaShape
    Description: 基于分子三维形状和药效团的虚拟筛选,算法在三维构象的基础上进行基于分子三维相似性的虚拟筛选。通过结合高斯函数与深度神经网络模型,计算精度领先同类型商业算法。 A molecular shape and pharmacophore-based virtual screening module. The AlphaShape algorithm performs virtual screening or protein structure search based on the three-dimensional similarity of molecules on the basis of three-dimensional conformation. By combining the Gaussian function and the deep neural network model, the calculation accuracy achieves SOTA.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-11-11 03:23:06
    Reference: X. Yan, J. Li, et al.. J. Chem. Inf. Model., 2013, 53(8), 1967–1978. X. Yan, J. Li, et al., J. Comput. Chem., 2014, 35(15), 1122-1130.

    AlphaShape

    简介

    AlphaShape(简称AlphaS)是一种构象表征与识别算法,可以基于分子的三维空间形状和药效团等药学特征比较进行高通量的虚拟筛选,可以最大化区分海量化合物中与已知活性分子相似的活性化合物(筛选的化合物库分子可使用AlphaConf进行构象生成)。也可用于蛋白质结构域匹配以指导蛋白质设计。

    通过创造性地在高斯函数表征方式之上融合深度学习技术,AlphaShape虚拟筛选的计算精度已经领先同超越主流商业算法(例如Schrodinger的Phase,OpenEye的ROCS),在DUD-E标准数据集的测试中,虚拟筛选的AUC值达到了0.837(对比Phase与ROCS的0.663及0.696)。
    image.png
    通过采用高性能计算(HPC)技术,特别是NVIDIA的GPU加速技术,目前在搜索或筛选速度上都领先同领域商业软件。以小分子化合物筛选为例,使用一块GPU卡,数小时即可筛完全世界所有的现货商业化合物库的数千万分子,一天可高通量虚拟筛选上亿个化合物分子。

    目前已被多家合作药企用于虚拟筛选并成功发现生物活性分子。目前已被合作药企用于虚拟筛选并成功发现生物活性分子。
    除了高精度之外,AlphaShape 还充分利用了GPU的能力。 一张GPU卡每天可以筛选大约 5000万种化合物。

    参数说明

    Private Library私有库筛选模式

    Query File

    输入查询分子文件,SDF格式

    Conformation Library

    小分子的构象库文件,由AlphaConf模块产生,AC.GZ格式

    Fragment Library

    小分子的片段库文件,由AlphaConf模块产生,AUX.GZ格式

    Top N

    输出和每个查询分子相似度排名前n个分子,默认100。

    Generate Query Conformation

    是否对输入的查询分子产生3D构象,True 表示生成,当输入分子是2D结构时可用,False表示不生成,直接使用输入分子的3D结构。

    Similarity Hits File

    输出最终相似度命中化合物的文件名称,SDF格式,默认文件名为hits.sdf

    Public Library系统公共库筛选模式

    Query File

    输入查询分子文件,SDF格式

    Public Library

    系统内置的小分子化合物数据库,可多选。

    Top N

    输出和每个查询分子相似度排名前n个分子,默认100。

    Generate Query Conformation

    是否对输入的查询分子产生3D构象,True 表示生成,当输入分子是2D结构时可用,False表示不生成,直接使用输入分子的3D结构。

    Similarity Hits File

    输出最终相似度命中化合物的文件名称,SDF格式,默认文件名为hits.sdf

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 相似度值信息,包含查询分子名称与库中分子名称。
    hits.sdf 筛选相似度最高的n个化合物。多个查询分子时,这个文件是多个查询分子命中化合物合并去重后的结果。
    result/AA-173-40757587.sdf 查询分子对应的命中化合物。每个查询分子都会生成一个对应的包含top n个命中化合物的文件

    其中result.csv,包含信息如下:

    字段名称 说明
    querymol 查询分子化合物名称
    confdb 化合物库名称
    molname 命中化合物名称
    Total Similarity 3D相似度值

    AlphaShape

    Introduction

    AlphaShape (AlphaS for short) is a conformation representation and recognition algorithm that enables high-throughput virtual screening based on the three-dimensional spatial shape and pharmacophoric features of molecules. It maximizes the differentiation of active compounds similar to known active molecules from a large number of compounds (the molecules in the compound library for screening can be generated using AlphaConf). It can also be used for protein domain matching to guide protein design.

    By creatively integrating deep learning technology on top of Gaussian function representation, AlphaShape’s virtual screening computational accuracy has surpassed and outperformed mainstream commercial algorithms (such as Schrodinger’s Phase, OpenEye’s ROCS). In testing on the DUD-E standard dataset, the AUC value of virtual screening reached 0.837 (compared to Phase and ROCS at 0.663 and 0.696).
    image.png

    By employing high-performance computing (HPC) technology, especially NVIDIA’s GPU acceleration technology, AlphaShape currently leads in search or screening speed compared to commercial software in the field. For example, in small molecule compound screening, using a single GPU card, it is possible to screen tens of millions of molecules in commercial compound libraries worldwide in a few hours, and conduct high-throughput virtual screening of billions of compound molecules in a day.

    It has been used by several collaborative pharmaceutical companies for virtual screening and successful discovery of bioactive molecules. In addition to high accuracy, AlphaShape fully leverages the capabilities of GPUs. A single GPU card can screen approximately 50 million compounds per day.

    Parameter Description

    Private Library Screening Mode

    Query File

    Input file of query molecules in SDF format.

    Conformation Library

    File of conformation libraries for small molecules, generated by the AlphaConf module, in AC.GZ format.

    Fragment Library

    File of fragment libraries for small molecules, generated by the AlphaConf module, in AUX.GZ format.

    Top N

    Output the top N molecules ranked by similarity to each query molecule, default is 100.

    Generate Query Conformation

    Whether to generate 3D conformations for the input query molecules. True for generation, useful when the input molecules are in 2D structure; False for direct use of the input molecules’ 3D structures.

    Similarity Hits File

    File name for the final hit compounds based on similarity, in SDF format, default file name is hits.sdf.

    Public Library Screening Mode

    Query File

    Input file of query molecules in SDF format.

    Public Library

    System’s built-in small molecule compound database, multiple selections allowed.

    Top N

    Output the top N molecules ranked by similarity to each query molecule, default is 100.

    Generate Query Conformation

    Whether to generate 3D conformations for the input query molecules. True for generation, useful when the input molecules are in 2D structure; False for direct use of the input molecules’ 3D structures.

    Similarity Hits File

    File name for the final hit compounds based on similarity, in SDF format, default file name is hits.sdf.

    Result Description

    The output includes:

    Output File Name Description
    result.csv Information on similarity values, including query molecule names and library molecule names.
    hits.sdf Top N screened compounds based on similarity. For multiple query molecules, this file is the merged and deduplicated result of top N hit compounds for each query molecule.
    result/AA-173-40757587.sdf Hit compounds corresponding to the query molecule. A file containing the top N hit compounds is generated for each query molecule.

    In result.csv, the information includes:

    Field Name Description
    querymol Query molecule name
    confdb Compound library name
    molname Hit compound name
    Total Similarity 3D similarity value
  • Name: Format Conversion
    Description: 分子文件格式转换工具。支持的输入文件格式为:SD(.sdf、.sd)、SMILES(.smi、.csv、.tsv、.txt)、PDB(.pdb)、mol2。支持的输出文件格式为:SD(.sdf、.sd)、SMILES(.smi)、PDB(.pdb)。 A molecular file format conversion tool. Supported input file formats include: SD (.sdf, .sd), SMILES (.smi, .csv, .tsv, .txt),PDB(.pdb), mol2. Supported output file formats include: SD (.sdf, .sd), SMILES (.smi), PDB (.pdb).
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-28 02:46:13
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Format Conversion

    简介

    File Convert是基于RDKit对分子文件格式之间进行转换的模块。支持的输入文件格式为:SD(.sdf、.sd)、SMILES(.smi、.csv、.tsv、.txt)、PDB(.pdb)、mol2。支持的输出文件格式为:SD(.sdf、.sd)、SMILES(.smi)、PDB(.pdb)。

    参数说明

    Input File

    小分子结构文件,SD、SMILES、PDB或mol2格式。

    Output File

    输出文件名。更改文件扩展名。

    结果说明

    输入SDF文件转换成SMILES格式output.smi文件。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Format Conversion

    Introduction

    The File Convert module is designed to convert molecular file formats using RDKit. Supported input file formats include: SD (.sdf, .sd), SMILES (.smi, .csv, .tsv, .txt),PDB(.pdb), mol2. Supported output file formats include: SD (.sdf, .sd), SMILES (.smi), PDB (.pdb).

    Parameter Description

    Input File

    Input file containing the molecular structure in SDF or SMILES format.

    Output File

    Name of the output file. Change the file extension as needed.

    Result Description

    Convert the input SDF file to SMILES format and save it as output.smi.

    References

    • Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.
  • Name: Metabolism Site Prediction
    Description: 预测小分子被CYP450代谢的代谢位点。模型对小分子的每个原子进行评估被代谢的可能性,并通过打分排序。 Predict which sites in a molecule are most liable to be metabolised by Cytochrome P450.
    Tags: undefined
    Author: Rydberg P
    Release: 2022-05-27 08:27:00
    Reference: Bioinformatics. 2010 Dec 1;26(23):2988-9.

    Metabolism Site Prediction

    简介

    Metabolism Site Prediction模块为预测小分子被CYP450代谢的代谢位点。模型对小分子的每个原子进行评估被代谢的可能性,并通过打分排序。支持的小分子输入文件格式为:SD(.sdf、.sd)、SMILES(.smi)。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    结果说明

    输出结果包括:

    输出文件名称 说明
    molecule_1_atomNumbers.png 原子编号图片
    molecule_1_heteroAtoms.png P450代谢酶(CYP3A4)预测结果图
    molecule_1_heteroAtoms1A2.png P450代谢酶(CYP1A2)预测结果图
    molecule_1_heteroAtoms2C19.png P450代谢酶(CYP2C19)预测结果图
    molecule_1_heteroAtoms2C9.png P450代谢酶(CYP2C9)预测结果图
    molecule_1_heteroAtoms2D6.png P450代谢酶(CYP2D6)预测结果图
    results.csv 评估被代谢可能性的csv文件
    results.html 评估被代谢可能性的html文件

    其中results.html,包含如下信息:

    Field Name Description
    Rank 排序
    Atom 原子类型和序号
    Score 最终的打分,也是排序的标准,打分越低,排名越前,被代谢的可能性越高。
    Energy 能量值,基于DFT计算以及原子匹配得到的原子激活的能量值。是打分Score的重要参考项。
    Accessibility 原子到分子中心的相对拓扑距离。

    参考文献

    Rydberg P, Gloriam DE, Olsen L. The SMARTCyp cytochrome P450 metabolism prediction server. Bioinformatics. 2010 Dec 1;26(23):2988-9.

    Metabolism Site Prediction

    Introduction

    The Metabolism Site Prediction module is used to predict the metabolism sites of small molecules by P450 enzymes. The model evaluates the likelihood of each atom in the small molecule being metabolized and ranks them based on scores. Supported input file formats for small molecules include: SD (.sdf, .sd) and SMILES (.smi).

    Parameter Description

    Input File

    Input file containing the small molecule structure in SDF or SMILES format.

    Result Description

    The output includes:

    Output File Name Description
    molecule_1_atomNumbers.png Image showing atom numbering
    molecule_1_heteroAtoms.png Prediction results for P450 enzyme (CYP3A4)
    molecule_1_heteroAtoms1A2.png Prediction results for P450 enzyme (CYP1A2)
    molecule_1_heteroAtoms2C19.png Prediction results for P450 enzyme (CYP2C19)
    molecule_1_heteroAtoms2C9.png Prediction results for P450 enzyme (CYP2C9)
    molecule_1_heteroAtoms2D6.png Prediction results for P450 enzyme (CYP2D6)
    results.csv CSV file evaluating the likelihood of metabolism
    results.html HTML file evaluating the likelihood of metabolism

    The results in results.html include the following information:

    Field Name Description
    Rank Ranking
    Atom Atom type and number
    Score Final score, also the sorting criterion. The lower the score, the higher the ranking, indicating a higher likelihood of metabolism.
    Energy Energy value based on DFT calculations and atomic activation energy obtained from atomic matching. An important reference for the score.
    Accessibility Relative topological distance of the atom to the molecular center.

    References

    • Rydberg P, Gloriam DE, Olsen L. The SMARTCyp cytochrome P450 metabolism prediction server. Bioinformatics. 2010 Dec 1;26(23):2988-9.
  • Name: Toxic Fragment Identification
    Description: 识别小分子结构中的毒效片段,从文献中收集了大量的毒效片段构成毒效片段库,利用子结构匹配方法,实现对化合物库中每个分子进行毒效片段匹配,并通过不同颜色区分。 Detect toxicity fragment in small molecules. Toxicity fragments were collected from the reported literatures.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-05-26 16:00:00
    Reference:

    Toxic Fragment Identification

    简介

    Toxic Fragment Identification模块用于识别小分子的毒效片段,从文献中收集了大量的毒效片段构成毒效片段库,利用子结构匹配方法,实现对化合物库中每个分子进行毒效片段匹配,并通过不同颜色区分。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    结果说明

    得到化合物库中与小分子毒效片段匹配的output.xlsx文件,并通过不同颜色区分毒性片段。
    output.xlsx包括如下信息:

    字段名称 说明
    Smiles 分子的smiles
    Image 分子的化学结构图片,包括毒效片段的匹配。
    MolName 分子名称
    Smarts 毒效片段的Smarts
    Bad_type 毒性类型
    BadNum 毒性数量
    Literature 参考文献
    Colors 毒效片段匹配颜色

    Bad_type毒性类型,包括如下:

    Potential_electrophilic_agents,Inpharmatica,Idiosyncratic_toxicity_(RM_formation),Non-genotoxic_carcinogenicity,Endocrine_disruption,MLSMR,AlphaScreen-HIS-FHs,AlphaScreen-FHs,Nonbiodegradable_compounds,Acute_Aquatic_Toxicity,AlphaScreen-GST-FHs,LINT,Promiscuity,LD50_mo_oral,Reactive,_unstable,_toxic,Skin_sensitization,Chelating_agents,Genotoxic_carcinogenicity,_mutagenicity,Developmental_and_mitochondrial_toxicity,PAINS,Hepatotoxicity_Nephrotoxicity,SMARTSfilter,Hepatotoxicity,Toxtree,Myelotoxicity
    

    Toxic Fragment Identification

    Introduction

    The Toxic Fragment Identification module is used to identify toxic fragments of small molecules. A large library of toxic fragments has been collected from the literature. Using a substructure matching method, this module matches toxic fragments in each molecule of the compound library and distinguishes them with different colors.

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Result Description

    Obtain the output.xlsx file that matches toxic fragments in the compound library with the small molecule, color-coding the toxic fragments.

    The output.xlsx includes the following information:

    Field Name Description
    Smiles Molecular SMILES
    Image Chemical structure image of the molecule, including the matched toxic fragments.
    MolName Molecule name
    Smarts Toxic fragment SMARTS
    Bad_type Type of toxicity
    BadNum Number of toxicities
    Literature Literature reference
    Colors Colors for toxic fragment matches

    The Bad_type toxicity types include:

    Potential_electrophilic_agents, Inpharmatica, Idiosyncratic_toxicity_(RM_formation), Non-genotoxic_carcinogenicity, Endocrine_disruption, MLSMR, AlphaScreen-HIS-FHs, AlphaScreen-FHs, Nonbiodegradable_compounds, Acute_Aquatic_Toxicity, AlphaScreen-GST-FHs, LINT, Promiscuity, LD50_mo_oral, Reactive,_unstable,_toxic, Skin_sensitization, Chelating_agents, Genotoxic_carcinogenicity,_mutagenicity, Developmental_and_mitochondrial_toxicity, PAINS, Hepatotoxicity_Nephrotoxicity, SMARTSfilter, Hepatotoxicity, Toxtree, Myelotoxicity
    
  • Name: mRNA Optimization (AlphaRNA)
    Description: 优化mRNA序列以获得更好的密码子偏好性和更稳定的二级结构,以优化其表达量和半衰期、抗体滴度等。支持人、小鼠、大鼠、猪等种属。 Optimize mRNA sequences for better codon usage bias and more stable secondary structures, to enhance its expression level, half-life, antibody titer, etc.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-05-17 07:01:58
    Reference:

    mRNA Optimization (AlphaRNA)

    简介

    AlphaRNA是Wecomput开发的程序,可以有效地共同优化CAI(Codon Adaption Index)和MFE(Minimum free energy)/AUP(Average unpaired probability)。AlphaRNA提供了一种基于DFA图进行Motif约束的方法,该方法在不明显增加计算量的同时,隐式地将约束加入到密码子优化地过程中以获得更好的密码子偏好性和更稳定的二级结构,以优化其表达量和半衰期、抗体滴度等。可以支持任意数量和长度的序列。
    image.png

    参数说明

    Amino acid sequence of CDS/ORF

    所需要优化的编码区氨基酸序列,例如:

    MDIDPYKEFGATVELLSFLPSDFFPSVRDLLDTASALYREALESPEHCSPHHTALRQAIL
    

    Enzyme restrictions

    要限制(避免出现在优化后序列中)的酶切位点,可多选。

    Motif restrictions

    需要限制的Motif序列,可指定多个,可手动输入不在列表中的新序列,使用空白符分隔。

    Weights of CAI

    CAI的lambda系数,正值越大能够调大结果中的CAI, 可设置多个,可为负值,负值越大表示越降低CAI。

    Weights of GCR

    GC碱基比例(GCR)的lambda系数,正值越大能够调大结果中的GCR, 可设置多个,可为负值,负值越大表示越降低GCR。

    结果说明

    输出结果文件为result.csv,包含信息如下:

    字段名称 说明
    lambda_cai CAI的lambda系数
    lambda_gcr GCR的lambda系数
    full_sequence 优化后的序列
    CAI 密码子适应指数
    AUP 平均未配对率
    GCR GC碱基比例
    MFE Structure 最小自由能二级结构
    dG(MFE)[kcal/mol] 最小自由能

    mRNA Optimization (AlphaRNA)

    Introduction

    AlphaRNA is a Wecomput-developed program that efficiently co-optimize both Codon Adaption Index (CAI) and Minimum free energy (MFE)/Average unpaired probability (AUP).It provides a method for motif-constrained codon optimization based on DFA graphs, which implicitly incorporates constraints into the codon optimization process to achieve better codon preferences and more stable secondary structures, optimizing expression levels, half-life, antibody titers, etc., without significantly increasing computational complexity. This method supports sequences of arbitrary numbers and lengths.
    image.png

    Parameter

    Amino acid sequence of CDS/ORF

    The amino acid sequence of the coding region that needs to be optimized, for example:

    MDIDPYKEFGATVELLSFLPSDFFPSVRDLLDTASALYREALESPEHCSPHHTALRQAIL
    

    Enzyme restrictions

    The restriction enzyme cleavage sites to be limited (avoided in the optimized sequence) can be selected multiple times.

    Motif restrictions

    Motif sequences that need to be restricted, multiple can be specified, and new sequences that are not in the list can be manually entered, separated by blanks.

    Weights of CAI

    The lambda coefficient of CAI, the larger the positive value, the larger the CAI in the result, you can choose multiple. It can be negative, and the more negative the value is, the greater the reduction in CAI.

    Weights of GCR

    The lambda coefficient of GCR, the larger the positive value, the larger the GCR in the result, you can choose multiple. It can be negative, and the more negative the value is, the greater the reduction in GCR.

    Result

    The output file is result.csv and contains the following information:

    Field Name Description
    lambda_cai Lambda coefficients of CAI
    lambda_gcr Lambda coefficients of GCR
    full_sequence The optimized sequence
    CAI Codon adaption index
    AUP Average unpaired probability
    GCR The proportion of GC bases
    MFE Structure The minimum free energy structure
    dG(MFE)[kcal/mol] The value of the minimum free energy
  • Name: Extract Fv Sequence
    Description: 从抗体全长序列中提取Fv区序列的工具。 Extract the Fv region sequence from antibody full-length sequence.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-05-16 11:18:14
    Reference:

    Extract Fv Sequence

    简介

    Extract Fv Sequence 是一个用于从抗体全长序列中提取 Fv 区域(可变区)和 非 Fv 区域 序列的工具。

    参数说明

    Antibody Sequence File

    输入抗体全长序列文件,格式为 FASTA。

    Output File

    指定输出的抗体 Fv 区域序列文件 名称,格式为 FASTA。

    结果说明

    工具将输出两个 FASTA 文件:

    • Fv.fasta:仅包含 Fv 区域序列;
    • nonFv.fasta:包含非 Fv 区域(包括可能存在的 linker)的序列。

    Extract Fv Sequence

    Introduction

    Extract Fv Sequence is a tool designed to extract the Fv region (variable domain) and non-Fv region sequences from a full-length antibody sequence.

    Parameters

    Antibody Sequence File

    Input full-length antibody sequence file in FASTA format.

    Output File

    Specify the output filename for the Fv region sequence, in FASTA format.

    Results

    The tool generates two FASTA files:

    • Fv.fasta: contains only the Fv region sequence;
    • nonFv.fasta: contains the non-Fv region sequence (including any linker regions, if present).
  • Name: RNA Secondary Structure Prediction
    Description: 使用动态编程算法预测单链RNA或DNA序列的二级结构,返回单一的最佳结构和最低自由能。 Predict secondary structures of single-stranded RNA or DNA sequences using dynamic programming algorithms which yield a single optimal structure and the minimum free energy.
    Tags: undefined
    Author: Zuker & Stiegler
    Release: 2022-04-29 08:00:00
    Reference: Lorenz R, Bernhart SH, Höner Zu Siederdissen C, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011 Nov 24;6:26.

    RNA Secondary Structure Prediction

    简介

    使用动态编程算法预测单链RNA或DNA序列的二级结构,返回单一的RNA最佳结构和最低自由能。

    RNA二级结构符号说明

    长度为n的序列上的结构由相等长度的括号和点组成的字符串表示。i和j之间的碱基对用“(”在i和“)”在在j位置表示,未配对的碱基用“.”表示。如下为RNA二级结构表示方式。

      (((..((((...)))).))) 
    

    与之对应的RNA二级结构图为:
    image.png

    参数说明

    RNA Sequence File

    RNA序列文件,FASTA格式。

    Output File

    输出文件名称。

    结果说明

    输出结果包括:

    输出文件名称 说明
    output.txt RNA序列二级结构的文本文件,其中包括序列、最佳二级结构以及与其对应的最小自由能(kcal/mol)。
    SeqN_2D.png 第N条RNA序列对应的二级结构图

    参考文献

    Lorenz R, Bernhart SH, Höner Zu Siederdissen C, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011 Nov 24;6:26.

    RNA Secondary Structure Prediction

    Introduction

    The dynamic programming algorithm is used to predict the secondary structure of a single-stranded RNA or DNA sequence, returning the best RNA structure and its minimum free energy.

    RNA Secondary Structure Symbols

    The structure on a sequence of length n is represented by a string consisting of equal-length parentheses and dots. Base pairs between i and j are represented by “(” at position i and “)” at position j, while unpaired bases are represented by “.”. Below is an example of an RNA secondary structure representation.

    (((..((((...)))).))) 
    

    The corresponding RNA secondary structure diagram is shown in the image above.
    image.png

    Parameter Description

    RNA Sequence File

    RNA sequence file in FASTA format.

    Output File

    Name of the output file.

    Result Description

    The output results include:

    Output File Name Description
    output.txt Text file of the RNA sequence’s secondary structure, including the sequence, best secondary structure, and the corresponding minimum free energy (kcal/mol).
    SeqN_2D.png Secondary structure diagram for the Nth RNA sequence

    Reference

    Lorenz R, Bernhart SH, Höner Zu Siederdissen C, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011 Nov 24;6:26.

  • Name: RNA 3D Structure Prediction
    Description: 在给定二级结构和实验限制的情况下,从头预测RNA的三维结构模型(可长达约 300 nts )。除了要预测的 RNA 序列外,您还需要提供一个描述二级结构的文件:具有以圆点符号表示的二级结构的文本文件。 Build three-dimensional de novo models of RNAs of sizes up to ~300 nts, given secondary structure and experimental constraints. Besides the RNA sequence to predict, you also need to provide a secondary structure file: a text file with secondary structure described in the dot-parentheses notation.
    Tags: undefined
    Author: Cheng, C.Y., Chou, F.-C., and Das, R.
    Release: 2022-04-30 00:00:00
    Reference: Cheng CY, Chou FC, Das R. Modeling complex RNA tertiary folds with Rosetta. Methods Enzymol. 2015;553:35-64.

    RNA 3D Structure Prediction

    简介

    RNA 3D Structure Prediction是基于Rosetta中的RNA结构建模算法是基于现有RNA晶体结构的短片段(1到3个核苷酸)的组装,其序列与目标RNA的子序列相匹配。RNA片段组装(Fragment Assembly of RNA, FARNA)算法是一个蒙特卡洛过程,由一个低分辨率的基于知识的能量函数指导。然后,这些模型可以在全原子力场下进一步完善,以产生更真实的结构。由此产生的能量也能更好地区分原生构象和非原生构象。该计算方法被称为FARFAR(RNA片段组装与全原子细化)。

    参数说明

    Input File

    从5’到3’的序列。通常用小写字母,但大写字母是可以接受的,并且会被转换。支持多条序列同时生成3D结构。

    Secstru File

    点括号表示RNA二级结构文件。可以通过模块“RNA Secondary Structure Prediction”获取。
    RNA二级结构文件,文本格式,例如:

    >a
    auauccccauauaucccauauauccccgcgcgucccgcgc
    ........((((((...))))))....(((((...))))) ( -6.60)
    >b
    aaauccccauauaucccauauauccccgcgcgucccgcgc
    ........((((((...))))))....(((((...))))) ( -6.60)
    

    结果说明

    得到RNA结构的PDB文件S_000001.pdb。

    参考文献

    Cheng CY, Chou FC, Das R. Modeling complex RNA tertiary folds with Rosetta. Methods Enzymol. 2015;553:35-64.

    RNA 3D Structure Prediction

    Introduction

    RNA 3D Structure Prediction utilizes the RNA structure modeling algorithm in Rosetta, which assembles short fragments (1 to 3 nucleotides) based on existing RNA crystal structures, matching the sequence to a subsequence of the target RNA. The Fragment Assembly of RNA (FARNA) algorithm is a Monte Carlo process guided by a low-resolution, knowledge-based energy function. These models can then be further refined under a full-atom force field to produce more realistic structures. The resulting energy can better distinguish native conformations from non-native conformations. This computational method is known as FARFAR (Fragment Assembly of RNA with Full Atom Refinement).

    Parameter Description

    Input File

    Sequence(s) from 5’ to 3’. Typically in lowercase letters, but uppercase letters are acceptable and will be converted. Supports generating 3D structures for multiple sequences simultaneously.

    Secstru File

    RNA secondary structure file in dot-bracket notation. This can be obtained using the “RNA Secondary Structure Prediction” module.
    Example RNA secondary structure file in text format:

    >a
    auauccccauauaucccauauauccccgcgcgucccgcgc
    ........((((((...))))))....(((((...))))) ( -6.60)
    >b
    aaauccccauauaucccauauauccccgcgcgucccgcgc
    ........((((((...))))))....(((((...))))) ( -6.60)
    

    Result Description

    Obtain the PDB file for the RNA structure as S_000001.pdb.

    Reference

    Cheng CY, Chou FC, Das R. Modeling complex RNA tertiary folds with Rosetta. Methods Enzymol. 2015;553:35-64.

  • Name: Immunogenicity Prediction (AlphaMHC v2.0)
    Description: AlphaMHC算法采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条公开及私有的与免疫原性相关的湿实验数据(包括亲和力数据、NGS数据、质谱数据等)进行训练,成功实现了从序列到临床免疫原性风险的端到端的预测,并通过上百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)进行验证,AlphaMHC能够准确区分免疫原性的高低,ROC-AUC达0.87,准确性超过80%(部分测试集高达91%),表现出比现有方法显著更优的预测性能,是已知唯一一个可以得到高质量临床数据验证的算法。 注:推荐在WeSeq序列编辑器中调用此功能(Immunogenicity按钮),可以在序列中直观看到T细胞表位的位置。 The AlphaMHC algorithm utilizes popular NLP natural language processing technology and a novel multimodal fusion deep neural network architecture. It integrates nearly one billion publicly and privately available wet lab experimental data related to immunogenicity (including affinity data, NGS data, mass spectrometry data, etc.) for training. It successfully achieves end-to-end prediction of immunogenicity risk from sequence to clinical application and has been validated using over a hundred clinical real-world immunogenicity data from FDA and EMA (including mono-/multi-specific antibodies and recombinant proteins). AlphaMHC can accurately distinguish between high and low immunogenicity, with an ROC-AUC of 0.87 and an accuracy of over 80% (up to 91% for some test sets). It exhibits significantly superior predictive performance compared to existing methods and is the only algorithm known to have been validated with clinical data.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-05-03 13:53:09
    Reference:

    Immunogenicity Prediction (AlphaMHC v2.0)

    简介

    AlphaMHC是唯信计算为解决现有预测方法的已知问题而开发的下一代免疫原性预测算法,采用流行的NLP自然语言处理技术,全新的多模融合深度神经网络架构,整合了近10亿条公开及私有的与免疫原性相关的湿实验数据(包括亲和力数据、NGS数据、质谱数据等)进行训练,成功实现了从序列到临床免疫原性风险的端到端的预测,并通过上百条来自FDA、EMA的临床真实免疫原性数据(包括单/多特异性抗体和重组蛋白等)进行验证,AlphaMHC能够准确区分免疫原性的高低,ROC-AUC达0.87,准确性超过80%(部分测试集高达91%),表现出比现有方法显著更优的预测性能,是已知唯一一个可以得到临床数据验证的算法。
    F13.png

    算法特点:

    • 显着扩展的训练集空间。除了公开可用的数据集外,我们还从文献、专利和湿实验室合作者那里收集了更多数据。除了最常用的亲和力数据外,还考虑了更多的数据类型,例如T细胞激活数据、蛋白质组学数据、抗体测序数据等,它们贡献了超过10亿个数据条目/点。
    • 与仅预测MHC肽结合亲和力的大多数其他算法不同,AlphaMHC 预测临床水平的最终免疫原性,同时考虑除肽结合之外的其他重要影响因素,例如免疫呈递/耐受性、HLA等位基因频率等。
    • 针对上千个MHC-II型等位基因训练深度神经网络模型。在并行计算的支持下,所有支持的 MHC 等位基因都可以以高通量的方式同时计算。
    • 基于独家收集的高质量临床ADA数据集进行验证和优化

    参数说明

    Fasta File

    蛋白序列文件,FASTA格式。支持多条链以及多分子模式。

    请注意按下面的规则来书写序列名,因为目前免疫原性风险的评分是以整个分子为单位的,链名会影响到程序区分同个分子的多条链,并影响对于分子总的风险评级(risk per molecule),但不影响对链的TCE的识别。

    对于多条链的分子,序列名称应写为:分子名.链名,".“之前是分子名,”.“之后是链名,同个分子的不同链,只要”."之前的分子名保持一致就可以了,链名随意,顺序不限。

    例如,下面mol1是常见的单抗,mol2是多抗:

    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    
    >mol2.L1
    XXXXXXX
    >mol2.H1
    XXXXXXX
    >mol2.L2
    XXXXXXX
    >mol2.H2
    XXXXXXX
    

    HLA Allotypes

    预测HLA等位基因型。
    rep:32个代表性等位基因型,适用于一般人群。
    all:用于训练的所有非冗余人类等位基因型(1166个)。

    一般推荐使用默认的"rep",因为免疫原性的风险评分(risk)是基于rep的代表性HLA来确定的。

    Binding Affinity Profile

    导出每个 HLA 等位基因的结合亲和力曲线图,展示了与每条蛋白质链的 N 端到 C 端的所有15肽的结合亲和力。注意:即使“HLA Allotypes”选项设置为全部,也只会绘制代表性 HLA的曲线。

    结果说明

    输出结果包括:

    输出文件名称 说明
    score_immunogenicity_risk.csv 该结果展示了预测的每个分子的免疫原性风险(自动将同个分子的多条链的预测的潜在T细胞表位的结果进行汇总后综合评估所得)。
    detail_tce_of_chains.csv 该结果评估可以进行定向改造的HLA呈递表位,以降低免疫原性。
    BAProfile_of_mol.chain.png 不同HLA亚型与每条链的不同位置的亲和力的分布情况,更精细的展示了不同HLA的亲和力的差异。 从左到右的分布图表示从其中一条蛋白质链的N末端移动到C末端的15聚肽窗口的结合亲和力。 即使“HLA同种异型”选项设置为“全部”,也只会包括代表性的HLA等位基因。
    Heatmap_of_mol.chain.png 每个肽与代表性HLA之间结合亲和力的热图。Z-score是pAffinity,值越大(浅色)意味着预测结合越强。

    其中score_immunogenicity_risk.csv包括信息如下:

    字段名称 说明
    Protein_Id 蛋白序列名称
    Risk 预测的分子整体风险评估,高风险的分子为high,否则为low。
    Score 表位总长度,是整体风险评估的重要依据。
    TCE_Sequences 表位序列

    其中detail_tce_of_chains.csv包括信息如下:

    字段名称 说明
    Sequences 蛋白序列名称
    TCE 每条链的相对的高风险的T细胞表位
    Alleles_Number 递呈的HLA亚型数
    Alleles 递呈的HLA亚型
    Min_Affinity 亲和力最小值
    Median_Affinity 亲和力中位数
    Max_Affinity 亲和力最大值

    Immunogenicity Prediction (AlphaMHC v2.0)

    Introduction

    AlphaMHC is the next-generation immunogenicity prediction algorithm developed by Wecomput using popular NLP natural language processing technology to address known issues with existing prediction methods. It employs a new multi-modal fusion deep neural network architecture and is trained on nearly one billion publicly available and private wet-lab experimental data related to immunogenicity, including affinity data, NGS data, mass spectrometry data, etc. It successfully achieves end-to-end prediction of immunogenicity risk from sequence to clinical application and is validated using hundreds of clinical real-world immunogenicity data from FDA and EMA, including mono/multi-specific antibodies and recombinant proteins. AlphaMHC accurately distinguishes high and low immunogenicity with an ROC-AUC of 0.87 and an accuracy of over 80% (up to 91% on some test sets), demonstrating significantly better predictive performance than existing methods. It is the only known algorithm that has been validated with clinical data.
    F13.png

    Feature highlights

    • Significantly expanded training set space. Besides the publicly available data sets, we have collected more data from literature, patents, and wet lab collaborators. Besides the most used affinity data, more data types are considered, e.g., T cell activation data, proteomics data, antibody sequencing data, etc., which contributes over 1 billion more data entries/points.
    • Unlike most other algorithms which predict only the MHC-peptide binding affinity, AlphaMHC predicts the eventual immunogenicity at the clinical level, taking into consideration other important influencing factors besides peptide binding, such as immune presentation/tolerance, allele frequency, etc.
    • A deep neural network model is trained for up to 5000+ alleles of MHC-II. With the support of parallel computing, all supported MHC alleles can be simultaneously calculated in a high-throughput manner, while similar methods can usually only afford a few representative alleles within reasonable time cost.

    Parameter

    Fasta File

    Protein sequence file in FASTA format.Multiple chains and multi-molecule modes are supported. For multi-molecule mode, the sequence name rule is: molecule name. chain name, for example:

    >mol1.A
    XXXXXXX
    >mol1.B
    XXXXXXX
    >mol2.A
    XXXXXXX
    >mol2.B
    XXXXXXX
    

    HLA Allotypes

    Prediction of HLA allelic types. “rep” is recommended, which is faster.
    rep: 32 representative allelic types, applicable to the general population.
    all: all non-redundant human allele types used for training (1166).

    Binding Affinity Profile

    Export binding affinity curve graphs for each HLA allele, showing the binding affinity of all 15 peptides from the N- to C-terminus for each protein chain. Note: Even if the “HLA Allotypes” option is set to all, curves will only be plotted for representative HLAs.

    Result

    The output includes:

    Output File Name Description
    score_immunogenicity_risk.csv The result displays the immunogenicity risk for each predicted molecule (which is obtained by aggregating the predicted potential T cell epitopes from multiple chains of the same molecule and evaluating the overall risk).
    detail_tce_of_chains.csv The results evaluated HLA presentation epitopes that could be targeted for engineering to reduce immunogenicity.
    BAProfile_of_mol.chain.png The distribution profile of the binding affinity between each chain and the 32 representative HLAs. The profile from left to right represents the binding affinity of a 15-mer pepetide window moving from the N terminus to C terminus of one of the protein chain. PS. only representative HLA alleles will be included even if the “HLA allotypes” option is set to “all”.
    Heatmap_of_mol.chain.png The heat map of the binding affinity between each peptide and the representative HLAs. The Z-score is pAffinity, greater value (light color) means stronger binding by prediction.

    score_immunogenicity_risk.csv contains the following information:

    Field Name Description
    Protein_Id Protein sequence name
    Risk The overall risk assessment for the predicted molecule, with “high” indicating high-risk molecules and “low” indicating low-risk molecules.
    Score The total length of the epitopes, which is an important basis for overall risk assessment.
    TCE_Sequences The epitope sequences

    detail_tce_of_chains.csv contains the following information:

    Field Name Description
    Sequences Protein sequence name
    TCE The relative high risk T cell epitope of each strand.
    Alleles_Number Number of HLA subtypes presented
    Alleles The HLA subtypes presented
    Min_Affinity Affinity minimum
    Median_Affinity Median affinity
    Max_Affinity Affinity maximum
  • Name: Codon Optimization
    Description: Codon Optimization可用于密码子优化(基于PCR的基因合成的自动寡核苷酸设计)。整个基因组序列的可用性极大地增加了蛋白质靶标的数量,其中许多需要在原始DNA来源以外的细胞中过度表达。合成基因可以针对表达进行优化,并构建为易于突变操作而无需考虑亲本基因组。然而,合成基因的设计和构建,尤其是那些编码大蛋白质的基因,可能是一个缓慢、困难和令人困惑的过程。该模块通过基于PCR的方法自动设计用于基因合成的寡核苷酸。 Codon optimization can be used for optimizing codons (i.e., the genetic code) for the automated oligonucleotide design of gene synthesis based on PCR. The availability of whole genome sequences has greatly increased the number of protein targets, many of which need to be overexpressed in cells outside of their native DNA source. Synthetic genes can be optimized for expression and constructed to be easily mutagenized without consideration for the parental genome. However, designing and constructing synthetic genes, particularly those encoding large proteins, can be a slow, difficult, and confusing process. This module automatically designs oligonucleotides for gene synthesis using a PCR-based approach.
    Tags: undefined
    Author: DNAWorks
    Release: 2022-04-15 11:52:22
    Reference: Nucleic Acids Res. 2002 May 15;30(10):e43.

    Codon Optimization

    简介

    基于知名的DNAWorks算法对氨基酸或DNA序列进行密码子优化(基于PCR的基因合成的自动寡核苷酸设计)。

    整个基因组序列的可用性极大地增加了蛋白质靶标的数量,其中许多需要在原始DNA来源以外的细胞中过度表达。合成基因可以针对表达进行优化,并构建为易于突变操作而无需考虑亲本基因组。然而,合成基因的设计和构建,尤其是那些编码大蛋白质的基因,可能是一个缓慢、困难和令人困惑的过程。该模块通过基于PCR的方法自动设计用于基因合成的寡核苷酸。
    image.png

    参数说明

    Sequence File

    蛋白或者核酸的序列文件,FASTA格式。

    Sequence Type

    序列类型,蛋白或者核酸。

    Organism

    几种常用生物的密码子频率基于每个密码子在相应生物基因组的蛋白质编码区中出现的次数。大肠杆菌有两种选项:基于所有基因的标准频率(E. coli),或在指数增长期间以高水平表达的 II 类基因频率(ecoli2),通常建议用后者。

    Annealing Temperature

    退火温度参数为一组合成寡核苷酸设定了理想的退火温度。 可接受的退火温度范围在 58 至 70°C 之间。

    Oligo Length

    寡核苷酸长度参数限制了一组合成寡核苷酸中的任何一个可以达到的核苷酸长度。可接受的寡核苷酸长度范围在 30 到 999 nt 之间。

    Codon Frequency Threshold

    密码子频率阈值参数设置:密码子用于反向翻译蛋白质序列到DNA的截断值。

    Oligonucleotides Concentration

    寡核苷酸的浓度。寡核苷酸必须在100 uM (1E-4 M)和1 nM (1E-9 M)之间。

    Cations Concentration

    一价阳离子(Na+,K+)的浓度。单价阳离子必须在10到1000mM之间。

    Magnesium Concentration

    镁离子的浓度。镁离子浓度必须在0到200mM之间。

    Solution Number

    执行中生成的寡核苷酸的数量,每个作业的最大运行次数为999次。

    Thermodynamically Balanced Mode

    检查是否为热力学平衡由内而外合成法 (thermodynamically balanced inside-out, TBIO)输出模式。

    Restriction Site Screen

    要求被排除在合成基因的蛋白质编码区之外的位点,每个位点之间用逗号隔开,例如Aatll,Acc65I。
    支持非简并位点共117种:

    AatII,Acc65I,AclI,AcuI,AfeI,AflII,AgeI,AlwI,ApaI,ApaLI,AscI,AseI,AsiSI,AvrII,BamHI,BbsI,BbvCI,BbvI,BccI,BceAI,BciVI,BclI,BfrBI,BfuAI,BglII,BmgBI,BmrI,BmtI,BpmI,BpuEI,BsaI,BseRI,BseYI,BsgI,BsiWI,BsmAI,BsmBI,BsmFI,BsmI,BspCNI,BspDI,BspEI,BspHI,BspMI,BsrBI,BsrDI,BsrGI,BsrI,BssHII,BssSI,BstBI,BstZ17I,BtgZI,BtsI,ClaI,DraI,EagI,EarI,EciI,EcoRI,EcoRV,FauI,FokI,FseI,FspI,HgaI,HindIII,HpaI,HphI,KasI,KpnI,MboII,MfeI,MluI,MlyI,MscI,NaeI,NarI,NcoI,NdeI,NgoMIV,NheI,NotI,NruI,NsiI,PacI,PaeR7I,PciI,PleI,PmeI,PmlI,PsiI,PspOMI,PstI,PvuI,PvuII,SacI,SacII,SalI,SapI,SbfI,ScaI,SfaNI,SfoI,SmaI,SnaBI,SpeI,SphI,SspI,StuI,SwaI,TliI,TspRI,XbaI,XhoI,XmaI,ZraI
    

    支持简并位点共62种:

    AccI,AflIII,AhdI,AleI,AlwNI,ApoI,AvaI,BanI,BanII,BcgI,BglI,BlpI,Bme1580I,Bpu10I,BsaAI,BsaBI,BsaHI,BsaJI,BsaWI,BsaXI,BsiEI,BsiHKAI,BslI,BsoBI,Bsp1286I,BsrFI,BstAPI,BstEII,BstF5I,BstXI,BstYI,Bsu36I,BtgI,Cac8I,DraIII,DrdI,EaeI,EcoNI,EcoO109I,HaeII,HincII,Hpy188III,MmeI,MslI,MspA1I,MwoI,NlaIV,NspI,PflFI,PflMI,PpuMI,PshAI,RsrII,SexAI,SfcI,SfiI,SgrAI,SmlI,StyI,Tth111I,XcmI,XmnI
    

    Custom Site Screen

    自定义被排除在合成基因的蛋白质编码区之外的位点,自定义位点格式必须包含名称和序列,名称和序列之间用空格隔开,多个位点时用逗号隔开,例如:Aatll GACGTC,Acc65I GGTACC。

    Output File

    输出结果文件的名称。

    结果说明

    输出结果文件为result.txt,包含优化后的密码子序列以及序列相关信息。

    参考文献

    Hoover DM, Lubkowski J. DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 2002 May 15;30(10):e43.

    Codon Optimization

    Introduction

    Codon optimization can be used for optimizing codons (i.e., the genetic code) for the automated oligonucleotide design of gene synthesis based on PCR. The availability of whole genome sequences has greatly increased the number of protein targets, many of which need to be overexpressed in cells outside of their native DNA source. Synthetic genes can be optimized for expression and constructed to be easily mutagenized without consideration for the parental genome. However, designing and constructing synthetic genes, particularly those encoding large proteins, can be a slow, difficult, and confusing process. This module automatically designs oligonucleotides for gene synthesis using a PCR-based approach.
    image.png

    Parameter

    Sequence File

    Protein or nucleotide sequences in FASTA format

    Sequence Type

    Sequence files of proteins or nucleic acids

    Organism

    The codon frequencies of several commonly used organisms are based on the number of times each codon appears in the protein-coding regions of the respective organism’s genome. For Escherichia coli, there are two options: the standard frequency based on all genes (E. coli), or the frequency of Class II genes expressed at high levels during exponential growth (ecoli2), which is usually recommended to be used.

    Annealing Temperature

    The annealing temperature parameter sets the ideal annealing temperature for a set of synthetic oligonucleotides. Acceptable annealing temperatures range from 58 to 70°C.

    Oligo Length

    The oligonucleotide length parameter limits the achievable nucleotide length of any one of a set of synthetic oligonucleotides. Acceptable oligonucleotide lengths range from 30 to 999 nt.

    Codon Frequency Threshold

    Codon Frequency Threshold Parameter Settings: Codon cutoff value for backtranslation of protein sequences to DNA.

    Oligonucleotides Concentration

    Concentration of oligonucleotides. Oligonucleotides must be between 100 uM (1E-4 M) and 1 nM (1E-9 M).

    Cations Concentration

    Concentration of monovalent cations (Na+, K+). Monovalent cations must be between 10 and 1000 mM.

    Magnesium Concentration

    concentration of magnesium ions. Magnesium ion concentration must be between 0 and 200mM.

    Solution Number

    The number of oligos generated in an execution, with a maximum of 999 runs per job.

    Thermodynamically Balanced Mode

    Check if it is thermodynamically balanced inside-out (TBIO) output mode.

    Restriction Site Screen

    Sites required to be excluded from the protein coding region of the synthetic gene, separated by commas between each site, example: Aatll,Acc65I.
    Support a total of 117 non-degenerate sites:

    AatII,Acc65I,AclI,AcuI,AfeI,AflII,AgeI,AlwI,ApaI,ApaLI,AscI,AseI,AsiSI,AvrII,BamHI,BbsI,BbvCI,BbvI,BccI,BceAI,BciVI,BclI,BfrBI,BfuAI,BglII,BmgBI,BmrI,BmtI,BpmI,BpuEI,BsaI,BseRI,BseYI,BsgI,BsiWI,BsmAI,BsmBI,BsmFI,BsmI,BspCNI,BspDI,BspEI,BspHI,BspMI,BsrBI,BsrDI,BsrGI,BsrI,BssHII,BssSI,BstBI,BstZ17I,BtgZI,BtsI,ClaI,DraI,EagI,EarI,EciI,EcoRI,EcoRV,FauI,FokI,FseI,FspI,HgaI,HindIII,HpaI,HphI,KasI,KpnI,MboII,MfeI,MluI,MlyI,MscI,NaeI,NarI,NcoI,NdeI,NgoMIV,NheI,NotI,NruI,NsiI,PacI,PaeR7I,PciI,PleI,PmeI,PmlI,PsiI,PspOMI,PstI,PvuI,PvuII,SacI,SacII,SalI,SapI,SbfI,ScaI,SfaNI,SfoI,SmaI,SnaBI,SpeI,SphI,SspI,StuI,SwaI,TliI,TspRI,XbaI,XhoI,XmaI,ZraI
    

    Support a total of 62 degenerate sites:

    AccI,AflIII,AhdI,AleI,AlwNI,ApoI,AvaI,BanI,BanII,BcgI,BglI,BlpI,Bme1580I,Bpu10I,BsaAI,BsaBI,BsaHI,BsaJI,BsaWI,BsaXI,BsiEI,BsiHKAI,BslI,BsoBI,Bsp1286I,BsrFI,BstAPI,BstEII,BstF5I,BstXI,BstYI,Bsu36I,BtgI,Cac8I,DraIII,DrdI,EaeI,EcoNI,EcoO109I,HaeII,HincII,Hpy188III,MmeI,MslI,MspA1I,MwoI,NlaIV,NspI,PflFI,PflMI,PpuMI,PshAI,RsrII,SexAI,SfcI,SfiI,SgrAI,SmlI,StyI,Tth111I,XcmI,XmnI
    

    Custom Site Screen

    Custom sites that to be excluded from the protein coding region(s) of the synthetic gene. The custom site format must contain the name and sequence, separated by a space between the name and sequence, and separated by a comma when there are multiple sites. Example: Aatll GACGTC,Acc65I GGTACC.

    Output File

    Specify output file name

    Result

    The output file is result.txt, which contains the optimized codon sequence and sequence-related information.

    Reference

    Hoover DM, Lubkowski J. DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 2002 May 15;30(10):e43.

  • Name: PDB Mutation
    Description: 突变PDB格式的蛋白质结构并返回突变后的结构。一般建议通过WeView三维结构可视化编辑器来使用该功能。 Mutate a protein structure in PDB format and return mutated structure. It is recommended to use in the WeView.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-04-12 00:00:00
    Reference:

    PDB Mutation

    简介

    PDB Mutation是用于突变PDB格式的蛋白质结构并返回突变后的结构。

    参数说明

    PDB File

    蛋白的结构文件,PDB格式

    Index Type

    为后续突变文件中的残基索引设置类别。有两种选择:UID或者POS
    UID表示PDB文件中自带的残基编号,该编号可能存在间断不连续,不从1开始等情况;
    POS表示位置编号或自然顺序编号,从1开始按顺序进行编号。默认为POS。

    Mutation File

    突突变文本文件,包含突变信息,格式如下:

    KA100N,KA101T
    KA100T
    KA100BT
    

    每个突变定义为:
    第一字母代表的是原始残基,第二个字母代表PDB文件中待突变残基所在的链名,后面的数字代表残基位置编号(编号类型是POS还是UID,在上述参数Index Type中定义,默认为POS),最后一个字母代表突变后的残基。如:KA100N表示A链中位置编号(POS)100的残基K,突变为N。

    • 每一行可放置一组突变,用英文逗号分隔,该组突变将被应用于结构中,得到一个新的结构文件。
      每行对应一个新的结构文件。
    • UID编号支持插入码输入,如KA100BT,表示A链中UID编号为100B的残基K,突变为T。

    结果说明

    输出结果包括:

    输出文件名称 说明
    mutations.tar.gz 所有突变体PDB结构的压缩包文件
    第一组突变对应的PDB结构 如:KA100N_KA101T.pdb

    PDB Mutation

    Introduction

    PDB Mutation is a tool used to mutate protein structures in PDB format and return the mutated structures.

    Parameters

    PDB File

    Structure file of the protein in PDB format.

    Index Type

    This parameter sets the residue index convention used in the mutation file. Two options are available: UID or POS.

    • UID uses the residue numbers already present in the PDB file; these numbers may be discontinuous or may not start from 1.
    • POS indicates a position-based or sequential index, counting from 1 upward in order.
      The default is POS.

    Mutation File

    A plain-text file that lists the desired mutations. Format:

    KA100N,KA101T  
    KA100T 
    KA100BT 
    

    Each mutation is defined as follows:

    • The first letter represents the original residue.
    • The second letter represents the chain ID of the residue to be mutated in the PDB file.
    • The number represents the residue position index (the index type, POS or UID, is defined by the Index Type parameter above; the default is POS).
    • The last letter represents the mutated residue.

    For example, KA100N means that residue K at position 100 (POS) in chain A is mutated to N.

    • Each line may contain a set of mutations, separated by commas. All mutations in the same line are applied together to generate one new structure file.

    • Each line corresponds to one newly generated structure file.

    • UID indexing supports insertion codes. For example, KA100BT means that residue K with UID 100B in chain A is mutated to T.

    Results

    The output results include:

    Output File Name Description
    mutations.tar.gz Compressed file containing all mutated PDB structures
    The PDB structure that corresponds to the first set of mutations such as: KA100N_KA101T.pdb
  • Name: Patent Sequence Listing
    Description: 批量从专利文本文件中提取序列的工具。很多大分子专利会附带一个序列清单文件,里面存储了专利要求中的全部序列,但是人工很难高效读取,利用此模块可以一次性批量提取。其中Image(OCR)是基于图像的蛋白质序列转换为3个字母编码或1个字母编码的序列。 A tool for extracting sequences in bulk from patent text files. Many macromolecule patents come with a sequence listing file that contains all the sequences in the patent claims. However, it is difficult for humans to efficiently read and extract these sequences. With this module, all sequences can be extracted in bulk at once. The Image(OCR) is the conversion of image-based protein sequences into 3-letter coded or 1-letter coded sequences.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-03-22 14:36:49
    Reference: https://github.com/xinyu-dev/PatentSeq

    Patent Sequence Listing

    简介

    通过解析美国(https://patentcenter.uspto.gov/ )和国际(https://patentscope2.wipo.int/search/en/search.jsf )专利附带的序列清单(Sequence Listing)文件,里面存储了专利权利要求的序列,但是人工很难读取,该模块可以从中一次性批量提取专利中所有具有正式编号(SEQ ID NO.)的序列。

    1. Sequence Listing文件下载

    序列清单(Sequence Listing)文件内容示例:
    image.png

    用法:

    (1)从专利网站搜索专利:

    • WO专利从WIPO的网站PatentScope搜索:
      https://patentscope2.wipo.int/search/en/search.jsf
    • US专利从USPTO的网站搜索:
      https://patentcenter.uspto.gov/

    (2)在专利的页面中找到Sequence Listing文件并下载。

    从WIPO网站下载

    image.png
    从USPTO网站下载

    (3)使用该模块,提交下载到的文件即可。

    2. Image(OCR)
    将图片中的蛋白质序列转换为3个字母编码或1个字母编码的序列。
    注意:截图时请务必省略标题,类似下图。

    TXT(XML)方法

    参数说明

    Sequence Listing File

    专利文件,TXT或者XML格式。

    结果说明

    输出结果包括:

    输出文件名称 说明
    seq_list.csv 记录所有序列信息的csv文件
    seq_list.fasta 记录所有序列信息的fasta文件

    其中seq_list.csv包括信息如下:

    字段名称 说明
    idx 序列编号
    type 序列类型,DNA/蛋白
    sequence 序列信息

    Image(OCR)方法

    参数说明

    Image File

    专利图片文件,PNG或者JPG格式

    Format Option

    区分蛋白质序列“三字母”和“单字母”的输入,该选项用于指定识别模式:3L 表示 3-letter,1L 表示 1-letter。

    Output File

    输出文件名称,默认为result.fasta

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.fasta 专利图片转换成一个字母序列的FASTA文件
    result.txt 包含图片文件的字符,转换成一个字母和三个字母的序列

    Patent Sequence Listing

    Introduction

    By parsing the sequence listing files attached to U.S. (https://patentcenter.uspto.gov/ ) and international (https://patentscope2.wipo.int/search/en/search.jsf ) patents, which store the sequences claimed in patents, it is difficult for humans to read them. This module can extract all sequences with official numbers (SEQ ID NO.) from the patents in bulk.

    1. Sequence Listing File Download

    Example content of a Sequence Listing file:
    image.png

    Usage:
    (1) Search for patents on patent websites:

    • For WO patents, search on WIPO’s PatentScope:
      https://patentscope2.wipo.int/search/en/search.jsf
    • For US patents, search on USPTO’s website:
      https://patentcenter.uspto.gov/
      (2) Find and download the Sequence Listing file on the patent page.
      image.png
      Download from the WIPO website
      image.png
      Download from the USPTO website
      (3) Use this module to submit the downloaded file.

    2. Image(OCR)

    Image(OCR) is for converting protein sequences from images into three-letter or one-letter coded sequences.
    Note: When taking screenshots, please be sure to omit the headers, similar to the image below.
    Example_Seq1.png

    TXT(XML) Method

    Parameter Description

    Sequence Listing File

    Patent file in TXT or XML format.

    Result Description

    The output includes:

    Output File Name Description
    seq_list.csv CSV file recording all sequence information
    seq_list.fasta FASTA file recording all sequence information

    The seq_list.csv includes the following information:

    Field Name Description
    idx Sequence number
    type Sequence type, DNA/protein
    sequence Sequence information

    Image(OCR) Method

    Parameter Description

    Image File

    Patent image file in PNG or JPG format

    Format Option

    Distinguishes between protein sequence inputs in three-letter and one-letter formats. This option specifies the recognition mode: 3L for three-letter and 1L for one-letter.

    Output File

    Output file name, default is result.fasta

    Result Description

    The output includes:

    Output File Name Description
    result.fasta FASTA file of one-letter sequences converted from patent images
    result.txt Characters from image files converted into one-letter and three-letter sequences
  • Name: Tumor Gene Expression (TCGA)
    Description: 基于TCGA和GTEx等数据,检索指定基因在肿瘤和正常组织的表达情况,统计并绘制肿瘤细胞、肿瘤组织、正常组织等的基因表达差异,帮助药物靶点选择、研发立项和决策。 Plot gene expression in normal and tumor tissues, based on databases TCGA and GTEx. It retrieves the expression of specified genes in tumors and normal tissues, counts and maps the gene expression differences of tumor cells, tumor tissues, and normal tissues, etc., to help drug target selection and decision-making.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-03-22 14:35:06
    Reference:

    Tumor Gene Expression (TCGA)

    简介

    基于TCGA和GTEx等数据,检索指定基因在肿瘤和正常组织的表达情况,统计并绘制肿瘤细胞、肿瘤组织、正常组织等的基因表达差异,帮助药物靶点选择、研发立项和决策。

    参数说明

    Gene Name

    基因名称,输入的基因名须对应HGNC(https://www.genenames.org/)的"Approved Symbol"。例如:在HGNC搜索“PD-1”,得知“approved symbol”为“PDCD1”,后者“PDCD1”是该程序需要的输入。

    注意:HGNC网站会更新基因命名。如果输入的Approved Symbol报错,可尝试使用Previous Symbol。例如,“AARS1” 基因可输入为 “AARS”。
    image.png
    image.png

    结果说明

    输出结果包括:

    输出文件名称 说明
    tcga_expression.jpeg 不同疾病中该基因分别在肿瘤、正常、癌旁组织的表达量分布。
    tcga_tissue_expression.jpeg 不同组织中该基因分别在肿瘤、正常、癌旁组织的表达量分布。

    Tumor Gene Expression (TCGA)

    Introduction

    Plot gene expression in normal and tumor tissues, based on databases TCGA and GTEx. It retrieves the expression of specified genes in tumors and normal tissues, counts and maps the gene expression differences of tumor cells, tumor tissues, and normal tissues, etc., to help drug target selection and decision-making.

    Parameter

    Gene Name

    The entered gene name must correspond to the “Approved Symbol” of HGNC (https://www.genenames.org/). For example: search for “PD-1” in HGNC, and know that “approved symbol” is “PDCD1”, and the latter “PDCD1” is the input required by the program.
    Note: Gene names on the HGNC website are subject to updates. If the Approved Symbol returns an error, try using a Previous Symbol. For example, the gene “AARS1” can be entered as “AARS”.
    image.png
    image.png

    Result

    The output includes:

    Output File Name Description
    tcga_expression.jpeg The program will return the expression distribution of the gene in tumor, normal, and adjacent tissues in different disease.
    tcga_tissue_expression.jpeg The program will return the expression distribution of the gene in tumor, normal, and adjacent tissues in different tissues.
  • Name: Multiple Sequence Alignment
    Description: 基于渐进(progressive)比对算法进行多重序列比对,绘制进化树与序列对比图。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Align -> Tree。 Align multiple sequences using progressive alignment algorithm for evolutionary analysis, generating phylogenetic trees. It is recommended to use in the WeSeq: WeSeq -> Align -> Tree.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-03-21 11:41:36
    Reference:

    Multiple Sequence Alignment

    简介

    Multiple Sequence Alignment 是多重序列比对模块,用于进化分析,绘制进化树,帮助对候选序列进行聚类、分析多样性等。

    参数说明

    方法:msa

    Input File

    蛋白序列文件,FASTA格式。

    方法:antibody

    Input File

    蛋白序列文件,FASTA格式。

    Numbering Scheme

    抗体编号方法,支持imgt,kabat,chothia

    Full Sequence Identity

    输出抗体整体序列一致性文件名称,CSV格式

    CDR Sequence Identity

    输出抗体CDR序列一致性文件名称,CSV格式

    Identity Heatmap

    输出抗体序列一致性热图,HTML格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    alignment.fasta 多重序列进行比对后的FASTA文件
    alignment.png 多重序列进行比对后的PNG文件
    newick.txt 多重序列进行多样性分析的结果文件
    tree.png 多重序列进化树图片
    out/full_identity.csv 针对抗体方法下,抗体整体序列一致性CSV文件
    out/cdr_identity.csv 针对抗体方法下,抗体CDR序列一致性CSV文件
    out/identity_heatmap.html 针对抗体方法下,抗体序列一致性热图HTML文件

    Multiple Sequence Alignment

    Introduction

    Multiple Sequence Alignment is a module for aligning multiple sequences, used for evolutionary analysis, drawing evolutionary trees, and aiding in clustering and analyzing diversity of candidate sequences.

    Parameter

    Method: msa

    Input File

    Protein sequence file in FASTA format

    Method: antibody

    Input File

    Antibody sequence file in FASTA format.

    Numbering Scheme

    Antibody numbering scheme, supporting imgt, kabat, and chothia

    Full Sequence Identity

    Export pairwise full identity matrix as CSV

    CDR Sequence Identity

    Export pairwise antibody CDR identity matrix as CSV

    Identity Heatmap

    Ouput antibody sequence identity heatmap in HTML format

    Result

    The output includes:

    Output File Name Description
    alignment.fasta FASTA file after aligning multiple sequences
    alignment.png PNG file after aligning multiple sequences
    newick.txt Evolutionary analysis result of multiple sequence
    tree.png Evolutionary trees picture of multiple sequence
    out/full_identity.csv only for antibody, pairwise full identity matrix CSV file
    out/cdr_identity.csv only for antibody, pairwise antibody CDR identity matrix CSV file
    out/identity_heatmap.html only for antibody, antibody sequence identity heatmap in HTML format
  • Name: Structural Alignment
    Description: 基于序列的蛋白质三维结构叠合工具。使用BLOSUM62矩阵和Needleman-Wunsch算法在两个序列之间执行全局配对比对,返回叠合后的蛋白结构,同时输出RMSD值。 Sequence-based protein structural alignment tool. Performs a global pairwise alignment between two sequences using the BLOSUM62 matrix and the Needleman-Wunsch algorithm. Returns the alignment, the sequence identity, and the residue mapping between both original sequences.
    Tags: undefined
    Author: Biopython
    Release: 2022-03-17 14:43:33
    Reference:

    Structural Alignment

    简介

    Structural Alignment是对两个蛋白质的三维结构进行叠合的工具。使用BLOSUM62矩阵和Needleman-Wunsch算法在两个序列之间执行全局配对比对,返回叠合后的蛋白结构,同时输出RMSD值。

    参数说明

    Reference Structure

    参考蛋白的结构文件,PDB格式

    Sample Structure

    需要叠合蛋白的结构文件,PDB格式

    Reference Chain

    指定参考蛋白的链名,默认是A链

    Sample Chain

    指定需要叠合蛋白的链名,默认是A链

    Output File

    指定输出叠合后的结构文件,PDB格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    result.csv 参考蛋白与样本蛋白之间的RMSD值记录文件
    alignment_renumbering_pred.pdb 叠合后的结构文件

    其中result.csv包含如下信息:

    字段名称 说明
    Reference 参考蛋白构象
    Sample 需要叠合的蛋白构象
    RMSD 叠合后的RMSD值

    Structural Alignment

    Introduction

    Structural Alignment is a tool for overlaying the 3D structures of two proteins. It performs a global pairwise alignment between two sequences using the BLOSUM62 matrix and the Needleman-Wunsch algorithm, returning the aligned protein structures and outputting the RMSD value.

    Parameter Description

    Reference Structure

    Structure file of the reference protein in PDB format.

    Sample Structure

    Structure file of the protein to be aligned in PDB format.

    Reference Chain

    Specify the chain name of the reference protein, default is chain A.

    Sample Chain

    Specify the chain name of the protein to be aligned, default is chain A.

    Output File

    Specify the output structure file after alignment in PDB format.

    Result Description

    The output results include:

    Output File Name Description
    result.csv RMSD value record file between the reference protein and the sample protein
    alignment_renumbering_pred.pdb Aligned structure file

    The result.csv file contains the following information:

    Field Name Description
    Reference Conformation of the reference protein
    Sample Conformation of the protein to be aligned
    RMSD RMSD value after alignment
  • Name: PDB Insertion Removal
    Description: 用于去掉抗体PDB文件中的插入序列,因为某些计算工具不支持PDB中的插入序列。比如,20A改成20。 Renumber the antibody PDB file to remove any insertion codes in UID, to make such PDB compatible with other tools.
    Tags: undefined
    Author: Rodrigues JPGLM
    Release: 2022-03-10 16:10:28
    Reference: Rodrigues JPGLM, Teixeira JMC, Trellet M, Bonvin AMJJ. pdb-tools: a swiss army knife for molecular structures. F1000Res. 2018 Dec 20;7:1961.

    PDB Insertion Removal

    简介

    PDB Insertion Removal模块用于去掉抗体PDB文件中的插入序列,因为某些计算工具不支持PDB中的插入序列。比如,20A改成20。

    参数说明

    Structure PDB File

    抗体结构文件,PDB格式。

    结果说明

    得到去掉抗体中的插入序列的PDB文件prepared_insert.pdb。

    参考文献

    • Rodrigues JPGLM, Teixeira JMC, Trellet M, Bonvin AMJJ. pdb-tools: a swiss army knife for molecular structures. F1000Res. 2018 Dec 20;7:1961.DOI:10.12688/f1000research.17456.1

    PDB Insertion Removal

    Introduction

    The PDB Insertion Removal module is used to remove insertion sequences from antibody PDB files because some computational tools do not support insertion sequences in PDB files. For example, changing 20A to 20.

    Parameter Description

    Structure PDB File

    Antibody structure file in PDB format.

    Result Description

    Obtain the PDB file prepared_insert.pdb with the insertion sequences removed from the antibody.

    References

    • Rodrigues JPGLM, Teixeira JMC, Trellet M, Bonvin AMJJ. pdb-tools: a swiss army knife for molecular structures. F1000Res. 2018 Dec 20;7:1961.DOI:10.12688/f1000research.17456.1
  • Name: Aggregation Score
    Description: 预测蛋白质结构中的聚集倾向和蛋白质溶解度,通过考虑序列和结构来预测蛋白质中易聚集的位点,这对于球状蛋白质特别有用,其中容易聚集的位点可能埋藏在天然结构内并且序列不连续。通过考虑天然氨基酸的实验聚集倾向尺度,该方法可以准确预测蛋白质聚集倾向。 Design for the rational design of protein solubility and aggregation tendency in protein structures. It allows researchers to predict aggregation-prone sites in proteins by considering both sequence and structure. This is particularly useful for globular proteins, where aggregation-prone sites may be buried within the native structure and the sequence may be discontinuous. By considering experimental aggregation propensity scales of natural amino acids, this method can accurately predict protein aggregation tendency.
    Tags: undefined
    Author: Zambrano R
    Release: 2022-03-01 14:05:39
    Reference: Zambrano R, Jamroz M, Szczasiuk A, Pujols J, Kmiecik S, Ventura S. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015 Jul 1;43(W1):W306-13.

    Aggregation Score

    简介

    该模块用于预测蛋白质结构中的聚集倾向和蛋白质溶解度,通过考虑序列和结构来预测蛋白质中易聚集的位点,这对于球状蛋白质特别有用,其中容易聚集的位点可能埋藏在天然结构内并且序列不连续。通过考虑天然氨基酸的实验聚集倾向尺度,该方法可以准确预测蛋白质聚集倾向,也可用于预测构象紊乱中家族性突变的致病作用。任何已知或预测的蛋白质结构都是适用的,它具备其他基于序列的算法未考虑的特性,例如蛋白质动态波动和蛋白质序列中距离较远的残基的空间聚类,这对于从初始折叠状态准确预测蛋白质聚集非常重要。
    底层算法Aggrescan3D(A3D)旨在预测蛋白质在其折叠状态下的聚集倾向。为了实现这个目标,A3D使用蛋白质的三维结构作为输入,这些结构可以通过X射线衍射、溶液NMR或建模方法得到,并以pdb格式表示。在分析之前,这些结构会经过能量最小化处理。该方法利用了实验得出的天然氨基酸内在聚集倾向尺度,并将这个尺度应用于蛋白质的三维结构中。在A3D方法中,结构中每个特定氨基酸的内在聚集倾向会受到其特定的结构环境的调节。聚集倾向是通过以每个残基Cα碳为中心的球形区域计算得出的。这为结构中每个氨基酸提供了一个独特的经过结构修正的聚集值(A3D分数),其公式如下:

    image.png

    其中:Aggi是球心处残基的内在聚集倾向;RSAi是其相对于溶剂暴露的表面积;Agge是包括在球体中的每个额外残基的内在聚集倾向,RSAe是其相对于溶剂暴露的表面积,dist是到中心残基i的距离。

    参数说明

    Structure PDB File

    蛋白质结构文件(PDB 格式)。
    支持上传包含多个结构文件的压缩包进行批量处理,包括 .zip、.tar、.tar.gz、.tgz、.tar.bz2、.tbz2、.tar.xz、.txz格式

    注意

    • 系统默认最多处理 500 个结构。
    • 如果没有已知结构,可以用结构预测模块预测。

    结果说明

    输出结果包括:

    名称 说明
    Aggregation Score (result_A3D.csv) 蛋白结构中每个氨基酸聚集倾向和蛋白质溶解度的打分文件
    Structure (output.pdb) 根据聚集倾向和蛋白质溶解度得到的结构文件,在PDB文件温度因子一栏填入计算得到的聚集度和溶解度数值
    all_results_AggS.tar.gz 当输入为压缩包格式并包含多个结构文件时,系统会将每个结构对应的计算结果汇总并打包为该压缩文件输出。
    result_A.png A链中每个氨基酸对应的聚集度和溶解度打分值的png格式图片
    result_A.svg A链中每个氨基酸对应的聚集度和溶解度打分值的svg格式图片

    其中result_A3D.csv包括信息如下:

    字段名称 说明
    protein 氨基酸残基折叠
    chain 蛋白链名称
    residue 氨基酸索引(PDB文件中)
    residue_name 氨基酸名称缩写(PDB文件中)
    score 聚集度和溶解度打分值,该数值为正代表氨基酸促进聚集,为负代表氨基酸促进溶解。

    参考文献

    • Zambrano R, Jamroz M, Szczasiuk A, Pujols J, Kmiecik S, Ventura S. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015 Jul 1;43(W1):W306-13.
    • Aleksander Kuriata, Valentin Iglesias, Jordi Pujols, Mateusz Kurcinski, Sebastian Kmiecik, Salvador Ventura, Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W300–W307

    Aggregation Score

    Introduction

    This module is used to predict the aggregation propensity and protein solubility in protein structures. By considering both sequence and structure, it predicts sites in proteins that are prone to aggregation, which is particularly useful for globular proteins where aggregation-prone sites may be buried within the native structure and not contiguous in sequence. By considering experimentally derived aggregation propensity scales of natural amino acids, this method accurately predicts protein aggregation propensity and can be used to predict the pathogenic effects of familial mutations in conformational disorders. Any known or predicted protein structure is applicable. It incorporates features not considered by other sequence-based algorithms, such as protein dynamic fluctuations and spatial clustering of residues that are distant in the protein sequence, which is crucial for accurately predicting protein aggregation from the initial folding state.

    The underlying algorithm, Aggrescan3D (A3D), aims to predict the aggregation propensity of proteins in their folded states. To achieve this, A3D uses the protein’s 3D structure as input, which can be obtained through X-ray crystallography, solution NMR, or modeling methods, and is represented in PDB format. These structures undergo energy minimization before analysis. The method utilizes experimentally determined intrinsic aggregation propensity scales of natural amino acids and applies this scale to the protein’s 3D structure. In the A3D method, the intrinsic aggregation propensity of each specific amino acid in the structure is modulated by its specific structural environment. The aggregation propensity is calculated within a spherical region centered on the Cα carbon of each residue. This provides a unique, structurally corrected aggregation value (A3D score) for each amino acid in the structure.The calculation formula is as follows:

    image.png

    Where:

    • Aggi represents the intrinsic aggregation propensity of the residue at the center of the sphere.
    • RSAi is the relative solvent accessibility of the residue.
    • Agge is the intrinsic aggregation propensity of each additional residue included in the sphere.
    • RSAe is the relative solvent accessibility of each additional residue included in the sphere.
    • dist is the distance to the central residue i.

    Parameters

    Structure PDB File

    The system accepts protein structure files in PDB format. For batch processing, you may upload a compressed archive containing multiple structure files. Supported archive formats include .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, and .txz.
    Notes

    • The system is configured to process a maximum of 500 structures per submission by default.
    • In cases where known experimental structures are unavailable, please utilize the Structure Prediction module to generate the required models.

    Results

    The output results include:

    Name Description
    Aggregation Score (result_A3D.csv) A scoring file for the aggregation propensity and protein solubility of each amino acid in the protein structure.
    Structure (output.pdb) Structure file obtained based on the aggregation propensity and protein solubility, with the calculated aggregation and solubility values filled in the temperature factor column of the PDB file.
    all_results_AggS.tar.gz When the input is provided as a compressed archive containing multiple structure files, the calculation results for each structure are collected and packaged into this archive for download.
    result_A.png A PNG format image showing the aggregation and solubility scores for each amino acid in chain A.
    result_A.svg An SVG format image showing the aggregation and solubility scores for each amino acid in chain A.

    The result_A3D.csv file includes the following information:

    Field Name Description
    protein Fold of the amino acid residue.
    chain Protein chain name.
    residue Amino acid index in the PDB file.
    residue_name Amino acid name abbreviation in the PDB file.
    score Aggregation and solubility score, where a positive value indicates promotion of aggregation and a negative value indicates promotion of solubility.

    References

    • Zambrano R, Jamroz M, Szczasiuk A, Pujols J, Kmiecik S, Ventura S. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015 Jul 1;43(W1):W306-13.
    • Aleksander Kuriata, Valentin Iglesias, Jordi Pujols, Mateusz Kurcinski, Sebastian Kmiecik, Salvador Ventura, Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W300–W307
  • Name: Sequence Mutagenesis (Saturated)
    Description: 枚举蛋白质序列指定位置饱和突变的所有可能性,生成所有对应突变的文本文件和突变体序列文件。 Enumerate all possible point mutations at specified positions in a protein sequence, and generate text files for all corresponding mutations and mutant sequence files.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-23 14:29:03
    Reference:

    Sequence Mutagenesis (Saturated)

    简介

    Sequence Mutagenesis (Saturated)是用于枚举蛋白质序列指定位置饱和突变的所有可能性,生成所有对应突变的文本文件和突变体序列文件。

    参数说明

    Input File

    蛋白序列文件,FASTA格式。

    Mutation Location

    突变位置,多个位置可以用逗号(,)隔开。

    Output File

    指定输出突变后的序列文件的名称,FASTA格式。

    Output Mutation Residue

    包含突变信息的文本文件的名称。

    Chain Name

    指定链名,生成带有链名的突变信息。

    结果说明

    输出结果包括:

    输出文件名称 说明
    mutated_seqs.fasta 突变后的序列文件
    individual.txt 突变文件信息,包含链信息
    mutated_polict.txt 突变文件信息,不包含链信息

    Sequence Mutagenesis (Saturated)

    Introduction

    Sequence Mutagenesis (Saturated) is used to enumerate all possibilities of saturated mutations at specified positions in a protein sequence, generating text files with all corresponding mutations and mutated sequence files.

    Parameter Description

    Input File

    Protein sequence file in FASTA format.

    Mutation Location

    Mutation locations, multiple positions can be separated by commas (,).

    Output File

    Specify the name of the output file containing the mutated sequence in FASTA format.

    Output Mutation Residue

    Name of the text file containing mutation information.

    Chain Name

    Specify the chain name to generate mutation information with chain names.

    Result Description

    The output results include:

    Output File Name Description
    mutated_seqs.fasta Mutated sequence file after mutation.
    individual.txt Mutation file information with chain information.
    mutated_polict.txt Mutation file information without chain information.
  • Name: Structure Mutagenesis
    Description: 从蛋白结构文件得到蛋白的序列信息,然后对指定位点进行饱和突变或者丙氨酸突变,得到包含突变信息的突变文件和突变序列。用于后续其他模块进行结构突变。 The Structure Mutagenesis module obtains protein sequence information from protein structure files and performs saturation mutagenesis or alanine mutagenesis at specified sites to generate a mutation file and a mutated sequence file containing mutation information. This module is used for subsequent structural mutation analysis in other modules.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-17 22:36:02
    Reference:

    Structure Mutagenesis

    简介

    对复合物界面区域进行单点或者多点的虚拟饱和突变,从而获得不同格式的突变文件以及突变后的Fasta文件。这为后续复合物之间的亲和力以及对突变体之间的结合自由能计算提供基础。

    参数说明

    Input File

    蛋白结构文件,PDB格式。

    Mutation Site

    突变位点文件,JSON格式,一般由Complex Interface Analysis模块生成的json文件。

    Chain Name

    指定链名。

    Output Sequence

    指定输出突变后的序列文件的名称。

    Mutated Policy

    指定输出突变文件的名称,不包含链信息。

    Chain Mutated Policy

    指定输出突变文件的名称,包含指定链信息。

    Mode

    突变模式:

    • Saturation:饱和突变,突变为其他19种氨基酸。
    • AlaScan:丙氨酸突变,仅突变为丙氨酸。

    结果说明

    输出结果包括:

    输出文件名称 说明
    mutated_policy.txt 突变文件信息,不包含链信息
    mutated_policy_with_chain.txt 突变文件信息,包含链信息
    output_mutated_seqs.fasta 突变后的序列文件

    Structure Mutagenesis

    Introduction

    Virtual saturation mutagenesis is performed on single or multiple points in the interface region of a complex to generate mutation files in different formats and mutated Fasta files. This provides a basis for calculating the affinity between complexes and the binding free energy between mutants.

    Parameter Description

    Input File

    Protein structure file in PDB format.

    Mutation Site

    Mutation site file in JSON format, typically generated by the Complex Interface Analysis module.

    Chain Name

    Specify the chain name.

    Output Sequence

    Specify the name of the output file containing the mutated sequence.

    Mutated Policy

    Specify the name of the output mutation file without chain information.

    Chain Mutated Policy

    Specify the name of the output mutation file with specified chain information.

    Mode

    Mutation mode:

    • Saturation: Saturation mutagenesis, mutating to the other 19 amino acids.
    • AlaScan: Alanine scanning mutagenesis, mutating only to alanine.

    Result Description

    The output results include:

    Output File Name Description
    mutated_policy.txt Mutation file information without chain information.
    mutated_policy_with_chain.txt Mutation file information with chain information.
    output_mutated_seqs.fasta Mutated sequence file after mutation.
  • Name: Protein BLAST
    Description: 从蛋白数据库搜索同源序列,数据库序列整合了GenPept、Swissprot、PIR、PDF、PDB、RefSeq等序列数据库。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Blast -> Protein BLAST。 Search for homologous sequences in protein databases, which integrates sequences from various databases including GenPept, Swissprot, PIR, PDF, PDB, and RefSeq. It is recommended to use in the WeSeq: WeSeq -> Blast -> Protein BLAST.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-15 11:00:04
    Reference:

    Protein BLAST

    简介

    Protein BLAST是蛋白Blast数据库,该数据库序列整合了GenPept、Swissprot、PIR、PDF、PDB、RefSeq等序列数据库。

    参数说明

    Input File

    蛋白序列文件,FASTA格式。

    Type

    指定序列比对数据库类型:蛋白,抗体,或者CDR区域。
    nr:蛋白Blast数据库。
    oas:Observed Antibody Space,抗体Blast数据库。
    cdr:CDR区域数据库,专利保护抗体数据库 。

    结果说明

    输出结果文件为alignment.fasta,是系列对齐后的FASTA文件,可在WeSeq中查看。

    Protein BLAST

    Introduction

    Protein BLAST is a protein Blast database that integrates sequences from databases such as GenPept, Swissprot, PIR, PDF, PDB, RefSeq, and others.

    Parameter Description

    Input File

    Protein sequence file in FASTA format.

    Type

    Specifies the sequence alignment database type: protein, antibody, or CDR region.
    nr: Protein BLAST database.
    oas: Observed Antibody Space, an antibody BLAST database.
    cdr: CDR region database, a patent-protected antibody database.

    Result Description

    The output result file is alignment.fasta, which is a FASTA file of the aligned sequences that can be viewed in WeSeq.

  • Name: Sequence Mutagenesis (Directed) for Ab
    Description: 根据模板抗体序列和描述突变的突变文件 (json) 批量生成突变抗体序列,通常突变文件由 BLAST 和 MSA 自动生成。这对于高通量抗体工程设计很有用。 Generate sequences of mutated antibody sequences based on a template antibody sequence and a mutation file (json) listing all mutations (normally the mutation file is automatically generated by BLAST and MSA). This is useful for high-throughput antibody engineering design.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-10 10:22:35
    Reference:

    Sequence Mutagenesis (Directed) for Ab

    简介

    Sequence Mutagenesis (Directed) for Ab是根据模板抗体序列和描述突变的突变文件(json)批量生成突变抗体序列,通常突变文件由BLAST和MSA自动生成。这对于高通量抗体工程设计很有用。

    参数说明

    Input File

    抗体的序列文件,FASTA格式

    Mutation File

    突变文件,JSON格式

    Cutoff

    突变频率截断值,默认10,只针对突变频率超过截断值的氨基酸生成对应的突变信息。用于过滤掉低频率的突变氨基酸。

    Numbering Type

    抗体编号类型:kabat,chothia,imgt以及none

    结果说明

    输出结果包括:

    输出文件名称 说明
    gen.fr.fasta 骨架区(frameworkregion,FR)FASTA文件
    gen.fr.mutations.txt 骨架区(frameworkregion,FR)突变文件信息
    gen.cdr.fasta 互补决定区(complementarity-determining region, CDR)FASTA文件
    gen.cdr.mutations.txt 互补决定区(complementarity-determining region, CDR)突变文件信息

    Sequence Mutagenesis (Directed) for Ab

    Introduction

    Sequence Mutagenesis (Directed) for Ab is a process that batch generates mutated antibody sequences based on a template antibody sequence and a mutation file (in JSON format) describing the mutations. The mutation file is typically generated automatically by BLAST and MSA. This is particularly useful for high-throughput antibody engineering design.

    Parameter Description

    Input File

    Antibody sequence file in FASTA format.

    Mutation File

    Mutation file in JSON format.

    Cutoff

    Mutation frequency cutoff value, default is 10. Only mutations with frequencies exceeding the cutoff value will generate corresponding mutation information. This is used to filter out low-frequency mutated amino acids.

    Numbering Type

    Antibody numbering type: kabat, chothia, imgt, or none.

    Result Description

    The output results include:

    Output File Name Description
    gen.fr.fasta FASTA file for the Framework Region (FR)
    gen.fr.mutations.txt Mutation file information for the Framework Region (FR)
    gen.cdr.fasta FASTA file for the Complementarity-Determining Region (CDR)
    gen.cdr.mutations.txt Mutation file information for the Complementarity-Determining Region (CDR)
  • Name: Mutation List Generation
    Description: 基于一个原始序列,从经过序列比对后得到的序列(例如BLAST得到的同源序列)中提取每个位点出现过的所有突变(同源突变/共识突变),生成一个突变列表,并按位点统计突变的频率。 Generate a list of mutations (aka. consensus mutations) from a set of aligned sequences (normally generated by the blast).
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-02-10 10:22:00
    Reference:

    Mutation List Generation

    简介

    Mutation List Generation是基于一个原始序列,从经过序列比对后得到的序列(例如BLAST得到的同源序列)中提取每个位点出现过的所有突变(同源突变/共识突变),生成一个突变列表,并按位点统计突变的频率。

    参数说明

    Reference Seq

    参考蛋白序列,FASTA格式

    Homologs

    同源序列文件,一般由参考序列BLAST数据库后得到,FASTA格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    mutations.csv 突变统计文件,包含每个位点的突变的类型及其百分比,CSV格式
    output.json 突变统计文件,包含每个位点的突变类型及其频率,JSON格式
    mutations.txt 突变文件,根据前面的突变统计信息生成,包含了野生型氨基酸、位置以及突变后氨基酸

    其中mutations.csv包括信息如下:

    字段名称 说明
    WT 野生型氨基酸
    Position 突变位置
    Mutations and frequency 突变氨基酸及其频率

    Mutation List Generation

    Introduction

    Mutation List Generation is a process that extracts all mutations (homologous mutations/consensus mutations) occurring at each position from a sequence obtained through sequence alignment (e.g., homologous sequences obtained from BLAST), based on an original sequence. It generates a mutation list and calculates the frequency of mutations at each position.

    Parameter Description

    Reference Seq

    Reference protein sequence in FASTA format.

    Homologs

    Homologous sequence file typically obtained by BLASTing the reference sequence against a database, in FASTA format.

    Result Description

    The output results include:

    Output File Name Description
    mutations.csv Mutation statistics file containing the type and percentage of mutations at each position, in CSV format
    output.json Mutation statistics file containing the type and frequency of mutations at each position, in JSON format
    mutations.txt Mutation file generated based on the mutation statistics information, containing the wild-type amino acid, position, and mutated amino acid

    The mutations.csv file includes the following information:

    Field Name Description
    WT Wild-type amino acid
    Position Mutation position
    Mutations and frequency Mutated amino acid and its frequency
  • Name: Solubility Score
    Description: 基于序列的蛋白溶解度预测。 Sequence-based protein solubility prediction.
    Tags: undefined
    Author: Hon J
    Release: 2022-01-24 11:53:25
    Reference: Bioinformatics. 2021 Apr 9;37(1):23-28. Bioinformatics. 2017 Oct 1;33(19):3098-3100. J Mol Biol. 2015 Jan 30;427(2):478-90.

    Solubility Score

    简介

    蛋白质溶解度不良阻碍了许多治疗和工业上有用的蛋白质的生产。通过实验手段增加溶解度的努力往往成功率低,并且通常会降低生物活性。使用序列信息来计算预测蛋白的溶解度,可以大大降低实验研究的成本。
    本模块使用CamSol、SoluProt和Protein-Sol算法进行溶解度预测。其中:

    • CamSol是利用最直接影响蛋白质溶解度的氨基酸的物理化学特性,包括疏水性、静电荷以及它们在空间的相互作用,通过对这些特性的组合来定义溶解度分数。该方法在预测突变对蛋白质溶解度的影响方面具有很高的准确性。与其他现有方法相比,如SOLpro和 PROSO II,在测试的56个变体中,该方法正确预测了54个突变体在突变后溶解度的变化,而SOLpro和PROSO II分别为40和32个。
    • SoluProt是一个基于序列信息预测溶解度的机器学习模型,使用了高质量的TargetTrack数据集进行训练,并使用NESG数据库的3100条序列进行了验证,准确度优于其他现有预测算法(评测结果见下表)。基于梯度增强机器模型并采用 96 个基于序列的特征,例如氨基酸含量、与 PDB 序列的序列同一性以及几种聚合的物理化学特性。 对溶解度的预测准确度为 58.5%,AUC 为 0.62,高于其他同类工具。
    • Protein-Sol提供了一种快速的基于序列的方法来预测蛋白质的溶解度,共采用了35个基于序列的特征进行模型构建。使用来自于大肠杆菌,酵母和人源的上万个蛋白数据进行了模型训练和验证测试。注意:要求输入序列长度大于20个氨基酸残基。

    结果说明

    输出结果包括:

    输出文件名称 说明
    protein-sol_score_show.png Protein–Sol方法下,针对Folding Propensity和Charge两个指标的分布图。横坐标Windows为每21个氨基酸为一个片段组别。
    result_per_chain.csv 三种方法下,每条链的预测溶解度结果。
    result_per_residue.csv Protein–Sol方法下,不同蛋白区域对应的溶解度情况(该结果仅针对第一条链)。

    其中result_per_chain.csv包括信息如下:

    字段名称 说明
    Protein ID 蛋白序列名称
    Solubility (CamSol) CamSol方法预测的溶解度。越大表示溶解性越好,大于1时,表示溶解性很好;当分数小于-1时,溶解性很差。
    Solubility (Soluprot) Soluprot方法预测的溶解度,值越大表示溶解性越好
    Solubility (Protein-Sol) Protein-Sol方法预测的溶解度,值越大表示溶解性越好
    pI 蛋白等电点

    其中result_per_residue.csv包括信息如下:

    字段名称 说明
    ID 蛋白序列名称
    Kyte-Doolittle Hydropathy 氨基酸亲水指数是一个描述其支链的亲水性或疏水性程度大小的值。亲水指数越小代表该氨基酸段的亲水性越强。
    Folding Propensity 该数值描述蛋白折叠程度,该数值越大,越不利于蛋白溶解。
    Entropy 熵是在某种分子折叠构象下能保证该分子最稳定(熵最大)。熵越大越不利于蛋白溶解。
    Charge 蛋白质表面带有的电荷值,带电蛋白均有利于溶解度,无论正负。
    Sequence 所分析的序列段。

    参考文献

    • Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D, Damborsky J. SoluProt: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021 Apr 9;37(1):23-28.DOI: 10.1093/bioinformatics/btaa1102
    • Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J. Protein-Sol: a web tool for predicting protein solubility from sequence. Bioinformatics. 2017 Oct 1;33(19):3098-3100.DOI: 10.1093/bioinformatics/btx345
    • Sormanni P, Aprile FA, Vendruscolo M. The CamSol method of rational design of protein mutants with enhanced solubility. J Mol Biol. 2015 Jan 30;427(2):478-90.DOI: 10.1016/j.jmb.2014.09.026

    Solubility Score

    Introduction

    Poor protein solubility hinders the production of many therapeutically and industrially useful proteins. Efforts to increase solubility through experimental means often have low success rates and can compromise biological activity. Calculating protein solubility based on sequence information can significantly reduce the cost of experimental research.
    This module uses the CamSol, SoluProt, and Protein-Sol algorithms for solubility prediction. Specifically:

    • CamSol utilizes the physical and chemical properties of amino acids that most directly affect protein solubility, including hydrophobicity, electrostatic charges, and their spatial interactions, to define a solubility score based on a combination of these properties. This method demonstrates high accuracy in predicting the impact of mutations on protein solubility. In a test of 56 variants, it correctly predicted the solubility changes after mutation for 54 variants, compared to 40 and 32 for SOLpro and PROSO II, respectively.
    • SoluProt is a machine learning model that predicts solubility based on sequence information. It is trained on a high-quality TargetTrack dataset and validated using 3100 sequences from the NESG database, showing superior accuracy compared to other existing prediction algorithms (see evaluation results in the table below). It employs a gradient boosting machine model and utilizes 96 sequence-based features, such as amino acid composition, sequence identity to PDB sequences, and several physicochemical properties of aggregates. The accuracy of solubility prediction is 58.5%, with an AUC of 0.62, higher than other similar tools.
    • Protein-Sol provides a rapid sequence-based method to predict protein solubility, using 35 sequence-based features for model construction. The model is trained and validated using tens of thousands of protein data from Escherichia coli, yeast, and human sources. Note: Input sequences must be longer than 20 amino acid residues.

    Results

    The output results include:

    Output File Name Description
    protein-sol_score_show.png Distribution of Folding Propensity and Charge under the Protein-Sol method. The horizontal coordinate Windows for each 21 amino acids is a fragment group.
    result_per_chain.csv Predicted solubility results for each chain under the three methods.
    result_per_residue.csv Solubility status corresponding to different protein regions under the Protein-Sol method (this result is only for the first chain).

    The result_per_chain.csv includes the following information:

    Field Name Description
    Protein ID Protein sequence name
    Solubility (CamSol) Predicted solubility by CamSol. A higher score indicates better solubility, with scores greater than 1 indicating good solubility and scores less than -1 indicating poor solubility.
    Solubility (SoluProt) Predicted solubility by SoluProt, a higher score indicates better solubility
    Solubility (Protein-Sol) Predicted solubility by Protein-Sol,a higher score indicates better solubility
    pI Isoelectric point of the protein

    The result_per_residue.csv includes the following information:

    Field Name Description
    ID Protein sequence name
    Kyte-Doolittle Hydropathy Hydropathy index of amino acids, describing the hydrophilicity or hydrophobicity of their side chains. A smaller hydropathy index indicates higher hydrophilicity of the amino acid segment.
    Folding Propensity This value describes the folding degree of the protein, with higher values being less favorable for protein solubility.
    Entropy Entropy ensures the most stable molecular conformation under certain folding configurations. Higher entropy is less favorable for protein solubility.
    Charge The charge value on the protein surface, with charged proteins being favorable for solubility regardless of positive or negative charge.
    Sequence The analyzed sequence segment.

    References

    • Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D, Damborsky J. SoluProt: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021 Apr 9;37(1):23-28.DOI: 10.1093/bioinformatics/btaa1102
    • Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J. Protein-Sol: a web tool for predicting protein solubility from sequence. Bioinformatics. 2017 Oct 1;33(19):3098-3100.DOI: 10.1093/bioinformatics/btx345
    • Sormanni P, Aprile FA, Vendruscolo M. The CamSol method of rational design of protein mutants with enhanced solubility. J Mol Biol. 2015 Jan 30;427(2):478-90.DOI: 10.1016/j.jmb.2014.09.026
  • Name: Humanization Report
    Description: 抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。 Generating the humanization design reports as well as patent example paragraphs.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-01-19 09:19:22
    Reference:

    Humanization Report

    简介

    Humanization Report是抗体人源化设计报告生成模块,用于生成最终的抗体人源化设计报告以及相应的专利实施例段落。

    参数说明

    Graft Policy

    Grafting模块生成的Graft Policy文件。

    Mutate Policy

    Back Mutation Grouping模块生成的Policy文件。

    结果说明

    输出结果包括:

    输出文件名称 说明
    BM.pptx 回复突变位点汇总文件
    batch_registration_template.xlsx 批量注册模板文件
    hotspot_summary.xlsx 风险位点总结
    patent_example_template.docx 人源化设计序列在相应的专利实施例段落
    humanized_variants.fasta 抗体人源化设计序列文件,FASTA格式
    Report.docx 抗体人源化设计报告,包括整个人源化设计过程涉及的序列、分组等信息

    其中batch_registration_template.xlsx包含如下信息:

    字段名称 说明
    Protein Sequence 蛋白序列
    Molecule Name 分子名称

    其中hotspot_summary.xlsx包含如下信息:

    字段名称 说明
    ID 抗体序列名称
    Sequence-CDR CDR序列区域
    Deamidation 脱酰胺位点
    Isomerization 异构化位点
    Cleavage 酶切位点
    Hydrolysis 水解位点
    Glycosylation 糖基化位点
    Cys 半胱氨酸数量
    Oxidation 氧化位点
    High risk 高风险率
    High risk sites 高风险位点

    Humanization Report

    Introduction

    The Humanization Report is a module for generating reports on antibody humanization design, including the final antibody humanization design report and corresponding paragraphs for patent implementation examples.

    Parameter Description

    Graft Policy

    The Graft Policy file generated by the Grafting module.

    Mutate Policy

    The Policy file generated by the Back Mutation Grouping module.

    Result Description

    The output results include:

    Output File Name Description
    BM.pptx Summary file of back mutation sites
    batch_registration_template.xlsx Batch registration template file
    hotspot_summary.xlsx Summary of hotspot sites
    patent_example_template.docx Humanization design sequences in corresponding patent implementation example paragraphs
    humanized_variants.fasta Antibody humanization design sequence file in FASTA format
    Report.docx Antibody humanization design report, including information on sequences, grouping, etc., involved in the entire humanization design process

    The batch_registration_template.xlsx file contains the following information:

    Field Name Description
    Protein Sequence Protein sequence
    Molecule Name Molecule name

    The hotspot_summary.xlsx file contains the following information:

    Field Name Description
    ID Antibody sequence name
    Sequence-CDR CDR sequence region
    Deamidation Deamidation site
    Isomerization Isomerization site
    Cleavage Cleavage site
    Hydrolysis Hydrolysis site
    Glycosylation Glycosylation site
    Cys Number of cysteines
    Oxidation Oxidation site
    High risk High-risk rate
    High risk sites High-risk sites
  • Name: Protein Docking (FRODOCK)
    Description: 蛋白-蛋白对接程序 Protein-protein docking tool
    Tags: undefined
    Author: Ramírez-Aportela E
    Release: 2022-01-12 19:45:05
    Reference: Ramírez-Aportela E, López-Blanco JR, Chacón P. FRODOCK 2.0: fast protein-protein docking server. Bioinformatics. 2016 Aug 1;32(15):2386-8.

    Protein Docking (FRODOCK)

    简介

    FRODOCK是由西班牙Pablo Chacón教授开发的蛋白-蛋白对接软件。FRODOCK使用球谐函数(spherical harmonics)的旋转搜索提高对接效率。全局能量优化采用 6D(3D 旋转 + 3D平移)刚体详尽搜索(rigid-body exhaustive search)固定配体的构象。复合物的结合能考虑范德华力、静电和去溶剂化三个能量项。在抗原-抗体复合物、酶-底物、其他蛋白复合物的基准测试集中效果表现很好。具有以下技术特点:

    1. 采用球谐函数旋转搜索提高对接效率。
    2. 采用6D(3D 旋转 + 3D平移)进行详尽搜索采样。

    参数说明

    Receptor File

    受体结构文件,PDB格式。

    Ligand File

    配体结构文件,PDB格式。

    Interaction Type

    相互作用类型。

    Constraints File

    限制文件,文本格式如下:

    # RECEPT_____ LIGAND_____ D__
    # -------------------------------
    GLY A 269 SER A 81 5
    GLY A 269 LEU A 84 10
    

    其中"GLY A 269"代表受体部分的残基名称"GLY"、链名称"A"、残基编号"269";“SER A 81"代表配体部分的残基"SER”,链名称"A",残基编号"81";"5"代表受配体残基之间的距离在5Å。

    Clusters Number

    生成构象聚类最大数目。

    Output TopN

    保存的得分最高分子的PDB文件。

    Reference File

    参考结合配体分子(用于比较),格式:PDB。

    结果说明

    输出结果包括:

    输出文件名称 说明
    complex_01.pdb-complex_10.pdb 输出打分前十的复合物构象
    output_complex_TopN.tar.gz 输出所有复合物结构的压缩包文件
    TopN_score.csv 提供复合物构象的对接打分,其中打分值越大,结合能力越强。
    output_ligand_TopN.tar.gz 输出所有配体结构的压缩包文件

    其中TopN_score.csv包括信息如下:

    字段名称 说明
    NO 打分排序
    Euler1 配体旋转α角度(ZYZ顺序旋转的欧拉角)
    Euler2 配体旋转β角度(ZYZ顺序旋转的欧拉角)
    Euler3 配体旋转γ角度(ZYZ顺序旋转的欧拉角)
    posX 配体质心所在位置的X坐标
    posY 配体质心所在位置的Y坐标
    posZ 配体质心所在位置的Z坐标
    Absolute_Energy_Score 绝对能量分数用来评估复合物结合能力强弱。
    Ligand_File 配体文件名称
    complex_pdb 复合物文件名称

    参考文献

    Ramírez-Aportela E, López-Blanco JR, Chacón P. FRODOCK 2.0: fast protein-protein docking server. Bioinformatics. 2016 Aug 1;32(15):2386-8.

    Protein Docking (FRODOCK)

    Introduction

    FRODOCK is a protein-protein docking software developed by Professor Pablo Chacón from Spain. FRODOCK utilizes spherical harmonics for rotation search to enhance docking efficiency. Global energy optimization is achieved through a 6D (3D rotation + 3D translation) rigid-body exhaustive search with fixed ligand conformation. The binding energy of the complex considers van der Waals forces, electrostatic interactions, and desolvation energy. It has shown good performance in benchmark tests with antigen-antibody complexes, enzyme-substrate interactions, and other protein complexes. It features the following technical aspects:

    1. Utilizes spherical harmonics for rotation search to enhance docking efficiency.
    2. Utilizes 6D (3D rotation + 3D translation) for exhaustive search sampling.

    Parameter Description

    Receptor File

    Structure file of the receptor in PDB format.

    Ligand File

    Structure file of the ligand in PDB format.

    Interaction Type

    Type of interaction.

    Constraints File

    Text file specifying constraints, with the format:

    # RECEPT_____ LIGAND_____ D__
    # -------------------------------
    GLY A 269 SER A 81 5
    GLY A 269 LEU A 84 10
    

    Where “GLY A 269” represents the residue name “GLY”, chain “A”, residue number “269” in the receptor part; “SER A 81” represents the residue “SER”, chain “A”, residue number “81” in the ligand part; and “5” represents a distance of 5Å between the receptor and ligand residues.

    Clusters Number

    Maximum number of conformation clusters to generate.

    Output TopN

    Number of top-scoring molecules to save as PDB files.

    Reference File

    Reference ligand molecule for comparison, in PDB format.

    Result Description

    The output includes:

    Output File Name Description
    complex_01.pdb-complex_10.pdb Output of the top ten scored complex conformations
    output_complex_TopN.tar.gz Compressed file containing all complex structures
    TopN_score.csv Provides docking scores for complex conformations, where higher scores indicate stronger binding affinity
    output_ligand_TopN.tar.gz Compressed file containing all ligand structures

    The TopN_score.csv file includes the following information:

    Field Name Description
    NO Ranking based on scores
    Euler1 Euler angles for ligand rotation (in ZYZ order)
    Euler2 Euler angles for ligand rotation (in ZYZ order)
    Euler3 Euler angles for ligand rotation (in ZYZ order)
    posX X-coordinate of the ligand center of mass
    posY Y-coordinate of the ligand center of mass
    posZ Z-coordinate of the ligand center of mass
    Absolute_Energy_Score Absolute energy score for evaluating binding strength
    Ligand_File Ligand file name
    complex_pdb Complex file name

    Reference

    Ramírez-Aportela E, López-Blanco JR, Chacón P. FRODOCK 2.0: fast protein-protein docking server. Bioinformatics. 2016 Aug 1;32(15):2386-8.

  • Name: Human Antibody BLAST
    Description: 人类抗体数据库Blast模块,该数据库目前包含来自超过75项不同研究的超过10亿个序列,涵盖了来自人类的多种免疫状态和个体。提交抗体序列,将返回同源性最高的人源同源抗体序列,可用于高级抗体人源化设计、亲和力成熟、去免疫原性、抗体工程等。建议通过WeSeq序列编辑器来使用该功能,具体为WeSeq -> Blast -> Human Antibody BLAST。 BLAST human antibody database for homologs search, which currently contains over one billion sequences, from over 75 different studies. These repertoires cover diverse immune states and individuals from humans. Submit an antibody sequence, and homologous human antibody sequences will be returned and could be used for advanced antibody humanization, affinity maturation, de-immunization, etc. It is recommended to use in the WeSeq: WeSeq -> Blast -> Human Antibody BLAST.
    Tags: undefined
    Author: WECOMPUT
    Release: 2022-01-13 18:17:41
    Reference: Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol. 2018 Oct 15;201(8):2502-2509.

    Human Antibody Blast

    简介

    Observed Antibody Space 数据库 (OAS) 是一个收集和注释免疫组库以用于大规模分析的项目。它目前包含来自超过75项不同研究的超过10亿个真实抗体序列。这些库涵盖了不同的免疫状态、生物体(主要是人类和小鼠)和个体。本功能从OAS库中搜索同源的人源抗体序列,通过序列比对,可以得到不同位点的进化信息,常用于对亲和力成熟或是对人源化过程中突变位点的选择提供参考依据,指导抗体设计。

    参数说明

    Input File

    抗体序列文件,FASTA格式。

    结果说明

    通过序列比对,可以得到不同位点的进化信息文件alignment.fasta。

    参考文献

    Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol. 2018 Oct 15;201(8):2502-2509.

    Human Antibody Blast

    Introduction

    The Observed Antibody Space (OAS) database is a project that collects and annotates immune repertoires for large-scale analysis. It currently contains over 1 billion real antibody sequences from more than 75 different studies. These libraries cover different immune states, organisms (primarily humans and mice), and individuals. This feature searches for homologous human antibody sequences from the OAS database. By aligning sequences, evolutionary information at different sites can be obtained. This is commonly used to provide reference for the selection of mutation sites during affinity maturation or humanization processes, guiding antibody design.

    Parameter Description

    Input File

    Antibody sequence file in FASTA format.

    Result Description

    The evolutionary information file for different sites can be obtained through sequence alignment, saved as alignment.fasta.

    Reference

    Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol. 2018 Oct 15;201(8):2502-2509.

  • Name: Protein Docking (HDOCK)
    Description: 蛋白质-蛋白质对接程序,支持蛋白质-蛋白质和蛋白质- DNA/RNA 对接。 Protein-protein docking program supporting protein-protein and protein-DNA/RNA docking.
    Tags: undefined
    Author: Yan Y; Huang S-Y
    Release: 2022-01-12 15:21:06
    Reference: Yan Y, Tao H, He J, Huang S-Y.* The HDOCK server for integrated protein-protein docking. Nature Protocols, 2020; doi: https://doi.org/10.1038/s41596-020-0312-x. Yan Y, Zhang D, Zhou P, Li B, Huang S-Y. HDOCK: a web server for protein-protein and protein-DNA/RNA docking based on a hybrid strategy. Nucleic Acids Res. 2017;45(W1):W365-W373. Yan Y, Wen Z, Wang X, Huang S-Y. Addressing recent docking challenges: A hybrid strategy to integrate template-based and free protein-protein docking. Proteins 2017;85:497-512. Huang S-Y, Zou X. A knowledge-based scoring function for protein-RNA interactions derived from a statistical mechanics-based iterative method. Nucleic Acids Res. 2014;42:e55. Huang S-Y, Zou X. An iterative knowledge-based scoring function for protein-protein recognition. Proteins 2008;72:557-579.

    Protein Docking (HDOCK)

    简介

    HDOCK是由华中科技大学物理学院黄胜友教授团队开发的一个集成了同源搜索、基于模板建模、结构预测、大分子对接、生物信息整合的快速蛋白质-蛋白质对接程序。HDOCK使用基于快速傅里叶变换 (FFT) 的对接算法对所有结合模式进行全局采样,然后通过迭代导出的基于知识的评分函数对结合模式进行打分。在多个基准测试中显示很好的预测效果。具有以下技术特点:

    1. 支持氨基酸序列作为输入和混合对接策略
    2. 支持蛋白-DNA/RNA对接
    3. 计算速度快,几分钟内完成对接

    参数说明

    Receptor File

    受体的结构文件,PDB格式

    Ligand File

    配体的结构文件,PDB格式

    Output TopN

    输出打分最高的复合物PDB文件个数

    Grid Space

    平动网格间距

    Angle Interval

    转动角间距

    Receptor Binding Site

    受体的结合位点残基。
    结合位点残基可以作为一个文件(.txt)提交,格式如下:

    195:A
    203-206:A
    108:B
    

    表示A链的195号、203-206号残基以及B链的108号残基。请注意,文件中的残基应该放在不同的行上。

    Ligand Binding Site

    配体的结合位点残基。
    结合位点残基可以作为一个文件(.txt)提交,格式如下:

    195:A
    203-206:A
    108:B
    

    表示A链的195号、203-206号残基以及B链的108号残基。请注意,文件中的残基应该放在不同的行上。

    Restraints

    相互作用氨基酸之间的距离约束。
    距离约束可以作为一个文件(.txt)提供,格式如下:

    195:A 236:B 8
    215-218:A 306:B 6
    

    其中,受体上的A链195号残基和配体上的B链236号残基的距离将在8埃之内。受体上的A链215-218号残基和配体上的B链306号残基的距离将在6埃之内。
    注意:对于每个约束,第一个字段是受体,第二个字段是配体,第三个字段是约束距离。残基表示必须采用num:chainID或num1-num2:chainID格式,其中残基编号和链ID指的是输入结构(如果输入是结构)或模型结构(如果输入是序列)。

    Cluster Cutoff

    聚类RMSD截断值

    Keep Receptor Heterogens

    是否保留受体中非标准氨基酸:都保留(all),只保留水(water),指定保留非标准氨基酸(specify),去除所有非标准氨基酸(none)。

    Receptor Specify Heterogens

    多个残基用逗号(,)分隔开。例如:“X:UNL-1”,其中X为链名,UNL为非标准氨基酸残基名称,1为残基编号。

    Keep Ligand Heterogens

    是否保留配体中非标准氨基酸:都保留(all),只保留水(water),指定保留非标准氨基酸(specify),去除所有非标准氨基酸(none)。

    Ligand Specify Heterogens

    指定配体中需要保留非标准氨基酸,多个残基用逗号(,)分隔开。例如:“X:UNL-1”,其中X为链名,UNL为非标准氨基酸残基名称,1为残基编号。

    结果说明

    输出结果包括:

    输出文件名称 说明
    complex_01.pdb-complex_10.pdb 打分前十的复合物构象
    score.csv 提供复合物构象的对接打分,其中打分值越低,结合能力越强。
    TopNComplex.tar.gz 输出所有复合物结构的压缩包文件

    其中score.csv包括如下信息:

    字段名称 说明
    Number 打分排序
    RMSD 复合物构象的RMSD
    Score 对接能量打分,其中打分值越低,结合能力越强。

    参考文献

    • Yan Y, Tao H, He J, Huang S-Y.* The HDOCK server for integrated protein-protein docking. Nature Protocols, 2020.DOI:10.1038/s41596-020-0312-x .
    • Yan Y, Zhang D, Zhou P, Li B, Huang S-Y. HDOCK: a web server for protein-protein and protein-DNA/RNA docking based on a hybrid strategy. Nucleic Acids Res. 2017;45(W1):W365-W373.DOI:10.1093/nar/gkx407

    Protein Docking (HDOCK)

    Introduction

    HDOCK is a fast protein-protein docking program developed by the team of Professor Shengyou Huang at the School of Physics, Huazhong University of Science and Technology. It integrates homology search, template-based modeling, structure prediction, macromolecular docking, and bioinformatics integration. HDOCK uses a docking algorithm based on Fast Fourier Transform (FFT) to globally sample all binding modes and then scores the binding modes using an iteratively derived knowledge-based scoring function. It has shown good predictive performance in multiple benchmark tests. Its technical features include:

    1. Support for amino acid sequences as input and hybrid docking strategies.
    2. Support for protein-DNA/RNA docking.
    3. Fast computation speed, completing docking in minutes.

    Parameters

    Receptor File

    Structure file of the receptor in PDB format.

    Ligand File

    Structure file of the ligand in PDB format.

    Output TopN

    Number of top-scoring complex PDB files to output.

    Grid Space

    Translation grid spacing.

    Angle Interval

    Rotation angle interval.

    Receptor Binding Site

    Residues of the receptor’s binding site.
    Binding site residues can be submitted as a file (.txt) with the following format:

    195:A
    203-206:A
    108:B
    

    This indicates residue 195 of chain A, residues 203-206 of chain A, and residue 108 of chain B. Note that residues in the file should be on separate lines.

    Ligand Binding Site

    Residues of the ligand’s binding site.
    Binding site residues can be submitted as a file (.txt) with the same format as above.

    195:A
    203-206:A
    108:B
    

    Restraints

    Distance constraints between interacting amino acids.
    Distance constraints can be provided as a file (.txt) with the following format:

    195:A 236:B 8
    215-218:A 306:B 6
    

    Here, the distance between residue 195 of chain A in the receptor and residue 236 of chain B in the ligand is within 8 angstroms. The distance between residues 215-218 of chain A in the receptor and residue 306 of chain B in the ligand is within 6 angstroms.
    Note: For each constraint, the first field is the receptor, the second field is the ligand, and the third field is the constraint distance. Residues should be in the format num:chainID or num1-num2:chainID, where residue number and chain ID refer to the input structure (if the input is a structure) or model structure (if the input is a sequence).

    Cluster Cutoff

    RMSD cutoff value for clustering.

    Keep Receptor Heterogens

    Whether to retain non-standard amino acids in the receptor: retain all (all), retain only water (water), specify which non-standard amino acids to retain (specify), remove all non-standard amino acids (none).

    Receptor Specify Heterogens

    Multiple residues should be separated by commas (,). For example: “X:UNL-1”, where X is the chain name, UNL is the name of the non-standard amino acid residue, and 1 is the residue number.

    Keep Ligand Heterogens

    Whether to retain non-standard amino acids in the ligand: retain all (all), retain only water (water), specify which non-standard amino acids to retain (specify), remove all non-standard amino acids (none).

    Ligand Specify Heterogens

    Specify which non-standard amino acids in the ligand need to be retained, with multiple residues separated by commas (,). For example: “X:UNL-1”, where X is the chain name, UNL is the name of the non-standard amino acid residue, and 1 is the residue number.

    Result

    The output includes:

    Output File Name Description
    complex_01.pdb-complex_10.pdb Top ten scoring complex conformations
    score.csv Provides docking scores for complex conformations, where lower scores indicate stronger binding
    TopNComplex.tar.gz Compressed file containing all complex structures

    The score.csv file includes the following information:

    Field Name Description
    Number Score ranking
    RMSD RMSD of complex conformations
    Score Docking energy score, where lower scores indicate stronger binding

    References

    • Yan Y, Tao H, He J, Huang S-Y.* The HDOCK server for integrated protein-protein docking. Nature Protocols, 2020.DOI:10.1038/s41596-020-0312-x .
    • Yan Y, Zhang D, Zhou P, Li B, Huang S-Y. HDOCK: a web server for protein-protein and protein-DNA/RNA docking based on a hybrid strategy. Nucleic Acids Res. 2017;45(W1):W365-W373.DOI:10.1093/nar/gkx407
  • Name: SeqKit
    Description: SeqKit模块是一款超快速、全面的FASTA/Q处理工具包,能够快速完成常见的FASTA/Q文件操作。 Ultrafast comprehensive toolkit for FASTA/Q processing, rapidly accomplishing common FASTA/Q file manipulations.
    Tags: undefined
    Author: Shen W
    Release: 2022-01-04 17:15:54
    Reference: Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962.

    SeqKit

    简介

    Seqkit是一款专门处理fsata/q序列文件的软件,由go语言编写,功能比较完善,软件使用也很稳定。
    该模块主要提供的功能有:

    1. 编辑序列(点突,插入,删除)
    2. 通过名称/序列来去除重复的序列、保存数量的文件并列出重复的seqs、保存重复seqs的文件
    3. 对序列进行转换(颠倒,互补,提取ID等)

    参数说明

    Clean模式

    FASTA File

    序列文件,FASTA格式。

    GAP

    指定序列中需要清理掉的间隔字符。

    Output File

    指定输出序列文件名称,FASTA格式。

    Edit模式

    FASTA File

    序列文件,FASTA格式。

    Output File

    指定输出序列文件名称,FASTA格式。

    Point Mutation

    对FASTA文件进行单独突变:在给定位置改变碱基。例如:“2:C”为将第二位碱基变为胞嘧啶(C);“-1:A”为将最后一位碱基变为腺嘌呤(A)。

    Deletion Mutation

    删除突变:删除指定范围内的子序列,例如,“1:2”表示删除前两个碱基,“-3:-1”表示删除最后三个碱基。

    Insertion Mutation

    插入突变:在给定位置后插入碱基,例如,“0:ACGT”表示在开头插入ACGT,“-1:”表示在末尾添加。

    Threads

    CPUs数目。

    Remove Duplicates模式

    FASTA File

    序列文件,FASTA格式。

    Output File

    指定输出序列文件名称,FASTA格式。

    Duplicated Type

    按name (-n)或按seq (-s)删除重复序列。

    Save Data

    保存重复序列数和列表的文件(-D)或保存重复序列的文件(-d)。

    Threads

    CPUs数目。

    Transform模式

    FASTA File

    序列文件,FASTA格式。

    Output File

    指定输出序列文件名称,FASTA格式。

    Transform Sequences

    转换类型,包括如下几种:
    –complement:互补序列
    –dna2rna:DNA转RNA
    –rna2dna:RNA转DNA
    –lower-case:以小写形式打印序列
    –upper-case:以大写形式打印序列

    Threads

    CPUs数目。

    FASTA2Seq模式

    FASTA File

    序列文件,FASTA格式。

    Output File

    指定输出序列文件名称,FASTA格式。

    结果说明

    按照指定要求得到FASTA文件。

    参考文献

    Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962.

    SeqKit

    Introduction

    SeqKit is a software specifically designed for processing fasta/q sequence files. It is written in Go language, offering comprehensive functionality and stable performance. The module provides the following main features:

    1. Edit sequences (point mutations, insertions, deletions).
    2. Remove duplicate sequences by name/sequence, save the count of files, list duplicate seqs, and save files with duplicate seqs.
    3. Transform sequences (reverse, complement, extract IDs, etc.).

    Parameter Description

    Clean Mode

    FASTA File

    Sequence file in FASTA format.

    GAP

    Specify the gap characters to be cleaned from the sequence.

    Output File

    Specify the output sequence file name in FASTA format.

    Edit Mode

    FASTA File

    Sequence file in FASTA format.

    Output File

    Specify the output sequence file name in FASTA format.

    Point Mutation

    Perform individual mutations on the FASTA file: change bases at specified positions. For example, “2:C” changes the base at the second position to cytosine ©; “-1:A” changes the last base to adenine (A).

    Deletion Mutation

    Deletion mutation: delete a subsequence within a specified range. For example, “1:2” deletes the first two bases, “-3:-1” deletes the last three bases.

    Insertion Mutation

    Insertion mutation: insert bases after the specified position. For example, “0:ACGT” inserts ACGT at the beginning, “-1:*” appends * at the end.

    Threads

    Number of CPUs.

    Remove Duplicates Mode

    FASTA File

    Sequence file in FASTA format.

    Output File

    Specify the output sequence file name in FASTA format.

    Duplicated Type

    Delete duplicate sequences by name (-n) or by sequence (-s).

    Save Data

    Save a file with the count and list of duplicate sequences (-D) or save a file with duplicate sequences (-d).

    Threads

    Number of CPUs.

    Transform Mode

    FASTA File

    Sequence file in FASTA format.

    Output File

    Specify the output sequence file name in FASTA format.

    Transform Sequences

    Transformation types include:
    –complement: Complementary sequences
    –dna2rna: DNA to RNA conversion
    –rna2dna: RNA to DNA conversion
    –lower-case: Print sequences in lowercase
    –upper-case: Print sequences in uppercase

    Threads

    Number of CPUs.

    FASTA2Seq Mode

    FASTA File

    Sequence file in FASTA format.

    Output File

    Specify the output sequence file name in FASTA format.

    Result Description

    Obtain a FASTA file according to the specified requirements.

    References

    Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962.

  • Name: Property Filter
    Description: 基于导入的分子属性(例如从SDF文件导入)或在运行时对分子进行计算来选择分子的子集。支持的输入文件格式为:SD(.sdf,.sd)。支持的输出文件格式为:SD(.sdf,.sd)。 It is very versatile and can select a subset of molecules based either on properties imported with the molecule (as from a SDF file) or from calculations on the molecule on the fly. The supported input file formats are: SD (.sdf, .sd). The supported output file formats are: SD (.sdf, .sd).
    Tags: undefined
    Author: Open Babel
    Release: 2021-12-28 06:06:09
    Reference: O'Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

    Property Filter

    简介

    Property Filter模块可以基于导入的分子属性(例如从SDF文件导入)或在运行时对分子进行计算来选择分子的子集。支持的输入文件格式为:SD(.sdf,.sd)。支持的输出文件格式为:SD(.sdf,.sd)。

    参数说明

    Input File

    小分子结构文件,SDF格式。

    Property

    过滤属性,相关的描述符含义分别如下:

    L5 (Lipinski rule of five):类药物五原则,指的是一组用于评估化合物作为口服药物潜力的规则,包括的规则为HBD<5、HBA1<10、MW<500以及logP<5。
    HBA1 (Number of hydrogen bond acceptors 1 [JoelLib]):用于识别化合物中符合此模式的氢键受体,其匹配的SMARTS格式为[$([!#6;+0]);!$([F,Cl,Br,I]);!$([o,s,nX3]);!$([Nv5,Pv5,Sv4,Sv6])]
    HBA2 (Number of hydrogen bond acceptors 2 [JoelLib]):用于识别另一种模式的氢键受体,其匹配的SMARTS格式为[$([$([#8,#16]);!$(*=N~O);!$(*~N=O);X1,X2]),$([#7;v3;!$([nH]);!$(*(-a)-a)])]
    HBD (Number of hydrogen bond donors [JoelLib]):其匹配的SMARTS格式为[!#6;!H0],用于识别化合物中符合此模式的氢键供体。
    logP (Octanol/water partition coefficient):辛醇/水分配系数,是衡量化合物在辛醇与水之间分配的比例,通常用于预测化合物的疏水性。
    MW (Molecular weight):分子量。
    abonds (Number of aromatic bonds):芳香键的数量,SMARTS格式为*:*。
    atoms (Number of atoms):原子数量,通过添加或去除氢原子来计算总原子或重原子数量,SMARTS格式为*。
    bonds (Number of bonds):键的数量,通过添加或去除氢原子来计算总键或重原子之间的键,SMARTS格式为*~*。
    cansmi (Canonical SMILES):规范化的SMILES(简化分子线性输入规范),用于唯一表示化合物的线性结构。
    cansmiNS (Canonical SMILES without isotopes or stereo):不含同位素或立体化学信息的规范化SMILES。
    dbonds (Number of double bonds):双键的数量,SMARTS格式为*=*。
    formula (Chemical formula):化学式。
    InChI (IUPAC InChI identifier):国际化学标识符。
    InChIKey (InChIKey):InChI的简化版,固定长度的字符串,用于快速查找和识别化合物。
    MP (Melting point):熔点,是由Andy Lang开发的熔点描述符,用于预测化合物的熔点。
    MR (Molar refractivity):摩尔折射率,是化合物体积和极化率的量度,通常用于评估分子间相互作用。
    nF (Number of fluorine atoms):氟原子的数量,SMARTS格式为F,用于识别化合物中的氟原子数量。
    s/smarts  (SMARTS filter):SMARTS过滤器,用于根据特定模式筛选化合物。
    sbonds (Number of single bonds):单键的数量,SMARTS格式为*-*。
    tbonds (Number of triple bonds):三键的数量,SMARTS格式为*#*。
    title (For comparing a molecule's title):用于比较分子标题的信息。
    TPSA (Topological polar surface area):拓扑极性表面积,是分子中极性区域的表面积总和,通常用于预测药物的吸收性和透过性。
    

    Relation

    选择属性的名称和所需的关系(如>、<、=、>=、<=、!=),多个符号用逗号(,)分隔。当筛选性质为L5时,该栏填None。

    Value

    属性过滤器的截止值。当筛选性质为L5时,该栏填None。

    Logic Operator

    前后条件的逻辑关系连接符(&&或者||),多个用逗号分隔

    Output File

    输出文件名称。

    结果说明

    得到筛选后的SDF结构文件output.sdf。

    参考文献

    O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

    Property Filter

    Introduction

    The Property Filter module allows for the selection of a subset of molecules based on imported molecular properties (e.g., imported from an SDF file) or calculated at runtime. Supported input file formats include: SD (.sdf, .sd). Supported output file formats include: SD (.sdf, .sd).

    Parameter Description

    Input File

    Small molecule structure file in SDF format.

    Property

    Filter properties, with the meanings of related descriptors as follows:

    L5 (Lipinski rule of five): A set of rules used to evaluate the potential of compounds as oral drugs, including the following criteria: HBD<5, HBA1<10, MW<500, and logP<5.
    HBA1 (Number of hydrogen bond acceptors 1 [JoelLib]): Used to identify hydrogen bond acceptors in compounds that match this pattern, with the SMARTS format: [$([!#6;+0]);!$([F,Cl,Br,I]);!$([o,s,nX3]);!$([Nv5,Pv5,Sv4,Sv6])]
    HBA2 (Number of hydrogen bond acceptors 2 [JoelLib]): Used to identify another pattern of hydrogen bond acceptors, with the SMARTS format: [$([$([#8,#16]);!$(*=N~O);!$(*~N=O);X1,X2]),$([#7;v3;!$([nH]);!$(*(-a)-a)])]
    HBD (Number of hydrogen bond donors [JoelLib]): Matches the SMARTS format [!#6;!H0], used to identify hydrogen bond donors in compounds that match this pattern.
    logP (Octanol/water partition coefficient): The octanol/water partition coefficient, which measures the ratio of a compound's distribution between octanol and water, typically used to predict compound hydrophobicity.
    MW (Molecular weight): The molecular weight.
    abonds (Number of aromatic bonds): The number of aromatic bonds, SMARTS format: *:*.
    atoms (Number of atoms): The number of atoms, calculated by adding or removing hydrogen atoms to count total or heavy atoms, SMARTS format: *.
    bonds (Number of bonds): The number of bonds, calculated by adding or removing hydrogen atoms to count total bonds or bonds between heavy atoms, SMARTS format: *~*.
    cansmi (Canonical SMILES): Canonical SMILES (Simplified Molecular Input Line Entry System), used to uniquely represent the linear structure of a compound.
    cansmiNS (Canonical SMILES without isotopes or stereo): Canonical SMILES without isotope or stereochemistry information.
    dbonds (Number of double bonds): The number of double bonds, SMARTS format: *=*.
    formula (Chemical formula): The chemical formula.
    InChI (IUPAC InChI identifier): The International Chemical Identifier, a standardized text string to represent the structure of a compound.
    InChIKey (InChIKey): A simplified version of InChI, a fixed-length string used for quick lookup and identification of compounds.
    MP (Melting point): The melting point, a descriptor developed by Andy Lang, used to predict the melting point of compounds.
    MR (Molar refractivity): Molar refractivity, a measure of the compound's volume and polarizability, typically used to assess intermolecular interactions.
    nF (Number of fluorine atoms): The number of fluorine atoms, SMARTS format: F, used to identify the number of fluorine atoms in a compound.
    s/smarts (SMARTS filter): A SMARTS filter used to filter compounds based on specific patterns.
    sbonds (Number of single bonds): The number of single bonds, SMARTS format: *-*.
    tbonds (Number of triple bonds): The number of triple bonds, SMARTS format: *#*.
    title (For comparing a molecule's title): Used for comparing the titles of molecules.
    TPSA (Topological polar surface area): The topological polar surface area, the total surface area of polar regions in a molecule, typically used to predict drug absorption and permeability.
    

    Relation

    Select the name of the property and the desired relation (such as >, <, =, >=, <=, !=), separated by commas. When filtering by L5, fill in None for this field.

    Value

    The cutoff value for the property filter. When filtering by L5, fill in None for this field.

    Logic Operator

    Logical operators (&& or ||) connecting the conditions, separated by commas.

    Result Description

    Obtain the filtered SDF structure file, output.sdf.

    Output File

    The name of the output file.

    Result Description

    The filtered SDF structure file output.sdf is obtained.

    References

    O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. J Cheminform. 2011 Oct 7;3:33.

  • Name: Homology Modeling (Protein)
    Description: 蛋白质三维结构的同源性或比较建模。用户提供一个待建模的序列与已知相关结构的比对。通过满足空间约束条件进行比较蛋白质结构建模,以及许多其他任务,包括蛋白质结构中循环的全新建模、针对灵活定义的目标函数优化各种蛋白质结构模型、蛋白质序列和/或结构的多重比对、聚类、搜索序列数据库、比较蛋白质结构等。 Homology or comparative modeling of protein three-dimensional structures. Users provide a sequence to be modeled and compare it with known related structures. Protein structure modeling is performed by satisfying spatial constraint conditions, as well as many other tasks, including novel modeling of loops in protein structures, optimization of various protein structure models for flexibly defined objective functions, multiple alignments of protein sequences and/or structures, clustering, searching sequence databases, and comparing protein structures.
    Tags: undefined
    Author: B. Webb*; M.A. Marti-Renom*; A. Sali*; A. Fiser, R.K. Do*.
    Release: 2021-12-21 17:39:18
    Reference: (1) B. Webb, A. Sali. Comparative Protein Structure Modeling Using Modeller. Current Protocols in Bioinformatics 54, John Wiley & Sons, Inc., 5.6.1-5.6.37, 2016. M.A. Marti-Renom, A. Stuart, A. Fiser, R. Sánchez, F. Melo, A. Sali. Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291-325, 2000. (2) A. Sali & T.L. Blundell. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815, 1993. (3) A. Fiser, R.K. Do, & A. Sali. Modeling of loops in protein structures, Protein Science 9. 1753-1773, 2000.

    Homology Modeling (Protein)

    简介

    Homology Modeling (Protein)采用老牌蛋白质同源模建算法Modeller,可以对蛋白质三维结构的同源性或比较建模。用户提供一个待建模的序列与已知相关结构的比对。通过满足空间约束条件进行比较蛋白质结构建模,以及许多其他任务,包括蛋白质结构中循环的全新建模、针对灵活定义的目标函数优化各种蛋白质结构模型、蛋白质序列和/或结构的多重比对、聚类、搜索序列数据库、比较蛋白质结构等。

    参数说明

    Protein Sequence File

    蛋白的序列文件,FASTA格式。

    Models

    输出预测结构数目。

    Template PDB File

    构建PDB结构的模板文件。

    结果说明

    输出结果包括:

    输出文件名称 说明
    output.log 输出记录文件
    score.csv 预测结构对应的打分文件
    Top0001.pdb-Top0005.pdb 打分前五的结构文件

    其中score.csv包括信息如下:

    字段名称 说明
    name 预测结构名称
    molpdf 评估预测结构与模板结构的一致性,其值越低越好。
    DOPE score 评估预测结构与真实结构相似的可能性,其值越低越好。
    Template 构建结构所使用的模板PDB ID和链名称。

    参考文献

    Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37.
    Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291-325.
    Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993 Dec 5;234(3):779-815.
    Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000 Sep;9(9):1753-73.

    Homology Modeling (Protein)

    Introduction

    Homology Modeling (Protein) uses the established protein homology modeling algorithm Modeller to model protein three-dimensional structures based on homology or comparative modeling. Users provide a sequence to be modeled and perform a comparison with known related structures. The modeling of protein structures is achieved by satisfying spatial constraints, as well as many other tasks, including novel modeling of loops in protein structures, optimizing various protein structure models for flexible-defined target functions, multiple sequence and/or structure alignments, clustering, searching sequence databases, and comparing protein structures.

    Parameter

    Protein Sequence File

    Protein sequence file in FASTA format.

    Models

    Number of predicted structures.

    Template PDB File

    Build a template file for the PDB structure.

    Log File

    Name of log file

    Result

    The output includes:

    Output File Name Description
    output.log Output record file
    score.csv Predict the structure of the corresponding scoring file
    Top0001.pdb-Top0005.pdb Score the top five structure files

    score.csv contains the following information:

    Field Name Description
    name Prediction structures name
    molpdf The molpdf score informs about the agreement of the model with the restraints derived from the alignment, the lower the value, the better.
    DOPE score The DOPE score tries to inform on the likelihood of the model resembling a real structure, the lower the value, the better.
    Template The template PDB ID and chain name used to build the structure.

    Reference

    Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37.
    Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291-325.
    Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993 Dec 5;234(3):779-815.
    Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000 Sep;9(9):1753-73.

  • Name: PTM Hotspot by Sequence
    Description: 扫描抗体序列发现潜在的翻译后修饰(PTM)风险位点, PTM 位点是生物制剂开发的常见风险。 通常建议使用WeSeq中的PTM功能进行可视化的分析,本模块更常用于组装自动化流程。 Scan antibody sequences for potential PTM (post-translational modification) hotspots (liabilities). PTM hotspot is a common risk for biologics development. It is generally recommended to use the PTM function in WeSeq for visual analysis. This module is more commonly used for assembling automated workflows.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-12-20 16:13:18
    Reference: NA

    PTM Hotspot by Sequence

    简介

    扫描抗体序列发现潜在的翻译后修饰(PTM)风险位点,PTM位点是生物制剂开发的常见风险。主要包括:氧化位点Oxidation、糖基化位点Glycosylation、水解位点Hydrolysis、脱酰胺基位点Deamidation、裂解位点Cleavage、天冬氨酸异构化位点Isomerization、半胱氨酸位点Cysteine。

    参数说明

    FASTA File

    抗体的序列文件,FASTA格式

    结果说明

    输出结果包括:

    输出文件名称 说明
    hotspots.md 风险位点信息,Mardown格式
    Hotspots.json 风险位点信息,JSON格式

    针对抗体序列,会自动识别CDR区域,并输出CDR区和全部序列区域的风险位点。

    风险位点说明:
    图片.png
    其中打勾的位点默认视为高风险位点(NXS, NXT, NG, DG, DHK, DD, Cys),修饰发生率相对较高,通常需要重点关注。也可基于经验自行判断。

    FAQ

    1、SXN/TXN位点是什么?

    这两个位点是非典型的N糖基化位点,可见于Amgen发表的文献:
    Glutamine-linked and Non-consensus Asparagine-linked Oligosaccharides Present in Human Recombinant Antibodies Define Novel Protein Glycosylation Motifs, Journal of Biological Chemistry, Volume 285, Issue 21, 16012 - 16022
    企业微信截图_17508133748745.png

    PTM Hotspot by Sequence

    Introduction

    This module scans antibody sequences to identify potential post-translational modification (PTM) hotspot sites. PTM sites are common risks in biologics development and include Oxidation, Glycosylation, Hydrolysis, Deamidation, Cleavage, Isomerization, and Cysteine sites.

    Parameter Description

    FASTA File

    Antibody sequence file in FASTA format.

    Result Description

    The output includes:

    Output File Name Description
    hotspots.md Information on hotspot sites in Markdown format
    Hotspots.json Information on hotspot sites in JSON format

    For antibody sequences, the module automatically identifies the CDR regions and outputs hotspot sites for both the CDR and the entire sequence regions.

    Explanation of Hotspot Sites:
    Image.png

    Among the marked sites, the six sites NXS, NXT, NG, DHK, DG, DD, and Cys are potential high-risk PTM hotspots that require special attention.

    FAQ

    1、What are SXN/TXN?

    They are non-classic N-glycosylation PTM hotspots as reported in:
    Glutamine-linked and Non-consensus Asparagine-linked Oligosaccharides Present in Human Recombinant Antibodies Define Novel Protein Glycosylation Motifs, Journal of Biological Chemistry, Volume 285, Issue 21, 16012 - 16022
    企业微信截图_17508133748745.png

  • Name: 2D Similarity Search
    Description: 基于分子指纹进行二维相似度搜索。根据不同指纹类型(Maccs Key、pharmacophore fingerprints、extended connectivity fingerprints)计算得到的指纹向量或者向量字符串进行相似性搜索,从分子数据库中筛选出与模板分子相似(不相似)的化合物。 It is a tool based on molecular fingerprints for 2D similarity search. Firstly, the fingerprint bit-vector or vector string of the template small molecule is calculated based on the fingerprint types (Maccs Key, pharmacophore fingerprints, extended connectivity fingerprints). Then, the fingerprint bit-vector or vector string is used for molecular similarity search in the selected public library or private library, and the small molecules that are similar (or dissimilar) to the template molecule are obtained.
    Tags: undefined
    Author: Kier LB; Filimonov D; Venkatraman V
    Release: 2021-12-15 07:40:57
    Reference: Kier LB, Hall LH, The E-State as the basis for molecular structure space definition and structure similarity. J. Chem. Inf. Comput. Sci. 2000, 40, 784-791. Filimonov D, Poroikov V, Borodina Y, Gloriozova T, Chemical similarity assessment through multilevel neighborhoods of atoms: Definition and comparison with the other Descriptors. J. Chem. Inf. Comput. Sci., 1999, 39, 666-670 Venkatraman V, Prez-Nueno VI, Mavridis L, Ritchie DW, Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J. Chem. Inf. Model., 2010, 50, 2079-2093. Durant, J.L.; Leland, B.A.; Henry, D.H.; Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273-1280. Willett, P.; Similarity searching using 2D structural fingerprints. Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology. 2011, 672, 133-58.

    2D Similarity Search

    简介

    2D Similarity Search模块是基于分子指纹进行二维相似度搜索的工具。根据不同指纹类型(Maccs Key、pharmacophore fingerprints、extended connectivity fingerprints)计算得到的指纹向量或者向量字符串进行相似性搜索,从分子数据库中筛选出与模板分子相似(不相似)的化合物。相似性评估方法采用的是常用的Tanimoto系数,用于比较两个化合物之间的相似性。它是基于化合物指纹或描述符的重叠程度计算得出的,数值范围从0到1,值越大表示两个化合物越相似。其主要功能如下所示:

    1. 从提供的化合物数据库中,筛选出与查询分子二维相似、符合特定相似度阈值的的化合物结构。
    2. 从提供的化合物数据库中,筛选出与查询分子二维不相似、符合特定距离阈值的的化合物结构。
    3. 支持多个查询分子模式。
      支持的输入文件格式为:SD(.sdf, .sd)。支持的输出文件格式为:SD(.sdf,.sd)、CSV(.csv)。

    参数说明

    Template SDF File

    小分子结构文件,SDF格式。

    Template Smiles

    小分子结构,SMILES格式,支持多个小分子,一行一个SMILES,例如:

    CSC1=C(c2ccc(C)s2)/C(=N/C(C)(C)C)C1
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5
    

    Public Library

    选择用于相似性搜索的分子库,该模块提供17个公共分子数据库用于进行相似性搜索:

    1. Alinda :~77万库存分子,源自中国香港的Alinda Chemical公司,致力于分子砌块和新颖筛选化合物的研发供应。
    2. Analyticon :~4万库存分子,源自德国的天然产物品牌,专注天然产物提取及类似物合成工作,产品质量稳定。
    3. Asinex :~57万库存分子,源自美国的品牌,多年来致力于类先导化合物及分子砌块的研发供应,价格较贵。
    4. Bionet :~30万库存分子,源自英国的品牌,拥有多年的有机合成经验。
    5. Chembridge :~137万库存分子,源自美国的化合物品牌,总部位于圣地亚哥,拥有多样性库、大环库等多种热门化合物库。
    6. Chemdiv :~156万库存分子,全球最大的化合物品牌之一,拥有5000多种化合物骨架结构和100多种化合物库,性价比高。
    7. Enamine :~407万库存分子,源自乌克兰的化合物品牌,具有较强的化合物研发能力,有高性价比化合物和高价值化合物两类产品。
    8. Eximed :~6万库存分子,源自乌克兰的化合物品牌,近20年来致力于提供高通量筛选化合物及相关服务。
    9. HTS :~6万库存分子,源自德国的HTS Biochemie Innovationen化合物品牌,致力于为制药、农业和生物技术公司开发独特的化合物。
    10. IBS :~55万库存分子,源自俄罗斯的InterBioScreen化合物品牌,拥有多种天然产物及衍生物。
    11. Life_Chemicals :~54万库存分子,源自加拿大的化合物品牌,拥有2900多种化合物骨架结构,化合物规格较齐全且有对应价格。
    12. Maybridge :~5万库存分子,源自英国的化合物品牌,Thermofisher旗下,产品数量少而专,每种产品均具有较大库存。
    13. Otava :~29万库存分子,源自加拿大的化合物品牌,专门从事特色化合物,生物化学药品和生物分析试剂的开发和生成。
    14. Princeton :~153万库存分子,源自美国的化合物品牌,20多年来设计独特的小分子化合物用于药物开发。
    15. Specs :~20万库存分子,源自荷兰的化合物品牌,价格优势明显。
    16. UORSY :~68万库存分子,源自乌克兰的化合物品牌,产品主要用于高通量筛选和药物发现,价格与Enamine接近。
    17. Vitas-m :~140万库存分子,源自美国的化合物品牌,在香港拥有发货中心,到货速度快,价格适中。

    Public Library与Private Library选填其中一个。

    Private Library

    上传用于进行相似度搜索的个人分子数据库,格式为SDF。
    Public Library与Private Library选填其中一个。

    Fingerprint

    分子指纹类型:maccskey、phar、ecfp

    1. maccskey指纹是基于分子的结构和功能团片段生成的二进制指纹,可以用于进行药物相似性和虚拟筛选。
    2. phar(Pharmacophore fingerprints)识别分子中的药效团特征指纹,如氢键供体、氢键受体、疏水中心等,适合药物设计。
    3. ecfp(Extended Connectivity Fingerprints)是基于圆形子结构的分子指纹,适合相似性搜索和定量结构-活性关系(QSAR)建模。

    Cutoff

    当搜索模式为SimilaritySearch时,表示搜索相似度≥截断值的分子;当搜索模式为DissimilaritySearch时,表示搜索相似度≤截断值的分子。计算值取值范围是0~1。Cutoff默认为0.75。

    Search Mode

    指定搜索模式:SimilaritySearch是查找相似分子,DissimilaritySearch是查找不相似分子。

    结果说明

    输出结果包括:

    输出文件名称 说明
    hits_values.csv 添加数据库与模板分子相似度值。
    hits.sdf 数据库中筛选出与模板分子相似在截断值以内的化合物。

    其中hits_values.csv包括信息如下:

    字段名称 说明
    ReferenceCompoundID 模板分子库中分子的名称,无名称则别表示为“Cmpd”前缀+“分子编号”。
    DatabaseCompoundID 搜索库中符合条件的分子的名称,无名称同上。
    ComparisonValue 模板分子与分子库的相似度值。

    其余参数为所提供的分子数据库包含的描述。

    参考文献

    Kier LB, Hall LH, The E-State as the basis for molecular structure space definition and structure similarity. J. Chem. Inf. Comput. Sci. 2000, 40, 784-791.
    Filimonov D, Poroikov V, Borodina Y, Gloriozova T, Chemical similarity assessment through multilevel neighborhoods of atoms: Definition and comparison with the other Descriptors. J. Chem. Inf. Comput. Sci., 1999, 39, 666-670
    Venkatraman V, Prez-Nueno VI, Mavridis L, Ritchie DW, Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J. Chem. Inf. Model., 2010, 50, 2079-2093.
    Durant, J.L.; Leland, B.A.; Henry, D.H.; Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273-1280.
    Willett, P.; Similarity searching using 2D structural fingerprints. Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology. 2011, 672, 133-58.

    2D Similarity Search

    Introduction

    The 2D Similarity Search module is a tool based on molecular fingerprint for 2D similarity search. The fingerprint bit-vector or vector string obtained by calculating the fingerprint types (Maccs Key, pharmacophore fingerprints, extended connectivity fingerprints) are used for similarity search, and compounds similar (or dissimilar) to the template molecule are selected from the small molecular database. The similarity assessment method used is the commonly used Tanimoto coefficient, which is used to compare the similarity between two compounds. It is based on the overlap of molecular fingerprints or descriptors, and the numerical range is from 0 to 1. The larger the value, the more similar the two compounds are considered to be. Its main functions are as follows:

    1. Select compounds from the provided compound database that are two-dimensionally similar to the query molecule and meet a specific similarity threshold.
    2. Select compounds from the provided compound database that are two-dimensionally dissimilar to the query molecule and meet a specific distance threshold.
    3. Support multiple query molecule patterns.
      The supported input file formats are: SD (.sdf, .sd). The supported output file formats are: SD (.sdf, .sd), CSV (.csv).

    Parameter

    Template SDF File

    Small molecule structure file in format.

    Template Smiles

    Small molecule SMILES string. Example:

    CSC1=C(c2ccc(C)s2)/C(=N/C(C)(C)C)C1
    CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5
    

    Public Library

    Select the molecular database for similarity search. This module provides 17 public molecular databases for conducting similarity search:

    Translation:

    1. Alinda : ~770,000 stock molecules, sourced from Alinda Chemical in Hong Kong, dedicated to the development and supply of molecular building blocks and novel screening compounds.
    2. Analyticon : ~40,000 stock molecules, a German brand specializing in natural product extraction and analogue synthesis, known for stable product quality.
    3. Asinex : ~570,000 stock molecules, an American brand focused on the development and supply of lead-like compounds and molecular building blocks for many years, relatively expensive.
    4. Bionet : ~300,000 stock molecules, a UK brand with many years of experience in organic synthesis.
    5. Chembridge : ~1,370,000 stock molecules, an American compound brand headquartered in San Diego, offering diverse libraries, macrocyclic libraries, and other popular compound libraries.
    6. Chemdiv : ~1,560,000 stock molecules, one of the world’s largest compound brands, with over 5,000 compound scaffolds and more than 100 compound libraries, offering high cost-effectiveness.
    7. Enamine : ~4,070,000 stock molecules, a Ukrainian compound brand with strong compound development capabilities, offering both high cost-effectiveness compounds and high-value compounds.
    8. Eximed : ~60,000 stock molecules, a Ukrainian compound brand dedicated to providing high-throughput screening compounds and related services for nearly 20 years.
    9. HTS : ~60,000 stock molecules, a German compound brand HTS Biochemie Innovationen, dedicated to developing unique compounds for pharmaceutical, agricultural, and biotechnology companies.
    10. IBS : ~550,000 stock molecules, a Russian compound brand InterBioScreen, offering a variety of natural products and derivatives.
    11. Life Chemicals : ~540,000 stock molecules, a Canadian compound brand with over 2,900 compound scaffolds, offering a wide range of compound specifications at corresponding prices.
    12. Maybridge : ~50,000 stock molecules, a UK compound brand under Thermo Fisher, known for a small but specialized product range with large inventories for each product.
    13. Otava : ~290,000 stock molecules, a Canadian compound brand specializing in the development and production of specialty compounds, biochemical drugs, and bioanalytical reagents.
    14. Princeton : ~1,530,000 stock molecules, an American compound brand that has been designing unique small molecules for drug development for over 20 years.
    15. Specs : ~200,000 stock molecules, a Dutch compound brand with a clear price advantage.
    16. UORSY : ~680,000 stock molecules, a Ukrainian compound brand, mainly used for high-throughput screening and drug discovery, with prices similar to Enamine.
    17. Vitas-M : ~1,400,000 stock molecules, an American compound brand with a shipping center in Hong Kong, offering fast delivery and moderate prices.

    Public Library and Private Library are optional, choose one of them.

    Private Library

    Upload a personal molecular database in SDF format for similarity search.

    Public Library and Private Library are optional, choose one of them.

    Fingerprint

    Types of Molecular Fingerprints: maccskey, phar, ecfp.

    1. maccskey fingerprints are binary fingerprints generated based on the structure and functional group fragments of a molecule, and can be used for drug similarity and virtual screening.
    2. phar (Pharmacophore fingerprints) recognize pharmacophore features in molecules, such as hydrogen bond donors, hydrogen bond acceptors, hydrophobic centers, etc., and are suitable for drug design.
    3. ecfp (Extended Connectivity Fingerprints) are circular substructure-based molecular fingerprints, suitable for similarity searching and quantitative structure-activity relationship (QSAR) modeling.

    Cutoff

    When the search mode is set to SimilaritySearch, it means that molecules with a similarity ≥ the cutoff value will be searched. When the search mode is set to DissimilaritySearch, it means that molecules with a similarity ≤ the cutoff value will be searched. The calculated values range from 0 to 1, with a default cutoff value of 0.75.

    Search Mode

    Specify the search mode: SimilaritySearch or DissimilaritySearch.

    Result

    The output includes:

    Output File Name Description
    hits_values.csv Add database and template molecular similarity values.
    hits.sdf Compounds similar to template molecules within the truncation value were screened from the database.

    The hits_values.csv contains the following information:

    Field Name Description
    ReferenceCompoundID The name of the molecule in the template library, or denoted as “Cmpd” prefix + “molecule number” if it has no name.
    DatabaseCompoundID The name of the compound in the search library that meets the conditions, or denoted as above if it has no name.
    ComparisonValue The similarity value between the template molecule and the compound in the database.

    The remaining parameters are the descriptors contained in the provided molecular database.

    Reference

    Kier LB, Hall LH, The E-State as the basis for molecular structure space definition and structure similarity. J. Chem. Inf. Comput. Sci. 2000, 40, 784-791.
    Filimonov D, Poroikov V, Borodina Y, Gloriozova T, Chemical similarity assessment through multilevel neighborhoods of atoms: Definition and comparison with the other Descriptors. J. Chem. Inf. Comput. Sci., 1999, 39, 666-670
    Venkatraman V, Prez-Nueno VI, Mavridis L, Ritchie DW, Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J. Chem. Inf. Model., 2010, 50, 2079-2093.
    Durant, J.L.; Leland, B.A.; Henry, D.H.; Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273-1280.
    Willett, P.; Similarity searching using 2D structural fingerprints. Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology. 2011, 672, 133-58.

  • Name: Molecular Docking (SMINA)
    Description: 基于SMINA的小分子对接工具 SMINA-based small molecule docking tool
    Tags: undefined
    Author: David Ryan Koes
    Release: 2022-03-17 09:56:09
    Reference: Koes DR, Baumgartner MP, Camacho CJ. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J Chem Inf Model. 2013 Aug 26;53(8):1893-904.

    Molecular Docking (SMINA)

    简介

    Molecular Docking (SMINA)是基于SMINA的分子对接工具(背景介绍链接 )。SMINA作为Autodock Vina(http://vina.scripps.edu/ )的分支,其主要功能是预测分子之间的结合模式和相互作用,得到分子对接的能量和结合亲和力等信息。它还可以计算和比较多个分子之间的结合能力,用于药物分子的筛选、设计和优化。与Autodock Vina(version 1.1.2)相比,SMINA支持:
    1.配体SDF分子格式进行计算;
    2.多配体文件(SDF)进行对接;
    3.超过20个对接POSE输出;
    4.更易于定义受体柔性残基;
    5.极大地改进了最小化算法(最小化趋于收敛)。
    image.png

    参数说明

    Private Ligand Library (Comp<2000)

    Binding Mode

    对接模式为刚性配体对接(rigid)或者柔性配体对接(flex),
    刚性配体对接:配体自身保持刚性,经平移、旋转,在口袋内寻找合适的结合取向。
    柔性配体对接:配体在固定某些非关键部位的键长、键角的前提下允许其构象发生一定程度的变化。

    Receptor

    受体结构文件,PDB格式

    Private Ligand

    配体结构文件,支持SDF、PDB、MOL格式。只会计算前2000的分子。

    Box Center

    对接口袋中心的三维坐标(XYZ),空格分割。例如:0 0 0。

    Box Size

    对接口袋长方体盒子的大小,必须是整数,空格分割,例如 24 22 32。

    Number of Poses

    每个分子保留的最大结合模式数量

    TopN

    虚拟筛选中保留打分排名前n个分子。

    Keep Heterogens

    保留非标准氨基酸,格式为[链名]:[残基名称]-[残基编号],如A:UNL-311。不能包含特殊离子的小分子结构。

    结果说明

    输出结果包括:

    输出文件名称 说明
    TopNScores.csv 分子对接得到的打分csv文件。输出小分子最多为10,000。
    complex_001.pdb 展示配体与受体的复合物构象文件。
    output_ligand_topn.sdf 筛选后配体的SDF文件。根据指定的topN数生成,最多为10,000。
    output_complex_topn.tar.bz2 小分子与受体对接后的复合物构象PDB文件压缩包,最多生成前1000小分子的复合物结构。

    参考文献

    Koes DR, Baumgartner MP, Camacho CJ. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J Chem Inf Model. 2013 Aug 26;53(8):1893-904.

    Molecular Docking (SMINA)

    Introduction

    Molecular Docking (SMINA) is a molecular docking tool based on SMINA. As a branch of Autodock Vina (http://vina.scripps.edu/ ), SMINA’s main function is to predict the binding modes and interactions between molecules, providing information on the energy and binding affinity of molecular docking. It can also calculate and compare the binding abilities of multiple molecules, useful for screening, designing, and optimizing drug molecules. Compared to Autodock Vina (version 1.1.2), SMINA supports:

    1. Calculation with ligand SDF molecule format.
    2. Docking with multiple ligand files (SDF).
    3. Output of over 20 docking poses.
    4. Easier definition of flexible receptor residues.
    5. Greatly improved minimization algorithm (minimization tends to converge).
      image.png

    Parameter Description

    Rigid Docking Mode

    Receptor File

    Protein receptor structure file in PDB or PDBQT format. The receptor protein is set as rigid.

    Ligand File

    Small molecule structure file in SDF format.

    Configure File

    Binding pocket information file in TXT format, obtainable from Weview. The file content is as follows:

    center_x = -44.497
    center_y = -22.273
    center_z = -4.922
    size_x = 40
    size_y = 40
    size_z = 40
    

    TopN

    Specify the top N small molecules for output, default is 100.

    Out Pose

    Number of conformations output for each ligand-protein docking, default is 10. This value should be ≤ “Run Pose”.

    Flexible Docking Mode

    Flexible Residue

    Define flexible residues in the format “chain name”:“amino acid number”, with each amino acid separated by a comma, e.g., “A:48,A:90,A:110”. Flexible amino acids must be near the pocket.

    Flexible Distance (Å)

    Set all side chains within a specified distance from the ligand as flexible, unit is Å.
    Other parameters are the same as in Rigid Docking Mode.

    Result Description

    The output includes:

    Output File Name Description
    Complex_Top1-10.pdb Files showing the top ten complex conformations with the highest scores for each ligand-protein docking
    score.csv File containing scores for all ligand-protein dockings
    TopNscore.csv Scores file sorted by the highest docking scores for each ligand-protein docking
    output.TopNComplex.tar.gz Compressed file containing PDBQT files of the top complex conformations for each ligand-protein docking in the top N small molecules
    output.TopNLigand.sdf SDF file of the top N ligands based on docking scores

    Reference

    Koes DR, Baumgartner MP, Camacho CJ. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J Chem Inf Model. 2013 Aug 26;53(8):1893-904.

  • Name: Batch Renaming
    Description: 用于小分子化学库的分子批量重命名。用户可以使用前缀和定义的长度来规范分子名称。 Batch molecule rename for chemical library. User could standardize the molecule name using prefix and defined length.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-11-18 09:38:17
    Reference: NA

    Batch Renaming

    Batch Renaming模块设计用于化学库的分子重命名。用户可以使用前缀和定义的长度来规范分子名称。例如,将一个从WCP0001开始的库重命名为WCP9999,用户可以输入WCP前缀,长度为4。用户还可以使用——keeptitle参数保存以前的名称,以保存名称之间的关系。该模块可用于大型从头库或用户私有化学库中的自定义分子命名。支持的输入文件格式为:SD(.sdf,.sd)。支持的输出文件格式为:SD(.sdf,.sd)。

    参数说明

    Input File

    小分子结构文件,SDF格式。

    Output File

    输出SDF文件名称。

    Prefix

    自定义前缀,如C表示从C001生成名称,并结合长度为3。

    Length

    固定名称长度,如4表示生成名C0001, 1表示生成C1, C2……。

    Location

    新生成名称的位置:

    1. field表示添加新字段以保存新名称。
    2. title表示替换之前的分子标题。
    3. all表示以上两种操作。

    Field Name

    字段名作为新生成的名称,仅当Location为filed或all时有效。

    Keep Name

    保留以前的分子标题名称。

    结果说明

    得到重命名后的sdf文件output.sdf。

    Batch Renaming

    The Batch Renaming module is designed for renaming molecules in chemical libraries. Users can standardize molecule names using a prefix and a defined length. For example, to rename a library starting from WCP0001 to WCP9999, users can input the prefix WCP and a length of 4. Users can also use the --keeptitle parameter to preserve previous names, maintaining relationships between names. This module can be used for custom molecule naming in large de novo libraries or user-private chemical libraries. Supported input file formats: SD (.sdf, .sd). Supported output file formats: SD (.sdf, .sd).

    Parameter Description

    Input File

    Small molecule structure file in SDF format.

    Output File

    Name of the output SDF file.

    Prefix

    Custom prefix, e.g., C indicating names generated from C001, combined with a length of 3.

    Length

    Fixed name length, e.g., 4 generates names like C0001, 1 generates C1, C2, and so on.

    Location

    Position for the newly generated names:

    1. field: Add a new field to save the new name.
    2. title: Replace the previous molecule title.
    3. all: Perform both of the above operations.

    Field Name

    Field name to be used as the newly generated name, only valid when Location is field or all.

    Keep Name

    Keep the previous molecule title name.

    Result Description

    Obtain the renamed SDF file named output.sdf.

  • Name: 3D Conf Generation (AlphaConf)
    Description: 小分子三维构象搜索模块。三维构象搜索与生成技术主要用于对蛋白质结构域或者化合物结构进行高效的搜索,以用于结构设计或筛选。唯信通过采用一种全新的限制性结构片段定义方式进行分子三维构象的生成,精度优于同类算法。通过采用非重复构象生成方法,节省大量计算时间,计算速度远超同类算法。独特高效的构象压缩技术,较同类算法的存储空间降低400~800倍,适用于超大规模三维构象库的构建和超高通量虚拟筛选。 It is a super fast 3D conformation search and generation engine. Machine learning models for bond lengths/angles based on millions of high-quality data in PubChemQC. A new way of defining restriction structure fragments is developed to generate the three-dimensional conformation of molecules, and the accuracy is better than similar algorithms. By adopting the non-repetitive conformation generation method, a lot of computation time is saved, and the computation speed is much faster than similar algorithms. The unique and efficient conformation compression technology reduces the storage space by 400-800 times compared with similar algorithms and is suitable for the construction of ultra-large-scale 3D conformation libraries and ultra-high-throughput virtual screening.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-11-11 03:20:54
    Reference:

    3D Conf Generation (AlphaConf)

    简介

    3D Conf Generation (AlphaConf)采用唯信计算自研的分子三维构象生成算法,超快速生成分子三维构象库,比Open Eyes的Omega至少快一个数量级,后者被认为是目前最高效的商业产品。它也比薛定谔的ConfGenX快一个数量级以上。其优异的构象多样性和质量已被下游应用证明。AlphaConf非常适合用于药物分子发现的超高通量虚拟筛选。其技术特点如下:

    1. 通过采用限制性结构片段定义,构象生成精度已媲美Schrodinger的ConfGenX算法,明显优于同类开源算法,如:RDKit。
    2. 通过采用非重复构象生成方法,节省大量计算时间,计算速度远超同类算法。
    3. 专利数据格式(AC 格式),用于高效的数据存储和检索。例如,与主流的SD格式相比,数据压缩率约为400-800倍。这也意味着我们可以通过多核并行化在大约一周内为数十亿个药物分子生成构象异构体,并将它们存储在具有几TB存储容量的磁盘上。构象检索也非常令人印象深刻:每秒从磁盘获取1-2百万个3D构象(使用中等的8核机器)。
      AlphaConf与其他构象生成工具的对比情况。
      image.png

    参数说明

    Input File

    小分子结构文件,SDF格式或者压缩的SDF格式(.gz文件)。

    Max Confs

    每个分子的最大构象数,默认100。

    Energy Window

    构象能量截断值(单位:kcal/mol),默认20kcal/mol。

    Output File

    指定输出文件名称,后缀是.sd,.ac,.ac.gz或者.aux.gz。除了构象文件外,当输出文件后缀为.ac.gz或者.aux.gz还会输出片段库文件(文件后缀为.aux,其文件名根据构象文件名称自动命名,如:构象文件名设置为conf.ac.gz,片段文件名自动命名为conf.aux.gz)。

    结果说明

    输出结果包括:

    输出文件名称 说明
    SelfConf.ac.gz 构象压缩文件,AC格式,用于AlphaShape模块的构象库输入
    SelfConf.aux.gz 片段库文件(其文件名根据构象文件名称自动命名,如:构象文件名设置为conf.ac.gz或者conf.aux.gz,片段文件名自动命名为conf.aux),AUX格式,用于AlphaShape模块的片段库输入

    3D Conf Generation (AlphaConf)

    Introduction

    3D Conf Generation (AlphaConf) uses a proprietary molecular conformation generation algorithm developed by Wecompute to rapidly generate a library of molecular conformations. It is at least an order of magnitude faster than Open Eye’s Omega, which is considered the most efficient commercial product, and more than an order of magnitude faster than Schrodinger’s ConfGenX. Its excellent conformational diversity and quality have been proven in downstream applications, making AlphaConf particularly suitable for high-throughput virtual screening in drug discovery. Its technical features are as follows:

    1. The precision of the conformation generation, achieved through the use of restrictive structural fragments, is comparable to Schrodinger’s ConfGenX algorithm, and significantly better than similar open-source algorithms such as RDKit.
    2. The use of a non-redundant conformation generation method saves a significant amount of computation time, making it much faster than similar algorithms.
    3. The proprietary AC format is used for efficient data storage and retrieval. Compared to the mainstream SD format, the data compression ratio is about 400-800 times higher. This means that we can generate conformational isomers for billions of drug molecules in about a week using multi-core parallelization and store them on a disk with several terabytes of storage capacity. Conformational retrieval is also impressive: 1-2 million 3D conformations can be retrieved from disk per second using a medium-sized 8-core machine.
      The comparison of AlphaConf with other conformation generation tools.
      image.png

    Parameter

    Input File

    Small molecule structure file in SDF format or gzip format with .gz file extension for SDF file.

    Max Confs

    The maximum number of conformations per molecule, the default value is 100.

    Energy Window

    Specify energy cutoff for confs.(kcal/mol), the default value is 20 kcal/mol.

    Output File

    Specify output conformation file in SD format(.sd) or AC format(.ac)

    Result

    The output includes:

    Output File Name Description
    SelfConf.ac.gz Conformation compressed file in AC format, used as input for the conformation library in the AlphaShape module.
    SelfConf.aux.gz Fragment library file in AUX format, used as input for the fragment library in the AlphaShape module.
  • Name: Salts Removal
    Description: 从分子中去除盐或者简单地计算含盐分子的数量。 Remove salts from molecules or simply count the number of molecules containing salts.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-28 06:37:44
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Salts Removal

    简介

    该模块可以去除或者统计分子含有的盐,从而获得去盐后分子结构或者分子结构含有的盐数量。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件名称。

    Mode

    选择去除(remove)或者统计(count)盐离子。

    结果说明

    得到无盐离子的分子结构文件oufile.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Salts Removal

    Introduction

    The Salts Removal module can remove or count the salts present in molecules, providing the option to obtain the molecular structures without salts or the count of salts in the molecular structures.

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    Name of the output file.

    Mode

    Select whether to remove (remove) or count (count) salt ions.

    Result Description

    Obtain a molecular structure file without salt ions named outfile.sdf.

    Reference

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

  • Name: Duplicates Removal
    Description: 基于规范SMILES字符串识别和删除重复分子,或者仅统计重复分子数量。 Remove duplicate molecules based on canonical SMILES strings or simply count the number of duplicate molecules.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-28 06:27:43
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Duplicates Removal

    简介

    基于规范SMILES字符串识别和删除重复分子,或者仅统计重复分子数量。支持的输入文件格式为:MOL(.mol)、SD(.sdf、.sd)、SMILES(.smi、.csv、.tsv、.txt)。支持的输出文件格式为:SD(.sdf、.sd)、SMILES(.smi、.csv、.tsv、.txt)。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件名称。

    Mode

    选择是去除重复分子(remove)还是对重复分子进行计数(count),默认为remove。

    结果说明

    得到删除重复分子的sdf文件outfile.sdf。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Duplicates Removal

    Introduction

    The Duplicates Removal module identifies and removes duplicate molecules based on canonical SMILES strings, or it can simply count the number of duplicate molecules. Supported input file formats are: MOL (.mol), SD (.sdf, .sd), SMILES (.smi, .csv, .tsv, .txt). Supported output file formats are: SD (.sdf, .sd), SMILES (.smi, .csv, .tsv, .txt).

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    Name of the output file.

    Mode

    Select whether to remove duplicate molecules (remove) or count duplicate molecules (count), default is remove.

    Result Description

    Obtain an SDF file named outfile.sdf after removing duplicate molecules.

    Reference

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

  • Name: Diverse Subset
    Description: 基于多种2D指纹以及使用最大最小距离(MaxMin)或分层聚类方法(Hierarchical Clustering)选择分子子集。 Pick a subset of diverse molecules based on a variety of 2D fingerprints using MaxMin or an available hierarchical clustering methodology.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-22 08:41:36
    Reference: Ashton, M., Barnard, J., Casset, F. et al. Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions. Quantitative Structure-Activity Relationships. 2002 Dec 27; 21:598-604.

    Diverse Subset

    简介

    基于多种2D指纹选择分子子集,使用MaxMin或可用的分层聚类方法,并将它们写入文件。RDKit中可用的Dice和Tanimoto相似性函数能够处理对应于IntVect和BitVect的指纹。然而,所有其他相似性函数都期望使用BitVect指纹来计算成对相似性。因此,对于AtomPairs、Morgan、MorganFeatures和TopologicalTorsions的相似性计算,使用ExplicitBitVect指纹代替默认的IntVect指纹。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件名称。

    Diverse Numbers

    指定划分数量。

    Mode

    利用最大最小距离(MaxMin)或分层聚类方法(Hierarchical Clustering)进行聚类,从而选择不同的分子子集类型。

    Similarity Metric

    用于计算分子间相似性的方法,有Tanimoto、Cosine以及Dice。

    • 谷本系数——Tanimoto:只关心个体间共同具有的特征是否一致这个问题,用于比较有限样本集之间的相似性与差异性。计算公式如下:
      image.png
    • 余弦相似度——Cosine:通过n维空间中两个n维向量之间角度的余弦来判断相似程度。计算公式如下:
      image.png
    • Dice相似度:是一种集合相似度度量指标。计算公式如下所示:
      image.png

    Fingerprints

    用于计算分子间相似性/距离的指纹。

    • Morgan通过设定一个从特定原子出发的半径,来统计这个半径以内的部分分子结构的数量来组成一个分子指纹。
    • AtomPairs是分子中每个原子对基于原子环境和最短路径分离。
    • MACCS166Keys是一种基于SMARTS的,长度为167的分子指纹,每一位所代表的含义可见https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt 。
    • PathLength搜索分子中特定长度的所有路径。
    • TopologicalTorsions是基于拓扑两面角描述符。

    结果说明

    按划分数量得到聚类结果,输出每个聚类中的第一个分子文件diverse_set.sdf。

    参考文献

    Ashton, M., Barnard, J., Casset, F. et al. Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions. Quantitative Structure-Activity Relationships. 2002 Dec 27; 21:598-604.

    Diverse Subset

    Introduction

    The Diverse Subset module selects a subset of molecules based on multiple 2D fingerprints, using MaxMin or available hierarchical clustering methods, and writes them to a file. The Dice and Tanimoto similarity functions available in RDKit can handle fingerprints corresponding to IntVect and BitVect. However, all other similarity functions expect to use BitVect fingerprints to compute pairwise similarities. Therefore, for similarity calculations of AtomPairs, Morgan, MorganFeatures, and TopologicalTorsions, ExplicitBitVect fingerprints are used instead of the default IntVect fingerprints.

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    Name of the output file.

    Diverse Numbers

    Specify the number of partitions.

    Mode

    Use MaxMin distance or hierarchical clustering to select different types of molecular subsets.

    Similarity Metric

    Methods used to calculate molecular similarity, including Tanimoto, Cosine, and Dice.

    • Tanimoto Coefficient: Focuses on whether individuals share common features and is used to compare the similarity and dissimilarity between limited sample sets. The calculation formula is as follows:
      image.png
    • Cosine Similarity: Determines the similarity degree by the cosine of the angle between two n-dimensional vectors in an n-dimensional space. The calculation formula is as follows:
      image.png
    • Dice Similarity: A measure of set similarity. The calculation formula is as follows:
      image.png

    Fingerprints

    Fingerprints used to calculate molecular similarity/distance.

    • Morgan counts the number of substructures within a certain radius from a specific atom to form a molecular fingerprint.
    • AtomPairs represent pairs of atoms in a molecule based on atomic environments and shortest path separation.
    • MACCS166Keys is a 167-bit molecular fingerprint based on SMARTS, where each bit’s meaning can be seen at https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt .
    • PathLength searches for all paths of a specific length in a molecule.
    • TopologicalTorsions are based on topological torsion descriptors.

    Result Description

    Cluster results are obtained based on the specified number of partitions, and the first molecule in each cluster is written to the file diverse_set.sdf.

    References

    Ashton, M., Barnard, J., Casset, F. et al. Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions. Quantitative Structure-Activity Relationships. 2002 Dec 27; 21:598-604.

  • Name: Descriptors (RDKit)
    Description: 基于RDKit计算小分子的2D和3D描述符 Calculate small molecule 2D/3D descriptors in RDKit
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-22 09:00:29
    Reference: Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Descriptors (RDKit)

    简介

    Descriptors (RDKit)模块是计算分子的2D/3D描述符并将其写入SD或CSV/TSV文本文件中。2D描述符:Autocorr2D、MolWt、Ipc、NumRotatableBonds、qed等;3D描述符:Autocorr3D、RadiusOfGyration、Eccentricity等;以及FragmentCountOnly描述符:fr_Al_COO、fr_Al_OH、fr_Al_OH_noTert等。支持的输入文件格式为:Mol(.mol)、SD(.sdf、.sd)、SMILES(.smi、.txt、.csv、.tsv)。支持的输出文件格式为:SD文件(.sdf、.sd)、CSV/TSV(.csv、.tsv、.txt)。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件以保存计算的描述符。

    Multiprocessing

    使用多进程处理(默认:yes)。

    Type

    计算分子描述符的类型,可选值有2D、3D、FragmentCountOnly和Specify。
    2D描述符包括以下:

    Autocorr2D, BalabanJ, BertzCT, Chi0, Chi1, Chi0n - Chi4n, Chi0v - Chi4v, EState_VSA1 - EState_VSA11, ExactMolWt, FpDensityMorgan1, FpDensityMorgan2, FpDensityMorgan3, FractionCSP3, HallKierAlpha, HeavyAtomCount, HeavyAtomMolWt, Ipc, Kappa1 - Kappa3, LabuteASA, MaxAbsEStateIndex, MaxAbsPartialCharge, MaxEStateIndex, MaxPartialCharge, MinAbsEStateIndex, MinAbsPartialCharge, MinEStateIndex, MinPartialCharge, MolLogP, MolMR, MolWt, NHOHCount, NOCount, NumAliphaticCarbocycles, NumAliphaticHeterocycles, NumAliphaticRings, NumAromaticCarbocycles, NumAromaticHeterocycles, NumAromaticRings, NumHAcceptors, NumHDonors, NumHeteroatoms, NumRadicalElectrons, NumRotatableBonds, NumSaturatedCarbocycles, NumSaturatedHeterocycles, NumSaturatedRings, NumValenceElectrons, PEOE_VSA1 - PEOE_VSA14, RingCount, SMR_VSA1 - SMR_VSA10, SlogP_VSA1 - SlogP_VSA12, TPSA, VSA_EState1 - VSA_EState10, qed
    

    FragmentCountOnly描述符包括以下:

    fr_Al_COO, fr_Al_OH, fr_Al_OH_noTert, fr_ArN, fr_Ar_COO, fr_Ar_N, fr_Ar_NH, fr_Ar_OH, fr_COO, fr_COO2, fr_C_O, fr_C_O_noCOO, fr_C_S, fr_HOCCN, fr_Imine, fr_NH0, fr_NH1, fr_NH2, fr_N_O, fr_Ndealkylation1, fr_Ndealkylation2, fr_Nhpyrrole, fr_SH, fr_aldehyde, fr_alkyl_carbamate, fr_alkyl_halide, fr_allylic_oxid, fr_amide, fr_amidine, fr_aniline, fr_aryl_methyl, fr_azide, fr_azo, fr_barbitur, fr_benzene, fr_benzodiazepine, fr_bicyclic, fr_diazo, fr_dihydropyridine, fr_epoxide, fr_ester, fr_ether, fr_furan, fr_guanido, fr_halogen, fr_hdrzine, fr_hdrzone, fr_imidazole, fr_imide, fr_isocyan, fr_isothiocyan, fr_ketone, fr_ketone_Topliss, fr_lactam, fr_lactone, fr_methoxy, fr_morpholine, fr_nitrile, fr_nitro, fr_nitro_arom, fr_nitro_arom_nonortho, fr_nitroso, fr_oxazole, fr_oxime, fr_para_hydroxylation, fr_phenol, fr_phenol_noOrthoHbond, fr_phos_acid, fr_phos_ester, fr_piperdine, fr_piperzine, fr_priamide, fr_prisulfonamd, fr_pyridine, fr_quatN, fr_sulfide, fr_sulfonamd, fr_sulfone, fr_term_acetylene, fr_tetrazole, fr_thiazole, fr_thiocyan, fr_thiophene, fr_unbrch_alkane, fr_urea
    

    3D描述符包括以下:

    Asphericity, Autocorr3D, Eccentricity, GETAWAY, InertialShapeFactor, MORSE, NPR1, NPR2, PMI1, PMI2, PMI3, RDF, RadiusOfGyration, SpherocityIndex, WHIM
    

    Descriptor Names

    此选项仅在Type为“Specify”时使用。当应用多个描述符时,由逗号分隔描述符,如MolWt, qed。

    结果说明

    得到各个分子指定描述符的数值在descriptors.csv文件中。

    参考文献

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

    Descriptors (RDKit)

    Introduction

    The Descriptors (RDKit) module calculates 2D/3D descriptors of molecules and writes them to an SD or CSV/TSV text file. 2D descriptors include Autocorr2D, MolWt, Ipc, NumRotatableBonds, qed, etc.; 3D descriptors include Autocorr3D, RadiusOfGyration, Eccentricity, etc.; and FragmentCountOnly descriptors include fr_Al_COO, fr_Al_OH, fr_Al_OH_noTert, etc. Supported input file formats are: Mol (.mol), SD (.sdf, .sd), SMILES (.smi, .txt, .csv, .tsv). Supported output file formats are: SD files (.sdf, .sd), CSV/TSV (.csv, .tsv, .txt).

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    File to save the calculated descriptors.

    Multiprocessing

    Use multiprocessing for computation (default: yes).

    Type

    Type of molecular descriptors to compute, options are 2D, 3D, FragmentCountOnly, and Specify.
    2D descriptors include the following:

    Autocorr2D, BalabanJ, BertzCT, Chi0, Chi1, Chi0n - Chi4n, Chi0v - Chi4v, EState_VSA1 - EState_VSA11, ExactMolWt, FpDensityMorgan1, FpDensityMorgan2, FpDensityMorgan3, FractionCSP3, HallKierAlpha, HeavyAtomCount, HeavyAtomMolWt, Ipc, Kappa1 - Kappa3, LabuteASA, MaxAbsEStateIndex, MaxAbsPartialCharge, MaxEStateIndex, MaxPartialCharge, MinAbsEStateIndex, MinAbsPartialCharge, MinEStateIndex, MinPartialCharge, MolLogP, MolMR, MolWt, NHOHCount, NOCount, NumAliphaticCarbocycles, NumAliphaticHeterocycles, NumAliphaticRings, NumAromaticCarbocycles, NumAromaticHeterocycles, NumAromaticRings, NumHAcceptors, NumHDonors, NumHeteroatoms, NumRadicalElectrons, NumRotatableBonds, NumSaturatedCarbocycles, NumSaturatedHeterocycles, NumSaturatedRings, NumValenceElectrons, PEOE_VSA1 - PEOE_VSA14, RingCount, SMR_VSA1 - SMR_VSA10, SlogP_VSA1 - SlogP_VSA12, TPSA, VSA_EState1 - VSA_EState10, qed
    

    FragmentCountOnly descriptors include the following:

    fr_Al_COO, fr_Al_OH, fr_Al_OH_noTert, fr_ArN, fr_Ar_COO, fr_Ar_N, fr_Ar_NH, fr_Ar_OH, fr_COO, fr_COO2, fr_C_O, fr_C_O_noCOO, fr_C_S, fr_HOCCN, fr_Imine, fr_NH0, fr_NH1, fr_NH2, fr_N_O, fr_Ndealkylation1, fr_Ndealkylation2, fr_Nhpyrrole, fr_SH, fr_aldehyde, fr_alkyl_carbamate, fr_alkyl_halide, fr_allylic_oxid, fr_amide, fr_amidine, fr_aniline, fr_aryl_methyl, fr_azide, fr_azo, fr_barbitur, fr_benzene, fr_benzodiazepine, fr_bicyclic, fr_diazo, fr_dihydropyridine, fr_epoxide, fr_ester, fr_ether, fr_furan, fr_guanido, fr_halogen, fr_hdrzine, fr_hdrzone, fr_imidazole, fr_imide, fr_isocyan, fr_isothiocyan, fr_ketone, fr_ketone_Topliss, fr_lactam, fr_lactone, fr_methoxy, fr_morpholine, fr_nitrile, fr_nitro, fr_nitro_arom, fr_nitro_arom_nonortho, fr_nitroso, fr_oxazole, fr_oxime, fr_para_hydroxylation, fr_phenol, fr_phenol_noOrthoHbond, fr_phos_acid, fr_phos_ester, fr_piperdine, fr_piperzine, fr_priamide, fr_prisulfonamd, fr_pyridine, fr_quatN, fr_sulfide, fr_sulfonamd, fr_sulfone, fr_term_acetylene, fr_tetrazole, fr_thiazole, fr_thiocyan, fr_thiophene, fr_unbrch_alkane, fr_urea
    

    3D descriptors include the following:

    Asphericity, Autocorr3D, Eccentricity, GETAWAY, InertialShapeFactor, MORSE, NPR1, NPR2, PMI1, PMI2, PMI3, RDF, RadiusOfGyration, SpherocityIndex, WHIM
    

    Descriptor Names

    This option is only used when Type is “Specify.” When applying multiple descriptors, separate them by commas, e.g., MolWt, qed.

    Result Description

    The numerical values of the specified descriptors for each molecule are stored in the descriptors.csv file.

    References

    Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016 Dec 27;56(12):2292-2297.

  • Name: PAINS Filter
    Description: 通过使用SMARTS模式进行子结构搜索,从输入文件中过滤Filter Pan-assay Interference molecules (PAINS) ,并将适当的分子写入输出文件或仅计算过滤分子的数量。 Filter Pan-assay Interference molecules (PAINS) from an input file by performing a substructure search using SMARTS pattern and write out appropriate molecules to an output file or simply count the number of filtered molecules.
    Tags: undefined
    Author: Manish Sud
    Release: 2021-10-22 03:29:53
    Reference: Baell JB, Holloway GA. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem. 2010 Apr 8;53(7):2719-40.

    PAINS Filter

    简介

    PAINS Filter模块通过SMARTS子结构规则来搜索输入文件中假阳性化合物(Pan-assay Interference molecules,PAINS),并将符合条件的分子输出或者统计过滤分子的数量。

    参数说明

    Input File

    小分子结构文件,SDF或者SMILES格式。

    Output File

    输出文件名称。

    Multiprocessing

    是否使用多进程进行计算,可选:yes或者no,默认为yes。

    Output PAINS

    输出文件包含与PAINS匹配的分子,可选:yes或者no,默认为no。

    结果说明

    输出结果包括:

    输出文件名称 说明
    output.sdf 筛选出不匹配PAINS规则的化合物
    output_Filtered.sdf 筛选出匹配PAINS规则的化合物

    参考文献

    Baell JB, Holloway GA. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem. 2010 Apr 8;53(7):2719-40.

    PAINS Filter

    Introduction

    The PAINS Filter module searches for false positive compounds (Pan-assay Interference molecules, PAINS) in the input file using SMARTS substructure rules and either outputs or counts the molecules that meet the criteria.

    Parameter Description

    Input File

    Small molecule structure file in SDF or SMILES format.

    Output File

    Name of the output file.

    Multiprocessing

    Whether to use multiprocessing for computation, options: yes or no, default is yes.

    Output PAINS

    Whether the output file includes molecules that match PAINS, options: yes or no, default is no.

    Result Description

    The output includes:

    Output File Name Description
    output.sdf Compounds that do not match the PAINS rules
    output_Filtered.sdf Compounds that match the PAINS rules

    References

    Baell JB, Holloway GA. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem. 2010 Apr 8;53(7):2719-40.

  • Name: File
    Description: File是用于指定输入文件的模块,可用于多个模块的统一输入。 File is a module for specifying file path which could be used for multiple modules.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 10:35:43
    Reference: NA

    File

    简介

    File是用于指定输入文件的模块,可用于多个模块的统一输入。

    参数说明

    Input File

    上传小分子结构文件(SDF格式)或者蛋白的结构文件(PDB格式)

    结果说明

    输出重命名后的文件。

    File

    Introduction

    The File module is used to specify input files and can be used for unified input across multiple modules.

    Parameter Description

    Input File

    Upload a small molecule structure file (SDF format) or a protein structure file (PDB format).

    Result Description

    Output the file after renaming.

  • Name: PDB File
    Description: PDB文件是一个用于指定PDB文件的模块,可用于其他模块的输入。 PDB File is a module for specifying pdb file which could used for other modules input.
    Tags: undefined
    Author: WECOMPUT
    Release: 2021-10-22 17:16:59
    Reference: NA

    PDB File

    简介

    PDB文件是一个用于指定PDB文件的模块,可用于其他模块的输入。

    参数说明

    PDB File

    Protein structure file in PDB format

    结果说明

    得到PDB文件

    PDB File

    Introduction

    The PDB File module is used to specify a PDB file that can be used as input for other modules.

    Parameter Description

    PDB File

    Protein structure file in PDB format.

    Result Description

    Obtain a PDB file.