PICRUSt2 placement in a phylogenetic tree and estimation of gene copy number

Context

frogsfunc_placeseqs.py is a sequence placement tool that inserts study sequences (ASVs) into a reference phylogenetic tree. It supports multiple marker types, such as 16S, ITS, or 18S, and allows users to place sequences using different placement algorithms (epa-ng or SEPP). The tool prepares the sequences and outputs a tree with inserted sequences, along with filtered BIOM and FASTA files, and marker gene copy number predictions for downstream functional analysis.

How it does

The program takes unaligned ASV sequences (FASTA) and abundance tables (BIOM) as input. If non-16S markers are analyzed, a reference directory is required containing marker-specific reference sequences. Sequences are aligned and placed into the reference tree using epa-ng or SEPP. Minimum alignment thresholds remove poorly matching sequences. The HSP method can be applied for trait prediction along the tree. The program outputs updated tree files, filtered sequences, BIOMs, and predicted marker gene copy numbers, and logs the entire process.

Command lines


      usage: frogsfunc_placeseqs.py [-h] [--version] [--debug] --input-fasta
                                    INPUT_FASTA --input-biom INPUT_BIOM
                                    [--ref-dir REF_DIR]
                                    [--placement-tool {epa-ng,sepp}]
                                    [--min-align MIN_ALIGN]
                                    [--input-marker-table INPUT_MARKER_TABLE]
                                    [--hsp-method {mp,emp_prob,pic,scp,subtree_average}]
                                    [--output-tree OUTPUT_TREE]
                                    [--excluded EXCLUDED]
                                    [--output-fasta OUTPUT_FASTA]
                                    [--output-biom OUTPUT_BIOM]
                                    [--closests-ref CLOSESTS_REF] [--html HTML]
                                    [--output-marker OUTPUT_MARKER]
                                    [--log-file LOG_FILE]

      place studies sequences (i.e. ASVs) into a reference tree.

      optional arguments:
        -h, --help            show this help message and exit
        --version             show program's version number and exit
        --debug               Keep temporary files to debug program. [Default:
                              False]

      Inputs:
        --input-fasta INPUT_FASTA
                              Input fasta file of unaligned studies sequences.
        --input-biom INPUT_BIOM
                              Input biom file of unaligned studies sequences.
        --ref-dir REF_DIR     If marker studied is not 16S, this is the directory
                              containing reference sequence files (for ITS, see:
                              $PICRUST2_PATH/default_files/fungi/fungi_ITS
        --placement-tool {epa-ng,sepp}
                              Tool to place sequences into reference tree. Note that
                              epa-ng is more sensitiv but very memory and computing
                              power intensive. Warning : sepp is not usable for ITS
                              and 18S analysis [Default: epa-ng]
        --min-align MIN_ALIGN
                              Proportion of the total length of an input query
                              sequence that must align with reference sequences. Any
                              sequences with lengths below this value after making
                              an alignment with reference sequences will be excluded
                              from the placement and all subsequent steps. [Default:
                              0.8].
        --input-marker-table INPUT_MARKER_TABLE
                              The input marker table describing directly observed
                              traits (e.g. sequenced genomes) in tab-delimited
                              format. (ex
                              $PICRUSt2_PATH/default_files/fungi/ITS_counts.txt.gz).
        --hsp-method {mp,emp_prob,pic,scp,subtree_average}
                              HSP method to use. mp: predict discrete traits using
                              max parsimony. emp_prob: predict discrete traits based
                              on empirical state probabilities across tips.
                              subtree_average: predict continuous traits using
                              subtree averaging. pic: predict continuous traits with
                              phylogentic independent contrast. scp: reconstruct
                              continuous traits using squared-change parsimony
                              [Default: mp].

      Outputs:
        --output-tree OUTPUT_TREE
                              Reference tree output with insert sequences (format:
                              newick). [Default: frogsfunc_placeseqs_tree.nwk]
        --excluded EXCLUDED   List of sequences not inserted in the tree. [Default:
                              frogsfunc_placeseqs_excluded.txt]
        --output-fasta OUTPUT_FASTA
                              Fasta file without non insert sequences. (format:
                              FASTA). [Default: frogsfunc_placeseqs.fasta]
        --output-biom OUTPUT_BIOM
                              Biom file without non insert sequences. (format: BIOM)
                              [Default: frogsfunc_placeseqs.biom]
        --closests-ref CLOSESTS_REF
                              Informations about Clusters (i.e ASVs) and PICRUSt2
                              closest reference from cluster sequences
                              (identifiants, taxonomies, phylogenetic distance from
                              reference, nucleotidics sequences). [Default:
                              frogsfunc_placeseqs_closests_ref_sequences.txt]
        --html HTML           Path to store resulting html file. [Default:
                              frogsfunc_placeseqs_summary.html]
        --output-marker OUTPUT_MARKER
                              Output table of predicted marker gene copy numbers per
                              studied sequence in input tree. If the extension ".gz"
                              is added the table will automatically be gzipped.
                              [Default: frogsfunc_marker.tsv]
        --log-file LOG_FILE   List of commands executed. [Default: stdout]

        

Exemple of command line:

frogsfunc_placeseqs.py \
--input-fasta input_sequences.fasta --input-biom input_sequences.biom \
--ref-dir $PICRUSt2_PATH/default_files/fungi/fungi_ITS \
--placement-tool epa-ng \
--min-align 0.8 \
--hsp-method mp \
--output-tree frogsfunc_placeseqs_tree.nwk \
--excluded frogsfunc_placeseqs_excluded.txt \
--output-fasta frogsfunc_placeseqs.fasta \
--output-biom frogsfunc_placeseqs.biom \
--closests-ref frogsfunc_placeseqs_closests_ref_sequences.txt \
--output-marker frogsfunc_marker.tsv \
--html frogsfunc_placeseqs_summary.html \
--log-file frogsfunc_placeseqs.log
        

Outputs

Reference tree (--output-tree): Newick tree with inserted sequences.
Excluded sequences (--excluded): list of sequences that failed placement.
Filtered FASTA (--output-fasta): FASTA file excluding non-inserted sequences.
Filtered BIOM (--output-biom): BIOM file excluding non-inserted sequences.
Closest reference info (--closests-ref): details of ASV clusters and their closest reference sequences.
Marker gene copy numbers (--output-marker): predicted marker table per ASV, optionally gzipped.
HTML report (--html): summary of sequence placement.
Log file (--log-file): records executed commands and processing steps.