PICRUSt2 placement in a phylogenetic tree and estimation of gene copy number
Context
frogsfunc_placeseqs.py is a sequence placement tool that inserts study sequences (ASVs) into a reference phylogenetic tree. It supports multiple marker types, such as 16S, ITS, or 18S, and allows users to place sequences using different placement algorithms (epa-ng or SEPP). The tool prepares the sequences and outputs a tree with inserted sequences, along with filtered BIOM and FASTA files, and marker gene copy number predictions for downstream functional analysis.
How it does
The program takes unaligned ASV sequences (FASTA) and abundance tables (BIOM) as input. If non-16S markers are analyzed, a reference directory is required containing marker-specific reference sequences. Sequences are aligned and placed into the reference tree using epa-ng or SEPP. Minimum alignment thresholds remove poorly matching sequences. The HSP method can be applied for trait prediction along the tree. The program outputs updated tree files, filtered sequences, BIOMs, and predicted marker gene copy numbers, and logs the entire process.
Command lines
usage: frogsfunc_placeseqs.py [-h] [--version] [--debug] --input-fasta
INPUT_FASTA --input-biom INPUT_BIOM
[--ref-dir REF_DIR]
[--placement-tool {epa-ng,sepp}]
[--min-align MIN_ALIGN]
[--input-marker-table INPUT_MARKER_TABLE]
[--hsp-method {mp,emp_prob,pic,scp,subtree_average}]
[--output-tree OUTPUT_TREE]
[--excluded EXCLUDED]
[--output-fasta OUTPUT_FASTA]
[--output-biom OUTPUT_BIOM]
[--closests-ref CLOSESTS_REF] [--html HTML]
[--output-marker OUTPUT_MARKER]
[--log-file LOG_FILE]
place studies sequences (i.e. ASVs) into a reference tree.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--debug Keep temporary files to debug program. [Default:
False]
Inputs:
--input-fasta INPUT_FASTA
Input fasta file of unaligned studies sequences.
--input-biom INPUT_BIOM
Input biom file of unaligned studies sequences.
--ref-dir REF_DIR If marker studied is not 16S, this is the directory
containing reference sequence files (for ITS, see:
$PICRUST2_PATH/default_files/fungi/fungi_ITS
--placement-tool {epa-ng,sepp}
Tool to place sequences into reference tree. Note that
epa-ng is more sensitiv but very memory and computing
power intensive. Warning : sepp is not usable for ITS
and 18S analysis [Default: epa-ng]
--min-align MIN_ALIGN
Proportion of the total length of an input query
sequence that must align with reference sequences. Any
sequences with lengths below this value after making
an alignment with reference sequences will be excluded
from the placement and all subsequent steps. [Default:
0.8].
--input-marker-table INPUT_MARKER_TABLE
The input marker table describing directly observed
traits (e.g. sequenced genomes) in tab-delimited
format. (ex
$PICRUSt2_PATH/default_files/fungi/ITS_counts.txt.gz).
--hsp-method {mp,emp_prob,pic,scp,subtree_average}
HSP method to use. mp: predict discrete traits using
max parsimony. emp_prob: predict discrete traits based
on empirical state probabilities across tips.
subtree_average: predict continuous traits using
subtree averaging. pic: predict continuous traits with
phylogentic independent contrast. scp: reconstruct
continuous traits using squared-change parsimony
[Default: mp].
Outputs:
--output-tree OUTPUT_TREE
Reference tree output with insert sequences (format:
newick). [Default: frogsfunc_placeseqs_tree.nwk]
--excluded EXCLUDED List of sequences not inserted in the tree. [Default:
frogsfunc_placeseqs_excluded.txt]
--output-fasta OUTPUT_FASTA
Fasta file without non insert sequences. (format:
FASTA). [Default: frogsfunc_placeseqs.fasta]
--output-biom OUTPUT_BIOM
Biom file without non insert sequences. (format: BIOM)
[Default: frogsfunc_placeseqs.biom]
--closests-ref CLOSESTS_REF
Informations about Clusters (i.e ASVs) and PICRUSt2
closest reference from cluster sequences
(identifiants, taxonomies, phylogenetic distance from
reference, nucleotidics sequences). [Default:
frogsfunc_placeseqs_closests_ref_sequences.txt]
--html HTML Path to store resulting html file. [Default:
frogsfunc_placeseqs_summary.html]
--output-marker OUTPUT_MARKER
Output table of predicted marker gene copy numbers per
studied sequence in input tree. If the extension ".gz"
is added the table will automatically be gzipped.
[Default: frogsfunc_marker.tsv]
--log-file LOG_FILE List of commands executed. [Default: stdout]
Exemple of command line:
frogsfunc_placeseqs.py \
--input-fasta input_sequences.fasta --input-biom input_sequences.biom \
--ref-dir $PICRUSt2_PATH/default_files/fungi/fungi_ITS \
--placement-tool epa-ng \
--min-align 0.8 \
--hsp-method mp \
--output-tree frogsfunc_placeseqs_tree.nwk \
--excluded frogsfunc_placeseqs_excluded.txt \
--output-fasta frogsfunc_placeseqs.fasta \
--output-biom frogsfunc_placeseqs.biom \
--closests-ref frogsfunc_placeseqs_closests_ref_sequences.txt \
--output-marker frogsfunc_marker.tsv \
--html frogsfunc_placeseqs_summary.html \
--log-file frogsfunc_placeseqs.log
Outputs
Reference tree (--output-tree): Newick tree with inserted sequences.
Excluded sequences (--excluded): list of sequences that failed placement.
Filtered FASTA (--output-fasta): FASTA file excluding non-inserted sequences.
Filtered BIOM (--output-biom): BIOM file excluding non-inserted sequences.
Closest reference info (--closests-ref): details of ASV clusters and their closest reference sequences.
Marker gene copy numbers (--output-marker): predicted marker table per ASV, optionally gzipped.
HTML report (--html): summary of sequence placement.
Log file (--log-file): records executed commands and processing steps.