Cluster/ASV filters

Context

Once the clusters have been reconstructed, it is absolutely essential to filter these data. Most software do this internally without the user being aware of it, but in FROGS this is a user controlled step.

How it does

This tool deletes clusters among conditions enter by user. If an cluster reply to at least 1 criteria, the cluster is deleted.
This tool filters the clusters inside an abundance table according to:
  • Filter on prevalence
    The number of times the cluster is present in the environment, i.e. the number of samples where the cluster must be present.
  • Filter on abundance
    An cluster that is not large enough for a given proportion or count will be removed.
  • Filter on the most abundant
    Only the N biggest clusters are conserved.
  • Filter on contaminant
    from the list of proposition, if cluster sequence matches with phiX (a control added in Illumina sequencing technologies), chloroplastic/mitochondrial 16S of A. Thaliana
    or your own contaminant sequence (a fasta file containing a list of contaminant of your choice).
Once the filters of your choice have been set, the kept clusters are the ones that satisfy into the BIOM input file the specified thresholds. The BIOM abundance table and the fasta file are written again according to the clusters kept. And the clusters discarded are listed in the excluded file.

Configuration: Short reads (16S V3V4 use cases)

We now apply filters to remove low-abundant clusters that are likely to be chimeras or artifacts. We check also if some phiX sequences are still present. Low-abundant clusters are difficult to estimate. Following FROGS guidelines, we choose 0.005% of overall abundance. More stringent filters, including filters based on the prevalence across samples, can be made later if needed.

sbatch -J filters -o LOGS/filters.out -e LOGS/filters.err -c 8 --export=ALL --wrap="module load devel/Miniforge/Miniforge3 && module load bioinfo/FROGS/FROGS-v5.0.2 && cluster_filters.py --input-fasta FROGS/SWARM/remove_chimera.fasta --input-biom FROGS/SWARM/remove_chimera.biom --output-fasta FROGS/SWARM/filters.fasta --nb-cpus 8 --log-file FROGS/SWARM/filters.log --output-biom FROGS/SWARM/filters.biom --html FROGS/SWARM/filters.html --excluded FROGS/SWARM/filters_excluded.tsv --contaminant /save/user/frogs/galaxy_databanks/phiX_genome/phi.fa --min-sample-presence 1 --min-abundance 0.00005 && module unload bioinfo/FROGS/FROGS-v5.0.2"
(to see all settings: cluster_filters.py --help)



Interpretation: Short reads (16S V3V4 use cases)

Let look at the HTML file to see the result of cluster filters.
You have four panels: Filters by ASVs, Filters by samples, ASV distribution, and Sample distribution. Here, we will focus primarily on the Summary panel.

Since the ASV distribution and Samples distribution panels are common to multiple tools, a more detailed interpretation with vizualisation of these panels can be found in the Cluster Stat section .


Cluster filters typically remove a significant proportion of ASVs. However, these ASVs do not represent the majority of sequences.
  • 99.6% of ASVs are removed, ~7% of sequences are lost but they mostly correspond to low-abundances clusters
  • 213 clusters are kept!
  • 962,265 sequences are remaining


You can see how many sequences have been removed by each filter (1 and 2) separately, or by both filters together, by clicking on Venn diagram (3). This will then display a Venn diagram similar to the one below.
The abundance filter removed all these sequences. There did not appear to be any contaminants in our sequences.

Filters by samples


You can see details on filtered clusters within each sample. You can sort the columns by clicking on result cluster statistics.
For example, if there had been contamination, this table would have enabled us to identify the contaminated samples.

Configuration: Short reads (ITS use cases)

We now apply filters to remove low-abundant clusters that are likely to be chimeras or artifacts. We check also if some phiX sequences are still present. Low-abundant clusters are difficult to estimate. Following FROGS guidelines, we choose 0.005% of overall abundance. We set the prevalence filter value to 0.5 for all samples.

sbatch -J filters -o LOGS/filters.out -e LOGS/filters.err -c 8 --export=ALL --wrap="module load devel/Miniforge/Miniforge3 && module load bioinfo/FROGS/FROGS-v5.0.2 && cluster_filters.py --input-fasta FROGS/ITS/remove_chimera.fasta --input-biom FROGS/ITS/remove_chimera.biom --output-fasta FROGS/ITS/filters.fasta --nb-cpus 8 --log-file FROGS/ITS/filters.log --output-biom FROGS/ITS/filters.biom --html FROGS/ITS/filters.html --excluded FROGS/ITS/filters_excluded.tsv --contaminant /save/user/frogs/galaxy_databanks/phiX_genome/phi.fa --min-abundance 0.00005 --min-replicate-presence 0.5 --replicate_file ITS_fast_replicates.tsv && module unload bioinfo/FROGS/FROGS-v5.0.2"
(to see all settings: cluster_filters.py --help)



Interpretation: Short reads (ITS use cases)

result Remove Chimera
This report allows to show the impact of our filters: