FROGS

Context

Denoising refers to the process of correcting sequencing errors that occur during high-throughput sequencing of amplicons. These errors can originate from PCR amplification or from the sequencing platform itself (e.g., Illumina, long-read technologies, or older 454 systems). They typically appear as spurious singletons or low-abundance variants that differ by only one or a few bases from the actual biological sequences. Tools such as denoising.py in FROGS implement strategies such as clustering (e.g., with Swarm) or statistical inference (e.g., with DADA2) to distinguish real variants (ASVs) from sequencing noise.

How it does

This tool cleans raw sequencing data before denoising.

To achieve this, it relies on well-established preprocessing tools:

Cutadapt is used to detect and remove sequencing primers or adapters from the reads.
Delete sequences without good primers. In the context of amplicon data, primers must be trimmed because they can interfere with downstream clustering/denoising and taxonomic assignment. Cutadapt also allows mismatches, making it robust for real-world data where primer binding is not always perfect.
Merging of R1 and R2 reads with vsearch, flash or pear (only in command line) into single continuous sequences when their overlaps are sufficient.
Delete sequence with not expected lengths
Delete sequences with ambiguous bases (N)
Dereplication
+ removing homopolymers (size = 8 ) for 454 data
+ quality filter for 454 data

Then the tool can clustered sequences using either:
- Swarm, which groups sequences based on local connectivity and a user-defined distance (e.g., 1 nucleotide difference), producing robust clusters without relying on arbitrary global thresholds.
- DADA2, which applies a statistical error model to distinguish true biological sequences from sequencing errors, allowing inference of exact sequence variants with very high resolution.
Together, these steps ensure that downstream analyses are based on accurate, high-quality ASVs, reducing the impact of technical artifacts such as primer sequences, incomplete overlaps, or sequencing errors. It is possible to perform cleanup only with the --process preprocess-only option of denoising.

N.B. : Long reads

DADA2 needs to see the same sequences multiple times to model sequencing errors. If almost all sequences are unique (as is often the case with very long PacBio reads), DADA2 cannot function properly because it does not have enough information to distinguish true errors from true biological variations.
If the duplication rate is less than 10% (i.e., if the number of unique sequences > 0.9 × the total number of reads), then DADA2 is not appropriate.

The knowledge of your data is essential. You have to answer the following questions to choose the parameters:

Sequencing technology?
Targeted region and the expected amplicon length?
Have reads already been merged?
Have primers already been deleted?
What are the primers sequences?

Configuration: Short reads (16S V3V4 use cases)

Here are the answers for this dataset:

Sequencing technology

Illumina
454
long reads

Type of data

R1 and R2 files for one sample
One file by sample (R1 and R2 already merged or single-end technology data)

Amplicon expected length

Reads are mergeable: V3-V4 region is ~450-bp. So reads should overlap
Reads are not mergeable
Reads are are both mergeable and unmergeable

Primers sequences

Primers are still present: V3F(5’-ACGGRAGGCWGCAGT-3’) and V4R (5’-TACCAGGGTATCTAATCCT-3’) have been used for the first amplification
Primers have already been removed

Reads size

250 bp as seen previously

In this release, you can choose between Swarm and DADA2 to generate ASVs (amplicon sequence variants).
First, select the Main 1.a. Denoising of short reads tool.

The dataset has R1 and R2 files for one sample. Select Paired-end reads.
The input file is an archive. For your own data, it will be important to know how create an archive.
Select your file after uploading your data . It can be automatically detected.

R1 and R2 are both 250 bp.
We choose to allow only a 0.1 mismatch rate in the overlap region between R1 and R2 reads.

You can choose either Flash or Vsearch as the read merging tool. In our case, we want to use Vsearch and discard any unmerged reads.

We know that our amplicons are approximately 450 bp long. After trying different minimum and maximum values, we settled on a minimum amplicon length of 420 bp and a maximum length of 470 bp.

We have the primers that were used, so we select Yes and enter them. Make sure you put the primers in the correct orientation. You can use this tool to obtain the reverse complement.

Swarm

We will show you an example of the output from Swarm clustering with a distance of 1 between neighbouring sequences. We are using the fastidious option.

or DADA2

But also an example with DADA2 by choosing the Pseudo pooling, samples will be pseudo-pooled prior to sample inference method.

Don't forget to click on the button :

Next Step

Interpretation: Short reads (16S V3V4 use cases)

Swarm Results

Let look at the HTML file to see the result of denoising.
You have three panels: Denoising, Cluster distribution, and Sample distribution. Here, we will focus primarily on the Denoising panel.

Since the Cluster distribution and Sample distribution panels are common to multiple tools, a more detailed interpretation with vizualisation of these panels can be found in the Cluster Stat section .

This bar plot shows the number of sequences that do not pass the filters. At the end of the filtering process, we have 1,156,905 sequences.

Below you will find a table showing details on merged sequences. Display all sequences (1) and select all samples (2). You can then click on Display amplicon lengths (3) and Display preprocessed amplicon lengths (4).

We observe that the majority of sequences are ~425 nt long.
The other tabs give information about clusters. They show classical characteristics of clusters built with swarm:

A lot of clusters: 142,515
~81.63% of them are composed of only 1 sequence

DADA2 Results

Let look at the HTML file to check what happened.

87.39% of raw reads are kept (1,200,783 sequences from 1,374,011)
8,580 Clusters, ~23.24% of them composed of only 1 sequences

Configuration: Short reads (ITS use cases)

Here are the answers for this dataset:

Sequencing technology

Illumina
454
long reads

Type of data

R1 and R2 files for one sample
One file by sample (R1 and R2 already merged or single-end technology data)

Amplicon expected length

Reads are mergeable.
Reads are not mergeable
Reads are are both mergeable and unmergeable, unmerged reads will be artificially combined with 100 N to allow further processing.

Primers sequences

Primers are still present: V3F(5’-CTTGGTCATTTAGAGGAAGTAA-3’) and V4R (5’-GCATCGATGAAGAACGCAGC-3’) have been used for the first amplification
Primers have already been removed

Reads size

250 bp

First, select the Main 1.a. Denoising of short reads tool.

The dataset has R1 and R2 files for one sample. Select Paired-end reads.
Select your file after uploading your data. It can be automatically detected.
Paste/Fetch data :

R1 and R2 are both 250 bp.
We choose to allow only a 0.1 mismatch rate in the overlap region between R1 and R2 reads.

You can choose either Flash or Vsearch as the read merging tool. In our case, we want to use Vsearch and and click on Yes to retain the unmerged reads that will be artificially combined with 100N for further processing.

We know that our amplicons are approximately 450 bp long. After trying different minimum and maximum values, we settled on a minimum amplicon length of 420 bp and a maximum length of 470 bp.

We have the primers that were used, so we select Yes and enter them. Make sure you put the primers in the correct orientation. You can use this tool to obtain the reverse complement.

Swarm

We will show you an example of the output from Swarm clustering with a distance of 1 between neighbouring sequences. We are using the fastidious option.

Don't forget to click on the button :

Next Step

Interpretation: Short reads (ITS use cases)

Swarm Results

Let look at the HTML file to see the result of denoising.
You have three panels: Denoising, Cluster distribution, and Sample distribution. Here, we will focus primarily on the Denoising panel.

Since the Cluster distribution and Sample distribution panels are common to multiple tools, a more detailed interpretation with vizualisation of these panels can be found in the Cluster Stat section .

This bar plot shows the number of sequences that do not pass the filters. At the end of the filtering process, we have 198,191 sequences. We also see the number of artificially recombined reads, 47,594 reads.

Below you will find a table showing details on merged sequences. Display all sequences (1) and select all samples (2). You can then click on Display amplicon lengths (3) and Display preprocessed amplicon lengths (4).

ITS sequence sizes vary greatly.

Details on artificial combined sequences

This table provides information on artificial combined sequences by sample.

The other tabs give information about clusters. They show classical characteristics of clusters built with swarm:

A lot of clusters: 50,883
97.9% of them are composed of only 1 sequence

FROGS Core

Galaxy

Main tools

Optional tools

CLI

Main tools

Optional tools

Reads processing

Context

How it does

N.B. : Long reads

Configuration: Short reads (16S V3V4 use cases)

Swarm

or DADA2

Interpretation: Short reads (16S V3V4 use cases)

Swarm Results

DADA2 Results

Configuration: Short reads (ITS use cases)

Swarm

Interpretation: Short reads (ITS use cases)

Swarm Results

Details on artificial combined sequences