[PDF] fastp: an ultra-fast all-in-one FASTQ preprocessor | Semantic Scholar (2024)

Skip to search formSkip to main contentSkip to account menu

Semantic ScholarSemantic Scholar's Logo
@article{Chen2018fastpAU, title={fastp: an ultra-fast all-in-one FASTQ preprocessor}, author={Shifu Chen and Yanqing Zhou and Yaru Chen and Jia Gu}, journal={Bioinformatics}, year={2018}, volume={34}, pages={i884 - i890}, url={https://api.semanticscholar.org/CorpusID:52196534}}
  • Shifu Chen, Yanqing Zhou, Jia Gu
  • Published in bioRxiv 1 March 2018
  • Computer Science

Fastp is developed as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features that can perform quality control, adapter trimming, quality filtering, per-read quality cutting, and many other operations with a single scan of the FastQ data.

10,599 Citations

Highly Influential Citations

1,379

Background Citations

562

Methods Citations

2,831

Results Citations

8

Topics

Fastp (opens in a new tab)Adapter Trimming (opens in a new tab)SOAPnuke (opens in a new tab)AfterQC (opens in a new tab)Cutadapt (opens in a new tab)Adapter Trimmer (opens in a new tab)FastQC (opens in a new tab)Adapter Contamination (opens in a new tab)Base Correction (opens in a new tab)Adapter Sequences (opens in a new tab)

10,599 Citations

Atria: an ultra-fast and accurate trimmer for adapter and quality trimming
    Jiacheng ChuanAiguo ZhouL. HaleMiao HeXiang Li

    Computer Science, Biology

    bioRxiv

  • 2021

Atria matches the adapters in paired reads and finds possible overlapped regions with a super-fast and carefully designed byte-based matching algorithm (O(n) time with O(1) space) that can be used in a broad range of short-sequence matching applications.

Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data
    Kun Sun

    Computer Science, Biology

    Bioinform.

  • 2020

Ktrim was ∼2-18 times faster than current tools and also showed high accuracy when applied on the testing datasets and could serve as a valuable and efficient tool for short-read NGS data preprocessing.

  • 24
  • PDF
RabbitFX: Efficient Framework for FASTA/Q File Parsing on Modern Multi-Core Platforms
    Hao ZhangHonglei Song Weiguo Liu

    Computer Science, Biology

    IEEE/ACM Transactions on Computational Biology…

  • 2023

RabbitFX is a fast, efficient, and easy-to-use framework for processing biological sequencing data on modern multi-core platforms that can efficiently read FASTA and FASTQ files by combining a lightweight parsing method by means of an optimized formatting implementation.

  • 2
  • Highly Influenced
RabbitQCPlus 2.0: More Efficient and Versatile Quality Control for Sequencing Data.
    Lifeng YanZekun Yin Weiguo Liu

    Computer Science, Biology

    Methods

  • 2023
FastProNGS: fast preprocessing of next-generation sequencing reads
    Xiaoshuang LiuZhenhe YanChao WuYang YangXiaoming LiGuangxin Zhang

    Computer Science

    BMC Bioinformatics

  • 2019

FastProNGS is a rapid, standardized, and user-friendly tool for preprocessing next-generation sequencing data within minutes and is an all-in-one software that is convenient for bulk data analysis.

  • 13
  • PDF
RabbitQCPlus: More Efficient Quality Control for Sequencing Data
    Lifeng YanZekun Yin Weiguo Liu

    Computer Science

    2022 IEEE International Conference on…

  • 2022

RabbitQCPlus is an ultra-efficient quality control tool for modern multi-core systems that uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains.

SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files
    Andrea TelatinP. FariselliG. Birolo

    Computer Science, Biology

    Bioengineering

  • 2021

A suite of tools, called SeqFu (Sequence Fastx utilities), that provides a broad range of commands to perform both common and specialist operations with ease and is designed to be easily implemented in high-performance analytical pipelines.

FAST: FPGA-based Acceleration of Genomic Sequence Trimming
    Behnam KhaleghiTianqi Zhang Tajana Rosing

    Computer Science, Biology

    2022 IEEE Biomedical Circuits and Systems…

  • 2022

This work proposes the first FPGA-based framework dubbed FAST to accelerate the stages that deal with sequence trimming, in particular adapter and primer removal, which supports a comprehensive set of functionalities and is convenient to use by operating on standard genomics data formats.

  • 1
  • Highly Influenced
EARRINGS: an efficient and accurate adapter trimmer entails no a priori adapter sequences
    Ting-Hsuan WangCheng-Ching HuangJui-Hung Hung

    Computer Science, Biology

    Bioinform.

  • 2021

A set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences are introduced that are particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales.

  • 3
Falco: high-speed FastQC emulation for quality control of sequencing data
    Guilherme de Sena BrandineAndrew D. Smith

    Computer Science, Biology

    F1000Research

  • 2019

Falco is presented, an emulation of the popular FastQC tool that runs on average three times faster while generating equivalent results and requires less memory to run and provides more flexible visualization of HTML reports.

...

...

20 References

SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data
    Yuxin ChenYongsheng Chen Qiang Chen

    Computer Science, Biology

    GigaScience

  • 2018

SOAPnuke is demonstrated as a tool with abundant functions for a “QC-Preprocess-QC” workflow and MapReduce acceleration framework that enables large scalability to distribute all the processing works to an entire compute cluster.

Trimmomatic: a flexible trimmer for Illumina sequence data
    Anthony M. BolgerM. LohseB. Usadel

    Computer Science, Biology

    Bioinform.

  • 2014

Timmomatic is developed as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data and is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested.

AfterQC: automatic filtering, trimming, error removing and quality control for fastq data
    Shifu ChenTanxiao HuangYanqing ZhouYue HanMingyan XuJia Gu

    Computer Science

    BMC Bioinformatics

  • 2017

Experimental results show that AfterQC can help to eliminate the sequencing errors for pair-end sequencing data to provide much cleaner outputs, and consequently help to reduce the false-positive variants, especially for the low-frequency somatic mutations.

  • 258
  • PDF
Cutadapt removes adapter sequences from high-throughput sequencing reads
    Marcel Martin

    Computer Science, Biology

  • 2011

The command-line tool cutadapt is developed, which supports 454, Illumina and SOLiD (color space) data, offers two adapter trimming algorithms, and has other useful features.

  • 22,238
  • PDF
Fast gapped-read alignment with Bowtie 2
    Ben LangmeadS. Salzberg

    Computer Science, Biology

    Nature Methods

  • 2012

Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

  • 39,888
  • PDF
SpeedSeq: Ultra-fast personal genome analysis and interpretation
    Colby ChiangRyan M. Layer Ira M. Hall

    Computer Science, Biology

    Nature Methods

  • 2015

The SpeedSeq platform accomplishes alignment, variant detection and functional annotation of a 50× human genome in 13 h on a low-cost server and alleviates a bioinformatics bottleneck that typically demands weeks of computation with extensive hands-on expert involvement.

  • 452
  • PDF
The Sequence Alignment/Map format and SAMtools
    Heng LiR. Handsaker R. Durbin

    Computer Science, Biology

    Bioinform.

  • 2009

Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by

UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy
    Tom S. SmithA. HegerI. Sudbery

    Computer Science, Biology

    bioRxiv

  • 2016

It is shown that errors in the UMI sequence are common and network-based methods to account for these errors when identifying PCR duplicates are introduced, demonstrating the value of properly accounting for errors in UMIs.

  • 1,219
  • PDF
Detecting ultralow-frequency mutations by Duplex Sequencing
    Scott R. KennedyMichael W. Schmitt L. Loeb

    Biology

    Nature Protocols

  • 2014

A detailed protocol for efficient DS adapter synthesis, library preparation and target enrichment, as well as an overview of the data analysis workflow are provided.

  • 360
  • PDF
Theoretical and practical advances in genome halving
    F. CollynL. GuyM. MarceauM. SimonetClaude-Alain H. Roten

    Biology

  • 2004

The authors' tighter bounds on genome halving distance yield a new algorithm for reconstructing an ancestral duplicated genome, and a software package GenomeHalving is created based on this new algorithm, identifying a sequence of translocations for halving the yeast genome that is shorter than previously conjectured possible.

  • 28,326

...

...

Related Papers

Showing 1 through 3 of 0 Related Papers

    [PDF] fastp: an ultra-fast all-in-one FASTQ preprocessor | Semantic Scholar (2024)

    FAQs

    Does Fastp remove duplicates? ›

    duplication rate and deduplication

    For both SE and PE data, fastp supports evaluating its duplication rate and removing duplicated reads/pairs. fastp considers one read as duplicated only if its all base pairs are identical as another one.

    What is the difference between Fastqc and Fastp? ›

    The tool with second speed is FASTQC, which takes about 2x the time of fastp. However, FASTQC only performs quality control, while fastp performs quality control (for both pre-filtering data and post-filtering data), data filtering and other operations. The other tools take 3x~5x time of fastp.

    What is FastP used for? ›

    fastp is a FASTQ data pre-processing tool. The algorithm has functions for quality control, trimming of adapters, filtering by quality, and read pruning. It also supports multi-threading.

    What is the difference between Trimmomatic and Fastp? ›

    The fastp-filtered data contains no suspected adapters when four or fewer mismatches are allowed. Comparing to fastp-filtered data, Trimmomatic-filtered data contains less suspected adapters when five or more mismatches are allowed, but contains more when four mismatches are allowed.

    How do I remove duplicates from data preprocessing? ›

    To delete duplicates, we use a function drop_duplicates in Pandas. An argument “keep” can also be used with drop_duplicates. keep = 'first' keeps the first record and deletes the other duplicates, keep = 'last' keeps the last record and deletes the rest, and keep = False deletes all the records.

    What is the easiest way to remove duplicates? ›

    Remove duplicate values
    1. Select the range of cells that has duplicate values you want to remove. ...
    2. Select Data > Remove Duplicates, and then under Columns, check or uncheck the columns where you want to remove the duplicates. ...
    3. Select OK.

    What does the Q in fastq stand for? ›

    Quality Score Encoding

    In FASTQ files, quality scores are encoded into a compact form, which uses only 1 byte per quality value. In this encoding, the quality score is represented as the character with an ASCII code equal to its value + 33.

    What are overrepresented sequences in FastQC? ›

    Overrepresented Sequences – List of sequences which appear more than expected in the file. Only the first 50bp are considered. A sequence is considered overrepresented if it accounts for ≥ 0.1% of the total reads.

    What is the alternative to Trimmomatic? ›

    I also recommend FastX, CutAdapt and Prinseq as alternatives to Trimmomatic. Additionally, if you do not have experience using command line in Linux, you can implements some of the previously mentioned software (FastX, FastQC, Trimmomatic, CutAdapt) on the Galaxy platform.

    What are the advantages of FASTQ? ›

    FastQ Screen has a number of advantages over these tools, including directly reporting the proportion of multi-mapping reads, thereby helping identify DNA populations rich in low-complexity sequences. Another benefit of our program is the capability to create filtered FASTQ files.

    What is the FASTQ file format? ›

    FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

    How to speed up FastQC? ›

    How do we speed this up? FastQC has the capability of splitting up a single process to run on multiple cores! To do this, we will need to specify an additional argument -t indicating number of cores. We will also need to exit the current interactive session, since we started this interactive session with only 1 core.

    What is the difference between Fastp and FastQC? ›

    fastp supports duplication level evaluation for both single-end and paired-end data. Different from FASTQC that uses a hash table to store the duplication keys, fastp stores them by a duplication array D and a counting array C to provide much faster access.

    What is the purpose of FastQC? ›

    FastQC is used to quality control checks on raw sequence data coming from high throughput sequencing pipelines.

    What is the purpose of Trimmomatic? ›

    Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data. The selection of trimming steps and their associated parameters are supplied on the command line.

    How to remove duplicates in Fasta? ›

    Remove Duplicates from a Fasta File and manipulate names :
    1. Detect and remove duplicated IDs.
    2. Detect and remove duplicated sequences.
    3. Detect and remove duplicated sequences & generate a new ID by pasting the sequence IDs that have the same sequence.
    4. Manipulate the sequences names (eliminate a certain string)

    What tool removes duplicates from a list? ›

    List Deduper:: Online Tool for Removing Duplicates and Sorting Lists
    • Grab your list into the buffer.
    • Paste it into the "source" field below.
    • Optional: choose what you want to delimit on (default is line feed).
    • Press the "dedupe" button.
    • Grab the deduped, sorted list from the "target" field below.
    • Go about your business!

    Do sets remove duplicates? ›

    Sets in Python are unordered collections of unique elements. By their nature, duplicates aren't allowed. Therefore, converting a list into a set removes the duplicates.

    Does Python remove remove duplicates? ›

    If the order of the elements is not critical, we can remove duplicates using the Set method and the Numpy unique() function. We can use Pandas functions, OrderedDict, reduce() function, Set + sort() method, and iterative approaches to keep the order of elements.

    Top Articles
    Latest Posts
    Article information

    Author: Melvina Ondricka

    Last Updated:

    Views: 5281

    Rating: 4.8 / 5 (68 voted)

    Reviews: 91% of readers found this page helpful

    Author information

    Name: Melvina Ondricka

    Birthday: 2000-12-23

    Address: Suite 382 139 Shaniqua Locks, Paulaborough, UT 90498

    Phone: +636383657021

    Job: Dynamic Government Specialist

    Hobby: Kite flying, Watching movies, Knitting, Model building, Reading, Wood carving, Paintball

    Introduction: My name is Melvina Ondricka, I am a helpful, fancy, friendly, innocent, outstanding, courageous, thoughtful person who loves writing and wants to share my knowledge and understanding with you.