首页资源分类其它科学普及 > Manual_SnoWhite_2.0.3

Manual_SnoWhite_2.0.3

已有 456417个资源

下载专区

上传者其他资源

文档信息举报收藏

  • 资源大小:300.11KB
  • 上 传 者:qxs他上传的所有资源
  • 上传日期:2015-07-02
  • 资源类型:应用文档
  • 资源积分:1分
  • 评 论:0条
  • 下载次数:0
  • 参与讨论:去论坛

标    签: Manual_SnoWhite_2 0 3

分    享:

文档简介

Manual_SnoWhite_2.0.3

文档预览

SnoWhite A cleaning pipeline for next­generation DNA sequences by Katrina M Dlugosch Manual ~ version 2.0.x CONTENTS 1. Installation ..…............................................................................................................. 2 2. Overview …................................................................................................................. 2 2.1 Cleaning step details …................................................................................ 2 2.2 Changing the order of cleaning steps …....................................................... 3 2.3 Paired end or Mate­pair data ….................................................................... 3 3. Usage …....................................................................................................................... 4 3.1 Input Files …................................................................................................. 4 3.2 Parameter Options …................................................................................... 5 3.3 Output Settings …......................................................................................... 7 4. Output Files ….............................................................................................................. 7 5. Citing SnoWhite …....................................................................................................... 8 6. License and Disclaimer …............................................................................................ 8 SnoWhite 2.0.x Manual 1 1. INSTALLATION SnoWhite is written in Perl for Linux/Unix. Download and extract the current SnoWhite archive, navigate into this directory on the command line, and you should be ready to go. See the Usage section below for details regarding input files and commands. 2. OVERVIEW SnoWhite is a pipeline of existing programs and custom scripts designed to flexibly and aggressively clean sequence data prior to assembly. The pipeline was originally written for cleaning normalized cDNA sequenced on the Roche 454 platform, one of the trickier cleaning tasks around. It has since been expanded to handle a broader range of data types and tasks. The pipeline employs several steps that can be turned on and off as desired. Briefly, these are: 1) File splitting or multi­plexed Barcode parsing (­B) 2) Conversion of FASTQ to FASTA (required) 3) Quality trimming (­Q) 4) End clipping (­E) 5) TagDust filtering (­D) 6) SeqClean trimming (­L) 7) PolyA trimming (­Y) 8) SeqClean repeated, if used 9) Conversion to FASTQ (­R) 2.1 More Detail Regarding Cleaning Steps 1) File Splitting or Multi­plexed Barcode Parsing [FASTQ format only] (­B) For data with multiplexing barcodes incorporated into the 5' or 3' ends of the reads, reads can be parsed into separate files by barcode. Alternatively, your input file can be split into a user­specified number of files, to reduce RAM requirements at subsequent steps. Splitting via barcode uses a modified version of the splitBC.pl (v. 2009) script from the FASTX­toolkit from the Hannon Lab at Cold Spring Harbor Laboratory (http://hannonlab.cshl.edu); this script removes the barcodes during splitting (and has been modified NOT to assume an extra ligation base following the barcode). 2) Conversion to FASTA format SnoWhite can read in a FASTQ file, but converts to FASTA at this step. Output files can be converted to FASTQ at the end of the pipeline, if desired (option ­R). File conversions are performed using the 'convert_project' script from the MIRA assembler v3.4.0 (http://chevreux.org/projects_mira.html). Note that convert_project is set to automatically detect the ASCII offset of the FASTQ format (see http://en.wikipedia.org/wiki/FASTQ_format), and this works best if there are at least five or so reads within a single file (not usually a problem with next­gen data!). 3) Quality Trimming (­Q) Low quality bases can be trimmed from the 3' ends of all reads. SnoWhite defines low quality as the point at which base quality dips below a user­specified minimum, and does not return above this threshold. The user may or may not wish to execute this step with SnoWhite, as some assemblers or other downstream analyses may deal with low quality regions as desired. 4) End Clipping (­E) SnoWhite 2.0.x Manual 2 A user­specified number of bases can be clipped from the end of each sequence. Alternatively, SnoWhite can search from one end of the read and find the last (most internal) exact match to a provided adapter, and then clip through that point. These types of clipping are useful when sequence data is delivered with uniform short adapter sequences (<12 bp) remaining at the beginning or end of reads. For longer contaminants, see the SeqClean step (option ­L). 5) TagDust (­D) TagDust (Lassmann et al. 2009 Bioinformatics 25:2839­4) is designed to find sequences that are composed almost entirely of primer/adapter fragments. These primer 'multimers' or 'concatmers' are a persistent low­abundance feature of many datasets, and are extremely difficult to remove using traditional contaminant searches. Note that this can be the slowest step of the process, and that only the first 999 bp of longer sequences are evaluated (if the sequence passes TagDust, SnoWhite automatically restores the sequence to its full length). 5) SeqClean Contaminant Removal and PolyA/T Trimming (­L) SeqClean (64 bit version, http://compbio.dfci.harvard.edu/tgi/software/) is a relatively old but still excellent tool for trimming polyA/T tails, primer contaminants, and uninformative sequences (Ns). See Chen et al 2007 BMC Genomics 8:416 for performance comparisons to some other programs. Note that primer/adapter contaminants in this step are identified by BLAST, and so sequences must be long enough to be detected in this way. SeqClean is configured to find contaminants as small as 12bp within 30% of the ends (92% identity), and as small as 60bp internally (94% identity). SnoWhite provides options for executing SeqClean with or without contaminant searches, and of just the terminal and/or internal types. 7) PolyA/T Trimming (­Y) Due to the nature of error and primer contamination in many next­gen transcriptome datasets, Seqclean is typically not sufficient for complete polyA/T trimming. SnoWhite provides additional trimming governed by many tunable parameters described below. In short, users can set tolerances for what constitutes a polyA/T, where to look for it in the sequence, and how much error to allow. The default parameters are likely to do a good job in most situations, but users are encouraged to page through their own data and get a sense of whether adjustments are needed. **NOTE: As of v1.1.4, this step also trims away terminal 'X' characters that may be left behind when users pre­process their data with cross_match or similar masking tools. 8) Seqclean is repeated 9) Conversion to FASTQ (­R) 2.2. Changing the Order of Steps SnoWhite is set up to accommodate a reasonable workflow, but next­gen sequencing applications are now so varied, it may be best for your data to run some steps in a different sequence. Because all steps of SnoWhite are optional, any order other than the default can be achieved by re­running the program. In this case, it is recommended that you keep your output files in FASTA format, to avoid time wasted re­converting to FASTA from FASTQ during the runs. 2.3 Paired­end or Mate­pair Data Currrently, SnoWhite does not identify and re­associate paired reads. You will need to find (or create) an appropriate script for the format of your data (there are several formats) to do this task. SnoWhite 2.0.x Manual 3 3. USAGE Navigate to the SnoWhite directory at the command line, and run: ~/SnoWhite_2.0.x$ perl snowhite_2.0.x.pl [OPTIONS] Where [OPTIONS] refers to a list of any of the switches described in the sections below. To return a list of these options at the command line, type: ~/SnoWhite_2.0.x$ perl snowhite_2.0.x.pl ­help 3.1. Input Files ● ­f: Sequences: Specify path if needed. FASTQ or FASTA file formats only. The file type will be automatically detected. FASTQ files will be converted to FASTA for processing using the “convert_project” script from the MIRA v3.4.0 assembler (http://chevreux.org/projects_mira.html). This script automatically detects the quality score offset, which has varied among sequencing platforms (see http://en.wikipedia.org/wiki/FASTQ_format). This file conversion can take time for a large file, but should not be more than a couple of hours. ● ­q: FASTA formatted quality file (optional, if using FASTA sequences). ● ­v: FASTA formatted vector/primer/adapter (>12 bp, < 60bp) file (optional). ● ­s: FASTA formatted internal/long (>60 bp) contaminants file (optional, SeqClean step only). ● ­E Short adapter (<12 bp) clipping file. See further description of (­E) under Parameter Options below. SnoWhite can search from one (specified) end of the read and find the last (most internal) exact match to a provided adapter, and then clip back the read up through that point. This is ideal ONLY for very short adapters (<12 bases) that are expected at the ends of sequences, without errors. One adapter sequence can be specified for the 5' end and another for the 3' end. The sequences must be in FASTA format and have headers that begin ">5_" for the 5' end and ">3_" for the 3' end. Example file format: >5_myAdapter AATTG >3_myAdapter TCGAA Only one for each end will be used. To use the same adapter sequence for both ends, list it twice with the appropriate different headers. See below that instead, a fixed number of based can be trimmed from all reads. ● ­B: Barcode file. See further description of (­B) under Parameter Options below. Reads can be parsed into separate files by barcodes that have been integrated into the 5' or 3' ends. This script removes the barcodes during splitting and takes options for the position of the barcode (­j) and number of mismatches allowed (­z). See Parameter Options below. [Note: An offset in the barcode position (missing bases) is currently not allowed, but see Lns 383­4 in SnoWhite 2.0.2.pl to modify this according to parameters outlined in Programs/splitBC_Sno.pl.] Barcodes should be given in a tab­delimited list of barcode names and sequences as in: #a comment line will be ignored if it starts with '#' code1 ATGCG code2 CTGAG SnoWhite 2.0.x Manual 4 3.2 Parameter Options File splitting (FASTQ only): • ­B: Split file into given number of subfiles , or according to barcodes in Default = no splitting • ­j: <5/3> Look for barcodes at 5' or 3' ends Default = 5 • ­z: Number of mismatches allowed when matching barcode Default = 0 Quality trimming: • ­Q: Minimum phred score under which to trim 3' ends. A value of '10' is suggested. Default = no trimming End clipping: • ­E: <5/3/B/FILENAME> Clip at 5' <5>, 3' <3>, Both , or according to sequences in Default = 5 • ­c: Number of bases to clip from of all sequences (unless ­E is a ) Default = 0 TagDust read filtering (for multimers of sequences in ­v): • ­D: Execute TagDust, assuming primer/adapter (­v) file is provided Default = F • ­d: False discovery rate Default = 0.01 SeqClean trimming (for uninformative bases and matches to sequences in ­v and ­s): • ­L: Execute SeqClean Default = F • ­p: Processor number Default = 1 SnoWhite 2.0.x Manual 5 Terminal poly trimming (e.g. 3'AAAAAAAAAACGATTAG...): • ­Y: Execute poly trimming as defined by all following parameters Default = F • ­l: Minimum length of terminal A/T repeat Default = 6 • ­a: <3/5/B> Poly A at 3', 5', or Both ends Default = 3 • ­t: <3/5/B> Poly T at 3', 5', or Both ends Default = 5 Terminal poly trimming inside of cap (e.g. 3'CGAAAAAAAAAAAACGATTAG...): • ­b: Number of terminal bases to look beyond for start of terminal poly A/T Default = 0 • ­r: Minimum length of A/T repeat inside of ­b to consider as poly A/T (min = 2) Default = 10 Internal poly trimming (e.g. 3'...CCGTATAGGAAAAAAAAAAAAAAAAAAAACGATTAGGG...5'): • ­i: Minimum length of internal poly A/T sequence to consider as poly A/T Default = 100bp (extreme case) • ­k: Keep the longer end of sequence broken by a single internal polyA/T Default = F General poly trimming settings: • ­n: Interpret Ns within A/T repeats as As or Ts Default = T • ­w: Allow wobble: i.e. ignore single alternative bases within A/T repeats Default = T SnoWhite 2.0.x Manual 6 3.3. Output settings • ­o: Output folder and file prefixes Default = sequence input filename • ­m: Minimum sequence length for cleaned reads Applies to ­Q,­E,­L,­Y trimming steps, but NOT to ­B,­D,­R Default = 50 • ­g: Delete all temporary (garbage) files. This deletes all copies of the input file that are generated as it is processed through different cleaning steps, which can amount to substantial duplication of the data. It is recommended that cleaning settings are verified for a test file for a small amount of data, and then ­g set to for the full dataset. Default = F • ­R: Convert final output to FASTQ format. Converting back to FASTQ will convert to standard Sanger FASTQ, with its standard offset of 32. If you originally input Illumina FASTQ from older versions of their software, the encoding may look different between input and output (Illumina currently has switched to providing standard Sanger FASTQ offsets). Default = F (i.e. FASTA format) Example: perl snowhite_2.0.x.pl -f myfasta -q myfasta.qual -v myprimers -Q T -D T -L T -Y T -i 20 -o mytestrun1 4. OUTPUT FILES SnoWhite will created a folder (given by ­o), where you will find: ● A 'FinalOutput' directory with files *.white (and *.white.qual, as appropriate) These are your final cleaned sequence files. ● *.log Second in importance only to the *.white files, this is the log file for the entire run. Here you will find summaries of the sequences trimmed or removed by each step. Note that TagDust appears to have a bug that reports both the total number of sequences and the number trashed as 1 less than the true number evaluated and removed. ● *.seqclean_report (rounds 1 and 2) For those familiar with Seqclean, this is the *.cln file. This is a detailed log file that gives the following fields for each input sequence: 1. the name of the input sequence 2. the percentage of undetermined bases in the clear range 3. 5' coordinate after cleaning 4. 3' coordinate after cleaning 5. initial length of the sequence 6. trash code (see Seqclean's README for detailed information about these) 7. trimming comments (contaminant names, reasons for trimming/trashing) SnoWhite 2.0.x Manual 7 ● *.polytrim_report This is a detailed log file that gives the same information as the *.seqclean_report (minus field 2). Trimming/Trashing codes are as follows: S = sequence shorter than user specified minimum I = internal polyA/T found A5 or T5 = type of repeat found on 5' end A3 or T3 = type of repeat found on 3' end X3 or X5 = 'X' character found on indicated end ● *.tagdust_trash This is a list of sequences removed by TagDust. ● Temporary files *.clipped / .tagdusted / .clean / .nopoly If ­g is not set to , then temporary (intermediate) cleaning files will not be deleted, and you will see files with some combination of the above extensions indicating the steps that they have gone through. All ­Q and ­E clipping is done simultaneously and appears in the *.clipped file(s). Files ending in *.clean are the output of SeqClean steps. The *.tagdusted and *.nopoly files indicate the output of TagDust and PolyA/T trimming, respectively. 5. CITING SNOWHITE Please cite: Dlugosch KM, Lai Z, Bonin A, Hierro J, Rieseberg LH. 2013. Allele identification for transcriptome­based population genomics in the invasive plant Centaurea solstitialis. G3 3: 359­367. For use of TagDust, please cite: Lassmann et al. 2009 Bioinformatics 25:2839­4 6. LICENSE AND DISCLAIMER SnoWhite: A cleaning pipeline for next­generation DNA sequences Copyright (C) 2011 Katrina M Dlugosch This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details, available as License.txt in your NU­IN download, and at . At this time, this program is distributed without any guarantee of technical support. For contact information and regular updates, see the SnoWhite website . SnoWhite 2.0.x Manual 8

Top_arrow
回到顶部
EEWORLD下载中心所有资源均来自网友分享,如有侵权,请发送举报邮件到客服邮箱bbs_service@eeworld.com.cn 或通过站内短信息或QQ:273568022联系管理员 高进,我们会尽快处理。