Dataset Title:  [Transcriptome statistics] - Transcriptome statistics from samples obtained on
LMG1411 collected on the Gould (LMG1411) in the Western Antarctica Peninsula in
2014. (Polar Transcriptomes project) (Iron and Light Limitation in Ecologically
Important Polar Diatoms: Comparative Transcriptomics and Development of
Molecular Indicators)
Institution:  BCO-DMO   (Dataset ID: bcodmo_dataset_665311)
Attributes

    Acquisition Description:
    String acquisition_description 
"Nine species of diatoms were isolated from the Western Antarctic Peninsula
along the PalmerLTER sampling grid in 2013 and 2014. Isolations were performed
using an Olympus CKX41 inverted microscope by single cell isolation with a
micropipette (Anderson 2005). Diatom species were identified by morphological
characterization and 18S rRNA gene (rDNA) sequencing. DNA was extracted with
the DNeasy Plant Mini Kit according to the manufacturer\\u2019s protocols
(Qiagen). Amplification of the nuclear 18S rDNA region was achieved with
standard PCR protocols using eukaryotic-specific, universal 18S forward and
reverse primers. Primer sequences were obtained from Medlin et al. (1982). The
length of the region amplified is approximately 1800 base pairs (bp
).\\u00a0Pseudo-nitzschia\\u00a0species are often difficult to identify by their
18S rDNA sequence, therefore, additional support of the taxonomic
identification of\\u00a0P.\\u00a0subcurvata\\u00a0was provided through sequencing
of the 18S-ITS1-5.8S regions. Amplification of this region was performed with
the 18SF-euk and 5.8SR_euk primers of Hubbard et al. (2008). PCR products were
purified using either QIAquick PCR Purification Kit (Qiagen) or ExoSAP-IT
(Affymetrix) and sequenced by Sanger DNA sequencing (Genewiz). Sequences were
edited using Geneious Pro software
([http://www.geneious.com](\\\\\"http://www.geneious.com\\\\\"), Kearse et al.,
2012) and BLASTn sequence homology searches were performed against the NCBI
nucleotide non-redundant (nr) database to determine species with a cutoff
identity of 98%.
Diatom phylogenetic analysis was performed with Geneious Pro and included 71
additional diatom 18S rDNA sequences from publically available genomes and
transcriptomes, including those in the MMETSP database. Diatom sequences were
trimmed to the same length and aligned with MUSCLE (Edgar 2004). A
phylogenetic tree was created in Mega with the Maximum-likelihood method of
tree reconstruction, the Jukes-Cantor genetic distance model (Jukes and Cantor
1969), and 100 bootstrap replicates.
Illumina TruSeq adapters and poly-A tails were trimmed from raw reads using
the Fastx_toolkit clipper function. Fastq_quality_filter was used to remove
poor quality sequences, such that remaining sequences had a minimum quality
score of 20 with a minimum of 80% of bases within a\\u00a0read\\u00a0meeting
this quality score requirement. Any remaining raw sequences less than 50 base
pairs in length were also removed. Merged files were assembled\\u00a0de
novo\\u00a0using Trinity (Grabherr et al. 2011). The resulting assembly was
filtered to remove contigs less than 200 bp in length. Trinity-assembled
contigs which exhibited sequence overlap were grouped into isogroups which
were then used for sequence homology searches (BLASTx E-value \\u2264 10-4)
against the Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (Kanehisa
BUSCO (Benchmarking Universal Single-Copy Orthologs) was used to assess the
completeness of genomes and transcriptomes based on sets of\\u00a0single
copy\\u00a0orthologous groups derived from OrthoDB that are highly conserved
within multiple lineages (Felipe et al. 2015). Completed, duplicated and
fragmented orthologs were determined by meeting an \\u2018expected score\\u2019
and having aligned sequences within two standard deviations of the BUSCO
gene\\u2019s length.\\u00a0A second\\u00a0metric of completeness was performed by
evaluating conserved pathways, such as the ribosome and spliceosome, using the
single-directional\\u00a0best-hit\\u00a0method in the KEGG Automatic Annotation
Server (KAAS) (Moriya et al. 2007).\\u00a0Finally\\u00a0contiguity,\\u00a0was
calculated at the 0.75 level as according to Martin and Wang (2011) with
custom scripts.
For each transcriptome, unassembled sequence reads were aligned to the final
Trinity assembly using Bowtie 2 (Langmead 2012). Mapped reads were normalized
by the Reads per Kilobase per Million reads method (RPKM) (Mortazavi et al.
Gene biogeographical distributions -\\u00a020 genes of interest were selected
in the study to investigate the molecular basis of iron and light limitation
in polar diatoms. Reference sequences for each of these genes were obtained
from the\\u00a0F.\\u00a0cylindrus\\u00a0and\\u00a0P.\\u00a0tricornutum\\u00a0JGI
genome portals
and\\u00a0T.\\u00a0pseudonana\\u00a0and\\u00a0T.\\u00a0oceanica\\u00a0NCBI and
GenBank repositories. Reference sequences were identified in the
transcriptomes by translated nucleotide homology searches (tBLASTn) with an
e-value cutoff of <10-5. A reciprocal tBLASTn homology search was performed
for each transcriptome against the KEGG GENES database, using the single-
directional\\u00a0best-hit\\u00a0method in the KAAS online tool to ensure
consistent gene annotations (Moriya et al. 2007).
Subsequently, reference sequences were identified in the MMETSP protein
database by BLASTp (e-value <10-5) homology searches among the diatom
transcriptomes. The transcriptomes and their associated latitude and longitude
were obtained from iMicrobe Data Commons (Project Code CAM_P_0001000) and the
National Center for Marine Algae and Microbiota (NCMA). Custom Matlab scripts
allowed global biogeographical distribution of key genes of interest to be
