With researchers in the omics space now generating terabytes of data in a single experiment, it’s important to understand how meaningful results can be derived from what looks like a string of random letters. Bioinformatics is a term that we have all heard of, but it’s a field that many are scared to dip their feet into.
In this resource, we take a look at some common bioinformatics file types and their purposes, outlining the journey from the lab to the results we read in a paper. Whether you’re a junior bioinformatician looking to understand the science of transforming your data, a wet lab scientist wondering what happens to your work once you’ve generated it or if you’re simply looking to refresh your understanding, we hope this article helps to clear up some of the complexities of the field.
Exploring the genome
Genomic data – typically obtained through various sequencing practices – is by far the most widely used form of biological data. This primarily contains lengthy arrays of DNA or amino acid sequences, the raw output from genomic sequencing experiments. In the journey from bench to results, assessing genomic data computationally is often the first stop. But what does genomic data actually look like, and how do we obtain real results from millions of letters on the page?
The genomic alphabet
The first step is often the generation of a FASTA (.fasta, .fa) file. These are text-based files that store nucleotide or amino acid sequences in their most basic form, long lists of A’s, T’s, G’s and C’s. They were created in 1985 – a long time before large-scale sequencing was commonplace – and are now near universal in the bioinformatics world. Their format allows for descriptions of a sequence to precede the data itself, allowing for better annotation. Typically, this description is followed by no more than around 80 characters of sequence data, and the same format is repeated throughout the file. Due to their simple text-based format, FASTA files can be analysed using a variety of tools and text editors that can be found online. The FASTA format is also the basis of queries in the infamous BLAST server, hosted by the National Institutes of Health.
FASTQ (.fastq) files are similar to FASTA files. Created in 2000, FASTQ files also contain nucleotide or amino acid data in text format. However, these contain more data than the previous file type – notably including a quality score in the form of an ASCII code for each component of the sequence.
Figure 1: Example of a FASTA file. Generated by ChatGPT.
Figure 2: Example of a FASTQ file, where H denotes the ASCII quality code. Generated by ChatGPT.
So, FASTA and FASTQ files contain the raw data. But what does that actually tell us? Of course, we have to do more analysis, and that involves transforming our data into something a little more informative.
This is where SAM/BAM (.sam/.bam) files come in. Standing for ‘sequence alignment map’ and ‘binary alignment map’, respectively, SAM and BAM files contain information relating to the alignment of a sequence to a reference genome. By comparing to a reference, you can see the differences between your sample and the baseline genome. These files contain a variety of information including which reference genome to align to, mapping quality of the sample and the position of any mate reads. BAM files contain this information in binary format. The files can be edited and analysed via the SAMtools suite, a software package that can not only clean and organise data but also perform further functions such as variant calling.
A functional analysis
Speaking of variant calling, let’s take a look at VCF (.vcf) files. Variant Call Format files are the result of the above variant calling process, although they can be generated through a number of different avenues. They are used to store genetic variations such as SNPs and indels compared to a reference genome. Because they only contain the relevant variants, the files are small and easy to work with, compatible with a number of computational tools.
In contrast, GFF (.gff), general feature formats, are much larger. They contain more detailed information about your sequence, and features within that sequence. A GFF file generally contains the name of the sequence, the type of feature described (i.e., a whole gene, an exon or an intron), the coordinates of the feature within the overall sequence and other notes such as the parent gene that an exon lies on. In the below example, the gene is defined as gene1, and the three exons reside within this gene.
Figure 3: Example of a GFF file. Generated by ChatGPT.
Figure 4: Diagram explaining the workflow from raw sequence data to functional analysis.
The omics world
But what about different ‘omics’? Luckily, most transcriptomic data can be analysed similarly to genomic data using the same tools and software, by simply replacing DNA sequences with RNA. However, given the vast amount of transcriptomic data that’s now being generated, many experiments require something a little more bespoke. An example of this is the Loom file, a type of Hierarchal Data File (HDF5) that is designed to store large amounts of omics data and metadata. A notable example of the use of the loom file is in the Human Cell Atlas initiative.
Let’s go even further on our data journey. In the multi-omics world, we have a plethora of different data sources, including proteomics and metabolomics experiments. What does a bioinformatician do with these results?
Mascot Generic Format MGF (.mgf) files are used in both proteomics and metabolomics for storing mass spectrometry fragmentation data for peptide identification. The files contain information about mass, charge and abundance and allow for efficient computational analysis of data that was once assessed entirely by humans – no small feat! Similarly, MZXML files exist to hold mass spectrometry data in an XML-based format.
Additionally, PDB (.pdb) files contain atomic coordinates from protein sequences despotised in the Protein Data Bank at the Research Collaboratory for Structural Bioinformatics. They can be used alongside software like Pymol to predict protein structures and assess the impact of mutations, their relationship with other proteins and much more.
Figure 5: Example of a PDB file showing the coordinates of each atom in a protein. Generated by ChatGPT.
That brings us to epigenetics. Whilst the following formats are also widely used for genomic and transcriptomic analysis, they have a special role in the analysis of epigenetic data.
BED (.bed): Browser Extensible Data format. These files are used to represent genomic intervals and are also stored in a simple text format. Different regions of the genome are stored as coordinates. For example, you could simply describe the position of each chromosome within your sequence. But generally, these are used to provide much more intricate data, including DNA methylation regions and other areas containing epigenetic markers.
Wiggle (.wig): These files store genome-wide signal data, such as DNA methylation, GC percentage or histone modification levels. Data is stored in ‘bins’ in a wiggle file, and they can get quite large! For easier storage, the binary equivalent, BigWig (.bw) can be used. The data from a wiggle file can be plotted using specialised software, allowing for visualisation of biological features.
Back to basics
But forget these fancy terms – as a wet lab scientist you might be partaking in bioinformatics work without even realising it! A simple Excel, Text, TSV or CSV file containing transcriptomic information or even phenotypic information is a valuable resource in the bioinformatics world. And these ‘simpler’ files are also widely used for assessing proteomic and metabolomic data. Additionally, assessing image files generated from a microscopes or band-sizing software can require a certain amount of computational know-how.
Ultimately, the files you use and what you choose to do with them depends on the technology used in the lab, the software available to you and your overall goals. It would be impossible to cover everything in this feature; these days, even some R packages generate their own unique file types! But we hope that you’ve enjoyed this whistlestop tour of the bioinformatics world, and we hope it’s made it seem a little less daunting to those who may want to pursue a new skill.
Want a more exhaustive list of file types? Click here for a more comprehensive list of biological files types and see here for a run-down of how best to use the most common files in the bioinformatics world.