Installation
Instalation instructions are found in the README, here is a quick video of how to install the package on Ubuntu with cmake installed and gcc version 4.7:
News
- October 15, 2018 Release v1.0.8
glactools can now export acf to GRoSS or BetaScan for selections analyses.
- June 1, 2018 Release v1.0.7
glactools can now import BEAGLE data into glf. This glf can be transformed into acf.
- May 7, 2018 Release v1.0.6
glactools has a new filter called nosingle which removes private mutations.
- Apr 25, 2018 Release v1.0.6
glactools now supports importing from EIGENSTRAT/PLINK. This can be done with or without EPO files.
- Nov 27, 2017 Release v1.0.5
glactools was published in Bioinformatics! Also see our bioRxiv preprint here.
Motivation
Analysis of nuclear genomes often involves combining genotyping information from various individuals or populations to compute summary statistics or exporting them to various file formats (EIGENSTRAT, PLINK, etc..). This is achieved by parsing of multiple genotyping files (ex: VCF/BCF) to combined sets which are later exported to various formats or are used to compute summary statistics.
Research groups usually do this by coding custom scripts tailored to the task at hand. New group members often cannot reproduce the analyses that were performed and often code their own scripts from scratch. This leads to time being wasted and an overall lack of reproducibility.
glactools are a set or programs coded in C++ designed to extract allele counts from VCF files (or the raw base count from BAM files) and store it as an intermediate representation, compute summary statistics and export it to various file formats used by population genetics software. We consider 2 types of intermediate representations:
- GLF: for storing genotype likelihoods, suffix: glf.gz
- ACF: for storing allele counts, suffix acf.gz
both file formats are compressed binary files. For instance, a line in an ACF:
#chr coord REF,ALT IndividualA
7 35190 G,N 2,0:0
The line above that "IndividualA" had two reference alleles (G) and since no alternative alleles were found, the alternative is an N. The first column is the chromosome name, the second the coordinate, the third the bases for the reference and alternative. We do not consider tri-allelic sites. The fourth column is the allele count for the individual in the following format :
REF,ALT:CPG
Where REF is the reference allele count and the ALT is the alternative allele count. The CPG field is a flag for CpG (0 is no CpG, 1 otherwise). In the presence of an alternative allele, the following representation is used:
#chr coord REF,ALT IndividualA
7 517798 T,C 1,1:0
The line above says that IndividualA had one T and one C at that position hence a heterozygous site.
For GLF files, instead of having allele counts, there are raw genotypes likelihoods for three genotypes: reference/reference, reference/alternative, alternative/alternative in that order:
#chr coord REF,ALT IndividualA
7 517798 T,C 255,0,100:0
glactools allows users to transform genotyping data into this intermediate allele count or genotype likelihood. Data from multiple sources can be combined, transformed and exported to numerous formats
Test data
Either refer to the Quick Start or the section about downloading test data in the README. Here are some examples of downloading publicly available VCFs and converting them to ACF:
File format
The glactools file can be divided into two parts, the header that details the operations that were performed and the allele count (or genotype likelihood) per se. Here is an example:
#ACF
#PG:vcf2acf --fai human_g1k_v37.fasta.fai --epo all.epo.gz input.vcf.gz IndividualB
#GITVERSION: 262ef9596751f819a9846d59a88f059e85847b8c
#DATE: 2017-08-09
#VCF2ACF:
#chr coord REF,ALT root anc IndividualA
4 72045 A,N 1,0:0 1,0:0 2,0:0
4 72046 T,N 1,0:0 1,0:0 2,0:0
4 72047 T,N 1,0:0 1,0:0 2,0:0
4 72048 T,G 0,1:0 0,1:0 1,1:0
4 72049 G,N 1,0:0 1,0:0 2,0:0
4 72050 T,N 1,0:0 1,0:0 2,0:0
4 72051 A,N 1,0:0 1,0:0 2,0:0
4 72052 G,N 1,0:0 1,0:0 2,0:0
The header lines started with a # sign. Each operation made on the files from creation to transformation leaves a "watermark" that allows users to know precisely which operation was done and which version of glactools was used. The remaining lines contain the allele count for chromosome 4 for 8 sites. Three samples are present, the root allele, the most recent common ancestor and an individual named "IndividualA". For hominin samples, the root could be the chimp, the most recent common ancestor can be the chimp/human ancestor. If multiple individuals are combined, the allele count lines will appear as:
#chr coord REF,ALT root anc IndividualA IndividualB IndividualC
4 72045 A,N 1,0:0 1,0:0 2,0:0 2,0:0 2,0:0
4 72046 T,N 1,0:0 1,0:0 2,0:0 2,0:0 2,0:0
4 72047 T,N 1,0:0 1,0:0 2,0:0 2,0:0 2,0:0
4 72048 T,G 0,1:0 0,1:0 1,1:0 1,1:0 2,0:0
4 72049 G,N 1,0:0 1,0:0 2,0:0 2,0:0 2,0:0
4 72050 T,N 1,0:0 1,0:0 2,0:0 2,0:0 2,0:0
4 72051 A,N 1,0:0 1,0:0 2,0:0 2,0:0 2,0:0
4 72052 G,N 1,0:0 1,0:0 2,0:0 2,0:0 2,0:0
We recommend using bgzip to compress the files and index and query them using tabix from htslib.
Input data
Data can be transformed into glactools internal format from
- VCF
- BAM
- 23andme
- AXT alignments
Operations on data
Using the data in the internal format, here are examples of operations that can be performed:
- unions or intersections
- meld two or mre populations as one
- force two populations to share one or all alleles
- add/remove populations
- transform individual VCF files into ACF files using "vcf2acf"
- unite each individual files using "intersection"
- meld all individuals into a single population using "meld"
- disable sharing of alleles with the ancestor using "snosharing"
Exporting data
Data can be exported to the following formats:
- Treemix
- binary PLINK
- EIGENSTRAT
- NEXUS
- FASTA
Documentation
Please refer to the documentation manual available here.
Questions ? Feature requests ?
Contact Gabriel Renaud (@grenaud) :