glactools: A set of programs for the management of allele counts and genotype likelihoods

Download as .zip Download as .tar.gz View on GitHub

Installation

Instalation instructions are found in the README, here is a quick video of how to install the package on Ubuntu with cmake installed and gcc version 4.7:

News

Motivation

Analysis of nuclear genomes often involves combining genotyping information from various individuals or populations to compute summary statistics or exporting them to various file formats (EIGENSTRAT, PLINK, etc..). This is achieved by parsing of multiple genotyping files (ex: VCF/BCF) to combined sets which are later exported to various formats or are used to compute summary statistics.

Research groups usually do this by coding custom scripts tailored to the task at hand. New group members often cannot reproduce the analyses that were performed and often code their own scripts from scratch. This leads to time being wasted and an overall lack of reproducibility.

glactools are a set or programs coded in C++ designed to extract allele counts from VCF files (or the raw base count from BAM files) and store it as an intermediate representation, compute summary statistics and export it to various file formats used by population genetics software. We consider 2 types of intermediate representations:

both file formats are compressed binary files. For instance, a line in an ACF:


#chr	coord	REF,ALT	IndividualA
7	35190	G,N	2,0:0

The line above that "IndividualA" had two reference alleles (G) and since no alternative alleles were found, the alternative is an N. The first column is the chromosome name, the second the coordinate, the third the bases for the reference and alternative. We do not consider tri-allelic sites. The fourth column is the allele count for the individual in the following format :


REF,ALT:CPG

Where REF is the reference allele count and the ALT is the alternative allele count. The CPG field is a flag for CpG (0 is no CpG, 1 otherwise). In the presence of an alternative allele, the following representation is used:


#chr	coord	REF,ALT	IndividualA
7	517798	T,C	1,1:0

The line above says that IndividualA had one T and one C at that position hence a heterozygous site.

For GLF files, instead of having allele counts, there are raw genotypes likelihoods for three genotypes: reference/reference, reference/alternative, alternative/alternative in that order:


#chr	coord	REF,ALT	IndividualA
7	517798	T,C	255,0,100:0

glactools allows users to transform genotyping data into this intermediate allele count or genotype likelihood. Data from multiple sources can be combined, transformed and exported to numerous formats

Test data

Either refer to the Quick Start or the section about downloading test data in the README. Here are some examples of downloading publicly available VCFs and converting them to ACF:

File format

The glactools file can be divided into two parts, the header that details the operations that were performed and the allele count (or genotype likelihood) per se. Here is an example:


#ACF
#PG:vcf2acf --fai human_g1k_v37.fasta.fai --epo all.epo.gz input.vcf.gz IndividualB 
#GITVERSION: 262ef9596751f819a9846d59a88f059e85847b8c
#DATE: 2017-08-09
#VCF2ACF:
#chr	coord	REF,ALT	root	anc	IndividualA
4	72045	A,N	1,0:0	1,0:0	2,0:0
4	72046	T,N	1,0:0	1,0:0	2,0:0
4	72047	T,N	1,0:0	1,0:0	2,0:0
4	72048	T,G	0,1:0	0,1:0	1,1:0
4	72049	G,N	1,0:0	1,0:0	2,0:0
4	72050	T,N	1,0:0	1,0:0	2,0:0
4	72051	A,N	1,0:0	1,0:0	2,0:0
4	72052	G,N	1,0:0	1,0:0	2,0:0

The header lines started with a # sign. Each operation made on the files from creation to transformation leaves a "watermark" that allows users to know precisely which operation was done and which version of glactools was used. The remaining lines contain the allele count for chromosome 4 for 8 sites. Three samples are present, the root allele, the most recent common ancestor and an individual named "IndividualA". For hominin samples, the root could be the chimp, the most recent common ancestor can be the chimp/human ancestor. If multiple individuals are combined, the allele count lines will appear as:


#chr	coord	REF,ALT	root	anc	IndividualA	IndividualB	IndividualC
4	72045	A,N	1,0:0	1,0:0	2,0:0	2,0:0	2,0:0
4	72046	T,N	1,0:0	1,0:0	2,0:0	2,0:0	2,0:0
4	72047	T,N	1,0:0	1,0:0	2,0:0	2,0:0	2,0:0
4	72048	T,G	0,1:0	0,1:0	1,1:0	1,1:0	2,0:0
4	72049	G,N	1,0:0	1,0:0	2,0:0	2,0:0	2,0:0
4	72050	T,N	1,0:0	1,0:0	2,0:0	2,0:0	2,0:0
4	72051	A,N	1,0:0	1,0:0	2,0:0	2,0:0	2,0:0
4	72052	G,N	1,0:0	1,0:0	2,0:0	2,0:0	2,0:0

We recommend using bgzip to compress the files and index and query them using tabix from htslib.

Input data

Data can be transformed into glactools internal format from

Operations on data

Using the data in the internal format, here are examples of operations that can be performed:

For instance, for obtaining sites where all individuals in specific population are fixed derived the following workflow can be used:
  1. transform individual VCF files into ACF files using "vcf2acf"
  2. unite each individual files using "intersection"
  3. meld all individuals into a single population using "meld"
  4. disable sharing of alleles with the ancestor using "snosharing"

Exporting data

Data can be exported to the following formats:

Documentation

Please refer to the documentation manual available here.




Questions ? Feature requests ?

Contact Gabriel Renaud (@grenaud) :