idno dncase rs01 rs01 (...) indiv0 1 G:C T:T (...) indiv1 0 G:G T:A (...)
genedc - genetic data converter
genedc --input name --output name --legend file [ options ]
genedc converts textual genetic data files.
Reads a user data file with genotypes coded as bases and writes files in hapmixmap format. Developed for use with hapmixmap, but can be extended to handle other data formats.
Genedc is also a tool to help conducting candidate gene studies with hapmixmap; it contains tools to download and prepare hapmap.org data.
Please visit http://genedc.sf.net/ for more information.
Display help message
Input files base name (required). Use name as the base name for output files.
Output file base name. The default base name is ‘user’. With hapmixmap output format, file names are user_genotypes.txt, user_loci.txt and user_outcome.txt.
Legend file (required). Legend file stores a list of SNP identifiers, their coding and position.
Accept an incomplete legend. Loci that are in the genotypes file, but are not in the legend, will be dropped from the output files. Dropped
Verbose mode.
Print program version and exit.
Display debug messages.
The tabular format resembles a spreadsheet.
idno dncase rs01 rs01 (...) indiv0 1 G:C T:T (...) indiv1 0 G:G T:A (...)
Colums are separated by white space; tabs and spaces are treated the same way.
A dataset in this format consists of three files:
name_genotypes.txt
name_loci.txt
name_outcome.txt
The genotypes file is similar to the tabular format. The differences are that it does not have the outcome (dncase) column and the genotypes are coded as numbers (as opposed to ACGT bases). The values are taken from the legend file and incremented by one. Zero indicates a missing genotype. Genotype pairs are surrounded by quotes. Genotypes file:
idno rs01 rs02 (...) indiv0 "1,2" "1,1" (...) indiv1 "1,1" "1,2" (...)
The locus file is similar to the legend file, but contains distances between loci instead of loci positions. The first distance is missing, indicated by the # character.
"SNPid" "NumAlleles" "DistanceinMb" rs4732057 2 # rs3807337 2 0.001775 rs17168032 2 0.000701 rs2347896 2 0.000871
The outcome file contains the outcome (dncase) column from the source file.
dncase 1 0 (...)
Contains information about SNP ids, their position, encoding and chromosome.
rs1000000 125415860 A G chr12 rs10000007 114910857 A C chr4 rs10000009 71229713 A G chr4 rs10000010 21294943 C T chr4
Genedc takes about 30 second to start.
Currently, loading legend file into memory is the most time-consuming operation for genedc. A workaround is to shorten this file; for example by leaving only those chromosomes which are actually in the data file.
hapmixmap, modeling hapmap haplotypes using tag SNP genotype data
hapmap.org, training data for hapmixmap
Maciej Blizinski, maciej, point, blizinski `a in a circle' gmail, point, com