SYNOPSIS

genedc --input name --output name --legend file [ options ]

DESCRIPTION

genedc converts textual genetic data files.

Reads a user data file with genotypes coded as bases and writes files in hapmixmap format. Developed for use with hapmixmap, but can be extended to handle other data formats.

Genedc is also a tool to help conducting candidate gene studies with hapmixmap; it contains tools to download and prepare hapmap.org data.

WEBSITE

Please visit http://genedc.sf.net/ for more information.

OPTIONS

--help

Display help message

-i|--input name

Input files base name (required). Use name as the base name for output files.

-o|--output file

Output file base name. The default base name is ‘user’. With hapmixmap output format, file names are user_genotypes.txt, user_loci.txt and user_outcome.txt.

-l|--legend file

Legend file (required). Legend file stores a list of SNP identifiers, their coding and position.

--incomplete-legend

Accept an incomplete legend. Loci that are in the genotypes file, but are not in the legend, will be dropped from the output files. Dropped

-v|--verbose

Verbose mode.

-V|--version

Print program version and exit.

--debug

Display debug messages.

FILE FORMATS

Tabular

The tabular format resembles a spreadsheet.

idno    dncase  rs01    rs01    (...)
indiv0  1       G:C     T:T     (...)
indiv1  0       G:G     T:A     (...)

Colums are separated by white space; tabs and spaces are treated the same way.

Hapmixmap diploid

A dataset in this format consists of three files:

The genotypes file is similar to the tabular format. The differences are that it does not have the outcome (dncase) column and the genotypes are coded as numbers (as opposed to ACGT bases). The values are taken from the legend file and incremented by one. Zero indicates a missing genotype. Genotype pairs are surrounded by quotes. Genotypes file:

idno    rs01    rs02    (...)
indiv0  "1,2"   "1,1"   (...)
indiv1  "1,1"   "1,2"   (...)

The locus file is similar to the legend file, but contains distances between loci instead of loci positions. The first distance is missing, indicated by the # character.

"SNPid" "NumAlleles"    "DistanceinMb"
rs4732057       2       #
rs3807337       2       0.001775
rs17168032      2       0.000701
rs2347896       2       0.000871

The outcome file contains the outcome (dncase) column from the source file.

dncase
1
0
(...)

Legend file

Contains information about SNP ids, their position, encoding and chromosome.

rs1000000       125415860       A       G       chr12
rs10000007      114910857       A       C       chr4
rs10000009      71229713        A       G       chr4
rs10000010      21294943        C       T       chr4

KNOWN PROBLEMS

Genedc takes about 30 second to start.

Currently, loading legend file into memory is the most time-consuming operation for genedc. A workaround is to shorten this file; for example by leaving only those chromosomes which are actually in the data file.

SEE ALSO

AUTHOR

Maciej Blizinski, maciej, point, blizinski `a in a circle' gmail, point, com