Full Code of freeseek/gtc2vcf for AI

master cc4898976c11 cached
11 files
607.0 KB
185.3k tokens
264 symbols
1 requests
Download .txt
Showing preview only (623K chars total). Download the full file or copy to clipboard to get everything.
Repository: freeseek/gtc2vcf
Branch: master
Commit: cc4898976c11
Files: 11
Total size: 607.0 KB

Directory structure:
gitextract_37oi3chf/

├── BAFregress.c
├── HapMap.md
├── Illumina.md
├── LICENSE
├── README.md
├── affy2vcf.c
├── gtc2vcf.c
├── gtc2vcf.h
├── gtc2vcf_plot.R
├── idat2gtc.c
└── nearest_neighbor.c

================================================
FILE CONTENTS
================================================

================================================
FILE: BAFregress.c
================================================
/* The MIT License

   Copyright (C) 2024-2025 Giulio Genovese

   Author: Giulio Genovese <giulio.genovese@gmail.com>

   Permission is hereby granted, free of charge, to any person obtaining a copy
   of this software and associated documentation files (the "Software"), to deal
   in the Software without restriction, including without limitation the rights
   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
   copies of the Software, and to permit persons to whom the Software is
   furnished to do so, subject to the following conditions:

   The above copyright notice and this permission notice shall be included in
   all copies or substantial portions of the Software.

   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
   THE SOFTWARE.

 */

#include <stdio.h>
#include <unistd.h>
#include <getopt.h>
#include <errno.h>
#include <htslib/vcf.h>
#include <htslib/synced_bcf_reader.h>
#include <htslib/vcfutils.h>
#include <htslib/ksort.h>
#include "bcftools.h"

#define BAFREGRESS_VERSION "2025-08-19"

#define GT_NC 0
#define GT_AA 1
#define GT_AB 2
#define GT_BB 3

KSORT_INIT_GENERIC(float)

/******************************************
 * PLUGIN                                 *
 ******************************************/

inline static double sqr(double x) { return x * x; }

const char *about(void) { return "Detects and estimates sample contamination using BAF intensity data.\n"; }

static const char *usage_text(void) {
    return "\n"
           "About: Detects and estimates sample contamination. (version " BAFREGRESS_VERSION
           " http://github.com/freeseek/gtc2vcf)\n"
           "[ Jun, G. et al. Detecting and Estimating Contamination of Human DNA Samples in Sequencing\n"
           "and Array-Based Genotype Data. AJHG 91, 839-848 (2012) http://doi.org/10.1016/j.ajhg.2012.09.004 ]\n"
           "\n"
           "Usage: bcftools +BAFregress [options] <in.vcf.gz>\n"
           "\n"
           "Plugin options:\n"
           "        --threshold <float>         minimum allele frequency for BAF regression [0.1]\n"
           "    -a, --af <file>                 file with allele frequency information\n"
           "        --tag <string>              allele frequency INFO tag [AC/AN]\n"
           "        --adjust-BAF                minimum number of genotypes for a cluster to median adjust BAF (-1 for "
           "no adjustment) [5]\n"
           "        --truncate-BAF              truncates BAF values between 0 and 1 and turns off adjustment to "
           "recover original behavior\n"
           "        --use-MAF                   uses minor allele frequency rather than A/B allele frequency to "
           "recover original behavior\n"
           "    -e, --estimates <file>          write BAF regression estimates to a file [standard output]\n"
           "    -o, --output <file>             write VCF output to a file\n"
           "    -O, --output-type u|b|v|z[0-9]  u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level "
           "[v]\n"
           "    -r, --regions <region>          restrict to comma-separated list of regions\n"
           "    -R, --regions-file <file>       restrict to regions listed in a file\n"
           "        --regions-overlap 0|1|2     Include if POS in the region (0), record overlaps (1), variant "
           "overlaps (2) [1]\n"
           "    -t, --targets [^]<region>       similar to -r but streams rather than index-jumps. Exclude regions "
           "with \"^\" prefix\n"
           "    -T, --targets-file [^]<file>    similar to -R but streams rather than index-jumps. Exclude regions "
           "with \"^\" prefix\n"
           "        --targets-overlap 0|1|2     Include if POS in the region (0), record overlaps (1), variant "
           "overlaps (2) [0]\n"
           "        --threads <int>             number of extra output compression threads [0]\n"
           "    -s, --samples [^]<list>         comma separated list of samples to include (or exclude with \"^\" "
           "prefix)\n"
           "    -S, --samples-file [^]<file>    file of samples to include (or exclude with \"^\" prefix)\n"
           "        --force-samples             only warn about unknown subset samples\n"
           "    -W, --write-index[=FMT]         Automatically index the output files [off]\n"
           "\n"
           "Example:\n"
           "    bcftools +BAFregress file.bcf\n"
           "    bcftools +BAFregress --tag AF file.bcf\n"
           "    bcftools +BAFregress --af 1kGP_high_coverage_Illumina.sites.bcf file.bcf\n"
           "    bcftools +BAFregress --af 1kGP_high_coverage_Illumina.sites.bcf --truncate-BAF --use-MAF file.bcf\n"
           "\n";
}

int run(int argc, char **argv) {
    float af_threshold = 0.1;
    char *af_fname = NULL;
    char *af_tag = NULL;
    int adj_baf = 5;
    int truncate_baf = 0;
    int use_maf = 0;
    char *estimate_fname = "-";
    char *output_fname = NULL;
    int output_type = FT_VCF;
    int clevel = -1;
    int regions_overlap = 1;
    int targets_overlap = 0;
    int n_threads = 0;
    char *targets_list = NULL;
    int targets_is_file = 0;
    char *regions_list = NULL;
    int regions_is_file = 0;
    char *sample_names = NULL;
    int sample_is_file = 0;
    int force_samples = 0;
    int write_index = 0;
    char *index_fname;
    htsFile *out_fh = NULL;

    static struct option loptions[] = {{"threshold", required_argument, NULL, 1},
                                       {"af", required_argument, NULL, 'a'},
                                       {"tag", required_argument, NULL, 2},
                                       {"adjust-BAF", required_argument, NULL, 3},
                                       {"truncate-BAF", no_argument, NULL, 4},
                                       {"use-MAF", no_argument, NULL, 5},
                                       {"estimates", required_argument, NULL, 'e'},
                                       {"output", required_argument, NULL, 'o'},
                                       {"output-type", required_argument, NULL, 'O'},
                                       {"threads", required_argument, NULL, 6},
                                       {"regions", required_argument, NULL, 'r'},
                                       {"regions-file", required_argument, NULL, 'R'},
                                       {"regions-overlap", required_argument, NULL, 7},
                                       {"targets", required_argument, NULL, 't'},
                                       {"targets-file", required_argument, NULL, 'T'},
                                       {"targets-overlap", required_argument, NULL, 8},
                                       {"samples", required_argument, NULL, 's'},
                                       {"samples-file", required_argument, NULL, 'S'},
                                       {"force-samples", no_argument, NULL, 9},
                                       {"write-index", optional_argument, NULL, 'W'},
                                       {0, 0, 0, 0}};
    int c;
    char *tmp;
    while ((c = getopt_long(argc, argv, "h?a:e:o:O:r:R:t:T:s:S:", loptions, NULL)) >= 0) {
        switch (c) {
        case 1:
            af_threshold = strtof(optarg, &tmp);
            if (*tmp) error("Could not parse: --threshold %s\n", optarg);
            if (af_threshold <= 0.0 || af_threshold >= 1.0) error("--threshold must input a value between 0 and 1\n");
            break;
        case 'a':
            af_fname = optarg;
            break;
        case 2:
            af_tag = optarg;
            break;
        case 3:
            adj_baf = (int)strtol(optarg, &tmp, 0);
            if (*tmp) error("Could not parse: --adjust-BAF %s\n", optarg);
            break;
        case 4:
            truncate_baf = 1;
            break;
        case 5:
            use_maf = 1;
            break;
        case 'e':
            estimate_fname = optarg;
            break;
        case 'o':
            output_fname = optarg;
            break;
        case 'O':
            switch (optarg[0]) {
            case 'b':
                output_type = FT_BCF_GZ;
                break;
            case 'u':
                output_type = FT_BCF;
                break;
            case 'z':
                output_type = FT_VCF_GZ;
                break;
            case 'v':
                output_type = FT_VCF;
                break;
            default: {
                clevel = strtol(optarg, &tmp, 10);
                if (*tmp || clevel < 0 || clevel > 9) error("The output type \"%s\" not recognised\n", optarg);
            }
            };
            if (optarg[1]) {
                clevel = strtol(optarg + 1, &tmp, 10);
                if (*tmp || clevel < 0 || clevel > 9)
                    error("Could not parse argument: --compression-level %s\n", optarg + 1);
            }
            break;
        case 6:
            n_threads = strtol(optarg, &tmp, 0);
            if (*tmp) error("Could not parse argument: --threads %s\n", optarg);
            break;
        case 'r':
            regions_list = optarg;
            break;
        case 'R':
            regions_list = optarg;
            regions_is_file = 1;
            break;
        case 7:
            if (!strcasecmp(optarg, "0"))
                regions_overlap = 0;
            else if (!strcasecmp(optarg, "1"))
                regions_overlap = 1;
            else if (!strcasecmp(optarg, "2"))
                regions_overlap = 2;
            else
                error("Could not parse: --regions-overlap %s\n", optarg);
            break;
        case 't':
            targets_list = optarg;
            break;
        case 'T':
            targets_list = optarg;
            targets_is_file = 1;
            break;
        case 8:
            if (!strcasecmp(optarg, "0"))
                targets_overlap = 0;
            else if (!strcasecmp(optarg, "1"))
                targets_overlap = 1;
            else if (!strcasecmp(optarg, "2"))
                targets_overlap = 2;
            else
                error("Could not parse: --targets-overlap %s\n", optarg);
            break;
        case 's':
            sample_names = optarg;
            break;
        case 'S':
            sample_names = optarg;
            sample_is_file = 1;
            break;
        case 9:
            force_samples = 1;
            break;
        case 'W':
            if (!(write_index = write_index_parse(optarg))) error("Unsupported index format '%s'\n", optarg);
            break;
        case 'h':
        case '?':
        default:
            error("%s", usage_text());
            break;
        }
    }

    if (truncate_baf) adj_baf = -1;

    char *input_fname = NULL;
    if (optind == argc) {
        if (!isatty(fileno((FILE *)stdin))) {
            input_fname = "-"; // reading from stdin
        } else {
            error("%s", usage_text());
        }
    } else if (optind + 1 != argc) {
        error("%s", usage_text());
    } else {
        input_fname = argv[optind];
    }

    bcf_srs_t *srs = bcf_sr_init();
    if (af_fname) {
        bcf_sr_set_opt(srs, BCF_SR_REQUIRE_IDX);
        bcf_sr_set_opt(srs, BCF_SR_PAIR_LOGIC, BCF_SR_PAIR_EXACT);
    }

    if (regions_list) {
        bcf_sr_set_opt(srs, BCF_SR_REGIONS_OVERLAP, regions_overlap);
        if (bcf_sr_set_regions(srs, regions_list, regions_is_file) < 0)
            error("Failed to read the regions: %s\n", regions_list);
    }
    if (targets_list) {
        bcf_sr_set_opt(srs, BCF_SR_TARGETS_OVERLAP, targets_overlap);
        if (bcf_sr_set_targets(srs, targets_list, targets_is_file, 0) < 0)
            error("Failed to read the targets: %s\n", targets_list);
    }
    if (bcf_sr_set_threads(srs, n_threads) < 0) error("Failed to create threads\n");
    if (!bcf_sr_add_reader(srs, input_fname))
        error("Failed to open %s: %s\n", input_fname, bcf_sr_strerror(srs->errnum));
    if (af_fname && !bcf_sr_add_reader(srs, af_fname))
        error("Failed to open %s: %s\n", af_fname, bcf_sr_strerror(srs->errnum));

    bcf_hdr_t *hdr = bcf_sr_get_header(srs, 0);
    bcf_hdr_t *af_hdr = af_fname ? bcf_sr_get_header(srs, 1) : NULL;

    if (sample_names) {
        int ret = bcf_hdr_set_samples(hdr, sample_names, sample_is_file);
        if (ret < 0)
            error("Error parsing the list of samples: %s\n", sample_names);
        else if (force_samples && ret > 0)
            error("Sample name mismatch: sample #%d not found in the header\n", ret);
    }

    // get IDs for all VCF formats
    int gt_id = bcf_hdr_id2int(hdr, BCF_DT_ID, "GT");
    if (gt_id < 0) error("Format GT was not found in the input header\n");
    int baf_id = bcf_hdr_id2int(hdr, BCF_DT_ID, "BAF");
    if (baf_id < 0) error("Format BAF was not found in the input header\n");
    int allele_a_id = bcf_hdr_id2int(hdr, BCF_DT_ID, "ALLELE_A");
    if (allele_a_id < 0) error("Format ALLELE_A was not found in the input header\n");
    int allele_b_id = bcf_hdr_id2int(hdr, BCF_DT_ID, "ALLELE_B");
    if (allele_b_id < 0) error("Format ALLELE_B was not found in the input header\n");
    int af_id = -1;
    if (af_tag) {
        af_id = bcf_hdr_id2int(af_hdr ? af_hdr : hdr, BCF_DT_ID, af_tag);
        if (af_id < 0) error("Format %s was not found in the allele frequency header\n", af_tag);
    }

    FILE *est_fh = strcmp("-", estimate_fname) ? fopen(estimate_fname, "w") : stdout;
    if (!est_fh) error("Error: cannot write to %s\n", estimate_fname);

    // output VCF
    if (output_fname) {
        char wmode[8];
        set_wmode(wmode, output_type, output_fname, clevel);
        out_fh = hts_open(output_fname, wmode);
        if (out_fh == NULL) error("[%s] Error: cannot write to \"%s\": %s\n", __func__, output_fname, strerror(errno));
        if (n_threads) hts_set_opt(out_fh, HTS_OPT_THREAD_POOL, srs->p);
        if (bcf_hdr_write(out_fh, hdr) < 0) error("Unable to write to output VCF file\n");
        if (init_index2(out_fh, hdr, output_fname, &index_fname, write_index) < 0)
            error("Error: failed to initialise index for %s\n", output_fname);
    }

    int n_smpls = bcf_hdr_nsamples(hdr);
    if (!af_hdr && !af_tag && n_smpls < 30)
        fprintf(
            stderr,
            "Input VCF only includes %d samples. We recommend using a separate VCF to infer marker allele frequency\n",
            n_smpls);

    int *arr = NULL;
    int marr = 0;
    float *baf_arr = NULL;
    int nbaf_arr = 0;
    int8_t *gts = (int8_t *)calloc(n_smpls, sizeof(int8_t));
    float *tmp_arr = (float *)calloc(n_smpls, sizeof(float));
    float *sumx2 = (float *)calloc(n_smpls, sizeof(float));
    float *sumxy = (float *)calloc(n_smpls, sizeof(float));
    float *sumx = (float *)calloc(n_smpls, sizeof(float));
    float *sumy = (float *)calloc(n_smpls, sizeof(float));
    int *n = (int *)calloc(n_smpls, sizeof(int));

    // run through each record present in both VCFs
    int i, j;
    while (bcf_sr_next_line(srs)) {
        bcf1_t *line = bcf_sr_get_line(srs, 0);
        if (!line) continue;
        if (out_fh && bcf_write1(out_fh, hdr, line) != 0)
            error("[%s] Error: cannot write to %s\n", __func__, output_fname);

        bcf1_t *af_line = af_hdr ? bcf_sr_get_line(srs, 1) : line;
        if (line->n_allele != 2 || !af_line || af_line->n_allele != 2) continue;

        // skip lines where the allele frequency is less than 0.01 (or greater than 0.99)
        double af;
        if (af_tag) {
            bcf_info_t *af_info = bcf_get_info_id(af_line, af_id);
            af = af_info ? (double)af_info->v1.f : NAN;
        } else {
            hts_expand(int, af_line->n_allele, marr, arr);
            int ret = bcf_calc_ac(af_hdr ? af_hdr : hdr, af_line, arr, BCF_UN_INFO | BCF_UN_FMT);
            if (ret <= 0) continue;
            int an = 0;
            for (i = 0; i < af_line->n_allele; i++) an += arr[i];
            af = (double)arr[1] / (double)an;
        }
        if (isnan(af) || af < af_threshold || af > 1.0 - af_threshold) continue;
        if (use_maf && af > 0.5) af = 1.0 - af; // uses MAF instead of AF to avoid problems with flipped Illumina probes

        // skip lines where ALLELE_A and ALLELE_B refer to alleles missing from the record (it should not happen)
        bcf_info_t *allele_a_info = bcf_get_info_id(line, allele_a_id);
        int8_t allele_a = allele_a_info ? (int8_t)allele_a_info->v1.i : bcf_int8_missing;
        bcf_info_t *allele_b_info = bcf_get_info_id(line, allele_b_id);
        int8_t allele_b = allele_b_info ? (int8_t)allele_b_info->v1.i : bcf_int8_missing;
        if (allele_a < 0 || allele_a >= line->n_allele || allele_b < 0 || allele_b >= line->n_allele) continue;
        if (allele_b == 0) af = 1.0 - af; // flip the allele frequency if ALLELE_B is the reference

        // skip lines missing genotypes (e.g. intensity only sites) or with ploidy other than 2
        int n_aa = 0, n_ab = 0, n_bb = 0;
        bcf_fmt_t *gt_fmt = bcf_get_fmt_id(line, gt_id);
        if (!gt_fmt || gt_fmt->n != 2) continue;
#define BRANCH(type_t, bcf_type_vector_end)                                                                            \
    {                                                                                                                  \
        type_t *p = (type_t *)gt_fmt->p;                                                                               \
        for (i = 0; i < n_smpls; i++, p += 2) {                                                                        \
            gts[i] = GT_NC;                                                                                            \
            if (p[0] == bcf_type_vector_end || bcf_gt_is_missing(p[0]) || p[1] == bcf_type_vector_end                  \
                || bcf_gt_is_missing(p[1]))                                                                            \
                continue;                                                                                              \
            type_t allele_0 = bcf_gt_allele(p[0]);                                                                     \
            type_t allele_1 = bcf_gt_allele(p[1]);                                                                     \
            if (allele_0 == allele_a && allele_1 == allele_a) {                                                        \
                gts[i] = GT_AA;                                                                                        \
                n_aa++;                                                                                                \
            } else if ((allele_0 == allele_a && allele_1 == allele_b)                                                  \
                       || (allele_0 == allele_b && allele_1 == allele_a)) {                                            \
                gts[i] = GT_AB;                                                                                        \
                n_ab++;                                                                                                \
            } else if (allele_0 == allele_b && allele_1 == allele_b) {                                                 \
                gts[i] = GT_BB;                                                                                        \
                n_bb++;                                                                                                \
            }                                                                                                          \
        }                                                                                                              \
    }
        switch (gt_fmt->type) {
        case BCF_BT_INT8:
            BRANCH(int8_t, bcf_int8_vector_end);
            break;
        case BCF_BT_INT16:
            BRANCH(int16_t, bcf_int16_vector_end);
            break;
        case BCF_BT_INT32:
            BRANCH(int32_t, bcf_int32_vector_end);
            break;
        default:
            error("Unexpected type %d\n", gt_fmt->type);
        }
#undef BRANCH

        int nbaf = bcf_get_format_float(hdr, line, "BAF", &baf_arr, &nbaf_arr);
        if (nbaf != n_smpls) continue; // wrong number of BAF values

        // adjust BAF
        float adj_baf_aa = 0.0;
        float adj_baf_bb = 0.0;
        if (adj_baf != -1) {
            j = 0;
            if (n_aa >= adj_baf) {
                for (i = 0; i < n_smpls; i++)
                    if (gts[i] == GT_AA) tmp_arr[j++] = baf_arr[i];
                adj_baf_aa = ks_ksmall_float((size_t)j, tmp_arr, (size_t)j / 2);
                if (j % 2 == 0) adj_baf_aa = (adj_baf_aa + tmp_arr[j / 2 - 1]) * 0.5f;
            }
            j = 0;
            if (n_bb >= adj_baf) {
                for (i = 0; i < n_smpls; i++)
                    if (gts[i] == GT_BB) tmp_arr[j++] = baf_arr[i];
                adj_baf_bb = ks_ksmall_float((size_t)j, tmp_arr, (size_t)j / 2);
                if (j % 2 == 0) adj_baf_bb = (adj_baf_bb + tmp_arr[j / 2 - 1]) * 0.5f;
                adj_baf_bb -= 1.0;
            }
        } else if (truncate_baf) { // truncates the BAF between 0.0 and 1.0 like Illumina does
            for (i = 0; i < n_smpls; i++) {
                if (baf_arr[i] < 0.0)
                    baf_arr[i] = 0.0;
                else if (baf_arr[i] > 1.0)
                    baf_arr[i] = 1.0;
            }
        }

        for (i = 0; i < n_smpls; i++) {
            double baf;
            if (gts[i] == GT_AA) {
                baf = (double)(baf_arr[i] - adj_baf_aa);
                sumx2[i] += sqr(af);
                sumxy[i] += af * baf;
                sumx[i] += af;
                sumy[i] += baf;
            } else if (gts[i] == GT_BB) {
                baf = (double)(baf_arr[i] - adj_baf_bb);
                sumx2[i] += sqr(1.0 - af);
                sumxy[i] += (1.0 - af) * (1.0 - baf);
                sumx[i] += 1.0 - af;
                sumy[i] += 1.0 - baf;
            } else
                continue;
            n[i]++;
        }
    }

    fprintf(est_fh, "sample_id\tbaf_regress\tNhom\n");
    for (i = 0; i < n_smpls; i++) {
        double denom = (double)n[i] * sumx2[i] - sqr(sumx[i]);
        double m = denom ? (n[i] * sumxy[i] - sumx[i] * sumy[i]) / denom : NAN;
        // double b = denom ? (sumy[i] * sumx2[i] - sumx[i] * sumxy[i]) / denom : NAN;
        fprintf(est_fh, "%s\t%.4f\t%d\n", hdr->samples[i], m, n[i]);
    }

    if (est_fh != stdout && est_fh != stderr) fclose(est_fh);

    // close output VCF
    if (output_fname) {
        if (write_index) {
            if (bcf_idx_save(out_fh) < 0) {
                if (hts_close(out_fh) != 0)
                    error("Close failed %s\n", strcmp(output_fname, "-") ? output_fname : "stdout");
                error("Error: cannot write to index %s\n", index_fname);
            }
            free(index_fname);
        }
        hts_close(out_fh);
    }

    free(arr);
    free(baf_arr);
    free(gts);
    free(tmp_arr);
    free(sumx2);
    free(sumxy);
    free(sumx);
    free(sumy);
    free(n);
    bcf_sr_destroy(srs);

    return 0;
}


================================================
FILE: HapMap.md
================================================
HapMap
======

A tutorial for how to convert HapMap data from Illumina and Affymetrix arrays to a GRCh38 VCF using gtc2vcf

<!--ts-->
   * [Download manifest files](#download-manifest-files)
   * [Download and unpack IDAT and CEL files](#download-and-unpack-idat-and-cel-files)
   * [Create sample maps](#create-sample-maps)
   * [Convert IDATs to GTCs](#convert-idats-to-gtcs)
   * [Convert GTCs to VCF](#convert-gtcs-to-vcf)
   * [Convert CELs to CHPs](#convert-cels-to-chps)
   * [Convert CHPs to VCF](#convert-chps-to-vcf)
<!--te-->

Download manifest files
=======================

Download HumanCNV370v1 manifest and cluster files from [Illumina](http://support.illumina.com/downloads/humancnv370-duo_v10_product_files.html) and [GEO](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6986)
```
wget ftp://webdata:webdata@ftp.illumina.com/downloads/ProductFiles/HumanCNV370/HumanCNV370-Duo/humancnv370v1_c.bpm
wget ftp://webdata2:webdata2@ftp.illumina.com/downloads/ProductFiles/HumanCNV370/HumanCNV370-Duo/HumanCNV370v1_C.egt
wget http://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL6nnn/GPL6986/suppl/GPL6986_HumanCNV370v1_C.csv.gz
gunzip GPL6986_HumanCNV370v1_C.csv.gz
/bin/mv GPL6986_HumanCNV370v1_C.csv HumanCNV370v1_C.csv
```

Download HumanOmni2.5-4v1 manifest and cluster files from [Illumina](http://support.illumina.com/downloads/humanomni2-5-quad_product_files.html)
```
wget ftp://webdata2:webdata2@ftp.illumina.com/MyIllumina/94afb35e-7c11-45cc-8a65-d868af527c54/HumanOmni2.5-4v1_H.bpm
wget ftp://webdata2:webdata2@ftp.illumina.com/MyIllumina/f003e017-1761-4348-958f-03997a30cf67/HumanOmni2.5-4v1_H.egt
wget ftp://webdata2:webdata2@ftp.illumina.com/MyIllumina/d5578cf6-bb3b-4b4b-98d3-21edc5bcbd45/HumanOmni2.5-4v1_H.csv
```

Download HumanOmni25M-8v1-1 manifest and cluster files from [Illumina](ftp://webdata2:webdata2@ftp.illumina.com/downloads/productfiles/humanomni25) and [GEO](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL20641)
```
wget http://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL20nnn/GPL20641/suppl/GPL20641_HumanOmni2.5M-8v1-1_B.bpm.gz
wget ftp://webdata2:webdata2@ftp.illumina.com/downloads/productfiles/humanomni25/humanomni2-5m-8v1-1_b.egt
wget http://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL20nnn/GPL20641/suppl/GPL20641_HumanOmni25M-8v1-1_B.csv.gz
gunzip GPL20641_HumanOmni2.5M-8v1-1_B.bpm.gz
gunzip GPL20641_HumanOmni25M-8v1-1_B.csv.gz
/bin/mv GPL20641_HumanOmni2.5M-8v1-1_B.bpm HumanOmni25M-8v1-1_B.bpm
/bin/mv GPL20641_HumanOmni25M-8v1-1_B.csv HumanOmni25M-8v1-1_B.csv
```

Download GenomeWideEx_6 and GenomeWideSNP_6 library and annotation files from [Affymetrix](http://www.affymetrix.com/support/technical/byproduct.affx?product=genomewidesnp_6)
```
wget http://tools.thermofisher.com/content/sfs/supportfiles/genomewidesnp6_libraryfile.zip
wget http://www.affymetrix.com/Auth/analysis/downloads/lf/genotyping/GenomeWideSNP_6/SNP6_supplemental_axiom_analysis_files.zip
wget http://www.affymetrix.com/Auth/analysis/downloads/na35/genotyping/GenomeWideSNP_6.na35.annot.csv.zip
unzip -oj genomewidesnp6_libraryfile.zip CD_GenomeWideSNP_6_rev3/Full/GenomeWideSNP_6/LibFiles/GenomeWideSNP_6.{cdf,chr{X,Y}probes,specialSNPs}
unzip -o SNP6_supplemental_axiom_analysis_files.zip GenomeWideSNP_6.{generic_prior.txt,apt-probeset-genotype.AxiomGT1.xml,AxiomGT1.sketch}
unzip -o GenomeWideSNP_6.na35.annot.csv.zip GenomeWideSNP_6.na35.annot.csv
/bin/rm genomewidesnp6_libraryfile.zip SNP6_supplemental_axiom_analysis_files.zip GenomeWideSNP_6.na35.annot.csv.zip
```

Re-align flanking sequences to GRCh38
```
for chip in HumanCNV370v1_C humanomni25m-8v1-1_b HumanOmni2.5-4v1_H; do
  bcftools +gtc2vcf --csv $chip.csv --fasta-flank | \
    bwa mem -M $HOME/res/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna - | \
    samtools view -bS -o $chip.bam
done
bcftools +affy2vcf --csv GenomeWideSNP_6.na35.annot.csv --fasta-flank | \
  bwa mem -M $HOME/res/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna - | \
  samtools view -bS -o $chip.bam
```

Download and unpack IDAT and CEL files
======================================

```
wget http://bioconductor.org/packages/release/data/annotation/src/contrib/hapmap370k_1.0.1.tar.gz
wget -nH --cut-dirs 2 -r ftp://ftp.ncbi.nlm.nih.gov/hapmap/raw_data/hapmap3_affy6.0/
wget -nH --cut-dirs 5 -r ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/

mkdir -p idats
tar xzvf hapmap370k_1.0.1.tar.gz -C idats hapmap370k/inst/idatFiles
tar xzvf hd_genotype_chip/broad_intensities/Omni25_idats_gtcs_2141_samples.tgz -C idats
tar xzvf hd_genotype_chip/sanger_intensities/ALL.wgs.sanger_omni_2_5_8.20130805.snps.genotypes.idats.tar.gz -C idats

mkdir -p cels
for tgz in hapmap3_affy6.0/*.tgz; do tar xzvf $tgz -C cels; done
tar xzvf hd_genotype_chip/coriell_affy6_intensities/Affy60_Coriell_CEL_files.tar.gz -C cels

# one sample is mapped to HG03171 but should be mapped to HG01171, most likely a typo here
/bin/mv "cels/affy6/1000 Genomes phase 1 and 2 cel files/NA18489 .CEL" "cels/affy6/1000 Genomes phase 1 and 2 cel files/NA18489.CEL"
/bin/mv "cels/affy6/1000 Genomes phase 1 and 2 cel files/HG03616.CEL" "cels/affy6/1000 Genomes phase 1 and 2 cel files/HG03616-1.CEL"
/bin/mv "cels/affy6/1000 Genomes phase 1 and 2 cel files/HG03660.CEL" "cels/affy6/1000 Genomes phase 1 and 2 cel files/HG03660-1.CEL"
/bin/mv "cels/affy6/1000 Genomes phase 1 and 2 cel files/HG04149.CEL" "cels/affy6/1000 Genomes phase 1 and 2 cel files/HG04149-1.CEL"
/bin/mv "cels/affy6/1000 Genomes phase 1 and 2 cel files/HG01171.CEL" "cels/affy6/1000 Genomes phase 1 and 2 cel files/HG01171-1.CEL"
/bin/mv "cels/affy6/1000 Genomes phase 3 cel files/HG03616.CEL" "cels/affy6/1000 Genomes phase 3 cel files/HG03616-C1.CEL"
/bin/mv "cels/affy6/1000 Genomes phase 3 cel files/HG03660.CEL" "cels/affy6/1000 Genomes phase 3 cel files/HG03660-C1.CEL"
/bin/mv "cels/affy6/1000 Genomes phase 3 cel files/HG04149.CEL" "cels/affy6/1000 Genomes phase 3 cel files/HG04149-C1.CEL"
/bin/mv "cels/affy6/1000 Genomes phase 3 cel files/HG03171.CEL" "cels/affy6/1000 Genomes phase 3 cel files/HG01171-C1.CEL"
```

Create sample maps
==================

```
awk -F, 'NR>1 {print $5"\t"$1".HumanCNV370v1"}' idats/hapmap370k/inst/idatFiles/samples370k.csv > HapMap.HumanCNV370v1.tsv

awk -F, 'NR>15 {print $2"_"$3"\t"$6".HumanOmni2.5-4v1"}' idats/SampleSheet.csv > HapMap.HumanOmni2.5-4v1.tsv

awk 'NR==FNR {x[$2]=$1} NR>FNR {print $2"\t"x[substr($1,12)]".HumanOmni25M-8v1-1"}' \
  hd_genotype_chip/sanger_intensities/sanger_omni_chip.20130805.internal_to_coriell_id.map \
  idats/omni2.5-8_otgeno_20130805.idats/log.txt > HapMap.HumanOmni25M-8v1-1.tsv

# one sample is mapped to NA19787 but should be mapped to NA19730, most likely a sample swap
# samples mapped to NA21742 and NA21743 are the same individual, most likely a collection issue
cat hapmap3_affy6.0/{passing,excluded}_cels_sample_map.txt | sed 's/.CEL$//' | \
  sed 's/NA19787\tCHEAP_p_HapMapP3Redo2_GenomeWideSNP_6_B09_235604.CEL/NA19730\tCHEAP_p_HapMapP3Redo2_GenomeWideSNP_6_B09_235604.CEL/' | \
  awk '{sm=$1; if (sm in x) sm=sm"-"x[sm]; print $2"\t"sm".GenomeWideEx_6"; x[$1]++}' > HapMap.GenomeWideEx_6.tsv

ls cels/affy6/1000\ Genomes\ phase\ {1\ and\ 2,3}\ cel\ files/*.CEL | sed 's/.CEL$//' | \
  sed 's/.CEL$//' | awk -F/ '{print $4"\t"$4".GenomeWideSNP_6"}' > HapMap.GenomeWideSNP_6.tsv
```

Convert IDATs to GTCs
=====================

```
declare -A bpm=( ["HumanCNV370v1"]="humancnv370v1_c.bpm"
                 ["HumanOmni2.5-4v1"]="HumanOmni2.5-4v1_H.bpm"
                 ["HumanOmni25M-8v1-1"]="HumanOmni25M-8v1-1_B.bpm" )
declare -A egt=( ["HumanCNV370v1"]="HumanCNV370v1_C.egt"
                 ["HumanOmni2.5-4v1"]="HumanOmni2.5-4v1_H.egt"
                 ["HumanOmni25M-8v1-1"]="humanomni2-5m-8v1-1_b.egt" )
bcftools +gtc2vcf -i $(find idats -iname *.idat) -o gtc2vcf.idat.tsv
mkdir -p HumanCNV370v1 HumanOmni25M-8v1-1 HumanOmni2.5-4v1
for idat in $(cut -f1 gtc2vcf.idat.tsv | grep _Grn.idat$); do
  chip=$(grep ^$idat gtc2vcf.idat.tsv | cut -f16)
  mono $HOME/bin/autoconvert/AutoConvert.exe $(find idats -iname $idat) $chip ${bpm[$chip]} ${egt[$chip]}
done
bcftools +gtc2vcf {HumanCNV370v1,HumanOmni25M-8v1-1,HumanOmni2.5-4v1}/*.gtc -o gtc2vcf.gtc.tsv
```

Convert GTCs to VCF
===================

```
declare -A bpm=( ["HumanCNV370v1"]="humancnv370v1_c.bpm"
                 ["HumanOmni2.5-4v1"]="HumanOmni2.5-4v1_H.bpm"
                 ["HumanOmni25M-8v1-1"]="HumanOmni25M-8v1-1_B.bpm" )
declare -A egt=( ["HumanCNV370v1"]="HumanCNV370v1_C.egt"
                 ["HumanOmni2.5-4v1"]="HumanOmni2.5-4v1_H.egt"
                 ["HumanOmni25M-8v1-1"]="humanomni2-5m-8v1-1_b.egt" )
declare -A csv=( ["HumanCNV370v1"]="HumanCNV370v1_C.csv"
                 ["HumanOmni2.5-4v1"]="HumanOmni2.5-4v1_H.csv"
                 ["HumanOmni25M-8v1-1"]="humanomni25m-8v1-1_b.csv" )
declare -A sam=( ["HumanCNV370v1"]="HumanCNV370v1_C.bam"
                 ["HumanOmni2.5-4v1"]="HumanOmni2.5-4v1_H.bam"
                 ["HumanOmni25M-8v1-1"]="humanomni25m-8v1-1_b.bam" )
for chip in HumanCNV370v1 HumanOmni25M-8v1-1 HumanOmni2.5-4v1; do
  bcftools +gtc2vcf \
    --no-version -Ou \
    --fasta-ref $HOME/res/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
    --bpm ${bpm[$chip]} \
    --egt ${egt[$chip]} \
    --csv ${csv[$chip]} \
    --sam ${sam[$chip]} \
    --gtcs $chip \
    --extra HapMap.$chip.sex \
    --do-not-check-bpm | \
    bcftools sort -Ou -T ./bcftools. | \
    bcftools norm --no-version -Ob -o HapMap.$chip.bcf -c x -f $ref && \
    bcftools index -f HapMap.$chip.bcf"
done
```

Convert CELs to CHPs
====================

```
(echo cel_files; ls cels/{,Broad_hapmap3_r2_Affy6_cels_excluded/}*.CEL) > cels.GenomeWideEx_6.lst
(echo cel_files; ls cels/affy6/1000\ Genomes\ phase\ {1\ and\ 2,3}\ cel\ files/*.CEL) > cels.GenomeWideSNP_6.lst
for chip in GenomeWideEx_6 GenomeWideSNP_6; do
  mkdir -p $chip
  apt-probeset-genotype \
    --out-dir $chip \
    --special-snps GenomeWideSNP_6.specialSNPs \
    --read-models-brlmmp GenomeWideSNP_6.generic_prior.txt \
    --chip-type $chip \
    --xml-file GenomeWideSNP_6.apt-probeset-genotype.AxiomGT1.xml \
    --cel-files cels.$chip.lst \
    --table-output false \
    --cc-chp-output \
    --cc-chp-out-dir $chip \
    --write-models
done
```

Convert CHPs to VCF
===================

```
for chip in GenomeWideEx_6 GenomeWideSNP_6; do
  bcftools +affy2vcf \
    --no-version -Ou \
    --fasta-ref HOME/res/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
    --csv GenomeWideSNP_6.na35.annot.csv \
    --sam GenomeWideSNP_6.na35.annot.bam \
    --models $chip/AxiomGT1.snp-posteriors.txt \
    --report $chip/AxiomGT1.report.txt \
    --chps $chip \
    --extra HapMap.$chip.sex | \
    bcftools sort -Ou -T ./bcftools. | \
    bcftools norm --no-version -Ob -o HapMap.$chip.bcf -c x -f $ref && \
    bcftools index -f HapMap.$chip.bcf"
done
```


================================================
FILE: Illumina.md
================================================

Archived Human Products
-----------------------

| array                                                                                                                                                                   | date       | bpm                                       | egt                                       | csv                                       |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|
| [Human-1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/Human-1_product_files>)                             | 12/21/2004 | Exon-Centric_100K_(v1.2.1).bpm            | Exon-Centric_100K_(v1.2.1).egt            | NA                                        |
| [HumanHap240S](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanHap240S_product_files>)                   | 03/13/2006 | BDCHP-1X10-HUMANHAP240S_11216501_B.bpm    | BDCHP-1X10-HUMANHAP240S_11216501_B.egt    | BDCHP-1X10-HUMANHAP240S_11216501_B.csv    |
| [HumanHap300_v1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanHap300_v1_product_files>)               | 03/30/2006 | BDCHP-1x10-HUMANHAP300v1-1_11219278_C.bpm | BDCHP-1x10-HUMANHAP300v1-1_11219278_C.egt | BDCHP-1x10-HUMANHAP300v1-1_11219278_C.csv |
| [Human1M](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/Human1M_product_files>)                             | 4/24/2006  | Human1Mv1_C.bpm                           | Human1Mv1_C.egt                           | Human1Mv1_C.csv                           |
| [HumanExon510S-2](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanExon510S-2_product_files>)             | 4/24/2006  | HumanExon510Sv1_D.bpm                     | HumanExon510Sv1_D.egt                     | Human510Sv1_A.csv                         |
| [HumanHap550_v1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanHap550_v1_product_files>)               | 05/01/2006 | BDCHP-1X10-HUMANHAP550_11218540_C.bpm     | BDCHP-1X10-HUMANHAP550_11218540_C.egt     | BDCHP-1X10-HUMANHAP550_11218540_C_csv     |
| [HumanNS-12](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanNS-12_product_files>)                       | 11/7/2006  | HumanNS-12.bpm                            | HumanNS-12.egt                            | HumanNS-12.csv                            |
| [HumanHap300-Duo_v2](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanHap300-Duo_v2 product files>)       | 12/21/2006 | HumanHap300v2_A.bpm                       | HumanHap300v2_A.egt                       | HumanHap300v2_A.csv                       |
| [HumanHap550-Duo_v3](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanHap550-Duo_v3_product_files>)       | 12/21/2006 | HumanHap550-2v3_B.bpm                     | HumanHap550-2v3_B.egt                     | HumanHap550-2v3_B.csv                     |
| [HumanHap550_v3](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanHap550_v3_product_files>)               | 12/21/2006 | HumanHap550v3_A.bpm                       | HumanHap550v3_A.egt                       | HumanHap550v3_A.csv                       |
| [HumanHap650Y_v3](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanHap650Y_v3_product_files>)             | 12/21/2006 | HumanHap650Y_v3.bpm                       | HumanHap650Yv3_A.egt                      | HumanHap650Yv3_A.csv                      |
| [HumanCNV-12_v1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanCNV-12_v1_product_files>)               | 5/15/2007  | HumanCNV12v1_C.bpm                        | HumanCNV12v1_C.egt                        | NA                                        |
| [HumanCNV370-Duo_v1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanCNV370-Duo_v1_product_files>)       | 5/15/2007  | HumanCNV370v1_C.bpm                       | HumanCNV370v1_C.egt                       | HumanCNV370v1_C.csv                       |
| [HumanLinkage-12](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanLinkage-12>)                           | 7/10/2007  | HumanLinkage-12 _E.bpm                    | HumanLinkage-12 _E.egt                    | NA                                        |
| [HumanCVDSNP55](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanCVDSNP55>)                               | 3/31/2008  | CVDSNP55v1_A.bpm                          | Human CVD.egt                             | HumanCVDv1_A.csv                          |
| [HumanCNV370-Quad_v3](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanCNV370-Quad_v3_product_files>)     | 3/17/2008  | HumanCNV370-Quadv3_C.bpm                  | HumanCNV370-Quadv3_C.egt                  | HumanCNV370-Quadv3_C.csv                  |
| [HumanCNV-12_v2](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanCNV-12_v2_product_files>)               | 4/3/2008   | HumanCNV12v2_B.bpm                        | NA                                        | NA                                        |
| [Human1M-Duo_v3](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/Human1M-Duo_v3_product_files>)               | 4/4/2008   | Human1M-Duov3_B.bpm                       | NA                                        | Human1M-Duov3_B.csv                       |
| [HumanLinkage-24](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanLinkage-24>)                           | 02/02/2010 | InfiniumLinkage-24_11419173_A.bpm         | NA                                        | InfiniumLinkage-24_11419173_A.csv         |
| [Human610-Quad_v1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/Human610-Quad_v1_product_files>)           | 10/13/2010 | Human610-Quadv1_C.bpm                     | Human610-Quadv1_C.egt                     | Human610-Quadv1_C.csv                     |
| [HumanOmniExpress-12v1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/HumanOmniExpress-12v1_Product_Files>) | 10/14/2010 | HumanOmniExpress-12v1_C.bpm               | HumanOmniExpress-12v1_C.egt               | HumanOmniExpress-12v1_C.csv               |
| [Human660W-Quad_v1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_Human_Products/Human660W-Quad_v1_H_product_files>)       | 4/21/2011  | Human660W-Quad_v1_H.bpm                   | Human660W-Quad_v1_H.egt                   | Human660W-Quad_v1_H.csv                   |

Archived_non-Human_Products
---------------------------

| array                                                                                                                                                                   | date      | bpm                  | egt                                 | csv                                 |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|----------------------|-------------------------------------|-------------------------------------|
| [CanineSNP20](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_non-Human_Products/CanineSNP20_ProductFiles>)                  | 7/10/2007 | CanineSNP20_A.bpm    | CanineSNP20_A.egt                   | NA                                  | 
| [BovineSNP50VERSION1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_non-Human_Products/BOVINESNP50VERSION1_product files>) | 8/10/2007 | BovineSNP50_B.bpm    | BovineSNP50_A.egt/BovineSNP50_B.egt | BovineSNP50_A.csv/BovineSNP50_B.csv | 
| [EquineSNP50](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_non-Human_Products/EquineSNP50_product_files>)                 | 6/9/2008  | EquineSNP50_C.bpm    | EquineSNP50_C.egt                   | EquineSNP50_C.csv                   |
| [PorcineSNP60](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_non-Human_Products/PorcineSNP60_product_files>)               | 1/7/2009  | PorcineSNP60_B.bpm   | PorcineSNP60_A.egt                  | PorcineSNP60_B.csv                  |
| [OvineSNP50](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_non-Human_Products/OvineSNP50_product_files>)                   | 1/7/2009  | OvineSNP50_B.bpm     | OvineSNP50_A.egt                    | OvineSNP50_B.csv                    | 
| [CanineHD](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_non-Human_Products/CanineHD_Product_files>)                       | 9/2/2009  | CanineHD_A.bpm       | CanineHD-A.egt                      | CanineHD_A.csv                      |
| [Maize_SNP50](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_non-Human_Products/Maize_SNP50>)                               | 2/3/2010  | MaizeSNP50_A.bpm     | MaizeSNP50_B.egt                    | MaizeSNP50_A.csv                    |
| [BovineSNP50VERSION2](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_non-Human_Products/BovineSNP50VERSION2_product_files>) | 5/20/2010 | BovineSNP50_v2_C.bpm | BovineSNP50v2_A.egt                 | BovineSNP50_v2_C.csv                |
| [BOVINEHD](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/Archived_non-Human_Products/BOVINEHD_Product_Files>)                       | 6/18/2010 | BovineHD_B.bpm       | BovineHD_A.egt                      | BovineHD_B.csv                      |

Old Products
------------

| array                                                                                                                                                                             | date       | bpm                                   | egt                                     | csv                           |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|---------------------------------------|-----------------------------------------|-------------------------------|
| [HumanOmni5Exome-4v1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/GT_Call_Files_Current_Products/HumanOmni5Exome v1.0>)                     | 2/10/2012  | HumanOmni5Exome-4v1_A.bpm             | NA                                      | NA                            |
| [HumanOmniExpress-12v1-1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/GT_Call_Files_Current_Products/HumanOmniExpress-12v1.1>)              | 10/30/2012 | HumanOmniExpress-12v1-1_A.bpm         | NA                                      | NA                            |
| [OmniExpressExome-8v1-1_15036758](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/GT_Call_Files_Current_Products/HumanOmniExpressExome-12v1.1>) | 12/17/2012 | OmniExpressExome-8v1-1_15036758_A.bpm | HumanOmniExpressExome-8v1-1_2012.12.egt | NA                            |
| [HumanOmni25M-8v1-1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/GT_Call_Files_Current_Products/HumanOmni-2.5-8-v1.1>)                      | 2/13/2013  | HumanOmni25M-8v1-1_B.bpm              | HumanOmni2-5M-8v1-1_B.egt               | HumanOmni25M-8v1-1_B.csv      |
| [OmniExpressExome-8v1-1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/GT_Call_Files_Current_Products/OmniExpressExome-8v1-1_B>)              | 2/5/2013   | OmniExpressExome-8v1-1_B.bpm          | HumanOmniExpressExome-8v1-1_B.egt       | OmniExpressExome-8v1-1_B.csv  |
| [OmniExpressExome-8v1-1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/GT_Call_Files_Current_Products/HumanOmniExpressExome-8v1-1_B>)         | 2/5/2013   | OmniExpressExome-8v1-1_B.bpm          | HumanOmniExpressExome-8v1-1_B.egt       | OmniExpressExome-8v1-1_B.csv  |
| [HumanCoreExome-12v1-0](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/GT_Call_Files_Current_Products/HumanCoreExomev1-0_A>)                   | 2/6/2013   | HumanCoreExome-12v1-0_A.bpm           | HumanCoreExome-12v1-0_A.egt             | HumanCoreExome-12v1-0_A.csv   |
| [HumanOmniExpress-12v1-1](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/GT_Call_Files_Current_Products/OmniExpress-12v1.1_B>)                 | 2/6/2013   | HumanOmniExpress-12v1-1_B.bpm         | HumanOmniExpress-12v1-1_B.egt           | HumanOmniExpress-12v1-1_B.csv |
| [PsychChip_15048346](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Whole Genome Genotyping Files/GT_Call_Files_Current_Products/HumanPsychChipv-1-0>)                       | 10/23/2013 | PsychChip_15048346_A.bpm              | NA                                      | PsychChip_15048346_A.csv      |

Consortium Products
-------------------

| array                                                                    | date       | bpm                                      | egt | csv                                      |
|--------------------------------------------------------------------------|------------|------------------------------------------|-----|------------------------------------------|
| [ASA-24v1-0-Consort_20022506](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Genotyping_Array_Support_Files/Consortium Asian Screening Array>)        | 1/23/2018  | ASA-24v1-0-Consort_20022506_A2.bpm       | NA  | ASA-24v1-0-Consort_20022506_A2.csv       |
| [CGCA-24v1-0_20034773](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Genotyping_Array_Support_Files/Consortium Chinese Genotyping Array>)            | 5/13/2020  | CGCA-24v1-0_20034773_A1.bpm              | NA  | CGCA-24v1-0_20034773_A1.csv              |
| [DrugDevConsortium-24v1-2_20024394](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Genotyping_Array_Support_Files/Consortium Drug Dev Array>)         | 3/14/2018  | DrugDevConsortium-24v1-2_20024394_A1.bpm | NA  | DrugDevConsortium-24v1-2_20024394_A1.csv |
| [GDAConfluence_20032938X375356](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Genotyping_Array_Support_Files/Global Diversity Array/GDA-Confluence>) | 3/11/2021  | GDAConfluence_20032938X375356_A2.bpm     | NA  | GDAConfluence_20032938X375356_A2.csv     |
| [NeuroBooster_20042459](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Genotyping_Array_Support_Files/Global Diversity Array/GDA-Neuro Booster>)      | 7/16/2020  | NeuroBooster_20042459_A2.bpm             | NA  | NeuroBooster_20042459_A2.bpm             |
| [H3Africa_2017_20021485_A2.csv](<ftp://webdata:webdata@ftp.illumina.com/Public_Docs/Genotyping_Array_Support_Files/H3Africa/v1>)                           | 10/27/2017 | H3Africa_2017_20021485`_A2.bpm           | NA  | H3Africa_2017_20021485_A2.csv            |


================================================
FILE: LICENSE
================================================
The MIT License

Copyright (C) 2018-2025 Giulio Genovese

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.


================================================
FILE: README.md
================================================
gtc2vcf
=======

A set of tools to convert Illumina and Affymetrix DNA microarray intensity data files into VCF files <b>without</b> using Microsoft Windows. You can use the final output to run the pipeline to detect [mosaic chromosomal alterations](http://github.com/freeseek/mocha). If you use this tool in your publication, please cite this website. For any feedback or questions, contact the [author](mailto:giulio.genovese@gmail.com)

![](gtc2vcf.png)

<!--ts-->
   * [Usage](#usage)
   * [Installation](#installation)
   * [Software Installation](#software-installation)
   * [Identifying chip type for IDAT and CEL files](#identifying-chip-type-for-idat-and-cel-files)
   * [Convert Illumina IDAT files to GTC files](#convert-illumina-idat-files-to-gtc-files)
   * [Convert Illumina GTC files to VCF](#convert-illumina-gtc-files-to-vcf)
   * [Convert Affymetrix CEL files to CHP files](#convert-affymetrix-cel-files-to-chp-files)
   * [Convert Affymetrix CHP files to VCF](#convert-affymetrix-chp-files-to-vcf)
   * [Using an alternative genome reference](#using-an-alternative-genome-reference)
   * [Detect contamination](#detect-contamination)
   * [Plot variants](#plot-variants)
   * [Illumina GenCall](#illumina-gencall)
      * [Illumina AutoConvert](#illumina-autoconvert)
      * [Illumina AutoConvert 2.0](#illumina-autoconvert-2-0)
      * [Illumina Array Analysis Platform Genotyping Command Line Interface](#illumina-array-analysis-platform-genotyping-command-line-interface)
      * [Illumina Microarray Analytics Array Analysis Command Line Interface](#illumina-microarray-analytics-array-analysis-command-line-interface)
   * [Acknowledgements](#acknowledgements)
<!--te-->

Usage
=====

Illumina data tool:
```
Usage: bcftools +gtc2vcf [options] [<A.gtc> ...]

Plugin options:
    -l, --list-tags                   list available FORMAT tags with description for VCF output
    -t, --tags LIST                   list of output FORMAT tags [GT,GQ,IGC,BAF,LRR,NORMX,NORMY,R,THETA,X,Y]
    -b, --bpm <file>                  BPM manifest file
    -c, --csv <file>                  CSV manifest file (can be gzip compressed)
    -e, --egt <file>                  EGT cluster file
    -f, --fasta-ref <file>            reference sequence in fasta format
        --set-cache-size <int>        select fasta cache size in bytes
        --gc-window-size <int>        window size in bp used to compute the GC content (-1 for no estimate) [200]
    -g, --gtcs <dir|file>             GTC genotype files from directory or list from file
    -i, --idat                        input IDAT files rather than GTC files
        --capacity <int>              number of variants to read from intensity files per I/O operation [32768]
        --adjust-clusters             adjust cluster centers in (Theta, R) space (requires --bpm and --egt)
        --use-gtc-sample-names        use sample name in GTC files rather than GTC file name
        --do-not-check-bpm            do not check whether BPM and GTC files match manifest file name
        --do-not-check-eof            do not check whether the BPM and EGT readers reach the end of the file
        --genome-studio <file>        input a GenomeStudio final report file (in matrix format)
        --no-version                  do not append version and command line to the header
    -o, --output <file>               write output to a file [standard output]
    -O, --output-type u|b|v|z|t[0-9]  u/b: un/compressed BCF, v/z: un/compressed VCF
                                      t: GenomeStudio tab-delimited text output, 0-9: compression level [v]
        --threads <int>               number of extra output compression threads [0]
    -x, --extra <file>                write GTC metadata to a file
    -v, --verbose                     print verbose information
    -W, --write-index[=FMT]           Automatically index the output files [off]

Manifest options:
        --beadset-order               output BeadSetID normalization order (requires --bpm and --csv)
        --fasta-flank                 output flank sequence in FASTA format (requires --csv)
    -s, --sam-flank <file>            input flank sequence alignment in SAM/BAM format (requires --csv)
        --genome-build <assembly>     genome build ID used to update the manifest file [GRCh38]

Examples:
    bcftools +gtc2vcf -i 5434246082_R03C01_Grn.idat
    bcftools +gtc2vcf 5434246082_R03C01.gtc
    bcftools +gtc2vcf -b HumanOmni2.5-4v1_H.bpm -c HumanOmni2.5-4v1_H.csv
    bcftools +gtc2vcf -e HumanOmni2.5-4v1_H.egt
    bcftools +gtc2vcf -c GSA-24v3-0_A1.csv -e GSA-24v3-0_A1_ClusterFile.egt -f human_g1k_v37.fasta -o GSA-24v3-0_A1.vcf
    bcftools +gtc2vcf -c HumanOmni2.5-4v1_H.csv -f human_g1k_v37.fasta 5434246082_R03C01.gtc -o 5434246082_R03C01.vcf
    bcftools +gtc2vcf -f human_g1k_v37.fasta --genome-studio GenotypeReport.txt -o GenotypeReport.vcf

Examples of manifest file options:
    bcftools +gtc2vcf -b GSA-24v3-0_A1.bpm -c GSA-24v3-0_A1.csv --beadset-order
    bcftools +gtc2vcf -c GSA-24v3-0_A1.csv --fasta-flank -o GSA-24v3-0_A1.fasta
    bwa mem -M GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GSA-24v3-0_A1.fasta -o GSA-24v3-0_A1.sam
    bcftools +gtc2vcf -c GSA-24v3-0_A1.csv --sam-flank GSA-24v3-0_A1.sam -o GSA-24v3-0_A1.GRCh38.csv
```

Affymetrix data tool:
```
Usage: bcftools +affy2vcf [options] --csv <file> --fasta-ref <file> [<A.chp> ...]

Plugin options:
    -l, --list-tags                 list available FORMAT tags with description for VCF output
    -t, --tags LIST                 list of output FORMAT tags [GT,CONF,BAF,LRR,NORMX,NORMY,DELTA,SIZE]
    -c, --csv <file>                CSV manifest file (can be gzip compressed)
    -f, --fasta-ref <file>          reference sequence in fasta format
        --set-cache-size <int>      select fasta cache size in bytes
        --gc-window-size <int>      window size in bp used to compute the GC content (-1 for no estimate) [200]
        --probeset-ids              tab delimited file with column 'probeset_id' specifying probesets to convert
        --calls <file>              apt-probeset-genotype calls output (can be gzip compressed)
        --confidences <file>        apt-probeset-genotype confidences output (can be gzip compressed)
        --summary <file>            apt-probeset-genotype summary output (can be gzip compressed)
        --snp <file>                apt-probeset-genotype SNP posteriors output (can be gzip compressed)
        --chps <dir|file>           input CHP files rather than tab delimited files
        --cel <file>                input CEL files rather CHP files
        --adjust-clusters           adjust cluster centers in (Contrast, Size) space (requires --snp)
        --no-version                do not append version and command line to the header
    -o, --output <file>             write output to a file [standard output]
    -O, --output-type u|b|v|z[0-9]  u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
        --threads <int>             number of extra output compression threads [0]
    -x, --extra <file>              write CHP metadata to a file (requires CHP files)
    -v, --verbose                   print verbose information
    -W, --write-index[=FMT]         Automatically index the output files [off]

Manifest options:
        --fasta-flank               output flank sequence in FASTA format (requires --csv)
    -s, --sam-flank <file>          input flank sequence alignment in SAM/BAM format (requires --csv)

Examples:
    bcftools +affy2vcf \
        --csv GenomeWideSNP_6.na35.annot.csv \
        --fasta-ref human_g1k_v37.fasta \
        --chps cc-chp/ \
        --snp AxiomGT1.snp-posteriors.txt \
        --output AxiomGT1.vcf \
        --extra report.tsv
    bcftools +affy2vcf \
        --csv GenomeWideSNP_6.na35.annot.csv \
        --fasta-ref human_g1k_v37.fasta \
        --calls AxiomGT1.calls.txt \
        --confidences AxiomGT1.confidences.txt \
        --summary AxiomGT1.summary.txt \
        --snp AxiomGT1.snp-posteriors.txt \
        --output AxiomGT1.vcf

Examples of manifest file options:
    bcftools +affy2vcf -c GenomeWideSNP_6.na35.annot.csv --fasta-flank -o  GenomeWideSNP_6.fasta
    bwa mem -M GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GenomeWideSNP_6.fasta -o GenomeWideSNP_6.sam
    bcftools +affy2vcf -c GenomeWideSNP_6.na35.annot.csv -s GenomeWideSNP_6.sam -o GenomeWideSNP_6.na35.annot.GRCh38.csv
```

Installation
============

Install basic tools (Debian/Ubuntu specific if you have admin privileges)
```
sudo apt install wget unzip git g++ zlib1g-dev bwa unzip samtools msitools cabextract mono-devel libgdiplus icu-devtools bcftools
```

Optionally, you can install these libraries to activate further HTSlib features
```
sudo apt install libbz2-dev libssl-dev liblzma-dev libgsl0-dev
```

Preparation steps
```
mkdir -p $HOME/bin $HOME/GRCh3{7,8} && cd /tmp
```

We recommend compiling the source code but, wherever this is not possible, Linux x86_64 pre-compiled binaries are available for download [here](http://software.broadinstitute.org/software/gtc2vcf). However, notice that you will require BCFtools version 1.20 or newer. You can also download a previous version of the plugin through [bioconda](http://anaconda.org/bioconda/bcftools-gtc2vcf-plugin)

Download latest version of [HTSlib](http://github.com/samtools/htslib) and [BCFtools](http://github.com/samtools/bcftools) (if not downloaded already)
```
wget http://github.com/samtools/bcftools/releases/download/1.20/bcftools-1.20.tar.bz2
tar xjvf bcftools-1.20.tar.bz2
```

Download and compile plugins code (make sure you are using gcc version 5 or newer)
```
cd bcftools-1.20/
/bin/rm -f plugins/{idat2gtc.c,gtc2vcf.{c,h},affy2vcf.c}
wget -P plugins http://raw.githubusercontent.com/freeseek/gtc2vcf/master/{idat2gtc.c,gtc2vcf.{c,h},affy2vcf.c,BAFregress.c}
make
/bin/cp bcftools plugins/{idat2gtc,gtc2vcf,affy2vcf,BAFregress}.so $HOME/bin/
```

Make sure the directory with the plugins is available to BCFtools
```
export PATH="$HOME/bin:$PATH"
export BCFTOOLS_PLUGINS="$HOME/bin"
```

Install the GRCh37 human genome reference
```
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | \
  gzip -d > $HOME/GRCh37/human_g1k_v37.fasta
samtools faidx $HOME/GRCh37/human_g1k_v37.fasta
bwa index $HOME/GRCh37/human_g1k_v37.fasta
```

Install the GRCh38 human genome reference (following the suggestion from [Heng Li](http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use))
```
wget -O- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | \
  gzip -d > $HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
samtools faidx $HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
bwa index $HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
```

Affymetrix provides the [Analysis Power Tools (APT)](http://www.thermofisher.com/us/en/home/life-science/microarray-analysis/microarray-analysis-partners-programs/affymetrix-developers-network/affymetrix-power-tools.html) for free which allow to call genotypes from raw intensity data using an algorithm derived from [BRLMM-P](http://tools.thermofisher.com/content/sfs/brochures/brlmmp_whitepaper.pdf)
```
mkdir -p $HOME/bin && cd /tmp
wget http://downloads.thermofisher.com/APT/APT_2.11.8/apt_2.11.8_linux_64_x86_binaries.zip
unzip -ojd $HOME/bin apt_2.11.8_linux_64_x86_binaries.zip apt_2.11.8_linux_64_x86_binaries/bin/apt-probeset-genotype
chmod a+x $HOME/bin/apt-probeset-genotype
```

Identifying chip type for IDAT and CEL files
============================================

To convert a pair of green and red IDAT files with raw Illumina intensities into a GTC file with genotype calls you need to provide both a BPM manifest file with the location of the probes and an EGT cluster file with the expected intensities of each genotype cluster. It is important to provide the correct BPM and EGT files otherwise the calling will fail possibly generating a GTC file with meaningless calls. Unfortunately newer IDAT files do not contain information about which BPM manifest file to use. The gtc2vcf bcftools plugin can be used to guess which files to use
```
path_to_idat_folder="..."
bcftools +gtc2vcf \
  -i -g $path_to_idat_folder
```
This will generate a spreadsheet table with information about each IDAT file including a guess for what manifest and cluster files you should use. If a guess is not provided, contact the [author](mailto:giulio.genovese@gmail.com) for troubleshooting

Similarly, you can use the affy2vcf bcftools plugin to extract chip type information from CEL files
```
path_to_cel_folder="..."
bcftools +affy2vcf \
  --cel --chps $path_to_cel_folder
```

Convert Illumina IDAT files to GTC files
========================================

The idat2gtc bcftools plugin can be used to convert Illumina IDAT files to GTC files
```
bpm_manifest_file="..."
egt_cluster_file="..."
bcftools +idat2gtc \
  --bpm $bpm_manifest_file \
  --egt $egt_cluster_file \
  --idats $path_to_idat_folder \
  --output $path_to_gtc_folder
```
The output is equivalent to the output of the Illumina GenCall algorithm while being significantly faster

If you do not have the manifest and cluster files for the Illumina IDAT files you are trying to convert, make sure to check the links [here](Illumina.md)

If you run the command with the option `--autocall-date ""` then the output should be deterministic and using the `--preset` option you can generate output equivalent to the output you obtain with any of the following:

* [Illumina AutoConvert](#autoconvert)
* [Illumina AutoConvert 2.0](#autoconvert-2-0)
* [Illumina Array Analysis Platform Genotyping Command Line Interface](#iaap-cli)
* [Illumina Microarray Analytics Array Analysis Command Line Interface](#array-analysis-cli)

If you similarly patch those tools to make them generate deterministic output, you should be able to verify that you get the same md5sum

Convert Illumina GTC files to VCF
=================================

Specifications for Illumina BPM, EGT, and GTC files were obtained through Illumina's [BeadArrayFiles](http://github.com/Illumina/BeadArrayFiles) library and [GTCtoVCF](http://github.com/Illumina/GTCtoVCF) script. Specifications for IDAT files were obtained through Henrik Bengtsson's [illuminaio](http://github.com/HenrikBengtsson/illuminaio) package
```
bpm_manifest_file="..."
csv_manifest_file="..."
egt_cluster_file="..."
path_to_gtc_folder="..."
ref="$HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" # or ref="$HOME/GRCh37/human_g1k_v37.fasta"
out_prefix="..."
bcftools +gtc2vcf \
  --no-version -Ou \
  --bpm $bpm_manifest_file \
  --csv $csv_manifest_file \
  --egt $egt_cluster_file \
  --gtcs $path_to_gtc_folder \
  --fasta-ref $ref \
  --extra $out_prefix.tsv | \
  bcftools sort -Ou -T ./bcftools. | \
  bcftools norm --no-version -o $out_prefix.bcf -Ob -c x -f $ref --write-index
```
Heavy random access to the reference will be needed, so it is important that enough extra memory be available for the operating system to cache the reference or else the task can run excruciatingly slowly. Notice that the gtc2vcf bcftools plugin will drop unlocalized variants. The final VCF might contain duplicates. If this is an issue `bcftools norm -d exact` can be used to remove such variants. At least one of the BPM or the CSV manifest files has to be provided. Normalized intensities cannot be computed without the BPM manifest file. Indel alleles cannot be inferred and will be skipped without the CSV manifest file. Information about genotype cluster centers will be included in the VCF if the EGT cluster file is provided. You can use gtc2vcf to convert one GTC file at a time, but we strongly advise to convert multiple files at once as single sample VCF files will consume a lot of storage space. If you convert hundreds of GTC files at once, you can use the `--adjust-clusters` option which will recenter the genotype clusters rather than using those provided in the EGT cluster file and will compute less noisy LRR values. If you use the `--adjust-clusters` option and you are using the output for calling [mosaic chromosomal alterations](http://github.com/freeseek/mocha), then it is safe to turn the median BAF/LRR adjustments off during that step (i.e. use `--adjust-BAF-LRR -1`)

Optionally, between the conversion and the sorting step you can include a `bcftools reheader --samples <file>` command to assign new names to the samples where `<file>` contains `old_name new_name\n` pairs separated by whitespaces, each on a separate line, with `old_name` being the GTC file name without the `.gtc` extension in this case

When running the conversion, the gtc2vcf plugin will double check that the SNP manifest metadata information in the GTC file matches the descriptor file name in the BPM file to make sure you are using the correct manifest file. Sometimes, due to discrepancies between the BPM file name provided by Illumina and the internal descriptor file name, this safety check fails. To turn off this feature in these cases, you can use option `--do-not-check-bpm`

Convert Affymetrix CEL files to CHP files
=========================================

Affymetrix provides a best practice workflow for genotyping data generated using [SNP6](http://www.affymetrix.com/support/developer/powertools/changelog/VIGNETTE-snp6-on-axiom.html) and [Axiom](http://www.affymetrix.com/support/developer/powertools/changelog/VIGNETTE-Axiom-probeset-genotype.html) arrays. As an example, the following command will run the genotyping for the Affymetrix SNP6 array:
```
path_to_output_folder="..."
cel_list_file="..."
apt-probeset-genotype \
  --analysis-files-path . \
  --xml-file GenomeWideSNP_6.apt-probeset-genotype.AxiomGT1.xml \
  --out-dir $path_to_output_folder \
  --cel-files $cel_list_file \
  --special-snps GenomeWideSNP_6.specialSNPs \
  --chip-type GenomeWideEx_6 \
  --chip-type GenomeWideSNP_6 \
  --table-output false \
  --cc-chp-output \
  --write-models \
  --read-models-brlmmp GenomeWideSNP_6.generic_prior.txt
```
Affymetrix provides Library and NetAffx Annotation files for their arrays ([here](http://www.affymetrix.com/support/technical/byproduct.affx?cat=dnaarrays), [here](http://media.affymetrix.com/analysis/downloads/lf/genotyping), and [here](http://www.thermofisher.com/us/en/home/life-science/microarray-analysis/microarray-data-analysis/genechip-array-annotation-files.html))

As an example, the following commands will obtain the files necessary to run the genotyping for the Affymetrix SNP6 array:
```
wget http://tools.thermofisher.com/content/sfs/supportfiles/genomewidesnp6_libraryfile.zip
wget http://tools.thermofisher.com/content/sfs/supportfiles/SNP6_supplemental_axiom_analysis_files.zip
wget http://tools.thermofisher.com/content/sfs/supportfiles/GenomeWideSNP_6-na35-annot-csv.zip
unzip -oj genomewidesnp6_libraryfile.zip CD_GenomeWideSNP_6_rev3/Full/GenomeWideSNP_6/LibFiles/GenomeWideSNP_6.{cdf,chrXprobes,chrYprobes,specialSNPs}
unzip -o SNP6_supplemental_axiom_analysis_files.zip GenomeWideSNP_6.{generic_prior.txt,apt-probeset-genotype.AxiomGT1.xml,AxiomGT1.sketch}
unzip -o GenomeWideSNP_6-na35-annot-csv.zip GenomeWideSNP_6.na35.annot.csv
```

Note: If the program exits due to different chip types or probe counts with error message such as `Wrong CEL ChipType: expecting: 'GenomeWideSNP_6' and #######.CEL is: 'GenomeWideEx_6'` then make sure you included the option `--chip-type GenomeWideEx_6 --chip-type GenomeWideSNP_6` or `--force` to the command line to solve the problem

Convert Affymetrix CHP files to VCF
===================================

The affy2vcf bcftools plugin can be used to convert Affymetrix CHP files to VCF
```
csv_manifest_file="..." # for example csv_manifest_file="GenomeWideSNP_6.na35.annot.csv"
ref="$HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" # or ref="$HOME/GRCh37/human_g1k_v37.fasta"
path_to_chp_folder="cc-chp"
path_to_txt_folder="..."
out_prefix="..."
bcftools +affy2vcf \
  --no-version -Ou \
  --csv $csv_manifest_file \
  --fasta-ref $ref \
  --chps $path_to_chp_folder \
  --snp $path_to_txt_folder/AxiomGT1.snp-posteriors.txt \
  --extra $out_prefix.tsv | \
  bcftools sort -Ou -T ./bcftools. | \
  bcftools norm --no-version -o $out_prefix.bcf -Ob -c x -f $ref --write-index
```
Heavy random access to the reference will be needed, so it is important that enough extra memory be available for the operating system to cache the reference or else the task can run excruciatingly slowly. The final VCF might contain duplicates. If this is an issue `bcftools norm -d exact` can be used to remove such variants. There is often no need to use the `--adjust-clusters` option for Affymetrix data as the cluster posteriors are already adjusted using the data processed by the genotype caller

Optionally, between the conversion and the sorting step you can include a `bcftools reheader --samples <file>` command to assign new names to the samples where `<file>` contains `old_name new_name\n` pairs separated by whitespaces, each on a separate line, with `old_name` being the CHP file name without the `.chp` extension

Using an alternative genome reference
=====================================

Illumina provides [GRCh38/hg38](http://support.illumina.com/bulletins/2017/04/infinium-human-genotyping-manifests-and-support-files--with-anno.html) manifests for many of its genotyping arrays. However, if your genotyping array is not supported for the newer reference by Illumina, you can use the `--fasta-flank` and `--sam-flank` options to realign the flank sequences from the manifest files you have and recompute the marker positions. This approach uses [flank sequence](http://support.illumina.com/bulletins/2016/05/infinium-genotyping-manifest-column-headings.html) and [strand](http://support.illumina.com/bulletins/2017/06/how-to-interpret-dna-strand-and-allele-information-for-infinium-.html) information to identify the marker [coordinates](http://support.illumina.com/bulletins/2016/06/-infinium-genotyping-array-manifest-files-what-does-chr-or-mapinfo---mean.html). It will need a sequence aligner such as `bwa` to realign the sequences and it seems to reproduce the coordinates provided from Illumina more than 99.9% of the times. Mapping information will follow the [implicit dbSNP standard](http://github.com/Illumina/GTCtoVCF#manifests). Occasionally the flank sequence provided by Illumina is incorrect and it is impossible to recover the correct marker coordinate from the flank sequence alone

You first have to generate an alignment file for the flank sequences from a CSV manifest file
```
csv_manifest_file="..."
ref="$HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" # or ref="$HOME/GRCh37/human_g1k_v37.fasta"
bam_alignment_file="..."
bcftools +gtc2vcf \
  -c $csv_manifest_file \
  --fasta-flank | \
  bwa mem -M $ref - | \
  samtools view -bS \
  -o $bam_alignment_file
```
Notice that you need to use the `-M` option to mark shorter split hits as secondary and you should not sort the output BAM file as gtc2vcf expects it to have the sequences in the same order as in the CSV file . Then you load the alignment file while converting your GTC files to VCF including the `-s $bam_alignment_file` option

Some older manifest files from Illumina have thousands of markers with incorrect RefStrand annotations that will lead to incorrect genotypes. While Illumina has not explained why this is the case, it still distributes incorrect manifests. If you are using one of the following manifests
```
Human1M-Duov3_H
Human610-Quadv1_H
Human660W-Quad_v1_H
HumanCytoSNP-12v2-1_Anova
HumanOmni1-Quad_v1-0-Multi_H
HumanOmni1-Quad_v1-0_H
```
We advise to either contact Illumina to demand a fixed version or to use gtc2vcf to realign the flank sequences

Also, Illumina assigns chromosomal positions to indels by first left aligning the flank sequences in an incoherent way (see [here](http://github.com/Illumina/GTCtoVCF/blob/develop/BPMRecord.py)). Apparently this is incoherent enough that Illumina also cannot get the coordinates of homopolymer indels right. For example, chromosome 13 ClinVar indel [rs80359507](http://www.ncbi.nlm.nih.gov/clinvar/variation/37959) is assigned to position 32913838 in the manifest file for the GSA-24v2-0 array, but it is assigned to position 32913837 in the manifest file for GSA-24v3-0 array (GRCh37 coordinates). If you want to trust genotypes at homopolymer indels, we advise to use gtc2vcf to realign the flank sequences

We also found numerous examples of markers from Illumina manifest files that are mapped to the wrong chromosome, such as markers rs10465468, rs12401272, rs185597746, rs188145685 which are localized over XY in the Illumina manifest files for the GSA-24v2-0 array and the GSA-24v3-0 array but their flank sequences map to chromosome Y. If you trust the flank sequences better than the coordinates from the Illumina manifest files, we advise to use gtc2vcf to realign the flank sequences

The same functionality exists for the affy2vcf tool to convert Affymetrix data

Detect contamination
====================

To detect contamination we use a model similar to what employed by [BAFRegress](http://genome.sph.umich.edu/wiki/BAFRegress) and described in [Jun et al. 2012](http://doi.org/10.1016/j.ajhg.2012.09.004) which estimates BAF deviations at homozygous sites towards reference population means. The model needs allele frequencies which can be inferred from the BCFtools/gtc2vcf output:
```
bcftools +BAFregress $out_prefix.bcf
```
or they can be inferred from a separate resource:
```
bcftools +BAFregress --af 1kGP_high_coverage_Illumina.sites.bcf --tag AF $out_prefix.bcf
```

Plot variants
=============

Install basic tools (Debian/Ubuntu specific if you have admin privileges):
```
sudo apt install r-cran-optparse r-cran-ggplot2 r-cran-data.table r-cran-gridextra
```

Download R scripts
```
/bin/rm -f $HOME/bin/gtc2vcf_plot.R
wget -P $HOME/bin http://raw.githubusercontent.com/freeseek/gtc2vcf/master/gtc2vcf_plot.R
chmod a+x $HOME/bin/gtc2vcf_plot.R
```

Plot variant (for Illumina data)
```
gtc2vcf_plot.R \
  --illumina \
  --vcf input.vcf \
  --chrom 11 \
  --pos 66328095 \
  --png rs1815739.png
```

![](rs1815739.png)

Plot variant (for Affymetrix data)
```
gtc2vcf_plot.R \
  --affymetrix \
  --vcf input.vcf \
  --chrom 1 \
  --pos 196642233 \
  --png rs800292.png
```

![](rs800292.png)

Illumina GenCall
================

To genotype raw Illumina IDAT intensity files using Illumina GenCall algorithms, Illumina over the course of the year has provided several command line interfaces written in the .NET language:
- [AutoConvert](http://support.illumina.com/array/array_software/beeline/downloads.html) (2011)
- [AutoConvert 2.0](http://support.illumina.com/array/array_software/beeline/downloads.html) (2017)
- [IAAP CLI](http://support.illumina.com/array/array_software/illumina-array-analysis-platform.html) (2019)
- [Array Analysis CLI](http://support.illumina.com/array/array_software/ima-array-analysis-cli/downloads.html) (2023)

We provide instructions to install and run these interfaces. The `sed -i -e ':a' -e 'N' -e '$!ba'` installation commands are used to prevent the interfaces from timestamping the output GTC files by removing the [System.DateTime](http://learn.microsoft.com/en-us/dotnet/api/system.datetime) calls and accesses to the [CreationTime](http://learn.microsoft.com/en-us/dotnet/api/system.io.filesysteminfo.creationtime) property from the binaries, with the goal of making each execution completely reproducible. AutoConvert 2.0, IAAP-CLI, and Array Analysis CLI binaries will both perform version 1.2.0 of the normalization step and seem to produce the exact same results while AutoConvert will only perform version 1.1.2 of the normalization step yielding somewhat different results. If you want to run these binaries but fail to download them, contact the [author](mailto:giulio.genovese@gmail.com) for troubleshooting

Illumina also provides the [Beeline](http://support.illumina.com/array/array_software/beeline.html) software for free and this includes the AutoConvert.exe command line executable which allows to call genotypes from raw intensity data using Illumina's proprietary GenCall algorithm. AutoConvert is almost entirely written in Mono/.Net language, except for one small mathmatical function (findClosestSitesToPointsAlongAxis) which is included within a Windows PE32+ library (MathRoutines.dll). As this is [unmanaged code](http://www.mono-project.com/docs/advanced/embedding/), to be run on Linux with [Mono](http://www.mono-project.com/) it needs to be embedded in an equivalent Linux ELF64 library (libMathRoutines.dll.so) as shown below. This function is run as part of the [normalization](http://doi.org/10.1093/bioinformatics/btm443) of the raw intensities when sampling [400 candidate homozygotes](http://dnatech.genomecenter.ucdavis.edu/wp-content/uploads/2013/06/illumina_gt_normalization.pdf) before calling genotypes.

Illumina AutoConvert
--------------------

To run Illumina AutoConvert (version 1.6.3.1) you will need to fix the hardcoded Windows [backlashes](http://en.wikipedia.org/wiki/Backslash) into UNIX [slashes](http://en.wikipedia.org/wiki/Slash_(punctuation), as shown below
```
mkdir -p $HOME/bin && cd /tmp
wget http://support.illumina.com/content/dam/illumina-support/documents/downloads/software/beeline/autoconvert-software-v1-6-3-installer.zip
wget http://raw.githubusercontent.com/freeseek/gtc2vcf/master/nearest_neighbor.c
unzip -o autoconvert-software-v1-6-3-installer.zip 
msiextract -C Illumina/AutoConvert SetupAutoConvert64_1.6.3.1.msi
msiextract -l SetupAutoConvert64_1.6.3.1.msi | grep DLL$ | while read dll; do mv Illumina/AutoConvert/$dll Illumina/AutoConvert/${dll%DLL}dll; done
gcc -fPIC -shared -O2 -o Illumina/AutoConvert/libMathRoutines.dll.so nearest_neighbor.c
sed -i 's/\x00\x03\\\x00/\x00\x03\/\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i 's/G\x00R\x00N\x00.\x00i\x00d\x00a\x00t\x00/G\x00r\x00n\x00.\x00i\x00d\x00a\x00t\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i 's/R\x00E\x00D\x00.\x00i\x00d\x00a\x00t\x00/R\x00e\x00d\x00.\x00i\x00d\x00a\x00t\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i 's/\\\x00M\x00o\x00d\x00u\x00l\x00e\x00s\x00\\\x00B\x00S\x00G\x00T\x00\\\x00C\x00l\x00u\x00s\x00t\x00e\x00r\x00A\x00l\x00g\x00o\x00r\x00i\x00t\x00h\x00m\x00s\x00\\\x00/\/\x00M\x00o\x00d\x00u\x00l\x00e\x00s\x00\/\x00B\x00S\x00G\x00T\x00\/\x00C\x00l\x00u\x00s\x00t\x00e\x00r\x00A\x00l\x00g\x00o\x00r\x00i\x00t\x00h\x00m\x00s\x00\/\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i 's/\\\x00M\x00o\x00d\x00u\x00l\x00e\x00s\x00\\\x00B\x00S\x00G\x00T\x00/\/\x00M\x00o\x00d\x00u\x00l\x00e\x00s\x00\/\x00B\x00S\x00G\x00T\x00/' Illumina/AutoConvert/Modules/BSGT/ClusterAlgorithms/{GoldenGate/GGCA,InfiniumII/I2CA,GenTrain/ILCA}.dll
sed -i 's/\\\x00d\x00a\x00t\x00.\x00b\x00i\x00n\x00/\/\x00d\x00a\x00t\x00.\x00b\x00i\x00n\x00/' Illumina/AutoConvert/Modules/BSGT/ClusterAlgorithms/{GoldenGate/GGCA,InfiniumII/I2CA,GenTrain/ILCA}.dll
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x28\xa6\x00\x00\x0a\x13\x40\x12\x40\x28\xa7\x00\x00\x0a\x72\xad\x12\x00\x70\x28\xa6\x00\x00\x0a\x13\x40\x12\x40\x28\xa8\x00\x00\x0a\x28\x23\x00\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7e\x16\x00\x00\x0a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/' Illumina/AutoConvert/AutoCallLib.dll
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x11\x0e\x6f\xe5\x00\x00\x0a\x13\x11\x12\x11\x28\xe6\x00\x00\x0a\x72\xad\x12\x00\x70\x11\x0e\x6f\xe5\x00\x00\x0a\x13\x12\x12\x12\x28\xe7\x00\x00\x0a\x28\x23\x00\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7e\x16\x00\x00\x0a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/' Illumina/AutoConvert/AutoCallLib.dll
rm autoconvert-software-v1-6-3-installer.zip SetupAutoConvert64_1.6.3.1.msi nearest_neighbor.c
mv Illumina/AutoConvert $HOME/bin/
rmdir Illumina
```

You can run Illumina's proprietary GenCall algorithm on a single IDAT file pair
```
mono $HOME/bin/AutoConvert/AutoConvert.exe \
  $idat_green_file \
  $path_to_output_folder \
  $bpm_manifest_file \
  $egt_cluster_file
```
Make sure that the red IDAT file is in the same folder as the green IDAT file. Alternatively you can run on multiple IDAT file pairs
```
mono $HOME/bin/AutoConvert/AutoConvert.exe \
  $path_to_idat_folder \
  $path_to_output_folder \
  $bpm_manifest_file \
  $egt_cluster_file
```

Illumina AutoConvert 2.0
------------------------

To run Illumina AutoConvert 2.0 (version 2.0.1.179) you will need to separately download an additional Mono/.Net library (Heatmap.dll) from [GenomeStudio](http://support.illumina.com/array/array_software/genomestudio.html) or the [polyploid clustering module](http://support.illumina.com/downloads/genomestudio_polyploid_clustering_module_v1-0_software.html) and include it in your binary directory, most likely due to differences in which Mono and .Net resolve library dependencies, as shown below
```
mkdir -p $HOME/bin && cd /tmp
wget http://support.illumina.com/content/dam/illumina-support/documents/downloads/software/beeline/autoconvert-software-v2-0-1-installer.zip
wget http://support.illumina.com/content/dam/illumina-support/documents/downloads/software/genomestudio/genomestudiopolyploidclusteringv1-0.msi
wget http://raw.githubusercontent.com/freeseek/gtc2vcf/master/nearest_neighbor.c
unzip -o autoconvert-software-v2-0-1-installer.zip
msiextract AutoConvertInstaller.msi
msiextract genomestudiopolyploidclusteringv1-0.msi
mv Heatmap.DLL Illumina/AutoConvert\ 2.0/
gcc -fPIC -shared -O2 -o Illumina/AutoConvert\ 2.0/libMathRoutines.dll.so nearest_neighbor.c
sed -i 's/^\(     <AutosomalCallRateThreshold>\)0.97\(<\/AutosomalCallRateThreshold>\r\)$/\10.0\2/' Illumina/AutoConvert\ 2.0/AutoCallConfig.xml
sed -i 's/\\\x00d\x00a\x00t\x00.\x00b\x00i\x00n\x00/\/\x00d\x00a\x00t\x00.\x00b\x00i\x00n\x00/' Illumina/AutoConvert\ 2.0/{GGCA,I2CA,HDCA,ILCA,ILCA3}.dll
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x28\xc7\x00\x00\x0a\x13\x3f\x12\x3f\x28\xc8\x00\x00\x0a\x72\xa8\x15\x00\x70\x28\xc7\x00\x00\x0a\x13\x3f\x12\x3f\x28\xc9\x00\x00\x0a\x28\x1f\x00\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7e\x12\x00\x00\x0a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/' Illumina/AutoConvert\ 2.0/AutoCallLib.dll
msiextract -l genomestudiopolyploidclusteringv1-0.msi | grep -v Heatmap.DLL | xargs rm
rmdir Modules/BSPC/clusteralgorithms/*
rmdir -p Modules/BSPC/clusteralgorithms
rm autoconvert-software-v2-0-1-installer.zip AutoConvertInstaller.msi genomestudiopolyploidclusteringv1-0.msi nearest_neighbor.c
mv Illumina/AutoConvert\ 2.0 $HOME/bin/
rmdir Illumina
```
We change the autosomal call rate threshold to 0.0 to more aggressively call gender in lower quality samples

If you need to get the Heatmap.dll library from GenomeStudio indtead, you can use the following code
```
wget ftp://webdata2:webdata2@ftp.illumina.com/downloads/software/genomestudio/genomestudio-software-v2-0-4-5-installer.zip
unzip -oj genomestudio-software-v2-0-4-5-installer.zip
cabextract GenomeStudioInstaller.exe
msiextract a0
mv Illumina/GenomeStudio\ 2.0/Heatmap.dll Illumina/AutoConvert\ 2.0/
rm genomestudio-software-v2-0-4-5-installer.zip GenomeStudioInstaller.exe {,a}0 u{0..5} Illumina/GenomeStudio\ 2.0 -r
```

You can run Illumina's proprietary GenCall algorithm on a single IDAT file pair
```
mono $HOME/bin/AutoConvert\ 2.0/AutoConvert.exe \
  $idat_green_file \
  $path_to_output_folder \
  $bpm_manifest_file \
  $egt_cluster_file
```
Make sure that the red IDAT file is in the same folder as the green IDAT file. Alternatively you can run on multiple IDAT file pairs
```
mono $HOME/bin/AutoConvert\ 2.0/AutoConvert.exe \
  $path_to_idat_folder \
  $path_to_output_folder \
  $bpm_manifest_file \
  $egt_cluster_file
```

Make sure that the IDAT files have the same name prefix as the IDAT folder name. The software might require up to 8GB of RAM to run. Illumina provides manifest (BPM) and cluster (EGT) files for their arrays [here](http://support.illumina.com/array/downloads.html). Notice that if you provide the wrong BPM file, you will get an error such as: `Normalization failed!  Unable to normalize!` and if you provide the wrong EGT file, you will get an error such as `System.Exception: Unrecoverable Error...Exiting! Unable to find manifest entry ######## in the cluster file!`

Illumina Array Analysis Platform Genotyping Command Line Interface
------------------------------------------------------------------

Illumina provides the [Illumina Array Analysis Platform Genotyping Command Line Interface](http://support.illumina.com/array/array_software/illumina-array-analysis-platform.html) software for free for research use and this includes the iaap-cli 1.1.0 which runs natively on Linux
```
mkdir -p $HOME/bin && cd /tmp
wget ftp://webdata2:webdata2@ftp.illumina.com/downloads/software/iaap/iaap-cli-linux-x64-1.1.0.tar.gz
tar xzvf iaap-cli-linux-x64-1.1.0.tar.gz -C $HOME/bin/ iaap-cli-linux-x64-1.1.0/iaap-cli --strip-components=1
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x28\x17\x01\x00\x0a\x13\x07\x12\x07\x72\xdd\x23\x00\x70\x28\x18\x01\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7e\x92\x00\x00\x0a\x00\x00\x00\x00\x00/' $HOME/bin/iaap-cli/ArrayAnalysis.NormToGenCall.Services.dll
rm iaap-cli-linux-x64-1.1.0.tar.gz
```

Once iaap-cli is properly installed in your system, run Illumina's proprietary GenCall algorithm on multiple IDAT file pairs
```
CLR_ICU_VERSION_OVERRIDE="$(uconv -V | sed 's/.* //g')" LANG="en_US.UTF-8" $HOME/bin/iaap-cli/iaap-cli \
  gencall \
  $bpm_manifest_file \
  $egt_cluster_file \
  $path_to_output_folder \
  --idat-folder $path_to_idat_folder \
  --output-gtc \
  --gender-estimate-call-rate-threshold 0.0
```
It is important to set the `LANG` environmental variable to `en_US.UTF-8`, if this is set to other values, due to a bug in `iaap-cli` causing malformed GTC files to be generated as a result. Due to another bug in `iaap-cli`, IDAT filenames cannot include more than two `_` characters and should be formatted as `BARCODE_POSITION_(Red|Grn).idat`. When using `iaap_cli` you cannot process old array manifest files with loci data encoded as version 5 or older, such as `HumanHap650Yv3_A.bpm`, as the corresponding code was not carried over and you will get the error `Error in reading file.  Unknown Manifest version`. The AutoConvert command line tool can read older manifest files. We change the autosomal call rate threshold to 0.0 both to more aggressively call gender in lower quality samples and to deal with an implementation issue that causes loci with null cluster scores to be included in the determination of the autosomal call rate threshold

Illumina Microarray Analytics Array Analysis Command Line Interface
-------------------------------------------------------------------

Illumina provides the [Illumina Microarray Analytics Array Analysis Command Line Interface](http://support.illumina.com/array/array_software/ima-array-analysis-cli/downloads.html) software for free for research use and this includes the array-analysis-cli 2.1.0 which runs natively on Linux
```
mkdir -p $HOME/bin && cd /tmp
wget http://support.illumina.com/softwaredownload.html?assetId=72f8a34f-0933-4256-bad6-73d830436c74&assetDetails=IlluminaMicroarrayAnalyticsArrayAnalysisCLIv2.1LinuxInstaller-2.1-array-analysis-cli-linux-x64-v2.1.0.tar.gz
tar xzvf array-analysis-cli-linux-x64-v2.1.0.tar.gz -C $HOME/bin/ --strip-components=1
sed -i -e ':a' -e 'N' -e '$!ba' -e 's/\x28\x89\x00\x00\x0a\x0A\x12\x00\x72\xa3\x15\x00\x70\x28\x8a\x00\x00\x0a/\x00\x00\x00\x00\x00\x00\x00\x00\x72\xfc\x0d\x00\x70\x00\x00\x00\x00\x00/' $HOME/bin/array-analysis-cli//ArrayAnalysis.Core.dll
rm array-analysis-cli-linux-x64-v2.1.0.tar.gz
```

Once array-analysis-cli is properly installed in your system, run Illumina's proprietary GenCall algorithm on multiple IDAT file pairs
```
$HOME/bin/array-analysis-cli/array-analysis-cli \
  genotype call \
  --bpm-manifest $bpm_manifest_file \
  --cluster-file $egt_cluster_file \
  --idat-folder .
```
We cannot change the autosomal call rate threshold to 0.0 both to more aggressively call gender in lower quality samples as the default 0.97 value is hardcoded

Acknowledgements
================

This work is supported by NIH grant [R01 HG006855](http://grantome.com/grant/NIH/R01-HG006855), NIH grant [R01 MH104964](http://grantome.com/grant/NIH/R01-MH104964), NIH grant [R01MH123451](http://grantome.com/grant/NIH/R01-MH123451), US Department of Defense Breast Cancer Research Breakthrough Award W81XWH-16-1-0316 (project BC151244), and the Stanley Center for Psychiatric Research


================================================
FILE: affy2vcf.c
================================================
/* The MIT License

   Copyright (c) 2018-2025 Giulio Genovese

   Author: Giulio Genovese <giulio.genovese@gmail.com>

   Permission is hereby granted, free of charge, to any person obtaining a copy
   of this software and associated documentation files (the "Software"), to deal
   in the Software without restriction, including without limitation the rights
   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
   copies of the Software, and to permit persons to whom the Software is
   furnished to do so, subject to the following conditions:

   The above copyright notice and this permission notice shall be included in
   all copies or substantial portions of the Software.

   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
   THE SOFTWARE.

 */

#include <getopt.h>
#include <errno.h>
#include <wchar.h>
#include <sys/resource.h>
#include <arpa/inet.h>
#include <htslib/vcf.h>
#include <htslib/kseq.h>
#include <htslib/khash_str2int.h>
#include "bcftools.h"
#include "gtc2vcf.h"

#define AFFY2VCF_VERSION "2025-10-08"

#define TAG_LIST_DFLT "GT,CONF,BAF,LRR,NORMX,NORMY,DELTA,SIZE"
#define GC_WIN_DFLT "200"

#define VERBOSE (1 << 0)
#define LOAD_CEL (1 << 1)
#define PROBESET_IDS_LOADED (1 << 2)
#define CALLS_LOADED (1 << 3)
#define CONFIDENCES_LOADED (1 << 4)
#define SUMMARY_LOADED (1 << 5)
#define SNP_LOADED (1 << 6)
#define ADJUST_CLUSTERS (1 << 7)
#define NO_INFO_GC (1 << 8)
#define FORMAT_GT (1 << 9)
#define FORMAT_CONF (1 << 10)
#define FORMAT_BAF (1 << 11)
#define FORMAT_LRR (1 << 12)
#define FORMAT_NORMX (1 << 13)
#define FORMAT_NORMY (1 << 14)
#define FORMAT_DELTA (1 << 15)
#define FORMAT_SIZE (1 << 16)

// #%affymetrix-algorithm-param-apt-opt-use-copynumber-call-codes=0
// #%call-code-1=NoCall:-1:2
// #%call-code-2=AA:0:2
// #%call-code-3=AB:1:2
// #%call-code-4=BB:2:2
#define GT_NC -1
#define GT_AA 0
#define GT_AB 1
#define GT_BB 2

// #%max-alleles=4
// #%max-cn-states=2
// #%call-code-1=OTV_1:-4:1
// #%call-code-2=NoCall_1:-3:1
// #%call-code-3=OTV:-2:2
// #%call-code-4=NoCall:-1:2
// #%call-code-5=AA:0:2
// #%call-code-6=AB:1:2
// #%call-code-7=BB:2:2
// #%call-code-8=ZeroCN:3:0
// #%call-code-9=A:4:1
// #%call-code-10=B:5:1
// #%call-code-11=C:6:1
// #%call-code-12=AC:7:2
// #%call-code-13=BC:8:2
// #%call-code-14=CC:9:2
// #%call-code-15=D:10:1
// #%call-code-16=AD:11:2
// #%call-code-17=BD:12:2
// #%call-code-18=CD:13:2
// #%call-code-19=DD:14:2
// #%call-code-20=E:15:1
// #%call-code-21=AE:16:2
// #%call-code-22=BE:17:2
// #%call-code-23=CE:18:2
// #%call-code-24=DE:19:2
// #%call-code-25=EE:20:2
// #%call-code-26=F:21:1
// #%call-code-27=AF:22:2
// #%call-code-28=BF:23:2
// #%call-code-29=CF:24:2
// #%call-code-30=DF:25:2
// #%call-code-31=EF:26:2
// #%call-code-32=FF:27:2
static const int txt_gt[32] = {GT_NC, GT_NC, GT_NC, GT_NC, GT_AA, GT_AB, GT_BB, GT_NC,
                               GT_AA, GT_BB, GT_NC, GT_NC, GT_NC, GT_NC, GT_NC, GT_NC,
                               GT_NC, GT_NC, GT_NC, GT_NC, GT_NC, GT_NC, GT_NC, GT_NC,
                               GT_NC, GT_NC, GT_NC, GT_NC, GT_NC, GT_NC, GT_NC, GT_NC};
static const int chp_gt[16] = {-1, -1, -1, -1, -1, -1, GT_AA, GT_BB, GT_AB, -1, -1, GT_NC, -1, -1, -1, -1};

/****************************************
 * hFILE READING FUNCTIONS              *
 ****************************************/

// read long in network order
static inline uint32_t read_long(hFILE *hfile) {
    uint32_t value;
    read_bytes(hfile, (void *)&value, sizeof(uint32_t));
    value = ntohl(value);
    return value;
}

// read float in network order
static inline float read_float(hFILE *hfile) {
    union {
        uint32_t u;
        float f;
    } convert;
    read_bytes(hfile, (void *)&convert.u, sizeof(uint32_t));
    convert.u = ntohl(convert.u);
    return convert.f;
}

// read string in network order
static inline int32_t read_string8(hFILE *hfile, char **buffer) {
    int32_t len = (int32_t)read_long(hfile);
    if (len) {
        *buffer = (char *)malloc((1 + len) * sizeof(char));
        read_bytes(hfile, (void *)*buffer, len * sizeof(char));
        (*buffer)[len] = '\0';
    } else {
        *buffer = NULL;
    }
    return len;
}

// read wide-character string in network order
static inline int32_t read_string16(hFILE *hfile, wchar_t **buffer) {
    int32_t len = (int32_t)read_long(hfile);
    if (len) {
        *buffer = (wchar_t *)malloc((1 + len) * sizeof(wchar_t));
        int i;
        for (i = 0; i < len; i++) {
            uint16_t cvalue;
            read_bytes(hfile, (void *)&cvalue, sizeof(unsigned short));
            (*buffer)[i] = (wchar_t)ntohs(cvalue);
        }
        (*buffer)[len] = L'\0';
    } else {
        *buffer = NULL;
    }
    return len;
}

/****************************************
 * CEL FILE IMPLEMENTATION              *
 ****************************************/

// http://www.affymetrix.com/support/developer/powertools/changelog/gcos-agcc/index.html

typedef struct {
    float mean __attribute__((packed));
    float dev __attribute__((packed));
    int16_t N;
} Cell;

typedef struct {
    int16_t x;
    int16_t y;
} Entry;

typedef struct {
    int32_t row;
    int32_t col;
    float upper_left_x;
    float upper_left_y;
    float upper_right_x;
    float upper_right_y;
    float lower_left_x;
    float lower_left_y;
    float lower_right_x;
    float lower_right_y;
    int32_t left_cell;
    int32_t top_cell;
    int32_t right_cell;
    int32_t bottom_cell;
} SubGrid;

typedef struct {
    char *fn;
    hFILE *hfile;
    int32_t version;
    int32_t num_rows;
    int32_t num_cols;
    int32_t num_cells;
    int32_t n_header;
    char *header;
    int32_t n_algorithm;
    char *algorithm;
    int32_t n_parameters;
    char *parameters;
    int32_t cell_margin;
    uint32_t num_outlier_cells;
    uint32_t num_masked_cells;
    int32_t num_sub_grids;
    Cell *cells;
    Entry *masked_entries;
    Entry *outlier_entries;
    SubGrid *sub_grids;
} xda_cel_t;

static xda_cel_t *xda_cel_init(const char *fn, hFILE *hfile, int flags) {
    xda_cel_t *xda_cel = (xda_cel_t *)calloc(1, sizeof(xda_cel_t));
    xda_cel->fn = strdup(fn);
    xda_cel->hfile = hfile;

    int32_t magic;
    read_bytes(xda_cel->hfile, (void *)&magic, sizeof(int32_t));
    if (magic != 64) error("XDA CEL file %s magic number is %d while it should be 64\n", xda_cel->fn, magic);

    read_bytes(xda_cel->hfile, (void *)&xda_cel->version, sizeof(int32_t));
    if (xda_cel->version != 4)
        error("Cannot read XDA CEL file %s. Unsupported XDA CEL file format version: %d\n", xda_cel->fn,
              xda_cel->version);

    read_bytes(xda_cel->hfile, (void *)&xda_cel->num_rows, sizeof(int32_t));
    read_bytes(xda_cel->hfile, (void *)&xda_cel->num_cols, sizeof(int32_t));
    read_bytes(xda_cel->hfile, (void *)&xda_cel->num_cells, sizeof(int32_t));

    read_bytes(xda_cel->hfile, (void *)&xda_cel->n_header, sizeof(int32_t));
    xda_cel->header = (char *)malloc((1 + xda_cel->n_header) * sizeof(char));
    read_bytes(xda_cel->hfile, (void *)xda_cel->header, xda_cel->n_header * sizeof(char));
    xda_cel->header[xda_cel->n_header] = '\0';

    read_bytes(xda_cel->hfile, (void *)&xda_cel->n_algorithm, sizeof(int32_t));
    xda_cel->algorithm = (char *)malloc((1 + xda_cel->n_algorithm) * sizeof(char));
    read_bytes(xda_cel->hfile, (void *)xda_cel->algorithm, xda_cel->n_algorithm * sizeof(char));
    xda_cel->algorithm[xda_cel->n_algorithm] = '\0';

    read_bytes(xda_cel->hfile, (void *)&xda_cel->n_parameters, sizeof(int32_t));
    xda_cel->parameters = (char *)malloc((1 + xda_cel->n_parameters) * sizeof(char));
    read_bytes(xda_cel->hfile, (void *)xda_cel->parameters, xda_cel->n_parameters * sizeof(char));
    xda_cel->parameters[xda_cel->n_parameters] = '\0';

    read_bytes(xda_cel->hfile, (void *)&xda_cel->cell_margin, sizeof(int32_t));
    read_bytes(xda_cel->hfile, (void *)&xda_cel->num_outlier_cells, sizeof(uint32_t));
    read_bytes(xda_cel->hfile, (void *)&xda_cel->num_masked_cells, sizeof(uint32_t));
    read_bytes(xda_cel->hfile, (void *)&xda_cel->num_sub_grids, sizeof(int32_t));

    if (flags) return xda_cel;

    xda_cel->cells = (Cell *)malloc(xda_cel->num_cells * sizeof(Cell));
    read_bytes(xda_cel->hfile, (void *)xda_cel->cells, xda_cel->num_cells * sizeof(Cell));

    xda_cel->masked_entries = (Entry *)malloc(xda_cel->num_masked_cells * sizeof(Entry));
    read_bytes(xda_cel->hfile, (void *)xda_cel->masked_entries, xda_cel->num_masked_cells * sizeof(Entry));

    xda_cel->outlier_entries = (Entry *)malloc(xda_cel->num_outlier_cells * sizeof(Entry));
    read_bytes(xda_cel->hfile, (void *)xda_cel->outlier_entries, xda_cel->num_outlier_cells * sizeof(Entry));

    xda_cel->sub_grids = (SubGrid *)malloc(xda_cel->num_sub_grids * sizeof(SubGrid));
    read_bytes(xda_cel->hfile, (void *)xda_cel->sub_grids, xda_cel->num_sub_grids * sizeof(SubGrid));

    if (!heof(xda_cel->hfile))
        error("XDA CEL reader did not reach the end of file %s at position %ld\n", xda_cel->fn, htell(xda_cel->hfile));

    return xda_cel;
}

static void xda_cel_destroy(xda_cel_t *xda_cel) {
    if (!xda_cel) return;
    free(xda_cel->fn);
    if (hclose(xda_cel->hfile) < 0) error("Error closing XDA CEL file\n");
    free(xda_cel->header);
    free(xda_cel->algorithm);
    free(xda_cel->parameters);
    free(xda_cel->cells);
    free(xda_cel->masked_entries);
    free(xda_cel->outlier_entries);
    free(xda_cel->sub_grids);
    free(xda_cel);
}

static void xda_cel_print(const xda_cel_t *xda_cel, FILE *stream, int verbose) {
    fprintf(stream, "[CEL]\n");
    fprintf(stream, "Version=3\n");
    fprintf(stream, "\n[HEADER]\n");
    fprintf(stream, "%s", xda_cel->header);
    fprintf(stream, "\n[INTENSITY]\n");
    fprintf(stream, "NumberCells=%d\n", xda_cel->num_cells);
    fprintf(stream, "CellHeader=X\tY\tMEAN\tSTDV\tNPIXELS\n");
    int i;
    if (!verbose)
        fprintf(stream, "... use --verbose to visualize Cell Entries ...\n");
    else
        for (i = 0; i < xda_cel->num_cells; i++)
            fprintf(stream, "%3d\t%3d\t%.1f\t%.1f\t%3d\n", i % xda_cel->num_cols, i / xda_cel->num_cols,
                    xda_cel->cells[i].mean, xda_cel->cells[i].dev, xda_cel->cells[i].N);
    fprintf(stream, "\n[MASKS]\n");
    fprintf(stream, "NumberCells=%d\n", xda_cel->num_masked_cells);
    fprintf(stream, "CellHeader=X\tY\n");
    if (!verbose)
        fprintf(stream, "... use --verbose to visualize Masked Entries ...\n");
    else
        for (i = 0; i < xda_cel->num_masked_cells; i++)
            fprintf(stream, "%d\t%d\n", xda_cel->masked_entries[i].x, xda_cel->masked_entries[i].y);
    fprintf(stream, "\n[OUTLIERS]\n");
    fprintf(stream, "NumberCells=%d\n", xda_cel->num_outlier_cells);
    fprintf(stream, "CellHeader=X\tY\n");
    if (!verbose)
        fprintf(stream, "... use --verbose to visualize Outlier Entries ...\n");
    else
        for (i = 0; i < xda_cel->num_outlier_cells; i++)
            fprintf(stream, "%d\t%d\n", xda_cel->outlier_entries[i].x, xda_cel->outlier_entries[i].y);
    fprintf(stream, "\n[MODIFIED]\n");
    fprintf(stream, "NumberCells=0\n");
    fprintf(stream, "CellHeader=X\tY\tORIGMEAN\n");
}

/****************************************
 * CHP FILE IMPLEMENTATION              *
 ****************************************/

// http://www.affymetrix.com/support/developer/powertools/changelog/gcos-agcc/index.html

#define BYTE 0
#define UBYTE 1
#define SHORT 2
#define USHORT 3
#define INT 4
#define UINT 5
#define FLOAT 6
#define STRING 7
#define WSTRING 8

typedef struct {
    wchar_t *name;
    char *value;
    wchar_t *mime_type;
    int32_t n_value;
    int8_t type;
} Parameter;

typedef struct DataHeader DataHeader;

struct DataHeader {
    char *data_type_identifier;
    char *guid;
    wchar_t *datetime;
    wchar_t *locale;
    int32_t n_parameters;
    Parameter *parameters;
    int32_t n_parents;
    DataHeader *parents;
};

typedef struct {
    wchar_t *name;
    int8_t type;
    int32_t size;
} ColHeader;

typedef struct {
    uint32_t pos_first_element;
    uint32_t pos_next_data_set;
    wchar_t *name;
    int32_t n_parameters;
    Parameter *parameters;
    uint32_t n_cols;
    ColHeader *col_headers;
    uint32_t n_rows;
    hFILE *hfile; // this should not be destroyed
    uint32_t n_buffer;
    uint32_t *col_offsets;
    char *buffer;
} DataSet;

typedef struct {
    uint32_t pos_next_data_group;
    uint32_t pos_first_data_set;
    int32_t num_data_sets;
    wchar_t *name;
    DataSet *data_sets;
} DataGroup;

typedef struct {
    wchar_t *name;
    int8_t type;
    int32_t size;
} ColumnHeader;

typedef struct {
    char *fn;
    hFILE *hfile;
    uint8_t magic;
    uint8_t version;
    int32_t num_data_groups;
    uint32_t pos_first_data_group;
    DataHeader data_header;
    DataGroup *data_groups;
    off_t size;
    char *display_name;
} agcc_t;

static void agcc_read_parameters(Parameter *parameter, hFILE *hfile, int flags) {
    read_string16(hfile, &parameter->name);
    parameter->n_value = read_string8(hfile, &parameter->value);
    read_string16(hfile, &parameter->mime_type);
    if (wcscmp(parameter->mime_type, L"text/x-calvin-integer-8") == 0)
        parameter->type = BYTE;
    else if (wcscmp(parameter->mime_type, L"text/x-calvin-unsigned-integer-8") == 0)
        parameter->type = UBYTE;
    else if (wcscmp(parameter->mime_type, L"text/x-calvin-integer-16") == 0)
        parameter->type = SHORT;
    else if (wcscmp(parameter->mime_type, L"text/x-calvin-unsigned-integer-16") == 0)
        parameter->type = USHORT;
    else if (wcscmp(parameter->mime_type, L"text/x-calvin-integer-32") == 0)
        parameter->type = INT;
    else if (wcscmp(parameter->mime_type, L"text/x-calvin-unsigned-integer-32") == 0)
        parameter->type = UINT;
    else if (wcscmp(parameter->mime_type, L"text/x-calvin-float") == 0)
        parameter->type = FLOAT;
    else if (wcscmp(parameter->mime_type, L"text/ascii") == 0)
        parameter->type = STRING;
    else if (wcscmp(parameter->mime_type, L"text/plain") == 0)
        parameter->type = WSTRING;
    else
        error("MIME type %ls not allowed\n", parameter->mime_type);

    // drop parameters that can increase the size of the header dramatically
    if (flags && wcsncmp(parameter->name, L"affymetrix-algorithm-param-apt-opt-cel", 38) == 0) {
        free(parameter->name);
        parameter->name = NULL;
        parameter->n_value = 0;
        free(parameter->value);
        parameter->value = NULL;
        free(parameter->mime_type);
        parameter->mime_type = NULL;
    }
}

static void agcc_read_data_header(DataHeader *data_header, hFILE *hfile, int flags) {
    int i;
    read_string8(hfile, &data_header->data_type_identifier);
    read_string8(hfile, &data_header->guid);
    read_string16(hfile, &data_header->datetime);
    read_string16(hfile, &data_header->locale);

    data_header->n_parameters = (int32_t)read_long(hfile);
    data_header->parameters = (Parameter *)malloc(data_header->n_parameters * sizeof(Parameter));
    for (i = 0; i < data_header->n_parameters; i++) agcc_read_parameters(&data_header->parameters[i], hfile, flags);

    data_header->n_parents = (int32_t)read_long(hfile);
    data_header->parents = (DataHeader *)malloc(data_header->n_parents * sizeof(DataHeader));
    for (i = 0; i < data_header->n_parents; i++) agcc_read_data_header(&data_header->parents[i], hfile, flags);
}

static void agcc_read_data_set(DataSet *data_set, hFILE *hfile, int flags) {
    int i;
    data_set->pos_first_element = read_long(hfile);
    data_set->pos_next_data_set = read_long(hfile);
    read_string16(hfile, &data_set->name);

    data_set->n_parameters = (int32_t)read_long(hfile);
    data_set->parameters = (Parameter *)malloc(data_set->n_parameters * sizeof(Parameter));
    for (i = 0; i < data_set->n_parameters; i++) agcc_read_parameters(&data_set->parameters[i], hfile, flags);

    data_set->n_cols = read_long(hfile);
    data_set->col_headers = (ColHeader *)malloc(data_set->n_cols * sizeof(ColHeader));
    for (i = 0; i < data_set->n_cols; i++) {
        read_string16(hfile, &data_set->col_headers[i].name);
        read_bytes(hfile, (void *)&data_set->col_headers[i].type, sizeof(int8_t));
        data_set->col_headers[i].size = read_long(hfile);
    }
    data_set->n_rows = read_long(hfile);

    data_set->hfile = hfile;
    data_set->col_offsets = (uint32_t *)malloc(data_set->n_cols * sizeof(uint32_t *));
    data_set->n_buffer = 0;
    for (i = 0; i < data_set->n_cols; i++) {
        data_set->col_offsets[i] = data_set->n_buffer;
        data_set->n_buffer += data_set->col_headers[i].size;
    }
    data_set->buffer = (char *)malloc(data_set->n_buffer * sizeof(char));

    if (data_set->pos_next_data_set)
        if (hseek(hfile, data_set->pos_next_data_set, SEEK_SET) < 0)
            error("Fail to seek to position %d in AGCC file\n", data_set->pos_next_data_set);
}

static void agcc_read_data_group(DataGroup *data_group, hFILE *hfile, int flags) {
    int i;
    data_group->pos_next_data_group = read_long(hfile);
    data_group->pos_first_data_set = read_long(hfile);
    data_group->num_data_sets = read_long(hfile);
    read_string16(hfile, &data_group->name);
    if (hseek(hfile, data_group->pos_first_data_set, SEEK_SET) < 0)
        error("Fail to seek to position %d in AGCC file\n", data_group->pos_first_data_set);
    data_group->data_sets = (DataSet *)malloc(data_group->num_data_sets * sizeof(DataSet));
    for (i = 0; i < data_group->num_data_sets; i++) agcc_read_data_set(&data_group->data_sets[i], hfile, flags);
    if (data_group->pos_next_data_group)
        if (hseek(hfile, data_group->pos_next_data_group, SEEK_SET) < 0)
            error("Fail to seek to position %d in AGCC file\n", data_group->pos_next_data_group);
}

static agcc_t *agcc_init(const char *fn, hFILE *hfile, int flags) {
    int i;
    agcc_t *agcc = (agcc_t *)calloc(1, sizeof(agcc_t));
    agcc->fn = strdup(fn);
    agcc->hfile = hfile;

    // read File Header
    read_bytes(agcc->hfile, (void *)&agcc->magic, sizeof(uint8_t));
    if (agcc->magic != 59) error("AGCC file %s magic number is %d while it should be 59\n", agcc->fn, agcc->magic);
    read_bytes(agcc->hfile, (void *)&agcc->version, sizeof(uint8_t));
    if (agcc->version != 1)
        error("Cannot read AGCC file %s. Unsupported AGCC file format version: %d\n", agcc->fn, agcc->version);
    agcc->num_data_groups = (int32_t)read_long(agcc->hfile);
    agcc->pos_first_data_group = read_long(agcc->hfile);

    // read Generic Data Header
    agcc_read_data_header(&agcc->data_header, agcc->hfile, flags);

    // read Data Groups
    if (hseek(agcc->hfile, agcc->pos_first_data_group, SEEK_SET) < 0)
        error("Fail to seek to position %d in AGCC %s file\n", agcc->pos_first_data_group, agcc->fn);
    agcc->data_groups = (DataGroup *)malloc(agcc->num_data_groups * sizeof(DataGroup));
    for (i = 0; i < agcc->num_data_groups; i++) agcc_read_data_group(&agcc->data_groups[i], agcc->hfile, flags);

    if (!heof(agcc->hfile))
        error("AGCC reader did not reach the end of file %s at position %ld\n", agcc->fn, htell(agcc->hfile));

    if (hseek(agcc->hfile, 0L, SEEK_END) < 0) error("Fail to seek to end of AGCC %s file\n", agcc->fn);
    agcc->size = htell(agcc->hfile);

    char *ptr = strrchr(agcc->fn, '/') ? strrchr(agcc->fn, '/') + 1 : agcc->fn;
    agcc->display_name = strdup(ptr);
    ptr = strrchr(agcc->display_name, '.');
    if (ptr && strcmp(ptr + 1, "chp") == 0) {
        *ptr = '\0';
        ptr = strrchr(agcc->display_name, '.');
        if (ptr && (strcmp(ptr + 1, "AxiomGT1") == 0 || strcmp(ptr + 1, "birdseed-v2") == 0)) *ptr = '\0';
    }

    return agcc;
}

static void agcc_destroy_parameters(Parameter *parameters, int32_t n_parameters) {
    int i;
    for (i = 0; i < n_parameters; i++) {
        free(parameters[i].name);
        free(parameters[i].value);
        free(parameters[i].mime_type);
    }
    free(parameters);
}

static void agcc_destroy_data_header(DataHeader *data_header) {
    int i;
    free(data_header->data_type_identifier);
    free(data_header->guid);
    free(data_header->datetime);
    free(data_header->locale);
    agcc_destroy_parameters(data_header->parameters, data_header->n_parameters);
    for (i = 0; i < data_header->n_parents; i++) agcc_destroy_data_header(&data_header->parents[i]);
    free(data_header->parents);
}

static void agcc_destroy_data_set(DataSet *data_set) {
    int i;
    free(data_set->name);
    agcc_destroy_parameters(data_set->parameters, data_set->n_parameters);
    for (i = 0; i < data_set->n_cols; i++) free(data_set->col_headers[i].name);
    free(data_set->col_headers);
    free(data_set->col_offsets);
    free(data_set->buffer);
}

static void agcc_destroy_data_group(DataGroup *data_group) {
    int i;
    free(data_group->name);
    for (i = 0; i < data_group->num_data_sets; i++) agcc_destroy_data_set(&data_group->data_sets[i]);
    free(data_group->data_sets);
}

static void agcc_destroy(agcc_t *agcc) {
    if (!agcc) return;
    int i;
    free(agcc->fn);
    if (hclose(agcc->hfile) < 0) error("Error closing AGCC file\n");
    agcc_destroy_data_header(&agcc->data_header);
    for (i = 0; i < agcc->num_data_groups; i++) agcc_destroy_data_group(&agcc->data_groups[i]);
    free(agcc->data_groups);
    free(agcc->display_name);
    free(agcc);
}

static void buffer_string16(const uint16_t *value, int32_t n_value, size_t *m_buffer, wchar_t **buffer) {
    int i;
    hts_expand(wchar_t, n_value / 2 + 1, *m_buffer, *buffer);
    for (i = 0; i < n_value / 2; i++) (*buffer)[i] = (wchar_t)ntohs(value[i]);
    (*buffer)[n_value / 2] = L'\0';
}

static void agcc_print_parameters(const Parameter *parameters, int32_t n_parameters, FILE *stream) {
    int i;
    union {
        uint32_t u;
        float f;
    } convert;
    wchar_t *buffer = NULL;
    size_t m_buffer = 0;
    for (i = 0; i < n_parameters; i++) {
        fprintf(stream, "#%%%ls=", parameters[i].name ? parameters[i].name : L"");
        switch (parameters[i].type) {
        case BYTE:
            fprintf(stream, "%d\n", (int8_t)ntohl(*(uint32_t *)parameters[i].value));
            break;
        case UBYTE:
            fprintf(stream, "%u\n", (uint8_t)ntohl(*(uint32_t *)parameters[i].value));
            break;
        case SHORT:
            fprintf(stream, "%d\n", (int16_t)ntohl(*(uint32_t *)parameters[i].value));
            break;
        case USHORT:
            fprintf(stream, "%u\n", (uint16_t)ntohl(*(uint32_t *)parameters[i].value));
            break;
        case INT:
            fprintf(stream, "%d\n", (int32_t)ntohl(*(uint32_t *)parameters[i].value));
            break;
        case UINT:
            fprintf(stream, "%u\n", ntohl(*(uint32_t *)parameters[i].value));
            break;
        case FLOAT:
            convert.u = ntohl(*(uint32_t *)parameters[i].value);
            fprintf(stream, "%f\n", convert.f);
            break;
        case STRING:
            fprintf(stream, "%s\n", parameters[i].value);
            break;
        case WSTRING:
            buffer_string16((uint16_t *)parameters[i].value, parameters[i].n_value, &m_buffer, &buffer);
            fprintf(stream, "%ls\n", buffer);
            break;
        default:
            break;
        }
    }
    free(buffer);
}

static void agcc_print_data_header(const DataHeader *data_header, FILE *stream) {
    int i;
    if (data_header->guid) fprintf(stream, "#%%FileIdentifier=%s\n", data_header->guid);
    fprintf(stream, "#%%FileTypeIdentifier=%s\n", data_header->data_type_identifier);
    fprintf(stream, "#%%FileLocale=%ls\n", data_header->locale);
    agcc_print_parameters(data_header->parameters, data_header->n_parameters, stream);
    for (i = 0; i < data_header->n_parents; i++) agcc_print_data_header(&data_header->parents[i], stream);
}

typedef void (*col_print_t)(const char *, FILE *stream);

void agcc_print_probe_set_name(const char *s, FILE *stream) {
    uint32_t size = ntohl(*(uint32_t *)s);
    fwrite(s + 4, 1, size, stream);
}

void agcc_print_call(const char *s, FILE *stream) {
    static const char a[16] = "......ABA..N....";
    static const char b[16] = "......ABB..C....";
    int c = s[0] & 0x0F;
    fputc(a[c], stream);
    fputc(b[c], stream);
}

void agcc_print_float(const char *s, FILE *stream) {
    union {
        uint32_t u;
        float f;
    } convert;
    convert.u = ntohl(*(uint32_t *)s);
    fprintf(stream, "%g", convert.f);
}

static void agcc_print_data_set(const DataSet *data_set, FILE *stream, int verbose) {
    fprintf(stream, "#%%SetName=%ls\n", data_set->name);
    fprintf(stream, "#%%Columns=%d\n", data_set->n_cols);
    fprintf(stream, "#%%Rows=%d\n", data_set->n_rows);
    int i, j;
    agcc_print_parameters(data_set->parameters, data_set->n_parameters, stream);
    for (i = 0; i < data_set->n_cols; i++)
        fprintf(stream, "%ls%c", data_set->col_headers[i].name, i + 1 < data_set->n_cols ? '\t' : '\n');
    if (data_set->n_rows == 0) return;

    if (!verbose) {
        fprintf(stream, "... use --verbose to visualize Data Set ...\n");
        return;
    }
    if (wcscmp(data_set->name, L"Genotype") != 0) {
        fprintf(stream, "... can only visualize Genotype Data Set ...\n");
        return;
    }

    char *col_ends = (char *)malloc(data_set->n_cols * sizeof(char *));
    col_print_t *col_prints = (col_print_t *)malloc(data_set->n_cols * sizeof(col_print_t *));
    for (i = 0; i < data_set->n_cols; i++) {
        col_ends[i] = i + 1 < data_set->n_cols ? '\t' : '\n';
        if (wcscmp(data_set->col_headers[i].name, L"ProbeSetName") == 0)
            col_prints[i] = agcc_print_probe_set_name;
        else if (wcscmp(data_set->col_headers[i].name, L"Call") == 0)
            col_prints[i] = agcc_print_call;
        else if (wcscmp(data_set->col_headers[i].name, L"Confidence") == 0)
            col_prints[i] = agcc_print_float;
        else if (wcscmp(data_set->col_headers[i].name, L"Contrast") == 0)
            col_prints[i] = agcc_print_float;
        else if (wcscmp(data_set->col_headers[i].name, L"Log Ratio") == 0)
            col_prints[i] = agcc_print_float;
        else if (wcscmp(data_set->col_headers[i].name, L"Strength") == 0)
            col_prints[i] = agcc_print_float;
        else if (wcscmp(data_set->col_headers[i].name, L"Signal A") == 0)
            col_prints[i] = agcc_print_float;
        else if (wcscmp(data_set->col_headers[i].name, L"Signal B") == 0)
            col_prints[i] = agcc_print_float;
        else if (wcscmp(data_set->col_headers[i].name, L"Forced Call") == 0)
            col_prints[i] = agcc_print_call;
        else
            error("Unknown column type %ls in AGCC file with type %d\n", data_set->col_headers[i].name,
                  data_set->col_headers[i].type);
    }
    if (hseek(data_set->hfile, data_set->pos_first_element, SEEK_SET) < 0)
        error("Fail to seek to position %d in AGCC file\n", data_set->pos_first_element);
    for (i = 0; i < data_set->n_rows; i++) {
        read_bytes(data_set->hfile, (void *)data_set->buffer, data_set->n_buffer);
        for (j = 0; j < data_set->n_cols; j++) {
            col_prints[j](data_set->buffer + data_set->col_offsets[j], stream);
            fputc(col_ends[j], stream);
        }
    }
    free(col_ends);
    free(col_prints);
}

static void agcc_print_data_group(const DataGroup *data_group, FILE *stream, int verbose) {
    fprintf(stream, "#%%GroupName=%ls\n", data_group->name);
    int i;
    for (i = 0; i < data_group->num_data_sets; i++) agcc_print_data_set(&data_group->data_sets[i], stream, verbose);
}

static void agcc_print(const agcc_t *agcc, FILE *stream, int verbose) {
    fprintf(stream, "#%%File=%s\n", agcc->fn);
    fprintf(stream, "#%%FileSize=%ld\n", agcc->size);
    fprintf(stream, "#%%Magic=%d\n", agcc->magic);
    fprintf(stream, "#%%Version=%d\n", agcc->version);
    int i;
    agcc_print_data_header(&agcc->data_header, stream);
    for (i = 0; i < agcc->num_data_groups; i++) agcc_print_data_group(&agcc->data_groups[i], stream, verbose);
}

static void chps_to_tsv(uint8_t *magic, agcc_t **agcc, int n, FILE *stream) {
    int i, j, k;
    // AxiomGT1 analysis has also cn-probe-chrXY-ratio_gender_meanX,
    // cn-probe-chrXY-ratio_gender_meanY, cn-probe-chrXY-ratio_gender_ratio,
    // cn-probe-chrXY-ratio_gender while BRLMM-P analysis has also em-cluster-chrX-het-contrast_gender
    // em-cluster-chrX-het-contrast_gender_chrX_het_rate
    // pm_mean
    static const wchar_t *chipsummary[] = {L"computed_gender",
                                           L"call_rate",
                                           L"total_call_rate",
                                           L"het_rate",
                                           L"total_het_rate",
                                           L"hom_rate",
                                           L"total_hom_rate",
                                           L"cluster_distance_mean",
                                           L"cluster_distance_stdev",
                                           L"allele_summarization_mean",
                                           L"allele_summarization_stdev",
                                           L"allele_deviation_mean",
                                           L"allele_deviation_stdev",
                                           L"allele_mad_residuals_mean",
                                           L"allele_mad_residuals_stdev"};
    fputs("chp", stream);
    for (j = 0; j < 15; j++) fprintf(stream, "\t%ls", chipsummary[j]);
    fputc('\n', stream);
    for (i = 0; i < n; i++) {
        if (magic[i] != 59) continue;
        if (strcmp(agcc[i]->data_header.data_type_identifier, "affymetrix-multi-data-type-analysis") != 0) {
            if (strcmp(agcc[i]->data_header.data_type_identifier, "affymetrix-calvin-intensity") == 0
                || strcmp(agcc[i]->data_header.data_type_identifier, "affymetrix-calvin-multi-intensity") == 0)
                error(
                    "AGCC file %s contains calvin intensities rather multi data type analysis (use --cel to extract "
                    "metadata)\n",
                    agcc[i]->fn);
            else
                error("AGCC file %s does not contain multi data type analysis as data type identifier is %s\n",
                      agcc[i]->fn, agcc[i]->data_header.data_type_identifier);
        }
        fputs(strrchr(agcc[i]->fn, '/') ? strrchr(agcc[i]->fn, '/') + 1 : agcc[i]->fn, stream);
        DataHeader *data_header = &agcc[i]->data_header;
        for (j = 0, k = 0; j < 15; j++) {
            fputc('\t', stream);
            while (!data_header->parameters[k].name
                   || wcsncmp(data_header->parameters[k].name, L"affymetrix-chipsummary-", 23) != 0
                   || wcscmp(&data_header->parameters[k].name[23], chipsummary[j]) != 0) {
                k++;
                k %= data_header->n_parameters;
            }
            union {
                uint32_t u;
                float f;
            } convert;
            switch (data_header->parameters[k].type) {
            case FLOAT:
                convert.u = ntohl(*(uint32_t *)data_header->parameters[k].value);
                fprintf(stream, "%.5f", convert.f);
                break;
            case STRING:
                fputs(data_header->parameters[k].value, stream);
                break;
            default:
                error("Unable to print parameter of type %d from %s AGCC file\n", data_header->parameters[k].type,
                      agcc[i]->fn);
                break;
            }
        }
        fputc('\n', stream);
    }
}

/****************************************
 * PRINT CEL SUMMARY                    *
 ****************************************/

// this function returns
// fusion-experiment-name
// pixel-cols
// pixel-rows
// XIN
// YIN
// VE
// temp
// power
// scan-date
// scanner-id
// scanner-type
// array-type
static void parse_dat_header(char *dat_header, char *str[12], int n_str[12]) {
    char *ss = strchr(dat_header, ' ') + 2;
    char *se = strchr(dat_header, '\0');
    if (!se) goto fail;

    se = strchr(ss, ':');
    if (!se) goto fail;
    str[0] = ss;
    n_str[0] = se - ss;

    ss = se + 5;
    for (se = ss + 4; isspace(*se) && se >= ss; se--);
    str[1] = ss;
    n_str[1] = se - ss + 1;

    ss = ss + 9;
    for (se = ss + 4; isspace(*se) && se >= ss; se--);
    str[2] = ss;
    n_str[2] = se - ss + 1;

    ss = ss + 9;
    for (se = ss + 2; isspace(*se) && se >= ss; se--);
    str[3] = ss;
    n_str[3] = se - ss + 1;

    ss = ss + 7;
    for (se = ss + 2; isspace(*se) && se >= ss; se--);
    str[4] = ss;
    n_str[4] = se - ss + 1;

    ss = ss + 6;
    for (se = ss + 2; isspace(*se) && se >= ss; se--);
    str[5] = ss;
    n_str[5] = se - ss + 1;

    ss = ss + 3;
    for (se = ss + 6; isspace(*se) && se >= ss; se--);
    str[6] = ss;
    n_str[6] = se - ss + 1;

    ss = ss + 7;
    for (se = ss + 3; isspace(*se) && se >= ss; se--);
    str[7] = ss;
    n_str[7] = se - ss + 1;

    ss = ss + 4;
    for (se = ss + 17; isspace(*se) && se >= ss; se--);
    str[8] = ss;
    n_str[8] = se - ss + 1;

    ss = ss + 18;
    se = strchr(ss, ' ');
    if (!se) goto fail;
    str[9] = ss;
    n_str[9] = se - ss;

    ss = se + 2;
    se = strstr(ss, "\x14 ");
    if (!se) goto fail;
    for (se--; isspace(*se) && se >= ss; se--);
    str[10] = ss;
    n_str[10] = se - ss + 1;

    se = strstr(ss, "\x14 ");
    if (!se) goto fail;
    ss = se + 2;
    se = strstr(ss, "\x14 ");
    if (!se) goto fail;
    ss = se + 2;
    se = strstr(ss, ".1sq");
    if (!se) goto fail;
    str[11] = ss;
    n_str[11] = se - ss;

    return;

fail:
    error("DAT header malformed\n");
}

// http://github.com/HenrikBengtsson/affxparser/blob/master/R/parseDatHeaderString.R
static void cels_to_tsv(uint8_t *magic, void **files, int n, FILE *stream) {
    int i, j;
    wchar_t *array_type = NULL;             // affymetrix-array-type
    wchar_t *scanner_type = NULL;           // affymetrix-scanner-type
    wchar_t *scanner_id = NULL;             // affymetrix-scanner-id
    wchar_t *scan_date = NULL;              // affymetrix-scan-date
    wchar_t *fusion_experiment_name = NULL; // affymetrix-fusion-experiment-name
    size_t m_array_type = 0, m_scanner_type = 0, m_scanner_id = 0, m_scan_date = 0, m_fusion_experiment_name = 0;
    int32_t pixel_rows = 0; // affymetrix-pixel-rows
    int32_t pixel_cols = 0; // affymetrix-pixel-cols

    char *str[12];
    int n_str[12];

    fprintf(stream,
            "cel\tarray_type\tscanner_type\tscanner_id\tscan_date\tfusion_experiment_name\tpixel_rows\tpixel_cols\n");
    for (i = 0; i < n; i++) {
        char *ss, *se;
        agcc_t *agcc = (agcc_t *)files[i];
        xda_cel_t *xda_cel = (xda_cel_t *)files[i];
        switch (magic[i]) {
        case 59:
            if (strcmp(agcc->data_header.data_type_identifier, "affymetrix-calvin-intensity") != 0
                && strcmp(agcc->data_header.data_type_identifier, "affymetrix-calvin-multi-intensity") != 0)
                error("AGCC file %s does not contain calvin intensities as data type identifier is %s\n", agcc->fn,
                      agcc->data_header.data_type_identifier);
            if (agcc->data_header.n_parents == 0
                || (strcmp(agcc->data_header.parents[0].data_type_identifier, "affymetrix-calvin-scan-acquisition") != 0
                    && strcmp(agcc->data_header.parents[0].data_type_identifier,
                              "affymetrix-calvin-multi-scan-acquisition")
                           != 0))
                error("AGCC file %s is missing scan acquisition information as data type identifier is %s\n", agcc->fn,
                      agcc->data_header.parents[0].data_type_identifier);

            const Parameter *parameter;
            for (j = 0; j < agcc->data_header.parents[0].n_parameters; j++) {
                parameter = &agcc->data_header.parents[0].parameters[j];
                if (wcscmp(parameter->name, L"affymetrix-array-type") == 0 && parameter->type == WSTRING)
                    buffer_string16((uint16_t *)parameter->value, parameter->n_value, &m_array_type, &array_type);
                else if (wcscmp(parameter->name, L"affymetrix-scanner-type") == 0 && parameter->type == WSTRING)
                    buffer_string16((uint16_t *)parameter->value, parameter->n_value, &m_scanner_type, &scanner_type);
                else if (wcscmp(parameter->name, L"affymetrix-scanner-id") == 0 && parameter->type == WSTRING)
                    buffer_string16((uint16_t *)parameter->value, parameter->n_value, &m_scanner_id, &scanner_id);
                else if (wcscmp(parameter->name, L"affymetrix-scan-date") == 0 && parameter->type == WSTRING)
                    buffer_string16((uint16_t *)parameter->value, parameter->n_value, &m_scan_date, &scan_date);
                else if (wcscmp(parameter->name, L"affymetrix-fusion-experiment-name") == 0
                         && parameter->type == WSTRING)
                    buffer_string16((uint16_t *)parameter->value, parameter->n_value, &m_fusion_experiment_name,
                                    &fusion_experiment_name);
                if (wcscmp(parameter->name, L"affymetrix-pixel-rows") == 0 && parameter->type == INT)
                    pixel_rows = (int32_t)ntohl(*(uint32_t *)parameter->value);
                if (wcscmp(parameter->name, L"affymetrix-pixel-cols") == 0 && parameter->type == INT)
                    pixel_cols = (int32_t)ntohl(*(uint32_t *)parameter->value);
            }
            fputs(strrchr(agcc->fn, '/') ? strrchr(agcc->fn, '/') + 1 : agcc->fn, stream);
            fputc('\t', stream);
            if (array_type) {
                fprintf(stream, "%ls", array_type);
                array_type[0] = L'\0';
            }
            fputc('\t', stream);
            if (scanner_type) {
                fprintf(stream, "%ls", scanner_type);
                scanner_type[0] = L'\0';
            }
            fputc('\t', stream);
            if (scanner_id) {
                fprintf(stream, "%ls", scanner_id);
                scanner_id[0] = L'\0';
            }
            fputc('\t', stream);
            if (scan_date) {
                fprintf(stream, "%ls", scan_date);
                scan_date[0] = L'\0';
            }
            fputc('\t', stream);
            if (fusion_experiment_name) {
                fprintf(stream, "%ls", fusion_experiment_name);
                fusion_experiment_name[0] = L'\0';
            }
            fputc('\t', stream);
            if (pixel_rows) {
                fprintf(stream, "%d", pixel_rows);
                pixel_rows = 0;
            }
            fputc('\t', stream);
            if (pixel_cols) {
                fprintf(stream, "%d", pixel_cols);
                pixel_cols = 0;
            }
            fputc('\n', stream);
            break;
        case 64:
            ss = strstr(xda_cel->header, "\nDatHeader=[");
            if (!ss) error("XDA CEL file %s is missing DAT header\n", xda_cel->fn);
            ss = strchr(ss + 12, ']');
            if (!ss) error("XDA CEL file %s is missing DAT header\n", xda_cel->fn);
            ss++;
            se = strchr(ss, '\n');
            if (!se) error("XDA CEL file %s is missing DAT header\n", xda_cel->fn);
            *se = '\0';
            parse_dat_header(ss, str, n_str);
            *se = '\n';
            fprintf(stream, "%s\t%.*s\t%.*s\t%.*s\t%.*s\t%.*s\t%.*s\t%.*s\n",
                    strrchr(xda_cel->fn, '/') ? strrchr(xda_cel->fn, '/') + 1 : xda_cel->fn, n_str[11], str[11],
                    n_str[10], str[10], n_str[9], str[9], n_str[8], str[8], n_str[0], str[0], n_str[1], str[1],
                    n_str[2], str[2]);
            break;
        default:
            break;
        }
    }
    free(array_type);
    free(scanner_type);
    free(scanner_id);
    free(scan_date);
    free(fusion_experiment_name);
}

/****************************************
 * htsFILE READING FUNCTIONS            *
 ****************************************/

static htsFile *unheader(const char *fn, kstring_t *str) {
    htsFile *fp = hts_open(fn, "r");
    if (fp == NULL) error("Could not open %s: %s\n", fn, strerror(errno));

    do // skip header
        if (hts_getline(fp, KS_SEP_LINE, str) <= 0) error("Empty file: %s\n", fn);
    while (str->s[0] == '#');

    return fp;
}

/************************************************
 * PROBEST IDS FILE IMPLEMENTATION              *
 ************************************************/

static void *probeset_ids_init(const char *fn) {
    void *probeset_ids = khash_str2int_init();
    kstring_t str = {0, 0, NULL};
    htsFile *fp = unheader(fn, &str);
    int moff = 0, *off = NULL, ncols;
    ncols = ksplit_core(str.s, '\t', &moff, &off);
    if (ncols < 1 || strcmp(&str.s[off[0]], "probeset_id"))
        error("Malformed first line from probeset IDs file: %s\n%s\n", fn, str.s);
    while (hts_getline(fp, KS_SEP_LINE, &str) > 0) {
        ncols = ksplit_core(str.s, '\t', &moff, &off);
        if (khash_str2int_has_key(probeset_ids, &str.s[off[0]]))
            error("Probe Set %s present multiple times in file %s\n", &str.s[off[0]], fn);
        khash_str2int_inc(probeset_ids, strdup(&str.s[off[0]]));
    }
    free(off);
    free(str.s);
    hts_close(fp);
    return probeset_ids;
}

/************************************************
 * SNP CLUSTER POSTERIORS FILE IMPLEMENTATION   *
 ************************************************/

// http://www.affymetrix.com/support/developer/powertools/changelog/SnpModelConverter_8cpp_source.html

typedef struct {
    float xm;   // delta mean of cluster
    float xss;  // delta variance of cluster
    float k;    // strength of mean (pseudo-observations)
    float v;    // strength of variance (pseudo-observations)
    float ym;   // size mean of cluster in other dimension
    float yss;  // size variance of cluster in other dimension
    float xyss; // covariance of cluster in both directions
} cluster_t;

typedef struct {
    char *probeset_id;
    int copynumber;
    cluster_t aa;
    cluster_t ab;
    cluster_t bb;
} snp_t;

typedef struct {
    int is_birdseed;
    void *probeset_id[2];
    snp_t *snps[2];
    int n_snps[2];
    int m_snps[2];
} snp_models_t;

static inline void brlmmp_cluster_init(const char *s, const int *off, cluster_t *cluster) {
    cluster->xm = strtof(&s[off[0]], NULL);
    cluster->xss = strtof(&s[off[1]], NULL);
    cluster->k = strtof(&s[off[2]], NULL);
    cluster->v = strtof(&s[off[3]], NULL);
    cluster->ym = strtof(&s[off[4]], NULL);
    cluster->yss = strtof(&s[off[5]], NULL);
    cluster->xyss = strtof(&s[off[6]], NULL);
}

static inline void birdseed_cluster_init(const char *s, const int *off, cluster_t *cluster) {
    cluster->xm = strtof(&s[off[0]], NULL);
    cluster->ym = strtof(&s[off[1]], NULL);
    cluster->xss = strtof(&s[off[2]], NULL);
    cluster->xyss = strtof(&s[off[3]], NULL);
    cluster->yss = strtof(&s[off[4]], NULL);
    cluster->k = strtof(&s[off[5]], NULL);
    cluster->v = strtof(&s[off[5]], NULL);
}

static snp_models_t *snp_models_init(const char *fn) {
    int i;
    snp_models_t *snp_models = (snp_models_t *)calloc(1, sizeof(snp_models_t));
    for (i = 0; i < 2; i++) {
        snp_models->probeset_id[i] = khash_str2int_init();
    }

    kstring_t str = {0, 0, NULL};
    htsFile *fp = unheader(fn, &str);

    int sep1, sep2, sep3, exp_cols;
    if (strcmp(str.s, "id\tBB\tAB\tAA\tCV") == 0 || strcmp(str.s, "id\tBB\tAB\tAA\tCV\tOTV") == 0) {
        if (hts_getline(fp, KS_SEP_LINE, &str) <= 0) error("Missing information in SNP posteriors file: %s\n", fn);
        sep1 = '\t';
        sep2 = ',';
        sep3 = ':';
        exp_cols = 7;
    } else if (!strchr(str.s, '\t')) {
        snp_models->is_birdseed = 1;
        sep1 = ';';
        sep2 = ' ';
        sep3 = '-';
        exp_cols = 6;
    } else {
        error("Malformed header line in SNP model file %s:\n%s\n", fn, str.s);
    }

    snp_t *snp;
    int moff1 = 0, *off1 = NULL, ncols1;
    int moff2 = 0, *off2 = NULL, ncols2;
    do {
        ncols1 = ksplit_core(str.s, sep1, &moff1, &off1);
        char *col_str = &str.s[off1[0]];

        int len = strlen(col_str);
        int copynumber;
        if (col_str[len - 2] == sep3) {
            char *tmp;
            copynumber = strtol(&col_str[len - 1], &tmp, 0);
            if (*tmp) error("Could not parse copynumber %s from file: %s\n", &col_str[len - 1], fn);
            len -= 2;
            col_str[len] = '\0';
        } else {
            copynumber = 2;
        }

        int idx = copynumber == 2;
        hts_expand(snp_t, snp_models->n_snps[idx] + 1, snp_models->m_snps[idx], snp_models->snps[idx]);
        snp = &snp_models->snps[idx][snp_models->n_snps[idx]];
        snp->probeset_id = strdup(&str.s[off1[0]]);
        snp->copynumber = copynumber;
        if (khash_str2int_has_key(snp_models->probeset_id[idx], snp->probeset_id))
            error("Probe Set %s present multiple times in file %s\n", snp->probeset_id, fn);
        khash_str2int_inc(snp_models->probeset_id[idx], snp->probeset_id);

        if (ncols1 < 4 - (2 - copynumber) * snp_models->is_birdseed)
            error("Missing information for probeset %s in SNP posteriors file: %s\n", str.s, fn);
        col_str = &str.s[off1[1]];
        ncols2 = ksplit_core(col_str, sep2, &moff2, &off2);

        if (ncols2 < exp_cols) error("Missing information for probeset %s in SNP posteriors file: %s\n", str.s, fn);
        if (snp_models->is_birdseed)
            birdseed_cluster_init(col_str, off2, &snp->aa);
        else
            brlmmp_cluster_init(col_str, off2, &snp->bb);

        col_str = &str.s[off1[2]];
        if (snp_models->is_birdseed && copynumber == 1) {
            snp->ab.xm = NAN;
            snp->ab.xss = NAN;
            snp->ab.k = NAN;
            snp->ab.v = NAN;
            snp->ab.ym = NAN;
            snp->ab.yss = NAN;
            snp->ab.xyss = NAN;
        } else {
            ncols2 = ksplit_core(col_str, sep2, &moff2, &off2);
            if (ncols2 < exp_cols) error("Missing information for probeset %s in SNP posteriors file: %s\n", str.s, fn);
            if (snp_models->is_birdseed)
                birdseed_cluster_init(col_str, off2, &snp->ab);
            else
                brlmmp_cluster_init(col_str, off2, &snp->ab);
            col_str = &str.s[off1[3]];
        }

        ncols2 = ksplit_core(col_str, sep2, &moff2, &off2);
        if (ncols2 < exp_cols) error("Missing information for probeset %s in SNP posteriors file: %s\n", str.s, fn);
        if (snp_models->is_birdseed)
            birdseed_cluster_init(col_str, off2, &snp->bb);
        else
            brlmmp_cluster_init(col_str, off2, &snp->aa);

        snp_models->n_snps[idx]++;
    } while (hts_getline(fp, KS_SEP_LINE, &str) > 0);

    free(off2);
    free(off1);
    free(str.s);
    hts_close(fp);
    return snp_models;
}

static void snp_models_destroy(snp_models_t *snp_models) {
    int i, j;
    for (i = 0; i < 2; i++) {
        khash_str2int_destroy(snp_models->probeset_id[i]);
        for (j = 0; j < snp_models->n_snps[i]; j++) free(snp_models->snps[i][j].probeset_id);
        free(snp_models->snps[i]);
    }
    free(snp_models);
}

/****************************************
 * ANNOT.CSV FILE IMPLEMENTATION        *
 ****************************************/

typedef struct {
    char *probeset_id;
    char *affy_snp_id;
    char *dbsnp_rs_id;
    char *chromosome;
    int position;
    int strand;
    char *flank;
} record_t;

typedef struct {
    void *probeset_id;
    record_t *records;
    int n_records, m_records;
} annot_t;

static inline char *unquote(char *str) {
    if (strcmp(str, "\"---\"") == 0) return NULL;
    char *ptr = strrchr(str, '"');
    if (ptr) *ptr = '\0';
    return str + 1;
}

static annot_t *annot_init(const char *fn, const char *sam_fn, const char *out_fn, int flags) {
    annot_t *annot = NULL;
    FILE *out_txt = get_file_handle(out_fn);
    htsFile *hts = NULL;
    sam_hdr_t *sam_hdr = NULL;
    bam1_t *b = NULL;
    if (sam_fn) {
        hts = hts_open(sam_fn, "r");
        if (hts == NULL || hts_get_format(hts)->category != sequence_data)
            error("File %s does not contain sequence data\n", sam_fn);
        sam_hdr = sam_hdr_read(hts);
        if (sam_hdr == NULL) error("Reading header from \"%s\" failed", sam_fn);
        b = bam_init1();
        if (b == NULL) error("Cannot create SAM record\n");
    }
    kstring_t str = {0, 0, NULL};

    htsFile *fp = hts_open(fn, "r");
    if (!fp) error("Could not read: %s\n", fn);
    if (hts_getline(fp, KS_SEP_LINE, &str) <= 0) error("Empty file: %s\n", fn);
    const char *null_strand = "---";
    while (str.s[0] == '#') {
        if (strcmp(str.s, "#%netaffx-annotation-tabular-format-version=1.0") == 0) null_strand = "---";
        if (strcmp(str.s, "#%netaffx-annotation-tabular-format-version=1.5") == 0) null_strand = "+";
        if (hts && out_txt) fprintf(out_txt, "%s\n", str.s);
        hts_getline(fp, KS_SEP_LINE, &str);
    }

    if (hts && out_txt) fprintf(out_txt, "%s\n", str.s);

    int probe_set_id_idx = -1;
    int affy_snp_id_idx = -1;
    int dbsnp_rs_id_idx = -1;
    int chromosome_idx = -1;
    int position_idx = -1;
    int position_end_idx = -1;
    int strand_idx = -1;
    int flank_idx = -1;
    int allele_a_idx = -1;
    int allele_b_idx = -1;

    int i, moff = 0, *off = NULL;
    int ncols = ksplit_core(str.s, ',', &moff, &off);
    for (i = 0; i < ncols; i++) {
        if (strcmp(&str.s[off[i]], "\"Probe Set ID\"") == 0)
            probe_set_id_idx = i;
        else if (strcmp(&str.s[off[i]], "\"Affy SNP ID\"") == 0)
            affy_snp_id_idx = i;
        else if (strcmp(&str.s[off[i]], "\"dbSNP RS ID\"") == 0)
            dbsnp_rs_id_idx = i;
        else if (strcmp(&str.s[off[i]], "\"Chromosome\"") == 0)
            chromosome_idx = i;
        else if (strcmp(&str.s[off[i]], "\"Physical Position\"") == 0)
            position_idx = i;
        else if (strcmp(&str.s[off[i]], "\"Position End\"") == 0)
            position_end_idx = i;
        else if (strcmp(&str.s[off[i]], "\"Strand\"") == 0)
            strand_idx = i;
        else if (strcmp(&str.s[off[i]], "\"Flank\"") == 0)
            flank_idx = i;
        else if (strcmp(&str.s[off[i]], "\"Allele A\"") == 0)
            allele_a_idx = i;
        else if (strcmp(&str.s[off[i]], "\"Allele B\"") == 0)
            allele_b_idx = i;
    }
    if (probe_set_id_idx != 0) error("Probe Set ID not the first column in file: %s\n", fn);
    if (flank_idx == -1) error("Flank missing from file: %s\n", fn);
    if (allele_a_idx == -1) error("Allele A missing from file: %s\n", fn);
    if (allele_b_idx == -1) error("Allele B missing from file: %s\n", fn);
    const char *probeset_id, *flank, *allele_a, *allele_b;

    if (!hts && out_txt) {
        while (hts_getline(fp, KS_SEP_LINE, &str) > 0) {
            ncols = ksplit_core(str.s, ',', &moff, &off);
            probeset_id = unquote(&str.s[off[probe_set_id_idx]]);
            flank = unquote(&str.s[off[flank_idx]]);
            if (flank) flank2fasta(probeset_id, flank, out_txt);
        }
    } else {
        if (dbsnp_rs_id_idx == -1) error("dbSNP RS ID missing from file: %s\n", fn);
        if (chromosome_idx == -1) error("Chromosome missing from file: %s\n", fn);
        if (position_idx == -1) error("Physical Position missing from file: %s\n", fn);
        if (strand_idx == -1) error("Strand missing from file: %s\n", fn);

        if (!out_txt) {
            annot = (annot_t *)calloc(1, sizeof(annot_t));
            annot->probeset_id = khash_str2int_init();
        }

        int n_total = 0, n_unmapped = 0;
        while (hts_getline(fp, KS_SEP_LINE, &str) > 0) {
            ncols = ksplit_core(str.s, ',', &moff, &off);
            probeset_id = unquote(&str.s[off[probe_set_id_idx]]);
            flank = unquote(&str.s[off[flank_idx]]);
            allele_a = unquote(&str.s[off[allele_a_idx]]);
            allele_b = unquote(&str.s[off[allele_b_idx]]);
            const char *chromosome = NULL;
            int strand = -1, position = 0, idx = -1;
            if (hts) {
                if (!flank) {
                    if (flags & VERBOSE) fprintf(stderr, "Missing flank sequence for marker %s\n", probeset_id);
                    n_unmapped++;
                } else {
                    idx = get_position(hts, sam_hdr, b, probeset_id, flank, 0, &chromosome, &position, &strand);
                    if (idx < 0)
                        error("Reading from %s failed", sam_fn);
                    else if (idx == 0) {
                        if (flags & VERBOSE)
                            fprintf(stderr, "Unable to determine position for marker %s\n", probeset_id);
                        n_unmapped++;
                    }
                }
                n_total++;
            } else {
                chromosome = unquote(&str.s[off[chromosome_idx]]);
                const char *ptr = unquote(&str.s[off[position_idx]]);
                char *tmp = NULL;
                if (ptr) {
                    position = strtol(ptr, &tmp, 0);
                    if (*tmp) error("Could not parse position %s from file: %s\n", ptr, fn);
                } else {
                    position = 0;
                }
                ptr = unquote(&str.s[off[strand_idx]]);
                if (!ptr)
                    strand = -1;
                else if (strcmp(ptr, "+") == 0)
                    strand = 0;
                else if (strcmp(ptr, "-") == 0)
                    strand = 1;
                else
                    strand = -1;
            }

            if (out_txt) {
                // "Ref Allele" and "Alt Allele" will not be updated
                fprintf(out_txt, "\"%s\"", probeset_id);
                for (i = 1; i < ncols; i++) {
                    if (i == flank_idx) {
                        fprintf(out_txt, ",\"%s\"", flank);
                    } else if (i == allele_a_idx) {
                        fprintf(out_txt, ",\"%s\"", allele_a);
                    } else if (i == allele_b_idx) {
                        fprintf(out_txt, ",\"%s\"", allele_b);
                    } else if (i == chromosome_idx) {
                        if (chromosome)
                            fprintf(out_txt, ",\"%s\"", chromosome);
                        else
                            fprintf(out_txt, ",\"---\"");
                    } else if (i == position_idx) {
                        if (position)
                            fprintf(out_txt, ",\"%d\"", position);
                        else
                            fprintf(out_txt, ",\"---\"");
                    } else if (i == position_end_idx) {
                        if (flank && position && idx > 0) {
                            const char *left = strchr(flank, '[');
                            const char *middle = strchr(flank, '/');
                            const char *right = strchr(flank, ']');
                            if (!left || !middle || !right) error("Flank sequence is malformed: %s\n", flank);

                            fprintf(out_txt, ",\"%d\"",
                                    position + (int)(idx > 1 ? right - middle : middle - left + (*(left + 1) == '-'))
                                        - 2);
                        } else {
                            fprintf(out_txt, ",\"---\"");
                        }
                    } else if (i == strand_idx) {
                        fprintf(out_txt, ",\"%s\"", strand == 0 ? "+" : (strand == 1 ? "-" : null_strand));
                    } else {
                        fprintf(out_txt, ",%s", &str.s[off[i]]);
                    }
                }
                fprintf(out_txt, "\n");
            } else {
                hts_expand0(record_t, annot->n_records + 1, annot->m_records, annot->records);
                annot->records[annot->n_records].probeset_id = strdup(probeset_id);
                if (khash_str2int_has_key(annot->probeset_id, annot->records[annot->n_records].probeset_id))
                    error("Probe Set %s present multiple times in file %s\n",
                          annot->records[annot->n_records].probeset_id, fn);
                khash_str2int_inc(annot->probeset_id, annot->records[annot->n_records].probeset_id);
                const char *dbsnp_rs_id = unquote(&str.s[off[dbsnp_rs_id_idx]]);
                if (dbsnp_rs_id) annot->records[annot->n_records].dbsnp_rs_id = strdup(dbsnp_rs_id);
                if (affy_snp_id_idx >= 0) {
                    const char *affy_snp_id = unquote(&str.s[off[affy_snp_id_idx]]);
                    if (affy_snp_id) annot->records[annot->n_records].affy_snp_id = strdup(affy_snp_id);
                }
                if (chromosome) annot->records[annot->n_records].chromosome = strdup(chromosome);
                annot->records[annot->n_records].position = position;
                if (flank) {
                    annot->records[annot->n_records].flank = strdup(flank);
                    // check whether alleles A and B need to be flipped in
                    // the flank sequence (happens with T/C and T/G SNPs
                    // only)
                    char *left = strchr(annot->records[annot->n_records].flank, '[');
                    char *middle = strchr(annot->records[annot->n_records].flank, '/');
                    char *right = strchr(annot->records[annot->n_records].flank, ']');
                    if (strncmp(left + 1, allele_b, middle - left - 1) == 0
                        && strncmp(middle + 1, allele_a, right - middle - 1) == 0) {
                        memcpy(left + 1, allele_a, right - middle - 1);
                        *(left + (right - middle)) = '/';
                        memcpy(left + (right - middle) + 1, allele_b, middle - left - 1);
                    }
                }
                annot->records[annot->n_records].strand = strand;
                annot->n_records++;
            }
        }
        if (hts) fprintf(stderr, "Lines   total/unmapped:\t%d/%d\n", n_total, n_unmapped);

        bam_destroy1(b);
        sam_hdr_destroy(sam_hdr);
        if (hts && hts_close(hts) < 0) error("closing \"%s\" failed", fn);
    }

    free(off);
    free(str.s);
    hts_close(fp);

    if (out_txt && out_txt != stdout && out_txt != stderr) fclose(out_txt);
    return annot;
}

static void annot_destroy(annot_t *annot) {
    int i;
    khash_str2int_destroy(annot->probeset_id);
    for (i = 0; i < annot->n_records; i++) {
        free(annot->records[i].probeset_id);
        free(annot->records[i].affy_snp_id);
        free(annot->records[i].dbsnp_rs_id);
        free(annot->records[i].chromosome);
        free(annot->records[i].flank);
    }
    free(annot->records);
    free(annot);
}

/****************************************
 * READER ITERATORS                     *
 ****************************************/

#define MAX_LENGTH_PROBE_SET_ID 17
typedef struct {
    int nsmpl;

    DataSet **data_sets;
    int *nrows;
    int *is_brlmm_p;
    htsFile *calls_fp;
    htsFile *confidences_fp;
    htsFile *summary_fp;
    char probeset_id[MAX_LENGTH_PROBE_SET_ID + 1];

    int *gts;
    float *conf_arr;
    float *norm_x_arr;
    float *norm_y_arr;
    float *delta_arr;
    float *size_arr;
    float *baf_arr;
    float *lrr_arr;
} varitr_t;

static void varitr_init_common(varitr_t *varitr) {
    varitr->gts = (int *)malloc(varitr->nsmpl * sizeof(int));
    varitr->conf_arr = (float *)malloc(varitr->nsmpl * sizeof(float));
    varitr->norm_x_arr = (float *)malloc(varitr->nsmpl * sizeof(float));
    varitr->norm_y_arr = (float *)malloc(varitr->nsmpl * sizeof(float));
    varitr->delta_arr = (float *)malloc(varitr->nsmpl * sizeof(float));
    varitr->size_arr = (float *)malloc(varitr->nsmpl * sizeof(float));
    varitr->baf_arr = (float *)malloc(varitr->nsmpl * sizeof(float));
    varitr->lrr_arr = (float *)malloc(varitr->nsmpl * sizeof(float));
}

static varitr_t *varitr_init_cc(bcf_hdr_t *hdr, agcc_t **agcc, int n) {
    int i;
    varitr_t *varitr = (varitr_t *)calloc(1, sizeof(varitr_t));
    varitr->nsmpl = n;
    varitr->data_sets = (DataSet **)malloc(n * sizeof(DataSet *));
    varitr->nrows = (int *)calloc(n, sizeof(int));
    varitr->is_brlmm_p = (int *)malloc(n * sizeof(int));
    for (i = 0; i < n; i++) {
        if (strcmp(agcc[i]->data_header.data_type_identifier, "affymetrix-multi-data-type-analysis") != 0)
            error("AGCC file %s does not contain multi data type analysis as \n", agcc[i]->fn);
        if (agcc[i]->num_data_groups == 0 || wcscmp(agcc[i]->data_groups[0].name, L"MultiData") != 0)
            error("AGCC file %s does not contain multi data\n", agcc[i]->fn);
        if (agcc[i]->data_groups[0].num_data_sets == 0
            || wcscmp(agcc[i]->data_groups[0].data_sets[0].name, L"Genotype") != 0)
            error("AGCC file %s does not contain genotype data\n", agcc[i]->fn);
        DataSet *data_set = &agcc[i]->data_groups[0].data_sets[0];
        if (wcscmp(data_set->col_headers[0].name, L"ProbeSetName") != 0
            || wcscmp(data_set->col_headers[1].name, L"Call") != 0
            || wcscmp(data_set->col_headers[2].name, L"Confidence") != 0
            || wcscmp(data_set->col_headers[5].name, L"Forced Call") != 0)
            error("AGCC file %s does not contain genotype data in the expected format\n", agcc[i]->fn);
        if (wcscmp(data_set->col_headers[3].name, L"Contrast") == 0
            || wcscmp(data_set->col_headers[3].name, L"Log Ratio") == 0
            || wcscmp(data_set->col_headers[4].name, L"Strength") == 0)
            varitr->is_brlmm_p[i] = 1; // ProbeSetName / Call / Confidence / Contrast/Log Ratio
                                       // / Strength / Forced Call
        else if (wcscmp(data_set->col_headers[3].name, L"Signal A") == 0
                 || wcscmp(data_set->col_headers[4].name, L"Signal B") == 0)
            varitr->is_brlmm_p[i] = 0; // ProbeSetName / Call / Confidence / Signal A
                                       // / Signal B / Forced Call
        else
            error("AGCC file %s does not contain intensities data in the expected format\n", agcc[i]->fn);
        if (hseek(data_set->hfile, data_set->pos_first_element, SEEK_SET) < 0)
            error("Fail to seek to position %d in AGCC file\n", data_set->pos_first_element);
        bcf_hdr_add_sample(hdr, agcc[i]->display_name);
        varitr->data_sets[i] = data_set;
    }
    varitr_init_common(varitr);
    return varitr;
}

static varitr_t *varitr_init_txt(bcf_hdr_t *hdr, const char *calls_fn, const char *confidences_fn,
                                 const char *summary_fn) {
    varitr_t *varitr = (varitr_t *)calloc(1, sizeof(varitr_t));

    kstring_t str = {0, 0, NULL};
    int i, moff = 0, *off = NULL, ncols;

    if (calls_fn) {
        fprintf(stderr, "Reading genotype calls file %s\n", calls_fn);
        varitr->calls_fp = unheader(calls_fn, &str);
        ncols = ksplit_core(str.s, '\t', &moff, &off);
        if (strcmp(&str.s[off[0]], "probeset_id"))
            error("Malformed first line from calls file: %s\n%s\n", calls_fn, str.s);
        varitr->nsmpl = ncols - 1;
        for (i = 1; i < ncols; i++) {
            char *ptr = strrchr(&str.s[off[i]], '.');
            if (ptr && strcmp(ptr + 1, "CEL") == 0) *ptr = '\0';
            bcf_hdr_add_sample(hdr, &str.s[off[i]]);
        }
    }

    if (confidences_fn) {
        fprintf(stderr, "Reading genotype confidences file %s\n", confidences_fn);
        varitr->confidences_fp = unheader(confidences_fn, &str);
        ncols = ksplit_core(str.s, '\t', &moff, &off);
        if (strcmp(&str.s[off[0]], "probeset_id"))
            error("Malformed first line from confidences file: %s\n%s\n", confidences_fn, str.s);
        if (!varitr->calls_fp) {
            varitr->nsmpl = ncols - 1;
            for (i = 1; i < ncols; i++) {
                char *ptr = strrchr(&str.s[off[i]], '.');
                if (ptr && strcmp(ptr + 1, "CEL") == 0) *ptr = '\0';
                bcf_hdr_add_sample(hdr, &str.s[off[i]]);
            }
        }
    }

    if (summary_fn) {
        fprintf(stderr, "Reading allelic intensities file %s\n", summary_fn);
        varitr->summary_fp = unheader(summary_fn, &str);
        ncols = ksplit_core(str.s, '\t', &moff, &off);
        if (strcmp(&str.s[off[0]], "probeset_id"))
            error("Malformed first line from summary file: %s\n%s\n", summary_fn, str.s);
        if (!varitr->calls_fp && !varitr->confidences_fp) {
            varitr->nsmpl = ncols - 1;
            for (i = 1; i < ncols; i++) {
                char *ptr = strrchr(&str.s[off[i]], '.');
                if (ptr && strcmp(ptr + 1, "CEL") == 0) *ptr = '\0';
                bcf_hdr_add_sample(hdr, &str.s[off[i]]);
            }
        }
    }

    free(str.s);
    free(off);

    varitr_init_common(varitr);
    return varitr;
}

static inline void check_probe_set_id(char *dest, const char *src) {
    if (dest[0] == '\0') {
        if (strlen(src) > MAX_LENGTH_PROBE_SET_ID) error("Probe Set Name %s is too long\n", src);
        strcpy(dest, src);
    } else {
        if (strcmp(dest, src) != 0) error("Probe Set Name mismatch: %s %s\n", dest, src);
    }
}

static int varitr_loop(varitr_t *varitr, void *probeset_ids) {
    int i, ret = 0;
    varitr->probeset_id[0] = '\0';
    if (varitr->data_sets) {
        for (i = 0; i < varitr->nsmpl; i++) {
            DataSet *data_set = varitr->data_sets[i];
            uint32_t n;
            char probeset_id[MAX_LENGTH_PROBE_SET_ID + 1];
            do {
                varitr->nrows[i]++;
                // check whether you have arrived at the last element
                if (varitr->nrows[i] > data_set->n_rows) return -1;
                read_bytes(data_set->hfile, (void *)data_set->buffer, data_set->n_buffer);
                n = ntohl(*(uint32_t *)&data_set->buffer[data_set->col_offsets[0]]);
                if (n > MAX_LENGTH_PROBE_SET_ID)
                    error("Probe Set Name %.*s is too long\n", n, &data_set->buffer[data_set->col_offsets[0] + 4]);
                strncpy(probeset_id, &data_set->buffer[data_set->col_offsets[0] + 4], (size_t)n);
                probeset_id[n] = '\0';
            } while (probeset_ids && !khash_str2int_has_key(probeset_ids, probeset_id));
            check_probe_set_id(varitr->probeset_id, probeset_id);
            varitr->gts[i] = chp_gt[data_set->buffer[data_set->col_offsets[1]] & 0x0F];
            union {
                uint32_t u;
                float f;
            } convert;
            convert.u = ntohl(*(uint32_t *)&data_set->buffer[data_set->col_offsets[2]]);
            varitr->conf_arr[i] = convert.f;
            if (varitr->is_brlmm_p[i]) {
                convert.u = ntohl(*(uint32_t *)&data_set->buffer[data_set->col_offsets[3]]);
                varitr->delta_arr[i] = convert.f;
                convert.u = ntohl(*(uint32_t *)&data_set->buffer[data_set->col_offsets[4]]);
                varitr->size_arr[i] = convert.f;
                varitr->norm_x_arr[i] = expf((varitr->size_arr[i] + varitr->delta_arr[i] * 0.5f) * (float)M_LN2);
                varitr->norm_y_arr[i] = expf((varitr->size_arr[i] - varitr->delta_arr[i] * 0.5f) * (float)M_LN2);
            } else {
                convert.u = ntohl(*(uint32_t *)&data_set->buffer[data_set->col_offsets[3]]);
                varitr->norm_x_arr[i] = convert.f;
                convert.u = ntohl(*(uint32_t *)&data_set->buffer[data_set->col_offsets[4]]);
                varitr->norm_y_arr[i] = convert.f;
                float log2x = logf(varitr->norm_x_arr[i]) * (float)M_LOG2E;
                float log2y = logf(varitr->norm_y_arr[i]) * (float)M_LOG2E;
                varitr->delta_arr[i] = log2x - log2y;
                varitr->size_arr[i] = (log2x + log2y) * 0.5f;
            }
        }
    } else {
        kstring_t str = {0, 0, NULL};
        int moff = 0, *off = NULL, ncols, len;
        kstring_t str_b = {0, 0, NULL};
        int moff_b = 0, *off_b = NULL, ncols_b, len_b;
        char *tmp;

        // read genotypes
        if (varitr->calls_fp) {
            do {
                if ((ret = hts_getline(varitr->calls_fp, KS_SEP_LINE, &str)) < 0) goto exit;
                ncols = ksplit_core(str.s, '\t', &moff, &off);
                if (ncols != 1 + varitr->nsmpl)
                    error("Expected %d columns but %d columns found in the calls file\n", 1 + varitr->nsmpl, ncols);
                for (i = 1; i < 1 + varitr->nsmpl; i++) {
                    int gt = strtol(&str.s[off[i]], &tmp, 0);
                    if (*tmp || gt < -4 || gt > 27)
                        error("Could not parse genotype %s found in the calls file\n", &str.s[off[i]]);
                    varitr->gts[i - 1] = txt_gt[4 + gt];
                }
            } while (probeset_ids && !khash_str2int_has_key(probeset_ids, &str.s[off[0]]));
            check_probe_set_id(varitr->probeset_id, &str.s[off[0]]);
        }

        // read confidences
        if (varitr->confidences_fp) {
            do {
                if ((ret = hts_getline(varitr->confidences_fp, KS_SEP_LINE, &str)) < 0) goto exit;
                ncols = ksplit_core(str.s, '\t', &moff, &off);
                if (ncols != 1 + varitr->nsmpl)
                    error("Expected %d columns but %d columns found in the confidences file\n", 1 + varitr->nsmpl,
                          ncols);
                for (i = 1; i < 1 + varitr->nsmpl; i++) varitr->conf_arr[i - 1] = strtof(&str.s[off[i]], &tmp);
            } while (probeset_ids && !khash_str2int_has_key(probeset_ids, &str.s[off[0]]));
            check_probe_set_id(varitr->probeset_id, &str.s[off[0]]);
        }

        // read intensities
        if (varitr->summary_fp) {
            do {
                // skips -C/-D/-E/-F/-G summary statistics
                do {
                    if ((ret = hts_getline(varitr->summary_fp, KS_SEP_LINE, &str)) < 0) goto exit;
                    ncols = ksplit_core(str.s, '\t', &moff, &off);
                    if (ncols != 1 + varitr->nsmpl)
                        error("Expected %d columns but %d columns found in the summary file\n", 1 + varitr->nsmpl,
                              ncols);
                    len = strlen(&str.s[off[0]]);
                } while (str.s[off[0] + len - 2] != '-' && str.s[off[0] + len - 1] != 'A');
                // skips probes with -A summary statistics only
                do {
                    // check whether the next line contains the expected -B probeset_id
                    if ((ret = hts_getline(varitr->summary_fp, KS_SEP_LINE, &str_b)) < 0) goto exit;
                    ncols_b = ksplit_core(str_b.s, '\t', &moff_b, &off_b);
                    if (ncols_b != 1 + varitr->nsmpl)
                        error("Expected %d columns but %d columns found in the summary file\n", 1 + varitr->nsmpl,
                              ncols_b);
                    len_b = strlen(&str_b.s[off_b[0]]);
                    if (str_b.s[off_b[0] + len_b - 2] == '-' && str_b.s[off_b[0] + len_b - 1] == 'B') break;

                    kstring_t str_tmp = str;
                    str = str_b;
                    str_b = str_tmp;
                    int len_tmp = len;
                    len = len_b;
                    len_b = len_tmp;
                    int moff_tmp = moff;
                    moff = moff_b;
                    moff_b = moff_tmp;
                    int *off_tmp = off;
                    off = off_b;
                    off_b = off_tmp;
                    int ncols_tmp = ncols;
                    ncols = ncols_b;
                    ncols_b = ncols_tmp;
                } while (1);

                if (len != len_b || strncmp(&str.s[off[0]], &str_b.s[off_b[0]], len - 2) != 0)
                    error("Mismatching %s and %s Probe Set IDs found in the summary file\n", &str.s[off[0]],
                          &str_b.s[off_b[0]]);
                for (i = 1; i < 1 + varitr->nsmpl; i++) {
                    varitr->norm_x_arr[i - 1] = strtof(&str.s[off[i]], &tmp);
                    if (*tmp) error("Could not parse intensity value %s found in the summary file\n", &str.s[off[i]]);
                    varitr->norm_y_arr[i - 1] = strtof(&str_b.s[off_b[i]], &tmp);
                    if (*tmp)
                        error("Could not parse intensity value %s found in the summary file\n", &str_b.s[off_b[i]]);
                    float log2x = logf(varitr->norm_x_arr[i - 1]) * (float)M_LOG2E;
                    float log2y = logf(varitr->norm_y_arr[i - 1]) * (float)M_LOG2E;
                    varitr->delta_arr[i - 1] = log2x - log2y;
                    varitr->size_arr[i - 1] = (log2x + log2y) * 0.5f;
                }
                str.s[off[0] + len - 2] = '\0';
            } while (probeset_ids && !khash_str2int_has_key(probeset_ids, &str.s[off[0]]));
            check_probe_set_id(varitr->probeset_id, &str.s[off[0]]);
        }
    exit:
        free(str_b.s);
        free(off_b);
        free(str.s);
        free(off);
    }
    return ret;
}

static void varitr_destroy(varitr_t *varitr) {
    free(varitr->data_sets);
    free(varitr->nrows);
    free(varitr->is_brlmm_p);
    if (varitr->calls_fp) hts_close(varitr->calls_fp);
    if (varitr->confidences_fp) hts_close(varitr->confidences_fp);
    if (varitr->summary_fp) hts_close(varitr->summary_fp);
    free(varitr->gts);
    free(varitr->conf_arr);
    free(varitr->norm_x_arr);
    free(varitr->norm_y_arr);
    free(varitr->delta_arr);
    free(varitr->size_arr);
    free(varitr->baf_arr);
    free(varitr->lrr_arr);
    free(varitr);
}

/****************************************
 * OUTPUT FUNCTIONS                     *
 ****************************************/

static bcf_hdr_t *hdr_init(const faidx_t *fai, int flags) {
    bcf_hdr_t *hdr = bcf_hdr_init("w");
    int i, n = faidx_nseq(fai);
    for (i = 0; i < n; i++) {
        const char *seq = faidx_iseq(fai, i);
        int len = faidx_seq_len(fai, seq);
        bcf_hdr_printf(hdr, "##contig=<ID=%s,length=%d>", seq, len);
    }
    bcf_hdr_append(hdr, "##INFO=<ID=ALLELE_A,Number=1,Type=Integer,Description=\"A allele\">");
    bcf_hdr_append(hdr, "##INFO=<ID=ALLELE_B,Number=1,Type=Integer,Description=\"B allele\">");
    bcf_hdr_append(hdr, "##INFO=<ID=DBSNP_RS_ID,Number=1,Type=String,Description=\"dbSNP RS ID\">");
    bcf_hdr_append(hdr, "##INFO=<ID=AFFY_SNP_ID,Number=1,Type=String,Description=\"Affymetrix SNP ID\">");
    if (flags & SNP_LOADED) {
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanX_AA,Number=1,Type=Float,Description=\"Mean of "
                       "normalized DELTA for AA diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanX_AB,Number=1,Type=Float,Description=\"Mean of "
                       "normalized DELTA for AB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanX_BB,Number=1,Type=Float,Description=\"Mean of "
                       "normalized DELTA for BB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varX_AA,Number=1,Type=Float,Description=\"Variance of "
                       "normalized DELTA for AA diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varX_AB,Number=1,Type=Float,Description=\"Variance of "
                       "normalized DELTA for AB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varX_BB,Number=1,Type=Float,Description=\"Variance of "
                       "normalized DELTA for BB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsMean_AA,Number=1,Type=Float,Description=\"Number of AA "
                       "calls in training set for diploid mean\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsMean_AB,Number=1,Type=Float,Description=\"Number of AB "
                       "calls in training set for diploid mean\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsMean_BB,Number=1,Type=Float,Description=\"Number of BB "
                       "calls in training set for diploid mean\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsVar_AA,Number=1,Type=Float,Description=\"Number of AA "
                       "calls in training set for diploid variance\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsVar_AB,Number=1,Type=Float,Description=\"Number of AB "
                       "calls in training set for diploid variance\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsVar_BB,Number=1,Type=Float,Description=\"Number of BB "
                       "calls in training set for diploid variance\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanY_AA,Number=1,Type=Float,Description=\"Mean of "
                       "normalized SIZE for AA diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanY_AB,Number=1,Type=Float,Description=\"Mean of "
                       "normalized SIZE for AB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanY_BB,Number=1,Type=Float,Description=\"Mean of "
                       "normalized SIZE for BB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varY_AA,Number=1,Type=Float,Description=\"Variance of "
                       "normalized SIZE for AA diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varY_AB,Number=1,Type=Float,Description=\"Variance of "
                       "normalized SIZE for AB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varY_BB,Number=1,Type=Float,Description=\"Variance of "
                       "normalized SIZE for BB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=covarXY_AA,Number=1,Type=Float,Description=\"Covariance for "
                       "AA diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=covarXY_AB,Number=1,Type=Float,Description=\"Covariance for "
                       "AB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=covarXY_BB,Number=1,Type=Float,Description=\"Covariance for "
                       "BB diploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanX_AA.1,Number=1,Type=Float,Description=\"Mean of "
                       "normalized DELTA for AA haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanX_AB.1,Number=1,Type=Float,Description=\"Mean of "
                       "normalized DELTA for AB haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanX_BB.1,Number=1,Type=Float,Description=\"Mean of "
                       "normalized DELTA for BB haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varX_AA.1,Number=1,Type=Float,Description=\"Variance of "
                       "normalized DELTA for AA haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varX_AB.1,Number=1,Type=Float,Description=\"Variance of "
                       "normalized DELTA for AB haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varX_BB.1,Number=1,Type=Float,Description=\"Variance of "
                       "normalized DELTA for BB haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsMean_AA.1,Number=1,Type=Float,Description=\"Number of "
                       "AA calls in training set for haploid mean\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsMean_AB.1,Number=1,Type=Float,Description=\"Number of "
                       "AB calls in training set for haploid mean\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsMean_BB.1,Number=1,Type=Float,Description=\"Number of "
                       "BB calls in training set for haploid mean\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsVar_AA.1,Number=1,Type=Float,Description=\"Number of AA "
                       "calls in training set for haploid variance\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsVar_AB.1,Number=1,Type=Float,Description=\"Number of AB "
                       "calls in training set for haploid variance\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=nObsVar_BB.1,Number=1,Type=Float,Description=\"Number of BB "
                       "calls in training set for haploid variance\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanY_AA.1,Number=1,Type=Float,Description=\"Mean of "
                       "normalized SIZE for AA haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanY_AB.1,Number=1,Type=Float,Description=\"Mean of "
                       "normalized SIZE for AB haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=meanY_BB.1,Number=1,Type=Float,Description=\"Mean of "
                       "normalized SIZE for BB haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varY_AA.1,Number=1,Type=Float,Description=\"Variance of "
                       "normalized SIZE for AA haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varY_AB.1,Number=1,Type=Float,Description=\"Variance of "
                       "normalized SIZE for AB haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=varY_BB.1,Number=1,Type=Float,Description=\"Variance of "
                       "normalized SIZE for BB haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=covarXY_AA.1,Number=1,Type=Float,Description=\"Covariance "
                       "for AA haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=covarXY_AB.1,Number=1,Type=Float,Description=\"Covariance "
                       "for AB haploid cluster\">");
        bcf_hdr_append(hdr,
                       "##INFO=<ID=covarXY_BB.1,Number=1,Type=Float,Description=\"Covariance "
                       "for BB haploid cluster\">");
    }
    if (!(flags & NO_INFO_GC))
        bcf_hdr_append(hdr,
                       "##INFO=<ID=GC,Number=1,Type=Float,Description=\"GC ratio content "
                       "around the variant\">");
    if ((flags & CALLS_LOADED) && (flags & FORMAT_GT))
        bcf_hdr_append(hdr, "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">");
    if ((flags & CONFIDENCES_LOADED) && (flags & FORMAT_CONF))
        bcf_hdr_append(hdr, "##FORMAT=<ID=CONF,Number=1,Type=Float,Description=\"Genotype confidence\">");
    if (flags & SUMMARY_LOADED) {
        if (flags & FORMAT_NORMX)
            bcf_hdr_append(hdr,
                           "##FORMAT=<ID=NORMX,Number=1,Type=Float,Description=\"Normalized X "
                           "intensity\">");
        if (flags & FORMAT_NORMY)
            bcf_hdr_append(hdr,
                           "##FORMAT=<ID=NORMY,Number=1,Type=Float,Description=\"Normalized Y "
                           "intensity\">");
        if (flags & FORMAT_DELTA)
            bcf_hdr_append(hdr,
                           "##FORMAT=<ID=DELTA,Number=1,Type=Float,Description=\"Normalized "
                           "contrast value\">");
        if (flags & FORMAT_SIZE)
            bcf_hdr_append(hdr, "##FORMAT=<ID=SIZE,Number=1,Type=Float,Description=\"Normalized size value\">");
    }
    if ((flags & SUMMARY_LOADED) && (flags & SNP_LOADED)) {
        if (flags & FORMAT_BAF)
            bcf_hdr_append(hdr, "##FORMAT=<ID=BAF,Number=1,Type=Float,Description=\"B Allele Frequency\">");
        if (flags & FORMAT_LRR)
            bcf_hdr_append(hdr, "##FORMAT=<ID=LRR,Number=1,Type=Float,Description=\"Log R Ratio\">");
    }
    return hdr;
}

// adjust cluster centers (using apt-probeset-genotype posteriors as priors)
// similar to
// http://github.com/WGLab/PennCNV/blob/master/affy/bin/generate_affy_geno_cluster.pl
static void adjust_clusters(const int *gts, const float *x, const float *y, int n, snp_t *snp) {
    snp->aa.xm *= 0.2f;
    snp->ab.xm *= 0.2f;
    snp->bb.xm *= 0.2f;
    snp->aa.ym *= 0.2f;
    snp->ab.ym *= 0.2f;
    snp->bb.ym *= 0.2f;
    snp->aa.k = 0.2f;
    snp->ab.k = 0.2f;
    snp->bb.k = 0.2f;

    int i;
    for (i = 0; i < n; i++) {
        switch (gts[i]) {
        case GT_AA:
            snp->aa.k++;
            snp->aa.xm += x[i];
            snp->aa.ym += y[i];
            break;
        case GT_AB:
            snp->ab.k++;
            snp->ab.xm += x[i];
            snp->ab.ym += y[i];
            break;
        case GT_BB:
            snp->bb.k++;
            snp->bb.xm += x[i];
            snp->bb.ym += y[i];
            break;
        default:
            break;
        }
    }

    snp->aa.xm /= snp->aa.k;
    snp->ab.xm /= snp->ab.k;
    snp->bb.xm /= snp->bb.k;
    snp->aa.ym /= snp->aa.k;
    snp->ab.ym /= snp->ab.k;
    snp->bb.ym /= snp->bb.k;
}

static void update_info_cluster(const bcf_hdr_t *hdr, bcf1_t *rec, const char **info_str, const snp_t *snp) {
    bcf_update_info_float(hdr, rec, info_str[0], &snp->aa.xm, 1);
    bcf_update_info_float(hdr, rec, info_str[1], &snp->ab.xm, 1);
    bcf_update_info_float(hdr, rec, info_str[2], &snp->bb.xm, 1);
    bcf_update_info_float(hdr, rec, info_str[3], &snp->aa.xss, 1);
    bcf_update_info_float(hdr, rec, info_str[4], &snp->ab.xss, 1);
    bcf_update_info_float(hdr, rec, info_str[5], &snp->bb.xss, 1);
    bcf_update_info_float(hdr, rec, info_str[6], &snp->aa.k, 1);
    bcf_update_info_float(hdr, rec, info_str[7], &snp->ab.k, 1);
    bcf_update_info_float(hdr, rec, info_str[8], &snp->bb.k, 1);
    bcf_update_info_float(hdr, rec, info_str[9], &snp->aa.v, 1);
    bcf_update_info_float(hdr, rec, info_str[10], &snp->ab.v, 1);
    bcf_update_info_float(hdr, rec, info_str[11], &snp->bb.v, 1);
    bcf_update_info_float(hdr, rec, info_str[12], &snp->aa.ym, 1);
    bcf_update_info_float(hdr, rec, info_str[13], &snp->ab.ym, 1);
    bcf_update_info_float(hdr, rec, info_str[14], &snp->bb.ym, 1);
    bcf_update_info_float(hdr, rec, info_str[15], &snp->aa.yss, 1);
    bcf_update_info_float(hdr, rec, info_str[16], &snp->ab.yss, 1);
    bcf_update_info_float(hdr, rec, info_str[17], &snp->bb.yss, 1);
    bcf_update_info_float(hdr, rec, info_str[18], &snp->aa.xyss, 1);
    bcf_update_info_float(hdr, rec, info_str[19], &snp->ab.xyss, 1);
    bcf_update_info_float(hdr, rec, info_str[20], &snp->bb.xyss, 1);
}

// compute LRR and BAF
// similar to
// http://github.com/WGLab/PennCNV/blob/master/affy/bin/normalize_affy_geno_cluster.pl
static void compute_baf_lrr(const float *norm_x, const float *norm_y, int n, const snp_t *snp, int is_birdseed,
                            float *baf, float *lrr) {
    float aa_theta, ab_theta, bb_theta, aa_r, ab_r, bb_r;

    if (is_birdseed) {
        aa_theta = atan2f(snp->aa.ym, snp->aa.xm) * (float)M_2_PI;
        ab_theta = atan2f(snp->ab.ym, snp->ab.xm) * (float)M_2_PI;
        bb_theta = atan2f(snp->bb.ym, snp->bb.xm) * (float)M_2_PI;
        aa_r = snp->aa.xm + snp->aa.ym;
        ab_r = snp->ab.xm + snp->ab.ym;
        bb_r = snp->bb.xm + snp->bb.ym;
    } else {
        aa_theta = atanf(expf(-snp->aa.xm * (float)M_LN2)) * (float)M_2_PI;
        ab_theta = atanf(expf(-snp->ab.xm * (float)M_LN2)) * (float)M_2_PI;
        bb_theta = atanf(expf(-snp->bb.xm * (float)M_LN2)) * (float)M_2_PI;
        aa_r = expf(snp->aa.ym * (float)M_LN2) * 2.0f * coshf(snp->aa.xm * 0.5f * (float)M_LN2);
        ab_r = expf(snp->ab.ym * (float)M_LN2) * 2.0f * coshf(snp->ab.xm * 0.5f * (float)M_LN2);
        bb_r = expf(snp->bb.ym * (float)M_LN2) * 2.0f * coshf(snp->bb.xm * 0.5f * (float)M_LN2);
    }

    // handles chromosome Y SNPs
    if (snp->copynumber == 1) {
        ab_theta = (aa_theta + bb_theta) * 0.5f;
        ab_r = (aa_r + bb_r) * 0.5f;
    }

    int i;
    for (i = 0; i < n; i++) {
        float ilmn_theta = atan2f(norm_y[i], norm_x[i]) * (float)M_2_PI;
        float ilmn_r = norm_x[i] + norm_y[i];
        get_baf_lrr(ilmn_theta, ilmn_r, aa_theta, ab_theta, bb_theta, aa_r, ab_r, bb_r, NAN, &baf[i], &lrr[i]);
    }
}

static void process(faidx_t *fai, const annot_t *annot, void *probeset_ids, snp_models_t *snp_models, varitr_t *varitr,
                    htsFile *out_fh, bcf_hdr_t *hdr, int flags, int gc_win) {
    int i, nsmpl = bcf_hdr_nsamples(hdr);
    if ((flags & ADJUST_CLUSTERS) && (nsmpl < 100))
        fprintf(stderr, "Warning: adjusting clusters with %d sample(s) is not recommended\n", nsmpl);

    bcf1_t *rec = bcf_init();
    char ref_base[] = {'\0', '\0'};
    kstring_t allele_a = {0, 0, NULL};
    kstring_t allele_b = {0, 0, NULL};
    kstring_t flank = {0, 0, NULL};

    int32_t *gt_arr = (int32_t *)malloc(nsmpl * 2 * sizeof(int32_t));
    float *baf_arr = (float *)malloc(nsmpl * sizeof(float));
    float *lrr_arr = (float *)malloc(nsmpl * sizeof(float));

    int n_missing = 0, n_no_snp_models = 0, n_skipped = 0;
    for (i = 0; i < annot->n_records; i++) {
        // identify variants to use for next VCF record
        int idx;
        if (varitr) {
            if (varitr_loop(varitr, probeset_ids) < 0) break;
            int ret = khash_str2int_get(annot->probeset_id, varitr->probeset_id, &idx);
            if (ret < 0) error("Probe Set %s not found in manifest file\n", varitr->probeset_id);
        } else {
            if (probeset_ids && !khash_str2int_has_key(probeset_ids, annot->records[i].probeset_id)) {
                n_skipped++;
                continue;
            }
            idx = i;
        }
        record_t *record = &annot->records[idx];

        bcf_clear(rec);
        rec->n_sample = nsmpl;
        rec->rid = bcf_hdr_name2id_flexible(hdr, record->chromosome);
        rec->pos = record->position - 1;
        if (rec->rid < 0 || rec->pos < 0 || record->strand < 0 || !record->flank) {
            if (flags & VERBOSE) fprintf(stderr, "Skipping unlocalized marker %s\n", record->probeset_id);
            n_skipped++;
            continue;
        }
        bcf_update_id(hdr, rec, record->probeset_id);

        flank.l = 0;
        kputs(record->flank, &flank);
        strupper(flank.s);
        if (record->strand) flank_reverse_complement(flank.s);

        int len, win = min(max(max(gc_win, strlen(flank.s)), 100), rec->pos);
        char *ref = faidx_fetch_seq(fai, bcf_seqname(hdr, rec), rec->pos - win, rec->pos + win, &len);
        if (!ref || len == 1)
            error("faidx_fetch_seq failed at %s:%" PRId64 " (are you using the correct reference genome?)\n",
                  bcf_seqname(hdr, rec), rec->pos + 1);
        strupper(ref);
        if (!(flags & NO_INFO_GC)) {
            float gc_ratio = get_gc_ratio(&ref[max(win - gc_win, 0)], &ref[min(win + gc_win, len)]);
            bcf_update_info_float(hdr, rec, "GC", &gc_ratio, 1);
        }
        ref_base[0] = ref[win];
        int32_t allele_b_idx;
        allele_a.l = allele_b.l = 0;
        if (strchr(flank.s, '-')) {
            kputc('D', &allele_a);
            kputc('I', &allele_b);
            int ref_is_del = get_indel_alleles(&allele_a, &allele_b, flank.s, ref, win, len, 0);
            if (ref_is_del < 0) {
                if (flags & VERBOSE) fprintf(stderr, "Unable to determine alleles for indel %s\n", record->probeset_id);
                n_missing++;
            }
            if (ref_is_del == 0) {
                rec->pos--;
                ref_base[0] = ref[win - 1];
            }
            allele_b_idx = ref_is_del < 0 ? 1 : ref_is_del;
        } else {
            const char *left = strchr(flank.s, '[');
            const char *middle = strchr(flank.s, '/');
            const char *right = strchr(flank.s, ']');
            if (!left || !middle || !right) error("Flank sequence is malformed: %s\n", flank.s);
            kputsn(left + 1, middle - left - 1, &allele_a);
            kputsn(middle + 1, right - middle - 1, &allele_b);

            if (middle - left == 2 && right - middle == 2) {
                allele_b_idx = get_allele_b_idx(ref_base[0], allele_a.s, allele_b.s);
            } else {
                int allele_a_match = strncmp(left + 1, &ref[win], middle - left - 1) == 0;
                int allele_b_match = strncmp(middle + 1, &ref[win], right - middle - 1) == 0;
                if (allele_a_match && !allele_b_match) {
                    allele_b_idx = 1;
                } else if (!allele_a_match && allele_b_match) {
                    allele_b_idx = 0;
                } else if (allele_a_match && allele_b_match) {
                    int allele_a_right =
                        len_common_prefix(right + 1, &ref[win] + (middle - left) - 1, strlen(right + 1));
                    int allele_b_right =
                        len_common_prefix(right + 1, &ref[win] + (right - middle) - 1, strlen(right + 1));
                    allele_b_idx = allele_a_right > allele_b_right;
                } else {
                    allele_b_idx = -1;
                }
            }
        }
        free(ref);

        int32_t allele_a_idx = get_allele_a_idx(allele_b_idx);
        const char *alleles[3];
        int nals = alleles_ab_to_vcf(alleles, ref_base, allele_a.s, allele_b.s, allele_b_idx);
        if (nals < 0) error("Unable to process Probe Set %s\n", record->probeset_id);
        bcf_update_alleles(hdr, rec, alleles, nals);
        bcf_update_info_int32(hdr, rec, "ALLELE_A", &allele_a_idx, 1);
        bcf_update_info_int32(hdr, rec, "ALLELE_B", &allele_b_idx, 1);
        if (record->dbsnp_rs_id) bcf_update_info_string(hdr, rec, "DBSNP_RS_ID", record->dbsnp_rs_id);
        if (record->affy_snp_id) bcf_update_info_string(hdr, rec, "AFFY_SNP_ID", record->affy_snp_id);

        if (varitr) {
            if ((varitr->data_sets || varitr->calls_fp) && flags & FORMAT_GT) {
                for (i = 0; i < nsmpl; i++) {
                    switch (varitr->gts[i]) {
                    case GT_AA:
                        gt_arr[2 * i] = bcf_gt_unphased(allele_a_idx);
                        gt_arr[2 * i + 1] = bcf_gt_unphased(allele_a_idx);
                        break;
                    case GT_AB:
                        gt_arr[2 * i] = bcf_gt_unphased(min(allele_a_idx, allele_b_idx));
                        gt_arr[2 * i + 1] = bcf_gt_unphased(max(allele_a_idx, allele_b_idx));
                        break;
                    case GT_BB:
                        gt_arr[2 * i] = bcf_gt_unphased(allele_b_idx);
                        gt_arr[2 * i + 1] = bcf_gt_unphased(allele_b_idx);
                        break;
                    case GT_NC:
                        gt_arr[2 * i] = bcf_gt_missing;
                        gt_arr[2 * i + 1] = bcf_gt_missing;
                        break;
                    default:
                        error("Genotype for Probe Set ID %s is malformed: %d\n", record->probeset_id, varitr->gts[i]);
                        break;
                    }
                }
                bcf_update_genotypes(hdr, rec, gt_arr, nsmpl * 2);
            }

            if ((varitr->data_sets || varitr->confidences_fp) && flags & FORMAT_CONF)
                bcf_update_format_float(hdr, rec, "CONF", varitr->conf_arr, nsmpl);

            if (varitr->data_sets || varitr->summary_fp) {
                if (flags & FORMAT_NORMX) bcf_update_format_float(hdr, rec, "NORMX", varitr->norm_x_arr, nsmpl);
                if (flags & FORMAT_NORMY) bcf_update_format_float(hdr, rec, "NORMY", varitr->norm_y_arr, nsmpl);
                if (flags & FORMAT_DELTA) bcf_update_format_float(hdr, rec, "DELTA", varitr->delta_arr, nsmpl);
                if (flags & FORMAT_SIZE) bcf_update_format_float(hdr, rec, "SIZE", varitr->size_arr, nsmpl);
            }
        }

        if (snp_models) {
            int rets[2], idxs[2];
            for (i = 0; i < 2; i++) {
                rets[i] = khash_str2int_get(snp_models->probeset_id[i], record->probeset_id, &idxs[i]);
            }
            static const char *hap_info_str[] = {
                "meanX_AA.1",    "meanX_AB.1",    "meanX_BB.1",    "varX_AA.1",    "varX_AB.1",    "varX_BB.1",
                "nObsMean_AA.1", "nObsMean_AB.1", "nObsMean_BB.1", "nObsVar_AA.1", "nObsVar_AB.1", "nObsVar_BB.1",
                "meanY_AA.1",    "meanY_AB.1",    "meanY_BB.1",    "varY_AA.1",    "varY_AB.1",    "varY_BB.1",
                "covarXY_AA.1",  "covarXY_AB.1",  "covarXY_BB.1"};
            static const char *dip_info_str[] = {
                "meanX_AA",    "meanX_AB",    "meanX_BB",   "varX_AA",    "varX_AB",    "varX_BB",    "nObsMean_AA",
                "nObsMean_AB", "nObsMean_BB", "nObsVar_AA", "nObsVar_AB", "nObsVar_BB", "meanY_AA",   "meanY_AB",
                "meanY_BB",    "varY_AA",     "varY_AB",    "varY_BB",    "covarXY_AA", "covarXY_AB", "covarXY_BB"};
            if (rets[0] >= 0) update_info_cluster(hdr, rec, hap_info_str, &snp_models->snps[0][idxs[0]]);
            if (rets[1] >= 0) update_info_cluster(hdr, rec, dip_info_str, &snp_models->snps[1][idxs[1]]);
            snp_t *snp =
                rets[1] >= 0 ? &snp_models->snps[1][idxs[1]] : (rets[0] >= 0 ? &snp_models->snps[0][idxs[0]] : NULL);
            if (!snp) {
                n_no_snp_models++;
                if (flags & VERBOSE)
                    fprintf(stderr, "Warning: SNP model for Probe Set ID %s was not found\n", record->probeset_id);
            } else {
                if (flags & ADJUST_CLUSTERS)
                    adjust_clusters(varitr->gts, snp_models->is_birdseed ? varitr->norm_x_arr : varitr->delta_arr,
                                    snp_models->is_birdseed ? varitr->norm_y_arr : varitr->size_arr, nsmpl, snp);
                if (flags & SUMMARY_LOADED) {
                    compute_baf_lrr(varitr->norm_x_arr, varitr->norm_y_arr, nsmpl, snp, snp_models->is_birdseed,
                                    baf_arr, lrr_arr);
                    if (flags & FORMAT_BAF) bcf_update_format_float(hdr, rec, "BAF", baf_arr, nsmpl);
                    if (flags & FORMAT_LRR) bcf_update_format_float(hdr, rec, "LRR", lrr_arr, nsmpl);
                }
            }
        }

        if (bcf_write(out_fh, hdr, rec) < 0) error("Unable to write to output VCF file\n");
    }
    if (snp_models)
        fprintf(stderr, "Lines   total/missing-reference/missing-snp-posteriors/skipped:\t%d/%d/%d/%d\n", i, n_missing,
                n_no_snp_models, n_skipped);
    else
        fprintf(stderr, "Lines   total/missing-reference/skipped:\t%d/%d/%d\n", i, n_missing, n_skipped);

    free(gt_arr);
    free(baf_arr);
    free(lrr_arr);

    free(allele_a.s);
    free(allele_b.s);
    free(flank.s);

    bcf_destroy(rec);
    return;
}

/****************************************
 * PLUGIN                               *
 ****************************************/

const char *about(void) { return "convert Affymetrix files to VCF.\n"; }

static const char *usage_text(void) {
    return "\n"
           "About: convert Affymetrix apt-probeset-genotype output files to VCF. "
           "(version " AFFY2VCF_VERSION
           " http://github.com/freeseek/gtc2vcf)\n"
           "Usage: bcftools +affy2vcf [options] --csv <file> --fasta-ref <file> [<A.chp> ...]\n"
           "\n"
           "Plugin options:\n"
           "    -l, --list-tags                 list available FORMAT tags with description for VCF output\n"
           "    -t, --tags LIST                 list of output FORMAT tags [" TAG_LIST_DFLT
           "]\n"
           "    -c, --csv <file>                CSV manifest file (can be gzip compressed)\n"
           "    -f, --fasta-ref <file>          reference sequence in fasta format\n"
           "        --set-cache-size <int>      select fasta cache size in bytes\n"
           "        --gc-window-size <int>      window size in bp used to compute the GC content (-1 for no estimate) "
           "[" GC_WIN_DFLT
           "]\n"
           "        --probeset-ids              tab delimited file with column 'probeset_id' specifying probesets to "
           "convert\n"
           "        --calls <file>              apt-probeset-genotype calls output (can be gzip compressed)\n"
           "        --confidences <file>        apt-probeset-genotype confidences output (can be gzip compressed)\n"
           "        --summary <file>            apt-probeset-genotype summary output (can be gzip compressed)\n"
           "        --snp <file>                apt-probeset-genotype SNP posteriors output (can be gzip compressed)\n"
           "        --chps <dir|file>           input CHP files rather than tab delimited files\n"
           "        --cel <file>                input CEL files rather CHP files\n"
           "        --adjust-clusters           adjust cluster centers in (Contrast, Size) space (requires --snp)\n"
           "        --no-version                do not append version and command line to the header\n"
           "    -o, --output <file>             write output to a file [standard output]\n"
           "    -O, --output-type u|b|v|z[0-9]  u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level "
           "[v]\n"
           "        --threads <int>             number of extra output compression threads [0]\n"
           "    -x, --extra <file>              write CHP metadata to a file (requires CHP files)\n"
           "    -v, --verbose                   print verbose information\n"
           "    -W, --write-index[=FMT]         Automatically index the output files [off]\n"
           "\n"
           "Manifest options:\n"
           "        --fasta-flank               output flank sequence in FASTA format (requires --csv)\n"
           "    -s, --sam-flank <file>          input flank sequence alignment in SAM/BAM format (requires --csv)\n"
           "\n"
           "Examples:\n"
           "    bcftools +affy2vcf \\\n"
           "        --csv GenomeWideSNP_6.na35.annot.csv \\\n"
           "        --fasta-ref human_g1k_v37.fasta \\\n"
           "        --chps cc-chp/ \\\n"
           "        --snp AxiomGT1.snp-posteriors.txt \\\n"
           "        --output AxiomGT1.vcf \\\n"
           "        --extra report.tsv\n"
           "    bcftools +affy2vcf \\\n"
           "        --csv GenomeWideSNP_6.na35.annot.csv \\\n"
           "        --fasta-ref human_g1k_v37.fasta \\\n"
           "        --calls AxiomGT1.calls.txt \\\n"
           "        --confidences AxiomGT1.confidences.txt \\\n"
           "        --summary AxiomGT1.summary.txt \\\n"
           "        --snp AxiomGT1.snp-posteriors.txt \\\n"
           "        --output AxiomGT1.vcf\n"
           "\n"
           "Examples of manifest file options:\n"
           "    bcftools +affy2vcf -c GenomeWideSNP_6.na35.annot.csv --fasta-flank -o  GenomeWideSNP_6.fasta\n"
           "    bwa mem -M Homo_sapiens_assembly38.fasta GenomeWideSNP_6.fasta -o "
           "GenomeWideSNP_6.sam\n"
           "    bcftools +affy2vcf -c GenomeWideSNP_6.na35.annot.csv -s GenomeWideSNP_6.sam -o "
           "GenomeWideSNP_6.na35.annot.GRCh38.csv\n"
           "\n";
}

static int parse_tags(const char *str) {
    int i, flags = 0, n;
    char **tags = hts_readlist(str, 0, &n);
    for (i = 0; i < n; i++) {
        if (!strcasecmp(tags[i], "GT"))
            flags |= FORMAT_GT;
        else if (!strcasecmp(tags[i], "CONF"))
            flags |= FORMAT_CONF;
        else if (!strcasecmp(tags[i], "NORMX"))
            flags |= FORMAT_NORMX;
        else if (!strcasecmp(tags[i], "NORMY"))
            flags |= FORMAT_NORMY;
        else if (!strcasecmp(tags[i], "DELTA"))
            flags |= FORMAT_DELTA;
        else if (!strcasecmp(tags[i], "SIZE"))
            flags |= FORMAT_SIZE;
        else if (!strcasecmp(tags[i], "LRR"))
            flags |= FORMAT_LRR;
        else if (!strcasecmp(tags[i], "BAF"))
            flags |= FORMAT_BAF;
        else
            error("Error parsing \"--tags %s\": the tag \"%s\" is not supported\n", str, tags[i]);
        free(tags[i]);
    }
    if (n) free(tags);
    return flags;
}

static void list_tags(void) {
    error(
        "FORMAT/GT       Number:1  Type:String   ..  Genotype\n"
        "FORMAT/CONF     Number:1  Type:Float    ..  Genotype confidence\n"
        "FORMAT/BAF      Number:1  Type:Float    ..  B Allele Frequency\n"
        "FORMAT/LRR      Number:1  Type:Float    ..  Log R Ratio\n"
        "FORMAT/NORMX    Number:1  Type:Float    ..  Normalized X intensity\n"
        "FORMAT/NORMY    Number:1  Type:Float    ..  Normalized Y intensity\n"
        "FORMAT/DELTA    Number:1  Type:Float    ..  Normalized Delta value\n"
        "FORMAT/SIZE     Number:1  Type:Float    ..  Normalized Size value\n");
}

int run(int argc, char *argv[]) {
    const char *tag_list = TAG_LIST_DFLT;
    const char *ref_fname = NULL;
    const char *extra_fname = NULL;
    const char *csv_fname = NULL;
    const char *probeset_ids_fname = NULL;
    const char *calls_fname = NULL;
    const char *confidences_fname = NULL;
    const char *summary_fname = NULL;
    const char *snp_fname = NULL;
    const char *pathname = NULL;
    const char *output_fname = "-";
    const char *sam_fname = NULL;
    char *index_fname;
    char *tmp;
    int i;
    int flags = 0;
    int output_type = FT_VCF;
    int clevel = -1;
    int cache_size = 0;
    int gc_win = (int)strtol(GC_WIN_DFLT, NU
Download .txt
gitextract_37oi3chf/

├── BAFregress.c
├── HapMap.md
├── Illumina.md
├── LICENSE
├── README.md
├── affy2vcf.c
├── gtc2vcf.c
├── gtc2vcf.h
├── gtc2vcf_plot.R
├── idat2gtc.c
└── nearest_neighbor.c
Download .txt
SYMBOL INDEX (264 symbols across 6 files)

FILE: BAFregress.c
  function sqr (line 44) | KSORT_INIT_GENERIC(float)
  function run (line 102) | int run(int argc, char **argv) {

FILE: affy2vcf.c
  function read_long (line 116) | static inline uint32_t read_long(hFILE *hfile) {
  function read_float (line 124) | static inline float read_float(hFILE *hfile) {
  function read_string8 (line 135) | static inline int32_t read_string8(hFILE *hfile, char **buffer) {
  function read_string16 (line 148) | static inline int32_t read_string16(hFILE *hfile, wchar_t **buffer) {
  type Cell (line 171) | typedef struct {
  type Entry (line 177) | typedef struct {
  type SubGrid (line 182) | typedef struct {
  type xda_cel_t (line 199) | typedef struct {
  function xda_cel_t (line 222) | static xda_cel_t *xda_cel_init(const char *fn, hFILE *hfile, int flags) {
  function xda_cel_destroy (line 280) | static void xda_cel_destroy(xda_cel_t *xda_cel) {
  function xda_cel_print (line 294) | static void xda_cel_print(const xda_cel_t *xda_cel, FILE *stream, int ve...
  type Parameter (line 346) | typedef struct {
  type DataHeader (line 354) | typedef struct DataHeader DataHeader;
  type DataHeader (line 356) | struct DataHeader {
  type ColHeader (line 367) | typedef struct {
  type DataSet (line 373) | typedef struct {
  type DataGroup (line 388) | typedef struct {
  type ColumnHeader (line 396) | typedef struct {
  type agcc_t (line 402) | typedef struct {
  function agcc_read_parameters (line 415) | static void agcc_read_parameters(Parameter *parameter, hFILE *hfile, int...
  function agcc_read_data_header (line 452) | static void agcc_read_data_header(DataHeader *data_header, hFILE *hfile,...
  function agcc_read_data_set (line 468) | static void agcc_read_data_set(DataSet *data_set, hFILE *hfile, int flag...
  function agcc_read_data_group (line 501) | static void agcc_read_data_group(DataGroup *data_group, hFILE *hfile, in...
  function agcc_t (line 516) | static agcc_t *agcc_init(const char *fn, hFILE *hfile, int flags) {
  function agcc_destroy_parameters (line 558) | static void agcc_destroy_parameters(Parameter *parameters, int32_t n_par...
  function agcc_destroy_data_header (line 568) | static void agcc_destroy_data_header(DataHeader *data_header) {
  function agcc_destroy_data_set (line 579) | static void agcc_destroy_data_set(DataSet *data_set) {
  function agcc_destroy_data_group (line 589) | static void agcc_destroy_data_group(DataGroup *data_group) {
  function agcc_destroy (line 596) | static void agcc_destroy(agcc_t *agcc) {
  function buffer_string16 (line 608) | static void buffer_string16(const uint16_t *value, int32_t n_value, size...
  function agcc_print_parameters (line 615) | static void agcc_print_parameters(const Parameter *parameters, int32_t n...
  function agcc_print_data_header (line 662) | static void agcc_print_data_header(const DataHeader *data_header, FILE *...
  function agcc_print_probe_set_name (line 673) | void agcc_print_probe_set_name(const char *s, FILE *stream) {
  function agcc_print_call (line 678) | void agcc_print_call(const char *s, FILE *stream) {
  function agcc_print_float (line 686) | void agcc_print_float(const char *s, FILE *stream) {
  function agcc_print_data_set (line 695) | static void agcc_print_data_set(const DataSet *data_set, FILE *stream, i...
  function agcc_print_data_group (line 753) | static void agcc_print_data_group(const DataGroup *data_group, FILE *str...
  function agcc_print (line 759) | static void agcc_print(const agcc_t *agcc, FILE *stream, int verbose) {
  function chps_to_tsv (line 769) | static void chps_to_tsv(uint8_t *magic, agcc_t **agcc, int n, FILE *stre...
  function parse_dat_header (line 856) | static void parse_dat_header(char *dat_header, char *str[12], int n_str[...
  function cels_to_tsv (line 937) | static void cels_to_tsv(uint8_t *magic, void **files, int n, FILE *strea...
  function htsFile (line 1060) | static htsFile *unheader(const char *fn, kstring_t *str) {
  type cluster_t (line 1101) | typedef struct {
  type snp_t (line 1111) | typedef struct {
  type snp_models_t (line 1119) | typedef struct {
  function brlmmp_cluster_init (line 1127) | static inline void brlmmp_cluster_init(const char *s, const int *off, cl...
  function birdseed_cluster_init (line 1137) | static inline void birdseed_cluster_init(const char *s, const int *off, ...
  function snp_models_t (line 1147) | static snp_models_t *snp_models_init(const char *fn) {
  function snp_models_destroy (line 1249) | static void snp_models_destroy(snp_models_t *snp_models) {
  type record_t (line 1263) | typedef struct {
  type annot_t (line 1273) | typedef struct {
  function annot_t (line 1286) | static annot_t *annot_init(const char *fn, const char *sam_fn, const cha...
  function annot_destroy (line 1509) | static void annot_destroy(annot_t *annot) {
  type varitr_t (line 1528) | typedef struct {
  function varitr_init_common (line 1549) | static void varitr_init_common(varitr_t *varitr) {
  function varitr_t (line 1560) | static varitr_t *varitr_init_cc(bcf_hdr_t *hdr, agcc_t **agcc, int n) {
  function varitr_t (line 1601) | static varitr_t *varitr_init_txt(bcf_hdr_t *hdr, const char *calls_fn, c...
  function check_probe_set_id (line 1661) | static inline void check_probe_set_id(char *dest, const char *src) {
  function varitr_loop (line 1670) | static int varitr_loop(varitr_t *varitr, void *probeset_ids) {
  function varitr_destroy (line 1819) | static void varitr_destroy(varitr_t *varitr) {
  function bcf_hdr_t (line 1841) | static bcf_hdr_t *hdr_init(const faidx_t *fai, int flags) {
  function adjust_clusters (line 2017) | static void adjust_clusters(const int *gts, const float *x, const float ...
  function update_info_cluster (line 2059) | static void update_info_cluster(const bcf_hdr_t *hdr, bcf1_t *rec, const...
  function compute_baf_lrr (line 2086) | static void compute_baf_lrr(const float *norm_x, const float *norm_y, in...
  function process (line 2120) | static void process(faidx_t *fai, const annot_t *annot, void *probeset_i...
  function parse_tags (line 2400) | static int parse_tags(const char *str) {
  function list_tags (line 2428) | static void list_tags(void) {
  function run (line 2440) | int run(int argc, char *argv[]) {

FILE: gtc2vcf.c
  function read_array (line 74) | static void read_array(hFILE *hfile, void **arr, size_t *m_arr, size_t n...
  function read_pfx_array (line 96) | static void read_pfx_array(hFILE *hfile, void **arr, size_t *m_arr, size...
  function read_pfx_string (line 106) | static void read_pfx_string(hFILE *hfile, char **str, size_t *m_str) {
  function is_gzip (line 124) | static int is_gzip(hFILE *hfile) {
  type buffer_array_t (line 134) | typedef struct {
  function buffer_array_t (line 144) | static buffer_array_t *buffer_array_init(hFILE *hfile, size_t capacity, ...
  function get_element (line 158) | static int get_element(buffer_array_t *arr, void *dst, size_t item_idx) {
  function buffer_array_destroy (line 177) | static void buffer_array_destroy(buffer_array_t *arr) {
  type LocusEntry (line 190) | typedef struct {
  function get_assay_type (line 235) | static uint8_t get_assay_type(const char *allele_a_probe_seq, const char...
  function locusentry_read (line 267) | static void locusentry_read(LocusEntry *locus_entry, hFILE *hfile) {
  type bpm_t (line 330) | typedef struct {
  function bpm_t (line 365) | static bpm_t *bpm_init(const char *fn, int eof_check, int make_dict) {
  function bpm_destroy (line 436) | static void bpm_destroy(bpm_t *bpm) {
  function bpm_to_csv (line 479) | static void bpm_to_csv(const bpm_t *bpm, FILE *stream, int flags) {
  function tsv_read_uint8 (line 548) | static int tsv_read_uint8(tsv_t *tsv, bcf1_t *rec, void *usr) {
  function tsv_read_int32 (line 558) | static int tsv_read_int32(tsv_t *tsv, bcf1_t *rec, void *usr) {
  function tsv_read_float (line 568) | static int tsv_read_float(tsv_t *tsv, bcf1_t *rec, void *usr) {
  function tsv_read_string (line 578) | static int tsv_read_string(tsv_t *tsv, bcf1_t *rec, void *usr) {
  function csv_parse (line 592) | static int csv_parse(tsv_t *tsv, bcf1_t *rec, char *str) {
  function locus_merge (line 610) | static void locus_merge(LocusEntry *dest, LocusEntry *src) {
  function bpm_t (line 703) | static bpm_t *bpm_csv_init(const char *fn, bpm_t *bpm, int make_dict) {
  type ClusterStats (line 854) | typedef struct {
  type ClusterScore (line 862) | typedef struct {
  type ClusterRecord (line 869) | typedef struct {
  type egt_t (line 879) | typedef struct {
  function clusterscore_read (line 898) | static void clusterscore_read(ClusterScore *clusterscore, hFILE *hfile) {
  function clusterrecord_read (line 905) | static void clusterrecord_read(ClusterRecord *clusterrecord, hFILE *hfil...
  function egt_t (line 929) | static egt_t *egt_init(const char *fn, int eof_check) {
  function egt_destroy (line 1003) | static void egt_destroy(egt_t *egt) {
  function egt_to_csv (line 1022) | static void egt_to_csv(const egt_t *egt, FILE *stream, int verbose) {
  type chip_type_t (line 1094) | typedef struct {
  type RunInfo (line 1191) | typedef struct {
  type idat_t (line 1199) | typedef struct {
  function idat_read (line 1236) | static int idat_read(idat_t *idat, uint16_t id) {
  function idat_t (line 1334) | static idat_t *idat_init(const char *fn, int load_arrays) {
  function idat_destroy (line 1388) | static void idat_destroy(idat_t *idat) {
  function idat_to_csv (line 1422) | static void idat_to_csv(const idat_t *idat, FILE *stream, int verbose) {
  function idats_to_tsv (line 1464) | static void idats_to_tsv(idat_t **idats, int n, FILE *stream) {
  type XForm (line 1536) | typedef struct {
  type ScannerData (line 1554) | typedef struct {
  type SampleData (line 1562) | typedef struct {
  type gtc_t (line 1571) | typedef struct {
  function gtc_read (line 1620) | static int gtc_read(gtc_t *gtc, uint16_t id) {
  function gtc_t (line 1735) | static gtc_t *gtc_init(const char *fn, size_t capacity) {
  function gtc_destroy (line 1773) | static void gtc_destroy(gtc_t *gtc) {
  function gtc_to_csv (line 1809) | static void gtc_to_csv(const gtc_t *gtc, FILE *stream, int verbose) {
  function gtcs_to_tsv (line 1901) | static void gtcs_to_tsv(gtc_t **gtcs, int n, FILE *stream) {
  function bpm_t (line 1940) | static bpm_t *sam_csv_init(const char *fn, bpm_t *bpm, const char *genom...
  function raw_x_y2norm_x_y (line 1990) | static inline void raw_x_y2norm_x_y(uint16_t raw_x, uint16_t raw_y, floa...
  function norm_x_y2ilmn_theta_r (line 2003) | static inline void norm_x_y2ilmn_theta_r(float norm_x, float norm_y, flo...
  function adjust_clusters (line 2008) | static void adjust_clusters(const uint8_t *gts, const float *ilmn_theta,...
  function rev_allele (line 2055) | static inline char rev_allele(char allele) {
  function gtcs_to_gs (line 2065) | static void gtcs_to_gs(gtc_t **gtc, int n, const bpm_t *bpm, const egt_t...
  function bcf_hdr_t (line 2186) | static bcf_hdr_t *hdr_init(const faidx_t *fai, int flags) {
  function gts_to_gt_arr (line 2325) | static int gts_to_gt_arr(int32_t *gt_arr, const uint8_t *gts, int n, int...
  function locus2bcf (line 2352) | static int locus2bcf(const LocusEntry *locus_entry, const ClusterRecord ...
  function gtcs_to_vcf (line 2485) | static void gtcs_to_vcf(faidx_t *fai, const bpm_t *bpm, const egt_t *egt...
  type gs_col_t (line 2629) | typedef struct {
  function tsv_setter_gs_col (line 2635) | static int tsv_setter_gs_col(tsv_t *tsv, bcf1_t *rec, void *usr) {
  function tsv_setter_chrom_flexible (line 2680) | static int tsv_setter_chrom_flexible(tsv_t *tsv, bcf1_t *rec, void *usr) {
  function tsv_setter_ilmn_strand (line 2688) | static int tsv_setter_ilmn_strand(tsv_t *tsv, bcf1_t *rec, void *usr) {
  function tsv_setter_snp (line 2694) | static int tsv_setter_snp(tsv_t *tsv, bcf1_t *rec, void *usr) {
  function tsv_register_all (line 2703) | static int tsv_register_all(tsv_t *tsv, const char *id, tsv_setter_t set...
  function tsv_parse_delimiter (line 2715) | static int tsv_parse_delimiter(tsv_t *tsv, bcf1_t *rec, char *str, int d...
  function gs_to_vcf (line 2739) | static void gs_to_vcf(faidx_t *fai, const bpm_t *bpm, const egt_t *egt, ...
  function parse_tags (line 3242) | static int parse_tags(const char *str) {
  function list_tags (line 3276) | static void list_tags(void) {
  function run (line 3291) | int run(int argc, char *argv[]) {

FILE: gtc2vcf.h
  function heof (line 47) | static inline int heof(hFILE *hfile) {
  function read_bytes (line 54) | static inline void read_bytes(hFILE *hfile, void *buffer, size_t nbytes) {
  type dirent (line 70) | struct dirent
  function FILE (line 94) | static inline FILE *get_file_handle(const char *str) {
  function flank2fasta (line 106) | static inline void flank2fasta(const char *name, const char *flank, FILE...
  function bcf_hdr_name2id_flexible (line 128) | static inline int bcf_hdr_name2id_flexible(const bcf_hdr_t *hdr, char *c...
  function rev_nt (line 155) | static inline char rev_nt(char iupac) {
  function mask_nt (line 169) | static inline char mask_nt(char iupac) {
  function flank_reverse_complement (line 180) | static inline void flank_reverse_complement(char *flank) {
  function flank_left_shift (line 205) | static inline int flank_left_shift(char *flank) {
  function get_position (line 240) | static inline int get_position(htsFile *hts, sam_hdr_t *sam_hdr, bam1_t ...
  function strupper (line 345) | static inline void strupper(char *str) {
  function get_gc_ratio (line 353) | static inline float get_gc_ratio(const char *beg, const char *end) {
  function len_common_suffix (line 364) | static inline int len_common_suffix(const char *s1, const char *s2, size...
  function len_common_prefix (line 374) | static inline int len_common_prefix(const char *s1, const char *s2, size...
  function get_indel_alleles (line 388) | static inline int get_indel_alleles(kstring_t *allele_a, kstring_t *alle...
  function get_allele_b_idx (line 417) | static inline int get_allele_b_idx(char ref_base, char *allele_a, char *...
  function get_allele_a_idx (line 437) | static inline int get_allele_a_idx(int allele_b_idx) {
  function alleles_ab_to_vcf (line 450) | static inline int alleles_ab_to_vcf(const char **alleles, const char *re...
  function get_strand_from_top_alleles (line 478) | static inline int get_strand_from_top_alleles(char *allele_a, char *alle...
  function get_baf_lrr (line 524) | static inline void get_baf_lrr(float ilmn_theta, float ilmn_r, float aa_...

FILE: idat2gtc.c
  function KSORT_INIT_GENERIC (line 248) | KSORT_INIT_GENERIC(float)
  function md5_hgetc (line 327) | static inline int md5_hgetc(hFILE *fp, hts_md5_context *md5) {
  function read_bytes (line 334) | static void read_bytes(hFILE *hfile, void *buffer, size_t nbytes, hts_md...
  function heof (line 347) | static int heof(hFILE *hfile) {
  function read_array (line 354) | static void read_array(hFILE *hfile, void **arr, size_t *m_arr, size_t n...
  function read_pfx_string (line 378) | static void read_pfx_string(hFILE *hfile, char **str, size_t *m_str, hts...
  function is_gzip (line 396) | static int is_gzip(hFILE *hfile) {
  function hwrite_uint16 (line 402) | static inline int hwrite_uint16(hFILE *hfile, uint16_t num) { return hwr...
  function hwrite_int32 (line 404) | static inline int hwrite_int32(hFILE *hfile, int32_t num) { return hwrit...
  function hwrite_pfx_string (line 407) | static int hwrite_pfx_string(hFILE *hfile, const char *str) {
  type chip_type_t (line 460) | typedef struct {
  type RunInfo (line 557) | typedef struct {
  type idat_t (line 565) | typedef struct {
  function idat_read (line 602) | static int idat_read(idat_t *idat, uint16_t id) {
  function idat_t (line 700) | static idat_t *idat_init(const char *fn, int load_arrays) {
  function idat_destroy (line 754) | static void idat_destroy(idat_t *idat) {
  function idat_to_csv (line 788) | static void idat_to_csv(const idat_t *idat, FILE *stream, int verbose) {
  function idats_to_tsv (line 830) | static void idats_to_tsv(idat_t **idats, int n, FILE *stream) {
  type XForm (line 902) | typedef struct {
  type ScannerData (line 920) | typedef struct {
  type SampleData (line 928) | typedef struct {
  type gtc_t (line 937) | typedef struct {
  function leb128_strlen (line 993) | static int leb128_strlen(const char *s) {
  function gtc_write (line 1001) | static int gtc_write(const gtc_t *gtc, const char *fn, int gtc_file_vers...
  function gtc_destroy (line 1154) | static void gtc_destroy(gtc_t *gtc) {
  type LocusEntry (line 1197) | typedef struct {
  function get_assay_type (line 1242) | static uint8_t get_assay_type(const char *allele_a_probe_seq, const char...
  function locusentry_read (line 1274) | static void locusentry_read(LocusEntry *locus_entry, hFILE *hfile, hts_m...
  type bpm_t (line 1337) | typedef struct {
  function bpm_t (line 1373) | static bpm_t *bpm_init(const char *fn, int eof_check, int make_dict, int...
  function bpm_destroy (line 1455) | static void bpm_destroy(bpm_t *bpm) {
  type ClusterStats (line 1505) | typedef struct {
  type ClusterScore (line 1513) | typedef struct {
  type ClusterRecord (line 1520) | typedef struct {
  type egt_t (line 1530) | typedef struct {
  function clusterscore_read (line 1550) | static void clusterscore_read(ClusterScore *clusterscore, hFILE *hfile, ...
  function clusterrecord_read (line 1557) | static void clusterrecord_read(ClusterRecord *clusterrecord, hFILE *hfil...
  function egt_t (line 1582) | static egt_t *egt_init(const char *fn, int eof_check, int checksum) {
  function egt_destroy (line 1672) | static void egt_destroy(egt_t *egt) {
  function sqr (line 1746) | inline static double sqr(double x) { return x * x; }
  function sqrf (line 1748) | inline static float sqrf(float x) { return x * x; }
  function matlab_linsolve0 (line 1752) | static int matlab_linsolve0(int n, const float *x, const float *y, doubl...
  function matlab_linsolve1 (line 1767) | static int matlab_linsolve1(int n, const float *x, const float *y, doubl...
  function matlab_wfit0 (line 1788) | static int matlab_wfit0(int n, const float *y, const float *x, const dou...
  function matlab_wfit1 (line 1803) | static int matlab_wfit1(int n, const float *y, const float *x, const dou...
  function matlab_nanmean (line 1825) | static float matlab_nanmean(int n, const float *vals) {
  function matlab_mean (line 1839) | static float matlab_mean(int n, const float *vals) {
  function matlab_median (line 1849) | static float matlab_median(int n, float *vals) {
  function matlab_madsigma_new (line 1861) | static double matlab_madsigma_new(int n, const double *r, int p) {
  function matlab_madsigma_old (line 1885) | static double matlab_madsigma_old(int n, const double *r, int p) {
  function matlab_robustfit0 (line 1906) | static void matlab_robustfit0(int n, const float *x, const float *y, dou...
  function matlab_robustfit1 (line 1947) | static void matlab_robustfit1(int n, const float *x, const float *y, dou...
  function findClosestSitesToPointsAlongAxis (line 2018) | int findClosestSitesToPointsAlongAxis(int n_raw, float *raw_x, float *ra...
  function percentile (line 2206) | static float percentile(int n, const float *vals, int percentile) {
  function matlab_iqr (line 2229) | static float matlab_iqr(int n, const float *vals) {
  function matlab_trimmean (line 2240) | static float matlab_trimmean(int n, float *vals, int percent) {
  function matlab_unique (line 2265) | static int matlab_unique(int n, int *indices) {
  function matlab_min (line 2286) | static float matlab_min(int n, const float *vals) {
  function matlab_max (line 2297) | static float matlab_max(int n, const float *vals) {
  function remove_outliers (line 2365) | static void remove_outliers(int *n, float *x, float *y) {
  function remove_offset (line 2436) | static void remove_offset(int n, float *x, float *y, int *naa, int **iaa...
  function handle_rotation (line 2509) | static void handle_rotation(int n, float *x, float *y, int *naa, int **i...
  function handle_shear (line 2565) | static void handle_shear(int n, float *x, float *y, int *nbb, int **ibb,...
  function base_handle_scale (line 2606) | static void base_handle_scale(int n, float *x, float *y, int gentrain_ve...
  function handle_scale (line 2683) | static void handle_scale(int n, float *x, float *y, int gentrain_version...
  function get_nn12_rr12 (line 2730) | static void get_nn12_rr12(int n, const float *x, const float *y, float *...
  function normalize_single_bin (line 2766) | static void normalize_single_bin(int n, float *x, float *y, int gentrain...
  function mirror_data (line 2784) | static void mirror_data(int n, float *x, float *y) {
  function rect_to_polar (line 2807) | static void rect_to_polar(int n, float *x, float *y) {
  function normalize_single_bin_single_channel (line 2849) | static void normalize_single_bin_single_channel(int n, float *x, float *...
  function XForm (line 2890) | static XForm *normalize(int n, const uint16_t *xin, const uint16_t *yin,...
  function matlab_zmf (line 2947) | static float matlab_zmf(float x, float a, float b) {
  function matlab_smf (line 2956) | static float matlab_smf(float x, float a, float b) {
  function matlab_normpdf_vleft (line 2965) | static double matlab_normpdf_vleft(float x, float mu, float sigma) {
  function matlab_normpdf_vright (line 2972) | static double matlab_normpdf_vright(float x, float mu, float sigma) {
  function matlab_normpdf (line 2979) | static double matlab_normpdf(float x, float mu, float sigma) {
  function raw_x_y2norm_x_y (line 2990) | static inline void raw_x_y2norm_x_y(uint16_t raw_x, uint16_t raw_y, floa...
  function norm_x_y2ilmn_theta_r (line 3003) | static inline void norm_x_y2ilmn_theta_r(float norm_x, float norm_y, flo...
  function median3 (line 3018) | static inline float median3(float a, float b, float c) { return fmaxf(fm...
  function ClusterRecord (line 3022) | static ClusterRecord *gen_std_flair(const ClusterRecord *cluster_record) {
  function modilik (line 3077) | static void modilik(ClusterRecord *c, float t, float r, double *Laa, dou...
  function compute_score_call_prelim (line 3162) | static float compute_score_call_prelim(float r, float t, const ClusterRe...
  function matlab_gbellmf (line 3294) | static double matlab_gbellmf(double x, double a, double b, double c) {
  function gencall_score_map (line 3302) | static double gencall_score_map(double x) { return pow(x, 0.35) * matlab...
  function rev_allele (line 3304) | static inline char rev_allele(char allele) {
  function get_base_call (line 3315) | static void get_base_call(const char *snp, const char *ilmn_strand, uint...
  function make_calls (line 3337) | static void make_calls(gtc_t *gtc, const bpm_t *bpm, const egt_t *egt, f...
  type gender_t (line 3402) | typedef struct {
  function estimate_gender (line 3416) | static void estimate_gender(gtc_t *gtc, const bpm_t *bpm, const egt_t *e...
  function get_baf_lrr (line 3489) | static inline void get_baf_lrr(float ilmn_theta, float ilmn_r, float aa_...
  function calculate_baf_lrr (line 3511) | static void calculate_baf_lrr(gtc_t *gtc, const bpm_t *bpm, const egt_t ...
  function calculate_intensity_percentiles (line 3552) | static void calculate_intensity_percentiles(gtc_t *gtc) {
  function compute_sample_stats (line 3592) | static void compute_sample_stats(gtc_t *gtc, const bpm_t *bpm, float gen...
  function get_int32_parameter (line 3674) | static int32_t get_int32_parameter(const char *str, const char *id) {
  function load_sample_section (line 3688) | static void load_sample_section(gtc_t *gtc, const idat_t *idat, int imag...
  function get32_index (line 3703) | static int32_t get32_index(void *dict, int32_t key) {
  function fill_array (line 3711) | static void fill_array(const idat_t *grn_idat, const idat_t *red_idat, c...
  function fill_controls_array (line 3747) | static void fill_controls_array(const idat_t *grn_idat, const idat_t *re...
  function gtc_t (line 3766) | static gtc_t *gtc_init(const idat_t *grn_idat, const idat_t *red_idat, c...
  function FILE (line 3921) | static inline FILE *get_file_handle(const char *str) {
  function round_adjust (line 3937) | static double round_adjust(double x) {
  function snp_map_write (line 3960) | static void snp_map_write(const bpm_t *bpm, const egt_t *egt, const char...
  function run (line 3985) | int run(int argc, char *argv[]) {

FILE: nearest_neighbor.c
  function findClosestSitesToPointsAlongAxis (line 34) | int findClosestSitesToPointsAlongAxis(int n_raw, float *raw_x, float *ra...
Condensed preview — 11 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (643K chars).
[
  {
    "path": "BAFregress.c",
    "chars": 23662,
    "preview": "/* The MIT License\n\n   Copyright (C) 2024-2025 Giulio Genovese\n\n   Author: Giulio Genovese <giulio.genovese@gmail.com>\n\n"
  },
  {
    "path": "HapMap.md",
    "chars": 10851,
    "preview": "HapMap\n======\n\nA tutorial for how to convert HapMap data from Illumina and Affymetrix arrays to a GRCh38 VCF using gtc2v"
  },
  {
    "path": "Illumina.md",
    "chars": 16249,
    "preview": "\nArchived Human Products\n-----------------------\n\n| array                                                               "
  },
  {
    "path": "LICENSE",
    "chars": 1081,
    "preview": "The MIT License\n\nCopyright (C) 2018-2025 Giulio Genovese\n\nPermission is hereby granted, free of charge, to any person ob"
  },
  {
    "path": "README.md",
    "chars": 40943,
    "preview": "gtc2vcf\n=======\n\nA set of tools to convert Illumina and Affymetrix DNA microarray intensity data files into VCF files <b"
  },
  {
    "path": "affy2vcf.c",
    "chars": 119690,
    "preview": "/* The MIT License\n\n   Copyright (c) 2018-2025 Giulio Genovese\n\n   Author: Giulio Genovese <giulio.genovese@gmail.com>\n\n"
  },
  {
    "path": "gtc2vcf.c",
    "chars": 173450,
    "preview": "/* The MIT License\n\n   Copyright (c) 2018-2026 Giulio Genovese\n\n   Author: Giulio Genovese <giulio.genovese@gmail.com>\n\n"
  },
  {
    "path": "gtc2vcf.h",
    "chars": 22733,
    "preview": "/* The MIT License\n\n   Copyright (c) 2018-2025 Giulio Genovese\n\n   Author: Giulio Genovese <giulio.genovese@gmail.com>\n\n"
  },
  {
    "path": "gtc2vcf_plot.R",
    "chars": 10641,
    "preview": "#!/usr/bin/env Rscript\n###\n#  The MIT License\n#\n#  Copyright (C) 2019-2025 Giulio Genovese\n#\n#  Author: Giulio Genovese "
  },
  {
    "path": "idat2gtc.c",
    "chars": 196544,
    "preview": "/* The MIT License\n\n   Copyright (c) 2024-2026 Giulio Genovese\n\n   Author: Giulio Genovese <giulio.genovese@gmail.com>\n\n"
  },
  {
    "path": "nearest_neighbor.c",
    "chars": 5733,
    "preview": "/* The MIT License\n\n   Copyright (c) 2018 Giulio Genovese\n\n   Author: Giulio Genovese <giulio.genovese@gmail.com>\n\n   Pe"
  }
]

About this extraction

This page contains the full source code of the freeseek/gtc2vcf GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 11 files (607.0 KB), approximately 185.3k tokens, and a symbol index with 264 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!