[
  {
    "path": "README.md",
    "content": "### Overview\nSEED is a software for clustering large sets of Next Generation Sequences (NGS) with hundreds of millions of reads in a time and memory efficient manner. Its algorithm joins highly similar sequences into clusters that can differ by up to three mismatches and three overhanging residues.\n\n### Copy right\nSEED is under the [Artistic License 2.0](http://opensource.org/licenses/Artistic-2.0).\n\n### How to cite SEED?\nIf you use SEED, please cite the following paper:  \nBao E, Jiang T, Kaloshian I, Girke T (2011) SEED: Efficient Clustering of Next Generation Sequences. Bioinformatics: [epub](http://www.hubmed.org/display.cgi?uids=21810899).\n\n### Short manual\n1. System requirements\n\n   SEED is suitable for 32-bit or 64-bit machines with Windows, OS X or Linux operating systems. At least 4GB of system memory is recommended for clustering larger data sets.\n\n2. Installation\n\n   The downloaded .cpp file can be compiled as follows:  \n   * On Mac/UNIX/Linux systems, execute on the command line: `g++ -o SEED SEED.cpp`\n   * On Windows systems, the code can be compiled under the Visual C++ environment.\n\n3. Input\n\n   Only FASTQ format is supported in the current version. The sequence length should be between 21 bp and 1000 bp with the max variation of 5 bp.\n\n4. Using SEED\n\n   ```\n   SEED --input input.fastq --output output.txt [--mismatch M] [--shift S] [--QV1 L] [--QV2 U] [--fast/short] [--reverse] [--input2 input2.fastq]\n   ```\n\n   --mismatch is the maximum number of mismatches allowed from the center sequence in each cluster (0 - 3, default 3).  \n   --shift is the maximum number of shifts allowed from the center sequence in each cluster (0 - 6, default 3).  \n   --QV1 is the threshold for the base call quality values (QV) that are provided in the FASTQ files as Phred scores. SEED ignores those mismatches where the sum of the Phred scores of the mismatching bases is lower than the specified QV1 threshold value (0 - 2 * 93). The default value for QV1 is 0.  \n   --QV2 is another QV threshold. It prevents co-clustering of sequences where the sum of all mismatched positions is higher than the threshold value (0 - 6 * 93). The default value for QV2 is 6 * 93.  \n   --fast uses a bigger spaced seed weight to save running time. It is only applicable for sequences longer than 58 bp and may need more memory.  \n   --short is to use a smaller spaced seeds weight for sequences as short as 21 bp. This setting often results in longer compute times.  \n   --reverse is to co-cluster sequences in sense and anti-sense orientation (reverse and complement).  \n   --input2 specifies the paired sequences so that paired-end library can be clustered. In current implementation, no shift is allowed for this option, and if --reverse option is specified minimum sequence lengths of both pairs should be the same.\n\n5. Output\n\n   SEED outputs two files: a SEED file and a FASTQ file. The outputted FASTQ file has the same format as the input FASTQ file, but it contains only the center sequences and their quality scores for each cluster with one or more members. In other words, it is the filtered version of the input FASTQ file where the redundant sequences have been removed. The SEED file has a tabular format that is explained in the following table. The third column in this table is only available if the --reverse argument has been specified.\n\n   |Cluster ID                   |Sequence ID                  | Is Reversed                 |\n   |:----------------------------|:----------------------------|:----------------------------|\n   |Center sequence for cluster 0|                             |                             |\n   |0                            |Sequence  ID from input file |1                            |\n   |0                            |Sequence  ID from input file |0                            |\n   |Center sequence for cluster 1|                             |                             |\n   |1                            |Sequence  ID from input file |1                            |\n   |1                            |Sequence  ID from input file |0                            |\n"
  },
  {
    "path": "SEED/SEED.cpp",
    "content": "//**********************************************************************************\n//* Title: SEED: Efficient Clustering Software of Next Generation Sequences\n//* Platform: 32-Bit/64-Bit Windows/Linux/Mac\n//* Author: Ergude Bao\n//* Affliation: Department of Computer Science & Engineering\n//* University of California, Riverside\n//* Date: 08/08/2011\n//* Version: 1.4.1\n//* Copy Right: For Purpose of Study Only\n//**********************************************************************************\n//In this updated version with pre-sorting, N base detection in function Hash::build() and read sequence ID recording \n//in fucntion Hash::seqInsert should have been rolled back. However, for possible usage in future versions, they stay\n//the same.\n\n#include <iostream>\n#include <fstream>\n#include <math.h>\n#include <time.h>\n#include <cstdlib>\n#include <cstring>\nusing namespace std;\n\nint QV = 0;\nint reversed = 0;\nint paired = 0;\nint seedsCount = 10;\nint seedsWeight = 16 * 1024;\n\n#define OFFSET 33\n#define RANGE 94 \n\nstatic int seeds[10][30] = \n{\n\t1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,\n\t1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,\n\t0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,\n\t0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,\n\t0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,\n\t0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,\n\t0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n};\n\nstatic int fastSeeds[4][52] =\n{\n\t1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n};\n\nstatic int shortSeeds[10][15] =\n{\n\t1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,\n\t1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,\n\t1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,\n\t0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,\n\t0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,\n\t0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,\n\t0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,\n\t0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,\n\t0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1\t\n};\n\nclass Hash\n{\n\tifstream in;\n\tint lowerSizeInBit, lowerSizeInChar, upperSizeInBit, upperSizeInChar, num, mismatchAllowed;\n\tchar * seqHead;\n\tunsigned int ** indexHead;\n//\tunsigned int * offsetCount;\npublic:\n\tunsigned int * offsetCount;\n\tHash(char [], int, int, int);\n\tvoid build();\n\tvoid seqInsert(char [], int, char *, unsigned int);\n\tvoid indexInsert(char [], unsigned int);\n\tchar change(char);\n\tchar changeBack(char);\n\tunsigned int calOffset(int, char []);\n\tint searchByIndex(int, unsigned int, int, char []);\n\tint searchBySeq(unsigned int, char []);\n\tvoid deleteByIndex(int, unsigned int, int);\n\tvoid deleteBySeq(unsigned int);\n\tint calSeqID(int, unsigned int, int);\n\tvoid QVInsert(char [], int, char *, unsigned int);\n\tint searchByIndex(int, unsigned int, int, char [], char [], unsigned int &);\n\tint searchBySeq(unsigned int, char [], char [], unsigned int &);\n\tvoid seqInsert(char [], int, char *, unsigned int, unsigned int);\n\tint searchByIndex(int, unsigned int, int, char [], unsigned int &);\n\tint searchBySeq(unsigned int, char [], unsigned int &);\n\tvoid adjust();\n\tvoid tmpDeleteByIndex(int, unsigned int, int);\n\tvoid recoverByIndex(int, unsigned int, int);\n\tunsigned int getOffsetCount(unsigned int, int);\n};\n\nclass FastqGenerator\n{\n\tifstream in;\n\tchar addiInput[100];\n\tifstream addiIn;\n\tchar outputq[100];\n\tofstream out;\n\tint num;\n\tchar * seq;\npublic:\n\tFastqGenerator(char [], char [], int);\n\tFastqGenerator(char [], char [], int, int);\n\tvoid record();\n\tvoid generateFastq();\n};\n\nclass Cluster\n{\n\tofstream out;\n//\tofstream dis;\n\tint lowerSizeInChar;\n\tint upperSizeInChar;\n\tint lowerSizeInBit;\n\tint upperSizeInBit;\n\tint num;\n\tint mismatchAllowed;\n\tint shiftAllowed;\n\tint CLID;\n//\tint numInCL;\n//\tHash * h;\n\tint lowerQV;\n\tint upperQV;\n\tofstream addiOut;\n\tchar addiOutput[100];\n\tunsigned long seqNum;\n\tunsigned long adjustNum;\n\tunsigned int ** mappingTable;\n\tunsigned int * mappingNum;\n\tchar midInput[100];\npublic:\n\tHash * h;\n\tCluster(char [], char [], int, int, int, int, int, unsigned int **, unsigned int *);\n\tvoid cluster();\n\tint compare(char [], char [], int, int, int &);\n\tvoid clusterWithMismatches(char []);\n\tvoid clusterWithShifts(char []);\n\tchar max(int, int, int, int);\n\tvoid clusterByConsensus();\n\tvoid calConsensus(char [], unsigned int, int &);\n\tvoid calConsensus(char [], char [], unsigned int, int &);\n\tvoid preprocess();\n\tCluster(char [], char [], int, int, int, int, int, int, int, unsigned int **, unsigned int *);\n\tint compare(char [], char [], char [], char [], int, int, int &);\n\tvoid clusterWithMismatches(char [], char []);\n\tvoid clusterWithShifts(char [], char []);\n\tchar reverseChange(char);\n};\n\nclass FileAnalyzer\n{\npublic:\n\tvoid inputAnalyze(char [], int &, int &, int &, int &);\n\tvoid outputAnalyze(int, int);\n\tvoid PECombine(char [], int, char [], int, char *, int &, int &, int &);\n};\n\nclass Sorter\n{\n\tifstream in;\n\tofstream midOut;\n\tchar midOutput[100];\n\tint num;\n\tunsigned int ** mappingTable;\n\tunsigned int * mappingNum;\n\tint lowerSizeInChar;\n\tint realNum;\n\ttypedef struct \n\t{\n\t\tint ID;\n\t\tint realID;\n\t} Order;\npublic:\n\tSorter(char [], int, int);\n\tvoid sort();\n\tvoid suffixSort(int, int, int, char [], Order []);\n\tint getRealNum();\n\tunsigned int ** getMappingTable();\n\tunsigned int * getMappingNum();\n};\n\nHash::Hash(char input[], int num, int lowerSizeInChar, int upperSizeInChar)\n{\n\tlong int i;\n\n\tif(upperSizeInChar % 4)\n\t\tthis->upperSizeInBit = upperSizeInChar / 4 + 1;\n\telse\n\t\tthis->upperSizeInBit = upperSizeInChar / 4;\n\tif(lowerSizeInChar % 4)\n\t\tthis->lowerSizeInBit = lowerSizeInChar / 4 + 1;\n\telse\n\t\tthis->lowerSizeInBit = lowerSizeInChar / 4;\n\tthis->lowerSizeInChar = lowerSizeInChar;\n\tthis->upperSizeInChar = upperSizeInChar;\n\tthis->num = num;\n\tif(QV)\n\t{\n\t\tseqHead = new char[(long int)num * (upperSizeInBit + 5 + upperSizeInChar)];\n\t\tfor(i = 0; i < (long int)num * (upperSizeInBit + 5 + upperSizeInChar); i ++)\n\t\t\tseqHead[i] = 0;\n\t}\n\telse\n\t{\n\t\tseqHead = new char[(long int)num * (upperSizeInBit + 5)];\n\t\tfor(i = 0; i < (long int)num * (upperSizeInBit + 5); i ++)\n\t\t\tseqHead[i] = 0;\n\t}\n//\tindexHead = new unsigned int * [1024 * 1024 * 16 * 10];\n//\tindexHead = new unsigned int * [1024 * 1024 * 64 * 4];\n\tindexHead = new unsigned int * [1024 * seedsWeight * seedsCount];\n\toffsetCount = new unsigned int [1024 * seedsWeight * seedsCount];\n\tfor(i = 0; i < (long int)1024 * seedsWeight * seedsCount; i ++)\n\t\toffsetCount[i] = 0;\n\tin.open(input);\n}\n\nvoid Hash::build()\n{\n\tchar buf[1001];\n\tint i, count = 0, tag = 1;\n\tunsigned int seqOffset = 0, seqID;\n\n//\tint seqNext;\n//\tchar base[4];\n//\tofstream out;\n\n\tint totalLength, j;\n\n\tif(in.is_open())\n\t{\n\t\tseqID = 0;\ncont:\n\t\twhile(seqID < num * 4)\n\t\t{\n\t\t\tin.getline(buf, 1001);\n\t\t\tif(seqID % 4 == 1)\n\t\t\t{\n\t\t\t\tfor(i = 0; i < in.gcount() - 1; i ++)\n\t\t\t\t{\n\t\t\t\t\tif(buf[i] == 'N')\n\t\t\t\t\t{\n\t\t\t\t\t\ttag = 0;\n\t\t\t\t\t\tseqID ++;\n\t\t\t\t\t\tgoto cont;\n\t\t\t\t\t}//to deal with N\n\t\t\t\t\telse\n\t\t\t\t\t{\n\t\t\t\t\t\ttag = 1;\n\t\t\t\t\t\tbuf[i] = change(buf[i]);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tseqInsert(buf, in.gcount() - 1, seqHead, seqOffset, seqID/4);\n\t\t\t\tindexInsert(buf, seqOffset);\n\t\t\t\tseqOffset = seqOffset + (upperSizeInBit + 5);\n\t\t\t}\n\t\t\tif(QV)\n\t\t\t{\n\t\t\t\tif(seqID % 4 == 3 && tag == 1)\n\t\t\t\t{\n\t\t\t\t\tif(buf[in.gcount()] == '\\n')\n\t\t\t\t\t\tQVInsert(buf, in.gcount() - 1, seqHead, seqOffset);\n\t\t\t\t\telse\n\t\t\t\t\t\tQVInsert(buf, in.gcount(), seqHead, seqOffset);//for the last line in Windows OS\n\t\t\t\t\tseqOffset = seqOffset + upperSizeInChar;\n\t\t\t\t}\n\t\t\t\tfor(i = 0; i < 1001; i ++)\n\t\t\t\t\tbuf[i] = 0;\n\t\t\t}\n\t\t\tseqID ++;\n\t\t}\n\t}\n\telse\n\t{\n\t\tcout << \"CANNOT OPEN INPUT FILE!\" << endl;\n\t\texit(-1);\n\t}\n\n//verification\n//\tcout << endl;\n//#ifdef QV\n//#ifdef REALID\n//\tfor(seqOffset = 0; seqOffset < num * (upperSizeInBit + 5 + upperSizeInChar); seqOffset = seqOffset + upperSizeInBit + 5 + upperSizeInChar)\n//#else\n//\tfor(seqOffset = 0; seqOffset < num * (upperSizeInBit + 1 + upperSizeInChar); seqOffset = seqOffset + upperSizeInBit + 1 + upperSizeInChar)\n//#endif\n//#else\n//#ifdef REALID\n//\tfor(seqOffset = 0; seqOffset < num * (upperSizeInBit + 5); seqOffset = seqOffset + upperSizeInBit + 5)\n//#else\n//\tfor(seqOffset = 0; seqOffset < num * (upperSizeInBit + 1); seqOffset = seqOffset + upperSizeInBit + 1)\n//#endif\n//#endif\n//\t{\n//\t\tcout << (int)*(seqHead + seqOffset) << \": \" << endl;\n//\t\tfor(seqNext = 1; seqNext < 5; seqNext ++)\n//\t\t{\n//\t\t\tcout << (int)*(seqHead + seqOffset + seqNext) << \" \";\n//\t\t}\n//\t\tcout << endl;\n//#ifdef REALID\n//\t\tfor(seqNext = 5; seqNext < upperSizeInBit + 5; seqNext ++)\n//#else\n//\t\tfor(seqNext = 1; seqNext < upperSizeInBit + 1; seqNext ++)\n//#endif\n//\t\t{\n//\t\t\tbase[0] = (*(seqHead + seqOffset + seqNext) >> 6) & 0x03;\n//\t\t\tbase[1] = (*(seqHead + seqOffset + seqNext) >> 4) & 0x03;\n//\t\t\tbase[2] = (*(seqHead + seqOffset + seqNext) >> 2) & 0x03;\n//\t\t\tbase[3] = (*(seqHead + seqOffset + seqNext)) & 0x03;\n//\t\t\tcout << (unsigned int)base[0] << \" \" << (unsigned int)base[1] << \" \" << (unsigned int)base[2] << \" \" << (unsigned int)base[3] << \" \";\n//\t\t}\n//#ifdef QV\n//\t\tcout << endl;\n//#ifdef REALID\n//\t\tfor(; seqNext < upperSizeInBit + 1 + upperSizeInChar; seqNext ++)\n//#else\n//\t\tfor(; seqNext < upperSizeInBit + 5 + upperSizeInChar; seqNext ++)\n//#endif\n//\t\t\tcout << *(seqHead + seqOffset + seqNext) << \" \";\n//#endif\n//\t\tcout << endl;\n//\t}\n//verification\n/*\n        totalLength = 0;\n        count = 0;\n        for(i = 0; i < seedsCount; i ++)\n        {\n                for(j = 0; j < 1024 * seedsWeight; j ++)\n                        if(c.h->offsetCount[j * seedsCount + i] > 1000)\n                        {\n                                totalLength = totalLength + c.h->offsetCount[j * seedsCount + i];\n                                count ++;\n                        }\n        }\n        cout << \"#buckets longer than 1000: \" << count << endl;\n\tif(count != 0)\n\t        cout << \"average length of the buckets: \" << totalLength / count << endl;\n*/\n}\n\nvoid Hash::adjust()\n{\n\tint i, j, k ,p;\n\n\tfor(i = 0; i < seedsCount; i ++)\n\t\tfor(j = 0; j < 1024 * seedsWeight; j ++)\n\t\t\tif(offsetCount[j * seedsCount + i] != 0)\n\t\t\t{\n\t\t\t\tp = offsetCount[j * seedsCount + i] - 1;\n\t\t\t\tfor(k = 0; k < offsetCount[j * seedsCount + i] && p > k; k ++)\n\t\t\t\t{\n\t\t\t\t\tif(seqHead[indexHead[j * seedsCount + i][k]] == 0)\n\t\t\t\t\t{\n\t\t\t\t\t\twhile(seqHead[indexHead[j * seedsCount + i][p]] == 0 && p > k)\n\t\t\t\t\t\t{\n\t\t\t\t\t\t\tp --;\n\t\t\t\t\t\t\toffsetCount[j * seedsCount + i] --;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(p > k)\n\t\t\t\t\t\t{\n\t\t\t\t\t\t\tindexHead[j * seedsCount + i][k] = indexHead[j * seedsCount + i][p];\n\t\t\t\t\t\t\tp --;\n\t\t\t\t\t\t\toffsetCount[j * seedsCount + i] --;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(seqHead[indexHead[j * seedsCount + i][k]] == 0)\n\t\t\t\t\toffsetCount[j * seedsCount + i] --;\n\t\t\t}\n}\n\nvoid Hash::seqInsert(char buf[], int realSize, char * seqHead, unsigned int seqOffset, unsigned int seqID)\n{\n\tint i, seqNext;\n\tchar bitBuf = 0x00;\n\n\t*(seqHead + seqOffset) = 0x01;// the first byte was intended to record size of the seq, but is now used for existence of the seq. If it is changed for the recording purpose in the future, more bytes would be required.\n\t*(seqHead + seqOffset + 1) = (char) ((seqID & 0xff000000) >> 24);\n\t*(seqHead + seqOffset + 2) = (char) ((seqID & 0x00ff0000) >> 16);\n\t*(seqHead + seqOffset + 3) = (char) ((seqID & 0x0000ff00) >> 8);\n\t*(seqHead + seqOffset + 4) = (char) (seqID & 0x000000ff);\n\tfor(i = 0, seqNext = 5; i < realSize; i ++)\n\t{\n\t\tif((i + 1) % 4 == 0)\n\t\t{\n\t\t\tbitBuf = (bitBuf | buf[i]);\n\t\t\t*(seqHead + seqOffset + seqNext) = bitBuf;\n\t\t\tseqNext ++;\n\t\t\tbitBuf = 0x00;\n\t\t}\n\t\telse if(i == realSize - 1)\n\t\t{\n\t\t\tbitBuf = (bitBuf | buf[i]) << (4 - realSize % 4) * 2;\n\t\t\t*(seqHead + seqOffset + seqNext) = bitBuf;\n\t\t\tseqNext ++;\n\t\t\tbitBuf = 0x00;\n\t\t}\n\t\telse\n\t\t\tbitBuf = (bitBuf | buf[i]) << 2;\n\t}\n}\n\nvoid Hash::QVInsert(char buf[], int realSize, char * seqHead, unsigned int seqOffset)\n{\n\tint i, seqNext = 0;\n\n\tfor(i = 0; i < realSize; i ++)\n\t{\n\t\t*(seqHead + seqOffset + seqNext) = buf[i];\n\t\tseqNext ++;\n\t}\n}\n\nvoid Hash::indexInsert(char buf[], unsigned int seqOffset)\n{\n\tint indexNext;\n\tunsigned int indexOffset = 0;\n\n\tfor(indexNext = 0; indexNext < seedsCount; indexNext ++)\n\t{\n\t\tindexOffset = calOffset(indexNext, buf);\n\t\tif(offsetCount[indexOffset + indexNext] == 0)\n\t\t{\n\t\t\tindexHead[indexOffset + indexNext] = (unsigned int *) malloc((++ offsetCount[indexOffset + indexNext]) * sizeof(unsigned int));\n\t\t\tif(indexHead[indexOffset + indexNext] == NULL)\n\t\t\t{\n\t\t\t\tcout << \"CANNOT ALLOCATE MEMORY!\" << endl;\n\t\t\t\texit(-1);\n\t\t\t}\n\t\t}\n\t\telse\n\t\t{\n\t\t\tindexHead[indexOffset + indexNext] = (unsigned int *) realloc(indexHead[indexOffset + indexNext], (++ offsetCount[indexOffset + indexNext]) * sizeof(unsigned int));\n\t\t\tif(indexHead[indexOffset + indexNext] == NULL)\n\t\t\t{\n\t\t\t\tcout << \"CANNOT ALLOCATE MEMORY!\" << endl;\n\t\t\t\texit(-1);\n\t\t\t}\n\t\t}\n\t\tindexHead[indexOffset + indexNext][offsetCount[indexOffset + indexNext] - 1] = seqOffset;\n\t}\n}\n\nchar Hash::change(char base)\n{\n\tswitch(base)\n\t{\n\t\tcase 'A': return 0x00;\n\t\tcase 'C': return 0x01;\n\t\tcase 'G': return 0x02;\n\t\tcase 'T': return 0x03;\n\t\tdefault: cout << \"INPUT ERROR!\" << endl; exit(-1);\n\t}\n}\n\nchar Hash::changeBack(char base)\n{\n\tswitch(base)\n\t{\n\t\tcase 0x00: return 'A';\n\t\tcase 0x01: return 'C';\n\t\tcase 0x02: return 'G';\n\t\tcase 0x03: return 'T';\n\t\tdefault: cout << \"MEMORY ERROR!\" << endl; exit(-1);\n\t}\n}\n\nunsigned int Hash::calOffset(int indexNext, char buf[])\n{\n\tunsigned int indexOffset = 0;\n\tint i, j = 0;\n\n\tif(seedsWeight == 1024 * 16)\n\t{\n\t\tfor(i = 0; i < 30; i ++)\n\t\t\tif(seeds[indexNext][i] == 1)\n\t\t\t\tindexOffset = indexOffset + buf[i + lowerSizeInChar - 33] * (unsigned int)pow(4, j ++);\n\t}\n\telse if(seedsWeight == 1024 * 64)\n\t{\n\t\tfor(i = 0; i < 52; i ++)\n\t\t\tif(fastSeeds[indexNext][i] == 1)\n\t\t\t\tindexOffset = indexOffset + buf[i + lowerSizeInChar - 55] * (unsigned int)pow(4, j ++);\n\t}\n\telse\n\t{\n\t\tfor(i = 0; i < 15; i ++)\n                        if(shortSeeds[indexNext][i] == 1)\n                                indexOffset = indexOffset + buf[i + lowerSizeInChar - 18] * (unsigned int)pow(4, j ++);\n\t}\n\treturn indexOffset * seedsCount;\n}\n\nint Hash::searchByIndex(int indexNext, unsigned int indexOffset, int no, char buf[], unsigned int & seqID)\n{\n\tint seqNext, p = 0;\n\tint i;\n\n\tseqID = 0;\n\tfor(i = 1; i < 5; i ++)\n\t{\n\t\tseqID = seqID | (((unsigned int) seqHead[indexHead[indexOffset + indexNext][no] + i]) & 0x000000ff);\n\t\tif(i < 4)\n\t\t\tseqID = seqID << 8;\n\t}\n\tfor(seqNext = 5; seqNext < upperSizeInBit + 5; seqNext ++)\n\t{\n\t\tbuf[p ++] = (seqHead[indexHead[indexOffset + indexNext][no] + seqNext] >> 6) & 0x03;\n\t\tbuf[p ++] = (seqHead[indexHead[indexOffset + indexNext][no] + seqNext] >> 4) & 0x03;\n\t\tbuf[p ++] = (seqHead[indexHead[indexOffset + indexNext][no] + seqNext] >> 2) & 0x03;\n\t\tbuf[p ++] = (seqHead[indexHead[indexOffset + indexNext][no] + seqNext]) & 0x03;\n\t}\n\treturn (int)seqHead[indexHead[indexOffset + indexNext][no]];\n}\n\nint Hash::searchBySeq(unsigned int seqOffset, char buf[], unsigned int & seqID)\n{\n\tint seqNext, p = 0;\n\tint i;\n\n\tseqID = 0;\n\tfor(i = 1; i < 5; i ++)\n\t{\n\t\tseqID = seqID | (((unsigned int) seqHead[seqOffset + i]) & 0x000000ff);\n\t\tif(i < 4)\n\t\t\tseqID = seqID << 8;\n\t}\n\tfor(seqNext = 5; seqNext < upperSizeInBit + 5; seqNext ++)\n\t{\n\t\tbuf[p ++] = (seqHead[seqOffset + seqNext] >> 6) & 0x03;\n\t\tbuf[p ++] = (seqHead[seqOffset + seqNext] >> 4) & 0x03;\n\t\tbuf[p ++] = (seqHead[seqOffset + seqNext] >> 2) & 0x03;\n\t\tbuf[p ++] = (seqHead[seqOffset + seqNext]) & 0x03;\n\t}\n\treturn (int)seqHead[seqOffset];\n}\n\nint Hash::searchByIndex(int indexNext, unsigned int indexOffset, int no, char buf[], char QVBuf[], unsigned int & seqID)\n{\n\tint seqNext, p = 0;\n\tint i;\n\n\tseqID = 0;\n\tfor(i = 1; i < 5; i ++)\n\t{\n\t\tseqID = seqID | (((unsigned int) seqHead[indexHead[indexOffset + indexNext][no] + i]) & 0x000000ff);\n\t\tif(i < 4)\n\t\t\tseqID = seqID << 8;\n\t}\n\tfor(seqNext = 5; seqNext < upperSizeInBit + 5; seqNext ++)\n\t{\n\t\tbuf[p ++] = (seqHead[indexHead[indexOffset + indexNext][no] + seqNext] >> 6) & 0x03;\n\t\tbuf[p ++] = (seqHead[indexHead[indexOffset + indexNext][no] + seqNext] >> 4) & 0x03;\n\t\tbuf[p ++] = (seqHead[indexHead[indexOffset + indexNext][no] + seqNext] >> 2) & 0x03;\n\t\tbuf[p ++] = (seqHead[indexHead[indexOffset + indexNext][no] + seqNext]) & 0x03;\n\t}\n\tp = 0;\n\tfor(; seqNext < upperSizeInBit + 5 + upperSizeInChar; seqNext ++)\n\t\tQVBuf[p ++] = seqHead[indexHead[indexOffset + indexNext][no] + seqNext];\n\treturn (int)seqHead[indexHead[indexOffset + indexNext][no]];\n}\n\nint Hash::searchBySeq(unsigned int seqOffset, char buf[], char QVBuf[], unsigned int & seqID)\n{\n\tint seqNext, p = 0;\n\tint i;\n\n\tseqID = 0;\n\tfor(i = 1; i < 5; i ++)\n\t{\n\t\tseqID = seqID | (((unsigned int) *(seqHead + seqOffset + i)) & 0x000000ff);\n\t\tif(i < 4)\n\t\t\tseqID = seqID << 8;\n\t}\n\tfor(seqNext = 5; seqNext < upperSizeInBit + 5; seqNext ++)\n\t{\n\t\tbuf[p ++] = (seqHead[seqOffset + seqNext] >> 6) & 0x03;\n\t\tbuf[p ++] = (seqHead[seqOffset + seqNext] >> 4) & 0x03;\n\t\tbuf[p ++] = (seqHead[seqOffset + seqNext] >> 2) & 0x03;\n\t\tbuf[p ++] = (seqHead[seqOffset + seqNext]) & 0x03;\n\t}\n\tp = 0;\n\tfor(; seqNext < upperSizeInBit + 5 + upperSizeInChar; seqNext ++)\n\t\tQVBuf[p ++] = seqHead[seqOffset + seqNext];\n\treturn (int)seqHead[seqOffset];\n}\n\nvoid Hash::deleteByIndex(int indexNext, unsigned int indexOffset, int no)\n{\n\tseqHead[indexHead[indexOffset + indexNext][no]] = 0;\n}\n\nvoid Hash::tmpDeleteByIndex(int indexNext, unsigned int indexOffset, int no)\n{\n\tseqHead[indexHead[indexOffset + indexNext][no]] = seqHead[indexHead[indexOffset + indexNext][no]] | 0x80;\n}\n\nvoid Hash::recoverByIndex(int indexNext, unsigned int indexOffset, int no)\n{\n\tseqHead[indexHead[indexOffset + indexNext][no]] = seqHead[indexHead[indexOffset + indexNext][no]] & 0x7f;\n}\n\nvoid Hash::deleteBySeq(unsigned int seqOffset)\n{\n\tseqHead[seqOffset] = 0;\n}\n\nint Hash::calSeqID(int indexNext, unsigned int indexOffset, int no)\n{\n\tif(QV)\n\t\treturn indexHead[indexOffset + indexNext][no]/(upperSizeInBit + 1 + upperSizeInChar);\n\telse\n\t\treturn indexHead[indexOffset + indexNext][no]/(upperSizeInBit + 1);\n}\n\nunsigned int Hash::getOffsetCount(unsigned int indexOffset, int indexNext)\n{\n\treturn offsetCount[indexOffset + indexNext];\n}\n\nCluster::Cluster(char input[], char output[], int num, int lowerSizeInChar, int upperSizeInChar, int mismatchAllowed, int shiftAllowed, int lowerQV, int upperQV, unsigned int ** mappingTable, unsigned int * mappingNum)\n{\n\tif(lowerSizeInChar % 4)\n\t\tthis->lowerSizeInBit = lowerSizeInChar / 4 + 1;\n\telse\n\t\tthis->lowerSizeInBit = lowerSizeInChar / 4;\n\tif(upperSizeInChar % 4)\n\t\tthis->upperSizeInBit = upperSizeInChar / 4 + 1;\n\telse\n\t\tthis->upperSizeInBit = upperSizeInChar / 4;\n\tthis->lowerSizeInChar = lowerSizeInChar;\n\tthis->upperSizeInChar = upperSizeInChar;\n\tthis->num = num;\n\tthis->mismatchAllowed = mismatchAllowed;\n\tthis->shiftAllowed = shiftAllowed;\n\tthis->CLID = 0;\n\tthis->lowerQV = lowerQV;\n\tthis->upperQV = upperQV;\n\tthis->seqNum = 0;\n\tthis->adjustNum = num / 20;\n\tthis->mappingTable = mappingTable;\n\tthis->mappingNum = mappingNum;\n\tstrcpy(midInput, input);\n\tstrcat(midInput, \".mid.fastq\");\n\tout.open(output);\n\tstrcpy(addiOutput, output);\n\tstrcat(addiOutput, \".fasta\");\n\taddiOut.open(addiOutput);\n//\tdis.open(\"distribution.txt\");\n\tout << \"CLID\tSeqID\" << endl;\n//\tdis << \"CLID\tNo\" << endl;\n\th = new Hash(midInput, num, lowerSizeInChar, upperSizeInChar);\n\th->build();\n}\n\nCluster::Cluster(char input[], char output[], int num, int lowerSizeInChar, int upperSizeInChar, int mismatchAllowed, int shiftAllowed, unsigned int ** mappingTable, unsigned int * mappingNum)\n{\n\tif(lowerSizeInChar % 4)\n\t\tthis->lowerSizeInBit = lowerSizeInChar / 4 + 1;\n\telse\n\t\tthis->lowerSizeInBit = lowerSizeInChar / 4;\n\tif(upperSizeInChar % 4)\n\t\tthis->upperSizeInBit = upperSizeInChar / 4 + 1;\n\telse\n\t\tthis->upperSizeInBit = upperSizeInChar / 4;\n\tthis->lowerSizeInChar = lowerSizeInChar;\n\tthis->upperSizeInChar = upperSizeInChar;\n\tthis->num = num;\n\tthis->mismatchAllowed = mismatchAllowed;\n\tthis->shiftAllowed = shiftAllowed;\n\tthis->CLID = 0;\n\tthis->seqNum = 0;\n\tthis->adjustNum = num / 20;\n\tthis->mappingTable = mappingTable;\n\tthis->mappingNum = mappingNum;\n\tstrcpy(midInput, input);\n\tstrcat(midInput, \".mid.fastq\");\n\tout.open(output);\n\tstrcpy(addiOutput, output);\n\tstrcat(addiOutput, \".fasta\");\n\taddiOut.open(addiOutput);\n//\tdis.open(\"distribution.txt\");\n\tout << \"CLID\tSeqID\" << endl;\n//\tdis << \"CLID\tNo\" << endl;\n\th = new Hash(midInput, num, lowerSizeInChar, upperSizeInChar);\n\th->build();\n}\n\nchar Cluster::max(int a, int b, int c, int d)\n{\n\tif(b >= a && b >= c && b >= d) return 0x01;\n\tif(c >= a && c >= b && c >= d) return 0x02;\n\tif(d >= a && d >= b && d >= c) return 0x03;\n\tif(a >= b && a >= c && a >= d) return 0x00;\n\tcout << \"UNKNOWN ERROR!\" << endl; exit(-1);\n}\n\nvoid Cluster::calConsensus(char sBuf[], unsigned int sSeqID, int & tagReverse)\n{\n\tlong int A[1000] = {0}, C[1000] = {0}, G[1000] = {0}, T[1000] = {0};\n\tint indexNext, no, i, realSize, similarity = 1000, diff, j;\n\tchar sBufBak[1000], tBuf[1000], buf[1000];\n\tunsigned int indexOffset;\n\tunsigned int seqID, centerSeqID;\n\n\tfor(indexNext = 0; indexNext < seedsCount; indexNext ++)\n\t{\n\t\tindexOffset = h->calOffset(indexNext, sBuf);\n\t\tfor(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n\t\t{\n\t\t\trealSize = h->searchByIndex(indexNext, indexOffset, no, tBuf, seqID);\n\t\t\tif(realSize > 0 && compare(sBuf, tBuf, 0, lowerSizeInChar, tagReverse) <= mismatchAllowed * 2)\n\t\t\t{\n\t\t\t\tif(reversed && tagReverse)\n\t\t\t\t\tfor(i = 0, j = lowerSizeInChar - 1; i < lowerSizeInChar; i ++, j --)\n\t\t\t\t\t\tswitch(tBuf[j])\n\t\t\t\t\t\t{\n\t\t\t\t\t\t\tcase 0x00: T[i] = T[i] + mappingNum[seqID]; break;\n\t\t\t\t\t\t\tcase 0x01: G[i] = G[i] + mappingNum[seqID]; break;\n\t\t\t\t\t\t\tcase 0x02: C[i] = C[i] + mappingNum[seqID]; break;\n\t\t\t\t\t\t\tcase 0x03: A[i] = A[i] + mappingNum[seqID]; break;\n\t\t\t\t\t\t\tdefault: cout << \"MEMORY ERROR!\" << endl; exit(-1);\n\t\t\t\t\t\t}\t\n\t\t\t\telse\n\t\t\t\t\tfor(i = 0; i < lowerSizeInChar; i ++)\n\t\t\t\t\t\tswitch(tBuf[i])\n\t\t\t\t\t\t{\n\t\t\t\t\t\t\tcase 0x00: A[i] = A[i] + mappingNum[seqID]; break;\n\t\t\t\t\t\t\tcase 0x01: C[i] = C[i] + mappingNum[seqID]; break;\n\t\t\t\t\t\t\tcase 0x02: G[i] = G[i] + mappingNum[seqID]; break;\n\t\t\t\t\t\t\tcase 0x03: T[i] = T[i] + mappingNum[seqID]; break;\n\t\t\t\t\t\t\tdefault: cout << \"MEMORY ERROR!\" << endl; exit(-1);\n\t\t\t\t\t\t}\n\t\t\t\th->tmpDeleteByIndex(indexNext, indexOffset, no);\n\t\t\t}\n\t\t}\n\t}\n\tfor(i = 0; i < lowerSizeInChar; i ++)\n\t{\n\t\tsBufBak[i] = sBuf[i];\n\t\tsBuf[i] = max(A[i], C[i], G[i], T[i]);\n\t}\n//find the most similar seq to the consensus and write to the output\n\tfor(indexNext = 0; indexNext < seedsCount; indexNext ++)\n\t{\n\t\tindexOffset = h->calOffset(indexNext, sBufBak);\n\t\tfor(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n\t\t{\n\t\t\trealSize = h->searchByIndex(indexNext, indexOffset, no, tBuf, seqID);\n\t\t\tif(realSize < 0)\n\t\t\t{\n\t\t\t\th->recoverByIndex(indexNext, indexOffset, no);\n\t\t\t\tdiff = compare(sBuf, tBuf, 0, lowerSizeInChar, tagReverse);\n\t\t\t\tif(diff < similarity)\n\t\t\t\t{\n\t\t\t\t\tsimilarity = diff;\n\t\t\t\t\tfor(i = 0; i < lowerSizeInChar; i ++)\n\t\t\t\t\t\tbuf[i] = tBuf[i];\n\t\t\t\t\tcenterSeqID = seqID;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n//if the virtual center and the center are too far\n\tif(compare(sBuf, buf, 0, lowerSizeInChar, tagReverse) <= mismatchAllowed)// bug exists here: the same center sequence is found for many clusters since the center sequence is not deleted\n\t{\n\t\tif(paired == 0)\n\t\t{\n\t\t\taddiOut << \">\" << mappingTable[centerSeqID][0] << endl;\n\t\t\tfor(i = 0; i < lowerSizeInChar; i ++)\n\t\t\t{\n\t\t\t\tout << h->changeBack(buf[i]);\n\t\t\t\taddiOut << h->changeBack(buf[i]);\n\t\t\t}\n\t\t\tout << endl;\n\t\t\taddiOut << endl;\n\t\t}\n\t\telse\n\t\t{\n\t\t\taddiOut << \">\" << mappingTable[centerSeqID][0] << endl;\n\t\t\tfor(i = 0; i < paired; i ++)//paired == lower\n\t\t\t{\n\t\t\t\tout << h->changeBack(buf[i]);\n\t\t\t\taddiOut << h->changeBack(buf[i]);\n\t\t\t}\n\t\t\tout << endl;\n\t\t\taddiOut << endl;\n\t\t\taddiOut << \">\" << mappingTable[centerSeqID][0] << endl;\n\t\t\tfor(; i < lowerSizeInChar; i ++)\n\t\t\t{\n\t\t\t\tout << h->changeBack(buf[i]);\n\t\t\t\taddiOut << h->changeBack(buf[i]);\n\t\t\t}\n\t\t\tout << endl;\n\t\t\taddiOut << endl;\n\t\t}\n\t}\n\telse\n\t{\n\t\tif(paired == 0)\n\t\t{\n\t\t\taddiOut << \">\" << mappingTable[sSeqID][0] << endl;\n\t\t\tfor(i = 0; i < lowerSizeInChar; i ++)\n\t\t\t{\n\t\t\t\tout << h->changeBack(sBufBak[i]); //out << h->changeBack(sBuf[i]);\n\t\t\t\taddiOut << h->changeBack(sBufBak[i]); //addiOut << h->changeBack(sBuf[i]);\n\t\t\t}\n\t\t\tout << endl;\n\t\t\taddiOut << endl;\n\t\t}\n\t\telse\n\t\t{\n\t\t\taddiOut << \">\" << mappingTable[sSeqID][0] << endl;\n\t\t\tfor(i = 0; i < paired; i ++)//paired == lower\n\t\t\t{\n\t\t\t\tout << h->changeBack(sBufBak[i]);\n\t\t\t\taddiOut << h->changeBack(sBufBak[i]);\n\t\t\t}\n\t\t\tout << endl;\n\t\t\taddiOut << endl;\n\t\t\taddiOut << \">\" << mappingTable[sSeqID][0] << endl;\n\t\t\tfor(; i < lowerSizeInChar; i ++)\n\t\t\t{\n\t\t\t\tout << h->changeBack(sBufBak[i]);\n\t\t\t\taddiOut << h->changeBack(sBufBak[i]);\n\t\t\t}\n\t\t\tout << endl;\n\t\t\taddiOut << endl;\n\t\t}\n//still need to decide if the source sequence is reverse complementary to the virtual center\n\t\tcompare(sBuf, sBufBak, 0, lowerSizeInChar, tagReverse);\n\t}\n}\n\nvoid Cluster::calConsensus(char sBuf[], char sQVBuf[], unsigned int sSeqID, int & tagReverse)\n{\n        long int A[1000] = {0}, C[1000] = {0}, G[1000] = {0}, T[1000] = {0}, QV[1000] = {0};\n        int indexNext, no, i, realSize, similarity = 1000, diff, j;\n        char sBufBak[1000], tBuf[1000], tQVBuf[1000], buf[1000], QVBuf[1000];\n        unsigned int indexOffset;\n        unsigned int seqID, centerSeqID;\n\n        for(indexNext = 0; indexNext < seedsCount; indexNext ++)\n        {\n                indexOffset = h->calOffset(indexNext, sBuf);\n                for(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n                {\n                        realSize = h->searchByIndex(indexNext, indexOffset, no, tBuf, tQVBuf, seqID);\n                        if(realSize > 0 && compare(sBuf, sQVBuf, tBuf, tQVBuf, 0, lowerSizeInChar, tagReverse) <= mismatchAllowed * 2)\n                        {\n                                if(reversed && tagReverse)\n                                        for(i = 0, j = lowerSizeInChar - 1; i < lowerSizeInChar; i ++, j --)\n\t\t\t\t\t{\n                                                switch(tBuf[j])\n                                                {\n                                                        case 0x00: T[i] = T[i] + mappingNum[seqID]; break;\n                                                        case 0x01: G[i] = G[i] + mappingNum[seqID]; break;\n                                                        case 0x02: C[i] = C[i] + mappingNum[seqID]; break;\n                                                        case 0x03: A[i] = A[i] + mappingNum[seqID]; break;\n                                                        default: cout << \"MEMORY ERROR!\" << endl; exit(-1);\n                                                }\n\t\t\t\t\t\tQV[i] = QV[i] + mappingNum[seqID] * tQVBuf[j];\n\t\t\t\t\t}\n                                else\n                                        for(i = 0; i < lowerSizeInChar; i ++)\n\t\t\t\t\t{\n                                                switch(tBuf[i])\n                                                {\n                                                        case 0x00: A[i] = A[i] + mappingNum[seqID]; break;\n                                                        case 0x01: C[i] = C[i] + mappingNum[seqID]; break;\n                                                        case 0x02: G[i] = G[i] + mappingNum[seqID]; break;\n                                                        case 0x03: T[i] = T[i] + mappingNum[seqID]; break;\n                                                        default: cout << \"MEMORY ERROR!\" << endl; exit(-1);\n                                                }\n\t\t\t\t\t\tQV[i] = QV[i] + mappingNum[seqID] * tQVBuf[i];\n\t\t\t\t\t}\n                                h->tmpDeleteByIndex(indexNext, indexOffset, no);\n                        }\n                }\n        }\n        for(i = 0; i < lowerSizeInChar; i ++)\n        {\n                sBufBak[i] = sBuf[i];\n                sBuf[i] = max(A[i], C[i], G[i], T[i]);\n\t\tsQVBuf[i] = QV[i] / (A[i] + C[i] + G[i] + T[i]);\n        }\n//find the most similar seq to the consensus and write to the output\n        for(indexNext = 0; indexNext < seedsCount; indexNext ++)\n        {\n                indexOffset = h->calOffset(indexNext, sBufBak);\n                for(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n                {\n                        realSize = h->searchByIndex(indexNext, indexOffset, no, tBuf, tQVBuf, seqID);\n                        if(realSize < 0)\n                        {\n                                h->recoverByIndex(indexNext, indexOffset, no);\n                                diff = compare(sBuf, sQVBuf, tBuf, tQVBuf, 0, lowerSizeInChar, tagReverse);\n                                if(diff < similarity)\n                                {\n                                        similarity = diff;\n                                        for(i = 0; i < lowerSizeInChar; i ++)\n\t\t\t\t\t{\n                                                buf[i] = tBuf[i];\n\t\t\t\t\t\tQVBuf[i] = tQVBuf[i];\n\t\t\t\t\t}\n                                        centerSeqID = seqID;\n                                }\n                        }\n                }\n        }\n//if the virtual center and the center are too far\n        if(compare(sBuf, sQVBuf, buf, QVBuf, 0, lowerSizeInChar, tagReverse) <= mismatchAllowed)\n        {\n                if(paired == 0)\n                {\n                        addiOut << \">\" << mappingTable[centerSeqID][0] << endl;\n                        for(i = 0; i < lowerSizeInChar; i ++)\n                        {\n                                out << h->changeBack(buf[i]);\n                                addiOut << h->changeBack(buf[i]);\n                        }\n                        out << endl;\n                        addiOut << endl;\n                }\n                else\n                {\n                        addiOut << \">\" << mappingTable[centerSeqID][0] << endl;\n                        for(i = 0; i < paired; i ++)//paired == lower\n                        {\n                                out << h->changeBack(buf[i]);\n                                addiOut << h->changeBack(buf[i]);\n                        }\n                        out << endl;\n                        addiOut << endl;\n                        addiOut << \">\" << mappingTable[centerSeqID][0] << endl;\n                        for(; i < lowerSizeInChar; i ++)\n                        {\n                                out << h->changeBack(buf[i]);\n                                addiOut << h->changeBack(buf[i]);\n                        }\n                        out << endl;\n                        addiOut << endl;\n                }\n        }\n        else\n        {\n                if(paired == 0)\n                {\n                        addiOut << \">\" << mappingTable[sSeqID][0] << endl;\n                        for(i = 0; i < lowerSizeInChar; i ++)\n                        {\n                                out << h->changeBack(sBufBak[i]); //out << h->changeBack(sBuf[i]);\n                                addiOut << h->changeBack(sBufBak[i]); //addiOut << h->changeBack(sBuf[i]);\n                        }\n                        out << endl;\n                        addiOut << endl;\n                }\n                else\n                {\n                        addiOut << \">\" << mappingTable[sSeqID][0] << endl;\n                        for(i = 0; i < paired; i ++)//paired == lower\n                        {\n                                out << h->changeBack(sBufBak[i]);\n                                addiOut << h->changeBack(sBufBak[i]);\n                        }\n                        out << endl;\n                        addiOut << endl;\n                        addiOut << \">\" << mappingTable[sSeqID][0] << endl;\n                        for(; i < lowerSizeInChar; i ++)\n                        {\n                                out << h->changeBack(sBufBak[i]);\n                                addiOut << h->changeBack(sBufBak[i]);\n                        }\n                        out << endl;\n                        addiOut << endl;\n                }\n//still need to decide if the source sequence is reverse complementary to the virtual center\n                compare(sBuf, sBufBak, 0, lowerSizeInChar, tagReverse);\n        }\n}\n\nvoid Cluster::cluster()\n{\n\tchar sBuf[1000];\n\tint realSize, tagReverse, i;\n\tunsigned int seqOffset;\n\tchar sQVBuf[1000];\n\tunsigned int seqID;\n//\tlong int big = 0, small = 0;\n\n\tif(QV)\n\t{\n\t\tfor(seqOffset = 0; seqOffset < num * (upperSizeInBit + 5 + upperSizeInChar); seqOffset = seqOffset + (upperSizeInBit + 5 + upperSizeInChar))\n\t\t{\n\t\t\trealSize = h->searchBySeq(seqOffset, sBuf, sQVBuf, seqID);\n\t\t\tif(realSize == 0) continue;\n\t\t\telse realSize = 0;\n\t\t\tcalConsensus(sBuf, sQVBuf, seqID, tagReverse);\n\t\t\tif(out.is_open())\n\t\t\t\tif(reversed)\n\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << \"\t\" << tagReverse << endl;\n\t\t\t\telse\n\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << endl;\n\t\t\telse\n\t\t\t{\n\t\t\t\tcout << \"CANNOT OPEN OUTPUT FILE!\" << endl;\n\t\t\t\texit(-1);\n\t\t\t}\n\t\t\th->deleteBySeq(seqOffset);\n\t\t\tseqNum ++;\n\t\t\tif(seqNum > adjustNum)\n\t\t\t{\n\t\t\t\th->adjust();\n\t\t\t\tadjustNum = adjustNum + num / 20;\n\t\t\t}\n//forcefully write this seq to avoid lost of it\n//\t\t\tnumInCL = 1;\n\n\t\t\tclusterWithMismatches(sBuf, sQVBuf);\n\t\t\tclusterWithShifts(sBuf, sQVBuf);\n\n//\t\t\tif(numInCL > 100)\n//\t\t\t{\n//\t\t\t\tdis << CLID << \" \" << numInCL << endl;\n//\t\t\t\tbig ++;\n//\t\t\t}\n//\t\t\telse if(numInCL == 1)\n//\t\t\t{\n//\t\t\t\tsmall ++;\n//\t\t\t}\n\t\t\tCLID ++;\n\t\t}\n//\t\tcout << \"#clusters of more than 100 seqs is \" << big <<\tendl;\n//\t\tcout << \"#singleton clusters is \" << small << endl;\n\t}\n\telse\n\t{\n\t\tfor(seqOffset = 0; seqOffset < num * (upperSizeInBit + 5); seqOffset = seqOffset + (upperSizeInBit + 5))\n\t\t{\n\t\t\trealSize = h->searchBySeq(seqOffset, sBuf, seqID);\n\t\t\tif(realSize == 0) continue;\n\t\t\telse realSize = 0;\n\t\t\tcalConsensus(sBuf, seqID, tagReverse);\n\t\t\tif(out.is_open())\n\t\t\t\tif(reversed)\n\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << \"\t\" << tagReverse << endl;\n\t\t\t\telse\n\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << endl;\n\t\t\telse\n\t\t\t{\n\t\t\t\tcout << \"CANNOT OPEN OUTPUT FILE!\" << endl;\n\t\t\t\texit(-1);\n\t\t\t}\n\t\t\th->deleteBySeq(seqOffset);\n\t\t\tseqNum ++;\n\t\t\tif(seqNum > adjustNum)\n\t\t\t{\n\t\t\t\th->adjust();\n\t\t\t\tadjustNum = adjustNum + num / 20;\n\t\t\t}\n\t\t\tclusterWithMismatches(sBuf);\n\t\t\tclusterWithShifts(sBuf);\n\t\t\tCLID ++;\n\t\t}\n\t}\n\tdelete h;\n\taddiOut.close();\n}\n\nvoid Cluster::clusterWithMismatches(char sBuf[])\n{\n\tchar tBuf[1000];\n\tint indexNext, no, realSize, tagReverse, i;\n\tunsigned int indexOffset;\n\tunsigned int seqID;\n\n\tfor(indexNext = 0; indexNext < seedsCount; indexNext ++)\n\t{\n\t\tindexOffset = h->calOffset(indexNext, sBuf);\n\t\tfor(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n\t\t{\n\t\t\trealSize = h->searchByIndex(indexNext, indexOffset, no, tBuf, seqID);\n\t\t\tif(realSize && compare(sBuf, tBuf, 0, lowerSizeInChar, tagReverse) <= mismatchAllowed)\n\t\t\t{\n\t\t\t\th->deleteByIndex(indexNext, indexOffset, no);\n\t\t\t\tif(out.is_open())\n\t\t\t\t\tif(reversed)\n\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << \"\t\" << tagReverse << endl;\n\t\t\t\t\telse\n\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << endl;\n\t\t\t\telse\n\t\t\t\t{\n\t\t\t\t\tcout << \"CANNOT OPEN OUTPUT FILE!\" << endl;\n\t\t\t\t\texit(-1);\n\t\t\t\t}\n\t\t\t\tseqNum ++;\n//\t\t\t\tnumInCL ++;\n\t\t\t}\n\t\t}\n\t}\n}\n\nvoid Cluster::clusterWithShifts(char sBuf[])\n{\n\tchar tBuf[1000], buf[1000];\n\tint no, i, lShift, rShift, realSize, indexNext, tagReverse;\n\tunsigned int indexOffset;\n\tunsigned int seqID;\n\n\tfor(lShift = 1; lShift <= shiftAllowed; lShift ++)\n\t{\n\t\tfor(i = 0; i < lowerSizeInChar - lShift; i ++)\n\t\t\tbuf[i] = sBuf[i + lShift];\n\t\tfor(indexNext = 0; indexNext < seedsCount; indexNext ++)\n\t\t{\n\t\t\tindexOffset = h->calOffset(indexNext, buf);\n\t\t\tfor(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n\t\t\t{\n\t\t\t\trealSize = (int)h->searchByIndex(indexNext, indexOffset, no, tBuf, seqID);\n\t\t\t\tif(realSize && compare(buf, tBuf, 0, lowerSizeInChar - lShift, tagReverse) <= mismatchAllowed)//from 0 to 3 - lShift\n\t\t\t\t{\n\t\t\t\t\th->deleteByIndex(indexNext, indexOffset, no);\n\t\t\t\t\tif(out.is_open())\n\t\t\t\t\t\tif(reversed)\n\t\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << \"\t\" << tagReverse << endl;\n\t\t\t\t\t\telse\n\t\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << endl;\n\t\t\t\t\telse\n\t\t\t\t\t{\n\t\t\t\t\t\tcout << \"CANNOT OPEN OUTPUT FILE!\" << endl;\n\t\t\t\t\t\texit(-1);\n\t\t\t\t\t}\n//\t\t\t\t\tnumInCL ++;\n\t\t\t\t\tseqNum ++;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\tfor(rShift = 1; rShift <= shiftAllowed; rShift ++)\n\t{\n\t\tfor(i = rShift; i < lowerSizeInChar; i ++)\n\t\t\tbuf[i] = sBuf[i - rShift];\n\t\tfor(indexNext = 0; indexNext < seedsCount; indexNext ++)\n\t\t{\n\t\t\tindexOffset = h->calOffset(indexNext, buf);\n\t\t\tfor(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n\t\t\t{\n\t\t\t\trealSize = (int)h->searchByIndex(indexNext, indexOffset, no, tBuf, seqID);\n\t\t\t\tif(realSize && compare(buf, tBuf, rShift, lowerSizeInChar, tagReverse) <= mismatchAllowed)//from rShift to 3\n\t\t\t\t{\n\t\t\t\t\th->deleteByIndex(indexNext, indexOffset, no);\n\t\t\t\t\tif(out.is_open())\n\t\t\t\t\t\tif(reversed)\n\t\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << \"\t\" << tagReverse << endl;\n\t\t\t\t\t\telse\n\t\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << endl;\n\t\t\t\t\telse\n\t\t\t\t\t{\n\t\t\t\t\t\tcout << \"CANNOT OPEN OUTPUT FILE!\" << endl;\n\t\t\t\t\t\texit(-1);\n\t\t\t\t\t}\n//\t\t\t\t\tnumInCL ++;\n\t\t\t\t\tseqNum ++;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n}\n\nint Cluster::compare(char sBuf[], char tBuf[], int start, int end, int & tagReverse)\n{\n\tint i, j, count = 0, reverseCount = 0;\n\n\tfor(i = start; i < end; i ++)\n\t\tif(sBuf[i] != tBuf[i])\n\t\t\tcount ++;\n\n\tif(reversed)\n\t{\n\t\tfor(i = start, j = end - 1; i < end; i ++, j --)\n\t\t\tif(sBuf[i] != reverseChange(tBuf[j]))\n\t\t\t\treverseCount ++;\n\t\tif(count <= reverseCount) \n\t\t{\n\t\t\ttagReverse = 0;\n\t\t\treturn count;\n\t\t}\n\t\telse\n\t\t{\n\t\t\ttagReverse = 1;\n\t\t\treturn reverseCount;\n\t\t}\n\t}\n\telse\n\t\treturn count;\n}\n\nchar Cluster::reverseChange(char base)\n{\n\tswitch(base)\n\t{\n\t\tcase 0x00: return 0x03;\n\t\tcase 0x01: return 0x02;\n\t\tcase 0x02: return 0x01;\n\t\tcase 0x03: return 0x00;\n\t\tdefault: cout << \"MEMORY ERROR!!\" << endl; exit(-1); \n\t}\n}\n\nvoid Cluster::clusterWithMismatches(char sBuf[], char sQVBuf[])\n{\n\tchar tBuf[1000], tQVBuf[1000];\n\tint indexNext, no, realSize, tagReverse, i;\n\tunsigned int indexOffset;\n\tunsigned int seqID;\n\n\tfor(indexNext = 0; indexNext < seedsCount; indexNext ++)\n\t{\n\t\tindexOffset = h->calOffset(indexNext, sBuf);\n\t\tfor(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n\t\t{\n\t\t\trealSize = h->searchByIndex(indexNext, indexOffset, no, tBuf, tQVBuf, seqID);\n\t\t\tif(realSize && compare(sBuf, sQVBuf, tBuf, tQVBuf, 0, lowerSizeInChar, tagReverse) <= mismatchAllowed)\n\t\t\t{\n\t\t\t\th->deleteByIndex(indexNext, indexOffset, no);\n\t\t\t\tif(out.is_open())\n\t\t\t\t\tif(reversed)\n\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << \"\t\" << tagReverse << endl;\n\t\t\t\t\telse\n\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << endl;\n\t\t\t\telse\n\t\t\t\t{\n\t\t\t\t\tcout << \"CANNOT OPEN OUTPUT FILE!\" << endl;\n\t\t\t\t\texit(-1);\n\t\t\t\t}\n//\t\t\t\tnumInCL ++;\n\t\t\t\tseqNum ++;\n\t\t\t}\n\t\t}\n\t}\n}\n\nvoid Cluster::clusterWithShifts(char sBuf[], char sQVBuf[])\n{\n\tchar tBuf[1000], buf[1000], tQVBuf[1000], QVBuf[1000];\n\tint no, i, lShift, rShift, realSize, indexNext, tagReverse;\n\tunsigned int indexOffset;\n\tunsigned int seqID;\n\n\tfor(lShift = 1; lShift <= shiftAllowed; lShift ++)\n\t{\n\t\tfor(i = 0; i < lowerSizeInChar - lShift; i ++)\n\t\t{\n\t\t\tbuf[i] = sBuf[i + lShift];\n\t\t\tQVBuf[i] = sQVBuf[i + lShift];\n\t\t}\n\t\tfor(indexNext = 0; indexNext < seedsCount; indexNext ++)\n\t\t{\n\t\t\tindexOffset = h->calOffset(indexNext, buf);\n\t\t\tfor(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n\t\t\t{\n\t\t\t\trealSize = (int)h->searchByIndex(indexNext, indexOffset, no, tBuf, tQVBuf, seqID);\n\t\t\t\tif(realSize && compare(buf, QVBuf, tBuf, tQVBuf, 0, lowerSizeInChar - lShift, tagReverse) <= mismatchAllowed)//from 0 to 3 - lShift\n\t\t\t\t{\n\t\t\t\t\th->deleteByIndex(indexNext, indexOffset, no);\n\t\t\t\t\tif(out.is_open())\n\t\t\t\t\t\tif(reversed)\n\t\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << \"\t\" << tagReverse << endl;\n\t\t\t\t\t\telse\n\t\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << endl;\n\t\t\t\t\telse\n\t\t\t\t\t{\n\t\t\t\t\t\tcout << \"CANNOT OPEN OUTPUT FILE!\" << endl;\n\t\t\t\t\t\texit(-1);\n\t\t\t\t\t}\n//\t\t\t\t\tnumInCL ++;\n\t\t\t\t\tseqNum ++;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\tfor(rShift = 1; rShift <= shiftAllowed; rShift ++)\n\t{\n\t\tfor(i = rShift; i < lowerSizeInChar; i ++)\n\t\t{\n\t\t\tbuf[i] = sBuf[i - rShift];\n\t\t\tQVBuf[i] = sQVBuf[i - rShift];\n\t\t}\n\t\tfor(indexNext = 0; indexNext < seedsCount; indexNext ++)\n\t\t{\n\t\t\tindexOffset = h->calOffset(indexNext, buf);\n\t\t\tfor(no = 0; no < h->getOffsetCount(indexOffset, indexNext); no ++)\n\t\t\t{\n\t\t\t\trealSize = (int)h->searchByIndex(indexNext, indexOffset, no, tBuf, tQVBuf, seqID);\n\t\t\t\tif(realSize && compare(buf, QVBuf, tBuf, tQVBuf, rShift, lowerSizeInChar, tagReverse) <= mismatchAllowed)//from rShift to 3\n\t\t\t\t{\n\t\t\t\t\th->deleteByIndex(indexNext, indexOffset, no);\n\t\t\t\t\tif(out.is_open())\n\t\t\t\t\t\tif(reversed)\n\t\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << \"\t\" << tagReverse << endl;\n\t\t\t\t\t\telse\n\t\t\t\t\t\t\tfor(i = 0; i < mappingNum[seqID]; i ++)\n\t\t\t\t\t\t\t\tout << CLID << \"\t\" << mappingTable[seqID][i] << endl;\n\t\t\t\t\telse\n\t\t\t\t\t{\n\t\t\t\t\t\tcout << \"CANNOT OPEN OUTPUT FILE!\" << endl;\n\t\t\t\t\t\texit(-1);\n\t\t\t\t\t}\n//\t\t\t\t\tnumInCL ++;\n\t\t\t\t\tseqNum ++;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n}\n\nint Cluster::compare(char sBuf[], char sQVBuf[], char tBuf[], char tQVBuf[], int start, int end, int & tagReverse)\n{\n\tint i, j, count = 0, qv = 0, reverseCount = 0, reverseQv = 0;\n\n\tfor(i = start; i < end; i ++)\n\t\tif(sBuf[i] != tBuf[i] && !((sQVBuf[i] - OFFSET) + (tQVBuf[i] - OFFSET) < lowerQV))\n\t\t{\n\t\t\tcount ++;\n\t\t\tqv = qv + (sQVBuf[i] - OFFSET) + (tQVBuf[i] - OFFSET);\n\t\t}\n\n\tif(reversed)\n\t{\n\t\tfor(i = start, j = end - 1; i < end; i ++, j --)\n\t\t\tif(sBuf[i] != reverseChange(tBuf[j]) && !((sQVBuf[i] - OFFSET) + (tQVBuf[j] - OFFSET) < lowerQV))\n\t\t\t{\n\t\t\t\treverseCount ++;\n\t\t\t\treverseQv = reverseQv + (sQVBuf[i] - OFFSET) + (tQVBuf[j] - OFFSET);\n\t\t\t}\n\t\tif(count <= reverseCount)\n\t\t{\n\t\t\ttagReverse = 0;\n\t\t\treturn qv > upperQV ? 10 : count;\n\t\t}\n\t\telse\n\t\t{\n\t\t\ttagReverse = 1;\n\t\t\treturn reverseQv > upperQV ? 10 : reverseCount;\n\t\t}\n\t}\n\telse\n\t{\n\t\treturn qv > upperQV ? 10 : count;\n\t}\n}\n\nvoid FileAnalyzer::outputAnalyze(int num, int correctCluster)\n{\n\tifstream in;\n\tchar buf[20] = {0}, CLID[10] = {0}, seqID[10] = {0};\n\tint i = 0, j = 0, clusterID = -1, correctLower, correctUpper, subCluster = 0, multiCluster = 0;\n\n\tin.open(\"output.txt\");\n\tif(in.is_open())\n\t{\n\t\tif(in.good())\n\t\t{\n\t\t\tin.getline(buf, 20);\n\t\t\tfor(i = 0; i < 20; i ++)\n\t\t\t\tbuf[i] = 0;\n\t\t\ti = 0;\n\t\t}\n\t\twhile(in.good())\n\t\t{\n\t\t\tin.getline(buf, 20);\n\t\t\tif(buf[0] == 0)\n\t\t\t\tbreak;\n\t\t\twhile(buf[i] != ' ')\n\t\t\t\tCLID[i ++] = buf[i];\n\t\t\ti ++;\n\t\t\twhile(buf[i] != 0)\n\t\t\t\tseqID[j ++] = buf[i ++];\n\n\t\t\tif(atoi(CLID) > clusterID)\n\t\t\t{\n\t\t\t\tclusterID = atoi(CLID);\n\t\t\t\tcorrectLower = atoi(seqID) / (num / correctCluster) * (num / correctCluster);\n\t\t\t\tcorrectUpper = correctLower + num / correctCluster - 1;\n\t\t\t\tif(multiCluster == 0)\n\t\t\t\t\tsubCluster ++;\n\t\t\t\tmultiCluster = 0;\n\t\t\t}\n\t\t\telse\n\t\t\t{\n\t\t\t\tif(atoi(seqID) < correctLower || atoi(seqID) > correctUpper)\n\t\t\t\t\tmultiCluster = 1;\n\t\t\t}\n\t\t\tfor(i = 0; i < 10; i ++)\n\t\t\t\tCLID[i] = seqID[i] = 0;\n\t\t\tfor(i = 0; i < 20; i ++)\n\t\t\t\tbuf[i] = 0;\n\t\t\ti = j = 0;\n\t\t}\n\t}\n\telse\n\t{\n\t\tcout << \"CANNOT OPEN INPUT FILE!\" << endl;\n\t\texit(-1);\n\t}\n\tcout << \"(4) output analysis finished\" << endl;\n\tcout << \" - \" << clusterID + 1 << \" clusters in total\" << endl;\n\tcout << \" - \" << subCluster << \" sub clusters of degree \" << (double)subCluster / (clusterID + 1) << endl;\n}\n\nvoid FileAnalyzer::inputAnalyze(char input[], int & num, int & tNum, int & lower, int & upper)\n{\n\tifstream in;\n\tint seqID = 0, i;\n\tchar buf[1001];\n\n\tnum = upper = 0;\n\tlower = 1000;\n\tin.open(input);\n\n\tif(in.is_open())\n\t{\n//cont:\n\t\twhile(in.good())\n\t\t{\n\t\t\tin.getline(buf, 1001);\n\t\t\tif(seqID % 4 == 1)\n\t\t\t{\n//\t\t\t\tfor(i = 0; i < in.gcount() - 1; i ++)\n//\t\t\t\t\tif(buf[i] == 'N') \n//\t\t\t\t\t{\n//\t\t\t\t\t\tseqID ++;\n//\t\t\t\t\t\tgoto cont;\n//\t\t\t\t\t}\n\t\t\t\tif(in.gcount() - 1 < lower) lower = in.gcount() - 1;\n\t\t\t\tif(in.gcount() - 1 > upper) upper = in.gcount() - 1;\n//\t\t\t\tnum ++;\n\t\t\t}\n\t\t\tseqID ++;\n\t\t}\n\t}\n\telse\n\t{\n\t\tcout << \"CANNOT OPEN INPUT FILE!\" << endl;\n\t\texit(-1);\n\t}\n//\ttNum = seqID;\n\n//Filter reads based on first \"lower\" bases rather than all bases to avoid underestimate the number of valid reads\n\tin.clear(); in.seekg(0); seqID = 0; num = 0;\n\tif(in.is_open())\n\t{\nconti:\n\t\twhile(in.good())\n\t\t{\n\t\t\tin.getline(buf, 1001);\n\t\t\tif(seqID % 4 == 1)\n\t\t\t{\n\t\t\t\tfor(i = 0; i < lower; i ++)\n\t\t\t\t\tif(buf[i] == 'N')\n\t\t\t\t\t{\n\t\t\t\t\t\tseqID ++;\n\t\t\t\t\t\tgoto conti;\n\t\t\t\t\t}\n\t\t\t\tnum ++;\n\t\t\t}\n\t\t\tseqID ++;\n\t\t}\n\t}\n\telse\n\t{\n\t\tcout << \"CANNOT OPEN INPUT FILE!\" << endl;\n\t\texit(-1);\n\t}\n\ttNum = seqID;\n}\n\nvoid FileAnalyzer::PECombine(char input1[], int lower1, char input2[], int lower2, \nchar * input, int & num, int & lower, int & upper)\n{\n\tifstream in1, in2;\n\tofstream out;\n\tchar buf1[1000], buf2[1000];\n\tint seqID = 0, i, NBase;\n\n\tin1.open(input1);\n\tin2.open(input2);\n\tout.open(\"combined.fastq\");\n\n\tstrcpy(input, \"combined.fastq\");\n\tnum = 0;\n\tlower = 1000;\n\tupper = 0;\n\n\tif(in1.is_open() && in2.is_open())\n\t{\n\t\twhile(in1.good() && in2.good())\n\t\t{\n\t\t\tin1.getline(buf1, 1000);\n\t\t\tin2.getline(buf2, 1000);\n\t\t\tif(buf1[0] == 0 || buf2[0] == 0) break;\n\n\t\t\tif(seqID % 4 == 0)\n\t\t\t\tout << \"@\" << seqID / 4 << endl;\n\t\t\telse if(seqID % 4 == 2)\n\t\t\t\tout << \"+\" << seqID / 4 << endl;\n\t\t\telse\n\t\t\t{\n\t\t\t\tNBase = 0;\n\t\t\t\tfor(i = 0; i < lower1; i ++)\n\t\t\t\t{\n\t\t\t\t\tout << buf1[i];\n\t\t\t\t\tif(buf1[i] == 'N') NBase = 1;\n\t\t\t\t}\n\t\t\t\tfor(i = 0; i < lower2; i ++)\n\t\t\t\t{\n\t\t\t\t\tout << buf2[i];\n\t\t\t\t\tif(buf2[i] == 'N') NBase = 1;\n\t\t\t\t}\n\t\t\t\tout << endl;\n\t\t\t\tif(NBase == 0) num ++;\n\t\t\t\tif(in1.gcount() - 1 + in2.gcount() - 1 < lower) lower = in1.gcount() - 1 + in2.gcount() - 1;\n\t\t\t\tif(in1.gcount() - 1 + in2.gcount() - 1 > upper) upper = in1.gcount() - 1 + in2.gcount() - 1;\n\t\t\t}\n\t\t\tseqID ++;\n\t\t}\n\t}\n\telse\n\t{\n\t\tcout << \"CANNOT OPEN INPUT FILE!\" << endl;\n\t\texit(-1);\n\t}\n}\n\nFastqGenerator::FastqGenerator(char input[], char output[], int num)\n{\n        long i;\n\n        in.open(input);\n        strcpy(addiInput, output);\n        strcat(addiInput, \".fasta\");\n        addiIn.open(addiInput);\n        strcpy(outputq, output);\n        strcat(outputq, \".fastq\");\n        out.open(outputq);\n        this->num = num;\n        seq = new char [num];\n        for(i = 0; i < num; i ++)\n                seq[i] = 0;\n}\n\nFastqGenerator::FastqGenerator(char input[], char output[], int num, int pair)\n{\n\tlong i;\n\n\tin.open(input);\n\tstrcpy(addiInput, output);\n\tstrcat(addiInput, \".fasta\");\n\taddiIn.open(addiInput);\n\tstrcpy(outputq, output);\n\tstrcat(outputq, \".\"); \n\tif(pair == 1) strcat(outputq, \"1\"); else strcat(outputq, \"2\");\n\tstrcat(outputq, \".fastq\");\n\tout.open(outputq);\n\tthis->num = num;\n\tseq = new char [num];\n\tfor(i = 0; i < num; i ++)\n\t\tseq[i] = 0;\n}\n\nvoid FastqGenerator::record()\n{\n\tlong i;\n\tchar buf[1001];\n\n\tif(addiIn.is_open())\n\t{\n\t\twhile(addiIn.good())\n\t\t{\n\t\t\taddiIn.getline(buf, 1001);\n\t\t\tif(buf[0] == 0)\n\t\t\t\tbreak;\n\t\t\tfor(i = 1; i < addiIn.gcount() - 1; i ++)\n\t\t\t\tbuf[i - 1] = buf[i];\n\t\t\tbuf[i - 1] = '\\0';\n\t\t\tseq[atoi(buf)] = 1;\n\t\t\taddiIn.getline(buf, 1001);\n\t\t}\n\t}\n\telse\n\t{\n\t\tcout << \"CANNOT OPEN INPUT FILE!\" << endl;\n\t\texit(-1);\n\t}\n\n//\tfor(i = 0; i < num; i ++)\n//\t\tif(seq[i])\n//\t\t\tcout << i << endl;\n}\n\nvoid FastqGenerator::generateFastq()\n{\n\tunsigned long seqID, i;\n\tchar buf[1001];\n\n\trecord();\n\tif(in.is_open())\n\t{\n\t\tfor(seqID = 0; seqID < num * 4; seqID ++)\n\t\t{\n\t\t\tin.getline(buf, 1001);\n\t\t\tif(seq[seqID / 4])\n\t\t\t{\n\t\t\t\tfor(i = 0; i < in.gcount() - 1; i ++)\n\t\t\t\t\tout << buf[i];\n\t\t\t\tout << endl;\n\t\t\t}\n\t\t}\n\t}\n\telse\n\t{\n\t\tcout << \"CANNOT OPEN INPUT FILE!\" << endl;\n\t\texit(-1);\n\t}\n}\n\nSorter::Sorter(char input[], int num, int lower)\n{\n\tin.open(input);\n\tstrcpy(midOutput, input);\n\tstrcat(midOutput, \".mid.fastq\");\n\tmidOut.open(midOutput);\n\tthis->num = num;\n\tmappingTable = new unsigned int * [num];\n\tmappingNum = new unsigned int [num];\n\tthis->lowerSizeInChar = lower;\n\tthis->realNum = 0;\n}\n\nint Sorter::getRealNum()\n{\n\treturn realNum;\n}\n\nunsigned int ** Sorter::getMappingTable()\n{\n\treturn mappingTable;\n}\n\nunsigned int * Sorter::getMappingNum()\n{\n\treturn mappingNum;\n}\n\nvoid Sorter::sort()\n{\n\tchar * seqs;\n\tOrder * order;\n\tint i, j, seqID = 0, tag = 1, localNum = 0;\n\tchar buf[1001];\n\n\tseqs = new char [(long int)num * lowerSizeInChar * 2];//2 means both bases and QVs\n\torder = new Order [num];\n\n\tif(in.is_open())\n\t{\ncont:\n\t\twhile(in.good())\n\t\t{\n\t\t\tin.getline(buf, 1001);\n\t\t\tif(seqID % 4 == 1)\n\t\t\t\tfor(i = 0; i < lowerSizeInChar; i ++)\n\t\t\t\t{\n\t\t\t\t\tif(buf[i] == 'N')\n\t\t\t\t\t{\n\t\t\t\t\t\tseqID ++;\n\t\t\t\t\t\ttag = 0;\n\t\t\t\t\t\tgoto cont;\n\t\t\t\t\t}\n\t\t\t\t\telse\n\t\t\t\t\t\ttag = 1;\n\t\t\t\t\tseqs[(long int)localNum * lowerSizeInChar * 2 + i] = buf[i];\n\t\t\t\t}\n\t\t\tif(seqID % 4 == 3 && tag == 1)\n\t\t\t{\n\t\t\t\tfor(i = lowerSizeInChar; i < lowerSizeInChar * 2; i ++)\n\t\t\t\t\tseqs[(long int)localNum * lowerSizeInChar * 2 + i] = buf[i - lowerSizeInChar];\n\t\t\t\torder[localNum ++].realID = seqID / 4;\n\t\t\t}\n\t\t\tseqID ++;\n\t\t\tif(localNum == num) break;\n//Must finish to avoid crash. Otherwise:\n//if there are reads with N in the end, they are not counted to initialize the seqs array but are put in seqs array\n\t\t}\n\t}\n\telse\n\t{\n\t\tcout << \"CANNOT OPEN INPUT FILE!\" << endl;\n\t\texit(-1);\n\t}\n\tfor(i = 0; i < localNum; i ++)\n\t\torder[i].ID = i;\n//verification\n//\tcout << \"-------------------------------------\" << endl;\n//\tfor(i = 0; i < localNum; i ++)\n//\t\tcout << order[i].ID << \"|\" << order[i].realID << \" \";\n//\tcout << endl;\n//\tcout << \"-------------------------------------\" << endl;\n//\tfor(i = 0; i < localNum; i ++)\n//\t{\n//\t\tfor(j = 0; j < lowerSizeInChar * 2; j ++)\n//\t\t\tcout << seqs[i * lowerSizeInChar * 2 + j];\n//\t\tcout << endl;\n//\t}\n//\tcout << \"-------------------------------------\" << endl;\n//verification\n\n\tsuffixSort(0, localNum - 1, 0, seqs, order);\n//verification\n//\tcout << \"-------------------------------------\" << endl;\n//\tfor(i = 0; i < localNum; i ++)\n//\t\tcout << order[i].ID << \"|\" << order[i].realID << \" \";\n//\tcout << endl;\n//\tcout << \"-------------------------------------\" << endl;\n//\tfor(i = 0; i < realNum; i ++)\n//\t{\n//\t\tcout << mappingNum[i] << \": \";\n//\t\tfor(j = 0; j < mappingNum[i]; j ++)\n//\t\t\tcout << mappingTable[i][j] << \" \";\n//\t\tcout << endl;\n//\t}\n//\tcout << \"-------------------------------------\" << endl;\n//verification\n\tdelete seqs;\n\tdelete order;\n}\n\nvoid Sorter::suffixSort(int start, int end, int depth, char seqs[], Order order[])\n{\n\tint i, seqID = 0, tag;\n\tOrder * buf;\n\tint s[4], e[4];\n\tint startBuf;\n\tint j;\n\n\tif(start == -1 && end == -1)\n\t\treturn;\n\n\tif(start == end || depth == lowerSizeInChar - 1)\n\t{\n\t\tmappingTable[realNum] = new unsigned int [end - start + 1];\n\t\tfor(i = start; i <= end; i ++)\n\t\t\tmappingTable[realNum][i - start] = order[i].realID;\n\t\tmappingNum[realNum] = end - start + 1;\n\n\t\tif(midOut.is_open())\n\t\t{\n\t\t\tmidOut << \"@\" << realNum << endl;\n\t\t\tfor(i = 0; i < lowerSizeInChar; i ++)\n\t\t\t\tmidOut << seqs[(long int)order[start].ID * lowerSizeInChar * 2 + i];\n\t\t\tmidOut << endl;\n\t\t\tmidOut << \"+\" << realNum << endl;\n\t\t\tfor(i = lowerSizeInChar; i < lowerSizeInChar * 2; i ++)\n\t\t\t\tmidOut << seqs[(long int)order[start].ID * lowerSizeInChar * 2 + i];\n\t\t\tmidOut << endl;\n\t\t}\n\t\telse\n\t\t{\n\t\t\tcout << \"CANNOT OPEN OUTPUT FILE!\" << endl;\n\t\t\texit(-1);\n\t\t}\n\n\t\trealNum ++;\n\t\treturn;\n\t}\n\n\tbuf = new Order [end - start + 1];\n\n\tstartBuf = start;\n\ttag = 0;\n\tfor(i = start; i <= end; i ++)\n\t\tif(seqs[(long int)order[i].ID * lowerSizeInChar * 2 + depth] == 'A')\n\t\t{\n\t\t\ttag = 1;\n\t\t\tbuf[seqID].ID = order[i].ID;\n\t\t\tbuf[seqID ++].realID = order[i].realID;\n\t\t}\n\tif(tag == 1)\n\t{\n\t\ts[0] = startBuf;\n\t\te[0] = start + seqID - 1;\n\t\tstartBuf = start + seqID;\n\t}\n\telse\n\t\ts[0] = e[0] = -1;\n\n\ttag = 0;\n\tfor(i = start; i <= end; i ++)\n\t\tif(seqs[(long int)order[i].ID * lowerSizeInChar * 2 + depth] == 'C')\n\t\t{\n\t\t\ttag = 1;\n\t\t\tbuf[seqID].ID = order[i].ID;\n\t\t\tbuf[seqID ++].realID = order[i].realID;\n\t\t}\n\tif(tag == 1)\n\t{\n\t\ts[1] = startBuf;\n\t\te[1] = start + seqID - 1;\n\t\tstartBuf = start + seqID;\n\t}\n\telse\n\t\ts[1] = e[1] = -1;\n\n\ttag = 0;\n\tfor(i = start; i <= end; i ++)\n\t\tif(seqs[(long int)order[i].ID * lowerSizeInChar * 2 + depth] == 'G')\n\t\t{\n\t\t\ttag = 1;\n\t\t\tbuf[seqID].ID = order[i].ID;\n\t\t\tbuf[seqID ++].realID = order[i].realID;\n\t\t}\n\tif(tag == 1)\n\t{\n\t\ts[2] = startBuf;\n\t\te[2] = start + seqID - 1;\n\t\tstartBuf = start + seqID;\n\t}\n\telse\n\t\ts[2] = e[2] = -1;\n\n\ttag = 0;\n\tfor(i = start; i <= end; i ++)\n\t\tif(seqs[(long int)order[i].ID * lowerSizeInChar * 2 + depth] == 'T')\n\t\t{\n\t\t\ttag = 1;\n\t\t\tbuf[seqID].ID = order[i].ID;\n\t\t\tbuf[seqID ++].realID = order[i].realID;\n\t\t}\n\tif(tag == 1)\n\t{\n\t\ts[3] = startBuf;\n\t\te[3] = start + seqID - 1;\n\t\tstartBuf = start + seqID;\n\t}\n\telse\n\t\ts[3] = e[3] = -1;\n\n\tfor(i = start, seqID = 0; i <= end; i ++, seqID ++)\n\t{\n\t\torder[i].ID = buf[seqID].ID;\n\t\torder[i].realID = buf[seqID].realID;\n\t}\n\n//\tcout << s[0] << \", \" << s[1] << \", \" << s[2] << \", \" << s[3] << endl;\n//\tcout << e[0] << \", \" << e[1] << \", \" << e[2] << \", \" << e[3] << endl;\n\n\tsuffixSort(s[0], e[0], depth + 1, seqs, order);\n\tsuffixSort(s[1], e[1], depth + 1, seqs, order);\n\tsuffixSort(s[2], e[2], depth + 1, seqs, order);\n\tsuffixSort(s[3], e[3], depth + 1, seqs, order);\n}\n\nchar change(char base)\n{\n\tswitch(base)\n\t{\n\t\tcase 0x00: return 'A';\n\t\tcase 0x01: return 'C';\n\t\tcase 0x02: return 'G';\n\t\tcase 0x03: return 'T';\n\t\tdefault: cout << \"UNKNOWN ERROR!\" << endl; exit(-1);\n\t}\n}\n\nbool within(int p, int pos[], int size)\n{\n\tint i;\n\tfor(i = 0; i < size; i ++)\n\t\tif(p == pos[i]) return true;\n\treturn false;\n}\n\nvoid introduceMismatches(char sBuf[], char buf[], int mismatch, int size)\n{\n\tint i, j;\n\tint pos[1000], mBuf[1000] = {0};\n\n\tfor(i = 0; i < mismatch; i ++)\n\t{\n\t\tdo\n\t\t\tpos[i] = rand() % size;\n\t\twhile(mBuf[pos[i]] == 1);\n\t\tmBuf[pos[i]] = 1;\n\t}\n\tfor(i = 0, j = 0; i < size; i ++)\n\t{\n\t\tif(within(i, pos, mismatch) && j < mismatch)\n\t\t{\n\t\t\tdo\n\t\t\t\tbuf[i] = change((char)(rand() % 4));\n\t\t\twhile(sBuf[i] == buf[i]);\n\t\t\tj ++;\n\t\t}\n\t\telse\n\t\t\tbuf[i] = sBuf[i];\n\t}\n}\n\nvoid introduceShifts(char sBuf[], char buf[], int shift, int size)\n{\n\tint i;\n\n\tif(shift < 0)\n\t{\n\t\tfor(i = 0; i < size - abs(shift); i ++)\n\t\t\tbuf[i] = sBuf[i + abs(shift)];\n\t\tfor(i = size - abs(shift); i < size; i ++)\n\t\t\tbuf[i] = change((char)(rand() % 4));\n\t}\n\telse\n\t{\n\t\tfor(i = 0; i < shift; i ++)\n\t\t\tbuf[i] = change((char)(rand() % 4));\n\t\tfor(i = shift; i < size; i ++)\n\t\t\tbuf[i] = sBuf[i - shift];\n\t}\n}\n\n#ifdef WITHSIMILARITY\nvoid generateClusteredSeq(int num, int lower, int upper, int mismatchAllowed, int shiftAllowed, int correctCluster, int distance)\n#else\nvoid generateClusteredSeq(int num, int lower, int upper, int mismatchAllowed, int shiftAllowed, int correctCluster)\n#endif\n{\n\tofstream out;\n\tint size, mismatch, shift, i, j, k;\n\tchar s[1000], sBuf[1000], mismatchBuf[1000], shiftBuf[1000], buf[1000];\n\tint mBuf[1000] = {0};\n\n#ifdef WITHSIMILARITY\n\tif(distance < 2) \n\t{\n\t\tcout << \"incorrect distance\" << endl;\n\t\treturn;\n\t}\n\tfor(i = 0; i < 1000; i ++)\n\t\ts[i] = change((char)(rand() % 4));\n#endif\n\tout.open(\"input.txt\");\n\n\tfor(k = 0; k < correctCluster; k ++)\n\t{\n#ifdef WITHSIMILARITY\n\t\tgenerateCenter(s, sBuf, distance, mBuf);\n#else\n\t\tfor(i = 0; i < 1000; i ++)\n\t\t\tsBuf[i] = change((char)(rand() % 4));\n#endif\n\t\tfor(i = 0; i < num / correctCluster; i ++)\n\t\t{\n\t\t\tsize = lower + rand() % (upper - lower + 1);\n\t\t\tmismatch = rand() % (mismatchAllowed + 1);\n\t\t\tintroduceMismatches(sBuf, mismatchBuf, mismatch, size);\n\t\t\tshift = rand() % (shiftAllowed * 2 + 1) - shiftAllowed;\n\t\t\tintroduceShifts(mismatchBuf, shiftBuf, shift, size);\n//\t\t\tif(rand() % 2 == 1)\n//\t\t\t\tintroduceMismatches(sBuf, buf, mismatch, size);\n//\t\t\telse\n//\t\t\t\tintroduceShifts(sBuf, buf, shift, size);\n\t\t\tout << \"@title \" << k * num / correctCluster + i << \" size = \" << size << \" mismatches = \" << mismatch << \" shifts = \" << shift << endl;\n\t\t\tout.write(shiftBuf, size);\n//\t\t\tout.write(buf, size);\n\t\t\tout << endl;\n\t\t\tout << \"+title \" << k * num / correctCluster + i << \" size = \" << size << \" mismatches = \" << mismatch << \" shifts = \" << shift << endl;\n\t\t\tfor(j = 0; j < size; j ++)\n\t\t\t\tshiftBuf[j] = (char)(rand() % RANGE) + OFFSET;\n\t\t\tout.write(shiftBuf, size);\n\t\t\tout << endl;\n\t\t}\n\t}\n}\n\nvoid itoa(char buf[], unsigned int v)\n{\n\tif(v / 1000)\n\t{\n\t\tbuf[0] = v / 1000 + 48;\n\t\tbuf[1] = (v % 1000) / 100 + 48;\n\t\tbuf[2] = (v % 100) / 10 + 48;\n\t\tbuf[3] = v % 10 + 48;\n\t\tbuf[4] = '\\0';\n\t}\n\telse if(v / 100)\n\t{\n\t\tbuf[0] = v / 100 + 48;\n\t\tbuf[1] = (v % 100) / 10 + 48;\n\t\tbuf[2] = v % 10 + 48;\n\t\tbuf[3] = '\\0';\n\t}\n\telse if(v / 10)\n\t{\n\t\tbuf[0] = v / 10 + 48;\n\t\tbuf[1] = v % 10 + 48;\n\t\tbuf[2] = '\\0';\n\t}\n\telse\n\t{\n\t\tbuf[0] = v + 48;\n\t\tbuf[1] = '\\0';\n\t}\n}\n\nvoid print()\n{\n\tcout << \"SEED --input input.fastq --output output.txt [--mismatch M] [--shift S] [--QV1 L] [--QV2 U] [--fast/short] [--reverse] [--input2 input2.fastq]\" << endl;\n\tcout << \"--mismatch is the maximum number of mismatches allowed from the center sequence in each cluster (0 - 3, default 3)\" << endl;\n\tcout << \"--shift is the maximum number of shifts allowed from the center sequence in each cluster (0 - 6, default 3)\" << endl;\n\tcout << \"--QV1 is the threshold for the base call quality values (QV) that are provided in the FASTQ files as Phred scores. SEED ignores those mismatches where the sum of the Phred scores of the mismatching bases is lower than the specified QV1 threshold value (0 - 2 * 93). The default value for QV1 is 0\" << endl;\n\tcout << \"--QV2 is another QV threshold. It prevents co-clustering of sequences where the sum of all mismatched positions is higher than the threshold value (0 - 6 * 93). The default value for QV2 is 6 * 93\" << endl;\n\tcout << \"--fast uses a bigger spaced seed weight to save running time. It is only applicable for sequences longer than 58 bp and may need more memory\" << endl;\n\tcout << \"--short is to use a smaller spaced seeds weight for sequences as short as 21 bp. This setting often results in longer compute times\" << endl;\n\tcout << \"--reverse is to co-cluster sequences in sense and anti-sense orientation (reverse and complement)\" << endl;\n\tcout << \"--input2 specifies the paired sequences so that paired-end library can be clustered. In current implementation, no shift is allowed for this option, and if --reverse option is specified minimum sequence lengths of both pairs should be the same\" << endl;\n}\n\nint main(int argc, char * argv[])\n{\n\ttime_t start, end;\n\tint num, num1, num2, tNum, tNum1, tNum2, lower, lower1, lower2, upper, upper1, upper2, mismatch = 3, shift = 3, lowerQV = 0, upperQV = 6 * 93, i, tagMismatch = 0, tagShift = 0, tagInput = 0, tagInput2 = 0, tagOutput = 0, tagFast = 0, tagShort = 0, tagReverse = 0;\n\tint tagQV1 = 0, tagQV2 = 0;\n\tchar buf[5], input[100], input1[100], input2[100], output[100], midOutput[100];\n\tifstream in, in2;\n\tint io = 0;\n\tint totalLength, count, j;\n\n\tfor(i = 1; i < argc; i ++)\n\t\tif(strcmp(argv[i], \"--input\") == 0)\n\t\t{\n\t\t\tif(tagInput == 1 || i == argc - 1)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tin.open(argv[++ i]);\n\t\t\tif(!in.is_open())\n\t\t\t{\n\t\t\t\tcout << \"CANNOT OPEN INPUT FILE!\" << endl;\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tin.close();\n\t\t\tstrcpy(input, argv[i]);\n\t\t\ttagInput = 1;\n\t\t}\n                else if(strcmp(argv[i], \"--input2\") == 0)\n                {\n                        if(tagInput2 == 1 || i == argc - 1)\n                        {\n                                print();\n                                return 0;\n                        }\n                        in2.open(argv[++ i]);\n                        if(!in2.is_open())\n                        {\n                                cout << \"CANNOT OPEN PAIRED INPUT FILE!\" << endl;\n                                print();\n                                return 0;\n                        }\n                        in2.close();\n                        strcpy(input2, argv[i]);\n                        tagInput2 = 1;\n\t\t\tpaired = 1;\n                }\n\t\telse if(strcmp(argv[i], \"--output\") == 0)\n\t\t{\n\t\t\tif(tagOutput == 1 || i == argc - 1)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tstrcpy(output, argv[++ i]);\n\t\t\ttagOutput = 1;\n\t\t}\n\t\telse if(strcmp(argv[i], \"--mismatch\") == 0)\n\t\t{\n\t\t\tif(tagMismatch == 1 || i == argc - 1)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tmismatch = atoi(argv[++ i]);\n\t\t\titoa(buf, mismatch);\n\t\t\tif(strcmp(argv[i], buf) != 0)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\ttagMismatch = 1;\n\t\t}\n\t\telse if(strcmp(argv[i], \"--shift\") == 0)\n\t\t{\n\t\t\tif(tagShift == 1 || i == argc - 1)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tshift = atoi(argv[++ i]);\n\t\t\titoa(buf, shift);\n\t\t\tif(strcmp(argv[i], buf) != 0)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\ttagShift = 1;\n\t\t}\n\t\telse if(strcmp(argv[i], \"--QV1\") == 0)\n\t\t{\n\t\t\tif(tagQV1 == 1 || i == argc - 1)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tlowerQV = atoi(argv[++ i]);\n\t\t\titoa(buf, lowerQV);\n\t\t\tif(strcmp(argv[i], buf) != 0)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\ttagQV1 = 1;\n\t\t\tQV = 1;\n\t\t}\n\t\telse if(strcmp(argv[i], \"--QV2\") == 0)\n\t\t{\n\t\t\tif(tagQV2 == 1 || i == argc - 1)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tupperQV = atoi(argv[++ i]);\n\t\t\titoa(buf, upperQV);\n\t\t\tif(strcmp(argv[i], buf) != 0)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\ttagQV2 = 1;\n\t\t\tQV = 1;\n\t\t}\n\t\telse if(strcmp(argv[i], \"--fast\") == 0)\n\t\t{\n\t\t\tif(tagFast == 1 || tagShort == 1)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\ttagFast = 1;\n\t\t\tseedsCount = 4;\n\t\t\tseedsWeight = 64 * 1024;\n\t\t}\n\t\telse if(strcmp(argv[i], \"--short\") == 0)\n                {\n                        if(tagFast == 1 || tagShort == 1)\n                        {\n                                print();\n                                return 0;\n                        }\n                        tagShort = 1;\n                        seedsWeight = 4;\n                }\n\t\telse if(strcmp(argv[i], \"--reverse\") == 0)\n\t\t{\n\t\t\tif(tagReverse == 1)\n\t\t\t{\n\t\t\t\tprint();\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\ttagReverse = 1;\n\t\t\treversed = 1;\n\t\t}\n\t\telse\n\t\t{\n\t\t\tprint();\n\t\t\treturn 0;\n\t\t}\n\n\tif(tagInput == 0 || tagOutput == 0 || mismatch < 0 || mismatch > 3 || shift < 0 || shift > 6 || lowerQV < 0 || lowerQV > 2 * 93 || upperQV < 0 || upperQV > 6 * 93)\n\t{\n\t\tprint();\n\t\treturn 0;\n\t}\n\n\tif(QV)\n\t\tcout << \"#mismatch = \" << mismatch << \"; #shift = \" << shift << \"; QV1 = \" << lowerQV << \"; QV2 = \" << upperQV << endl;\n\telse\n\t\tcout << \"#mismatch = \" << mismatch << \"; #shift = \" << shift << endl;\n\n//\tgenerateClusteredSeq(1000, 95, 100, 0, 0, 100);\n//\treturn 0;\n\n\tFileAnalyzer fa;\n\tstart = time(NULL);\n\tif(paired == 0)\n\t{\n\t\tfa.inputAnalyze(input, num, tNum, lower, upper);\n\n\t\tcout << \"(1) input analysis finished\" << endl;\n\t\tcout << \" - \" << num << \" valid seqs with lengths between \" << lower << \" and \" << upper << endl;\n\n\t\tif(num == 0)\n\t\t{\n\t\t\tcout << \"INSUFFICIENT VALID READS!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(upper - lower > 5)\n\t\t{\n\t\t\tcout << \"INVALID READ LENGTH DIFFERENCE (ABOVE 5)!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(lower < 36 && seedsWeight == 1024 * 16)\n\t\t{\n\t\t\tcout << \"INVALID READ LENGTH (BELOW 36) IN ORDINARY MODE!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(lower < 58 && seedsWeight == 1024 * 64)\n\t\t{\n\t\t\tcout << \"INVALID READ LENGTH (BELOW 58) IN FAST MODE!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(lower < 21 && seedsWeight == 4)\n\t\t{\n\t\t\tcout << \"INVALID READ LENGTH (BELOW 21) IN SHORT MODE!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(upper > 1000)\n\t\t{\n\t\t\tcout << \"INVALID READ LENGTH (ABOVE 1000)!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t}\n\telse\n\t{\n\t\tstrcpy(input1, input);\n\t\tfa.inputAnalyze(input1, num1, tNum1, lower1, upper1);\n\t\tpaired = lower1;//keep lower1 in paired to separate read pairs\n\t\tfa.inputAnalyze(input2, num2, tNum2, lower2, upper2);\n\t\tfa.PECombine(input1, lower1, input2, lower2, input, num, lower, upper);\n\t\t//combine both pairs and trim to keep reads in the same pair same length\n\n\t\tcout << \"(1) input analysis finished\" << endl;\n//\t\tcout << \" - \" << num1 << \" valid seqs with lengths between \" << lower1 << \" and \" << upper1 << \" in left pair\" << endl;\n//\t\tcout << \" - \" << num2 << \" valid seqs with lengths between \" << lower2 << \" and \" << upper2 << \" in right pair\" << endl;\n\t\tcout << \" - \" << num << \" valid seqs with combined lengths between \" << lower << \" and \" << upper << endl;\n\n\t\tif(num1 == 0)\n\t\t{\n\t\t\tcout << \"INSUFFICIENT VALID READS IN LEFT PAIR!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(num2 == 0)\n\t\t{\n\t\t\tcout << \"INSUFFICIENT VALID READS IN RIGHT PAIR!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(tNum1 != tNum2)\n\t\t{\n\t\t\tcout << \"DIFFERENT NUMBER OF READS IN LEFT AND RIGHT PAIRS!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(upper1 - lower1 > 5)\n\t\t{\n\t\t\tcout << \"INVALID READ LENGTH DIFFERENCE (ABOVE 5) IN LEFT PAIR!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(upper2 - lower2 > 5)\n\t\t{\n\t\t\tcout << \"INVALID READ LENGTH DIFFERENCE (ABOVE 5) IN RIGHT PAIR!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(lower < 36 && seedsWeight == 1024 * 16)\n        \t{\n\t\t\tcout << \"INVALID COMBINED READ LENGTH (BELOW 36) IN ORDINARY MODE!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(lower < 58 && seedsWeight == 1024 * 64)\n\t\t{\n\t\t\tcout << \"INVALID COMBINED READ LENGTH (BELOW 58) IN FAST MODE!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(lower < 21 && seedsWeight == 4)\n\t\t{\n\t\t\tcout << \"INVALID COMBINED READ LENGTH (BELOW 21) IN SHORT MODE!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(upper > 1000)\n\t\t{\n\t\t\tcout << \"INVALID COMBINED READ LENGTH (ABOVE 1000)!\" << endl;\n\t\t\treturn 0;\n\t\t}\n\n\t\tif(shift > 0)\n\t\t{\n\t\t\tcout << \"In current implementation, #shift must be 0 for paired-end clustering. Please wait for SEED2 to solve this issue.\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\tif(reversed == 1 && lower1 != lower2)\n\t\t{\n\t\t\tcout << \"In current implementation, if reverse complementary is considered in clustering, lower bounds of both pairs should be the same. Please wait for SEED2 to solve this issue.\" << endl;\n\t\t\treturn 0;\n\t\t}\n\n\t\tupper = lower;\n\t}\n\n//\tproduce realNum, mappingTable and mappingNum here, and the intermediate file is produced/opened by protocol\n\tSorter s(input, num, lower);\n\ts.sort();// if seqs x and y are the same but with differnt QVs, then y's QV will be represented by x's QV and not be considered in clustering\n\tcout << \"(2) sorting finished\" << endl;\n\n\tCluster c(input, output, s.getRealNum(), lower, upper, mismatch, shift, lowerQV, upperQV, s.getMappingTable(), s.getMappingNum());\n\tcout << \"(3) init finished\" << endl;\n\n\tc.cluster();\n\tend = time(NULL);\n\tcout << \"(4) clustering finished\" << endl;\n\n\tif(paired == 0)\n\t{\n\t\tFastqGenerator f(input, output, tNum);\n\t\tf.generateFastq();\n\t}\n\telse\n\t{\n\t\tFastqGenerator f1(input1, output, tNum1, 1);\n\t\tf1.generateFastq();\n\t\tFastqGenerator f2(input2, output, tNum2, 2);\n\t\tf2.generateFastq();\n\t}\n\tcout << \"(5) fastq file generated\" << endl;\n\n\tcout << \" - \" << end - start << \" seconds\" << endl;\n\n\treturn 1;\n}\n"
  }
]