[
  {
    "path": ".gitattributes",
    "content": "# Auto detect text files and perform LF normalization\n* text=auto\n\n# Custom for Visual Studio\n*.cs     diff=csharp\n*.sln    merge=union\n*.csproj merge=union\n*.vbproj merge=union\n*.fsproj merge=union\n*.dbproj merge=union\n\n# Standard to msysgit\n*.doc\t diff=astextplain\n*.DOC\t diff=astextplain\n*.docx diff=astextplain\n*.DOCX diff=astextplain\n*.dot  diff=astextplain\n*.DOT  diff=astextplain\n*.pdf  diff=astextplain\n*.PDF\t diff=astextplain\n*.rtf\t diff=astextplain\n*.RTF\t diff=astextplain\n"
  },
  {
    "path": ".gitignore",
    "content": "#################\n## Eclipse\n#################\n\n*.pydevproject\n.project\n.metadata\nbin/\ntmp/\n*.tmp\n*.bak\n*.swp\n*~.nib\nlocal.properties\n.classpath\n.settings/\n.loadpath\n\n# External tool builders\n.externalToolBuilders/\n\n# Locally stored \"Eclipse launch configurations\"\n*.launch\n\n# CDT-specific\n.cproject\n\n# PDT-specific\n.buildpath\n\n\n#################\n## Visual Studio\n#################\n\n## Ignore Visual Studio temporary files, build results, and\n## files generated by popular Visual Studio add-ons.\n\n# User-specific files\n*.suo\n*.user\n*.sln.docstates\n\n# Build results\n\n[Dd]ebug/\n[Rr]elease/\nx64/\nbuild/\n[Bb]in/\n[Oo]bj/\n\n# MSTest test Results\n[Tt]est[Rr]esult*/\n[Bb]uild[Ll]og.*\n\n*_i.c\n*_p.c\n*.ilk\n*.meta\n*.obj\n*.pch\n*.pdb\n*.pgc\n*.pgd\n*.rsp\n*.sbr\n*.tlb\n*.tli\n*.tlh\n*.tmp\n*.tmp_proj\n*.log\n*.vspscc\n*.vssscc\n.builds\n*.pidb\n*.log\n*.scc\n\n# Visual C++ cache files\nipch/\n*.aps\n*.ncb\n*.opensdf\n*.sdf\n*.cachefile\n\n# Visual Studio profiler\n*.psess\n*.vsp\n*.vspx\n\n# Guidance Automation Toolkit\n*.gpState\n\n# ReSharper is a .NET coding add-in\n_ReSharper*/\n*.[Rr]e[Ss]harper\n\n# TeamCity is a build add-in\n_TeamCity*\n\n# DotCover is a Code Coverage Tool\n*.dotCover\n\n# NCrunch\n*.ncrunch*\n.*crunch*.local.xml\n\n# Installshield output folder\n[Ee]xpress/\n\n# DocProject is a documentation generator add-in\nDocProject/buildhelp/\nDocProject/Help/*.HxT\nDocProject/Help/*.HxC\nDocProject/Help/*.hhc\nDocProject/Help/*.hhk\nDocProject/Help/*.hhp\nDocProject/Help/Html2\nDocProject/Help/html\n\n# Click-Once directory\npublish/\n\n# Publish Web Output\n*.Publish.xml\n*.pubxml\n\n# NuGet Packages Directory\n## TODO: If you have NuGet Package Restore enabled, uncomment the next line\n#packages/\n\n# Windows Azure Build Output\ncsx\n*.build.csdef\n\n# Windows Store app package directory\nAppPackages/\n\n# Others\nsql/\n*.Cache\nClientBin/\n[Ss]tyle[Cc]op.*\n~$*\n*~\n*.dbmdl\n*.[Pp]ublish.xml\n*.pfx\n*.publishsettings\n\n# RIA/Silverlight projects\nGenerated_Code/\n\n# Backup & report files from converting an old project file to a newer\n# Visual Studio version. Backup files are not needed, because we have git ;-)\n_UpgradeReport_Files/\nBackup*/\nUpgradeLog*.XML\nUpgradeLog*.htm\n\n# SQL Server files\nApp_Data/*.mdf\nApp_Data/*.ldf\n\n#############\n## Windows detritus\n#############\n\n# Windows image file caches\nThumbs.db\nehthumbs.db\n\n# Folder config file\nDesktop.ini\n\n# Recycle Bin used on file shares\n$RECYCLE.BIN/\n\n# Mac crap\n.DS_Store\n\n\n#############\n## Python\n#############\n\n*.py[co]\n\n# Packages\n*.egg\n*.egg-info\ndist/\nbuild/\neggs/\nparts/\nvar/\nsdist/\ndevelop-eggs/\n.installed.cfg\n\n# Installer logs\npip-log.txt\n\n# Unit test / coverage reports\n.coverage\n.tox\n\n#Translations\n*.mo\n\n#Mr Developer\n.mr.developer.cfg\n"
  },
  {
    "path": "LICENSE",
    "content": "Copyright (c) 2013 Zygmunt Zając\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n\n* Redistributions of source code must retain the above copyright notice, this\n  list of conditions and the following disclaimer.\n\n* Redistributions in binary form must reproduce the above copyright notice,\n  this list of conditions and the following disclaimer in the documentation\n  and/or other materials provided with the distribution.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\nAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\nIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\nDISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\nFOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\nDAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\nSERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\nCAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\nOR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\nOF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n"
  },
  {
    "path": "README.md",
    "content": "phraug\n======\n\nA set of simple Python scripts for pre-processing large files, things like splitting and format conversion. The names _phraug_ comes from a great book, _Made to Stick_, by Chip and Dan Heath.\n\nSee [http://fastml.com/processing-large-files-line-by-line/](http://fastml.com/processing-large-files-line-by-line/) for the basic idea.\n\nThere's always at least one input file and usually one or more output files. An input file always stays unchanged.\n\n__[phraug2](https://github.com/zygmuntz/phraug2) is available. It offers improved handling of command line arguments.__ Check it out.\n\nFormat conversion\n-----------------\n\n`[...]` means that the parameter is optional.\n\n`csv2libsvm.py <input file> <output file> [<label index = 0>] [<skip headers = 0>]`\n\nConvert CSV to the LIBSVM format. If there are no labels in the input file, specify _label index_ = -1. If there are headers in the input file, specify _skip headers_ = 1.\n\n`pivotedcsv2libsvm.py <input file> <output file> [<skip headers = 0>]`\n\nConvert pivoted CSV (each line contains sample id, feature index and feature value) to the LIBSVM format. If there are headers in the input file, specify _skip headers_ = 1.\n\n\n`csv2vw.py <input file> <output file> [<label index = 0>] [<skip headers = 0>]`\n\nConvert CSV to VW format. Arguments as above.\n\n\n`libsvm2csv.py <input file> <output file> <input file dimensionality>`\n\nConvert LIBSVM to CSV. You need to specify dimensionality, that is a number of columns (not counting a label).\n\n\n`libsvm2vw.py <input file> <output file>`\n\nConvert LIBSVM to VW.\n\n\n`tsv2csv.py <input file> <output file>`\n\nConvert tab-separated file to comma-separated file.\n\n\nColumn means, standard deviations and standardization\n--------------------------------------------------\n\nHow do you standardize (or _shift and scale_) your data if it doesn't fit into memory? With these two scripts. \n\n`colstats.py <input file> <output file> [<label index>]`\n\nCompute column means and standard deviations from data in csv file. Can skip label if present. Numbers only. The first line of the output file contains means, the second one standard deviations.\n\nThis script uses f_is_headers module, which contains is_headers() function. The purpose of the function is to automatically define if the [first] line in file contains headers.\n\n`standardize.py <stats file> <input file> <output file> [<label index>]`\n\nStandardize (shift and scale to zero mean and unit standard deviation) data from csv file. Meant to be used with column stats file produced by colstats.py. Numbers only.\n\n\nOther operations\n----------------\n\n`chunk.py <input file> <number of output files> [<random seed>]`\n\nSplit a file randomly line by line into a number of smaller files. Might be useful for preparing cross-validation. Output files will have the base nume suffixed with a chunk number, for example `data.csv` will be chunked into `data_0.csv`, `data_1.csv` etc.\n\n`count.py <input file>`\n\nCount lines in a file. On Unix you can do it with `wc -l`\n\n`delete_cols.py <input file> <output_file> <indices of columns to delete>`\n`delete_cols.py train.csv train_del.csv 0 2 3`\n\nDelete some columns from a CSV file. Indexes start with 0. Separate them with whitespace.\n\n`sample.py <input file> <output file> [<P = 0.5>]`\n\nSample lines from an input file with probability P. Similiar to `split.py`, but there's only one output file. Useful for sampling large datasets.\n\n`shuffle.py <input file> <output file> [<max. lines in memory = 25000>] [<random seed>]`\n\nShuffle (randomize order of) lines in a [big] file. Similiar to Unix' `shuf`. Useful for files that don't fit in memory. For fastest operation, set _max. lines in memory_ as big as possible - this will result in fewer passes over the input file.\n\n`split.py <input file> <output file 1> <output file 2> [<P = 0.9>] [<random seed>]`\n\nSplit a file into two randomly. Default P (probability of writing to the first file) is 0.9. You can specify any string as a seed for random number generator.\n\n\n`subset.py <input file> <output file> [<offset = 0>] [<lines = 100>]`\n\nSave a subset of lines from an input file to an output file. Start at _offset_ (default 0), save _lines_ (default 100).\n\t\n`unshuffle.py <input file> <output file> <max. lines in memory> <random seed>`\n\t\nUnshuffle a previously shuffled file  (or any file) to the original order. Syntax is the same as for `shuffle.py`, but the seed is mandatory so _max. lines in memory_ is mandatory also.\n\n"
  },
  {
    "path": "chunk.py",
    "content": "'''\nsplit a file into a given number of chunks randomly, line by line. \nUsage: chunk.py <input file> <number of chunks> [<seed>]'\n'''\n\nimport sys, random, os\n\ninput_file = sys.argv[1]\nnum_chunks = int( sys.argv[2] )\n\ntry:\n\tseed = sys.argv[3]\nexcept IndexError:\n\tseed = None\n\t\nif seed:\n\tprint \"seeding: %s\" % ( seed )\n\trandom.seed( seed )\n\nbasename = os.path.basename( input_file )\nbasename, ext = os.path.splitext( basename )\n\ni = open( input_file )\n\nos = {}\nfor n in range( num_chunks ):\n\toutput_file = \"%s_%s%s\" % ( basename, n, ext )\n\tos[n] = open( output_file, 'wb' )\n\ncounter = 0\n\nfor line in i:\n\tn = random.randint( 0, num_chunks - 1 )\n\tos[n].write( line )\n\t\n\tcounter += 1\n\tif counter % 100000 == 0:\n\t\tprint counter\n\t\n\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t"
  },
  {
    "path": "colstats.py",
    "content": "\"\"\"\ncompute column means and standard deviations from data in csv file\ncolstats.py <input file> <output file> [<label index>]\n\"\"\"\n\nimport sys, csv\nimport numpy as np\nfrom f_is_headers import *\n\nprint __doc__\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\ntry:\n\tlabel_index = int( sys.argv[3] )\nexcept IndexError:\n\tlabel_index = False\n\t\ni = open( input_file )\nreader = csv.reader( i )\nwriter = csv.writer( open( output_file, 'wb' ))\n\n# check headers\n\nfirst_line = reader.next()\nif not is_headers( first_line ):\n\t# rewind\n\ti.seek( 0 )\t\t\n\t\nn = 0\nsums_x = 0\t\t# will be a np array\nsums_x2 = 0\t\t# will be a np array\n\nfor line in reader:\n\tn += 1\n\t\n\tif not label_index is False:\n\t\tline.pop( label_index )\n\t\n\tx = np.array( map( float, line ))\n\tx2 = np.square( x )\n\n\tsums_x += x\n\tsums_x2 += x2\n\n\n# preparation\n\nprint n\t\nprint sums_x\nprint sums_x2\n\t\nmeans = sums_x / n\t\nsums2_x = np.square( sums_x )\n\n#print means\n#print sums2_x\n\nvariances = sums_x2 / n - sums2_x / ( n ** 2 )\nstandard_deviations = np.sqrt( variances )\n\n#print variances\n#print standard_deviations\n\n# save stats\n\nwriter.writerow( means )\nwriter.writerow( standard_deviations )\n"
  },
  {
    "path": "count.py",
    "content": "'Count lines in a file'\n\nimport sys\n\nfile_path = sys.argv[1]\nf = open( file_path )\n\ncount =  0\nfor line in f:\n\tcount += 1\n\t\n\tif count % 100000 == 0:\n\t\tprint count\n\t\nprint count\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t"
  },
  {
    "path": "csv2libsvm.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nConvert CSV file to libsvm format. Works only with numeric variables.\nPut -1 as label index (argv[3]) if there are no labels in your file.\nExpecting no headers. If present, headers can be skipped with argv[4] == 1.\n\n\"\"\"\n\nimport sys\nimport csv\nfrom collections import defaultdict\n\ndef construct_line( label, line ):\n\tnew_line = []\n\tif float( label ) == 0.0:\n\t\tlabel = \"0\"\n\tnew_line.append( label )\n\n\tfor i, item in enumerate( line ):\n\t\tif item == '' or float( item ) == 0.0:\n\t\t\tcontinue\n\t\tnew_item = \"%s:%s\" % ( i + 1, item )\n\t\tnew_line.append( new_item )\n\tnew_line = \" \".join( new_line )\n\tnew_line += \"\\n\"\n\treturn new_line\n\n# ---\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\ntry:\n\tlabel_index = int( sys.argv[3] )\nexcept IndexError:\n\tlabel_index = 0\n\ntry:\n\tskip_headers = sys.argv[4]\nexcept IndexError:\n\tskip_headers = 0\n\ni = open( input_file, 'rb' )\no = open( output_file, 'wb' )\n\nreader = csv.reader( i )\n\nif skip_headers:\n\theaders = reader.next()\n\nfor line in reader:\n\tif label_index == -1:\n\t\tlabel = '1'\n\telse:\n\t\tlabel = line.pop( label_index )\n\n\tnew_line = construct_line( label, line )\n\to.write( new_line )\n\n"
  },
  {
    "path": "csv2vw.py",
    "content": "\"\"\"\nConvert CSV file to vw format. Headers can be skipped with argv[4] == true.\nUse -1 for label index if there no labels in the input file\n\nphraug2 version has an option to ignore columns:\nhttps://github.com/zygmuntz/phraug2/blob/master/csv2vw.py\n\"\"\"\n\nimport sys\nimport csv\n\ndef construct_line( label, line ):\n\tnew_line = []\n\tif float( label ) == 0.0:\n\t\tlabel = \"0\"\n\tnew_line.append( \"%s |n \" % ( label ))\n\t\n\tfor i, item in enumerate( line ):\n\t\tif float( item ) == 0.0:\n\t\t\tcontinue\t# sparse!!!\n\t\tnew_item = \"%s:%s\" % ( i + 1, item )\n\t\tnew_line.append( new_item )\n\tnew_line = \" \".join( new_line )\n\tnew_line += \"\\n\"\n\treturn new_line\n\n# ---\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\ntry:\n\tlabel_index = int( sys.argv[3] )\nexcept IndexError:\n\tlabel_index = 0\n\t\ntry:\n\tskip_headers = sys.argv[4]\nexcept IndexError:\n\tskip_headers = 0\t\n\ni = open( input_file )\no = open( output_file, 'w' )\n\nreader = csv.reader( i )\nif skip_headers:\n\theaders = reader.next()\n\nn = 0\n\nfor line in reader:\n\tif label_index == -1:\n\t\tlabel = 1\n\telse:\n\t\tlabel = line.pop( label_index )\n\t\t\n\tnew_line = construct_line( label, line )\n\to.write( new_line )\n\t\n\tn += 1\n\tif n % 10000 == 0:\n\t\tprint n\n\t\t\n\t\t"
  },
  {
    "path": "delete_cols.py",
    "content": "'delete some columns from file, given by their indexes'\n\nimport csv\nimport sys\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\nheaders = sys.argv[3:]\n\nheaders = map( int, headers )\nheaders.sort( reverse = True )\n\nprint \"%s ---> %s\" % ( input_file, output_file )\nprint \"header indices: %s\" % ( headers )\n\ni = open( input_file )\no = open( output_file, 'wb' )\n\nreader = csv.reader( i )\nwriter = csv.writer( o )\n\ncounter = 0\nfor line in reader:\n\n\tfor h in headers:\n\t\tdel line[h]\n\n\twriter.writerow( line )\n\t\n\tcounter += 1\n\tif counter % 10000 == 0:\n\t\tprint counter\n\n\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t"
  },
  {
    "path": "f_is_headers.py",
    "content": "import re\n\ndef is_headers( line ):\n\tline = ''.join( line )\n\tif re.match( '.*[a-df-zA-Z].*', line ):\n\t\treturn True\n\t\t\n\t\t"
  },
  {
    "path": "libsvm2csv.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nconvert libsvm file to csv'\nlibsvm2csv.py <input file> <output file> <X dimensionality>\n\"\"\"\n\nimport sys\nimport csv\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\nd = int( sys.argv[3] )\nassert ( d > 0 )\n\nreader = csv.reader( open( input_file ), delimiter = \" \" )\nwriter = csv.writer( open( output_file, 'wb' ))\n\nfor line in reader:\n\tlabel = line.pop( 0 )\n\tif line[-1].strip() == '':\n\t\tline.pop( -1 )\n\t\t\n\t# print line\n\t\n\tline = map( lambda x: tuple( x.split( \":\" )), line )\n\t#print line\n\t# ('1', '0.194035105364'), ('2', '0.186042408882'), ('3', '-0.148706067206'), ...\n\t\n\tnew_line = [ label ] + [ 0 ] * d\n\tfor i, v in line:\n\t\ti = int( i )\n\t\tif i <= d:\n\t\t\tnew_line[i] = v\n\t\t\n\twriter.writerow( new_line )\n"
  },
  {
    "path": "libsvm2vw.py",
    "content": "\"convert a libsvm file to VW format\"\n\"skip malformed lines\"\n\"in case of binary classification with 0/1 labels set the third argument to True\"\n\"this will convert labels to -1/1\"\n\nimport sys\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\ntry:\n\tconvert_zero_to_negative_one = bool( sys.argv[3] )\nexcept IndexError:\n\tconvert_zero_to_negative_one = False\n\ni = open( input_file )\no = open( output_file, 'wb' )\n\nfor line in i:\n\ttry:\n\t\ty, x = line.split( \" \", 1 )\n\t# ValueError: need more than 1 value to unpack\n\texcept ValueError:\n\t\tprint \"line with ValueError (skipping):\"\n\t\tprint line\n\t\tcontinue\n\t\t\n\tif convert_zero_to_negative_one and y == '0':\n\t\ty = '-1'\n\tnew_line = y + \" |n \" + x\n\to.write( new_line )\n\t\n"
  },
  {
    "path": "pivotedcsv2libsvm.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nConvert pivoted CSV file to libsvm format. Works only with numeric variables.\nExpecting no labels and no headers. If present, headers can be skipped with argv[3] == 1.\n\nExample call: python pivotedlibsvm2csv.py pivoted.csv output.txt\nformat: row_id, zero_based_feature_index [, value = 1]\n\ninput example:\nid1,1\nid1,2\nid1,3\nid2,2,0.5\nid2,3,0.6\nid2,4,0.7\n\noutput example:\n1 2:1 3:1 4:1\n1 3:0.5 4:0.6 5:0.7\n\n\"\"\"\n\nimport sys\nimport csv\n\ndef construct_line( label, line ):\n\tnew_line = []\n\tif float( label ) == 0.0:\n\t\tlabel = \"0\"\n\tnew_line.append( label )\n\n\tfor i_item in line:\n\t\ti, item = i_item\n\t\tif item == '' or float( item ) == 0.0:\n\t\t\tcontinue\n\t\tnew_item = \"%s:%s\" % ( i + 1, item )\n\t\tnew_line.append( new_item )\n\t\t\n\tnew_line = \" \".join( new_line )\n\tnew_line += \"\\n\"\n\treturn new_line\n\n# ---\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\ntry:\n\tskip_headers = sys.argv[3]\nexcept IndexError:\n\tskip_headers = 0\n\ni = open( input_file, 'rb' )\no = open( output_file, 'wb' )\n\nreader = csv.reader( i )\n\nif skip_headers:\n\theaders = reader.next()\n\n\nline = reader.next()\ncurrent_row = line[0].strip()\ncurrent_feature = int( line[1].strip())\ncurrent_value = line[2].strip() if len( line ) > 2 else '1'\ncurrent_line = [ ( current_feature, current_value ) ]\n\nfor line in reader:\n\trow_id = line[0]\n\tfeature_index = int( line[1].strip())\n\tfeature_value = line[2].strip() if len( line ) > 2 else '1'\n\t\n\tif row_id != current_row:\n\t\tnew_line = construct_line( '1', current_line )\n\t\to.write( new_line )\t\n\t\t\n\t\tcurrent_row = row_id\n\t\tcurrent_line = [( feature_index, feature_value )]\n\telse:\n\t\tcurrent_line.append(( feature_index, feature_value ))\n\n# the last row\nnew_line = construct_line( '1', current_line )\no.write( new_line )\t\n"
  },
  {
    "path": "sample.py",
    "content": "'sample lines from input file with probability P, save them to output file'\n\nimport csv\nimport sys\nimport random\n\ntry:\n\tP = float( sys.argv[3] )\nexcept IndexError:\n\tP = 0.5\n\t\nprint \"P = %s\" % ( P )\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\ni = open( input_file )\no = open( output_file, 'w' )\n\nreader = csv.reader( i )\nwriter = csv.writer( o )\n\nheaders = reader.next()\nwriter.writerow( headers )\n\nfor line in reader:\n\tr = random.random()\n\tif r > P:\n\t\tcontinue\n\n\twriter.writerow( line )\n\t\n\n\t\n\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t"
  },
  {
    "path": "shuffle.py",
    "content": "\"\"\"\nShuffle lines in a [big] file\nshuffle.py <input_file> <output_file> [<preserve headers?>] [<max. lines in memory>] [<random seed>]\n\n\"\"\"\n\nimport sys\nimport random\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\ntry:\n\tpreserve_headers = int( sys.argv[3] )\nexcept IndexError:\n\tpreserve_headers = 0\n\ntry:\n\tlines_in_memory = int( sys.argv[4] )\nexcept IndexError:\n\tlines_in_memory = 25000\n\t\nprint \"caching %s lines at a time...\" % ( lines_in_memory )\n\t\ntry:\n\trandom_seed = sys.argv[5]\n\trandom.seed( random_seed )\n\tprint \"random seed: %s\" % ( random_seed )\nexcept IndexError:\n\tpass\n\t\n# first count\n\nprint \"counting lines...\"\n\ni_f = open( input_file )\no_f = open( output_file, 'wb' )\n\nif preserve_headers:\n\theaders = i_f.readline()\n\to_f.write( headers )\n\ncounter =  0\nfor line in i_f:\n\tcounter += 1\n\t\n\tif counter % 100000 == 0:\n\t\tprint counter\n\t\nprint counter\n\t\t\nprint \"shuffling...\"\n\norder = range( counter )\nrandom.shuffle( order )\n\nepoch = 0\n\t\nwhile order:\n\n\tcurrent_lines = {}\n\tcurrent_lines_count = 0\n\n\tcurrent_chunk = order[:lines_in_memory]\n\tcurrent_chunk_dict = { x: 1 for x in current_chunk }\t\t# faster \"in\"\n\tcurrent_chunk_length = len( current_chunk )\n\t\n\torder = order[lines_in_memory:]\n\t\n\ti_f.seek( 0 )\n\tif preserve_headers:\n\t\ti_f.readline()\n\t\t\n\tcount = 0\n\t\t\n\tfor line in i_f:\n\t\tif count in current_chunk_dict:\n\t\t\tcurrent_lines[count] = line\n\t\t\tcurrent_lines_count += 1\n\t\t\tif current_lines_count == current_chunk_length:\n\t\t\t\tbreak\n\t\tcount += 1\t\n\t\tif count % 100000 == 0:\n\t\t\tprint count\t\t\n\t\n\tprint \"writing...\"\n\t\n\tfor l in current_chunk:\n\t\to_f.write( current_lines[l] )\n\t\n\tlines_saved = current_chunk_length + epoch * lines_in_memory\n\tepoch += 1\n\tprint \"pass %s complete (%s lines saved)\" % ( epoch, lines_saved )\n\t\t"
  },
  {
    "path": "split.py",
    "content": "'''\nsplit a file into two randomly, line by line. \nUsage: split.py <input file> <output file 1> <output file 2> [<probability of writing to the first file>]'\n'''\n\nimport csv\nimport sys\nimport random\n\ninput_file = sys.argv[1]\noutput_file1 = sys.argv[2]\noutput_file2 = sys.argv[3]\n\ntry:\n\tP = float( sys.argv[4] )\nexcept IndexError:\n\tP = 0.9\n\t\ntry:\n\tseed = sys.argv[5]\nexcept IndexError:\n\tseed = None\n\ntry:\n\tskip_headers = sys.argv[6]\nexcept IndexError:\n\tskip_headers = False\n\t\ntry:\n\tskip_headers = sys.argv[6]\nexcept IndexError:\n\tskip_headers = False\t\n\t\nprint \"P = %s\" % ( P )\n\nif seed:\n\trandom.seed( seed )\n\ni = open( input_file )\no1 = open( output_file1, 'wb' )\no2 = open( output_file2, 'wb' )\n\nif skip_headers:\n\ti.readline()\n\ncounter = 0\n\nfor line in i:\n\tr = random.random()\n\tif r > P:\n\t\to2.write( line )\n\telse:\n\t\to1.write( line )\n\t\n\tcounter += 1\n\tif counter % 100000 == 0:\n\t\tprint counter\n\t\n\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n"
  },
  {
    "path": "standardize.py",
    "content": "'standardize (shift and scale to zero mean and unit standard deviation) data from csv file'\n'meant to be used together with colstats.py'\n'standardize.py <stats file> <input file> <output file> [<label index>]'\n\nimport sys, csv\nimport numpy as np\nfrom f_is_headers import *\n\nstats_file = sys.argv[1]\ninput_file = sys.argv[2]\noutput_file = sys.argv[3]\n\ntry:\n\tlabel_index = int( sys.argv[4] )\nexcept IndexError:\n\tlabel_index = False\n\t\ni = open( input_file )\t\nstats_reader = csv.reader( open( stats_file ))\t\nreader = csv.reader( i )\nwriter = csv.writer( open( output_file, 'wb' ))\n\n# get stats\n\nmeans = stats_reader.next()\nmeans = np.array( map( float, means ))\n\nstandard_deviations = stats_reader.next()\nstandard_deviations = np.array( map( float, standard_deviations ))\n\n# check headers\n\nfirst_line = reader.next()\nif is_headers( first_line ):\n\theaders = first_line\nelse:\n\theaders = False\n\ti.seek( 0 )\n\t\n# go\n\nfor line in reader:\n\t\n\tif not label_index is False:\n\t\tl = line.pop( label_index )\t\n\t\tprint l\n\t\t\n\tx = np.array( map( float, line ))\n\t\n\t# shift and scale\n\tx = x - means\n\tx = x / standard_deviations\n\t\n\tif not label_index is False:\n\t\t# -1.0,...\n\t\t#x = np.insert( x, 0, l )\n\t\tline = list( x )\n\t\tline.insert( 0, l )\n\t\n\twriter.writerow( line )\n\t\n"
  },
  {
    "path": "subset.py",
    "content": "'Save a subset of lines from an input file; start at offset and count n lines'\n'default 100 lines starting from 0'\n\nimport sys\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\ntry:\n\toffset = int( sys.argv[3] )\nexcept IndexError:\n\toffset = 0\n\t\ntry:\n\tlines = int( sys.argv[4] )\nexcept IndexError:\n\tlines = 100\t\n\n\ni = open( input_file )\no = open( output_file, 'wb' )\n\ncount =  0\nfor line in i:\n\n\tif offset > 0:\n\t\toffset -= 1\n\t\tcontinue\n\n\to.write( line )\n\tcount += 1\n\t\n\tif count >= lines:\n\t\tbreak\n\t\n\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t"
  },
  {
    "path": "tsv2csv.py",
    "content": "import csv\nimport sys\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\ni = open( input_file )\no = open( output_file, 'wb' )\n\nreader = csv.reader( i, delimiter = '\\t' )\nwriter = csv.writer( o )\n\nfor line in reader:\n\twriter.writerow( line )\n"
  },
  {
    "path": "unshuffle.py",
    "content": "\"\"\"\nUnshuffle previously shuffled file\nunshuffle.py input_file.csv output_file.csv <max. lines in memory> <random seed>\n\n\"\"\"\n\nimport sys\nimport random\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\ntry:\n\tlines_in_memory = int( sys.argv[3] )\nexcept IndexError:\n\tlines_in_memory = 100000\n\t\nprint \"caching %s lines at a time...\" % ( lines_in_memory )\n\t\ntry:\n\trandom_seed = sys.argv[4]\n\trandom.seed( random_seed )\n\tprint \"random seed: %s\" % ( random_seed )\nexcept IndexError:\n\tprint \"need a seed...\"\n\tsys.exit( 1 )\n\t\n# first count\n\nprint \"counting lines...\"\n\nf = open( input_file )\n\ncount =  0\nfor line in f:\n\tcount += 1\n\t\n\tif count % 100000 == 0:\n\t\tprint count\n\t\nprint count\n\t\t\n# then shuffle\t\t\n\nprint \"(un)shuffling...\"\n\no_f = open( output_file, 'wb' )\n\t\norder = range( count )\nrandom.shuffle( order )\n\n# un-shuffle\n\norder_dict = { shuf_i: orig_i for shuf_i, orig_i in enumerate( order ) }\n# sort by original key asc, will get shuffled keys in the right order to unshuffle\norder = sorted( order_dict, key = order_dict.get )\n\nepoch = 0\n\t\nwhile order:\n\n\tcurrent_lines = {}\n\tcurrent_lines_count = 0\n\n\tcurrent_chunk = order[:lines_in_memory]\n\tcurrent_chunk_dict = { x: 1 for x in current_chunk }\t\t# faster \"in\"\n\tcurrent_chunk_length = len( current_chunk )\n\t\n\torder = order[lines_in_memory:]\n\t\n\tf.seek( 0 )\n\tcount = 0\n\t\t\n\tfor line in f:\n\t\tif count in current_chunk_dict:\n\t\t\tcurrent_lines[count] = line\n\t\t\tcurrent_lines_count += 1\n\t\t\tif current_lines_count == current_chunk_length:\n\t\t\t\tbreak\n\t\tcount += 1\t\n\t\tif count % 100000 == 0:\n\t\t\tprint count\t\t\n\t\n\tprint \"writing...\"\n\t\n\tfor l in current_chunk:\n\t\to_f.write( current_lines[l] )\n\t\n\tlines_saved = current_chunk_length + epoch * lines_in_memory\n\tepoch += 1\n\tprint \"pass %s complete (%s lines saved)\" % ( epoch, lines_saved )\n\t\t"
  }
]