Repository: zygmuntz/phraug
Branch: master
Commit: 06a7e54fb5ad
Files: 21
Total size: 23.6 KB
Directory structure:
gitextract_5og_t45i/
├── .gitattributes
├── .gitignore
├── LICENSE
├── README.md
├── chunk.py
├── colstats.py
├── count.py
├── csv2libsvm.py
├── csv2vw.py
├── delete_cols.py
├── f_is_headers.py
├── libsvm2csv.py
├── libsvm2vw.py
├── pivotedcsv2libsvm.py
├── sample.py
├── shuffle.py
├── split.py
├── standardize.py
├── subset.py
├── tsv2csv.py
└── unshuffle.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitattributes
================================================
# Auto detect text files and perform LF normalization
* text=auto
# Custom for Visual Studio
*.cs diff=csharp
*.sln merge=union
*.csproj merge=union
*.vbproj merge=union
*.fsproj merge=union
*.dbproj merge=union
# Standard to msysgit
*.doc diff=astextplain
*.DOC diff=astextplain
*.docx diff=astextplain
*.DOCX diff=astextplain
*.dot diff=astextplain
*.DOT diff=astextplain
*.pdf diff=astextplain
*.PDF diff=astextplain
*.rtf diff=astextplain
*.RTF diff=astextplain
================================================
FILE: .gitignore
================================================
#################
## Eclipse
#################
*.pydevproject
.project
.metadata
bin/
tmp/
*.tmp
*.bak
*.swp
*~.nib
local.properties
.classpath
.settings/
.loadpath
# External tool builders
.externalToolBuilders/
# Locally stored "Eclipse launch configurations"
*.launch
# CDT-specific
.cproject
# PDT-specific
.buildpath
#################
## Visual Studio
#################
## Ignore Visual Studio temporary files, build results, and
## files generated by popular Visual Studio add-ons.
# User-specific files
*.suo
*.user
*.sln.docstates
# Build results
[Dd]ebug/
[Rr]elease/
x64/
build/
[Bb]in/
[Oo]bj/
# MSTest test Results
[Tt]est[Rr]esult*/
[Bb]uild[Ll]og.*
*_i.c
*_p.c
*.ilk
*.meta
*.obj
*.pch
*.pdb
*.pgc
*.pgd
*.rsp
*.sbr
*.tlb
*.tli
*.tlh
*.tmp
*.tmp_proj
*.log
*.vspscc
*.vssscc
.builds
*.pidb
*.log
*.scc
# Visual C++ cache files
ipch/
*.aps
*.ncb
*.opensdf
*.sdf
*.cachefile
# Visual Studio profiler
*.psess
*.vsp
*.vspx
# Guidance Automation Toolkit
*.gpState
# ReSharper is a .NET coding add-in
_ReSharper*/
*.[Rr]e[Ss]harper
# TeamCity is a build add-in
_TeamCity*
# DotCover is a Code Coverage Tool
*.dotCover
# NCrunch
*.ncrunch*
.*crunch*.local.xml
# Installshield output folder
[Ee]xpress/
# DocProject is a documentation generator add-in
DocProject/buildhelp/
DocProject/Help/*.HxT
DocProject/Help/*.HxC
DocProject/Help/*.hhc
DocProject/Help/*.hhk
DocProject/Help/*.hhp
DocProject/Help/Html2
DocProject/Help/html
# Click-Once directory
publish/
# Publish Web Output
*.Publish.xml
*.pubxml
# NuGet Packages Directory
## TODO: If you have NuGet Package Restore enabled, uncomment the next line
#packages/
# Windows Azure Build Output
csx
*.build.csdef
# Windows Store app package directory
AppPackages/
# Others
sql/
*.Cache
ClientBin/
[Ss]tyle[Cc]op.*
~$*
*~
*.dbmdl
*.[Pp]ublish.xml
*.pfx
*.publishsettings
# RIA/Silverlight projects
Generated_Code/
# Backup & report files from converting an old project file to a newer
# Visual Studio version. Backup files are not needed, because we have git ;-)
_UpgradeReport_Files/
Backup*/
UpgradeLog*.XML
UpgradeLog*.htm
# SQL Server files
App_Data/*.mdf
App_Data/*.ldf
#############
## Windows detritus
#############
# Windows image file caches
Thumbs.db
ehthumbs.db
# Folder config file
Desktop.ini
# Recycle Bin used on file shares
$RECYCLE.BIN/
# Mac crap
.DS_Store
#############
## Python
#############
*.py[co]
# Packages
*.egg
*.egg-info
dist/
build/
eggs/
parts/
var/
sdist/
develop-eggs/
.installed.cfg
# Installer logs
pip-log.txt
# Unit test / coverage reports
.coverage
.tox
#Translations
*.mo
#Mr Developer
.mr.developer.cfg
================================================
FILE: LICENSE
================================================
Copyright (c) 2013 Zygmunt Zając
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
================================================
FILE: README.md
================================================
phraug
======
A set of simple Python scripts for pre-processing large files, things like splitting and format conversion. The names _phraug_ comes from a great book, _Made to Stick_, by Chip and Dan Heath.
See [http://fastml.com/processing-large-files-line-by-line/](http://fastml.com/processing-large-files-line-by-line/) for the basic idea.
There's always at least one input file and usually one or more output files. An input file always stays unchanged.
__[phraug2](https://github.com/zygmuntz/phraug2) is available. It offers improved handling of command line arguments.__ Check it out.
Format conversion
-----------------
`[...]` means that the parameter is optional.
`csv2libsvm.py <input file> <output file> [<label index = 0>] [<skip headers = 0>]`
Convert CSV to the LIBSVM format. If there are no labels in the input file, specify _label index_ = -1. If there are headers in the input file, specify _skip headers_ = 1.
`pivotedcsv2libsvm.py <input file> <output file> [<skip headers = 0>]`
Convert pivoted CSV (each line contains sample id, feature index and feature value) to the LIBSVM format. If there are headers in the input file, specify _skip headers_ = 1.
`csv2vw.py <input file> <output file> [<label index = 0>] [<skip headers = 0>]`
Convert CSV to VW format. Arguments as above.
`libsvm2csv.py <input file> <output file> <input file dimensionality>`
Convert LIBSVM to CSV. You need to specify dimensionality, that is a number of columns (not counting a label).
`libsvm2vw.py <input file> <output file>`
Convert LIBSVM to VW.
`tsv2csv.py <input file> <output file>`
Convert tab-separated file to comma-separated file.
Column means, standard deviations and standardization
--------------------------------------------------
How do you standardize (or _shift and scale_) your data if it doesn't fit into memory? With these two scripts.
`colstats.py <input file> <output file> [<label index>]`
Compute column means and standard deviations from data in csv file. Can skip label if present. Numbers only. The first line of the output file contains means, the second one standard deviations.
This script uses f_is_headers module, which contains is_headers() function. The purpose of the function is to automatically define if the [first] line in file contains headers.
`standardize.py <stats file> <input file> <output file> [<label index>]`
Standardize (shift and scale to zero mean and unit standard deviation) data from csv file. Meant to be used with column stats file produced by colstats.py. Numbers only.
Other operations
----------------
`chunk.py <input file> <number of output files> [<random seed>]`
Split a file randomly line by line into a number of smaller files. Might be useful for preparing cross-validation. Output files will have the base nume suffixed with a chunk number, for example `data.csv` will be chunked into `data_0.csv`, `data_1.csv` etc.
`count.py <input file>`
Count lines in a file. On Unix you can do it with `wc -l`
`delete_cols.py <input file> <output_file> <indices of columns to delete>`
`delete_cols.py train.csv train_del.csv 0 2 3`
Delete some columns from a CSV file. Indexes start with 0. Separate them with whitespace.
`sample.py <input file> <output file> [<P = 0.5>]`
Sample lines from an input file with probability P. Similiar to `split.py`, but there's only one output file. Useful for sampling large datasets.
`shuffle.py <input file> <output file> [<max. lines in memory = 25000>] [<random seed>]`
Shuffle (randomize order of) lines in a [big] file. Similiar to Unix' `shuf`. Useful for files that don't fit in memory. For fastest operation, set _max. lines in memory_ as big as possible - this will result in fewer passes over the input file.
`split.py <input file> <output file 1> <output file 2> [<P = 0.9>] [<random seed>]`
Split a file into two randomly. Default P (probability of writing to the first file) is 0.9. You can specify any string as a seed for random number generator.
`subset.py <input file> <output file> [<offset = 0>] [<lines = 100>]`
Save a subset of lines from an input file to an output file. Start at _offset_ (default 0), save _lines_ (default 100).
`unshuffle.py <input file> <output file> <max. lines in memory> <random seed>`
Unshuffle a previously shuffled file (or any file) to the original order. Syntax is the same as for `shuffle.py`, but the seed is mandatory so _max. lines in memory_ is mandatory also.
================================================
FILE: chunk.py
================================================
'''
split a file into a given number of chunks randomly, line by line.
Usage: chunk.py <input file> <number of chunks> [<seed>]'
'''
import sys, random, os
input_file = sys.argv[1]
num_chunks = int( sys.argv[2] )
try:
seed = sys.argv[3]
except IndexError:
seed = None
if seed:
print "seeding: %s" % ( seed )
random.seed( seed )
basename = os.path.basename( input_file )
basename, ext = os.path.splitext( basename )
i = open( input_file )
os = {}
for n in range( num_chunks ):
output_file = "%s_%s%s" % ( basename, n, ext )
os[n] = open( output_file, 'wb' )
counter = 0
for line in i:
n = random.randint( 0, num_chunks - 1 )
os[n].write( line )
counter += 1
if counter % 100000 == 0:
print counter
================================================
FILE: colstats.py
================================================
"""
compute column means and standard deviations from data in csv file
colstats.py <input file> <output file> [<label index>]
"""
import sys, csv
import numpy as np
from f_is_headers import *
print __doc__
input_file = sys.argv[1]
output_file = sys.argv[2]
try:
label_index = int( sys.argv[3] )
except IndexError:
label_index = False
i = open( input_file )
reader = csv.reader( i )
writer = csv.writer( open( output_file, 'wb' ))
# check headers
first_line = reader.next()
if not is_headers( first_line ):
# rewind
i.seek( 0 )
n = 0
sums_x = 0 # will be a np array
sums_x2 = 0 # will be a np array
for line in reader:
n += 1
if not label_index is False:
line.pop( label_index )
x = np.array( map( float, line ))
x2 = np.square( x )
sums_x += x
sums_x2 += x2
# preparation
print n
print sums_x
print sums_x2
means = sums_x / n
sums2_x = np.square( sums_x )
#print means
#print sums2_x
variances = sums_x2 / n - sums2_x / ( n ** 2 )
standard_deviations = np.sqrt( variances )
#print variances
#print standard_deviations
# save stats
writer.writerow( means )
writer.writerow( standard_deviations )
================================================
FILE: count.py
================================================
'Count lines in a file'
import sys
file_path = sys.argv[1]
f = open( file_path )
count = 0
for line in f:
count += 1
if count % 100000 == 0:
print count
print count
================================================
FILE: csv2libsvm.py
================================================
#!/usr/bin/env python
"""
Convert CSV file to libsvm format. Works only with numeric variables.
Put -1 as label index (argv[3]) if there are no labels in your file.
Expecting no headers. If present, headers can be skipped with argv[4] == 1.
"""
import sys
import csv
from collections import defaultdict
def construct_line( label, line ):
new_line = []
if float( label ) == 0.0:
label = "0"
new_line.append( label )
for i, item in enumerate( line ):
if item == '' or float( item ) == 0.0:
continue
new_item = "%s:%s" % ( i + 1, item )
new_line.append( new_item )
new_line = " ".join( new_line )
new_line += "\n"
return new_line
# ---
input_file = sys.argv[1]
output_file = sys.argv[2]
try:
label_index = int( sys.argv[3] )
except IndexError:
label_index = 0
try:
skip_headers = sys.argv[4]
except IndexError:
skip_headers = 0
i = open( input_file, 'rb' )
o = open( output_file, 'wb' )
reader = csv.reader( i )
if skip_headers:
headers = reader.next()
for line in reader:
if label_index == -1:
label = '1'
else:
label = line.pop( label_index )
new_line = construct_line( label, line )
o.write( new_line )
================================================
FILE: csv2vw.py
================================================
"""
Convert CSV file to vw format. Headers can be skipped with argv[4] == true.
Use -1 for label index if there no labels in the input file
phraug2 version has an option to ignore columns:
https://github.com/zygmuntz/phraug2/blob/master/csv2vw.py
"""
import sys
import csv
def construct_line( label, line ):
new_line = []
if float( label ) == 0.0:
label = "0"
new_line.append( "%s |n " % ( label ))
for i, item in enumerate( line ):
if float( item ) == 0.0:
continue # sparse!!!
new_item = "%s:%s" % ( i + 1, item )
new_line.append( new_item )
new_line = " ".join( new_line )
new_line += "\n"
return new_line
# ---
input_file = sys.argv[1]
output_file = sys.argv[2]
try:
label_index = int( sys.argv[3] )
except IndexError:
label_index = 0
try:
skip_headers = sys.argv[4]
except IndexError:
skip_headers = 0
i = open( input_file )
o = open( output_file, 'w' )
reader = csv.reader( i )
if skip_headers:
headers = reader.next()
n = 0
for line in reader:
if label_index == -1:
label = 1
else:
label = line.pop( label_index )
new_line = construct_line( label, line )
o.write( new_line )
n += 1
if n % 10000 == 0:
print n
================================================
FILE: delete_cols.py
================================================
'delete some columns from file, given by their indexes'
import csv
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
headers = sys.argv[3:]
headers = map( int, headers )
headers.sort( reverse = True )
print "%s ---> %s" % ( input_file, output_file )
print "header indices: %s" % ( headers )
i = open( input_file )
o = open( output_file, 'wb' )
reader = csv.reader( i )
writer = csv.writer( o )
counter = 0
for line in reader:
for h in headers:
del line[h]
writer.writerow( line )
counter += 1
if counter % 10000 == 0:
print counter
================================================
FILE: f_is_headers.py
================================================
import re
def is_headers( line ):
line = ''.join( line )
if re.match( '.*[a-df-zA-Z].*', line ):
return True
================================================
FILE: libsvm2csv.py
================================================
#!/usr/bin/env python
"""
convert libsvm file to csv'
libsvm2csv.py <input file> <output file> <X dimensionality>
"""
import sys
import csv
input_file = sys.argv[1]
output_file = sys.argv[2]
d = int( sys.argv[3] )
assert ( d > 0 )
reader = csv.reader( open( input_file ), delimiter = " " )
writer = csv.writer( open( output_file, 'wb' ))
for line in reader:
label = line.pop( 0 )
if line[-1].strip() == '':
line.pop( -1 )
# print line
line = map( lambda x: tuple( x.split( ":" )), line )
#print line
# ('1', '0.194035105364'), ('2', '0.186042408882'), ('3', '-0.148706067206'), ...
new_line = [ label ] + [ 0 ] * d
for i, v in line:
i = int( i )
if i <= d:
new_line[i] = v
writer.writerow( new_line )
================================================
FILE: libsvm2vw.py
================================================
"convert a libsvm file to VW format"
"skip malformed lines"
"in case of binary classification with 0/1 labels set the third argument to True"
"this will convert labels to -1/1"
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
try:
convert_zero_to_negative_one = bool( sys.argv[3] )
except IndexError:
convert_zero_to_negative_one = False
i = open( input_file )
o = open( output_file, 'wb' )
for line in i:
try:
y, x = line.split( " ", 1 )
# ValueError: need more than 1 value to unpack
except ValueError:
print "line with ValueError (skipping):"
print line
continue
if convert_zero_to_negative_one and y == '0':
y = '-1'
new_line = y + " |n " + x
o.write( new_line )
================================================
FILE: pivotedcsv2libsvm.py
================================================
#!/usr/bin/env python
"""
Convert pivoted CSV file to libsvm format. Works only with numeric variables.
Expecting no labels and no headers. If present, headers can be skipped with argv[3] == 1.
Example call: python pivotedlibsvm2csv.py pivoted.csv output.txt
format: row_id, zero_based_feature_index [, value = 1]
input example:
id1,1
id1,2
id1,3
id2,2,0.5
id2,3,0.6
id2,4,0.7
output example:
1 2:1 3:1 4:1
1 3:0.5 4:0.6 5:0.7
"""
import sys
import csv
def construct_line( label, line ):
new_line = []
if float( label ) == 0.0:
label = "0"
new_line.append( label )
for i_item in line:
i, item = i_item
if item == '' or float( item ) == 0.0:
continue
new_item = "%s:%s" % ( i + 1, item )
new_line.append( new_item )
new_line = " ".join( new_line )
new_line += "\n"
return new_line
# ---
input_file = sys.argv[1]
output_file = sys.argv[2]
try:
skip_headers = sys.argv[3]
except IndexError:
skip_headers = 0
i = open( input_file, 'rb' )
o = open( output_file, 'wb' )
reader = csv.reader( i )
if skip_headers:
headers = reader.next()
line = reader.next()
current_row = line[0].strip()
current_feature = int( line[1].strip())
current_value = line[2].strip() if len( line ) > 2 else '1'
current_line = [ ( current_feature, current_value ) ]
for line in reader:
row_id = line[0]
feature_index = int( line[1].strip())
feature_value = line[2].strip() if len( line ) > 2 else '1'
if row_id != current_row:
new_line = construct_line( '1', current_line )
o.write( new_line )
current_row = row_id
current_line = [( feature_index, feature_value )]
else:
current_line.append(( feature_index, feature_value ))
# the last row
new_line = construct_line( '1', current_line )
o.write( new_line )
================================================
FILE: sample.py
================================================
'sample lines from input file with probability P, save them to output file'
import csv
import sys
import random
try:
P = float( sys.argv[3] )
except IndexError:
P = 0.5
print "P = %s" % ( P )
input_file = sys.argv[1]
output_file = sys.argv[2]
i = open( input_file )
o = open( output_file, 'w' )
reader = csv.reader( i )
writer = csv.writer( o )
headers = reader.next()
writer.writerow( headers )
for line in reader:
r = random.random()
if r > P:
continue
writer.writerow( line )
================================================
FILE: shuffle.py
================================================
"""
Shuffle lines in a [big] file
shuffle.py <input_file> <output_file> [<preserve headers?>] [<max. lines in memory>] [<random seed>]
"""
import sys
import random
input_file = sys.argv[1]
output_file = sys.argv[2]
try:
preserve_headers = int( sys.argv[3] )
except IndexError:
preserve_headers = 0
try:
lines_in_memory = int( sys.argv[4] )
except IndexError:
lines_in_memory = 25000
print "caching %s lines at a time..." % ( lines_in_memory )
try:
random_seed = sys.argv[5]
random.seed( random_seed )
print "random seed: %s" % ( random_seed )
except IndexError:
pass
# first count
print "counting lines..."
i_f = open( input_file )
o_f = open( output_file, 'wb' )
if preserve_headers:
headers = i_f.readline()
o_f.write( headers )
counter = 0
for line in i_f:
counter += 1
if counter % 100000 == 0:
print counter
print counter
print "shuffling..."
order = range( counter )
random.shuffle( order )
epoch = 0
while order:
current_lines = {}
current_lines_count = 0
current_chunk = order[:lines_in_memory]
current_chunk_dict = { x: 1 for x in current_chunk } # faster "in"
current_chunk_length = len( current_chunk )
order = order[lines_in_memory:]
i_f.seek( 0 )
if preserve_headers:
i_f.readline()
count = 0
for line in i_f:
if count in current_chunk_dict:
current_lines[count] = line
current_lines_count += 1
if current_lines_count == current_chunk_length:
break
count += 1
if count % 100000 == 0:
print count
print "writing..."
for l in current_chunk:
o_f.write( current_lines[l] )
lines_saved = current_chunk_length + epoch * lines_in_memory
epoch += 1
print "pass %s complete (%s lines saved)" % ( epoch, lines_saved )
================================================
FILE: split.py
================================================
'''
split a file into two randomly, line by line.
Usage: split.py <input file> <output file 1> <output file 2> [<probability of writing to the first file>]'
'''
import csv
import sys
import random
input_file = sys.argv[1]
output_file1 = sys.argv[2]
output_file2 = sys.argv[3]
try:
P = float( sys.argv[4] )
except IndexError:
P = 0.9
try:
seed = sys.argv[5]
except IndexError:
seed = None
try:
skip_headers = sys.argv[6]
except IndexError:
skip_headers = False
try:
skip_headers = sys.argv[6]
except IndexError:
skip_headers = False
print "P = %s" % ( P )
if seed:
random.seed( seed )
i = open( input_file )
o1 = open( output_file1, 'wb' )
o2 = open( output_file2, 'wb' )
if skip_headers:
i.readline()
counter = 0
for line in i:
r = random.random()
if r > P:
o2.write( line )
else:
o1.write( line )
counter += 1
if counter % 100000 == 0:
print counter
================================================
FILE: standardize.py
================================================
'standardize (shift and scale to zero mean and unit standard deviation) data from csv file'
'meant to be used together with colstats.py'
'standardize.py <stats file> <input file> <output file> [<label index>]'
import sys, csv
import numpy as np
from f_is_headers import *
stats_file = sys.argv[1]
input_file = sys.argv[2]
output_file = sys.argv[3]
try:
label_index = int( sys.argv[4] )
except IndexError:
label_index = False
i = open( input_file )
stats_reader = csv.reader( open( stats_file ))
reader = csv.reader( i )
writer = csv.writer( open( output_file, 'wb' ))
# get stats
means = stats_reader.next()
means = np.array( map( float, means ))
standard_deviations = stats_reader.next()
standard_deviations = np.array( map( float, standard_deviations ))
# check headers
first_line = reader.next()
if is_headers( first_line ):
headers = first_line
else:
headers = False
i.seek( 0 )
# go
for line in reader:
if not label_index is False:
l = line.pop( label_index )
print l
x = np.array( map( float, line ))
# shift and scale
x = x - means
x = x / standard_deviations
if not label_index is False:
# -1.0,...
#x = np.insert( x, 0, l )
line = list( x )
line.insert( 0, l )
writer.writerow( line )
================================================
FILE: subset.py
================================================
'Save a subset of lines from an input file; start at offset and count n lines'
'default 100 lines starting from 0'
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
try:
offset = int( sys.argv[3] )
except IndexError:
offset = 0
try:
lines = int( sys.argv[4] )
except IndexError:
lines = 100
i = open( input_file )
o = open( output_file, 'wb' )
count = 0
for line in i:
if offset > 0:
offset -= 1
continue
o.write( line )
count += 1
if count >= lines:
break
================================================
FILE: tsv2csv.py
================================================
import csv
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
i = open( input_file )
o = open( output_file, 'wb' )
reader = csv.reader( i, delimiter = '\t' )
writer = csv.writer( o )
for line in reader:
writer.writerow( line )
================================================
FILE: unshuffle.py
================================================
"""
Unshuffle previously shuffled file
unshuffle.py input_file.csv output_file.csv <max. lines in memory> <random seed>
"""
import sys
import random
input_file = sys.argv[1]
output_file = sys.argv[2]
try:
lines_in_memory = int( sys.argv[3] )
except IndexError:
lines_in_memory = 100000
print "caching %s lines at a time..." % ( lines_in_memory )
try:
random_seed = sys.argv[4]
random.seed( random_seed )
print "random seed: %s" % ( random_seed )
except IndexError:
print "need a seed..."
sys.exit( 1 )
# first count
print "counting lines..."
f = open( input_file )
count = 0
for line in f:
count += 1
if count % 100000 == 0:
print count
print count
# then shuffle
print "(un)shuffling..."
o_f = open( output_file, 'wb' )
order = range( count )
random.shuffle( order )
# un-shuffle
order_dict = { shuf_i: orig_i for shuf_i, orig_i in enumerate( order ) }
# sort by original key asc, will get shuffled keys in the right order to unshuffle
order = sorted( order_dict, key = order_dict.get )
epoch = 0
while order:
current_lines = {}
current_lines_count = 0
current_chunk = order[:lines_in_memory]
current_chunk_dict = { x: 1 for x in current_chunk } # faster "in"
current_chunk_length = len( current_chunk )
order = order[lines_in_memory:]
f.seek( 0 )
count = 0
for line in f:
if count in current_chunk_dict:
current_lines[count] = line
current_lines_count += 1
if current_lines_count == current_chunk_length:
break
count += 1
if count % 100000 == 0:
print count
print "writing..."
for l in current_chunk:
o_f.write( current_lines[l] )
lines_saved = current_chunk_length + epoch * lines_in_memory
epoch += 1
print "pass %s complete (%s lines saved)" % ( epoch, lines_saved )
gitextract_5og_t45i/ ├── .gitattributes ├── .gitignore ├── LICENSE ├── README.md ├── chunk.py ├── colstats.py ├── count.py ├── csv2libsvm.py ├── csv2vw.py ├── delete_cols.py ├── f_is_headers.py ├── libsvm2csv.py ├── libsvm2vw.py ├── pivotedcsv2libsvm.py ├── sample.py ├── shuffle.py ├── split.py ├── standardize.py ├── subset.py ├── tsv2csv.py └── unshuffle.py
SYMBOL INDEX (4 symbols across 4 files) FILE: csv2libsvm.py function construct_line (line 14) | def construct_line( label, line ): FILE: csv2vw.py function construct_line (line 12) | def construct_line( label, line ): FILE: f_is_headers.py function is_headers (line 3) | def is_headers( line ): FILE: pivotedcsv2libsvm.py function construct_line (line 27) | def construct_line( label, line ):
Condensed preview — 21 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (27K chars).
[
{
"path": ".gitattributes",
"chars": 483,
"preview": "# Auto detect text files and perform LF normalization\n* text=auto\n\n# Custom for Visual Studio\n*.cs diff=csharp\n*.sln"
},
{
"path": ".gitignore",
"chars": 2643,
"preview": "#################\n## Eclipse\n#################\n\n*.pydevproject\n.project\n.metadata\nbin/\ntmp/\n*.tmp\n*.bak\n*.swp\n*~.nib\nloc"
},
{
"path": "LICENSE",
"chars": 1295,
"preview": "Copyright (c) 2013 Zygmunt Zając\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or withou"
},
{
"path": "README.md",
"chars": 4461,
"preview": "phraug\n======\n\nA set of simple Python scripts for pre-processing large files, things like splitting and format conversio"
},
{
"path": "chunk.py",
"chars": 745,
"preview": "'''\nsplit a file into a given number of chunks randomly, line by line. \nUsage: chunk.py <input file> <number of chunks> "
},
{
"path": "colstats.py",
"chars": 1139,
"preview": "\"\"\"\ncompute column means and standard deviations from data in csv file\ncolstats.py <input file> <output file> [<label in"
},
{
"path": "count.py",
"chars": 197,
"preview": "'Count lines in a file'\n\nimport sys\n\nfile_path = sys.argv[1]\nf = open( file_path )\n\ncount = 0\nfor line in f:\n\tcount += "
},
{
"path": "csv2libsvm.py",
"chars": 1149,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nConvert CSV file to libsvm format. Works only with numeric variables.\nPut -1 as label index ("
},
{
"path": "csv2vw.py",
"chars": 1176,
"preview": "\"\"\"\nConvert CSV file to vw format. Headers can be skipped with argv[4] == true.\nUse -1 for label index if there no label"
},
{
"path": "delete_cols.py",
"chars": 585,
"preview": "'delete some columns from file, given by their indexes'\n\nimport csv\nimport sys\n\ninput_file = sys.argv[1]\noutput_file = s"
},
{
"path": "f_is_headers.py",
"chars": 119,
"preview": "import re\n\ndef is_headers( line ):\n\tline = ''.join( line )\n\tif re.match( '.*[a-df-zA-Z].*', line ):\n\t\treturn True\n\t\t\n\t\t"
},
{
"path": "libsvm2csv.py",
"chars": 735,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nconvert libsvm file to csv'\nlibsvm2csv.py <input file> <output file> <X dimensionality>\n\"\"\"\n\n"
},
{
"path": "libsvm2vw.py",
"chars": 707,
"preview": "\"convert a libsvm file to VW format\"\n\"skip malformed lines\"\n\"in case of binary classification with 0/1 labels set the th"
},
{
"path": "pivotedcsv2libsvm.py",
"chars": 1742,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nConvert pivoted CSV file to libsvm format. Works only with numeric variables.\nExpecting no la"
},
{
"path": "sample.py",
"chars": 522,
"preview": "'sample lines from input file with probability P, save them to output file'\n\nimport csv\nimport sys\nimport random\n\ntry:\n\t"
},
{
"path": "shuffle.py",
"chars": 1727,
"preview": "\"\"\"\nShuffle lines in a [big] file\nshuffle.py <input_file> <output_file> [<preserve headers?>] [<max. lines in memory>] ["
},
{
"path": "split.py",
"chars": 915,
"preview": "'''\nsplit a file into two randomly, line by line. \nUsage: split.py <input file> <output file 1> <output file 2> [<probab"
},
{
"path": "standardize.py",
"chars": 1247,
"preview": "'standardize (shift and scale to zero mean and unit standard deviation) data from csv file'\n'meant to be used together w"
},
{
"path": "subset.py",
"chars": 519,
"preview": "'Save a subset of lines from an input file; start at offset and count n lines'\n'default 100 lines starting from 0'\n\nimpo"
},
{
"path": "tsv2csv.py",
"chars": 243,
"preview": "import csv\nimport sys\n\ninput_file = sys.argv[1]\noutput_file = sys.argv[2]\n\ni = open( input_file )\no = open( output_file,"
},
{
"path": "unshuffle.py",
"chars": 1774,
"preview": "\"\"\"\nUnshuffle previously shuffled file\nunshuffle.py input_file.csv output_file.csv <max. lines in memory> <random seed>\n"
}
]
About this extraction
This page contains the full source code of the zygmuntz/phraug GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 21 files (23.6 KB), approximately 7.3k tokens, and a symbol index with 4 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.