Repository: qiyuangong/Mondrian
Branch: main
Commit: 49d49cfb3348
Files: 12
Total size: 37.9 KB
Directory structure:
gitextract_5yp6udp6/
├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── anonymizer.py
├── mondrian.py
├── mondrian_test.py
└── utils/
├── __init__.py
├── read_adult_data.py
├── read_file.py
├── read_informs_data.py
└── utility.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
*.py[cod]
.DS_Store
output/*
data/*
*.bak
.vs
# C extensions
*.so
ftp
# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg
lib
lib64
__pycache__
# Unit test / coverage reports
.coverage
.tox
nosetests.xml
# Translations
*.mo
# Mr Developer
.mr.developer.cfg
.project
.pydevproject
*.sublime-*
*.csv
================================================
FILE: .travis.yml
================================================
language: python
python:
- "2.7"
script: python -m unittest discover . "*_test.py"
branches:
only:
- master
================================================
FILE: LICENSE
================================================
The MIT License (MIT)
Copyright (c) [2014] [Mondrian]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
Mondrian [](https://travis-ci.org/qiyuangong/Mondrian)
===========================
Mondrian is a Top-down greedy data anonymization algorithm for relational dataset, proposed by Kristen LeFevre in his papers[1]. To our knowledge, Mondrian is the fastest local recording algorithm, which preserve good data utility at the same time. Although LeFevre gave the pseudocode in his papers, the original source code is not available. You can find the third part Java implementation in Anonymization Toolbox[2].
This repository is an **open source python implementation for Mondrian**.
### Motivation
Researches on data privacy have lasted for more than ten years, lots of great papers have been published. However, only a few open source projects are available on Internet [2-3], most open source projects are using algorithms proposed before 2004! Fewer projects have been used in real life. Worse more, most people even don't hear about it. Such a tragedy!
I decided to make some effort. Hoping these open source repositories can help researchers and developers on data privacy (privacy preserving data publishing, data anonymization).
### Attention
This Mondrian is the earliest Mondrian proposed in [1], which imposes an intuitive ordering on each attribute. So, there is no generalization hierarchies for categorical attributes. This operation brings lower information loss, but worse semantic results. **If you want the Mondrian based on generalization hierarchies, please turn to [Basic_Mondrian](https://github.com/qiyuangong/Basic_Mondrian).**
I used **both adult and INFORMS** dataset in this implementation. For clarification, **we transform NCP (Normalized Certainty Penalty) to percentage**. This NCP percentage is computed by dividing NCP value with the number of values in dataset (also called GCP (Global Certainty Penalty) [4]). The range of NCP percentage is from 0 to 1, where 0 means no information loss, 1 means loses all information (more meaningful than raw NCP, which is sensitive to size of dataset).
One more thing!!! Mondrian has strict and relax models. (Most online implementations are in strict model.) Both Mondrian split partition with binary split (let lhs and rhs denotes left part and right part). In strict Mondrian, lhs has not intersection part with rhs. But in relaxed Mondrian, the points in the middle are evenly divided between lhs and rhs to ensure `|lhs| = |rhs|` (+1 where `|partition|` is odd). So in relax model, the generalized result of lhs and rhs may have intersection.
The Final NCP of Mondrian on [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) is about 24.91% (relax) and 12.19% (strict), while 12.26% (relax) and 10.21% (strict) on [INFORMS data](https://sites.google.com/site/informsdataminingcontest/) (with K=10).
### Basic idea of Mondrian
#### First, what is k-anonymity?
Assuming your record is in this format: [QID, SA]. QID means quasi-identifier such as age and birthday, SA means sensitive information such as disease information. The basic idea of k-anonymity is `safety in group` (or safety in numbers [5]), which means that you are safe if you are in a group of people whose QIDs are the same. Note nobody can infer your sensitive information (SA) from this group using QID, as shown in Fig. 1 (k=3 in 1(b) and 1(c)). If each of these group has at least k people, then this dataset satisfy k-anonymity.
<p align="center">
<img src=https://cloud.githubusercontent.com/assets/3848789/25949050/c6a7e8ec-3688-11e7-933d-d5a991e6ef30.png width=750>
</p>
<p align="center">
Figure 1. Anonymity, Privacy and Generalization
</p>
**But in practice, the raw datasets usually don't satisfy k-anonymity, as shown in Fig. 1(a).** So, we need some help from anonymization algorithm to transform the raw datasets to anonymized datasets. Mondrian is one of them, and it is based on generalization. I don't want to talk too much about generalization. In a word, generalization is a kind of transformation, which finds a result QID* that covers all QIDs (QID1~QID3 in Fig. 1 (b)). And it also brings information loss (distortion).
#### How Mondrian anonymizes dataset?
Here is the basic workflow of Mondrian:
1. Partition the raw dataset into k-groups using kd-tree. k-groups means that each group contains at least k records.
2. Generalization each k-group (Fig. 1(b)), such that each group has the same QID*.
Why using kd-tree? Because it is fast, straight-forward and sufficient.
<p align="center">
<img src=https://cloud.githubusercontent.com/assets/3848789/25949051/c6a87622-3688-11e7-8bd0-726f07245570.png width=750>
</p>
<p align="center">
Figure 2. Basic workflow of Modnrian
</p>
<p align="center">
<img src=https://cloud.githubusercontent.com/assets/3848789/25949052/c6ab3fce-3688-11e7-99ea-cde7bccd8684.png width=450>
</p>
<p align="center">
Figure 3. kd-tree
</p>
### Usage and Parameters:
The Implementation is based on Python 3 and compatible with python 2.7. You can run Mondrian in following steps:
1) Download (or clone) the whole project.
2) Run `anonymized.py` in root dir with CLI.
3) Get the anonymized dataset from `data/anonymized.data`, if you didn't add `[k | qi | data]`.
Parameters:
# Usage: python anonymizer.py [r|s] [a | i] [k | qi | data]
# r: relax mondrian, s: strict mondrian
# a: adult dataset, 'i': INFORMS dataset
# k: varying k, qi: varying qi numbers, data: varying size of dataset
# run Mondrian with adult data and default K (K=10)
python anonymizer.py
# run Strict Mondrian with adult data K=20
python anonymizer.py s a 20
# run Relax Mondrian with INFORMS data K=11
python anonymizer.py r i 11
# Evluating Strict Mondrian with k on adult data
python anonymizer.py s a k
### For more information:
[1] K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Mondrian Multidimensional K-Anonymity ICDE '06: Proceedings of the 22nd International Conference on Data Engineering, IEEE Computer Society, 2006, 25
[2] [UTD Anonymization Toolbox](http://cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php?go=home)
[3] [ARX- Powerful Data Anonymization](https://github.com/arx-deidentifier/arx)
[4] G. Ghinita, P. Karras, P. Kalnis, N. Mamoulis. Fast data anonymization with low information loss. Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, 2007, 758-769
[5] Y. He, J. F. Naughton, Anonymization of set-valued data via top-down, local generalization. Proceedings of VLDB, 2009, 2, 934-945
### Support
- You can post bug reports and feature requests at the [Issue Page](https://github.com/qiyuangong/Mondrian/issues).
- Contributions via [Pull request](https://github.com/qiyuangong/Mondrian/pulls) is welcome.
- Also, you can contact me via [email](mailto:qiyuangong@gmail.com).
==========================
by [Qiyuan Gong](mailto:qiyuangong@gmail.com)
2017-5-23
### Contributor List 🏆
* [Qiyuan Gong](mailto:qiyuangong@gmail.com)
* [Liu Kun](https://github.com/build2last)
================================================
FILE: anonymizer.py
================================================
"""
run mondrian with given parameters
"""
# !/usr/bin/env python
# coding=utf-8
from mondrian import mondrian
from utils.read_adult_data import read_data as read_adult
from utils.read_informs_data import read_data as read_informs
import sys, copy, random
DATA_SELECT = 'a'
RELAX = False
INTUITIVE_ORDER = None
def write_to_file(result):
"""
write the anonymized result to anonymized.data
"""
with open("data/anonymized.data", "w") as output:
for r in result:
output.write(';'.join(r) + '\n')
def get_result_one(data, k=10):
"""
run mondrian for one time, with k=10
"""
print("K=%d" % k)
data_back = copy.deepcopy(data)
result, eval_result = mondrian(data, k, RELAX)
# Convert numerical values back to categorical values if necessary
if DATA_SELECT == 'a':
result = covert_to_raw(result)
else:
for r in result:
r[-1] = ','.join(r[-1])
# write to anonymized.out
write_to_file(result)
data = copy.deepcopy(data_back)
print("NCP %0.2f" % eval_result[0] + "%")
print("Running time %0.2f" % eval_result[1] + " seconds")
def get_result_k(data):
"""
change k, while fixing QD and size of data set
"""
data_back = copy.deepcopy(data)
for k in range(5, 105, 5):
print('#' * 30)
print("K=%d" % k)
result, eval_result = mondrian(data, k, RELAX)
if DATA_SELECT == 'a':
result = covert_to_raw(result)
data = copy.deepcopy(data_back)
print("NCP %0.2f" % eval_result[0] + "%")
print("Running time %0.2f" % eval_result[1] + " seconds")
def get_result_dataset(data, k=10, num_test=10):
"""
fix k and QI, while changing size of data set
num_test is the test number.
"""
data_back = copy.deepcopy(data)
length = len(data_back)
joint = 5000
datasets = []
check_time = length / joint
if length % joint == 0:
check_time -= 1
for i in range(check_time):
datasets.append(joint * (i + 1))
datasets.append(length)
ncp = 0
rtime = 0
for pos in datasets:
print('#' * 30)
print("size of dataset %d" % pos)
for j in range(num_test):
temp = random.sample(data, pos)
result, eval_result = mondrian(temp, k, RELAX)
if DATA_SELECT == 'a':
result = covert_to_raw(result)
ncp += eval_result[0]
rtime += eval_result[1]
data = copy.deepcopy(data_back)
ncp /= num_test
rtime /= num_test
print("Average NCP %0.2f" % ncp + "%")
print("Running time %0.2f" % rtime + " seconds")
print('#' * 30)
def get_result_qi(data, k=10):
"""
change number of QI, while fixing k and size of data set
"""
data_back = copy.deepcopy(data)
num_data = len(data[0])
for i in reversed(list(range(1, num_data))):
print('#' * 30)
print("Number of QI=%d" % i)
result, eval_result = mondrian(data, k, RELAX, i)
if DATA_SELECT == 'a':
result = covert_to_raw(result)
data = copy.deepcopy(data_back)
print("NCP %0.2f" % eval_result[0] + "%")
print("Running time %0.2f" % eval_result[1] + " seconds")
def covert_to_raw(result, connect_str='~'):
"""
During preprocessing, categorical attributes are covert to
numeric attribute using intuitive order. This function will covert
these values back to they raw values. For example, Female and Male
may be converted to 0 and 1 during anonymizaiton. Then we need to transform
them back to original values after anonymization.
"""
covert_result = []
qi_len = len(INTUITIVE_ORDER)
for record in result:
covert_record = []
for i in range(qi_len):
if len(INTUITIVE_ORDER[i]) > 0:
vtemp = ''
if connect_str in record[i]:
temp = record[i].split(connect_str)
raw_list = []
for j in range(int(temp[0]), int(temp[1]) + 1):
raw_list.append(INTUITIVE_ORDER[i][j])
vtemp = connect_str.join(raw_list)
else:
vtemp = INTUITIVE_ORDER[i][int(record[i])]
covert_record.append(vtemp)
else:
covert_record.append(record[i])
if isinstance(record[-1], str):
covert_result.append(covert_record + [record[-1]])
else:
covert_result.append(covert_record + [connect_str.join(record[-1])])
return covert_result
if __name__ == '__main__':
FLAG = ''
LEN_ARGV = len(sys.argv)
try:
MODEL = sys.argv[1]
DATA_SELECT = sys.argv[2]
except IndexError:
MODEL = 's'
DATA_SELECT = 'a'
INPUT_K = 10
# read record
if MODEL == 's':
RELAX = False
else:
RELAX = True
if RELAX:
print("Relax Mondrian")
else:
print("Strict Mondrian")
if DATA_SELECT == 'i':
print("INFORMS data")
DATA = read_informs()
else:
print("Adult data")
# INTUITIVE_ORDER is an intuitive order for
# categorical attributes. This order is produced
# by the reading (from data set) order.
DATA, INTUITIVE_ORDER = read_adult()
print(INTUITIVE_ORDER)
if LEN_ARGV > 3:
FLAG = sys.argv[3]
if FLAG == 'k':
get_result_k(DATA)
elif FLAG == 'qi':
get_result_qi(DATA)
elif FLAG == 'data':
get_result_dataset(DATA)
elif FLAG == '':
get_result_one(DATA)
else:
try:
INPUT_K = int(FLAG)
get_result_one(DATA, INPUT_K)
except ValueError:
print("Usage: python anonymizer [r|s] [a | i] [k | qi | data]")
print("r: relax mondrian, s: strict mondrian")
print("a: adult dataset, i: INFORMS dataset")
print("k: varying k")
print("qi: varying qi numbers")
print("data: varying size of dataset")
print("example: python anonymizer s a 10")
print("example: python anonymizer s a k")
# anonymized dataset is stored in result
print("Finish Mondrian!!")
================================================
FILE: mondrian.py
================================================
# coding:utf-8
"""
main module of mondrian
"""
# Implemented by Qiyuan Gong
# qiyuangong@gmail.com
# 2014-09-11
# @InProceedings{LeFevre2006,
# Title = {Mondrian Multidimensional K-Anonymity},
# Author = {LeFevre, Kristen and DeWitt, David J. and Ramakrishnan, Raghu},
# Booktitle = {ICDE '06: Proceedings of the 22nd International Conference on Data Engineering},
# Year = {2006},
# Address = {Washington, DC, USA},
# Pages = {25},
# Publisher = {IEEE Computer Society},
# Doi = {http://dx.doi.org/10.1109/ICDE.2006.101},
# ISBN = {0-7695-2570-9},
# }
# !/usr/bin/env python
# coding=utf-8
import pdb
import time
from utils.utility import cmp_value, value, merge_qi_value
from functools import cmp_to_key
# warning all these variables should be re-inited, if
# you want to run mondrian with different parameters
__DEBUG = False
QI_LEN = 10
GL_K = 0
RESULT = []
QI_RANGE = []
QI_DICT = []
QI_ORDER = []
class Partition(object):
"""
Class for Group (or EC), which is used to keep records
self.member: records in group
self.low: lower point, use index to avoid negative values
self.high: higher point, use index to avoid negative values
self.allow: show if partition can be split on this QI
"""
def __init__(self, data, low, high):
"""
split_tuple = (index, low, high)
"""
self.low = list(low)
self.high = list(high)
self.member = data[:]
self.allow = [1] * QI_LEN
def add_record(self, record, dim):
"""
add one record to member
"""
self.member.append(record)
def add_multiple_record(self, records, dim):
"""
add multiple records (list) to partition
"""
for record in records:
self.add_record(record, dim)
def __len__(self):
"""
return number of records
"""
return len(self.member)
def get_normalized_width(partition, index):
"""
return Normalized width of partition
similar to NCP
"""
d_order = QI_ORDER[index]
width = value(d_order[partition.high[index]]) - value(d_order[partition.low[index]])
if width == QI_RANGE[index]:
return 1
return width * 1.0 / QI_RANGE[index]
def choose_dimension(partition):
"""
choose dim with largest norm_width from all attributes.
This function can be upgraded with other distance function.
"""
max_width = -1
max_dim = -1
for dim in range(QI_LEN):
if partition.allow[dim] == 0:
continue
norm_width = get_normalized_width(partition, dim)
if norm_width > max_width:
max_width = norm_width
max_dim = dim
if max_width > 1:
pdb.set_trace()
return max_dim
def frequency_set(partition, dim):
"""
get the frequency_set of partition on dim
"""
frequency = {}
for record in partition.member:
try:
frequency[record[dim]] += 1
except KeyError:
frequency[record[dim]] = 1
return frequency
def find_median(partition, dim):
"""
find the middle of the partition, return split_val
"""
# use frequency set to get median
frequency = frequency_set(partition, dim)
split_val = ''
next_val = ''
value_list = list(frequency.keys())
value_list.sort(key=cmp_to_key(cmp_value))
total = sum(frequency.values())
middle = total // 2
if middle < GL_K or len(value_list) <= 1:
try:
return '', '', value_list[0], value_list[-1]
except IndexError:
return '', '', '', ''
index = 0
split_index = 0
for i, qi_value in enumerate(value_list):
index += frequency[qi_value]
if index >= middle:
split_val = qi_value
split_index = i
break
else:
print("Error: cannot find split_val")
try:
next_val = value_list[split_index + 1]
except IndexError:
# there is a frequency value in partition
# which can be handle by mid_set
# e.g.[1, 2, 3, 4, 4, 4, 4]
next_val = split_val
return (split_val, next_val, value_list[0], value_list[-1])
def anonymize_strict(partition):
"""
recursively partition groups until not allowable
"""
allow_count = sum(partition.allow)
# only run allow_count times
if allow_count == 0:
RESULT.append(partition)
return
for index in range(allow_count):
# choose attrubite from domain
dim = choose_dimension(partition)
if dim == -1:
print("Error: dim=-1")
pdb.set_trace()
(split_val, next_val, low, high) = find_median(partition, dim)
# Update parent low and high
if low is not '':
partition.low[dim] = QI_DICT[dim][low]
partition.high[dim] = QI_DICT[dim][high]
if split_val == '' or split_val == next_val:
# cannot split
partition.allow[dim] = 0
continue
# split the group from median
mean = QI_DICT[dim][split_val]
lhs_high = partition.high[:]
rhs_low = partition.low[:]
lhs_high[dim] = mean
rhs_low[dim] = QI_DICT[dim][next_val]
lhs = Partition([], partition.low, lhs_high)
rhs = Partition([], rhs_low, partition.high)
for record in partition.member:
pos = QI_DICT[dim][record[dim]]
if pos <= mean:
# lhs = [low, mean]
lhs.add_record(record, dim)
else:
# rhs = (mean, high]
rhs.add_record(record, dim)
# check is lhs and rhs satisfy k-anonymity
if len(lhs) < GL_K or len(rhs) < GL_K:
partition.allow[dim] = 0
continue
# anonymize sub-partition
anonymize_strict(lhs)
anonymize_strict(rhs)
return
RESULT.append(partition)
def anonymize_relaxed(partition):
"""
recursively partition groups until not allowable
"""
if sum(partition.allow) == 0:
# can not split
RESULT.append(partition)
return
# choose attribute from domain
dim = choose_dimension(partition)
if dim == -1:
print("Error: dim=-1")
pdb.set_trace()
# use frequency set to get median
(split_val, next_val, low, high) = find_median(partition, dim)
# Update parent low and high
if low is not '':
partition.low[dim] = QI_DICT[dim][low]
partition.high[dim] = QI_DICT[dim][high]
if split_val == '':
# cannot split
partition.allow[dim] = 0
anonymize_relaxed(partition)
return
# split the group from median
mean = QI_DICT[dim][split_val]
lhs_high = partition.high[:]
rhs_low = partition.low[:]
lhs_high[dim] = mean
rhs_low[dim] = QI_DICT[dim][next_val]
lhs = Partition([], partition.low, lhs_high)
rhs = Partition([], rhs_low, partition.high)
mid_set = []
for record in partition.member:
pos = QI_DICT[dim][record[dim]]
if pos < mean:
# lhs = [low, mean)
lhs.add_record(record, dim)
elif pos > mean:
# rhs = (mean, high]
rhs.add_record(record, dim)
else:
# mid_set keep the means
mid_set.append(record)
# handle records in the middle
# these records will be divided evenly
# between lhs and rhs, such that
# |lhs| = |rhs| (+1 if total size is odd)
half_size = len(partition) // 2
for i in range(half_size - len(lhs)):
record = mid_set.pop()
lhs.add_record(record, dim)
if len(mid_set) > 0:
rhs.low[dim] = mean
rhs.add_multiple_record(mid_set, dim)
# It's not necessary now.
# if len(lhs) < GL_K or len(rhs) < GL_K:
# print "Error: split failure"
# anonymize sub-partition
anonymize_relaxed(lhs)
anonymize_relaxed(rhs)
def init(data, k, QI_num=-1):
"""
reset global variables
"""
global GL_K, RESULT, QI_LEN, QI_DICT, QI_RANGE, QI_ORDER
if QI_num <= 0:
QI_LEN = len(data[0]) - 1
else:
QI_LEN = QI_num
GL_K = k
RESULT = []
# static values
QI_DICT = []
QI_ORDER = []
QI_RANGE = []
att_values = []
for i in range(QI_LEN):
att_values.append(set())
QI_DICT.append(dict())
for record in data:
for i in range(QI_LEN):
att_values[i].add(record[i])
for i in range(QI_LEN):
value_list = list(att_values[i])
value_list.sort(key=cmp_to_key(cmp_value))
QI_RANGE.append(value(value_list[-1]) - value(value_list[0]))
QI_ORDER.append(list(value_list))
for index, qi_value in enumerate(value_list):
QI_DICT[i][qi_value] = index
def mondrian(data, k, relax=False, QI_num=-1):
"""
Main function of mondrian, return result in tuple (result, (ncp, rtime)).
data: dataset in 2-dimensional array.
k: k parameter for k-anonymity
QI_num: Default -1, which exclude the last column. Othewise, [0, 1,..., QI_num - 1]
will be anonymized, [QI_num,...] will be excluded.
relax: determine use strict or relaxed mondrian,
Both mondrians split partition with binary split.
In strict mondrian, lhs and rhs have not intersection.
But in relaxed mondrian, lhs may be have intersection with rhs.
"""
init(data, k, QI_num)
result = []
data_size = len(data)
low = [0] * QI_LEN
high = [(len(t) - 1) for t in QI_ORDER]
whole_partition = Partition(data, low, high)
# begin mondrian
start_time = time.time()
if relax:
# relax model
anonymize_relaxed(whole_partition)
else:
# strict model
anonymize_strict(whole_partition)
rtime = float(time.time() - start_time)
# generalization result and
# evaluation information loss
ncp = 0.0
dp = 0.0
for partition in RESULT:
rncp = 0.0
for index in range(QI_LEN):
rncp += get_normalized_width(partition, index)
rncp *= len(partition)
ncp += rncp
dp += len(partition) ** 2
for record in partition.member[:]:
for index in range(QI_LEN):
record[index] = merge_qi_value(QI_ORDER[index][partition.low[index]],
QI_ORDER[index][partition.high[index]])
result.append(record)
# If you want to get NCP values instead of percentage
# please remove next three lines
ncp /= QI_LEN
ncp /= data_size
ncp *= 100
if __DEBUG:
from decimal import Decimal
print("Discernability Penalty=%.2E" % Decimal(str(dp)))
print("size of partitions=%d" % len(RESULT))
print("K=%d" % k)
print("NCP = %.2f %%" % ncp)
return (result, (ncp, rtime))
================================================
FILE: mondrian_test.py
================================================
# coding:utf-8
from datetime import datetime
import unittest
from mondrian import mondrian
from utils.read_file import read_csv
class functionTest(unittest.TestCase):
def test1_mondrian_strict(self):
data = [[6, 1, 'haha'],
[6, 1, 'test'],
[8, 2, 'haha'],
[8, 2, 'test'],
[4, 1, 'hha'],
[4, 2, 'hha'],
[4, 3, 'hha'],
[4, 4, 'hha']]
result, eval_r = mondrian(data, 2, False)
self.assertTrue(abs(eval_r[0] - 100.0 / 12) < 0.05)
def test1_mondrian_relax(self):
data = [[6, 1, 'haha'],
[6, 1, 'test'],
[8, 2, 'haha'],
[8, 2, 'test'],
[4, 1, 'hha'],
[4, 2, 'hha'],
[4, 3, 'hha'],
[4, 4, 'hha']]
result, eval_r = mondrian(data, 2, True)
self.assertTrue(abs(eval_r[0] - 100.0 / 12) < 0.05)
def test2_mondrian_strict(self):
data = [[6, 1, 'haha'],
[8, 1, 'haha'],
[8, 1, 'test'],
[8, 1, 'haha'],
[8, 1, 'test'],
[4, 1, 'hha'],
[4, 2, 'hha'],
[4, 3, 'hha'],
[4, 4, 'hha']]
result, eval_r = mondrian(data, 2, False)
self.assertTrue(abs(eval_r[0] - 2300.0 / 108) < 0.05)
def test2_mondrian_relax(self):
data = [[6, 1, 'haha'],
[8, 1, 'haha'],
[8, 1, 'test'],
[8, 1, 'haha'],
[8, 1, 'test'],
[4, 1, 'hha'],
[4, 2, 'hha'],
[4, 3, 'hha'],
[4, 4, 'hha']]
result, eval_r = mondrian(data, 2, True)
self.assertTrue(abs(eval_r[0] - 700.0 / 54) < 0.05)
def test_mondrian_datetime(self):
d1 = datetime.strptime("2007-03-04 21:08:12", "%Y-%m-%d %H:%M:%S")
d2 = datetime.strptime("2008-03-04 21:08:12", "%Y-%m-%d %H:%M:%S")
d3 = datetime.strptime("2009-03-04 21:08:12", "%Y-%m-%d %H:%M:%S")
d4 = datetime.strptime("2007-03-05 21:08:12", "%Y-%m-%d %H:%M:%S")
data = [[6, d1, 'haha'],
[8, d1, 'haha'],
[8, d1, 'test'],
[8, d1, 'haha'],
[8, d1, 'test'],
[4, d1, 'hha'],
[4, d2, 'hha'],
[4, d3, 'hha'],
[4, d4, 'hha']]
result, eval_r = mondrian(data, 2, False)
print(eval_r)
def test_read_csv_and_anonymise(self):
from utils.read_adult_data import read_data as read_adult
DATA, INTUITIVE_ORDER = read_adult()
result, eval_result = mondrian(DATA, 40, False)
print(result)
if __name__ == '__main__':
unittest.main()
================================================
FILE: utils/__init__.py
================================================
================================================
FILE: utils/read_adult_data.py
================================================
"""
read adult data set
"""
# !/usr/bin/env python
# coding=utf-8
# Read data and read tree functions for INFORMS data
# attributes ['age', 'work_class', 'final_weight', 'education', 'education_num',
# 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain',
# 'capital_loss', 'hours_per_week', 'native_country', 'class']
# QID ['age', 'work_class', 'education', 'marital_status', 'race', 'sex', 'native_country']
# SA ['occupation']
ATT_NAME = ['age', 'work_class', 'final_weight', 'education',
'education_num', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
'native_country', 'class']
QI_INDEX = [0, 1, 4, 5, 6, 8, 9, 13]
IS_CAT = [False, True, False, True, True, True, True, True]
SA_INDEX = -1
__DEBUG = False
def read_data():
"""
read microdata for *.txt and return read data
# Note that Mondrian can only handle numeric attribute
# So, categorical attributes should be transformed to numeric attributes
# before anonymization. For example, Male and Female should be transformed
# to 0, 1 during pre-processing. Then, after anonymization, 0 and 1 should
# be transformed to Male and Female.
"""
QI_num = len(QI_INDEX)
data = []
# oder categorical attributes in intuitive order
# here, we use the appear number
intuitive_dict = []
intuitive_order = []
intuitive_number = []
for i in range(QI_num):
intuitive_dict.append(dict())
intuitive_number.append(0)
intuitive_order.append(list())
data_file = open('data/adult.data', 'rU')
for line in data_file:
line = line.strip()
# remove empty and incomplete lines
# only 30162 records will be kept
if len(line) == 0 or '?' in line:
continue
# remove double spaces
line = line.replace(' ', '')
temp = line.split(',')
ltemp = []
for i in range(QI_num):
index = QI_INDEX[i]
if IS_CAT[i]:
try:
ltemp.append(intuitive_dict[i][temp[index]])
except KeyError:
intuitive_dict[i][temp[index]] = intuitive_number[i]
ltemp.append(intuitive_number[i])
intuitive_number[i] += 1
intuitive_order[i].append(temp[index])
else:
ltemp.append(int(temp[index]))
ltemp.append(temp[SA_INDEX])
data.append(ltemp)
return data, intuitive_order
================================================
FILE: utils/read_file.py
================================================
# !/usr/bin/env python
'''
read csv data,
support numeric, category, time date
author : Liu Kun
date : 2018-10
'''
from datetime import datetime
__DEBUG = False
def read_csv(file_path,
QI_INDEX,
IS_CAT,
IS_DATETIME,
SA_INDEX,
header=False, delimiter=',', encoding="utf-8",
TIME_FORMAT_STR="%Y-%m-%d %H:%M:%S"
):
"""
read microdata for *.txt and return read data
# Note that Mondrian can only handle numeric attribute
# So, categorical attributes should be transformed to numberic attributes
# before anonymization. For example, Male and Female shold be transformed
# to 0, 1 during pre-processing. Then, after anonymization, 0 and 1 should
# be transformed to Male and Female.
"""
QI_num = len(QI_INDEX)
data = []
# oder categorical attributes in intuitive order
# here, we use the appear number
intuitive_dict = []
intuitive_order = []
intuitive_number = []
for i in range(QI_num):
intuitive_dict.append(dict())
intuitive_number.append(0)
intuitive_order.append(list())
with open(file_path, 'r', encoding=encoding) as data_file:
if header:
headers = data_file.readline()
for line in data_file:
if len(line) == 0 or '?' in line:
continue
temp = [item.strip() for item in line.split(delimiter)]
ltemp = []
if not all(temp):
continue
for i in range(QI_num):
index = QI_INDEX[i]
if IS_DATETIME[i]:
t = datetime.strptime(temp[index], TIME_FORMAT_STR)
ltemp.append(t)
elif IS_CAT[i]:
try:
ltemp.append(intuitive_dict[i][temp[index]])
except KeyError:
intuitive_dict[i][temp[index]] = intuitive_number[i]
ltemp.append(intuitive_number[i])
intuitive_number[i] += 1
intuitive_order[i].append(temp[index])
else:
ltemp.append(float(temp[index]))
ltemp.append(temp[SA_INDEX])
data.append(ltemp)
return data, intuitive_order
================================================
FILE: utils/read_informs_data.py
================================================
"""
read informs dataset
"""
# !/usr/bin/env python
# coding=utf-8
# Read data and read tree fuctions for INFORMS data
# user att ['DUID','PID','DUPERSID','DOBMM','DOBYY','SEX','RACEX','RACEAX','RACEBX','RACEWX','RACETHNX','HISPANX','HISPCAT','EDUCYEAR','Year','marry','income','poverty']
# condition att ['DUID','DUPERSID','ICD9CODX','year']
__DEBUG = False
USER_ATT = ['DUID', 'PID', 'DUPERSID', 'DOBMM', 'DOBYY', 'SEX', 'RACEX', 'RACEAX',
'RACEBX', 'RACEWX', 'RACETHNX', 'HISPANX', 'HISPCAT', 'EDUCYEAR',
'Year', 'marry', 'income', 'poverty']
CONDITION_ATT = ['DUID', 'DUPERSID', 'ICD9CODX', 'year']
# Only 5 relational attributes and 1 transaction attribute are selected (according to Poulis's paper)
QI_INDEX = [3, 4, 6, 13, 16]
__DEBUG = False
def read_data():
"""
read microda for *.txt and return read data
"""
data = []
userfile = open('data/demographics.csv', 'rU')
conditionfile = open('data/conditions.csv', 'rU')
userdata = {}
# We selet 3,4,5,6,13,15,15 att from demographics05, and 2 from condition05
# print "Reading Data..."
for i, line in enumerate(userfile):
line = line.strip()
# ignore first line of csv
if i == 0:
continue
row = line.split(',')
row[2] = row[2][1:-1]
try:
userdata[row[2]].append(row)
except:
userdata[row[2]] = row
conditiondata = {}
for i, line in enumerate(conditionfile):
line = line.strip()
# ignore first line of csv
if i == 0:
continue
row = line.split(',')
row[1] = row[1][1:-1]
row[2] = row[2][1:-1]
try:
conditiondata[row[1]].append(row)
except KeyError:
conditiondata[row[1]] = [row]
hashdata = {}
for k, v in list(userdata.items()):
if k in conditiondata:
temp = []
for t in conditiondata[k]:
temp.append(t[2])
hashdata[k] = []
for i in range(len(QI_INDEX)):
index = QI_INDEX[i]
hashdata[k].append(v[index])
hashdata[k].append(temp)
for k, v in list(hashdata.items()):
data.append(v)
userfile.close()
conditionfile.close()
return data
================================================
FILE: utils/utility.py
================================================
# !/usr/bin/env python
# coding:utf-8
"""
public functions
"""
from datetime import datetime
import time
def cmp(x, y):
if x > y:
return 1
elif x==y:
return 0
else:
return -1
def cmp_str(element1, element2):
"""
compare number in str format correctley
"""
try:
return cmp(int(element1), int(element2))
except ValueError:
return cmp(element1, element2)
def cmp_value(element1, element2):
if isinstance(element1, str):
return cmp_str(element1, element2)
else:
return cmp(element1, element2)
def value(x):
'''Return the numeric type that supports addition and subtraction'''
if isinstance(x, (int, float)):
return float(x)
elif isinstance(x, datetime):
return time.mktime(x.timetuple())
# return x.timestamp() # not supported by python 2.7
else:
try:
return float(x)
except Exception as e:
return x
def merge_qi_value(x_left, x_right, connect_str='~'):
'''Connect the interval boundary value as a generalized interval and return the result as a string
return:
result:string
'''
if isinstance(x_left, (int, float)):
if x_left == x_right:
result = '%d' % (x_left)
else:
result = '%d%s%d' % (x_left, connect_str, x_right)
elif isinstance(x_left, str):
if x_left == x_right:
result = x_left
else:
result = x_left + connect_str + x_right
elif isinstance(x_left, datetime):
# Generalize the datetime type value
begin_date = x_left.strftime("%Y-%m-%d %H:%M:%S")
end_date = x_right.strftime("%Y-%m-%d %H:%M:%S")
result = begin_date + connect_str + end_date
return result
def covert_to_raw(result, intuitive_order, delimiter='~'):
"""
During preprocessing, categorical attrbutes are covert to
numeric attrbute using intutive order. This function will covert
these values back to they raw values. For example, Female and Male
may be coverted to 0 and 1 during anonymizaiton. Then we need to transform
them back to original values after anonymization.
"""
covert_result = []
qi_len = len(intuitive_order)
for record in result:
covert_record = []
for i in range(qi_len):
if len(intuitive_order[i]) > 0:
vtemp = ''
if delimiter in record[i]:
temp = record[i].split(delimiter)
raw_list = []
for j in range(int(temp[0]), int(temp[1]) + 1):
raw_list.append(intuitive_order[i][j])
vtemp = delimiter.join(raw_list)
else:
vtemp = intuitive_order[i][int(record[i])]
covert_record.append(vtemp)
else:
covert_record.append(record[i])
if isinstance(record[-1], str):
covert_result.append(covert_record + [record[-1]])
else:
covert_result.append(covert_record + [delimiter.join(record[-1])])
return covert_result
gitextract_5yp6udp6/
├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── anonymizer.py
├── mondrian.py
├── mondrian_test.py
└── utils/
├── __init__.py
├── read_adult_data.py
├── read_file.py
├── read_informs_data.py
└── utility.py
SYMBOL INDEX (35 symbols across 7 files)
FILE: anonymizer.py
function write_to_file (line 17) | def write_to_file(result):
function get_result_one (line 26) | def get_result_one(data, k=10):
function get_result_k (line 46) | def get_result_k(data):
function get_result_dataset (line 62) | def get_result_dataset(data, k=10, num_test=10):
function get_result_qi (line 97) | def get_result_qi(data, k=10):
function covert_to_raw (line 114) | def covert_to_raw(result, connect_str='~'):
FILE: mondrian.py
class Partition (line 41) | class Partition(object):
method __init__ (line 51) | def __init__(self, data, low, high):
method add_record (line 60) | def add_record(self, record, dim):
method add_multiple_record (line 66) | def add_multiple_record(self, records, dim):
method __len__ (line 73) | def __len__(self):
function get_normalized_width (line 80) | def get_normalized_width(partition, index):
function choose_dimension (line 92) | def choose_dimension(partition):
function frequency_set (line 111) | def frequency_set(partition, dim):
function find_median (line 124) | def find_median(partition, dim):
function anonymize_strict (line 161) | def anonymize_strict(partition):
function anonymize_relaxed (line 212) | def anonymize_relaxed(partition):
function init (line 275) | def init(data, k, QI_num=-1):
function mondrian (line 306) | def mondrian(data, k, relax=False, QI_num=-1):
FILE: mondrian_test.py
class functionTest (line 8) | class functionTest(unittest.TestCase):
method test1_mondrian_strict (line 9) | def test1_mondrian_strict(self):
method test1_mondrian_relax (line 21) | def test1_mondrian_relax(self):
method test2_mondrian_strict (line 33) | def test2_mondrian_strict(self):
method test2_mondrian_relax (line 46) | def test2_mondrian_relax(self):
method test_mondrian_datetime (line 59) | def test_mondrian_datetime(self):
method test_read_csv_and_anonymise (line 76) | def test_read_csv_and_anonymise(self):
FILE: utils/read_adult_data.py
function read_data (line 26) | def read_data():
FILE: utils/read_file.py
function read_csv (line 15) | def read_csv(file_path,
FILE: utils/read_informs_data.py
function read_data (line 23) | def read_data():
FILE: utils/utility.py
function cmp (line 10) | def cmp(x, y):
function cmp_str (line 19) | def cmp_str(element1, element2):
function cmp_value (line 28) | def cmp_value(element1, element2):
function value (line 35) | def value(x):
function merge_qi_value (line 49) | def merge_qi_value(x_left, x_right, connect_str='~'):
function covert_to_raw (line 72) | def covert_to_raw(result, intuitive_order, delimiter='~'):
Condensed preview — 12 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (41K chars).
[
{
"path": ".gitignore",
"chars": 346,
"preview": "*.py[cod]\n.DS_Store\noutput/*\n\ndata/*\n\n*.bak\n.vs\n\n# C extensions\n*.so\nftp\n# Packages\n*.egg\n*.egg-info\ndist\nbuild\neggs\npar"
},
{
"path": ".travis.yml",
"chars": 116,
"preview": "language: python\npython:\n - \"2.7\"\nscript: python -m unittest discover . \"*_test.py\"\nbranches:\n only:\n - master\n"
},
{
"path": "LICENSE",
"chars": 1079,
"preview": "The MIT License (MIT)\n\nCopyright (c) [2014] [Mondrian]\n\nPermission is hereby granted, free of charge, to any person obta"
},
{
"path": "README.md",
"chars": 7021,
"preview": "Mondrian [](https://travis-ci.org/qiyuangong"
},
{
"path": "anonymizer.py",
"chars": 6276,
"preview": "\"\"\"\nrun mondrian with given parameters\n\"\"\"\n\n# !/usr/bin/env python\n# coding=utf-8\nfrom mondrian import mondrian\nfrom uti"
},
{
"path": "mondrian.py",
"chars": 10862,
"preview": "# coding:utf-8\n\"\"\"\nmain module of mondrian\n\"\"\"\n\n# Implemented by Qiyuan Gong\n# qiyuangong@gmail.com\n# 2014-09-11\n\n# @InP"
},
{
"path": "mondrian_test.py",
"chars": 2816,
"preview": "# coding:utf-8\nfrom datetime import datetime\nimport unittest\n\nfrom mondrian import mondrian\nfrom utils.read_file import "
},
{
"path": "utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "utils/read_adult_data.py",
"chars": 2579,
"preview": "\"\"\"\nread adult data set\n\"\"\"\n\n# !/usr/bin/env python\n# coding=utf-8\n\n# Read data and read tree functions for INFORMS data"
},
{
"path": "utils/read_file.py",
"chars": 2305,
"preview": "# !/usr/bin/env python\n'''\nread csv data, \nsupport numeric, category, time date\n\nauthor : Liu Kun\ndate : 2018-10\n'''\n\n"
},
{
"path": "utils/read_informs_data.py",
"chars": 2303,
"preview": "\"\"\"\nread informs dataset\n\"\"\"\n\n# !/usr/bin/env python\n# coding=utf-8\n\n# Read data and read tree fuctions for INFORMS data"
},
{
"path": "utils/utility.py",
"chars": 3150,
"preview": "# !/usr/bin/env python\n# coding:utf-8\n\"\"\"\npublic functions\n\"\"\"\n\nfrom datetime import datetime\nimport time\n\ndef cmp(x, y)"
}
]
About this extraction
This page contains the full source code of the qiyuangong/Mondrian GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 12 files (37.9 KB), approximately 10.4k tokens, and a symbol index with 35 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.