Repository: qiyuangong/Mondrian Branch: main Commit: 49d49cfb3348 Files: 12 Total size: 37.9 KB Directory structure: gitextract_5yp6udp6/ ├── .gitignore ├── .travis.yml ├── LICENSE ├── README.md ├── anonymizer.py ├── mondrian.py ├── mondrian_test.py └── utils/ ├── __init__.py ├── read_adult_data.py ├── read_file.py ├── read_informs_data.py └── utility.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ *.py[cod] .DS_Store output/* data/* *.bak .vs # C extensions *.so ftp # Packages *.egg *.egg-info dist build eggs parts bin var sdist develop-eggs .installed.cfg lib lib64 __pycache__ # Unit test / coverage reports .coverage .tox nosetests.xml # Translations *.mo # Mr Developer .mr.developer.cfg .project .pydevproject *.sublime-* *.csv ================================================ FILE: .travis.yml ================================================ language: python python: - "2.7" script: python -m unittest discover . "*_test.py" branches: only: - master ================================================ FILE: LICENSE ================================================ The MIT License (MIT) Copyright (c) [2014] [Mondrian] Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ Mondrian [![Build Status](https://travis-ci.org/qiyuangong/Mondrian.svg?branch=master)](https://travis-ci.org/qiyuangong/Mondrian) =========================== Mondrian is a Top-down greedy data anonymization algorithm for relational dataset, proposed by Kristen LeFevre in his papers[1]. To our knowledge, Mondrian is the fastest local recording algorithm, which preserve good data utility at the same time. Although LeFevre gave the pseudocode in his papers, the original source code is not available. You can find the third part Java implementation in Anonymization Toolbox[2]. This repository is an **open source python implementation for Mondrian**. ### Motivation Researches on data privacy have lasted for more than ten years, lots of great papers have been published. However, only a few open source projects are available on Internet [2-3], most open source projects are using algorithms proposed before 2004! Fewer projects have been used in real life. Worse more, most people even don't hear about it. Such a tragedy! I decided to make some effort. Hoping these open source repositories can help researchers and developers on data privacy (privacy preserving data publishing, data anonymization). ### Attention This Mondrian is the earliest Mondrian proposed in [1], which imposes an intuitive ordering on each attribute. So, there is no generalization hierarchies for categorical attributes. This operation brings lower information loss, but worse semantic results. **If you want the Mondrian based on generalization hierarchies, please turn to [Basic_Mondrian](https://github.com/qiyuangong/Basic_Mondrian).** I used **both adult and INFORMS** dataset in this implementation. For clarification, **we transform NCP (Normalized Certainty Penalty) to percentage**. This NCP percentage is computed by dividing NCP value with the number of values in dataset (also called GCP (Global Certainty Penalty) [4]). The range of NCP percentage is from 0 to 1, where 0 means no information loss, 1 means loses all information (more meaningful than raw NCP, which is sensitive to size of dataset). One more thing!!! Mondrian has strict and relax models. (Most online implementations are in strict model.) Both Mondrian split partition with binary split (let lhs and rhs denotes left part and right part). In strict Mondrian, lhs has not intersection part with rhs. But in relaxed Mondrian, the points in the middle are evenly divided between lhs and rhs to ensure `|lhs| = |rhs|` (+1 where `|partition|` is odd). So in relax model, the generalized result of lhs and rhs may have intersection. The Final NCP of Mondrian on [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) is about 24.91% (relax) and 12.19% (strict), while 12.26% (relax) and 10.21% (strict) on [INFORMS data](https://sites.google.com/site/informsdataminingcontest/) (with K=10). ### Basic idea of Mondrian #### First, what is k-anonymity? Assuming your record is in this format: [QID, SA]. QID means quasi-identifier such as age and birthday, SA means sensitive information such as disease information. The basic idea of k-anonymity is `safety in group` (or safety in numbers [5]), which means that you are safe if you are in a group of people whose QIDs are the same. Note nobody can infer your sensitive information (SA) from this group using QID, as shown in Fig. 1 (k=3 in 1(b) and 1(c)). If each of these group has at least k people, then this dataset satisfy k-anonymity.

Figure 1. Anonymity, Privacy and Generalization

**But in practice, the raw datasets usually don't satisfy k-anonymity, as shown in Fig. 1(a).** So, we need some help from anonymization algorithm to transform the raw datasets to anonymized datasets. Mondrian is one of them, and it is based on generalization. I don't want to talk too much about generalization. In a word, generalization is a kind of transformation, which finds a result QID* that covers all QIDs (QID1~QID3 in Fig. 1 (b)). And it also brings information loss (distortion). #### How Mondrian anonymizes dataset? Here is the basic workflow of Mondrian: 1. Partition the raw dataset into k-groups using kd-tree. k-groups means that each group contains at least k records. 2. Generalization each k-group (Fig. 1(b)), such that each group has the same QID*. Why using kd-tree? Because it is fast, straight-forward and sufficient.

Figure 2. Basic workflow of Modnrian

Figure 3. kd-tree

### Usage and Parameters: The Implementation is based on Python 3 and compatible with python 2.7. You can run Mondrian in following steps: 1) Download (or clone) the whole project. 2) Run `anonymized.py` in root dir with CLI. 3) Get the anonymized dataset from `data/anonymized.data`, if you didn't add `[k | qi | data]`. Parameters: # Usage: python anonymizer.py [r|s] [a | i] [k | qi | data] # r: relax mondrian, s: strict mondrian # a: adult dataset, 'i': INFORMS dataset # k: varying k, qi: varying qi numbers, data: varying size of dataset # run Mondrian with adult data and default K (K=10) python anonymizer.py # run Strict Mondrian with adult data K=20 python anonymizer.py s a 20 # run Relax Mondrian with INFORMS data K=11 python anonymizer.py r i 11 # Evluating Strict Mondrian with k on adult data python anonymizer.py s a k ### For more information: [1] K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Mondrian Multidimensional K-Anonymity ICDE '06: Proceedings of the 22nd International Conference on Data Engineering, IEEE Computer Society, 2006, 25 [2] [UTD Anonymization Toolbox](http://cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php?go=home) [3] [ARX- Powerful Data Anonymization](https://github.com/arx-deidentifier/arx) [4] G. Ghinita, P. Karras, P. Kalnis, N. Mamoulis. Fast data anonymization with low information loss. Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, 2007, 758-769 [5] Y. He, J. F. Naughton, Anonymization of set-valued data via top-down, local generalization. Proceedings of VLDB, 2009, 2, 934-945 ### Support - You can post bug reports and feature requests at the [Issue Page](https://github.com/qiyuangong/Mondrian/issues). - Contributions via [Pull request](https://github.com/qiyuangong/Mondrian/pulls) is welcome. - Also, you can contact me via [email](mailto:qiyuangong@gmail.com). ========================== by [Qiyuan Gong](mailto:qiyuangong@gmail.com) 2017-5-23 ### Contributor List 🏆 * [Qiyuan Gong](mailto:qiyuangong@gmail.com) * [Liu Kun](https://github.com/build2last) ================================================ FILE: anonymizer.py ================================================ """ run mondrian with given parameters """ # !/usr/bin/env python # coding=utf-8 from mondrian import mondrian from utils.read_adult_data import read_data as read_adult from utils.read_informs_data import read_data as read_informs import sys, copy, random DATA_SELECT = 'a' RELAX = False INTUITIVE_ORDER = None def write_to_file(result): """ write the anonymized result to anonymized.data """ with open("data/anonymized.data", "w") as output: for r in result: output.write(';'.join(r) + '\n') def get_result_one(data, k=10): """ run mondrian for one time, with k=10 """ print("K=%d" % k) data_back = copy.deepcopy(data) result, eval_result = mondrian(data, k, RELAX) # Convert numerical values back to categorical values if necessary if DATA_SELECT == 'a': result = covert_to_raw(result) else: for r in result: r[-1] = ','.join(r[-1]) # write to anonymized.out write_to_file(result) data = copy.deepcopy(data_back) print("NCP %0.2f" % eval_result[0] + "%") print("Running time %0.2f" % eval_result[1] + " seconds") def get_result_k(data): """ change k, while fixing QD and size of data set """ data_back = copy.deepcopy(data) for k in range(5, 105, 5): print('#' * 30) print("K=%d" % k) result, eval_result = mondrian(data, k, RELAX) if DATA_SELECT == 'a': result = covert_to_raw(result) data = copy.deepcopy(data_back) print("NCP %0.2f" % eval_result[0] + "%") print("Running time %0.2f" % eval_result[1] + " seconds") def get_result_dataset(data, k=10, num_test=10): """ fix k and QI, while changing size of data set num_test is the test number. """ data_back = copy.deepcopy(data) length = len(data_back) joint = 5000 datasets = [] check_time = length / joint if length % joint == 0: check_time -= 1 for i in range(check_time): datasets.append(joint * (i + 1)) datasets.append(length) ncp = 0 rtime = 0 for pos in datasets: print('#' * 30) print("size of dataset %d" % pos) for j in range(num_test): temp = random.sample(data, pos) result, eval_result = mondrian(temp, k, RELAX) if DATA_SELECT == 'a': result = covert_to_raw(result) ncp += eval_result[0] rtime += eval_result[1] data = copy.deepcopy(data_back) ncp /= num_test rtime /= num_test print("Average NCP %0.2f" % ncp + "%") print("Running time %0.2f" % rtime + " seconds") print('#' * 30) def get_result_qi(data, k=10): """ change number of QI, while fixing k and size of data set """ data_back = copy.deepcopy(data) num_data = len(data[0]) for i in reversed(list(range(1, num_data))): print('#' * 30) print("Number of QI=%d" % i) result, eval_result = mondrian(data, k, RELAX, i) if DATA_SELECT == 'a': result = covert_to_raw(result) data = copy.deepcopy(data_back) print("NCP %0.2f" % eval_result[0] + "%") print("Running time %0.2f" % eval_result[1] + " seconds") def covert_to_raw(result, connect_str='~'): """ During preprocessing, categorical attributes are covert to numeric attribute using intuitive order. This function will covert these values back to they raw values. For example, Female and Male may be converted to 0 and 1 during anonymizaiton. Then we need to transform them back to original values after anonymization. """ covert_result = [] qi_len = len(INTUITIVE_ORDER) for record in result: covert_record = [] for i in range(qi_len): if len(INTUITIVE_ORDER[i]) > 0: vtemp = '' if connect_str in record[i]: temp = record[i].split(connect_str) raw_list = [] for j in range(int(temp[0]), int(temp[1]) + 1): raw_list.append(INTUITIVE_ORDER[i][j]) vtemp = connect_str.join(raw_list) else: vtemp = INTUITIVE_ORDER[i][int(record[i])] covert_record.append(vtemp) else: covert_record.append(record[i]) if isinstance(record[-1], str): covert_result.append(covert_record + [record[-1]]) else: covert_result.append(covert_record + [connect_str.join(record[-1])]) return covert_result if __name__ == '__main__': FLAG = '' LEN_ARGV = len(sys.argv) try: MODEL = sys.argv[1] DATA_SELECT = sys.argv[2] except IndexError: MODEL = 's' DATA_SELECT = 'a' INPUT_K = 10 # read record if MODEL == 's': RELAX = False else: RELAX = True if RELAX: print("Relax Mondrian") else: print("Strict Mondrian") if DATA_SELECT == 'i': print("INFORMS data") DATA = read_informs() else: print("Adult data") # INTUITIVE_ORDER is an intuitive order for # categorical attributes. This order is produced # by the reading (from data set) order. DATA, INTUITIVE_ORDER = read_adult() print(INTUITIVE_ORDER) if LEN_ARGV > 3: FLAG = sys.argv[3] if FLAG == 'k': get_result_k(DATA) elif FLAG == 'qi': get_result_qi(DATA) elif FLAG == 'data': get_result_dataset(DATA) elif FLAG == '': get_result_one(DATA) else: try: INPUT_K = int(FLAG) get_result_one(DATA, INPUT_K) except ValueError: print("Usage: python anonymizer [r|s] [a | i] [k | qi | data]") print("r: relax mondrian, s: strict mondrian") print("a: adult dataset, i: INFORMS dataset") print("k: varying k") print("qi: varying qi numbers") print("data: varying size of dataset") print("example: python anonymizer s a 10") print("example: python anonymizer s a k") # anonymized dataset is stored in result print("Finish Mondrian!!") ================================================ FILE: mondrian.py ================================================ # coding:utf-8 """ main module of mondrian """ # Implemented by Qiyuan Gong # qiyuangong@gmail.com # 2014-09-11 # @InProceedings{LeFevre2006, # Title = {Mondrian Multidimensional K-Anonymity}, # Author = {LeFevre, Kristen and DeWitt, David J. and Ramakrishnan, Raghu}, # Booktitle = {ICDE '06: Proceedings of the 22nd International Conference on Data Engineering}, # Year = {2006}, # Address = {Washington, DC, USA}, # Pages = {25}, # Publisher = {IEEE Computer Society}, # Doi = {http://dx.doi.org/10.1109/ICDE.2006.101}, # ISBN = {0-7695-2570-9}, # } # !/usr/bin/env python # coding=utf-8 import pdb import time from utils.utility import cmp_value, value, merge_qi_value from functools import cmp_to_key # warning all these variables should be re-inited, if # you want to run mondrian with different parameters __DEBUG = False QI_LEN = 10 GL_K = 0 RESULT = [] QI_RANGE = [] QI_DICT = [] QI_ORDER = [] class Partition(object): """ Class for Group (or EC), which is used to keep records self.member: records in group self.low: lower point, use index to avoid negative values self.high: higher point, use index to avoid negative values self.allow: show if partition can be split on this QI """ def __init__(self, data, low, high): """ split_tuple = (index, low, high) """ self.low = list(low) self.high = list(high) self.member = data[:] self.allow = [1] * QI_LEN def add_record(self, record, dim): """ add one record to member """ self.member.append(record) def add_multiple_record(self, records, dim): """ add multiple records (list) to partition """ for record in records: self.add_record(record, dim) def __len__(self): """ return number of records """ return len(self.member) def get_normalized_width(partition, index): """ return Normalized width of partition similar to NCP """ d_order = QI_ORDER[index] width = value(d_order[partition.high[index]]) - value(d_order[partition.low[index]]) if width == QI_RANGE[index]: return 1 return width * 1.0 / QI_RANGE[index] def choose_dimension(partition): """ choose dim with largest norm_width from all attributes. This function can be upgraded with other distance function. """ max_width = -1 max_dim = -1 for dim in range(QI_LEN): if partition.allow[dim] == 0: continue norm_width = get_normalized_width(partition, dim) if norm_width > max_width: max_width = norm_width max_dim = dim if max_width > 1: pdb.set_trace() return max_dim def frequency_set(partition, dim): """ get the frequency_set of partition on dim """ frequency = {} for record in partition.member: try: frequency[record[dim]] += 1 except KeyError: frequency[record[dim]] = 1 return frequency def find_median(partition, dim): """ find the middle of the partition, return split_val """ # use frequency set to get median frequency = frequency_set(partition, dim) split_val = '' next_val = '' value_list = list(frequency.keys()) value_list.sort(key=cmp_to_key(cmp_value)) total = sum(frequency.values()) middle = total // 2 if middle < GL_K or len(value_list) <= 1: try: return '', '', value_list[0], value_list[-1] except IndexError: return '', '', '', '' index = 0 split_index = 0 for i, qi_value in enumerate(value_list): index += frequency[qi_value] if index >= middle: split_val = qi_value split_index = i break else: print("Error: cannot find split_val") try: next_val = value_list[split_index + 1] except IndexError: # there is a frequency value in partition # which can be handle by mid_set # e.g.[1, 2, 3, 4, 4, 4, 4] next_val = split_val return (split_val, next_val, value_list[0], value_list[-1]) def anonymize_strict(partition): """ recursively partition groups until not allowable """ allow_count = sum(partition.allow) # only run allow_count times if allow_count == 0: RESULT.append(partition) return for index in range(allow_count): # choose attrubite from domain dim = choose_dimension(partition) if dim == -1: print("Error: dim=-1") pdb.set_trace() (split_val, next_val, low, high) = find_median(partition, dim) # Update parent low and high if low is not '': partition.low[dim] = QI_DICT[dim][low] partition.high[dim] = QI_DICT[dim][high] if split_val == '' or split_val == next_val: # cannot split partition.allow[dim] = 0 continue # split the group from median mean = QI_DICT[dim][split_val] lhs_high = partition.high[:] rhs_low = partition.low[:] lhs_high[dim] = mean rhs_low[dim] = QI_DICT[dim][next_val] lhs = Partition([], partition.low, lhs_high) rhs = Partition([], rhs_low, partition.high) for record in partition.member: pos = QI_DICT[dim][record[dim]] if pos <= mean: # lhs = [low, mean] lhs.add_record(record, dim) else: # rhs = (mean, high] rhs.add_record(record, dim) # check is lhs and rhs satisfy k-anonymity if len(lhs) < GL_K or len(rhs) < GL_K: partition.allow[dim] = 0 continue # anonymize sub-partition anonymize_strict(lhs) anonymize_strict(rhs) return RESULT.append(partition) def anonymize_relaxed(partition): """ recursively partition groups until not allowable """ if sum(partition.allow) == 0: # can not split RESULT.append(partition) return # choose attribute from domain dim = choose_dimension(partition) if dim == -1: print("Error: dim=-1") pdb.set_trace() # use frequency set to get median (split_val, next_val, low, high) = find_median(partition, dim) # Update parent low and high if low is not '': partition.low[dim] = QI_DICT[dim][low] partition.high[dim] = QI_DICT[dim][high] if split_val == '': # cannot split partition.allow[dim] = 0 anonymize_relaxed(partition) return # split the group from median mean = QI_DICT[dim][split_val] lhs_high = partition.high[:] rhs_low = partition.low[:] lhs_high[dim] = mean rhs_low[dim] = QI_DICT[dim][next_val] lhs = Partition([], partition.low, lhs_high) rhs = Partition([], rhs_low, partition.high) mid_set = [] for record in partition.member: pos = QI_DICT[dim][record[dim]] if pos < mean: # lhs = [low, mean) lhs.add_record(record, dim) elif pos > mean: # rhs = (mean, high] rhs.add_record(record, dim) else: # mid_set keep the means mid_set.append(record) # handle records in the middle # these records will be divided evenly # between lhs and rhs, such that # |lhs| = |rhs| (+1 if total size is odd) half_size = len(partition) // 2 for i in range(half_size - len(lhs)): record = mid_set.pop() lhs.add_record(record, dim) if len(mid_set) > 0: rhs.low[dim] = mean rhs.add_multiple_record(mid_set, dim) # It's not necessary now. # if len(lhs) < GL_K or len(rhs) < GL_K: # print "Error: split failure" # anonymize sub-partition anonymize_relaxed(lhs) anonymize_relaxed(rhs) def init(data, k, QI_num=-1): """ reset global variables """ global GL_K, RESULT, QI_LEN, QI_DICT, QI_RANGE, QI_ORDER if QI_num <= 0: QI_LEN = len(data[0]) - 1 else: QI_LEN = QI_num GL_K = k RESULT = [] # static values QI_DICT = [] QI_ORDER = [] QI_RANGE = [] att_values = [] for i in range(QI_LEN): att_values.append(set()) QI_DICT.append(dict()) for record in data: for i in range(QI_LEN): att_values[i].add(record[i]) for i in range(QI_LEN): value_list = list(att_values[i]) value_list.sort(key=cmp_to_key(cmp_value)) QI_RANGE.append(value(value_list[-1]) - value(value_list[0])) QI_ORDER.append(list(value_list)) for index, qi_value in enumerate(value_list): QI_DICT[i][qi_value] = index def mondrian(data, k, relax=False, QI_num=-1): """ Main function of mondrian, return result in tuple (result, (ncp, rtime)). data: dataset in 2-dimensional array. k: k parameter for k-anonymity QI_num: Default -1, which exclude the last column. Othewise, [0, 1,..., QI_num - 1] will be anonymized, [QI_num,...] will be excluded. relax: determine use strict or relaxed mondrian, Both mondrians split partition with binary split. In strict mondrian, lhs and rhs have not intersection. But in relaxed mondrian, lhs may be have intersection with rhs. """ init(data, k, QI_num) result = [] data_size = len(data) low = [0] * QI_LEN high = [(len(t) - 1) for t in QI_ORDER] whole_partition = Partition(data, low, high) # begin mondrian start_time = time.time() if relax: # relax model anonymize_relaxed(whole_partition) else: # strict model anonymize_strict(whole_partition) rtime = float(time.time() - start_time) # generalization result and # evaluation information loss ncp = 0.0 dp = 0.0 for partition in RESULT: rncp = 0.0 for index in range(QI_LEN): rncp += get_normalized_width(partition, index) rncp *= len(partition) ncp += rncp dp += len(partition) ** 2 for record in partition.member[:]: for index in range(QI_LEN): record[index] = merge_qi_value(QI_ORDER[index][partition.low[index]], QI_ORDER[index][partition.high[index]]) result.append(record) # If you want to get NCP values instead of percentage # please remove next three lines ncp /= QI_LEN ncp /= data_size ncp *= 100 if __DEBUG: from decimal import Decimal print("Discernability Penalty=%.2E" % Decimal(str(dp))) print("size of partitions=%d" % len(RESULT)) print("K=%d" % k) print("NCP = %.2f %%" % ncp) return (result, (ncp, rtime)) ================================================ FILE: mondrian_test.py ================================================ # coding:utf-8 from datetime import datetime import unittest from mondrian import mondrian from utils.read_file import read_csv class functionTest(unittest.TestCase): def test1_mondrian_strict(self): data = [[6, 1, 'haha'], [6, 1, 'test'], [8, 2, 'haha'], [8, 2, 'test'], [4, 1, 'hha'], [4, 2, 'hha'], [4, 3, 'hha'], [4, 4, 'hha']] result, eval_r = mondrian(data, 2, False) self.assertTrue(abs(eval_r[0] - 100.0 / 12) < 0.05) def test1_mondrian_relax(self): data = [[6, 1, 'haha'], [6, 1, 'test'], [8, 2, 'haha'], [8, 2, 'test'], [4, 1, 'hha'], [4, 2, 'hha'], [4, 3, 'hha'], [4, 4, 'hha']] result, eval_r = mondrian(data, 2, True) self.assertTrue(abs(eval_r[0] - 100.0 / 12) < 0.05) def test2_mondrian_strict(self): data = [[6, 1, 'haha'], [8, 1, 'haha'], [8, 1, 'test'], [8, 1, 'haha'], [8, 1, 'test'], [4, 1, 'hha'], [4, 2, 'hha'], [4, 3, 'hha'], [4, 4, 'hha']] result, eval_r = mondrian(data, 2, False) self.assertTrue(abs(eval_r[0] - 2300.0 / 108) < 0.05) def test2_mondrian_relax(self): data = [[6, 1, 'haha'], [8, 1, 'haha'], [8, 1, 'test'], [8, 1, 'haha'], [8, 1, 'test'], [4, 1, 'hha'], [4, 2, 'hha'], [4, 3, 'hha'], [4, 4, 'hha']] result, eval_r = mondrian(data, 2, True) self.assertTrue(abs(eval_r[0] - 700.0 / 54) < 0.05) def test_mondrian_datetime(self): d1 = datetime.strptime("2007-03-04 21:08:12", "%Y-%m-%d %H:%M:%S") d2 = datetime.strptime("2008-03-04 21:08:12", "%Y-%m-%d %H:%M:%S") d3 = datetime.strptime("2009-03-04 21:08:12", "%Y-%m-%d %H:%M:%S") d4 = datetime.strptime("2007-03-05 21:08:12", "%Y-%m-%d %H:%M:%S") data = [[6, d1, 'haha'], [8, d1, 'haha'], [8, d1, 'test'], [8, d1, 'haha'], [8, d1, 'test'], [4, d1, 'hha'], [4, d2, 'hha'], [4, d3, 'hha'], [4, d4, 'hha']] result, eval_r = mondrian(data, 2, False) print(eval_r) def test_read_csv_and_anonymise(self): from utils.read_adult_data import read_data as read_adult DATA, INTUITIVE_ORDER = read_adult() result, eval_result = mondrian(DATA, 40, False) print(result) if __name__ == '__main__': unittest.main() ================================================ FILE: utils/__init__.py ================================================ ================================================ FILE: utils/read_adult_data.py ================================================ """ read adult data set """ # !/usr/bin/env python # coding=utf-8 # Read data and read tree functions for INFORMS data # attributes ['age', 'work_class', 'final_weight', 'education', 'education_num', # 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', # 'capital_loss', 'hours_per_week', 'native_country', 'class'] # QID ['age', 'work_class', 'education', 'marital_status', 'race', 'sex', 'native_country'] # SA ['occupation'] ATT_NAME = ['age', 'work_class', 'final_weight', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'class'] QI_INDEX = [0, 1, 4, 5, 6, 8, 9, 13] IS_CAT = [False, True, False, True, True, True, True, True] SA_INDEX = -1 __DEBUG = False def read_data(): """ read microdata for *.txt and return read data # Note that Mondrian can only handle numeric attribute # So, categorical attributes should be transformed to numeric attributes # before anonymization. For example, Male and Female should be transformed # to 0, 1 during pre-processing. Then, after anonymization, 0 and 1 should # be transformed to Male and Female. """ QI_num = len(QI_INDEX) data = [] # oder categorical attributes in intuitive order # here, we use the appear number intuitive_dict = [] intuitive_order = [] intuitive_number = [] for i in range(QI_num): intuitive_dict.append(dict()) intuitive_number.append(0) intuitive_order.append(list()) data_file = open('data/adult.data', 'rU') for line in data_file: line = line.strip() # remove empty and incomplete lines # only 30162 records will be kept if len(line) == 0 or '?' in line: continue # remove double spaces line = line.replace(' ', '') temp = line.split(',') ltemp = [] for i in range(QI_num): index = QI_INDEX[i] if IS_CAT[i]: try: ltemp.append(intuitive_dict[i][temp[index]]) except KeyError: intuitive_dict[i][temp[index]] = intuitive_number[i] ltemp.append(intuitive_number[i]) intuitive_number[i] += 1 intuitive_order[i].append(temp[index]) else: ltemp.append(int(temp[index])) ltemp.append(temp[SA_INDEX]) data.append(ltemp) return data, intuitive_order ================================================ FILE: utils/read_file.py ================================================ # !/usr/bin/env python ''' read csv data, support numeric, category, time date author : Liu Kun date : 2018-10 ''' from datetime import datetime __DEBUG = False def read_csv(file_path, QI_INDEX, IS_CAT, IS_DATETIME, SA_INDEX, header=False, delimiter=',', encoding="utf-8", TIME_FORMAT_STR="%Y-%m-%d %H:%M:%S" ): """ read microdata for *.txt and return read data # Note that Mondrian can only handle numeric attribute # So, categorical attributes should be transformed to numberic attributes # before anonymization. For example, Male and Female shold be transformed # to 0, 1 during pre-processing. Then, after anonymization, 0 and 1 should # be transformed to Male and Female. """ QI_num = len(QI_INDEX) data = [] # oder categorical attributes in intuitive order # here, we use the appear number intuitive_dict = [] intuitive_order = [] intuitive_number = [] for i in range(QI_num): intuitive_dict.append(dict()) intuitive_number.append(0) intuitive_order.append(list()) with open(file_path, 'r', encoding=encoding) as data_file: if header: headers = data_file.readline() for line in data_file: if len(line) == 0 or '?' in line: continue temp = [item.strip() for item in line.split(delimiter)] ltemp = [] if not all(temp): continue for i in range(QI_num): index = QI_INDEX[i] if IS_DATETIME[i]: t = datetime.strptime(temp[index], TIME_FORMAT_STR) ltemp.append(t) elif IS_CAT[i]: try: ltemp.append(intuitive_dict[i][temp[index]]) except KeyError: intuitive_dict[i][temp[index]] = intuitive_number[i] ltemp.append(intuitive_number[i]) intuitive_number[i] += 1 intuitive_order[i].append(temp[index]) else: ltemp.append(float(temp[index])) ltemp.append(temp[SA_INDEX]) data.append(ltemp) return data, intuitive_order ================================================ FILE: utils/read_informs_data.py ================================================ """ read informs dataset """ # !/usr/bin/env python # coding=utf-8 # Read data and read tree fuctions for INFORMS data # user att ['DUID','PID','DUPERSID','DOBMM','DOBYY','SEX','RACEX','RACEAX','RACEBX','RACEWX','RACETHNX','HISPANX','HISPCAT','EDUCYEAR','Year','marry','income','poverty'] # condition att ['DUID','DUPERSID','ICD9CODX','year'] __DEBUG = False USER_ATT = ['DUID', 'PID', 'DUPERSID', 'DOBMM', 'DOBYY', 'SEX', 'RACEX', 'RACEAX', 'RACEBX', 'RACEWX', 'RACETHNX', 'HISPANX', 'HISPCAT', 'EDUCYEAR', 'Year', 'marry', 'income', 'poverty'] CONDITION_ATT = ['DUID', 'DUPERSID', 'ICD9CODX', 'year'] # Only 5 relational attributes and 1 transaction attribute are selected (according to Poulis's paper) QI_INDEX = [3, 4, 6, 13, 16] __DEBUG = False def read_data(): """ read microda for *.txt and return read data """ data = [] userfile = open('data/demographics.csv', 'rU') conditionfile = open('data/conditions.csv', 'rU') userdata = {} # We selet 3,4,5,6,13,15,15 att from demographics05, and 2 from condition05 # print "Reading Data..." for i, line in enumerate(userfile): line = line.strip() # ignore first line of csv if i == 0: continue row = line.split(',') row[2] = row[2][1:-1] try: userdata[row[2]].append(row) except: userdata[row[2]] = row conditiondata = {} for i, line in enumerate(conditionfile): line = line.strip() # ignore first line of csv if i == 0: continue row = line.split(',') row[1] = row[1][1:-1] row[2] = row[2][1:-1] try: conditiondata[row[1]].append(row) except KeyError: conditiondata[row[1]] = [row] hashdata = {} for k, v in list(userdata.items()): if k in conditiondata: temp = [] for t in conditiondata[k]: temp.append(t[2]) hashdata[k] = [] for i in range(len(QI_INDEX)): index = QI_INDEX[i] hashdata[k].append(v[index]) hashdata[k].append(temp) for k, v in list(hashdata.items()): data.append(v) userfile.close() conditionfile.close() return data ================================================ FILE: utils/utility.py ================================================ # !/usr/bin/env python # coding:utf-8 """ public functions """ from datetime import datetime import time def cmp(x, y): if x > y: return 1 elif x==y: return 0 else: return -1 def cmp_str(element1, element2): """ compare number in str format correctley """ try: return cmp(int(element1), int(element2)) except ValueError: return cmp(element1, element2) def cmp_value(element1, element2): if isinstance(element1, str): return cmp_str(element1, element2) else: return cmp(element1, element2) def value(x): '''Return the numeric type that supports addition and subtraction''' if isinstance(x, (int, float)): return float(x) elif isinstance(x, datetime): return time.mktime(x.timetuple()) # return x.timestamp() # not supported by python 2.7 else: try: return float(x) except Exception as e: return x def merge_qi_value(x_left, x_right, connect_str='~'): '''Connect the interval boundary value as a generalized interval and return the result as a string return: result:string ''' if isinstance(x_left, (int, float)): if x_left == x_right: result = '%d' % (x_left) else: result = '%d%s%d' % (x_left, connect_str, x_right) elif isinstance(x_left, str): if x_left == x_right: result = x_left else: result = x_left + connect_str + x_right elif isinstance(x_left, datetime): # Generalize the datetime type value begin_date = x_left.strftime("%Y-%m-%d %H:%M:%S") end_date = x_right.strftime("%Y-%m-%d %H:%M:%S") result = begin_date + connect_str + end_date return result def covert_to_raw(result, intuitive_order, delimiter='~'): """ During preprocessing, categorical attrbutes are covert to numeric attrbute using intutive order. This function will covert these values back to they raw values. For example, Female and Male may be coverted to 0 and 1 during anonymizaiton. Then we need to transform them back to original values after anonymization. """ covert_result = [] qi_len = len(intuitive_order) for record in result: covert_record = [] for i in range(qi_len): if len(intuitive_order[i]) > 0: vtemp = '' if delimiter in record[i]: temp = record[i].split(delimiter) raw_list = [] for j in range(int(temp[0]), int(temp[1]) + 1): raw_list.append(intuitive_order[i][j]) vtemp = delimiter.join(raw_list) else: vtemp = intuitive_order[i][int(record[i])] covert_record.append(vtemp) else: covert_record.append(record[i]) if isinstance(record[-1], str): covert_result.append(covert_record + [record[-1]]) else: covert_result.append(covert_record + [delimiter.join(record[-1])]) return covert_result