[
  {
    "path": ".gitignore",
    "content": "*.py[cod]\n.DS_Store\noutput/*\n\ndata/*\n\n*.bak\n.vs\n\n# C extensions\n*.so\nftp\n# Packages\n*.egg\n*.egg-info\ndist\nbuild\neggs\nparts\nbin\nvar\nsdist\ndevelop-eggs\n.installed.cfg\nlib\nlib64\n__pycache__\n\n# Unit test / coverage reports\n.coverage\n.tox\nnosetests.xml\n\n# Translations\n*.mo\n\n# Mr Developer\n.mr.developer.cfg\n.project\n.pydevproject\n\n*.sublime-*\n\n*.csv\n"
  },
  {
    "path": ".travis.yml",
    "content": "language: python\npython:\n  - \"2.7\"\nscript: python -m unittest discover . \"*_test.py\"\nbranches:\n  only:\n    - master\n"
  },
  {
    "path": "LICENSE",
    "content": "The MIT License (MIT)\n\nCopyright (c) [2014] [Mondrian]\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "Mondrian [![Build Status](https://travis-ci.org/qiyuangong/Mondrian.svg?branch=master)](https://travis-ci.org/qiyuangong/Mondrian)\n===========================\nMondrian is a Top-down greedy data anonymization algorithm for relational dataset, proposed by Kristen LeFevre in his papers[1]. To our knowledge, Mondrian is the fastest local recording algorithm, which preserve good data utility at the same time. Although LeFevre gave the pseudocode in his papers, the original source code is not available. You can find the third part Java implementation in Anonymization Toolbox[2].\n\nThis repository is an **open source python implementation for Mondrian**.\n\n### Motivation\nResearches on data privacy have lasted for more than ten years, lots of great papers have been published. However, only a few open source projects are available on Internet [2-3], most open source projects are using algorithms proposed before 2004! Fewer projects have been used in real life. Worse more, most people even don't hear about it. Such a tragedy!\n\nI decided to make some effort. Hoping these open source repositories can help researchers and developers on data privacy (privacy preserving data publishing, data anonymization).\n\n### Attention\n\nThis Mondrian is the earliest Mondrian proposed in [1], which imposes an intuitive ordering on each attribute. So, there is no generalization hierarchies for categorical attributes. This operation brings lower information loss, but worse semantic results. **If you want the Mondrian based on generalization hierarchies, please turn to [Basic_Mondrian](https://github.com/qiyuangong/Basic_Mondrian).**\n\nI used **both adult and INFORMS** dataset in this implementation. For clarification, **we transform NCP (Normalized Certainty Penalty) to percentage**. This NCP percentage is computed by dividing NCP value with the number of values in dataset (also called GCP (Global Certainty Penalty) [4]). The range of NCP percentage is from 0 to 1, where 0 means no information loss, 1 means loses all information (more meaningful than raw NCP, which is sensitive to size of dataset).\n\nOne more thing!!! Mondrian has strict and relax models. (Most online implementations are in strict model.) Both Mondrian split partition with binary split (let lhs and rhs denotes left part and right part). In strict Mondrian, lhs has not intersection part with rhs. But in relaxed Mondrian, the points in the middle are evenly divided between lhs and rhs to ensure `|lhs| = |rhs|` (+1 where `|partition|` is odd). So in relax model, the generalized result of lhs and rhs may have intersection.\n\nThe Final NCP of Mondrian on [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) is about 24.91% (relax) and 12.19% (strict), while 12.26% (relax) and 10.21% (strict) on [INFORMS data](https://sites.google.com/site/informsdataminingcontest/) (with K=10).\n\n### Basic idea of Mondrian\n#### First, what is k-anonymity?\nAssuming your record is in this format: [QID, SA]. QID means quasi-identifier such as age and birthday, SA means sensitive information such as disease information. The basic idea of k-anonymity is `safety in group` (or safety in numbers [5]), which means that you are safe if you are in a group of people whose QIDs are the same. Note nobody can infer your sensitive information (SA) from this group using QID, as shown in Fig. 1 (k=3 in 1(b) and 1(c)). If each of these group has at least k people, then this dataset satisfy k-anonymity.\n\n<p align=\"center\">\n<img src=https://cloud.githubusercontent.com/assets/3848789/25949050/c6a7e8ec-3688-11e7-933d-d5a991e6ef30.png width=750>\n</p>\n<p align=\"center\">\nFigure 1. Anonymity, Privacy and Generalization\n</p>\n\n**But in practice, the raw datasets usually don't satisfy k-anonymity, as shown in Fig. 1(a).** So, we need some help from anonymization algorithm to transform the raw datasets to anonymized datasets. Mondrian is one of them, and it is based on generalization. I don't want to talk too much about generalization. In a word, generalization is a kind of transformation, which finds a result QID* that covers all QIDs (QID1~QID3 in Fig. 1 (b)). And it also brings information loss (distortion).\n\n#### How Mondrian anonymizes dataset?\nHere is the basic workflow of Mondrian:\n\n1. Partition the raw dataset into k-groups using kd-tree. k-groups means that each group contains at least k records.\n2. Generalization each k-group (Fig. 1(b)), such that each group has the same QID*.\n\nWhy using kd-tree? Because it is fast, straight-forward and sufficient.\n\n<p align=\"center\">\n<img src=https://cloud.githubusercontent.com/assets/3848789/25949051/c6a87622-3688-11e7-8bd0-726f07245570.png width=750>\n</p>\n<p align=\"center\">\nFigure 2. Basic workflow of Modnrian\n</p>\n\n<p align=\"center\">\n<img src=https://cloud.githubusercontent.com/assets/3848789/25949052/c6ab3fce-3688-11e7-99ea-cde7bccd8684.png width=450>\n</p>\n<p align=\"center\">\nFigure 3. kd-tree\n</p>\n\n\n### Usage and Parameters:\nThe Implementation is based on Python 3 and compatible with python 2.7. You can run Mondrian in following steps:\n\n1) Download (or clone) the whole project.\n\n2) Run `anonymized.py` in root dir with CLI.\n\n3) Get the anonymized dataset from `data/anonymized.data`, if you didn't add `[k | qi | data]`.\n\nParameters:\n\n\t# Usage: python anonymizer.py [r|s] [a | i] [k | qi | data]\n\t# r: relax mondrian, s: strict mondrian\n\t# a: adult dataset, 'i': INFORMS dataset\n\t# k: varying k, qi: varying qi numbers, data: varying size of dataset\n\t# run Mondrian with adult data and default K (K=10)\n\tpython anonymizer.py\n\n\t# run Strict Mondrian with adult data K=20\n\tpython anonymizer.py s a 20\n\n\t# run Relax Mondrian with INFORMS data K=11\n\tpython anonymizer.py r i 11\n\n\n\t# Evluating Strict Mondrian with k on adult data\n\tpython anonymizer.py s a k\n\n\n### For more information:\n[1] K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Mondrian Multidimensional K-Anonymity ICDE '06: Proceedings of the 22nd International Conference on Data Engineering, IEEE Computer Society, 2006, 25\n\n[2] [UTD Anonymization Toolbox](http://cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php?go=home)\n\n[3] [ARX- Powerful Data Anonymization](https://github.com/arx-deidentifier/arx)\n\n[4] G. Ghinita, P. Karras, P. Kalnis, N. Mamoulis. Fast data anonymization with low information loss. Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, 2007, 758-769\n\n[5] Y. He, J. F. Naughton, Anonymization of set-valued data via top-down, local generalization. Proceedings of VLDB, 2009, 2, 934-945\n\n### Support\n\n- You can post bug reports and feature requests at the [Issue Page](https://github.com/qiyuangong/Mondrian/issues).\n- Contributions via [Pull request](https://github.com/qiyuangong/Mondrian/pulls) is welcome.\n- Also, you can contact me via [email](mailto:qiyuangong@gmail.com).\n\n==========================\n\nby [Qiyuan Gong](mailto:qiyuangong@gmail.com)\n\n2017-5-23\n\n\n### Contributor List 🏆\n* [Qiyuan Gong](mailto:qiyuangong@gmail.com)\n* [Liu Kun](https://github.com/build2last)\n"
  },
  {
    "path": "anonymizer.py",
    "content": "\"\"\"\nrun mondrian with given parameters\n\"\"\"\n\n# !/usr/bin/env python\n# coding=utf-8\nfrom mondrian import mondrian\nfrom utils.read_adult_data import read_data as read_adult\nfrom utils.read_informs_data import read_data as read_informs\nimport sys, copy, random\n\nDATA_SELECT = 'a'\nRELAX = False\nINTUITIVE_ORDER = None\n\n\ndef write_to_file(result):\n    \"\"\"\n    write the anonymized result to anonymized.data\n    \"\"\"\n    with open(\"data/anonymized.data\", \"w\") as output:\n        for r in result:\n            output.write(';'.join(r) + '\\n')\n\n\ndef get_result_one(data, k=10):\n    \"\"\"\n    run mondrian for one time, with k=10\n    \"\"\"\n    print(\"K=%d\" % k)\n    data_back = copy.deepcopy(data)\n    result, eval_result = mondrian(data, k, RELAX)\n    # Convert numerical values back to categorical values if necessary\n    if DATA_SELECT == 'a':\n        result = covert_to_raw(result)\n    else:\n        for r in result:\n            r[-1] = ','.join(r[-1])\n    # write to anonymized.out\n    write_to_file(result)\n    data = copy.deepcopy(data_back)\n    print(\"NCP %0.2f\" % eval_result[0] + \"%\")\n    print(\"Running time %0.2f\" % eval_result[1] + \" seconds\")\n\n\ndef get_result_k(data):\n    \"\"\"\n    change k, while fixing QD and size of data set\n    \"\"\"\n    data_back = copy.deepcopy(data)\n    for k in range(5, 105, 5):\n        print('#' * 30)\n        print(\"K=%d\" % k)\n        result, eval_result = mondrian(data, k, RELAX)\n        if DATA_SELECT == 'a':\n            result = covert_to_raw(result)\n        data = copy.deepcopy(data_back)\n        print(\"NCP %0.2f\" % eval_result[0] + \"%\")\n        print(\"Running time %0.2f\" % eval_result[1] + \" seconds\")\n\n\ndef get_result_dataset(data, k=10, num_test=10):\n    \"\"\"\n    fix k and QI, while changing size of data set\n    num_test is the test number.\n    \"\"\"\n    data_back = copy.deepcopy(data)\n    length = len(data_back)\n    joint = 5000\n    datasets = []\n    check_time = length / joint\n    if length % joint == 0:\n        check_time -= 1\n    for i in range(check_time):\n        datasets.append(joint * (i + 1))\n    datasets.append(length)\n    ncp = 0\n    rtime = 0\n    for pos in datasets:\n        print('#' * 30)\n        print(\"size of dataset %d\" % pos)\n        for j in range(num_test):\n            temp = random.sample(data, pos)\n            result, eval_result = mondrian(temp, k, RELAX)\n            if DATA_SELECT == 'a':\n                result = covert_to_raw(result)\n            ncp += eval_result[0]\n            rtime += eval_result[1]\n            data = copy.deepcopy(data_back)\n        ncp /= num_test\n        rtime /= num_test\n        print(\"Average NCP %0.2f\" % ncp + \"%\")\n        print(\"Running time %0.2f\" % rtime + \" seconds\")\n        print('#' * 30)\n\n\ndef get_result_qi(data, k=10):\n    \"\"\"\n    change number of QI, while fixing k and size of data set\n    \"\"\"\n    data_back = copy.deepcopy(data)\n    num_data = len(data[0])\n    for i in reversed(list(range(1, num_data))):\n        print('#' * 30)\n        print(\"Number of QI=%d\" % i)\n        result, eval_result = mondrian(data, k, RELAX, i)\n        if DATA_SELECT == 'a':\n            result = covert_to_raw(result)\n        data = copy.deepcopy(data_back)\n        print(\"NCP %0.2f\" % eval_result[0] + \"%\")\n        print(\"Running time %0.2f\" % eval_result[1] + \" seconds\")\n\n\ndef covert_to_raw(result, connect_str='~'):\n    \"\"\"\n    During preprocessing, categorical attributes are covert to\n    numeric attribute using intuitive order. This function will covert\n    these values back to they raw values. For example, Female and Male\n    may be converted to 0 and 1 during anonymizaiton. Then we need to transform\n    them back to original values after anonymization.\n    \"\"\"\n    covert_result = []\n    qi_len = len(INTUITIVE_ORDER)\n    for record in result:\n        covert_record = []\n        for i in range(qi_len):\n            if len(INTUITIVE_ORDER[i]) > 0:\n                vtemp = ''\n                if connect_str in record[i]:\n                    temp = record[i].split(connect_str)\n                    raw_list = []\n                    for j in range(int(temp[0]), int(temp[1]) + 1):\n                        raw_list.append(INTUITIVE_ORDER[i][j])\n                    vtemp = connect_str.join(raw_list)\n                else:\n                    vtemp = INTUITIVE_ORDER[i][int(record[i])]\n                covert_record.append(vtemp)\n            else:\n                covert_record.append(record[i])\n        if isinstance(record[-1], str):\n            covert_result.append(covert_record + [record[-1]])\n        else:\n            covert_result.append(covert_record + [connect_str.join(record[-1])])\n    return covert_result\n\n\nif __name__ == '__main__':\n    FLAG = ''\n    LEN_ARGV = len(sys.argv)\n    try:\n        MODEL = sys.argv[1]\n        DATA_SELECT = sys.argv[2]\n    except IndexError:\n        MODEL = 's'\n        DATA_SELECT = 'a'\n    INPUT_K = 10\n    # read record\n    if MODEL == 's':\n        RELAX = False\n    else:\n        RELAX = True\n    if RELAX:\n        print(\"Relax Mondrian\")\n    else:\n        print(\"Strict Mondrian\")\n    if DATA_SELECT == 'i':\n        print(\"INFORMS data\")\n        DATA = read_informs()\n    else:\n        print(\"Adult data\")\n        # INTUITIVE_ORDER is an intuitive order for\n        # categorical attributes. This order is produced\n        # by the reading (from data set) order.\n        DATA, INTUITIVE_ORDER = read_adult()\n        print(INTUITIVE_ORDER)\n    if LEN_ARGV > 3:\n        FLAG = sys.argv[3]\n    if FLAG == 'k':\n        get_result_k(DATA)\n    elif FLAG == 'qi':\n        get_result_qi(DATA)\n    elif FLAG == 'data':\n        get_result_dataset(DATA)\n    elif FLAG == '':\n        get_result_one(DATA)\n    else:\n        try:\n            INPUT_K = int(FLAG)\n            get_result_one(DATA, INPUT_K)\n        except ValueError:\n            print(\"Usage: python anonymizer [r|s] [a | i] [k | qi | data]\")\n            print(\"r: relax mondrian, s: strict mondrian\")\n            print(\"a: adult dataset, i: INFORMS dataset\")\n            print(\"k: varying k\")\n            print(\"qi: varying qi numbers\")\n            print(\"data: varying size of dataset\")\n            print(\"example: python anonymizer s a 10\")\n            print(\"example: python anonymizer s a k\")\n    # anonymized dataset is stored in result\n    print(\"Finish Mondrian!!\")\n"
  },
  {
    "path": "mondrian.py",
    "content": "# coding:utf-8\n\"\"\"\nmain module of mondrian\n\"\"\"\n\n# Implemented by Qiyuan Gong\n# qiyuangong@gmail.com\n# 2014-09-11\n\n# @InProceedings{LeFevre2006,\n#   Title = {Mondrian Multidimensional K-Anonymity},\n#   Author = {LeFevre, Kristen and DeWitt, David J. and Ramakrishnan, Raghu},\n#   Booktitle = {ICDE '06: Proceedings of the 22nd International Conference on Data Engineering},\n#   Year = {2006},\n#   Address = {Washington, DC, USA},\n#   Pages = {25},\n#   Publisher = {IEEE Computer Society},\n#   Doi = {http://dx.doi.org/10.1109/ICDE.2006.101},\n#   ISBN = {0-7695-2570-9},\n# }\n\n# !/usr/bin/env python\n# coding=utf-8\n\nimport pdb\nimport time\nfrom utils.utility import cmp_value, value, merge_qi_value\nfrom functools import cmp_to_key\n\n# warning all these variables should be re-inited, if\n# you want to run mondrian with different parameters\n__DEBUG = False\nQI_LEN = 10\nGL_K = 0\nRESULT = []\nQI_RANGE = []\nQI_DICT = []\nQI_ORDER = []\n\n\nclass Partition(object):\n\n    \"\"\"\n    Class for Group (or EC), which is used to keep records\n    self.member: records in group\n    self.low: lower point, use index to avoid negative values\n    self.high: higher point, use index to avoid negative values\n    self.allow: show if partition can be split on this QI\n    \"\"\"\n\n    def __init__(self, data, low, high):\n        \"\"\"\n        split_tuple = (index, low, high)\n        \"\"\"\n        self.low = list(low)\n        self.high = list(high)\n        self.member = data[:]\n        self.allow = [1] * QI_LEN\n\n    def add_record(self, record, dim):\n        \"\"\"\n        add one record to member\n        \"\"\"\n        self.member.append(record)\n\n    def add_multiple_record(self, records, dim):\n        \"\"\"\n        add multiple records (list) to partition\n        \"\"\"\n        for record in records:\n            self.add_record(record, dim)\n\n    def __len__(self):\n        \"\"\"\n        return number of records\n        \"\"\"\n        return len(self.member)\n\n\ndef get_normalized_width(partition, index):\n    \"\"\"\n    return Normalized width of partition\n    similar to NCP\n    \"\"\"\n    d_order = QI_ORDER[index]\n    width = value(d_order[partition.high[index]]) - value(d_order[partition.low[index]])\n    if width == QI_RANGE[index]:\n        return 1\n    return width * 1.0 / QI_RANGE[index]\n\n\ndef choose_dimension(partition):\n    \"\"\"\n    choose dim with largest norm_width from all attributes.\n    This function can be upgraded with other distance function.\n    \"\"\"\n    max_width = -1\n    max_dim = -1\n    for dim in range(QI_LEN):\n        if partition.allow[dim] == 0:\n            continue\n        norm_width = get_normalized_width(partition, dim)\n        if norm_width > max_width:\n            max_width = norm_width\n            max_dim = dim\n    if max_width > 1:\n        pdb.set_trace()\n    return max_dim\n\n\ndef frequency_set(partition, dim):\n    \"\"\"\n    get the frequency_set of partition on dim\n    \"\"\"\n    frequency = {}\n    for record in partition.member:\n        try:\n            frequency[record[dim]] += 1\n        except KeyError:\n            frequency[record[dim]] = 1\n    return frequency\n\n\ndef find_median(partition, dim):\n    \"\"\"\n    find the middle of the partition, return split_val\n    \"\"\"\n    # use frequency set to get median\n    frequency = frequency_set(partition, dim)\n    split_val = ''\n    next_val = ''\n    value_list = list(frequency.keys())\n    value_list.sort(key=cmp_to_key(cmp_value))\n    total = sum(frequency.values())\n    middle = total // 2\n    if middle < GL_K or len(value_list) <= 1:\n        try:\n            return '', '', value_list[0], value_list[-1]\n        except IndexError:\n            return '', '', '', ''\n    index = 0\n    split_index = 0\n    for i, qi_value in enumerate(value_list):\n        index += frequency[qi_value]\n        if index >= middle:\n            split_val = qi_value\n            split_index = i\n            break\n    else:\n        print(\"Error: cannot find split_val\")\n    try:\n        next_val = value_list[split_index + 1]\n    except IndexError:\n        # there is a frequency value in partition\n        # which can be handle by mid_set\n        # e.g.[1, 2, 3, 4, 4, 4, 4]\n        next_val = split_val\n    return (split_val, next_val, value_list[0], value_list[-1])\n\n\ndef anonymize_strict(partition):\n    \"\"\"\n    recursively partition groups until not allowable\n    \"\"\"\n    allow_count = sum(partition.allow)\n    # only run allow_count times\n    if allow_count == 0:\n        RESULT.append(partition)\n        return\n    for index in range(allow_count):\n        # choose attrubite from domain\n        dim = choose_dimension(partition)\n        if dim == -1:\n            print(\"Error: dim=-1\")\n            pdb.set_trace()\n        (split_val, next_val, low, high) = find_median(partition, dim)\n        # Update parent low and high\n        if low is not '':\n            partition.low[dim] = QI_DICT[dim][low]\n            partition.high[dim] = QI_DICT[dim][high]\n        if split_val == '' or split_val == next_val:\n            # cannot split\n            partition.allow[dim] = 0\n            continue\n        # split the group from median\n        mean = QI_DICT[dim][split_val]\n        lhs_high = partition.high[:]\n        rhs_low = partition.low[:]\n        lhs_high[dim] = mean\n        rhs_low[dim] = QI_DICT[dim][next_val]\n        lhs = Partition([], partition.low, lhs_high)\n        rhs = Partition([], rhs_low, partition.high)\n        for record in partition.member:\n            pos = QI_DICT[dim][record[dim]]\n            if pos <= mean:\n                # lhs = [low, mean]\n                lhs.add_record(record, dim)\n            else:\n                # rhs = (mean, high]\n                rhs.add_record(record, dim)\n        # check is lhs and rhs satisfy k-anonymity\n        if len(lhs) < GL_K or len(rhs) < GL_K:\n            partition.allow[dim] = 0\n            continue\n        # anonymize sub-partition\n        anonymize_strict(lhs)\n        anonymize_strict(rhs)\n        return\n    RESULT.append(partition)\n\n\ndef anonymize_relaxed(partition):\n    \"\"\"\n    recursively partition groups until not allowable\n    \"\"\"\n    if sum(partition.allow) == 0:\n        # can not split\n        RESULT.append(partition)\n        return\n    # choose attribute from domain\n    dim = choose_dimension(partition)\n    if dim == -1:\n        print(\"Error: dim=-1\")\n        pdb.set_trace()\n    # use frequency set to get median\n    (split_val, next_val, low, high) = find_median(partition, dim)\n    # Update parent low and high\n    if low is not '':\n        partition.low[dim] = QI_DICT[dim][low]\n        partition.high[dim] = QI_DICT[dim][high]\n    if split_val == '':\n        # cannot split\n        partition.allow[dim] = 0\n        anonymize_relaxed(partition)\n        return\n    # split the group from median\n    mean = QI_DICT[dim][split_val]\n    lhs_high = partition.high[:]\n    rhs_low = partition.low[:]\n    lhs_high[dim] = mean\n    rhs_low[dim] = QI_DICT[dim][next_val]\n    lhs = Partition([], partition.low, lhs_high)\n    rhs = Partition([], rhs_low, partition.high)\n    mid_set = []\n    for record in partition.member:\n        pos = QI_DICT[dim][record[dim]]\n        if pos < mean:\n            # lhs = [low, mean)\n            lhs.add_record(record, dim)\n        elif pos > mean:\n            # rhs = (mean, high]\n            rhs.add_record(record, dim)\n        else:\n            # mid_set keep the means\n            mid_set.append(record)\n    # handle records in the middle\n    # these records will be divided evenly\n    # between lhs and rhs, such that\n    # |lhs| = |rhs| (+1 if total size is odd)\n    half_size = len(partition) // 2\n    for i in range(half_size - len(lhs)):\n        record = mid_set.pop()\n        lhs.add_record(record, dim)\n    if len(mid_set) > 0:\n        rhs.low[dim] = mean\n        rhs.add_multiple_record(mid_set, dim)\n    # It's not necessary now.\n    # if len(lhs) < GL_K or len(rhs) < GL_K:\n    #     print \"Error: split failure\"\n    # anonymize sub-partition\n    anonymize_relaxed(lhs)\n    anonymize_relaxed(rhs)\n\n\ndef init(data, k, QI_num=-1):\n    \"\"\"\n    reset global variables\n    \"\"\"\n    global GL_K, RESULT, QI_LEN, QI_DICT, QI_RANGE, QI_ORDER\n    if QI_num <= 0:\n        QI_LEN = len(data[0]) - 1\n    else:\n        QI_LEN = QI_num\n    GL_K = k\n    RESULT = []\n    # static values\n    QI_DICT = []\n    QI_ORDER = []\n    QI_RANGE = []\n    att_values = []\n    for i in range(QI_LEN):\n        att_values.append(set())\n        QI_DICT.append(dict())\n    for record in data:\n        for i in range(QI_LEN):\n            att_values[i].add(record[i])\n    for i in range(QI_LEN):\n        value_list = list(att_values[i])\n        value_list.sort(key=cmp_to_key(cmp_value))\n        QI_RANGE.append(value(value_list[-1]) - value(value_list[0]))\n        QI_ORDER.append(list(value_list))\n        for index, qi_value in enumerate(value_list):\n            QI_DICT[i][qi_value] = index\n\n\ndef mondrian(data, k, relax=False, QI_num=-1):\n    \"\"\"\n    Main function of mondrian, return result in tuple (result, (ncp, rtime)).\n    data: dataset in 2-dimensional array.\n    k: k parameter for k-anonymity\n    QI_num: Default -1, which exclude the last column. Othewise, [0, 1,..., QI_num - 1]\n            will be anonymized, [QI_num,...] will be excluded.\n    relax: determine use strict or relaxed mondrian,\n    Both mondrians split partition with binary split.\n    In strict mondrian, lhs and rhs have not intersection.\n    But in relaxed mondrian, lhs may be have intersection with rhs.\n    \"\"\"\n    init(data, k, QI_num)\n    result = []\n    data_size = len(data)\n    low = [0] * QI_LEN\n    high = [(len(t) - 1) for t in QI_ORDER]\n    whole_partition = Partition(data, low, high)\n    # begin mondrian\n    start_time = time.time()\n    if relax:\n        # relax model\n        anonymize_relaxed(whole_partition)\n    else:\n        # strict model\n        anonymize_strict(whole_partition)\n    rtime = float(time.time() - start_time)\n    # generalization result and\n    # evaluation information loss\n    ncp = 0.0\n    dp = 0.0\n    for partition in RESULT:\n        rncp = 0.0\n        for index in range(QI_LEN):\n            rncp += get_normalized_width(partition, index)\n        rncp *= len(partition)\n        ncp += rncp\n        dp += len(partition) ** 2\n        for record in partition.member[:]:\n            for index in range(QI_LEN):\n                record[index] = merge_qi_value(QI_ORDER[index][partition.low[index]],\n                                QI_ORDER[index][partition.high[index]])\n            result.append(record)\n    # If you want to get NCP values instead of percentage\n    # please remove next three lines\n    ncp /= QI_LEN\n    ncp /= data_size\n    ncp *= 100\n    if __DEBUG:\n        from decimal import Decimal\n        print(\"Discernability Penalty=%.2E\" % Decimal(str(dp)))\n        print(\"size of partitions=%d\" % len(RESULT))\n        print(\"K=%d\" % k)\n        print(\"NCP = %.2f %%\" % ncp)\n    return (result, (ncp, rtime))\n"
  },
  {
    "path": "mondrian_test.py",
    "content": "# coding:utf-8\nfrom datetime import datetime\nimport unittest\n\nfrom mondrian import mondrian\nfrom utils.read_file import read_csv\n\nclass functionTest(unittest.TestCase):\n    def test1_mondrian_strict(self):\n        data = [[6, 1, 'haha'],\n                [6, 1, 'test'],\n                [8, 2, 'haha'],\n                [8, 2, 'test'],\n                [4, 1, 'hha'],\n                [4, 2, 'hha'],\n                [4, 3, 'hha'],\n                [4, 4, 'hha']]\n        result, eval_r = mondrian(data, 2, False)\n        self.assertTrue(abs(eval_r[0] - 100.0 / 12) < 0.05)\n\n    def test1_mondrian_relax(self):\n        data = [[6, 1, 'haha'],\n                [6, 1, 'test'],\n                [8, 2, 'haha'],\n                [8, 2, 'test'],\n                [4, 1, 'hha'],\n                [4, 2, 'hha'],\n                [4, 3, 'hha'],\n                [4, 4, 'hha']]\n        result, eval_r = mondrian(data, 2, True)\n        self.assertTrue(abs(eval_r[0] - 100.0 / 12) < 0.05)\n\n    def test2_mondrian_strict(self):\n        data = [[6, 1, 'haha'],\n                [8, 1, 'haha'],\n                [8, 1, 'test'],\n                [8, 1, 'haha'],\n                [8, 1, 'test'],\n                [4, 1, 'hha'],\n                [4, 2, 'hha'],\n                [4, 3, 'hha'],\n                [4, 4, 'hha']]\n        result, eval_r = mondrian(data, 2, False)\n        self.assertTrue(abs(eval_r[0] - 2300.0 / 108) < 0.05)\n\n    def test2_mondrian_relax(self):\n        data = [[6, 1, 'haha'],\n                [8, 1, 'haha'],\n                [8, 1, 'test'],\n                [8, 1, 'haha'],\n                [8, 1, 'test'],\n                [4, 1, 'hha'],\n                [4, 2, 'hha'],\n                [4, 3, 'hha'],\n                [4, 4, 'hha']]\n        result, eval_r = mondrian(data, 2, True)\n        self.assertTrue(abs(eval_r[0] - 700.0 / 54) < 0.05)\n\n    def test_mondrian_datetime(self):\n        d1 = datetime.strptime(\"2007-03-04 21:08:12\", \"%Y-%m-%d %H:%M:%S\")\n        d2 = datetime.strptime(\"2008-03-04 21:08:12\", \"%Y-%m-%d %H:%M:%S\")\n        d3 = datetime.strptime(\"2009-03-04 21:08:12\", \"%Y-%m-%d %H:%M:%S\")\n        d4 = datetime.strptime(\"2007-03-05 21:08:12\", \"%Y-%m-%d %H:%M:%S\")\n        data = [[6, d1, 'haha'],\n                [8, d1, 'haha'],\n                [8, d1, 'test'],\n                [8, d1, 'haha'],\n                [8, d1, 'test'],\n                [4, d1, 'hha'],\n                [4, d2, 'hha'],\n                [4, d3, 'hha'],\n                [4, d4, 'hha']]\n        result, eval_r = mondrian(data, 2, False)\n        print(eval_r)\n\n    def test_read_csv_and_anonymise(self):\n        from utils.read_adult_data import read_data as read_adult\n        DATA, INTUITIVE_ORDER = read_adult() \n        result, eval_result = mondrian(DATA, 40, False)\n        print(result)\n\nif __name__ == '__main__':\n    unittest.main()\n"
  },
  {
    "path": "utils/__init__.py",
    "content": ""
  },
  {
    "path": "utils/read_adult_data.py",
    "content": "\"\"\"\nread adult data set\n\"\"\"\n\n# !/usr/bin/env python\n# coding=utf-8\n\n# Read data and read tree functions for INFORMS data\n# attributes ['age', 'work_class', 'final_weight', 'education', 'education_num',\n# 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain',\n# 'capital_loss', 'hours_per_week', 'native_country', 'class']\n# QID ['age', 'work_class', 'education', 'marital_status', 'race', 'sex', 'native_country']\n# SA ['occupation']\n\n\nATT_NAME = ['age', 'work_class', 'final_weight', 'education',\n            'education_num', 'marital_status', 'occupation', 'relationship',\n            'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week',\n            'native_country', 'class']\nQI_INDEX = [0, 1, 4, 5, 6, 8, 9, 13]\nIS_CAT = [False, True, False, True, True, True, True, True]\nSA_INDEX = -1\n__DEBUG = False\n\n\ndef read_data():\n    \"\"\"\n    read microdata for *.txt and return read data\n\n    # Note that Mondrian can only handle numeric attribute\n    # So, categorical attributes should be transformed to numeric attributes\n    # before anonymization. For example, Male and Female should be transformed\n    # to 0, 1 during pre-processing. Then, after anonymization, 0 and 1 should\n    # be transformed to Male and Female.\n    \"\"\"\n    QI_num = len(QI_INDEX)\n    data = []\n    # oder categorical attributes in intuitive order\n    # here, we use the appear number\n    intuitive_dict = []\n    intuitive_order = []\n    intuitive_number = []\n    for i in range(QI_num):\n        intuitive_dict.append(dict())\n        intuitive_number.append(0)\n        intuitive_order.append(list())\n    data_file = open('data/adult.data', 'rU')\n    for line in data_file:\n        line = line.strip()\n        # remove empty and incomplete lines\n        # only 30162 records will be kept\n        if len(line) == 0 or '?' in line:\n            continue\n        # remove double spaces\n        line = line.replace(' ', '')\n        temp = line.split(',')\n        ltemp = []\n        for i in range(QI_num):\n            index = QI_INDEX[i]\n            if IS_CAT[i]:\n                try:\n                    ltemp.append(intuitive_dict[i][temp[index]])\n                except KeyError:\n                    intuitive_dict[i][temp[index]] = intuitive_number[i]\n                    ltemp.append(intuitive_number[i])\n                    intuitive_number[i] += 1\n                    intuitive_order[i].append(temp[index])\n            else:\n                ltemp.append(int(temp[index]))\n        ltemp.append(temp[SA_INDEX])\n        data.append(ltemp)\n    return data, intuitive_order\n"
  },
  {
    "path": "utils/read_file.py",
    "content": "# !/usr/bin/env python\n'''\nread csv data, \nsupport numeric, category, time date\n\nauthor : Liu Kun\ndate   : 2018-10\n'''\n\nfrom datetime import datetime\n\n\n__DEBUG = False\n\ndef read_csv(file_path, \n        QI_INDEX,\n        IS_CAT,\n        IS_DATETIME,\n        SA_INDEX, \n        header=False, delimiter=',', encoding=\"utf-8\",\n        TIME_FORMAT_STR=\"%Y-%m-%d %H:%M:%S\"\n    ):\n    \"\"\"\n    read microdata for *.txt and return read data\n\n    # Note that Mondrian can only handle numeric attribute\n    # So, categorical attributes should be transformed to numberic attributes\n    # before anonymization. For example, Male and Female shold be transformed\n    # to 0, 1 during pre-processing. Then, after anonymization, 0 and 1 should\n    # be transformed to Male and Female.\n    \"\"\"\n    QI_num = len(QI_INDEX)\n    data = []\n    # oder categorical attributes in intuitive order\n    # here, we use the appear number\n    intuitive_dict = []\n    intuitive_order = []\n    intuitive_number = []\n    for i in range(QI_num):\n        intuitive_dict.append(dict())\n        intuitive_number.append(0)\n        intuitive_order.append(list())\n    with open(file_path, 'r', encoding=encoding) as data_file:\n        if header:\n            headers = data_file.readline()\n        for line in data_file:\n            if len(line) == 0 or '?' in line:\n                continue\n            temp = [item.strip() for item in line.split(delimiter)]\n            ltemp = []\n            if not all(temp):\n                continue\n            for i in range(QI_num):\n                index = QI_INDEX[i]\n                if IS_DATETIME[i]:\n                    t = datetime.strptime(temp[index], TIME_FORMAT_STR)\n                    ltemp.append(t)\n                elif IS_CAT[i]:\n                    try:\n                        ltemp.append(intuitive_dict[i][temp[index]])\n                    except KeyError:\n                        intuitive_dict[i][temp[index]] = intuitive_number[i]\n                        ltemp.append(intuitive_number[i])\n                        intuitive_number[i] += 1\n                        intuitive_order[i].append(temp[index])\n                else:\n                    ltemp.append(float(temp[index]))\n            ltemp.append(temp[SA_INDEX])\n            data.append(ltemp)\n        return data, intuitive_order\n\n"
  },
  {
    "path": "utils/read_informs_data.py",
    "content": "\"\"\"\nread informs dataset\n\"\"\"\n\n# !/usr/bin/env python\n# coding=utf-8\n\n# Read data and read tree fuctions for INFORMS data\n# user att ['DUID','PID','DUPERSID','DOBMM','DOBYY','SEX','RACEX','RACEAX','RACEBX','RACEWX','RACETHNX','HISPANX','HISPCAT','EDUCYEAR','Year','marry','income','poverty']\n# condition att ['DUID','DUPERSID','ICD9CODX','year']\n\n\n__DEBUG = False\nUSER_ATT = ['DUID', 'PID', 'DUPERSID', 'DOBMM', 'DOBYY', 'SEX', 'RACEX', 'RACEAX',\n            'RACEBX', 'RACEWX', 'RACETHNX', 'HISPANX', 'HISPCAT', 'EDUCYEAR',\n            'Year', 'marry', 'income', 'poverty']\nCONDITION_ATT = ['DUID', 'DUPERSID', 'ICD9CODX', 'year']\n# Only 5 relational attributes and 1 transaction attribute are selected (according to Poulis's paper)\nQI_INDEX = [3, 4, 6, 13, 16]\n__DEBUG = False\n\n\ndef read_data():\n    \"\"\"\n    read microda for *.txt and return read data\n    \"\"\"\n    data = []\n    userfile = open('data/demographics.csv', 'rU')\n    conditionfile = open('data/conditions.csv', 'rU')\n    userdata = {}\n    # We selet 3,4,5,6,13,15,15 att from demographics05, and 2 from condition05\n    # print \"Reading Data...\"\n    for i, line in enumerate(userfile):\n        line = line.strip()\n        # ignore first line of csv\n        if i == 0:\n            continue\n        row = line.split(',')\n        row[2] = row[2][1:-1]\n        try:\n            userdata[row[2]].append(row)\n        except:\n            userdata[row[2]] = row\n    conditiondata = {}\n    for i, line in enumerate(conditionfile):\n        line = line.strip()\n        # ignore first line of csv\n        if i == 0:\n            continue\n        row = line.split(',')\n        row[1] = row[1][1:-1]\n        row[2] = row[2][1:-1]\n        try:\n            conditiondata[row[1]].append(row)\n        except KeyError:\n            conditiondata[row[1]] = [row]\n    hashdata = {}\n    for k, v in list(userdata.items()):\n        if k in conditiondata:\n            temp = []\n            for t in conditiondata[k]:\n                temp.append(t[2])\n            hashdata[k] = []\n            for i in range(len(QI_INDEX)):\n                index = QI_INDEX[i]\n                hashdata[k].append(v[index])\n            hashdata[k].append(temp)\n    for k, v in list(hashdata.items()):\n        data.append(v)\n    userfile.close()\n    conditionfile.close()\n    return data\n"
  },
  {
    "path": "utils/utility.py",
    "content": "# !/usr/bin/env python\n# coding:utf-8\n\"\"\"\npublic functions\n\"\"\"\n\nfrom datetime import datetime\nimport time\n\ndef cmp(x, y):\n    if x > y:\n        return 1\n    elif x==y:\n        return 0\n    else:\n        return -1\n\n\ndef cmp_str(element1, element2):\n    \"\"\"\n    compare number in str format correctley\n    \"\"\"\n    try:\n        return cmp(int(element1), int(element2))\n    except ValueError:\n        return cmp(element1, element2)\n\ndef cmp_value(element1, element2):\n    if isinstance(element1, str):\n        return cmp_str(element1, element2)\n    else:\n        return cmp(element1, element2)\n\n\ndef value(x):\n    '''Return the numeric type that supports addition and subtraction'''\n    if isinstance(x, (int, float)):\n        return float(x)\n    elif isinstance(x, datetime):\n        return time.mktime(x.timetuple())\n        # return x.timestamp() # not supported by python 2.7\n    else:\n        try:\n            return float(x)\n        except Exception as e:\n            return x\n\n\ndef merge_qi_value(x_left, x_right, connect_str='~'):\n    '''Connect the interval boundary value as a generalized interval and return the result as a string\n    return:\n        result:string\n    '''\n    if isinstance(x_left, (int, float)):\n        if x_left == x_right:\n            result = '%d' % (x_left)\n        else:\n            result = '%d%s%d' % (x_left, connect_str, x_right)\n    elif isinstance(x_left, str):\n        if x_left == x_right:\n            result = x_left\n        else:\n            result = x_left + connect_str + x_right\n    elif isinstance(x_left, datetime):\n        # Generalize the datetime type value\n        begin_date = x_left.strftime(\"%Y-%m-%d %H:%M:%S\")\n        end_date = x_right.strftime(\"%Y-%m-%d %H:%M:%S\")\n        result = begin_date + connect_str + end_date\n    return result\n\n\ndef covert_to_raw(result, intuitive_order, delimiter='~'):\n    \"\"\"\n    During preprocessing, categorical attrbutes are covert to\n    numeric attrbute using intutive order. This function will covert\n    these values back to they raw values. For example, Female and Male\n    may be coverted to 0 and 1 during anonymizaiton. Then we need to transform\n    them back to original values after anonymization.\n    \"\"\"\n    covert_result = []\n    qi_len = len(intuitive_order)\n    for record in result:\n        covert_record = []\n        for i in range(qi_len):\n            if len(intuitive_order[i]) > 0:\n                vtemp = ''\n                if delimiter in record[i]:\n                    temp = record[i].split(delimiter)\n                    raw_list = []\n                    for j in range(int(temp[0]), int(temp[1]) + 1):\n                        raw_list.append(intuitive_order[i][j])\n                    vtemp = delimiter.join(raw_list)\n                else:\n                    vtemp = intuitive_order[i][int(record[i])]\n                covert_record.append(vtemp)\n            else:\n                covert_record.append(record[i])\n        if isinstance(record[-1], str):\n            covert_result.append(covert_record + [record[-1]])\n        else:\n            covert_result.append(covert_record + [delimiter.join(record[-1])])\n    return covert_result\n\n"
  }
]