[
  {
    "path": "LICENSE",
    "content": "SOFTWARE LICENSE AGREEMENT FOR EVALUATION\n\nThis SOFTWARE EVALUATION LICENSE AGREEMENT (this \"Agreement\") is a legal contract between a person who uses or otherwise accesses or installs the Software (“User(s)”), and Nippon Telegraph and Telephone corporation (\"NTT\").\nREAD THE TERMS AND CONDITIONS OF THIS AGREEMENT CAREFULLY BEFORE INSTALLING OR OTHERWISE ACCESSING OR USING NTT'S PROPRIETARY SOFTWARE ACCOMPANIED BY THIS AGREEMENT (the \"SOFTWARE\"). THE SOFTWARE IS COPYRIGHTED AND IT IS LICENSED TO USER UNDER THIS AGREEMENT, NOT SOLD TO USER. BY INSTALLING OR OTHERWISE ACCESSING OR USING THE SOFTWARE, USER ACKNOWLEDGES THAT USER HAS READ THIS AGREEMENT, THAT USER UNDERSTANDS IT, AND THAT USER ACCEPTS AND AGREES TO BE BOUND BY ITS TERMS. IF AT ANY TIME USER IS NOT WILLING TO BE BOUND BY THE TERMS OF THIS AGREEMENT, USER SHOULD  TERMINATE THE INSTALLATION PROCESS, IMMEDIATELY CEASE AND REFRAIN FROM ACCESSING OR USING THE SOFTWARE AND DELETE ANY COPIES USER MAY HAVE. THIS AGREEMENT REPRESENTS THE ENTIRE AGREEMENT BETWEEN USER AND NTT CONCERNING THE SOFTWARE.\n\n \nBACKGROUND\nA.\tNTT is the owner of all rights, including all patent rights, copyrights and trade secret rights, in and to the Software and related documentation listed in Exhibit A to this Agreement.\nB.\tUser wishes to obtain a royalty free license to use the Software to enable User to evaluate, and NTT wishes to grant such a license to User, pursuant and subject to the terms and conditions of this Agreement.\nC.\tAs a condition to NTT's provision of the Software to User, NTT has required User to execute this Agreement.\nIn consideration of these premises, and the mutual promises and conditions in this Agreement, the parties hereby agree as follows:\n1. Grant of Evaluation License.     NTT hereby grants to User, and User hereby accepts, under the terms and conditions of this Agreement, a royalty free, nontransferable and nonexclusive license to use the Software internally for the purposes of testing, analyzing, and evaluating the methods or mechanisms as shown in the research paper submitted by NTT to a certain academy. User may make a reasonable number of backup copies of the Software solely for User's internal use pursuant to the license granted in this Section 1.\n2.　Shipment and Installation.  NTT will ship or deliver the Software by any method that NTT deems appropriate. User shall be solely responsible for proper installation of the Software.\n3.　Term.  This Agreement is effective whichever is earlier (i) upon User’s acceptance of the Agreement, or (ii) upon User’s installing, accessing, and using the Software, even if User has not expressly accepted this Agreement. Without prejudice to any other rights, NTT may terminate this Agreement without notice to User (i) if User breaches or fails to comply with any of the limitations or other requirements described herein, and (ii) if NTT receives a notice from the academy stating that the research paper would not be published, and in any such case User agrees that NTT may, in addition to any other remedies it may have at law or in equity, remotely disable the Software. User may terminate this Agreement at any time by User’s decision to terminate the Agreement to NTT and ceasing use of the Software. Upon any termination or expiration of this Agreement for any reason, User agrees to uninstall the Software and either return to NTT the Software and all copies thereof, or to destroy all such materials and provide written verification of such destruction to NTT.\n4.\t   Proprietary Rights\n(a)\t   The Software is the valuable, confidential, and proprietary property of NTT, and NTT shall retain exclusive title to this property both during the term and after the termination of this Agreement.  Without limitation, User acknowledges that all patent rights, copyrights and trade secret rights in the Software shall remain the exclusive property of NTT at all times. User shall use not less than reasonable care in safeguarding the confidentiality of the Software. \n(b)\t   USER SHALL NOT, IN WHOLE OR IN PART, AT ANY TIME DURING THE TERM OF OR AFTER THE TERMINATION OF THIS AGREEMENT: (i) SELL, ASSIGN, LEASE, DISTRIBUTE, OR OTHERWISE TRANSFER THE SOFTWARE TO ANY THIRD PARTY; (ii) EXCEPT AS OTHERWISE PROVIDED HEREIN, COPY OR REPRODUCE THE SOFTWARE IN ANY MANNER; (iii) DISCLOSE THE SOFTWARE TO ANY THIRD PARTY, EXCEPT TO USER'S EMPLOYEES WHO REQUIRE ACCESS TO THE SOFTWARE FOR THE PURPOSES OF THIS AGREEMENT; (iv) MODIFY, DISASSEMBLE, DECOMPILE, REVERSE ENGINEER OR TRANSLATE THE SOFTWARE; OR (v) ALLOW ANY PERSON OR ENTITY TO COMMIT ANY OF THE ACTIONS DESCRIBED IN (i) THROUGH (iv) ABOVE.\n(c)\t   User shall take appropriate action, by instruction, agreement, or otherwise, with respect to its employees permitted under this Agreement to have access to the Software to ensure that all of User's obligations under this Section 4 shall be satisfied.  \n5.　\t   Indemnity.  User shall defend, indemnify and hold harmless NTT, its agents and employees, from any loss, damage, or liability arising in connection with User's improper or unauthorized use of the Software. NTT SHALL HAVE THE SOLE RIGHT TO CONDUCT DEFEND ANY ACTTION RELATING TO THE SOFTWARE.\n6.\t   Disclaimer.  THE SOFTWARE IS LICENSED TO USER \"AS IS,\" WITHOUT ANY TRAINING, MAINTENANCE, OR SERVICE OBLIGATIONS WHATSOEVER ON THE PART OF NTT. NTT MAKES NO EXPRESS OR IMPLIED WARRANTIES OF ANY TYPE WHATSOEVER, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, OF FITNESS FOR A PARTICULAR PURPOSE AND OF NON-INFRINGEMENT ON COPYRIGHT OR ANY OTHER RIGHT OF THIRD PARTIES.  USER ASSUMES ALL RISKS ASSOCIATED WITH ITS USE OF THE SOFTWARE, INCLUDING WITHOUT LIMITATION RISKS RELATING TO QUALITY, PERFORMANCE, DATA LOSS, AND UTILITY IN A PRODUCTION ENVIRONMENT. \n7.\t   Limitation of Liability.  IN NO EVENT SHALL NTT BE LIABLE TO USER OR TO ANY THIRD PARTY FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO DAMAGES FOR PERSONAL INJURY, PROPERTY DAMAGE, LOST PROFITS, OR OTHER ECONOMIC LOSS, ARISING IN CONNECTION WITH USER'S USE OF OR INABILITY TO USE THE SOFTWARE, IN CONNECTION WITH NTT'S PROVISION OF OR FAILURE TO PROVIDE SERVICES PERTAINING TO THE SOFTWARE, OR AS A RESULT OF ANY DEFECT IN THE SOFTWARE.  THIS DISCLAIMER OF LIABILITY SHALL APPLY REGARD¬LESS OF THE FORM OF ACTION THAT MAY BE BROUGHT AGAINST NTT, WHETHER IN CONTRACT OR TORT, INCLUDING WITHOUT LIMITATION ANY ACTION FOR NEGLIGENCE.  USER'S SOLE REMEDY IN THE EVENT OF ANY BREACH OF THIS AGREEMENT BY NTT SHALL BE TERMINATION PURSUANT TO SECTION 3.\n8.\t   No Assignment or Sublicense.  Neither this Agreement nor any right or license under this Agreement, nor the Software, may be sublicensed, assigned, or otherwise transferred by User without NTT's prior written consent.\n9.\t   General\n(a)\t   If any provision, or part of a provision, of this Agreement is or becomes illegal, unenforceable, or invalidated, by operation of law or otherwise, that provision or part shall to that extent be deemed omitted, and the remainder of this Agreement shall remain in full force and effect.\n(b)\t   This Agreement is the complete and exclusive statement of the agreement between the parties with respect to the subject matter hereof, and supersedes all written and oral contracts, proposals, and other communications between the parties relating to that subject matter.  \n(c)\t   Subject to Section 8, this Agreement shall be binding on, and shall inure to the benefit of, the respective successors and assigns of NTT and User.  \n(d)\t   If either party to this Agreement initiates a legal action or proceeding to enforce or interpret any part of this Agreement, the prevailing party in such action shall be entitled to recover, as an element of the costs of such action and not as damages, its attorneys' fees and other costs associated with such action or proceeding.\n(e)\t   This Agreement shall be governed by and interpreted under the laws of Japan, without reference to conflicts of law principles. All disputes arising out of or in connection with this Agreement shall be finally settled by arbitration in Tokyo in accordance with the Commercial Arbitration Rules of the Japan Commercial Arbitration Association.  The arbitration shall be conducted by three (3) arbitrators and in Japanese. The award rendered by the arbitrators shall be final and binding upon the parties. Judgment upon the award may be entered in any court having jurisdiction thereof.\n(f)　　\t   NTT shall not be liable to the User or to any third party for any delay or failure to perform NTT’s obligation set forth under this Agreement due to any cause beyond NTT’s reasonable control.\n \nEXHIBIT A\nThe software and related data include the following files,\n- data_preprocessors\n- download_scripts\n- download.sh\n- process_data.sh\n- merge_datasets.py\n- instructdoc_instructions.xlsx\n- README\n"
  },
  {
    "path": "README.md",
    "content": "# InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions\nThis repository includes the InstructDoc dataset introduced by the following paper: Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. \"InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions\". In Proc. of AAAI. 2024.\n\n> We introduce InstructDoc, the first large-scale visual instruction tuning dataset that covers a wide range of VDU tasks and datasets.\n\n![Figure 1 from paper](example.png)\n\n\n# Get Started\n## 1. Download datasets\n```\nsh download.sh\n```\nThis script helps you to download most of the datasets automatically. For some datasets, due to the license issue and downloading restrictions, you need to manually download them by following the instructions in [download_scripts/README.md](download_scripts)\n\n## 2. Preprocess datasets\n```\nsh process_data.sh API_KEY\n```\nThis script helps you to process all the datasets. To extract OCR information from document images, we used Google Vision API and set the variables \"API_KEY\" to the API key obtained from [Google Cloud Platform](https://cloud.google.com/). To get one visit the [link](https://cloud.google.com/vision/docs/quickstart). <br><br>\nIf you encounter the FileNotFoundError while processing the datasets, please set the variable --input_data_dir in [data_processors](data_processors) to your dataset directory name correctly.\n\n## 3. Merge preprocessed datasets\n```\npython merge_datasets --max_samples 5000 --input_data_dir processed_data --save_dir ./\n```\nWe randomly sampled a maximum of 5000 instances for each held-in dataset. After processing datasets, you can obtain JSON files with the following format.\nIf the dataset provides multiple images per instance (e.g., SlideVQA), we add \"_list\" into the fields, including \"image\", \"ocr\", and \"bboxes\". \n\n<pre>\n   {\n      \"dataset_name\": dataset name,\n      \"id\": identification of the instance,\n      \"image\" or \"image_list\": image path,\n      \"ocr\" or \"ocr_list\": ocr text,\n      \"bboxes\" or \"bboxes_list\": [x1, y1, x2, y2, w, h],\n      \"conversations\": [\n        {'user': 'human', 'value': randomly sampled instruction}\n        {'user': 'gpt', 'value': answer}\n      ]\n    }\n</pre>\n\n# Citation\n\nYou can cite it as follows:\n```bibtex\n@inproceedings{InstructDoc2024,\n  author    = {Ryota Tanaka and\n               Taichi Iki and\n               Kyosuke Nishida and\n               Kuniko Saito and\n               Jun Suzuki},\n  title     = {InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions},\n  booktitle = {AAAI},\n  year      = {2024}\n}\n```\n\nIf you have any questions about the paper and repository, feel free to contact Ryota Tanaka (ryota.tanaka[at]ntt.com) or open an issue!\n"
  },
  {
    "path": "data_preprocessors/ai2d.py",
    "content": "import json\nimport os\nimport random\nimport glob\nfrom PIL import Image, ImageDraw, ImageFont\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nfrom transformers import BertTokenizer\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.question_dir = os.path.join(args.input_data_dir, f'questions')\n        self.ann_dir = os.path.join(args.input_data_dir, f'annotations')\n        self.img_dir = os.path.join(args.input_data_dir, f'images')\n        self.font = ImageFont.truetype(args.font_file, size=40)\n        self.dataset_name = 'ai2d'\n        self.split = ['train', 'test']\n\n    def sort_coordinate(self, bboxes):\n        return sorted(bboxes, key=lambda k: [k[1][1], k[1][0]])    \n\n    def create_data(self):\n        train = []\n        test = []\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        with open(os.path.join(self.data_dir, 'ai2d_test_ids.csv')) as f:\n            test_ids = f.read().splitlines()\n        for i, file in enumerate(tqdm(sorted(os.listdir(self.question_dir)))):\n            file_path = os.path.join(self.question_dir, file)\n            with open(file_path, 'r', encoding='utf-8') as f:\n                data = json.load(f)\n            annotation_path = os.path.join(self.ann_dir, file)\n            with open(annotation_path, 'r') as f:\n                ann = json.load(f) \n\n            index = file.replace('.png.json', '')\n            split = 'test' if str(index) in test_ids else 'train'\n\n            image_path = os.path.join(self.img_dir, file)\n            image_path = image_path.replace('.json', '')\n            img = Image.open(image_path)\n            draw = ImageDraw.Draw(img)\n\n            for index, text in ann['text'].items():\n                replacement_text = text['replacementText']\n                bbox = text['rectangle']\n                bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]]\n                text = text['value']\n                x1, y1, x2, y2 = bbox\n                draw.rectangle((x1, y1, x2, y2), outline=\"lime\", width=4)\n                draw.text((x1, y1-30), replacement_text, font=self.font, fill=\"blue\", align=\"center\")\n\n            image_path = os.path.join(self.out_data_dir, 'draw_images', f'{file.replace(\".json\", \"\")}')\n            os.makedirs(os.path.dirname(image_path), exist_ok=True)\n            img.save(image_path)\n            \n            for question, item in data['questions'].items():\n                options = item['answerTexts']\n                answer_index = item['correctAnswer']\n                value = options[answer_index]\n\n                instruction = random.choice(instructions)\n                instruction = instruction.replace('<key>', question).replace('<options>', str(options))\n                file_name = os.path.abspath(image_path)\n                metadata = {\n                    \"image\": file_name,\n                    \"conversations\": [\n                        {'from': 'human', 'value': instruction},\n                        {'from': 'gpt', 'value': f\"{value}\"},\n                    ],\n                }\n                if split == 'train':\n                    train.append(metadata)\n                elif split == 'test':\n                    test.append(metadata)\n\n        for split, results in [('train', train), ('test', test)]:\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(results)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(results, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/ai2d', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/ai2d', type=str)\n    parser.add_argument('--font_file', default='GoNotoCurrent.ttf', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/chartqa.py",
    "content": "import json\nimport os\nimport random\nimport argparse\n\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import load_instructions\nfrom google_vision_ocr import Google_OCR\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')\n        self.dataset_name = 'chartqa'\n        self.google_ocr = Google_OCR(args.api_key)\n        self.split = ['train', 'val', 'test']\n        os.makedirs(self.ocr_dir, exist_ok=True)\n            \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            for qa_type in ['human', 'augmented']:\n                file_name = os.path.join(self.data_dir, f'{split}/{split}_{qa_type}.json')\n                with open(file_name, 'r') as f:\n                    data = json.load(f)\n                for d in tqdm(data):\n                    image_name = d['imgname']\n                    image_path = os.path.join(self.data_dir, f'{split}/png/{image_name}')\n                    ocr_path = os.path.join(self.ocr_dir, f'{image_name.replace(\".png\", \".json\")}')        \n                    try:\n                        img = Image.open(image_path)\n                        img_w, img_h = img.size\n                        if not os.path.exists(ocr_path):\n                            items = self.google_ocr.recognize_image(img)\n                            if items == \"error\":\n                                print('OCR error: ', image_path)\n                                continue\n                            with open(ocr_path, 'w') as f:\n                                json.dump(items, f)\n                        else:\n                            with open(ocr_path, 'r') as f:\n                                items = json.load(f)\n                        words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)\n                    except:\n                        words, bboxes = [], []\n                    \n                    question = d['query']\n                    value = d['label']\n                    instruction = random.choice(instructions)\n                    instruction = instruction.replace('<key>', question)\n                    ocr = ' '.join(words)\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr, \n                        \"bboxes\": bboxes, \n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': value},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/chartqa', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/chartqa', type=str)\n    parser.add_argument('--api_key', type=str, help='google vision api key')\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()\n"
  },
  {
    "path": "data_preprocessors/cord.py",
    "content": "import json\nimport os\nimport random\n\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, sort_coordinate, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'cord'\n        self.split = ['train', 'dev', 'test']\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            ann_dir = os.path.join(self.data_dir, f'{split}/json')\n            img_dir = os.path.join(self.data_dir, f'{split}/image')\n            for file in tqdm(sorted(os.listdir(ann_dir))):\n                file_path = os.path.join(ann_dir, file)\n                with open(file_path, 'r', encoding='utf-8') as f:\n                    data = json.load(f)\n\n                image_path = os.path.join(img_dir, file)\n                image_path = image_path.replace('.json', '.png')\n                image = Image.open(image_path)\n                w, h = image.size\n\n                items = []\n                labels = {}\n                for item in data[\"valid_line\"]:\n                    words, label = item[\"words\"], item[\"category\"]\n                    words = [w for w in words if w[\"text\"].strip() != \"\"]\n                    if len(words) == 0:\n                        continue\n                    text = \" \".join([word[\"text\"] for word in words])\n                    bbox = [words[0][\"quad\"][\"x1\"], words[0][\"quad\"][\"y1\"], words[-1][\"quad\"][\"x3\"], words[-1][\"quad\"][\"y3\"]]\n                    bbox = normalize_bbox(bbox, w, h)\n                    items.append((text, label, bbox))\n\n                items = sort_coordinate(items)\n                ocr = []\n                bboxes = []\n                for item in items:\n                    words, label, bbox = item\n                    labels[words] = label\n                    ocr.append(words)\n                    bbox = [bbox] * len(words.split())\n                    bboxes += bbox\n                ocr = ' '.join(ocr)\n\n                for key in labels:\n                    instruction = random.choice(instructions)\n                    instruction = instruction.replace('<key>', key)\n                    value = labels[key]\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': value},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/cord/CORD', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/cord', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/deepform.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'deepform'\n        self.split = ['train', 'dev']\n\n    def create_ocr_data(self, split):\n        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')\n        with open(file_name, 'r') as f:\n            data = f.readlines()\n        ocrs = {}\n        for d in data:\n            d = json.loads(d)\n            image_name = d['name'].replace('.pdf', '')\n            try:\n                content = d['contents'][1] # microsoft cv\n            except:\n                content = d['contents'][0] # tesseract\n\n            bboxes = []\n            tokens = []\n            try:\n                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]\n                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):\n                    bbox = normalize_bbox(bbox, w, h)\n                    bboxes.append(bbox)\n                    tokens.append(token)\n            except:\n                pass\n            ocrs[image_name] = (' '.join(tokens), bboxes)\n            break\n        return ocrs\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, split, 'document.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.readlines()\n\n            ocrs = self.create_ocr_data(split)\n            target_format = []\n            for d in tqdm(data):\n                d = json.loads(d)\n                image_name = d['name'].replace('.pdf', '')\n                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')\n                file_name = os.path.abspath(file_name)\n                for ann in d['annotations']:\n                    instruction = random.choice(instructions)\n                    if 'children' in ann['values'][0]:\n                        for v in ann['values']:\n                            for child in v['children']:\n                                value = child['key']\n                                key = child['values'][0]['value']\n                                instruction = instruction.replace('<key>', key)\n                                ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]\n\n                                target_format.append({\n                                    \"image\": file_name,\n                                    \"ocr\": ocr,\n                                    \"bboxes\": bboxes,\n                                    \"conversations\": [\n                                        {'from': 'human', 'value': instruction},\n                                        {'from': 'gpt', 'value': value},\n                                    ],\n                                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/DeepForm', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/deepform', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/docbank.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh\nfrom transformers import BertTokenizer\nfrom collections import defaultdict\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'docbank'\n        self.split = ['train', 'valid', 'test']\n\n    def sort_coordinate(self, bboxes):\n        return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]])    \n\n    def create_ocr_data(self, data):\n        ocr_info = {}\n        for image_info in tqdm(data['images']):\n            file_name = image_info['file_name']\n            image_id = image_info['id']\n            width, height = image_info['width'], image_info['height']\n\n            image_path = os.path.join(self.data_dir,  f'DocBank_500K_ori_img/{file_name}')\n            txt_path = os.path.join(self.data_dir,  f'DocBank_500K_txt/{file_name.replace(\"_ori.jpg\", \".txt\")}')\n            with open(txt_path, 'r') as f:\n                txt_data = f.read().splitlines()\n\n            words = []\n            bboxes = []\n            for d in txt_data:      \n                d = d.split('\\t')\n                word = d[0]\n                word_position = convert_wh([int(d[1]), int(d[2]), int(d[3]), int(d[4])])\n                if word_position[0] >= word_position[2] or word_position[1] >= word_position[3]:\n                    continue\n                words.append(word)\n                bboxes.append(word_position)\n            \n            text_sequence = ' '.join(words)\n            ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height}\n        return ocr_info\n    \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            with open(os.path.join(self.data_dir, f'500K_{split}.json'), \"r\") as f:\n                data = json.load(f)\n\n            ocr_info = self.create_ocr_data(data)\n            categories = data['categories']\n\n            target_format = []\n            annotations = defaultdict(list)\n            for ann_info in data['annotations']:\n                image_id = ann_info['image_id']\n                annotations[image_id].append(ann_info)\n\n            for image_id in tqdm(annotations):\n                image_info = ocr_info[image_id]\n                image_path = image_info['image_path']\n                text_sequence = image_info['text_sequence']\n                bboxes = image_info['bboxes']\n                width, height = image_info['width'], image_info['height']\n\n                items = []\n                for ann in annotations[image_id]:\n                    category_id = ann['category_id']\n                    category_name = categories[category_id-1]['name']\n                    bbox = ann['bbox']\n                    bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]]\n                    bbox = normalize_bbox(bbox, width, height)\n                    items.append((category_name, bbox))\n                items = self.sort_coordinate(items)\n\n                dla = []\n                for item in items:\n                    category_name, bbox = item\n                    dla.append(f'{category_name} {bbox}')\n                value = ' '.join(dla)\n\n                instruction = random.choice(instructions)        \n                file_name = os.path.abspath(image_path)\n\n                target_format.append({\n                    \"image\": file_name,\n                    \"ocr\": text_sequence,\n                    \"bboxes\": bboxes,\n                    \"conversations\": [\n                        {'from': 'human','value': instruction},\n                        {'from': 'gpt', 'value': value},\n                    ],\n                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/docbank', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/docbank', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/docile.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import sort_coordinate, load_instructions, normalize_bbox\nimport argparse\nfrom collections import defaultdict\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'docile'\n        self.ann_dir = os.path.join(args.input_data_dir, f'annotations')\n        self.img_dir = os.path.join(args.input_data_dir, f'images')\n        self.ocr_dir = os.path.join(args.input_data_dir, f'ocr')\n        self.split = ['train', 'val']\n\n    def extract_ocr_info(self, ocr_data):\n        tokens = []\n        bboxes = []\n        for page in ocr_data['pages']:\n            for block in page['blocks']:\n                for line in block['lines']:\n                    for word in line['words']:\n                        left_top, right_bottom = word['geometry']\n                        bbox = normalize_bbox([left_top[0], left_top[1], right_bottom[0], right_bottom[1]])\n                        bboxes.append(bbox)\n                        tokens.append(word['value'])\n        return tokens, bboxes\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, f'{split}.json')\n            with open(file_name, 'r') as f:\n                ann_filenames = json.load(f)\n\n            target_format = []\n            for id, file in enumerate(tqdm(ann_filenames)):\n                image_path = os.path.join(self.img_dir, file + '0001-1.jpg')\n                with open(os.path.join(self.ocr_dir, f'{file}.json'), 'r', encoding='utf-8') as f:\n                    ocr_data = json.load(f)\n                with open(os.path.join(self.ann_dir, f'{file}.json'), 'r', encoding='utf-8') as f:\n                    d = json.load(f)\n\n                items = []\n                for item in d[\"field_extractions\"]:\n                    if item[\"page\"] == 0:                \n                        text, label = item[\"text\"], item[\"fieldtype\"]\n                        bbox = item[\"bbox\"]\n                        items.append((text, label, bbox))\n                if len(items) == 0:\n                    continue\n                items = sort_coordinate(items)\n\n                labels = {}\n                for item in items:\n                    tokens, label, bbox = item\n                    labels[tokens] = label\n\n                tokens, bboxes = self.extract_ocr_info(ocr_data)\n                ocr = ' '.join(tokens)\n\n                for key in labels:\n                    instruction = random.choice(instructions)\n                    instruction = instruction.replace('<key>', key)\n                    value = labels[key]\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': value},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/docile/data/docile', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/docile', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/doclaynet.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh\nfrom collections import defaultdict\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'doclaynet'\n        self.split = ['train', 'val']\n\n    def sort_coordinate(self, bboxes):\n        return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]])    \n\n    def create_ocr_data(self, data):\n        ocr_info = {}\n        for image_info in data['images']:\n            file_name = image_info['file_name']\n            image_id = image_info['id']\n            image_path = os.path.join(self.data_dir,  f'PNG/{file_name}')\n            json_path = os.path.join(self.data_dir,  f'JSON/{file_name.replace(\".png\", \".json\")}')\n            width, height = image_info['width'], image_info['height']\n            with open(json_path, 'r') as f:\n                json_data = json.load(f)\n            items = []\n\n            for cell in json_data['cells']:\n                text = cell['text']\n                bbox = cell['bbox']\n                bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]]\n                bbox = convert_wh(normalize_bbox(bbox, width, height))\n                items.append((text, bbox))\n\n            items = self.sort_coordinate(items)\n            words = []\n            bboxes = []\n            for text, bbox in items:\n                words.append(text)\n                bboxes += bbox\n            text_sequence = ' '.join(words)\n            ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height}\n            break\n        return ocr_info\n    \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            with open(os.path.join(self.data_dir, f'COCO/{split}.json'), \"r\") as f:\n                data = json.load(f)\n            ocr_info = self.create_ocr_data(data)\n            categories = data['categories']\n\n            target_format = []\n            annotations = defaultdict(list)\n            for ann_info in data['annotations']:\n                image_id = ann_info['image_id']\n                annotations[image_id].append(ann_info)\n\n            for image_id in tqdm(annotations):\n                image_info = ocr_info[image_id]\n                image_path = image_info['image_path']\n                text_sequence = image_info['text_sequence']\n                bboxes = image_info['bboxes']\n                width, height = image_info['width'], image_info['height']\n\n                items = []\n                for ann in annotations[image_id]:\n                    category_id = ann['category_id']\n                    category_name = categories[category_id-1]['name']\n                    bbox = ann['bbox']\n                    bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3], bbox[2], bbox[3]]\n                    bbox = normalize_bbox(bbox, width, height)\n                    items.append((category_name, bbox))\n                items = self.sort_coordinate(items)\n\n                dla = []\n                for item in items:\n                    category_name, bbox = item\n                    dla.append(f'{category_name} {bbox}')\n                value = ' '.join(dla)\n\n                instruction = random.choice(instructions)        \n                file_name = os.path.abspath(image_path)\n\n                target_format.append({\n                    \"image\": file_name,\n                    \"ocr\": text_sequence,\n                    \"bboxes\": bboxes,\n                    \"conversations\": [\n                        {'from': 'human','value': instruction},\n                        {'from': 'gpt', 'value': value},\n                    ],\n                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/doclaynet', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/doclaynet', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/docvqa.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'docvqa'\n        self.split = ['train', 'dev']\n\n    def create_ocr_data(self, split):\n        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')\n        with open(file_name, 'r') as f:\n            data = f.readlines()\n        ocrs = {}\n        for d in data:\n            d = json.loads(d)\n            image_name = d['name'].replace('.pdf', '')\n            try:\n                content = d['contents'][1] # microsoft cv\n            except:\n                content = d['contents'][0] # tesseract\n\n            bboxes = []\n            tokens = []\n            try:\n                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]\n                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):\n                    bbox = normalize_bbox(bbox, w, h)\n                    bboxes.append(bbox)\n                    tokens.append(token)\n            except:\n                pass\n            ocrs[image_name] = (' '.join(tokens), bboxes)\n        return ocrs\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, split, 'document.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.readlines()\n\n            ocrs = self.create_ocr_data(split)\n            target_format = []\n            for d in tqdm(data):\n                d = json.loads(d)\n                image_name = d['name'].replace('.pdf', '')\n                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')\n                file_name = os.path.abspath(file_name)\n                for ann in d['annotations']:\n                    instruction = random.choice(instructions)\n                    question = ann['key']\n                    instruction = instruction.replace('<key>', question)\n                    bboxes = []\n                    ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]\n                    value = ann['values'][0]['value']\n                    values = ann['values'][0]['value_variants']\n\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr, \n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'instruction': instruction},\n                            {'from': 'gpt', 'value': value, 'values': values},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/docvqa', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/docvqa', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/docvqa_iq.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'docvqa_iq'\n        self.split = ['train', 'dev']\n\n    def create_ocr_data(self, split):\n        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')\n        with open(file_name, 'r') as f:\n            data = f.readlines()\n        ocrs = {}\n        for d in data:\n            d = json.loads(d)\n            image_name = d['name'].replace('.pdf', '')\n            try:\n                content = d['contents'][1] # microsoft cv\n            except:\n                content = d['contents'][0] # tesseract\n\n            bboxes = []\n            tokens = []\n            try:\n                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]\n                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):\n                    bbox = normalize_bbox(bbox, w, h)\n                    bboxes.append(bbox)\n                    tokens.append(token)\n            except:\n                pass\n            ocrs[image_name] = (' '.join(tokens), bboxes)\n        return ocrs\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, split, 'document.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.readlines()\n\n            ocrs = self.create_ocr_data(split)\n            target_format = []\n            questions = []\n            for d in data:\n                d = json.loads(d)\n                for ann in d['annotations']:\n                    question = ann['key']\n                    questions.append(question)\n\n            for d in tqdm(data):\n                d = json.loads(d)\n                image_name = d['name'].replace('.pdf', '')\n                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')\n                file_name = os.path.abspath(file_name)\n                for ann in d['annotations']:\n                    instruction = random.choice(instructions)\n                    if random.random() > 0.5:\n                        question = random.choice(questions)\n                        value = 'no'\n                    else:\n                        question = ann['key']\n                        value = 'yes'\n\n                    instruction = instruction.replace('<key>', question)\n                    bboxes = []\n                    ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]\n\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr, \n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'instruction': instruction},\n                            {'from': 'gpt', 'value': value},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/docvqa', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/docvqa_iq', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/dude.py",
    "content": "import json\nimport os\nimport random\nimport argparse\nimport glob\n\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.image_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/images')\n        self.ocr_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/OCR')\n        self.dataset_name = 'dude'\n        self.split = ['train', 'val']\n            \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        file_name =  os.path.join(self.data_dir, '2023-03-23_DUDE_gt_test_PUBLIC.json')\n        with open(file_name, 'r') as f:\n            data = json.load()\n        train, validation = [],[]\n        for d in tqdm(data['data']):\n            docid = d['docId']\n            question = d['question']\n            split = d['data_split']          \n            if split in self.split:\n                image_paths = []\n                pages = len(glob.glob(os.path.join(self.image_dir, split, f'{docid}_*.jpg')))\n                for i in range(pages):\n                    image_path = os.path.join(self.image_dir, split, f'{docid}_{i}.jpg')\n                    image_path = os.path.abspath(image_path)                \n                    image_paths.append(image_path)\n\n                ocr_path =os.path.join(self.ocr_dir, f'Azure/{docid}_due.json')\n                try:\n                    with open(ocr_path, 'r') as f:\n                        ocr_info = json.load(f)\n                except:\n                    continue\n\n                structure_value = ocr_info['structures']['pages']['structure_value']\n                image_sizes = ocr_info['structures']['pages']['positions']\n                text_sequences = []\n                bboxes = []\n                for page_split, image_size in zip(structure_value, image_sizes):\n                    start = page_split[0]\n                    end = page_split[1]\n                    page_tokens = ' '.join(ocr_info['tokens'][start:end])\n                    page_bboxes = []\n                    for bbox in ocr_info['positions'][start:end]:\n                        bbox = normalize_bbox(bbox, (image_size[2], image_size[3]))\n                        page_bboxes.append(bbox)\n                    text_sequences.append(page_tokens)\n                    bboxes.append(page_bboxes)\n                                \n                if len(text_sequences) != len(image_paths):\n                    continue\n\n                instruction = random.choice(instructions)\n                instruction = instruction.replace('<key>', question)\n                if 'answers' in d:\n                    value = d['answers'][0]\n                    if d['answer_type'] == 'not-answerable':\n                        d['answers'] = 'none'\n                else:\n                    value  = ''\n\n                file_name = os.path.abspath(image_path)\n                sample = {\n                    \"image_list\": image_paths,\n                    \"ocr_list\": text_sequences, \n                    \"bboxes_list\": bboxes, \n                    \"conversations\": [\n                        {'from': 'human', 'value': instruction},\n                        {'from': 'gpt', 'value': value},\n                    ],\n                }\n\n                if split == 'train':\n                    train.append(sample)\n                elif split == 'val':\n                    validation.append(sample)\n\n        for split, target_format in [('train', train), ('validation', validation)]:\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/dude', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/dude', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/funsd.py",
    "content": "import json\nimport os\nimport random\nimport cv2\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, sort_coordinate, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'funsd'\n        self.split = ['training', 'testing']\n        self.label_mapping = {'header': 'title',\n                              'question': 'key',\n                              'answer': 'value',\n                              'other': 'other'}\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            ann_dir = os.path.join(self.data_dir, f'{split}_data/annotations')\n            img_dir = os.path.join(self.data_dir, f'{split}_data/images')\n            for i, file in enumerate(tqdm(sorted(os.listdir(ann_dir)))):\n                file_path = os.path.join(ann_dir, file)\n                with open(file_path, 'r', encoding='utf-8') as f:\n                    data = json.load(f)\n\n                image_path = os.path.join(img_dir, file)\n                image_path = image_path.replace('.json', '.png') \n                image = cv2.imread(image_path)\n                h, w, _ = image.shape\n\n                items = []\n                for item in data[\"form\"]:\n                    text = item['text']\n                    words, label = item[\"words\"], item[\"label\"]\n                    label = self.label_mapping[label]\n                    words = [w for w in words if w[\"text\"].strip() != \"\"]\n                    if len(words) == 0:\n                        continue\n                    start_bbox, end_bbox = words[0]['box'], words[-1]['box']\n                    bbox = [start_bbox[0], start_bbox[1], end_bbox[2], start_bbox[3]]\n                    bbox = normalize_bbox(bbox, w, h)\n                    items.append((text, label, bbox))\n                items = sort_coordinate(items)\n\n                text_sequence = []\n                bboxes = []\n                labels = {}\n                for item in items:\n                    text, label, bbox = item\n                    labels[text] = label\n                    text_sequence.append(text)\n                    bbox = [bbox] * len(text)\n                    bboxes += bbox\n\n                ocr = ' '.join(text_sequence)\n                for key in labels:\n                    instruction = random.choice(instructions)\n                    instruction = instruction.replace('<key>', key)\n                    value = labels[key]\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': value},\n                        ],\n                    })\n\n            split = split.replace('ing', '')\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/funsd/dataset', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/funsd', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/google_vision_ocr.py",
    "content": "import base64\nimport json\nfrom requests import Request, Session\nfrom io import BytesIO\nfrom utils import normalize_bbox\n\nclass Google_OCR:\n    def __init__(self, api_key):\n        self.api_key = api_key\n\n    def pil_image_to_base64(self, pil_image):\n        buffered = BytesIO()\n        pil_image.save(buffered, format=\"PNG\")\n        str_encode_file = base64.b64encode(buffered.getvalue()).decode(\"utf-8\")\n        return str_encode_file\n\n    def recognize_image(self, pil_image):\n        str_encode_file = self.pil_image_to_base64(pil_image)\n        str_url = \"https://vision.googleapis.com/v1/images:annotate?key=\"\n        str_headers = {'Content-Type': 'application/json'}\n        str_json_data = {\n            'requests': [\n                {\n                    'image': {\n                        'content': str_encode_file\n                    },\n                    'features': [\n                        {\n                            'type': \"TEXT_DETECTION\",\n                        }\n                    ]\n                }\n            ]\n        }\n\n        obj_session = Session()\n        obj_request = Request(\"POST\",\n                              str_url + self.api_key,\n                              data=json.dumps(str_json_data),\n                              headers=str_headers\n                              )\n        obj_prepped = obj_session.prepare_request(obj_request)\n        obj_response = obj_session.send(obj_prepped,\n                                        verify=True,\n                                        timeout=60\n                                        )\n\n        if obj_response.status_code == 200:\n            return obj_response.json()\n\n        else:\n            return \"error\"\n\n    def extract_info(self, items, img_w, img_h):\n        words = []\n        bboxes = []\n        for page_ocrs in items['responses'][0]['fullTextAnnotation']['pages']:\n            for block_ocrs in page_ocrs['blocks']:\n                for para_ocrs in block_ocrs['paragraphs']:\n                    for word_ocrs in para_ocrs['words']:\n                        char_bboxes = []\n                        word = ''\n                        for sym_ocrs in word_ocrs['symbols']:\n                            try:\n                                bbox = sym_ocrs['boundingBox']\n                                xmin = max(0, bbox['vertices'][0]['x'])\n                                ymin = max(0, bbox['vertices'][0]['y'])\n                                xmax = max(0, bbox['vertices'][2]['x'])\n                                ymax = max(0, bbox['vertices'][2]['y'])\n                                bbox = [xmin, ymin, xmax, ymax]\n                            except:\n                                continue\n                            word += sym_ocrs['text']\n                            char_bboxes.append(bbox)\n                        if len(char_bboxes) > 0:\n                            x1 = [w_p[0] for w_p in char_bboxes]\n                            y1 = [w_p[1] for w_p in char_bboxes]\n                            x2 = [w_p[2] for w_p in char_bboxes]\n                            y2 = [w_p[3] for w_p in char_bboxes]\n                            word_bbox = [min(x1), min(y1), max(x2), max(y2)]\n                            if word_bbox[0] >= word_bbox[2] or word_bbox[1] >= word_bbox[3]:\n                                continue\n                            word_bbox = normalize_bbox(word_bbox, img_w, img_h)\n                            words.append(word)\n                            bboxes.append(word_bbox)\n        return words, bboxes\n"
  },
  {
    "path": "data_preprocessors/hwsquad.py",
    "content": "import json\nimport os\nimport random\nimport argparse\nimport csv\n\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nfrom collections import defaultdict\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'hwsquad'\n        self.split = ['train', 'val', 'test']\n    \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            filename = os.path.join(self.data_dir, f\"HW-SQuAD_{split}_1.0.json\")\n            with open(filename, \"r\") as f:\n                annotations = json.load(f)\n\n            target_format = []\n            for ann in tqdm(annotations[\"data\"]):\n                qas = ann[\"qas\"]\n                image_path = ann[\"document_image\"][\"document_image\"]\n                h, w = ann[\"document_image\"][\"image_height\"], ann[\"document_image\"][\"image_width\"]\n\n                words = []\n                bboxes = []\n                for item in ann[\"document_image\"][\"gold_standard_transcription\"]:\n                    word = item[\"text\"]\n                    words.append(word)\n                    bbox = [item[\"xmin\"], item[\"ymin\"], item[\"xmax\"], item[\"ymax\"]]\n                    bbox = normalize_bbox(bbox, w, h)\n                    bboxes.append(bbox)\n                \n                for qa in qas:\n                    question = qa[\"question\"]\n                    start_index, end_index = qa[\"answers\"][0][\"answer_start_word_no\"], qa[\"answers\"][0][\"answer_end_word_no\"]+1    \n                    answer = words[start_index:end_index]\n                    answer = \" \".join(answer)\n\n                    instruction = random.choice(instructions)        \n                    instruction = instruction.replace('<key>', question)\n                    ocr = ' '.join(words)\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': answer},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/HW-SQuAD/HW-SQuAD_annotations', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/hwsquad', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/iconqa.py",
    "content": "import json\nimport os\nimport random\nimport glob\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'iconqa'\n        self.split = ['train', 'val']\n\n    def create_data(self):\n        for split in self.split:\n            for answer_style in ['fill_in_blank', 'choose_txt']:\n                target_format = []\n                dataset_name = f'{self.dataset_name}_{answer_style}'\n                instructions = load_instructions(self.instruction_path)[dataset_name]\n\n                data_dir = os.path.join(self.data_dir, f'{split}/{answer_style}/*')\n                for file_path in glob.glob(data_dir):\n                    data_path = os.path.join(file_path, 'data.json')\n                    image_path = os.path.join(file_path, 'image.png')\n                    with open(data_path, 'r') as f:\n                        data = json.load(f)\n                    question = data['question']\n                    instruction = random.choice(instructions)\n                    instruction = instruction.replace('<key>', question)\n                    if answer_style == 'fill_in_blank':\n                        value = data['answer']\n                    else:\n                        options = data['choices']\n                        answer_index = data['answer']\n                        value = str(options[answer_index])\n                        instruction = instruction.replace('<options>', options)\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': f\"{value}\"},\n                        ],\n                    })\n            \n                out_data_dir = f'{self.out_data_dir}_{answer_style}'\n                out_filepath = os.path.join(out_data_dir, f'{split}.json')        \n                os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n                print(f'{split}: {len(target_format)}')\n                with open(out_filepath, \"w\") as f:\n                    json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/iconqa/iconqa_data', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/iconqa', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/infographicvqa.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'infographicvqa'\n        self.split = ['train', 'dev']\n\n    def create_ocr_data(self, split):\n        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')\n        with open(file_name, 'r') as f:\n            data = f.readlines()\n        ocrs = {}\n        for d in data:\n            d = json.loads(d)\n            image_name = d['name'].replace('.pdf', '')\n            try:\n                content = d['contents'][1] # microsoft cv\n            except:\n                content = d['contents'][0] # tesseract\n\n            bboxes = []\n            tokens = []\n            try:\n                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]\n                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):\n                    bbox = normalize_bbox(bbox, w, h)\n                    bboxes.append(bbox)\n                    tokens.append(token)\n            except:\n                pass\n            ocrs[image_name] = (' '.join(tokens), bboxes)\n        return ocrs\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, split, 'document.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.readlines()\n\n            ocrs = self.create_ocr_data(split)\n            target_format = []\n            for d in tqdm(data):\n                d = json.loads(d)\n                image_name = d['name'].replace('.pdf', '')\n                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')\n                file_name = os.path.abspath(file_name)\n                for ann in d['annotations']:\n                    instruction = random.choice(instructions)\n                    question = ann['key']\n                    instruction = instruction.replace('<key>', question)\n                    bboxes = []\n                    ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]\n                    value = ann['values'][0]['value']\n                    values = ann['values'][0]['value_variants']\n\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr, \n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'instruction': instruction},\n                            {'from': 'gpt', 'value': value, 'values': values},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/infographics_vqa', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/infographicvqa', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/klc.py",
    "content": "import json\nimport os\nimport random\n\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'klc'\n        self.split = ['train', 'dev']\n\n    def create_ocr_data(self, split):\n        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')\n        with open(file_name, 'r') as f:\n            data = f.readlines()\n        ocrs = {}\n        for d in data:\n            d = json.loads(d)\n            image_name = d['name'].replace('.pdf', '')\n            try:\n                content = d['contents'][1] # microsoft cv\n            except:\n                content = d['contents'][0] # tesseract\n\n            bboxes = []\n            tokens = []\n            try:\n                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]\n                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):\n                    bbox = normalize_bbox(bbox, w, h)\n                    bboxes.append(bbox)\n                    tokens.append(token)\n            except:\n                pass\n            ocrs[image_name] = (' '.join(tokens), bboxes)\n            break\n        return ocrs\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, split, 'document.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.readlines()\n\n            ocrs = self.create_ocr_data(split)\n            target_format = []\n            for d in tqdm(data):\n                d = json.loads(d)\n                image_name = d['name'].replace('.pdf', '')\n                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')\n                file_name = os.path.abspath(file_name)\n                for ann in d['annotations']:\n                    instruction = random.choice(instructions)\n                    if 'children' in ann['values'][0]:\n                        for v in ann['values']:\n                            for child in v['children']:\n                                value = child['key']\n                                key = child['values'][0]['value']\n                                instruction = instruction.replace('<key>', key)\n                                ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] \n\n                                target_format.append({\n                                    \"image\": file_name,\n                                    \"ocr\": ocr,\n                                    \"bboxes\": bboxes,\n                                    \"conversations\": [\n                                        {'from': 'human', 'value': instruction},\n                                        {'from': 'gpt', 'value': value},\n                                    ],\n                                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/kleister-charity', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/klc', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/llavar.py",
    "content": "import json\nimport os\nimport random\nimport argparse\n\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nfrom collections import defaultdict\nfrom google_vision_ocr import Google_OCR\n\nclass InstructData:\n    def __init__(self, args):\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')\n        self.image_dir = os.path.join(args.input_data_dir, 'images')\n        self.google_ocr = Google_OCR(args.api_key)\n        os.makedirs(self.ocr_dir, exist_ok=True)\n\n    def create_data(self):\n        file_name = os.path.join(self.data_dir, 'llava_instruct_150k_llavar_20k.json')\n        with open(file_name, 'r') as f:\n            data = json.load(f)\n        target_format = []\n        for d in data:\n            image_name = d[\"image\"]\n            image_path = os.path.join(self.image_dir, image_name)\n            if not os.path.exists(image_path):\n                continue\n\n            ocr_path = os.path.join(self.ocr_dir, f\"{image_name.replace('.jpg', '.json')}\")\n            try:\n                img = Image.open(image_path)\n                img_w, img_h = img.size\n                if not os.path.exists(ocr_path):\n                    items = self.google_ocr.recognize_image(img)\n                    if items == 'error':\n                        print('OCR error: ', image_path)\n                        continue\n                    with open(ocr_path, 'w') as f:\n                        json.dump(items, f)\n                else:\n                    with open(ocr_path, 'r') as f:\n                        items = json.load(f)\n                words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)\n            except:\n                words, bboxes = [], []\n\n            ocr = ' '.join(words)\n            file_name = os.path.abspath(image_path)\n            d[\"image\"] = file_name\n            d[\"ocr\"] = ocr\n            d[\"bboxes\"] = bboxes\n            target_format.append(d)\n\n        out_filepath = os.path.join(self.out_data_dir, 'train.json')        \n        os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n        print(f'train: {len(target_format)}')\n        with open(out_filepath, \"w\") as f:\n            json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/llavar', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/llavar', type=str)\n    parser.add_argument('--api_key', default='API_KEY', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/ocrvqa.py",
    "content": "import json\nimport os\nimport random\nimport argparse\nimport csv\n\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nfrom collections import defaultdict\nfrom google_vision_ocr import Google_OCR\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')\n        self.image_dir = os.path.join(args.input_data_dir, 'images')\n        self.dataset_name = 'ocrvqa'\n        self.google_ocr = Google_OCR(args.api_key)\n        self.split = ['train', 'dev', 'test']\n        self.split_dict = {1: 'train', 2: 'dev', 3: 'test'}\n        os.makedirs(self.ocr_dir, exist_ok=True)\n        \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            file_name = os.path.join(self.data_dir, 'dataset.json')\n            with open(file_name, 'r') as f:\n                data = json.load(f)\n            for image_id in tqdm(data):\n                d = data[image_id]\n                split_id = d['split']\n                if split != self.split_dict[split_id]:\n                    continue\n                image_path = os.path.join(self.image_dir, f'{image_id}.jpg')\n                if not os.path.exists(image_path):\n                    continue\n\n                ocr_path = os.path.join(self.ocr_dir, f\"{image_id}.json\")\n                try:\n                    img = Image.open(image_path)\n                    img_w, img_h = img.size\n                    if not os.path.exists(ocr_path):\n                        items = self.google_ocr.recognize_image(img)\n                        if items == \"error\":\n                            print('error: ', image_path)\n                            continue\n                        with open(ocr_path, 'w') as f:\n                            json.dump(items, f)\n                    else:\n                        with open(ocr_path, 'r') as f:\n                            items = json.load(f)\n                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)\n                except:\n                    words, bboxes = [], []\n\n                ocr = ' '.join(words)\n                file_name = os.path.abspath(image_path)\n                for question, answer in zip(d['questions'], d['answers']):\n                    instruction = random.choice(instructions)        \n                    instruction = instruction.replace('<key>', question)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': answer},\n                        ],\n                    }) \n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/OCR-VQA-200K', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/ocrvqa', type=str)\n    parser.add_argument('--api_key', default='API_KEY', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/pwc.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'pwc'\n        self.split = ['train', 'dev']\n\n    def create_ocr_data(self, split):\n        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')\n        with open(file_name, 'r') as f:\n            data = f.readlines()\n        ocrs = {}\n        for d in data:\n            d = json.loads(d)\n            image_name = d['name'].replace('.pdf', '')\n            try:\n                content = d['contents'][1] # microsoft cv\n            except:\n                content = d['contents'][0] # tesseract\n\n            bboxes = []\n            tokens = []\n            try:\n                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]\n                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):\n                    bbox = normalize_bbox(bbox, w, h)\n                    bboxes.append(bbox)\n                    tokens.append(token)\n            except:\n                pass\n            ocrs[image_name] = (' '.join(tokens), bboxes)\n            break\n        return ocrs\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, split, 'document.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.readlines()\n\n            ocrs = self.create_ocr_data(split)\n            target_format = []\n            for d in tqdm(data):\n                d = json.loads(d)\n                image_name = d['name'].replace('.pdf', '')\n                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')\n                file_name = os.path.abspath(file_name)\n                for ann in d['annotations']:\n                    instruction = random.choice(instructions)\n                    if 'children' in ann['values'][0]:\n                        for v in ann['values']:\n                            for child in v['children']:\n                                value = child['key']\n                                key = child['values'][0]['value']\n                                instruction = instruction.replace('<key>', key)\n                                ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] \n\n                                target_format.append({\n                                    \"image\": file_name,\n                                    \"ocr\": ocr,\n                                    \"bboxes\": bboxes,\n                                    \"conversations\": [\n                                        {'from': 'human', 'value': instruction},\n                                        {'from': 'gpt', 'value': value},\n                                    ],\n                                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/AxCell', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/pwc', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/rvlcdip.py",
    "content": "import json\nimport os\nimport random\nimport argparse\n\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import load_instructions\nfrom google_vision_ocr import Google_OCR\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')\n        self.image_dir = os.path.join(args.input_data_dir, 'images')\n        self.dataset_name = 'rvlcdip'\n        self.google_ocr = Google_OCR(args.api_key)\n        self.split = ['train', 'val', 'test']\n        self.class_dict = {'4': \"advertisement\", '10': \"budget\", '2': \"email\", \n                          '8': \"file_folder\", '1': \"form\", '3': \"handwritten\", \n                          '11': \"invoice\", '0': \"letter\", '15': \"memo\", \n                          '9': \"news_article\", '12': \"presentation\", '13': \"questionnaire\", \n                          '14': \"resume\", '6': \"scientific_publication\", '5':\"scientific_report\", \n                          '7': \"specification\"}\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f:\n                labels = f.read().splitlines()\n            for label in labels:\n                filename, label = label.split(' ')\n                value = self.class_dict[label]\n                image_path = os.path.join(self.image_dir, filename)\n                ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(\".tif\", \".json\")}')\n                try:\n                    img = Image.open(image_path)\n                    img_w, img_h = img.size\n                    if not os.path.exists(ocr_path):\n                        items = self.google_ocr.recognize_image(img)\n                        if items == \"error\":\n                            print('OCR error: ', image_path)\n                            continue\n                        os.makedirs(os.dirname(ocr_path), exist_ok=True)\n                        with open(ocr_path, 'w') as f:\n                            json.dump(items, f)\n                    else:\n                        with open(ocr_path, 'r') as f:\n                            items = json.load(f)\n                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)\n                except:\n                    words, bboxes = [], []\n\n                instruction = random.choice(instructions)\n                ocr = ' '.join(words)\n\n                file_name = os.path.abspath(image_path)\n                target_format.append({\n                    \"image\": file_name,\n                    \"ocr\": ocr, \n                    \"bboxes\": bboxes,\n                    \"conversations\": [\n                        {'from': 'human', 'value': instruction},\n                        {'from': 'gpt', 'value': value},\n                    ],\n                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/rvlcdip', type=str)\n    parser.add_argument('--api_key', default='API_KEY', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/rvlcdip_io.py",
    "content": "import json\nimport os\nimport random\nimport argparse\n\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import load_instructions\nfrom google_vision_ocr import Google_OCR\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')\n        self.image_dir = os.path.join(args.input_data_dir, 'images')\n        self.dataset_name = 'rvlcdip_io'\n        self.google_ocr = Google_OCR(args.api_key)\n        self.split = ['train', 'val', 'test']\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            ocrs = []\n            with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f:\n                labels = f.read().splitlines()\n            for label in labels:\n                filename, label = label.split(' ')\n                value = self.class_dict[label]\n                image_path = os.path.join(self.image_dir, filename)\n                ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(\".tif\", \".json\")}')\n                try:\n                    img = Image.open(image_path)\n                    img_w, img_h = img.size\n                    if not os.path.exists(ocr_path):\n                        items = self.google_ocr.recognize_image(img)\n                        if items == \"error\":\n                            print('OCR error: ', image_path)\n                            continue\n                        os.makedirs(os.dirname(ocr_path), exist_ok=True)\n                        with open(ocr_path, 'w') as f:\n                            json.dump(items, f)\n                    else:\n                        with open(ocr_path, 'r') as f:\n                            items = json.load(f)\n                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)\n                except:\n                    words, bboxes = [], []\n\n                ocr = ' '.join(words)\n                ocrs.append((ocr, bboxes))\n\n            for label in labels:\n                instruction = random.choice(instructions)\n                if random.random() > 0.5:\n                    ocr, bboxes = random.choice(ocrs)\n                    value = 'no'\n                else:\n                    value = 'yes'\n                \n                file_name = os.path.abspath(image_path)\n                target_format.append({\n                    \"image\": file_name,\n                    \"ocr\": ocr, \n                    \"bboxes\": bboxes, \n                    \"conversations\": [\n                        {'from': 'human', 'value': instruction},\n                        {'from': 'gpt', 'value': value},\n                    ],\n                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/rvlcdip_io', type=str)\n    parser.add_argument('--api_key', default='API_KEY', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/scicap.py",
    "content": "import json\nimport os\nimport random\nimport argparse\n\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import load_instructions\nfrom google_vision_ocr import Google_OCR\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.image_dir = os.path.join(args.input_data_dir, 'SciCap-No-Subfig-Img')\n        self.caption_dir = os.path.join(args.input_data_dir, 'SciCap-Caption-All')\n        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')\n        self.dataset_name = 'scicap'\n        self.google_ocr = Google_OCR(args.api_key)\n        self.split = ['train', 'val', 'test']\n        os.makedirs(self.ocr_dir, exist_ok=True)\n        \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        ocr_info = {}\n        for split in self.split:\n            with open(os.path.join(self.data_dir, f'List-of-Files-for-Each-Experiments/Single-Sentence-Caption/No-Subfig/{split}/file_idx.json'), \"r\") as f:\n                split_info = json.load(f)\n            target_format = []\n            for file_name in tqdm(split_info):\n                image_path = os.path.join(self.image_dir, split, file_name)\n                caption_path = os.path.join(self.caption_dir,  split, f'{file_name.replace(\".png\", \".json\")}')\n                ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(\".png\", \".json\")}')\n\n                with open(caption_path, \"r\") as f:\n                    annotation = json.load(f)\n                try:\n                    img = Image.open(image_path)\n                    img_w, img_h = img.size\n                    if not os.path.exists(ocr_path):\n                        items = self.google_ocr.recognize_image(img)\n                        if items == \"error\":\n                            print('OCR error: ', image_path)\n                            continue\n                        with open(ocr_path, 'w') as f:\n                            json.dump(items, f)\n                    else:\n                        with open(ocr_path, 'r') as f:\n                            items = json.load(f)\n                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)\n                except:\n                    words, bboxes = [], []\n\n                value = annotation['1-lowercase-and-token-and-remove-figure-index']['caption']\n                instruction = random.choice(instructions)\n                ocr = ' '.join(words)\n\n                file_name = os.path.abspath(image_path)\n                target_format.append({\n                    \"image\": file_name,\n                    \"ocr\": ocr, \n                    \"bboxes\": bboxes, \n                    \"conversations\": [\n                        {'from': 'human', 'value': instruction},\n                        {'from': 'gpt', 'value': value},\n                    ],\n                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/scicap', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/scicap', type=str)\n    parser.add_argument('--api_key', default='API_KEY', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/scienceqa.py",
    "content": "import json\nimport os\nimport random\nimport glob\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, sort_coordinate, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'scienceqa'\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        train, val, test = [],[],[]\n        target_format = []\n        ann_filename = os.path.join(self.data_dir, 'data/scienceqa/problems.json')\n        with open(ann_filename, 'r') as f:\n            anns = json.load(f)\n        for questionId, ann in tqdm(anns.items()):\n            question = ann['question']\n            choices = ann['choices']\n            value = choices[ann['answer']]\n            split = ann['split']\n            image_name = ann['image']\n            if str(image_name) == 'null':\n                continue\n\n            image_path = os.path.join(self.data_dir, split, questionId, image_name)\n            instruction = random.choice(instructions)\n            instruction = instruction.replace('<key>', question).replace('<options>', str(choices))\n\n            file_name = os.path.abspath(image_path)\n            sample = {\n                \"image\": file_name,\n                \"conversations\": [\n                    {'from': 'human', 'value': instruction},\n                    {'from': 'gpt', 'value': f\"{value}\"},\n                ],\n            }\n            if split == 'train':\n                train.append(sample)\n            elif split == 'val':\n                val.append(sample)\n            elif split == 'train':\n                test.append(sample)\n        \n        for split, target_format in [('train', train), ('val', val), ('test', test)]:\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/scienceqa', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/scienceqa', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/screen2words.py",
    "content": "import json\nimport os\nimport random\nimport argparse\n\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import load_instructions\nfrom google_vision_ocr import Google_OCR\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')\n        self.image_dir = os.path.join(args.input_data_dir, 'combined')\n        self.dataset_name = 'screen2words'\n        self.google_ocr = Google_OCR(args.api_key)\n        self.split = ['train', 'dev']\n        os.makedirs(self.ocr_dir, exist_ok=True)\n    \n    def load_captions(self):\n        with open(os.path.join(self.data_dir, 'screen_summaries.csv'), \"r\") as f:\n            lines = f.read().splitlines()\n        captions = {}\n        for i, line in enumerate(lines):\n            if i != 0:\n                items = line.split(',')\n                if len(items) > 2:\n                    screenId = items[0]\n                    summary = line[len(screenId)+1:]\n                else:\n                    screenId, summary = items\n                captions[screenId] = summary\n        return captions\n        \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        captions = self.load_captions()\n        for split in self.split:\n            target_format = []\n            with open(os.path.join(self.data_dir, f'split/{split}_screens.txt'), \"r\") as f:\n                split_info = f.read().splitlines()\n            for split_id in tqdm(split_info):\n                image_path = os.path.join(self.image_dir,  f'{split_id}.jpg')\n                ocr_path = os.path.join(self.ocr_dir, f'{split_id}.json')\n                try:\n                    img = Image.open(image_path)\n                    img_w, img_h = img.size\n                    if not os.path.exists(ocr_path):\n                        items = self.google_ocr.recognize_image(img)\n                        if items == \"error\":\n                            print('OCR error: ', image_path)\n                            continue\n                        with open(ocr_path, 'w') as f:\n                            json.dump(items, f)\n                    else:\n                        with open(ocr_path, 'r') as f:\n                            items = json.load(f)\n                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)\n                except:\n                    words, bboxes = [], []\n\n                value = captions[split_id]\n                instruction = random.choice(instructions)\n                ocr = ' '.join(words)\n\n                file_name = os.path.abspath(image_path)\n                target_format.append({\n                    \"image\": file_name,\n                    \"ocr\": ocr, \n                    \"bboxes\": bboxes, \n                    \"conversations\": [\n                        {'from': 'human', 'value': instruction},\n                        {'from': 'gpt', 'value': value},\n                    ],\n                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/screen2words', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/screen2words', type=str)\n    parser.add_argument('--api_key', default='API_KEY', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/slidevqa.py",
    "content": "import json\nimport os\nimport random\nimport argparse\nimport glob\n\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import load_instructions\nfrom google_vision_ocr import Google_OCR\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')\n        self.image_dir = os.path.join(args.input_data_dir, 'images')\n        self.dataset_name = 'slidevqa'\n        self.google_ocr = Google_OCR(args.api_key)\n        self.split = ['train', 'val', 'test']\n        os.makedirs(self.ocr_dir, exist_ok=True)\n            \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            file_name =  os.path.join(self.data_dir, 'annotations/qa', f'{split}.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.read().splitlines()\n            for d in tqdm(data):\n                question = d['question']\n                deck_name = d['deck_name']\n                value = d['answer']\n                image_paths = []\n                text_sequences = []\n                bboxes = []\n                for image_path in glob.glob(os.path.join(self.image_dir, deck_name, f'slide_*_1024.jpg')):\n                    image_path = os.path.abspath(image_path)\n                    image_name = os.path.basename(image_path)                \n                    image_paths.append(image_path)\n                    ocr_path = os.path.join(self.ocr_dir, f'{deck_name}_{image_name.replace(\".jpg\", \".json\")}')\n                    try:\n                        img = Image.open(image_path)\n                        img_w, img_h = img.size\n                        if not os.path.exists(ocr_path):\n                            items = self.google_ocr.recognize_image(img)\n                            if items == 'error':\n                                print('OCR error: ', image_path)\n                                continue\n                            with open(ocr_path, 'w') as f:\n                                json.dump(items, f)\n                        else:\n                            with open(ocr_path, 'r') as f:\n                                items = json.load(f)\n                        words, page_bboxes = self.google_ocr.extract_info(items, img_w, img_h)\n                    except:\n                        words, page_bboxes = [], []                        \n                    text_sequences.append(' '.join(words))\n                    bboxes.append(page_bboxes)\n                \n                instruction = random.choice(instructions)\n                instruction = instruction.replace('<key>', question)\n\n                file_name = os.path.abspath(image_path)\n                target_format.append({\n                    \"image_list\": image_paths,\n                    \"ocr_list\": text_sequences, \n                    \"bboxes_list\": bboxes, \n                    \"conversations\": [\n                        {'from': 'human', 'value': instruction},\n                        {'from': 'gpt', 'value': value},\n                    ],\n                })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/slidevqa', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/slidevqa', type=str)\n    parser.add_argument('--api_key', type=str, help='google vision api key')\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/sroie.py",
    "content": "import json\nimport os\nimport random\nimport cv2\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'sroie'\n        self.split = ['train', 'test']\n\n    def sort_coordinate(self, bboxes):\n        return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]])    \n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            ann_dir = os.path.join(self.data_dir, f'{split}/entities')\n            img_dir = os.path.join(self.data_dir, f'{split}/img')\n            for file in tqdm(sorted(os.listdir(ann_dir))):\n                file_path = os.path.join(ann_dir, file)\n                with open(file_path, 'r', encoding='utf-8') as f:\n                    labels = json.load(f)\n                image_path = os.path.join(img_dir, file)\n                image_path = image_path.replace('.txt', '.jpg')\n                image = cv2.imread(image_path)\n                h, w, _ = image.shape\n                    \n                file_path = os.path.join(ann_dir.replace('entities', 'box'), file)\n                text_sequence = []\n                bboxes = []\n                with open(file_path, 'r', encoding='utf-8') as f:\n                    items = []\n                    for item in f.read().splitlines():\n                        bbox = item.split(',')[:8]\n                        text = item[len(','.join(bbox))+1:]\n                        bbox = [int(bbox[0]), int(bbox[1]), int(bbox[4]), int(bbox[5])]\n                        bbox = normalize_bbox(bbox, w, h)\n                        items.append((text, bbox))\n                items = self.sort_coordinate(items)\n                for item in items:\n                    words, bbox = item\n                    text_sequence.append(words)\n                    bbox = [bbox] * len(words.split())\n                    bboxes += bbox\n                \n                ocr = ' '.join(text_sequence)\n                for label in labels:\n                    instruction = random.choice(instructions)\n                    instruction = instruction.replace('<key>', labels[label])\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': label},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/SROIE2019', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/sroie', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/tabfact.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'tabfact'\n        self.split = ['train', 'dev']\n        self.options = ['no', 'yes']\n\n    def create_ocr_data(self, split):\n        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')\n        with open(file_name, 'r') as f:\n            data = f.readlines()\n        ocrs = {}\n        for d in data:\n            d = json.loads(d)\n            image_name = d['name'].replace('.pdf', '')\n            try:\n                content = d['contents'][1] # microsoft cv\n            except:\n                content = d['contents'][0] # tesseract\n\n            bboxes = []\n            tokens = []\n            try:\n                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]\n                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):\n                    bbox = normalize_bbox(bbox, w, h)\n                    bboxes.append(bbox)\n                    tokens.append(token)\n            except:\n                pass\n            ocrs[image_name] = (' '.join(tokens), bboxes)\n        return ocrs\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, split, 'document.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.readlines()\n\n            ocrs = self.create_ocr_data(split)\n            target_format = []\n            for d in tqdm(data):\n                d = json.loads(d)\n                image_name = d['name'].replace('.pdf', '')\n                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')\n                file_name = os.path.abspath(file_name)\n                for ann in d['annotations']:\n                    instruction = random.choice(instructions)\n                    question = ann['key']\n                    instruction = instruction.replace('<key>', question)\n                    bboxes = []\n                    ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]\n                    value = ann['values'][0]['value']\n                    value = self.options[int(value)]\n\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr, \n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'instruction': instruction},\n                            {'from': 'gpt', 'value': value},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/TabFact', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/tabfact', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/tatdqa.py",
    "content": "import json\nimport os\nimport random\nimport argparse\nimport csv\n\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'tatdqa'\n        self.split = ['train', 'dev', 'test']\n    \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            file_name = os.path.join(self.data_dir, f'tatdqa_dataset_{split}.json')\n            with open(file_name, 'r') as f:\n                data = json.load(f)\n            for d in tqdm(data):\n                uid = d['doc']['uid']\n                page_num = d['doc']['page']\n                image_path = f'{split}/{uid}_{page_num}.png'\n                ocr_file_name = os.path.join(self.data_dir, f'{split}/{uid}.json')\n                with open(ocr_file_name, 'r') as f:\n                    ocrs = json.load(f)\n\n                text = []\n                bboxes = []\n                _, _, w, h = ocrs['pages'][page_num-1]['bbox']\n                for page in ocrs['pages']:\n                    for block in page['blocks']:\n                        text.append(block['text'])\n                        for bbox in block['words']['bbox_list']:\n                            bbox = normalize_bbox(bbox, w, h)\n                            bboxes.append(bbox)\n\n                for qa in d['questions']:\n                    question =qa['question']\n                    if 'answer' in qa:\n                        answer = qa['answer']\n                        if type(qa['answer']) == list:\n                            if len(qa['answer']) > 1:\n                                answer = ', '.join(answer)\n                            else:\n                                answer = answer[0]\n                    else:\n                        answer = \"\"\n\n                    instruction = random.choice(instructions)        \n                    instruction = instruction.replace('<key>', question)\n                    ocr = ' '.join(text)\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': answer},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/TAT-DQA', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/tatdqa', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/textbookqa.py",
    "content": "import json\nimport os\nimport random\nimport glob\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, sort_coordinate, load_instructions\nfrom transformers import BertTokenizer\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'textbookqa'\n        self.split = ['train', 'val', 'test']\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            ann_filename = f'{split}/tqa_v1_{split}.json' if split != 'test' else f'{split}/tqa_v2_{split}.json'\n            ann_filename = os.path.join(self.data_dir, ann_filename)\n            with open(ann_filename, 'r') as f:\n                anns = json.load(f)\n            for ann in tqdm(anns):\n                questions = ann['questions']\n                diagram_questions = questions['diagramQuestions']\n                if len(diagram_questions) == 0:\n                    continue\n\n                diagram_annotations = ann['diagramAnnotations']\n\n                for global_id, data in diagram_questions.items():\n                    options = []\n                    for option_id, choice in data['answerChoices'].items():\n                        choice = choice['processedText']\n                        options.append(choice)\n                    question = data['beingAsked']['processedText']\n                    value = data['correctAnswer']['rawText']\n                    image_path = data['imagePath']\n                    image_name = data['imageName']\n                    image_path = os.path.join(self.data_dir, f'{split}/{image_path}')\n                    if image_name in diagram_annotations:\n                        annotation = diagram_annotations[image_name]\n                        bboxes = []\n                        ocr = []\n                        for item in annotation:\n                            text, bbox = item[\"text\"], item[\"rectangle\"]\n                            try:\n                                bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]]\n                            except:\n                                continue\n                            if len(text) > 0:\n                                bboxes.append(bbox)\n                                ocr.append(text)\n                        ocr = \" \".join(ocr)\n                    else:\n                        ocr = \"\"\n                    instruction = random.choice(instructions)\n                    instruction = instruction.replace('<key>', question).replace('<options>', str(options))\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': f\"{value}\"},\n                        ],\n                    })\n            \n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/textbookqa', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/textbookqa', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/utils.py",
    "content": "import pandas as pd\n\ndef normalize_bbox(bbox, w=-1, h=-1):\n    if w > 0 and h > 0:\n        normalized_bbox = [\n            int(1000 * bbox[0] / w),\n            int(1000 * bbox[1] / h),\n            int(1000 * bbox[2] / w),\n            int(1000 * bbox[3] / h),\n        ]\n    else:\n        normalized_bbox = [\n            int(1000 * bbox[0]),\n            int(1000 * bbox[1]),\n            int(1000 * bbox[2]),\n            int(1000 * bbox[3]),\n        ]\n        \n    if len(bbox) == 4:\n        return convert_wh(normalized_bbox)\n    elif len(bbox) == 6:\n        return normalized_bbox\n\ndef convert_wh(bbox):\n    return [bbox[0], bbox[1], bbox[2], bbox[3], abs(bbox[2]-bbox[0]), abs(bbox[3]-bbox[1])]\n\ndef sort_coordinate(bboxes):\n    return sorted(bboxes , key=lambda k: [k[2][1], k[2][0]])    \n\ndef load_instructions(instruction_path):\n    instructions = {}\n    data = pd.read_excel(instruction_path)\n    for d in data.values:\n        dataset_name = d[0]\n        insts = []\n        for prompt in d[3:]:\n            if pd.isna(prompt):\n                break\n            insts.append(prompt)\n        instructions[dataset_name] = insts\n    return instructions\n"
  },
  {
    "path": "data_preprocessors/visualmrc.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'visualmrc'\n        self.split = ['train', 'dev', 'test']\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, f'data/{split}.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.readlines()\n            target_format = []\n            for d in tqdm(data):\n                d = json.loads(d)\n                file_name = os.path.join(self.data_dir, d['image_filename'])\n                file_name = os.path.abspath(file_name)\n                image = Image.open(file_name)\n                w, h = image.size\n\n                words = []\n                bboxes = []\n                for bbox in d['bounding_boxes']:\n                    if 'ocr_info' in bbox:\n                        for ocr in bbox['ocr_info']:\n                            word = ocr['word']\n                            bbox = ocr['bbox']\n                            bbox = [bbox['x'], bbox['y'], bbox['x']+bbox['width'], bbox['y']+bbox['height']]\n                            bbox = normalize_bbox(bbox, w, h)\n                            bboxes.append(bbox)\n                            words.append(word)\n\n                ocr = \" \".join(words)\n                for qa in d['qa_data']:\n                    question = qa['question']['text']\n                    value = qa['answer']['text']\n                    instruction = random.choice(instructions)\n                    instruction = instruction.replace('<key>', question)\n\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr, \n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'instruction': instruction},\n                            {'from': 'gpt', 'value': value},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/VisualMRC_official', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/visualmrc', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/websrc.py",
    "content": "import json\nimport os\nimport random\nimport argparse\nimport csv\n\nfrom PIL import Image\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nfrom collections import defaultdict\nfrom google_vision_ocr import Google_OCR\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')\n        self.dataset_name = 'websrc'\n        self.google_ocr = Google_OCR(args.api_key)\n        self.split = ['train', 'dev']\n        os.makedirs(self.ocr_dir, exist_ok=True)\n    \n    def load_split_info(self):\n        file_name = os.path.join(self.data_dir, 'dataset_split.csv')\n        with open(file_name) as f:\n            reader = csv.reader(f)\n            split_info = defaultdict(list)\n            for i, row in enumerate(reader):\n                if i == 0:\n                    continue\n                number = '0' + row[1] if int(row[1]) < 10 else  row[1]\n                split = row[3]\n                data_path = os.path.join(self.data_dir, f'{row[0]}/{number}/dataset.csv')\n                split_info[split].append(data_path)\n        return split_info\n        \n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        split_info = self.load_split_info()\n        for split in self.split:\n            target_format = []\n            for data_path in tqdm(split_info[split]):\n                with open(data_path) as f:\n                    data_dir = os.path.dirname(data_path)\n                    reader = csv.reader(f)\n                    for i, row in enumerate(reader):\n                        if i == 0:\n                            for index, element in enumerate(row):\n                                if 'question' == element:\n                                    question_index = index\n                                elif 'id' == element:\n                                    id_index = index\n                                elif 'answer' == element:\n                                    answer_index = index\n                            continue   \n                        questionId = row[id_index]\n                        image_path = os.path.join(data_dir, f'processed_data/{questionId[2:9]}.png')\n                        img = Image.open(image_path)\n                        img_w, img_h = img.size\n\n                        ocr_path = os.path.join(self.ocr_dir, f'{questionId[2:9]}.json')\n                        try:\n                            if not os.path.exists(ocr_path):\n                                items = self.google_ocr.recognize_image(img)\n                                if items == \"error\":\n                                    print('OCR error: ', image_path)\n                                    continue\n                                with open(ocr_path, 'w') as f:\n                                    json.dump(items, f)\n                            else:\n                                with open(ocr_path, 'r') as f:\n                                    items = json.load(f)\n                            words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)\n                        except:\n                            words, bboxes = [], []\n\n                        question = row[question_index]\n                        instruction = random.choice(instructions)        \n                        instruction = instruction.replace('<key>', question)\n                        ocr = ' '.join(words)\n                        value = row[answer_index]\n\n                        file_name = os.path.abspath(image_path)\n                        target_format.append({\n                            \"image\": file_name,\n                            \"ocr\": ocr,\n                            \"bboxes\": bboxes,\n                            \"conversations\": [\n                                {'from': 'human', 'value': instruction},\n                                {'from': 'gpt', 'value': value},\n                            ],\n                        })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/websrc', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/websrc', type=str)\n    parser.add_argument('--ocr_dir', default='raw_datasets/websrc/ocrs', type=str)\n    parser.add_argument('--api_key', default='API_KEY', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/wildreceipt.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, sort_coordinate, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'wildreceipt'\n        self.split = ['train', 'test']\n        self.classes = {}\n        for items in open(os.path.join(args.input_data_dir, 'class_list.txt')):\n            index, label = items.split()\n            self.classes[index] = label\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            target_format = []\n            with open(os.path.join(self.data_dir, f'{split}.txt')) as f:\n                samples = f.readlines()\n            for sample in tqdm(samples):\n                data = json.loads(sample)\n                file_name = data['file_name']\n                image_path = os.path.join(self.data_dir, file_name)\n                image = Image.open(image_path)\n                w, h = image.size\n\n                items = []\n                labels = {}\n                for item in data[\"annotations\"]:\n                    text, label_index = item[\"text\"], item[\"label\"]\n                    label = self.classes[str(label_index)]\n                    if label_index == 0:\n                        continue\n                    bbox = item[\"box\"]\n                    bbox = [bbox[0], bbox[1], bbox[4], bbox[5]]\n                    bbox = normalize_bbox(bbox, w, h)\n                    items.append((text, label, bbox))\n\n                items = sort_coordinate(items)\n\n                ocr = []\n                bboxes = []\n                for item in items:\n                    words, label, bbox = item\n                    labels[words] = label\n                    ocr.append(words)\n                    bbox = [bbox] * len(words.split())\n                    bboxes += bbox\n                ocr = ' '.join(ocr)\n\n                for key in labels:\n                    instruction = random.choice(instructions)\n                    instruction = instruction.replace('<key>', key)\n                    value = labels[key]\n\n                    file_name = os.path.abspath(image_path)\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human', 'value': instruction},\n                            {'from': 'gpt', 'value': value},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/wildreceipt/wildreceipt', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/wildreceipt', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "data_preprocessors/wtq.py",
    "content": "import json\nimport os\nimport random\nfrom PIL import Image, ImageSequence\nfrom tqdm import tqdm \nfrom pathlib import Path\nfrom utils import normalize_bbox, load_instructions\nimport argparse\n\nclass InstructData:\n    def __init__(self, args):\n        self.instruction_path = Path('instructdoc_instructions.xlsx')\n        self.data_dir = args.input_data_dir\n        self.out_data_dir = args.out_data_dir\n        self.dataset_name = 'wtq'\n        self.split = ['train', 'dev']\n\n    def create_ocr_data(self, split):\n        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')\n        with open(file_name, 'r') as f:\n            data = f.readlines()\n        ocrs = {}\n        for d in data:\n            d = json.loads(d)\n            image_name = d['name'].replace('.pdf', '')\n            try:\n                content = d['contents'][1] # microsoft cv\n            except:\n                content = d['contents'][0] # tesseract\n\n            bboxes = []\n            tokens = []\n            try:\n                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]\n                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):\n                    bbox = normalize_bbox(bbox, w, h)\n                    bboxes.append(bbox)\n                    tokens.append(token)\n            except:\n                continue\n            ocrs[image_name] = (' '.join(tokens), bboxes)\n        return ocrs\n\n    def create_data(self):\n        instructions = load_instructions(self.instruction_path)[self.dataset_name]\n        for split in self.split:\n            file_name = os.path.join(self.data_dir, split, 'document.jsonl')\n            with open(file_name, 'r') as f:\n                data = f.readlines()\n\n            ocrs = self.create_ocr_data(split)\n            target_format = []\n            for d in tqdm(data):\n                d = json.loads(d)\n                image_name = d['name'].replace('.pdf', '')\n                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')\n                file_name = os.path.abspath(file_name)\n                for ann in d['annotations']:\n                    instruction = random.choice(instructions)\n                    question = ann['key']\n                    instruction = instruction.replace('<key>', question)\n                    ocr, bboxes = ocrs[image_name]\n                    value = ann['values'][0]['value']\n\n                    target_format.append({\n                        \"image\": file_name,\n                        \"ocr\": ocr,\n                        \"bboxes\": bboxes,\n                        \"conversations\": [\n                            {'from': 'human',  'value': instruction},\n                            {'from': 'gpt', 'value': value},\n                        ],\n                    })\n\n            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        \n            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n\n            print(f'{split}: {len(target_format)}')\n            with open(out_filepath, \"w\") as f:\n                json.dump(target_format, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/WikiTableQuestions', type=str)\n    parser.add_argument('--out_data_dir', default='processed_data/wtq', type=str)\n    args = parser.parse_args()\n    \n    dataset = InstructData(args)\n    dataset.create_data()"
  },
  {
    "path": "download.sh",
    "content": "#!/bin/bash\nexport DATASET_DIR=raw_datasets\n\nmkdir raw_datasets  \n\nsh ./download_scripts/due.sh\nsh ./download_scripts/websrc.sh\nsh ./download_scripts/funsd.sh\nsh ./download_scripts/iconqa.sh\nsh ./download_scripts/textbookqa.sh\nsh ./download_scripts/screen2words.shsh \nsh ./download_scripts/doclaynet.sh\nsh ./download_scripts/ai2d.sh\nsh ./download_scripts/wildreceipt.sh\n\n# font file for rendering text in AI2D dataset\nwget https://huggingface.co/Team-PIXEL/pixel-base-finetuned-masakhaner-swa/resolve/main/GoNotoCurrent.ttf\n"
  },
  {
    "path": "download_scripts/README.md",
    "content": "Beolow are the list for downloading datasets used in InstructDoc.\n### Automatically download datasets\n- DocVQA ([due.sh](download_scripts/due.sh))\n- InfographicVQA ([due.sh](download_scripts/due.sh))\n- PWC ([due.sh](download_scripts/due.sh))\n- KLC ([due.sh](download_scripts/due.sh))\n- DeepForm ([due.sh](download_scripts/due.sh))\n- TabFact ([due.sh](download_scripts/due.sh))\n- WebSRC ([websrc.sh](download_scripts/websrc.sh))\n- FUNSD ([funsd.sh](download_scripts/funsd.sh))\n- IconQA ([iconqa.sh](download_scripts/iconqa.sh))\n- TextbookQA ([textbookqa.sh](download_scripts/textbookqa.sh))\n- Screen2Words ([screen2words.sh](download_scripts/screen2words.sh))\n- DocLaynet ([doclaynet.sh](download_scripts/doclaynet.sh))\n- LLaVAR ([llavar.sh](download_scripts/llavar.sh))\n\n### Manually download \nAfter downloading below datasets, please place them under the directory \"raw_datasets\".\n- SROIE ([kaggle](https://www.kaggle.com/datasets/urbikn/sroie-datasetv2))\n- CORD ([google drive](https://drive.google.com/drive/folders/14OEWr86qotVBMAsWk7lymMytxn5u-kM6))\n- OCRVQA ([google drive](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing))\n- TAT-DQA ([google drive](https://drive.google.com/drive/folders/1SGpZyRWqycMd_dZim1ygvWhl5KdJYDR2))\n- ScienceQA ([google drive](https://drive.google.com/drive/folders/1w8imCXWYn2LxajmGeGH_g5DaL2rabHev))\n- ChartQA ([google drive](https://drive.google.com/file/d/17-aqtiq_KJ16PIGOp30W0y6OJNax6SVT/view))\n- RVL-CDIP ([goole docs](https://docs.google.com/uc?id=0Bz1dfcnrpXM-MUt4cHNzUEFXcmc&export=download))\n- HW-SQuAD ([onedrive](https://www.docvqa.org/datasets/benthamqa-and-hw-squad))\n- SciCap ([dropbox](https://www.dropbox.com/s/t1sjqesl0pynaxo/scicap_data.zip?dl=0))\n- DUDE ([project page](https://rrc.cvc.uab.es/?ch=23&com=introduction))\n- DocBank ([project page](https://doc-analysis.github.io/docbank-page/index.html))\n- DocILE ([projct page](https://docile.rossum.ai/))\n- VisualMRC ([project page](https://github.com/nttmdlab-nlp/VisualMRC), request authors via e-mail ryota.tanaka@ntt.com)\n- SlideVQA ([project page](https://github.com/nttmdlab-nlp/SlideVQA), request authors via e-mail ryota.tanaka@ntt.com)\n\n\n"
  },
  {
    "path": "download_scripts/ai2d.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading AI2D dataset...\"\nmkdir ai2d\ncd ai2d\nwget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip\nwget https://s3-us-east-2.amazonaws.com/prior-datasets/ai2d_test_ids.csv\nunzip ai2d-all.zip && rm ai2d-all.zip\n"
  },
  {
    "path": "download_scripts/doclaynet.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading DocLaynet dataset...\"\nmkdir doclaynet\ncd doclaynet\nwget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip\nwget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip\nunzip DocLayNet_core.zip && rm DocLayNet_core.zip\nunzip DocLayNet_extra.zip && rm DocLayNet_extra.zip\n"
  },
  {
    "path": "download_scripts/due.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading DocVQA dataset...\"\nmkdir docvqa\ncd docvqa\nwget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DocVQA.tar.gz\ntar xvf DocVQA.tar.gz && rm DocVQA.tar.gz\ncd ..\n\necho \"Donwloading InfoVQA dataset...\"\nmkdir infovqa\ncd infovqa\nwget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/InfographicsVQA.tar.gz\ntar xvf InfographicsVQA.tar.gz && rm InfographicsVQA.tar.gz\ncd ..\n\necho \"Donwloading TabFact dataset...\"\nmkdir tabfact\ncd tabfact\nwget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/TabFact.tar.gz\ntar xvf TabFact.tar.gz && rm TabFact.tar.gz\ncd ..\n\necho \"Donwloading WTQ dataset...\"\nmkdir wtq\ncd wtq\nwget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/WikiTableQuestions.tar.gz\ntar xvf WikiTableQuestions.tar.gz && rm WikiTableQuestions.tar.gz\ncd ..\n\necho \"Donwloading KLC dataset...\"\nmkdir klc\ncd klc\nwget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/KleisterCharity.tar.gz\ntar xvf KleisterCharity.tar.gz && rm KleisterCharity.tar.gz\ncd ..\n\necho \"Donwloading DeepForm dataset...\"\nmkdir deepform\ncd deepform\nwget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DeepForm.tar.gz\ntar xvf DeepForm.tar.gz && rm DeepForm.tar.gz\ncd ..\n\necho \"Donwloading PWC dataset...\"\nmkdir pwc\ncd pwc\nwget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/PWC.tar.gz\ntar xvf PWC.tar.gz && rm PWC.tar.gz\n"
  },
  {
    "path": "download_scripts/funsd.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading FUNSD dataset...\"\nmkdir funsd\ncd funsd\nwget https://guillaumejaume.github.io/FUNSD/dataset.zip\nunzip dataset.zip && rm dataset.zip\n"
  },
  {
    "path": "download_scripts/iconqa.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading IconQA dataset...\"\nmkdir iconqa\ncd iconqa\nwget https://iconqa2021.s3.us-west-1.amazonaws.com/iconqa_data.zip\nunzip iconqa_data.zip && rm iconqa_data.zip\n"
  },
  {
    "path": "download_scripts/llavar.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading LLaVAR dataset...\"\nmkdir llavar\ncd llavar\nwget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/llava_instruct_150k_llavar_20k.json\nmkdir images\ncd images\nwget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/finetune.zip\nunzip finetune.zip && rm finetune.zip\n"
  },
  {
    "path": "download_scripts/screen2words.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading Screen2Words dataset...\"\ngit clone https://github.com/google-research-datasets/screen2words.git\ncd screen2words\nwget https://storage.googleapis.com/crowdstf-rico-uiuc-4540/rico_dataset_v0.1/unique_uis.tar.gz\ntar xvf unique_uis.tar.gz && rm unique_uis.tar.gz\n"
  },
  {
    "path": "download_scripts/textbookqa.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading TextbookQA dataset...\"\nmkdir textbookqa\ncd textbookqa\nwget https://ai2-public-datasets.s3.amazonaws.com/tqa/tqa_train_val_test.zip\nunzip tqa_train_val_test.zip && rm tqa_train_val_test.zip\n"
  },
  {
    "path": "download_scripts/websrc.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading WebSRC dataset...\"\nmkdir websrc\ncd websrc\nwget https://websrc-data.s3.amazonaws.com/release.zip\nunzip release.zip && rm release.zip\n"
  },
  {
    "path": "download_scripts/wildreceipt.sh",
    "content": "cd $DATASET_DIR\n\necho \"Donwloading WildReceipt dataset...\"\nmkdir wildreceipt\ncd wildreceipt\nwget https://download.openmmlab.com/mmocr/data/wildreceipt.tar\ntar xvf wildreceipt.tar && rm wildreceipt.tar\n"
  },
  {
    "path": "merge_datasets.py",
    "content": "import os\nimport json\nimport random\nimport argparse\n\ntrain_val_datasets = ['klc', 'pwc', 'deepform', 'sroie', 'docile', 'wildreceipt', 'websrc', 'hwsquad',\n                      'visualmrc', 'iconqa_fill_in_blank', 'iconqa_choose_txt', 'scienceqa',\n                      'ai2d', 'docvqa', 'rvlcdip', 'textbookqa', 'wtq', 'tatdqa','scicap', 'llavar',\n                      'screen2words', 'doclaynet', 'docbank', 'docvqa_iq', 'rvlcdip_io', 'ocrvqa']\n\ndef merge_datasets(input_data_dir='./processed_data', save_dir='./', max_samples=5000):\n    questionId = 0\n    for split in [('train'), ('dev', 'val')]:\n        merge = []\n        for dataset_name in train_val_datasets:\n            for s in split:\n                dataset_path = os.path.join(input_data_dir, dataset_name, f'{s}.json')\n                if os.path.exists(dataset_path):\n                    with open(dataset_path, 'r') as f:\n                        data = json.load(f)\n            if len(data) == 0:\n                continue\n            random.shuffle(data)[:max_samples]\n            for d in data:\n                d[\"dataset_name\"] = dataset_name\n                d[\"id\"] = questionId\n                merge.append(d)\n        random.shuffle(merge)\n\n        out_filepath = os.path.join(save_dir, f'{split[0]}.json')\n        os.makedirs(os.path.dirname(out_filepath), exist_ok=True)\n        print(f'{split}: {len(merge)}')\n        with open(out_filepath, \"w\") as f:\n            json.dump(merge, f)\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--input_data_dir', default='processed_data', type=str)\n    parser.add_argument('--save_dir', default='./', type=str)\n    parser.add_argument('--max_samples', default=5000, type=int)\n    args = parser.parse_args()\n\n    merge_datasets(args.input_data_dir, args.save_dir, args.max_samples)"
  },
  {
    "path": "process_data.sh",
    "content": "#!/bin/bash\nAPI_KEY=$1\n\n# ===== KIE =====\npython data_preprocessors/docile.py\npython data_preprocessors/klc.py\npython data_preprocessors/deepform.py\npython data_preprocessors/funsd.py\npython data_preprocessors/pwc.py\npython data_preprocessors/wildreceipt.py\npython data_preprocessors/cord.py\npython data_preprocessors/sroi.py\n\n# ===== Single-page QA =====\npython data_preprocessors/visualmrc.py\npython data_preprocessors/websrc.py --api_key $API_KEY\npython data_preprocessors/ocrvqa.py --api_key $API_KEY\npython data_preprocessors/docvqa.py\npython data_preprocessors/hwsquad.py\n\n# ===== Single-page QA w/ Discrete Reasoning =====\npython data_preprocessors/tatdqa.py\npython data_preprocessors/wtq.py\n\n# ===== Single-page QA w/ Visual Reasoning =====\npython data_preprocessors/iconqa.py\npython data_preprocessors/ai2d.py\npython data_preprocessors/scienceqa.py\npython data_preprocessors/textbook.py\n\n# ===== Single-page QA w/ Discrete and Visual Reasoning =====\npython data_preprocessors/infographicvqa.py\npython data_preprocessors/chartqa.py --api_key $API_KEY\n\n# ===== Multi-page QA w/ Multi-hop, Discrete, and Visual Reasoning =====\npython data_preprocessors/slidevqa.py --api_key $API_KEY\npython data_preprocessors/dude.py\n\n# ===== Document NLI =====\npython data_preprocessors/tabfact.py\n\n# ===== Dialogue =====\npython data_preprocessors/llavar.py --api_key $API_KEY\n\n# ===== Captioning =====\npython data_preprocessors/scicap.py --api_key $API_KEY\npython data_preprocessors/screen2words.py --api_key $API_KEY\n\n# ===== Classification =====\npython data_preprocessors/rvlcdip.py --api_key $API_KEY\n\n# ===== ITM =====\npython data_preprocessors/rvlcdip_io.py --api_key $API_KEY\npython data_preprocessors/docvqa_iq.py\n\n# ===== DLA =====\npython data_preprocessors/docbank.py\npython data_preprocessors/doclaynet.py\n"
  }
]