Repository: nttmdlab-nlp/InstructDoc Branch: main Commit: fadcdabc1d07 Files: 51 Total size: 141.9 KB Directory structure: gitextract_pfacmpi3/ ├── LICENSE ├── README.md ├── data_preprocessors/ │ ├── ai2d.py │ ├── chartqa.py │ ├── cord.py │ ├── deepform.py │ ├── docbank.py │ ├── docile.py │ ├── doclaynet.py │ ├── docvqa.py │ ├── docvqa_iq.py │ ├── dude.py │ ├── funsd.py │ ├── google_vision_ocr.py │ ├── hwsquad.py │ ├── iconqa.py │ ├── infographicvqa.py │ ├── klc.py │ ├── llavar.py │ ├── ocrvqa.py │ ├── pwc.py │ ├── rvlcdip.py │ ├── rvlcdip_io.py │ ├── scicap.py │ ├── scienceqa.py │ ├── screen2words.py │ ├── slidevqa.py │ ├── sroie.py │ ├── tabfact.py │ ├── tatdqa.py │ ├── textbookqa.py │ ├── utils.py │ ├── visualmrc.py │ ├── websrc.py │ ├── wildreceipt.py │ └── wtq.py ├── download.sh ├── download_scripts/ │ ├── README.md │ ├── ai2d.sh │ ├── doclaynet.sh │ ├── due.sh │ ├── funsd.sh │ ├── iconqa.sh │ ├── llavar.sh │ ├── screen2words.sh │ ├── textbookqa.sh │ ├── websrc.sh │ └── wildreceipt.sh ├── instructdoc_instructions.xlsx ├── merge_datasets.py └── process_data.sh ================================================ FILE CONTENTS ================================================ ================================================ FILE: LICENSE ================================================ SOFTWARE LICENSE AGREEMENT FOR EVALUATION This SOFTWARE EVALUATION LICENSE AGREEMENT (this "Agreement") is a legal contract between a person who uses or otherwise accesses or installs the Software (“User(s)”), and Nippon Telegraph and Telephone corporation ("NTT"). READ THE TERMS AND CONDITIONS OF THIS AGREEMENT CAREFULLY BEFORE INSTALLING OR OTHERWISE ACCESSING OR USING NTT'S PROPRIETARY SOFTWARE ACCOMPANIED BY THIS AGREEMENT (the "SOFTWARE"). THE SOFTWARE IS COPYRIGHTED AND IT IS LICENSED TO USER UNDER THIS AGREEMENT, NOT SOLD TO USER. BY INSTALLING OR OTHERWISE ACCESSING OR USING THE SOFTWARE, USER ACKNOWLEDGES THAT USER HAS READ THIS AGREEMENT, THAT USER UNDERSTANDS IT, AND THAT USER ACCEPTS AND AGREES TO BE BOUND BY ITS TERMS. IF AT ANY TIME USER IS NOT WILLING TO BE BOUND BY THE TERMS OF THIS AGREEMENT, USER SHOULD TERMINATE THE INSTALLATION PROCESS, IMMEDIATELY CEASE AND REFRAIN FROM ACCESSING OR USING THE SOFTWARE AND DELETE ANY COPIES USER MAY HAVE. THIS AGREEMENT REPRESENTS THE ENTIRE AGREEMENT BETWEEN USER AND NTT CONCERNING THE SOFTWARE. BACKGROUND A. NTT is the owner of all rights, including all patent rights, copyrights and trade secret rights, in and to the Software and related documentation listed in Exhibit A to this Agreement. B. User wishes to obtain a royalty free license to use the Software to enable User to evaluate, and NTT wishes to grant such a license to User, pursuant and subject to the terms and conditions of this Agreement. C. As a condition to NTT's provision of the Software to User, NTT has required User to execute this Agreement. In consideration of these premises, and the mutual promises and conditions in this Agreement, the parties hereby agree as follows: 1. Grant of Evaluation License. NTT hereby grants to User, and User hereby accepts, under the terms and conditions of this Agreement, a royalty free, nontransferable and nonexclusive license to use the Software internally for the purposes of testing, analyzing, and evaluating the methods or mechanisms as shown in the research paper submitted by NTT to a certain academy. User may make a reasonable number of backup copies of the Software solely for User's internal use pursuant to the license granted in this Section 1. 2. Shipment and Installation. NTT will ship or deliver the Software by any method that NTT deems appropriate. User shall be solely responsible for proper installation of the Software. 3. Term. This Agreement is effective whichever is earlier (i) upon User’s acceptance of the Agreement, or (ii) upon User’s installing, accessing, and using the Software, even if User has not expressly accepted this Agreement. Without prejudice to any other rights, NTT may terminate this Agreement without notice to User (i) if User breaches or fails to comply with any of the limitations or other requirements described herein, and (ii) if NTT receives a notice from the academy stating that the research paper would not be published, and in any such case User agrees that NTT may, in addition to any other remedies it may have at law or in equity, remotely disable the Software. User may terminate this Agreement at any time by User’s decision to terminate the Agreement to NTT and ceasing use of the Software. Upon any termination or expiration of this Agreement for any reason, User agrees to uninstall the Software and either return to NTT the Software and all copies thereof, or to destroy all such materials and provide written verification of such destruction to NTT. 4. Proprietary Rights (a) The Software is the valuable, confidential, and proprietary property of NTT, and NTT shall retain exclusive title to this property both during the term and after the termination of this Agreement. Without limitation, User acknowledges that all patent rights, copyrights and trade secret rights in the Software shall remain the exclusive property of NTT at all times. User shall use not less than reasonable care in safeguarding the confidentiality of the Software. (b) USER SHALL NOT, IN WHOLE OR IN PART, AT ANY TIME DURING THE TERM OF OR AFTER THE TERMINATION OF THIS AGREEMENT: (i) SELL, ASSIGN, LEASE, DISTRIBUTE, OR OTHERWISE TRANSFER THE SOFTWARE TO ANY THIRD PARTY; (ii) EXCEPT AS OTHERWISE PROVIDED HEREIN, COPY OR REPRODUCE THE SOFTWARE IN ANY MANNER; (iii) DISCLOSE THE SOFTWARE TO ANY THIRD PARTY, EXCEPT TO USER'S EMPLOYEES WHO REQUIRE ACCESS TO THE SOFTWARE FOR THE PURPOSES OF THIS AGREEMENT; (iv) MODIFY, DISASSEMBLE, DECOMPILE, REVERSE ENGINEER OR TRANSLATE THE SOFTWARE; OR (v) ALLOW ANY PERSON OR ENTITY TO COMMIT ANY OF THE ACTIONS DESCRIBED IN (i) THROUGH (iv) ABOVE. (c) User shall take appropriate action, by instruction, agreement, or otherwise, with respect to its employees permitted under this Agreement to have access to the Software to ensure that all of User's obligations under this Section 4 shall be satisfied. 5.  Indemnity. User shall defend, indemnify and hold harmless NTT, its agents and employees, from any loss, damage, or liability arising in connection with User's improper or unauthorized use of the Software. NTT SHALL HAVE THE SOLE RIGHT TO CONDUCT DEFEND ANY ACTTION RELATING TO THE SOFTWARE. 6. Disclaimer. THE SOFTWARE IS LICENSED TO USER "AS IS," WITHOUT ANY TRAINING, MAINTENANCE, OR SERVICE OBLIGATIONS WHATSOEVER ON THE PART OF NTT. NTT MAKES NO EXPRESS OR IMPLIED WARRANTIES OF ANY TYPE WHATSOEVER, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, OF FITNESS FOR A PARTICULAR PURPOSE AND OF NON-INFRINGEMENT ON COPYRIGHT OR ANY OTHER RIGHT OF THIRD PARTIES. USER ASSUMES ALL RISKS ASSOCIATED WITH ITS USE OF THE SOFTWARE, INCLUDING WITHOUT LIMITATION RISKS RELATING TO QUALITY, PERFORMANCE, DATA LOSS, AND UTILITY IN A PRODUCTION ENVIRONMENT. 7. Limitation of Liability. IN NO EVENT SHALL NTT BE LIABLE TO USER OR TO ANY THIRD PARTY FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO DAMAGES FOR PERSONAL INJURY, PROPERTY DAMAGE, LOST PROFITS, OR OTHER ECONOMIC LOSS, ARISING IN CONNECTION WITH USER'S USE OF OR INABILITY TO USE THE SOFTWARE, IN CONNECTION WITH NTT'S PROVISION OF OR FAILURE TO PROVIDE SERVICES PERTAINING TO THE SOFTWARE, OR AS A RESULT OF ANY DEFECT IN THE SOFTWARE. THIS DISCLAIMER OF LIABILITY SHALL APPLY REGARD¬LESS OF THE FORM OF ACTION THAT MAY BE BROUGHT AGAINST NTT, WHETHER IN CONTRACT OR TORT, INCLUDING WITHOUT LIMITATION ANY ACTION FOR NEGLIGENCE. USER'S SOLE REMEDY IN THE EVENT OF ANY BREACH OF THIS AGREEMENT BY NTT SHALL BE TERMINATION PURSUANT TO SECTION 3. 8. No Assignment or Sublicense. Neither this Agreement nor any right or license under this Agreement, nor the Software, may be sublicensed, assigned, or otherwise transferred by User without NTT's prior written consent. 9. General (a) If any provision, or part of a provision, of this Agreement is or becomes illegal, unenforceable, or invalidated, by operation of law or otherwise, that provision or part shall to that extent be deemed omitted, and the remainder of this Agreement shall remain in full force and effect. (b) This Agreement is the complete and exclusive statement of the agreement between the parties with respect to the subject matter hereof, and supersedes all written and oral contracts, proposals, and other communications between the parties relating to that subject matter. (c) Subject to Section 8, this Agreement shall be binding on, and shall inure to the benefit of, the respective successors and assigns of NTT and User. (d) If either party to this Agreement initiates a legal action or proceeding to enforce or interpret any part of this Agreement, the prevailing party in such action shall be entitled to recover, as an element of the costs of such action and not as damages, its attorneys' fees and other costs associated with such action or proceeding. (e) This Agreement shall be governed by and interpreted under the laws of Japan, without reference to conflicts of law principles. All disputes arising out of or in connection with this Agreement shall be finally settled by arbitration in Tokyo in accordance with the Commercial Arbitration Rules of the Japan Commercial Arbitration Association. The arbitration shall be conducted by three (3) arbitrators and in Japanese. The award rendered by the arbitrators shall be final and binding upon the parties. Judgment upon the award may be entered in any court having jurisdiction thereof. (f)   NTT shall not be liable to the User or to any third party for any delay or failure to perform NTT’s obligation set forth under this Agreement due to any cause beyond NTT’s reasonable control.   EXHIBIT A The software and related data include the following files, - data_preprocessors - download_scripts - download.sh - process_data.sh - merge_datasets.py - instructdoc_instructions.xlsx - README ================================================ FILE: README.md ================================================ # InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions This repository includes the InstructDoc dataset introduced by the following paper: Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. "InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions". In Proc. of AAAI. 2024. > We introduce InstructDoc, the first large-scale visual instruction tuning dataset that covers a wide range of VDU tasks and datasets. ![Figure 1 from paper](example.png) # Get Started ## 1. Download datasets ``` sh download.sh ``` This script helps you to download most of the datasets automatically. For some datasets, due to the license issue and downloading restrictions, you need to manually download them by following the instructions in [download_scripts/README.md](download_scripts) ## 2. Preprocess datasets ``` sh process_data.sh API_KEY ``` This script helps you to process all the datasets. To extract OCR information from document images, we used Google Vision API and set the variables "API_KEY" to the API key obtained from [Google Cloud Platform](https://cloud.google.com/). To get one visit the [link](https://cloud.google.com/vision/docs/quickstart).

If you encounter the FileNotFoundError while processing the datasets, please set the variable --input_data_dir in [data_processors](data_processors) to your dataset directory name correctly. ## 3. Merge preprocessed datasets ``` python merge_datasets --max_samples 5000 --input_data_dir processed_data --save_dir ./ ``` We randomly sampled a maximum of 5000 instances for each held-in dataset. After processing datasets, you can obtain JSON files with the following format. If the dataset provides multiple images per instance (e.g., SlideVQA), we add "_list" into the fields, including "image", "ocr", and "bboxes".
   {
      "dataset_name": dataset name,
      "id": identification of the instance,
      "image" or "image_list": image path,
      "ocr" or "ocr_list": ocr text,
      "bboxes" or "bboxes_list": [x1, y1, x2, y2, w, h],
      "conversations": [
        {'user': 'human', 'value': randomly sampled instruction}
        {'user': 'gpt', 'value': answer}
      ]
    }
# Citation You can cite it as follows: ```bibtex @inproceedings{InstructDoc2024, author = {Ryota Tanaka and Taichi Iki and Kyosuke Nishida and Kuniko Saito and Jun Suzuki}, title = {InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions}, booktitle = {AAAI}, year = {2024} } ``` If you have any questions about the paper and repository, feel free to contact Ryota Tanaka (ryota.tanaka[at]ntt.com) or open an issue! ================================================ FILE: data_preprocessors/ai2d.py ================================================ import json import os import random import glob from PIL import Image, ImageDraw, ImageFont from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions from transformers import BertTokenizer import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.question_dir = os.path.join(args.input_data_dir, f'questions') self.ann_dir = os.path.join(args.input_data_dir, f'annotations') self.img_dir = os.path.join(args.input_data_dir, f'images') self.font = ImageFont.truetype(args.font_file, size=40) self.dataset_name = 'ai2d' self.split = ['train', 'test'] def sort_coordinate(self, bboxes): return sorted(bboxes, key=lambda k: [k[1][1], k[1][0]]) def create_data(self): train = [] test = [] instructions = load_instructions(self.instruction_path)[self.dataset_name] with open(os.path.join(self.data_dir, 'ai2d_test_ids.csv')) as f: test_ids = f.read().splitlines() for i, file in enumerate(tqdm(sorted(os.listdir(self.question_dir)))): file_path = os.path.join(self.question_dir, file) with open(file_path, 'r', encoding='utf-8') as f: data = json.load(f) annotation_path = os.path.join(self.ann_dir, file) with open(annotation_path, 'r') as f: ann = json.load(f) index = file.replace('.png.json', '') split = 'test' if str(index) in test_ids else 'train' image_path = os.path.join(self.img_dir, file) image_path = image_path.replace('.json', '') img = Image.open(image_path) draw = ImageDraw.Draw(img) for index, text in ann['text'].items(): replacement_text = text['replacementText'] bbox = text['rectangle'] bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]] text = text['value'] x1, y1, x2, y2 = bbox draw.rectangle((x1, y1, x2, y2), outline="lime", width=4) draw.text((x1, y1-30), replacement_text, font=self.font, fill="blue", align="center") image_path = os.path.join(self.out_data_dir, 'draw_images', f'{file.replace(".json", "")}') os.makedirs(os.path.dirname(image_path), exist_ok=True) img.save(image_path) for question, item in data['questions'].items(): options = item['answerTexts'] answer_index = item['correctAnswer'] value = options[answer_index] instruction = random.choice(instructions) instruction = instruction.replace('', question).replace('', str(options)) file_name = os.path.abspath(image_path) metadata = { "image": file_name, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': f"{value}"}, ], } if split == 'train': train.append(metadata) elif split == 'test': test.append(metadata) for split, results in [('train', train), ('test', test)]: out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(results)}') with open(out_filepath, "w") as f: json.dump(results, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/ai2d', type=str) parser.add_argument('--out_data_dir', default='processed_data/ai2d', type=str) parser.add_argument('--font_file', default='GoNotoCurrent.ttf', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/chartqa.py ================================================ import json import os import random import argparse from PIL import Image from tqdm import tqdm from pathlib import Path from utils import load_instructions from google_vision_ocr import Google_OCR class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') self.dataset_name = 'chartqa' self.google_ocr = Google_OCR(args.api_key) self.split = ['train', 'val', 'test'] os.makedirs(self.ocr_dir, exist_ok=True) def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] for qa_type in ['human', 'augmented']: file_name = os.path.join(self.data_dir, f'{split}/{split}_{qa_type}.json') with open(file_name, 'r') as f: data = json.load(f) for d in tqdm(data): image_name = d['imgname'] image_path = os.path.join(self.data_dir, f'{split}/png/{image_name}') ocr_path = os.path.join(self.ocr_dir, f'{image_name.replace(".png", ".json")}') try: img = Image.open(image_path) img_w, img_h = img.size if not os.path.exists(ocr_path): items = self.google_ocr.recognize_image(img) if items == "error": print('OCR error: ', image_path) continue with open(ocr_path, 'w') as f: json.dump(items, f) else: with open(ocr_path, 'r') as f: items = json.load(f) words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) except: words, bboxes = [], [] question = d['query'] value = d['label'] instruction = random.choice(instructions) instruction = instruction.replace('', question) ocr = ' '.join(words) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/chartqa', type=str) parser.add_argument('--out_data_dir', default='processed_data/chartqa', type=str) parser.add_argument('--api_key', type=str, help='google vision api key') args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/cord.py ================================================ import json import os import random from PIL import Image from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, sort_coordinate, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'cord' self.split = ['train', 'dev', 'test'] def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] ann_dir = os.path.join(self.data_dir, f'{split}/json') img_dir = os.path.join(self.data_dir, f'{split}/image') for file in tqdm(sorted(os.listdir(ann_dir))): file_path = os.path.join(ann_dir, file) with open(file_path, 'r', encoding='utf-8') as f: data = json.load(f) image_path = os.path.join(img_dir, file) image_path = image_path.replace('.json', '.png') image = Image.open(image_path) w, h = image.size items = [] labels = {} for item in data["valid_line"]: words, label = item["words"], item["category"] words = [w for w in words if w["text"].strip() != ""] if len(words) == 0: continue text = " ".join([word["text"] for word in words]) bbox = [words[0]["quad"]["x1"], words[0]["quad"]["y1"], words[-1]["quad"]["x3"], words[-1]["quad"]["y3"]] bbox = normalize_bbox(bbox, w, h) items.append((text, label, bbox)) items = sort_coordinate(items) ocr = [] bboxes = [] for item in items: words, label, bbox = item labels[words] = label ocr.append(words) bbox = [bbox] * len(words.split()) bboxes += bbox ocr = ' '.join(ocr) for key in labels: instruction = random.choice(instructions) instruction = instruction.replace('', key) value = labels[key] file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/cord/CORD', type=str) parser.add_argument('--out_data_dir', default='processed_data/cord', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/deepform.py ================================================ import json import os import random from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'deepform' self.split = ['train', 'dev'] def create_ocr_data(self, split): file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = {} for d in data: d = json.loads(d) image_name = d['name'].replace('.pdf', '') try: content = d['contents'][1] # microsoft cv except: content = d['contents'][0] # tesseract bboxes = [] tokens = [] try: _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) tokens.append(token) except: pass ocrs[image_name] = (' '.join(tokens), bboxes) break return ocrs def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, split, 'document.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = self.create_ocr_data(split) target_format = [] for d in tqdm(data): d = json.loads(d) image_name = d['name'].replace('.pdf', '') file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') file_name = os.path.abspath(file_name) for ann in d['annotations']: instruction = random.choice(instructions) if 'children' in ann['values'][0]: for v in ann['values']: for child in v['children']: value = child['key'] key = child['values'][0]['value'] instruction = instruction.replace('', key) ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/DeepForm', type=str) parser.add_argument('--out_data_dir', default='processed_data/deepform', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/docbank.py ================================================ import json import os import random from PIL import Image from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh from transformers import BertTokenizer from collections import defaultdict import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'docbank' self.split = ['train', 'valid', 'test'] def sort_coordinate(self, bboxes): return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]]) def create_ocr_data(self, data): ocr_info = {} for image_info in tqdm(data['images']): file_name = image_info['file_name'] image_id = image_info['id'] width, height = image_info['width'], image_info['height'] image_path = os.path.join(self.data_dir, f'DocBank_500K_ori_img/{file_name}') txt_path = os.path.join(self.data_dir, f'DocBank_500K_txt/{file_name.replace("_ori.jpg", ".txt")}') with open(txt_path, 'r') as f: txt_data = f.read().splitlines() words = [] bboxes = [] for d in txt_data: d = d.split('\t') word = d[0] word_position = convert_wh([int(d[1]), int(d[2]), int(d[3]), int(d[4])]) if word_position[0] >= word_position[2] or word_position[1] >= word_position[3]: continue words.append(word) bboxes.append(word_position) text_sequence = ' '.join(words) ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height} return ocr_info def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: with open(os.path.join(self.data_dir, f'500K_{split}.json'), "r") as f: data = json.load(f) ocr_info = self.create_ocr_data(data) categories = data['categories'] target_format = [] annotations = defaultdict(list) for ann_info in data['annotations']: image_id = ann_info['image_id'] annotations[image_id].append(ann_info) for image_id in tqdm(annotations): image_info = ocr_info[image_id] image_path = image_info['image_path'] text_sequence = image_info['text_sequence'] bboxes = image_info['bboxes'] width, height = image_info['width'], image_info['height'] items = [] for ann in annotations[image_id]: category_id = ann['category_id'] category_name = categories[category_id-1]['name'] bbox = ann['bbox'] bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]] bbox = normalize_bbox(bbox, width, height) items.append((category_name, bbox)) items = self.sort_coordinate(items) dla = [] for item in items: category_name, bbox = item dla.append(f'{category_name} {bbox}') value = ' '.join(dla) instruction = random.choice(instructions) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": text_sequence, "bboxes": bboxes, "conversations": [ {'from': 'human','value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/docbank', type=str) parser.add_argument('--out_data_dir', default='processed_data/docbank', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/docile.py ================================================ import json import os import random from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import sort_coordinate, load_instructions, normalize_bbox import argparse from collections import defaultdict class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'docile' self.ann_dir = os.path.join(args.input_data_dir, f'annotations') self.img_dir = os.path.join(args.input_data_dir, f'images') self.ocr_dir = os.path.join(args.input_data_dir, f'ocr') self.split = ['train', 'val'] def extract_ocr_info(self, ocr_data): tokens = [] bboxes = [] for page in ocr_data['pages']: for block in page['blocks']: for line in block['lines']: for word in line['words']: left_top, right_bottom = word['geometry'] bbox = normalize_bbox([left_top[0], left_top[1], right_bottom[0], right_bottom[1]]) bboxes.append(bbox) tokens.append(word['value']) return tokens, bboxes def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, f'{split}.json') with open(file_name, 'r') as f: ann_filenames = json.load(f) target_format = [] for id, file in enumerate(tqdm(ann_filenames)): image_path = os.path.join(self.img_dir, file + '0001-1.jpg') with open(os.path.join(self.ocr_dir, f'{file}.json'), 'r', encoding='utf-8') as f: ocr_data = json.load(f) with open(os.path.join(self.ann_dir, f'{file}.json'), 'r', encoding='utf-8') as f: d = json.load(f) items = [] for item in d["field_extractions"]: if item["page"] == 0: text, label = item["text"], item["fieldtype"] bbox = item["bbox"] items.append((text, label, bbox)) if len(items) == 0: continue items = sort_coordinate(items) labels = {} for item in items: tokens, label, bbox = item labels[tokens] = label tokens, bboxes = self.extract_ocr_info(ocr_data) ocr = ' '.join(tokens) for key in labels: instruction = random.choice(instructions) instruction = instruction.replace('', key) value = labels[key] file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/docile/data/docile', type=str) parser.add_argument('--out_data_dir', default='processed_data/docile', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/doclaynet.py ================================================ import json import os import random from PIL import Image from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh from collections import defaultdict import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'doclaynet' self.split = ['train', 'val'] def sort_coordinate(self, bboxes): return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]]) def create_ocr_data(self, data): ocr_info = {} for image_info in data['images']: file_name = image_info['file_name'] image_id = image_info['id'] image_path = os.path.join(self.data_dir, f'PNG/{file_name}') json_path = os.path.join(self.data_dir, f'JSON/{file_name.replace(".png", ".json")}') width, height = image_info['width'], image_info['height'] with open(json_path, 'r') as f: json_data = json.load(f) items = [] for cell in json_data['cells']: text = cell['text'] bbox = cell['bbox'] bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]] bbox = convert_wh(normalize_bbox(bbox, width, height)) items.append((text, bbox)) items = self.sort_coordinate(items) words = [] bboxes = [] for text, bbox in items: words.append(text) bboxes += bbox text_sequence = ' '.join(words) ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height} break return ocr_info def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: with open(os.path.join(self.data_dir, f'COCO/{split}.json'), "r") as f: data = json.load(f) ocr_info = self.create_ocr_data(data) categories = data['categories'] target_format = [] annotations = defaultdict(list) for ann_info in data['annotations']: image_id = ann_info['image_id'] annotations[image_id].append(ann_info) for image_id in tqdm(annotations): image_info = ocr_info[image_id] image_path = image_info['image_path'] text_sequence = image_info['text_sequence'] bboxes = image_info['bboxes'] width, height = image_info['width'], image_info['height'] items = [] for ann in annotations[image_id]: category_id = ann['category_id'] category_name = categories[category_id-1]['name'] bbox = ann['bbox'] bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3], bbox[2], bbox[3]] bbox = normalize_bbox(bbox, width, height) items.append((category_name, bbox)) items = self.sort_coordinate(items) dla = [] for item in items: category_name, bbox = item dla.append(f'{category_name} {bbox}') value = ' '.join(dla) instruction = random.choice(instructions) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": text_sequence, "bboxes": bboxes, "conversations": [ {'from': 'human','value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/doclaynet', type=str) parser.add_argument('--out_data_dir', default='processed_data/doclaynet', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/docvqa.py ================================================ import json import os import random from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'docvqa' self.split = ['train', 'dev'] def create_ocr_data(self, split): file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = {} for d in data: d = json.loads(d) image_name = d['name'].replace('.pdf', '') try: content = d['contents'][1] # microsoft cv except: content = d['contents'][0] # tesseract bboxes = [] tokens = [] try: _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) tokens.append(token) except: pass ocrs[image_name] = (' '.join(tokens), bboxes) return ocrs def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, split, 'document.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = self.create_ocr_data(split) target_format = [] for d in tqdm(data): d = json.loads(d) image_name = d['name'].replace('.pdf', '') file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') file_name = os.path.abspath(file_name) for ann in d['annotations']: instruction = random.choice(instructions) question = ann['key'] instruction = instruction.replace('', question) bboxes = [] ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] value = ann['values'][0]['value'] values = ann['values'][0]['value_variants'] target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'instruction': instruction}, {'from': 'gpt', 'value': value, 'values': values}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/docvqa', type=str) parser.add_argument('--out_data_dir', default='processed_data/docvqa', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/docvqa_iq.py ================================================ import json import os import random from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'docvqa_iq' self.split = ['train', 'dev'] def create_ocr_data(self, split): file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = {} for d in data: d = json.loads(d) image_name = d['name'].replace('.pdf', '') try: content = d['contents'][1] # microsoft cv except: content = d['contents'][0] # tesseract bboxes = [] tokens = [] try: _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) tokens.append(token) except: pass ocrs[image_name] = (' '.join(tokens), bboxes) return ocrs def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, split, 'document.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = self.create_ocr_data(split) target_format = [] questions = [] for d in data: d = json.loads(d) for ann in d['annotations']: question = ann['key'] questions.append(question) for d in tqdm(data): d = json.loads(d) image_name = d['name'].replace('.pdf', '') file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') file_name = os.path.abspath(file_name) for ann in d['annotations']: instruction = random.choice(instructions) if random.random() > 0.5: question = random.choice(questions) value = 'no' else: question = ann['key'] value = 'yes' instruction = instruction.replace('', question) bboxes = [] ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'instruction': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/docvqa', type=str) parser.add_argument('--out_data_dir', default='processed_data/docvqa_iq', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/dude.py ================================================ import json import os import random import argparse import glob from PIL import Image from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.image_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/images') self.ocr_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/OCR') self.dataset_name = 'dude' self.split = ['train', 'val'] def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] file_name = os.path.join(self.data_dir, '2023-03-23_DUDE_gt_test_PUBLIC.json') with open(file_name, 'r') as f: data = json.load() train, validation = [],[] for d in tqdm(data['data']): docid = d['docId'] question = d['question'] split = d['data_split'] if split in self.split: image_paths = [] pages = len(glob.glob(os.path.join(self.image_dir, split, f'{docid}_*.jpg'))) for i in range(pages): image_path = os.path.join(self.image_dir, split, f'{docid}_{i}.jpg') image_path = os.path.abspath(image_path) image_paths.append(image_path) ocr_path =os.path.join(self.ocr_dir, f'Azure/{docid}_due.json') try: with open(ocr_path, 'r') as f: ocr_info = json.load(f) except: continue structure_value = ocr_info['structures']['pages']['structure_value'] image_sizes = ocr_info['structures']['pages']['positions'] text_sequences = [] bboxes = [] for page_split, image_size in zip(structure_value, image_sizes): start = page_split[0] end = page_split[1] page_tokens = ' '.join(ocr_info['tokens'][start:end]) page_bboxes = [] for bbox in ocr_info['positions'][start:end]: bbox = normalize_bbox(bbox, (image_size[2], image_size[3])) page_bboxes.append(bbox) text_sequences.append(page_tokens) bboxes.append(page_bboxes) if len(text_sequences) != len(image_paths): continue instruction = random.choice(instructions) instruction = instruction.replace('', question) if 'answers' in d: value = d['answers'][0] if d['answer_type'] == 'not-answerable': d['answers'] = 'none' else: value = '' file_name = os.path.abspath(image_path) sample = { "image_list": image_paths, "ocr_list": text_sequences, "bboxes_list": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], } if split == 'train': train.append(sample) elif split == 'val': validation.append(sample) for split, target_format in [('train', train), ('validation', validation)]: out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/dude', type=str) parser.add_argument('--out_data_dir', default='processed_data/dude', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/funsd.py ================================================ import json import os import random import cv2 from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, sort_coordinate, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'funsd' self.split = ['training', 'testing'] self.label_mapping = {'header': 'title', 'question': 'key', 'answer': 'value', 'other': 'other'} def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] ann_dir = os.path.join(self.data_dir, f'{split}_data/annotations') img_dir = os.path.join(self.data_dir, f'{split}_data/images') for i, file in enumerate(tqdm(sorted(os.listdir(ann_dir)))): file_path = os.path.join(ann_dir, file) with open(file_path, 'r', encoding='utf-8') as f: data = json.load(f) image_path = os.path.join(img_dir, file) image_path = image_path.replace('.json', '.png') image = cv2.imread(image_path) h, w, _ = image.shape items = [] for item in data["form"]: text = item['text'] words, label = item["words"], item["label"] label = self.label_mapping[label] words = [w for w in words if w["text"].strip() != ""] if len(words) == 0: continue start_bbox, end_bbox = words[0]['box'], words[-1]['box'] bbox = [start_bbox[0], start_bbox[1], end_bbox[2], start_bbox[3]] bbox = normalize_bbox(bbox, w, h) items.append((text, label, bbox)) items = sort_coordinate(items) text_sequence = [] bboxes = [] labels = {} for item in items: text, label, bbox = item labels[text] = label text_sequence.append(text) bbox = [bbox] * len(text) bboxes += bbox ocr = ' '.join(text_sequence) for key in labels: instruction = random.choice(instructions) instruction = instruction.replace('', key) value = labels[key] file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) split = split.replace('ing', '') out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/funsd/dataset', type=str) parser.add_argument('--out_data_dir', default='processed_data/funsd', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/google_vision_ocr.py ================================================ import base64 import json from requests import Request, Session from io import BytesIO from utils import normalize_bbox class Google_OCR: def __init__(self, api_key): self.api_key = api_key def pil_image_to_base64(self, pil_image): buffered = BytesIO() pil_image.save(buffered, format="PNG") str_encode_file = base64.b64encode(buffered.getvalue()).decode("utf-8") return str_encode_file def recognize_image(self, pil_image): str_encode_file = self.pil_image_to_base64(pil_image) str_url = "https://vision.googleapis.com/v1/images:annotate?key=" str_headers = {'Content-Type': 'application/json'} str_json_data = { 'requests': [ { 'image': { 'content': str_encode_file }, 'features': [ { 'type': "TEXT_DETECTION", } ] } ] } obj_session = Session() obj_request = Request("POST", str_url + self.api_key, data=json.dumps(str_json_data), headers=str_headers ) obj_prepped = obj_session.prepare_request(obj_request) obj_response = obj_session.send(obj_prepped, verify=True, timeout=60 ) if obj_response.status_code == 200: return obj_response.json() else: return "error" def extract_info(self, items, img_w, img_h): words = [] bboxes = [] for page_ocrs in items['responses'][0]['fullTextAnnotation']['pages']: for block_ocrs in page_ocrs['blocks']: for para_ocrs in block_ocrs['paragraphs']: for word_ocrs in para_ocrs['words']: char_bboxes = [] word = '' for sym_ocrs in word_ocrs['symbols']: try: bbox = sym_ocrs['boundingBox'] xmin = max(0, bbox['vertices'][0]['x']) ymin = max(0, bbox['vertices'][0]['y']) xmax = max(0, bbox['vertices'][2]['x']) ymax = max(0, bbox['vertices'][2]['y']) bbox = [xmin, ymin, xmax, ymax] except: continue word += sym_ocrs['text'] char_bboxes.append(bbox) if len(char_bboxes) > 0: x1 = [w_p[0] for w_p in char_bboxes] y1 = [w_p[1] for w_p in char_bboxes] x2 = [w_p[2] for w_p in char_bboxes] y2 = [w_p[3] for w_p in char_bboxes] word_bbox = [min(x1), min(y1), max(x2), max(y2)] if word_bbox[0] >= word_bbox[2] or word_bbox[1] >= word_bbox[3]: continue word_bbox = normalize_bbox(word_bbox, img_w, img_h) words.append(word) bboxes.append(word_bbox) return words, bboxes ================================================ FILE: data_preprocessors/hwsquad.py ================================================ import json import os import random import argparse import csv from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions from collections import defaultdict class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'hwsquad' self.split = ['train', 'val', 'test'] def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: filename = os.path.join(self.data_dir, f"HW-SQuAD_{split}_1.0.json") with open(filename, "r") as f: annotations = json.load(f) target_format = [] for ann in tqdm(annotations["data"]): qas = ann["qas"] image_path = ann["document_image"]["document_image"] h, w = ann["document_image"]["image_height"], ann["document_image"]["image_width"] words = [] bboxes = [] for item in ann["document_image"]["gold_standard_transcription"]: word = item["text"] words.append(word) bbox = [item["xmin"], item["ymin"], item["xmax"], item["ymax"]] bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) for qa in qas: question = qa["question"] start_index, end_index = qa["answers"][0]["answer_start_word_no"], qa["answers"][0]["answer_end_word_no"]+1 answer = words[start_index:end_index] answer = " ".join(answer) instruction = random.choice(instructions) instruction = instruction.replace('', question) ocr = ' '.join(words) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': answer}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/HW-SQuAD/HW-SQuAD_annotations', type=str) parser.add_argument('--out_data_dir', default='processed_data/hwsquad', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/iconqa.py ================================================ import json import os import random import glob from PIL import Image from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'iconqa' self.split = ['train', 'val'] def create_data(self): for split in self.split: for answer_style in ['fill_in_blank', 'choose_txt']: target_format = [] dataset_name = f'{self.dataset_name}_{answer_style}' instructions = load_instructions(self.instruction_path)[dataset_name] data_dir = os.path.join(self.data_dir, f'{split}/{answer_style}/*') for file_path in glob.glob(data_dir): data_path = os.path.join(file_path, 'data.json') image_path = os.path.join(file_path, 'image.png') with open(data_path, 'r') as f: data = json.load(f) question = data['question'] instruction = random.choice(instructions) instruction = instruction.replace('', question) if answer_style == 'fill_in_blank': value = data['answer'] else: options = data['choices'] answer_index = data['answer'] value = str(options[answer_index]) instruction = instruction.replace('', options) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': f"{value}"}, ], }) out_data_dir = f'{self.out_data_dir}_{answer_style}' out_filepath = os.path.join(out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/iconqa/iconqa_data', type=str) parser.add_argument('--out_data_dir', default='processed_data/iconqa', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/infographicvqa.py ================================================ import json import os import random from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'infographicvqa' self.split = ['train', 'dev'] def create_ocr_data(self, split): file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = {} for d in data: d = json.loads(d) image_name = d['name'].replace('.pdf', '') try: content = d['contents'][1] # microsoft cv except: content = d['contents'][0] # tesseract bboxes = [] tokens = [] try: _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) tokens.append(token) except: pass ocrs[image_name] = (' '.join(tokens), bboxes) return ocrs def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, split, 'document.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = self.create_ocr_data(split) target_format = [] for d in tqdm(data): d = json.loads(d) image_name = d['name'].replace('.pdf', '') file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') file_name = os.path.abspath(file_name) for ann in d['annotations']: instruction = random.choice(instructions) question = ann['key'] instruction = instruction.replace('', question) bboxes = [] ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] value = ann['values'][0]['value'] values = ann['values'][0]['value_variants'] target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'instruction': instruction}, {'from': 'gpt', 'value': value, 'values': values}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/infographics_vqa', type=str) parser.add_argument('--out_data_dir', default='processed_data/infographicvqa', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/klc.py ================================================ import json import os import random from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'klc' self.split = ['train', 'dev'] def create_ocr_data(self, split): file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = {} for d in data: d = json.loads(d) image_name = d['name'].replace('.pdf', '') try: content = d['contents'][1] # microsoft cv except: content = d['contents'][0] # tesseract bboxes = [] tokens = [] try: _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) tokens.append(token) except: pass ocrs[image_name] = (' '.join(tokens), bboxes) break return ocrs def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, split, 'document.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = self.create_ocr_data(split) target_format = [] for d in tqdm(data): d = json.loads(d) image_name = d['name'].replace('.pdf', '') file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') file_name = os.path.abspath(file_name) for ann in d['annotations']: instruction = random.choice(instructions) if 'children' in ann['values'][0]: for v in ann['values']: for child in v['children']: value = child['key'] key = child['values'][0]['value'] instruction = instruction.replace('', key) ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/kleister-charity', type=str) parser.add_argument('--out_data_dir', default='processed_data/klc', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/llavar.py ================================================ import json import os import random import argparse from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions from collections import defaultdict from google_vision_ocr import Google_OCR class InstructData: def __init__(self, args): self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') self.image_dir = os.path.join(args.input_data_dir, 'images') self.google_ocr = Google_OCR(args.api_key) os.makedirs(self.ocr_dir, exist_ok=True) def create_data(self): file_name = os.path.join(self.data_dir, 'llava_instruct_150k_llavar_20k.json') with open(file_name, 'r') as f: data = json.load(f) target_format = [] for d in data: image_name = d["image"] image_path = os.path.join(self.image_dir, image_name) if not os.path.exists(image_path): continue ocr_path = os.path.join(self.ocr_dir, f"{image_name.replace('.jpg', '.json')}") try: img = Image.open(image_path) img_w, img_h = img.size if not os.path.exists(ocr_path): items = self.google_ocr.recognize_image(img) if items == 'error': print('OCR error: ', image_path) continue with open(ocr_path, 'w') as f: json.dump(items, f) else: with open(ocr_path, 'r') as f: items = json.load(f) words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) except: words, bboxes = [], [] ocr = ' '.join(words) file_name = os.path.abspath(image_path) d["image"] = file_name d["ocr"] = ocr d["bboxes"] = bboxes target_format.append(d) out_filepath = os.path.join(self.out_data_dir, 'train.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'train: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/llavar', type=str) parser.add_argument('--out_data_dir', default='processed_data/llavar', type=str) parser.add_argument('--api_key', default='API_KEY', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/ocrvqa.py ================================================ import json import os import random import argparse import csv from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions from collections import defaultdict from google_vision_ocr import Google_OCR class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') self.image_dir = os.path.join(args.input_data_dir, 'images') self.dataset_name = 'ocrvqa' self.google_ocr = Google_OCR(args.api_key) self.split = ['train', 'dev', 'test'] self.split_dict = {1: 'train', 2: 'dev', 3: 'test'} os.makedirs(self.ocr_dir, exist_ok=True) def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] file_name = os.path.join(self.data_dir, 'dataset.json') with open(file_name, 'r') as f: data = json.load(f) for image_id in tqdm(data): d = data[image_id] split_id = d['split'] if split != self.split_dict[split_id]: continue image_path = os.path.join(self.image_dir, f'{image_id}.jpg') if not os.path.exists(image_path): continue ocr_path = os.path.join(self.ocr_dir, f"{image_id}.json") try: img = Image.open(image_path) img_w, img_h = img.size if not os.path.exists(ocr_path): items = self.google_ocr.recognize_image(img) if items == "error": print('error: ', image_path) continue with open(ocr_path, 'w') as f: json.dump(items, f) else: with open(ocr_path, 'r') as f: items = json.load(f) words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) except: words, bboxes = [], [] ocr = ' '.join(words) file_name = os.path.abspath(image_path) for question, answer in zip(d['questions'], d['answers']): instruction = random.choice(instructions) instruction = instruction.replace('', question) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': answer}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/OCR-VQA-200K', type=str) parser.add_argument('--out_data_dir', default='processed_data/ocrvqa', type=str) parser.add_argument('--api_key', default='API_KEY', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/pwc.py ================================================ import json import os import random from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'pwc' self.split = ['train', 'dev'] def create_ocr_data(self, split): file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = {} for d in data: d = json.loads(d) image_name = d['name'].replace('.pdf', '') try: content = d['contents'][1] # microsoft cv except: content = d['contents'][0] # tesseract bboxes = [] tokens = [] try: _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) tokens.append(token) except: pass ocrs[image_name] = (' '.join(tokens), bboxes) break return ocrs def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, split, 'document.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = self.create_ocr_data(split) target_format = [] for d in tqdm(data): d = json.loads(d) image_name = d['name'].replace('.pdf', '') file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') file_name = os.path.abspath(file_name) for ann in d['annotations']: instruction = random.choice(instructions) if 'children' in ann['values'][0]: for v in ann['values']: for child in v['children']: value = child['key'] key = child['values'][0]['value'] instruction = instruction.replace('', key) ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/AxCell', type=str) parser.add_argument('--out_data_dir', default='processed_data/pwc', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/rvlcdip.py ================================================ import json import os import random import argparse from PIL import Image from tqdm import tqdm from pathlib import Path from utils import load_instructions from google_vision_ocr import Google_OCR class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') self.image_dir = os.path.join(args.input_data_dir, 'images') self.dataset_name = 'rvlcdip' self.google_ocr = Google_OCR(args.api_key) self.split = ['train', 'val', 'test'] self.class_dict = {'4': "advertisement", '10': "budget", '2': "email", '8': "file_folder", '1': "form", '3': "handwritten", '11': "invoice", '0': "letter", '15': "memo", '9': "news_article", '12': "presentation", '13': "questionnaire", '14': "resume", '6': "scientific_publication", '5':"scientific_report", '7': "specification"} def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f: labels = f.read().splitlines() for label in labels: filename, label = label.split(' ') value = self.class_dict[label] image_path = os.path.join(self.image_dir, filename) ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".tif", ".json")}') try: img = Image.open(image_path) img_w, img_h = img.size if not os.path.exists(ocr_path): items = self.google_ocr.recognize_image(img) if items == "error": print('OCR error: ', image_path) continue os.makedirs(os.dirname(ocr_path), exist_ok=True) with open(ocr_path, 'w') as f: json.dump(items, f) else: with open(ocr_path, 'r') as f: items = json.load(f) words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) except: words, bboxes = [], [] instruction = random.choice(instructions) ocr = ' '.join(words) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str) parser.add_argument('--out_data_dir', default='processed_data/rvlcdip', type=str) parser.add_argument('--api_key', default='API_KEY', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/rvlcdip_io.py ================================================ import json import os import random import argparse from PIL import Image from tqdm import tqdm from pathlib import Path from utils import load_instructions from google_vision_ocr import Google_OCR class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') self.image_dir = os.path.join(args.input_data_dir, 'images') self.dataset_name = 'rvlcdip_io' self.google_ocr = Google_OCR(args.api_key) self.split = ['train', 'val', 'test'] def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] ocrs = [] with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f: labels = f.read().splitlines() for label in labels: filename, label = label.split(' ') value = self.class_dict[label] image_path = os.path.join(self.image_dir, filename) ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".tif", ".json")}') try: img = Image.open(image_path) img_w, img_h = img.size if not os.path.exists(ocr_path): items = self.google_ocr.recognize_image(img) if items == "error": print('OCR error: ', image_path) continue os.makedirs(os.dirname(ocr_path), exist_ok=True) with open(ocr_path, 'w') as f: json.dump(items, f) else: with open(ocr_path, 'r') as f: items = json.load(f) words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) except: words, bboxes = [], [] ocr = ' '.join(words) ocrs.append((ocr, bboxes)) for label in labels: instruction = random.choice(instructions) if random.random() > 0.5: ocr, bboxes = random.choice(ocrs) value = 'no' else: value = 'yes' file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str) parser.add_argument('--out_data_dir', default='processed_data/rvlcdip_io', type=str) parser.add_argument('--api_key', default='API_KEY', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/scicap.py ================================================ import json import os import random import argparse from PIL import Image from tqdm import tqdm from pathlib import Path from utils import load_instructions from google_vision_ocr import Google_OCR class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.image_dir = os.path.join(args.input_data_dir, 'SciCap-No-Subfig-Img') self.caption_dir = os.path.join(args.input_data_dir, 'SciCap-Caption-All') self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') self.dataset_name = 'scicap' self.google_ocr = Google_OCR(args.api_key) self.split = ['train', 'val', 'test'] os.makedirs(self.ocr_dir, exist_ok=True) def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] ocr_info = {} for split in self.split: with open(os.path.join(self.data_dir, f'List-of-Files-for-Each-Experiments/Single-Sentence-Caption/No-Subfig/{split}/file_idx.json'), "r") as f: split_info = json.load(f) target_format = [] for file_name in tqdm(split_info): image_path = os.path.join(self.image_dir, split, file_name) caption_path = os.path.join(self.caption_dir, split, f'{file_name.replace(".png", ".json")}') ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".png", ".json")}') with open(caption_path, "r") as f: annotation = json.load(f) try: img = Image.open(image_path) img_w, img_h = img.size if not os.path.exists(ocr_path): items = self.google_ocr.recognize_image(img) if items == "error": print('OCR error: ', image_path) continue with open(ocr_path, 'w') as f: json.dump(items, f) else: with open(ocr_path, 'r') as f: items = json.load(f) words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) except: words, bboxes = [], [] value = annotation['1-lowercase-and-token-and-remove-figure-index']['caption'] instruction = random.choice(instructions) ocr = ' '.join(words) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/scicap', type=str) parser.add_argument('--out_data_dir', default='processed_data/scicap', type=str) parser.add_argument('--api_key', default='API_KEY', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/scienceqa.py ================================================ import json import os import random import glob from PIL import Image from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, sort_coordinate, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'scienceqa' def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] train, val, test = [],[],[] target_format = [] ann_filename = os.path.join(self.data_dir, 'data/scienceqa/problems.json') with open(ann_filename, 'r') as f: anns = json.load(f) for questionId, ann in tqdm(anns.items()): question = ann['question'] choices = ann['choices'] value = choices[ann['answer']] split = ann['split'] image_name = ann['image'] if str(image_name) == 'null': continue image_path = os.path.join(self.data_dir, split, questionId, image_name) instruction = random.choice(instructions) instruction = instruction.replace('', question).replace('', str(choices)) file_name = os.path.abspath(image_path) sample = { "image": file_name, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': f"{value}"}, ], } if split == 'train': train.append(sample) elif split == 'val': val.append(sample) elif split == 'train': test.append(sample) for split, target_format in [('train', train), ('val', val), ('test', test)]: out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/scienceqa', type=str) parser.add_argument('--out_data_dir', default='processed_data/scienceqa', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/screen2words.py ================================================ import json import os import random import argparse from PIL import Image from tqdm import tqdm from pathlib import Path from utils import load_instructions from google_vision_ocr import Google_OCR class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') self.image_dir = os.path.join(args.input_data_dir, 'combined') self.dataset_name = 'screen2words' self.google_ocr = Google_OCR(args.api_key) self.split = ['train', 'dev'] os.makedirs(self.ocr_dir, exist_ok=True) def load_captions(self): with open(os.path.join(self.data_dir, 'screen_summaries.csv'), "r") as f: lines = f.read().splitlines() captions = {} for i, line in enumerate(lines): if i != 0: items = line.split(',') if len(items) > 2: screenId = items[0] summary = line[len(screenId)+1:] else: screenId, summary = items captions[screenId] = summary return captions def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] captions = self.load_captions() for split in self.split: target_format = [] with open(os.path.join(self.data_dir, f'split/{split}_screens.txt'), "r") as f: split_info = f.read().splitlines() for split_id in tqdm(split_info): image_path = os.path.join(self.image_dir, f'{split_id}.jpg') ocr_path = os.path.join(self.ocr_dir, f'{split_id}.json') try: img = Image.open(image_path) img_w, img_h = img.size if not os.path.exists(ocr_path): items = self.google_ocr.recognize_image(img) if items == "error": print('OCR error: ', image_path) continue with open(ocr_path, 'w') as f: json.dump(items, f) else: with open(ocr_path, 'r') as f: items = json.load(f) words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) except: words, bboxes = [], [] value = captions[split_id] instruction = random.choice(instructions) ocr = ' '.join(words) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/screen2words', type=str) parser.add_argument('--out_data_dir', default='processed_data/screen2words', type=str) parser.add_argument('--api_key', default='API_KEY', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/slidevqa.py ================================================ import json import os import random import argparse import glob from PIL import Image from tqdm import tqdm from pathlib import Path from utils import load_instructions from google_vision_ocr import Google_OCR class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') self.image_dir = os.path.join(args.input_data_dir, 'images') self.dataset_name = 'slidevqa' self.google_ocr = Google_OCR(args.api_key) self.split = ['train', 'val', 'test'] os.makedirs(self.ocr_dir, exist_ok=True) def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] file_name = os.path.join(self.data_dir, 'annotations/qa', f'{split}.jsonl') with open(file_name, 'r') as f: data = f.read().splitlines() for d in tqdm(data): question = d['question'] deck_name = d['deck_name'] value = d['answer'] image_paths = [] text_sequences = [] bboxes = [] for image_path in glob.glob(os.path.join(self.image_dir, deck_name, f'slide_*_1024.jpg')): image_path = os.path.abspath(image_path) image_name = os.path.basename(image_path) image_paths.append(image_path) ocr_path = os.path.join(self.ocr_dir, f'{deck_name}_{image_name.replace(".jpg", ".json")}') try: img = Image.open(image_path) img_w, img_h = img.size if not os.path.exists(ocr_path): items = self.google_ocr.recognize_image(img) if items == 'error': print('OCR error: ', image_path) continue with open(ocr_path, 'w') as f: json.dump(items, f) else: with open(ocr_path, 'r') as f: items = json.load(f) words, page_bboxes = self.google_ocr.extract_info(items, img_w, img_h) except: words, page_bboxes = [], [] text_sequences.append(' '.join(words)) bboxes.append(page_bboxes) instruction = random.choice(instructions) instruction = instruction.replace('', question) file_name = os.path.abspath(image_path) target_format.append({ "image_list": image_paths, "ocr_list": text_sequences, "bboxes_list": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/slidevqa', type=str) parser.add_argument('--out_data_dir', default='processed_data/slidevqa', type=str) parser.add_argument('--api_key', type=str, help='google vision api key') args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/sroie.py ================================================ import json import os import random import cv2 from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'sroie' self.split = ['train', 'test'] def sort_coordinate(self, bboxes): return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]]) def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] ann_dir = os.path.join(self.data_dir, f'{split}/entities') img_dir = os.path.join(self.data_dir, f'{split}/img') for file in tqdm(sorted(os.listdir(ann_dir))): file_path = os.path.join(ann_dir, file) with open(file_path, 'r', encoding='utf-8') as f: labels = json.load(f) image_path = os.path.join(img_dir, file) image_path = image_path.replace('.txt', '.jpg') image = cv2.imread(image_path) h, w, _ = image.shape file_path = os.path.join(ann_dir.replace('entities', 'box'), file) text_sequence = [] bboxes = [] with open(file_path, 'r', encoding='utf-8') as f: items = [] for item in f.read().splitlines(): bbox = item.split(',')[:8] text = item[len(','.join(bbox))+1:] bbox = [int(bbox[0]), int(bbox[1]), int(bbox[4]), int(bbox[5])] bbox = normalize_bbox(bbox, w, h) items.append((text, bbox)) items = self.sort_coordinate(items) for item in items: words, bbox = item text_sequence.append(words) bbox = [bbox] * len(words.split()) bboxes += bbox ocr = ' '.join(text_sequence) for label in labels: instruction = random.choice(instructions) instruction = instruction.replace('', labels[label]) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': label}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/SROIE2019', type=str) parser.add_argument('--out_data_dir', default='processed_data/sroie', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/tabfact.py ================================================ import json import os import random from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'tabfact' self.split = ['train', 'dev'] self.options = ['no', 'yes'] def create_ocr_data(self, split): file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = {} for d in data: d = json.loads(d) image_name = d['name'].replace('.pdf', '') try: content = d['contents'][1] # microsoft cv except: content = d['contents'][0] # tesseract bboxes = [] tokens = [] try: _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) tokens.append(token) except: pass ocrs[image_name] = (' '.join(tokens), bboxes) return ocrs def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, split, 'document.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = self.create_ocr_data(split) target_format = [] for d in tqdm(data): d = json.loads(d) image_name = d['name'].replace('.pdf', '') file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') file_name = os.path.abspath(file_name) for ann in d['annotations']: instruction = random.choice(instructions) question = ann['key'] instruction = instruction.replace('', question) bboxes = [] ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] value = ann['values'][0]['value'] value = self.options[int(value)] target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'instruction': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/TabFact', type=str) parser.add_argument('--out_data_dir', default='processed_data/tabfact', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/tatdqa.py ================================================ import json import os import random import argparse import csv from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'tatdqa' self.split = ['train', 'dev', 'test'] def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] file_name = os.path.join(self.data_dir, f'tatdqa_dataset_{split}.json') with open(file_name, 'r') as f: data = json.load(f) for d in tqdm(data): uid = d['doc']['uid'] page_num = d['doc']['page'] image_path = f'{split}/{uid}_{page_num}.png' ocr_file_name = os.path.join(self.data_dir, f'{split}/{uid}.json') with open(ocr_file_name, 'r') as f: ocrs = json.load(f) text = [] bboxes = [] _, _, w, h = ocrs['pages'][page_num-1]['bbox'] for page in ocrs['pages']: for block in page['blocks']: text.append(block['text']) for bbox in block['words']['bbox_list']: bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) for qa in d['questions']: question =qa['question'] if 'answer' in qa: answer = qa['answer'] if type(qa['answer']) == list: if len(qa['answer']) > 1: answer = ', '.join(answer) else: answer = answer[0] else: answer = "" instruction = random.choice(instructions) instruction = instruction.replace('', question) ocr = ' '.join(text) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': answer}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/TAT-DQA', type=str) parser.add_argument('--out_data_dir', default='processed_data/tatdqa', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/textbookqa.py ================================================ import json import os import random import glob from PIL import Image from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, sort_coordinate, load_instructions from transformers import BertTokenizer import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'textbookqa' self.split = ['train', 'val', 'test'] def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] ann_filename = f'{split}/tqa_v1_{split}.json' if split != 'test' else f'{split}/tqa_v2_{split}.json' ann_filename = os.path.join(self.data_dir, ann_filename) with open(ann_filename, 'r') as f: anns = json.load(f) for ann in tqdm(anns): questions = ann['questions'] diagram_questions = questions['diagramQuestions'] if len(diagram_questions) == 0: continue diagram_annotations = ann['diagramAnnotations'] for global_id, data in diagram_questions.items(): options = [] for option_id, choice in data['answerChoices'].items(): choice = choice['processedText'] options.append(choice) question = data['beingAsked']['processedText'] value = data['correctAnswer']['rawText'] image_path = data['imagePath'] image_name = data['imageName'] image_path = os.path.join(self.data_dir, f'{split}/{image_path}') if image_name in diagram_annotations: annotation = diagram_annotations[image_name] bboxes = [] ocr = [] for item in annotation: text, bbox = item["text"], item["rectangle"] try: bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]] except: continue if len(text) > 0: bboxes.append(bbox) ocr.append(text) ocr = " ".join(ocr) else: ocr = "" instruction = random.choice(instructions) instruction = instruction.replace('', question).replace('', str(options)) file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': f"{value}"}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/textbookqa', type=str) parser.add_argument('--out_data_dir', default='processed_data/textbookqa', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/utils.py ================================================ import pandas as pd def normalize_bbox(bbox, w=-1, h=-1): if w > 0 and h > 0: normalized_bbox = [ int(1000 * bbox[0] / w), int(1000 * bbox[1] / h), int(1000 * bbox[2] / w), int(1000 * bbox[3] / h), ] else: normalized_bbox = [ int(1000 * bbox[0]), int(1000 * bbox[1]), int(1000 * bbox[2]), int(1000 * bbox[3]), ] if len(bbox) == 4: return convert_wh(normalized_bbox) elif len(bbox) == 6: return normalized_bbox def convert_wh(bbox): return [bbox[0], bbox[1], bbox[2], bbox[3], abs(bbox[2]-bbox[0]), abs(bbox[3]-bbox[1])] def sort_coordinate(bboxes): return sorted(bboxes , key=lambda k: [k[2][1], k[2][0]]) def load_instructions(instruction_path): instructions = {} data = pd.read_excel(instruction_path) for d in data.values: dataset_name = d[0] insts = [] for prompt in d[3:]: if pd.isna(prompt): break insts.append(prompt) instructions[dataset_name] = insts return instructions ================================================ FILE: data_preprocessors/visualmrc.py ================================================ import json import os import random from PIL import Image from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'visualmrc' self.split = ['train', 'dev', 'test'] def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, f'data/{split}.jsonl') with open(file_name, 'r') as f: data = f.readlines() target_format = [] for d in tqdm(data): d = json.loads(d) file_name = os.path.join(self.data_dir, d['image_filename']) file_name = os.path.abspath(file_name) image = Image.open(file_name) w, h = image.size words = [] bboxes = [] for bbox in d['bounding_boxes']: if 'ocr_info' in bbox: for ocr in bbox['ocr_info']: word = ocr['word'] bbox = ocr['bbox'] bbox = [bbox['x'], bbox['y'], bbox['x']+bbox['width'], bbox['y']+bbox['height']] bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) words.append(word) ocr = " ".join(words) for qa in d['qa_data']: question = qa['question']['text'] value = qa['answer']['text'] instruction = random.choice(instructions) instruction = instruction.replace('', question) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'instruction': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/VisualMRC_official', type=str) parser.add_argument('--out_data_dir', default='processed_data/visualmrc', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/websrc.py ================================================ import json import os import random import argparse import csv from PIL import Image from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions from collections import defaultdict from google_vision_ocr import Google_OCR class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') self.dataset_name = 'websrc' self.google_ocr = Google_OCR(args.api_key) self.split = ['train', 'dev'] os.makedirs(self.ocr_dir, exist_ok=True) def load_split_info(self): file_name = os.path.join(self.data_dir, 'dataset_split.csv') with open(file_name) as f: reader = csv.reader(f) split_info = defaultdict(list) for i, row in enumerate(reader): if i == 0: continue number = '0' + row[1] if int(row[1]) < 10 else row[1] split = row[3] data_path = os.path.join(self.data_dir, f'{row[0]}/{number}/dataset.csv') split_info[split].append(data_path) return split_info def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] split_info = self.load_split_info() for split in self.split: target_format = [] for data_path in tqdm(split_info[split]): with open(data_path) as f: data_dir = os.path.dirname(data_path) reader = csv.reader(f) for i, row in enumerate(reader): if i == 0: for index, element in enumerate(row): if 'question' == element: question_index = index elif 'id' == element: id_index = index elif 'answer' == element: answer_index = index continue questionId = row[id_index] image_path = os.path.join(data_dir, f'processed_data/{questionId[2:9]}.png') img = Image.open(image_path) img_w, img_h = img.size ocr_path = os.path.join(self.ocr_dir, f'{questionId[2:9]}.json') try: if not os.path.exists(ocr_path): items = self.google_ocr.recognize_image(img) if items == "error": print('OCR error: ', image_path) continue with open(ocr_path, 'w') as f: json.dump(items, f) else: with open(ocr_path, 'r') as f: items = json.load(f) words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) except: words, bboxes = [], [] question = row[question_index] instruction = random.choice(instructions) instruction = instruction.replace('', question) ocr = ' '.join(words) value = row[answer_index] file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/websrc', type=str) parser.add_argument('--out_data_dir', default='processed_data/websrc', type=str) parser.add_argument('--ocr_dir', default='raw_datasets/websrc/ocrs', type=str) parser.add_argument('--api_key', default='API_KEY', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/wildreceipt.py ================================================ import json import os import random from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, sort_coordinate, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'wildreceipt' self.split = ['train', 'test'] self.classes = {} for items in open(os.path.join(args.input_data_dir, 'class_list.txt')): index, label = items.split() self.classes[index] = label def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: target_format = [] with open(os.path.join(self.data_dir, f'{split}.txt')) as f: samples = f.readlines() for sample in tqdm(samples): data = json.loads(sample) file_name = data['file_name'] image_path = os.path.join(self.data_dir, file_name) image = Image.open(image_path) w, h = image.size items = [] labels = {} for item in data["annotations"]: text, label_index = item["text"], item["label"] label = self.classes[str(label_index)] if label_index == 0: continue bbox = item["box"] bbox = [bbox[0], bbox[1], bbox[4], bbox[5]] bbox = normalize_bbox(bbox, w, h) items.append((text, label, bbox)) items = sort_coordinate(items) ocr = [] bboxes = [] for item in items: words, label, bbox = item labels[words] = label ocr.append(words) bbox = [bbox] * len(words.split()) bboxes += bbox ocr = ' '.join(ocr) for key in labels: instruction = random.choice(instructions) instruction = instruction.replace('', key) value = labels[key] file_name = os.path.abspath(image_path) target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/wildreceipt/wildreceipt', type=str) parser.add_argument('--out_data_dir', default='processed_data/wildreceipt', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: data_preprocessors/wtq.py ================================================ import json import os import random from PIL import Image, ImageSequence from tqdm import tqdm from pathlib import Path from utils import normalize_bbox, load_instructions import argparse class InstructData: def __init__(self, args): self.instruction_path = Path('instructdoc_instructions.xlsx') self.data_dir = args.input_data_dir self.out_data_dir = args.out_data_dir self.dataset_name = 'wtq' self.split = ['train', 'dev'] def create_ocr_data(self, split): file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = {} for d in data: d = json.loads(d) image_name = d['name'].replace('.pdf', '') try: content = d['contents'][1] # microsoft cv except: content = d['contents'][0] # tesseract bboxes = [] tokens = [] try: _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): bbox = normalize_bbox(bbox, w, h) bboxes.append(bbox) tokens.append(token) except: continue ocrs[image_name] = (' '.join(tokens), bboxes) return ocrs def create_data(self): instructions = load_instructions(self.instruction_path)[self.dataset_name] for split in self.split: file_name = os.path.join(self.data_dir, split, 'document.jsonl') with open(file_name, 'r') as f: data = f.readlines() ocrs = self.create_ocr_data(split) target_format = [] for d in tqdm(data): d = json.loads(d) image_name = d['name'].replace('.pdf', '') file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') file_name = os.path.abspath(file_name) for ann in d['annotations']: instruction = random.choice(instructions) question = ann['key'] instruction = instruction.replace('', question) ocr, bboxes = ocrs[image_name] value = ann['values'][0]['value'] target_format.append({ "image": file_name, "ocr": ocr, "bboxes": bboxes, "conversations": [ {'from': 'human', 'value': instruction}, {'from': 'gpt', 'value': value}, ], }) out_filepath = os.path.join(self.out_data_dir, f'{split}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(target_format)}') with open(out_filepath, "w") as f: json.dump(target_format, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/WikiTableQuestions', type=str) parser.add_argument('--out_data_dir', default='processed_data/wtq', type=str) args = parser.parse_args() dataset = InstructData(args) dataset.create_data() ================================================ FILE: download.sh ================================================ #!/bin/bash export DATASET_DIR=raw_datasets mkdir raw_datasets sh ./download_scripts/due.sh sh ./download_scripts/websrc.sh sh ./download_scripts/funsd.sh sh ./download_scripts/iconqa.sh sh ./download_scripts/textbookqa.sh sh ./download_scripts/screen2words.shsh sh ./download_scripts/doclaynet.sh sh ./download_scripts/ai2d.sh sh ./download_scripts/wildreceipt.sh # font file for rendering text in AI2D dataset wget https://huggingface.co/Team-PIXEL/pixel-base-finetuned-masakhaner-swa/resolve/main/GoNotoCurrent.ttf ================================================ FILE: download_scripts/README.md ================================================ Beolow are the list for downloading datasets used in InstructDoc. ### Automatically download datasets - DocVQA ([due.sh](download_scripts/due.sh)) - InfographicVQA ([due.sh](download_scripts/due.sh)) - PWC ([due.sh](download_scripts/due.sh)) - KLC ([due.sh](download_scripts/due.sh)) - DeepForm ([due.sh](download_scripts/due.sh)) - TabFact ([due.sh](download_scripts/due.sh)) - WebSRC ([websrc.sh](download_scripts/websrc.sh)) - FUNSD ([funsd.sh](download_scripts/funsd.sh)) - IconQA ([iconqa.sh](download_scripts/iconqa.sh)) - TextbookQA ([textbookqa.sh](download_scripts/textbookqa.sh)) - Screen2Words ([screen2words.sh](download_scripts/screen2words.sh)) - DocLaynet ([doclaynet.sh](download_scripts/doclaynet.sh)) - LLaVAR ([llavar.sh](download_scripts/llavar.sh)) ### Manually download After downloading below datasets, please place them under the directory "raw_datasets". - SROIE ([kaggle](https://www.kaggle.com/datasets/urbikn/sroie-datasetv2)) - CORD ([google drive](https://drive.google.com/drive/folders/14OEWr86qotVBMAsWk7lymMytxn5u-kM6)) - OCRVQA ([google drive](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)) - TAT-DQA ([google drive](https://drive.google.com/drive/folders/1SGpZyRWqycMd_dZim1ygvWhl5KdJYDR2)) - ScienceQA ([google drive](https://drive.google.com/drive/folders/1w8imCXWYn2LxajmGeGH_g5DaL2rabHev)) - ChartQA ([google drive](https://drive.google.com/file/d/17-aqtiq_KJ16PIGOp30W0y6OJNax6SVT/view)) - RVL-CDIP ([goole docs](https://docs.google.com/uc?id=0Bz1dfcnrpXM-MUt4cHNzUEFXcmc&export=download)) - HW-SQuAD ([onedrive](https://www.docvqa.org/datasets/benthamqa-and-hw-squad)) - SciCap ([dropbox](https://www.dropbox.com/s/t1sjqesl0pynaxo/scicap_data.zip?dl=0)) - DUDE ([project page](https://rrc.cvc.uab.es/?ch=23&com=introduction)) - DocBank ([project page](https://doc-analysis.github.io/docbank-page/index.html)) - DocILE ([projct page](https://docile.rossum.ai/)) - VisualMRC ([project page](https://github.com/nttmdlab-nlp/VisualMRC), request authors via e-mail ryota.tanaka@ntt.com) - SlideVQA ([project page](https://github.com/nttmdlab-nlp/SlideVQA), request authors via e-mail ryota.tanaka@ntt.com) ================================================ FILE: download_scripts/ai2d.sh ================================================ cd $DATASET_DIR echo "Donwloading AI2D dataset..." mkdir ai2d cd ai2d wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip wget https://s3-us-east-2.amazonaws.com/prior-datasets/ai2d_test_ids.csv unzip ai2d-all.zip && rm ai2d-all.zip ================================================ FILE: download_scripts/doclaynet.sh ================================================ cd $DATASET_DIR echo "Donwloading DocLaynet dataset..." mkdir doclaynet cd doclaynet wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip unzip DocLayNet_core.zip && rm DocLayNet_core.zip unzip DocLayNet_extra.zip && rm DocLayNet_extra.zip ================================================ FILE: download_scripts/due.sh ================================================ cd $DATASET_DIR echo "Donwloading DocVQA dataset..." mkdir docvqa cd docvqa wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DocVQA.tar.gz tar xvf DocVQA.tar.gz && rm DocVQA.tar.gz cd .. echo "Donwloading InfoVQA dataset..." mkdir infovqa cd infovqa wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/InfographicsVQA.tar.gz tar xvf InfographicsVQA.tar.gz && rm InfographicsVQA.tar.gz cd .. echo "Donwloading TabFact dataset..." mkdir tabfact cd tabfact wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/TabFact.tar.gz tar xvf TabFact.tar.gz && rm TabFact.tar.gz cd .. echo "Donwloading WTQ dataset..." mkdir wtq cd wtq wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/WikiTableQuestions.tar.gz tar xvf WikiTableQuestions.tar.gz && rm WikiTableQuestions.tar.gz cd .. echo "Donwloading KLC dataset..." mkdir klc cd klc wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/KleisterCharity.tar.gz tar xvf KleisterCharity.tar.gz && rm KleisterCharity.tar.gz cd .. echo "Donwloading DeepForm dataset..." mkdir deepform cd deepform wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DeepForm.tar.gz tar xvf DeepForm.tar.gz && rm DeepForm.tar.gz cd .. echo "Donwloading PWC dataset..." mkdir pwc cd pwc wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/PWC.tar.gz tar xvf PWC.tar.gz && rm PWC.tar.gz ================================================ FILE: download_scripts/funsd.sh ================================================ cd $DATASET_DIR echo "Donwloading FUNSD dataset..." mkdir funsd cd funsd wget https://guillaumejaume.github.io/FUNSD/dataset.zip unzip dataset.zip && rm dataset.zip ================================================ FILE: download_scripts/iconqa.sh ================================================ cd $DATASET_DIR echo "Donwloading IconQA dataset..." mkdir iconqa cd iconqa wget https://iconqa2021.s3.us-west-1.amazonaws.com/iconqa_data.zip unzip iconqa_data.zip && rm iconqa_data.zip ================================================ FILE: download_scripts/llavar.sh ================================================ cd $DATASET_DIR echo "Donwloading LLaVAR dataset..." mkdir llavar cd llavar wget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/llava_instruct_150k_llavar_20k.json mkdir images cd images wget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/finetune.zip unzip finetune.zip && rm finetune.zip ================================================ FILE: download_scripts/screen2words.sh ================================================ cd $DATASET_DIR echo "Donwloading Screen2Words dataset..." git clone https://github.com/google-research-datasets/screen2words.git cd screen2words wget https://storage.googleapis.com/crowdstf-rico-uiuc-4540/rico_dataset_v0.1/unique_uis.tar.gz tar xvf unique_uis.tar.gz && rm unique_uis.tar.gz ================================================ FILE: download_scripts/textbookqa.sh ================================================ cd $DATASET_DIR echo "Donwloading TextbookQA dataset..." mkdir textbookqa cd textbookqa wget https://ai2-public-datasets.s3.amazonaws.com/tqa/tqa_train_val_test.zip unzip tqa_train_val_test.zip && rm tqa_train_val_test.zip ================================================ FILE: download_scripts/websrc.sh ================================================ cd $DATASET_DIR echo "Donwloading WebSRC dataset..." mkdir websrc cd websrc wget https://websrc-data.s3.amazonaws.com/release.zip unzip release.zip && rm release.zip ================================================ FILE: download_scripts/wildreceipt.sh ================================================ cd $DATASET_DIR echo "Donwloading WildReceipt dataset..." mkdir wildreceipt cd wildreceipt wget https://download.openmmlab.com/mmocr/data/wildreceipt.tar tar xvf wildreceipt.tar && rm wildreceipt.tar ================================================ FILE: merge_datasets.py ================================================ import os import json import random import argparse train_val_datasets = ['klc', 'pwc', 'deepform', 'sroie', 'docile', 'wildreceipt', 'websrc', 'hwsquad', 'visualmrc', 'iconqa_fill_in_blank', 'iconqa_choose_txt', 'scienceqa', 'ai2d', 'docvqa', 'rvlcdip', 'textbookqa', 'wtq', 'tatdqa','scicap', 'llavar', 'screen2words', 'doclaynet', 'docbank', 'docvqa_iq', 'rvlcdip_io', 'ocrvqa'] def merge_datasets(input_data_dir='./processed_data', save_dir='./', max_samples=5000): questionId = 0 for split in [('train'), ('dev', 'val')]: merge = [] for dataset_name in train_val_datasets: for s in split: dataset_path = os.path.join(input_data_dir, dataset_name, f'{s}.json') if os.path.exists(dataset_path): with open(dataset_path, 'r') as f: data = json.load(f) if len(data) == 0: continue random.shuffle(data)[:max_samples] for d in data: d["dataset_name"] = dataset_name d["id"] = questionId merge.append(d) random.shuffle(merge) out_filepath = os.path.join(save_dir, f'{split[0]}.json') os.makedirs(os.path.dirname(out_filepath), exist_ok=True) print(f'{split}: {len(merge)}') with open(out_filepath, "w") as f: json.dump(merge, f) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--input_data_dir', default='processed_data', type=str) parser.add_argument('--save_dir', default='./', type=str) parser.add_argument('--max_samples', default=5000, type=int) args = parser.parse_args() merge_datasets(args.input_data_dir, args.save_dir, args.max_samples) ================================================ FILE: process_data.sh ================================================ #!/bin/bash API_KEY=$1 # ===== KIE ===== python data_preprocessors/docile.py python data_preprocessors/klc.py python data_preprocessors/deepform.py python data_preprocessors/funsd.py python data_preprocessors/pwc.py python data_preprocessors/wildreceipt.py python data_preprocessors/cord.py python data_preprocessors/sroi.py # ===== Single-page QA ===== python data_preprocessors/visualmrc.py python data_preprocessors/websrc.py --api_key $API_KEY python data_preprocessors/ocrvqa.py --api_key $API_KEY python data_preprocessors/docvqa.py python data_preprocessors/hwsquad.py # ===== Single-page QA w/ Discrete Reasoning ===== python data_preprocessors/tatdqa.py python data_preprocessors/wtq.py # ===== Single-page QA w/ Visual Reasoning ===== python data_preprocessors/iconqa.py python data_preprocessors/ai2d.py python data_preprocessors/scienceqa.py python data_preprocessors/textbook.py # ===== Single-page QA w/ Discrete and Visual Reasoning ===== python data_preprocessors/infographicvqa.py python data_preprocessors/chartqa.py --api_key $API_KEY # ===== Multi-page QA w/ Multi-hop, Discrete, and Visual Reasoning ===== python data_preprocessors/slidevqa.py --api_key $API_KEY python data_preprocessors/dude.py # ===== Document NLI ===== python data_preprocessors/tabfact.py # ===== Dialogue ===== python data_preprocessors/llavar.py --api_key $API_KEY # ===== Captioning ===== python data_preprocessors/scicap.py --api_key $API_KEY python data_preprocessors/screen2words.py --api_key $API_KEY # ===== Classification ===== python data_preprocessors/rvlcdip.py --api_key $API_KEY # ===== ITM ===== python data_preprocessors/rvlcdip_io.py --api_key $API_KEY python data_preprocessors/docvqa_iq.py # ===== DLA ===== python data_preprocessors/docbank.py python data_preprocessors/doclaynet.py