Repository: nttmdlab-nlp/InstructDoc
Branch: main
Commit: fadcdabc1d07
Files: 51
Total size: 141.9 KB

Directory structure:
gitextract_pfacmpi3/

├── LICENSE
├── README.md
├── data_preprocessors/
│   ├── ai2d.py
│   ├── chartqa.py
│   ├── cord.py
│   ├── deepform.py
│   ├── docbank.py
│   ├── docile.py
│   ├── doclaynet.py
│   ├── docvqa.py
│   ├── docvqa_iq.py
│   ├── dude.py
│   ├── funsd.py
│   ├── google_vision_ocr.py
│   ├── hwsquad.py
│   ├── iconqa.py
│   ├── infographicvqa.py
│   ├── klc.py
│   ├── llavar.py
│   ├── ocrvqa.py
│   ├── pwc.py
│   ├── rvlcdip.py
│   ├── rvlcdip_io.py
│   ├── scicap.py
│   ├── scienceqa.py
│   ├── screen2words.py
│   ├── slidevqa.py
│   ├── sroie.py
│   ├── tabfact.py
│   ├── tatdqa.py
│   ├── textbookqa.py
│   ├── utils.py
│   ├── visualmrc.py
│   ├── websrc.py
│   ├── wildreceipt.py
│   └── wtq.py
├── download.sh
├── download_scripts/
│   ├── README.md
│   ├── ai2d.sh
│   ├── doclaynet.sh
│   ├── due.sh
│   ├── funsd.sh
│   ├── iconqa.sh
│   ├── llavar.sh
│   ├── screen2words.sh
│   ├── textbookqa.sh
│   ├── websrc.sh
│   └── wildreceipt.sh
├── instructdoc_instructions.xlsx
├── merge_datasets.py
└── process_data.sh

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
SOFTWARE LICENSE AGREEMENT FOR EVALUATION

This SOFTWARE EVALUATION LICENSE AGREEMENT (this "Agreement") is a legal contract between a person who uses or otherwise accesses or installs the Software (“User(s)”), and Nippon Telegraph and Telephone corporation ("NTT").
READ THE TERMS AND CONDITIONS OF THIS AGREEMENT CAREFULLY BEFORE INSTALLING OR OTHERWISE ACCESSING OR USING NTT'S PROPRIETARY SOFTWARE ACCOMPANIED BY THIS AGREEMENT (the "SOFTWARE"). THE SOFTWARE IS COPYRIGHTED AND IT IS LICENSED TO USER UNDER THIS AGREEMENT, NOT SOLD TO USER. BY INSTALLING OR OTHERWISE ACCESSING OR USING THE SOFTWARE, USER ACKNOWLEDGES THAT USER HAS READ THIS AGREEMENT, THAT USER UNDERSTANDS IT, AND THAT USER ACCEPTS AND AGREES TO BE BOUND BY ITS TERMS. IF AT ANY TIME USER IS NOT WILLING TO BE BOUND BY THE TERMS OF THIS AGREEMENT, USER SHOULD  TERMINATE THE INSTALLATION PROCESS, IMMEDIATELY CEASE AND REFRAIN FROM ACCESSING OR USING THE SOFTWARE AND DELETE ANY COPIES USER MAY HAVE. THIS AGREEMENT REPRESENTS THE ENTIRE AGREEMENT BETWEEN USER AND NTT CONCERNING THE SOFTWARE.

 
BACKGROUND
A.	NTT is the owner of all rights, including all patent rights, copyrights and trade secret rights, in and to the Software and related documentation listed in Exhibit A to this Agreement.
B.	User wishes to obtain a royalty free license to use the Software to enable User to evaluate, and NTT wishes to grant such a license to User, pursuant and subject to the terms and conditions of this Agreement.
C.	As a condition to NTT's provision of the Software to User, NTT has required User to execute this Agreement.
In consideration of these premises, and the mutual promises and conditions in this Agreement, the parties hereby agree as follows:
1. Grant of Evaluation License.     NTT hereby grants to User, and User hereby accepts, under the terms and conditions of this Agreement, a royalty free, nontransferable and nonexclusive license to use the Software internally for the purposes of testing, analyzing, and evaluating the methods or mechanisms as shown in the research paper submitted by NTT to a certain academy. User may make a reasonable number of backup copies of the Software solely for User's internal use pursuant to the license granted in this Section 1.
2.　Shipment and Installation.  NTT will ship or deliver the Software by any method that NTT deems appropriate. User shall be solely responsible for proper installation of the Software.
3.　Term.  This Agreement is effective whichever is earlier (i) upon User’s acceptance of the Agreement, or (ii) upon User’s installing, accessing, and using the Software, even if User has not expressly accepted this Agreement. Without prejudice to any other rights, NTT may terminate this Agreement without notice to User (i) if User breaches or fails to comply with any of the limitations or other requirements described herein, and (ii) if NTT receives a notice from the academy stating that the research paper would not be published, and in any such case User agrees that NTT may, in addition to any other remedies it may have at law or in equity, remotely disable the Software. User may terminate this Agreement at any time by User’s decision to terminate the Agreement to NTT and ceasing use of the Software. Upon any termination or expiration of this Agreement for any reason, User agrees to uninstall the Software and either return to NTT the Software and all copies thereof, or to destroy all such materials and provide written verification of such destruction to NTT.
4.	   Proprietary Rights
(a)	   The Software is the valuable, confidential, and proprietary property of NTT, and NTT shall retain exclusive title to this property both during the term and after the termination of this Agreement.  Without limitation, User acknowledges that all patent rights, copyrights and trade secret rights in the Software shall remain the exclusive property of NTT at all times. User shall use not less than reasonable care in safeguarding the confidentiality of the Software. 
(b)	   USER SHALL NOT, IN WHOLE OR IN PART, AT ANY TIME DURING THE TERM OF OR AFTER THE TERMINATION OF THIS AGREEMENT: (i) SELL, ASSIGN, LEASE, DISTRIBUTE, OR OTHERWISE TRANSFER THE SOFTWARE TO ANY THIRD PARTY; (ii) EXCEPT AS OTHERWISE PROVIDED HEREIN, COPY OR REPRODUCE THE SOFTWARE IN ANY MANNER; (iii) DISCLOSE THE SOFTWARE TO ANY THIRD PARTY, EXCEPT TO USER'S EMPLOYEES WHO REQUIRE ACCESS TO THE SOFTWARE FOR THE PURPOSES OF THIS AGREEMENT; (iv) MODIFY, DISASSEMBLE, DECOMPILE, REVERSE ENGINEER OR TRANSLATE THE SOFTWARE; OR (v) ALLOW ANY PERSON OR ENTITY TO COMMIT ANY OF THE ACTIONS DESCRIBED IN (i) THROUGH (iv) ABOVE.
(c)	   User shall take appropriate action, by instruction, agreement, or otherwise, with respect to its employees permitted under this Agreement to have access to the Software to ensure that all of User's obligations under this Section 4 shall be satisfied.  
5.　	   Indemnity.  User shall defend, indemnify and hold harmless NTT, its agents and employees, from any loss, damage, or liability arising in connection with User's improper or unauthorized use of the Software. NTT SHALL HAVE THE SOLE RIGHT TO CONDUCT DEFEND ANY ACTTION RELATING TO THE SOFTWARE.
6.	   Disclaimer.  THE SOFTWARE IS LICENSED TO USER "AS IS," WITHOUT ANY TRAINING, MAINTENANCE, OR SERVICE OBLIGATIONS WHATSOEVER ON THE PART OF NTT. NTT MAKES NO EXPRESS OR IMPLIED WARRANTIES OF ANY TYPE WHATSOEVER, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, OF FITNESS FOR A PARTICULAR PURPOSE AND OF NON-INFRINGEMENT ON COPYRIGHT OR ANY OTHER RIGHT OF THIRD PARTIES.  USER ASSUMES ALL RISKS ASSOCIATED WITH ITS USE OF THE SOFTWARE, INCLUDING WITHOUT LIMITATION RISKS RELATING TO QUALITY, PERFORMANCE, DATA LOSS, AND UTILITY IN A PRODUCTION ENVIRONMENT. 
7.	   Limitation of Liability.  IN NO EVENT SHALL NTT BE LIABLE TO USER OR TO ANY THIRD PARTY FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO DAMAGES FOR PERSONAL INJURY, PROPERTY DAMAGE, LOST PROFITS, OR OTHER ECONOMIC LOSS, ARISING IN CONNECTION WITH USER'S USE OF OR INABILITY TO USE THE SOFTWARE, IN CONNECTION WITH NTT'S PROVISION OF OR FAILURE TO PROVIDE SERVICES PERTAINING TO THE SOFTWARE, OR AS A RESULT OF ANY DEFECT IN THE SOFTWARE.  THIS DISCLAIMER OF LIABILITY SHALL APPLY REGARD¬LESS OF THE FORM OF ACTION THAT MAY BE BROUGHT AGAINST NTT, WHETHER IN CONTRACT OR TORT, INCLUDING WITHOUT LIMITATION ANY ACTION FOR NEGLIGENCE.  USER'S SOLE REMEDY IN THE EVENT OF ANY BREACH OF THIS AGREEMENT BY NTT SHALL BE TERMINATION PURSUANT TO SECTION 3.
8.	   No Assignment or Sublicense.  Neither this Agreement nor any right or license under this Agreement, nor the Software, may be sublicensed, assigned, or otherwise transferred by User without NTT's prior written consent.
9.	   General
(a)	   If any provision, or part of a provision, of this Agreement is or becomes illegal, unenforceable, or invalidated, by operation of law or otherwise, that provision or part shall to that extent be deemed omitted, and the remainder of this Agreement shall remain in full force and effect.
(b)	   This Agreement is the complete and exclusive statement of the agreement between the parties with respect to the subject matter hereof, and supersedes all written and oral contracts, proposals, and other communications between the parties relating to that subject matter.  
(c)	   Subject to Section 8, this Agreement shall be binding on, and shall inure to the benefit of, the respective successors and assigns of NTT and User.  
(d)	   If either party to this Agreement initiates a legal action or proceeding to enforce or interpret any part of this Agreement, the prevailing party in such action shall be entitled to recover, as an element of the costs of such action and not as damages, its attorneys' fees and other costs associated with such action or proceeding.
(e)	   This Agreement shall be governed by and interpreted under the laws of Japan, without reference to conflicts of law principles. All disputes arising out of or in connection with this Agreement shall be finally settled by arbitration in Tokyo in accordance with the Commercial Arbitration Rules of the Japan Commercial Arbitration Association.  The arbitration shall be conducted by three (3) arbitrators and in Japanese. The award rendered by the arbitrators shall be final and binding upon the parties. Judgment upon the award may be entered in any court having jurisdiction thereof.
(f)　　	   NTT shall not be liable to the User or to any third party for any delay or failure to perform NTT’s obligation set forth under this Agreement due to any cause beyond NTT’s reasonable control.
 
EXHIBIT A
The software and related data include the following files,
- data_preprocessors
- download_scripts
- download.sh
- process_data.sh
- merge_datasets.py
- instructdoc_instructions.xlsx
- README


================================================
FILE: README.md
================================================
# InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
This repository includes the InstructDoc dataset introduced by the following paper: Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. "InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions". In Proc. of AAAI. 2024.

> We introduce InstructDoc, the first large-scale visual instruction tuning dataset that covers a wide range of VDU tasks and datasets.

![Figure 1 from paper](example.png)


# Get Started
## 1. Download datasets
```
sh download.sh
```
This script helps you to download most of the datasets automatically. For some datasets, due to the license issue and downloading restrictions, you need to manually download them by following the instructions in [download_scripts/README.md](download_scripts)

## 2. Preprocess datasets
```
sh process_data.sh API_KEY
```
This script helps you to process all the datasets. To extract OCR information from document images, we used Google Vision API and set the variables "API_KEY" to the API key obtained from [Google Cloud Platform](https://cloud.google.com/). To get one visit the [link](https://cloud.google.com/vision/docs/quickstart). <br><br>
If you encounter the FileNotFoundError while processing the datasets, please set the variable --input_data_dir in [data_processors](data_processors) to your dataset directory name correctly.

## 3. Merge preprocessed datasets
```
python merge_datasets --max_samples 5000 --input_data_dir processed_data --save_dir ./
```
We randomly sampled a maximum of 5000 instances for each held-in dataset. After processing datasets, you can obtain JSON files with the following format.
If the dataset provides multiple images per instance (e.g., SlideVQA), we add "_list" into the fields, including "image", "ocr", and "bboxes". 

<pre>
   {
      "dataset_name": dataset name,
      "id": identification of the instance,
      "image" or "image_list": image path,
      "ocr" or "ocr_list": ocr text,
      "bboxes" or "bboxes_list": [x1, y1, x2, y2, w, h],
      "conversations": [
        {'user': 'human', 'value': randomly sampled instruction}
        {'user': 'gpt', 'value': answer}
      ]
    }
</pre>

# Citation

You can cite it as follows:
```bibtex
@inproceedings{InstructDoc2024,
  author    = {Ryota Tanaka and
               Taichi Iki and
               Kyosuke Nishida and
               Kuniko Saito and
               Jun Suzuki},
  title     = {InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions},
  booktitle = {AAAI},
  year      = {2024}
}
```

If you have any questions about the paper and repository, feel free to contact Ryota Tanaka (ryota.tanaka[at]ntt.com) or open an issue!


================================================
FILE: data_preprocessors/ai2d.py
================================================
import json
import os
import random
import glob
from PIL import Image, ImageDraw, ImageFont
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
from transformers import BertTokenizer
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.question_dir = os.path.join(args.input_data_dir, f'questions')
        self.ann_dir = os.path.join(args.input_data_dir, f'annotations')
        self.img_dir = os.path.join(args.input_data_dir, f'images')
        self.font = ImageFont.truetype(args.font_file, size=40)
        self.dataset_name = 'ai2d'
        self.split = ['train', 'test']

    def sort_coordinate(self, bboxes):
        return sorted(bboxes, key=lambda k: [k[1][1], k[1][0]])    

    def create_data(self):
        train = []
        test = []
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        with open(os.path.join(self.data_dir, 'ai2d_test_ids.csv')) as f:
            test_ids = f.read().splitlines()
        for i, file in enumerate(tqdm(sorted(os.listdir(self.question_dir)))):
            file_path = os.path.join(self.question_dir, file)
            with open(file_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            annotation_path = os.path.join(self.ann_dir, file)
            with open(annotation_path, 'r') as f:
                ann = json.load(f) 

            index = file.replace('.png.json', '')
            split = 'test' if str(index) in test_ids else 'train'

            image_path = os.path.join(self.img_dir, file)
            image_path = image_path.replace('.json', '')
            img = Image.open(image_path)
            draw = ImageDraw.Draw(img)

            for index, text in ann['text'].items():
                replacement_text = text['replacementText']
                bbox = text['rectangle']
                bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]]
                text = text['value']
                x1, y1, x2, y2 = bbox
                draw.rectangle((x1, y1, x2, y2), outline="lime", width=4)
                draw.text((x1, y1-30), replacement_text, font=self.font, fill="blue", align="center")

            image_path = os.path.join(self.out_data_dir, 'draw_images', f'{file.replace(".json", "")}')
            os.makedirs(os.path.dirname(image_path), exist_ok=True)
            img.save(image_path)
            
            for question, item in data['questions'].items():
                options = item['answerTexts']
                answer_index = item['correctAnswer']
                value = options[answer_index]

                instruction = random.choice(instructions)
                instruction = instruction.replace('<key>', question).replace('<options>', str(options))
                file_name = os.path.abspath(image_path)
                metadata = {
                    "image": file_name,
                    "conversations": [
                        {'from': 'human', 'value': instruction},
                        {'from': 'gpt', 'value': f"{value}"},
                    ],
                }
                if split == 'train':
                    train.append(metadata)
                elif split == 'test':
                    test.append(metadata)

        for split, results in [('train', train), ('test', test)]:
            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(results)}')
            with open(out_filepath, "w") as f:
                json.dump(results, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/ai2d', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/ai2d', type=str)
    parser.add_argument('--font_file', default='GoNotoCurrent.ttf', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/chartqa.py
================================================
import json
import os
import random
import argparse

from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import load_instructions
from google_vision_ocr import Google_OCR

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
        self.dataset_name = 'chartqa'
        self.google_ocr = Google_OCR(args.api_key)
        self.split = ['train', 'val', 'test']
        os.makedirs(self.ocr_dir, exist_ok=True)
            
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            for qa_type in ['human', 'augmented']:
                file_name = os.path.join(self.data_dir, f'{split}/{split}_{qa_type}.json')
                with open(file_name, 'r') as f:
                    data = json.load(f)
                for d in tqdm(data):
                    image_name = d['imgname']
                    image_path = os.path.join(self.data_dir, f'{split}/png/{image_name}')
                    ocr_path = os.path.join(self.ocr_dir, f'{image_name.replace(".png", ".json")}')        
                    try:
                        img = Image.open(image_path)
                        img_w, img_h = img.size
                        if not os.path.exists(ocr_path):
                            items = self.google_ocr.recognize_image(img)
                            if items == "error":
                                print('OCR error: ', image_path)
                                continue
                            with open(ocr_path, 'w') as f:
                                json.dump(items, f)
                        else:
                            with open(ocr_path, 'r') as f:
                                items = json.load(f)
                        words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
                    except:
                        words, bboxes = [], []
                    
                    question = d['query']
                    value = d['label']
                    instruction = random.choice(instructions)
                    instruction = instruction.replace('<key>', question)
                    ocr = ' '.join(words)

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr, 
                        "bboxes": bboxes, 
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': value},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/chartqa', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/chartqa', type=str)
    parser.add_argument('--api_key', type=str, help='google vision api key')
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()


================================================
FILE: data_preprocessors/cord.py
================================================
import json
import os
import random

from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, sort_coordinate, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'cord'
        self.split = ['train', 'dev', 'test']

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            ann_dir = os.path.join(self.data_dir, f'{split}/json')
            img_dir = os.path.join(self.data_dir, f'{split}/image')
            for file in tqdm(sorted(os.listdir(ann_dir))):
                file_path = os.path.join(ann_dir, file)
                with open(file_path, 'r', encoding='utf-8') as f:
                    data = json.load(f)

                image_path = os.path.join(img_dir, file)
                image_path = image_path.replace('.json', '.png')
                image = Image.open(image_path)
                w, h = image.size

                items = []
                labels = {}
                for item in data["valid_line"]:
                    words, label = item["words"], item["category"]
                    words = [w for w in words if w["text"].strip() != ""]
                    if len(words) == 0:
                        continue
                    text = " ".join([word["text"] for word in words])
                    bbox = [words[0]["quad"]["x1"], words[0]["quad"]["y1"], words[-1]["quad"]["x3"], words[-1]["quad"]["y3"]]
                    bbox = normalize_bbox(bbox, w, h)
                    items.append((text, label, bbox))

                items = sort_coordinate(items)
                ocr = []
                bboxes = []
                for item in items:
                    words, label, bbox = item
                    labels[words] = label
                    ocr.append(words)
                    bbox = [bbox] * len(words.split())
                    bboxes += bbox
                ocr = ' '.join(ocr)

                for key in labels:
                    instruction = random.choice(instructions)
                    instruction = instruction.replace('<key>', key)
                    value = labels[key]

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': value},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/cord/CORD', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/cord', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/deepform.py
================================================
import json
import os
import random
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'deepform'
        self.split = ['train', 'dev']

    def create_ocr_data(self, split):
        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
        with open(file_name, 'r') as f:
            data = f.readlines()
        ocrs = {}
        for d in data:
            d = json.loads(d)
            image_name = d['name'].replace('.pdf', '')
            try:
                content = d['contents'][1] # microsoft cv
            except:
                content = d['contents'][0] # tesseract

            bboxes = []
            tokens = []
            try:
                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
                    bbox = normalize_bbox(bbox, w, h)
                    bboxes.append(bbox)
                    tokens.append(token)
            except:
                pass
            ocrs[image_name] = (' '.join(tokens), bboxes)
            break
        return ocrs

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, split, 'document.jsonl')
            with open(file_name, 'r') as f:
                data = f.readlines()

            ocrs = self.create_ocr_data(split)
            target_format = []
            for d in tqdm(data):
                d = json.loads(d)
                image_name = d['name'].replace('.pdf', '')
                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
                file_name = os.path.abspath(file_name)
                for ann in d['annotations']:
                    instruction = random.choice(instructions)
                    if 'children' in ann['values'][0]:
                        for v in ann['values']:
                            for child in v['children']:
                                value = child['key']
                                key = child['values'][0]['value']
                                instruction = instruction.replace('<key>', key)
                                ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]

                                target_format.append({
                                    "image": file_name,
                                    "ocr": ocr,
                                    "bboxes": bboxes,
                                    "conversations": [
                                        {'from': 'human', 'value': instruction},
                                        {'from': 'gpt', 'value': value},
                                    ],
                                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/DeepForm', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/deepform', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/docbank.py
================================================
import json
import os
import random
from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh
from transformers import BertTokenizer
from collections import defaultdict
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'docbank'
        self.split = ['train', 'valid', 'test']

    def sort_coordinate(self, bboxes):
        return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]])    

    def create_ocr_data(self, data):
        ocr_info = {}
        for image_info in tqdm(data['images']):
            file_name = image_info['file_name']
            image_id = image_info['id']
            width, height = image_info['width'], image_info['height']

            image_path = os.path.join(self.data_dir,  f'DocBank_500K_ori_img/{file_name}')
            txt_path = os.path.join(self.data_dir,  f'DocBank_500K_txt/{file_name.replace("_ori.jpg", ".txt")}')
            with open(txt_path, 'r') as f:
                txt_data = f.read().splitlines()

            words = []
            bboxes = []
            for d in txt_data:      
                d = d.split('\t')
                word = d[0]
                word_position = convert_wh([int(d[1]), int(d[2]), int(d[3]), int(d[4])])
                if word_position[0] >= word_position[2] or word_position[1] >= word_position[3]:
                    continue
                words.append(word)
                bboxes.append(word_position)
            
            text_sequence = ' '.join(words)
            ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height}
        return ocr_info
    
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            with open(os.path.join(self.data_dir, f'500K_{split}.json'), "r") as f:
                data = json.load(f)

            ocr_info = self.create_ocr_data(data)
            categories = data['categories']

            target_format = []
            annotations = defaultdict(list)
            for ann_info in data['annotations']:
                image_id = ann_info['image_id']
                annotations[image_id].append(ann_info)

            for image_id in tqdm(annotations):
                image_info = ocr_info[image_id]
                image_path = image_info['image_path']
                text_sequence = image_info['text_sequence']
                bboxes = image_info['bboxes']
                width, height = image_info['width'], image_info['height']

                items = []
                for ann in annotations[image_id]:
                    category_id = ann['category_id']
                    category_name = categories[category_id-1]['name']
                    bbox = ann['bbox']
                    bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]]
                    bbox = normalize_bbox(bbox, width, height)
                    items.append((category_name, bbox))
                items = self.sort_coordinate(items)

                dla = []
                for item in items:
                    category_name, bbox = item
                    dla.append(f'{category_name} {bbox}')
                value = ' '.join(dla)

                instruction = random.choice(instructions)        
                file_name = os.path.abspath(image_path)

                target_format.append({
                    "image": file_name,
                    "ocr": text_sequence,
                    "bboxes": bboxes,
                    "conversations": [
                        {'from': 'human','value': instruction},
                        {'from': 'gpt', 'value': value},
                    ],
                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/docbank', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/docbank', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/docile.py
================================================
import json
import os
import random
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import sort_coordinate, load_instructions, normalize_bbox
import argparse
from collections import defaultdict

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'docile'
        self.ann_dir = os.path.join(args.input_data_dir, f'annotations')
        self.img_dir = os.path.join(args.input_data_dir, f'images')
        self.ocr_dir = os.path.join(args.input_data_dir, f'ocr')
        self.split = ['train', 'val']

    def extract_ocr_info(self, ocr_data):
        tokens = []
        bboxes = []
        for page in ocr_data['pages']:
            for block in page['blocks']:
                for line in block['lines']:
                    for word in line['words']:
                        left_top, right_bottom = word['geometry']
                        bbox = normalize_bbox([left_top[0], left_top[1], right_bottom[0], right_bottom[1]])
                        bboxes.append(bbox)
                        tokens.append(word['value'])
        return tokens, bboxes

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, f'{split}.json')
            with open(file_name, 'r') as f:
                ann_filenames = json.load(f)

            target_format = []
            for id, file in enumerate(tqdm(ann_filenames)):
                image_path = os.path.join(self.img_dir, file + '0001-1.jpg')
                with open(os.path.join(self.ocr_dir, f'{file}.json'), 'r', encoding='utf-8') as f:
                    ocr_data = json.load(f)
                with open(os.path.join(self.ann_dir, f'{file}.json'), 'r', encoding='utf-8') as f:
                    d = json.load(f)

                items = []
                for item in d["field_extractions"]:
                    if item["page"] == 0:                
                        text, label = item["text"], item["fieldtype"]
                        bbox = item["bbox"]
                        items.append((text, label, bbox))
                if len(items) == 0:
                    continue
                items = sort_coordinate(items)

                labels = {}
                for item in items:
                    tokens, label, bbox = item
                    labels[tokens] = label

                tokens, bboxes = self.extract_ocr_info(ocr_data)
                ocr = ' '.join(tokens)

                for key in labels:
                    instruction = random.choice(instructions)
                    instruction = instruction.replace('<key>', key)
                    value = labels[key]

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': value},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/docile/data/docile', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/docile', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/doclaynet.py
================================================
import json
import os
import random
from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh
from collections import defaultdict
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'doclaynet'
        self.split = ['train', 'val']

    def sort_coordinate(self, bboxes):
        return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]])    

    def create_ocr_data(self, data):
        ocr_info = {}
        for image_info in data['images']:
            file_name = image_info['file_name']
            image_id = image_info['id']
            image_path = os.path.join(self.data_dir,  f'PNG/{file_name}')
            json_path = os.path.join(self.data_dir,  f'JSON/{file_name.replace(".png", ".json")}')
            width, height = image_info['width'], image_info['height']
            with open(json_path, 'r') as f:
                json_data = json.load(f)
            items = []

            for cell in json_data['cells']:
                text = cell['text']
                bbox = cell['bbox']
                bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]]
                bbox = convert_wh(normalize_bbox(bbox, width, height))
                items.append((text, bbox))

            items = self.sort_coordinate(items)
            words = []
            bboxes = []
            for text, bbox in items:
                words.append(text)
                bboxes += bbox
            text_sequence = ' '.join(words)
            ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height}
            break
        return ocr_info
    
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            with open(os.path.join(self.data_dir, f'COCO/{split}.json'), "r") as f:
                data = json.load(f)
            ocr_info = self.create_ocr_data(data)
            categories = data['categories']

            target_format = []
            annotations = defaultdict(list)
            for ann_info in data['annotations']:
                image_id = ann_info['image_id']
                annotations[image_id].append(ann_info)

            for image_id in tqdm(annotations):
                image_info = ocr_info[image_id]
                image_path = image_info['image_path']
                text_sequence = image_info['text_sequence']
                bboxes = image_info['bboxes']
                width, height = image_info['width'], image_info['height']

                items = []
                for ann in annotations[image_id]:
                    category_id = ann['category_id']
                    category_name = categories[category_id-1]['name']
                    bbox = ann['bbox']
                    bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3], bbox[2], bbox[3]]
                    bbox = normalize_bbox(bbox, width, height)
                    items.append((category_name, bbox))
                items = self.sort_coordinate(items)

                dla = []
                for item in items:
                    category_name, bbox = item
                    dla.append(f'{category_name} {bbox}')
                value = ' '.join(dla)

                instruction = random.choice(instructions)        
                file_name = os.path.abspath(image_path)

                target_format.append({
                    "image": file_name,
                    "ocr": text_sequence,
                    "bboxes": bboxes,
                    "conversations": [
                        {'from': 'human','value': instruction},
                        {'from': 'gpt', 'value': value},
                    ],
                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/doclaynet', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/doclaynet', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/docvqa.py
================================================
import json
import os
import random
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'docvqa'
        self.split = ['train', 'dev']

    def create_ocr_data(self, split):
        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
        with open(file_name, 'r') as f:
            data = f.readlines()
        ocrs = {}
        for d in data:
            d = json.loads(d)
            image_name = d['name'].replace('.pdf', '')
            try:
                content = d['contents'][1] # microsoft cv
            except:
                content = d['contents'][0] # tesseract

            bboxes = []
            tokens = []
            try:
                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
                    bbox = normalize_bbox(bbox, w, h)
                    bboxes.append(bbox)
                    tokens.append(token)
            except:
                pass
            ocrs[image_name] = (' '.join(tokens), bboxes)
        return ocrs

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, split, 'document.jsonl')
            with open(file_name, 'r') as f:
                data = f.readlines()

            ocrs = self.create_ocr_data(split)
            target_format = []
            for d in tqdm(data):
                d = json.loads(d)
                image_name = d['name'].replace('.pdf', '')
                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
                file_name = os.path.abspath(file_name)
                for ann in d['annotations']:
                    instruction = random.choice(instructions)
                    question = ann['key']
                    instruction = instruction.replace('<key>', question)
                    bboxes = []
                    ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]
                    value = ann['values'][0]['value']
                    values = ann['values'][0]['value_variants']

                    target_format.append({
                        "image": file_name,
                        "ocr": ocr, 
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'instruction': instruction},
                            {'from': 'gpt', 'value': value, 'values': values},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/docvqa', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/docvqa', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/docvqa_iq.py
================================================
import json
import os
import random
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'docvqa_iq'
        self.split = ['train', 'dev']

    def create_ocr_data(self, split):
        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
        with open(file_name, 'r') as f:
            data = f.readlines()
        ocrs = {}
        for d in data:
            d = json.loads(d)
            image_name = d['name'].replace('.pdf', '')
            try:
                content = d['contents'][1] # microsoft cv
            except:
                content = d['contents'][0] # tesseract

            bboxes = []
            tokens = []
            try:
                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
                    bbox = normalize_bbox(bbox, w, h)
                    bboxes.append(bbox)
                    tokens.append(token)
            except:
                pass
            ocrs[image_name] = (' '.join(tokens), bboxes)
        return ocrs

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, split, 'document.jsonl')
            with open(file_name, 'r') as f:
                data = f.readlines()

            ocrs = self.create_ocr_data(split)
            target_format = []
            questions = []
            for d in data:
                d = json.loads(d)
                for ann in d['annotations']:
                    question = ann['key']
                    questions.append(question)

            for d in tqdm(data):
                d = json.loads(d)
                image_name = d['name'].replace('.pdf', '')
                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
                file_name = os.path.abspath(file_name)
                for ann in d['annotations']:
                    instruction = random.choice(instructions)
                    if random.random() > 0.5:
                        question = random.choice(questions)
                        value = 'no'
                    else:
                        question = ann['key']
                        value = 'yes'

                    instruction = instruction.replace('<key>', question)
                    bboxes = []
                    ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]

                    target_format.append({
                        "image": file_name,
                        "ocr": ocr, 
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'instruction': instruction},
                            {'from': 'gpt', 'value': value},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/docvqa', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/docvqa_iq', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/dude.py
================================================
import json
import os
import random
import argparse
import glob

from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.image_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/images')
        self.ocr_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/OCR')
        self.dataset_name = 'dude'
        self.split = ['train', 'val']
            
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        file_name =  os.path.join(self.data_dir, '2023-03-23_DUDE_gt_test_PUBLIC.json')
        with open(file_name, 'r') as f:
            data = json.load()
        train, validation = [],[]
        for d in tqdm(data['data']):
            docid = d['docId']
            question = d['question']
            split = d['data_split']          
            if split in self.split:
                image_paths = []
                pages = len(glob.glob(os.path.join(self.image_dir, split, f'{docid}_*.jpg')))
                for i in range(pages):
                    image_path = os.path.join(self.image_dir, split, f'{docid}_{i}.jpg')
                    image_path = os.path.abspath(image_path)                
                    image_paths.append(image_path)

                ocr_path =os.path.join(self.ocr_dir, f'Azure/{docid}_due.json')
                try:
                    with open(ocr_path, 'r') as f:
                        ocr_info = json.load(f)
                except:
                    continue

                structure_value = ocr_info['structures']['pages']['structure_value']
                image_sizes = ocr_info['structures']['pages']['positions']
                text_sequences = []
                bboxes = []
                for page_split, image_size in zip(structure_value, image_sizes):
                    start = page_split[0]
                    end = page_split[1]
                    page_tokens = ' '.join(ocr_info['tokens'][start:end])
                    page_bboxes = []
                    for bbox in ocr_info['positions'][start:end]:
                        bbox = normalize_bbox(bbox, (image_size[2], image_size[3]))
                        page_bboxes.append(bbox)
                    text_sequences.append(page_tokens)
                    bboxes.append(page_bboxes)
                                
                if len(text_sequences) != len(image_paths):
                    continue

                instruction = random.choice(instructions)
                instruction = instruction.replace('<key>', question)
                if 'answers' in d:
                    value = d['answers'][0]
                    if d['answer_type'] == 'not-answerable':
                        d['answers'] = 'none'
                else:
                    value  = ''

                file_name = os.path.abspath(image_path)
                sample = {
                    "image_list": image_paths,
                    "ocr_list": text_sequences, 
                    "bboxes_list": bboxes, 
                    "conversations": [
                        {'from': 'human', 'value': instruction},
                        {'from': 'gpt', 'value': value},
                    ],
                }

                if split == 'train':
                    train.append(sample)
                elif split == 'val':
                    validation.append(sample)

        for split, target_format in [('train', train), ('validation', validation)]:
            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/dude', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/dude', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/funsd.py
================================================
import json
import os
import random
import cv2
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, sort_coordinate, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'funsd'
        self.split = ['training', 'testing']
        self.label_mapping = {'header': 'title',
                              'question': 'key',
                              'answer': 'value',
                              'other': 'other'}

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            ann_dir = os.path.join(self.data_dir, f'{split}_data/annotations')
            img_dir = os.path.join(self.data_dir, f'{split}_data/images')
            for i, file in enumerate(tqdm(sorted(os.listdir(ann_dir)))):
                file_path = os.path.join(ann_dir, file)
                with open(file_path, 'r', encoding='utf-8') as f:
                    data = json.load(f)

                image_path = os.path.join(img_dir, file)
                image_path = image_path.replace('.json', '.png') 
                image = cv2.imread(image_path)
                h, w, _ = image.shape

                items = []
                for item in data["form"]:
                    text = item['text']
                    words, label = item["words"], item["label"]
                    label = self.label_mapping[label]
                    words = [w for w in words if w["text"].strip() != ""]
                    if len(words) == 0:
                        continue
                    start_bbox, end_bbox = words[0]['box'], words[-1]['box']
                    bbox = [start_bbox[0], start_bbox[1], end_bbox[2], start_bbox[3]]
                    bbox = normalize_bbox(bbox, w, h)
                    items.append((text, label, bbox))
                items = sort_coordinate(items)

                text_sequence = []
                bboxes = []
                labels = {}
                for item in items:
                    text, label, bbox = item
                    labels[text] = label
                    text_sequence.append(text)
                    bbox = [bbox] * len(text)
                    bboxes += bbox

                ocr = ' '.join(text_sequence)
                for key in labels:
                    instruction = random.choice(instructions)
                    instruction = instruction.replace('<key>', key)
                    value = labels[key]

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': value},
                        ],
                    })

            split = split.replace('ing', '')
            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/funsd/dataset', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/funsd', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/google_vision_ocr.py
================================================
import base64
import json
from requests import Request, Session
from io import BytesIO
from utils import normalize_bbox

class Google_OCR:
    def __init__(self, api_key):
        self.api_key = api_key

    def pil_image_to_base64(self, pil_image):
        buffered = BytesIO()
        pil_image.save(buffered, format="PNG")
        str_encode_file = base64.b64encode(buffered.getvalue()).decode("utf-8")
        return str_encode_file

    def recognize_image(self, pil_image):
        str_encode_file = self.pil_image_to_base64(pil_image)
        str_url = "https://vision.googleapis.com/v1/images:annotate?key="
        str_headers = {'Content-Type': 'application/json'}
        str_json_data = {
            'requests': [
                {
                    'image': {
                        'content': str_encode_file
                    },
                    'features': [
                        {
                            'type': "TEXT_DETECTION",
                        }
                    ]
                }
            ]
        }

        obj_session = Session()
        obj_request = Request("POST",
                              str_url + self.api_key,
                              data=json.dumps(str_json_data),
                              headers=str_headers
                              )
        obj_prepped = obj_session.prepare_request(obj_request)
        obj_response = obj_session.send(obj_prepped,
                                        verify=True,
                                        timeout=60
                                        )

        if obj_response.status_code == 200:
            return obj_response.json()

        else:
            return "error"

    def extract_info(self, items, img_w, img_h):
        words = []
        bboxes = []
        for page_ocrs in items['responses'][0]['fullTextAnnotation']['pages']:
            for block_ocrs in page_ocrs['blocks']:
                for para_ocrs in block_ocrs['paragraphs']:
                    for word_ocrs in para_ocrs['words']:
                        char_bboxes = []
                        word = ''
                        for sym_ocrs in word_ocrs['symbols']:
                            try:
                                bbox = sym_ocrs['boundingBox']
                                xmin = max(0, bbox['vertices'][0]['x'])
                                ymin = max(0, bbox['vertices'][0]['y'])
                                xmax = max(0, bbox['vertices'][2]['x'])
                                ymax = max(0, bbox['vertices'][2]['y'])
                                bbox = [xmin, ymin, xmax, ymax]
                            except:
                                continue
                            word += sym_ocrs['text']
                            char_bboxes.append(bbox)
                        if len(char_bboxes) > 0:
                            x1 = [w_p[0] for w_p in char_bboxes]
                            y1 = [w_p[1] for w_p in char_bboxes]
                            x2 = [w_p[2] for w_p in char_bboxes]
                            y2 = [w_p[3] for w_p in char_bboxes]
                            word_bbox = [min(x1), min(y1), max(x2), max(y2)]
                            if word_bbox[0] >= word_bbox[2] or word_bbox[1] >= word_bbox[3]:
                                continue
                            word_bbox = normalize_bbox(word_bbox, img_w, img_h)
                            words.append(word)
                            bboxes.append(word_bbox)
        return words, bboxes


================================================
FILE: data_preprocessors/hwsquad.py
================================================
import json
import os
import random
import argparse
import csv

from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
from collections import defaultdict

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'hwsquad'
        self.split = ['train', 'val', 'test']
    
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            filename = os.path.join(self.data_dir, f"HW-SQuAD_{split}_1.0.json")
            with open(filename, "r") as f:
                annotations = json.load(f)

            target_format = []
            for ann in tqdm(annotations["data"]):
                qas = ann["qas"]
                image_path = ann["document_image"]["document_image"]
                h, w = ann["document_image"]["image_height"], ann["document_image"]["image_width"]

                words = []
                bboxes = []
                for item in ann["document_image"]["gold_standard_transcription"]:
                    word = item["text"]
                    words.append(word)
                    bbox = [item["xmin"], item["ymin"], item["xmax"], item["ymax"]]
                    bbox = normalize_bbox(bbox, w, h)
                    bboxes.append(bbox)
                
                for qa in qas:
                    question = qa["question"]
                    start_index, end_index = qa["answers"][0]["answer_start_word_no"], qa["answers"][0]["answer_end_word_no"]+1    
                    answer = words[start_index:end_index]
                    answer = " ".join(answer)

                    instruction = random.choice(instructions)        
                    instruction = instruction.replace('<key>', question)
                    ocr = ' '.join(words)

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': answer},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/HW-SQuAD/HW-SQuAD_annotations', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/hwsquad', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/iconqa.py
================================================
import json
import os
import random
import glob
from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'iconqa'
        self.split = ['train', 'val']

    def create_data(self):
        for split in self.split:
            for answer_style in ['fill_in_blank', 'choose_txt']:
                target_format = []
                dataset_name = f'{self.dataset_name}_{answer_style}'
                instructions = load_instructions(self.instruction_path)[dataset_name]

                data_dir = os.path.join(self.data_dir, f'{split}/{answer_style}/*')
                for file_path in glob.glob(data_dir):
                    data_path = os.path.join(file_path, 'data.json')
                    image_path = os.path.join(file_path, 'image.png')
                    with open(data_path, 'r') as f:
                        data = json.load(f)
                    question = data['question']
                    instruction = random.choice(instructions)
                    instruction = instruction.replace('<key>', question)
                    if answer_style == 'fill_in_blank':
                        value = data['answer']
                    else:
                        options = data['choices']
                        answer_index = data['answer']
                        value = str(options[answer_index])
                        instruction = instruction.replace('<options>', options)

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': f"{value}"},
                        ],
                    })
            
                out_data_dir = f'{self.out_data_dir}_{answer_style}'
                out_filepath = os.path.join(out_data_dir, f'{split}.json')        
                os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

                print(f'{split}: {len(target_format)}')
                with open(out_filepath, "w") as f:
                    json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/iconqa/iconqa_data', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/iconqa', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/infographicvqa.py
================================================
import json
import os
import random
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'infographicvqa'
        self.split = ['train', 'dev']

    def create_ocr_data(self, split):
        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
        with open(file_name, 'r') as f:
            data = f.readlines()
        ocrs = {}
        for d in data:
            d = json.loads(d)
            image_name = d['name'].replace('.pdf', '')
            try:
                content = d['contents'][1] # microsoft cv
            except:
                content = d['contents'][0] # tesseract

            bboxes = []
            tokens = []
            try:
                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
                    bbox = normalize_bbox(bbox, w, h)
                    bboxes.append(bbox)
                    tokens.append(token)
            except:
                pass
            ocrs[image_name] = (' '.join(tokens), bboxes)
        return ocrs

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, split, 'document.jsonl')
            with open(file_name, 'r') as f:
                data = f.readlines()

            ocrs = self.create_ocr_data(split)
            target_format = []
            for d in tqdm(data):
                d = json.loads(d)
                image_name = d['name'].replace('.pdf', '')
                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
                file_name = os.path.abspath(file_name)
                for ann in d['annotations']:
                    instruction = random.choice(instructions)
                    question = ann['key']
                    instruction = instruction.replace('<key>', question)
                    bboxes = []
                    ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]
                    value = ann['values'][0]['value']
                    values = ann['values'][0]['value_variants']

                    target_format.append({
                        "image": file_name,
                        "ocr": ocr, 
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'instruction': instruction},
                            {'from': 'gpt', 'value': value, 'values': values},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/infographics_vqa', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/infographicvqa', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/klc.py
================================================
import json
import os
import random

from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'klc'
        self.split = ['train', 'dev']

    def create_ocr_data(self, split):
        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
        with open(file_name, 'r') as f:
            data = f.readlines()
        ocrs = {}
        for d in data:
            d = json.loads(d)
            image_name = d['name'].replace('.pdf', '')
            try:
                content = d['contents'][1] # microsoft cv
            except:
                content = d['contents'][0] # tesseract

            bboxes = []
            tokens = []
            try:
                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
                    bbox = normalize_bbox(bbox, w, h)
                    bboxes.append(bbox)
                    tokens.append(token)
            except:
                pass
            ocrs[image_name] = (' '.join(tokens), bboxes)
            break
        return ocrs

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, split, 'document.jsonl')
            with open(file_name, 'r') as f:
                data = f.readlines()

            ocrs = self.create_ocr_data(split)
            target_format = []
            for d in tqdm(data):
                d = json.loads(d)
                image_name = d['name'].replace('.pdf', '')
                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
                file_name = os.path.abspath(file_name)
                for ann in d['annotations']:
                    instruction = random.choice(instructions)
                    if 'children' in ann['values'][0]:
                        for v in ann['values']:
                            for child in v['children']:
                                value = child['key']
                                key = child['values'][0]['value']
                                instruction = instruction.replace('<key>', key)
                                ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 

                                target_format.append({
                                    "image": file_name,
                                    "ocr": ocr,
                                    "bboxes": bboxes,
                                    "conversations": [
                                        {'from': 'human', 'value': instruction},
                                        {'from': 'gpt', 'value': value},
                                    ],
                                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/kleister-charity', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/klc', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/llavar.py
================================================
import json
import os
import random
import argparse

from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
from collections import defaultdict
from google_vision_ocr import Google_OCR

class InstructData:
    def __init__(self, args):
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
        self.image_dir = os.path.join(args.input_data_dir, 'images')
        self.google_ocr = Google_OCR(args.api_key)
        os.makedirs(self.ocr_dir, exist_ok=True)

    def create_data(self):
        file_name = os.path.join(self.data_dir, 'llava_instruct_150k_llavar_20k.json')
        with open(file_name, 'r') as f:
            data = json.load(f)
        target_format = []
        for d in data:
            image_name = d["image"]
            image_path = os.path.join(self.image_dir, image_name)
            if not os.path.exists(image_path):
                continue

            ocr_path = os.path.join(self.ocr_dir, f"{image_name.replace('.jpg', '.json')}")
            try:
                img = Image.open(image_path)
                img_w, img_h = img.size
                if not os.path.exists(ocr_path):
                    items = self.google_ocr.recognize_image(img)
                    if items == 'error':
                        print('OCR error: ', image_path)
                        continue
                    with open(ocr_path, 'w') as f:
                        json.dump(items, f)
                else:
                    with open(ocr_path, 'r') as f:
                        items = json.load(f)
                words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
            except:
                words, bboxes = [], []

            ocr = ' '.join(words)
            file_name = os.path.abspath(image_path)
            d["image"] = file_name
            d["ocr"] = ocr
            d["bboxes"] = bboxes
            target_format.append(d)

        out_filepath = os.path.join(self.out_data_dir, 'train.json')        
        os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

        print(f'train: {len(target_format)}')
        with open(out_filepath, "w") as f:
            json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/llavar', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/llavar', type=str)
    parser.add_argument('--api_key', default='API_KEY', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/ocrvqa.py
================================================
import json
import os
import random
import argparse
import csv

from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
from collections import defaultdict
from google_vision_ocr import Google_OCR

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
        self.image_dir = os.path.join(args.input_data_dir, 'images')
        self.dataset_name = 'ocrvqa'
        self.google_ocr = Google_OCR(args.api_key)
        self.split = ['train', 'dev', 'test']
        self.split_dict = {1: 'train', 2: 'dev', 3: 'test'}
        os.makedirs(self.ocr_dir, exist_ok=True)
        
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            file_name = os.path.join(self.data_dir, 'dataset.json')
            with open(file_name, 'r') as f:
                data = json.load(f)
            for image_id in tqdm(data):
                d = data[image_id]
                split_id = d['split']
                if split != self.split_dict[split_id]:
                    continue
                image_path = os.path.join(self.image_dir, f'{image_id}.jpg')
                if not os.path.exists(image_path):
                    continue

                ocr_path = os.path.join(self.ocr_dir, f"{image_id}.json")
                try:
                    img = Image.open(image_path)
                    img_w, img_h = img.size
                    if not os.path.exists(ocr_path):
                        items = self.google_ocr.recognize_image(img)
                        if items == "error":
                            print('error: ', image_path)
                            continue
                        with open(ocr_path, 'w') as f:
                            json.dump(items, f)
                    else:
                        with open(ocr_path, 'r') as f:
                            items = json.load(f)
                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
                except:
                    words, bboxes = [], []

                ocr = ' '.join(words)
                file_name = os.path.abspath(image_path)
                for question, answer in zip(d['questions'], d['answers']):
                    instruction = random.choice(instructions)        
                    instruction = instruction.replace('<key>', question)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': answer},
                        ],
                    }) 

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/OCR-VQA-200K', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/ocrvqa', type=str)
    parser.add_argument('--api_key', default='API_KEY', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/pwc.py
================================================
import json
import os
import random
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'pwc'
        self.split = ['train', 'dev']

    def create_ocr_data(self, split):
        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
        with open(file_name, 'r') as f:
            data = f.readlines()
        ocrs = {}
        for d in data:
            d = json.loads(d)
            image_name = d['name'].replace('.pdf', '')
            try:
                content = d['contents'][1] # microsoft cv
            except:
                content = d['contents'][0] # tesseract

            bboxes = []
            tokens = []
            try:
                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
                    bbox = normalize_bbox(bbox, w, h)
                    bboxes.append(bbox)
                    tokens.append(token)
            except:
                pass
            ocrs[image_name] = (' '.join(tokens), bboxes)
            break
        return ocrs

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, split, 'document.jsonl')
            with open(file_name, 'r') as f:
                data = f.readlines()

            ocrs = self.create_ocr_data(split)
            target_format = []
            for d in tqdm(data):
                d = json.loads(d)
                image_name = d['name'].replace('.pdf', '')
                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
                file_name = os.path.abspath(file_name)
                for ann in d['annotations']:
                    instruction = random.choice(instructions)
                    if 'children' in ann['values'][0]:
                        for v in ann['values']:
                            for child in v['children']:
                                value = child['key']
                                key = child['values'][0]['value']
                                instruction = instruction.replace('<key>', key)
                                ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 

                                target_format.append({
                                    "image": file_name,
                                    "ocr": ocr,
                                    "bboxes": bboxes,
                                    "conversations": [
                                        {'from': 'human', 'value': instruction},
                                        {'from': 'gpt', 'value': value},
                                    ],
                                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/AxCell', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/pwc', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/rvlcdip.py
================================================
import json
import os
import random
import argparse

from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import load_instructions
from google_vision_ocr import Google_OCR

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
        self.image_dir = os.path.join(args.input_data_dir, 'images')
        self.dataset_name = 'rvlcdip'
        self.google_ocr = Google_OCR(args.api_key)
        self.split = ['train', 'val', 'test']
        self.class_dict = {'4': "advertisement", '10': "budget", '2': "email", 
                          '8': "file_folder", '1': "form", '3': "handwritten", 
                          '11': "invoice", '0': "letter", '15': "memo", 
                          '9': "news_article", '12': "presentation", '13': "questionnaire", 
                          '14': "resume", '6': "scientific_publication", '5':"scientific_report", 
                          '7': "specification"}

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f:
                labels = f.read().splitlines()
            for label in labels:
                filename, label = label.split(' ')
                value = self.class_dict[label]
                image_path = os.path.join(self.image_dir, filename)
                ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".tif", ".json")}')
                try:
                    img = Image.open(image_path)
                    img_w, img_h = img.size
                    if not os.path.exists(ocr_path):
                        items = self.google_ocr.recognize_image(img)
                        if items == "error":
                            print('OCR error: ', image_path)
                            continue
                        os.makedirs(os.dirname(ocr_path), exist_ok=True)
                        with open(ocr_path, 'w') as f:
                            json.dump(items, f)
                    else:
                        with open(ocr_path, 'r') as f:
                            items = json.load(f)
                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
                except:
                    words, bboxes = [], []

                instruction = random.choice(instructions)
                ocr = ' '.join(words)

                file_name = os.path.abspath(image_path)
                target_format.append({
                    "image": file_name,
                    "ocr": ocr, 
                    "bboxes": bboxes,
                    "conversations": [
                        {'from': 'human', 'value': instruction},
                        {'from': 'gpt', 'value': value},
                    ],
                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/rvlcdip', type=str)
    parser.add_argument('--api_key', default='API_KEY', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/rvlcdip_io.py
================================================
import json
import os
import random
import argparse

from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import load_instructions
from google_vision_ocr import Google_OCR

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
        self.image_dir = os.path.join(args.input_data_dir, 'images')
        self.dataset_name = 'rvlcdip_io'
        self.google_ocr = Google_OCR(args.api_key)
        self.split = ['train', 'val', 'test']

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            ocrs = []
            with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f:
                labels = f.read().splitlines()
            for label in labels:
                filename, label = label.split(' ')
                value = self.class_dict[label]
                image_path = os.path.join(self.image_dir, filename)
                ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".tif", ".json")}')
                try:
                    img = Image.open(image_path)
                    img_w, img_h = img.size
                    if not os.path.exists(ocr_path):
                        items = self.google_ocr.recognize_image(img)
                        if items == "error":
                            print('OCR error: ', image_path)
                            continue
                        os.makedirs(os.dirname(ocr_path), exist_ok=True)
                        with open(ocr_path, 'w') as f:
                            json.dump(items, f)
                    else:
                        with open(ocr_path, 'r') as f:
                            items = json.load(f)
                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
                except:
                    words, bboxes = [], []

                ocr = ' '.join(words)
                ocrs.append((ocr, bboxes))

            for label in labels:
                instruction = random.choice(instructions)
                if random.random() > 0.5:
                    ocr, bboxes = random.choice(ocrs)
                    value = 'no'
                else:
                    value = 'yes'
                
                file_name = os.path.abspath(image_path)
                target_format.append({
                    "image": file_name,
                    "ocr": ocr, 
                    "bboxes": bboxes, 
                    "conversations": [
                        {'from': 'human', 'value': instruction},
                        {'from': 'gpt', 'value': value},
                    ],
                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/rvlcdip_io', type=str)
    parser.add_argument('--api_key', default='API_KEY', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/scicap.py
================================================
import json
import os
import random
import argparse

from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import load_instructions
from google_vision_ocr import Google_OCR

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.image_dir = os.path.join(args.input_data_dir, 'SciCap-No-Subfig-Img')
        self.caption_dir = os.path.join(args.input_data_dir, 'SciCap-Caption-All')
        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
        self.dataset_name = 'scicap'
        self.google_ocr = Google_OCR(args.api_key)
        self.split = ['train', 'val', 'test']
        os.makedirs(self.ocr_dir, exist_ok=True)
        
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        ocr_info = {}
        for split in self.split:
            with open(os.path.join(self.data_dir, f'List-of-Files-for-Each-Experiments/Single-Sentence-Caption/No-Subfig/{split}/file_idx.json'), "r") as f:
                split_info = json.load(f)
            target_format = []
            for file_name in tqdm(split_info):
                image_path = os.path.join(self.image_dir, split, file_name)
                caption_path = os.path.join(self.caption_dir,  split, f'{file_name.replace(".png", ".json")}')
                ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".png", ".json")}')

                with open(caption_path, "r") as f:
                    annotation = json.load(f)
                try:
                    img = Image.open(image_path)
                    img_w, img_h = img.size
                    if not os.path.exists(ocr_path):
                        items = self.google_ocr.recognize_image(img)
                        if items == "error":
                            print('OCR error: ', image_path)
                            continue
                        with open(ocr_path, 'w') as f:
                            json.dump(items, f)
                    else:
                        with open(ocr_path, 'r') as f:
                            items = json.load(f)
                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
                except:
                    words, bboxes = [], []

                value = annotation['1-lowercase-and-token-and-remove-figure-index']['caption']
                instruction = random.choice(instructions)
                ocr = ' '.join(words)

                file_name = os.path.abspath(image_path)
                target_format.append({
                    "image": file_name,
                    "ocr": ocr, 
                    "bboxes": bboxes, 
                    "conversations": [
                        {'from': 'human', 'value': instruction},
                        {'from': 'gpt', 'value': value},
                    ],
                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/scicap', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/scicap', type=str)
    parser.add_argument('--api_key', default='API_KEY', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/scienceqa.py
================================================
import json
import os
import random
import glob
from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, sort_coordinate, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'scienceqa'

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        train, val, test = [],[],[]
        target_format = []
        ann_filename = os.path.join(self.data_dir, 'data/scienceqa/problems.json')
        with open(ann_filename, 'r') as f:
            anns = json.load(f)
        for questionId, ann in tqdm(anns.items()):
            question = ann['question']
            choices = ann['choices']
            value = choices[ann['answer']]
            split = ann['split']
            image_name = ann['image']
            if str(image_name) == 'null':
                continue

            image_path = os.path.join(self.data_dir, split, questionId, image_name)
            instruction = random.choice(instructions)
            instruction = instruction.replace('<key>', question).replace('<options>', str(choices))

            file_name = os.path.abspath(image_path)
            sample = {
                "image": file_name,
                "conversations": [
                    {'from': 'human', 'value': instruction},
                    {'from': 'gpt', 'value': f"{value}"},
                ],
            }
            if split == 'train':
                train.append(sample)
            elif split == 'val':
                val.append(sample)
            elif split == 'train':
                test.append(sample)
        
        for split, target_format in [('train', train), ('val', val), ('test', test)]:
            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/scienceqa', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/scienceqa', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/screen2words.py
================================================
import json
import os
import random
import argparse

from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import load_instructions
from google_vision_ocr import Google_OCR

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
        self.image_dir = os.path.join(args.input_data_dir, 'combined')
        self.dataset_name = 'screen2words'
        self.google_ocr = Google_OCR(args.api_key)
        self.split = ['train', 'dev']
        os.makedirs(self.ocr_dir, exist_ok=True)
    
    def load_captions(self):
        with open(os.path.join(self.data_dir, 'screen_summaries.csv'), "r") as f:
            lines = f.read().splitlines()
        captions = {}
        for i, line in enumerate(lines):
            if i != 0:
                items = line.split(',')
                if len(items) > 2:
                    screenId = items[0]
                    summary = line[len(screenId)+1:]
                else:
                    screenId, summary = items
                captions[screenId] = summary
        return captions
        
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        captions = self.load_captions()
        for split in self.split:
            target_format = []
            with open(os.path.join(self.data_dir, f'split/{split}_screens.txt'), "r") as f:
                split_info = f.read().splitlines()
            for split_id in tqdm(split_info):
                image_path = os.path.join(self.image_dir,  f'{split_id}.jpg')
                ocr_path = os.path.join(self.ocr_dir, f'{split_id}.json')
                try:
                    img = Image.open(image_path)
                    img_w, img_h = img.size
                    if not os.path.exists(ocr_path):
                        items = self.google_ocr.recognize_image(img)
                        if items == "error":
                            print('OCR error: ', image_path)
                            continue
                        with open(ocr_path, 'w') as f:
                            json.dump(items, f)
                    else:
                        with open(ocr_path, 'r') as f:
                            items = json.load(f)
                    words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
                except:
                    words, bboxes = [], []

                value = captions[split_id]
                instruction = random.choice(instructions)
                ocr = ' '.join(words)

                file_name = os.path.abspath(image_path)
                target_format.append({
                    "image": file_name,
                    "ocr": ocr, 
                    "bboxes": bboxes, 
                    "conversations": [
                        {'from': 'human', 'value': instruction},
                        {'from': 'gpt', 'value': value},
                    ],
                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/screen2words', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/screen2words', type=str)
    parser.add_argument('--api_key', default='API_KEY', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/slidevqa.py
================================================
import json
import os
import random
import argparse
import glob

from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import load_instructions
from google_vision_ocr import Google_OCR

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
        self.image_dir = os.path.join(args.input_data_dir, 'images')
        self.dataset_name = 'slidevqa'
        self.google_ocr = Google_OCR(args.api_key)
        self.split = ['train', 'val', 'test']
        os.makedirs(self.ocr_dir, exist_ok=True)
            
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            file_name =  os.path.join(self.data_dir, 'annotations/qa', f'{split}.jsonl')
            with open(file_name, 'r') as f:
                data = f.read().splitlines()
            for d in tqdm(data):
                question = d['question']
                deck_name = d['deck_name']
                value = d['answer']
                image_paths = []
                text_sequences = []
                bboxes = []
                for image_path in glob.glob(os.path.join(self.image_dir, deck_name, f'slide_*_1024.jpg')):
                    image_path = os.path.abspath(image_path)
                    image_name = os.path.basename(image_path)                
                    image_paths.append(image_path)
                    ocr_path = os.path.join(self.ocr_dir, f'{deck_name}_{image_name.replace(".jpg", ".json")}')
                    try:
                        img = Image.open(image_path)
                        img_w, img_h = img.size
                        if not os.path.exists(ocr_path):
                            items = self.google_ocr.recognize_image(img)
                            if items == 'error':
                                print('OCR error: ', image_path)
                                continue
                            with open(ocr_path, 'w') as f:
                                json.dump(items, f)
                        else:
                            with open(ocr_path, 'r') as f:
                                items = json.load(f)
                        words, page_bboxes = self.google_ocr.extract_info(items, img_w, img_h)
                    except:
                        words, page_bboxes = [], []                        
                    text_sequences.append(' '.join(words))
                    bboxes.append(page_bboxes)
                
                instruction = random.choice(instructions)
                instruction = instruction.replace('<key>', question)

                file_name = os.path.abspath(image_path)
                target_format.append({
                    "image_list": image_paths,
                    "ocr_list": text_sequences, 
                    "bboxes_list": bboxes, 
                    "conversations": [
                        {'from': 'human', 'value': instruction},
                        {'from': 'gpt', 'value': value},
                    ],
                })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/slidevqa', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/slidevqa', type=str)
    parser.add_argument('--api_key', type=str, help='google vision api key')
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/sroie.py
================================================
import json
import os
import random
import cv2
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'sroie'
        self.split = ['train', 'test']

    def sort_coordinate(self, bboxes):
        return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]])    

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            ann_dir = os.path.join(self.data_dir, f'{split}/entities')
            img_dir = os.path.join(self.data_dir, f'{split}/img')
            for file in tqdm(sorted(os.listdir(ann_dir))):
                file_path = os.path.join(ann_dir, file)
                with open(file_path, 'r', encoding='utf-8') as f:
                    labels = json.load(f)
                image_path = os.path.join(img_dir, file)
                image_path = image_path.replace('.txt', '.jpg')
                image = cv2.imread(image_path)
                h, w, _ = image.shape
                    
                file_path = os.path.join(ann_dir.replace('entities', 'box'), file)
                text_sequence = []
                bboxes = []
                with open(file_path, 'r', encoding='utf-8') as f:
                    items = []
                    for item in f.read().splitlines():
                        bbox = item.split(',')[:8]
                        text = item[len(','.join(bbox))+1:]
                        bbox = [int(bbox[0]), int(bbox[1]), int(bbox[4]), int(bbox[5])]
                        bbox = normalize_bbox(bbox, w, h)
                        items.append((text, bbox))
                items = self.sort_coordinate(items)
                for item in items:
                    words, bbox = item
                    text_sequence.append(words)
                    bbox = [bbox] * len(words.split())
                    bboxes += bbox
                
                ocr = ' '.join(text_sequence)
                for label in labels:
                    instruction = random.choice(instructions)
                    instruction = instruction.replace('<key>', labels[label])

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': label},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/SROIE2019', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/sroie', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/tabfact.py
================================================
import json
import os
import random
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'tabfact'
        self.split = ['train', 'dev']
        self.options = ['no', 'yes']

    def create_ocr_data(self, split):
        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
        with open(file_name, 'r') as f:
            data = f.readlines()
        ocrs = {}
        for d in data:
            d = json.loads(d)
            image_name = d['name'].replace('.pdf', '')
            try:
                content = d['contents'][1] # microsoft cv
            except:
                content = d['contents'][0] # tesseract

            bboxes = []
            tokens = []
            try:
                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
                    bbox = normalize_bbox(bbox, w, h)
                    bboxes.append(bbox)
                    tokens.append(token)
            except:
                pass
            ocrs[image_name] = (' '.join(tokens), bboxes)
        return ocrs

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, split, 'document.jsonl')
            with open(file_name, 'r') as f:
                data = f.readlines()

            ocrs = self.create_ocr_data(split)
            target_format = []
            for d in tqdm(data):
                d = json.loads(d)
                image_name = d['name'].replace('.pdf', '')
                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
                file_name = os.path.abspath(file_name)
                for ann in d['annotations']:
                    instruction = random.choice(instructions)
                    question = ann['key']
                    instruction = instruction.replace('<key>', question)
                    bboxes = []
                    ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]
                    value = ann['values'][0]['value']
                    value = self.options[int(value)]

                    target_format.append({
                        "image": file_name,
                        "ocr": ocr, 
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'instruction': instruction},
                            {'from': 'gpt', 'value': value},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/TabFact', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/tabfact', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/tatdqa.py
================================================
import json
import os
import random
import argparse
import csv

from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'tatdqa'
        self.split = ['train', 'dev', 'test']
    
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            file_name = os.path.join(self.data_dir, f'tatdqa_dataset_{split}.json')
            with open(file_name, 'r') as f:
                data = json.load(f)
            for d in tqdm(data):
                uid = d['doc']['uid']
                page_num = d['doc']['page']
                image_path = f'{split}/{uid}_{page_num}.png'
                ocr_file_name = os.path.join(self.data_dir, f'{split}/{uid}.json')
                with open(ocr_file_name, 'r') as f:
                    ocrs = json.load(f)

                text = []
                bboxes = []
                _, _, w, h = ocrs['pages'][page_num-1]['bbox']
                for page in ocrs['pages']:
                    for block in page['blocks']:
                        text.append(block['text'])
                        for bbox in block['words']['bbox_list']:
                            bbox = normalize_bbox(bbox, w, h)
                            bboxes.append(bbox)

                for qa in d['questions']:
                    question =qa['question']
                    if 'answer' in qa:
                        answer = qa['answer']
                        if type(qa['answer']) == list:
                            if len(qa['answer']) > 1:
                                answer = ', '.join(answer)
                            else:
                                answer = answer[0]
                    else:
                        answer = ""

                    instruction = random.choice(instructions)        
                    instruction = instruction.replace('<key>', question)
                    ocr = ' '.join(text)

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': answer},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/TAT-DQA', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/tatdqa', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/textbookqa.py
================================================
import json
import os
import random
import glob
from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, sort_coordinate, load_instructions
from transformers import BertTokenizer
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'textbookqa'
        self.split = ['train', 'val', 'test']

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            ann_filename = f'{split}/tqa_v1_{split}.json' if split != 'test' else f'{split}/tqa_v2_{split}.json'
            ann_filename = os.path.join(self.data_dir, ann_filename)
            with open(ann_filename, 'r') as f:
                anns = json.load(f)
            for ann in tqdm(anns):
                questions = ann['questions']
                diagram_questions = questions['diagramQuestions']
                if len(diagram_questions) == 0:
                    continue

                diagram_annotations = ann['diagramAnnotations']

                for global_id, data in diagram_questions.items():
                    options = []
                    for option_id, choice in data['answerChoices'].items():
                        choice = choice['processedText']
                        options.append(choice)
                    question = data['beingAsked']['processedText']
                    value = data['correctAnswer']['rawText']
                    image_path = data['imagePath']
                    image_name = data['imageName']
                    image_path = os.path.join(self.data_dir, f'{split}/{image_path}')
                    if image_name in diagram_annotations:
                        annotation = diagram_annotations[image_name]
                        bboxes = []
                        ocr = []
                        for item in annotation:
                            text, bbox = item["text"], item["rectangle"]
                            try:
                                bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]]
                            except:
                                continue
                            if len(text) > 0:
                                bboxes.append(bbox)
                                ocr.append(text)
                        ocr = " ".join(ocr)
                    else:
                        ocr = ""
                    instruction = random.choice(instructions)
                    instruction = instruction.replace('<key>', question).replace('<options>', str(options))

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': f"{value}"},
                        ],
                    })
            
            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/textbookqa', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/textbookqa', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/utils.py
================================================
import pandas as pd

def normalize_bbox(bbox, w=-1, h=-1):
    if w > 0 and h > 0:
        normalized_bbox = [
            int(1000 * bbox[0] / w),
            int(1000 * bbox[1] / h),
            int(1000 * bbox[2] / w),
            int(1000 * bbox[3] / h),
        ]
    else:
        normalized_bbox = [
            int(1000 * bbox[0]),
            int(1000 * bbox[1]),
            int(1000 * bbox[2]),
            int(1000 * bbox[3]),
        ]
        
    if len(bbox) == 4:
        return convert_wh(normalized_bbox)
    elif len(bbox) == 6:
        return normalized_bbox

def convert_wh(bbox):
    return [bbox[0], bbox[1], bbox[2], bbox[3], abs(bbox[2]-bbox[0]), abs(bbox[3]-bbox[1])]

def sort_coordinate(bboxes):
    return sorted(bboxes , key=lambda k: [k[2][1], k[2][0]])    

def load_instructions(instruction_path):
    instructions = {}
    data = pd.read_excel(instruction_path)
    for d in data.values:
        dataset_name = d[0]
        insts = []
        for prompt in d[3:]:
            if pd.isna(prompt):
                break
            insts.append(prompt)
        instructions[dataset_name] = insts
    return instructions


================================================
FILE: data_preprocessors/visualmrc.py
================================================
import json
import os
import random
from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'visualmrc'
        self.split = ['train', 'dev', 'test']

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, f'data/{split}.jsonl')
            with open(file_name, 'r') as f:
                data = f.readlines()
            target_format = []
            for d in tqdm(data):
                d = json.loads(d)
                file_name = os.path.join(self.data_dir, d['image_filename'])
                file_name = os.path.abspath(file_name)
                image = Image.open(file_name)
                w, h = image.size

                words = []
                bboxes = []
                for bbox in d['bounding_boxes']:
                    if 'ocr_info' in bbox:
                        for ocr in bbox['ocr_info']:
                            word = ocr['word']
                            bbox = ocr['bbox']
                            bbox = [bbox['x'], bbox['y'], bbox['x']+bbox['width'], bbox['y']+bbox['height']]
                            bbox = normalize_bbox(bbox, w, h)
                            bboxes.append(bbox)
                            words.append(word)

                ocr = " ".join(words)
                for qa in d['qa_data']:
                    question = qa['question']['text']
                    value = qa['answer']['text']
                    instruction = random.choice(instructions)
                    instruction = instruction.replace('<key>', question)

                    target_format.append({
                        "image": file_name,
                        "ocr": ocr, 
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'instruction': instruction},
                            {'from': 'gpt', 'value': value},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/VisualMRC_official', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/visualmrc', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/websrc.py
================================================
import json
import os
import random
import argparse
import csv

from PIL import Image
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
from collections import defaultdict
from google_vision_ocr import Google_OCR

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
        self.dataset_name = 'websrc'
        self.google_ocr = Google_OCR(args.api_key)
        self.split = ['train', 'dev']
        os.makedirs(self.ocr_dir, exist_ok=True)
    
    def load_split_info(self):
        file_name = os.path.join(self.data_dir, 'dataset_split.csv')
        with open(file_name) as f:
            reader = csv.reader(f)
            split_info = defaultdict(list)
            for i, row in enumerate(reader):
                if i == 0:
                    continue
                number = '0' + row[1] if int(row[1]) < 10 else  row[1]
                split = row[3]
                data_path = os.path.join(self.data_dir, f'{row[0]}/{number}/dataset.csv')
                split_info[split].append(data_path)
        return split_info
        
    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        split_info = self.load_split_info()
        for split in self.split:
            target_format = []
            for data_path in tqdm(split_info[split]):
                with open(data_path) as f:
                    data_dir = os.path.dirname(data_path)
                    reader = csv.reader(f)
                    for i, row in enumerate(reader):
                        if i == 0:
                            for index, element in enumerate(row):
                                if 'question' == element:
                                    question_index = index
                                elif 'id' == element:
                                    id_index = index
                                elif 'answer' == element:
                                    answer_index = index
                            continue   
                        questionId = row[id_index]
                        image_path = os.path.join(data_dir, f'processed_data/{questionId[2:9]}.png')
                        img = Image.open(image_path)
                        img_w, img_h = img.size

                        ocr_path = os.path.join(self.ocr_dir, f'{questionId[2:9]}.json')
                        try:
                            if not os.path.exists(ocr_path):
                                items = self.google_ocr.recognize_image(img)
                                if items == "error":
                                    print('OCR error: ', image_path)
                                    continue
                                with open(ocr_path, 'w') as f:
                                    json.dump(items, f)
                            else:
                                with open(ocr_path, 'r') as f:
                                    items = json.load(f)
                            words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
                        except:
                            words, bboxes = [], []

                        question = row[question_index]
                        instruction = random.choice(instructions)        
                        instruction = instruction.replace('<key>', question)
                        ocr = ' '.join(words)
                        value = row[answer_index]

                        file_name = os.path.abspath(image_path)
                        target_format.append({
                            "image": file_name,
                            "ocr": ocr,
                            "bboxes": bboxes,
                            "conversations": [
                                {'from': 'human', 'value': instruction},
                                {'from': 'gpt', 'value': value},
                            ],
                        })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/websrc', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/websrc', type=str)
    parser.add_argument('--ocr_dir', default='raw_datasets/websrc/ocrs', type=str)
    parser.add_argument('--api_key', default='API_KEY', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/wildreceipt.py
================================================
import json
import os
import random
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, sort_coordinate, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'wildreceipt'
        self.split = ['train', 'test']
        self.classes = {}
        for items in open(os.path.join(args.input_data_dir, 'class_list.txt')):
            index, label = items.split()
            self.classes[index] = label

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            target_format = []
            with open(os.path.join(self.data_dir, f'{split}.txt')) as f:
                samples = f.readlines()
            for sample in tqdm(samples):
                data = json.loads(sample)
                file_name = data['file_name']
                image_path = os.path.join(self.data_dir, file_name)
                image = Image.open(image_path)
                w, h = image.size

                items = []
                labels = {}
                for item in data["annotations"]:
                    text, label_index = item["text"], item["label"]
                    label = self.classes[str(label_index)]
                    if label_index == 0:
                        continue
                    bbox = item["box"]
                    bbox = [bbox[0], bbox[1], bbox[4], bbox[5]]
                    bbox = normalize_bbox(bbox, w, h)
                    items.append((text, label, bbox))

                items = sort_coordinate(items)

                ocr = []
                bboxes = []
                for item in items:
                    words, label, bbox = item
                    labels[words] = label
                    ocr.append(words)
                    bbox = [bbox] * len(words.split())
                    bboxes += bbox
                ocr = ' '.join(ocr)

                for key in labels:
                    instruction = random.choice(instructions)
                    instruction = instruction.replace('<key>', key)
                    value = labels[key]

                    file_name = os.path.abspath(image_path)
                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human', 'value': instruction},
                            {'from': 'gpt', 'value': value},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/wildreceipt/wildreceipt', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/wildreceipt', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: data_preprocessors/wtq.py
================================================
import json
import os
import random
from PIL import Image, ImageSequence
from tqdm import tqdm 
from pathlib import Path
from utils import normalize_bbox, load_instructions
import argparse

class InstructData:
    def __init__(self, args):
        self.instruction_path = Path('instructdoc_instructions.xlsx')
        self.data_dir = args.input_data_dir
        self.out_data_dir = args.out_data_dir
        self.dataset_name = 'wtq'
        self.split = ['train', 'dev']

    def create_ocr_data(self, split):
        file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
        with open(file_name, 'r') as f:
            data = f.readlines()
        ocrs = {}
        for d in data:
            d = json.loads(d)
            image_name = d['name'].replace('.pdf', '')
            try:
                content = d['contents'][1] # microsoft cv
            except:
                content = d['contents'][0] # tesseract

            bboxes = []
            tokens = []
            try:
                _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
                for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
                    bbox = normalize_bbox(bbox, w, h)
                    bboxes.append(bbox)
                    tokens.append(token)
            except:
                continue
            ocrs[image_name] = (' '.join(tokens), bboxes)
        return ocrs

    def create_data(self):
        instructions = load_instructions(self.instruction_path)[self.dataset_name]
        for split in self.split:
            file_name = os.path.join(self.data_dir, split, 'document.jsonl')
            with open(file_name, 'r') as f:
                data = f.readlines()

            ocrs = self.create_ocr_data(split)
            target_format = []
            for d in tqdm(data):
                d = json.loads(d)
                image_name = d['name'].replace('.pdf', '')
                file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
                file_name = os.path.abspath(file_name)
                for ann in d['annotations']:
                    instruction = random.choice(instructions)
                    question = ann['key']
                    instruction = instruction.replace('<key>', question)
                    ocr, bboxes = ocrs[image_name]
                    value = ann['values'][0]['value']

                    target_format.append({
                        "image": file_name,
                        "ocr": ocr,
                        "bboxes": bboxes,
                        "conversations": [
                            {'from': 'human',  'value': instruction},
                            {'from': 'gpt', 'value': value},
                        ],
                    })

            out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
            os.makedirs(os.path.dirname(out_filepath), exist_ok=True)

            print(f'{split}: {len(target_format)}')
            with open(out_filepath, "w") as f:
                json.dump(target_format, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/WikiTableQuestions', type=str)
    parser.add_argument('--out_data_dir', default='processed_data/wtq', type=str)
    args = parser.parse_args()
    
    dataset = InstructData(args)
    dataset.create_data()

================================================
FILE: download.sh
================================================
#!/bin/bash
export DATASET_DIR=raw_datasets

mkdir raw_datasets  

sh ./download_scripts/due.sh
sh ./download_scripts/websrc.sh
sh ./download_scripts/funsd.sh
sh ./download_scripts/iconqa.sh
sh ./download_scripts/textbookqa.sh
sh ./download_scripts/screen2words.shsh 
sh ./download_scripts/doclaynet.sh
sh ./download_scripts/ai2d.sh
sh ./download_scripts/wildreceipt.sh

# font file for rendering text in AI2D dataset
wget https://huggingface.co/Team-PIXEL/pixel-base-finetuned-masakhaner-swa/resolve/main/GoNotoCurrent.ttf


================================================
FILE: download_scripts/README.md
================================================
Beolow are the list for downloading datasets used in InstructDoc.
### Automatically download datasets
- DocVQA ([due.sh](download_scripts/due.sh))
- InfographicVQA ([due.sh](download_scripts/due.sh))
- PWC ([due.sh](download_scripts/due.sh))
- KLC ([due.sh](download_scripts/due.sh))
- DeepForm ([due.sh](download_scripts/due.sh))
- TabFact ([due.sh](download_scripts/due.sh))
- WebSRC ([websrc.sh](download_scripts/websrc.sh))
- FUNSD ([funsd.sh](download_scripts/funsd.sh))
- IconQA ([iconqa.sh](download_scripts/iconqa.sh))
- TextbookQA ([textbookqa.sh](download_scripts/textbookqa.sh))
- Screen2Words ([screen2words.sh](download_scripts/screen2words.sh))
- DocLaynet ([doclaynet.sh](download_scripts/doclaynet.sh))
- LLaVAR ([llavar.sh](download_scripts/llavar.sh))

### Manually download 
After downloading below datasets, please place them under the directory "raw_datasets".
- SROIE ([kaggle](https://www.kaggle.com/datasets/urbikn/sroie-datasetv2))
- CORD ([google drive](https://drive.google.com/drive/folders/14OEWr86qotVBMAsWk7lymMytxn5u-kM6))
- OCRVQA ([google drive](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing))
- TAT-DQA ([google drive](https://drive.google.com/drive/folders/1SGpZyRWqycMd_dZim1ygvWhl5KdJYDR2))
- ScienceQA ([google drive](https://drive.google.com/drive/folders/1w8imCXWYn2LxajmGeGH_g5DaL2rabHev))
- ChartQA ([google drive](https://drive.google.com/file/d/17-aqtiq_KJ16PIGOp30W0y6OJNax6SVT/view))
- RVL-CDIP ([goole docs](https://docs.google.com/uc?id=0Bz1dfcnrpXM-MUt4cHNzUEFXcmc&export=download))
- HW-SQuAD ([onedrive](https://www.docvqa.org/datasets/benthamqa-and-hw-squad))
- SciCap ([dropbox](https://www.dropbox.com/s/t1sjqesl0pynaxo/scicap_data.zip?dl=0))
- DUDE ([project page](https://rrc.cvc.uab.es/?ch=23&com=introduction))
- DocBank ([project page](https://doc-analysis.github.io/docbank-page/index.html))
- DocILE ([projct page](https://docile.rossum.ai/))
- VisualMRC ([project page](https://github.com/nttmdlab-nlp/VisualMRC), request authors via e-mail ryota.tanaka@ntt.com)
- SlideVQA ([project page](https://github.com/nttmdlab-nlp/SlideVQA), request authors via e-mail ryota.tanaka@ntt.com)


================================================
FILE: download_scripts/ai2d.sh
================================================
cd $DATASET_DIR

echo "Donwloading AI2D dataset..."
mkdir ai2d
cd ai2d
wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip
wget https://s3-us-east-2.amazonaws.com/prior-datasets/ai2d_test_ids.csv
unzip ai2d-all.zip && rm ai2d-all.zip


================================================
FILE: download_scripts/doclaynet.sh
================================================
cd $DATASET_DIR

echo "Donwloading DocLaynet dataset..."
mkdir doclaynet
cd doclaynet
wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip
wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip
unzip DocLayNet_core.zip && rm DocLayNet_core.zip
unzip DocLayNet_extra.zip && rm DocLayNet_extra.zip


================================================
FILE: download_scripts/due.sh
================================================
cd $DATASET_DIR

echo "Donwloading DocVQA dataset..."
mkdir docvqa
cd docvqa
wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DocVQA.tar.gz
tar xvf DocVQA.tar.gz && rm DocVQA.tar.gz
cd ..

echo "Donwloading InfoVQA dataset..."
mkdir infovqa
cd infovqa
wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/InfographicsVQA.tar.gz
tar xvf InfographicsVQA.tar.gz && rm InfographicsVQA.tar.gz
cd ..

echo "Donwloading TabFact dataset..."
mkdir tabfact
cd tabfact
wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/TabFact.tar.gz
tar xvf TabFact.tar.gz && rm TabFact.tar.gz
cd ..

echo "Donwloading WTQ dataset..."
mkdir wtq
cd wtq
wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/WikiTableQuestions.tar.gz
tar xvf WikiTableQuestions.tar.gz && rm WikiTableQuestions.tar.gz
cd ..

echo "Donwloading KLC dataset..."
mkdir klc
cd klc
wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/KleisterCharity.tar.gz
tar xvf KleisterCharity.tar.gz && rm KleisterCharity.tar.gz
cd ..

echo "Donwloading DeepForm dataset..."
mkdir deepform
cd deepform
wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DeepForm.tar.gz
tar xvf DeepForm.tar.gz && rm DeepForm.tar.gz
cd ..

echo "Donwloading PWC dataset..."
mkdir pwc
cd pwc
wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/PWC.tar.gz
tar xvf PWC.tar.gz && rm PWC.tar.gz


================================================
FILE: download_scripts/funsd.sh
================================================
cd $DATASET_DIR

echo "Donwloading FUNSD dataset..."
mkdir funsd
cd funsd
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip dataset.zip && rm dataset.zip


================================================
FILE: download_scripts/iconqa.sh
================================================
cd $DATASET_DIR

echo "Donwloading IconQA dataset..."
mkdir iconqa
cd iconqa
wget https://iconqa2021.s3.us-west-1.amazonaws.com/iconqa_data.zip
unzip iconqa_data.zip && rm iconqa_data.zip


================================================
FILE: download_scripts/llavar.sh
================================================
cd $DATASET_DIR

echo "Donwloading LLaVAR dataset..."
mkdir llavar
cd llavar
wget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/llava_instruct_150k_llavar_20k.json
mkdir images
cd images
wget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/finetune.zip
unzip finetune.zip && rm finetune.zip


================================================
FILE: download_scripts/screen2words.sh
================================================
cd $DATASET_DIR

echo "Donwloading Screen2Words dataset..."
git clone https://github.com/google-research-datasets/screen2words.git
cd screen2words
wget https://storage.googleapis.com/crowdstf-rico-uiuc-4540/rico_dataset_v0.1/unique_uis.tar.gz
tar xvf unique_uis.tar.gz && rm unique_uis.tar.gz


================================================
FILE: download_scripts/textbookqa.sh
================================================
cd $DATASET_DIR

echo "Donwloading TextbookQA dataset..."
mkdir textbookqa
cd textbookqa
wget https://ai2-public-datasets.s3.amazonaws.com/tqa/tqa_train_val_test.zip
unzip tqa_train_val_test.zip && rm tqa_train_val_test.zip


================================================
FILE: download_scripts/websrc.sh
================================================
cd $DATASET_DIR

echo "Donwloading WebSRC dataset..."
mkdir websrc
cd websrc
wget https://websrc-data.s3.amazonaws.com/release.zip
unzip release.zip && rm release.zip


================================================
FILE: download_scripts/wildreceipt.sh
================================================
cd $DATASET_DIR

echo "Donwloading WildReceipt dataset..."
mkdir wildreceipt
cd wildreceipt
wget https://download.openmmlab.com/mmocr/data/wildreceipt.tar
tar xvf wildreceipt.tar && rm wildreceipt.tar


================================================
FILE: merge_datasets.py
================================================
import os
import json
import random
import argparse

train_val_datasets = ['klc', 'pwc', 'deepform', 'sroie', 'docile', 'wildreceipt', 'websrc', 'hwsquad',
                      'visualmrc', 'iconqa_fill_in_blank', 'iconqa_choose_txt', 'scienceqa',
                      'ai2d', 'docvqa', 'rvlcdip', 'textbookqa', 'wtq', 'tatdqa','scicap', 'llavar',
                      'screen2words', 'doclaynet', 'docbank', 'docvqa_iq', 'rvlcdip_io', 'ocrvqa']

def merge_datasets(input_data_dir='./processed_data', save_dir='./', max_samples=5000):
    questionId = 0
    for split in [('train'), ('dev', 'val')]:
        merge = []
        for dataset_name in train_val_datasets:
            for s in split:
                dataset_path = os.path.join(input_data_dir, dataset_name, f'{s}.json')
                if os.path.exists(dataset_path):
                    with open(dataset_path, 'r') as f:
                        data = json.load(f)
            if len(data) == 0:
                continue
            random.shuffle(data)[:max_samples]
            for d in data:
                d["dataset_name"] = dataset_name
                d["id"] = questionId
                merge.append(d)
        random.shuffle(merge)

        out_filepath = os.path.join(save_dir, f'{split[0]}.json')
        os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
        print(f'{split}: {len(merge)}')
        with open(out_filepath, "w") as f:
            json.dump(merge, f)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_data_dir', default='processed_data', type=str)
    parser.add_argument('--save_dir', default='./', type=str)
    parser.add_argument('--max_samples', default=5000, type=int)
    args = parser.parse_args()

    merge_datasets(args.input_data_dir, args.save_dir, args.max_samples)

================================================
FILE: process_data.sh
================================================
#!/bin/bash
API_KEY=$1

# ===== KIE =====
python data_preprocessors/docile.py
python data_preprocessors/klc.py
python data_preprocessors/deepform.py
python data_preprocessors/funsd.py
python data_preprocessors/pwc.py
python data_preprocessors/wildreceipt.py
python data_preprocessors/cord.py
python data_preprocessors/sroi.py

# ===== Single-page QA =====
python data_preprocessors/visualmrc.py
python data_preprocessors/websrc.py --api_key $API_KEY
python data_preprocessors/ocrvqa.py --api_key $API_KEY
python data_preprocessors/docvqa.py
python data_preprocessors/hwsquad.py

# ===== Single-page QA w/ Discrete Reasoning =====
python data_preprocessors/tatdqa.py
python data_preprocessors/wtq.py

# ===== Single-page QA w/ Visual Reasoning =====
python data_preprocessors/iconqa.py
python data_preprocessors/ai2d.py
python data_preprocessors/scienceqa.py
python data_preprocessors/textbook.py

# ===== Single-page QA w/ Discrete and Visual Reasoning =====
python data_preprocessors/infographicvqa.py
python data_preprocessors/chartqa.py --api_key $API_KEY

# ===== Multi-page QA w/ Multi-hop, Discrete, and Visual Reasoning =====
python data_preprocessors/slidevqa.py --api_key $API_KEY
python data_preprocessors/dude.py

# ===== Document NLI =====
python data_preprocessors/tabfact.py

# ===== Dialogue =====
python data_preprocessors/llavar.py --api_key $API_KEY

# ===== Captioning =====
python data_preprocessors/scicap.py --api_key $API_KEY
python data_preprocessors/screen2words.py --api_key $API_KEY

# ===== Classification =====
python data_preprocessors/rvlcdip.py --api_key $API_KEY

# ===== ITM =====
python data_preprocessors/rvlcdip_io.py --api_key $API_KEY
python data_preprocessors/docvqa_iq.py

# ===== DLA =====
python data_preprocessors/docbank.py
python data_preprocessors/doclaynet.py