Repository: coastalcph/lex-glue Branch: main Commit: 419a49d0fe82 Files: 30 Total size: 197.0 KB Directory structure: gitextract_ybq2rcc9/ ├── .gitignore ├── README.md ├── experiments/ │ ├── case_hold.py │ ├── casehold_helpers.py │ ├── ecthr.py │ ├── eurlex.py │ ├── ledgar.py │ ├── scotus.py │ ├── trainer.py │ └── unfair_tos.py ├── models/ │ ├── deberta.py │ ├── hierbert.py │ └── tfidf_svm.py ├── requirements.txt ├── scripts/ │ ├── run_case_hold.sh │ ├── run_ecthr.sh │ ├── run_eurlex.sh │ ├── run_ledgar.sh │ ├── run_scotus.sh │ ├── run_tfidf_svm.sh │ └── run_unfair_tos.sh ├── statistics/ │ ├── compute_avg_lexglue_scores.py │ ├── compute_avg_scores.py │ ├── compute_lexglue_scores.py │ ├── report_model_results.py │ └── report_train_time.py └── utils/ ├── fix_casehold.py ├── load_hierbert.py ├── preprocess_unfair_tos.py └── subsample_ledgar.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ pip-wheel-metadata/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py,cover .hypothesis/ .pytest_cache/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv .python-version # pipenv # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. # However, in case of collaboration, if having platform-specific dependencies or dependencies # having no cross-platform support, pipenv may install dependencies that don't work, or not # install all needed dependencies. #Pipfile.lock # PEP 582; used by e.g. github.com/David-OConnor/pyflow __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # PyCharm files .idea # Log files logs/ ================================================ FILE: README.md ================================================ # LexGLUE: A Benchmark Dataset for Legal Language Understanding in English :balance_scale: :trophy: :student: :woman_judge: ![LexGLUE Graphic](https://repository-images.githubusercontent.com/411072132/5c49b313-ab36-4391-b785-40d9478d0f73) ## Dataset Summary Inspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset ([Wang et al., 2018](https://aclanthology.org/W18-5446/)), the subsequent more difficult SuperGLUE ([Wang et al., 2109](https://openreview.net/forum?id=rJ4km2R5t7)), other previous multi-task NLP benchmarks ([Conneau and Kiela,2018](https://aclanthology.org/L18-1269/); [McCann et al., 2018](https://arxiv.org/abs/1806.08730)), and similar initiatives in other domains ([Peng et al., 2019](https://arxiv.org/abs/1906.05474)), we introduce LexGLUE, a benchmark dataset to evaluate the performance of NLP methods in legal tasks. LexGLUE is based on seven existing legal NLP datasets, selected using criteria largely from SuperGLUE. We anticipate that more datasets, tasks, and languages will be added in later versions of LexGLUE. As more legal NLP datasets become available, we also plan to favor datasets checked thoroughly for validity (scores reflecting real-life performance), annotation quality, statistical power,and social bias ([Bowman and Dahl, 2021](https://aclanthology.org/2021.naacl-main.385/)). As in GLUE and SuperGLUE ([Wang et al., 2109](https://openreview.net/forum?id=rJ4km2R5t7)) one of our goals is to push towards generic (or *foundation*) models that can cope with multiple NLP tasks, in our case legal NLP tasks,possibly with limited task-specific fine-tuning. An-other goal is to provide a convenient and informative entry point for NLP researchers and practitioners wishing to explore or develop methods for legalNLP. Having these goals in mind, the datasets we include in LexGLUE and the tasks they address have been simplified in several ways, discussed below, to make it easier for newcomers and generic models to address all tasks. We provide PythonAPIs integrated with Hugging Face (Wolf et al.,2020; Lhoest et al., 2021) to easily import all the datasets, experiment with and evaluate their performance. By unifying and facilitating the access to a set of law-related datasets and tasks, we hope to attract not only more NLP experts, but also more interdisciplinary researchers (e.g., law doctoral students willing to take NLP courses). More broadly, we hope LexGLUE will speed up the adoption and transparent evaluation of new legal NLP methods and approaches in the commercial sector too. Indeed, there have been many commercial press releases in legal-tech industry, but almost no independent evaluation of the veracity of the performance of various machine learning and NLP-based offerings. A standard publicly available benchmark would also allay concerns of undue influence in predictive models, including the use of metadata which the relevant law expressly disregards. If you participate, use the LexGLUE benchmark, or our experimentation library, please cite: [*Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras.* *LexGLUE: A Benchmark Dataset for Legal Language Understanding in English.* *2022. In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland.*](https://aclanthology.org/2022.acl-long.297/) ``` @inproceedings{chalkidis-etal-2022-lexglue, title = "{L}ex{GLUE}: A Benchmark Dataset for Legal Language Understanding in {E}nglish", author = "Chalkidis, Ilias and Jana, Abhik and Hartung, Dirk and Bommarito, Michael and Androutsopoulos, Ion and Katz, Daniel and Aletras, Nikolaos", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.297", pages = "4310--4330", } ``` ## Supported Tasks
DatasetSourceSub-domainTask TypeTrain/Dev/Test InstancesClasses
ECtHR (Task A) Chalkidis et al. (2019) ECHRMulti-label classification9,000/1,000/1,00010+1
ECtHR (Task B) Chalkidis et al. (2021a) ECHRMulti-label classification 9,000/1,000/1,00010+1
SCOTUS Spaeth et al. (2020)US LawMulti-class classification5,000/1,400/1,40014
EUR-LEX Chalkidis et al. (2021b)EU LawMulti-label classification55,000/5,000/5,000100
LEDGAR Tuggener et al. (2020)ContractsMulti-class classification60,000/10,000/10,000100
UNFAIR-ToS Lippi et al. (2019)ContractsMulti-label classification5,532/2,275/1,6078+1
CaseHOLDZheng et al. (2021)US LawMultiple choice QA45,000/3,900/3,900n/a
### ECtHR (Task A) The European Court of Human Rights (ECtHR) hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of the ECHR that were violated (if any). ### ECtHR (Task B) The European Court of Human Rights (ECtHR) hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of ECHR that were allegedly violated (considered by the court). ### SCOTUS The US Supreme Court (SCOTUS) is the highest federal court in the United States of America and generally hears only the most controversial or otherwise complex cases which have not been sufficiently well solved by lower courts. This is a single-label multi-class classification task, where given a document (court opinion), the task is to predict the relevant issue areas. The 14 issue areas cluster 278 issues whose focus is on the subject matter of the controversy (dispute). ### EUR-LEX European Union (EU) legislation is published in EUR-Lex portal. All EU laws are annotated by EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. The current version of EuroVoc contains more than 7k concepts referring to various activities of the EU and its Member States (e.g., economics, health-care, trade). Given a document, the task is to predict its EuroVoc labels (concepts). ### LEDGAR LEDGAR dataset aims contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. ### UNFAIR-ToS The UNFAIR-ToS dataset contains 50 Terms of Service (ToS) from on-line platforms (e.g., YouTube, Ebay, Facebook, etc.). The dataset has been annotated on the sentence-level with 8 types of unfair contractual terms (sentences), meaning terms that potentially violate user rights according to the European consumer law. ### CaseHOLD The CaseHOLD (Case Holdings on Legal Decisions) dataset includes multiple choice questions about holdings of US court cases from the Harvard Law Library case law corpus. Holdings are short summaries of legal rulings accompany referenced decisions relevant for the present case. The input consists of an excerpt (or prompt) from a court decision, containing a reference to a particular case, while the holding statement is masked out. The model must identify the correct (masked) holding statement from a selection of five choices. ## Leaderboard ### Averaged LexGLUE Scores We report the arithmetic, harmonic, and geometric mean across tasks following [Shavrina and Malykh (2021)](https://openreview.net/pdf?id=PPGfoNJnLKd). We acknowledge that the use of scores aggregated over tasks has been criticized in general NLU benchmarks (e.g., GLUE), as models are trained with different numbers of samples, task complexity, and evaluation metrics per task. We believe that the use of a standard common metric (F1) across tasks and averaging with harmonic mean alleviate this issue.
AveragingArithmeticHarmonicGeometric
Modelμ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1
BERT 77.8 / 69.5 76.7 / 68.2 77.2 / 68.8
RoBERTa 77.8 / 68.7 76.8 / 67.5 77.3 / 68.1
RoBERTa (Large) 79.4 / 70.8 78.4 / 69.1 78.9 / 70.0
DeBERTa 78.3 / 69.7 77.4 / 68.5 77.8 / 69.1
Longformer 78.5 / 70.5 77.5 / 69.5 78.0 / 70.0
BigBird 78.2 / 69.6 77.2 / 68.5 77.7 / 69.0
Legal-BERT 79.8 / 72.0 78.9 / 70.8 79.3 / 71.4
CaseLaw-BERT 79.4 / 70.9 78.5 / 69.7 78.9 / 70.3
### Task-wise LexGLUE scores #### Large-sized (:older_man:) Models [1]
DatasetECtHR AECtHR BSCOTUSEUR-LEXLEDGARUNFAIR-ToSCaseHOLD
Modelμ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1μ-F1 / m-F1
RoBERTa 73.8 / 67.6 79.8 / 71.6 75.5 / 66.3 67.9 / 50.3 88.6 / 83.6 95.8 / 81.6 74.4
[1] Results reported by [Chalkidis et al. (2021)](https://arxiv.org/abs/2110.00976). All large-sized transformer-based models follow the same specifications (L=24, H=1024, A=18). #### Medium-sized (:man:) Models [2]
DatasetECtHR AECtHR BSCOTUSEUR-LEXLEDGARUNFAIR-ToSCaseHOLD
Modelμ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1μ-F1 / m-F1
TFIDF+SVM 62.6 / 48.9 73.0 / 63.8 74.0 / 64.4 63.4 / 47.9 87.0 / 81.4 94.7 / 75.022.4
BERT 71.2 / 63.6 79.7 / 73.4 68.3 / 58.3 71.4 / 57.2 87.6 / 81.8 95.6 / 81.3 70.8
RoBERTa 69.2 / 59.0 77.3 / 68.9 71.6 / 62.0 71.9 / 57.9 87.9 / 82.3 95.2 / 79.2 71.4
DeBERTa 70.0 / 60.8 78.8 / 71.0 71.1 / 62.7 72.1 / 57.4 88.2 / 83.1 95.5 / 80.3 72.6
Longformer 69.9 / 64.7 79.4 / 71.7 72.9 / 64.0 71.6 / 57.7 88.2 / 83.0 95.5 / 80.9 71.9
BigBird 70.0 / 62.9 78.8 / 70.9 72.8 / 62.0 71.5 / 56.8 87.8 / 82.6 95.7 / 81.3 70.8
Legal-BERT 70.0 / 64.0 80.4 / 74.7 76.4 / 66.5 72.1 / 57.4 88.2 / 83.0 96.0 / 83.0 75.3
CaseLaw-BERT 69.8 / 62.9 78.8 / 70.3 76.6 / 65.9 70.7 / 56.6 88.3 / 83.0 96.0 / 82.3 75.4
[2] Results reported by [Chalkidis et al. (2021)](https://arxiv.org/abs/2110.00976). All medium-sized transformer-based models follow the same specifications (L=12, H=768, A=12). #### Small-sized (:baby:) Models [3]
DatasetECtHR AECtHR BSCOTUSEUR-LEXLEDGARUNFAIR-ToSCaseHOLD
Modelμ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1μ-F1 / m-F1
BERT-Tinyn/an/a 62.8 / 40.9 65.5 / 27.5 83.9 / 74.7 94.3 / 11.1 68.3
Mini-LM (v2)n/an/a 60.8 / 45.5 62.2 / 35.6 86.7 / 79.6 93.9 / 13.2 71.3
Distil-BERTn/an/a 67.0 / 55.9 66.0 / 51.5 87.5 / 81.5 97.1 / 79.4 68.6
Legal-BERT n/an/a75.6 / 68.5 73.4 / 54.487.8 /81.497.1 / 76.374.7
[3] Results reported by Atreya Shankar ([@atreyasha](https://github.com/atreyasha)) :hugs: :partying_face:. More details (e.g., validation scores, log files) are provided [here](https://github.com/coastalcph/lex-glue/discussions/categories/new-results). The small-sized models' specifications are: * BERT-Tiny (L=2, H=128, A=2) by [Turc et al. (2020)](https://arxiv.org/abs/1908.08962) * Mini-LM (v2) (L=12, H=386, A=12) by [Wang et al. (2020)](https://arxiv.org/abs/2002.10957) * Distil-BERT (L=6, H=768, A=12) by [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) * Legal-BERT (L=6, H=512, A=8) by [Chalkidis et al. (2020)](https://arxiv.org/abs/2010.02559) ## Frequently Asked Questions (FAQ) ### Where are the datasets? We provide access to LexGLUE on [Hugging Face Datasets](https://huggingface.co/datasets) (Lhoest et al., 2021) at https://huggingface.co/datasets/coastalcph/lex_glue. For example to load the SCOTUS [Spaeth et al. (2020)](http://scdb.wustl.edu) dataset, you first simply install the datasets python library and then make the following call: ```python from datasets import load_dataset dataset = load_dataset("coastalcph/lex_glue", "scotus") ``` ### How to run experiments? Furthermore, to make reproducing the results for the already examined models or future models even easier, we release our code in this repository. In folder `/experiments`, there are Python scripts, relying on the [Hugging Face Transformers](https://huggingface.co/transformers/) library, to run and evaluate any Transformer-based model (e.g., BERT, RoBERTa, LegalBERT, and their hierarchical variants, as well as, Longforrmer, and BigBird). We also provide bash scripts in folder `/scripts` to replicate the experiments for each dataset with 5 randoms seeds, as we did for the reported results for the original leaderboard. Make sure that all required packages are installed: ``` torch>=1.9.0 transformers>=4.9.0 scikit-learn>=0.24.1 tqdm>=4.61.1 numpy>=1.20.1 datasets>=1.12.1 nltk>=3.5 scipy>=1.6.3 ``` For example to replicate the results for RoBERTa ([Liu et al., 2019](https://arxiv.org/abs/1907.11692)) on UNFAIR-ToS [Lippi et al. (2019)](https://arxiv.org/abs/1805.01217), you have to configure the relevant bash script (`run_unfair_tos.sh`): ``` > nano run_unfair_tos.sh GPU_NUMBER=1 MODEL_NAME='roberta-base' LOWER_CASE='False' BATCH_SIZE=8 ACCUMULATION_STEPS=1 TASK='unfair_tos' ``` and then run it: ``` > sh run_unfair_tos.sh ``` **Note:** The bash scripts make use of two HF arguments/parameters (`--fp16`, `--fp16_full_eval`), which are only applicable (working) when there are available (and correctly configured) NVIDIA GPUs in a machine station (server or cluster), while also `torch` is correctly configured to use these compute resources. So, in case you don't have such resources, just delete these two arguments from the scripts to train models with standard `fp32` precision. In case you have such resources, make sure to correctly install the NVIDIA CUDA drivers, and also correctly install `torch` to identify these resources (Consider this page to figure out the appropriate steps: https://pytorch.org/get-started/locally/) ### I don't have the resources to run all these Muppets. What can I do? You can use Google Colab with GPU acceleration for free online (https://colab.research.google.com). - Set Up a new notebook (https://colab.research.google.com) and git clone the project. - Navigate to Edit → Notebook Settings and select GPU from the Hardware Accelerator drop-down. You will probably get assigned with an NVIDIA Tesla K80 12GB. - You will also have to decrease the batch size and increase the accumulation steps for hierarchical models. But, this is an interesting open problem (Efficient NLP), please consider using lighter pre-trained (smaller/faster) models, like: - The smaller [Legal-BERT](https://huggingface.co/nlpaueb/legal-bert-small-uncased) of [Chalkidis et al. (2020)](https://arxiv.org/abs/2010.02559), - Smaller [BERT](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) models of [Turc et al. (2020)](https://arxiv.org/abs/1908.08962), - [Mini-LM](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased) of [Wang et al. (2020)](https://arxiv.org/abs/2002.10957), , or non transformer-based neural models, like: - LSTM-based [(Hochreiter and Schmidhuber, 1997)](https://dl.acm.org/doi/10.1162/neco.1997.9.8.1735) models, like the Hierarchical Attention Network (HAN) of [Yang et al. (2016)](https://aclanthology.org/N16-1174/), - Graph-based models, like the Graph Attention Network (GAT) of [Veličković et al. (2017)](https://arxiv.org/abs/1710.10903) , or even non neural models, like: - Bag of Word (BoW) models using TF-IDF representations (e.g., SVM, Random Forest), - The eXtreme Gradient Boosting (XGBoost) of [Chen and Guestrin (2016)](http://arxiv.org/abs/1603.02754), and report back the results. We are curious! ### How to participate? We are currently still lacking some technical infrastructure, e.g., an integrated submission environment comprised of an automated evaluation and an automatically updated leaderboard. We plan to develop the necessary publicly available web infrastructure extend the public infrastructure of LexGLUE in the near future. In the mean-time, we ask participants to re-use and expand our code to submit new results, if possible, and open a new discussion (submission) in our repository (https://github.com/coastalcph/lex-glue/discussions/new?category=new-results) presenting their results, providing the auto-generated result logs and the relevant publication (or pre-print), if available, accompanied with a pull request including the code amendments that are needed to reproduce their experiments. Upon reviewing your results, we'll update the public leaderboard accordingly. ### I want to re-load fine-tuned HierBERT models. How can I do this? You can re-load fine-tuned HierBERT models following our example python script ["Re-load HierBERT models"](https://github.com/coastalcph/lex-glue/blob/main/utils/load_hierbert.py). ### I still have open questions... Please post your question on [Discussions](https://github.com/coastalcph/lex-glue/discussions) section or communicate with the corresponding author via e-mail. ## Credits Thanks to [@JamesLYC88](https://github.com/JamesLYC88) and [@danigoju](https://github.com/danigoju) for digging up for :bug:s! ================================================ FILE: experiments/case_hold.py ================================================ #!/usr/bin/env python # coding=utf-8 """ Finetuning models on CaseHOLD (e.g. Bert, RoBERTa, LEGAL-BERT).""" import logging import os from dataclasses import dataclass, field from typing import Optional import numpy as np import random import shutil import glob import transformers from transformers import ( AutoConfig, AutoModelForMultipleChoice, AutoTokenizer, EvalPrediction, HfArgumentParser, Trainer, TrainingArguments, set_seed, ) from transformers.trainer_utils import is_main_process from transformers import EarlyStoppingCallback from casehold_helpers import MultipleChoiceDataset, Split from sklearn.metrics import f1_score from models.deberta import DebertaForMultipleChoice logger = logging.getLogger(__name__) @dataclass class ModelArguments: """ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. """ model_name_or_path: str = field( metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} ) config_name: Optional[str] = field( default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} ) tokenizer_name: Optional[str] = field( default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} ) cache_dir: Optional[str] = field( default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, ) @dataclass class DataTrainingArguments: """ Arguments pertaining to what data we are going to input our model for training and eval. """ task_name: str = field(default="case_hold", metadata={"help": "The name of the task to train on"}) max_seq_length: int = field( default=256, metadata={ "help": "The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded." }, ) pad_to_max_length: bool = field( default=True, metadata={ "help": "Whether to pad all samples to `max_seq_length`. " "If False, will pad the samples dynamically when batching to the maximum length in the batch." }, ) max_train_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of training examples to this " "value if set." }, ) max_eval_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this " "value if set." }, ) max_predict_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this " "value if set." }, ) overwrite_cache: bool = field( default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} ) def main(): # See all possible arguments in src/transformers/training_args.py # or by passing the --help flag to this script. # We now keep distinct sets of args, for a cleaner separation of concerns. parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) # Add custom arguments for computing pre-train loss parser.add_argument("--ptl", type=bool, default=False) model_args, data_args, training_args, custom_args = parser.parse_args_into_dataclasses() if ( os.path.exists(training_args.output_dir) and os.listdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir ): raise ValueError( f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome." ) # Setup logging logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN, ) logger.warning( "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s", training_args.local_rank, training_args.device, training_args.n_gpu, bool(training_args.local_rank != -1), training_args.fp16, ) # Set the verbosity to info of the Transformers logger (on main process only): if is_main_process(training_args.local_rank): transformers.utils.logging.set_verbosity_info() transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() logger.info("Training/evaluation parameters %s", training_args) # Set seed set_seed(training_args.seed) # Load pretrained model and tokenizer config = AutoConfig.from_pretrained( model_args.config_name if model_args.config_name else model_args.model_name_or_path, num_labels=5, finetuning_task=data_args.task_name, cache_dir=model_args.cache_dir, ) if config.model_type == 'big_bird': config.attention_type = 'original_full' elif config.model_type == 'longformer': config.attention_window = [data_args.max_seq_length] * config.num_hidden_layers tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, # Default fast tokenizer is buggy on CaseHOLD task, switch to legacy tokenizer use_fast=True, ) if config.model_type != 'deberta': model = AutoModelForMultipleChoice.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, ) else: model = DebertaForMultipleChoice.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, ) train_dataset = None eval_dataset = None # If do_train passed, train_dataset by default loads train split from file named train.csv in data directory if training_args.do_train: train_dataset = \ MultipleChoiceDataset( tokenizer=tokenizer, task=data_args.task_name, max_seq_length=data_args.max_seq_length, overwrite_cache=data_args.overwrite_cache, mode=Split.train, ) # If do_eval or do_predict passed, eval_dataset by default loads dev split from file named dev.csv in data directory if training_args.do_eval: eval_dataset = \ MultipleChoiceDataset( tokenizer=tokenizer, task=data_args.task_name, max_seq_length=data_args.max_seq_length, overwrite_cache=data_args.overwrite_cache, mode=Split.dev, ) if training_args.do_predict: predict_dataset = \ MultipleChoiceDataset( tokenizer=tokenizer, task=data_args.task_name, max_seq_length=data_args.max_seq_length, overwrite_cache=data_args.overwrite_cache, mode=Split.test, ) if training_args.do_train: if data_args.max_train_samples is not None: train_dataset = train_dataset[:data_args.max_train_samples] # Log a few random samples from the training set: for index in random.sample(range(len(train_dataset)), 3): logger.info(f"Sample {index} of the training set: {train_dataset[index]}.") if training_args.do_eval: if data_args.max_eval_samples is not None: eval_dataset = eval_dataset[:data_args.max_eval_samples] if training_args.do_predict: if data_args.max_predict_samples is not None: predict_dataset = predict_dataset[:data_args.max_predict_samples] # Define custom compute_metrics function, returns macro F1 metric for CaseHOLD task def compute_metrics(p: EvalPrediction): preds = np.argmax(p.predictions, axis=1) # Compute macro and micro F1 for 5-class CaseHOLD task macro_f1 = f1_score(y_true=p.label_ids, y_pred=preds, average='macro', zero_division=0) micro_f1 = f1_score(y_true=p.label_ids, y_pred=preds, average='micro', zero_division=0) return {'macro-f1': macro_f1, 'micro-f1': micro_f1} # Initialize our Trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, compute_metrics=compute_metrics, callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] ) # Training if training_args.do_train: trainer.train( model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None ) trainer.save_model() # Re-save the tokenizer for model sharing if trainer.is_world_process_zero(): tokenizer.save_pretrained(training_args.output_dir) # Evaluation on eval_dataset if training_args.do_eval: logger.info("*** Evaluate ***") metrics = trainer.evaluate(eval_dataset=eval_dataset) max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # Predict on eval_dataset if training_args.do_predict: logger.info("*** Predict ***") predictions, labels, metrics = trainer.predict(predict_dataset, metric_key_prefix="predict") max_predict_samples = ( data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset) ) metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset)) trainer.log_metrics("predict", metrics) trainer.save_metrics("predict", metrics) output_predict_file = os.path.join(training_args.output_dir, "test_predictions.csv") if trainer.is_world_process_zero(): with open(output_predict_file, "w") as writer: for index, pred_list in enumerate(predictions): pred_line = '\t'.join([f'{pred:.5f}' for pred in pred_list]) writer.write(f"{index}\t{pred_line}\n") # Clean up checkpoints checkpoints = [filepath for filepath in glob.glob(f'{training_args.output_dir}/*/') if '/checkpoint' in filepath] for checkpoint in checkpoints: shutil.rmtree(checkpoint) def _mp_fn(index): # For xla_spawn (TPUs) main() if __name__ == "__main__": main() ================================================ FILE: experiments/casehold_helpers.py ================================================ import logging import os from dataclasses import dataclass from enum import Enum from typing import List, Optional import tqdm import re from filelock import FileLock from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available import datasets logger = logging.getLogger(__name__) @dataclass(frozen=True) class InputFeatures: """ A single set of features of data. Property names are the same names as the corresponding inputs to a model. """ input_ids: List[List[int]] attention_mask: Optional[List[List[int]]] token_type_ids: Optional[List[List[int]]] label: Optional[int] class Split(Enum): train = "train" dev = "dev" test = "test" if is_torch_available(): import torch from torch.utils.data.dataset import Dataset class MultipleChoiceDataset(Dataset): """ PyTorch multiple choice dataset class """ features: List[InputFeatures] def __init__( self, tokenizer: PreTrainedTokenizer, task: str, max_seq_length: Optional[int] = None, overwrite_cache=False, mode: Split = Split.train, ): dataset = datasets.load_dataset('lex_glue', task) tokenizer_name = re.sub('[^a-z]+', ' ', tokenizer.name_or_path).title().replace(' ', '') cached_features_file = os.path.join( '.cache', task, "cached_{}_{}_{}_{}".format( mode.value, tokenizer_name, str(max_seq_length), task, ), ) # Make sure only the first process in distributed training processes the dataset, # and the others will use the cache. lock_path = cached_features_file + ".lock" if not os.path.exists(os.path.join('.cache', task)): if not os.path.exists('.cache'): os.mkdir('.cache') os.mkdir(os.path.join('.cache', task)) with FileLock(lock_path): if os.path.exists(cached_features_file) and not overwrite_cache: logger.info(f"Loading features from cached file {cached_features_file}") self.features = torch.load(cached_features_file) else: logger.info(f"Creating features from dataset file at {task}") if mode == Split.dev: examples = dataset['validation'] elif mode == Split.test: examples = dataset['test'] elif mode == Split.train: examples = dataset['train'] logger.info("Training examples: %s", len(examples)) self.features = convert_examples_to_features( examples, max_seq_length, tokenizer, ) logger.info("Saving features into cached file %s", cached_features_file) torch.save(self.features, cached_features_file) def __len__(self): return len(self.features) def __getitem__(self, i) -> InputFeatures: return self.features[i] if is_tf_available(): import tensorflow as tf class TFMultipleChoiceDataset: """ TensorFlow multiple choice dataset class """ features: List[InputFeatures] def __init__( self, tokenizer: PreTrainedTokenizer, task: str, max_seq_length: Optional[int] = 256, overwrite_cache=False, mode: Split = Split.train, ): dataset = datasets.load_dataset('lex_glue') logger.info(f"Creating features from dataset file at {task}") if mode == Split.dev: examples = dataset['validation'] elif mode == Split.test: examples = dataset['test'] else: examples = dataset['train'] logger.info(f"{mode.name.title()} examples: %s", len(examples)) self.features = convert_examples_to_features( examples, max_seq_length, tokenizer, ) def gen(): for (ex_index, ex) in tqdm.tqdm(enumerate(self.features), desc="convert examples to features"): if ex_index % 10000 == 0: logger.info("Writing example %d of %d" % (ex_index, len(examples))) yield ( { "input_ids": ex.input_ids, "attention_mask": ex.attention_mask, "token_type_ids": ex.token_type_ids, }, ex.label, ) self.dataset = tf.data.Dataset.from_generator( gen, ( { "input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32, }, tf.int64, ), ( { "input_ids": tf.TensorShape([None, None]), "attention_mask": tf.TensorShape([None, None]), "token_type_ids": tf.TensorShape([None, None]), }, tf.TensorShape([]), ), ) def get_dataset(self): self.dataset = self.dataset.apply(tf.data.experimental.assert_cardinality(len(self.features))) return self.dataset def __len__(self): return len(self.features) def __getitem__(self, i) -> InputFeatures: return self.features[i] def convert_examples_to_features( examples: datasets.Dataset, max_length: int, tokenizer: PreTrainedTokenizer, ) -> List[InputFeatures]: """ Loads a data file into a list of `InputFeatures` """ features = [] for (ex_index, example) in tqdm.tqdm(enumerate(examples), desc="convert examples to features"): if ex_index % 10000 == 0: logger.info("Writing example %d of %d" % (ex_index, len(examples))) choices_inputs = [] for ending_idx, ending in enumerate(example['endings']): context = example['context'] inputs = tokenizer( context, ending, add_special_tokens=True, max_length=max_length, padding="max_length", truncation=True, ) choices_inputs.append(inputs) label = example['label'] input_ids = [x["input_ids"] for x in choices_inputs] attention_mask = ( [x["attention_mask"] for x in choices_inputs] if "attention_mask" in choices_inputs[0] else None ) token_type_ids = ( [x["token_type_ids"] for x in choices_inputs] if "token_type_ids" in choices_inputs[0] else None ) features.append( InputFeatures( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=label, ) ) for f in features[:2]: logger.info("*** Example ***") logger.info("feature: %s" % f) return features ================================================ FILE: experiments/ecthr.py ================================================ #!/usr/bin/env python # coding=utf-8 """ Finetuning models on the ECtHR dataset (e.g. Bert, RoBERTa, LEGAL-BERT).""" import logging import os import random import sys from dataclasses import dataclass, field from typing import Optional import datasets import numpy as np from datasets import load_dataset from sklearn.metrics import f1_score from trainer import MultilabelTrainer from scipy.special import expit from torch import nn import glob import shutil import transformers from transformers import ( AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, EvalPrediction, HfArgumentParser, TrainingArguments, default_data_collator, set_seed, EarlyStoppingCallback, ) from transformers.trainer_utils import get_last_checkpoint from transformers.utils import check_min_version from transformers.utils.versions import require_version from models.hierbert import HierarchicalBert from models.deberta import DebertaForSequenceClassification # Will error if the minimal version of Transformers is not installed. Remove at your own risks. check_min_version("4.9.0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt") logger = logging.getLogger(__name__) @dataclass class DataTrainingArguments: """ Arguments pertaining to what data we are going to input our model for training and eval. Using `HfArgumentParser` we can turn this class into argparse arguments to be able to specify them on the command line. """ max_seq_length: Optional[int] = field( default=4096, metadata={ "help": "The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded." }, ) max_segments: Optional[int] = field( default=64, metadata={ "help": "The maximum number of segments (paragraphs) to be considered. Sequences longer " "than this will be truncated, sequences shorter will be padded." }, ) max_seg_length: Optional[int] = field( default=128, metadata={ "help": "The maximum segment (paragraph) length to be considered. Segments longer " "than this will be truncated, sequences shorter will be padded." }, ) overwrite_cache: bool = field( default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."} ) pad_to_max_length: bool = field( default=True, metadata={ "help": "Whether to pad all samples to `max_seq_length`. " "If False, will pad the samples dynamically when batching to the maximum length in the batch." }, ) max_train_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of training examples to this " "value if set." }, ) max_eval_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this " "value if set." }, ) max_predict_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this " "value if set." }, ) task: Optional[str] = field( default='ecthr_a', metadata={ "help": "Define downstream task" }, ) server_ip: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) server_port: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) @dataclass class ModelArguments: """ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. """ model_name_or_path: str = field( default=None, metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} ) hierarchical: bool = field( default=True, metadata={"help": "Whether to use a hierarchical variant or not"} ) config_name: Optional[str] = field( default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} ) tokenizer_name: Optional[str] = field( default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} ) cache_dir: Optional[str] = field( default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, ) do_lower_case: Optional[bool] = field( default=True, metadata={"help": "arg to indicate if tokenizer should do lower case in AutoTokenizer.from_pretrained()"}, ) use_fast_tokenizer: bool = field( default=True, metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, ) model_revision: str = field( default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) use_auth_token: bool = field( default=False, metadata={ "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script " "with private models)." }, ) def main(): # See all possible arguments in src/transformers/training_args.py # or by passing the --help flag to this script. # We now keep distinct sets of args, for a cleaner separation of concerns. parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) model_args, data_args, training_args = parser.parse_args_into_dataclasses() # Fix boolean parameter if model_args.do_lower_case == 'False' or not model_args.do_lower_case: model_args.do_lower_case = False else: model_args.do_lower_case = True if model_args.hierarchical == 'False' or not model_args.hierarchical: model_args.hierarchical = False else: model_args.hierarchical = True # Setup distant debugging if needed if data_args.server_ip and data_args.server_port: # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script import ptvsd print("Waiting for debugger attach") ptvsd.enable_attach(address=(data_args.server_ip, data_args.server_port), redirect_output=True) ptvsd.wait_for_attach() # Setup logging logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process the small summary: logger.warning( f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") # Detecting last checkpoint. last_checkpoint = None if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: last_checkpoint = get_last_checkpoint(training_args.output_dir) if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: raise ValueError( f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) elif last_checkpoint is not None: logger.info( f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." ) # Set seed before initializing model. set_seed(training_args.seed) # In distributed training, the load_dataset function guarantees that only one local process can concurrently # download the dataset. # Downloading and loading eurlex dataset from the hub. if training_args.do_train: train_dataset = load_dataset("lex_glue", name=data_args.task, split="train", data_dir='data', cache_dir=model_args.cache_dir) if training_args.do_eval: eval_dataset = load_dataset("lex_glue", name=data_args.task, split="validation", data_dir='data', cache_dir=model_args.cache_dir) if training_args.do_predict: predict_dataset = load_dataset("lex_glue", name=data_args.task, split="test", data_dir='data', cache_dir=model_args.cache_dir) # Labels label_list = list(range(10)) num_labels = len(label_list) # Load pretrained model and tokenizer # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. config = AutoConfig.from_pretrained( model_args.config_name if model_args.config_name else model_args.model_name_or_path, num_labels=num_labels, finetuning_task=f"{data_args.task}", cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, do_lower_case=model_args.do_lower_case, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) if config.model_type == 'deberta' and model_args.hierarchical: model = DebertaForSequenceClassification.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) else: model = AutoModelForSequenceClassification.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) if model_args.hierarchical: # Hack the classifier encoder to use hierarchical BERT if config.model_type in ['bert', 'deberta']: if config.model_type == 'bert': segment_encoder = model.bert else: segment_encoder = model.deberta model_encoder = HierarchicalBert(encoder=segment_encoder, max_segments=data_args.max_segments, max_segment_length=data_args.max_seg_length) if config.model_type == 'bert': model.bert = model_encoder elif config.model_type == 'deberta': model.deberta = model_encoder else: raise NotImplementedError(f"{config.model_type} is no supported yet!") elif config.model_type == 'roberta': model_encoder = HierarchicalBert(encoder=model.roberta, max_segments=data_args.max_segments, max_segment_length=data_args.max_seg_length) model.roberta = model_encoder # Build a new classification layer, as well dense = nn.Linear(config.hidden_size, config.hidden_size) dense.load_state_dict(model.classifier.dense.state_dict()) # load weights dropout = nn.Dropout(config.hidden_dropout_prob).to(model.device) out_proj = nn.Linear(config.hidden_size, config.num_labels).to(model.device) out_proj.load_state_dict(model.classifier.out_proj.state_dict()) # load weights model.classifier = nn.Sequential(dense, dropout, out_proj).to(model.device) elif config.model_type in ['longformer', 'big_bird']: pass else: raise NotImplementedError(f"{config.model_type} is no supported yet!") # Preprocessing the datasets # Padding strategy if data_args.pad_to_max_length: padding = "max_length" else: # We will pad later, dynamically at batch creation, to the max sequence length in each batch padding = False def preprocess_function(examples): # Tokenize the texts if model_args.hierarchical: case_template = [[0] * data_args.max_seg_length] if config.model_type == 'roberta': batch = {'input_ids': [], 'attention_mask': []} for case in examples['text']: case_encodings = tokenizer(case[:data_args.max_segments], padding=padding, max_length=data_args.max_seg_length, truncation=True) batch['input_ids'].append(case_encodings['input_ids'] + case_template * ( data_args.max_segments - len(case_encodings['input_ids']))) batch['attention_mask'].append(case_encodings['attention_mask'] + case_template * ( data_args.max_segments - len(case_encodings['attention_mask']))) else: batch = {'input_ids': [], 'attention_mask': [], 'token_type_ids': []} for case in examples['text']: case_encodings = tokenizer(case[:data_args.max_segments], padding=padding, max_length=data_args.max_seg_length, truncation=True) batch['input_ids'].append(case_encodings['input_ids'] + case_template * ( data_args.max_segments - len(case_encodings['input_ids']))) batch['attention_mask'].append(case_encodings['attention_mask'] + case_template * ( data_args.max_segments - len(case_encodings['attention_mask']))) batch['token_type_ids'].append(case_encodings['token_type_ids'] + case_template * ( data_args.max_segments - len(case_encodings['token_type_ids']))) elif config.model_type in ['longformer', 'big_bird']: cases = [] max_position_embeddings = config.max_position_embeddings - 2 if config.model_type == 'longformer' \ else config.max_position_embeddings for case in examples['text']: cases.append(f' {tokenizer.sep_token} '.join( [' '.join(fact.split()[:data_args.max_seg_length]) for fact in case[:data_args.max_segments]])) batch = tokenizer(cases, padding=padding, max_length=max_position_embeddings, truncation=True) if config.model_type == 'longformer': global_attention_mask = np.zeros((len(cases), max_position_embeddings), dtype=np.int32) # global attention on cls token global_attention_mask[:, 0] = 1 batch['global_attention_mask'] = list(global_attention_mask) else: cases = [] for case in examples['text']: cases.append(f'\n'.join(case)) batch = tokenizer(cases, padding=padding, max_length=512, truncation=True) batch["labels"] = [[1 if label in labels else 0 for label in label_list] for labels in examples["labels"]] return batch if training_args.do_train: if data_args.max_train_samples is not None: train_dataset = train_dataset.select(range(data_args.max_train_samples)) with training_args.main_process_first(desc="train dataset map pre-processing"): train_dataset = train_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on train dataset", ) # Log a few random samples from the training set: for index in random.sample(range(len(train_dataset)), 3): logger.info(f"Sample {index} of the training set: {train_dataset[index]}.") if training_args.do_eval: if data_args.max_eval_samples is not None: eval_dataset = eval_dataset.select(range(data_args.max_eval_samples)) with training_args.main_process_first(desc="validation dataset map pre-processing"): eval_dataset = eval_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on validation dataset", ) if training_args.do_predict: if data_args.max_predict_samples is not None: predict_dataset = predict_dataset.select(range(data_args.max_predict_samples)) with training_args.main_process_first(desc="prediction dataset map pre-processing"): predict_dataset = predict_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on prediction dataset", ) # You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. def compute_metrics(p: EvalPrediction): # Fix gold labels y_true = np.zeros((p.label_ids.shape[0], p.label_ids.shape[1] + 1), dtype=np.int32) y_true[:, :-1] = p.label_ids y_true[:, -1] = (np.sum(p.label_ids, axis=1) == 0).astype('int32') # Fix predictions logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions preds = (expit(logits) > 0.5).astype('int32') y_pred = np.zeros((p.label_ids.shape[0], p.label_ids.shape[1] + 1), dtype=np.int32) y_pred[:, :-1] = preds y_pred[:, -1] = (np.sum(preds, axis=1) == 0).astype('int32') # Compute scores macro_f1 = f1_score(y_true=y_true, y_pred=y_pred, average='macro', zero_division=0) micro_f1 = f1_score(y_true=y_true, y_pred=y_pred, average='micro', zero_division=0) return {'macro-f1': macro_f1, 'micro-f1': micro_f1} # Data collator will default to DataCollatorWithPadding, so we change it if we already did the padding. if data_args.pad_to_max_length: data_collator = default_data_collator elif training_args.fp16: data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) else: data_collator = None # Initialize our Trainer trainer = MultilabelTrainer( model=model, args=training_args, train_dataset=train_dataset if training_args.do_train else None, eval_dataset=eval_dataset if training_args.do_eval else None, compute_metrics=compute_metrics, tokenizer=tokenizer, data_collator=data_collator, callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] ) # Training if training_args.do_train: checkpoint = None if training_args.resume_from_checkpoint is not None: checkpoint = training_args.resume_from_checkpoint elif last_checkpoint is not None: checkpoint = last_checkpoint train_result = trainer.train(resume_from_checkpoint=checkpoint) metrics = train_result.metrics max_train_samples = ( data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) ) metrics["train_samples"] = min(max_train_samples, len(train_dataset)) trainer.save_model() # Saves the tokenizer too for easy upload trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) trainer.save_state() # Evaluation if training_args.do_eval: logger.info("*** Evaluate ***") metrics = trainer.evaluate(eval_dataset=eval_dataset) max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # Prediction if training_args.do_predict: logger.info("*** Predict ***") predictions, labels, metrics = trainer.predict(predict_dataset, metric_key_prefix="predict") max_predict_samples = ( data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset) ) metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset)) trainer.log_metrics("predict", metrics) trainer.save_metrics("predict", metrics) output_predict_file = os.path.join(training_args.output_dir, "test_predictions.csv") if trainer.is_world_process_zero(): with open(output_predict_file, "w") as writer: for index, pred_list in enumerate(predictions[0]): pred_line = '\t'.join([f'{pred:.5f}' for pred in pred_list]) writer.write(f"{index}\t{pred_line}\n") # Clean up checkpoints checkpoints = [filepath for filepath in glob.glob(f'{training_args.output_dir}/*/') if '/checkpoint' in filepath] for checkpoint in checkpoints: shutil.rmtree(checkpoint) if __name__ == "__main__": main() ================================================ FILE: experiments/eurlex.py ================================================ #!/usr/bin/env python # coding=utf-8 """ Finetuning models on EUR-LEX (e.g. Bert, RoBERTa, LEGAL-BERT).""" import logging import os import random import sys from dataclasses import dataclass, field from typing import Optional import datasets from datasets import load_dataset from sklearn.metrics import f1_score from trainer import MultilabelTrainer from scipy.special import expit import glob import shutil import transformers from transformers import ( AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, EvalPrediction, HfArgumentParser, TrainingArguments, default_data_collator, set_seed, EarlyStoppingCallback, ) from transformers.trainer_utils import get_last_checkpoint from transformers.utils import check_min_version from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. check_min_version("4.9.0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt") logger = logging.getLogger(__name__) @dataclass class DataTrainingArguments: """ Arguments pertaining to what data we are going to input our model for training and eval. Using `HfArgumentParser` we can turn this class into argparse arguments to be able to specify them on the command line. """ max_seq_length: Optional[int] = field( default=512, metadata={ "help": "The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded." }, ) overwrite_cache: bool = field( default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."} ) pad_to_max_length: bool = field( default=True, metadata={ "help": "Whether to pad all samples to `max_seq_length`. " "If False, will pad the samples dynamically when batching to the maximum length in the batch." }, ) max_train_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of training examples to this " "value if set." }, ) max_eval_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this " "value if set." }, ) max_predict_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this " "value if set." }, ) server_ip: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) server_port: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) @dataclass class ModelArguments: """ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. """ model_name_or_path: str = field( default=None, metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} ) config_name: Optional[str] = field( default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} ) tokenizer_name: Optional[str] = field( default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} ) cache_dir: Optional[str] = field( default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, ) do_lower_case: Optional[bool] = field( default=True, metadata={"help": "arg to indicate if tokenizer should do lower case in AutoTokenizer.from_pretrained()"}, ) use_fast_tokenizer: bool = field( default=True, metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, ) model_revision: str = field( default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) use_auth_token: bool = field( default=False, metadata={ "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script " "with private models)." }, ) def main(): # See all possible arguments in src/transformers/training_args.py # or by passing the --help flag to this script. # We now keep distinct sets of args, for a cleaner separation of concerns. parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) model_args, data_args, training_args = parser.parse_args_into_dataclasses() # Setup distant debugging if needed if data_args.server_ip and data_args.server_port: # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script import ptvsd print("Waiting for debugger attach") ptvsd.enable_attach(address=(data_args.server_ip, data_args.server_port), redirect_output=True) ptvsd.wait_for_attach() # Fix boolean parameter if model_args.do_lower_case == 'False' or not model_args.do_lower_case: model_args.do_lower_case = False 'Tokenizer do_lower_case False' else: model_args.do_lower_case = True # Setup logging logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process the small summary: logger.warning( f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") # Detecting last checkpoint. last_checkpoint = None if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: last_checkpoint = get_last_checkpoint(training_args.output_dir) if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: raise ValueError( f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) elif last_checkpoint is not None: logger.info( f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." ) # Set seed before initializing model. set_seed(training_args.seed) # In distributed training, the load_dataset function guarantees that only one local process can concurrently # download the dataset. # Downloading and loading eurlex dataset from the hub. if training_args.do_train: train_dataset = load_dataset("lex_glue", "eurlex", split="train", cache_dir=model_args.cache_dir) if training_args.do_eval: eval_dataset = load_dataset("lex_glue", "eurlex", split="validation", cache_dir=model_args.cache_dir) if training_args.do_predict: predict_dataset = load_dataset("lex_glue", "eurlex", split="test", cache_dir=model_args.cache_dir) # Labels label_list = list(range(100)) num_labels = len(label_list) # Load pretrained model and tokenizer # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. config = AutoConfig.from_pretrained( model_args.config_name if model_args.config_name else model_args.model_name_or_path, num_labels=num_labels, finetuning_task="eurlex", cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) if config.model_type == 'big_bird': config.attention_type = 'original_full' tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, do_lower_case=model_args.do_lower_case, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) model = AutoModelForSequenceClassification.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) # Preprocessing the datasets # Padding strategy if data_args.pad_to_max_length: padding = "max_length" else: # We will pad later, dynamically at batch creation, to the max sequence length in each batch padding = False def preprocess_function(examples): # Tokenize the texts batch = tokenizer( examples["text"], padding=padding, max_length=data_args.max_seq_length, truncation=True, ) batch["labels"] = [[1 if label in labels else 0 for label in label_list] for labels in examples["labels"]] return batch if training_args.do_train: if data_args.max_train_samples is not None: train_dataset = train_dataset.select(range(data_args.max_train_samples)) with training_args.main_process_first(desc="train dataset map pre-processing"): train_dataset = train_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on train dataset", ) # Log a few random samples from the training set: for index in random.sample(range(len(train_dataset)), 3): logger.info(f"Sample {index} of the training set: {train_dataset[index]}.") if training_args.do_eval: if data_args.max_eval_samples is not None: eval_dataset = eval_dataset.select(range(data_args.max_eval_samples)) with training_args.main_process_first(desc="validation dataset map pre-processing"): eval_dataset = eval_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on validation dataset", ) if training_args.do_predict: if data_args.max_predict_samples is not None: predict_dataset = predict_dataset.select(range(data_args.max_predict_samples)) with training_args.main_process_first(desc="prediction dataset map pre-processing"): predict_dataset = predict_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on prediction dataset", ) # You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. def compute_metrics(p: EvalPrediction): logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions preds = (expit(logits) > 0.5).astype('int32') macro_f1 = f1_score(y_true=p.label_ids, y_pred=preds, average='macro', zero_division=0) micro_f1 = f1_score(y_true=p.label_ids, y_pred=preds, average='micro', zero_division=0) return {'macro-f1': macro_f1, 'micro-f1': micro_f1} # Data collator will default to DataCollatorWithPadding, so we change it if we already did the padding. if data_args.pad_to_max_length: data_collator = default_data_collator elif training_args.fp16: data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) else: data_collator = None # Initialize our Trainer trainer = MultilabelTrainer( model=model, args=training_args, train_dataset=train_dataset if training_args.do_train else None, eval_dataset=eval_dataset if training_args.do_eval else None, compute_metrics=compute_metrics, tokenizer=tokenizer, data_collator=data_collator, callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] ) # Training if training_args.do_train: checkpoint = None if training_args.resume_from_checkpoint is not None: checkpoint = training_args.resume_from_checkpoint elif last_checkpoint is not None: checkpoint = last_checkpoint train_result = trainer.train(resume_from_checkpoint=checkpoint) metrics = train_result.metrics max_train_samples = ( data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) ) metrics["train_samples"] = min(max_train_samples, len(train_dataset)) trainer.save_model() # Saves the tokenizer too for easy upload trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) trainer.save_state() # Evaluation if training_args.do_eval: logger.info("*** Evaluate ***") metrics = trainer.evaluate(eval_dataset=eval_dataset) max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # Prediction if training_args.do_predict: logger.info("*** Predict ***") predictions, labels, metrics = trainer.predict(predict_dataset, metric_key_prefix="predict") max_predict_samples = ( data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset) ) metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset)) trainer.log_metrics("predict", metrics) trainer.save_metrics("predict", metrics) output_predict_file = os.path.join(training_args.output_dir, "test_predictions.csv") if trainer.is_world_process_zero(): with open(output_predict_file, "w") as writer: for index, pred_list in enumerate(predictions): pred_line = '\t'.join([f'{pred:.5f}' for pred in pred_list]) writer.write(f"{index}\t{pred_line}\n") # Clean up checkpoints checkpoints = [filepath for filepath in glob.glob(f'{training_args.output_dir}/*/') if '/checkpoint' in filepath] for checkpoint in checkpoints: shutil.rmtree(checkpoint) if __name__ == "__main__": main() ================================================ FILE: experiments/ledgar.py ================================================ #!/usr/bin/env python # coding=utf-8 """ Finetuning models on LEDGAR (e.g. Bert, RoBERTa, LEGAL-BERT).""" import logging import os import random import sys from dataclasses import dataclass, field from typing import Optional import datasets from datasets import load_dataset from sklearn.metrics import f1_score import numpy as np import glob import shutil import transformers from transformers import ( AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, EvalPrediction, HfArgumentParser, TrainingArguments, default_data_collator, set_seed, EarlyStoppingCallback, Trainer ) from transformers.trainer_utils import get_last_checkpoint from transformers.utils import check_min_version from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. check_min_version("4.9.0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt") logger = logging.getLogger(__name__) @dataclass class DataTrainingArguments: """ Arguments pertaining to what data we are going to input our model for training and eval. Using `HfArgumentParser` we can turn this class into argparse arguments to be able to specify them on the command line. """ max_seq_length: Optional[int] = field( default=512, metadata={ "help": "The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded." }, ) overwrite_cache: bool = field( default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."} ) pad_to_max_length: bool = field( default=True, metadata={ "help": "Whether to pad all samples to `max_seq_length`. " "If False, will pad the samples dynamically when batching to the maximum length in the batch." }, ) max_train_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of training examples to this " "value if set." }, ) max_eval_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this " "value if set." }, ) max_predict_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this " "value if set." }, ) server_ip: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) server_port: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) @dataclass class ModelArguments: """ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. """ model_name_or_path: str = field( default=None, metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} ) config_name: Optional[str] = field( default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} ) tokenizer_name: Optional[str] = field( default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} ) cache_dir: Optional[str] = field( default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, ) do_lower_case: Optional[bool] = field( default=True, metadata={"help": "arg to indicate if tokenizer should do lower case in AutoTokenizer.from_pretrained()"}, ) use_fast_tokenizer: bool = field( default=True, metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, ) model_revision: str = field( default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) use_auth_token: bool = field( default=False, metadata={ "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script " "with private models)." }, ) def main(): # See all possible arguments in src/transformers/training_args.py # or by passing the --help flag to this script. # We now keep distinct sets of args, for a cleaner separation of concerns. parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) model_args, data_args, training_args = parser.parse_args_into_dataclasses() # Setup distant debugging if needed if data_args.server_ip and data_args.server_port: # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script import ptvsd print("Waiting for debugger attach") ptvsd.enable_attach(address=(data_args.server_ip, data_args.server_port), redirect_output=True) ptvsd.wait_for_attach() # Fix boolean parameter if model_args.do_lower_case == 'False' or not model_args.do_lower_case: model_args.do_lower_case = False 'Tokenizer do_lower_case False' else: model_args.do_lower_case = True # Setup logging logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process the small summary: logger.warning( f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") # Detecting last checkpoint. last_checkpoint = None if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: last_checkpoint = get_last_checkpoint(training_args.output_dir) if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: raise ValueError( f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) elif last_checkpoint is not None: logger.info( f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." ) # Set seed before initializing model. set_seed(training_args.seed) # In distributed training, the load_dataset function guarantees that only one local process can concurrently # download the dataset. # Downloading and loading eurlex dataset from the hub. if training_args.do_train: train_dataset = load_dataset("lex_glue", "ledgar", split="train", cache_dir=model_args.cache_dir) if training_args.do_eval: eval_dataset = load_dataset("lex_glue", "ledgar", split="validation", cache_dir=model_args.cache_dir) if training_args.do_predict: predict_dataset = load_dataset("lex_glue", "ledgar", split="test", cache_dir=model_args.cache_dir) # Labels label_list = list(range(100)) num_labels = len(label_list) # Load pretrained model and tokenizer # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. config = AutoConfig.from_pretrained( model_args.config_name if model_args.config_name else model_args.model_name_or_path, num_labels=num_labels, finetuning_task="eurlex", cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) if config.model_type == 'big_bird': config.attention_type = 'original_full' tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, do_lower_case=model_args.do_lower_case, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) model = AutoModelForSequenceClassification.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) # Preprocessing the datasets # Padding strategy if data_args.pad_to_max_length: padding = "max_length" else: # We will pad later, dynamically at batch creation, to the max sequence length in each batch padding = False def preprocess_function(examples): # Tokenize the texts batch = tokenizer( examples["text"], padding=padding, max_length=data_args.max_seq_length, truncation=True, ) batch["label"] = [label_list.index(label) for label in examples["label"]] return batch if training_args.do_train: if data_args.max_train_samples is not None: train_dataset = train_dataset.select(range(data_args.max_train_samples)) with training_args.main_process_first(desc="train dataset map pre-processing"): train_dataset = train_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on train dataset", ) # Log a few random samples from the training set: for index in random.sample(range(len(train_dataset)), 3): logger.info(f"Sample {index} of the training set: {train_dataset[index]}.") if training_args.do_eval: if data_args.max_eval_samples is not None: eval_dataset = eval_dataset.select(range(data_args.max_eval_samples)) with training_args.main_process_first(desc="validation dataset map pre-processing"): eval_dataset = eval_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on validation dataset", ) if training_args.do_predict: if data_args.max_predict_samples is not None: predict_dataset = predict_dataset.select(range(data_args.max_predict_samples)) with training_args.main_process_first(desc="prediction dataset map pre-processing"): predict_dataset = predict_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on prediction dataset", ) # You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. def compute_metrics(p: EvalPrediction): logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions preds = np.argmax(logits, axis=1) macro_f1 = f1_score(y_true=p.label_ids, y_pred=preds, average='macro', zero_division=0) micro_f1 = f1_score(y_true=p.label_ids, y_pred=preds, average='micro', zero_division=0) return {'macro-f1': macro_f1, 'micro-f1': micro_f1} # Data collator will default to DataCollatorWithPadding, so we change it if we already did the padding. if data_args.pad_to_max_length: data_collator = default_data_collator elif training_args.fp16: data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) else: data_collator = None # Initialize our Trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset if training_args.do_train else None, eval_dataset=eval_dataset if training_args.do_eval else None, compute_metrics=compute_metrics, tokenizer=tokenizer, data_collator=data_collator, callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] ) # Training if training_args.do_train: checkpoint = None if training_args.resume_from_checkpoint is not None: checkpoint = training_args.resume_from_checkpoint elif last_checkpoint is not None: checkpoint = last_checkpoint train_result = trainer.train(resume_from_checkpoint=checkpoint) metrics = train_result.metrics max_train_samples = ( data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) ) metrics["train_samples"] = min(max_train_samples, len(train_dataset)) trainer.save_model() # Saves the tokenizer too for easy upload trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) trainer.save_state() # Evaluation if training_args.do_eval: logger.info("*** Evaluate ***") metrics = trainer.evaluate(eval_dataset=eval_dataset) max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # Prediction if training_args.do_predict: logger.info("*** Predict ***") predictions, labels, metrics = trainer.predict(predict_dataset, metric_key_prefix="predict") max_predict_samples = ( data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset) ) metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset)) trainer.log_metrics("predict", metrics) trainer.save_metrics("predict", metrics) output_predict_file = os.path.join(training_args.output_dir, "test_predictions.csv") if trainer.is_world_process_zero(): with open(output_predict_file, "w") as writer: for index, pred_list in enumerate(predictions): pred_line = '\t'.join([f'{pred:.5f}' for pred in pred_list]) writer.write(f"{index}\t{pred_line}\n") # Clean up checkpoints checkpoints = [filepath for filepath in glob.glob(f'{training_args.output_dir}/*/') if '/checkpoint' in filepath] for checkpoint in checkpoints: shutil.rmtree(checkpoint) if __name__ == "__main__": main() ================================================ FILE: experiments/scotus.py ================================================ #!/usr/bin/env python # coding=utf-8 """ Finetuning models on SCOTUS (e.g. Bert, RoBERTa, LEGAL-BERT).""" import logging import os import random import re import sys from dataclasses import dataclass, field from typing import Optional import datasets from datasets import load_dataset from sklearn.metrics import f1_score from models.hierbert import HierarchicalBert import numpy as np from torch import nn import glob import shutil import transformers from transformers import ( Trainer, AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, EvalPrediction, HfArgumentParser, TrainingArguments, default_data_collator, set_seed, EarlyStoppingCallback, ) from transformers.trainer_utils import get_last_checkpoint from transformers.utils import check_min_version from transformers.utils.versions import require_version from models.deberta import DebertaForSequenceClassification # Will error if the minimal version of Transformers is not installed. Remove at your own risks. check_min_version("4.9.0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt") logger = logging.getLogger(__name__) @dataclass class DataTrainingArguments: """ Arguments pertaining to what data we are going to input our model for training and eval. Using `HfArgumentParser` we can turn this class into argparse arguments to be able to specify them on the command line. """ max_seq_length: Optional[int] = field( default=128, metadata={ "help": "The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded." }, ) max_segments: Optional[int] = field( default=64, metadata={ "help": "The maximum number of segments (paragraphs) to be considered. Sequences longer " "than this will be truncated, sequences shorter will be padded." }, ) max_seg_length: Optional[int] = field( default=128, metadata={ "help": "The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded." }, ) overwrite_cache: bool = field( default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."} ) pad_to_max_length: bool = field( default=True, metadata={ "help": "Whether to pad all samples to `max_seq_length`. " "If False, will pad the samples dynamically when batching to the maximum length in the batch." }, ) max_train_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of training examples to this " "value if set." }, ) max_eval_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this " "value if set." }, ) max_predict_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this " "value if set." }, ) server_ip: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) server_port: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) @dataclass class ModelArguments: """ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. """ model_name_or_path: str = field( default=None, metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} ) hierarchical: bool = field( default=True, metadata={"help": "Whether to use a hierarchical variant or not"} ) config_name: Optional[str] = field( default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} ) tokenizer_name: Optional[str] = field( default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} ) cache_dir: Optional[str] = field( default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, ) do_lower_case: Optional[bool] = field( default=True, metadata={"help": "arg to indicate if tokenizer should do lower case in AutoTokenizer.from_pretrained()"}, ) use_fast_tokenizer: bool = field( default=True, metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, ) model_revision: str = field( default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) use_auth_token: bool = field( default=False, metadata={ "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script " "with private models)." }, ) def main(): # See all possible arguments in src/transformers/training_args.py # or by passing the --help flag to this script. # We now keep distinct sets of args, for a cleaner separation of concerns. parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) model_args, data_args, training_args = parser.parse_args_into_dataclasses() # Setup distant debugging if needed if data_args.server_ip and data_args.server_port: # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script import ptvsd print("Waiting for debugger attach") ptvsd.enable_attach(address=(data_args.server_ip, data_args.server_port), redirect_output=True) ptvsd.wait_for_attach() # Fix boolean parameter if model_args.do_lower_case == 'False' or not model_args.do_lower_case: model_args.do_lower_case = False else: model_args.do_lower_case = True if model_args.hierarchical == 'False' or not model_args.hierarchical: model_args.hierarchical = False else: model_args.hierarchical = True # Setup logging logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process the small summary: logger.warning( f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") # Detecting last checkpoint. last_checkpoint = None if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: last_checkpoint = get_last_checkpoint(training_args.output_dir) if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: raise ValueError( f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) elif last_checkpoint is not None: logger.info( f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." ) # Set seed before initializing model. set_seed(training_args.seed) # In distributed training, the load_dataset function guarantees that only one local process can concurrently # download the dataset. # Downloading and loading eurlex dataset from the hub. if training_args.do_train: train_dataset = load_dataset("lex_glue", "scotus", split="train", cache_dir=model_args.cache_dir) if training_args.do_eval: eval_dataset = load_dataset("lex_glue", "scotus", split="validation", cache_dir=model_args.cache_dir) if training_args.do_predict: predict_dataset = load_dataset("lex_glue", "scotus", split="test", cache_dir=model_args.cache_dir) # Labels label_list = list(range(14)) num_labels = len(label_list) # Load pretrained model and tokenizer # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. config = AutoConfig.from_pretrained( model_args.config_name if model_args.config_name else model_args.model_name_or_path, num_labels=num_labels, finetuning_task="scotus", cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, do_lower_case=model_args.do_lower_case, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) if config.model_type == 'deberta' and model_args.hierarchical: model = DebertaForSequenceClassification.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) else: model = AutoModelForSequenceClassification.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) if model_args.hierarchical: # Hack the classifier encoder to use hierarchical BERT if config.model_type in ['bert', 'deberta']: if config.model_type == 'bert': segment_encoder = model.bert else: segment_encoder = model.deberta model_encoder = HierarchicalBert(encoder=segment_encoder, max_segments=data_args.max_segments, max_segment_length=data_args.max_seg_length) if config.model_type == 'bert': model.bert = model_encoder elif config.model_type == 'deberta': model.deberta = model_encoder else: raise NotImplementedError(f"{config.model_type} is no supported yet!") elif config.model_type == 'roberta': model_encoder = HierarchicalBert(encoder=model.roberta, max_segments=data_args.max_segments, max_segment_length=data_args.max_seg_length) model.roberta = model_encoder # Build a new classification layer, as well dense = nn.Linear(config.hidden_size, config.hidden_size) dense.load_state_dict(model.classifier.dense.state_dict()) # load weights dropout = nn.Dropout(config.hidden_dropout_prob).to(model.device) out_proj = nn.Linear(config.hidden_size, config.num_labels).to(model.device) out_proj.load_state_dict(model.classifier.out_proj.state_dict()) # load weights model.classifier = nn.Sequential(dense, dropout, out_proj).to(model.device) elif config.model_type in ['longformer', 'big_bird']: pass else: raise NotImplementedError(f"{config.model_type} is no supported yet!") # Preprocessing the datasets # Padding strategy if data_args.pad_to_max_length: padding = "max_length" else: # We will pad later, dynamically at batch creation, to the max sequence length in each batch padding = False def preprocess_function(examples): # Tokenize the texts if model_args.hierarchical: case_template = [[0] * data_args.max_seq_length] if config.model_type == 'roberta': batch = {'input_ids': [], 'attention_mask': []} for doc in examples['text']: doc = re.split('\n{2,}', doc) doc_encodings = tokenizer(doc[:data_args.max_segments], padding=padding, max_length=data_args.max_seg_length, truncation=True) batch['input_ids'].append(doc_encodings['input_ids'] + case_template * ( data_args.max_segments - len(doc_encodings['input_ids']))) batch['attention_mask'].append(doc_encodings['attention_mask'] + case_template * ( data_args.max_segments - len(doc_encodings['attention_mask']))) else: batch = {'input_ids': [], 'attention_mask': [], 'token_type_ids': []} for doc in examples['text']: doc = re.split('\n{2,}', doc) doc_encodings = tokenizer(doc[:data_args.max_segments], padding=padding, max_length=data_args.max_seg_length, truncation=True) batch['input_ids'].append(doc_encodings['input_ids'] + case_template * ( data_args.max_segments - len(doc_encodings['input_ids']))) batch['attention_mask'].append(doc_encodings['attention_mask'] + case_template * ( data_args.max_segments - len(doc_encodings['attention_mask']))) batch['token_type_ids'].append(doc_encodings['token_type_ids'] + case_template * ( data_args.max_segments - len(doc_encodings['token_type_ids']))) elif config.model_type in ['longformer', 'big_bird']: cases = [] max_position_embeddings = config.max_position_embeddings - 2 if config.model_type == 'longformer' \ else config.max_position_embeddings for doc in examples['text']: doc = re.split('\n{2,}', doc) cases.append(f' {tokenizer.sep_token} '.join([' '.join(paragraph.split()[:data_args.max_seg_length]) for paragraph in doc[:data_args.max_segments]])) batch = tokenizer(cases, padding=padding, max_length=max_position_embeddings, truncation=True) if config.model_type == 'longformer': global_attention_mask = np.zeros((len(cases), max_position_embeddings), dtype=np.int32) # global attention on cls token global_attention_mask[:, 0] = 1 batch['global_attention_mask'] = list(global_attention_mask) else: batch = tokenizer(examples['text'], padding=padding, max_length=512, truncation=True) batch["label"] = [label_list.index(labels) for labels in examples["label"]] return batch if training_args.do_train: if data_args.max_train_samples is not None: train_dataset = train_dataset.select(range(data_args.max_train_samples)) with training_args.main_process_first(desc="train dataset map pre-processing"): train_dataset = train_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on train dataset", ) # Log a few random samples from the training set: for index in random.sample(range(len(train_dataset)), 3): logger.info(f"Sample {index} of the training set: {train_dataset[index]}.") if training_args.do_eval: if data_args.max_eval_samples is not None: eval_dataset = eval_dataset.select(range(data_args.max_eval_samples)) with training_args.main_process_first(desc="validation dataset map pre-processing"): eval_dataset = eval_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on validation dataset", ) if training_args.do_predict: if data_args.max_predict_samples is not None: predict_dataset = predict_dataset.select(range(data_args.max_predict_samples)) with training_args.main_process_first(desc="prediction dataset map pre-processing"): predict_dataset = predict_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on prediction dataset", ) # You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. def compute_metrics(p: EvalPrediction): logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions preds = np.argmax(logits, axis=1) macro_f1 = f1_score(y_true=p.label_ids, y_pred=preds, average='macro', zero_division=0) micro_f1 = f1_score(y_true=p.label_ids, y_pred=preds, average='micro', zero_division=0) return {'macro-f1': macro_f1, 'micro-f1': micro_f1} # Data collator will default to DataCollatorWithPadding, so we change it if we already did the padding. if data_args.pad_to_max_length: data_collator = default_data_collator elif training_args.fp16: data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) else: data_collator = None # Initialize our Trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset if training_args.do_train else None, eval_dataset=eval_dataset if training_args.do_eval else None, compute_metrics=compute_metrics, tokenizer=tokenizer, data_collator=data_collator, callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] ) # Training if training_args.do_train: checkpoint = None if training_args.resume_from_checkpoint is not None: checkpoint = training_args.resume_from_checkpoint elif last_checkpoint is not None: checkpoint = last_checkpoint train_result = trainer.train(resume_from_checkpoint=checkpoint) metrics = train_result.metrics max_train_samples = ( data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) ) metrics["train_samples"] = min(max_train_samples, len(train_dataset)) trainer.save_model() # Saves the tokenizer too for easy upload trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) trainer.save_state() # Evaluation if training_args.do_eval: logger.info("*** Evaluate ***") metrics = trainer.evaluate(eval_dataset=eval_dataset) max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # Prediction if training_args.do_predict: logger.info("*** Predict ***") predictions, labels, metrics = trainer.predict(predict_dataset, metric_key_prefix="predict") max_predict_samples = ( data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset) ) metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset)) trainer.log_metrics("predict", metrics) trainer.save_metrics("predict", metrics) output_predict_file = os.path.join(training_args.output_dir, "test_predictions.csv") if trainer.is_world_process_zero(): with open(output_predict_file, "w") as writer: for index, pred_list in enumerate(predictions[0]): pred_line = '\t'.join([f'{pred:.5f}' for pred in pred_list]) writer.write(f"{index}\t{pred_line}\n") # Clean up checkpoints checkpoints = [filepath for filepath in glob.glob(f'{training_args.output_dir}/*/') if '/checkpoint' in filepath] for checkpoint in checkpoints: shutil.rmtree(checkpoint) if __name__ == "__main__": main() ================================================ FILE: experiments/trainer.py ================================================ from torch import nn from transformers import Trainer class MultilabelTrainer(Trainer): def compute_loss(self, model, inputs, return_outputs=False): labels = inputs.pop("labels") outputs = model(**inputs) logits = outputs.logits loss_fct = nn.BCEWithLogitsLoss() loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.float().view(-1, self.model.config.num_labels)) return (loss, outputs) if return_outputs else loss ================================================ FILE: experiments/unfair_tos.py ================================================ #!/usr/bin/env python # coding=utf-8 """ Finetuning models on UNFAIR-ToC (e.g. Bert, RoBERTa, LEGAL-BERT).""" import logging import os import random import sys from dataclasses import dataclass, field from typing import Optional import datasets from datasets import load_dataset from sklearn.metrics import f1_score from trainer import MultilabelTrainer from scipy.special import expit import glob import shutil import numpy as np import transformers from transformers import ( AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, EvalPrediction, HfArgumentParser, TrainingArguments, default_data_collator, set_seed, EarlyStoppingCallback, ) from transformers.trainer_utils import get_last_checkpoint from transformers.utils import check_min_version from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. check_min_version("4.9.0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt") logger = logging.getLogger(__name__) @dataclass class DataTrainingArguments: """ Arguments pertaining to what data we are going to input our model for training and eval. Using `HfArgumentParser` we can turn this class into argparse arguments to be able to specify them on the command line. """ max_seq_length: Optional[int] = field( default=128, metadata={ "help": "The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded." }, ) overwrite_cache: bool = field( default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."} ) pad_to_max_length: bool = field( default=True, metadata={ "help": "Whether to pad all samples to `max_seq_length`. " "If False, will pad the samples dynamically when batching to the maximum length in the batch." }, ) max_train_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of training examples to this " "value if set." }, ) max_eval_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this " "value if set." }, ) max_predict_samples: Optional[int] = field( default=None, metadata={ "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this " "value if set." }, ) server_ip: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) server_port: Optional[str] = field(default=None, metadata={"help": "For distant debugging."}) @dataclass class ModelArguments: """ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. """ model_name_or_path: str = field( default=None, metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} ) config_name: Optional[str] = field( default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} ) tokenizer_name: Optional[str] = field( default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} ) cache_dir: Optional[str] = field( default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, ) do_lower_case: Optional[bool] = field( default=True, metadata={"help": "arg to indicate if tokenizer should do lower case in AutoTokenizer.from_pretrained()"}, ) use_fast_tokenizer: bool = field( default=True, metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, ) model_revision: str = field( default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) use_auth_token: bool = field( default=False, metadata={ "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script " "with private models)." }, ) def main(): # See all possible arguments in src/transformers/training_args.py # or by passing the --help flag to this script. # We now keep distinct sets of args, for a cleaner separation of concerns. parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) model_args, data_args, training_args = parser.parse_args_into_dataclasses() # Setup distant debugging if needed if data_args.server_ip and data_args.server_port: # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script import ptvsd print("Waiting for debugger attach") ptvsd.enable_attach(address=(data_args.server_ip, data_args.server_port), redirect_output=True) ptvsd.wait_for_attach() # Fix boolean parameter if model_args.do_lower_case == 'False' or not model_args.do_lower_case: model_args.do_lower_case = False 'Tokenizer do_lower_case False' else: model_args.do_lower_case = True # Setup logging logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process the small summary: logger.warning( f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") # Detecting last checkpoint. last_checkpoint = None if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: last_checkpoint = get_last_checkpoint(training_args.output_dir) if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: raise ValueError( f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) elif last_checkpoint is not None: logger.info( f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." ) # Set seed before initializing model. set_seed(training_args.seed) # In distributed training, the load_dataset function guarantees that only one local process can concurrently # download the dataset. # Downloading and loading eurlex dataset from the hub. if training_args.do_train: train_dataset = load_dataset("lex_glue", "unfair_tos", split="train", data_dir='data', cache_dir=model_args.cache_dir) if training_args.do_eval: eval_dataset = load_dataset("lex_glue", "unfair_tos", split="validation", data_dir='data', cache_dir=model_args.cache_dir) if training_args.do_predict: predict_dataset = load_dataset("lex_glue", "unfair_tos", split="test", data_dir='data', cache_dir=model_args.cache_dir) # Labels label_list = list(range(8)) num_labels = len(label_list) # Load pretrained model and tokenizer # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. config = AutoConfig.from_pretrained( model_args.config_name if model_args.config_name else model_args.model_name_or_path, num_labels=num_labels, finetuning_task="unfair_toc", cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) if config.model_type == 'big_bird': config.attention_type = 'original_full' if config.model_type == 'longformer': config.attention_window = [128] * config.num_hidden_layers tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, do_lower_case=model_args.do_lower_case, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) model = AutoModelForSequenceClassification.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) # Preprocessing the datasets # Padding strategy if data_args.pad_to_max_length: padding = "max_length" else: # We will pad later, dynamically at batch creation, to the max sequence length in each batch padding = False def preprocess_function(examples): # Tokenize the texts batch = tokenizer( examples["text"], padding=padding, max_length=data_args.max_seq_length, truncation=True, ) batch["labels"] = [[1 if label in labels else 0 for label in label_list] for labels in examples["labels"]] return batch if training_args.do_train: if data_args.max_train_samples is not None: train_dataset = train_dataset.select(range(data_args.max_train_samples)) with training_args.main_process_first(desc="train dataset map pre-processing"): train_dataset = train_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on train dataset", ) # Log a few random samples from the training set: for index in random.sample(range(len(train_dataset)), 3): logger.info(f"Sample {index} of the training set: {train_dataset[index]}.") if training_args.do_eval: if data_args.max_eval_samples is not None: eval_dataset = eval_dataset.select(range(data_args.max_eval_samples)) with training_args.main_process_first(desc="validation dataset map pre-processing"): eval_dataset = eval_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on validation dataset", ) if training_args.do_predict: if data_args.max_predict_samples is not None: predict_dataset = predict_dataset.select(range(data_args.max_predict_samples)) with training_args.main_process_first(desc="prediction dataset map pre-processing"): predict_dataset = predict_dataset.map( preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on prediction dataset", ) # You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. def compute_metrics(p: EvalPrediction): # Fix gold labels y_true = np.zeros((p.label_ids.shape[0], p.label_ids.shape[1] + 1), dtype=np.int32) y_true[:, :-1] = p.label_ids y_true[:, -1] = (np.sum(p.label_ids, axis=1) == 0).astype('int32') # Fix predictions logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions preds = (expit(logits) > 0.5).astype('int32') y_pred = np.zeros((p.label_ids.shape[0], p.label_ids.shape[1] + 1), dtype=np.int32) y_pred[:, :-1] = preds y_pred[:, -1] = (np.sum(preds, axis=1) == 0).astype('int32') # Compute scores macro_f1 = f1_score(y_true=y_true, y_pred=y_pred, average='macro', zero_division=0) micro_f1 = f1_score(y_true=y_true, y_pred=y_pred, average='micro', zero_division=0) return {'macro-f1': macro_f1, 'micro-f1': micro_f1} # Data collator will default to DataCollatorWithPadding, so we change it if we already did the padding. if data_args.pad_to_max_length: data_collator = default_data_collator elif training_args.fp16: data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) else: data_collator = None # Initialize our Trainer trainer = MultilabelTrainer( model=model, args=training_args, train_dataset=train_dataset if training_args.do_train else None, eval_dataset=eval_dataset if training_args.do_eval else None, compute_metrics=compute_metrics, tokenizer=tokenizer, data_collator=data_collator, callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] ) # Training if training_args.do_train: checkpoint = None if training_args.resume_from_checkpoint is not None: checkpoint = training_args.resume_from_checkpoint elif last_checkpoint is not None: checkpoint = last_checkpoint train_result = trainer.train(resume_from_checkpoint=checkpoint) metrics = train_result.metrics max_train_samples = ( data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) ) metrics["train_samples"] = min(max_train_samples, len(train_dataset)) trainer.save_model() # Saves the tokenizer too for easy upload trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) trainer.save_state() # Evaluation if training_args.do_eval: logger.info("*** Evaluate ***") metrics = trainer.evaluate(eval_dataset=eval_dataset) max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # Prediction if training_args.do_predict: logger.info("*** Predict ***") predictions, labels, metrics = trainer.predict(predict_dataset, metric_key_prefix="predict") max_predict_samples = ( data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset) ) metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset)) trainer.log_metrics("predict", metrics) trainer.save_metrics("predict", metrics) output_predict_file = os.path.join(training_args.output_dir, "test_predictions.csv") if trainer.is_world_process_zero(): with open(output_predict_file, "w") as writer: for index, pred_list in enumerate(predictions[0]): pred_line = '\t'.join([f'{pred:.5f}' for pred in pred_list]) writer.write(f"{index}\t{pred_line}\n") # Clean up checkpoints checkpoints = [filepath for filepath in glob.glob(f'{training_args.output_dir}/*/') if '/checkpoint' in filepath] for checkpoint in checkpoints: shutil.rmtree(checkpoint) if __name__ == "__main__": main() ================================================ FILE: models/deberta.py ================================================ import torch from torch import nn from transformers import DebertaPreTrainedModel, DebertaModel from transformers.modeling_outputs import SequenceClassifierOutput, MultipleChoiceModelOutput from transformers.activations import ACT2FN class ContextPooler(nn.Module): def __init__(self, config): super().__init__() self.dense = nn.Linear(config.pooler_hidden_size, config.pooler_hidden_size) self.dropout = StableDropout(config.pooler_dropout) self.config = config def forward(self, hidden_states): # We "pool" the model by simply taking the hidden state corresponding # to the first token. context_token = hidden_states[:, 0] context_token = self.dropout(context_token) pooled_output = self.dense(context_token) pooled_output = ACT2FN[self.config.pooler_hidden_act](pooled_output) return pooled_output @property def output_dim(self): return self.config.hidden_size class DropoutContext(object): def __init__(self): self.dropout = 0 self.mask = None self.scale = 1 self.reuse_mask = True def get_mask(input, local_context): if not isinstance(local_context, DropoutContext): dropout = local_context mask = None else: dropout = local_context.dropout dropout *= local_context.scale mask = local_context.mask if local_context.reuse_mask else None if dropout > 0 and mask is None: mask = (1 - torch.empty_like(input).bernoulli_(1 - dropout)).bool() if isinstance(local_context, DropoutContext): if local_context.mask is None: local_context.mask = mask return mask, dropout class XDropout(torch.autograd.Function): """Optimized dropout function to save computation and memory by using mask operation instead of multiplication.""" @staticmethod def forward(ctx, input, local_ctx): mask, dropout = get_mask(input, local_ctx) ctx.scale = 1.0 / (1 - dropout) if dropout > 0: ctx.save_for_backward(mask) return input.masked_fill(mask, 0) * ctx.scale else: return input @staticmethod def backward(ctx, grad_output): if ctx.scale > 1: (mask,) = ctx.saved_tensors return grad_output.masked_fill(mask, 0) * ctx.scale, None else: return grad_output, None class StableDropout(nn.Module): """ Optimized dropout module for stabilizing the training Args: drop_prob (float): the dropout probabilities """ def __init__(self, drop_prob): super().__init__() self.drop_prob = drop_prob self.count = 0 self.context_stack = None def forward(self, x): """ Call the module Args: x (:obj:`torch.tensor`): The input tensor to apply dropout """ if self.training and self.drop_prob > 0: return XDropout.apply(x, self.get_context()) return x def clear_context(self): self.count = 0 self.context_stack = None def init_context(self, reuse_mask=True, scale=1): if self.context_stack is None: self.context_stack = [] self.count = 0 for c in self.context_stack: c.reuse_mask = reuse_mask c.scale = scale def get_context(self): if self.context_stack is not None: if self.count >= len(self.context_stack): self.context_stack.append(DropoutContext()) ctx = self.context_stack[self.count] ctx.dropout = self.drop_prob self.count += 1 return ctx else: return self.drop_prob class DebertaForSequenceClassification(DebertaPreTrainedModel): def __init__(self, config): super().__init__(config) num_labels = getattr(config, "num_labels", 2) self.num_labels = num_labels self.deberta = DebertaModel(config) self.classifier = nn.Linear(config.hidden_size, num_labels) drop_out = getattr(config, "cls_dropout", None) drop_out = self.config.hidden_dropout_prob if drop_out is None else drop_out self.dropout = nn.Dropout(drop_out) self.init_weights() def get_input_embeddings(self): return self.deberta.get_input_embeddings() def set_input_embeddings(self, new_embeddings): self.deberta.set_input_embeddings(new_embeddings) def forward( self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None, ): r""" labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ..., config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss), If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy). """ return_dict = return_dict if return_dict is not None else self.config.use_return_dict outputs = self.deberta( input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, position_ids=position_ids, inputs_embeds=inputs_embeds, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) pooled_output = self.dropout(outputs[1]) logits = self.classifier(pooled_output) loss = None if labels is not None: if self.num_labels == 1: # regression task loss_fn = nn.MSELoss() logits = logits.view(-1).to(labels.dtype) loss = loss_fn(logits, labels.view(-1)) elif labels.dim() == 1 or labels.size(-1) == 1: label_index = (labels >= 0).nonzero() labels = labels.long() if label_index.size(0) > 0: labeled_logits = torch.gather(logits, 0, label_index.expand(label_index.size(0), logits.size(1))) labels = torch.gather(labels, 0, label_index.view(-1)) loss_fct = nn.CrossEntropyLoss() loss = loss_fct(labeled_logits.view(-1, self.num_labels).float(), labels.view(-1)) else: loss = torch.tensor(0).to(logits) else: log_softmax = nn.LogSoftmax(-1) loss = -((log_softmax(logits) * labels).sum(-1)).mean() if not return_dict: output = (logits,) + outputs[1:] return ((loss,) + output) if loss is not None else output else: return SequenceClassifierOutput( loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, ) class DebertaForMultipleChoice(DebertaPreTrainedModel): def __init__(self, config): super().__init__(config) self.deberta = DebertaModel(config) self.pooler = ContextPooler(config) output_dim = self.pooler.output_dim drop_out = getattr(config, "cls_dropout", None) drop_out = self.config.hidden_dropout_prob if drop_out is None else drop_out self.dropout = StableDropout(drop_out) self.classifier = nn.Linear(output_dim, 1) self.init_weights() def forward( self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None, ): r""" labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): Labels for computing the multiple choice classification loss. Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See :obj:`input_ids` above) """ return_dict = return_dict if return_dict is not None else self.config.use_return_dict num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1] input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None inputs_embeds = ( inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1)) if inputs_embeds is not None else None ) outputs = self.deberta( input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, position_ids=position_ids, inputs_embeds=inputs_embeds, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) encoder_layer = outputs[0] pooled_output = self.pooler(encoder_layer) pooled_output = self.dropout(pooled_output) logits = self.classifier(pooled_output) reshaped_logits = logits.view(-1, num_choices) loss = None if labels is not None: loss_fct = nn.CrossEntropyLoss() loss = loss_fct(reshaped_logits, labels) if not return_dict: output = (reshaped_logits,) + outputs[2:] return ((loss,) + output) if loss is not None else output return MultipleChoiceModelOutput( loss=loss, logits=reshaped_logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, ) ================================================ FILE: models/hierbert.py ================================================ from dataclasses import dataclass from typing import Optional, Tuple import torch import numpy as np from torch import nn from transformers.file_utils import ModelOutput @dataclass class SimpleOutput(ModelOutput): last_hidden_state: torch.FloatTensor = None past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None hidden_states: Optional[Tuple[torch.FloatTensor]] = None attentions: Optional[Tuple[torch.FloatTensor]] = None cross_attentions: Optional[Tuple[torch.FloatTensor]] = None def sinusoidal_init(num_embeddings: int, embedding_dim: int): # keep dim 0 for padding token position encoding zero vector position_enc = np.array([ [pos / np.power(10000, 2 * i / embedding_dim) for i in range(embedding_dim)] if pos != 0 else np.zeros(embedding_dim) for pos in range(num_embeddings)]) position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1 return torch.from_numpy(position_enc).type(torch.FloatTensor) class HierarchicalBert(nn.Module): def __init__(self, encoder, max_segments=64, max_segment_length=128): super(HierarchicalBert, self).__init__() supported_models = ['bert', 'roberta', 'deberta'] assert encoder.config.model_type in supported_models # other model types are not supported so far # Pre-trained segment (token-wise) encoder, e.g., BERT self.encoder = encoder # Specs for the segment-wise encoder self.hidden_size = encoder.config.hidden_size self.max_segments = max_segments self.max_segment_length = max_segment_length # Init sinusoidal positional embeddings self.seg_pos_embeddings = nn.Embedding(max_segments + 1, encoder.config.hidden_size, padding_idx=0, _weight=sinusoidal_init(max_segments + 1, encoder.config.hidden_size)) # Init segment-wise transformer-based encoder self.seg_encoder = nn.Transformer(d_model=encoder.config.hidden_size, nhead=encoder.config.num_attention_heads, batch_first=True, dim_feedforward=encoder.config.intermediate_size, activation=encoder.config.hidden_act, dropout=encoder.config.hidden_dropout_prob, layer_norm_eps=encoder.config.layer_norm_eps, num_encoder_layers=2, num_decoder_layers=0).encoder def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None, ): # Hypothetical Example # Batch of 4 documents: (batch_size, n_segments, max_segment_length) --> (4, 64, 128) # BERT-BASE encoder: 768 hidden units # Squash samples and segments into a single axis (batch_size * n_segments, max_segment_length) --> (256, 128) input_ids_reshape = input_ids.contiguous().view(-1, input_ids.size(-1)) attention_mask_reshape = attention_mask.contiguous().view(-1, attention_mask.size(-1)) if token_type_ids is not None: token_type_ids_reshape = token_type_ids.contiguous().view(-1, token_type_ids.size(-1)) else: token_type_ids_reshape = None # Encode segments with BERT --> (256, 128, 768) encoder_outputs = self.encoder(input_ids=input_ids_reshape, attention_mask=attention_mask_reshape, token_type_ids=token_type_ids_reshape)[0] # Reshape back to (batch_size, n_segments, max_segment_length, output_size) --> (4, 64, 128, 768) encoder_outputs = encoder_outputs.contiguous().view(input_ids.size(0), self.max_segments, self.max_segment_length, self.hidden_size) # Gather CLS outputs per segment --> (4, 64, 768) encoder_outputs = encoder_outputs[:, :, 0] # Infer real segments, i.e., mask paddings seg_mask = (torch.sum(input_ids, 2) != 0).to(input_ids.dtype) # Infer and collect segment positional embeddings seg_positions = torch.arange(1, self.max_segments + 1).to(input_ids.device) * seg_mask # Add segment positional embeddings to segment inputs encoder_outputs += self.seg_pos_embeddings(seg_positions) # Encode segments with segment-wise transformer seg_encoder_outputs = self.seg_encoder(encoder_outputs) # Collect document representation outputs, _ = torch.max(seg_encoder_outputs, 1) return SimpleOutput(last_hidden_state=outputs, hidden_states=outputs) if __name__ == "__main__": from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # Use as a stand-alone encoder bert = AutoModel.from_pretrained('bert-base-uncased') model = HierarchicalBert(encoder=bert, max_segments=64, max_segment_length=128) fake_inputs = {'input_ids': [], 'attention_mask': [], 'token_type_ids': []} for i in range(4): # Tokenize segment temp_inputs = tokenizer(['dog ' * 126] * 64) fake_inputs['input_ids'].append(temp_inputs['input_ids']) fake_inputs['attention_mask'].append(temp_inputs['attention_mask']) fake_inputs['token_type_ids'].append(temp_inputs['token_type_ids']) fake_inputs['input_ids'] = torch.as_tensor(fake_inputs['input_ids']) fake_inputs['attention_mask'] = torch.as_tensor(fake_inputs['attention_mask']) fake_inputs['token_type_ids'] = torch.as_tensor(fake_inputs['token_type_ids']) output = model(fake_inputs['input_ids'], fake_inputs['attention_mask'], fake_inputs['token_type_ids']) # 4 document representations of 768 features are expected assert output[0].shape == torch.Size([4, 768]) # Use with HuggingFace AutoModelForSequenceClassification and Trainer API # Init Classifier model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=10) # Replace flat BERT encoder with hierarchical BERT encoder model.bert = HierarchicalBert(encoder=model.bert, max_segments=64, max_segment_length=128) output = model(fake_inputs['input_ids'], fake_inputs['attention_mask'], fake_inputs['token_type_ids']) # 4 document outputs with 10 (num_labels) logits are expected assert output.logits.shape == torch.Size([4, 10]) ================================================ FILE: models/tfidf_svm.py ================================================ import pandas from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer from sklearn.model_selection import GridSearchCV from sklearn.multiclass import OneVsRestClassifier from sklearn.svm import LinearSVC from sklearn import metrics from sklearn.model_selection import PredefinedSplit from sklearn.preprocessing import MultiLabelBinarizer import numpy as np import pandas as pd from sklearn.preprocessing import FunctionTransformer from sklearn.pipeline import FeatureUnion, Pipeline from datasets import load_dataset import logging import os import argparse dataset_n_classes = {'ecthr_a': 10, 'ecthr_b': 10, 'scotus': 14, 'eurlex': 100, 'ledgar': 100, 'unfair_tos': 8, 'case_hold': 5} def main(): parser = argparse.ArgumentParser() # Required arguments parser.add_argument('--dataset', default='case_hold', type=str) parser.add_argument('--task_type', default='multi_class', type=str) parser.add_argument('--text_limit', default=-1, type=int) config = parser.parse_args() n_classes = dataset_n_classes[config.dataset] if not os.path.exists(f'logs/{config.dataset}'): if not os.path.exists(f'logs'): os.mkdir(f'logs') os.mkdir(f'logs/{config.dataset}') handlers = [logging.FileHandler(f'logs/{config.dataset}_svm.txt'), logging.StreamHandler()] logging.basicConfig(handlers=handlers, level=logging.INFO) def get_text(dataset): if 'ecthr' in config.dataset: texts = [' '.join(text) for text in dataset['text']] return [' '.join(text.split()[:config.text_limit]) for text in texts] elif config.dataset == 'case_hold': data = [[context] + endings for context, endings in zip(dataset['context'], dataset['endings'])] return pd.DataFrame(data=data, columns=['context', 'option_1', 'option_2', 'option_3', 'options_4', 'option_5'] ) else: return [' '.join(text.split()[:config.text_limit]) for text in dataset['text']] def get_labels(dataset, mlb=None): if config.task_type == 'multi_class': return dataset['label'] else: return mlb.transform(dataset['labels']).tolist() def add_zero_class(labels): augmented_labels = np.zeros((len(labels), len(labels[0]) + 1), dtype=np.int32) augmented_labels[:, :-1] = labels augmented_labels[:, -1] = (np.sum(labels, axis=1) == 0).astype('int32') return augmented_labels scores = {'micro-f1': [], 'macro-f1': []} dataset = load_dataset('lex_glue', config.dataset) for seed in range(1, 6): if config.task_type == 'multi_label': classifier = OneVsRestClassifier(LinearSVC(random_state=seed, max_iter=50000)) parameters = { 'vect__max_features': [10000, 20000, 40000], 'clf__estimator__C': [0.1, 1, 10], 'clf__estimator__loss': ('hinge', 'squared_hinge') } elif config.dataset == 'case_hold': classifier = LinearSVC(random_state=seed, max_iter=50000) parameters = { 'clf__C': [0.1, 1, 10], 'clf__loss': ('hinge', 'squared_hinge') } else: classifier = LinearSVC(random_state=seed, max_iter=50000) parameters = { 'vect__max_features': [10000, 20000, 40000], 'clf__C': [0.1, 1, 10], 'clf__loss': ('hinge', 'squared_hinge') } # Init Pipeline (TF-IDF, SVM) if config.dataset == 'case_hold': text_clf = Pipeline([ ('union', FeatureUnion([('context_tfidf', Pipeline([('extract_field', FunctionTransformer(lambda x: x['context'], validate=False)), ('vect', CountVectorizer(stop_words=stopwords.words('english'), ngram_range=(1, 3), min_df=5, max_features=40000)), ('tfidf', TfidfTransformer())]))] + [(f'option_{idx}_tfidf', Pipeline([('extract_field', FunctionTransformer(lambda x: x[f'option_{idx}'], validate=False)), ('vect', CountVectorizer(stop_words=stopwords.words('english'), ngram_range=(1, 3), min_df=5, max_features=40000)), ('tfidf', TfidfTransformer())])) for idx in range(1, 6)] )), ('clf', classifier) ]) else: text_clf = Pipeline([('vect', CountVectorizer(stop_words=stopwords.words('english'), ngram_range=(1, 3), min_df=5)), ('tfidf', TfidfTransformer()), ('clf', classifier), ]) # Fixate Validation Split split_index = [-1] * len(dataset['train']) + [0] * len(dataset['validation']) val_split = PredefinedSplit(test_fold=split_index) gs_clf = GridSearchCV(text_clf, parameters, cv=val_split, n_jobs=32, verbose=4, refit = False) # Pre-process inputs, outputs x_train = get_text(dataset['train']) x_val = get_text(dataset['validation']) x_train_val = pd.concat([x_train, x_val]) if config.task_type == 'multi_label': mlb = MultiLabelBinarizer(classes=range(n_classes)) mlb.fit(dataset['train']['labels']) else: mlb = None y_train = get_labels(dataset['train'], mlb) y_val = get_labels(dataset['validation'], mlb) y_train_val = y_train + y_val # Train classifier gs_clf = gs_clf.fit(x_train_val, y_train_val) # Print best hyper-parameters logging.info('Best Parameters:') for param_name in sorted(parameters.keys()): logging.info("%s: %r" % (param_name, gs_clf.best_params_[param_name])) # Retrain model with best CV parameters only with train data text_clf.set_params(**gs_clf.best_params_) gs_clf = text_clf.fit(x_train, y_train) # Report results logging.info('VALIDATION RESULTS:') y_pred = gs_clf.predict(get_text(dataset['validation'])) y_true = get_labels(dataset["validation"], mlb) if config.task_type == 'multi_label' and config.dataset != 'eurlex': y_true = add_zero_class(y_true) y_pred = add_zero_class(y_pred) logging.info(f'Micro-F1: {metrics.f1_score(y_true, y_pred, average="micro")*100:.1f}') logging.info(f'Macro-F1: {metrics.f1_score(y_true, y_pred, average="macro")*100:.1f}') logging.info('TEST RESULTS:') y_pred = gs_clf.predict(get_text(dataset['test'])) y_true = get_labels(dataset["test"], mlb) if config.task_type == 'multi_label' and config.dataset != 'eurlex': y_true = add_zero_class(y_true) y_pred = add_zero_class(y_pred) logging.info(f'Micro-F1: {metrics.f1_score(y_true, y_pred, average="micro")*100:.1f}') logging.info(f'Macro-F1: {metrics.f1_score(y_true, y_pred, average="macro")*100:.1f}') scores['micro-f1'].append(metrics.f1_score(y_true, y_pred, average="micro")) scores['macro-f1'].append(metrics.f1_score(y_true, y_pred, average="macro")) # Report averaged results across runs logging.info('-' * 100) logging.info(f'Micro-F1: {np.mean(scores["micro-f1"])*100:.1f} +/- {np.std(scores["micro-f1"])*100:.1f}\t' f'Macro-F1: {np.mean(scores["macro-f1"])*100:.1f} +/- {np.std(scores["macro-f1"])*100:.1f}') if __name__ == '__main__': main() ================================================ FILE: requirements.txt ================================================ torch>=1.9.0 transformers>=4.9.0 scikit-learn>=0.24.1 tqdm>=4.61.1 numpy>=1.20.1 datasets>=1.18.1 nltk>=3.5 scipy>=1.6.3 ================================================ FILE: scripts/run_case_hold.sh ================================================ GPU_NUMBER=0 MODEL_NAME='bert-base-uncased' BATCH_SIZE=8 ACCUMULATION_STEPS=1 TASK='case_hold' CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/case_hold.py --task_name ${TASK} --model_name_or_path ${MODEL_NAME} --output_dir logs/${TASK}/${MODEL_NAME}/seed_1 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 1 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/case_hold.py --task_name ${TASK} --model_name_or_path ${MODEL_NAME} --output_dir logs/${TASK}/${MODEL_NAME}/seed_2 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 2 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/case_hold.py --task_name ${TASK} --model_name_or_path ${MODEL_NAME} --output_dir logs/${TASK}/${MODEL_NAME}/seed_3 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 3 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/case_hold.py --task_name ${TASK} --model_name_or_path ${MODEL_NAME} --output_dir logs/${TASK}/${MODEL_NAME}/seed_4 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 4 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/case_hold.py --task_name ${TASK} --model_name_or_path ${MODEL_NAME} --output_dir logs/${TASK}/${MODEL_NAME}/seed_5 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 5 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} python statistics/compute_avg_scores.py --dataset ${TASK} ================================================ FILE: scripts/run_ecthr.sh ================================================ GPU_NUMBER=0 MODEL_NAME='bert-base-uncased' LOWER_CASE='True' BATCH_SIZE=2 ACCUMULATION_STEPS=4 TASK='ecthr_a' CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ecthr.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --task ${TASK} --output_dir logs/${TASK}/${MODEL_NAME}/seed_1 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 1 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ecthr.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --task ${TASK} --output_dir logs/${TASK}/${MODEL_NAME}/seed_2 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 2 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ecthr.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --task ${TASK} --output_dir logs/${TASK}/${MODEL_NAME}/seed_3 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 3 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ecthr.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --task ${TASK} --output_dir logs/${TASK}/${MODEL_NAME}/seed_4 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 4 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ecthr.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --task ${TASK} --output_dir logs/${TASK}/${MODEL_NAME}/seed_5 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 5 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} ================================================ FILE: scripts/run_eurlex.sh ================================================ GPU_NUMBER=6 MODEL_NAME='bert-base-uncased' LOWER_CASE='True' BATCH_SIZE=8 ACCUMULATION_STEPS=1 TASK='eurlex' CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/eurlex.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_1 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 2 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 1 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/eurlex.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_2 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 2 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 2 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/eurlex.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_3 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 2 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 3 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/eurlex.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_4 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 2 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 4 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/eurlex.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_5 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 2 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 5 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} python statistics/compute_avg_scores.py --dataset ${TASK} ================================================ FILE: scripts/run_ledgar.sh ================================================ GPU_NUMBER=0 MODEL_NAME='bert-base-uncased' LOWER_CASE='True' BATCH_SIZE=8 ACCUMULATION_STEPS=1 TASK='ledgar' CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ledgar.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_1 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 1 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ledgar.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_2 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 2 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ledgar.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_3 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 3 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ledgar.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_4 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 4 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/ledgar.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_5 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 5 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} python statistics/compute_avg_scores.py --dataset ${TASK} ================================================ FILE: scripts/run_scotus.sh ================================================ GPU_NUMBER=0 MODEL_NAME='bert-base-uncased' LOWER_CASE='True' BATCH_SIZE=2 ACCUMULATION_STEPS=4 TASK='scotus' CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/scotus.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_1 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 1 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/scotus.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_2 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 2 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/scotus.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_3 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 3 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/scotus.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_4 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 4 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/scotus.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_5 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 5 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} python statistics/compute_avg_scores.py --dataset ${TASK} ================================================ FILE: scripts/run_tfidf_svm.sh ================================================ DATASET='eurlex' TASK_TYPE='multi_label' N_CLASSES=100 python models/tfidf_svm.py --dataset ${DATASET} --task_type ${TASK_TYPE} --n_classes ${N_CLASSES} ================================================ FILE: scripts/run_unfair_tos.sh ================================================ GPU_NUMBER=0 MODEL_NAME='bert-base-uncased' LOWER_CASE='True' BATCH_SIZE=8 ACCUMULATION_STEPS=1 TASK='unfair_tos' CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/unfair_tos.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_1 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 1 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/unfair_tos.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_2 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 2 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/unfair_tos.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_3 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 3 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/unfair_tos.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir logs/${TASK}/${MODEL_NAME}/seed_4 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 4 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} CUDA_VISIBLE_DEVICES=${GPU_NUMBER} python experiments/unfair_tos.py --model_name_or_path ${MODEL_NAME} --do_lower_case ${LOWER_CASE} --output_dir .logs/${TASK}/${MODEL_NAME}/seed_5 --do_train --do_eval --do_pred --overwrite_output_dir --load_best_model_at_end --metric_for_best_model micro-f1 --greater_is_better True --evaluation_strategy epoch --save_strategy epoch --save_total_limit 5 --num_train_epochs 20 --learning_rate 3e-5 --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --seed 5 --fp16 --fp16_full_eval --gradient_accumulation_steps ${ACCUMULATION_STEPS} --eval_accumulation_steps ${ACCUMULATION_STEPS} python statistics/compute_avg_scores.py --dataset ${TASK} ================================================ FILE: statistics/compute_avg_lexglue_scores.py ================================================ import copy import json import os import argparse import numpy as np from scipy.stats import hmean, gmean def main(): ''' set default hyperparams in default_hyperparams.py ''' parser = argparse.ArgumentParser() # Required arguments parser.add_argument('--filter_outliers', default=True) parser.add_argument('--top_k', default=1) config = parser.parse_args() MODELS = ['bert-base-uncased', 'roberta-base', 'microsoft/deberta-base', 'allenai/longformer-base-4096', 'google/bigbird-roberta-base', 'nlpaueb/legal-bert-base-uncased', 'zlucia/custom-legalbert', 'roberta-large'] DATASETS = ['ecthr_a', 'ecthr_b', 'eurlex', 'scotus', 'ledgar', 'unfair_tos', 'casehold'] MODEL_NAMES = ['BERT', 'RoBERTa', 'DeBERTa', 'Longformer', 'BigBird', 'Legal-BERT', 'CaseLaw-BERT', 'RoBERTa'] score_dicts = {model: {'dev': {'micro': [], 'macro': []}, 'test': {'micro': [], 'macro': []}} for model in MODELS} for model in MODELS: for dataset in DATASETS: BASE_DIR = f'/Users/rwg642/Desktop/LEXGLUE/RESULTS/{dataset}' score_dict = {'dev': {'micro': [], 'macro': []}, 'test': {'micro': [], 'macro': []}} for seed in range(1, 6): try: seed = f'seed_{seed}' with open(os.path.join(BASE_DIR, model, seed, 'all_results.json')) as json_file: json_data = json.load(json_file) score_dict['dev']['micro'].append(float(json_data['eval_micro-f1'])) score_dict['dev']['macro'].append(float(json_data['eval_macro-f1'])) score_dict['test']['micro'].append(float(json_data['predict_micro-f1'])) score_dict['test']['macro'].append(float(json_data['predict_macro-f1'])) except: continue temp_stats = copy.deepcopy(score_dict) if config.filter_outliers: seed_scores = [(idx, score) for (idx, score) in enumerate(score_dict['dev']['macro'])] sorted_scores = sorted(seed_scores, key=lambda tup: tup[1], reverse=True) top_k_ids = [idx for idx, score in sorted_scores[:config.top_k]] for subset in ['dev', 'test']: temp_stats[subset]['micro'] = [score for idx, score in enumerate(score_dict[subset]['micro']) if idx in top_k_ids] temp_stats[subset]['macro'] = [score for idx, score in enumerate(score_dict[subset]['macro']) if idx in top_k_ids] for subset in ['dev', 'test']: for avg in ['micro', 'macro']: score_dicts[model][subset][avg].append(np.mean(temp_stats[subset][avg])) print('-' * 253) print(f'{"MODEL NAME":>35} | {"A-MEAN":<33} | {"H-MEAN":<33} | {"G-MEAN":<33} |') print('-' * 253) for idx, (method, stats) in enumerate(score_dicts.items()): algo_means = {'dev': {'micro': [0.0, 0.0, 0.0], 'macro': [0.0, 0.0, 0.0]}, 'test': {'micro': [0.0, 0.0, 0.0], 'macro': [0.0, 0.0, 0.0]}} for subset in ['dev', 'test']: for avg in ['micro', 'macro']: algo_means[subset][avg][0] = np.mean(stats[subset][avg]) algo_means[subset][avg][1] = hmean(stats[subset][avg]) algo_means[subset][avg][2] = gmean(stats[subset][avg]) report_line = f'{MODEL_NAMES[idx]}' for task_idx in range(3): report_line += f' {algo_means["test"]["micro"][task_idx] * 100:.1f} / ' report_line += f' {algo_means["test"]["macro"][task_idx] * 100:.1f} ' report_line += '' print(report_line) if __name__ == '__main__': main() ================================================ FILE: statistics/compute_avg_scores.py ================================================ import copy import json import os import argparse import numpy as np import warnings warnings.filterwarnings('ignore') def main(): ''' set default hyperparams in default_hyperparams.py ''' parser = argparse.ArgumentParser() # Required arguments parser.add_argument('--dataset', default='scotus') parser.add_argument('--filter_outliers', default=True) parser.add_argument('--top_k', default=3) config = parser.parse_args() BASE_DIR = f'logs/{config.dataset}' if os.path.exists(BASE_DIR): print(f'{BASE_DIR} exists!') score_dicts = {} MODELS = ['bert-base-uncased', 'roberta-base', 'microsoft/deberta-base', 'allenai/longformer-base-4096', 'google/bigbird-roberta-base', 'nlpaueb/legal-bert-base-uncased', 'zlucia/custom-legalbert', 'roberta-large'] for model in MODELS: score_dict = {'dev': {'micro': [], 'macro': []}, 'test': {'micro': [], 'macro': []}} for seed in range(1, 6): seed = f'seed_{seed}' try: with open(os.path.join(BASE_DIR, model, seed, 'all_results.json')) as json_file: json_data = json.load(json_file) score_dict['dev']['micro'].append(float(json_data['eval_micro-f1'])) score_dict['dev']['macro'].append(float(json_data['eval_macro-f1'])) score_dict['test']['micro'].append(float(json_data['predict_micro-f1'])) score_dict['test']['macro'].append(float(json_data['predict_macro-f1'])) except: continue score_dicts[model] = score_dict print(f'{" " * 36} {"VALIDATION":<47} | {"TEST"}') print('-' * 200) for algo, stats in score_dicts.items(): temp_stats = copy.deepcopy(stats) if config.filter_outliers: seed_scores = [(idx, score) for (idx, score) in enumerate(stats['dev']['macro'])] sorted_scores = sorted(seed_scores, key=lambda tup: tup[1], reverse=True) top_k_ids = [idx for idx, score in sorted_scores[:config.top_k]] temp_stats['dev']['micro'] = [score for idx, score in enumerate(stats['dev']['micro']) if idx in top_k_ids] temp_stats['dev']['macro'] = [score for idx, score in enumerate(stats['dev']['macro']) if idx in top_k_ids] temp_stats['test']['micro'] = [score for idx, score in enumerate(stats['test']['micro']) if idx in top_k_ids[:1]] temp_stats['test']['macro'] = [score for idx, score in enumerate(stats['test']['macro']) if idx in top_k_ids[:1]] report_line = f'{algo:>35}: MICRO-F1: {np.mean(temp_stats["dev"]["micro"])*100:.1f}\t ± {np.std(temp_stats["dev"]["micro"])*100:.1f}\t' report_line += f'MACRO-F1: {np.mean(temp_stats["dev"]["macro"])*100:.1f}\t ± {np.std(temp_stats["dev"]["macro"])*100:.1f}\t' report_line += ' | ' report_line += f'MICRO-F1: {np.mean(temp_stats["test"]["micro"])*100:.1f}\t' report_line += f'MACRO-F1: {np.mean(temp_stats["test"]["macro"])*100:.1f}\t' print(report_line) if __name__ == '__main__': main() ================================================ FILE: statistics/compute_lexglue_scores.py ================================================ import copy import json import os import argparse import numpy as np def main(): ''' set default hyperparams in default_hyperparams.py ''' parser = argparse.ArgumentParser() # Required arguments parser.add_argument('--filter_outliers', default=True) parser.add_argument('--top_k', default=1) config = parser.parse_args() MODELS = ['bert-base-uncased', 'roberta-base', 'microsoft/deberta-base', 'allenai/longformer-base-4096', 'google/bigbird-roberta-base', 'nlpaueb/legal-bert-base-uncased', 'zlucia/custom-legalbert'] DATASETS = ['ecthr_a', 'ecthr_b', 'eurlex', 'scotus', 'ledgar', 'unfair_tos', 'casehold'] MODEL_NAMES = ['BERT', 'RoBERTa', 'DeBERTa', 'Longformer', 'BigBird', 'Legal-BERT', 'CaseLaw-BERT'] score_dicts = {model: {'dev': {'micro': [], 'macro': []}, 'test': {'micro': [], 'macro': []}} for model in MODELS} for model in MODELS: for dataset in DATASETS: BASE_DIR = f'logs/{dataset}' score_dict = {'dev': {'micro': [], 'macro': []}, 'test': {'micro': [], 'macro': []}} for seed in range(1, 6): try: seed = f'seed_{seed}' with open(os.path.join(BASE_DIR, model, seed, 'all_results.json')) as json_file: json_data = json.load(json_file) score_dict['dev']['micro'].append(float(json_data['eval_micro-f1'])) score_dict['dev']['macro'].append(float(json_data['eval_macro-f1'])) score_dict['test']['micro'].append(float(json_data['predict_micro-f1'])) score_dict['test']['macro'].append(float(json_data['predict_macro-f1'])) except: continue temp_stats = copy.deepcopy(score_dict) if config.filter_outliers: seed_scores = [(idx, score) for (idx, score) in enumerate(score_dict['dev']['macro'])] sorted_scores = sorted(seed_scores, key=lambda tup: tup[1], reverse=True) top_k_ids = [idx for idx, score in sorted_scores[:config.top_k]] for subset in ['dev', 'test']: temp_stats[subset]['micro'] = [score for idx, score in enumerate(score_dict[subset]['micro']) if idx in top_k_ids] temp_stats[subset]['macro'] = [score for idx, score in enumerate(score_dict[subset]['macro']) if idx in top_k_ids] for subset in ['dev', 'test']: for avg in ['micro', 'macro']: score_dicts[model][subset][avg].append(np.mean(temp_stats[subset][avg])) print('-' * 253) print(f'{"DATASET":>35} & ', ' & '.join([f"{dataset}" for dataset in DATASETS]).upper(), ' \\\\') print('-' * 253) for idx, (method, stats) in enumerate(score_dicts.items()): report_line = f'{MODEL_NAMES[idx]} ' for task_idx in range(len(DATASETS)): report_line += f' {stats["test"]["micro"][task_idx] * 100:.1f} / ' report_line += f' {stats["test"]["macro"][task_idx] * 100:.1f} ' # report_line += '' if task_idx == len(DATASETS) - 1 else '&' report_line += '' print(report_line) # print('-' * 253) if __name__ == '__main__': main() ================================================ FILE: statistics/report_model_results.py ================================================ import json import os import argparse def main(): ''' set default hyperparams in default_hyperparams.py ''' parser = argparse.ArgumentParser() # Required arguments parser.add_argument('--model', default='roberta-base') config = parser.parse_args() MODEL = config.model TASKS = ['ecthr_a', 'ecthr_b', 'scotus', 'eurlex', 'ledgar', 'unfair_tos', 'case_hold'] for task in TASKS: print('-' * 100) print(task.upper()) print('-' * 100) BASE_DIR = f'logs/{task}' print(f'{" " * 10} | {"VALIDATION":<40} | {"TEST":<40}') print('-' * 100) for seed in range(1, 6): seed = f'seed_{seed}' try: with open(os.path.join(BASE_DIR, MODEL, seed, 'all_results.json')) as json_file: json_data = json.load(json_file) dev_micro_f1 = float(json_data['eval_micro-f1']) dev_macro_f1 = float(json_data['eval_macro-f1']) test_micro_f1 = float(json_data['predict_micro-f1']) test_macro_f1 = float(json_data['predict_macro-f1']) epoch = float(json_data['epoch']) report_line = f'EPOCH: {epoch: 2.1f} | ' report_line += f'MICRO-F1: {dev_micro_f1 * 100:.1f}\t' report_line += f'MACRO-F1: {dev_macro_f1 * 100:.1f}\t' report_line += ' | ' report_line += f'MICRO-F1: {test_micro_f1 * 100:.1f}\t' report_line += f'MACRO-F1: {test_macro_f1 * 100:.1f}\t' print(report_line) except: continue if __name__ == '__main__': main() ================================================ FILE: statistics/report_train_time.py ================================================ import json import os import numpy as np import datetime def main(): for dataset in ['ecthr_a', 'ecthr_b', 'scotus', 'eurlex', 'ledgar', 'unfair_tos']: print(f'{dataset.upper()}') print('-'*100) BASE_DIR = f'logs/{dataset}' score_dicts = {} MODELS = ['bert-base-uncased', 'roberta-base', 'microsoft/deberta-base', 'nlpaueb/legal-bert-base-uncased', 'zlucia/custom-legalbert', 'allenai/longformer-base-4096', 'google/bigbird-roberta-base'] for model in MODELS: score_dict = {'time': [], 'epochs': [], 'time/epoch': []} for seed in range(1, 6): seed = f'seed_{seed}' try: with open(os.path.join(BASE_DIR, model, seed, 'trainer_state.json')) as json_file: json_data = json.load(json_file) score_dict['time'].append(json_data['log_history'][-1]['train_runtime']) score_dict['epochs'].append(json_data['log_history'][-1]['epoch']) score_dict['time/epoch'].append(json_data['log_history'][-1]['train_runtime']/json_data['log_history'][-1]['epoch']) except: continue score_dicts[model] = score_dict for algo, stats in score_dicts.items(): total_time = np.mean(stats["time"]) time_epoch = np.mean(stats["time/epoch"]) print(f'{algo:>35}: TRAIN TIME: {str(datetime.timedelta(seconds=total_time)).split(".")[0]}\t ' f'TIME/EPOCH: {str(datetime.timedelta(seconds=time_epoch)).split(".")[0]}\t' f' EPOCHS: {np.mean(stats["epochs"]):.1f} ± {np.std(stats["epochs"]):.1f}') if __name__ == '__main__': main() ================================================ FILE: utils/fix_casehold.py ================================================ import re import csv import numpy as np prompts = [] texts = [] with open('casehold_fixed.csv', "w", encoding="utf-8") as out_f: with open('casehold.csv', "r", encoding="utf-8") as f: for line in f.readlines(): # Eliminate broken records if not re.match('\d', line) or not re.match('.+\d\n$', line): continue else: # Discard samples that are extremely long if len(line) < 5000: out_f.write(line) # Reload cleansed data and count text with open('casehold_fixed.csv', "r", encoding="utf-8") as f: data = list(csv.reader(f))[1:] for idx, sample in enumerate(data): for choice in sample[2:7]: texts.append(sample[1] + ' ' + choice) # Compute approximate length per sample t_lengths = [len(text.split()) for text in texts] print(np.mean(t_lengths)) print(np.median(t_lengths)) ================================================ FILE: utils/load_hierbert.py ================================================ from models.hierbert import HierarchicalBert from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch MODEL_PATH = '...' # Load Tokenizer tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) # Load BERT base model model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH) # Transform BERT base model to Hierarchical BERT segment_encoder = model.bert model_encoder = HierarchicalBert(encoder=segment_encoder, max_segments=64, max_segment_length=128) model.bert = model_encoder # Load Hierarchical BERT model model_state_dict = torch.load(f'{MODEL_PATH}/pytorch_model.bin', map_location=torch.device('cpu')) model.load_state_dict(model_state_dict) # Pre-process text following the hierarchical 3D pre-processing # as described either in experiments/ecthr.py, or experiments/scotus.py inputs = ... # Inference soft_predictions = model.predict(inputs) # Post-process predictions, e.g., sigmoid or argmax hard_predictions = torch.argmax(soft_predictions) ================================================ FILE: utils/preprocess_unfair_tos.py ================================================ import glob import json import re filenames = glob.glob('/Users/rwg642/Downloads/ToS/sentences/*.txt') data = {} total_sentence_count = 0 companies = [] for filename in filenames: with open(filename) as file: company = filename.split('/')[-1].split('.')[0] data[f'{company}'] = [] text = '' for line in file.readlines(): total_sentence_count += 1 data[f'{company}'].append( {'company': company, 'release_year': '-', 'labels': [], 'text': line.replace('-lrb-', '(').replace('-rrb-', ')')}) text += line + ' ' matches = re.findall('20[0-2][0-9]', text) if matches: date = matches[0] else: date = '-' companies.append((company, date)) print('All sentences: ', total_sentence_count) annotated_sentences = 0 for label_type, label_name in zip( ['Labels_A', 'Labels_CH', 'Labels_CR', 'Labels_J', 'Labels_LAW', 'Labels_LTD', 'Labels_TER', 'Labels_USE'], ['Arbitration', 'Unilateral change', 'Content removal', 'Jurisdiction', 'Choice of law', 'Limitation of liability', 'Unilateral termination', 'Contract by using']): filenames = glob.glob(f'/Users/rwg642/Downloads/ToS/{label_type}/*.txt') sentence_count = 0 for filename in filenames: company = filename.split('/')[-1].split('.')[0] with open(filename) as file: for idx, line in enumerate(file.readlines()): if line == '1\n': data[f'{company}'][idx]['labels'].append(label_name) sentence_count += 1 annotated_sentences += 1 print(f'{label_type}: ', sentence_count) print('Unannotated: ', total_sentence_count - annotated_sentences) companies = [('Tinder', '-'), ('Betterpoints_UK', '-'), ('Deliveroo', '-'), ('9gag', '-'), ('Booking', '-'), ('YouTube', '-'), ('Yahoo', '-'), ('TrueCaller', '-'), ('Skype', '2006'), ('WorldOfWarcraft', '2012'), ('Viber', '2013'), ('Microsoft', '2013'), ('Instagram', '2013'), ('Rovio', '2013'), ('Onavo', '2013'), ('Moves-app', '2014'), ('Syncme', '2014'), ('Google', '2014'), ('Facebook', '2015'), ('Vivino', '2015'), ('Atlas', '2015'), ('Dropbox', '2016'), ('musically', '2016'), ('Spotify', '2016'), ('Endomondo', '2016'), ('WhatsApp', '2016'), ('Zynga', '2016'), ('PokemonGo', '2016'), ('Masquerade', '2016'), ('Skyscanner', '2016'), ('Nintendo', '2017'), ('Airbnb', '2017'), ('Crowdtangle', '2017'), ('TripAdvisor', '2017'), ('Supercell', '2017'), ('Headspace', '2017'), ('Fitbit', '2017'), ('Vimeo', '2017'), ('Oculus', '2017'), ('LindenLab', '2017'), ('Academia', '2017'), ('Amazon', '2017'), ('Netflix', '2017'), ('Snap', '2017'), ('Twitter', '2017'), ('LinkedIn', '2017'), ('Duolingo', '2017'), ('Uber', '2017'), ('Evernote', '2017'), ('eBay', '2017')] with open('/Users/rwg642/PycharmProjects/LexGLUE/dataloaders/unfair_toc/unfair_toc.jsonl', 'w') as out_file: for company, year in companies[:30]: for record in data[f'{company}']: record['data_type'] = 'train' record['release_year'] = year out_file.write(json.dumps(record) + '\n') for company, year in companies[30:40]: for record in data[f'{company}']: record['data_type'] = 'val' record['release_year'] = year out_file.write(json.dumps(record) + '\n') for company, year in companies[40:]: for record in data[f'{company}']: record['data_type'] = 'test' record['release_year'] = year out_file.write(json.dumps(record) + '\n') print() # import numpy as np # import matplotlib.pyplot as plt # ecthr = [71.5, 17.0, 15.5, 18.5, 14.7] # ledgar = [0.1, 81.1, 0.1, 81.1] # ecthr_mean = np.mean(ecthr) # ledgar_mean = np.mean(ledgar) # # ecthr_max = np.max(ecthr) # ledgar_max= np.max(ledgar) # # ecthr_std = np.std(ecthr) # ledgar_std = np.std(ledgar) # # # fig, ax = plt.subplots() # ax.bar(np.arange(2), [ecthr_max, ledgar_max], align='center', alpha=0.3, capsize=10) # ax.bar(np.arange(2), [ecthr_mean, ledgar_mean], yerr=[ecthr_std, ledgar_std], align='center', alpha=0.6, ecolor='black', capsize=10) # ax.set_ylabel('Macro-F1') # ax.set_xticks(np.arange(2)) # ax.set_xticklabels(['ECtHR (Task A)', 'LEDGAR']) # ax.yaxis.grid(True) # # # Save the figure and show # plt.tight_layout() # plt.show() ================================================ FILE: utils/subsample_ledgar.py ================================================ import json import random import tqdm from collections import Counter # NOTE: The dataset has been first enriched with metadata from SEC-EDGAR # to figure out the year of submission for the original filings. This # part is missing from the script. # Parse original (augmented) dataset categories = [] with open('ledgar.jsonl') as file: for line in tqdm.tqdm(file.readlines()): data = json.loads(line) categories.extend(data['labels']) # Find the top-100 labels. categories = set([label for label, count in Counter(categories).most_common()[:100]]) # Subsample examples labeled with one of the top-100 labels. with open('ledgar_small.jsonl', 'w') as out_file: with open('ledgar.jsonl') as file: for line in tqdm.tqdm(file.readlines()): data = json.loads(line) if set(data['labels']).intersection(categories): labels = set(data['labels']).intersection(categories) if len(labels) == 1: data['labels'] = sorted(list(labels)) data.pop('clause_types', None) out_file.write(json.dumps(data)+'\n') # Organize examples in clusters by year years = [] samples = {year: [] for year in ['2016', '2017', '2018', '2019']} with open('ledgar_small.jsonl') as file: for line in tqdm.tqdm(file.readlines()): data = json.loads(line) years.append(data['year']) data.pop('filer_cik', None) data.pop('filer_name', None) data.pop('filer_state', None) data.pop('filer_industry', None) samples[data['year']].append(data) # Write final dataset 60k/10k/10k random.seed(1) with open('ledgar.jsonl', 'w') as file: final_samples = random.sample(samples['2016'], 30000) final_samples += random.sample(samples['2017'], 30000) for sample in final_samples: sample['data_type'] = 'train' file.write(json.dumps(sample) + '\n') final_samples = random.sample(samples['2018'], 10000) for sample in final_samples: sample['data_type'] = 'dev' file.write(json.dumps(sample) + '\n') final_samples = random.sample(samples['2019'], 10000) for sample in final_samples: sample['data_type'] = 'test' file.write(json.dumps(sample) + '\n')