[
  {
    "path": ".gitignore",
    "content": "figures/\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\npip-wheel-metadata/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n.python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n\n# Slurm\n*.out\n\n# VSCODE\n.vscode\n"
  },
  {
    "path": "LICENSE",
    "content": "The Clear BSD License\n\nCopyright (c) 2023\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted (subject to the limitations in the disclaimer\nbelow) provided that the following conditions are met:\n\n     * Redistributions of source code must retain the above copyright notice,\n     this list of conditions and the following disclaimer.\n\n     * Redistributions in binary form must reproduce the above copyright\n     notice, this list of conditions and the following disclaimer in the\n     documentation and/or other materials provided with the distribution.\n\n     * Neither the name of the copyright holder nor the names of its\n     contributors may be used to endorse or promote products derived from this\n     software without specific prior written permission.\n\nNO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY\nTHIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND\nCONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT\nLIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A\nPARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR\nCONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,\nEXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,\nPROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR\nBUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER\nIN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)\nARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE\nPOSSIBILITY OF SUCH DAMAGE.\n"
  },
  {
    "path": "README.md",
    "content": "# Detecting Hallucinations in Large Language Models Using Semantic Entropy\n\nThis repository contains the code necessary to reproduce the short-phrase and sentence-length experiments of the Nature submission 'Detecting Hallucinations in Large Language Models Using Semantic Entropy'.\n\nThis repository builds on the original, now deprecated codebase for semantic uncertainty at [https://github.com/lorenzkuhn/semantic_uncertainty](https://github.com/lorenzkuhn/semantic_uncertainty).\n\n## System Requirements\n\nWe here discuss hardware and software system requirements.\n\n### Hardware Dependencies\n\nGenerally speaking, our experiments require modern computer hardware which is suited for usage with large language models (LLMs).\n\nRequirements regarding the system's CPU and RAM size are relatively modest: any reasonably modern system should suffice, e.g. a system with an Intel 10th generation CPU and 16 GB of system memory or better.\n\nMore importantly, all our experiments make use of one or more Graphics Processor Units (GPUs) to speed up LLM inference.\nWithout a GPU, it is not feasible to reproduce our results in a reasonable amount of time.\nThe particular GPU necessary depends on the choice of LLM: LLMs with more parameters require GPUs with more memory.\nFor smaller models (7B parameters), desktop GPUs such as the Nvidia TitanRTX (24 GB) are sufficient.\nFor larger models (13B), GPUs with more memory, such as the Nvidia A100 server GPU, are required.\nOur largest models with 70B parameters require the use of two Nvidia A100 GPUs (2x80GB) simultaneously.\n\nOne can reduce the precision to float16 or int8 to reduce memory requirements without significantly affecting model predictions and their accuracy.\nWe use float16 for 70B models by default, and int8 mode can be enabled for any model by suffixing the model name with `-int8`.\n\n\n### Software Dependencies\n\nOur code relies on Python 3.11 with PyTorch 2.1.\n\nOur systems run the Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-89-generic x86_64) operating system.\n\nIn [environment_export.yaml](environment_export.yaml) we list the exact versions for all Python packages used in our experiments.\nWe generally advise against trying to install from this exact export of our conda environment.\nPlease see below for installation instructions.\n\nAlthough we have not tested this, we would expect our code to be compatible with other operating systems, Python versions, and versions of the Python libraries that we use.\n\n\n## Installation Guide\n\n\nTo install Python with all necessary dependencies, we recommend the use of conda, and we refer to [https://conda.io/](https://conda.io/) for an installation guide.\n\n\nAfter installing conda, you can set up and activate a new conda environment with all required packages by executing the following commands from the root folder of this repository in a shell:\n\n\n```\nconda-env update -f environment.yaml\nconda activate semantic_uncertainty\n```\n\nThe installation should take around 15 minutes.\n\nOur experiments rely on [Weights & Biases (wandb)](https://wandb.ai/) to log and save individual runs.\nWhile wandb will be installed automatically with the above conda script, you may need to log in with your wandb API key upon initial execution.\n\nOur experiments rely on Hugging Face for all LLM models and most of the datasets.\nIt may be necessary to set the environment variable `HUGGING_FACE_HUB_TOKEN` to the token associated with your Hugging Face account.\nFurther, it may be necessary to [apply for access](https://huggingface.co/meta-llama) to use the official repository of Meta's LLaMa-2 models.\nWe further recommend setting the `XDG_CACHE_HOME` environment variable to a directory on a device with sufficient space, as models and datasets will be downloaded to this folder.\n\n\nOur experiments with sentence-length generation use GPT models from the OpenAI API.\nPlease set the environment variable `OPENAI_API_KEY` to your OpenAI API key in order to use these models.\nNote that OpenAI charges a cost per input token and per generated token.\nCosts for reproducing our results vary depending on experiment configuration, but, without any guarantee, should lie somewhere between 5 and 30 USD per run.\n\n\nFor almost all tasks, the dataset is downloaded automatically from the Hugging Face Datasets library upon first execution.\nThe only exception is BioASQ (task b, BioASQ11, 2023), for which the data needs to be [downloaded](http://participants-area.bioasq.org/datasets) manually and stored at `$SCRATCH_DIR/$USER/semantic_uncertainty/data/bioasq/training11b.json`, where `$SCRATCH_DIR` defaults to `.`.\n\n\n\n## Demo\n\nExecute\n\n```\npython semantic_uncertainty/generate_answers.py --model_name=Llama-2-7b-chat --dataset=trivia_qa\n```\n\nto reproduce results for short-phrase generation with LLaMa-2 Chat (7B) on the BioASQ dataset.\n\nThe expected runtime of this demo is 1 hour using an Nvidia A100 GPU (80 GB), 24 cores of a Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz, and 192 GB of RAM.\nRuntime may be longer upon first execution, as the LLM needs to be downloaded from Hugging Face first.\n\nTo evaluate the run and obtain a barplot similar to those of the paper, open the Jupyter notebook in [notebooks/example_evaluation.ipynb](notebooks/example_evaluation.ipynb), populate the `wandb_id` variable in the first cell with the id assigned to your demo run, and execute all cells of the notebook.\n\n\nWe refer to [https://jupyter.org/](https://jupyter.org/) for more information on how to start the Jupter notebook server.\n\n\n## Further Instructions\n\n\n### Repository Structure\n\nWe here give an overview of the various components of the code.\n\nBy default, a standard run executes the following three scripts in order:\n\n* `generate_answers.py`: Sample responses (and their likelihods/hidden states) from the models for a set of input questions.\n* `compute_uncertainty_measures.py`: Compute uncertainty metrics given responses.\n* `analyze_results.py`: Compute aggregate performance metrics given uncertainties.\n\nIt is possible to run these scripts individually, e.g. when recomputing results, and we are happy to provide guidance on how to do so upon request.\n\n\n### Reproducing the Experiments\n\nTo reproduce the experiments of the paper, one needs to execute\n\n```\npython generate_answers.py --model_name=$MODEL --dataset=$DATASET $EXTRA_CFG\n```\n\nfor all combinations of models and datasets, and where `$EXTRA_CFG` is defined to either activate short-phrase or sentence-length generations and their associated hyperparameters.\n\nConcretely,\n\n* `$MODEL` is one of: `[Llama-2-7b, Llama-2-13b, Llama-2-70b, Llama-2-7b-chat, Llama-2-13b-chat, Llama-2-70b-chat, falcon-7b, falcon-40b, falcon-7b-instruct, falcon-40b-instruct, Mistral-7B-v0.1, Mistral-7B-Instruct-v0.1]`,\n* `$DATASET` is one of `[trivia_qa, squad, bioasq, nq, svamp]`,\n* and `$EXTRA_CFG=''` is empty for short-phrase generations and `EXTRA_CFG=--num_few_shot=0 --model_max_new_tokens=100 --brief_prompt=chat --metric=llm_gpt-4 --entailment_model=gpt-3.5 --no-compute_accuracy_at_all_temps` for sentence-length generations.\n\n\nThe results for any run can be obtained by passing the associated `wandb_id` to an evaluation notebook identical to the demo in [notebooks/example_evaluation.ipynb](notebooks/example_evaluation.ipynb).\n"
  },
  {
    "path": "environment.yaml",
    "content": "name: semantic_uncertainty\nchannels:\n  - pytorch\n  - nvidia\n  - defaults\ndependencies:\n  - python=3.11\n  - pip\n  - pytorch\n  - torchvision\n  - torchaudio\n  - pytorch-cuda=11.8\n  - numpy\n  - transformers\n  - evaluate\n  - datasets\n  - pip:\n    - transformers>=4.31\n    - scikit-learn\n    - pandas\n    - flake8\n    - omegaconf\n    - jupyterlab\n    - notebook\n    - matplotlib\n    - seaborn\n    - tqdm\n    - ipywidgets\n    - scipy\n    - wandb\n    - tokenizers>=0.13.3\n    - accelerate\n    - ml_collections\n    - torchmetrics\n    - openai\n    - tiktoken\n    - einops\n    - bitsandbytes\n    - nltk\n    - tenacity\n    - sentencepiece\n    - safetensors\n"
  },
  {
    "path": "environment_export.yaml",
    "content": "name: semantic_uncertainty_export\nchannels:\n  - pytorch\n  - nvidia\n  - defaults\ndependencies:\n  - _libgcc_mutex=0.1=main\n  - _openmp_mutex=5.1=1_gnu\n  - abseil-cpp=20211102.0=hd4dd3e8_0\n  - aiohttp=3.8.5=py311h5eee18b_0\n  - aiosignal=1.2.0=pyhd3eb1b0_0\n  - arrow=1.2.3=py311h06a4308_1\n  - arrow-cpp=11.0.0=h374c478_2\n  - async-timeout=4.0.2=py311h06a4308_0\n  - attrs=23.1.0=py311h06a4308_0\n  - aws-c-common=0.6.8=h5eee18b_1\n  - aws-c-event-stream=0.1.6=h6a678d5_6\n  - aws-checksums=0.1.11=h5eee18b_2\n  - aws-sdk-cpp=1.8.185=h721c034_1\n  - binaryornot=0.4.4=pyhd3eb1b0_1\n  - blas=1.0=mkl\n  - boost-cpp=1.82.0=hdb19cb5_2\n  - bottleneck=1.3.5=py311hbed6279_0\n  - brotlipy=0.7.0=py311h5eee18b_1002\n  - bzip2=1.0.8=h7b6447c_0\n  - c-ares=1.19.1=h5eee18b_0\n  - ca-certificates=2023.08.22=h06a4308_0\n  - certifi=2023.11.17=py311h06a4308_0\n  - cffi=1.15.1=py311h5eee18b_3\n  - chardet=4.0.0=py311h06a4308_1003\n  - charset-normalizer=2.0.4=pyhd3eb1b0_0\n  - click=8.0.4=py311h06a4308_0\n  - cookiecutter=1.7.3=pyhd3eb1b0_0\n  - cryptography=41.0.3=py311hdda0065_0\n  - cuda-cudart=11.8.89=0\n  - cuda-cupti=11.8.87=0\n  - cuda-libraries=11.8.0=0\n  - cuda-nvrtc=11.8.89=0\n  - cuda-nvtx=11.8.86=0\n  - cuda-runtime=11.8.0=0\n  - datasets=2.12.0=py311h06a4308_0\n  - dill=0.3.6=py311h06a4308_0\n  - evaluate=0.4.0=py311h06a4308_0\n  - ffmpeg=4.3=hf484d3e_0\n  - filelock=3.9.0=py311h06a4308_0\n  - freetype=2.12.1=h4a9f257_0\n  - frozenlist=1.3.3=py311h5eee18b_0\n  - fsspec=2023.9.2=py311h06a4308_0\n  - gflags=2.2.2=he6710b0_0\n  - giflib=5.2.1=h5eee18b_3\n  - glog=0.5.0=h2531618_0\n  - gmp=6.2.1=h295c915_3\n  - gmpy2=2.1.2=py311hc9b5ff0_0\n  - gnutls=3.6.15=he1e5248_0\n  - grpc-cpp=1.48.2=he1ff14a_1\n  - huggingface_hub=0.17.3=py311h06a4308_0\n  - icu=73.1=h6a678d5_0\n  - idna=3.4=py311h06a4308_0\n  - intel-openmp=2023.1.0=hdb19cb5_46305\n  - jinja2=3.1.2=py311h06a4308_0\n  - jinja2-time=0.2.0=pyhd3eb1b0_3\n  - jpeg=9e=h5eee18b_1\n  - krb5=1.20.1=h143b758_1\n  - lame=3.100=h7b6447c_0\n  - lcms2=2.12=h3be6417_0\n  - ld_impl_linux-64=2.38=h1181459_1\n  - lerc=3.0=h295c915_0\n  - libboost=1.82.0=h109eef0_2\n  - libbrotlicommon=1.0.9=h5eee18b_7\n  - libbrotlidec=1.0.9=h5eee18b_7\n  - libbrotlienc=1.0.9=h5eee18b_7\n  - libcublas=11.11.3.6=0\n  - libcufft=10.9.0.58=0\n  - libcufile=1.8.0.34=0\n  - libcurand=10.3.4.52=0\n  - libcurl=8.4.0=h251f7ec_0\n  - libcusolver=11.4.1.48=0\n  - libcusparse=11.7.5.86=0\n  - libdeflate=1.17=h5eee18b_1\n  - libedit=3.1.20221030=h5eee18b_0\n  - libev=4.33=h7f8727e_1\n  - libevent=2.1.12=hdbd6064_1\n  - libffi=3.4.4=h6a678d5_0\n  - libgcc-ng=11.2.0=h1234567_1\n  - libgomp=11.2.0=h1234567_1\n  - libiconv=1.16=h7f8727e_2\n  - libidn2=2.3.4=h5eee18b_0\n  - libjpeg-turbo=2.0.0=h9bf148f_0\n  - libnghttp2=1.57.0=h2d74bed_0\n  - libnpp=11.8.0.86=0\n  - libnvjpeg=11.9.0.86=0\n  - libpng=1.6.39=h5eee18b_0\n  - libprotobuf=3.20.3=he621ea3_0\n  - libssh2=1.10.0=hdbd6064_2\n  - libstdcxx-ng=11.2.0=h1234567_1\n  - libtasn1=4.19.0=h5eee18b_0\n  - libthrift=0.15.0=h1795dd8_2\n  - libtiff=4.5.1=h6a678d5_0\n  - libunistring=0.9.10=h27cfd23_0\n  - libuuid=1.41.5=h5eee18b_0\n  - libwebp=1.3.2=h11a3e52_0\n  - libwebp-base=1.3.2=h5eee18b_0\n  - llvm-openmp=14.0.6=h9e868ea_0\n  - lz4-c=1.9.4=h6a678d5_0\n  - markupsafe=2.1.1=py311h5eee18b_0\n  - mkl=2023.1.0=h213fc3f_46343\n  - mkl-service=2.4.0=py311h5eee18b_1\n  - mkl_fft=1.3.8=py311h5eee18b_0\n  - mkl_random=1.2.4=py311hdb19cb5_0\n  - mpc=1.1.0=h10f8cd9_1\n  - mpfr=4.0.2=hb69a4c5_1\n  - mpmath=1.3.0=py311h06a4308_0\n  - multidict=6.0.2=py311h5eee18b_0\n  - multiprocess=0.70.14=py311h06a4308_0\n  - ncurses=6.4=h6a678d5_0\n  - nettle=3.7.3=hbbd107a_1\n  - networkx=3.1=py311h06a4308_0\n  - numexpr=2.8.7=py311h65dcdc2_0\n  - numpy=1.26.0=py311h08b1b3b_0\n  - numpy-base=1.26.0=py311hf175353_0\n  - openh264=2.1.1=h4ff587b_0\n  - openjpeg=2.4.0=h3ad879b_0\n  - openssl=3.0.12=h7f8727e_0\n  - orc=1.7.4=hb3bc3d3_1\n  - packaging=23.1=py311h06a4308_0\n  - pillow=10.0.1=py311ha6cbd5a_0\n  - pip=23.3.1=py311h06a4308_0\n  - poyo=0.5.0=pyhd3eb1b0_0\n  - pyarrow=11.0.0=py311hd8e8d9b_1\n  - pycparser=2.21=pyhd3eb1b0_0\n  - pyopenssl=23.2.0=py311h06a4308_0\n  - pysocks=1.7.1=py311h06a4308_0\n  - python=3.11.5=h955ad1f_0\n  - python-dateutil=2.8.2=pyhd3eb1b0_0\n  - python-slugify=5.0.2=pyhd3eb1b0_0\n  - python-tzdata=2023.3=pyhd3eb1b0_0\n  - python-xxhash=2.0.2=py311h5eee18b_1\n  - pytorch=2.1.1=py3.11_cuda11.8_cudnn8.7.0_0\n  - pytorch-cuda=11.8=h7e8668a_5\n  - pytorch-mutex=1.0=cuda\n  - pytz=2023.3.post1=py311h06a4308_0\n  - pyyaml=6.0=py311h5eee18b_1\n  - re2=2022.04.01=h295c915_0\n  - readline=8.2=h5eee18b_0\n  - requests=2.31.0=py311h06a4308_0\n  - responses=0.13.3=pyhd3eb1b0_0\n  - setuptools=68.0.0=py311h06a4308_0\n  - six=1.16.0=pyhd3eb1b0_1\n  - snappy=1.1.9=h295c915_0\n  - sqlite=3.41.2=h5eee18b_0\n  - sympy=1.11.1=py311h06a4308_0\n  - tbb=2021.8.0=hdb19cb5_0\n  - text-unidecode=1.3=pyhd3eb1b0_0\n  - tk=8.6.12=h1ccaba5_0\n  - torchaudio=2.1.1=py311_cu118\n  - torchtriton=2.1.0=py311\n  - torchvision=0.16.1=py311_cu118\n  - typing-extensions=4.7.1=py311h06a4308_0\n  - typing_extensions=4.7.1=py311h06a4308_0\n  - tzdata=2023c=h04d1e81_0\n  - unidecode=1.2.0=pyhd3eb1b0_0\n  - urllib3=1.26.16=py311h06a4308_0\n  - utf8proc=2.6.1=h27cfd23_0\n  - wheel=0.41.2=py311h06a4308_0\n  - xxhash=0.8.0=h7f8727e_3\n  - xz=5.4.2=h5eee18b_0\n  - yaml=0.2.5=h7b6447c_0\n  - yarl=1.8.1=py311h5eee18b_0\n  - zlib=1.2.13=h5eee18b_0\n  - zstd=1.5.5=hc292b87_0\n  - pip:\n    - absl-py==2.0.0\n    - accelerate==0.25.0\n    - annotated-types==0.6.0\n    - antlr4-python3-runtime==4.9.3\n    - anyio==3.7.1\n    - appdirs==1.4.4\n    - argon2-cffi==23.1.0\n    - argon2-cffi-bindings==21.2.0\n    - asttokens==2.4.0\n    - async-lru==2.0.4\n    - babel==2.13.0\n    - backcall==0.2.0\n    - beautifulsoup4==4.12.2\n    - bitsandbytes==0.41.2.post2\n    - bleach==6.1.0\n    - comm==0.1.4\n    - contextlib2==21.6.0\n    - contourpy==1.1.1\n    - cycler==0.12.1\n    - debugpy==1.8.0\n    - decorator==5.1.1\n    - defusedxml==0.7.1\n    - distro==1.8.0\n    - docker-pycreds==0.4.0\n    - einops==0.7.0\n    - executing==2.0.0\n    - fastjsonschema==2.18.1\n    - flake8==6.1.0\n    - fonttools==4.43.1\n    - fqdn==1.5.1\n    - gitdb==4.0.11\n    - gitpython==3.1.40\n    - h11==0.14.0\n    - httpcore==1.0.1\n    - httpx==0.25.1\n    - ipykernel==6.25.2\n    - ipython==8.16.1\n    - ipywidgets==8.1.1\n    - isoduration==20.11.0\n    - jedi==0.19.1\n    - joblib==1.3.2\n    - json5==0.9.14\n    - jsonpointer==2.4\n    - jsonschema==4.19.1\n    - jsonschema-specifications==2023.7.1\n    - jupyter-client==8.4.0\n    - jupyter-core==5.4.0\n    - jupyter-events==0.8.0\n    - jupyter-lsp==2.2.0\n    - jupyter-server==2.8.0\n    - jupyter-server-terminals==0.4.4\n    - jupyterlab==4.0.9\n    - jupyterlab-pygments==0.2.2\n    - jupyterlab-server==2.25.0\n    - jupyterlab-widgets==3.0.9\n    - kiwisolver==1.4.5\n    - lightning-utilities==0.9.0\n    - matplotlib==3.8.2\n    - matplotlib-inline==0.1.6\n    - mccabe==0.7.0\n    - mistune==3.0.2\n    - ml-collections==0.1.1\n    - nbclient==0.8.0\n    - nbconvert==7.9.2\n    - nbformat==5.9.2\n    - nest-asyncio==1.5.8\n    - nltk==3.8.1\n    - notebook==7.0.6\n    - notebook-shim==0.2.3\n    - omegaconf==2.3.0\n    - openai==1.3.7\n    - overrides==7.4.0\n    - pandas==2.1.3\n    - pandocfilters==1.5.0\n    - parso==0.8.3\n    - pathtools==0.1.2\n    - pexpect==4.8.0\n    - pickleshare==0.7.5\n    - platformdirs==3.11.0\n    - prometheus-client==0.17.1\n    - prompt-toolkit==3.0.39\n    - protobuf==4.24.4\n    - psutil==5.9.6\n    - ptyprocess==0.7.0\n    - pure-eval==0.2.2\n    - pycodestyle==2.11.1\n    - pydantic==2.4.2\n    - pydantic-core==2.10.1\n    - pyflakes==3.1.0\n    - pygments==2.16.1\n    - pyparsing==3.1.1\n    - python-json-logger==2.0.7\n    - pyzmq==25.1.1\n    - referencing==0.30.2\n    - regex==2023.10.3\n    - rfc3339-validator==0.1.4\n    - rfc3986-validator==0.1.1\n    - rpds-py==0.10.6\n    - safetensors==0.4.1\n    - scikit-learn==1.3.2\n    - scipy==1.11.4\n    - seaborn==0.13.0\n    - send2trash==1.8.2\n    - sentencepiece==0.1.99\n    - sentry-sdk==1.32.0\n    - setproctitle==1.3.3\n    - smmap==5.0.1\n    - sniffio==1.3.0\n    - soupsieve==2.5\n    - stack-data==0.6.3\n    - tenacity==8.2.3\n    - terminado==0.17.1\n    - threadpoolctl==3.2.0\n    - tiktoken==0.5.2\n    - tinycss2==1.2.1\n    - tokenizers==0.15.0\n    - torchmetrics==1.2.1\n    - tornado==6.3.3\n    - tqdm==4.66.1\n    - traitlets==5.11.2\n    - transformers==4.35.2\n    - uri-template==1.3.0\n    - wandb==0.16.0\n    - wcwidth==0.2.8\n    - webcolors==1.13\n    - webencodings==0.5.1\n    - websocket-client==1.6.4\n    - widgetsnbextension==4.0.9\n"
  },
  {
    "path": "notebooks/example_evaluation.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"e9698d3d-f63c-4d8c-af6e-4e9996b3fe28\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Fill in the wandb_id assigned to your demo run!\\n\",\n    \"\\n\",\n    \"wandb_id = 'YOUR_ID'\\n\",\n    \"if wandb_id == 'YOUR_ID':\\n\",\n    \"    raise ValueError('Need to provide wandb_id of demo run!')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"0ff8faff-38fa-49cb-be73-624dc88fcd13\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext autoreload\\n\",\n    \"%autoreload 2\\n\",\n    \"\\n\",\n    \"import os\\n\",\n    \"import json\\n\",\n    \"import wandb\\n\",\n    \"import pandas as pd\\n\",\n    \"from matplotlib import pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"32bfe743-5d8c-4ae6-804d-c5800a4209f1\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Helper Functions\\n\",\n    \"\\n\",\n    \"def restore_file(wandb_id, filename='wandb-summary.json'):\\n\",\n    \"    files_dir = 'notebooks/restored_files'    \\n\",\n    \"    os.system(f'mkdir -p {files_dir}')\\n\",\n    \"\\n\",\n    \"    api = wandb.Api()\\n\",\n    \"    run = api.run(f'semantic_uncertainty/{wandb_id}')\\n\",\n    \"\\n\",\n    \"    path = f'{files_dir}/{filename}'\\n\",\n    \"    os.system(f'rm -rf {path}')\\n\",\n    \"    run.file(filename).download(root=files_dir, replace=True, exist_ok=False)\\n\",\n    \"    with open(path, 'r') as f:\\n\",\n    \"        out = json.load(f)\\n\",\n    \"    return out\\n\",\n    \"\\n\",\n    \"def get_uncertainty_df(metrics):\\n\",\n    \"    data = []\\n\",\n    \"    for method in metrics['uncertainty']:\\n\",\n    \"        for metric in metrics['uncertainty'][method]:\\n\",\n    \"            mean = metrics['uncertainty'][method][metric]['mean']\\n\",\n    \"            data.append([method, metric, mean])\\n\",\n    \"    df = pd.DataFrame(data, columns=['method', 'metric', 'means'])\\n\",\n    \"    main_methods = ['semantic_entropy', 'cluster_assignment_entropy', 'regular_entropy', 'p_false', 'p_ik']\\n\",\n    \"    df = df.set_index('method').loc[main_methods].reset_index()\\n\",\n    \"    main_names = ['Semantic entropy', 'Discrete Semantic Entropy', 'Naive Entropy', 'p(True)', 'Embedding Regression']\\n\",\n    \"    conversion = dict(zip(main_methods, main_names))\\n\",\n    \"    df['method'] = df.method.map(lambda x: conversion[x])\\n\",\n    \"    return df\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"id\": \"56409203-8734-4b9e-8618-638a26be7ccb\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"results = restore_file(wandb_id)\\n\",\n    \"unc_df = get_uncertainty_df(results)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"6acb4184-0c9a-4ec5-b860-e050bf7544a2\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(0.6, 0.8)\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    },\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAAkgAAAJhCAYAAAC6vU9RAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8g+/7EAAAACXBIWXMAAA9hAAAPYQGoP6dpAABsVElEQVR4nO3deXhMZ/8/8PdksklkI7IKidorsQQR1N7YvlSpnRBLlVinWkmL2CqqT1Hl4UFsbYny0FoTxE5IxRolCBFLFpEmkaSyzfz+8DNP50xCtJk5k5z367rmusx97jnzmbmb5p1z7nMfmUqlUoGIiIiI1IzELoCIiIjI0DAgEREREQkwIBEREREJMCARERERCTAgEREREQkwIBEREREJMCARERERCTAgEREREQkwIBEREREJMCARERERCYgekFavXg13d3eYm5vDx8cHMTExr+2/YsUKNGjQAFWqVIGbmxtmzJiBFy9evNU+X7x4gcDAQFSvXh1Vq1bFgAEDkJqaWu6fjYiIiComUQPSjh07oFAoEBISgkuXLqFp06bo3r070tLSSuy/bds2BAUFISQkBDdv3kRYWBh27NiBL7744q32OWPGDOzbtw87d+7EyZMn8eTJE/Tv31/nn5eIiIgqBpmYN6v18fFBq1atsGrVKgCAUqmEm5sbpkyZgqCgIK3+kydPxs2bNxEVFaVu+/TTT3HhwgWcOXOmTPvMyspCjRo1sG3bNnz00UcAgFu3bqFRo0aIjo5GmzZtdP2xiYiIyMAZi/XGBQUFiI2NRXBwsLrNyMgI3bp1Q3R0dImvadu2LX788UfExMSgdevWuHfvHg4ePIiRI0eWeZ+xsbEoLCxEt27d1H0aNmyIWrVqvTYg5efnIz8/X/1cqVQiIyMD1atXh0wm+/tfBBEREemNSqXC8+fP4eLiAiOj0k+kiRaQ0tPTUVxcDEdHR412R0dH3Lp1q8TXDBs2DOnp6Wjfvj1UKhWKiorwySefqE+xlWWfKSkpMDU1ha2trVaflJSUUusNDQ3F/Pnz3/ZjEhERkQF6+PAhatasWep20QLS33HixAksXrwY//73v+Hj44O7d+9i2rRpWLhwIebMmaPT9w4ODoZCoVA/z8rKQq1atXD//n1YWVnp9L2JiIiofDx//hweHh5v/N0tWkCyt7eHXC7XunosNTUVTk5OJb5mzpw5GDlyJMaNGwcA8PT0RG5uLj7++GN8+eWXZdqnk5MTCgoKkJmZqXEU6XXvCwBmZmYwMzPTaq9WrRqsra3L9JmJiIhIXCYmJgDwxukxol3FZmpqCm9vb40J10qlElFRUfD19S3xNXl5eVrnC+VyOYCX5xTLsk9vb2+YmJho9ImPj0dSUlKp70tERETSIuopNoVCgVGjRqFly5Zo3bo1VqxYgdzcXAQEBAAA/P394erqitDQUABAnz59sGzZMjRv3lx9im3OnDno06ePOii9aZ82NjYYO3YsFAqF+ujPlClT4OvryyvYiIiICIDIAWnw4MF4+vQp5s6di5SUFDRr1gwRERHqSdZJSUkaR4xmz54NmUyG2bNn4/Hjx6hRowb69OmDr776qsz7BIDly5fDyMgIAwYMQH5+Prp3745///vf+vvgREREZNBEXQepIsvOzoaNjQ2ysrI4B4mIiLS8utq6uLhY7FIkRS6Xw9jYuNQ5RmX9/V2hrmIjIiKqCAoKCpCcnIy8vDyxS5EkCwsLODs7w9TU9G/vgwGJiIioHCmVSty/fx9yuRwuLi4wNTXlgsJ6olKpUFBQgKdPn+L+/fuoV6/eaxeDfB0GJCIionJUUFCgvs2VhYWF2OVITpUqVWBiYoIHDx6goKAA5ubmf2s/ot6sloiIqLL6u0cu6J8rj++eo0dEREQkwIBEREREJMCARERERCTASdpERER64h50QK/vl7ikt17frzLhESQiIiIiAQYkIiIiAgB06tQJU6ZMwfTp02FnZwdHR0esX79efU9TKysr1K1bF4cOHVK/Ji4uDj179kTVqlXh6OiIkSNHIj09Xb09IiIC7du3h62tLapXr47/+7//Q0JCgnp7YmIiZDIZdu/ejc6dO8PCwgJNmzZFdHS0us+DBw/Qp08f2NnZwdLSEu+++y4OHjyo0++CAYmIiIjUtmzZAnt7e8TExGDKlCmYOHEiBg4ciLZt2+LSpUvw8/PDyJEjkZeXh8zMTHTp0gXNmzfHxYsXERERgdTUVAwaNEi9v9zcXCgUCly8eBFRUVEwMjLChx9+CKVSqfG+X375JWbOnIkrV66gfv36GDp0KIqKigAAgYGByM/Px6lTp3D9+nV8/fXXqFq1qk6/B96L7W/ivdiIiKgkL168wP379+Hh4aG1SKGhz0Hq1KkTiouLcfr0aQBAcXExbGxs0L9/f2zduhUAkJKSAmdnZ0RHR+Po0aM4ffo0IiMj1ft49OgR3NzcEB8fj/r162u9R3p6OmrUqIHr16+jSZMmSExMhIeHBzZs2ICxY8cCAH7//Xe8++67uHnzJho2bAgvLy8MGDAAISEhZfocrxuDsv7+5hEkIiIiUvPy8lL/Wy6Xo3r16vD09FS3OTo6AgDS0tJw9epVHD9+HFWrVlU/GjZsCADq02h37tzB0KFDUadOHVhbW8Pd3R0AkJSUVOr7Ojs7q98DAKZOnYpFixahXbt2CAkJwbVr18r5U2tjQCIiIiI1ExMTjecymUyj7dV95ZRKJXJyctCnTx9cuXJF43Hnzh106NABANCnTx9kZGRg/fr1uHDhAi5cuADg5S1ZSnvfv74HAIwbNw737t3DyJEjcf36dbRs2RLff/99OX9yTbzMn4iIiP6WFi1a4L///S/c3d1hbKwdKZ49e4b4+HisX78e7733HgDgzJkzf+u93Nzc8Mknn+CTTz5BcHAw1q9fjylTpvyj+l+HR5CIiIjobwkMDERGRgaGDh2K3377DQkJCYiMjERAQACKi4thZ2eH6tWrY926dbh79y6OHTsGhULx1u8zffp0REZG4v79+7h06RKOHz+ORo0a6eAT/Q+PIBEREelJZVu40cXFBWfPnsWsWbPg5+eH/Px81K5dGz169ICRkRFkMhnCw8MxdepUNGnSBA0aNMDKlSvRqVOnt3qf4uJiBAYG4tGjR7C2tkaPHj2wfPly3Xyo/49Xsf1NvIqNiIhK8rorqEg/eBUbERERkQ4wIBEREREJMCARERERCTAgEREREQkwIBEREekAr4EST3l89wxIRERE5ejVitB5eXkiVyJdr7574argb4PrIBEREZUjuVwOW1tb9X3ELCws1LfOIN1SqVTIy8tDWloabG1tIZfL//a+GJCIiIjKmZOTE4D/3WyV9MvW1lY9Bn8XAxIREVE5k8lkcHZ2hoODAwoLC8UuR1JMTEz+0ZGjVxiQiIiIdEQul5fLL2vSP07SJiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhIwiIC0evVquLu7w9zcHD4+PoiJiSm1b6dOnSCTybQevXv3VvcpabtMJsM333yj7uPu7q61fcmSJTr9nERERFQxiL5Q5I4dO6BQKLB27Vr4+PhgxYoV6N69O+Lj4+Hg4KDVf/fu3SgoKFA/f/bsGZo2bYqBAweq25KTkzVec+jQIYwdOxYDBgzQaF+wYAHGjx+vfm5lZVVeH4uIiIgqMNED0rJlyzB+/HgEBAQAANauXYsDBw5g48aNCAoK0upfrVo1jefh4eGwsLDQCEjC+6/8+uuv6Ny5M+rUqaPRbmVl9Y/v1UJERESVj6gBqaCgALGxsQgODla3GRkZoVu3boiOji7TPsLCwjBkyBBYWlqWuD01NRUHDhzAli1btLYtWbIECxcuRK1atTBs2DDMmDEDxsYlfyX5+fnIz89XP8/OzgYAFBYW8j47REREFURZf2eLGpDS09NRXFwMR0dHjXZHR0fcunXrja+PiYlBXFwcwsLCSu2zZcsWWFlZoX///hrtU6dORYsWLVCtWjWcO3cOwcHBSE5OxrJly0rcT2hoKObPn6/VfvjwYVhYWLyxViIiIhJfXl5emfqJfortnwgLC4Onpydat25dap+NGzdi+PDhMDc312hXKBTqf3t5ecHU1BQTJkxAaGgozMzMtPYTHBys8Zrs7Gy4ubnBz88P1tbW5fBpiIiISNdenQF6E1EDkr29PeRyOVJTUzXaU1NT3zg3KDc3F+Hh4ViwYEGpfU6fPo34+Hjs2LHjjbX4+PigqKgIiYmJaNCggdZ2MzOzEoOTiYkJTExM3rh/IiIiEl9Zf2eLepm/qakpvL29ERUVpW5TKpWIioqCr6/va1+7c+dO5OfnY8SIEaX2CQsLg7e3N5o2bfrGWq5cuQIjI6MSr5wjIiIiaRH9FJtCocCoUaPQsmVLtG7dGitWrEBubq76qjZ/f3+4uroiNDRU43VhYWHo168fqlevXuJ+s7OzsXPnTnz77bda26Kjo3HhwgV07twZVlZWiI6OxowZMzBixAjY2dmV/4ckIiKiCkX0gDR48GA8ffoUc+fORUpKCpo1a4aIiAj1xO2kpCQYGWke6IqPj8eZM2dw+PDhUvcbHh4OlUqFoUOHam0zMzNDeHg45s2bh/z8fHh4eGDGjBkac4yIiIhIumQqlUoldhEVUXZ2NmxsbJCVlcVJ2kRERBVEWX9/G8StRoiIiIgMCQMSERERkQADEhEREZEAAxIRERGRAAMSERERkQADEhEREZEAAxIRERGRAAMSERERkQADEhEREZEAAxIRERGRgOj3YqPXcw86IHYJ/1jikt5il0BERPRWeASJiIiISIABiYiIiEiAAYmIiIhIgAGJiIiISIABiYiIiEiAAYmIiIhIgJf5E5URl1wgIpIOHkEiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhIwiIC0evVquLu7w9zcHD4+PoiJiSm1b6dOnSCTybQevXv3VvcZPXq01vYePXpo7CcjIwPDhw+HtbU1bG1tMXbsWOTk5OjsMxIREVHFIXpA2rFjBxQKBUJCQnDp0iU0bdoU3bt3R1paWon9d+/ejeTkZPUjLi4OcrkcAwcO1OjXo0cPjX7bt2/X2D58+HDcuHEDR44cwf79+3Hq1Cl8/PHHOvucREREVHGIHpCWLVuG8ePHIyAgAI0bN8batWthYWGBjRs3lti/WrVqcHJyUj+OHDkCCwsLrYBkZmam0c/Ozk697ebNm4iIiMCGDRvg4+OD9u3b4/vvv0d4eDiePHmi089LREREhs9YzDcvKChAbGwsgoOD1W1GRkbo1q0boqOjy7SPsLAwDBkyBJaWlhrtJ06cgIODA+zs7NClSxcsWrQI1atXBwBER0fD1tYWLVu2VPfv1q0bjIyMcOHCBXz44Yda75Ofn4/8/Hz18+zsbABAYWEhCgsLy/6h35KZXKWzfeuLLr8ffeJYEBFVfGX9/6CoASk9PR3FxcVwdHTUaHd0dMStW7fe+PqYmBjExcUhLCxMo71Hjx7o378/PDw8kJCQgC+++AI9e/ZEdHQ05HI5UlJS4ODgoPEaY2NjVKtWDSkpKSW+V2hoKObPn6/VfvjwYVhYWLyx1r9raWud7VpvDh48KHYJ5YJjQURU8eXl5ZWpn6gB6Z8KCwuDp6cnWrfW/M01ZMgQ9b89PT3h5eWFd955BydOnEDXrl3/1nsFBwdDoVCon2dnZ8PNzQ1+fn6wtrb+ex+gDJrMi9TZvvUlbl53sUsoFxwLIqKK79UZoDcRNSDZ29tDLpcjNTVVoz01NRVOTk6vfW1ubi7Cw8OxYMGCN75PnTp1YG9vj7t376Jr165wcnLSmgReVFSEjIyMUt/XzMwMZmZmWu0mJiYwMTF5Yw1/V36xTGf71hddfj/6xLEgIqr4yvr/QVEnaZuamsLb2xtRUVHqNqVSiaioKPj6+r72tTt37kR+fj5GjBjxxvd59OgRnj17BmdnZwCAr68vMjMzERsbq+5z7NgxKJVK+Pj4/M1PQ0RERJWF6FexKRQKrF+/Hlu2bMHNmzcxceJE5ObmIiAgAADg7++vMYn7lbCwMPTr10898fqVnJwcfPbZZzh//jwSExMRFRWFDz74AHXr1kX37i9PLzRq1Ag9evTA+PHjERMTg7Nnz2Ly5MkYMmQIXFxcdP+hiYiIyKCJPgdp8ODBePr0KebOnYuUlBQ0a9YMERER6onbSUlJMDLSzHHx8fE4c+YMDh8+rLU/uVyOa9euYcuWLcjMzISLiwv8/PywcOFCjVNkP/30EyZPnoyuXbvCyMgIAwYMwMqVK3X7YYmIiKhCkKlUqop/7bIIsrOzYWNjg6ysLJ1O0nYPOqCzfetL4pLeb+5UAXAsiIgqvrL+/hb9FBsRERGRoWFAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISMBa7ACKit+UedEDsEspF4pLeYpdARKXgESQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgGDCEirV6+Gu7s7zM3N4ePjg5iYmFL7durUCTKZTOvRu/fLq0EKCwsxa9YseHp6wtLSEi4uLvD398eTJ0809uPu7q61jyVLluj0cxIREVHFIHpA2rFjBxQKBUJCQnDp0iU0bdoU3bt3R1paWon9d+/ejeTkZPUjLi4OcrkcAwcOBADk5eXh0qVLmDNnDi5duoTdu3cjPj4effv21drXggULNPY1ZcoUnX5WIiIiqhhEXwdp2bJlGD9+PAICAgAAa9euxYEDB7Bx40YEBQVp9a9WrZrG8/DwcFhYWKgDko2NDY4cOaLRZ9WqVWjdujWSkpJQq1YtdbuVlRWcnJzKVGd+fj7y8/PVz7OzswG8PGJVWFhYpn38HWZylc72rS+6/H70iWNhOCrDWACVZzyIKpKy/tzJVCqVaP+nKSgogIWFBXbt2oV+/fqp20eNGoXMzEz8+uuvb9yHp6cnfH19sW7dulL7HD16FH5+fsjMzIS1tTWAl6fYXrx4gcLCQtSqVQvDhg3DjBkzYGxccmacN28e5s+fr9W+bds2WFhYvLFOIiIiEl9eXh6GDRuGrKwsdSYoiahHkNLT01FcXAxHR0eNdkdHR9y6deuNr4+JiUFcXBzCwsJK7fPixQvMmjULQ4cO1fgipk6dihYtWqBatWo4d+4cgoODkZycjGXLlpW4n+DgYCgUCvXz7OxsuLm5wc/P77Vf8D/VZF6kzvatL3HzuotdQrngWBiOyjAWQOUZD6KK5NUZoDcR/RTbPxEWFgZPT0+0bt26xO2FhYUYNGgQVCoV1qxZo7Htr2HHy8sLpqammDBhAkJDQ2FmZqa1LzMzsxLbTUxMYGJi8g8/Senyi2U627e+6PL70SeOheGoDGMBVJ7xIKpIyvpzJ+okbXt7e8jlcqSmpmq0p6amvnFuUG5uLsLDwzF27NgSt78KRw8ePMCRI0feeJTHx8cHRUVFSExMfKvPQERERJWPqAHJ1NQU3t7eiIqKUrcplUpERUXB19f3ta/duXMn8vPzMWLECK1tr8LRnTt3cPToUVSvXv2NtVy5cgVGRkZwcHB4+w9CRERElYrop9gUCgVGjRqFli1bonXr1lixYgVyc3PVV7X5+/vD1dUVoaGhGq8LCwtDv379tMJPYWEhPvroI1y6dAn79+9HcXExUlJSALy8As7U1BTR0dG4cOECOnfuDCsrK0RHR2PGjBkYMWIE7Ozs9PPBiYiIyGCJHpAGDx6Mp0+fYu7cuUhJSUGzZs0QERGhnridlJQEIyPNA13x8fE4c+YMDh8+rLW/x48fY+/evQCAZs2aaWw7fvw4OnXqBDMzM4SHh2PevHnIz8+Hh4cHZsyYoTEviYiIiKRL9IAEAJMnT8bkyZNL3HbixAmttgYNGqC01Qnc3d1L3fZKixYtcP78+beuk4iIiKRB9JW0iYiIiAwNAxIRERGRAAMSERERkQADEhEREZEAAxIRERGRAAMSERERkQADEhEREZEAAxIRERGRAAMSERERkQADEhEREZEAAxIRERGRAAMSERERkQADEhEREZEAAxIRERGRAAMSERERkQADEhEREZHAWwWk7OxsKJVKrfbi4mJkZ2eXW1FEREREYipzQNqzZw9atmyJFy9eaG178eIFWrVqhX379pVrcURERERiKHNAWrNmDT7//HNYWFhobbO0tMSsWbOwatWqci2OiIiISAxlDkhxcXHo1KlTqds7dOiA69evl0dNRERERKIqc0D6448/UFRUVOr2wsJC/PHHH+VSFBEREZGYyhyQ3N3dcfHixVK3X7x4EbVr1y6XooiIiIjEVOaA1L9/f3z55ZdITU3V2paSkoLZs2djwIAB5VocERERkRiMy9oxKCgIv/76K+rVq4cRI0agQYMGAIBbt27hp59+gpubG4KCgnRWKBEREZG+lDkgWVlZ4ezZswgODsaOHTvU841sbW0xYsQIfPXVV7CystJZoURERET6UuaABAA2Njb497//jdWrVyM9PR0qlQo1atSATCbTVX1EREREevdWAemV69ev4/bt2wCABg0awNPTs1yLIiIiIhLTWwWkmJgYjB07Fr///jtUKhUAQCaT4d1330VYWBhatWqlkyKJiIiI9KnMV7H9/vvv6Nq1K6pUqYIff/wRly5dwqVLl/DDDz/AzMwMXbt2xe+//67LWomIiIj0osxHkObNm4f3338f//3vfzXmHDVr1gxDhw5F//79MW/ePPz88886KZSIiIhIX8ockI4fP45Dhw6VOCFbJpPhiy++QK9evcq1OCIiIiIxlPkU2/Pnz+Ho6FjqdicnJzx//rxciiIiIiISU5kDUu3atRETE1Pq9gsXLvBWI0RERFQplDkgDRkyBAqFAnFxcVrbrl+/jpkzZ2Lw4MHlWhwRERGRGMo8Byk4OBhHjx5Fs2bN8P7776NRo0ZQqVS4efMmjh49itatW+OLL77QZa1EREREelHmI0jm5uY4fvw4vvrqKyQnJ2Pt2rX4z3/+g5SUFCxatAjHjx+Hubn53ypi9erVcHd3h7m5OXx8fF57Kq9Tp06QyWRaj969e6v7qFQqzJ07F87OzqhSpQq6deuGO3fuaOwnIyMDw4cPh7W1NWxtbTF27Fjk5OT8rfqJiIiocilzQAIAU1NTzJo1C1euXEFeXh7y8vJw5coVBAUFwczM7G8VsGPHDigUCoSEhODSpUto2rQpunfvjrS0tBL77969G8nJyepHXFwc5HI5Bg4cqO6zdOlSrFy5EmvXrsWFCxdgaWmJ7t2748WLF+o+w4cPx40bN3DkyBHs378fp06dwscff/y3PgMRERFVLm8VkF4nOTkZkydPfuvXLVu2DOPHj0dAQAAaN26MtWvXwsLCAhs3biyxf7Vq1eDk5KR+HDlyBBYWFuqApFKpsGLFCsyePRsffPABvLy8sHXrVjx58gS//PILAODmzZuIiIjAhg0b4OPjg/bt2+P7779HeHg4njx58re/AyIiIqoc3upWIzdu3MDx48dhamqKQYMGwdbWFunp6Vi0aBH+85//oE6dOm/15gUFBYiNjUVwcLC6zcjICN26dUN0dHSZ9hEWFoYhQ4bA0tISAHD//n2kpKSgW7du6j42Njbw8fFBdHQ0hgwZgujoaNja2qJly5bqPt26dYORkREuXLiADz/8UOt98vPzkZ+fr36enZ0NACgsLERhYeFbfe63YSZX6Wzf+qLL70efOBaGozKMBVB5xoOoIinrz12ZA9LevXvx0UcfoaioCMDL01jr16/HoEGD4O3tjT179qBHjx5vVWR6ejqKi4u11ldydHTErVu33vj6mJgYxMXFISwsTN2WkpKi3odwn6+2paSkwMHBQWO7sbExqlWrpu4jFBoaivnz52u1Hz58GBYWFm+s9e9a2lpnu9abgwcPil1CueBYGI7KMBZA5RkPoookLy+vTP3KHJAWLVqEwMBALFy4EBs2bIBCocDUqVNx8OBB0W5SGxYWBk9PT7Rurfv/WwYHB0OhUKifZ2dnw83NDX5+frC2ttbZ+zaZF6mzfetL3LzuYpdQLjgWhqMyjAVQecaDqCJ5dQboTcockOLj47Ft2zZUrVoVU6ZMwcyZM7F8+fJ/FI7s7e0hl8uRmpqq0Z6amgonJ6fXvjY3Nxfh4eFYsGCBRvur16WmpsLZ2Vljn82aNVP3EU4CLyoqQkZGRqnva2ZmVuJEdBMTE5iYmLy21n8iv1j71i4VjS6/H33iWBiOyjAWQOUZD6KKpKw/d291q5FXR0rkcjmqVKny1nOOhExNTeHt7Y2oqCh1m1KpRFRUFHx9fV/72p07dyI/Px8jRozQaPfw8ICTk5PGPrOzs3HhwgX1Pn19fZGZmYnY2Fh1n2PHjkGpVMLHx+cffSYiIiKq+N5qknZkZCRsbGwA/C/ICFfW7tu371sVoFAoMGrUKLRs2RKtW7fGihUrkJubi4CAAACAv78/XF1dERoaqvG6sLAw9OvXD9WrV9dol8lkmD59OhYtWoR69erBw8MDc+bMgYuLC/r16wcAaNSoEXr06IHx48dj7dq1KCwsxOTJkzFkyBC4uLi8Vf1ERERU+bxVQBo1apTG8wkTJmg8l8lkKC4ufqsCBg8ejKdPn2Lu3LlISUlBs2bNEBERoZ5knZSUBCMjzQNd8fHxOHPmDA4fPlziPj///HPk5ubi448/RmZmJtq3b4+IiAiNhSx/+uknTJ48GV27doWRkREGDBiAlStXvlXtREREVDnJVCpV5bheVs+ys7NhY2ODrKwsnU7Sdg86oLN960vikt5v7lQBcCwMR2UYC6DyjAdRRVLW39/ltlAkERERUWVR5lNspZ1+srGxQf369d84qZqIiIiooihzQFq+fHmJ7ZmZmcjKykLbtm2xd+9eVKtWrdyKIyIiIhJDmU+x3b9/v8THH3/8gbt370KpVGL27Nm6rJWIiIhIL8plDlKdOnWwZMmSUq8qIyIiIqpIym2Sdq1atUq9jxkRERFRRVJuAen69euoXbt2ee2OiIiISDRlnqRd2s3dsrKyEBsbi08//VRrIUkiIiKiiqjMAcnW1hYyWck3iJTJZBg3bhyCgoLKrTAiIiIisZQ5IB0/frzEdmtra9SrVw9Vq1ZFXFwcmjRpUm7FEREREYmhzAGpY8eOJbY/f/4c27ZtQ1hYGC5evPjW92IjIiIiMjR/e5L2qVOnMGrUKDg7O+Nf//oXOnfujPPnz5dnbURERESiKPMRJABISUnB5s2bERYWhuzsbAwaNAj5+fn45Zdf0LhxY13VSERERKRXZT6C1KdPHzRo0ADXrl3DihUr8OTJE3z//fe6rI2IiIhIFGU+gnTo0CFMnToVEydORL169XRZExEREZGoynwE6cyZM3j+/Dm8vb3h4+ODVatWIT09XZe1EREREYmizAGpTZs2WL9+PZKTkzFhwgSEh4fDxcUFSqUSR44cwfPnz3VZJxEREZHevPVVbJaWlhgzZgzOnDmD69ev49NPP8WSJUvg4OCAvn376qJGIiIiIr36R/dia9CgAZYuXYpHjx5h+/bt5VUTERERkajK5Wa1crkc/fr1w969e8tjd0RERESiKpeARERERFSZMCARERERCTAgEREREQkwIBEREREJMCARERERCTAgEREREQkwIBEREREJMCARERERCTAgEREREQkwIBEREREJMCARERERCTAgEREREQkwIBEREREJMCARERERCTAgEREREQmIHpBWr14Nd3d3mJubw8fHBzExMa/tn5mZicDAQDg7O8PMzAz169fHwYMH1dvd3d0hk8m0HoGBgeo+nTp10tr+ySef6OwzEhERUcViLOab79ixAwqFAmvXroWPjw9WrFiB7t27Iz4+Hg4ODlr9CwoK8P7778PBwQG7du2Cq6srHjx4AFtbW3Wf3377DcXFxerncXFxeP/99zFw4ECNfY0fPx4LFixQP7ewsCj/D0hEREQVkqgBadmyZRg/fjwCAgIAAGvXrsWBAwewceNGBAUFafXfuHEjMjIycO7cOZiYmAB4ecTor2rUqKHxfMmSJXjnnXfQsWNHjXYLCws4OTmV46chIiKiykK0gFRQUIDY2FgEBwer24yMjNCtWzdER0eX+Jq9e/fC19cXgYGB+PXXX1GjRg0MGzYMs2bNglwuL/E9fvzxRygUCshkMo1tP/30E3788Uc4OTmhT58+mDNnzmuPIuXn5yM/P1/9PDs7GwBQWFiIwsLCt/rsb8NMrtLZvvVFl9+PPnEsDEdlGAug8owHUUVS1p870QJSeno6iouL4ejoqNHu6OiIW7dulfiae/fu4dixYxg+fDgOHjyIu3fvYtKkSSgsLERISIhW/19++QWZmZkYPXq0RvuwYcNQu3ZtuLi44Nq1a5g1axbi4+Oxe/fuUusNDQ3F/PnztdoPHz6s09NzS1vrbNd689c5YhUZx8JwVIaxACrPeBBVJHl5eWXqJ1OpVKL8KfbkyRO4urri3Llz8PX1Vbd//vnnOHnyJC5cuKD1mvr16+PFixe4f/+++ojRsmXL8M033yA5OVmrf/fu3WFqaop9+/a9tpZjx46ha9euuHv3Lt55550S+5R0BMnNzQ3p6emwtrYu02f+O5rMi9TZvvUlbl53sUsoFxwLw1EZxgKoPONBVJFkZ2fD3t4eWVlZr/39LdoRJHt7e8jlcqSmpmq0p6amljo3yNnZGSYmJhqn0xo1aoSUlBQUFBTA1NRU3f7gwQMcPXr0tUeFXvHx8QGA1wYkMzMzmJmZabWbmJio50PpQn6x7M2dDJwuvx994lgYjsowFkDlGQ+iiqSsP3eiXeZvamoKb29vREVFqduUSiWioqI0jij9Vbt27XD37l0olUp12+3bt+Hs7KwRjgBg06ZNcHBwQO/evd9Yy5UrVwC8DGBEREREoq6DpFAosH79emzZsgU3b97ExIkTkZubq76qzd/fX2MS98SJE5GRkYFp06bh9u3bOHDgABYvXqyxxhHwMmht2rQJo0aNgrGx5kGyhIQELFy4ELGxsUhMTMTevXvh7++PDh06wMvLS/cfmoiIiAyeqJf5Dx48GE+fPsXcuXORkpKCZs2aISIiQj1xOykpCUZG/8twbm5uiIyMxIwZM+Dl5QVXV1dMmzYNs2bN0tjv0aNHkZSUhDFjxmi9p6mpKY4ePYoVK1YgNzcXbm5uGDBgAGbPnq3bD0tEREQVhqgBCQAmT56MyZMnl7jtxIkTWm2+vr44f/78a/fp5+eH0uaeu7m54eTJk29dJxEREUmH6LcaISIiIjI0DEhEREREAgxIRERERAIMSEREREQCDEhEREREAgxIRERERAIMSEREREQCoq+DREREFZd70AGxSygXiUvefFsqkhYeQSIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhJgQCIiIiISYEAiIiIiEmBAIiIiIhIQPSCtXr0a7u7uMDc3h4+PD2JiYl7bPzMzE4GBgXB2doaZmRnq16+PgwcPqrfPmzcPMplM49GwYUONfbx48QKBgYGoXr06qlatigEDBiA1NVUnn4+IiIgqHlED0o4dO6BQKBASEoJLly6hadOm6N69O9LS0krsX1BQgPfffx+JiYnYtWsX4uPjsX79eri6umr0e/fdd5GcnKx+nDlzRmP7jBkzsG/fPuzcuRMnT57EkydP0L9/f519TiIiIqpYjMV882XLlmH8+PEICAgAAKxduxYHDhzAxo0bERQUpNV/48aNyMjIwLlz52BiYgIAcHd31+pnbGwMJyenEt8zKysLYWFh2LZtG7p06QIA2LRpExo1aoTz58+jTZs25fTpiIiIqKISLSAVFBQgNjYWwcHB6jYjIyN069YN0dHRJb5m79698PX1RWBgIH799VfUqFEDw4YNw6xZsyCXy9X97ty5AxcXF5ibm8PX1xehoaGoVasWACA2NhaFhYXo1q2bun/Dhg1Rq1YtREdHlxqQ8vPzkZ+fr36enZ0NACgsLERhYeHf/yLewEyu0tm+9UWX348+cSwMR2UYC6ByjAfHgiqaso61aAEpPT0dxcXFcHR01Gh3dHTErVu3SnzNvXv3cOzYMQwfPhwHDx7E3bt3MWnSJBQWFiIkJAQA4OPjg82bN6NBgwZITk7G/Pnz8d577yEuLg5WVlZISUmBqakpbG1ttd43JSWl1HpDQ0Mxf/58rfbDhw/DwsLiLT992S1trbNd681f54hVZBwLw1EZxgKoHOPBsaCKJi8vr0z9RD3F9raUSiUcHBywbt06yOVyeHt74/Hjx/jmm2/UAalnz57q/l5eXvDx8UHt2rXx888/Y+zYsX/7vYODg6FQKNTPs7Oz4ebmBj8/P1hbW//9D/UGTeZF6mzf+hI3r7vYJZQLjoXhqAxjAVSO8eBYUEXz6gzQm4gWkOzt7SGXy7WuHktNTS11/pCzszNMTEw0Tqc1atQIKSkpKCgogKmpqdZrbG1tUb9+fdy9excA4OTkhIKCAmRmZmocRXrd+wKAmZkZzMzMtNpNTEzU86F0Ib9YprN964suvx994lgYjsowFkDlGA+OBVU0ZR1r0a5iMzU1hbe3N6KiotRtSqUSUVFR8PX1LfE17dq1w927d6FUKtVtt2/fhrOzc4nhCABycnKQkJAAZ2dnAIC3tzdMTEw03jc+Ph5JSUmlvi8RERFJi6iX+SsUCqxfvx5btmzBzZs3MXHiROTm5qqvavP399eYxD1x4kRkZGRg2rRpuH37Ng4cOIDFixcjMDBQ3WfmzJk4efIkEhMTce7cOXz44YeQy+UYOnQoAMDGxgZjx46FQqHA8ePHERsbi4CAAPj6+vIKNiIiIgIg8hykwYMH4+nTp5g7dy5SUlLQrFkzREREqCduJyUlwcjofxnOzc0NkZGRmDFjBry8vODq6opp06Zh1qxZ6j6PHj3C0KFD8ezZM9SoUQPt27fH+fPnUaNGDXWf5cuXw8jICAMGDEB+fj66d++Of//73/r74ERERGTQRJ+kPXnyZEyePLnEbSdOnNBq8/X1xfnz50vdX3h4+Bvf09zcHKtXr8bq1avLXCcRERFJh+i3GiEiIiIyNAxIRERERAIMSEREREQCDEhEREREAgxIRERERAIMSEREREQCDEhEREREAgxIRERERAIMSEREREQCDEhEREREAgxIRERERAIMSEREREQCDEhEREREAgxIRERERAIMSEREREQCDEhEREREAgxIRERERAIMSEREREQCDEhEREREAgxIRERERAIMSEREREQCDEhEREREAsZiF0BERET/nHvQAbFLKBeJS3qLXQIAHkEiIiIi0sKARERERCTAgEREREQkwIBEREREJMCARERERCTAgEREREQkwIBEREREJMCARERERCTAgEREREQkwIBEREREJMCARERERCTAgEREREQkIHpAWr16Ndzd3WFubg4fHx/ExMS8tn9mZiYCAwPh7OwMMzMz1K9fHwcPHlRvDw0NRatWrWBlZQUHBwf069cP8fHxGvvo1KkTZDKZxuOTTz7RyecjIiKiikfUgLRjxw4oFAqEhITg0qVLaNq0Kbp37460tLQS+xcUFOD9999HYmIidu3ahfj4eKxfvx6urq7qPidPnkRgYCDOnz+PI0eOoLCwEH5+fsjNzdXY1/jx45GcnKx+LF26VKeflYiIiCoOYzHffNmyZRg/fjwCAgIAAGvXrsWBAwewceNGBAUFafXfuHEjMjIycO7cOZiYmAAA3N3dNfpERERoPN+8eTMcHBwQGxuLDh06qNstLCzg5ORUzp+IiIiIKgPRAlJBQQFiY2MRHBysbjMyMkK3bt0QHR1d4mv27t0LX19fBAYG4tdff0WNGjUwbNgwzJo1C3K5vMTXZGVlAQCqVaum0f7TTz/hxx9/hJOTE/r06YM5c+bAwsKi1Hrz8/ORn5+vfp6dnQ0AKCwsRGFhYdk+9N9gJlfpbN/6osvvR584FoajMowFUDnGg2NhODgW5bt/mUqlEuUbffLkCVxdXXHu3Dn4+vqq2z///HOcPHkSFy5c0HpNw4YNkZiYiOHDh2PSpEm4e/cuJk2ahKlTpyIkJESrv1KpRN++fZGZmYkzZ86o29etW4fatWvDxcUF165dw6xZs9C6dWvs3r271HrnzZuH+fPna7Vv27bttcGKiIiIDEdeXh6GDRuGrKwsWFtbl9pP1FNsb0upVMLBwQHr1q2DXC6Ht7c3Hj9+jG+++abEgBQYGIi4uDiNcAQAH3/8sfrfnp6ecHZ2RteuXZGQkIB33nmnxPcODg6GQqFQP8/Ozoabmxv8/Pxe+wX/U03mReps3/oSN6+72CWUC46F4agMYwFUjvHgWBgOjkXZvDoD9CaiBSR7e3vI5XKkpqZqtKemppY6N8jZ2RkmJiYap9MaNWqElJQUFBQUwNTUVN0+efJk7N+/H6dOnULNmjVfW4uPjw8A4O7du6UGJDMzM5iZmWm1m5iYqOdD6UJ+sUxn+9YXXX4/+sSxMByVYSyAyjEeHAvDwbEo3/2LdhWbqakpvL29ERUVpW5TKpWIiorSOOX2V+3atcPdu3ehVCrVbbdv34azs7M6HKlUKkyePBl79uzBsWPH4OHh8cZarly5AuBlACMiIiIS9TJ/hUKB9evXY8uWLbh58yYmTpyI3Nxc9VVt/v7+GpO4J06ciIyMDEybNg23b9/GgQMHsHjxYgQGBqr7BAYG4scff8S2bdtgZWWFlJQUpKSk4M8//wQAJCQkYOHChYiNjUViYiL27t0Lf39/dOjQAV5eXvr9AoiIiMggiToHafDgwXj69Cnmzp2LlJQUNGvWDBEREXB0dAQAJCUlwcjofxnOzc0NkZGRmDFjBry8vODq6opp06Zh1qxZ6j5r1qwB8HIxyL/atGkTRo8eDVNTUxw9ehQrVqxAbm4u3NzcMGDAAMyePVv3H5iIiIgqBNEnaU+ePBmTJ08ucduJEye02nx9fXH+/PlS9/emi/Lc3Nxw8uTJt6qRiIiIpEX0W40QERERGRoGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkIiIiIgEGJCIiIiIBBiQiIiIiAdED0urVq+Hu7g5zc3P4+PggJibmtf0zMzMRGBgIZ2dnmJmZoX79+jh48OBb7fPFixcIDAxE9erVUbVqVQwYMACpqanl/tmIiIioYhI1IO3YsQMKhQIhISG4dOkSmjZtiu7duyMtLa3E/gUFBXj//feRmJiIXbt2IT4+HuvXr4erq+tb7XPGjBnYt28fdu7ciZMnT+LJkyfo37+/zj8vERERVQzGYr75smXLMH78eAQEBAAA1q5diwMHDmDjxo0ICgrS6r9x40ZkZGTg3LlzMDExAQC4u7u/1T6zsrIQFhaGbdu2oUuXLgCATZs2oVGjRjh//jzatGlTYq35+fnIz89XP8/KygIAZGRkoLCw8J99Ea9hXJSrs33ry7Nnz8QuoVxwLAxHZRgLoHKMB8fCcHAsyub58+cAAJVK9fqOKpHk5+er5HK5as+ePRrt/v7+qr59+5b4mp49e6qGDx+uGj9+vMrBwUH17rvvqr766itVUVFRmfcZFRWlAqD6448/NPrUqlVLtWzZslLrDQkJUQHggw8++OCDDz4qwePhw4evzSmiHUFKT09HcXExHB0dNdodHR1x69atEl9z7949HDt2DMOHD8fBgwdx9+5dTJo0CYWFhQgJCSnTPlNSUmBqagpbW1utPikpKaXWGxwcDIVCoX6uVCqRkZGB6tWrQyaTvc1HNxjZ2dlwc3PDw4cPYW1tLXY5ksfxMBwcC8PBsTAclWUsVCoVnj9/DhcXl9f2E/UU29tSKpVwcHDAunXrIJfL4e3tjcePH+Obb75BSEiITt/bzMwMZmZmGm3CkFVRWVtbV+j/2Csbjofh4FgYDo6F4agMY2FjY/PGPqIFJHt7e8jlcq2rx1JTU+Hk5FTia5ydnWFiYgK5XK5ua9SoEVJSUlBQUFCmfTo5OaGgoACZmZkaAed170tERETSItpVbKampvD29kZUVJS6TalUIioqCr6+viW+pl27drh79y6USqW67fbt23B2doapqWmZ9unt7Q0TExONPvHx8UhKSir1fYmIiEhaRL3MX6FQYP369diyZQtu3ryJiRMnIjc3V30Fmr+/P4KDg9X9J06ciIyMDEybNg23b9/GgQMHsHjxYgQGBpZ5nzY2Nhg7diwUCgWOHz+O2NhYBAQEwNfXt9Qr2CorMzMzhISEaJ06JHFwPAwHx8JwcCwMh+TG4rVTuPXg+++/V9WqVUtlamqqat26ter8+fPqbR07dlSNGjVKo/+5c+dUPj4+KjMzM1WdOnU0rmIryz5VKpXqzz//VE2aNEllZ2ensrCwUH344Yeq5ORknX1GIiIiqlhkKtWbFgIgIiIikhbRbzVCREREZGgYkIiIiIgEGJCIiIiIBBiQiIiIiAQYkCSkY8eO2Lp1K/7880+xSyFwPAxJSEgIHjx4IHYZRGRAGJAkpHnz5pg5cyacnJwwfvx4nD9/XuySJI3jYTh+/fVXvPPOO+jatSu2bduG/Px8sUsiEl1ubi7mzJmDtm3bom7duqhTp47Go7LjZf4SU1RUhL1792LLli04dOgQ6tatizFjxmDkyJFaN/kl3eN4GI7Lly9j06ZN2L59O4qKijBkyBCMGTMGrVq1Ers0ScvPz5fOwoQGZujQoTh58iRGjhwJZ2dnrRuzT5s2TaTK9IMBScLS0tKwbt06fPXVVyguLkavXr0wdepUdOnSRezSJInjYRgKCwuxb98+bNq0CZGRkWjYsCHGjh2L0aNHl+kGl/TPHDp0COHh4Th9+jQePnwIpVIJS0tLNG/eHH5+fggICHjjXdipfNja2uLAgQNo166d2KWIgqfYJComJgYhISH49ttv4eDggODgYNjb2+P//u//MHPmTLHLkxyOh+FQqVQoLCxEQUEBVCoV7OzssGrVKri5uWHHjh1il1dp7dmzB/Xr18eYMWNgbGyMWbNmYffu3YiMjMSGDRvQsWNHHD16FHXq1MEnn3yCp0+fil1ypWdnZ4dq1aqJXYZ4xFvEm/QtNTVV9a9//Uv17rvvqkxNTVUDBgxQHTp0SKVUKtV9Tp8+rbK0tBSxSungeBiWixcvqgIDA1XVqlVTOTs7q2bNmqW6c+eOevvKlStVDg4OIlZYubVp00a1f/9+VXFx8Wv7PXr0SDVr1izVsmXL9FSZdP3www+qjz76SJWbmyt2KaLgKTYJMTU1xTvvvIMxY8Zg9OjRqFGjhlaf7OxsfPDBBzh+/LgIFUoLx8NweHp64tatW/Dz88P48ePRp08fyOVyjT7p6elwcHCAUqkUqUoi/WrevDkSEhKgUqng7u4OExMTje2XLl0SqTL9MBa7ANKfqKgovPfee6/tY21tzV/GesLxMByDBg3CmDFj4OrqWmofe3t7hiM9KygowP379/HOO+/A2Ji/rvStX79+YpcgKh5BkqC0tDTEx8cDABo0aAAHBweRK5I2jodhefW/ROEVO6Q/eXl5mDJlCrZs2QIAuH37NurUqYMpU6bA1dUVQUFBIldIUsBJ2hLy/PlzjBw5Eq6urujYsSM6duwIV1dXjBgxAllZWWKXJzkcD8MSFhaGJk2awNzcHObm5mjSpAk2bNggdlmSFBwcjKtXr+LEiRMwNzdXt3fr1o0T5UUQGxuLH3/8ET/++CMuX74sdjl6w4AkIePGjcOFCxewf/9+ZGZmIjMzE/v378fFixcxYcIEscuTHI6H4Zg7dy6mTZuGPn36YOfOndi5cyf69OmDGTNmYO7cuWKXJzm//PILVq1ahfbt22scyXv33XeRkJAgYmXSkpaWhi5duqBVq1aYOnUqpk6dCm9vb3Tt2lUaVxGKOUOc9MvCwkJ1+vRprfZTp06pLCwsRKhI2jgehsPe3l61bds2rfZt27apqlevLkJF0lalShVVQkKCSqVSqapWrar+95UrV1TW1tZiliYpgwYNUrVs2VL1+++/q9tu3LihatmypWrIkCEiVqYfPIIkIdWrVy9xoTsbGxvY2dmJUJG0cTwMR2FhIVq2bKnV7u3tjaKiIhEqkraWLVviwIED6uevjiJt2LABvr6+YpUlOREREfj3v/+NRo0aqdsaN26M1atX49ChQyJWph8MSBIye/ZsKBQKpKSkqNtSUlLw2WefYc6cOSJWJk0cD8MxcuRIrFmzRqt93bp1GD58uAgVSdvixYvxxRdfYOLEiSgqKsJ3330HPz8/bNq0CV999ZXY5UmGUqnUurQfAExMTCRxRSevYpOQ5s2b4+7du8jPz0etWrUAAElJSTAzM0O9evU0+lb29S0MAcfDcEyZMgVbt26Fm5sb2rRpAwC4cOECkpKS4O/vr/FLYtmyZWKVKSkJCQlYsmQJrl69ipycHLRo0QKzZs2Cp6en2KVJxgcffIDMzExs375dfXuXx48fY/jw4bCzs8OePXtErlC3uLCEhEh9TQtDw/EwHHFxcWjRogUAqCcB29vbw97eHnFxcep+vPRff9555x2sX79e7DIkbdWqVejbty/c3d3h5uYGAHj48CGaNGmCH3/8UeTqdI9HkIiIyKAkJSW9dvurI66keyqVCkePHsWtW7cAAI0aNUK3bt1Erko/GJAkKDY2Fjdv3gTw8rLZ5s2bi1yRtHE8DMujR48AADVr1hS5EukyMjJ67dG64uJiPVZDUsVTbBKSlpaGIUOG4MSJE7C1tQUAZGZmonPnzggPDy/xXmCkOxwPw6FUKrFo0SJ8++23yMnJAQBYWVnh008/xZdffgkjI17Pok/CxQgLCwtx+fJlLFu2jJO0dWzlypX4+OOPYW5ujpUrV76279SpU/VUlTh4BElCBg8ejHv37mHr1q3qyzZ///13jBo1CnXr1sX27dtFrlBaOB6GIzg4GGFhYZg/fz7atWsHADhz5gzmzZuH8ePH85eygThw4AC++eYbnDhxQuxSKi0PDw9cvHgR1atXh4eHR6n9ZDIZ7t27p8fK9I8BSUJsbGxw9OhRtGrVSqM9JiYGfn5+yMzMFKcwieJ4GA4XFxesXbsWffv21Wj/9ddfMWnSJDx+/Fikyuiv7t69i6ZNmyI3N1fsUkgCeIpNQqS+poWh4XgYjoyMDDRs2FCrvWHDhsjIyBChImnLzs7WeK5SqZCcnIx58+ZpLYFB+lNcXIzr16+jdu3akljMlifWJaRLly6YNm0anjx5om57/PgxZsyYga5du4pYmTRxPAxH06ZNsWrVKq32VatWoWnTpiJUJG22traws7NTP6pVq4bGjRsjOjq6xAU9STemT5+OsLAwAC/DUYcOHdCiRQu4ublJ4jQnT7FJyMOHD9G3b1/cuHFDa02LvXv38qodPeN4GI6TJ0+id+/eqFWrlvpWFtHR0Xj48CEOHjyI9957T+QKpeXkyZMaz42MjFCjRg3UrVsXxsY88aEvNWvWxC+//IKWLVvil19+QWBgII4fP44ffvgBx44dw9mzZ8UuUacYkCRGymtaGCKOh+F48uQJVq9erTEWkyZNUq8gTPpRWFiICRMmYM6cOa+dJEy6Z25ujrt376JmzZr4+OOPYWFhgRUrVuD+/fto2rSp1qnQyoYBSSIKCwtRpUoVXLlyBU2aNBG7HMnjeBiOwsJC9OjRA2vXruX8FgNhY2ODK1euMCCJrHbt2li/fj26du0KDw8PrFmzBr1798aNGzfQvn17/PHHH2KXqFOcgyQRJiYmqFWrFhdYMxAcD8NhYmKCa9euiV0G/UW/fv3wyy+/iF2G5AUEBGDQoEFo0qQJZDKZ+uj2hQsXSryoobLhESQJCQsLw+7du/HDDz+gWrVqYpcjeRwPwzFjxgyYmZlhyZIlYpdCgHrRzq5du8Lb2xuWlpYa2yv7AoWGZNeuXXj48CEGDhyonhe5ZcsW2Nra4oMPPhC5Ot1iQJKQV3ePLywsRO3atbX+p8M7xusXx8NwTJkyBVu3bkW9evVK/IW8bNkykSqTljp16uC3335Dy5YtS+0jhQUKDVlmZqZ65f/KjpcDSMgHH3zAu5EbEI6H4YiLi0OLFi0AALdv3xa5GulKTExEcXEx7t+/L3YpBODrr7+Gu7s7Bg8eDAAYNGgQ/vvf/8LZ2RkHDx6El5eXyBXqFo8gERGRQTAyMkJKSgocHBzELoXw8rYjP/30E9q2bYsjR45g0KBB2LFjB37++WckJSXh8OHDYpeoUzyCJCGvDl9Xr15doz0zMxMtWrTgYWs943gYjjFjxuC7776DlZWVRntubi6mTJmCjRs3ilSZ9ERGRsLGxua1fYS3hCHdSElJUa/Rtn//fgwaNAh+fn5wd3eHj4+PyNXpHo8gSUhpf52lpqbCzc0NBQUFIlUmTRwPwyGXy5GcnKw1Funp6XByckJRUZFIlUmLkdGbL6yWyWS8+lNPXFxcsGvXLrRt2xYNGjTAokWLMHDgQMTHx6NVq1aVfh0kHkGSgL1796r/LfzrrLi4GFFRUVxvRI84HoYjOzsbKpUKKpUKz58/h7m5uXpbcXExDh48yNM9esZTbIajf//+GDZsGOrVq4dnz56hZ8+eAIDLly+jbt26IlenewxIEtCvXz8AL//yGjVqlMY2ExMTuLu749tvvxWhMmnieBgOW1tbyGQyyGQy1K9fX2u7TCbD/PnzRahMmnjRgmFZvnw53N3d8fDhQyxduhRVq1YFACQnJ2PSpEkiV6d7PMUmIR4eHvjtt99gb28vdikEjochOHnyJFQqFbp06YL//ve/GutRmZqaonbt2rzViB5xkjYZEgYkIpK8Bw8ewM3NrUxzYEh3AgICsHLlSq3J8iSeH374Af/5z39w7949REdHo3bt2lixYgU8PDwq/UKRPMUmMVFRUYiKikJaWhqUSqXGNl6po38cD8NQu3ZtZGZmIiYmpsSx8Pf3F6ky6cjNzcWmTZveqr9wQU8qX2vWrMHcuXMxffp0fPXVV+rJ8ba2tlixYkWlD0g8giQh8+fPx4IFC9CyZUs4Oztrne/fs2ePSJVJE8fDcOzbtw/Dhw9HTk4OrK2tNcZCJpMhIyNDxOqkwdnZGdOmTcOoUaPg7OxcYh+VSoWjR49i2bJl6NChA4KDg/VcpbQ0btwYixcvRr9+/WBlZYWrV6+iTp06iIuLQ6dOnZCeni52iTrFgCQhzs7OWLp0KUaOHCl2KQSOhyGpX78+evXqhcWLF8PCwkLsciQpPj4eX3zxBfbv349mzZqhZcuWcHFxgbm5Of744w/8/vvviI6OhrGxMYKDgzFhwgTI5XKxy67UqlSpglu3bqF27doaAenOnTvw8vLCn3/+KXaJOsVTbBJSUFCAtm3bil0G/X8cD8Px+PFjTJ06leFIRA0aNMB///tfJCUlYefOnTh9+jTOnTuHP//8E/b29mjevDnWr1+Pnj17MhjpiYeHB65cuYLatWtrtEdERKBRo0YiVaU/DEgSMm7cOGzbtg1z5swRuxQCx8OQdO/eHRcvXkSdOnXELkXyatWqhU8//RSffvopgJen1QAuASAGhUKBwMBAvHjxAiqVCjExMdi+fTtCQ0OxYcMGscvTOQYkCXnx4gXWrVuHo0ePwsvLCyYmJhrbecdy/eJ4GI7evXvjs88+w++//w5PT0+tseCtLfQvLCwMy5cvx507dwAA9erVw/Tp0zFu3DiRK5OOcePGoUqVKpg9ezby8vIwbNgwuLi44LvvvsOQIUPELk/nOAdJQjp37lzqNplMhmPHjumxGuJ4GI7XXd7PW1vo39y5c7Fs2TJMmTIFvr6+AIDo6GisWrUKM2bMwIIFC0SusPIrKirCtm3b0L17dzg6OiIvLw85OTmSWqOKAYmIiAxKjRo1sHLlSgwdOlSjffv27ZgyZUqlv3rKUFhYWODmzZtac5CkgquiSdDdu3cRGRmpvgKBGdnwpKWliV0CkWgKCwvRsmVLrXZvb2/eOFiPWrdujcuXL4tdhmgYkCTk2bNn6Nq1q/qS5uTkZADA2LFj1RMiSfcsLCzw9OlT9fPevXurxwIAUlNTS10HhspXr169kJWVpX6+ZMkSZGZmqp8/e/YMjRs3FqEyaRs5ciTWrFmj1b5u3ToMHz5chIqkadKkSfj000+xatUqREdH49q1axqPyo6n2CTE398faWlp2LBhAxo1aqRe0yIyMhIKhQI3btwQu0RJEN5v6q/riwD/C0jC1Zyp/MnlciQnJ6vHwtraGleuXNEYCxcXF85B0rMpU6Zg69atcHNzQ5s2bQAAFy5cQFJSEvz9/TUm0fNiBt0paW6eTCaDSqWSxNw8XsUmIYcPH0ZkZCRq1qyp0V6vXj08ePBApKqoJLykWT+Efx/y70XDEBcXhxYtWgAAEhISAAD29vawt7dHXFycuh9/TnTr/v37YpcgKgYkCcnNzS1xIbyMjAyYmZmJUBERkbbjx4+LXQIBkp2c/QoDkoS899572Lp1KxYuXAjg5V9fSqUSS5cufe0l51S+ZDKZ1r2++JewOEr67jkWRC/t3bu3xHaZTAZzc3PUrVsXHh4eeq5KfzgHSULi4uLQtWtXtGjRAseOHUPfvn1x48YNZGRk4OzZs3jnnXfELlESjIyMYGNjo/5FnJmZCWtra/X5fpVKhezs7Ep/ft8QGBkZoWfPnuojqPv27UOXLl3Ud4nPz89HREQEx4IkycjISD3n6K/+Og+pffv2+OWXX2BnZydSlbrDgCQxWVlZWLVqFa5evYqcnBy0aNECgYGBvGpKj7Zs2VKmfqNGjdJxJRQQEFCmfps2bdJxJUSGJyoqCl9++SW++uortG7dGgAQExODOXPmYPbs2bCxscGECRPg4+ODsLAwkastfwxIREREpKVJkyZYt26d1k21z549i48//hg3btzA0aNHMWbMGCQlJYlUpe5wHSQiIiLSkpCQAGtra612a2tr3Lt3D8DLq6Ar68rmDEhERESkxdvbG5999pnGwrZPnz7F559/jlatWgEA7ty5Azc3N7FK1ClexUZERERawsLC8MEHH6BmzZrqEPTw4UPUqVMHv/76KwAgJycHs2fPFrNMneEcJCIiIiqRUqnE4cOHcfv2bQBAgwYN8P7775e4ynZlw4AkIffv30dRURHq1aun0X7nzh2YmJjA3d1dnMKIiMigvXjxAmZmZpJaJ6zyR0BSGz16NM6dO6fVfuHCBYwePVr/BUncgAED8PXXX2u1L126FAMHDhShImn74Ycf0K5dO7i4uKhvvbNixQr1qQQiqVEqlVi4cCFcXV1RtWpV9a1H5syZUykv6xdiQJKQy5cvo127dlrtbdq0wZUrV/RfkMSdOnUKvXr10mrv2bMnTp06JUJF0rVmzRooFAr06tULmZmZ6oUhbW1tsWLFCnGLIxLJokWLsHnzZixduhSmpqbq9iZNmmDDhg0iVqYfDEgSIpPJ8Pz5c632rKwsrhQsgpycHI3/6bxiYmKC7OxsESqSru+//x7r16/Hl19+Cblcrm5v2bIlrl+/LmJlROLZunUr1q1bh+HDh2v8XDRt2hS3bt0SsTL9YECSkA4dOiA0NFQjDBUXFyM0NBTt27cXsTJp8vT0xI4dO7Taw8PD0bhxYxEqkq779++jefPmWu1mZmbIzc0VoSIi8T1+/Bh169bValcqlSgsLBShIv3iZf4S8vXXX6NDhw5o0KAB3nvvPQDA6dOnkZ2djWPHjolcnfTMmTMH/fv3R0JCArp06QLg5dL+27dvx86dO0WuTlo8PDxw5coVrbuXR0REoFGjRiJVRSSuxo0b4/Tp01o/F7t27SrxD4rKhgFJQho3boxr166p78VWpUoV+Pv7Y/LkyahWrZrY5UlOnz598Msvv2Dx4sXYtWsXqlSpAi8vLxw9ehQdO3YUuzxJUSgUCAwMxIsXL6BSqRATE4Pt27cjNDRUEnMtiEoyd+5cjBo1Co8fP4ZSqcTu3bsRHx+PrVu3Yv/+/WKXp3O8zJ+ICMBPP/2EefPmISEhAQDg4uKC+fPnY+zYsSJXRiSe06dPY8GCBRo3OJ87dy78/PzELk3nGJAquWvXrqFJkyYwMjLCtWvXXtvXy8tLT1URGa68vDzk5OTAwcFB7FKIDNbFixfRsmVLscvQKQakSs7IyAgpKSlwcHCAkZERZDIZShpymUzGK9n0oFq1arh9+zbs7e1hZ2f32kXXMjIy9FiZtC1atAjDhw+Hh4eH2KUQGYycnBzI5XJUqVJF3XblyhXMmTMHBw8erPS/MzgHqZK7f/8+atSoof43iWv58uWwsrJS/1tKq9Iasp07dyIkJAQ+Pj4YMWIEBg0aBHt7e7HLIhLFw4cPMWjQIMTExEAul2Py5MlYtGgRPvnkE+zYsQMffvhhiYsOVzY8giQhp06dQtu2bWFsrJmLi4qKcO7cOXTo0EGkyojEd+PGDfz0008IDw/Ho0eP8P7772P48OHo168fLCwsxC6PSG+GDBmC+Ph4jB07Frt378bJkyfRokUL+Pj4ICgoCDVr1hS7RL1gQJIQuVyO5ORkrbkVz549g4ODQ6U/XGpoOB6G6+zZs9i2bRt27tyJFy9ecOFOkhQXFxfs3r0bbdq0QVpaGpycnLBs2TJMnz5d7NL0igtFSohKpSrxlM6zZ89gaWkpQkXSVtrfJvn5+SWusE36Y2lpiSpVqsDU1FQSC+IR/VVqaqp6Pp6DgwMsLCzQs2dPkavSP85BkoD+/fsDeDkRe/To0TAzM1NvKy4uxrVr19C2bVuxypOclStXAng5Hhs2bEDVqlXV24qLi3Hq1Ck0bNhQrPIk6/79+9i2bRu2bduG+Ph4dOzYEfPnz8dHH30kdmlEemdkZKTxbyn+0caAJAE2NjYAXh6xsLKy0rgiwdTUFG3atMH48ePFKk9yli9fDuDleKxdu1bjHkempqZwd3fH2rVrxSpPktq0aYPffvsNXl5eCAgIwNChQ+Hq6ip2WUSiUKlUqF+/vvqMQ05ODpo3b64RmoDKf6UtA5IEbNq0CQDg7u6OmTNn8nSayF5dTdi5c2fs3r0bdnZ2IldEXbt2xcaNG3kPPCL873eG1HGSNhEREZEAjyBJSGpqKmbOnImoqCikpaVpTRLmVVP6VVxcjM2bN6vHQ6lUamznDYR1S6FQYOHChbC0tIRCoXht32XLlumpKiIyFAxIEjJ69GgkJSVhzpw5cHZ25iKFIps2bRo2b96M3r17o0mTJhwPPbt8+bL6CrXLly+X2o/jQiRNPMUmIVZWVjh9+jSaNWsmdikEwN7eHlu3bkWvXr3ELoWIiAS4DpKEuLm5lbr2Dumfqakp6tatK3YZRERUAh5BkpDDhw/j22+/xX/+8x+4u7uLXY7kffvtt7h37x5WrVrF0zgG4OLFi/j555+RlJSEgoICjW27d+8WqSoiEgsDkoTY2dkhLy8PRUVFsLCwgImJicb2yr6mhaH58MMPcfz4cVSrVg3vvvuu1njwl7L+hIeHw9/fH927d8fhw4fh5+eH27dvIzU1FR9++CEveyZJKu3iBZlMBnNzc9StWxcffPABqlWrpufK9IMBSUK2bNny2u2jRo3SUyUEAAEBAa/dzl/K+uPl5YUJEyYgMDAQVlZWuHr1Kjw8PDBhwgQ4Oztj/vz5YpdIpHedO3fGpUuXUFxcjAYNGgAAbt++DblcjoYNGyI+Ph4ymQxnzpyplGuIMSARkeRZWlrixo0bcHd3R/Xq1XHixAl4enri5s2b6NKlC5KTk8UukUjvVqxYgdOnT2PTpk2wtrYGAGRlZWHcuHFo3749xo8fj2HDhuHPP/9EZGSkyNWWP07SlqhXdyj/64NIquzs7PD8+XMAgKurK+Li4gAAmZmZyMvLE7M0ItF88803WLhwoTocAS9vXTVv3jwsXboUFhYWmDt3LmJjY0WsUne4DpKE5ObmYtasWfj555/x7Nkzre1cKFL/du3aVerE4EuXLolUlfR06NABR44cgaenJwYOHIhp06bh2LFjOHLkCLp27Sp2eUSiyMrKQlpamtbps6dPn6r/qLa1tdX6f1dlwSNIEvL555/j2LFjWLNmDczMzLBhwwbMnz8fLi4u2Lp1q9jlSc7KlSsREBAAR0dHXL58Ga1bt0b16tVx79499OzZU+zyJGXVqlUYMmQIAODLL7+EQqFAamoqBgwYgLCwMJGrIxLHBx98gDFjxmDPnj149OgRHj16hD179mDs2LHo168fACAmJgb169cXt1Ad4RwkCalVqxa2bt2KTp06wdraGpcuXULdunXxww8/YPv27Th48KDYJUpKw4YNERISgqFDh6onBtepUwdz585FRkYGVq1aJXaJRCRhOTk5mDFjBrZu3YqioiIAgLGxMUaNGoXly5fD0tISV65cAYBKuQAxA5KEVK1aFb///jtq1aqFmjVrYvfu3WjdujXu378PT09P5OTkiF2ipFhYWODmzZuoXbs2HBwccOTIETRt2hR37txBmzZtSjwNSkSkbzk5Obh37x4AoE6dOqhatarIFekHT7FJSJ06dXD//n0AL49e/PzzzwCAffv2wdbWVsTKpMnJyUm99lStWrVw/vx5AMD9+/e54rmeGBkZQS6Xv/ZhbMypmiRtVatWhZeXF7y8vCQTjgBO0paUgIAAXL16FR07dkRQUBD69OmDVatWobCwkHcrF0GXLl2wd+9eNG/eHAEBAZgxYwZ27dqFixcvon///mKXJwl79uwpdVt0dDRWrlwJpVKpx4qIDEdubi6WLFmCqKgopKWlaf0svDqqVFnxFJuEPXjwALGxsahbty68vLzELkdylEollEql+ghFeHg4zp07h3r16mHChAkwNTUVuUJpio+PR1BQEPbt24fhw4djwYIFqF27tthlEend0KFDcfLkSYwcORLOzs5at0SaNm2aSJXpBwMSERGAJ0+eICQkBFu2bEH37t0RGhqKJk2aiF0WkWhsbW1x4MABtGvXTuxSRMFTbBLz22+/4fjx4yUeLuVpNv178eIFrl27VuJ49O3bV6SqpCUrKwuLFy/G999/j2bNmiEqKgrvvfee2GURic7Ozq7S3metLBiQJGTx4sWYPXs2GjRoAEdHR43DpbybvP5FRETA398f6enpWttkMhkX7tSDpUuX4uuvv4aTkxO2b9+ODz74QOySiAzGwoULMXfuXGzZsgUWFhZil6N3PMUmIY6Ojvj6668xevRosUshAPXq1YOfnx/mzp0LR0dHscuRJCMjI1SpUgXdunWDXC4vtd/u3bv1WBWRYWjevDkSEhKgUqng7u4OExMTje2VfbV/HkGSECMjI8meSzZEqampUCgUDEci8vf359FTolK8Wi1bqngESUKWLl2KJ0+eYMWKFWKXQgDGjBmDdu3aYezYsWKXQkREAgxIEqJUKtG7d2/cvn0bjRs31jpcytMI+pWXl4eBAweiRo0a8PT01BqPqVOnilQZERHxFJuETJ06FcePH0fnzp1RvXp1nloQ2fbt23H48GGYm5vjxIkTWpPmGZCISN+qVauG27dvw97eHnZ2dq/9PfHqTgCVFY8gSYiVlRXCw8PRu3dvsUshvLzVyNSpUxEUFAQjI971h4jEt2XLFgwZMgRmZmbYsmXLa/uOGjVKT1WJgwFJQmrXro3IyEg0bNhQ7FIIL/9S++233/DOO++IXQoREQkwIEnIpk2bEBERgU2bNklyTQtDM2PGDNSoUQNffPGF2KUQEQEAsrOzy9zX2tpah5WIjwFJQqS+poWhmTp1KrZu3YqmTZvCy8tLazy4sjkR6ZuRkVGZ56dW9sVsOUlbQqS+poWhuX79Opo3bw4AiIuL09jGCfREJIbjx4+r/52YmIigoCCMHj0avr6+AIDo6Ghs2bIFoaGhYpWoNzyCRERERFq6du2KcePGYejQoRrt27Ztw7p163DixAlxCtMTXjojMZmZmdiwYQOCg4PVl2heunQJjx8/Frky6bp79y4iIyPx559/AgD4NwsRGYLo6Gi0bNlSq71ly5aIiYkRoSL9YkCSkGvXrqF+/fr4+uuv8a9//QuZmZkAXi4QGRwcLG5xEvTs2TN07doV9evXR69evZCcnAwAGDt2LD799FORqyMiqXNzc8P69eu12jds2AA3NzcRKtIvBiQJUSgUGD16NO7cuQNzc3N1e69evXDq1CkRK5OmGTNmwMTEBElJSRpXFQ4ePBgREREiVkZEBCxfvhzff/89PD09MW7cOIwbNw5eXl74/vvvsXz5crHL0zkGJAn57bffMGHCBK12V1dXpKSkiFCRtB0+fBhff/01atasqdFer149PHjwQKSqiIhe6tWrF27fvo0+ffogIyMDGRkZ6NOnD27fvo1evXqJXZ7O8So2CTEzMytxjYvbt2+jRo0aIlQkbbm5uSWuR5WRkQEzMzMRKiIi0uTm5obFixeLXYYoGJAkpG/fvliwYAF+/vlnAC8vJU9KSsKsWbMwYMAAkauTnvfeew9bt27FwoULAbwcD6VSiaVLl6Jz584iV0dEUnTt2rUy9/Xy8tJhJeLjZf4SkpWVhY8++ggXL17E8+fP4eLigpSUFPj6+uLgwYOwtLQUu0RJiYuLQ9euXdGiRQscO3YMffv2xY0bN5CRkYGzZ8/yFiREpHevFopUqVQa67G9igp/bavsC0UyIEnQ2bNncfXqVeTk5KBFixbo1q2b2CVJVlZWFlatWqUxHoGBgXB2dha7NCKSoL/Of7x8+TJmzpyJzz77TGOhyG+//RZLly6t9IsPMyARERGRltatW2PevHlaE7IPHjyIOXPmIDY2VqTK9INXsUlAdHQ09u/fr9G2detWeHh4wMHBAR9//DHy8/NFqk560tPTta5Su3HjBgICAjBo0CBs27ZNpMqIiP7n+vXr8PDw0Gr38PDA77//LkJF+sWAJAELFizAjRs31M+vX7+OsWPHolu3bggKCsK+ffskcV8dQzFlyhSsXLlS/TwtLQ3vvfcefvvtN+Tn52P06NH44YcfRKyQiAho1KgRQkNDUVBQoG4rKChAaGgoGjVqJGJl+sGr2CTgypUr6iulACA8PBw+Pj7qFVLd3NwQEhKCefPmiVShtJw/fx6bN29WP9+6dSuqVauGK1euwNjYGP/617+wevVqjBw5UrwiiUjy1q5diz59+qBmzZrqK9auXbsGmUyGffv2iVyd7jEgScAff/wBR0dH9fOTJ0+iZ8+e6uetWrXCw4cPxShNklJSUuDu7q5+fuzYMfTv3x/Gxi9/HPv27csjekQkutatW+PevXv46aefcOvWLQAvV/ofNmyYJK56ZkCSAEdHR9y/fx9ubm4oKCjApUuXMH/+fPX258+fw8TERMQKpcXa2hqZmZmoXbs2ACAmJgZjx45Vb5fJZJwTRkQGwdLSEh9//LHYZYiCc5AkoFevXggKCsLp06cRHBwMCwsLvPfee+rt165d45o7etSmTRusXLkSSqUSu3btwvPnz9GlSxf19tu3b0viRpBEZPh++OEHtG/fHi4uLuqLS5YvX45ff/1V5Mp0jwFJAhYuXAhjY2N07NgR69evx/r162FqaqrevnHjRvj5+YlYobQsXLgQe/fuRZUqVTB48GB8/vnnsLOzU28PDw9Hx44dRayQiAhYs2YNFAoFevbsiT/++EO9MKSdnR1WrFghbnF6wHWQJCQrKwtVq1aFXC7XaM/IyEDVqlU1QhPpVnp6Os6ePQsnJyf4+PhobDtw4AAaN25c4uW1RET60rhxYyxevBj9+vWDlZUVrl69ijp16iAuLg6dOnVCenq62CXqFAMSERERaalSpQpu3bqF2rVrawSkO3fuwMvLC3/++afYJeoUT7ERERGRFg8PD1y5ckWrPSIigusgERERkTQpFAoEBgbixYsXUKlUiImJwfbt2xEaGooNGzaIXZ7O8RQbERERleinn37CvHnzkJCQAABwcXHB/PnzNZYmqawYkIiIiOi18vLykJOTAwcHB7FL0RvOQSISUUJCAmbPno2hQ4ciLS0NAHDo0CGNe+cREYkpLS0NsbGxiI+Px9OnT8UuR28YkIhEcvLkSXh6euLChQvYvXs3cnJyAABXr15FSEiIyNURkdQ9f/4cI0eOhIuLCzp27IiOHTvCxcUFI0aMQFZWltjl6RwDEpFIgoKCsGjRIhw5ckRjDaouXbrg/PnzIlZGRASMGzcOFy5cwIEDB5CZmYnMzEzs378fFy9exIQJE8QuT+c4B4lIJFWrVsX169fh4eGhscZIYmIiGjZsiBcvXohdIhFJmKWlJSIjI9G+fXuN9tOnT6NHjx7Izc0VqTL94BEkIpHY2toiOTlZq/3y5ctwdXUVoSIiov+pXr06bGxstNptbGw0bo9UWTEgEYlkyJAhmDVrFlJSUiCTyaBUKnH27FnMnDkT/v7+YpdHRBI3e/ZsKBQKpKSkqNtSUlLw2WefYc6cOSJWph88xUYkkoKCAgQGBmLz5s0oLi6GsbExiouLMWzYMGzevFnrnnlERLrWvHlzyGQy9fM7d+4gPz8ftWrVAgAkJSXBzMwM9erVw6VLl8QqUy8YkIhE9vDhQ1y/fh05OTlo3rw56tWrJ3ZJRCRR8+fPL3Pfyn61LQMSkUgWLFiAmTNnwsLCQqP9zz//xDfffIO5c+eKVBkRETEgEYlELpcjOTlZa2XaZ8+ewcHBAcXFxSJVRkSkKScnB0qlUqPN2tpapGr0g5O0iUSiUqk0zvW/cvXqVVSrVk2EioiI/uf+/fvo3bs3LC0t1Veu2dnZwdbWVhJXsRmLXQCR1NjZ2UEmk0Emk6F+/foaIam4uBg5OTn45JNPRKyQiAgYMWIEVCoVNm7cCEdHxxL/oKvMeIqNSM+2bNkClUqFMWPGYMWKFRrrjJiamsLd3R2+vr4iVkhE9HIx29jYWDRo0EDsUkTBI0hEejZq1CgAgIeHB9q1awdjY/4YEpHhadWqFR4+fCjZgMQjSEQiSkhIwKZNm5CQkIDvvvsODg4OOHToEGrVqoV3331X7PKISMISEhLwySefYMSIEWjSpAlMTEw0tnt5eYlUmX4wIBGJ5OTJk+jZsyfatWuHU6dO4ebNm6hTpw6WLFmCixcvYteuXWKXSEQSdv78eQwbNgyJiYnqNplMpr7ApLJfacuARCQSX19fDBw4EAqFQuNmtTExMejfvz8ePXokdolEJGGNGzdGo0aN8Pnnn5c4Sbt27doiVaYfnPxAJJLr169j27ZtWu0ODg5IT08XoSIiov958OAB9u7di7p164pdiii4DhKRSGxtbZGcnKzVfvnyZbi6uopQERHR/3Tp0gVXr14VuwzR8AgSkUiGDBmCWbNmYefOnZDJZFAqlTh79ixmzpwJf39/scsjIonr06cPZsyYgevXr8PT01Nrknbfvn1Fqkw/OAeJSCQFBQUIDAzE5s2bUVxcDGNjYxQXF2PYsGHYvHkz5HK52CUSkYQZGZV+komTtIlIJ1QqFR4+fIgaNWogPT0d169fR05ODpo3b4569eqJXR4RkeQxIBGJQKlUwtzcHDdu3GAgIiIyQJykTSQCIyMj1KtXD8+ePRO7FCIiDb169UJWVpb6+ZIlS5CZmal+/uzZMzRu3FiEyvSLAYlIJEuWLMFnn32GuLg4sUshIlKLjIxEfn6++vnixYuRkZGhfl5UVIT4+HgxStMrXsVGJBJ/f3/k5eWhadOmMDU1RZUqVTS2//V/SERE+iKceSPVmTgMSEQiWbFihdglEBFRKRiQiEQyatQosUsgItIik8m0bisifC4FDEhEIjl48CDkcjm6d++u0X748GEUFxejZ8+eIlVGRFKmUqkwevRomJmZAQBevHiBTz75BJaWlgCgMT+pMuMkbSKRBAUFlbjQmlKpRFBQkAgVERG9PLrt4OAAGxsb2NjYYMSIEXBxcVE/d3BwkMRq/1wHiUgkVapUwc2bN+Hu7q7RnpiYiHfffRe5ubniFEZERDyCRCQWGxsb3Lt3T6v97t276kPZREQkDgYkIpF88MEHmD59OhISEtRtd+/exaefflrpbwJJRGToeIqNSCRZWVno0aMHLl68iJo1awIAHj16hPfeew+7d++Gra2tuAUSEUkYAxKRiFQqFY4cOYKrV6+iSpUq8PLyQocOHcQui4hI8hiQiAxIZmYmjxwRERkAzkEiEsnXX3+NHTt2qJ8PGjQI1atXh6urK65evSpiZURExIBEJJK1a9fCzc0NAHDkyBEcOXIEhw4dQs+ePfHZZ5+JXB0RkbRxJW0ikaSkpKgD0v79+zFo0CD4+fnB3d0dPj4+IldHRCRtPIJEJBI7Ozs8fPgQABAREYFu3boBeDlxu6QVtomISH94BIlIJP3798ewYcNQr149PHv2TH3vtcuXL6Nu3boiV0dEJG0MSEQiWb58Odzd3fHw4UMsXboUVatWBQAkJydj0qRJIldHRCRtvMyfiIiISIBHkIj0aO/evejZsydMTEywd+/e1/bl7UaIiMTDI0hEemRkZISUlBQ4ODjAyKj0ayRkMhknahMRiYgBiYiIiEiAl/kTERERCXAOEpEIlEolNm/ejN27dyMxMREymQweHh746KOPMHLkSMhkMrFLJCKSNJ5iI9IzlUqFPn364ODBg2jatCkaNmwIlUqFmzdv4vr16+jbty9++eUXscskIpI0HkEi0rPNmzfj1KlTiIqKQufOnTW2HTt2DP369cPWrVvh7+8vUoVERMQjSER65ufnhy5duiAoKKjE7YsXL8bJkycRGRmp58qIiOgVTtIm0rNr166hR48epW7v2bMnrl69qseKiIhIiAGJSM8yMjLg6OhY6nZHR0f88ccfeqyIiIiEGJCI9Ky4uBjGxqVP/5PL5SgqKtJjRUREJMRJ2kR6plKpMHr0aJiZmZW4PT8/X88VERGREAMSkZ6NGjXqjX14BRsRkbh4FRsRERGRAOcgEREREQkwIBEREREJMCARERERCTAgEREREQkwIBEREREJMCAREZXgxIkTkMlkyMzMLPd9y2Qy/PLLL+W+XyIqPwxIRCR5nTp1wvTp08Uug4gMCAMSERERkQADEhFVKJ06dcKUKVMwffp02NnZwdHREevXr0dubi4CAgJgZWWFunXr4tChQ+rXxMXFoWfPnqhatSocHR0xcuRIpKenAwBGjx6NkydP4rvvvoNMJoNMJkNiYqL6tbGxsWjZsiUsLCzQtm1bxMfHa9SzZs0avPPOOzA1NUWDBg3www8/aGy/c+cOOnToAHNzczRu3BhHjhzR3ZdDROWGAYmIKpwtW7bA3t4eMTExmDJlCiZOnIiBAweibdu2uHTpEvz8/DBy5Ejk5eUhMzMTXbp0QfPmzXHx4kVEREQgNTUVgwYNAgB899138PX1xfjx45GcnIzk5GS4ubmp3+vLL7/Et99+i4sXL8LY2BhjxoxRb9uzZw+mTZuGTz/9FHFxcZgwYQICAgJw/PhxAIBSqUT//v1hamqKCxcuYO3atZg1a5Z+vywi+ntUREQVSMeOHVXt27dXPy8qKlJZWlqqRo4cqW5LTk5WAVBFR0erFi5cqPLz89PYx8OHD1UAVPHx8ep9Tps2TaPP8ePHVQBUR48eVbcdOHBABUD1559/qlQqlapt27aq8ePHa7xu4MCBql69eqlUKpUqMjJSZWxsrHr8+LF6+6FDh1QAVHv27Pn7XwIR6RyPIBFRhePl5aX+t1wuR/Xq1eHp6aluc3R0BACkpaXh6tWrOH78OKpWrap+NGzYEACQkJDwVu/l7Oys3i8A3Lx5E+3atdPo365dO9y8eVO93c3NDS4uLurtvr6+b/VZiUgcxmIXQET0tkxMTDSey2QyjTaZTAbg5SmunJwc9OnTB19//bXWfl4FnrK+11/3S0SVG48gEVGl1qJFC9y4cQPu7u6oW7euxsPS0hIAYGpqiuLi4rfed6NGjXD27FmNtrNnz6Jx48bq7Q8fPkRycrJ6+/nz5//BpyEifWFAIqJKLTAwEBkZGRg6dCh+++03JCQkIDIyEgEBAepQ5O7ujgsXLiAxMRHp6ellPkL02WefYfPmzVizZg3u3LmDZcuWYffu3Zg5cyYAoFu3bqhfvz5GjRqFq1ev4vTp0/jyyy919lmJqPwwIBFRpebi4oKzZ8+iuLgYfn5+8PT0xPTp02Frawsjo5f/C5w5cybkcjkaN26MGjVqICkpqUz77tevH7777jv861//wrvvvov//Oc/2LRpEzp16gQAMDIywp49e/Dnn3+idevWGDduHL766itdfVQiKkcylUqlErsIIiIiIkPCI0hEREREAgxIRERERAIMSEREREQCDEhEREREAgxIRERERAIMSEREREQCDEhEREREAgxIRERERAIMSEREREQCDEhEREREAgxIRERERAL/D4fJzXnMM118AAAAAElFTkSuQmCC\",\n      \"text/plain\": [\n       \"<Figure size 640x480 with 1 Axes>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"metric = 'AUROC'\\n\",\n    \"unc_df.set_index('metric').loc[metric].plot.bar(x='method', y='means')\\n\",\n    \"plt.gca().set_ylabel(metric)\\n\",\n    \"plt.gca().grid(axis='y')\\n\",\n    \"plt.gca().set_ylim(0.6, 0.8)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.11.7\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "semantic_uncertainty/analyze_results.py",
    "content": "\"\"\"Compute overall performance metrics from predicted uncertainties.\"\"\"\nimport argparse\nimport functools\nimport logging\nimport os\nimport pickle\n\nimport numpy as np\nimport wandb\n\nfrom uncertainty.utils import utils\nfrom uncertainty.utils.eval_utils import (\n    bootstrap, compatible_bootstrap, auroc, accuracy_at_quantile,\n    area_under_thresholded_accuracy)\n\n\nutils.setup_logger()\n\nresult_dict = {}\n\nUNC_MEAS = 'uncertainty_measures.pkl'\n\n\ndef init_wandb(wandb_runid, assign_new_wandb_id, experiment_lot, entity):\n    \"\"\"Initialize wandb session.\"\"\"\n    user = os.environ['USER']\n    slurm_jobid = os.getenv('SLURM_JOB_ID')\n    scratch_dir = os.getenv('SCRATCH_DIR', '.')\n    kwargs = dict(\n        entity=entity,\n        project='semantic_uncertainty',\n        dir=f'{scratch_dir}/{user}/uncertainty',\n        notes=f'slurm_id: {slurm_jobid}, experiment_lot: {experiment_lot}',\n    )\n    if not assign_new_wandb_id:\n        # Restore wandb session.\n        wandb.init(\n            id=wandb_runid,\n            resume=True,\n            **kwargs)\n        wandb.restore(UNC_MEAS)\n    else:\n        api = wandb.Api()\n        wandb.init(**kwargs)\n\n        old_run = api.run(f'{entity}/semantic_uncertainty/{wandb_runid}')\n        old_run.file(UNC_MEAS).download(\n            replace=True, exist_ok=False, root=wandb.run.dir)\n\n\ndef analyze_run(\n        wandb_runid, assign_new_wandb_id=False, answer_fractions_mode='default',\n        experiment_lot=None, entity=None):\n    \"\"\"Analyze the uncertainty measures for a given wandb run id.\"\"\"\n    logging.info('Analyzing wandb_runid `%s`.', wandb_runid)\n\n    # Set up evaluation metrics.\n    if answer_fractions_mode == 'default':\n        answer_fractions = [0.8, 0.9, 0.95, 1.0]\n    elif answer_fractions_mode == 'finegrained':\n        answer_fractions = [round(i, 3) for i in np.linspace(0, 1, 20+1)]\n    else:\n        raise ValueError\n\n    rng = np.random.default_rng(41)\n    eval_metrics = dict(zip(\n        ['AUROC', 'area_under_thresholded_accuracy', 'mean_uncertainty'],\n        list(zip(\n            [auroc, area_under_thresholded_accuracy, np.mean],\n            [compatible_bootstrap, compatible_bootstrap, bootstrap]\n        )),\n    ))\n    for answer_fraction in answer_fractions:\n        key = f'accuracy_at_{answer_fraction}_answer_fraction'\n        eval_metrics[key] = [\n            functools.partial(accuracy_at_quantile, quantile=answer_fraction),\n            compatible_bootstrap]\n\n    if wandb.run is None:\n        init_wandb(\n            wandb_runid, assign_new_wandb_id=assign_new_wandb_id,\n            experiment_lot=experiment_lot, entity=entity)\n\n    elif wandb.run.id != wandb_runid:\n        raise ValueError\n\n    # Load the results dictionary from a pickle file.\n    with open(f'{wandb.run.dir}/{UNC_MEAS}', 'rb') as file:\n        results_old = pickle.load(file)\n\n    result_dict = {'performance': {}, 'uncertainty': {}}\n\n    # First: Compute simple accuracy metrics for model predictions.\n    all_accuracies = dict()\n    all_accuracies['accuracy'] = 1 - np.array(results_old['validation_is_false'])\n\n    for name, target in all_accuracies.items():\n        result_dict['performance'][name] = {}\n        result_dict['performance'][name]['mean'] = np.mean(target)\n        result_dict['performance'][name]['bootstrap'] = bootstrap(np.mean, rng)(target)\n\n    rum = results_old['uncertainty_measures']\n    if 'p_false' in rum and 'p_false_fixed' not in rum:\n        # Restore log probs true: y = 1 - x --> x = 1 - y.\n        # Convert to probs --> np.exp(1 - y).\n        # Convert to p_false --> 1 - np.exp(1 - y).\n        rum['p_false_fixed'] = [1 - np.exp(1 - x) for x in rum['p_false']]\n\n    # Next: Uncertainty Measures.\n    # Iterate through the dictionary and compute additional metrics for each measure.\n    for measure_name, measure_values in rum.items():\n        logging.info('Computing for uncertainty measure `%s`.', measure_name)\n\n        # Validation accuracy.\n        validation_is_falses = [\n            results_old['validation_is_false'],\n            results_old['validation_unanswerable']\n        ]\n\n        logging_names = ['', '_UNANSWERABLE']\n\n        # Iterate over predictions of 'falseness' or 'answerability'.\n        for validation_is_false, logging_name in zip(validation_is_falses, logging_names):\n            name = measure_name + logging_name\n            result_dict['uncertainty'][name] = {}\n\n            validation_is_false = np.array(validation_is_false)\n            validation_accuracy = 1 - validation_is_false\n            if len(measure_values) > len(validation_is_false):\n                # This can happen, but only for p_false.\n                if 'p_false' not in measure_name:\n                    raise ValueError\n                logging.warning(\n                    'More measure values for %s than in validation_is_false. Len(measure values): %d, Len(validation_is_false): %d',\n                    measure_name, len(measure_values), len(validation_is_false))\n                measure_values = measure_values[:len(validation_is_false)]\n\n            fargs = {\n                'AUROC': [validation_is_false, measure_values],\n                'area_under_thresholded_accuracy': [validation_accuracy, measure_values],\n                'mean_uncertainty': [measure_values]}\n\n            for answer_fraction in answer_fractions:\n                fargs[f'accuracy_at_{answer_fraction}_answer_fraction'] = [validation_accuracy, measure_values]\n\n            for fname, (function, bs_function) in eval_metrics.items():\n                metric_i = function(*fargs[fname])\n                result_dict['uncertainty'][name][fname] = {}\n                result_dict['uncertainty'][name][fname]['mean'] = metric_i\n                logging.info(\"%s for measure name `%s`: %f\", fname, name, metric_i)\n                result_dict['uncertainty'][name][fname]['bootstrap'] = bs_function(\n                    function, rng)(*fargs[fname])\n\n    wandb.log(result_dict)\n    logging.info(\n        'Analysis for wandb_runid `%s` finished. Full results dict: %s',\n        wandb_runid, result_dict\n    )\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--wandb_runids', nargs='+', type=str,\n                        help='Wandb run ids of the datasets to evaluate on.')\n    parser.add_argument('--assign_new_wandb_id', default=True,\n                        action=argparse.BooleanOptionalAction)\n    parser.add_argument('--answer_fractions_mode', type=str, default='default')\n    parser.add_argument(\n        \"--experiment_lot\", type=str, default='Unnamed Experiment',\n        help=\"Keep default wandb clean.\")\n    parser.add_argument(\n        \"--entity\", type=str, help=\"Wandb entity.\")\n\n    args, unknown = parser.parse_known_args()\n    if unknown:\n        raise ValueError(f'Unkown args: {unknown}')\n\n    wandb_runids = args.wandb_runids\n    for wid in wandb_runids:\n        logging.info('Evaluating wandb_runid `%s`.', wid)\n        analyze_run(\n            wid, args.assign_new_wandb_id, args.answer_fractions_mode,\n            experiment_lot=args.experiment_lot, entity=args.entity)\n"
  },
  {
    "path": "semantic_uncertainty/compute_uncertainty_measures.py",
    "content": "\"\"\"Compute uncertainty measures after generating answers.\"\"\"\nfrom collections import defaultdict\nimport logging\nimport os\nimport pickle\nimport numpy as np\nimport wandb\n\nfrom analyze_results import analyze_run\nfrom uncertainty.data.data_utils import load_ds\nfrom uncertainty.uncertainty_measures.p_ik import get_p_ik\nfrom uncertainty.uncertainty_measures.semantic_entropy import get_semantic_ids\nfrom uncertainty.uncertainty_measures.semantic_entropy import logsumexp_by_id\nfrom uncertainty.uncertainty_measures.semantic_entropy import predictive_entropy\nfrom uncertainty.uncertainty_measures.semantic_entropy import predictive_entropy_rao\nfrom uncertainty.uncertainty_measures.semantic_entropy import cluster_assignment_entropy\nfrom uncertainty.uncertainty_measures.semantic_entropy import context_entails_response\nfrom uncertainty.uncertainty_measures.semantic_entropy import EntailmentDeberta\nfrom uncertainty.uncertainty_measures.semantic_entropy import EntailmentGPT4\nfrom uncertainty.uncertainty_measures.semantic_entropy import EntailmentGPT35\nfrom uncertainty.uncertainty_measures.semantic_entropy import EntailmentGPT4Turbo\nfrom uncertainty.uncertainty_measures.semantic_entropy import EntailmentLlama\nfrom uncertainty.uncertainty_measures import p_true as p_true_utils\nfrom uncertainty.utils import utils\n\n\nutils.setup_logger()\n\nEXP_DETAILS = 'experiment_details.pkl'\n\n\ndef main(args):\n\n    if args.train_wandb_runid is None:\n        args.train_wandb_runid = args.eval_wandb_runid\n\n    user = os.environ['USER']\n    scratch_dir = os.getenv('SCRATCH_DIR', '.')\n    wandb_dir = f'{scratch_dir}/{user}/uncertainty'\n    slurm_jobid = os.getenv('SLURM_JOB_ID', None)\n    project = \"semantic_uncertainty\" if not args.debug else \"semantic_uncertainty_debug\"\n    if args.assign_new_wandb_id:\n        logging.info('Assign new wandb_id.')\n        api = wandb.Api()\n        old_run = api.run(f'{args.restore_entity_eval}/{project}/{args.eval_wandb_runid}')\n        wandb.init(\n            entity=args.entity,\n            project=project,\n            dir=wandb_dir,\n            notes=f'slurm_id: {slurm_jobid}, experiment_lot: {args.experiment_lot}',\n            # For convenience, keep any 'generate_answers' configs from old run,\n            # but overwrite the rest!\n            # NOTE: This means any special configs affecting this script must be\n            # called again when calling this script!\n            config={**old_run.config, **args.__dict__},\n        )\n\n        def restore(filename):\n            old_run.file(filename).download(\n                replace=True, exist_ok=False, root=wandb.run.dir)\n\n            class Restored:\n                name = f'{wandb.run.dir}/{filename}'\n\n            return Restored\n    else:\n        logging.info('Reuse active wandb id.')\n\n        def restore(filename):\n            class Restored:\n                name = f'{wandb.run.dir}/{filename}'\n            return Restored\n\n    if args.train_wandb_runid != args.eval_wandb_runid:\n        logging.info(\n            \"Distribution shift for p_ik. Training on embeddings from run %s but evaluating on run %s\",\n            args.train_wandb_runid, args.eval_wandb_runid)\n\n        is_ood_eval = True  # pylint: disable=invalid-name\n        api = wandb.Api()\n        old_run_train = api.run(f'{args.restore_entity_train}/semantic_uncertainty/{args.train_wandb_runid}')\n        filename = 'train_generations.pkl'\n        old_run_train.file(filename).download(\n            replace=True, exist_ok=False, root=wandb.run.dir)\n        with open(f'{wandb.run.dir}/{filename}', \"rb\") as infile:\n            train_generations = pickle.load(infile)\n        wandb.config.update(\n            {\"ood_training_set\": old_run_train.config['dataset']}, allow_val_change=True)\n    else:\n        is_ood_eval = False  # pylint: disable=invalid-name\n        if args.compute_p_ik or args.compute_p_ik_answerable:\n            train_generations_pickle = restore('train_generations.pkl')\n            with open(train_generations_pickle.name, 'rb') as infile:\n                train_generations = pickle.load(infile)\n\n    wandb.config.update({\"is_ood_eval\": is_ood_eval}, allow_val_change=True)\n\n    # Load entailment model.\n    if args.compute_predictive_entropy:\n        logging.info('Beginning loading for entailment model.')\n        if args.entailment_model == 'deberta':\n            entailment_model = EntailmentDeberta()\n        elif args.entailment_model == 'gpt-4':\n            entailment_model = EntailmentGPT4(args.entailment_cache_id, args.entailment_cache_only)\n        elif args.entailment_model == 'gpt-3.5':\n            entailment_model = EntailmentGPT35(args.entailment_cache_id, args.entailment_cache_only)\n        elif args.entailment_model == 'gpt-4-turbo':\n            entailment_model = EntailmentGPT4Turbo(args.entailment_cache_id, args.entailment_cache_only)\n        elif 'llama' in args.entailment_model.lower():\n            entailment_model = EntailmentLlama(args.entailment_cache_id, args.entailment_cache_only, args.entailment_model)\n        else:\n            raise ValueError\n        logging.info('Entailment model loading complete.')\n\n    if args.compute_p_true_in_compute_stage:\n        # This is usually not called.\n        old_exp = restore(EXP_DETAILS)\n        with open(old_exp.name, \"rb\") as infile:\n            old_exp = pickle.load(infile)\n\n        if args.reuse_entailment_model:\n            pt_model = entailment_model.model\n        else:\n            pt_model = utils.init_model(old_exp['args'])\n\n        pt_train_dataset, pt_validation_dataset = load_ds(\n            old_exp['args'].dataset, add_options=old_exp['args'].use_mc_options,\n            seed=args.random_seed)\n        del pt_validation_dataset\n\n        # Reduce num generations used in p_true if needed!\n        if not args.use_all_generations:\n            if args.use_num_generations == -1:\n                raise ValueError\n            num_gen = args.use_num_generations\n        else:\n            num_gen = args.num_generations\n\n        p_true_few_shot_prompt, p_true_responses, len_p_true = p_true_utils.construct_few_shot_prompt(\n            model=pt_model,\n            dataset=pt_train_dataset,\n            indices=old_exp['p_true_indices'],\n            prompt=old_exp['prompt'],\n            brief=old_exp['BRIEF'],\n            brief_always=old_exp['args'].brief_always and old_exp['args'].enable_brief,\n            make_prompt=utils.get_make_prompt(old_exp['args']),\n            num_generations=num_gen,\n            metric=utils.get_metric(old_exp['args'].metric))\n        del p_true_responses\n        wandb.config.update(\n            {'p_true_num_fewshot': len_p_true}, allow_val_change=True)\n        wandb.log(dict(len_p_true=len_p_true))\n\n        logging.info('Generated few-shot prompt for p_true.')\n        logging.info(80*'#')\n        logging.info('p_true_few_shot_prompt: %s', p_true_few_shot_prompt)\n        logging.info(80*'#')\n\n    if args.recompute_accuracy:\n        # This is usually not enabled.\n        logging.warning('Recompute accuracy enabled. This does not apply to precomputed p_true!')\n        metric = utils.get_metric(args.metric)\n\n    # Restore outputs from `generate_answrs.py` run.\n    result_dict_pickle = restore('uncertainty_measures.pkl')\n    with open(result_dict_pickle.name, \"rb\") as infile:\n        result_dict = pickle.load(infile)\n    result_dict['semantic_ids'] = []\n\n    validation_generations_pickle = restore('validation_generations.pkl')\n    with open(validation_generations_pickle.name, 'rb') as infile:\n        validation_generations = pickle.load(infile)\n\n    entropies = defaultdict(list)\n    validation_embeddings, validation_is_true, validation_answerable = [], [], []\n    p_trues = []\n    count = 0  # pylint: disable=invalid-name\n\n    def is_answerable(generation):\n        return len(generation['reference']['answers']['text']) > 0\n\n    # Loop over datapoints and compute validation embeddings and entropies.\n    for idx, tid in enumerate(validation_generations):\n\n        example = validation_generations[tid]\n        question = example['question']\n        context = example['context']\n        full_responses = example[\"responses\"]\n        most_likely_answer = example['most_likely_answer']\n\n        if not args.use_all_generations:\n            if args.use_num_generations == -1:\n                raise ValueError\n            responses = [fr[0] for fr in full_responses[:args.use_num_generations]]\n        else:\n            responses = [fr[0] for fr in full_responses]\n\n        if args.recompute_accuracy:\n            logging.info('Recomputing accuracy!')\n            if is_answerable(example):\n                acc = metric(most_likely_answer['response'], example, None)\n            else:\n                acc = 0.0  # pylint: disable=invalid-name\n            validation_is_true.append(acc)\n            logging.info('Recomputed accuracy!')\n\n        else:\n            validation_is_true.append(most_likely_answer['accuracy'])\n\n        validation_answerable.append(is_answerable(example))\n        validation_embeddings.append(most_likely_answer['embedding'])\n        logging.info('validation_is_true: %f', validation_is_true[-1])\n\n        if args.compute_predictive_entropy:\n            # Token log likelihoods. Shape = (n_sample, n_tokens)\n            if not args.use_all_generations:\n                log_liks = [r[1] for r in full_responses[:args.use_num_generations]]\n            else:\n                log_liks = [r[1] for r in full_responses]\n\n            for i in log_liks:\n                assert i\n\n            if args.compute_context_entails_response:\n                # Compute context entails answer baseline.\n                entropies['context_entails_response'].append(context_entails_response(\n                    context, responses, entailment_model))\n\n            if args.condition_on_question and args.entailment_model == 'deberta':\n                responses = [f'{question} {r}' for r in responses]\n\n            # Compute semantic ids.\n            semantic_ids = get_semantic_ids(\n                responses, model=entailment_model,\n                strict_entailment=args.strict_entailment, example=example)\n\n            result_dict['semantic_ids'].append(semantic_ids)\n\n            # Compute entropy from frequencies of cluster assignments.\n            entropies['cluster_assignment_entropy'].append(cluster_assignment_entropy(semantic_ids))\n\n            # Length normalization of generation probabilities.\n            log_liks_agg = [np.mean(log_lik) for log_lik in log_liks]\n\n            # Compute naive entropy.\n            entropies['regular_entropy'].append(predictive_entropy(log_liks_agg))\n\n            # Compute semantic entropy.\n            log_likelihood_per_semantic_id = logsumexp_by_id(semantic_ids, log_liks_agg, agg='sum_normalized')\n            pe = predictive_entropy_rao(log_likelihood_per_semantic_id)\n            entropies['semantic_entropy'].append(pe)\n\n            # pylint: disable=invalid-name\n            log_str = 'semantic_ids: %s, avg_token_log_likelihoods: %s, entropies: %s'\n            entropies_fmt = ', '.join([f'{i}:{j[-1]:.2f}' for i, j in entropies.items()])\n            # pylint: enable=invalid-name\n            logging.info(80*'#')\n            logging.info('NEW ITEM %d at id=`%s`.', idx, tid)\n            logging.info('Context:')\n            logging.info(example['context'])\n            logging.info('Question:')\n            logging.info(question)\n            logging.info('True Answers:')\n            logging.info(example['reference'])\n            logging.info('Low Temperature Generation:')\n            logging.info(most_likely_answer['response'])\n            logging.info('Low Temperature Generation Accuracy:')\n            logging.info(most_likely_answer['accuracy'])\n            logging.info('High Temp Generation:')\n            logging.info([r[0] for r in full_responses])\n            logging.info('High Temp Generation:')\n            logging.info(log_str, semantic_ids, log_liks_agg, entropies_fmt)\n\n        if args.compute_p_true_in_compute_stage:\n            p_true = p_true_utils.calculate_p_true(\n                pt_model, question, most_likely_answer['response'],\n                responses, p_true_few_shot_prompt,\n                hint=old_exp['args'].p_true_hint)\n            p_trues.append(p_true)\n            logging.info('p_true: %s', np.exp(p_true))\n\n        count += 1\n        if count >= args.num_eval_samples:\n            logging.info('Breaking out of main loop.')\n            break\n\n    logging.info('Accuracy on original task: %f', np.mean(validation_is_true))\n    validation_is_false = [1.0 - is_t for is_t in validation_is_true]\n    result_dict['validation_is_false'] = validation_is_false\n\n    validation_unanswerable = [1.0 - is_a for is_a in validation_answerable]\n    result_dict['validation_unanswerable'] = validation_unanswerable\n    logging.info('Unanswerable prop on validation: %f', np.mean(validation_unanswerable))\n\n    if 'uncertainty_measures' not in result_dict:\n        result_dict['uncertainty_measures'] = dict()\n\n    if args.compute_predictive_entropy:\n        result_dict['uncertainty_measures'].update(entropies)\n\n    if args.compute_p_ik or args.compute_p_ik_answerable:\n        # Assemble training data for embedding classification.\n        train_is_true, train_embeddings, train_answerable = [], [], []\n        for tid in train_generations:\n            most_likely_answer = train_generations[tid]['most_likely_answer']\n            train_embeddings.append(most_likely_answer['embedding'])\n            train_is_true.append(most_likely_answer['accuracy'])\n            train_answerable.append(is_answerable(train_generations[tid]))\n        train_is_false = [0.0 if is_t else 1.0 for is_t in train_is_true]\n        train_unanswerable = [0.0 if is_t else 1.0 for is_t in train_answerable]\n        logging.info('Unanswerable prop on p_ik training: %f', np.mean(train_unanswerable))\n\n    if args.compute_p_ik:\n        logging.info('Starting training p_ik on train embeddings.')\n        # Train classifier of correct/incorrect from embeddings.\n        p_ik_predictions = get_p_ik(\n            train_embeddings=train_embeddings, is_false=train_is_false,\n            eval_embeddings=validation_embeddings, eval_is_false=validation_is_false)\n        result_dict['uncertainty_measures']['p_ik'] = p_ik_predictions\n        logging.info('Finished training p_ik on train embeddings.')\n\n    if args.compute_p_ik_answerable:\n        # Train classifier of answerable/unanswerable.\n        p_ik_predictions = get_p_ik(\n            train_embeddings=train_embeddings, is_false=train_unanswerable,\n            eval_embeddings=validation_embeddings, eval_is_false=validation_unanswerable)\n        result_dict['uncertainty_measures']['p_ik_unanswerable'] = p_ik_predictions\n\n    if args.compute_p_true_in_compute_stage:\n        result_dict['uncertainty_measures']['p_false'] = [1 - p for p in p_trues]\n        result_dict['uncertainty_measures']['p_false_fixed'] = [1 - np.exp(p) for p in p_trues]\n\n    utils.save(result_dict, 'uncertainty_measures.pkl')\n\n    if args.compute_predictive_entropy:\n        entailment_model.save_prediction_cache()\n\n    if args.analyze_run:\n        # Follow up with computation of aggregate performance metrics.\n        logging.info(50 * '#X')\n        logging.info('STARTING `analyze_run`!')\n        analyze_run(wandb.run.id)\n        logging.info(50 * '#X')\n        logging.info('FINISHED `analyze_run`!')\n\n\nif __name__ == '__main__':\n    parser = utils.get_parser(stages=['compute'])\n    args, unknown = parser.parse_known_args()  # pylint: disable=invalid-name\n    if unknown:\n        raise ValueError(f'Unkown args: {unknown}')\n\n    logging.info(\"Args: %s\", args)\n\n    main(args)\n"
  },
  {
    "path": "semantic_uncertainty/generate_answers.py",
    "content": "\"\"\"Sample answers from LLMs on QA task.\"\"\"\nimport gc\nimport os\nimport logging\nimport random\nfrom tqdm import tqdm\n\nimport numpy as np\nimport torch\nimport wandb\n\nfrom uncertainty.data.data_utils import load_ds\nfrom uncertainty.utils import utils\nfrom uncertainty.uncertainty_measures import p_true as p_true_utils\nfrom compute_uncertainty_measures import main as main_compute\n\n\nutils.setup_logger()\n\n\ndef main(args):\n\n    # Setup run.\n    if args.dataset == 'svamp':\n        if not args.use_context:\n            logging.info('Forcing `use_context=True` for svamp dataset.')\n            args.use_context = True\n    elif args.dataset == 'squad':\n        if not args.answerable_only:\n            logging.info('Forcing `answerable_only=True` for squad dataset.')\n            args.answerable_only = True\n\n    experiment_details = {'args': args}\n    random.seed(args.random_seed)\n    user = os.environ['USER']\n    slurm_jobid = os.getenv('SLURM_JOB_ID', None)\n    scratch_dir = os.getenv('SCRATCH_DIR', '.')\n    if not os.path.exists(f\"{scratch_dir}/{user}/uncertainty\"):\n        os.makedirs(f\"{scratch_dir}/{user}/uncertainty\")\n\n    wandb.init(\n        entity=args.entity,\n        project=\"semantic_uncertainty\" if not args.debug else \"semantic_uncertainty_debug\",\n        dir=f\"{scratch_dir}/{user}/uncertainty\",\n        config=args,\n        notes=f'slurm_id: {slurm_jobid}, experiment_lot: {args.experiment_lot}',\n    )\n    logging.info('Finished wandb init.')\n\n    # Get accuracy metric.\n    metric = utils.get_metric(args.metric)\n\n    # Load dataset.\n    train_dataset, validation_dataset = load_ds(\n        args.dataset, add_options=args.use_mc_options, seed=args.random_seed)\n    if args.ood_train_dataset is not None:\n        logging.warning(\n            'Using OOD dataset %s to construct few-shot prompts and train p_ik.',\n            args.ood_train_dataset)\n        # Get indices of answerable and unanswerable questions and construct prompt.\n        train_dataset, _ = load_ds(args.ood_train_dataset, add_options=args.use_mc_options)\n    if not isinstance(train_dataset, list):\n        logging.info('Train dataset: %s', train_dataset)\n\n    # Get indices of answerable and unanswerable questions and construct prompt.\n    answerable_indices, unanswerable_indices = utils.split_dataset(train_dataset)\n\n    if args.answerable_only:\n        unanswerable_indices = []\n        val_answerable, val_unanswerable = utils.split_dataset(validation_dataset)\n        del val_unanswerable\n        validation_dataset = [validation_dataset[i] for i in val_answerable]\n\n    prompt_indices = random.sample(answerable_indices, args.num_few_shot)\n    experiment_details['prompt_indices'] = prompt_indices\n    remaining_answerable = list(set(answerable_indices) - set(prompt_indices))\n\n    # Create Few-Shot prompt.\n    make_prompt = utils.get_make_prompt(args)\n    BRIEF = utils.BRIEF_PROMPTS[args.brief_prompt]\n    arg = args.brief_always if args.enable_brief else True\n    prompt = utils.construct_fewshot_prompt_from_indices(\n        train_dataset, prompt_indices, BRIEF, arg, make_prompt)\n    experiment_details['prompt'] = prompt\n    experiment_details['BRIEF'] = BRIEF\n    logging.info('Prompt is: %s', prompt)\n\n    # Initialize model.\n    model = utils.init_model(args)\n\n    # Initialize prompt for p_true baseline.\n    if args.compute_p_true:\n        logging.info(80*'#')\n        logging.info('Constructing few-shot prompt for p_true.')\n\n        p_true_indices = random.sample(answerable_indices, args.p_true_num_fewshot)\n        remaining_answerable = list(set(remaining_answerable) - set(p_true_indices))\n        p_true_few_shot_prompt, p_true_responses, len_p_true = p_true_utils.construct_few_shot_prompt(\n            model=model, dataset=train_dataset, indices=p_true_indices,\n            prompt=prompt, brief=BRIEF,\n            brief_always=args.brief_always and args.enable_brief,\n            make_prompt=make_prompt, num_generations=args.num_generations,\n            metric=metric)\n        wandb.config.update(\n            {'p_true_num_fewshot': len_p_true}, allow_val_change=True)\n        wandb.log(dict(len_p_true=len_p_true))\n        experiment_details['p_true_indices'] = p_true_indices\n        experiment_details['p_true_responses'] = p_true_responses\n        experiment_details['p_true_few_shot_prompt'] = p_true_few_shot_prompt\n        logging.info('Finished constructing few-shot prompt for p_true.')\n        logging.info(80*'#')\n        logging.info('p_true_few_shot_prompt: %s', p_true_few_shot_prompt)\n        logging.info(80*'#')\n\n    # Start answer generation.\n    logging.info(80 * '=')\n    logging.info('Generating answers: ')\n    logging.info(80 * '=')\n    for dataset_split in ['train', 'validation']:\n        logging.info(80 * 'x')\n        logging.info('Starting with dataset_split %s.', dataset_split)\n        logging.info(80 * 'x')\n\n        # This will store all input data and model predictions.\n        accuracies, generations, results_dict, p_trues = [], {}, {}, []\n\n        if dataset_split == 'train':\n            if not args.get_training_set_generations:\n                logging.info('Skip training data.')\n                continue\n            dataset = train_dataset\n            possible_indices = list(set(remaining_answerable) | set(unanswerable_indices))\n\n        else:\n            dataset = validation_dataset\n            possible_indices = range(0, len(dataset))\n\n        # Evaluate over random subset of the datasets.\n        indices = random.sample(possible_indices, min(args.num_samples, len(dataset)))\n        experiment_details[dataset_split] = {'indices': indices}\n\n        if args.num_samples > len(dataset):\n            logging.warning('Not enough samples in dataset. Using all %d samples.', len(dataset))\n\n        it = 0\n        for index in tqdm(indices):\n            if (it + 1 % 10) == 0:\n                gc.collect()\n                torch.cuda.empty_cache()\n            it += 1\n\n            # Grab example at index.\n            example = dataset[index]\n            question, context = example[\"question\"], example['context']\n            generations[example['id']] = {'question': question, 'context': context}\n            correct_answer = example['answers']['text']\n\n            current_input = make_prompt(\n                context, question, None, BRIEF, args.brief_always and args.enable_brief)\n            local_prompt = prompt + current_input\n\n            logging.info('Current input: '.ljust(15) + current_input)\n\n            full_responses = []\n\n            # We sample one low temperature answer on which we will compute the\n            # accuracy and args.num_generation high temperature answers which will\n            # be used to estimate the entropy variants.\n\n            if dataset_split == 'train' and args.get_training_set_generations_most_likely_only:\n                num_generations = 1\n            else:\n                num_generations = args.num_generations + 1\n\n            for i in range(num_generations):\n\n                # Temperature for first generation is always `0.1`.\n                temperature = 0.1 if i == 0 else args.temperature\n\n                predicted_answer, token_log_likelihoods, embedding = model.predict(\n                    local_prompt, temperature)\n                embedding = embedding.cpu() if embedding is not None else None\n\n                # Only compute accuracy if question is answerable.\n                compute_acc = args.compute_accuracy_at_all_temps or (i == 0)\n                if correct_answer and compute_acc:\n                    acc = metric(predicted_answer, example, model)\n                else:\n                    acc = 0.0  # pylint: disable=invalid-name\n\n                if i == 0:\n                    logging.info('Iteration ' + str(it) + ':  ' + 80*'#')\n                    if args.use_context:\n                        logging.info('context: '.ljust(15) + str(context))\n                    logging.info('question: '.ljust(15) + question)\n                    logging.info('low-t prediction: '.ljust(15) + predicted_answer)\n                    logging.info('correct answer: '.ljust(15) + str(correct_answer))\n                    logging.info('accuracy: '.ljust(15) + str(acc))\n\n                    accuracies.append(acc)\n                    most_likely_answer_dict = {\n                        'response': predicted_answer,\n                        'token_log_likelihoods': token_log_likelihoods,\n                        'embedding': embedding,\n                        'accuracy': acc}\n                    generations[example['id']].update({\n                        'most_likely_answer': most_likely_answer_dict,\n                        'reference': utils.get_reference(example)})\n\n                else:\n                    logging.info('high-t prediction '.ljust(15) + str(i) + ' : ' + predicted_answer)\n                    # Aggregate predictions over num_generations.\n                    full_responses.append(\n                        (predicted_answer, token_log_likelihoods, embedding, acc))\n\n            # Append all predictions for this example to `generations`.\n            generations[example['id']]['responses'] = full_responses\n\n            if args.compute_p_true and dataset_split == 'validation':\n                # Already compute p_true here. Avoid cost of generations in compute_uncertainty script.\n                p_true = p_true_utils.calculate_p_true(\n                    model, question, most_likely_answer_dict['response'],\n                    [r[0] for r in full_responses], p_true_few_shot_prompt,\n                    hint=args.p_true_hint)\n                p_trues.append(p_true)\n                logging.info('p_true: %s', p_true)\n\n        # Save generations for that split.\n        utils.save(generations, f'{dataset_split}_generations.pkl')\n\n        # Log overall accuracy.\n        accuracy = np.mean(accuracies)\n        print(f\"Overall {dataset_split} split accuracy: {accuracy}\")\n        wandb.log({f\"{dataset_split}_accuracy\": accuracy})\n\n        if dataset_split == 'validation':\n            if args.compute_p_true:\n                results_dict['uncertainty_measures'] = {\n                    'p_false':  [1 - p for p in p_trues],\n                    'p_false_fixed':  [1 - np.exp(p) for p in p_trues],\n                }\n            utils.save(results_dict, 'uncertainty_measures.pkl')\n\n    utils.save(experiment_details, 'experiment_details.pkl')\n    logging.info('Run complete.')\n    del model\n\n\nif __name__ == '__main__':\n\n    parser = utils.get_parser()\n    args, unknown = parser.parse_known_args()\n    logging.info('Starting new run with args: %s', args)\n\n    if unknown:\n        raise ValueError(f'Unkown args: {unknown}')\n\n    if args.compute_uncertainties:\n        args.assign_new_wandb_id = False\n\n    # First sample generations from LLM.\n    logging.info('STARTING `generate_answers`!')\n    main(args)\n    logging.info('FINISHED `generate_answers`!')\n\n    if args.compute_uncertainties:\n        # Follow with uncertainty calculation script by default.\n        args.assign_new_wandb_id = False\n        gc.collect()\n        torch.cuda.empty_cache()\n        logging.info(50 * '#X')\n        logging.info('STARTING `compute_uncertainty_measures`!')\n        main_compute(args)\n        logging.info('FINISHED `compute_uncertainty_measures`!')\n"
  },
  {
    "path": "semantic_uncertainty/uncertainty/__init__.py",
    "content": ""
  },
  {
    "path": "semantic_uncertainty/uncertainty/data/data_utils.py",
    "content": "\"\"\"Data Loading Utilities.\"\"\"\nimport os\nimport json\nimport hashlib\nimport datasets\n\n\ndef load_ds(dataset_name, seed, add_options=None):\n    \"\"\"Load dataset.\"\"\"\n    user = os.environ['USER']\n\n    train_dataset, validation_dataset = None, None\n    if dataset_name == \"squad\":\n        dataset = datasets.load_dataset(\"squad_v2\")\n        train_dataset = dataset[\"train\"]\n        validation_dataset = dataset[\"validation\"]\n\n    elif dataset_name == 'svamp':\n        dataset = datasets.load_dataset('ChilleD/SVAMP')\n        train_dataset = dataset[\"train\"]\n        validation_dataset = dataset[\"test\"]\n\n        reformat = lambda x: {\n            'question': x['Question'], 'context': x['Body'], 'type': x['Type'],\n            'equation': x['Equation'], 'id': x['ID'],\n            'answers': {'text': [str(x['Answer'])]}}\n\n        train_dataset = [reformat(d) for d in train_dataset]\n        validation_dataset = [reformat(d) for d in validation_dataset]\n\n    elif dataset_name == 'nq':\n        dataset = datasets.load_dataset(\"nq_open\")\n        train_dataset = dataset[\"train\"]\n        validation_dataset = dataset[\"validation\"]\n        md5hash = lambda s: str(int(hashlib.md5(s.encode('utf-8')).hexdigest(), 16))\n\n        reformat = lambda x: {\n            'question': x['question']+'?',\n            'answers': {'text': x['answer']},\n            'context': '',\n            'id': md5hash(str(x['question'])),\n        }\n\n        train_dataset = [reformat(d) for d in train_dataset]\n        validation_dataset = [reformat(d) for d in validation_dataset]\n\n    elif dataset_name == \"trivia_qa\":\n        dataset = datasets.load_dataset('TimoImhof/TriviaQA-in-SQuAD-format')['unmodified']\n        dataset = dataset.train_test_split(test_size=0.2, seed=seed)\n        train_dataset = dataset['train']\n        validation_dataset = dataset['test']\n\n    elif dataset_name == \"bioasq\":\n        # http://participants-area.bioasq.org/datasets/ we are using training 11b\n        # could also download from here https://zenodo.org/records/7655130\n        scratch_dir = os.getenv('SCRATCH_DIR', '.')\n        path = f\"{scratch_dir}/{user}/semantic_uncertainty/data/bioasq/training11b.json\"\n        with open(path, \"rb\") as file:\n            data = json.load(file)\n\n        questions = data[\"questions\"]\n        dataset_dict = {\n            \"question\": [],\n            \"answers\": [],\n            \"id\": []\n        }\n\n        for question in questions:\n            if \"exact_answer\" not in question:\n                continue\n            dataset_dict[\"question\"].append(question[\"body\"])\n            if \"exact_answer\" in question:\n\n                if isinstance(question['exact_answer'], list):\n                    exact_answers = [\n                        ans[0] if isinstance(ans, list) else ans\n                        for ans in question['exact_answer']\n                    ]\n                else:\n                    exact_answers = [question['exact_answer']]\n\n                dataset_dict[\"answers\"].append({\n                    \"text\": exact_answers,\n                    \"answer_start\": [0] * len(question[\"exact_answer\"])\n                })\n            else:\n                dataset_dict[\"answers\"].append({\n                    \"text\": question[\"ideal_answer\"],\n                    \"answer_start\": [0]\n                })\n            dataset_dict[\"id\"].append(question[\"id\"])\n\n            dataset_dict[\"context\"] = [None] * len(dataset_dict[\"id\"])\n\n        dataset = datasets.Dataset.from_dict(dataset_dict)\n\n        # Split into training and validation set.\n        dataset = dataset.train_test_split(test_size=0.8, seed=seed)\n        train_dataset = dataset['train']\n        validation_dataset = dataset['test']\n\n    else:\n        raise ValueError\n\n    return train_dataset, validation_dataset\n"
  },
  {
    "path": "semantic_uncertainty/uncertainty/models/__init__.py",
    "content": ""
  },
  {
    "path": "semantic_uncertainty/uncertainty/models/base_model.py",
    "content": "from abc import ABC, abstractmethod\nfrom typing import List, Text\n\n\nSTOP_SEQUENCES = ['\\n\\n\\n\\n', '\\n\\n\\n', '\\n\\n', '\\n', 'Question:', 'Context:']\n\n\nclass BaseModel(ABC):\n\n    stop_sequences: List[Text]\n\n    @abstractmethod\n    def predict(self, input_data, temperature):\n        pass\n\n    @abstractmethod\n    def get_p_true(self, input_data):\n        pass\n"
  },
  {
    "path": "semantic_uncertainty/uncertainty/models/huggingface_models.py",
    "content": "\"\"\"Implement HuggingfaceModel models.\"\"\"\nimport copy\nimport logging\nfrom collections import Counter\nimport torch\n\nimport accelerate\n\nfrom transformers import AutoTokenizer\nfrom transformers import AutoConfig\nfrom transformers import AutoModelForCausalLM\nfrom transformers import BitsAndBytesConfig\nfrom transformers import StoppingCriteria\nfrom transformers import StoppingCriteriaList\nfrom huggingface_hub import snapshot_download\n\n\nfrom uncertainty.models.base_model import BaseModel\nfrom uncertainty.models.base_model import STOP_SEQUENCES\n\n\nclass StoppingCriteriaSub(StoppingCriteria):\n    \"\"\"Stop generations when they match a particular text or token.\"\"\"\n    def __init__(self, stops, tokenizer, match_on='text', initial_length=None):\n        super().__init__()\n        self.stops = stops\n        self.initial_length = initial_length\n        self.tokenizer = tokenizer\n        self.match_on = match_on\n        if self.match_on == 'tokens':\n            self.stops = [torch.tensor(self.tokenizer.encode(i)).to('cuda') for i in self.stops]\n            print(self.stops)\n\n    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):\n        del scores  # `scores` arg is required by StoppingCriteria but unused by us.\n        for stop in self.stops:\n            if self.match_on == 'text':\n                generation = self.tokenizer.decode(input_ids[0][self.initial_length:], skip_special_tokens=False)\n                match = stop in generation\n            elif self.match_on == 'tokens':\n                # Can be dangerous due to tokenizer ambiguities.\n                match = stop in input_ids[0][-len(stop):]\n            else:\n                raise\n            if match:\n                return True\n        return False\n\n\ndef remove_split_layer(device_map_in):\n    \"\"\"Modify device maps s.t. individual layers are not spread across devices.\"\"\"\n\n    device_map = copy.deepcopy(device_map_in)\n    destinations = list(device_map.keys())\n\n    counts = Counter(['.'.join(i.split('.')[:2]) for i in destinations])\n\n    found_split = False\n    for layer, count in counts.items():\n        if count == 1:\n            continue\n\n        if found_split:\n            # Only triggers if we find more than one split layer.\n            raise ValueError(\n                'More than one split layer.\\n'\n                f'Currently at layer {layer}.\\n'\n                f'In map: {device_map_in}\\n'\n                f'Out map: {device_map}\\n')\n\n        logging.info(f'Split layer is {layer}.')\n\n        # Remove split for that layer.\n        for name in list(device_map.keys()):\n            if name.startswith(layer):\n                print(f'pop {name}')\n                device = device_map.pop(name)\n\n        device_map[layer] = device\n        found_split = True\n\n    return device_map\n\n\nclass HuggingfaceModel(BaseModel):\n    \"\"\"Hugging Face Model.\"\"\"\n\n    def __init__(self, model_name, stop_sequences=None, max_new_tokens=None):\n        if max_new_tokens is None:\n            raise\n        self.max_new_tokens = max_new_tokens\n\n        if stop_sequences == 'default':\n            stop_sequences = STOP_SEQUENCES\n\n        if 'llama' in model_name.lower():\n\n            if model_name.endswith('-8bit'):\n                kwargs = {'quantization_config': BitsAndBytesConfig(\n                    load_in_8bit=True,)}\n                model_name = model_name[:-len('-8bit')]\n                eightbit = True\n            else:\n                kwargs = {}\n                eightbit = False\n\n            if 'Llama-2' in model_name:\n                base = 'meta-llama'\n                model_name = model_name + '-hf'\n            else:\n                base = 'huggyllama'\n\n            self.tokenizer = AutoTokenizer.from_pretrained(\n                f\"{base}/{model_name}\", device_map=\"auto\",\n                token_type_ids=None)\n\n            llama65b = '65b' in model_name and base == 'huggyllama'\n            llama2_70b = '70b' in model_name and base == 'meta-llama'\n\n            if ('7b' in model_name or '13b' in model_name) or eightbit:\n                self.model = AutoModelForCausalLM.from_pretrained(\n                    f\"{base}/{model_name}\", device_map=\"auto\",\n                    max_memory={0: '80GIB'}, **kwargs,)\n\n            elif llama2_70b or llama65b:\n                path = snapshot_download(\n                    repo_id=f'{base}/{model_name}',\n                    allow_patterns=['*.json', '*.model', '*.safetensors'],\n                    ignore_patterns=['pytorch_model.bin.index.json']\n                )\n                config = AutoConfig.from_pretrained(f\"{base}/{model_name}\")\n                with accelerate.init_empty_weights():\n                    self.model = AutoModelForCausalLM.from_config(config)\n                self.model.tie_weights()\n                max_mem = 15 * 4686198491\n\n                device_map = accelerate.infer_auto_device_map(\n                    self.model.model,\n                    max_memory={0: max_mem, 1: max_mem},\n                    dtype='float16'\n                )\n                device_map = remove_split_layer(device_map)\n                full_model_device_map = {f\"model.{k}\": v for k, v in device_map.items()}\n                full_model_device_map[\"lm_head\"] = 0\n\n                self.model = accelerate.load_checkpoint_and_dispatch(\n                    self.model, path, device_map=full_model_device_map,\n                    dtype='float16', skip_keys='past_key_values')\n            else:\n                raise ValueError\n\n        elif 'mistral' in model_name.lower():\n\n            if model_name.endswith('-8bit'):\n                kwargs = {'quantization_config': BitsAndBytesConfig(\n                    load_in_8bit=True,)}\n                model_name = model_name[:-len('-8bit')]\n            if model_name.endswith('-4bit'):\n                kwargs = {'quantization_config': BitsAndBytesConfig(\n                    load_in_4bit=True,)}\n                model_name = model_name[:-len('-4bit')]\n            else:\n                kwargs = {}\n\n            model_id = f'mistralai/{model_name}'\n            self.tokenizer = AutoTokenizer.from_pretrained(\n                model_id, device_map='auto', token_type_ids=None,\n                clean_up_tokenization_spaces=False)\n\n            self.model = AutoModelForCausalLM.from_pretrained(\n                model_id,\n                device_map='auto',\n                max_memory={0: '80GIB'},\n                **kwargs,\n            )\n\n        elif 'falcon' in model_name:\n            model_id = f'tiiuae/{model_name}'\n            self.tokenizer = AutoTokenizer.from_pretrained(\n                model_id, device_map='auto', token_type_ids=None,\n                clean_up_tokenization_spaces=False)\n\n            kwargs = {'quantization_config': BitsAndBytesConfig(\n                load_in_8bit=True,)}\n\n            self.model = AutoModelForCausalLM.from_pretrained(\n                model_id,\n                trust_remote_code=True,\n                device_map='auto',\n                **kwargs,\n            )\n        else:\n            raise ValueError\n\n        self.model_name = model_name\n        self.stop_sequences = stop_sequences + [self.tokenizer.eos_token]\n        self.token_limit = 4096 if 'Llama-2' in model_name else 2048\n\n    def predict(self, input_data, temperature, return_full=False):\n\n        # Implement prediction.\n        inputs = self.tokenizer(input_data, return_tensors=\"pt\").to(\"cuda\")\n\n        if 'llama' in self.model_name.lower() or 'falcon' in self.model_name or 'mistral' in self.model_name.lower():\n            if 'token_type_ids' in inputs:  # Some HF models have changed.\n                del inputs['token_type_ids']\n            pad_token_id = self.tokenizer.eos_token_id\n        else:\n            pad_token_id = None\n\n        if self.stop_sequences is not None:\n            stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(\n                stops=self.stop_sequences,\n                initial_length=len(inputs['input_ids'][0]),\n                tokenizer=self.tokenizer)])\n        else:\n            stopping_criteria = None\n\n        logging.debug('temperature: %f', temperature)\n        with torch.no_grad():\n            outputs = self.model.generate(\n                **inputs,\n                max_new_tokens=self.max_new_tokens,\n                return_dict_in_generate=True,\n                output_scores=True,\n                output_hidden_states=True,\n                temperature=temperature,\n                do_sample=True,\n                stopping_criteria=stopping_criteria,\n                pad_token_id=pad_token_id,\n            )\n\n        if len(outputs.sequences[0]) > self.token_limit:\n            raise ValueError(\n                'Generation exceeding token limit %d > %d',\n                len(outputs.sequences[0]), self.token_limit)\n\n        full_answer = self.tokenizer.decode(\n            outputs.sequences[0], skip_special_tokens=True)\n\n        if return_full:\n            return full_answer\n\n        # For some models, we need to remove the input_data from the answer.\n        if full_answer.startswith(input_data):\n            input_data_offset = len(input_data)\n        else:\n            raise ValueError('Have not tested this in a while.')\n\n        # Remove input from answer.\n        answer = full_answer[input_data_offset:]\n\n        # Remove stop_words from answer.\n        stop_at = len(answer)\n        sliced_answer = answer\n        if self.stop_sequences is not None:\n            for stop in self.stop_sequences:\n                if answer.endswith(stop):\n                    stop_at = len(answer) - len(stop)\n                    sliced_answer = answer[:stop_at]\n                    break\n            if not all([stop not in sliced_answer for stop in self.stop_sequences]):\n                error_msg = 'Error: Stop words not removed successfully!'\n                error_msg += f'Answer: >{answer}< '\n                error_msg += f'Sliced Answer: >{sliced_answer}<'\n                if 'falcon' not in self.model_name.lower():\n                    raise ValueError(error_msg)\n                else:\n                    logging.error(error_msg)\n\n        # Remove whitespaces from answer (in particular from beginning.)\n        sliced_answer = sliced_answer.strip()\n\n        # Get the number of tokens until the stop word comes up.\n        # Note: Indexing with `stop_at` already excludes the stop_token.\n        # Note: It's important we do this with full answer, since there might be\n        # non-trivial interactions between the input_data and generated part\n        # in tokenization (particularly around whitespaces.)\n        token_stop_index = self.tokenizer(full_answer[:input_data_offset + stop_at], return_tensors=\"pt\")['input_ids'].shape[1]\n        n_input_token = len(inputs['input_ids'][0])\n        n_generated = token_stop_index - n_input_token\n\n        if n_generated == 0:\n            logging.warning('Only stop_words were generated. For likelihoods and embeddings, taking stop word instead.')\n            n_generated = 1\n\n        # Get the last hidden state (last layer) and the last token's embedding of the answer.\n        # Note: We do not want this to be the stop token.\n\n        # outputs.hidden_state is a tuple of len = n_generated_tokens.\n        # The first hidden state is for the input tokens and is of shape\n        #     (n_layers) x (batch_size, input_size, hidden_size).\n        # (Note this includes the first generated token!)\n        # The remaining hidden states are for the remaining generated tokens and is of shape\n        #    (n_layers) x (batch_size, 1, hidden_size).\n\n        # Note: The output embeddings have the shape (batch_size, generated_length, hidden_size).\n        # We do not get embeddings for input_data! We thus subtract the n_tokens_in_input from\n        # token_stop_index to arrive at the right output.\n\n        if 'decoder_hidden_states' in outputs.keys():\n            hidden = outputs.decoder_hidden_states\n        else:\n            hidden = outputs.hidden_states\n\n        if len(hidden) == 1:\n            logging.warning(\n                'Taking first and only generation for hidden! '\n                'n_generated: %d, n_input_token: %d, token_stop_index %d, '\n                'last_token: %s, generation was: %s',\n                n_generated, n_input_token, token_stop_index,\n                self.tokenizer.decode(outputs['sequences'][0][-1]),\n                full_answer,\n                )\n            last_input = hidden[0]\n        elif ((n_generated - 1) >= len(hidden)):\n            # If access idx is larger/equal.\n            logging.error(\n                'Taking last state because n_generated is too large'\n                'n_generated: %d, n_input_token: %d, token_stop_index %d, '\n                'last_token: %s, generation was: %s, slice_answer: %s',\n                n_generated, n_input_token, token_stop_index,\n                self.tokenizer.decode(outputs['sequences'][0][-1]),\n                full_answer, sliced_answer\n                )\n            last_input = hidden[-1]\n        else:\n            last_input = hidden[n_generated - 1]\n\n        # Then access last layer for input\n        last_layer = last_input[-1]\n        # Then access last token in input.\n        last_token_embedding = last_layer[:, -1, :].cpu()\n\n        # Get log_likelihoods.\n        # outputs.scores are the logits for the generated token.\n        # outputs.scores is a tuple of len = n_generated_tokens.\n        # Each entry is shape (bs, vocabulary size).\n        # outputs.sequences is the sequence of all tokens: input and generated.\n        transition_scores = self.model.compute_transition_scores(\n            outputs.sequences, outputs.scores, normalize_logits=True)\n        # Transition_scores[0] only contains the scores for the first generated tokens.\n\n        log_likelihoods = [score.item() for score in transition_scores[0]]\n        if len(log_likelihoods) == 1:\n            logging.warning('Taking first and only generation for log likelihood!')\n            log_likelihoods = log_likelihoods\n        else:\n            log_likelihoods = log_likelihoods[:n_generated]\n\n        if len(log_likelihoods) == self.max_new_tokens:\n            logging.warning('Generation interrupted by max_token limit.')\n\n        if len(log_likelihoods) == 0:\n            raise ValueError\n\n        return sliced_answer, log_likelihoods, last_token_embedding\n\n    def get_p_true(self, input_data):\n        \"\"\"Get the probability of the model anwering A (True) for the given input.\"\"\"\n\n        input_data += ' A'\n        tokenized_prompt_true = self.tokenizer(input_data, return_tensors='pt').to('cuda')['input_ids']\n        # The computation of the negative log likelihoods follows:\n        # https://huggingface.co/docs/transformers/perplexity.\n\n        target_ids_true = tokenized_prompt_true.clone()\n        # Set all target_ids except the last one to -100.\n        target_ids_true[0, :-1] = -100\n\n        with torch.no_grad():\n            model_output_true = self.model(tokenized_prompt_true, labels=target_ids_true)\n\n        loss_true = model_output_true.loss\n\n        return -loss_true.item()\n"
  },
  {
    "path": "semantic_uncertainty/uncertainty/uncertainty_measures/p_ik.py",
    "content": "\"\"\"Predict model correctness from linear classifier.\"\"\"\nimport logging\nimport torch\nimport wandb\n\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.metrics import roc_auc_score\nfrom sklearn.model_selection import train_test_split\n\n\ndef get_p_ik(train_embeddings, is_false, eval_embeddings=None, eval_is_false=None):\n    \"\"\"Fit linear classifier to embeddings to predict model correctness.\"\"\"\n\n    logging.info('Accuracy of model on Task: %f.', 1 - torch.tensor(is_false).mean())  # pylint: disable=no-member\n\n    # Convert the list of tensors to a 2D tensor.\n    train_embeddings_tensor = torch.cat(train_embeddings, dim=0)  # pylint: disable=no-member\n    # Convert the tensor to a numpy array.\n    embeddings_array = train_embeddings_tensor.cpu().numpy()\n\n    # Split the data into training and test sets.\n    X_train, X_test, y_train, y_test = train_test_split(  # pylint: disable=invalid-name\n        embeddings_array, is_false, test_size=0.2, random_state=42)  # pylint: disable=invalid-name\n\n    # Fit a logistic regression model.\n    model = LogisticRegression()\n    model.fit(X_train, y_train)\n\n    # Predict deterministically and probabilistically and compute accuracy and auroc for all splits.\n    X_eval = torch.cat(eval_embeddings, dim=0).cpu().numpy()  # pylint: disable=no-member,invalid-name\n    y_eval = eval_is_false\n\n    Xs = [X_train, X_test, X_eval]  # pylint: disable=invalid-name\n    ys = [y_train, y_test, y_eval]  # pylint: disable=invalid-name\n    suffixes = ['train_train', 'train_test', 'eval']\n\n    metrics, y_preds_proba = {}, {}\n\n    for suffix, X, y_true in zip(suffixes, Xs, ys):  # pylint: disable=invalid-name\n\n        # If suffix is eval, we fit a new model on the entire training data set\n        # rather than just a split of the training data set.\n        if suffix == 'eval':\n            model = LogisticRegression()\n            model.fit(embeddings_array, is_false)\n            convergence = {\n                'n_iter': model.n_iter_[0],\n                'converged': (model.n_iter_ < model.max_iter)[0]}\n\n        y_pred = model.predict(X)\n        y_pred_proba = model.predict_proba(X)\n        y_preds_proba[suffix] = y_pred_proba\n        acc_p_ik_train = accuracy_score(y_true, y_pred)\n        auroc_p_ik_train = roc_auc_score(y_true, y_pred_proba[:, 1])\n        split_metrics = {\n            f'acc_p_ik_{suffix}': acc_p_ik_train,\n            f'auroc_p_ik_{suffix}': auroc_p_ik_train}\n        metrics.update(split_metrics)\n\n    logging.info('Metrics for p_ik classifier: %s.', metrics)\n    wandb.log({**metrics, **convergence})\n\n    # Return model predictions on the eval set.\n    return y_preds_proba['eval'][:, 1]\n"
  },
  {
    "path": "semantic_uncertainty/uncertainty/uncertainty_measures/p_true.py",
    "content": "\"\"\"Compute p_true uncertainty metric.\"\"\"\nimport logging\n\n\ndef construct_few_shot_prompt(\n        *, model, dataset, indices, prompt, brief, brief_always, make_prompt,\n        num_generations, metric):\n    \"\"\"Construct few shot prompt for p_true uncertainty metric.\"\"\"\n\n    # Call model n_shots many times.\n    few_shot_prompt = []\n    all_responses = dict()\n    for it, i in enumerate(indices):\n        prompt_candidate = []\n        example = dataset[i]\n        question = example[\"question\"]\n        context = example[\"context\"]\n        if it != 0:\n            prompt_candidate += ['\\n']\n        prompt_candidate += ['Question: ' + question]\n        prompt_candidate += ['\\nBrainstormed Answers: ']\n        current_question = make_prompt(context, question, None, brief, brief_always)\n        local_prompt = prompt + current_question\n        logging.info('P_TRUE >> Current Question: '.ljust(25) + current_question)\n\n        responses = []\n        for j in range(num_generations + 1):\n\n            if j == 0:\n                temperature = 0.1\n            else:\n                temperature = 1.0\n\n            response, _, _ = model.predict(local_prompt, temperature)\n            logging.info('P_TRUE >> Current Response: '.ljust(25) + response)\n\n            responses.append(response)\n            prompt_candidate += [f'{response.strip()} \\n']\n            if j == 0:\n                # Save most likely response and compute correctness metric for it.\n                most_likely_response = response\n                is_correct = metric(response, example, model)\n                answers = [answer for answer in example['answers']['text']]\n                logging.info('P_TRUE >> LOW-T >> true answer: '.ljust(35) + str(answers))\n                logging.info('P_TRUE >> LOW-T >> acc: '.ljust(35) + str(is_correct))\n\n        all_responses[i] = dict(\n            responses=responses, most_likely_response=most_likely_response,\n            is_correct=is_correct)\n\n        prompt_candidate += ['Possible answer: ' + most_likely_response + '\\n']\n        prompt_candidate += ['Is the possible answer:\\n']\n        prompt_candidate += ['A) True\\n']\n        prompt_candidate += ['B) False\\n']\n        prompt_candidate += ['The possible answer is:']\n        prompt_candidate += [' A' if is_correct else ' B']\n\n        prompt_len = len(model.tokenizer.encode(''.join(few_shot_prompt + prompt_candidate)))\n        # At test time, get a maximum of `num_generations * model.token_limit` extra tokens\n        # 200 buffer for question and 'Possible Answer'.\n        max_input_len = prompt_len + num_generations * model.max_new_tokens + 200\n\n        if max_input_len < model.token_limit:\n            few_shot_prompt.extend(prompt_candidate)\n        else:\n            logging.warning('Cutting of p_true prompt at length %d.', it)\n            break\n\n    return ''.join(few_shot_prompt), all_responses, it\n\n\ndef calculate_p_true(\n        model, question, most_probable_answer, brainstormed_answers,\n        few_shot_prompt, hint=False):\n    \"\"\"Calculate p_true uncertainty metric.\"\"\"\n\n    if few_shot_prompt:\n        prompt = few_shot_prompt + '\\n'\n    else:\n        prompt = ''\n\n    prompt += 'Question: ' + question\n    prompt += '\\nBrainstormed Answers: '\n    for answer in brainstormed_answers + [most_probable_answer]:\n        prompt += answer.strip() + '\\n'\n    prompt += 'Possible answer: ' + most_probable_answer + '\\n'\n    if not hint:\n        prompt += 'Is the possible answer:\\n'\n        prompt += 'A) True\\n'\n        prompt += 'B) False\\n'\n        prompt += 'The possible answer is:'\n    else:\n        prompt += 'Do the brainstormed answers match the possible answer? Respond with A if they do, if they do not respond with B. Answer:'\n\n    log_prob = model.get_p_true(prompt)\n\n    return log_prob\n"
  },
  {
    "path": "semantic_uncertainty/uncertainty/uncertainty_measures/semantic_entropy.py",
    "content": "\"\"\"Implement semantic entropy.\"\"\"\nimport os\nimport pickle\nimport logging\n\nimport numpy as np\nimport wandb\nimport torch\nimport torch.nn.functional as F\n\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nfrom uncertainty.models.huggingface_models import HuggingfaceModel\nfrom uncertainty.utils import openai as oai\nfrom uncertainty.utils import utils\n\n\nDEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\n\nclass BaseEntailment:\n    def save_prediction_cache(self):\n        pass\n\n\nclass EntailmentDeberta(BaseEntailment):\n    def __init__(self):\n        self.tokenizer = AutoTokenizer.from_pretrained(\"microsoft/deberta-v2-xlarge-mnli\")\n        self.model = AutoModelForSequenceClassification.from_pretrained(\n            \"microsoft/deberta-v2-xlarge-mnli\").to(DEVICE)\n\n    def check_implication(self, text1, text2, *args, **kwargs):\n        inputs = self.tokenizer(text1, text2, return_tensors=\"pt\").to(DEVICE)\n        # The model checks if text1 -> text2, i.e. if text2 follows from text1.\n        # check_implication('The weather is good', 'The weather is good and I like you') --> 1\n        # check_implication('The weather is good and I like you', 'The weather is good') --> 2\n        outputs = self.model(**inputs)\n        logits = outputs.logits\n        # Deberta-mnli returns `neutral` and `entailment` classes at indices 1 and 2.\n        largest_index = torch.argmax(F.softmax(logits, dim=1))  # pylint: disable=no-member\n        prediction = largest_index.cpu().item()\n        if os.environ.get('DEBERTA_FULL_LOG', False):\n            logging.info('Deberta Input: %s -> %s', text1, text2)\n            logging.info('Deberta Prediction: %s', prediction)\n\n        return prediction\n\n\nclass EntailmentLLM(BaseEntailment):\n\n    entailment_file = 'entailment_cache.pkl'\n\n    def __init__(self, entailment_cache_id, entailment_cache_only):\n        self.prediction_cache = self.init_prediction_cache(entailment_cache_id)\n        self.entailment_cache_only = entailment_cache_only\n\n    def init_prediction_cache(self, entailment_cache_id):\n        if entailment_cache_id is None:\n            return dict()\n\n        logging.info('Restoring prediction cache from %s', entailment_cache_id)\n\n        api = wandb.Api()\n        run = api.run(entailment_cache_id)\n        run.file(self.entailment_file).download(\n            replace=True, exist_ok=False, root=wandb.run.dir)\n\n        with open(f'{wandb.run.dir}/{self.entailment_file}', \"rb\") as infile:\n            return pickle.load(infile)\n\n    def save_prediction_cache(self):\n        # Write the dictionary to a pickle file.\n        utils.save(self.prediction_cache, self.entailment_file)\n\n    def check_implication(self, text1, text2, example=None):\n        if example is None:\n            raise ValueError\n        prompt = self.equivalence_prompt(text1, text2, example['question'])\n\n        logging.info('%s input: %s', self.name, prompt)\n\n        hashed = oai.md5hash(prompt)\n        if hashed in self.prediction_cache:\n            logging.info('Restoring hashed instead of predicting with model.')\n            response = self.prediction_cache[hashed]\n        else:\n            if self.entailment_cache_only:\n                raise ValueError\n            response = self.predict(prompt, temperature=0.02)\n            self.prediction_cache[hashed] = response\n\n        logging.info('%s prediction: %s', self.name, response)\n\n        binary_response = response.lower()[:30]\n        if 'entailment' in binary_response:\n            return 2\n        elif 'neutral' in binary_response:\n            return 1\n        elif 'contradiction' in binary_response:\n            return 0\n        else:\n            logging.warning('MANUAL NEUTRAL!')\n            return 1\n\n\nclass EntailmentGPT4(EntailmentLLM):\n\n    def __init__(self, entailment_cache_id, entailment_cache_only):\n        super().__init__(entailment_cache_id, entailment_cache_only)\n        self.name = 'gpt-4'\n\n    def equivalence_prompt(self, text1, text2, question):\n\n        prompt = f\"\"\"We are evaluating answers to the question \\\"{question}\\\"\\n\"\"\"\n        prompt += \"Here are two possible answers:\\n\"\n        prompt += f\"Possible Answer 1: {text1}\\nPossible Answer 2: {text2}\\n\"\n        prompt += \"Does Possible Answer 1 semantically entail Possible Answer 2? Respond with entailment, contradiction, or neutral.\"\"\"\n\n        return prompt\n\n    def predict(self, prompt, temperature):\n        return oai.predict(prompt, temperature, model=self.name)\n\n\nclass EntailmentGPT35(EntailmentGPT4):\n\n    def __init__(self, entailment_cache_id, entailment_cache_only):\n        super().__init__(entailment_cache_id, entailment_cache_only)\n        self.name = 'gpt-3.5'\n\n\nclass EntailmentGPT4Turbo(EntailmentGPT4):\n\n    def __init__(self, entailment_cache_id, entailment_cache_only):\n        super().__init__(entailment_cache_id, entailment_cache_only)\n        self.name = 'gpt-4-turbo'\n\n\nclass EntailmentLlama(EntailmentLLM):\n\n    def __init__(self, entailment_cache_id, entailment_cache_only, name):\n        super().__init__(entailment_cache_id, entailment_cache_only)\n        self.name = name\n        self.model = HuggingfaceModel(\n            name, stop_sequences='default', max_new_tokens=30)\n\n    def equivalence_prompt(self, text1, text2, question):\n\n        prompt = f\"\"\"We are evaluating answers to the question \\\"{question}\\\"\\n\"\"\"\n        prompt += \"Here are two possible answers:\\n\"\n        prompt += f\"Possible Answer 1: {text1}\\nPossible Answer 2: {text2}\\n\"\n        prompt += \"Does Possible Answer 1 semantically entail Possible Answer 2? Respond only with entailment, contradiction, or neutral.\\n\"\"\"\n        prompt += \"Response:\"\"\"\n\n        return prompt\n\n    def predict(self, prompt, temperature):\n        predicted_answer, _, _ = self.model.predict(prompt, temperature)\n        return predicted_answer\n\n\ndef context_entails_response(context, responses, model):\n    votes = []\n    for response in responses:\n        votes.append(model.check_implication(context, response))\n    return 2 - np.mean(votes)\n\n\ndef get_semantic_ids(strings_list, model, strict_entailment=False, example=None):\n    \"\"\"Group list of predictions into semantic meaning.\"\"\"\n\n    def are_equivalent(text1, text2):\n\n        implication_1 = model.check_implication(text1, text2, example=example)\n        implication_2 = model.check_implication(text2, text1, example=example)  # pylint: disable=arguments-out-of-order\n        assert (implication_1 in [0, 1, 2]) and (implication_2 in [0, 1, 2])\n\n        if strict_entailment:\n            semantically_equivalent = (implication_1 == 2) and (implication_2 == 2)\n\n        else:\n            implications = [implication_1, implication_2]\n            # Check if none of the implications are 0 (contradiction) and not both of them are neutral.\n            semantically_equivalent = (0 not in implications) and ([1, 1] != implications)\n\n        return semantically_equivalent\n\n    # Initialise all ids with -1.\n    semantic_set_ids = [-1] * len(strings_list)\n    # Keep track of current id.\n    next_id = 0\n    for i, string1 in enumerate(strings_list):\n        # Check if string1 already has an id assigned.\n        if semantic_set_ids[i] == -1:\n            # If string1 has not been assigned an id, assign it next_id.\n            semantic_set_ids[i] = next_id\n            for j in range(i+1, len(strings_list)):\n                # Search through all remaining strings. If they are equivalent to string1, assign them the same id.\n                if are_equivalent(string1, strings_list[j]):\n                    semantic_set_ids[j] = next_id\n            next_id += 1\n\n    assert -1 not in semantic_set_ids\n\n    return semantic_set_ids\n\n\ndef logsumexp_by_id(semantic_ids, log_likelihoods, agg='sum_normalized'):\n    \"\"\"Sum probabilities with the same semantic id.\n\n    Log-Sum-Exp because input and output probabilities in log space.\n    \"\"\"\n    unique_ids = sorted(list(set(semantic_ids)))\n    assert unique_ids == list(range(len(unique_ids)))\n    log_likelihood_per_semantic_id = []\n\n    for uid in unique_ids:\n        # Find positions in `semantic_ids` which belong to the active `uid`.\n        id_indices = [pos for pos, x in enumerate(semantic_ids) if x == uid]\n        # Gather log likelihoods at these indices.\n        id_log_likelihoods = [log_likelihoods[i] for i in id_indices]\n        if agg == 'sum_normalized':\n            # log_lik_norm = id_log_likelihoods - np.prod(log_likelihoods)\n            log_lik_norm = id_log_likelihoods - np.log(np.sum(np.exp(log_likelihoods)))\n            logsumexp_value = np.log(np.sum(np.exp(log_lik_norm)))\n        else:\n            raise ValueError\n        log_likelihood_per_semantic_id.append(logsumexp_value)\n\n    return log_likelihood_per_semantic_id\n\n\ndef predictive_entropy(log_probs):\n    \"\"\"Compute MC estimate of entropy.\n\n    `E[-log p(x)] ~= -1/N sum_i log p(x_i)`, i.e. the average token likelihood.\n    \"\"\"\n\n    entropy = -np.sum(log_probs) / len(log_probs)\n\n    return entropy\n\n\ndef predictive_entropy_rao(log_probs):\n    entropy = -np.sum(np.exp(log_probs) * log_probs)\n    return entropy\n\n\ndef cluster_assignment_entropy(semantic_ids):\n    \"\"\"Estimate semantic uncertainty from how often different clusters get assigned.\n\n    We estimate the categorical distribution over cluster assignments from the\n    semantic ids. The uncertainty is then given by the entropy of that\n    distribution. This estimate does not use token likelihoods, it relies soley\n    on the cluster assignments. If probability mass is spread of between many\n    clusters, entropy is larger. If probability mass is concentrated on a few\n    clusters, entropy is small.\n\n    Input:\n        semantic_ids: List of semantic ids, e.g. [0, 1, 2, 1].\n    Output:\n        cluster_entropy: Entropy, e.g. (-p log p).sum() for p = [1/4, 2/4, 1/4].\n    \"\"\"\n\n    n_generations = len(semantic_ids)\n    counts = np.bincount(semantic_ids)\n    probabilities = counts/n_generations\n    assert np.isclose(probabilities.sum(), 1)\n    entropy = - (probabilities * np.log(probabilities)).sum()\n    return entropy\n"
  },
  {
    "path": "semantic_uncertainty/uncertainty/utils/eval_utils.py",
    "content": "\"\"\"Functions for performance evaluation, mainly used in analyze_results.py.\"\"\"\nimport numpy as np\nimport scipy\nfrom sklearn import metrics\n\n\n# pylint: disable=missing-function-docstring\n\n\ndef bootstrap(function, rng, n_resamples=1000):\n    def inner(data):\n        bs = scipy.stats.bootstrap(\n            (data, ), function, n_resamples=n_resamples, confidence_level=0.9,\n            random_state=rng)\n        return {\n            'std_err': bs.standard_error,\n            'low': bs.confidence_interval.low,\n            'high': bs.confidence_interval.high\n        }\n    return inner\n\n\ndef auroc(y_true, y_score):\n    fpr, tpr, thresholds = metrics.roc_curve(y_true, y_score)\n    del thresholds\n    return metrics.auc(fpr, tpr)\n\n\ndef accuracy_at_quantile(accuracies, uncertainties, quantile):\n    cutoff = np.quantile(uncertainties, quantile)\n    select = uncertainties <= cutoff\n    return np.mean(accuracies[select])\n\n\ndef area_under_thresholded_accuracy(accuracies, uncertainties):\n    quantiles = np.linspace(0.1, 1, 20)\n    select_accuracies = np.array([accuracy_at_quantile(accuracies, uncertainties, q) for q in quantiles])\n    dx = quantiles[1] - quantiles[0]\n    area = (select_accuracies * dx).sum()\n    return area\n\n\n# Need wrappers because scipy expects 1D data.\ndef compatible_bootstrap(func, rng):\n    def helper(y_true_y_score):\n        # this function is called in the bootstrap\n        y_true = np.array([i['y_true'] for i in y_true_y_score])\n        y_score = np.array([i['y_score'] for i in y_true_y_score])\n        out = func(y_true, y_score)\n        return out\n\n    def wrap_inputs(y_true, y_score):\n        return [{'y_true': i, 'y_score': j} for i, j in zip(y_true, y_score)]\n\n    def converted_func(y_true, y_score):\n        y_true_y_score = wrap_inputs(y_true, y_score)\n        return bootstrap(helper, rng=rng)(y_true_y_score)\n    return converted_func\n"
  },
  {
    "path": "semantic_uncertainty/uncertainty/utils/openai.py",
    "content": "import os\nimport hashlib\nfrom tenacity import retry, wait_random_exponential, retry_if_not_exception_type\n\nfrom openai import OpenAI\n\n\nCLIENT = OpenAI(api_key=os.environ.get('OPENAI_API_KEY', False))\n\n\nclass KeyError(Exception):\n    \"\"\"OpenAIKey not provided in environment variable.\"\"\"\n    pass\n\n\n@retry(retry=retry_if_not_exception_type(KeyError), wait=wait_random_exponential(min=1, max=10))\ndef predict(prompt, temperature=1.0, model='gpt-4'):\n    \"\"\"Predict with GPT models.\"\"\"\n\n    if not CLIENT.api_key:\n        raise KeyError('Need to provide OpenAI API key in environment variable `OPENAI_API_KEY`.')\n\n    if isinstance(prompt, str):\n        messages = [\n            {'role': 'user', 'content': prompt},\n        ]\n    else:\n        messages = prompt\n\n    if model == 'gpt-4':\n        model = 'gpt-4-0613'\n    elif model == 'gpt-4-turbo':\n        model = 'gpt-4-1106-preview'\n    elif model == 'gpt-3.5':\n        model = 'gpt-3.5-turbo-1106'\n\n    output = CLIENT.chat.completions.create(\n        model=model,\n        messages=messages,\n        max_tokens=200,\n        temperature=temperature,\n    )\n    response = output.choices[0].message.content\n    return response\n\n\ndef md5hash(string):\n    return int(hashlib.md5(string.encode('utf-8')).hexdigest(), 16)\n"
  },
  {
    "path": "semantic_uncertainty/uncertainty/utils/utils.py",
    "content": "\"\"\"Utility functions.\"\"\"\nimport os\nimport logging\nimport argparse\nimport pickle\n\nimport wandb\n\nfrom evaluate import load\n\nfrom uncertainty.models.huggingface_models import HuggingfaceModel\nfrom uncertainty.utils import openai as oai\n\nBRIEF_PROMPTS = {\n    'default': \"Answer the following question as briefly as possible.\\n\",\n    'chat': 'Answer the following question in a single brief but complete sentence.\\n'}\n\n\ndef get_parser(stages=['generate', 'compute']):\n    entity = os.getenv('WANDB_SEM_UNC_ENTITY', None)\n\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--debug\", action=argparse.BooleanOptionalAction, default=False,\n        help=\"Keep default wandb clean.\")\n    parser.add_argument('--entity', type=str, default=entity)\n    parser.add_argument('--random_seed', type=int, default=10)\n    parser.add_argument(\n        \"--metric\", type=str, default=\"squad\",\n        choices=['squad', 'llm', 'llm_gpt-3.5', 'llm_gpt-4'],\n        help=\"Metric to assign accuracy to generations.\")\n    parser.add_argument(\n        \"--compute_accuracy_at_all_temps\",\n        action=argparse.BooleanOptionalAction, default=True,\n        help=\"Compute accuracy at all temperatures or only t<<1.\")\n    parser.add_argument(\n        \"--experiment_lot\", type=str, default='Unnamed Experiment',\n        help=\"Keep default wandb clean.\")\n    if 'generate' in stages:\n        parser.add_argument(\n            \"--model_name\", type=str, default=\"Llama-2-7b-chat\", help=\"Model name\",\n        )\n        parser.add_argument(\n            \"--model_max_new_tokens\", type=int, default=50,\n            help=\"Max number of tokens generated.\",\n        )\n        parser.add_argument(\n            \"--dataset\", type=str, default=\"trivia_qa\",\n            choices=['trivia_qa', 'squad', 'bioasq', 'nq', 'svamp'],\n            help=\"Dataset to use\")\n        parser.add_argument(\n            \"--ood_train_dataset\", type=str, default=None,\n            choices=['trivia_qa', 'squad', 'bioasq', 'nq', 'svamp'],\n            help=\"Dataset to use to assemble few-shot prompt, p_true prompt, and train p_ik.\")\n        parser.add_argument(\n            \"--num_samples\", type=int, default=400,\n            help=\"Number of samples to use\")\n        parser.add_argument(\n            \"--num_few_shot\", type=int, default=5,\n            help=\"Number of few shot examples to use\")\n        parser.add_argument(\n            \"--p_true_num_fewshot\", type=int, default=20,\n            help=\"Number of few shot examples to use\")\n        parser.add_argument(\n            \"--p_true_hint\", default=False,\n            action=argparse.BooleanOptionalAction,\n            help=\"Get generations for training set?\")\n        parser.add_argument(\n            \"--num_generations\", type=int, default=10,\n            help=\"Number of generations to use\")\n        parser.add_argument(\n            \"--temperature\", type=float, default=1.0,\n            help=\"Temperature\")\n        parser.add_argument(\n            \"--use_mc_options\", type=bool, default=True,\n            help=\"Include MC options question?\")\n        parser.add_argument(\n            \"--get_training_set_generations\", default=True,\n            action=argparse.BooleanOptionalAction,\n            help=\"Get generations for training set?\")\n        parser.add_argument(\n            \"--use_context\", default=False,\n            action=argparse.BooleanOptionalAction,\n            help=\"Get generations for training set?\")\n        parser.add_argument(\n            \"--get_training_set_generations_most_likely_only\", default=True,\n            action=argparse.BooleanOptionalAction,\n            help=(\n                \"Only get embedding of most likely answer for training set. \"\n                \"This is all that's needed for p_true.\"))\n        parser.add_argument('--compute_p_true', default=True,\n                            action=argparse.BooleanOptionalAction)\n        parser.add_argument(\n            \"--brief_always\", default=False, action=argparse.BooleanOptionalAction)\n        parser.add_argument(\n            \"--enable_brief\", default=True, action=argparse.BooleanOptionalAction)\n        parser.add_argument(\n            \"--brief_prompt\", default='default', type=str)\n        parser.add_argument(\n            \"--prompt_type\", default='default', type=str)\n        parser.add_argument(\n            \"--compute_uncertainties\", default=True,\n            action=argparse.BooleanOptionalAction,\n            help='Trigger compute_uncertainty_measures.py')\n        parser.add_argument(\n            \"--answerable_only\", default=False,\n            action=argparse.BooleanOptionalAction,\n            help='Exclude unanswerable questions.')\n\n    if 'compute' in stages:\n        parser.add_argument('--recompute_accuracy',\n                            default=False, action=argparse.BooleanOptionalAction)\n        parser.add_argument('--eval_wandb_runid', type=str,\n                            help='wandb run id of the dataset to evaluate on')\n        parser.add_argument('--train_wandb_runid', type=str, default=None,\n                            help='wandb run id of the dataset from which training embeddings and p_true samples will be taken')\n        parser.add_argument('--num_eval_samples', type=int, default=int(1e19))\n        parser.add_argument('--compute_predictive_entropy',\n                            default=True, action=argparse.BooleanOptionalAction)\n        parser.add_argument('--compute_p_ik', default=True,\n                            action=argparse.BooleanOptionalAction)\n        parser.add_argument('--compute_p_ik_answerable', default=False,\n                            action=argparse.BooleanOptionalAction)\n        parser.add_argument('--compute_context_entails_response', default=False,\n                            action=argparse.BooleanOptionalAction)\n        parser.add_argument('--analyze_run', default=True,\n                            action=argparse.BooleanOptionalAction)\n        parser.add_argument('--assign_new_wandb_id', default=True,\n                            action=argparse.BooleanOptionalAction)\n        parser.add_argument('--restore_entity_eval', type=str, default=entity)\n        parser.add_argument('--restore_entity_train', type=str, default=entity)\n        parser.add_argument('--condition_on_question',\n                            default=True, action=argparse.BooleanOptionalAction)\n        parser.add_argument('--strict_entailment',\n                            default=True, action=argparse.BooleanOptionalAction)\n        parser.add_argument('--use_all_generations', default=True, action=argparse.BooleanOptionalAction)\n        parser.add_argument('--use_num_generations', type=int, default=-1)\n        parser.add_argument(\"--entailment_model\", default='deberta', type=str)\n        parser.add_argument(\n            \"--entailment_cache_id\", default=None, type=str,\n            help='Restore entailment predictions from previous run for GPT-4/LLaMa-Entailment.')\n        parser.add_argument('--entailment_cache_only', default=False, action=argparse.BooleanOptionalAction)\n        parser.add_argument('--compute_p_true_in_compute_stage',\n                            default=False, action=argparse.BooleanOptionalAction)\n        parser.add_argument('--reuse_entailment_model',\n                            default=False, action=argparse.BooleanOptionalAction,\n                            help='Use entailment model as p_true model.')\n    return parser\n\n\ndef setup_logger():\n    \"\"\"Setup logger to always print time and level.\"\"\"\n    logging.basicConfig(\n        format='%(asctime)s %(levelname)-8s %(message)s',\n        level=logging.INFO,\n        datefmt='%Y-%m-%d %H:%M:%S')\n    logging.getLogger().setLevel(logging.INFO)  # logging.DEBUG\n\n\ndef construct_fewshot_prompt_from_indices(dataset, example_indices, brief, brief_always, make_prompt):\n    \"\"\"Given a dataset and indices, construct a fewshot prompt.\"\"\"\n    if not brief_always:\n        prompt = brief\n    else:\n        prompt = ''\n\n    for example_index in example_indices:\n\n        example = dataset[example_index]\n        context = example[\"context\"]\n        question = example[\"question\"]\n        answer = example[\"answers\"][\"text\"][0]\n\n        prompt = prompt + make_prompt(context, question, answer, brief, brief_always)\n\n    return prompt\n\n\ndef split_dataset(dataset):\n    \"\"\"Get indices of answerable and unanswerable questions.\"\"\"\n\n    def clen(ex):\n        return len(ex[\"answers\"][\"text\"])\n\n    answerable_indices = [i for i, ex in enumerate(dataset) if clen(ex) > 0]\n    unanswerable_indices = [i for i, ex in enumerate(dataset) if clen(ex) == 0]\n\n    # union == full dataset\n    assert set(answerable_indices) | set(\n        unanswerable_indices) == set(range(len(dataset)))\n    # no overlap\n    assert set(answerable_indices) - \\\n        set(unanswerable_indices) == set(answerable_indices)\n\n    return answerable_indices, unanswerable_indices\n\n\ndef model_based_metric(predicted_answer, example, model):\n    if 'answers' in example:\n        correct_answers = example['answers']['text']\n    elif 'reference' in example:\n        correct_answers = example['reference']['answers']['text']\n    else:\n        raise ValueError\n\n    prompt = f'We are assessing the quality of answers to the following question: {example[\"question\"]}\\n'\n    if len(correct_answers) == 1:\n        prompt += f\"The expected answer is: {correct_answers[0]}.\\n\"\n    else:\n        prompt += f\"The following are expected answers to this question: {correct_answers}.\\n\"\n\n    prompt += f\"The proposed answer is: {predicted_answer}\\n\"\n\n    if len(correct_answers) == 1:\n        prompt += \"Within the context of the question, does the proposed answer mean the same as the expected answer?\"\n    else:\n        prompt += \"Within the context of the question, does the proposed answer mean the same as any of the expected answers?\"\n\n    prompt += \" Respond only with yes or no.\\nResponse:\"\n\n    if 'gpt' in model.model_name.lower():\n        predicted_answer = model.predict(prompt, 0.01)\n    else:\n        predicted_answer, _, _ = model.predict(prompt, 0.01)\n\n    if 'yes' in predicted_answer.lower():\n        return 1.0\n    elif 'no' in predicted_answer.lower():\n        return 0.0\n    else:\n        logging.warning('Redo llm check.')\n        predicted_answer, _, _ = model.predict(prompt, 1)\n        if 'yes' in predicted_answer.lower():\n            return 1.0\n        elif 'no' in predicted_answer.lower():\n            return 0.0\n\n        logging.warning('Answer neither no nor yes. Defaulting to no!')\n        return 0.0\n\n\ndef llm_metric(predicted_answer, example, model):\n    return model_based_metric(predicted_answer, example, model)\n\n\ndef get_gpt_metric(metric_name):\n\n    model_name = '_'.join(metric_name.split('_')[1:])\n\n    class EntailmentGPT():\n        def __init__(self, model_name):\n            self.model_name = model_name\n\n        def predict(self, prompt, temperature):\n            return oai.predict(prompt, temperature, model=self.model_name)\n\n    gpt_model = EntailmentGPT(model_name)\n\n    def gpt_metric(predicted_answer, example, model):\n        del model\n        return model_based_metric(predicted_answer, example, gpt_model)\n\n    return gpt_metric\n\n\ndef get_reference(example):\n    if 'answers' not in example:\n        example = example['reference']\n    answers = example['answers']\n    answer_starts = answers.get('answer_start', [])\n    reference = {'answers': {'answer_start': answer_starts, 'text': answers['text']}, 'id': example['id']}\n    return reference\n\n\ndef init_model(args):\n    mn = args.model_name\n    if 'llama' in mn.lower() or 'falcon' in mn or 'mistral' in mn.lower():\n        model = HuggingfaceModel(\n            mn, stop_sequences='default',\n            max_new_tokens=args.model_max_new_tokens)\n    else:\n        raise ValueError(f'Unknown model_name `{mn}`.')\n    return model\n\n\ndef get_make_prompt(args):\n    if args.prompt_type == 'default':\n        def make_prompt(context, question, answer, brief, brief_always):\n            prompt = ''\n            if brief_always:\n                prompt += brief\n            if args.use_context and (context is not None):\n                prompt += f\"Context: {context}\\n\"\n            prompt += f\"Question: {question}\\n\"\n            if answer:\n                prompt += f\"Answer: {answer}\\n\\n\"\n            else:\n                prompt += 'Answer:'\n            return prompt\n    else:\n        raise ValueError\n\n    return make_prompt\n\n\ndef get_metric(metric):\n    if metric == 'squad':\n\n        squad_metric = load(\"squad_v2\")\n\n        def metric(response, example, *args, **kwargs):\n            # Compatibility with recomputation.\n            if 'id' in example:\n                exid = example['id']\n            elif 'id' in example['reference']:\n                exid = example['reference']['id']\n            else:\n                raise ValueError\n\n            prediction = {'prediction_text': response, 'no_answer_probability': 0.0, 'id': exid}\n            results = squad_metric.compute(\n                predictions=[prediction],\n                references=[get_reference(example)])\n            return 1.0 if (results['f1'] >= 50.0) else 0.0\n\n    # Reuses the globally active model for these.\n    elif metric == 'llm':\n        metric = llm_metric\n    elif metric == 'llm_gpt-3.5':\n        metric = get_gpt_metric(metric)\n    elif metric == 'llm_gpt-4':\n        metric = get_gpt_metric(metric)\n    else:\n        raise ValueError\n\n    return metric\n\n\ndef save(object, file):\n    with open(f'{wandb.run.dir}/{file}', 'wb') as f:\n        pickle.dump(object, f)\n    wandb.save(f'{wandb.run.dir}/{file}')\n"
  }
]