[
  {
    "path": ".gitignore",
    "content": "*.pkl\n*.jpg\n*.mp4\n*.pth\n*.pyc\n__pycache__\n*.h5\n*.avi\n*.wav\nfilelists/*.txt\nevaluation/test_filelists/lr*.txt\n*.pyc\n*.mkv\n*.gif\n*.webm\n*.mp3\n"
  },
  {
    "path": "README.md",
    "content": "# **Wav2Lip**: *Accurately Lip-syncing Videos In The Wild* \n\n# Commercial Version\n\nCreate your first lipsync generation in minutes. Please note, the commercial version is of a much higher quality than the old open source model!\n\n## Create your API Key\n\nCreate your API key from the [Dashboard](https://sync.so/keys). You will use this key to securely access the Sync API.\n\n## Make your first generation\n\nThe following example shows how to make a lipsync generation using the Sync API.\n\n### Python\n\n#### Step 1: Install Sync SDK\n\n```bash\npip install syncsdk\n```\n\n#### Step 2: Make your first generation\n\nCopy the following code into a file `quickstart.py` and replace `YOUR_API_KEY_HERE` with your generated API key.\n\n```python\n# quickstart.py\nimport time\nfrom sync import Sync\nfrom sync.common import Audio, GenerationOptions, Video\nfrom sync.core.api_error import ApiError\n\n# ---------- UPDATE API KEY ----------\n# Replace with your Sync.so API key\napi_key = \"YOUR_API_KEY_HERE\" \n\n# ----------[OPTIONAL] UPDATE INPUT VIDEO AND AUDIO URL ----------\n# URL to your source video\nvideo_url = \"https://assets.sync.so/docs/example-video.mp4\"\n# URL to your audio file\naudio_url = \"https://assets.sync.so/docs/example-audio.wav\"\n# ----------------------------------------\n\nclient = Sync(\n    base_url=\"https://api.sync.so\", \n    api_key=api_key\n).generations\n\nprint(\"Starting lip sync generation job...\")\n\ntry:\n    response = client.create(\n        input=[Video(url=video_url),Audio(url=audio_url)],\n        model=\"lipsync-2\",\n        options=GenerationOptions(sync_mode=\"cut_off\"),\n        outputFileName=\"quickstart\"\n    )\nexcept ApiError as e:\n    print(f'create generation request failed with status code {e.status_code} and error {e.body}')\n    exit()\n\njob_id = response.id\nprint(f\"Generation submitted successfully, job id: {job_id}\")\n\ngeneration = client.get(job_id)\nstatus = generation.status\nwhile status not in ['COMPLETED', 'FAILED']:\n    print('polling status for generation', job_id)\n    time.sleep(10)\n    generation = client.get(job_id)\n    status = generation.status\n\nif status == 'COMPLETED':\n    print('generation', job_id, 'completed successfully, output url:', generation.output_url)\nelse:\n    print('generation', job_id, 'failed')\n```\n\nRun the script:\n\n```bash\npython quickstart.py\n```\n\n#### Step 3: Done!\n\nIt may take a few minutes for the generation to complete. You should see the generated video URL in the terminal post completion.\n\n---\n\n### TypeScript\n\n#### Step 1: Install dependencies\n\n```bash\nnpm i @sync.so/sdk\n```\n\n#### Step 2: Make your first generation\n\nCopy the following code into a file `quickstart.ts` and replace `YOUR_API_KEY_HERE` with your generated API key.\n\n```typescript\n// quickstart.ts\nimport { SyncClient, SyncError } from \"@sync.so/sdk\";\n\n// ---------- UPDATE API KEY ----------\n// Replace with your Sync.so API key\nconst apiKey = \"YOUR_API_KEY_HERE\";\n\n// ----------[OPTIONAL] UPDATE INPUT VIDEO AND AUDIO URL ----------\n// URL to your source video\nconst videoUrl = \"https://assets.sync.so/docs/example-video.mp4\";\n// URL to your audio file\nconst audioUrl = \"https://assets.sync.so/docs/example-audio.wav\";\n// ----------------------------------------\n\nconst client = new SyncClient({ apiKey });\n\nasync function main() {\n    console.log(\"Starting lip sync generation job...\");\n\n    let jobId: string;\n    try {\n        const response = await client.generations.create({\n            input: [\n                {\n                    type: \"video\",\n                    url: videoUrl,\n                },\n                {\n                    type: \"audio\",\n                    url: audioUrl,\n                },\n            ],\n            model: \"lipsync-2\",\n            options: {\n                sync_mode: \"cut_off\",\n            },\n            outputFileName: \"quickstart\"\n        });\n        jobId = response.id;\n        console.log(`Generation submitted successfully, job id: ${jobId}`);\n    } catch (err) {\n        if (err instanceof SyncError) {\n            console.error(`create generation request failed with status code ${err.statusCode} and error ${JSON.stringify(err.body)}`);\n        } else {\n            console.error('An unexpected error occurred:', err);\n        }\n        return;\n    }\n\n    let generation;\n    let status;\n    while (status !== 'COMPLETED' && status !== 'FAILED') {\n        console.log(`polling status for generation ${jobId}...`);\n        try {\n            await new Promise(resolve => setTimeout(resolve, 10000));\n            generation = await client.generations.get(jobId);\n            status = generation.status;\n        } catch (err) {\n            if (err instanceof SyncError) {\n                console.error(`polling failed with status code ${err.statusCode} and error ${JSON.stringify(err.body)}`);\n            } else {\n                console.error('An unexpected error occurred during polling:', err);\n            }\n            status = 'FAILED';\n        }\n    }\n\n    if (status === 'COMPLETED') {\n        console.log(`generation ${jobId} completed successfully, output url: ${generation?.outputUrl}`);\n    } else {\n        console.log(`generation ${jobId} failed`);\n    }\n}\n\nmain();\n```\n\nRun the script:\n\n```bash\nnpx tsx quickstart.ts -y\n```\n\n#### Step 3: Done!\n\nYou should see the generated video URL in the terminal.\n\n---\n\n## Next Steps\n\nWell done! You've just made your first lipsync generation with sync.so!\n\nReady to unlock the full potential of lipsync? Dive into our interactive [Studio](https://sync.so/login) to experiment with all available models, or explore our [API Documentation](/api-reference) to take your lip-sync generations to the next level!\n\n## Contact\n- prady@sync.so\n- pavan@sync.so\n- sanjit@sync.so\n\n\n\n# Non Commercial Open-source Version\n\nThis code is part of the paper: _A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild_ published at ACM Multimedia 2020. \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-lip-sync-expert-is-all-you-need-for-speech/lip-sync-on-lrs2)](https://paperswithcode.com/sota/lip-sync-on-lrs2?p=a-lip-sync-expert-is-all-you-need-for-speech)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-lip-sync-expert-is-all-you-need-for-speech/lip-sync-on-lrs3)](https://paperswithcode.com/sota/lip-sync-on-lrs3?p=a-lip-sync-expert-is-all-you-need-for-speech)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-lip-sync-expert-is-all-you-need-for-speech/lip-sync-on-lrw)](https://paperswithcode.com/sota/lip-sync-on-lrw?p=a-lip-sync-expert-is-all-you-need-for-speech)\n|📑 Original Paper|📰 Project Page|🌀 Demo|⚡ Live Testing|📔 Colab Notebook\n|:-:|:-:|:-:|:-:|:-:|\n[Paper](http://arxiv.org/abs/2008.10010) | [Project Page](http://cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild/) | [Demo Video](https://youtu.be/0fXaDCZNOJc) | [Interactive Demo](https://synclabs.so/) | [Colab Notebook](https://colab.research.google.com/drive/1tZpDWXz49W6wDcTprANRGLo2D_EbD5J8?usp=sharing) /[Updated Collab Notebook](https://colab.research.google.com/drive/1IjFW1cLevs6Ouyu4Yht4mnR4yeuMqO7Y#scrollTo=MH1m608OymLH)\n \n![Logo](https://drive.google.com/uc?export=view&id=1Wn0hPmpo4GRbCIJR8Tf20Akzdi1qjjG9)\n----------\n**Highlights**\n----------\n - Weights of the visual quality disc has been updated in readme!\n - Lip-sync videos to any target speech with high accuracy :100:. Try our [interactive demo](https://sync.so/).\n - :sparkles: Works for any identity, voice, and language. Also works for CGI faces and synthetic voices.\n - Complete training code, inference code, and pretrained models are available :boom:\n - Or, quick-start with the Google Colab Notebook: [Link](https://colab.research.google.com/drive/1tZpDWXz49W6wDcTprANRGLo2D_EbD5J8?usp=sharing). Checkpoints and samples are available in a Google Drive [folder](https://drive.google.com/drive/folders/1I-0dNLfFOSFwrfqjNa-SXuwaURHE5K4k?usp=sharing) as well. There is also a [tutorial video](https://www.youtube.com/watch?v=Ic0TBhfuOrA) on this, courtesy of [What Make Art](https://www.youtube.com/channel/UCmGXH-jy0o2CuhqtpxbaQgA). Also, thanks to [Eyal Gruss](https://eyalgruss.com), there is a more accessible [Google Colab notebook](https://j.mp/wav2lip) with more useful features. A tutorial collab notebook is present at this [link](https://colab.research.google.com/drive/1IjFW1cLevs6Ouyu4Yht4mnR4yeuMqO7Y#scrollTo=MH1m608OymLH).  \n - :fire: :fire: Several new, reliable evaluation benchmarks and metrics [[`evaluation/` folder of this repo]](https://github.com/Rudrabha/Wav2Lip/tree/master/evaluation) released. Instructions to calculate the metrics reported in the paper are also present.\n--------\n**Disclaimer**\n--------\nAll results from this open-source code or our [demo website](https://bhaasha.iiit.ac.in/lipsync) should only be used for research/academic/personal purposes only. As the models are trained on the <a href=\"http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html\">LRS2 dataset</a>, any form of commercial use is strictly prohibited. For commercial requests please contact us directly!\nPrerequisites\n-------------\n- `Python 3.6` \n- ffmpeg: `sudo apt-get install ffmpeg`\n- Install necessary packages using `pip install -r requirements.txt`. Alternatively, instructions for using a docker image is provided [here](https://gist.github.com/xenogenesi/e62d3d13dadbc164124c830e9c453668). Have a look at [this comment](https://github.com/Rudrabha/Wav2Lip/issues/131#issuecomment-725478562) and comment on [the gist](https://gist.github.com/xenogenesi/e62d3d13dadbc164124c830e9c453668) if you encounter any issues. \n- Face detection [pre-trained model](https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth) should be downloaded to `face_detection/detection/sfd/s3fd.pth`. Alternative [link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/prajwal_k_research_iiit_ac_in/EZsy6qWuivtDnANIG73iHjIBjMSoojcIV0NULXV-yiuiIg?e=qTasa8) if the above does not work.\nGetting the weights\n----------\n| Model  | Description |  Link to the model | \n| :-------------: | :---------------: | :---------------: |\n| Wav2Lip  | Highly accurate lip-sync | [Link](https://drive.google.com/drive/folders/153HLrqlBNxzZcHi17PEvP09kkAfzRshM?usp=share_link)  |\n| Wav2Lip + GAN  | Slightly inferior lip-sync, but better visual quality | [Link](https://drive.google.com/file/d/15G3U08c8xsCkOqQxE38Z2XXDnPcOptNk/view?usp=share_link) |\n\n\nLip-syncing videos using the pre-trained models (Inference)\n-------\nYou can lip-sync any video to any audio:\n```bash\npython inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source> \n```\nThe result is saved (by default) in `results/result_voice.mp4`. You can specify it as an argument,  similar to several other available options. The audio source can be any file supported by `FFMPEG` containing audio data: `*.wav`, `*.mp3` or even a video file, from which the code will automatically extract the audio.\n##### Tips for better results:\n- Experiment with the `--pads` argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g. `--pads 0 20 0 0`.\n- If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the `--nosmooth` argument and give it another try. \n- Experiment with the `--resize_factor` argument, to get a lower-resolution video. Why? The models are trained on faces that were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too). \n- The Wav2Lip model without GAN usually needs more experimenting with the above two to get the most ideal results, and sometimes, can give you a better result as well.\nPreparing LRS2 for training\n----------\nOur models are trained on LRS2. See [here](#training-on-datasets-other-than-lrs2) for a few suggestions regarding training on other datasets.\n##### LRS2 dataset folder structure\n```\ndata_root (mvlrs_v1)\n├── main, pretrain (we use only main folder in this work)\n|\t├── list of folders\n|\t│   ├── five-digit numbered video IDs ending with (.mp4)\n```\nPlace the LRS2 filelists (train, val, test) `.txt` files in the `filelists/` folder.\n##### Preprocess the dataset for fast training\n```bash\npython preprocess.py --data_root data_root/main --preprocessed_root lrs2_preprocessed/\n```\nAdditional options like `batch_size` and the number of GPUs to use in parallel to use can also be set.\n##### Preprocessed LRS2 folder structure\n```\npreprocessed_root (lrs2_preprocessed)\n├── list of folders\n|\t├── Folders with five-digit numbered video IDs\n|\t│   ├── *.jpg\n|\t│   ├── audio.wav\n```\nTrain!\n----------\nThere are two major steps: (i) Train the expert lip-sync discriminator, (ii) Train the Wav2Lip model(s).\n##### Training the expert discriminator\nYou can download [the pre-trained weights](#getting-the-weights) if you want to skip this step. To train it:\n```bash\npython color_syncnet_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints>\n```\n##### Training the Wav2Lip models\nYou can either train the model without the additional visual quality discriminator (< 1 day of training) or use the discriminator (~2 days). For the former, run: \n```bash\npython wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint>\n```\nTo train with the visual quality discriminator, you should run `hq_wav2lip_train.py` instead. The arguments for both files are similar. In both cases, you can resume training as well. Look at `python wav2lip_train.py --help` for more details. You can also set additional less commonly-used hyper-parameters at the bottom of the `hparams.py` file.\nTraining on datasets other than LRS2\n------------------------------------\nTraining on other datasets might require modifications to the code. Please read the following before you raise an issue:\n- You might not get good results by training/fine-tuning on a few minutes of a single speaker. This is a separate research problem, to which we do not have a solution yet. Thus, we would most likely not be able to resolve your issue. \n- You must train the expert discriminator for your own dataset before training Wav2Lip.\n- If it is your own dataset downloaded from the web, in most cases, needs to be sync-corrected.\n- Be mindful of the FPS of the videos of your dataset. Changes to FPS would need significant code changes. \n- The expert discriminator's eval loss should go down to ~0.25 and the Wav2Lip eval sync loss should go down to ~0.2 to get good results. \nWhen raising an issue on this topic, please let us know that you are aware of all these points.\nWe have an HD model trained on a dataset allowing commercial usage. The size of the generated face will be 192 x 288 in our new model.\nEvaluation\n----------\nPlease check the `evaluation/` folder for the instructions.\nLicense and Citation\n----------\nThis repository can only be used for personal/research/non-commercial purposes. However, for commercial requests, please contact us directly at rudrabha@synclabs.so or prajwal@synclabs.so. We have a turn-key hosted API with new and improved lip-syncing models here: https://synclabs.so/\nThe size of the generated face will be 192 x 288 in our new models. Please cite the following paper if you use this repository:\n```\n@inproceedings{10.1145/3394171.3413532,\nauthor = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},\ntitle = {A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild},\nyear = {2020},\nisbn = {9781450379885},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https://doi.org/10.1145/3394171.3413532},\ndoi = {10.1145/3394171.3413532},\nbooktitle = {Proceedings of the 28th ACM International Conference on Multimedia},\npages = {484–492},\nnumpages = {9},\nkeywords = {lip sync, talking face generation, video generation},\nlocation = {Seattle, WA, USA},\nseries = {MM '20}\n}\n```\nAcknowledgments\n----------\nParts of the code structure are inspired by this [TTS repository](https://github.com/r9y9/deepvoice3_pytorch). We thank the author for this wonderful code. The code for Face Detection has been taken from the [face_alignment](https://github.com/1adrianb/face-alignment) repository. We thank the authors for releasing their code and models. We thank [zabique](https://github.com/zabique) for the tutorial collab notebook.\n## Acknowledgements\n - [Awesome Readme Templates](https://awesomeopensource.com/project/elangosundar/awesome-README-templates)\n - [Awesome README](https://github.com/matiassingers/awesome-readme)\n - [How to write a Good readme](https://bulldogjob.com/news/449-how-to-write-a-good-readme-for-your-github-project)\n"
  },
  {
    "path": "audio.py",
    "content": "import librosa\nimport librosa.filters\nimport numpy as np\n# import tensorflow as tf\nfrom scipy import signal\nfrom scipy.io import wavfile\nfrom hparams import hparams as hp\n\ndef load_wav(path, sr):\n    return librosa.core.load(path, sr=sr)[0]\n\ndef save_wav(wav, path, sr):\n    wav *= 32767 / max(0.01, np.max(np.abs(wav)))\n    #proposed by @dsmiller\n    wavfile.write(path, sr, wav.astype(np.int16))\n\ndef save_wavenet_wav(wav, path, sr):\n    librosa.output.write_wav(path, wav, sr=sr)\n\ndef preemphasis(wav, k, preemphasize=True):\n    if preemphasize:\n        return signal.lfilter([1, -k], [1], wav)\n    return wav\n\ndef inv_preemphasis(wav, k, inv_preemphasize=True):\n    if inv_preemphasize:\n        return signal.lfilter([1], [1, -k], wav)\n    return wav\n\ndef get_hop_size():\n    hop_size = hp.hop_size\n    if hop_size is None:\n        assert hp.frame_shift_ms is not None\n        hop_size = int(hp.frame_shift_ms / 1000 * hp.sample_rate)\n    return hop_size\n\ndef linearspectrogram(wav):\n    D = _stft(preemphasis(wav, hp.preemphasis, hp.preemphasize))\n    S = _amp_to_db(np.abs(D)) - hp.ref_level_db\n    \n    if hp.signal_normalization:\n        return _normalize(S)\n    return S\n\ndef melspectrogram(wav):\n    D = _stft(preemphasis(wav, hp.preemphasis, hp.preemphasize))\n    S = _amp_to_db(_linear_to_mel(np.abs(D))) - hp.ref_level_db\n    \n    if hp.signal_normalization:\n        return _normalize(S)\n    return S\n\ndef _lws_processor():\n    import lws\n    return lws.lws(hp.n_fft, get_hop_size(), fftsize=hp.win_size, mode=\"speech\")\n\ndef _stft(y):\n    if hp.use_lws:\n        return _lws_processor(hp).stft(y).T\n    else:\n        return librosa.stft(y=y, n_fft=hp.n_fft, hop_length=get_hop_size(), win_length=hp.win_size)\n\n##########################################################\n#Those are only correct when using lws!!! (This was messing with Wavenet quality for a long time!)\ndef num_frames(length, fsize, fshift):\n    \"\"\"Compute number of time frames of spectrogram\n    \"\"\"\n    pad = (fsize - fshift)\n    if length % fshift == 0:\n        M = (length + pad * 2 - fsize) // fshift + 1\n    else:\n        M = (length + pad * 2 - fsize) // fshift + 2\n    return M\n\n\ndef pad_lr(x, fsize, fshift):\n    \"\"\"Compute left and right padding\n    \"\"\"\n    M = num_frames(len(x), fsize, fshift)\n    pad = (fsize - fshift)\n    T = len(x) + 2 * pad\n    r = (M - 1) * fshift + fsize - T\n    return pad, pad + r\n##########################################################\n#Librosa correct padding\ndef librosa_pad_lr(x, fsize, fshift):\n    return 0, (x.shape[0] // fshift + 1) * fshift - x.shape[0]\n\n# Conversions\n_mel_basis = None\n\ndef _linear_to_mel(spectogram):\n    global _mel_basis\n    if _mel_basis is None:\n        _mel_basis = _build_mel_basis()\n    return np.dot(_mel_basis, spectogram)\n\ndef _build_mel_basis():\n    assert hp.fmax <= hp.sample_rate // 2\n    return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels,\n                               fmin=hp.fmin, fmax=hp.fmax)\n\ndef _amp_to_db(x):\n    min_level = np.exp(hp.min_level_db / 20 * np.log(10))\n    return 20 * np.log10(np.maximum(min_level, x))\n\ndef _db_to_amp(x):\n    return np.power(10.0, (x) * 0.05)\n\ndef _normalize(S):\n    if hp.allow_clipping_in_normalization:\n        if hp.symmetric_mels:\n            return np.clip((2 * hp.max_abs_value) * ((S - hp.min_level_db) / (-hp.min_level_db)) - hp.max_abs_value,\n                           -hp.max_abs_value, hp.max_abs_value)\n        else:\n            return np.clip(hp.max_abs_value * ((S - hp.min_level_db) / (-hp.min_level_db)), 0, hp.max_abs_value)\n    \n    assert S.max() <= 0 and S.min() - hp.min_level_db >= 0\n    if hp.symmetric_mels:\n        return (2 * hp.max_abs_value) * ((S - hp.min_level_db) / (-hp.min_level_db)) - hp.max_abs_value\n    else:\n        return hp.max_abs_value * ((S - hp.min_level_db) / (-hp.min_level_db))\n\ndef _denormalize(D):\n    if hp.allow_clipping_in_normalization:\n        if hp.symmetric_mels:\n            return (((np.clip(D, -hp.max_abs_value,\n                              hp.max_abs_value) + hp.max_abs_value) * -hp.min_level_db / (2 * hp.max_abs_value))\n                    + hp.min_level_db)\n        else:\n            return ((np.clip(D, 0, hp.max_abs_value) * -hp.min_level_db / hp.max_abs_value) + hp.min_level_db)\n    \n    if hp.symmetric_mels:\n        return (((D + hp.max_abs_value) * -hp.min_level_db / (2 * hp.max_abs_value)) + hp.min_level_db)\n    else:\n        return ((D * -hp.min_level_db / hp.max_abs_value) + hp.min_level_db)\n"
  },
  {
    "path": "checkpoints/README.md",
    "content": "Place all your checkpoints (.pth files) here. "
  },
  {
    "path": "color_syncnet_train.py",
    "content": "from os.path import dirname, join, basename, isfile\nfrom tqdm import tqdm\n\nfrom models import SyncNet_color as SyncNet\nimport audio\n\nimport torch\nfrom torch import nn\nfrom torch import optim\nimport torch.backends.cudnn as cudnn\nfrom torch.utils import data as data_utils\nimport numpy as np\n\nfrom glob import glob\n\nimport os, random, cv2, argparse\nfrom hparams import hparams, get_image_list\n\nparser = argparse.ArgumentParser(description='Code to train the expert lip-sync discriminator')\n\nparser.add_argument(\"--data_root\", help=\"Root folder of the preprocessed LRS2 dataset\", required=True)\n\nparser.add_argument('--checkpoint_dir', help='Save checkpoints to this directory', required=True, type=str)\nparser.add_argument('--checkpoint_path', help='Resumed from this checkpoint', default=None, type=str)\n\nargs = parser.parse_args()\n\n\nglobal_step = 0\nglobal_epoch = 0\nuse_cuda = torch.cuda.is_available()\nprint('use_cuda: {}'.format(use_cuda))\n\nsyncnet_T = 5\nsyncnet_mel_step_size = 16\n\nclass Dataset(object):\n    def __init__(self, split):\n        self.all_videos = get_image_list(args.data_root, split)\n\n    def get_frame_id(self, frame):\n        return int(basename(frame).split('.')[0])\n\n    def get_window(self, start_frame):\n        start_id = self.get_frame_id(start_frame)\n        vidname = dirname(start_frame)\n\n        window_fnames = []\n        for frame_id in range(start_id, start_id + syncnet_T):\n            frame = join(vidname, '{}.jpg'.format(frame_id))\n            if not isfile(frame):\n                return None\n            window_fnames.append(frame)\n        return window_fnames\n\n    def crop_audio_window(self, spec, start_frame):\n        # num_frames = (T x hop_size * fps) / sample_rate\n        start_frame_num = self.get_frame_id(start_frame)\n        start_idx = int(80. * (start_frame_num / float(hparams.fps)))\n\n        end_idx = start_idx + syncnet_mel_step_size\n\n        return spec[start_idx : end_idx, :]\n\n\n    def __len__(self):\n        return len(self.all_videos)\n\n    def __getitem__(self, idx):\n        while 1:\n            idx = random.randint(0, len(self.all_videos) - 1)\n            vidname = self.all_videos[idx]\n\n            img_names = list(glob(join(vidname, '*.jpg')))\n            if len(img_names) <= 3 * syncnet_T:\n                continue\n            img_name = random.choice(img_names)\n            wrong_img_name = random.choice(img_names)\n            while wrong_img_name == img_name:\n                wrong_img_name = random.choice(img_names)\n\n            if random.choice([True, False]):\n                y = torch.ones(1).float()\n                chosen = img_name\n            else:\n                y = torch.zeros(1).float()\n                chosen = wrong_img_name\n\n            window_fnames = self.get_window(chosen)\n            if window_fnames is None:\n                continue\n\n            window = []\n            all_read = True\n            for fname in window_fnames:\n                img = cv2.imread(fname)\n                if img is None:\n                    all_read = False\n                    break\n                try:\n                    img = cv2.resize(img, (hparams.img_size, hparams.img_size))\n                except Exception as e:\n                    all_read = False\n                    break\n\n                window.append(img)\n\n            if not all_read: continue\n\n            try:\n                wavpath = join(vidname, \"audio.wav\")\n                wav = audio.load_wav(wavpath, hparams.sample_rate)\n\n                orig_mel = audio.melspectrogram(wav).T\n            except Exception as e:\n                continue\n\n            mel = self.crop_audio_window(orig_mel.copy(), img_name)\n\n            if (mel.shape[0] != syncnet_mel_step_size):\n                continue\n\n            # H x W x 3 * T\n            x = np.concatenate(window, axis=2) / 255.\n            x = x.transpose(2, 0, 1)\n            x = x[:, x.shape[1]//2:]\n\n            x = torch.FloatTensor(x)\n            mel = torch.FloatTensor(mel.T).unsqueeze(0)\n\n            return x, mel, y\n\nlogloss = nn.BCELoss()\ndef cosine_loss(a, v, y):\n    d = nn.functional.cosine_similarity(a, v)\n    loss = logloss(d.unsqueeze(1), y)\n\n    return loss\n\ndef train(device, model, train_data_loader, test_data_loader, optimizer,\n          checkpoint_dir=None, checkpoint_interval=None, nepochs=None):\n\n    global global_step, global_epoch\n    resumed_step = global_step\n    \n    while global_epoch < nepochs:\n        running_loss = 0.\n        prog_bar = tqdm(enumerate(train_data_loader))\n        for step, (x, mel, y) in prog_bar:\n            model.train()\n            optimizer.zero_grad()\n\n            # Transform data to CUDA device\n            x = x.to(device)\n\n            mel = mel.to(device)\n\n            a, v = model(mel, x)\n            y = y.to(device)\n\n            loss = cosine_loss(a, v, y)\n            loss.backward()\n            optimizer.step()\n\n            global_step += 1\n            cur_session_steps = global_step - resumed_step\n            running_loss += loss.item()\n\n            if global_step == 1 or global_step % checkpoint_interval == 0:\n                save_checkpoint(\n                    model, optimizer, global_step, checkpoint_dir, global_epoch)\n\n            if global_step % hparams.syncnet_eval_interval == 0:\n                with torch.no_grad():\n                    eval_model(test_data_loader, global_step, device, model, checkpoint_dir)\n\n            prog_bar.set_description('Loss: {}'.format(running_loss / (step + 1)))\n\n        global_epoch += 1\n\ndef eval_model(test_data_loader, global_step, device, model, checkpoint_dir):\n    eval_steps = 1400\n    print('Evaluating for {} steps'.format(eval_steps))\n    losses = []\n    while 1:\n        for step, (x, mel, y) in enumerate(test_data_loader):\n\n            model.eval()\n\n            # Transform data to CUDA device\n            x = x.to(device)\n\n            mel = mel.to(device)\n\n            a, v = model(mel, x)\n            y = y.to(device)\n\n            loss = cosine_loss(a, v, y)\n            losses.append(loss.item())\n\n            if step > eval_steps: break\n\n        averaged_loss = sum(losses) / len(losses)\n        print(averaged_loss)\n\n        return\n\ndef save_checkpoint(model, optimizer, step, checkpoint_dir, epoch):\n\n    checkpoint_path = join(\n        checkpoint_dir, \"checkpoint_step{:09d}.pth\".format(global_step))\n    optimizer_state = optimizer.state_dict() if hparams.save_optimizer_state else None\n    torch.save({\n        \"state_dict\": model.state_dict(),\n        \"optimizer\": optimizer_state,\n        \"global_step\": step,\n        \"global_epoch\": epoch,\n    }, checkpoint_path)\n    print(\"Saved checkpoint:\", checkpoint_path)\n\ndef _load(checkpoint_path):\n    if use_cuda:\n        checkpoint = torch.load(checkpoint_path)\n    else:\n        checkpoint = torch.load(checkpoint_path,\n                                map_location=lambda storage, loc: storage)\n    return checkpoint\n\ndef load_checkpoint(path, model, optimizer, reset_optimizer=False):\n    global global_step\n    global global_epoch\n\n    print(\"Load checkpoint from: {}\".format(path))\n    checkpoint = _load(path)\n    model.load_state_dict(checkpoint[\"state_dict\"])\n    if not reset_optimizer:\n        optimizer_state = checkpoint[\"optimizer\"]\n        if optimizer_state is not None:\n            print(\"Load optimizer state from {}\".format(path))\n            optimizer.load_state_dict(checkpoint[\"optimizer\"])\n    global_step = checkpoint[\"global_step\"]\n    global_epoch = checkpoint[\"global_epoch\"]\n\n    return model\n\nif __name__ == \"__main__\":\n    checkpoint_dir = args.checkpoint_dir\n    checkpoint_path = args.checkpoint_path\n\n    if not os.path.exists(checkpoint_dir): os.mkdir(checkpoint_dir)\n\n    # Dataset and Dataloader setup\n    train_dataset = Dataset('train')\n    test_dataset = Dataset('val')\n\n    train_data_loader = data_utils.DataLoader(\n        train_dataset, batch_size=hparams.syncnet_batch_size, shuffle=True,\n        num_workers=hparams.num_workers)\n\n    test_data_loader = data_utils.DataLoader(\n        test_dataset, batch_size=hparams.syncnet_batch_size,\n        num_workers=8)\n\n    device = torch.device(\"cuda\" if use_cuda else \"cpu\")\n\n    # Model\n    model = SyncNet().to(device)\n    print('total trainable params {}'.format(sum(p.numel() for p in model.parameters() if p.requires_grad)))\n\n    optimizer = optim.Adam([p for p in model.parameters() if p.requires_grad],\n                           lr=hparams.syncnet_lr)\n\n    if checkpoint_path is not None:\n        load_checkpoint(checkpoint_path, model, optimizer, reset_optimizer=False)\n\n    train(device, model, train_data_loader, test_data_loader, optimizer,\n          checkpoint_dir=checkpoint_dir,\n          checkpoint_interval=hparams.syncnet_checkpoint_interval,\n          nepochs=hparams.nepochs)\n"
  },
  {
    "path": "evaluation/README.md",
    "content": "# Novel Evaluation Framework, new filelists, and using the LSE-D and LSE-C metric.\n\nOur paper also proposes a novel evaluation framework (Section 4). To evaluate on LRS2, LRS3, and LRW, the filelists are present in the `test_filelists` folder. Please use `gen_videos_from_filelist.py` script to generate the videos. After that, you can calculate the LSE-D and LSE-C scores using the instructions below. Please see [this thread](https://github.com/Rudrabha/Wav2Lip/issues/22#issuecomment-712825380) on how to calculate the FID scores. \n\nThe videos of the ReSyncED benchmark for real-world evaluation will be released soon. \n\n### Steps to set-up the evaluation repository for LSE-D and LSE-C metric:\nWe use the pre-trained syncnet model available in this [repository](https://github.com/joonson/syncnet_python). \n\n* Clone the SyncNet repository.\n``` \ngit clone https://github.com/joonson/syncnet_python.git \n```\n* Follow the procedure given in the above linked [repository](https://github.com/joonson/syncnet_python) to download the pretrained models and set up the dependencies. \n    * **Note: Please install a separate virtual environment for the evaluation scripts. The versions used by Wav2Lip and the publicly released code of SyncNet is different and can cause version mis-match issues. To avoid this, we suggest the users to install a separate virtual environment for the evaluation scripts**\n```\ncd syncnet_python\npip install -r requirements.txt\nsh download_model.sh\n```\n* The above step should ensure that all the dependencies required by the repository is installed and the pre-trained models are downloaded.\n\n### Running the evaluation scripts:\n* Copy our evaluation scripts given in this folder to the cloned repository.\n```  \n    cd Wav2Lip/evaluation/scores_LSE/\n    cp *.py syncnet_python/\n    cp *.sh syncnet_python/ \n```\n**Note: We will release the test filelists for LRW, LRS2 and LRS3 shortly once we receive permission from the dataset creators. We will also release the Real World Dataset we have collected shortly.**\n\n* Our evaluation technique does not require ground-truth of any sorts. Given lip-synced videos we can directly calculate the scores from only the generated videos. Please store the generated videos (from our test sets or your own generated videos) in the following folder structure.\n```\nvideo data root (Folder containing all videos)\n├── All .mp4 files\n```\n* Change the folder back to the cloned repository. \n```\ncd syncnet_python\n```\n* To run evaluation on the LRW, LRS2 and LRS3 test files, please run the following command:\n```\npython calculate_scores_LRS.py --data_root /path/to/video/data/root --tmp_dir tmp_dir/\n```\n\n* To run evaluation on the ReSynced dataset or your own generated videos, please run the following command:\n```\nsh calculate_scores_real_videos.sh /path/to/video/data/root\n```\n* The generated scores will be present in the all_scores.txt generated in the ```syncnet_python/``` folder\n\n# Evaluation of image quality using FID metric.\nWe use the [pytorch-fid](https://github.com/mseitzer/pytorch-fid) repository for calculating the FID metrics. We dump all the frames in both ground-truth and generated videos and calculate the FID score. \n\n\n# Opening issues related to evaluation scripts\n* Please open the issues with the \"Evaluation\" label if you face any issues in the evaluation scripts. \n\n# Acknowledgements\nOur evaluation pipeline in based on two existing repositories. LSE metrics are based on the [syncnet_python](https://github.com/joonson/syncnet_python) repository and the FID score is based on [pytorch-fid](https://github.com/mseitzer/pytorch-fid) repository. We thank the authors of both the repositories for releasing their wonderful code.\n\n\n\n"
  },
  {
    "path": "evaluation/gen_videos_from_filelist.py",
    "content": "from os import listdir, path\nimport numpy as np\nimport scipy, cv2, os, sys, argparse\nimport dlib, json, subprocess\nfrom tqdm import tqdm\nfrom glob import glob\nimport torch\n\nsys.path.append('../')\nimport audio\nimport face_detection\nfrom models import Wav2Lip\n\nparser = argparse.ArgumentParser(description='Code to generate results for test filelists')\n\nparser.add_argument('--filelist', type=str, \n\t\t\t\t\thelp='Filepath of filelist file to read', required=True)\nparser.add_argument('--results_dir', type=str, help='Folder to save all results into', \n\t\t\t\t\t\t\t\t\trequired=True)\nparser.add_argument('--data_root', type=str, required=True)\nparser.add_argument('--checkpoint_path', type=str, \n\t\t\t\t\thelp='Name of saved checkpoint to load weights from', required=True)\n\nparser.add_argument('--pads', nargs='+', type=int, default=[0, 0, 0, 0], \n\t\t\t\t\thelp='Padding (top, bottom, left, right)')\nparser.add_argument('--face_det_batch_size', type=int, \n\t\t\t\t\thelp='Single GPU batch size for face detection', default=64)\nparser.add_argument('--wav2lip_batch_size', type=int, help='Batch size for Wav2Lip', default=128)\n\n# parser.add_argument('--resize_factor', default=1, type=int)\n\nargs = parser.parse_args()\nargs.img_size = 96\n\ndef get_smoothened_boxes(boxes, T):\n\tfor i in range(len(boxes)):\n\t\tif i + T > len(boxes):\n\t\t\twindow = boxes[len(boxes) - T:]\n\t\telse:\n\t\t\twindow = boxes[i : i + T]\n\t\tboxes[i] = np.mean(window, axis=0)\n\treturn boxes\n\ndef face_detect(images):\n\tbatch_size = args.face_det_batch_size\n\t\n\twhile 1:\n\t\tpredictions = []\n\t\ttry:\n\t\t\tfor i in range(0, len(images), batch_size):\n\t\t\t\tpredictions.extend(detector.get_detections_for_batch(np.array(images[i:i + batch_size])))\n\t\texcept RuntimeError:\n\t\t\tif batch_size == 1:\n\t\t\t\traise RuntimeError('Image too big to run face detection on GPU')\n\t\t\tbatch_size //= 2\n\t\t\targs.face_det_batch_size = batch_size\n\t\t\tprint('Recovering from OOM error; New batch size: {}'.format(batch_size))\n\t\t\tcontinue\n\t\tbreak\n\n\tresults = []\n\tpady1, pady2, padx1, padx2 = args.pads\n\tfor rect, image in zip(predictions, images):\n\t\tif rect is None:\n\t\t\traise ValueError('Face not detected!')\n\n\t\ty1 = max(0, rect[1] - pady1)\n\t\ty2 = min(image.shape[0], rect[3] + pady2)\n\t\tx1 = max(0, rect[0] - padx1)\n\t\tx2 = min(image.shape[1], rect[2] + padx2)\n\t\t\n\t\tresults.append([x1, y1, x2, y2])\n\n\tboxes = get_smoothened_boxes(np.array(results), T=5)\n\tresults = [[image[y1: y2, x1:x2], (y1, y2, x1, x2), True] for image, (x1, y1, x2, y2) in zip(images, boxes)]\n\n\treturn results \n\ndef datagen(frames, face_det_results, mels):\n\timg_batch, mel_batch, frame_batch, coords_batch = [], [], [], []\n\n\tfor i, m in enumerate(mels):\n\t\tif i >= len(frames): raise ValueError('Equal or less lengths only')\n\n\t\tframe_to_save = frames[i].copy()\n\t\tface, coords, valid_frame = face_det_results[i].copy()\n\t\tif not valid_frame:\n\t\t\tcontinue\n\n\t\tface = cv2.resize(face, (args.img_size, args.img_size))\n\t\t\t\n\t\timg_batch.append(face)\n\t\tmel_batch.append(m)\n\t\tframe_batch.append(frame_to_save)\n\t\tcoords_batch.append(coords)\n\n\t\tif len(img_batch) >= args.wav2lip_batch_size:\n\t\t\timg_batch, mel_batch = np.asarray(img_batch), np.asarray(mel_batch)\n\n\t\t\timg_masked = img_batch.copy()\n\t\t\timg_masked[:, args.img_size//2:] = 0\n\n\t\t\timg_batch = np.concatenate((img_masked, img_batch), axis=3) / 255.\n\t\t\tmel_batch = np.reshape(mel_batch, [len(mel_batch), mel_batch.shape[1], mel_batch.shape[2], 1])\n\n\t\t\tyield img_batch, mel_batch, frame_batch, coords_batch\n\t\t\timg_batch, mel_batch, frame_batch, coords_batch = [], [], [], []\n\n\tif len(img_batch) > 0:\n\t\timg_batch, mel_batch = np.asarray(img_batch), np.asarray(mel_batch)\n\n\t\timg_masked = img_batch.copy()\n\t\timg_masked[:, args.img_size//2:] = 0\n\n\t\timg_batch = np.concatenate((img_masked, img_batch), axis=3) / 255.\n\t\tmel_batch = np.reshape(mel_batch, [len(mel_batch), mel_batch.shape[1], mel_batch.shape[2], 1])\n\n\t\tyield img_batch, mel_batch, frame_batch, coords_batch\n\nfps = 25\nmel_step_size = 16\nmel_idx_multiplier = 80./fps\ndevice = 'cuda' if torch.cuda.is_available() else 'cpu'\nprint('Using {} for inference.'.format(device))\n\ndetector = face_detection.FaceAlignment(face_detection.LandmarksType._2D, \n\t\t\t\t\t\t\t\t\t\t\tflip_input=False, device=device)\n\ndef _load(checkpoint_path):\n\tif device == 'cuda':\n\t\tcheckpoint = torch.load(checkpoint_path)\n\telse:\n\t\tcheckpoint = torch.load(checkpoint_path,\n\t\t\t\t\t\t\t\tmap_location=lambda storage, loc: storage)\n\treturn checkpoint\n\ndef load_model(path):\n\tmodel = Wav2Lip()\n\tprint(\"Load checkpoint from: {}\".format(path))\n\tcheckpoint = _load(path)\n\ts = checkpoint[\"state_dict\"]\n\tnew_s = {}\n\tfor k, v in s.items():\n\t\tnew_s[k.replace('module.', '')] = v\n\tmodel.load_state_dict(new_s)\n\n\tmodel = model.to(device)\n\treturn model.eval()\n\nmodel = load_model(args.checkpoint_path)\n\ndef main():\n\tassert args.data_root is not None\n\tdata_root = args.data_root\n\n\tif not os.path.isdir(args.results_dir): os.makedirs(args.results_dir)\n\n\twith open(args.filelist, 'r') as filelist:\n\t\tlines = filelist.readlines()\n\n\tfor idx, line in enumerate(tqdm(lines)):\n\t\taudio_src, video = line.strip().split()\n\n\t\taudio_src = os.path.join(data_root, audio_src) + '.mp4'\n\t\tvideo = os.path.join(data_root, video) + '.mp4'\n\n\t\tcommand = 'ffmpeg -loglevel panic -y -i {} -strict -2 {}'.format(audio_src, '../temp/temp.wav')\n\t\tsubprocess.call(command, shell=True)\n\t\ttemp_audio = '../temp/temp.wav'\n\n\t\twav = audio.load_wav(temp_audio, 16000)\n\t\tmel = audio.melspectrogram(wav)\n\t\tif np.isnan(mel.reshape(-1)).sum() > 0:\n\t\t\tcontinue\n\n\t\tmel_chunks = []\n\t\ti = 0\n\t\twhile 1:\n\t\t\tstart_idx = int(i * mel_idx_multiplier)\n\t\t\tif start_idx + mel_step_size > len(mel[0]):\n\t\t\t\tbreak\n\t\t\tmel_chunks.append(mel[:, start_idx : start_idx + mel_step_size])\n\t\t\ti += 1\n\n\t\tvideo_stream = cv2.VideoCapture(video)\n\t\t\t\n\t\tfull_frames = []\n\t\twhile 1:\n\t\t\tstill_reading, frame = video_stream.read()\n\t\t\tif not still_reading or len(full_frames) > len(mel_chunks):\n\t\t\t\tvideo_stream.release()\n\t\t\t\tbreak\n\t\t\tfull_frames.append(frame)\n\n\t\tif len(full_frames) < len(mel_chunks):\n\t\t\tcontinue\n\n\t\tfull_frames = full_frames[:len(mel_chunks)]\n\n\t\ttry:\n\t\t\tface_det_results = face_detect(full_frames.copy())\n\t\texcept ValueError as e:\n\t\t\tcontinue\n\n\t\tbatch_size = args.wav2lip_batch_size\n\t\tgen = datagen(full_frames.copy(), face_det_results, mel_chunks)\n\n\t\tfor i, (img_batch, mel_batch, frames, coords) in enumerate(gen):\n\t\t\tif i == 0:\n\t\t\t\tframe_h, frame_w = full_frames[0].shape[:-1]\n\t\t\t\tout = cv2.VideoWriter('../temp/result.avi', \n\t\t\t\t\t\t\t\tcv2.VideoWriter_fourcc(*'DIVX'), fps, (frame_w, frame_h))\n\n\t\t\timg_batch = torch.FloatTensor(np.transpose(img_batch, (0, 3, 1, 2))).to(device)\n\t\t\tmel_batch = torch.FloatTensor(np.transpose(mel_batch, (0, 3, 1, 2))).to(device)\n\n\t\t\twith torch.no_grad():\n\t\t\t\tpred = model(mel_batch, img_batch)\n\t\t\t\t\t\n\n\t\t\tpred = pred.cpu().numpy().transpose(0, 2, 3, 1) * 255.\n\t\t\t\n\t\t\tfor pl, f, c in zip(pred, frames, coords):\n\t\t\t\ty1, y2, x1, x2 = c\n\t\t\t\tpl = cv2.resize(pl.astype(np.uint8), (x2 - x1, y2 - y1))\n\t\t\t\tf[y1:y2, x1:x2] = pl\n\t\t\t\tout.write(f)\n\n\t\tout.release()\n\n\t\tvid = os.path.join(args.results_dir, '{}.mp4'.format(idx))\n\n\t\tcommand = 'ffmpeg -loglevel panic -y -i {} -i {} -strict -2 -q:v 1 {}'.format(temp_audio, \n\t\t\t\t\t\t\t\t'../temp/result.avi', vid)\n\t\tsubprocess.call(command, shell=True)\n\nif __name__ == '__main__':\n\tmain()\n"
  },
  {
    "path": "evaluation/real_videos_inference.py",
    "content": "from os import listdir, path\nimport numpy as np\nimport scipy, cv2, os, sys, argparse\nimport dlib, json, subprocess\nfrom tqdm import tqdm\nfrom glob import glob\nimport torch\n\nsys.path.append('../')\nimport audio\nimport face_detection\nfrom models import Wav2Lip\n\nparser = argparse.ArgumentParser(description='Code to generate results on ReSyncED evaluation set')\n\nparser.add_argument('--mode', type=str, \n\t\t\t\t\thelp='random | dubbed | tts', required=True)\n\nparser.add_argument('--filelist', type=str, \n\t\t\t\t\thelp='Filepath of filelist file to read', default=None)\n\nparser.add_argument('--results_dir', type=str, help='Folder to save all results into', \n\t\t\t\t\t\t\t\t\trequired=True)\nparser.add_argument('--data_root', type=str, required=True)\nparser.add_argument('--checkpoint_path', type=str, \n\t\t\t\t\thelp='Name of saved checkpoint to load weights from', required=True)\nparser.add_argument('--pads', nargs='+', type=int, default=[0, 10, 0, 0], \n\t\t\t\t\thelp='Padding (top, bottom, left, right)')\n\nparser.add_argument('--face_det_batch_size', type=int, \n\t\t\t\t\thelp='Single GPU batch size for face detection', default=16)\n\nparser.add_argument('--wav2lip_batch_size', type=int, help='Batch size for Wav2Lip', default=128)\nparser.add_argument('--face_res', help='Approximate resolution of the face at which to test', default=180)\nparser.add_argument('--min_frame_res', help='Do not downsample further below this frame resolution', default=480)\nparser.add_argument('--max_frame_res', help='Downsample to at least this frame resolution', default=720)\n# parser.add_argument('--resize_factor', default=1, type=int)\n\nargs = parser.parse_args()\nargs.img_size = 96\n\ndef get_smoothened_boxes(boxes, T):\n\tfor i in range(len(boxes)):\n\t\tif i + T > len(boxes):\n\t\t\twindow = boxes[len(boxes) - T:]\n\t\telse:\n\t\t\twindow = boxes[i : i + T]\n\t\tboxes[i] = np.mean(window, axis=0)\n\treturn boxes\n\ndef rescale_frames(images):\n\trect = detector.get_detections_for_batch(np.array([images[0]]))[0]\n\tif rect is None:\n\t\traise ValueError('Face not detected!')\n\th, w = images[0].shape[:-1]\n\n\tx1, y1, x2, y2 = rect\n\n\tface_size = max(np.abs(y1 - y2), np.abs(x1 - x2))\n\n\tdiff = np.abs(face_size - args.face_res)\n\tfor factor in range(2, 16):\n\t\tdownsampled_res = face_size // factor\n\t\tif min(h//factor, w//factor) < args.min_frame_res: break \n\t\tif np.abs(downsampled_res - args.face_res) >= diff: break\n\n\tfactor -= 1\n\tif factor == 1: return images\n\n\treturn [cv2.resize(im, (im.shape[1]//(factor), im.shape[0]//(factor))) for im in images]\n\n\ndef face_detect(images):\n\tbatch_size = args.face_det_batch_size\n\timages = rescale_frames(images)\n\n\twhile 1:\n\t\tpredictions = []\n\t\ttry:\n\t\t\tfor i in range(0, len(images), batch_size):\n\t\t\t\tpredictions.extend(detector.get_detections_for_batch(np.array(images[i:i + batch_size])))\n\t\texcept RuntimeError:\n\t\t\tif batch_size == 1:\n\t\t\t\traise RuntimeError('Image too big to run face detection on GPU')\n\t\t\tbatch_size //= 2\n\t\t\tprint('Recovering from OOM error; New batch size: {}'.format(batch_size))\n\t\t\tcontinue\n\t\tbreak\n\n\tresults = []\n\tpady1, pady2, padx1, padx2 = args.pads\n\tfor rect, image in zip(predictions, images):\n\t\tif rect is None:\n\t\t\traise ValueError('Face not detected!')\n\n\t\ty1 = max(0, rect[1] - pady1)\n\t\ty2 = min(image.shape[0], rect[3] + pady2)\n\t\tx1 = max(0, rect[0] - padx1)\n\t\tx2 = min(image.shape[1], rect[2] + padx2)\n\t\t\n\t\tresults.append([x1, y1, x2, y2])\n\n\tboxes = get_smoothened_boxes(np.array(results), T=5)\n\tresults = [[image[y1: y2, x1:x2], (y1, y2, x1, x2), True] for image, (x1, y1, x2, y2) in zip(images, boxes)]\n\n\treturn results, images \n\ndef datagen(frames, face_det_results, mels):\n\timg_batch, mel_batch, frame_batch, coords_batch = [], [], [], []\n\n\tfor i, m in enumerate(mels):\n\t\tif i >= len(frames): raise ValueError('Equal or less lengths only')\n\n\t\tframe_to_save = frames[i].copy()\n\t\tface, coords, valid_frame = face_det_results[i].copy()\n\t\tif not valid_frame:\n\t\t\tcontinue\n\n\t\tface = cv2.resize(face, (args.img_size, args.img_size))\n\t\t\t\n\t\timg_batch.append(face)\n\t\tmel_batch.append(m)\n\t\tframe_batch.append(frame_to_save)\n\t\tcoords_batch.append(coords)\n\n\t\tif len(img_batch) >= args.wav2lip_batch_size:\n\t\t\timg_batch, mel_batch = np.asarray(img_batch), np.asarray(mel_batch)\n\n\t\t\timg_masked = img_batch.copy()\n\t\t\timg_masked[:, args.img_size//2:] = 0\n\n\t\t\timg_batch = np.concatenate((img_masked, img_batch), axis=3) / 255.\n\t\t\tmel_batch = np.reshape(mel_batch, [len(mel_batch), mel_batch.shape[1], mel_batch.shape[2], 1])\n\n\t\t\tyield img_batch, mel_batch, frame_batch, coords_batch\n\t\t\timg_batch, mel_batch, frame_batch, coords_batch = [], [], [], []\n\n\tif len(img_batch) > 0:\n\t\timg_batch, mel_batch = np.asarray(img_batch), np.asarray(mel_batch)\n\n\t\timg_masked = img_batch.copy()\n\t\timg_masked[:, args.img_size//2:] = 0\n\n\t\timg_batch = np.concatenate((img_masked, img_batch), axis=3) / 255.\n\t\tmel_batch = np.reshape(mel_batch, [len(mel_batch), mel_batch.shape[1], mel_batch.shape[2], 1])\n\n\t\tyield img_batch, mel_batch, frame_batch, coords_batch\n\ndef increase_frames(frames, l):\n\t## evenly duplicating frames to increase length of video\n\twhile len(frames) < l:\n\t\tdup_every = float(l) / len(frames)\n\n\t\tfinal_frames = []\n\t\tnext_duplicate = 0.\n\n\t\tfor i, f in enumerate(frames):\n\t\t\tfinal_frames.append(f)\n\n\t\t\tif int(np.ceil(next_duplicate)) == i:\n\t\t\t\tfinal_frames.append(f)\n\n\t\t\tnext_duplicate += dup_every\n\n\t\tframes = final_frames\n\n\treturn frames[:l]\n\nmel_step_size = 16\ndevice = 'cuda' if torch.cuda.is_available() else 'cpu'\nprint('Using {} for inference.'.format(device))\n\ndetector = face_detection.FaceAlignment(face_detection.LandmarksType._2D, \n\t\t\t\t\t\t\t\t\t\t\tflip_input=False, device=device)\n\ndef _load(checkpoint_path):\n\tif device == 'cuda':\n\t\tcheckpoint = torch.load(checkpoint_path)\n\telse:\n\t\tcheckpoint = torch.load(checkpoint_path,\n\t\t\t\t\t\t\t\tmap_location=lambda storage, loc: storage)\n\treturn checkpoint\n\ndef load_model(path):\n\tmodel = Wav2Lip()\n\tprint(\"Load checkpoint from: {}\".format(path))\n\tcheckpoint = _load(path)\n\ts = checkpoint[\"state_dict\"]\n\tnew_s = {}\n\tfor k, v in s.items():\n\t\tnew_s[k.replace('module.', '')] = v\n\tmodel.load_state_dict(new_s)\n\n\tmodel = model.to(device)\n\treturn model.eval()\n\nmodel = load_model(args.checkpoint_path)\n\ndef main():\n\tif not os.path.isdir(args.results_dir): os.makedirs(args.results_dir)\n\n\tif args.mode == 'dubbed':\n\t\tfiles = listdir(args.data_root)\n\t\tlines = ['{} {}'.format(f, f) for f in files]\n\n\telse:\n\t\tassert args.filelist is not None\n\t\twith open(args.filelist, 'r') as filelist:\n\t\t\tlines = filelist.readlines()\n\n\tfor idx, line in enumerate(tqdm(lines)):\n\t\tvideo, audio_src = line.strip().split()\n\n\t\taudio_src = os.path.join(args.data_root, audio_src)\n\t\tvideo = os.path.join(args.data_root, video)\n\n\t\tcommand = 'ffmpeg -loglevel panic -y -i {} -strict -2 {}'.format(audio_src, '../temp/temp.wav')\n\t\tsubprocess.call(command, shell=True)\n\t\ttemp_audio = '../temp/temp.wav'\n\n\t\twav = audio.load_wav(temp_audio, 16000)\n\t\tmel = audio.melspectrogram(wav)\n\n\t\tif np.isnan(mel.reshape(-1)).sum() > 0:\n\t\t\traise ValueError('Mel contains nan!')\n\n\t\tvideo_stream = cv2.VideoCapture(video)\n\n\t\tfps = video_stream.get(cv2.CAP_PROP_FPS)\n\t\tmel_idx_multiplier = 80./fps\n\n\t\tfull_frames = []\n\t\twhile 1:\n\t\t\tstill_reading, frame = video_stream.read()\n\t\t\tif not still_reading:\n\t\t\t\tvideo_stream.release()\n\t\t\t\tbreak\n\n\t\t\tif min(frame.shape[:-1]) > args.max_frame_res:\n\t\t\t\th, w = frame.shape[:-1]\n\t\t\t\tscale_factor = min(h, w) / float(args.max_frame_res)\n\t\t\t\th = int(h/scale_factor)\n\t\t\t\tw = int(w/scale_factor)\n\n\t\t\t\tframe = cv2.resize(frame, (w, h))\n\t\t\tfull_frames.append(frame)\n\n\t\tmel_chunks = []\n\t\ti = 0\n\t\twhile 1:\n\t\t\tstart_idx = int(i * mel_idx_multiplier)\n\t\t\tif start_idx + mel_step_size > len(mel[0]):\n\t\t\t\tbreak\n\t\t\tmel_chunks.append(mel[:, start_idx : start_idx + mel_step_size])\n\t\t\ti += 1\n\n\t\tif len(full_frames) < len(mel_chunks):\n\t\t\tif args.mode == 'tts':\n\t\t\t\tfull_frames = increase_frames(full_frames, len(mel_chunks))\n\t\t\telse:\n\t\t\t\traise ValueError('#Frames, audio length mismatch')\n\n\t\telse:\n\t\t\tfull_frames = full_frames[:len(mel_chunks)]\n\n\t\ttry:\n\t\t\tface_det_results, full_frames = face_detect(full_frames.copy())\n\t\texcept ValueError as e:\n\t\t\tcontinue\n\n\t\tbatch_size = args.wav2lip_batch_size\n\t\tgen = datagen(full_frames.copy(), face_det_results, mel_chunks)\n\n\t\tfor i, (img_batch, mel_batch, frames, coords) in enumerate(gen):\n\t\t\tif i == 0:\n\t\t\t\tframe_h, frame_w = full_frames[0].shape[:-1]\n\n\t\t\t\tout = cv2.VideoWriter('../temp/result.avi', \n\t\t\t\t\t\t\t\tcv2.VideoWriter_fourcc(*'DIVX'), fps, (frame_w, frame_h))\n\n\t\t\timg_batch = torch.FloatTensor(np.transpose(img_batch, (0, 3, 1, 2))).to(device)\n\t\t\tmel_batch = torch.FloatTensor(np.transpose(mel_batch, (0, 3, 1, 2))).to(device)\n\n\t\t\twith torch.no_grad():\n\t\t\t\tpred = model(mel_batch, img_batch)\n\t\t\t\t\t\n\n\t\t\tpred = pred.cpu().numpy().transpose(0, 2, 3, 1) * 255.\n\t\t\t\n\t\t\tfor pl, f, c in zip(pred, frames, coords):\n\t\t\t\ty1, y2, x1, x2 = c\n\t\t\t\tpl = cv2.resize(pl.astype(np.uint8), (x2 - x1, y2 - y1))\n\t\t\t\tf[y1:y2, x1:x2] = pl\n\t\t\t\tout.write(f)\n\n\t\tout.release()\n\n\t\tvid = os.path.join(args.results_dir, '{}.mp4'.format(idx))\n\t\tcommand = 'ffmpeg -loglevel panic -y -i {} -i {} -strict -2 -q:v 1 {}'.format('../temp/temp.wav', \n\t\t\t\t\t\t\t\t'../temp/result.avi', vid)\n\t\tsubprocess.call(command, shell=True)\n\n\nif __name__ == '__main__':\n\tmain()\n"
  },
  {
    "path": "evaluation/scores_LSE/SyncNetInstance_calc_scores.py",
    "content": "#!/usr/bin/python\n#-*- coding: utf-8 -*-\n# Video 25 FPS, Audio 16000HZ\n\nimport torch\nimport numpy\nimport time, pdb, argparse, subprocess, os, math, glob\nimport cv2\nimport python_speech_features\n\nfrom scipy import signal\nfrom scipy.io import wavfile\nfrom SyncNetModel import *\nfrom shutil import rmtree\n\n\n# ==================== Get OFFSET ====================\n\ndef calc_pdist(feat1, feat2, vshift=10):\n    \n    win_size = vshift*2+1\n\n    feat2p = torch.nn.functional.pad(feat2,(0,0,vshift,vshift))\n\n    dists = []\n\n    for i in range(0,len(feat1)):\n\n        dists.append(torch.nn.functional.pairwise_distance(feat1[[i],:].repeat(win_size, 1), feat2p[i:i+win_size,:]))\n\n    return dists\n\n# ==================== MAIN DEF ====================\n\nclass SyncNetInstance(torch.nn.Module):\n\n    def __init__(self, dropout = 0, num_layers_in_fc_layers = 1024):\n        super(SyncNetInstance, self).__init__();\n\n        self.__S__ = S(num_layers_in_fc_layers = num_layers_in_fc_layers).cuda();\n\n    def evaluate(self, opt, videofile):\n\n        self.__S__.eval();\n\n        # ========== ==========\n        # Convert files\n        # ========== ==========\n\n        if os.path.exists(os.path.join(opt.tmp_dir,opt.reference)):\n          rmtree(os.path.join(opt.tmp_dir,opt.reference))\n\n        os.makedirs(os.path.join(opt.tmp_dir,opt.reference))\n\n        command = (\"ffmpeg -loglevel error -y -i %s -threads 1 -f image2 %s\" % (videofile,os.path.join(opt.tmp_dir,opt.reference,'%06d.jpg'))) \n        output = subprocess.call(command, shell=True, stdout=None)\n\n        command = (\"ffmpeg -loglevel error -y -i %s -async 1 -ac 1 -vn -acodec pcm_s16le -ar 16000 %s\" % (videofile,os.path.join(opt.tmp_dir,opt.reference,'audio.wav'))) \n        output = subprocess.call(command, shell=True, stdout=None)\n        \n        # ========== ==========\n        # Load video \n        # ========== ==========\n\n        images = []\n        \n        flist = glob.glob(os.path.join(opt.tmp_dir,opt.reference,'*.jpg'))\n        flist.sort()\n\n        for fname in flist:\n            img_input = cv2.imread(fname)\n            img_input = cv2.resize(img_input, (224,224)) #HARD CODED, CHANGE BEFORE RELEASE\n            images.append(img_input)\n\n        im = numpy.stack(images,axis=3)\n        im = numpy.expand_dims(im,axis=0)\n        im = numpy.transpose(im,(0,3,4,1,2))\n\n        imtv = torch.autograd.Variable(torch.from_numpy(im.astype(float)).float())\n\n        # ========== ==========\n        # Load audio\n        # ========== ==========\n\n        sample_rate, audio = wavfile.read(os.path.join(opt.tmp_dir,opt.reference,'audio.wav'))\n        mfcc = zip(*python_speech_features.mfcc(audio,sample_rate))\n        mfcc = numpy.stack([numpy.array(i) for i in mfcc])\n\n        cc = numpy.expand_dims(numpy.expand_dims(mfcc,axis=0),axis=0)\n        cct = torch.autograd.Variable(torch.from_numpy(cc.astype(float)).float())\n\n        # ========== ==========\n        # Check audio and video input length\n        # ========== ==========\n\n        #if (float(len(audio))/16000) != (float(len(images))/25) :\n        #    print(\"WARNING: Audio (%.4fs) and video (%.4fs) lengths are different.\"%(float(len(audio))/16000,float(len(images))/25))\n\n        min_length = min(len(images),math.floor(len(audio)/640))\n        \n        # ========== ==========\n        # Generate video and audio feats\n        # ========== ==========\n\n        lastframe = min_length-5\n        im_feat = []\n        cc_feat = []\n\n        tS = time.time()\n        for i in range(0,lastframe,opt.batch_size):\n            \n            im_batch = [ imtv[:,:,vframe:vframe+5,:,:] for vframe in range(i,min(lastframe,i+opt.batch_size)) ]\n            im_in = torch.cat(im_batch,0)\n            im_out  = self.__S__.forward_lip(im_in.cuda());\n            im_feat.append(im_out.data.cpu())\n\n            cc_batch = [ cct[:,:,:,vframe*4:vframe*4+20] for vframe in range(i,min(lastframe,i+opt.batch_size)) ]\n            cc_in = torch.cat(cc_batch,0)\n            cc_out  = self.__S__.forward_aud(cc_in.cuda())\n            cc_feat.append(cc_out.data.cpu())\n\n        im_feat = torch.cat(im_feat,0)\n        cc_feat = torch.cat(cc_feat,0)\n\n        # ========== ==========\n        # Compute offset\n        # ========== ==========\n            \n        #print('Compute time %.3f sec.' % (time.time()-tS))\n\n        dists = calc_pdist(im_feat,cc_feat,vshift=opt.vshift)\n        mdist = torch.mean(torch.stack(dists,1),1)\n\n        minval, minidx = torch.min(mdist,0)\n\n        offset = opt.vshift-minidx\n        conf   = torch.median(mdist) - minval\n\n        fdist   = numpy.stack([dist[minidx].numpy() for dist in dists])\n        # fdist   = numpy.pad(fdist, (3,3), 'constant', constant_values=15)\n        fconf   = torch.median(mdist).numpy() - fdist\n        fconfm  = signal.medfilt(fconf,kernel_size=9)\n        \n        numpy.set_printoptions(formatter={'float': '{: 0.3f}'.format})\n        #print('Framewise conf: ')\n        #print(fconfm)\n        #print('AV offset: \\t%d \\nMin dist: \\t%.3f\\nConfidence: \\t%.3f' % (offset,minval,conf))\n\n        dists_npy = numpy.array([ dist.numpy() for dist in dists ])\n        return offset.numpy(), conf.numpy(), minval.numpy()\n\n    def extract_feature(self, opt, videofile):\n\n        self.__S__.eval();\n        \n        # ========== ==========\n        # Load video \n        # ========== ==========\n        cap = cv2.VideoCapture(videofile)\n\n        frame_num = 1;\n        images = []\n        while frame_num:\n            frame_num += 1\n            ret, image = cap.read()\n            if ret == 0:\n                break\n\n            images.append(image)\n\n        im = numpy.stack(images,axis=3)\n        im = numpy.expand_dims(im,axis=0)\n        im = numpy.transpose(im,(0,3,4,1,2))\n\n        imtv = torch.autograd.Variable(torch.from_numpy(im.astype(float)).float())\n        \n        # ========== ==========\n        # Generate video feats\n        # ========== ==========\n\n        lastframe = len(images)-4\n        im_feat = []\n\n        tS = time.time()\n        for i in range(0,lastframe,opt.batch_size):\n            \n            im_batch = [ imtv[:,:,vframe:vframe+5,:,:] for vframe in range(i,min(lastframe,i+opt.batch_size)) ]\n            im_in = torch.cat(im_batch,0)\n            im_out  = self.__S__.forward_lipfeat(im_in.cuda());\n            im_feat.append(im_out.data.cpu())\n\n        im_feat = torch.cat(im_feat,0)\n\n        # ========== ==========\n        # Compute offset\n        # ========== ==========\n            \n        print('Compute time %.3f sec.' % (time.time()-tS))\n\n        return im_feat\n\n\n    def loadParameters(self, path):\n        loaded_state = torch.load(path, map_location=lambda storage, loc: storage);\n\n        self_state = self.__S__.state_dict();\n\n        for name, param in loaded_state.items():\n\n            self_state[name].copy_(param);\n"
  },
  {
    "path": "evaluation/scores_LSE/calculate_scores_LRS.py",
    "content": "#!/usr/bin/python\n#-*- coding: utf-8 -*-\n\nimport time, pdb, argparse, subprocess\nimport glob\nimport os\nfrom tqdm import tqdm\n\nfrom SyncNetInstance_calc_scores import *\n\n# ==================== LOAD PARAMS ====================\n\n\nparser = argparse.ArgumentParser(description = \"SyncNet\");\n\nparser.add_argument('--initial_model', type=str, default=\"data/syncnet_v2.model\", help='');\nparser.add_argument('--batch_size', type=int, default='20', help='');\nparser.add_argument('--vshift', type=int, default='15', help='');\nparser.add_argument('--data_root', type=str, required=True, help='');\nparser.add_argument('--tmp_dir', type=str, default=\"data/work/pytmp\", help='');\nparser.add_argument('--reference', type=str, default=\"demo\", help='');\n\nopt = parser.parse_args();\n\n\n# ==================== RUN EVALUATION ====================\n\ns = SyncNetInstance();\n\ns.loadParameters(opt.initial_model);\n#print(\"Model %s loaded.\"%opt.initial_model);\npath = os.path.join(opt.data_root, \"*.mp4\")\n\nall_videos = glob.glob(path)\n\nprog_bar = tqdm(range(len(all_videos)))\navg_confidence = 0.\navg_min_distance = 0.\n\n\nfor videofile_idx in prog_bar:\n\tvideofile = all_videos[videofile_idx]\n\toffset, confidence, min_distance = s.evaluate(opt, videofile=videofile)\n\tavg_confidence += confidence\n\tavg_min_distance += min_distance\n\tprog_bar.set_description('Avg Confidence: {}, Avg Minimum Dist: {}'.format(round(avg_confidence / (videofile_idx + 1), 3), round(avg_min_distance / (videofile_idx + 1), 3)))\n\tprog_bar.refresh()\n\nprint ('Average Confidence: {}'.format(avg_confidence/len(all_videos)))\nprint ('Average Minimum Distance: {}'.format(avg_min_distance/len(all_videos)))\n\n\n\n"
  },
  {
    "path": "evaluation/scores_LSE/calculate_scores_real_videos.py",
    "content": "#!/usr/bin/python\n#-*- coding: utf-8 -*-\n\nimport time, pdb, argparse, subprocess, pickle, os, gzip, glob\n\nfrom SyncNetInstance_calc_scores import *\n\n# ==================== PARSE ARGUMENT ====================\n\nparser = argparse.ArgumentParser(description = \"SyncNet\");\nparser.add_argument('--initial_model', type=str, default=\"data/syncnet_v2.model\", help='');\nparser.add_argument('--batch_size', type=int, default='20', help='');\nparser.add_argument('--vshift', type=int, default='15', help='');\nparser.add_argument('--data_dir', type=str, default='data/work', help='');\nparser.add_argument('--videofile', type=str, default='', help='');\nparser.add_argument('--reference', type=str, default='', help='');\nopt = parser.parse_args();\n\nsetattr(opt,'avi_dir',os.path.join(opt.data_dir,'pyavi'))\nsetattr(opt,'tmp_dir',os.path.join(opt.data_dir,'pytmp'))\nsetattr(opt,'work_dir',os.path.join(opt.data_dir,'pywork'))\nsetattr(opt,'crop_dir',os.path.join(opt.data_dir,'pycrop'))\n\n\n# ==================== LOAD MODEL AND FILE LIST ====================\n\ns = SyncNetInstance();\n\ns.loadParameters(opt.initial_model);\n#print(\"Model %s loaded.\"%opt.initial_model);\n\nflist = glob.glob(os.path.join(opt.crop_dir,opt.reference,'0*.avi'))\nflist.sort()\n\n# ==================== GET OFFSETS ====================\n\ndists = []\nfor idx, fname in enumerate(flist):\n    offset, conf, dist = s.evaluate(opt,videofile=fname)\n    print (str(dist)+\" \"+str(conf))\n      \n# ==================== PRINT RESULTS TO FILE ====================\n\n#with open(os.path.join(opt.work_dir,opt.reference,'activesd.pckl'), 'wb') as fil:\n#    pickle.dump(dists, fil)\n"
  },
  {
    "path": "evaluation/scores_LSE/calculate_scores_real_videos.sh",
    "content": "rm all_scores.txt\nyourfilenames=`ls $1`\n\nfor eachfile in $yourfilenames\ndo\n   python run_pipeline.py --videofile $1/$eachfile --reference wav2lip --data_dir tmp_dir\n   python calculate_scores_real_videos.py --videofile $1/$eachfile --reference wav2lip --data_dir tmp_dir >> all_scores.txt\ndone\n"
  },
  {
    "path": "evaluation/test_filelists/README.md",
    "content": "This folder contains the filelists for the new evaluation framework proposed in the paper. \n\n## Test filelists for LRS2, LRS3, and LRW.\n\nThis folder contains three filelists, each containing a list of names of audio-video pairs from the test sets of LRS2, LRS3, and LRW. The LRS2 and LRW filelists are strictly \"Copyright BBC\" and can only be used for “non-commercial research by applicants who have an agreement with the BBC to access the Lip Reading in the Wild and/or Lip Reading Sentences in the Wild datasets”. Please follow this link for more details: [https://www.bbc.co.uk/rd/projects/lip-reading-datasets](https://www.bbc.co.uk/rd/projects/lip-reading-datasets). \n\n\n## ReSynCED benchmark\n\nThe sub-folder `ReSynCED` contains filelists for our own Real-world lip-Sync Evaluation Dataset (ReSyncED).\n\n\n#### Instructions on how to use the above two filelists are available in the README of the parent folder. \n"
  },
  {
    "path": "evaluation/test_filelists/ReSyncED/random_pairs.txt",
    "content": "sachin.mp4 emma_cropped.mp4\nsachin.mp4 mourinho.mp4\nsachin.mp4 elon.mp4\nsachin.mp4 messi2.mp4\nsachin.mp4 cr1.mp4\nsachin.mp4 sachin.mp4\nsachin.mp4 sg.mp4\nsachin.mp4 fergi.mp4\nsachin.mp4 spanish_lec1.mp4\nsachin.mp4 bush_small.mp4\nsachin.mp4 macca_cut.mp4\nsachin.mp4 ca_cropped.mp4\nsachin.mp4 lecun.mp4\nsachin.mp4 spanish_lec0.mp4\nsrk.mp4 emma_cropped.mp4\nsrk.mp4 mourinho.mp4\nsrk.mp4 elon.mp4\nsrk.mp4 messi2.mp4\nsrk.mp4 cr1.mp4\nsrk.mp4 srk.mp4\nsrk.mp4 sachin.mp4\nsrk.mp4 sg.mp4\nsrk.mp4 fergi.mp4\nsrk.mp4 spanish_lec1.mp4\nsrk.mp4 bush_small.mp4\nsrk.mp4 macca_cut.mp4\nsrk.mp4 ca_cropped.mp4\nsrk.mp4 guardiola.mp4\nsrk.mp4 lecun.mp4\nsrk.mp4 spanish_lec0.mp4\ncr1.mp4 emma_cropped.mp4\ncr1.mp4 elon.mp4\ncr1.mp4 messi2.mp4\ncr1.mp4 cr1.mp4\ncr1.mp4 spanish_lec1.mp4\ncr1.mp4 bush_small.mp4\ncr1.mp4 macca_cut.mp4\ncr1.mp4 ca_cropped.mp4\ncr1.mp4 lecun.mp4\ncr1.mp4 spanish_lec0.mp4\nmacca_cut.mp4 emma_cropped.mp4\nmacca_cut.mp4 elon.mp4\nmacca_cut.mp4 messi2.mp4\nmacca_cut.mp4 spanish_lec1.mp4\nmacca_cut.mp4 macca_cut.mp4\nmacca_cut.mp4 ca_cropped.mp4\nmacca_cut.mp4 spanish_lec0.mp4\nlecun.mp4 emma_cropped.mp4\nlecun.mp4 elon.mp4\nlecun.mp4 messi2.mp4\nlecun.mp4 spanish_lec1.mp4\nlecun.mp4 macca_cut.mp4\nlecun.mp4 ca_cropped.mp4\nlecun.mp4 lecun.mp4\nlecun.mp4 spanish_lec0.mp4\nmessi2.mp4 emma_cropped.mp4\nmessi2.mp4 elon.mp4\nmessi2.mp4 messi2.mp4\nmessi2.mp4 spanish_lec1.mp4\nmessi2.mp4 macca_cut.mp4\nmessi2.mp4 ca_cropped.mp4\nmessi2.mp4 spanish_lec0.mp4\nca_cropped.mp4 emma_cropped.mp4\nca_cropped.mp4 elon.mp4\nca_cropped.mp4 spanish_lec1.mp4\nca_cropped.mp4 ca_cropped.mp4\nca_cropped.mp4 spanish_lec0.mp4\nspanish_lec1.mp4 spanish_lec1.mp4\nspanish_lec1.mp4 spanish_lec0.mp4\nelon.mp4 elon.mp4\nelon.mp4 spanish_lec1.mp4\nelon.mp4 spanish_lec0.mp4\nguardiola.mp4 emma_cropped.mp4\nguardiola.mp4 mourinho.mp4\nguardiola.mp4 elon.mp4\nguardiola.mp4 messi2.mp4\nguardiola.mp4 cr1.mp4\nguardiola.mp4 sachin.mp4\nguardiola.mp4 sg.mp4\nguardiola.mp4 fergi.mp4\nguardiola.mp4 spanish_lec1.mp4\nguardiola.mp4 bush_small.mp4\nguardiola.mp4 macca_cut.mp4\nguardiola.mp4 ca_cropped.mp4\nguardiola.mp4 guardiola.mp4\nguardiola.mp4 lecun.mp4\nguardiola.mp4 spanish_lec0.mp4\nfergi.mp4 emma_cropped.mp4\nfergi.mp4 mourinho.mp4\nfergi.mp4 elon.mp4\nfergi.mp4 messi2.mp4\nfergi.mp4 cr1.mp4\nfergi.mp4 sachin.mp4\nfergi.mp4 sg.mp4\nfergi.mp4 fergi.mp4\nfergi.mp4 spanish_lec1.mp4\nfergi.mp4 bush_small.mp4\nfergi.mp4 macca_cut.mp4\nfergi.mp4 ca_cropped.mp4\nfergi.mp4 lecun.mp4\nfergi.mp4 spanish_lec0.mp4\nspanish.mp4 emma_cropped.mp4\nspanish.mp4 spanish.mp4\nspanish.mp4 mourinho.mp4\nspanish.mp4 elon.mp4\nspanish.mp4 messi2.mp4\nspanish.mp4 cr1.mp4\nspanish.mp4 srk.mp4\nspanish.mp4 sachin.mp4\nspanish.mp4 sg.mp4\nspanish.mp4 fergi.mp4\nspanish.mp4 spanish_lec1.mp4\nspanish.mp4 bush_small.mp4\nspanish.mp4 macca_cut.mp4\nspanish.mp4 ca_cropped.mp4\nspanish.mp4 guardiola.mp4\nspanish.mp4 lecun.mp4\nspanish.mp4 spanish_lec0.mp4\nbush_small.mp4 emma_cropped.mp4\nbush_small.mp4 elon.mp4\nbush_small.mp4 messi2.mp4\nbush_small.mp4 spanish_lec1.mp4\nbush_small.mp4 bush_small.mp4\nbush_small.mp4 macca_cut.mp4\nbush_small.mp4 ca_cropped.mp4\nbush_small.mp4 lecun.mp4\nbush_small.mp4 spanish_lec0.mp4\nemma_cropped.mp4 emma_cropped.mp4\nemma_cropped.mp4 elon.mp4\nemma_cropped.mp4 spanish_lec1.mp4\nemma_cropped.mp4 spanish_lec0.mp4\nsg.mp4 emma_cropped.mp4\nsg.mp4 mourinho.mp4\nsg.mp4 elon.mp4\nsg.mp4 messi2.mp4\nsg.mp4 cr1.mp4\nsg.mp4 sachin.mp4\nsg.mp4 sg.mp4\nsg.mp4 fergi.mp4\nsg.mp4 spanish_lec1.mp4\nsg.mp4 bush_small.mp4\nsg.mp4 macca_cut.mp4\nsg.mp4 ca_cropped.mp4\nsg.mp4 lecun.mp4\nsg.mp4 spanish_lec0.mp4\nspanish_lec0.mp4 spanish_lec0.mp4\nmourinho.mp4 emma_cropped.mp4\nmourinho.mp4 mourinho.mp4\nmourinho.mp4 elon.mp4\nmourinho.mp4 messi2.mp4\nmourinho.mp4 cr1.mp4\nmourinho.mp4 sachin.mp4\nmourinho.mp4 sg.mp4\nmourinho.mp4 fergi.mp4\nmourinho.mp4 spanish_lec1.mp4\nmourinho.mp4 bush_small.mp4\nmourinho.mp4 macca_cut.mp4\nmourinho.mp4 ca_cropped.mp4\nmourinho.mp4 lecun.mp4\nmourinho.mp4 spanish_lec0.mp4\n"
  },
  {
    "path": "evaluation/test_filelists/ReSyncED/tts_pairs.txt",
    "content": "adam_1.mp4 andreng_optimization.wav\nagad_2.mp4 agad_2.wav\nagad_1.mp4 agad_1.wav\nagad_3.mp4 agad_3.wav\nrms_prop_1.mp4 rms_prop_tts.wav\ntf_1.mp4 tf_1.wav\ntf_2.mp4 tf_2.wav\nandrew_ng_ai_business.mp4 andrewng_business_tts.wav\ncovid_autopsy_1.mp4 autopsy_tts.wav\nnews_1.mp4 news_tts.wav\nandrew_ng_fund_1.mp4 andrewng_ai_fund.wav\ncovid_treatments_1.mp4 covid_tts.wav\npytorch_v_tf.mp4 pytorch_vs_tf_eng.wav\npytorch_1.mp4 pytorch.wav\npkb_1.mp4 pkb_1.wav\nss_1.mp4 ss_1.wav\ncarlsen_1.mp4 carlsen_eng.wav\nfrench.mp4 french.wav"
  },
  {
    "path": "face_detection/README.md",
    "content": "The code for Face Detection in this folder has been taken from the wonderful [face_alignment](https://github.com/1adrianb/face-alignment) repository. This has been modified to take batches of faces at a time. "
  },
  {
    "path": "face_detection/__init__.py",
    "content": "# -*- coding: utf-8 -*-\n\n__author__ = \"\"\"Adrian Bulat\"\"\"\n__email__ = 'adrian.bulat@nottingham.ac.uk'\n__version__ = '1.0.1'\n\nfrom .api import FaceAlignment, LandmarksType, NetworkSize\n"
  },
  {
    "path": "face_detection/api.py",
    "content": "from __future__ import print_function\nimport os\nimport torch\nfrom torch.utils.model_zoo import load_url\nfrom enum import Enum\nimport numpy as np\nimport cv2\ntry:\n    import urllib.request as request_file\nexcept BaseException:\n    import urllib as request_file\n\nfrom .models import FAN, ResNetDepth\nfrom .utils import *\n\n\nclass LandmarksType(Enum):\n    \"\"\"Enum class defining the type of landmarks to detect.\n\n    ``_2D`` - the detected points ``(x,y)`` are detected in a 2D space and follow the visible contour of the face\n    ``_2halfD`` - this points represent the projection of the 3D points into 3D\n    ``_3D`` - detect the points ``(x,y,z)``` in a 3D space\n\n    \"\"\"\n    _2D = 1\n    _2halfD = 2\n    _3D = 3\n\n\nclass NetworkSize(Enum):\n    # TINY = 1\n    # SMALL = 2\n    # MEDIUM = 3\n    LARGE = 4\n\n    def __new__(cls, value):\n        member = object.__new__(cls)\n        member._value_ = value\n        return member\n\n    def __int__(self):\n        return self.value\n\nROOT = os.path.dirname(os.path.abspath(__file__))\n\nclass FaceAlignment:\n    def __init__(self, landmarks_type, network_size=NetworkSize.LARGE,\n                 device='cuda', flip_input=False, face_detector='sfd', verbose=False):\n        self.device = device\n        self.flip_input = flip_input\n        self.landmarks_type = landmarks_type\n        self.verbose = verbose\n\n        network_size = int(network_size)\n\n        if 'cuda' in device:\n            torch.backends.cudnn.benchmark = True\n\n        # Get the face detector\n        face_detector_module = __import__('face_detection.detection.' + face_detector,\n                                          globals(), locals(), [face_detector], 0)\n        self.face_detector = face_detector_module.FaceDetector(device=device, verbose=verbose)\n\n    def get_detections_for_batch(self, images):\n        images = images[..., ::-1]\n        detected_faces = self.face_detector.detect_from_batch(images.copy())\n        results = []\n\n        for i, d in enumerate(detected_faces):\n            if len(d) == 0:\n                results.append(None)\n                continue\n            d = d[0]\n            d = np.clip(d, 0, None)\n            \n            x1, y1, x2, y2 = map(int, d[:-1])\n            results.append((x1, y1, x2, y2))\n\n        return results"
  },
  {
    "path": "face_detection/detection/__init__.py",
    "content": "from .core import FaceDetector"
  },
  {
    "path": "face_detection/detection/core.py",
    "content": "import logging\nimport glob\nfrom tqdm import tqdm\nimport numpy as np\nimport torch\nimport cv2\n\n\nclass FaceDetector(object):\n    \"\"\"An abstract class representing a face detector.\n\n    Any other face detection implementation must subclass it. All subclasses\n    must implement ``detect_from_image``, that return a list of detected\n    bounding boxes. Optionally, for speed considerations detect from path is\n    recommended.\n    \"\"\"\n\n    def __init__(self, device, verbose):\n        self.device = device\n        self.verbose = verbose\n\n        if verbose:\n            if 'cpu' in device:\n                logger = logging.getLogger(__name__)\n                logger.warning(\"Detection running on CPU, this may be potentially slow.\")\n\n        if 'cpu' not in device and 'cuda' not in device:\n            if verbose:\n                logger.error(\"Expected values for device are: {cpu, cuda} but got: %s\", device)\n            raise ValueError\n\n    def detect_from_image(self, tensor_or_path):\n        \"\"\"Detects faces in a given image.\n\n        This function detects the faces present in a provided BGR(usually)\n        image. The input can be either the image itself or the path to it.\n\n        Arguments:\n            tensor_or_path {numpy.ndarray, torch.tensor or string} -- the path\n            to an image or the image itself.\n\n        Example::\n\n            >>> path_to_image = 'data/image_01.jpg'\n            ...   detected_faces = detect_from_image(path_to_image)\n            [A list of bounding boxes (x1, y1, x2, y2)]\n            >>> image = cv2.imread(path_to_image)\n            ...   detected_faces = detect_from_image(image)\n            [A list of bounding boxes (x1, y1, x2, y2)]\n\n        \"\"\"\n        raise NotImplementedError\n\n    def detect_from_directory(self, path, extensions=['.jpg', '.png'], recursive=False, show_progress_bar=True):\n        \"\"\"Detects faces from all the images present in a given directory.\n\n        Arguments:\n            path {string} -- a string containing a path that points to the folder containing the images\n\n        Keyword Arguments:\n            extensions {list} -- list of string containing the extensions to be\n            consider in the following format: ``.extension_name`` (default:\n            {['.jpg', '.png']}) recursive {bool} -- option wherever to scan the\n            folder recursively (default: {False}) show_progress_bar {bool} --\n            display a progressbar (default: {True})\n\n        Example:\n        >>> directory = 'data'\n        ...   detected_faces = detect_from_directory(directory)\n        {A dictionary of [lists containing bounding boxes(x1, y1, x2, y2)]}\n\n        \"\"\"\n        if self.verbose:\n            logger = logging.getLogger(__name__)\n\n        if len(extensions) == 0:\n            if self.verbose:\n                logger.error(\"Expected at list one extension, but none was received.\")\n            raise ValueError\n\n        if self.verbose:\n            logger.info(\"Constructing the list of images.\")\n        additional_pattern = '/**/*' if recursive else '/*'\n        files = []\n        for extension in extensions:\n            files.extend(glob.glob(path + additional_pattern + extension, recursive=recursive))\n\n        if self.verbose:\n            logger.info(\"Finished searching for images. %s images found\", len(files))\n            logger.info(\"Preparing to run the detection.\")\n\n        predictions = {}\n        for image_path in tqdm(files, disable=not show_progress_bar):\n            if self.verbose:\n                logger.info(\"Running the face detector on image: %s\", image_path)\n            predictions[image_path] = self.detect_from_image(image_path)\n\n        if self.verbose:\n            logger.info(\"The detector was successfully run on all %s images\", len(files))\n\n        return predictions\n\n    @property\n    def reference_scale(self):\n        raise NotImplementedError\n\n    @property\n    def reference_x_shift(self):\n        raise NotImplementedError\n\n    @property\n    def reference_y_shift(self):\n        raise NotImplementedError\n\n    @staticmethod\n    def tensor_or_path_to_ndarray(tensor_or_path, rgb=True):\n        \"\"\"Convert path (represented as a string) or torch.tensor to a numpy.ndarray\n\n        Arguments:\n            tensor_or_path {numpy.ndarray, torch.tensor or string} -- path to the image, or the image itself\n        \"\"\"\n        if isinstance(tensor_or_path, str):\n            return cv2.imread(tensor_or_path) if not rgb else cv2.imread(tensor_or_path)[..., ::-1]\n        elif torch.is_tensor(tensor_or_path):\n            # Call cpu in case its coming from cuda\n            return tensor_or_path.cpu().numpy()[..., ::-1].copy() if not rgb else tensor_or_path.cpu().numpy()\n        elif isinstance(tensor_or_path, np.ndarray):\n            return tensor_or_path[..., ::-1].copy() if not rgb else tensor_or_path\n        else:\n            raise TypeError\n"
  },
  {
    "path": "face_detection/detection/sfd/__init__.py",
    "content": "from .sfd_detector import SFDDetector as FaceDetector"
  },
  {
    "path": "face_detection/detection/sfd/bbox.py",
    "content": "from __future__ import print_function\nimport os\nimport sys\nimport cv2\nimport random\nimport datetime\nimport time\nimport math\nimport argparse\nimport numpy as np\nimport torch\n\ntry:\n    from iou import IOU\nexcept BaseException:\n    # IOU cython speedup 10x\n    def IOU(ax1, ay1, ax2, ay2, bx1, by1, bx2, by2):\n        sa = abs((ax2 - ax1) * (ay2 - ay1))\n        sb = abs((bx2 - bx1) * (by2 - by1))\n        x1, y1 = max(ax1, bx1), max(ay1, by1)\n        x2, y2 = min(ax2, bx2), min(ay2, by2)\n        w = x2 - x1\n        h = y2 - y1\n        if w < 0 or h < 0:\n            return 0.0\n        else:\n            return 1.0 * w * h / (sa + sb - w * h)\n\n\ndef bboxlog(x1, y1, x2, y2, axc, ayc, aww, ahh):\n    xc, yc, ww, hh = (x2 + x1) / 2, (y2 + y1) / 2, x2 - x1, y2 - y1\n    dx, dy = (xc - axc) / aww, (yc - ayc) / ahh\n    dw, dh = math.log(ww / aww), math.log(hh / ahh)\n    return dx, dy, dw, dh\n\n\ndef bboxloginv(dx, dy, dw, dh, axc, ayc, aww, ahh):\n    xc, yc = dx * aww + axc, dy * ahh + ayc\n    ww, hh = math.exp(dw) * aww, math.exp(dh) * ahh\n    x1, x2, y1, y2 = xc - ww / 2, xc + ww / 2, yc - hh / 2, yc + hh / 2\n    return x1, y1, x2, y2\n\n\ndef nms(dets, thresh):\n    if 0 == len(dets):\n        return []\n    x1, y1, x2, y2, scores = dets[:, 0], dets[:, 1], dets[:, 2], dets[:, 3], dets[:, 4]\n    areas = (x2 - x1 + 1) * (y2 - y1 + 1)\n    order = scores.argsort()[::-1]\n\n    keep = []\n    while order.size > 0:\n        i = order[0]\n        keep.append(i)\n        xx1, yy1 = np.maximum(x1[i], x1[order[1:]]), np.maximum(y1[i], y1[order[1:]])\n        xx2, yy2 = np.minimum(x2[i], x2[order[1:]]), np.minimum(y2[i], y2[order[1:]])\n\n        w, h = np.maximum(0.0, xx2 - xx1 + 1), np.maximum(0.0, yy2 - yy1 + 1)\n        ovr = w * h / (areas[i] + areas[order[1:]] - w * h)\n\n        inds = np.where(ovr <= thresh)[0]\n        order = order[inds + 1]\n\n    return keep\n\n\ndef encode(matched, priors, variances):\n    \"\"\"Encode the variances from the priorbox layers into the ground truth boxes\n    we have matched (based on jaccard overlap) with the prior boxes.\n    Args:\n        matched: (tensor) Coords of ground truth for each prior in point-form\n            Shape: [num_priors, 4].\n        priors: (tensor) Prior boxes in center-offset form\n            Shape: [num_priors,4].\n        variances: (list[float]) Variances of priorboxes\n    Return:\n        encoded boxes (tensor), Shape: [num_priors, 4]\n    \"\"\"\n\n    # dist b/t match center and prior's center\n    g_cxcy = (matched[:, :2] + matched[:, 2:]) / 2 - priors[:, :2]\n    # encode variance\n    g_cxcy /= (variances[0] * priors[:, 2:])\n    # match wh / prior wh\n    g_wh = (matched[:, 2:] - matched[:, :2]) / priors[:, 2:]\n    g_wh = torch.log(g_wh) / variances[1]\n    # return target for smooth_l1_loss\n    return torch.cat([g_cxcy, g_wh], 1)  # [num_priors,4]\n\n\ndef decode(loc, priors, variances):\n    \"\"\"Decode locations from predictions using priors to undo\n    the encoding we did for offset regression at train time.\n    Args:\n        loc (tensor): location predictions for loc layers,\n            Shape: [num_priors,4]\n        priors (tensor): Prior boxes in center-offset form.\n            Shape: [num_priors,4].\n        variances: (list[float]) Variances of priorboxes\n    Return:\n        decoded bounding box predictions\n    \"\"\"\n\n    boxes = torch.cat((\n        priors[:, :2] + loc[:, :2] * variances[0] * priors[:, 2:],\n        priors[:, 2:] * torch.exp(loc[:, 2:] * variances[1])), 1)\n    boxes[:, :2] -= boxes[:, 2:] / 2\n    boxes[:, 2:] += boxes[:, :2]\n    return boxes\n\ndef batch_decode(loc, priors, variances):\n    \"\"\"Decode locations from predictions using priors to undo\n    the encoding we did for offset regression at train time.\n    Args:\n        loc (tensor): location predictions for loc layers,\n            Shape: [num_priors,4]\n        priors (tensor): Prior boxes in center-offset form.\n            Shape: [num_priors,4].\n        variances: (list[float]) Variances of priorboxes\n    Return:\n        decoded bounding box predictions\n    \"\"\"\n\n    boxes = torch.cat((\n        priors[:, :, :2] + loc[:, :, :2] * variances[0] * priors[:, :, 2:],\n        priors[:, :, 2:] * torch.exp(loc[:, :, 2:] * variances[1])), 2)\n    boxes[:, :, :2] -= boxes[:, :, 2:] / 2\n    boxes[:, :, 2:] += boxes[:, :, :2]\n    return boxes\n"
  },
  {
    "path": "face_detection/detection/sfd/detect.py",
    "content": "import torch\nimport torch.nn.functional as F\n\nimport os\nimport sys\nimport cv2\nimport random\nimport datetime\nimport math\nimport argparse\nimport numpy as np\n\nimport scipy.io as sio\nimport zipfile\nfrom .net_s3fd import s3fd\nfrom .bbox import *\n\n\ndef detect(net, img, device):\n    img = img - np.array([104, 117, 123])\n    img = img.transpose(2, 0, 1)\n    img = img.reshape((1,) + img.shape)\n\n    if 'cuda' in device:\n        torch.backends.cudnn.benchmark = True\n\n    img = torch.from_numpy(img).float().to(device)\n    BB, CC, HH, WW = img.size()\n    with torch.no_grad():\n        olist = net(img)\n\n    bboxlist = []\n    for i in range(len(olist) // 2):\n        olist[i * 2] = F.softmax(olist[i * 2], dim=1)\n    olist = [oelem.data.cpu() for oelem in olist]\n    for i in range(len(olist) // 2):\n        ocls, oreg = olist[i * 2], olist[i * 2 + 1]\n        FB, FC, FH, FW = ocls.size()  # feature map size\n        stride = 2**(i + 2)    # 4,8,16,32,64,128\n        anchor = stride * 4\n        poss = zip(*np.where(ocls[:, 1, :, :] > 0.05))\n        for Iindex, hindex, windex in poss:\n            axc, ayc = stride / 2 + windex * stride, stride / 2 + hindex * stride\n            score = ocls[0, 1, hindex, windex]\n            loc = oreg[0, :, hindex, windex].contiguous().view(1, 4)\n            priors = torch.Tensor([[axc / 1.0, ayc / 1.0, stride * 4 / 1.0, stride * 4 / 1.0]])\n            variances = [0.1, 0.2]\n            box = decode(loc, priors, variances)\n            x1, y1, x2, y2 = box[0] * 1.0\n            # cv2.rectangle(imgshow,(int(x1),int(y1)),(int(x2),int(y2)),(0,0,255),1)\n            bboxlist.append([x1, y1, x2, y2, score])\n    bboxlist = np.array(bboxlist)\n    if 0 == len(bboxlist):\n        bboxlist = np.zeros((1, 5))\n\n    return bboxlist\n\ndef batch_detect(net, imgs, device):\n    imgs = imgs - np.array([104, 117, 123])\n    imgs = imgs.transpose(0, 3, 1, 2)\n\n    if 'cuda' in device:\n        torch.backends.cudnn.benchmark = True\n\n    imgs = torch.from_numpy(imgs).float().to(device)\n    BB, CC, HH, WW = imgs.size()\n    with torch.no_grad():\n        olist = net(imgs)\n\n    bboxlist = []\n    for i in range(len(olist) // 2):\n        olist[i * 2] = F.softmax(olist[i * 2], dim=1)\n    olist = [oelem.data.cpu() for oelem in olist]\n    for i in range(len(olist) // 2):\n        ocls, oreg = olist[i * 2], olist[i * 2 + 1]\n        FB, FC, FH, FW = ocls.size()  # feature map size\n        stride = 2**(i + 2)    # 4,8,16,32,64,128\n        anchor = stride * 4\n        poss = zip(*np.where(ocls[:, 1, :, :] > 0.05))\n        for Iindex, hindex, windex in poss:\n            axc, ayc = stride / 2 + windex * stride, stride / 2 + hindex * stride\n            score = ocls[:, 1, hindex, windex]\n            loc = oreg[:, :, hindex, windex].contiguous().view(BB, 1, 4)\n            priors = torch.Tensor([[axc / 1.0, ayc / 1.0, stride * 4 / 1.0, stride * 4 / 1.0]]).view(1, 1, 4)\n            variances = [0.1, 0.2]\n            box = batch_decode(loc, priors, variances)\n            box = box[:, 0] * 1.0\n            # cv2.rectangle(imgshow,(int(x1),int(y1)),(int(x2),int(y2)),(0,0,255),1)\n            bboxlist.append(torch.cat([box, score.unsqueeze(1)], 1).cpu().numpy())\n    bboxlist = np.array(bboxlist)\n    if 0 == len(bboxlist):\n        bboxlist = np.zeros((1, BB, 5))\n\n    return bboxlist\n\ndef flip_detect(net, img, device):\n    img = cv2.flip(img, 1)\n    b = detect(net, img, device)\n\n    bboxlist = np.zeros(b.shape)\n    bboxlist[:, 0] = img.shape[1] - b[:, 2]\n    bboxlist[:, 1] = b[:, 1]\n    bboxlist[:, 2] = img.shape[1] - b[:, 0]\n    bboxlist[:, 3] = b[:, 3]\n    bboxlist[:, 4] = b[:, 4]\n    return bboxlist\n\n\ndef pts_to_bb(pts):\n    min_x, min_y = np.min(pts, axis=0)\n    max_x, max_y = np.max(pts, axis=0)\n    return np.array([min_x, min_y, max_x, max_y])\n"
  },
  {
    "path": "face_detection/detection/sfd/net_s3fd.py",
    "content": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass L2Norm(nn.Module):\n    def __init__(self, n_channels, scale=1.0):\n        super(L2Norm, self).__init__()\n        self.n_channels = n_channels\n        self.scale = scale\n        self.eps = 1e-10\n        self.weight = nn.Parameter(torch.Tensor(self.n_channels))\n        self.weight.data *= 0.0\n        self.weight.data += self.scale\n\n    def forward(self, x):\n        norm = x.pow(2).sum(dim=1, keepdim=True).sqrt() + self.eps\n        x = x / norm * self.weight.view(1, -1, 1, 1)\n        return x\n\n\nclass s3fd(nn.Module):\n    def __init__(self):\n        super(s3fd, self).__init__()\n        self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)\n        self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1)\n\n        self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)\n        self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1)\n\n        self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1)\n        self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)\n        self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)\n\n        self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1)\n        self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)\n        self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)\n\n        self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)\n        self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)\n        self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)\n\n        self.fc6 = nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=3)\n        self.fc7 = nn.Conv2d(1024, 1024, kernel_size=1, stride=1, padding=0)\n\n        self.conv6_1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0)\n        self.conv6_2 = nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1)\n\n        self.conv7_1 = nn.Conv2d(512, 128, kernel_size=1, stride=1, padding=0)\n        self.conv7_2 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)\n\n        self.conv3_3_norm = L2Norm(256, scale=10)\n        self.conv4_3_norm = L2Norm(512, scale=8)\n        self.conv5_3_norm = L2Norm(512, scale=5)\n\n        self.conv3_3_norm_mbox_conf = nn.Conv2d(256, 4, kernel_size=3, stride=1, padding=1)\n        self.conv3_3_norm_mbox_loc = nn.Conv2d(256, 4, kernel_size=3, stride=1, padding=1)\n        self.conv4_3_norm_mbox_conf = nn.Conv2d(512, 2, kernel_size=3, stride=1, padding=1)\n        self.conv4_3_norm_mbox_loc = nn.Conv2d(512, 4, kernel_size=3, stride=1, padding=1)\n        self.conv5_3_norm_mbox_conf = nn.Conv2d(512, 2, kernel_size=3, stride=1, padding=1)\n        self.conv5_3_norm_mbox_loc = nn.Conv2d(512, 4, kernel_size=3, stride=1, padding=1)\n\n        self.fc7_mbox_conf = nn.Conv2d(1024, 2, kernel_size=3, stride=1, padding=1)\n        self.fc7_mbox_loc = nn.Conv2d(1024, 4, kernel_size=3, stride=1, padding=1)\n        self.conv6_2_mbox_conf = nn.Conv2d(512, 2, kernel_size=3, stride=1, padding=1)\n        self.conv6_2_mbox_loc = nn.Conv2d(512, 4, kernel_size=3, stride=1, padding=1)\n        self.conv7_2_mbox_conf = nn.Conv2d(256, 2, kernel_size=3, stride=1, padding=1)\n        self.conv7_2_mbox_loc = nn.Conv2d(256, 4, kernel_size=3, stride=1, padding=1)\n\n    def forward(self, x):\n        h = F.relu(self.conv1_1(x))\n        h = F.relu(self.conv1_2(h))\n        h = F.max_pool2d(h, 2, 2)\n\n        h = F.relu(self.conv2_1(h))\n        h = F.relu(self.conv2_2(h))\n        h = F.max_pool2d(h, 2, 2)\n\n        h = F.relu(self.conv3_1(h))\n        h = F.relu(self.conv3_2(h))\n        h = F.relu(self.conv3_3(h))\n        f3_3 = h\n        h = F.max_pool2d(h, 2, 2)\n\n        h = F.relu(self.conv4_1(h))\n        h = F.relu(self.conv4_2(h))\n        h = F.relu(self.conv4_3(h))\n        f4_3 = h\n        h = F.max_pool2d(h, 2, 2)\n\n        h = F.relu(self.conv5_1(h))\n        h = F.relu(self.conv5_2(h))\n        h = F.relu(self.conv5_3(h))\n        f5_3 = h\n        h = F.max_pool2d(h, 2, 2)\n\n        h = F.relu(self.fc6(h))\n        h = F.relu(self.fc7(h))\n        ffc7 = h\n        h = F.relu(self.conv6_1(h))\n        h = F.relu(self.conv6_2(h))\n        f6_2 = h\n        h = F.relu(self.conv7_1(h))\n        h = F.relu(self.conv7_2(h))\n        f7_2 = h\n\n        f3_3 = self.conv3_3_norm(f3_3)\n        f4_3 = self.conv4_3_norm(f4_3)\n        f5_3 = self.conv5_3_norm(f5_3)\n\n        cls1 = self.conv3_3_norm_mbox_conf(f3_3)\n        reg1 = self.conv3_3_norm_mbox_loc(f3_3)\n        cls2 = self.conv4_3_norm_mbox_conf(f4_3)\n        reg2 = self.conv4_3_norm_mbox_loc(f4_3)\n        cls3 = self.conv5_3_norm_mbox_conf(f5_3)\n        reg3 = self.conv5_3_norm_mbox_loc(f5_3)\n        cls4 = self.fc7_mbox_conf(ffc7)\n        reg4 = self.fc7_mbox_loc(ffc7)\n        cls5 = self.conv6_2_mbox_conf(f6_2)\n        reg5 = self.conv6_2_mbox_loc(f6_2)\n        cls6 = self.conv7_2_mbox_conf(f7_2)\n        reg6 = self.conv7_2_mbox_loc(f7_2)\n\n        # max-out background label\n        chunk = torch.chunk(cls1, 4, 1)\n        bmax = torch.max(torch.max(chunk[0], chunk[1]), chunk[2])\n        cls1 = torch.cat([bmax, chunk[3]], dim=1)\n\n        return [cls1, reg1, cls2, reg2, cls3, reg3, cls4, reg4, cls5, reg5, cls6, reg6]\n"
  },
  {
    "path": "face_detection/detection/sfd/sfd_detector.py",
    "content": "import os\nimport cv2\nfrom torch.utils.model_zoo import load_url\n\nfrom ..core import FaceDetector\n\nfrom .net_s3fd import s3fd\nfrom .bbox import *\nfrom .detect import *\n\nmodels_urls = {\n    's3fd': 'https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth',\n}\n\n\nclass SFDDetector(FaceDetector):\n    def __init__(self, device, path_to_detector=os.path.join(os.path.dirname(os.path.abspath(__file__)), 's3fd.pth'), verbose=False):\n        super(SFDDetector, self).__init__(device, verbose)\n\n        # Initialise the face detector\n        if not os.path.isfile(path_to_detector):\n            model_weights = load_url(models_urls['s3fd'])\n        else:\n            model_weights = torch.load(path_to_detector)\n\n        self.face_detector = s3fd()\n        self.face_detector.load_state_dict(model_weights)\n        self.face_detector.to(device)\n        self.face_detector.eval()\n\n    def detect_from_image(self, tensor_or_path):\n        image = self.tensor_or_path_to_ndarray(tensor_or_path)\n\n        bboxlist = detect(self.face_detector, image, device=self.device)\n        keep = nms(bboxlist, 0.3)\n        bboxlist = bboxlist[keep, :]\n        bboxlist = [x for x in bboxlist if x[-1] > 0.5]\n\n        return bboxlist\n\n    def detect_from_batch(self, images):\n        bboxlists = batch_detect(self.face_detector, images, device=self.device)\n        keeps = [nms(bboxlists[:, i, :], 0.3) for i in range(bboxlists.shape[1])]\n        bboxlists = [bboxlists[keep, i, :] for i, keep in enumerate(keeps)]\n        bboxlists = [[x for x in bboxlist if x[-1] > 0.5] for bboxlist in bboxlists]\n\n        return bboxlists\n\n    @property\n    def reference_scale(self):\n        return 195\n\n    @property\n    def reference_x_shift(self):\n        return 0\n\n    @property\n    def reference_y_shift(self):\n        return 0\n"
  },
  {
    "path": "face_detection/models.py",
    "content": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport math\n\n\ndef conv3x3(in_planes, out_planes, strd=1, padding=1, bias=False):\n    \"3x3 convolution with padding\"\n    return nn.Conv2d(in_planes, out_planes, kernel_size=3,\n                     stride=strd, padding=padding, bias=bias)\n\n\nclass ConvBlock(nn.Module):\n    def __init__(self, in_planes, out_planes):\n        super(ConvBlock, self).__init__()\n        self.bn1 = nn.BatchNorm2d(in_planes)\n        self.conv1 = conv3x3(in_planes, int(out_planes / 2))\n        self.bn2 = nn.BatchNorm2d(int(out_planes / 2))\n        self.conv2 = conv3x3(int(out_planes / 2), int(out_planes / 4))\n        self.bn3 = nn.BatchNorm2d(int(out_planes / 4))\n        self.conv3 = conv3x3(int(out_planes / 4), int(out_planes / 4))\n\n        if in_planes != out_planes:\n            self.downsample = nn.Sequential(\n                nn.BatchNorm2d(in_planes),\n                nn.ReLU(True),\n                nn.Conv2d(in_planes, out_planes,\n                          kernel_size=1, stride=1, bias=False),\n            )\n        else:\n            self.downsample = None\n\n    def forward(self, x):\n        residual = x\n\n        out1 = self.bn1(x)\n        out1 = F.relu(out1, True)\n        out1 = self.conv1(out1)\n\n        out2 = self.bn2(out1)\n        out2 = F.relu(out2, True)\n        out2 = self.conv2(out2)\n\n        out3 = self.bn3(out2)\n        out3 = F.relu(out3, True)\n        out3 = self.conv3(out3)\n\n        out3 = torch.cat((out1, out2, out3), 1)\n\n        if self.downsample is not None:\n            residual = self.downsample(residual)\n\n        out3 += residual\n\n        return out3\n\n\nclass Bottleneck(nn.Module):\n\n    expansion = 4\n\n    def __init__(self, inplanes, planes, stride=1, downsample=None):\n        super(Bottleneck, self).__init__()\n        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)\n        self.bn1 = nn.BatchNorm2d(planes)\n        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,\n                               padding=1, bias=False)\n        self.bn2 = nn.BatchNorm2d(planes)\n        self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)\n        self.bn3 = nn.BatchNorm2d(planes * 4)\n        self.relu = nn.ReLU(inplace=True)\n        self.downsample = downsample\n        self.stride = stride\n\n    def forward(self, x):\n        residual = x\n\n        out = self.conv1(x)\n        out = self.bn1(out)\n        out = self.relu(out)\n\n        out = self.conv2(out)\n        out = self.bn2(out)\n        out = self.relu(out)\n\n        out = self.conv3(out)\n        out = self.bn3(out)\n\n        if self.downsample is not None:\n            residual = self.downsample(x)\n\n        out += residual\n        out = self.relu(out)\n\n        return out\n\n\nclass HourGlass(nn.Module):\n    def __init__(self, num_modules, depth, num_features):\n        super(HourGlass, self).__init__()\n        self.num_modules = num_modules\n        self.depth = depth\n        self.features = num_features\n\n        self._generate_network(self.depth)\n\n    def _generate_network(self, level):\n        self.add_module('b1_' + str(level), ConvBlock(self.features, self.features))\n\n        self.add_module('b2_' + str(level), ConvBlock(self.features, self.features))\n\n        if level > 1:\n            self._generate_network(level - 1)\n        else:\n            self.add_module('b2_plus_' + str(level), ConvBlock(self.features, self.features))\n\n        self.add_module('b3_' + str(level), ConvBlock(self.features, self.features))\n\n    def _forward(self, level, inp):\n        # Upper branch\n        up1 = inp\n        up1 = self._modules['b1_' + str(level)](up1)\n\n        # Lower branch\n        low1 = F.avg_pool2d(inp, 2, stride=2)\n        low1 = self._modules['b2_' + str(level)](low1)\n\n        if level > 1:\n            low2 = self._forward(level - 1, low1)\n        else:\n            low2 = low1\n            low2 = self._modules['b2_plus_' + str(level)](low2)\n\n        low3 = low2\n        low3 = self._modules['b3_' + str(level)](low3)\n\n        up2 = F.interpolate(low3, scale_factor=2, mode='nearest')\n\n        return up1 + up2\n\n    def forward(self, x):\n        return self._forward(self.depth, x)\n\n\nclass FAN(nn.Module):\n\n    def __init__(self, num_modules=1):\n        super(FAN, self).__init__()\n        self.num_modules = num_modules\n\n        # Base part\n        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)\n        self.bn1 = nn.BatchNorm2d(64)\n        self.conv2 = ConvBlock(64, 128)\n        self.conv3 = ConvBlock(128, 128)\n        self.conv4 = ConvBlock(128, 256)\n\n        # Stacking part\n        for hg_module in range(self.num_modules):\n            self.add_module('m' + str(hg_module), HourGlass(1, 4, 256))\n            self.add_module('top_m_' + str(hg_module), ConvBlock(256, 256))\n            self.add_module('conv_last' + str(hg_module),\n                            nn.Conv2d(256, 256, kernel_size=1, stride=1, padding=0))\n            self.add_module('bn_end' + str(hg_module), nn.BatchNorm2d(256))\n            self.add_module('l' + str(hg_module), nn.Conv2d(256,\n                                                            68, kernel_size=1, stride=1, padding=0))\n\n            if hg_module < self.num_modules - 1:\n                self.add_module(\n                    'bl' + str(hg_module), nn.Conv2d(256, 256, kernel_size=1, stride=1, padding=0))\n                self.add_module('al' + str(hg_module), nn.Conv2d(68,\n                                                                 256, kernel_size=1, stride=1, padding=0))\n\n    def forward(self, x):\n        x = F.relu(self.bn1(self.conv1(x)), True)\n        x = F.avg_pool2d(self.conv2(x), 2, stride=2)\n        x = self.conv3(x)\n        x = self.conv4(x)\n\n        previous = x\n\n        outputs = []\n        for i in range(self.num_modules):\n            hg = self._modules['m' + str(i)](previous)\n\n            ll = hg\n            ll = self._modules['top_m_' + str(i)](ll)\n\n            ll = F.relu(self._modules['bn_end' + str(i)]\n                        (self._modules['conv_last' + str(i)](ll)), True)\n\n            # Predict heatmaps\n            tmp_out = self._modules['l' + str(i)](ll)\n            outputs.append(tmp_out)\n\n            if i < self.num_modules - 1:\n                ll = self._modules['bl' + str(i)](ll)\n                tmp_out_ = self._modules['al' + str(i)](tmp_out)\n                previous = previous + ll + tmp_out_\n\n        return outputs\n\n\nclass ResNetDepth(nn.Module):\n\n    def __init__(self, block=Bottleneck, layers=[3, 8, 36, 3], num_classes=68):\n        self.inplanes = 64\n        super(ResNetDepth, self).__init__()\n        self.conv1 = nn.Conv2d(3 + 68, 64, kernel_size=7, stride=2, padding=3,\n                               bias=False)\n        self.bn1 = nn.BatchNorm2d(64)\n        self.relu = nn.ReLU(inplace=True)\n        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)\n        self.layer1 = self._make_layer(block, 64, layers[0])\n        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)\n        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)\n        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)\n        self.avgpool = nn.AvgPool2d(7)\n        self.fc = nn.Linear(512 * block.expansion, num_classes)\n\n        for m in self.modules():\n            if isinstance(m, nn.Conv2d):\n                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels\n                m.weight.data.normal_(0, math.sqrt(2. / n))\n            elif isinstance(m, nn.BatchNorm2d):\n                m.weight.data.fill_(1)\n                m.bias.data.zero_()\n\n    def _make_layer(self, block, planes, blocks, stride=1):\n        downsample = None\n        if stride != 1 or self.inplanes != planes * block.expansion:\n            downsample = nn.Sequential(\n                nn.Conv2d(self.inplanes, planes * block.expansion,\n                          kernel_size=1, stride=stride, bias=False),\n                nn.BatchNorm2d(planes * block.expansion),\n            )\n\n        layers = []\n        layers.append(block(self.inplanes, planes, stride, downsample))\n        self.inplanes = planes * block.expansion\n        for i in range(1, blocks):\n            layers.append(block(self.inplanes, planes))\n\n        return nn.Sequential(*layers)\n\n    def forward(self, x):\n        x = self.conv1(x)\n        x = self.bn1(x)\n        x = self.relu(x)\n        x = self.maxpool(x)\n\n        x = self.layer1(x)\n        x = self.layer2(x)\n        x = self.layer3(x)\n        x = self.layer4(x)\n\n        x = self.avgpool(x)\n        x = x.view(x.size(0), -1)\n        x = self.fc(x)\n\n        return x\n"
  },
  {
    "path": "face_detection/utils.py",
    "content": "from __future__ import print_function\nimport os\nimport sys\nimport time\nimport torch\nimport math\nimport numpy as np\nimport cv2\n\n\ndef _gaussian(\n        size=3, sigma=0.25, amplitude=1, normalize=False, width=None,\n        height=None, sigma_horz=None, sigma_vert=None, mean_horz=0.5,\n        mean_vert=0.5):\n    # handle some defaults\n    if width is None:\n        width = size\n    if height is None:\n        height = size\n    if sigma_horz is None:\n        sigma_horz = sigma\n    if sigma_vert is None:\n        sigma_vert = sigma\n    center_x = mean_horz * width + 0.5\n    center_y = mean_vert * height + 0.5\n    gauss = np.empty((height, width), dtype=np.float32)\n    # generate kernel\n    for i in range(height):\n        for j in range(width):\n            gauss[i][j] = amplitude * math.exp(-(math.pow((j + 1 - center_x) / (\n                sigma_horz * width), 2) / 2.0 + math.pow((i + 1 - center_y) / (sigma_vert * height), 2) / 2.0))\n    if normalize:\n        gauss = gauss / np.sum(gauss)\n    return gauss\n\n\ndef draw_gaussian(image, point, sigma):\n    # Check if the gaussian is inside\n    ul = [math.floor(point[0] - 3 * sigma), math.floor(point[1] - 3 * sigma)]\n    br = [math.floor(point[0] + 3 * sigma), math.floor(point[1] + 3 * sigma)]\n    if (ul[0] > image.shape[1] or ul[1] > image.shape[0] or br[0] < 1 or br[1] < 1):\n        return image\n    size = 6 * sigma + 1\n    g = _gaussian(size)\n    g_x = [int(max(1, -ul[0])), int(min(br[0], image.shape[1])) - int(max(1, ul[0])) + int(max(1, -ul[0]))]\n    g_y = [int(max(1, -ul[1])), int(min(br[1], image.shape[0])) - int(max(1, ul[1])) + int(max(1, -ul[1]))]\n    img_x = [int(max(1, ul[0])), int(min(br[0], image.shape[1]))]\n    img_y = [int(max(1, ul[1])), int(min(br[1], image.shape[0]))]\n    assert (g_x[0] > 0 and g_y[1] > 0)\n    image[img_y[0] - 1:img_y[1], img_x[0] - 1:img_x[1]\n          ] = image[img_y[0] - 1:img_y[1], img_x[0] - 1:img_x[1]] + g[g_y[0] - 1:g_y[1], g_x[0] - 1:g_x[1]]\n    image[image > 1] = 1\n    return image\n\n\ndef transform(point, center, scale, resolution, invert=False):\n    \"\"\"Generate and affine transformation matrix.\n\n    Given a set of points, a center, a scale and a targer resolution, the\n    function generates and affine transformation matrix. If invert is ``True``\n    it will produce the inverse transformation.\n\n    Arguments:\n        point {torch.tensor} -- the input 2D point\n        center {torch.tensor or numpy.array} -- the center around which to perform the transformations\n        scale {float} -- the scale of the face/object\n        resolution {float} -- the output resolution\n\n    Keyword Arguments:\n        invert {bool} -- define wherever the function should produce the direct or the\n        inverse transformation matrix (default: {False})\n    \"\"\"\n    _pt = torch.ones(3)\n    _pt[0] = point[0]\n    _pt[1] = point[1]\n\n    h = 200.0 * scale\n    t = torch.eye(3)\n    t[0, 0] = resolution / h\n    t[1, 1] = resolution / h\n    t[0, 2] = resolution * (-center[0] / h + 0.5)\n    t[1, 2] = resolution * (-center[1] / h + 0.5)\n\n    if invert:\n        t = torch.inverse(t)\n\n    new_point = (torch.matmul(t, _pt))[0:2]\n\n    return new_point.int()\n\n\ndef crop(image, center, scale, resolution=256.0):\n    \"\"\"Center crops an image or set of heatmaps\n\n    Arguments:\n        image {numpy.array} -- an rgb image\n        center {numpy.array} -- the center of the object, usually the same as of the bounding box\n        scale {float} -- scale of the face\n\n    Keyword Arguments:\n        resolution {float} -- the size of the output cropped image (default: {256.0})\n\n    Returns:\n        [type] -- [description]\n    \"\"\"  # Crop around the center point\n    \"\"\" Crops the image around the center. Input is expected to be an np.ndarray \"\"\"\n    ul = transform([1, 1], center, scale, resolution, True)\n    br = transform([resolution, resolution], center, scale, resolution, True)\n    # pad = math.ceil(torch.norm((ul - br).float()) / 2.0 - (br[0] - ul[0]) / 2.0)\n    if image.ndim > 2:\n        newDim = np.array([br[1] - ul[1], br[0] - ul[0],\n                           image.shape[2]], dtype=np.int32)\n        newImg = np.zeros(newDim, dtype=np.uint8)\n    else:\n        newDim = np.array([br[1] - ul[1], br[0] - ul[0]], dtype=np.int)\n        newImg = np.zeros(newDim, dtype=np.uint8)\n    ht = image.shape[0]\n    wd = image.shape[1]\n    newX = np.array(\n        [max(1, -ul[0] + 1), min(br[0], wd) - ul[0]], dtype=np.int32)\n    newY = np.array(\n        [max(1, -ul[1] + 1), min(br[1], ht) - ul[1]], dtype=np.int32)\n    oldX = np.array([max(1, ul[0] + 1), min(br[0], wd)], dtype=np.int32)\n    oldY = np.array([max(1, ul[1] + 1), min(br[1], ht)], dtype=np.int32)\n    newImg[newY[0] - 1:newY[1], newX[0] - 1:newX[1]\n           ] = image[oldY[0] - 1:oldY[1], oldX[0] - 1:oldX[1], :]\n    newImg = cv2.resize(newImg, dsize=(int(resolution), int(resolution)),\n                        interpolation=cv2.INTER_LINEAR)\n    return newImg\n\n\ndef get_preds_fromhm(hm, center=None, scale=None):\n    \"\"\"Obtain (x,y) coordinates given a set of N heatmaps. If the center\n    and the scale is provided the function will return the points also in\n    the original coordinate frame.\n\n    Arguments:\n        hm {torch.tensor} -- the predicted heatmaps, of shape [B, N, W, H]\n\n    Keyword Arguments:\n        center {torch.tensor} -- the center of the bounding box (default: {None})\n        scale {float} -- face scale (default: {None})\n    \"\"\"\n    max, idx = torch.max(\n        hm.view(hm.size(0), hm.size(1), hm.size(2) * hm.size(3)), 2)\n    idx += 1\n    preds = idx.view(idx.size(0), idx.size(1), 1).repeat(1, 1, 2).float()\n    preds[..., 0].apply_(lambda x: (x - 1) % hm.size(3) + 1)\n    preds[..., 1].add_(-1).div_(hm.size(2)).floor_().add_(1)\n\n    for i in range(preds.size(0)):\n        for j in range(preds.size(1)):\n            hm_ = hm[i, j, :]\n            pX, pY = int(preds[i, j, 0]) - 1, int(preds[i, j, 1]) - 1\n            if pX > 0 and pX < 63 and pY > 0 and pY < 63:\n                diff = torch.FloatTensor(\n                    [hm_[pY, pX + 1] - hm_[pY, pX - 1],\n                     hm_[pY + 1, pX] - hm_[pY - 1, pX]])\n                preds[i, j].add_(diff.sign_().mul_(.25))\n\n    preds.add_(-.5)\n\n    preds_orig = torch.zeros(preds.size())\n    if center is not None and scale is not None:\n        for i in range(hm.size(0)):\n            for j in range(hm.size(1)):\n                preds_orig[i, j] = transform(\n                    preds[i, j], center, scale, hm.size(2), True)\n\n    return preds, preds_orig\n\ndef get_preds_fromhm_batch(hm, centers=None, scales=None):\n    \"\"\"Obtain (x,y) coordinates given a set of N heatmaps. If the centers\n    and the scales is provided the function will return the points also in\n    the original coordinate frame.\n\n    Arguments:\n        hm {torch.tensor} -- the predicted heatmaps, of shape [B, N, W, H]\n\n    Keyword Arguments:\n        centers {torch.tensor} -- the centers of the bounding box (default: {None})\n        scales {float} -- face scales (default: {None})\n    \"\"\"\n    max, idx = torch.max(\n        hm.view(hm.size(0), hm.size(1), hm.size(2) * hm.size(3)), 2)\n    idx += 1\n    preds = idx.view(idx.size(0), idx.size(1), 1).repeat(1, 1, 2).float()\n    preds[..., 0].apply_(lambda x: (x - 1) % hm.size(3) + 1)\n    preds[..., 1].add_(-1).div_(hm.size(2)).floor_().add_(1)\n\n    for i in range(preds.size(0)):\n        for j in range(preds.size(1)):\n            hm_ = hm[i, j, :]\n            pX, pY = int(preds[i, j, 0]) - 1, int(preds[i, j, 1]) - 1\n            if pX > 0 and pX < 63 and pY > 0 and pY < 63:\n                diff = torch.FloatTensor(\n                    [hm_[pY, pX + 1] - hm_[pY, pX - 1],\n                     hm_[pY + 1, pX] - hm_[pY - 1, pX]])\n                preds[i, j].add_(diff.sign_().mul_(.25))\n\n    preds.add_(-.5)\n\n    preds_orig = torch.zeros(preds.size())\n    if centers is not None and scales is not None:\n        for i in range(hm.size(0)):\n            for j in range(hm.size(1)):\n                preds_orig[i, j] = transform(\n                    preds[i, j], centers[i], scales[i], hm.size(2), True)\n\n    return preds, preds_orig\n\ndef shuffle_lr(parts, pairs=None):\n    \"\"\"Shuffle the points left-right according to the axis of symmetry\n    of the object.\n\n    Arguments:\n        parts {torch.tensor} -- a 3D or 4D object containing the\n        heatmaps.\n\n    Keyword Arguments:\n        pairs {list of integers} -- [order of the flipped points] (default: {None})\n    \"\"\"\n    if pairs is None:\n        pairs = [16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,\n                 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 27, 28, 29, 30, 35,\n                 34, 33, 32, 31, 45, 44, 43, 42, 47, 46, 39, 38, 37, 36, 41,\n                 40, 54, 53, 52, 51, 50, 49, 48, 59, 58, 57, 56, 55, 64, 63,\n                 62, 61, 60, 67, 66, 65]\n    if parts.ndimension() == 3:\n        parts = parts[pairs, ...]\n    else:\n        parts = parts[:, pairs, ...]\n\n    return parts\n\n\ndef flip(tensor, is_label=False):\n    \"\"\"Flip an image or a set of heatmaps left-right\n\n    Arguments:\n        tensor {numpy.array or torch.tensor} -- [the input image or heatmaps]\n\n    Keyword Arguments:\n        is_label {bool} -- [denote wherever the input is an image or a set of heatmaps ] (default: {False})\n    \"\"\"\n    if not torch.is_tensor(tensor):\n        tensor = torch.from_numpy(tensor)\n\n    if is_label:\n        tensor = shuffle_lr(tensor).flip(tensor.ndimension() - 1)\n    else:\n        tensor = tensor.flip(tensor.ndimension() - 1)\n\n    return tensor\n\n# From pyzolib/paths.py (https://bitbucket.org/pyzo/pyzolib/src/tip/paths.py)\n\n\ndef appdata_dir(appname=None, roaming=False):\n    \"\"\" appdata_dir(appname=None, roaming=False)\n\n    Get the path to the application directory, where applications are allowed\n    to write user specific files (e.g. configurations). For non-user specific\n    data, consider using common_appdata_dir().\n    If appname is given, a subdir is appended (and created if necessary).\n    If roaming is True, will prefer a roaming directory (Windows Vista/7).\n    \"\"\"\n\n    # Define default user directory\n    userDir = os.getenv('FACEALIGNMENT_USERDIR', None)\n    if userDir is None:\n        userDir = os.path.expanduser('~')\n        if not os.path.isdir(userDir):  # pragma: no cover\n            userDir = '/var/tmp'  # issue #54\n\n    # Get system app data dir\n    path = None\n    if sys.platform.startswith('win'):\n        path1, path2 = os.getenv('LOCALAPPDATA'), os.getenv('APPDATA')\n        path = (path2 or path1) if roaming else (path1 or path2)\n    elif sys.platform.startswith('darwin'):\n        path = os.path.join(userDir, 'Library', 'Application Support')\n    # On Linux and as fallback\n    if not (path and os.path.isdir(path)):\n        path = userDir\n\n    # Maybe we should store things local to the executable (in case of a\n    # portable distro or a frozen application that wants to be portable)\n    prefix = sys.prefix\n    if getattr(sys, 'frozen', None):\n        prefix = os.path.abspath(os.path.dirname(sys.executable))\n    for reldir in ('settings', '../settings'):\n        localpath = os.path.abspath(os.path.join(prefix, reldir))\n        if os.path.isdir(localpath):  # pragma: no cover\n            try:\n                open(os.path.join(localpath, 'test.write'), 'wb').close()\n                os.remove(os.path.join(localpath, 'test.write'))\n            except IOError:\n                pass  # We cannot write in this directory\n            else:\n                path = localpath\n                break\n\n    # Get path specific for this app\n    if appname:\n        if path == userDir:\n            appname = '.' + appname.lstrip('.')  # Make it a hidden directory\n        path = os.path.join(path, appname)\n        if not os.path.isdir(path):  # pragma: no cover\n            os.mkdir(path)\n\n    # Done\n    return path\n"
  },
  {
    "path": "filelists/README.md",
    "content": "Place LRS2 (and any other) filelists here for training."
  },
  {
    "path": "hparams.py",
    "content": "from glob import glob\nimport os\n\ndef get_image_list(data_root, split):\n\tfilelist = []\n\n\twith open('filelists/{}.txt'.format(split)) as f:\n\t\tfor line in f:\n\t\t\tline = line.strip()\n\t\t\tif ' ' in line: line = line.split()[0]\n\t\t\tfilelist.append(os.path.join(data_root, line))\n\n\treturn filelist\n\nclass HParams:\n\tdef __init__(self, **kwargs):\n\t\tself.data = {}\n\n\t\tfor key, value in kwargs.items():\n\t\t\tself.data[key] = value\n\n\tdef __getattr__(self, key):\n\t\tif key not in self.data:\n\t\t\traise AttributeError(\"'HParams' object has no attribute %s\" % key)\n\t\treturn self.data[key]\n\n\tdef set_hparam(self, key, value):\n\t\tself.data[key] = value\n\n\n# Default hyperparameters\nhparams = HParams(\n\tnum_mels=80,  # Number of mel-spectrogram channels and local conditioning dimensionality\n\t#  network\n\trescale=True,  # Whether to rescale audio prior to preprocessing\n\trescaling_max=0.9,  # Rescaling value\n\t\n\t# Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction\n\t# It\"s preferred to set True to use with https://github.com/r9y9/wavenet_vocoder\n\t# Does not work if n_ffit is not multiple of hop_size!!\n\tuse_lws=False,\n\t\n\tn_fft=800,  # Extra window size is filled with 0 paddings to match this parameter\n\thop_size=200,  # For 16000Hz, 200 = 12.5 ms (0.0125 * sample_rate)\n\twin_size=800,  # For 16000Hz, 800 = 50 ms (If None, win_size = n_fft) (0.05 * sample_rate)\n\tsample_rate=16000,  # 16000Hz (corresponding to librispeech) (sox --i <filename>)\n\t\n\tframe_shift_ms=None,  # Can replace hop_size parameter. (Recommended: 12.5)\n\t\n\t# Mel and Linear spectrograms normalization/scaling and clipping\n\tsignal_normalization=True,\n\t# Whether to normalize mel spectrograms to some predefined range (following below parameters)\n\tallow_clipping_in_normalization=True,  # Only relevant if mel_normalization = True\n\tsymmetric_mels=True,\n\t# Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2, \n\t# faster and cleaner convergence)\n\tmax_abs_value=4.,\n\t# max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not \n\t# be too big to avoid gradient explosion, \n\t# not too small for fast convergence)\n\t# Contribution by @begeekmyfriend\n\t# Spectrogram Pre-Emphasis (Lfilter: Reduce spectrogram noise and helps model certitude \n\t# levels. Also allows for better G&L phase reconstruction)\n\tpreemphasize=True,  # whether to apply filter\n\tpreemphasis=0.97,  # filter coefficient.\n\t\n\t# Limits\n\tmin_level_db=-100,\n\tref_level_db=20,\n\tfmin=55,\n\t# Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To \n\t# test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])\n\tfmax=7600,  # To be increased/reduced depending on data.\n\n\t###################### Our training parameters #################################\n\timg_size=96,\n\tfps=25,\n\t\n\tbatch_size=16,\n\tinitial_learning_rate=1e-4,\n\tnepochs=200000000000000000,  ### ctrl + c, stop whenever eval loss is consistently greater than train loss for ~10 epochs\n\tnum_workers=16,\n\tcheckpoint_interval=3000,\n\teval_interval=3000,\n    save_optimizer_state=True,\n\n    syncnet_wt=0.0, # is initially zero, will be set automatically to 0.03 later. Leads to faster convergence. \n\tsyncnet_batch_size=64,\n\tsyncnet_lr=1e-4,\n\tsyncnet_eval_interval=10000,\n\tsyncnet_checkpoint_interval=10000,\n\n\tdisc_wt=0.07,\n\tdisc_initial_learning_rate=1e-4,\n)\n\n\ndef hparams_debug_string():\n\tvalues = hparams.values()\n\thp = [\"  %s: %s\" % (name, values[name]) for name in sorted(values) if name != \"sentences\"]\n\treturn \"Hyperparameters:\\n\" + \"\\n\".join(hp)\n"
  },
  {
    "path": "hq_wav2lip_train.py",
    "content": "from os.path import dirname, join, basename, isfile\nfrom tqdm import tqdm\n\nfrom models import SyncNet_color as SyncNet\nfrom models import Wav2Lip, Wav2Lip_disc_qual\nimport audio\n\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\nfrom torch import optim\nimport torch.backends.cudnn as cudnn\nfrom torch.utils import data as data_utils\nimport numpy as np\n\nfrom glob import glob\n\nimport os, random, cv2, argparse\nfrom hparams import hparams, get_image_list\n\nparser = argparse.ArgumentParser(description='Code to train the Wav2Lip model WITH the visual quality discriminator')\n\nparser.add_argument(\"--data_root\", help=\"Root folder of the preprocessed LRS2 dataset\", required=True, type=str)\n\nparser.add_argument('--checkpoint_dir', help='Save checkpoints to this directory', required=True, type=str)\nparser.add_argument('--syncnet_checkpoint_path', help='Load the pre-trained Expert discriminator', required=True, type=str)\n\nparser.add_argument('--checkpoint_path', help='Resume generator from this checkpoint', default=None, type=str)\nparser.add_argument('--disc_checkpoint_path', help='Resume quality disc from this checkpoint', default=None, type=str)\n\nargs = parser.parse_args()\n\n\nglobal_step = 0\nglobal_epoch = 0\nuse_cuda = torch.cuda.is_available()\nprint('use_cuda: {}'.format(use_cuda))\n\nsyncnet_T = 5\nsyncnet_mel_step_size = 16\n\nclass Dataset(object):\n    def __init__(self, split):\n        self.all_videos = get_image_list(args.data_root, split)\n\n    def get_frame_id(self, frame):\n        return int(basename(frame).split('.')[0])\n\n    def get_window(self, start_frame):\n        start_id = self.get_frame_id(start_frame)\n        vidname = dirname(start_frame)\n\n        window_fnames = []\n        for frame_id in range(start_id, start_id + syncnet_T):\n            frame = join(vidname, '{}.jpg'.format(frame_id))\n            if not isfile(frame):\n                return None\n            window_fnames.append(frame)\n        return window_fnames\n\n    def read_window(self, window_fnames):\n        if window_fnames is None: return None\n        window = []\n        for fname in window_fnames:\n            img = cv2.imread(fname)\n            if img is None:\n                return None\n            try:\n                img = cv2.resize(img, (hparams.img_size, hparams.img_size))\n            except Exception as e:\n                return None\n\n            window.append(img)\n\n        return window\n\n    def crop_audio_window(self, spec, start_frame):\n        if type(start_frame) == int:\n            start_frame_num = start_frame\n        else:\n            start_frame_num = self.get_frame_id(start_frame)\n        start_idx = int(80. * (start_frame_num / float(hparams.fps)))\n        \n        end_idx = start_idx + syncnet_mel_step_size\n\n        return spec[start_idx : end_idx, :]\n\n    def get_segmented_mels(self, spec, start_frame):\n        mels = []\n        assert syncnet_T == 5\n        start_frame_num = self.get_frame_id(start_frame) + 1 # 0-indexing ---> 1-indexing\n        if start_frame_num - 2 < 0: return None\n        for i in range(start_frame_num, start_frame_num + syncnet_T):\n            m = self.crop_audio_window(spec, i - 2)\n            if m.shape[0] != syncnet_mel_step_size:\n                return None\n            mels.append(m.T)\n\n        mels = np.asarray(mels)\n\n        return mels\n\n    def prepare_window(self, window):\n        # 3 x T x H x W\n        x = np.asarray(window) / 255.\n        x = np.transpose(x, (3, 0, 1, 2))\n\n        return x\n\n    def __len__(self):\n        return len(self.all_videos)\n\n    def __getitem__(self, idx):\n        while 1:\n            idx = random.randint(0, len(self.all_videos) - 1)\n            vidname = self.all_videos[idx]\n            img_names = list(glob(join(vidname, '*.jpg')))\n            if len(img_names) <= 3 * syncnet_T:\n                continue\n            \n            img_name = random.choice(img_names)\n            wrong_img_name = random.choice(img_names)\n            while wrong_img_name == img_name:\n                wrong_img_name = random.choice(img_names)\n\n            window_fnames = self.get_window(img_name)\n            wrong_window_fnames = self.get_window(wrong_img_name)\n            if window_fnames is None or wrong_window_fnames is None:\n                continue\n\n            window = self.read_window(window_fnames)\n            if window is None:\n                continue\n\n            wrong_window = self.read_window(wrong_window_fnames)\n            if wrong_window is None:\n                continue\n\n            try:\n                wavpath = join(vidname, \"audio.wav\")\n                wav = audio.load_wav(wavpath, hparams.sample_rate)\n\n                orig_mel = audio.melspectrogram(wav).T\n            except Exception as e:\n                continue\n\n            mel = self.crop_audio_window(orig_mel.copy(), img_name)\n            \n            if (mel.shape[0] != syncnet_mel_step_size):\n                continue\n\n            indiv_mels = self.get_segmented_mels(orig_mel.copy(), img_name)\n            if indiv_mels is None: continue\n\n            window = self.prepare_window(window)\n            y = window.copy()\n            window[:, :, window.shape[2]//2:] = 0.\n\n            wrong_window = self.prepare_window(wrong_window)\n            x = np.concatenate([window, wrong_window], axis=0)\n\n            x = torch.FloatTensor(x)\n            mel = torch.FloatTensor(mel.T).unsqueeze(0)\n            indiv_mels = torch.FloatTensor(indiv_mels).unsqueeze(1)\n            y = torch.FloatTensor(y)\n            return x, indiv_mels, mel, y\n\ndef save_sample_images(x, g, gt, global_step, checkpoint_dir):\n    x = (x.detach().cpu().numpy().transpose(0, 2, 3, 4, 1) * 255.).astype(np.uint8)\n    g = (g.detach().cpu().numpy().transpose(0, 2, 3, 4, 1) * 255.).astype(np.uint8)\n    gt = (gt.detach().cpu().numpy().transpose(0, 2, 3, 4, 1) * 255.).astype(np.uint8)\n\n    refs, inps = x[..., 3:], x[..., :3]\n    folder = join(checkpoint_dir, \"samples_step{:09d}\".format(global_step))\n    if not os.path.exists(folder): os.mkdir(folder)\n    collage = np.concatenate((refs, inps, g, gt), axis=-2)\n    for batch_idx, c in enumerate(collage):\n        for t in range(len(c)):\n            cv2.imwrite('{}/{}_{}.jpg'.format(folder, batch_idx, t), c[t])\n\nlogloss = nn.BCELoss()\ndef cosine_loss(a, v, y):\n    d = nn.functional.cosine_similarity(a, v)\n    loss = logloss(d.unsqueeze(1), y)\n\n    return loss\n\ndevice = torch.device(\"cuda\" if use_cuda else \"cpu\")\nsyncnet = SyncNet().to(device)\nfor p in syncnet.parameters():\n    p.requires_grad = False\n\nrecon_loss = nn.L1Loss()\ndef get_sync_loss(mel, g):\n    g = g[:, :, :, g.size(3)//2:]\n    g = torch.cat([g[:, :, i] for i in range(syncnet_T)], dim=1)\n    # B, 3 * T, H//2, W\n    a, v = syncnet(mel, g)\n    y = torch.ones(g.size(0), 1).float().to(device)\n    return cosine_loss(a, v, y)\n\ndef train(device, model, disc, train_data_loader, test_data_loader, optimizer, disc_optimizer,\n          checkpoint_dir=None, checkpoint_interval=None, nepochs=None):\n    global global_step, global_epoch\n    resumed_step = global_step\n\n    while global_epoch < nepochs:\n        print('Starting Epoch: {}'.format(global_epoch))\n        running_sync_loss, running_l1_loss, disc_loss, running_perceptual_loss = 0., 0., 0., 0.\n        running_disc_real_loss, running_disc_fake_loss = 0., 0.\n        prog_bar = tqdm(enumerate(train_data_loader))\n        for step, (x, indiv_mels, mel, gt) in prog_bar:\n            disc.train()\n            model.train()\n\n            x = x.to(device)\n            mel = mel.to(device)\n            indiv_mels = indiv_mels.to(device)\n            gt = gt.to(device)\n\n            ### Train generator now. Remove ALL grads. \n            optimizer.zero_grad()\n            disc_optimizer.zero_grad()\n\n            g = model(indiv_mels, x)\n\n            if hparams.syncnet_wt > 0.:\n                sync_loss = get_sync_loss(mel, g)\n            else:\n                sync_loss = 0.\n\n            if hparams.disc_wt > 0.:\n                perceptual_loss = disc.perceptual_forward(g)\n            else:\n                perceptual_loss = 0.\n\n            l1loss = recon_loss(g, gt)\n\n            loss = hparams.syncnet_wt * sync_loss + hparams.disc_wt * perceptual_loss + \\\n                                    (1. - hparams.syncnet_wt - hparams.disc_wt) * l1loss\n\n            loss.backward()\n            optimizer.step()\n\n            ### Remove all gradients before Training disc\n            disc_optimizer.zero_grad()\n\n            pred = disc(gt)\n            disc_real_loss = F.binary_cross_entropy(pred, torch.ones((len(pred), 1)).to(device))\n            disc_real_loss.backward()\n\n            pred = disc(g.detach())\n            disc_fake_loss = F.binary_cross_entropy(pred, torch.zeros((len(pred), 1)).to(device))\n            disc_fake_loss.backward()\n\n            disc_optimizer.step()\n\n            running_disc_real_loss += disc_real_loss.item()\n            running_disc_fake_loss += disc_fake_loss.item()\n\n            if global_step % checkpoint_interval == 0:\n                save_sample_images(x, g, gt, global_step, checkpoint_dir)\n\n            # Logs\n            global_step += 1\n            cur_session_steps = global_step - resumed_step\n\n            running_l1_loss += l1loss.item()\n            if hparams.syncnet_wt > 0.:\n                running_sync_loss += sync_loss.item()\n            else:\n                running_sync_loss += 0.\n\n            if hparams.disc_wt > 0.:\n                running_perceptual_loss += perceptual_loss.item()\n            else:\n                running_perceptual_loss += 0.\n\n            if global_step == 1 or global_step % checkpoint_interval == 0:\n                save_checkpoint(\n                    model, optimizer, global_step, checkpoint_dir, global_epoch)\n                save_checkpoint(disc, disc_optimizer, global_step, checkpoint_dir, global_epoch, prefix='disc_')\n\n\n            if global_step % hparams.eval_interval == 0:\n                with torch.no_grad():\n                    average_sync_loss = eval_model(test_data_loader, global_step, device, model, disc)\n\n                    if average_sync_loss < .75:\n                        hparams.set_hparam('syncnet_wt', 0.03)\n\n            prog_bar.set_description('L1: {}, Sync: {}, Percep: {} | Fake: {}, Real: {}'.format(running_l1_loss / (step + 1),\n                                                                                        running_sync_loss / (step + 1),\n                                                                                        running_perceptual_loss / (step + 1),\n                                                                                        running_disc_fake_loss / (step + 1),\n                                                                                        running_disc_real_loss / (step + 1)))\n\n        global_epoch += 1\n\ndef eval_model(test_data_loader, global_step, device, model, disc):\n    eval_steps = 300\n    print('Evaluating for {} steps'.format(eval_steps))\n    running_sync_loss, running_l1_loss, running_disc_real_loss, running_disc_fake_loss, running_perceptual_loss = [], [], [], [], []\n    while 1:\n        for step, (x, indiv_mels, mel, gt) in enumerate((test_data_loader)):\n            model.eval()\n            disc.eval()\n\n            x = x.to(device)\n            mel = mel.to(device)\n            indiv_mels = indiv_mels.to(device)\n            gt = gt.to(device)\n\n            pred = disc(gt)\n            disc_real_loss = F.binary_cross_entropy(pred, torch.ones((len(pred), 1)).to(device))\n\n            g = model(indiv_mels, x)\n            pred = disc(g)\n            disc_fake_loss = F.binary_cross_entropy(pred, torch.zeros((len(pred), 1)).to(device))\n\n            running_disc_real_loss.append(disc_real_loss.item())\n            running_disc_fake_loss.append(disc_fake_loss.item())\n\n            sync_loss = get_sync_loss(mel, g)\n            \n            if hparams.disc_wt > 0.:\n                perceptual_loss = disc.perceptual_forward(g)\n            else:\n                perceptual_loss = 0.\n\n            l1loss = recon_loss(g, gt)\n\n            loss = hparams.syncnet_wt * sync_loss + hparams.disc_wt * perceptual_loss + \\\n                                    (1. - hparams.syncnet_wt - hparams.disc_wt) * l1loss\n\n            running_l1_loss.append(l1loss.item())\n            running_sync_loss.append(sync_loss.item())\n            \n            if hparams.disc_wt > 0.:\n                running_perceptual_loss.append(perceptual_loss.item())\n            else:\n                running_perceptual_loss.append(0.)\n\n            if step > eval_steps: break\n\n        print('L1: {}, Sync: {}, Percep: {} | Fake: {}, Real: {}'.format(sum(running_l1_loss) / len(running_l1_loss),\n                                                            sum(running_sync_loss) / len(running_sync_loss),\n                                                            sum(running_perceptual_loss) / len(running_perceptual_loss),\n                                                            sum(running_disc_fake_loss) / len(running_disc_fake_loss),\n                                                             sum(running_disc_real_loss) / len(running_disc_real_loss)))\n        return sum(running_sync_loss) / len(running_sync_loss)\n\n\ndef save_checkpoint(model, optimizer, step, checkpoint_dir, epoch, prefix=''):\n    checkpoint_path = join(\n        checkpoint_dir, \"{}checkpoint_step{:09d}.pth\".format(prefix, global_step))\n    optimizer_state = optimizer.state_dict() if hparams.save_optimizer_state else None\n    torch.save({\n        \"state_dict\": model.state_dict(),\n        \"optimizer\": optimizer_state,\n        \"global_step\": step,\n        \"global_epoch\": epoch,\n    }, checkpoint_path)\n    print(\"Saved checkpoint:\", checkpoint_path)\n\ndef _load(checkpoint_path):\n    if use_cuda:\n        checkpoint = torch.load(checkpoint_path)\n    else:\n        checkpoint = torch.load(checkpoint_path,\n                                map_location=lambda storage, loc: storage)\n    return checkpoint\n\n\ndef load_checkpoint(path, model, optimizer, reset_optimizer=False, overwrite_global_states=True):\n    global global_step\n    global global_epoch\n\n    print(\"Load checkpoint from: {}\".format(path))\n    checkpoint = _load(path)\n    s = checkpoint[\"state_dict\"]\n    new_s = {}\n    for k, v in s.items():\n        new_s[k.replace('module.', '')] = v\n    model.load_state_dict(new_s)\n    if not reset_optimizer:\n        optimizer_state = checkpoint[\"optimizer\"]\n        if optimizer_state is not None:\n            print(\"Load optimizer state from {}\".format(path))\n            optimizer.load_state_dict(checkpoint[\"optimizer\"])\n    if overwrite_global_states:\n        global_step = checkpoint[\"global_step\"]\n        global_epoch = checkpoint[\"global_epoch\"]\n\n    return model\n\nif __name__ == \"__main__\":\n    checkpoint_dir = args.checkpoint_dir\n\n    # Dataset and Dataloader setup\n    train_dataset = Dataset('train')\n    test_dataset = Dataset('val')\n\n    train_data_loader = data_utils.DataLoader(\n        train_dataset, batch_size=hparams.batch_size, shuffle=True,\n        num_workers=hparams.num_workers)\n\n    test_data_loader = data_utils.DataLoader(\n        test_dataset, batch_size=hparams.batch_size,\n        num_workers=4)\n\n    device = torch.device(\"cuda\" if use_cuda else \"cpu\")\n\n     # Model\n    model = Wav2Lip().to(device)\n    disc = Wav2Lip_disc_qual().to(device)\n\n    print('total trainable params {}'.format(sum(p.numel() for p in model.parameters() if p.requires_grad)))\n    print('total DISC trainable params {}'.format(sum(p.numel() for p in disc.parameters() if p.requires_grad)))\n\n    optimizer = optim.Adam([p for p in model.parameters() if p.requires_grad],\n                           lr=hparams.initial_learning_rate, betas=(0.5, 0.999))\n    disc_optimizer = optim.Adam([p for p in disc.parameters() if p.requires_grad],\n                           lr=hparams.disc_initial_learning_rate, betas=(0.5, 0.999))\n\n    if args.checkpoint_path is not None:\n        load_checkpoint(args.checkpoint_path, model, optimizer, reset_optimizer=False)\n\n    if args.disc_checkpoint_path is not None:\n        load_checkpoint(args.disc_checkpoint_path, disc, disc_optimizer, \n                                reset_optimizer=False, overwrite_global_states=False)\n        \n    load_checkpoint(args.syncnet_checkpoint_path, syncnet, None, reset_optimizer=True, \n                                overwrite_global_states=False)\n\n    if not os.path.exists(checkpoint_dir):\n        os.mkdir(checkpoint_dir)\n\n    # Train!\n    train(device, model, disc, train_data_loader, test_data_loader, optimizer, disc_optimizer,\n              checkpoint_dir=checkpoint_dir,\n              checkpoint_interval=hparams.checkpoint_interval,\n              nepochs=hparams.nepochs)\n"
  },
  {
    "path": "inference.py",
    "content": "from os import listdir, path\nimport numpy as np\nimport scipy, cv2, os, sys, argparse, audio\nimport json, subprocess, random, string\nfrom tqdm import tqdm\nfrom glob import glob\nimport torch, face_detection\nfrom models import Wav2Lip\nimport platform\n\nparser = argparse.ArgumentParser(description='Inference code to lip-sync videos in the wild using Wav2Lip models')\n\nparser.add_argument('--checkpoint_path', type=str, \n\t\t\t\t\thelp='Name of saved checkpoint to load weights from', required=True)\n\nparser.add_argument('--face', type=str, \n\t\t\t\t\thelp='Filepath of video/image that contains faces to use', required=True)\nparser.add_argument('--audio', type=str, \n\t\t\t\t\thelp='Filepath of video/audio file to use as raw audio source', required=True)\nparser.add_argument('--outfile', type=str, help='Video path to save result. See default for an e.g.', \n\t\t\t\t\t\t\t\tdefault='results/result_voice.mp4')\n\nparser.add_argument('--static', type=bool, \n\t\t\t\t\thelp='If True, then use only first video frame for inference', default=False)\nparser.add_argument('--fps', type=float, help='Can be specified only if input is a static image (default: 25)', \n\t\t\t\t\tdefault=25., required=False)\n\nparser.add_argument('--pads', nargs='+', type=int, default=[0, 10, 0, 0], \n\t\t\t\t\thelp='Padding (top, bottom, left, right). Please adjust to include chin at least')\n\nparser.add_argument('--face_det_batch_size', type=int, \n\t\t\t\t\thelp='Batch size for face detection', default=16)\nparser.add_argument('--wav2lip_batch_size', type=int, help='Batch size for Wav2Lip model(s)', default=128)\n\nparser.add_argument('--resize_factor', default=1, type=int, \n\t\t\thelp='Reduce the resolution by this factor. Sometimes, best results are obtained at 480p or 720p')\n\nparser.add_argument('--crop', nargs='+', type=int, default=[0, -1, 0, -1], \n\t\t\t\t\thelp='Crop video to a smaller region (top, bottom, left, right). Applied after resize_factor and rotate arg. ' \n\t\t\t\t\t'Useful if multiple face present. -1 implies the value will be auto-inferred based on height, width')\n\nparser.add_argument('--box', nargs='+', type=int, default=[-1, -1, -1, -1], \n\t\t\t\t\thelp='Specify a constant bounding box for the face. Use only as a last resort if the face is not detected.'\n\t\t\t\t\t'Also, might work only if the face is not moving around much. Syntax: (top, bottom, left, right).')\n\nparser.add_argument('--rotate', default=False, action='store_true',\n\t\t\t\t\thelp='Sometimes videos taken from a phone can be flipped 90deg. If true, will flip video right by 90deg.'\n\t\t\t\t\t'Use if you get a flipped result, despite feeding a normal looking video')\n\nparser.add_argument('--nosmooth', default=False, action='store_true',\n\t\t\t\t\thelp='Prevent smoothing face detections over a short temporal window')\n\nargs = parser.parse_args()\nargs.img_size = 96\n\nif os.path.isfile(args.face) and args.face.split('.')[1] in ['jpg', 'png', 'jpeg']:\n\targs.static = True\n\ndef get_smoothened_boxes(boxes, T):\n\tfor i in range(len(boxes)):\n\t\tif i + T > len(boxes):\n\t\t\twindow = boxes[len(boxes) - T:]\n\t\telse:\n\t\t\twindow = boxes[i : i + T]\n\t\tboxes[i] = np.mean(window, axis=0)\n\treturn boxes\n\ndef face_detect(images):\n\tdetector = face_detection.FaceAlignment(face_detection.LandmarksType._2D, \n\t\t\t\t\t\t\t\t\t\t\tflip_input=False, device=device)\n\n\tbatch_size = args.face_det_batch_size\n\t\n\twhile 1:\n\t\tpredictions = []\n\t\ttry:\n\t\t\tfor i in tqdm(range(0, len(images), batch_size)):\n\t\t\t\tpredictions.extend(detector.get_detections_for_batch(np.array(images[i:i + batch_size])))\n\t\texcept RuntimeError:\n\t\t\tif batch_size == 1: \n\t\t\t\traise RuntimeError('Image too big to run face detection on GPU. Please use the --resize_factor argument')\n\t\t\tbatch_size //= 2\n\t\t\tprint('Recovering from OOM error; New batch size: {}'.format(batch_size))\n\t\t\tcontinue\n\t\tbreak\n\n\tresults = []\n\tpady1, pady2, padx1, padx2 = args.pads\n\tfor rect, image in zip(predictions, images):\n\t\tif rect is None:\n\t\t\tcv2.imwrite('temp/faulty_frame.jpg', image) # check this frame where the face was not detected.\n\t\t\traise ValueError('Face not detected! Ensure the video contains a face in all the frames.')\n\n\t\ty1 = max(0, rect[1] - pady1)\n\t\ty2 = min(image.shape[0], rect[3] + pady2)\n\t\tx1 = max(0, rect[0] - padx1)\n\t\tx2 = min(image.shape[1], rect[2] + padx2)\n\t\t\n\t\tresults.append([x1, y1, x2, y2])\n\n\tboxes = np.array(results)\n\tif not args.nosmooth: boxes = get_smoothened_boxes(boxes, T=5)\n\tresults = [[image[y1: y2, x1:x2], (y1, y2, x1, x2)] for image, (x1, y1, x2, y2) in zip(images, boxes)]\n\n\tdel detector\n\treturn results \n\ndef datagen(frames, mels):\n\timg_batch, mel_batch, frame_batch, coords_batch = [], [], [], []\n\n\tif args.box[0] == -1:\n\t\tif not args.static:\n\t\t\tface_det_results = face_detect(frames) # BGR2RGB for CNN face detection\n\t\telse:\n\t\t\tface_det_results = face_detect([frames[0]])\n\telse:\n\t\tprint('Using the specified bounding box instead of face detection...')\n\t\ty1, y2, x1, x2 = args.box\n\t\tface_det_results = [[f[y1: y2, x1:x2], (y1, y2, x1, x2)] for f in frames]\n\n\tfor i, m in enumerate(mels):\n\t\tidx = 0 if args.static else i%len(frames)\n\t\tframe_to_save = frames[idx].copy()\n\t\tface, coords = face_det_results[idx].copy()\n\n\t\tface = cv2.resize(face, (args.img_size, args.img_size))\n\t\t\t\n\t\timg_batch.append(face)\n\t\tmel_batch.append(m)\n\t\tframe_batch.append(frame_to_save)\n\t\tcoords_batch.append(coords)\n\n\t\tif len(img_batch) >= args.wav2lip_batch_size:\n\t\t\timg_batch, mel_batch = np.asarray(img_batch), np.asarray(mel_batch)\n\n\t\t\timg_masked = img_batch.copy()\n\t\t\timg_masked[:, args.img_size//2:] = 0\n\n\t\t\timg_batch = np.concatenate((img_masked, img_batch), axis=3) / 255.\n\t\t\tmel_batch = np.reshape(mel_batch, [len(mel_batch), mel_batch.shape[1], mel_batch.shape[2], 1])\n\n\t\t\tyield img_batch, mel_batch, frame_batch, coords_batch\n\t\t\timg_batch, mel_batch, frame_batch, coords_batch = [], [], [], []\n\n\tif len(img_batch) > 0:\n\t\timg_batch, mel_batch = np.asarray(img_batch), np.asarray(mel_batch)\n\n\t\timg_masked = img_batch.copy()\n\t\timg_masked[:, args.img_size//2:] = 0\n\n\t\timg_batch = np.concatenate((img_masked, img_batch), axis=3) / 255.\n\t\tmel_batch = np.reshape(mel_batch, [len(mel_batch), mel_batch.shape[1], mel_batch.shape[2], 1])\n\n\t\tyield img_batch, mel_batch, frame_batch, coords_batch\n\nmel_step_size = 16\ndevice = 'cuda' if torch.cuda.is_available() else 'cpu'\nprint('Using {} for inference.'.format(device))\n\ndef _load(checkpoint_path):\n\tif device == 'cuda':\n\t\tcheckpoint = torch.load(checkpoint_path)\n\telse:\n\t\tcheckpoint = torch.load(checkpoint_path,\n\t\t\t\t\t\t\t\tmap_location=lambda storage, loc: storage)\n\treturn checkpoint\n\ndef load_model(path):\n\tmodel = Wav2Lip()\n\tprint(\"Load checkpoint from: {}\".format(path))\n\tcheckpoint = _load(path)\n\ts = checkpoint[\"state_dict\"]\n\tnew_s = {}\n\tfor k, v in s.items():\n\t\tnew_s[k.replace('module.', '')] = v\n\tmodel.load_state_dict(new_s)\n\n\tmodel = model.to(device)\n\treturn model.eval()\n\ndef main():\n\tif not os.path.isfile(args.face):\n\t\traise ValueError('--face argument must be a valid path to video/image file')\n\n\telif args.face.split('.')[1] in ['jpg', 'png', 'jpeg']:\n\t\tfull_frames = [cv2.imread(args.face)]\n\t\tfps = args.fps\n\n\telse:\n\t\tvideo_stream = cv2.VideoCapture(args.face)\n\t\tfps = video_stream.get(cv2.CAP_PROP_FPS)\n\n\t\tprint('Reading video frames...')\n\n\t\tfull_frames = []\n\t\twhile 1:\n\t\t\tstill_reading, frame = video_stream.read()\n\t\t\tif not still_reading:\n\t\t\t\tvideo_stream.release()\n\t\t\t\tbreak\n\t\t\tif args.resize_factor > 1:\n\t\t\t\tframe = cv2.resize(frame, (frame.shape[1]//args.resize_factor, frame.shape[0]//args.resize_factor))\n\n\t\t\tif args.rotate:\n\t\t\t\tframe = cv2.rotate(frame, cv2.cv2.ROTATE_90_CLOCKWISE)\n\n\t\t\ty1, y2, x1, x2 = args.crop\n\t\t\tif x2 == -1: x2 = frame.shape[1]\n\t\t\tif y2 == -1: y2 = frame.shape[0]\n\n\t\t\tframe = frame[y1:y2, x1:x2]\n\n\t\t\tfull_frames.append(frame)\n\n\tprint (\"Number of frames available for inference: \"+str(len(full_frames)))\n\n\tif not args.audio.endswith('.wav'):\n\t\tprint('Extracting raw audio...')\n\t\tcommand = 'ffmpeg -y -i {} -strict -2 {}'.format(args.audio, 'temp/temp.wav')\n\n\t\tsubprocess.call(command, shell=True)\n\t\targs.audio = 'temp/temp.wav'\n\n\twav = audio.load_wav(args.audio, 16000)\n\tmel = audio.melspectrogram(wav)\n\tprint(mel.shape)\n\n\tif np.isnan(mel.reshape(-1)).sum() > 0:\n\t\traise ValueError('Mel contains nan! Using a TTS voice? Add a small epsilon noise to the wav file and try again')\n\n\tmel_chunks = []\n\tmel_idx_multiplier = 80./fps \n\ti = 0\n\twhile 1:\n\t\tstart_idx = int(i * mel_idx_multiplier)\n\t\tif start_idx + mel_step_size > len(mel[0]):\n\t\t\tmel_chunks.append(mel[:, len(mel[0]) - mel_step_size:])\n\t\t\tbreak\n\t\tmel_chunks.append(mel[:, start_idx : start_idx + mel_step_size])\n\t\ti += 1\n\n\tprint(\"Length of mel chunks: {}\".format(len(mel_chunks)))\n\n\tfull_frames = full_frames[:len(mel_chunks)]\n\n\tbatch_size = args.wav2lip_batch_size\n\tgen = datagen(full_frames.copy(), mel_chunks)\n\n\tfor i, (img_batch, mel_batch, frames, coords) in enumerate(tqdm(gen, \n\t\t\t\t\t\t\t\t\t\t\ttotal=int(np.ceil(float(len(mel_chunks))/batch_size)))):\n\t\tif i == 0:\n\t\t\tmodel = load_model(args.checkpoint_path)\n\t\t\tprint (\"Model loaded\")\n\n\t\t\tframe_h, frame_w = full_frames[0].shape[:-1]\n\t\t\tout = cv2.VideoWriter('temp/result.avi', \n\t\t\t\t\t\t\t\t\tcv2.VideoWriter_fourcc(*'DIVX'), fps, (frame_w, frame_h))\n\n\t\timg_batch = torch.FloatTensor(np.transpose(img_batch, (0, 3, 1, 2))).to(device)\n\t\tmel_batch = torch.FloatTensor(np.transpose(mel_batch, (0, 3, 1, 2))).to(device)\n\n\t\twith torch.no_grad():\n\t\t\tpred = model(mel_batch, img_batch)\n\n\t\tpred = pred.cpu().numpy().transpose(0, 2, 3, 1) * 255.\n\t\t\n\t\tfor p, f, c in zip(pred, frames, coords):\n\t\t\ty1, y2, x1, x2 = c\n\t\t\tp = cv2.resize(p.astype(np.uint8), (x2 - x1, y2 - y1))\n\n\t\t\tf[y1:y2, x1:x2] = p\n\t\t\tout.write(f)\n\n\tout.release()\n\n\tcommand = 'ffmpeg -y -i {} -i {} -strict -2 -q:v 1 {}'.format(args.audio, 'temp/result.avi', args.outfile)\n\tsubprocess.call(command, shell=platform.system() != 'Windows')\n\nif __name__ == '__main__':\n\tmain()\n"
  },
  {
    "path": "models/__init__.py",
    "content": "from .wav2lip import Wav2Lip, Wav2Lip_disc_qual\nfrom .syncnet import SyncNet_color"
  },
  {
    "path": "models/conv.py",
    "content": "import torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nclass Conv2d(nn.Module):\n    def __init__(self, cin, cout, kernel_size, stride, padding, residual=False, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n        self.conv_block = nn.Sequential(\n                            nn.Conv2d(cin, cout, kernel_size, stride, padding),\n                            nn.BatchNorm2d(cout)\n                            )\n        self.act = nn.ReLU()\n        self.residual = residual\n\n    def forward(self, x):\n        out = self.conv_block(x)\n        if self.residual:\n            out += x\n        return self.act(out)\n\nclass nonorm_Conv2d(nn.Module):\n    def __init__(self, cin, cout, kernel_size, stride, padding, residual=False, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n        self.conv_block = nn.Sequential(\n                            nn.Conv2d(cin, cout, kernel_size, stride, padding),\n                            )\n        self.act = nn.LeakyReLU(0.01, inplace=True)\n\n    def forward(self, x):\n        out = self.conv_block(x)\n        return self.act(out)\n\nclass Conv2dTranspose(nn.Module):\n    def __init__(self, cin, cout, kernel_size, stride, padding, output_padding=0, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n        self.conv_block = nn.Sequential(\n                            nn.ConvTranspose2d(cin, cout, kernel_size, stride, padding, output_padding),\n                            nn.BatchNorm2d(cout)\n                            )\n        self.act = nn.ReLU()\n\n    def forward(self, x):\n        out = self.conv_block(x)\n        return self.act(out)\n"
  },
  {
    "path": "models/syncnet.py",
    "content": "import torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom .conv import Conv2d\n\nclass SyncNet_color(nn.Module):\n    def __init__(self):\n        super(SyncNet_color, self).__init__()\n\n        self.face_encoder = nn.Sequential(\n            Conv2d(15, 32, kernel_size=(7, 7), stride=1, padding=3),\n\n            Conv2d(32, 64, kernel_size=5, stride=(1, 2), padding=1),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(64, 128, kernel_size=3, stride=2, padding=1),\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(128, 256, kernel_size=3, stride=2, padding=1),\n            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(256, 512, kernel_size=3, stride=2, padding=1),\n            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(512, 512, kernel_size=3, stride=2, padding=1),\n            Conv2d(512, 512, kernel_size=3, stride=1, padding=0),\n            Conv2d(512, 512, kernel_size=1, stride=1, padding=0),)\n\n        self.audio_encoder = nn.Sequential(\n            Conv2d(1, 32, kernel_size=3, stride=1, padding=1),\n            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(32, 64, kernel_size=3, stride=(3, 1), padding=1),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(64, 128, kernel_size=3, stride=3, padding=1),\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(128, 256, kernel_size=3, stride=(3, 2), padding=1),\n            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(256, 512, kernel_size=3, stride=1, padding=0),\n            Conv2d(512, 512, kernel_size=1, stride=1, padding=0),)\n\n    def forward(self, audio_sequences, face_sequences): # audio_sequences := (B, dim, T)\n        face_embedding = self.face_encoder(face_sequences)\n        audio_embedding = self.audio_encoder(audio_sequences)\n\n        audio_embedding = audio_embedding.view(audio_embedding.size(0), -1)\n        face_embedding = face_embedding.view(face_embedding.size(0), -1)\n\n        audio_embedding = F.normalize(audio_embedding, p=2, dim=1)\n        face_embedding = F.normalize(face_embedding, p=2, dim=1)\n\n\n        return audio_embedding, face_embedding\n"
  },
  {
    "path": "models/wav2lip.py",
    "content": "import torch\nfrom torch import nn\nfrom torch.nn import functional as F\nimport math\n\nfrom .conv import Conv2dTranspose, Conv2d, nonorm_Conv2d\n\nclass Wav2Lip(nn.Module):\n    def __init__(self):\n        super(Wav2Lip, self).__init__()\n\n        self.face_encoder_blocks = nn.ModuleList([\n            nn.Sequential(Conv2d(6, 16, kernel_size=7, stride=1, padding=3)), # 96,96\n\n            nn.Sequential(Conv2d(16, 32, kernel_size=3, stride=2, padding=1), # 48,48\n            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True)),\n\n            nn.Sequential(Conv2d(32, 64, kernel_size=3, stride=2, padding=1),    # 24,24\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True)),\n\n            nn.Sequential(Conv2d(64, 128, kernel_size=3, stride=2, padding=1),   # 12,12\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True)),\n\n            nn.Sequential(Conv2d(128, 256, kernel_size=3, stride=2, padding=1),       # 6,6\n            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True)),\n\n            nn.Sequential(Conv2d(256, 512, kernel_size=3, stride=2, padding=1),     # 3,3\n            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),),\n            \n            nn.Sequential(Conv2d(512, 512, kernel_size=3, stride=1, padding=0),     # 1, 1\n            Conv2d(512, 512, kernel_size=1, stride=1, padding=0)),])\n\n        self.audio_encoder = nn.Sequential(\n            Conv2d(1, 32, kernel_size=3, stride=1, padding=1),\n            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(32, 64, kernel_size=3, stride=(3, 1), padding=1),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(64, 128, kernel_size=3, stride=3, padding=1),\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(128, 256, kernel_size=3, stride=(3, 2), padding=1),\n            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),\n\n            Conv2d(256, 512, kernel_size=3, stride=1, padding=0),\n            Conv2d(512, 512, kernel_size=1, stride=1, padding=0),)\n\n        self.face_decoder_blocks = nn.ModuleList([\n            nn.Sequential(Conv2d(512, 512, kernel_size=1, stride=1, padding=0),),\n\n            nn.Sequential(Conv2dTranspose(1024, 512, kernel_size=3, stride=1, padding=0), # 3,3\n            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),),\n\n            nn.Sequential(Conv2dTranspose(1024, 512, kernel_size=3, stride=2, padding=1, output_padding=1),\n            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),), # 6, 6\n\n            nn.Sequential(Conv2dTranspose(768, 384, kernel_size=3, stride=2, padding=1, output_padding=1),\n            Conv2d(384, 384, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(384, 384, kernel_size=3, stride=1, padding=1, residual=True),), # 12, 12\n\n            nn.Sequential(Conv2dTranspose(512, 256, kernel_size=3, stride=2, padding=1, output_padding=1),\n            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),), # 24, 24\n\n            nn.Sequential(Conv2dTranspose(320, 128, kernel_size=3, stride=2, padding=1, output_padding=1), \n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),), # 48, 48\n\n            nn.Sequential(Conv2dTranspose(160, 64, kernel_size=3, stride=2, padding=1, output_padding=1),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),\n            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),),]) # 96,96\n\n        self.output_block = nn.Sequential(Conv2d(80, 32, kernel_size=3, stride=1, padding=1),\n            nn.Conv2d(32, 3, kernel_size=1, stride=1, padding=0),\n            nn.Sigmoid()) \n\n    def forward(self, audio_sequences, face_sequences):\n        # audio_sequences = (B, T, 1, 80, 16)\n        B = audio_sequences.size(0)\n\n        input_dim_size = len(face_sequences.size())\n        if input_dim_size > 4:\n            audio_sequences = torch.cat([audio_sequences[:, i] for i in range(audio_sequences.size(1))], dim=0)\n            face_sequences = torch.cat([face_sequences[:, :, i] for i in range(face_sequences.size(2))], dim=0)\n\n        audio_embedding = self.audio_encoder(audio_sequences) # B, 512, 1, 1\n\n        feats = []\n        x = face_sequences\n        for f in self.face_encoder_blocks:\n            x = f(x)\n            feats.append(x)\n\n        x = audio_embedding\n        for f in self.face_decoder_blocks:\n            x = f(x)\n            try:\n                x = torch.cat((x, feats[-1]), dim=1)\n            except Exception as e:\n                print(x.size())\n                print(feats[-1].size())\n                raise e\n            \n            feats.pop()\n\n        x = self.output_block(x)\n\n        if input_dim_size > 4:\n            x = torch.split(x, B, dim=0) # [(B, C, H, W)]\n            outputs = torch.stack(x, dim=2) # (B, C, T, H, W)\n\n        else:\n            outputs = x\n            \n        return outputs\n\nclass Wav2Lip_disc_qual(nn.Module):\n    def __init__(self):\n        super(Wav2Lip_disc_qual, self).__init__()\n\n        self.face_encoder_blocks = nn.ModuleList([\n            nn.Sequential(nonorm_Conv2d(3, 32, kernel_size=7, stride=1, padding=3)), # 48,96\n\n            nn.Sequential(nonorm_Conv2d(32, 64, kernel_size=5, stride=(1, 2), padding=2), # 48,48\n            nonorm_Conv2d(64, 64, kernel_size=5, stride=1, padding=2)),\n\n            nn.Sequential(nonorm_Conv2d(64, 128, kernel_size=5, stride=2, padding=2),    # 24,24\n            nonorm_Conv2d(128, 128, kernel_size=5, stride=1, padding=2)),\n\n            nn.Sequential(nonorm_Conv2d(128, 256, kernel_size=5, stride=2, padding=2),   # 12,12\n            nonorm_Conv2d(256, 256, kernel_size=5, stride=1, padding=2)),\n\n            nn.Sequential(nonorm_Conv2d(256, 512, kernel_size=3, stride=2, padding=1),       # 6,6\n            nonorm_Conv2d(512, 512, kernel_size=3, stride=1, padding=1)),\n\n            nn.Sequential(nonorm_Conv2d(512, 512, kernel_size=3, stride=2, padding=1),     # 3,3\n            nonorm_Conv2d(512, 512, kernel_size=3, stride=1, padding=1),),\n            \n            nn.Sequential(nonorm_Conv2d(512, 512, kernel_size=3, stride=1, padding=0),     # 1, 1\n            nonorm_Conv2d(512, 512, kernel_size=1, stride=1, padding=0)),])\n\n        self.binary_pred = nn.Sequential(nn.Conv2d(512, 1, kernel_size=1, stride=1, padding=0), nn.Sigmoid())\n        self.label_noise = .0\n\n    def get_lower_half(self, face_sequences):\n        return face_sequences[:, :, face_sequences.size(2)//2:]\n\n    def to_2d(self, face_sequences):\n        B = face_sequences.size(0)\n        face_sequences = torch.cat([face_sequences[:, :, i] for i in range(face_sequences.size(2))], dim=0)\n        return face_sequences\n\n    def perceptual_forward(self, false_face_sequences):\n        false_face_sequences = self.to_2d(false_face_sequences)\n        false_face_sequences = self.get_lower_half(false_face_sequences)\n\n        false_feats = false_face_sequences\n        for f in self.face_encoder_blocks:\n            false_feats = f(false_feats)\n\n        false_pred_loss = F.binary_cross_entropy(self.binary_pred(false_feats).view(len(false_feats), -1), \n                                        torch.ones((len(false_feats), 1)).cuda())\n\n        return false_pred_loss\n\n    def forward(self, face_sequences):\n        face_sequences = self.to_2d(face_sequences)\n        face_sequences = self.get_lower_half(face_sequences)\n\n        x = face_sequences\n        for f in self.face_encoder_blocks:\n            x = f(x)\n\n        return self.binary_pred(x).view(len(x), -1)\n"
  },
  {
    "path": "preprocess.py",
    "content": "import sys\n\nif sys.version_info[0] < 3 and sys.version_info[1] < 2:\n\traise Exception(\"Must be using >= Python 3.2\")\n\nfrom os import listdir, path\n\nif not path.isfile('face_detection/detection/sfd/s3fd.pth'):\n\traise FileNotFoundError('Save the s3fd model to face_detection/detection/sfd/s3fd.pth \\\n\t\t\t\t\t\t\tbefore running this script!')\n\nimport multiprocessing as mp\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\nimport numpy as np\nimport argparse, os, cv2, traceback, subprocess\nfrom tqdm import tqdm\nfrom glob import glob\nimport audio\nfrom hparams import hparams as hp\n\nimport face_detection\n\nparser = argparse.ArgumentParser()\n\nparser.add_argument('--ngpu', help='Number of GPUs across which to run in parallel', default=1, type=int)\nparser.add_argument('--batch_size', help='Single GPU Face detection batch size', default=32, type=int)\nparser.add_argument(\"--data_root\", help=\"Root folder of the LRS2 dataset\", required=True)\nparser.add_argument(\"--preprocessed_root\", help=\"Root folder of the preprocessed dataset\", required=True)\n\nargs = parser.parse_args()\n\nfa = [face_detection.FaceAlignment(face_detection.LandmarksType._2D, flip_input=False, \n\t\t\t\t\t\t\t\t\tdevice='cuda:{}'.format(id)) for id in range(args.ngpu)]\n\ntemplate = 'ffmpeg -loglevel panic -y -i {} -strict -2 {}'\n# template2 = 'ffmpeg -hide_banner -loglevel panic -threads 1 -y -i {} -async 1 -ac 1 -vn -acodec pcm_s16le -ar 16000 {}'\n\ndef process_video_file(vfile, args, gpu_id):\n\tvideo_stream = cv2.VideoCapture(vfile)\n\t\n\tframes = []\n\twhile 1:\n\t\tstill_reading, frame = video_stream.read()\n\t\tif not still_reading:\n\t\t\tvideo_stream.release()\n\t\t\tbreak\n\t\tframes.append(frame)\n\t\n\tvidname = os.path.basename(vfile).split('.')[0]\n\tdirname = vfile.split('/')[-2]\n\n\tfulldir = path.join(args.preprocessed_root, dirname, vidname)\n\tos.makedirs(fulldir, exist_ok=True)\n\n\tbatches = [frames[i:i + args.batch_size] for i in range(0, len(frames), args.batch_size)]\n\n\ti = -1\n\tfor fb in batches:\n\t\tpreds = fa[gpu_id].get_detections_for_batch(np.asarray(fb))\n\n\t\tfor j, f in enumerate(preds):\n\t\t\ti += 1\n\t\t\tif f is None:\n\t\t\t\tcontinue\n\n\t\t\tx1, y1, x2, y2 = f\n\t\t\tcv2.imwrite(path.join(fulldir, '{}.jpg'.format(i)), fb[j][y1:y2, x1:x2])\n\ndef process_audio_file(vfile, args):\n\tvidname = os.path.basename(vfile).split('.')[0]\n\tdirname = vfile.split('/')[-2]\n\n\tfulldir = path.join(args.preprocessed_root, dirname, vidname)\n\tos.makedirs(fulldir, exist_ok=True)\n\n\twavpath = path.join(fulldir, 'audio.wav')\n\n\tcommand = template.format(vfile, wavpath)\n\tsubprocess.call(command, shell=True)\n\n\t\ndef mp_handler(job):\n\tvfile, args, gpu_id = job\n\ttry:\n\t\tprocess_video_file(vfile, args, gpu_id)\n\texcept KeyboardInterrupt:\n\t\texit(0)\n\texcept:\n\t\ttraceback.print_exc()\n\t\t\ndef main(args):\n\tprint('Started processing for {} with {} GPUs'.format(args.data_root, args.ngpu))\n\n\tfilelist = glob(path.join(args.data_root, '*/*.mp4'))\n\n\tjobs = [(vfile, args, i%args.ngpu) for i, vfile in enumerate(filelist)]\n\tp = ThreadPoolExecutor(args.ngpu)\n\tfutures = [p.submit(mp_handler, j) for j in jobs]\n\t_ = [r.result() for r in tqdm(as_completed(futures), total=len(futures))]\n\n\tprint('Dumping audios...')\n\n\tfor vfile in tqdm(filelist):\n\t\ttry:\n\t\t\tprocess_audio_file(vfile, args)\n\t\texcept KeyboardInterrupt:\n\t\t\texit(0)\n\t\texcept:\n\t\t\ttraceback.print_exc()\n\t\t\tcontinue\n\nif __name__ == '__main__':\n\tmain(args)"
  },
  {
    "path": "requirements.txt",
    "content": "librosa==0.7.0\nnumpy==1.17.1\nopencv-contrib-python>=4.2.0.34\nopencv-python==4.1.0.25\ntorch==1.1.0\ntorchvision==0.3.0\ntqdm==4.45.0\nnumba==0.48\n"
  },
  {
    "path": "results/README.md",
    "content": "Generated results will be placed in this folder by default."
  },
  {
    "path": "temp/README.md",
    "content": "Temporary files at the time of inference/testing will be saved here. You can ignore them."
  },
  {
    "path": "wav2lip_train.py",
    "content": "from os.path import dirname, join, basename, isfile\nfrom tqdm import tqdm\n\nfrom models import SyncNet_color as SyncNet\nfrom models import Wav2Lip as Wav2Lip\nimport audio\n\nimport torch\nfrom torch import nn\nfrom torch import optim\nimport torch.backends.cudnn as cudnn\nfrom torch.utils import data as data_utils\nimport numpy as np\n\nfrom glob import glob\n\nimport os, random, cv2, argparse\nfrom hparams import hparams, get_image_list\n\nparser = argparse.ArgumentParser(description='Code to train the Wav2Lip model without the visual quality discriminator')\n\nparser.add_argument(\"--data_root\", help=\"Root folder of the preprocessed LRS2 dataset\", required=True, type=str)\n\nparser.add_argument('--checkpoint_dir', help='Save checkpoints to this directory', required=True, type=str)\nparser.add_argument('--syncnet_checkpoint_path', help='Load the pre-trained Expert discriminator', required=True, type=str)\n\nparser.add_argument('--checkpoint_path', help='Resume from this checkpoint', default=None, type=str)\n\nargs = parser.parse_args()\n\n\nglobal_step = 0\nglobal_epoch = 0\nuse_cuda = torch.cuda.is_available()\nprint('use_cuda: {}'.format(use_cuda))\n\nsyncnet_T = 5\nsyncnet_mel_step_size = 16\n\nclass Dataset(object):\n    def __init__(self, split):\n        self.all_videos = get_image_list(args.data_root, split)\n\n    def get_frame_id(self, frame):\n        return int(basename(frame).split('.')[0])\n\n    def get_window(self, start_frame):\n        start_id = self.get_frame_id(start_frame)\n        vidname = dirname(start_frame)\n\n        window_fnames = []\n        for frame_id in range(start_id, start_id + syncnet_T):\n            frame = join(vidname, '{}.jpg'.format(frame_id))\n            if not isfile(frame):\n                return None\n            window_fnames.append(frame)\n        return window_fnames\n\n    def read_window(self, window_fnames):\n        if window_fnames is None: return None\n        window = []\n        for fname in window_fnames:\n            img = cv2.imread(fname)\n            if img is None:\n                return None\n            try:\n                img = cv2.resize(img, (hparams.img_size, hparams.img_size))\n            except Exception as e:\n                return None\n\n            window.append(img)\n\n        return window\n\n    def crop_audio_window(self, spec, start_frame):\n        if type(start_frame) == int:\n            start_frame_num = start_frame\n        else:\n            start_frame_num = self.get_frame_id(start_frame) # 0-indexing ---> 1-indexing\n        start_idx = int(80. * (start_frame_num / float(hparams.fps)))\n        \n        end_idx = start_idx + syncnet_mel_step_size\n\n        return spec[start_idx : end_idx, :]\n\n    def get_segmented_mels(self, spec, start_frame):\n        mels = []\n        assert syncnet_T == 5\n        start_frame_num = self.get_frame_id(start_frame) + 1 # 0-indexing ---> 1-indexing\n        if start_frame_num - 2 < 0: return None\n        for i in range(start_frame_num, start_frame_num + syncnet_T):\n            m = self.crop_audio_window(spec, i - 2)\n            if m.shape[0] != syncnet_mel_step_size:\n                return None\n            mels.append(m.T)\n\n        mels = np.asarray(mels)\n\n        return mels\n\n    def prepare_window(self, window):\n        # 3 x T x H x W\n        x = np.asarray(window) / 255.\n        x = np.transpose(x, (3, 0, 1, 2))\n\n        return x\n\n    def __len__(self):\n        return len(self.all_videos)\n\n    def __getitem__(self, idx):\n        while 1:\n            idx = random.randint(0, len(self.all_videos) - 1)\n            vidname = self.all_videos[idx]\n            img_names = list(glob(join(vidname, '*.jpg')))\n            if len(img_names) <= 3 * syncnet_T:\n                continue\n            \n            img_name = random.choice(img_names)\n            wrong_img_name = random.choice(img_names)\n            while wrong_img_name == img_name:\n                wrong_img_name = random.choice(img_names)\n\n            window_fnames = self.get_window(img_name)\n            wrong_window_fnames = self.get_window(wrong_img_name)\n            if window_fnames is None or wrong_window_fnames is None:\n                continue\n\n            window = self.read_window(window_fnames)\n            if window is None:\n                continue\n\n            wrong_window = self.read_window(wrong_window_fnames)\n            if wrong_window is None:\n                continue\n\n            try:\n                wavpath = join(vidname, \"audio.wav\")\n                wav = audio.load_wav(wavpath, hparams.sample_rate)\n\n                orig_mel = audio.melspectrogram(wav).T\n            except Exception as e:\n                continue\n\n            mel = self.crop_audio_window(orig_mel.copy(), img_name)\n            \n            if (mel.shape[0] != syncnet_mel_step_size):\n                continue\n\n            indiv_mels = self.get_segmented_mels(orig_mel.copy(), img_name)\n            if indiv_mels is None: continue\n\n            window = self.prepare_window(window)\n            y = window.copy()\n            window[:, :, window.shape[2]//2:] = 0.\n\n            wrong_window = self.prepare_window(wrong_window)\n            x = np.concatenate([window, wrong_window], axis=0)\n\n            x = torch.FloatTensor(x)\n            mel = torch.FloatTensor(mel.T).unsqueeze(0)\n            indiv_mels = torch.FloatTensor(indiv_mels).unsqueeze(1)\n            y = torch.FloatTensor(y)\n            return x, indiv_mels, mel, y\n\ndef save_sample_images(x, g, gt, global_step, checkpoint_dir):\n    x = (x.detach().cpu().numpy().transpose(0, 2, 3, 4, 1) * 255.).astype(np.uint8)\n    g = (g.detach().cpu().numpy().transpose(0, 2, 3, 4, 1) * 255.).astype(np.uint8)\n    gt = (gt.detach().cpu().numpy().transpose(0, 2, 3, 4, 1) * 255.).astype(np.uint8)\n\n    refs, inps = x[..., 3:], x[..., :3]\n    folder = join(checkpoint_dir, \"samples_step{:09d}\".format(global_step))\n    if not os.path.exists(folder): os.mkdir(folder)\n    collage = np.concatenate((refs, inps, g, gt), axis=-2)\n    for batch_idx, c in enumerate(collage):\n        for t in range(len(c)):\n            cv2.imwrite('{}/{}_{}.jpg'.format(folder, batch_idx, t), c[t])\n\nlogloss = nn.BCELoss()\ndef cosine_loss(a, v, y):\n    d = nn.functional.cosine_similarity(a, v)\n    loss = logloss(d.unsqueeze(1), y)\n\n    return loss\n\ndevice = torch.device(\"cuda\" if use_cuda else \"cpu\")\nsyncnet = SyncNet().to(device)\nfor p in syncnet.parameters():\n    p.requires_grad = False\n\nrecon_loss = nn.L1Loss()\ndef get_sync_loss(mel, g):\n    g = g[:, :, :, g.size(3)//2:]\n    g = torch.cat([g[:, :, i] for i in range(syncnet_T)], dim=1)\n    # B, 3 * T, H//2, W\n    a, v = syncnet(mel, g)\n    y = torch.ones(g.size(0), 1).float().to(device)\n    return cosine_loss(a, v, y)\n\ndef train(device, model, train_data_loader, test_data_loader, optimizer,\n          checkpoint_dir=None, checkpoint_interval=None, nepochs=None):\n\n    global global_step, global_epoch\n    resumed_step = global_step\n \n    while global_epoch < nepochs:\n        print('Starting Epoch: {}'.format(global_epoch))\n        running_sync_loss, running_l1_loss = 0., 0.\n        prog_bar = tqdm(enumerate(train_data_loader))\n        for step, (x, indiv_mels, mel, gt) in prog_bar:\n            model.train()\n            optimizer.zero_grad()\n\n            # Move data to CUDA device\n            x = x.to(device)\n            mel = mel.to(device)\n            indiv_mels = indiv_mels.to(device)\n            gt = gt.to(device)\n\n            g = model(indiv_mels, x)\n\n            if hparams.syncnet_wt > 0.:\n                sync_loss = get_sync_loss(mel, g)\n            else:\n                sync_loss = 0.\n\n            l1loss = recon_loss(g, gt)\n\n            loss = hparams.syncnet_wt * sync_loss + (1 - hparams.syncnet_wt) * l1loss\n            loss.backward()\n            optimizer.step()\n\n            if global_step % checkpoint_interval == 0:\n                save_sample_images(x, g, gt, global_step, checkpoint_dir)\n\n            global_step += 1\n            cur_session_steps = global_step - resumed_step\n\n            running_l1_loss += l1loss.item()\n            if hparams.syncnet_wt > 0.:\n                running_sync_loss += sync_loss.item()\n            else:\n                running_sync_loss += 0.\n\n            if global_step == 1 or global_step % checkpoint_interval == 0:\n                save_checkpoint(\n                    model, optimizer, global_step, checkpoint_dir, global_epoch)\n\n            if global_step == 1 or global_step % hparams.eval_interval == 0:\n                with torch.no_grad():\n                    average_sync_loss = eval_model(test_data_loader, global_step, device, model, checkpoint_dir)\n\n                    if average_sync_loss < .75:\n                        hparams.set_hparam('syncnet_wt', 0.01) # without image GAN a lesser weight is sufficient\n\n            prog_bar.set_description('L1: {}, Sync Loss: {}'.format(running_l1_loss / (step + 1),\n                                                                    running_sync_loss / (step + 1)))\n\n        global_epoch += 1\n        \n\ndef eval_model(test_data_loader, global_step, device, model, checkpoint_dir):\n    eval_steps = 700\n    print('Evaluating for {} steps'.format(eval_steps))\n    sync_losses, recon_losses = [], []\n    step = 0\n    while 1:\n        for x, indiv_mels, mel, gt in test_data_loader:\n            step += 1\n            model.eval()\n\n            # Move data to CUDA device\n            x = x.to(device)\n            gt = gt.to(device)\n            indiv_mels = indiv_mels.to(device)\n            mel = mel.to(device)\n\n            g = model(indiv_mels, x)\n\n            sync_loss = get_sync_loss(mel, g)\n            l1loss = recon_loss(g, gt)\n\n            sync_losses.append(sync_loss.item())\n            recon_losses.append(l1loss.item())\n\n            if step > eval_steps: \n                averaged_sync_loss = sum(sync_losses) / len(sync_losses)\n                averaged_recon_loss = sum(recon_losses) / len(recon_losses)\n\n                print('L1: {}, Sync loss: {}'.format(averaged_recon_loss, averaged_sync_loss))\n\n                return averaged_sync_loss\n\ndef save_checkpoint(model, optimizer, step, checkpoint_dir, epoch):\n\n    checkpoint_path = join(\n        checkpoint_dir, \"checkpoint_step{:09d}.pth\".format(global_step))\n    optimizer_state = optimizer.state_dict() if hparams.save_optimizer_state else None\n    torch.save({\n        \"state_dict\": model.state_dict(),\n        \"optimizer\": optimizer_state,\n        \"global_step\": step,\n        \"global_epoch\": epoch,\n    }, checkpoint_path)\n    print(\"Saved checkpoint:\", checkpoint_path)\n\n\ndef _load(checkpoint_path):\n    if use_cuda:\n        checkpoint = torch.load(checkpoint_path)\n    else:\n        checkpoint = torch.load(checkpoint_path,\n                                map_location=lambda storage, loc: storage)\n    return checkpoint\n\ndef load_checkpoint(path, model, optimizer, reset_optimizer=False, overwrite_global_states=True):\n    global global_step\n    global global_epoch\n\n    print(\"Load checkpoint from: {}\".format(path))\n    checkpoint = _load(path)\n    s = checkpoint[\"state_dict\"]\n    new_s = {}\n    for k, v in s.items():\n        new_s[k.replace('module.', '')] = v\n    model.load_state_dict(new_s)\n    if not reset_optimizer:\n        optimizer_state = checkpoint[\"optimizer\"]\n        if optimizer_state is not None:\n            print(\"Load optimizer state from {}\".format(path))\n            optimizer.load_state_dict(checkpoint[\"optimizer\"])\n    if overwrite_global_states:\n        global_step = checkpoint[\"global_step\"]\n        global_epoch = checkpoint[\"global_epoch\"]\n\n    return model\n\nif __name__ == \"__main__\":\n    checkpoint_dir = args.checkpoint_dir\n\n    # Dataset and Dataloader setup\n    train_dataset = Dataset('train')\n    test_dataset = Dataset('val')\n\n    train_data_loader = data_utils.DataLoader(\n        train_dataset, batch_size=hparams.batch_size, shuffle=True,\n        num_workers=hparams.num_workers)\n\n    test_data_loader = data_utils.DataLoader(\n        test_dataset, batch_size=hparams.batch_size,\n        num_workers=4)\n\n    device = torch.device(\"cuda\" if use_cuda else \"cpu\")\n\n    # Model\n    model = Wav2Lip().to(device)\n    print('total trainable params {}'.format(sum(p.numel() for p in model.parameters() if p.requires_grad)))\n\n    optimizer = optim.Adam([p for p in model.parameters() if p.requires_grad],\n                           lr=hparams.initial_learning_rate)\n\n    if args.checkpoint_path is not None:\n        load_checkpoint(args.checkpoint_path, model, optimizer, reset_optimizer=False)\n        \n    load_checkpoint(args.syncnet_checkpoint_path, syncnet, None, reset_optimizer=True, overwrite_global_states=False)\n\n    if not os.path.exists(checkpoint_dir):\n        os.mkdir(checkpoint_dir)\n\n    # Train!\n    train(device, model, train_data_loader, test_data_loader, optimizer,\n              checkpoint_dir=checkpoint_dir,\n              checkpoint_interval=hparams.checkpoint_interval,\n              nepochs=hparams.nepochs)\n"
  }
]