Full Code of lieff/minih264 for AI

master b0baea7a80ef cached
25 files
44.1 MB
205.4k tokens
243 symbols
1 requests
Download .txt
Showing preview only (591K chars total). Download the full file or copy to clipboard to get everything.
Repository: lieff/minih264
Branch: master
Commit: b0baea7a80ef
Files: 25
Total size: 44.1 MB

Directory structure:
gitextract_ua023m8r/

├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── asm/
│   ├── minih264e_asm.h
│   └── neon/
│       ├── h264e_cavlc_arm11.s
│       ├── h264e_deblock_neon.s
│       ├── h264e_denoise_neon.s
│       ├── h264e_intra_neon.s
│       ├── h264e_qpel_neon.s
│       ├── h264e_sad_neon.s
│       └── h264e_transform_neon.s
├── minih264e.h
├── minih264e_test.c
├── scripts/
│   ├── build_arm.sh
│   ├── build_arm_clang.sh
│   ├── build_x86.sh
│   ├── build_x86_clang.sh
│   ├── profile.sh
│   └── test.sh
├── system.c
├── system.h
└── vectors/
    ├── foreman.cif
    ├── out_ref.264
    └── x264.264

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
h264enc_*
qemu-prof
*.gcda
*.gcno
*.gcov

================================================
FILE: .travis.yml
================================================
language: c
addons:
  apt:
    packages:
      - build-essential
      - libc6-dev-i386
      - linux-libc-dev:i386
      - gcc-arm-none-eabi
      - gcc-arm-linux-gnueabihf
      - libnewlib-arm-none-eabi
      - clang
      - gcc-5-multilib
      - gcc-arm-linux-gnueabihf
      - gcc-aarch64-linux-gnu
      - gcc-powerpc-linux-gnu
      - gcc-5-arm-linux-gnueabihf
      - gcc-5-aarch64-linux-gnu
      - gcc-5-powerpc-linux-gnu
      - libc6-armhf-cross
      - libc6-arm64-cross
      - libc6-powerpc-cross
      - libc6-dev-armhf-cross
      - libc6-dev-arm64-cross
      - libc6-dev-powerpc-cross
      - qemu

os:
    - linux

compiler:
    - gcc

script:
    - scripts/build_x86.sh
    - scripts/build_arm.sh
    - scripts/test.sh


================================================
FILE: LICENSE
================================================
CC0 1.0 Universal

Statement of Purpose

The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator and
subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").

Certain owners wish to permanently relinquish those rights to a Work for the
purpose of contributing to a commons of creative, cultural and scientific
works ("Commons") that the public can reliably and without fear of later
claims of infringement build upon, modify, incorporate in other works, reuse
and redistribute as freely as possible in any form whatsoever and for any
purposes, including without limitation commercial purposes. These owners may
contribute to the Commons to promote the ideal of a free culture and the
further production of creative, cultural and scientific works, or to gain
reputation or greater distribution for their Work in part through the use and
efforts of others.

For these and/or other purposes and motivations, and without any expectation
of additional consideration or compensation, the person associating CC0 with a
Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
and publicly distribute the Work under its terms, with knowledge of his or her
Copyright and Related Rights in the Work and the meaning and intended legal
effect of CC0 on those rights.

1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not limited
to, the following:

  i. the right to reproduce, adapt, distribute, perform, display, communicate,
  and translate a Work;

  ii. moral rights retained by the original author(s) and/or performer(s);

  iii. publicity and privacy rights pertaining to a person's image or likeness
  depicted in a Work;

  iv. rights protecting against unfair competition in regards to a Work,
  subject to the limitations in paragraph 4(a), below;

  v. rights protecting the extraction, dissemination, use and reuse of data in
  a Work;

  vi. database rights (such as those arising under Directive 96/9/EC of the
  European Parliament and of the Council of 11 March 1996 on the legal
  protection of databases, and under any national implementation thereof,
  including any amended or successor version of such directive); and

  vii. other similar, equivalent or corresponding rights throughout the world
  based on applicable law or treaty, and any national implementations thereof.

2. Waiver. To the greatest extent permitted by, but not in contravention of,
applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
and Related Rights and associated claims and causes of action, whether now
known or unknown (including existing as well as future claims and causes of
action), in the Work (i) in all territories worldwide, (ii) for the maximum
duration provided by applicable law or treaty (including future time
extensions), (iii) in any current or future medium and for any number of
copies, and (iv) for any purpose whatsoever, including without limitation
commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
the Waiver for the benefit of each member of the public at large and to the
detriment of Affirmer's heirs and successors, fully intending that such Waiver
shall not be subject to revocation, rescission, cancellation, termination, or
any other legal or equitable action to disrupt the quiet enjoyment of the Work
by the public as contemplated by Affirmer's express Statement of Purpose.

3. Public License Fallback. Should any part of the Waiver for any reason be
judged legally invalid or ineffective under applicable law, then the Waiver
shall be preserved to the maximum extent permitted taking into account
Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
is so judged Affirmer hereby grants to each affected person a royalty-free,
non transferable, non sublicensable, non exclusive, irrevocable and
unconditional license to exercise Affirmer's Copyright and Related Rights in
the Work (i) in all territories worldwide, (ii) for the maximum duration
provided by applicable law or treaty (including future time extensions), (iii)
in any current or future medium and for any number of copies, and (iv) for any
purpose whatsoever, including without limitation commercial, advertising or
promotional purposes (the "License"). The License shall be deemed effective as
of the date CC0 was applied by Affirmer to the Work. Should any part of the
License for any reason be judged legally invalid or ineffective under
applicable law, such partial invalidity or ineffectiveness shall not
invalidate the remainder of the License, and in such case Affirmer hereby
affirms that he or she will not (i) exercise any of his or her remaining
Copyright and Related Rights in the Work or (ii) assert any associated claims
and causes of action with respect to the Work, in either case contrary to
Affirmer's express Statement of Purpose.

4. Limitations and Disclaimers.

  a. No trademark or patent rights held by Affirmer are waived, abandoned,
  surrendered, licensed or otherwise affected by this document.

  b. Affirmer offers the Work as-is and makes no representations or warranties
  of any kind concerning the Work, express, implied, statutory or otherwise,
  including without limitation warranties of title, merchantability, fitness
  for a particular purpose, non infringement, or the absence of latent or
  other defects, accuracy, or the present or absence of errors, whether or not
  discoverable, all to the greatest extent permissible under applicable law.

  c. Affirmer disclaims responsibility for clearing rights of other persons
  that may apply to the Work or any use thereof, including without limitation
  any person's Copyright and Related Rights in the Work. Further, Affirmer
  disclaims responsibility for obtaining any necessary consents, permissions
  or other rights required for any use of the Work.

  d. Affirmer understands and acknowledges that Creative Commons is not a
  party to this document and has no duty or obligation with respect to this
  CC0 or use of the Work.

For more information, please see
<http://creativecommons.org/publicdomain/zero/1.0/>



================================================
FILE: README.md
================================================
minih264
==========

[![Build Status](https://travis-ci.org/lieff/minih264.svg)](https://travis-ci.org/lieff/minih264)

Small, but yet reasonably fast H264/SVC encoder single-header library with SSE/NEON optimizations.
Decoder can be popped up in future.

Disclaimer: code highly experimental.

## Comparison with [x264](https://www.videolan.org/developers/x264.html)

Rough comparison with x264 on an i7-6700K:

`x264 -I 30 --profile baseline --preset veryfast --tune zerolatency -b 0 -r 1 --qp 33 --ipratio 1.0 --qcomp 1.0 -o x264.264 --fps 30 vectors/foreman.cif --input-res 352x288 --slices 1 --threads 1`

vs

`./h264enc_x64 vectors/foreman.cif`

| x264         | minih264 |
| ------------ | -------- |
| source: ~4.6mb | 409kb |
| binary: 1.2mb | 100kb |
| time: 0,282s | 0,503s |
| out size: 320kb | 391kb  |

PSNR:
```
x264:     PSNR y:32.774824 u:38.874450 v:39.926132 average:34.084281 min:31.842667 max:36.630286
minih264: PSNR y:33.321686 u:38.858879 v:39.955914 average:34.574459 min:32.389171 max:37.174073
```

First intra frame screenshot (left-to-right: original 152064, minih264 5067, x264 5297 bytes):

![Intra screenshot](images/intra.png?raw=true)

You can compare results in motion using ffplay/mpv players on vectors/out_ref.264 and vectors/x264.264 .

## Usage

TBD

## SVC

Minih264 supports both spatial and temporal layers. Spatial layers are almost same as encode 2 independent AVC streams except for Intra frames prediction.
Following diagram shows minih264 SVC scheme for two spatial layers:

![SVC diargam](images/svc.png?raw=true)

That's because P frames spatial prediction are almost useless in practice. But for Intra frames there is a ~20% benefit in full resolution frame size.
Note that decoder must have both base layer I frame _and_ full resolution SVC I frame to decode whole sequence of next P frames in full resolution.

## Limitations

The following major features are not supported compared to x264 (baseline):

 * Trellis quantization.
 * Select prediction mode using Sum of Absolute Transform Differences (SATD).
 * 4x4 motion compensation.

## Interesting links

 * https://www.videolan.org/developers/x264.html
 * https://www.openh264.org/
 * https://github.com/cisco/openh264
 * http://iphome.hhi.de/suehring/tml/
 * https://github.com/oneam/h264bsd
 * https://github.com/fhunleth/hollowcore-h264
 * https://github.com/digetx/h264_decoder
 * https://github.com/lspbeyond/p264decoder
 * https://github.com/jcasal-homer/HomerHEVC
 * https://github.com/ultravideo/kvazaar
 * https://github.com/neocoretechs/h264j
 * https://github.com/jcodec/jcodec


================================================
FILE: asm/minih264e_asm.h
================================================
#define H264E_API(type, name, args) type name args; \
type name##_sse2 args;  \
type name##_arm11 args; \
type name##_neon args;
// h264e_qpel
H264E_API(void, h264e_qpel_interpolate_chroma, (const uint8_t *src,int src_stride, uint8_t *h264e_restrict dst,point_t wh, point_t dxdy))
H264E_API(void, h264e_qpel_interpolate_luma, (const uint8_t *src,int src_stride, uint8_t *h264e_restrict dst,point_t wh, point_t dxdy))
H264E_API(void, h264e_qpel_average_wh_align, (const uint8_t *p0, const uint8_t *p1, uint8_t *h264e_restrict d, point_t wh))
// h264e_deblock
H264E_API(void, h264e_deblock_chroma, (uint8_t *pSrcDst, int32_t srcdstStep, const deblock_params_t *par))
H264E_API(void, h264e_deblock_luma, (uint8_t *pSrcDst, int32_t srcdstStep, const deblock_params_t *par))
// h264e_intra
H264E_API(void, h264e_intra_predict_chroma,  (pix_t *predict, const pix_t *left, const pix_t *top, int mode))
H264E_API(void, h264e_intra_predict_16x16, (pix_t *predict, const pix_t *left, const pix_t *top, int mode))
H264E_API(int,  h264e_intra_choose_4x4, (const pix_t *blockin, pix_t *blockpred, int avail, const pix_t *edge, int mpred, int penalty))
// h264e_cavlc
H264E_API(void,     h264e_bs_put_bits, (bs_t *bs, unsigned n, unsigned val))
H264E_API(void,     h264e_bs_flush, (bs_t *bs))
H264E_API(unsigned, h264e_bs_get_pos_bits, (const bs_t *bs))
H264E_API(unsigned, h264e_bs_byte_align, (bs_t *bs))
H264E_API(void,     h264e_bs_put_golomb, (bs_t *bs, unsigned val))
H264E_API(void,     h264e_bs_put_sgolomb, (bs_t *bs, int val))
H264E_API(void,     h264e_bs_init_bits, (bs_t *bs, void *data))
H264E_API(void,     h264e_vlc_encode, (bs_t *bs, int16_t *quant, int maxNumCoeff, uint8_t *nz_ctx))
// h264e_sad
H264E_API(int,  h264e_sad_mb_unlaign_8x8, (const pix_t *a, int a_stride, const pix_t *b, int sad[4]))
H264E_API(int,  h264e_sad_mb_unlaign_wh, (const pix_t *a, int a_stride, const pix_t *b, point_t wh))
H264E_API(void, h264e_copy_8x8, (pix_t *d, int d_stride, const pix_t *s))
H264E_API(void, h264e_copy_16x16, (pix_t *d, int d_stride, const pix_t *s, int s_stride))
H264E_API(void, h264e_copy_borders, (unsigned char *pic, int w, int h, int guard))
// h264e_transform
H264E_API(void, h264e_transform_add, (pix_t *out, int out_stride, const pix_t *pred, quant_t *q, int side, int32_t mask))
H264E_API(int,  h264e_transform_sub_quant_dequant, (const pix_t *inp, const pix_t *pred, int inp_stride, int mode, quant_t *q, const uint16_t *qdat))
H264E_API(void, h264e_quant_luma_dc, (quant_t *q, int16_t *deq, const uint16_t *qdat))
H264E_API(int,  h264e_quant_chroma_dc, (quant_t *q, int16_t *deq, const uint16_t *qdat))
// h264e_denoise
H264E_API(void, h264e_denoise_run, (unsigned char *frm, unsigned char *frmprev, int w, int h, int stride_frm, int stride_frmprev))
#undef H264E_API


================================================
FILE: asm/neon/h264e_cavlc_arm11.s
================================================
        .arm
        .text
        .align 2
        .type  h264e_bs_put_sgolomb_arm11, %function
h264e_bs_put_sgolomb_arm11:
        MVN             r2,     #0
        ADD             r1,     r2,     r1,     lsl #1
        EOR             r1,     r1,     r1,     asr #31
        .size  h264e_bs_put_sgolomb_arm11, .-h264e_bs_put_sgolomb_arm11

        .type  h264e_bs_put_golomb_arm11, %function
h264e_bs_put_golomb_arm11:
        ADD             r2,     r1,     #1
        CLZ             r1,     r2
        MOV             r3,     #63
        SUB             r1,     r3,     r1,     lsl #1
        .size  h264e_bs_put_golomb_arm11, .-h264e_bs_put_golomb_arm11

        .type  h264e_bs_put_bits_arm11, %function
h264e_bs_put_bits_arm11:
        LDMIA           r0,     {r3,    r12}
        SUBS            r3,     r3,     r1
        BMI             local_cavlc_1_0
        ORR             r12,    r12,    r2,     lsl r3
        STMIA           r0,     {r3,    r12}
        BX              lr
local_cavlc_1_0:
        RSB             r1,     r3,     #0
        ORR             r12,    r12,    r2,     lsr r1
        LDR             r1,     [r0,    #8]
        REV             r12,    r12
        ADD             r3,     r3,     #32
        STR             r12,    [r1],   #4
        MOV             r12,    r2,     lsl r3
        STMIA           r0,     {r3,    r12}
        STR             r1,     [r0,    #8]
        BX              lr
        .size  h264e_bs_put_bits_arm11, .-h264e_bs_put_bits_arm11

        .type  h264e_bs_flush_arm11, %function
h264e_bs_flush_arm11:
        LDMIB           r0,     {r0,    r1}
        REV             r0,     r0
        STR             r0,     [r1]
        BX              lr
        .size  h264e_bs_flush_arm11, .-h264e_bs_flush_arm11

        .type  h264e_bs_get_pos_bits_arm11, %function
h264e_bs_get_pos_bits_arm11:
        LDMIA           r0,     {r0-r3}
        SUB             r2,     r2,     r3
        RSB             r0,     r0,     #0x20
        ADD             r0,     r0,     r2,     lsl #3
        BX              lr
        .size  h264e_bs_get_pos_bits_arm11, .-h264e_bs_get_pos_bits_arm11

        .type  h264e_bs_byte_align_arm11, %function
h264e_bs_byte_align_arm11:
        PUSH            {r0,    lr}
        BL              h264e_bs_get_pos_bits_arm11
        RSB             r1,     r0,     #0
        AND             r1,     r1,     #7
        ADD             r3,     r0,     r1
        MOV             r2,     #0
        LDR             r0,     [sp]
        STR             r3,     [sp]
        BL              h264e_bs_put_bits_arm11
        POP             {r0,    pc}
        .size  h264e_bs_byte_align_arm11, .-h264e_bs_byte_align_arm11

        .type  h264e_bs_init_bits_arm11, %function
h264e_bs_init_bits_arm11:
        MOV             r12,    r1
        MOV             r3,     r1
        MOV             r2,     #0
        MOV             r1,     #32
        STMIA           r0,     {r1-r3, r12}
        BX              lr
        .size  h264e_bs_init_bits_arm11, .-h264e_bs_init_bits_arm11

        .type  h264e_vlc_encode_arm11, %function
h264e_vlc_encode_arm11:
        PUSH            {r4-r11,        lr}
        CMP             r2,     #4
        MOVNE           r4,     #0x10
        MOVEQ           r4,     #4
        LDMIA           r0,     {r10-r12}
        SUB             sp,     sp,     #0x10
        MOV             r8,     #0
        ADD             r4,     r1,     r4,     lsl #1
        MOV             r9,     r8
        MOV             r5,     sp
        MOV             r1,     r4
        MOV             lr,     r2
local_cavlc_1_1:
        LDRSH           r7,     [r4,    #-2]!
        MOVS            r7,     r7,     lsl #1
        STRNEH          r7,     [r1,    #-2]!
        STRNEB          lr,     [r5],   #1
        SUBS            lr,     lr,     #1
        BNE             local_cavlc_1_1
        ADD             r4,     r4,     r2,     lsl #1
        SUB             r5,     r4,     r1
        MOVS            r5,     r5,     asr #1
        BEQ             no_nz1
        CMP             r5,     #3
        MOVLE           r6,     r5
        MOVGT           r6,     #3
        SUB             r1,     r4,     #2
local_cavlc_1_2:
        LDRSH           r4,     [r1,    #0]
        ADD             r7,     r4,     #2
        CMP             r7,     #4
        BHI             no_nz1
        MOV             r7,     r9,     lsl #1
        SUBS            r6,     r6,     #1
        ORR             r9,     r7,     r4,     lsr #31
        SUB             r1,     r1,     #2
        ADD             r8,     r8,     #1
        BNE             local_cavlc_1_2
no_nz1:
        LDRB            r4,     [r3,    #-1]
        LDRB            r7,     [r3,    #1]
        STRB            r5,     [r3,    #0]
        SUB             r6,     r5,     r8
        ADD             r3,     r4,     r7
        CMP             r3,     #0x22
        ADDLE           r3,     r3,     #1
        LDR             r4,     =h264e_g_coeff_token
        MOVLE           r3,     r3,     asr #1
        AND             r3,     r3,     #0x1f
        MOV             r7,     #6
        LDRB            r3,     [r4,    r3]
        ADD             lr,     r3,     r8
        ADD             lr,     lr,     r6,     lsl #2
        CMP             r3,     #0xe6
        LDRB            r4,     [r4,    lr]
        ANDNE           r3,     r4,     #0xf
        ADDNE           r7,     r3,     #1
        MOVNE           r4,     r4,     lsr #4
        SUBS            r10,    r10,    r7
        BLMI            bs_flush_sub
        ORR             r11,    r11,    r4,     lsl r10
        CMP             r5,     #0
        BEQ             l1.1272
        CMP             r8,     #0
        BEQ             l1.864
        SUBS            r10,    r10,    r8
        MOV             r4,     r9
        BLMI            bs_flush_sub
        ORR             r11,    r11,    r4,     lsl r10
l1.864:
        CMP             r6,     #0
        BEQ             l1.1120
        LDRSH           r7,     [r1,    #0]
        SUB             lr,     r1,     #2
        MVN             r4,     #2
        SUBS            r1,     r7,     #2
        SUBMI           r1,     r4,     r1
        CMP             r1,     #6
        MOV             r9,     #1
        MOVGE           r9,     #2
        CMP             r8,     #3
        BGE             l1.952
        CMP             r5,     #0xa
        SUB             r1,     r1,     #2
        BLE             l1.952
        MOV             r7,     r1,     asr #1
        CMP             r7,     #0xf
        MOVGE           r7,     #0xf
        MOV             r8,     #1
        MOVGE           r8,     #0xc
        SUB             r1,     r1,     r7,     lsl #1
        RSB             r7,     #2
        B               loop_enter
l1.952:
        CMP             r1,     #0xe
        MOVLT           r7,     r1
        MOVLT           r1,     #0
        MOVLT           r8,     r1
        RSBLT           r7,     #2
        BLT             loop_enter
        CMP             r1,     #0x1e
        MOVGE           r9,     #1
        BGE             escape
        MOV             r7,     #0xe
        MOV             r8,     #4
        SUB             r1,     r1,     #0xe
        RSB             r7,     #2
        B               loop_enter
local_cavlc_1_3:
        SUBS            r1,     r1,     #2
        SUBMI           r1,     r4,     r1
        MOV             r7,     r1,     asr r9
        CMP             r7,     #0xf
        MOV             r8,     r9
escape:
        MOVGE           r7,     #0xf
        MOVGE           r8,     #0xc
        SUB             r1,     r1,     r7,     lsl r9
        RSBS            r7,     #2
        CMPLT           r9,     #6
        ADDLT           r9,     r9,     #1
loop_enter:
        MOV             r3,     #1
        ORR             r1,     r1,     r3,     lsl r8
        RSB             r7,     r7,     #3
        ADD             r7,     r7,     r8
        SUBS            r10,    r10,    r7
        BMI             bs_flush_1
bs_flush_1_return:
        ORR             r11,    r11,    r1,     lsl r10
        SUBS            r6,     r6,     #1
        LDRNESH         r1,     [lr],   #-2
        BNE             local_cavlc_1_3
l1.1120:
        CMP             r5,     r2
        BGE             l1.1272
        LDRB            r8,     [sp,    #0]
        CMP             r2,     #4
        ADD             r6,     sp,     #1
        SUB             r1,     r8,     r5
        SUB             r9,     r5,     #1
        LDRNE           r7,     =h264e_g_total_zeros
        LDREQ           r7,     =h264e_g_total_zeros_cr_2x2
        ADD             r5,     r5,     r6
        MVN             r2,     #0
        MOV             lr,     #0x10
        ADD             r2,     r2,     r1,     lsl #1
        STRB            lr,     [r5,    #-1]
l1.1176:
        LDRB            r5,     [r7,    r9]
        ADD             r7,     r7,     r1
        LDRB            r5,     [r5,    r7]
        AND             r7,     r5,     #0xf
        SUBS            r10,    r10,    r7
        MOV             r4,     r5,     lsr #4
        BLMI            bs_flush_sub
        ORR             r11,    r11,    r4,     lsl r10
        SUBS            r2,     r2,     r1
        BMI             l1.1272
        LDRB            r1,     [r6],   #1
        MOV             r5,     r8
        MOV             r8,     r1
        SUB             r1,     r5,     r1
        SUBS            r1,     r1,     #1
        LDRPL           r7,     =h264e_g_run_before
        MOVPL           r9,     r2
        BPL             l1.1176
l1.1272:
        STMIA           r0,     {r10,   r11,    r12}
        ADD             sp,     sp,     #0x10
        POP             {r4-r11,        pc}
bs_flush_sub:
        RSB             r7,     r10,    #0
        ADD             r10,    r10,    #0x20
        ORR             r11,    r11,    r4,     asr r7
        REV             r11,    r11
        STR             r11,    [r12],  #4
        MOV             r11,    #0
        BX              lr
bs_flush_1:
        RSB             r7,     r10,    #0
        ADD             r10,    r10,    #0x20
        ORR             r11,    r11,    r1,     asr r7
        REV             r11,    r11
        STR             r11,    [r12],  #4
        MOV             r11,    #0
        B               bs_flush_1_return
        .size  h264e_vlc_encode_arm11, .-h264e_vlc_encode_arm11

        .global         h264e_bs_put_bits_arm11
        .global         h264e_bs_flush_arm11
        .global         h264e_bs_get_pos_bits_arm11
        .global         h264e_bs_byte_align_arm11
        .global         h264e_bs_put_golomb_arm11
        .global         h264e_bs_put_sgolomb_arm11
        .global         h264e_bs_init_bits_arm11
        .global         h264e_vlc_encode_arm11


================================================
FILE: asm/neon/h264e_deblock_neon.s
================================================
        .arm
        .text
        .align 2

        .type  deblock_luma_h_s4, %function
deblock_luma_h_s4:
        VPUSH           {q4-q7}
        SUB             r0,     r0,     r1,     lsl #2
        VLD1.8          {q8},   [r0],   r1
        VLD1.8          {q9},   [r0],   r1
        VLD1.8          {q10},  [r0],   r1
        VLD1.8          {q11},  [r0],   r1
        VLD1.8          {q12},  [r0],   r1
        VLD1.8          {q13},  [r0],   r1
        VLD1.8          {q14},  [r0],   r1
        VLD1.8          {q15},  [r0],   r1
        VDUP.8          q3,     r2
        VABD.U8         q0,     q11,    q12
        VCLT.U8         q2,     q0,     q3
        VDUP.8          q3,     r3
        VABD.U8         q1,     q11,    q10
        VCLT.U8         q1,     q1,     q3
        VAND            q2,     q2,     q1
        VABD.U8         q1,     q12,    q13
        VCLT.U8         q1,     q1,     q3
        VAND            q2,     q2,     q1
        MOV             r12,    r2,     lsr #2
        ADD             r12,    r12,    #2
        VDUP.8          q4,     r12
        VCLT.U8         q1,     q0,     q4
        VAND            q1,     q1,     q2
        VABD.U8         q0,     q9,     q11
        VCLT.U8         q0,     q0,     q3
        VAND            q0,     q0,     q1
        VABD.U8         q7,     q14,    q12
        VCLT.U8         q3,     q7,     q3
        VAND            q3,     q3,     q1
        VHADD.U8                q4,     q9,     q10
        VHADD.U8                q5,     q11,    q12
        VRHADD.U8               q6,     q9,     q10
        VRHADD.U8               q7,     q11,    q12
        VSUB.I8         q6,     q6,     q4
        VSUB.I8         q7,     q7,     q5
        VHADD.U8                q6,     q6,     q7
        VRHADD.U8               q7,     q4,     q8
        VHADD.U8                q4,     q4,     q8
        VSUB.I8         q7,     q7,     q4
        VADD.I8         q6,     q6,     q7
        VRHADD.U8               q7,     q5,     q9
        VHADD.U8                q5,     q5,     q9
        VSUB.I8         q7,     q7,     q5
        VHADD.U8                q6,     q6,     q7
        VRHADD.U8               q7,     q4,     q5
        VHADD.U8                q4,     q4,     q5
        VSUB.I8         q7,     q7,     q4
        VRHADD.U8               q6,     q6,     q7
        VADD.I8         q4,     q4,     q6
        VMOV            q6,     q9
        VBIT            q6,     q4,     q0
        VPUSH           {q6}
        VHADD.U8                q4,     q14,    q13
        VHADD.U8                q5,     q12,    q11
        VRHADD.U8               q6,     q14,    q13
        VRHADD.U8               q7,     q12,    q11
        VSUB.I8         q6,     q6,     q4
        VSUB.I8         q7,     q7,     q5
        VHADD.U8                q6,     q6,     q7
        VRHADD.U8               q7,     q4,     q15
        VHADD.U8                q4,     q4,     q15
        VSUB.I8         q7,     q7,     q4
        VADD.I8         q6,     q6,     q7
        VRHADD.U8               q7,     q5,     q14
        VHADD.U8                q5,     q5,     q14
        VSUB.I8         q7,     q7,     q5
        VHADD.U8                q6,     q6,     q7
        VRHADD.U8               q7,     q4,     q5
        VHADD.U8                q4,     q4,     q5
        VSUB.I8         q7,     q7,     q4
        VRHADD.U8               q6,     q6,     q7
        VADD.I8         q4,     q4,     q6
        VMOV            q6,     q14
        VBIT            q6,     q4,     q3
        VPUSH           {q6}
        VHADD.U8                q1,     q9,     q13
        VRHADD.U8               q4,     q1,     q10
        VRHADD.U8               q5,     q11,    q12
        VHADD.U8                q6,     q1,     q10
        VHADD.U8                q7,     q11,    q12
        VHADD.U8                q4,     q4,     q5
        VRHADD.U8               q6,     q6,     q7
        VRHADD.U8               q1,     q4,     q6
        VRHADD.U8               q4,     q9,     q10
        VRHADD.U8               q5,     q11,    q12
        VHADD.U8                q6,     q9,     q10
        VHADD.U8                q7,     q11,    q12
        VHADD.U8                q4,     q4,     q5
        VRHADD.U8               q6,     q6,     q7
        VRHADD.U8               q4,     q4,     q6
        VHADD.U8                q5,     q11,    q13
        VRHADD.U8               q5,     q5,     q10
        VBIF            q1,     q5,     q0
        VBSL            q0,     q4,     q10
        VHADD.U8                q7,     q14,    q10
        VRHADD.U8               q4,     q7,     q13
        VRHADD.U8               q5,     q11,    q12
        VHADD.U8                q6,     q7,     q13
        VHADD.U8                q7,     q11,    q12
        VHADD.U8                q4,     q4,     q5
        VRHADD.U8               q6,     q6,     q7
        VRHADD.U8               q4,     q4,     q6
        VRHADD.U8               q6,     q14,    q13
        VRHADD.U8               q5,     q11,    q12
        VHADD.U8                q5,     q6,     q5
        VHADD.U8                q6,     q14,    q13
        VHADD.U8                q7,     q11,    q12
        VRHADD.U8               q6,     q6,     q7
        VRHADD.U8               q5,     q5,     q6
        VHADD.U8                q6,     q12,    q10
        VRHADD.U8               q6,     q6,     q13
        VBIF            q4,     q6,     q3
        VBSL            q3,     q5,     q13
        VPOP            {q14}
        VPOP            {q9}
        VBIT            q10,    q0,     q2
        VBIT            q11,    q1,     q2
        VBIT            q12,    q4,     q2
        VBIT            q13,    q3,     q2
        SUB             r0,     r0,     r1,     lsl #3
        VST1.8          {q8},   [r0],   r1
        VST1.8          {q9},   [r0],   r1
        VST1.8          {q10},  [r0],   r1
        VST1.8          {q11},  [r0],   r1
        VST1.8          {q12},  [r0],   r1
        VST1.8          {q13},  [r0],   r1
        VST1.8          {q14},  [r0],   r1
        VST1.8          {q15},  [r0],   r1
        VPOP            {q4-q7}
        BX              lr
        .size  deblock_luma_h_s4, .-deblock_luma_h_s4

        .type  deblock_luma_v_s4, %function
deblock_luma_v_s4:
        VPUSH           {q4-q7}
        SUB             r0,     r0,     #4
        VLD1.8          {d16},  [r0],   r1
        VLD1.8          {d18},  [r0],   r1
        VLD1.8          {d20},  [r0],   r1
        VLD1.8          {d22},  [r0],   r1
        VLD1.8          {d24},  [r0],   r1
        VLD1.8          {d26},  [r0],   r1
        VLD1.8          {d28},  [r0],   r1
        VLD1.8          {d30},  [r0],   r1
        VLD1.8          {d17},  [r0],   r1
        VLD1.8          {d19},  [r0],   r1
        VLD1.8          {d21},  [r0],   r1
        VLD1.8          {d23},  [r0],   r1
        VLD1.8          {d25},  [r0],   r1
        VLD1.8          {d27},  [r0],   r1
        VLD1.8          {d29},  [r0],   r1
        VLD1.8          {d31},  [r0],   r1
        VTRN.32         q8,     q12
        VTRN.32         q9,     q13
        VTRN.32         q10,    q14
        VTRN.32         q11,    q15
        VTRN.16         q8,     q10
        VTRN.16         q9,     q11
        VTRN.16         q12,    q14
        VTRN.16         q13,    q15
        VTRN.8          q8,     q9
        VTRN.8          q10,    q11
        VTRN.8          q12,    q13
        VTRN.8          q14,    q15
        VDUP.8          q3,     r2
        VABD.U8         q0,     q11,    q12
        VCLT.U8         q2,     q0,     q3
        VDUP.8          q3,     r3
        VABD.U8         q1,     q11,    q10
        VCLT.U8         q1,     q1,     q3
        VAND            q2,     q2,     q1
        VABD.U8         q1,     q12,    q13
        VCLT.U8         q1,     q1,     q3
        VAND            q2,     q2,     q1
        MOV             r12,    r2,     lsr #2
        ADD             r12,    r12,    #2
        VDUP.8          q4,     r12
        VCLT.U8         q1,     q0,     q4
        VAND            q1,     q1,     q2
        VABD.U8         q0,     q9,     q11
        VCLT.U8         q0,     q0,     q3
        VAND            q0,     q0,     q1
        VABD.U8         q7,     q14,    q12
        VCLT.U8         q3,     q7,     q3
        VAND            q3,     q3,     q1
        VHADD.U8                q4,     q9,     q10
        VHADD.U8                q5,     q11,    q12
        VRHADD.U8               q6,     q9,     q10
        VRHADD.U8               q7,     q11,    q12
        VSUB.I8         q6,     q6,     q4
        VSUB.I8         q7,     q7,     q5
        VHADD.U8                q6,     q6,     q7
        VRHADD.U8               q7,     q4,     q8
        VHADD.U8                q4,     q4,     q8
        VSUB.I8         q7,     q7,     q4
        VADD.I8         q6,     q6,     q7
        VRHADD.U8               q7,     q5,     q9
        VHADD.U8                q5,     q5,     q9
        VSUB.I8         q7,     q7,     q5
        VHADD.U8                q6,     q6,     q7
        VRHADD.U8               q7,     q4,     q5
        VHADD.U8                q4,     q4,     q5
        VSUB.I8         q7,     q7,     q4
        VRHADD.U8               q6,     q6,     q7
        VADD.I8         q4,     q4,     q6
        VMOV            q6,     q9
        VBIT            q6,     q4,     q0
        VPUSH           {q6}
        VHADD.U8                q4,     q14,    q13
        VHADD.U8                q5,     q12,    q11
        VRHADD.U8               q6,     q14,    q13
        VRHADD.U8               q7,     q12,    q11
        VSUB.I8         q6,     q6,     q4
        VSUB.I8         q7,     q7,     q5
        VHADD.U8                q6,     q6,     q7
        VRHADD.U8               q7,     q4,     q15
        VHADD.U8                q4,     q4,     q15
        VSUB.I8         q7,     q7,     q4
        VADD.I8         q6,     q6,     q7
        VRHADD.U8               q7,     q5,     q14
        VHADD.U8                q5,     q5,     q14
        VSUB.I8         q7,     q7,     q5
        VHADD.U8                q6,     q6,     q7
        VRHADD.U8               q7,     q4,     q5
        VHADD.U8                q4,     q4,     q5
        VSUB.I8         q7,     q7,     q4
        VRHADD.U8               q6,     q6,     q7
        VADD.I8         q4,     q4,     q6
        VMOV            q6,     q14
        VBIT            q6,     q4,     q3
        VPUSH           {q6}
        VHADD.U8                q1,     q9,     q13
        VRHADD.U8               q4,     q1,     q10
        VRHADD.U8               q5,     q11,    q12
        VHADD.U8                q6,     q1,     q10
        VHADD.U8                q7,     q11,    q12
        VHADD.U8                q4,     q4,     q5
        VRHADD.U8               q6,     q6,     q7
        VRHADD.U8               q1,     q4,     q6
        VRHADD.U8               q4,     q9,     q10
        VRHADD.U8               q5,     q11,    q12
        VHADD.U8                q6,     q9,     q10
        VHADD.U8                q7,     q11,    q12
        VHADD.U8                q4,     q4,     q5
        VRHADD.U8               q6,     q6,     q7
        VRHADD.U8               q4,     q4,     q6
        VHADD.U8                q5,     q11,    q13
        VRHADD.U8               q5,     q5,     q10
        VBIF            q1,     q5,     q0
        VBSL            q0,     q4,     q10
        VHADD.U8                q7,     q14,    q10
        VRHADD.U8               q4,     q7,     q13
        VRHADD.U8               q5,     q11,    q12
        VHADD.U8                q6,     q7,     q13
        VHADD.U8                q7,     q11,    q12
        VHADD.U8                q4,     q4,     q5
        VRHADD.U8               q6,     q6,     q7
        VRHADD.U8               q4,     q4,     q6
        VRHADD.U8               q6,     q14,    q13
        VRHADD.U8               q5,     q11,    q12
        VHADD.U8                q5,     q6,     q5
        VHADD.U8                q6,     q14,    q13
        VHADD.U8                q7,     q11,    q12
        VRHADD.U8               q6,     q6,     q7
        VRHADD.U8               q5,     q5,     q6
        VHADD.U8                q6,     q12,    q10
        VRHADD.U8               q6,     q6,     q13
        VBIF            q4,     q6,     q3
        VBSL            q3,     q5,     q13
        VPOP            {q14}
        VPOP            {q9}
        VBIT            q10,    q0,     q2
        VBIT            q11,    q1,     q2
        VBIT            q12,    q4,     q2
        VBIT            q13,    q3,     q2
        VTRN.8          q8,     q9
        VTRN.8          q10,    q11
        VTRN.8          q12,    q13
        VTRN.8          q14,    q15
        VTRN.16         q8,     q10
        VTRN.16         q9,     q11
        VTRN.16         q12,    q14
        VTRN.16         q13,    q15
        VTRN.32         q8,     q12
        VTRN.32         q9,     q13
        VTRN.32         q10,    q14
        VTRN.32         q11,    q15
        SUB             r0,     r0,     r1,     lsl #4
        VST1.8          {d16},  [r0],   r1
        VST1.8          {d18},  [r0],   r1
        VST1.8          {d20},  [r0],   r1
        VST1.8          {d22},  [r0],   r1
        VST1.8          {d24},  [r0],   r1
        VST1.8          {d26},  [r0],   r1
        VST1.8          {d28},  [r0],   r1
        VST1.8          {d30},  [r0],   r1
        VST1.8          {d17},  [r0],   r1
        VST1.8          {d19},  [r0],   r1
        VST1.8          {d21},  [r0],   r1
        VST1.8          {d23},  [r0],   r1
        VST1.8          {d25},  [r0],   r1
        VST1.8          {d27},  [r0],   r1
        VST1.8          {d29},  [r0],   r1
        VST1.8          {d31},  [r0],   r1
        VPOP            {q4-q7}
        BX              lr
        .size  deblock_luma_v_s4, .-deblock_luma_v_s4

        .type  deblock_luma_v, %function
deblock_luma_v:
        VPUSH           {q4-q7}
        SUB             r0,     r0,     #4
        VLD1.8          {d16},  [r0],   r1
        VLD1.8          {d18},  [r0],   r1
        VLD1.8          {d20},  [r0],   r1
        VLD1.8          {d22},  [r0],   r1
        VLD1.8          {d24},  [r0],   r1
        VLD1.8          {d26},  [r0],   r1
        VLD1.8          {d28},  [r0],   r1
        VLD1.8          {d30},  [r0],   r1
        VLD1.8          {d17},  [r0],   r1
        VLD1.8          {d19},  [r0],   r1
        VLD1.8          {d21},  [r0],   r1
        VLD1.8          {d23},  [r0],   r1
        VLD1.8          {d25},  [r0],   r1
        VLD1.8          {d27},  [r0],   r1
        VLD1.8          {d29},  [r0],   r1
        VLD1.8          {d31},  [r0],   r1
        VTRN.32         q8,     q12
        VTRN.32         q9,     q13
        VTRN.32         q10,    q14
        VTRN.32         q11,    q15
        VTRN.16         q8,     q10
        VTRN.16         q9,     q11
        VTRN.16         q12,    q14
        VTRN.16         q13,    q15
        VTRN.8          q8,     q9
        VTRN.8          q10,    q11
        VTRN.8          q12,    q13
        VTRN.8          q14,    q15
        ADR             r12,    g_unzip2
        VDUP.8          q3,     r2
        VABD.U8         q1,     q11,    q12
        VLD1.8          {q4},   [r12]
        VCLT.U8         q2,     q1,     q3
        VDUP.8          q3,     r3
        LDR             r12,    [sp,    #4+16*4]
        VABD.U8         q1,     q11,    q10
        VABD.U8         q5,     q12,    q13
        VMAX.U8         q1,     q1,     q5
        LDR             r12,    [r12]
        VCLT.U8         q1,     q1,     q3
        VAND            q2,     q2,     q1
        VMOV.32         d2[0],  r12
        VTBL.8          d3,     {d2},   d9
        VTBL.8          d2,     {d2},   d8
        VCGT.S8         q1,     q1,     #0
        VAND            q2,     q2,     q1
        VMOV.I8         q6,     #1
        LDR             r12,    [sp,    #0+16*4]
        VHSUB.U8                q7,     q10,    q13
        VSHR.S8         q7,     q7,     #1
        VEOR            q0,     q12,    q11
        VAND            q6,     q6,     q0
        VHSUB.U8                q0,     q12,    q11
        LDR             r12,    [r12]
        VRHADD.S8               q7,     q7,     q6
        VQADD.S8                q7,     q0,     q7
        VAND            q7,     q7,     q2
        VMOV.32         d2[0],  r12
        VTBL.8          d3,     {d2},   d9
        VTBL.8          d2,     {d2},   d8
        VAND            q1,     q1,     q2
        VABD.U8         q0,     q9,     q11
        VCLT.U8         q0,     q0,     q3
        VAND            q4,     q0,     q2
        VABD.U8         q0,     q14,    q12
        VCLT.U8         q0,     q0,     q3
        VAND            q3,     q0,     q2
        VRHADD.U8               q0,     q11,    q12
        VHADD.U8                q0,     q0,     q9
        VAND            q5,     q1,     q4
        VQADD.U8                q6,     q10,    q5
        VMIN.U8         q0,     q0,     q6
        VQSUB.U8                q6,     q10,    q5
        VMAX.U8         q10,    q0,     q6
        VRHADD.U8               q0,     q11,    q12
        VHADD.U8                q0,     q0,     q14
        VAND            q5,     q1,     q3
        VQADD.U8                q6,     q13,    q5
        VMIN.U8         q0,     q0,     q6
        VQSUB.U8                q6,     q13,    q5
        VMAX.U8         q13,    q0,     q6
        VSUB.I8         q1,     q1,     q3
        VSUB.I8         q1,     q1,     q4
        VAND            q1,     q1,     q2
        VEOR            q6,     q6,     q6
        VMAX.S8         q5,     q6,     q7
        VSUB.S8         q7,     q6,     q7
        VMAX.S8         q6,     q6,     q7
        VMIN.U8         q5,     q1,     q5
        VMIN.U8         q6,     q1,     q6
        VQADD.U8                q11,    q11,    q5
        VQSUB.U8                q11,    q11,    q6
        VQSUB.U8                q12,    q12,    q5
        VQADD.U8                q12,    q12,    q6
        VTRN.8          q8,     q9
        VTRN.8          q10,    q11
        VTRN.8          q12,    q13
        VTRN.8          q14,    q15
        VTRN.16         q8,     q10
        VTRN.16         q9,     q11
        VTRN.16         q12,    q14
        VTRN.16         q13,    q15
        VTRN.32         q8,     q12
        VTRN.32         q9,     q13
        VTRN.32         q10,    q14
        VTRN.32         q11,    q15
        SUB             r0,     r0,     r1,     lsl #4
        VST1.8          {d16},  [r0],   r1
        VST1.8          {d18},  [r0],   r1
        VST1.8          {d20},  [r0],   r1
        VST1.8          {d22},  [r0],   r1
        VST1.8          {d24},  [r0],   r1
        VST1.8          {d26},  [r0],   r1
        VST1.8          {d28},  [r0],   r1
        VST1.8          {d30},  [r0],   r1
        VST1.8          {d17},  [r0],   r1
        VST1.8          {d19},  [r0],   r1
        VST1.8          {d21},  [r0],   r1
        VST1.8          {d23},  [r0],   r1
        VST1.8          {d25},  [r0],   r1
        VST1.8          {d27},  [r0],   r1
        VST1.8          {d29},  [r0],   r1
        VST1.8          {d31},  [r0],   r1
        VPOP            {q4-q7}
        BX              lr
g_unzip2:
        .quad           0x0101010100000000
        .quad           0x0303030302020202
        .size  deblock_luma_v, .-deblock_luma_v

        .type  deblock_luma_h, %function
deblock_luma_h:
        VPUSH           {q4-q7}
        SUB             r0,     r0,     r1
        SUB             r0,     r0,     r1,     lsl #1
        VLD1.8          {q9 },  [r0],   r1
        VLD1.8          {q10},  [r0],   r1
        VLD1.8          {q11},  [r0],   r1
        VLD1.8          {q12},  [r0],   r1
        VLD1.8          {q13},  [r0],   r1
        VLD1.8          {q14},  [r0]
        ADR             r12,    g_unzip2
        VDUP.8          q3,     r2
        VABD.U8         q1,     q11,    q12
        VLD1.8          {q4},   [r12]
        VCLT.U8         q2,     q1,     q3
        VDUP.8          q3,     r3
        LDR             r12,    [sp,    #4+16*4]
        VABD.U8         q1,     q11,    q10
        VABD.U8         q5,     q12,    q13
        VMAX.U8         q1,     q1,     q5
        LDR             r12,    [r12]
        VCLT.U8         q1,     q1,     q3
        VAND            q2,     q2,     q1
        VMOV.32         d2[0],  r12
        VTBL.8          d3,     {d2},   d9
        VTBL.8          d2,     {d2},   d8
        VCGT.S8         q1,     q1,     #0
        VAND            q2,     q2,     q1
        VMOV.I8         q6,     #1
        LDR             r12,    [sp,    #0+16*4]
        VHSUB.U8                q7,     q10,    q13
        VSHR.S8         q7,     q7,     #1
        VEOR            q0,     q12,    q11
        VAND            q6,     q6,     q0
        VHSUB.U8                q0,     q12,    q11
        LDR             r12,    [r12]
        VRHADD.S8               q7,     q7,     q6
        VQADD.S8                q7,     q0,     q7
        VAND            q7,     q7,     q2
        VMOV.32         d2[0],  r12
        VTBL.8          d3,     {d2},   d9
        VTBL.8          d2,     {d2},   d8
        VAND            q1,     q1,     q2
        VABD.U8         q0,     q9,     q11
        VCLT.U8         q0,     q0,     q3
        VAND            q4,     q0,     q2
        VABD.U8         q0,     q14,    q12
        VCLT.U8         q0,     q0,     q3
        VAND            q3,     q0,     q2
        VRHADD.U8               q0,     q11,    q12
        VHADD.U8                q0,     q0,     q9
        VAND            q5,     q1,     q4
        VQADD.U8                q6,     q10,    q5
        VMIN.U8         q0,     q0,     q6
        VQSUB.U8                q6,     q10,    q5
        VMAX.U8         q10,    q0,     q6
        VRHADD.U8               q0,     q11,    q12
        VHADD.U8                q0,     q0,     q14
        VAND            q5,     q1,     q3
        VQADD.U8                q6,     q13,    q5
        VMIN.U8         q0,     q0,     q6
        VQSUB.U8                q6,     q13,    q5
        VMAX.U8         q13,    q0,     q6
        VSUB.I8         q1,     q1,     q3
        VSUB.I8         q1,     q1,     q4
        VAND            q1,     q1,     q2
        VEOR            q6,     q6,     q6
        VMAX.S8         q5,     q6,     q7
        VSUB.S8         q7,     q6,     q7
        VMAX.S8         q6,     q6,     q7
        VMIN.U8         q5,     q1,     q5
        VMIN.U8         q6,     q1,     q6
        VQADD.U8                q11,    q11,    q5
        VQSUB.U8                q11,    q11,    q6
        VQSUB.U8                q12,    q12,    q5
        VQADD.U8                q12,    q12,    q6
        SUB             r0,     r0,     r1,     lsl #2
        VST1.8          {q10},  [r0],   r1
        VST1.8          {q11},  [r0],   r1
        VST1.8          {q12},  [r0],   r1
        VST1.8          {q13},  [r0],   r1
        VPOP            {q4-q7}
        BX              lr
        .size  deblock_luma_h, .-deblock_luma_h

        .type  deblock_chroma_v, %function
deblock_chroma_v:
        VPUSH           {q4-q7}
        SUB             r0,     r0,     #2
        VLD1.8          {d16},  [r0],   r1
        VLD1.8          {d18},  [r0],   r1
        VLD1.8          {d20},  [r0],   r1
        VLD1.8          {d22},  [r0],   r1
        VLD1.8          {d17},  [r0],   r1
        VLD1.8          {d19},  [r0],   r1
        VLD1.8          {d21},  [r0],   r1
        VLD1.8          {d23},  [r0],   r1
        VTRN.32         d16,    d17
        VTRN.32         d18,    d19
        VTRN.32         d20,    d21
        VTRN.32         d22,    d23
        VTRN.16         q8,     q10
        VTRN.16         q9,     q11
        VTRN.8          q8,     q9
        VTRN.8          q10,    q11
        LDR             r12,    [sp,    #4+16*4]
        VDUP.8          q3,     r2
        VABD.U8         q1,     q10,    q9
        VCLT.U8         q2,     q1,     q3
        VDUP.8          q3,     r3
        VABD.U8         q1,     q8,     q9
        VABD.U8         q4,     q10,    q11
        VMAX.U8         q4,     q1,     q4
        VLD1.8          {d2 },  [r12]
        VCLT.U8         q4,     q4,     q3
        VAND            q2,     q2,     q4
        LDR             r12,    [sp,    #0+16*4]
        VMOV            d0,     d2
        VZIP.8          q1,     q0
        VLD1.8          {d0 },  [r12]
        VCGT.S8         q3,     q1,     #0
        VSHR.U8         q1,     q1,     #2
        VCGT.S8         q1,     q1,     #0
        VAND            q2,     q2,     q3
        VMOV            d8,     d0
        VMOV.I8         q6,     #1
        VZIP.8          q0,     q4
        VADD.I8         q0,     q0,     q6
        VAND            q0,     q0,     q2
        VHSUB.U8                q7,     q8,     q11
        VSHR.S8         q7,     q7,     #1
        VEOR            q4,     q10,    q9
        VAND            q6,     q6,     q4
        VHSUB.U8                q4,     q10,    q9
        VRHADD.S8               q7,     q7,     q6
        VQADD.S8                q7,     q4,     q7
        VEOR            q4,     q4,     q4
        VMAX.S8         q5,     q4,     q7
        VSUB.S8         q7,     q4,     q7
        VMAX.S8         q4,     q4,     q7
        VMIN.U8         q5,     q0,     q5
        VMIN.U8         q4,     q0,     q4
        VQADD.U8                q0,     q9,     q5
        VQSUB.U8                q0,     q0,     q4
        VQSUB.U8                q3,     q10,    q5
        VQADD.U8                q3,     q3,     q4
        VHADD.U8                q6,     q9,     q11
        VRHADD.U8               q6,     q6,     q8
        VHADD.U8                q7,     q8,     q10
        VRHADD.U8               q7,     q7,     q11
        VBIT            q0,     q6,     q1
        VBIT            q3,     q7,     q1
        VBIT            q9,     q0,     q2
        VBIT            q10,    q3,     q2
        VTRN.8          q8,     q9
        VTRN.8          q10,    q11
        VTRN.16         q8,     q10
        VTRN.16         q9,     q11
        VTRN.32         d16,    d17
        VTRN.32         d18,    d19
        VTRN.32         d20,    d21
        VTRN.32         d22,    d23
        SUB             r0,     r0,     r1,     lsl #3
        VMOV.32         r12,    d16[0]
        STR             r12,    [r0],   r1
        VMOV.32         r12,    d18[0]
        STR             r12,    [r0],   r1
        VMOV.32         r12,    d20[0]
        STR             r12,    [r0],   r1
        VMOV.32         r12,    d22[0]
        STR             r12,    [r0],   r1
        VMOV.32         r12,    d17[0]
        STR             r12,    [r0],   r1
        VMOV.32         r12,    d19[0]
        STR             r12,    [r0],   r1
        VMOV.32         r12,    d21[0]
        STR             r12,    [r0],   r1
        VMOV.32         r12,    d23[0]
        STR             r12,    [r0],   r1
        VPOP            {q4-q7}
        BX              lr
        .size  deblock_chroma_v, .-deblock_chroma_v

        .type  deblock_chroma_h, %function
deblock_chroma_h:
        VPUSH           {q4-q7}
        SUB             r0,     r0,     r1,     lsl #1
        VLD1.8          {q8 },  [r0],   r1
        VLD1.8          {q9 },  [r0],   r1
        VLD1.8          {q10},  [r0],   r1
        VLD1.8          {q11},  [r0]
        LDR             r12,    [sp,    #4+16*4]
        VDUP.8          q3,     r2
        VABD.U8         q1,     q10,    q9
        VCLT.U8         q2,     q1,     q3
        VDUP.8          q3,     r3
        VABD.U8         q1,     q8,     q9
        VABD.U8         q4,     q10,    q11
        VMAX.U8         q4,     q1,     q4
        VLD1.8          {d2 },  [r12]
        VCLT.U8         q4,     q4,     q3
        VAND            q2,     q2,     q4
        LDR             r12,    [sp,    #0+16*4]
        VMOV            d0,     d2
        VZIP.8          q1,     q0
        VLD1.8          {d0 },  [r12]
        VCGT.S8         q3,     q1,     #0
        VSHR.U8         q1,     q1,     #2
        VCGT.S8         q1,     q1,     #0
        VAND            q2,     q2,     q3
        VMOV            d8,     d0
        VMOV.I8         q6,     #1
        VZIP.8          q0,     q4
        VADD.I8         q0,     q0,     q6
        VAND            q0,     q0,     q2
        VHSUB.U8                q7,     q8,     q11
        VSHR.S8         q7,     q7,     #1
        VEOR            q4,     q10,    q9
        VAND            q6,     q6,     q4
        VHSUB.U8                q4,     q10,    q9
        VRHADD.S8               q7,     q7,     q6
        VQADD.S8                q7,     q4,     q7
        VEOR            q4,     q4,     q4
        VMAX.S8         q5,     q4,     q7
        VSUB.S8         q7,     q4,     q7
        VMAX.S8         q4,     q4,     q7
        VMIN.U8         q5,     q0,     q5
        VMIN.U8         q4,     q0,     q4
        VQADD.U8                q0,     q9,     q5
        VQSUB.U8                q0,     q0,     q4
        VQSUB.U8                q3,     q10,    q5
        VQADD.U8                q3,     q3,     q4
        VHADD.U8                q6,     q9,     q11
        VRHADD.U8               q6,     q6,     q8
        VHADD.U8                q7,     q8,     q10
        VRHADD.U8               q7,     q7,     q11
        VBIT            q0,     q6,     q1
        VBIT            q3,     q7,     q1
        VBIT            q9,     q0,     q2
        VBIT            q10,    q3,     q2
        SUB             r0,     r0,     r1,     lsl #1
        VST1.8          {d18 }, [r0],   r1
        VST1.8          {d20},  [r0],   r1
        VPOP            {q4-q7}
        BX              lr
        .size  deblock_chroma_h, .-deblock_chroma_h

        .type  h264e_deblock_chroma_neon, %function
h264e_deblock_chroma_neon:
        PUSH            {r2-r10,        lr}
        MOV             r8,     r0
        LDRB            r0,     [r2,    #0x40]
        MOV             r9,     r1
        LDRB            r1,     [r2,    #0x44]
        ADD             r5,     r2,     #0x40
        ADD             r6,     r2,     #0x44
        ADD             r10,    r2,     #0x20
        MOV             r7,     r2
        MOV             r4,     #0
l1.2056:
        LDR             r2,     [r7,    r4]
        CMP             r2,     #0
        CMPNE           r0,     #0
        BEQ             l1.2108
        ADD             r3,     r7,     r4
        ADD             r2,     r10,    r4
        ADD             r12,    r8,     r4,     asr #1
        STRD            r2,     r3,     [sp,    #0]
        MOV             r3,     r1
        MOV             r2,     r0
        MOV             r1,     r9
        MOV             r0,     r12
        BL              deblock_chroma_v
l1.2108:
        LDRB            r0,     [r5,    #1]
        ADD             r4,     r4,     #8
        LDRB            r1,     [r6,    #1]
        CMP             r4,     #0x10
        BLT             l1.2056
        LDRB            r0,     [r5,    #2]
        LDRB            r1,     [r6,    #2]
        ADD             r10,    r10,    #0x10
        ADD             r7,     r7,     #0x10
        MOV             r4,     #0
l1.2148:
        LDR             r2,     [r7,    r4]
        CMP             r2,     #0
        CMPNE           r0,     #0
        BEQ             l1.2196
        ADD             r3,     r7,     r4
        ADD             r2,     r10,    r4
        STRD            r2,     r3,     [sp,    #0]
        MOV             r3,     r1
        MOV             r2,     r0
        MOV             r1,     r9
        MOV             r0,     r8
        BL              deblock_chroma_h
l1.2196:
        LDRB            r0,     [r5,    #3]
        ADD             r4,     r4,     #8
        LDRB            r1,     [r6,    #3]
        CMP             r4,     #0x10
        ADD             r8,     r8,     r9,     lsl #2
        BLT             l1.2148
        POP             {r2-r10,        pc}
        .size  h264e_deblock_chroma_neon, .-h264e_deblock_chroma_neon

        .type  h264e_deblock_luma_neon, %function
h264e_deblock_luma_neon:
        PUSH            {r2-r10,        lr}
        MOV             r7,     r0
        LDRB            r0,     [r2,    #0x40]
        MOV             r9,     r1
        LDRB            r1,     [r2,    #0x44]
        ADD             r5,     r2,     #0x40
        ADD             r6,     r2,     #0x44
        ADD             r10,    r2,     #0x20
        MOV             r8,     r2
        MOV             r4,     #0
l1.2264:
        LDR             r2,     [r8,    r4]
        AND             r3,     r2,     #0xff
        CMP             r3,     #4
        BEQ             l1.2456
        CMP             r2,     #0
        CMPNE           r0,     #0
        BEQ             l1.2328
        ADD             r3,     r8,     r4
        ADD             r2,     r10,    r4
        ADD             r12,    r7,     r4
        STRD            r2,     r3,     [sp,    #0]
        MOV             r3,     r1
        MOV             r2,     r0
        MOV             r1,     r9
        MOV             r0,     r12
        BL              deblock_luma_v
l1.2328:
        LDRB            r0,     [r5,    #1]
        ADD             r4,     r4,     #4
        LDRB            r1,     [r6,    #1]
        CMP             r4,     #0x10
        BLT             l1.2264
        LDRB            r0,     [r5,    #2]
        LDRB            r1,     [r6,    #2]
        ADD             r10,    r10,    #0x10
        ADD             r8,     r8,     #0x10
        MOV             r4,     #0
l1.2368:
        LDR             r2,     [r8,    r4]
        AND             r3,     r2,     #0xff
        CMP             r3,     #4
        BEQ             l1.2484
        CMP             r2,     #0
        CMPNE           r0,     #0
        BEQ             l1.2428
        ADD             r3,     r8,     r4
        ADD             r2,     r10,    r4
        STRD            r2,     r3,     [sp,    #0]
        MOV             r3,     r1
        MOV             r2,     r0
        MOV             r1,     r9
        MOV             r0,     r7
        BL              deblock_luma_h
l1.2428:
        LDRB            r0,     [r5,    #3]
        ADD             r4,     r4,     #4
        LDRB            r1,     [r6,    #3]
        CMP             r4,     #0x10
        ADD             r7,     r7,     r9,     lsl #2
        BLT             l1.2368
        POP             {r2-r10,        pc}
l1.2456:
        ADD             r12,    r7,     r4
        MOV             r3,     r1
        MOV             r2,     r0
        MOV             r1,     r9
        MOV             r0,     r12
        BL              deblock_luma_v_s4
        B               l1.2328
l1.2484:
        MOV             r3,     r1
        MOV             r2,     r0
        MOV             r1,     r9
        MOV             r0,     r7
        BL              deblock_luma_h_s4
        B               l1.2428
        .size  h264e_deblock_luma_neon, .-h264e_deblock_luma_neon

        .global         deblock_luma_h_s4
        .global         h264e_deblock_chroma_neon
        .global         h264e_deblock_luma_neon


================================================
FILE: asm/neon/h264e_denoise_neon.s
================================================
        .arm
        .text
        .align 2

__rt_memcpy_w:
        subs            r2,     r2,     #0x10-4
local_denoise_1_3:
        ldmcsia         r1!,    {r3,    r12}
        stmcsia         r0!,    {r3,    r12}
        ldmcsia         r1!,    {r3,    r12}
        stmcsia         r0!,    {r3,    r12}
        subcss          r2,     r2,     #0x10
        bcs             local_denoise_1_3
        movs            r12,    r2,     lsl #29
        ldmcsia         r1!,    {r3,    r12}
        stmcsia         r0!,    {r3,    r12}
        ldrmi           r3,     [r1],   #4
        strmi           r3,     [r0],   #4
        moveq           pc,     lr
        sub             r1,     r1,     #3
_memcpy_lastbytes_skip3:
        add             r1,     r1,     #1
_memcpy_lastbytes_skip2:
        add             r1,     r1,     #1
_memcpy_lastbytes_skip1:
        add             r1,     r1,     #1

_memcpy_lastbytes:
        movs            r2,     r2,     lsl #31
        ldrmib          r2,     [r1],   #1
        ldrcsb          r3,     [r1],   #1
        ldrcsb          r12,    [r1],   #1
        strmib          r2,     [r0],   #1
        strcsb          r3,     [r0],   #1
        strcsb          r12,    [r0],   #1
        bx              lr
my_memcpy:
        cmp             r2,     #3
        bls             _memcpy_lastbytes
        rsb             r12,    r0,     #0
        movs            r12,    r12,    lsl #31
        ldrcsb          r3,     [r1],   #1
        ldrcsb          r12,    [r1],   #1
        strcsb          r3,     [r0],   #1
        strcsb          r12,    [r0],   #1
        ldrmib          r3,     [r1],   #1
        subcs           r2,     r2,     #2
        submi           r2,     r2,     #1
        strmib          r3,     [r0],   #1
_memcpy_dest_aligned:
        subs            r2,     r2,     #4
        bcc             _memcpy_lastbytes
        adr             r12,    __rt_memcpy_w
        and             r3,     r1,     #3
        sub             pc,     r12,    r3,     lsl #5

        .global h264e_denoise_run_neon
        .type  h264e_denoise_run_neon, %function
h264e_denoise_run_neon:
        CMP             r2,     #2
        CMPGT           r3,     #2
        BXLE            lr
        PUSH            {r0-r11,        lr}
        SUB             sp,     sp,     #0xc
        SUB             r1,     r2,     #2
        SUB             r0,     r3,     #2
        STR             r0,     [sp,    #0+4+4]
        LDR             r4,     [sp,    #0+4+4+4+4+4+4+4+4*9+4]
        LDR             r5,     [sp,    #0+4+4+4+4+4+4+4+4*9]
        STR             r1,     [sp,    #0+4+4+4+4+4]
local_denoise_2_0:
        LDR             r0,     [sp,    #0+4+4+4]
        LDR             r1,     [sp,    #0+4+4+4+4]
        ADD             r0,     r0,     r5
        ADD             r1,     r1,     r4
        STR             r0,     [sp,    #0+4+4+4]
        STR             r1,     [sp,    #0+4+4+4+4]
        LDRB            r3,     [r0],   #1
        SUB             r12,    r1,     r4
        STRB            r3,     [r12,   #0]
        ADD             r1,     r1,     #1
        LDR             r12,    [sp,    #0+4+4+4+4+4]
        MOVS            r12,    r12,    lsr #3
        BEQ             local_denoise_10_0
local_denoise_1_4:
        VLD1.U8         {d16},  [r0]
        VLD1.U8         {d17},  [r1]
        SUB             lr,     r0,     #1
        VLD1.U8         {d18},  [lr]
        SUB             lr,     r1,     #1
        VLD1.U8         {d19},  [lr]
        SUB             lr,     r0,     r5
        VLD1.U8         {d20},  [lr]
        SUB             lr,     r1,     r4
        VLD1.U8         {d21},  [lr]
        ADD             lr,     r0,     #1
        VLD1.U8         {d22},  [lr]
        ADD             lr,     r1,     #1
        VLD1.U8         {d23},  [lr]
        ADD             lr,     r0,     r5
        VLD1.U8         {d24},  [lr]
        ADD             lr,     r1,     r4
        VLD1.U8         {d25},  [lr]
        VABDL.U8                q0,     d16,    d17
        VADDL.U8                q1,     d18,    d20
        VADDW.U8                q1,     q1,     d22
        VADDW.U8                q1,     q1,     d24
        VADDL.U8                q2,     d19,    d21
        VADDW.U8                q2,     q2,     d23
        VADDW.U8                q2,     q2,     d25
        VABD.U16                q1,     q1,     q2
        VSHR.U16                q1,     q1,     #2
        VMOV.I16                q2,     #1
        VADD.S16                q0,     q0,     q2
        VADD.S16                q1,     q1,     q2
        VQSHL.S16               q0,     q0,     #7
        VCLS.S16                q2,     q0
        VSHL.S16                q0,     q0,     q2
        VQDMULH.S16             q0,     q0,     q0
        VCLS.S16                q15,    q0
        VSHL.S16                q0,     q0,     q15
        VADD.S16                q2,     q2,     q2
        VADD.S16                q2,     q2,     q15
        VQDMULH.S16             q0,     q0,     q0
        VCLS.S16                q15,    q0
        VSHL.S16                q0,     q0,     q15
        VADD.S16                q2,     q2,     q2
        VADD.S16                q2,     q2,     q15
        VQDMULH.S16             q0,     q0,     q0
        VCLS.S16                q15,    q0
        VSHL.S16                q0,     q0,     q15
        VADD.S16                q2,     q2,     q2
        VADD.S16                q2,     q2,     q15
        VQDMULH.S16             q0,     q0,     q0
        VCLS.S16                q15,    q0
        VADD.S16                q2,     q2,     q2
        VADD.S16                q2,     q2,     q15
        VMOV.I16                q15,    #127
        VSUB.S16                q2,     q15,    q2
        VQSHL.S16               q1,     q1,     #7
        VCLS.S16                q3,     q1
        VSHL.S16                q1,     q1,     q3
        VQDMULH.S16             q1,     q1,     q1
        VCLS.S16                q15,    q1
        VSHL.S16                q1,     q1,     q15
        VADD.S16                q3,     q3,     q3
        VADD.S16                q3,     q3,     q15
        VQDMULH.S16             q1,     q1,     q1
        VCLS.S16                q15,    q1
        VSHL.S16                q1,     q1,     q15
        VADD.S16                q3,     q3,     q3
        VADD.S16                q3,     q3,     q15
        VQDMULH.S16             q1,     q1,     q1
        VCLS.S16                q15,    q1
        VSHL.S16                q1,     q1,     q15
        VADD.S16                q3,     q3,     q3
        VADD.S16                q3,     q3,     q15
        VQDMULH.S16             q1,     q1,     q1
        VCLS.S16                q15,    q1
        VADD.S16                q3,     q3,     q3
        VADD.S16                q3,     q3,     q15
        VMOV.I16                q15,    #127
        VSUB.S16                q3,     q15,    q3
        VQSHL.U16               q3,     q3,     #10
        VSHR.U16                q3,     q3,     #8
        VMOV.I16                q15,    #255
        VSUB.S16                q2,     q15,    q2
        VSUB.S16                q3,     q15,    q3
        VMUL.U16                q2,     q2,     q3
        VMOVL.U8                q0,     d17
        VMULL.U16               q10,    d0,     d4
        VMULL.U16               q11,    d1,     d5
        VMOV.I8         q15,    #255
        VSUB.S16                q2,     q15,    q2
        VMOVL.U8                q0,     d16
        VMLAL.U16               q10,    d0,     d4
        VMLAL.U16               q11,    d1,     d5
        VRSHRN.I32              d0,     q10,    #16
        VRSHRN.I32              d1,     q11,    #16
        VMOVN.I16               d0,     q0
        SUB             r3,     r1,     r4
        VST1.U8         {d0},   [r3]
        ADD             r0,     r0,     #8
        ADD             r1,     r1,     #8
        SUBS            r12,    r12,    #1
        BNE             local_denoise_1_4
local_denoise_10_0:
        LDR             r12,    [sp,    #0+4+4+4+4+4]
        ANDS            r12,    r12,    #7
        BNE             tail
tail_ret:
        LDRB            r0,     [r0,    #0]
        SUB             r1,     r1,     r4
        STRB            r0,     [r1,    #0]
        LDR             r0,     [sp,    #0+4+4]
        SUBS            r0,     r0,     #1
        STR             r0,     [sp,    #0+4+4]
        BNE             local_denoise_2_0
        LDR             r0,     [sp,    #0+4+4+4]
        LDR             r2,     [sp,    #0+4+4+4+4+4]
        ADD             r1,     r0,     r5
        LDR             r0,     [sp,    #0+4+4+4+4]
        ADD             r2,     r2,     #2
        ADD             r0,     r0,     r4
        BL              my_memcpy
        LDR             r11,    [sp,    #0+4+4+4+4+4+4]
        SUB             r11,    r11,    #2
local_denoise_1_5:
        LDR             r0,     [sp,    #0+4+4+4+4]
        SUB             r7,     r0,     r4
        LDR             r0,     [sp,    #0+4+4+4+4+4]
        MOV             r1,     r7
        ADD             r2,     r0,     #2
        LDR             r0,     [sp,    #0+4+4+4+4]
        BL              my_memcpy
        STR             r7,     [sp,    #0+4+4+4+4]
        SUBS            r11,    r11,    #1
        BNE             local_denoise_1_5
        LDR             r0,     [sp,    #0+4+4+4+4+4+4]
        RSB             r1,     r0,     #2
        LDR             r0,     [sp,    #0+4+4+4]
        MLA             r1,     r5,     r1,     r0
        LDR             r0,     [sp,    #0+4+4+4+4+4]
        ADD             r2,     r0,     #2
        LDR             r0,     [sp,    #0+4+4+4+4]
        ADD             sp,     sp,     #0x1c
        POP             {r4-r11,        lr}
        B               my_memcpy
tail:
local_denoise_1_6:
        LDRB            r3,     [r0,    #-1]
        LDRB            r9,     [r1,    #-1]
        LDRB            r6,     [r0,    #1]
        LDRB            r10,    [r1,    #1]
        SUB             r3,     r3,     r9
        SUB             r9,     r0,     r5
        SUB             r6,     r6,     r10
        ADD             r3,     r3,     r6
        SUB             r6,     r1,     r4
        LDRB            r9,     [r9,    #0]
        LDRB            r10,    [r6,    #0]
        LDRB            r7,     [r0,    #0]
        LDRB            r8,     [r1,    #0]
        LDRB            r11,    [r0,    r5]
        LDRB            lr,     [r1,    r4]
        SUB             r9,     r9,     r10
        SUBS            r2,     r7,     r8
        RSBLT           r2,     r2,     #0
        ADD             r3,     r3,     r9
        SUB             r9,     r11,    lr
        ADDS            r3,     r3,     r9
        RSBLT           r3,     r3,     #0
        MOV             r10,    r3,     asr #2
        LDR             r3,     =g_diff_to_gainQ8
        LDRB            r9,     [r3,    r2]
        LDRB            r2,     [r3,    r10]
        ADD             r0,     r0,     #1
        ADD             r1,     r1,     #1
        MOV             r2,     r2,     lsl #2
        CMP             r2,     #0xff
        MOVHI           r2,     #0xff
        RSB             r3,     r2,     #0xff
        RSB             r2,     r9,     #0xff
        MUL             r2,     r3,     r2
        RSB             r3,     r2,     #0x00010000
        SUB             r3,     r3,     #1
        MUL             r3,     r7,     r3
        MLA             r3,     r8,     r2,     r3
        ADD             r3,     r3,     #0x00008000
        MOV             r3,     r3,     lsr     #16
        STRB            r3,     [r6,    #0]
        SUBS            r12,    r12,    #1
        BNE             local_denoise_1_6
        B               tail_ret
        .size  h264e_denoise_run_neon, .-h264e_denoise_run_neon


================================================
FILE: asm/neon/h264e_intra_neon.s
================================================
        .arm
        .text
        .align 2

        .type  intra_predict_dc4_neon, %function
intra_predict_dc4_neon:
        MOV             r3,     #0
        VEOR            q1,     q1,     q1
        CMP             r0,     #0x20
        BCC             local_intra_10_0
        VLD1.8          {d0},   [r0]
        ADD             r3,     r3,     #2
        VPADAL.U8               q1,     q0
local_intra_10_0:
        CMP             r1,     #0x20
        BCC             local_intra_10_1
        VLD1.8          {d0},   [r1]
        ADD             r3,     r3,     #2
        VPADAL.U8               q1,     q0
local_intra_10_1:
        VPADDL.U16              q1,     q1
        VMOV.32         r12,    d2[0]
        ADD             r0,     r12,    r3
        CMP             r3,     #4
        MOVEQ           r0,     r0,     lsr #1
        MOV             r0,     r0,     lsr #2
        CMP             r3,     #0
        MOVEQ           r0,     #0x80
        ADD             r0,     r0,     r0,     lsl #16
        ADD             r0,     r0,     r0,     lsl #8
        BX              lr
        .size  intra_predict_dc4_neon, .-intra_predict_dc4_neon

        .type  h264e_intra_predict_16x16_neon, %function
h264e_intra_predict_16x16_neon:
        CMP             r3,     #1
        BEQ             h_pred_16x16
        BLT             v_pred_16x16
        MOV             r3,     #0
        VEOR            q1,     q1,     q1
        CMP             r1,     #0x20
        BCC             local_intra_10_2
        VLD1.8          {q2},   [r1]
        ADD             r3,     r3,     #8
        VPADAL.U8               q1,     q2
local_intra_10_2:
        CMP             r2,     #0x20
        BCC             local_intra_10_3
        VLD1.8          {q0},   [r2]
        ADD             r3,     r3,     #8
        VPADAL.U8               q1,     q0
local_intra_10_3:
        VPADDL.U16              q1,     q1
        VPADDL.U32              q1,     q1
        VADD.I64                d2,     d2,     d3
        VMOV.32         r12,    d2[0]
        ADD             r2,     r12,    r3
        CMP             r3,     #16
        MOVEQ           r2,     r2,     lsr #1
        MOV             r2,     r2,     lsr #4
        CMP             r3,     #0
        MOVEQ           r2,     #0x80
        VDUP.I8         q0,     r2
save_q0:
        VMOV            q1,     q0
        VMOV            q2,     q0
        VMOV            q3,     q0
        VSTMIA          r0!,    {q0-q3}
        VSTMIA          r0!,    {q0-q3}
        VSTMIA          r0!,    {q0-q3}
        VSTMIA          r0!,    {q0-q3}
        BX              lr
v_pred_16x16:
        VLD1.8          {q0},   [r2]
        B               save_q0
h_pred_16x16:
        MOV             r2,     #16
local_intra_1_0:
        LDRB            r3,     [r1],   #1
        VDUP.I8         q0,     r3
        SUBS            r2,     r2,     #1
        VSTMIA          r0!,    {q0}
        BNE             local_intra_1_0
        BX              lr
        .size  h264e_intra_predict_16x16_neon, .-h264e_intra_predict_16x16_neon

        .type  h264e_intra_predict_chroma_neon, %function
h264e_intra_predict_chroma_neon:
        PUSH            {r4-r8, lr}
        MOV             r6,     r2
        CMP             r3,     #1
        LDMLT           r6,     {r2,    r3,     r12,    lr}
        MOV             r4,     r0
        MOVGT           r7,     #2
        MOV             r5,     r1
        MOV             r0,     #8
        MOVGT           r8,     r7
        BEQ             h_pred_chroma
        BGT             dc_pred_chroma
v_pred_chroma:
        SUBS            r0,     r0,     #1
        STMIA           r4!,    {r2,    r3,     r12,    lr}
        BNE             v_pred_chroma
        POP             {r4-r8, pc}
h_pred_chroma:
        LDRB            r12,    [r5,    #8]
        LDRB            r2,     [r5],   #1
        SUBS            r0,     r0,     #1
        ADD             r12,    r12,    r12,    lsl #16
        ADD             r2,     r2,     r2,     lsl #16
        ADD             r12,    r12,    r12,    lsl #8
        ADD             r2,     r2,     r2,     lsl #8
        MOV             lr,     r12
        MOV             r3,     r2
        STMIA           r4!,    {r2,    r3,     r12,    lr}
        BNE             h_pred_chroma
        POP             {r4-r8, pc}
dc_pred_chroma:
        MOV             r1,     r6
        MOV             r0,     r5
        BL              intra_predict_dc4_neon
        STR             r0,     [r4,    #0x40]
        STR             r0,     [r4,    #4]
        STR             r0,     [r4,    #0]
        ADD             r1,     r6,     #4
        ADD             r0,     r5,     #4
        BL              intra_predict_dc4_neon
        CMP             r6,     #0x20
        STR             r0,     [r4,    #0x44]
        BCC             local_intra_10_4
        ADD             r1,     r6,     #4
        MOV             r0,     #0
        BL              intra_predict_dc4_neon
        STR             r0,     [r4,    #4]
local_intra_10_4:
        CMP             r5,     #0x20
        BCC             local_intra_11_0
        ADD             r1,     r5,     #4
        MOV             r0,     #0
        BL              intra_predict_dc4_neon
        STR             r0,     [r4,    #0x40]
local_intra_11_0:
        SUBS            r8,     r8,     #1
        ADD             r4,     r4,     #8
        ADD             r5,     r5,     #8
        ADD             r6,     r6,     #8
        BNE             dc_pred_chroma
        LDMDB           r4,     {r0-r3}
        STMIA           r4!,    {r0-r3}
        STMIA           r4!,    {r0-r3}
        STMIA           r4!,    {r0-r3}
        LDMIA           r4!,    {r0-r3}
        STMIA           r4!,    {r0-r3}
        STMIA           r4!,    {r0-r3}
        STMIA           r4!,    {r0-r3}
        POP             {r4-r8, pc}
save_best:
        CMP             r1,     r10
        MOVNE           r0,     r11
        MOVEQ           r0,     #0
        VABD.U8         q2,     q1,     q15
        VPADDL.U8               q2,     q2
        VPADDL.U16              q2,     q2
        VPADDL.U32              q2,     q2
        VADD.I64                d4,     d4,     d5
        VMOV.32         d5[0],  r0
        VADD.U32                d4,     d4,     d5
        VMOV.32         r0,     d4[0]
        CMP             r0,     r9
        BXGE            lr
        VMOV            q3,     q1
        STR             r1,     [sp,    #0+4+4+4]
        MOV             r9,     r0
        BX              lr
        .size  h264e_intra_predict_chroma_neon, .-h264e_intra_predict_chroma_neon

        .type  h264e_intra_choose_4x4_neon, %function
h264e_intra_choose_4x4_neon:
        PUSH            {r0-r11,        lr}
        SUB             sp,     sp,     #5*4
        LDR             r9,     [r0],   #0x10
        LDR             r10,    [r0],   #0x10
        LDR             r11,    [r0],   #0x10
        LDR             r12,    [r0],   #0x10
        VMOV            d30,    r9,     r10
        VMOV            d31,    r11,    r12
        LDR             r10,    [sp,    #0+4+4+4+4+4+4+4+4+4+4*8+4]
        LDR             r11,    [sp,    #0+4+4+4+4+4+4+4+4+4+4*8+4+4]
        MOV             r9,     #0x10000000
        TST             r2,     #1
        MOVNE           r1,     r3
        MOVEQ           r1,     #0
        TST             r2,     #2
        SUBNE           r0,     r3,     #5
        MOVEQ           r0,     #0
        BL              intra_predict_dc4_neon
        VDUP.8          q1,     r0
        MOV             r1,     #2
        BL              save_best
        LDR             r2,     [sp,    #0+4+4+4+4+4+4+4+4]
        SUB             r12,    r2,     #5
        VLD1.8          {q0},   [r12]
        LDR             r0,     [sp,    #0+4+4+4+4+4+4+4]
        VMOV.U8         lr,     d1[4]
        ORR             lr,     lr,     lr,     lsl #8
        ORR             lr,     lr,     lr,     lsl #16
        VMOV.32         d1[1],  lr
        TST             r0,     #1
        BEQ             not_avail_t
        TST             r0,     #8
        BNE             local_intra_10_5
        VDUP.8          d1,     d1[0]
local_intra_10_5:
        VEXT.8          q1,     q0,     q0,     #5
        VMOV            q2,     q1
        VZIP.32         q1,     q2
        VMOV            q2,     q1
        VZIP.32         q1,     q2
        MOV             r1,     #0
        BL              save_best
        VEXT.8          q10,    q0,     q0,     #5
        VEXT.8          q11,    q0,     q0,     #6
        VEXT.8          q12,    q0,     q0,     #7
        VHADD.U8                q1,     q10,    q12
        VRHADD.U8               q1,     q1,     q11
        VEXT.8          q10,    q1,     q1,     #1
        VEXT.8          d3,     d2,     d2,     #2
        VEXT.8          d4,     d2,     d2,     #3
        VZIP.32         d2,     d20
        VZIP.32         d3,     d4
        VMOV            d24,    d2
        MOV             r1,     #3
        BL              save_best
        VEXT.8          q10,    q0,     q0,     #5
        VEXT.8          q11,    q0,     q0,     #6
        VRHADD.U8               q1,     q10,    q11
        VEXT.8          q10,    q1,     q1,     #1
        VZIP.32         q1,     q10
        VZIP.32         q1,     q12
        MOV             r1,     #7
        BL              save_best
        LDR             r0,     [sp,    #0+4+4+4+4+4+4+4]
local_intra_10_6:
not_avail_t:
        TST             r0,     #2
        BEQ             not_avail_l
        VREV32.8                q8,     q0
        VREV32.8                q1,     q0
        VZIP.8          q8,     q1
        VMOV            q1,     q8
        VZIP.8          q1,     q8
        MOV             r1,     #1
        BL              save_best
        VREV32.8                q2,     q0
        VREV32.8                q1,     q0
        VREV32.8                q8,     q0
        VZIP.8          q8,     q1
        VMOV.U16                lr,     d16[3]
        VMOV.16         d4[2],  lr
        VMOV.16         d17[0], lr
        VEXT.8          q9,     q2,     q2,     #14
        VHADD.U8                q10,    q9,     q2
        VZIP.8          q9,     q10
        VEXT.8          q11,    q8,     q8,     #14
        VRHADD.U8               q10,    q9,     q11
        ADD             lr,     lr,     lr,     lsl #16
        VEXT.8          q1,     q10,    q10,    #4
        VEXT.8          q9,     q10,    q10,    #6
        VZIP.32         q1,     q9
        VMOV.32         d3[1],  lr
        MOV             r1,     #8
        BL              save_best
        LDR             r0,     [sp,    #0+4+4+4+4+4+4+4]
not_avail_l:
        AND             r0,     r0,     #7
        CMP             r0,     #7
        BNE             not_avail_diag
        VEXT.8          q10,    q0,     q0,     #1
        VEXT.8          q11,    q0,     q0,     #2
        VHADD.U8                q1,     q0,     q11
        VRHADD.U8               q2,     q1,     q10
        VMOV            q11,    q2
        VEXT.8          d3,     d4,     d4,     #1
        VEXT.8          d5,     d4,     d4,     #2
        VEXT.8          d2,     d4,     d4,     #3
        VZIP.32         d3,     d4
        VZIP.32         d2,     d5
        MOV             r1,     #4
        BL              save_best
        VRHADD.U8               q1,     q0,     q10
        VMOV            q12,    q1
        VMOV            q2,     q11
        VZIP.8          q1,     q2
        VEXT.8          q2,     q1,     q1,     #2
        VZIP.32         q1,     q2
        VREV64.32               q1,     q1
        VSWP            d2,     d3
        VMOV.U16                lr,     d22[2]
        VMOV.16         d2[1],  lr
        MOV             r1,     #6
        BL              save_best
        VEXT.8          q11,    q11,    q11,    #1
        VEXT.8          q1,     q12,    q12,    #4
        VEXT.8          q2,     q11,    q11,    #2
        VZIP.32         q1,     q2
        VMOV.U16                lr,     d22[0]
        VMOV.16         d24[1], lr
        MOV             lr,     lr,     lsl #8
        VMOV.16         d22[0], lr
        VEXT.8          d3,     d24,    d24,    #3
        VEXT.8          d22,    d22,    d22,    #1
        VZIP.32         d3,     d22
        MOV             r1,     #5
        BL              save_best
not_avail_diag:
        LDR             r0,     [sp,    #0+4+4+4]
        MOV             r3,     r9
        LDR             r4,     [sp,    #0+4+4+4+4+4+4]
        VMOV            r5,     r6,     d6
        STR             r5,     [r4]
        STR             r6,     [r4,    #0x10]
        VMOV            r5,     r6,     d7
        STR             r5,     [r4,    #0x20]
        STR             r6,     [r4,    #0x30]
        ADD             sp,     sp,     #4*9
        ADD             r0,     r0,     r3,     lsl #4
        POP             {r4-r11,        pc}
        .size  h264e_intra_choose_4x4_neon, .-h264e_intra_choose_4x4_neon

        .global         h264e_intra_predict_16x16_neon
        .global         h264e_intra_predict_chroma_neon
        .global         h264e_intra_choose_4x4_neon


================================================
FILE: asm/neon/h264e_qpel_neon.s
================================================
        .arm
        .text
        .align 2

        .global h264e_qpel_average_wh_align_neon
        .type  h264e_qpel_average_wh_align_neon, %function
h264e_qpel_average_wh_align_neon:
        MOVS            r3,     r3,     lsr #5
        BCC             local_qpel_20_0
local_qpel_1_0:
        VLDMIA          r0!,    {q0-q3}
        VLDMIA          r1!,    {q8-q11}
        SUBS            r3,     r3,     #4<<11
        VRHADD.U8               q0,     q0,     q8
        VRHADD.U8               q1,     q1,     q9
        VRHADD.U8               q2,     q2,     q10
        VRHADD.U8               q3,     q3,     q11
        VSTMIA          r2!,    {q0-q3}
        BNE             local_qpel_1_0
        BX              lr
local_qpel_20_0:
        MOV             r12,    #16
local_qpel_1_1:
        VLD1.8          {d0},   [r0],   r12
        VLD1.8          {d1},   [r1],   r12
        SUBS            r3,     r3,     #1<<11
        VRHADD.U8               d0,     d0,     d1
        VST1.8          {d0},   [r2],   r12
        BNE             local_qpel_1_1
        BX              lr
copy_w8or4:
        MOVS            r12,    r3,     lsr #4
        MOV             r3,     r3,     asr #16
        BCS             copy_w8
copy_w4:
local_qpel_1_2:
        LDR             r12,    [r0],   r1
        SUBS            r3,     r3,     #1
        STR             r12,    [r2],   #16
        BNE             local_qpel_1_2
        BX              lr
copy_w16or8:
        MOVS            r12,    r3,     lsr #5
        MOV             r3,     r3,     asr #16
        BCC             copy_w8
copy_w16:
        VLD1.8          {q0},   [r0],   r1
        VLD1.8          {q1},   [r0],   r1
        VLD1.8          {q2},   [r0],   r1
        VLD1.8          {q3},   [r0],   r1
        SUBS            r3,     r3,     #4
        VSTMIA          r2!,    {q0-q3}
        BNE             copy_w16
        BX              lr
copy_w8:
        MOV             r12,    #16
local_qpel_1_3:
        VLD1.8          {d0},   [r0],   r1
        VLD1.8          {d1},   [r0],   r1
        SUBS            r3,     r3,     #2
        VST1.8          {d0},   [r2],   r12
        VST1.8          {d1},   [r2],   r12
        BNE             local_qpel_1_3
        BX              lr
        .size  h264e_qpel_average_wh_align_neon, .-h264e_qpel_average_wh_align_neon

        .global h264e_qpel_interpolate_chroma_neon
        .type  h264e_qpel_interpolate_chroma_neon, %function
h264e_qpel_interpolate_chroma_neon:
        LDR             r12,    [sp]
        VMOV.I8         d5,     #8
        CMP             r12,    #0
        BEQ             copy_w8or4
        VDUP.8          d0,     r12
        MOV             r12,    r12,    asr #16
        VDUP.8          d1,     r12
        VSUB.I8         d2,     d5,     d0
        VSUB.I8         d3,     d5,     d1
        VMUL.I8         d28,    d2,     d3
        VMUL.I8         d29,    d0,     d3
        VMUL.I8         d30,    d2,     d1
        VMUL.I8         d31,    d0,     d1
        MOVS            r12,    r3,     lsr #4
        MOV             r3,     r3,     asr #16
        BCS             interpolate_chroma_w8
interpolate_chroma_w4:
        VLD1.8          {d0},   [r0],   r1
        VEXT.8          d1,     d0,     d0,     #1
local_qpel_1_4:
        VLD1.8          {d2},   [r0],   r1
        SUBS            r3,     r3,     #1
        VEXT.8          d3,     d2,     d2,     #1
        VMULL.U8                q2,     d0,     d28
        VMLAL.U8                q2,     d1,     d29
        VMLAL.U8                q2,     d2,     d30
        VMLAL.U8                q2,     d3,     d31
        VQRSHRUN.S16            d4,     q2,     #6
        VMOV            r12,    d4[0]
        STR             r12,    [r2],   #16
        VMOV            q0,     q1
        BNE             local_qpel_1_4
        BX              lr
interpolate_chroma_w8:
        VLD1.8          {q0},   [r0],   r1
        MOV             r12,    #16
        VEXT.8          d1,     d0,     d1,     #1
local_qpel_1_5:
        VLD1.8          {q1},   [r0],   r1
        SUBS            r3,     r3,     #1
        VEXT.8          d3,     d2,     d3,     #1
        VMULL.U8                q2,     d0,     d28
        VMLAL.U8                q2,     d1,     d29
        VMLAL.U8                q2,     d2,     d30
        VMLAL.U8                q2,     d3,     d31
        VQRSHRUN.S16            d4,     q2,     #6
        VST1.8          {d4},   [r2],   r12
        VMOV            q0,     q1
        BNE             local_qpel_1_5
        BX              lr
        .size  h264e_qpel_interpolate_chroma_neon, .-h264e_qpel_interpolate_chroma_neon

        .global h264e_qpel_interpolate_luma_neon
        .type  h264e_qpel_interpolate_luma_neon, %function
h264e_qpel_interpolate_luma_neon:
        LDR             r12,    [sp]
        VMOV.I8         d0,     #5
        CMP             r12,    #0
        BEQ             copy_w16or8
        PUSH            {r4,    r7,     r10,    r11,    lr}
        MOV             lr,     #16
        MOV             r4,     sp
        SUB             sp,     sp,     #16*16
        MOV             r7,     sp
        BIC             r7,     r7,     #15
        MOV             sp,     r7
        PUSH            {r2,    r4}
        MOV             r11,    #1
        ADD             r10,    r12,    #0x00010000
        ADD             r10,    r10,    r11
        ADD             r12,    r12,    r12,    lsr #14
        MOV             r11,    r11,    lsl r12
        LDR             r12,    =0xbbb0e0ee
        MOV             r7,     r0
        TST             r12,    r11
        BEQ             local_qpel_10_0
        TST             r10,    #0x00040000
        ADDNE           r0,     r0,     r1
        MOVS            r4,     r3,     lsr #5
        MOV             r4,     r3,     asr #16
        VSHL.I8         d1,     d0,     #2
        SUB             r0,     r0,     #2
        BCC             flt_luma_hor_w8
local_qpel_1_6:
        VLD1.8          {q8,    q9},    [r0],   r1
        SUBS            r4,     r4,     #1
        VEXT.8          q11,    q8,     q9,     #1
        VEXT.8          q12,    q8,     q9,     #2
        VEXT.8          q13,    q8,     q9,     #3
        VEXT.8          q14,    q8,     q9,     #4
        VEXT.8          q15,    q8,     q9,     #5
        VADDL.U8                q1,     d16,    d30
        VADDL.U8                q2,     d17,    d31
        VMLSL.U8                q1,     d22,    d0
        VMLSL.U8                q2,     d23,    d0
        VMLAL.U8                q1,     d24,    d1
        VMLAL.U8                q2,     d25,    d1
        VMLAL.U8                q1,     d26,    d1
        VMLAL.U8                q2,     d27,    d1
        VMLSL.U8                q1,     d28,    d0
        VMLSL.U8                q2,     d29,    d0
        VQRSHRUN.S16            d2,     q1,     #5
        VQRSHRUN.S16            d3,     q2,     #5
        VSTMIA          r2!,    {q1}
        BNE             local_qpel_1_6
        B               flt_luma_hor_end
flt_luma_hor_w8:
local_qpel_1_7:
        VLD1.8          {q8},   [r0],   r1
        SUBS            r4,     r4,     #1
        VEXT.8          d22,    d16,    d17,    #1
        VEXT.8          d24,    d16,    d17,    #2
        VEXT.8          d26,    d16,    d17,    #3
        VEXT.8          d28,    d16,    d17,    #4
        VEXT.8          d30,    d16,    d17,    #5
        VADDL.U8                q1,     d16,    d30
        VMLSL.U8                q1,     d22,    d0
        VMLAL.U8                q1,     d24,    d1
        VMLAL.U8                q1,     d26,    d1
        VMLSL.U8                q1,     d28,    d0
        VQRSHRUN.S16            d2,     q1,     #5
        VST1.8          {d2},   [r2],   lr
        BNE             local_qpel_1_7
flt_luma_hor_end:
        SUB             r2,     r3,     asr #12
        MOV             r0,     r7
        ADD             r2,     sp,     #4*2
local_qpel_10_0:
        TST             r11,    r12,    lsr #16
        BEQ             local_qpel_10_1
        MOV             r0,     r7
        TST             r10,    #0x0004
        ADDNE           r0,     r0,     #1
        MOVS            r4,     r3,     lsr #5
        MOV             r4,     r3,     asr #16
        VMOV.I8         d0,     #5
        VSHL.I8         d1,     d0,     #2
        SUB             r0,     r0,     r1,     lsl #1
        BCC             flt_luma_ver_w8
        VLD1.8          {q10},  [r0],   r1
        VLD1.8          {q11},  [r0],   r1
        VLD1.8          {q12},  [r0],   r1
        VLD1.8          {q13},  [r0],   r1
        VLD1.8          {q14},  [r0],   r1
local_qpel_1_8:
        VLD1.8          {q15},  [r0],   r1
        VADDL.U8                q1,     d20,    d30
        VADDL.U8                q2,     d21,    d31
        VMLSL.U8                q1,     d22,    d0
        VMLSL.U8                q2,     d23,    d0
        VMLAL.U8                q1,     d24,    d1
        VMLAL.U8                q2,     d25,    d1
        VMLAL.U8                q1,     d26,    d1
        VMLAL.U8                q2,     d27,    d1
        VMLSL.U8                q1,     d28,    d0
        VMLSL.U8                q2,     d29,    d0
        VQRSHRUN.S16            d2,     q1,     #5
        VQRSHRUN.S16            d3,     q2,     #5
        VSTMIA          r2!,    {q1}
        VMOV            q10,    q11
        VMOV            q11,    q12
        VMOV            q12,    q13
        VMOV            q13,    q14
        VMOV            q14,    q15
        SUBS            r4,     r4,     #1
        BNE             local_qpel_1_8
        B               flt_luma_ver_end
flt_luma_ver_w8:
        VLD1.8          {d20},  [r0],   r1
        VLD1.8          {d22},  [r0],   r1
        VLD1.8          {d24},  [r0],   r1
        VLD1.8          {d26},  [r0],   r1
        VLD1.8          {d28},  [r0],   r1
local_qpel_1_9:
        VLD1.8          {d30},  [r0],   r1
        VADDL.U8                q1,     d20,    d30
        VMLSL.U8                q1,     d22,    d0
        VMLAL.U8                q1,     d24,    d1
        VMLAL.U8                q1,     d26,    d1
        VMLSL.U8                q1,     d28,    d0
        VQRSHRUN.S16            d2,     q1,     #5
        VST1.8          {d2},   [r2],   lr
        VMOV            d20,    d22
        VMOV            d22,    d24
        VMOV            d24,    d26
        VMOV            d26,    d28
        VMOV            d28,    d30
        SUBS            r4,     r4,     #1
        BNE             local_qpel_1_9
flt_luma_ver_end:
        SUB             r2,     r3,     asr #12
        MOV             r0,     r7
        ADD             r2,     sp,     #4*2
local_qpel_10_1:
        LDR             r12,    =0xfafa4e40
        TST             r12,    r11
        BEQ             local_qpel_10_2
        MOV             r0,     r7
        SUB             sp,     sp,     #(8)
        VPUSH           {q4-q7}
        MOVS            r4,     r3,     lsr #5
        MOV             r4,     r3,     asr #16
        VMOV.I8         d0,     #5
        VSHL.I8         d1,     d0,     #2
        SUB             r0,     r0,     #2
        SUB             r0,     r0,     r1,     lsl #1
        ADD             r2,     r2,     r4,     lsl #4
        ADD             r4,     r4,     #5
        BCC             flt_luma_diag_w8
local_qpel_1_10:
        VLD1.8          {q8,    q9},    [r0],   r1
        VMOV            q10,    q8
        VEXT.8          q11,    q8,     q9,     #1
        VEXT.8          q12,    q8,     q9,     #2
        VEXT.8          q13,    q8,     q9,     #3
        VEXT.8          q14,    q8,     q9,     #4
        VEXT.8          q15,    q8,     q9,     #5
        VADDL.U8                q1,     d20,    d30
        VADDL.U8                q2,     d21,    d31
        VMLSL.U8                q1,     d22,    d0
        VMLSL.U8                q2,     d23,    d0
        VMLAL.U8                q1,     d24,    d1
        VMLAL.U8                q2,     d25,    d1
        VMLAL.U8                q1,     d26,    d1
        VMLAL.U8                q2,     d27,    d1
        VMLSL.U8                q1,     d28,    d0
        VMLSL.U8                q2,     d29,    d0
        VPUSH           {q1,    q2}
        SUBS            r4,     r4,     #1
        BNE             local_qpel_1_10
        MOV             r4,     r3,     asr #16
        VPOP            {q4-q9}
        VPOP            {q10-q15}
local_qpel_1_11:
        SUBS            r4,     r4,     #1
        SUB             r2,     r2,     #16
        VADD.S16                q4,     q4,     q14
        VADD.S16                q5,     q5,     q15
        VADD.S16                q2,     q6,     q12
        VADD.S16                q3,     q7,     q13
        VADD.S16                q0,     q8,     q10
        VADD.S16                q1,     q9,     q11
        VSUB.S16                q4,     q4,     q2
        VSUB.S16                q5,     q5,     q3
        VSUB.S16                q2,     q2,     q0
        VSUB.S16                q3,     q3,     q1
        VSHR.S16                q4,     q4,     #2
        VSHR.S16                q5,     q5,     #2
        VSUB.S16                q4,     q4,     q2
        VSUB.S16                q5,     q5,     q3
        VSHR.S16                q4,     q4,     #2
        VSHR.S16                q5,     q5,     #2
        VADD.S16                q4,     q4,     q0
        VADD.S16                q5,     q5,     q1
        VQRSHRUN.S16            d2,     q4,     #6
        VQRSHRUN.S16            d3,     q5,     #6
        VST1.8          {q1},   [r2]
        VMOV            q4,     q6
        VMOV            q5,     q7
        VMOV            q6,     q8
        VMOV            q7,     q9
        VMOV            q8,     q10
        VMOV            q9,     q11
        VMOV            q10,    q12
        VMOV            q11,    q13
        VMOV            q12,    q14
        VMOV            q13,    q15
        VPOPNE          {q14,   q15}
        BNE             local_qpel_1_11
        B               flt_luma_diag_end
flt_luma_diag_w8:
local_qpel_1_12:
        VLD1.8          {q8},   [r0],   r1
        VMOV            d20,    d16
        VEXT.8          d22,    d16,    d17,    #1
        VEXT.8          d24,    d16,    d17,    #2
        VEXT.8          d26,    d16,    d17,    #3
        VEXT.8          d28,    d16,    d17,    #4
        VEXT.8          d30,    d16,    d17,    #5
        VADDL.U8                q1,     d20,    d30
        VMLSL.U8                q1,     d22,    d0
        VMLAL.U8                q1,     d24,    d1
        VMLAL.U8                q1,     d26,    d1
        VMLSL.U8                q1,     d28,    d0
        VPUSH           {q1}
        SUBS            r4,     r4,     #1
        BNE             local_qpel_1_12
        MOV             r4,     r3,     asr #16
        VPOP            {q4}
        VPOP            {q6}
        VPOP            {q8}
        VPOP            {q10}
        VPOP            {q12}
local_qpel_1_13:
        VPOP            {q14}
        SUBS            r4,     r4,     #1
        SUB             r2,     r2,     #16
        VADD.S16                q4,     q4,     q14
        VADD.S16                q2,     q6,     q12
        VADD.S16                q0,     q8,     q10
        VSUB.S16                q4,     q4,     q2
        VSUB.S16                q2,     q2,     q0
        VSHR.S16                q4,     q4,     #2
        VSUB.S16                q4,     q4,     q2
        VSHR.S16                q4,     q4,     #2
        VADD.S16                q4,     q4,     q0
        VQRSHRUN.S16            d2,     q4,     #6
        VST1.8          {d2},   [r2]
        VMOV            q4,     q6
        VMOV            q6,     q8
        VMOV            q8,     q10
        VMOV            q10,    q12
        VMOV            q12,    q14
        BNE             local_qpel_1_13
flt_luma_diag_end:
        VPOP            {q4-q7}
        ADD             sp,     sp,     #(8)
        ADD             r2,     sp,     #4*2
local_qpel_10_2:
        TST             r11,    r12,    lsr #16
        BEQ             local_qpel_10_3
        LDR             r12,    =0xeae0
        TST             r12,    r11
        LDR             r2,     [sp]
        BEQ             local_qpel_20_1
        ADD             r0,     sp,     #4*2
        LDR             r3,     =0x00100010
        MOV             r1,     r2
        BL              h264e_qpel_average_wh_align_neon
        B               local_qpel_10_3
local_qpel_20_1:
        MOV             r0,     r7
        TST             r10,    #0x0004
        ADDNE           r0,     r0,     #1
        TST             r10,    #0x00040000
        ADDNE           r0,     r0,     r1
        LDR             r2,     [sp]
        MOV             r12,    #4
local_qpel_1_14:
        VLD1.8          {q8},   [r0],   r1
        VLD1.8          {q9},   [r0],   r1
        VLD1.8          {q10},  [r0],   r1
        VLD1.8          {q11},  [r0],   r1
        SUBS            r12,    r12,    #1
        VLDMIA          r2,     {q0-q3}
        VRHADD.U8               q0,     q8
        VRHADD.U8               q1,     q9
        VRHADD.U8               q2,     q10
        VRHADD.U8               q3,     q11
        VSTMIA          r2!,    {q0-q3}
        BNE             local_qpel_1_14
local_qpel_10_3:
        LDR             sp,     [sp,    #4]
        POP             {r4,    r7,     r10,    r11,    pc}
        .size  h264e_qpel_interpolate_luma_neon, .-h264e_qpel_interpolate_luma_neon


================================================
FILE: asm/neon/h264e_sad_neon.s
================================================
        .arm
        .text
        .align 2

        .type  h264e_sad_mb_unlaign_wh_neon, %function
h264e_sad_mb_unlaign_wh_neon:
        TST             r3,     #0x008
        BNE             local_sad_2_0
        VLDMIA          r2!,    {q8-q15}
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABDL.U8                q0,     d16,    d4
        VABAL.U8                q0,     d17,    d5
        VABAL.U8                q0,     d18,    d6
        VABAL.U8                q0,     d19,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d20,    d4
        VABAL.U8                q0,     d21,    d5
        VABAL.U8                q0,     d22,    d6
        VABAL.U8                q0,     d23,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d24,    d4
        VABAL.U8                q0,     d25,    d5
        VABAL.U8                q0,     d26,    d6
        VABAL.U8                q0,     d27,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d28,    d4
        VABAL.U8                q0,     d29,    d5
        VABAL.U8                q0,     d30,    d6
        VABAL.U8                q0,     d31,    d7
        TST             r3,     #0x00100000
        BEQ             local_sad_1_0
        VLDMIA          r2!,    {q8-q15}
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d16,    d4
        VABAL.U8                q0,     d17,    d5
        VABAL.U8                q0,     d18,    d6
        VABAL.U8                q0,     d19,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d20,    d4
        VABAL.U8                q0,     d21,    d5
        VABAL.U8                q0,     d22,    d6
        VABAL.U8                q0,     d23,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d24,    d4
        VABAL.U8                q0,     d25,    d5
        VABAL.U8                q0,     d26,    d6
        VABAL.U8                q0,     d27,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d28,    d4
        VABAL.U8                q0,     d29,    d5
        VABAL.U8                q0,     d30,    d6
        VABAL.U8                q0,     d31,    d7
local_sad_1_0:
        VPADDL.U16              q0,     q0
        VPADDL.U32              q0,     q0
        VADD.U64                d0,     d1
        VMOV            r0,     r1,     d0
        BX              lr
local_sad_2_0:
        VLDMIA          r2!,    {q8-q15}
        VLD1.8          {d4},   [r0],   r1
        VLD1.8          {d5},   [r0],   r1
        VABDL.U8                q0,     d16,    d4
        VABAL.U8                q0,     d18,    d5
        VLD1.8          {d4},   [r0],   r1
        VLD1.8          {d5},   [r0],   r1
        VABAL.U8                q0,     d20,    d4
        VABAL.U8                q0,     d22,    d5
        VLD1.8          {d4},   [r0],   r1
        VLD1.8          {d5},   [r0],   r1
        VABAL.U8                q0,     d24,    d4
        VABAL.U8                q0,     d26,    d5
        VLD1.8          {d4},   [r0],   r1
        VLD1.8          {d5},   [r0],   r1
        VABAL.U8                q0,     d28,    d4
        VABAL.U8                q0,     d30,    d5
        TST             r3,     #0x00100000
        BEQ             local_sad_1_1
        VLDMIA          r2!,    {q8-q15}
        VLD1.8          {d4},   [r0],   r1
        VLD1.8          {d5},   [r0],   r1
        VABAL.U8                q0,     d16,    d4
        VABAL.U8                q0,     d18,    d5
        VLD1.8          {d4},   [r0],   r1
        VLD1.8          {d5},   [r0],   r1
        VABAL.U8                q0,     d20,    d4
        VABAL.U8                q0,     d22,    d5
        VLD1.8          {d4},   [r0],   r1
        VLD1.8          {d5},   [r0],   r1
        VABAL.U8                q0,     d24,    d4
        VABAL.U8                q0,     d26,    d5
        VLD1.8          {d4},   [r0],   r1
        VLD1.8          {d5},   [r0],   r1
        VABAL.U8                q0,     d28,    d4
        VABAL.U8                q0,     d30,    d5
local_sad_1_1:
        VPADDL.U16              q0,     q0
        VPADDL.U32              q0,     q0
        VADD.U64                d0,     d1
        VMOV            r0,     r1,     d0
        BX              lr
        .size  h264e_sad_mb_unlaign_wh_neon, .-h264e_sad_mb_unlaign_wh_neon

        .type  h264e_sad_mb_unlaign_8x8_neon, %function
h264e_sad_mb_unlaign_8x8_neon:
        VLDMIA          r2!,    {q8-q15}
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABDL.U8                q0,     d16,    d4
        VABDL.U8                q1,     d17,    d5
        VABAL.U8                q0,     d18,    d6
        VABAL.U8                q1,     d19,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d20,    d4
        VABAL.U8                q1,     d21,    d5
        VABAL.U8                q0,     d22,    d6
        VABAL.U8                q1,     d23,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d24,    d4
        VABAL.U8                q1,     d25,    d5
        VABAL.U8                q0,     d26,    d6
        VABAL.U8                q1,     d27,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d28,    d4
        VABAL.U8                q1,     d29,    d5
        VABAL.U8                q0,     d30,    d6
        VABAL.U8                q1,     d31,    d7
        VLDMIA          r2!,    {q8-q15}
        VPADDL.U16              q0,     q0
        VPADDL.U16              q1,     q1
        VPADDL.U32              q0,     q0
        VPADDL.U32              q1,     q1
        VADD.U64                d0,     d1
        VADD.U64                d2,     d3
        VTRN.32         d0,     d2
        VSTMIA          r3!,    {d0}
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABDL.U8                q0,     d16,    d4
        VABDL.U8                q1,     d17,    d5
        VABAL.U8                q0,     d18,    d6
        VABAL.U8                q1,     d19,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d20,    d4
        VABAL.U8                q1,     d21,    d5
        VABAL.U8                q0,     d22,    d6
        VABAL.U8                q1,     d23,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d24,    d4
        VABAL.U8                q1,     d25,    d5
        VABAL.U8                q0,     d26,    d6
        VABAL.U8                q1,     d27,    d7
        VLD1.8          {d4,    d5},    [r0],   r1
        VLD1.8          {d6,    d7},    [r0],   r1
        VABAL.U8                q0,     d28,    d4
        VABAL.U8                q1,     d29,    d5
        VABAL.U8                q0,     d30,    d6
        VABAL.U8                q1,     d31,    d7
        VPADDL.U16              q0,     q0
        VPADDL.U16              q1,     q1
        VPADDL.U32              q0,     q0
        VPADDL.U32              q1,     q1
        VADD.U64                d0,     d1
        VADD.U64                d2,     d3
        VTRN.32         d0,     d2
        VSTMIA          r3!,    {d0}
        LDMDB           r3,     {r0-r3}
        ADD             r0,     r0,     r1
        ADD             r0,     r0,     r2
        ADD             r0,     r0,     r3
        BX              lr
        .size  h264e_sad_mb_unlaign_8x8_neon, .-h264e_sad_mb_unlaign_8x8_neon

        .type  h264e_copy_8x8_neon, %function
h264e_copy_8x8_neon:
        VLDR.64         d0,     [r2,    #0*16]
        VLDR.64         d1,     [r2,    #1*16]
        VLDR.64         d2,     [r2,    #2*16]
        VLDR.64         d3,     [r2,    #3*16]
        VLDR.64         d4,     [r2,    #4*16]
        VLDR.64         d5,     [r2,    #5*16]
        VLDR.64         d6,     [r2,    #6*16]
        VLDR.64         d7,     [r2,    #7*16]
        VST1.32         {d0},   [r0:64],        r1
        VST1.32         {d1},   [r0:64],        r1
        VST1.32         {d2},   [r0:64],        r1
        VST1.32         {d3},   [r0:64],        r1
        VST1.32         {d4},   [r0:64],        r1
        VST1.32         {d5},   [r0:64],        r1
        VST1.32         {d6},   [r0:64],        r1
        VST1.32         {d7},   [r0:64],        r1
        BX              lr
        .size  h264e_copy_8x8_neon, .-h264e_copy_8x8_neon

        .type  h264e_copy_16x16_neon, %function
h264e_copy_16x16_neon:
        MOV             r12,    #4
local_sad_1_2:
        VLD2.32         {d0-d1},        [r2:64],        r3
        VLD2.32         {d2-d3},        [r2:64],        r3
        VLD2.32         {d4-d5},        [r2:64],        r3
        VLD2.32         {d6-d7},        [r2:64],        r3
        SUBS            r12,    r12,    #1
        VST2.32         {d0-d1},        [r0:64],        r1
        VST2.32         {d2-d3},        [r0:64],        r1
        VST2.32         {d4-d5},        [r0:64],        r1
        VST2.32         {d6-d7},        [r0:64],        r1
        BNE             local_sad_1_2
        BX              lr
        .size  h264e_copy_16x16_neon, .-h264e_copy_16x16_neon

        .type  h264e_copy_borders_neon, %function
h264e_copy_borders_neon:
        PUSH            {r4-r12,        lr}
        ADD             r4,     r1,     r3,     lsl #1
        MUL             r5,     r3,     r4
        MLA             r6,     r2,     r4,     r0
        SUB             r8,     r1,     #4
        MOV             lr,     r5
        ADD             r12,    lr,     #4
        SUB             r7,     r6,     r4
        SUB             r5,     r0,     r5
        ADD             r5,     r5,     r8
        ADD             r6,     r6,     r8
local_sad_2_1:
        LDR             r10,    [r0,    r8]
        LDR             r11,    [r7,    r8]
        MOV             r9,     r3
local_sad_1_3:
        SUBS            r9,     r9,     #1
        STR             r10,    [r5],   r4
        STR             r11,    [r6],   r4
        BGT             local_sad_1_3
        SUBS            r8,     r8,     #4
        SUB             r5,     r5,     r12
        SUB             r6,     r6,     r12
        BGE             local_sad_2_1
        SUB             r0,     r0,     lr
        SUB             r5,     r0,     r3
        ADD             r6,     r0,     r1
        SUB             r7,     r6,     #1
        ADD             r9,     r2,     r3,     lsl #1
        LDR             r1,     =0x1010101
        RSB             r12,    r3,     r4,     lsl #1
local_sad_2_2:
        LDRB            lr,     [r0,    r4]
        LDRB            r2,     [r7,    r4]
        LDRB            r10,    [r0],   r4,     lsl #1
        LDRB            r11,    [r7],   r4,     lsl #1
        SUB             r8,     r3,     #4
        MUL             lr,     lr,     r1
        MUL             r2,     r2,     r1
        MUL             r10,    r10,    r1
        MUL             r11,    r11,    r1
local_sad_1_4:
        SUBS            r8,     r8,     #4
        STR             lr,     [r5,    r4]
        STR             r2,     [r6,    r4]
        STR             r10,    [r5],   #4
        STR             r11,    [r6],   #4
        BGE             local_sad_1_4
        SUBS            r9,     r9,     #2
        ADD             r5,     r5,     r12
        ADD             r6,     r6,     r12
        BGT             local_sad_2_2
        POP             {r4-r12,        pc}
        .size  h264e_copy_borders_neon, .-h264e_copy_borders_neon

        .global         h264e_sad_mb_unlaign_8x8_neon
        .global         h264e_sad_mb_unlaign_wh_neon
        .global         h264e_copy_borders_neon
        .global         h264e_copy_8x8_neon
        .global         h264e_copy_16x16_neon


================================================
FILE: asm/neon/h264e_transform_neon.s
================================================
        .arm
        .text
        .align 2

        .type  hadamar4_2d_neon, %function
hadamar4_2d_neon:
        VLD4.16         {d0,    d1,     d2,     d3},    [r0]
        VADD.S16                q2,     q0,     q1
        VSUB.S16                q3,     q0,     q1
        VSWP            d5,     d6
        VADD.S16                q0,     q2,     q3
        VSUB.S16                q1,     q2,     q3
        VSWP            d2,     d3
        VTRN.S16                d0,     d1
        VTRN.S16                d2,     d3
        VTRN.S32                q0,     q1
        VADD.S16                q2,     q0,     q1
        VSUB.S16                q3,     q0,     q1
        VSWP            d5,     d6
        VADD.S16                q0,     q2,     q3
        VSUB.S16                q1,     q2,     q3
        VSWP            d2,     d3
        VSTMIA          r0,     {q0-q1}
        BX              lr
        .size  hadamar4_2d_neon, .-hadamar4_2d_neon

        .type  hadamar2_2d_neon, %function
hadamar2_2d_neon:
        LDMIA           r0,     {r1,    r2}
        SADDSUBX                r1,     r1,     r1
        SADDSUBX                r2,     r2,     r2
        SSUB16          r3,     r1,     r2
        SADD16          r2,     r1,     r2
        MOV             r2,     r2,     ror #16
        MOV             r3,     r3,     ror #16
        STMIA           r0,     {r2,    r3}
        BX              lr
        .size  hadamar2_2d_neon, .-hadamar2_2d_neon

        .type  h264e_quant_luma_dc_neon, %function
h264e_quant_luma_dc_neon:
        PUSH            {r4-r6, lr}
        SUB             sp,     sp,     #0x28
        MOV             r6,     r1
        MOV             r4,     r2
        MOV             r5,     r0
        SUB             r0,     r5,     #16*2
        BL              hadamar4_2d_neon
        MOV             r3,     #0x20000
        STR             r3,     [sp,    #0]
        LDRSH           r2,     [r4,    #0]
        MOV             r3,     #0x10
        MOV             r1,     r6
        SUB             r0,     r5,     #16*2
        BL              quant_dc
        SUB             r0,     r5,     #16*2
        BL              hadamar4_2d_neon
        LDRH            r0,     [r4,    #2]
        MOV             r3,     #0x10
        SUB             r1,     r5,     #16*2
        MOV             r2,     r0,     lsr #2
        MOV             r0,     r5
        BL              dequant_dc
        ADD             sp,     sp,     #0x28
        POP             {r4-r6, pc}
        .size  h264e_quant_luma_dc_neon, .-h264e_quant_luma_dc_neon

        .type  h264e_quant_chroma_dc_neon, %function
h264e_quant_chroma_dc_neon:
        PUSH            {r3-r7, lr}
        MOV             r6,     r1
        MOV             r4,     r2
        MOV             r5,     r0
        SUB             r0,     r5,     #16*2
        BL              hadamar2_2d_neon
        LDR             r3,     =0x0000aaaa
        MOV             r1,     r6
        STR             r3,     [sp,    #0]
        LDRH            r0,     [r4,    #0]
        MOV             r3,     #4
        MOV             r2,     r0,     lsl #17
        MOV             r2,     r2,     asr #16
        SUB             r0,     r5,     #16*2
        BL              quant_dc
        SUB             r0,     r5,     #16*2
        BL              hadamar2_2d_neon
        LDRH            r0,     [r4,    #2]
        MOV             r3,     #4
        SUB             r1,     r5,     #16*2
        MOV             r2,     r0,     lsr #1
        MOV             r0,     r5
        BL              dequant_dc
        SUB             r1,     r5,     #16*2
        LDMIA           r1,     {r2,    r3}
        ORRS            r0,     r2,     r3
        MOVNE           r0,     #1
        POP             {r3-r7, pc}
        .size  h264e_quant_chroma_dc_neon, .-h264e_quant_chroma_dc_neon

        .type  is_zero4_neon, %function
is_zero4_neon:
        PUSH            {r4-r6, lr}
        MOV             r4,     r0
        MOV             r5,     r1
        MOV             r6,     r2
        ADD             r0,     r0,     #(0+16*2)
        BL              is_zero_neon
        POPNE           {r4-r6, pc}
        MOV             r2,     r6
        MOV             r1,     r5
        ADD             r0,     r4,     #(0+16*2)+((0+16*2)+16*2)
        BL              is_zero_neon
        POPNE           {r4-r6, pc}
        MOV             r2,     r6
        MOV             r1,     r5
        ADD             r0,     r4,     #(0+16*2)+4*((0+16*2)+16*2)
        BL              is_zero_neon
        POPNE           {r4-r6, pc}
        MOV             r2,     r6
        MOV             r1,     r5
        ADD             r0,     r4,     #(0+16*2)+5*((0+16*2)+16*2)
        BL              is_zero_neon
        POP             {r4-r6, pc}
        .size  is_zero4_neon, .-is_zero4_neon

        .type  h264e_transform_sub_quant_dequant_neon, %function
h264e_transform_sub_quant_dequant_neon:
        PUSH            {r0-r12,        lr}
        MOV             r6,     r1
        MOV             r5,     r0
        MOV             r8,     r3,     asr #1
        LDR             r1,     [sp,    #8]
        MOV             r0,     r3,     asr #1
        LDR             r4,     [sp,    #0x38]
        MOV             r9,     r3
        MOV             r7,     r8
        SUB             r10,    r1,     r0
        RSB             r11,    r0,     #0x10
l0.660:
        LDR             r2,     [sp,    #8]
        ADD             r3,     r4,     #0x20
        MOV             r1,     r6
        MOV             r0,     r5
        BL              fwdtransformresidual4x42_neon
        SUBS            r7,     r7,     #1
        ADD             r5,     r5,     #4
        ADD             r6,     r6,     #4
        ADD             r4,     r4,     #((0+16*2)+16*2)
        BNE             l0.660
        SUBS            r8,     r8,     #1
        MOV             r7,     r9,     asr #1
        ADD             r5,     r5,     r10,    lsl #2
        ADD             r6,     r6,     r11,    lsl #2
        BNE             l0.660
        MOVS            r7,     r9,     lsr #1
        BCC             local_transform_10_0
        MUL             r7,     r7,     r7
        LDR             r5,     [sp,    #0x38]
        SUB             r0,     r5,     #16*2
        ADD             r1,     r5,     #(0+16*2)
local_transform_1_0:
        LDRH            r2,     [r1],   #((0+16*2)+16*2)
        SUBS            r7,     r7,     #1
        STRH            r2,     [r0],   #2
        BNE             local_transform_1_0
local_transform_10_0:
        ADD             r3,     sp,     #0x38
        MOV             r1,     r9
        LDMIA           r3,     {r0,    r2}
        BL              zero_smallq_neon
        ADD             r4,     sp,     #0x38
        MOV             r3,     r0
        MOV             r1,     r9
        LDMIA           r4,     {r0,    r2}
        ADD             sp,     sp,     #0x10
        POP             {r4-r12,        lr}
        B               quantize_neon
        .size  h264e_transform_sub_quant_dequant_neon, .-h264e_transform_sub_quant_dequant_neon

        .type  h264e_transform_add_neon, %function
h264e_transform_add_neon:
        LDR             r12,    [sp]
        SUB             r12,    r12,    r12,    lsl #16
        ADD             r3,     r3,     #(0+16*2)
        PUSH            {r0-r12,        lr}
local_transform_1_1:
        LDR             r12,    [sp,    #0+4+4+4+4+4*8+4+4+4]
        MOV             lr,     #0
        MOVS            r12,    r12,    lsl #1
        STR             r12,    [sp,    #0+4+4+4+4+4*8+4+4+4]
        BCC             copy_block
        VLD1.16         {d0,    d1,     d2,     d3},    [r3]
        ADD             r3,     r3,     #((0+16*2)+16*2)
        VTRN.16         d0,     d1
        VTRN.16         d2,     d3
        VTRN.32         q0,     q1
        VADD.S16                d4,     d0,     d2
        VSUB.S16                d5,     d0,     d2
        VSHR.S16                d31,    d1,     #1
        VSHR.S16                d30,    d3,     #1
        VSUB.S16                d6,     d31,    d3
        VADD.S16                d7,     d1,     d30
        VADD.S16                d0,     d4,     d7
        VADD.S16                d1,     d5,     d6
        VSUB.S16                d2,     d5,     d6
        VSUB.S16                d3,     d4,     d7
        VTRN.16         d0,     d1
        VTRN.16         d2,     d3
        VTRN.32         q0,     q1
        VADD.S16                d4,     d0,     d2
        VSUB.S16                d5,     d0,     d2
        VSHR.S16                d31,    d1,     #1
        VSHR.S16                d30,    d3,     #1
        VSUB.S16                d6,     d31,    d3
        VADD.S16                d7,     d1,     d30
        VADD.S16                d0,     d4,     d7
        VADD.S16                d1,     d5,     d6
        VSUB.S16                d2,     d5,     d6
        VSUB.S16                d3,     d4,     d7
        LDR             r4,     [r2],   #16
        LDR             r5,     [r2],   #16
        VMOV            d20,    r4,     r5
        LDR             r4,     [r2],   #16
        LDR             r5,     [r2],   #4-16*3
        VMOV            d21,    r4,     r5
        VSHLL.U8                q2,     d20,    #6
        VADD.S16                q0,     q0,     q2
        VSHLL.U8                q3,     d21,    #6
        VADD.S16                q1,     q1,     q3
        VQRSHRUN.S16            d0,     q0,     #6
        VQRSHRUN.S16            d1,     q1,     #6
        VMOV            r4,     r5,     d0
        STR             r4,     [r0],   r1
        STR             r5,     [r0],   r1
        VMOV            r4,     r5,     d1
        STR             r4,     [r0],   r1
        STR             r5,     [r0],   r1
copy_block_ret:
        LDR             lr,     [sp,    #0+4+4+4+4+4*8]
        SUB             r0,     r0,     r1,     lsl #2
        ADD             r0,     r0,     #4
        ADDS            lr,     lr,     #0x10000
        STRMI           lr,     [sp,    #0+4+4+4+4+4*8]
        BMI             local_transform_1_1
        SUBS            lr,     lr,     #1
        POPEQ           {r0-r12,        pc}
        LDR             r4,     [sp,    #0+4+4+4+4+4*8+4+4]
        SUB             lr,     lr,     r4,     lsl #16
        STR             lr,     [sp,    #0+4+4+4+4+4*8]
        ADD             r0,     r0,     r1,     lsl #2
        SUB             r0,     r0,     r4,     lsl #2
        ADD             r2,     r2,     #16*4
        SUB             r2,     r2,     r4,     lsl #2
        B               local_transform_1_1
copy_block:
        LDR             r4,     [r2],   #16
        LDR             r5,     [r2],   #16
        LDR             r6,     [r2],   #16
        LDR             r7,     [r2],   #4-16*3
        ADD             r3,     r3,     #((0+16*2)+16*2)
        STR             r4,     [r0],   r1
        STR             r5,     [r0],   r1
        STR             r6,     [r0],   r1
        STR             r7,     [r0],   r1
        B               copy_block_ret
dequant_dc:
        PUSH            {lr}
        ADD             r0,     r0,     #(0+16*2)
local_transform_1_2:
        LDR             lr,     [r1],   #4
        SUBS            r3,     r3,     #2
        SMULBB          r12,    r2,     lr
        SMULBT          lr,     r2,     lr
        STRH            r12,    [r0],   #((0+16*2)+16*2)
        STRH            lr,     [r0],   #((0+16*2)+16*2)
        BNE             local_transform_1_2
        POP             {pc}
quant_dc:
        PUSH            {r4-r6, lr}
        CMP             r3,     #4
        LDR             r5,     [sp,    #0x10]
        LDRNE           r12,    =iscan16
        LDREQ           r12,    =iscan4
        RSB             r6,     r5,     #0x40000
local_transform_1_3:
        LDRSH           lr,     [r0]
        CMP             lr,     #0
        MOVGE           r4,     r5
        MOVLT           r4,     r6
        MLA             lr,     r2,     lr,     r4
        MOV             lr,     lr,     asr #18
        STRH            lr,     [r0],   #2
        LDRB            r4,     [r12],  #1
        SUBS            r3,     r3,     #1
        ADD             r4,     r1,     r4,     lsl #1
        STRH            lr,     [r4,    #0]
        BNE             local_transform_1_3
        POP             {r4-r6, pc}
        .size  h264e_transform_add_neon, .-h264e_transform_add_neon

        .type  fwdtransformresidual4x42_neon, %function
fwdtransformresidual4x42_neon:
        PUSH            {lr}
        LDR             r12,    [r0],   r2
        LDR             lr,     [r0],   r2
        VMOV            d16,    r12,    lr
        LDR             r12,    [r0],   r2
        LDR             lr,     [r0],   r2
        VMOV            d17,    r12,    lr
        LDR             r12,    [r1],   #16
        LDR             lr,     [r1],   #16
        VMOV            d20,    r12,    lr
        LDR             r12,    [r1],   #16
        LDR             lr,     [r1],   #16
        VMOV            d21,    r12,    lr
        VSUBL.U8                q0,     d16,    d20
        VSUBL.U8                q1,     d17,    d21
        VTRN.16         d0,     d1
        VTRN.16         d2,     d3
        VTRN.32         q0,     q1
        VADD.S16                d4,     d0,     d3
        VSUB.S16                d5,     d0,     d3
        VADD.S16                d6,     d1,     d2
        VSUB.S16                d7,     d1,     d2
        VADD.S16                q0,     q2,     q3
        VADD.S16                d1,     d1,     d5
        VSUB.S16                q1,     q2,     q3
        VSUB.S16                d3,     d3,     d7
        VTRN.16         d0,     d1
        VTRN.16         d2,     d3
        VTRN.32         q0,     q1
        VADD.S16                d4,     d0,     d3
        VSUB.S16                d5,     d0,     d3
        VADD.S16                d6,     d1,     d2
        VSUB.S16                d7,     d1,     d2
        VADD.S16                q0,     q2,     q3
        VADD.S16                d1,     d1,     d5
        VSUB.S16                q1,     q2,     q3
        VSUB.S16                d3,     d3,     d7
        VST1.16         {q0,    q1},    [r3]
        POP             {pc}
        .size  fwdtransformresidual4x42_neon, .-fwdtransformresidual4x42_neon

        .type  is_zero_neon, %function
is_zero_neon:
        VLD1.16         {d0-d3},        [r0]
        VABS.S16                q0,     q0
        VABS.S16                q1,     q1
        VCGT.U16                q0,     q0,     q15
        VCGT.U16                q1,     q1,     q15
        VBIC            d0,     d0,     d29
        VORR            q0,     q0,     q1
        VORR            d0,     d0,     d1
        VMOV            r0,     r1,     d0
        ORRS            r0,     r0,     r1
        BX              lr
        .size  is_zero_neon, .-is_zero_neon

        .type  zero_smallq_neon, %function
zero_smallq_neon:
        PUSH            {r4-r12,        lr}
        TST             r1,     #1
        VMOV.I64                d29,    #0xffff
        BNE             local_transform_10_1
        VMOV.I64                d29,    #0
local_transform_10_1:
        CMP             r1,     #8
        MOV             r8,     r0
        MOV             r6,     r1
        MOV             r0,     r1,     asr #1
        CMPNE           r6,     #5
        MOV             r7,     r2
        ADD             r2,     r2,     #0x14
        VLD1.16         {q15},  [r2]
        MOV             r4,     #0
        MULEQ           r9,     r0,     r0
        AND             r10,    r1,     #1
        MOVEQ           r5,     #0
        MOVEQ           r11,    #1
        BNE             l0.1964
        MOV             r12,    #((0+16*2)+16*2)
        MLA             r8,     r12,    r9,     r8
        ADD             r8,     r8,     #(0+16*2)
local_transform_1_4:
        SUB             r8,     r8,     #(((0+16*2)+16*2))
        VLD1.16         {d0-d3},        [r8]
        VABS.S16                q0,     q0
        VABS.S16                q1,     q1
        VCGT.U16                q0,     q0,     q15
        VCGT.U16                q1,     q1,     q15
        VBIC            d0,     d0,     d29
        VORR            q0,     q0,     q1
        VORR            d0,     d0,     d1
        VMOV            r0,     r1,     d0
        ORRS            r0,     r0,     r1
        ADD             r4,     r4,     r4
        ORREQ           r4,     r4,     #1
        SUBS            r9,     r9,     #1
        BNE             local_transform_1_4
        SUB             r8,     r8,     #(0+16*2)
        ADD             r2,     r2,     #0x10
        VLD1.16         {q15},  [r2]
        CMP             r6,     #8
        BNE             l0.1964
        MOV             r0,     #0x33
        BICS            r0,     r0,     r4
        BEQ             l0.1856
        ADD             r2,     r7,     #0x24
        MOV             r1,     r10
        MOV             r0,     r8
        BL              is_zero4_neon
        ORREQ           r4,     r4,     #0x33
l0.1856:
        MOV             r0,     #0xcc
        BICS            r0,     r0,     r4
        BEQ             l0.1892
        ADD             r2,     r7,     #0x24
        MOV             r1,     r10
        ADD             r0,     r8,     #2*((0+16*2)+16*2)
        BL              is_zero4_neon
        ORREQ           r4,     r4,     #0xcc
l0.1892:
        MOV             r0,     #0x3300
        BICS            r0,     r0,     r4
        BEQ             l0.1928
        ADD             r2,     r7,     #0x24
        MOV             r1,     r10
        ADD             r0,     r8,     #8*((0+16*2)+16*2)
        BL              is_zero4_neon
        ORREQ           r4,     r4,     #0x3300
l0.1928:
        MOV             r0,     #0xcc00
        BICS            r0,     r0,     r4
        BEQ             l0.1964
        ADD             r2,     r7,     #0x24
        MOV             r1,     r10
        ADD             r0,     r8,     #10*((0+16*2)+16*2)
        BL              is_zero4_neon
        ORREQ           r4,     r4,     #0xcc00
l0.1964:
        MOV             r0,     r4
        POP             {r4-r12,        pc}
        .size  zero_smallq_neon, .-zero_smallq_neon

        .type  quantize_neon, %function
quantize_neon:
        PUSH            {r3-r11,        lr}
        AND             r4,     r1,     #1
        MOV             r5,     r1,     asr #1
        MOV             r7,     #0
        MOV             lr,     r5
        STR             r4,     [sp,    #0]
local_transform_1_5:
        TST             r3,     #1
        MOV             r6,     #0
        BEQ             nonzero
        VMOV.U8         q0,     #0
        VMOV.U8         q1,     #0
        VST1.16         {q0,    q1},    [r0]
qloop_next:
        CMP             r6,     #0
        MOV             r7,     r7,     lsl #1
        ORRNE           r7,     r7,     #1
        SUBS            r5,     r5,     #1
        MOVEQ           r5,     r1,     asr #1
        SUBEQS          lr,     lr,     #1
        MOV             r3,     r3,     asr #1
        ADD             r0,     r0,     #((0+16*2)+16*2)
        MOVEQ           r0,     r7
        BNE             local_transform_1_5
        POP             {r3-r11,        pc}
nonzero:
        LDR             r4,     [sp,    #0]
        LDRH            r12,    [r2,    #0xc]
        CMP             r4,     #0
        ADD             r4,     r0,     #(0+16*2)
        VLD1.16         {q0,    q1},    [r4]
        VDUP.16         q15,    r12
        VCLT.S16                q8,     q0,     #0
        VCLT.S16                q9,     q1,     #0
        VEOR            q8,     q15,    q8
        VEOR            q9,     q15,    q9
        LDR             r12,    [r2,    #4]
        VDUP.16         d4,     r12
        VDUP.16         d6,     r12
        MOV             r12,    r12,    asr #16
        VDUP.16         d5,     r12
        VDUP.16         d7,     r12
        LDR             r12,    [r2,    #0]
        VMOV.16         d4[0],  r12
        VMOV.16         d4[2],  r12
        MOV             r12,    r12,    asr #16
        VMOV.16         d5[0],  r12
        VMOV.16         d5[2],  r12
        LDR             r12,    [r2,    #8]
        VMOV.16         d6[1],  r12
        VMOV.16         d6[3],  r12
        MOV             r12,    r12,    asr #16
        VMOV.16         d7[1],  r12
        VMOV.16         d7[3],  r12
        VMULL.S16               q10,    d0,     d4
        VADDW.U16               q10,    d16
        VQSHRN.S32              d22,    q10,    #16
        VMUL.S16                d26,    d22,    d5
        VMULL.S16               q10,    d1,     d6
        VADDW.U16               q10,    d17
        VQSHRN.S32              d23,    q10,    #16
        VMUL.S16                d27,    d23,    d7
        VMULL.S16               q10,    d2,     d4
        VADDW.U16               q10,    d18
        VQSHRN.S32              d24,    q10,    #16
        VMUL.S16                d28,    d24,    d5
        VMULL.S16               q10,    d3,     d6
        VADDW.U16               q10,    d19
        VQSHRN.S32              d25,    q10,    #16
        VMUL.S16                d29,    d25,    d7
        ADD             r4,     r0,     #(0+16*2)
        LDRNEH          r12,    [r4]
        VST1.16         {d26-d29},      [r4]
        STRNEH          r12,    [r4]
        LDR             r4,     [sp,    #0]
        CMP             r4,     #0
        LDR             r12,    =iscan16_neon
        VLD1.8          {q8,    q9},    [r12]
        VTBL.8          d0,     {d22-d25},      d16
        VTBL.8          d1,     {d22-d25},      d17
        VTBL.8          d2,     {d22-d25},      d18
        VTBL.8          d3,     {d22-d25},      d19
        LDRNEH          r4,     [r0]
        VST1.16         {d0-d3},        [r0]
        STRNEH          r4,     [r0]
        LDR             r12,    =imask16_neon
        VLD1.8          {q8,    q9},    [r12]
        VCEQ.I16                q0,     q0,     #0
        VCEQ.I16                q1,     q1,     #0
        VAND            q0,     q0,     q8
        VAND            q1,     q1,     q9
        VORR            q0,     q0,     q1
        VORR            d0,     d0,     d1
        VPADD.U16               d0,     d0,     d0
        VPADD.U16               d0,     d0,     d0
        VMOV.U16                r12,    d0[0]
        MVN             r6,     r12,    lsl #16
        MOV             r6,     r6,     lsr #16
        BICNE           r6,     r6,     #1
        B               qloop_next
        .size  quantize_neon, .-quantize_neon

        .section        .rodata
        .align 2
iscan4:
        .byte           0x00,   0x01,   0x02,   0x03
iscan16:
        .byte           0x00,   0x01,   0x05,   0x06
        .byte           0x02,   0x04,   0x07,   0x0c
        .byte           0x03,   0x08,   0x0b,   0x0d
        .byte           0x09,   0x0a,   0x0e,   0x0f
imask16_neon:
        .short          0x0001, 0x0002, 0x0004, 0x0008
        .short          0x0010, 0x0020, 0x0040, 0x0080
        .short          0x0100, 0x0200, 0x0400, 0x0800
        .short          0x1000, 0x2000, 0x4000, 0x8000
iscan16_neon:
        .byte           0x00,   0x01,   0x02,   0x03,   0x08,   0x09,   0x10,   0x11
        .byte           0x0a,   0x0b,   0x04,   0x05,   0x06,   0x07,   0x0c,   0x0d
        .byte           0x12,   0x13,   0x18,   0x19,   0x1a,   0x1b,   0x14,   0x15
        .byte           0x0e,   0x0f,   0x16,   0x17,   0x1c,   0x1d,   0x1e,   0x1f
        .global         h264e_quant_luma_dc_neon
        .global         h264e_quant_chroma_dc_neon
        .global         h264e_transform_sub_quant_dequant_neon
        .global         h264e_transform_add_neon


================================================
FILE: minih264e.h
================================================
#ifndef MINIH264_H
#define MINIH264_H
/*
    https://github.com/lieff/minih264
    To the extent possible under law, the author(s) have dedicated all copyright and related and neighboring rights to this software to the public domain worldwide.
    This software is distributed without any warranty.
    See <http://creativecommons.org/publicdomain/zero/1.0/>.
*/

#ifdef __cplusplus
extern "C" {
#endif

#ifndef H264E_SVC_API
#   define H264E_SVC_API 1
#endif

#ifndef H264E_MAX_THREADS
#   define H264E_MAX_THREADS 4
#endif

/**
*   API return error codes
*/
#define H264E_STATUS_SUCCESS                0
#define H264E_STATUS_BAD_ARGUMENT           1
#define H264E_STATUS_BAD_PARAMETER          2
#define H264E_STATUS_BAD_FRAME_TYPE         3
#define H264E_STATUS_SIZE_NOT_MULTIPLE_16   4
#define H264E_STATUS_SIZE_NOT_MULTIPLE_2    5
#define H264E_STATUS_BAD_LUMA_ALIGN         6
#define H264E_STATUS_BAD_LUMA_STRIDE        7
#define H264E_STATUS_BAD_CHROMA_ALIGN       8
#define H264E_STATUS_BAD_CHROMA_STRIDE      9

/**
*   Frame type definitions
*   - Sequence must start with key (IDR) frame.
*   - P (Predicted) frames are most efficiently coded
*   - Dropable frames may be safely removed from bitstream, and used
*     for frame rate scalability
*   - Golden and Recovery frames used for error recovery. These
*     frames uses "long-term reference" for prediction, and
*     can be decoded if P frames sequence is interrupted.
*     They acts similarly to key frame, but coded more efficiently.
*
*   Type        Refers to   Saved as long-term  Saved as short-term
*   ---------------------------------------------------------------
*   Key (IDR) : N/A         Yes                 Yes                |
*   Golden    : long-term   Yes                 Yes                |
*   Recovery  : long-term   No                  Yes                |
*   P         : short-term  No                  Yes                |
*   Droppable : short-term  No                  No                 |
*                                                                  |
*   Example sequence:        K   P   P   G   D   P   R   D   K     |
*   long-term reference       1K  1K  1K  4G  4G  4G  4G  4G  9K   |
*                             /         \ /         \         /    |
*   coded frame             1K  2P  3P  4G  5D  6P  7R  8D  9K     |
*                             \ / \ / \   \ /   / \   \ /     \    |
*   short-term reference      1K  2P  3P  4G  4G  6P  7R  7R  9K   |
*
*/
#define H264E_FRAME_TYPE_DEFAULT    0       // Frame type set according to GOP size
#define H264E_FRAME_TYPE_KEY        6       // Random access point: SPS+PPS+Intra frame
#define H264E_FRAME_TYPE_I          5       // Intra frame: updates long & short-term reference
#define H264E_FRAME_TYPE_GOLDEN     4       // Use and update long-term reference
#define H264E_FRAME_TYPE_RECOVERY   3       // Use long-term reference, updates short-term reference
#define H264E_FRAME_TYPE_P          2       // Use and update short-term reference
#define H264E_FRAME_TYPE_DROPPABLE  1       // Use short-term reference, don't update anything
#define H264E_FRAME_TYPE_CUSTOM     99      // Application specifies reference frame

/**
*   Speed preset index.
*   Currently used values are 0, 1, 8 and 9
*/
#define H264E_SPEED_SLOWEST         0       // All coding tools enabled, including denoise filter
#define H264E_SPEED_BALANCED        5
#define H264E_SPEED_FASTEST         10      // Minimum tools enabled

/**
*   Creations parameters
*/
typedef struct H264E_create_param_tag
{
    // Frame width: must be multiple of 16
    int width;

    // Frame height: must be multiple of 16
    int height;

    // GOP size == key frame period
    // If 0: no key frames generated except 1st frame (infinite GOP)
    // If 1: Only intra-frames produced
    int gop;

    // Video Buffer Verifier size, bits
    // If 0: VBV model would be disabled
    // Note, that this value defines Level,
    int vbv_size_bytes;

    // If set: transparent frames produced on VBV overflow
    // If not set: VBV overflow ignored, produce bitrate bigger than specified
    int vbv_overflow_empty_frame_flag;

    // If set: keep minimum bitrate using stuffing, prevent VBV underflow
    // If not set: ignore VBV underflow, produce bitrate smaller than specified
    int vbv_underflow_stuffing_flag;

    // If set: control bitrate at macroblock-level (better bitrate precision)
    // If not set: control bitrate at frame-level (better quality)
    int fine_rate_control_flag;

    // If set: don't change input, but allocate additional frame buffer
    // If not set: use input as a scratch
    int const_input_flag;

    // If 0: golden, recovery, and custom frames are disabled
    // If >0: Specifies number of persistent frame buffer's used
    int max_long_term_reference_frames;

    int enableNEON;

    // If set: enable temporal noise suppression
    int temporal_denoise_flag;

    int sps_id;

#if H264E_SVC_API
    //          SVC extension
    // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    // Number of SVC layers:
    // 1 = AVC
    // 2 = SVC with 2-layers of spatial scalability
    int num_layers;

    // If set, SVC extension layer will use predictors from base layer
    // (sometimes can slightly increase efficiency)
    int inter_layer_pred_flag;
#endif

#if H264E_MAX_THREADS
    //           Multi-thread extension
    // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    // Maximum threads, supported by the callback
    int max_threads;

    // Opaque token, passed to callback
    void *token;

    // Application-supplied callback function.
    // This callback runs given jobs, by calling provided job_func(), passing
    // job_data[i] to each one.
    //
    // The h264e_thread_pool_run() can be used here, example:
    //
    //      int max_threads = 4;
    //      void *thread_pool = h264e_thread_pool_init(max_threads);
    //
    //      H264E_create_param_t par;
    //      par.max_threads = max_threads;
    //      par.token = thread_pool;
    //      par.run_func_in_thread = h264e_thread_pool_run;
    //
    // The reason to use double callbacks is to avoid mixing portable and
    // system-dependent code, and to avoid close() function in the encoder API.
    //
    void (*run_func_in_thread)(void *token, void (*job_func)(void*), void *job_data[], int njobs);
#endif

} H264E_create_param_t;

/**
*   Run-time parameters
*/
typedef struct H264E_run_param_tag
{
    // Variable, indicating speed/quality tradeoff
    // 0 means best quality
    int encode_speed;

    // Frame type override: one of H264E_FRAME_TYPE_* values
    // if 0: GOP pattern defined by create_param::gop value
    int frame_type;

    // Used only if frame_type == H264E_FRAME_TYPE_CUSTOM
    // Reference long-term frame index [1..max_long_term_reference_frames]
    // 0 = use previous frame (short-term)
    // -1 = IDR frame, kill all long-term frames
    int long_term_idx_use;

    // Used only if frame_type == H264E_FRAME_TYPE_CUSTOM
    // Store decoded frame in long-term buffer with given index in the
    // range [1..max_long_term_reference_frames]
    // 0 = save to short-term buffer
    // -1 = Don't save frame (dropable)
    int long_term_idx_update;

    // Target frame size. Typically = bitrate/framerate
    int desired_frame_bytes;

    // Minimum quantizer value, 10 indicates good quality
    // range: [10; qp_max]
    int qp_min;

    // Maximum quantizer value, 51 indicates very bad quality
    // range: [qp_min; 51]
    int qp_max;

    // Desired NALU size. NALU produced as soon as it's size exceed this value
    // if 0: frame would be coded with a single NALU
    int desired_nalu_bytes;

    // Optional NALU notification callback, called by the encoder
    // as soon as NALU encoding complete.
    void (*nalu_callback)(
        const unsigned char *nalu_data, // Coded NALU data, w/o start code
        int sizeof_nalu_data,           // Size of NALU data
        void *token                     // optional transparent token
        );

    // token to pass to NALU callback
    void *nalu_callback_token;

} H264E_run_param_t;

/**
*    Planar YUV420 descriptor
*/
typedef struct H264E_io_yuv_tag
{
    // Pointers to 3 pixel planes of YUV image
    unsigned char *yuv[3];
    // Stride for each image plane
    int stride[3];
} H264E_io_yuv_t;

typedef struct H264E_persist_tag H264E_persist_t;
typedef struct H264E_scratch_tag H264E_scratch_t;

/**
*   Return persistent and scratch memory requirements
*   for given encoding options.
*
*   Return value:
*       -zero in case of success
*       -error code (H264E_STATUS_*), if fails
*
*   example:
*
*   int sizeof_persist, sizeof_scratch, error;
*   H264E_persist_t * enc;
*   H264E_scratch_t * scratch;
*
*   error = H264E_sizeof(param, &sizeof_persist, &sizeof_scratch);
*   if (!error)
*   {
*       enc     = malloc(sizeof_persist);
*       scratch = malloc(sizeof_scratch);
*       error = H264E_init(enc, param);
*   }
*/
int H264E_sizeof(
    const H264E_create_param_t *param,  ///< Encoder creation parameters
    int *sizeof_persist,                ///< [OUT] Size of persistent RAM
    int *sizeof_scratch                 ///< [OUT] Size of scratch RAM
);

/**
*   Initialize encoding session
*
*   Return value:
*       -zero in case of success
*       -error code (H264E_STATUS_*), if fails
*/
int H264E_init(
    H264E_persist_t *enc,               ///< Encoder object
    const H264E_create_param_t *param   ///< Encoder creation parameters
);

/**
*   Encode single video frame
*
*   Output buffer is in the scratch RAM
*
*   Return value:
*       -zero in case of success
*       -error code (H264E_STATUS_*), if fails
*/
int H264E_encode(
    H264E_persist_t *enc,               ///< Encoder object
    H264E_scratch_t *scratch,           ///< Scratch memory
    const H264E_run_param_t *run_param, ///< run-time parameters
    H264E_io_yuv_t *frame,              ///< Input video frame
    unsigned char **coded_data,         ///< [OUT] Pointer to coded data
    int *sizeof_coded_data              ///< [OUT] Size of coded data
);

/**
*   This is a "hack" function to set internal rate-control state
*   Note that encoder allows application to completely override rate-control,
*   so this function should be used only by lazy coders, who just want to change
*   VBV size, without implementing custom rate-control.
*
*   Note that H.264 level defined by VBV size on initialization.
*/
void H264E_set_vbv_state(
    H264E_persist_t *enc,               ///< Encoder object
    int vbv_size_bytes,                 ///< New VBV size
    int vbv_fullness_bytes              ///< New VBV fulness, -1 = no change
);

#ifdef __cplusplus
}
#endif

#endif //MINIH264_H

#if defined(MINIH264_IMPLEMENTATION) && !defined(MINIH264_IMPLEMENTATION_GUARD)
#define MINIH264_IMPLEMENTATION_GUARD

#include <assert.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

/************************************************************************/
/*      Build configuration                                             */
/************************************************************************/
#ifndef H264E_ENABLE_DENOISE
#define H264E_ENABLE_DENOISE 1 // Build-in noise supressor
#endif

#ifndef MAX_LONG_TERM_FRAMES
#define MAX_LONG_TERM_FRAMES 8 // Max long-term frames count
#endif

#if !defined(MINIH264_ONLY_SIMD) && (defined(_M_X64) || defined(_M_ARM64) || defined(__x86_64__) || defined(__aarch64__))
/* x64 always have SSE2, arm64 always have neon, no need for generic code */
#define MINIH264_ONLY_SIMD
#endif /* SIMD checks... */

#if (defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))) || ((defined(__i386__) || defined(__x86_64__)) && defined(__SSE2__))
#define H264E_ENABLE_SSE2 1
#if defined(_MSC_VER)
#include <intrin.h>
#else
#include <emmintrin.h>
#endif
#elif defined(__ARM_NEON) || defined(__aarch64__)
#define H264E_ENABLE_NEON 1
#include <arm_neon.h>
#else
#ifdef MINIH264_ONLY_SIMD
#error MINIH264_ONLY_SIMD used, but SSE/NEON not enabled
#endif
#endif

#ifndef MINIH264_ONLY_SIMD
#define H264E_ENABLE_PLAIN_C 1
#endif

#define H264E_CONFIGS_COUNT ((H264E_ENABLE_SSE2) + (H264E_ENABLE_PLAIN_C) + (H264E_ENABLE_NEON))

#if defined(__ARMCC_VERSION) || defined(_WIN32) || defined(__EMSCRIPTEN__)
#define __BYTE_ORDER 0
#define __BIG_ENDIAN 1
#elif defined(__linux__) || defined(__CYGWIN__)
#include <endian.h>
#elif defined(__APPLE__)
#include <libkern/OSByteOrder.h>
#define __BYTE_ORDER BYTE_ORDER
#define __BIG_ENDIAN BIG_ENDIAN
#elif defined(__OpenBSD__) || defined(__NetBSD__) || defined(__FreeBSD__) || defined(__DragonFly__)
#include <sys/endian.h>
#else
#error platform not supported
#endif

#if defined(__aarch64__) && defined(__clang__)
// uintptr_t broken with aarch64 clang on ubuntu 18
#define uintptr_t unsigned long
#endif
#if defined(__arm__) && defined(__clang__)
#include <arm_acle.h>
#elif defined(__arm__) && defined(__GNUC__) && !defined(__ARMCC_VERSION)
static inline unsigned int __usad8(unsigned int val1, unsigned int val2)
{
    unsigned int result;
    __asm__ volatile ("usad8 %0, %1, %2\n\t"
                      : "=r" (result)
                      : "r" (val1), "r" (val2));
    return result;
}

static inline unsigned int __usada8(unsigned int val1, unsigned int val2, unsigned int val3)
{
    unsigned int result;
    __asm__ volatile ("usada8 %0, %1, %2, %3\n\t"
                      : "=r" (result)
                      : "r" (val1), "r" (val2), "r" (val3));
    return result;
}

static inline unsigned int __sadd16(unsigned int val1, unsigned int val2)
{
    unsigned int result;
    __asm__ volatile ("sadd16 %0, %1, %2\n\t"
                      : "=r" (result)
                      : "r" (val1), "r" (val2));
    return result;
}

static inline unsigned int __ssub16(unsigned int val1, unsigned int val2)
{
    unsigned int result;
    __asm__ volatile ("ssub16 %0, %1, %2\n\t"
                      : "=r" (result)
                      : "r" (val1), "r" (val2));
    return result;
}

static inline unsigned int __clz(unsigned int val1)
{
    unsigned int result;
    __asm__ volatile ("clz %0, %1\n\t"
                      : "=r" (result)
                      : "r" (val1));
    return result;
}
#endif

#ifdef __cplusplus
extern "C" {
#endif  //__cplusplus

#if defined(_MSC_VER) && _MSC_VER >= 1400
#   define h264e_restrict __restrict
#elif defined(__arm__)
#   define h264e_restrict __restrict
#else
#   define h264e_restrict
#endif
#if defined(_MSC_VER)
#   define ALIGN(n) __declspec(align(n))
#   define ALIGN2(n)
#else
#   define ALIGN(n)
#   define ALIGN2(n) __attribute__((aligned(n)))
#endif

#if __GNUC__ || __clang__
typedef int int_u __attribute__ ((__aligned__ (1)));
#else
typedef int int_u;
#endif

#ifndef MAX
#   define MAX(x, y) ((x) > (y) ? (x) : (y))
#endif

#ifndef MIN
#   define MIN(x, y) ((x) < (y) ? (x) : (y))
#endif

#ifndef ABS
#   define ABS(x)    ((x) >= 0 ? (x) : -(x))
#endif

#define IS_ALIGNED(p, n) (!((uintptr_t)(p) & (uintptr_t)((n) - 1)))

// bit-stream
#if __BYTE_ORDER == __BIG_ENDIAN
#   define SWAP32(x) (uint32_t)(x)
#else
#ifdef _MSC_VER
#   define SWAP32(x) _byteswap_ulong(x)
#elif defined(__GNUC__) || defined(__clang__)
#   define SWAP32(x) __builtin_bswap32(x)
#else
#   define SWAP32(x) (uint32_t)((((x) >> 24) & 0xFF) | (((x) >> 8) & 0xFF00) | (((x) << 8) & 0xFF0000) | ((x & 0xFF) << 24))
#endif
#endif

#define BS_OPEN(bs) uint32_t cache = bs->cache; int shift = bs->shift; uint32_t *buf = bs->buf;
#define BS_CLOSE(bs) bs->cache = cache; bs->shift = shift; bs->buf = buf;
#define BS_PUT(n, val)      \
if ((shift -= n) < 0)       \
{                           \
    cache |= val >> -shift; \
    *buf++ = SWAP32(cache); \
    shift += 32;            \
    cache = 0;              \
}                           \
cache |= (uint32_t)val << shift;

// Quantizer-dequantizer modes
#define QDQ_MODE_INTRA_4   2       // intra 4x4
#define QDQ_MODE_INTER     8       // inter
#define QDQ_MODE_INTRA_16  (8 + 1) // intra 16x61
#define QDQ_MODE_CHROMA    (4 + 1) // chroma

// put most frequently used bits to lsb, to use these as look-up tables
#define AVAIL_TR    8
#define AVAIL_TL    4
#define AVAIL_L     2
#define AVAIL_T     1

typedef uint8_t     pix_t;
typedef uint32_t    bs_item_t;

/**
*   Output bitstream
*/
typedef struct
{
    int         shift;  // bit position in the cache
    uint32_t    cache;  // bit cache
    bs_item_t    *buf;  // current position
    bs_item_t  *origin; // initial position
} bs_t;

/**
*   Tuple for motion vector, or height/width representation
*/
typedef union
{
    struct
    {
        int16_t x;      // horizontal or width
        int16_t y;      // vertical or height
    } s;
    int32_t u32;        // packed representation
} point_t;

/**
*   Rectangle
*/
typedef struct
{
    point_t tl;         // top-left corner
    point_t br;         // bottom-right corner
} rectangle_t;

/**
*   Quantized/dequantized representation for 4x4 block
*/
typedef struct
{
    int16_t qv[16];     // quantized coefficient
    int16_t dq[16];     // dequantized
} quant_t;

/**
*   Scratch RAM, used only for current MB encoding
*/
typedef struct H264E_scratch_tag
{
    pix_t mb_pix_inp[256];          // Input MB (cached)
    pix_t mb_pix_store[4*256];      // Prediction variants

    // Quantized/dequantized
    int16_t dcy[16];                // Y DC
    quant_t qy[16];                 // Y 16x4x4 blocks

    int16_t dcu[16];                // U DC: 4 used + align
    quant_t qu[4];                  // U 4x4x4 blocks

    int16_t dcv[16];                // V DC: 4 used + align
    quant_t qv[4];                  // V 4x4x4 blocks

    // Quantized DC:
    int16_t quant_dc[16];           // Y
    int16_t quant_dc_u[4];          // U
    int16_t quant_dc_v[4];          // V

    uint16_t nz_mask;               // Bit flags for non-zero 4x4 blocks
} scratch_t;

/**
*   Deblock filter frame context
*/
typedef struct
{
    // Motion vectors for 4x4 MB internal sub-blocks, top and left border,
    // 5x5 array without top-left cell:
    //     T0 T1 T2 T4
    //  L0 i0 i1 i2 i3
    //  L1 ...
    //  ......
    //
    point_t df_mv[5*5 - 1];         // MV for current macroblock and neighbors
    uint8_t *df_qp;                 // QP for current row of macroblocks
    int8_t *mb_type;                // Macroblock type for current row of macroblocks
    uint32_t nzflag;                // Bit flags for non-zero 4x4 blocks (left neighbors)

    // Huffman and deblock uses different nnz...
    uint8_t *df_nzflag;             // Bit flags for non-zero 4x4 blocks (top neighbors), only 4 bits used
} deblock_filter_t;

/**
*    Deblock filter parameters for current MB
*/
typedef struct
{
    uint32_t strength32[4*2];       // Strength for 4 colums and 4 rows
    uint8_t tc0[16*2];              // TC0 parameter for 4 colums and 4 rows
    uint8_t alpha[2*2];             // alpha for border/internals
    uint8_t beta[2*2];              // beta for border/internals
} deblock_params_t;

/**
*   Persistent RAM
*/
typedef struct H264E_persist_tag
{
    H264E_create_param_t param;     // Copy of create parameters
    H264E_io_yuv_t inp;             // Input picture

    struct
    {
        int pic_init_qp;            // Initial QP
    } sps;

    struct
    {
        int num;                    // Frame number
        int nmbx;                   // Frame width, macroblocks
        int nmby;                   // Frame height, macroblocks
        int nmb;                    // Number of macroblocks in frame
        int w;                      // Frame width, pixels
        int h;                      // Frame height, pixels
        rectangle_t mv_limit;       // Frame MV limits = frame + border extension
        rectangle_t mv_qpel_limit;  // Reduced MV limits for qpel interpolation filter
        int cropping_flag;          // Cropping indicator
    } frame;

    struct
    {
        int type;                   // Current slice type (I/P)
        int start_mb_num;           // # of 1st MB in the current slice
    } slice;

    struct
    {
        int x;                      // MB x position (in MB's)
        int y;                      // MB y position (in MB's)
        int num;                    // MB number
        int skip_run;               // Skip run count

        // according to table 7-13
        // -1 = skip, 0 = P16x16, 1 = P16x8, 2=P8x16, 3 = P8x8, 5 = I4x4, >=6 = I16x16
        int type;                   // MB type

        struct
        {
            int pred_mode_luma;     // Intra 16x16 prediction mode
        } i16;

        int8_t i4x4_mode[16];       // Intra 4x4 prediction modes

        int cost;                   // Best coding cost
        int avail;                  // Neighbor availability flags
        point_t mvd[16];            // Delta-MV for each 4x4 sub-part
        point_t mv[16];             // MV for each 4x4 sub-part

        point_t mv_skip_pred;       // Skip MV predictor
    } mb;

    H264E_io_yuv_t ref;             // Current reference picture
    H264E_io_yuv_t dec;             // Reconstructed current macroblock
#if H264E_ENABLE_DENOISE
    H264E_io_yuv_t denoise;         // Noise suppression filter
#endif

    unsigned char *lt_yuv[MAX_LONG_TERM_FRAMES][3]; // Long-term reference pictures
    unsigned char lt_used[MAX_LONG_TERM_FRAMES];    // Long-term "used" flags

    struct
    {
        int qp;                     // Current QP
        int vbv_bits;               // Current VBV fullness, bits
        int qp_smooth;              // Averaged QP
        int dqp_smooth;             // Adaptive QP adjustment, account for "compressibility"
        int max_dqp;                // Worst-case DQP, for long-term reference QP adjustment

        int bit_budget;             // Frame bit budget
        int prev_qp;                // Previous MB QP
        int prev_err;               // Accumulated coded size error
        int stable_count;           // Stable/not stable state machine

        int vbv_target_level;       // Desired VBV fullness after frame encode

        // Quantizer data, passed to low-level functions
        // layout:
        // multiplier_quant0, multiplier_dequant0,
        // multiplier_quant2, multiplier_dequant2,
        // multiplier_quant1, multiplier_dequant1,
        // rounding_factor_pos,
        // zero_thr_inter
        // zero_thr_inter2
        // ... and same data for chroma
        //uint16_t qdat[2][(6 + 4)];
#define OFFS_RND_INTER 6
#define OFFS_RND_INTRA 7
#define OFFS_THR_INTER 8
#define OFFS_THR2_INTER 9
#define OFFS_THR_1_OFF 10
#define OFFS_THR_2_OFF 18
#define OFFS_QUANT_VECT 26
#define OFFS_DEQUANT_VECT 34
        //struct
        //{
        //    uint16_t qdq[6];
        //    uint16_t rnd[2]; // inter/intra
        //    uint16_t thr[2]; // thresholds
        //    uint16_t zero_thr[2][8];
        //    uint16_t qfull[8];
        //    uint16_t dqfull[8];
        //} qdat[2];
        uint16_t qdat[2][6 + 2 + 2 + 8 + 8 + 8 + 8];
    } rc;

    deblock_filter_t df;            // Deblock filter

    // Speed/quality trade-off
    struct
    {
        int disable_deblock;        // Disable deblock filter flags
    } speed;

    int most_recent_ref_frame_idx;  // Last updated long-term reference

    // predictors contexts
    point_t *mv_pred;               // MV for left&top 4x4 blocks
    uint8_t *nnz;                   // Number of non-zero coeffs per 4x4 block for left&top
    int32_t *i4x4mode;              // Intra 4x4 mode for left&top
    pix_t *top_line;                // left&top neighbor pixels

    // output data
    uint8_t *out;                   // Output data storage (pointer to scratch RAM!)
    unsigned int out_pos;           // Output byte position
    bs_t bs[1];                     // Output bitbuffer

    scratch_t *scratch;             // Pointer to scratch RAM
#if H264E_MAX_THREADS > 1
    scratch_t *scratch_store[H264E_MAX_THREADS];   // Pointer to scratch RAM
    int sizeof_scaratch;
#endif
    H264E_run_param_t run_param;    // Copy of run-time parameters

    // Consecutive IDR's must have different idr_pic_id,
    // unless there are some P between them
    uint8_t next_idr_pic_id;

    pix_t *pbest;                   // Macroblock best predictor
    pix_t *ptest;                   // Macroblock predictor under test

    point_t mv_clusters[2];         // MV clusterization for prediction

    // Flag to track short-term reference buffer, for MMCO 1 command
    int short_term_used;

#if H264E_SVC_API
    //svc ext
    int   current_layer;
    int   adaptive_base_mode_flag;
    void *enc_next;
#endif

} h264e_enc_t;

#ifdef __cplusplus
}
#endif //__cplusplus
/************************************************************************/
/*      Constants                                                       */
/************************************************************************/

// Tunable constants can be adjusted by the "training" application
#ifndef ADJUSTABLE
#   define ADJUSTABLE static const
#endif

// Huffman encode tables
#define CODE8(val, len) (uint8_t)((val << 4) + len)
#define CODE(val, len) (uint8_t)((val << 4) + (len - 1))

const uint8_t h264e_g_run_before[57] =
{
    15, 17, 20, 24, 29, 35, 42, 42, 42, 42, 42, 42, 42, 42, 42,
    /**** Table #  0 size  2 ****/
    CODE8(1, 1), CODE8(0, 1),
    /**** Table #  1 size  3 ****/
    CODE8(1, 1), CODE8(1, 2), CODE8(0, 2),
    /**** Table #  2 size  4 ****/
    CODE8(3, 2), CODE8(2, 2), CODE8(1, 2), CODE8(0, 2),
    /**** Table #  3 size  5 ****/
    CODE8(3, 2), CODE8(2, 2), CODE8(1, 2), CODE8(1, 3), CODE8(0, 3),
    /**** Table #  4 size  6 ****/
    CODE8(3, 2), CODE8(2, 2), CODE8(3, 3), CODE8(2, 3), CODE8(1, 3), CODE8(0, 3),
    /**** Table #  5 size  7 ****/
    CODE8(3, 2), CODE8(0, 3), CODE8(1, 3), CODE8(3, 3), CODE8(2, 3), CODE8(5, 3), CODE8(4, 3),
    /**** Table #  6 size 15 ****/
    CODE8(7, 3), CODE8(6, 3), CODE8(5, 3), CODE8(4, 3), CODE8(3, 3), CODE8(2,  3), CODE8(1,  3), CODE8(1, 4),
    CODE8(1, 5), CODE8(1, 6), CODE8(1, 7), CODE8(1, 8), CODE8(1, 9), CODE8(1, 10), CODE8(1, 11),
};

const uint8_t h264e_g_total_zeros_cr_2x2[12] =
{
    3, 7, 10,
    /**** Table #  0 size  4 ****/
    CODE8(1, 1), CODE8(1, 2), CODE8(1, 3), CODE8(0, 3),
    /**** Table #  1 size  3 ****/
    CODE8(1, 1), CODE8(1, 2), CODE8(0, 2),
    /**** Table #  2 size  2 ****/
    CODE8(1, 1), CODE8(0, 1),
};

const uint8_t h264e_g_total_zeros[150] =
{
    15, 31, 46, 60, 73, 85, 96, 106, 115, 123, 130, 136, 141, 145, 148,
    /**** Table #  0 size 16 ****/
    CODE8(1, 1), CODE8(3, 3), CODE8(2, 3), CODE8(3, 4), CODE8(2, 4), CODE8(3, 5), CODE8(2, 5), CODE8(3, 6),
    CODE8(2, 6), CODE8(3, 7), CODE8(2, 7), CODE8(3, 8), CODE8(2, 8), CODE8(3, 9), CODE8(2, 9), CODE8(1, 9),
    /**** Table #  1 size 15 ****/
    CODE8(7, 3), CODE8(6, 3), CODE8(5, 3), CODE8(4, 3), CODE8(3, 3), CODE8(5, 4), CODE8(4, 4), CODE8(3, 4),
    CODE8(2, 4), CODE8(3, 5), CODE8(2, 5), CODE8(3, 6), CODE8(2, 6), CODE8(1, 6), CODE8(0, 6),
    /**** Table #  2 size 14 ****/
    CODE8(5, 4), CODE8(7, 3), CODE8(6, 3), CODE8(5, 3), CODE8(4, 4), CODE8(3, 4), CODE8(4, 3), CODE8(3, 3),
    CODE8(2, 4), CODE8(3, 5), CODE8(2, 5), CODE8(1, 6), CODE8(1, 5), CODE8(0, 6),
    /**** Table #  3 size 13 ****/
    CODE8(3, 5), CODE8(7, 3), CODE8(5, 4), CODE8(4, 4), CODE8(6, 3), CODE8(5, 3), CODE8(4, 3), CODE8(3, 4),
    CODE8(3, 3), CODE8(2, 4), CODE8(2, 5), CODE8(1, 5), CODE8(0, 5),
    /**** Table #  4 size 12 ****/
    CODE8(5, 4), CODE8(4, 4), CODE8(3, 4), CODE8(7, 3), CODE8(6, 3), CODE8(5, 3), CODE8(4, 3), CODE8(3, 3),
    CODE8(2, 4), CODE8(1, 5), CODE8(1, 4), CODE8(0, 5),
    /**** Table #  5 size 11 ****/
    CODE8(1, 6), CODE8(1, 5), CODE8(7, 3), CODE8(6, 3), CODE8(5, 3), CODE8(4, 3), CODE8(3, 3), CODE8(2, 3),
    CODE8(1, 4), CODE8(1, 3), CODE8(0, 6),
    /**** Table #  6 size 10 ****/
    CODE8(1, 6), CODE8(1, 5), CODE8(5, 3), CODE8(4, 3), CODE8(3, 3), CODE8(3, 2), CODE8(2, 3), CODE8(1, 4),
    CODE8(1, 3), CODE8(0, 6),
    /**** Table #  7 size  9 ****/
    CODE8(1, 6), CODE8(1, 4), CODE8(1, 5), CODE8(3, 3), CODE8(3, 2), CODE8(2, 2), CODE8(2, 3), CODE8(1, 3),
    CODE8(0, 6),
    /**** Table #  8 size  8 ****/
    CODE8(1, 6), CODE8(0, 6), CODE8(1, 4), CODE8(3, 2), CODE8(2, 2), CODE8(1, 3), CODE8(1, 2), CODE8(1, 5),
    /**** Table #  9 size  7 ****/
    CODE8(1, 5), CODE8(0, 5), CODE8(1, 3), CODE8(3, 2), CODE8(2, 2), CODE8(1, 2), CODE8(1, 4),
    /**** Table # 10 size  6 ****/
    CODE8(0, 4), CODE8(1, 4), CODE8(1, 3), CODE8(2, 3), CODE8(1, 1), CODE8(3, 3),
    /**** Table # 11 size  5 ****/
    CODE8(0, 4), CODE8(1, 4), CODE8(1, 2), CODE8(1, 1), CODE8(1, 3),
    /**** Table # 12 size  4 ****/
    CODE8(0, 3), CODE8(1, 3), CODE8(1, 1), CODE8(1, 2),
    /**** Table # 13 size  3 ****/
    CODE8(0, 2), CODE8(1, 2), CODE8(1, 1),
    /**** Table # 14 size  2 ****/
    CODE8(0, 1), CODE8(1, 1),
};

const uint8_t h264e_g_coeff_token[277 + 18] =
{
    17 + 18, 17 + 18,
    82 + 18, 82 + 18,
    147 + 18, 147 + 18, 147 + 18, 147 + 18,
    212 + 18, 212 + 18, 212 + 18, 212 + 18, 212 + 18, 212 + 18, 212 + 18, 212 + 18, 212 + 18,
    0 + 18,
    /**** Table #  4 size 17 ****/     // offs: 0
    CODE(1, 2), CODE(1, 1), CODE(1, 3), CODE(5, 6), CODE(7, 6), CODE(6, 6), CODE(2, 7), CODE(0, 7), CODE(4, 6),
    CODE(3, 7), CODE(2, 8), CODE(0, 0), CODE(3, 6), CODE(3, 8), CODE(0, 0), CODE(0, 0), CODE(2, 6),
    /**** Table #  0 size 65 ****/     // offs: 17
    CODE( 1,  1), CODE( 1,  2), CODE( 1,  3), CODE( 3,  5), CODE( 5,  6), CODE( 4,  6), CODE( 5,  7), CODE( 3,  6),
    CODE( 7,  8), CODE( 6,  8), CODE( 5,  8), CODE( 4,  7), CODE( 7,  9), CODE( 6,  9), CODE( 5,  9), CODE( 4,  8),
    CODE( 7, 10), CODE( 6, 10), CODE( 5, 10), CODE( 4,  9), CODE( 7, 11), CODE( 6, 11), CODE( 5, 11), CODE( 4, 10),
    CODE(15, 13), CODE(14, 13), CODE(13, 13), CODE( 4, 11), CODE(11, 13), CODE(10, 13), CODE( 9, 13), CODE(12, 13),
    CODE( 8, 13), CODE(14, 14), CODE(13, 14), CODE(12, 14), CODE(15, 14), CODE(10, 14), CODE( 9, 14), CODE( 8, 14),
    CODE(11, 14), CODE(14, 15), CODE(13, 15), CODE(12, 15), CODE(15, 15), CODE(10, 15), CODE( 9, 15), CODE( 8, 15),
    CODE(11, 15), CODE( 1, 15), CODE(13, 16), CODE(12, 16), CODE(15, 16), CODE(14, 16), CODE( 9, 16), CODE( 8, 16),
    CODE(11, 16), CODE(10, 16), CODE( 5, 16), CODE( 0,  0), CODE( 7, 16), CODE( 6, 16), CODE( 0,  0), CODE( 0,  0), CODE( 4, 16),
    /**** Table #  1 size 65 ****/     // offs: 82
    CODE( 3,  2), CODE( 2,  2), CODE( 3,  3), CODE( 5,  4), CODE(11,  6), CODE( 7,  5), CODE( 9,  6), CODE( 4,  4),
    CODE( 7,  6), CODE(10,  6), CODE( 5,  6), CODE( 6,  5), CODE( 7,  7), CODE( 6,  6), CODE( 5,  7), CODE( 8,  6),
    CODE( 7,  8), CODE( 6,  7), CODE( 5,  8), CODE( 4,  6), CODE( 4,  8), CODE( 6,  8), CODE( 5,  9), CODE( 4,  7),
    CODE( 7,  9), CODE( 6,  9), CODE(13, 11), CODE( 4,  9), CODE(15, 11), CODE(14, 11), CODE( 9, 11), CODE(12, 11),
    CODE(11, 11), CODE(10, 11), CODE(13, 12), CODE( 8, 11), CODE(15, 12), CODE(14, 12), CODE( 9, 12), CODE(12, 12),
    CODE(11, 12), CODE(10, 12), CODE(13, 13), CODE(12, 13), CODE( 8, 12), CODE(14, 13), CODE( 9, 13), CODE( 8, 13),
    CODE(15, 13), CODE(10, 13), CODE( 6, 13), CODE( 1, 13), CODE(11, 13), CODE(11, 14), CODE(10, 14), CODE( 4, 14),
    CODE( 7, 13), CODE( 8, 14), CODE( 5, 14), CODE( 0,  0), CODE( 9, 14), CODE( 6, 14), CODE( 0,  0), CODE( 0,  0), CODE( 7, 14),
    /**** Table #  2 size 65 ****/     // offs: 147
    CODE(15,  4), CODE(14,  4), CODE(13,  4), CODE(12,  4), CODE(15,  6), CODE(15,  5), CODE(14,  5), CODE(11,  4),
    CODE(11,  6), CODE(12,  5), CODE(11,  5), CODE(10,  4), CODE( 8,  6), CODE(10,  5), CODE( 9,  5), CODE( 9,  4),
    CODE(15,  7), CODE( 8,  5), CODE(13,  6), CODE( 8,  4), CODE(11,  7), CODE(14,  6), CODE( 9,  6), CODE(13,  5),
    CODE( 9,  7), CODE(10,  6), CODE(13,  7), CODE(12,  6), CODE( 8,  7), CODE(14,  7), CODE(10,  7), CODE(12,  7),
    CODE(15,  8), CODE(14,  8), CODE(13,  8), CODE(12,  8), CODE(11,  8), CODE(10,  8), CODE( 9,  8), CODE( 8,  8),
    CODE(15,  9), CODE(14,  9), CODE(13,  9), CODE(12,  9), CODE(11,  9), CODE(10,  9), CODE( 9,  9), CODE(10, 10),
    CODE( 8,  9), CODE( 7,  9), CODE(11, 10), CODE( 6, 10), CODE(13, 10), CODE(12, 10), CODE( 7, 10), CODE( 2, 10),
    CODE( 9, 10), CODE( 8, 10), CODE( 3, 10), CODE( 0,  0), CODE( 5, 10), CODE( 4, 10), CODE( 0,  0), CODE( 0,  0), CODE( 1, 10),
    /**** Table #  3 size 65 ****/     // offs: 212
     3,  1,  6, 11,  0,  5, 10, 15,  4,  9, 14, 19,  8, 13, 18, 23, 12, 17, 22, 27, 16, 21, 26, 31, 20, 25, 30, 35,
    24, 29, 34, 39, 28, 33, 38, 43, 32, 37, 42, 47, 36, 41, 46, 51, 40, 45, 50, 55, 44, 49, 54, 59, 48, 53, 58, 63,
    52, 57, 62,  0, 56, 61,  0,  0, 60
};

/*
    Block scan order
    0 1 4 5
    2 3 6 7
    8 9 C D
    A B E F
*/
static const uint8_t decode_block_scan[16] = { 0, 1, 4, 5, 2, 3, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15 };

static const uint8_t qpy2qpc[52] = {  // todo: [0 - 9] not used
    0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,
   13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
   26, 27, 28, 29, 29, 30, 31, 32, 32, 33, 34, 34, 35,
   35, 36, 36, 37, 37, 37, 38, 38, 38, 39, 39, 39, 39,
};

/**
*   Rate-control LUT for intra/inter macroblocks: number of bits per macroblock for given QP
*   Estimated experimentally
*/
static const uint16_t bits_per_mb[2][42 - 1] =
{
    // 10                                                          20                                                          30                                                          40                                                          50
    { 664,  597,  530,  484,  432,  384,  341,  297,  262,  235,  198,  173,  153,  131,  114,  102,   84,   74,   64,   54,   47,   42,   35,   31,   26,   22,   20,   17,   15,   13,   12,   10,    9,    9,    7,    7,    6,    5,    4,    1,    1}, // P
    {1057,  975,  925,  868,  803,  740,  694,  630,  586,  547,  496,  457,  420,  378,  345,  318,  284,  258,  234,  210,  190,  178,  155,  141,  129,  115,  102,   95,   82,   75,   69,   60,   55,   51,   45,   41,   40,   35,   31,   28,   24}  // I
};

/**
*   Deblock filter constants:
*   <alpha> <thr[1]> <thr[2]> <thr[3]> <beta>
*/
static const uint8_t g_a_tc0_b[52 - 10][5] = {
    {  0,  0,  0,  0,  0},  // 10
    {  0,  0,  0,  0,  0},  // 11
    {  0,  0,  0,  0,  0},  // 12
    {  0,  0,  0,  0,  0},  // 13
    {  0,  0,  0,  0,  0},  // 14
    {  0,  0,  0,  0,  0},  // 15
    {  4,  0,  0,  0,  2},
    {  4,  0,  0,  1,  2},
    {  5,  0,  0,  1,  2},
    {  6,  0,  0,  1,  3},
    {  7,  0,  0,  1,  3},
    {  8,  0,  1,  1,  3},
    {  9,  0,  1,  1,  3},
    { 10,  1,  1,  1,  4},
    { 12,  1,  1,  1,  4},
    { 13,  1,  1,  1,  4},
    { 15,  1,  1,  1,  6},
    { 17,  1,  1,  2,  6},
    { 20,  1,  1,  2,  7},
    { 22,  1,  1,  2,  7},
    { 25,  1,  1,  2,  8},
    { 28,  1,  2,  3,  8},
    { 32,  1,  2,  3,  9},
    { 36,  2,  2,  3,  9},
    { 40,  2,  2,  4, 10},
    { 45,  2,  3,  4, 10},
    { 50,  2,  3,  4, 11},
    { 56,  3,  3,  5, 11},
    { 63,  3,  4,  6, 12},
    { 71,  3,  4,  6, 12},
    { 80,  4,  5,  7, 13},
    { 90,  4,  5,  8, 13},
    {101,  4,  6,  9, 14},
    {113,  5,  7, 10, 14},
    {127,  6,  8, 11, 15},
    {144,  6,  8, 13, 15},
    {162,  7, 10, 14, 16},
    {182,  8, 11, 16, 16},
    {203,  9, 12, 18, 17},
    {226, 10, 13, 20, 17},
    {255, 11, 15, 23, 18},
    {255, 13, 17, 25, 18},
};

/************************************************************************/
/*  Adjustable encoder parameters. Initial MIN_QP values never used     */
/************************************************************************/

ADJUSTABLE uint16_t g_rnd_inter[] = {
    11665, 11665, 11665, 11665, 11665, 11665, 11665, 11665, 11665, 11665,
    11665, 12868, 14071, 15273, 16476,
    17679, 17740, 17801, 17863, 17924,
    17985, 17445, 16904, 16364, 15823,
    15283, 15198, 15113, 15027, 14942,
    14857, 15667, 16478, 17288, 18099,
    18909, 19213, 19517, 19822, 20126,
    20430, 16344, 12259, 8173, 4088,
    4088, 4088, 4088, 4088, 4088,
    4088, 4088,
};

ADJUSTABLE uint16_t g_thr_inter[] = {
    31878, 31878, 31878, 31878, 31878, 31878, 31878, 31878, 31878, 31878,
    31878, 33578, 35278, 36978, 38678,
    40378, 41471, 42563, 43656, 44748,
    45841, 46432, 47024, 47615, 48207,
    48798, 49354, 49911, 50467, 51024,
    51580, 51580, 51580, 51580, 51580,
    51580, 52222, 52864, 53506, 54148,
    54790, 45955, 37120, 28286, 19451,
    10616, 9326, 8036, 6745, 5455,
    4165, 4165,
};

ADJUSTABLE uint16_t g_thr_inter2[] = {
    45352, 45352, 45352, 45352, 45352, 45352, 45352, 45352, 45352, 45352,
    45352, 41100, 36848, 32597, 28345,
    24093, 25904, 27715, 29525, 31336,
    33147, 33429, 33711, 33994, 34276,
    34558, 32902, 31246, 29590, 27934,
    26278, 26989, 27700, 28412, 29123,
    29834, 29038, 28242, 27445, 26649,
    25853, 23440, 21028, 18615, 16203,
    13790, 11137, 8484, 5832, 3179,
    526, 526,
};

ADJUSTABLE uint16_t g_skip_thr_inter[52] =
{
    45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
    45, 45, 45, 44, 44,
    44, 40, 37, 33, 30,
    26, 32, 38, 45, 51,
    57, 58, 58, 59, 59,
    60, 66, 73, 79, 86,
    92, 95, 98, 100, 103,
    106, 200, 300, 400, 500,
    600, 700, 800, 900, 1000,
    1377, 1377,
};

ADJUSTABLE uint16_t g_lambda_q4[52] =
{
    14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
    14, 13, 11, 10, 8,
    7, 11, 15, 20, 24,
    28, 30, 31, 33, 34,
    36, 48, 60, 71, 83,
    95, 95, 95, 96, 96,
    96, 113, 130, 147, 164,
    181, 401, 620, 840, 1059,
    1279, 1262, 1246, 1229, 1213,
    1196, 1196,
};
ADJUSTABLE uint16_t g_lambda_mv_q4[52] =
{
    13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
    13, 14, 15, 15, 16,
    17, 18, 20, 21, 23,
    24, 28, 32, 37, 41,
    45, 53, 62, 70, 79,
    87, 105, 123, 140, 158,
    176, 195, 214, 234, 253,
    272, 406, 541, 675, 810,
    944, 895, 845, 796, 746,
    697, 697,
};

ADJUSTABLE uint16_t g_skip_thr_i4x4[52] =
{
    0,1,2,3,4,5,6,7,8,9,
    7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
    24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
    44, 44, 44, 44, 44, 44, 44, 44, 44, 44,
    68, 68, 68, 68, 68, 68, 68, 68, 68, 68,
    100, 100,
};

ADJUSTABLE uint16_t g_deadzonei[] = {
    3419, 3419, 3419, 3419, 3419, 3419, 3419, 3419, 3419, 3419,
    30550, 8845, 14271, 19698, 25124,
    30550, 29556, 28562, 27569, 26575,
    25581, 25284, 24988, 24691, 24395,
    24098, 24116, 24134, 24153, 24171,
    24189, 24010, 23832, 23653, 23475,
    23296, 23569, 23842, 24115, 24388,
    24661, 19729, 14797, 9865, 4933,
    24661, 3499, 6997, 10495, 13993,
    17491, 17491,
};

ADJUSTABLE uint16_t g_lambda_i4_q4[] = {
    27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
    27, 31, 34, 38, 41,
    45, 76, 106, 137, 167,
    198, 220, 243, 265, 288,
    310, 347, 384, 421, 458,
    495, 584, 673, 763, 852,
    941, 1053, 1165, 1276, 1388,
    1500, 1205, 910, 614, 319,
    5000, 1448, 2872, 4296, 5720,
    7144, 7144,
};

ADJUSTABLE uint16_t g_lambda_i16_q4[] = {
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0,
    0, 3, 7, 10, 14,
    17, 14, 10, 7, 3,
    50, 20, 39, 59, 78,
    98, 94, 89, 85, 80,
    76, 118, 161, 203, 246,
    288, 349, 410, 470, 531,
    592, 575, 558, 540, 523,
    506, 506,
};

const uint8_t g_diff_to_gainQ8[256] =
{
    0, 16, 25, 32, 37, 41, 44, 48, 50, 53, 55, 57, 59, 60, 62, 64, 65,
    66, 67, 69, 70, 71, 72, 73, 74, 75, 76, 76, 77, 78, 79, 80, 80,
    81, 82, 82, 83, 83, 84, 85, 85, 86, 86, 87, 87, 88, 88, 89, 89,
    90, 90, 91, 91, 92, 92, 92, 93, 93, 94, 94, 94, 95, 95, 96, 96,
    96, 97, 97, 97, 98, 98, 98, 99, 99, 99, 99, 100, 100, 100, 101, 101,
    101, 102, 102, 102, 102, 103, 103, 103, 103, 104, 104, 104, 104, 105, 105, 105,
    105, 106, 106, 106, 106, 106, 107, 107, 107, 107, 108, 108, 108, 108, 108, 109,
    109, 109, 109, 109, 110, 110, 110, 110, 110, 111, 111, 111, 111, 111, 112, 112,
    112, 112, 112, 112, 113, 113, 113, 113, 113, 113, 114, 114, 114, 114, 114, 114,
    115, 115, 115, 115, 115, 115, 115, 116, 116, 116, 116, 116, 116, 117, 117, 117,
    117, 117, 117, 117, 118, 118, 118, 118, 118, 118, 118, 118, 119, 119, 119, 119,
    119, 119, 119, 119, 120, 120, 120, 120, 120, 120, 120, 120, 121, 121, 121, 121,
    121, 121, 121, 121, 122, 122, 122, 122, 122, 122, 122, 122, 122, 123, 123, 123,
    123, 123, 123, 123, 123, 123, 124, 124, 124, 124, 124, 124, 124, 124, 124, 125,
    125, 125, 125, 125, 125, 125, 125, 125, 125, 126, 126, 126, 126, 126, 126, 126,
    126, 126, 126, 126, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 128,
};

#if H264E_ENABLE_SSE2 && !defined(MINIH264_ASM)
#define BS_BITS 32

static void h264e_bs_put_bits_sse2(bs_t *bs, unsigned n, unsigned val)
{
    assert(!(val >> n));
    bs->shift -= n;
    assert((unsigned)n <= 32);
    if (bs->shift < 0)
    {
        assert(-bs->shift < 32);
        bs->cache |= val >> -bs->shift;
        *bs->buf++ = SWAP32(bs->cache);
        bs->shift = 32 + bs->shift;
        bs->cache = 0;
    }
    bs->cache |= val << bs->shift;
}

static void h264e_bs_flush_sse2(bs_t *bs)
{
    *bs->buf = SWAP32(bs->cache);
}

static unsigned h264e_bs_get_pos_bits_sse2(const bs_t *bs)
{
    unsigned pos_bits = (unsigned)((bs->buf - bs->origin)*BS_BITS);
    pos_bits += BS_BITS - bs->shift;
    assert((int)pos_bits >= 0);
    return pos_bits;
}

static unsigned h264e_bs_byte_align_sse2(bs_t *bs)
{
    int pos = h264e_bs_get_pos_bits_sse2(bs);
    h264e_bs_put_bits_sse2(bs, -pos & 7, 0);
    return pos + (-pos & 7);
}

/**
*   Golomb code
*   0 => 1
*   1 => 01 0
*   2 => 01 1
*   3 => 001 00
*   4 => 001 01
*
*   [0]     => 1
*   [1..2]  => 01x
*   [3..6]  => 001xx
*   [7..14] => 0001xxx
*
*/
static void h264e_bs_put_golomb_sse2(bs_t *bs, unsigned val)
{
    int size;
#if defined(_MSC_VER)
    unsigned long nbit;
    _BitScanReverse(&nbit, val + 1);
    size = 1 + nbit;
#else
    size = 32 - __builtin_clz(val + 1);
#endif
    h264e_bs_put_bits_sse2(bs, 2*size - 1, val + 1);
}

/**
*   signed Golomb code.
*   mapping to unsigned code:
*       0 => 0
*       1 => 1
*      -1 => 2
*       2 => 3
*      -2 => 4
*       3 => 5
*      -3 => 6
*/
static void h264e_bs_put_sgolomb_sse2(bs_t *bs, int val)
{
    val = 2*val - 1;
    val ^= val >> 31;
    h264e_bs_put_golomb_sse2(bs, val);
}

static void h264e_bs_init_bits_sse2(bs_t *bs, void *data)
{
    bs->origin = data;
    bs->buf = bs->origin;
    bs->shift = BS_BITS;
    bs->cache = 0;
}

static unsigned __clz_cavlc(unsigned v)
{
#if defined(_MSC_VER)
    unsigned long nbit;
    _BitScanReverse(&nbit, v);
    return 31 - nbit;
#else
    return __builtin_clz(v);
#endif
}

static void h264e_vlc_encode_sse2(bs_t *bs, int16_t *quant, int maxNumCoeff, uint8_t *nz_ctx)
{
    int nnz_context, nlevels, nnz; // nnz = nlevels + trailing_ones
    unsigned trailing_ones = 0;
    unsigned trailing_ones_sign = 0;
    uint8_t runs[16];
    uint8_t *prun = runs;
    int16_t *levels;
    int cloop = maxNumCoeff;
    int v, drun;
    unsigned zmask;
    BS_OPEN(bs)

    ALIGN(16) int16_t zzquant[16] ALIGN2(16);
    levels = zzquant + ((maxNumCoeff == 4) ? 4 : 16);
    if (maxNumCoeff != 4)
    {
        __m128i y0, y1;
        __m128i x0 = _mm_load_si128((__m128i *)quant);
        __m128i x1 = _mm_load_si128((__m128i *)(quant + 8));
#define SWAP_XMM(x, i, j)     { int t0 = _mm_extract_epi16(x, i); int t1 = _mm_extract_epi16(x, j); x = _mm_insert_epi16(x, t0, j); x = _mm_insert_epi16(x, t1, i); }
#define SWAP_XMM2(x, y, i, j) { int t0 = _mm_extract_epi16(x, i); int t1 = _mm_extract_epi16(y, j); y = _mm_insert_epi16(y, t0, j); x = _mm_insert_epi16(x, t1, i); }
        SWAP_XMM(x0, 3, 4);
        SWAP_XMM(x1, 3, 4);
        SWAP_XMM2(x0, x1, 5, 2);
        x0 = _mm_shufflelo_epi16(x0, 0 + (3 << 2) + (1 << 4) + (2 << 6));
        x0 = _mm_shufflehi_epi16(x0, 2 + (0 << 2) + (3 << 4) + (1 << 6));
        x1 = _mm_shufflelo_epi16(x1, 2 + (0 << 2) + (3 << 4) + (1 << 6));
        x1 = _mm_shufflehi_epi16(x1, 1 + (2 << 2) + (0 << 4) + (3 << 6));
        y0 = _mm_unpacklo_epi64(x0, x1);
        y1 = _mm_unpackhi_epi64(x0, x1);
        y0 = _mm_slli_epi16(y0, 1);
        y1 = _mm_slli_epi16(y1, 1);
        zmask = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_packs_epi16(y0, y1), _mm_setzero_si128()));
        _mm_store_si128((__m128i *)zzquant, y0);
        _mm_store_si128((__m128i *)(zzquant + 8), y1);

        if (maxNumCoeff == 15)
            zmask |= 1;
        zmask = (~zmask) << 16;

        v = 15;
        drun = (maxNumCoeff == 16) ? 1 : 0;
    } else
    {
        __m128i x0 = _mm_loadl_epi64((__m128i *)quant);
        x0 = _mm_slli_epi16(x0, 1);
        zmask = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_packs_epi16(x0, x0), _mm_setzero_si128()));
        _mm_storel_epi64((__m128i *)zzquant, x0);
        zmask = (~zmask) << 28;
        drun = 1;
        v = 3;
    }

    if (zmask)
    {
        do
        {
            int i = __clz_cavlc(zmask);
            *--levels = zzquant[v -= i];
            *prun++ = (uint8_t)(v + drun);
            zmask <<= (i + 1);
            v--;
        } while(zmask);
        quant = zzquant + ((maxNumCoeff == 4) ? 4 : 16);
        nnz = (int)(quant - levels);

        cloop = MIN(3, nnz);
        levels = quant - 1;
        do
        {
            if ((unsigned)(*levels + 2) > 4u)
            {
                break;
            }
            trailing_ones_sign = (trailing_ones_sign << 1) | (*levels-- < 0);
            trailing_ones++;
        } while (--cloop);
    } else
    {
        nnz = trailing_ones = 0;
    }
    nlevels = nnz - trailing_ones;

    nnz_context = nz_ctx[-1] + nz_ctx[1];

    nz_ctx[0] = (uint8_t)nnz;
    if (nnz_context <= 34)
    {
        nnz_context = (nnz_context + 1) >> 1;
    }
    nnz_context &= 31;

    // 9.2.1 Parsing process for total number of transform coefficient levels and trailing ones
    {
        int off = h264e_g_coeff_token[nnz_context];
        unsigned n = 6, val = h264e_g_coeff_token[off + trailing_ones + 4*nlevels];
        if (off != 230)
        {
            n = (val & 15) + 1;
            val >>= 4;
        }
        BS_PUT(n, val);
    }

    if (nnz)
    {
        if (trailing_ones)
        {
            BS_PUT(trailing_ones, trailing_ones_sign);
        }
        if (nlevels)
        {
            int vlcnum = 1;
            int sym_len, prefix_len;

            int sym = *levels-- - 2;
            if (sym < 0) sym = -3 - sym;
            if (sym >= 6) vlcnum++;
            if (trailing_ones < 3)
            {
                sym -= 2;
                if (nnz > 10)
                {
                    sym_len = 1;
                    prefix_len = sym >> 1;
                    if (prefix_len >= 15)
                    {
                        // or vlcnum = 1;  goto escape;
                        prefix_len = 15;
                        sym_len = 12;
                    }
                    sym -= prefix_len << 1;
                    // bypass vlcnum advance due to sym -= 2; above
                    goto loop_enter;
                }
            }

            if (sym < 14)
            {
                prefix_len = sym;
                sym = 0; // to avoid side effect in bitbuf
                sym_len = 0;
            } else if (sym < 30)
            {
                prefix_len = 14;
                sym_len = 4;
                sym -= 14;
            } else
            {
                vlcnum = 1;
                goto escape;
            }
            goto loop_enter;

            for (;;)
            {
                sym_len = vlcnum;
                prefix_len = sym >> vlcnum;
                if (prefix_len >= 15)
                {
escape:
                    prefix_len = 15;
                    sym_len = 12;
                }
                sym -= prefix_len << vlcnum;

                if (prefix_len >= 3 && vlcnum < 6) vlcnum++;
loop_enter:
                sym |= 1 << sym_len;
                sym_len += prefix_len+1;
                BS_PUT(sym_len, (unsigned)sym);
                if (!--nlevels) break;
                sym = *levels-- - 2;
                if (sym < 0) sym = -3 - sym;
            }
        }

        if (nnz < maxNumCoeff)
        {
            const uint8_t *vlc = (maxNumCoeff == 4) ? h264e_g_total_zeros_cr_2x2 : h264e_g_total_zeros;
            uint8_t *run = runs;
            int run_prev = *run++;
            int nzeros = run_prev - nnz;
            int zeros_left = 2*nzeros - 1;
            int ctx = nnz - 1;
            run[nnz - 1] = (uint8_t)maxNumCoeff; // terminator
            for(;;)
            {
                int t;
                //encode_huff8(bs, vlc, ctx, nzeros);

                unsigned val = vlc[vlc[ctx] + nzeros];
                unsigned n = val & 15;
                val >>= 4;
                BS_PUT(n, val);

                zeros_left -= nzeros;
                if (zeros_left < 0)
                {
                    break;
                }

                t = *run++;
                nzeros = run_prev - t - 1;
                if (nzeros < 0)
                {
                    break;
                }
                run_prev = t;
                assert(zeros_left < 14);
                vlc = h264e_g_run_before;
                ctx = zeros_left;
            }
        }
    }
    BS_CLOSE(bs);
}

#define MM_LOAD_8TO16_2(p) _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i*)(p)), _mm_setzero_si128())
static __inline __m128i subabs128_16(__m128i a, __m128i b)
{
    return _mm_or_si128(_mm_subs_epu16(a, b), _mm_subs_epu16(b, a));
}
static __inline __m128i clone2x16(const void *p)
{
    __m128i tmp = MM_LOAD_8TO16_2(p);
    return _mm_unpacklo_epi16(tmp, tmp);
}
static __inline __m128i subabs128(__m128i a, __m128i b)
{
    return _mm_or_si128(_mm_subs_epu8(a, b), _mm_subs_epu8(b, a));
}

static void transpose8x8_sse(uint8_t *dst, int dst_stride, uint8_t *src, int src_stride)
{
    __m128i a = _mm_loadl_epi64((__m128i *)(src));
    __m128i b = _mm_loadl_epi64((__m128i *)(src += src_stride));
    __m128i c = _mm_loadl_epi64((__m128i *)(src += src_stride));
    __m128i d = _mm_loadl_epi64((__m128i *)(src += src_stride));
    __m128i e = _mm_loadl_epi64((__m128i *)(src += src_stride));
    __m128i f = _mm_loadl_epi64((__m128i *)(src += src_stride));
    __m128i g = _mm_loadl_epi64((__m128i *)(src += src_stride));
    __m128i h = _mm_loadl_epi64((__m128i *)(src += src_stride));

    __m128i p0 = _mm_unpacklo_epi8(a,b);  // b7 a7 b6 a6 ... b0 a0
    __m128i p1 = _mm_unpacklo_epi8(c,d);  // d7 c7 d6 c6 ... d0 c0
    __m128i p2 = _mm_unpacklo_epi8(e,f);  // f7 e7 f6 e6 ... f0 e0
    __m128i p3 = _mm_unpacklo_epi8(g,h);  // h7 g7 h6 g6 ... h0 g0

    __m128i q0 = _mm_unpacklo_epi16(p0, p1);  // d3c3 b3a3 ... d0c0 b0a0
    __m128i q1 = _mm_unpackhi_epi16(p0, p1);  // d7c7 b7a7 ... d4c4 b4a4
    __m128i q2 = _mm_unpacklo_epi16(p2, p3);  // h3g3 f3e3 ... h0g0 f0e0
    __m128i q3 = _mm_unpackhi_epi16(p2, p3);  // h7g7 f7e7 ... h4g4 f4e4

    __m128i r0 = _mm_unpacklo_epi32(q0, q2);  // h1g1f1e1 d1c1b1a1 h0g0f0e0 d0c0b0a0
    __m128i r1 = _mm_unpackhi_epi32(q0, q2);  // h3g3f3e3 d3c3b3a3 h2g2f2e2 d2c2b2a2
    __m128i r2 = _mm_unpacklo_epi32(q1, q3);
    __m128i r3 = _mm_unpackhi_epi32(q1, q3);
    _mm_storel_epi64((__m128i *)(dst), r0); dst += dst_stride; _mm_storel_epi64((__m128i *)(dst), _mm_unpackhi_epi64(r0, r0)); dst += dst_stride;
    _mm_storel_epi64((__m128i *)(dst), r1); dst += dst_stride; _mm_storel_epi64((__m128i *)(dst), _mm_unpackhi_epi64(r1, r1)); dst += dst_stride;
    _mm_storel_epi64((__m128i *)(dst), r2); dst += dst_stride; _mm_storel_epi64((__m128i *)(dst), _mm_unpackhi_epi64(r2, r2)); dst += dst_stride;
    _mm_storel_epi64((__m128i *)(dst), r3); dst += dst_stride; _mm_storel_epi64((__m128i *)(dst), _mm_unpackhi_epi64(r3, r3)); dst += dst_stride;
}

static void deblock_chroma_h_s4_sse(uint8_t *pq0, int stride, const void* threshold, int alpha, int beta, uint32_t argstr)
{
    __m128i thr, str, d;
    __m128i p1 = MM_LOAD_8TO16_2(pq0 - 2*stride);
    __m128i p0 = MM_LOAD_8TO16_2(pq0 - stride);
    __m128i q0 = MM_LOAD_8TO16_2(pq0);
    __m128i q1 = MM_LOAD_8TO16_2(pq0 + stride);
    __m128i zero = _mm_setzero_si128();
    __m128i _alpha = _mm_set1_epi16((short)alpha);
    __m128i _beta = _mm_set1_epi16((short)beta);
    __m128i tmp;

    str =                    _mm_cmplt_epi16(subabs128_16(p0, q0), _alpha);
    str = _mm_and_si128(str, _mm_cmplt_epi16(_mm_max_epi16(subabs128_16(p1, p0), subabs128_16(q1, q0)), _beta));

    if ((uint8_t)argstr != 4)
    {
        d = _mm_srai_epi16(_mm_add_epi16(_mm_sub_epi16(_mm_add_epi16(_mm_slli_epi16(_mm_sub_epi16(q0, p0), 2), p1), q1),_mm_set1_epi16(4)), 3);
        thr = _mm_add_epi16(clone2x16(threshold), _mm_set1_epi16(1));
        d = _mm_min_epi16(_mm_max_epi16(d, _mm_sub_epi16(zero, thr)), thr);

        tmp = _mm_unpacklo_epi8(_mm_cvtsi32_si128(argstr), _mm_setzero_si128());
        tmp = _mm_unpacklo_epi16(tmp, tmp);

//        str = _mm_and_si128(str, _mm_cmpgt_epi16(clone2x16(strength), zero));
        str = _mm_and_si128(str, _mm_cmpgt_epi16(tmp, zero));
        d = _mm_and_si128(str, d);
        p0 = _mm_add_epi16(p0, d);
        q0 = _mm_sub_epi16(q0, d);
    } else
    {
        __m128i pq = _mm_add_epi16(p1, q1);
        __m128i newp = _mm_srai_epi16(_mm_add_epi16(_mm_add_epi16(pq, p1), p0), 1);
        __m128i newq = _mm_srai_epi16(_mm_add_epi16(_mm_add_epi16(pq, q1), q0), 1);
        p0 = _mm_xor_si128(_mm_and_si128(_mm_xor_si128(_mm_avg_epu16(newp,zero), p0), str), p0);
        q0 = _mm_xor_si128(_mm_and_si128(_mm_xor_si128(_mm_avg_epu16(newq,zero), q0), str), q0);
    }
    _mm_storel_epi64((__m128i*)(pq0 - stride), _mm_packus_epi16(p0, zero));
    _mm_storel_epi64((__m128i*)(pq0         ), _mm_packus_epi16(q0, zero));
}

static void deblock_chroma_v_s4_sse(uint8_t *pix, int stride, const void* threshold, int alpha, int beta, uint32_t str)
{
    uint8_t t8x4[8*4];
    int i;
    uint8_t *p = pix - 2;
    __m128i t0 =_mm_unpacklo_epi16(
        _mm_unpacklo_epi8(_mm_cvtsi32_si128(*(int_u*)p),              _mm_cvtsi32_si128(*(int_u*)(p + stride))),
        _mm_unpacklo_epi8(_mm_cvtsi32_si128(*(int_u*)(p + 2*stride)), _mm_cvtsi32_si128(*(int_u*)(p + 3*stride)))
        );
    __m128i t1 =_mm_unpacklo_epi16(
        _mm_unpacklo_epi8(_mm_cvtsi32_si128(*(int_u*)(p + 4*stride)), _mm_cvtsi32_si128(*(int_u*)(p + 5*stride))),
        _mm_unpacklo_epi8(_mm_cvtsi32_si128(*(int_u*)(p + 6*stride)), _mm_cvtsi32_si128(*(int_u*)(p + 7*stride)))
        );
    __m128i p1 = _mm_unpacklo_epi32(t0, t1);
    __m128i p0 = _mm_shuffle_epi32 (p1, 0x4E); // 01001110b
    __m128i q0 = _mm_unpackhi_epi32(t0, t1);
    __m128i q1 = _mm_shuffle_epi32 (q0, 0x4E);
    _mm_storel_epi64((__m128i*)(t8x4), p1);
    _mm_storel_epi64((__m128i*)(t8x4 + 8), p0);
    _mm_storel_epi64((__m128i*)(t8x4 + 16), q0);
    _mm_storel_epi64((__m128i*)(t8x4 + 24), q1);
    deblock_chroma_h_s4_sse(t8x4 + 16, 8, threshold, alpha, beta, str);

    for (i = 0; i < 8; i++)
    {
        pix[-1] = t8x4[8  + i];
        pix[ 0] = t8x4[16 + i];
        pix += stride;
    }
}

#define CMP_BETA(p, q, beta)   _mm_cmpeq_epi8(_mm_subs_epu8(_mm_subs_epu8(p, q), beta), _mm_subs_epu8(_mm_subs_epu8(q, p), beta))
#define CMP_1(p, q, beta)     (_mm_subs_epu8(subabs128(p, q), beta))

static void deblock_luma_h_s4_sse(uint8_t *pix, int stride, int alpha, int beta)
{
    int ccloop = 2;
    do
    {
        __m128i p3 = MM_LOAD_8TO16_2(pix - 4*stride);
        __m128i p2 = MM_LOAD_8TO16_2(pix - 3*stride);
        __m128i p1 = MM_LOAD_8TO16_2(pix - 2*stride);
        __m128i p0 = MM_LOAD_8TO16_2(pix - stride);
        __m128i q0 = MM_LOAD_8TO16_2(pix);
        __m128i q1 = MM_LOAD_8TO16_2(pix + stride);
        __m128i q2 = MM_LOAD_8TO16_2(pix + 2*stride);
        __m128i q3 = MM_LOAD_8TO16_2(pix + 3*stride);
        __m128i zero = _mm_setzero_si128();
        __m128i _alpha = _mm_set1_epi16((short)alpha);
        __m128i _quarteralpha = _mm_set1_epi16((short)((alpha >> 2) + 2));
        __m128i _beta = _mm_set1_epi16((short)beta);
        __m128i ap_less_beta;
        __m128i aq_less_beta;
        __m128i str;
        __m128i pq;
        __m128i short_p;
        __m128i short_q;
        __m128i long_p;
        __m128i long_q;
        __m128i t;
        __m128i p0q0_less__quarteralpha;

        __m128i absdif_p0_q0 = subabs128_16(p0, q0);
        __m128i p0_plus_q0 = _mm_add_epi16(_mm_add_epi16(p0, q0), _mm_set1_epi16(2));

        // if (abs_p0_q0 < alpha && abs_p1_p0 < beta && abs_q1_q0 < beta)
        str = _mm_cmplt_epi16(absdif_p0_q0, _alpha);
        //str = _mm_and_si128(str, _mm_cmplt_epi16(subabs128_16(p1, p0), _beta));
        //str = _mm_and_si128(str, _mm_cmplt_epi16(subabs128_16(q1, q0), _beta));
        str = _mm_and_si128(str, _mm_cmplt_epi16(_mm_max_epi16(subabs128_16(p1, p0), subabs128_16(q1, q0)), _beta));
        p0q0_less__quarteralpha = _mm_and_si128(_mm_cmplt_epi16(absdif_p0_q0, _quarteralpha), str);

        //int short_p = (2*p1 + p0 + q1 + 2);
        //int short_q = (2*q1 + q0 + p1 + 2);
        short_p = _mm_avg_epu8(_mm_avg_epu8(p0, q1),p1);
        pq = _mm_add_epi16(_mm_add_epi16(p1, q1), _mm_set1_epi16(2));
        short_p = _mm_add_epi16(_mm_add_epi16(pq, p1), p0);
        short_q = _mm_add_epi16(_mm_add_epi16(pq, q1), q0);

        ap_less_beta = _mm_and_si128(_mm_cmplt_epi16(subabs128_16(p2, p0), _beta), p0q0_less__quarteralpha);
        t = _mm_add_epi16(_mm_add_epi16(p2, p1), p0_plus_q0);
        // short_p += t - p1 + q0;
        long_p = _mm_srai_epi16(_mm_add_epi16(_mm_sub_epi16(_mm_add_epi16(short_p, t), p1), q0), 1);

        _mm_storel_epi64((__m128i*)(pix - 2*stride), _mm_packus_epi16(_mm_or_si128(_mm_and_si128(ap_less_beta, _mm_srai_epi16(t, 2)), _mm_andnot_si128(ap_less_beta, p1)), zero));
        t = _mm_add_epi16(_mm_add_epi16(_mm_slli_epi16(_mm_add_epi16(p3, p2), 1), t), _mm_set1_epi16(2));
        _mm_storel_epi64((__m128i*)(pix - 3*stride), _mm_packus_epi16(_mm_or_si128(_mm_and_si128(ap_less_beta, _mm_srai_epi16(t, 3)), _mm_andnot_si128(ap_less_beta, p2)), zero));

        aq_less_beta = _mm_and_si128(_mm_cmplt_epi16(subabs128_16(q2, q0), _beta), p0q0_less__quarteralpha);
        t = _mm_add_epi16(_mm_add_epi16(q2, q1), p0_plus_q0);
        long_q = _mm_srai_epi16(_mm_add_epi16(_mm_sub_epi16(_mm_add_epi16(short_q, t), q1), p0), 1);
        _mm_storel_epi64((__m128i*)(pix + 1*stride), _mm_packus_epi16(_mm_or_si128(_mm_and_si128(aq_less_beta, _mm_srai_epi16(t, 2)), _mm_andnot_si128(aq_less_beta, q1)), zero));

        t = _mm_add_epi16(_mm_add_epi16(_mm_slli_epi16(_mm_add_epi16(q3, q2), 1), t), _mm_set1_epi16(2));
        _mm_storel_epi64((__m128i*)(pix + 2*stride), _mm_packus_epi16(_mm_or_si128(_mm_and_si128(aq_less_beta, _mm_srai_epi16(t, 3)), _mm_andnot_si128(aq_less_beta, q2)), zero));

        short_p = _mm_srai_epi16(_mm_or_si128(_mm_and_si128(ap_less_beta, long_p), _mm_andnot_si128(ap_less_beta, short_p)), 2);
        short_q = _mm_srai_epi16(_mm_or_si128(_mm_and_si128(aq_less_beta, long_q), _mm_andnot_si128(aq_less_beta, short_q)), 2);

        _mm_storel_epi64((__m128i*)(pix - stride), _mm_packus_epi16(_mm_or_si128(_mm_and_si128(str, short_p), _mm_andnot_si128(str, p0)), zero));
        _mm_storel_epi64((__m128i*)(pix         ), _mm_packus_epi16(_mm_or_si128(_mm_and_si128(str, short_q), _mm_andnot_si128(str, q0)), zero));

   
Download .txt
gitextract_ua023m8r/

├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── asm/
│   ├── minih264e_asm.h
│   └── neon/
│       ├── h264e_cavlc_arm11.s
│       ├── h264e_deblock_neon.s
│       ├── h264e_denoise_neon.s
│       ├── h264e_intra_neon.s
│       ├── h264e_qpel_neon.s
│       ├── h264e_sad_neon.s
│       └── h264e_transform_neon.s
├── minih264e.h
├── minih264e_test.c
├── scripts/
│   ├── build_arm.sh
│   ├── build_arm_clang.sh
│   ├── build_x86.sh
│   ├── build_x86_clang.sh
│   ├── profile.sh
│   └── test.sh
├── system.c
├── system.h
└── vectors/
    ├── foreman.cif
    ├── out_ref.264
    └── x264.264
Download .txt
SYMBOL INDEX (243 symbols across 4 files)

FILE: minih264e.h
  type H264E_create_param_t (line 83) | typedef struct H264E_create_param_tag
  type H264E_run_param_t (line 177) | typedef struct H264E_run_param_tag
  type H264E_io_yuv_t (line 231) | typedef struct H264E_io_yuv_tag
  type H264E_persist_t (line 239) | typedef struct H264E_persist_tag H264E_persist_t;
  type H264E_scratch_t (line 240) | typedef struct H264E_scratch_tag H264E_scratch_t;
  function __usad8 (line 389) | static inline unsigned int __usad8(unsigned int val1, unsigned int val2)
  function __usada8 (line 398) | static inline unsigned int __usada8(unsigned int val1, unsigned int val2...
  function __sadd16 (line 407) | static inline unsigned int __sadd16(unsigned int val1, unsigned int val2)
  function __ssub16 (line 416) | static inline unsigned int __ssub16(unsigned int val1, unsigned int val2)
  function __clz (line 425) | static inline unsigned int __clz(unsigned int val1)
  type int_u (line 455) | typedef int int_u __attribute__ ((__aligned__ (1)));
  type int_u (line 457) | typedef int int_u;
  type pix_t (line 511) | typedef uint8_t     pix_t;
  type bs_item_t (line 512) | typedef uint32_t    bs_item_t;
  type bs_t (line 517) | typedef struct
  type point_t (line 528) | typedef union
  type rectangle_t (line 541) | typedef struct
  type quant_t (line 550) | typedef struct
  type scratch_t (line 559) | typedef struct H264E_scratch_tag
  type deblock_filter_t (line 585) | typedef struct
  type deblock_params_t (line 606) | typedef struct
  type h264e_enc_t (line 617) | typedef struct H264E_persist_tag
  function h264e_bs_put_bits_sse2 (line 1140) | static void h264e_bs_put_bits_sse2(bs_t *bs, unsigned n, unsigned val)
  function h264e_bs_flush_sse2 (line 1156) | static void h264e_bs_flush_sse2(bs_t *bs)
  function h264e_bs_get_pos_bits_sse2 (line 1161) | static unsigned h264e_bs_get_pos_bits_sse2(const bs_t *bs)
  function h264e_bs_byte_align_sse2 (line 1169) | static unsigned h264e_bs_byte_align_sse2(bs_t *bs)
  function h264e_bs_put_golomb_sse2 (line 1190) | static void h264e_bs_put_golomb_sse2(bs_t *bs, unsigned val)
  function h264e_bs_put_sgolomb_sse2 (line 1214) | static void h264e_bs_put_sgolomb_sse2(bs_t *bs, int val)
  function h264e_bs_init_bits_sse2 (line 1221) | static void h264e_bs_init_bits_sse2(bs_t *bs, void *data)
  function __clz_cavlc (line 1229) | static unsigned __clz_cavlc(unsigned v)
  function h264e_vlc_encode_sse2 (line 1240) | static void h264e_vlc_encode_sse2(bs_t *bs, int16_t *quant, int maxNumCo...
  function __m128i (line 1460) | static __inline __m128i subabs128_16(__m128i a, __m128i b)
  function __m128i (line 1464) | static __inline __m128i clone2x16(const void *p)
  function __m128i (line 1469) | static __inline __m128i subabs128(__m128i a, __m128i b)
  function transpose8x8_sse (line 1474) | static void transpose8x8_sse(uint8_t *dst, int dst_stride, uint8_t *src,...
  function deblock_chroma_h_s4_sse (line 1505) | static void deblock_chroma_h_s4_sse(uint8_t *pq0, int stride, const void...
  function deblock_chroma_v_s4_sse (line 1546) | static void deblock_chroma_v_s4_sse(uint8_t *pix, int stride, const void...
  function deblock_luma_h_s4_sse (line 1580) | static void deblock_luma_h_s4_sse(uint8_t *pix, int stride, int alpha, i...
  function deblock_luma_v_s4_sse (line 1652) | static void deblock_luma_v_s4_sse(uint8_t *pix, int stride, int alpha, i...
  function deblock_luma_h_s3_sse (line 1685) | static void deblock_luma_h_s3_sse(uint8_t *h264e_restrict pix, int strid...
  function deblock_luma_v_s3_sse (line 1749) | static void deblock_luma_v_s3_sse(uint8_t *pix, int stride, int alpha, i...
  function h264e_deblock_chroma_sse2 (line 1774) | static void h264e_deblock_chroma_sse2(uint8_t *pix, int32_t stride, cons...
  function h264e_deblock_luma_sse2 (line 1810) | static void h264e_deblock_luma_sse2(uint8_t *pix, int32_t stride, const ...
  function h264e_denoise_run_sse2 (line 1852) | static void h264e_denoise_run_sse2(unsigned char *frm, unsigned char *fr...
  function intra_predict_dc_sse (line 2018) | static uint32_t intra_predict_dc_sse(const pix_t *left, const pix_t *top...
  function h264e_intra_predict_16x16_sse2 (line 2074) | static void h264e_intra_predict_16x16_sse2(pix_t *predict,  const pix_t ...
  function h264e_intra_predict_chroma_sse2 (line 2106) | static void h264e_intra_predict_chroma_sse2(pix_t *predict, const pix_t ...
  function copy_wh_sse (line 2325) | static __inline void copy_wh_sse(const uint8_t *src, int src_stride, uin...
  function hpel_lpf_diag_sse (line 2353) | static __inline void hpel_lpf_diag_sse(const uint8_t *src, int src_strid...
  function hpel_lpf_hor_sse (line 2475) | static __inline void hpel_lpf_hor_sse(const uint8_t *src, int src_stride...
  function hpel_lpf_ver_sse (line 2535) | static __inline void hpel_lpf_ver_sse(const uint8_t *src, int src_stride...
  function average_16x16_unalign_sse (line 2567) | static void average_16x16_unalign_sse(uint8_t *dst, const uint8_t *src, ...
  function h264e_qpel_average_wh_align_sse2 (line 2588) | static void h264e_qpel_average_wh_align_sse2(const uint8_t *src0, const ...
  function h264e_qpel_interpolate_luma_sse2 (line 2624) | static void h264e_qpel_interpolate_luma_sse2(const uint8_t *src, int src...
  function h264e_qpel_interpolate_chroma_sse2 (line 2673) | static void h264e_qpel_interpolate_chroma_sse2(const uint8_t *src, int s...
  function h264e_sad_mb_unlaign_8x8_sse2 (line 2822) | static int h264e_sad_mb_unlaign_8x8_sse2(const pix_t *a, int a_stride, c...
  function h264e_sad_mb_unlaign_wh_sse2 (line 2852) | static int h264e_sad_mb_unlaign_wh_sse2(const pix_t *a, int a_stride, co...
  function h264e_copy_8x8_sse2 (line 2910) | static void h264e_copy_8x8_sse2(pix_t *d, int d_stride, const pix_t *s)
  function h264e_copy_16x16_sse2 (line 2924) | static void h264e_copy_16x16_sse2(pix_t *d, int d_stride, const pix_t *s...
  function h264e_copy_borders_sse2 (line 2946) | static void h264e_copy_borders_sse2(unsigned char *pic, int w, int h, in...
  function hadamar4_2d_sse (line 3007) | static void hadamar4_2d_sse(int16_t *x)
  function dequant_dc_sse (line 3045) | static void dequant_dc_sse(quant_t *q, int16_t *qval, int dequant, int n)
  function quant_dc_sse (line 3050) | static void quant_dc_sse(int16_t *qval, int16_t *deq, int16_t quant, int...
  function hadamar2_2d_sse (line 3061) | static void hadamar2_2d_sse(int16_t *x)
  function h264e_quant_luma_dc_sse2 (line 3073) | static void h264e_quant_luma_dc_sse2(quant_t *q, int16_t *deq, const uin...
  function h264e_quant_chroma_dc_sse2 (line 3084) | static int h264e_quant_chroma_dc_sse2(quant_t *q, int16_t *deq, const ui...
  function is_zero_sse (line 3095) | static int is_zero_sse(const int16_t *dat, int i0, const uint16_t *thr)
  function is_zero4_sse (line 3114) | static int is_zero4_sse(const quant_t *q, int i0, const uint16_t *thr)
  function h264e_transform_sub_quant_dequant_sse2 (line 3122) | static int h264e_transform_sub_quant_dequant_sse2(const pix_t *inp, cons...
  function h264e_transform_add_sse2 (line 3313) | static void h264e_transform_add_sse2(pix_t *out, int out_stride, const p...
  function deblock_luma_v_neon (line 3409) | static void deblock_luma_v_neon(uint8_t *pix, int stride, int alpha, int...
  function deblock_luma_h_s4_neon (line 3540) | static void deblock_luma_h_s4_neon(uint8_t *pix, int stride, int alpha, ...
  function deblock_luma_v_s4_neon (line 3656) | static void deblock_luma_v_s4_neon(uint8_t *pix, int stride, int alpha, ...
  function deblock_luma_h_neon (line 3816) | static void deblock_luma_h_neon(uint8_t *pix, int stride, int alpha, int...
  function deblock_chroma_v_neon (line 3903) | static void deblock_chroma_v_neon(uint8_t *pix, int32_t stride, int a, i...
  function deblock_chroma_h_neon (line 4020) | static void deblock_chroma_h_neon(uint8_t *pix, int32_t stride, int a, i...
  function h264e_deblock_chroma_neon (line 4080) | static void h264e_deblock_chroma_neon(uint8_t *pix, int32_t stride, cons...
  function h264e_deblock_luma_neon (line 4116) | static void h264e_deblock_luma_neon(uint8_t *pix, int32_t stride, const ...
  function h264e_denoise_run_neon (line 4158) | static void h264e_denoise_run_neon(unsigned char *frm, unsigned char *fr...
  function intra_predict_dc4_neon (line 4335) | static uint32_t intra_predict_dc4_neon(const pix_t *left, const pix_t *top)
  function uint8x16_t (line 4359) | static uint8x16_t intra_predict_dc16_neon(const pix_t *left, const pix_t...
  function h264e_intra_predict_16x16_neon (line 4410) | static void h264e_intra_predict_16x16_neon(pix_t *predict, const pix_t *...
  function h264e_intra_predict_chroma_neon (line 4445) | static void h264e_intra_predict_chroma_neon(pix_t *predict, const pix_t ...
  function vsad_neon (line 4503) | static __inline int vsad_neon(uint8x16_t a, uint8x16_t b)
  function h264e_intra_choose_4x4_neon (line 4510) | static int h264e_intra_choose_4x4_neon(const pix_t *blockin, pix_t *bloc...
  function copy_wh_neon (line 4644) | static void copy_wh_neon(const uint8_t *src, int src_stride, uint8_t *h2...
  function hpel_lpf_hor_neon (line 4682) | static void hpel_lpf_hor_neon(const uint8_t *src, int src_stride, uint8_...
  function hpel_lpf_hor16_neon (line 4746) | static void hpel_lpf_hor16_neon(const uint8_t *src, int src_stride, int1...
  function hpel_lpf_ver_neon (line 4809) | static void hpel_lpf_ver_neon(const uint8_t *src, int src_stride, uint8_...
  function hpel_lpf_ver16_neon (line 4875) | static void hpel_lpf_ver16_neon(const int16_t *src, uint8_t *h264e_restr...
  function hpel_lpf_diag_neon (line 4914) | static void hpel_lpf_diag_neon(const uint8_t *src, int src_stride, uint8...
  function average_16x16_unalign_neon (line 4927) | static void average_16x16_unalign_neon(uint8_t *dst, const uint8_t *src,...
  function h264e_qpel_average_wh_align_neon (line 4947) | static void h264e_qpel_average_wh_align_neon(const uint8_t *src0, const ...
  function h264e_qpel_interpolate_luma_neon (line 4973) | static void h264e_qpel_interpolate_luma_neon(const uint8_t *src, int src...
  function h264e_qpel_interpolate_chroma_neon (line 5027) | static void h264e_qpel_interpolate_chroma_neon(const uint8_t *src, int s...
  function h264e_sad_mb_unlaign_8x8_neon (line 5087) | static int h264e_sad_mb_unlaign_8x8_neon(const pix_t *a, int a_stride, c...
  function h264e_sad_mb_unlaign_wh_neon (line 5123) | static int h264e_sad_mb_unlaign_wh_neon(const pix_t *a, int a_stride, co...
  function h264e_copy_8x8_neon (line 5186) | static void h264e_copy_8x8_neon(pix_t *d, int d_stride, const pix_t *s)
  function h264e_copy_16x16_neon (line 5200) | static void h264e_copy_16x16_neon(pix_t *d, int d_stride, const pix_t *s...
  function hadamar4_2d_neon (line 5233) | static void hadamar4_2d_neon(int16_t *x)
  function dequant_dc_neon (line 5264) | static void dequant_dc_neon(quant_t *q, int16_t *qval, int dequant, int n)
  function quant_dc_neon (line 5269) | static void quant_dc_neon(int16_t *qval, int16_t *deq, int16_t quant, in...
  function hadamar2_2d_neon (line 5293) | static void hadamar2_2d_neon(int16_t *x)
  function h264e_quant_luma_dc_neon (line 5305) | static void h264e_quant_luma_dc_neon(quant_t *q, int16_t *deq, const uin...
  function h264e_quant_chroma_dc_neon (line 5316) | static int h264e_quant_chroma_dc_neon(quant_t *q, int16_t *deq, const ui...
  function FwdTransformResidual4x42_neon (line 5338) | static void FwdTransformResidual4x42_neon(const uint8_t *inp, const uint...
  function TransformResidual4x4_neon (line 5405) | static void TransformResidual4x4_neon(const int16_t *pSrc, const pix_t *...
  function is_zero_neon (line 5463) | static int is_zero_neon(const int16_t *dat, int i0, const uint16_t *thr)
  function is_zero4_neon (line 5478) | static int is_zero4_neon(const quant_t *q, int i0, const uint16_t *thr)
  function zero_smallq_neon (line 5486) | static int zero_smallq_neon(quant_t *q, int mode, const uint16_t *qdat)
  function quantize_neon (line 5510) | static int quantize_neon(quant_t *q, int mode, const uint16_t *qdat, int...
  function transform_neon (line 5651) | static void transform_neon(const pix_t *inp, const pix_t *pred, int inp_...
  function h264e_transform_sub_quant_dequant_neon (line 5671) | static int h264e_transform_sub_quant_dequant_neon(const pix_t *inp, cons...
  function h264e_transform_add_neon (line 5690) | static void h264e_transform_add_neon(pix_t *out, int out_stride, const p...
  function byteclip_deblock (line 5728) | static uint8_t byteclip_deblock(int x)
  function clip_range (line 5741) | static int clip_range(int range, int src)
  function deblock_chroma (line 5754) | static void deblock_chroma(uint8_t *pix, int stride, int alpha, int beta...
  function deblock_luma_v (line 5788) | static void deblock_luma_v(uint8_t *pix, int stride, int alpha, int beta...
  function deblock_luma_h_s4 (line 5839) | static void deblock_luma_h_s4(uint8_t *pix, int stride, int alpha, int b...
  function deblock_luma_v_s4 (line 5890) | static void deblock_luma_v_s4(uint8_t *pix, int stride, int alpha, int b...
  function deblock_luma_h (line 5933) | static void deblock_luma_h(uint8_t *pix, int stride, int alpha, int beta...
  function deblock_chroma_v (line 5986) | static void deblock_chroma_v(uint8_t *pix, int32_t stride, int a, int b,...
  function deblock_chroma_h (line 5996) | static void deblock_chroma_h(uint8_t *pix, int32_t stride, int a, int b,...
  function h264e_deblock_chroma (line 6006) | static void h264e_deblock_chroma(uint8_t *pix, int32_t stride, const deb...
  function h264e_deblock_luma (line 6042) | static void h264e_deblock_luma(uint8_t *pix, int32_t stride, const deblo...
  function h264e_denoise_run (line 6084) | static void h264e_denoise_run(unsigned char *frm, unsigned char *frmprev...
  function intra_predict_dc (line 6162) | static uint32_t intra_predict_dc(const pix_t *left, const pix_t *top, in...
  function h264e_intra_predict_16x16 (line 6214) | static void h264e_intra_predict_16x16(pix_t *predict,  const pix_t *left...
  function h264e_intra_predict_chroma (line 6253) | static void h264e_intra_predict_chroma(pix_t *predict, const pix_t *left...
  function pix_sad_4 (line 6320) | static int pix_sad_4(uint32_t r0, uint32_t r1, uint32_t r2, uint32_t r3,
  function h264e_intra_choose_4x4 (line 6355) | static int h264e_intra_choose_4x4(const pix_t *blockin, pix_t *blockpred...
  function byteclip (line 6509) | static uint8_t byteclip(int x)
  function hpel_lpf (line 6516) | static int hpel_lpf(const uint8_t *p, int s)
  function copy_wh (line 6521) | static void copy_wh(const uint8_t *src, int src_stride, uint8_t *dst, in...
  function hpel_lpf_diag (line 6535) | static void hpel_lpf_diag(const uint8_t *src, int src_stride, uint8_t *h...
  function hpel_lpf_hor (line 6574) | static void hpel_lpf_hor(const uint8_t *src, int src_stride, uint8_t *h2...
  function hpel_lpf_ver (line 6586) | static void hpel_lpf_ver(const uint8_t *src, int src_stride, uint8_t *h2...
  function average_16x16_unalign (line 6598) | static void average_16x16_unalign(uint8_t *dst, const uint8_t *src1, int...
  function h264e_qpel_average_wh_align (line 6610) | static void h264e_qpel_average_wh_align(const uint8_t *src0, const uint8...
  function h264e_qpel_interpolate_luma (line 6624) | static void h264e_qpel_interpolate_luma(const uint8_t *src, int src_stri...
  function h264e_qpel_interpolate_chroma (line 6678) | static void h264e_qpel_interpolate_chroma(const uint8_t *src, int src_st...
  function sad_block (line 6707) | static int sad_block(const pix_t *a, int a_stride, const pix_t *b, int b...
  function h264e_sad_mb_unlaign_8x8 (line 6723) | static int h264e_sad_mb_unlaign_8x8(const pix_t *a, int a_stride, const ...
  function h264e_sad_mb_unlaign_wh (line 6734) | static int h264e_sad_mb_unlaign_wh(const pix_t *a, int a_stride, const p...
  function h264e_copy_8x8 (line 6739) | static void h264e_copy_8x8(pix_t *d, int d_stride, const pix_t *s)
  function h264e_copy_16x16 (line 6755) | static void h264e_copy_16x16(pix_t *d, int d_stride, const pix_t *s, int...
  function h264e_copy_borders (line 6777) | static void h264e_copy_borders(unsigned char *pic, int w, int h, int guard)
  function clip_byte (line 6802) | static int clip_byte(int x)
  function hadamar4_2d (line 6814) | static void hadamar4_2d(int16_t *x)
  function dequant_dc (line 6848) | static void dequant_dc(quant_t *q, int16_t *qval, int dequant, int n)
  function quant_dc (line 6853) | static void quant_dc(int16_t *qval, int16_t *deq, int16_t quant, int n, ...
  function hadamar2_2d (line 6877) | static void hadamar2_2d(int16_t *x)
  function h264e_quant_luma_dc (line 6889) | static void h264e_quant_luma_dc(quant_t *q, int16_t *deq, const uint16_t...
  function h264e_quant_chroma_dc (line 6900) | static int h264e_quant_chroma_dc(quant_t *q, int16_t *deq, const uint16_...
  function FwdTransformResidual4x42 (line 6930) | static void FwdTransformResidual4x42(const uint8_t *inp, const uint8_t *...
  function TransformResidual4x4 (line 6981) | static void TransformResidual4x4(int16_t *pSrc)
  function is_zero (line 7036) | static int is_zero(const int16_t *dat, int i0, const uint16_t *thr)
  function is_zero4 (line 7049) | static int is_zero4(const quant_t *q, int i0, const uint16_t *thr)
  function zero_smallq (line 7057) | static int zero_smallq(quant_t *q, int mode, const uint16_t *qdat)
  function quantize (line 7081) | static int quantize(quant_t *q, int mode, const uint16_t *qdat, int zmask)
  function transform (line 7144) | static void transform(const pix_t *inp, const pix_t *pred, int inp_strid...
  function h264e_transform_sub_quant_dequant (line 7164) | static int h264e_transform_sub_quant_dequant(const pix_t *inp, const pix...
  function h264e_transform_add (line 7183) | static void h264e_transform_add(pix_t *out, int out_stride, const pix_t ...
  function h264e_bs_put_bits (line 7233) | static void h264e_bs_put_bits(bs_t *bs, unsigned n, unsigned val)
  function h264e_bs_flush (line 7249) | static void h264e_bs_flush(bs_t *bs)
  function h264e_bs_get_pos_bits (line 7254) | static unsigned h264e_bs_get_pos_bits(const bs_t *bs)
  function h264e_bs_byte_align (line 7262) | static unsigned h264e_bs_byte_align(bs_t *bs)
  function h264e_bs_put_golomb (line 7283) | static void h264e_bs_put_golomb(bs_t *bs, unsigned val)
  function h264e_bs_put_sgolomb (line 7309) | static void h264e_bs_put_sgolomb(bs_t *bs, int val)
  function h264e_bs_init_bits (line 7316) | static void h264e_bs_init_bits(bs_t *bs, void *data)
  function h264e_vlc_encode (line 7324) | static void h264e_vlc_encode(bs_t *bs, int16_t *quant, int maxNumCoeff, ...
  function udiv32 (line 7557) | static uint32_t udiv32(uint32_t n, uint32_t d)
  function h264e_copy_8x8_s (line 7572) | static void h264e_copy_8x8_s(pix_t *d, int d_stride, const pix_t *s, int...
  function h264e_frame_downsampling (line 7588) | static void h264e_frame_downsampling(uint8_t *out, int wo, int ho,
  function clip (line 7655) | static int clip(int val, int max)
  function h264e_intra_upsampling (line 7682) | static void h264e_intra_upsampling(int srcw, int srch, int dstw, int dst...
  type vft_t (line 7884) | typedef struct
  function minih264_cpuid (line 8056) | static __inline__ __attribute__((always_inline)) void minih264_cpuid(int...
  function rc_frame_end (line 10872) | static void rc_frame_end(h264e_enc_t *enc, int intra_flag, int skip_flag...
  function rc_mb_end (line 10943) | static void rc_mb_end(h264e_enc_t *enc)
  function enc_alloc (line 10988) | static int enc_alloc(h264e_enc_t *enc, const H264E_create_param_t *par, ...
  function enc_alloc_scratch (line 11011) | static int enc_alloc_scratch(h264e_enc_t *enc, const H264E_create_param_...
  function pix_t (line 11032) | static pix_t *io_yuv_set_pointers(pix_t *base, H264E_io_yuv_t *frm, int ...
  function enc_check_create_params (line 11049) | static int enc_check_create_params(const H264E_create_param_t *par)
  function H264E_sizeof_one (line 11085) | static int H264E_sizeof_one(const H264E_create_param_t *par, int *sizeof...
  function H264E_init_one (line 11106) | static int H264E_init_one(h264e_enc_t *enc, const H264E_create_param_t *...
  function H264E_init (line 11172) | int H264E_init(h264e_enc_t *enc, const H264E_create_param_t *opt)
  function encode_slice (line 11206) | static void encode_slice(h264e_enc_t *enc, int frame_type, int long_term...
  type h264_enc_slice_thread_params_t (line 11261) | typedef struct
  function encode_slice_thread_simple (line 11267) | static void encode_slice_thread_simple(void *arg)
  function H264E_encode_one (line 11274) | static int H264E_encode_one(H264E_persist_t *enc, const H264E_run_param_...
  function check_parameters_align (line 11425) | static int check_parameters_align(const H264E_create_param_t *opt, const...
  function H264E_encode (line 11454) | int H264E_encode(H264E_persist_t *enc, H264E_scratch_t *scratch, const H...
  function H264E_sizeof (line 11668) | int H264E_sizeof(const H264E_create_param_t *par, int *sizeof_persist, i...
  function H264E_set_vbv_state (line 11698) | void H264E_set_vbv_state(

FILE: minih264e_test.c
  type h264e_thread_t (line 36) | typedef struct
  function THREAD_RET (line 46) | static THREAD_RET THRAPI minih264_thread_func(void *arg)
  function h264e_thread_pool_close (line 77) | void h264e_thread_pool_close(void *pool, int max_threads)
  function h264e_thread_pool_run (line 94) | void h264e_thread_pool_run(void *pool, void (*callback)(void*), void *ca...
  function str_equal (line 120) | static int str_equal(const char *pattern, char **p)
  function read_cmdline_options (line 132) | static int read_cmdline_options(int argc, char *argv[])
  type frame_size_descriptor_t (line 217) | typedef struct
  function guess_format_from_name (line 256) | static int guess_format_from_name(const char *file_name, int *w, int *h)
  type rd_t (line 300) | typedef struct
  function psnr_init (line 318) | static void psnr_init()
  function psnr_add (line 323) | static void psnr_add(unsigned char *p0, unsigned char *p1, int w, int h,...
  function rd_t (line 344) | static rd_t psnr_get()
  function psnr_print (line 361) | static void psnr_print(rd_t rd)
  function pixel_of_chessboard (line 375) | static int pixel_of_chessboard(double x, double y)
  function gen_chessboard_rot (line 403) | static void gen_chessboard_rot(unsigned char *p, int w, int h, int frm)
  function main (line 422) | int main(int argc, char *argv[])

FILE: system.c
  type Event (line 15) | typedef struct Event Event;
  type Event (line 17) | typedef struct Event
  function InitEvent (line 26) | static bool InitEvent(Event *e)
  function GetAbsTimeInNanoseconds (line 58) | static inline uint64_t GetAbsTimeInNanoseconds()
  function GetAbsTime (line 67) | static inline void GetAbsTime(struct timespec *ts, uint32_t timeout)
  function CondTimedWait (line 85) | static inline int CondTimedWait(pthread_cond_t *cond, pthread_mutex_t *m...
  function WaitForEvent (line 107) | static bool WaitForEvent(Event *e, uint32_t timeout, bool *signaled)
  function WaitForMultipleEvents (line 135) | static bool WaitForMultipleEvents(Event **e, uint32_t count, uint32_t ti...
  function HANDLE (line 211) | HANDLE event_create(bool manualReset, bool initialState)
  function event_destroy (line 226) | bool event_destroy(HANDLE event)
  function event_set (line 239) | bool event_set(HANDLE event)
  function event_reset (line 265) | bool event_reset(HANDLE event)
  function event_wait (line 276) | int event_wait(HANDLE event, uint32_t milliseconds)
  function event_wait_multiple (line 284) | int event_wait_multiple(uint32_t count, const HANDLE *events, bool waitA...
  function InitializeCriticalSection (line 294) | bool InitializeCriticalSection(LPCRITICAL_SECTION lpCriticalSection)
  function DeleteCriticalSection (line 314) | bool DeleteCriticalSection(LPCRITICAL_SECTION lpCriticalSection)
  function EnterCriticalSection (line 321) | bool EnterCriticalSection(LPCRITICAL_SECTION lpCriticalSection)
  function LeaveCriticalSection (line 328) | bool LeaveCriticalSection(LPCRITICAL_SECTION lpCriticalSection)
  function HANDLE (line 335) | HANDLE thread_create(LPTHREAD_START_ROUTINE lpStartAddress, void *lpPara...
  function thread_close (line 350) | bool thread_close(HANDLE thread)
  type timespec (line 386) | struct timespec
  function HANDLE (line 399) | HANDLE thread_create(LPTHREAD_START_ROUTINE lpStartAddress, void *lpPara...
  function HANDLE (line 405) | HANDLE event_create(bool manualReset, bool initialState)
  function event_destroy (line 410) | bool event_destroy(HANDLE event)
  function thread_close (line 416) | bool thread_close(HANDLE thread)
  function thread_name (line 435) | bool thread_name(const char *name)
  function thread_sleep (line 463) | void thread_sleep(uint32_t milliseconds)
  function GetTime (line 472) | uint64_t GetTime()

FILE: system.h
  type DWORD (line 8) | typedef DWORD THREAD_RET;
  type THREAD_RET (line 18) | typedef THREAD_RET (*PTHREAD_START_ROUTINE)(void *lpThreadParameter);
  type PTHREAD_START_ROUTINE (line 19) | typedef PTHREAD_START_ROUTINE LPTHREAD_START_ROUTINE;
  type pthread_mutex_t (line 21) | typedef pthread_mutex_t CRITICAL_SECTION, *PCRITICAL_SECTION, *LPCRITICA...
Condensed preview — 25 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (605K chars).
[
  {
    "path": ".gitignore",
    "chars": 40,
    "preview": "h264enc_*\nqemu-prof\n*.gcda\n*.gcno\n*.gcov"
  },
  {
    "path": ".travis.yml",
    "chars": 741,
    "preview": "language: c\naddons:\n  apt:\n    packages:\n      - build-essential\n      - libc6-dev-i386\n      - linux-libc-dev:i386\n    "
  },
  {
    "path": "LICENSE",
    "chars": 6556,
    "preview": "CC0 1.0 Universal\n\nStatement of Purpose\n\nThe laws of most jurisdictions throughout the world automatically confer\nexclus"
  },
  {
    "path": "README.md",
    "chars": 2596,
    "preview": "minih264\n==========\n\n[![Build Status](https://travis-ci.org/lieff/minih264.svg)](https://travis-ci.org/lieff/minih264)\n\n"
  },
  {
    "path": "asm/minih264e_asm.h",
    "chars": 2782,
    "preview": "#define H264E_API(type, name, args) type name args; \\\ntype name##_sse2 args;  \\\ntype name##_arm11 args; \\\ntype name##_ne"
  },
  {
    "path": "asm/neon/h264e_cavlc_arm11.s",
    "chars": 11152,
    "preview": "        .arm\r\n        .text\r\n        .align 2\r\n        .type  h264e_bs_put_sgolomb_arm11, %function\r\nh264e_bs_put_sgolom"
  },
  {
    "path": "asm/neon/h264e_deblock_neon.s",
    "chars": 35470,
    "preview": "        .arm\n        .text\n        .align 2\n\n        .type  deblock_luma_h_s4, %function\ndeblock_luma_h_s4:\n        VPUS"
  },
  {
    "path": "asm/neon/h264e_denoise_neon.s",
    "chars": 11922,
    "preview": "        .arm\n        .text\n        .align 2\n\n__rt_memcpy_w:\n        subs            r2,     r2,     #0x10-4\nlocal_denois"
  },
  {
    "path": "asm/neon/h264e_intra_neon.s",
    "chars": 13244,
    "preview": "        .arm\n        .text\n        .align 2\n\n        .type  intra_predict_dc4_neon, %function\nintra_predict_dc4_neon:\n  "
  },
  {
    "path": "asm/neon/h264e_qpel_neon.s",
    "chars": 17636,
    "preview": "        .arm\n        .text\n        .align 2\n\n        .global h264e_qpel_average_wh_align_neon\n        .type  h264e_qpel_"
  },
  {
    "path": "asm/neon/h264e_sad_neon.s",
    "chars": 12858,
    "preview": "        .arm\n        .text\n        .align 2\n\n        .type  h264e_sad_mb_unlaign_wh_neon, %function\nh264e_sad_mb_unlaign"
  },
  {
    "path": "asm/neon/h264e_transform_neon.s",
    "chars": 23944,
    "preview": "        .arm\n        .text\n        .align 2\n\n        .type  hadamar4_2d_neon, %function\nhadamar4_2d_neon:\n        VLD4.1"
  },
  {
    "path": "minih264e.h",
    "chars": 410420,
    "preview": "#ifndef MINIH264_H\n#define MINIH264_H\n/*\n    https://github.com/lieff/minih264\n    To the extent possible under law, the"
  },
  {
    "path": "minih264e_test.c",
    "chars": 18479,
    "preview": "#include <stdio.h>\n#include <stdlib.h>\n#include <assert.h>\n#include <string.h>\n#include <math.h>\n#define MINIH264_IMPLEM"
  },
  {
    "path": "scripts/build_arm.sh",
    "chars": 1242,
    "preview": "_FILENAME=${0##*/}\nCUR_DIR=${0/${_FILENAME}}\nCUR_DIR=$(cd $(dirname ${CUR_DIR}); pwd)/$(basename ${CUR_DIR})/\n\npushd $CU"
  },
  {
    "path": "scripts/build_arm_clang.sh",
    "chars": 1397,
    "preview": "_FILENAME=${0##*/}\nCUR_DIR=${0/${_FILENAME}}\nCUR_DIR=$(cd $(dirname ${CUR_DIR}); pwd)/$(basename ${CUR_DIR})/\n\npushd $CU"
  },
  {
    "path": "scripts/build_x86.sh",
    "chars": 1065,
    "preview": "_FILENAME=${0##*/}\nCUR_DIR=${0/${_FILENAME}}\nCUR_DIR=$(cd $(dirname ${CUR_DIR}); pwd)/$(basename ${CUR_DIR})/\n\npushd $CU"
  },
  {
    "path": "scripts/build_x86_clang.sh",
    "chars": 412,
    "preview": "_FILENAME=${0##*/}\nCUR_DIR=${0/${_FILENAME}}\nCUR_DIR=$(cd $(dirname ${CUR_DIR}); pwd)/$(basename ${CUR_DIR})/\n\npushd $CU"
  },
  {
    "path": "scripts/profile.sh",
    "chars": 220,
    "preview": "_FILENAME=${0##*/}\nCUR_DIR=${0/${_FILENAME}}\nCUR_DIR=$(cd $(dirname ${CUR_DIR}); pwd)/$(basename ${CUR_DIR})/\n\npushd $CU"
  },
  {
    "path": "scripts/test.sh",
    "chars": 1047,
    "preview": "_FILENAME=${0##*/}\nCUR_DIR=${0/${_FILENAME}}\nCUR_DIR=$(cd $(dirname ${CUR_DIR}); pwd)/$(basename ${CUR_DIR})/\n\npushd $CU"
  },
  {
    "path": "system.c",
    "chars": 11967,
    "preview": "#include \"system.h\"\n\n#ifndef _WIN32\n\n#include <stdlib.h>\n#include <time.h>\n#include <errno.h>\n#include <unistd.h>\n#if de"
  },
  {
    "path": "system.h",
    "chars": 2118,
    "preview": "#pragma once\n#ifndef __LGE_SYSTEM_H__\n#define __LGE_SYSTEM_H__\n\n#ifdef _WIN32\n\n#include <windows.h>\ntypedef DWORD THREAD"
  }
]

// ... and 3 more files (download for full content)

About this extraction

This page contains the full source code of the lieff/minih264 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 25 files (44.1 MB), approximately 205.4k tokens, and a symbol index with 243 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!