Full Code of alexrozanski/llama.swift for AI

master 74eb4678ad42 cached

33 files

466.5 KB

142.5k tokens

797 symbols

1 requests

Download .txt

Showing preview only (484K chars total). Download the full file or copy to clipboard to get everything.

Repository: alexrozanski/llama.swift
Branch: master
Commit: 74eb4678ad42
Files: 33
Total size: 466.5 KB

Directory structure:
gitextract_4y5f8lx7/

├── .github/
│   └── workflows/
│       └── build.yml
├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── Package.swift
├── README.md
├── Sources/
│   ├── cpp/
│   │   ├── ggml.c
│   │   ├── ggml.h
│   │   ├── quantize.cpp
│   │   ├── utils.cpp
│   │   └── utils.h
│   ├── llama/
│   │   └── LlamaRunner.swift
│   └── llamaObjCxx/
│       ├── LlamaError.m
│       ├── bridge/
│       │   ├── LlamaEvent.mm
│       │   ├── LlamaPredictOperation.hh
│       │   ├── LlamaPredictOperation.mm
│       │   ├── LlamaRunnerBridge.mm
│       │   └── LlamaRunnerBridgeConfig.m
│       ├── headers/
│       │   ├── LlamaError.h
│       │   ├── LlamaEvent.h
│       │   ├── LlamaRunnerBridge.h
│       │   └── LlamaRunnerBridgeConfig.h
│       └── module.modulemap
├── llama.xcodeproj/
│   ├── project.pbxproj
│   └── project.xcworkspace/
│       ├── contents.xcworkspacedata
│       └── xcshareddata/
│           └── IDEWorkspaceChecks.plist
├── llamaTest/
│   ├── Info.plist
│   ├── LlamaTest.xcconfig
│   └── main.swift
└── tools/
    ├── .gitignore
    ├── Makefile
    ├── convert-pth-to-ggml.py
    └── quantize.sh

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/build.yml
================================================
name: CI

on:
  workflow_dispatch: # allows manual triggering
    inputs:
      create_release:
        description: "Create new release"
        required: true
        type: boolean
  push:
    paths: [".github/workflows/**", "Sources/**/*", "Package.swift", "llama.xcodeproj/**/*"]
  pull_request:
    paths: [".github/workflows/**", "Sources/**/*", "Package.swift", "llama.xcodeproj/**/*"]

jobs:
  swift-build:
    runs-on: macos-latest
    steps:
      - name: Clone repo
        id: checkout
        uses: actions/checkout@v1
      - name: Update dependencies
        id: depends
        run: |
          brew update
      - name: Swift build
        id: swift_build
        run: |
          swift build


================================================
FILE: .gitignore
================================================
*.o
*.a
.cache/
.vs/
.vscode/
.DS_Store

build/
build-em/
build-debug/
build-release/
build-static/
build-no-accel/
build-sanitize-addr/
build-sanitize-thread/

models/*

/main
/quantize

arm_neon.h
compile_commands.json
# Xcode
#
# gitignore contributors: remember to update Global/Xcode.gitignore, Objective-C.gitignore & Swift.gitignore

## User settings
xcuserdata/

## compatibility with Xcode 8 and earlier (ignoring not required starting Xcode 9)
*.xcscmblueprint
*.xccheckout

## compatibility with Xcode 3 and earlier (ignoring not required starting Xcode 4)
build/
DerivedData/
*.moved-aside
*.pbxuser
!default.pbxuser
*.mode1v3
!default.mode1v3
*.mode2v3
!default.mode2v3
*.perspectivev3
!default.perspectivev3

## Obj-C/Swift specific
*.hmap

## App packaging
*.ipa
*.dSYM.zip
*.dSYM

## Playgrounds
timeline.xctimeline
playground.xcworkspace

# Swift Package Manager
#
# Add this line if you want to avoid checking in source code from Swift Package Manager dependencies.
# Packages/
# Package.pins
# Package.resolved
# *.xcodeproj
#
# Xcode automatically generates this directory with a .xcworkspacedata file and xcuserdata
# hence it is not needed unless you have added a package configuration file to your project
# .swiftpm

.build/

# CocoaPods
#
# We recommend against adding the Pods directory to your .gitignore. However
# you should judge for yourself, the pros and cons are mentioned at:
# https://guides.cocoapods.org/using/using-cocoapods.html#should-i-check-the-pods-directory-into-source-control
#
# Pods/
#
# Add this line if you want to avoid checking in source code from the Xcode workspace
# *.xcworkspace

# Carthage
#
# Add this line if you want to avoid checking in source code from Carthage dependencies.
# Carthage/Checkouts

Carthage/Build/

# Accio dependency management
Dependencies/
.accio/

# fastlane
#
# It is recommended to not store the screenshots in the git repo.
# Instead, use fastlane to re-generate the screenshots whenever they are needed.
# For more information about the recommended setup visit:
# https://docs.fastlane.tools/best-practices/source-control/#source-control

fastlane/report.xml
fastlane/Preview.html
fastlane/screenshots/**/*.png
fastlane/test_output

# Code Injection
#
# After new code Injection tools there's a generated folder /iOSInjectionProject
# https://github.com/johnno1962/injectionforxcode

iOSInjectionProject/


================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Contributor Covenant Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our
community include:

* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
  and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
  overall community

Examples of unacceptable behavior include:

* The use of sexualized language or imagery, and sexual attention or
  advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
  address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
  professional setting

## Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.

Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.

## Scope

This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
alex@rozanski.me.
All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the
reporter of any incident.

## Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:

### 1. Correction

**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.

**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.

### 2. Warning

**Community Impact**: A violation through a single incident or series
of actions.

**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.

### 3. Temporary Ban

**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.

**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.

### 4. Permanent Ban

**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior,  harassment of an
individual, or aggression toward or disparagement of classes of individuals.

**Consequence**: A permanent ban from any sort of public interaction within
the community.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.

Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see the FAQ at
https://www.contributor-covenant.org/faq. Translations are available at
https://www.contributor-covenant.org/translations.


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2023 Georgi Gerganov, Alex Rozanski and others

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: Package.swift
================================================
// swift-tools-version:5.5

import PackageDescription

let package = Package(
  name: "llama.swift",
  platforms: [
    .macOS(.v10_15),
    .iOS(.v13),
  ],
  products: [
    .library(name: "llama", targets: ["llama"]),
  ],
  targets: [
    .target(
      name: "llama",
      dependencies: ["llamaObjCxx"],
      path: "Sources/llama"),
    .target(
      name: "llamaObjCxx",
      dependencies: [],
      path: "Sources/llamaObjCxx",
      exclude: [
        "cpp/quantize.cpp"
      ],
      publicHeadersPath: "headers",
      cxxSettings: [
        .headerSearchPath("cpp")
      ])
  ],
  cLanguageStandard: .gnu11,
  cxxLanguageStandard: .gnucxx20
)


================================================
FILE: README.md
================================================
# 🦙 llama.swift

[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)

A fork of [@ggerganov](https://github.com/ggerganov)'s [llama.cpp](https://github.com/ggerganov/llama.cpp) to use [Facebook's LLaMA](https://github.com/facebookresearch/llama) models in Swift.

See the [llama.cpp repository](https://github.com/ggerganov/llama.cpp/) for info about the original goals of the project and implementation.

## 🚀 llama.swift → future

Version 1 of llama.swift provides a simple, clean wrapper around the original LLaMA models and some of their early derivatives.

The future of llama.swift is [CameLLM](https://github.com/CameLLM/), which provides clean, Swift interfaces to run LLMs locally on macOS (and hopefully in the future, iOS, too). CameLLM is still in development, and you can star or watch the [main repository](https://github.com/CameLLM/CameLLM) for updates.

<hr/>

## 🔨 Setup

Clone the repo:

```bash
git clone https://github.com/alexrozanski/llama.swift.git
cd llama.swift
```

Grab the LLaMA model weights and place them in `./models`. `ls` should print something like:

```bash
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
```

To convert the LLaMA-7B model and quantize:

```bash
# install Python dependencies
python3 -m pip install torch numpy sentencepiece

# the command-line tools are in `./tools` instead of the repo root like in llama.cpp
cd tools

# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py ../models/7B/ 1

# quantize the model to 4-bits
make
./quantize.sh 7B
```

When running the larger models, make sure you have enough disk space to store all of the intermediate files.

## ⬇️ Installation

### Swift Package Manager

Add `llama.swift` to your project using Xcode (File > Add Packages...) or by adding it to your project's `Package.swift` file:

```swift
dependencies: [
  .package(url: "https://github.com/alexrozanski/llama.swift.git", .upToNextMajor(from: "1.0.0"))
]
```

## 👩‍💻 Usage

### Swift library

To generate output from a prompt, first instantiate a `LlamaRunner` instance with the URL to your LLaMA model file:

```swift
import llama

let url = ... // URL to the ggml-model-q4_0.bin model file
let runner = LlamaRunner(modelURL: url)
```

Generating output is as simple as calling `run()` with your prompt on the `LlamaRunner` instance. Since tokens are generated asynchronously this returns an `AsyncThrowingStream` which you can enumerate over to process tokens as they are returned:

```swift
do {
  for try await token in runner.run(with: "Building a website can be done in 10 simple steps:") {
    print(token, terminator: "")
  }
} catch let error {
  // Handle error
}
```

Note that tokens don't necessarily correspond to a single word, and also include any whitespace and newlines.

#### Configuration

`LlamaRunner.run()` takes an optional `LlamaRunner.Config` instance which lets you control the number of threads inference is run on (default: `8`), the maximum number of tokens returned (default: `512`) and an optional reverse/negative prompt:

```swift
let prompt = "..."
let config = LlamaRunner.Config(numThreads: 8, numTokens: 20, reversePrompt: "...")
let tokenStream = runner.run(with: prompt, config: config)

do {
  for try await token in tokenStream {
    ...
  }
} catch let error {
  ...
}
```

#### State Changes

`LlamaRunner.run()` also takes an optional `stateChangeHandler` closure, which is invoked whenever the run state changes:

```
let prompt = "..."
let tokenStream = runner.run(
  with: prompt,
  config: .init(numThreads: 8, numTokens: 20),
  stateChangeHandler: { state in
    switch state {
      case .notStarted:
        // Initial state
        break
      case .initializing:
        // Loading the model and initializing
        break
      case .generatingOutput:
        // Generating tokens
        break
      case .completed:
        // Completed successfully
        break
      case .failed:
        // Failed. This is also the error thrown by the `AsyncThrowingSequence` returned from `LlamaRunner.run()`
        break
    }
  })
```

#### Closure-based API

If you don't want to use Swift concurrency there is an alternative version of `run()` which returns tokens via a `tokenHandler` closure instead:

```swift
let prompt = "..."
runner.run(
  with: prompt,
  config: ...,
  tokenHandler: { token in
    ...
  },
  stateChangeHandler: ...
)
```

#### Other notes

- Build for Release if you want token generation to be snappy, since `llama` will generate tokens slowly in Debug builds.
- Because of the way the Swift package is structured (and some gaps in my knowledge around exported symbols from modules), including `llama.swift` also leaks the name of the internal module containing the Objective-C/C++ implementation, `llamaObjCxx`, as well as some internal classes prefixed with `_Llama`. Pull requests welcome if you have any ideas on fixing this!


### `llamaTest` app

The repo contains a barebones command-line tool, `llamaTest`, which uses the `llama` Framework to run a simple input loop to run inference on a given input prompt.

- Ensure to set `MODEL_PATH` in `LlamaTest.xcconfig` to point to your `path/to/ggml-model-q4_0.bin` (without quotes or spaces after `MODEL_PATH=`), for example:

```
MODEL_PATH=/path/to/ggml-model-q4_0.bin
```

## 📃 Misc

- License: MIT
- Other matters: See the [llama.cpp repo](https://github.com/ggerganov/llama.cpp/).


================================================
FILE: Sources/cpp/ggml.c
================================================
#include "ggml.h"

#if defined(_MSC_VER) || defined(__MINGW32__)
#include <malloc.h> // using malloc.h with MSC/MINGW
#elif !defined(__FreeBSD__) && !defined(__NetBSD__)
#include <alloca.h>
#endif

#include <assert.h>
#include <time.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <float.h>

// if C99 - static_assert is noop
// ref: https://stackoverflow.com/a/53923785/4039976
#ifndef static_assert
#define static_assert(cond, msg) struct global_scope_noop_trick
#endif

#if defined _MSC_VER || defined(__MINGW32__)

#if !defined(__MINGW32__)
#include <Windows.h>
#else
// ref: https://github.com/ggerganov/whisper.cpp/issues/168
#include <windows.h>
#include <errno.h>
#endif

typedef volatile LONG atomic_int;
typedef atomic_int atomic_bool;

static void atomic_store(atomic_int* ptr, LONG val) {
    InterlockedExchange(ptr, val);
}
static LONG atomic_load(atomic_int* ptr) {
    return InterlockedCompareExchange(ptr, 0, 0);
}
static LONG atomic_fetch_add(atomic_int* ptr, LONG inc) {
    return InterlockedExchangeAdd(ptr, inc);
}
static LONG atomic_fetch_sub(atomic_int* ptr, LONG dec) {
    return atomic_fetch_add(ptr, -(dec));
}

typedef HANDLE pthread_t;

typedef DWORD thread_ret_t;
static int pthread_create(pthread_t* out, void* unused, thread_ret_t(*func)(void*), void* arg) {
    HANDLE handle = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE) func, arg, 0, NULL);
    if (handle == NULL)
    {
        return EAGAIN;
    }

    *out = handle;
    return 0;
}

static int pthread_join(pthread_t thread, void* unused) {
    return (int) WaitForSingleObject(thread, INFINITE);
}

static int sched_yield (void) {
    Sleep (0);
    return 0;
}
#else
#include <pthread.h>
#include <stdatomic.h>

typedef void* thread_ret_t;
#endif

#ifdef __HAIKU__
#define static_assert(cond, msg) _Static_assert(cond, msg)
#endif

/*#define GGML_PERF*/
#define GGML_DEBUG 0
#define GGML_GELU_FP16
#define GGML_SILU_FP16

#define GGML_SOFT_MAX_UNROLL 4
#define GGML_VEC_DOT_UNROLL  2

#ifdef GGML_USE_ACCELERATE
// uncomment to use vDSP for soft max computation
// note: not sure if it is actually faster
//#define GGML_SOFT_MAX_ACCELERATE
#endif

#if UINTPTR_MAX == 0xFFFFFFFF
    #define GGML_MEM_ALIGN 4
#else
    #define GGML_MEM_ALIGN 16
#endif

#define UNUSED(x) (void)(x)
#define SWAP(x, y, T) do { T SWAP = x; x = y; y = SWAP; } while (0)

#define GGML_ASSERT(x) \
    do { \
        if (!(x)) { \
            fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
            abort(); \
        } \
    } while (0)

#ifdef GGML_USE_ACCELERATE
#include <Accelerate/Accelerate.h>
#elif GGML_USE_OPENBLAS
#include <cblas.h>
#endif

#undef MIN
#undef MAX
#define MIN(a, b) ((a) < (b) ? (a) : (b))
#define MAX(a, b) ((a) > (b) ? (a) : (b))

// floating point type used to accumulate sums
typedef double ggml_float;

// 16-bit float
// on Arm, we use __fp16
// on x86, we use uint16_t
#ifdef __ARM_NEON

// if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:
//
//   $ ln -sfn /Library/Developer/CommandLineTools/usr/lib/clang/13.1.6/include/arm_neon.h ./src/
//
#include <arm_neon.h>

#define GGML_COMPUTE_FP16_TO_FP32(x) (x)
#define GGML_COMPUTE_FP32_TO_FP16(x) (x)

#define GGML_FP16_TO_FP32(x) (x)
#define GGML_FP32_TO_FP16(x) (x)

#else

#ifdef __wasm_simd128__
#include <wasm_simd128.h>
#else
#ifdef __POWER9_VECTOR__
#include <altivec.h>
#undef bool
#define bool _Bool
#else
#include <immintrin.h>
#endif
#endif

#ifdef __F16C__

#define GGML_COMPUTE_FP16_TO_FP32(x) _cvtsh_ss(x)
#define GGML_COMPUTE_FP32_TO_FP16(x) _cvtss_sh(x, 0)

#else

// FP16 <-> FP32
// ref: https://github.com/Maratyszcza/FP16

static inline float fp32_from_bits(uint32_t w) {
    union {
        uint32_t as_bits;
        float as_value;
    } fp32;
    fp32.as_bits = w;
    return fp32.as_value;
}

static inline uint32_t fp32_to_bits(float f) {
	union {
		float as_value;
		uint32_t as_bits;
	} fp32;
	fp32.as_value = f;
	return fp32.as_bits;
}

static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
    const uint32_t w = (uint32_t) h << 16;
    const uint32_t sign = w & UINT32_C(0x80000000);
    const uint32_t two_w = w + w;

    const uint32_t exp_offset = UINT32_C(0xE0) << 23;
#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)
    const float exp_scale = 0x1.0p-112f;
#else
    const float exp_scale = fp32_from_bits(UINT32_C(0x7800000));
#endif
    const float normalized_value = fp32_from_bits((two_w >> 4) + exp_offset) * exp_scale;

    const uint32_t magic_mask = UINT32_C(126) << 23;
    const float magic_bias = 0.5f;
    const float denormalized_value = fp32_from_bits((two_w >> 17) | magic_mask) - magic_bias;

    const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
    const uint32_t result = sign |
        (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value));
    return fp32_from_bits(result);
}

static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)
    const float scale_to_inf = 0x1.0p+112f;
    const float scale_to_zero = 0x1.0p-110f;
#else
    const float scale_to_inf = fp32_from_bits(UINT32_C(0x77800000));
    const float scale_to_zero = fp32_from_bits(UINT32_C(0x08800000));
#endif
    float base = (fabsf(f) * scale_to_inf) * scale_to_zero;

    const uint32_t w = fp32_to_bits(f);
    const uint32_t shl1_w = w + w;
    const uint32_t sign = w & UINT32_C(0x80000000);
    uint32_t bias = shl1_w & UINT32_C(0xFF000000);
    if (bias < UINT32_C(0x71000000)) {
        bias = UINT32_C(0x71000000);
    }

    base = fp32_from_bits((bias >> 1) + UINT32_C(0x07800000)) + base;
    const uint32_t bits = fp32_to_bits(base);
    const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00);
    const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF);
    const uint32_t nonsign = exp_bits + mantissa_bits;
    return (sign >> 16) | (shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00) : nonsign);
}

#define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
#define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)

#endif // __F16C__

#endif // __ARM_NEON

//
// global data
//

// precomputed gelu table for f16 (128 KB)
static ggml_fp16_t table_gelu_f16[1 << 16];

// precomputed silu table for f16 (128 KB)
static ggml_fp16_t table_silu_f16[1 << 16];

// precomputed exp table for f16 (128 KB)
static ggml_fp16_t table_exp_f16[1 << 16];

// precomputed f32 table for f16 (256 KB)
static float table_f32_f16[1 << 16];

// On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
// so we define GGML_FP16_TO_FP32 and GGML_FP32_TO_FP16 elsewhere for NEON.
#if !defined(GGML_FP16_TO_FP32) || !defined(GGML_FP32_TO_FP16)

inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
    uint16_t s;
    memcpy(&s, &f, sizeof(uint16_t));
    return table_f32_f16[s];
}

#define GGML_FP16_TO_FP32(x) ggml_lookup_fp16_to_fp32(x)
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)

#endif

// note: do not use these inside ggml.c
// these are meant to be used via the ggml.h API
float ggml_fp16_to_fp32(ggml_fp16_t x) {
    return GGML_FP16_TO_FP32(x);
}

ggml_fp16_t ggml_fp32_to_fp16(float x) {
    return GGML_FP32_TO_FP16(x);
}

//
// timing
//

#if defined(_MSC_VER) || defined(__MINGW32__)
static int64_t timer_freq;
void ggml_time_init(void) {
    LARGE_INTEGER frequency;
    QueryPerformanceFrequency(&frequency);
    timer_freq = frequency.QuadPart;
}
int64_t ggml_time_ms(void) {
    LARGE_INTEGER t;
    QueryPerformanceCounter(&t);
    return (t.QuadPart * 1000) / timer_freq;
}
int64_t ggml_time_us(void) {
    LARGE_INTEGER t;
    QueryPerformanceCounter(&t);
    return (t.QuadPart * 1000000) / timer_freq;
}
#else
void ggml_time_init(void) {}
int64_t ggml_time_ms(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (int64_t)ts.tv_sec*1000 + (int64_t)ts.tv_nsec/1000000;
}

int64_t ggml_time_us(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (int64_t)ts.tv_sec*1000000 + (int64_t)ts.tv_nsec/1000;
}
#endif

int64_t ggml_cycles(void) {
    return clock();
}

int64_t ggml_cycles_per_ms(void) {
    return CLOCKS_PER_SEC/1000;
}

#ifdef GGML_PERF
#define ggml_perf_time_ms()       ggml_time_ms()
#define ggml_perf_time_us()       ggml_time_us()
#define ggml_perf_cycles()        ggml_cycles()
#define ggml_perf_cycles_per_ms() ggml_cycles_per_ms()
#else
#define ggml_perf_time_ms()       0
#define ggml_perf_time_us()       0
#define ggml_perf_cycles()        0
#define ggml_perf_cycles_per_ms() 0
#endif

//
// cache line
//

#if defined(__cpp_lib_hardware_interference_size)
#define CACHE_LINE_SIZE hardware_destructive_interference_size
#else
#if defined(__POWER9_VECTOR__)
#define CACHE_LINE_SIZE 128
#else
#define CACHE_LINE_SIZE 64
#endif
#endif

static const size_t CACHE_LINE_SIZE_F32 = CACHE_LINE_SIZE/sizeof(float);

//
// quantization
//

#define QK 32

// AVX routines provided by GH user Const-me
// ref: https://github.com/ggerganov/ggml/pull/27#issuecomment-1464934600
#if __AVX2__
// Unpack 32 4-bit fields into 32 bytes
// The output vector contains 32 bytes, each one in [ 0 .. 15 ] interval
inline __m256i bytesFromNibbles( const uint8_t* rsi )
{
    // Load 16 bytes from memory
    __m128i tmp = _mm_loadu_si128( ( const __m128i* )rsi );

    // Expand bytes into uint16_t values
    __m256i bytes = _mm256_cvtepu8_epi16( tmp );

    // Unpack values into individual bytes
    const __m256i lowMask = _mm256_set1_epi8( 0xF );
    __m256i high = _mm256_andnot_si256( lowMask, bytes );
    __m256i low = _mm256_and_si256( lowMask, bytes );
    high = _mm256_slli_epi16( high, 4 );
    bytes = _mm256_or_si256( low, high );
    return bytes;
}

inline __m128i packNibbles( __m256i bytes )
{
    // Move bits within 16-bit lanes from 0000_abcd_0000_efgh into 0000_0000_abcd_efgh
    const __m256i lowByte = _mm256_set1_epi16( 0xFF );
    __m256i high = _mm256_andnot_si256( lowByte, bytes );
    __m256i low = _mm256_and_si256( lowByte, bytes );
    high = _mm256_srli_epi16( high, 4 );
    bytes = _mm256_or_si256( low, high );

    // Compress uint16_t lanes into bytes
    __m128i r0 = _mm256_castsi256_si128( bytes );
    __m128i r1 = _mm256_extracti128_si256( bytes, 1 );
    return _mm_packus_epi16( r0, r1 );
}
#endif


// method 5
// blocks of QK elements
// represented with a single float (delta) and QK/2 8-bit ints (i.e QK 4-bit signed integer factors)
void quantize_row_q4_0(const float * restrict x, void * restrict y, int k) {
    assert(k % QK == 0);

    const int nb = k / QK;
    const size_t bs = sizeof(float) + QK/2;

    uint8_t * restrict pd = ((uint8_t *)y + 0*bs);
    uint8_t * restrict pb = ((uint8_t *)y + 0*bs + sizeof(float));

    uint8_t pp[QK/2];

#if __ARM_NEON
#if QK == 32
    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max

        float32x4_t srcv [8];
        float32x4_t asrcv[8];
        float32x4_t amaxv[8];

        for (int l = 0; l < 8; l++) srcv[l]  = vld1q_f32(x + i*32 + 4*l);
        for (int l = 0; l < 8; l++) asrcv[l] = vabsq_f32(srcv[l]);

        for (int l = 0; l < 4; l++) amaxv[2*l] = vmaxq_f32(asrcv[2*l], asrcv[2*l+1]);
        for (int l = 0; l < 2; l++) amaxv[4*l] = vmaxq_f32(amaxv[4*l], amaxv[4*l+2]);
        for (int l = 0; l < 1; l++) amaxv[8*l] = vmaxq_f32(amaxv[8*l], amaxv[8*l+4]);

        amax = MAX(
                MAX(vgetq_lane_f32(amaxv[0], 0), vgetq_lane_f32(amaxv[0], 1)),
                MAX(vgetq_lane_f32(amaxv[0], 2), vgetq_lane_f32(amaxv[0], 3)));

        const float d = amax / ((1 << 3) - 1);
        const float id = d ? 1.0/d : 0.0;

        *(float *)pd = d;
        pd += bs;

        for (int l = 0; l < 8; l++) {
            const float32x4_t v  = vmulq_n_f32(srcv[l], id);
            const float32x4_t vf = vaddq_f32(v, vdupq_n_f32(8.5f));
            const int32x4_t   vi = vcvtq_s32_f32(vf);

            pp[2*l + 0] = vgetq_lane_s32(vi, 0) | (vgetq_lane_s32(vi, 1) << 4);
            pp[2*l + 1] = vgetq_lane_s32(vi, 2) | (vgetq_lane_s32(vi, 3) << 4);
        }

        memcpy(pb, pp, sizeof(pp));
        pb += bs;
    }
#else
#error "not implemented for QK"
#endif
#elif defined(__AVX2__)
#if QK == 32
    for (int i = 0; i < nb; i++) {
        // Load elements into 4 AVX vectors
        __m256 v0 = _mm256_loadu_ps( x );
        __m256 v1 = _mm256_loadu_ps( x + 8 );
        __m256 v2 = _mm256_loadu_ps( x + 16 );
        __m256 v3 = _mm256_loadu_ps( x + 24 );
        x += 32;

        // Compute max(abs(e)) for the block
        const __m256 signBit = _mm256_set1_ps( -0.0f );
        __m256 maxAbs = _mm256_andnot_ps( signBit, v0 );
        maxAbs = _mm256_max_ps( maxAbs, _mm256_andnot_ps( signBit, v1 ) );
        maxAbs = _mm256_max_ps( maxAbs, _mm256_andnot_ps( signBit, v2 ) );
        maxAbs = _mm256_max_ps( maxAbs, _mm256_andnot_ps( signBit, v3 ) );

        __m128 max4 = _mm_max_ps( _mm256_extractf128_ps( maxAbs, 1 ), _mm256_castps256_ps128( maxAbs ) );
        max4 = _mm_max_ps( max4, _mm_movehl_ps( max4, max4 ) );
        max4 = _mm_max_ss( max4, _mm_movehdup_ps( max4 ) );
        const float maxScalar = _mm_cvtss_f32( max4 );

        // Quantize these floats
        const float d = maxScalar / 7.0f;
        *(float *)pd = d;
        pd += bs;
        const float id = ( maxScalar != 0.0f ) ? 7.0f / maxScalar : 0.0f;
        const __m256 mul = _mm256_set1_ps( id );

        // Apply the multiplier
        v0 = _mm256_mul_ps( v0, mul );
        v1 = _mm256_mul_ps( v1, mul );
        v2 = _mm256_mul_ps( v2, mul );
        v3 = _mm256_mul_ps( v3, mul );

        // Round to nearest integer
        v0 = _mm256_round_ps( v0, _MM_ROUND_NEAREST );
        v1 = _mm256_round_ps( v1, _MM_ROUND_NEAREST );
        v2 = _mm256_round_ps( v2, _MM_ROUND_NEAREST );
        v3 = _mm256_round_ps( v3, _MM_ROUND_NEAREST );

        // Convert floats to integers
        __m256i i0 = _mm256_cvtps_epi32( v0 );
        __m256i i1 = _mm256_cvtps_epi32( v1 );
        __m256i i2 = _mm256_cvtps_epi32( v2 );
        __m256i i3 = _mm256_cvtps_epi32( v3 );

        // Convert int32 to int16
        i0 = _mm256_packs_epi32( i0, i1 );	// 0, 1, 2, 3,  8, 9, 10, 11,  4, 5, 6, 7, 12, 13, 14, 15
        i2 = _mm256_packs_epi32( i2, i3 );	// 16, 17, 18, 19,  24, 25, 26, 27,  20, 21, 22, 23, 28, 29, 30, 31
                                            // Convert int16 to int8
        i0 = _mm256_packs_epi16( i0, i2 );	// 0, 1, 2, 3,  8, 9, 10, 11,  16, 17, 18, 19,  24, 25, 26, 27,  4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31

        // We got our precious signed bytes, but the order is now wrong
        // These AVX2 pack instructions process 16-byte pieces independently
        // The following instruction is fixing the order
        const __m256i perm = _mm256_setr_epi32( 0, 4, 1, 5, 2, 6, 3, 7 );
        i0 = _mm256_permutevar8x32_epi32( i0, perm );

        // Apply offset to translate the range from [ -7 .. +7 ] into [ +1 .. +15 ]
        const __m256i off = _mm256_set1_epi8( 8 );
        i0 = _mm256_add_epi8( i0, off );

        // Compress the vector into 4 bit/value, and store
        __m128i res = packNibbles( i0 );
        _mm_storeu_si128( ( __m128i* )pb, res );
        pb += bs;
    }
#else
#error "not implemented for QK"
#endif
#elif defined(__wasm_simd128__)
#if QK == 32
    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max

        v128_t srcv [8];
        v128_t asrcv[8];
        v128_t amaxv[8];

        for (int l = 0; l < 8; l++) srcv[l]  = wasm_v128_load(x + i*32 + 4*l);
        for (int l = 0; l < 8; l++) asrcv[l] = wasm_f32x4_abs(srcv[l]);

        for (int l = 0; l < 4; l++) amaxv[2*l] = wasm_f32x4_max(asrcv[2*l], asrcv[2*l+1]);
        for (int l = 0; l < 2; l++) amaxv[4*l] = wasm_f32x4_max(amaxv[4*l], amaxv[4*l+2]);
        for (int l = 0; l < 1; l++) amaxv[8*l] = wasm_f32x4_max(amaxv[8*l], amaxv[8*l+4]);

        amax = MAX(
                MAX(wasm_f32x4_extract_lane(amaxv[0], 0), wasm_f32x4_extract_lane(amaxv[0], 1)),
                MAX(wasm_f32x4_extract_lane(amaxv[0], 2), wasm_f32x4_extract_lane(amaxv[0], 3)));

        const float d = amax / ((1 << 3) - 1);
        const float id = d ? 1.0/d : 0.0;

        *(float *)pd = d;
        pd += bs;

        for (int l = 0; l < 8; l++) {
            const v128_t v  = wasm_f32x4_mul(srcv[l], wasm_f32x4_splat(id));
            const v128_t vf = wasm_f32x4_add(v, wasm_f32x4_splat(8.5f));
            const v128_t vi = wasm_i32x4_trunc_sat_f32x4(vf);

            pp[2*l + 0] = wasm_i32x4_extract_lane(vi, 0) | (wasm_i32x4_extract_lane(vi, 1) << 4);
            pp[2*l + 1] = wasm_i32x4_extract_lane(vi, 2) | (wasm_i32x4_extract_lane(vi, 3) << 4);
        }

        memcpy(pb, pp, sizeof(pp));
        pb += bs;
    }
#else
#error "not implemented for QK"
#endif
#else
    // scalar
    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max

        for (int l = 0; l < QK; l++) {
            const float v = x[i*QK + l];
            amax = MAX(amax, fabsf(v));
        }

        const float d = amax / ((1 << 3) - 1);
        const float id = d ? 1.0f/d : 0.0f;

        *(float *)pd = d;
        pd += bs;

        for (int l = 0; l < QK; l += 2) {
            const float v0 = x[i*QK + l + 0]*id;
            const float v1 = x[i*QK + l + 1]*id;

            const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
            const uint8_t vi1 = ((int8_t) (round(v1))) + 8;

            assert(vi0 >= 0 && vi0 < 16);
            assert(vi1 >= 0 && vi1 < 16);

            pp[l/2] = vi0 | (vi1 << 4);
        }

        memcpy(pb, pp, sizeof(pp));
        pb += bs;
    }
#endif
}

// method 4
// blocks of QK elements
// represented with 2 floats (min + delta) and QK/2 8-bit ints (i.e QK 4-bit unsigned integer factors)
void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) {
    assert(k % QK == 0);

    const int nb = k / QK;

    float   * restrict pm = (float *)   (y);
    float   * restrict pd = (float *)   (pm + nb);
    uint8_t * restrict pb = (uint8_t *) (pd + nb);

    uint8_t pp[QK/2];

    for (int i = 0; i < nb; i++) {
        float min = FLT_MAX;
        float max = -FLT_MAX;

        for (int l = 0; l < QK; l++) {
            const float v = x[i*QK + l];
            if (v < min) min = v;
            if (v > max) max = v;
        }

        const float d = (max - min) / ((1 << 4) - 1);
        const float id = d ? 1.0f/d : 0.0f;

        pm[i] = min;
        pd[i] = d;

        for (int l = 0; l < QK; l += 2) {
            const float v0 = (x[i*QK + l + 0] - min)*id;
            const float v1 = (x[i*QK + l + 1] - min)*id;

            const uint8_t vi0 = round(v0);
            const uint8_t vi1 = round(v1);

            assert(vi0 >= 0 && vi0 < 16);
            assert(vi1 >= 0 && vi1 < 16);

            pp[l/2] = vi0 | (vi1 << 4);
        }

        memcpy(pb + i*QK/2, pp, sizeof(pp));
    }
}

// TODO: vectorize
void dequantize_row_q4_0(const void * restrict x, float * restrict y, int k) {
    assert(k % QK == 0);

    const int nb = k / QK;
    const size_t bs = sizeof(float) + QK/2;

    const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs);
    const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + sizeof(float));

    // scalar
    for (int i = 0; i < nb; i++) {
        const float d = *(const float *) (pd + i*bs);

        const uint8_t * restrict pp = pb + i*bs;

        for (int l = 0; l < QK; l += 2) {
            const uint8_t vi = pp[l/2];

            const int8_t vi0 = vi & 0xf;
            const int8_t vi1 = vi >> 4;

            const float v0 = (vi0 - 8)*d;
            const float v1 = (vi1 - 8)*d;

            //printf("d = %f, vi = %d, vi0 = %d, vi1 = %d, v0 = %f, v1 = %f\n", d, vi, vi0, vi1, v0, v1);

            y[i*QK + l + 0] = v0;
            y[i*QK + l + 1] = v1;

            assert(!isnan(y[i*QK + l + 0]));
            assert(!isnan(y[i*QK + l + 1]));
        }
    }
}

void dequantize_row_q4_1(const void * restrict x, float * restrict y, int k) {
    assert(k % QK == 0);

    const int nb = k / QK;

    const float   * restrict pm = (const float *)   (x);
    const float   * restrict pd = (const float *)   (pm + nb);
    const uint8_t * restrict pb = (const uint8_t *) (pd + nb);

    for (int i = 0; i < nb; i++) {
        const float m = pm[i];
        const float d = pd[i];

        const uint8_t * restrict pp = pb + i*QK/2;

        for (int l = 0; l < QK; l += 2) {
            const uint8_t vi = pp[l/2];

            const int8_t vi0 = vi & 0xf;
            const int8_t vi1 = vi >> 4;

            const float v0 = vi0*d + m;
            const float v1 = vi1*d + m;

            y[i*QK + l + 0] = v0;
            y[i*QK + l + 1] = v1;

            assert(!isnan(y[i*QK + l + 0]));
            assert(!isnan(y[i*QK + l + 1]));
        }
    }
}

//
// simd mappings
//

// we define a common set of C macros which map to specific intrinsics based on the current architecture
// we then implement the fundamental computation operations below using only these macros
// adding support for new architectures requires to define the corresponding SIMD macros
//
// GGML_F32_STEP / GGML_F16_STEP
//   number of elements to process in a single step
//
// GGML_F32_EPR / GGML_F16_EPR
//   number of elements to fit in a single register
//

#if defined(__ARM_NEON) && defined(__ARM_FEATURE_FMA)

#define GGML_SIMD

// F32 NEON

#define GGML_F32_STEP 16
#define GGML_F32_EPR  4

#define GGML_F32x4              float32x4_t
#define GGML_F32x4_ZERO         vdupq_n_f32(0.0f)
#define GGML_F32x4_SET1(x)      vdupq_n_f32(x)
#define GGML_F32x4_LOAD         vld1q_f32
#define GGML_F32x4_STORE        vst1q_f32
#define GGML_F32x4_FMA(a, b, c) vfmaq_f32(a, b, c)
#define GGML_F32x4_ADD          vaddq_f32
#define GGML_F32x4_MUL          vmulq_f32
#if defined(__ARM_FEATURE_QRDMX)
    #define GGML_F32x4_REDUCE_ONE(x) vaddvq_f32(x)
#else
    #define GGML_F32x4_REDUCE_ONE(x) \
    (vgetq_lane_f32(x, 0) +          \
     vgetq_lane_f32(x, 1) +          \
     vgetq_lane_f32(x, 2) +          \
     vgetq_lane_f32(x, 3))
#endif
#define GGML_F32x4_REDUCE(res, x)              \
{                                              \
    for (int i = 0; i < GGML_F32_ARR/2; ++i) { \
        x[2*i] = vaddq_f32(x[2*i], x[2*i+1]);  \
    }                                          \
    for (int i = 0; i < GGML_F32_ARR/4; ++i) { \
        x[4*i] = vaddq_f32(x[4*i], x[4*i+2]);  \
    }                                          \
    for (int i = 0; i < GGML_F32_ARR/8; ++i) { \
        x[8*i] = vaddq_f32(x[8*i], x[8*i+4]);  \
    }                                          \
    res = GGML_F32x4_REDUCE_ONE(x[0]);         \
}

#define GGML_F32_VEC        GGML_F32x4
#define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
#define GGML_F32_VEC_SET1   GGML_F32x4_SET1
#define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
#define GGML_F32_VEC_STORE  GGML_F32x4_STORE
#define GGML_F32_VEC_FMA    GGML_F32x4_FMA
#define GGML_F32_VEC_ADD    GGML_F32x4_ADD
#define GGML_F32_VEC_MUL    GGML_F32x4_MUL
#define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE

// F16 NEON

#if defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
    #define GGML_F16_STEP 32
    #define GGML_F16_EPR  8

    #define GGML_F16x8              float16x8_t
    #define GGML_F16x8_ZERO         vdupq_n_f16(0.0f)
    #define GGML_F16x8_SET1(x)      vdupq_n_f16(x)
    #define GGML_F16x8_LOAD         vld1q_f16
    #define GGML_F16x8_STORE        vst1q_f16
    #define GGML_F16x8_FMA(a, b, c) vfmaq_f16(a, b, c)
    #define GGML_F16x8_ADD          vaddq_f16
    #define GGML_F16x8_MUL          vmulq_f16
    #define GGML_F16x8_REDUCE(res, x)                             \
    {                                                             \
        for (int i = 0; i < GGML_F16_ARR/2; ++i) {                \
            x[2*i] = vaddq_f16(x[2*i], x[2*i+1]);                 \
        }                                                         \
        for (int i = 0; i < GGML_F16_ARR/4; ++i) {                \
            x[4*i] = vaddq_f16(x[4*i], x[4*i+2]);                 \
        }                                                         \
        for (int i = 0; i < GGML_F16_ARR/8; ++i) {                \
            x[8*i] = vaddq_f16(x[8*i], x[8*i+4]);                 \
        }                                                         \
        const float32x4_t t0 = vcvt_f32_f16(vget_low_f16 (x[0])); \
        const float32x4_t t1 = vcvt_f32_f16(vget_high_f16(x[0])); \
        res = vaddvq_f32(vaddq_f32(t0, t1));                      \
    }

    #define GGML_F16_VEC                GGML_F16x8
    #define GGML_F16_VEC_ZERO           GGML_F16x8_ZERO
    #define GGML_F16_VEC_SET1           GGML_F16x8_SET1
    #define GGML_F16_VEC_LOAD(p, i)     GGML_F16x8_LOAD(p)
    #define GGML_F16_VEC_STORE(p, r, i) GGML_F16x8_STORE(p, r[i])
    #define GGML_F16_VEC_FMA            GGML_F16x8_FMA
    #define GGML_F16_VEC_ADD            GGML_F16x8_ADD
    #define GGML_F16_VEC_MUL            GGML_F16x8_MUL
    #define GGML_F16_VEC_REDUCE         GGML_F16x8_REDUCE
#else
    // if FP16 vector arithmetic is not supported, we use FP32 instead
    // and take advantage of the vcvt_ functions to convert to/from FP16

    #define GGML_F16_STEP 16
    #define GGML_F16_EPR  4

    #define GGML_F32Cx4              float32x4_t
    #define GGML_F32Cx4_ZERO         vdupq_n_f32(0.0f)
    #define GGML_F32Cx4_SET1(x)      vdupq_n_f32(x)
    #define GGML_F32Cx4_LOAD(x)      vcvt_f32_f16(vld1_f16(x))
    #define GGML_F32Cx4_STORE(x, y)  vst1_f16(x, vcvt_f16_f32(y))
    #define GGML_F32Cx4_FMA(a, b, c) vfmaq_f32(a, b, c)
    #define GGML_F32Cx4_ADD          vaddq_f32
    #define GGML_F32Cx4_MUL          vmulq_f32
    #define GGML_F32Cx4_REDUCE       GGML_F32x4_REDUCE

    #define GGML_F16_VEC                GGML_F32Cx4
    #define GGML_F16_VEC_ZERO           GGML_F32Cx4_ZERO
    #define GGML_F16_VEC_SET1           GGML_F32Cx4_SET1
    #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx4_LOAD(p)
    #define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx4_STORE(p, r[i])
    #define GGML_F16_VEC_FMA            GGML_F32Cx4_FMA
    #define GGML_F16_VEC_ADD            GGML_F32Cx4_ADD
    #define GGML_F16_VEC_MUL            GGML_F32Cx4_MUL
    #define GGML_F16_VEC_REDUCE         GGML_F32Cx4_REDUCE
#endif

#elif defined(__AVX__)

#define GGML_SIMD

// F32 AVX

#define GGML_F32_STEP 32
#define GGML_F32_EPR  8

#define GGML_F32x8         __m256
#define GGML_F32x8_ZERO    _mm256_setzero_ps()
#define GGML_F32x8_SET1(x) _mm256_set1_ps(x)
#define GGML_F32x8_LOAD    _mm256_loadu_ps
#define GGML_F32x8_STORE   _mm256_storeu_ps
#if defined(__FMA__)
    #define GGML_F32x8_FMA(a, b, c) _mm256_fmadd_ps(b, c, a)
#else
    #define GGML_F32x8_FMA(a, b, c) _mm256_add_ps(_mm256_mul_ps(b, c), a)
#endif
#define GGML_F32x8_ADD     _mm256_add_ps
#define GGML_F32x8_MUL     _mm256_mul_ps
#define GGML_F32x8_REDUCE(res, x)                                 \
{                                                                 \
    for (int i = 0; i < GGML_F32_ARR/2; ++i) {                    \
        x[2*i] = _mm256_add_ps(x[2*i], x[2*i+1]);                 \
    }                                                             \
    for (int i = 0; i < GGML_F32_ARR/4; ++i) {                    \
        x[4*i] = _mm256_add_ps(x[4*i], x[4*i+2]);                 \
    }                                                             \
    for (int i = 0; i < GGML_F32_ARR/8; ++i) {                    \
        x[8*i] = _mm256_add_ps(x[8*i], x[8*i+4]);                 \
    }                                                             \
    const __m128 t0 = _mm_add_ps(_mm256_castps256_ps128(x[0]),    \
                                 _mm256_extractf128_ps(x[0], 1)); \
    const __m128 t1 = _mm_hadd_ps(t0, t0);                        \
    res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1));                     \
}
// TODO: is this optimal ?

#define GGML_F32_VEC        GGML_F32x8
#define GGML_F32_VEC_ZERO   GGML_F32x8_ZERO
#define GGML_F32_VEC_SET1   GGML_F32x8_SET1
#define GGML_F32_VEC_LOAD   GGML_F32x8_LOAD
#define GGML_F32_VEC_STORE  GGML_F32x8_STORE
#define GGML_F32_VEC_FMA    GGML_F32x8_FMA
#define GGML_F32_VEC_ADD    GGML_F32x8_ADD
#define GGML_F32_VEC_MUL    GGML_F32x8_MUL
#define GGML_F32_VEC_REDUCE GGML_F32x8_REDUCE

// F16 AVX

#define GGML_F16_STEP 32
#define GGML_F16_EPR  8

// F16 arithmetic is not supported by AVX, so we use F32 instead
// we take advantage of the _mm256_cvt intrinsics to convert F16 <-> F32

#define GGML_F32Cx8             __m256
#define GGML_F32Cx8_ZERO        _mm256_setzero_ps()
#define GGML_F32Cx8_SET1(x)     _mm256_set1_ps(x)
#define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
#define GGML_F32Cx8_STORE(x, y) _mm_storeu_si128((__m128i *)(x), _mm256_cvtps_ph(y, 0))
#define GGML_F32Cx8_FMA         GGML_F32x8_FMA
#define GGML_F32Cx8_ADD         _mm256_add_ps
#define GGML_F32Cx8_MUL         _mm256_mul_ps
#define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE

#define GGML_F16_VEC                GGML_F32Cx8
#define GGML_F16_VEC_ZERO           GGML_F32Cx8_ZERO
#define GGML_F16_VEC_SET1           GGML_F32Cx8_SET1
#define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
#define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx8_STORE(p, r[i])
#define GGML_F16_VEC_FMA            GGML_F32Cx8_FMA
#define GGML_F16_VEC_ADD            GGML_F32Cx8_ADD
#define GGML_F16_VEC_MUL            GGML_F32Cx8_MUL
#define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE

#elif defined(__POWER9_VECTOR__)

#define GGML_SIMD

// F32 POWER9

#define GGML_F32_STEP 32
#define GGML_F32_EPR  4

#define GGML_F32x4              vector float
#define GGML_F32x4_ZERO         0.0f
#define GGML_F32x4_SET1         vec_splats
#define GGML_F32x4_LOAD(p)      vec_xl(0, p)
#define GGML_F32x4_STORE(p, r)  vec_xst(r, 0, p)
#define GGML_F32x4_FMA(a, b, c) vec_madd(b, c, a)
#define GGML_F32x4_ADD          vec_add
#define GGML_F32x4_MUL          vec_mul
#define GGML_F32x4_REDUCE(res, x)              \
{                                              \
    for (int i = 0; i < GGML_F32_ARR/2; ++i) { \
        x[2*i] = vec_add(x[2*i], x[2*i+1]);    \
    }                                          \
    for (int i = 0; i < GGML_F32_ARR/4; ++i) { \
        x[4*i] = vec_add(x[4*i], x[4*i+2]);    \
    }                                          \
    for (int i = 0; i < GGML_F32_ARR/8; ++i) { \
        x[8*i] = vec_add(x[8*i], x[8*i+4]);    \
    }                                          \
    res = vec_extract(x[0], 0) +               \
          vec_extract(x[0], 1) +               \
          vec_extract(x[0], 2) +               \
          vec_extract(x[0], 3);                \
}

#define GGML_F32_VEC        GGML_F32x4
#define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
#define GGML_F32_VEC_SET1   GGML_F32x4_SET1
#define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
#define GGML_F32_VEC_STORE  GGML_F32x4_STORE
#define GGML_F32_VEC_FMA    GGML_F32x4_FMA
#define GGML_F32_VEC_ADD    GGML_F32x4_ADD
#define GGML_F32_VEC_MUL    GGML_F32x4_MUL
#define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE

// F16 POWER9
#define GGML_F16_STEP       GGML_F32_STEP
#define GGML_F16_EPR        GGML_F32_EPR
#define GGML_F16_VEC        GGML_F32x4
#define GGML_F16_VEC_ZERO   GGML_F32x4_ZERO
#define GGML_F16_VEC_SET1   GGML_F32x4_SET1
#define GGML_F16_VEC_FMA    GGML_F32x4_FMA
#define GGML_F16_VEC_REDUCE GGML_F32x4_REDUCE
// Use vec_xl, not vec_ld, in case the load address is not aligned.
#define GGML_F16_VEC_LOAD(p, i) (i & 0x1) ?                   \
  vec_extract_fp32_from_shorth(vec_xl(0, p - GGML_F16_EPR)) : \
  vec_extract_fp32_from_shortl(vec_xl(0, p))
#define GGML_ENDIAN_BYTE(i) ((unsigned char *)&(uint16_t){1})[i]
#define GGML_F16_VEC_STORE(p, r, i)                             \
  if (i & 0x1)                                                  \
    vec_xst(vec_pack_to_short_fp32(r[i - GGML_ENDIAN_BYTE(1)],  \
                                   r[i - GGML_ENDIAN_BYTE(0)]), \
            0, p - GGML_F16_EPR)

#elif defined(__wasm_simd128__)

#define GGML_SIMD

// F32 WASM

#define GGML_F32_STEP 16
#define GGML_F32_EPR  4

#define GGML_F32x4              v128_t
#define GGML_F32x4_ZERO         wasm_f32x4_splat(0.0f)
#define GGML_F32x4_SET1(x)      wasm_f32x4_splat(x)
#define GGML_F32x4_LOAD         wasm_v128_load
#define GGML_F32x4_STORE        wasm_v128_store
#define GGML_F32x4_FMA(a, b, c) wasm_f32x4_add(wasm_f32x4_mul(b, c), a)
#define GGML_F32x4_ADD          wasm_f32x4_add
#define GGML_F32x4_MUL          wasm_f32x4_mul
#define GGML_F32x4_REDUCE(res, x)                  \
{                                                  \
    for (int i = 0; i < GGML_F32_ARR/2; ++i) {     \
        x[2*i] = wasm_f32x4_add(x[2*i], x[2*i+1]); \
    }                                              \
    for (int i = 0; i < GGML_F32_ARR/4; ++i) {     \
        x[4*i] = wasm_f32x4_add(x[4*i], x[4*i+2]); \
    }                                              \
    for (int i = 0; i < GGML_F32_ARR/8; ++i) {     \
        x[8*i] = wasm_f32x4_add(x[8*i], x[8*i+4]); \
    }                                              \
    res = wasm_f32x4_extract_lane(x[0], 0) +       \
          wasm_f32x4_extract_lane(x[0], 1) +       \
          wasm_f32x4_extract_lane(x[0], 2) +       \
          wasm_f32x4_extract_lane(x[0], 3);        \
}

#define GGML_F32_VEC        GGML_F32x4
#define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
#define GGML_F32_VEC_SET1   GGML_F32x4_SET1
#define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
#define GGML_F32_VEC_STORE  GGML_F32x4_STORE
#define GGML_F32_VEC_FMA    GGML_F32x4_FMA
#define GGML_F32_VEC_ADD    GGML_F32x4_ADD
#define GGML_F32_VEC_MUL    GGML_F32x4_MUL
#define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE

// F16 WASM

#define GGML_F16_STEP 16
#define GGML_F16_EPR  4

inline static v128_t __wasm_f16x4_load(const ggml_fp16_t * p) {
    float tmp[4];

    tmp[0] = GGML_FP16_TO_FP32(p[0]);
    tmp[1] = GGML_FP16_TO_FP32(p[1]);
    tmp[2] = GGML_FP16_TO_FP32(p[2]);
    tmp[3] = GGML_FP16_TO_FP32(p[3]);

    return wasm_v128_load(tmp);
}

inline static void __wasm_f16x4_store(ggml_fp16_t * p, v128_t x) {
    float tmp[4];

    wasm_v128_store(tmp, x);

    p[0] = GGML_FP32_TO_FP16(tmp[0]);
    p[1] = GGML_FP32_TO_FP16(tmp[1]);
    p[2] = GGML_FP32_TO_FP16(tmp[2]);
    p[3] = GGML_FP32_TO_FP16(tmp[3]);
}

#define GGML_F16x4             v128_t
#define GGML_F16x4_ZERO        wasm_f32x4_splat(0.0f)
#define GGML_F16x4_SET1(x)     wasm_f32x4_splat(x)
#define GGML_F16x4_LOAD(x)     __wasm_f16x4_load(x)
#define GGML_F16x4_STORE(x, y) __wasm_f16x4_store(x, y)
#define GGML_F16x4_FMA         GGML_F32x4_FMA
#define GGML_F16x4_ADD         wasm_f32x4_add
#define GGML_F16x4_MUL         wasm_f32x4_mul
#define GGML_F16x4_REDUCE(res, x)                  \
{                                                  \
    for (int i = 0; i < GGML_F16_ARR/2; ++i) {     \
        x[2*i] = wasm_f32x4_add(x[2*i], x[2*i+1]); \
    }                                              \
    for (int i = 0; i < GGML_F16_ARR/4; ++i) {     \
        x[4*i] = wasm_f32x4_add(x[4*i], x[4*i+2]); \
    }                                              \
    for (int i = 0; i < GGML_F16_ARR/8; ++i) {     \
        x[8*i] = wasm_f32x4_add(x[8*i], x[8*i+4]); \
    }                                              \
    res = wasm_f32x4_extract_lane(x[0], 0) +       \
          wasm_f32x4_extract_lane(x[0], 1) +       \
          wasm_f32x4_extract_lane(x[0], 2) +       \
          wasm_f32x4_extract_lane(x[0], 3);        \
}

#define GGML_F16_VEC                GGML_F16x4
#define GGML_F16_VEC_ZERO           GGML_F16x4_ZERO
#define GGML_F16_VEC_SET1           GGML_F16x4_SET1
#define GGML_F16_VEC_LOAD(p, i)     GGML_F16x4_LOAD(p)
#define GGML_F16_VEC_STORE(p, r, i) GGML_F16x4_STORE(p, r[i])
#define GGML_F16_VEC_FMA            GGML_F16x4_FMA
#define GGML_F16_VEC_ADD            GGML_F16x4_ADD
#define GGML_F16_VEC_MUL            GGML_F16x4_MUL
#define GGML_F16_VEC_REDUCE         GGML_F16x4_REDUCE

#elif defined(__SSE3__)

#define GGML_SIMD

// F32 SSE

#define GGML_F32_STEP 32
#define GGML_F32_EPR  4

#define GGML_F32x4         __m128
#define GGML_F32x4_ZERO    _mm_setzero_ps()
#define GGML_F32x4_SET1(x) _mm_set1_ps(x)
#define GGML_F32x4_LOAD    _mm_loadu_ps
#define GGML_F32x4_STORE   _mm_storeu_ps
#if defined(__FMA__)
    // TODO: Does this work?
    #define GGML_F32x4_FMA(a, b, c) _mm_fmadd_ps(b, c, a)
#else
    #define GGML_F32x4_FMA(a, b, c) _mm_add_ps(_mm_mul_ps(b, c), a)
#endif
#define GGML_F32x4_ADD     _mm_add_ps
#define GGML_F32x4_MUL     _mm_mul_ps
#define GGML_F32x4_REDUCE(res, x)                                 \
{                                                                 \
    for (int i = 0; i < GGML_F32_ARR/2; ++i) {                    \
        x[2*i] = _mm_add_ps(x[2*i], x[2*i+1]);                    \
    }                                                             \
    for (int i = 0; i < GGML_F32_ARR/4; ++i) {                    \
        x[4*i] = _mm_add_ps(x[4*i], x[4*i+2]);                    \
    }                                                             \
    for (int i = 0; i < GGML_F32_ARR/8; ++i) {                    \
        x[8*i] = _mm_add_ps(x[8*i], x[8*i+4]);                    \
    }                                                             \
    const __m128 t0 = _mm_hadd_ps(x[0], x[0]);                    \
    res = _mm_cvtss_f32(_mm_hadd_ps(t0, t0));                     \
}
// TODO: is this optimal ?

#define GGML_F32_VEC        GGML_F32x4
#define GGML_F32_VEC_ZERO   GGML_F32x4_ZERO
#define GGML_F32_VEC_SET1   GGML_F32x4_SET1
#define GGML_F32_VEC_LOAD   GGML_F32x4_LOAD
#define GGML_F32_VEC_STORE  GGML_F32x4_STORE
#define GGML_F32_VEC_FMA    GGML_F32x4_FMA
#define GGML_F32_VEC_ADD    GGML_F32x4_ADD
#define GGML_F32_VEC_MUL    GGML_F32x4_MUL
#define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE

// F16 SSE

#define GGML_F16_STEP 32
#define GGML_F16_EPR  4

static inline __m128 __sse_f16x4_load(ggml_fp16_t *x) {
    float tmp[4];

    tmp[0] = GGML_FP16_TO_FP32(x[0]);
    tmp[1] = GGML_FP16_TO_FP32(x[1]);
    tmp[2] = GGML_FP16_TO_FP32(x[2]);
    tmp[3] = GGML_FP16_TO_FP32(x[3]);

    return _mm_loadu_ps(tmp);
}

static inline void __sse_f16x4_store(ggml_fp16_t *x, __m128 y) {
    float arr[4];

    _mm_storeu_ps(arr, y);

    x[0] = GGML_FP32_TO_FP16(arr[0]);
    x[1] = GGML_FP32_TO_FP16(arr[1]);
    x[2] = GGML_FP32_TO_FP16(arr[2]);
    x[3] = GGML_FP32_TO_FP16(arr[3]);
}

#define GGML_F32Cx4             __m128
#define GGML_F32Cx4_ZERO        _mm_setzero_ps()
#define GGML_F32Cx4_SET1(x)     _mm_set1_ps(x)
#define GGML_F32Cx4_LOAD(x)     __sse_f16x4_load(x)
#define GGML_F32Cx4_STORE(x, y) __sse_f16x4_store(x, y)
#define GGML_F32Cx4_FMA         GGML_F32x4_FMA
#define GGML_F32Cx4_ADD         _mm_add_ps
#define GGML_F32Cx4_MUL         _mm_mul_ps
#define GGML_F32Cx4_REDUCE      GGML_F32x4_REDUCE

#define GGML_F16_VEC                 GGML_F32Cx4
#define GGML_F16_VEC_ZERO            GGML_F32Cx4_ZERO
#define GGML_F16_VEC_SET1            GGML_F32Cx4_SET1
#define GGML_F16_VEC_LOAD(p, i)      GGML_F32Cx4_LOAD(p)
#define GGML_F16_VEC_STORE(p, r, i)  GGML_F32Cx4_STORE(p, r[i])
#define GGML_F16_VEC_FMA             GGML_F32Cx4_FMA
#define GGML_F16_VEC_ADD             GGML_F32Cx4_ADD
#define GGML_F16_VEC_MUL             GGML_F32Cx4_MUL
#define GGML_F16_VEC_REDUCE          GGML_F32Cx4_REDUCE

#endif

// GGML_F32_ARR / GGML_F16_ARR
//   number of registers to use per step
#ifdef GGML_SIMD
#define GGML_F32_ARR (GGML_F32_STEP/GGML_F32_EPR)
#define GGML_F16_ARR (GGML_F16_STEP/GGML_F16_EPR)
#endif

//
// fundamental operations
//

inline static void ggml_vec_set_i8(const int n, int8_t * x, const int8_t v) { for (int i = 0; i < n; ++i) x[i] = v; }

inline static void ggml_vec_set_i16(const int n, int16_t * x, const int16_t v) { for (int i = 0; i < n; ++i) x[i] = v; }

inline static void ggml_vec_set_i32(const int n, int32_t * x, const int32_t v) { for (int i = 0; i < n; ++i) x[i] = v; }

inline static void ggml_vec_set_f16(const int n, ggml_fp16_t * x, const int32_t v) { for (int i = 0; i < n; ++i) x[i] = v; }

inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i]  = x[i] + y[i]; }
inline static void ggml_vec_acc_f32 (const int n, float * y, const float * x)                  { for (int i = 0; i < n; ++i) y[i] += x[i];        }
inline static void ggml_vec_acc1_f32(const int n, float * y, const float   v)                  { for (int i = 0; i < n; ++i) y[i] += v;           }
inline static void ggml_vec_sub_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i]  = x[i] - y[i]; }
inline static void ggml_vec_set_f32 (const int n, float * x, const float   v)                  { for (int i = 0; i < n; ++i) x[i]  = v;           }
inline static void ggml_vec_cpy_f32 (const int n, float * y, const float * x)                  { for (int i = 0; i < n; ++i) y[i]  = x[i];        }
inline static void ggml_vec_neg_f32 (const int n, float * y, const float * x)                  { for (int i = 0; i < n; ++i) y[i]  = -x[i];       }
inline static void ggml_vec_mul_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i]  = x[i]*y[i];   }
inline static void ggml_vec_div_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i]  = x[i]/y[i];   }

inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
    ggml_float sumf = 0.0;

#ifdef GGML_SIMD
    const int np = (n & ~(GGML_F32_STEP - 1));

    GGML_F32_VEC sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };

    GGML_F32_VEC ax[GGML_F32_ARR];
    GGML_F32_VEC ay[GGML_F32_ARR];

    for (int i = 0; i < np; i += GGML_F32_STEP) {
        for (int j = 0; j < GGML_F32_ARR; j++) {
            ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);

            sum[j] = GGML_F32_VEC_FMA(sum[j], ax[j], ay[j]);
        }
    }

    // reduce sum0..sum3 to sum0
    GGML_F32_VEC_REDUCE(sumf, sum);

    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += x[i]*y[i];
    }
#else
    // scalar
    for (int i = 0; i < n; ++i) {
        sumf += x[i]*y[i];
    }
#endif

    *s = sumf;
}

inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) {
    ggml_float sumf = 0.0;

#if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F16_STEP - 1));

    GGML_F16_VEC sum[GGML_F16_ARR] = { GGML_F16_VEC_ZERO };

    GGML_F16_VEC ax[GGML_F16_ARR];
    GGML_F16_VEC ay[GGML_F16_ARR];

    for (int i = 0; i < np; i += GGML_F16_STEP) {
        for (int j = 0; j < GGML_F16_ARR; j++) {
            ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
            ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);

            sum[j] = GGML_F16_VEC_FMA(sum[j], ax[j], ay[j]);
        }
    }

    // reduce sum0..sum3 to sum0
    GGML_F16_VEC_REDUCE(sumf, sum);

    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += GGML_FP16_TO_FP32(x[i])*GGML_FP16_TO_FP32(y[i]);
    }
#else
    for (int i = 0; i < n; ++i) {
        sumf += GGML_FP16_TO_FP32(x[i])*GGML_FP16_TO_FP32(y[i]);
    }
#endif

    *s = sumf;
}

inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void * restrict x, const void * restrict y) {
    const int nb = n / QK;

    assert(n % QK == 0);
    assert(nb % 2 == 0);

    const size_t bs = sizeof(float) + QK/2;

    const uint8_t * restrict pd0 = ((const uint8_t *)x + 0*bs);
    const uint8_t * restrict pd1 = ((const uint8_t *)y + 0*bs);

    const uint8_t * restrict pb0 = ((const uint8_t *)x + 0*bs + sizeof(float));
    const uint8_t * restrict pb1 = ((const uint8_t *)y + 0*bs + sizeof(float));

    float sumf = 0.0;

#ifdef __ARM_NEON
#if QK == 32
    float sum0 = 0.0f;
    float sum1 = 0.0f;

    for (int i = 0; i < nb; i += 2) {
        const float d0_0 = *(const float *) (pd0 + i*bs);
        const float d1_0 = *(const float *) (pd1 + i*bs);
        const float d0_1 = *(const float *) (pd0 + (i + 1)*bs);
        const float d1_1 = *(const float *) (pd1 + (i + 1)*bs);

        //printf("d0_0: %f, d1_0: %f, d0_1: %f, d1_1: %f\n", d0_0, d1_0, d0_1, d1_1);

        const uint8_t * restrict p0 = pb0 + i*bs;
        const uint8_t * restrict p1 = pb1 + i*bs;

        const uint8x16_t m4b = vdupq_n_u8(0xf);
        const int8x16_t  s8b = vdupq_n_s8(0x8);

        const uint8x16_t v0_0 = vld1q_u8(p0);
        const uint8x16_t v1_0 = vld1q_u8(p1);
        const uint8x16_t v0_1 = vld1q_u8(p0 + bs);
        const uint8x16_t v1_1 = vld1q_u8(p1 + bs);

        // 4-bit -> 8-bit
        const int8x16_t v0_0l = vreinterpretq_s8_u8(vandq_u8(v0_0, m4b));
        const int8x16_t v1_0l = vreinterpretq_s8_u8(vandq_u8(v1_0, m4b));

        const int8x16_t v0_0h = vreinterpretq_s8_u8(vshrq_n_u8(v0_0, 4));
        const int8x16_t v1_0h = vreinterpretq_s8_u8(vshrq_n_u8(v1_0, 4));

        const int8x16_t v0_1l = vreinterpretq_s8_u8(vandq_u8(v0_1, m4b));
        const int8x16_t v1_1l = vreinterpretq_s8_u8(vandq_u8(v1_1, m4b));

        const int8x16_t v0_1h = vreinterpretq_s8_u8(vshrq_n_u8(v0_1, 4));
        const int8x16_t v1_1h = vreinterpretq_s8_u8(vshrq_n_u8(v1_1, 4));

        // sub 8
        const int8x16_t v0_0ls = vsubq_s8(v0_0l, s8b);
        const int8x16_t v1_0ls = vsubq_s8(v1_0l, s8b);

        const int8x16_t v0_0hs = vsubq_s8(v0_0h, s8b);
        const int8x16_t v1_0hs = vsubq_s8(v1_0h, s8b);

        const int8x16_t v0_1ls = vsubq_s8(v0_1l, s8b);
        const int8x16_t v1_1ls = vsubq_s8(v1_1l, s8b);

        const int8x16_t v0_1hs = vsubq_s8(v0_1h, s8b);
        const int8x16_t v1_1hs = vsubq_s8(v1_1h, s8b);

#if defined(__ARM_FEATURE_DOTPROD)
        // dot product into int16x8_t
        int32x4_t p_0 = vdotq_s32(vdupq_n_s32(0), v0_0ls, v1_0ls);
        int32x4_t p_1 = vdotq_s32(vdupq_n_s32(0), v0_1ls, v1_1ls);

        p_0 = vdotq_s32(p_0, v0_0hs, v1_0hs);
        p_1 = vdotq_s32(p_1, v0_1hs, v1_1hs);

        // scalar
#if defined(__ARM_FEATURE_QRDMX)
        sum0 += d0_0*d1_0*vaddvq_s32(p_0);
        sum1 += d0_1*d1_1*vaddvq_s32(p_1);
#else
        sum0 += d0_0*d1_0*(vgetq_lane_s32(p_0, 0) + vgetq_lane_s32(p_0, 1) + vgetq_lane_s32(p_0, 2) + vgetq_lane_s32(p_0, 3));
        sum1 += d0_1*d1_1*(vgetq_lane_s32(p_1, 0) + vgetq_lane_s32(p_1, 1) + vgetq_lane_s32(p_1, 2) + vgetq_lane_s32(p_1, 3));
#endif
#else
	    const int16x8_t pl0l = vmull_s8(vget_low_s8 (v0_0ls), vget_low_s8 (v1_0ls));
        const int16x8_t pl0h = vmull_s8(vget_high_s8(v0_0ls), vget_high_s8(v1_0ls));

        const int16x8_t ph0l = vmull_s8(vget_low_s8 (v0_0hs), vget_low_s8 (v1_0hs));
        const int16x8_t ph0h = vmull_s8(vget_high_s8(v0_0hs), vget_high_s8(v1_0hs));

        const int16x8_t pl1l = vmull_s8(vget_low_s8 (v0_1ls), vget_low_s8 (v1_1ls));
        const int16x8_t pl1h = vmull_s8(vget_high_s8(v0_1ls), vget_high_s8(v1_1ls));

        const int16x8_t ph1l = vmull_s8(vget_low_s8 (v0_1hs), vget_low_s8 (v1_1hs));
        const int16x8_t ph1h = vmull_s8(vget_high_s8(v0_1hs), vget_high_s8(v1_1hs));

        const int16x8_t pl_0 = vaddq_s16(pl0l, pl0h);
        const int16x8_t ph_0 = vaddq_s16(ph0l, ph0h);

        const int16x8_t pl_1 = vaddq_s16(pl1l, pl1h);
        const int16x8_t ph_1 = vaddq_s16(ph1l, ph1h);

        const int16x8_t p_0 = vaddq_s16(pl_0, ph_0);
        const int16x8_t p_1 = vaddq_s16(pl_1, ph_1);

        // scalar
#if defined(__ARM_FEATURE_QRDMX)
        sum0 += d0_0*d1_0*vaddvq_s16(p_0);
        sum1 += d0_1*d1_1*vaddvq_s16(p_1);
#else
        sum0 += d0_0*d1_0*(vgetq_lane_s16(p_0, 0) + vgetq_lane_s16(p_0, 1) + vgetq_lane_s16(p_0, 2) + vgetq_lane_s16(p_0, 3) + vgetq_lane_s16(p_0, 4) + vgetq_lane_s16(p_0, 5) + vgetq_lane_s16(p_0, 6) + vgetq_lane_s16(p_0, 7));
        sum1 += d0_1*d1_1*(vgetq_lane_s16(p_1, 0) + vgetq_lane_s16(p_1, 1) + vgetq_lane_s16(p_1, 2) + vgetq_lane_s16(p_1, 3) + vgetq_lane_s16(p_1, 4) + vgetq_lane_s16(p_1, 5) + vgetq_lane_s16(p_1, 6) + vgetq_lane_s16(p_1, 7));
#endif
#endif
    }

    sumf = sum0 + sum1;
#else
#error "not implemented for QK"
#endif
#elif defined(__AVX2__)
#if QK == 32
    const size_t countBlocks = nb;

    // Initialize accumulator with zeros
    __m256 acc = _mm256_setzero_ps();

    // Main loop
    for (int i = 0; i < nb; ++i) {
        const float * d0_0 = (const float *) (pd0 + i*bs);
        const float * d1_0 = (const float *) (pd1 + i*bs);

        const uint8_t * restrict p0 = pb0 + i*bs;
        const uint8_t * restrict p1 = pb1 + i*bs;

        // Compute combined scale for the block
        const __m256 scale = _mm256_mul_ps( _mm256_broadcast_ss( d0_0 ), _mm256_broadcast_ss( d1_0 ) );

        // Load 16 bytes, and unpack 4 bit fields into bytes, making 32 bytes
        __m256i bx = bytesFromNibbles( p0 );
        __m256i by = bytesFromNibbles( p1 );

        // Now we have a vector with bytes in [ 0 .. 15 ] interval. Offset them into [ -8 .. +7 ] interval.
        const __m256i off = _mm256_set1_epi8( 8 );
        bx = _mm256_sub_epi8( bx, off );
        by = _mm256_sub_epi8( by, off );

        // Sign-extend first 16 signed bytes into int16_t
        __m256i x16 = _mm256_cvtepi8_epi16( _mm256_castsi256_si128( bx ) );
        __m256i y16 = _mm256_cvtepi8_epi16( _mm256_castsi256_si128( by ) );
        // Compute products of int16_t integers, add pairwise
        __m256i i32 = _mm256_madd_epi16( x16, y16 );

        // Sign-extend last 16 signed bytes into int16_t vectors
        x16 = _mm256_cvtepi8_epi16( _mm256_extracti128_si256( bx, 1 ) );
        y16 = _mm256_cvtepi8_epi16( _mm256_extracti128_si256( by, 1 ) );
        // Accumulate products of int16_t integers
        i32 = _mm256_add_epi32( i32, _mm256_madd_epi16( x16, y16 ) );

        // Convert int32_t to float
        __m256 p = _mm256_cvtepi32_ps( i32 );
        // Apply the scale, and accumulate
        acc = _mm256_fmadd_ps( scale, p, acc );
    }

    // Return horizontal sum of the acc vector
    __m128 res = _mm256_extractf128_ps( acc, 1 );
    res = _mm_add_ps( res, _mm256_castps256_ps128( acc ) );
    res = _mm_add_ps( res, _mm_movehl_ps( res, res ) );
    res = _mm_add_ss( res, _mm_movehdup_ps( res ) );

    sumf = _mm_cvtss_f32( res );
#else
#error "not implemented for QK"
#endif
#elif defined(__wasm_simd128__)
#if QK == 32
    // wasm simd
    float sum0 = 0.0f;
    float sum1 = 0.0f;

    for (int i = 0; i < nb; i += 2) {
        const float d0_0 = *(const float *) (pd0 + i*bs);
        const float d1_0 = *(const float *) (pd1 + i*bs);
        const float d0_1 = *(const float *) (pd0 + (i + 1)*bs);
        const float d1_1 = *(const float *) (pd1 + (i + 1)*bs);

        const uint8_t * restrict p0 = pb0 + i*bs;
        const uint8_t * restrict p1 = pb1 + i*bs;

        const v128_t m4b = wasm_u8x16_splat(0xf);
        const v128_t s8b = wasm_i8x16_splat(0x8);

        const v128_t v0_0 = wasm_v128_load(p0);
        const v128_t v0_1 = wasm_v128_load(p0 + bs);
        const v128_t v1_0 = wasm_v128_load(p1);
        const v128_t v1_1 = wasm_v128_load(p1 + bs);

        // 4-bit -> 8-bit
        const v128_t v0_0l = wasm_v128_and(v0_0, m4b);
        const v128_t v1_0l = wasm_v128_and(v1_0, m4b);

        const v128_t v0_0h = wasm_u8x16_shr(v0_0, 4);
        const v128_t v1_0h = wasm_u8x16_shr(v1_0, 4);

        const v128_t v0_1l = wasm_v128_and(v0_1, m4b);
        const v128_t v1_1l = wasm_v128_and(v1_1, m4b);

        const v128_t v0_1h = wasm_u8x16_shr(v0_1, 4);
        const v128_t v1_1h = wasm_u8x16_shr(v1_1, 4);

        // sub 8
        const v128_t v0_0ls = wasm_i8x16_sub(v0_0l, s8b);
        const v128_t v1_0ls = wasm_i8x16_sub(v1_0l, s8b);

        const v128_t v0_0hs = wasm_i8x16_sub(v0_0h, s8b);
        const v128_t v1_0hs = wasm_i8x16_sub(v1_0h, s8b);

        const v128_t v0_1ls = wasm_i8x16_sub(v0_1l, s8b);
        const v128_t v1_1ls = wasm_i8x16_sub(v1_1l, s8b);

        const v128_t v0_1hs = wasm_i8x16_sub(v0_1h, s8b);
        const v128_t v1_1hs = wasm_i8x16_sub(v1_1h, s8b);

        // dot product into int16x8_t
        const v128_t pl0l = wasm_i16x8_mul(wasm_i16x8_extend_low_i8x16(v0_0ls), wasm_i16x8_extend_low_i8x16(v1_0ls));
        const v128_t pl0h = wasm_i16x8_mul(wasm_i16x8_extend_high_i8x16(v0_0ls), wasm_i16x8_extend_high_i8x16(v1_0ls));

        const v128_t ph0l = wasm_i16x8_mul(wasm_i16x8_extend_low_i8x16(v0_0hs), wasm_i16x8_extend_low_i8x16(v1_0hs));
        const v128_t ph0h = wasm_i16x8_mul(wasm_i16x8_extend_high_i8x16(v0_0hs), wasm_i16x8_extend_high_i8x16(v1_0hs));

        const v128_t pl1l = wasm_i16x8_mul(wasm_i16x8_extend_low_i8x16(v0_1ls), wasm_i16x8_extend_low_i8x16(v1_1ls));
        const v128_t pl1h = wasm_i16x8_mul(wasm_i16x8_extend_high_i8x16(v0_1ls), wasm_i16x8_extend_high_i8x16(v1_1ls));

        const v128_t ph1l = wasm_i16x8_mul(wasm_i16x8_extend_low_i8x16(v0_1hs), wasm_i16x8_extend_low_i8x16(v1_1hs));
        const v128_t ph1h = wasm_i16x8_mul(wasm_i16x8_extend_high_i8x16(v0_1hs), wasm_i16x8_extend_high_i8x16(v1_1hs));

        const v128_t pl_0 = wasm_i16x8_add(pl0l, pl0h);
        const v128_t ph_0 = wasm_i16x8_add(ph0l, ph0h);

        const v128_t pl_1 = wasm_i16x8_add(pl1l, pl1h);
        const v128_t ph_1 = wasm_i16x8_add(ph1l, ph1h);

        const v128_t p_0 = wasm_i16x8_add(pl_0, ph_0);
        const v128_t p_1 = wasm_i16x8_add(pl_1, ph_1);

        sum0 += d0_0*d1_0*(
                wasm_i16x8_extract_lane(p_0, 0) + wasm_i16x8_extract_lane(p_0, 1) +
                wasm_i16x8_extract_lane(p_0, 2) + wasm_i16x8_extract_lane(p_0, 3) +
                wasm_i16x8_extract_lane(p_0, 4) + wasm_i16x8_extract_lane(p_0, 5) +
                wasm_i16x8_extract_lane(p_0, 6) + wasm_i16x8_extract_lane(p_0, 7));
        sum1 += d0_1*d1_1*(
                wasm_i16x8_extract_lane(p_1, 0) + wasm_i16x8_extract_lane(p_1, 1) +
                wasm_i16x8_extract_lane(p_1, 2) + wasm_i16x8_extract_lane(p_1, 3) +
                wasm_i16x8_extract_lane(p_1, 4) + wasm_i16x8_extract_lane(p_1, 5) +
                wasm_i16x8_extract_lane(p_1, 6) + wasm_i16x8_extract_lane(p_1, 7));
    }

    sumf = sum0 + sum1;
#else
#error "not implemented for QK"
#endif
#else
    // scalar
    for (int i = 0; i < nb; i++) {
        const float d0 = *(const float *) (pd0 + i*bs);
        const float d1 = *(const float *) (pd1 + i*bs);

        const uint8_t * restrict p0 = pb0 + i*bs;
        const uint8_t * restrict p1 = pb1 + i*bs;

        for (int j = 0; j < QK/2; j++) {
            const uint8_t v0 = p0[j];
            const uint8_t v1 = p1[j];

            const float f0 = d0*((int8_t) (v0 & 0xf) - 8);
            const float f1 = d0*((int8_t) (v0 >> 4)  - 8);

            const float f2 = d1*((int8_t) (v1 & 0xf) - 8);
            const float f3 = d1*((int8_t) (v1 >> 4)  - 8);

            sumf += f0*f2 + f1*f3;
        }
    }
#endif

    *s = sumf;
}

inline static void ggml_vec_dot_q4_1(const int n, float * restrict s, const void * restrict x, const void * restrict y) {
    const int nb = n / QK;

    const float * restrict pm0 = (const float *) x;
    const float * restrict pm1 = (const float *) y;

    const float * restrict pd0 = (const float *) (pm0 + nb);
    const float * restrict pd1 = (const float *) (pm1 + nb);

    const uint8_t * restrict pb0 = (const uint8_t *) (pd0 + nb);
    const uint8_t * restrict pb1 = (const uint8_t *) (pd1 + nb);

    float sumf = 0.0;

#if 1
    // scalar
    for (int i = 0; i < nb; i++) {
        const float m0 = pm0[i];
        const float m1 = pm1[i];

        const float d0 = pd0[i];
        const float d1 = pd1[i];

        const uint8_t * restrict p0 = pb0 + i*QK/2;
        const uint8_t * restrict p1 = pb1 + i*QK/2;

        for (int j = 0; j < QK/2; j++) {
            const uint8_t v0 = p0[j];
            const uint8_t v1 = p1[j];

            const float f0 = d0*(v0 & 0xf) + m0;
            const float f1 = d0*(v0 >> 4)  + m0;

            const float f2 = d1*(v1 & 0xf) + m1;
            const float f3 = d1*(v1 >> 4)  + m1;

            sumf += f0*f2 + f1*f3;
        }
    }
#endif

    *s = sumf;
}

// compute GGML_VEC_DOT_UNROLL dot products at once
// xs - x row stride in bytes
inline static void ggml_vec_dot_f16_unroll(const int n, const int xs, float * restrict s, void * restrict xv, ggml_fp16_t * restrict y) {
    ggml_float sumf[GGML_VEC_DOT_UNROLL] = { 0.0 };

    ggml_fp16_t * restrict x[GGML_VEC_DOT_UNROLL];

    for (int i = 0; i < GGML_VEC_DOT_UNROLL; ++i) {
        x[i] = (ggml_fp16_t *) ((char *) xv + i*xs);
    }

#if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F16_STEP - 1));

    GGML_F16_VEC sum[GGML_VEC_DOT_UNROLL][GGML_F16_ARR] = { { GGML_F16_VEC_ZERO } };

    GGML_F16_VEC ax[GGML_F16_ARR];
    GGML_F16_VEC ay[GGML_F16_ARR];

    for (int i = 0; i < np; i += GGML_F16_STEP) {
        for (int j = 0; j < GGML_F16_ARR; j++) {
            ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);

            for (int k = 0; k < GGML_VEC_DOT_UNROLL; ++k) {
                ax[j] = GGML_F16_VEC_LOAD(x[k] + i + j*GGML_F16_EPR, j);

                sum[k][j] = GGML_F16_VEC_FMA(sum[k][j], ax[j], ay[j]);
            }
        }
    }

    // reduce sum0..sum3 to sum0
    for (int k = 0; k < GGML_VEC_DOT_UNROLL; ++k) {
        GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
    }

    // leftovers
    for (int i = np; i < n; ++i) {
        for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
            sumf[j] += GGML_FP16_TO_FP32(x[j][i])*GGML_FP16_TO_FP32(y[i]);
        }
    }
#else
    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
            sumf[j] += GGML_FP16_TO_FP32(x[j][i])*GGML_FP16_TO_FP32(y[i]);
        }
    }
#endif

    for (int i = 0; i < GGML_VEC_DOT_UNROLL; ++i) {
        s[i] = sumf[i];
    }
}

inline static void ggml_vec_mad_f32(const int n, float * restrict y, const float * restrict x, const float v) {
#if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F32_STEP - 1));

    GGML_F32_VEC vx = GGML_F32_VEC_SET1(v);

    GGML_F32_VEC ax[GGML_F32_ARR];
    GGML_F32_VEC ay[GGML_F32_ARR];

    for (int i = 0; i < np; i += GGML_F32_STEP) {
        for (int j = 0; j < GGML_F32_ARR; j++) {
            ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_FMA(ay[j], ax[j], vx);

            GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]);
        }
    }

    // leftovers
    for (int i = np; i < n; ++i) {
        y[i] += x[i]*v;
    }
#else
    // scalar
    for (int i = 0; i < n; ++i) {
        y[i] += x[i]*v;
    }
#endif
}

inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * restrict y, ggml_fp16_t * restrict x, const float v) {
#if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F16_STEP - 1));

    GGML_F16_VEC vx = GGML_F16_VEC_SET1(v);

    GGML_F16_VEC ax[GGML_F16_ARR];
    GGML_F16_VEC ay[GGML_F16_ARR];

    for (int i = 0; i < np; i += GGML_F16_STEP) {
        for (int j = 0; j < GGML_F16_ARR; j++) {
            ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
            ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
            ay[j] = GGML_F16_VEC_FMA(ay[j], ax[j], vx);

            GGML_F16_VEC_STORE(y + i + j*GGML_F16_EPR, ay, j);
        }
    }

    // leftovers
    for (int i = np; i < n; ++i) {
        GGML_ASSERT(false);
        y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i]) + GGML_FP16_TO_FP32(x[i])*v);
    }
#else
    for (int i = 0; i < n; ++i) {
        y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i]) + GGML_FP16_TO_FP32(x[i])*v);
    }
#endif
}

inline static void ggml_vec_mad_q4_0(const int n, float * restrict y, void * restrict x, const float v) {
    assert(n % QK == 0);

    const int nb = n / QK;
    const size_t bs = sizeof(float) + QK/2;

    const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs);
    const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + sizeof(float));

#if __ARM_NEON
#if QK == 32
    for (int i = 0; i < nb; ++i) {
        const float d0 = v*(*(const float *) (pd + i*bs));

        const uint8_t * restrict pp = pb + i*bs;

        const uint8x8_t m4b = vdup_n_u8(0xf);
        const int8x8_t  s8b = vdup_n_s8(0x8);

        const float32x4_t vd = vdupq_n_f32(d0);

        for (int j = 0; j < 2; j++) {
            const uint8x8_t vx = vld1_u8(pp + j*8);

            const int8x8_t vxl = vreinterpret_s8_u8(vand_u8(vx, m4b));
            const int8x8_t vxh = vreinterpret_s8_u8(vshr_n_u8(vx, 4));

            // sub 8
            const int8x8_t vxls = vsub_s8(vxl, s8b);
            const int8x8_t vxhs = vsub_s8(vxh, s8b);

            //const int8x8_t vxlt = vzip_s8(vxls, vxhs)[0];
            //const int8x8_t vxht = vzip_s8(vxls, vxhs)[1];
            const int8x8_t vxlt = vzip1_s8(vxls, vxhs);
            const int8x8_t vxht = vzip2_s8(vxls, vxhs);

            const int8x16_t vxq = vcombine_s8(vxlt, vxht);

            // convert to 2x int16x8_t
            const int16x8_t vxq0 = vmovl_s8(vget_low_s8 (vxq));
            const int16x8_t vxq1 = vmovl_s8(vget_high_s8(vxq));

            // convert to 4x float32x4_t
            const float32x4_t vx0 = vcvtq_f32_s32(vmovl_s16(vget_low_s16 (vxq0)));
            const float32x4_t vx1 = vcvtq_f32_s32(vmovl_s16(vget_high_s16(vxq0)));
            const float32x4_t vx2 = vcvtq_f32_s32(vmovl_s16(vget_low_s16 (vxq1)));
            const float32x4_t vx3 = vcvtq_f32_s32(vmovl_s16(vget_high_s16(vxq1)));

            const float32x4_t vy0 = vld1q_f32(y + i*32 + j*16 + 0);
            const float32x4_t vy1 = vld1q_f32(y + i*32 + j*16 + 4);
            const float32x4_t vy2 = vld1q_f32(y + i*32 + j*16 + 8);
            const float32x4_t vy3 = vld1q_f32(y + i*32 + j*16 + 12);

            const float32x4_t vr0 = vfmaq_f32(vy0, vx0, vd);
            const float32x4_t vr1 = vfmaq_f32(vy1, vx1, vd);
            const float32x4_t vr2 = vfmaq_f32(vy2, vx2, vd);
            const float32x4_t vr3 = vfmaq_f32(vy3, vx3, vd);

            vst1q_f32(y + i*32 + j*16 + 0,  vr0);
            vst1q_f32(y + i*32 + j*16 + 4,  vr1);
            vst1q_f32(y + i*32 + j*16 + 8,  vr2);
            vst1q_f32(y + i*32 + j*16 + 12, vr3);
        }
    }
#endif
#else
    // scalar
    for (int i = 0; i < nb; i++) {
        const float d = *(const float *) (pd + i*bs);

        const uint8_t * restrict pp = pb + i*bs;

        for (int l = 0; l < QK; l += 2) {
            const uint8_t vi = pp[l/2];

            const int8_t vi0 = vi & 0xf;
            const int8_t vi1 = vi >> 4;

            const float v0 = (vi0 - 8)*d;
            const float v1 = (vi1 - 8)*d;

            y[i*QK + l + 0] += v0*v;
            y[i*QK + l + 1] += v1*v;

            assert(!isnan(y[i*QK + l + 0]));
            assert(!isnan(y[i*QK + l + 1]));
            assert(!isinf(y[i*QK + l + 0]));
            assert(!isinf(y[i*QK + l + 1]));
        }
    }
#endif
}

inline static void ggml_vec_mad_q4_1(const int n, float * restrict y, void * restrict x, const float v) {
    assert(n % QK == 0);

    const int nb = n / QK;

    const float   * restrict pm = (const float *)   (x);
    const float   * restrict pd = (const float *)   (pm + nb);
    const uint8_t * restrict pb = (const uint8_t *) (pd + nb);

    for (int i = 0; i < nb; i++) {
        const float m = pm[i];
        const float d = pd[i];

        const uint8_t * restrict pp = pb + i*QK/2;

        for (int l = 0; l < QK; l += 2) {
            const uint8_t vi = pp[l/2];

            const uint8_t vi0 = vi & 0xf;
            const uint8_t vi1 = vi >> 4;

            const float v0 = d*vi0 + m;
            const float v1 = d*vi1 + m;

            y[i*QK + l + 0] += v0*v;
            y[i*QK + l + 1] += v1*v;

            assert(!isnan(y[i*QK + l + 0]));
            assert(!isnan(y[i*QK + l + 1]));
            assert(!isinf(y[i*QK + l + 0]));
            assert(!isinf(y[i*QK + l + 1]));
            //printf("mad: v0 %f v1 %f, i = %d, l = %d, d = %f, vi = %d, vi0 = %d, vi1 = %d\n", v0, v1, i, l, d, vi, vi0, vi1);
        }
    }
}

//inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) { for (int i = 0; i < n; ++i) y[i] *= v;          }
inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) {
#if defined(GGML_SIMD)
    const int np = (n & ~(GGML_F32_STEP - 1));

    GGML_F32_VEC vx = GGML_F32_VEC_SET1(v);

    GGML_F32_VEC ay[GGML_F32_ARR];

    for (int i = 0; i < np; i += GGML_F32_STEP) {
        for (int j = 0; j < GGML_F32_ARR; j++) {
            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_MUL(ay[j], vx);

            GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]);
        }
    }

    // leftovers
    for (int i = np; i < n; ++i) {
        y[i] *= v;
    }
#else
    // scalar
    for (int i = 0; i < n; ++i) {
        y[i] *= v;
    }
#endif
}

inline static void ggml_vec_norm_f32 (const int n, float * s, const float * x) { ggml_vec_dot_f32(n, s, x, x); *s = sqrt(*s);   }
inline static void ggml_vec_sqr_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i];   }
inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrt(x[i]); }
inline static void ggml_vec_abs_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fabsf(x[i]); }
inline static void ggml_vec_sgn_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : ((x[i] < 0.f) ? -1.f : 0.f); }
inline static void ggml_vec_step_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : 0.f; }
inline static void ggml_vec_relu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : 0.f; }

static const ggml_float GELU_COEF_A    = 0.044715;
static const ggml_float SQRT_2_OVER_PI = 0.79788456080286535587989211986876;

inline static float ggml_gelu_f32(float x) {
    return 0.5*x*(1.0 + tanh(SQRT_2_OVER_PI*x*(1.0 + GELU_COEF_A*x*x)));
}

inline static void ggml_vec_gelu_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    const uint16_t * i16 = (const uint16_t *) x;
    for (int i = 0; i < n; ++i) {
        y[i] = table_gelu_f16[i16[i]];
    }
}

#ifdef GGML_GELU_FP16
inline static void ggml_vec_gelu_f32(const int n, float * y, const float * x) {
    uint16_t t;
    for (int i = 0; i < n; ++i) {
        ggml_fp16_t fp16 = GGML_FP32_TO_FP16(x[i]);
        memcpy(&t, &fp16, sizeof(uint16_t));
        y[i] = GGML_FP16_TO_FP32(table_gelu_f16[t]);
    }
}
#else
inline static void ggml_vec_gelu_f32(const int n, float * y, const float * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = ggml_gelu_f32(x[i]);
    }
}
#endif

// Sigmoid Linear Unit (SiLU) function
inline static float ggml_silu_f32(float x) {
    return x/(1.0 + exp(-x));
}

inline static void ggml_vec_silu_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
    const uint16_t * i16 = (const uint16_t *) x;
    for (int i = 0; i < n; ++i) {
        y[i] = table_silu_f16[i16[i]];
    }
}

#ifdef GGML_SILU_FP16
inline static void ggml_vec_silu_f32(const int n, float * y, const float * x) {
    uint16_t t;
    for (int i = 0; i < n; ++i) {
        ggml_fp16_t fp16 = GGML_FP32_TO_FP16(x[i]);
        memcpy(&t, &fp16, sizeof(uint16_t));
        y[i] = GGML_FP16_TO_FP32(table_silu_f16[t]);
    }
}
#else
inline static void ggml_vec_silu_f32(const int n, float * y, const float * x) {
    for (int i = 0; i < n; ++i) {
        y[i] = ggml_silu_f32(x[i]);
    }
}
#endif

inline static void ggml_vec_sum_f32(const int n, float * s, const float * x) {
#ifndef GGML_USE_ACCELERATE
    ggml_float sum = 0.0;
    for (int i = 0; i < n; ++i) {
        sum += x[i];
    }
    *s = sum;
#else
    vDSP_sve(x, 1, s, n);
#endif
}

inline static void ggml_vec_max_f32(const int n, float * s, const float * x) {
#ifndef GGML_USE_ACCELERATE
    ggml_float max = -INFINITY;
    for (int i = 0; i < n; ++i) {
        max = MAX(max, x[i]);
    }
    *s = max;
#else
    vDSP_maxv(x, 1, s, n);
#endif
}

inline static void ggml_vec_norm_inv_f32(const int n, float * s, const float * x) { ggml_vec_norm_f32(n, s, x); *s = 1./(*s); }

//
// logging
//

#if (GGML_DEBUG >= 1)
#define GGML_PRINT_DEBUG(...) printf(__VA_ARGS__)
#else
#define GGML_PRINT_DEBUG(...)
#endif

#if (GGML_DEBUG >= 5)
#define GGML_PRINT_DEBUG_5(...) printf(__VA_ARGS__)
#else
#define GGML_PRINT_DEBUG_5(...)
#endif

#if (GGML_DEBUG >= 10)
#define GGML_PRINT_DEBUG_10(...) printf(__VA_ARGS__)
#else
#define GGML_PRINT_DEBUG_10(...)
#endif

#define GGML_PRINT(...) printf(__VA_ARGS__)

//
// data types
//

static const int GGML_BLCK_SIZE[GGML_TYPE_COUNT] = {
    QK,
    QK,
    1,
    1,
    1,
    1,
    1,
};

static_assert(GGML_TYPE_COUNT == 7, "GGML_TYPE_COUNT != 5");

static const size_t GGML_TYPE_SIZE[GGML_TYPE_COUNT] = {
    sizeof(float  )   + QK/2,
    sizeof(float  )*2 + QK/2,
    sizeof(int8_t ),
    sizeof(int16_t),
    sizeof(int32_t),
    sizeof(ggml_fp16_t),
    sizeof(float  ),
};

// don't forget to update the array above when adding new types
static_assert(GGML_TYPE_COUNT == 7, "GGML_TYPE_COUNT != 5");

static const char * GGML_OP_LABEL[GGML_OP_COUNT] = {
    "NONE",

    "DUP",
    "ADD",
    "SUB",
    "MUL",
    "DIV",
    "SQR",
    "SQRT",
    "SUM",
    "MEAN",
    "REPEAT",
    "ABS",
    "SGN",
    "NEG",
    "STEP",
    "RELU",
    "GELU",
    "SILU",
    "NORM",

    "MUL_MAT",

    "SCALE",
    "CPY",
    "RESHAPE",
    "VIEW",
    "PERMUTE",
    "TRANSPOSE",
    "GET_ROWS",
    "DIAG_MASK_INF",
    "SOFT_MAX",
    "ROPE",
    "CONV_1D_1S",
    "CONV_1D_2S",

    "FLASH_ATTN",
    "FLASH_FF",
};

static_assert(GGML_OP_COUNT == 34, "GGML_OP_COUNT != 34");

static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
    "none",

    "x",
    "x+y",
    "x-y",
    "x*y",
    "x/y",
    "x^2",
    "√x",
    "Σx",
    "Σx/n",
    "repeat(x)",
    "abs(x)",
    "sgn(x)",
    "-x",
    "step(x)",
    "relu(x)",
    "gelu(x)",
    "silu(x)",
    "norm(x)",

    "X*Y",

    "x*v",
    "x-\\>y",
    "reshape(x)",
    "view(x)",
    "permute(x)",
    "transpose(x)",
    "get_rows(x)",
    "diag_mask_inf(x)",
    "soft_max(x)",
    "rope(x)",
    "conv_1d_1s(x)",
    "conv_1d_2s(x)",

    "flash_attn(x)",
    "flash_ff(x)",
};

static_assert(GGML_OP_COUNT == 34, "GGML_OP_COUNT != 34");

//
// ggml object
//

struct ggml_object {
    size_t offs;
    size_t size;

    struct ggml_object * next;

    char padding[8];
};

static const size_t GGML_OBJECT_SIZE = sizeof(struct ggml_object);

static_assert(sizeof(struct ggml_object)%GGML_MEM_ALIGN == 0, "ggml_object size must be a multiple of GGML_MEM_ALIGN");
static_assert(sizeof(struct ggml_tensor)%GGML_MEM_ALIGN == 0, "ggml_tensor size must be a multiple of GGML_MEM_ALIGN");

//
// ggml context
//

struct ggml_context {
    size_t mem_size;
    void * mem_buffer;
    bool   mem_buffer_owned;

    int n_objects;

    struct ggml_object * objects_begin;
    struct ggml_object * objects_end;

    struct ggml_scratch scratch;
    struct ggml_scratch scratch_save;
};

struct ggml_context_container {
    bool used;

    struct ggml_context context;
};

//
// compute types
//

enum ggml_task_type {
    GGML_TASK_INIT = 0,
    GGML_TASK_COMPUTE,
    GGML_TASK_FINALIZE,
};

struct ggml_compute_params {
    enum ggml_task_type type;

    int ith, nth;

    // work buffer for all threads
    size_t wsize;
    void * wdata;
};

//
// ggml state
//

struct ggml_state {
    struct ggml_context_container contexts[GGML_MAX_CONTEXTS];
};

// global state
static struct ggml_state g_state;
static atomic_int g_state_barrier = 0;

// barrier via spin lock
inline static void ggml_critical_section_start(void) {
    int processing = atomic_fetch_add(&g_state_barrier, 1);

    while (processing > 0) {
        // wait for other threads to finish
        atomic_fetch_sub(&g_state_barrier, 1);
        sched_yield(); // TODO: reconsider this
        processing = atomic_fetch_add(&g_state_barrier, 1);
    }
}

// TODO: make this somehow automatically executed
//       some sort of "sentry" mechanism
inline static void ggml_critical_section_end(void) {
    atomic_fetch_sub(&g_state_barrier, 1);
}

////////////////////////////////////////////////////////////////////////////////

void ggml_print_object(const struct ggml_object * obj) {
    GGML_PRINT(" - ggml_object: offset = %zu, size = %zu, next = %p\n",
            obj->offs, obj->size, (const void *) obj->next);
}

void ggml_print_objects(const struct ggml_context * ctx) {
    struct ggml_object * obj = ctx->objects_begin;

    GGML_PRINT("%s: objects in context %p:\n", __func__, (const void *) ctx);

    while (obj != NULL) {
        ggml_print_object(obj);
        obj = obj->next;
    }

    GGML_PRINT("%s: --- end ---\n", __func__);
}

int ggml_nelements(const struct ggml_tensor * tensor) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return tensor->ne[0]*tensor->ne[1]*tensor->ne[2]*tensor->ne[3];
}

int ggml_nrows(const struct ggml_tensor * tensor) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return tensor->ne[1]*tensor->ne[2]*tensor->ne[3];
}

size_t ggml_nbytes(const struct ggml_tensor * tensor) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return (ggml_nelements(tensor)*GGML_TYPE_SIZE[tensor->type])/GGML_BLCK_SIZE[tensor->type];
}

int ggml_blck_size(enum ggml_type type) {
    return GGML_BLCK_SIZE[type];
}

size_t ggml_type_size(enum ggml_type type) {
    return GGML_TYPE_SIZE[type];
}

float ggml_type_sizef(enum ggml_type type) {
    return ((float)(GGML_TYPE_SIZE[type]))/GGML_BLCK_SIZE[type];
}

size_t ggml_element_size(const struct ggml_tensor * tensor) {
    return GGML_TYPE_SIZE[tensor->type];
}

static inline bool ggml_is_scalar(const struct ggml_tensor * tensor) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return tensor->ne[0] == 1 && tensor->ne[1] == 1 && tensor->ne[2] == 1 && tensor->ne[3] == 1;
}

static inline bool ggml_is_vector(const struct ggml_tensor * tensor) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return tensor->ne[1] == 1 && tensor->ne[2] == 1 && tensor->ne[3] == 1;
}

static inline bool ggml_is_matrix(const struct ggml_tensor * tensor) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return tensor->ne[2] == 1 && tensor->ne[3] == 1;
}

static inline bool ggml_can_mul_mat(const struct ggml_tensor * t0, const struct ggml_tensor * t1) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return
        (t0->ne[0]  == t1->ne[0])  &&
        (t0->ne[2]  == t1->ne[2])  &&
        (t0->ne[3]  == t1->ne[3]);
}

static inline bool ggml_is_contiguous(const struct ggml_tensor * tensor) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return
        tensor->nb[0] == GGML_TYPE_SIZE[tensor->type] &&
        tensor->nb[1] == (tensor->nb[0]*tensor->ne[0])/GGML_BLCK_SIZE[tensor->type] &&
        tensor->nb[2] == tensor->nb[1]*tensor->ne[1] &&
        tensor->nb[3] == tensor->nb[2]*tensor->ne[2];
}

static inline bool ggml_is_padded_1d(const struct ggml_tensor * tensor) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return
        tensor->nb[0] == GGML_TYPE_SIZE[tensor->type] &&
        tensor->nb[2] == tensor->nb[1]*tensor->ne[1] &&
        tensor->nb[3] == tensor->nb[2]*tensor->ne[2];
}

static inline bool ggml_are_same_shape(const struct ggml_tensor * t0, const struct ggml_tensor * t1) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return
        (t0->ne[0] == t1->ne[0] ) &&
        (t0->ne[1] == t1->ne[1] ) &&
        (t0->ne[2] == t1->ne[2] ) &&
        (t0->ne[3] == t1->ne[3] );
}

// check if t1 can be represented as a repeatition of t0
static inline bool ggml_can_repeat(const struct ggml_tensor * t0, const struct ggml_tensor * t1) {
    static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");

    return
        (t1->ne[0]%t0->ne[0] == 0) &&
        (t1->ne[1]%t0->ne[1] == 0) &&
        (t1->ne[2]%t0->ne[2] == 0) &&
        (t1->ne[3]%t0->ne[3] == 0);
}

static inline int ggml_up32(int n) {
    return (n + 31) & ~31;
}

static inline int ggml_up64(int n) {
    return (n + 63) & ~63;
}

static inline int ggml_up(int n, int m) {
    // assert m is a power of 2
    GGML_ASSERT((m & (m - 1)) == 0);
    return (n + m - 1) & ~(m - 1);
}

// assert that pointer is aligned to GGML_MEM_ALIGN
#define ggml_assert_aligned(ptr) \
    assert(((uintptr_t) (ptr))%GGML_MEM_ALIGN == 0)

////////////////////////////////////////////////////////////////////////////////

struct ggml_context * ggml_init(struct ggml_init_params params) {
    // make this function thread safe
    ggml_critical_section_start();

    static bool is_first_call = true;

    if (is_first_call) {
        // initialize GELU, SILU and EXP F32 tables
        {
            const uint64_t t_start = ggml_time_us(); UNUSED(t_start);

            ggml_fp16_t ii;
            for (int i = 0; i < (1 << 16); ++i) {
                uint16_t ui = i;
                memcpy(&ii, &ui, sizeof(ii));
                const float f = table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(ii);
                table_gelu_f16[i] = GGML_FP32_TO_FP16(ggml_gelu_f32(f));
                table_silu_f16[i] = GGML_FP32_TO_FP16(ggml_silu_f32(f));
                table_exp_f16[i]  = GGML_FP32_TO_FP16(exp(f));
            }

            const uint64_t t_end = ggml_time_us(); UNUSED(t_end);

            GGML_PRINT_DEBUG("%s: GELU, SILU and EXP tables initialized in %f ms\n", __func__, (t_end - t_start)/1000.0f);
        }

        // initialize g_state
        {
            const uint64_t t_start = ggml_time_us(); UNUSED(t_start);

            g_state = (struct ggml_state) {
                /*.contexts =*/ { { 0 } },
            };

            for (int i = 0; i < GGML_MAX_CONTEXTS; ++i) {
                g_state.contexts[i].used = false;
            }

            const uint64_t t_end = ggml_time_us(); UNUSED(t_end);

            GGML_PRINT_DEBUG("%s: g_state initialized in %f ms\n", __func__, (t_end - t_start)/1000.0f);
        }

        is_first_call = false;
    }

    // find non-used context in g_state
    struct ggml_context * ctx = NULL;

    for (int i = 0; i < GGML_MAX_CONTEXTS; i++) {
        if (!g_state.contexts[i].used) {
            g_state.contexts[i].used = true;
            ctx = &g_state.contexts[i].context;

            GGML_PRINT_DEBUG("%s: found unused context %d\n", __func__, i);
            break;
        }
    }

    if (ctx == NULL) {
        GGML_PRINT_DEBUG("%s: no unused context found\n", __func__);

        ggml_critical_section_end();

        return NULL;
    }

    *ctx = (struct ggml_context) {
        /*.mem_size         =*/ params.mem_size,
        /*.mem_buffer       =*/ params.mem_buffer ? params.mem_buffer : malloc(params.mem_size),
        /*.mem_buffer_owned =*/ params.mem_buffer ? false : true,
        /*.n_objects        =*/ 0,
        /*.objects_begin    =*/ NULL,
        /*.objects_end      =*/ NULL,
        /*.scratch          =*/ { 0, 0, NULL, },
        /*.scratch_save     =*/ { 0, 0, NULL, },
    };

    ggml_assert_aligned(ctx->mem_buffer);

    GGML_PRINT_DEBUG("%s: context initialized\n", __func__);

    ggml_critical_section_end();

    return ctx;
}

void ggml_free(struct ggml_context * ctx) {
    // make this function thread safe
    ggml_critical_section_start();

    bool found = false;

    for (int i = 0; i < GGML_MAX_CONTEXTS; i++) {
        if (&g_state.contexts[i].context == ctx) {
            g_state.contexts[i].used = false;

            GGML_PRINT_DEBUG("%s: context %d with %d objects has been freed. memory used = %zu\n",
                    __func__, i, ctx->n_objects, ctx->objects_end->offs + ctx->objects_end->size);

            if (ctx->mem_buffer_owned) {
                free(ctx->mem_buffer);
            }

            found = true;
            break;
        }
    }

    if (!found) {
        GGML_PRINT_DEBUG("%s: context not found\n", __func__);
    }

    ggml_critical_section_end();
}

size_t ggml_used_mem(const struct ggml_context * ctx) {
    return ctx->objects_end->offs + ctx->objects_end->size;
}

size_t ggml_set_scratch(struct ggml_context * ctx, struct ggml_scratch scratch) {
    const size_t result = ctx->scratch.data ? ctx->scratch.offs : 0;

    ctx->scratch = scratch;

    return result;
}

////////////////////////////////////////////////////////////////////////////////

struct ggml_tensor * ggml_new_tensor_impl(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    n_dims,
        const int* ne,
        void*  data) {
    // always insert objects at the end of the context's memory pool
    struct ggml_object * obj_cur = ctx->objects_end;

    const size_t cur_offs = obj_cur == NULL ? 0 : obj_cur->offs;
    const size_t cur_size = obj_cur == NULL ? 0 : obj_cur->size;
    const size_t cur_end  = cur_offs + cur_size;

    size_t size_needed = 0;

    if (data == NULL) {
        size_needed += GGML_TYPE_SIZE[type]*(ne[0]/GGML_BLCK_SIZE[type]);
        for (int i = 1; i < n_dims; i++) {
            size_needed *= ne[i];
        }
        // align to GGML_MEM_ALIGN
        size_needed = ((size_needed + GGML_MEM_ALIGN - 1)/GGML_MEM_ALIGN)*GGML_MEM_ALIGN;
    }

    char * const mem_buffer = ctx->mem_buffer;
    struct ggml_object * const obj_new = (struct ggml_object *)(mem_buffer + cur_end);

    if (ctx->scratch.data == NULL || data != NULL) {
        size_needed += sizeof(struct ggml_tensor);

        if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
            GGML_PRINT("%s: not enough space in the context's memory pool (needed %zu, available %zu)\n",
                    __func__, cur_end + size_needed + GGML_OBJECT_SIZE, ctx->mem_size);
            assert(false);
            return NULL;
        }

        *obj_new = (struct ggml_object) {
            .offs = cur_end + GGML_OBJECT_SIZE,
            .size = size_needed,
            .next = NULL,
        };
    } else {
        if (ctx->scratch.offs + size_needed > ctx->scratch.size) {
            GGML_PRINT("%s: not enough space in the scratch memory\n", __func__);
            assert(false);
            return NULL;
        }

        if (cur_end + sizeof(struct ggml_tensor) + GGML_OBJECT_SIZE > ctx->mem_size) {
            GGML_PRINT("%s: not enough space in the context's memory pool (needed %zu, available %zu)\n",
                    __func__, cur_end + sizeof(struct ggml_tensor) + GGML_OBJECT_SIZE, ctx->mem_size);
            assert(false);
            return NULL;
        }

        data = (char * const) ctx->scratch.data + ctx->scratch.offs;

        *obj_new = (struct ggml_object) {
            .offs = cur_end + GGML_OBJECT_SIZE,
            .size = sizeof(struct ggml_tensor),
            .next = NULL,
        };

        //printf("scratch offs = %zu, size_needed = %zu\n", ctx->scratch.offs, size_needed);

        ctx->scratch.offs += size_needed;
    }

    if (obj_cur != NULL) {
        obj_cur->next = obj_new;
    } else {
        // this is the first object in this context
        ctx->objects_begin = obj_new;
    }

    ctx->objects_end = obj_new;

    //printf("%s: inserted new object at %zu, size = %zu\n", __func__, cur_end, obj_new->size);

    struct ggml_tensor * const result = (struct ggml_tensor *)(mem_buffer + obj_new->offs);

    ggml_assert_aligned(result);

    *result = (struct ggml_tensor) {
        /*.type         =*/ type,
        /*.n_dims       =*/ n_dims,
        /*.ne           =*/ { 1, 1, 1, 1 },
        /*.nb           =*/ { 0, 0, 0, 0 },
        /*.op           =*/ GGML_OP_NONE,
        /*.is_param     =*/ false,
        /*.grad         =*/ NULL,
        /*.src0         =*/ NULL,
        /*.src1         =*/ NULL,
        /*.opt          =*/ { NULL },
        /*.n_tasks      =*/ 0,
        /*.perf_runs    =*/ 0,
        /*.perf_cycles  =*/ 0,
        /*.perf_time_us =*/ 0,
        /*.data         =*/ data == NULL ? (void *)(result + 1) : data,
        /*.pad          =*/ { 0 },
    };

    ggml_assert_aligned(result->data);

    for (int i = 0; i < n_dims; i++) {
        result->ne[i] = ne[i];
    }

    result->nb[0] = GGML_TYPE_SIZE[type];
    result->nb[1] = result->nb[0]*(result->ne[0]/GGML_BLCK_SIZE[type]);
    for (int i = 2; i < GGML_MAX_DIMS; i++) {
        result->nb[i] = result->nb[i - 1]*result->ne[i - 1];
    }

    ctx->n_objects++;

    return result;
}

struct ggml_tensor * ggml_new_tensor(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    n_dims,
        const int * ne) {
    return ggml_new_tensor_impl(ctx, type, n_dims, ne, NULL);
}

struct ggml_tensor * ggml_new_tensor_1d(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    ne0) {
    return ggml_new_tensor(ctx, type, 1, &ne0);
}

struct ggml_tensor * ggml_new_tensor_2d(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    ne0,
        int    ne1) {
    const int ne[2] = { ne0, ne1 };
    return ggml_new_tensor(ctx, type, 2, ne);
}

struct ggml_tensor * ggml_new_tensor_3d(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    ne0,
        int    ne1,
        int    ne2) {
    const int ne[3] = { ne0, ne1, ne2 };
    return ggml_new_tensor(ctx, type, 3, ne);
}

struct ggml_tensor * ggml_new_tensor_4d(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    ne0,
        int    ne1,
        int    ne2,
        int    ne3) {
    const int ne[4] = { ne0, ne1, ne2, ne3 };
    return ggml_new_tensor(ctx, type, 4, ne);
}

struct ggml_tensor * ggml_new_i32(struct ggml_context * ctx, int32_t value) {
    ctx->scratch_save = ctx->scratch;
    ctx->scratch.data = NULL;

    struct ggml_tensor * result = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, 1);

    ctx->scratch = ctx->scratch_save;

    ggml_set_i32(result, value);

    return result;
}

struct ggml_tensor * ggml_new_f32(struct ggml_context * ctx, float value) {
    ctx->scratch_save = ctx->scratch;
    ctx->scratch.data = NULL;

    struct ggml_tensor * result = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);

    ctx->scratch = ctx->scratch_save;

    ggml_set_f32(result, value);

    return result;
}

struct ggml_tensor * ggml_dup_tensor(struct ggml_context * ctx, const struct ggml_tensor * src) {
    return ggml_new_tensor_impl(ctx, src->type, src->n_dims, src->ne, NULL);
}

struct ggml_tensor * ggml_set_zero(struct ggml_tensor * tensor) {
    memset(tensor->data, 0, ggml_nbytes(tensor));
    return tensor;
}

struct ggml_tensor * ggml_set_i32 (struct ggml_tensor * tensor, int32_t value) {
    const int n     = ggml_nrows(tensor);
    const int nc    = tensor->ne[0];
    const size_t n1 = tensor->nb[1];

    char * const data = tensor->data;

    switch (tensor->type) {
        case GGML_TYPE_Q4_0:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_Q4_1:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_I8:
            {
                assert(tensor->nb[0] == sizeof(int8_t));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_i8(nc, (int8_t *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_I16:
            {
                assert(tensor->nb[0] == sizeof(int16_t));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_i16(nc, (int16_t *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_I32:
            {
                assert(tensor->nb[0] == sizeof(int32_t));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_i32(nc, (int32_t *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_F16:
            {
                assert(tensor->nb[0] == sizeof(ggml_fp16_t));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_F32:
            {
                assert(tensor->nb[0] == sizeof(float));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_f32(nc, (float *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }

    return tensor;
}

struct ggml_tensor * ggml_set_f32(struct ggml_tensor * tensor, float value) {
    const int n     = ggml_nrows(tensor);
    const int nc    = tensor->ne[0];
    const size_t n1 = tensor->nb[1];

    char * const data = tensor->data;

    switch (tensor->type) {
        case GGML_TYPE_Q4_0:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_Q4_1:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_I8:
            {
                assert(tensor->nb[0] == sizeof(int8_t));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_i8(nc, (int8_t *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_I16:
            {
                assert(tensor->nb[0] == sizeof(int16_t));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_i16(nc, (int16_t *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_I32:
            {
                assert(tensor->nb[0] == sizeof(int32_t));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_i32(nc, (int32_t *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_F16:
            {
                assert(tensor->nb[0] == sizeof(ggml_fp16_t));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_F32:
            {
                assert(tensor->nb[0] == sizeof(float));
                for (int i = 0; i < n; i++) {
                    ggml_vec_set_f32(nc, (float *)(data + i*n1), value);
                }
            } break;
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }

    return tensor;
}

int32_t ggml_get_i32_1d(const struct ggml_tensor * tensor, int i) {
    switch (tensor->type) {
        case GGML_TYPE_Q4_0:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_Q4_1:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_I8:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int8_t));
                return ((int8_t *)(tensor->data))[i];
            } break;
        case GGML_TYPE_I16:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int16_t));
                return ((int16_t *)(tensor->data))[i];
            } break;
        case GGML_TYPE_I32:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int32_t));
                return ((int32_t *)(tensor->data))[i];
            } break;
        case GGML_TYPE_F16:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
                return GGML_FP16_TO_FP32(((ggml_fp16_t *)(tensor->data))[i]);
            } break;
        case GGML_TYPE_F32:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(float));
                return ((float *)(tensor->data))[i];
            } break;
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }

    return 0.0f;
}

void ggml_set_i32_1d(const struct ggml_tensor * tensor, int i, int32_t value) {
    switch (tensor->type) {
        case GGML_TYPE_Q4_0:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_Q4_1:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_I8:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int8_t));
                ((int8_t *)(tensor->data))[i] = value;
            } break;
        case GGML_TYPE_I16:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int16_t));
                ((int16_t *)(tensor->data))[i] = value;
            } break;
        case GGML_TYPE_I32:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int32_t));
                ((int32_t *)(tensor->data))[i] = value;
            } break;
        case GGML_TYPE_F16:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
                ((ggml_fp16_t *)(tensor->data))[i] = GGML_FP32_TO_FP16(value);
            } break;
        case GGML_TYPE_F32:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(float));
                ((float *)(tensor->data))[i] = value;
            } break;
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

float ggml_get_f32_1d(const struct ggml_tensor * tensor, int i) {
    switch (tensor->type) {
        case GGML_TYPE_Q4_0:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_Q4_1:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_I8:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int8_t));
                return ((int8_t *)(tensor->data))[i];
            } break;
        case GGML_TYPE_I16:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int16_t));
                return ((int16_t *)(tensor->data))[i];
            } break;
        case GGML_TYPE_I32:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int32_t));
                return ((int32_t *)(tensor->data))[i];
            } break;
        case GGML_TYPE_F16:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
                return GGML_FP16_TO_FP32(((ggml_fp16_t *)(tensor->data))[i]);
            } break;
        case GGML_TYPE_F32:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(float));
                return ((float *)(tensor->data))[i];
            } break;
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }

    return 0.0f;
}

void ggml_set_f32_1d(const struct ggml_tensor * tensor, int i, float value) {
    switch (tensor->type) {
        case GGML_TYPE_Q4_0:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_Q4_1:
            {
                GGML_ASSERT(false);
            } break;
        case GGML_TYPE_I8:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int8_t));
                ((int8_t *)(tensor->data))[i] = value;
            } break;
        case GGML_TYPE_I16:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int16_t));
                ((int16_t *)(tensor->data))[i] = value;
            } break;
        case GGML_TYPE_I32:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(int32_t));
                ((int32_t *)(tensor->data))[i] = value;
            } break;
        case GGML_TYPE_F16:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
                ((ggml_fp16_t *)(tensor->data))[i] = GGML_FP32_TO_FP16(value);
            } break;
        case GGML_TYPE_F32:
            {
                GGML_ASSERT(tensor->nb[0] == sizeof(float));
                ((float *)(tensor->data))[i] = value;
            } break;
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

void * ggml_get_data(const struct ggml_tensor * tensor) {
    return tensor->data;
}

float * ggml_get_data_f32(const struct ggml_tensor * tensor) {
    assert(tensor->type == GGML_TYPE_F32);
    return (float *)(tensor->data);
}

struct ggml_tensor * ggml_view_tensor(
        struct ggml_context * ctx,
        const struct ggml_tensor * src) {
    return ggml_new_tensor_impl(ctx, src->type, src->n_dims, src->ne, src->data);
}

////////////////////////////////////////////////////////////////////////////////

// ggml_dup

struct ggml_tensor * ggml_dup_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_DUP;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_dup(
        struct ggml_context * ctx,
        struct ggml_tensor * a) {
    return ggml_dup_impl(ctx, a, false);
}

struct ggml_tensor * ggml_dup_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor * a) {
    return ggml_dup_impl(ctx, a, true);
}

// ggml_add

struct ggml_tensor * ggml_add_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b,
        bool inplace) {
    GGML_ASSERT(ggml_are_same_shape(a, b));

    bool is_node = false;

    if (!inplace && (a->grad || b->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_ADD;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

struct ggml_tensor * ggml_add(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    return ggml_add_impl(ctx, a, b, false);
}

struct ggml_tensor * ggml_add_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    return ggml_add_impl(ctx, a, b, true);
}

// ggml_sub

struct ggml_tensor * ggml_sub_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b,
        bool inplace) {
    GGML_ASSERT(ggml_are_same_shape(a, b));

    bool is_node = false;

    if (!inplace && (a->grad || b->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_SUB;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

struct ggml_tensor * ggml_sub(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    return ggml_sub_impl(ctx, a, b, false);
}

struct ggml_tensor * ggml_sub_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    return ggml_sub_impl(ctx, a, b, true);
}

// ggml_mul

struct ggml_tensor * ggml_mul_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b,
        bool inplace) {
    GGML_ASSERT(ggml_are_same_shape(a, b));

    bool is_node = false;

    if (!inplace && (a->grad || b->grad)) {
        is_node = true;
    }

    if (inplace) {
        GGML_ASSERT(is_node == false);
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_MUL;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

struct ggml_tensor * ggml_mul(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    return ggml_mul_impl(ctx, a, b, false);
}

struct ggml_tensor * ggml_mul_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    return ggml_mul_impl(ctx, a, b, true);
}

// ggml_div

struct ggml_tensor * ggml_div_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b,
        bool inplace) {
    GGML_ASSERT(ggml_are_same_shape(a, b));

    bool is_node = false;

    if (!inplace && (a->grad || b->grad)) {
        is_node = true;
    }

    if (inplace) {
        GGML_ASSERT(is_node == false);
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_DIV;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

struct ggml_tensor * ggml_div(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    return ggml_div_impl(ctx, a, b, false);
}

struct ggml_tensor * ggml_div_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    return ggml_div_impl(ctx, a, b, true);
}

// ggml_sqr

struct ggml_tensor * ggml_sqr_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_SQR;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_sqr(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_sqr_impl(ctx, a, false);
}

struct ggml_tensor * ggml_sqr_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_sqr_impl(ctx, a, true);
}

// ggml_sqrt

struct ggml_tensor * ggml_sqrt_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_SQRT;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_sqrt(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_sqrt_impl(ctx, a, false);
}

struct ggml_tensor * ggml_sqrt_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_sqrt_impl(ctx, a, true);
}

// ggml_sum

struct ggml_tensor * ggml_sum(
        struct ggml_context * ctx,
        struct ggml_tensor * a) {
    bool is_node = false;

    if (a->grad) {
        is_node = true;
    }

    struct ggml_tensor * result = ggml_new_tensor_1d(ctx, a->type, 1);

    result->op   = GGML_OP_SUM;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

// ggml_mean

struct ggml_tensor * ggml_mean(
        struct ggml_context * ctx,
        struct ggml_tensor * a) {
    bool is_node = false;

    if (a->grad) {
        GGML_ASSERT(false); // TODO: implement
        is_node = true;
    }

    int ne[GGML_MAX_DIMS] = { 1, a->ne[1], a->ne[2], a->ne[3] };
    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, a->n_dims, ne);

    result->op   = GGML_OP_MEAN;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

// ggml_repeat

struct ggml_tensor * ggml_repeat(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    GGML_ASSERT(ggml_can_repeat(a, b));

    bool is_node = false;

    if (a->grad) {
        is_node = true;
    }

    if (ggml_are_same_shape(a, b) && !is_node) {
        return a;
    }

    struct ggml_tensor * result = ggml_new_tensor(ctx, a->type, b->n_dims, b->ne);

    result->op   = GGML_OP_REPEAT;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

// ggml_abs

struct ggml_tensor * ggml_abs_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_ABS;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_abs(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_abs_impl(ctx, a, false);
}

struct ggml_tensor * ggml_abs_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_abs_impl(ctx, a, true);
}


// ggml_sgn

struct ggml_tensor * ggml_sgn_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_SGN;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_sgn(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_sgn_impl(ctx, a, false);
}

struct ggml_tensor * ggml_sgn_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_sgn_impl(ctx, a, true);
}

// ggml_neg

struct ggml_tensor * ggml_neg_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_NEG;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_neg(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_neg_impl(ctx, a, false);
}

struct ggml_tensor * ggml_neg_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_neg_impl(ctx, a, true);
}

// ggml_step

struct ggml_tensor * ggml_step_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_STEP;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_step(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_step_impl(ctx, a, false);
}

struct ggml_tensor * ggml_step_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_step_impl(ctx, a, true);
}

// ggml_relu

struct ggml_tensor * ggml_relu_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_RELU;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_relu(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_relu_impl(ctx, a, false);
}

struct ggml_tensor * ggml_relu_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_relu_impl(ctx, a, true);
}

// ggml_gelu

struct ggml_tensor * ggml_gelu_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_GELU;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_gelu(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_gelu_impl(ctx, a, false);
}

struct ggml_tensor * ggml_gelu_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_gelu_impl(ctx, a, true);
}

// ggml_silu

struct ggml_tensor * ggml_silu_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_SILU;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_silu(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_silu_impl(ctx, a, false);
}

struct ggml_tensor * ggml_silu_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_silu_impl(ctx, a, true);
}

// ggml_norm

struct ggml_tensor * ggml_norm_impl(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        bool inplace) {
    bool is_node = false;

    if (!inplace && (a->grad)) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);

    result->op   = GGML_OP_NORM;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL; // TODO: maybe store epsilon here?

    return result;
}

struct ggml_tensor * ggml_norm(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_norm_impl(ctx, a, false);
}

struct ggml_tensor * ggml_norm_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    return ggml_norm_impl(ctx, a, true);
}

// ggml_mul_mat

struct ggml_tensor * ggml_mul_mat(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    GGML_ASSERT(ggml_can_mul_mat(a, b));

    bool is_node = false;

    if (a->grad || b->grad) {
        is_node = true;
    }

    const int ne[4] = { a->ne[1], b->ne[1], a->ne[2], b->ne[3] };
    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, MIN(a->n_dims, b->n_dims), ne);

    result->op   = GGML_OP_MUL_MAT;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

// ggml_scale

struct ggml_tensor * ggml_scale_impl(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        bool inplace) {
    GGML_ASSERT(ggml_is_scalar(b));
    GGML_ASSERT(ggml_is_padded_1d(a));

    bool is_node = false;

    if (!inplace && (a->grad || b->grad)) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    // TODO: when implement backward, fix this:
    //struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
    struct ggml_tensor * result = ggml_view_tensor(ctx, a);

    result->op   = GGML_OP_SCALE;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

struct ggml_tensor * ggml_scale(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    return ggml_scale_impl(ctx, a, b, false);
}

struct ggml_tensor * ggml_scale_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    return ggml_scale_impl(ctx, a, b, true);
}

// ggml_cpy

struct ggml_tensor * ggml_cpy_impl(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        bool inplace) {
    GGML_ASSERT(ggml_nelements(a) == ggml_nelements(b));

    bool is_node = false;

    if (!inplace && (a->grad || b->grad)) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    // make a view of the destination
    struct ggml_tensor * result = ggml_view_tensor(ctx, b);

    result->op   = GGML_OP_CPY;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

struct ggml_tensor * ggml_cpy(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    return ggml_cpy_impl(ctx, a, b, false);
}

struct ggml_tensor * ggml_cpy_inplace(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    return ggml_cpy_impl(ctx, a, b, true);
}

// ggml_reshape

struct ggml_tensor * ggml_reshape(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b) {
    GGML_ASSERT(ggml_is_contiguous(a));
    GGML_ASSERT(ggml_is_contiguous(b));
    GGML_ASSERT(ggml_nelements(a) == ggml_nelements(b));

    bool is_node = false;

    if (a->grad || b->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    struct ggml_tensor * result = ggml_new_tensor_impl(ctx, a->type, b->n_dims, b->ne, a->data);

    result->op   = GGML_OP_RESHAPE;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_reshape_2d(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        int                   ne0,
        int                   ne1) {
    GGML_ASSERT(ggml_is_contiguous(a));
    GGML_ASSERT(ggml_nelements(a) == ne0*ne1);

    bool is_node = false;

    if (a->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    const int ne[2] = { ne0, ne1 };
    struct ggml_tensor * result = ggml_new_tensor_impl(ctx, a->type, 2, ne, a->data);

    result->op   = GGML_OP_RESHAPE;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

struct ggml_tensor * ggml_reshape_3d(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        int                   ne0,
        int                   ne1,
        int                   ne2) {
    GGML_ASSERT(ggml_is_contiguous(a));
    GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2);

    bool is_node = false;

    if (a->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    const int ne[3] = { ne0, ne1, ne2 };
    struct ggml_tensor * result = ggml_new_tensor_impl(ctx, a->type, 3, ne, a->data);

    result->op   = GGML_OP_RESHAPE;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

// ggml_view_1d

struct ggml_tensor * ggml_view_1d(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        int                   ne0,
        size_t                offset) {
    if (a->grad) {
        GGML_ASSERT(false); // gradient propagation is not supported
    }

    struct ggml_tensor * result = ggml_new_tensor_impl(ctx, a->type, 1, &ne0, (char *) a->data + offset);

    result->op   = GGML_OP_VIEW;
    result->grad = NULL;
    result->src0 = a;
    result->src1 = NULL; // TODO: maybe store the offset here?

    return result;
}

// ggml_view_2d

struct ggml_tensor * ggml_view_2d(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        int                   ne0,
        int                   ne1,
        size_t                nb1,
        size_t                offset) {
    if (a->grad) {
        GGML_ASSERT(false); // gradient propagation is not supported
    }

    const int ne[GGML_MAX_DIMS] = { ne0, ne1, 1, 1 };

    struct ggml_tensor * result = ggml_new_tensor_impl(ctx, a->type, 2, ne, (char *) a->data + offset);

    result->nb[1] = nb1;
    result->nb[2] = result->nb[1]*ne1;
    result->nb[3] = result->nb[2];

    result->op   = GGML_OP_VIEW;
    result->grad = NULL;
    result->src0 = a;
    result->src1 = NULL; // TODO: maybe store the offset here?

    return result;
}

// ggml_permute

struct ggml_tensor * ggml_permute(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        int                   axis0,
        int                   axis1,
        int                   axis2,
        int                   axis3) {
    GGML_ASSERT(axis0 >= 0 && axis0 < GGML_MAX_DIMS);
    GGML_ASSERT(axis1 >= 0 && axis1 < GGML_MAX_DIMS);
    GGML_ASSERT(axis2 >= 0 && axis2 < GGML_MAX_DIMS);
    GGML_ASSERT(axis3 >= 0 && axis3 < GGML_MAX_DIMS);

    GGML_ASSERT(axis0 != axis1);
    GGML_ASSERT(axis0 != axis2);
    GGML_ASSERT(axis0 != axis3);
    GGML_ASSERT(axis1 != axis2);
    GGML_ASSERT(axis1 != axis3);
    GGML_ASSERT(axis2 != axis3);

    bool is_node = false;

    if (a->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    struct ggml_tensor * result = ggml_view_tensor(ctx, a);

    int ne[GGML_MAX_DIMS];
    int nb[GGML_MAX_DIMS];

    ne[axis0] = a->ne[0];
    ne[axis1] = a->ne[1];
    ne[axis2] = a->ne[2];
    ne[axis3] = a->ne[3];

    nb[axis0] = a->nb[0];
    nb[axis1] = a->nb[1];
    nb[axis2] = a->nb[2];
    nb[axis3] = a->nb[3];

    result->ne[0] = ne[0];
    result->ne[1] = ne[1];
    result->ne[2] = ne[2];
    result->ne[3] = ne[3];

    result->nb[0] = nb[0];
    result->nb[1] = nb[1];
    result->nb[2] = nb[2];
    result->nb[3] = nb[3];

    result->op   = GGML_OP_PERMUTE;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL; // TODO: maybe store the permutation here?

    return result;
}

// ggml_transpose

struct ggml_tensor * ggml_transpose(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    bool is_node = false;

    if (a->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    struct ggml_tensor * result = ggml_view_tensor(ctx, a);

    result->ne[0] = a->ne[1];
    result->ne[1] = a->ne[0];

    result->nb[0] = a->nb[1];
    result->nb[1] = a->nb[0];

    result->op   = GGML_OP_TRANSPOSE;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

// ggml_get_rows

struct ggml_tensor * ggml_get_rows(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    GGML_ASSERT(ggml_is_matrix(a) && ggml_is_vector(b) && b->type == GGML_TYPE_I32);

    bool is_node = false;

    if (a->grad || b->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    // TODO: implement non F32 return
    //struct ggml_tensor * result = ggml_new_tensor_2d(ctx, a->type, a->ne[0], b->ne[0]);
    struct ggml_tensor * result = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, a->ne[0], b->ne[0]);

    result->op   = GGML_OP_GET_ROWS;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

// ggml_diag_mask_inf

struct ggml_tensor * ggml_diag_mask_inf(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        int                   n_past) {
    bool is_node = false;

    if (a->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    // TODO: when implement backward, fix this:
    //struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
    struct ggml_tensor * result = ggml_view_tensor(ctx, a);
    struct ggml_tensor * b = ggml_new_i32(ctx, n_past);

    result->op   = GGML_OP_DIAG_MASK_INF;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

// ggml_soft_max

struct ggml_tensor * ggml_soft_max(
        struct ggml_context * ctx,
        struct ggml_tensor  * a) {
    bool is_node = false;

    if (a->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    // TODO: when implement backward, fix this:
    //struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
    struct ggml_tensor * result = ggml_view_tensor(ctx, a);

    result->op   = GGML_OP_SOFT_MAX;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = NULL;

    return result;
}

// ggml_rope

struct ggml_tensor * ggml_rope(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        int                   n_past,
        int                   n_dims,
        int                   mode) {
    GGML_ASSERT(n_past >= 0);
    bool is_node = false;

    if (a->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    // TODO: when implement backward, fix this:
    //struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
    struct ggml_tensor * result = ggml_view_tensor(ctx, a);

    struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, 3);
    ((int32_t *) b->data)[0] = n_past;
    ((int32_t *) b->data)[1] = n_dims;
    ((int32_t *) b->data)[2] = mode;

    result->op   = GGML_OP_ROPE;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

// ggml_conv_1d_1s

struct ggml_tensor * ggml_conv_1d_1s(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    GGML_ASSERT(ggml_is_matrix(b));
    GGML_ASSERT(a->ne[1] == b->ne[1]);
    GGML_ASSERT(a->ne[3] == 1);
    bool is_node = false;

    if (a->grad || b->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    const int ne[4] = { b->ne[0], a->ne[2], 1, 1, };
    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 2, ne);

    result->op   = GGML_OP_CONV_1D_1S;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

// ggml_conv_1d_2s

struct ggml_tensor * ggml_conv_1d_2s(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    GGML_ASSERT(ggml_is_matrix(b));
    GGML_ASSERT(a->ne[1] == b->ne[1]);
    GGML_ASSERT(a->ne[3] == 1);
    bool is_node = false;

    if (a->grad || b->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    const int ne[4] = { b->ne[0]/2, a->ne[2], 1, 1, };
    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 2, ne);

    result->op   = GGML_OP_CONV_1D_2S;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

// ggml_flash_attn

struct ggml_tensor * ggml_flash_attn(
        struct ggml_context * ctx,
        struct ggml_tensor  * q,
        struct ggml_tensor  * k,
        struct ggml_tensor  * v,
        bool                  masked) {
    GGML_ASSERT(ggml_can_mul_mat(k, q));
    // TODO: check if vT can be multiplied by (k*qT)

    bool is_node = false;

    if (q->grad || k->grad || v->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    //struct ggml_tensor * result = ggml_dup_tensor(ctx, q);
    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, q->ne);

    result->op   = GGML_OP_FLASH_ATTN;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = q;
    result->src1 = k;
    result->opt[0] = v;
    result->opt[1] = ggml_new_i32(ctx, masked ? 1 : 0);

    return result;
}

// ggml_flash_ff

struct ggml_tensor * ggml_flash_ff(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b0,
        struct ggml_tensor  * b1,
        struct ggml_tensor  * c0,
        struct ggml_tensor  * c1) {
    GGML_ASSERT(ggml_can_mul_mat(b0, a));
    // TODO: more checks

    bool is_node = false;

    if (a->grad || b0->grad || b1->grad || c0->grad || c1->grad) {
        GGML_ASSERT(false); // TODO: implement backward
        is_node = true;
    }

    //struct ggml_tensor * result = ggml_dup_tensor(ctx, a);
    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, a->ne);

    result->op   = GGML_OP_FLASH_FF;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b0;
    result->opt[0] = b1;
    result->opt[1] = c0;
    result->opt[2] = c1;

    return result;
}

////////////////////////////////////////////////////////////////////////////////

void ggml_set_param(
        struct ggml_context * ctx,
        struct ggml_tensor * tensor) {
    tensor->is_param = true;

    GGML_ASSERT(tensor->grad == NULL);
    tensor->grad = ggml_dup_tensor(ctx, tensor);
}

// ggml_compute_forward_dup

static void ggml_compute_forward_dup_f16(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    GGML_ASSERT(params->ith == 0);
    GGML_ASSERT(ggml_is_contiguous(dst));
    GGML_ASSERT(ggml_nelements(dst) == ggml_nelements(src0));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int ne00 = src0->ne[0];
    const int ne01 = src0->ne[1];
    const int ne02 = src0->ne[2];
    const int ne03 = src0->ne[3];

    const size_t nb00 = src0->nb[0];
    const size_t nb01 = src0->nb[1];
    const size_t nb02 = src0->nb[2];
    const size_t nb03 = src0->nb[3];

    if (ggml_is_contiguous(src0) && src0->type == dst->type) {
        memcpy(dst->data, src0->data, ggml_nelements(dst) * GGML_TYPE_SIZE[src0->type]);
        return;
    }

    if (src0->nb[0] == sizeof(ggml_fp16_t)) {
        if (dst->type == GGML_TYPE_F16) {
            int id = 0;
            const size_t rs = ne00*nb00;

            for (int i03 = 0; i03 < ne03; i03++) {
                for (int i02 = 0; i02 < ne02; i02++) {
                    for (int i01 = 0; i01 < ne01; i01++) {
                        const char * src0_ptr = (char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03;
                        char * dst_ptr = (char *) dst->data + id*rs;

                        memcpy(dst_ptr, src0_ptr, rs);

                        id++;
                    }
                }
            }
        } else if (dst->type == GGML_TYPE_F32) {
            int id = 0;
            float * dst_ptr = (float *) dst->data;

            for (int i03 = 0; i03 < ne03; i03++) {
                for (int i02 = 0; i02 < ne02; i02++) {
                    for (int i01 = 0; i01 < ne01; i01++) {
                        for (int i00 = 0; i00 < ne00; i00++) {
                            const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);

                            dst_ptr[id] = GGML_FP16_TO_FP32(*src0_ptr);
                            id++;
                        }
                    }
                }
            }
        } else {
            GGML_ASSERT(false); // TODO: implement
        }
    } else {
        //printf("%s: this is not optimal - fix me\n", __func__);

        if (dst->type == GGML_TYPE_F32) {
            int id = 0;
            float * dst_ptr = (float *) dst->data;

            for (int i03 = 0; i03 < ne03; i03++) {
                for (int i02 = 0; i02 < ne02; i02++) {
                    for (int i01 = 0; i01 < ne01; i01++) {
                        for (int i00 = 0; i00 < ne00; i00++) {
                            const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);

                            dst_ptr[id] = GGML_FP16_TO_FP32(*src0_ptr);
                            id++;
                        }
                    }
                }
            }
        } else if (dst->type == GGML_TYPE_F16) {
            int id = 0;
            ggml_fp16_t * dst_ptr = (ggml_fp16_t *) dst->data;

            for (int i03 = 0; i03 < ne03; i03++) {
                for (int i02 = 0; i02 < ne02; i02++) {
                    for (int i01 = 0; i01 < ne01; i01++) {
                        for (int i00 = 0; i00 < ne00; i00++) {
                            const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);

                            dst_ptr[id] = *src0_ptr;
                            id++;
                        }
                    }
                }
            }
        } else {
            GGML_ASSERT(false); // TODO: implement
        }
    }
}

static void ggml_compute_forward_dup_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    GGML_ASSERT(params->ith == 0);
    GGML_ASSERT(ggml_is_contiguous(dst));
    GGML_ASSERT(ggml_nelements(dst) == ggml_nelements(src0));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int ne00 = src0->ne[0];
    const int ne01 = src0->ne[1];
    const int ne02 = src0->ne[2];
    const int ne03 = src0->ne[3];

    const size_t nb00 = src0->nb[0];
    const size_t nb01 = src0->nb[1];
    const size_t nb02 = src0->nb[2];
    const size_t nb03 = src0->nb[3];

    if (ggml_is_contiguous(src0) && src0->type == dst->type) {
        memcpy(dst->data, src0->data, ggml_nelements(dst) * GGML_TYPE_SIZE[src0->type]);
        return;
    }

    if (src0->nb[0] == sizeof(float)) {
        if (dst->type == GGML_TYPE_F32) {
            int id = 0;
            const size_t rs = ne00*nb00;

            for (int i03 = 0; i03 < ne03; i03++) {
                for (int i02 = 0; i02 < ne02; i02++) {
                    for (int i01 = 0; i01 < ne01; i01++) {
                        const char * src0_ptr = (char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03;
                        char * dst_ptr = (char *) dst->data + id*rs;

                        memcpy(dst_ptr, src0_ptr, rs);

                        id++;
                    }
                }
            }
        } else if (dst->type == GGML_TYPE_F16) {
            int id = 0;
            ggml_fp16_t * dst_ptr = (ggml_fp16_t *) dst->data;

            for (int i03 = 0; i03 < ne03; i03++) {
                for (int i02 = 0; i02 < ne02; i02++) {
                    for (int i01 = 0; i01 < ne01; i01++) {
                        for (int i00 = 0; i00 < ne00; i00++) {
                            const float * src0_ptr = (float *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);

                            dst_ptr[id] = GGML_FP32_TO_FP16(*src0_ptr);
                            id++;
                        }
                    }
                }
            }
        } else {
            GGML_ASSERT(false); // TODO: implement
        }
    } else {
        //printf("%s: this is not optimal - fix me\n", __func__);

        if (dst->type == GGML_TYPE_F32) {
            int id = 0;
            float * dst_ptr = (float *) dst->data;

            for (int i03 = 0; i03 < ne03; i03++) {
                for (int i02 = 0; i02 < ne02; i02++) {
                    for (int i01 = 0; i01 < ne01; i01++) {
                        for (int i00 = 0; i00 < ne00; i00++) {
                            const float * src0_ptr = (float *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);

                            dst_ptr[id] = *src0_ptr;
                            id++;
                        }
                    }
                }
            }
        } else if (dst->type == GGML_TYPE_F16) {
            int id = 0;
            ggml_fp16_t * dst_ptr = (ggml_fp16_t *) dst->data;

            for (int i03 = 0; i03 < ne03; i03++) {
                for (int i02 = 0; i02 < ne02; i02++) {
                    for (int i01 = 0; i01 < ne01; i01++) {
                        for (int i00 = 0; i00 < ne00; i00++) {
                            const float * src0_ptr = (float *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);

                            dst_ptr[id] = GGML_FP32_TO_FP16(*src0_ptr);
                            id++;
                        }
                    }
                }
            }
        } else {
            GGML_ASSERT(false); // TODO: implement
        }
    }
}

static void ggml_compute_forward_dup(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F16:
            {
                ggml_compute_forward_dup_f16(params, src0, dst);
            } break;
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_dup_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_add

static void ggml_compute_forward_add_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    GGML_ASSERT(ggml_are_same_shape(src0, src1) && ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int ith = params->ith;
    const int nth = params->nth;

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    const size_t nb00 = src0->nb[0];
    const size_t nb01 = src0->nb[1];

    const size_t nb10 = src1->nb[0];
    const size_t nb11 = src1->nb[1];

    const size_t nb0 = dst->nb[0];
    const size_t nb1 = dst->nb[1];

    GGML_ASSERT( nb0 == sizeof(float));
    GGML_ASSERT(nb00 == sizeof(float));

    if (nb10 == sizeof(float)) {
        const int j0 = (n/nth)*ith;
        const int j1 = ith == nth - 1 ? n : (n/nth)*(ith + 1);

        for (int j = j0; j < j1; j++) {
            ggml_vec_add_f32(nc,
                    (float *) ((char *) dst->data  + j*nb1),
                    (float *) ((char *) src0->data + j*nb01),
                    (float *) ((char *) src1->data + j*nb11));
        }
    } else {
        // src1 is not contiguous
        for (int j = ith; j < n; j += nth) {
            float * dst_ptr  = (float *) ((char *) dst->data  + j*nb1);
            float * src0_ptr = (float *) ((char *) src0->data + j*nb01);
            for (int i = 0; i < nc; i++) {
                float * src1_ptr = (float *) ((char *) src1->data + j*nb11 + i*nb10);

                dst_ptr[i] = src0_ptr[i] + *src1_ptr;
            }
        }
    }
}

static void ggml_compute_forward_add(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_add_f32(params, src0, src1, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_sub

static void ggml_compute_forward_sub_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, src1) && ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert( dst->nb[0] == sizeof(float));
    assert(src0->nb[0] == sizeof(float));
    assert(src1->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_sub_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])),
                (float *) ((char *) src1->data + i*(src1->nb[1])));
    }
}

static void ggml_compute_forward_sub(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_sub_f32(params, src0, src1, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_mul

static void ggml_compute_forward_mul_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, src1) && ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert( dst->nb[0] == sizeof(float));
    assert(src0->nb[0] == sizeof(float));
    assert(src1->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_mul_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])),
                (float *) ((char *) src1->data + i*(src1->nb[1])));
    }
}

static void ggml_compute_forward_mul(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_mul_f32(params, src0, src1, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_div

static void ggml_compute_forward_div_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, src1) && ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert( dst->nb[0] == sizeof(float));
    assert(src0->nb[0] == sizeof(float));
    assert(src1->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_div_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])),
                (float *) ((char *) src1->data + i*(src1->nb[1])));
    }
}

static void ggml_compute_forward_div(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_div_f32(params, src0, src1, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_sqr

static void ggml_compute_forward_sqr_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n     = ggml_nrows(src0);
    const int nc    = src0->ne[0];

    assert( dst->nb[0] == sizeof(float));
    assert(src0->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_sqr_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])));
    }
}

static void ggml_compute_forward_sqr(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_sqr_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_sqrt

static void ggml_compute_forward_sqrt_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert( dst->nb[0] == sizeof(float));
    assert(src0->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_sqrt_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])));
    }
}

static void ggml_compute_forward_sqrt(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_sqrt_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_sum

static void ggml_compute_forward_sum_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_is_scalar(dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    assert(ggml_is_scalar(dst));
    assert(src0->nb[0] == sizeof(float));

    const int ne00 = src0->ne[0];
    const int ne01 = src0->ne[1];
    const int ne02 = src0->ne[2];
    const int ne03 = src0->ne[3];

    const size_t nb01 = src0->nb[1];
    const size_t nb02 = src0->nb[2];
    const size_t nb03 = src0->nb[3];

    for (int i03 = 0; i03 < ne03; i03++) {
        for (int i02 = 0; i02 < ne02; i02++) {
            for (int i01 = 0; i01 < ne01; i01++) {
                ggml_vec_sum_f32(ne00,
                        (float *) (dst->data),
                        (float *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03));
            }
        }
    }
}

static void ggml_compute_forward_sum(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_sum_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_mean

static void ggml_compute_forward_mean_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    assert(src0->nb[0] == sizeof(float));

    const int ne00 = src0->ne[0];
    const int ne01 = src0->ne[1];
    const int ne02 = src0->ne[2];
    const int ne03 = src0->ne[3];

    const size_t nb01 = src0->nb[1];
    const size_t nb02 = src0->nb[2];
    const size_t nb03 = src0->nb[3];

    const int ne0 = dst->ne[0];
    const int ne1 = dst->ne[1];
    const int ne2 = dst->ne[2];
    const int ne3 = dst->ne[3];

    assert(ne0 == 1);
    assert(ne1 == ne01);
    assert(ne2 == ne02);
    assert(ne3 == ne03);

    UNUSED(ne0);
    UNUSED(ne1);
    UNUSED(ne2);
    UNUSED(ne3);

    const size_t nb1 = dst->nb[1];
    const size_t nb2 = dst->nb[2];
    const size_t nb3 = dst->nb[3];

    for (int i03 = 0; i03 < ne03; i03++) {
        for (int i02 = 0; i02 < ne02; i02++) {
            for (int i01 = 0; i01 < ne01; i01++) {
                ggml_vec_sum_f32(ne00,
                        (float *) ((char *)  dst->data + i01*nb1  + i02*nb2  + i03*nb3),
                        (float *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03));

                *(float *) ((char *) dst->data + i01*nb1 + i02*nb2 + i03*nb3) /= (float) ne00;
            }
        }
    }
}

static void ggml_compute_forward_mean(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_mean_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_repeat

static void ggml_compute_forward_repeat_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_can_repeat(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    // TODO: implement support for rank > 2 tensors
    assert(src0->ne[2] == 1);
    assert(src0->ne[3] == 1);
    assert( dst->ne[2] == 1);
    assert( dst->ne[3] == 1);

    const int nc  = dst->ne[0];
    const int nr  = dst->ne[1];
    const int nc0 = src0->ne[0];
    const int nr0 = src0->ne[1];
    const int ncr = nc/nc0; // guaranteed to be an integer due to the check in ggml_can_repeat
    const int nrr = nr/nr0; // guaranteed to be an integer due to the check in ggml_can_repeat

    // TODO: support for transposed / permuted tensors
    assert( dst->nb[0] == sizeof(float));
    assert(src0->nb[0] == sizeof(float));

    // TODO: maybe this is not optimal?
    for (int i = 0; i < nrr; i++) {
        for (int j = 0; j < ncr; j++) {
            for (int k = 0; k < nr0; k++) {
                ggml_vec_cpy_f32(nc0,
                        (float *) ((char *)  dst->data + (i*nr0 + k)*( dst->nb[1]) + j*nc0*( dst->nb[0])),
                        (float *) ((char *) src0->data + (        k)*(src0->nb[1])));
            }
        }
    }
}

static void ggml_compute_forward_repeat(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_repeat_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_abs

static void ggml_compute_forward_abs_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert(dst->nb[0]  == sizeof(float));
    assert(src0->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_abs_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])));
    }
}

static void ggml_compute_forward_abs(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_abs_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_sgn

static void ggml_compute_forward_sgn_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert(dst->nb[0]  == sizeof(float));
    assert(src0->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_sgn_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])));
    }
}

static void ggml_compute_forward_sgn(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_sgn_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_neg

static void ggml_compute_forward_neg_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert(dst->nb[0]  == sizeof(float));
    assert(src0->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_neg_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])));
    }
}

static void ggml_compute_forward_neg(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_neg_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_step

static void ggml_compute_forward_step_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert(dst->nb[0]  == sizeof(float));
    assert(src0->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_step_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])));
    }
}

static void ggml_compute_forward_step(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_step_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_relu

static void ggml_compute_forward_relu_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert(dst->nb[0]  == sizeof(float));
    assert(src0->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        ggml_vec_relu_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])));
    }
}

static void ggml_compute_forward_relu(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_relu_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_gelu

static void ggml_compute_forward_gelu_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    GGML_ASSERT(ggml_is_contiguous(src0));
    GGML_ASSERT(ggml_is_contiguous(dst));
    GGML_ASSERT(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int ith = params->ith;
    const int nth = params->nth;

    const int nc = src0->ne[0];
    const int nr = ggml_nrows(src0);

    // rows per thread
    const int dr = (nr + nth - 1)/nth;

    // row range for this thread
    const int ir0 = dr*ith;
    const int ir1 = MIN(ir0 + dr, nr);

    for (int i1 = ir0; i1 < ir1; i1++) {
        ggml_vec_gelu_f32(nc,
                (float *) ((char *) dst->data  + i1*( dst->nb[1])),
                (float *) ((char *) src0->data + i1*(src0->nb[1])));

#ifndef NDEBUG
        for (int k = 0; k < nc; k++) {
            const float x = ((float *) ((char *) dst->data + i1*( dst->nb[1])))[k];
            UNUSED(x);
            assert(!isnan(x));
            assert(!isinf(x));
        }
#endif
    }
}

static void ggml_compute_forward_gelu(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_gelu_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }

    //printf("XXXXXXXX gelu\n");
}

// ggml_compute_forward_silu

static void ggml_compute_forward_silu_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    GGML_ASSERT(ggml_is_contiguous(src0));
    GGML_ASSERT(ggml_is_contiguous(dst));
    GGML_ASSERT(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int ith = params->ith;
    const int nth = params->nth;

    const int nc = src0->ne[0];
    const int nr = ggml_nrows(src0);

    // rows per thread
    const int dr = (nr + nth - 1)/nth;

    // row range for this thread
    const int ir0 = dr*ith;
    const int ir1 = MIN(ir0 + dr, nr);

    for (int i1 = ir0; i1 < ir1; i1++) {
        ggml_vec_silu_f32(nc,
                (float *) ((char *) dst->data  + i1*( dst->nb[1])),
                (float *) ((char *) src0->data + i1*(src0->nb[1])));

#ifndef NDEBUG
        for (int k = 0; k < nc; k++) {
            const float x = ((float *) ((char *) dst->data + i1*( dst->nb[1])))[k];
            UNUSED(x);
            assert(!isnan(x));
            assert(!isinf(x));
        }
#endif
    }
}

static void ggml_compute_forward_silu(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_silu_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}


// ggml_compute_forward_norm

static void ggml_compute_forward_norm_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    GGML_ASSERT(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    GGML_ASSERT(src0->nb[0] == sizeof(float));

    const int ith = params->ith;
    const int nth = params->nth;

    const int ne00 = src0->ne[0];
    const int ne01 = src0->ne[1];
    const int ne02 = src0->ne[2];
    const int ne03 = src0->ne[3];

    const size_t nb01 = src0->nb[1];
    const size_t nb02 = src0->nb[2];
    const size_t nb03 = src0->nb[3];

    const size_t nb1 = dst->nb[1];
    const size_t nb2 = dst->nb[2];
    const size_t nb3 = dst->nb[3];

    const ggml_float eps = 1e-5f; // TODO: make this a parameter

    // TODO: optimize
    for (int i03 = 0; i03 < ne03; i03++) {
        for (int i02 = 0; i02 < ne02; i02++) {
            for (int i01 = ith; i01 < ne01; i01 += nth) {
                const float * x = (float *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);

                ggml_float mean = 0.0;
                for (int i00 = 0; i00 < ne00; i00++) {
                    mean += x[i00];
                }

                mean /= ne00;

                float * y = (float *) ((char *) dst->data + i01*nb1 + i02*nb2 + i03*nb3);

                ggml_float sum2 = 0.0;
                for (int i00 = 0; i00 < ne00; i00++) {
                    ggml_float v = x[i00] - mean;
                    y[i00] = v;
                    sum2 += v*v;
                }

                const float scale = 1.0/sqrt(sum2/ne00 + eps);

                ggml_vec_scale_f32(ne00, y, scale);
            }
        }
    }
}

static void ggml_compute_forward_norm(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_norm_f32(params, src0, dst);
            } break;
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q4_1:
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                GGML_ASSERT(false);
            } break;
    }
}

// ggml_compute_forward_mul_mat

#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS)
// helper function to determine if it is better to use BLAS or not
// for large matrices, BLAS is faster
static bool ggml_compute_forward_mul_mat_use_blas(
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
              struct ggml_tensor * dst) {
    UNUSED(src0);

    const int ne10 = src1->ne[0];

    const int ne0 = dst->ne[0];
    const int ne1 = dst->ne[1];

    // TODO: find the optimal values for these
    if (ggml_is_contiguous(src0) &&
        ggml_is_contiguous(src1) && ((ne0 >= 32 && ne1 >= 32 && ne10 >= 32))) {
        //printf("BLAS: %d %d %d\n", ne0, ne1, ne10);
        return true;
    }

    return false;
}
#endif

static void ggml_compute_forward_mul_mat_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
              struct ggml_tensor * dst) {
    int64_t t0 = ggml_perf_time_us();
    UNUSED(t0);

    const int ne00 = src0->ne[0];
    const int ne01 = src0->ne[1];
    const int ne02 = src0->ne[2];
    const int ne03 = src0->ne[3];

    const int ne10 = src1->ne[0];
    const int ne11 = src1->ne[1];
    const int ne12 = src1->ne[2];
    const int ne13 = src1->ne[3];

    const int ne0  = dst->ne[0];
    const int ne1  = dst->ne[1];
    const int ne2  = dst->ne[2];
    const int ne3  = dst->ne[3];
    const int ne   = ne0*ne1*ne2*ne3;

    const int nb00 = src0->nb[0];
    const int nb01 = src0->nb[1];
    const int nb02 = src0->nb[2];
    const int nb03 = src0->nb[3];

    const int nb10 = src1->nb[0];
    const int nb11 = src1->nb[1];
    const int nb12 = src1->nb[2];
    const int nb13 = src1->nb[3];

    const int nb0  = dst->nb[0];
    const int nb1  = dst->nb[1];
    const int nb2  = dst->nb[2];
    const int nb3  = dst->nb[3];

    const int ith = params->ith;
    const int nth = params->nth;

    assert(ne02 == ne12);
    assert(ne03 == ne13);
    assert(ne2  == ne12);
    assert(ne3  == ne13);

    // TODO: we don't support permuted src0
    assert(nb00 == sizeof(float) || nb01 == sizeof(float));

    // dst cannot be transposed or permuted
    assert(nb0 == sizeof(float));
    assert(nb0 <= nb1);
    assert(nb1 <= nb2);
    assert(nb2 <= nb3);

    assert(ne0 == ne01);
    assert(ne1 == ne11);
    assert(ne2 == ne02);
    assert(ne3 == ne03);

    // nb01 >= nb00 - src0 is not transposed
    //   compute by src0 rows
    //
    // nb00 <  nb01 - src0 is transposed
    //   compute by src0 columns

#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS)
    if (ggml_compute_forward_mul_mat_use_blas(src0, src1, dst)) {
        GGML_ASSERT(nb10 == sizeof(float));

        if (params->ith != 0) {
            return;
        }

        if (params->type == GGML_TASK_INIT) {
            return;
        }

        if (params->type == GGML_TASK_FINALIZE) {
            return;
        }

        for (int i03 = 0; i03 < ne03; i03++) {
            for (int i02 = 0; i02 < ne02; i02++) {
                const float * x = (float *) (src0->data);
                const float * y = (float *) ((char *) src1->data + i02*nb12 + i03*nb13);

                float * d = (float *) ((char *) dst->data + i02*nb2 + i03*nb3);

                // zT = y * xT
                {
                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                            ne11, ne01, ne10,
                            1.0f,    y, ne10,
                                     x, ne10,
                            0.0f,    d, ne01);
                }
            }
        }

        //printf("CBLAS F32 = %f ms, %d x %d x %d x %d\n", (ggml_perf_time_us() - t0)/1000.0, ne0, ne1, ne2, ne3);

        return;
    }
#endif

    if (params->type == GGML_TASK_INIT) {
        if (nb01 >= nb00) {
            return;
        }

        // TODO: fix this memset (wsize is overestimated)
        memset(params->wdata, 0, params->wsize);
        return;
    }

    if (params->type == GGML_TASK_FINALIZE) {
        if (nb01 >= nb00) {
            return;
        }

        // TODO: fix this memset (wsize is overestimated)
        //assert(params->wsize == (ggml_nbytes(dst) + CACHE_LINE_SIZE)*nth);

        float * const wdata = params->wdata;

        // cols per thread
        const int dc = (ne + nth - 1)/nth;

        // col range for this thread
        const int ic0 = dc*ith;
        const int ic1 = MIN(ic0 + dc, ne);

        ggml_vec_cpy_f32(ic1 - ic0, (float *) dst->data + ic0, wdata + ic0);

        for (int k = 1; k < nth; k++) {
            ggml_vec_acc_f32(ic1 - ic0, (float *) dst->data + ic0, wdata + (ne + CACHE_LINE_SIZE_F32)*k + ic0);
        }

        return;
    }

    if (nb01 >= nb00) {
        // TODO: do not support transposed src1
        assert(nb10 == sizeof(float));

        // parallelize by src0 rows using ggml_vec_dot_f32

        // total rows in src0
        const int nr = ne01*ne02*ne03;

        // rows per thread
        const int dr = (nr + nth - 1)/nth;

        // row range for this thread
        const int ir0 = dr*ith;
        const int ir1 = MIN(ir0 + dr, nr);

        for (int ir = ir0; ir < ir1; ++ir) {
            // src0 indices
            const int i03 = ir/(ne02*ne01);
            const int i02 = (ir - i03*ne02*ne01)/ne01;
            const int i01 = (ir - i03*ne02*ne01 - i02*ne01);

            for (int ic = 0; ic < ne11; ++ic) {
                // src1 indices
                const int i13 = i03;
                const int i12 = i02;
                const int i11 = ic;

                // dst indices
                const int i0 = i01;
                const int i1 = i11;
                const int i2 = i02;
                const int i3 = i03;

                ggml_vec_dot_f32(ne00,
                        (float *) ((char *)  dst->data + (i0*nb0 + i1*nb1 + i2*nb2 + i3*nb3)),
                        (float *) ((char *) src0->data + (i01*nb01 + i02*nb02 + i03*nb03)),
                        (float *) ((char *) src1->data + (i11*nb11 + i12*nb12 + i13*nb13)));
            }
        }
    } else {
        // parallelize by src1 columns using ggml_vec_mad_f32
        // each thread has its own work data
        // during FINALIZE we accumulate all work data into dst

        // total columns in src1
        const int nc = ne10;

        // columns per thread
        const int dc = (nc + nth - 1)/nth;

        // column range for this thread
        const int ic0 = dc*ith;
        const int ic1 = MIN(ic0 + dc, nc);

        // work data for thread
        const int wo = (ne + CACHE_LINE_SIZE_F32)*ith;
        float * const wdata = params->wdata;

        for (int i13 = 0; i13 < ne13; ++i13) {
            for (int i12 = 0; i12 < ne12; ++i12) {
                for (int i11 = 0; i11 < ne11; ++i11) {
                    for (int ic = ic0; ic < ic1; ++ic) {
                        // src1 indices
                        const int i10 = ic;

                        // src0 indices
                        const int i03 = i13;
                        const int i02 = i12;
                        const int i00 = ic;

                        // dst indices
                        const int i1 = i11;
                        const int i2 = i12;
                        const int i3 = i13;

                        assert(sizeof(float)*(wo + i3*ne2*ne1*ne0 + i2*ne1*ne0 + i1*ne0 + ne01) <= params->wsize);

                        ggml_vec_mad_f32(ne01,
                                (float *) (wdata + wo + i3*ne2*ne1*ne0 + i2*ne1*ne0 + i1*ne0),
                                (float *) ((char *) src0->data + (i00*nb00 + i02*nb02 + i03*nb03)),
                               *(float *) ((char *) src1->data + (i10*nb10 + i11*nb11 + i12*nb12 + i13*nb13)));
                    }
                }
            }
        }
    }

    //int64_t t1 = ggml_perf_time_us();
    //static int64_t acc = 0;
    //acc += t1 - t0;
    //if (t1 - t0 > 10) {
    //    printf("\n");
    //    printf("ne00 = %5d, ne01 = %5d, ne02 = %5d, ne03 = %5d\n", ne00, ne01, ne02, ne03);
    //    printf("nb00 = %5d, nb01 = %5d, nb02 = %5d, nb03 = %5d\n", nb00, nb01, nb02, nb03);
    //    printf("ne10 = %5d, ne11 = %5d, ne12 = %5d, ne13 = %5d\n", ne10, ne11, ne12, ne13);
    //    printf("nb10 = %5d, nb11 = %5d, nb12 = %5d, nb13 = %5d\n", nb10, nb11, nb12, nb13);

    //    printf("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX task %d/%d: %d us, acc = %d\n", ith, nth, (int) (t1 - t0), (int) acc);
    //}
}

static void ggml_compute_forward_mul_mat_f16_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
              struct ggml_tensor * dst) {
    int64_t t0 = ggml_perf_time_us();
    UNUSED(t0);

    const int ne00 = src0->ne[0];
    const int ne01 = src0->ne[1];
    const int ne02 = src0->ne[2];
    const int ne03 = src0->ne[3];

    const int ne10 = src1->ne[0];
    const int ne11 = src1->ne[1];
    const int ne12 = src1->ne[2];
    const int ne13 = src1->ne[3];

    const int ne0  = dst->ne[0];
    const int ne1  = dst->ne[1];
    const int ne2  = dst->ne[2];
    const int ne3  = dst->ne[3];
    const int ne   = ne0*ne1*ne2*ne3;

    const int nb00 = src0->nb[0];
    const int nb01 = src0->nb[1];
    const int nb02 = src0->nb[2];
    const int nb03 = src0->nb[3];

    const int nb10 = src1->nb[0];
    const int nb11 = src1->nb[1];
    const int nb12 = src1->nb[2];
    const int nb13 = src1->nb[3];

    const int nb0  = dst->nb[0];
    const int nb1  = dst->nb[1];
    const int nb2  = dst->nb[2];
    const int nb3  = dst->nb[3];

    const int ith = params->ith;
    const int nth = params->nth;

    GGML_ASSERT(ne02 == ne12);
    GGML_ASSERT(ne03 == ne13);
    GGML_ASSERT(ne2  == ne12);
    GGML_ASSERT(ne3  == ne13);

    // TODO: we don't support permuted src0
    GGML_ASSERT(nb00 == sizeof(ggml_fp16_t) || nb01 == sizeof(ggml_fp16_t));

    // dst cannot be transposed or permuted
    GGML_ASSERT(nb0 == sizeof(float));
    GGML_ASSERT(nb0 <= nb1);
    GGML_ASSERT(nb1 <= nb2);
    GGML_ASSERT(nb2 <= nb3);

    GGML_ASSERT(ne0 == ne01);
    GGML_ASSERT(ne1 == ne11);
    GGML_ASSERT(ne2 == ne02);
    GGML_ASSERT(ne3 == ne03);

    // nb01 >= nb00 - src0 is not transposed
    //   compute by src0 rows
    //
    // nb00 <  nb01 - src0 is transposed
    //   compute by src0 columns

#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS)
    if (ggml_compute_forward_mul_mat_use_blas(src0, src1, dst)) {
        GGML_ASSERT(nb10 == sizeof(float));

        if (params->ith != 0) {
            return;
        }

        if (params->type == GGML_TASK_INIT) {
            return;
        }

        if (params->type == GGML_TASK_FINALIZE) {
            return;
        }

        float * const wdata = params->wdata;

        for (int i03 = 0; i03 < ne03; i03++) {
            for (int i02 = 0; i02 < ne02; i02++) {
                {
                    int id = 0;
                    for (int i01 = 0; i01 < ne01; ++i01) {
                        for (int i00 = 0; i00 < ne00; ++i00) {
                            wdata[id++] = GGML_FP16_TO_FP32(*(ggml_fp16_t *) ((char *) src0->data + i03*nb03 + i02*nb02 + i01*nb01 + i00*nb00));
                        }
                    }
                }

                const float * x = wdata;
                const float * y = (float *) ((char *) src1->data + i02*nb12 + i03*nb13);

                //      float * z =                          wdata + ne00*ne01;

                // z = x * yT
                //{
                //    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                //            ne01, ne11, ne00,
                //            1.0f, x, ne00,
                //                  y, ne00,
                //            0.0f, z, ne11);
                //}

                float * d = (float *) ((char *) dst->data + i02*nb2 + i03*nb3);

                // transpose z
                //for (int j = 0; j < ne11; ++j) {
                //    for (int i = 0; i < ne01; ++i) {
                //        d[j*ne01 + i] = z[i*ne11 + j];
                //    }
                //}

                {
#if 1
                    // zT = y * xT
                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                            ne11, ne01, ne10,
                            1.0f,    y, ne00,
                                     x, ne00,
                            0.0f,    d, ne01);
#else
                    // zT = (xT * y)T
                    cblas_sgemm(CblasColMajor, CblasTrans, CblasNoTrans,
                            ne01, ne11, ne10,
                            1.0f,    x, ne00,
                                     y, ne00,
                            0.0f,    d, ne01);
#endif
                }
            }
        }

        /*printf("CBLAS F16 = %f ms, %d x %d x %d x %d\n", (ggml_perf_time_us() - t0)/1000.0, ne0, ne1, ne2, ne3);*/

        return;
    }
#endif

    if (params->type == GGML_TASK_INIT) {
        if (nb01 >= nb00) {
            ggml_fp16_t * const wdata = params->wdata;

            int id = 0;
            for (int i13 = 0; i13 < ne13; ++i13) {
                for (int i12 = 0; i12 < ne12; ++i12) {
                    for (int i11 = 0; i11 < ne11; ++i11) {
                        for (int i10 = 0; i10 < ne10; ++i10) {
                            wdata[id++] = GGML_FP32_TO_FP16(*(float *)((char *) src1->data + i13*nb13 + i12*nb12 + i11*nb11 + i10*nb10));
                        }
                    }
                }
            }

            GGML_ASSERT(id*sizeof(ggml_fp16_t) <= params->wsize);

            return;
        }

        // TODO: fix this memset (wsize is overestimated)
        memset(params->wdata, 0, params->wsize);
        return;
    }

    if (params->type == GGML_TASK_FINALIZE) {
        if (nb01 >= nb00) {
            return;
        }

        // TODO: fix this memset (wsize is overestimated)
        //assert(params->wsize == (ggml_nbytes(dst) + CACHE_LINE_SIZE)*nth);

        ggml_fp16_t * const wdata = params->wdata;

        // cols per thread
        const int dc = (ne + nth - 1)/nth;

        // col range for this thread
        const int ic0 = dc*ith;
        const int ic1 = MIN(ic0 + dc, ne);

        for (int i = ic0; i < ic1; ++i) {
            ((float *) dst->data)[i] = GGML_FP16_TO_FP32(wdata[i]);
        }

        for (int k = 1; k < nth; k++) {
            for (int i = ic0; i < ic1; ++i) {
                ((float *) dst->data)[i] += GGML_FP16_TO_FP32(wdata[(ne + CACHE_LINE_SIZE_F32)*k + i]);
            }
        }

        return;
    }

    if (nb01 >= nb00) {
        // fp16 -> half the size, so divide by 2
        // TODO: do not support transposed src1
        assert(nb10/2 == sizeof(ggml_fp16_t));

        // parallelize by src0 rows using ggml_vec_dot_f16

        // total rows in src0
        const int nr = ne01*ne02*ne03;

        // rows per thread
        const int dr = (nr + nth - 1)/nth;

        // row range for this thread
        const int ir0 = dr*ith;
        const int ir1 = MIN(ir0 + dr, nr);

        ggml_fp16_t * wdata = params->wdata;

        for (int ir = ir0; ir < ir1; ++ir) {
            // src0 indices
            const int i03 = ir/(ne02*ne01);
            const int i02 = (ir - i03*ne02*ne01)/ne01;
            const int i01 = (ir - i03*ne02*ne01 - i02*ne01);

            const int i13 = i03;
            const int i12 = i02;

            const int i0 = i01;
            const int i2 = i02;
            const int i3 = i03;

            ggml_fp16_t * src0_row = (ggml_fp16_t *) ((char *) src0->data + (i01*nb01 + i02*nb02 + i03*nb03));
            ggml_fp16_t * src1_col =                                wdata + (       0 + i12*ne11 + i13*ne12*ne11)*ne00;

            float * dst_col = (float *) ((char *) dst->data + (i0*nb0 + 0*nb1 + i2*nb2 + i3*nb3));

            assert(ne00 % 32 == 0);

            for (int ic = 0; ic < ne11; ++ic) {
                ggml_vec_dot_f16(ne00, &dst_col[ic*ne0], src0_row, src1_col + ic*ne00);
            }
        }
    } else {
        // parallelize by src1 columns using ggml_vec_mad_f16
        // each thread has its own work data
        // during FINALIZE we accumulate all work data into dst

        // total columns in src1
        const int nc = ne10;

        // columns per thread
        const int dc = (nc + nth - 1)/nth;

        // column range for this thread
        const int ic0 = dc*ith;
        const int ic1 = MIN(ic0 + dc, nc);

        // work data for thread
        const int wo = (ne + CACHE_LINE_SIZE_F32)*ith;
        ggml_fp16_t * const wdata = params->wdata;

        for (int i13 = 0; i13 < ne13; ++i13) {
            for (int i12 = 0; i12 < ne12; ++i12) {
                for (int i11 = 0; i11 < ne11; ++i11) {
                    // dst indices
                    const int i1 = i11;
                    const int i2 = i12;
                    const int i3 = i13;

                    ggml_fp16_t * dst_row = wdata + wo + i3*ne2*ne1*ne0 + i2*ne1*ne0 + i1*ne0;

                    for (int ic = ic0; ic < ic1; ++ic) {
                        // src1 indices
                        const int i10 = ic;

                        // src0 indices
                        const int i03 = i13;
                        const int i02 = i12;
                        const int i00 = ic;

                        assert(sizeof(ggml_fp16_t)*(wo + i3*ne2*ne1*ne0 + i2*ne1*ne0 + i1*ne0 + ne01) <= params->wsize);

                        ggml_fp16_t * src0_col =  (ggml_fp16_t *) ((char *) src0->data + (i00*nb00 + i02*nb02 + i03*nb03));
                        float         src1_val = *      (float *) ((char *) src1->data + (i10*nb10 + i11*nb11 + i12*nb12 + i13*nb13));

                        ggml_vec_mad_f16(ne01, dst_row, src0_col, src1_val);
                    }
                }
            }
        }
    }

    //int64_t t1 = ggml_time_us();
    //static int64_t acc = 0;
    //acc += t1 - t0;
    //if (t1 - t0 > 10) {
    //    printf("\n");
    //    printf("ne00 = %5d, ne01 = %5d, ne02 = %5d, ne03 = %5d\n", ne00, ne01, ne02, ne03);
    //    printf("nb00 = %5d, nb01 = %5d, nb02 = %5d, nb03 = %5d\n", nb00, nb01, nb02, nb03);
    //    printf("ne10 = %5d, ne11 = %5d, ne12 = %5d, ne13 = %5d\n", ne10, ne11, ne12, ne13);

    //    printf("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX task %d/%d: %d us, acc = %d\n", ith, nth, (int) (t1 - t0), (int) acc);
    //}
}

static void ggml_compute_forward_mul_mat_q4_0_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
              struct ggml_tensor * dst) {
    int64_t t0 = ggml_perf_time_us();
    UNUSED(t0);

    const int ne00 = src0->ne[0];
    const int ne01 = src0->ne[1];
    const int ne02 = src0->ne[2];
    const int ne03 = src0->ne[3];

    const int ne10 = src1->ne[0];
    const int ne11 = src1->ne[1];
    const int ne12 = src1->ne[2];
    const int ne13 = src1->ne[3];

    const int ne0  = dst->ne[0];
    const int ne1  = dst->ne[1];
    const int ne2  = dst->ne[2];
    const int ne3  = dst->ne[3];
    const int ne   = ne0*ne1*ne2*ne3;

    const int nb00 = src0->nb[0];
    const int nb01 = src0->nb[1];
    const int nb02 = src0->nb[2];
    const int nb03 = src0->nb[3];

    const int nb10 = src1->nb[0];
    const int nb11 = src1->nb[1];
    const int nb12 = src1->nb[2];
    const int nb13 = src1->nb[3];

    const int nb0  = dst->nb[0];
    const int nb1  = dst->nb[1];
    const int nb2  = dst->nb[2];
    const int nb3  = dst->nb[3];

    const int ith = params->ith;
    const int nth = params->nth;

    GGML_ASSERT(ne02 == ne12);
    GGML_ASSERT(ne03 == ne13);
    GGML_ASSERT(ne2  == ne12);
    GGML_ASSERT(ne3  == ne13);

    // TODO: we don't support permuted src0
    GGML_ASSERT(nb00 == (int) GGML_TYPE_SIZE[GGML_TYPE_Q4_0] || nb01 == (int) GGML_TYPE_SIZE[GGML_TYPE_Q4_0]);

    // dst cannot be transposed or permuted
    GGML_ASSERT(nb0 == sizeof(float));
    GGML_ASSERT(nb0 <= nb1);
    GGML_ASSERT(nb1 <= nb2);
    GGML_ASSERT(nb2 <= nb3);

    GGML_ASSERT(ne0 == ne01);
    GGML_ASSERT(ne1 == ne11);
    GGML_ASSERT(ne2 == ne02);
    GGML_ASSERT(ne3 == ne03);

    // nb01 >= nb00 - src0 is not transposed
    //   compute by src0 rows
    //
    // nb00 <  nb01 - src0 is transposed

Download .txt

gitextract_4y5f8lx7/

├── .github/
│   └── workflows/
│       └── build.yml
├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── Package.swift
├── README.md
├── Sources/
│   ├── cpp/
│   │   ├── ggml.c
│   │   ├── ggml.h
│   │   ├── quantize.cpp
│   │   ├── utils.cpp
│   │   └── utils.h
│   ├── llama/
│   │   └── LlamaRunner.swift
│   └── llamaObjCxx/
│       ├── LlamaError.m
│       ├── bridge/
│       │   ├── LlamaEvent.mm
│       │   ├── LlamaPredictOperation.hh
│       │   ├── LlamaPredictOperation.mm
│       │   ├── LlamaRunnerBridge.mm
│       │   └── LlamaRunnerBridgeConfig.m
│       ├── headers/
│       │   ├── LlamaError.h
│       │   ├── LlamaEvent.h
│       │   ├── LlamaRunnerBridge.h
│       │   └── LlamaRunnerBridgeConfig.h
│       └── module.modulemap
├── llama.xcodeproj/
│   ├── project.pbxproj
│   └── project.xcworkspace/
│       ├── contents.xcworkspacedata
│       └── xcshareddata/
│           └── IDEWorkspaceChecks.plist
├── llamaTest/
│   ├── Info.plist
│   ├── LlamaTest.xcconfig
│   └── main.swift
└── tools/
    ├── .gitignore
    ├── Makefile
    ├── convert-pth-to-ggml.py
    └── quantize.sh

Download .txt

SYMBOL INDEX (797 symbols across 6 files)

FILE: Sources/cpp/ggml.c
  type LONG (line 34) | typedef volatile LONG atomic_int;
  type atomic_int (line 35) | typedef atomic_int atomic_bool;
  function atomic_store (line 37) | static void atomic_store(atomic_int* ptr, LONG val) {
  function LONG (line 40) | static LONG atomic_load(atomic_int* ptr) {
  function LONG (line 43) | static LONG atomic_fetch_add(atomic_int* ptr, LONG inc) {
  function LONG (line 46) | static LONG atomic_fetch_sub(atomic_int* ptr, LONG dec) {
  type HANDLE (line 50) | typedef HANDLE pthread_t;
  type DWORD (line 52) | typedef DWORD thread_ret_t;
  function pthread_create (line 53) | static int pthread_create(pthread_t* out, void* unused, thread_ret_t(*fu...
  function pthread_join (line 64) | static int pthread_join(pthread_t thread, void* unused) {
  function sched_yield (line 68) | static int sched_yield (void) {
  type ggml_float (line 126) | typedef double ggml_float;
  function fp32_from_bits (line 169) | static inline float fp32_from_bits(uint32_t w) {
  function fp32_to_bits (line 178) | static inline uint32_t fp32_to_bits(float f) {
  function ggml_compute_fp16_to_fp32 (line 187) | static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
  function ggml_fp16_t (line 210) | static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
  function ggml_lookup_fp16_to_fp32 (line 263) | inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
  function ggml_fp16_to_fp32 (line 276) | float ggml_fp16_to_fp32(ggml_fp16_t x) {
  function ggml_fp16_t (line 280) | ggml_fp16_t ggml_fp32_to_fp16(float x) {
  function ggml_time_init (line 290) | void ggml_time_init(void) {
  function ggml_time_ms (line 295) | int64_t ggml_time_ms(void) {
  function ggml_time_us (line 300) | int64_t ggml_time_us(void) {
  function ggml_time_init (line 306) | void ggml_time_init(void) {}
  function ggml_time_ms (line 307) | int64_t ggml_time_ms(void) {
  function ggml_time_us (line 313) | int64_t ggml_time_us(void) {
  function ggml_cycles (line 320) | int64_t ggml_cycles(void) {
  function ggml_cycles_per_ms (line 324) | int64_t ggml_cycles_per_ms(void) {
  function __m256i (line 367) | inline __m256i bytesFromNibbles( const uint8_t* rsi )
  function __m128i (line 384) | inline __m128i packNibbles( __m256i bytes )
  function quantize_row_q4_0 (line 404) | void quantize_row_q4_0(const float * restrict x, void * restrict y, int ...
  function quantize_row_q4_1 (line 606) | void quantize_row_q4_1(const float * restrict x, void * restrict y, int ...
  function dequantize_row_q4_0 (line 651) | void dequantize_row_q4_0(const void * restrict x, float * restrict y, in...
  function dequantize_row_q4_1 (line 686) | void dequantize_row_q4_1(const void * restrict x, float * restrict y, in...
  function v128_t (line 1040) | inline static v128_t __wasm_f16x4_load(const ggml_fp16_t * p) {
  function __wasm_f16x4_store (line 1051) | inline static void __wasm_f16x4_store(ggml_fp16_t * p, v128_t x) {
  function __m128 (line 1150) | static inline __m128 __sse_f16x4_load(ggml_fp16_t *x) {
  function __sse_f16x4_store (line 1161) | static inline void __sse_f16x4_store(ggml_fp16_t *x, __m128 y) {
  function ggml_vec_set_i8 (line 1205) | inline static void ggml_vec_set_i8(const int n, int8_t * x, const int8_t...
  function ggml_vec_set_i16 (line 1207) | inline static void ggml_vec_set_i16(const int n, int16_t * x, const int1...
  function ggml_vec_set_i32 (line 1209) | inline static void ggml_vec_set_i32(const int n, int32_t * x, const int3...
  function ggml_vec_set_f16 (line 1211) | inline static void ggml_vec_set_f16(const int n, ggml_fp16_t * x, const ...
  function ggml_vec_add_f32 (line 1213) | inline static void ggml_vec_add_f32 (const int n, float * z, const float...
  function ggml_vec_acc_f32 (line 1214) | inline static void ggml_vec_acc_f32 (const int n, float * y, const float...
  function ggml_vec_acc1_f32 (line 1215) | inline static void ggml_vec_acc1_f32(const int n, float * y, const float...
  function ggml_vec_sub_f32 (line 1216) | inline static void ggml_vec_sub_f32 (const int n, float * z, const float...
  function ggml_vec_set_f32 (line 1217) | inline static void ggml_vec_set_f32 (const int n, float * x, const float...
  function ggml_vec_cpy_f32 (line 1218) | inline static void ggml_vec_cpy_f32 (const int n, float * y, const float...
  function ggml_vec_neg_f32 (line 1219) | inline static void ggml_vec_neg_f32 (const int n, float * y, const float...
  function ggml_vec_mul_f32 (line 1220) | inline static void ggml_vec_mul_f32 (const int n, float * z, const float...
  function ggml_vec_div_f32 (line 1221) | inline static void ggml_vec_div_f32 (const int n, float * z, const float...
  function ggml_vec_dot_f32 (line 1223) | inline static void ggml_vec_dot_f32(const int n, float * restrict s, con...
  function ggml_vec_dot_f16 (line 1260) | inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggm...
  function ggml_vec_dot_q4_0 (line 1296) | inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, co...
  function ggml_vec_dot_q4_1 (line 1584) | inline static void ggml_vec_dot_q4_1(const int n, float * restrict s, co...
  function ggml_vec_dot_f16_unroll (line 1630) | inline static void ggml_vec_dot_f16_unroll(const int n, const int xs, fl...
  function ggml_vec_mad_f32 (line 1683) | inline static void ggml_vec_mad_f32(const int n, float * restrict y, con...
  function ggml_vec_mad_f16 (line 1714) | inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * restrict ...
  function ggml_vec_mad_q4_0 (line 1745) | inline static void ggml_vec_mad_q4_0(const int n, float * restrict y, vo...
  function ggml_vec_mad_q4_1 (line 1838) | inline static void ggml_vec_mad_q4_1(const int n, float * restrict y, vo...
  function ggml_vec_scale_f32 (line 1875) | inline static void ggml_vec_scale_f32(const int n, float * y, const floa...
  function ggml_vec_norm_f32 (line 1904) | inline static void ggml_vec_norm_f32 (const int n, float * s, const floa...
  function ggml_vec_sqr_f32 (line 1905) | inline static void ggml_vec_sqr_f32  (const int n, float * y, const floa...
  function ggml_vec_sqrt_f32 (line 1906) | inline static void ggml_vec_sqrt_f32 (const int n, float * y, const floa...
  function ggml_vec_abs_f32 (line 1907) | inline static void ggml_vec_abs_f32  (const int n, float * y, const floa...
  function ggml_vec_sgn_f32 (line 1908) | inline static void ggml_vec_sgn_f32  (const int n, float * y, const floa...
  function ggml_vec_step_f32 (line 1909) | inline static void ggml_vec_step_f32 (const int n, float * y, const floa...
  function ggml_vec_relu_f32 (line 1910) | inline static void ggml_vec_relu_f32 (const int n, float * y, const floa...
  function ggml_gelu_f32 (line 1915) | inline static float ggml_gelu_f32(float x) {
  function ggml_vec_gelu_f16 (line 1919) | inline static void ggml_vec_gelu_f16(const int n, ggml_fp16_t * y, const...
  function ggml_vec_gelu_f32 (line 1927) | inline static void ggml_vec_gelu_f32(const int n, float * y, const float...
  function ggml_vec_gelu_f32 (line 1936) | inline static void ggml_vec_gelu_f32(const int n, float * y, const float...
  function ggml_silu_f32 (line 1944) | inline static float ggml_silu_f32(float x) {
  function ggml_vec_silu_f16 (line 1948) | inline static void ggml_vec_silu_f16(const int n, ggml_fp16_t * y, const...
  function ggml_vec_silu_f32 (line 1956) | inline static void ggml_vec_silu_f32(const int n, float * y, const float...
  function ggml_vec_silu_f32 (line 1965) | inline static void ggml_vec_silu_f32(const int n, float * y, const float...
  function ggml_vec_sum_f32 (line 1972) | inline static void ggml_vec_sum_f32(const int n, float * s, const float ...
  function ggml_vec_max_f32 (line 1984) | inline static void ggml_vec_max_f32(const int n, float * s, const float ...
  function ggml_vec_norm_inv_f32 (line 1996) | inline static void ggml_vec_norm_inv_f32(const int n, float * s, const f...
  type ggml_object (line 2141) | struct ggml_object {
  type ggml_object (line 2150) | struct ggml_object
  type ggml_object (line 2152) | struct ggml_object
  type ggml_tensor (line 2153) | struct ggml_tensor
  type ggml_context (line 2159) | struct ggml_context {
  type ggml_context_container (line 2173) | struct ggml_context_container {
  type ggml_task_type (line 2183) | enum ggml_task_type {
  type ggml_compute_params (line 2189) | struct ggml_compute_params {
  type ggml_state (line 2203) | struct ggml_state {
  type ggml_state (line 2208) | struct ggml_state
  function ggml_critical_section_start (line 2212) | inline static void ggml_critical_section_start(void) {
  function ggml_critical_section_end (line 2225) | inline static void ggml_critical_section_end(void) {
  function ggml_print_object (line 2231) | void ggml_print_object(const struct ggml_object * obj) {
  function ggml_print_objects (line 2236) | void ggml_print_objects(const struct ggml_context * ctx) {
  function ggml_nelements (line 2249) | int ggml_nelements(const struct ggml_tensor * tensor) {
  function ggml_nrows (line 2255) | int ggml_nrows(const struct ggml_tensor * tensor) {
  function ggml_nbytes (line 2261) | size_t ggml_nbytes(const struct ggml_tensor * tensor) {
  function ggml_blck_size (line 2267) | int ggml_blck_size(enum ggml_type type) {
  function ggml_type_size (line 2271) | size_t ggml_type_size(enum ggml_type type) {
  function ggml_type_sizef (line 2275) | float ggml_type_sizef(enum ggml_type type) {
  function ggml_element_size (line 2279) | size_t ggml_element_size(const struct ggml_tensor * tensor) {
  function ggml_is_scalar (line 2283) | static inline bool ggml_is_scalar(const struct ggml_tensor * tensor) {
  function ggml_is_vector (line 2289) | static inline bool ggml_is_vector(const struct ggml_tensor * tensor) {
  function ggml_is_matrix (line 2295) | static inline bool ggml_is_matrix(const struct ggml_tensor * tensor) {
  function ggml_can_mul_mat (line 2301) | static inline bool ggml_can_mul_mat(const struct ggml_tensor * t0, const...
  function ggml_is_contiguous (line 2310) | static inline bool ggml_is_contiguous(const struct ggml_tensor * tensor) {
  function ggml_is_padded_1d (line 2320) | static inline bool ggml_is_padded_1d(const struct ggml_tensor * tensor) {
  function ggml_are_same_shape (line 2329) | static inline bool ggml_are_same_shape(const struct ggml_tensor * t0, co...
  function ggml_can_repeat (line 2340) | static inline bool ggml_can_repeat(const struct ggml_tensor * t0, const ...
  function ggml_up32 (line 2350) | static inline int ggml_up32(int n) {
  function ggml_up64 (line 2354) | static inline int ggml_up64(int n) {
  function ggml_up (line 2358) | static inline int ggml_up(int n, int m) {
  type ggml_context (line 2370) | struct ggml_context
  type ggml_init_params (line 2370) | struct ggml_init_params
  type ggml_state (line 2400) | struct ggml_state
  type ggml_context (line 2417) | struct ggml_context
  type ggml_context (line 2437) | struct ggml_context
  function ggml_free (line 2457) | void ggml_free(struct ggml_context * ctx) {
  function ggml_used_mem (line 2486) | size_t ggml_used_mem(const struct ggml_context * ctx) {
  function ggml_set_scratch (line 2490) | size_t ggml_set_scratch(struct ggml_context * ctx, struct ggml_scratch s...
  type ggml_tensor (line 2500) | struct ggml_tensor
  type ggml_context (line 2501) | struct ggml_context
  type ggml_type (line 2502) | enum   ggml_type
  type ggml_object (line 2507) | struct ggml_object
  type ggml_object (line 2525) | struct ggml_object
  type ggml_object (line 2525) | struct ggml_object
  type ggml_tensor (line 2528) | struct ggml_tensor
  type ggml_object (line 2537) | struct ggml_object
  type ggml_tensor (line 2549) | struct ggml_tensor
  type ggml_tensor (line 2551) | struct ggml_tensor
  type ggml_object (line 2558) | struct ggml_object
  type ggml_tensor (line 2560) | struct ggml_tensor
  type ggml_tensor (line 2580) | struct ggml_tensor
  type ggml_tensor (line 2580) | struct ggml_tensor
  type ggml_tensor (line 2584) | struct ggml_tensor
  type ggml_tensor (line 2620) | struct ggml_tensor
  type ggml_context (line 2621) | struct ggml_context
  type ggml_type (line 2622) | enum   ggml_type
  type ggml_tensor (line 2628) | struct ggml_tensor
  type ggml_context (line 2629) | struct ggml_context
  type ggml_type (line 2630) | enum   ggml_type
  type ggml_tensor (line 2635) | struct ggml_tensor
  type ggml_context (line 2636) | struct ggml_context
  type ggml_type (line 2637) | enum   ggml_type
  type ggml_tensor (line 2644) | struct ggml_tensor
  type ggml_context (line 2645) | struct ggml_context
  type ggml_type (line 2646) | enum   ggml_type
  type ggml_tensor (line 2654) | struct ggml_tensor
  type ggml_context (line 2655) | struct ggml_context
  type ggml_type (line 2656) | enum   ggml_type
  type ggml_tensor (line 2665) | struct ggml_tensor
  type ggml_context (line 2665) | struct ggml_context
  type ggml_tensor (line 2669) | struct ggml_tensor
  type ggml_tensor (line 2678) | struct ggml_tensor
  type ggml_context (line 2678) | struct ggml_context
  type ggml_tensor (line 2682) | struct ggml_tensor
  type ggml_tensor (line 2691) | struct ggml_tensor
  type ggml_context (line 2691) | struct ggml_context
  type ggml_tensor (line 2691) | struct ggml_tensor
  type ggml_tensor (line 2695) | struct ggml_tensor
  type ggml_tensor (line 2695) | struct ggml_tensor
  type ggml_tensor (line 2700) | struct ggml_tensor
  type ggml_tensor (line 2700) | struct ggml_tensor
  type ggml_tensor (line 2760) | struct ggml_tensor
  type ggml_tensor (line 2760) | struct ggml_tensor
  function ggml_get_i32_1d (line 2820) | int32_t ggml_get_i32_1d(const struct ggml_tensor * tensor, int i) {
  function ggml_set_i32_1d (line 2864) | void ggml_set_i32_1d(const struct ggml_tensor * tensor, int i, int32_t v...
  function ggml_get_f32_1d (line 2906) | float ggml_get_f32_1d(const struct ggml_tensor * tensor, int i) {
  function ggml_set_f32_1d (line 2950) | void ggml_set_f32_1d(const struct ggml_tensor * tensor, int i, float val...
  type ggml_tensor (line 2992) | struct ggml_tensor
  type ggml_tensor (line 2996) | struct ggml_tensor
  type ggml_tensor (line 3001) | struct ggml_tensor
  type ggml_context (line 3002) | struct ggml_context
  type ggml_tensor (line 3003) | struct ggml_tensor
  type ggml_tensor (line 3011) | struct ggml_tensor
  type ggml_context (line 3012) | struct ggml_context
  type ggml_tensor (line 3013) | struct ggml_tensor
  type ggml_tensor (line 3021) | struct ggml_tensor
  type ggml_tensor (line 3031) | struct ggml_tensor
  type ggml_context (line 3032) | struct ggml_context
  type ggml_tensor (line 3033) | struct ggml_tensor
  type ggml_tensor (line 3037) | struct ggml_tensor
  type ggml_context (line 3038) | struct ggml_context
  type ggml_tensor (line 3039) | struct ggml_tensor
  type ggml_tensor (line 3045) | struct ggml_tensor
  type ggml_context (line 3046) | struct ggml_context
  type ggml_tensor (line 3047) | struct ggml_tensor
  type ggml_tensor (line 3048) | struct ggml_tensor
  type ggml_tensor (line 3058) | struct ggml_tensor
  type ggml_tensor (line 3068) | struct ggml_tensor
  type ggml_context (line 3069) | struct ggml_context
  type ggml_tensor (line 3070) | struct ggml_tensor
  type ggml_tensor (line 3071) | struct ggml_tensor
  type ggml_tensor (line 3075) | struct ggml_tensor
  type ggml_context (line 3076) | struct ggml_context
  type ggml_tensor (line 3077) | struct ggml_tensor
  type ggml_tensor (line 3078) | struct ggml_tensor
  type ggml_tensor (line 3084) | struct ggml_tensor
  type ggml_context (line 3085) | struct ggml_context
  type ggml_tensor (line 3086) | struct ggml_tensor
  type ggml_tensor (line 3087) | struct ggml_tensor
  type ggml_tensor (line 3097) | struct ggml_tensor
  type ggml_tensor (line 3107) | struct ggml_tensor
  type ggml_context (line 3108) | struct ggml_context
  type ggml_tensor (line 3109) | struct ggml_tensor
  type ggml_tensor (line 3110) | struct ggml_tensor
  type ggml_tensor (line 3114) | struct ggml_tensor
  type ggml_context (line 3115) | struct ggml_context
  type ggml_tensor (line 3116) | struct ggml_tensor
  type ggml_tensor (line 3117) | struct ggml_tensor
  type ggml_tensor (line 3123) | struct ggml_tensor
  type ggml_context (line 3124) | struct ggml_context
  type ggml_tensor (line 3125) | struct ggml_tensor
  type ggml_tensor (line 3126) | struct ggml_tensor
  type ggml_tensor (line 3140) | struct ggml_tensor
  type ggml_tensor (line 3150) | struct ggml_tensor
  type ggml_context (line 3151) | struct ggml_context
  type ggml_tensor (line 3152) | struct ggml_tensor
  type ggml_tensor (line 3153) | struct ggml_tensor
  type ggml_tensor (line 3157) | struct ggml_tensor
  type ggml_context (line 3158) | struct ggml_context
  type ggml_tensor (line 3159) | struct ggml_tensor
  type ggml_tensor (line 3160) | struct ggml_tensor
  type ggml_tensor (line 3166) | struct ggml_tensor
  type ggml_context (line 3167) | struct ggml_context
  type ggml_tensor (line 3168) | struct ggml_tensor
  type ggml_tensor (line 3169) | struct ggml_tensor
  type ggml_tensor (line 3183) | struct ggml_tensor
  type ggml_tensor (line 3193) | struct ggml_tensor
  type ggml_context (line 3194) | struct ggml_context
  type ggml_tensor (line 3195) | struct ggml_tensor
  type ggml_tensor (line 3196) | struct ggml_tensor
  type ggml_tensor (line 3200) | struct ggml_tensor
  type ggml_context (line 3201) | struct ggml_context
  type ggml_tensor (line 3202) | struct ggml_tensor
  type ggml_tensor (line 3203) | struct ggml_tensor
  type ggml_tensor (line 3209) | struct ggml_tensor
  type ggml_context (line 3210) | struct ggml_context
  type ggml_tensor (line 3211) | struct ggml_tensor
  type ggml_tensor (line 3219) | struct ggml_tensor
  type ggml_tensor (line 3229) | struct ggml_tensor
  type ggml_context (line 3230) | struct ggml_context
  type ggml_tensor (line 3231) | struct ggml_tensor
  type ggml_tensor (line 3235) | struct ggml_tensor
  type ggml_context (line 3236) | struct ggml_context
  type ggml_tensor (line 3237) | struct ggml_tensor
  type ggml_tensor (line 3243) | struct ggml_tensor
  type ggml_context (line 3244) | struct ggml_context
  type ggml_tensor (line 3245) | struct ggml_tensor
  type ggml_tensor (line 3253) | struct ggml_tensor
  type ggml_tensor (line 3263) | struct ggml_tensor
  type ggml_context (line 3264) | struct ggml_context
  type ggml_tensor (line 3265) | struct ggml_tensor
  type ggml_tensor (line 3269) | struct ggml_tensor
  type ggml_context (line 3270) | struct ggml_context
  type ggml_tensor (line 3271) | struct ggml_tensor
  type ggml_tensor (line 3277) | struct ggml_tensor
  type ggml_context (line 3278) | struct ggml_context
  type ggml_tensor (line 3279) | struct ggml_tensor
  type ggml_tensor (line 3286) | struct ggml_tensor
  type ggml_tensor (line 3298) | struct ggml_tensor
  type ggml_context (line 3299) | struct ggml_context
  type ggml_tensor (line 3300) | struct ggml_tensor
  type ggml_tensor (line 3309) | struct ggml_tensor
  type ggml_tensor (line 3321) | struct ggml_tensor
  type ggml_context (line 3322) | struct ggml_context
  type ggml_tensor (line 3323) | struct ggml_tensor
  type ggml_tensor (line 3324) | struct ggml_tensor
  type ggml_tensor (line 3337) | struct ggml_tensor
  type ggml_tensor (line 3349) | struct ggml_tensor
  type ggml_context (line 3350) | struct ggml_context
  type ggml_tensor (line 3351) | struct ggml_tensor
  type ggml_tensor (line 3359) | struct ggml_tensor
  type ggml_tensor (line 3369) | struct ggml_tensor
  type ggml_context (line 3370) | struct ggml_context
  type ggml_tensor (line 3371) | struct ggml_tensor
  type ggml_tensor (line 3375) | struct ggml_tensor
  type ggml_context (line 3376) | struct ggml_context
  type ggml_tensor (line 3377) | struct ggml_tensor
  type ggml_tensor (line 3384) | struct ggml_tensor
  type ggml_context (line 3385) | struct ggml_context
  type ggml_tensor (line 3386) | struct ggml_tensor
  type ggml_tensor (line 3394) | struct ggml_tensor
  type ggml_tensor (line 3404) | struct ggml_tensor
  type ggml_context (line 3405) | struct ggml_context
  type ggml_tensor (line 3406) | struct ggml_tensor
  type ggml_tensor (line 3410) | struct ggml_tensor
  type ggml_context (line 3411) | struct ggml_context
  type ggml_tensor (line 3412) | struct ggml_tensor
  type ggml_tensor (line 3418) | struct ggml_tensor
  type ggml_context (line 3419) | struct ggml_context
  type ggml_tensor (line 3420) | struct ggml_tensor
  type ggml_tensor (line 3428) | struct ggml_tensor
  type ggml_tensor (line 3438) | struct ggml_tensor
  type ggml_context (line 3439) | struct ggml_context
  type ggml_tensor (line 3440) | struct ggml_tensor
  type ggml_tensor (line 3444) | struct ggml_tensor
  type ggml_context (line 3445) | struct ggml_context
  type ggml_tensor (line 3446) | struct ggml_tensor
  type ggml_tensor (line 3452) | struct ggml_tensor
  type ggml_context (line 3453) | struct ggml_context
  type ggml_tensor (line 3454) | struct ggml_tensor
  type ggml_tensor (line 3462) | struct ggml_tensor
  type ggml_tensor (line 3472) | struct ggml_tensor
  type ggml_context (line 3473) | struct ggml_context
  type ggml_tensor (line 3474) | struct ggml_tensor
  type ggml_tensor (line 3478) | struct ggml_tensor
  type ggml_context (line 3479) | struct ggml_context
  type ggml_tensor (line 3480) | struct ggml_tensor
  type ggml_tensor (line 3486) | struct ggml_tensor
  type ggml_context (line 3487) | struct ggml_context
  type ggml_tensor (line 3488) | struct ggml_tensor
  type ggml_tensor (line 3496) | struct ggml_tensor
  type ggml_tensor (line 3506) | struct ggml_tensor
  type ggml_context (line 3507) | struct ggml_context
  type ggml_tensor (line 3508) | struct ggml_tensor
  type ggml_tensor (line 3512) | struct ggml_tensor
  type ggml_context (line 3513) | struct ggml_context
  type ggml_tensor (line 3514) | struct ggml_tensor
  type ggml_tensor (line 3520) | struct ggml_tensor
  type ggml_context (line 3521) | struct ggml_context
  type ggml_tensor (line 3522) | struct ggml_tensor
  type ggml_tensor (line 3530) | struct ggml_tensor
  type ggml_tensor (line 3540) | struct ggml_tensor
  type ggml_context (line 3541) | struct ggml_context
  type ggml_tensor (line 3542) | struct ggml_tensor
  type ggml_tensor (line 3546) | struct ggml_tensor
  type ggml_context (line 3547) | struct ggml_context
  type ggml_tensor (line 3548) | struct ggml_tensor
  type ggml_tensor (line 3554) | struct ggml_tensor
  type ggml_context (line 3555) | struct ggml_context
  type ggml_tensor (line 3556) | struct ggml_tensor
  type ggml_tensor (line 3564) | struct ggml_tensor
  type ggml_tensor (line 3574) | struct ggml_tensor
  type ggml_context (line 3575) | struct ggml_context
  type ggml_tensor (line 3576) | struct ggml_tensor
  type ggml_tensor (line 3580) | struct ggml_tensor
  type ggml_context (line 3581) | struct ggml_context
  type ggml_tensor (line 3582) | struct ggml_tensor
  type ggml_tensor (line 3588) | struct ggml_tensor
  type ggml_context (line 3589) | struct ggml_context
  type ggml_tensor (line 3590) | struct ggml_tensor
  type ggml_tensor (line 3599) | struct ggml_tensor
  type ggml_tensor (line 3609) | struct ggml_tensor
  type ggml_context (line 3610) | struct ggml_context
  type ggml_tensor (line 3611) | struct ggml_tensor
  type ggml_tensor (line 3615) | struct ggml_tensor
  type ggml_context (line 3616) | struct ggml_context
  type ggml_tensor (line 3617) | struct ggml_tensor
  type ggml_tensor (line 3623) | struct ggml_tensor
  type ggml_context (line 3624) | struct ggml_context
  type ggml_tensor (line 3625) | struct ggml_tensor
  type ggml_tensor (line 3626) | struct ggml_tensor
  type ggml_tensor (line 3636) | struct ggml_tensor
  type ggml_tensor (line 3648) | struct ggml_tensor
  type ggml_context (line 3649) | struct ggml_context
  type ggml_tensor (line 3650) | struct ggml_tensor
  type ggml_tensor (line 3651) | struct ggml_tensor
  type ggml_tensor (line 3665) | struct ggml_tensor
  type ggml_tensor (line 3675) | struct ggml_tensor
  type ggml_context (line 3676) | struct ggml_context
  type ggml_tensor (line 3677) | struct ggml_tensor
  type ggml_tensor (line 3678) | struct ggml_tensor
  type ggml_tensor (line 3682) | struct ggml_tensor
  type ggml_context (line 3683) | struct ggml_context
  type ggml_tensor (line 3684) | struct ggml_tensor
  type ggml_tensor (line 3685) | struct ggml_tensor
  type ggml_tensor (line 3691) | struct ggml_tensor
  type ggml_context (line 3692) | struct ggml_context
  type ggml_tensor (line 3693) | struct ggml_tensor
  type ggml_tensor (line 3694) | struct ggml_tensor
  type ggml_tensor (line 3706) | struct ggml_tensor
  type ggml_tensor (line 3716) | struct ggml_tensor
  type ggml_context (line 3717) | struct ggml_context
  type ggml_tensor (line 3718) | struct ggml_tensor
  type ggml_tensor (line 3719) | struct ggml_tensor
  type ggml_tensor (line 3723) | struct ggml_tensor
  type ggml_context (line 3724) | struct ggml_context
  type ggml_tensor (line 3725) | struct ggml_tensor
  type ggml_tensor (line 3726) | struct ggml_tensor
  type ggml_tensor (line 3732) | struct ggml_tensor
  type ggml_context (line 3733) | struct ggml_context
  type ggml_tensor (line 3734) | struct ggml_tensor
  type ggml_tensor (line 3735) | struct ggml_tensor
  type ggml_tensor (line 3747) | struct ggml_tensor
  type ggml_tensor (line 3757) | struct ggml_tensor
  type ggml_context (line 3758) | struct ggml_context
  type ggml_tensor (line 3759) | struct ggml_tensor
  type ggml_tensor (line 3773) | struct ggml_tensor
  type ggml_tensor (line 3783) | struct ggml_tensor
  type ggml_context (line 3784) | struct ggml_context
  type ggml_tensor (line 3785) | struct ggml_tensor
  type ggml_tensor (line 3800) | struct ggml_tensor
  type ggml_tensor (line 3812) | struct ggml_tensor
  type ggml_context (line 3813) | struct ggml_context
  type ggml_tensor (line 3814) | struct ggml_tensor
  type ggml_tensor (line 3821) | struct ggml_tensor
  type ggml_tensor (line 3833) | struct ggml_tensor
  type ggml_context (line 3834) | struct ggml_context
  type ggml_tensor (line 3835) | struct ggml_tensor
  type ggml_tensor (line 3846) | struct ggml_tensor
  type ggml_tensor (line 3862) | struct ggml_tensor
  type ggml_context (line 3863) | struct ggml_context
  type ggml_tensor (line 3864) | struct ggml_tensor
  type ggml_tensor (line 3888) | struct ggml_tensor
  type ggml_tensor (line 3923) | struct ggml_tensor
  type ggml_context (line 3924) | struct ggml_context
  type ggml_tensor (line 3925) | struct ggml_tensor
  type ggml_tensor (line 3933) | struct ggml_tensor
  type ggml_tensor (line 3951) | struct ggml_tensor
  type ggml_context (line 3952) | struct ggml_context
  type ggml_tensor (line 3953) | struct ggml_tensor
  type ggml_tensor (line 3954) | struct ggml_tensor
  type ggml_tensor (line 3966) | struct ggml_tensor
  type ggml_tensor (line 3978) | struct ggml_tensor
  type ggml_context (line 3979) | struct ggml_context
  type ggml_tensor (line 3980) | struct ggml_tensor
  type ggml_tensor (line 3991) | struct ggml_tensor
  type ggml_tensor (line 3992) | struct ggml_tensor
  type ggml_tensor (line 4004) | struct ggml_tensor
  type ggml_context (line 4005) | struct ggml_context
  type ggml_tensor (line 4006) | struct ggml_tensor
  type ggml_tensor (line 4016) | struct ggml_tensor
  type ggml_tensor (line 4028) | struct ggml_tensor
  type ggml_context (line 4029) | struct ggml_context
  type ggml_tensor (line 4030) | struct ggml_tensor
  type ggml_tensor (line 4044) | struct ggml_tensor
  type ggml_tensor (line 4046) | struct ggml_tensor
  type ggml_tensor (line 4061) | struct ggml_tensor
  type ggml_context (line 4062) | struct ggml_context
  type ggml_tensor (line 4063) | struct ggml_tensor
  type ggml_tensor (line 4064) | struct ggml_tensor
  type ggml_tensor (line 4076) | struct ggml_tensor
  type ggml_tensor (line 4088) | struct ggml_tensor
  type ggml_context (line 4089) | struct ggml_context
  type ggml_tensor (line 4090) | struct ggml_tensor
  type ggml_tensor (line 4091) | struct ggml_tensor
  type ggml_tensor (line 4103) | struct ggml_tensor
  type ggml_tensor (line 4115) | struct ggml_tensor
  type ggml_context (line 4116) | struct ggml_context
  type ggml_tensor (line 4117) | struct ggml_tensor
  type ggml_tensor (line 4118) | struct ggml_tensor
  type ggml_tensor (line 4119) | struct ggml_tensor
  type ggml_tensor (line 4132) | struct ggml_tensor
  type ggml_tensor (line 4146) | struct ggml_tensor
  type ggml_context (line 4147) | struct ggml_context
  type ggml_tensor (line 4148) | struct ggml_tensor
  type ggml_tensor (line 4149) | struct ggml_tensor
  type ggml_tensor (line 4150) | struct ggml_tensor
  type ggml_tensor (line 4151) | struct ggml_tensor
  type ggml_tensor (line 4152) | struct ggml_tensor
  type ggml_tensor (line 4164) | struct ggml_tensor
  function ggml_set_param (line 4179) | void ggml_set_param(
  function ggml_compute_forward_dup_f16 (line 4190) | static void ggml_compute_forward_dup_f16(
  function ggml_compute_forward_dup_f32 (line 4294) | static void ggml_compute_forward_dup_f32(
  function ggml_compute_forward_dup (line 4398) | static void ggml_compute_forward_dup(
  function ggml_compute_forward_add_f32 (line 4425) | static void ggml_compute_forward_add_f32(
  function ggml_compute_forward_add (line 4478) | static void ggml_compute_forward_add(
  function ggml_compute_forward_sub_f32 (line 4503) | static void ggml_compute_forward_sub_f32(
  function ggml_compute_forward_sub (line 4530) | static void ggml_compute_forward_sub(
  function ggml_compute_forward_mul_f32 (line 4555) | static void ggml_compute_forward_mul_f32(
  function ggml_compute_forward_mul (line 4582) | static void ggml_compute_forward_mul(
  function ggml_compute_forward_div_f32 (line 4607) | static void ggml_compute_forward_div_f32(
  function ggml_compute_forward_div (line 4634) | static void ggml_compute_forward_div(
  function ggml_compute_forward_sqr_f32 (line 4659) | static void ggml_compute_forward_sqr_f32(
  function ggml_compute_forward_sqr (line 4683) | static void ggml_compute_forward_sqr(
  function ggml_compute_forward_sqrt_f32 (line 4707) | static void ggml_compute_forward_sqrt_f32(
  function ggml_compute_forward_sqrt (line 4731) | static void ggml_compute_forward_sqrt(
  function ggml_compute_forward_sum_f32 (line 4755) | static void ggml_compute_forward_sum_f32(
  function ggml_compute_forward_sum (line 4789) | static void ggml_compute_forward_sum(
  function ggml_compute_forward_mean_f32 (line 4813) | static void ggml_compute_forward_mean_f32(
  function ggml_compute_forward_mean (line 4866) | static void ggml_compute_forward_mean(
  function ggml_compute_forward_repeat_f32 (line 4890) | static void ggml_compute_forward_repeat_f32(
  function ggml_compute_forward_repeat (line 4930) | static void ggml_compute_forward_repeat(
  function ggml_compute_forward_abs_f32 (line 4954) | static void ggml_compute_forward_abs_f32(
  function ggml_compute_forward_abs (line 4978) | static void ggml_compute_forward_abs(
  function ggml_compute_forward_sgn_f32 (line 5002) | static void ggml_compute_forward_sgn_f32(
  function ggml_compute_forward_sgn (line 5026) | static void ggml_compute_forward_sgn(
  function ggml_compute_forward_neg_f32 (line 5050) | static void ggml_compute_forward_neg_f32(
  function ggml_compute_forward_neg (line 5074) | static void ggml_compute_forward_neg(
  function ggml_compute_forward_step_f32 (line 5098) | static void ggml_compute_forward_step_f32(
  function ggml_compute_forward_step (line 5122) | static void ggml_compute_forward_step(
  function ggml_compute_forward_relu_f32 (line 5146) | static void ggml_compute_forward_relu_f32(
  function ggml_compute_forward_relu (line 5170) | static void ggml_compute_forward_relu(
  function ggml_compute_forward_gelu_f32 (line 5194) | static void ggml_compute_forward_gelu_f32(
  function ggml_compute_forward_gelu (line 5235) | static void ggml_compute_forward_gelu(
  function ggml_compute_forward_silu_f32 (line 5261) | static void ggml_compute_forward_silu_f32(
  function ggml_compute_forward_silu (line 5302) | static void ggml_compute_forward_silu(
  function ggml_compute_forward_norm_f32 (line 5327) | static void ggml_compute_forward_norm_f32(
  function ggml_compute_forward_norm (line 5387) | static void ggml_compute_forward_norm(
  function ggml_compute_forward_mul_mat_use_blas (line 5414) | static bool ggml_compute_forward_mul_mat_use_blas(
  function ggml_compute_forward_mul_mat_f32 (line 5436) | static void ggml_compute_forward_mul_mat_f32(
  function ggml_compute_forward_mul_mat_f16_f32 (line 5681) | static void ggml_compute_forward_mul_mat_f16_f32(
  function ggml_compute_forward_mul_mat_q4_0_f32 (line 5987) | static void ggml_compute_forward_mul_mat_q4_0_f32(
  function ggml_compute_forward_mul_mat_q4_1_f32 (line 6287) | static void ggml_compute_forward_mul_mat_q4_1_f32(
  function ggml_compute_forward_mul_mat (line 6587) | static void ggml_compute_forward_mul_mat(
  function ggml_compute_forward_scale_f32 (line 6649) | static void ggml_compute_forward_scale_f32(
  function ggml_compute_forward_scale (line 6684) | static void ggml_compute_forward_scale(
  function ggml_compute_forward_cpy (line 6709) | static void ggml_compute_forward_cpy(
  function ggml_compute_forward_reshape (line 6718) | static void ggml_compute_forward_reshape(
  function ggml_compute_forward_view (line 6730) | static void ggml_compute_forward_view(
  function ggml_compute_forward_permute (line 6740) | static void ggml_compute_forward_permute(
  function ggml_compute_forward_transpose (line 6750) | static void ggml_compute_forward_transpose(
  function ggml_compute_forward_get_rows_q4_0 (line 6760) | static void ggml_compute_forward_get_rows_q4_0(
  function ggml_compute_forward_get_rows_q4_1 (line 6787) | static void ggml_compute_forward_get_rows_q4_1(
  function ggml_compute_forward_get_rows_f16 (line 6814) | static void ggml_compute_forward_get_rows_f16(
  function ggml_compute_forward_get_rows_f32 (line 6842) | static void ggml_compute_forward_get_rows_f32(
  function ggml_compute_forward_get_rows (line 6869) | static void ggml_compute_forward_get_rows(
  function ggml_compute_forward_diag_mask_inf_f32 (line 6921) | static void ggml_compute_forward_diag_mask_inf_f32(
  function ggml_compute_forward_diag_mask_inf (line 6957) | static void ggml_compute_forward_diag_mask_inf(
  function ggml_compute_forward_soft_max_f32 (line 6982) | static void ggml_compute_forward_soft_max_f32(
  function ggml_compute_forward_soft_max (line 7052) | static void ggml_compute_forward_soft_max(
  function ggml_compute_forward_rope_f32 (line 7076) | static void ggml_compute_forward_rope_f32(
  function ggml_compute_forward_rope_f16 (line 7133) | static void ggml_compute_forward_rope_f16(
  function ggml_compute_forward_rope (line 7189) | static void ggml_compute_forward_rope(
  function ggml_compute_forward_conv_1d_1s_f16_f32 (line 7217) | static void ggml_compute_forward_conv_1d_1s_f16_f32(
  function ggml_compute_forward_conv_1d_1s_f32 (line 7337) | static void ggml_compute_forward_conv_1d_1s_f32(
  function ggml_compute_forward_conv_1d_1s (line 7457) | static void ggml_compute_forward_conv_1d_1s(
  function ggml_compute_forward_conv_1d_2s_f16_f32 (line 7485) | static void ggml_compute_forward_conv_1d_2s_f16_f32(
  function ggml_compute_forward_conv_1d_2s_f32 (line 7605) | static void ggml_compute_forward_conv_1d_2s_f32(
  function ggml_compute_forward_conv_1d_2s (line 7725) | static void ggml_compute_forward_conv_1d_2s(
  function ggml_compute_forward_flash_attn_f32 (line 7753) | static void ggml_compute_forward_flash_attn_f32(
  function ggml_compute_forward_flash_attn_f16 (line 7962) | static void ggml_compute_forward_flash_attn_f16(
  function ggml_compute_forward_flash_attn (line 8208) | static void ggml_compute_forward_flash_attn(
  function ggml_compute_forward_flash_ff_f16 (line 8238) | static void ggml_compute_forward_flash_ff_f16(
  function ggml_compute_forward_flash_ff (line 8418) | static void ggml_compute_forward_flash_ff(
  function ggml_compute_forward (line 8449) | static void ggml_compute_forward(struct ggml_compute_params * params, st...
  function ggml_compute_backward (line 8601) | static void ggml_compute_backward(struct ggml_context * ctx, struct ggml...
  function ggml_visit_parents (line 8849) | static void ggml_visit_parents(struct ggml_cgraph * cgraph, struct ggml_...
  function ggml_build_forward_impl (line 8900) | static void ggml_build_forward_impl(struct ggml_cgraph * cgraph, struct ...
  function ggml_build_forward_expand (line 8920) | void ggml_build_forward_expand(struct ggml_cgraph * cgraph, struct ggml_...
  function ggml_build_forward (line 8924) | struct ggml_cgraph ggml_build_forward(struct ggml_tensor * tensor) {
  function ggml_build_backward (line 8944) | struct ggml_cgraph ggml_build_backward(struct ggml_context * ctx, struct...
  type ggml_lock_t (line 9002) | typedef int ggml_lock_t;
  type pthread_t (line 9011) | typedef pthread_t ggml_thread_t;
  type ggml_lock_t (line 9025) | typedef int ggml_lock_t;
  type pthread_t (line 9034) | typedef pthread_t ggml_thread_t;
  type ggml_compute_state_shared (line 9041) | struct ggml_compute_state_shared {
  type ggml_compute_state (line 9052) | struct ggml_compute_state {
  function thread_ret_t (line 9061) | static thread_ret_t ggml_graph_compute_thread(void * data) {
  function ggml_graph_compute (line 9109) | void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * ...
  function ggml_graph_reset (line 9557) | void ggml_graph_reset(struct ggml_cgraph * cgraph) {
  function ggml_graph_print (line 9567) | void ggml_graph_print(const struct ggml_cgraph * cgraph) {
  function ggml_graph_find (line 9609) | static bool ggml_graph_find(const struct ggml_cgraph * cgraph, const str...
  type ggml_tensor (line 9623) | struct ggml_tensor
  type ggml_cgraph (line 9623) | struct ggml_cgraph
  type ggml_tensor (line 9623) | struct ggml_tensor
  type ggml_tensor (line 9625) | struct ggml_tensor
  function ggml_graph_dump_dot (line 9635) | void ggml_graph_dump_dot(const struct ggml_cgraph * gb, const struct ggm...
  function ggml_opt_set_params (line 9752) | static void ggml_opt_set_params(int np, struct ggml_tensor * const ps[],...
  function ggml_opt_get_params (line 9763) | static void ggml_opt_get_params(int np, struct ggml_tensor * const ps[],...
  function ggml_opt_get_grad (line 9774) | static void ggml_opt_get_grad(int np, struct ggml_tensor * const ps[], f...
  function ggml_opt_adam (line 9791) | static enum ggml_opt_result ggml_opt_adam(
  type ggml_lbfgs_iteration_data (line 9971) | struct ggml_lbfgs_iteration_data {
  function linesearch_backtracking (line 9978) | static enum ggml_opt_result linesearch_backtracking(
  function ggml_opt_lbfgs (line 10084) | static enum ggml_opt_result ggml_opt_lbfgs(
  function ggml_opt_default_params (line 10307) | struct ggml_opt_params ggml_opt_default_params(enum ggml_opt_type type) {
  function ggml_opt (line 10368) | enum ggml_opt_result ggml_opt(
  function ggml_cpu_has_avx (line 10423) | int ggml_cpu_has_avx(void) {
  function ggml_cpu_has_avx2 (line 10431) | int ggml_cpu_has_avx2(void) {
  function ggml_cpu_has_avx512 (line 10439) | int ggml_cpu_has_avx512(void) {
  function ggml_cpu_has_fma (line 10447) | int ggml_cpu_has_fma(void) {
  function ggml_cpu_has_neon (line 10455) | int ggml_cpu_has_neon(void) {
  function ggml_cpu_has_arm_fma (line 10463) | int ggml_cpu_has_arm_fma(void) {
  function ggml_cpu_has_f16c (line 10471) | int ggml_cpu_has_f16c(void) {
  function ggml_cpu_has_fp16_va (line 10479) | int ggml_cpu_has_fp16_va(void) {
  function ggml_cpu_has_wasm_simd (line 10487) | int ggml_cpu_has_wasm_simd(void) {
  function ggml_cpu_has_blas (line 10495) | int ggml_cpu_has_blas(void) {
  function ggml_cpu_has_sse3 (line 10503) | int ggml_cpu_has_sse3(void) {
  function ggml_cpu_has_vsx (line 10511) | int ggml_cpu_has_vsx(void) {

FILE: Sources/cpp/ggml.h
  type __fp16 (line 188) | typedef __fp16 ggml_fp16_t;
  type ggml_fp16_t (line 190) | typedef uint16_t ggml_fp16_t;
  type ggml_object (line 197) | struct ggml_object
  type ggml_context (line 198) | struct ggml_context
  type ggml_type (line 200) | enum ggml_type {
  type ggml_op (line 212) | enum ggml_op {
  type ggml_tensor (line 256) | struct ggml_tensor {
  type ggml_cgraph (line 289) | struct ggml_cgraph {
  type ggml_scratch (line 308) | struct ggml_scratch {
  type ggml_init_params (line 314) | struct ggml_init_params {
  type ggml_object (line 326) | struct ggml_object
  type ggml_context (line 327) | struct ggml_context
  type ggml_tensor (line 329) | struct ggml_tensor
  type ggml_tensor (line 330) | struct ggml_tensor
  type ggml_type (line 332) | enum ggml_type
  type ggml_type (line 333) | enum ggml_type
  type ggml_type (line 334) | enum ggml_type
  type ggml_tensor (line 336) | struct ggml_tensor
  type ggml_context (line 338) | struct ggml_context
  type ggml_init_params (line 338) | struct ggml_init_params
  type ggml_context (line 339) | struct ggml_context
  type ggml_context (line 341) | struct ggml_context
  type ggml_context (line 343) | struct ggml_context
  type ggml_scratch (line 343) | struct ggml_scratch
  type ggml_tensor (line 345) | struct ggml_tensor
  type ggml_context (line 346) | struct ggml_context
  type ggml_type (line 347) | enum   ggml_type
  type ggml_tensor (line 351) | struct ggml_tensor
  type ggml_context (line 352) | struct ggml_context
  type ggml_type (line 353) | enum   ggml_type
  type ggml_tensor (line 356) | struct ggml_tensor
  type ggml_context (line 357) | struct ggml_context
  type ggml_type (line 358) | enum   ggml_type
  type ggml_tensor (line 362) | struct ggml_tensor
  type ggml_context (line 363) | struct ggml_context
  type ggml_type (line 364) | enum   ggml_type
  type ggml_tensor (line 369) | struct ggml_tensor
  type ggml_context (line 370) | struct ggml_context
  type ggml_type (line 371) | enum   ggml_type
  type ggml_tensor (line 377) | struct ggml_tensor
  type ggml_context (line 377) | struct ggml_context
  type ggml_tensor (line 378) | struct ggml_tensor
  type ggml_context (line 378) | struct ggml_context
  type ggml_tensor (line 380) | struct ggml_tensor
  type ggml_context (line 380) | struct ggml_context
  type ggml_tensor (line 380) | struct ggml_tensor
  type ggml_tensor (line 381) | struct ggml_tensor
  type ggml_context (line 381) | struct ggml_context
  type ggml_tensor (line 381) | struct ggml_tensor
  type ggml_tensor (line 383) | struct ggml_tensor
  type ggml_tensor (line 383) | struct ggml_tensor
  type ggml_tensor (line 384) | struct ggml_tensor
  type ggml_tensor (line 384) | struct ggml_tensor
  type ggml_tensor (line 385) | struct ggml_tensor
  type ggml_tensor (line 385) | struct ggml_tensor
  type ggml_tensor (line 387) | struct ggml_tensor
  type ggml_tensor (line 388) | struct ggml_tensor
  type ggml_tensor (line 390) | struct ggml_tensor
  type ggml_tensor (line 391) | struct ggml_tensor
  type ggml_tensor (line 393) | struct ggml_tensor
  type ggml_tensor (line 394) | struct ggml_tensor
  type ggml_tensor (line 400) | struct ggml_tensor
  type ggml_context (line 401) | struct ggml_context
  type ggml_tensor (line 402) | struct ggml_tensor
  type ggml_tensor (line 404) | struct ggml_tensor
  type ggml_context (line 405) | struct ggml_context
  type ggml_tensor (line 406) | struct ggml_tensor
  type ggml_tensor (line 407) | struct ggml_tensor
  type ggml_tensor (line 409) | struct ggml_tensor
  type ggml_context (line 410) | struct ggml_context
  type ggml_tensor (line 411) | struct ggml_tensor
  type ggml_tensor (line 412) | struct ggml_tensor
  type ggml_tensor (line 414) | struct ggml_tensor
  type ggml_context (line 415) | struct ggml_context
  type ggml_tensor (line 416) | struct ggml_tensor
  type ggml_tensor (line 417) | struct ggml_tensor
  type ggml_tensor (line 419) | struct ggml_tensor
  type ggml_context (line 420) | struct ggml_context
  type ggml_tensor (line 421) | struct ggml_tensor
  type ggml_tensor (line 422) | struct ggml_tensor
  type ggml_tensor (line 424) | struct ggml_tensor
  type ggml_context (line 425) | struct ggml_context
  type ggml_tensor (line 426) | struct ggml_tensor
  type ggml_tensor (line 428) | struct ggml_tensor
  type ggml_context (line 429) | struct ggml_context
  type ggml_tensor (line 430) | struct ggml_tensor
  type ggml_tensor (line 434) | struct ggml_tensor
  type ggml_context (line 435) | struct ggml_context
  type ggml_tensor (line 436) | struct ggml_tensor
  type ggml_tensor (line 439) | struct ggml_tensor
  type ggml_context (line 440) | struct ggml_context
  type ggml_tensor (line 441) | struct ggml_tensor
  type ggml_tensor (line 445) | struct ggml_tensor
  type ggml_context (line 446) | struct ggml_context
  type ggml_tensor (line 447) | struct ggml_tensor
  type ggml_tensor (line 448) | struct ggml_tensor
  type ggml_tensor (line 450) | struct ggml_tensor
  type ggml_context (line 451) | struct ggml_context
  type ggml_tensor (line 452) | struct ggml_tensor
  type ggml_tensor (line 454) | struct ggml_tensor
  type ggml_context (line 455) | struct ggml_context
  type ggml_tensor (line 456) | struct ggml_tensor
  type ggml_tensor (line 458) | struct ggml_tensor
  type ggml_context (line 459) | struct ggml_context
  type ggml_tensor (line 460) | struct ggml_tensor
  type ggml_tensor (line 462) | struct ggml_tensor
  type ggml_context (line 463) | struct ggml_context
  type ggml_tensor (line 464) | struct ggml_tensor
  type ggml_tensor (line 466) | struct ggml_tensor
  type ggml_context (line 467) | struct ggml_context
  type ggml_tensor (line 468) | struct ggml_tensor
  type ggml_tensor (line 471) | struct ggml_tensor
  type ggml_context (line 472) | struct ggml_context
  type ggml_tensor (line 473) | struct ggml_tensor
  type ggml_tensor (line 475) | struct ggml_tensor
  type ggml_context (line 476) | struct ggml_context
  type ggml_tensor (line 477) | struct ggml_tensor
  type ggml_tensor (line 481) | struct ggml_tensor
  type ggml_context (line 482) | struct ggml_context
  type ggml_tensor (line 483) | struct ggml_tensor
  type ggml_tensor (line 488) | struct ggml_tensor
  type ggml_context (line 489) | struct ggml_context
  type ggml_tensor (line 490) | struct ggml_tensor
  type ggml_tensor (line 491) | struct ggml_tensor
  type ggml_tensor (line 498) | struct ggml_tensor
  type ggml_context (line 499) | struct ggml_context
  type ggml_tensor (line 500) | struct ggml_tensor
  type ggml_tensor (line 501) | struct ggml_tensor
  type ggml_tensor (line 504) | struct ggml_tensor
  type ggml_context (line 505) | struct ggml_context
  type ggml_tensor (line 506) | struct ggml_tensor
  type ggml_tensor (line 507) | struct ggml_tensor
  type ggml_tensor (line 511) | struct ggml_tensor
  type ggml_context (line 512) | struct ggml_context
  type ggml_tensor (line 513) | struct ggml_tensor
  type ggml_tensor (line 514) | struct ggml_tensor
  type ggml_tensor (line 518) | struct ggml_tensor
  type ggml_context (line 519) | struct ggml_context
  type ggml_tensor (line 520) | struct ggml_tensor
  type ggml_tensor (line 526) | struct ggml_tensor
  type ggml_context (line 527) | struct ggml_context
  type ggml_tensor (line 528) | struct ggml_tensor
  type ggml_tensor (line 534) | struct ggml_tensor
  type ggml_context (line 535) | struct ggml_context
  type ggml_tensor (line 536) | struct ggml_tensor
  type ggml_tensor (line 540) | struct ggml_tensor
  type ggml_context (line 541) | struct ggml_context
  type ggml_tensor (line 542) | struct ggml_tensor
  type ggml_tensor (line 548) | struct ggml_tensor
  type ggml_context (line 549) | struct ggml_context
  type ggml_tensor (line 550) | struct ggml_tensor
  type ggml_tensor (line 557) | struct ggml_tensor
  type ggml_context (line 558) | struct ggml_context
  type ggml_tensor (line 559) | struct ggml_tensor
  type ggml_tensor (line 561) | struct ggml_tensor
  type ggml_context (line 562) | struct ggml_context
  type ggml_tensor (line 563) | struct ggml_tensor
  type ggml_tensor (line 564) | struct ggml_tensor
  type ggml_tensor (line 568) | struct ggml_tensor
  type ggml_context (line 569) | struct ggml_context
  type ggml_tensor (line 570) | struct ggml_tensor
  type ggml_tensor (line 574) | struct ggml_tensor
  type ggml_context (line 575) | struct ggml_context
  type ggml_tensor (line 576) | struct ggml_tensor
  type ggml_tensor (line 582) | struct ggml_tensor
  type ggml_context (line 583) | struct ggml_context
  type ggml_tensor (line 584) | struct ggml_tensor
  type ggml_tensor (line 593) | struct ggml_tensor
  type ggml_context (line 594) | struct ggml_context
  type ggml_tensor (line 595) | struct ggml_tensor
  type ggml_tensor (line 596) | struct ggml_tensor
  type ggml_tensor (line 598) | struct ggml_tensor
  type ggml_context (line 599) | struct ggml_context
  type ggml_tensor (line 600) | struct ggml_tensor
  type ggml_tensor (line 601) | struct ggml_tensor
  type ggml_tensor (line 603) | struct ggml_tensor
  type ggml_context (line 604) | struct ggml_context
  type ggml_tensor (line 605) | struct ggml_tensor
  type ggml_tensor (line 606) | struct ggml_tensor
  type ggml_tensor (line 607) | struct ggml_tensor
  type ggml_tensor (line 610) | struct ggml_tensor
  type ggml_context (line 611) | struct ggml_context
  type ggml_tensor (line 612) | struct ggml_tensor
  type ggml_tensor (line 613) | struct ggml_tensor
  type ggml_tensor (line 614) | struct ggml_tensor
  type ggml_tensor (line 615) | struct ggml_tensor
  type ggml_tensor (line 616) | struct ggml_tensor
  type ggml_context (line 623) | struct ggml_context
  type ggml_tensor (line 624) | struct ggml_tensor
  type ggml_cgraph (line 626) | struct ggml_cgraph
  type ggml_tensor (line 626) | struct ggml_tensor
  type ggml_cgraph (line 628) | struct ggml_cgraph
  type ggml_tensor (line 628) | struct ggml_tensor
  type ggml_cgraph (line 629) | struct ggml_cgraph
  type ggml_context (line 629) | struct ggml_context
  type ggml_cgraph (line 629) | struct ggml_cgraph
  type ggml_context (line 631) | struct ggml_context
  type ggml_cgraph (line 631) | struct ggml_cgraph
  type ggml_cgraph (line 632) | struct ggml_cgraph
  type ggml_cgraph (line 635) | struct ggml_cgraph
  type ggml_cgraph (line 638) | struct ggml_cgraph
  type ggml_cgraph (line 638) | struct ggml_cgraph
  type ggml_opt_type (line 645) | enum ggml_opt_type {
  type ggml_linesearch (line 651) | enum ggml_linesearch {
  type ggml_opt_result (line 660) | enum ggml_opt_result {
  type ggml_opt_params (line 678) | struct ggml_opt_params {
  type ggml_opt_params (line 731) | struct ggml_opt_params
  type ggml_opt_type (line 731) | enum ggml_opt_type
  type ggml_opt_result (line 734) | enum ggml_opt_result
  type ggml_context (line 735) | struct ggml_context
  type ggml_opt_params (line 736) | struct ggml_opt_params
  type ggml_tensor (line 737) | struct ggml_tensor

FILE: Sources/cpp/quantize.cpp
  type llama_hparams (line 19) | struct llama_hparams {
  function llama_model_quantize (line 32) | bool llama_model_quantize(const std::string & fname_inp, const std::stri...
  function main (line 291) | int main(int argc, char ** argv) {

FILE: Sources/cpp/utils.cpp
  function gpt_params_parse (line 18) | bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
  function gpt_print_usage (line 74) | void gpt_print_usage(int argc, char ** argv, const gpt_params & params) {
  function gpt_random_prompt (line 102) | std::string gpt_random_prompt(std::mt19937 & rng) {
  function replace (line 121) | void replace(std::string & str, const std::string & needle, const std::s...
  function json_parse (line 129) | std::map<std::string, int32_t> json_parse(const std::string & fname) {
  function gpt_tokenize (line 220) | std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const s...
  function llama_tokenize (line 275) | std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, const...
  function gpt_vocab_init (line 313) | bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab) {
  function sample_top_k (line 333) | void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits...
  function llama_sample_top_p_top_k (line 345) | gpt_vocab::id llama_sample_top_p_top_k(
  function ggml_quantize_q4_0 (line 431) | size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk,...
  function ggml_quantize_q4_1 (line 487) | size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk,...

FILE: Sources/llamaObjCxx/bridge/LlamaPredictOperation.hh
  class _LlamaEvent (line 11) | class _LlamaEvent

FILE: tools/convert-pth-to-ggml.py
  function get_n_parts (line 39) | def get_n_parts(dim):

Download .json

Condensed preview — 33 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (499K chars).

[
  {
    "path": ".github/workflows/build.yml",
    "chars": 710,
    "preview": "name: CI\n\non:\n  workflow_dispatch: # allows manual triggering\n    inputs:\n      create_release:\n        description: \"Cr"
  },
  {
    "path": ".gitignore",
    "chars": 2392,
    "preview": "*.o\n*.a\n.cache/\n.vs/\n.vscode/\n.DS_Store\n\nbuild/\nbuild-em/\nbuild-debug/\nbuild-release/\nbuild-static/\nbuild-no-accel/\nbuil"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 5218,
    "preview": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participa"
  },
  {
    "path": "LICENSE",
    "chars": 1098,
    "preview": "MIT License\n\nCopyright (c) 2023 Georgi Gerganov, Alex Rozanski and others\n\nPermission is hereby granted, free of charge,"
  },
  {
    "path": "Package.swift",
    "chars": 660,
    "preview": "// swift-tools-version:5.5\n\nimport PackageDescription\n\nlet package = Package(\n  name: \"llama.swift\",\n  platforms: [\n    "
  },
  {
    "path": "README.md",
    "chars": 5470,
    "preview": "# 🦙 llama.swift\n\n[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MI"
  },
  {
    "path": "Sources/cpp/ggml.c",
    "chars": 324639,
    "preview": "#include \"ggml.h\"\n\n#if defined(_MSC_VER) || defined(__MINGW32__)\n#include <malloc.h> // using malloc.h with MSC/MINGW\n#e"
  },
  {
    "path": "Sources/cpp/ggml.h",
    "chars": 22025,
    "preview": "#pragma once\n\n//\n// GGML Tensor Library\n//\n// This documentation is still a work in progress.\n// If you wish some specif"
  },
  {
    "path": "Sources/cpp/quantize.cpp",
    "chars": 11425,
    "preview": "#include \"ggml.h\"\n\n#include \"utils.h\"\n\n#include <cassert>\n#include <cmath>\n#include <cstdio>\n#include <cstring>\n#include"
  },
  {
    "path": "Sources/cpp/utils.cpp",
    "chars": 18000,
    "preview": "#include \"utils.h\"\n\n#include <cassert>\n#include <cstring>\n#include <fstream>\n#include <regex>\n#include <iostream>\n#inclu"
  },
  {
    "path": "Sources/cpp/utils.h",
    "chars": 3195,
    "preview": "// Various helper functions and utilities\n\n#pragma once\n\n#include <string>\n#include <map>\n#include <vector>\n#include <ra"
  },
  {
    "path": "Sources/llama/LlamaRunner.swift",
    "chars": 3254,
    "preview": "//\n//  LlamaRunner.swift\n//  llama\n//\n//  Created by Alex Rozanski on 12/03/2023.\n//\n\nimport Foundation\nimport llamaObjC"
  },
  {
    "path": "Sources/llamaObjCxx/LlamaError.m",
    "chars": 173,
    "preview": "//\n//  LlamaError.m\n//  llama\n//\n//  Created by Alex Rozanski on 14/03/2023.\n//\n\n#import \"LlamaError.h\"\n\nNSString *const"
  },
  {
    "path": "Sources/llamaObjCxx/bridge/LlamaEvent.mm",
    "chars": 3014,
    "preview": "//\n//  LlamaEvent.mm\n//  llama\n//\n//  Created by Alex Rozanski on 14/03/2023.\n//\n\n#include \"LlamaEvent.h\"\n\ntypedef NS_EN"
  },
  {
    "path": "Sources/llamaObjCxx/bridge/LlamaPredictOperation.hh",
    "chars": 538,
    "preview": "//\n//  LlamaPredictOperation.h\n//  llama\n//\n//  Created by Alex Rozanski on 13/03/2023.\n//\n\n#import <Foundation/NSOperat"
  },
  {
    "path": "Sources/llamaObjCxx/bridge/LlamaPredictOperation.mm",
    "chars": 29035,
    "preview": "//\n//  LlamaPredictOperation.m\n//  llama\n//\n//  Created by Alex Rozanski on 13/03/2023.\n//\n\n#import \"LlamaPredictOperati"
  },
  {
    "path": "Sources/llamaObjCxx/bridge/LlamaRunnerBridge.mm",
    "chars": 1585,
    "preview": "//\n//  LlamaRunnerBridge.mm\n//  llama\n//\n//  Created by Alex Rozanski on 12/03/2023.\n//\n\n#import \"LlamaRunnerBridge.h\"\n#"
  },
  {
    "path": "Sources/llamaObjCxx/bridge/LlamaRunnerBridgeConfig.m",
    "chars": 317,
    "preview": "//\n//  LlamaRunnerBridgeConfig.m\n//  llama\n//\n//  Created by Alex Rozanski on 13/03/2023.\n//\n\n#import \"LlamaRunnerBridge"
  },
  {
    "path": "Sources/llamaObjCxx/headers/LlamaError.h",
    "chars": 370,
    "preview": "//\n//  LlamaError.h\n//  llama\n//\n//  Created by Alex Rozanski on 14/03/2023.\n//\n\n#import <Foundation/Foundation.h>\n\nNS_A"
  },
  {
    "path": "Sources/llamaObjCxx/headers/LlamaEvent.h",
    "chars": 936,
    "preview": "//\n//  LlamaEvent.h\n//  llama\n//\n//  Created by Alex Rozanski on 14/03/2023.\n//\n\n#import <Foundation/Foundation.h>\n\nNS_A"
  },
  {
    "path": "Sources/llamaObjCxx/headers/LlamaRunnerBridge.h",
    "chars": 720,
    "preview": "//\n//  LlamaRunnerBridge.h\n//  llama\n//\n//  Created by Alex Rozanski on 12/03/2023.\n//\n\n#import <Foundation/Foundation.h"
  },
  {
    "path": "Sources/llamaObjCxx/headers/LlamaRunnerBridgeConfig.h",
    "chars": 399,
    "preview": "//\n//  LlamaRunnerBridgeConfig.h\n//  llama\n//\n//  Created by Alex Rozanski on 13/03/2023.\n//\n\n#import <Foundation/Founda"
  },
  {
    "path": "Sources/llamaObjCxx/module.modulemap",
    "chars": 59,
    "preview": "module llamaObjCxx {\n    umbrella \"headers\"\n    export *\n}\n"
  },
  {
    "path": "llama.xcodeproj/project.pbxproj",
    "chars": 29141,
    "preview": "// !$*UTF8*$!\n{\n\tarchiveVersion = 1;\n\tclasses = {\n\t};\n\tobjectVersion = 56;\n\tobjects = {\n\n/* Begin PBXBuildFile section *"
  },
  {
    "path": "llama.xcodeproj/project.xcworkspace/contents.xcworkspacedata",
    "chars": 135,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Workspace\n   version = \"1.0\">\n   <FileRef\n      location = \"self:\">\n   </FileRef"
  },
  {
    "path": "llama.xcodeproj/project.xcworkspace/xcshareddata/IDEWorkspaceChecks.plist",
    "chars": 238,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/P"
  },
  {
    "path": "llamaTest/Info.plist",
    "chars": 247,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/P"
  },
  {
    "path": "llamaTest/LlamaTest.xcconfig",
    "chars": 12,
    "preview": "MODEL_PATH=\n"
  },
  {
    "path": "llamaTest/main.swift",
    "chars": 1794,
    "preview": "//\n//  main.swift\n//  llamaTest\n//\n//  Created by Alex Rozanski on 12/03/2023.\n//\n\nimport Foundation\nimport llama\n\nguard"
  },
  {
    "path": "tools/.gitignore",
    "chars": 9,
    "preview": "quantize\n"
  },
  {
    "path": "tools/Makefile",
    "chars": 5173,
    "preview": "ifndef UNAME_S\nUNAME_S := $(shell uname -s)\nendif\n\nifndef UNAME_P\nUNAME_P := $(shell uname -p)\nendif\n\nifndef UNAME_M\nUNA"
  },
  {
    "path": "tools/convert-pth-to-ggml.py",
    "chars": 5401,
    "preview": "# Convert a LLaMA model checkpoint to a ggml compatible file\n#\n# Load the model using Torch\n# Iterate over all variables"
  },
  {
    "path": "tools/quantize.sh",
    "chars": 312,
    "preview": "#!/usr/bin/env bash\n\nif ! [[ \"$1\" =~ ^[0-9]{1,2}B$ ]]; then\n    echo\n    echo \"Usage: quantize.sh 7B|13B|30B|65B [--remo"
  }
]

About this extraction

This page contains the full source code of the alexrozanski/llama.swift GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 33 files (466.5 KB), approximately 142.5k tokens, and a symbol index with 797 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo