Repository: nadavrot/memset_benchmark
Branch: main
Commit: ed6cd499f55a
Files: 21
Total size: 53.9 KB
Directory structure:
gitextract_j6tepqkv/
├── .gitignore
├── CMakeLists.txt
├── README.md
├── docs/
│ └── annotated_glibc.txt
├── include/
│ ├── decl.h
│ ├── types.h
│ └── utils.h
└── src/
├── memcpy/
│ ├── CMakeLists.txt
│ ├── bench_memcpy.cc
│ ├── folly.S
│ ├── impl.S
│ ├── impl.c
│ └── test_memcpy.cc
├── memset/
│ ├── CMakeLists.txt
│ ├── bench_memset.cc
│ ├── impl.S
│ ├── impl.c
│ ├── shims.c
│ └── test_memset.cc
└── utils/
├── CMakeLists.txt
└── hist_tool.c
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
*.o
*.swn
*.swo
*.swp
*~
.DS_Store
*.so
*.dylib
GPATH
GRTAGS
GTAGS
tags
compile_commands.json
toolchain/
llvm-project/
gcc-project/
build*/
.vscode/
.vim/
.idea/
================================================
FILE: CMakeLists.txt
================================================
cmake_minimum_required(VERSION 3.7)
project(bpf_tracer VERSION 1.0.0 DESCRIPTION "Memset benchmarks")
set(CMAKE_CXX_STANDARD 14)
set(CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
# Export a JSON file with the compilation commands that external tools can use
# to analyze the source code of the project.
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
enable_language(C ASM)
# Disable exceptions
SET (CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} "-fno-rtti ")
if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
message(STATUS "No build type selected, default to Release")
set(CMAKE_BUILD_TYPE "RelWithDebInfo" CACHE STRING "Build type (default RelWithDebInfo)" FORCE)
endif()
add_compile_options(-Wall -g3 -O3 -march=native)
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -Wall -march=native")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Wall -fno-omit-frame-pointer -O0")
# Place all of the binaries in the build directory.
set (CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
set (CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
include_directories(BEFORE
${CMAKE_CURRENT_BINARY_DIR}/include
${CMAKE_CURRENT_SOURCE_DIR}/include
)
add_subdirectory(src/memset/)
add_subdirectory(src/memcpy/)
add_subdirectory(src/utils/)
================================================
FILE: README.md
================================================
# Fast Memset and Memcpy implementations
*UPDATE*: Ilya Albrecht landed the memset implementation from this repo into [Folly](https://github.com/facebook/folly/blob/main/folly/memset.S).
This repository contains high-performance implementations of memset and memcpy.
These implementations outperform the folly and glibc implementations. This
repository contains several reference implementations in C and assembly. The
high-performance implementations are found in the files called "impl.S".
Before reading the source code in this repository you probably want to read an
excellent blog [post](https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines/)
by Joe Bialek about his work to optimize memset for windows.
The charts below compare the code in this repo with other implementations:
folly, musl, and glibc. The glibc implementations are measured with and without
the elf indirection, as suggested by Dave Zarzycki.
## Memset

## Memcpy

The chart below compares the performance of different memset implementations on
buffers of varying sizes and offsets. Unlike the hot loop that hammers a single
value, this benchmark is more realistic and takes into account mispredicted
branches and the performance of the cpu decoder. The buffers are in the size
range 0 to 256. The random function is made of pre-computed random values, to
lower the overhead of the random function. This was suggested by Yann Collet.
The 'nop' function is used to compute the benchmark setup and call overhead. The
numbers below represent the implementation execution time minus the nop function
time.
 
The size of the buffer that memset and memcpy mutates is typically small. The
picture below presents the buffer length distribution in google-chrome. Vim,
Python, and even server workloads have a similar distribution. The values in the
chart represent the power of two buffer size (10 represents the values between
512 and 1024).

The chart below presents a histogram of pointer alignment (from the game
minecraft). Most of the pointers that are called by memset and memcpy are
aligned to 8-byte values. Some programs have histograms that are not as sharp,
meaning that there are more values that are not aligned to 4 or 8-byte boundary.

Memcpy and Memset and frequently called by low-level high-performance libraries.
Here is one example of one stack trace from the Firefox codebase:
```
(gdb) bt
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:225
#1 in memcpy (__dest=, __src=, __len=40) at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:34
#2 mozilla::BufferList<InfallibleAllocPolicy>::ReadBytes(mozilla::BufferList<InfallibleAllocPolicy>::IterImpl&, char*, unsigned long) const
#3 Pickle::ReadBytesInto(PickleIterator*, void*, unsigned int) const (this=, iter=, data=, length=<optimized out>)
#4 in IPC::Message::ReadFooter(void*, unsigned int, bool) (this=, buffer=, buffer_len=40, truncate=true)
#5 in mozilla::ipc::NodeController::DeserializeEventMessage(mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >) (this=, aMessage=...)
#6 in mozilla::ipc::NodeController::OnEventMessage(mojo::core::ports::NodeName const&, mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >)
#7 in mozilla::ipc::NodeChannel::OnMessageReceived(IPC::Message&&) (this=<optimized out>, aMessage=...)
#8 in IPC::Channel::ChannelImpl::ProcessIncomingMessages() (this=<optimized out>)
#9 in IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking(int) (this=, fd=)
#10 in base::MessagePumpLibevent::OnLibeventNotification(int, short, void*) (fd=, flags=, context=)
#11 in event_persist_closure (base=, ev=) at /build/firefox-HSiFn6/firefox-94.0+build3/ipc/chromium/src/third_party/libevent/event.c:1580
#12 event_process_active_single_queue (base=, activeq=, max_to_process=, endtime=)
```
The repository contains a few utilities for testing and measuring the
performance and correctness of memset and memcpy.
## Test tool
This is a small test harness that verifies the correctness of the
implementations. It's really easy to make mistakes with off-by-one errors and
run into alignment issues. The exhaustive tester catches these issues.
This is a sample output:
```
OOOOOOOOOOOXX
^
Filling a buffer of length 13. Expected "O" at index 11
```
## Benchmark tool
The benchmark tool measures the performance of the system libc and the local
implementation. The benchmarking tool runs each of the implementations in a loop
millions of times. It runs the benchmark several times and picks the least noisy
results. It's a good idea to run the benchmark tool and compare some
implementation to itself to assess the noise level in the system. The
benchmarking tool uses a trampoline to prevent the compiler from inlining and
expanding the memset.
## Histogram tool
The histogram tool is a shared object that collects records calls to memset and
memcpy and creates a histogram of the length parameter. It prints the histogram
when the program exits cleanly. The shared object can be loaded using
LD\_PRELOAD (on Linux) or DYLD\_INSERT\_LIBRARIES (on Mac). Each bucket in the
output represents the log2 size of the buffer, and each value represents the
number of hits for the bucket.
## Proxy tool
This is a small utility that swaps the builtin call to memset and memcpy with
the local implementation from this project. The shared object can be loaded
using LD\_PRELOAD (on Linux) or DYLD\_INSERT\_LIBRARIES (on Mac).
================================================
FILE: docs/annotated_glibc.txt
================================================
<+0>: endbr64
<+4>: vmovd %esi, %xmm0
<+8>: movq %rdi, %rax
<+11>: vpbroadcastb %xmm0, %ymm0
<+16>: cmpq $0x20, %rdx
<+20>: jb 0xBELOW_32____ ; <+190>
<+26>: cmpq $0x40, %rdx
<+30>: ja 0xABOVE_64____ ; <+46>
<+32>: vmovdqu %ymm0, -0x20(%rdi,%rdx)
<+38>: vmovdqu %ymm0, (%rdi)
<+42>: vzeroupper
<+45>: retq
0xABOVE_64____ <+46>: cmpq $0x800, %rdx ; imm = 0x800
<+53>: ja 0xABOVE_2048__ ; ___lldb_unnamed_symbol1097$$libc.so.6 + 4
<+55>: cmpq $0x80, %rdx
<+62>: ja 0xABOVE_128___ ; <+89>
0xSZ_64_TO_128 <+64>: vmovdqu %ymm0, (%rdi)
<+68>: vmovdqu %ymm0, 0x20(%rdi)
<+73>: vmovdqu %ymm0, -0x20(%rdi,%rdx)
<+79>: vmovdqu %ymm0, -0x40(%rdi,%rdx)
0xEXIT_EXIT___ <+85>: vzeroupper
<+88>: retq
0xABOVE_128___ <+89>: leaq 0x80(%rdi), %rcx
<+96>: vmovdqu %ymm0, (%rdi)
<+100>: andq $-0x80, %rcx
<+104>: vmovdqu %ymm0, -0x20(%rdi,%rdx)
<+110>: vmovdqu %ymm0, 0x20(%rdi)
<+115>: vmovdqu %ymm0, -0x40(%rdi,%rdx)
<+121>: vmovdqu %ymm0, 0x40(%rdi)
<+126>: vmovdqu %ymm0, -0x60(%rdi,%rdx)
<+132>: vmovdqu %ymm0, 0x60(%rdi)
<+137>: vmovdqu %ymm0, -0x80(%rdi,%rdx)
<+143>: addq %rdi, %rdx
<+146>: andq $-0x80, %rdx
<+150>: cmpq %rdx, %rcx
<+153>: je 0xEXIT_EXIT___ ; <+85>
0xLOOP_4x32B__ <+155>: vmovdqa %ymm0, (%rcx)
<+159>: vmovdqa %ymm0, 0x20(%rcx)
<+164>: vmovdqa %ymm0, 0x40(%rcx)
<+169>: vmovdqa %ymm0, 0x60(%rcx)
<+174>: addq $0x80, %rcx
<+181>: cmpq %rcx, %rdx
<+184>: jne 0xLOOP_4x32B__ ; <+155>
<+186>: vzeroupper
<+189>: retq
0xBELOW_32____ <+190>: cmpb $0x10, %dl
<+193>: jae 0xBELOW_16____ ; <+223>
<+195>: vmovq %xmm0, %rcx
<+200>: cmpb $0x8, %dl
<+203>: jae 0xABOVE_8_____ ; <+237>
<+205>: cmpb $0x4, %dl
<+208>: jae 0xABOVE_4_____ ; <+249>
<+210>: cmpb $0x1, %dl
<+213>: ja 0xABOVE_1_____ ; <+259>
<+215>: jb 0xIS_ZERO_CASE ; <+219>
<+217>: movb %cl, (%rdi)
0xIS_ZERO_CASE <+219>: vzeroupper
<+222>: retq
0xBELOW_16____ <+223>: vmovdqu %xmm0, -0x10(%rdi,%rdx)
<+229>: vmovdqu %xmm0, (%rdi)
<+233>: vzeroupper
<+236>: retq
0xABOVE_8_____ <+237>: movq %rcx, -0x8(%rdi,%rdx)
<+242>: movq %rcx, (%rdi)
<+245>: vzeroupper
<+248>: retq
0xABOVE_4____ <+249>: movl %ecx, -0x4(%rdi,%rdx)
<+253>: movl %ecx, (%rdi)
<+255>: vzeroupper
<+258>: retq
0xABOVE_1_____ <+259>: movw %cx, -0x2(%rdi,%rdx)
<+264>: movw %cx, (%rdi)
<+267>: vzeroupper
<+270>: retq
<+271>: nop
================================================
FILE: include/decl.h
================================================
#ifndef DECLS
#define DECLS
#include <stddef.h>
#ifdef __cplusplus
using memset_ty = void *(void *s, int c, size_t n);
using memcpy_ty = void *(void *dest, const void *src, size_t n);
extern "C" {
#endif
void *memcpy(void *dest, const void *src, size_t n);
void *__folly_memcpy(void *dest, const void *src, size_t n);
void *libc_memcpy(void *dest, const void *src, size_t n);
void *local_memcpy(void *dest, const void *src, size_t n);
void *asm_memcpy(void *dest, const void *src, size_t n);
void *memset(void *s, int c, size_t n);
void *libc_memset(void *s, int c, size_t n);
void *local_memset(void *s, int c, size_t n);
void *asm_memset(void *s, int c, size_t n);
void *musl_memset(void *s, int c, size_t n);
#ifdef __cplusplus
}
#endif
#endif // DECLS
================================================
FILE: include/types.h
================================================
#ifndef TYPES
#define TYPES
#include <stdint.h>
#define NO_INLINE __attribute__((noinline))
#ifdef __clang__
typedef char char8 __attribute__((ext_vector_type(8), aligned(1)));
typedef char char16 __attribute__((ext_vector_type(16), aligned(1)));
typedef char char32 __attribute__((ext_vector_type(32), aligned(1)));
typedef char char32a __attribute__((ext_vector_type(32), aligned(32)));
#else
// __GNUC__
typedef char char8 __attribute__((vector_size(8), aligned(1)));
typedef char char16 __attribute__((vector_size(16), aligned(1)));
typedef char char32 __attribute__((vector_size(32), aligned(1)));
typedef char char32a __attribute__((vector_size(32), aligned(32)));
#endif
typedef uint32_t __attribute__((aligned(1))) u32;
typedef uint64_t __attribute__((aligned(1))) u64;
#endif // TYPES
================================================
FILE: include/utils.h
================================================
#ifndef UTILS_H
#define UTILS_H
#include <algorithm>
#include <chrono>
#include <string>
#include "types.h"
/// Aligns the pointer \p ptr, to alignment \p alignment and offset \p offset
/// within the word.
void *align_pointer(void *ptr, unsigned alignment, unsigned offset) {
size_t p = (size_t)ptr;
while (p % alignment)
++p;
return (void *)(p + (size_t)offset);
}
using time_point = std::chrono::steady_clock::time_point;
class Stopwatch {
/// The time of the last sample;
time_point begin_;
/// A list of recorded intervals.
std::vector<uint64_t> intervals_;
public:
NO_INLINE
Stopwatch() : begin_() {}
NO_INLINE
void start() { begin_ = std::chrono::steady_clock::now(); }
NO_INLINE
void stop() {
time_point end = std::chrono::steady_clock::now();
uint64_t interval =
std::chrono::duration_cast<std::chrono::microseconds>(end - begin_)
.count();
intervals_.push_back(interval);
}
NO_INLINE
uint64_t get_median() {
std::sort(intervals_.begin(), intervals_.end());
return intervals_[intervals_.size() / 2];
}
};
uint8_t random_bytes[320] = {
227, 138, 244, 198, 73, 247, 185, 248, 229, 75, 24, 215, 159, 230, 136,
246, 200, 144, 65, 67, 109, 86, 118, 61, 209, 103, 188, 213, 187, 8,
210, 121, 214, 178, 232, 59, 153, 92, 209, 239, 44, 85, 156, 172, 237,
41, 150, 195, 247, 202, 249, 142, 208, 133, 21, 204, 114, 38, 51, 150,
194, 46, 184, 138, 50, 250, 190, 180, 161, 5, 211, 191, 62, 137, 142,
122, 63, 72, 233, 125, 189, 51, 238, 51, 116, 10, 44, 18, 240, 41,
157, 81, 183, 252, 214, 17, 81, 12, 44, 119, 77, 97, 101, 80, 106,
128, 190, 89, 160, 104, 244, 192, 46, 69, 73, 255, 45, 213, 190, 86,
18, 89, 34, 46, 134, 145, 166, 128, 87, 97, 192, 71, 105, 94, 51,
30, 7, 9, 0, 40, 0, 187, 205, 189, 151, 159, 107, 105, 180, 182,
233, 52, 209, 108, 186, 31, 184, 254, 170, 71, 162, 31, 80, 226, 75,
125, 214, 125, 247, 197, 149, 132, 247, 157, 253, 101, 107, 1, 127, 236,
249, 242, 152, 169, 123, 240, 129, 230, 135, 25, 57, 227, 130, 189, 76,
254, 33, 193, 39, 82, 177, 143, 31, 17, 20, 195, 219, 165, 171, 198,
125, 119, 216, 143, 55, 210, 17, 88, 150, 126, 38, 160, 71, 214, 10,
162, 158, 6, 234, 233, 119, 221, 167, 62, 146, 50, 150, 176, 142, 167,
201, 250, 195, 26, 156, 96, 36, 177, 95, 23, 7, 63, 55, 142, 80,
227, 73, 124, 93, 211, 231, 166, 182, 57, 145, 55, 242, 213, 246, 30,
146, 247, 19, 229, 34, 210, 37, 147, 242, 103, 125, 91, 171, 51, 22,
126, 248, 149, 19, 60, 89, 5, 241, 132, 72, 217, 195, 11, 173, 247,
47, 144, 222, 94, 51, 166, 192, 50, 109, 62, 42, 126, 111, 204, 141,
66,
};
/// Implements a doom-style random number generator.
struct DoomRNG {
// Points to the current random number.
unsigned rand_curr = 0;
void rand_reset() { rand_curr = 0; }
uint8_t next_u8_random() { return random_bytes[rand_curr++ % 320]; }
};
#endif // UTILS_H
================================================
FILE: src/memcpy/CMakeLists.txt
================================================
add_executable(test_memcpy
test_memcpy.cc
folly.S
impl.S
impl.c
)
target_link_libraries(test_memcpy PUBLIC)
add_executable(bench_memcpy
bench_memcpy.cc
folly.S
impl.S
impl.c
)
install(TARGETS bench_memcpy DESTINATION bin)
install(TARGETS test_memcpy DESTINATION bin)
================================================
FILE: src/memcpy/bench_memcpy.cc
================================================
#include <algorithm>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <unistd.h>
#include <vector>
#include "decl.h"
#include "utils.h"
////////////////////////////////////////////////////////////////////////////////
// This is a small program that compares two memcpy implementations and records
// the output in a csv file.
////////////////////////////////////////////////////////////////////////////////
#define ITER (1000L * 1000L * 10L)
#define SAMPLES (20)
DoomRNG RNG;
/// Measure a single implementation \p handle.
uint64_t measure(memcpy_ty handle, void *dest, void *src, unsigned size) {
Stopwatch T;
for (unsigned i = 0; i < SAMPLES; i++) {
T.start();
for (size_t j = 0; j < ITER; j++) {
(handle)(dest, src, size);
}
T.stop();
}
return T.get_median();
}
// Allocate memory and benchmark each implementation at a specific size \p size.
void bench_impl(const std::vector<memcpy_ty *> &toTest, unsigned size,
unsigned align, unsigned offset) {
std::vector<char> dest(size + 256, 0);
std::vector<char> src(size + 256, 0);
char *src_ptr = (char *)align_pointer(&src[0], align, offset);
char *dest_ptr = (char *)align_pointer(&dest[0], align, offset);
std::cout << size << ", ";
for (auto handle : toTest) {
u_int64_t res = measure(handle, dest_ptr, src_ptr, size);
std::cout << res << ", ";
}
std::cout << std::endl;
}
/// Allocate and copy buffers at random offsets and in random sizes.
/// The sizes and the offsets are in the range 0..256.
void bench_rand_range(const std::vector<memcpy_ty *> &toTest) {
std::vector<char> dest(4096, 1);
std::vector<char> src(4096, 0);
const char *src_p = &src[0];
char *dest_p = &dest[0];
for (auto handle : toTest) {
Stopwatch T;
sleep(1);
for (unsigned i = 0; i < SAMPLES; i++) {
RNG.rand_reset();
T.start();
for (size_t j = 0; j < ITER; j++) {
char *to = dest_p + RNG.next_u8_random();
const char *from = src_p + RNG.next_u8_random();
(handle)(to, from, RNG.next_u8_random());
}
T.stop();
}
std::cout << T.get_median() << ", ";
}
std::cout << std::endl;
}
// To measure the call overhead.
void *nop(void *dest, const void *src, size_t n) { return dest; }
int main(int argc, char **argv) {
std::cout << std::setprecision(3);
std::cout << std::fixed;
std::vector<memcpy_ty *> toTest = {
&libc_memcpy, &memcpy, &__folly_memcpy, &local_memcpy, &asm_memcpy, &nop};
std::cout << "Batches of random sizes:\n";
std::cout << "libc@plt, libc, folly, c_memcpy, asm_memcpy, nop,\n";
bench_rand_range(toTest);
std::cout << "\nFixed size:\n";
std::cout << "size, libc@plt, libc, folly, c_memcpy, asm_memcpy, nop,\n";
for (int i = 0; i < 512; i++) {
bench_impl(toTest, i, 16, 0);
}
return 0;
}
================================================
FILE: src/memcpy/folly.S
================================================
/*
* Copyright (c) Facebook, Inc. and its affiliates.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* __folly_memcpy: An optimized memcpy implementation that uses prefetch and
* AVX2 instructions.
*
* This implementation of memcpy acts as a memmove, but it is not optimized for
* this purpose. While overlapping copies are undefined in memcpy, this
* implementation acts like memmove for sizes up through 256 bytes and will
* detect overlapping copies and call memmove for overlapping copies of 257 or
* more bytes.
*
* This implementation uses prefetch to avoid dtlb misses. This can
* substantially reduce dtlb store misses in cases where the destination
* location is absent from L1 cache and where the copy size is small enough
* that the hardware prefetcher doesn't have a large impact.
*
* The number of branches is limited by the use of overlapping copies. This
* helps with copies where the source and destination cache lines are already
* present in L1 because there are fewer instructions to execute and fewer
* branches to potentially mispredict.
*
* Vector operations up to 32-bytes are used (avx2 instruction set). Larger
* mov operations (avx512) are not used.
*
* Large copies make use of aligned store operations. This operation is
* observed to always be faster than rep movsb, so the rep movsb instruction
* is not used.
*
* If the copy size is humongous and the source and destination are both
* aligned, this memcpy will use non-temporal operations. This can have
* a substantial speedup for copies where data is absent from L1, but it
* is significantly slower if the source and destination data were already
* in L1. The use of non-temporal operations also has the effect that after
* the copy is complete, the data will be moved out of L1, even if the data was
* present before the copy started.
*
* @author Logan Evans <lpe@fb.com>
*/
#if defined(__AVX2__)
// This threshold is half of L1 cache on a Skylake machine, which means that
// potentially all of L1 will be populated by this copy once it is executed
// (dst and src are cached for temporal copies).
#define NON_TEMPORAL_STORE_THRESHOLD $32768
.file "memcpy.S"
.section .text,"ax"
.type __folly_memcpy_short, @function
__folly_memcpy_short:
.cfi_startproc
.L_GE1_LE7:
cmp $1, %rdx
je .L_EQ1
cmp $4, %rdx
jae .L_GE4_LE7
.L_GE2_LE3:
movw (%rsi), %r8w
movw -2(%rsi,%rdx), %r9w
movw %r8w, (%rdi)
movw %r9w, -2(%rdi,%rdx)
ret
.align 2
.L_EQ1:
movb (%rsi), %r8b
movb %r8b, (%rdi)
ret
// Aligning the target of a jump to an even address has a measurable
// speedup in microbenchmarks.
.align 2
.L_GE4_LE7:
movl (%rsi), %r8d
movl -4(%rsi,%rdx), %r9d
movl %r8d, (%rdi)
movl %r9d, -4(%rdi,%rdx)
ret
.cfi_endproc
.size __folly_memcpy_short, .-__folly_memcpy_short
// memcpy is an alternative entrypoint into the function named __folly_memcpy.
// The compiler is able to call memcpy since the name is global while
// stacktraces will show __folly_memcpy since that is the name of the function.
// This is intended to aid in debugging by making it obvious which version of
// memcpy is being used.
.align 64
.globl __folly_memcpy
.type __folly_memcpy, @function
__folly_memcpy:
.cfi_startproc
mov %rdi, %rax
test %rdx, %rdx
je .L_EQ0
prefetchw (%rdi)
prefetchw -1(%rdi,%rdx)
cmp $8, %rdx
jb .L_GE1_LE7
.L_GE8:
cmp $32, %rdx
ja .L_GE33
.L_GE8_LE32:
cmp $16, %rdx
ja .L_GE17_LE32
.L_GE8_LE16:
mov (%rsi), %r8
mov -8(%rsi,%rdx), %r9
mov %r8, (%rdi)
mov %r9, -8(%rdi,%rdx)
.L_EQ0:
ret
.align 2
.L_GE17_LE32:
movdqu (%rsi), %xmm0
movdqu -16(%rsi,%rdx), %xmm1
movdqu %xmm0, (%rdi)
movdqu %xmm1, -16(%rdi,%rdx)
ret
.align 2
.L_GE193_LE256:
vmovdqu %ymm3, 96(%rdi)
vmovdqu %ymm4, -128(%rdi,%rdx)
.L_GE129_LE192:
vmovdqu %ymm2, 64(%rdi)
vmovdqu %ymm5, -96(%rdi,%rdx)
.L_GE65_LE128:
vmovdqu %ymm1, 32(%rdi)
vmovdqu %ymm6, -64(%rdi,%rdx)
.L_GE33_LE64:
vmovdqu %ymm0, (%rdi)
vmovdqu %ymm7, -32(%rdi,%rdx)
vzeroupper
ret
.align 2
.L_GE33:
vmovdqu (%rsi), %ymm0
vmovdqu -32(%rsi,%rdx), %ymm7
cmp $64, %rdx
jbe .L_GE33_LE64
prefetchw 64(%rdi)
vmovdqu 32(%rsi), %ymm1
vmovdqu -64(%rsi,%rdx), %ymm6
cmp $128, %rdx
jbe .L_GE65_LE128
prefetchw 128(%rdi)
vmovdqu 64(%rsi), %ymm2
vmovdqu -96(%rsi,%rdx), %ymm5
cmp $192, %rdx
jbe .L_GE129_LE192
prefetchw 192(%rdi)
vmovdqu 96(%rsi), %ymm3
vmovdqu -128(%rsi,%rdx), %ymm4
cmp $256, %rdx
jbe .L_GE193_LE256
.L_GE257:
prefetchw 256(%rdi)
// Check if there is an overlap. If there is an overlap then the caller
// has a bug since this is undefined behavior. However, for legacy
// reasons this behavior is expected by some callers.
//
// All copies through 256 bytes will operate as a memmove since for
// those sizes all reads are performed before any writes.
//
// This check uses the idea that there is an overlap if
// (%rdi < (%rsi + %rdx)) && (%rsi < (%rdi + %rdx)),
// or equivalently, there is no overlap if
// ((%rsi + %rdx) <= %rdi) || ((%rdi + %rdx) <= %rsi).
//
// %r9 will be used after .L_ALIGNED_DST_LOOP to calculate how many
// bytes remain to be copied.
lea (%rsi,%rdx), %r9
cmp %rdi, %r9
jbe .L_NO_OVERLAP
lea (%rdi,%rdx), %r8
cmp %rsi, %r8
// This is a forward jump so that the branch predictor will not predict
// a memmove.
ja .L_MEMMOVE
.align 2
.L_NO_OVERLAP:
vmovdqu %ymm0, (%rdi)
vmovdqu %ymm1, 32(%rdi)
vmovdqu %ymm2, 64(%rdi)
vmovdqu %ymm3, 96(%rdi)
// Align %rdi to a 32 byte boundary.
// %rcx = 128 - 31 & %rdi
mov $128, %rcx
and $31, %rdi
sub %rdi, %rcx
lea (%rsi,%rcx), %rsi
lea (%rax,%rcx), %rdi
sub %rcx, %rdx
// %r8 is the end condition for the loop.
lea -128(%rsi,%rdx), %r8
cmp NON_TEMPORAL_STORE_THRESHOLD, %rdx
jae .L_NON_TEMPORAL_LOOP
.align 2
.L_ALIGNED_DST_LOOP:
prefetchw 128(%rdi)
prefetchw 192(%rdi)
vmovdqu (%rsi), %ymm0
vmovdqu 32(%rsi), %ymm1
vmovdqu 64(%rsi), %ymm2
vmovdqu 96(%rsi), %ymm3
add $128, %rsi
vmovdqa %ymm0, (%rdi)
vmovdqa %ymm1, 32(%rdi)
vmovdqa %ymm2, 64(%rdi)
vmovdqa %ymm3, 96(%rdi)
add $128, %rdi
cmp %r8, %rsi
jb .L_ALIGNED_DST_LOOP
.L_ALIGNED_DST_LOOP_END:
sub %rsi, %r9
mov %r9, %rdx
vmovdqu %ymm4, -128(%rdi,%rdx)
vmovdqu %ymm5, -96(%rdi,%rdx)
vmovdqu %ymm6, -64(%rdi,%rdx)
vmovdqu %ymm7, -32(%rdi,%rdx)
vzeroupper
ret
.align 2
.L_NON_TEMPORAL_LOOP:
testb $31, %sil
jne .L_ALIGNED_DST_LOOP
// This is prefetching the source data unlike ALIGNED_DST_LOOP which
// prefetches the destination data. This choice is again informed by
// benchmarks. With a non-temporal store the entirety of the cache line
// is being written so the previous data can be discarded without being
// fetched.
prefetchnta 128(%rsi)
prefetchnta 196(%rsi)
vmovntdqa (%rsi), %ymm0
vmovntdqa 32(%rsi), %ymm1
vmovntdqa 64(%rsi), %ymm2
vmovntdqa 96(%rsi), %ymm3
add $128, %rsi
vmovntdq %ymm0, (%rdi)
vmovntdq %ymm1, 32(%rdi)
vmovntdq %ymm2, 64(%rdi)
vmovntdq %ymm3, 96(%rdi)
add $128, %rdi
cmp %r8, %rsi
jb .L_NON_TEMPORAL_LOOP
sfence
jmp .L_ALIGNED_DST_LOOP_END
.L_MEMMOVE:
call memmove
ret
.cfi_endproc
.size __folly_memcpy, .-__folly_memcpy
#ifdef FOLLY_MEMCPY_IS_MEMCPY
.weak memcpy
memcpy = __folly_memcpy
#endif
.ident "GCC: (GNU) 4.8.2"
#ifdef __linux__
.section .note.GNU-stack,"",@progbits
#endif
#endif
================================================
FILE: src/memcpy/impl.S
================================================
#if defined(__APPLE__)
.text
.global _libc_memcpy
.p2align 4, 0x90
_libc_memcpy:
jmp _memcpy
#else
.text
.global libc_memcpy
.p2align 4, 0x90
libc_memcpy:
jmp memcpy
#endif
#define LABEL(x) .L##x
#if defined(__APPLE__)
.text
.global _asm_memcpy
.p2align 5, 0x90
_asm_memcpy:
#else
.text
.global asm_memcpy
.p2align 5, 0x90
asm_memcpy:
#endif
// RDI is the dest
// RSI is the src
// RDX is length
mov %rdi, %rax
cmp $64,%rdx
ja LABEL(over_64)
cmp $16,%rdx
jae LABEL(16_to_64)
LABEL(below_16):
cmp $4,%rdx
jbe LABEL(0_to_4)
cmp $8,%rdx
jbe LABEL(in_4_to_8)
LABEL(8_to_16):
movq (%rsi), %rcx
movq %rcx, (%rax)
movq -8(%rsi,%rdx), %rcx
movq %rcx, -8(%rax,%rdx)
retq
LABEL(0_to_4):
// Copy the first two bytes:
cmp $0,%rdx
je LABEL(exit)
movb (%rsi), %cl
movb %cl, (%rdi)
movb -1(%rsi,%rdx), %cl
movb %cl, -1(%rdi,%rdx)
cmp $2,%rdx
jbe LABEL(exit)
// Copy the second two bytes, if n > 2.
movb 1(%rsi), %cl
movb %cl, 1(%rdi)
movb 2(%rsi), %cl
movb %cl, 2(%rdi)
retq
LABEL(in_4_to_8):
movl (%rsi), %ecx
movl %ecx, (%rdi)
movl -4(%rsi,%rdx), %ecx
movl %ecx, -4(%rdi,%rdx)
LABEL(exit):
retq
LABEL(16_to_64):
cmp $32, %rdx
jbe LABEL(16_to_32)
LABEL(32_to_64):
vmovdqu (%rsi), %ymm0
vmovdqu %ymm0, (%rdi)
vmovdqu -32(%rsi,%rdx), %ymm0
vmovdqu %ymm0, -32(%rdi,%rdx)
vzeroupper
retq
LABEL(16_to_32):
movups (%rsi), %xmm0
movups %xmm0, (%rdi)
movups -16(%rsi,%rdx), %xmm0
movups %xmm0, -16(%rdi,%rdx)
retq
// Handle buffers over 64 bytes:
LABEL(over_64):
cmp $128, %rdx
ja LABEL(over_128)
// Copy the last wide word.
vmovups -32(%rsi,%rdx), %ymm0
// Handle cases in the range 64 to 128. This is two unconditional
// stores (64), 1 conditional store (32), and the one 32 byte store at
// the end.
vmovups (%rsi), %ymm1
vmovups 32(%rsi), %ymm2
cmp $96, %rdx
jbe LABEL(64_to_128_done)
vmovups 64(%rsi), %ymm3
vmovups %ymm3, 64(%rax)
.align 4
LABEL(64_to_128_done):
vmovups %ymm1, (%rax)
vmovups %ymm2, 32(%rax)
// Store the last wide word.
vmovups %ymm0, -32(%rax,%rdx)
vzeroupper
retq
LABEL(over_128):
// Compute the last writeable destination.
lea -128(%rdx), %rcx
xor %r8, %r8
.align 16
LABEL(over_128_copy_loop):
vmovdqu (%rsi, %r8), %ymm0
vmovdqu 32(%rsi, %r8), %ymm1
vmovdqu 64(%rsi, %r8), %ymm2
vmovdqu 96(%rsi, %r8), %ymm3
vmovdqu %ymm0, (%rdi, %r8)
vmovdqu %ymm1, 32(%rdi, %r8)
vmovdqu %ymm2, 64(%rdi, %r8)
vmovdqu %ymm3, 96(%rdi, %r8)
add $128, %r8
cmp %rcx, %r8
jb LABEL(over_128_copy_loop)
// Handle the tail:
lea -32(%rdx), %rcx
cmp %r8, %rcx
jb LABEL(over_128_done)
vmovdqu (%rsi, %r8), %ymm0
vmovdqu %ymm0, (%rdi, %r8)
add $32, %r8
cmp %r8, %rcx
jb LABEL(over_128_done)
vmovdqu (%rsi, %r8), %ymm0
vmovdqu %ymm0, (%rdi, %r8)
add $32, %r8
cmp %r8, %rcx
jb LABEL(over_128_done)
vmovdqu (%rsi, %r8), %ymm0
vmovdqu %ymm0, (%rdi, %r8)
LABEL(over_128_done):
// Copy the last 32 bytes
vmovdqu -32(%rsi, %rdx), %ymm0
vmovdqu %ymm0, -32(%rdi, %rdx)
vzeroupper
retq
================================================
FILE: src/memcpy/impl.c
================================================
#include "types.h"
#include <stddef.h>
#include <stdint.h>
void *local_memcpy(void *dest, const void *src, size_t n) {
char *d = (char *)dest;
const char *s = (char *)src;
if (n < 5) {
if (n == 0)
return dest;
d[0] = s[0];
d[n - 1] = s[n - 1];
if (n <= 2)
return dest;
d[1] = s[1];
d[2] = s[2];
return dest;
}
if (n <= 16) {
if (n >= 8) {
const char *first_s = s;
const char *last_s = s + n - 8;
char *first_d = d;
char *last_d = d + n - 8;
*((u64 *)first_d) = *((u64 *)first_s);
*((u64 *)last_d) = *((u64 *)last_s);
return dest;
}
const char *first_s = s;
const char *last_s = s + n - 4;
char *first_d = d;
char *last_d = d + n - 4;
*((u32 *)first_d) = *((u32 *)first_s);
*((u32 *)last_d) = *((u32 *)last_s);
return dest;
}
if (n <= 32) {
const char *first_s = s;
const char *last_s = s + n - 16;
char *first_d = d;
char *last_d = d + n - 16;
*((char16 *)first_d) = *((char16 *)first_s);
*((char16 *)last_d) = *((char16 *)last_s);
return dest;
}
const char *last_word_s = s + n - 32;
char *last_word_d = d + n - 32;
// Stamp the 32-byte chunks.
do {
*((char32 *)d) = *((char32 *)s);
d += 32;
s += 32;
} while (d < last_word_d);
// Stamp the last unaligned 32 bytes of the buffer.
*((char32 *)last_word_d) = *((char32 *)last_word_s);
return dest;
}
================================================
FILE: src/memcpy/test_memcpy.cc
================================================
#include <cstring>
#include <iostream>
#include <vector>
#include "decl.h"
#include "utils.h"
////////////////////////////////////////////////////////////////////////////////
// This is a small program that checks if some memcpy implementation is correct.
////////////////////////////////////////////////////////////////////////////////
#define MAGIC_VALUE0 '#'
#define MAGIC_VALUE1 '='
void print_buffer(const char *start, const char *end, char val,
const char *ptr) {
const char *it = start;
while (it != end) {
std::cout << *it;
it++;
}
std::cout << "\n";
it = start;
while (it != ptr) {
std::cout << " ";
it++;
}
std::cout << "^\n";
std::cout << "Filling a buffer of length " << end - start << ".";
std::cout << " Expected \"" << val << "\" at index " << ptr - start
<< std::endl;
}
void print_buffer_match(const char *start0, const char *start1, size_t len,
size_t error_at) {
for (size_t i = 0; i < len; i++) {
std::cout << start0[i];
}
std::cout << "\n";
for (size_t i = 0; i < len; i++) {
std::cout << start1[i];
}
std::cout << "\n";
for (size_t i = 0; i < error_at; i++) {
std::cout << " ";
}
std::cout << "^\n";
std::cout << "Comparing buffers of length " << len << ".";
std::cout << " Invalid value at index " << error_at << "." << std::endl;
}
// Make sure that the whole buffer, from \p start to \p end, is set to \p val.
void assert_uniform_value(const char *start, const char *end, char val) {
const char *ptr = start;
while (ptr != end) {
if (val != *ptr) {
print_buffer(start, end, val, ptr);
abort();
}
ptr++;
}
}
// Make sure that two buffers contain the same memory content.
void assert_buffers_match(const char *buff1, const char *buff2, size_t len) {
for (size_t i = 0; i < len; i++) {
if (buff1[i] != buff2[i]) {
print_buffer_match(buff1, buff2, len, i);
abort();
}
}
}
void test_impl(memcpy_ty handle, const std::string &name, unsigned chunk_size) {
std::vector<char> src(chunk_size + 512);
std::vector<char> dest(chunk_size + 512, MAGIC_VALUE0);
// Fill the buffer with a running counter of printable chars.
for (unsigned i = 0; i < src.size(); i++) {
src[i] = 'A' + (i % 26);
}
// Start copying memory at different offsets.
for (int src_offset = 0; src_offset < 32; src_offset++) {
for (int dest_offset = 0; dest_offset < 32; dest_offset++) {
const char *dest_start = &*dest.begin();
const char *dest_end = &*dest.end();
const char *src_region_start = &src[src_offset];
char *dest_region_start = &dest[dest_offset];
char *dest_region_end = &dest[dest_offset + chunk_size];
void *res =
(handle)((void *)dest_region_start, src_region_start, chunk_size);
if (res != dest_region_start) {
std::cout << "Invalid return value." << std::endl;
abort();
}
// Check the chunk.
assert_buffers_match(dest_region_start, src_region_start, chunk_size);
// Check before chunk.
assert_uniform_value(dest_start, dest_region_start, MAGIC_VALUE0);
// Check after chunk.
assert_uniform_value(dest_region_end, dest_end, MAGIC_VALUE0);
// Reset the dest buffer:
std::fill(dest.begin(), dest.end(), MAGIC_VALUE0);
}
}
}
int main(int argc, char **argv) {
std::cout << "Testing memcpy... \n";
#define TEST(FUNC, SIZE) test_impl(FUNC, #FUNC, SIZE);
for (int i = 0; i < 1024; i++) {
TEST(&memcpy, i);
TEST(&__folly_memcpy, i);
TEST(&local_memcpy, i);
TEST(&asm_memcpy, i);
}
std::cout << "Done.\n";
return 0;
}
================================================
FILE: src/memset/CMakeLists.txt
================================================
add_library(mem_shim SHARED
shims.c
impl.S
impl.c
)
set_target_properties(mem_shim PROPERTIES
VERSION ${PROJECT_VERSION}
SOVERSION 1
)
add_executable(bench_memset
bench_memset.cc
impl.S
impl.c
)
add_executable(test_memset
test_memset.cc
impl.S
impl.c
)
target_link_libraries(bench_memset PUBLIC)
target_link_libraries(test_memset PUBLIC)
install(TARGETS bench_memset DESTINATION bin)
install(TARGETS test_memset DESTINATION bin)
install(TARGETS mem_shim LIBRARY DESTINATION bin)
================================================
FILE: src/memset/bench_memset.cc
================================================
#include <algorithm>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <unistd.h>
#include <vector>
#include "decl.h"
#include "utils.h"
////////////////////////////////////////////////////////////////////////////////
// This is a small program that compares two memset implementations and records
// the output in a csv file.
////////////////////////////////////////////////////////////////////////////////
#define ITER (1000L * 1000L * 10L)
#define SAMPLES (20)
DoomRNG RNG;
/// Measure a single implementation \p handle.
uint64_t measure(memset_ty handle, unsigned size, unsigned align,
unsigned offset, void *ptr) {
Stopwatch T;
for (unsigned i = 0; i < SAMPLES; i++) {
T.start();
for (size_t j = 0; j < ITER; j++) {
(handle)(ptr, 0, size);
}
T.stop();
}
return T.get_median();
}
// Allocate memory and benchmark each implementation at a specific size \p size.
void bench_impl(const std::vector<memset_ty *> &toTest, unsigned size,
unsigned align, unsigned offset) {
std::vector<char> memory(size + 256, 0);
void *ptr = align_pointer(&memory[0], align, offset);
std::cout << size << ", ";
for (auto handle : toTest) {
u_int64_t res = measure(handle, size, align, offset, ptr);
std::cout << res << ", ";
}
std::cout << std::endl;
}
/// Try to allocate buffers at random offsets and in random sizes.
/// The sizes and the offsets are in the range 0..256.
void bench_rand_range(const std::vector<memset_ty *> &toTest) {
std::vector<char> memory(1024, 0);
void *ptr = &memory[0];
for (auto handle : toTest) {
Stopwatch T;
sleep(1);
for (unsigned i = 0; i < SAMPLES; i++) {
RNG.rand_reset();
T.start();
for (size_t j = 0; j < ITER; j++) {
(handle)((char *)ptr + RNG.next_u8_random(), 0, RNG.next_u8_random());
}
T.stop();
}
std::cout << T.get_median() << ", ";
}
std::cout << std::endl;
}
// To measure the call overhead.
void *nop(void *s, int c, size_t n) { return s; }
int main(int argc, char **argv) {
std::cout << std::setprecision(3);
std::cout << std::fixed;
std::vector<memset_ty *> toTest = {musl_memset, libc_memset, &memset,
local_memset, asm_memset, &nop};
std::cout << "Batches of random sizes:\n";
std::cout << " musl, libc@plt, libc, c_memset, asm_memset, nop,\n";
bench_rand_range(toTest);
std::cout << "\nFixed size:\n";
std::cout << "size, musl, libc@plt, libc, c_memset, asm_memset, nop,\n";
for (int i = 0; i < 512; i++) {
bench_impl(toTest, i, 16, 0);
}
return 0;
}
================================================
FILE: src/memset/impl.S
================================================
#if defined(__APPLE__)
.text
.global _libc_memset
.p2align 4, 0x90
_libc_memset:
jmp _memset
#else
.text
.global libc_memset
.p2align 4, 0x90
libc_memset:
jmp memset
#endif
#define LABEL(x) .L##x
#if defined(__APPLE__)
.text
.global _asm_memset
.p2align 5, 0x90
_asm_memset:
#else
.text
.global asm_memset
.p2align 5, 0x90
asm_memset:
#endif
// RDI is the buffer
// RSI is the value
// RDX is length
vmovd %esi, %xmm0
vpbroadcastb %xmm0,%ymm0
mov %rdi,%rax
cmp $0x40,%rdx
jae LABEL(above_64)
LABEL(below_64):
cmp $0x20, %rdx
jb LABEL(below_32)
vmovdqu %ymm0,(%rdi)
vmovdqu %ymm0,-0x20(%rdi,%rdx)
vzeroupper
retq
LABEL(below_32):
cmp $0x10, %rdx
jae LABEL(in_16_to_32)
LABEL(below_16):
cmp $0x4, %rdx
jbe LABEL(below_4)
LABEL(in_4_to_16):
// Scalar stores from this point.
vmovq %xmm0, %rsi
cmp $0x7, %rdx
jbe LABEL(in_4_to_8)
// two 8-wide stores, up to 16 bytes.
mov %rsi, -0x8(%rdi, %rdx)
mov %rsi,(%rdi)
vzeroupper
retq
.align 4
LABEL(below_4):
test %rdx, %rdx
je LABEL(exit)
mov %sil, (%rdi)
mov %sil, -0x1(%rdi,%rdx)
cmp $0x2, %rdx
jbe LABEL(exit)
mov %sil, 0x1(%rdi)
mov %sil, 0x2(%rdi)
mov %rdi,%rax
.align 4
LABEL(exit):
vzeroupper
retq
LABEL(in_4_to_8):
// two 4-wide stores, upto 8 bytes.
mov %esi,-0x4(%rdi,%rdx)
mov %esi,(%rdi)
vzeroupper
retq
LABEL(in_16_to_32):
vmovups %xmm0,(%rdi)
vmovups %xmm0,-0x10(%rdi,%rdx)
vzeroupper
retq
LABEL(above_64):
cmp $0xb0, %rdx
ja LABEL(above_192)
cmp $0x80, %rdx
jbe LABEL(in_64_to_128)
// Do some work filling unaligned 32bit words.
// last_word -> rsi
lea -0x20(%rdi,%rdx),%rsi
// rcx -> fill pointer.
// We have at least 128 bytes to store.
vmovdqu %ymm0,(%rdi)
vmovdqu %ymm0, 0x20(%rdi)
vmovdqu %ymm0, 0x40(%rdi)
add $0x60,%rdi
.align 8
LABEL(fill_32):
vmovdqu %ymm0,(%rdi)
add $0x20,%rdi
cmp %rdi,%rsi
ja LABEL(fill_32)
// Stamp the last unaligned store.
vmovdqu %ymm0,(%rsi)
vzeroupper
retq
LABEL(in_64_to_128):
// last_word -> rsi
vmovdqu %ymm0,(%rdi)
vmovdqu %ymm0, 0x20(%rdi)
vmovdqu %ymm0,-0x40(%rdi,%rdx)
vmovdqu %ymm0,-0x20(%rdi,%rdx)
vzeroupper
retq
LABEL(above_192):
// rdi is the buffer address
// rsi is the value
// rdx is length
// Store the first unaligned 32 bytes.
vmovdqu %ymm0,(%rdi)
// The first aligned word is stored in %rsi.
mov %rdi,%rsi
and $0xffffffffffffffe0,%rsi
lea 0x20(%rsi),%rsi
// Compute the address of the last unaligned word into rdi.
lea -0x20(%rdx), %rdx
add %rdx, %rdi
// Check if we can do a full 5x32B stamp.
lea 0xa0(%rsi),%rcx
cmp %rcx, %rdi
jb LABEL(stamp_4)
.align 8
LABEL(fill_192):
vmovdqa %ymm0,(%rsi)
vmovdqa %ymm0,0x20(%rsi)
vmovdqa %ymm0,0x40(%rsi)
vmovdqa %ymm0,0x60(%rsi)
vmovdqa %ymm0,0x80(%rsi)
add $0xa0, %rsi
lea 0xa0(%rsi),%rcx
cmp %rcx, %rdi
ja LABEL(fill_192)
LABEL(fill_192_tail):
cmp %rsi, %rdi
jb LABEL(fill_192_done)
vmovdqa %ymm0, (%rsi)
lea 0x20(%rsi),%rcx
cmp %rcx, %rdi
jb LABEL(fill_192_done)
vmovdqa %ymm0, 0x20(%rsi)
lea 0x40(%rsi),%rcx
cmp %rcx, %rdi
jb LABEL(fill_192_done)
vmovdqa %ymm0, 0x40(%rsi)
lea 0x60(%rsi),%rcx
cmp %rcx, %rdi
jb LABEL(fill_192_done)
vmovdqa %ymm0, 0x60(%rsi)
LABEL(last_wide_store):
lea 0x80(%rsi),%rcx
cmp %rcx, %rdi
jb LABEL(fill_192_done)
vmovdqa %ymm0, 0x80(%rsi)
LABEL(fill_192_done):
// Stamp the last word.
vmovdqu %ymm0,(%rdi)
vzeroupper
ret
LABEL(stamp_4):
vmovdqa %ymm0,(%rsi)
vmovdqa %ymm0,0x20(%rsi)
vmovdqa %ymm0,0x40(%rsi)
vmovdqa %ymm0,0x60(%rsi)
jmp LABEL(last_wide_store)
================================================
FILE: src/memset/impl.c
================================================
#include "types.h"
#include <stddef.h>
#include <stdint.h>
// Handle memsets of sizes 0..32
static inline void *small_memset(void *s, int c, size_t n) {
if (n < 5) {
if (n == 0)
return s;
char *p = s;
p[0] = c;
p[n - 1] = c;
if (n <= 2)
return s;
p[1] = c;
p[2] = c;
return s;
}
if (n <= 16) {
uint64_t val8 = ((uint64_t)0x0101010101010101L * ((uint8_t)c));
if (n >= 8) {
char *first = s;
char *last = s + n - 8;
*((u64 *)first) = val8;
*((u64 *)last) = val8;
return s;
}
uint32_t val4 = val8;
char *first = s;
char *last = s + n - 4;
*((u32 *)first) = val4;
*((u32 *)last) = val4;
return s;
}
char X = c;
char *p = s;
char16 val16 = {X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X};
char *last = s + n - 16;
*((char16 *)last) = val16;
*((char16 *)p) = val16;
return s;
}
static inline void *huge_memset(void *s, int c, size_t n) {
char *p = s;
char X = c;
char32 val32 = {X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X,
X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X};
// Stamp the first 32byte store.
*((char32 *)p) = val32;
char *first_aligned = p + 32 - ((uint64_t)p % 32);
char *buffer_end = p + n;
char *last_word = buffer_end - 32;
// Align the next stores.
p = first_aligned;
// Unroll the body of the loop to increase parallelism.
while (p + (32 * 5) < buffer_end) {
*((char32a *)p) = val32;
p += 32;
*((char32a *)p) = val32;
p += 32;
*((char32a *)p) = val32;
p += 32;
*((char32a *)p) = val32;
p += 32;
*((char32a *)p) = val32;
p += 32;
}
// Complete the last few iterations:
#define TRY_STAMP_32_BYTES \
if (p < last_word) { \
*((char32a *)p) = val32; \
p += 32; \
}
TRY_STAMP_32_BYTES
TRY_STAMP_32_BYTES
TRY_STAMP_32_BYTES
TRY_STAMP_32_BYTES
// Stamp the last unaligned word.
*((char32 *)last_word) = val32;
return s;
}
void *local_memset(void *s, int c, size_t n) {
char *p = s;
char X = c;
if (n < 32) {
return small_memset(s, c, n);
}
if (n > 160) {
return huge_memset(s, c, n);
}
char32 val32 = {X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X,
X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X};
char *last_word = s + n - 32;
// Stamp the 32-byte chunks.
do {
*((char32 *)p) = val32;
p += 32;
} while (p < last_word);
// Stamp the last unaligned 32 bytes of the buffer.
*((char32 *)last_word) = val32;
return s;
}
/// This a memset implementation that was copied from musl. We only use it for
/// benchmarking.
void *musl_memset(void *dest, int c, size_t n) {
unsigned char *s = dest;
size_t k;
/* Fill head and tail with minimal branching. Each
* conditional ensures that all the subsequently used
* offsets are well-defined and in the dest region. */
if (!n)
return dest;
s[0] = c;
s[n - 1] = c;
if (n <= 2)
return dest;
s[1] = c;
s[2] = c;
s[n - 2] = c;
s[n - 3] = c;
if (n <= 6)
return dest;
s[3] = c;
s[n - 4] = c;
if (n <= 8)
return dest;
/* Advance pointer to align it at a 4-byte boundary,
* and truncate n to a multiple of 4. The previous code
* already took care of any head/tail that get cut off
* by the alignment. */
k = -(uintptr_t)s & 3;
s += k;
n -= k;
n &= -4;
#ifdef __GNUC__
typedef uint32_t __attribute__((__may_alias__)) u32;
typedef uint64_t __attribute__((__may_alias__)) u64;
u32 c32 = ((u32)-1) / 255 * (unsigned char)c;
/* In preparation to copy 32 bytes at a time, aligned on
* an 8-byte bounary, fill head/tail up to 28 bytes each.
* As in the initial byte-based head/tail fill, each
* conditional below ensures that the subsequent offsets
* are valid (e.g. !(n<=24) implies n>=28). */
*(u32 *)(s + 0) = c32;
*(u32 *)(s + n - 4) = c32;
if (n <= 8)
return dest;
*(u32 *)(s + 4) = c32;
*(u32 *)(s + 8) = c32;
*(u32 *)(s + n - 12) = c32;
*(u32 *)(s + n - 8) = c32;
if (n <= 24)
return dest;
*(u32 *)(s + 12) = c32;
*(u32 *)(s + 16) = c32;
*(u32 *)(s + 20) = c32;
*(u32 *)(s + 24) = c32;
*(u32 *)(s + n - 28) = c32;
*(u32 *)(s + n - 24) = c32;
*(u32 *)(s + n - 20) = c32;
*(u32 *)(s + n - 16) = c32;
/* Align to a multiple of 8 so we can fill 64 bits at a time,
* and avoid writing the same bytes twice as much as is
* practical without introducing additional branching. */
k = 24 + ((uintptr_t)s & 4);
s += k;
n -= k;
/* If this loop is reached, 28 tail bytes have already been
* filled, so any remainder when n drops below 32 can be
* safely ignored. */
u64 c64 = c32 | ((u64)c32 << 32);
for (; n >= 32; n -= 32, s += 32) {
*(u64 *)(s + 0) = c64;
*(u64 *)(s + 8) = c64;
*(u64 *)(s + 16) = c64;
*(u64 *)(s + 24) = c64;
}
#else
/* Pure C fallback with no aliasing violations. */
for (; n; n--, s++)
*s = c;
#endif
return dest;
}
================================================
FILE: src/memset/shims.c
================================================
#include "decl.h"
////////////////////////////////////////////////////////////////////////////////
/// This is a small utility that swaps the builtin call to memset with the
/// local implementation of memset, implemented in this project.
/// The shared object can be loaded using LD_PRELOAD (on Linux) or
/// DYLD_INSERT_LIBRARIES (on Mac).
////////////////////////////////////////////////////////////////////////////////
void *memset(void *s, int c, size_t n) { return local_memset(s, c, n); }
================================================
FILE: src/memset/test_memset.cc
================================================
#include <cstring>
#include <iostream>
#include <vector>
#include "decl.h"
#include "utils.h"
////////////////////////////////////////////////////////////////////////////////
// This is a small program that checks if some memset implementation is correct.
// The tool currently checks libc, musl and the local implementation.
////////////////////////////////////////////////////////////////////////////////
#define MAGIC_VALUE0 'X'
#define MAGIC_VALUE1 'O'
void print_buffer(const char *start, const char *end, char val,
const char *ptr) {
const char *it = start;
while (it != end) {
std::cout << *it;
it++;
}
std::cout << "\n";
it = start;
while (it != ptr) {
std::cout << " ";
it++;
}
std::cout << "^\n";
std::cout << "Filling a buffer of length " << end - start << ".";
std::cout << " Expected \"" << val << "\" at index " << ptr - start << "\n";
}
void assert_uniform_value(const char *start, const char *end, char val) {
const char *ptr = start;
while (ptr != end) {
if (val != *ptr) {
print_buffer(start, end, val, ptr);
fflush(stdout);
abort();
}
ptr++;
}
}
void test_impl(memset_ty handle, const std::string &name, unsigned chunk_size) {
std::vector<char> memory(chunk_size + 512, MAGIC_VALUE0);
// Start mem-setting the array at different offsets.
for (int offset = 0; offset < 128; offset++) {
const char *buffer_start = &*memory.begin();
const char *buffer_end = &*memory.end();
const char *region_start = &memory[offset];
const char *region_end = region_start + chunk_size;
assert_uniform_value(buffer_start, buffer_end, MAGIC_VALUE0);
(handle)((void *)region_start, MAGIC_VALUE1, chunk_size);
// Check the chunk.
assert_uniform_value(region_start, region_end, MAGIC_VALUE1);
// Check before chunk.
assert_uniform_value(buffer_start, region_start, MAGIC_VALUE0);
// Check after chunk.
assert_uniform_value(region_end, buffer_end, MAGIC_VALUE0);
// Reset the buffer:
std::fill(memory.begin(), memory.end(), MAGIC_VALUE0);
assert_uniform_value(buffer_start, buffer_end, MAGIC_VALUE0);
}
}
int main(int argc, char **argv) {
std::cout << "Testing memset... \n";
#define TEST(FUNC, SIZE) test_impl(FUNC, #FUNC, SIZE);
for (int i = 0; i < 1024; i++) {
TEST(libc_memset, i);
TEST(local_memset, i);
TEST(musl_memset, i);
TEST(asm_memset, i);
}
std::cout << "Done.\n";
return 0;
}
================================================
FILE: src/utils/CMakeLists.txt
================================================
add_library(hist_tool SHARED
hist_tool.c
)
set_target_properties(hist_tool PROPERTIES
VERSION ${PROJECT_VERSION}
SOVERSION 1
)
target_compile_options(hist_tool PRIVATE "-fno-builtin")
install(TARGETS hist_tool LIBRARY DESTINATION bin)
================================================
FILE: src/utils/hist_tool.c
================================================
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>
////////////////////////////////////////////////////////////////////////////////
/// This is a small utility that records calls to some methods and creates a
/// histogram of the lengths of calls to memset. It prints the histogram when
/// the program is terminated. The shared object can be loaded using LD_PRELOAD
/// (on Linux) or DYLD_INSERT_LIBRARIES (on Mac).
////////////////////////////////////////////////////////////////////////////////
uint32_t memset_len_dist[32] = {
0,
};
uint32_t memcpy_len_dist[32] = {
0,
};
uint32_t align_dist[32] = {
0,
};
const int tab32[32] = {0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16,
18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17,
24, 7, 19, 27, 23, 6, 26, 5, 4, 31};
int log2_32(uint32_t value) {
value |= value >> 1;
value |= value >> 2;
value |= value >> 4;
value |= value >> 8;
value |= value >> 16;
return tab32[(uint32_t)(value * 0x07C4ACDD) >> 27];
}
void __attribute__((destructor)) print_hitograms() {
FILE *ff = fopen("/tmp/hist.txt", "a+");
if (!ff) {
return;
}
pid_t pid = getpid();
fprintf(ff, "Histogram for (%d):\n", pid);
fprintf(ff, "size, memset, memcpy, alignment:\n");
for (int i = 0; i < 32; i++) {
fprintf(ff, "%d, %d, %d, %d,\n", i, memset_len_dist[i], memcpy_len_dist[i], align_dist[i]);
}
fclose(ff);
}
void *memcpy(void *dest, const void *src, size_t len) {
memcpy_len_dist[log2_32(len)]++;
align_dist[(unsigned long)dest % 32]++;
align_dist[(unsigned long)src % 32]++;
char *d = (char *)dest;
char *s = (char *)src;
for (size_t i = 0; i < len; i++) {
d[i] = s[i];
}
return dest;
}
void *memset(void *s, int c, size_t len) {
memset_len_dist[log2_32(len)]++;
align_dist[(unsigned long)s % 32]++;
char *p = s;
for (int i = 0; i < len; i++) {
p[i] = c;
}
return s;
}
gitextract_j6tepqkv/
├── .gitignore
├── CMakeLists.txt
├── README.md
├── docs/
│ └── annotated_glibc.txt
├── include/
│ ├── decl.h
│ ├── types.h
│ └── utils.h
└── src/
├── memcpy/
│ ├── CMakeLists.txt
│ ├── bench_memcpy.cc
│ ├── folly.S
│ ├── impl.S
│ ├── impl.c
│ └── test_memcpy.cc
├── memset/
│ ├── CMakeLists.txt
│ ├── bench_memset.cc
│ ├── impl.S
│ ├── impl.c
│ ├── shims.c
│ └── test_memset.cc
└── utils/
├── CMakeLists.txt
└── hist_tool.c
SYMBOL INDEX (30 symbols across 7 files)
FILE: include/types.h
type char8 (line 9) | typedef char char8 __attribute__((ext_vector_type(8), aligned(1)));
type char16 (line 10) | typedef char char16 __attribute__((ext_vector_type(16), aligned(1)));
type char32 (line 11) | typedef char char32 __attribute__((ext_vector_type(32), aligned(1)));
type char32a (line 12) | typedef char char32a __attribute__((ext_vector_type(32), aligned(32)));
type char8 (line 16) | typedef char char8 __attribute__((vector_size(8), aligned(1)));
type char16 (line 17) | typedef char char16 __attribute__((vector_size(16), aligned(1)));
type char32 (line 18) | typedef char char32 __attribute__((vector_size(32), aligned(1)));
type char32a (line 19) | typedef char char32a __attribute__((vector_size(32), aligned(32)));
FILE: include/utils.h
function class (line 21) | class Stopwatch {
function next_u8_random (line 76) | struct DoomRNG {
FILE: src/memcpy/bench_memcpy.cc
function measure (line 22) | uint64_t measure(memcpy_ty handle, void *dest, void *src, unsigned size) {
function bench_impl (line 35) | void bench_impl(const std::vector<memcpy_ty *> &toTest, unsigned size,
function bench_rand_range (line 53) | void bench_rand_range(const std::vector<memcpy_ty *> &toTest) {
function main (line 81) | int main(int argc, char **argv) {
FILE: src/memcpy/test_memcpy.cc
function print_buffer (line 15) | void print_buffer(const char *start, const char *end, char val,
function print_buffer_match (line 34) | void print_buffer_match(const char *start0, const char *start1, size_t len,
function assert_uniform_value (line 55) | void assert_uniform_value(const char *start, const char *end, char val) {
function assert_buffers_match (line 67) | void assert_buffers_match(const char *buff1, const char *buff2, size_t l...
function test_impl (line 76) | void test_impl(memcpy_ty handle, const std::string &name, unsigned chunk...
function main (line 115) | int main(int argc, char **argv) {
FILE: src/memset/bench_memset.cc
function measure (line 22) | uint64_t measure(memset_ty handle, unsigned size, unsigned align,
function bench_impl (line 36) | void bench_impl(const std::vector<memset_ty *> &toTest, unsigned size,
function bench_rand_range (line 51) | void bench_rand_range(const std::vector<memset_ty *> &toTest) {
function main (line 75) | int main(int argc, char **argv) {
FILE: src/memset/test_memset.cc
function print_buffer (line 16) | void print_buffer(const char *start, const char *end, char val,
function assert_uniform_value (line 34) | void assert_uniform_value(const char *start, const char *end, char val) {
function test_impl (line 46) | void test_impl(memset_ty handle, const std::string &name, unsigned chunk...
function main (line 73) | int main(int argc, char **argv) {
FILE: src/utils/hist_tool.c
function log2_32 (line 28) | int log2_32(uint32_t value) {
function print_hitograms (line 37) | void __attribute__((destructor)) print_hitograms() {
Condensed preview — 21 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (58K chars).
[
{
"path": ".gitignore",
"chars": 165,
"preview": "*.o\n*.swn\n*.swo\n*.swp\n*~\n.DS_Store\n*.so\n*.dylib\n\nGPATH\nGRTAGS\nGTAGS\ntags\n\ncompile_commands.json\n\ntoolchain/\nllvm-project"
},
{
"path": "CMakeLists.txt",
"chars": 1265,
"preview": "cmake_minimum_required(VERSION 3.7)\n\nproject(bpf_tracer VERSION 1.0.0 DESCRIPTION \"Memset benchmarks\")\n\nset(CMAKE_CXX_ST"
},
{
"path": "README.md",
"chars": 5753,
"preview": "# Fast Memset and Memcpy implementations\n\n*UPDATE*: Ilya Albrecht landed the memset implementation from this repo into ["
},
{
"path": "docs/annotated_glibc.txt",
"chars": 3501,
"preview": " <+0>: endbr64 \n <+4>: vmovd %esi, %xmm0\n <+8>: movq %rdi, %rax\n "
},
{
"path": "include/decl.h",
"chars": 764,
"preview": "#ifndef DECLS\n#define DECLS\n\n#include <stddef.h>\n\n#ifdef __cplusplus\n\nusing memset_ty = void *(void *s, int c, size_t n)"
},
{
"path": "include/types.h",
"chars": 800,
"preview": "#ifndef TYPES\n#define TYPES\n\n#include <stdint.h>\n\n#define NO_INLINE __attribute__((noinline))\n\n#ifdef __clang__\ntypedef "
},
{
"path": "include/utils.h",
"chars": 3064,
"preview": "#ifndef UTILS_H\n#define UTILS_H\n\n#include <algorithm>\n#include <chrono>\n#include <string>\n\n#include \"types.h\"\n\n/// Align"
},
{
"path": "src/memcpy/CMakeLists.txt",
"chars": 441,
"preview": "add_executable(test_memcpy\n test_memcpy.cc\n folly.S\n impl.S\n "
},
{
"path": "src/memcpy/bench_memcpy.cc",
"chars": 2852,
"preview": "#include <algorithm>\n#include <cstring>\n#include <iomanip>\n#include <iostream>\n#include <unistd.h>\n#include <vector>\n\n#i"
},
{
"path": "src/memcpy/folly.S",
"chars": 9872,
"preview": "/*\n * Copyright (c) Facebook, Inc. and its affiliates.\n *\n * Licensed under the Apache License, Version 2.0 (the \"Licens"
},
{
"path": "src/memcpy/impl.S",
"chars": 3316,
"preview": "#if defined(__APPLE__)\n.text\n.global _libc_memcpy\n.p2align 4, 0x90\n_libc_memcpy:\n jmp _memcpy\n\n#else\n\n.text\n.glo"
},
{
"path": "src/memcpy/impl.c",
"chars": 1449,
"preview": "#include \"types.h\"\n\n#include <stddef.h>\n#include <stdint.h>\n\nvoid *local_memcpy(void *dest, const void *src, size_t n) {"
},
{
"path": "src/memcpy/test_memcpy.cc",
"chars": 3683,
"preview": "#include <cstring>\n#include <iostream>\n#include <vector>\n\n#include \"decl.h\"\n#include \"utils.h\"\n\n////////////////////////"
},
{
"path": "src/memset/CMakeLists.txt",
"chars": 683,
"preview": "add_library(mem_shim SHARED\n shims.c\n impl.S\n impl.c\n )\n\nset_target_propertie"
},
{
"path": "src/memset/bench_memset.cc",
"chars": 2636,
"preview": "#include <algorithm>\n#include <cstring>\n#include <iomanip>\n#include <iostream>\n#include <unistd.h>\n#include <vector>\n\n#i"
},
{
"path": "src/memset/impl.S",
"chars": 4540,
"preview": "#if defined(__APPLE__)\n.text\n.global _libc_memset\n.p2align 4, 0x90\n_libc_memset:\n jmp _memset\n\n#else\n\n.text\n.glo"
},
{
"path": "src/memset/impl.c",
"chars": 5213,
"preview": "#include \"types.h\"\n\n#include <stddef.h>\n#include <stdint.h>\n\n// Handle memsets of sizes 0..32\nstatic inline void *small_"
},
{
"path": "src/memset/shims.c",
"chars": 498,
"preview": "#include \"decl.h\"\n\n////////////////////////////////////////////////////////////////////////////////\n/// This is a small "
},
{
"path": "src/memset/test_memset.cc",
"chars": 2483,
"preview": "#include <cstring>\n#include <iostream>\n#include <vector>\n\n#include \"decl.h\"\n#include \"utils.h\"\n\n////////////////////////"
},
{
"path": "src/utils/CMakeLists.txt",
"chars": 276,
"preview": "add_library(hist_tool SHARED\n hist_tool.c\n )\n\nset_target_properties(hist_tool PROPERTIES\n VERSI"
},
{
"path": "src/utils/hist_tool.c",
"chars": 1955,
"preview": "#include <stddef.h>\n#include <stdint.h>\n#include <stdio.h>\n#include <unistd.h>\n\n////////////////////////////////////////"
}
]
About this extraction
This page contains the full source code of the nadavrot/memset_benchmark GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 21 files (53.9 KB), approximately 18.5k tokens, and a symbol index with 30 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.