[
  {
    "path": ".gitignore",
    "content": "*.o\n*.swn\n*.swo\n*.swp\n*~\n.DS_Store\n*.so\n*.dylib\n\nGPATH\nGRTAGS\nGTAGS\ntags\n\ncompile_commands.json\n\ntoolchain/\nllvm-project/\ngcc-project/\nbuild*/\n.vscode/\n.vim/\n.idea/\n"
  },
  {
    "path": "CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.7)\n\nproject(bpf_tracer VERSION 1.0.0 DESCRIPTION \"Memset benchmarks\")\n\nset(CMAKE_CXX_STANDARD 14)\nset(CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CXX_EXTENSIONS OFF)\n\n# Export a JSON file with the compilation commands that external tools can use\n# to analyze the source code of the project.\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\n\nenable_language(C ASM)\n\n# Disable exceptions\nSET (CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} \"-fno-rtti \")\n\nif(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)\n  message(STATUS \"No build type selected, default to Release\")\n  set(CMAKE_BUILD_TYPE \"RelWithDebInfo\" CACHE STRING \"Build type (default RelWithDebInfo)\" FORCE)\nendif()\n\nadd_compile_options(-Wall -g3 -O3 -march=native)\n\nset(CMAKE_CXX_FLAGS_RELEASE \"${CMAKE_CXX_FLAGS_RELEASE} -Wall -march=native\")\nset(CMAKE_CXX_FLAGS_DEBUG \"${CMAKE_CXX_FLAGS_DEBUG} -Wall -fno-omit-frame-pointer -O0\")\n\n# Place all of the binaries in the build directory.\nset (CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})\nset (CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})\n\ninclude_directories(BEFORE\n  ${CMAKE_CURRENT_BINARY_DIR}/include\n  ${CMAKE_CURRENT_SOURCE_DIR}/include\n  )\n\nadd_subdirectory(src/memset/)\nadd_subdirectory(src/memcpy/)\nadd_subdirectory(src/utils/)\n"
  },
  {
    "path": "README.md",
    "content": "# Fast Memset and Memcpy implementations\n\n*UPDATE*: Ilya Albrecht landed the memset implementation from this repo into [Folly](https://github.com/facebook/folly/blob/main/folly/memset.S).\n\nThis repository contains high-performance implementations of memset and memcpy.\nThese implementations outperform the folly and glibc implementations.  This\nrepository contains several reference implementations in C and assembly.  The\nhigh-performance implementations are found in the files called \"impl.S\".\n\nBefore reading the source code in this repository you probably want to read an\nexcellent blog [post](https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines/)\nby Joe Bialek about his work to optimize memset for windows.\n\nThe charts below compare the code in this repo with other implementations:\nfolly, musl, and glibc.  The glibc implementations are measured with and without\nthe elf indirection, as suggested by Dave Zarzycki.\n\n## Memset\n![Memset](docs/memset_bench.png)\n\n## Memcpy\n![Memcpy](docs/memcpy_bench.png)\n\nThe chart below compares the performance of different memset implementations on\nbuffers of varying sizes and offsets. Unlike the hot loop that hammers a single\nvalue, this benchmark is more realistic and takes into account mispredicted\nbranches and the performance of the cpu decoder. The buffers are in the size\nrange 0 to 256. The random function is made of pre-computed random values, to\nlower the overhead of the random function.  This was suggested by Yann Collet.\nThe 'nop' function is used to compute the benchmark setup and call overhead. The\nnumbers below represent the implementation execution time minus the nop function\ntime.\n\n![memset](docs/memset_r.png) ![memcpy](docs/memcpy_r.png)\n\nThe size of the buffer that memset and memcpy mutates is typically small. The\npicture below presents the buffer length distribution in google-chrome. Vim,\nPython, and even server workloads have a similar distribution. The values in the\nchart represent the power of two buffer size (10 represents the values between\n512 and 1024).\n \n![Histogram](docs/hist.png)\n\n\nThe chart below presents a histogram of pointer alignment (from the game\nminecraft). Most of the pointers that are called by memset and memcpy are\naligned to 8-byte values. Some programs have histograms that are not as sharp,\nmeaning that there are more values that are not aligned to 4 or 8-byte boundary.\n\n![Pointer Alignment](docs/align.png)\n\n\nMemcpy and Memset and frequently called by low-level high-performance libraries.\nHere is one example of one stack trace from the Firefox codebase:\n\n```\n  (gdb) bt\n  #0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:225\n  #1  in memcpy (__dest=, __src=, __len=40) at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:34\n  #2  mozilla::BufferList<InfallibleAllocPolicy>::ReadBytes(mozilla::BufferList<InfallibleAllocPolicy>::IterImpl&, char*, unsigned long) const\n  #3  Pickle::ReadBytesInto(PickleIterator*, void*, unsigned int) const (this=, iter=, data=, length=<optimized out>)\n  #4  in IPC::Message::ReadFooter(void*, unsigned int, bool) (this=, buffer=, buffer_len=40, truncate=true)\n  #5  in mozilla::ipc::NodeController::DeserializeEventMessage(mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >) (this=, aMessage=...)\n  #6  in mozilla::ipc::NodeController::OnEventMessage(mojo::core::ports::NodeName const&, mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >)\n  #7  in mozilla::ipc::NodeChannel::OnMessageReceived(IPC::Message&&) (this=<optimized out>, aMessage=...)\n  #8  in IPC::Channel::ChannelImpl::ProcessIncomingMessages() (this=<optimized out>)\n  #9  in IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking(int) (this=, fd=)\n  #10 in base::MessagePumpLibevent::OnLibeventNotification(int, short, void*) (fd=, flags=, context=)\n  #11 in event_persist_closure (base=, ev=) at /build/firefox-HSiFn6/firefox-94.0+build3/ipc/chromium/src/third_party/libevent/event.c:1580\n  #12 event_process_active_single_queue (base=, activeq=, max_to_process=, endtime=)\n\n```\n\nThe repository contains a few utilities for testing and measuring the\nperformance and correctness of memset and memcpy.\n\n## Test tool\n\nThis is a small test harness that verifies the correctness of the\nimplementations. It's really easy to make mistakes with off-by-one errors and\nrun into alignment issues. The exhaustive tester catches these issues.\n\nThis is a sample output:\n```\nOOOOOOOOOOOXX\n           ^\nFilling a buffer of length 13. Expected \"O\" at index 11\n```\n\n## Benchmark tool\n\nThe benchmark tool measures the performance of the system libc and the local\nimplementation. The benchmarking tool runs each of the implementations in a loop\nmillions of times. It runs the benchmark several times and picks the least noisy\nresults. It's a good idea to run the benchmark tool and compare some\nimplementation to itself to assess the noise level in the system. The\nbenchmarking tool uses a trampoline to prevent the compiler from inlining and\nexpanding the memset.\n\n## Histogram tool\n\nThe histogram tool is a shared object that collects records calls to memset and\nmemcpy and creates a histogram of the length parameter. It prints the histogram\nwhen the program exits cleanly. The shared object can be loaded using\nLD\\_PRELOAD (on Linux) or DYLD\\_INSERT\\_LIBRARIES (on Mac). Each bucket in the\noutput represents the log2 size of the buffer, and each value represents the\nnumber of hits for the bucket.\n\n## Proxy tool\n\nThis is a small utility that swaps the builtin call to memset and memcpy with\nthe local implementation from this project. The shared object can be loaded\nusing LD\\_PRELOAD (on Linux) or DYLD\\_INSERT\\_LIBRARIES (on Mac).\n\n"
  },
  {
    "path": "docs/annotated_glibc.txt",
    "content": "               <+0>:   endbr64 \n               <+4>:   vmovd  %esi, %xmm0\n               <+8>:   movq   %rdi, %rax\n               <+11>:  vpbroadcastb %xmm0, %ymm0\n               <+16>:  cmpq   $0x20, %rdx\n               <+20>:  jb     0xBELOW_32____            ; <+190>\n               <+26>:  cmpq   $0x40, %rdx\n               <+30>:  ja     0xABOVE_64____            ; <+46>\n               <+32>:  vmovdqu %ymm0, -0x20(%rdi,%rdx)\n               <+38>:  vmovdqu %ymm0, (%rdi)\n               <+42>:  vzeroupper \n               <+45>:  retq   \n0xABOVE_64____ <+46>:  cmpq   $0x800, %rdx              ; imm = 0x800 \n               <+53>:  ja     0xABOVE_2048__            ; ___lldb_unnamed_symbol1097$$libc.so.6 + 4\n               <+55>:  cmpq   $0x80, %rdx\n               <+62>:  ja     0xABOVE_128___            ; <+89>\n0xSZ_64_TO_128 <+64>:  vmovdqu %ymm0, (%rdi)\n               <+68>:  vmovdqu %ymm0, 0x20(%rdi)\n               <+73>:  vmovdqu %ymm0, -0x20(%rdi,%rdx)\n               <+79>:  vmovdqu %ymm0, -0x40(%rdi,%rdx)\n0xEXIT_EXIT___ <+85>:  vzeroupper \n               <+88>:  retq   \n0xABOVE_128___ <+89>:  leaq   0x80(%rdi), %rcx\n               <+96>:  vmovdqu %ymm0, (%rdi)\n               <+100>: andq   $-0x80, %rcx\n               <+104>: vmovdqu %ymm0, -0x20(%rdi,%rdx)\n               <+110>: vmovdqu %ymm0, 0x20(%rdi)\n               <+115>: vmovdqu %ymm0, -0x40(%rdi,%rdx)\n               <+121>: vmovdqu %ymm0, 0x40(%rdi)\n               <+126>: vmovdqu %ymm0, -0x60(%rdi,%rdx)\n               <+132>: vmovdqu %ymm0, 0x60(%rdi)\n               <+137>: vmovdqu %ymm0, -0x80(%rdi,%rdx)\n               <+143>: addq   %rdi, %rdx\n               <+146>: andq   $-0x80, %rdx\n               <+150>: cmpq   %rdx, %rcx\n               <+153>: je     0xEXIT_EXIT___            ; <+85>\n0xLOOP_4x32B__ <+155>: vmovdqa %ymm0, (%rcx)\n               <+159>: vmovdqa %ymm0, 0x20(%rcx)\n               <+164>: vmovdqa %ymm0, 0x40(%rcx)\n               <+169>: vmovdqa %ymm0, 0x60(%rcx)\n               <+174>: addq   $0x80, %rcx\n               <+181>: cmpq   %rcx, %rdx\n               <+184>: jne    0xLOOP_4x32B__            ; <+155>\n               <+186>: vzeroupper \n               <+189>: retq   \n0xBELOW_32____ <+190>: cmpb   $0x10, %dl\n               <+193>: jae    0xBELOW_16____            ; <+223>\n               <+195>: vmovq  %xmm0, %rcx\n               <+200>: cmpb   $0x8, %dl\n               <+203>: jae    0xABOVE_8_____            ; <+237>\n               <+205>: cmpb   $0x4, %dl\n               <+208>: jae    0xABOVE_4_____            ; <+249>\n               <+210>: cmpb   $0x1, %dl\n               <+213>: ja     0xABOVE_1_____            ; <+259>\n               <+215>: jb     0xIS_ZERO_CASE            ; <+219>\n               <+217>: movb   %cl, (%rdi)\n0xIS_ZERO_CASE <+219>: vzeroupper \n               <+222>: retq   \n0xBELOW_16____ <+223>: vmovdqu %xmm0, -0x10(%rdi,%rdx)\n               <+229>: vmovdqu %xmm0, (%rdi)\n               <+233>: vzeroupper \n               <+236>: retq   \n0xABOVE_8_____ <+237>: movq   %rcx, -0x8(%rdi,%rdx)\n               <+242>: movq   %rcx, (%rdi)\n               <+245>: vzeroupper \n               <+248>: retq   \n0xABOVE_4____ <+249>: movl   %ecx, -0x4(%rdi,%rdx)\n               <+253>: movl   %ecx, (%rdi)\n               <+255>: vzeroupper \n               <+258>: retq   \n0xABOVE_1_____ <+259>: movw   %cx, -0x2(%rdi,%rdx)\n               <+264>: movw   %cx, (%rdi)\n               <+267>: vzeroupper \n               <+270>: retq   \n               <+271>: nop    \n\n"
  },
  {
    "path": "include/decl.h",
    "content": "#ifndef DECLS\n#define DECLS\n\n#include <stddef.h>\n\n#ifdef __cplusplus\n\nusing memset_ty = void *(void *s, int c, size_t n);\nusing memcpy_ty = void *(void *dest, const void *src, size_t n);\n\nextern \"C\" {\n#endif\n\nvoid *memcpy(void *dest, const void *src, size_t n);\nvoid *__folly_memcpy(void *dest, const void *src, size_t n);\nvoid *libc_memcpy(void *dest, const void *src, size_t n);\nvoid *local_memcpy(void *dest, const void *src, size_t n);\nvoid *asm_memcpy(void *dest, const void *src, size_t n);\n\nvoid *memset(void *s, int c, size_t n);\nvoid *libc_memset(void *s, int c, size_t n);\nvoid *local_memset(void *s, int c, size_t n);\nvoid *asm_memset(void *s, int c, size_t n);\nvoid *musl_memset(void *s, int c, size_t n);\n\n#ifdef __cplusplus\n}\n#endif\n\n#endif // DECLS\n"
  },
  {
    "path": "include/types.h",
    "content": "#ifndef TYPES\n#define TYPES\n\n#include <stdint.h>\n\n#define NO_INLINE __attribute__((noinline))\n\n#ifdef __clang__\ntypedef char char8 __attribute__((ext_vector_type(8), aligned(1)));\ntypedef char char16 __attribute__((ext_vector_type(16), aligned(1)));\ntypedef char char32 __attribute__((ext_vector_type(32), aligned(1)));\ntypedef char char32a __attribute__((ext_vector_type(32), aligned(32)));\n\n#else\n// __GNUC__\ntypedef char char8 __attribute__((vector_size(8), aligned(1)));\ntypedef char char16 __attribute__((vector_size(16), aligned(1)));\ntypedef char char32 __attribute__((vector_size(32), aligned(1)));\ntypedef char char32a __attribute__((vector_size(32), aligned(32)));\n#endif\n\ntypedef uint32_t __attribute__((aligned(1))) u32;\ntypedef uint64_t __attribute__((aligned(1))) u64;\n\n#endif // TYPES\n"
  },
  {
    "path": "include/utils.h",
    "content": "#ifndef UTILS_H\n#define UTILS_H\n\n#include <algorithm>\n#include <chrono>\n#include <string>\n\n#include \"types.h\"\n\n/// Aligns the pointer \\p ptr, to alignment \\p alignment and offset \\p offset\n/// within the word.\nvoid *align_pointer(void *ptr, unsigned alignment, unsigned offset) {\n  size_t p = (size_t)ptr;\n  while (p % alignment)\n    ++p;\n  return (void *)(p + (size_t)offset);\n}\n\nusing time_point = std::chrono::steady_clock::time_point;\n\nclass Stopwatch {\n  /// The time of the last sample;\n  time_point begin_;\n  /// A list of recorded intervals.\n  std::vector<uint64_t> intervals_;\n\npublic:\n  NO_INLINE\n  Stopwatch() : begin_() {}\n\n  NO_INLINE\n  void start() { begin_ = std::chrono::steady_clock::now(); }\n\n  NO_INLINE\n  void stop() {\n    time_point end = std::chrono::steady_clock::now();\n    uint64_t interval =\n        std::chrono::duration_cast<std::chrono::microseconds>(end - begin_)\n            .count();\n    intervals_.push_back(interval);\n  }\n\n  NO_INLINE\n  uint64_t get_median() {\n    std::sort(intervals_.begin(), intervals_.end());\n    return intervals_[intervals_.size() / 2];\n  }\n};\n\nuint8_t random_bytes[320] = {\n    227, 138, 244, 198, 73,  247, 185, 248, 229, 75,  24,  215, 159, 230, 136,\n    246, 200, 144, 65,  67,  109, 86,  118, 61,  209, 103, 188, 213, 187, 8,\n    210, 121, 214, 178, 232, 59,  153, 92,  209, 239, 44,  85,  156, 172, 237,\n    41,  150, 195, 247, 202, 249, 142, 208, 133, 21,  204, 114, 38,  51,  150,\n    194, 46,  184, 138, 50,  250, 190, 180, 161, 5,   211, 191, 62,  137, 142,\n    122, 63,  72,  233, 125, 189, 51,  238, 51,  116, 10,  44,  18,  240, 41,\n    157, 81,  183, 252, 214, 17,  81,  12,  44,  119, 77,  97,  101, 80,  106,\n    128, 190, 89,  160, 104, 244, 192, 46,  69,  73,  255, 45,  213, 190, 86,\n    18,  89,  34,  46,  134, 145, 166, 128, 87,  97,  192, 71,  105, 94,  51,\n    30,  7,   9,   0,   40,  0,   187, 205, 189, 151, 159, 107, 105, 180, 182,\n    233, 52,  209, 108, 186, 31,  184, 254, 170, 71,  162, 31,  80,  226, 75,\n    125, 214, 125, 247, 197, 149, 132, 247, 157, 253, 101, 107, 1,   127, 236,\n    249, 242, 152, 169, 123, 240, 129, 230, 135, 25,  57,  227, 130, 189, 76,\n    254, 33,  193, 39,  82,  177, 143, 31,  17,  20,  195, 219, 165, 171, 198,\n    125, 119, 216, 143, 55,  210, 17,  88,  150, 126, 38,  160, 71,  214, 10,\n    162, 158, 6,   234, 233, 119, 221, 167, 62,  146, 50,  150, 176, 142, 167,\n    201, 250, 195, 26,  156, 96,  36,  177, 95,  23,  7,   63,  55,  142, 80,\n    227, 73,  124, 93,  211, 231, 166, 182, 57,  145, 55,  242, 213, 246, 30,\n    146, 247, 19,  229, 34,  210, 37,  147, 242, 103, 125, 91,  171, 51,  22,\n    126, 248, 149, 19,  60,  89,  5,   241, 132, 72,  217, 195, 11,  173, 247,\n    47,  144, 222, 94,  51,  166, 192, 50,  109, 62,  42,  126, 111, 204, 141,\n    66,\n};\n\n/// Implements a doom-style random number generator.\nstruct DoomRNG {\n  // Points to the current random number.\n  unsigned rand_curr = 0;\n\n  void rand_reset() { rand_curr = 0; }\n\n  uint8_t next_u8_random() { return random_bytes[rand_curr++ % 320]; }\n};\n\n#endif // UTILS_H\n"
  },
  {
    "path": "src/memcpy/CMakeLists.txt",
    "content": "add_executable(test_memcpy\n                 test_memcpy.cc\n                 folly.S\n                 impl.S\n                 impl.c\n                 )\n\ntarget_link_libraries(test_memcpy PUBLIC)\n\nadd_executable(bench_memcpy\n                 bench_memcpy.cc\n                 folly.S\n                 impl.S\n                 impl.c\n                 )\n\ninstall(TARGETS bench_memcpy DESTINATION bin)\ninstall(TARGETS test_memcpy DESTINATION bin)\n\n"
  },
  {
    "path": "src/memcpy/bench_memcpy.cc",
    "content": "#include <algorithm>\n#include <cstring>\n#include <iomanip>\n#include <iostream>\n#include <unistd.h>\n#include <vector>\n\n#include \"decl.h\"\n#include \"utils.h\"\n\n////////////////////////////////////////////////////////////////////////////////\n// This is a small program that compares two memcpy implementations and records\n// the output in a csv file.\n////////////////////////////////////////////////////////////////////////////////\n\n#define ITER (1000L * 1000L * 10L)\n#define SAMPLES (20)\n\nDoomRNG RNG;\n\n/// Measure a single implementation \\p handle.\nuint64_t measure(memcpy_ty handle, void *dest, void *src, unsigned size) {\n  Stopwatch T;\n  for (unsigned i = 0; i < SAMPLES; i++) {\n    T.start();\n    for (size_t j = 0; j < ITER; j++) {\n      (handle)(dest, src, size);\n    }\n    T.stop();\n  }\n  return T.get_median();\n}\n\n// Allocate memory and benchmark each implementation at a specific size \\p size.\nvoid bench_impl(const std::vector<memcpy_ty *> &toTest, unsigned size,\n                unsigned align, unsigned offset) {\n  std::vector<char> dest(size + 256, 0);\n  std::vector<char> src(size + 256, 0);\n\n  char *src_ptr = (char *)align_pointer(&src[0], align, offset);\n  char *dest_ptr = (char *)align_pointer(&dest[0], align, offset);\n\n  std::cout << size << \", \";\n  for (auto handle : toTest) {\n    u_int64_t res = measure(handle, dest_ptr, src_ptr, size);\n    std::cout << res << \", \";\n  }\n  std::cout << std::endl;\n}\n\n/// Allocate and copy buffers at random offsets and in random sizes.\n/// The sizes and the offsets are in the range 0..256.\nvoid bench_rand_range(const std::vector<memcpy_ty *> &toTest) {\n  std::vector<char> dest(4096, 1);\n  std::vector<char> src(4096, 0);\n  const char *src_p = &src[0];\n  char *dest_p = &dest[0];\n\n  for (auto handle : toTest) {\n    Stopwatch T;\n    sleep(1);\n    for (unsigned i = 0; i < SAMPLES; i++) {\n      RNG.rand_reset();\n      T.start();\n      for (size_t j = 0; j < ITER; j++) {\n        char *to = dest_p + RNG.next_u8_random();\n        const char *from = src_p + RNG.next_u8_random();\n        (handle)(to, from, RNG.next_u8_random());\n      }\n      T.stop();\n    }\n\n    std::cout << T.get_median() << \", \";\n  }\n  std::cout << std::endl;\n}\n\n// To measure the call overhead.\nvoid *nop(void *dest, const void *src, size_t n) { return dest; }\n\nint main(int argc, char **argv) {\n  std::cout << std::setprecision(3);\n  std::cout << std::fixed;\n\n  std::vector<memcpy_ty *> toTest = {\n      &libc_memcpy, &memcpy, &__folly_memcpy, &local_memcpy, &asm_memcpy, &nop};\n\n  std::cout << \"Batches of random sizes:\\n\";\n  std::cout << \"libc@plt, libc, folly, c_memcpy, asm_memcpy, nop,\\n\";\n\n  bench_rand_range(toTest);\n\n  std::cout << \"\\nFixed size:\\n\";\n  std::cout << \"size, libc@plt, libc, folly, c_memcpy, asm_memcpy, nop,\\n\";\n\n  for (int i = 0; i < 512; i++) {\n    bench_impl(toTest, i, 16, 0);\n  }\n\n  return 0;\n}\n"
  },
  {
    "path": "src/memcpy/folly.S",
    "content": "/*\n * Copyright (c) Facebook, Inc. and its affiliates.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n/*\n * __folly_memcpy: An optimized memcpy implementation that uses prefetch and\n * AVX2 instructions.\n *\n * This implementation of memcpy acts as a memmove, but it is not optimized for\n * this purpose. While overlapping copies are undefined in memcpy, this\n * implementation acts like memmove for sizes up through 256 bytes and will\n * detect overlapping copies and call memmove for overlapping copies of 257 or\n * more bytes.\n *\n * This implementation uses prefetch to avoid dtlb misses. This can\n * substantially reduce dtlb store misses in cases where the destination\n * location is absent from L1 cache and where the copy size is small enough\n * that the hardware prefetcher doesn't have a large impact.\n *\n * The number of branches is limited by the use of overlapping copies. This\n * helps with copies where the source and destination cache lines are already\n * present in L1 because there are fewer instructions to execute and fewer\n * branches to potentially mispredict.\n *\n * Vector operations up to 32-bytes are used (avx2 instruction set). Larger\n * mov operations (avx512) are not used.\n *\n * Large copies make use of aligned store operations. This operation is\n * observed to always be faster than rep movsb, so the rep movsb instruction\n * is not used.\n *\n * If the copy size is humongous and the source and destination are both\n * aligned, this memcpy will use non-temporal operations. This can have\n * a substantial speedup for copies where data is absent from L1, but it\n * is significantly slower if the source and destination data were already\n * in L1. The use of non-temporal operations also has the effect that after\n * the copy is complete, the data will be moved out of L1, even if the data was\n * present before the copy started.\n *\n * @author Logan Evans <lpe@fb.com>\n */\n\n#if defined(__AVX2__)\n\n// This threshold is half of L1 cache on a Skylake machine, which means that\n// potentially all of L1 will be populated by this copy once it is executed\n// (dst and src are cached for temporal copies).\n#define NON_TEMPORAL_STORE_THRESHOLD $32768\n\n        .file       \"memcpy.S\"\n        .section    .text,\"ax\"\n\n        .type       __folly_memcpy_short, @function\n__folly_memcpy_short:\n        .cfi_startproc\n\n.L_GE1_LE7:\n        cmp         $1, %rdx\n        je          .L_EQ1\n\n        cmp         $4, %rdx\n        jae         .L_GE4_LE7\n\n.L_GE2_LE3:\n        movw        (%rsi), %r8w\n        movw        -2(%rsi,%rdx), %r9w\n        movw        %r8w, (%rdi)\n        movw        %r9w, -2(%rdi,%rdx)\n        ret\n\n        .align      2\n.L_EQ1:\n        movb        (%rsi), %r8b\n        movb        %r8b, (%rdi)\n        ret\n\n        // Aligning the target of a jump to an even address has a measurable\n        // speedup in microbenchmarks.\n        .align      2\n.L_GE4_LE7:\n        movl        (%rsi), %r8d\n        movl        -4(%rsi,%rdx), %r9d\n        movl        %r8d, (%rdi)\n        movl        %r9d, -4(%rdi,%rdx)\n        ret\n\n        .cfi_endproc\n        .size       __folly_memcpy_short, .-__folly_memcpy_short\n\n// memcpy is an alternative entrypoint into the function named __folly_memcpy.\n// The compiler is able to call memcpy since the name is global while\n// stacktraces will show __folly_memcpy since that is the name of the function.\n// This is intended to aid in debugging by making it obvious which version of\n// memcpy is being used.\n        .align      64\n        .globl      __folly_memcpy\n        .type       __folly_memcpy, @function\n\n__folly_memcpy:\n        .cfi_startproc\n\n        mov         %rdi, %rax\n\n        test        %rdx, %rdx\n        je          .L_EQ0\n\n        prefetchw   (%rdi)\n        prefetchw   -1(%rdi,%rdx)\n\n        cmp         $8, %rdx\n        jb          .L_GE1_LE7\n\n.L_GE8:\n        cmp         $32, %rdx\n        ja          .L_GE33\n\n.L_GE8_LE32:\n        cmp         $16, %rdx\n        ja          .L_GE17_LE32\n\n.L_GE8_LE16:\n        mov         (%rsi), %r8\n        mov         -8(%rsi,%rdx), %r9\n        mov         %r8, (%rdi)\n        mov         %r9, -8(%rdi,%rdx)\n.L_EQ0:\n        ret\n\n        .align      2\n.L_GE17_LE32:\n        movdqu      (%rsi), %xmm0\n        movdqu      -16(%rsi,%rdx), %xmm1\n        movdqu      %xmm0, (%rdi)\n        movdqu      %xmm1, -16(%rdi,%rdx)\n        ret\n\n        .align      2\n.L_GE193_LE256:\n        vmovdqu     %ymm3, 96(%rdi)\n        vmovdqu     %ymm4, -128(%rdi,%rdx)\n\n.L_GE129_LE192:\n        vmovdqu     %ymm2, 64(%rdi)\n        vmovdqu     %ymm5, -96(%rdi,%rdx)\n\n.L_GE65_LE128:\n        vmovdqu     %ymm1, 32(%rdi)\n        vmovdqu     %ymm6, -64(%rdi,%rdx)\n\n.L_GE33_LE64:\n        vmovdqu     %ymm0, (%rdi)\n        vmovdqu     %ymm7, -32(%rdi,%rdx)\n\n        vzeroupper\n        ret\n\n        .align      2\n.L_GE33:\n        vmovdqu     (%rsi), %ymm0\n        vmovdqu     -32(%rsi,%rdx), %ymm7\n\n        cmp         $64, %rdx\n        jbe         .L_GE33_LE64\n\n        prefetchw   64(%rdi)\n\n        vmovdqu     32(%rsi), %ymm1\n        vmovdqu     -64(%rsi,%rdx), %ymm6\n\n        cmp         $128, %rdx\n        jbe         .L_GE65_LE128\n\n        prefetchw   128(%rdi)\n\n        vmovdqu     64(%rsi), %ymm2\n        vmovdqu     -96(%rsi,%rdx), %ymm5\n\n        cmp         $192, %rdx\n        jbe         .L_GE129_LE192\n\n        prefetchw   192(%rdi)\n\n        vmovdqu     96(%rsi), %ymm3\n        vmovdqu     -128(%rsi,%rdx), %ymm4\n\n        cmp         $256, %rdx\n        jbe         .L_GE193_LE256\n\n.L_GE257:\n        prefetchw   256(%rdi)\n\n        // Check if there is an overlap. If there is an overlap then the caller\n        // has a bug since this is undefined behavior. However, for legacy\n        // reasons this behavior is expected by some callers.\n        //\n        // All copies through 256 bytes will operate as a memmove since for\n        // those sizes all reads are performed before any writes.\n        //\n        // This check uses the idea that there is an overlap if\n        // (%rdi < (%rsi + %rdx)) && (%rsi < (%rdi + %rdx)),\n        // or equivalently, there is no overlap if\n        // ((%rsi + %rdx) <= %rdi) || ((%rdi + %rdx) <= %rsi).\n        //\n        // %r9 will be used after .L_ALIGNED_DST_LOOP to calculate how many\n        // bytes remain to be copied.\n        lea         (%rsi,%rdx), %r9\n        cmp         %rdi, %r9\n        jbe         .L_NO_OVERLAP\n        lea         (%rdi,%rdx), %r8\n        cmp         %rsi, %r8\n        // This is a forward jump so that the branch predictor will not predict\n        // a memmove.\n        ja          .L_MEMMOVE\n\n        .align      2\n.L_NO_OVERLAP:\n        vmovdqu     %ymm0, (%rdi)\n        vmovdqu     %ymm1, 32(%rdi)\n        vmovdqu     %ymm2, 64(%rdi)\n        vmovdqu     %ymm3, 96(%rdi)\n\n        // Align %rdi to a 32 byte boundary.\n        // %rcx = 128 - 31 & %rdi\n        mov         $128, %rcx\n        and         $31, %rdi\n        sub         %rdi, %rcx\n\n        lea         (%rsi,%rcx), %rsi\n        lea         (%rax,%rcx), %rdi\n        sub         %rcx, %rdx\n\n        // %r8 is the end condition for the loop.\n        lea         -128(%rsi,%rdx), %r8\n\n        cmp         NON_TEMPORAL_STORE_THRESHOLD, %rdx\n        jae         .L_NON_TEMPORAL_LOOP\n\n        .align      2\n.L_ALIGNED_DST_LOOP:\n        prefetchw   128(%rdi)\n        prefetchw   192(%rdi)\n\n        vmovdqu     (%rsi), %ymm0\n        vmovdqu     32(%rsi), %ymm1\n        vmovdqu     64(%rsi), %ymm2\n        vmovdqu     96(%rsi), %ymm3\n        add         $128, %rsi\n\n        vmovdqa     %ymm0, (%rdi)\n        vmovdqa     %ymm1, 32(%rdi)\n        vmovdqa     %ymm2, 64(%rdi)\n        vmovdqa     %ymm3, 96(%rdi)\n        add         $128, %rdi\n\n        cmp         %r8, %rsi\n        jb          .L_ALIGNED_DST_LOOP\n\n.L_ALIGNED_DST_LOOP_END:\n        sub         %rsi, %r9\n        mov         %r9, %rdx\n\n        vmovdqu     %ymm4, -128(%rdi,%rdx)\n        vmovdqu     %ymm5, -96(%rdi,%rdx)\n        vmovdqu     %ymm6, -64(%rdi,%rdx)\n        vmovdqu     %ymm7, -32(%rdi,%rdx)\n\n        vzeroupper\n        ret\n\n        .align      2\n.L_NON_TEMPORAL_LOOP:\n        testb       $31, %sil\n        jne         .L_ALIGNED_DST_LOOP\n        // This is prefetching the source data unlike ALIGNED_DST_LOOP which\n        // prefetches the destination data. This choice is again informed by\n        // benchmarks. With a non-temporal store the entirety of the cache line\n        // is being written so the previous data can be discarded without being\n        // fetched.\n        prefetchnta 128(%rsi)\n        prefetchnta 196(%rsi)\n\n        vmovntdqa   (%rsi), %ymm0\n        vmovntdqa   32(%rsi), %ymm1\n        vmovntdqa   64(%rsi), %ymm2\n        vmovntdqa   96(%rsi), %ymm3\n        add         $128, %rsi\n\n        vmovntdq    %ymm0, (%rdi)\n        vmovntdq    %ymm1, 32(%rdi)\n        vmovntdq    %ymm2, 64(%rdi)\n        vmovntdq    %ymm3, 96(%rdi)\n        add         $128, %rdi\n\n        cmp         %r8, %rsi\n        jb          .L_NON_TEMPORAL_LOOP\n\n        sfence\n        jmp         .L_ALIGNED_DST_LOOP_END\n\n.L_MEMMOVE:\n        call        memmove\n        ret\n\n        .cfi_endproc\n        .size       __folly_memcpy, .-__folly_memcpy\n\n#ifdef FOLLY_MEMCPY_IS_MEMCPY\n        .weak       memcpy\n        memcpy = __folly_memcpy\n#endif\n\n        .ident \"GCC: (GNU) 4.8.2\"\n#ifdef __linux__\n        .section .note.GNU-stack,\"\",@progbits\n#endif\n\n#endif\n"
  },
  {
    "path": "src/memcpy/impl.S",
    "content": "#if defined(__APPLE__)\n.text\n.global _libc_memcpy\n.p2align  4, 0x90\n_libc_memcpy:\n        jmp _memcpy\n\n#else\n\n.text\n.global libc_memcpy\n.p2align  4, 0x90\nlibc_memcpy:\n        jmp memcpy\n#endif\n\n#define LABEL(x)     .L##x\n#if defined(__APPLE__)\n.text\n.global _asm_memcpy\n.p2align  5, 0x90\n_asm_memcpy:\n#else\n.text\n.global asm_memcpy\n.p2align  5, 0x90\nasm_memcpy:\n#endif\n\n// RDI is the dest\n// RSI is the src\n// RDX is length\n  mov  %rdi, %rax\n  cmp    $64,%rdx\n  ja LABEL(over_64)\n  cmp    $16,%rdx\n  jae LABEL(16_to_64)\n\nLABEL(below_16):\n  cmp    $4,%rdx\n  jbe LABEL(0_to_4)\n  cmp    $8,%rdx\n  jbe LABEL(in_4_to_8)\nLABEL(8_to_16):\n  movq  (%rsi), %rcx\n  movq  %rcx, (%rax)\n  movq  -8(%rsi,%rdx), %rcx\n  movq  %rcx, -8(%rax,%rdx)\n  retq\n\nLABEL(0_to_4):\n  // Copy the first two bytes:\n  cmp    $0,%rdx\n  je      LABEL(exit)\n  movb  (%rsi), %cl\n  movb  %cl, (%rdi)\n  movb  -1(%rsi,%rdx), %cl\n  movb  %cl, -1(%rdi,%rdx)\n  cmp   $2,%rdx\n  jbe   LABEL(exit)\n  // Copy the second two bytes, if n > 2.\n  movb  1(%rsi), %cl\n  movb  %cl, 1(%rdi)\n  movb  2(%rsi), %cl\n  movb  %cl, 2(%rdi)\n  retq\nLABEL(in_4_to_8):\n  movl  (%rsi), %ecx\n  movl  %ecx, (%rdi)\n  movl  -4(%rsi,%rdx), %ecx\n  movl  %ecx, -4(%rdi,%rdx)\nLABEL(exit):\n  retq\n\nLABEL(16_to_64):\n  cmp    $32, %rdx\n  jbe LABEL(16_to_32)\n\nLABEL(32_to_64):\n  vmovdqu  (%rsi), %ymm0\n  vmovdqu  %ymm0, (%rdi)\n  vmovdqu  -32(%rsi,%rdx), %ymm0\n  vmovdqu  %ymm0, -32(%rdi,%rdx)\n  vzeroupper\n  retq\n\nLABEL(16_to_32):\n  movups  (%rsi), %xmm0\n  movups  %xmm0, (%rdi)\n  movups  -16(%rsi,%rdx), %xmm0\n  movups  %xmm0, -16(%rdi,%rdx)\n  retq\n\n  // Handle buffers over 64 bytes:\nLABEL(over_64):\n  cmp    $128, %rdx\n  ja LABEL(over_128)\n\n  // Copy the last wide word.\n  vmovups  -32(%rsi,%rdx), %ymm0\n\n  // Handle cases in the range 64 to 128. This is two unconditional\n  // stores (64), 1 conditional store (32), and the one 32 byte store at\n  // the end.\n  vmovups  (%rsi), %ymm1\n  vmovups  32(%rsi), %ymm2\n\n  cmp    $96, %rdx\n  jbe    LABEL(64_to_128_done)\n  vmovups  64(%rsi), %ymm3\n  vmovups  %ymm3, 64(%rax)\n\n.align 4\nLABEL(64_to_128_done):\n  vmovups  %ymm1, (%rax)\n  vmovups  %ymm2, 32(%rax)\n  // Store the last wide word.\n  vmovups  %ymm0, -32(%rax,%rdx)\n  vzeroupper\n  retq\n\nLABEL(over_128):\n  // Compute the last writeable destination.\n  lea -128(%rdx), %rcx\n  xor %r8, %r8\n.align 16\nLABEL(over_128_copy_loop):\n  vmovdqu       (%rsi, %r8), %ymm0\n  vmovdqu     32(%rsi, %r8), %ymm1\n  vmovdqu     64(%rsi, %r8), %ymm2\n  vmovdqu     96(%rsi, %r8), %ymm3\n  vmovdqu     %ymm0,   (%rdi, %r8)\n  vmovdqu     %ymm1, 32(%rdi, %r8)\n  vmovdqu     %ymm2, 64(%rdi, %r8)\n  vmovdqu     %ymm3, 96(%rdi, %r8)\n  add         $128, %r8\n  cmp         %rcx, %r8\n  jb LABEL(over_128_copy_loop)\n\n// Handle the tail:\n  lea    -32(%rdx), %rcx\n  cmp    %r8, %rcx\n  jb     LABEL(over_128_done)\n  vmovdqu     (%rsi, %r8), %ymm0\n  vmovdqu     %ymm0,   (%rdi, %r8)\n  add         $32, %r8\n\n  cmp         %r8, %rcx\n  jb          LABEL(over_128_done)\n  vmovdqu     (%rsi, %r8), %ymm0\n  vmovdqu     %ymm0,   (%rdi, %r8)\n  add         $32, %r8\n\n  cmp         %r8, %rcx\n  jb          LABEL(over_128_done)\n  vmovdqu     (%rsi, %r8), %ymm0\n  vmovdqu     %ymm0,   (%rdi, %r8)\n\nLABEL(over_128_done):\n  // Copy the last 32 bytes\n  vmovdqu   -32(%rsi, %rdx), %ymm0\n  vmovdqu   %ymm0,   -32(%rdi, %rdx)\n\n  vzeroupper\n  retq\n"
  },
  {
    "path": "src/memcpy/impl.c",
    "content": "#include \"types.h\"\n\n#include <stddef.h>\n#include <stdint.h>\n\nvoid *local_memcpy(void *dest, const void *src, size_t n) {\n  char *d = (char *)dest;\n  const char *s = (char *)src;\n\n  if (n < 5) {\n    if (n == 0)\n      return dest;\n    d[0] = s[0];\n    d[n - 1] = s[n - 1];\n    if (n <= 2)\n      return dest;\n    d[1] = s[1];\n    d[2] = s[2];\n    return dest;\n  }\n\n  if (n <= 16) {\n    if (n >= 8) {\n      const char *first_s = s;\n      const char *last_s = s + n - 8;\n      char *first_d = d;\n      char *last_d = d + n - 8;\n      *((u64 *)first_d) = *((u64 *)first_s);\n      *((u64 *)last_d) = *((u64 *)last_s);\n      return dest;\n    }\n\n    const char *first_s = s;\n    const char *last_s = s + n - 4;\n    char *first_d = d;\n    char *last_d = d + n - 4;\n    *((u32 *)first_d) = *((u32 *)first_s);\n    *((u32 *)last_d) = *((u32 *)last_s);\n    return dest;\n  }\n\n  if (n <= 32) {\n    const char *first_s = s;\n    const char *last_s = s + n - 16;\n    char *first_d = d;\n    char *last_d = d + n - 16;\n\n    *((char16 *)first_d) = *((char16 *)first_s);\n    *((char16 *)last_d) = *((char16 *)last_s);\n    return dest;\n  }\n\n  const char *last_word_s = s + n - 32;\n  char *last_word_d = d + n - 32;\n\n  // Stamp the 32-byte chunks.\n  do {\n    *((char32 *)d) = *((char32 *)s);\n    d += 32;\n    s += 32;\n  } while (d < last_word_d);\n\n  // Stamp the last unaligned 32 bytes of the buffer.\n  *((char32 *)last_word_d) = *((char32 *)last_word_s);\n  return dest;\n}\n"
  },
  {
    "path": "src/memcpy/test_memcpy.cc",
    "content": "#include <cstring>\n#include <iostream>\n#include <vector>\n\n#include \"decl.h\"\n#include \"utils.h\"\n\n////////////////////////////////////////////////////////////////////////////////\n// This is a small program that checks if some memcpy implementation is correct.\n////////////////////////////////////////////////////////////////////////////////\n\n#define MAGIC_VALUE0 '#'\n#define MAGIC_VALUE1 '='\n\nvoid print_buffer(const char *start, const char *end, char val,\n                  const char *ptr) {\n  const char *it = start;\n  while (it != end) {\n    std::cout << *it;\n    it++;\n  }\n  std::cout << \"\\n\";\n  it = start;\n  while (it != ptr) {\n    std::cout << \" \";\n    it++;\n  }\n  std::cout << \"^\\n\";\n  std::cout << \"Filling a buffer of length \" << end - start << \".\";\n  std::cout << \" Expected \\\"\" << val << \"\\\" at index \" << ptr - start\n            << std::endl;\n}\n\nvoid print_buffer_match(const char *start0, const char *start1, size_t len,\n                        size_t error_at) {\n\n  for (size_t i = 0; i < len; i++) {\n    std::cout << start0[i];\n  }\n  std::cout << \"\\n\";\n  for (size_t i = 0; i < len; i++) {\n    std::cout << start1[i];\n  }\n  std::cout << \"\\n\";\n\n  for (size_t i = 0; i < error_at; i++) {\n    std::cout << \" \";\n  }\n  std::cout << \"^\\n\";\n  std::cout << \"Comparing buffers of length \" << len << \".\";\n  std::cout << \" Invalid value at index \" << error_at << \".\" << std::endl;\n}\n\n// Make sure that the whole buffer, from \\p start to \\p end, is set to \\p val.\nvoid assert_uniform_value(const char *start, const char *end, char val) {\n  const char *ptr = start;\n  while (ptr != end) {\n    if (val != *ptr) {\n      print_buffer(start, end, val, ptr);\n      abort();\n    }\n    ptr++;\n  }\n}\n\n// Make sure that two buffers contain the same memory content.\nvoid assert_buffers_match(const char *buff1, const char *buff2, size_t len) {\n  for (size_t i = 0; i < len; i++) {\n    if (buff1[i] != buff2[i]) {\n      print_buffer_match(buff1, buff2, len, i);\n      abort();\n    }\n  }\n}\n\nvoid test_impl(memcpy_ty handle, const std::string &name, unsigned chunk_size) {\n  std::vector<char> src(chunk_size + 512);\n  std::vector<char> dest(chunk_size + 512, MAGIC_VALUE0);\n\n  // Fill the buffer with a running counter of printable chars.\n  for (unsigned i = 0; i < src.size(); i++) {\n    src[i] = 'A' + (i % 26);\n  }\n\n  // Start copying memory at different offsets.\n  for (int src_offset = 0; src_offset < 32; src_offset++) {\n    for (int dest_offset = 0; dest_offset < 32; dest_offset++) {\n      const char *dest_start = &*dest.begin();\n      const char *dest_end = &*dest.end();\n\n      const char *src_region_start = &src[src_offset];\n      char *dest_region_start = &dest[dest_offset];\n      char *dest_region_end = &dest[dest_offset + chunk_size];\n\n      void *res =\n          (handle)((void *)dest_region_start, src_region_start, chunk_size);\n      if (res != dest_region_start) {\n        std::cout << \"Invalid return value.\" << std::endl;\n        abort();\n      }\n\n      // Check the chunk.\n      assert_buffers_match(dest_region_start, src_region_start, chunk_size);\n      // Check before chunk.\n      assert_uniform_value(dest_start, dest_region_start, MAGIC_VALUE0);\n      // Check after chunk.\n      assert_uniform_value(dest_region_end, dest_end, MAGIC_VALUE0);\n\n      // Reset the dest buffer:\n      std::fill(dest.begin(), dest.end(), MAGIC_VALUE0);\n    }\n  }\n}\n\nint main(int argc, char **argv) {\n  std::cout << \"Testing memcpy... \\n\";\n\n#define TEST(FUNC, SIZE) test_impl(FUNC, #FUNC, SIZE);\n\n  for (int i = 0; i < 1024; i++) {\n    TEST(&memcpy, i);\n    TEST(&__folly_memcpy, i);\n    TEST(&local_memcpy, i);\n    TEST(&asm_memcpy, i);\n  }\n\n  std::cout << \"Done.\\n\";\n\n  return 0;\n}\n"
  },
  {
    "path": "src/memset/CMakeLists.txt",
    "content": "add_library(mem_shim SHARED\n            shims.c\n            impl.S\n            impl.c\n           )\n\nset_target_properties(mem_shim PROPERTIES\n     VERSION ${PROJECT_VERSION}\n     SOVERSION 1\n     )\n\nadd_executable(bench_memset\n                 bench_memset.cc\n                 impl.S\n                 impl.c\n                 )\n\nadd_executable(test_memset\n                 test_memset.cc\n                 impl.S\n                 impl.c\n                 )\n\ntarget_link_libraries(bench_memset PUBLIC)\ntarget_link_libraries(test_memset PUBLIC)\n\ninstall(TARGETS bench_memset DESTINATION bin)\ninstall(TARGETS test_memset DESTINATION bin)\ninstall(TARGETS mem_shim LIBRARY DESTINATION bin)\n\n"
  },
  {
    "path": "src/memset/bench_memset.cc",
    "content": "#include <algorithm>\n#include <cstring>\n#include <iomanip>\n#include <iostream>\n#include <unistd.h>\n#include <vector>\n\n#include \"decl.h\"\n#include \"utils.h\"\n\n////////////////////////////////////////////////////////////////////////////////\n// This is a small program that compares two memset implementations and records\n// the output in a csv file.\n////////////////////////////////////////////////////////////////////////////////\n\n#define ITER (1000L * 1000L * 10L)\n#define SAMPLES (20)\n\nDoomRNG RNG;\n\n/// Measure a single implementation \\p handle.\nuint64_t measure(memset_ty handle, unsigned size, unsigned align,\n                 unsigned offset, void *ptr) {\n  Stopwatch T;\n  for (unsigned i = 0; i < SAMPLES; i++) {\n    T.start();\n    for (size_t j = 0; j < ITER; j++) {\n      (handle)(ptr, 0, size);\n    }\n    T.stop();\n  }\n  return T.get_median();\n}\n\n// Allocate memory and benchmark each implementation at a specific size \\p size.\nvoid bench_impl(const std::vector<memset_ty *> &toTest, unsigned size,\n                unsigned align, unsigned offset) {\n  std::vector<char> memory(size + 256, 0);\n  void *ptr = align_pointer(&memory[0], align, offset);\n\n  std::cout << size << \", \";\n  for (auto handle : toTest) {\n    u_int64_t res = measure(handle, size, align, offset, ptr);\n    std::cout << res << \", \";\n  }\n  std::cout << std::endl;\n}\n\n/// Try to allocate buffers at random offsets and in random sizes.\n/// The sizes and the offsets are in the range 0..256.\nvoid bench_rand_range(const std::vector<memset_ty *> &toTest) {\n  std::vector<char> memory(1024, 0);\n  void *ptr = &memory[0];\n\n  for (auto handle : toTest) {\n    Stopwatch T;\n    sleep(1);\n    for (unsigned i = 0; i < SAMPLES; i++) {\n      RNG.rand_reset();\n      T.start();\n      for (size_t j = 0; j < ITER; j++) {\n        (handle)((char *)ptr + RNG.next_u8_random(), 0, RNG.next_u8_random());\n      }\n      T.stop();\n    }\n\n    std::cout << T.get_median() << \", \";\n  }\n  std::cout << std::endl;\n}\n\n// To measure the call overhead.\nvoid *nop(void *s, int c, size_t n) { return s; }\n\nint main(int argc, char **argv) {\n  std::cout << std::setprecision(3);\n  std::cout << std::fixed;\n\n  std::vector<memset_ty *> toTest = {musl_memset,  libc_memset, &memset,\n                                     local_memset, asm_memset,  &nop};\n\n  std::cout << \"Batches of random sizes:\\n\";\n  std::cout << \" musl, libc@plt, libc, c_memset, asm_memset, nop,\\n\";\n  bench_rand_range(toTest);\n\n  std::cout << \"\\nFixed size:\\n\";\n  std::cout << \"size, musl, libc@plt, libc, c_memset, asm_memset, nop,\\n\";\n\n  for (int i = 0; i < 512; i++) {\n    bench_impl(toTest, i, 16, 0);\n  }\n\n  return 0;\n}\n"
  },
  {
    "path": "src/memset/impl.S",
    "content": "#if defined(__APPLE__)\n.text\n.global _libc_memset\n.p2align  4, 0x90\n_libc_memset:\n        jmp _memset\n\n#else\n\n.text\n.global libc_memset\n.p2align  4, 0x90\nlibc_memset:\n        jmp memset\n#endif\n\n#define LABEL(x)     .L##x\n#if defined(__APPLE__)\n.text\n.global _asm_memset\n.p2align  5, 0x90\n_asm_memset:\n#else\n.text\n.global asm_memset\n.p2align  5, 0x90\nasm_memset:\n#endif\n\n// RDI is the buffer\n// RSI is the value\n// RDX is length\n        vmovd  %esi, %xmm0\n        vpbroadcastb %xmm0,%ymm0\n        mov    %rdi,%rax\n        cmp    $0x40,%rdx\n        jae LABEL(above_64)\nLABEL(below_64):\n        cmp    $0x20, %rdx\n        jb LABEL(below_32)\n        vmovdqu %ymm0,(%rdi)\n        vmovdqu %ymm0,-0x20(%rdi,%rdx)\n        vzeroupper\n        retq\nLABEL(below_32):\n        cmp    $0x10, %rdx\n        jae     LABEL(in_16_to_32)\nLABEL(below_16):\n        cmp    $0x4, %rdx\n        jbe     LABEL(below_4)\nLABEL(in_4_to_16):\n        // Scalar stores from this point.\n        vmovq %xmm0, %rsi\n        cmp    $0x7, %rdx\n        jbe    LABEL(in_4_to_8)\n        // two 8-wide stores, up to 16 bytes.\n        mov    %rsi, -0x8(%rdi, %rdx)\n        mov    %rsi,(%rdi)\n        vzeroupper\n        retq\n.align 4\nLABEL(below_4):\n        test   %rdx, %rdx\n        je     LABEL(exit)\n        mov    %sil, (%rdi)\n        mov    %sil, -0x1(%rdi,%rdx)\n        cmp    $0x2, %rdx\n        jbe    LABEL(exit)\n        mov     %sil, 0x1(%rdi)\n        mov     %sil, 0x2(%rdi)\n        mov    %rdi,%rax\n.align 4\nLABEL(exit):\n        vzeroupper\n        retq\nLABEL(in_4_to_8):\n        // two 4-wide stores, upto 8 bytes.\n        mov    %esi,-0x4(%rdi,%rdx)\n        mov    %esi,(%rdi)\n        vzeroupper\n        retq\nLABEL(in_16_to_32):\n        vmovups %xmm0,(%rdi)\n        vmovups %xmm0,-0x10(%rdi,%rdx)\n        vzeroupper\n        retq\nLABEL(above_64):\n        cmp    $0xb0, %rdx\n        ja LABEL(above_192)\n        cmp    $0x80, %rdx\n        jbe LABEL(in_64_to_128)\n        // Do some work filling unaligned 32bit words.\n        // last_word -> rsi\n        lea    -0x20(%rdi,%rdx),%rsi\n        // rcx -> fill pointer.\n\n        // We have at least 128 bytes to store.\n        vmovdqu %ymm0,(%rdi)\n        vmovdqu %ymm0, 0x20(%rdi)\n        vmovdqu %ymm0, 0x40(%rdi)\n        add    $0x60,%rdi\n.align 8\nLABEL(fill_32):\n        vmovdqu %ymm0,(%rdi)\n        add    $0x20,%rdi\n        cmp    %rdi,%rsi\n        ja     LABEL(fill_32)\n        // Stamp the last unaligned store.\n        vmovdqu %ymm0,(%rsi)\n        vzeroupper\n        retq\nLABEL(in_64_to_128):\n        // last_word -> rsi\n        vmovdqu %ymm0,(%rdi)\n        vmovdqu %ymm0, 0x20(%rdi)\n        vmovdqu %ymm0,-0x40(%rdi,%rdx)\n        vmovdqu %ymm0,-0x20(%rdi,%rdx)\n        vzeroupper\n        retq\n\nLABEL(above_192):\n// rdi is the buffer address\n// rsi is the value\n// rdx is length\n        // Store the first unaligned 32 bytes.\n        vmovdqu %ymm0,(%rdi)\n\n        // The first aligned word is stored in %rsi.\n        mov    %rdi,%rsi\n        and    $0xffffffffffffffe0,%rsi\n        lea    0x20(%rsi),%rsi\n\n        // Compute the address of the last unaligned word into rdi.\n        lea    -0x20(%rdx), %rdx\n        add     %rdx, %rdi\n\n        // Check if we can do a full 5x32B stamp.\n        lea    0xa0(%rsi),%rcx\n        cmp    %rcx, %rdi\n        jb     LABEL(stamp_4)\n.align 8\nLABEL(fill_192):\n        vmovdqa %ymm0,(%rsi)\n        vmovdqa %ymm0,0x20(%rsi)\n        vmovdqa %ymm0,0x40(%rsi)\n        vmovdqa %ymm0,0x60(%rsi)\n        vmovdqa %ymm0,0x80(%rsi)\n        add     $0xa0, %rsi\n        lea    0xa0(%rsi),%rcx\n        cmp    %rcx, %rdi\n        ja     LABEL(fill_192)\n\nLABEL(fill_192_tail):\n        cmp    %rsi, %rdi\n        jb     LABEL(fill_192_done)\n        vmovdqa %ymm0, (%rsi)\n\n        lea    0x20(%rsi),%rcx\n        cmp    %rcx, %rdi\n        jb     LABEL(fill_192_done)\n        vmovdqa %ymm0, 0x20(%rsi)\n\n        lea    0x40(%rsi),%rcx\n        cmp    %rcx, %rdi\n        jb     LABEL(fill_192_done)\n        vmovdqa %ymm0, 0x40(%rsi)\n\n        lea    0x60(%rsi),%rcx\n        cmp    %rcx, %rdi\n        jb     LABEL(fill_192_done)\n        vmovdqa %ymm0, 0x60(%rsi)\n\nLABEL(last_wide_store):\n        lea    0x80(%rsi),%rcx\n        cmp    %rcx, %rdi\n        jb     LABEL(fill_192_done)\n        vmovdqa %ymm0, 0x80(%rsi)\nLABEL(fill_192_done):\n        // Stamp the last word.\n        vmovdqu %ymm0,(%rdi)\n        vzeroupper\n        ret\nLABEL(stamp_4):\n        vmovdqa %ymm0,(%rsi)\n        vmovdqa %ymm0,0x20(%rsi)\n        vmovdqa %ymm0,0x40(%rsi)\n        vmovdqa %ymm0,0x60(%rsi)\n        jmp     LABEL(last_wide_store)\n"
  },
  {
    "path": "src/memset/impl.c",
    "content": "#include \"types.h\"\n\n#include <stddef.h>\n#include <stdint.h>\n\n// Handle memsets of sizes 0..32\nstatic inline void *small_memset(void *s, int c, size_t n) {\n  if (n < 5) {\n    if (n == 0)\n      return s;\n    char *p = s;\n    p[0] = c;\n    p[n - 1] = c;\n    if (n <= 2)\n      return s;\n    p[1] = c;\n    p[2] = c;\n    return s;\n  }\n\n  if (n <= 16) {\n    uint64_t val8 = ((uint64_t)0x0101010101010101L * ((uint8_t)c));\n    if (n >= 8) {\n      char *first = s;\n      char *last = s + n - 8;\n      *((u64 *)first) = val8;\n      *((u64 *)last) = val8;\n      return s;\n    }\n\n    uint32_t val4 = val8;\n    char *first = s;\n    char *last = s + n - 4;\n    *((u32 *)first) = val4;\n    *((u32 *)last) = val4;\n    return s;\n  }\n\n  char X = c;\n  char *p = s;\n  char16 val16 = {X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X};\n  char *last = s + n - 16;\n  *((char16 *)last) = val16;\n  *((char16 *)p) = val16;\n  return s;\n}\n\nstatic inline void *huge_memset(void *s, int c, size_t n) {\n  char *p = s;\n  char X = c;\n  char32 val32 = {X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X,\n                  X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X};\n\n  // Stamp the first 32byte store.\n  *((char32 *)p) = val32;\n\n  char *first_aligned = p + 32 - ((uint64_t)p % 32);\n  char *buffer_end = p + n;\n  char *last_word = buffer_end - 32;\n\n  // Align the next stores.\n  p = first_aligned;\n\n  // Unroll the body of the loop to increase parallelism.\n  while (p + (32 * 5) < buffer_end) {\n    *((char32a *)p) = val32;\n    p += 32;\n    *((char32a *)p) = val32;\n    p += 32;\n    *((char32a *)p) = val32;\n    p += 32;\n    *((char32a *)p) = val32;\n    p += 32;\n    *((char32a *)p) = val32;\n    p += 32;\n  }\n\n// Complete the last few iterations:\n#define TRY_STAMP_32_BYTES                                                     \\\n  if (p < last_word) {                                                         \\\n    *((char32a *)p) = val32;                                                   \\\n    p += 32;                                                                   \\\n  }\n\n  TRY_STAMP_32_BYTES\n  TRY_STAMP_32_BYTES\n  TRY_STAMP_32_BYTES\n  TRY_STAMP_32_BYTES\n\n  // Stamp the last unaligned word.\n  *((char32 *)last_word) = val32;\n  return s;\n}\n\nvoid *local_memset(void *s, int c, size_t n) {\n  char *p = s;\n  char X = c;\n\n  if (n < 32) {\n    return small_memset(s, c, n);\n  }\n\n  if (n > 160) {\n    return huge_memset(s, c, n);\n  }\n\n  char32 val32 = {X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X,\n                  X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X};\n\n  char *last_word = s + n - 32;\n\n  // Stamp the 32-byte chunks.\n  do {\n    *((char32 *)p) = val32;\n    p += 32;\n  } while (p < last_word);\n\n  // Stamp the last unaligned 32 bytes of the buffer.\n  *((char32 *)last_word) = val32;\n  return s;\n}\n\n/// This a memset implementation that was copied from musl. We only use it for\n/// benchmarking.\n\nvoid *musl_memset(void *dest, int c, size_t n) {\n  unsigned char *s = dest;\n  size_t k;\n\n  /* Fill head and tail with minimal branching. Each\n   * conditional ensures that all the subsequently used\n   * offsets are well-defined and in the dest region. */\n\n  if (!n)\n    return dest;\n  s[0] = c;\n  s[n - 1] = c;\n  if (n <= 2)\n    return dest;\n  s[1] = c;\n  s[2] = c;\n  s[n - 2] = c;\n  s[n - 3] = c;\n  if (n <= 6)\n    return dest;\n  s[3] = c;\n  s[n - 4] = c;\n  if (n <= 8)\n    return dest;\n\n  /* Advance pointer to align it at a 4-byte boundary,\n   * and truncate n to a multiple of 4. The previous code\n   * already took care of any head/tail that get cut off\n   * by the alignment. */\n\n  k = -(uintptr_t)s & 3;\n  s += k;\n  n -= k;\n  n &= -4;\n\n#ifdef __GNUC__\n  typedef uint32_t __attribute__((__may_alias__)) u32;\n  typedef uint64_t __attribute__((__may_alias__)) u64;\n\n  u32 c32 = ((u32)-1) / 255 * (unsigned char)c;\n\n  /* In preparation to copy 32 bytes at a time, aligned on\n   * an 8-byte bounary, fill head/tail up to 28 bytes each.\n   * As in the initial byte-based head/tail fill, each\n   * conditional below ensures that the subsequent offsets\n   * are valid (e.g. !(n<=24) implies n>=28). */\n\n  *(u32 *)(s + 0) = c32;\n  *(u32 *)(s + n - 4) = c32;\n  if (n <= 8)\n    return dest;\n  *(u32 *)(s + 4) = c32;\n  *(u32 *)(s + 8) = c32;\n  *(u32 *)(s + n - 12) = c32;\n  *(u32 *)(s + n - 8) = c32;\n  if (n <= 24)\n    return dest;\n  *(u32 *)(s + 12) = c32;\n  *(u32 *)(s + 16) = c32;\n  *(u32 *)(s + 20) = c32;\n  *(u32 *)(s + 24) = c32;\n  *(u32 *)(s + n - 28) = c32;\n  *(u32 *)(s + n - 24) = c32;\n  *(u32 *)(s + n - 20) = c32;\n  *(u32 *)(s + n - 16) = c32;\n\n  /* Align to a multiple of 8 so we can fill 64 bits at a time,\n   * and avoid writing the same bytes twice as much as is\n   * practical without introducing additional branching. */\n\n  k = 24 + ((uintptr_t)s & 4);\n  s += k;\n  n -= k;\n\n  /* If this loop is reached, 28 tail bytes have already been\n   * filled, so any remainder when n drops below 32 can be\n   * safely ignored. */\n\n  u64 c64 = c32 | ((u64)c32 << 32);\n  for (; n >= 32; n -= 32, s += 32) {\n    *(u64 *)(s + 0) = c64;\n    *(u64 *)(s + 8) = c64;\n    *(u64 *)(s + 16) = c64;\n    *(u64 *)(s + 24) = c64;\n  }\n#else\n  /* Pure C fallback with no aliasing violations. */\n  for (; n; n--, s++)\n    *s = c;\n#endif\n\n  return dest;\n}\n"
  },
  {
    "path": "src/memset/shims.c",
    "content": "#include \"decl.h\"\n\n////////////////////////////////////////////////////////////////////////////////\n/// This is a small utility that swaps the builtin call to memset with the\n/// local implementation of memset, implemented in this project.\n/// The shared object can be loaded using LD_PRELOAD (on Linux) or\n/// DYLD_INSERT_LIBRARIES (on Mac).\n////////////////////////////////////////////////////////////////////////////////\n\nvoid *memset(void *s, int c, size_t n) { return local_memset(s, c, n); }\n"
  },
  {
    "path": "src/memset/test_memset.cc",
    "content": "#include <cstring>\n#include <iostream>\n#include <vector>\n\n#include \"decl.h\"\n#include \"utils.h\"\n\n////////////////////////////////////////////////////////////////////////////////\n// This is a small program that checks if some memset implementation is correct.\n// The tool currently checks libc, musl and the local implementation.\n////////////////////////////////////////////////////////////////////////////////\n\n#define MAGIC_VALUE0 'X'\n#define MAGIC_VALUE1 'O'\n\nvoid print_buffer(const char *start, const char *end, char val,\n                  const char *ptr) {\n  const char *it = start;\n  while (it != end) {\n    std::cout << *it;\n    it++;\n  }\n  std::cout << \"\\n\";\n  it = start;\n  while (it != ptr) {\n    std::cout << \" \";\n    it++;\n  }\n  std::cout << \"^\\n\";\n  std::cout << \"Filling a buffer of length \" << end - start << \".\";\n  std::cout << \" Expected \\\"\" << val << \"\\\" at index \" << ptr - start << \"\\n\";\n}\n\nvoid assert_uniform_value(const char *start, const char *end, char val) {\n  const char *ptr = start;\n  while (ptr != end) {\n    if (val != *ptr) {\n      print_buffer(start, end, val, ptr);\n      fflush(stdout);\n      abort();\n    }\n    ptr++;\n  }\n}\n\nvoid test_impl(memset_ty handle, const std::string &name, unsigned chunk_size) {\n  std::vector<char> memory(chunk_size + 512, MAGIC_VALUE0);\n  // Start mem-setting the array at different offsets.\n  for (int offset = 0; offset < 128; offset++) {\n    const char *buffer_start = &*memory.begin();\n    const char *buffer_end = &*memory.end();\n\n    const char *region_start = &memory[offset];\n    const char *region_end = region_start + chunk_size;\n\n    assert_uniform_value(buffer_start, buffer_end, MAGIC_VALUE0);\n\n    (handle)((void *)region_start, MAGIC_VALUE1, chunk_size);\n\n    // Check the chunk.\n    assert_uniform_value(region_start, region_end, MAGIC_VALUE1);\n    // Check before chunk.\n    assert_uniform_value(buffer_start, region_start, MAGIC_VALUE0);\n    // Check after chunk.\n    assert_uniform_value(region_end, buffer_end, MAGIC_VALUE0);\n\n    // Reset the buffer:\n    std::fill(memory.begin(), memory.end(), MAGIC_VALUE0);\n    assert_uniform_value(buffer_start, buffer_end, MAGIC_VALUE0);\n  }\n}\n\nint main(int argc, char **argv) {\n  std::cout << \"Testing memset... \\n\";\n\n#define TEST(FUNC, SIZE) test_impl(FUNC, #FUNC, SIZE);\n\n  for (int i = 0; i < 1024; i++) {\n    TEST(libc_memset, i);\n    TEST(local_memset, i);\n    TEST(musl_memset, i);\n    TEST(asm_memset, i);\n  }\n  std::cout << \"Done.\\n\";\n\n  return 0;\n}\n"
  },
  {
    "path": "src/utils/CMakeLists.txt",
    "content": "add_library(hist_tool SHARED\n            hist_tool.c\n           )\n\nset_target_properties(hist_tool PROPERTIES\n     VERSION ${PROJECT_VERSION}\n     SOVERSION 1\n     )\n\ntarget_compile_options(hist_tool PRIVATE \"-fno-builtin\")\n\ninstall(TARGETS hist_tool LIBRARY DESTINATION bin)\n"
  },
  {
    "path": "src/utils/hist_tool.c",
    "content": "#include <stddef.h>\n#include <stdint.h>\n#include <stdio.h>\n#include <unistd.h>\n\n////////////////////////////////////////////////////////////////////////////////\n/// This is a small utility that records calls to some methods and creates a\n/// histogram of the lengths of calls to memset. It prints the histogram when\n/// the program is terminated. The shared object can be loaded using LD_PRELOAD\n/// (on Linux) or DYLD_INSERT_LIBRARIES (on Mac).\n////////////////////////////////////////////////////////////////////////////////\n\nuint32_t memset_len_dist[32] = {\n    0,\n};\nuint32_t memcpy_len_dist[32] = {\n    0,\n};\nuint32_t align_dist[32] = {\n    0,\n};\n\n\nconst int tab32[32] = {0,  9,  1,  10, 13, 21, 2,  29, 11, 14, 16,\n                       18, 22, 25, 3,  30, 8,  12, 20, 28, 15, 17,\n                       24, 7,  19, 27, 23, 6,  26, 5,  4,  31};\n\nint log2_32(uint32_t value) {\n  value |= value >> 1;\n  value |= value >> 2;\n  value |= value >> 4;\n  value |= value >> 8;\n  value |= value >> 16;\n  return tab32[(uint32_t)(value * 0x07C4ACDD) >> 27];\n}\n\nvoid __attribute__((destructor)) print_hitograms() {\n  FILE *ff = fopen(\"/tmp/hist.txt\", \"a+\");\n  if (!ff) {\n    return;\n  }\n  pid_t pid = getpid();\n\n  fprintf(ff, \"Histogram for (%d):\\n\", pid);\n  fprintf(ff, \"size, memset, memcpy, alignment:\\n\");\n  for (int i = 0; i < 32; i++) {\n    fprintf(ff, \"%d, %d, %d, %d,\\n\", i, memset_len_dist[i], memcpy_len_dist[i], align_dist[i]);\n  }\n  fclose(ff);\n}\n\nvoid *memcpy(void *dest, const void *src, size_t len) {\n  memcpy_len_dist[log2_32(len)]++;\n  align_dist[(unsigned long)dest % 32]++;\n  align_dist[(unsigned long)src % 32]++;\n  char *d = (char *)dest;\n  char *s = (char *)src;\n  for (size_t i = 0; i < len; i++) {\n    d[i] = s[i];\n  }\n  return dest;\n}\n\nvoid *memset(void *s, int c, size_t len) {\n  memset_len_dist[log2_32(len)]++;\n  align_dist[(unsigned long)s % 32]++;\n  char *p = s;\n\n  for (int i = 0; i < len; i++) {\n    p[i] = c;\n  }\n  return s;\n}\n\n"
  }
]