Repository: google/fchan-go
Branch: master
Commit: 9cd80472fe1f
Files: 12
Total size: 101.3 KB

Directory structure:
gitextract_sxcrcuyi/

├── CONTRIBUTING
├── LICENSE
├── README.md
├── src/
│   ├── bench/
│   │   └── bench.go
│   └── fchan/
│       ├── bounded.go
│       ├── fchan_test.go
│       ├── q.go
│       └── unbounded.go
└── writeup/
    ├── graphs.py
    ├── latex.template
    ├── refs.bib
    └── writeup.md

================================================
FILE CONTENTS
================================================

================================================
FILE: CONTRIBUTING
================================================
Want to contribute? Great! First, read this page (including the small print at the end).

### Before you contribute
Before we can use your code, you must sign the
[Google Individual Contributor License Agreement]
(https://cla.developers.google.com/about/google-individual)
(CLA), which you can do online. The CLA is necessary mainly because you own the
copyright to your changes, even after your contribution becomes part of our
codebase, so we need your permission to use and distribute your code. We also
need to be sure of various other things—for instance that you'll tell us if you
know that your code infringes on other people's patents. You don't have to sign
the CLA until after you've submitted your code for review and a member has
approved it, but you must do it before we can put your code into our codebase.
Before you start working on a larger contribution, you should get in touch with
us first through the issue tracker with your idea so that we can help out and
possibly guide you. Coordinating up front makes it much easier to avoid
frustration later on.

### Code reviews
All submissions, including submissions by project members, require review. We
use GitHub pull requests for this purpose.

### The small print
Contributions made by corporations are covered by a different agreement than
the one above, the
[Software Grant and Corporate Contributor License Agreement]
(https://cla.developers.google.com/about/google-corporate).


================================================
FILE: LICENSE
================================================

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
# `fchan`: Fast Channels in Go

This package contains implementations of fast and scalable channels in Go.
Implementation is in `src/fchan`. To run benchmarks, run `src/bench/bench.go`.
`bench.go` is very rudimentary, and modifying the source may be necessary
depending on what you want to run; that will change in the future.  For details
on the algorithm, check out the writeup directory, it includes a pdf and the
pandoc markdown used to generate it. 

**This is a proof of concept only**. This code should *not* be run in
production.  Comments, criticisms and bugs are all welcome!

## Disclaimer

This is not an official Google product.


================================================
FILE: src/bench/bench.go
================================================
// Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package main

import (
	"flag"
	"fmt"
	"log"
	"os"
	"runtime"
	"runtime/pprof"
	"sync"
	"time"

	"../fchan"
)

// chanRec is a wrapper for a channel-like object used in the benchmarking code
// to avoid code duplication
type chanRec struct {
	NewHandle func() interface{}
	Enqueue   func(ch interface{}, e fchan.Elt)
	Dequeue   func(ch interface{}) fchan.Elt
}

func wrapBounded(bound uint64) *chanRec {
	ch := fchan.NewBounded(bound)
	return &chanRec{
		NewHandle: func() interface{} { return ch.NewHandle() },
		Enqueue: func(ch interface{}, e fchan.Elt) {
			ch.(*fchan.BoundedChan).Enqueue(e)
		},
		Dequeue: func(ch interface{}) fchan.Elt {
			return ch.(*fchan.BoundedChan).Dequeue()
		},
	}
}

func wrapUnbounded() *chanRec {
	ch := fchan.New()
	return &chanRec{
		NewHandle: func() interface{} { return ch.NewHandle() },
		Enqueue: func(ch interface{}, e fchan.Elt) {
			ch.(*fchan.UnboundedChan).Enqueue(e)
		},
		Dequeue: func(ch interface{}) fchan.Elt {
			return ch.(*fchan.UnboundedChan).Dequeue()
		},
	}
}

func wrapChan(chanSize int) *chanRec {
	ch := make(chan fchan.Elt, chanSize)
	return &chanRec{
		NewHandle: func() interface{} { return nil },
		Enqueue:   func(_ interface{}, e fchan.Elt) { ch <- e },
		Dequeue:   func(_ interface{}) fchan.Elt { return <-ch },
	}
}

func benchHelp(N int, chanBase *chanRec, nProcs int) time.Duration {
	const nIters = 1
	var totalTime int64
	for iter := 0; iter < nIters; iter++ {
		var waitSetup, waitBench sync.WaitGroup
		nProcsPer := nProcs / 2
		pt := N / nProcsPer
		waitSetup.Add(2*nProcsPer + 1)
		for i := 0; i < nProcsPer; i++ {
			waitBench.Add(2)
			go func() {
				ch := chanBase.NewHandle()
				var (
					m   interface{} = 1
					msg fchan.Elt   = &m
				)
				waitSetup.Done()
				waitSetup.Wait()
				for j := 0; j < pt; j++ {
					chanBase.Enqueue(ch, msg)
				}
				waitBench.Done()
			}()
			go func() {
				ch := chanBase.NewHandle()
				waitSetup.Done()
				waitSetup.Wait()
				for j := 0; j < pt; j++ {
					chanBase.Dequeue(ch)
				}
				waitBench.Done()
			}()
		}
		time.Sleep(time.Millisecond * 5)
		waitSetup.Done()
		waitSetup.Wait()
		start := time.Now().UnixNano()
		waitBench.Wait()
		end := time.Now().UnixNano()
		runtime.GC()
		time.Sleep(time.Second)
		totalTime += end - start
	}
	return time.Duration(totalTime/nIters) * time.Nanosecond
}

func render(N, numCPUs int, gmp bool, desc string, t time.Duration) {
	extra := ""
	if gmp {
		extra = "GMP"
	}
	fmt.Printf("%s%s-%d\t%d\t%v\n", desc, extra, numCPUs, N, t)
}

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile `file`")

func main() {
	const (
		more     = 5000
		nOps     = 10000000
		gmpScale = 1
	)
	flag.Parse()
	if *cpuprofile != "" {
		f, err := os.Create(*cpuprofile)
		if err != nil {
			log.Fatal("could not create CPU profile: ", err)
		}
		if err := pprof.StartCPUProfile(f); err != nil {
			log.Fatal("could not start CPU profile: ", err)
		}
		defer pprof.StopCPUProfile()
	}
	for _, pack := range []struct {
		desc string
		f    func() *chanRec
	}{
		{"Chan10M", func() *chanRec { return wrapChan(nOps) }},
		{"Chan1K", func() *chanRec { return wrapChan(1024) }},
		{"Chan0", func() *chanRec { return wrapChan(0) }},
		{"Bounded1K", func() *chanRec { return wrapBounded(1024) }},
		{"Bounded0", func() *chanRec { return wrapBounded(0) }},
		{"Unbounded", wrapUnbounded},
	} {
		for _, nprocs := range []int{2, 4, 8, 12, 16, 24, 28, 32} {
			runtime.GOMAXPROCS(nprocs)
			dur := benchHelp(nOps, pack.f(), more)
			render(nOps, nprocs, false, pack.desc, dur)
			dur = benchHelp(nOps, pack.f(), nprocs*gmpScale)
			render(nOps, nprocs, true, pack.desc, dur)
		}
	}
}


================================================
FILE: src/fchan/bounded.go
================================================
// Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package fchan

import (
	"runtime"
	"sync/atomic"
	"unsafe"
)

var (
	s        = 1
	sentinel = unsafe.Pointer(&s)
)

// we use a type synonym here because otherwise it would be possible for a user
// to append a value of the same underlying type. If that happened then the
// type-assertion in Dequeue could send on one of the elements passed in (which
// is wrong, and could potentially deadlock the program as well).
type waitch chan struct{}

func waitChan() waitch {
	return make(chan struct{}, 2)
}

// possible history of values of a cell
// waitch ::= channel that a sender waits on when it is over buffer size
// recvchan ::= channel that a receiver waits on when it has to receive a value
// - nil -> sentinel -> value
// - nil -> sentinel -> recvChan
// - nil -> value
// - nil -> recvChan
// These two may require someone to send on the waitch before transitioning
// - nil -> waitch -> value
// - nil -> waitch -> recvChan

// BoundedChan is a thread_local handle onto a bounded channel.
type BoundedChan struct {
	q          *queue
	head, tail *segment
	bound      uint64
}

// NewBounded allocates a new queue and returns a handle to that queue. Further
// handles are created by calling NewHandle on the result of NewBounded.
func NewBounded(bufsz uint64) *BoundedChan {
	segPtr := &segment{}
	cur := segPtr
	for b := uint64(segSize); b < bufsz; b += segSize {
		cur.Next = &segment{ID: index(b) >> segShift}
		cur = cur.Next
	}
	q := &queue{
		H:           0,
		T:           0,
		SpareAllocs: segList{MaxSpares: int64(runtime.GOMAXPROCS(0))},
	}
	return &BoundedChan{
		q:     q,
		head:  segPtr,
		tail:  segPtr,
		bound: bufsz,
	}
}

// NewHandle creates a new handle for the given Queue.
func (b *BoundedChan) NewHandle() *BoundedChan {
	return &BoundedChan{
		q:     b.q,
		head:  b.head,
		tail:  b.tail,
		bound: b.bound,
	}
}

func (b *BoundedChan) adjust() {
	// TODO: factor this out into a helper so that bounded and unbounded can
	// use the same code
	H := index(atomic.LoadUint64((*uint64)(&b.q.H)))
	T := index(atomic.LoadUint64((*uint64)(&b.q.T)))
	cellH, _ := H.SplitInd()
	advance(&b.head, cellH)
	cellT, _ := T.SplitInd()
	advance(&b.tail, cellT)
}

// tryCas attempts to cas seg.Data[index] from nil to elt, and if that fails,
// from sentinel to elt.
func tryCas(seg *segment, segInd index, elt unsafe.Pointer) bool {
	return atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),
		sentinel, elt) ||
		atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),
			unsafe.Pointer(nil), elt) ||
		atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),
			sentinel, elt)
}

// Enqueue sends e on b. If there are already >=bound goroutines blocking, then
// Enqueue will block until sufficiently many elements have been received.
func (b *BoundedChan) Enqueue(e Elt) {
	b.adjust()
	startHead := index(atomic.LoadUint64((*uint64)(&b.q.H)))
	myInd := index(atomic.AddUint64((*uint64)(&b.q.T), 1) - 1)
	cell, cellInd := myInd.SplitInd()
	seg := b.q.findCell(b.tail, cell)
	if myInd > startHead && (myInd-startHead) > index(uint64(b.bound)) {
		// there is a chance that we have to block
		const patience = 4
		for i := 0; i < patience; i++ {
			if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
				sentinel, unsafe.Pointer(e)) {
				// Between us reading startHead and now, there were enough
				// increments to make it the case that we should no longer
				// block.
				if debug {
					dbgPrint("[enq] swapped out for sentinel\n")
				}
				return
			}
		}
		var w interface{} = makeWeakWaiter(2)
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
			unsafe.Pointer(nil), unsafe.Pointer(Elt(&w))) {
			// we successfully swapped in w. No one will overwrite this
			// location unless they send on w first. We block.
			w.(*weakWaiter).Wait()
			// <-(w.(waitch))
			if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
				unsafe.Pointer(Elt(&w)), unsafe.Pointer(e)) {
				if debug {
					dbgPrint("[enq] blocked then swapped successfully\n")
				}
				return
			} // someone put in a chan Elt into this location. We need to use the slow path
		} else if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
			sentinel, unsafe.Pointer(e)) {
			// Between us reading startHead and now, there were enough
			// increments to make it the case that we should no longer
			// block.
			if debug {
				dbgPrint("[enq] swapped out for sentinel\n")
			}
			return
		}
	} else {
		// normal case. We know we don't have to block because b.q.H can only
		// increase.
		if tryCas(seg, cellInd, unsafe.Pointer(e)) {
			if debug {
				dbgPrint("[enq] successful tryCas\n")
			}
			return
		}
	}
	for i := 0; ; i++ { // will run at most twice
		if i >= 2 {
			panic("[enq] bug!")
		}
		ptr := atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])))
		w := (*waiter)(ptr)
		w.Send(e)
		if debug {
			dbgPrint("[enq] sending to waiter on %v\n", ptr)
		}
		return
	}
}

// Dequeue receives an Elt from b. It blocks if there are no elements enqueued
// there.
func (b *BoundedChan) Dequeue() Elt {
	b.adjust()
	myInd := index(atomic.AddUint64((*uint64)(&b.q.H), 1) - 1)
	cell, segInd := myInd.SplitInd()
	seg := b.q.findCell(b.head, cell)

	// If there are Enqueuers waiting to complete due to the buffer size, we
	// take responsibility for waking up the thread that FA'ed b.q.H + b.bound.
	// If bound is zero, that is just the current thread. Otherwise we have to
	// do some extra work. The thread we are waking up is referred to in names
	// and comments as our 'buddy'.
	var (
		bCell, bInd index
		bSeg        *segment
	)
	if b.bound > 0 {
		buddy := myInd + index(b.bound)
		bCell, bInd = buddy.SplitInd()
		bSeg = b.q.findCell(b.head, bCell)
	}

	w := makeWaiter()
	var res Elt
	if tryCas(seg, segInd, unsafe.Pointer(w)) {
		if debug {
			dbgPrint("[deq] getting res from channel %v\n", w)
		}
		res = w.Recv()
	} else {
		// tryCas failed, which means that through the "possible histories"
		// argument, this must be either an Elt, a waiter or a weakWaiter. It
		// cannot be a waiter because we are the only actor allowed to swap
		// one into this location. Thus it must either be a weakWaiter or an Elt.
		// if it is a weakWaiter, then we must send on it before casing in w,
		// otherwise the other thread could starve. If it is a normal Elt we
		// do the rest of the protocol. This also means that we can safely load
		// an Elt from seg, which is not always the case because sentinel is
		// not an Elt.
		//
		// Step 1: We failed to put our waiter into Ind. That means that either our
		// value is in there, or there is a weakWaiter in there. Either way these
		// are valid elts and we can reliably distinguish them with a type assertion
		elt := seg.Load(segInd)
		res = elt
		if ww, ok := (*elt).(*weakWaiter); ok {
			ww.Signal()
			if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),
				unsafe.Pointer(elt), unsafe.Pointer(w)) {
				if debug {
					dbgPrint("[deq] getting res from channel slow %v\n", w)
				}
				res = w.Recv()
			} else {
				// someone cas'ed a value from a waitchan, could only have been our
				// friend on the dequeue side
				if debug {
					dbgPrint("[deq] getting res from load\n")
				}
				res = seg.Load(segInd)
			}
		}
	}
	for i := 0; b.bound > 0; i++ {
		if i >= 2 {
			panic("[deq] bug!")
		}
		// We have successfully gotten the value out of our cell. Now we
		// must ensure that our buddy is either woken up if they are
		// waiting, or that they will know not to sleep.
		// if bElt is not nil, it either has an Elt in it or a weakWater. If
		// it has a waitch then we need to send on it to wake up the buddy.
		// If it is not nill then we attempt to cas sentinel into the buddy
		// index. If we fail then the buddy may have cas'ed in a wait
		// channel so we must go again. However that will only happen once.
		bElt := bSeg.Load(bInd)
		// could this be sentinel? I don't think so..
		if bElt != nil {
			if ww, ok := (*bElt).(*weakWaiter); ok {
				ww.Signal()
			}
			// there is a real queue value in bSeg.Data[bInd], therefore
			// buddy cannot be waiting.
			break
		}
		// Let buddy know that they do not have to block
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&bSeg.Data[bInd])),
			unsafe.Pointer(nil), sentinel) {
			break
		}
	}
	return res
}


================================================
FILE: src/fchan/fchan_test.go
================================================
// Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package fchan

import (
	"reflect"
	"sync"
	"testing"
)

const perThread = 256

func unorderedEltsEq(s1, s2 []int) bool {
	readMap := func(i []int) map[int]int {
		res := make(map[int]int)
		for _, ii := range i {
			res[ii]++
		}
		return res
	}
	return reflect.DeepEqual(readMap(s1), readMap(s2))
}

func TestBoundedQueueElements(t *testing.T) {
	const numInputs = (1 << 20)
	bounds := []uint64{0, 1, 1024, segSize}
	for _, bound := range bounds {
		var inputs []int
		var wg sync.WaitGroup
		for i := 0; i < numInputs; i++ {
			inputs = append(inputs, i)
		}
		h := NewBounded(bound)

		ch := make(chan int, 1024)
		for i := 0; i < numInputs/perThread; i++ {
			wg.Add(1)
			go func(i int) {
				hn := h.NewHandle()
				for j := 0; j < perThread; j++ {
					var inp interface{} = inputs[i*perThread+j]
					hn.Enqueue(&inp)
				}
				wg.Done()
			}(i)
			wg.Add(1)
			go func() {
				hn := h.NewHandle()
				for j := 0; j < perThread; j++ {
					out := hn.Dequeue()
					outInt := (*out).(int)
					ch <- outInt
				}
				wg.Done()
			}()
		}

		var outs []int
		for i := 0; i < numInputs; i++ {
			outs = append(outs, <-ch)
		}
		close(ch)
		if !unorderedEltsEq(outs, inputs) {
			t.Errorf("expected %v, got %v", inputs, outs)
		}
		wg.Wait()
	}
}

func TestQueueElements(t *testing.T) {
	const numInputs = 1 << 20
	iters := numInputs / perThread
	var inputs []int
	var wg sync.WaitGroup
	for i := 0; i < numInputs; i++ {
		inputs = append(inputs, i)
	}
	h := New()

	ch := make(chan int, 1024)
	for i := 0; i < iters; i++ {
		wg.Add(1)
		go func(i int) {
			hn := h.NewHandle()
			for j := 0; j < perThread; j++ {
				var inp interface{} = inputs[i*perThread+j]
				hn.Enqueue(&inp)
			}
			wg.Done()
		}(i)
		wg.Add(1)
		go func() {
			hn := h.NewHandle()
			for j := 0; j < perThread; j++ {
				out := hn.Dequeue()
				ch <- (*out).(int)
			}
			wg.Done()
		}()
	}

	var outs []int
	for i := 0; i < numInputs; i++ {
		outs = append(outs, <-ch)
	}
	close(ch)
	if !unorderedEltsEq(outs, inputs) {
		t.Errorf("expected %v, got %v", inputs, outs)
	}
	wg.Wait()
}

func TestSerialQueue(t *testing.T) {
	const runs = 3*segSize + 1

	h := New()
	var msg interface{} = "hi"
	for i := 0; i < runs; i++ {
		var m interface{} = msg
		h.Enqueue(&m)
	}
	for i := 0; i < runs; i++ {
		p := h.Dequeue()
		if !reflect.DeepEqual(*p, msg) {
			t.Errorf("expected %v, got %v", msg, *p)
		}
	}
}

func TestConcurrentQueueAddFirst(t *testing.T) {
	const runs = 3*segSize + 1
	var wg sync.WaitGroup
	h := New()
	var msg interface{} = "hi"
	t.Logf("Spawning %v adding goroutines", runs)
	for i := 0; i < runs; i++ {
		var m interface{} = msg
		wg.Add(1)
		go func() {
			hn := h.NewHandle()
			hn.Enqueue(&m)
			wg.Done()
		}()
	}
	t.Logf("Spawning %v getting goroutines", runs)
	for i := 0; i < runs; i++ {
		wg.Add(1)
		go func() {
			hn := h.NewHandle()
			p := hn.Dequeue()
			if !reflect.DeepEqual(*p, msg) {
				t.Errorf("expected %v, got %v", msg, *p)
			}
			wg.Done()
		}()
	}
	wg.Wait()
}

func TestConcurrentQueueTakeFirst(t *testing.T) {
	const runs = 2*segSize + 1 // 4*segSize + 1

	var wg sync.WaitGroup
	h := New()
	var msg interface{} = "hi"

	t.Logf("Spawning %v getting goroutines", runs)
	for i := 0; i < runs; i++ {
		wg.Add(1)
		go func() {
			hn := h.NewHandle()
			p := hn.Dequeue()
			if !reflect.DeepEqual(*p, msg) {
				t.Errorf("expected %v, got %v", msg, *p)
			}
			wg.Done()
		}()
	}

	t.Logf("Spawning %v adding goroutines", runs)
	for i := 0; i < runs; i++ {
		var m interface{} = msg
		wg.Add(1)
		go func() {
			hn := h.NewHandle()
			hn.Enqueue(&m)
			wg.Done()
		}()
	}
	wg.Wait()
}

func minN(b *testing.B) int {
	if b.N < 2 {
		return 2
	}
	return b.N
}


================================================
FILE: src/fchan/q.go
================================================
// Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package fchan

import (
	"fmt"
	"sync"
	"sync/atomic"
	"unsafe"
)

// basic debug infrastructure
const debug = false

var dbgPrint = func(s string, i ...interface{}) { fmt.Printf(s, i...) }

// Elt is the element type of a queue, can be any pointer type
type Elt *interface{}
type index uint64
type listElt *segment

type waiter struct {
	E      Elt
	Wgroup sync.WaitGroup
}

func makeWaiter() *waiter {
	wait := &waiter{}
	wait.Wgroup.Add(1)
	return wait
}

func (w *waiter) Send(e Elt) {
	atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&w.E)), unsafe.Pointer(e))
	w.Wgroup.Done()
}

func (w *waiter) Recv() Elt {
	w.Wgroup.Wait()
	return Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&w.E))))
}

/*
type weakWaiter struct {
	cond *sync.Cond
	sync.Mutex
	woke int64
}

func makeWeakWaiter(i int32) *weakWaiter {
	w := &weakWaiter{}
	w.cond = sync.NewCond(w)
	return w
}

func (w *weakWaiter) Signal() {
	w.Lock()
	w.woke++
	w.cond.Signal()
	w.Unlock()
}

func (w *weakWaiter) Wait() {
	w.Lock()
	for w.woke == 0 {
		w.cond.Wait()
	}
	w.Unlock()
}

//*/

/*

// Idea to get beyond the scalability bottleneck when number of goroutines is
// much larger than gomaxprocs. Have an array of channels with large buffers
// (or unbuffered channels?) and group threads into these larger groups. This
// means weakWaiters are attached to queue-level state. It has the disadvantage
// of making ordering a bit more difficult, as later receivers could wake up
// earlier senders. I think this is fine, but it merits some thought.
type weakWaiter chan struct{}

func makeWeakWaiter(i int32) *weakWaiter {
	var ch weakWaiter = make(chan struct{}, i)
	return &ch
}

func (w *weakWaiter) Signal() { *w <- struct{}{} }

func (w *weakWaiter) Wait() { <-(*w) }

//*/

//*
type weakWaiter struct {
	OSize  int32
	Size   int32
	Wgroup sync.WaitGroup
}

func makeWeakWaiter(i int32) *weakWaiter {
	wait := &weakWaiter{Size: i, OSize: i}
	wait.Wgroup.Add(1)
	return wait
}

func (w *weakWaiter) Signal() {
	newVal := atomic.AddInt32(&w.Size, -1)
	orig := atomic.LoadInt32(&w.OSize)
	if newVal+1 == orig {
		w.Wgroup.Done()
	}
}

func (w *weakWaiter) Wait() {
	w.Wgroup.Wait()
}

// */

// segList is a best-effort data-structure for storing spare segment
// allocations. The TryPush and TryPop methods follow standard algorithms for
// lock-free linked lists. They have an inconsistent length counter they
// may underestimate the true length of the data-structure, but this allows
// threads to bail out early. Because the slow path of allocating a new segment
// in grow still works.
type segList struct {
	MaxSpares int64
	Length    int64
	Head      *segLink
}

// spmcLink is a list element in a segList. Note that we cannot just re-use the
// segLink next pointers without modifying the algorithm as TryPush could
// potentitally sever pointers in the live queue data structure. That would
// break everything.
type segLink struct {
	Elt  listElt
	Next *segLink
}

func (s *segList) TryPush(e listElt) {
	// bail out if list is at capacity
	if atomic.LoadInt64(&s.Length) >= s.MaxSpares {
		return
	}
	// add to length. Note that this is not atomic with respect to the append,
	// which means we may be under capacity on occasion. This list is only used
	// in a best-effort capacity, so that is okay.
	atomic.AddInt64(&s.Length, 1)
	if debug {
		dbgPrint("Length now %v\n", s.Length)
	}

	tl := &segLink{
		Elt:  e,
		Next: nil,
	}
	const patience = 4
	i := 0
	for ; i < patience; i++ {
		// attempt to cas Head from nil to tail,
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)),
			unsafe.Pointer(nil), unsafe.Pointer(tl)) {
			break
		}

		// try to find an empty element
		tailPtr := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head))))
		if tailPtr == nil {
			// if Head was switched to nil, retry
			continue
		}

		// advance tailPtr until it has anil next pointer
		for {
			next := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next))))
			if next == nil {
				break
			}
			tailPtr = next
		}

		// try and add something to the end of the list
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next)),
			unsafe.Pointer(nil), unsafe.Pointer(tl)) {
			break
		}
	}
	if i == patience {
		atomic.AddInt64(&s.Length, -1)
	}

	if debug {
		dbgPrint("Successfully pushed to segment list\n")
	}

}

func (s *segList) TryPop() (e listElt, ok bool) {
	const patience = 8
	// it is possible that s has length <= 0 due to a temporary inconsistency
	// between the list itself and the length counter. See the comments in
	// TryPush()
	if atomic.LoadInt64(&s.Length) <= 0 {
		return nil, false
	}
	for i := 0; i < patience; i++ {
		hd := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head))))
		if hd == nil {
			return nil, false
		}

		// if head is not nil, try to swap it for its next pointer
		nxt := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&hd.Next))))
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)),
			unsafe.Pointer(hd), unsafe.Pointer(nxt)) {
			if debug {
				dbgPrint("Successfully popped off segment list\n")
			}
			atomic.AddInt64(&s.Length, -1)
			return hd.Elt, true
		}
	}
	return nil, false
}

// segment size
const segShift = 12
const segSize = 1 << segShift

// The channel buffer is stored as a linked list of fixed-size arrays of size
// segsize. ID is a monotonically increasing identifier corresponding to the
// index in the buffer of the first element of the segment, divided by segSize
// (see SplitInd).
type segment struct {
	ID   index
	Next *segment
	Data [segSize]Elt
}

// Load atomically loads the element at index i of s
func (s *segment) Load(i index) Elt {
	return Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Data[i]))))
}

// Queue is the global state of the channel. It contains indices into the head
// and tail of the channel as well as a linked list of spare segments used to
// avoid excess allocations.
type queue struct {
	H           index // head index
	T           index // tail index
	SpareAllocs segList
}

// SplitInd splits i into the ID of the segment to which it refers as well as
// the local index into that segment
func (i index) SplitInd() (cellNum index, cellInd index) {
	cellNum = (i >> segShift)
	cellInd = i - (cellNum * segSize)
	return
}

const spare = true

// grow is called if a thread has arrived at the end of the segment list but
// needs to enqueue/dequeue from an index with a higher cell ID. In this case we
// attempt to assign the segment's next pointer to a new segment. Allocating
// segments can be expensive, so the underlying queue has a 'SpareAlloc' segment
// that can be used to grow the queue, or to store unused segments that the
// thread allocates. The presence of 'SpareAlloc' complicates the protocol quite
// a bit, but it is wait-free (aside from memory allocation) and it will only
// return if tail.Next is non-nil.
func (q *queue) Grow(tail *segment) {
	curTail := atomic.LoadUint64((*uint64)(&tail.ID))
	if spare {
		if next, ok := q.SpareAllocs.TryPop(); ok {
			atomic.StoreUint64((*uint64)(&next.ID), curTail+1)
			if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),
				unsafe.Pointer(nil), unsafe.Pointer(next)) {
				return
			}
		}
	}

	newSegment := &segment{ID: index(curTail + 1)}
	if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),
		unsafe.Pointer(nil), unsafe.Pointer(newSegment)) {
		if debug {
			dbgPrint("\t\tgrew\n")
		}
		return
	}
	if spare {
		// If we allocated a new segment but failed, attempt to place it in
		// SpareAlloc so someone else can use it.
		q.SpareAllocs.TryPush(newSegment)
	}
}

// advance will search for a segment with ID cell at or after the segment in
// ptr, It returns with ptr either pointing to the cell in question or to the
// last non-nill segment in the list.
func advance(ptr **segment, cell index) {
	for {
		next := (*segment)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&(*ptr).Next))))
		if next == nil || next.ID > cell {
			break
		}
		*ptr = next
	}
}


================================================
FILE: src/fchan/unbounded.go
================================================
// Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package fchan

import (
	"runtime"
	"sync/atomic"
	"unsafe"
)

// Thread-local state for interacting with an unbounded channel
type UnboundedChan struct {
	// pointer to global state
	q *queue
	// pointer into last guess at the true head and tail segments
	head, tail *segment
}

// New initializes a new queue and returns an initial handle to that queue. All
// other handles are allocated by calls to NewHandle()
func New() *UnboundedChan {
	segPtr := &segment{} // 0 values are fine here
	q := &queue{
		H:           0,
		T:           0,
		SpareAllocs: segList{MaxSpares: int64(runtime.GOMAXPROCS(0))},
	}
	h := &UnboundedChan{
		q:    q,
		head: segPtr,
		tail: segPtr,
	}

	return h
}

// NewHandle creates a new handle for the given Queue.
func (u *UnboundedChan) NewHandle() *UnboundedChan {
	return &UnboundedChan{
		q:    u.q,
		head: u.head,
		tail: u.tail,
	}
}

// Enqueue enqueues a Elt into the channel
// TODO(ezrosent) enforce that e is not nil, I think we make that assumption
// here..
func (u *UnboundedChan) Enqueue(e Elt) {
	u.adjust() // don't always do this?
	myInd := index(atomic.AddUint64((*uint64)(&u.q.T), 1) - 1)
	cell, cellInd := myInd.SplitInd()
	seg := u.q.findCell(u.tail, cell)
	if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
		unsafe.Pointer(nil), unsafe.Pointer(e)) {
		return
	}
	wt := (*waiter)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd]))))
	wt.Send(e)
}

// findCell finds a segment at or after start with ID cellID. If one does not
// yet exist, it grows the list of segments.
func (q *queue) findCell(start *segment, cellID index) *segment {
	cur := start
	for cur.ID != cellID {
		next := (*segment)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&cur.Next))))
		if next == nil {
			q.Grow(cur)
			continue
		}
		cur = next
	}
	return cur
}

// adjust moves h's head and tail pointers forward if H and T point to a newer
// segment. The loads and moves do not need to be atomic because H and T only
// ever increase in value. Calling this regularly is probably good for
// performance, and is necessary to ensure that old segments are garbage
// collected.
func (u *UnboundedChan) adjust() {
	H := index(atomic.LoadUint64((*uint64)(&u.q.H)))
	T := index(atomic.LoadUint64((*uint64)(&u.q.T)))
	cellH, _ := H.SplitInd()
	advance(&u.head, cellH)
	cellT, _ := T.SplitInd()
	advance(&u.tail, cellT)
}

// Dequeue an element from the channel, will block if nothing is there
func (u *UnboundedChan) Dequeue() Elt {
	u.adjust()
	myInd := index(atomic.AddUint64((*uint64)(&u.q.H), 1) - 1)
	cell, cellInd := myInd.SplitInd()
	seg := u.q.findCell(u.head, cell)
	elt := seg.Load(cellInd)
	wt := makeWaiter()
	if elt == nil &&
		atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
			unsafe.Pointer(nil), unsafe.Pointer(wt)) {
		if debug {
			dbgPrint("\t[deq] slow path\n")
		}
		return wt.Recv()
	}
	return seg.Load(cellInd)
}


================================================
FILE: writeup/graphs.py
================================================
# Copyright 2016 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This is a basic script that parses the output of fchan_main and renders the
graphs for goroutines=GOMAXPROCS and goroutines=5000
"""
import numpy as np
import matplotlib.pyplot as plt
import seaborn as unused_import
import re
import sys


class BenchResult(object):
    def __init__(self, name, gmp, max_hw_thr, nops, secs):
        self.name = name
        self.gmp = gmp
        self.max_hw_thr = int(max_hw_thr)
        # Millions of operations per second
        self.tp = float(nops) / (float(secs) * 1e6)


def parse_line(line):
    m_gmp = re.match(r'^([^\-]*)GMP-(\d+)\s+(\d+)\s+([^\s]*)s\s*$',
                     line)
    m2 = re.match(r'^([^\-]*)-(\d+)\s+(\d+)\s+([^\s]*)s\s*$', line)
    if m_gmp is not None:
        name, threads, nops, secs = m_gmp.groups()
        return BenchResult(name, True, threads, nops, secs)
    if m2 is not None:
        name, threads, nops, secs = m2.groups()
        return BenchResult(name, False, threads, nops, secs)
    print line, 'did not match anything'
    return None


def plot_points(all_results, gmp):
    series = sorted(list({k.name for k in all_results if k.gmp == gmp}))
    for k in series:
        results = [r for r in all_results if r.gmp == gmp and r.name == k]
        points = sorted((r.max_hw_thr, r.tp) for r in results)
        plt.xlabel(r'GOMAXPROCS')
        plt.ylabel('Ops / second (millions)')
        X = np.array([x for (x, y) in points])
        Y = np.array([y for (x, y) in points])
        plt.plot(X, Y, label=k)
        plt.scatter(X, Y)
        plt.legend()


def main(fname):
    with open(fname) as f:
        results = [p for p in (parse_line(line) for line in f)
                   if p is not None]
        print 'Generating non-GMP graph'
        plt.title('5000 Goroutines')
        plot_points(results, False)
        plt.savefig('contend_graph.pdf')
        plt.clf()
        print 'Generating GMP graph'
        plt.title('Goroutines Equal to GOMAXPROCS')
        plot_points(results, True)
        plt.savefig('gmp_graph.pdf')
        plt.clf()

if __name__ == '__main__':
    main(sys.argv[1])


================================================
FILE: writeup/latex.template
================================================
\documentclass[$if(fontsize)$$fontsize$,$endif$$if(lang)$$babel-lang$,$endif$$if(papersize)$$papersize$,$endif$$for(classoption)$$classoption$$sep$,$endfor$]{$documentclass$}
$if(fontfamily)$
\usepackage{$fontfamily$}
$else$
%\usepackage{lmodern}
$endif$
$if(linestretch)$
\usepackage{setspace}
\setstretch{$linestretch$}
$endif$
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{fixltx2e} % provides \textsubscript
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
  \usepackage[T1]{fontenc}
  \usepackage[utf8]{inputenc}
$if(euro)$
  \usepackage{eurosym}
$endif$
\else % if luatex or xelatex
  \ifxetex
    \usepackage{mathspec}
    \usepackage{xltxtra,xunicode}
  \else
    \usepackage{fontspec}
  \fi
  \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
  \newcommand{\euro}{€}
$if(mainfont)$
    \setmainfont{$mainfont$}
$endif$
$if(sansfont)$
    \setsansfont{$sansfont$}
$endif$
$if(monofont)$
    \setmonofont[Mapping=tex-ansi]{$monofont$}
$endif$
$if(mathfont)$
    \setmathfont(Digits,Latin,Greek){$mathfont$}
$endif$
$if(CJKmainfont)$
    \usepackage{xeCJK}
    \setCJKmainfont[$CJKoptions$]{$CJKmainfont$}
$endif$
\fi
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
% use microtype if available
\IfFileExists{microtype.sty}{%
\usepackage{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
$if(geometry)$
\usepackage[$for(geometry)$$geometry$$sep$,$endfor$]{geometry}
$endif$
\ifxetex
  \usepackage[setpagesize=false, % page size defined by xetex
              unicode=false, % unicode breaks when used with xetex
              xetex]{hyperref}
\else
  \usepackage[unicode=true]{hyperref}
\fi
\usepackage[usenames,dvipsnames]{color}
\hypersetup{breaklinks=true,
            bookmarks=true,
            pdfauthor={$author-meta$},
            pdftitle={$title-meta$},
            colorlinks=true,
            citecolor=$if(citecolor)$$citecolor$$else$blue$endif$,
            urlcolor=$if(urlcolor)$$urlcolor$$else$blue$endif$,
            linkcolor=$if(linkcolor)$$linkcolor$$else$magenta$endif$,
            pdfborder={0 0 0}}
\urlstyle{same}  % don't use monospace font for urls
$if(lang)$
\ifxetex
  \usepackage{polyglossia}
  \setmainlanguage[variant=$polyglossia-variant$]{$polyglossia-lang$}
  \setotherlanguages{$for(polyglossia-otherlangs)$$polyglossia-otherlangs$$sep$,$endfor$}
\else
  \usepackage[shorthands=off,$babel-lang$]{babel}
\fi
$endif$
$if(natbib)$
\usepackage{natbib}
\bibliographystyle{$if(biblio-style)$$biblio-style$$else$plainnat$endif$}
$endif$
$if(biblatex)$
\usepackage{biblatex}
$for(bibliography)$
\addbibresource{$bibliography$}
$endfor$
$endif$
$if(listings)$
\usepackage{listings}
$endif$
$if(lhs)$
\lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
$endif$
$if(highlighting-macros)$
$highlighting-macros$
$endif$
$if(verbatim-in-note)$
\usepackage{fancyvrb}
\VerbatimFootnotes
$endif$
$if(tables)$
\usepackage{longtable,booktabs}
$endif$
$if(graphics)$
\usepackage{graphicx,grffile}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
$endif$
$if(links-as-notes)$
% Make links footnotes instead of hotlinks:
\renewcommand{\href}[2]{#2\footnote{\url{#1}}}
$endif$
$if(strikeout)$
\usepackage[normalem]{ulem}
% avoid problems with \sout in headers with hyperref:
\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
$endif$
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}
\setlength{\emergencystretch}{3em}  % prevent overfull lines
\providecommand{\tightlist}{%
  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
$if(numbersections)$
\setcounter{secnumdepth}{5}
$else$
\setcounter{secnumdepth}{0}
$endif$
$if(verbatim-in-note)$
\VerbatimFootnotes % allows verbatim text in footnotes
$endif$
$if(dir)$
\ifxetex
  % load bidi as late as possible as it modifies e.g. graphicx
  $if(latex-dir-rtl)$
  \usepackage[RTLdocument]{bidi}
  $else$
  \usepackage{bidi}
  $endif$
\fi
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
  \TeXXeTstate=1
  \newcommand{\RL}[1]{\beginR #1\endR}
  \newcommand{\LR}[1]{\beginL #1\endL}
  \newenvironment{RTL}{\beginR}{\endR}
  \newenvironment{LTR}{\beginL}{\endL}
\fi
$endif$

$if(title)$
\title{$title$$if(subtitle)$\\\vspace{0.5em}{\large $subtitle$}$endif$}
$endif$

$if(author)$
\usepackage{fancyhdr}
\fancypagestyle{plain}{}
\pagestyle{fancy}
\fancyhead[LO,RE]{\large $for(author)$$author.name$$sep$ \and $endfor$}
%\author{$for(author)$$author.name$$sep$ \and $endfor$}
$endif$
\date{$date$}
$for(header-includes)$
$header-includes$
$endfor$

% Redefines (sub)paragraphs to behave more like sections
\ifx\paragraph\undefined\else
\let\oldparagraph\paragraph
\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
\fi
\ifx\subparagraph\undefined\else
\let\oldsubparagraph\subparagraph
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi

\begin{document}
$if(title)$
\maketitle
$endif$
$if(abstract)$
\begin{abstract}
$abstract$
\end{abstract}
$endif$

$for(include-before)$
$include-before$

$endfor$
$if(toc)$
{
  \vspace{-0.9in}
\hypersetup{linkcolor=$if(toccolor)$$toccolor$$else$black$endif$}
\setcounter{tocdepth}{$toc-depth$}
\tableofcontents
}
$endif$
$if(lot)$
\listoftables
$endif$
$if(lof)$
\listoffigures
$endif$
$body$

$if(natbib)$
$if(bibliography)$
$if(biblio-title)$
$if(book-class)$
\renewcommand\bibname{$biblio-title$}
$else$
\renewcommand\refname{$biblio-title$}
$endif$
$endif$
\bibliography{$for(bibliography)$$bibliography$$sep$,$endfor$}

$endif$
$endif$
$if(biblatex)$
\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$

$endif$
$for(include-after)$
$include-after$

$endfor$
\end{document}


================================================
FILE: writeup/refs.bib
================================================
@inproceedings{wfq,
  title={A wait-free queue as fast as fetch-and-add},
  author={Yang, Chaoran and Mellor-Crummey, John},
  booktitle={Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  pages={16},
  year={2016},
  organization={ACM}
}

@inproceedings{lcrq,
  title={Fast concurrent queues for x86 processors},
  author={Morrison, Adam and Afek, Yehuda},
  booktitle={ACM SIGPLAN Notices},
  volume={48},
  number={8},
  pages={103--112},
  year={2013},
  organization={ACM}
}
@incollection{CSP,
  title={Communicating sequential processes},
  author={Hoare, Charles Antony Richard},
  booktitle={The origin of concurrent programming},
  pages={413--443},
  year={1978},
  publisher={Springer}
}
@book{tgpl,
  author = {Donovan, Alan A.A. and Kernighan, Brian W.},
  title = {The Go Programming Language},
  year = {2015},
  isbn = {0134190440, 9780134190440},
  edition = {1st},
  publisher = {Addison-Wesley Professional},
}

@inproceedings{MSQueue,
  title={Simple, fast, and practical non-blocking and blocking concurrent queue algorithms},
  author={Michael, Maged M and Scott, Michael L},
  booktitle={Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing},
  pages={267--275},
  year={1996},
  organization={ACM}
}

@article{herlihyBook,
  title={The Art of Multiprocessor Programming},
  author={Herlihy, Maurice and Shavit, Nir},
  year={2008},
  publisher={Morgan Kaufmann Publishers Inc.}
}

@online{GoSpec,
    title = {The Go Programming Language Specification},
    howpublished = {\url{https://golang.org/ref/spec}},
    year = {2009},
    urldate = {2016-10-30}
}

@inproceedings{FastSlow,
  title={A methodology for creating fast wait-free data structures},
  author={Kogan, Alex and Petrank, Erez},
  booktitle={ACM SIGPLAN Notices},
  volume={47},
  number={8},
  pages={141--150},
  year={2012},
  organization={ACM}
}

@article{wfsync,
  title={Wait-free synchronization},
  author={Herlihy, Maurice},
  journal={ACM Transactions on Programming Languages and Systems (TOPLAS)},
  volume={13},
  number={1},
  pages={124--149},
  year={1991},
  publisher={ACM}
}

@incollection{marlowPar,
  title={Parallel and concurrent programming in Haskell},
  author={Marlow, Simon},
  booktitle={Central European Functional Programming School},
  pages={339--401},
  year={2012},
  publisher={Springer}
}

@article{herlihyLinear,
  title={Linearizability: A correctness condition for concurrent objects},
  author={Herlihy, Maurice P and Wing, Jeannette M},
  journal={ACM Transactions on Programming Languages and Systems
    (TOPLAS)},
  volume={12},
  number={3},
  pages={463--492},
  year={1990},
  publisher={ACM}
}


================================================
FILE: writeup/writeup.md
================================================
---
title: Faster Channels in Go (Work in Progress)
subtitle: Scaling Blocking Channels with Techniques from Nonblocking Data-Structures.
toc: true
link-citations: true
geometry: ['margin=1in']
fontsize: 11pt
author: 
  name: Eli Rosenthal
---

<!--
Compile with
pandoc -s -S --template=latex.template --latex-engine=xelatex \
  --bibliography refs.bib --metadata link-citations=true \
  --filter pandoc-citeproc  writeup.md -o writeup.pdf

--
-->


# Introduction

Channels in the [Go](https://golang.org/) language are a common way to structure
concurrent code. The channel API in Go is intended to support programming in the
manner described by CSP [see  @CSP, the original paper; also the preface of
@tgpl for CSP's relationship to Go].  Channels in Go have a fixed buffer size
$b$ such that only $b$ senders may return without having handed a value off to a
corresponding receiver. Here is some basic pseudocode[^pseudo] for the send and
receive operations[^select], though it is worth referring to the spec @GoSpec as
well.

~~~~
send(c: chan T, item: T)                  receive(c: chan T) -> T
  atomically do:                            atomically do:
    if the buffer is full                       begin:
      block                                     if there are items in the buffer
    append an item to the buffer                  result = head of buffer
    if there were any receivers blocked           advance the buffer head
      wake the first one up                       if there are any senders waiting
                                                    wake the first sender up
                                                  return result
                                                if the buffer is empty
                                                  block
                                                  goto begin
~~~~

Go channels currently require goroutines[^goroutine] to acquire a single
lock before performing additional operations[^chanimp]. This makes contention
for this lock a scalability bottleneck; while acquiring a mutex can be very fast
this means that only one thread can perform an operation on a queue at a time.
This document describes the implementation of a novel channel algorithm that
permits different sends and receives to complete in parallel.

We will start with a review of recent literature on non-blocking queues. Then we
will move onto describing the implementation of a fast *unbounded* channel in
Go; this algorithm may be of independent interest. Finally we will extend this
design to provide the bounded semantics of Go channels. We will also report
performance measurements for these algorithms.

# Non-blocking Queues
<!-- confirm definition in herlihy book for non blocking -->

The standard data-structure closest to the notion of unbounded channel is that
of a FIFO queue. A queue supports enqueue and dequeue operations, where it is
common for dequeue to be allowed to fail if there are no elements in the queue.
There are myriad algorithms for concurrent queues which provide different
guarantees in terms of progress and consistency [see @herlihyBook Chapter 10 for
an overview], but we will focus here on *non-blocking* queues because of the
approach in that literature to making scalable concurrent data-structures.

Informally, we say a data-structure is non-blocking if no thread can perform an
operation that will require it to block any other threads for an unbounded
amount of time.  As a result, no queue that requires a thread to take a lock can
be non-blocking: one thread can acquire a lock and then be de-scheduled for an
arbitrary amount of time and thereby block all other threads contending for the
lock. Non-blocking algorithms generally use atomic instructions like
Compare-And-Swap (CAS) to avoid different threads stepping on one another's toes
[see @herlihyBook Chapter 3 for a tutorial on atomic synchronization
primitives]. Non-blocking operations can exhibit a number of additional progress
guarantees:

* **Obstruction Freedom** If there is only one thread executing an operation,
  that operation will complete in a finite number of steps.
* **Lock Freedom** Regardless of the number of threads executing an operation
  concurrently, at least one thread will complete the operation in a finite
  number of steps.
* **Wait Freedom** Any thread executing an operation is guaranteed to finish in
  a finite number of steps.

Non-blocking synchronization is not a panacea. The fact that there are hard
upper bounds on how long it will take for a thread to complete an operation does
not imply that the algorithm will perform better in practice. While wait-free
data-structures are important for some embedded or real-time systems that need
these strong guarantees, there are often blocking algorithms which perform
better in terms of throughput than their lock-free or wait-free
counterparts[^combine]. Still, non-blocking algorithms can shine in
high-contention settings. A small number of CAS operations can amount to less
overhead than aquiring a lock, and more fine-grained concurrency coupled with
progress guarantees *can* reduce contention[^msqueue].

<!-- check out http://dl.acm.org/citation.cfm?id=1122994 --!>
<!-- http://link.springer.com/chapter/10.1007/978-3-642-15291-7_16 --!>
<!-- worth linking to performance blog post about tail latency w/wait-free queue
as well--!>

# Using Fetch-and-Add to Reduce Contention

The atomic Fetch-and-Add (F&A) instruction adds a value to an integer and
returns the old or new value of that integer. Here are the basic semantics of
the operation in Go[^fasem]:

```go
//atomically
func AtomicFetchAdd(src *int, delta int) {
  *src += delta
  return *src
}
```

While hardware support for a F&A instruction is not as universal as that of CAS,
F&A is implemented on x86. On modern x86 machines, F&A is much faster than CAS
[see @lcrq for performance measurements], and it always succeeds. This has the
dual effect allowing code making judicious use of F&A to be both efficient and
easier to reason about than equivalents that rely only on CAS. A common pattern
exemplifying this idea is to first use F&A to acquire an index into an array,
and then to use more conventional techniques to write to that index. This is
helpful because it can reduce contention on individual locations for a
data-structure.

## A Non-blocking Queue From an Infinite Array

To illustrate this, we will write two non-blocking queues in
pseudo-Go based on an infinite array [`Queue2` is based on the obstruction-free
queue presented in pseudo-code in @wfq, `Queue1` is a CAS-ification of that
design]. Both of these designs make use of the fact that head and tail pointers
*only ever increase*.

~~~~ {.go .numberLines }
type Queue1 struct {
	head, tail *T
	data       [∞]T
}
func (q *Queue1) Enqueue(elt T) {
	for {
		newTail := atomic.LoadPointer(&q.tail) + 1
		if atomic.CompareAndSwapT(newTail, nil, elt) {
			atomic.CompareAndSwap(&q.tail, q.tail, newTail)
			break
		}
	}
}
func (q *Queue1) Dequeue() T {
	for {
		curHead := atomic.LoadPointer(&q.head)
		curTail := atomic.LoadPointer(&q.tail)
		if curHead == curTail {
			return nil
		}
		if atomic.CompareAndSwapPointer(&q.head, curHead, curHead+1) {
			return *curHead
		}
	}
}
~~~~

The second queue will assume that the type `T` can not only take on a `nil`
value but also an unambiguous `SENTINEL` value that a user is guaranteed not to
pass in to `Enqueue`. This value is used to mark an index as unusable,
signalling a conflicting `Enqueue` thread that it should try again.

<!-- If ther is any change to the above psuedocode, change the startFrom line
here  -->

~~~~ {.go .numberLines startFrom="26"}
type Queue2 struct {
	head, ta uint
	data     [∞]T
}

func (q *Queue2) Enqueue(elt T) {
	for {
		myTail := atomic.AddUint(&q.tail) - 1
		if atomic.CompareAndSwapT(&q.data[myTail], nil, elt) {
			break
		}
	}
}

func (q *Queue2) Dequeue() T {
	for {
		myHead := atomic.AddUint(&q.head) - 1
		curTail := atomic.LoadUint(&q.tail)
		if !atomic.CompareAndSwapPointer(&q.data[myHead], nil, SENTINEL) {
			return atomic.LoadT(&q.data[myHead])
		}
		if myHead == curTail {
			return nil
		}
	}
}
~~~~

The core algorithm for both `Queue1` and `Queue2` is essentially the same.
Enqueueing threads load a view of the tail pointer and try to CAS their element
in one element after that pointer; dequeueing threads perform a symmetric
operation to advance the head pointer. The practical (that is, practical for
algorithms that require a infinite amount of memory) difference between `Queue1`
and `Queue2` is that `Queue2` first has threads perform an atomic increment of a
head or tail index. This means that two concurrent enqueue operations will
always attempt a CAS on *different* queue elements. As a result, enqueue
operations need only concern themselves with dequeue operations that increment
`head` to the same value as their `myTail` (lines 33--34).

A downside of this approach is that while `Queue1` is lock free, `Queue2` is
merely obstruction free. For an enqueue/dequeue pair of threads, each can
continually increment equal `head` and `tail` indices while the dequeuer's CAS
(line 44) always succeeds before the enqueuer's (line 34) resulting in
livelock[^livelockdef].


## Lessons for Channels

The `Queue2` above is the core of the implementation of a fast wait-free queue
in @wfq. It is also the basic idea that we will leverage when designing a more
scalable channel. The rest of their algorithm consists in solving three problems
that have analogs in our setting.

  (1) *Simulating an infinite array with a finite amount of memory.* Here the
  authors implement a linked list of fixed-length arrays (called segments, or
  cells); threads grow this array when more space is required.

  (2) *Going from obstruction freedom to wait freedom.* This involves attempting
  either `Dequeue` or `Enqueue` above for a constant number of iterations,
  followed by a slow path which involves implementing a helping
  mechanism[^helping] to help contending threads to finish their outstanding
  operations.

  (3) *Memory Reclamation.* Reclaiming memory in a non-blocking setting is,
  perhaps unsurprisingly, a very fraught task. 

While the solution to (3) in this paper is interesting and efficient, we will
(mercifully) be relying on Go's garbage collection mechanism to solve this
problem. For (1) we will employ essentially the same algorithm as the paper, but
with additional optimizations for memory allocation. For (2) our slow path will
implement the blocking semantics of a channel.

# An Unbounded Channel With Low Contention

We first consider the case of implementing an unbounded channel. While this
channel is blocking --- Go channels must in some capacity be blocking as
they provide a synchronization mechanism ---  it only blocks when it has to
(i.e. for receives that do not yet have a corresponding send), and when it does
progress is impeded for at most 2 threads, the components of a send/receive
pair.  We will start with the types:

~~~ {.go }
type Elt *interface{}
type index uint64

// segment size
const segShift = 12
const segSize = 1 << segShift

// The channel buffer is stored as a linked list of fixed-size arrays of size
// segsize. ID is a monotonically increasing identifier corresponding to the
// index in the buffer of the first element of the segment, divided by segSize
// (see SplitInd).
type segment struct {
	ID   index // index of Data[0] / segSize
	Next *segment
	Data [segSize]Elt
}

// Load atomically loads the element at index i of s
func (s *segment) Load(i index) Elt {
	return Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Data[i]))))
}

// Queue is the global state of the channel. It contains indices into the head
// and tail of the channel as well as a linked list of spare segments used to
// avoid excess allocations.
type queue struct{ H, T index }

// Thread-local state for interacting with an unbounded channel
type UnboundedChan struct {
	// pointer to global state
	q *queue
	// pointer into last guess at the true head and tail segments
	head, tail *segment
}
~~~~

The only data-structure-global global state that we employ is the `queue`
structure which maintains the head and tail indices. Pointers into the data
itself are kept locally in an `UnboundedChan` for two reasons

  (1) It reduces any possible contention resulting from updated shared head or
  tail pointers.

  (2) If individual threads all update local head and tail pointers, then the
  garbage collector will be able to clean up used segments when (and only when)
  all threads no longer hold a reference to them.

We note that a downside of this design is that inactive threads that hold such a
handle can cause space leaks by holding onto references to long-dead segments.

Users interact with a channel by first creating an initial value, and later
cloning that value and others derived from it using `NewHandle`.

~~~~ {.go }
// New initializes a new queue and returns an initial handle to that queue. All
// other handles are allocated by calls to NewHandle()
func New() *UnboundedChan {
	segPtr := &segment{} // 0 values are fine here
	q := &queue{H: 0, T: 0}
	h := &UnboundedChan{q: q, head: segPtr, tail: segPtr}
	return h
}

// NewHandle creates a new handle for the given Channel
func (u *UnboundedChan) NewHandle() *UnboundedChan {
	return &UnboundedChan{q: u.q, head: u.head, tail: u.tail}
}
~~~~ 

## Sending and Receiving

The key enqueue (or send) algorithm is to atomically increment the \texttt{T}
index, attempt to CAS in the item, and to wake up a blocking thread if the CAS
fails. We will begin with the `Enqueue` code and then explain the code that it
calls.

~~~~ {.go .numberLines}
func (u *UnboundedChan) Enqueue(e Elt) {
	u.adjust()
	myInd := index(atomic.AddUint64((*uint64)(&u.q.T), 1) - 1)
	cell, cellInd := myInd.SplitInd()
	seg := u.q.findCell(u.tail, cell)
	if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
		unsafe.Pointer(nil), unsafe.Pointer(e)) {
		return
	}
	wt := (*waiter)(atomic.LoadPointer(
		(*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd]))))
	wt.Send(e)
}

func (u *UnboundedChan) Dequeue() Elt {
	u.adjust()
	myInd := index(atomic.AddUint64((*uint64)(&u.q.H), 1) - 1)
	cell, cellInd := myInd.SplitInd()
	seg := u.q.findCell(u.head, cell)
	elt := seg.Load(cellInd)
	wt := makeWaiter()
	if elt == nil &&
		atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
			unsafe.Pointer(nil), unsafe.Pointer(wt)) {
		return wt.Recv()
	}
	return seg.Load(cellInd)
}
~~~~

The `adjust` (line 2) method atomically loads `H` and `T`, then advances
`u.head` and `u.tail` to point to their cells.  The `AtomicAdd`
on line 3 acquires an index into the queue. `SplitInd` (line 4) returns the
cell ID and the index into that cell corresponding to `myInd`. As `T` can only
increase, the only possible thread that could also be contending for this item
is a `Dequeue`ing thread that acquired `H` as the same value as `myInd`. So it
comes down to the CASes on lines 6 and 21--22. If the first CAS fails, it means
a `Dequeue` thread has swapped in a `waiter`, if it succeeds then it means an
`Enqueue`r can return and a contending `Dequeue`r can just load the value in
`cellInd`. 

## Blocking

So what is a `waiter`? It acts like a channel with buffer size 1, or an *MVar*
in the Haskell community [see Chapter 7 of @marlowPar for an introduction], that
can only tolerate 1 element being sent on it. We currently implement this in
terms of a single value and a `WaitGroup`. `WaitGroup`s in Go's `sync` package
allow goroutines to `Add` an integer value to the `WaitGroup`'s counter and to
`Wait` for that counter to reach zero. If the counter goes below zero, the
current `WaitGroup` implementation panics, which is helpful for debugging
purposes as there should only ever be one `Send` or `Recv` on a `waiter` here.

~~~~ {.go}
type waiter struct {       func makeWaiter() *waiter {
	E      Elt               	wait := &waiter{}
	Wgroup sync.WaitGroup    	wait.Wgroup.Add(1)
}                          		return wait
                           }

func (w *waiter) Send(e Elt) {
	atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&w.E)), unsafe.Pointer(e))
	w.Wgroup.Done() // The Done method just calls Add(-1)
}

func (w *waiter) Recv() Elt {
	w.Wgroup.Wait()
	return Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&w.E))))
}
~~~~

There are two important parts of our strategy to implement blocking. Neither
Enqueuers nor Dequeuers will block at all if Enqueuers complete before Dequeuers
begin. In fact the only global synchronization they must perform is a single F&A
and a single *uncontended* CAS (unless they must grow the queue; see below).
Second, if a Enqueuer does not arrive soon enough and must block on a `waiter`,
there will be essentially no contention for the waiter because there can only be
one other threads interacting with it.

## Growing the Queue and Allocation

We will now describe the implementation of the `findCell` method. The algorithm
is to start at a given segment pointer, and to follow that segment's `Next`
pointer until that segment's `ID` is equal to a given `cell` index. If
`findCell` reaches the end of the list of segments before it reaches the correct
index, it attempts to allocate a new segment and place it onto the end of the
list. Here is some code:

~~~~ {.go .numberLines}
func (q *queue) findCell(start *segment, cellID index) *segment {
	cur := start
	for cur.ID != cellID {
		next := (*segment)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&cur.Next))))
		if next == nil {
			q.Grow(cur)
			continue
		}
		cur = next
	}
	return cur
}
func (q *queue) Grow(tail *segment) {
	curTail := atomic.LoadUint64((*uint64)(&tail.ID))
	newSegment := &segment{ID: index(curTail + 1)}
	if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),
		unsafe.Pointer(nil), unsafe.Pointer(newSegment)) {
		return
	}
}
~~~~

Note that we can get away with performing a single CAS operation in `Grow`
because if our CAS failed we know someone else succeeded, and a new segment with
ID of `tail.ID+1` is the only possible value that could be placed there.
However, there *is* a problem with this implementation: it is extremely
wasteful. In a high-contention situation, it is possible for many threads to all
allocate a new segment, but only one of those threads will succeed. Any failed
allocations will become immediately unreachable and will hence be garbage
collected. In our experiments,  channel operations are fastest when segments
have a size of $\geq 1024$, so any wasted allocation can have a tangible impact
on throughput. This slowdown was evident in our performance measurements.
<!-- TODO(ezr) compare benchmark performance to before/after list allocation -->

Our solution to this problem is to keep a lock-free linked list of segments in
the `queue` structure. Threads in `Grow` first try and pop a segment off of this
list, and then perform the CAS. Only if this pop fails do they allocate a new
segment. Symmetrically, if a CAS fails then threads attempt to push a segment
onto this list. The list keeps a best-effort counter representing its length and
does not allow this counter to grow past a maximum length; this allows us to
avoid a space leak in the implementation of the queue. For a full implementation
of `Grow`, see Appendix A.

# Extending to the Bounded Case

Go channels do not have an unbounded variant. While the structure offered above
is potentially useful, there are good reasons to prefer bounded channels in some
settings[^bounded]. Unbuffered channels allow for a more synchronous programming
model that is common in Go to synchronize two cooperating threads; this level of
synchronization is useful to have. This section describes the implementation of
a bounded channel on top of the unbounded implementation above.

## Preliminaries

We re-use the `q`  and `segment` types, along with the `findCell` and `Grow`
machinery. Almost all of the difference is in the new `Enqueue` and `Dequeue`
operations. These are, however, significantly more complex. This complexity is
the result of senders and receivers being given new responsibilities:

  * Senders must decide if they should block and wait for more receivers to
    arrive.

  * Receivers have to wake up any waiters who ought to wake up if they succeed
    in popping an element off of the queue.

As before, this protocol is implemented in a manner that avoids blocking unless
blocking is required by the channel semantics. This means `Enqueue` and
`Dequeue` methods must consider arbitrary interleavings of the unbounded channel
protocol and the new blocking protocol. The `BoundedChan` has an additional
integer field `bound` indicating the maximum number of senders permitted to
return without having rendezvoused with a receiver.

We also introduce an immutable global `sentinel` pointer used by receiving
threads to signal that a sender should not block. A consequence of this design
is that now all places that required a CAS from `nil` to another value must also
attempt to CAS from `sentinel`. We maintain the invariant that no value will
transition from `sentinel` back to `nil`, so the `tryCas` function below
guarantees that `seg.Data[segInd]` is neither `nil` nor `sentinel` when it
returns (unless `e` is either of those).

## (Aside) Possible Histories of an Element in a Segment

In the unbounded case, there were essentially two possible histories of a value
in the queue:

|Events                | History                |
|----------------------|------------------------|
|Sender, Receiver      |  `nil` $\to$ `Elt`     |
|Receiver, Sender      |  `nil` $\to$ `*waiter` |

This can be viewed as the key invariant that is enforced in the implementation
of unbounded channels. There are more histories in the bounded case. These (and
only these) can all arise --- keeping this in mind is helpful for understanding
the protocol:

---------------------------------------------------------------------------------------------
Events                                               History
---------------------------------------------------  ----------------------------------------
Sender, Receiver                                     `nil` $\to$ `Elt`

Receiver, Sender                                     `nil` $\to$ `*waiter`

Waker, Sender, Receiver                              `nil` $\to$ `sentinel` $\to$ `Elt`

Waker, Receiver, Sender                              `nil` $\to$ `sentinel` $\to$ `*waiter`

$\textrm{Sender}^\dagger$, Waker, Sender, Receiver   `nil` $\to$ `*weakWaiter` $\to$ `Elt`

$\textrm{Sender}^\dagger$, Waker, Receiver, Sender   `nil` $\to$ `*weakWaiter` $\to$ `*waiter`

---------------------------------------------------  ----------------------------------------

Where $\textrm{Sender}^\dagger$ denotes that a sender arrives but must block for
more receivers to complete, and a Waker is any thread that successfully wakes up
a blocked Sender. The details of what a `weakWaiter` is and who exactly plays
the role of "Waker" are covered in the following sections.


## Enqueue

We first present the source of `tryCas` and `Enqueue`:

~~~~ {.go .numberLines}
func tryCas(seg *segment, segInd index, elt unsafe.Pointer) bool {
	return atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),
		unsafe.Pointer(nil), elt) ||
		atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),
			sentinel, elt)
}

// Enqueue sends e on b. If there are already >=bound goroutines blocking, then
// Enqueue will block until sufficiently many elements have been received.
func (b *BoundedChan) Enqueue(e Elt) {
	b.adjust()
	startHead := index(atomic.LoadUint64((*uint64)(&b.q.H)))
	myInd := index(atomic.AddUint64((*uint64)(&b.q.T), 1) - 1)
	cell, cellInd := myInd.SplitInd()
	seg := b.q.findCell(b.tail, cell)
	if myInd > startHead && (myInd-startHead) > index(uint64(b.bound)) {
		// there is a chance that we have to block
		var w interface{} = makeWeakWaiter(2)
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
			unsafe.Pointer(nil), unsafe.Pointer(Elt(&w))) {
			// we successfully swapped in w. No one will overwrite this
			// location unless they send on w first. We block.
			w.(*weakWaiter).Wait()
			if atomic.CompareAndSwapPointer(
				(*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
				unsafe.Pointer(Elt(&w)), unsafe.Pointer(e)) {
				return
			} // someone put a waiter into this location. We need to use the slow path
		} else if atomic.CompareAndSwapPointer(
			(*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),
			sentinel, unsafe.Pointer(e)) {
			// Between us reading startHead and now, there were enough
			// increments to make it the case that we should no longer
			// block.
			return
		}
	} else {
		// normal case. We know we don't have to block because b.q.H can only
		// increase.
		if tryCas(seg, cellInd, unsafe.Pointer(e)) {
			return
		}
	}
	ptr := atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])))
	w := (*waiter)(ptr)
	w.Send(e)
	return
}
~~~~

`Enqueue` starts by loading a value of `H` and then acquiring `myInd`. Note that
this *is not* a consistent snapshot of the state of the queue, as `H` could have
moved between loading it and incrementing `myInd` (lines 12--13). However, `H`
will only increase! If `startHead` is within `b.bound` of `myInd` it means that
`H` is at most that far behind `T` was when we performed the increment. In that
case we can simply attempt to CAS in `e` (line 40). If that fails, it can only
mean that a receiver has placed a `waiter` in this index, so we wake up the
receiver and return (lines 44--46).

If there is a chance that we *do* have to block, then we allocate a new
`weakWaiter`.  A `weakWaiter` is like a `waiter` except it does not contain a
value, but it does allow for more than one message to be received. There are
many ways to implement such a construct in Go, here is an implementation in
terms of a `WaitGroup`:

```go
type weakWaiter struct {
	OSize, Size  int32
	Wgroup sync.WaitGroup
}
func makeWeakWaiter(i int32) *weakWaiter {
	wait := &weakWaiter{Size: i, OSize: i}
	wait.Wgroup.Add(1)
	return wait
}
func (w *weakWaiter) Signal() {
	newVal := atomic.AddInt32(&w.Size, -1)
	orig := atomic.LoadInt32(&w.OSize)
	if newVal+1 == orig { w.Wgroup.Done() }
}
```

In the that case we may block, we construct a `weakWaiter` with a buffer size of two
because it is possible to have two dequeueing threads concurrently attempt to
wake up an enqueueing thread (see below). If the sender successfully CASes `w`
into the proper location (line 19), then it waits and attempts the rest of the
unbounded channel protocol when it wakes.  There are two possible scenarios if
this CAS fails:

  (1) A receiver for `b.bound` elements forward in the channel attempted
  to wake up this sender, but arrived before `w` was stored.
  (2) A receiver has already started waiting at this location

The CAS on line 29 determines which case this is. If (1) then the CAS will fail
and the sender must now wake up the waiting receiver thread on line 46. If (2)
is the case then the CAS will succeed and `e` will successfully be in the queue.


## Dequeue
The `Dequeue` implementation effectively mirrors the `Enqueue` implementation.
There are, however, a few things that are especially subtle. Let's start with
the implementation:

~~~~ {.go .numberLines startFrom="49"}
func (b *BoundedChan) Dequeue() Elt {
	b.adjust()
	myInd := index(atomic.AddUint64((*uint64)(&b.q.H), 1) - 1)
	cell, segInd := myInd.SplitInd()
	seg := b.q.findCell(b.head, cell)
	// If there are Enqueuers waiting to complete due to the buffer size, we
	// take responsibility for waking up the thread that FA'ed b.q.H + b.bound.
	// If bound is zero, that is just the current thread. Otherwise we have to
	// do some extra work. The thread we are waking up is referred to in names
	// and comments as our 'buddy'.
	var (
		bCell, bInd index
		bSeg        *segment
	)
	if b.bound > 0 {
		buddy := myInd + index(b.bound)
		bCell, bInd = buddy.SplitInd()
		bSeg = b.q.findCell(b.head, bCell)
	}
	w := makeWaiter()
	var res Elt
	if tryCas(seg, segInd, unsafe.Pointer(w)) {
		res = w.Recv()
	} else {
		// tryCas failed, which means that through the "possible histories"
		// argument, this must be either an Elt, a waiter or a weakWaiter. It
		// cannot be a waiter because we are the only actor allowed to swap
		// one into this location. Thus it must either be a weakWaiter or an Elt.
		// if it is a weakWaiter, then we must send on it before casing in w,
		// otherwise the other thread could starve. If it is a normal Elt we
		// do the rest of the protocol. This also means that we can safely load
		// an Elt from seg, which is not always the case because sentinel is
		// not an Elt.
		// Step 1: We failed to put our waiter into Ind. That means that either our
		// value is in there, or there is a weakWaiter in there. Either way these
		// are valid elts and we can reliably distinguish them with a type assertion
		elt := seg.Load(segInd)
		res = elt
		if ww, ok := (*elt).(*weakWaiter); ok {
			ww.Signal()
			if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),
				unsafe.Pointer(elt), unsafe.Pointer(w)) {
				res = w.Recv()
			} else {
				// someone cas'ed a value from a weakWaiter, could only have been our
				// friend on the dequeue side
				res = seg.Load(segInd)
			}
		}
	}
	for b.bound > 0 { // runs at most twice
		// We have successfully gotten the value out of our cell. Now we
		// must ensure that our buddy is either woken up if they are
		// waiting, or that they will know not to sleep.
		// if bElt is not nil, it either has an Elt in it or a weakWater. If
		// it has a waitch then we need to send on it to wake up the buddy.
		// If it is not nill then we attempt to cas sentinel into the buddy
		// index. If we fail then the buddy may have cas'ed in a wait
		// channel so we must go again. However that will only happen once.
		bElt := bSeg.Load(bInd)
		if bElt != nil {
			if ww, ok := (*bElt).(*weakWaiter); ok {
				ww.Signal()
			}
			// there is a real queue value in bSeg.Data[bInd], therefore
			// buddy cannot be waiting.
			break
		}
		// Let buddy know that they do not have to block
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&bSeg.Data[bInd])),
			unsafe.Pointer(nil), sentinel) {
			break
		}
	}
	return res
}
~~~~

Now the subtleties. A dequeuer may have to wake up multiple waiting send
threads: the one waiting at `myInd` and the other waiting at `myInd + bound` (or
`bInd`). This may seem strange because the dequeuer that receives `myInd-bound`
ought to have woken up any pending senders. The issue is that *we have no
gaurantee that this dequeuer has returned*. The possibility of this occurring is
remote with a large buffer size, but when `bound` is small it happens with some
regularity.

The second is a peculiarity of Go. On line 87 there is a *type assertion* which
de-references an `Elt` to yield a value of type `interface{}`. The `interface{}`
contains a pointer to some runtime information about the actual type of the
pointed-to struct, and the `.(*weakWaiter)` syntax queries if `elt` is a
pointer to a weakWaiter. This is a safe thing to do because `weakWaiter` is a
package-private type: no external caller could pass in an `Elt` that pointed to
a `weakWaiter` unless we returned one from any of the public functions in the
package, which we do not.

This is complicated by the fact that `*waiter`s are actually stored in the queue
directly, without hiding behind an interface value (e.g. at line 90). This is
because the extra layer of indirection is unnecessary: it is always possible to
determine where an `Elt` or a `*waiter` is present in a given location based on
which CAS's have failed and which have succeeded.

# Performance

We benchmarked 5 separate channels on enqueue/dequeue pairs:

  * *Bounded0*: A `BoundedChan` with buffer size 0

  * *Bounded1K*: A `BoundedChan` with buffer size 1024

  * *Unbounded*: An `UnboundedChan`

  * *Chan0*: An unbuffered native Go channel

  * *Chan1K*: A native Go channel with buffer size 1024

  * *Chan10M*: A native Go channel with buffer size $10^7$, which is the total
    number of elements enqueued into the channel over the course of the
    benchmark.

We include benchmark results for two cases: one where we allocate one goroutine
per processor (where processors are set with the `GOMAXPROCS` procedure from Go's
runtime), and one where we allocate 5000 goroutines, irrespective of the current
value of `GOMAXPROCS`. We include both of these for two reasons. First, it is
not uncommon to have thousands of goroutines active in a running Go program and
it makes sense to consider the case where processors are over-subscribed in that
manner. Second, we noticed that performance is often *better* in the cases where
cores are oversubscribed. While counter-intuitive, this is possibly due to a
combination of unpredictable scheduler performance, and the lower overhead of
synchronizing between two goroutines executing on the same core.

These benchmarks were conducted on a machine with 2 Intel Xeon 2620 v4 CPUs each
with 8 cores clocked at 2.1GHz with two hardware threads per core. We were
unable to allocate cores in an intuitive manner, so the 16-core benchmark is
actually using all of a single CPU's hardware threads; only at core-counts
higher than 16 does the program cross a NUMA-domain. The benchmarks were run on
the Windows Subsystem for Linux[^wsl]; an implementation of an Ubuntu 14.04
userland from within the Windows 10 Operating System. These benchmarks were
conducted using Go version 1.6.

These numbers were produced by performing 5,000,000 enqueues and dequeues per
configuration, averaged over 5 iterations per setting, with a full GC between
iterations.

The benchmarks show that both *Bounded* and *Unbounded* are able to increase
throughput as the core-count increases. Native Go channels are unable to do so.
When using more than 4 processors,  *Unbounded* and *Bounded1K* provide much
better throughput than native channels regardless of buffer size. *Unbounded* in
particular is often 2-3x faster than the buffered *Chan* configurations, while
*Bounded0* continues to increase throughput even after crossing a NUMA domain
and dipping into using multiple hardware threads per core. At the highest core
counts, all three new configurations outpace native Go channels.

![](contend_graph.pdf)
![](gmp_graph.pdf)


# Linearizability

We contend that both the bounded and unbounded queues presented in this document
are *linearizable* with respect to their Enqueue and Dequeue operations.
Linearizability is a strong consistency guarantee often used to specify the
behavior of concurrent data-structures. Informally we say a structure is
linearizable if for an arbitrary (possibly infinite) history of concurrent
operations on the structure beginning and ending at specific times, we can
*linearize* it such that each each operation occurs atomically at some point in
time between it beginning and ending [See Chapter 3 of @herlihyBook for an
overview; Linearizability was introduced with @herlihyLinear].

This section describes linearization procedures for the bounded and unbounded
channels in this document. Both channels begin with a fetch-add on the head or
tail index for the queue that determines the  *logical index* that will be the
subject of their send or receive. We denote $e_i$ and $d_i$ the enqueue and
dequeue operations that fetch-add to get a value of `myInd` equal to $i$. We
will provide linearizations that preserve the following properties, where
$\prec$ indicates precedence in the linearized sequence of events. For all $i$
we must have that

  (1) $e_i \prec e_{i+1}$ (if $e_{i+1}$ occurs)
  (2) $d_i \prec d_{i+1}$ (if $d_{i+1}$ occurs)
  (3) $e_i \prec d_i$     (if both occur)

Which we take to be a straight-forward sequential specification for a channel.

## Unbounded Channels

Our linearization procedure considers two broad cases, a fast and slow path.

  * In the fast path there is sufficient distance between enqueuers and
    dequeuers such that the fetch-add of $e_i$ occurs before the fetch-add for
    $d_i$ *and* $e_i$'s CAS succeeds. In this case, linearize $e_i$ and $d_i$ at
    their respective fetch-adds.

  * In the case where $d_i$'s fetch-add occurs before that of $e_i$  (or the CAS
    fails) we linearize *both* operations at $e_i$'s fetch-add, with $e_i$
    occurring just before $d_i$.

Observe that both cases in this procedure linearize $e_i,d_i$ between them
starting and finishing. The second case is guaranteed to do so because if $d_i$
must block then $e_i$ is responsible for unblocking them, and if $d_i$ does not
block then we know its CAS fails, meaning that $e_i$'s fetch-add occurs after
$d_i$'s fetch-add but before its failed CAS.

We will now show that the above procedure yields a history consistent with the
three criteria provided above. The proof strategy is to show, for both the fast
and slow paths, that we can maintain the criteria for an arbitrary
$e_i,d_i,e_{i+1},d_{i+1}$. Given this we can conclude that the criteria are
satisfied for an arbitrary number of enqueue-dequeue pairs. We then consider the
other possible cases.

*The Fast Path*

We know that we satisfy (1) because all $e_i$, fast or slow path, linearize at
their fetch-add, and these are guaranteed to provide a total ordering on
operations. We satisfy (3) by assumption. Consider $d_{i+1}$, if it hits the
fast path then it is linearized at its fetch-add which must happen after $d_i$'s
fetch-add. If it hits the slow path then it will be linearized at the fetch-add
of $e_{i+1}$, but by assumption we only hit the slow path if $d_{i+1}$'s
fetch-add completed before that of $e_{i+1}$; $d_{i+1}$'s fetch-add definitely
completed after that of $d_i$, so we satisfy (2).

*The Slow Path*

The argument for (1) is the same as in the fast path, and the argument for (3)
follows by assumption. Once again, the interesting case is to show that we
maintain an ordering between dequeue operations. There are two possible cases:

(1) *$d_{i+1}$ blocks*
    We know that $d_{i+1}$ will take the slow path, and will
    therefore be linearized at a later fetch-add.

(2) *$d_{i+1}$ does not block*
    The only way that $d_{i+1}$ does not block is if its CAS fails, which means
    that there is another enqueuer $e_{i+1}$ that completed. Regardless of whether
    $d_{i+1}$ is linearized on a slow path or a fast path, it must be after the
    fetch-add in $e_{i+1}$ and hence also that of $e_i$.

*Small Numbers of Operations*

If there is only one enqueue operation, then at most one dequeue operation will
be linearized. This is fine, because at most one dequeue operation will
complete, while any others will block forever. The definitions of the two cases
in the linearization procedure automatically yield condition (3), while (1,2)
are trivially satisfied as there is only one enqueue and at most one dequeue.

*Concluding*

We can conclude by induction that for any finite number of enqueues and
dequeues, there is a linearization that satisfies a standard sequential
specification for a channel. For infinite sequences of operations (assuming `H`
and `T` can be updated with with arbitrary precision) there is probably a
similar co-inductive characterization of the same process; the above argument
should still hold. We conclude that unbounded channels are linearizable.
$\square$

## Bounded Channels

The bounded case has the same linearization procedure (and proofs) as the
unbounded case, with the caveat that enqueue operations that do not return never
make it into the history. This works because all operations unconditionally
perform fetch-adds, even if they later have to block for an unbounded amount of
time. $\square$

# Conclusion and Future Work

This document demonstrates that it is possible to have scalable unbounded and
bounded queues while still satisfying a strong consistency guarantee. It
leverages techniques from the recent literature on non-blocking queues to
implement (to our knowledge) novel blocking constructs. There are a number of
avenues for future work.

**Verification**

It will be useful to model both channels in
[SPIN](http://spinroot.com/spin/whatispin.html) or
[TLA+](http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html) to
provide further assurance that the algorithms are correct. While it would be
more involved, proving correctness in [Coq](https://coq.inria.fr/) in line with
techniques mentioned in [FRAP](http://adam.chlipala.net/frap/) would also be
helpful in building confidence in the algorithms.

**Implement in the Go runtime**

Implementing these channels within the runtime could further reduce these
algorithms' overhead. In particular they will allow for more efficient
implementation of the blocking semantics in that they can access goroutine
and scheduling metadata directly, whereas the current implementation relies on
`WaitGroup`s, which may be too heavyweight for our purposes.

**Improving Performance**

Some variants of this algorithm still perform worse at lower core-counts than
their native Go equivalents. One possible reason for this is how much allocation
these queues perform (go channels need only keep a single fix-sized buffer).
It could be fruitful to experiment with schemes that reduce allocation, as well
as algorithms that allocate a fix-sized buffer, similar to the CRQ algorithm in
@lcrq.


# Appendix A: Efficient Segment Allocation

In order to speed up allocation, we add a list to the queue state. This list is
similar to standard lock-free queue designs in the literature, and bares some
resemblance to `Queue1` above. The major difference here is that we only provide
partial push and pop operations: Push will fail if the list may be too large or
if it runs out of `patience`, and Pop will fail if its CAS fails more than
`patience` times. 

~~~~{.go}
type listElt *segment
type segList struct {           type segLink struct {
	MaxSpares, Length int64       	Elt  listElt
	Head              *segLink    	Next *segLink
}                               }

func (s *segList) TryPush(e listElt) {
	// bail out if list is at capacity
	if atomic.LoadInt64(&s.Length) >= s.MaxSpares {
		return
	}
	// add to length. Note that this is not atomic with respect to the append,
	// which means we may be under capacity on occasion. This list is only used
	// in a best-effort capacity, so that is okay.
	atomic.AddInt64(&s.Length, 1)
	tl := &segLink{Elt: e, Next: nil}
	const patience = 4
	i := 0
	for ; i < patience; i++ {
		// attempt to cas Head from nil to tail,
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)),
			unsafe.Pointer(nil), unsafe.Pointer(tl)) {
			break
		}
		// try to find an empty element
		tailPtr := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head))))
		if tailPtr == nil {
			// if Head was switched to nil, retry
			continue
		}
		// advance tailPtr until it has anil next pointer
		for {
			next := (*segLink)(atomic.LoadPointer(
				(*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next))))
			if next == nil {
				break
			}
			tailPtr = next
		}
		// try and add something to the end of the list
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next)),
			unsafe.Pointer(nil), unsafe.Pointer(tl)) {
			break
		}
	}
	if i == patience {
		atomic.AddInt64(&s.Length, -1)
	}
}

func (s *segList) TryPop() (e listElt, ok bool) {
	const patience = 1
	if atomic.LoadInt64(&s.Length) <= 0 {
		return nil, false
	}
	for i := 0; i < patience; i++ {
		hd := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head))))
		if hd == nil {
			return nil, false
		}
		// if head is not nil, try to swap it for its next pointer
		nxt := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&hd.Next))))
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)),
			unsafe.Pointer(hd), unsafe.Pointer(nxt)) {
			atomic.AddInt64(&s.Length, -1)
			return hd.Elt, true
		}
	}
	return nil, false
}
~~~~

Given this list implementation, we simply insert calls to `TryPush` and `TryPop`
around the original implementation of `Grow` to have it take advantage of extra
allocations:

~~~~ {.go}
type queue struct {
	H, T        index
	SpareAllocs segList
}
func (q *queue) Grow(tail *segment) {
	curTail := atomic.LoadUint64((*uint64)(&tail.ID))
	if next, ok := q.SpareAllocs.TryPop(); ok {
		atomic.StoreUint64((*uint64)(&next.ID), curTail+1)
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),
			unsafe.Pointer(nil), unsafe.Pointer(next)) {
			return
		}
	}
	newSegment := &segment{ID: index(curTail + 1)}
	if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),
		unsafe.Pointer(nil), unsafe.Pointer(newSegment)) {
		return
	}
	// If we allocated a new segment but failed, attempt to place it in
	// SpareAlloc so someone else can use it.
	q.SpareAllocs.TryPush(newSegment)
}
~~~~

This scheme led to significant speedups in performance tests, but the code in
`q.go` includes a constant that, if set to false, will disable any such
list-based caching of allocations. This should make it easy to verify or falsify
those performance measurements.

# References

<!-- Footnotes referenced in the text -->

[^pseudo]: Psuedocode in this document will increasingly resemble real, working
Go code. While we will try to explain core Go concepts as we go, a passing
familiarity with Go syntax (or at least a willingness to squint and pretend one
is reading C) will be helpful.

[^select]: Our focus is send and receive; we do not cover `select` or `close`
here. `Close` would be fairly simple to add, `select` could be implement by
using channels for the waiting mechanism used by receivers. While this would not
be difficult, it would slow things down compared to the `WaitGroup`
implementation.

[^goroutine]: Go's standard unit of concurrency is called a goroutine.
Goroutines take the place of threads in a language like C, but they are
generally much cheaper to create and provide faster context switches. Many
goroutines are independently scheduled on top of a smaller number of native
operating system threads. This scheduling is not preemptive in the standard
implementation, rather goroutines implicitly yield on function-call boundaries.

[^chanimp]: See the [Go channel source](https://golang.org/src/runtime/chan.go).
In particular note calls to `lock` in `chansend` and `chanrecv`.

[^combine]: Consult the related work sections of @wfq on *combining* queues for an
example of this; @lcrq has a similar survey

[^msqueue]: Less contention is not something that you get automatically when the
algorithm is lock-free. An early lock-free queue @MSQueue still suffers from
from bottlenecks around the head and tail pointers all being CAS-ed by
contending threads. Most of these CASes will fail, and all threads whose CASes
fail must retry. Exponential backoff schemes can help this state of affairs but
the bottleneck is still present; see the performance measurements in @wfq with
includes the algorithm from @MSQueue.

[^fasem]: F&A is more commonly defined to return the *old* value of `src`, but
returning the new value is equivalent.

[^helping]: Helping is a standard technique for making obstruction-free or
lock-free algorithms wait free. The technique goes back to @wfsync; the practice
of using a weaker progress guarantee as a fast path and then falling back to a
helping mechanism to ensure wait freedom was introduced in @FastSlow. An
explanation of helping can be found in @herlihyBook chapters 6, 10.5.

[^bounded]: See, for example, [this
discussion](https://mail.mozilla.org/pipermail/rust-dev/2013-December/007449.html)
on the Rust mailing list regarding unbounded channels. Haskell's standard
channel implementation in
[Control.Concurrent](https://hackage.haskell.org/package/base-4.9.0.0/docs/Control-Concurrent-Chan.html)
is unbounded, as are the STM variants.

[^wsl]: See [this blog post](https://blogs.msdn.microsoft.com/wsl/2016/04/22/windows-subsystem-for-linux-overview/)
as well as the various [follow-ups](https://blogs.msdn.microsoft.com/wsl/) for
an overview of this system.

[^livelockdef]: A [livelock](https://en.wikipedia.org/wiki/Deadlock#Livelock) is
a scenario in which one or more threads never block (i.e. they continuously
change their respective states) but still indefinitely fail to make progress.