[
  {
    "path": "CONTRIBUTING",
    "content": "Want to contribute? Great! First, read this page (including the small print at the end).\n\n### Before you contribute\nBefore we can use your code, you must sign the\n[Google Individual Contributor License Agreement]\n(https://cla.developers.google.com/about/google-individual)\n(CLA), which you can do online. The CLA is necessary mainly because you own the\ncopyright to your changes, even after your contribution becomes part of our\ncodebase, so we need your permission to use and distribute your code. We also\nneed to be sure of various other things—for instance that you'll tell us if you\nknow that your code infringes on other people's patents. You don't have to sign\nthe CLA until after you've submitted your code for review and a member has\napproved it, but you must do it before we can put your code into our codebase.\nBefore you start working on a larger contribution, you should get in touch with\nus first through the issue tracker with your idea so that we can help out and\npossibly guide you. Coordinating up front makes it much easier to avoid\nfrustration later on.\n\n### Code reviews\nAll submissions, including submissions by project members, require review. We\nuse GitHub pull requests for this purpose.\n\n### The small print\nContributions made by corporations are covered by a different agreement than\nthe one above, the\n[Software Grant and Corporate Contributor License Agreement]\n(https://cla.developers.google.com/about/google-corporate).\n"
  },
  {
    "path": "LICENSE",
    "content": "\n                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright [yyyy] [name of copyright owner]\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "README.md",
    "content": "# `fchan`: Fast Channels in Go\n\nThis package contains implementations of fast and scalable channels in Go.\nImplementation is in `src/fchan`. To run benchmarks, run `src/bench/bench.go`.\n`bench.go` is very rudimentary, and modifying the source may be necessary\ndepending on what you want to run; that will change in the future.  For details\non the algorithm, check out the writeup directory, it includes a pdf and the\npandoc markdown used to generate it. \n\n**This is a proof of concept only**. This code should *not* be run in\nproduction.  Comments, criticisms and bugs are all welcome!\n\n## Disclaimer\n\nThis is not an official Google product.\n"
  },
  {
    "path": "src/bench/bench.go",
    "content": "// Copyright 2016 Google Inc.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//     http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage main\n\nimport (\n\t\"flag\"\n\t\"fmt\"\n\t\"log\"\n\t\"os\"\n\t\"runtime\"\n\t\"runtime/pprof\"\n\t\"sync\"\n\t\"time\"\n\n\t\"../fchan\"\n)\n\n// chanRec is a wrapper for a channel-like object used in the benchmarking code\n// to avoid code duplication\ntype chanRec struct {\n\tNewHandle func() interface{}\n\tEnqueue   func(ch interface{}, e fchan.Elt)\n\tDequeue   func(ch interface{}) fchan.Elt\n}\n\nfunc wrapBounded(bound uint64) *chanRec {\n\tch := fchan.NewBounded(bound)\n\treturn &chanRec{\n\t\tNewHandle: func() interface{} { return ch.NewHandle() },\n\t\tEnqueue: func(ch interface{}, e fchan.Elt) {\n\t\t\tch.(*fchan.BoundedChan).Enqueue(e)\n\t\t},\n\t\tDequeue: func(ch interface{}) fchan.Elt {\n\t\t\treturn ch.(*fchan.BoundedChan).Dequeue()\n\t\t},\n\t}\n}\n\nfunc wrapUnbounded() *chanRec {\n\tch := fchan.New()\n\treturn &chanRec{\n\t\tNewHandle: func() interface{} { return ch.NewHandle() },\n\t\tEnqueue: func(ch interface{}, e fchan.Elt) {\n\t\t\tch.(*fchan.UnboundedChan).Enqueue(e)\n\t\t},\n\t\tDequeue: func(ch interface{}) fchan.Elt {\n\t\t\treturn ch.(*fchan.UnboundedChan).Dequeue()\n\t\t},\n\t}\n}\n\nfunc wrapChan(chanSize int) *chanRec {\n\tch := make(chan fchan.Elt, chanSize)\n\treturn &chanRec{\n\t\tNewHandle: func() interface{} { return nil },\n\t\tEnqueue:   func(_ interface{}, e fchan.Elt) { ch <- e },\n\t\tDequeue:   func(_ interface{}) fchan.Elt { return <-ch },\n\t}\n}\n\nfunc benchHelp(N int, chanBase *chanRec, nProcs int) time.Duration {\n\tconst nIters = 1\n\tvar totalTime int64\n\tfor iter := 0; iter < nIters; iter++ {\n\t\tvar waitSetup, waitBench sync.WaitGroup\n\t\tnProcsPer := nProcs / 2\n\t\tpt := N / nProcsPer\n\t\twaitSetup.Add(2*nProcsPer + 1)\n\t\tfor i := 0; i < nProcsPer; i++ {\n\t\t\twaitBench.Add(2)\n\t\t\tgo func() {\n\t\t\t\tch := chanBase.NewHandle()\n\t\t\t\tvar (\n\t\t\t\t\tm   interface{} = 1\n\t\t\t\t\tmsg fchan.Elt   = &m\n\t\t\t\t)\n\t\t\t\twaitSetup.Done()\n\t\t\t\twaitSetup.Wait()\n\t\t\t\tfor j := 0; j < pt; j++ {\n\t\t\t\t\tchanBase.Enqueue(ch, msg)\n\t\t\t\t}\n\t\t\t\twaitBench.Done()\n\t\t\t}()\n\t\t\tgo func() {\n\t\t\t\tch := chanBase.NewHandle()\n\t\t\t\twaitSetup.Done()\n\t\t\t\twaitSetup.Wait()\n\t\t\t\tfor j := 0; j < pt; j++ {\n\t\t\t\t\tchanBase.Dequeue(ch)\n\t\t\t\t}\n\t\t\t\twaitBench.Done()\n\t\t\t}()\n\t\t}\n\t\ttime.Sleep(time.Millisecond * 5)\n\t\twaitSetup.Done()\n\t\twaitSetup.Wait()\n\t\tstart := time.Now().UnixNano()\n\t\twaitBench.Wait()\n\t\tend := time.Now().UnixNano()\n\t\truntime.GC()\n\t\ttime.Sleep(time.Second)\n\t\ttotalTime += end - start\n\t}\n\treturn time.Duration(totalTime/nIters) * time.Nanosecond\n}\n\nfunc render(N, numCPUs int, gmp bool, desc string, t time.Duration) {\n\textra := \"\"\n\tif gmp {\n\t\textra = \"GMP\"\n\t}\n\tfmt.Printf(\"%s%s-%d\\t%d\\t%v\\n\", desc, extra, numCPUs, N, t)\n}\n\nvar cpuprofile = flag.String(\"cpuprofile\", \"\", \"write cpu profile `file`\")\n\nfunc main() {\n\tconst (\n\t\tmore     = 5000\n\t\tnOps     = 10000000\n\t\tgmpScale = 1\n\t)\n\tflag.Parse()\n\tif *cpuprofile != \"\" {\n\t\tf, err := os.Create(*cpuprofile)\n\t\tif err != nil {\n\t\t\tlog.Fatal(\"could not create CPU profile: \", err)\n\t\t}\n\t\tif err := pprof.StartCPUProfile(f); err != nil {\n\t\t\tlog.Fatal(\"could not start CPU profile: \", err)\n\t\t}\n\t\tdefer pprof.StopCPUProfile()\n\t}\n\tfor _, pack := range []struct {\n\t\tdesc string\n\t\tf    func() *chanRec\n\t}{\n\t\t{\"Chan10M\", func() *chanRec { return wrapChan(nOps) }},\n\t\t{\"Chan1K\", func() *chanRec { return wrapChan(1024) }},\n\t\t{\"Chan0\", func() *chanRec { return wrapChan(0) }},\n\t\t{\"Bounded1K\", func() *chanRec { return wrapBounded(1024) }},\n\t\t{\"Bounded0\", func() *chanRec { return wrapBounded(0) }},\n\t\t{\"Unbounded\", wrapUnbounded},\n\t} {\n\t\tfor _, nprocs := range []int{2, 4, 8, 12, 16, 24, 28, 32} {\n\t\t\truntime.GOMAXPROCS(nprocs)\n\t\t\tdur := benchHelp(nOps, pack.f(), more)\n\t\t\trender(nOps, nprocs, false, pack.desc, dur)\n\t\t\tdur = benchHelp(nOps, pack.f(), nprocs*gmpScale)\n\t\t\trender(nOps, nprocs, true, pack.desc, dur)\n\t\t}\n\t}\n}\n"
  },
  {
    "path": "src/fchan/bounded.go",
    "content": "// Copyright 2016 Google Inc.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//     http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage fchan\n\nimport (\n\t\"runtime\"\n\t\"sync/atomic\"\n\t\"unsafe\"\n)\n\nvar (\n\ts        = 1\n\tsentinel = unsafe.Pointer(&s)\n)\n\n// we use a type synonym here because otherwise it would be possible for a user\n// to append a value of the same underlying type. If that happened then the\n// type-assertion in Dequeue could send on one of the elements passed in (which\n// is wrong, and could potentially deadlock the program as well).\ntype waitch chan struct{}\n\nfunc waitChan() waitch {\n\treturn make(chan struct{}, 2)\n}\n\n// possible history of values of a cell\n// waitch ::= channel that a sender waits on when it is over buffer size\n// recvchan ::= channel that a receiver waits on when it has to receive a value\n// - nil -> sentinel -> value\n// - nil -> sentinel -> recvChan\n// - nil -> value\n// - nil -> recvChan\n// These two may require someone to send on the waitch before transitioning\n// - nil -> waitch -> value\n// - nil -> waitch -> recvChan\n\n// BoundedChan is a thread_local handle onto a bounded channel.\ntype BoundedChan struct {\n\tq          *queue\n\thead, tail *segment\n\tbound      uint64\n}\n\n// NewBounded allocates a new queue and returns a handle to that queue. Further\n// handles are created by calling NewHandle on the result of NewBounded.\nfunc NewBounded(bufsz uint64) *BoundedChan {\n\tsegPtr := &segment{}\n\tcur := segPtr\n\tfor b := uint64(segSize); b < bufsz; b += segSize {\n\t\tcur.Next = &segment{ID: index(b) >> segShift}\n\t\tcur = cur.Next\n\t}\n\tq := &queue{\n\t\tH:           0,\n\t\tT:           0,\n\t\tSpareAllocs: segList{MaxSpares: int64(runtime.GOMAXPROCS(0))},\n\t}\n\treturn &BoundedChan{\n\t\tq:     q,\n\t\thead:  segPtr,\n\t\ttail:  segPtr,\n\t\tbound: bufsz,\n\t}\n}\n\n// NewHandle creates a new handle for the given Queue.\nfunc (b *BoundedChan) NewHandle() *BoundedChan {\n\treturn &BoundedChan{\n\t\tq:     b.q,\n\t\thead:  b.head,\n\t\ttail:  b.tail,\n\t\tbound: b.bound,\n\t}\n}\n\nfunc (b *BoundedChan) adjust() {\n\t// TODO: factor this out into a helper so that bounded and unbounded can\n\t// use the same code\n\tH := index(atomic.LoadUint64((*uint64)(&b.q.H)))\n\tT := index(atomic.LoadUint64((*uint64)(&b.q.T)))\n\tcellH, _ := H.SplitInd()\n\tadvance(&b.head, cellH)\n\tcellT, _ := T.SplitInd()\n\tadvance(&b.tail, cellT)\n}\n\n// tryCas attempts to cas seg.Data[index] from nil to elt, and if that fails,\n// from sentinel to elt.\nfunc tryCas(seg *segment, segInd index, elt unsafe.Pointer) bool {\n\treturn atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),\n\t\tsentinel, elt) ||\n\t\tatomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),\n\t\t\tunsafe.Pointer(nil), elt) ||\n\t\tatomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),\n\t\t\tsentinel, elt)\n}\n\n// Enqueue sends e on b. If there are already >=bound goroutines blocking, then\n// Enqueue will block until sufficiently many elements have been received.\nfunc (b *BoundedChan) Enqueue(e Elt) {\n\tb.adjust()\n\tstartHead := index(atomic.LoadUint64((*uint64)(&b.q.H)))\n\tmyInd := index(atomic.AddUint64((*uint64)(&b.q.T), 1) - 1)\n\tcell, cellInd := myInd.SplitInd()\n\tseg := b.q.findCell(b.tail, cell)\n\tif myInd > startHead && (myInd-startHead) > index(uint64(b.bound)) {\n\t\t// there is a chance that we have to block\n\t\tconst patience = 4\n\t\tfor i := 0; i < patience; i++ {\n\t\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\t\t\tsentinel, unsafe.Pointer(e)) {\n\t\t\t\t// Between us reading startHead and now, there were enough\n\t\t\t\t// increments to make it the case that we should no longer\n\t\t\t\t// block.\n\t\t\t\tif debug {\n\t\t\t\t\tdbgPrint(\"[enq] swapped out for sentinel\\n\")\n\t\t\t\t}\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t\tvar w interface{} = makeWeakWaiter(2)\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\t\tunsafe.Pointer(nil), unsafe.Pointer(Elt(&w))) {\n\t\t\t// we successfully swapped in w. No one will overwrite this\n\t\t\t// location unless they send on w first. We block.\n\t\t\tw.(*weakWaiter).Wait()\n\t\t\t// <-(w.(waitch))\n\t\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\t\t\tunsafe.Pointer(Elt(&w)), unsafe.Pointer(e)) {\n\t\t\t\tif debug {\n\t\t\t\t\tdbgPrint(\"[enq] blocked then swapped successfully\\n\")\n\t\t\t\t}\n\t\t\t\treturn\n\t\t\t} // someone put in a chan Elt into this location. We need to use the slow path\n\t\t} else if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\t\tsentinel, unsafe.Pointer(e)) {\n\t\t\t// Between us reading startHead and now, there were enough\n\t\t\t// increments to make it the case that we should no longer\n\t\t\t// block.\n\t\t\tif debug {\n\t\t\t\tdbgPrint(\"[enq] swapped out for sentinel\\n\")\n\t\t\t}\n\t\t\treturn\n\t\t}\n\t} else {\n\t\t// normal case. We know we don't have to block because b.q.H can only\n\t\t// increase.\n\t\tif tryCas(seg, cellInd, unsafe.Pointer(e)) {\n\t\t\tif debug {\n\t\t\t\tdbgPrint(\"[enq] successful tryCas\\n\")\n\t\t\t}\n\t\t\treturn\n\t\t}\n\t}\n\tfor i := 0; ; i++ { // will run at most twice\n\t\tif i >= 2 {\n\t\t\tpanic(\"[enq] bug!\")\n\t\t}\n\t\tptr := atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])))\n\t\tw := (*waiter)(ptr)\n\t\tw.Send(e)\n\t\tif debug {\n\t\t\tdbgPrint(\"[enq] sending to waiter on %v\\n\", ptr)\n\t\t}\n\t\treturn\n\t}\n}\n\n// Dequeue receives an Elt from b. It blocks if there are no elements enqueued\n// there.\nfunc (b *BoundedChan) Dequeue() Elt {\n\tb.adjust()\n\tmyInd := index(atomic.AddUint64((*uint64)(&b.q.H), 1) - 1)\n\tcell, segInd := myInd.SplitInd()\n\tseg := b.q.findCell(b.head, cell)\n\n\t// If there are Enqueuers waiting to complete due to the buffer size, we\n\t// take responsibility for waking up the thread that FA'ed b.q.H + b.bound.\n\t// If bound is zero, that is just the current thread. Otherwise we have to\n\t// do some extra work. The thread we are waking up is referred to in names\n\t// and comments as our 'buddy'.\n\tvar (\n\t\tbCell, bInd index\n\t\tbSeg        *segment\n\t)\n\tif b.bound > 0 {\n\t\tbuddy := myInd + index(b.bound)\n\t\tbCell, bInd = buddy.SplitInd()\n\t\tbSeg = b.q.findCell(b.head, bCell)\n\t}\n\n\tw := makeWaiter()\n\tvar res Elt\n\tif tryCas(seg, segInd, unsafe.Pointer(w)) {\n\t\tif debug {\n\t\t\tdbgPrint(\"[deq] getting res from channel %v\\n\", w)\n\t\t}\n\t\tres = w.Recv()\n\t} else {\n\t\t// tryCas failed, which means that through the \"possible histories\"\n\t\t// argument, this must be either an Elt, a waiter or a weakWaiter. It\n\t\t// cannot be a waiter because we are the only actor allowed to swap\n\t\t// one into this location. Thus it must either be a weakWaiter or an Elt.\n\t\t// if it is a weakWaiter, then we must send on it before casing in w,\n\t\t// otherwise the other thread could starve. If it is a normal Elt we\n\t\t// do the rest of the protocol. This also means that we can safely load\n\t\t// an Elt from seg, which is not always the case because sentinel is\n\t\t// not an Elt.\n\t\t//\n\t\t// Step 1: We failed to put our waiter into Ind. That means that either our\n\t\t// value is in there, or there is a weakWaiter in there. Either way these\n\t\t// are valid elts and we can reliably distinguish them with a type assertion\n\t\telt := seg.Load(segInd)\n\t\tres = elt\n\t\tif ww, ok := (*elt).(*weakWaiter); ok {\n\t\t\tww.Signal()\n\t\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),\n\t\t\t\tunsafe.Pointer(elt), unsafe.Pointer(w)) {\n\t\t\t\tif debug {\n\t\t\t\t\tdbgPrint(\"[deq] getting res from channel slow %v\\n\", w)\n\t\t\t\t}\n\t\t\t\tres = w.Recv()\n\t\t\t} else {\n\t\t\t\t// someone cas'ed a value from a waitchan, could only have been our\n\t\t\t\t// friend on the dequeue side\n\t\t\t\tif debug {\n\t\t\t\t\tdbgPrint(\"[deq] getting res from load\\n\")\n\t\t\t\t}\n\t\t\t\tres = seg.Load(segInd)\n\t\t\t}\n\t\t}\n\t}\n\tfor i := 0; b.bound > 0; i++ {\n\t\tif i >= 2 {\n\t\t\tpanic(\"[deq] bug!\")\n\t\t}\n\t\t// We have successfully gotten the value out of our cell. Now we\n\t\t// must ensure that our buddy is either woken up if they are\n\t\t// waiting, or that they will know not to sleep.\n\t\t// if bElt is not nil, it either has an Elt in it or a weakWater. If\n\t\t// it has a waitch then we need to send on it to wake up the buddy.\n\t\t// If it is not nill then we attempt to cas sentinel into the buddy\n\t\t// index. If we fail then the buddy may have cas'ed in a wait\n\t\t// channel so we must go again. However that will only happen once.\n\t\tbElt := bSeg.Load(bInd)\n\t\t// could this be sentinel? I don't think so..\n\t\tif bElt != nil {\n\t\t\tif ww, ok := (*bElt).(*weakWaiter); ok {\n\t\t\t\tww.Signal()\n\t\t\t}\n\t\t\t// there is a real queue value in bSeg.Data[bInd], therefore\n\t\t\t// buddy cannot be waiting.\n\t\t\tbreak\n\t\t}\n\t\t// Let buddy know that they do not have to block\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&bSeg.Data[bInd])),\n\t\t\tunsafe.Pointer(nil), sentinel) {\n\t\t\tbreak\n\t\t}\n\t}\n\treturn res\n}\n"
  },
  {
    "path": "src/fchan/fchan_test.go",
    "content": "// Copyright 2016 Google Inc.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//     http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage fchan\n\nimport (\n\t\"reflect\"\n\t\"sync\"\n\t\"testing\"\n)\n\nconst perThread = 256\n\nfunc unorderedEltsEq(s1, s2 []int) bool {\n\treadMap := func(i []int) map[int]int {\n\t\tres := make(map[int]int)\n\t\tfor _, ii := range i {\n\t\t\tres[ii]++\n\t\t}\n\t\treturn res\n\t}\n\treturn reflect.DeepEqual(readMap(s1), readMap(s2))\n}\n\nfunc TestBoundedQueueElements(t *testing.T) {\n\tconst numInputs = (1 << 20)\n\tbounds := []uint64{0, 1, 1024, segSize}\n\tfor _, bound := range bounds {\n\t\tvar inputs []int\n\t\tvar wg sync.WaitGroup\n\t\tfor i := 0; i < numInputs; i++ {\n\t\t\tinputs = append(inputs, i)\n\t\t}\n\t\th := NewBounded(bound)\n\n\t\tch := make(chan int, 1024)\n\t\tfor i := 0; i < numInputs/perThread; i++ {\n\t\t\twg.Add(1)\n\t\t\tgo func(i int) {\n\t\t\t\thn := h.NewHandle()\n\t\t\t\tfor j := 0; j < perThread; j++ {\n\t\t\t\t\tvar inp interface{} = inputs[i*perThread+j]\n\t\t\t\t\thn.Enqueue(&inp)\n\t\t\t\t}\n\t\t\t\twg.Done()\n\t\t\t}(i)\n\t\t\twg.Add(1)\n\t\t\tgo func() {\n\t\t\t\thn := h.NewHandle()\n\t\t\t\tfor j := 0; j < perThread; j++ {\n\t\t\t\t\tout := hn.Dequeue()\n\t\t\t\t\toutInt := (*out).(int)\n\t\t\t\t\tch <- outInt\n\t\t\t\t}\n\t\t\t\twg.Done()\n\t\t\t}()\n\t\t}\n\n\t\tvar outs []int\n\t\tfor i := 0; i < numInputs; i++ {\n\t\t\touts = append(outs, <-ch)\n\t\t}\n\t\tclose(ch)\n\t\tif !unorderedEltsEq(outs, inputs) {\n\t\t\tt.Errorf(\"expected %v, got %v\", inputs, outs)\n\t\t}\n\t\twg.Wait()\n\t}\n}\n\nfunc TestQueueElements(t *testing.T) {\n\tconst numInputs = 1 << 20\n\titers := numInputs / perThread\n\tvar inputs []int\n\tvar wg sync.WaitGroup\n\tfor i := 0; i < numInputs; i++ {\n\t\tinputs = append(inputs, i)\n\t}\n\th := New()\n\n\tch := make(chan int, 1024)\n\tfor i := 0; i < iters; i++ {\n\t\twg.Add(1)\n\t\tgo func(i int) {\n\t\t\thn := h.NewHandle()\n\t\t\tfor j := 0; j < perThread; j++ {\n\t\t\t\tvar inp interface{} = inputs[i*perThread+j]\n\t\t\t\thn.Enqueue(&inp)\n\t\t\t}\n\t\t\twg.Done()\n\t\t}(i)\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\thn := h.NewHandle()\n\t\t\tfor j := 0; j < perThread; j++ {\n\t\t\t\tout := hn.Dequeue()\n\t\t\t\tch <- (*out).(int)\n\t\t\t}\n\t\t\twg.Done()\n\t\t}()\n\t}\n\n\tvar outs []int\n\tfor i := 0; i < numInputs; i++ {\n\t\touts = append(outs, <-ch)\n\t}\n\tclose(ch)\n\tif !unorderedEltsEq(outs, inputs) {\n\t\tt.Errorf(\"expected %v, got %v\", inputs, outs)\n\t}\n\twg.Wait()\n}\n\nfunc TestSerialQueue(t *testing.T) {\n\tconst runs = 3*segSize + 1\n\n\th := New()\n\tvar msg interface{} = \"hi\"\n\tfor i := 0; i < runs; i++ {\n\t\tvar m interface{} = msg\n\t\th.Enqueue(&m)\n\t}\n\tfor i := 0; i < runs; i++ {\n\t\tp := h.Dequeue()\n\t\tif !reflect.DeepEqual(*p, msg) {\n\t\t\tt.Errorf(\"expected %v, got %v\", msg, *p)\n\t\t}\n\t}\n}\n\nfunc TestConcurrentQueueAddFirst(t *testing.T) {\n\tconst runs = 3*segSize + 1\n\tvar wg sync.WaitGroup\n\th := New()\n\tvar msg interface{} = \"hi\"\n\tt.Logf(\"Spawning %v adding goroutines\", runs)\n\tfor i := 0; i < runs; i++ {\n\t\tvar m interface{} = msg\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\thn := h.NewHandle()\n\t\t\thn.Enqueue(&m)\n\t\t\twg.Done()\n\t\t}()\n\t}\n\tt.Logf(\"Spawning %v getting goroutines\", runs)\n\tfor i := 0; i < runs; i++ {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\thn := h.NewHandle()\n\t\t\tp := hn.Dequeue()\n\t\t\tif !reflect.DeepEqual(*p, msg) {\n\t\t\t\tt.Errorf(\"expected %v, got %v\", msg, *p)\n\t\t\t}\n\t\t\twg.Done()\n\t\t}()\n\t}\n\twg.Wait()\n}\n\nfunc TestConcurrentQueueTakeFirst(t *testing.T) {\n\tconst runs = 2*segSize + 1 // 4*segSize + 1\n\n\tvar wg sync.WaitGroup\n\th := New()\n\tvar msg interface{} = \"hi\"\n\n\tt.Logf(\"Spawning %v getting goroutines\", runs)\n\tfor i := 0; i < runs; i++ {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\thn := h.NewHandle()\n\t\t\tp := hn.Dequeue()\n\t\t\tif !reflect.DeepEqual(*p, msg) {\n\t\t\t\tt.Errorf(\"expected %v, got %v\", msg, *p)\n\t\t\t}\n\t\t\twg.Done()\n\t\t}()\n\t}\n\n\tt.Logf(\"Spawning %v adding goroutines\", runs)\n\tfor i := 0; i < runs; i++ {\n\t\tvar m interface{} = msg\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\thn := h.NewHandle()\n\t\t\thn.Enqueue(&m)\n\t\t\twg.Done()\n\t\t}()\n\t}\n\twg.Wait()\n}\n\nfunc minN(b *testing.B) int {\n\tif b.N < 2 {\n\t\treturn 2\n\t}\n\treturn b.N\n}\n"
  },
  {
    "path": "src/fchan/q.go",
    "content": "// Copyright 2016 Google Inc.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//     http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage fchan\n\nimport (\n\t\"fmt\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"unsafe\"\n)\n\n// basic debug infrastructure\nconst debug = false\n\nvar dbgPrint = func(s string, i ...interface{}) { fmt.Printf(s, i...) }\n\n// Elt is the element type of a queue, can be any pointer type\ntype Elt *interface{}\ntype index uint64\ntype listElt *segment\n\ntype waiter struct {\n\tE      Elt\n\tWgroup sync.WaitGroup\n}\n\nfunc makeWaiter() *waiter {\n\twait := &waiter{}\n\twait.Wgroup.Add(1)\n\treturn wait\n}\n\nfunc (w *waiter) Send(e Elt) {\n\tatomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&w.E)), unsafe.Pointer(e))\n\tw.Wgroup.Done()\n}\n\nfunc (w *waiter) Recv() Elt {\n\tw.Wgroup.Wait()\n\treturn Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&w.E))))\n}\n\n/*\ntype weakWaiter struct {\n\tcond *sync.Cond\n\tsync.Mutex\n\twoke int64\n}\n\nfunc makeWeakWaiter(i int32) *weakWaiter {\n\tw := &weakWaiter{}\n\tw.cond = sync.NewCond(w)\n\treturn w\n}\n\nfunc (w *weakWaiter) Signal() {\n\tw.Lock()\n\tw.woke++\n\tw.cond.Signal()\n\tw.Unlock()\n}\n\nfunc (w *weakWaiter) Wait() {\n\tw.Lock()\n\tfor w.woke == 0 {\n\t\tw.cond.Wait()\n\t}\n\tw.Unlock()\n}\n\n//*/\n\n/*\n\n// Idea to get beyond the scalability bottleneck when number of goroutines is\n// much larger than gomaxprocs. Have an array of channels with large buffers\n// (or unbuffered channels?) and group threads into these larger groups. This\n// means weakWaiters are attached to queue-level state. It has the disadvantage\n// of making ordering a bit more difficult, as later receivers could wake up\n// earlier senders. I think this is fine, but it merits some thought.\ntype weakWaiter chan struct{}\n\nfunc makeWeakWaiter(i int32) *weakWaiter {\n\tvar ch weakWaiter = make(chan struct{}, i)\n\treturn &ch\n}\n\nfunc (w *weakWaiter) Signal() { *w <- struct{}{} }\n\nfunc (w *weakWaiter) Wait() { <-(*w) }\n\n//*/\n\n//*\ntype weakWaiter struct {\n\tOSize  int32\n\tSize   int32\n\tWgroup sync.WaitGroup\n}\n\nfunc makeWeakWaiter(i int32) *weakWaiter {\n\twait := &weakWaiter{Size: i, OSize: i}\n\twait.Wgroup.Add(1)\n\treturn wait\n}\n\nfunc (w *weakWaiter) Signal() {\n\tnewVal := atomic.AddInt32(&w.Size, -1)\n\torig := atomic.LoadInt32(&w.OSize)\n\tif newVal+1 == orig {\n\t\tw.Wgroup.Done()\n\t}\n}\n\nfunc (w *weakWaiter) Wait() {\n\tw.Wgroup.Wait()\n}\n\n// */\n\n// segList is a best-effort data-structure for storing spare segment\n// allocations. The TryPush and TryPop methods follow standard algorithms for\n// lock-free linked lists. They have an inconsistent length counter they\n// may underestimate the true length of the data-structure, but this allows\n// threads to bail out early. Because the slow path of allocating a new segment\n// in grow still works.\ntype segList struct {\n\tMaxSpares int64\n\tLength    int64\n\tHead      *segLink\n}\n\n// spmcLink is a list element in a segList. Note that we cannot just re-use the\n// segLink next pointers without modifying the algorithm as TryPush could\n// potentitally sever pointers in the live queue data structure. That would\n// break everything.\ntype segLink struct {\n\tElt  listElt\n\tNext *segLink\n}\n\nfunc (s *segList) TryPush(e listElt) {\n\t// bail out if list is at capacity\n\tif atomic.LoadInt64(&s.Length) >= s.MaxSpares {\n\t\treturn\n\t}\n\t// add to length. Note that this is not atomic with respect to the append,\n\t// which means we may be under capacity on occasion. This list is only used\n\t// in a best-effort capacity, so that is okay.\n\tatomic.AddInt64(&s.Length, 1)\n\tif debug {\n\t\tdbgPrint(\"Length now %v\\n\", s.Length)\n\t}\n\n\ttl := &segLink{\n\t\tElt:  e,\n\t\tNext: nil,\n\t}\n\tconst patience = 4\n\ti := 0\n\tfor ; i < patience; i++ {\n\t\t// attempt to cas Head from nil to tail,\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)),\n\t\t\tunsafe.Pointer(nil), unsafe.Pointer(tl)) {\n\t\t\tbreak\n\t\t}\n\n\t\t// try to find an empty element\n\t\ttailPtr := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head))))\n\t\tif tailPtr == nil {\n\t\t\t// if Head was switched to nil, retry\n\t\t\tcontinue\n\t\t}\n\n\t\t// advance tailPtr until it has anil next pointer\n\t\tfor {\n\t\t\tnext := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next))))\n\t\t\tif next == nil {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\ttailPtr = next\n\t\t}\n\n\t\t// try and add something to the end of the list\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next)),\n\t\t\tunsafe.Pointer(nil), unsafe.Pointer(tl)) {\n\t\t\tbreak\n\t\t}\n\t}\n\tif i == patience {\n\t\tatomic.AddInt64(&s.Length, -1)\n\t}\n\n\tif debug {\n\t\tdbgPrint(\"Successfully pushed to segment list\\n\")\n\t}\n\n}\n\nfunc (s *segList) TryPop() (e listElt, ok bool) {\n\tconst patience = 8\n\t// it is possible that s has length <= 0 due to a temporary inconsistency\n\t// between the list itself and the length counter. See the comments in\n\t// TryPush()\n\tif atomic.LoadInt64(&s.Length) <= 0 {\n\t\treturn nil, false\n\t}\n\tfor i := 0; i < patience; i++ {\n\t\thd := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head))))\n\t\tif hd == nil {\n\t\t\treturn nil, false\n\t\t}\n\n\t\t// if head is not nil, try to swap it for its next pointer\n\t\tnxt := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&hd.Next))))\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)),\n\t\t\tunsafe.Pointer(hd), unsafe.Pointer(nxt)) {\n\t\t\tif debug {\n\t\t\t\tdbgPrint(\"Successfully popped off segment list\\n\")\n\t\t\t}\n\t\t\tatomic.AddInt64(&s.Length, -1)\n\t\t\treturn hd.Elt, true\n\t\t}\n\t}\n\treturn nil, false\n}\n\n// segment size\nconst segShift = 12\nconst segSize = 1 << segShift\n\n// The channel buffer is stored as a linked list of fixed-size arrays of size\n// segsize. ID is a monotonically increasing identifier corresponding to the\n// index in the buffer of the first element of the segment, divided by segSize\n// (see SplitInd).\ntype segment struct {\n\tID   index\n\tNext *segment\n\tData [segSize]Elt\n}\n\n// Load atomically loads the element at index i of s\nfunc (s *segment) Load(i index) Elt {\n\treturn Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Data[i]))))\n}\n\n// Queue is the global state of the channel. It contains indices into the head\n// and tail of the channel as well as a linked list of spare segments used to\n// avoid excess allocations.\ntype queue struct {\n\tH           index // head index\n\tT           index // tail index\n\tSpareAllocs segList\n}\n\n// SplitInd splits i into the ID of the segment to which it refers as well as\n// the local index into that segment\nfunc (i index) SplitInd() (cellNum index, cellInd index) {\n\tcellNum = (i >> segShift)\n\tcellInd = i - (cellNum * segSize)\n\treturn\n}\n\nconst spare = true\n\n// grow is called if a thread has arrived at the end of the segment list but\n// needs to enqueue/dequeue from an index with a higher cell ID. In this case we\n// attempt to assign the segment's next pointer to a new segment. Allocating\n// segments can be expensive, so the underlying queue has a 'SpareAlloc' segment\n// that can be used to grow the queue, or to store unused segments that the\n// thread allocates. The presence of 'SpareAlloc' complicates the protocol quite\n// a bit, but it is wait-free (aside from memory allocation) and it will only\n// return if tail.Next is non-nil.\nfunc (q *queue) Grow(tail *segment) {\n\tcurTail := atomic.LoadUint64((*uint64)(&tail.ID))\n\tif spare {\n\t\tif next, ok := q.SpareAllocs.TryPop(); ok {\n\t\t\tatomic.StoreUint64((*uint64)(&next.ID), curTail+1)\n\t\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),\n\t\t\t\tunsafe.Pointer(nil), unsafe.Pointer(next)) {\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t}\n\n\tnewSegment := &segment{ID: index(curTail + 1)}\n\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),\n\t\tunsafe.Pointer(nil), unsafe.Pointer(newSegment)) {\n\t\tif debug {\n\t\t\tdbgPrint(\"\\t\\tgrew\\n\")\n\t\t}\n\t\treturn\n\t}\n\tif spare {\n\t\t// If we allocated a new segment but failed, attempt to place it in\n\t\t// SpareAlloc so someone else can use it.\n\t\tq.SpareAllocs.TryPush(newSegment)\n\t}\n}\n\n// advance will search for a segment with ID cell at or after the segment in\n// ptr, It returns with ptr either pointing to the cell in question or to the\n// last non-nill segment in the list.\nfunc advance(ptr **segment, cell index) {\n\tfor {\n\t\tnext := (*segment)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&(*ptr).Next))))\n\t\tif next == nil || next.ID > cell {\n\t\t\tbreak\n\t\t}\n\t\t*ptr = next\n\t}\n}\n"
  },
  {
    "path": "src/fchan/unbounded.go",
    "content": "// Copyright 2016 Google Inc.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//     http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage fchan\n\nimport (\n\t\"runtime\"\n\t\"sync/atomic\"\n\t\"unsafe\"\n)\n\n// Thread-local state for interacting with an unbounded channel\ntype UnboundedChan struct {\n\t// pointer to global state\n\tq *queue\n\t// pointer into last guess at the true head and tail segments\n\thead, tail *segment\n}\n\n// New initializes a new queue and returns an initial handle to that queue. All\n// other handles are allocated by calls to NewHandle()\nfunc New() *UnboundedChan {\n\tsegPtr := &segment{} // 0 values are fine here\n\tq := &queue{\n\t\tH:           0,\n\t\tT:           0,\n\t\tSpareAllocs: segList{MaxSpares: int64(runtime.GOMAXPROCS(0))},\n\t}\n\th := &UnboundedChan{\n\t\tq:    q,\n\t\thead: segPtr,\n\t\ttail: segPtr,\n\t}\n\n\treturn h\n}\n\n// NewHandle creates a new handle for the given Queue.\nfunc (u *UnboundedChan) NewHandle() *UnboundedChan {\n\treturn &UnboundedChan{\n\t\tq:    u.q,\n\t\thead: u.head,\n\t\ttail: u.tail,\n\t}\n}\n\n// Enqueue enqueues a Elt into the channel\n// TODO(ezrosent) enforce that e is not nil, I think we make that assumption\n// here..\nfunc (u *UnboundedChan) Enqueue(e Elt) {\n\tu.adjust() // don't always do this?\n\tmyInd := index(atomic.AddUint64((*uint64)(&u.q.T), 1) - 1)\n\tcell, cellInd := myInd.SplitInd()\n\tseg := u.q.findCell(u.tail, cell)\n\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\tunsafe.Pointer(nil), unsafe.Pointer(e)) {\n\t\treturn\n\t}\n\twt := (*waiter)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd]))))\n\twt.Send(e)\n}\n\n// findCell finds a segment at or after start with ID cellID. If one does not\n// yet exist, it grows the list of segments.\nfunc (q *queue) findCell(start *segment, cellID index) *segment {\n\tcur := start\n\tfor cur.ID != cellID {\n\t\tnext := (*segment)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&cur.Next))))\n\t\tif next == nil {\n\t\t\tq.Grow(cur)\n\t\t\tcontinue\n\t\t}\n\t\tcur = next\n\t}\n\treturn cur\n}\n\n// adjust moves h's head and tail pointers forward if H and T point to a newer\n// segment. The loads and moves do not need to be atomic because H and T only\n// ever increase in value. Calling this regularly is probably good for\n// performance, and is necessary to ensure that old segments are garbage\n// collected.\nfunc (u *UnboundedChan) adjust() {\n\tH := index(atomic.LoadUint64((*uint64)(&u.q.H)))\n\tT := index(atomic.LoadUint64((*uint64)(&u.q.T)))\n\tcellH, _ := H.SplitInd()\n\tadvance(&u.head, cellH)\n\tcellT, _ := T.SplitInd()\n\tadvance(&u.tail, cellT)\n}\n\n// Dequeue an element from the channel, will block if nothing is there\nfunc (u *UnboundedChan) Dequeue() Elt {\n\tu.adjust()\n\tmyInd := index(atomic.AddUint64((*uint64)(&u.q.H), 1) - 1)\n\tcell, cellInd := myInd.SplitInd()\n\tseg := u.q.findCell(u.head, cell)\n\telt := seg.Load(cellInd)\n\twt := makeWaiter()\n\tif elt == nil &&\n\t\tatomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\t\tunsafe.Pointer(nil), unsafe.Pointer(wt)) {\n\t\tif debug {\n\t\t\tdbgPrint(\"\\t[deq] slow path\\n\")\n\t\t}\n\t\treturn wt.Recv()\n\t}\n\treturn seg.Load(cellInd)\n}\n"
  },
  {
    "path": "writeup/graphs.py",
    "content": "# Copyright 2016 Google Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"\nThis is a basic script that parses the output of fchan_main and renders the\ngraphs for goroutines=GOMAXPROCS and goroutines=5000\n\"\"\"\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as unused_import\nimport re\nimport sys\n\n\nclass BenchResult(object):\n    def __init__(self, name, gmp, max_hw_thr, nops, secs):\n        self.name = name\n        self.gmp = gmp\n        self.max_hw_thr = int(max_hw_thr)\n        # Millions of operations per second\n        self.tp = float(nops) / (float(secs) * 1e6)\n\n\ndef parse_line(line):\n    m_gmp = re.match(r'^([^\\-]*)GMP-(\\d+)\\s+(\\d+)\\s+([^\\s]*)s\\s*$',\n                     line)\n    m2 = re.match(r'^([^\\-]*)-(\\d+)\\s+(\\d+)\\s+([^\\s]*)s\\s*$', line)\n    if m_gmp is not None:\n        name, threads, nops, secs = m_gmp.groups()\n        return BenchResult(name, True, threads, nops, secs)\n    if m2 is not None:\n        name, threads, nops, secs = m2.groups()\n        return BenchResult(name, False, threads, nops, secs)\n    print line, 'did not match anything'\n    return None\n\n\ndef plot_points(all_results, gmp):\n    series = sorted(list({k.name for k in all_results if k.gmp == gmp}))\n    for k in series:\n        results = [r for r in all_results if r.gmp == gmp and r.name == k]\n        points = sorted((r.max_hw_thr, r.tp) for r in results)\n        plt.xlabel(r'GOMAXPROCS')\n        plt.ylabel('Ops / second (millions)')\n        X = np.array([x for (x, y) in points])\n        Y = np.array([y for (x, y) in points])\n        plt.plot(X, Y, label=k)\n        plt.scatter(X, Y)\n        plt.legend()\n\n\ndef main(fname):\n    with open(fname) as f:\n        results = [p for p in (parse_line(line) for line in f)\n                   if p is not None]\n        print 'Generating non-GMP graph'\n        plt.title('5000 Goroutines')\n        plot_points(results, False)\n        plt.savefig('contend_graph.pdf')\n        plt.clf()\n        print 'Generating GMP graph'\n        plt.title('Goroutines Equal to GOMAXPROCS')\n        plot_points(results, True)\n        plt.savefig('gmp_graph.pdf')\n        plt.clf()\n\nif __name__ == '__main__':\n    main(sys.argv[1])\n"
  },
  {
    "path": "writeup/latex.template",
    "content": "\\documentclass[$if(fontsize)$$fontsize$,$endif$$if(lang)$$babel-lang$,$endif$$if(papersize)$$papersize$,$endif$$for(classoption)$$classoption$$sep$,$endfor$]{$documentclass$}\n$if(fontfamily)$\n\\usepackage{$fontfamily$}\n$else$\n%\\usepackage{lmodern}\n$endif$\n$if(linestretch)$\n\\usepackage{setspace}\n\\setstretch{$linestretch$}\n$endif$\n\\usepackage{amssymb,amsmath}\n\\usepackage{ifxetex,ifluatex}\n\\usepackage{fixltx2e} % provides \\textsubscript\n\\ifnum 0\\ifxetex 1\\fi\\ifluatex 1\\fi=0 % if pdftex\n  \\usepackage[T1]{fontenc}\n  \\usepackage[utf8]{inputenc}\n$if(euro)$\n  \\usepackage{eurosym}\n$endif$\n\\else % if luatex or xelatex\n  \\ifxetex\n    \\usepackage{mathspec}\n    \\usepackage{xltxtra,xunicode}\n  \\else\n    \\usepackage{fontspec}\n  \\fi\n  \\defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}\n  \\newcommand{\\euro}{€}\n$if(mainfont)$\n    \\setmainfont{$mainfont$}\n$endif$\n$if(sansfont)$\n    \\setsansfont{$sansfont$}\n$endif$\n$if(monofont)$\n    \\setmonofont[Mapping=tex-ansi]{$monofont$}\n$endif$\n$if(mathfont)$\n    \\setmathfont(Digits,Latin,Greek){$mathfont$}\n$endif$\n$if(CJKmainfont)$\n    \\usepackage{xeCJK}\n    \\setCJKmainfont[$CJKoptions$]{$CJKmainfont$}\n$endif$\n\\fi\n% use upquote if available, for straight quotes in verbatim environments\n\\IfFileExists{upquote.sty}{\\usepackage{upquote}}{}\n% use microtype if available\n\\IfFileExists{microtype.sty}{%\n\\usepackage{microtype}\n\\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts\n}{}\n$if(geometry)$\n\\usepackage[$for(geometry)$$geometry$$sep$,$endfor$]{geometry}\n$endif$\n\\ifxetex\n  \\usepackage[setpagesize=false, % page size defined by xetex\n              unicode=false, % unicode breaks when used with xetex\n              xetex]{hyperref}\n\\else\n  \\usepackage[unicode=true]{hyperref}\n\\fi\n\\usepackage[usenames,dvipsnames]{color}\n\\hypersetup{breaklinks=true,\n            bookmarks=true,\n            pdfauthor={$author-meta$},\n            pdftitle={$title-meta$},\n            colorlinks=true,\n            citecolor=$if(citecolor)$$citecolor$$else$blue$endif$,\n            urlcolor=$if(urlcolor)$$urlcolor$$else$blue$endif$,\n            linkcolor=$if(linkcolor)$$linkcolor$$else$magenta$endif$,\n            pdfborder={0 0 0}}\n\\urlstyle{same}  % don't use monospace font for urls\n$if(lang)$\n\\ifxetex\n  \\usepackage{polyglossia}\n  \\setmainlanguage[variant=$polyglossia-variant$]{$polyglossia-lang$}\n  \\setotherlanguages{$for(polyglossia-otherlangs)$$polyglossia-otherlangs$$sep$,$endfor$}\n\\else\n  \\usepackage[shorthands=off,$babel-lang$]{babel}\n\\fi\n$endif$\n$if(natbib)$\n\\usepackage{natbib}\n\\bibliographystyle{$if(biblio-style)$$biblio-style$$else$plainnat$endif$}\n$endif$\n$if(biblatex)$\n\\usepackage{biblatex}\n$for(bibliography)$\n\\addbibresource{$bibliography$}\n$endfor$\n$endif$\n$if(listings)$\n\\usepackage{listings}\n$endif$\n$if(lhs)$\n\\lstnewenvironment{code}{\\lstset{language=Haskell,basicstyle=\\small\\ttfamily}}{}\n$endif$\n$if(highlighting-macros)$\n$highlighting-macros$\n$endif$\n$if(verbatim-in-note)$\n\\usepackage{fancyvrb}\n\\VerbatimFootnotes\n$endif$\n$if(tables)$\n\\usepackage{longtable,booktabs}\n$endif$\n$if(graphics)$\n\\usepackage{graphicx,grffile}\n\\makeatletter\n\\def\\maxwidth{\\ifdim\\Gin@nat@width>\\linewidth\\linewidth\\else\\Gin@nat@width\\fi}\n\\def\\maxheight{\\ifdim\\Gin@nat@height>\\textheight\\textheight\\else\\Gin@nat@height\\fi}\n\\makeatother\n% Scale images if necessary, so that they will not overflow the page\n% margins by default, and it is still possible to overwrite the defaults\n% using explicit options in \\includegraphics[width, height, ...]{}\n\\setkeys{Gin}{width=\\maxwidth,height=\\maxheight,keepaspectratio}\n$endif$\n$if(links-as-notes)$\n% Make links footnotes instead of hotlinks:\n\\renewcommand{\\href}[2]{#2\\footnote{\\url{#1}}}\n$endif$\n$if(strikeout)$\n\\usepackage[normalem]{ulem}\n% avoid problems with \\sout in headers with hyperref:\n\\pdfstringdefDisableCommands{\\renewcommand{\\sout}{}}\n$endif$\n\\setlength{\\parindent}{0pt}\n\\setlength{\\parskip}{6pt plus 2pt minus 1pt}\n\\setlength{\\emergencystretch}{3em}  % prevent overfull lines\n\\providecommand{\\tightlist}{%\n  \\setlength{\\itemsep}{0pt}\\setlength{\\parskip}{0pt}}\n$if(numbersections)$\n\\setcounter{secnumdepth}{5}\n$else$\n\\setcounter{secnumdepth}{0}\n$endif$\n$if(verbatim-in-note)$\n\\VerbatimFootnotes % allows verbatim text in footnotes\n$endif$\n$if(dir)$\n\\ifxetex\n  % load bidi as late as possible as it modifies e.g. graphicx\n  $if(latex-dir-rtl)$\n  \\usepackage[RTLdocument]{bidi}\n  $else$\n  \\usepackage{bidi}\n  $endif$\n\\fi\n\\ifnum 0\\ifxetex 1\\fi\\ifluatex 1\\fi=0 % if pdftex\n  \\TeXXeTstate=1\n  \\newcommand{\\RL}[1]{\\beginR #1\\endR}\n  \\newcommand{\\LR}[1]{\\beginL #1\\endL}\n  \\newenvironment{RTL}{\\beginR}{\\endR}\n  \\newenvironment{LTR}{\\beginL}{\\endL}\n\\fi\n$endif$\n\n$if(title)$\n\\title{$title$$if(subtitle)$\\\\\\vspace{0.5em}{\\large $subtitle$}$endif$}\n$endif$\n\n$if(author)$\n\\usepackage{fancyhdr}\n\\fancypagestyle{plain}{}\n\\pagestyle{fancy}\n\\fancyhead[LO,RE]{\\large $for(author)$$author.name$$sep$ \\and $endfor$}\n%\\author{$for(author)$$author.name$$sep$ \\and $endfor$}\n$endif$\n\\date{$date$}\n$for(header-includes)$\n$header-includes$\n$endfor$\n\n% Redefines (sub)paragraphs to behave more like sections\n\\ifx\\paragraph\\undefined\\else\n\\let\\oldparagraph\\paragraph\n\\renewcommand{\\paragraph}[1]{\\oldparagraph{#1}\\mbox{}}\n\\fi\n\\ifx\\subparagraph\\undefined\\else\n\\let\\oldsubparagraph\\subparagraph\n\\renewcommand{\\subparagraph}[1]{\\oldsubparagraph{#1}\\mbox{}}\n\\fi\n\n\\begin{document}\n$if(title)$\n\\maketitle\n$endif$\n$if(abstract)$\n\\begin{abstract}\n$abstract$\n\\end{abstract}\n$endif$\n\n$for(include-before)$\n$include-before$\n\n$endfor$\n$if(toc)$\n{\n  \\vspace{-0.9in}\n\\hypersetup{linkcolor=$if(toccolor)$$toccolor$$else$black$endif$}\n\\setcounter{tocdepth}{$toc-depth$}\n\\tableofcontents\n}\n$endif$\n$if(lot)$\n\\listoftables\n$endif$\n$if(lof)$\n\\listoffigures\n$endif$\n$body$\n\n$if(natbib)$\n$if(bibliography)$\n$if(biblio-title)$\n$if(book-class)$\n\\renewcommand\\bibname{$biblio-title$}\n$else$\n\\renewcommand\\refname{$biblio-title$}\n$endif$\n$endif$\n\\bibliography{$for(bibliography)$$bibliography$$sep$,$endfor$}\n\n$endif$\n$endif$\n$if(biblatex)$\n\\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$\n\n$endif$\n$for(include-after)$\n$include-after$\n\n$endfor$\n\\end{document}\n"
  },
  {
    "path": "writeup/refs.bib",
    "content": "@inproceedings{wfq,\n  title={A wait-free queue as fast as fetch-and-add},\n  author={Yang, Chaoran and Mellor-Crummey, John},\n  booktitle={Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},\n  pages={16},\n  year={2016},\n  organization={ACM}\n}\n\n@inproceedings{lcrq,\n  title={Fast concurrent queues for x86 processors},\n  author={Morrison, Adam and Afek, Yehuda},\n  booktitle={ACM SIGPLAN Notices},\n  volume={48},\n  number={8},\n  pages={103--112},\n  year={2013},\n  organization={ACM}\n}\n@incollection{CSP,\n  title={Communicating sequential processes},\n  author={Hoare, Charles Antony Richard},\n  booktitle={The origin of concurrent programming},\n  pages={413--443},\n  year={1978},\n  publisher={Springer}\n}\n@book{tgpl,\n  author = {Donovan, Alan A.A. and Kernighan, Brian W.},\n  title = {The Go Programming Language},\n  year = {2015},\n  isbn = {0134190440, 9780134190440},\n  edition = {1st},\n  publisher = {Addison-Wesley Professional},\n}\n\n@inproceedings{MSQueue,\n  title={Simple, fast, and practical non-blocking and blocking concurrent queue algorithms},\n  author={Michael, Maged M and Scott, Michael L},\n  booktitle={Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing},\n  pages={267--275},\n  year={1996},\n  organization={ACM}\n}\n\n@article{herlihyBook,\n  title={The Art of Multiprocessor Programming},\n  author={Herlihy, Maurice and Shavit, Nir},\n  year={2008},\n  publisher={Morgan Kaufmann Publishers Inc.}\n}\n\n@online{GoSpec,\n    title = {The Go Programming Language Specification},\n    howpublished = {\\url{https://golang.org/ref/spec}},\n    year = {2009},\n    urldate = {2016-10-30}\n}\n\n@inproceedings{FastSlow,\n  title={A methodology for creating fast wait-free data structures},\n  author={Kogan, Alex and Petrank, Erez},\n  booktitle={ACM SIGPLAN Notices},\n  volume={47},\n  number={8},\n  pages={141--150},\n  year={2012},\n  organization={ACM}\n}\n\n@article{wfsync,\n  title={Wait-free synchronization},\n  author={Herlihy, Maurice},\n  journal={ACM Transactions on Programming Languages and Systems (TOPLAS)},\n  volume={13},\n  number={1},\n  pages={124--149},\n  year={1991},\n  publisher={ACM}\n}\n\n@incollection{marlowPar,\n  title={Parallel and concurrent programming in Haskell},\n  author={Marlow, Simon},\n  booktitle={Central European Functional Programming School},\n  pages={339--401},\n  year={2012},\n  publisher={Springer}\n}\n\n@article{herlihyLinear,\n  title={Linearizability: A correctness condition for concurrent objects},\n  author={Herlihy, Maurice P and Wing, Jeannette M},\n  journal={ACM Transactions on Programming Languages and Systems\n    (TOPLAS)},\n  volume={12},\n  number={3},\n  pages={463--492},\n  year={1990},\n  publisher={ACM}\n}\n"
  },
  {
    "path": "writeup/writeup.md",
    "content": "---\ntitle: Faster Channels in Go (Work in Progress)\nsubtitle: Scaling Blocking Channels with Techniques from Nonblocking Data-Structures.\ntoc: true\nlink-citations: true\ngeometry: ['margin=1in']\nfontsize: 11pt\nauthor: \n  name: Eli Rosenthal\n---\n\n<!--\nCompile with\npandoc -s -S --template=latex.template --latex-engine=xelatex \\\n  --bibliography refs.bib --metadata link-citations=true \\\n  --filter pandoc-citeproc  writeup.md -o writeup.pdf\n\n--\n-->\n\n\n# Introduction\n\nChannels in the [Go](https://golang.org/) language are a common way to structure\nconcurrent code. The channel API in Go is intended to support programming in the\nmanner described by CSP [see  @CSP, the original paper; also the preface of\n@tgpl for CSP's relationship to Go].  Channels in Go have a fixed buffer size\n$b$ such that only $b$ senders may return without having handed a value off to a\ncorresponding receiver. Here is some basic pseudocode[^pseudo] for the send and\nreceive operations[^select], though it is worth referring to the spec @GoSpec as\nwell.\n\n~~~~\nsend(c: chan T, item: T)                  receive(c: chan T) -> T\n  atomically do:                            atomically do:\n    if the buffer is full                       begin:\n      block                                     if there are items in the buffer\n    append an item to the buffer                  result = head of buffer\n    if there were any receivers blocked           advance the buffer head\n      wake the first one up                       if there are any senders waiting\n                                                    wake the first sender up\n                                                  return result\n                                                if the buffer is empty\n                                                  block\n                                                  goto begin\n~~~~\n\nGo channels currently require goroutines[^goroutine] to acquire a single\nlock before performing additional operations[^chanimp]. This makes contention\nfor this lock a scalability bottleneck; while acquiring a mutex can be very fast\nthis means that only one thread can perform an operation on a queue at a time.\nThis document describes the implementation of a novel channel algorithm that\npermits different sends and receives to complete in parallel.\n\nWe will start with a review of recent literature on non-blocking queues. Then we\nwill move onto describing the implementation of a fast *unbounded* channel in\nGo; this algorithm may be of independent interest. Finally we will extend this\ndesign to provide the bounded semantics of Go channels. We will also report\nperformance measurements for these algorithms.\n\n# Non-blocking Queues\n<!-- confirm definition in herlihy book for non blocking -->\n\nThe standard data-structure closest to the notion of unbounded channel is that\nof a FIFO queue. A queue supports enqueue and dequeue operations, where it is\ncommon for dequeue to be allowed to fail if there are no elements in the queue.\nThere are myriad algorithms for concurrent queues which provide different\nguarantees in terms of progress and consistency [see @herlihyBook Chapter 10 for\nan overview], but we will focus here on *non-blocking* queues because of the\napproach in that literature to making scalable concurrent data-structures.\n\nInformally, we say a data-structure is non-blocking if no thread can perform an\noperation that will require it to block any other threads for an unbounded\namount of time.  As a result, no queue that requires a thread to take a lock can\nbe non-blocking: one thread can acquire a lock and then be de-scheduled for an\narbitrary amount of time and thereby block all other threads contending for the\nlock. Non-blocking algorithms generally use atomic instructions like\nCompare-And-Swap (CAS) to avoid different threads stepping on one another's toes\n[see @herlihyBook Chapter 3 for a tutorial on atomic synchronization\nprimitives]. Non-blocking operations can exhibit a number of additional progress\nguarantees:\n\n* **Obstruction Freedom** If there is only one thread executing an operation,\n  that operation will complete in a finite number of steps.\n* **Lock Freedom** Regardless of the number of threads executing an operation\n  concurrently, at least one thread will complete the operation in a finite\n  number of steps.\n* **Wait Freedom** Any thread executing an operation is guaranteed to finish in\n  a finite number of steps.\n\nNon-blocking synchronization is not a panacea. The fact that there are hard\nupper bounds on how long it will take for a thread to complete an operation does\nnot imply that the algorithm will perform better in practice. While wait-free\ndata-structures are important for some embedded or real-time systems that need\nthese strong guarantees, there are often blocking algorithms which perform\nbetter in terms of throughput than their lock-free or wait-free\ncounterparts[^combine]. Still, non-blocking algorithms can shine in\nhigh-contention settings. A small number of CAS operations can amount to less\noverhead than aquiring a lock, and more fine-grained concurrency coupled with\nprogress guarantees *can* reduce contention[^msqueue].\n\n<!-- check out http://dl.acm.org/citation.cfm?id=1122994 --!>\n<!-- http://link.springer.com/chapter/10.1007/978-3-642-15291-7_16 --!>\n<!-- worth linking to performance blog post about tail latency w/wait-free queue\nas well--!>\n\n# Using Fetch-and-Add to Reduce Contention\n\nThe atomic Fetch-and-Add (F&A) instruction adds a value to an integer and\nreturns the old or new value of that integer. Here are the basic semantics of\nthe operation in Go[^fasem]:\n\n```go\n//atomically\nfunc AtomicFetchAdd(src *int, delta int) {\n  *src += delta\n  return *src\n}\n```\n\nWhile hardware support for a F&A instruction is not as universal as that of CAS,\nF&A is implemented on x86. On modern x86 machines, F&A is much faster than CAS\n[see @lcrq for performance measurements], and it always succeeds. This has the\ndual effect allowing code making judicious use of F&A to be both efficient and\neasier to reason about than equivalents that rely only on CAS. A common pattern\nexemplifying this idea is to first use F&A to acquire an index into an array,\nand then to use more conventional techniques to write to that index. This is\nhelpful because it can reduce contention on individual locations for a\ndata-structure.\n\n## A Non-blocking Queue From an Infinite Array\n\nTo illustrate this, we will write two non-blocking queues in\npseudo-Go based on an infinite array [`Queue2` is based on the obstruction-free\nqueue presented in pseudo-code in @wfq, `Queue1` is a CAS-ification of that\ndesign]. Both of these designs make use of the fact that head and tail pointers\n*only ever increase*.\n\n~~~~ {.go .numberLines }\ntype Queue1 struct {\n\thead, tail *T\n\tdata       [∞]T\n}\nfunc (q *Queue1) Enqueue(elt T) {\n\tfor {\n\t\tnewTail := atomic.LoadPointer(&q.tail) + 1\n\t\tif atomic.CompareAndSwapT(newTail, nil, elt) {\n\t\t\tatomic.CompareAndSwap(&q.tail, q.tail, newTail)\n\t\t\tbreak\n\t\t}\n\t}\n}\nfunc (q *Queue1) Dequeue() T {\n\tfor {\n\t\tcurHead := atomic.LoadPointer(&q.head)\n\t\tcurTail := atomic.LoadPointer(&q.tail)\n\t\tif curHead == curTail {\n\t\t\treturn nil\n\t\t}\n\t\tif atomic.CompareAndSwapPointer(&q.head, curHead, curHead+1) {\n\t\t\treturn *curHead\n\t\t}\n\t}\n}\n~~~~\n\nThe second queue will assume that the type `T` can not only take on a `nil`\nvalue but also an unambiguous `SENTINEL` value that a user is guaranteed not to\npass in to `Enqueue`. This value is used to mark an index as unusable,\nsignalling a conflicting `Enqueue` thread that it should try again.\n\n<!-- If ther is any change to the above psuedocode, change the startFrom line\nhere  -->\n\n~~~~ {.go .numberLines startFrom=\"26\"}\ntype Queue2 struct {\n\thead, ta uint\n\tdata     [∞]T\n}\n\nfunc (q *Queue2) Enqueue(elt T) {\n\tfor {\n\t\tmyTail := atomic.AddUint(&q.tail) - 1\n\t\tif atomic.CompareAndSwapT(&q.data[myTail], nil, elt) {\n\t\t\tbreak\n\t\t}\n\t}\n}\n\nfunc (q *Queue2) Dequeue() T {\n\tfor {\n\t\tmyHead := atomic.AddUint(&q.head) - 1\n\t\tcurTail := atomic.LoadUint(&q.tail)\n\t\tif !atomic.CompareAndSwapPointer(&q.data[myHead], nil, SENTINEL) {\n\t\t\treturn atomic.LoadT(&q.data[myHead])\n\t\t}\n\t\tif myHead == curTail {\n\t\t\treturn nil\n\t\t}\n\t}\n}\n~~~~\n\nThe core algorithm for both `Queue1` and `Queue2` is essentially the same.\nEnqueueing threads load a view of the tail pointer and try to CAS their element\nin one element after that pointer; dequeueing threads perform a symmetric\noperation to advance the head pointer. The practical (that is, practical for\nalgorithms that require a infinite amount of memory) difference between `Queue1`\nand `Queue2` is that `Queue2` first has threads perform an atomic increment of a\nhead or tail index. This means that two concurrent enqueue operations will\nalways attempt a CAS on *different* queue elements. As a result, enqueue\noperations need only concern themselves with dequeue operations that increment\n`head` to the same value as their `myTail` (lines 33--34).\n\nA downside of this approach is that while `Queue1` is lock free, `Queue2` is\nmerely obstruction free. For an enqueue/dequeue pair of threads, each can\ncontinually increment equal `head` and `tail` indices while the dequeuer's CAS\n(line 44) always succeeds before the enqueuer's (line 34) resulting in\nlivelock[^livelockdef].\n\n\n## Lessons for Channels\n\nThe `Queue2` above is the core of the implementation of a fast wait-free queue\nin @wfq. It is also the basic idea that we will leverage when designing a more\nscalable channel. The rest of their algorithm consists in solving three problems\nthat have analogs in our setting.\n\n  (1) *Simulating an infinite array with a finite amount of memory.* Here the\n  authors implement a linked list of fixed-length arrays (called segments, or\n  cells); threads grow this array when more space is required.\n\n  (2) *Going from obstruction freedom to wait freedom.* This involves attempting\n  either `Dequeue` or `Enqueue` above for a constant number of iterations,\n  followed by a slow path which involves implementing a helping\n  mechanism[^helping] to help contending threads to finish their outstanding\n  operations.\n\n  (3) *Memory Reclamation.* Reclaiming memory in a non-blocking setting is,\n  perhaps unsurprisingly, a very fraught task. \n\nWhile the solution to (3) in this paper is interesting and efficient, we will\n(mercifully) be relying on Go's garbage collection mechanism to solve this\nproblem. For (1) we will employ essentially the same algorithm as the paper, but\nwith additional optimizations for memory allocation. For (2) our slow path will\nimplement the blocking semantics of a channel.\n\n# An Unbounded Channel With Low Contention\n\nWe first consider the case of implementing an unbounded channel. While this\nchannel is blocking --- Go channels must in some capacity be blocking as\nthey provide a synchronization mechanism ---  it only blocks when it has to\n(i.e. for receives that do not yet have a corresponding send), and when it does\nprogress is impeded for at most 2 threads, the components of a send/receive\npair.  We will start with the types:\n\n~~~ {.go }\ntype Elt *interface{}\ntype index uint64\n\n// segment size\nconst segShift = 12\nconst segSize = 1 << segShift\n\n// The channel buffer is stored as a linked list of fixed-size arrays of size\n// segsize. ID is a monotonically increasing identifier corresponding to the\n// index in the buffer of the first element of the segment, divided by segSize\n// (see SplitInd).\ntype segment struct {\n\tID   index // index of Data[0] / segSize\n\tNext *segment\n\tData [segSize]Elt\n}\n\n// Load atomically loads the element at index i of s\nfunc (s *segment) Load(i index) Elt {\n\treturn Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Data[i]))))\n}\n\n// Queue is the global state of the channel. It contains indices into the head\n// and tail of the channel as well as a linked list of spare segments used to\n// avoid excess allocations.\ntype queue struct{ H, T index }\n\n// Thread-local state for interacting with an unbounded channel\ntype UnboundedChan struct {\n\t// pointer to global state\n\tq *queue\n\t// pointer into last guess at the true head and tail segments\n\thead, tail *segment\n}\n~~~~\n\nThe only data-structure-global global state that we employ is the `queue`\nstructure which maintains the head and tail indices. Pointers into the data\nitself are kept locally in an `UnboundedChan` for two reasons\n\n  (1) It reduces any possible contention resulting from updated shared head or\n  tail pointers.\n\n  (2) If individual threads all update local head and tail pointers, then the\n  garbage collector will be able to clean up used segments when (and only when)\n  all threads no longer hold a reference to them.\n\nWe note that a downside of this design is that inactive threads that hold such a\nhandle can cause space leaks by holding onto references to long-dead segments.\n\nUsers interact with a channel by first creating an initial value, and later\ncloning that value and others derived from it using `NewHandle`.\n\n~~~~ {.go }\n// New initializes a new queue and returns an initial handle to that queue. All\n// other handles are allocated by calls to NewHandle()\nfunc New() *UnboundedChan {\n\tsegPtr := &segment{} // 0 values are fine here\n\tq := &queue{H: 0, T: 0}\n\th := &UnboundedChan{q: q, head: segPtr, tail: segPtr}\n\treturn h\n}\n\n// NewHandle creates a new handle for the given Channel\nfunc (u *UnboundedChan) NewHandle() *UnboundedChan {\n\treturn &UnboundedChan{q: u.q, head: u.head, tail: u.tail}\n}\n~~~~ \n\n## Sending and Receiving\n\nThe key enqueue (or send) algorithm is to atomically increment the \\texttt{T}\nindex, attempt to CAS in the item, and to wake up a blocking thread if the CAS\nfails. We will begin with the `Enqueue` code and then explain the code that it\ncalls.\n\n~~~~ {.go .numberLines}\nfunc (u *UnboundedChan) Enqueue(e Elt) {\n\tu.adjust()\n\tmyInd := index(atomic.AddUint64((*uint64)(&u.q.T), 1) - 1)\n\tcell, cellInd := myInd.SplitInd()\n\tseg := u.q.findCell(u.tail, cell)\n\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\tunsafe.Pointer(nil), unsafe.Pointer(e)) {\n\t\treturn\n\t}\n\twt := (*waiter)(atomic.LoadPointer(\n\t\t(*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd]))))\n\twt.Send(e)\n}\n\nfunc (u *UnboundedChan) Dequeue() Elt {\n\tu.adjust()\n\tmyInd := index(atomic.AddUint64((*uint64)(&u.q.H), 1) - 1)\n\tcell, cellInd := myInd.SplitInd()\n\tseg := u.q.findCell(u.head, cell)\n\telt := seg.Load(cellInd)\n\twt := makeWaiter()\n\tif elt == nil &&\n\t\tatomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\t\tunsafe.Pointer(nil), unsafe.Pointer(wt)) {\n\t\treturn wt.Recv()\n\t}\n\treturn seg.Load(cellInd)\n}\n~~~~\n\nThe `adjust` (line 2) method atomically loads `H` and `T`, then advances\n`u.head` and `u.tail` to point to their cells.  The `AtomicAdd`\non line 3 acquires an index into the queue. `SplitInd` (line 4) returns the\ncell ID and the index into that cell corresponding to `myInd`. As `T` can only\nincrease, the only possible thread that could also be contending for this item\nis a `Dequeue`ing thread that acquired `H` as the same value as `myInd`. So it\ncomes down to the CASes on lines 6 and 21--22. If the first CAS fails, it means\na `Dequeue` thread has swapped in a `waiter`, if it succeeds then it means an\n`Enqueue`r can return and a contending `Dequeue`r can just load the value in\n`cellInd`. \n\n## Blocking\n\nSo what is a `waiter`? It acts like a channel with buffer size 1, or an *MVar*\nin the Haskell community [see Chapter 7 of @marlowPar for an introduction], that\ncan only tolerate 1 element being sent on it. We currently implement this in\nterms of a single value and a `WaitGroup`. `WaitGroup`s in Go's `sync` package\nallow goroutines to `Add` an integer value to the `WaitGroup`'s counter and to\n`Wait` for that counter to reach zero. If the counter goes below zero, the\ncurrent `WaitGroup` implementation panics, which is helpful for debugging\npurposes as there should only ever be one `Send` or `Recv` on a `waiter` here.\n\n~~~~ {.go}\ntype waiter struct {       func makeWaiter() *waiter {\n\tE      Elt               \twait := &waiter{}\n\tWgroup sync.WaitGroup    \twait.Wgroup.Add(1)\n}                          \t\treturn wait\n                           }\n\nfunc (w *waiter) Send(e Elt) {\n\tatomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&w.E)), unsafe.Pointer(e))\n\tw.Wgroup.Done() // The Done method just calls Add(-1)\n}\n\nfunc (w *waiter) Recv() Elt {\n\tw.Wgroup.Wait()\n\treturn Elt(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&w.E))))\n}\n~~~~\n\nThere are two important parts of our strategy to implement blocking. Neither\nEnqueuers nor Dequeuers will block at all if Enqueuers complete before Dequeuers\nbegin. In fact the only global synchronization they must perform is a single F&A\nand a single *uncontended* CAS (unless they must grow the queue; see below).\nSecond, if a Enqueuer does not arrive soon enough and must block on a `waiter`,\nthere will be essentially no contention for the waiter because there can only be\none other threads interacting with it.\n\n## Growing the Queue and Allocation\n\nWe will now describe the implementation of the `findCell` method. The algorithm\nis to start at a given segment pointer, and to follow that segment's `Next`\npointer until that segment's `ID` is equal to a given `cell` index. If\n`findCell` reaches the end of the list of segments before it reaches the correct\nindex, it attempts to allocate a new segment and place it onto the end of the\nlist. Here is some code:\n\n~~~~ {.go .numberLines}\nfunc (q *queue) findCell(start *segment, cellID index) *segment {\n\tcur := start\n\tfor cur.ID != cellID {\n\t\tnext := (*segment)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&cur.Next))))\n\t\tif next == nil {\n\t\t\tq.Grow(cur)\n\t\t\tcontinue\n\t\t}\n\t\tcur = next\n\t}\n\treturn cur\n}\nfunc (q *queue) Grow(tail *segment) {\n\tcurTail := atomic.LoadUint64((*uint64)(&tail.ID))\n\tnewSegment := &segment{ID: index(curTail + 1)}\n\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),\n\t\tunsafe.Pointer(nil), unsafe.Pointer(newSegment)) {\n\t\treturn\n\t}\n}\n~~~~\n\nNote that we can get away with performing a single CAS operation in `Grow`\nbecause if our CAS failed we know someone else succeeded, and a new segment with\nID of `tail.ID+1` is the only possible value that could be placed there.\nHowever, there *is* a problem with this implementation: it is extremely\nwasteful. In a high-contention situation, it is possible for many threads to all\nallocate a new segment, but only one of those threads will succeed. Any failed\nallocations will become immediately unreachable and will hence be garbage\ncollected. In our experiments,  channel operations are fastest when segments\nhave a size of $\\geq 1024$, so any wasted allocation can have a tangible impact\non throughput. This slowdown was evident in our performance measurements.\n<!-- TODO(ezr) compare benchmark performance to before/after list allocation -->\n\nOur solution to this problem is to keep a lock-free linked list of segments in\nthe `queue` structure. Threads in `Grow` first try and pop a segment off of this\nlist, and then perform the CAS. Only if this pop fails do they allocate a new\nsegment. Symmetrically, if a CAS fails then threads attempt to push a segment\nonto this list. The list keeps a best-effort counter representing its length and\ndoes not allow this counter to grow past a maximum length; this allows us to\navoid a space leak in the implementation of the queue. For a full implementation\nof `Grow`, see Appendix A.\n\n# Extending to the Bounded Case\n\nGo channels do not have an unbounded variant. While the structure offered above\nis potentially useful, there are good reasons to prefer bounded channels in some\nsettings[^bounded]. Unbuffered channels allow for a more synchronous programming\nmodel that is common in Go to synchronize two cooperating threads; this level of\nsynchronization is useful to have. This section describes the implementation of\na bounded channel on top of the unbounded implementation above.\n\n## Preliminaries\n\nWe re-use the `q`  and `segment` types, along with the `findCell` and `Grow`\nmachinery. Almost all of the difference is in the new `Enqueue` and `Dequeue`\noperations. These are, however, significantly more complex. This complexity is\nthe result of senders and receivers being given new responsibilities:\n\n  * Senders must decide if they should block and wait for more receivers to\n    arrive.\n\n  * Receivers have to wake up any waiters who ought to wake up if they succeed\n    in popping an element off of the queue.\n\nAs before, this protocol is implemented in a manner that avoids blocking unless\nblocking is required by the channel semantics. This means `Enqueue` and\n`Dequeue` methods must consider arbitrary interleavings of the unbounded channel\nprotocol and the new blocking protocol. The `BoundedChan` has an additional\ninteger field `bound` indicating the maximum number of senders permitted to\nreturn without having rendezvoused with a receiver.\n\nWe also introduce an immutable global `sentinel` pointer used by receiving\nthreads to signal that a sender should not block. A consequence of this design\nis that now all places that required a CAS from `nil` to another value must also\nattempt to CAS from `sentinel`. We maintain the invariant that no value will\ntransition from `sentinel` back to `nil`, so the `tryCas` function below\nguarantees that `seg.Data[segInd]` is neither `nil` nor `sentinel` when it\nreturns (unless `e` is either of those).\n\n## (Aside) Possible Histories of an Element in a Segment\n\nIn the unbounded case, there were essentially two possible histories of a value\nin the queue:\n\n|Events                | History                |\n|----------------------|------------------------|\n|Sender, Receiver      |  `nil` $\\to$ `Elt`     |\n|Receiver, Sender      |  `nil` $\\to$ `*waiter` |\n\nThis can be viewed as the key invariant that is enforced in the implementation\nof unbounded channels. There are more histories in the bounded case. These (and\nonly these) can all arise --- keeping this in mind is helpful for understanding\nthe protocol:\n\n---------------------------------------------------------------------------------------------\nEvents                                               History\n---------------------------------------------------  ----------------------------------------\nSender, Receiver                                     `nil` $\\to$ `Elt`\n\nReceiver, Sender                                     `nil` $\\to$ `*waiter`\n\nWaker, Sender, Receiver                              `nil` $\\to$ `sentinel` $\\to$ `Elt`\n\nWaker, Receiver, Sender                              `nil` $\\to$ `sentinel` $\\to$ `*waiter`\n\n$\\textrm{Sender}^\\dagger$, Waker, Sender, Receiver   `nil` $\\to$ `*weakWaiter` $\\to$ `Elt`\n\n$\\textrm{Sender}^\\dagger$, Waker, Receiver, Sender   `nil` $\\to$ `*weakWaiter` $\\to$ `*waiter`\n\n---------------------------------------------------  ----------------------------------------\n\nWhere $\\textrm{Sender}^\\dagger$ denotes that a sender arrives but must block for\nmore receivers to complete, and a Waker is any thread that successfully wakes up\na blocked Sender. The details of what a `weakWaiter` is and who exactly plays\nthe role of \"Waker\" are covered in the following sections.\n\n\n## Enqueue\n\nWe first present the source of `tryCas` and `Enqueue`:\n\n~~~~ {.go .numberLines}\nfunc tryCas(seg *segment, segInd index, elt unsafe.Pointer) bool {\n\treturn atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),\n\t\tunsafe.Pointer(nil), elt) ||\n\t\tatomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),\n\t\t\tsentinel, elt)\n}\n\n// Enqueue sends e on b. If there are already >=bound goroutines blocking, then\n// Enqueue will block until sufficiently many elements have been received.\nfunc (b *BoundedChan) Enqueue(e Elt) {\n\tb.adjust()\n\tstartHead := index(atomic.LoadUint64((*uint64)(&b.q.H)))\n\tmyInd := index(atomic.AddUint64((*uint64)(&b.q.T), 1) - 1)\n\tcell, cellInd := myInd.SplitInd()\n\tseg := b.q.findCell(b.tail, cell)\n\tif myInd > startHead && (myInd-startHead) > index(uint64(b.bound)) {\n\t\t// there is a chance that we have to block\n\t\tvar w interface{} = makeWeakWaiter(2)\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\t\tunsafe.Pointer(nil), unsafe.Pointer(Elt(&w))) {\n\t\t\t// we successfully swapped in w. No one will overwrite this\n\t\t\t// location unless they send on w first. We block.\n\t\t\tw.(*weakWaiter).Wait()\n\t\t\tif atomic.CompareAndSwapPointer(\n\t\t\t\t(*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\t\t\tunsafe.Pointer(Elt(&w)), unsafe.Pointer(e)) {\n\t\t\t\treturn\n\t\t\t} // someone put a waiter into this location. We need to use the slow path\n\t\t} else if atomic.CompareAndSwapPointer(\n\t\t\t(*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])),\n\t\t\tsentinel, unsafe.Pointer(e)) {\n\t\t\t// Between us reading startHead and now, there were enough\n\t\t\t// increments to make it the case that we should no longer\n\t\t\t// block.\n\t\t\treturn\n\t\t}\n\t} else {\n\t\t// normal case. We know we don't have to block because b.q.H can only\n\t\t// increase.\n\t\tif tryCas(seg, cellInd, unsafe.Pointer(e)) {\n\t\t\treturn\n\t\t}\n\t}\n\tptr := atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[cellInd])))\n\tw := (*waiter)(ptr)\n\tw.Send(e)\n\treturn\n}\n~~~~\n\n`Enqueue` starts by loading a value of `H` and then acquiring `myInd`. Note that\nthis *is not* a consistent snapshot of the state of the queue, as `H` could have\nmoved between loading it and incrementing `myInd` (lines 12--13). However, `H`\nwill only increase! If `startHead` is within `b.bound` of `myInd` it means that\n`H` is at most that far behind `T` was when we performed the increment. In that\ncase we can simply attempt to CAS in `e` (line 40). If that fails, it can only\nmean that a receiver has placed a `waiter` in this index, so we wake up the\nreceiver and return (lines 44--46).\n\nIf there is a chance that we *do* have to block, then we allocate a new\n`weakWaiter`.  A `weakWaiter` is like a `waiter` except it does not contain a\nvalue, but it does allow for more than one message to be received. There are\nmany ways to implement such a construct in Go, here is an implementation in\nterms of a `WaitGroup`:\n\n```go\ntype weakWaiter struct {\n\tOSize, Size  int32\n\tWgroup sync.WaitGroup\n}\nfunc makeWeakWaiter(i int32) *weakWaiter {\n\twait := &weakWaiter{Size: i, OSize: i}\n\twait.Wgroup.Add(1)\n\treturn wait\n}\nfunc (w *weakWaiter) Signal() {\n\tnewVal := atomic.AddInt32(&w.Size, -1)\n\torig := atomic.LoadInt32(&w.OSize)\n\tif newVal+1 == orig { w.Wgroup.Done() }\n}\n```\n\nIn the that case we may block, we construct a `weakWaiter` with a buffer size of two\nbecause it is possible to have two dequeueing threads concurrently attempt to\nwake up an enqueueing thread (see below). If the sender successfully CASes `w`\ninto the proper location (line 19), then it waits and attempts the rest of the\nunbounded channel protocol when it wakes.  There are two possible scenarios if\nthis CAS fails:\n\n  (1) A receiver for `b.bound` elements forward in the channel attempted\n  to wake up this sender, but arrived before `w` was stored.\n  (2) A receiver has already started waiting at this location\n\nThe CAS on line 29 determines which case this is. If (1) then the CAS will fail\nand the sender must now wake up the waiting receiver thread on line 46. If (2)\nis the case then the CAS will succeed and `e` will successfully be in the queue.\n\n\n## Dequeue\nThe `Dequeue` implementation effectively mirrors the `Enqueue` implementation.\nThere are, however, a few things that are especially subtle. Let's start with\nthe implementation:\n\n~~~~ {.go .numberLines startFrom=\"49\"}\nfunc (b *BoundedChan) Dequeue() Elt {\n\tb.adjust()\n\tmyInd := index(atomic.AddUint64((*uint64)(&b.q.H), 1) - 1)\n\tcell, segInd := myInd.SplitInd()\n\tseg := b.q.findCell(b.head, cell)\n\t// If there are Enqueuers waiting to complete due to the buffer size, we\n\t// take responsibility for waking up the thread that FA'ed b.q.H + b.bound.\n\t// If bound is zero, that is just the current thread. Otherwise we have to\n\t// do some extra work. The thread we are waking up is referred to in names\n\t// and comments as our 'buddy'.\n\tvar (\n\t\tbCell, bInd index\n\t\tbSeg        *segment\n\t)\n\tif b.bound > 0 {\n\t\tbuddy := myInd + index(b.bound)\n\t\tbCell, bInd = buddy.SplitInd()\n\t\tbSeg = b.q.findCell(b.head, bCell)\n\t}\n\tw := makeWaiter()\n\tvar res Elt\n\tif tryCas(seg, segInd, unsafe.Pointer(w)) {\n\t\tres = w.Recv()\n\t} else {\n\t\t// tryCas failed, which means that through the \"possible histories\"\n\t\t// argument, this must be either an Elt, a waiter or a weakWaiter. It\n\t\t// cannot be a waiter because we are the only actor allowed to swap\n\t\t// one into this location. Thus it must either be a weakWaiter or an Elt.\n\t\t// if it is a weakWaiter, then we must send on it before casing in w,\n\t\t// otherwise the other thread could starve. If it is a normal Elt we\n\t\t// do the rest of the protocol. This also means that we can safely load\n\t\t// an Elt from seg, which is not always the case because sentinel is\n\t\t// not an Elt.\n\t\t// Step 1: We failed to put our waiter into Ind. That means that either our\n\t\t// value is in there, or there is a weakWaiter in there. Either way these\n\t\t// are valid elts and we can reliably distinguish them with a type assertion\n\t\telt := seg.Load(segInd)\n\t\tres = elt\n\t\tif ww, ok := (*elt).(*weakWaiter); ok {\n\t\t\tww.Signal()\n\t\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&seg.Data[segInd])),\n\t\t\t\tunsafe.Pointer(elt), unsafe.Pointer(w)) {\n\t\t\t\tres = w.Recv()\n\t\t\t} else {\n\t\t\t\t// someone cas'ed a value from a weakWaiter, could only have been our\n\t\t\t\t// friend on the dequeue side\n\t\t\t\tres = seg.Load(segInd)\n\t\t\t}\n\t\t}\n\t}\n\tfor b.bound > 0 { // runs at most twice\n\t\t// We have successfully gotten the value out of our cell. Now we\n\t\t// must ensure that our buddy is either woken up if they are\n\t\t// waiting, or that they will know not to sleep.\n\t\t// if bElt is not nil, it either has an Elt in it or a weakWater. If\n\t\t// it has a waitch then we need to send on it to wake up the buddy.\n\t\t// If it is not nill then we attempt to cas sentinel into the buddy\n\t\t// index. If we fail then the buddy may have cas'ed in a wait\n\t\t// channel so we must go again. However that will only happen once.\n\t\tbElt := bSeg.Load(bInd)\n\t\tif bElt != nil {\n\t\t\tif ww, ok := (*bElt).(*weakWaiter); ok {\n\t\t\t\tww.Signal()\n\t\t\t}\n\t\t\t// there is a real queue value in bSeg.Data[bInd], therefore\n\t\t\t// buddy cannot be waiting.\n\t\t\tbreak\n\t\t}\n\t\t// Let buddy know that they do not have to block\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&bSeg.Data[bInd])),\n\t\t\tunsafe.Pointer(nil), sentinel) {\n\t\t\tbreak\n\t\t}\n\t}\n\treturn res\n}\n~~~~\n\nNow the subtleties. A dequeuer may have to wake up multiple waiting send\nthreads: the one waiting at `myInd` and the other waiting at `myInd + bound` (or\n`bInd`). This may seem strange because the dequeuer that receives `myInd-bound`\nought to have woken up any pending senders. The issue is that *we have no\ngaurantee that this dequeuer has returned*. The possibility of this occurring is\nremote with a large buffer size, but when `bound` is small it happens with some\nregularity.\n\nThe second is a peculiarity of Go. On line 87 there is a *type assertion* which\nde-references an `Elt` to yield a value of type `interface{}`. The `interface{}`\ncontains a pointer to some runtime information about the actual type of the\npointed-to struct, and the `.(*weakWaiter)` syntax queries if `elt` is a\npointer to a weakWaiter. This is a safe thing to do because `weakWaiter` is a\npackage-private type: no external caller could pass in an `Elt` that pointed to\na `weakWaiter` unless we returned one from any of the public functions in the\npackage, which we do not.\n\nThis is complicated by the fact that `*waiter`s are actually stored in the queue\ndirectly, without hiding behind an interface value (e.g. at line 90). This is\nbecause the extra layer of indirection is unnecessary: it is always possible to\ndetermine where an `Elt` or a `*waiter` is present in a given location based on\nwhich CAS's have failed and which have succeeded.\n\n# Performance\n\nWe benchmarked 5 separate channels on enqueue/dequeue pairs:\n\n  * *Bounded0*: A `BoundedChan` with buffer size 0\n\n  * *Bounded1K*: A `BoundedChan` with buffer size 1024\n\n  * *Unbounded*: An `UnboundedChan`\n\n  * *Chan0*: An unbuffered native Go channel\n\n  * *Chan1K*: A native Go channel with buffer size 1024\n\n  * *Chan10M*: A native Go channel with buffer size $10^7$, which is the total\n    number of elements enqueued into the channel over the course of the\n    benchmark.\n\nWe include benchmark results for two cases: one where we allocate one goroutine\nper processor (where processors are set with the `GOMAXPROCS` procedure from Go's\nruntime), and one where we allocate 5000 goroutines, irrespective of the current\nvalue of `GOMAXPROCS`. We include both of these for two reasons. First, it is\nnot uncommon to have thousands of goroutines active in a running Go program and\nit makes sense to consider the case where processors are over-subscribed in that\nmanner. Second, we noticed that performance is often *better* in the cases where\ncores are oversubscribed. While counter-intuitive, this is possibly due to a\ncombination of unpredictable scheduler performance, and the lower overhead of\nsynchronizing between two goroutines executing on the same core.\n\nThese benchmarks were conducted on a machine with 2 Intel Xeon 2620 v4 CPUs each\nwith 8 cores clocked at 2.1GHz with two hardware threads per core. We were\nunable to allocate cores in an intuitive manner, so the 16-core benchmark is\nactually using all of a single CPU's hardware threads; only at core-counts\nhigher than 16 does the program cross a NUMA-domain. The benchmarks were run on\nthe Windows Subsystem for Linux[^wsl]; an implementation of an Ubuntu 14.04\nuserland from within the Windows 10 Operating System. These benchmarks were\nconducted using Go version 1.6.\n\nThese numbers were produced by performing 5,000,000 enqueues and dequeues per\nconfiguration, averaged over 5 iterations per setting, with a full GC between\niterations.\n\nThe benchmarks show that both *Bounded* and *Unbounded* are able to increase\nthroughput as the core-count increases. Native Go channels are unable to do so.\nWhen using more than 4 processors,  *Unbounded* and *Bounded1K* provide much\nbetter throughput than native channels regardless of buffer size. *Unbounded* in\nparticular is often 2-3x faster than the buffered *Chan* configurations, while\n*Bounded0* continues to increase throughput even after crossing a NUMA domain\nand dipping into using multiple hardware threads per core. At the highest core\ncounts, all three new configurations outpace native Go channels.\n\n![](contend_graph.pdf)\n![](gmp_graph.pdf)\n\n\n# Linearizability\n\nWe contend that both the bounded and unbounded queues presented in this document\nare *linearizable* with respect to their Enqueue and Dequeue operations.\nLinearizability is a strong consistency guarantee often used to specify the\nbehavior of concurrent data-structures. Informally we say a structure is\nlinearizable if for an arbitrary (possibly infinite) history of concurrent\noperations on the structure beginning and ending at specific times, we can\n*linearize* it such that each each operation occurs atomically at some point in\ntime between it beginning and ending [See Chapter 3 of @herlihyBook for an\noverview; Linearizability was introduced with @herlihyLinear].\n\nThis section describes linearization procedures for the bounded and unbounded\nchannels in this document. Both channels begin with a fetch-add on the head or\ntail index for the queue that determines the  *logical index* that will be the\nsubject of their send or receive. We denote $e_i$ and $d_i$ the enqueue and\ndequeue operations that fetch-add to get a value of `myInd` equal to $i$. We\nwill provide linearizations that preserve the following properties, where\n$\\prec$ indicates precedence in the linearized sequence of events. For all $i$\nwe must have that\n\n  (1) $e_i \\prec e_{i+1}$ (if $e_{i+1}$ occurs)\n  (2) $d_i \\prec d_{i+1}$ (if $d_{i+1}$ occurs)\n  (3) $e_i \\prec d_i$     (if both occur)\n\nWhich we take to be a straight-forward sequential specification for a channel.\n\n## Unbounded Channels\n\nOur linearization procedure considers two broad cases, a fast and slow path.\n\n  * In the fast path there is sufficient distance between enqueuers and\n    dequeuers such that the fetch-add of $e_i$ occurs before the fetch-add for\n    $d_i$ *and* $e_i$'s CAS succeeds. In this case, linearize $e_i$ and $d_i$ at\n    their respective fetch-adds.\n\n  * In the case where $d_i$'s fetch-add occurs before that of $e_i$  (or the CAS\n    fails) we linearize *both* operations at $e_i$'s fetch-add, with $e_i$\n    occurring just before $d_i$.\n\nObserve that both cases in this procedure linearize $e_i,d_i$ between them\nstarting and finishing. The second case is guaranteed to do so because if $d_i$\nmust block then $e_i$ is responsible for unblocking them, and if $d_i$ does not\nblock then we know its CAS fails, meaning that $e_i$'s fetch-add occurs after\n$d_i$'s fetch-add but before its failed CAS.\n\nWe will now show that the above procedure yields a history consistent with the\nthree criteria provided above. The proof strategy is to show, for both the fast\nand slow paths, that we can maintain the criteria for an arbitrary\n$e_i,d_i,e_{i+1},d_{i+1}$. Given this we can conclude that the criteria are\nsatisfied for an arbitrary number of enqueue-dequeue pairs. We then consider the\nother possible cases.\n\n*The Fast Path*\n\nWe know that we satisfy (1) because all $e_i$, fast or slow path, linearize at\ntheir fetch-add, and these are guaranteed to provide a total ordering on\noperations. We satisfy (3) by assumption. Consider $d_{i+1}$, if it hits the\nfast path then it is linearized at its fetch-add which must happen after $d_i$'s\nfetch-add. If it hits the slow path then it will be linearized at the fetch-add\nof $e_{i+1}$, but by assumption we only hit the slow path if $d_{i+1}$'s\nfetch-add completed before that of $e_{i+1}$; $d_{i+1}$'s fetch-add definitely\ncompleted after that of $d_i$, so we satisfy (2).\n\n*The Slow Path*\n\nThe argument for (1) is the same as in the fast path, and the argument for (3)\nfollows by assumption. Once again, the interesting case is to show that we\nmaintain an ordering between dequeue operations. There are two possible cases:\n\n(1) *$d_{i+1}$ blocks*\n    We know that $d_{i+1}$ will take the slow path, and will\n    therefore be linearized at a later fetch-add.\n\n(2) *$d_{i+1}$ does not block*\n    The only way that $d_{i+1}$ does not block is if its CAS fails, which means\n    that there is another enqueuer $e_{i+1}$ that completed. Regardless of whether\n    $d_{i+1}$ is linearized on a slow path or a fast path, it must be after the\n    fetch-add in $e_{i+1}$ and hence also that of $e_i$.\n\n*Small Numbers of Operations*\n\nIf there is only one enqueue operation, then at most one dequeue operation will\nbe linearized. This is fine, because at most one dequeue operation will\ncomplete, while any others will block forever. The definitions of the two cases\nin the linearization procedure automatically yield condition (3), while (1,2)\nare trivially satisfied as there is only one enqueue and at most one dequeue.\n\n*Concluding*\n\nWe can conclude by induction that for any finite number of enqueues and\ndequeues, there is a linearization that satisfies a standard sequential\nspecification for a channel. For infinite sequences of operations (assuming `H`\nand `T` can be updated with with arbitrary precision) there is probably a\nsimilar co-inductive characterization of the same process; the above argument\nshould still hold. We conclude that unbounded channels are linearizable.\n$\\square$\n\n## Bounded Channels\n\nThe bounded case has the same linearization procedure (and proofs) as the\nunbounded case, with the caveat that enqueue operations that do not return never\nmake it into the history. This works because all operations unconditionally\nperform fetch-adds, even if they later have to block for an unbounded amount of\ntime. $\\square$\n\n# Conclusion and Future Work\n\nThis document demonstrates that it is possible to have scalable unbounded and\nbounded queues while still satisfying a strong consistency guarantee. It\nleverages techniques from the recent literature on non-blocking queues to\nimplement (to our knowledge) novel blocking constructs. There are a number of\navenues for future work.\n\n**Verification**\n\nIt will be useful to model both channels in\n[SPIN](http://spinroot.com/spin/whatispin.html) or\n[TLA+](http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html) to\nprovide further assurance that the algorithms are correct. While it would be\nmore involved, proving correctness in [Coq](https://coq.inria.fr/) in line with\ntechniques mentioned in [FRAP](http://adam.chlipala.net/frap/) would also be\nhelpful in building confidence in the algorithms.\n\n**Implement in the Go runtime**\n\nImplementing these channels within the runtime could further reduce these\nalgorithms' overhead. In particular they will allow for more efficient\nimplementation of the blocking semantics in that they can access goroutine\nand scheduling metadata directly, whereas the current implementation relies on\n`WaitGroup`s, which may be too heavyweight for our purposes.\n\n**Improving Performance**\n\nSome variants of this algorithm still perform worse at lower core-counts than\ntheir native Go equivalents. One possible reason for this is how much allocation\nthese queues perform (go channels need only keep a single fix-sized buffer).\nIt could be fruitful to experiment with schemes that reduce allocation, as well\nas algorithms that allocate a fix-sized buffer, similar to the CRQ algorithm in\n@lcrq.\n\n\n# Appendix A: Efficient Segment Allocation\n\nIn order to speed up allocation, we add a list to the queue state. This list is\nsimilar to standard lock-free queue designs in the literature, and bares some\nresemblance to `Queue1` above. The major difference here is that we only provide\npartial push and pop operations: Push will fail if the list may be too large or\nif it runs out of `patience`, and Pop will fail if its CAS fails more than\n`patience` times. \n\n~~~~{.go}\ntype listElt *segment\ntype segList struct {           type segLink struct {\n\tMaxSpares, Length int64       \tElt  listElt\n\tHead              *segLink    \tNext *segLink\n}                               }\n\nfunc (s *segList) TryPush(e listElt) {\n\t// bail out if list is at capacity\n\tif atomic.LoadInt64(&s.Length) >= s.MaxSpares {\n\t\treturn\n\t}\n\t// add to length. Note that this is not atomic with respect to the append,\n\t// which means we may be under capacity on occasion. This list is only used\n\t// in a best-effort capacity, so that is okay.\n\tatomic.AddInt64(&s.Length, 1)\n\ttl := &segLink{Elt: e, Next: nil}\n\tconst patience = 4\n\ti := 0\n\tfor ; i < patience; i++ {\n\t\t// attempt to cas Head from nil to tail,\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)),\n\t\t\tunsafe.Pointer(nil), unsafe.Pointer(tl)) {\n\t\t\tbreak\n\t\t}\n\t\t// try to find an empty element\n\t\ttailPtr := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head))))\n\t\tif tailPtr == nil {\n\t\t\t// if Head was switched to nil, retry\n\t\t\tcontinue\n\t\t}\n\t\t// advance tailPtr until it has anil next pointer\n\t\tfor {\n\t\t\tnext := (*segLink)(atomic.LoadPointer(\n\t\t\t\t(*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next))))\n\t\t\tif next == nil {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\ttailPtr = next\n\t\t}\n\t\t// try and add something to the end of the list\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tailPtr.Next)),\n\t\t\tunsafe.Pointer(nil), unsafe.Pointer(tl)) {\n\t\t\tbreak\n\t\t}\n\t}\n\tif i == patience {\n\t\tatomic.AddInt64(&s.Length, -1)\n\t}\n}\n\nfunc (s *segList) TryPop() (e listElt, ok bool) {\n\tconst patience = 1\n\tif atomic.LoadInt64(&s.Length) <= 0 {\n\t\treturn nil, false\n\t}\n\tfor i := 0; i < patience; i++ {\n\t\thd := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head))))\n\t\tif hd == nil {\n\t\t\treturn nil, false\n\t\t}\n\t\t// if head is not nil, try to swap it for its next pointer\n\t\tnxt := (*segLink)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&hd.Next))))\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&s.Head)),\n\t\t\tunsafe.Pointer(hd), unsafe.Pointer(nxt)) {\n\t\t\tatomic.AddInt64(&s.Length, -1)\n\t\t\treturn hd.Elt, true\n\t\t}\n\t}\n\treturn nil, false\n}\n~~~~\n\nGiven this list implementation, we simply insert calls to `TryPush` and `TryPop`\naround the original implementation of `Grow` to have it take advantage of extra\nallocations:\n\n~~~~ {.go}\ntype queue struct {\n\tH, T        index\n\tSpareAllocs segList\n}\nfunc (q *queue) Grow(tail *segment) {\n\tcurTail := atomic.LoadUint64((*uint64)(&tail.ID))\n\tif next, ok := q.SpareAllocs.TryPop(); ok {\n\t\tatomic.StoreUint64((*uint64)(&next.ID), curTail+1)\n\t\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),\n\t\t\tunsafe.Pointer(nil), unsafe.Pointer(next)) {\n\t\t\treturn\n\t\t}\n\t}\n\tnewSegment := &segment{ID: index(curTail + 1)}\n\tif atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&tail.Next)),\n\t\tunsafe.Pointer(nil), unsafe.Pointer(newSegment)) {\n\t\treturn\n\t}\n\t// If we allocated a new segment but failed, attempt to place it in\n\t// SpareAlloc so someone else can use it.\n\tq.SpareAllocs.TryPush(newSegment)\n}\n~~~~\n\nThis scheme led to significant speedups in performance tests, but the code in\n`q.go` includes a constant that, if set to false, will disable any such\nlist-based caching of allocations. This should make it easy to verify or falsify\nthose performance measurements.\n\n# References\n\n<!-- Footnotes referenced in the text -->\n\n[^pseudo]: Psuedocode in this document will increasingly resemble real, working\nGo code. While we will try to explain core Go concepts as we go, a passing\nfamiliarity with Go syntax (or at least a willingness to squint and pretend one\nis reading C) will be helpful.\n\n[^select]: Our focus is send and receive; we do not cover `select` or `close`\nhere. `Close` would be fairly simple to add, `select` could be implement by\nusing channels for the waiting mechanism used by receivers. While this would not\nbe difficult, it would slow things down compared to the `WaitGroup`\nimplementation.\n\n[^goroutine]: Go's standard unit of concurrency is called a goroutine.\nGoroutines take the place of threads in a language like C, but they are\ngenerally much cheaper to create and provide faster context switches. Many\ngoroutines are independently scheduled on top of a smaller number of native\noperating system threads. This scheduling is not preemptive in the standard\nimplementation, rather goroutines implicitly yield on function-call boundaries.\n\n[^chanimp]: See the [Go channel source](https://golang.org/src/runtime/chan.go).\nIn particular note calls to `lock` in `chansend` and `chanrecv`.\n\n[^combine]: Consult the related work sections of @wfq on *combining* queues for an\nexample of this; @lcrq has a similar survey\n\n[^msqueue]: Less contention is not something that you get automatically when the\nalgorithm is lock-free. An early lock-free queue @MSQueue still suffers from\nfrom bottlenecks around the head and tail pointers all being CAS-ed by\ncontending threads. Most of these CASes will fail, and all threads whose CASes\nfail must retry. Exponential backoff schemes can help this state of affairs but\nthe bottleneck is still present; see the performance measurements in @wfq with\nincludes the algorithm from @MSQueue.\n\n[^fasem]: F&A is more commonly defined to return the *old* value of `src`, but\nreturning the new value is equivalent.\n\n[^helping]: Helping is a standard technique for making obstruction-free or\nlock-free algorithms wait free. The technique goes back to @wfsync; the practice\nof using a weaker progress guarantee as a fast path and then falling back to a\nhelping mechanism to ensure wait freedom was introduced in @FastSlow. An\nexplanation of helping can be found in @herlihyBook chapters 6, 10.5.\n\n[^bounded]: See, for example, [this\ndiscussion](https://mail.mozilla.org/pipermail/rust-dev/2013-December/007449.html)\non the Rust mailing list regarding unbounded channels. Haskell's standard\nchannel implementation in\n[Control.Concurrent](https://hackage.haskell.org/package/base-4.9.0.0/docs/Control-Concurrent-Chan.html)\nis unbounded, as are the STM variants.\n\n[^wsl]: See [this blog post](https://blogs.msdn.microsoft.com/wsl/2016/04/22/windows-subsystem-for-linux-overview/)\nas well as the various [follow-ups](https://blogs.msdn.microsoft.com/wsl/) for\nan overview of this system.\n\n[^livelockdef]: A [livelock](https://en.wikipedia.org/wiki/Deadlock#Livelock) is\na scenario in which one or more threads never block (i.e. they continuously\nchange their respective states) but still indefinitely fail to make progress.\n"
  }
]